qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v1 00/22] vfio: Adopt iommufd
@ 2023-08-30 10:37 Zhenzhong Duan
  2023-08-30 10:37 ` [PATCH v1 01/22] scripts/update-linux-headers: Add iommufd.h Zhenzhong Duan
                   ` (23 more replies)
  0 siblings, 24 replies; 109+ messages in thread
From: Zhenzhong Duan @ 2023-08-30 10:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, jgg, nicolinc, joao.m.martins, eric.auger,
	peterx, jasowang, kevin.tian, yi.l.liu, yi.y.sun, chao.p.peng,
	Zhenzhong Duan

Hi All,

As the kernel side iommufd cdev and hot reset feature have been queued,
also hwpt alloc has been added in Jason's for_next branch [1], I'd like
to update a new version matching kernel side update and with rfc flag
removed. Qemu code can be found at [2], look forward more comments!


We have done wide test with different combinations, e.g:

- PCI device were tested
- FD passing and hot reset with some trick.
- device hotplug test with legacy and iommufd backends
- with or without vIOMMU for legacy and iommufd backends
- divices linked to different iommufds
- VFIO migration with a E800 net card(no dirty sync support) passthrough
- platform, ccw and ap were only compile-tested due to environment limit


Given some iommufd kernel limitations, the iommufd backend is
not yet fully on par with the legacy backend w.r.t. features like:
- p2p mappings (you will see related error traces)
- dirty page sync
- and etc.


Changelog:
v1:
- Alloc hwpt instead of using auto hwpt
- elaborate iommufd code per Nicolin
- consolidate two patches and drop as.c
- typo error fix and function rename

I didn't list change log of rfc stage, see [3] if anyone is interested.


[1] https://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd.git
[2] https://github.com/yiliu1765/qemu/commits/zhenzhong/iommufd_cdev_v1
[3] https://lists.nongnu.org/archive/html/qemu-devel/2023-07/msg02529.html


--------------------------------------------------------------------------

With the introduction of iommufd, the Linux kernel provides a generic
interface for userspace drivers to propagate their DMA mappings to kernel
for assigned devices. This series does the porting of the VFIO devices
onto the /dev/iommu uapi and let it coexist with the legacy implementation.

This QEMU integration is the result of a collaborative work between
Yi Liu, Yi Sun, Nicolin Chen and Eric Auger.

At QEMU level, interactions with the /dev/iommu are abstracted by a new
iommufd object (compiled in with the CONFIG_IOMMUFD option).

Any QEMU device (e.g. vfio device) wishing to use /dev/iommu must be
linked with an iommufd object. In this series, the vfio-pci device is
granted with such capability (other VFIO devices are not yet ready):

It gets a new optional parameter named iommufd which allows to pass
an iommufd object:

    -object iommufd,id=iommufd0
    -device vfio-pci,host=0000:02:00.0,iommufd=iommufd0

Note the /dev/iommu and vfio cdev can be externally opened by a
management layer. In such a case the fd is passed:
  
    -object iommufd,id=iommufd0,fd=22
    -device vfio-pci,iommufd=iommufd0,fd=23

If the fd parameter is not passed, the fd is opened by QEMU.
See https://www.mail-archive.com/qemu-devel@nongnu.org/msg937155.html
for detailed discuss on this requirement.

If no iommufd option is passed to the vfio-pci device, iommufd is not
used and the end-user gets the behavior based on the legacy vfio iommu
interfaces:

    -device vfio-pci,host=0000:02:00.0

While the legacy kernel interface is group-centric, the new iommufd
interface is device-centric, relying on device fd and iommufd.

To support both interfaces in the QEMU VFIO device we reworked the vfio
container abstraction so that the generic VFIO code can use either
backend.

The VFIOContainer object becomes a base object derived into
a) the legacy VFIO container and
b) the new iommufd based container.

The base object implements generic code such as code related to
memory_listener and address space management whereas the derived
objects implement callbacks specific to either BE, legacy and
iommufd. Indeed each backend has its own way to setup secure context
and dma management interface. The below diagram shows how it looks
like with both BEs.

                    VFIO                           AddressSpace/Memory
    +-------+  +----------+  +-----+  +-----+
    |  pci  |  | platform |  |  ap |  | ccw |
    +---+---+  +----+-----+  +--+--+  +--+--+     +----------------------+
        |           |           |        |        |   AddressSpace       |
        |           |           |        |        +------------+---------+
    +---V-----------V-----------V--------V----+               /
    |           VFIOAddressSpace              | <------------+
    |                  |                      |  MemoryListener
    |          VFIOContainer list             |
    +-------+----------------------------+----+
            |                            |
            |                            |
    +-------V------+            +--------V----------+
    |   iommufd    |            |    vfio legacy    |
    |  container   |            |     container     |
    +-------+------+            +--------+----------+
            |                            |
            | /dev/iommu                 | /dev/vfio/vfio
            | /dev/vfio/devices/vfioX    | /dev/vfio/$group_id
Userspace   |                            |
============+============================+===========================
Kernel      |  device fd                 |
            +---------------+            | group/container fd
            | (BIND_IOMMUFD |            | (SET_CONTAINER/SET_IOMMU)
            |  ATTACH_IOAS) |            | device fd
            |               |            |
            |       +-------V------------V-----------------+
    iommufd |       |                vfio                  |
(map/unmap  |       +---------+--------------------+-------+
ioas_copy)  |                 |                    | map/unmap
            |                 |                    |
     +------V------+    +-----V------+      +------V--------+
     | iommfd core |    |  device    |      |  vfio iommu   |
     +-------------+    +------------+      +---------------+

[Secure Context setup]
- iommufd BE: uses device fd and iommufd to setup secure context
              (bind_iommufd, attach_ioas)
- vfio legacy BE: uses group fd and container fd to setup secure context
                  (set_container, set_iommu)
[Device access]
- iommufd BE: device fd is opened through /dev/vfio/devices/vfioX
- vfio legacy BE: device fd is retrieved from group fd ioctl
[DMA Mapping flow]
1. VFIOAddressSpace receives MemoryRegion add/del via MemoryListener
2. VFIO populates DMA map/unmap via the container BEs
   *) iommufd BE: uses iommufd
   *) vfio legacy BE: uses container fd


Thanks,
Yi, Yi, Eric, Zhenzhong


Eric Auger (8):
  scripts/update-linux-headers: Add iommufd.h
  vfio/common: Introduce vfio_container_add|del_section_window()
  vfio/container: Introduce vfio_[attach/detach]_device
  vfio/platform: Use vfio_[attach/detach]_device
  vfio/ap: Use vfio_[attach/detach]_device
  vfio/ccw: Use vfio_[attach/detach]_device
  backends/iommufd: Introduce the iommufd object
  vfio/pci: Allow the selection of a given iommu backend

Yi Liu (5):
  vfio/common: Move IOMMU agnostic helpers to a separate file
  vfio/common: Move legacy VFIO backend code into separate container.c
  vfio: Add base container
  util/char_dev: Add open_cdev()
  vfio/iommufd: Implement the iommufd backend

Zhenzhong Duan (9):
  Update linux-header to support iommufd cdev and hwpt alloc
  vfio/common: Extract out vfio_kvm_device_[add/del]_fd
  vfio/common: Add a vfio device iterator
  vfio/common: Refactor vfio_viommu_preset() to be group agnostic
  vfio/common: Simplify vfio_viommu_preset()
  Add iommufd configure option
  vfio/iommufd: Add vfio device iterator callback for iommufd
  vfio/pci: Adapt vfio pci hot reset support with iommufd BE
  vfio/pci: Make vfio cdev pre-openable by passing a file handle

 MAINTAINERS                           |   13 +
 backends/Kconfig                      |    4 +
 backends/iommufd.c                    |  291 ++++
 backends/meson.build                  |    3 +
 backends/trace-events                 |   13 +
 hw/vfio/ap.c                          |   68 +-
 hw/vfio/ccw.c                         |  120 +-
 hw/vfio/common.c                      | 1948 +++----------------------
 hw/vfio/container-base.c              |  160 ++
 hw/vfio/container.c                   | 1208 +++++++++++++++
 hw/vfio/helpers.c                     |  626 ++++++++
 hw/vfio/iommufd.c                     |  554 +++++++
 hw/vfio/meson.build                   |    6 +
 hw/vfio/pci.c                         |  319 +++-
 hw/vfio/platform.c                    |   43 +-
 hw/vfio/spapr.c                       |   22 +-
 hw/vfio/trace-events                  |   21 +-
 include/hw/vfio/vfio-common.h         |  111 +-
 include/hw/vfio/vfio-container-base.h |  158 ++
 include/qemu/char_dev.h               |   16 +
 include/standard-headers/linux/fuse.h |    3 +
 include/sysemu/iommufd.h              |   49 +
 linux-headers/linux/iommufd.h         |  444 ++++++
 linux-headers/linux/kvm.h             |   13 +-
 linux-headers/linux/vfio.h            |  148 +-
 meson.build                           |    6 +
 meson_options.txt                     |    2 +
 qapi/qom.json                         |   18 +-
 qemu-options.hx                       |   13 +
 scripts/meson-buildoptions.sh         |    3 +
 scripts/update-linux-headers.sh       |    3 +-
 util/chardev_open.c                   |   61 +
 util/meson.build                      |    1 +
 33 files changed, 4395 insertions(+), 2073 deletions(-)
 create mode 100644 backends/iommufd.c
 create mode 100644 hw/vfio/container-base.c
 create mode 100644 hw/vfio/container.c
 create mode 100644 hw/vfio/helpers.c
 create mode 100644 hw/vfio/iommufd.c
 create mode 100644 include/hw/vfio/vfio-container-base.h
 create mode 100644 include/qemu/char_dev.h
 create mode 100644 include/sysemu/iommufd.h
 create mode 100644 linux-headers/linux/iommufd.h
 create mode 100644 util/chardev_open.c

-- 
2.34.1



^ permalink raw reply	[flat|nested] 109+ messages in thread

* [PATCH v1 01/22] scripts/update-linux-headers: Add iommufd.h
  2023-08-30 10:37 [PATCH v1 00/22] vfio: Adopt iommufd Zhenzhong Duan
@ 2023-08-30 10:37 ` Zhenzhong Duan
  2023-08-30 10:37 ` [PATCH v1 02/22] Update linux-header to support iommufd cdev and hwpt alloc Zhenzhong Duan
                   ` (22 subsequent siblings)
  23 siblings, 0 replies; 109+ messages in thread
From: Zhenzhong Duan @ 2023-08-30 10:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, jgg, nicolinc, joao.m.martins, eric.auger,
	peterx, jasowang, kevin.tian, yi.l.liu, yi.y.sun, chao.p.peng,
	Zhenzhong Duan, Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini

From: Eric Auger <eric.auger@redhat.com>

Update the script to import iommufd.h

Signed-off-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 scripts/update-linux-headers.sh | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/scripts/update-linux-headers.sh b/scripts/update-linux-headers.sh
index 35a64bb501..34295c0fe5 100755
--- a/scripts/update-linux-headers.sh
+++ b/scripts/update-linux-headers.sh
@@ -161,7 +161,8 @@ done
 rm -rf "$output/linux-headers/linux"
 mkdir -p "$output/linux-headers/linux"
 for header in const.h stddef.h kvm.h vfio.h vfio_ccw.h vfio_zdev.h vhost.h \
-              psci.h psp-sev.h userfaultfd.h memfd.h mman.h nvme_ioctl.h vduse.h; do
+              psci.h psp-sev.h userfaultfd.h memfd.h mman.h nvme_ioctl.h \
+              vduse.h iommufd.h; do
     cp "$tmpdir/include/linux/$header" "$output/linux-headers/linux"
 done
 
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH v1 02/22] Update linux-header to support iommufd cdev and hwpt alloc
  2023-08-30 10:37 [PATCH v1 00/22] vfio: Adopt iommufd Zhenzhong Duan
  2023-08-30 10:37 ` [PATCH v1 01/22] scripts/update-linux-headers: Add iommufd.h Zhenzhong Duan
@ 2023-08-30 10:37 ` Zhenzhong Duan
  2023-09-14 14:46   ` Eric Auger
  2023-08-30 10:37 ` [PATCH v1 03/22] vfio/common: Move IOMMU agnostic helpers to a separate file Zhenzhong Duan
                   ` (21 subsequent siblings)
  23 siblings, 1 reply; 109+ messages in thread
From: Zhenzhong Duan @ 2023-08-30 10:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, jgg, nicolinc, joao.m.martins, eric.auger,
	peterx, jasowang, kevin.tian, yi.l.liu, yi.y.sun, chao.p.peng,
	Zhenzhong Duan, Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini,
	open list:Overall KVM CPUs

From https://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd.git
branch: for_next
commit id: eb501c2d96cfce6b42528e8321ea085ec605e790

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
Note this is a placeholder patch.

 include/standard-headers/linux/fuse.h |   3 +
 linux-headers/linux/iommufd.h         | 444 ++++++++++++++++++++++++++
 linux-headers/linux/kvm.h             |  13 +-
 linux-headers/linux/vfio.h            | 148 ++++++++-
 4 files changed, 604 insertions(+), 4 deletions(-)
 create mode 100644 linux-headers/linux/iommufd.h

diff --git a/include/standard-headers/linux/fuse.h b/include/standard-headers/linux/fuse.h
index 35c131a107..2c8b8de9c2 100644
--- a/include/standard-headers/linux/fuse.h
+++ b/include/standard-headers/linux/fuse.h
@@ -206,6 +206,7 @@
  *  - add extension header
  *  - add FUSE_EXT_GROUPS
  *  - add FUSE_CREATE_SUPP_GROUP
+ *  - add FUSE_HAS_EXPIRE_ONLY
  */
 
 #ifndef _LINUX_FUSE_H
@@ -365,6 +366,7 @@ struct fuse_file_lock {
  * FUSE_HAS_INODE_DAX:  use per inode DAX
  * FUSE_CREATE_SUPP_GROUP: add supplementary group info to create, mkdir,
  *			symlink and mknod (single group that matches parent)
+ * FUSE_HAS_EXPIRE_ONLY: kernel supports expiry-only entry invalidation
  */
 #define FUSE_ASYNC_READ		(1 << 0)
 #define FUSE_POSIX_LOCKS	(1 << 1)
@@ -402,6 +404,7 @@ struct fuse_file_lock {
 #define FUSE_SECURITY_CTX	(1ULL << 32)
 #define FUSE_HAS_INODE_DAX	(1ULL << 33)
 #define FUSE_CREATE_SUPP_GROUP	(1ULL << 34)
+#define FUSE_HAS_EXPIRE_ONLY	(1ULL << 35)
 
 /**
  * CUSE INIT request/reply flags
diff --git a/linux-headers/linux/iommufd.h b/linux-headers/linux/iommufd.h
new file mode 100644
index 0000000000..218bf7ac98
--- /dev/null
+++ b/linux-headers/linux/iommufd.h
@@ -0,0 +1,444 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES.
+ */
+#ifndef _IOMMUFD_H
+#define _IOMMUFD_H
+
+#include <linux/types.h>
+#include <linux/ioctl.h>
+
+#define IOMMUFD_TYPE (';')
+
+/**
+ * DOC: General ioctl format
+ *
+ * The ioctl interface follows a general format to allow for extensibility. Each
+ * ioctl is passed in a structure pointer as the argument providing the size of
+ * the structure in the first u32. The kernel checks that any structure space
+ * beyond what it understands is 0. This allows userspace to use the backward
+ * compatible portion while consistently using the newer, larger, structures.
+ *
+ * ioctls use a standard meaning for common errnos:
+ *
+ *  - ENOTTY: The IOCTL number itself is not supported at all
+ *  - E2BIG: The IOCTL number is supported, but the provided structure has
+ *    non-zero in a part the kernel does not understand.
+ *  - EOPNOTSUPP: The IOCTL number is supported, and the structure is
+ *    understood, however a known field has a value the kernel does not
+ *    understand or support.
+ *  - EINVAL: Everything about the IOCTL was understood, but a field is not
+ *    correct.
+ *  - ENOENT: An ID or IOVA provided does not exist.
+ *  - ENOMEM: Out of memory.
+ *  - EOVERFLOW: Mathematics overflowed.
+ *
+ * As well as additional errnos, within specific ioctls.
+ */
+enum {
+	IOMMUFD_CMD_BASE = 0x80,
+	IOMMUFD_CMD_DESTROY = IOMMUFD_CMD_BASE,
+	IOMMUFD_CMD_IOAS_ALLOC,
+	IOMMUFD_CMD_IOAS_ALLOW_IOVAS,
+	IOMMUFD_CMD_IOAS_COPY,
+	IOMMUFD_CMD_IOAS_IOVA_RANGES,
+	IOMMUFD_CMD_IOAS_MAP,
+	IOMMUFD_CMD_IOAS_UNMAP,
+	IOMMUFD_CMD_OPTION,
+	IOMMUFD_CMD_VFIO_IOAS,
+	IOMMUFD_CMD_HWPT_ALLOC,
+	IOMMUFD_CMD_GET_HW_INFO,
+};
+
+/**
+ * struct iommu_destroy - ioctl(IOMMU_DESTROY)
+ * @size: sizeof(struct iommu_destroy)
+ * @id: iommufd object ID to destroy. Can be any destroyable object type.
+ *
+ * Destroy any object held within iommufd.
+ */
+struct iommu_destroy {
+	__u32 size;
+	__u32 id;
+};
+#define IOMMU_DESTROY _IO(IOMMUFD_TYPE, IOMMUFD_CMD_DESTROY)
+
+/**
+ * struct iommu_ioas_alloc - ioctl(IOMMU_IOAS_ALLOC)
+ * @size: sizeof(struct iommu_ioas_alloc)
+ * @flags: Must be 0
+ * @out_ioas_id: Output IOAS ID for the allocated object
+ *
+ * Allocate an IO Address Space (IOAS) which holds an IO Virtual Address (IOVA)
+ * to memory mapping.
+ */
+struct iommu_ioas_alloc {
+	__u32 size;
+	__u32 flags;
+	__u32 out_ioas_id;
+};
+#define IOMMU_IOAS_ALLOC _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_ALLOC)
+
+/**
+ * struct iommu_iova_range - ioctl(IOMMU_IOVA_RANGE)
+ * @start: First IOVA
+ * @last: Inclusive last IOVA
+ *
+ * An interval in IOVA space.
+ */
+struct iommu_iova_range {
+	__aligned_u64 start;
+	__aligned_u64 last;
+};
+
+/**
+ * struct iommu_ioas_iova_ranges - ioctl(IOMMU_IOAS_IOVA_RANGES)
+ * @size: sizeof(struct iommu_ioas_iova_ranges)
+ * @ioas_id: IOAS ID to read ranges from
+ * @num_iovas: Input/Output total number of ranges in the IOAS
+ * @__reserved: Must be 0
+ * @allowed_iovas: Pointer to the output array of struct iommu_iova_range
+ * @out_iova_alignment: Minimum alignment required for mapping IOVA
+ *
+ * Query an IOAS for ranges of allowed IOVAs. Mapping IOVA outside these ranges
+ * is not allowed. num_iovas will be set to the total number of iovas and
+ * the allowed_iovas[] will be filled in as space permits.
+ *
+ * The allowed ranges are dependent on the HW path the DMA operation takes, and
+ * can change during the lifetime of the IOAS. A fresh empty IOAS will have a
+ * full range, and each attached device will narrow the ranges based on that
+ * device's HW restrictions. Detaching a device can widen the ranges. Userspace
+ * should query ranges after every attach/detach to know what IOVAs are valid
+ * for mapping.
+ *
+ * On input num_iovas is the length of the allowed_iovas array. On output it is
+ * the total number of iovas filled in. The ioctl will return -EMSGSIZE and set
+ * num_iovas to the required value if num_iovas is too small. In this case the
+ * caller should allocate a larger output array and re-issue the ioctl.
+ *
+ * out_iova_alignment returns the minimum IOVA alignment that can be given
+ * to IOMMU_IOAS_MAP/COPY. IOVA's must satisfy::
+ *
+ *   starting_iova % out_iova_alignment == 0
+ *   (starting_iova + length) % out_iova_alignment == 0
+ *
+ * out_iova_alignment can be 1 indicating any IOVA is allowed. It cannot
+ * be higher than the system PAGE_SIZE.
+ */
+struct iommu_ioas_iova_ranges {
+	__u32 size;
+	__u32 ioas_id;
+	__u32 num_iovas;
+	__u32 __reserved;
+	__aligned_u64 allowed_iovas;
+	__aligned_u64 out_iova_alignment;
+};
+#define IOMMU_IOAS_IOVA_RANGES _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_IOVA_RANGES)
+
+/**
+ * struct iommu_ioas_allow_iovas - ioctl(IOMMU_IOAS_ALLOW_IOVAS)
+ * @size: sizeof(struct iommu_ioas_allow_iovas)
+ * @ioas_id: IOAS ID to allow IOVAs from
+ * @num_iovas: Input/Output total number of ranges in the IOAS
+ * @__reserved: Must be 0
+ * @allowed_iovas: Pointer to array of struct iommu_iova_range
+ *
+ * Ensure a range of IOVAs are always available for allocation. If this call
+ * succeeds then IOMMU_IOAS_IOVA_RANGES will never return a list of IOVA ranges
+ * that are narrower than the ranges provided here. This call will fail if
+ * IOMMU_IOAS_IOVA_RANGES is currently narrower than the given ranges.
+ *
+ * When an IOAS is first created the IOVA_RANGES will be maximally sized, and as
+ * devices are attached the IOVA will narrow based on the device restrictions.
+ * When an allowed range is specified any narrowing will be refused, ie device
+ * attachment can fail if the device requires limiting within the allowed range.
+ *
+ * Automatic IOVA allocation is also impacted by this call. MAP will only
+ * allocate within the allowed IOVAs if they are present.
+ *
+ * This call replaces the entire allowed list with the given list.
+ */
+struct iommu_ioas_allow_iovas {
+	__u32 size;
+	__u32 ioas_id;
+	__u32 num_iovas;
+	__u32 __reserved;
+	__aligned_u64 allowed_iovas;
+};
+#define IOMMU_IOAS_ALLOW_IOVAS _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_ALLOW_IOVAS)
+
+/**
+ * enum iommufd_ioas_map_flags - Flags for map and copy
+ * @IOMMU_IOAS_MAP_FIXED_IOVA: If clear the kernel will compute an appropriate
+ *                             IOVA to place the mapping at
+ * @IOMMU_IOAS_MAP_WRITEABLE: DMA is allowed to write to this mapping
+ * @IOMMU_IOAS_MAP_READABLE: DMA is allowed to read from this mapping
+ */
+enum iommufd_ioas_map_flags {
+	IOMMU_IOAS_MAP_FIXED_IOVA = 1 << 0,
+	IOMMU_IOAS_MAP_WRITEABLE = 1 << 1,
+	IOMMU_IOAS_MAP_READABLE = 1 << 2,
+};
+
+/**
+ * struct iommu_ioas_map - ioctl(IOMMU_IOAS_MAP)
+ * @size: sizeof(struct iommu_ioas_map)
+ * @flags: Combination of enum iommufd_ioas_map_flags
+ * @ioas_id: IOAS ID to change the mapping of
+ * @__reserved: Must be 0
+ * @user_va: Userspace pointer to start mapping from
+ * @length: Number of bytes to map
+ * @iova: IOVA the mapping was placed at. If IOMMU_IOAS_MAP_FIXED_IOVA is set
+ *        then this must be provided as input.
+ *
+ * Set an IOVA mapping from a user pointer. If FIXED_IOVA is specified then the
+ * mapping will be established at iova, otherwise a suitable location based on
+ * the reserved and allowed lists will be automatically selected and returned in
+ * iova.
+ *
+ * If IOMMU_IOAS_MAP_FIXED_IOVA is specified then the iova range must currently
+ * be unused, existing IOVA cannot be replaced.
+ */
+struct iommu_ioas_map {
+	__u32 size;
+	__u32 flags;
+	__u32 ioas_id;
+	__u32 __reserved;
+	__aligned_u64 user_va;
+	__aligned_u64 length;
+	__aligned_u64 iova;
+};
+#define IOMMU_IOAS_MAP _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_MAP)
+
+/**
+ * struct iommu_ioas_copy - ioctl(IOMMU_IOAS_COPY)
+ * @size: sizeof(struct iommu_ioas_copy)
+ * @flags: Combination of enum iommufd_ioas_map_flags
+ * @dst_ioas_id: IOAS ID to change the mapping of
+ * @src_ioas_id: IOAS ID to copy from
+ * @length: Number of bytes to copy and map
+ * @dst_iova: IOVA the mapping was placed at. If IOMMU_IOAS_MAP_FIXED_IOVA is
+ *            set then this must be provided as input.
+ * @src_iova: IOVA to start the copy
+ *
+ * Copy an already existing mapping from src_ioas_id and establish it in
+ * dst_ioas_id. The src iova/length must exactly match a range used with
+ * IOMMU_IOAS_MAP.
+ *
+ * This may be used to efficiently clone a subset of an IOAS to another, or as a
+ * kind of 'cache' to speed up mapping. Copy has an efficiency advantage over
+ * establishing equivalent new mappings, as internal resources are shared, and
+ * the kernel will pin the user memory only once.
+ */
+struct iommu_ioas_copy {
+	__u32 size;
+	__u32 flags;
+	__u32 dst_ioas_id;
+	__u32 src_ioas_id;
+	__aligned_u64 length;
+	__aligned_u64 dst_iova;
+	__aligned_u64 src_iova;
+};
+#define IOMMU_IOAS_COPY _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_COPY)
+
+/**
+ * struct iommu_ioas_unmap - ioctl(IOMMU_IOAS_UNMAP)
+ * @size: sizeof(struct iommu_ioas_unmap)
+ * @ioas_id: IOAS ID to change the mapping of
+ * @iova: IOVA to start the unmapping at
+ * @length: Number of bytes to unmap, and return back the bytes unmapped
+ *
+ * Unmap an IOVA range. The iova/length must be a superset of a previously
+ * mapped range used with IOMMU_IOAS_MAP or IOMMU_IOAS_COPY. Splitting or
+ * truncating ranges is not allowed. The values 0 to U64_MAX will unmap
+ * everything.
+ */
+struct iommu_ioas_unmap {
+	__u32 size;
+	__u32 ioas_id;
+	__aligned_u64 iova;
+	__aligned_u64 length;
+};
+#define IOMMU_IOAS_UNMAP _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_UNMAP)
+
+/**
+ * enum iommufd_option - ioctl(IOMMU_OPTION_RLIMIT_MODE) and
+ *                       ioctl(IOMMU_OPTION_HUGE_PAGES)
+ * @IOMMU_OPTION_RLIMIT_MODE:
+ *    Change how RLIMIT_MEMLOCK accounting works. The caller must have privilege
+ *    to invoke this. Value 0 (default) is user based accouting, 1 uses process
+ *    based accounting. Global option, object_id must be 0
+ * @IOMMU_OPTION_HUGE_PAGES:
+ *    Value 1 (default) allows contiguous pages to be combined when generating
+ *    iommu mappings. Value 0 disables combining, everything is mapped to
+ *    PAGE_SIZE. This can be useful for benchmarking.  This is a per-IOAS
+ *    option, the object_id must be the IOAS ID.
+ */
+enum iommufd_option {
+	IOMMU_OPTION_RLIMIT_MODE = 0,
+	IOMMU_OPTION_HUGE_PAGES = 1,
+};
+
+/**
+ * enum iommufd_option_ops - ioctl(IOMMU_OPTION_OP_SET) and
+ *                           ioctl(IOMMU_OPTION_OP_GET)
+ * @IOMMU_OPTION_OP_SET: Set the option's value
+ * @IOMMU_OPTION_OP_GET: Get the option's value
+ */
+enum iommufd_option_ops {
+	IOMMU_OPTION_OP_SET = 0,
+	IOMMU_OPTION_OP_GET = 1,
+};
+
+/**
+ * struct iommu_option - iommu option multiplexer
+ * @size: sizeof(struct iommu_option)
+ * @option_id: One of enum iommufd_option
+ * @op: One of enum iommufd_option_ops
+ * @__reserved: Must be 0
+ * @object_id: ID of the object if required
+ * @val64: Option value to set or value returned on get
+ *
+ * Change a simple option value. This multiplexor allows controlling options
+ * on objects. IOMMU_OPTION_OP_SET will load an option and IOMMU_OPTION_OP_GET
+ * will return the current value.
+ */
+struct iommu_option {
+	__u32 size;
+	__u32 option_id;
+	__u16 op;
+	__u16 __reserved;
+	__u32 object_id;
+	__aligned_u64 val64;
+};
+#define IOMMU_OPTION _IO(IOMMUFD_TYPE, IOMMUFD_CMD_OPTION)
+
+/**
+ * enum iommufd_vfio_ioas_op - IOMMU_VFIO_IOAS_* ioctls
+ * @IOMMU_VFIO_IOAS_GET: Get the current compatibility IOAS
+ * @IOMMU_VFIO_IOAS_SET: Change the current compatibility IOAS
+ * @IOMMU_VFIO_IOAS_CLEAR: Disable VFIO compatibility
+ */
+enum iommufd_vfio_ioas_op {
+	IOMMU_VFIO_IOAS_GET = 0,
+	IOMMU_VFIO_IOAS_SET = 1,
+	IOMMU_VFIO_IOAS_CLEAR = 2,
+};
+
+/**
+ * struct iommu_vfio_ioas - ioctl(IOMMU_VFIO_IOAS)
+ * @size: sizeof(struct iommu_vfio_ioas)
+ * @ioas_id: For IOMMU_VFIO_IOAS_SET the input IOAS ID to set
+ *           For IOMMU_VFIO_IOAS_GET will output the IOAS ID
+ * @op: One of enum iommufd_vfio_ioas_op
+ * @__reserved: Must be 0
+ *
+ * The VFIO compatibility support uses a single ioas because VFIO APIs do not
+ * support the ID field. Set or Get the IOAS that VFIO compatibility will use.
+ * When VFIO_GROUP_SET_CONTAINER is used on an iommufd it will get the
+ * compatibility ioas, either by taking what is already set, or auto creating
+ * one. From then on VFIO will continue to use that ioas and is not effected by
+ * this ioctl. SET or CLEAR does not destroy any auto-created IOAS.
+ */
+struct iommu_vfio_ioas {
+	__u32 size;
+	__u32 ioas_id;
+	__u16 op;
+	__u16 __reserved;
+};
+#define IOMMU_VFIO_IOAS _IO(IOMMUFD_TYPE, IOMMUFD_CMD_VFIO_IOAS)
+
+/**
+ * struct iommu_hwpt_alloc - ioctl(IOMMU_HWPT_ALLOC)
+ * @size: sizeof(struct iommu_hwpt_alloc)
+ * @flags: Must be 0
+ * @dev_id: The device to allocate this HWPT for
+ * @pt_id: The IOAS to connect this HWPT to
+ * @out_hwpt_id: The ID of the new HWPT
+ * @__reserved: Must be 0
+ *
+ * Explicitly allocate a hardware page table object. This is the same object
+ * type that is returned by iommufd_device_attach() and represents the
+ * underlying iommu driver's iommu_domain kernel object.
+ *
+ * A HWPT will be created with the IOVA mappings from the given IOAS.
+ */
+struct iommu_hwpt_alloc {
+	__u32 size;
+	__u32 flags;
+	__u32 dev_id;
+	__u32 pt_id;
+	__u32 out_hwpt_id;
+	__u32 __reserved;
+};
+#define IOMMU_HWPT_ALLOC _IO(IOMMUFD_TYPE, IOMMUFD_CMD_HWPT_ALLOC)
+
+/**
+ * struct iommu_hw_info_vtd - Intel VT-d hardware information
+ *
+ * @flags: Must be 0
+ * @__reserved: Must be 0
+ *
+ * @cap_reg: Value of Intel VT-d capability register defined in VT-d spec
+ *           section 11.4.2 Capability Register.
+ * @ecap_reg: Value of Intel VT-d capability register defined in VT-d spec
+ *            section 11.4.3 Extended Capability Register.
+ *
+ * User needs to understand the Intel VT-d specification to decode the
+ * register value.
+ */
+struct iommu_hw_info_vtd {
+	__u32 flags;
+	__u32 __reserved;
+	__aligned_u64 cap_reg;
+	__aligned_u64 ecap_reg;
+};
+
+/**
+ * enum iommu_hw_info_type - IOMMU Hardware Info Types
+ * @IOMMU_HW_INFO_TYPE_NONE: Used by the drivers that do not report hardware
+ *                           info
+ * @IOMMU_HW_INFO_TYPE_INTEL_VTD: Intel VT-d iommu info type
+ */
+enum iommu_hw_info_type {
+	IOMMU_HW_INFO_TYPE_NONE,
+	IOMMU_HW_INFO_TYPE_INTEL_VTD,
+};
+
+/**
+ * struct iommu_hw_info - ioctl(IOMMU_GET_HW_INFO)
+ * @size: sizeof(struct iommu_hw_info)
+ * @flags: Must be 0
+ * @dev_id: The device bound to the iommufd
+ * @data_len: Input the length of a user buffer in bytes. Output the length of
+ *            data that kernel supports
+ * @data_uptr: User pointer to a user-space buffer used by the kernel to fill
+ *             the iommu type specific hardware information data
+ * @out_data_type: Output the iommu hardware info type as defined in the enum
+ *                 iommu_hw_info_type.
+ * @__reserved: Must be 0
+ *
+ * Query an iommu type specific hardware information data from an iommu behind
+ * a given device that has been bound to iommufd. This hardware info data will
+ * be used to sync capabilities between the virtual iommu and the physical
+ * iommu, e.g. a nested translation setup needs to check the hardware info, so
+ * a guest stage-1 page table can be compatible with the physical iommu.
+ *
+ * To capture an iommu type specific hardware information data, @data_uptr and
+ * its length @data_len must be provided. Trailing bytes will be zeroed if the
+ * user buffer is larger than the data that kernel has. Otherwise, kernel only
+ * fills the buffer using the given length in @data_len. If the ioctl succeeds,
+ * @data_len will be updated to the length that kernel actually supports,
+ * @out_data_type will be filled to decode the data filled in the buffer
+ * pointed by @data_uptr. Input @data_len == zero is allowed.
+ */
+struct iommu_hw_info {
+	__u32 size;
+	__u32 flags;
+	__u32 dev_id;
+	__u32 data_len;
+	__aligned_u64 data_uptr;
+	__u32 out_data_type;
+	__u32 __reserved;
+};
+#define IOMMU_GET_HW_INFO _IO(IOMMUFD_TYPE, IOMMUFD_CMD_GET_HW_INFO)
+#endif
diff --git a/linux-headers/linux/kvm.h b/linux-headers/linux/kvm.h
index 1f3f3333a4..0d74ee999a 100644
--- a/linux-headers/linux/kvm.h
+++ b/linux-headers/linux/kvm.h
@@ -1414,9 +1414,16 @@ struct kvm_device_attr {
 	__u64	addr;		/* userspace address of attr data */
 };
 
-#define  KVM_DEV_VFIO_GROUP			1
-#define   KVM_DEV_VFIO_GROUP_ADD			1
-#define   KVM_DEV_VFIO_GROUP_DEL			2
+#define  KVM_DEV_VFIO_FILE			1
+
+#define   KVM_DEV_VFIO_FILE_ADD			1
+#define   KVM_DEV_VFIO_FILE_DEL			2
+
+/* KVM_DEV_VFIO_GROUP aliases are for compile time uapi compatibility */
+#define  KVM_DEV_VFIO_GROUP	KVM_DEV_VFIO_FILE
+
+#define   KVM_DEV_VFIO_GROUP_ADD	KVM_DEV_VFIO_FILE_ADD
+#define   KVM_DEV_VFIO_GROUP_DEL	KVM_DEV_VFIO_FILE_DEL
 #define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE		3
 
 enum kvm_device_type {
diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
index 16db89071e..7326ace436 100644
--- a/linux-headers/linux/vfio.h
+++ b/linux-headers/linux/vfio.h
@@ -677,11 +677,60 @@ enum {
  * VFIO_DEVICE_GET_PCI_HOT_RESET_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 12,
  *					      struct vfio_pci_hot_reset_info)
  *
+ * This command is used to query the affected devices in the hot reset for
+ * a given device.
+ *
+ * This command always reports the segment, bus, and devfn information for
+ * each affected device, and selectively reports the group_id or devid per
+ * the way how the calling device is opened.
+ *
+ *	- If the calling device is opened via the traditional group/container
+ *	  API, group_id is reported.  User should check if it has owned all
+ *	  the affected devices and provides a set of group fds to prove the
+ *	  ownership in VFIO_DEVICE_PCI_HOT_RESET ioctl.
+ *
+ *	- If the calling device is opened as a cdev, devid is reported.
+ *	  Flag VFIO_PCI_HOT_RESET_FLAG_DEV_ID is set to indicate this
+ *	  data type.  All the affected devices should be represented in
+ *	  the dev_set, ex. bound to a vfio driver, and also be owned by
+ *	  this interface which is determined by the following conditions:
+ *	  1) Has a valid devid within the iommufd_ctx of the calling device.
+ *	     Ownership cannot be determined across separate iommufd_ctx and
+ *	     the cdev calling conventions do not support a proof-of-ownership
+ *	     model as provided in the legacy group interface.  In this case
+ *	     valid devid with value greater than zero is provided in the return
+ *	     structure.
+ *	  2) Does not have a valid devid within the iommufd_ctx of the calling
+ *	     device, but belongs to the same IOMMU group as the calling device
+ *	     or another opened device that has a valid devid within the
+ *	     iommufd_ctx of the calling device.  This provides implicit ownership
+ *	     for devices within the same DMA isolation context.  In this case
+ *	     the devid value of VFIO_PCI_DEVID_OWNED is provided in the return
+ *	     structure.
+ *
+ *	  A devid value of VFIO_PCI_DEVID_NOT_OWNED is provided in the return
+ *	  structure for affected devices where device is NOT represented in the
+ *	  dev_set or ownership is not available.  Such devices prevent the use
+ *	  of VFIO_DEVICE_PCI_HOT_RESET ioctl outside of the proof-of-ownership
+ *	  calling conventions (ie. via legacy group accessed devices).  Flag
+ *	  VFIO_PCI_HOT_RESET_FLAG_DEV_ID_OWNED would be set when all the
+ *	  affected devices are represented in the dev_set and also owned by
+ *	  the user.  This flag is available only when
+ *	  flag VFIO_PCI_HOT_RESET_FLAG_DEV_ID is set, otherwise reserved.
+ *	  When set, user could invoke VFIO_DEVICE_PCI_HOT_RESET with a zero
+ *	  length fd array on the calling device as the ownership is validated
+ *	  by iommufd_ctx.
+ *
  * Return: 0 on success, -errno on failure:
  *	-enospc = insufficient buffer, -enodev = unsupported for device.
  */
 struct vfio_pci_dependent_device {
-	__u32	group_id;
+	union {
+		__u32   group_id;
+		__u32	devid;
+#define VFIO_PCI_DEVID_OWNED		0
+#define VFIO_PCI_DEVID_NOT_OWNED	-1
+	};
 	__u16	segment;
 	__u8	bus;
 	__u8	devfn; /* Use PCI_SLOT/PCI_FUNC */
@@ -690,6 +739,8 @@ struct vfio_pci_dependent_device {
 struct vfio_pci_hot_reset_info {
 	__u32	argsz;
 	__u32	flags;
+#define VFIO_PCI_HOT_RESET_FLAG_DEV_ID		(1 << 0)
+#define VFIO_PCI_HOT_RESET_FLAG_DEV_ID_OWNED	(1 << 1)
 	__u32	count;
 	struct vfio_pci_dependent_device	devices[];
 };
@@ -700,6 +751,24 @@ struct vfio_pci_hot_reset_info {
  * VFIO_DEVICE_PCI_HOT_RESET - _IOW(VFIO_TYPE, VFIO_BASE + 13,
  *				    struct vfio_pci_hot_reset)
  *
+ * A PCI hot reset results in either a bus or slot reset which may affect
+ * other devices sharing the bus/slot.  The calling user must have
+ * ownership of the full set of affected devices as determined by the
+ * VFIO_DEVICE_GET_PCI_HOT_RESET_INFO ioctl.
+ *
+ * When called on a device file descriptor acquired through the vfio
+ * group interface, the user is required to provide proof of ownership
+ * of those affected devices via the group_fds array in struct
+ * vfio_pci_hot_reset.
+ *
+ * When called on a direct cdev opened vfio device, the flags field of
+ * struct vfio_pci_hot_reset_info reports the ownership status of the
+ * affected devices and this ioctl must be called with an empty group_fds
+ * array.  See above INFO ioctl definition for ownership requirements.
+ *
+ * Mixed usage of legacy groups and cdevs across the set of affected
+ * devices is not supported.
+ *
  * Return: 0 on success, -errno on failure.
  */
 struct vfio_pci_hot_reset {
@@ -828,6 +897,83 @@ struct vfio_device_feature {
 
 #define VFIO_DEVICE_FEATURE		_IO(VFIO_TYPE, VFIO_BASE + 17)
 
+/*
+ * VFIO_DEVICE_BIND_IOMMUFD - _IOR(VFIO_TYPE, VFIO_BASE + 18,
+ *				   struct vfio_device_bind_iommufd)
+ * @argsz:	 User filled size of this data.
+ * @flags:	 Must be 0.
+ * @iommufd:	 iommufd to bind.
+ * @out_devid:	 The device id generated by this bind. devid is a handle for
+ *		 this device/iommufd bond and can be used in IOMMUFD commands.
+ *
+ * Bind a vfio_device to the specified iommufd.
+ *
+ * User is restricted from accessing the device before the binding operation
+ * is completed.  Only allowed on cdev fds.
+ *
+ * Unbind is automatically conducted when device fd is closed.
+ *
+ * Return: 0 on success, -errno on failure.
+ */
+struct vfio_device_bind_iommufd {
+	__u32		argsz;
+	__u32		flags;
+	__s32		iommufd;
+	__u32		out_devid;
+};
+
+#define VFIO_DEVICE_BIND_IOMMUFD	_IO(VFIO_TYPE, VFIO_BASE + 18)
+
+/*
+ * VFIO_DEVICE_ATTACH_IOMMUFD_PT - _IOW(VFIO_TYPE, VFIO_BASE + 19,
+ *					struct vfio_device_attach_iommufd_pt)
+ * @argsz:	User filled size of this data.
+ * @flags:	Must be 0.
+ * @pt_id:	Input the target id which can represent an ioas or a hwpt
+ *		allocated via iommufd subsystem.
+ *		Output the input ioas id or the attached hwpt id which could
+ *		be the specified hwpt itself or a hwpt automatically created
+ *		for the specified ioas by kernel during the attachment.
+ *
+ * Associate the device with an address space within the bound iommufd.
+ * Undo by VFIO_DEVICE_DETACH_IOMMUFD_PT or device fd close.  This is only
+ * allowed on cdev fds.
+ *
+ * If a vfio device is currently attached to a valid hw_pagetable, without doing
+ * a VFIO_DEVICE_DETACH_IOMMUFD_PT, a second VFIO_DEVICE_ATTACH_IOMMUFD_PT ioctl
+ * passing in another hw_pagetable (hwpt) id is allowed. This action, also known
+ * as a hw_pagetable replacement, will replace the device's currently attached
+ * hw_pagetable with a new hw_pagetable corresponding to the given pt_id.
+ *
+ * Return: 0 on success, -errno on failure.
+ */
+struct vfio_device_attach_iommufd_pt {
+	__u32	argsz;
+	__u32	flags;
+	__u32	pt_id;
+};
+
+#define VFIO_DEVICE_ATTACH_IOMMUFD_PT		_IO(VFIO_TYPE, VFIO_BASE + 19)
+
+/*
+ * VFIO_DEVICE_DETACH_IOMMUFD_PT - _IOW(VFIO_TYPE, VFIO_BASE + 20,
+ *					struct vfio_device_detach_iommufd_pt)
+ * @argsz:	User filled size of this data.
+ * @flags:	Must be 0.
+ *
+ * Remove the association of the device and its current associated address
+ * space.  After it, the device should be in a blocking DMA state.  This is only
+ * allowed on cdev fds.
+ *
+ * Return: 0 on success, -errno on failure.
+ */
+struct vfio_device_detach_iommufd_pt {
+	__u32	argsz;
+	__u32	flags;
+};
+
+#define VFIO_DEVICE_DETACH_IOMMUFD_PT		_IO(VFIO_TYPE, VFIO_BASE + 20)
+
 /*
  * Provide support for setting a PCI VF Token, which is used as a shared
  * secret between PF and VF drivers.  This feature may only be set on a
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH v1 03/22] vfio/common: Move IOMMU agnostic helpers to a separate file
  2023-08-30 10:37 [PATCH v1 00/22] vfio: Adopt iommufd Zhenzhong Duan
  2023-08-30 10:37 ` [PATCH v1 01/22] scripts/update-linux-headers: Add iommufd.h Zhenzhong Duan
  2023-08-30 10:37 ` [PATCH v1 02/22] Update linux-header to support iommufd cdev and hwpt alloc Zhenzhong Duan
@ 2023-08-30 10:37 ` Zhenzhong Duan
  2023-08-30 10:37 ` [PATCH v1 04/22] vfio/common: Introduce vfio_container_add|del_section_window() Zhenzhong Duan
                   ` (20 subsequent siblings)
  23 siblings, 0 replies; 109+ messages in thread
From: Zhenzhong Duan @ 2023-08-30 10:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, jgg, nicolinc, joao.m.martins, eric.auger,
	peterx, jasowang, kevin.tian, yi.l.liu, yi.y.sun, chao.p.peng,
	Yi Sun, Zhenzhong Duan

From: Yi Liu <yi.l.liu@intel.com>

Move low-level iommu agnostic helpers to a separate helpers.c
file. They relate to regions, interrupts and device/region
capabilities.

Signed-off-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/vfio/common.c              | 569 --------------------------------
 hw/vfio/helpers.c             | 598 ++++++++++++++++++++++++++++++++++
 hw/vfio/meson.build           |   1 +
 include/hw/vfio/vfio-common.h |   2 +
 4 files changed, 601 insertions(+), 569 deletions(-)
 create mode 100644 hw/vfio/helpers.c

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 9aac21abb7..9ca695837f 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -61,84 +61,6 @@ static QLIST_HEAD(, VFIOAddressSpace) vfio_address_spaces =
 static int vfio_kvm_device_fd = -1;
 #endif
 
-/*
- * Common VFIO interrupt disable
- */
-void vfio_disable_irqindex(VFIODevice *vbasedev, int index)
-{
-    struct vfio_irq_set irq_set = {
-        .argsz = sizeof(irq_set),
-        .flags = VFIO_IRQ_SET_DATA_NONE | VFIO_IRQ_SET_ACTION_TRIGGER,
-        .index = index,
-        .start = 0,
-        .count = 0,
-    };
-
-    ioctl(vbasedev->fd, VFIO_DEVICE_SET_IRQS, &irq_set);
-}
-
-void vfio_unmask_single_irqindex(VFIODevice *vbasedev, int index)
-{
-    struct vfio_irq_set irq_set = {
-        .argsz = sizeof(irq_set),
-        .flags = VFIO_IRQ_SET_DATA_NONE | VFIO_IRQ_SET_ACTION_UNMASK,
-        .index = index,
-        .start = 0,
-        .count = 1,
-    };
-
-    ioctl(vbasedev->fd, VFIO_DEVICE_SET_IRQS, &irq_set);
-}
-
-void vfio_mask_single_irqindex(VFIODevice *vbasedev, int index)
-{
-    struct vfio_irq_set irq_set = {
-        .argsz = sizeof(irq_set),
-        .flags = VFIO_IRQ_SET_DATA_NONE | VFIO_IRQ_SET_ACTION_MASK,
-        .index = index,
-        .start = 0,
-        .count = 1,
-    };
-
-    ioctl(vbasedev->fd, VFIO_DEVICE_SET_IRQS, &irq_set);
-}
-
-static inline const char *action_to_str(int action)
-{
-    switch (action) {
-    case VFIO_IRQ_SET_ACTION_MASK:
-        return "MASK";
-    case VFIO_IRQ_SET_ACTION_UNMASK:
-        return "UNMASK";
-    case VFIO_IRQ_SET_ACTION_TRIGGER:
-        return "TRIGGER";
-    default:
-        return "UNKNOWN ACTION";
-    }
-}
-
-static const char *index_to_str(VFIODevice *vbasedev, int index)
-{
-    if (vbasedev->type != VFIO_DEVICE_TYPE_PCI) {
-        return NULL;
-    }
-
-    switch (index) {
-    case VFIO_PCI_INTX_IRQ_INDEX:
-        return "INTX";
-    case VFIO_PCI_MSI_IRQ_INDEX:
-        return "MSI";
-    case VFIO_PCI_MSIX_IRQ_INDEX:
-        return "MSIX";
-    case VFIO_PCI_ERR_IRQ_INDEX:
-        return "ERR";
-    case VFIO_PCI_REQ_IRQ_INDEX:
-        return "REQ";
-    default:
-        return NULL;
-    }
-}
-
 static int vfio_ram_block_discard_disable(VFIOContainer *container, bool state)
 {
     switch (container->iommu_type) {
@@ -162,160 +84,6 @@ static int vfio_ram_block_discard_disable(VFIOContainer *container, bool state)
     }
 }
 
-int vfio_set_irq_signaling(VFIODevice *vbasedev, int index, int subindex,
-                           int action, int fd, Error **errp)
-{
-    struct vfio_irq_set *irq_set;
-    int argsz, ret = 0;
-    const char *name;
-    int32_t *pfd;
-
-    argsz = sizeof(*irq_set) + sizeof(*pfd);
-
-    irq_set = g_malloc0(argsz);
-    irq_set->argsz = argsz;
-    irq_set->flags = VFIO_IRQ_SET_DATA_EVENTFD | action;
-    irq_set->index = index;
-    irq_set->start = subindex;
-    irq_set->count = 1;
-    pfd = (int32_t *)&irq_set->data;
-    *pfd = fd;
-
-    if (ioctl(vbasedev->fd, VFIO_DEVICE_SET_IRQS, irq_set)) {
-        ret = -errno;
-    }
-    g_free(irq_set);
-
-    if (!ret) {
-        return 0;
-    }
-
-    error_setg_errno(errp, -ret, "VFIO_DEVICE_SET_IRQS failure");
-
-    name = index_to_str(vbasedev, index);
-    if (name) {
-        error_prepend(errp, "%s-%d: ", name, subindex);
-    } else {
-        error_prepend(errp, "index %d-%d: ", index, subindex);
-    }
-    error_prepend(errp,
-                  "Failed to %s %s eventfd signaling for interrupt ",
-                  fd < 0 ? "tear down" : "set up", action_to_str(action));
-    return ret;
-}
-
-/*
- * IO Port/MMIO - Beware of the endians, VFIO is always little endian
- */
-void vfio_region_write(void *opaque, hwaddr addr,
-                       uint64_t data, unsigned size)
-{
-    VFIORegion *region = opaque;
-    VFIODevice *vbasedev = region->vbasedev;
-    union {
-        uint8_t byte;
-        uint16_t word;
-        uint32_t dword;
-        uint64_t qword;
-    } buf;
-
-    switch (size) {
-    case 1:
-        buf.byte = data;
-        break;
-    case 2:
-        buf.word = cpu_to_le16(data);
-        break;
-    case 4:
-        buf.dword = cpu_to_le32(data);
-        break;
-    case 8:
-        buf.qword = cpu_to_le64(data);
-        break;
-    default:
-        hw_error("vfio: unsupported write size, %u bytes", size);
-        break;
-    }
-
-    if (pwrite(vbasedev->fd, &buf, size, region->fd_offset + addr) != size) {
-        error_report("%s(%s:region%d+0x%"HWADDR_PRIx", 0x%"PRIx64
-                     ",%d) failed: %m",
-                     __func__, vbasedev->name, region->nr,
-                     addr, data, size);
-    }
-
-    trace_vfio_region_write(vbasedev->name, region->nr, addr, data, size);
-
-    /*
-     * A read or write to a BAR always signals an INTx EOI.  This will
-     * do nothing if not pending (including not in INTx mode).  We assume
-     * that a BAR access is in response to an interrupt and that BAR
-     * accesses will service the interrupt.  Unfortunately, we don't know
-     * which access will service the interrupt, so we're potentially
-     * getting quite a few host interrupts per guest interrupt.
-     */
-    vbasedev->ops->vfio_eoi(vbasedev);
-}
-
-uint64_t vfio_region_read(void *opaque,
-                          hwaddr addr, unsigned size)
-{
-    VFIORegion *region = opaque;
-    VFIODevice *vbasedev = region->vbasedev;
-    union {
-        uint8_t byte;
-        uint16_t word;
-        uint32_t dword;
-        uint64_t qword;
-    } buf;
-    uint64_t data = 0;
-
-    if (pread(vbasedev->fd, &buf, size, region->fd_offset + addr) != size) {
-        error_report("%s(%s:region%d+0x%"HWADDR_PRIx", %d) failed: %m",
-                     __func__, vbasedev->name, region->nr,
-                     addr, size);
-        return (uint64_t)-1;
-    }
-    switch (size) {
-    case 1:
-        data = buf.byte;
-        break;
-    case 2:
-        data = le16_to_cpu(buf.word);
-        break;
-    case 4:
-        data = le32_to_cpu(buf.dword);
-        break;
-    case 8:
-        data = le64_to_cpu(buf.qword);
-        break;
-    default:
-        hw_error("vfio: unsupported read size, %u bytes", size);
-        break;
-    }
-
-    trace_vfio_region_read(vbasedev->name, region->nr, addr, size, data);
-
-    /* Same as write above */
-    vbasedev->ops->vfio_eoi(vbasedev);
-
-    return data;
-}
-
-const MemoryRegionOps vfio_region_ops = {
-    .read = vfio_region_read,
-    .write = vfio_region_write,
-    .endianness = DEVICE_LITTLE_ENDIAN,
-    .valid = {
-        .min_access_size = 1,
-        .max_access_size = 8,
-    },
-    .impl = {
-        .min_access_size = 1,
-        .max_access_size = 8,
-    },
-};
-
 /*
  * Device state interfaces
  */
@@ -1916,30 +1684,6 @@ static void vfio_listener_release(VFIOContainer *container)
     }
 }
 
-static struct vfio_info_cap_header *
-vfio_get_cap(void *ptr, uint32_t cap_offset, uint16_t id)
-{
-    struct vfio_info_cap_header *hdr;
-
-    for (hdr = ptr + cap_offset; hdr != ptr; hdr = ptr + hdr->next) {
-        if (hdr->id == id) {
-            return hdr;
-        }
-    }
-
-    return NULL;
-}
-
-struct vfio_info_cap_header *
-vfio_get_region_info_cap(struct vfio_region_info *info, uint16_t id)
-{
-    if (!(info->flags & VFIO_REGION_INFO_FLAG_CAPS)) {
-        return NULL;
-    }
-
-    return vfio_get_cap((void *)info, info->cap_offset, id);
-}
-
 static struct vfio_info_cap_header *
 vfio_get_iommu_type1_info_cap(struct vfio_iommu_type1_info *info, uint16_t id)
 {
@@ -1950,16 +1694,6 @@ vfio_get_iommu_type1_info_cap(struct vfio_iommu_type1_info *info, uint16_t id)
     return vfio_get_cap((void *)info, info->cap_offset, id);
 }
 
-struct vfio_info_cap_header *
-vfio_get_device_info_cap(struct vfio_device_info *info, uint16_t id)
-{
-    if (!(info->flags & VFIO_DEVICE_FLAGS_CAPS)) {
-        return NULL;
-    }
-
-    return vfio_get_cap((void *)info, info->cap_offset, id);
-}
-
 bool vfio_get_info_dma_avail(struct vfio_iommu_type1_info *info,
                              unsigned int *avail)
 {
@@ -1981,232 +1715,6 @@ bool vfio_get_info_dma_avail(struct vfio_iommu_type1_info *info,
     return true;
 }
 
-static int vfio_setup_region_sparse_mmaps(VFIORegion *region,
-                                          struct vfio_region_info *info)
-{
-    struct vfio_info_cap_header *hdr;
-    struct vfio_region_info_cap_sparse_mmap *sparse;
-    int i, j;
-
-    hdr = vfio_get_region_info_cap(info, VFIO_REGION_INFO_CAP_SPARSE_MMAP);
-    if (!hdr) {
-        return -ENODEV;
-    }
-
-    sparse = container_of(hdr, struct vfio_region_info_cap_sparse_mmap, header);
-
-    trace_vfio_region_sparse_mmap_header(region->vbasedev->name,
-                                         region->nr, sparse->nr_areas);
-
-    region->mmaps = g_new0(VFIOMmap, sparse->nr_areas);
-
-    for (i = 0, j = 0; i < sparse->nr_areas; i++) {
-        if (sparse->areas[i].size) {
-            trace_vfio_region_sparse_mmap_entry(i, sparse->areas[i].offset,
-                                            sparse->areas[i].offset +
-                                            sparse->areas[i].size - 1);
-            region->mmaps[j].offset = sparse->areas[i].offset;
-            region->mmaps[j].size = sparse->areas[i].size;
-            j++;
-        }
-    }
-
-    region->nr_mmaps = j;
-    region->mmaps = g_realloc(region->mmaps, j * sizeof(VFIOMmap));
-
-    return 0;
-}
-
-int vfio_region_setup(Object *obj, VFIODevice *vbasedev, VFIORegion *region,
-                      int index, const char *name)
-{
-    struct vfio_region_info *info;
-    int ret;
-
-    ret = vfio_get_region_info(vbasedev, index, &info);
-    if (ret) {
-        return ret;
-    }
-
-    region->vbasedev = vbasedev;
-    region->flags = info->flags;
-    region->size = info->size;
-    region->fd_offset = info->offset;
-    region->nr = index;
-
-    if (region->size) {
-        region->mem = g_new0(MemoryRegion, 1);
-        memory_region_init_io(region->mem, obj, &vfio_region_ops,
-                              region, name, region->size);
-
-        if (!vbasedev->no_mmap &&
-            region->flags & VFIO_REGION_INFO_FLAG_MMAP) {
-
-            ret = vfio_setup_region_sparse_mmaps(region, info);
-
-            if (ret) {
-                region->nr_mmaps = 1;
-                region->mmaps = g_new0(VFIOMmap, region->nr_mmaps);
-                region->mmaps[0].offset = 0;
-                region->mmaps[0].size = region->size;
-            }
-        }
-    }
-
-    g_free(info);
-
-    trace_vfio_region_setup(vbasedev->name, index, name,
-                            region->flags, region->fd_offset, region->size);
-    return 0;
-}
-
-static void vfio_subregion_unmap(VFIORegion *region, int index)
-{
-    trace_vfio_region_unmap(memory_region_name(&region->mmaps[index].mem),
-                            region->mmaps[index].offset,
-                            region->mmaps[index].offset +
-                            region->mmaps[index].size - 1);
-    memory_region_del_subregion(region->mem, &region->mmaps[index].mem);
-    munmap(region->mmaps[index].mmap, region->mmaps[index].size);
-    object_unparent(OBJECT(&region->mmaps[index].mem));
-    region->mmaps[index].mmap = NULL;
-}
-
-int vfio_region_mmap(VFIORegion *region)
-{
-    int i, prot = 0;
-    char *name;
-
-    if (!region->mem) {
-        return 0;
-    }
-
-    prot |= region->flags & VFIO_REGION_INFO_FLAG_READ ? PROT_READ : 0;
-    prot |= region->flags & VFIO_REGION_INFO_FLAG_WRITE ? PROT_WRITE : 0;
-
-    for (i = 0; i < region->nr_mmaps; i++) {
-        region->mmaps[i].mmap = mmap(NULL, region->mmaps[i].size, prot,
-                                     MAP_SHARED, region->vbasedev->fd,
-                                     region->fd_offset +
-                                     region->mmaps[i].offset);
-        if (region->mmaps[i].mmap == MAP_FAILED) {
-            int ret = -errno;
-
-            trace_vfio_region_mmap_fault(memory_region_name(region->mem), i,
-                                         region->fd_offset +
-                                         region->mmaps[i].offset,
-                                         region->fd_offset +
-                                         region->mmaps[i].offset +
-                                         region->mmaps[i].size - 1, ret);
-
-            region->mmaps[i].mmap = NULL;
-
-            for (i--; i >= 0; i--) {
-                vfio_subregion_unmap(region, i);
-            }
-
-            return ret;
-        }
-
-        name = g_strdup_printf("%s mmaps[%d]",
-                               memory_region_name(region->mem), i);
-        memory_region_init_ram_device_ptr(&region->mmaps[i].mem,
-                                          memory_region_owner(region->mem),
-                                          name, region->mmaps[i].size,
-                                          region->mmaps[i].mmap);
-        g_free(name);
-        memory_region_add_subregion(region->mem, region->mmaps[i].offset,
-                                    &region->mmaps[i].mem);
-
-        trace_vfio_region_mmap(memory_region_name(&region->mmaps[i].mem),
-                               region->mmaps[i].offset,
-                               region->mmaps[i].offset +
-                               region->mmaps[i].size - 1);
-    }
-
-    return 0;
-}
-
-void vfio_region_unmap(VFIORegion *region)
-{
-    int i;
-
-    if (!region->mem) {
-        return;
-    }
-
-    for (i = 0; i < region->nr_mmaps; i++) {
-        if (region->mmaps[i].mmap) {
-            vfio_subregion_unmap(region, i);
-        }
-    }
-}
-
-void vfio_region_exit(VFIORegion *region)
-{
-    int i;
-
-    if (!region->mem) {
-        return;
-    }
-
-    for (i = 0; i < region->nr_mmaps; i++) {
-        if (region->mmaps[i].mmap) {
-            memory_region_del_subregion(region->mem, &region->mmaps[i].mem);
-        }
-    }
-
-    trace_vfio_region_exit(region->vbasedev->name, region->nr);
-}
-
-void vfio_region_finalize(VFIORegion *region)
-{
-    int i;
-
-    if (!region->mem) {
-        return;
-    }
-
-    for (i = 0; i < region->nr_mmaps; i++) {
-        if (region->mmaps[i].mmap) {
-            munmap(region->mmaps[i].mmap, region->mmaps[i].size);
-            object_unparent(OBJECT(&region->mmaps[i].mem));
-        }
-    }
-
-    object_unparent(OBJECT(region->mem));
-
-    g_free(region->mem);
-    g_free(region->mmaps);
-
-    trace_vfio_region_finalize(region->vbasedev->name, region->nr);
-
-    region->mem = NULL;
-    region->mmaps = NULL;
-    region->nr_mmaps = 0;
-    region->size = 0;
-    region->flags = 0;
-    region->nr = 0;
-}
-
-void vfio_region_mmaps_set_enabled(VFIORegion *region, bool enabled)
-{
-    int i;
-
-    if (!region->mem) {
-        return;
-    }
-
-    for (i = 0; i < region->nr_mmaps; i++) {
-        if (region->mmaps[i].mmap) {
-            memory_region_set_enabled(&region->mmaps[i].mem, enabled);
-        }
-    }
-
-    trace_vfio_region_mmaps_set_enabled(memory_region_name(region->mem),
-                                        enabled);
-}
-
 void vfio_reset_handler(void *opaque)
 {
     VFIOGroup *group;
@@ -2905,83 +2413,6 @@ void vfio_put_base_device(VFIODevice *vbasedev)
     close(vbasedev->fd);
 }
 
-int vfio_get_region_info(VFIODevice *vbasedev, int index,
-                         struct vfio_region_info **info)
-{
-    size_t argsz = sizeof(struct vfio_region_info);
-
-    *info = g_malloc0(argsz);
-
-    (*info)->index = index;
-retry:
-    (*info)->argsz = argsz;
-
-    if (ioctl(vbasedev->fd, VFIO_DEVICE_GET_REGION_INFO, *info)) {
-        g_free(*info);
-        *info = NULL;
-        return -errno;
-    }
-
-    if ((*info)->argsz > argsz) {
-        argsz = (*info)->argsz;
-        *info = g_realloc(*info, argsz);
-
-        goto retry;
-    }
-
-    return 0;
-}
-
-int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type,
-                             uint32_t subtype, struct vfio_region_info **info)
-{
-    int i;
-
-    for (i = 0; i < vbasedev->num_regions; i++) {
-        struct vfio_info_cap_header *hdr;
-        struct vfio_region_info_cap_type *cap_type;
-
-        if (vfio_get_region_info(vbasedev, i, info)) {
-            continue;
-        }
-
-        hdr = vfio_get_region_info_cap(*info, VFIO_REGION_INFO_CAP_TYPE);
-        if (!hdr) {
-            g_free(*info);
-            continue;
-        }
-
-        cap_type = container_of(hdr, struct vfio_region_info_cap_type, header);
-
-        trace_vfio_get_dev_region(vbasedev->name, i,
-                                  cap_type->type, cap_type->subtype);
-
-        if (cap_type->type == type && cap_type->subtype == subtype) {
-            return 0;
-        }
-
-        g_free(*info);
-    }
-
-    *info = NULL;
-    return -ENODEV;
-}
-
-bool vfio_has_region_cap(VFIODevice *vbasedev, int region, uint16_t cap_type)
-{
-    struct vfio_region_info *info = NULL;
-    bool ret = false;
-
-    if (!vfio_get_region_info(vbasedev, region, &info)) {
-        if (vfio_get_region_info_cap(info, cap_type)) {
-            ret = true;
-        }
-        g_free(info);
-    }
-
-    return ret;
-}
-
 /*
  * Interfaces for IBM EEH (Enhanced Error Handling)
  */
diff --git a/hw/vfio/helpers.c b/hw/vfio/helpers.c
new file mode 100644
index 0000000000..4338456b08
--- /dev/null
+++ b/hw/vfio/helpers.c
@@ -0,0 +1,598 @@
+/*
+ * low level and IOMMU backend agnostic helpers used by VFIO devices,
+ * related to regions, interrupts, capabilities
+ *
+ * Copyright Red Hat, Inc. 2012
+ *
+ * Authors:
+ *  Alex Williamson <alex.williamson@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ * Based on qemu-kvm device-assignment:
+ *  Adapted for KVM by Qumranet.
+ *  Copyright (c) 2007, Neocleus, Alex Novik (alex@neocleus.com)
+ *  Copyright (c) 2007, Neocleus, Guy Zana (guy@neocleus.com)
+ *  Copyright (C) 2008, Qumranet, Amit Shah (amit.shah@qumranet.com)
+ *  Copyright (C) 2008, Red Hat, Amit Shah (amit.shah@redhat.com)
+ *  Copyright (C) 2008, IBM, Muli Ben-Yehuda (muli@il.ibm.com)
+ */
+
+#include "qemu/osdep.h"
+#include <sys/ioctl.h>
+
+#include "hw/vfio/vfio-common.h"
+#include "hw/vfio/vfio.h"
+#include "hw/hw.h"
+#include "trace.h"
+#include "qapi/error.h"
+
+/*
+ * Common VFIO interrupt disable
+ */
+void vfio_disable_irqindex(VFIODevice *vbasedev, int index)
+{
+    struct vfio_irq_set irq_set = {
+        .argsz = sizeof(irq_set),
+        .flags = VFIO_IRQ_SET_DATA_NONE | VFIO_IRQ_SET_ACTION_TRIGGER,
+        .index = index,
+        .start = 0,
+        .count = 0,
+    };
+
+    ioctl(vbasedev->fd, VFIO_DEVICE_SET_IRQS, &irq_set);
+}
+
+void vfio_unmask_single_irqindex(VFIODevice *vbasedev, int index)
+{
+    struct vfio_irq_set irq_set = {
+        .argsz = sizeof(irq_set),
+        .flags = VFIO_IRQ_SET_DATA_NONE | VFIO_IRQ_SET_ACTION_UNMASK,
+        .index = index,
+        .start = 0,
+        .count = 1,
+    };
+
+    ioctl(vbasedev->fd, VFIO_DEVICE_SET_IRQS, &irq_set);
+}
+
+void vfio_mask_single_irqindex(VFIODevice *vbasedev, int index)
+{
+    struct vfio_irq_set irq_set = {
+        .argsz = sizeof(irq_set),
+        .flags = VFIO_IRQ_SET_DATA_NONE | VFIO_IRQ_SET_ACTION_MASK,
+        .index = index,
+        .start = 0,
+        .count = 1,
+    };
+
+    ioctl(vbasedev->fd, VFIO_DEVICE_SET_IRQS, &irq_set);
+}
+
+static inline const char *action_to_str(int action)
+{
+    switch (action) {
+    case VFIO_IRQ_SET_ACTION_MASK:
+        return "MASK";
+    case VFIO_IRQ_SET_ACTION_UNMASK:
+        return "UNMASK";
+    case VFIO_IRQ_SET_ACTION_TRIGGER:
+        return "TRIGGER";
+    default:
+        return "UNKNOWN ACTION";
+    }
+}
+
+static const char *index_to_str(VFIODevice *vbasedev, int index)
+{
+    if (vbasedev->type != VFIO_DEVICE_TYPE_PCI) {
+        return NULL;
+    }
+
+    switch (index) {
+    case VFIO_PCI_INTX_IRQ_INDEX:
+        return "INTX";
+    case VFIO_PCI_MSI_IRQ_INDEX:
+        return "MSI";
+    case VFIO_PCI_MSIX_IRQ_INDEX:
+        return "MSIX";
+    case VFIO_PCI_ERR_IRQ_INDEX:
+        return "ERR";
+    case VFIO_PCI_REQ_IRQ_INDEX:
+        return "REQ";
+    default:
+        return NULL;
+    }
+}
+
+int vfio_set_irq_signaling(VFIODevice *vbasedev, int index, int subindex,
+                           int action, int fd, Error **errp)
+{
+    struct vfio_irq_set *irq_set;
+    int argsz, ret = 0;
+    const char *name;
+    int32_t *pfd;
+
+    argsz = sizeof(*irq_set) + sizeof(*pfd);
+
+    irq_set = g_malloc0(argsz);
+    irq_set->argsz = argsz;
+    irq_set->flags = VFIO_IRQ_SET_DATA_EVENTFD | action;
+    irq_set->index = index;
+    irq_set->start = subindex;
+    irq_set->count = 1;
+    pfd = (int32_t *)&irq_set->data;
+    *pfd = fd;
+
+    if (ioctl(vbasedev->fd, VFIO_DEVICE_SET_IRQS, irq_set)) {
+        ret = -errno;
+    }
+    g_free(irq_set);
+
+    if (!ret) {
+        return 0;
+    }
+
+    error_setg_errno(errp, -ret, "VFIO_DEVICE_SET_IRQS failure");
+
+    name = index_to_str(vbasedev, index);
+    if (name) {
+        error_prepend(errp, "%s-%d: ", name, subindex);
+    } else {
+        error_prepend(errp, "index %d-%d: ", index, subindex);
+    }
+    error_prepend(errp,
+                  "Failed to %s %s eventfd signaling for interrupt ",
+                  fd < 0 ? "tear down" : "set up", action_to_str(action));
+    return ret;
+}
+
+/*
+ * IO Port/MMIO - Beware of the endians, VFIO is always little endian
+ */
+void vfio_region_write(void *opaque, hwaddr addr,
+                       uint64_t data, unsigned size)
+{
+    VFIORegion *region = opaque;
+    VFIODevice *vbasedev = region->vbasedev;
+    union {
+        uint8_t byte;
+        uint16_t word;
+        uint32_t dword;
+        uint64_t qword;
+    } buf;
+
+    switch (size) {
+    case 1:
+        buf.byte = data;
+        break;
+    case 2:
+        buf.word = cpu_to_le16(data);
+        break;
+    case 4:
+        buf.dword = cpu_to_le32(data);
+        break;
+    case 8:
+        buf.qword = cpu_to_le64(data);
+        break;
+    default:
+        hw_error("vfio: unsupported write size, %u bytes", size);
+        break;
+    }
+
+    if (pwrite(vbasedev->fd, &buf, size, region->fd_offset + addr) != size) {
+        error_report("%s(%s:region%d+0x%"HWADDR_PRIx", 0x%"PRIx64
+                     ",%d) failed: %m",
+                     __func__, vbasedev->name, region->nr,
+                     addr, data, size);
+    }
+
+    trace_vfio_region_write(vbasedev->name, region->nr, addr, data, size);
+
+    /*
+     * A read or write to a BAR always signals an INTx EOI.  This will
+     * do nothing if not pending (including not in INTx mode).  We assume
+     * that a BAR access is in response to an interrupt and that BAR
+     * accesses will service the interrupt.  Unfortunately, we don't know
+     * which access will service the interrupt, so we're potentially
+     * getting quite a few host interrupts per guest interrupt.
+     */
+    vbasedev->ops->vfio_eoi(vbasedev);
+}
+
+uint64_t vfio_region_read(void *opaque,
+                          hwaddr addr, unsigned size)
+{
+    VFIORegion *region = opaque;
+    VFIODevice *vbasedev = region->vbasedev;
+    union {
+        uint8_t byte;
+        uint16_t word;
+        uint32_t dword;
+        uint64_t qword;
+    } buf;
+    uint64_t data = 0;
+
+    if (pread(vbasedev->fd, &buf, size, region->fd_offset + addr) != size) {
+        error_report("%s(%s:region%d+0x%"HWADDR_PRIx", %d) failed: %m",
+                     __func__, vbasedev->name, region->nr,
+                     addr, size);
+        return (uint64_t)-1;
+    }
+    switch (size) {
+    case 1:
+        data = buf.byte;
+        break;
+    case 2:
+        data = le16_to_cpu(buf.word);
+        break;
+    case 4:
+        data = le32_to_cpu(buf.dword);
+        break;
+    case 8:
+        data = le64_to_cpu(buf.qword);
+        break;
+    default:
+        hw_error("vfio: unsupported read size, %u bytes", size);
+        break;
+    }
+
+    trace_vfio_region_read(vbasedev->name, region->nr, addr, size, data);
+
+    /* Same as write above */
+    vbasedev->ops->vfio_eoi(vbasedev);
+
+    return data;
+}
+
+const MemoryRegionOps vfio_region_ops = {
+    .read = vfio_region_read,
+    .write = vfio_region_write,
+    .endianness = DEVICE_LITTLE_ENDIAN,
+    .valid = {
+        .min_access_size = 1,
+        .max_access_size = 8,
+    },
+    .impl = {
+        .min_access_size = 1,
+        .max_access_size = 8,
+    },
+};
+
+struct vfio_info_cap_header *
+vfio_get_cap(void *ptr, uint32_t cap_offset, uint16_t id)
+{
+    struct vfio_info_cap_header *hdr;
+
+    for (hdr = ptr + cap_offset; hdr != ptr; hdr = ptr + hdr->next) {
+        if (hdr->id == id) {
+            return hdr;
+        }
+    }
+
+    return NULL;
+}
+
+struct vfio_info_cap_header *
+vfio_get_region_info_cap(struct vfio_region_info *info, uint16_t id)
+{
+    if (!(info->flags & VFIO_REGION_INFO_FLAG_CAPS)) {
+        return NULL;
+    }
+
+    return vfio_get_cap((void *)info, info->cap_offset, id);
+}
+
+struct vfio_info_cap_header *
+vfio_get_device_info_cap(struct vfio_device_info *info, uint16_t id)
+{
+    if (!(info->flags & VFIO_DEVICE_FLAGS_CAPS)) {
+        return NULL;
+    }
+
+    return vfio_get_cap((void *)info, info->cap_offset, id);
+}
+
+static int vfio_setup_region_sparse_mmaps(VFIORegion *region,
+                                          struct vfio_region_info *info)
+{
+    struct vfio_info_cap_header *hdr;
+    struct vfio_region_info_cap_sparse_mmap *sparse;
+    int i, j;
+
+    hdr = vfio_get_region_info_cap(info, VFIO_REGION_INFO_CAP_SPARSE_MMAP);
+    if (!hdr) {
+        return -ENODEV;
+    }
+
+    sparse = container_of(hdr, struct vfio_region_info_cap_sparse_mmap, header);
+
+    trace_vfio_region_sparse_mmap_header(region->vbasedev->name,
+                                         region->nr, sparse->nr_areas);
+
+    region->mmaps = g_new0(VFIOMmap, sparse->nr_areas);
+
+    for (i = 0, j = 0; i < sparse->nr_areas; i++) {
+        if (sparse->areas[i].size) {
+            trace_vfio_region_sparse_mmap_entry(i, sparse->areas[i].offset,
+                                            sparse->areas[i].offset +
+                                            sparse->areas[i].size - 1);
+            region->mmaps[j].offset = sparse->areas[i].offset;
+            region->mmaps[j].size = sparse->areas[i].size;
+            j++;
+        }
+    }
+
+    region->nr_mmaps = j;
+    region->mmaps = g_realloc(region->mmaps, j * sizeof(VFIOMmap));
+
+    return 0;
+}
+
+int vfio_region_setup(Object *obj, VFIODevice *vbasedev, VFIORegion *region,
+                      int index, const char *name)
+{
+    struct vfio_region_info *info;
+    int ret;
+
+    ret = vfio_get_region_info(vbasedev, index, &info);
+    if (ret) {
+        return ret;
+    }
+
+    region->vbasedev = vbasedev;
+    region->flags = info->flags;
+    region->size = info->size;
+    region->fd_offset = info->offset;
+    region->nr = index;
+
+    if (region->size) {
+        region->mem = g_new0(MemoryRegion, 1);
+        memory_region_init_io(region->mem, obj, &vfio_region_ops,
+                              region, name, region->size);
+
+        if (!vbasedev->no_mmap &&
+            region->flags & VFIO_REGION_INFO_FLAG_MMAP) {
+
+            ret = vfio_setup_region_sparse_mmaps(region, info);
+
+            if (ret) {
+                region->nr_mmaps = 1;
+                region->mmaps = g_new0(VFIOMmap, region->nr_mmaps);
+                region->mmaps[0].offset = 0;
+                region->mmaps[0].size = region->size;
+            }
+        }
+    }
+
+    g_free(info);
+
+    trace_vfio_region_setup(vbasedev->name, index, name,
+                            region->flags, region->fd_offset, region->size);
+    return 0;
+}
+
+static void vfio_subregion_unmap(VFIORegion *region, int index)
+{
+    trace_vfio_region_unmap(memory_region_name(&region->mmaps[index].mem),
+                            region->mmaps[index].offset,
+                            region->mmaps[index].offset +
+                            region->mmaps[index].size - 1);
+    memory_region_del_subregion(region->mem, &region->mmaps[index].mem);
+    munmap(region->mmaps[index].mmap, region->mmaps[index].size);
+    object_unparent(OBJECT(&region->mmaps[index].mem));
+    region->mmaps[index].mmap = NULL;
+}
+
+int vfio_region_mmap(VFIORegion *region)
+{
+    int i, prot = 0;
+    char *name;
+
+    if (!region->mem) {
+        return 0;
+    }
+
+    prot |= region->flags & VFIO_REGION_INFO_FLAG_READ ? PROT_READ : 0;
+    prot |= region->flags & VFIO_REGION_INFO_FLAG_WRITE ? PROT_WRITE : 0;
+
+    for (i = 0; i < region->nr_mmaps; i++) {
+        region->mmaps[i].mmap = mmap(NULL, region->mmaps[i].size, prot,
+                                     MAP_SHARED, region->vbasedev->fd,
+                                     region->fd_offset +
+                                     region->mmaps[i].offset);
+        if (region->mmaps[i].mmap == MAP_FAILED) {
+            int ret = -errno;
+
+            trace_vfio_region_mmap_fault(memory_region_name(region->mem), i,
+                                         region->fd_offset +
+                                         region->mmaps[i].offset,
+                                         region->fd_offset +
+                                         region->mmaps[i].offset +
+                                         region->mmaps[i].size - 1, ret);
+
+            region->mmaps[i].mmap = NULL;
+
+            for (i--; i >= 0; i--) {
+                vfio_subregion_unmap(region, i);
+            }
+
+            return ret;
+        }
+
+        name = g_strdup_printf("%s mmaps[%d]",
+                               memory_region_name(region->mem), i);
+        memory_region_init_ram_device_ptr(&region->mmaps[i].mem,
+                                          memory_region_owner(region->mem),
+                                          name, region->mmaps[i].size,
+                                          region->mmaps[i].mmap);
+        g_free(name);
+        memory_region_add_subregion(region->mem, region->mmaps[i].offset,
+                                    &region->mmaps[i].mem);
+
+        trace_vfio_region_mmap(memory_region_name(&region->mmaps[i].mem),
+                               region->mmaps[i].offset,
+                               region->mmaps[i].offset +
+                               region->mmaps[i].size - 1);
+    }
+
+    return 0;
+}
+
+void vfio_region_unmap(VFIORegion *region)
+{
+    int i;
+
+    if (!region->mem) {
+        return;
+    }
+
+    for (i = 0; i < region->nr_mmaps; i++) {
+        if (region->mmaps[i].mmap) {
+            vfio_subregion_unmap(region, i);
+        }
+    }
+}
+
+void vfio_region_exit(VFIORegion *region)
+{
+    int i;
+
+    if (!region->mem) {
+        return;
+    }
+
+    for (i = 0; i < region->nr_mmaps; i++) {
+        if (region->mmaps[i].mmap) {
+            memory_region_del_subregion(region->mem, &region->mmaps[i].mem);
+        }
+    }
+
+    trace_vfio_region_exit(region->vbasedev->name, region->nr);
+}
+
+void vfio_region_finalize(VFIORegion *region)
+{
+    int i;
+
+    if (!region->mem) {
+        return;
+    }
+
+    for (i = 0; i < region->nr_mmaps; i++) {
+        if (region->mmaps[i].mmap) {
+            munmap(region->mmaps[i].mmap, region->mmaps[i].size);
+            object_unparent(OBJECT(&region->mmaps[i].mem));
+        }
+    }
+
+    object_unparent(OBJECT(region->mem));
+
+    g_free(region->mem);
+    g_free(region->mmaps);
+
+    trace_vfio_region_finalize(region->vbasedev->name, region->nr);
+
+    region->mem = NULL;
+    region->mmaps = NULL;
+    region->nr_mmaps = 0;
+    region->size = 0;
+    region->flags = 0;
+    region->nr = 0;
+}
+
+void vfio_region_mmaps_set_enabled(VFIORegion *region, bool enabled)
+{
+    int i;
+
+    if (!region->mem) {
+        return;
+    }
+
+    for (i = 0; i < region->nr_mmaps; i++) {
+        if (region->mmaps[i].mmap) {
+            memory_region_set_enabled(&region->mmaps[i].mem, enabled);
+        }
+    }
+
+    trace_vfio_region_mmaps_set_enabled(memory_region_name(region->mem),
+                                        enabled);
+}
+
+int vfio_get_region_info(VFIODevice *vbasedev, int index,
+                         struct vfio_region_info **info)
+{
+    size_t argsz = sizeof(struct vfio_region_info);
+
+    *info = g_malloc0(argsz);
+
+    (*info)->index = index;
+retry:
+    (*info)->argsz = argsz;
+
+    if (ioctl(vbasedev->fd, VFIO_DEVICE_GET_REGION_INFO, *info)) {
+        g_free(*info);
+        *info = NULL;
+        return -errno;
+    }
+
+    if ((*info)->argsz > argsz) {
+        argsz = (*info)->argsz;
+        *info = g_realloc(*info, argsz);
+
+        goto retry;
+    }
+
+    return 0;
+}
+
+int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type,
+                             uint32_t subtype, struct vfio_region_info **info)
+{
+    int i;
+
+    for (i = 0; i < vbasedev->num_regions; i++) {
+        struct vfio_info_cap_header *hdr;
+        struct vfio_region_info_cap_type *cap_type;
+
+        if (vfio_get_region_info(vbasedev, i, info)) {
+            continue;
+        }
+
+        hdr = vfio_get_region_info_cap(*info, VFIO_REGION_INFO_CAP_TYPE);
+        if (!hdr) {
+            g_free(*info);
+            continue;
+        }
+
+        cap_type = container_of(hdr, struct vfio_region_info_cap_type, header);
+
+        trace_vfio_get_dev_region(vbasedev->name, i,
+                                  cap_type->type, cap_type->subtype);
+
+        if (cap_type->type == type && cap_type->subtype == subtype) {
+            return 0;
+        }
+
+        g_free(*info);
+    }
+
+    *info = NULL;
+    return -ENODEV;
+}
+
+bool vfio_has_region_cap(VFIODevice *vbasedev, int region, uint16_t cap_type)
+{
+    struct vfio_region_info *info = NULL;
+    bool ret = false;
+
+    if (!vfio_get_region_info(vbasedev, region, &info)) {
+        if (vfio_get_region_info_cap(info, cap_type)) {
+            ret = true;
+        }
+        g_free(info);
+    }
+
+    return ret;
+}
diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
index da9af297a0..3746c9f984 100644
--- a/hw/vfio/meson.build
+++ b/hw/vfio/meson.build
@@ -1,5 +1,6 @@
 vfio_ss = ss.source_set()
 vfio_ss.add(files(
+  'helpers.c',
   'common.c',
   'spapr.c',
   'migration.c',
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index da43d27352..5e376c436e 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -243,6 +243,8 @@ bool vfio_get_info_dma_avail(struct vfio_iommu_type1_info *info,
                              unsigned int *avail);
 struct vfio_info_cap_header *
 vfio_get_device_info_cap(struct vfio_device_info *info, uint16_t id);
+struct vfio_info_cap_header *
+vfio_get_cap(void *ptr, uint32_t cap_offset, uint16_t id);
 #endif
 extern const MemoryListener vfio_prereg_listener;
 
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH v1 04/22] vfio/common: Introduce vfio_container_add|del_section_window()
  2023-08-30 10:37 [PATCH v1 00/22] vfio: Adopt iommufd Zhenzhong Duan
                   ` (2 preceding siblings ...)
  2023-08-30 10:37 ` [PATCH v1 03/22] vfio/common: Move IOMMU agnostic helpers to a separate file Zhenzhong Duan
@ 2023-08-30 10:37 ` Zhenzhong Duan
  2023-09-20 11:23   ` Eric Auger
  2023-09-21  8:28   ` Cédric Le Goater
  2023-08-30 10:37 ` [PATCH v1 05/22] vfio/common: Extract out vfio_kvm_device_[add/del]_fd Zhenzhong Duan
                   ` (19 subsequent siblings)
  23 siblings, 2 replies; 109+ messages in thread
From: Zhenzhong Duan @ 2023-08-30 10:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, jgg, nicolinc, joao.m.martins, eric.auger,
	peterx, jasowang, kevin.tian, yi.l.liu, yi.y.sun, chao.p.peng,
	Zhenzhong Duan

From: Eric Auger <eric.auger@redhat.com>

Introduce helper functions that isolate the code used for
VFIO_SPAPR_TCE_v2_IOMMU. This code reliance is IOMMU backend
specific whereas the rest of the code in the callers, ie.
vfio_listener_region_add|del is not.

Signed-off-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/vfio/common.c | 156 +++++++++++++++++++++++++++--------------------
 1 file changed, 89 insertions(+), 67 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 9ca695837f..67150e4575 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -796,6 +796,92 @@ static bool vfio_get_section_iova_range(VFIOContainer *container,
     return true;
 }
 
+static int vfio_container_add_section_window(VFIOContainer *container,
+                                             MemoryRegionSection *section,
+                                             Error **errp)
+{
+    VFIOHostDMAWindow *hostwin;
+    hwaddr pgsize = 0;
+    int ret;
+
+    if (container->iommu_type != VFIO_SPAPR_TCE_v2_IOMMU) {
+        return 0;
+    }
+
+    /* For now intersections are not allowed, we may relax this later */
+    QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) {
+        if (ranges_overlap(hostwin->min_iova,
+                           hostwin->max_iova - hostwin->min_iova + 1,
+                           section->offset_within_address_space,
+                           int128_get64(section->size))) {
+            error_setg(errp,
+                "region [0x%"PRIx64",0x%"PRIx64"] overlaps with existing"
+                "host DMA window [0x%"PRIx64",0x%"PRIx64"]",
+                section->offset_within_address_space,
+                section->offset_within_address_space +
+                    int128_get64(section->size) - 1,
+                hostwin->min_iova, hostwin->max_iova);
+            return -EINVAL;
+        }
+    }
+
+    ret = vfio_spapr_create_window(container, section, &pgsize);
+    if (ret) {
+        error_setg_errno(errp, -ret, "Failed to create SPAPR window");
+        return ret;
+    }
+
+    vfio_host_win_add(container, section->offset_within_address_space,
+                      section->offset_within_address_space +
+                      int128_get64(section->size) - 1, pgsize);
+#ifdef CONFIG_KVM
+    if (kvm_enabled()) {
+        VFIOGroup *group;
+        IOMMUMemoryRegion *iommu_mr = IOMMU_MEMORY_REGION(section->mr);
+        struct kvm_vfio_spapr_tce param;
+        struct kvm_device_attr attr = {
+            .group = KVM_DEV_VFIO_GROUP,
+            .attr = KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE,
+            .addr = (uint64_t)(unsigned long)&param,
+        };
+
+        if (!memory_region_iommu_get_attr(iommu_mr, IOMMU_ATTR_SPAPR_TCE_FD,
+                                          &param.tablefd)) {
+            QLIST_FOREACH(group, &container->group_list, container_next) {
+                param.groupfd = group->fd;
+                if (ioctl(vfio_kvm_device_fd, KVM_SET_DEVICE_ATTR, &attr)) {
+                    error_report("vfio: failed to setup fd %d "
+                                 "for a group with fd %d: %s",
+                                 param.tablefd, param.groupfd,
+                                 strerror(errno));
+                    return 0;
+                }
+                trace_vfio_spapr_group_attach(param.groupfd, param.tablefd);
+            }
+        }
+    }
+#endif
+    return 0;
+}
+
+static void vfio_container_del_section_window(VFIOContainer *container,
+                                              MemoryRegionSection *section)
+{
+    if (container->iommu_type != VFIO_SPAPR_TCE_v2_IOMMU) {
+        return;
+    }
+
+    vfio_spapr_remove_window(container,
+                             section->offset_within_address_space);
+    if (vfio_host_win_del(container,
+                          section->offset_within_address_space,
+                          section->offset_within_address_space +
+                          int128_get64(section->size) - 1) < 0) {
+        hw_error("%s: Cannot delete missing window at %"HWADDR_PRIx,
+                 __func__, section->offset_within_address_space);
+    }
+}
+
 static void vfio_listener_region_add(MemoryListener *listener,
                                      MemoryRegionSection *section)
 {
@@ -822,62 +908,8 @@ static void vfio_listener_region_add(MemoryListener *listener,
         return;
     }
 
-    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
-        hwaddr pgsize = 0;
-
-        /* For now intersections are not allowed, we may relax this later */
-        QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) {
-            if (ranges_overlap(hostwin->min_iova,
-                               hostwin->max_iova - hostwin->min_iova + 1,
-                               section->offset_within_address_space,
-                               int128_get64(section->size))) {
-                error_setg(&err,
-                    "region [0x%"PRIx64",0x%"PRIx64"] overlaps with existing"
-                    "host DMA window [0x%"PRIx64",0x%"PRIx64"]",
-                    section->offset_within_address_space,
-                    section->offset_within_address_space +
-                        int128_get64(section->size) - 1,
-                    hostwin->min_iova, hostwin->max_iova);
-                goto fail;
-            }
-        }
-
-        ret = vfio_spapr_create_window(container, section, &pgsize);
-        if (ret) {
-            error_setg_errno(&err, -ret, "Failed to create SPAPR window");
-            goto fail;
-        }
-
-        vfio_host_win_add(container, section->offset_within_address_space,
-                          section->offset_within_address_space +
-                          int128_get64(section->size) - 1, pgsize);
-#ifdef CONFIG_KVM
-        if (kvm_enabled()) {
-            VFIOGroup *group;
-            IOMMUMemoryRegion *iommu_mr = IOMMU_MEMORY_REGION(section->mr);
-            struct kvm_vfio_spapr_tce param;
-            struct kvm_device_attr attr = {
-                .group = KVM_DEV_VFIO_GROUP,
-                .attr = KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE,
-                .addr = (uint64_t)(unsigned long)&param,
-            };
-
-            if (!memory_region_iommu_get_attr(iommu_mr, IOMMU_ATTR_SPAPR_TCE_FD,
-                                              &param.tablefd)) {
-                QLIST_FOREACH(group, &container->group_list, container_next) {
-                    param.groupfd = group->fd;
-                    if (ioctl(vfio_kvm_device_fd, KVM_SET_DEVICE_ATTR, &attr)) {
-                        error_report("vfio: failed to setup fd %d "
-                                     "for a group with fd %d: %s",
-                                     param.tablefd, param.groupfd,
-                                     strerror(errno));
-                        return;
-                    }
-                    trace_vfio_spapr_group_attach(param.groupfd, param.tablefd);
-                }
-            }
-        }
-#endif
+    if (vfio_container_add_section_window(container, section, &err)) {
+        goto fail;
     }
 
     hostwin = vfio_find_hostwin(container, iova, end);
@@ -1094,17 +1126,7 @@ static void vfio_listener_region_del(MemoryListener *listener,
 
     memory_region_unref(section->mr);
 
-    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
-        vfio_spapr_remove_window(container,
-                                 section->offset_within_address_space);
-        if (vfio_host_win_del(container,
-                              section->offset_within_address_space,
-                              section->offset_within_address_space +
-                              int128_get64(section->size) - 1) < 0) {
-            hw_error("%s: Cannot delete missing window at %"HWADDR_PRIx,
-                     __func__, section->offset_within_address_space);
-        }
-    }
+    vfio_container_del_section_window(container, section);
 }
 
 static int vfio_set_dirty_page_tracking(VFIOContainer *container, bool start)
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH v1 05/22] vfio/common: Extract out vfio_kvm_device_[add/del]_fd
  2023-08-30 10:37 [PATCH v1 00/22] vfio: Adopt iommufd Zhenzhong Duan
                   ` (3 preceding siblings ...)
  2023-08-30 10:37 ` [PATCH v1 04/22] vfio/common: Introduce vfio_container_add|del_section_window() Zhenzhong Duan
@ 2023-08-30 10:37 ` Zhenzhong Duan
  2023-09-20 11:49   ` Eric Auger
  2023-09-20 21:39   ` Alex Williamson
  2023-08-30 10:37 ` [PATCH v1 06/22] vfio/common: Add a vfio device iterator Zhenzhong Duan
                   ` (18 subsequent siblings)
  23 siblings, 2 replies; 109+ messages in thread
From: Zhenzhong Duan @ 2023-08-30 10:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, jgg, nicolinc, joao.m.martins, eric.auger,
	peterx, jasowang, kevin.tian, yi.l.liu, yi.y.sun, chao.p.peng,
	Zhenzhong Duan

...which will be used by both legacy and iommufd backend.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/vfio/common.c              | 44 +++++++++++++++++++++++------------
 include/hw/vfio/vfio-common.h |  3 +++
 2 files changed, 32 insertions(+), 15 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 67150e4575..949ad6714a 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -1759,17 +1759,17 @@ void vfio_reset_handler(void *opaque)
     }
 }
 
-static void vfio_kvm_device_add_group(VFIOGroup *group)
+int vfio_kvm_device_add_fd(int fd)
 {
 #ifdef CONFIG_KVM
     struct kvm_device_attr attr = {
-        .group = KVM_DEV_VFIO_GROUP,
-        .attr = KVM_DEV_VFIO_GROUP_ADD,
-        .addr = (uint64_t)(unsigned long)&group->fd,
+        .group = KVM_DEV_VFIO_FILE,
+        .attr = KVM_DEV_VFIO_FILE_ADD,
+        .addr = (uint64_t)(unsigned long)&fd,
     };
 
     if (!kvm_enabled()) {
-        return;
+        return 0;
     }
 
     if (vfio_kvm_device_fd < 0) {
@@ -1779,37 +1779,51 @@ static void vfio_kvm_device_add_group(VFIOGroup *group)
 
         if (kvm_vm_ioctl(kvm_state, KVM_CREATE_DEVICE, &cd)) {
             error_report("Failed to create KVM VFIO device: %m");
-            return;
+            return -ENODEV;
         }
 
         vfio_kvm_device_fd = cd.fd;
     }
 
     if (ioctl(vfio_kvm_device_fd, KVM_SET_DEVICE_ATTR, &attr)) {
-        error_report("Failed to add group %d to KVM VFIO device: %m",
-                     group->groupid);
+        error_report("Failed to add fd %d to KVM VFIO device: %m",
+                     fd);
+        return -errno;
     }
 #endif
+    return 0;
 }
 
-static void vfio_kvm_device_del_group(VFIOGroup *group)
+static void vfio_kvm_device_add_group(VFIOGroup *group)
+{
+    vfio_kvm_device_add_fd(group->fd);
+}
+
+int vfio_kvm_device_del_fd(int fd)
 {
 #ifdef CONFIG_KVM
     struct kvm_device_attr attr = {
-        .group = KVM_DEV_VFIO_GROUP,
-        .attr = KVM_DEV_VFIO_GROUP_DEL,
-        .addr = (uint64_t)(unsigned long)&group->fd,
+        .group = KVM_DEV_VFIO_FILE,
+        .attr = KVM_DEV_VFIO_FILE_DEL,
+        .addr = (uint64_t)(unsigned long)&fd,
     };
 
     if (vfio_kvm_device_fd < 0) {
-        return;
+        return -EINVAL;
     }
 
     if (ioctl(vfio_kvm_device_fd, KVM_SET_DEVICE_ATTR, &attr)) {
-        error_report("Failed to remove group %d from KVM VFIO device: %m",
-                     group->groupid);
+        error_report("Failed to remove fd %d from KVM VFIO device: %m",
+                     fd);
+        return -EBADF;
     }
 #endif
+    return 0;
+}
+
+static void vfio_kvm_device_del_group(VFIOGroup *group)
+{
+    vfio_kvm_device_del_fd(group->fd);
 }
 
 static VFIOAddressSpace *vfio_get_address_space(AddressSpace *as)
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 5e376c436e..598c3ce079 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -220,6 +220,9 @@ struct vfio_device_info *vfio_get_device_info(int fd);
 int vfio_get_device(VFIOGroup *group, const char *name,
                     VFIODevice *vbasedev, Error **errp);
 
+int vfio_kvm_device_add_fd(int fd);
+int vfio_kvm_device_del_fd(int fd);
+
 extern const MemoryRegionOps vfio_region_ops;
 typedef QLIST_HEAD(VFIOGroupList, VFIOGroup) VFIOGroupList;
 extern VFIOGroupList vfio_group_list;
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH v1 06/22] vfio/common: Add a vfio device iterator
  2023-08-30 10:37 [PATCH v1 00/22] vfio: Adopt iommufd Zhenzhong Duan
                   ` (4 preceding siblings ...)
  2023-08-30 10:37 ` [PATCH v1 05/22] vfio/common: Extract out vfio_kvm_device_[add/del]_fd Zhenzhong Duan
@ 2023-08-30 10:37 ` Zhenzhong Duan
  2023-09-20 12:25   ` Eric Auger
  2023-09-20 22:16   ` Alex Williamson
  2023-08-30 10:37 ` [PATCH v1 07/22] vfio/common: Refactor vfio_viommu_preset() to be group agnostic Zhenzhong Duan
                   ` (17 subsequent siblings)
  23 siblings, 2 replies; 109+ messages in thread
From: Zhenzhong Duan @ 2023-08-30 10:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, jgg, nicolinc, joao.m.martins, eric.auger,
	peterx, jasowang, kevin.tian, yi.l.liu, yi.y.sun, chao.p.peng,
	Zhenzhong Duan

With a vfio device iterator added, we can make some migration and reset
related functions group agnostic.
E.x:
vfio_mig_active
vfio_migratable_device_num
vfio_devices_all_dirty_tracking
vfio_devices_all_device_dirty_tracking
vfio_devices_all_running_and_mig_active
vfio_devices_dma_logging_stop
vfio_devices_dma_logging_start
vfio_devices_query_dirty_bitmap
vfio_reset_handler

Or else we need to add container specific callback variants for above
functions just because they iterate devices based on group.

Move the reset handler registration/unregistration to a place that is not
group specific, saying first vfio address space created instead of the
first group.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/vfio/common.c | 224 ++++++++++++++++++++++++++---------------------
 1 file changed, 122 insertions(+), 102 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 949ad6714a..51c6e7598e 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -84,6 +84,26 @@ static int vfio_ram_block_discard_disable(VFIOContainer *container, bool state)
     }
 }
 
+static VFIODevice *vfio_container_dev_iter_next(VFIOContainer *container,
+                                                VFIODevice *curr)
+{
+    VFIOGroup *group;
+
+    if (!curr) {
+        group = QLIST_FIRST(&container->group_list);
+    } else {
+        if (curr->next.le_next) {
+            return curr->next.le_next;
+        }
+        group = curr->group->container_next.le_next;
+    }
+
+    if (!group) {
+        return NULL;
+    }
+    return QLIST_FIRST(&group->device_list);
+}
+
 /*
  * Device state interfaces
  */
@@ -112,17 +132,22 @@ static int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
 
 bool vfio_mig_active(void)
 {
-    VFIOGroup *group;
+    VFIOAddressSpace *space;
+    VFIOContainer *container;
     VFIODevice *vbasedev;
 
-    if (QLIST_EMPTY(&vfio_group_list)) {
+    if (QLIST_EMPTY(&vfio_address_spaces)) {
         return false;
     }
 
-    QLIST_FOREACH(group, &vfio_group_list, next) {
-        QLIST_FOREACH(vbasedev, &group->device_list, next) {
-            if (vbasedev->migration_blocker) {
-                return false;
+    QLIST_FOREACH(space, &vfio_address_spaces, list) {
+        QLIST_FOREACH(container, &space->containers, next) {
+            vbasedev = NULL;
+            while ((vbasedev = vfio_container_dev_iter_next(container,
+                                                            vbasedev))) {
+                if (vbasedev->migration_blocker) {
+                    return false;
+                }
             }
         }
     }
@@ -133,14 +158,19 @@ static Error *multiple_devices_migration_blocker;
 
 static unsigned int vfio_migratable_device_num(void)
 {
-    VFIOGroup *group;
+    VFIOAddressSpace *space;
+    VFIOContainer *container;
     VFIODevice *vbasedev;
     unsigned int device_num = 0;
 
-    QLIST_FOREACH(group, &vfio_group_list, next) {
-        QLIST_FOREACH(vbasedev, &group->device_list, next) {
-            if (vbasedev->migration) {
-                device_num++;
+    QLIST_FOREACH(space, &vfio_address_spaces, list) {
+        QLIST_FOREACH(container, &space->containers, next) {
+            vbasedev = NULL;
+            while ((vbasedev = vfio_container_dev_iter_next(container,
+                                                            vbasedev))) {
+                if (vbasedev->migration) {
+                    device_num++;
+                }
             }
         }
     }
@@ -207,8 +237,7 @@ static void vfio_set_migration_error(int err)
 
 static bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
 {
-    VFIOGroup *group;
-    VFIODevice *vbasedev;
+    VFIODevice *vbasedev = NULL;
     MigrationState *ms = migrate_get_current();
 
     if (ms->state != MIGRATION_STATUS_ACTIVE &&
@@ -216,19 +245,17 @@ static bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
         return false;
     }
 
-    QLIST_FOREACH(group, &container->group_list, container_next) {
-        QLIST_FOREACH(vbasedev, &group->device_list, next) {
-            VFIOMigration *migration = vbasedev->migration;
+    while ((vbasedev = vfio_container_dev_iter_next(container, vbasedev))) {
+        VFIOMigration *migration = vbasedev->migration;
 
-            if (!migration) {
-                return false;
-            }
+        if (!migration) {
+            return false;
+        }
 
-            if (vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF &&
-                (migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
-                 migration->device_state == VFIO_DEVICE_STATE_PRE_COPY)) {
-                return false;
-            }
+        if (vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF &&
+            (migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
+             migration->device_state == VFIO_DEVICE_STATE_PRE_COPY)) {
+            return false;
         }
     }
     return true;
@@ -236,14 +263,11 @@ static bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
 
 static bool vfio_devices_all_device_dirty_tracking(VFIOContainer *container)
 {
-    VFIOGroup *group;
-    VFIODevice *vbasedev;
+    VFIODevice *vbasedev = NULL;
 
-    QLIST_FOREACH(group, &container->group_list, container_next) {
-        QLIST_FOREACH(vbasedev, &group->device_list, next) {
-            if (!vbasedev->dirty_pages_supported) {
-                return false;
-            }
+    while ((vbasedev = vfio_container_dev_iter_next(container, vbasedev))) {
+        if (!vbasedev->dirty_pages_supported) {
+            return false;
         }
     }
 
@@ -256,27 +280,24 @@ static bool vfio_devices_all_device_dirty_tracking(VFIOContainer *container)
  */
 static bool vfio_devices_all_running_and_mig_active(VFIOContainer *container)
 {
-    VFIOGroup *group;
-    VFIODevice *vbasedev;
+    VFIODevice *vbasedev = NULL;
 
     if (!migration_is_active(migrate_get_current())) {
         return false;
     }
 
-    QLIST_FOREACH(group, &container->group_list, container_next) {
-        QLIST_FOREACH(vbasedev, &group->device_list, next) {
-            VFIOMigration *migration = vbasedev->migration;
+    while ((vbasedev = vfio_container_dev_iter_next(container, vbasedev))) {
+        VFIOMigration *migration = vbasedev->migration;
 
-            if (!migration) {
-                return false;
-            }
+        if (!migration) {
+            return false;
+        }
 
-            if (migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
-                migration->device_state == VFIO_DEVICE_STATE_PRE_COPY) {
-                continue;
-            } else {
-                return false;
-            }
+        if (migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
+            migration->device_state == VFIO_DEVICE_STATE_PRE_COPY) {
+            continue;
+        } else {
+            return false;
         }
     }
     return true;
@@ -1243,25 +1264,22 @@ static void vfio_devices_dma_logging_stop(VFIOContainer *container)
     uint64_t buf[DIV_ROUND_UP(sizeof(struct vfio_device_feature),
                               sizeof(uint64_t))] = {};
     struct vfio_device_feature *feature = (struct vfio_device_feature *)buf;
-    VFIODevice *vbasedev;
-    VFIOGroup *group;
+    VFIODevice *vbasedev = NULL;
 
     feature->argsz = sizeof(buf);
     feature->flags = VFIO_DEVICE_FEATURE_SET |
                      VFIO_DEVICE_FEATURE_DMA_LOGGING_STOP;
 
-    QLIST_FOREACH(group, &container->group_list, container_next) {
-        QLIST_FOREACH(vbasedev, &group->device_list, next) {
-            if (!vbasedev->dirty_tracking) {
-                continue;
-            }
+    while ((vbasedev = vfio_container_dev_iter_next(container, vbasedev))) {
+        if (!vbasedev->dirty_tracking) {
+            continue;
+        }
 
-            if (ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature)) {
-                warn_report("%s: Failed to stop DMA logging, err %d (%s)",
-                             vbasedev->name, -errno, strerror(errno));
-            }
-            vbasedev->dirty_tracking = false;
+        if (ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature)) {
+            warn_report("%s: Failed to stop DMA logging, err %d (%s)",
+                        vbasedev->name, -errno, strerror(errno));
         }
+        vbasedev->dirty_tracking = false;
     }
 }
 
@@ -1336,8 +1354,7 @@ static int vfio_devices_dma_logging_start(VFIOContainer *container)
 {
     struct vfio_device_feature *feature;
     VFIODirtyRanges ranges;
-    VFIODevice *vbasedev;
-    VFIOGroup *group;
+    VFIODevice *vbasedev = NULL;
     int ret = 0;
 
     vfio_dirty_tracking_init(container, &ranges);
@@ -1347,21 +1364,19 @@ static int vfio_devices_dma_logging_start(VFIOContainer *container)
         return -errno;
     }
 
-    QLIST_FOREACH(group, &container->group_list, container_next) {
-        QLIST_FOREACH(vbasedev, &group->device_list, next) {
-            if (vbasedev->dirty_tracking) {
-                continue;
-            }
+    while ((vbasedev = vfio_container_dev_iter_next(container, vbasedev))) {
+        if (vbasedev->dirty_tracking) {
+            continue;
+        }
 
-            ret = ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature);
-            if (ret) {
-                ret = -errno;
-                error_report("%s: Failed to start DMA logging, err %d (%s)",
-                             vbasedev->name, ret, strerror(errno));
-                goto out;
-            }
-            vbasedev->dirty_tracking = true;
+        ret = ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature);
+        if (ret) {
+            ret = -errno;
+            error_report("%s: Failed to start DMA logging, err %d (%s)",
+                         vbasedev->name, ret, strerror(errno));
+            goto out;
         }
+        vbasedev->dirty_tracking = true;
     }
 
 out:
@@ -1440,22 +1455,19 @@ static int vfio_devices_query_dirty_bitmap(VFIOContainer *container,
                                            VFIOBitmap *vbmap, hwaddr iova,
                                            hwaddr size)
 {
-    VFIODevice *vbasedev;
-    VFIOGroup *group;
+    VFIODevice *vbasedev = NULL;
     int ret;
 
-    QLIST_FOREACH(group, &container->group_list, container_next) {
-        QLIST_FOREACH(vbasedev, &group->device_list, next) {
-            ret = vfio_device_dma_logging_report(vbasedev, iova, size,
-                                                 vbmap->bitmap);
-            if (ret) {
-                error_report("%s: Failed to get DMA logging report, iova: "
-                             "0x%" HWADDR_PRIx ", size: 0x%" HWADDR_PRIx
-                             ", err: %d (%s)",
-                             vbasedev->name, iova, size, ret, strerror(-ret));
+    while ((vbasedev = vfio_container_dev_iter_next(container, vbasedev))) {
+        ret = vfio_device_dma_logging_report(vbasedev, iova, size,
+                                             vbmap->bitmap);
+        if (ret) {
+            error_report("%s: Failed to get DMA logging report, iova: "
+                         "0x%" HWADDR_PRIx ", size: 0x%" HWADDR_PRIx
+                         ", err: %d (%s)",
+                         vbasedev->name, iova, size, ret, strerror(-ret));
 
-                return ret;
-            }
+            return ret;
         }
     }
 
@@ -1739,21 +1751,30 @@ bool vfio_get_info_dma_avail(struct vfio_iommu_type1_info *info,
 
 void vfio_reset_handler(void *opaque)
 {
-    VFIOGroup *group;
+    VFIOAddressSpace *space;
+    VFIOContainer *container;
     VFIODevice *vbasedev;
 
-    QLIST_FOREACH(group, &vfio_group_list, next) {
-        QLIST_FOREACH(vbasedev, &group->device_list, next) {
-            if (vbasedev->dev->realized) {
-                vbasedev->ops->vfio_compute_needs_reset(vbasedev);
+    QLIST_FOREACH(space, &vfio_address_spaces, list) {
+        QLIST_FOREACH(container, &space->containers, next) {
+            vbasedev = NULL;
+            while ((vbasedev = vfio_container_dev_iter_next(container,
+                                                            vbasedev))) {
+                if (vbasedev->dev->realized) {
+                    vbasedev->ops->vfio_compute_needs_reset(vbasedev);
+                }
             }
         }
     }
 
-    QLIST_FOREACH(group, &vfio_group_list, next) {
-        QLIST_FOREACH(vbasedev, &group->device_list, next) {
-            if (vbasedev->dev->realized && vbasedev->needs_reset) {
-                vbasedev->ops->vfio_hot_reset_multi(vbasedev);
+    QLIST_FOREACH(space, &vfio_address_spaces, list) {
+        QLIST_FOREACH(container, &space->containers, next) {
+            vbasedev = NULL;
+            while ((vbasedev = vfio_container_dev_iter_next(container,
+                                                            vbasedev))) {
+                if (vbasedev->dev->realized && vbasedev->needs_reset) {
+                    vbasedev->ops->vfio_hot_reset_multi(vbasedev);
+                    }
             }
         }
     }
@@ -1841,6 +1862,10 @@ static VFIOAddressSpace *vfio_get_address_space(AddressSpace *as)
     space->as = as;
     QLIST_INIT(&space->containers);
 
+    if (QLIST_EMPTY(&vfio_address_spaces)) {
+        qemu_register_reset(vfio_reset_handler, NULL);
+    }
+
     QLIST_INSERT_HEAD(&vfio_address_spaces, space, list);
 
     return space;
@@ -1852,6 +1877,9 @@ static void vfio_put_address_space(VFIOAddressSpace *space)
         QLIST_REMOVE(space, list);
         g_free(space);
     }
+    if (QLIST_EMPTY(&vfio_address_spaces)) {
+        qemu_unregister_reset(vfio_reset_handler, NULL);
+    }
 }
 
 /*
@@ -2317,10 +2345,6 @@ VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
         goto close_fd_exit;
     }
 
-    if (QLIST_EMPTY(&vfio_group_list)) {
-        qemu_register_reset(vfio_reset_handler, NULL);
-    }
-
     QLIST_INSERT_HEAD(&vfio_group_list, group, next);
 
     return group;
@@ -2349,10 +2373,6 @@ void vfio_put_group(VFIOGroup *group)
     trace_vfio_put_group(group->fd);
     close(group->fd);
     g_free(group);
-
-    if (QLIST_EMPTY(&vfio_group_list)) {
-        qemu_unregister_reset(vfio_reset_handler, NULL);
-    }
 }
 
 struct vfio_device_info *vfio_get_device_info(int fd)
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH v1 07/22] vfio/common: Refactor vfio_viommu_preset() to be group agnostic
  2023-08-30 10:37 [PATCH v1 00/22] vfio: Adopt iommufd Zhenzhong Duan
                   ` (5 preceding siblings ...)
  2023-08-30 10:37 ` [PATCH v1 06/22] vfio/common: Add a vfio device iterator Zhenzhong Duan
@ 2023-08-30 10:37 ` Zhenzhong Duan
  2023-09-20 13:00   ` Eric Auger
  2023-09-20 22:51   ` Alex Williamson
  2023-08-30 10:37 ` [PATCH v1 08/22] vfio/common: Move legacy VFIO backend code into separate container.c Zhenzhong Duan
                   ` (16 subsequent siblings)
  23 siblings, 2 replies; 109+ messages in thread
From: Zhenzhong Duan @ 2023-08-30 10:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, jgg, nicolinc, joao.m.martins, eric.auger,
	peterx, jasowang, kevin.tian, yi.l.liu, yi.y.sun, chao.p.peng,
	Zhenzhong Duan

So that it doesn't need to be moved into container.c as done
in following patch.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/vfio/common.c | 17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 51c6e7598e..fda5fc87b9 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -219,7 +219,22 @@ void vfio_unblock_multiple_devices_migration(void)
 
 bool vfio_viommu_preset(VFIODevice *vbasedev)
 {
-    return vbasedev->group->container->space->as != &address_space_memory;
+    VFIOAddressSpace *space;
+    VFIOContainer *container;
+    VFIODevice *tmp_dev;
+
+    QLIST_FOREACH(space, &vfio_address_spaces, list) {
+        QLIST_FOREACH(container, &space->containers, next) {
+            tmp_dev = NULL;
+            while ((tmp_dev = vfio_container_dev_iter_next(container,
+                                                           tmp_dev))) {
+                if (vbasedev == tmp_dev) {
+                    return space->as != &address_space_memory;
+                }
+            }
+        }
+    }
+    g_assert_not_reached();
 }
 
 static void vfio_set_migration_error(int err)
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH v1 08/22] vfio/common: Move legacy VFIO backend code into separate container.c
  2023-08-30 10:37 [PATCH v1 00/22] vfio: Adopt iommufd Zhenzhong Duan
                   ` (6 preceding siblings ...)
  2023-08-30 10:37 ` [PATCH v1 07/22] vfio/common: Refactor vfio_viommu_preset() to be group agnostic Zhenzhong Duan
@ 2023-08-30 10:37 ` Zhenzhong Duan
  2023-09-20 13:12   ` Eric Auger
  2023-08-30 10:37 ` [PATCH v1 09/22] vfio/container: Introduce vfio_[attach/detach]_device Zhenzhong Duan
                   ` (15 subsequent siblings)
  23 siblings, 1 reply; 109+ messages in thread
From: Zhenzhong Duan @ 2023-08-30 10:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, jgg, nicolinc, joao.m.martins, eric.auger,
	peterx, jasowang, kevin.tian, yi.l.liu, yi.y.sun, chao.p.peng,
	Zhenzhong Duan

From: Yi Liu <yi.l.liu@intel.com>

Move all the code really dependent on the legacy VFIO container/group
into a separate file: container.c. What does remain in common.c is
the code related to VFIOAddressSpace, MemoryListeners, migration and
all other general operations.

Move struct VFIOBitmap declaration to vfio-common.h also for containter.c
usage.

Signed-off-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
---
 hw/vfio/common.c              | 1085 +--------------------------------
 hw/vfio/container.c           | 1085 +++++++++++++++++++++++++++++++++
 hw/vfio/meson.build           |    1 +
 include/hw/vfio/vfio-common.h |   45 ++
 4 files changed, 1147 insertions(+), 1069 deletions(-)
 create mode 100644 hw/vfio/container.c

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index fda5fc87b9..044710fc1f 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -45,8 +45,6 @@
 #include "migration/qemu-file.h"
 #include "sysemu/tpm.h"
 
-VFIOGroupList vfio_group_list =
-    QLIST_HEAD_INITIALIZER(vfio_group_list);
 static QLIST_HEAD(, VFIOAddressSpace) vfio_address_spaces =
     QLIST_HEAD_INITIALIZER(vfio_address_spaces);
 
@@ -58,63 +56,14 @@ static QLIST_HEAD(, VFIOAddressSpace) vfio_address_spaces =
  * initialized, this file descriptor is only released on QEMU exit and
  * we'll re-use it should another vfio device be attached before then.
  */
-static int vfio_kvm_device_fd = -1;
+int vfio_kvm_device_fd = -1;
 #endif
 
-static int vfio_ram_block_discard_disable(VFIOContainer *container, bool state)
-{
-    switch (container->iommu_type) {
-    case VFIO_TYPE1v2_IOMMU:
-    case VFIO_TYPE1_IOMMU:
-        /*
-         * We support coordinated discarding of RAM via the RamDiscardManager.
-         */
-        return ram_block_uncoordinated_discard_disable(state);
-    default:
-        /*
-         * VFIO_SPAPR_TCE_IOMMU most probably works just fine with
-         * RamDiscardManager, however, it is completely untested.
-         *
-         * VFIO_SPAPR_TCE_v2_IOMMU with "DMA memory preregistering" does
-         * completely the opposite of managing mapping/pinning dynamically as
-         * required by RamDiscardManager. We would have to special-case sections
-         * with a RamDiscardManager.
-         */
-        return ram_block_discard_disable(state);
-    }
-}
-
-static VFIODevice *vfio_container_dev_iter_next(VFIOContainer *container,
-                                                VFIODevice *curr)
-{
-    VFIOGroup *group;
-
-    if (!curr) {
-        group = QLIST_FIRST(&container->group_list);
-    } else {
-        if (curr->next.le_next) {
-            return curr->next.le_next;
-        }
-        group = curr->group->container_next.le_next;
-    }
-
-    if (!group) {
-        return NULL;
-    }
-    return QLIST_FIRST(&group->device_list);
-}
-
 /*
  * Device state interfaces
  */
 
-typedef struct {
-    unsigned long *bitmap;
-    hwaddr size;
-    hwaddr pages;
-} VFIOBitmap;
-
-static int vfio_bitmap_alloc(VFIOBitmap *vbmap, hwaddr size)
+int vfio_bitmap_alloc(VFIOBitmap *vbmap, hwaddr size)
 {
     vbmap->pages = REAL_HOST_PAGE_ALIGN(size) / qemu_real_host_page_size();
     vbmap->size = ROUND_UP(vbmap->pages, sizeof(__u64) * BITS_PER_BYTE) /
@@ -127,9 +76,6 @@ static int vfio_bitmap_alloc(VFIOBitmap *vbmap, hwaddr size)
     return 0;
 }
 
-static int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
-                                 uint64_t size, ram_addr_t ram_addr);
-
 bool vfio_mig_active(void)
 {
     VFIOAddressSpace *space;
@@ -276,7 +222,7 @@ static bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
     return true;
 }
 
-static bool vfio_devices_all_device_dirty_tracking(VFIOContainer *container)
+bool vfio_devices_all_device_dirty_tracking(VFIOContainer *container)
 {
     VFIODevice *vbasedev = NULL;
 
@@ -293,7 +239,7 @@ static bool vfio_devices_all_device_dirty_tracking(VFIOContainer *container)
  * Check if all VFIO devices are running and migration is active, which is
  * essentially equivalent to the migration being in pre-copy phase.
  */
-static bool vfio_devices_all_running_and_mig_active(VFIOContainer *container)
+bool vfio_devices_all_running_and_mig_active(VFIOContainer *container)
 {
     VFIODevice *vbasedev = NULL;
 
@@ -318,150 +264,8 @@ static bool vfio_devices_all_running_and_mig_active(VFIOContainer *container)
     return true;
 }
 
-static int vfio_dma_unmap_bitmap(VFIOContainer *container,
-                                 hwaddr iova, ram_addr_t size,
-                                 IOMMUTLBEntry *iotlb)
-{
-    struct vfio_iommu_type1_dma_unmap *unmap;
-    struct vfio_bitmap *bitmap;
-    VFIOBitmap vbmap;
-    int ret;
-
-    ret = vfio_bitmap_alloc(&vbmap, size);
-    if (ret) {
-        return ret;
-    }
-
-    unmap = g_malloc0(sizeof(*unmap) + sizeof(*bitmap));
-
-    unmap->argsz = sizeof(*unmap) + sizeof(*bitmap);
-    unmap->iova = iova;
-    unmap->size = size;
-    unmap->flags |= VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP;
-    bitmap = (struct vfio_bitmap *)&unmap->data;
-
-    /*
-     * cpu_physical_memory_set_dirty_lebitmap() supports pages in bitmap of
-     * qemu_real_host_page_size to mark those dirty. Hence set bitmap_pgsize
-     * to qemu_real_host_page_size.
-     */
-    bitmap->pgsize = qemu_real_host_page_size();
-    bitmap->size = vbmap.size;
-    bitmap->data = (__u64 *)vbmap.bitmap;
-
-    if (vbmap.size > container->max_dirty_bitmap_size) {
-        error_report("UNMAP: Size of bitmap too big 0x%"PRIx64, vbmap.size);
-        ret = -E2BIG;
-        goto unmap_exit;
-    }
-
-    ret = ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, unmap);
-    if (!ret) {
-        cpu_physical_memory_set_dirty_lebitmap(vbmap.bitmap,
-                iotlb->translated_addr, vbmap.pages);
-    } else {
-        error_report("VFIO_UNMAP_DMA with DIRTY_BITMAP : %m");
-    }
-
-unmap_exit:
-    g_free(unmap);
-    g_free(vbmap.bitmap);
-
-    return ret;
-}
-
-/*
- * DMA - Mapping and unmapping for the "type1" IOMMU interface used on x86
- */
-static int vfio_dma_unmap(VFIOContainer *container,
-                          hwaddr iova, ram_addr_t size,
-                          IOMMUTLBEntry *iotlb)
-{
-    struct vfio_iommu_type1_dma_unmap unmap = {
-        .argsz = sizeof(unmap),
-        .flags = 0,
-        .iova = iova,
-        .size = size,
-    };
-    bool need_dirty_sync = false;
-    int ret;
-
-    if (iotlb && vfio_devices_all_running_and_mig_active(container)) {
-        if (!vfio_devices_all_device_dirty_tracking(container) &&
-            container->dirty_pages_supported) {
-            return vfio_dma_unmap_bitmap(container, iova, size, iotlb);
-        }
-
-        need_dirty_sync = true;
-    }
-
-    while (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap)) {
-        /*
-         * The type1 backend has an off-by-one bug in the kernel (71a7d3d78e3c
-         * v4.15) where an overflow in its wrap-around check prevents us from
-         * unmapping the last page of the address space.  Test for the error
-         * condition and re-try the unmap excluding the last page.  The
-         * expectation is that we've never mapped the last page anyway and this
-         * unmap request comes via vIOMMU support which also makes it unlikely
-         * that this page is used.  This bug was introduced well after type1 v2
-         * support was introduced, so we shouldn't need to test for v1.  A fix
-         * is queued for kernel v5.0 so this workaround can be removed once
-         * affected kernels are sufficiently deprecated.
-         */
-        if (errno == EINVAL && unmap.size && !(unmap.iova + unmap.size) &&
-            container->iommu_type == VFIO_TYPE1v2_IOMMU) {
-            trace_vfio_dma_unmap_overflow_workaround();
-            unmap.size -= 1ULL << ctz64(container->pgsizes);
-            continue;
-        }
-        error_report("VFIO_UNMAP_DMA failed: %s", strerror(errno));
-        return -errno;
-    }
-
-    if (need_dirty_sync) {
-        ret = vfio_get_dirty_bitmap(container, iova, size,
-                                    iotlb->translated_addr);
-        if (ret) {
-            return ret;
-        }
-    }
-
-    return 0;
-}
-
-static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
-                        ram_addr_t size, void *vaddr, bool readonly)
-{
-    struct vfio_iommu_type1_dma_map map = {
-        .argsz = sizeof(map),
-        .flags = VFIO_DMA_MAP_FLAG_READ,
-        .vaddr = (__u64)(uintptr_t)vaddr,
-        .iova = iova,
-        .size = size,
-    };
-
-    if (!readonly) {
-        map.flags |= VFIO_DMA_MAP_FLAG_WRITE;
-    }
-
-    /*
-     * Try the mapping, if it fails with EBUSY, unmap the region and try
-     * again.  This shouldn't be necessary, but we sometimes see it in
-     * the VGA ROM space.
-     */
-    if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0 ||
-        (errno == EBUSY && vfio_dma_unmap(container, iova, size, NULL) == 0 &&
-         ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0)) {
-        return 0;
-    }
-
-    error_report("VFIO_MAP_DMA failed: %s", strerror(errno));
-    return -errno;
-}
-
-static void vfio_host_win_add(VFIOContainer *container,
-                              hwaddr min_iova, hwaddr max_iova,
-                              uint64_t iova_pgsizes)
+void vfio_host_win_add(VFIOContainer *container, hwaddr min_iova,
+                       hwaddr max_iova, uint64_t iova_pgsizes)
 {
     VFIOHostDMAWindow *hostwin;
 
@@ -482,8 +286,8 @@ static void vfio_host_win_add(VFIOContainer *container,
     QLIST_INSERT_HEAD(&container->hostwin_list, hostwin, hostwin_next);
 }
 
-static int vfio_host_win_del(VFIOContainer *container, hwaddr min_iova,
-                             hwaddr max_iova)
+int vfio_host_win_del(VFIOContainer *container,
+                      hwaddr min_iova, hwaddr max_iova)
 {
     VFIOHostDMAWindow *hostwin;
 
@@ -832,92 +636,6 @@ static bool vfio_get_section_iova_range(VFIOContainer *container,
     return true;
 }
 
-static int vfio_container_add_section_window(VFIOContainer *container,
-                                             MemoryRegionSection *section,
-                                             Error **errp)
-{
-    VFIOHostDMAWindow *hostwin;
-    hwaddr pgsize = 0;
-    int ret;
-
-    if (container->iommu_type != VFIO_SPAPR_TCE_v2_IOMMU) {
-        return 0;
-    }
-
-    /* For now intersections are not allowed, we may relax this later */
-    QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) {
-        if (ranges_overlap(hostwin->min_iova,
-                           hostwin->max_iova - hostwin->min_iova + 1,
-                           section->offset_within_address_space,
-                           int128_get64(section->size))) {
-            error_setg(errp,
-                "region [0x%"PRIx64",0x%"PRIx64"] overlaps with existing"
-                "host DMA window [0x%"PRIx64",0x%"PRIx64"]",
-                section->offset_within_address_space,
-                section->offset_within_address_space +
-                    int128_get64(section->size) - 1,
-                hostwin->min_iova, hostwin->max_iova);
-            return -EINVAL;
-        }
-    }
-
-    ret = vfio_spapr_create_window(container, section, &pgsize);
-    if (ret) {
-        error_setg_errno(errp, -ret, "Failed to create SPAPR window");
-        return ret;
-    }
-
-    vfio_host_win_add(container, section->offset_within_address_space,
-                      section->offset_within_address_space +
-                      int128_get64(section->size) - 1, pgsize);
-#ifdef CONFIG_KVM
-    if (kvm_enabled()) {
-        VFIOGroup *group;
-        IOMMUMemoryRegion *iommu_mr = IOMMU_MEMORY_REGION(section->mr);
-        struct kvm_vfio_spapr_tce param;
-        struct kvm_device_attr attr = {
-            .group = KVM_DEV_VFIO_GROUP,
-            .attr = KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE,
-            .addr = (uint64_t)(unsigned long)&param,
-        };
-
-        if (!memory_region_iommu_get_attr(iommu_mr, IOMMU_ATTR_SPAPR_TCE_FD,
-                                          &param.tablefd)) {
-            QLIST_FOREACH(group, &container->group_list, container_next) {
-                param.groupfd = group->fd;
-                if (ioctl(vfio_kvm_device_fd, KVM_SET_DEVICE_ATTR, &attr)) {
-                    error_report("vfio: failed to setup fd %d "
-                                 "for a group with fd %d: %s",
-                                 param.tablefd, param.groupfd,
-                                 strerror(errno));
-                    return 0;
-                }
-                trace_vfio_spapr_group_attach(param.groupfd, param.tablefd);
-            }
-        }
-    }
-#endif
-    return 0;
-}
-
-static void vfio_container_del_section_window(VFIOContainer *container,
-                                              MemoryRegionSection *section)
-{
-    if (container->iommu_type != VFIO_SPAPR_TCE_v2_IOMMU) {
-        return;
-    }
-
-    vfio_spapr_remove_window(container,
-                             section->offset_within_address_space);
-    if (vfio_host_win_del(container,
-                          section->offset_within_address_space,
-                          section->offset_within_address_space +
-                          int128_get64(section->size) - 1) < 0) {
-        hw_error("%s: Cannot delete missing window at %"HWADDR_PRIx,
-                 __func__, section->offset_within_address_space);
-    }
-}
-
 static void vfio_listener_region_add(MemoryListener *listener,
                                      MemoryRegionSection *section)
 {
@@ -1165,33 +883,6 @@ static void vfio_listener_region_del(MemoryListener *listener,
     vfio_container_del_section_window(container, section);
 }
 
-static int vfio_set_dirty_page_tracking(VFIOContainer *container, bool start)
-{
-    int ret;
-    struct vfio_iommu_type1_dirty_bitmap dirty = {
-        .argsz = sizeof(dirty),
-    };
-
-    if (!container->dirty_pages_supported) {
-        return 0;
-    }
-
-    if (start) {
-        dirty.flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_START;
-    } else {
-        dirty.flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP;
-    }
-
-    ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, &dirty);
-    if (ret) {
-        ret = -errno;
-        error_report("Failed to set dirty tracking flag 0x%x errno: %d",
-                     dirty.flags, errno);
-    }
-
-    return ret;
-}
-
 typedef struct VFIODirtyRanges {
     hwaddr min32;
     hwaddr max32;
@@ -1466,9 +1157,9 @@ static int vfio_device_dma_logging_report(VFIODevice *vbasedev, hwaddr iova,
     return 0;
 }
 
-static int vfio_devices_query_dirty_bitmap(VFIOContainer *container,
-                                           VFIOBitmap *vbmap, hwaddr iova,
-                                           hwaddr size)
+int vfio_devices_query_dirty_bitmap(VFIOContainer *container,
+                                    VFIOBitmap *vbmap, hwaddr iova,
+                                    hwaddr size)
 {
     VFIODevice *vbasedev = NULL;
     int ret;
@@ -1489,45 +1180,8 @@ static int vfio_devices_query_dirty_bitmap(VFIOContainer *container,
     return 0;
 }
 
-static int vfio_query_dirty_bitmap(VFIOContainer *container, VFIOBitmap *vbmap,
-                                   hwaddr iova, hwaddr size)
-{
-    struct vfio_iommu_type1_dirty_bitmap *dbitmap;
-    struct vfio_iommu_type1_dirty_bitmap_get *range;
-    int ret;
-
-    dbitmap = g_malloc0(sizeof(*dbitmap) + sizeof(*range));
-
-    dbitmap->argsz = sizeof(*dbitmap) + sizeof(*range);
-    dbitmap->flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP;
-    range = (struct vfio_iommu_type1_dirty_bitmap_get *)&dbitmap->data;
-    range->iova = iova;
-    range->size = size;
-
-    /*
-     * cpu_physical_memory_set_dirty_lebitmap() supports pages in bitmap of
-     * qemu_real_host_page_size to mark those dirty. Hence set bitmap's pgsize
-     * to qemu_real_host_page_size.
-     */
-    range->bitmap.pgsize = qemu_real_host_page_size();
-    range->bitmap.size = vbmap->size;
-    range->bitmap.data = (__u64 *)vbmap->bitmap;
-
-    ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, dbitmap);
-    if (ret) {
-        ret = -errno;
-        error_report("Failed to get dirty bitmap for iova: 0x%"PRIx64
-                " size: 0x%"PRIx64" err: %d", (uint64_t)range->iova,
-                (uint64_t)range->size, errno);
-    }
-
-    g_free(dbitmap);
-
-    return ret;
-}
-
-static int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
-                                 uint64_t size, ram_addr_t ram_addr)
+int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
+                          uint64_t size, ram_addr_t ram_addr)
 {
     bool all_device_dirty_tracking =
         vfio_devices_all_device_dirty_tracking(container);
@@ -1716,7 +1370,7 @@ static void vfio_listener_log_sync(MemoryListener *listener,
     }
 }
 
-static const MemoryListener vfio_memory_listener = {
+const MemoryListener vfio_memory_listener = {
     .name = "vfio",
     .region_add = vfio_listener_region_add,
     .region_del = vfio_listener_region_del,
@@ -1725,45 +1379,6 @@ static const MemoryListener vfio_memory_listener = {
     .log_sync = vfio_listener_log_sync,
 };
 
-static void vfio_listener_release(VFIOContainer *container)
-{
-    memory_listener_unregister(&container->listener);
-    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
-        memory_listener_unregister(&container->prereg_listener);
-    }
-}
-
-static struct vfio_info_cap_header *
-vfio_get_iommu_type1_info_cap(struct vfio_iommu_type1_info *info, uint16_t id)
-{
-    if (!(info->flags & VFIO_IOMMU_INFO_CAPS)) {
-        return NULL;
-    }
-
-    return vfio_get_cap((void *)info, info->cap_offset, id);
-}
-
-bool vfio_get_info_dma_avail(struct vfio_iommu_type1_info *info,
-                             unsigned int *avail)
-{
-    struct vfio_info_cap_header *hdr;
-    struct vfio_iommu_type1_info_dma_avail *cap;
-
-    /* If the capability cannot be found, assume no DMA limiting */
-    hdr = vfio_get_iommu_type1_info_cap(info,
-                                        VFIO_IOMMU_TYPE1_INFO_DMA_AVAIL);
-    if (hdr == NULL) {
-        return false;
-    }
-
-    if (avail != NULL) {
-        cap = (void *) hdr;
-        *avail = cap->avail;
-    }
-
-    return true;
-}
-
 void vfio_reset_handler(void *opaque)
 {
     VFIOAddressSpace *space;
@@ -1830,11 +1445,6 @@ int vfio_kvm_device_add_fd(int fd)
     return 0;
 }
 
-static void vfio_kvm_device_add_group(VFIOGroup *group)
-{
-    vfio_kvm_device_add_fd(group->fd);
-}
-
 int vfio_kvm_device_del_fd(int fd)
 {
 #ifdef CONFIG_KVM
@@ -1857,12 +1467,7 @@ int vfio_kvm_device_del_fd(int fd)
     return 0;
 }
 
-static void vfio_kvm_device_del_group(VFIOGroup *group)
-{
-    vfio_kvm_device_del_fd(group->fd);
-}
-
-static VFIOAddressSpace *vfio_get_address_space(AddressSpace *as)
+VFIOAddressSpace *vfio_get_address_space(AddressSpace *as)
 {
     VFIOAddressSpace *space;
 
@@ -1886,7 +1491,7 @@ static VFIOAddressSpace *vfio_get_address_space(AddressSpace *as)
     return space;
 }
 
-static void vfio_put_address_space(VFIOAddressSpace *space)
+void vfio_put_address_space(VFIOAddressSpace *space)
 {
     if (QLIST_EMPTY(&space->containers)) {
         QLIST_REMOVE(space, list);
@@ -1897,499 +1502,6 @@ static void vfio_put_address_space(VFIOAddressSpace *space)
     }
 }
 
-/*
- * vfio_get_iommu_type - selects the richest iommu_type (v2 first)
- */
-static int vfio_get_iommu_type(VFIOContainer *container,
-                               Error **errp)
-{
-    int iommu_types[] = { VFIO_TYPE1v2_IOMMU, VFIO_TYPE1_IOMMU,
-                          VFIO_SPAPR_TCE_v2_IOMMU, VFIO_SPAPR_TCE_IOMMU };
-    int i;
-
-    for (i = 0; i < ARRAY_SIZE(iommu_types); i++) {
-        if (ioctl(container->fd, VFIO_CHECK_EXTENSION, iommu_types[i])) {
-            return iommu_types[i];
-        }
-    }
-    error_setg(errp, "No available IOMMU models");
-    return -EINVAL;
-}
-
-static int vfio_init_container(VFIOContainer *container, int group_fd,
-                               Error **errp)
-{
-    int iommu_type, ret;
-
-    iommu_type = vfio_get_iommu_type(container, errp);
-    if (iommu_type < 0) {
-        return iommu_type;
-    }
-
-    ret = ioctl(group_fd, VFIO_GROUP_SET_CONTAINER, &container->fd);
-    if (ret) {
-        error_setg_errno(errp, errno, "Failed to set group container");
-        return -errno;
-    }
-
-    while (ioctl(container->fd, VFIO_SET_IOMMU, iommu_type)) {
-        if (iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
-            /*
-             * On sPAPR, despite the IOMMU subdriver always advertises v1 and
-             * v2, the running platform may not support v2 and there is no
-             * way to guess it until an IOMMU group gets added to the container.
-             * So in case it fails with v2, try v1 as a fallback.
-             */
-            iommu_type = VFIO_SPAPR_TCE_IOMMU;
-            continue;
-        }
-        error_setg_errno(errp, errno, "Failed to set iommu for container");
-        return -errno;
-    }
-
-    container->iommu_type = iommu_type;
-    return 0;
-}
-
-static int vfio_get_iommu_info(VFIOContainer *container,
-                               struct vfio_iommu_type1_info **info)
-{
-
-    size_t argsz = sizeof(struct vfio_iommu_type1_info);
-
-    *info = g_new0(struct vfio_iommu_type1_info, 1);
-again:
-    (*info)->argsz = argsz;
-
-    if (ioctl(container->fd, VFIO_IOMMU_GET_INFO, *info)) {
-        g_free(*info);
-        *info = NULL;
-        return -errno;
-    }
-
-    if (((*info)->argsz > argsz)) {
-        argsz = (*info)->argsz;
-        *info = g_realloc(*info, argsz);
-        goto again;
-    }
-
-    return 0;
-}
-
-static struct vfio_info_cap_header *
-vfio_get_iommu_info_cap(struct vfio_iommu_type1_info *info, uint16_t id)
-{
-    struct vfio_info_cap_header *hdr;
-    void *ptr = info;
-
-    if (!(info->flags & VFIO_IOMMU_INFO_CAPS)) {
-        return NULL;
-    }
-
-    for (hdr = ptr + info->cap_offset; hdr != ptr; hdr = ptr + hdr->next) {
-        if (hdr->id == id) {
-            return hdr;
-        }
-    }
-
-    return NULL;
-}
-
-static void vfio_get_iommu_info_migration(VFIOContainer *container,
-                                         struct vfio_iommu_type1_info *info)
-{
-    struct vfio_info_cap_header *hdr;
-    struct vfio_iommu_type1_info_cap_migration *cap_mig;
-
-    hdr = vfio_get_iommu_info_cap(info, VFIO_IOMMU_TYPE1_INFO_CAP_MIGRATION);
-    if (!hdr) {
-        return;
-    }
-
-    cap_mig = container_of(hdr, struct vfio_iommu_type1_info_cap_migration,
-                            header);
-
-    /*
-     * cpu_physical_memory_set_dirty_lebitmap() supports pages in bitmap of
-     * qemu_real_host_page_size to mark those dirty.
-     */
-    if (cap_mig->pgsize_bitmap & qemu_real_host_page_size()) {
-        container->dirty_pages_supported = true;
-        container->max_dirty_bitmap_size = cap_mig->max_dirty_bitmap_size;
-        container->dirty_pgsizes = cap_mig->pgsize_bitmap;
-    }
-}
-
-static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
-                                  Error **errp)
-{
-    VFIOContainer *container;
-    int ret, fd;
-    VFIOAddressSpace *space;
-
-    space = vfio_get_address_space(as);
-
-    /*
-     * VFIO is currently incompatible with discarding of RAM insofar as the
-     * madvise to purge (zap) the page from QEMU's address space does not
-     * interact with the memory API and therefore leaves stale virtual to
-     * physical mappings in the IOMMU if the page was previously pinned.  We
-     * therefore set discarding broken for each group added to a container,
-     * whether the container is used individually or shared.  This provides
-     * us with options to allow devices within a group to opt-in and allow
-     * discarding, so long as it is done consistently for a group (for instance
-     * if the device is an mdev device where it is known that the host vendor
-     * driver will never pin pages outside of the working set of the guest
-     * driver, which would thus not be discarding candidates).
-     *
-     * The first opportunity to induce pinning occurs here where we attempt to
-     * attach the group to existing containers within the AddressSpace.  If any
-     * pages are already zapped from the virtual address space, such as from
-     * previous discards, new pinning will cause valid mappings to be
-     * re-established.  Likewise, when the overall MemoryListener for a new
-     * container is registered, a replay of mappings within the AddressSpace
-     * will occur, re-establishing any previously zapped pages as well.
-     *
-     * Especially virtio-balloon is currently only prevented from discarding
-     * new memory, it will not yet set ram_block_discard_set_required() and
-     * therefore, neither stops us here or deals with the sudden memory
-     * consumption of inflated memory.
-     *
-     * We do support discarding of memory coordinated via the RamDiscardManager
-     * with some IOMMU types. vfio_ram_block_discard_disable() handles the
-     * details once we know which type of IOMMU we are using.
-     */
-
-    QLIST_FOREACH(container, &space->containers, next) {
-        if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
-            ret = vfio_ram_block_discard_disable(container, true);
-            if (ret) {
-                error_setg_errno(errp, -ret,
-                                 "Cannot set discarding of RAM broken");
-                if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER,
-                          &container->fd)) {
-                    error_report("vfio: error disconnecting group %d from"
-                                 " container", group->groupid);
-                }
-                return ret;
-            }
-            group->container = container;
-            QLIST_INSERT_HEAD(&container->group_list, group, container_next);
-            vfio_kvm_device_add_group(group);
-            return 0;
-        }
-    }
-
-    fd = qemu_open_old("/dev/vfio/vfio", O_RDWR);
-    if (fd < 0) {
-        error_setg_errno(errp, errno, "failed to open /dev/vfio/vfio");
-        ret = -errno;
-        goto put_space_exit;
-    }
-
-    ret = ioctl(fd, VFIO_GET_API_VERSION);
-    if (ret != VFIO_API_VERSION) {
-        error_setg(errp, "supported vfio version: %d, "
-                   "reported version: %d", VFIO_API_VERSION, ret);
-        ret = -EINVAL;
-        goto close_fd_exit;
-    }
-
-    container = g_malloc0(sizeof(*container));
-    container->space = space;
-    container->fd = fd;
-    container->error = NULL;
-    container->dirty_pages_supported = false;
-    container->dma_max_mappings = 0;
-    QLIST_INIT(&container->giommu_list);
-    QLIST_INIT(&container->hostwin_list);
-    QLIST_INIT(&container->vrdl_list);
-
-    ret = vfio_init_container(container, group->fd, errp);
-    if (ret) {
-        goto free_container_exit;
-    }
-
-    ret = vfio_ram_block_discard_disable(container, true);
-    if (ret) {
-        error_setg_errno(errp, -ret, "Cannot set discarding of RAM broken");
-        goto free_container_exit;
-    }
-
-    switch (container->iommu_type) {
-    case VFIO_TYPE1v2_IOMMU:
-    case VFIO_TYPE1_IOMMU:
-    {
-        struct vfio_iommu_type1_info *info;
-
-        ret = vfio_get_iommu_info(container, &info);
-        if (ret) {
-            error_setg_errno(errp, -ret, "Failed to get VFIO IOMMU info");
-            goto enable_discards_exit;
-        }
-
-        if (info->flags & VFIO_IOMMU_INFO_PGSIZES) {
-            container->pgsizes = info->iova_pgsizes;
-        } else {
-            container->pgsizes = qemu_real_host_page_size();
-        }
-
-        if (!vfio_get_info_dma_avail(info, &container->dma_max_mappings)) {
-            container->dma_max_mappings = 65535;
-        }
-        vfio_get_iommu_info_migration(container, info);
-        g_free(info);
-
-        /*
-         * FIXME: We should parse VFIO_IOMMU_TYPE1_INFO_CAP_IOVA_RANGE
-         * information to get the actual window extent rather than assume
-         * a 64-bit IOVA address space.
-         */
-        vfio_host_win_add(container, 0, (hwaddr)-1, container->pgsizes);
-
-        break;
-    }
-    case VFIO_SPAPR_TCE_v2_IOMMU:
-    case VFIO_SPAPR_TCE_IOMMU:
-    {
-        struct vfio_iommu_spapr_tce_info info;
-        bool v2 = container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU;
-
-        /*
-         * The host kernel code implementing VFIO_IOMMU_DISABLE is called
-         * when container fd is closed so we do not call it explicitly
-         * in this file.
-         */
-        if (!v2) {
-            ret = ioctl(fd, VFIO_IOMMU_ENABLE);
-            if (ret) {
-                error_setg_errno(errp, errno, "failed to enable container");
-                ret = -errno;
-                goto enable_discards_exit;
-            }
-        } else {
-            container->prereg_listener = vfio_prereg_listener;
-
-            memory_listener_register(&container->prereg_listener,
-                                     &address_space_memory);
-            if (container->error) {
-                memory_listener_unregister(&container->prereg_listener);
-                ret = -1;
-                error_propagate_prepend(errp, container->error,
-                    "RAM memory listener initialization failed: ");
-                goto enable_discards_exit;
-            }
-        }
-
-        info.argsz = sizeof(info);
-        ret = ioctl(fd, VFIO_IOMMU_SPAPR_TCE_GET_INFO, &info);
-        if (ret) {
-            error_setg_errno(errp, errno,
-                             "VFIO_IOMMU_SPAPR_TCE_GET_INFO failed");
-            ret = -errno;
-            if (v2) {
-                memory_listener_unregister(&container->prereg_listener);
-            }
-            goto enable_discards_exit;
-        }
-
-        if (v2) {
-            container->pgsizes = info.ddw.pgsizes;
-            /*
-             * There is a default window in just created container.
-             * To make region_add/del simpler, we better remove this
-             * window now and let those iommu_listener callbacks
-             * create/remove them when needed.
-             */
-            ret = vfio_spapr_remove_window(container, info.dma32_window_start);
-            if (ret) {
-                error_setg_errno(errp, -ret,
-                                 "failed to remove existing window");
-                goto enable_discards_exit;
-            }
-        } else {
-            /* The default table uses 4K pages */
-            container->pgsizes = 0x1000;
-            vfio_host_win_add(container, info.dma32_window_start,
-                              info.dma32_window_start +
-                              info.dma32_window_size - 1,
-                              0x1000);
-        }
-    }
-    }
-
-    vfio_kvm_device_add_group(group);
-
-    QLIST_INIT(&container->group_list);
-    QLIST_INSERT_HEAD(&space->containers, container, next);
-
-    group->container = container;
-    QLIST_INSERT_HEAD(&container->group_list, group, container_next);
-
-    container->listener = vfio_memory_listener;
-
-    memory_listener_register(&container->listener, container->space->as);
-
-    if (container->error) {
-        ret = -1;
-        error_propagate_prepend(errp, container->error,
-            "memory listener initialization failed: ");
-        goto listener_release_exit;
-    }
-
-    container->initialized = true;
-
-    return 0;
-listener_release_exit:
-    QLIST_REMOVE(group, container_next);
-    QLIST_REMOVE(container, next);
-    vfio_kvm_device_del_group(group);
-    vfio_listener_release(container);
-
-enable_discards_exit:
-    vfio_ram_block_discard_disable(container, false);
-
-free_container_exit:
-    g_free(container);
-
-close_fd_exit:
-    close(fd);
-
-put_space_exit:
-    vfio_put_address_space(space);
-
-    return ret;
-}
-
-static void vfio_disconnect_container(VFIOGroup *group)
-{
-    VFIOContainer *container = group->container;
-
-    QLIST_REMOVE(group, container_next);
-    group->container = NULL;
-
-    /*
-     * Explicitly release the listener first before unset container,
-     * since unset may destroy the backend container if it's the last
-     * group.
-     */
-    if (QLIST_EMPTY(&container->group_list)) {
-        vfio_listener_release(container);
-    }
-
-    if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER, &container->fd)) {
-        error_report("vfio: error disconnecting group %d from container",
-                     group->groupid);
-    }
-
-    if (QLIST_EMPTY(&container->group_list)) {
-        VFIOAddressSpace *space = container->space;
-        VFIOGuestIOMMU *giommu, *tmp;
-        VFIOHostDMAWindow *hostwin, *next;
-
-        QLIST_REMOVE(container, next);
-
-        QLIST_FOREACH_SAFE(giommu, &container->giommu_list, giommu_next, tmp) {
-            memory_region_unregister_iommu_notifier(
-                    MEMORY_REGION(giommu->iommu_mr), &giommu->n);
-            QLIST_REMOVE(giommu, giommu_next);
-            g_free(giommu);
-        }
-
-        QLIST_FOREACH_SAFE(hostwin, &container->hostwin_list, hostwin_next,
-                           next) {
-            QLIST_REMOVE(hostwin, hostwin_next);
-            g_free(hostwin);
-        }
-
-        trace_vfio_disconnect_container(container->fd);
-        close(container->fd);
-        g_free(container);
-
-        vfio_put_address_space(space);
-    }
-}
-
-VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
-{
-    VFIOGroup *group;
-    char path[32];
-    struct vfio_group_status status = { .argsz = sizeof(status) };
-
-    QLIST_FOREACH(group, &vfio_group_list, next) {
-        if (group->groupid == groupid) {
-            /* Found it.  Now is it already in the right context? */
-            if (group->container->space->as == as) {
-                return group;
-            } else {
-                error_setg(errp, "group %d used in multiple address spaces",
-                           group->groupid);
-                return NULL;
-            }
-        }
-    }
-
-    group = g_malloc0(sizeof(*group));
-
-    snprintf(path, sizeof(path), "/dev/vfio/%d", groupid);
-    group->fd = qemu_open_old(path, O_RDWR);
-    if (group->fd < 0) {
-        error_setg_errno(errp, errno, "failed to open %s", path);
-        goto free_group_exit;
-    }
-
-    if (ioctl(group->fd, VFIO_GROUP_GET_STATUS, &status)) {
-        error_setg_errno(errp, errno, "failed to get group %d status", groupid);
-        goto close_fd_exit;
-    }
-
-    if (!(status.flags & VFIO_GROUP_FLAGS_VIABLE)) {
-        error_setg(errp, "group %d is not viable", groupid);
-        error_append_hint(errp,
-                          "Please ensure all devices within the iommu_group "
-                          "are bound to their vfio bus driver.\n");
-        goto close_fd_exit;
-    }
-
-    group->groupid = groupid;
-    QLIST_INIT(&group->device_list);
-
-    if (vfio_connect_container(group, as, errp)) {
-        error_prepend(errp, "failed to setup container for group %d: ",
-                      groupid);
-        goto close_fd_exit;
-    }
-
-    QLIST_INSERT_HEAD(&vfio_group_list, group, next);
-
-    return group;
-
-close_fd_exit:
-    close(group->fd);
-
-free_group_exit:
-    g_free(group);
-
-    return NULL;
-}
-
-void vfio_put_group(VFIOGroup *group)
-{
-    if (!group || !QLIST_EMPTY(&group->device_list)) {
-        return;
-    }
-
-    if (!group->ram_block_discard_allowed) {
-        vfio_ram_block_discard_disable(group->container, false);
-    }
-    vfio_kvm_device_del_group(group);
-    vfio_disconnect_container(group);
-    QLIST_REMOVE(group, next);
-    trace_vfio_put_group(group->fd);
-    close(group->fd);
-    g_free(group);
-}
-
 struct vfio_device_info *vfio_get_device_info(int fd)
 {
     struct vfio_device_info *info;
@@ -2413,168 +1525,3 @@ retry:
 
     return info;
 }
-
-int vfio_get_device(VFIOGroup *group, const char *name,
-                    VFIODevice *vbasedev, Error **errp)
-{
-    g_autofree struct vfio_device_info *info = NULL;
-    int fd;
-
-    fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name);
-    if (fd < 0) {
-        error_setg_errno(errp, errno, "error getting device from group %d",
-                         group->groupid);
-        error_append_hint(errp,
-                      "Verify all devices in group %d are bound to vfio-<bus> "
-                      "or pci-stub and not already in use\n", group->groupid);
-        return fd;
-    }
-
-    info = vfio_get_device_info(fd);
-    if (!info) {
-        error_setg_errno(errp, errno, "error getting device info");
-        close(fd);
-        return -1;
-    }
-
-    /*
-     * Set discarding of RAM as not broken for this group if the driver knows
-     * the device operates compatibly with discarding.  Setting must be
-     * consistent per group, but since compatibility is really only possible
-     * with mdev currently, we expect singleton groups.
-     */
-    if (vbasedev->ram_block_discard_allowed !=
-        group->ram_block_discard_allowed) {
-        if (!QLIST_EMPTY(&group->device_list)) {
-            error_setg(errp, "Inconsistent setting of support for discarding "
-                       "RAM (e.g., balloon) within group");
-            close(fd);
-            return -1;
-        }
-
-        if (!group->ram_block_discard_allowed) {
-            group->ram_block_discard_allowed = true;
-            vfio_ram_block_discard_disable(group->container, false);
-        }
-    }
-
-    vbasedev->fd = fd;
-    vbasedev->group = group;
-    QLIST_INSERT_HEAD(&group->device_list, vbasedev, next);
-
-    vbasedev->num_irqs = info->num_irqs;
-    vbasedev->num_regions = info->num_regions;
-    vbasedev->flags = info->flags;
-
-    trace_vfio_get_device(name, info->flags, info->num_regions, info->num_irqs);
-
-    vbasedev->reset_works = !!(info->flags & VFIO_DEVICE_FLAGS_RESET);
-
-    return 0;
-}
-
-void vfio_put_base_device(VFIODevice *vbasedev)
-{
-    if (!vbasedev->group) {
-        return;
-    }
-    QLIST_REMOVE(vbasedev, next);
-    vbasedev->group = NULL;
-    trace_vfio_put_base_device(vbasedev->fd);
-    close(vbasedev->fd);
-}
-
-/*
- * Interfaces for IBM EEH (Enhanced Error Handling)
- */
-static bool vfio_eeh_container_ok(VFIOContainer *container)
-{
-    /*
-     * As of 2016-03-04 (linux-4.5) the host kernel EEH/VFIO
-     * implementation is broken if there are multiple groups in a
-     * container.  The hardware works in units of Partitionable
-     * Endpoints (== IOMMU groups) and the EEH operations naively
-     * iterate across all groups in the container, without any logic
-     * to make sure the groups have their state synchronized.  For
-     * certain operations (ENABLE) that might be ok, until an error
-     * occurs, but for others (GET_STATE) it's clearly broken.
-     */
-
-    /*
-     * XXX Once fixed kernels exist, test for them here
-     */
-
-    if (QLIST_EMPTY(&container->group_list)) {
-        return false;
-    }
-
-    if (QLIST_NEXT(QLIST_FIRST(&container->group_list), container_next)) {
-        return false;
-    }
-
-    return true;
-}
-
-static int vfio_eeh_container_op(VFIOContainer *container, uint32_t op)
-{
-    struct vfio_eeh_pe_op pe_op = {
-        .argsz = sizeof(pe_op),
-        .op = op,
-    };
-    int ret;
-
-    if (!vfio_eeh_container_ok(container)) {
-        error_report("vfio/eeh: EEH_PE_OP 0x%x: "
-                     "kernel requires a container with exactly one group", op);
-        return -EPERM;
-    }
-
-    ret = ioctl(container->fd, VFIO_EEH_PE_OP, &pe_op);
-    if (ret < 0) {
-        error_report("vfio/eeh: EEH_PE_OP 0x%x failed: %m", op);
-        return -errno;
-    }
-
-    return ret;
-}
-
-static VFIOContainer *vfio_eeh_as_container(AddressSpace *as)
-{
-    VFIOAddressSpace *space = vfio_get_address_space(as);
-    VFIOContainer *container = NULL;
-
-    if (QLIST_EMPTY(&space->containers)) {
-        /* No containers to act on */
-        goto out;
-    }
-
-    container = QLIST_FIRST(&space->containers);
-
-    if (QLIST_NEXT(container, next)) {
-        /* We don't yet have logic to synchronize EEH state across
-         * multiple containers */
-        container = NULL;
-        goto out;
-    }
-
-out:
-    vfio_put_address_space(space);
-    return container;
-}
-
-bool vfio_eeh_as_ok(AddressSpace *as)
-{
-    VFIOContainer *container = vfio_eeh_as_container(as);
-
-    return (container != NULL) && vfio_eeh_container_ok(container);
-}
-
-int vfio_eeh_as_op(AddressSpace *as, uint32_t op)
-{
-    VFIOContainer *container = vfio_eeh_as_container(as);
-
-    if (!container) {
-        return -ENODEV;
-    }
-    return vfio_eeh_container_op(container, op);
-}
diff --git a/hw/vfio/container.c b/hw/vfio/container.c
new file mode 100644
index 0000000000..175cdbbdff
--- /dev/null
+++ b/hw/vfio/container.c
@@ -0,0 +1,1085 @@
+/*
+ * generic functions used by VFIO devices
+ *
+ * Copyright Red Hat, Inc. 2012
+ *
+ * Authors:
+ *  Alex Williamson <alex.williamson@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ * Based on qemu-kvm device-assignment:
+ *  Adapted for KVM by Qumranet.
+ *  Copyright (c) 2007, Neocleus, Alex Novik (alex@neocleus.com)
+ *  Copyright (c) 2007, Neocleus, Guy Zana (guy@neocleus.com)
+ *  Copyright (C) 2008, Qumranet, Amit Shah (amit.shah@qumranet.com)
+ *  Copyright (C) 2008, Red Hat, Amit Shah (amit.shah@redhat.com)
+ *  Copyright (C) 2008, IBM, Muli Ben-Yehuda (muli@il.ibm.com)
+ */
+
+#include "qemu/osdep.h"
+#include <sys/ioctl.h>
+#ifdef CONFIG_KVM
+#include <linux/kvm.h>
+#endif
+#include <linux/vfio.h>
+
+#include "hw/vfio/vfio-common.h"
+#include "hw/vfio/vfio.h"
+#include "exec/address-spaces.h"
+#include "exec/memory.h"
+#include "exec/ram_addr.h"
+#include "hw/hw.h"
+#include "qemu/error-report.h"
+#include "qemu/range.h"
+#include "sysemu/kvm.h"
+#include "sysemu/reset.h"
+#include "trace.h"
+#include "qapi/error.h"
+#include "migration/migration.h"
+
+VFIOGroupList vfio_group_list =
+    QLIST_HEAD_INITIALIZER(vfio_group_list);
+
+static int vfio_ram_block_discard_disable(VFIOContainer *container, bool state)
+{
+    switch (container->iommu_type) {
+    case VFIO_TYPE1v2_IOMMU:
+    case VFIO_TYPE1_IOMMU:
+        /*
+         * We support coordinated discarding of RAM via the RamDiscardManager.
+         */
+        return ram_block_uncoordinated_discard_disable(state);
+    default:
+        /*
+         * VFIO_SPAPR_TCE_IOMMU most probably works just fine with
+         * RamDiscardManager, however, it is completely untested.
+         *
+         * VFIO_SPAPR_TCE_v2_IOMMU with "DMA memory preregistering" does
+         * completely the opposite of managing mapping/pinning dynamically as
+         * required by RamDiscardManager. We would have to special-case sections
+         * with a RamDiscardManager.
+         */
+        return ram_block_discard_disable(state);
+    }
+}
+
+VFIODevice *vfio_container_dev_iter_next(VFIOContainer *container,
+                                         VFIODevice *curr)
+{
+    VFIOGroup *group;
+
+    if (!curr) {
+        group = QLIST_FIRST(&container->group_list);
+    } else {
+        if (curr->next.le_next) {
+            return curr->next.le_next;
+        }
+        group = curr->group->container_next.le_next;
+    }
+
+    if (!group) {
+        return NULL;
+    }
+    return QLIST_FIRST(&group->device_list);
+}
+
+static int vfio_dma_unmap_bitmap(VFIOContainer *container,
+                                 hwaddr iova, ram_addr_t size,
+                                 IOMMUTLBEntry *iotlb)
+{
+    struct vfio_iommu_type1_dma_unmap *unmap;
+    struct vfio_bitmap *bitmap;
+    VFIOBitmap vbmap;
+    int ret;
+
+    ret = vfio_bitmap_alloc(&vbmap, size);
+    if (ret) {
+        return ret;
+    }
+
+    unmap = g_malloc0(sizeof(*unmap) + sizeof(*bitmap));
+
+    unmap->argsz = sizeof(*unmap) + sizeof(*bitmap);
+    unmap->iova = iova;
+    unmap->size = size;
+    unmap->flags |= VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP;
+    bitmap = (struct vfio_bitmap *)&unmap->data;
+
+    /*
+     * cpu_physical_memory_set_dirty_lebitmap() supports pages in bitmap of
+     * qemu_real_host_page_size to mark those dirty. Hence set bitmap_pgsize
+     * to qemu_real_host_page_size.
+     */
+    bitmap->pgsize = qemu_real_host_page_size();
+    bitmap->size = vbmap.size;
+    bitmap->data = (__u64 *)vbmap.bitmap;
+
+    if (vbmap.size > container->max_dirty_bitmap_size) {
+        error_report("UNMAP: Size of bitmap too big 0x%"PRIx64, vbmap.size);
+        ret = -E2BIG;
+        goto unmap_exit;
+    }
+
+    ret = ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, unmap);
+    if (!ret) {
+        cpu_physical_memory_set_dirty_lebitmap(vbmap.bitmap,
+                iotlb->translated_addr, vbmap.pages);
+    } else {
+        error_report("VFIO_UNMAP_DMA with DIRTY_BITMAP : %m");
+    }
+
+unmap_exit:
+    g_free(unmap);
+    g_free(vbmap.bitmap);
+
+    return ret;
+}
+
+/*
+ * DMA - Mapping and unmapping for the "type1" IOMMU interface used on x86
+ */
+int vfio_dma_unmap(VFIOContainer *container, hwaddr iova,
+                   ram_addr_t size, IOMMUTLBEntry *iotlb)
+{
+    struct vfio_iommu_type1_dma_unmap unmap = {
+        .argsz = sizeof(unmap),
+        .flags = 0,
+        .iova = iova,
+        .size = size,
+    };
+    bool need_dirty_sync = false;
+    int ret;
+
+    if (iotlb && vfio_devices_all_running_and_mig_active(container)) {
+        if (!vfio_devices_all_device_dirty_tracking(container) &&
+            container->dirty_pages_supported) {
+            return vfio_dma_unmap_bitmap(container, iova, size, iotlb);
+        }
+
+        need_dirty_sync = true;
+    }
+
+    while (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap)) {
+        /*
+         * The type1 backend has an off-by-one bug in the kernel (71a7d3d78e3c
+         * v4.15) where an overflow in its wrap-around check prevents us from
+         * unmapping the last page of the address space.  Test for the error
+         * condition and re-try the unmap excluding the last page.  The
+         * expectation is that we've never mapped the last page anyway and this
+         * unmap request comes via vIOMMU support which also makes it unlikely
+         * that this page is used.  This bug was introduced well after type1 v2
+         * support was introduced, so we shouldn't need to test for v1.  A fix
+         * is queued for kernel v5.0 so this workaround can be removed once
+         * affected kernels are sufficiently deprecated.
+         */
+        if (errno == EINVAL && unmap.size && !(unmap.iova + unmap.size) &&
+            container->iommu_type == VFIO_TYPE1v2_IOMMU) {
+            trace_vfio_dma_unmap_overflow_workaround();
+            unmap.size -= 1ULL << ctz64(container->pgsizes);
+            continue;
+        }
+        error_report("VFIO_UNMAP_DMA failed: %s", strerror(errno));
+        return -errno;
+    }
+
+    if (need_dirty_sync) {
+        ret = vfio_get_dirty_bitmap(container, iova, size,
+                                    iotlb->translated_addr);
+        if (ret) {
+            return ret;
+        }
+    }
+
+    return 0;
+}
+
+int vfio_dma_map(VFIOContainer *container, hwaddr iova,
+                 ram_addr_t size, void *vaddr, bool readonly)
+{
+    struct vfio_iommu_type1_dma_map map = {
+        .argsz = sizeof(map),
+        .flags = VFIO_DMA_MAP_FLAG_READ,
+        .vaddr = (__u64)(uintptr_t)vaddr,
+        .iova = iova,
+        .size = size,
+    };
+
+    if (!readonly) {
+        map.flags |= VFIO_DMA_MAP_FLAG_WRITE;
+    }
+
+    /*
+     * Try the mapping, if it fails with EBUSY, unmap the region and try
+     * again.  This shouldn't be necessary, but we sometimes see it in
+     * the VGA ROM space.
+     */
+    if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0 ||
+        (errno == EBUSY && vfio_dma_unmap(container, iova, size, NULL) == 0 &&
+         ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0)) {
+        return 0;
+    }
+
+    error_report("VFIO_MAP_DMA failed: %s", strerror(errno));
+    return -errno;
+}
+
+int vfio_set_dirty_page_tracking(VFIOContainer *container, bool start)
+{
+    int ret;
+    struct vfio_iommu_type1_dirty_bitmap dirty = {
+        .argsz = sizeof(dirty),
+    };
+
+    if (!container->dirty_pages_supported) {
+        return 0;
+    }
+
+    if (start) {
+        dirty.flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_START;
+    } else {
+        dirty.flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP;
+    }
+
+    ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, &dirty);
+    if (ret) {
+        ret = -errno;
+        error_report("Failed to set dirty tracking flag 0x%x errno: %d",
+                     dirty.flags, errno);
+    }
+
+    return ret;
+}
+
+int vfio_query_dirty_bitmap(VFIOContainer *container, VFIOBitmap *vbmap,
+                            hwaddr iova, hwaddr size)
+{
+    struct vfio_iommu_type1_dirty_bitmap *dbitmap;
+    struct vfio_iommu_type1_dirty_bitmap_get *range;
+    int ret;
+
+    dbitmap = g_malloc0(sizeof(*dbitmap) + sizeof(*range));
+
+    dbitmap->argsz = sizeof(*dbitmap) + sizeof(*range);
+    dbitmap->flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP;
+    range = (struct vfio_iommu_type1_dirty_bitmap_get *)&dbitmap->data;
+    range->iova = iova;
+    range->size = size;
+
+    /*
+     * cpu_physical_memory_set_dirty_lebitmap() supports pages in bitmap of
+     * qemu_real_host_page_size to mark those dirty. Hence set bitmap's pgsize
+     * to qemu_real_host_page_size.
+     */
+    range->bitmap.pgsize = qemu_real_host_page_size();
+    range->bitmap.size = vbmap->size;
+    range->bitmap.data = (__u64 *)vbmap->bitmap;
+
+    ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, dbitmap);
+    if (ret) {
+        ret = -errno;
+        error_report("Failed to get dirty bitmap for iova: 0x%"PRIx64
+                " size: 0x%"PRIx64" err: %d", (uint64_t)range->iova,
+                (uint64_t)range->size, errno);
+    }
+
+    g_free(dbitmap);
+
+    return ret;
+}
+
+static void vfio_listener_release(VFIOContainer *container)
+{
+    memory_listener_unregister(&container->listener);
+    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
+        memory_listener_unregister(&container->prereg_listener);
+    }
+}
+
+int vfio_container_add_section_window(VFIOContainer *container,
+                                      MemoryRegionSection *section,
+                                      Error **errp)
+{
+    VFIOHostDMAWindow *hostwin;
+    hwaddr pgsize = 0;
+    int ret;
+
+    if (container->iommu_type != VFIO_SPAPR_TCE_v2_IOMMU) {
+        return 0;
+    }
+
+    /* For now intersections are not allowed, we may relax this later */
+    QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) {
+        if (ranges_overlap(hostwin->min_iova,
+                           hostwin->max_iova - hostwin->min_iova + 1,
+                           section->offset_within_address_space,
+                           int128_get64(section->size))) {
+            error_setg(errp,
+                "region [0x%"PRIx64",0x%"PRIx64"] overlaps with existing"
+                "host DMA window [0x%"PRIx64",0x%"PRIx64"]",
+                section->offset_within_address_space,
+                section->offset_within_address_space +
+                    int128_get64(section->size) - 1,
+                hostwin->min_iova, hostwin->max_iova);
+            return -EINVAL;
+        }
+    }
+
+    ret = vfio_spapr_create_window(container, section, &pgsize);
+    if (ret) {
+        error_setg_errno(errp, -ret, "Failed to create SPAPR window");
+        return ret;
+    }
+
+    vfio_host_win_add(container, section->offset_within_address_space,
+                      section->offset_within_address_space +
+                      int128_get64(section->size) - 1, pgsize);
+#ifdef CONFIG_KVM
+    if (kvm_enabled()) {
+        VFIOGroup *group;
+        IOMMUMemoryRegion *iommu_mr = IOMMU_MEMORY_REGION(section->mr);
+        struct kvm_vfio_spapr_tce param;
+        struct kvm_device_attr attr = {
+            .group = KVM_DEV_VFIO_GROUP,
+            .attr = KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE,
+            .addr = (uint64_t)(unsigned long)&param,
+        };
+
+        if (!memory_region_iommu_get_attr(iommu_mr, IOMMU_ATTR_SPAPR_TCE_FD,
+                                          &param.tablefd)) {
+            QLIST_FOREACH(group, &container->group_list, container_next) {
+                param.groupfd = group->fd;
+                if (ioctl(vfio_kvm_device_fd, KVM_SET_DEVICE_ATTR, &attr)) {
+                    error_report("vfio: failed to setup fd %d "
+                                 "for a group with fd %d: %s",
+                                 param.tablefd, param.groupfd,
+                                 strerror(errno));
+                    return 0;
+                }
+                trace_vfio_spapr_group_attach(param.groupfd, param.tablefd);
+            }
+        }
+    }
+#endif
+    return 0;
+}
+
+void vfio_container_del_section_window(VFIOContainer *container,
+                                       MemoryRegionSection *section)
+{
+    if (container->iommu_type != VFIO_SPAPR_TCE_v2_IOMMU) {
+        return;
+    }
+
+    vfio_spapr_remove_window(container,
+                             section->offset_within_address_space);
+    if (vfio_host_win_del(container,
+                          section->offset_within_address_space,
+                          section->offset_within_address_space +
+                          int128_get64(section->size) - 1) < 0) {
+        hw_error("%s: Cannot delete missing window at %"HWADDR_PRIx,
+                 __func__, section->offset_within_address_space);
+    }
+}
+
+static struct vfio_info_cap_header *
+vfio_get_iommu_type1_info_cap(struct vfio_iommu_type1_info *info, uint16_t id)
+{
+    if (!(info->flags & VFIO_IOMMU_INFO_CAPS)) {
+        return NULL;
+    }
+
+    return vfio_get_cap((void *)info, info->cap_offset, id);
+}
+
+bool vfio_get_info_dma_avail(struct vfio_iommu_type1_info *info,
+                             unsigned int *avail)
+{
+    struct vfio_info_cap_header *hdr;
+    struct vfio_iommu_type1_info_dma_avail *cap;
+
+    /* If the capability cannot be found, assume no DMA limiting */
+    hdr = vfio_get_iommu_type1_info_cap(info,
+                                        VFIO_IOMMU_TYPE1_INFO_DMA_AVAIL);
+    if (hdr == NULL) {
+        return false;
+    }
+
+    if (avail != NULL) {
+        cap = (void *) hdr;
+        *avail = cap->avail;
+    }
+
+    return true;
+}
+
+static void vfio_kvm_device_add_group(VFIOGroup *group)
+{
+    vfio_kvm_device_add_fd(group->fd);
+}
+
+static void vfio_kvm_device_del_group(VFIOGroup *group)
+{
+    vfio_kvm_device_del_fd(group->fd);
+}
+
+/*
+ * vfio_get_iommu_type - selects the richest iommu_type (v2 first)
+ */
+static int vfio_get_iommu_type(VFIOContainer *container,
+                               Error **errp)
+{
+    int iommu_types[] = { VFIO_TYPE1v2_IOMMU, VFIO_TYPE1_IOMMU,
+                          VFIO_SPAPR_TCE_v2_IOMMU, VFIO_SPAPR_TCE_IOMMU };
+    int i;
+
+    for (i = 0; i < ARRAY_SIZE(iommu_types); i++) {
+        if (ioctl(container->fd, VFIO_CHECK_EXTENSION, iommu_types[i])) {
+            return iommu_types[i];
+        }
+    }
+    error_setg(errp, "No available IOMMU models");
+    return -EINVAL;
+}
+
+static int vfio_init_container(VFIOContainer *container, int group_fd,
+                               Error **errp)
+{
+    int iommu_type, ret;
+
+    iommu_type = vfio_get_iommu_type(container, errp);
+    if (iommu_type < 0) {
+        return iommu_type;
+    }
+
+    ret = ioctl(group_fd, VFIO_GROUP_SET_CONTAINER, &container->fd);
+    if (ret) {
+        error_setg_errno(errp, errno, "Failed to set group container");
+        return -errno;
+    }
+
+    while (ioctl(container->fd, VFIO_SET_IOMMU, iommu_type)) {
+        if (iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
+            /*
+             * On sPAPR, despite the IOMMU subdriver always advertises v1 and
+             * v2, the running platform may not support v2 and there is no
+             * way to guess it until an IOMMU group gets added to the container.
+             * So in case it fails with v2, try v1 as a fallback.
+             */
+            iommu_type = VFIO_SPAPR_TCE_IOMMU;
+            continue;
+        }
+        error_setg_errno(errp, errno, "Failed to set iommu for container");
+        return -errno;
+    }
+
+    container->iommu_type = iommu_type;
+    return 0;
+}
+
+static int vfio_get_iommu_info(VFIOContainer *container,
+                               struct vfio_iommu_type1_info **info)
+{
+
+    size_t argsz = sizeof(struct vfio_iommu_type1_info);
+
+    *info = g_new0(struct vfio_iommu_type1_info, 1);
+again:
+    (*info)->argsz = argsz;
+
+    if (ioctl(container->fd, VFIO_IOMMU_GET_INFO, *info)) {
+        g_free(*info);
+        *info = NULL;
+        return -errno;
+    }
+
+    if (((*info)->argsz > argsz)) {
+        argsz = (*info)->argsz;
+        *info = g_realloc(*info, argsz);
+        goto again;
+    }
+
+    return 0;
+}
+
+static struct vfio_info_cap_header *
+vfio_get_iommu_info_cap(struct vfio_iommu_type1_info *info, uint16_t id)
+{
+    struct vfio_info_cap_header *hdr;
+    void *ptr = info;
+
+    if (!(info->flags & VFIO_IOMMU_INFO_CAPS)) {
+        return NULL;
+    }
+
+    for (hdr = ptr + info->cap_offset; hdr != ptr; hdr = ptr + hdr->next) {
+        if (hdr->id == id) {
+            return hdr;
+        }
+    }
+
+    return NULL;
+}
+
+static void vfio_get_iommu_info_migration(VFIOContainer *container,
+                                         struct vfio_iommu_type1_info *info)
+{
+    struct vfio_info_cap_header *hdr;
+    struct vfio_iommu_type1_info_cap_migration *cap_mig;
+
+    hdr = vfio_get_iommu_info_cap(info, VFIO_IOMMU_TYPE1_INFO_CAP_MIGRATION);
+    if (!hdr) {
+        return;
+    }
+
+    cap_mig = container_of(hdr, struct vfio_iommu_type1_info_cap_migration,
+                            header);
+
+    /*
+     * cpu_physical_memory_set_dirty_lebitmap() supports pages in bitmap of
+     * qemu_real_host_page_size to mark those dirty.
+     */
+    if (cap_mig->pgsize_bitmap & qemu_real_host_page_size()) {
+        container->dirty_pages_supported = true;
+        container->max_dirty_bitmap_size = cap_mig->max_dirty_bitmap_size;
+        container->dirty_pgsizes = cap_mig->pgsize_bitmap;
+    }
+}
+
+static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
+                                  Error **errp)
+{
+    VFIOContainer *container;
+    int ret, fd;
+    VFIOAddressSpace *space;
+
+    space = vfio_get_address_space(as);
+
+    /*
+     * VFIO is currently incompatible with discarding of RAM insofar as the
+     * madvise to purge (zap) the page from QEMU's address space does not
+     * interact with the memory API and therefore leaves stale virtual to
+     * physical mappings in the IOMMU if the page was previously pinned.  We
+     * therefore set discarding broken for each group added to a container,
+     * whether the container is used individually or shared.  This provides
+     * us with options to allow devices within a group to opt-in and allow
+     * discarding, so long as it is done consistently for a group (for instance
+     * if the device is an mdev device where it is known that the host vendor
+     * driver will never pin pages outside of the working set of the guest
+     * driver, which would thus not be discarding candidates).
+     *
+     * The first opportunity to induce pinning occurs here where we attempt to
+     * attach the group to existing containers within the AddressSpace.  If any
+     * pages are already zapped from the virtual address space, such as from
+     * previous discards, new pinning will cause valid mappings to be
+     * re-established.  Likewise, when the overall MemoryListener for a new
+     * container is registered, a replay of mappings within the AddressSpace
+     * will occur, re-establishing any previously zapped pages as well.
+     *
+     * Especially virtio-balloon is currently only prevented from discarding
+     * new memory, it will not yet set ram_block_discard_set_required() and
+     * therefore, neither stops us here or deals with the sudden memory
+     * consumption of inflated memory.
+     *
+     * We do support discarding of memory coordinated via the RamDiscardManager
+     * with some IOMMU types. vfio_ram_block_discard_disable() handles the
+     * details once we know which type of IOMMU we are using.
+     */
+
+    QLIST_FOREACH(container, &space->containers, next) {
+        if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
+            ret = vfio_ram_block_discard_disable(container, true);
+            if (ret) {
+                error_setg_errno(errp, -ret,
+                                 "Cannot set discarding of RAM broken");
+                if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER,
+                          &container->fd)) {
+                    error_report("vfio: error disconnecting group %d from"
+                                 " container", group->groupid);
+                }
+                return ret;
+            }
+            group->container = container;
+            QLIST_INSERT_HEAD(&container->group_list, group, container_next);
+            vfio_kvm_device_add_group(group);
+            return 0;
+        }
+    }
+
+    fd = qemu_open_old("/dev/vfio/vfio", O_RDWR);
+    if (fd < 0) {
+        error_setg_errno(errp, errno, "failed to open /dev/vfio/vfio");
+        ret = -errno;
+        goto put_space_exit;
+    }
+
+    ret = ioctl(fd, VFIO_GET_API_VERSION);
+    if (ret != VFIO_API_VERSION) {
+        error_setg(errp, "supported vfio version: %d, "
+                   "reported version: %d", VFIO_API_VERSION, ret);
+        ret = -EINVAL;
+        goto close_fd_exit;
+    }
+
+    container = g_malloc0(sizeof(*container));
+    container->space = space;
+    container->fd = fd;
+    container->error = NULL;
+    container->dirty_pages_supported = false;
+    container->dma_max_mappings = 0;
+    QLIST_INIT(&container->giommu_list);
+    QLIST_INIT(&container->hostwin_list);
+    QLIST_INIT(&container->vrdl_list);
+
+    ret = vfio_init_container(container, group->fd, errp);
+    if (ret) {
+        goto free_container_exit;
+    }
+
+    ret = vfio_ram_block_discard_disable(container, true);
+    if (ret) {
+        error_setg_errno(errp, -ret, "Cannot set discarding of RAM broken");
+        goto free_container_exit;
+    }
+
+    switch (container->iommu_type) {
+    case VFIO_TYPE1v2_IOMMU:
+    case VFIO_TYPE1_IOMMU:
+    {
+        struct vfio_iommu_type1_info *info;
+
+        ret = vfio_get_iommu_info(container, &info);
+        if (ret) {
+            error_setg_errno(errp, -ret, "Failed to get VFIO IOMMU info");
+            goto enable_discards_exit;
+        }
+
+        if (info->flags & VFIO_IOMMU_INFO_PGSIZES) {
+            container->pgsizes = info->iova_pgsizes;
+        } else {
+            container->pgsizes = qemu_real_host_page_size();
+        }
+
+        if (!vfio_get_info_dma_avail(info, &container->dma_max_mappings)) {
+            container->dma_max_mappings = 65535;
+        }
+        vfio_get_iommu_info_migration(container, info);
+        g_free(info);
+
+        /*
+         * FIXME: We should parse VFIO_IOMMU_TYPE1_INFO_CAP_IOVA_RANGE
+         * information to get the actual window extent rather than assume
+         * a 64-bit IOVA address space.
+         */
+        vfio_host_win_add(container, 0, (hwaddr)-1, container->pgsizes);
+
+        break;
+    }
+    case VFIO_SPAPR_TCE_v2_IOMMU:
+    case VFIO_SPAPR_TCE_IOMMU:
+    {
+        struct vfio_iommu_spapr_tce_info info;
+        bool v2 = container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU;
+
+        /*
+         * The host kernel code implementing VFIO_IOMMU_DISABLE is called
+         * when container fd is closed so we do not call it explicitly
+         * in this file.
+         */
+        if (!v2) {
+            ret = ioctl(fd, VFIO_IOMMU_ENABLE);
+            if (ret) {
+                error_setg_errno(errp, errno, "failed to enable container");
+                ret = -errno;
+                goto enable_discards_exit;
+            }
+        } else {
+            container->prereg_listener = vfio_prereg_listener;
+
+            memory_listener_register(&container->prereg_listener,
+                                     &address_space_memory);
+            if (container->error) {
+                memory_listener_unregister(&container->prereg_listener);
+                ret = -1;
+                error_propagate_prepend(errp, container->error,
+                    "RAM memory listener initialization failed: ");
+                goto enable_discards_exit;
+            }
+        }
+
+        info.argsz = sizeof(info);
+        ret = ioctl(fd, VFIO_IOMMU_SPAPR_TCE_GET_INFO, &info);
+        if (ret) {
+            error_setg_errno(errp, errno,
+                             "VFIO_IOMMU_SPAPR_TCE_GET_INFO failed");
+            ret = -errno;
+            if (v2) {
+                memory_listener_unregister(&container->prereg_listener);
+            }
+            goto enable_discards_exit;
+        }
+
+        if (v2) {
+            container->pgsizes = info.ddw.pgsizes;
+            /*
+             * There is a default window in just created container.
+             * To make region_add/del simpler, we better remove this
+             * window now and let those iommu_listener callbacks
+             * create/remove them when needed.
+             */
+            ret = vfio_spapr_remove_window(container, info.dma32_window_start);
+            if (ret) {
+                error_setg_errno(errp, -ret,
+                                 "failed to remove existing window");
+                goto enable_discards_exit;
+            }
+        } else {
+            /* The default table uses 4K pages */
+            container->pgsizes = 0x1000;
+            vfio_host_win_add(container, info.dma32_window_start,
+                              info.dma32_window_start +
+                              info.dma32_window_size - 1,
+                              0x1000);
+        }
+    }
+    }
+
+    vfio_kvm_device_add_group(group);
+
+    QLIST_INIT(&container->group_list);
+    QLIST_INSERT_HEAD(&space->containers, container, next);
+
+    group->container = container;
+    QLIST_INSERT_HEAD(&container->group_list, group, container_next);
+
+    container->listener = vfio_memory_listener;
+
+    memory_listener_register(&container->listener, container->space->as);
+
+    if (container->error) {
+        ret = -1;
+        error_propagate_prepend(errp, container->error,
+            "memory listener initialization failed: ");
+        goto listener_release_exit;
+    }
+
+    container->initialized = true;
+
+    return 0;
+listener_release_exit:
+    QLIST_REMOVE(group, container_next);
+    QLIST_REMOVE(container, next);
+    vfio_kvm_device_del_group(group);
+    vfio_listener_release(container);
+
+enable_discards_exit:
+    vfio_ram_block_discard_disable(container, false);
+
+free_container_exit:
+    g_free(container);
+
+close_fd_exit:
+    close(fd);
+
+put_space_exit:
+    vfio_put_address_space(space);
+
+    return ret;
+}
+
+static void vfio_disconnect_container(VFIOGroup *group)
+{
+    VFIOContainer *container = group->container;
+
+    QLIST_REMOVE(group, container_next);
+    group->container = NULL;
+
+    /*
+     * Explicitly release the listener first before unset container,
+     * since unset may destroy the backend container if it's the last
+     * group.
+     */
+    if (QLIST_EMPTY(&container->group_list)) {
+        vfio_listener_release(container);
+    }
+
+    if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER, &container->fd)) {
+        error_report("vfio: error disconnecting group %d from container",
+                     group->groupid);
+    }
+
+    if (QLIST_EMPTY(&container->group_list)) {
+        VFIOAddressSpace *space = container->space;
+        VFIOGuestIOMMU *giommu, *tmp;
+        VFIOHostDMAWindow *hostwin, *next;
+
+        QLIST_REMOVE(container, next);
+
+        QLIST_FOREACH_SAFE(giommu, &container->giommu_list, giommu_next, tmp) {
+            memory_region_unregister_iommu_notifier(
+                    MEMORY_REGION(giommu->iommu_mr), &giommu->n);
+            QLIST_REMOVE(giommu, giommu_next);
+            g_free(giommu);
+        }
+
+        QLIST_FOREACH_SAFE(hostwin, &container->hostwin_list, hostwin_next,
+                           next) {
+            QLIST_REMOVE(hostwin, hostwin_next);
+            g_free(hostwin);
+        }
+
+        trace_vfio_disconnect_container(container->fd);
+        close(container->fd);
+        g_free(container);
+
+        vfio_put_address_space(space);
+    }
+}
+
+VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
+{
+    VFIOGroup *group;
+    char path[32];
+    struct vfio_group_status status = { .argsz = sizeof(status) };
+
+    QLIST_FOREACH(group, &vfio_group_list, next) {
+        if (group->groupid == groupid) {
+            /* Found it.  Now is it already in the right context? */
+            if (group->container->space->as == as) {
+                return group;
+            } else {
+                error_setg(errp, "group %d used in multiple address spaces",
+                           group->groupid);
+                return NULL;
+            }
+        }
+    }
+
+    group = g_malloc0(sizeof(*group));
+
+    snprintf(path, sizeof(path), "/dev/vfio/%d", groupid);
+    group->fd = qemu_open_old(path, O_RDWR);
+    if (group->fd < 0) {
+        error_setg_errno(errp, errno, "failed to open %s", path);
+        goto free_group_exit;
+    }
+
+    if (ioctl(group->fd, VFIO_GROUP_GET_STATUS, &status)) {
+        error_setg_errno(errp, errno, "failed to get group %d status", groupid);
+        goto close_fd_exit;
+    }
+
+    if (!(status.flags & VFIO_GROUP_FLAGS_VIABLE)) {
+        error_setg(errp, "group %d is not viable", groupid);
+        error_append_hint(errp,
+                          "Please ensure all devices within the iommu_group "
+                          "are bound to their vfio bus driver.\n");
+        goto close_fd_exit;
+    }
+
+    group->groupid = groupid;
+    QLIST_INIT(&group->device_list);
+
+    if (vfio_connect_container(group, as, errp)) {
+        error_prepend(errp, "failed to setup container for group %d: ",
+                      groupid);
+        goto close_fd_exit;
+    }
+
+    QLIST_INSERT_HEAD(&vfio_group_list, group, next);
+
+    return group;
+
+close_fd_exit:
+    close(group->fd);
+
+free_group_exit:
+    g_free(group);
+
+    return NULL;
+}
+
+void vfio_put_group(VFIOGroup *group)
+{
+    if (!group || !QLIST_EMPTY(&group->device_list)) {
+        return;
+    }
+
+    if (!group->ram_block_discard_allowed) {
+        vfio_ram_block_discard_disable(group->container, false);
+    }
+    vfio_kvm_device_del_group(group);
+    vfio_disconnect_container(group);
+    QLIST_REMOVE(group, next);
+    trace_vfio_put_group(group->fd);
+    close(group->fd);
+    g_free(group);
+}
+
+int vfio_get_device(VFIOGroup *group, const char *name,
+                    VFIODevice *vbasedev, Error **errp)
+{
+    g_autofree struct vfio_device_info *info = NULL;
+    int fd;
+
+    fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name);
+    if (fd < 0) {
+        error_setg_errno(errp, errno, "error getting device from group %d",
+                         group->groupid);
+        error_append_hint(errp,
+                      "Verify all devices in group %d are bound to vfio-<bus> "
+                      "or pci-stub and not already in use\n", group->groupid);
+        return fd;
+    }
+
+    info = vfio_get_device_info(fd);
+    if (!info) {
+        error_setg_errno(errp, errno, "error getting device info");
+        close(fd);
+        return -1;
+    }
+
+    /*
+     * Set discarding of RAM as not broken for this group if the driver knows
+     * the device operates compatibly with discarding.  Setting must be
+     * consistent per group, but since compatibility is really only possible
+     * with mdev currently, we expect singleton groups.
+     */
+    if (vbasedev->ram_block_discard_allowed !=
+        group->ram_block_discard_allowed) {
+        if (!QLIST_EMPTY(&group->device_list)) {
+            error_setg(errp, "Inconsistent setting of support for discarding "
+                       "RAM (e.g., balloon) within group");
+            close(fd);
+            return -1;
+        }
+
+        if (!group->ram_block_discard_allowed) {
+            group->ram_block_discard_allowed = true;
+            vfio_ram_block_discard_disable(group->container, false);
+        }
+    }
+
+    vbasedev->fd = fd;
+    vbasedev->group = group;
+    QLIST_INSERT_HEAD(&group->device_list, vbasedev, next);
+
+    vbasedev->num_irqs = info->num_irqs;
+    vbasedev->num_regions = info->num_regions;
+    vbasedev->flags = info->flags;
+
+    trace_vfio_get_device(name, info->flags, info->num_regions, info->num_irqs);
+
+    vbasedev->reset_works = !!(info->flags & VFIO_DEVICE_FLAGS_RESET);
+
+    return 0;
+}
+
+void vfio_put_base_device(VFIODevice *vbasedev)
+{
+    if (!vbasedev->group) {
+        return;
+    }
+    QLIST_REMOVE(vbasedev, next);
+    vbasedev->group = NULL;
+    trace_vfio_put_base_device(vbasedev->fd);
+    close(vbasedev->fd);
+}
+
+/*
+ * Interfaces for IBM EEH (Enhanced Error Handling)
+ */
+static bool vfio_eeh_container_ok(VFIOContainer *container)
+{
+    /*
+     * As of 2016-03-04 (linux-4.5) the host kernel EEH/VFIO
+     * implementation is broken if there are multiple groups in a
+     * container.  The hardware works in units of Partitionable
+     * Endpoints (== IOMMU groups) and the EEH operations naively
+     * iterate across all groups in the container, without any logic
+     * to make sure the groups have their state synchronized.  For
+     * certain operations (ENABLE) that might be ok, until an error
+     * occurs, but for others (GET_STATE) it's clearly broken.
+     */
+
+    /*
+     * XXX Once fixed kernels exist, test for them here
+     */
+
+    if (QLIST_EMPTY(&container->group_list)) {
+        return false;
+    }
+
+    if (QLIST_NEXT(QLIST_FIRST(&container->group_list), container_next)) {
+        return false;
+    }
+
+    return true;
+}
+
+static int vfio_eeh_container_op(VFIOContainer *container, uint32_t op)
+{
+    struct vfio_eeh_pe_op pe_op = {
+        .argsz = sizeof(pe_op),
+        .op = op,
+    };
+    int ret;
+
+    if (!vfio_eeh_container_ok(container)) {
+        error_report("vfio/eeh: EEH_PE_OP 0x%x: "
+                     "kernel requires a container with exactly one group", op);
+        return -EPERM;
+    }
+
+    ret = ioctl(container->fd, VFIO_EEH_PE_OP, &pe_op);
+    if (ret < 0) {
+        error_report("vfio/eeh: EEH_PE_OP 0x%x failed: %m", op);
+        return -errno;
+    }
+
+    return ret;
+}
+
+static VFIOContainer *vfio_eeh_as_container(AddressSpace *as)
+{
+    VFIOAddressSpace *space = vfio_get_address_space(as);
+    VFIOContainer *container = NULL;
+
+    if (QLIST_EMPTY(&space->containers)) {
+        /* No containers to act on */
+        goto out;
+    }
+
+    container = QLIST_FIRST(&space->containers);
+
+    if (QLIST_NEXT(container, next)) {
+        /*
+         * We don't yet have logic to synchronize EEH state across
+         * multiple containers
+         */
+        container = NULL;
+        goto out;
+    }
+
+out:
+    vfio_put_address_space(space);
+    return container;
+}
+
+bool vfio_eeh_as_ok(AddressSpace *as)
+{
+    VFIOContainer *container = vfio_eeh_as_container(as);
+
+    return (container != NULL) && vfio_eeh_container_ok(container);
+}
+
+int vfio_eeh_as_op(AddressSpace *as, uint32_t op)
+{
+    VFIOContainer *container = vfio_eeh_as_container(as);
+
+    if (!container) {
+        return -ENODEV;
+    }
+    return vfio_eeh_container_op(container, op);
+}
diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
index 3746c9f984..2a6912c940 100644
--- a/hw/vfio/meson.build
+++ b/hw/vfio/meson.build
@@ -2,6 +2,7 @@ vfio_ss = ss.source_set()
 vfio_ss.add(files(
   'helpers.c',
   'common.c',
+  'container.c',
   'spapr.c',
   'migration.c',
 ))
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 598c3ce079..bb7f9fe9c4 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -33,6 +33,8 @@
 
 #define VFIO_MSG_PREFIX "vfio %s: "
 
+extern const MemoryListener vfio_memory_listener;
+
 enum {
     VFIO_DEVICE_TYPE_PCI = 0,
     VFIO_DEVICE_TYPE_PLATFORM = 1,
@@ -196,6 +198,38 @@ typedef struct VFIODisplay {
     } dmabuf;
 } VFIODisplay;
 
+typedef struct {
+    unsigned long *bitmap;
+    hwaddr size;
+    hwaddr pages;
+} VFIOBitmap;
+
+void vfio_host_win_add(VFIOContainer *container,
+                       hwaddr min_iova, hwaddr max_iova,
+                       uint64_t iova_pgsizes);
+int vfio_host_win_del(VFIOContainer *container, hwaddr min_iova,
+                      hwaddr max_iova);
+VFIOAddressSpace *vfio_get_address_space(AddressSpace *as);
+void vfio_put_address_space(VFIOAddressSpace *space);
+bool vfio_devices_all_running_and_saving(VFIOContainer *container);
+
+/* container->fd */
+VFIODevice *vfio_container_dev_iter_next(VFIOContainer *container,
+                                         VFIODevice *curr);
+int vfio_dma_unmap(VFIOContainer *container, hwaddr iova,
+                   ram_addr_t size, IOMMUTLBEntry *iotlb);
+int vfio_dma_map(VFIOContainer *container, hwaddr iova,
+                 ram_addr_t size, void *vaddr, bool readonly);
+int vfio_set_dirty_page_tracking(VFIOContainer *container, bool start);
+int vfio_query_dirty_bitmap(VFIOContainer *container, VFIOBitmap *vbmap,
+                            hwaddr iova, hwaddr size);
+
+int vfio_container_add_section_window(VFIOContainer *container,
+                                      MemoryRegionSection *section,
+                                      Error **errp);
+void vfio_container_del_section_window(VFIOContainer *container,
+                                       MemoryRegionSection *section);
+
 void vfio_put_base_device(VFIODevice *vbasedev);
 void vfio_disable_irqindex(VFIODevice *vbasedev, int index);
 void vfio_unmask_single_irqindex(VFIODevice *vbasedev, int index);
@@ -220,6 +254,8 @@ struct vfio_device_info *vfio_get_device_info(int fd);
 int vfio_get_device(VFIOGroup *group, const char *name,
                     VFIODevice *vbasedev, Error **errp);
 
+extern int vfio_kvm_device_fd;
+
 int vfio_kvm_device_add_fd(int fd);
 int vfio_kvm_device_del_fd(int fd);
 
@@ -260,4 +296,13 @@ int vfio_spapr_remove_window(VFIOContainer *container,
 bool vfio_migration_realize(VFIODevice *vbasedev, Error **errp);
 void vfio_migration_exit(VFIODevice *vbasedev);
 
+int vfio_bitmap_alloc(VFIOBitmap *vbmap, hwaddr size);
+bool vfio_devices_all_running_and_mig_active(VFIOContainer *container);
+bool vfio_devices_all_device_dirty_tracking(VFIOContainer *container);
+int vfio_devices_query_dirty_bitmap(VFIOContainer *container,
+                                    VFIOBitmap *vbmap, hwaddr iova,
+                                    hwaddr size);
+int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
+                                 uint64_t size, ram_addr_t ram_addr);
+
 #endif /* HW_VFIO_VFIO_COMMON_H */
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH v1 09/22] vfio/container: Introduce vfio_[attach/detach]_device
  2023-08-30 10:37 [PATCH v1 00/22] vfio: Adopt iommufd Zhenzhong Duan
                   ` (7 preceding siblings ...)
  2023-08-30 10:37 ` [PATCH v1 08/22] vfio/common: Move legacy VFIO backend code into separate container.c Zhenzhong Duan
@ 2023-08-30 10:37 ` Zhenzhong Duan
  2023-09-20 13:33   ` Eric Auger
  2023-09-21  9:44   ` Cédric Le Goater
  2023-08-30 10:37 ` [PATCH v1 10/22] vfio/platform: Use vfio_[attach/detach]_device Zhenzhong Duan
                   ` (14 subsequent siblings)
  23 siblings, 2 replies; 109+ messages in thread
From: Zhenzhong Duan @ 2023-08-30 10:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, jgg, nicolinc, joao.m.martins, eric.auger,
	peterx, jasowang, kevin.tian, yi.l.liu, yi.y.sun, chao.p.peng,
	Zhenzhong Duan

From: Eric Auger <eric.auger@redhat.com>

We want the VFIO devices to be able to use two different
IOMMU callbacks, the legacy VFIO one and the new iommufd one.

Introduce vfio_[attach/detach]_device which aim at hiding the
underlying IOMMU backend (IOCTLs, datatypes, ...).

Once vfio_attach_device completes, the device is attached
to a security context and its fd can be used. Conversely
When vfio_detach_device completes, the device has been
detached to the security context.

In this patch, only the vfio-pci device gets converted to use
the new API. Subsequent patches will handle other devices.

Signed-off-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/vfio/container.c           | 66 +++++++++++++++++++++++++++++++++++
 hw/vfio/pci.c                 | 50 ++++----------------------
 hw/vfio/trace-events          |  2 +-
 include/hw/vfio/vfio-common.h |  3 ++
 4 files changed, 76 insertions(+), 45 deletions(-)

diff --git a/hw/vfio/container.c b/hw/vfio/container.c
index 175cdbbdff..74556da0c7 100644
--- a/hw/vfio/container.c
+++ b/hw/vfio/container.c
@@ -1083,3 +1083,69 @@ int vfio_eeh_as_op(AddressSpace *as, uint32_t op)
     }
     return vfio_eeh_container_op(container, op);
 }
+
+static int vfio_device_groupid(VFIODevice *vbasedev, Error **errp)
+{
+    char *tmp, group_path[PATH_MAX], *group_name;
+    int ret, groupid;
+    ssize_t len;
+
+    tmp = g_strdup_printf("%s/iommu_group", vbasedev->sysfsdev);
+    len = readlink(tmp, group_path, sizeof(group_path));
+    g_free(tmp);
+
+    if (len <= 0 || len >= sizeof(group_path)) {
+        ret = len < 0 ? -errno : -ENAMETOOLONG;
+        error_setg_errno(errp, -ret, "no iommu_group found");
+        return ret;
+    }
+
+    group_path[len] = 0;
+
+    group_name = basename(group_path);
+    if (sscanf(group_name, "%d", &groupid) != 1) {
+        error_setg_errno(errp, errno, "failed to read %s", group_path);
+        return -errno;
+    }
+    return groupid;
+}
+
+int vfio_attach_device(char *name, VFIODevice *vbasedev,
+                       AddressSpace *as, Error **errp)
+{
+    int groupid = vfio_device_groupid(vbasedev, errp);
+    VFIODevice *vbasedev_iter;
+    VFIOGroup *group;
+    int ret;
+
+    if (groupid < 0) {
+        return groupid;
+    }
+
+    group = vfio_get_group(groupid, as, errp);
+    if (!group) {
+        return -ENOENT;
+    }
+
+    QLIST_FOREACH(vbasedev_iter, &group->device_list, next) {
+        if (strcmp(vbasedev_iter->name, vbasedev->name) == 0) {
+            error_setg(errp, "device is already attached");
+            vfio_put_group(group);
+            return -EBUSY;
+        }
+    }
+    ret = vfio_get_device(group, name, vbasedev, errp);
+    if (ret) {
+        vfio_put_group(group);
+    }
+
+    return ret;
+}
+
+void vfio_detach_device(VFIODevice *vbasedev)
+{
+    VFIOGroup *group = vbasedev->group;
+
+    vfio_put_base_device(vbasedev);
+    vfio_put_group(group);
+}
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index a205c6b113..34f65ecd17 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2828,10 +2828,10 @@ static void vfio_populate_device(VFIOPCIDevice *vdev, Error **errp)
 
 static void vfio_put_device(VFIOPCIDevice *vdev)
 {
+    vfio_detach_device(&vdev->vbasedev);
+
     g_free(vdev->vbasedev.name);
     g_free(vdev->msix);
-
-    vfio_put_base_device(&vdev->vbasedev);
 }
 
 static void vfio_err_notifier_handler(void *opaque)
@@ -2978,13 +2978,9 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
 {
     VFIOPCIDevice *vdev = VFIO_PCI(pdev);
     VFIODevice *vbasedev = &vdev->vbasedev;
-    VFIODevice *vbasedev_iter;
-    VFIOGroup *group;
-    char *tmp, *subsys, group_path[PATH_MAX], *group_name;
+    char *tmp, *subsys;
     Error *err = NULL;
-    ssize_t len;
     struct stat st;
-    int groupid;
     int i, ret;
     bool is_mdev;
     char uuid[UUID_FMT_LEN];
@@ -3015,38 +3011,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
     vbasedev->type = VFIO_DEVICE_TYPE_PCI;
     vbasedev->dev = DEVICE(vdev);
 
-    tmp = g_strdup_printf("%s/iommu_group", vbasedev->sysfsdev);
-    len = readlink(tmp, group_path, sizeof(group_path));
-    g_free(tmp);
-
-    if (len <= 0 || len >= sizeof(group_path)) {
-        error_setg_errno(errp, len < 0 ? errno : ENAMETOOLONG,
-                         "no iommu_group found");
-        goto error;
-    }
-
-    group_path[len] = 0;
-
-    group_name = basename(group_path);
-    if (sscanf(group_name, "%d", &groupid) != 1) {
-        error_setg_errno(errp, errno, "failed to read %s", group_path);
-        goto error;
-    }
-
-    trace_vfio_realize(vbasedev->name, groupid);
-
-    group = vfio_get_group(groupid, pci_device_iommu_address_space(pdev), errp);
-    if (!group) {
-        goto error;
-    }
-
-    QLIST_FOREACH(vbasedev_iter, &group->device_list, next) {
-        if (strcmp(vbasedev_iter->name, vbasedev->name) == 0) {
-            error_setg(errp, "device is already attached");
-            vfio_put_group(group);
-            goto error;
-        }
-    }
+    trace_vfio_realize(vbasedev->name);
 
     /*
      * Mediated devices *might* operate compatibly with discarding of RAM, but
@@ -3065,7 +3030,6 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
     if (vbasedev->ram_block_discard_allowed && !is_mdev) {
         error_setg(errp, "x-balloon-allowed only potentially compatible "
                    "with mdev devices");
-        vfio_put_group(group);
         goto error;
     }
 
@@ -3076,10 +3040,10 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
         name = g_strdup(vbasedev->name);
     }
 
-    ret = vfio_get_device(group, name, vbasedev, errp);
+    ret = vfio_attach_device(name, vbasedev,
+                             pci_device_iommu_address_space(pdev), errp);
     g_free(name);
     if (ret) {
-        vfio_put_group(group);
         goto error;
     }
 
@@ -3318,7 +3282,6 @@ error:
 static void vfio_instance_finalize(Object *obj)
 {
     VFIOPCIDevice *vdev = VFIO_PCI(obj);
-    VFIOGroup *group = vdev->vbasedev.group;
 
     vfio_display_finalize(vdev);
     vfio_bars_finalize(vdev);
@@ -3332,7 +3295,6 @@ static void vfio_instance_finalize(Object *obj)
      * g_free(vdev->igd_opregion);
      */
     vfio_put_device(vdev);
-    vfio_put_group(group);
 }
 
 static void vfio_exitfn(PCIDevice *pdev)
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index ee7509e68e..8016d9f0d2 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -37,7 +37,7 @@ vfio_pci_hot_reset_dep_devices(int domain, int bus, int slot, int function, int
 vfio_pci_hot_reset_result(const char *name, const char *result) "%s hot reset: %s"
 vfio_populate_device_config(const char *name, unsigned long size, unsigned long offset, unsigned long flags) "Device %s config:\n  size: 0x%lx, offset: 0x%lx, flags: 0x%lx"
 vfio_populate_device_get_irq_info_failure(const char *errstr) "VFIO_DEVICE_GET_IRQ_INFO failure: %s"
-vfio_realize(const char *name, int group_id) " (%s) group %d"
+vfio_realize(const char *name) " (%s)"
 vfio_mdev(const char *name, bool is_mdev) " (%s) is_mdev %d"
 vfio_add_ext_cap_dropped(const char *name, uint16_t cap, uint16_t offset) "%s 0x%x@0x%x"
 vfio_pci_reset(const char *name) " (%s)"
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index bb7f9fe9c4..a29dfe7723 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -253,6 +253,9 @@ void vfio_put_group(VFIOGroup *group);
 struct vfio_device_info *vfio_get_device_info(int fd);
 int vfio_get_device(VFIOGroup *group, const char *name,
                     VFIODevice *vbasedev, Error **errp);
+int vfio_attach_device(char *name, VFIODevice *vbasedev,
+                       AddressSpace *as, Error **errp);
+void vfio_detach_device(VFIODevice *vbasedev);
 
 extern int vfio_kvm_device_fd;
 
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH v1 10/22] vfio/platform: Use vfio_[attach/detach]_device
  2023-08-30 10:37 [PATCH v1 00/22] vfio: Adopt iommufd Zhenzhong Duan
                   ` (8 preceding siblings ...)
  2023-08-30 10:37 ` [PATCH v1 09/22] vfio/container: Introduce vfio_[attach/detach]_device Zhenzhong Duan
@ 2023-08-30 10:37 ` Zhenzhong Duan
  2023-09-21 12:17   ` Cédric Le Goater
  2023-08-30 10:37 ` [PATCH v1 11/22] vfio/ap: " Zhenzhong Duan
                   ` (13 subsequent siblings)
  23 siblings, 1 reply; 109+ messages in thread
From: Zhenzhong Duan @ 2023-08-30 10:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, jgg, nicolinc, joao.m.martins, eric.auger,
	peterx, jasowang, kevin.tian, yi.l.liu, yi.y.sun, chao.p.peng,
	Zhenzhong Duan

From: Eric Auger <eric.auger@redhat.com>

Let the vfio-platform device use vfio_attach_device() and
vfio_detach_device(), hence hiding the details of the used
IOMMU backend.

Signed-off-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/vfio/platform.c   | 43 ++++---------------------------------------
 hw/vfio/trace-events |  2 +-
 2 files changed, 5 insertions(+), 40 deletions(-)

diff --git a/hw/vfio/platform.c b/hw/vfio/platform.c
index 5af73f9287..5c08c39315 100644
--- a/hw/vfio/platform.c
+++ b/hw/vfio/platform.c
@@ -529,12 +529,7 @@ static VFIODeviceOps vfio_platform_ops = {
  */
 static int vfio_base_device_init(VFIODevice *vbasedev, Error **errp)
 {
-    VFIOGroup *group;
-    VFIODevice *vbasedev_iter;
-    char *tmp, group_path[PATH_MAX], *group_name;
-    ssize_t len;
     struct stat st;
-    int groupid;
     int ret;
 
     /* @sysfsdev takes precedence over @host */
@@ -557,47 +552,17 @@ static int vfio_base_device_init(VFIODevice *vbasedev, Error **errp)
         return -errno;
     }
 
-    tmp = g_strdup_printf("%s/iommu_group", vbasedev->sysfsdev);
-    len = readlink(tmp, group_path, sizeof(group_path));
-    g_free(tmp);
+    trace_vfio_platform_base_device_init(vbasedev->name);
 
-    if (len < 0 || len >= sizeof(group_path)) {
-        ret = len < 0 ? -errno : -ENAMETOOLONG;
-        error_setg_errno(errp, -ret, "no iommu_group found");
-        return ret;
-    }
-
-    group_path[len] = 0;
-
-    group_name = basename(group_path);
-    if (sscanf(group_name, "%d", &groupid) != 1) {
-        error_setg_errno(errp, errno, "failed to read %s", group_path);
-        return -errno;
-    }
-
-    trace_vfio_platform_base_device_init(vbasedev->name, groupid);
-
-    group = vfio_get_group(groupid, &address_space_memory, errp);
-    if (!group) {
-        return -ENOENT;
-    }
-
-    QLIST_FOREACH(vbasedev_iter, &group->device_list, next) {
-        if (strcmp(vbasedev_iter->name, vbasedev->name) == 0) {
-            error_setg(errp, "device is already attached");
-            vfio_put_group(group);
-            return -EBUSY;
-        }
-    }
-    ret = vfio_get_device(group, vbasedev->name, vbasedev, errp);
+    ret = vfio_attach_device(vbasedev->name, vbasedev,
+                             &address_space_memory, errp);
     if (ret) {
-        vfio_put_group(group);
         return ret;
     }
 
     ret = vfio_populate_device(vbasedev, errp);
     if (ret) {
-        vfio_put_group(group);
+        vfio_detach_device(vbasedev);
     }
 
     return ret;
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 8016d9f0d2..bd32970854 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -124,7 +124,7 @@ vfio_get_dirty_bitmap(int fd, uint64_t iova, uint64_t size, uint64_t bitmap_size
 vfio_iommu_map_dirty_notify(uint64_t iova_start, uint64_t iova_end) "iommu dirty @ 0x%"PRIx64" - 0x%"PRIx64
 
 # platform.c
-vfio_platform_base_device_init(char *name, int groupid) "%s belongs to group #%d"
+vfio_platform_base_device_init(char *name) "%s"
 vfio_platform_realize(char *name, char *compat) "vfio device %s, compat = %s"
 vfio_platform_eoi(int pin, int fd) "EOI IRQ pin %d (fd=%d)"
 vfio_platform_intp_mmap_enable(int pin) "IRQ #%d still active, stay in slow path"
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH v1 11/22] vfio/ap: Use vfio_[attach/detach]_device
  2023-08-30 10:37 [PATCH v1 00/22] vfio: Adopt iommufd Zhenzhong Duan
                   ` (9 preceding siblings ...)
  2023-08-30 10:37 ` [PATCH v1 10/22] vfio/platform: Use vfio_[attach/detach]_device Zhenzhong Duan
@ 2023-08-30 10:37 ` Zhenzhong Duan
  2023-08-30 10:37 ` [PATCH v1 12/22] vfio/ccw: " Zhenzhong Duan
                   ` (12 subsequent siblings)
  23 siblings, 0 replies; 109+ messages in thread
From: Zhenzhong Duan @ 2023-08-30 10:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, jgg, nicolinc, joao.m.martins, eric.auger,
	peterx, jasowang, kevin.tian, yi.l.liu, yi.y.sun, chao.p.peng,
	Zhenzhong Duan, Tony Krowiak, Halil Pasic, Jason Herne,
	Thomas Huth, open list:vfio-ap

From: Eric Auger <eric.auger@redhat.com>

Let the vfio-ap device use vfio_attach_device() and
vfio_detach_device(), hence hiding the details of the used
IOMMU backend.

Signed-off-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/vfio/ap.c | 68 +++++++++-------------------------------------------
 1 file changed, 11 insertions(+), 57 deletions(-)

diff --git a/hw/vfio/ap.c b/hw/vfio/ap.c
index 6e21d1da5a..16ea7fb3c2 100644
--- a/hw/vfio/ap.c
+++ b/hw/vfio/ap.c
@@ -53,40 +53,6 @@ struct VFIODeviceOps vfio_ap_ops = {
     .vfio_compute_needs_reset = vfio_ap_compute_needs_reset,
 };
 
-static void vfio_ap_put_device(VFIOAPDevice *vapdev)
-{
-    g_free(vapdev->vdev.name);
-    vfio_put_base_device(&vapdev->vdev);
-}
-
-static VFIOGroup *vfio_ap_get_group(VFIOAPDevice *vapdev, Error **errp)
-{
-    GError *gerror = NULL;
-    char *symlink, *group_path;
-    int groupid;
-
-    symlink = g_strdup_printf("%s/iommu_group", vapdev->vdev.sysfsdev);
-    group_path = g_file_read_link(symlink, &gerror);
-    g_free(symlink);
-
-    if (!group_path) {
-        error_setg(errp, "%s: no iommu_group found for %s: %s",
-                   TYPE_VFIO_AP_DEVICE, vapdev->vdev.sysfsdev, gerror->message);
-        g_error_free(gerror);
-        return NULL;
-    }
-
-    if (sscanf(basename(group_path), "%d", &groupid) != 1) {
-        error_setg(errp, "vfio: failed to read %s", group_path);
-        g_free(group_path);
-        return NULL;
-    }
-
-    g_free(group_path);
-
-    return vfio_get_group(groupid, &address_space_memory, errp);
-}
-
 static void vfio_ap_req_notifier_handler(void *opaque)
 {
     VFIOAPDevice *vapdev = opaque;
@@ -189,22 +155,15 @@ static void vfio_ap_unregister_irq_notifier(VFIOAPDevice *vapdev,
 static void vfio_ap_realize(DeviceState *dev, Error **errp)
 {
     int ret;
-    char *mdevid;
     Error *err = NULL;
-    VFIOGroup *vfio_group;
     APDevice *apdev = AP_DEVICE(dev);
     VFIOAPDevice *vapdev = VFIO_AP_DEVICE(apdev);
+    VFIODevice *vbasedev = &vapdev->vdev;
 
-    vfio_group = vfio_ap_get_group(vapdev, errp);
-    if (!vfio_group) {
-        return;
-    }
-
-    vapdev->vdev.ops = &vfio_ap_ops;
-    vapdev->vdev.type = VFIO_DEVICE_TYPE_AP;
-    mdevid = basename(vapdev->vdev.sysfsdev);
-    vapdev->vdev.name = g_strdup_printf("%s", mdevid);
-    vapdev->vdev.dev = dev;
+    vbasedev->name = g_path_get_basename(vbasedev->sysfsdev);
+    vbasedev->ops = &vfio_ap_ops;
+    vbasedev->type = VFIO_DEVICE_TYPE_AP;
+    vbasedev->dev = dev;
 
     /*
      * vfio-ap devices operate in a way compatible with discarding of
@@ -214,9 +173,11 @@ static void vfio_ap_realize(DeviceState *dev, Error **errp)
      */
     vapdev->vdev.ram_block_discard_allowed = true;
 
-    ret = vfio_get_device(vfio_group, mdevid, &vapdev->vdev, errp);
+    ret = vfio_attach_device(vbasedev->name, vbasedev,
+                             &address_space_memory, errp);
     if (ret) {
-        goto out_get_dev_err;
+        g_free(vbasedev->name);
+        return;
     }
 
     vfio_ap_register_irq_notifier(vapdev, VFIO_AP_REQ_IRQ_INDEX, &err);
@@ -227,23 +188,16 @@ static void vfio_ap_realize(DeviceState *dev, Error **errp)
          */
         error_report_err(err);
     }
-
-    return;
-
-out_get_dev_err:
-    vfio_ap_put_device(vapdev);
-    vfio_put_group(vfio_group);
 }
 
 static void vfio_ap_unrealize(DeviceState *dev)
 {
     APDevice *apdev = AP_DEVICE(dev);
     VFIOAPDevice *vapdev = VFIO_AP_DEVICE(apdev);
-    VFIOGroup *group = vapdev->vdev.group;
 
     vfio_ap_unregister_irq_notifier(vapdev, VFIO_AP_REQ_IRQ_INDEX);
-    vfio_ap_put_device(vapdev);
-    vfio_put_group(group);
+    vfio_detach_device(&vapdev->vdev);
+    g_free(vapdev->vdev.name);
 }
 
 static Property vfio_ap_properties[] = {
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH v1 12/22] vfio/ccw: Use vfio_[attach/detach]_device
  2023-08-30 10:37 [PATCH v1 00/22] vfio: Adopt iommufd Zhenzhong Duan
                   ` (10 preceding siblings ...)
  2023-08-30 10:37 ` [PATCH v1 11/22] vfio/ap: " Zhenzhong Duan
@ 2023-08-30 10:37 ` Zhenzhong Duan
  2023-09-21 12:19   ` Cédric Le Goater
  2023-08-30 10:37 ` [PATCH v1 13/22] vfio: Add base container Zhenzhong Duan
                   ` (11 subsequent siblings)
  23 siblings, 1 reply; 109+ messages in thread
From: Zhenzhong Duan @ 2023-08-30 10:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, jgg, nicolinc, joao.m.martins, eric.auger,
	peterx, jasowang, kevin.tian, yi.l.liu, yi.y.sun, chao.p.peng,
	Zhenzhong Duan, Eric Farman, Matthew Rosato, Thomas Huth,
	open list:vfio-ccw

From: Eric Auger <eric.auger@redhat.com>

Let the vfio-ccw device use vfio_attach_device() and
vfio_detach_device(), hence hiding the details of the used
IOMMU backend.

Also now all the devices have been migrated to use the new
vfio_attach_device/vfio_detach_device API, let's turn the
legacy functions into static functions, local to container.c.

Signed-off-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/vfio/ccw.c                 | 120 ++++++++--------------------------
 hw/vfio/container.c           |  10 +--
 include/hw/vfio/vfio-common.h |   5 --
 3 files changed, 33 insertions(+), 102 deletions(-)

diff --git a/hw/vfio/ccw.c b/hw/vfio/ccw.c
index 1e2fce83b0..f078e014fa 100644
--- a/hw/vfio/ccw.c
+++ b/hw/vfio/ccw.c
@@ -572,88 +572,15 @@ static void vfio_ccw_put_region(VFIOCCWDevice *vcdev)
     g_free(vcdev->io_region);
 }
 
-static void vfio_ccw_put_device(VFIOCCWDevice *vcdev)
-{
-    g_free(vcdev->vdev.name);
-    vfio_put_base_device(&vcdev->vdev);
-}
-
-static void vfio_ccw_get_device(VFIOGroup *group, VFIOCCWDevice *vcdev,
-                                Error **errp)
-{
-    S390CCWDevice *cdev = S390_CCW_DEVICE(vcdev);
-    char *name = g_strdup_printf("%x.%x.%04x", cdev->hostid.cssid,
-                                 cdev->hostid.ssid,
-                                 cdev->hostid.devid);
-    VFIODevice *vbasedev;
-
-    QLIST_FOREACH(vbasedev, &group->device_list, next) {
-        if (strcmp(vbasedev->name, name) == 0) {
-            error_setg(errp, "vfio: subchannel %s has already been attached",
-                       name);
-            goto out_err;
-        }
-    }
-
-    /*
-     * All vfio-ccw devices are believed to operate in a way compatible with
-     * discarding of memory in RAM blocks, ie. pages pinned in the host are
-     * in the current working set of the guest driver and therefore never
-     * overlap e.g., with pages available to the guest balloon driver.  This
-     * needs to be set before vfio_get_device() for vfio common to handle
-     * ram_block_discard_disable().
-     */
-    vcdev->vdev.ram_block_discard_allowed = true;
-
-    if (vfio_get_device(group, cdev->mdevid, &vcdev->vdev, errp)) {
-        goto out_err;
-    }
-
-    vcdev->vdev.ops = &vfio_ccw_ops;
-    vcdev->vdev.type = VFIO_DEVICE_TYPE_CCW;
-    vcdev->vdev.name = name;
-    vcdev->vdev.dev = DEVICE(vcdev);
-
-    return;
-
-out_err:
-    g_free(name);
-}
-
-static VFIOGroup *vfio_ccw_get_group(S390CCWDevice *cdev, Error **errp)
-{
-    char *tmp, group_path[PATH_MAX];
-    ssize_t len;
-    int groupid;
-
-    tmp = g_strdup_printf("/sys/bus/css/devices/%x.%x.%04x/%s/iommu_group",
-                          cdev->hostid.cssid, cdev->hostid.ssid,
-                          cdev->hostid.devid, cdev->mdevid);
-    len = readlink(tmp, group_path, sizeof(group_path));
-    g_free(tmp);
-
-    if (len <= 0 || len >= sizeof(group_path)) {
-        error_setg(errp, "vfio: no iommu_group found");
-        return NULL;
-    }
-
-    group_path[len] = 0;
-
-    if (sscanf(basename(group_path), "%d", &groupid) != 1) {
-        error_setg(errp, "vfio: failed to read %s", group_path);
-        return NULL;
-    }
-
-    return vfio_get_group(groupid, &address_space_memory, errp);
-}
-
 static void vfio_ccw_realize(DeviceState *dev, Error **errp)
 {
-    VFIOGroup *group;
-    S390CCWDevice *cdev = S390_CCW_DEVICE(dev);
-    VFIOCCWDevice *vcdev = VFIO_CCW(cdev);
+    CcwDevice *ccw_dev = DO_UPCAST(CcwDevice, parent_obj, dev);
+    S390CCWDevice *cdev = DO_UPCAST(S390CCWDevice, parent_obj, ccw_dev);
+    VFIOCCWDevice *vcdev = DO_UPCAST(VFIOCCWDevice, cdev, cdev);
     S390CCWDeviceClass *cdc = S390_CCW_DEVICE_GET_CLASS(cdev);
+    VFIODevice *vbasedev = &vcdev->vdev;
     Error *err = NULL;
+    int ret;
 
     /* Call the class init function for subchannel. */
     if (cdc->realize) {
@@ -663,14 +590,25 @@ static void vfio_ccw_realize(DeviceState *dev, Error **errp)
         }
     }
 
-    group = vfio_ccw_get_group(cdev, &err);
-    if (!group) {
-        goto out_group_err;
-    }
+    vbasedev->ops = &vfio_ccw_ops;
+    vbasedev->type = VFIO_DEVICE_TYPE_CCW;
+    vbasedev->name = g_strdup(cdev->mdevid);
+    vbasedev->dev = &vcdev->cdev.parent_obj.parent_obj;
 
-    vfio_ccw_get_device(group, vcdev, &err);
-    if (err) {
-        goto out_device_err;
+    /*
+     * All vfio-ccw devices are believed to operate in a way compatible with
+     * discarding of memory in RAM blocks, ie. pages pinned in the host are
+     * in the current working set of the guest driver and therefore never
+     * overlap e.g., with pages available to the guest balloon driver.  This
+     * needs to be set before vfio_get_device() for vfio common to handle
+     * ram_block_discard_disable().
+     */
+    vbasedev->ram_block_discard_allowed = true;
+
+    ret = vfio_attach_device(vbasedev->name, vbasedev,
+                             &address_space_memory, errp);
+    if (ret) {
+        goto out_attach_dev_err;
     }
 
     vfio_ccw_get_region(vcdev, &err);
@@ -708,10 +646,9 @@ out_irq_notifier_err:
 out_io_notifier_err:
     vfio_ccw_put_region(vcdev);
 out_region_err:
-    vfio_ccw_put_device(vcdev);
-out_device_err:
-    vfio_put_group(group);
-out_group_err:
+    vfio_detach_device(vbasedev);
+out_attach_dev_err:
+    g_free(vbasedev->name);
     if (cdc->unrealize) {
         cdc->unrealize(cdev);
     }
@@ -724,14 +661,13 @@ static void vfio_ccw_unrealize(DeviceState *dev)
     S390CCWDevice *cdev = S390_CCW_DEVICE(dev);
     VFIOCCWDevice *vcdev = VFIO_CCW(cdev);
     S390CCWDeviceClass *cdc = S390_CCW_DEVICE_GET_CLASS(cdev);
-    VFIOGroup *group = vcdev->vdev.group;
 
     vfio_ccw_unregister_irq_notifier(vcdev, VFIO_CCW_REQ_IRQ_INDEX);
     vfio_ccw_unregister_irq_notifier(vcdev, VFIO_CCW_CRW_IRQ_INDEX);
     vfio_ccw_unregister_irq_notifier(vcdev, VFIO_CCW_IO_IRQ_INDEX);
     vfio_ccw_put_region(vcdev);
-    vfio_ccw_put_device(vcdev);
-    vfio_put_group(group);
+    vfio_detach_device(&vcdev->vdev);
+    g_free(vcdev->vdev.name);
 
     if (cdc->unrealize) {
         cdc->unrealize(cdev);
diff --git a/hw/vfio/container.c b/hw/vfio/container.c
index 74556da0c7..c71fddc09a 100644
--- a/hw/vfio/container.c
+++ b/hw/vfio/container.c
@@ -837,7 +837,7 @@ static void vfio_disconnect_container(VFIOGroup *group)
     }
 }
 
-VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
+static VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
 {
     VFIOGroup *group;
     char path[32];
@@ -900,7 +900,7 @@ free_group_exit:
     return NULL;
 }
 
-void vfio_put_group(VFIOGroup *group)
+static void vfio_put_group(VFIOGroup *group)
 {
     if (!group || !QLIST_EMPTY(&group->device_list)) {
         return;
@@ -917,8 +917,8 @@ void vfio_put_group(VFIOGroup *group)
     g_free(group);
 }
 
-int vfio_get_device(VFIOGroup *group, const char *name,
-                    VFIODevice *vbasedev, Error **errp)
+static int vfio_get_device(VFIOGroup *group, const char *name,
+                           VFIODevice *vbasedev, Error **errp)
 {
     g_autofree struct vfio_device_info *info = NULL;
     int fd;
@@ -976,7 +976,7 @@ int vfio_get_device(VFIOGroup *group, const char *name,
     return 0;
 }
 
-void vfio_put_base_device(VFIODevice *vbasedev)
+static void vfio_put_base_device(VFIODevice *vbasedev)
 {
     if (!vbasedev->group) {
         return;
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index a29dfe7723..95bcafdaf6 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -230,7 +230,6 @@ int vfio_container_add_section_window(VFIOContainer *container,
 void vfio_container_del_section_window(VFIOContainer *container,
                                        MemoryRegionSection *section);
 
-void vfio_put_base_device(VFIODevice *vbasedev);
 void vfio_disable_irqindex(VFIODevice *vbasedev, int index);
 void vfio_unmask_single_irqindex(VFIODevice *vbasedev, int index);
 void vfio_mask_single_irqindex(VFIODevice *vbasedev, int index);
@@ -248,11 +247,7 @@ void vfio_region_unmap(VFIORegion *region);
 void vfio_region_exit(VFIORegion *region);
 void vfio_region_finalize(VFIORegion *region);
 void vfio_reset_handler(void *opaque);
-VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp);
-void vfio_put_group(VFIOGroup *group);
 struct vfio_device_info *vfio_get_device_info(int fd);
-int vfio_get_device(VFIOGroup *group, const char *name,
-                    VFIODevice *vbasedev, Error **errp);
 int vfio_attach_device(char *name, VFIODevice *vbasedev,
                        AddressSpace *as, Error **errp);
 void vfio_detach_device(VFIODevice *vbasedev);
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH v1 13/22] vfio: Add base container
  2023-08-30 10:37 [PATCH v1 00/22] vfio: Adopt iommufd Zhenzhong Duan
                   ` (11 preceding siblings ...)
  2023-08-30 10:37 ` [PATCH v1 12/22] vfio/ccw: " Zhenzhong Duan
@ 2023-08-30 10:37 ` Zhenzhong Duan
  2023-09-19 17:23   ` Cédric Le Goater
  2023-08-30 10:37 ` [PATCH v1 14/22] vfio/common: Simplify vfio_viommu_preset() Zhenzhong Duan
                   ` (10 subsequent siblings)
  23 siblings, 1 reply; 109+ messages in thread
From: Zhenzhong Duan @ 2023-08-30 10:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, jgg, nicolinc, joao.m.martins, eric.auger,
	peterx, jasowang, kevin.tian, yi.l.liu, yi.y.sun, chao.p.peng,
	Yi Sun, Zhenzhong Duan, Daniel Henrique Barboza, David Gibson,
	Greg Kurz, Harsh Prateek Bora, open list:sPAPR (pseries)

From: Yi Liu <yi.l.liu@intel.com>

Abstract the VFIOContainer to be a base object. It is supposed to be
embedded by legacy VFIO container and later on, into the new iommufd
based container.

The base container implements generic code such as code related to
memory_listener and address space management. The VFIOContainerOps
implements callbacks that depend on the kernel user space being used.

'common.c' and vfio device code only manipulates the base container with
wrapper functions that calls the functions defined in VFIOContainerOpsClass.
Existing 'container.c' code is converted to implement the legacy container
ops functions.

Below is the base container. It's named as VFIOContainer, old VFIOContainer
is replaced with VFIOLegacyContainer.

struct VFIOContainer {
    VFIOIOMMUBackendOpsClass *ops;
    VFIOAddressSpace *space;
    MemoryListener listener;
    Error *error;
    bool initialized;
    bool dirty_pages_supported;
    uint64_t dirty_pgsizes;
    uint64_t max_dirty_bitmap_size;
    unsigned long pgsizes;
    unsigned int dma_max_mappings;
    QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
    QLIST_HEAD(, VFIOHostDMAWindow) hostwin_list;
    QLIST_HEAD(, VFIORamDiscardListener) vrdl_list;
    QLIST_ENTRY(VFIOContainer) next;
};

struct VFIOLegacyContainer {
    VFIOContainer bcontainer;
    int fd; /* /dev/vfio/vfio, empowered by the attached groups */
    MemoryListener prereg_listener;
    unsigned iommu_type;
    QLIST_HEAD(, VFIOGroup) group_list;
};

Co-authored-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/vfio/common.c                      |  72 +++++---
 hw/vfio/container-base.c              | 160 +++++++++++++++++
 hw/vfio/container.c                   | 247 ++++++++++++++++----------
 hw/vfio/meson.build                   |   1 +
 hw/vfio/spapr.c                       |  22 +--
 hw/vfio/trace-events                  |   4 +-
 include/hw/vfio/vfio-common.h         |  85 ++-------
 include/hw/vfio/vfio-container-base.h | 155 ++++++++++++++++
 8 files changed, 540 insertions(+), 206 deletions(-)
 create mode 100644 hw/vfio/container-base.c
 create mode 100644 include/hw/vfio/vfio-container-base.h

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 044710fc1f..86b6af5740 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -379,19 +379,20 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
          * of vaddr will always be there, even if the memory object is
          * destroyed and its backing memory munmap-ed.
          */
-        ret = vfio_dma_map(container, iova,
-                           iotlb->addr_mask + 1, vaddr,
-                           read_only);
+        ret = vfio_container_dma_map(container, iova,
+                                     iotlb->addr_mask + 1, vaddr,
+                                     read_only);
         if (ret) {
-            error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
+            error_report("vfio_container_dma_map(%p, 0x%"HWADDR_PRIx", "
                          "0x%"HWADDR_PRIx", %p) = %d (%s)",
                          container, iova,
                          iotlb->addr_mask + 1, vaddr, ret, strerror(-ret));
         }
     } else {
-        ret = vfio_dma_unmap(container, iova, iotlb->addr_mask + 1, iotlb);
+        ret = vfio_container_dma_unmap(container, iova,
+                                       iotlb->addr_mask + 1, iotlb);
         if (ret) {
-            error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
+            error_report("vfio_container_dma_unmap(%p, 0x%"HWADDR_PRIx", "
                          "0x%"HWADDR_PRIx") = %d (%s)",
                          container, iova,
                          iotlb->addr_mask + 1, ret, strerror(-ret));
@@ -407,14 +408,15 @@ static void vfio_ram_discard_notify_discard(RamDiscardListener *rdl,
 {
     VFIORamDiscardListener *vrdl = container_of(rdl, VFIORamDiscardListener,
                                                 listener);
+    VFIOContainer *container = vrdl->container;
     const hwaddr size = int128_get64(section->size);
     const hwaddr iova = section->offset_within_address_space;
     int ret;
 
     /* Unmap with a single call. */
-    ret = vfio_dma_unmap(vrdl->container, iova, size , NULL);
+    ret = vfio_container_dma_unmap(container, iova, size , NULL);
     if (ret) {
-        error_report("%s: vfio_dma_unmap() failed: %s", __func__,
+        error_report("%s: vfio_container_dma_unmap() failed: %s", __func__,
                      strerror(-ret));
     }
 }
@@ -424,6 +426,7 @@ static int vfio_ram_discard_notify_populate(RamDiscardListener *rdl,
 {
     VFIORamDiscardListener *vrdl = container_of(rdl, VFIORamDiscardListener,
                                                 listener);
+    VFIOContainer *container = vrdl->container;
     const hwaddr end = section->offset_within_region +
                        int128_get64(section->size);
     hwaddr start, next, iova;
@@ -442,8 +445,8 @@ static int vfio_ram_discard_notify_populate(RamDiscardListener *rdl,
                section->offset_within_address_space;
         vaddr = memory_region_get_ram_ptr(section->mr) + start;
 
-        ret = vfio_dma_map(vrdl->container, iova, next - start,
-                           vaddr, section->readonly);
+        ret = vfio_container_dma_map(container, iova, next - start,
+                                     vaddr, section->readonly);
         if (ret) {
             /* Rollback */
             vfio_ram_discard_notify_discard(rdl, section);
@@ -756,10 +759,10 @@ static void vfio_listener_region_add(MemoryListener *listener,
         }
     }
 
-    ret = vfio_dma_map(container, iova, int128_get64(llsize),
-                       vaddr, section->readonly);
+    ret = vfio_container_dma_map(container, iova, int128_get64(llsize),
+                                 vaddr, section->readonly);
     if (ret) {
-        error_setg(&err, "vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
+        error_setg(&err, "vfio_container_dma_map(%p, 0x%"HWADDR_PRIx", "
                    "0x%"HWADDR_PRIx", %p) = %d (%s)",
                    container, iova, int128_get64(llsize), vaddr, ret,
                    strerror(-ret));
@@ -775,7 +778,7 @@ static void vfio_listener_region_add(MemoryListener *listener,
 
 fail:
     if (memory_region_is_ram_device(section->mr)) {
-        error_report("failed to vfio_dma_map. pci p2p may not work");
+        error_report("failed to vfio_container_dma_map. pci p2p may not work");
         return;
     }
     /*
@@ -860,18 +863,20 @@ static void vfio_listener_region_del(MemoryListener *listener,
         if (int128_eq(llsize, int128_2_64())) {
             /* The unmap ioctl doesn't accept a full 64-bit span. */
             llsize = int128_rshift(llsize, 1);
-            ret = vfio_dma_unmap(container, iova, int128_get64(llsize), NULL);
+            ret = vfio_container_dma_unmap(container, iova,
+                                           int128_get64(llsize), NULL);
             if (ret) {
-                error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
+                error_report("vfio_container_dma_unmap(%p, 0x%"HWADDR_PRIx", "
                              "0x%"HWADDR_PRIx") = %d (%s)",
                              container, iova, int128_get64(llsize), ret,
                              strerror(-ret));
             }
             iova += int128_get64(llsize);
         }
-        ret = vfio_dma_unmap(container, iova, int128_get64(llsize), NULL);
+        ret = vfio_container_dma_unmap(container, iova,
+                                       int128_get64(llsize), NULL);
         if (ret) {
-            error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
+            error_report("vfio_container_dma_unmap(%p, 0x%"HWADDR_PRIx", "
                          "0x%"HWADDR_PRIx") = %d (%s)",
                          container, iova, int128_get64(llsize), ret,
                          strerror(-ret));
@@ -1103,7 +1108,7 @@ static void vfio_listener_log_global_start(MemoryListener *listener)
     if (vfio_devices_all_device_dirty_tracking(container)) {
         ret = vfio_devices_dma_logging_start(container);
     } else {
-        ret = vfio_set_dirty_page_tracking(container, true);
+        ret = vfio_container_set_dirty_page_tracking(container, true);
     }
 
     if (ret) {
@@ -1121,7 +1126,7 @@ static void vfio_listener_log_global_stop(MemoryListener *listener)
     if (vfio_devices_all_device_dirty_tracking(container)) {
         vfio_devices_dma_logging_stop(container);
     } else {
-        ret = vfio_set_dirty_page_tracking(container, false);
+        ret = vfio_container_set_dirty_page_tracking(container, false);
     }
 
     if (ret) {
@@ -1204,7 +1209,7 @@ int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
     if (all_device_dirty_tracking) {
         ret = vfio_devices_query_dirty_bitmap(container, &vbmap, iova, size);
     } else {
-        ret = vfio_query_dirty_bitmap(container, &vbmap, iova, size);
+        ret = vfio_container_query_dirty_bitmap(container, &vbmap, iova, size);
     }
 
     if (ret) {
@@ -1214,8 +1219,7 @@ int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
     dirty_pages = cpu_physical_memory_set_dirty_lebitmap(vbmap.bitmap, ram_addr,
                                                          vbmap.pages);
 
-    trace_vfio_get_dirty_bitmap(container->fd, iova, size, vbmap.size,
-                                ram_addr, dirty_pages);
+    trace_vfio_get_dirty_bitmap(iova, size, vbmap.size, ram_addr, dirty_pages);
 out:
     g_free(vbmap.bitmap);
 
@@ -1525,3 +1529,25 @@ retry:
 
     return info;
 }
+
+int vfio_attach_device(char *name, VFIODevice *vbasedev,
+                       AddressSpace *as, Error **errp)
+{
+    const VFIOIOMMUBackendOpsClass *ops;
+
+    ops = VFIO_IOMMU_BACKEND_OPS_CLASS(
+                  object_class_by_name(TYPE_VFIO_IOMMU_BACKEND_LEGACY_OPS));
+    if (!ops) {
+        error_setg(errp, "VFIO IOMMU Backend not found!");
+        return -ENODEV;
+    }
+    return ops->attach_device(name, vbasedev, as, errp);
+}
+
+void vfio_detach_device(VFIODevice *vbasedev)
+{
+    if (!vbasedev->container) {
+        return;
+    }
+    vbasedev->container->ops->detach_device(vbasedev);
+}
diff --git a/hw/vfio/container-base.c b/hw/vfio/container-base.c
new file mode 100644
index 0000000000..876e95c6dd
--- /dev/null
+++ b/hw/vfio/container-base.c
@@ -0,0 +1,160 @@
+/*
+ * VFIO BASE CONTAINER
+ *
+ * Copyright (C) 2023 Intel Corporation.
+ * Copyright Red Hat, Inc. 2023
+ *
+ * Authors: Yi Liu <yi.l.liu@intel.com>
+ *          Eric Auger <eric.auger@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include "qemu/osdep.h"
+#include "qapi/error.h"
+#include "qemu/error-report.h"
+#include "hw/vfio/vfio-container-base.h"
+
+VFIODevice *vfio_container_dev_iter_next(VFIOContainer *container,
+                                 VFIODevice *curr)
+{
+    if (!container->ops->dev_iter_next) {
+        return NULL;
+    }
+
+    return container->ops->dev_iter_next(container, curr);
+}
+
+int vfio_container_dma_map(VFIOContainer *container,
+                           hwaddr iova, ram_addr_t size,
+                           void *vaddr, bool readonly)
+{
+    if (!container->ops->dma_map) {
+        return -EINVAL;
+    }
+
+    return container->ops->dma_map(container, iova, size, vaddr, readonly);
+}
+
+int vfio_container_dma_unmap(VFIOContainer *container,
+                             hwaddr iova, ram_addr_t size,
+                             IOMMUTLBEntry *iotlb)
+{
+    if (!container->ops->dma_unmap) {
+        return -EINVAL;
+    }
+
+    return container->ops->dma_unmap(container, iova, size, iotlb);
+}
+
+int vfio_container_set_dirty_page_tracking(VFIOContainer *container,
+                                            bool start)
+{
+    /* Fallback to all pages dirty if dirty page sync isn't supported */
+    if (!container->ops->set_dirty_page_tracking) {
+        return 0;
+    }
+
+    return container->ops->set_dirty_page_tracking(container, start);
+}
+
+int vfio_container_query_dirty_bitmap(VFIOContainer *container,
+                                      VFIOBitmap *vbmap,
+                                      hwaddr iova, hwaddr size)
+{
+    if (!container->ops->query_dirty_bitmap) {
+        return -EINVAL;
+    }
+
+    return container->ops->query_dirty_bitmap(container, vbmap, iova, size);
+}
+
+int vfio_container_add_section_window(VFIOContainer *container,
+                                      MemoryRegionSection *section,
+                                      Error **errp)
+{
+    if (!container->ops->add_window) {
+        return 0;
+    }
+
+    return container->ops->add_window(container, section, errp);
+}
+
+void vfio_container_del_section_window(VFIOContainer *container,
+                                       MemoryRegionSection *section)
+{
+    if (!container->ops->del_window) {
+        return;
+    }
+
+    return container->ops->del_window(container, section);
+}
+
+void vfio_container_init(VFIOContainer *container,
+                         VFIOAddressSpace *space,
+                         struct VFIOIOMMUBackendOpsClass *ops)
+{
+    container->ops = ops;
+    container->space = space;
+    container->error = NULL;
+    container->dirty_pages_supported = false;
+    container->dma_max_mappings = 0;
+    QLIST_INIT(&container->giommu_list);
+    QLIST_INIT(&container->hostwin_list);
+    QLIST_INIT(&container->vrdl_list);
+}
+
+void vfio_container_destroy(VFIOContainer *container)
+{
+    VFIORamDiscardListener *vrdl, *vrdl_tmp;
+    VFIOGuestIOMMU *giommu, *tmp;
+    VFIOHostDMAWindow *hostwin, *next;
+
+    QLIST_SAFE_REMOVE(container, next);
+
+    QLIST_FOREACH_SAFE(vrdl, &container->vrdl_list, next, vrdl_tmp) {
+        RamDiscardManager *rdm;
+
+        rdm = memory_region_get_ram_discard_manager(vrdl->mr);
+        ram_discard_manager_unregister_listener(rdm, &vrdl->listener);
+        QLIST_REMOVE(vrdl, next);
+        g_free(vrdl);
+    }
+
+    QLIST_FOREACH_SAFE(giommu, &container->giommu_list, giommu_next, tmp) {
+        memory_region_unregister_iommu_notifier(
+                MEMORY_REGION(giommu->iommu_mr), &giommu->n);
+        QLIST_REMOVE(giommu, giommu_next);
+        g_free(giommu);
+    }
+
+    QLIST_FOREACH_SAFE(hostwin, &container->hostwin_list, hostwin_next,
+                       next) {
+        QLIST_REMOVE(hostwin, hostwin_next);
+        g_free(hostwin);
+    }
+}
+
+static const TypeInfo vfio_iommu_backend_ops_type_info = {
+    .name = TYPE_VFIO_IOMMU_BACKEND_OPS,
+    .parent = TYPE_OBJECT,
+    .abstract = true,
+    .class_size = sizeof(VFIOIOMMUBackendOpsClass),
+};
+
+static void vfio_iommu_backend_ops_register_types(void)
+{
+    type_register_static(&vfio_iommu_backend_ops_type_info);
+}
+type_init(vfio_iommu_backend_ops_register_types);
diff --git a/hw/vfio/container.c b/hw/vfio/container.c
index c71fddc09a..bb29b3612d 100644
--- a/hw/vfio/container.c
+++ b/hw/vfio/container.c
@@ -42,7 +42,8 @@
 VFIOGroupList vfio_group_list =
     QLIST_HEAD_INITIALIZER(vfio_group_list);
 
-static int vfio_ram_block_discard_disable(VFIOContainer *container, bool state)
+static int vfio_ram_block_discard_disable(VFIOLegacyContainer *container,
+                                          bool state)
 {
     switch (container->iommu_type) {
     case VFIO_TYPE1v2_IOMMU:
@@ -65,11 +66,18 @@ static int vfio_ram_block_discard_disable(VFIOContainer *container, bool state)
     }
 }
 
-VFIODevice *vfio_container_dev_iter_next(VFIOContainer *container,
-                                         VFIODevice *curr)
+static VFIODevice *vfio_legacy_dev_iter_next(VFIOContainer *bcontainer,
+                                             VFIODevice *curr)
 {
     VFIOGroup *group;
 
+    assert(object_class_dynamic_cast(OBJECT_CLASS(bcontainer->ops),
+                                     TYPE_VFIO_IOMMU_BACKEND_LEGACY_OPS));
+
+    VFIOLegacyContainer *container = container_of(bcontainer,
+                                                  VFIOLegacyContainer,
+                                                  bcontainer);
+
     if (!curr) {
         group = QLIST_FIRST(&container->group_list);
     } else {
@@ -85,10 +93,11 @@ VFIODevice *vfio_container_dev_iter_next(VFIOContainer *container,
     return QLIST_FIRST(&group->device_list);
 }
 
-static int vfio_dma_unmap_bitmap(VFIOContainer *container,
+static int vfio_dma_unmap_bitmap(VFIOLegacyContainer *container,
                                  hwaddr iova, ram_addr_t size,
                                  IOMMUTLBEntry *iotlb)
 {
+    VFIOContainer *bcontainer = &container->bcontainer;
     struct vfio_iommu_type1_dma_unmap *unmap;
     struct vfio_bitmap *bitmap;
     VFIOBitmap vbmap;
@@ -116,7 +125,7 @@ static int vfio_dma_unmap_bitmap(VFIOContainer *container,
     bitmap->size = vbmap.size;
     bitmap->data = (__u64 *)vbmap.bitmap;
 
-    if (vbmap.size > container->max_dirty_bitmap_size) {
+    if (vbmap.size > bcontainer->max_dirty_bitmap_size) {
         error_report("UNMAP: Size of bitmap too big 0x%"PRIx64, vbmap.size);
         ret = -E2BIG;
         goto unmap_exit;
@@ -140,9 +149,13 @@ unmap_exit:
 /*
  * DMA - Mapping and unmapping for the "type1" IOMMU interface used on x86
  */
-int vfio_dma_unmap(VFIOContainer *container, hwaddr iova,
-                   ram_addr_t size, IOMMUTLBEntry *iotlb)
+static int vfio_legacy_dma_unmap(VFIOContainer *bcontainer, hwaddr iova,
+                          ram_addr_t size, IOMMUTLBEntry *iotlb)
 {
+    VFIOLegacyContainer *container = container_of(bcontainer,
+                                                  VFIOLegacyContainer,
+                                                  bcontainer);
+
     struct vfio_iommu_type1_dma_unmap unmap = {
         .argsz = sizeof(unmap),
         .flags = 0,
@@ -152,9 +165,9 @@ int vfio_dma_unmap(VFIOContainer *container, hwaddr iova,
     bool need_dirty_sync = false;
     int ret;
 
-    if (iotlb && vfio_devices_all_running_and_mig_active(container)) {
-        if (!vfio_devices_all_device_dirty_tracking(container) &&
-            container->dirty_pages_supported) {
+    if (iotlb && vfio_devices_all_running_and_mig_active(bcontainer)) {
+        if (!vfio_devices_all_device_dirty_tracking(bcontainer) &&
+            bcontainer->dirty_pages_supported) {
             return vfio_dma_unmap_bitmap(container, iova, size, iotlb);
         }
 
@@ -176,8 +189,8 @@ int vfio_dma_unmap(VFIOContainer *container, hwaddr iova,
          */
         if (errno == EINVAL && unmap.size && !(unmap.iova + unmap.size) &&
             container->iommu_type == VFIO_TYPE1v2_IOMMU) {
-            trace_vfio_dma_unmap_overflow_workaround();
-            unmap.size -= 1ULL << ctz64(container->pgsizes);
+            trace_vfio_legacy_dma_unmap_overflow_workaround();
+            unmap.size -= 1ULL << ctz64(bcontainer->pgsizes);
             continue;
         }
         error_report("VFIO_UNMAP_DMA failed: %s", strerror(errno));
@@ -185,7 +198,7 @@ int vfio_dma_unmap(VFIOContainer *container, hwaddr iova,
     }
 
     if (need_dirty_sync) {
-        ret = vfio_get_dirty_bitmap(container, iova, size,
+        ret = vfio_get_dirty_bitmap(bcontainer, iova, size,
                                     iotlb->translated_addr);
         if (ret) {
             return ret;
@@ -195,9 +208,13 @@ int vfio_dma_unmap(VFIOContainer *container, hwaddr iova,
     return 0;
 }
 
-int vfio_dma_map(VFIOContainer *container, hwaddr iova,
-                 ram_addr_t size, void *vaddr, bool readonly)
+static int vfio_legacy_dma_map(VFIOContainer *bcontainer, hwaddr iova,
+                               ram_addr_t size, void *vaddr, bool readonly)
 {
+    VFIOLegacyContainer *container = container_of(bcontainer,
+                                                  VFIOLegacyContainer,
+                                                  bcontainer);
+
     struct vfio_iommu_type1_dma_map map = {
         .argsz = sizeof(map),
         .flags = VFIO_DMA_MAP_FLAG_READ,
@@ -216,7 +233,8 @@ int vfio_dma_map(VFIOContainer *container, hwaddr iova,
      * the VGA ROM space.
      */
     if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0 ||
-        (errno == EBUSY && vfio_dma_unmap(container, iova, size, NULL) == 0 &&
+        (errno == EBUSY &&
+         vfio_legacy_dma_unmap(bcontainer, iova, size, NULL) == 0 &&
          ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0)) {
         return 0;
     }
@@ -225,14 +243,18 @@ int vfio_dma_map(VFIOContainer *container, hwaddr iova,
     return -errno;
 }
 
-int vfio_set_dirty_page_tracking(VFIOContainer *container, bool start)
+static int vfio_legacy_set_dirty_page_tracking(VFIOContainer *bcontainer,
+                                               bool start)
 {
+    VFIOLegacyContainer *container = container_of(bcontainer,
+                                                  VFIOLegacyContainer,
+                                                  bcontainer);
     int ret;
     struct vfio_iommu_type1_dirty_bitmap dirty = {
         .argsz = sizeof(dirty),
     };
 
-    if (!container->dirty_pages_supported) {
+    if (!bcontainer->dirty_pages_supported) {
         return 0;
     }
 
@@ -252,9 +274,13 @@ int vfio_set_dirty_page_tracking(VFIOContainer *container, bool start)
     return ret;
 }
 
-int vfio_query_dirty_bitmap(VFIOContainer *container, VFIOBitmap *vbmap,
-                            hwaddr iova, hwaddr size)
+static int vfio_legacy_query_dirty_bitmap(VFIOContainer *bcontainer,
+                                          VFIOBitmap *vbmap,
+                                          hwaddr iova, hwaddr size)
 {
+    VFIOLegacyContainer *container = container_of(bcontainer,
+                                                  VFIOLegacyContainer,
+                                                  bcontainer);
     struct vfio_iommu_type1_dirty_bitmap *dbitmap;
     struct vfio_iommu_type1_dirty_bitmap_get *range;
     int ret;
@@ -289,18 +315,24 @@ int vfio_query_dirty_bitmap(VFIOContainer *container, VFIOBitmap *vbmap,
     return ret;
 }
 
-static void vfio_listener_release(VFIOContainer *container)
+static void vfio_listener_release(VFIOLegacyContainer *container)
 {
-    memory_listener_unregister(&container->listener);
+    VFIOContainer *bcontainer = &container->bcontainer;
+
+    memory_listener_unregister(&bcontainer->listener);
     if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
         memory_listener_unregister(&container->prereg_listener);
     }
 }
 
-int vfio_container_add_section_window(VFIOContainer *container,
-                                      MemoryRegionSection *section,
-                                      Error **errp)
+static int
+vfio_legacy_add_section_window(VFIOContainer *bcontainer,
+                               MemoryRegionSection *section,
+                               Error **errp)
 {
+    VFIOLegacyContainer *container = container_of(bcontainer,
+                                                  VFIOLegacyContainer,
+                                                  bcontainer);
     VFIOHostDMAWindow *hostwin;
     hwaddr pgsize = 0;
     int ret;
@@ -310,7 +342,7 @@ int vfio_container_add_section_window(VFIOContainer *container,
     }
 
     /* For now intersections are not allowed, we may relax this later */
-    QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) {
+    QLIST_FOREACH(hostwin, &bcontainer->hostwin_list, hostwin_next) {
         if (ranges_overlap(hostwin->min_iova,
                            hostwin->max_iova - hostwin->min_iova + 1,
                            section->offset_within_address_space,
@@ -332,7 +364,7 @@ int vfio_container_add_section_window(VFIOContainer *container,
         return ret;
     }
 
-    vfio_host_win_add(container, section->offset_within_address_space,
+    vfio_host_win_add(bcontainer, section->offset_within_address_space,
                       section->offset_within_address_space +
                       int128_get64(section->size) - 1, pgsize);
 #ifdef CONFIG_KVM
@@ -365,16 +397,21 @@ int vfio_container_add_section_window(VFIOContainer *container,
     return 0;
 }
 
-void vfio_container_del_section_window(VFIOContainer *container,
-                                       MemoryRegionSection *section)
+static void
+vfio_legacy_del_section_window(VFIOContainer *bcontainer,
+                               MemoryRegionSection *section)
 {
+    VFIOLegacyContainer *container = container_of(bcontainer,
+                                                  VFIOLegacyContainer,
+                                                  bcontainer);
+
     if (container->iommu_type != VFIO_SPAPR_TCE_v2_IOMMU) {
         return;
     }
 
     vfio_spapr_remove_window(container,
                              section->offset_within_address_space);
-    if (vfio_host_win_del(container,
+    if (vfio_host_win_del(bcontainer,
                           section->offset_within_address_space,
                           section->offset_within_address_space +
                           int128_get64(section->size) - 1) < 0) {
@@ -427,7 +464,7 @@ static void vfio_kvm_device_del_group(VFIOGroup *group)
 /*
  * vfio_get_iommu_type - selects the richest iommu_type (v2 first)
  */
-static int vfio_get_iommu_type(VFIOContainer *container,
+static int vfio_get_iommu_type(VFIOLegacyContainer *container,
                                Error **errp)
 {
     int iommu_types[] = { VFIO_TYPE1v2_IOMMU, VFIO_TYPE1_IOMMU,
@@ -443,7 +480,7 @@ static int vfio_get_iommu_type(VFIOContainer *container,
     return -EINVAL;
 }
 
-static int vfio_init_container(VFIOContainer *container, int group_fd,
+static int vfio_init_container(VFIOLegacyContainer *container, int group_fd,
                                Error **errp)
 {
     int iommu_type, ret;
@@ -478,7 +515,7 @@ static int vfio_init_container(VFIOContainer *container, int group_fd,
     return 0;
 }
 
-static int vfio_get_iommu_info(VFIOContainer *container,
+static int vfio_get_iommu_info(VFIOLegacyContainer *container,
                                struct vfio_iommu_type1_info **info)
 {
 
@@ -522,11 +559,12 @@ vfio_get_iommu_info_cap(struct vfio_iommu_type1_info *info, uint16_t id)
     return NULL;
 }
 
-static void vfio_get_iommu_info_migration(VFIOContainer *container,
-                                         struct vfio_iommu_type1_info *info)
+static void vfio_get_iommu_info_migration(VFIOLegacyContainer *container,
+                                          struct vfio_iommu_type1_info *info)
 {
     struct vfio_info_cap_header *hdr;
     struct vfio_iommu_type1_info_cap_migration *cap_mig;
+    VFIOContainer *bcontainer = &container->bcontainer;
 
     hdr = vfio_get_iommu_info_cap(info, VFIO_IOMMU_TYPE1_INFO_CAP_MIGRATION);
     if (!hdr) {
@@ -541,16 +579,19 @@ static void vfio_get_iommu_info_migration(VFIOContainer *container,
      * qemu_real_host_page_size to mark those dirty.
      */
     if (cap_mig->pgsize_bitmap & qemu_real_host_page_size()) {
-        container->dirty_pages_supported = true;
-        container->max_dirty_bitmap_size = cap_mig->max_dirty_bitmap_size;
-        container->dirty_pgsizes = cap_mig->pgsize_bitmap;
+        bcontainer->dirty_pages_supported = true;
+        bcontainer->max_dirty_bitmap_size = cap_mig->max_dirty_bitmap_size;
+        bcontainer->dirty_pgsizes = cap_mig->pgsize_bitmap;
     }
 }
 
 static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
                                   Error **errp)
 {
-    VFIOContainer *container;
+    VFIOIOMMUBackendOpsClass *ops = VFIO_IOMMU_BACKEND_OPS_CLASS(
+        object_class_by_name(TYPE_VFIO_IOMMU_BACKEND_LEGACY_OPS));
+    VFIOContainer *bcontainer;
+    VFIOLegacyContainer *container;
     int ret, fd;
     VFIOAddressSpace *space;
 
@@ -587,7 +628,8 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
      * details once we know which type of IOMMU we are using.
      */
 
-    QLIST_FOREACH(container, &space->containers, next) {
+    QLIST_FOREACH(bcontainer, &space->containers, next) {
+        container = container_of(bcontainer, VFIOLegacyContainer, bcontainer);
         if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
             ret = vfio_ram_block_discard_disable(container, true);
             if (ret) {
@@ -623,14 +665,9 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
     }
 
     container = g_malloc0(sizeof(*container));
-    container->space = space;
     container->fd = fd;
-    container->error = NULL;
-    container->dirty_pages_supported = false;
-    container->dma_max_mappings = 0;
-    QLIST_INIT(&container->giommu_list);
-    QLIST_INIT(&container->hostwin_list);
-    QLIST_INIT(&container->vrdl_list);
+    bcontainer = &container->bcontainer;
+    vfio_container_init(bcontainer, space, ops);
 
     ret = vfio_init_container(container, group->fd, errp);
     if (ret) {
@@ -656,13 +693,13 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
         }
 
         if (info->flags & VFIO_IOMMU_INFO_PGSIZES) {
-            container->pgsizes = info->iova_pgsizes;
+            bcontainer->pgsizes = info->iova_pgsizes;
         } else {
-            container->pgsizes = qemu_real_host_page_size();
+            bcontainer->pgsizes = qemu_real_host_page_size();
         }
 
-        if (!vfio_get_info_dma_avail(info, &container->dma_max_mappings)) {
-            container->dma_max_mappings = 65535;
+        if (!vfio_get_info_dma_avail(info, &bcontainer->dma_max_mappings)) {
+            bcontainer->dma_max_mappings = 65535;
         }
         vfio_get_iommu_info_migration(container, info);
         g_free(info);
@@ -672,7 +709,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
          * information to get the actual window extent rather than assume
          * a 64-bit IOVA address space.
          */
-        vfio_host_win_add(container, 0, (hwaddr)-1, container->pgsizes);
+        vfio_host_win_add(bcontainer, 0, (hwaddr)-1, bcontainer->pgsizes);
 
         break;
     }
@@ -699,10 +736,10 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
 
             memory_listener_register(&container->prereg_listener,
                                      &address_space_memory);
-            if (container->error) {
+            if (bcontainer->error) {
                 memory_listener_unregister(&container->prereg_listener);
                 ret = -1;
-                error_propagate_prepend(errp, container->error,
+                error_propagate_prepend(errp, bcontainer->error,
                     "RAM memory listener initialization failed: ");
                 goto enable_discards_exit;
             }
@@ -721,7 +758,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
         }
 
         if (v2) {
-            container->pgsizes = info.ddw.pgsizes;
+            bcontainer->pgsizes = info.ddw.pgsizes;
             /*
              * There is a default window in just created container.
              * To make region_add/del simpler, we better remove this
@@ -736,8 +773,8 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
             }
         } else {
             /* The default table uses 4K pages */
-            container->pgsizes = 0x1000;
-            vfio_host_win_add(container, info.dma32_window_start,
+            bcontainer->pgsizes = 0x1000;
+            vfio_host_win_add(bcontainer, info.dma32_window_start,
                               info.dma32_window_start +
                               info.dma32_window_size - 1,
                               0x1000);
@@ -748,28 +785,28 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
     vfio_kvm_device_add_group(group);
 
     QLIST_INIT(&container->group_list);
-    QLIST_INSERT_HEAD(&space->containers, container, next);
+    QLIST_INSERT_HEAD(&space->containers, bcontainer, next);
 
     group->container = container;
     QLIST_INSERT_HEAD(&container->group_list, group, container_next);
 
-    container->listener = vfio_memory_listener;
+    bcontainer->listener = vfio_memory_listener;
 
-    memory_listener_register(&container->listener, container->space->as);
+    memory_listener_register(&bcontainer->listener, bcontainer->space->as);
 
-    if (container->error) {
+    if (bcontainer->error) {
         ret = -1;
-        error_propagate_prepend(errp, container->error,
+        error_propagate_prepend(errp, bcontainer->error,
             "memory listener initialization failed: ");
         goto listener_release_exit;
     }
 
-    container->initialized = true;
+    bcontainer->initialized = true;
 
     return 0;
 listener_release_exit:
     QLIST_REMOVE(group, container_next);
-    QLIST_REMOVE(container, next);
+    QLIST_REMOVE(bcontainer, next);
     vfio_kvm_device_del_group(group);
     vfio_listener_release(container);
 
@@ -790,7 +827,8 @@ put_space_exit:
 
 static void vfio_disconnect_container(VFIOGroup *group)
 {
-    VFIOContainer *container = group->container;
+    VFIOLegacyContainer *container = group->container;
+    VFIOContainer *bcontainer = &container->bcontainer;
 
     QLIST_REMOVE(group, container_next);
     group->container = NULL;
@@ -810,25 +848,9 @@ static void vfio_disconnect_container(VFIOGroup *group)
     }
 
     if (QLIST_EMPTY(&container->group_list)) {
-        VFIOAddressSpace *space = container->space;
-        VFIOGuestIOMMU *giommu, *tmp;
-        VFIOHostDMAWindow *hostwin, *next;
-
-        QLIST_REMOVE(container, next);
-
-        QLIST_FOREACH_SAFE(giommu, &container->giommu_list, giommu_next, tmp) {
-            memory_region_unregister_iommu_notifier(
-                    MEMORY_REGION(giommu->iommu_mr), &giommu->n);
-            QLIST_REMOVE(giommu, giommu_next);
-            g_free(giommu);
-        }
-
-        QLIST_FOREACH_SAFE(hostwin, &container->hostwin_list, hostwin_next,
-                           next) {
-            QLIST_REMOVE(hostwin, hostwin_next);
-            g_free(hostwin);
-        }
+        VFIOAddressSpace *space = bcontainer->space;
 
+        vfio_container_destroy(bcontainer);
         trace_vfio_disconnect_container(container->fd);
         close(container->fd);
         g_free(container);
@@ -840,13 +862,15 @@ static void vfio_disconnect_container(VFIOGroup *group)
 static VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
 {
     VFIOGroup *group;
+    VFIOContainer *bcontainer;
     char path[32];
     struct vfio_group_status status = { .argsz = sizeof(status) };
 
     QLIST_FOREACH(group, &vfio_group_list, next) {
         if (group->groupid == groupid) {
             /* Found it.  Now is it already in the right context? */
-            if (group->container->space->as == as) {
+            bcontainer = &group->container->bcontainer;
+            if (bcontainer->space->as == as) {
                 return group;
             } else {
                 error_setg(errp, "group %d used in multiple address spaces",
@@ -990,7 +1014,7 @@ static void vfio_put_base_device(VFIODevice *vbasedev)
 /*
  * Interfaces for IBM EEH (Enhanced Error Handling)
  */
-static bool vfio_eeh_container_ok(VFIOContainer *container)
+static bool vfio_eeh_container_ok(VFIOLegacyContainer *container)
 {
     /*
      * As of 2016-03-04 (linux-4.5) the host kernel EEH/VFIO
@@ -1018,7 +1042,7 @@ static bool vfio_eeh_container_ok(VFIOContainer *container)
     return true;
 }
 
-static int vfio_eeh_container_op(VFIOContainer *container, uint32_t op)
+static int vfio_eeh_container_op(VFIOLegacyContainer *container, uint32_t op)
 {
     struct vfio_eeh_pe_op pe_op = {
         .argsz = sizeof(pe_op),
@@ -1041,19 +1065,21 @@ static int vfio_eeh_container_op(VFIOContainer *container, uint32_t op)
     return ret;
 }
 
-static VFIOContainer *vfio_eeh_as_container(AddressSpace *as)
+static VFIOLegacyContainer *vfio_eeh_as_container(AddressSpace *as)
 {
     VFIOAddressSpace *space = vfio_get_address_space(as);
-    VFIOContainer *container = NULL;
+    VFIOLegacyContainer *container = NULL;
+    VFIOContainer *bcontainer = NULL;
 
     if (QLIST_EMPTY(&space->containers)) {
         /* No containers to act on */
         goto out;
     }
 
-    container = QLIST_FIRST(&space->containers);
+    bcontainer = QLIST_FIRST(&space->containers);
+    container = container_of(bcontainer, VFIOLegacyContainer, bcontainer);
 
-    if (QLIST_NEXT(container, next)) {
+    if (QLIST_NEXT(bcontainer, next)) {
         /*
          * We don't yet have logic to synchronize EEH state across
          * multiple containers
@@ -1069,14 +1095,14 @@ out:
 
 bool vfio_eeh_as_ok(AddressSpace *as)
 {
-    VFIOContainer *container = vfio_eeh_as_container(as);
+    VFIOLegacyContainer *container = vfio_eeh_as_container(as);
 
     return (container != NULL) && vfio_eeh_container_ok(container);
 }
 
 int vfio_eeh_as_op(AddressSpace *as, uint32_t op)
 {
-    VFIOContainer *container = vfio_eeh_as_container(as);
+    VFIOLegacyContainer *container = vfio_eeh_as_container(as);
 
     if (!container) {
         return -ENODEV;
@@ -1110,8 +1136,8 @@ static int vfio_device_groupid(VFIODevice *vbasedev, Error **errp)
     return groupid;
 }
 
-int vfio_attach_device(char *name, VFIODevice *vbasedev,
-                       AddressSpace *as, Error **errp)
+static int vfio_legacy_attach_device(char *name, VFIODevice *vbasedev,
+                                     AddressSpace *as, Error **errp)
 {
     int groupid = vfio_device_groupid(vbasedev, errp);
     VFIODevice *vbasedev_iter;
@@ -1137,15 +1163,46 @@ int vfio_attach_device(char *name, VFIODevice *vbasedev,
     ret = vfio_get_device(group, name, vbasedev, errp);
     if (ret) {
         vfio_put_group(group);
+        return ret;
     }
+    vbasedev->container = &group->container->bcontainer;
 
     return ret;
 }
 
-void vfio_detach_device(VFIODevice *vbasedev)
+static void vfio_legacy_detach_device(VFIODevice *vbasedev)
 {
     VFIOGroup *group = vbasedev->group;
 
     vfio_put_base_device(vbasedev);
     vfio_put_group(group);
+    vbasedev->container = NULL;
+}
+
+static void vfio_iommu_backend_legacy_ops_class_init(ObjectClass *oc,
+                                                     void *data) {
+    VFIOIOMMUBackendOpsClass *ops = VFIO_IOMMU_BACKEND_OPS_CLASS(oc);
+
+    ops->dev_iter_next = vfio_legacy_dev_iter_next;
+    ops->dma_map = vfio_legacy_dma_map;
+    ops->dma_unmap = vfio_legacy_dma_unmap;
+    ops->attach_device = vfio_legacy_attach_device;
+    ops->detach_device = vfio_legacy_detach_device;
+    ops->set_dirty_page_tracking = vfio_legacy_set_dirty_page_tracking;
+    ops->query_dirty_bitmap = vfio_legacy_query_dirty_bitmap;
+    ops->add_window = vfio_legacy_add_section_window;
+    ops->del_window = vfio_legacy_del_section_window;
+}
+
+static const TypeInfo vfio_iommu_backend_legacy_ops_type = {
+    .name = TYPE_VFIO_IOMMU_BACKEND_LEGACY_OPS,
+
+    .parent = TYPE_VFIO_IOMMU_BACKEND_OPS,
+    .class_init = vfio_iommu_backend_legacy_ops_class_init,
+    .abstract = true,
+};
+static void vfio_iommu_backend_legacy_ops_register_types(void)
+{
+    type_register_static(&vfio_iommu_backend_legacy_ops_type);
 }
+type_init(vfio_iommu_backend_legacy_ops_register_types);
diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
index 2a6912c940..eb6ce6229d 100644
--- a/hw/vfio/meson.build
+++ b/hw/vfio/meson.build
@@ -2,6 +2,7 @@ vfio_ss = ss.source_set()
 vfio_ss.add(files(
   'helpers.c',
   'common.c',
+  'container-base.c',
   'container.c',
   'spapr.c',
   'migration.c',
diff --git a/hw/vfio/spapr.c b/hw/vfio/spapr.c
index 9ec1e95f6d..7647e7d492 100644
--- a/hw/vfio/spapr.c
+++ b/hw/vfio/spapr.c
@@ -39,8 +39,8 @@ static void *vfio_prereg_gpa_to_vaddr(MemoryRegionSection *section, hwaddr gpa)
 static void vfio_prereg_listener_region_add(MemoryListener *listener,
                                             MemoryRegionSection *section)
 {
-    VFIOContainer *container = container_of(listener, VFIOContainer,
-                                            prereg_listener);
+    VFIOLegacyContainer *container = container_of(listener, VFIOLegacyContainer,
+                                                  prereg_listener);
     const hwaddr gpa = section->offset_within_address_space;
     hwaddr end;
     int ret;
@@ -83,9 +83,9 @@ static void vfio_prereg_listener_region_add(MemoryListener *listener,
          * can gracefully fail.  Runtime, there's not much we can do other
          * than throw a hardware error.
          */
-        if (!container->initialized) {
-            if (!container->error) {
-                error_setg_errno(&container->error, -ret,
+        if (!container->bcontainer.initialized) {
+            if (!container->bcontainer.error) {
+                error_setg_errno(&container->bcontainer.error, -ret,
                                  "Memory registering failed");
             }
         } else {
@@ -97,8 +97,8 @@ static void vfio_prereg_listener_region_add(MemoryListener *listener,
 static void vfio_prereg_listener_region_del(MemoryListener *listener,
                                             MemoryRegionSection *section)
 {
-    VFIOContainer *container = container_of(listener, VFIOContainer,
-                                            prereg_listener);
+    VFIOLegacyContainer *container = container_of(listener, VFIOLegacyContainer,
+                                                  prereg_listener);
     const hwaddr gpa = section->offset_within_address_space;
     hwaddr end;
     int ret;
@@ -141,7 +141,7 @@ const MemoryListener vfio_prereg_listener = {
     .region_del = vfio_prereg_listener_region_del,
 };
 
-int vfio_spapr_create_window(VFIOContainer *container,
+int vfio_spapr_create_window(VFIOLegacyContainer *container,
                              MemoryRegionSection *section,
                              hwaddr *pgsize)
 {
@@ -159,13 +159,13 @@ int vfio_spapr_create_window(VFIOContainer *container,
     if (pagesize > rampagesize) {
         pagesize = rampagesize;
     }
-    pgmask = container->pgsizes & (pagesize | (pagesize - 1));
+    pgmask = container->bcontainer.pgsizes & (pagesize | (pagesize - 1));
     pagesize = pgmask ? (1ULL << (63 - clz64(pgmask))) : 0;
     if (!pagesize) {
         error_report("Host doesn't support page size 0x%"PRIx64
                      ", the supported mask is 0x%lx",
                      memory_region_iommu_get_min_page_size(iommu_mr),
-                     container->pgsizes);
+                     container->bcontainer.pgsizes);
         return -EINVAL;
     }
 
@@ -233,7 +233,7 @@ int vfio_spapr_create_window(VFIOContainer *container,
     return 0;
 }
 
-int vfio_spapr_remove_window(VFIOContainer *container,
+int vfio_spapr_remove_window(VFIOLegacyContainer *container,
                              hwaddr offset_within_address_space)
 {
     struct vfio_iommu_spapr_tce_remove remove = {
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index bd32970854..1692bcd8f1 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -119,8 +119,8 @@ vfio_region_unmap(const char *name, unsigned long offset, unsigned long end) "Re
 vfio_region_sparse_mmap_header(const char *name, int index, int nr_areas) "Device %s region %d: %d sparse mmap entries"
 vfio_region_sparse_mmap_entry(int i, unsigned long start, unsigned long end) "sparse entry %d [0x%lx - 0x%lx]"
 vfio_get_dev_region(const char *name, int index, uint32_t type, uint32_t subtype) "%s index %d, %08x/%08x"
-vfio_dma_unmap_overflow_workaround(void) ""
-vfio_get_dirty_bitmap(int fd, uint64_t iova, uint64_t size, uint64_t bitmap_size, uint64_t start, uint64_t dirty_pages) "container fd=%d, iova=0x%"PRIx64" size= 0x%"PRIx64" bitmap_size=0x%"PRIx64" start=0x%"PRIx64" dirty_pages=%"PRIu64
+vfio_legacy_dma_unmap_overflow_workaround(void) ""
+vfio_get_dirty_bitmap(uint64_t iova, uint64_t size, uint64_t bitmap_size, uint64_t start, uint64_t dirty_pages) "iova=0x%"PRIx64" size= 0x%"PRIx64" bitmap_size=0x%"PRIx64" start=0x%"PRIx64" dirty_pages=%"PRIu64
 vfio_iommu_map_dirty_notify(uint64_t iova_start, uint64_t iova_end) "iommu dirty @ 0x%"PRIx64" - 0x%"PRIx64
 
 # platform.c
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 95bcafdaf6..b1a76dcc9c 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -30,6 +30,7 @@
 #include <linux/vfio.h>
 #endif
 #include "sysemu/sysemu.h"
+#include "hw/vfio/vfio-container-base.h"
 
 #define VFIO_MSG_PREFIX "vfio %s: "
 
@@ -74,64 +75,22 @@ typedef struct VFIOMigration {
     bool initial_data_sent;
 } VFIOMigration;
 
-typedef struct VFIOAddressSpace {
-    AddressSpace *as;
-    QLIST_HEAD(, VFIOContainer) containers;
-    QLIST_ENTRY(VFIOAddressSpace) list;
-} VFIOAddressSpace;
-
 struct VFIOGroup;
 
-typedef struct VFIOContainer {
-    VFIOAddressSpace *space;
+typedef struct VFIOLegacyContainer {
+    VFIOContainer bcontainer;
     int fd; /* /dev/vfio/vfio, empowered by the attached groups */
-    MemoryListener listener;
     MemoryListener prereg_listener;
     unsigned iommu_type;
-    Error *error;
-    bool initialized;
-    bool dirty_pages_supported;
-    uint64_t dirty_pgsizes;
-    uint64_t max_dirty_bitmap_size;
-    unsigned long pgsizes;
-    unsigned int dma_max_mappings;
-    QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
-    QLIST_HEAD(, VFIOHostDMAWindow) hostwin_list;
     QLIST_HEAD(, VFIOGroup) group_list;
-    QLIST_HEAD(, VFIORamDiscardListener) vrdl_list;
-    QLIST_ENTRY(VFIOContainer) next;
-} VFIOContainer;
-
-typedef struct VFIOGuestIOMMU {
-    VFIOContainer *container;
-    IOMMUMemoryRegion *iommu_mr;
-    hwaddr iommu_offset;
-    IOMMUNotifier n;
-    QLIST_ENTRY(VFIOGuestIOMMU) giommu_next;
-} VFIOGuestIOMMU;
-
-typedef struct VFIORamDiscardListener {
-    VFIOContainer *container;
-    MemoryRegion *mr;
-    hwaddr offset_within_address_space;
-    hwaddr size;
-    uint64_t granularity;
-    RamDiscardListener listener;
-    QLIST_ENTRY(VFIORamDiscardListener) next;
-} VFIORamDiscardListener;
-
-typedef struct VFIOHostDMAWindow {
-    hwaddr min_iova;
-    hwaddr max_iova;
-    uint64_t iova_pgsizes;
-    QLIST_ENTRY(VFIOHostDMAWindow) hostwin_next;
-} VFIOHostDMAWindow;
+} VFIOLegacyContainer;
 
 typedef struct VFIODeviceOps VFIODeviceOps;
 
 typedef struct VFIODevice {
     QLIST_ENTRY(VFIODevice) next;
     struct VFIOGroup *group;
+    VFIOContainer *container;
     char *sysfsdev;
     char *name;
     DeviceState *dev;
@@ -165,7 +124,7 @@ struct VFIODeviceOps {
 typedef struct VFIOGroup {
     int fd;
     int groupid;
-    VFIOContainer *container;
+    VFIOLegacyContainer *container;
     QLIST_HEAD(, VFIODevice) device_list;
     QLIST_ENTRY(VFIOGroup) next;
     QLIST_ENTRY(VFIOGroup) container_next;
@@ -198,37 +157,13 @@ typedef struct VFIODisplay {
     } dmabuf;
 } VFIODisplay;
 
-typedef struct {
-    unsigned long *bitmap;
-    hwaddr size;
-    hwaddr pages;
-} VFIOBitmap;
-
-void vfio_host_win_add(VFIOContainer *container,
+void vfio_host_win_add(VFIOContainer *bcontainer,
                        hwaddr min_iova, hwaddr max_iova,
                        uint64_t iova_pgsizes);
-int vfio_host_win_del(VFIOContainer *container, hwaddr min_iova,
+int vfio_host_win_del(VFIOContainer *bcontainer, hwaddr min_iova,
                       hwaddr max_iova);
 VFIOAddressSpace *vfio_get_address_space(AddressSpace *as);
 void vfio_put_address_space(VFIOAddressSpace *space);
-bool vfio_devices_all_running_and_saving(VFIOContainer *container);
-
-/* container->fd */
-VFIODevice *vfio_container_dev_iter_next(VFIOContainer *container,
-                                         VFIODevice *curr);
-int vfio_dma_unmap(VFIOContainer *container, hwaddr iova,
-                   ram_addr_t size, IOMMUTLBEntry *iotlb);
-int vfio_dma_map(VFIOContainer *container, hwaddr iova,
-                 ram_addr_t size, void *vaddr, bool readonly);
-int vfio_set_dirty_page_tracking(VFIOContainer *container, bool start);
-int vfio_query_dirty_bitmap(VFIOContainer *container, VFIOBitmap *vbmap,
-                            hwaddr iova, hwaddr size);
-
-int vfio_container_add_section_window(VFIOContainer *container,
-                                      MemoryRegionSection *section,
-                                      Error **errp);
-void vfio_container_del_section_window(VFIOContainer *container,
-                                       MemoryRegionSection *section);
 
 void vfio_disable_irqindex(VFIODevice *vbasedev, int index);
 void vfio_unmask_single_irqindex(VFIODevice *vbasedev, int index);
@@ -285,10 +220,10 @@ vfio_get_cap(void *ptr, uint32_t cap_offset, uint16_t id);
 #endif
 extern const MemoryListener vfio_prereg_listener;
 
-int vfio_spapr_create_window(VFIOContainer *container,
+int vfio_spapr_create_window(VFIOLegacyContainer *container,
                              MemoryRegionSection *section,
                              hwaddr *pgsize);
-int vfio_spapr_remove_window(VFIOContainer *container,
+int vfio_spapr_remove_window(VFIOLegacyContainer *container,
                              hwaddr offset_within_address_space);
 
 bool vfio_migration_realize(VFIODevice *vbasedev, Error **errp);
diff --git a/include/hw/vfio/vfio-container-base.h b/include/hw/vfio/vfio-container-base.h
new file mode 100644
index 0000000000..b18fa92146
--- /dev/null
+++ b/include/hw/vfio/vfio-container-base.h
@@ -0,0 +1,155 @@
+/*
+ * VFIO BASE CONTAINER
+ *
+ * Copyright (C) 2023 Intel Corporation.
+ * Copyright Red Hat, Inc. 2023
+ *
+ * Authors: Yi Liu <yi.l.liu@intel.com>
+ *          Eric Auger <eric.auger@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef HW_VFIO_VFIO_BASE_CONTAINER_H
+#define HW_VFIO_VFIO_BASE_CONTAINER_H
+
+#include "exec/memory.h"
+#ifndef CONFIG_USER_ONLY
+#include "exec/hwaddr.h"
+#endif
+
+typedef struct VFIOContainer VFIOContainer;
+
+typedef struct VFIOAddressSpace {
+    AddressSpace *as;
+    QLIST_HEAD(, VFIOContainer) containers;
+    QLIST_ENTRY(VFIOAddressSpace) list;
+} VFIOAddressSpace;
+
+typedef struct VFIOGuestIOMMU {
+    VFIOContainer *container;
+    IOMMUMemoryRegion *iommu_mr;
+    hwaddr iommu_offset;
+    IOMMUNotifier n;
+    QLIST_ENTRY(VFIOGuestIOMMU) giommu_next;
+} VFIOGuestIOMMU;
+
+typedef struct VFIORamDiscardListener {
+    VFIOContainer *container;
+    MemoryRegion *mr;
+    hwaddr offset_within_address_space;
+    hwaddr size;
+    uint64_t granularity;
+    RamDiscardListener listener;
+    QLIST_ENTRY(VFIORamDiscardListener) next;
+} VFIORamDiscardListener;
+
+typedef struct VFIOHostDMAWindow {
+    hwaddr min_iova;
+    hwaddr max_iova;
+    uint64_t iova_pgsizes;
+    QLIST_ENTRY(VFIOHostDMAWindow) hostwin_next;
+} VFIOHostDMAWindow;
+
+typedef struct {
+    unsigned long *bitmap;
+    hwaddr size;
+    hwaddr pages;
+} VFIOBitmap;
+
+typedef struct VFIODevice VFIODevice;
+typedef struct VFIOIOMMUBackendOpsClass VFIOIOMMUBackendOpsClass;
+
+/*
+ * This is the base object for vfio container backends
+ */
+struct VFIOContainer {
+    VFIOIOMMUBackendOpsClass *ops;
+    VFIOAddressSpace *space;
+    MemoryListener listener;
+    Error *error;
+    bool initialized;
+    bool dirty_pages_supported;
+    uint64_t dirty_pgsizes;
+    uint64_t max_dirty_bitmap_size;
+    unsigned long pgsizes;
+    unsigned int dma_max_mappings;
+    QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
+    QLIST_HEAD(, VFIOHostDMAWindow) hostwin_list;
+    QLIST_HEAD(, VFIORamDiscardListener) vrdl_list;
+    QLIST_ENTRY(VFIOContainer) next;
+};
+
+VFIODevice *vfio_container_dev_iter_next(VFIOContainer *container,
+                                 VFIODevice *curr);
+int vfio_container_dma_map(VFIOContainer *container,
+                           hwaddr iova, ram_addr_t size,
+                           void *vaddr, bool readonly);
+int vfio_container_dma_unmap(VFIOContainer *container,
+                             hwaddr iova, ram_addr_t size,
+                             IOMMUTLBEntry *iotlb);
+bool vfio_container_devices_all_dirty_tracking(VFIOContainer *container);
+int vfio_container_set_dirty_page_tracking(VFIOContainer *container,
+                                            bool start);
+int vfio_container_query_dirty_bitmap(VFIOContainer *container,
+                                      VFIOBitmap *vbmap,
+                                      hwaddr iova, hwaddr size);
+int vfio_container_add_section_window(VFIOContainer *container,
+                                      MemoryRegionSection *section,
+                                      Error **errp);
+void vfio_container_del_section_window(VFIOContainer *container,
+                                       MemoryRegionSection *section);
+
+void vfio_container_init(VFIOContainer *container,
+                         VFIOAddressSpace *space,
+                         struct VFIOIOMMUBackendOpsClass *ops);
+void vfio_container_destroy(VFIOContainer *container);
+
+#define TYPE_VFIO_IOMMU_BACKEND_LEGACY_OPS "vfio-iommu-backend-legacy-ops"
+#define TYPE_VFIO_IOMMU_BACKEND_OPS "vfio-iommu-backend-ops"
+
+DECLARE_CLASS_CHECKERS(VFIOIOMMUBackendOpsClass,
+                       VFIO_IOMMU_BACKEND_OPS, TYPE_VFIO_IOMMU_BACKEND_OPS)
+
+struct VFIOIOMMUBackendOpsClass {
+    /*< private >*/
+    ObjectClass parent_class;
+
+    /*< public >*/
+    /* required */
+    VFIODevice *(*dev_iter_next)(VFIOContainer *container, VFIODevice *curr);
+    int (*dma_map)(VFIOContainer *container,
+                   hwaddr iova, ram_addr_t size,
+                   void *vaddr, bool readonly);
+    int (*dma_unmap)(VFIOContainer *container,
+                     hwaddr iova, ram_addr_t size,
+                     IOMMUTLBEntry *iotlb);
+    int (*attach_device)(char *name, VFIODevice *vbasedev,
+                         AddressSpace *as, Error **errp);
+    void (*detach_device)(VFIODevice *vbasedev);
+    /* migration feature */
+    int (*set_dirty_page_tracking)(VFIOContainer *container, bool start);
+    int (*query_dirty_bitmap)(VFIOContainer *bcontainer, VFIOBitmap *vbmap,
+                              hwaddr iova, hwaddr size);
+
+    /* SPAPR specific */
+    int (*add_window)(VFIOContainer *container,
+                      MemoryRegionSection *section,
+                      Error **errp);
+    void (*del_window)(VFIOContainer *container,
+                       MemoryRegionSection *section);
+};
+
+
+#endif /* HW_VFIO_VFIO_BASE_CONTAINER_H */
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH v1 14/22] vfio/common: Simplify vfio_viommu_preset()
  2023-08-30 10:37 [PATCH v1 00/22] vfio: Adopt iommufd Zhenzhong Duan
                   ` (12 preceding siblings ...)
  2023-08-30 10:37 ` [PATCH v1 13/22] vfio: Add base container Zhenzhong Duan
@ 2023-08-30 10:37 ` Zhenzhong Duan
  2023-09-19 16:01   ` Cédric Le Goater
  2023-08-30 10:37 ` [PATCH v1 15/22] Add iommufd configure option Zhenzhong Duan
                   ` (9 subsequent siblings)
  23 siblings, 1 reply; 109+ messages in thread
From: Zhenzhong Duan @ 2023-08-30 10:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, jgg, nicolinc, joao.m.martins, eric.auger,
	peterx, jasowang, kevin.tian, yi.l.liu, yi.y.sun, chao.p.peng,
	Zhenzhong Duan

Commit "vfio/container-base: Introduce [attach/detach]_device container callbacks"
add support to link to address space, we can utilize it to simplify
vfio_viommu_preset().

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/vfio/common.c | 17 +----------------
 1 file changed, 1 insertion(+), 16 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 86b6af5740..6c3e98d5fd 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -165,22 +165,7 @@ void vfio_unblock_multiple_devices_migration(void)
 
 bool vfio_viommu_preset(VFIODevice *vbasedev)
 {
-    VFIOAddressSpace *space;
-    VFIOContainer *container;
-    VFIODevice *tmp_dev;
-
-    QLIST_FOREACH(space, &vfio_address_spaces, list) {
-        QLIST_FOREACH(container, &space->containers, next) {
-            tmp_dev = NULL;
-            while ((tmp_dev = vfio_container_dev_iter_next(container,
-                                                           tmp_dev))) {
-                if (vbasedev == tmp_dev) {
-                    return space->as != &address_space_memory;
-                }
-            }
-        }
-    }
-    g_assert_not_reached();
+    return vbasedev->container->space->as != &address_space_memory;
 }
 
 static void vfio_set_migration_error(int err)
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH v1 15/22] Add iommufd configure option
  2023-08-30 10:37 [PATCH v1 00/22] vfio: Adopt iommufd Zhenzhong Duan
                   ` (13 preceding siblings ...)
  2023-08-30 10:37 ` [PATCH v1 14/22] vfio/common: Simplify vfio_viommu_preset() Zhenzhong Duan
@ 2023-08-30 10:37 ` Zhenzhong Duan
  2023-09-19 17:07   ` Cédric Le Goater
  2023-08-30 10:37 ` [PATCH v1 16/22] backends/iommufd: Introduce the iommufd object Zhenzhong Duan
                   ` (8 subsequent siblings)
  23 siblings, 1 reply; 109+ messages in thread
From: Zhenzhong Duan @ 2023-08-30 10:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, jgg, nicolinc, joao.m.martins, eric.auger,
	peterx, jasowang, kevin.tian, yi.l.liu, yi.y.sun, chao.p.peng,
	Zhenzhong Duan, Paolo Bonzini, Marc-André Lureau,
	Daniel P. Berrangé, Thomas Huth, Philippe Mathieu-Daudé

This adds "--enable-iommufd/--disable-iommufd" to enable or disable
iommufd support, enabled by default.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 meson.build                   | 6 ++++++
 meson_options.txt             | 2 ++
 scripts/meson-buildoptions.sh | 3 +++
 3 files changed, 11 insertions(+)

diff --git a/meson.build b/meson.build
index 98e68ef0b1..6526d8cc9b 100644
--- a/meson.build
+++ b/meson.build
@@ -574,6 +574,10 @@ have_tpm = get_option('tpm') \
   .require(targetos != 'windows', error_message: 'TPM emulation only available on POSIX systems') \
   .allowed()
 
+have_iommufd = get_option('iommufd') \
+  .require(targetos == 'linux', error_message: 'iommufd is supported only on Linux') \
+  .allowed()
+
 # vhost
 have_vhost_user = get_option('vhost_user') \
   .disable_auto_if(targetos != 'linux') \
@@ -2129,6 +2133,7 @@ endif
 config_host_data.set('CONFIG_SNAPPY', snappy.found())
 config_host_data.set('CONFIG_TPM', have_tpm)
 config_host_data.set('CONFIG_TSAN', get_option('tsan'))
+config_host_data.set('CONFIG_IOMMUFD', have_iommufd)
 config_host_data.set('CONFIG_USB_LIBUSB', libusb.found())
 config_host_data.set('CONFIG_VDE', vde.found())
 config_host_data.set('CONFIG_VHOST_NET', have_vhost_net)
@@ -4051,6 +4056,7 @@ summary_info += {'vhost-user-crypto support': have_vhost_user_crypto}
 summary_info += {'vhost-user-blk server support': have_vhost_user_blk_server}
 summary_info += {'vhost-vdpa support': have_vhost_vdpa}
 summary_info += {'build guest agent': have_ga}
+summary_info += {'iommufd support': have_iommufd}
 summary(summary_info, bool_yn: true, section: 'Configurable features')
 
 # Compilation information
diff --git a/meson_options.txt b/meson_options.txt
index aaea5ddd77..aed91d173b 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -105,6 +105,8 @@ option('dbus_display', type: 'feature', value: 'auto',
        description: '-display dbus support')
 option('tpm', type : 'feature', value : 'auto',
        description: 'TPM support')
+option('iommufd', type : 'feature', value : 'auto',
+       description: 'iommufd support')
 
 # Do not enable it by default even for Mingw32, because it doesn't
 # work on Wine.
diff --git a/scripts/meson-buildoptions.sh b/scripts/meson-buildoptions.sh
index 9da3fe299b..719401ffb0 100644
--- a/scripts/meson-buildoptions.sh
+++ b/scripts/meson-buildoptions.sh
@@ -113,6 +113,7 @@ meson_options_help() {
   printf "%s\n" '  hax             HAX acceleration support'
   printf "%s\n" '  hvf             HVF acceleration support'
   printf "%s\n" '  iconv           Font glyph conversion support'
+  printf "%s\n" '  iommufd         iommufd support'
   printf "%s\n" '  jack            JACK sound support'
   printf "%s\n" '  keyring         Linux keyring support'
   printf "%s\n" '  kvm             KVM acceleration support'
@@ -325,6 +326,8 @@ _meson_option_parse() {
     --enable-install-blobs) printf "%s" -Dinstall_blobs=true ;;
     --disable-install-blobs) printf "%s" -Dinstall_blobs=false ;;
     --interp-prefix=*) quote_sh "-Dinterp_prefix=$2" ;;
+    --enable-iommufd) printf "%s" -Diommufd=enabled ;;
+    --disable-iommufd) printf "%s" -Diommufd=disabled ;;
     --enable-jack) printf "%s" -Djack=enabled ;;
     --disable-jack) printf "%s" -Djack=disabled ;;
     --enable-keyring) printf "%s" -Dkeyring=enabled ;;
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH v1 16/22] backends/iommufd: Introduce the iommufd object
  2023-08-30 10:37 [PATCH v1 00/22] vfio: Adopt iommufd Zhenzhong Duan
                   ` (14 preceding siblings ...)
  2023-08-30 10:37 ` [PATCH v1 15/22] Add iommufd configure option Zhenzhong Duan
@ 2023-08-30 10:37 ` Zhenzhong Duan
  2023-09-22  7:15   ` Cédric Le Goater
  2023-08-30 10:37 ` [PATCH v1 17/22] util/char_dev: Add open_cdev() Zhenzhong Duan
                   ` (7 subsequent siblings)
  23 siblings, 1 reply; 109+ messages in thread
From: Zhenzhong Duan @ 2023-08-30 10:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, jgg, nicolinc, joao.m.martins, eric.auger,
	peterx, jasowang, kevin.tian, yi.l.liu, yi.y.sun, chao.p.peng,
	Zhenzhong Duan, Paolo Bonzini, Eric Blake, Markus Armbruster,
	Daniel P. Berrangé,
	Eduardo Habkost

From: Eric Auger <eric.auger@redhat.com>

Introduce an iommufd object which allows the interaction
with the host /dev/iommu device.

The /dev/iommu can have been already pre-opened outside of qemu,
in which case the fd can be passed directly along with the
iommufd object:

This allows the iommufd object to be shared accross several
subsystems (VFIO, VDPA, ...). For example, libvirt would open
the /dev/iommu once.

If no fd is passed along with the iommufd object, the /dev/iommu
is opened by the qemu code.

The CONFIG_IOMMUFD option must be set to compile this new object.

Suggested-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 MAINTAINERS              |   7 +
 backends/Kconfig         |   4 +
 backends/iommufd.c       | 291 +++++++++++++++++++++++++++++++++++++++
 backends/meson.build     |   3 +
 backends/trace-events    |  13 ++
 include/sysemu/iommufd.h |  49 +++++++
 qapi/qom.json            |  18 ++-
 qemu-options.hx          |  13 ++
 8 files changed, 397 insertions(+), 1 deletion(-)
 create mode 100644 backends/iommufd.c
 create mode 100644 include/sysemu/iommufd.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 6111b6b4d9..04663fbb6f 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2079,6 +2079,13 @@ F: hw/vfio/ap.c
 F: docs/system/s390x/vfio-ap.rst
 L: qemu-s390x@nongnu.org
 
+iommufd
+M: Yi Liu <yi.l.liu@intel.com>
+M: Eric Auger <eric.auger@redhat.com>
+S: Supported
+F: backends/iommufd.c
+F: include/sysemu/iommufd.h
+
 vhost
 M: Michael S. Tsirkin <mst@redhat.com>
 S: Supported
diff --git a/backends/Kconfig b/backends/Kconfig
index f35abc1609..2cb23f62fa 100644
--- a/backends/Kconfig
+++ b/backends/Kconfig
@@ -1 +1,5 @@
 source tpm/Kconfig
+
+config IOMMUFD
+    bool
+    depends on VFIO
diff --git a/backends/iommufd.c b/backends/iommufd.c
new file mode 100644
index 0000000000..07ea434424
--- /dev/null
+++ b/backends/iommufd.c
@@ -0,0 +1,291 @@
+/*
+ * iommufd container backend
+ *
+ * Copyright (C) 2023 Intel Corporation.
+ * Copyright Red Hat, Inc. 2023
+ *
+ * Authors: Yi Liu <yi.l.liu@intel.com>
+ *          Eric Auger <eric.auger@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include "qemu/osdep.h"
+#include "sysemu/iommufd.h"
+#include "qapi/error.h"
+#include "qapi/qmp/qerror.h"
+#include "qemu/module.h"
+#include "qom/object_interfaces.h"
+#include "qemu/error-report.h"
+#include "monitor/monitor.h"
+#include "trace.h"
+#include <sys/ioctl.h>
+#include <linux/iommufd.h>
+
+static void iommufd_backend_init(Object *obj)
+{
+    IOMMUFDBackend *be = IOMMUFD_BACKEND(obj);
+
+    be->fd = -1;
+    be->users = 0;
+    be->owned = true;
+    qemu_mutex_init(&be->lock);
+}
+
+static void iommufd_backend_finalize(Object *obj)
+{
+    IOMMUFDBackend *be = IOMMUFD_BACKEND(obj);
+
+    if (be->owned) {
+        close(be->fd);
+        be->fd = -1;
+    }
+}
+
+static void iommufd_backend_set_fd(Object *obj, const char *str, Error **errp)
+{
+    IOMMUFDBackend *be = IOMMUFD_BACKEND(obj);
+    int fd = -1;
+
+    fd = monitor_fd_param(monitor_cur(), str, errp);
+    if (fd == -1) {
+        error_prepend(errp, "Could not parse remote object fd %s:", str);
+        return;
+    }
+    qemu_mutex_lock(&be->lock);
+    be->fd = fd;
+    be->owned = false;
+    qemu_mutex_unlock(&be->lock);
+    trace_iommu_backend_set_fd(be->fd);
+}
+
+static void iommufd_backend_class_init(ObjectClass *oc, void *data)
+{
+    object_class_property_add_str(oc, "fd", NULL, iommufd_backend_set_fd);
+}
+
+int iommufd_backend_connect(IOMMUFDBackend *be, Error **errp)
+{
+    int fd, ret = 0;
+
+    qemu_mutex_lock(&be->lock);
+    if (be->users == UINT32_MAX) {
+        error_setg(errp, "too many connections");
+        ret = -E2BIG;
+        goto out;
+    }
+    if (be->owned && !be->users) {
+        fd = qemu_open_old("/dev/iommu", O_RDWR);
+        if (fd < 0) {
+            error_setg_errno(errp, errno, "/dev/iommu opening failed");
+            ret = fd;
+            goto out;
+        }
+        be->fd = fd;
+    }
+    be->users++;
+out:
+    trace_iommufd_backend_connect(be->fd, be->owned,
+                                  be->users, ret);
+    qemu_mutex_unlock(&be->lock);
+    return ret;
+}
+
+void iommufd_backend_disconnect(IOMMUFDBackend *be)
+{
+    qemu_mutex_lock(&be->lock);
+    if (!be->users) {
+        goto out;
+    }
+    be->users--;
+    if (!be->users && be->owned) {
+        close(be->fd);
+        be->fd = -1;
+    }
+out:
+    trace_iommufd_backend_disconnect(be->fd, be->users);
+    qemu_mutex_unlock(&be->lock);
+}
+
+static int iommufd_backend_alloc_ioas(int fd, uint32_t *ioas)
+{
+    int ret;
+    struct iommu_ioas_alloc alloc_data  = {
+        .size = sizeof(alloc_data),
+        .flags = 0,
+    };
+
+    ret = ioctl(fd, IOMMU_IOAS_ALLOC, &alloc_data);
+    if (ret) {
+        error_report("Failed to allocate ioas %m");
+    }
+
+    *ioas = alloc_data.out_ioas_id;
+    trace_iommufd_backend_alloc_ioas(fd, *ioas, ret);
+
+    return ret;
+}
+
+void iommufd_backend_free_id(int fd, uint32_t id)
+{
+    int ret;
+    struct iommu_destroy des = {
+        .size = sizeof(des),
+        .id = id,
+    };
+
+    ret = ioctl(fd, IOMMU_DESTROY, &des);
+    trace_iommufd_backend_free_id(fd, id, ret);
+    if (ret) {
+        error_report("Failed to free id: %u %m", id);
+    }
+}
+
+int iommufd_backend_get_ioas(IOMMUFDBackend *be, uint32_t *ioas_id)
+{
+    int ret;
+
+    ret = iommufd_backend_alloc_ioas(be->fd, ioas_id);
+    trace_iommufd_backend_get_ioas(be->fd, *ioas_id, ret);
+    return ret;
+}
+
+void iommufd_backend_put_ioas(IOMMUFDBackend *be, uint32_t ioas)
+{
+    trace_iommufd_backend_put_ioas(be->fd, ioas);
+    iommufd_backend_free_id(be->fd, ioas);
+}
+
+int iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t ioas,
+                              hwaddr iova, ram_addr_t size)
+{
+    int ret;
+    struct iommu_ioas_unmap unmap = {
+        .size = sizeof(unmap),
+        .ioas_id = ioas,
+        .iova = iova,
+        .length = size,
+    };
+
+    ret = ioctl(be->fd, IOMMU_IOAS_UNMAP, &unmap);
+    trace_iommufd_backend_unmap_dma(be->fd, ioas, iova, size, ret);
+    if (ret && errno == ENOENT) {
+        ret = 0;
+    }
+    if (ret) {
+        error_report("IOMMU_IOAS_UNMAP failed: %s", strerror(errno));
+    }
+    return !ret ? 0 : -errno;
+}
+
+int iommufd_backend_map_dma(IOMMUFDBackend *be, uint32_t ioas, hwaddr iova,
+                            ram_addr_t size, void *vaddr, bool readonly)
+{
+    int ret;
+    struct iommu_ioas_map map = {
+        .size = sizeof(map),
+        .flags = IOMMU_IOAS_MAP_READABLE |
+                 IOMMU_IOAS_MAP_FIXED_IOVA,
+        .ioas_id = ioas,
+        .__reserved = 0,
+        .user_va = (int64_t)vaddr,
+        .iova = iova,
+        .length = size,
+    };
+
+    if (!readonly) {
+        map.flags |= IOMMU_IOAS_MAP_WRITEABLE;
+    }
+
+    ret = ioctl(be->fd, IOMMU_IOAS_MAP, &map);
+    trace_iommufd_backend_map_dma(be->fd, ioas, iova, size,
+                                  vaddr, readonly, ret);
+    if (ret) {
+        error_report("IOMMU_IOAS_MAP failed: %s", strerror(errno));
+    }
+    return !ret ? 0 : -errno;
+}
+
+int iommufd_backend_copy_dma(IOMMUFDBackend *be, uint32_t src_ioas,
+                             uint32_t dst_ioas, hwaddr iova,
+                             ram_addr_t size, bool readonly)
+{
+    int ret;
+    struct iommu_ioas_copy copy = {
+        .size = sizeof(copy),
+        .flags = IOMMU_IOAS_MAP_READABLE |
+                 IOMMU_IOAS_MAP_FIXED_IOVA,
+        .dst_ioas_id = dst_ioas,
+        .src_ioas_id = src_ioas,
+        .length = size,
+        .dst_iova = iova,
+        .src_iova = iova,
+    };
+
+    if (!readonly) {
+        copy.flags |= IOMMU_IOAS_MAP_WRITEABLE;
+    }
+
+    ret = ioctl(be->fd, IOMMU_IOAS_COPY, &copy);
+    trace_iommufd_backend_copy_dma(be->fd, src_ioas, dst_ioas,
+                                   iova, size, readonly, ret);
+    if (ret) {
+        error_report("IOMMU_IOAS_COPY failed: %s", strerror(errno));
+    }
+    return !ret ? 0 : -errno;
+}
+
+int iommufd_backend_alloc_hwpt(int iommufd, uint32_t dev_id,
+                               uint32_t pt_id, uint32_t *out_hwpt)
+{
+    int ret;
+    struct iommu_hwpt_alloc alloc_hwpt = {
+        .size = sizeof(struct iommu_hwpt_alloc),
+        .flags = 0,
+        .dev_id = dev_id,
+        .pt_id = pt_id,
+        .__reserved = 0,
+    };
+
+    ret = ioctl(iommufd, IOMMU_HWPT_ALLOC, &alloc_hwpt);
+    trace_iommufd_backend_alloc_hwpt(iommufd, dev_id, pt_id, ret);
+
+    if (ret) {
+        error_report("IOMMU_HWPT_ALLOC failed: %s", strerror(errno));
+    } else {
+        *out_hwpt = alloc_hwpt.out_hwpt_id;
+    }
+    return !ret ? 0 : -errno;
+}
+
+static const TypeInfo iommufd_backend_info = {
+    .name = TYPE_IOMMUFD_BACKEND,
+    .parent = TYPE_OBJECT,
+    .instance_size = sizeof(IOMMUFDBackend),
+    .instance_init = iommufd_backend_init,
+    .instance_finalize = iommufd_backend_finalize,
+    .class_size = sizeof(IOMMUFDBackendClass),
+    .class_init = iommufd_backend_class_init,
+    .interfaces = (InterfaceInfo[]) {
+        { TYPE_USER_CREATABLE },
+        { }
+    }
+};
+
+static void register_types(void)
+{
+    type_register_static(&iommufd_backend_info);
+}
+
+type_init(register_types);
diff --git a/backends/meson.build b/backends/meson.build
index 914c7c4afb..29dc147c8e 100644
--- a/backends/meson.build
+++ b/backends/meson.build
@@ -20,6 +20,9 @@ if have_vhost_user
   system_ss.add(when: 'CONFIG_VIRTIO', if_true: files('vhost-user.c'))
 endif
 system_ss.add(when: 'CONFIG_VIRTIO_CRYPTO', if_true: files('cryptodev-vhost.c'))
+if have_iommufd
+  system_ss.add(files('iommufd.c'))
+endif
 if have_vhost_user_crypto
   system_ss.add(when: 'CONFIG_VIRTIO_CRYPTO', if_true: files('cryptodev-vhost-user.c'))
 endif
diff --git a/backends/trace-events b/backends/trace-events
index 652eb76a57..093e3eb1da 100644
--- a/backends/trace-events
+++ b/backends/trace-events
@@ -5,3 +5,16 @@ dbus_vmstate_pre_save(void)
 dbus_vmstate_post_load(int version_id) "version_id: %d"
 dbus_vmstate_loading(const char *id) "id: %s"
 dbus_vmstate_saving(const char *id) "id: %s"
+
+# iommufd.c
+iommufd_backend_connect(int fd, bool owned, uint32_t users, int ret) "fd=%d owned=%d users=%d (%d)"
+iommufd_backend_disconnect(int fd, uint32_t users) "fd=%d users=%d"
+iommu_backend_set_fd(int fd) "pre-opened /dev/iommu fd=%d"
+iommufd_backend_get_ioas(int iommufd, uint32_t ioas, int ret) " iommufd=%d ioas=%d (%d)"
+iommufd_backend_put_ioas(int iommufd, uint32_t ioas) " iommufd=%d ioas=%d"
+iommufd_backend_unmap_dma(int iommufd, uint32_t ioas, uint64_t iova, uint64_t size, int ret) " iommufd=%d ioas=%d iova=0x%"PRIx64" size=0x%"PRIx64" (%d)"
+iommufd_backend_map_dma(int iommufd, uint32_t ioas, uint64_t iova, uint64_t size, void *vaddr, bool readonly, int ret) " iommufd=%d ioas=%d iova=0x%"PRIx64" size=0x%"PRIx64" addr=%p readonly=%d (%d)"
+iommufd_backend_copy_dma(int iommufd, uint32_t src_ioas, uint32_t dst_ioas, uint64_t iova, uint64_t size, bool readonly, int ret) " iommufd=%d src_ioas=%d dst_ioas=%d iova=0x%"PRIx64" size=0x%"PRIx64" readonly=%d (%d)"
+iommufd_backend_alloc_ioas(int iommufd, uint32_t ioas, int ret) " iommufd=%d ioas=%d (%d)"
+iommufd_backend_free_id(int iommufd, uint32_t id, int ret) " iommufd=%d id=%d (%d)"
+iommufd_backend_alloc_hwpt(int iommufd, uint32_t dev_id, uint32_t pt_id, int ret) " iommufd=%d dev_id=%u pt_id=%u (%d)"
diff --git a/include/sysemu/iommufd.h b/include/sysemu/iommufd.h
new file mode 100644
index 0000000000..f3bd212170
--- /dev/null
+++ b/include/sysemu/iommufd.h
@@ -0,0 +1,49 @@
+#ifndef SYSEMU_IOMMUFD_H
+#define SYSEMU_IOMMUFD_H
+
+#include "qom/object.h"
+#include "qemu/thread.h"
+#include "exec/hwaddr.h"
+#include "exec/cpu-common.h"
+
+#define TYPE_IOMMUFD_BACKEND "iommufd"
+OBJECT_DECLARE_TYPE(IOMMUFDBackend, IOMMUFDBackendClass,
+                    IOMMUFD_BACKEND)
+#define IOMMUFD_BACKEND(obj) \
+    OBJECT_CHECK(IOMMUFDBackend, (obj), TYPE_IOMMUFD_BACKEND)
+#define IOMMUFD_BACKEND_GET_CLASS(obj) \
+    OBJECT_GET_CLASS(IOMMUFDBackendClass, (obj), TYPE_IOMMUFD_BACKEND)
+#define IOMMUFD_BACKEND_CLASS(klass) \
+    OBJECT_CLASS_CHECK(IOMMUFDBackendClass, (klass), TYPE_IOMMUFD_BACKEND)
+struct IOMMUFDBackendClass {
+    ObjectClass parent_class;
+};
+
+struct IOMMUFDBackend {
+    Object parent;
+
+    /*< protected >*/
+    int fd;            /* /dev/iommu file descriptor */
+    bool owned;        /* is the /dev/iommu opened internally */
+    QemuMutex lock;
+    uint32_t users;
+
+    /*< public >*/
+};
+
+int iommufd_backend_connect(IOMMUFDBackend *be, Error **errp);
+void iommufd_backend_disconnect(IOMMUFDBackend *be);
+
+int iommufd_backend_get_ioas(IOMMUFDBackend *be, uint32_t *ioas_id);
+void iommufd_backend_put_ioas(IOMMUFDBackend *be, uint32_t ioas_id);
+void iommufd_backend_free_id(int fd, uint32_t id);
+int iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t ioas,
+                              hwaddr iova, ram_addr_t size);
+int iommufd_backend_map_dma(IOMMUFDBackend *be, uint32_t ioas, hwaddr iova,
+                            ram_addr_t size, void *vaddr, bool readonly);
+int iommufd_backend_copy_dma(IOMMUFDBackend *be, uint32_t src_ioas,
+                             uint32_t dst_ioas, hwaddr iova,
+                             ram_addr_t size, bool readonly);
+int iommufd_backend_alloc_hwpt(int iommufd, uint32_t dev_id,
+                               uint32_t pt_id, uint32_t *out_hwpt);
+#endif
diff --git a/qapi/qom.json b/qapi/qom.json
index fa3e88c8e6..2646ac4cca 100644
--- a/qapi/qom.json
+++ b/qapi/qom.json
@@ -779,6 +779,18 @@
 { 'struct': 'VfioUserServerProperties',
   'data': { 'socket': 'SocketAddress', 'device': 'str' } }
 
+##
+# @IOMMUFDProperties:
+#
+# Properties for IOMMUFDbackend objects.
+#
+# fd: file descriptor name
+#
+# Since: 7.2
+##
+{ 'struct': 'IOMMUFDProperties',
+        'data': { '*fd': 'str' } }
+
 ##
 # @RngProperties:
 #
@@ -933,6 +945,8 @@
     'qtest',
     'rng-builtin',
     'rng-egd',
+    { 'name': 'iommufd',
+      'if': 'CONFIG_IOMMUFD' },
     { 'name': 'rng-random',
       'if': 'CONFIG_POSIX' },
     'secret',
@@ -1014,7 +1028,9 @@
       'tls-creds-x509':             'TlsCredsX509Properties',
       'tls-cipher-suites':          'TlsCredsProperties',
       'x-remote-object':            'RemoteObjectProperties',
-      'x-vfio-user-server':         'VfioUserServerProperties'
+      'x-vfio-user-server':         'VfioUserServerProperties',
+      'iommufd':                    { 'type': 'IOMMUFDProperties',
+                                      'if': 'CONFIG_IOMMUFD' }
   } }
 
 ##
diff --git a/qemu-options.hx b/qemu-options.hx
index 29b98c3d4c..827dd085ee 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -5098,6 +5098,19 @@ SRST
 
         The ``share`` boolean option is on by default with memfd.
 
+#ifdef CONFIG_IOMMUFD
+    ``-object iommufd,id=id[,fd=fd]``
+        Creates an iommufd backend which allows control of DMA mapping
+        through the /dev/iommu device.
+
+        The ``id`` parameter is a unique ID which frontends (such as
+        vfio-pci of vdpa) will use to connect withe the iommufd backend.
+
+        The ``fd`` parameter is an optional pre-opened file descriptor
+        resulting from /dev/iommu opening. Usually the iommufd is shared
+        accross all subsystems, bringing the benefit of centralized
+        reference counting.
+#endif
     ``-object rng-builtin,id=id``
         Creates a random number generator backend which obtains entropy
         from QEMU builtin functions. The ``id`` parameter is a unique ID
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH v1 17/22] util/char_dev: Add open_cdev()
  2023-08-30 10:37 [PATCH v1 00/22] vfio: Adopt iommufd Zhenzhong Duan
                   ` (15 preceding siblings ...)
  2023-08-30 10:37 ` [PATCH v1 16/22] backends/iommufd: Introduce the iommufd object Zhenzhong Duan
@ 2023-08-30 10:37 ` Zhenzhong Duan
  2023-09-20 12:39   ` Daniel P. Berrangé
  2023-08-30 10:37 ` [PATCH v1 18/22] vfio/iommufd: Implement the iommufd backend Zhenzhong Duan
                   ` (6 subsequent siblings)
  23 siblings, 1 reply; 109+ messages in thread
From: Zhenzhong Duan @ 2023-08-30 10:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, jgg, nicolinc, joao.m.martins, eric.auger,
	peterx, jasowang, kevin.tian, yi.l.liu, yi.y.sun, chao.p.peng,
	Zhenzhong Duan

From: Yi Liu <yi.l.liu@intel.com>

/dev/vfio/devices/vfioX may not exist. In that case it is still possible
to open /dev/char/$major:$minor instead. Add helper function to abstract
the cdev open.

Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 MAINTAINERS             |  6 ++++
 include/qemu/char_dev.h | 16 +++++++++++
 util/chardev_open.c     | 61 +++++++++++++++++++++++++++++++++++++++++
 util/meson.build        |  1 +
 4 files changed, 84 insertions(+)
 create mode 100644 include/qemu/char_dev.h
 create mode 100644 util/chardev_open.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 04663fbb6f..74d18593fe 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3372,6 +3372,12 @@ S: Maintained
 F: include/qemu/iova-tree.h
 F: util/iova-tree.c
 
+cdev Open
+M: Yi Liu <yi.l.liu@intel.com>
+S: Maintained
+F: include/qemu/char_dev.h
+F: util/chardev_open.c
+
 elf2dmp
 M: Viktor Prutyanov <viktor.prutyanov@phystech.edu>
 S: Maintained
diff --git a/include/qemu/char_dev.h b/include/qemu/char_dev.h
new file mode 100644
index 0000000000..6580d351c6
--- /dev/null
+++ b/include/qemu/char_dev.h
@@ -0,0 +1,16 @@
+/*
+ * QEMU Chardev Helper
+ *
+ * Copyright (C) 2023 Intel Corporation.
+ *
+ * Authors: Yi Liu <yi.l.liu@intel.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ */
+
+#ifndef QEMU_CHARDEV_HELPERS_H
+#define QEMU_CHARDEV_HELPERS_H
+
+int open_cdev(const char *devpath, dev_t cdev);
+#endif
diff --git a/util/chardev_open.c b/util/chardev_open.c
new file mode 100644
index 0000000000..d03e415131
--- /dev/null
+++ b/util/chardev_open.c
@@ -0,0 +1,61 @@
+/*
+ * Copyright (C) 2023 Intel Corporation.
+ * Copyright (c) 2019, Mellanox Technologies. All rights reserved.
+ *
+ * Authors: Yi Liu <yi.l.liu@intel.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ * Copied from
+ * https://github.com/linux-rdma/rdma-core/blob/master/util/open_cdev.c
+ *
+ */
+#ifndef _GNU_SOURCE
+#define _GNU_SOURCE
+#endif
+#include "qemu/osdep.h"
+#include "qemu/char_dev.h"
+
+static int open_cdev_internal(const char *path, dev_t cdev)
+{
+    struct stat st;
+    int fd;
+
+    fd = qemu_open_old(path, O_RDWR);
+    if (fd == -1) {
+        return -1;
+    }
+    if (fstat(fd, &st) || !S_ISCHR(st.st_mode) ||
+        (cdev != 0 && st.st_rdev != cdev)) {
+        close(fd);
+        return -1;
+    }
+    return fd;
+}
+
+static int open_cdev_robust(dev_t cdev)
+{
+    char *devpath;
+    int ret;
+
+    /*
+     * This assumes that udev is being used and is creating the /dev/char/
+     * symlinks.
+     */
+    devpath = g_strdup_printf("/dev/char/%u:%u", major(cdev), minor(cdev));
+    ret = open_cdev_internal(devpath, cdev);
+    g_free(devpath);
+    return ret;
+}
+
+int open_cdev(const char *devpath, dev_t cdev)
+{
+    int fd;
+
+    fd = open_cdev_internal(devpath, cdev);
+    if (fd == -1 && cdev != 0) {
+        return open_cdev_robust(cdev);
+    }
+    return fd;
+}
diff --git a/util/meson.build b/util/meson.build
index a375160286..d5313d858f 100644
--- a/util/meson.build
+++ b/util/meson.build
@@ -107,6 +107,7 @@ if have_block
     util_ss.add(files('filemonitor-stub.c'))
   endif
   util_ss.add(when: 'CONFIG_LINUX', if_true: files('vfio-helpers.c'))
+  util_ss.add(when: 'CONFIG_LINUX', if_true: files('chardev_open.c'))
 endif
 
 if cpu == 'aarch64'
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH v1 18/22] vfio/iommufd: Implement the iommufd backend
  2023-08-30 10:37 [PATCH v1 00/22] vfio: Adopt iommufd Zhenzhong Duan
                   ` (16 preceding siblings ...)
  2023-08-30 10:37 ` [PATCH v1 17/22] util/char_dev: Add open_cdev() Zhenzhong Duan
@ 2023-08-30 10:37 ` Zhenzhong Duan
  2023-08-30 10:37 ` [PATCH v1 19/22] vfio/iommufd: Add vfio device iterator callback for iommufd Zhenzhong Duan
                   ` (5 subsequent siblings)
  23 siblings, 0 replies; 109+ messages in thread
From: Zhenzhong Duan @ 2023-08-30 10:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, jgg, nicolinc, joao.m.martins, eric.auger,
	peterx, jasowang, kevin.tian, yi.l.liu, yi.y.sun, chao.p.peng,
	Zhenzhong Duan

From: Yi Liu <yi.l.liu@intel.com>

Add the iommufd backend. The IOMMUFD container class is implemented
based on the new /dev/iommu user API. This backend obviously depends
on CONFIG_IOMMUFD.

So far, the iommufd backend doesn't support dirty page sync yet due
to missing support in the host kernel.

Co-authored-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/vfio/common.c                      |  12 +-
 hw/vfio/iommufd.c                     | 521 ++++++++++++++++++++++++++
 hw/vfio/meson.build                   |   3 +
 hw/vfio/trace-events                  |  12 +
 include/hw/vfio/vfio-common.h         |  25 ++
 include/hw/vfio/vfio-container-base.h |   3 +
 6 files changed, 574 insertions(+), 2 deletions(-)
 create mode 100644 hw/vfio/iommufd.c

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 6c3e98d5fd..b40cd8544d 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -45,7 +45,7 @@
 #include "migration/qemu-file.h"
 #include "sysemu/tpm.h"
 
-static QLIST_HEAD(, VFIOAddressSpace) vfio_address_spaces =
+VFIOAddressSpaceList vfio_address_spaces =
     QLIST_HEAD_INITIALIZER(vfio_address_spaces);
 
 #ifdef CONFIG_KVM
@@ -1520,8 +1520,16 @@ int vfio_attach_device(char *name, VFIODevice *vbasedev,
 {
     const VFIOIOMMUBackendOpsClass *ops;
 
-    ops = VFIO_IOMMU_BACKEND_OPS_CLASS(
+#ifdef CONFIG_IOMMUFD
+    if (vbasedev->iommufd) {
+        ops = VFIO_IOMMU_BACKEND_OPS_CLASS(
+                  object_class_by_name(TYPE_VFIO_IOMMU_BACKEND_IOMMUFD_OPS));
+    } else
+#endif
+    {
+        ops = VFIO_IOMMU_BACKEND_OPS_CLASS(
                   object_class_by_name(TYPE_VFIO_IOMMU_BACKEND_LEGACY_OPS));
+    }
     if (!ops) {
         error_setg(errp, "VFIO IOMMU Backend not found!");
         return -ENODEV;
diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
new file mode 100644
index 0000000000..876d0e4928
--- /dev/null
+++ b/hw/vfio/iommufd.c
@@ -0,0 +1,521 @@
+/*
+ * iommufd container backend
+ *
+ * Copyright (C) 2023 Intel Corporation.
+ * Copyright Red Hat, Inc. 2023
+ *
+ * Authors: Yi Liu <yi.l.liu@intel.com>
+ *          Eric Auger <eric.auger@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include "qemu/osdep.h"
+#include <sys/ioctl.h>
+#include <linux/vfio.h>
+#include <linux/iommufd.h>
+
+#include "hw/vfio/vfio-common.h"
+#include "qemu/error-report.h"
+#include "trace.h"
+#include "qapi/error.h"
+#include "sysemu/iommufd.h"
+#include "hw/qdev-core.h"
+#include "sysemu/reset.h"
+#include "qemu/cutils.h"
+#include "qemu/char_dev.h"
+
+static int iommufd_map(VFIOContainer *bcontainer, hwaddr iova,
+                       ram_addr_t size, void *vaddr, bool readonly)
+{
+    VFIOIOMMUFDContainer *container =
+        container_of(bcontainer, VFIOIOMMUFDContainer, bcontainer);
+
+    return iommufd_backend_map_dma(container->be,
+                                   container->ioas_id,
+                                   iova, size, vaddr, readonly);
+}
+
+static int iommufd_unmap(VFIOContainer *bcontainer,
+                         hwaddr iova, ram_addr_t size,
+                         IOMMUTLBEntry *iotlb)
+{
+    VFIOIOMMUFDContainer *container =
+        container_of(bcontainer, VFIOIOMMUFDContainer, bcontainer);
+
+    /* TODO: Handle dma_unmap_bitmap with iotlb args (migration) */
+    return iommufd_backend_unmap_dma(container->be,
+                                     container->ioas_id, iova, size);
+}
+
+static void vfio_kvm_device_add_device(VFIODevice *vbasedev)
+{
+    vfio_kvm_device_add_fd(vbasedev->fd);
+}
+
+static void vfio_kvm_device_del_device(VFIODevice *vbasedev)
+{
+    vfio_kvm_device_del_fd(vbasedev->fd);
+}
+
+static int iommufd_connect_and_bind(VFIODevice *vbasedev, Error **errp)
+{
+    IOMMUFDBackend *iommufd = vbasedev->iommufd;
+    struct vfio_device_bind_iommufd bind = {
+        .argsz = sizeof(bind),
+        .flags = 0,
+    };
+    int ret;
+
+    ret = iommufd_backend_connect(iommufd, errp);
+    if (ret) {
+        return ret;
+    }
+
+    /*
+     * Add device to kvm-vfio to be prepared for the tracking
+     * in KVM. Especially for some emulated devices, it requires
+     * to have kvm information in the device open.
+     */
+    vfio_kvm_device_add_device(vbasedev);
+
+    /* Bind device to iommufd */
+    bind.iommufd = iommufd->fd;
+    ret = ioctl(vbasedev->fd, VFIO_DEVICE_BIND_IOMMUFD, &bind);
+    if (ret) {
+        error_setg_errno(errp, errno, "error bind device fd=%d to iommufd=%d",
+                         vbasedev->fd, bind.iommufd);
+        goto err_bind;
+    }
+
+    vbasedev->devid = bind.out_devid;
+    trace_vfio_iommufd_bind_device(bind.iommufd, vbasedev->name,
+                                   vbasedev->fd, vbasedev->devid);
+    return ret;
+err_bind:
+    vfio_kvm_device_del_device(vbasedev);
+    iommufd_backend_disconnect(iommufd);
+    return ret;
+}
+
+static void iommufd_unbind_and_disconnect(VFIODevice *vbasedev)
+{
+    /* Unbind is automatically conducted when device fd is closed */
+    vfio_kvm_device_del_device(vbasedev);
+    iommufd_backend_disconnect(vbasedev->iommufd);
+}
+
+static int vfio_get_devicefd(const char *sysfs_path, Error **errp)
+{
+    long int ret = -ENOTTY;
+    char *path, *vfio_dev_path = NULL, *vfio_path = NULL;
+    DIR *dir = NULL;
+    struct dirent *dent;
+    gchar *contents;
+    struct stat st;
+    gsize length;
+    int major, minor;
+    dev_t vfio_devt;
+
+    path = g_strdup_printf("%s/vfio-dev", sysfs_path);
+    if (stat(path, &st) < 0) {
+        error_setg_errno(errp, errno, "no such host device");
+        goto out_free_path;
+    }
+
+    dir = opendir(path);
+    if (!dir) {
+        error_setg_errno(errp, errno, "couldn't open dirrectory %s", path);
+        goto out_free_path;
+    }
+
+    while ((dent = readdir(dir))) {
+        if (!strncmp(dent->d_name, "vfio", 4)) {
+            vfio_dev_path = g_strdup_printf("%s/%s/dev", path, dent->d_name);
+            break;
+        }
+    }
+
+    if (!vfio_dev_path) {
+        error_setg(errp, "failed to find vfio-dev/vfioX/dev");
+        goto out_close_dir;
+    }
+
+    if (!g_file_get_contents(vfio_dev_path, &contents, &length, NULL)) {
+        error_setg(errp, "failed to load \"%s\"", vfio_dev_path);
+        goto out_free_dev_path;
+    }
+
+    if (sscanf(contents, "%d:%d", &major, &minor) != 2) {
+        error_setg(errp, "failed to get major:minor for \"%s\"", vfio_dev_path);
+        goto out_free_dev_path;
+    }
+    g_free(contents);
+    vfio_devt = makedev(major, minor);
+
+    vfio_path = g_strdup_printf("/dev/vfio/devices/%s", dent->d_name);
+    ret = open_cdev(vfio_path, vfio_devt);
+    if (ret < 0) {
+        error_setg(errp, "Failed to open %s", vfio_path);
+    }
+
+    trace_vfio_iommufd_get_devicefd(vfio_path, ret);
+    g_free(vfio_path);
+
+out_free_dev_path:
+    g_free(vfio_dev_path);
+out_close_dir:
+    closedir(dir);
+out_free_path:
+    if (*errp) {
+        error_prepend(errp, VFIO_MSG_PREFIX, path);
+    }
+    g_free(path);
+
+    return ret;
+}
+
+static VFIOIOASHwpt *vfio_container_get_hwpt(VFIOIOMMUFDContainer *container,
+                                             uint32_t hwpt_id)
+{
+    VFIOIOASHwpt *hwpt;
+
+    QLIST_FOREACH(hwpt, &container->hwpt_list, next) {
+        if (hwpt->hwpt_id == hwpt_id) {
+            return hwpt;
+        }
+    }
+
+    hwpt = g_malloc0(sizeof(*hwpt));
+
+    hwpt->hwpt_id = hwpt_id;
+    QLIST_INIT(&hwpt->device_list);
+    QLIST_INSERT_HEAD(&container->hwpt_list, hwpt, next);
+
+    return hwpt;
+}
+
+static void vfio_container_put_hwpt(IOMMUFDBackend *be, VFIOIOASHwpt *hwpt)
+{
+    QLIST_REMOVE(hwpt, next);
+    iommufd_backend_free_id(be->fd, hwpt->hwpt_id);
+    g_free(hwpt);
+}
+
+static int __vfio_device_attach_hwpt(VFIODevice *vbasedev, uint32_t hwpt_id,
+                                     Error **errp)
+{
+    struct vfio_device_attach_iommufd_pt attach_data = {
+        .argsz = sizeof(attach_data),
+        .flags = 0,
+        .pt_id = hwpt_id,
+    };
+    int ret;
+
+    ret = ioctl(vbasedev->fd, VFIO_DEVICE_ATTACH_IOMMUFD_PT, &attach_data);
+    if (ret) {
+        error_setg_errno(errp, errno,
+                         "[iommufd=%d] error attach %s (%d) to hwpt_id=%d",
+                         vbasedev->iommufd->fd, vbasedev->name, vbasedev->fd,
+                         hwpt_id);
+    }
+    return ret;
+}
+
+static int __vfio_device_detach_hwpt(VFIODevice *vbasedev, Error **errp)
+{
+    struct vfio_device_detach_iommufd_pt detach_data = {
+        .argsz = sizeof(detach_data),
+        .flags = 0,
+    };
+    int ret;
+
+    ret = ioctl(vbasedev->fd, VFIO_DEVICE_DETACH_IOMMUFD_PT, &detach_data);
+    if (ret) {
+        error_setg_errno(errp, errno, "detach %s from ioas failed",
+                         vbasedev->name);
+    }
+    return ret;
+}
+
+static int vfio_device_attach_container(VFIODevice *vbasedev,
+                                        VFIOIOMMUFDContainer *container,
+                                        Error **errp)
+{
+    int ret, iommufd = vbasedev->iommufd->fd;
+    VFIOIOASHwpt *hwpt;
+    uint32_t hwpt_id;
+    Error *err = NULL;
+
+    /* try to attach to an existing hwpt in this container */
+    QLIST_FOREACH(hwpt, &container->hwpt_list, next) {
+        ret = __vfio_device_attach_hwpt(vbasedev, hwpt->hwpt_id, &err);
+        if (ret) {
+            const char *msg = error_get_pretty(err);
+
+            trace_vfio_iommufd_fail_attach_existing_hwpt(msg);
+            error_free(err);
+            err = NULL;
+        } else {
+            goto found_hwpt;
+        }
+    }
+
+    ret = iommufd_backend_alloc_hwpt(iommufd, vbasedev->devid,
+                                     container->ioas_id, &hwpt_id);
+
+    if (ret) {
+        error_setg_errno(errp, errno, "error alloc shadow hwpt");
+        return ret;
+    }
+
+    /* Attach device to an hwpt within iommufd */
+    ret = __vfio_device_attach_hwpt(vbasedev, hwpt_id, errp);
+    if (ret) {
+        iommufd_backend_free_id(iommufd, hwpt_id);
+        return ret;
+    }
+
+    hwpt = vfio_container_get_hwpt(container, hwpt_id);
+found_hwpt:
+    QLIST_INSERT_HEAD(&hwpt->device_list, vbasedev, next);
+    vbasedev->hwpt = hwpt;
+
+    trace_vfio_iommufd_attach_device(iommufd, vbasedev->name, vbasedev->fd,
+                                     container->ioas_id, hwpt->hwpt_id);
+    return ret;
+}
+
+static void vfio_device_detach_container(VFIODevice *vbasedev,
+                                         VFIOIOMMUFDContainer *container,
+                                         Error **errp)
+{
+    VFIOIOASHwpt *hwpt = vbasedev->hwpt;
+
+    __vfio_device_detach_hwpt(vbasedev, errp);
+    QLIST_REMOVE(vbasedev, next);
+    vbasedev->hwpt = NULL;
+    if (QLIST_EMPTY(&hwpt->device_list)) {
+        vfio_container_put_hwpt(vbasedev->iommufd, hwpt);
+    }
+
+    trace_vfio_iommufd_detach_device(container->be->fd, vbasedev->name,
+                                     container->ioas_id);
+}
+
+static void vfio_iommufd_container_destroy(VFIOIOMMUFDContainer *container)
+{
+    VFIOContainer *bcontainer = &container->bcontainer;
+
+    if (!QLIST_EMPTY(&container->hwpt_list)) {
+        return;
+    }
+    memory_listener_unregister(&bcontainer->listener);
+    vfio_container_destroy(bcontainer);
+    iommufd_backend_put_ioas(container->be, container->ioas_id);
+    g_free(container);
+}
+
+static int vfio_ram_block_discard_disable(bool state)
+{
+    /*
+     * We support coordinated discarding of RAM via the RamDiscardManager.
+     */
+    return ram_block_uncoordinated_discard_disable(state);
+}
+
+static int iommufd_attach_device(char *name, VFIODevice *vbasedev,
+                                 AddressSpace *as, Error **errp)
+{
+    VFIOIOMMUBackendOpsClass *ops = VFIO_IOMMU_BACKEND_OPS_CLASS(
+        object_class_by_name(TYPE_VFIO_IOMMU_BACKEND_IOMMUFD_OPS));
+    VFIOContainer *bcontainer;
+    VFIOIOMMUFDContainer *container;
+    VFIOAddressSpace *space;
+    struct vfio_device_info dev_info = { .argsz = sizeof(dev_info) };
+    int ret, devfd;
+    uint32_t ioas_id;
+    Error *err = NULL;
+
+    devfd = vfio_get_devicefd(vbasedev->sysfsdev, errp);
+    if (devfd < 0) {
+        return devfd;
+    }
+    vbasedev->fd = devfd;
+
+    ret = iommufd_connect_and_bind(vbasedev, errp);
+    if (ret) {
+        goto err_connect_bind;
+    }
+
+    space = vfio_get_address_space(as);
+
+    /* try to attach to an existing container in this space */
+    QLIST_FOREACH(bcontainer, &space->containers, next) {
+        container = container_of(bcontainer, VFIOIOMMUFDContainer, bcontainer);
+        if (bcontainer->ops != ops || vbasedev->iommufd != container->be) {
+            continue;
+        }
+        if (vfio_device_attach_container(vbasedev, container, &err)) {
+            const char *msg = error_get_pretty(err);
+
+            trace_vfio_iommufd_fail_attach_existing_container(msg);
+            error_free(err);
+            err = NULL;
+        } else {
+            ret = vfio_ram_block_discard_disable(true);
+            if (ret) {
+                error_setg(errp,
+                              "Cannot set discarding of RAM broken (%d)", ret);
+                goto err_discard_disable;
+            }
+            goto found_container;
+        }
+    }
+
+    /* Need to allocate a new dedicated container */
+    ret = iommufd_backend_get_ioas(vbasedev->iommufd, &ioas_id);
+    if (ret < 0) {
+        error_setg_errno(errp, errno, "Failed to alloc ioas");
+        goto err_get_ioas;
+    }
+
+    trace_vfio_iommufd_alloc_ioas(vbasedev->iommufd->fd, ioas_id);
+
+    container = g_malloc0(sizeof(*container));
+    container->be = vbasedev->iommufd;
+    container->ioas_id = ioas_id;
+    QLIST_INIT(&container->hwpt_list);
+
+    bcontainer = &container->bcontainer;
+    vfio_container_init(bcontainer, space, ops);
+
+    ret = vfio_device_attach_container(vbasedev, container, errp);
+    if (ret) {
+        goto err_attach_container;
+    }
+
+    ret = vfio_ram_block_discard_disable(true);
+    if (ret) {
+        goto err_discard_disable;
+    }
+
+    /*
+     * TODO: for now iommufd BE is on par with vfio iommu type1, so it's
+     * fine to add the whole range as window. For SPAPR, below code
+     * should be updated.
+     */
+    vfio_host_win_add(bcontainer, 0, (hwaddr)-1, 4096);
+    bcontainer->pgsizes = 4096;
+
+    QLIST_INSERT_HEAD(&space->containers, bcontainer, next);
+
+    bcontainer->listener = vfio_memory_listener;
+
+    memory_listener_register(&bcontainer->listener, bcontainer->space->as);
+
+    bcontainer->initialized = true;
+
+found_container:
+    /*
+     * TODO: examine RAM_BLOCK_DISCARD stuff, should we do group level
+     * for discarding incompatibility check as well?
+     */
+    if (vbasedev->ram_block_discard_allowed) {
+        vfio_ram_block_discard_disable(false);
+    }
+
+    ret = ioctl(devfd, VFIO_DEVICE_GET_INFO, &dev_info);
+    if (ret) {
+        error_setg_errno(errp, errno, "error getting device info");
+        goto err_discard_disable;
+    }
+
+    vbasedev->group = 0;
+    vbasedev->num_irqs = dev_info.num_irqs;
+    vbasedev->num_regions = dev_info.num_regions;
+    vbasedev->flags = dev_info.flags;
+    vbasedev->reset_works = !!(dev_info.flags & VFIO_DEVICE_FLAGS_RESET);
+    vbasedev->container = bcontainer;
+
+    trace_vfio_iommufd_device_info(vbasedev->name, devfd, vbasedev->num_irqs,
+                                   vbasedev->num_regions, vbasedev->flags);
+    return 0;
+
+err_discard_disable:
+    vfio_device_detach_container(vbasedev, container, &err);
+    if (err) {
+        error_report_err(err);
+    }
+err_attach_container:
+    vfio_iommufd_container_destroy(container);
+err_get_ioas:
+    vfio_put_address_space(space);
+    iommufd_unbind_and_disconnect(vbasedev);
+err_connect_bind:
+    close(vbasedev->fd);
+    return ret;
+}
+
+static void iommufd_detach_device(VFIODevice *vbasedev)
+{
+    VFIOContainer *bcontainer = vbasedev->container;
+    VFIOIOMMUFDContainer *container;
+    VFIOAddressSpace *space = bcontainer->space;
+    Error *err = NULL;
+
+    if (!bcontainer) {
+        return;
+    }
+
+    if (!vbasedev->ram_block_discard_allowed) {
+        vfio_ram_block_discard_disable(false);
+    }
+
+    container = container_of(bcontainer, VFIOIOMMUFDContainer, bcontainer);
+    vfio_device_detach_container(vbasedev, container, &err);
+    if (err) {
+        error_report_err(err);
+    }
+    if (QLIST_EMPTY(&container->hwpt_list)) {
+        vfio_iommufd_container_destroy(container);
+        vfio_put_address_space(space);
+    }
+    vbasedev->container = NULL;
+    iommufd_unbind_and_disconnect(vbasedev);
+    close(vbasedev->fd);
+}
+
+static void vfio_iommu_backend_iommufd_ops_class_init(ObjectClass *oc,
+                                                     void *data) {
+    VFIOIOMMUBackendOpsClass *ops = VFIO_IOMMU_BACKEND_OPS_CLASS(oc);
+
+    ops->dma_map = iommufd_map;
+    ops->dma_unmap = iommufd_unmap;
+    ops->attach_device = iommufd_attach_device;
+    ops->detach_device = iommufd_detach_device;
+}
+
+static const TypeInfo vfio_iommu_backend_iommufd_ops_type = {
+    .name = TYPE_VFIO_IOMMU_BACKEND_IOMMUFD_OPS,
+
+    .parent = TYPE_VFIO_IOMMU_BACKEND_OPS,
+    .class_init = vfio_iommu_backend_iommufd_ops_class_init,
+    .abstract = true,
+};
+static void vfio_iommu_backend_iommufd_ops_register_types(void)
+{
+    type_register_static(&vfio_iommu_backend_iommufd_ops_type);
+}
+type_init(vfio_iommu_backend_iommufd_ops_register_types);
diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
index eb6ce6229d..9cae2c9e21 100644
--- a/hw/vfio/meson.build
+++ b/hw/vfio/meson.build
@@ -7,6 +7,9 @@ vfio_ss.add(files(
   'spapr.c',
   'migration.c',
 ))
+if have_iommufd
+  vfio_ss.add(files('iommufd.c'))
+endif
 vfio_ss.add(when: 'CONFIG_VFIO_PCI', if_true: files(
   'display.c',
   'pci-quirks.c',
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 1692bcd8f1..60b56f23a1 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -167,3 +167,15 @@ vfio_save_setup(const char *name, uint64_t data_buffer_size) " (%s) data buffer
 vfio_state_pending_estimate(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64
 vfio_state_pending_exact(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t stopcopy_size, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" stopcopy size 0x%"PRIx64" precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64
 vfio_vmstate_change(const char *name, int running, const char *reason, const char *dev_state) " (%s) running %d reason %s device state %s"
+
+#iommufd.c
+
+vfio_iommufd_get_devicefd(const char *dev, int devfd) " %s (fd=%d)"
+vfio_iommufd_bind_device(int iommufd, const char *name, int devfd, int devid) " [iommufd=%d] Successfully bound device %s (fd=%d): output devid=%d"
+vfio_iommufd_fail_attach_existing_hwpt(const char *msg) " %s"
+vfio_iommufd_attach_device(int iommufd, const char *name, int devfd, int ioasid, int hwptid) " [iommufd=%d] Successfully attached device %s (%d) to ioasid=%d: output hwptd=%d"
+vfio_iommufd_detach_device(int iommufd, const char *name, int ioasid) " [iommufd=%d] Detached %s from ioasid=%d"
+vfio_iommufd_alloc_ioas(int iommufd, int ioas_id) " [iommufd=%d] new IOMMUFD container with ioasid=%d"
+vfio_iommufd_device_info(char *name, int devfd, int num_irqs, int num_regions, int flags) " %s (%d) num_irqs=%d num_regions=%d flags=%d"
+vfio_iommufd_fail_attach_existing_container(const char *msg) " %s"
+vfio_iommufd_container_reset(char *name) " Successfully reset %s"
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index b1a76dcc9c..027a59a13a 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -85,6 +85,26 @@ typedef struct VFIOLegacyContainer {
     QLIST_HEAD(, VFIOGroup) group_list;
 } VFIOLegacyContainer;
 
+#ifdef CONFIG_IOMMUFD
+typedef struct VFIOIOASHwpt {
+    uint32_t hwpt_id;
+    QLIST_HEAD(, VFIODevice) device_list;
+    QLIST_ENTRY(VFIOIOASHwpt) next;
+} VFIOIOASHwpt;
+
+typedef struct IOMMUFDBackend IOMMUFDBackend;
+
+typedef struct VFIOIOMMUFDContainer {
+    VFIOContainer bcontainer;
+    IOMMUFDBackend *be;
+    uint32_t ioas_id;
+    QLIST_HEAD(, VFIOIOASHwpt) hwpt_list;
+} VFIOIOMMUFDContainer;
+#endif
+
+typedef QLIST_HEAD(VFIOAddressSpaceList, VFIOAddressSpace) VFIOAddressSpaceList;
+extern VFIOAddressSpaceList vfio_address_spaces;
+
 typedef struct VFIODeviceOps VFIODeviceOps;
 
 typedef struct VFIODevice {
@@ -110,6 +130,11 @@ typedef struct VFIODevice {
     OnOffAuto pre_copy_dirty_page_tracking;
     bool dirty_pages_supported;
     bool dirty_tracking;
+#ifdef CONFIG_IOMMUFD
+    int devid;
+    VFIOIOASHwpt *hwpt;
+    IOMMUFDBackend *iommufd;
+#endif
 } VFIODevice;
 
 struct VFIODeviceOps {
diff --git a/include/hw/vfio/vfio-container-base.h b/include/hw/vfio/vfio-container-base.h
index b18fa92146..51aff4af05 100644
--- a/include/hw/vfio/vfio-container-base.h
+++ b/include/hw/vfio/vfio-container-base.h
@@ -117,6 +117,9 @@ void vfio_container_init(VFIOContainer *container,
 void vfio_container_destroy(VFIOContainer *container);
 
 #define TYPE_VFIO_IOMMU_BACKEND_LEGACY_OPS "vfio-iommu-backend-legacy-ops"
+#ifdef CONFIG_IOMMUFD
+#define TYPE_VFIO_IOMMU_BACKEND_IOMMUFD_OPS "vfio-iommu-backend-iommufd-ops"
+#endif
 #define TYPE_VFIO_IOMMU_BACKEND_OPS "vfio-iommu-backend-ops"
 
 DECLARE_CLASS_CHECKERS(VFIOIOMMUBackendOpsClass,
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH v1 19/22] vfio/iommufd: Add vfio device iterator callback for iommufd
  2023-08-30 10:37 [PATCH v1 00/22] vfio: Adopt iommufd Zhenzhong Duan
                   ` (17 preceding siblings ...)
  2023-08-30 10:37 ` [PATCH v1 18/22] vfio/iommufd: Implement the iommufd backend Zhenzhong Duan
@ 2023-08-30 10:37 ` Zhenzhong Duan
  2023-08-30 10:37 ` [PATCH v1 20/22] vfio/pci: Adapt vfio pci hot reset support with iommufd BE Zhenzhong Duan
                   ` (4 subsequent siblings)
  23 siblings, 0 replies; 109+ messages in thread
From: Zhenzhong Duan @ 2023-08-30 10:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, jgg, nicolinc, joao.m.martins, eric.auger,
	peterx, jasowang, kevin.tian, yi.l.liu, yi.y.sun, chao.p.peng,
	Zhenzhong Duan

The way to get vfio device pointer is different between legacy
container and iommufd container, with iommufd backend support
added, it's time to add the iterator support for iommufd.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/vfio/iommufd.c | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
index 876d0e4928..dd24e76e39 100644
--- a/hw/vfio/iommufd.c
+++ b/hw/vfio/iommufd.c
@@ -36,6 +36,34 @@
 #include "qemu/cutils.h"
 #include "qemu/char_dev.h"
 
+static VFIODevice *iommufd_dev_iter_next(VFIOContainer *bcontainer,
+                                           VFIODevice *curr)
+{
+
+    VFIOIOASHwpt *hwpt;
+
+    assert(object_class_dynamic_cast(OBJECT_CLASS(bcontainer->ops),
+                                     TYPE_VFIO_IOMMU_BACKEND_IOMMUFD_OPS));
+
+    VFIOIOMMUFDContainer *container = container_of(bcontainer,
+                                                   VFIOIOMMUFDContainer,
+                                                   bcontainer);
+
+    if (!curr) {
+        hwpt = QLIST_FIRST(&container->hwpt_list);
+    } else {
+        if (curr->next.le_next) {
+            return curr->next.le_next;
+        }
+        hwpt = curr->hwpt->next.le_next;
+    }
+
+    if (!hwpt) {
+        return NULL;
+    }
+    return QLIST_FIRST(&hwpt->device_list);
+}
+
 static int iommufd_map(VFIOContainer *bcontainer, hwaddr iova,
                        ram_addr_t size, void *vaddr, bool readonly)
 {
@@ -501,6 +529,7 @@ static void vfio_iommu_backend_iommufd_ops_class_init(ObjectClass *oc,
                                                      void *data) {
     VFIOIOMMUBackendOpsClass *ops = VFIO_IOMMU_BACKEND_OPS_CLASS(oc);
 
+    ops->dev_iter_next = iommufd_dev_iter_next;
     ops->dma_map = iommufd_map;
     ops->dma_unmap = iommufd_unmap;
     ops->attach_device = iommufd_attach_device;
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH v1 20/22] vfio/pci: Adapt vfio pci hot reset support with iommufd BE
  2023-08-30 10:37 [PATCH v1 00/22] vfio: Adopt iommufd Zhenzhong Duan
                   ` (18 preceding siblings ...)
  2023-08-30 10:37 ` [PATCH v1 19/22] vfio/iommufd: Add vfio device iterator callback for iommufd Zhenzhong Duan
@ 2023-08-30 10:37 ` Zhenzhong Duan
  2023-08-30 10:37 ` [PATCH v1 21/22] vfio/pci: Allow the selection of a given iommu backend Zhenzhong Duan
                   ` (3 subsequent siblings)
  23 siblings, 0 replies; 109+ messages in thread
From: Zhenzhong Duan @ 2023-08-30 10:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, jgg, nicolinc, joao.m.martins, eric.auger,
	peterx, jasowang, kevin.tian, yi.l.liu, yi.y.sun, chao.p.peng,
	Zhenzhong Duan

As pci hot reset path need to reference pci specific functions
and data structures, adding container level callback functions
for legacy and iommufd BE and referencing those pci specific
func/data is no better than implementing reset support with
iommufd BE directly in pci.c

This way we can also share the common bus reset and system reset
path for different BEs.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/vfio/pci.c        | 224 +++++++++++++++++++++++++++++++++++++++----
 hw/vfio/trace-events |   1 +
 2 files changed, 208 insertions(+), 17 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 34f65ecd17..3a8fee3c99 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -42,6 +42,7 @@
 #include "qapi/error.h"
 #include "migration/blocker.h"
 #include "migration/qemu-file.h"
+#include "linux/iommufd.h"
 
 #define TYPE_VFIO_PCI_NOHOTPLUG "vfio-pci-nohotplug"
 
@@ -2378,22 +2379,13 @@ static bool vfio_pci_host_match(PCIHostDeviceAddress *addr, const char *name)
     return (strcmp(tmp, name) == 0);
 }
 
-static int vfio_pci_hot_reset(VFIOPCIDevice *vdev, bool single)
+static int vfio_pci_get_pci_hot_reset_info(VFIOPCIDevice *vdev,
+                                       struct vfio_pci_hot_reset_info **info_p)
 {
-    VFIOGroup *group;
     struct vfio_pci_hot_reset_info *info;
-    struct vfio_pci_dependent_device *devices;
-    struct vfio_pci_hot_reset *reset;
-    int32_t *fds;
-    int ret, i, count;
-    bool multi = false;
-
-    trace_vfio_pci_hot_reset(vdev->vbasedev.name, single ? "one" : "multi");
+    int ret, count;
 
-    if (!single) {
-        vfio_pci_pre_reset(vdev);
-    }
-    vdev->vbasedev.needs_reset = false;
+    assert(info_p && !*info_p);
 
     info = g_malloc0(sizeof(*info));
     info->argsz = sizeof(*info);
@@ -2401,24 +2393,53 @@ static int vfio_pci_hot_reset(VFIOPCIDevice *vdev, bool single)
     ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_GET_PCI_HOT_RESET_INFO, info);
     if (ret && errno != ENOSPC) {
         ret = -errno;
+        g_free(info);
         if (!vdev->has_pm_reset) {
             error_report("vfio: Cannot reset device %s, "
                          "no available reset mechanism.", vdev->vbasedev.name);
         }
-        goto out_single;
+        return ret;
     }
 
     count = info->count;
-    info = g_realloc(info, sizeof(*info) + (count * sizeof(*devices)));
-    info->argsz = sizeof(*info) + (count * sizeof(*devices));
-    devices = &info->devices[0];
+    info = g_realloc(info, sizeof(*info) + (count * sizeof(info->devices[0])));
+    info->argsz = sizeof(*info) + (count * sizeof(info->devices[0]));
 
     ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_GET_PCI_HOT_RESET_INFO, info);
     if (ret) {
         ret = -errno;
+        g_free(info);
         error_report("vfio: hot reset info failed: %m");
+        return ret;
+    }
+
+    *info_p = info;
+    return 0;
+}
+
+static int vfio_pci_hot_reset_legacy(VFIOPCIDevice *vdev, bool single)
+{
+    VFIOGroup *group;
+    struct vfio_pci_hot_reset_info *info = NULL;
+    struct vfio_pci_dependent_device *devices;
+    struct vfio_pci_hot_reset *reset;
+    int32_t *fds;
+    int ret, i, count;
+    bool multi = false;
+
+    trace_vfio_pci_hot_reset(vdev->vbasedev.name, single ? "one" : "multi");
+
+    if (!single) {
+        vfio_pci_pre_reset(vdev);
+    }
+    vdev->vbasedev.needs_reset = false;
+
+    ret = vfio_pci_get_pci_hot_reset_info(vdev, &info);
+
+    if (ret) {
         goto out_single;
     }
+    devices = &info->devices[0];
 
     trace_vfio_pci_hot_reset_has_dep_devices(vdev->vbasedev.name);
 
@@ -2560,6 +2581,175 @@ out_single:
     return ret;
 }
 
+#ifdef CONFIG_IOMMUFD
+static VFIODevice *vfio_pci_find_by_iommufd_devid(__u32 devid)
+{
+    VFIOAddressSpace *space;
+    VFIOContainer *bcontainer;
+    VFIOIOMMUFDContainer *container;
+    VFIOIOASHwpt *hwpt;
+    VFIODevice *vbasedev_iter;
+    VFIOIOMMUBackendOpsClass *ops = VFIO_IOMMU_BACKEND_OPS_CLASS(
+        object_class_by_name(TYPE_VFIO_IOMMU_BACKEND_IOMMUFD_OPS));
+
+     QLIST_FOREACH(space, &vfio_address_spaces, list) {
+        QLIST_FOREACH(bcontainer, &space->containers, next) {
+            if (bcontainer->ops != ops) {
+                continue;
+            }
+            container = container_of(bcontainer, VFIOIOMMUFDContainer,
+                                     bcontainer);
+            QLIST_FOREACH(hwpt, &container->hwpt_list, next) {
+                QLIST_FOREACH(vbasedev_iter, &hwpt->device_list, next) {
+                    if (devid == vbasedev_iter->devid) {
+                        return vbasedev_iter;
+                    }
+                }
+            }
+        }
+    }
+    return NULL;
+}
+
+static int vfio_pci_hot_reset_iommufd(VFIOPCIDevice *vdev, bool single)
+{
+    struct vfio_pci_hot_reset_info *info = NULL;
+    struct vfio_pci_dependent_device *devices;
+    struct vfio_pci_hot_reset *reset;
+    int ret, i;
+    bool multi = false;
+
+    trace_vfio_pci_hot_reset(vdev->vbasedev.name, single ? "one" : "multi");
+
+    if (!single) {
+        vfio_pci_pre_reset(vdev);
+    }
+    vdev->vbasedev.needs_reset = false;
+
+    ret = vfio_pci_get_pci_hot_reset_info(vdev, &info);
+
+    if (ret) {
+        goto out_single;
+    }
+
+    assert(info->flags & VFIO_PCI_HOT_RESET_FLAG_DEV_ID);
+
+    devices = &info->devices[0];
+
+    if (!(info->flags & VFIO_PCI_HOT_RESET_FLAG_DEV_ID_OWNED)) {
+        if (!vdev->has_pm_reset) {
+            for (i = 0; i < info->count; i++) {
+                if (devices[i].devid == VFIO_PCI_DEVID_NOT_OWNED) {
+                    error_report("vfio: Cannot reset device %s, "
+                                 "depends on device %04x:%02x:%02x.%x "
+                                 "which is not owned.",
+                                 vdev->vbasedev.name, devices[i].segment,
+                                 devices[i].bus, PCI_SLOT(devices[i].devfn),
+                                 PCI_FUNC(devices[i].devfn));
+                }
+            }
+        }
+        ret = -EPERM;
+        goto out_single;
+    }
+
+    trace_vfio_pci_hot_reset_has_dep_devices(vdev->vbasedev.name);
+
+    for (i = 0; i < info->count; i++) {
+        VFIOPCIDevice *tmp;
+        VFIODevice *vbasedev_iter;
+
+        trace_vfio_pci_hot_reset_dep_devices_iommufd(devices[i].segment,
+                                             devices[i].bus,
+                                             PCI_SLOT(devices[i].devfn),
+                                             PCI_FUNC(devices[i].devfn),
+                                             devices[i].devid);
+
+        /*
+         * If a VFIO cdev device is resettable, all the dependent devices
+         * are either bound to same iommufd or within same iommu_groups as
+         * one of the iommufd bound devices.
+         */
+        assert(devices[i].devid != VFIO_PCI_DEVID_NOT_OWNED);
+
+        if (devices[i].devid == vdev->vbasedev.devid ||
+            devices[i].devid == VFIO_PCI_DEVID_OWNED) {
+            continue;
+        }
+
+        vbasedev_iter = vfio_pci_find_by_iommufd_devid(devices[i].devid);
+        if (!vbasedev_iter || !vbasedev_iter->dev->realized ||
+            vbasedev_iter->type != VFIO_DEVICE_TYPE_PCI) {
+            continue;
+        }
+        tmp = container_of(vbasedev_iter, VFIOPCIDevice, vbasedev);
+        if (single) {
+            ret = -EINVAL;
+            goto out_single;
+        }
+        vfio_pci_pre_reset(tmp);
+        tmp->vbasedev.needs_reset = false;
+        multi = true;
+    }
+
+    if (!single && !multi) {
+        ret = -EINVAL;
+        goto out_single;
+    }
+
+    /* Use zero length array for hot reset with iommufd backend */
+    reset = g_malloc0(sizeof(*reset));
+    reset->argsz = sizeof(*reset);
+
+     /* Bus reset! */
+    ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_PCI_HOT_RESET, reset);
+    g_free(reset);
+
+    trace_vfio_pci_hot_reset_result(vdev->vbasedev.name,
+                                    ret ? strerror(errno) : "Success");
+
+    /* Re-enable INTx on affected devices */
+    for (i = 0; i < info->count; i++) {
+        VFIOPCIDevice *tmp;
+        VFIODevice *vbasedev_iter;
+
+        if (devices[i].devid == vdev->vbasedev.devid ||
+            devices[i].devid == VFIO_PCI_DEVID_OWNED) {
+            continue;
+        }
+
+        vbasedev_iter = vfio_pci_find_by_iommufd_devid(devices[i].devid);
+        if (!vbasedev_iter || !vbasedev_iter->dev->realized ||
+            vbasedev_iter->type != VFIO_DEVICE_TYPE_PCI) {
+            continue;
+        }
+        tmp = container_of(vbasedev_iter, VFIOPCIDevice, vbasedev);
+        vfio_pci_post_reset(tmp);
+    }
+out_single:
+    if (!single) {
+        vfio_pci_post_reset(vdev);
+    }
+    g_free(info);
+
+    return ret;
+}
+#endif
+
+static int vfio_pci_hot_reset(VFIOPCIDevice *vdev, bool single)
+{
+#ifdef CONFIG_IOMMUFD
+    if (vdev->vbasedev.iommufd) {
+        return vfio_pci_hot_reset_iommufd(vdev, single);
+    } else
+#endif
+    {
+        return vfio_pci_hot_reset_legacy(vdev, single);
+    }
+}
+
+
+
 /*
  * We want to differentiate hot reset of multiple in-use devices vs hot reset
  * of a single in-use device.  VFIO_DEVICE_RESET will already handle the case
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 60b56f23a1..c4f3b337b8 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -34,6 +34,7 @@ vfio_check_af_flr(const char *name) "%s Supports FLR via AF cap"
 vfio_pci_hot_reset(const char *name, const char *type) " (%s) %s"
 vfio_pci_hot_reset_has_dep_devices(const char *name) "%s: hot reset dependent devices:"
 vfio_pci_hot_reset_dep_devices(int domain, int bus, int slot, int function, int group_id) "\t%04x:%02x:%02x.%x group %d"
+vfio_pci_hot_reset_dep_devices_iommufd(int domain, int bus, int slot, int function, int dev_id) "\t%04x:%02x:%02x.%x devid %d"
 vfio_pci_hot_reset_result(const char *name, const char *result) "%s hot reset: %s"
 vfio_populate_device_config(const char *name, unsigned long size, unsigned long offset, unsigned long flags) "Device %s config:\n  size: 0x%lx, offset: 0x%lx, flags: 0x%lx"
 vfio_populate_device_get_irq_info_failure(const char *errstr) "VFIO_DEVICE_GET_IRQ_INFO failure: %s"
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH v1 21/22] vfio/pci: Allow the selection of a given iommu backend
  2023-08-30 10:37 [PATCH v1 00/22] vfio: Adopt iommufd Zhenzhong Duan
                   ` (19 preceding siblings ...)
  2023-08-30 10:37 ` [PATCH v1 20/22] vfio/pci: Adapt vfio pci hot reset support with iommufd BE Zhenzhong Duan
@ 2023-08-30 10:37 ` Zhenzhong Duan
  2023-09-06 18:10   ` Jason Gunthorpe
  2023-08-30 10:37 ` [PATCH v1 22/22] vfio/pci: Make vfio cdev pre-openable by passing a file handle Zhenzhong Duan
                   ` (2 subsequent siblings)
  23 siblings, 1 reply; 109+ messages in thread
From: Zhenzhong Duan @ 2023-08-30 10:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, jgg, nicolinc, joao.m.martins, eric.auger,
	peterx, jasowang, kevin.tian, yi.l.liu, yi.y.sun, chao.p.peng,
	Zhenzhong Duan

From: Eric Auger <eric.auger@redhat.com>

Now we support two types of iommu backends, let's add the capability
to select one of them. This depends on whether an iommufd object has
been linked with the vfio-pci device:

if the user wants to use the legacy backend, it shall not
link the vfio-pci device with any iommufd object:

-device vfio-pci,host=0000:02:00.0

This is called the legacy mode/backend.

If the user wants to use the iommufd backend (/dev/iommu) it
shall pass an iommufd object id in the vfio-pci device options:

 -object iommufd,id=iommufd0
 -device vfio-pci,host=0000:02:00.0,iommufd=iommufd0

Note the /dev/iommu device may have been pre-opened by a
management tool such as libvirt. This mode is no more considered
for the legacy backend. So let's remove the "TODO" comment.

Suggested-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/vfio/pci.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 3a8fee3c99..99265253f8 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -43,6 +43,7 @@
 #include "migration/blocker.h"
 #include "migration/qemu-file.h"
 #include "linux/iommufd.h"
+#include "sysemu/iommufd.h"
 
 #define TYPE_VFIO_PCI_NOHOTPLUG "vfio-pci-nohotplug"
 
@@ -3611,11 +3612,10 @@ static Property vfio_pci_dev_properties[] = {
                                    qdev_prop_nv_gpudirect_clique, uint8_t),
     DEFINE_PROP_OFF_AUTO_PCIBAR("x-msix-relocation", VFIOPCIDevice, msix_relo,
                                 OFF_AUTOPCIBAR_OFF),
-    /*
-     * TODO - support passed fds... is this necessary?
-     * DEFINE_PROP_STRING("vfiofd", VFIOPCIDevice, vfiofd_name),
-     * DEFINE_PROP_STRING("vfiogroupfd, VFIOPCIDevice, vfiogroupfd_name),
-     */
+#ifdef CONFIG_IOMMUFD
+    DEFINE_PROP_LINK("iommufd", VFIOPCIDevice, vbasedev.iommufd,
+                     TYPE_IOMMUFD_BACKEND, IOMMUFDBackend *),
+#endif
     DEFINE_PROP_END_OF_LIST(),
 };
 
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH v1 22/22] vfio/pci: Make vfio cdev pre-openable by passing a file handle
  2023-08-30 10:37 [PATCH v1 00/22] vfio: Adopt iommufd Zhenzhong Duan
                   ` (20 preceding siblings ...)
  2023-08-30 10:37 ` [PATCH v1 21/22] vfio/pci: Allow the selection of a given iommu backend Zhenzhong Duan
@ 2023-08-30 10:37 ` Zhenzhong Duan
  2023-09-14  9:04 ` [PATCH v1 00/22] vfio: Adopt iommufd Eric Auger
  2023-09-15 12:42 ` Cédric Le Goater
  23 siblings, 0 replies; 109+ messages in thread
From: Zhenzhong Duan @ 2023-08-30 10:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, jgg, nicolinc, joao.m.martins, eric.auger,
	peterx, jasowang, kevin.tian, yi.l.liu, yi.y.sun, chao.p.peng,
	Zhenzhong Duan

This gives management tools like libvirt a chance to open the vfio
cdev with privilege and pass FD to qemu. This way qemu never needs
to have privilege to open a VFIO or iommu cdev node.

Add a helper function vfio_device_get_name() to check fd and get
device name, it will also be used by other vfio devices.

There is no easy way to check if a device is mdev with FD passing,
so fail the x-balloon-allowed check unconditionally in this case.

There is also no easy way to get BDF as name with FD passing, so
we fake a name by VFIO_FD[fd].

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/vfio/helpers.c             | 28 ++++++++++++++++++++++++++++
 hw/vfio/iommufd.c             | 12 ++++++++----
 hw/vfio/pci.c                 | 35 ++++++++++++++++++++++++++++-------
 include/hw/vfio/vfio-common.h |  1 +
 4 files changed, 65 insertions(+), 11 deletions(-)

diff --git a/hw/vfio/helpers.c b/hw/vfio/helpers.c
index 4338456b08..1a27efb075 100644
--- a/hw/vfio/helpers.c
+++ b/hw/vfio/helpers.c
@@ -596,3 +596,31 @@ bool vfio_has_region_cap(VFIODevice *vbasedev, int region, uint16_t cap_type)
 
     return ret;
 }
+
+int vfio_device_get_name(VFIODevice *vbasedev, Error **errp)
+{
+    struct stat st;
+
+    if (vbasedev->fd < 0) {
+        if (stat(vbasedev->sysfsdev, &st) < 0) {
+            error_setg_errno(errp, errno, "no such host device");
+            error_prepend(errp, VFIO_MSG_PREFIX, vbasedev->sysfsdev);
+            return -errno;
+        }
+        vbasedev->name = g_path_get_basename(vbasedev->sysfsdev);
+    }
+#ifdef CONFIG_IOMMUFD
+    else {
+        if (!vbasedev->iommufd) {
+            error_setg(errp, "Use FD passing only with iommufd backend");
+            return -EINVAL;
+        }
+        /*
+         * Give a name with fd so any function printing out vbasedev->name
+         * will not break.
+         */
+        vbasedev->name = g_strdup_printf("VFIO_FD%d", vbasedev->fd);
+    }
+#endif
+    return 0;
+}
diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
index dd24e76e39..2cd2daebf4 100644
--- a/hw/vfio/iommufd.c
+++ b/hw/vfio/iommufd.c
@@ -376,11 +376,15 @@ static int iommufd_attach_device(char *name, VFIODevice *vbasedev,
     uint32_t ioas_id;
     Error *err = NULL;
 
-    devfd = vfio_get_devicefd(vbasedev->sysfsdev, errp);
-    if (devfd < 0) {
-        return devfd;
+    if (vbasedev->fd < 0) {
+        devfd = vfio_get_devicefd(vbasedev->sysfsdev, errp);
+        if (devfd < 0) {
+            return devfd;
+        }
+        vbasedev->fd = devfd;
+    } else {
+        devfd = vbasedev->fd;
     }
-    vbasedev->fd = devfd;
 
     ret = iommufd_connect_and_bind(vbasedev, errp);
     if (ret) {
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 99265253f8..eff52b5014 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -44,6 +44,7 @@
 #include "migration/qemu-file.h"
 #include "linux/iommufd.h"
 #include "sysemu/iommufd.h"
+#include "monitor/monitor.h"
 
 #define TYPE_VFIO_PCI_NOHOTPLUG "vfio-pci-nohotplug"
 
@@ -3171,18 +3172,23 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
     VFIODevice *vbasedev = &vdev->vbasedev;
     char *tmp, *subsys;
     Error *err = NULL;
-    struct stat st;
     int i, ret;
     bool is_mdev;
     char uuid[UUID_FMT_LEN];
     char *name;
 
-    if (!vbasedev->sysfsdev) {
+    if (vbasedev->fd < 0 && !vbasedev->sysfsdev) {
         if (!(~vdev->host.domain || ~vdev->host.bus ||
               ~vdev->host.slot || ~vdev->host.function)) {
             error_setg(errp, "No provided host device");
+#ifdef CONFIG_IOMMUFD
+            error_append_hint(errp, "Use -device vfio-pci,host=DDDD:BB:DD.F, "
+                              "-device vfio-pci,sysfsdev=PATH_TO_DEVICE "
+                              "or -device vfio-pci,fd=DEVICE_FD\n");
+#else
             error_append_hint(errp, "Use -device vfio-pci,host=DDDD:BB:DD.F "
                               "or -device vfio-pci,sysfsdev=PATH_TO_DEVICE\n");
+#endif
             return;
         }
         vbasedev->sysfsdev =
@@ -3191,13 +3197,9 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
                             vdev->host.slot, vdev->host.function);
     }
 
-    if (stat(vbasedev->sysfsdev, &st) < 0) {
-        error_setg_errno(errp, errno, "no such host device");
-        error_prepend(errp, VFIO_MSG_PREFIX, vbasedev->sysfsdev);
+    if (vfio_device_get_name(vbasedev, errp)) {
         return;
     }
-
-    vbasedev->name = g_path_get_basename(vbasedev->sysfsdev);
     vbasedev->ops = &vfio_pci_ops;
     vbasedev->type = VFIO_DEVICE_TYPE_PCI;
     vbasedev->dev = DEVICE(vdev);
@@ -3559,6 +3561,7 @@ static void vfio_instance_init(Object *obj)
     vdev->host.bus = ~0U;
     vdev->host.slot = ~0U;
     vdev->host.function = ~0U;
+    vdev->vbasedev.fd = -1;
 
     vdev->nv_gpudirect_clique = 0xFF;
 
@@ -3619,6 +3622,21 @@ static Property vfio_pci_dev_properties[] = {
     DEFINE_PROP_END_OF_LIST(),
 };
 
+#ifdef CONFIG_IOMMUFD
+static void vfio_pci_set_fd(Object *obj, const char *str, Error **errp)
+{
+    VFIOPCIDevice *vdev = VFIO_PCI(obj);
+    int fd = -1;
+
+    fd = monitor_fd_param(monitor_cur(), str, errp);
+    if (fd == -1) {
+        error_prepend(errp, "Could not parse remote object fd %s:", str);
+        return;
+    }
+    vdev->vbasedev.fd = fd;
+}
+#endif
+
 static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
 {
     DeviceClass *dc = DEVICE_CLASS(klass);
@@ -3626,6 +3644,9 @@ static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
 
     dc->reset = vfio_pci_reset;
     device_class_set_props(dc, vfio_pci_dev_properties);
+#ifdef CONFIG_IOMMUFD
+    object_class_property_add_str(klass, "fd", NULL, vfio_pci_set_fd);
+#endif
     dc->desc = "VFIO-based PCI device assignment";
     set_bit(DEVICE_CATEGORY_MISC, dc->categories);
     pdc->realize = vfio_realize;
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 027a59a13a..41c8eeaa54 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -242,6 +242,7 @@ struct vfio_info_cap_header *
 vfio_get_device_info_cap(struct vfio_device_info *info, uint16_t id);
 struct vfio_info_cap_header *
 vfio_get_cap(void *ptr, uint32_t cap_offset, uint16_t id);
+int vfio_device_get_name(VFIODevice *vbasedev, Error **errp);
 #endif
 extern const MemoryListener vfio_prereg_listener;
 
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 109+ messages in thread

* Re: [PATCH v1 21/22] vfio/pci: Allow the selection of a given iommu backend
  2023-08-30 10:37 ` [PATCH v1 21/22] vfio/pci: Allow the selection of a given iommu backend Zhenzhong Duan
@ 2023-09-06 18:10   ` Jason Gunthorpe
  2023-09-06 19:09     ` Alex Williamson
  0 siblings, 1 reply; 109+ messages in thread
From: Jason Gunthorpe @ 2023-09-06 18:10 UTC (permalink / raw)
  To: Zhenzhong Duan
  Cc: qemu-devel, alex.williamson, clg, nicolinc, joao.m.martins,
	eric.auger, peterx, jasowang, kevin.tian, yi.l.liu, yi.y.sun,
	chao.p.peng

On Wed, Aug 30, 2023 at 06:37:53PM +0800, Zhenzhong Duan wrote:
> Note the /dev/iommu device may have been pre-opened by a
> management tool such as libvirt. This mode is no more considered
> for the legacy backend. So let's remove the "TODO" comment.

Can you show an example of that syntax too?

Also, the vfio device should be openable externally as well

Jason


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v1 21/22] vfio/pci: Allow the selection of a given iommu backend
  2023-09-06 18:10   ` Jason Gunthorpe
@ 2023-09-06 19:09     ` Alex Williamson
  2023-09-07  1:10       ` Jason Gunthorpe
  0 siblings, 1 reply; 109+ messages in thread
From: Alex Williamson @ 2023-09-06 19:09 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Zhenzhong Duan, qemu-devel, clg, nicolinc, joao.m.martins,
	eric.auger, peterx, jasowang, kevin.tian, yi.l.liu, yi.y.sun,
	chao.p.peng

On Wed, 6 Sep 2023 15:10:39 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Wed, Aug 30, 2023 at 06:37:53PM +0800, Zhenzhong Duan wrote:
> > Note the /dev/iommu device may have been pre-opened by a
> > management tool such as libvirt. This mode is no more considered
> > for the legacy backend. So let's remove the "TODO" comment.  
> 
> Can you show an example of that syntax too?

Unless you're just looking for something in the commit log, patch 16/
added the following to the qemu help output:

+#ifdef CONFIG_IOMMUFD
+    ``-object iommufd,id=id[,fd=fd]``
+        Creates an iommufd backend which allows control of DMA mapping
+        through the /dev/iommu device.
+
+        The ``id`` parameter is a unique ID which frontends (such as
+        vfio-pci of vdpa) will use to connect withe the iommufd backend.
+
+        The ``fd`` parameter is an optional pre-opened file descriptor
+        resulting from /dev/iommu opening. Usually the iommufd is shared
+        accross all subsystems, bringing the benefit of centralized
+        reference counting.
+#endif
 
> Also, the vfio device should be openable externally as well

Appears to be added in the very next patch in the series.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v1 21/22] vfio/pci: Allow the selection of a given iommu backend
  2023-09-06 19:09     ` Alex Williamson
@ 2023-09-07  1:10       ` Jason Gunthorpe
  2023-09-07  2:27         ` Duan, Zhenzhong
  0 siblings, 1 reply; 109+ messages in thread
From: Jason Gunthorpe @ 2023-09-07  1:10 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Zhenzhong Duan, qemu-devel, clg, nicolinc, joao.m.martins,
	eric.auger, peterx, jasowang, kevin.tian, yi.l.liu, yi.y.sun,
	chao.p.peng

On Wed, Sep 06, 2023 at 01:09:26PM -0600, Alex Williamson wrote:
> On Wed, 6 Sep 2023 15:10:39 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Wed, Aug 30, 2023 at 06:37:53PM +0800, Zhenzhong Duan wrote:
> > > Note the /dev/iommu device may have been pre-opened by a
> > > management tool such as libvirt. This mode is no more considered
> > > for the legacy backend. So let's remove the "TODO" comment.  
> > 
> > Can you show an example of that syntax too?
> 
> Unless you're just looking for something in the commit log, 

Yeah, I was thinking the commit log

> patch 16/ added the following to the qemu help output:
> 
> +#ifdef CONFIG_IOMMUFD
> +    ``-object iommufd,id=id[,fd=fd]``
> +        Creates an iommufd backend which allows control of DMA mapping
> +        through the /dev/iommu device.
> +
> +        The ``id`` parameter is a unique ID which frontends (such as
> +        vfio-pci of vdpa) will use to connect withe the iommufd backend.
> +
> +        The ``fd`` parameter is an optional pre-opened file descriptor
> +        resulting from /dev/iommu opening. Usually the iommufd is shared
> +        accross all subsystems, bringing the benefit of centralized
> +        reference counting.
> +#endif
>  
> > Also, the vfio device should be openable externally as well
> 
> Appears to be added in the very next patch in the series.  Thanks,

Indeed, I got confused because this removed the TODO - that could
reasonably be pushed to the next patch and include a bit more detail
in the commit message

Jason


^ permalink raw reply	[flat|nested] 109+ messages in thread

* RE: [PATCH v1 21/22] vfio/pci: Allow the selection of a given iommu backend
  2023-09-07  1:10       ` Jason Gunthorpe
@ 2023-09-07  2:27         ` Duan, Zhenzhong
  0 siblings, 0 replies; 109+ messages in thread
From: Duan, Zhenzhong @ 2023-09-07  2:27 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson
  Cc: qemu-devel, clg, nicolinc, Martins, Joao, eric.auger, peterx,
	jasowang, Tian, Kevin, Liu, Yi L, Sun, Yi Y, Peng, Chao P



>-----Original Message-----
>From: Jason Gunthorpe <jgg@nvidia.com>
>Sent: Thursday, September 7, 2023 9:11 AM
>To: Alex Williamson <alex.williamson@redhat.com>
>Subject: Re: [PATCH v1 21/22] vfio/pci: Allow the selection of a given iommu
>backend
>
>On Wed, Sep 06, 2023 at 01:09:26PM -0600, Alex Williamson wrote:
>> On Wed, 6 Sep 2023 15:10:39 -0300
>> Jason Gunthorpe <jgg@nvidia.com> wrote:
>>
>> > On Wed, Aug 30, 2023 at 06:37:53PM +0800, Zhenzhong Duan wrote:
>> > > Note the /dev/iommu device may have been pre-opened by a
>> > > management tool such as libvirt. This mode is no more considered
>> > > for the legacy backend. So let's remove the "TODO" comment.
>> >
>> > Can you show an example of that syntax too?
>>
>> Unless you're just looking for something in the commit log,
>
>Yeah, I was thinking the commit log
>
>> patch 16/ added the following to the qemu help output:
>>
>> +#ifdef CONFIG_IOMMUFD
>> +    ``-object iommufd,id=id[,fd=fd]``
>> +        Creates an iommufd backend which allows control of DMA mapping
>> +        through the /dev/iommu device.
>> +
>> +        The ``id`` parameter is a unique ID which frontends (such as
>> +        vfio-pci of vdpa) will use to connect withe the iommufd backend.
>> +
>> +        The ``fd`` parameter is an optional pre-opened file descriptor
>> +        resulting from /dev/iommu opening. Usually the iommufd is shared
>> +        accross all subsystems, bringing the benefit of centralized
>> +        reference counting.
>> +#endif

Thanks for point out this issue.
I can think of two choices:
1. squash this patch to PATCH16
2. keep this patch separate and to pull fd passing related change from PATCH16 into this one
Please kindly suggest which way is preferred in community.

Btw: I only enable fd passing for vfio pci device, let me know if it's preferred
to include all other vfio devices in this series, then I'll add them.

>>
>> > Also, the vfio device should be openable externally as well
>>
>> Appears to be added in the very next patch in the series.  Thanks,
>
>Indeed, I got confused because this removed the TODO - that could
>reasonably be pushed to the next patch and include a bit more detail
>in the commit message

Good idea, will fix.

Thanks
Zhenzhong


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v1 00/22] vfio: Adopt iommufd
  2023-08-30 10:37 [PATCH v1 00/22] vfio: Adopt iommufd Zhenzhong Duan
                   ` (21 preceding siblings ...)
  2023-08-30 10:37 ` [PATCH v1 22/22] vfio/pci: Make vfio cdev pre-openable by passing a file handle Zhenzhong Duan
@ 2023-09-14  9:04 ` Eric Auger
  2023-09-14  9:27   ` Duan, Zhenzhong
  2023-09-15 12:42 ` Cédric Le Goater
  23 siblings, 1 reply; 109+ messages in thread
From: Eric Auger @ 2023-09-14  9:04 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, jgg, nicolinc, joao.m.martins, peterx,
	jasowang, kevin.tian, yi.l.liu, yi.y.sun, chao.p.peng

Hi Zhenzhong

On 8/30/23 12:37, Zhenzhong Duan wrote:
> Hi All,
>
> As the kernel side iommufd cdev and hot reset feature have been queued,
> also hwpt alloc has been added in Jason's for_next branch [1], I'd like
> to update a new version matching kernel side update and with rfc flag
> removed. Qemu code can be found at [2], look forward more comments!
>
>
> We have done wide test with different combinations, e.g:
>
> - PCI device were tested
> - FD passing and hot reset with some trick.
> - device hotplug test with legacy and iommufd backends
> - with or without vIOMMU for legacy and iommufd backends
> - divices linked to different iommufds
> - VFIO migration with a E800 net card(no dirty sync support) passthrough
> - platform, ccw and ap were only compile-tested due to environment limit
>
>
> Given some iommufd kernel limitations, the iommufd backend is
> not yet fully on par with the legacy backend w.r.t. features like:
> - p2p mappings (you will see related error traces)
> - dirty page sync
> - and etc.
>
>
> Changelog:
> v1:
> - Alloc hwpt instead of using auto hwpt
> - elaborate iommufd code per Nicolin
> - consolidate two patches and drop as.c
> - typo error fix and function rename
>
> I didn't list change log of rfc stage, see [3] if anyone is interested.
>
>
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd.git
> [2] https://github.com/yiliu1765/qemu/commits/zhenzhong/iommufd_cdev_v1
> [3] https://lists.nongnu.org/archive/html/qemu-devel/2023-07/msg02529.html

Do you have a branch to share?

It does not apply to upstream

Thanks

Eric
>
>
> --------------------------------------------------------------------------
>
> With the introduction of iommufd, the Linux kernel provides a generic
> interface for userspace drivers to propagate their DMA mappings to kernel
> for assigned devices. This series does the porting of the VFIO devices
> onto the /dev/iommu uapi and let it coexist with the legacy implementation.
>
> This QEMU integration is the result of a collaborative work between
> Yi Liu, Yi Sun, Nicolin Chen and Eric Auger.
>
> At QEMU level, interactions with the /dev/iommu are abstracted by a new
> iommufd object (compiled in with the CONFIG_IOMMUFD option).
>
> Any QEMU device (e.g. vfio device) wishing to use /dev/iommu must be
> linked with an iommufd object. In this series, the vfio-pci device is
> granted with such capability (other VFIO devices are not yet ready):
>
> It gets a new optional parameter named iommufd which allows to pass
> an iommufd object:
>
>     -object iommufd,id=iommufd0
>     -device vfio-pci,host=0000:02:00.0,iommufd=iommufd0
>
> Note the /dev/iommu and vfio cdev can be externally opened by a
> management layer. In such a case the fd is passed:
>   
>     -object iommufd,id=iommufd0,fd=22
>     -device vfio-pci,iommufd=iommufd0,fd=23
>
> If the fd parameter is not passed, the fd is opened by QEMU.
> See https://www.mail-archive.com/qemu-devel@nongnu.org/msg937155.html
> for detailed discuss on this requirement.
>
> If no iommufd option is passed to the vfio-pci device, iommufd is not
> used and the end-user gets the behavior based on the legacy vfio iommu
> interfaces:
>
>     -device vfio-pci,host=0000:02:00.0
>
> While the legacy kernel interface is group-centric, the new iommufd
> interface is device-centric, relying on device fd and iommufd.
>
> To support both interfaces in the QEMU VFIO device we reworked the vfio
> container abstraction so that the generic VFIO code can use either
> backend.
>
> The VFIOContainer object becomes a base object derived into
> a) the legacy VFIO container and
> b) the new iommufd based container.
>
> The base object implements generic code such as code related to
> memory_listener and address space management whereas the derived
> objects implement callbacks specific to either BE, legacy and
> iommufd. Indeed each backend has its own way to setup secure context
> and dma management interface. The below diagram shows how it looks
> like with both BEs.
>
>                     VFIO                           AddressSpace/Memory
>     +-------+  +----------+  +-----+  +-----+
>     |  pci  |  | platform |  |  ap |  | ccw |
>     +---+---+  +----+-----+  +--+--+  +--+--+     +----------------------+
>         |           |           |        |        |   AddressSpace       |
>         |           |           |        |        +------------+---------+
>     +---V-----------V-----------V--------V----+               /
>     |           VFIOAddressSpace              | <------------+
>     |                  |                      |  MemoryListener
>     |          VFIOContainer list             |
>     +-------+----------------------------+----+
>             |                            |
>             |                            |
>     +-------V------+            +--------V----------+
>     |   iommufd    |            |    vfio legacy    |
>     |  container   |            |     container     |
>     +-------+------+            +--------+----------+
>             |                            |
>             | /dev/iommu                 | /dev/vfio/vfio
>             | /dev/vfio/devices/vfioX    | /dev/vfio/$group_id
> Userspace   |                            |
> ============+============================+===========================
> Kernel      |  device fd                 |
>             +---------------+            | group/container fd
>             | (BIND_IOMMUFD |            | (SET_CONTAINER/SET_IOMMU)
>             |  ATTACH_IOAS) |            | device fd
>             |               |            |
>             |       +-------V------------V-----------------+
>     iommufd |       |                vfio                  |
> (map/unmap  |       +---------+--------------------+-------+
> ioas_copy)  |                 |                    | map/unmap
>             |                 |                    |
>      +------V------+    +-----V------+      +------V--------+
>      | iommfd core |    |  device    |      |  vfio iommu   |
>      +-------------+    +------------+      +---------------+
>
> [Secure Context setup]
> - iommufd BE: uses device fd and iommufd to setup secure context
>               (bind_iommufd, attach_ioas)
> - vfio legacy BE: uses group fd and container fd to setup secure context
>                   (set_container, set_iommu)
> [Device access]
> - iommufd BE: device fd is opened through /dev/vfio/devices/vfioX
> - vfio legacy BE: device fd is retrieved from group fd ioctl
> [DMA Mapping flow]
> 1. VFIOAddressSpace receives MemoryRegion add/del via MemoryListener
> 2. VFIO populates DMA map/unmap via the container BEs
>    *) iommufd BE: uses iommufd
>    *) vfio legacy BE: uses container fd
>
>
> Thanks,
> Yi, Yi, Eric, Zhenzhong
>
>
> Eric Auger (8):
>   scripts/update-linux-headers: Add iommufd.h
>   vfio/common: Introduce vfio_container_add|del_section_window()
>   vfio/container: Introduce vfio_[attach/detach]_device
>   vfio/platform: Use vfio_[attach/detach]_device
>   vfio/ap: Use vfio_[attach/detach]_device
>   vfio/ccw: Use vfio_[attach/detach]_device
>   backends/iommufd: Introduce the iommufd object
>   vfio/pci: Allow the selection of a given iommu backend
>
> Yi Liu (5):
>   vfio/common: Move IOMMU agnostic helpers to a separate file
>   vfio/common: Move legacy VFIO backend code into separate container.c
>   vfio: Add base container
>   util/char_dev: Add open_cdev()
>   vfio/iommufd: Implement the iommufd backend
>
> Zhenzhong Duan (9):
>   Update linux-header to support iommufd cdev and hwpt alloc
>   vfio/common: Extract out vfio_kvm_device_[add/del]_fd
>   vfio/common: Add a vfio device iterator
>   vfio/common: Refactor vfio_viommu_preset() to be group agnostic
>   vfio/common: Simplify vfio_viommu_preset()
>   Add iommufd configure option
>   vfio/iommufd: Add vfio device iterator callback for iommufd
>   vfio/pci: Adapt vfio pci hot reset support with iommufd BE
>   vfio/pci: Make vfio cdev pre-openable by passing a file handle
>
>  MAINTAINERS                           |   13 +
>  backends/Kconfig                      |    4 +
>  backends/iommufd.c                    |  291 ++++
>  backends/meson.build                  |    3 +
>  backends/trace-events                 |   13 +
>  hw/vfio/ap.c                          |   68 +-
>  hw/vfio/ccw.c                         |  120 +-
>  hw/vfio/common.c                      | 1948 +++----------------------
>  hw/vfio/container-base.c              |  160 ++
>  hw/vfio/container.c                   | 1208 +++++++++++++++
>  hw/vfio/helpers.c                     |  626 ++++++++
>  hw/vfio/iommufd.c                     |  554 +++++++
>  hw/vfio/meson.build                   |    6 +
>  hw/vfio/pci.c                         |  319 +++-
>  hw/vfio/platform.c                    |   43 +-
>  hw/vfio/spapr.c                       |   22 +-
>  hw/vfio/trace-events                  |   21 +-
>  include/hw/vfio/vfio-common.h         |  111 +-
>  include/hw/vfio/vfio-container-base.h |  158 ++
>  include/qemu/char_dev.h               |   16 +
>  include/standard-headers/linux/fuse.h |    3 +
>  include/sysemu/iommufd.h              |   49 +
>  linux-headers/linux/iommufd.h         |  444 ++++++
>  linux-headers/linux/kvm.h             |   13 +-
>  linux-headers/linux/vfio.h            |  148 +-
>  meson.build                           |    6 +
>  meson_options.txt                     |    2 +
>  qapi/qom.json                         |   18 +-
>  qemu-options.hx                       |   13 +
>  scripts/meson-buildoptions.sh         |    3 +
>  scripts/update-linux-headers.sh       |    3 +-
>  util/chardev_open.c                   |   61 +
>  util/meson.build                      |    1 +
>  33 files changed, 4395 insertions(+), 2073 deletions(-)
>  create mode 100644 backends/iommufd.c
>  create mode 100644 hw/vfio/container-base.c
>  create mode 100644 hw/vfio/container.c
>  create mode 100644 hw/vfio/helpers.c
>  create mode 100644 hw/vfio/iommufd.c
>  create mode 100644 include/hw/vfio/vfio-container-base.h
>  create mode 100644 include/qemu/char_dev.h
>  create mode 100644 include/sysemu/iommufd.h
>  create mode 100644 linux-headers/linux/iommufd.h
>  create mode 100644 util/chardev_open.c
>



^ permalink raw reply	[flat|nested] 109+ messages in thread

* RE: [PATCH v1 00/22] vfio: Adopt iommufd
  2023-09-14  9:04 ` [PATCH v1 00/22] vfio: Adopt iommufd Eric Auger
@ 2023-09-14  9:27   ` Duan, Zhenzhong
  0 siblings, 0 replies; 109+ messages in thread
From: Duan, Zhenzhong @ 2023-09-14  9:27 UTC (permalink / raw)
  To: eric.auger, qemu-devel
  Cc: alex.williamson, clg, jgg, nicolinc, Martins, Joao, peterx,
	jasowang, Tian, Kevin, Liu, Yi L, Sun, Yi Y, Peng, Chao P

Hi Eric,

>-----Original Message-----
>From: Eric Auger <eric.auger@redhat.com>
>Sent: Thursday, September 14, 2023 5:04 PM
>To: Duan, Zhenzhong <zhenzhong.duan@intel.com>; qemu-devel@nongnu.org
>Cc: alex.williamson@redhat.com; clg@redhat.com; jgg@nvidia.com;
>nicolinc@nvidia.com; Martins, Joao <joao.m.martins@oracle.com>;
>peterx@redhat.com; jasowang@redhat.com; Tian, Kevin <kevin.tian@intel.com>;
>Liu, Yi L <yi.l.liu@intel.com>; Sun, Yi Y <yi.y.sun@intel.com>; Peng, Chao P
><chao.p.peng@intel.com>
>Subject: Re: [PATCH v1 00/22] vfio: Adopt iommufd
>
>Hi Zhenzhong
>
>On 8/30/23 12:37, Zhenzhong Duan wrote:
>> Hi All,
>>
>> As the kernel side iommufd cdev and hot reset feature have been queued,
>> also hwpt alloc has been added in Jason's for_next branch [1], I'd like
>> to update a new version matching kernel side update and with rfc flag
>> removed. Qemu code can be found at [2], look forward more comments!
>>
>>
>> We have done wide test with different combinations, e.g:
>>
>> - PCI device were tested
>> - FD passing and hot reset with some trick.
>> - device hotplug test with legacy and iommufd backends
>> - with or without vIOMMU for legacy and iommufd backends
>> - divices linked to different iommufds
>> - VFIO migration with a E800 net card(no dirty sync support) passthrough
>> - platform, ccw and ap were only compile-tested due to environment limit
>>
>>
>> Given some iommufd kernel limitations, the iommufd backend is
>> not yet fully on par with the legacy backend w.r.t. features like:
>> - p2p mappings (you will see related error traces)
>> - dirty page sync
>> - and etc.
>>
>>
>> Changelog:
>> v1:
>> - Alloc hwpt instead of using auto hwpt
>> - elaborate iommufd code per Nicolin
>> - consolidate two patches and drop as.c
>> - typo error fix and function rename
>>
>> I didn't list change log of rfc stage, see [3] if anyone is interested.
>>
>>
>> [1] https://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd.git
>> [2] https://github.com/yiliu1765/qemu/commits/zhenzhong/iommufd_cdev_v1
>> [3] https://lists.nongnu.org/archive/html/qemu-devel/2023-07/msg02529.html
>
>Do you have a branch to share?
>
>It does not apply to upstream

Sure, https://github.com/yiliu1765/qemu/tree/zhenzhong/iommufd_cdev_v1_rebased
I think this one is already based on today's upstream.

Thanks
Zhenzhong

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v1 02/22] Update linux-header to support iommufd cdev and hwpt alloc
  2023-08-30 10:37 ` [PATCH v1 02/22] Update linux-header to support iommufd cdev and hwpt alloc Zhenzhong Duan
@ 2023-09-14 14:46   ` Eric Auger
  2023-09-15  3:02     ` Duan, Zhenzhong
  0 siblings, 1 reply; 109+ messages in thread
From: Eric Auger @ 2023-09-14 14:46 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, jgg, nicolinc, joao.m.martins, peterx,
	jasowang, kevin.tian, yi.l.liu, yi.y.sun, chao.p.peng,
	Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini,
	open list:Overall KVM CPUs

Hi Zhenzhong,

On 8/30/23 12:37, Zhenzhong Duan wrote:
> From https://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd.git
> branch: for_next
> commit id: eb501c2d96cfce6b42528e8321ea085ec605e790
I see that in your branch you have now updated against v6.6-rc1. However
you should run a full ./scripts/update-linux-headers.sh,
ie. not only importing the changes in linux-headers/linux/iommufd.h as
it seems to do but also import all changes brought with this linux version.

Thanks

Eric
>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
> Note this is a placeholder patch.
>
>  include/standard-headers/linux/fuse.h |   3 +
>  linux-headers/linux/iommufd.h         | 444 ++++++++++++++++++++++++++
>  linux-headers/linux/kvm.h             |  13 +-
>  linux-headers/linux/vfio.h            | 148 ++++++++-
>  4 files changed, 604 insertions(+), 4 deletions(-)
>  create mode 100644 linux-headers/linux/iommufd.h
>
> diff --git a/include/standard-headers/linux/fuse.h b/include/standard-headers/linux/fuse.h
> index 35c131a107..2c8b8de9c2 100644
> --- a/include/standard-headers/linux/fuse.h
> +++ b/include/standard-headers/linux/fuse.h
> @@ -206,6 +206,7 @@
>   *  - add extension header
>   *  - add FUSE_EXT_GROUPS
>   *  - add FUSE_CREATE_SUPP_GROUP
> + *  - add FUSE_HAS_EXPIRE_ONLY
>   */
>  
>  #ifndef _LINUX_FUSE_H
> @@ -365,6 +366,7 @@ struct fuse_file_lock {
>   * FUSE_HAS_INODE_DAX:  use per inode DAX
>   * FUSE_CREATE_SUPP_GROUP: add supplementary group info to create, mkdir,
>   *			symlink and mknod (single group that matches parent)
> + * FUSE_HAS_EXPIRE_ONLY: kernel supports expiry-only entry invalidation
>   */
>  #define FUSE_ASYNC_READ		(1 << 0)
>  #define FUSE_POSIX_LOCKS	(1 << 1)
> @@ -402,6 +404,7 @@ struct fuse_file_lock {
>  #define FUSE_SECURITY_CTX	(1ULL << 32)
>  #define FUSE_HAS_INODE_DAX	(1ULL << 33)
>  #define FUSE_CREATE_SUPP_GROUP	(1ULL << 34)
> +#define FUSE_HAS_EXPIRE_ONLY	(1ULL << 35)
>  
>  /**
>   * CUSE INIT request/reply flags
> diff --git a/linux-headers/linux/iommufd.h b/linux-headers/linux/iommufd.h
> new file mode 100644
> index 0000000000..218bf7ac98
> --- /dev/null
> +++ b/linux-headers/linux/iommufd.h
> @@ -0,0 +1,444 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +/* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES.
> + */
> +#ifndef _IOMMUFD_H
> +#define _IOMMUFD_H
> +
> +#include <linux/types.h>
> +#include <linux/ioctl.h>
> +
> +#define IOMMUFD_TYPE (';')
> +
> +/**
> + * DOC: General ioctl format
> + *
> + * The ioctl interface follows a general format to allow for extensibility. Each
> + * ioctl is passed in a structure pointer as the argument providing the size of
> + * the structure in the first u32. The kernel checks that any structure space
> + * beyond what it understands is 0. This allows userspace to use the backward
> + * compatible portion while consistently using the newer, larger, structures.
> + *
> + * ioctls use a standard meaning for common errnos:
> + *
> + *  - ENOTTY: The IOCTL number itself is not supported at all
> + *  - E2BIG: The IOCTL number is supported, but the provided structure has
> + *    non-zero in a part the kernel does not understand.
> + *  - EOPNOTSUPP: The IOCTL number is supported, and the structure is
> + *    understood, however a known field has a value the kernel does not
> + *    understand or support.
> + *  - EINVAL: Everything about the IOCTL was understood, but a field is not
> + *    correct.
> + *  - ENOENT: An ID or IOVA provided does not exist.
> + *  - ENOMEM: Out of memory.
> + *  - EOVERFLOW: Mathematics overflowed.
> + *
> + * As well as additional errnos, within specific ioctls.
> + */
> +enum {
> +	IOMMUFD_CMD_BASE = 0x80,
> +	IOMMUFD_CMD_DESTROY = IOMMUFD_CMD_BASE,
> +	IOMMUFD_CMD_IOAS_ALLOC,
> +	IOMMUFD_CMD_IOAS_ALLOW_IOVAS,
> +	IOMMUFD_CMD_IOAS_COPY,
> +	IOMMUFD_CMD_IOAS_IOVA_RANGES,
> +	IOMMUFD_CMD_IOAS_MAP,
> +	IOMMUFD_CMD_IOAS_UNMAP,
> +	IOMMUFD_CMD_OPTION,
> +	IOMMUFD_CMD_VFIO_IOAS,
> +	IOMMUFD_CMD_HWPT_ALLOC,
> +	IOMMUFD_CMD_GET_HW_INFO,
> +};
> +
> +/**
> + * struct iommu_destroy - ioctl(IOMMU_DESTROY)
> + * @size: sizeof(struct iommu_destroy)
> + * @id: iommufd object ID to destroy. Can be any destroyable object type.
> + *
> + * Destroy any object held within iommufd.
> + */
> +struct iommu_destroy {
> +	__u32 size;
> +	__u32 id;
> +};
> +#define IOMMU_DESTROY _IO(IOMMUFD_TYPE, IOMMUFD_CMD_DESTROY)
> +
> +/**
> + * struct iommu_ioas_alloc - ioctl(IOMMU_IOAS_ALLOC)
> + * @size: sizeof(struct iommu_ioas_alloc)
> + * @flags: Must be 0
> + * @out_ioas_id: Output IOAS ID for the allocated object
> + *
> + * Allocate an IO Address Space (IOAS) which holds an IO Virtual Address (IOVA)
> + * to memory mapping.
> + */
> +struct iommu_ioas_alloc {
> +	__u32 size;
> +	__u32 flags;
> +	__u32 out_ioas_id;
> +};
> +#define IOMMU_IOAS_ALLOC _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_ALLOC)
> +
> +/**
> + * struct iommu_iova_range - ioctl(IOMMU_IOVA_RANGE)
> + * @start: First IOVA
> + * @last: Inclusive last IOVA
> + *
> + * An interval in IOVA space.
> + */
> +struct iommu_iova_range {
> +	__aligned_u64 start;
> +	__aligned_u64 last;
> +};
> +
> +/**
> + * struct iommu_ioas_iova_ranges - ioctl(IOMMU_IOAS_IOVA_RANGES)
> + * @size: sizeof(struct iommu_ioas_iova_ranges)
> + * @ioas_id: IOAS ID to read ranges from
> + * @num_iovas: Input/Output total number of ranges in the IOAS
> + * @__reserved: Must be 0
> + * @allowed_iovas: Pointer to the output array of struct iommu_iova_range
> + * @out_iova_alignment: Minimum alignment required for mapping IOVA
> + *
> + * Query an IOAS for ranges of allowed IOVAs. Mapping IOVA outside these ranges
> + * is not allowed. num_iovas will be set to the total number of iovas and
> + * the allowed_iovas[] will be filled in as space permits.
> + *
> + * The allowed ranges are dependent on the HW path the DMA operation takes, and
> + * can change during the lifetime of the IOAS. A fresh empty IOAS will have a
> + * full range, and each attached device will narrow the ranges based on that
> + * device's HW restrictions. Detaching a device can widen the ranges. Userspace
> + * should query ranges after every attach/detach to know what IOVAs are valid
> + * for mapping.
> + *
> + * On input num_iovas is the length of the allowed_iovas array. On output it is
> + * the total number of iovas filled in. The ioctl will return -EMSGSIZE and set
> + * num_iovas to the required value if num_iovas is too small. In this case the
> + * caller should allocate a larger output array and re-issue the ioctl.
> + *
> + * out_iova_alignment returns the minimum IOVA alignment that can be given
> + * to IOMMU_IOAS_MAP/COPY. IOVA's must satisfy::
> + *
> + *   starting_iova % out_iova_alignment == 0
> + *   (starting_iova + length) % out_iova_alignment == 0
> + *
> + * out_iova_alignment can be 1 indicating any IOVA is allowed. It cannot
> + * be higher than the system PAGE_SIZE.
> + */
> +struct iommu_ioas_iova_ranges {
> +	__u32 size;
> +	__u32 ioas_id;
> +	__u32 num_iovas;
> +	__u32 __reserved;
> +	__aligned_u64 allowed_iovas;
> +	__aligned_u64 out_iova_alignment;
> +};
> +#define IOMMU_IOAS_IOVA_RANGES _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_IOVA_RANGES)
> +
> +/**
> + * struct iommu_ioas_allow_iovas - ioctl(IOMMU_IOAS_ALLOW_IOVAS)
> + * @size: sizeof(struct iommu_ioas_allow_iovas)
> + * @ioas_id: IOAS ID to allow IOVAs from
> + * @num_iovas: Input/Output total number of ranges in the IOAS
> + * @__reserved: Must be 0
> + * @allowed_iovas: Pointer to array of struct iommu_iova_range
> + *
> + * Ensure a range of IOVAs are always available for allocation. If this call
> + * succeeds then IOMMU_IOAS_IOVA_RANGES will never return a list of IOVA ranges
> + * that are narrower than the ranges provided here. This call will fail if
> + * IOMMU_IOAS_IOVA_RANGES is currently narrower than the given ranges.
> + *
> + * When an IOAS is first created the IOVA_RANGES will be maximally sized, and as
> + * devices are attached the IOVA will narrow based on the device restrictions.
> + * When an allowed range is specified any narrowing will be refused, ie device
> + * attachment can fail if the device requires limiting within the allowed range.
> + *
> + * Automatic IOVA allocation is also impacted by this call. MAP will only
> + * allocate within the allowed IOVAs if they are present.
> + *
> + * This call replaces the entire allowed list with the given list.
> + */
> +struct iommu_ioas_allow_iovas {
> +	__u32 size;
> +	__u32 ioas_id;
> +	__u32 num_iovas;
> +	__u32 __reserved;
> +	__aligned_u64 allowed_iovas;
> +};
> +#define IOMMU_IOAS_ALLOW_IOVAS _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_ALLOW_IOVAS)
> +
> +/**
> + * enum iommufd_ioas_map_flags - Flags for map and copy
> + * @IOMMU_IOAS_MAP_FIXED_IOVA: If clear the kernel will compute an appropriate
> + *                             IOVA to place the mapping at
> + * @IOMMU_IOAS_MAP_WRITEABLE: DMA is allowed to write to this mapping
> + * @IOMMU_IOAS_MAP_READABLE: DMA is allowed to read from this mapping
> + */
> +enum iommufd_ioas_map_flags {
> +	IOMMU_IOAS_MAP_FIXED_IOVA = 1 << 0,
> +	IOMMU_IOAS_MAP_WRITEABLE = 1 << 1,
> +	IOMMU_IOAS_MAP_READABLE = 1 << 2,
> +};
> +
> +/**
> + * struct iommu_ioas_map - ioctl(IOMMU_IOAS_MAP)
> + * @size: sizeof(struct iommu_ioas_map)
> + * @flags: Combination of enum iommufd_ioas_map_flags
> + * @ioas_id: IOAS ID to change the mapping of
> + * @__reserved: Must be 0
> + * @user_va: Userspace pointer to start mapping from
> + * @length: Number of bytes to map
> + * @iova: IOVA the mapping was placed at. If IOMMU_IOAS_MAP_FIXED_IOVA is set
> + *        then this must be provided as input.
> + *
> + * Set an IOVA mapping from a user pointer. If FIXED_IOVA is specified then the
> + * mapping will be established at iova, otherwise a suitable location based on
> + * the reserved and allowed lists will be automatically selected and returned in
> + * iova.
> + *
> + * If IOMMU_IOAS_MAP_FIXED_IOVA is specified then the iova range must currently
> + * be unused, existing IOVA cannot be replaced.
> + */
> +struct iommu_ioas_map {
> +	__u32 size;
> +	__u32 flags;
> +	__u32 ioas_id;
> +	__u32 __reserved;
> +	__aligned_u64 user_va;
> +	__aligned_u64 length;
> +	__aligned_u64 iova;
> +};
> +#define IOMMU_IOAS_MAP _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_MAP)
> +
> +/**
> + * struct iommu_ioas_copy - ioctl(IOMMU_IOAS_COPY)
> + * @size: sizeof(struct iommu_ioas_copy)
> + * @flags: Combination of enum iommufd_ioas_map_flags
> + * @dst_ioas_id: IOAS ID to change the mapping of
> + * @src_ioas_id: IOAS ID to copy from
> + * @length: Number of bytes to copy and map
> + * @dst_iova: IOVA the mapping was placed at. If IOMMU_IOAS_MAP_FIXED_IOVA is
> + *            set then this must be provided as input.
> + * @src_iova: IOVA to start the copy
> + *
> + * Copy an already existing mapping from src_ioas_id and establish it in
> + * dst_ioas_id. The src iova/length must exactly match a range used with
> + * IOMMU_IOAS_MAP.
> + *
> + * This may be used to efficiently clone a subset of an IOAS to another, or as a
> + * kind of 'cache' to speed up mapping. Copy has an efficiency advantage over
> + * establishing equivalent new mappings, as internal resources are shared, and
> + * the kernel will pin the user memory only once.
> + */
> +struct iommu_ioas_copy {
> +	__u32 size;
> +	__u32 flags;
> +	__u32 dst_ioas_id;
> +	__u32 src_ioas_id;
> +	__aligned_u64 length;
> +	__aligned_u64 dst_iova;
> +	__aligned_u64 src_iova;
> +};
> +#define IOMMU_IOAS_COPY _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_COPY)
> +
> +/**
> + * struct iommu_ioas_unmap - ioctl(IOMMU_IOAS_UNMAP)
> + * @size: sizeof(struct iommu_ioas_unmap)
> + * @ioas_id: IOAS ID to change the mapping of
> + * @iova: IOVA to start the unmapping at
> + * @length: Number of bytes to unmap, and return back the bytes unmapped
> + *
> + * Unmap an IOVA range. The iova/length must be a superset of a previously
> + * mapped range used with IOMMU_IOAS_MAP or IOMMU_IOAS_COPY. Splitting or
> + * truncating ranges is not allowed. The values 0 to U64_MAX will unmap
> + * everything.
> + */
> +struct iommu_ioas_unmap {
> +	__u32 size;
> +	__u32 ioas_id;
> +	__aligned_u64 iova;
> +	__aligned_u64 length;
> +};
> +#define IOMMU_IOAS_UNMAP _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_UNMAP)
> +
> +/**
> + * enum iommufd_option - ioctl(IOMMU_OPTION_RLIMIT_MODE) and
> + *                       ioctl(IOMMU_OPTION_HUGE_PAGES)
> + * @IOMMU_OPTION_RLIMIT_MODE:
> + *    Change how RLIMIT_MEMLOCK accounting works. The caller must have privilege
> + *    to invoke this. Value 0 (default) is user based accouting, 1 uses process
> + *    based accounting. Global option, object_id must be 0
> + * @IOMMU_OPTION_HUGE_PAGES:
> + *    Value 1 (default) allows contiguous pages to be combined when generating
> + *    iommu mappings. Value 0 disables combining, everything is mapped to
> + *    PAGE_SIZE. This can be useful for benchmarking.  This is a per-IOAS
> + *    option, the object_id must be the IOAS ID.
> + */
> +enum iommufd_option {
> +	IOMMU_OPTION_RLIMIT_MODE = 0,
> +	IOMMU_OPTION_HUGE_PAGES = 1,
> +};
> +
> +/**
> + * enum iommufd_option_ops - ioctl(IOMMU_OPTION_OP_SET) and
> + *                           ioctl(IOMMU_OPTION_OP_GET)
> + * @IOMMU_OPTION_OP_SET: Set the option's value
> + * @IOMMU_OPTION_OP_GET: Get the option's value
> + */
> +enum iommufd_option_ops {
> +	IOMMU_OPTION_OP_SET = 0,
> +	IOMMU_OPTION_OP_GET = 1,
> +};
> +
> +/**
> + * struct iommu_option - iommu option multiplexer
> + * @size: sizeof(struct iommu_option)
> + * @option_id: One of enum iommufd_option
> + * @op: One of enum iommufd_option_ops
> + * @__reserved: Must be 0
> + * @object_id: ID of the object if required
> + * @val64: Option value to set or value returned on get
> + *
> + * Change a simple option value. This multiplexor allows controlling options
> + * on objects. IOMMU_OPTION_OP_SET will load an option and IOMMU_OPTION_OP_GET
> + * will return the current value.
> + */
> +struct iommu_option {
> +	__u32 size;
> +	__u32 option_id;
> +	__u16 op;
> +	__u16 __reserved;
> +	__u32 object_id;
> +	__aligned_u64 val64;
> +};
> +#define IOMMU_OPTION _IO(IOMMUFD_TYPE, IOMMUFD_CMD_OPTION)
> +
> +/**
> + * enum iommufd_vfio_ioas_op - IOMMU_VFIO_IOAS_* ioctls
> + * @IOMMU_VFIO_IOAS_GET: Get the current compatibility IOAS
> + * @IOMMU_VFIO_IOAS_SET: Change the current compatibility IOAS
> + * @IOMMU_VFIO_IOAS_CLEAR: Disable VFIO compatibility
> + */
> +enum iommufd_vfio_ioas_op {
> +	IOMMU_VFIO_IOAS_GET = 0,
> +	IOMMU_VFIO_IOAS_SET = 1,
> +	IOMMU_VFIO_IOAS_CLEAR = 2,
> +};
> +
> +/**
> + * struct iommu_vfio_ioas - ioctl(IOMMU_VFIO_IOAS)
> + * @size: sizeof(struct iommu_vfio_ioas)
> + * @ioas_id: For IOMMU_VFIO_IOAS_SET the input IOAS ID to set
> + *           For IOMMU_VFIO_IOAS_GET will output the IOAS ID
> + * @op: One of enum iommufd_vfio_ioas_op
> + * @__reserved: Must be 0
> + *
> + * The VFIO compatibility support uses a single ioas because VFIO APIs do not
> + * support the ID field. Set or Get the IOAS that VFIO compatibility will use.
> + * When VFIO_GROUP_SET_CONTAINER is used on an iommufd it will get the
> + * compatibility ioas, either by taking what is already set, or auto creating
> + * one. From then on VFIO will continue to use that ioas and is not effected by
> + * this ioctl. SET or CLEAR does not destroy any auto-created IOAS.
> + */
> +struct iommu_vfio_ioas {
> +	__u32 size;
> +	__u32 ioas_id;
> +	__u16 op;
> +	__u16 __reserved;
> +};
> +#define IOMMU_VFIO_IOAS _IO(IOMMUFD_TYPE, IOMMUFD_CMD_VFIO_IOAS)
> +
> +/**
> + * struct iommu_hwpt_alloc - ioctl(IOMMU_HWPT_ALLOC)
> + * @size: sizeof(struct iommu_hwpt_alloc)
> + * @flags: Must be 0
> + * @dev_id: The device to allocate this HWPT for
> + * @pt_id: The IOAS to connect this HWPT to
> + * @out_hwpt_id: The ID of the new HWPT
> + * @__reserved: Must be 0
> + *
> + * Explicitly allocate a hardware page table object. This is the same object
> + * type that is returned by iommufd_device_attach() and represents the
> + * underlying iommu driver's iommu_domain kernel object.
> + *
> + * A HWPT will be created with the IOVA mappings from the given IOAS.
> + */
> +struct iommu_hwpt_alloc {
> +	__u32 size;
> +	__u32 flags;
> +	__u32 dev_id;
> +	__u32 pt_id;
> +	__u32 out_hwpt_id;
> +	__u32 __reserved;
> +};
> +#define IOMMU_HWPT_ALLOC _IO(IOMMUFD_TYPE, IOMMUFD_CMD_HWPT_ALLOC)
> +
> +/**
> + * struct iommu_hw_info_vtd - Intel VT-d hardware information
> + *
> + * @flags: Must be 0
> + * @__reserved: Must be 0
> + *
> + * @cap_reg: Value of Intel VT-d capability register defined in VT-d spec
> + *           section 11.4.2 Capability Register.
> + * @ecap_reg: Value of Intel VT-d capability register defined in VT-d spec
> + *            section 11.4.3 Extended Capability Register.
> + *
> + * User needs to understand the Intel VT-d specification to decode the
> + * register value.
> + */
> +struct iommu_hw_info_vtd {
> +	__u32 flags;
> +	__u32 __reserved;
> +	__aligned_u64 cap_reg;
> +	__aligned_u64 ecap_reg;
> +};
> +
> +/**
> + * enum iommu_hw_info_type - IOMMU Hardware Info Types
> + * @IOMMU_HW_INFO_TYPE_NONE: Used by the drivers that do not report hardware
> + *                           info
> + * @IOMMU_HW_INFO_TYPE_INTEL_VTD: Intel VT-d iommu info type
> + */
> +enum iommu_hw_info_type {
> +	IOMMU_HW_INFO_TYPE_NONE,
> +	IOMMU_HW_INFO_TYPE_INTEL_VTD,
> +};
> +
> +/**
> + * struct iommu_hw_info - ioctl(IOMMU_GET_HW_INFO)
> + * @size: sizeof(struct iommu_hw_info)
> + * @flags: Must be 0
> + * @dev_id: The device bound to the iommufd
> + * @data_len: Input the length of a user buffer in bytes. Output the length of
> + *            data that kernel supports
> + * @data_uptr: User pointer to a user-space buffer used by the kernel to fill
> + *             the iommu type specific hardware information data
> + * @out_data_type: Output the iommu hardware info type as defined in the enum
> + *                 iommu_hw_info_type.
> + * @__reserved: Must be 0
> + *
> + * Query an iommu type specific hardware information data from an iommu behind
> + * a given device that has been bound to iommufd. This hardware info data will
> + * be used to sync capabilities between the virtual iommu and the physical
> + * iommu, e.g. a nested translation setup needs to check the hardware info, so
> + * a guest stage-1 page table can be compatible with the physical iommu.
> + *
> + * To capture an iommu type specific hardware information data, @data_uptr and
> + * its length @data_len must be provided. Trailing bytes will be zeroed if the
> + * user buffer is larger than the data that kernel has. Otherwise, kernel only
> + * fills the buffer using the given length in @data_len. If the ioctl succeeds,
> + * @data_len will be updated to the length that kernel actually supports,
> + * @out_data_type will be filled to decode the data filled in the buffer
> + * pointed by @data_uptr. Input @data_len == zero is allowed.
> + */
> +struct iommu_hw_info {
> +	__u32 size;
> +	__u32 flags;
> +	__u32 dev_id;
> +	__u32 data_len;
> +	__aligned_u64 data_uptr;
> +	__u32 out_data_type;
> +	__u32 __reserved;
> +};
> +#define IOMMU_GET_HW_INFO _IO(IOMMUFD_TYPE, IOMMUFD_CMD_GET_HW_INFO)
> +#endif
> diff --git a/linux-headers/linux/kvm.h b/linux-headers/linux/kvm.h
> index 1f3f3333a4..0d74ee999a 100644
> --- a/linux-headers/linux/kvm.h
> +++ b/linux-headers/linux/kvm.h
> @@ -1414,9 +1414,16 @@ struct kvm_device_attr {
>  	__u64	addr;		/* userspace address of attr data */
>  };
>  
> -#define  KVM_DEV_VFIO_GROUP			1
> -#define   KVM_DEV_VFIO_GROUP_ADD			1
> -#define   KVM_DEV_VFIO_GROUP_DEL			2
> +#define  KVM_DEV_VFIO_FILE			1
> +
> +#define   KVM_DEV_VFIO_FILE_ADD			1
> +#define   KVM_DEV_VFIO_FILE_DEL			2
> +
> +/* KVM_DEV_VFIO_GROUP aliases are for compile time uapi compatibility */
> +#define  KVM_DEV_VFIO_GROUP	KVM_DEV_VFIO_FILE
> +
> +#define   KVM_DEV_VFIO_GROUP_ADD	KVM_DEV_VFIO_FILE_ADD
> +#define   KVM_DEV_VFIO_GROUP_DEL	KVM_DEV_VFIO_FILE_DEL
>  #define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE		3
>  
>  enum kvm_device_type {
> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
> index 16db89071e..7326ace436 100644
> --- a/linux-headers/linux/vfio.h
> +++ b/linux-headers/linux/vfio.h
> @@ -677,11 +677,60 @@ enum {
>   * VFIO_DEVICE_GET_PCI_HOT_RESET_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 12,
>   *					      struct vfio_pci_hot_reset_info)
>   *
> + * This command is used to query the affected devices in the hot reset for
> + * a given device.
> + *
> + * This command always reports the segment, bus, and devfn information for
> + * each affected device, and selectively reports the group_id or devid per
> + * the way how the calling device is opened.
> + *
> + *	- If the calling device is opened via the traditional group/container
> + *	  API, group_id is reported.  User should check if it has owned all
> + *	  the affected devices and provides a set of group fds to prove the
> + *	  ownership in VFIO_DEVICE_PCI_HOT_RESET ioctl.
> + *
> + *	- If the calling device is opened as a cdev, devid is reported.
> + *	  Flag VFIO_PCI_HOT_RESET_FLAG_DEV_ID is set to indicate this
> + *	  data type.  All the affected devices should be represented in
> + *	  the dev_set, ex. bound to a vfio driver, and also be owned by
> + *	  this interface which is determined by the following conditions:
> + *	  1) Has a valid devid within the iommufd_ctx of the calling device.
> + *	     Ownership cannot be determined across separate iommufd_ctx and
> + *	     the cdev calling conventions do not support a proof-of-ownership
> + *	     model as provided in the legacy group interface.  In this case
> + *	     valid devid with value greater than zero is provided in the return
> + *	     structure.
> + *	  2) Does not have a valid devid within the iommufd_ctx of the calling
> + *	     device, but belongs to the same IOMMU group as the calling device
> + *	     or another opened device that has a valid devid within the
> + *	     iommufd_ctx of the calling device.  This provides implicit ownership
> + *	     for devices within the same DMA isolation context.  In this case
> + *	     the devid value of VFIO_PCI_DEVID_OWNED is provided in the return
> + *	     structure.
> + *
> + *	  A devid value of VFIO_PCI_DEVID_NOT_OWNED is provided in the return
> + *	  structure for affected devices where device is NOT represented in the
> + *	  dev_set or ownership is not available.  Such devices prevent the use
> + *	  of VFIO_DEVICE_PCI_HOT_RESET ioctl outside of the proof-of-ownership
> + *	  calling conventions (ie. via legacy group accessed devices).  Flag
> + *	  VFIO_PCI_HOT_RESET_FLAG_DEV_ID_OWNED would be set when all the
> + *	  affected devices are represented in the dev_set and also owned by
> + *	  the user.  This flag is available only when
> + *	  flag VFIO_PCI_HOT_RESET_FLAG_DEV_ID is set, otherwise reserved.
> + *	  When set, user could invoke VFIO_DEVICE_PCI_HOT_RESET with a zero
> + *	  length fd array on the calling device as the ownership is validated
> + *	  by iommufd_ctx.
> + *
>   * Return: 0 on success, -errno on failure:
>   *	-enospc = insufficient buffer, -enodev = unsupported for device.
>   */
>  struct vfio_pci_dependent_device {
> -	__u32	group_id;
> +	union {
> +		__u32   group_id;
> +		__u32	devid;
> +#define VFIO_PCI_DEVID_OWNED		0
> +#define VFIO_PCI_DEVID_NOT_OWNED	-1
> +	};
>  	__u16	segment;
>  	__u8	bus;
>  	__u8	devfn; /* Use PCI_SLOT/PCI_FUNC */
> @@ -690,6 +739,8 @@ struct vfio_pci_dependent_device {
>  struct vfio_pci_hot_reset_info {
>  	__u32	argsz;
>  	__u32	flags;
> +#define VFIO_PCI_HOT_RESET_FLAG_DEV_ID		(1 << 0)
> +#define VFIO_PCI_HOT_RESET_FLAG_DEV_ID_OWNED	(1 << 1)
>  	__u32	count;
>  	struct vfio_pci_dependent_device	devices[];
>  };
> @@ -700,6 +751,24 @@ struct vfio_pci_hot_reset_info {
>   * VFIO_DEVICE_PCI_HOT_RESET - _IOW(VFIO_TYPE, VFIO_BASE + 13,
>   *				    struct vfio_pci_hot_reset)
>   *
> + * A PCI hot reset results in either a bus or slot reset which may affect
> + * other devices sharing the bus/slot.  The calling user must have
> + * ownership of the full set of affected devices as determined by the
> + * VFIO_DEVICE_GET_PCI_HOT_RESET_INFO ioctl.
> + *
> + * When called on a device file descriptor acquired through the vfio
> + * group interface, the user is required to provide proof of ownership
> + * of those affected devices via the group_fds array in struct
> + * vfio_pci_hot_reset.
> + *
> + * When called on a direct cdev opened vfio device, the flags field of
> + * struct vfio_pci_hot_reset_info reports the ownership status of the
> + * affected devices and this ioctl must be called with an empty group_fds
> + * array.  See above INFO ioctl definition for ownership requirements.
> + *
> + * Mixed usage of legacy groups and cdevs across the set of affected
> + * devices is not supported.
> + *
>   * Return: 0 on success, -errno on failure.
>   */
>  struct vfio_pci_hot_reset {
> @@ -828,6 +897,83 @@ struct vfio_device_feature {
>  
>  #define VFIO_DEVICE_FEATURE		_IO(VFIO_TYPE, VFIO_BASE + 17)
>  
> +/*
> + * VFIO_DEVICE_BIND_IOMMUFD - _IOR(VFIO_TYPE, VFIO_BASE + 18,
> + *				   struct vfio_device_bind_iommufd)
> + * @argsz:	 User filled size of this data.
> + * @flags:	 Must be 0.
> + * @iommufd:	 iommufd to bind.
> + * @out_devid:	 The device id generated by this bind. devid is a handle for
> + *		 this device/iommufd bond and can be used in IOMMUFD commands.
> + *
> + * Bind a vfio_device to the specified iommufd.
> + *
> + * User is restricted from accessing the device before the binding operation
> + * is completed.  Only allowed on cdev fds.
> + *
> + * Unbind is automatically conducted when device fd is closed.
> + *
> + * Return: 0 on success, -errno on failure.
> + */
> +struct vfio_device_bind_iommufd {
> +	__u32		argsz;
> +	__u32		flags;
> +	__s32		iommufd;
> +	__u32		out_devid;
> +};
> +
> +#define VFIO_DEVICE_BIND_IOMMUFD	_IO(VFIO_TYPE, VFIO_BASE + 18)
> +
> +/*
> + * VFIO_DEVICE_ATTACH_IOMMUFD_PT - _IOW(VFIO_TYPE, VFIO_BASE + 19,
> + *					struct vfio_device_attach_iommufd_pt)
> + * @argsz:	User filled size of this data.
> + * @flags:	Must be 0.
> + * @pt_id:	Input the target id which can represent an ioas or a hwpt
> + *		allocated via iommufd subsystem.
> + *		Output the input ioas id or the attached hwpt id which could
> + *		be the specified hwpt itself or a hwpt automatically created
> + *		for the specified ioas by kernel during the attachment.
> + *
> + * Associate the device with an address space within the bound iommufd.
> + * Undo by VFIO_DEVICE_DETACH_IOMMUFD_PT or device fd close.  This is only
> + * allowed on cdev fds.
> + *
> + * If a vfio device is currently attached to a valid hw_pagetable, without doing
> + * a VFIO_DEVICE_DETACH_IOMMUFD_PT, a second VFIO_DEVICE_ATTACH_IOMMUFD_PT ioctl
> + * passing in another hw_pagetable (hwpt) id is allowed. This action, also known
> + * as a hw_pagetable replacement, will replace the device's currently attached
> + * hw_pagetable with a new hw_pagetable corresponding to the given pt_id.
> + *
> + * Return: 0 on success, -errno on failure.
> + */
> +struct vfio_device_attach_iommufd_pt {
> +	__u32	argsz;
> +	__u32	flags;
> +	__u32	pt_id;
> +};
> +
> +#define VFIO_DEVICE_ATTACH_IOMMUFD_PT		_IO(VFIO_TYPE, VFIO_BASE + 19)
> +
> +/*
> + * VFIO_DEVICE_DETACH_IOMMUFD_PT - _IOW(VFIO_TYPE, VFIO_BASE + 20,
> + *					struct vfio_device_detach_iommufd_pt)
> + * @argsz:	User filled size of this data.
> + * @flags:	Must be 0.
> + *
> + * Remove the association of the device and its current associated address
> + * space.  After it, the device should be in a blocking DMA state.  This is only
> + * allowed on cdev fds.
> + *
> + * Return: 0 on success, -errno on failure.
> + */
> +struct vfio_device_detach_iommufd_pt {
> +	__u32	argsz;
> +	__u32	flags;
> +};
> +
> +#define VFIO_DEVICE_DETACH_IOMMUFD_PT		_IO(VFIO_TYPE, VFIO_BASE + 20)
> +
>  /*
>   * Provide support for setting a PCI VF Token, which is used as a shared
>   * secret between PF and VF drivers.  This feature may only be set on a



^ permalink raw reply	[flat|nested] 109+ messages in thread

* RE: [PATCH v1 02/22] Update linux-header to support iommufd cdev and hwpt alloc
  2023-09-14 14:46   ` Eric Auger
@ 2023-09-15  3:02     ` Duan, Zhenzhong
  2023-09-20 11:04       ` Eric Auger
  0 siblings, 1 reply; 109+ messages in thread
From: Duan, Zhenzhong @ 2023-09-15  3:02 UTC (permalink / raw)
  To: eric.auger, qemu-devel
  Cc: alex.williamson, clg, jgg, nicolinc, Martins, Joao, peterx,
	jasowang, Tian, Kevin, Liu, Yi L, Sun, Yi Y, Peng, Chao P,
	Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini,
	open list:Overall KVM CPUs

Hi Eric,

>-----Original Message-----
>From: Eric Auger <eric.auger@redhat.com>
>Sent: Thursday, September 14, 2023 10:46 PM
>Subject: Re: [PATCH v1 02/22] Update linux-header to support iommufd cdev and
>hwpt alloc
>
>Hi Zhenzhong,
>
>On 8/30/23 12:37, Zhenzhong Duan wrote:
>> From https://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd.git
>> branch: for_next
>> commit id: eb501c2d96cfce6b42528e8321ea085ec605e790
>I see that in your branch you have now updated against v6.6-rc1. However
>you should run a full ./scripts/update-linux-headers.sh,
>ie. not only importing the changes in linux-headers/linux/iommufd.h as
>it seems to do but also import all changes brought with this linux version.

Found reason. The base is already against v6.6-rc1, [PATCH v1 01/22] added
Iommufd.h into script and this patch added it.
I agree the subject is confusing, need to be like "Update iommufd.h to linux-header"
I'll fix the subject in next version, thanks for point out.

BR.
Zhenzhong


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v1 00/22] vfio: Adopt iommufd
  2023-08-30 10:37 [PATCH v1 00/22] vfio: Adopt iommufd Zhenzhong Duan
                   ` (22 preceding siblings ...)
  2023-09-14  9:04 ` [PATCH v1 00/22] vfio: Adopt iommufd Eric Auger
@ 2023-09-15 12:42 ` Cédric Le Goater
  2023-09-15 13:14   ` Duan, Zhenzhong
  2023-09-18 11:51   ` Jason Gunthorpe
  23 siblings, 2 replies; 109+ messages in thread
From: Cédric Le Goater @ 2023-09-15 12:42 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, jgg, nicolinc, joao.m.martins, eric.auger,
	peterx, jasowang, kevin.tian, yi.l.liu, yi.y.sun, chao.p.peng

On 8/30/23 12:37, Zhenzhong Duan wrote:
> Hi All,
> 
> As the kernel side iommufd cdev and hot reset feature have been queued,
> also hwpt alloc has been added in Jason's for_next branch [1], I'd like
> to update a new version matching kernel side update and with rfc flag
> removed. Qemu code can be found at [2], look forward more comments!

FYI, I have started cleaning up the VFIO support in QEMU PPC. First
is the removal of nvlink2, which was dropped from the kernel 2.5 years
ago. Next is probably removal of all the PPC bits in VFIO. Code is
bitrotting and AFAICT VFIO has been broken on these platforms since
5.18 or so.

The consequences on this patchset should be less movement of code
between files. I think this is something we should reduce to maintain
history.

Thanks,

C.
  

> 
> 
> We have done wide test with different combinations, e.g:
> 
> - PCI device were tested
> - FD passing and hot reset with some trick.
> - device hotplug test with legacy and iommufd backends
> - with or without vIOMMU for legacy and iommufd backends
> - divices linked to different iommufds
> - VFIO migration with a E800 net card(no dirty sync support) passthrough
> - platform, ccw and ap were only compile-tested due to environment limit
> 
> 
> Given some iommufd kernel limitations, the iommufd backend is
> not yet fully on par with the legacy backend w.r.t. features like:
> - p2p mappings (you will see related error traces)
> - dirty page sync
> - and etc.
> 
> 
> Changelog:
> v1:
> - Alloc hwpt instead of using auto hwpt
> - elaborate iommufd code per Nicolin
> - consolidate two patches and drop as.c
> - typo error fix and function rename
> 
> I didn't list change log of rfc stage, see [3] if anyone is interested.
> 
> 
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd.git
> [2] https://github.com/yiliu1765/qemu/commits/zhenzhong/iommufd_cdev_v1
> [3] https://lists.nongnu.org/archive/html/qemu-devel/2023-07/msg02529.html
> 
> 
> --------------------------------------------------------------------------
> 
> With the introduction of iommufd, the Linux kernel provides a generic
> interface for userspace drivers to propagate their DMA mappings to kernel
> for assigned devices. This series does the porting of the VFIO devices
> onto the /dev/iommu uapi and let it coexist with the legacy implementation.
> 
> This QEMU integration is the result of a collaborative work between
> Yi Liu, Yi Sun, Nicolin Chen and Eric Auger.
> 
> At QEMU level, interactions with the /dev/iommu are abstracted by a new
> iommufd object (compiled in with the CONFIG_IOMMUFD option).
> 
> Any QEMU device (e.g. vfio device) wishing to use /dev/iommu must be
> linked with an iommufd object. In this series, the vfio-pci device is
> granted with such capability (other VFIO devices are not yet ready):
> 
> It gets a new optional parameter named iommufd which allows to pass
> an iommufd object:
> 
>      -object iommufd,id=iommufd0
>      -device vfio-pci,host=0000:02:00.0,iommufd=iommufd0
> 
> Note the /dev/iommu and vfio cdev can be externally opened by a
> management layer. In such a case the fd is passed:
>    
>      -object iommufd,id=iommufd0,fd=22
>      -device vfio-pci,iommufd=iommufd0,fd=23
> 
> If the fd parameter is not passed, the fd is opened by QEMU.
> See https://www.mail-archive.com/qemu-devel@nongnu.org/msg937155.html
> for detailed discuss on this requirement.
> 
> If no iommufd option is passed to the vfio-pci device, iommufd is not
> used and the end-user gets the behavior based on the legacy vfio iommu
> interfaces:
> 
>      -device vfio-pci,host=0000:02:00.0
> 
> While the legacy kernel interface is group-centric, the new iommufd
> interface is device-centric, relying on device fd and iommufd.
> 
> To support both interfaces in the QEMU VFIO device we reworked the vfio
> container abstraction so that the generic VFIO code can use either
> backend.
> 
> The VFIOContainer object becomes a base object derived into
> a) the legacy VFIO container and
> b) the new iommufd based container.
> 
> The base object implements generic code such as code related to
> memory_listener and address space management whereas the derived
> objects implement callbacks specific to either BE, legacy and
> iommufd. Indeed each backend has its own way to setup secure context
> and dma management interface. The below diagram shows how it looks
> like with both BEs.
> 
>                      VFIO                           AddressSpace/Memory
>      +-------+  +----------+  +-----+  +-----+
>      |  pci  |  | platform |  |  ap |  | ccw |
>      +---+---+  +----+-----+  +--+--+  +--+--+     +----------------------+
>          |           |           |        |        |   AddressSpace       |
>          |           |           |        |        +------------+---------+
>      +---V-----------V-----------V--------V----+               /
>      |           VFIOAddressSpace              | <------------+
>      |                  |                      |  MemoryListener
>      |          VFIOContainer list             |
>      +-------+----------------------------+----+
>              |                            |
>              |                            |
>      +-------V------+            +--------V----------+
>      |   iommufd    |            |    vfio legacy    |
>      |  container   |            |     container     |
>      +-------+------+            +--------+----------+
>              |                            |
>              | /dev/iommu                 | /dev/vfio/vfio
>              | /dev/vfio/devices/vfioX    | /dev/vfio/$group_id
> Userspace   |                            |
> ============+============================+===========================
> Kernel      |  device fd                 |
>              +---------------+            | group/container fd
>              | (BIND_IOMMUFD |            | (SET_CONTAINER/SET_IOMMU)
>              |  ATTACH_IOAS) |            | device fd
>              |               |            |
>              |       +-------V------------V-----------------+
>      iommufd |       |                vfio                  |
> (map/unmap  |       +---------+--------------------+-------+
> ioas_copy)  |                 |                    | map/unmap
>              |                 |                    |
>       +------V------+    +-----V------+      +------V--------+
>       | iommfd core |    |  device    |      |  vfio iommu   |
>       +-------------+    +------------+      +---------------+
> 
> [Secure Context setup]
> - iommufd BE: uses device fd and iommufd to setup secure context
>                (bind_iommufd, attach_ioas)
> - vfio legacy BE: uses group fd and container fd to setup secure context
>                    (set_container, set_iommu)
> [Device access]
> - iommufd BE: device fd is opened through /dev/vfio/devices/vfioX
> - vfio legacy BE: device fd is retrieved from group fd ioctl
> [DMA Mapping flow]
> 1. VFIOAddressSpace receives MemoryRegion add/del via MemoryListener
> 2. VFIO populates DMA map/unmap via the container BEs
>     *) iommufd BE: uses iommufd
>     *) vfio legacy BE: uses container fd
> 
> 
> Thanks,
> Yi, Yi, Eric, Zhenzhong
> 
> 
> Eric Auger (8):
>    scripts/update-linux-headers: Add iommufd.h
>    vfio/common: Introduce vfio_container_add|del_section_window()
>    vfio/container: Introduce vfio_[attach/detach]_device
>    vfio/platform: Use vfio_[attach/detach]_device
>    vfio/ap: Use vfio_[attach/detach]_device
>    vfio/ccw: Use vfio_[attach/detach]_device
>    backends/iommufd: Introduce the iommufd object
>    vfio/pci: Allow the selection of a given iommu backend
> 
> Yi Liu (5):
>    vfio/common: Move IOMMU agnostic helpers to a separate file
>    vfio/common: Move legacy VFIO backend code into separate container.c
>    vfio: Add base container
>    util/char_dev: Add open_cdev()
>    vfio/iommufd: Implement the iommufd backend
> 
> Zhenzhong Duan (9):
>    Update linux-header to support iommufd cdev and hwpt alloc
>    vfio/common: Extract out vfio_kvm_device_[add/del]_fd
>    vfio/common: Add a vfio device iterator
>    vfio/common: Refactor vfio_viommu_preset() to be group agnostic
>    vfio/common: Simplify vfio_viommu_preset()
>    Add iommufd configure option
>    vfio/iommufd: Add vfio device iterator callback for iommufd
>    vfio/pci: Adapt vfio pci hot reset support with iommufd BE
>    vfio/pci: Make vfio cdev pre-openable by passing a file handle
> 
>   MAINTAINERS                           |   13 +
>   backends/Kconfig                      |    4 +
>   backends/iommufd.c                    |  291 ++++
>   backends/meson.build                  |    3 +
>   backends/trace-events                 |   13 +
>   hw/vfio/ap.c                          |   68 +-
>   hw/vfio/ccw.c                         |  120 +-
>   hw/vfio/common.c                      | 1948 +++----------------------
>   hw/vfio/container-base.c              |  160 ++
>   hw/vfio/container.c                   | 1208 +++++++++++++++
>   hw/vfio/helpers.c                     |  626 ++++++++
>   hw/vfio/iommufd.c                     |  554 +++++++
>   hw/vfio/meson.build                   |    6 +
>   hw/vfio/pci.c                         |  319 +++-
>   hw/vfio/platform.c                    |   43 +-
>   hw/vfio/spapr.c                       |   22 +-
>   hw/vfio/trace-events                  |   21 +-
>   include/hw/vfio/vfio-common.h         |  111 +-
>   include/hw/vfio/vfio-container-base.h |  158 ++
>   include/qemu/char_dev.h               |   16 +
>   include/standard-headers/linux/fuse.h |    3 +
>   include/sysemu/iommufd.h              |   49 +
>   linux-headers/linux/iommufd.h         |  444 ++++++
>   linux-headers/linux/kvm.h             |   13 +-
>   linux-headers/linux/vfio.h            |  148 +-
>   meson.build                           |    6 +
>   meson_options.txt                     |    2 +
>   qapi/qom.json                         |   18 +-
>   qemu-options.hx                       |   13 +
>   scripts/meson-buildoptions.sh         |    3 +
>   scripts/update-linux-headers.sh       |    3 +-
>   util/chardev_open.c                   |   61 +
>   util/meson.build                      |    1 +
>   33 files changed, 4395 insertions(+), 2073 deletions(-)
>   create mode 100644 backends/iommufd.c
>   create mode 100644 hw/vfio/container-base.c
>   create mode 100644 hw/vfio/container.c
>   create mode 100644 hw/vfio/helpers.c
>   create mode 100644 hw/vfio/iommufd.c
>   create mode 100644 include/hw/vfio/vfio-container-base.h
>   create mode 100644 include/qemu/char_dev.h
>   create mode 100644 include/sysemu/iommufd.h
>   create mode 100644 linux-headers/linux/iommufd.h
>   create mode 100644 util/chardev_open.c
> 



^ permalink raw reply	[flat|nested] 109+ messages in thread

* RE: [PATCH v1 00/22] vfio: Adopt iommufd
  2023-09-15 12:42 ` Cédric Le Goater
@ 2023-09-15 13:14   ` Duan, Zhenzhong
  2023-09-18 11:51   ` Jason Gunthorpe
  1 sibling, 0 replies; 109+ messages in thread
From: Duan, Zhenzhong @ 2023-09-15 13:14 UTC (permalink / raw)
  To: Cédric Le Goater, qemu-devel
  Cc: alex.williamson, jgg, nicolinc, Martins, Joao, eric.auger,
	peterx, jasowang, Tian, Kevin, Liu, Yi L, Sun, Yi Y, Peng,
	Chao P



>-----Original Message-----
>From: Cédric Le Goater <clg@redhat.com>
>Sent: Friday, September 15, 2023 8:43 PM
>Subject: Re: [PATCH v1 00/22] vfio: Adopt iommufd
>
>On 8/30/23 12:37, Zhenzhong Duan wrote:
>> Hi All,
>>
>> As the kernel side iommufd cdev and hot reset feature have been queued,
>> also hwpt alloc has been added in Jason's for_next branch [1], I'd like
>> to update a new version matching kernel side update and with rfc flag
>> removed. Qemu code can be found at [2], look forward more comments!
>
>FYI, I have started cleaning up the VFIO support in QEMU PPC. First
>is the removal of nvlink2, which was dropped from the kernel 2.5 years
>ago. Next is probably removal of all the PPC bits in VFIO. Code is
>bitrotting and AFAICT VFIO has been broken on these platforms since
>5.18 or so.
>
>The consequences on this patchset should be less movement of code
>between files. I think this is something we should reduce to maintain
>history.

Glad to know I'll only need to move less code. I'll rebase this patchset
after you finish.

Thanks
Zhenzhong


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v1 00/22] vfio: Adopt iommufd
  2023-09-15 12:42 ` Cédric Le Goater
  2023-09-15 13:14   ` Duan, Zhenzhong
@ 2023-09-18 11:51   ` Jason Gunthorpe
  2023-09-18 12:23     ` Cédric Le Goater
  1 sibling, 1 reply; 109+ messages in thread
From: Jason Gunthorpe @ 2023-09-18 11:51 UTC (permalink / raw)
  To: Cédric Le Goater
  Cc: Zhenzhong Duan, qemu-devel, alex.williamson, nicolinc,
	joao.m.martins, eric.auger, peterx, jasowang, kevin.tian,
	yi.l.liu, yi.y.sun, chao.p.peng

On Fri, Sep 15, 2023 at 02:42:48PM +0200, Cédric Le Goater wrote:
> On 8/30/23 12:37, Zhenzhong Duan wrote:
> > Hi All,
> > 
> > As the kernel side iommufd cdev and hot reset feature have been queued,
> > also hwpt alloc has been added in Jason's for_next branch [1], I'd like
> > to update a new version matching kernel side update and with rfc flag
> > removed. Qemu code can be found at [2], look forward more comments!
> 
> FYI, I have started cleaning up the VFIO support in QEMU PPC. First
> is the removal of nvlink2, which was dropped from the kernel 2.5 years
> ago. Next is probably removal of all the PPC bits in VFIO. Code is
> bitrotting and AFAICT VFIO has been broken on these platforms since
> 5.18 or so.

It was fixed since then - at least one company (not IBM) still cares
about vfio on ppc, though I think it is for a DPDK use case not VFIO.

Jason


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v1 00/22] vfio: Adopt iommufd
  2023-09-18 11:51   ` Jason Gunthorpe
@ 2023-09-18 12:23     ` Cédric Le Goater
  2023-09-18 17:56       ` Jason Gunthorpe
  0 siblings, 1 reply; 109+ messages in thread
From: Cédric Le Goater @ 2023-09-18 12:23 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Zhenzhong Duan, qemu-devel, alex.williamson, nicolinc,
	joao.m.martins, eric.auger, peterx, jasowang, kevin.tian,
	yi.l.liu, yi.y.sun, chao.p.peng

On 9/18/23 13:51, Jason Gunthorpe wrote:
> On Fri, Sep 15, 2023 at 02:42:48PM +0200, Cédric Le Goater wrote:
>> On 8/30/23 12:37, Zhenzhong Duan wrote:
>>> Hi All,
>>>
>>> As the kernel side iommufd cdev and hot reset feature have been queued,
>>> also hwpt alloc has been added in Jason's for_next branch [1], I'd like
>>> to update a new version matching kernel side update and with rfc flag
>>> removed. Qemu code can be found at [2], look forward more comments!
>>
>> FYI, I have started cleaning up the VFIO support in QEMU PPC. First
>> is the removal of nvlink2, which was dropped from the kernel 2.5 years
>> ago. Next is probably removal of all the PPC bits in VFIO. Code is
>> bitrotting and AFAICT VFIO has been broken on these platforms since
>> 5.18 or so.
> 
> It was fixed since then - at least one company (not IBM) still cares
> about vfio on ppc, though I think it is for a DPDK use case not VFIO.

Indeed.
I just checked on a POWER9 box running a debian sid (6.4) and device
assignment of a simple NIC (e1000e) in a ubuntu 23.04 guest worked
correctly. Using a 6.6-rc1 on the host worked also. One improvement
would be to reflect in the Kconfig files that CONFIG_IOMMUFD is not
supported on PPC so that it can not be selected.

Thanks,

C.



^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v1 00/22] vfio: Adopt iommufd
  2023-09-18 12:23     ` Cédric Le Goater
@ 2023-09-18 17:56       ` Jason Gunthorpe
  0 siblings, 0 replies; 109+ messages in thread
From: Jason Gunthorpe @ 2023-09-18 17:56 UTC (permalink / raw)
  To: Cédric Le Goater
  Cc: Zhenzhong Duan, qemu-devel, alex.williamson, nicolinc,
	joao.m.martins, eric.auger, peterx, jasowang, kevin.tian,
	yi.l.liu, yi.y.sun, chao.p.peng

On Mon, Sep 18, 2023 at 02:23:48PM +0200, Cédric Le Goater wrote:
> On 9/18/23 13:51, Jason Gunthorpe wrote:
> > On Fri, Sep 15, 2023 at 02:42:48PM +0200, Cédric Le Goater wrote:
> > > On 8/30/23 12:37, Zhenzhong Duan wrote:
> > > > Hi All,
> > > > 
> > > > As the kernel side iommufd cdev and hot reset feature have been queued,
> > > > also hwpt alloc has been added in Jason's for_next branch [1], I'd like
> > > > to update a new version matching kernel side update and with rfc flag
> > > > removed. Qemu code can be found at [2], look forward more comments!
> > > 
> > > FYI, I have started cleaning up the VFIO support in QEMU PPC. First
> > > is the removal of nvlink2, which was dropped from the kernel 2.5 years
> > > ago. Next is probably removal of all the PPC bits in VFIO. Code is
> > > bitrotting and AFAICT VFIO has been broken on these platforms since
> > > 5.18 or so.
> > 
> > It was fixed since then - at least one company (not IBM) still cares
> > about vfio on ppc, though I think it is for a DPDK use case not VFIO.
> 
> Indeed.
> I just checked on a POWER9 box running a debian sid (6.4) and device
> assignment of a simple NIC (e1000e) in a ubuntu 23.04 guest worked
> correctly. Using a 6.6-rc1 on the host worked also. One improvement
> would be to reflect in the Kconfig files that CONFIG_IOMMUFD is not
> supported on PPC so that it can not be selected.

When we did this I thought there were other iommu drivers on Power
that did work with VFIO (fsl_pamu specifically), but it turns out that
ppc iommu driver doesn't support VFIO and the VFIO FSL stuff is for
ARM only.

So it could be done...

These days I believe we have the capacity to do the PPC stuff without
making it so special - it would be alot of work but the road is pretty
clear. At least if qemu wants to remove PPC VFIO support I would not
object.

Jason


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v1 14/22] vfio/common: Simplify vfio_viommu_preset()
  2023-08-30 10:37 ` [PATCH v1 14/22] vfio/common: Simplify vfio_viommu_preset() Zhenzhong Duan
@ 2023-09-19 16:01   ` Cédric Le Goater
  2023-09-20  2:59     ` Duan, Zhenzhong
  0 siblings, 1 reply; 109+ messages in thread
From: Cédric Le Goater @ 2023-09-19 16:01 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, jgg, nicolinc, joao.m.martins, eric.auger,
	peterx, jasowang, kevin.tian, yi.l.liu, yi.y.sun, chao.p.peng

On 8/30/23 12:37, Zhenzhong Duan wrote:
> Commit "vfio/container-base: Introduce [attach/detach]_device container callbacks"
> add support to link to address space, we can utilize it to simplify
> vfio_viommu_preset().
> 
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>

This looks like a revert of patch 07. Can it be avoided in v2 ?

Thanks,

C.

> ---
>   hw/vfio/common.c | 17 +----------------
>   1 file changed, 1 insertion(+), 16 deletions(-)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 86b6af5740..6c3e98d5fd 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -165,22 +165,7 @@ void vfio_unblock_multiple_devices_migration(void)
>   
>   bool vfio_viommu_preset(VFIODevice *vbasedev)
>   {
> -    VFIOAddressSpace *space;
> -    VFIOContainer *container;
> -    VFIODevice *tmp_dev;
> -
> -    QLIST_FOREACH(space, &vfio_address_spaces, list) {
> -        QLIST_FOREACH(container, &space->containers, next) {
> -            tmp_dev = NULL;
> -            while ((tmp_dev = vfio_container_dev_iter_next(container,
> -                                                           tmp_dev))) {
> -                if (vbasedev == tmp_dev) {
> -                    return space->as != &address_space_memory;
> -                }
> -            }
> -        }
> -    }
> -    g_assert_not_reached();
> +    return vbasedev->container->space->as != &address_space_memory;
>   }
>   
>   static void vfio_set_migration_error(int err)



^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v1 15/22] Add iommufd configure option
  2023-08-30 10:37 ` [PATCH v1 15/22] Add iommufd configure option Zhenzhong Duan
@ 2023-09-19 17:07   ` Cédric Le Goater
  2023-09-20  3:42     ` Duan, Zhenzhong
  0 siblings, 1 reply; 109+ messages in thread
From: Cédric Le Goater @ 2023-09-19 17:07 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, jgg, nicolinc, joao.m.martins, eric.auger,
	peterx, jasowang, kevin.tian, yi.l.liu, yi.y.sun, chao.p.peng,
	Paolo Bonzini, Marc-André Lureau, Daniel P. Berrangé,
	Thomas Huth, Philippe Mathieu-Daudé

On 8/30/23 12:37, Zhenzhong Duan wrote:
> This adds "--enable-iommufd/--disable-iommufd" to enable or disable
> iommufd support, enabled by default.

Why would someone want to disable support at compile time ? It might
have been useful for dev but now QEMU should self-adjust at runtime
depending only on the host capabilities AFAIUI. Am I missing something ?

Thanks,

C.


> 
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
>   meson.build                   | 6 ++++++
>   meson_options.txt             | 2 ++
>   scripts/meson-buildoptions.sh | 3 +++
>   3 files changed, 11 insertions(+)
> 
> diff --git a/meson.build b/meson.build
> index 98e68ef0b1..6526d8cc9b 100644
> --- a/meson.build
> +++ b/meson.build
> @@ -574,6 +574,10 @@ have_tpm = get_option('tpm') \
>     .require(targetos != 'windows', error_message: 'TPM emulation only available on POSIX systems') \
>     .allowed()
>   
> +have_iommufd = get_option('iommufd') \
> +  .require(targetos == 'linux', error_message: 'iommufd is supported only on Linux') \
> +  .allowed()
> +
>   # vhost
>   have_vhost_user = get_option('vhost_user') \
>     .disable_auto_if(targetos != 'linux') \
> @@ -2129,6 +2133,7 @@ endif
>   config_host_data.set('CONFIG_SNAPPY', snappy.found())
>   config_host_data.set('CONFIG_TPM', have_tpm)
>   config_host_data.set('CONFIG_TSAN', get_option('tsan'))
> +config_host_data.set('CONFIG_IOMMUFD', have_iommufd)
>   config_host_data.set('CONFIG_USB_LIBUSB', libusb.found())
>   config_host_data.set('CONFIG_VDE', vde.found())
>   config_host_data.set('CONFIG_VHOST_NET', have_vhost_net)
> @@ -4051,6 +4056,7 @@ summary_info += {'vhost-user-crypto support': have_vhost_user_crypto}
>   summary_info += {'vhost-user-blk server support': have_vhost_user_blk_server}
>   summary_info += {'vhost-vdpa support': have_vhost_vdpa}
>   summary_info += {'build guest agent': have_ga}
> +summary_info += {'iommufd support': have_iommufd}
>   summary(summary_info, bool_yn: true, section: 'Configurable features')
>   
>   # Compilation information
> diff --git a/meson_options.txt b/meson_options.txt
> index aaea5ddd77..aed91d173b 100644
> --- a/meson_options.txt
> +++ b/meson_options.txt
> @@ -105,6 +105,8 @@ option('dbus_display', type: 'feature', value: 'auto',
>          description: '-display dbus support')
>   option('tpm', type : 'feature', value : 'auto',
>          description: 'TPM support')
> +option('iommufd', type : 'feature', value : 'auto',
> +       description: 'iommufd support')
>   
>   # Do not enable it by default even for Mingw32, because it doesn't
>   # work on Wine.
> diff --git a/scripts/meson-buildoptions.sh b/scripts/meson-buildoptions.sh
> index 9da3fe299b..719401ffb0 100644
> --- a/scripts/meson-buildoptions.sh
> +++ b/scripts/meson-buildoptions.sh
> @@ -113,6 +113,7 @@ meson_options_help() {
>     printf "%s\n" '  hax             HAX acceleration support'
>     printf "%s\n" '  hvf             HVF acceleration support'
>     printf "%s\n" '  iconv           Font glyph conversion support'
> +  printf "%s\n" '  iommufd         iommufd support'
>     printf "%s\n" '  jack            JACK sound support'
>     printf "%s\n" '  keyring         Linux keyring support'
>     printf "%s\n" '  kvm             KVM acceleration support'
> @@ -325,6 +326,8 @@ _meson_option_parse() {
>       --enable-install-blobs) printf "%s" -Dinstall_blobs=true ;;
>       --disable-install-blobs) printf "%s" -Dinstall_blobs=false ;;
>       --interp-prefix=*) quote_sh "-Dinterp_prefix=$2" ;;
> +    --enable-iommufd) printf "%s" -Diommufd=enabled ;;
> +    --disable-iommufd) printf "%s" -Diommufd=disabled ;;
>       --enable-jack) printf "%s" -Djack=enabled ;;
>       --disable-jack) printf "%s" -Djack=disabled ;;
>       --enable-keyring) printf "%s" -Dkeyring=enabled ;;



^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v1 13/22] vfio: Add base container
  2023-08-30 10:37 ` [PATCH v1 13/22] vfio: Add base container Zhenzhong Duan
@ 2023-09-19 17:23   ` Cédric Le Goater
  2023-09-20  8:48     ` Duan, Zhenzhong
                       ` (2 more replies)
  0 siblings, 3 replies; 109+ messages in thread
From: Cédric Le Goater @ 2023-09-19 17:23 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, jgg, nicolinc, joao.m.martins, eric.auger,
	peterx, jasowang, kevin.tian, yi.l.liu, yi.y.sun, chao.p.peng,
	Yi Sun, Daniel Henrique Barboza, David Gibson, Greg Kurz,
	Harsh Prateek Bora, open list:sPAPR (pseries)

On 8/30/23 12:37, Zhenzhong Duan wrote:
> From: Yi Liu <yi.l.liu@intel.com>
> 
> Abstract the VFIOContainer to be a base object. It is supposed to be
> embedded by legacy VFIO container and later on, into the new iommufd
> based container.
> 
> The base container implements generic code such as code related to
> memory_listener and address space management. The VFIOContainerOps
> implements callbacks that depend on the kernel user space being used.
> 
> 'common.c' and vfio device code only manipulates the base container with
> wrapper functions that calls the functions defined in VFIOContainerOpsClass.
> Existing 'container.c' code is converted to implement the legacy container
> ops functions.
> 
> Below is the base container. It's named as VFIOContainer, old VFIOContainer
> is replaced with VFIOLegacyContainer.

Usualy, we introduce the new interface solely, port the current models
on top of the new interface, wire the new models in the current
implementation and remove the old implementation. Then, we can start
adding extensions to support other implementations.

spapr should be taken care of separatly following the principle above.
With my PPC hat, I would not even read such a massive change, too risky
for the subsystem. This path will need (much) further splitting to be
understandable and acceptable.

Also, please include the .h file first, it helps in reading. Have you
considered using an InterfaceClass ?

Thanks,

C.

> 
> struct VFIOContainer {
>      VFIOIOMMUBackendOpsClass *ops;
>      VFIOAddressSpace *space;
>      MemoryListener listener;
>      Error *error;
>      bool initialized;
>      bool dirty_pages_supported;
>      uint64_t dirty_pgsizes;
>      uint64_t max_dirty_bitmap_size;
>      unsigned long pgsizes;
>      unsigned int dma_max_mappings;
>      QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
>      QLIST_HEAD(, VFIOHostDMAWindow) hostwin_list;
>      QLIST_HEAD(, VFIORamDiscardListener) vrdl_list;
>      QLIST_ENTRY(VFIOContainer) next;
> };
> 
> struct VFIOLegacyContainer {
>      VFIOContainer bcontainer;
>      int fd; /* /dev/vfio/vfio, empowered by the attached groups */
>      MemoryListener prereg_listener;
>      unsigned iommu_type;
>      QLIST_HEAD(, VFIOGroup) group_list;
> };
> 
> Co-authored-by: Eric Auger <eric.auger@redhat.com>
> Signed-off-by: Eric Auger <eric.auger@redhat.com>
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
>   hw/vfio/common.c                      |  72 +++++---
>   hw/vfio/container-base.c              | 160 +++++++++++++++++
>   hw/vfio/container.c                   | 247 ++++++++++++++++----------
>   hw/vfio/meson.build                   |   1 +
>   hw/vfio/spapr.c                       |  22 +--
>   hw/vfio/trace-events                  |   4 +-
>   include/hw/vfio/vfio-common.h         |  85 ++-------
>   include/hw/vfio/vfio-container-base.h | 155 ++++++++++++++++
>   8 files changed, 540 insertions(+), 206 deletions(-)
>   create mode 100644 hw/vfio/container-base.c
>   create mode 100644 include/hw/vfio/vfio-container-base.h
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 044710fc1f..86b6af5740 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -379,19 +379,20 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
>            * of vaddr will always be there, even if the memory object is
>            * destroyed and its backing memory munmap-ed.
>            */
> -        ret = vfio_dma_map(container, iova,
> -                           iotlb->addr_mask + 1, vaddr,
> -                           read_only);
> +        ret = vfio_container_dma_map(container, iova,
> +                                     iotlb->addr_mask + 1, vaddr,
> +                                     read_only);
>           if (ret) {
> -            error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
> +            error_report("vfio_container_dma_map(%p, 0x%"HWADDR_PRIx", "
>                            "0x%"HWADDR_PRIx", %p) = %d (%s)",
>                            container, iova,
>                            iotlb->addr_mask + 1, vaddr, ret, strerror(-ret));
>           }
>       } else {
> -        ret = vfio_dma_unmap(container, iova, iotlb->addr_mask + 1, iotlb);
> +        ret = vfio_container_dma_unmap(container, iova,
> +                                       iotlb->addr_mask + 1, iotlb);
>           if (ret) {
> -            error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
> +            error_report("vfio_container_dma_unmap(%p, 0x%"HWADDR_PRIx", "
>                            "0x%"HWADDR_PRIx") = %d (%s)",
>                            container, iova,
>                            iotlb->addr_mask + 1, ret, strerror(-ret));
> @@ -407,14 +408,15 @@ static void vfio_ram_discard_notify_discard(RamDiscardListener *rdl,
>   {
>       VFIORamDiscardListener *vrdl = container_of(rdl, VFIORamDiscardListener,
>                                                   listener);
> +    VFIOContainer *container = vrdl->container;
>       const hwaddr size = int128_get64(section->size);
>       const hwaddr iova = section->offset_within_address_space;
>       int ret;
>   
>       /* Unmap with a single call. */
> -    ret = vfio_dma_unmap(vrdl->container, iova, size , NULL);
> +    ret = vfio_container_dma_unmap(container, iova, size , NULL);
>       if (ret) {
> -        error_report("%s: vfio_dma_unmap() failed: %s", __func__,
> +        error_report("%s: vfio_container_dma_unmap() failed: %s", __func__,
>                        strerror(-ret));
>       }
>   }
> @@ -424,6 +426,7 @@ static int vfio_ram_discard_notify_populate(RamDiscardListener *rdl,
>   {
>       VFIORamDiscardListener *vrdl = container_of(rdl, VFIORamDiscardListener,
>                                                   listener);
> +    VFIOContainer *container = vrdl->container;
>       const hwaddr end = section->offset_within_region +
>                          int128_get64(section->size);
>       hwaddr start, next, iova;
> @@ -442,8 +445,8 @@ static int vfio_ram_discard_notify_populate(RamDiscardListener *rdl,
>                  section->offset_within_address_space;
>           vaddr = memory_region_get_ram_ptr(section->mr) + start;
>   
> -        ret = vfio_dma_map(vrdl->container, iova, next - start,
> -                           vaddr, section->readonly);
> +        ret = vfio_container_dma_map(container, iova, next - start,
> +                                     vaddr, section->readonly);
>           if (ret) {
>               /* Rollback */
>               vfio_ram_discard_notify_discard(rdl, section);
> @@ -756,10 +759,10 @@ static void vfio_listener_region_add(MemoryListener *listener,
>           }
>       }
>   
> -    ret = vfio_dma_map(container, iova, int128_get64(llsize),
> -                       vaddr, section->readonly);
> +    ret = vfio_container_dma_map(container, iova, int128_get64(llsize),
> +                                 vaddr, section->readonly);
>       if (ret) {
> -        error_setg(&err, "vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
> +        error_setg(&err, "vfio_container_dma_map(%p, 0x%"HWADDR_PRIx", "
>                      "0x%"HWADDR_PRIx", %p) = %d (%s)",
>                      container, iova, int128_get64(llsize), vaddr, ret,
>                      strerror(-ret));
> @@ -775,7 +778,7 @@ static void vfio_listener_region_add(MemoryListener *listener,
>   
>   fail:
>       if (memory_region_is_ram_device(section->mr)) {
> -        error_report("failed to vfio_dma_map. pci p2p may not work");
> +        error_report("failed to vfio_container_dma_map. pci p2p may not work");
>           return;
>       }
>       /*
> @@ -860,18 +863,20 @@ static void vfio_listener_region_del(MemoryListener *listener,
>           if (int128_eq(llsize, int128_2_64())) {
>               /* The unmap ioctl doesn't accept a full 64-bit span. */
>               llsize = int128_rshift(llsize, 1);
> -            ret = vfio_dma_unmap(container, iova, int128_get64(llsize), NULL);
> +            ret = vfio_container_dma_unmap(container, iova,
> +                                           int128_get64(llsize), NULL);
>               if (ret) {
> -                error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
> +                error_report("vfio_container_dma_unmap(%p, 0x%"HWADDR_PRIx", "
>                                "0x%"HWADDR_PRIx") = %d (%s)",
>                                container, iova, int128_get64(llsize), ret,
>                                strerror(-ret));
>               }
>               iova += int128_get64(llsize);
>           }
> -        ret = vfio_dma_unmap(container, iova, int128_get64(llsize), NULL);
> +        ret = vfio_container_dma_unmap(container, iova,
> +                                       int128_get64(llsize), NULL);
>           if (ret) {
> -            error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
> +            error_report("vfio_container_dma_unmap(%p, 0x%"HWADDR_PRIx", "
>                            "0x%"HWADDR_PRIx") = %d (%s)",
>                            container, iova, int128_get64(llsize), ret,
>                            strerror(-ret));
> @@ -1103,7 +1108,7 @@ static void vfio_listener_log_global_start(MemoryListener *listener)
>       if (vfio_devices_all_device_dirty_tracking(container)) {
>           ret = vfio_devices_dma_logging_start(container);
>       } else {
> -        ret = vfio_set_dirty_page_tracking(container, true);
> +        ret = vfio_container_set_dirty_page_tracking(container, true);
>       }
>   
>       if (ret) {
> @@ -1121,7 +1126,7 @@ static void vfio_listener_log_global_stop(MemoryListener *listener)
>       if (vfio_devices_all_device_dirty_tracking(container)) {
>           vfio_devices_dma_logging_stop(container);
>       } else {
> -        ret = vfio_set_dirty_page_tracking(container, false);
> +        ret = vfio_container_set_dirty_page_tracking(container, false);
>       }
>   
>       if (ret) {
> @@ -1204,7 +1209,7 @@ int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
>       if (all_device_dirty_tracking) {
>           ret = vfio_devices_query_dirty_bitmap(container, &vbmap, iova, size);
>       } else {
> -        ret = vfio_query_dirty_bitmap(container, &vbmap, iova, size);
> +        ret = vfio_container_query_dirty_bitmap(container, &vbmap, iova, size);
>       }
>   
>       if (ret) {
> @@ -1214,8 +1219,7 @@ int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
>       dirty_pages = cpu_physical_memory_set_dirty_lebitmap(vbmap.bitmap, ram_addr,
>                                                            vbmap.pages);
>   
> -    trace_vfio_get_dirty_bitmap(container->fd, iova, size, vbmap.size,
> -                                ram_addr, dirty_pages);
> +    trace_vfio_get_dirty_bitmap(iova, size, vbmap.size, ram_addr, dirty_pages);
>   out:
>       g_free(vbmap.bitmap);
>   
> @@ -1525,3 +1529,25 @@ retry:
>   
>       return info;
>   }
> +
> +int vfio_attach_device(char *name, VFIODevice *vbasedev,
> +                       AddressSpace *as, Error **errp)
> +{
> +    const VFIOIOMMUBackendOpsClass *ops;
> +
> +    ops = VFIO_IOMMU_BACKEND_OPS_CLASS(
> +                  object_class_by_name(TYPE_VFIO_IOMMU_BACKEND_LEGACY_OPS));
> +    if (!ops) {
> +        error_setg(errp, "VFIO IOMMU Backend not found!");
> +        return -ENODEV;
> +    }
> +    return ops->attach_device(name, vbasedev, as, errp);
> +}
> +
> +void vfio_detach_device(VFIODevice *vbasedev)
> +{
> +    if (!vbasedev->container) {
> +        return;
> +    }
> +    vbasedev->container->ops->detach_device(vbasedev);
> +}
> diff --git a/hw/vfio/container-base.c b/hw/vfio/container-base.c
> new file mode 100644
> index 0000000000..876e95c6dd
> --- /dev/null
> +++ b/hw/vfio/container-base.c
> @@ -0,0 +1,160 @@
> +/*
> + * VFIO BASE CONTAINER
> + *
> + * Copyright (C) 2023 Intel Corporation.
> + * Copyright Red Hat, Inc. 2023
> + *
> + * Authors: Yi Liu <yi.l.liu@intel.com>
> + *          Eric Auger <eric.auger@redhat.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> +
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> +
> + * You should have received a copy of the GNU General Public License along
> + * with this program; if not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qapi/error.h"
> +#include "qemu/error-report.h"
> +#include "hw/vfio/vfio-container-base.h"
> +
> +VFIODevice *vfio_container_dev_iter_next(VFIOContainer *container,
> +                                 VFIODevice *curr)
> +{
> +    if (!container->ops->dev_iter_next) {
> +        return NULL;
> +    }
> +
> +    return container->ops->dev_iter_next(container, curr);
> +}
> +
> +int vfio_container_dma_map(VFIOContainer *container,
> +                           hwaddr iova, ram_addr_t size,
> +                           void *vaddr, bool readonly)
> +{
> +    if (!container->ops->dma_map) {
> +        return -EINVAL;
> +    }
> +
> +    return container->ops->dma_map(container, iova, size, vaddr, readonly);
> +}
> +
> +int vfio_container_dma_unmap(VFIOContainer *container,
> +                             hwaddr iova, ram_addr_t size,
> +                             IOMMUTLBEntry *iotlb)
> +{
> +    if (!container->ops->dma_unmap) {
> +        return -EINVAL;
> +    }
> +
> +    return container->ops->dma_unmap(container, iova, size, iotlb);
> +}
> +
> +int vfio_container_set_dirty_page_tracking(VFIOContainer *container,
> +                                            bool start)
> +{
> +    /* Fallback to all pages dirty if dirty page sync isn't supported */
> +    if (!container->ops->set_dirty_page_tracking) {
> +        return 0;
> +    }
> +
> +    return container->ops->set_dirty_page_tracking(container, start);
> +}
> +
> +int vfio_container_query_dirty_bitmap(VFIOContainer *container,
> +                                      VFIOBitmap *vbmap,
> +                                      hwaddr iova, hwaddr size)
> +{
> +    if (!container->ops->query_dirty_bitmap) {
> +        return -EINVAL;
> +    }
> +
> +    return container->ops->query_dirty_bitmap(container, vbmap, iova, size);
> +}
> +
> +int vfio_container_add_section_window(VFIOContainer *container,
> +                                      MemoryRegionSection *section,
> +                                      Error **errp)
> +{
> +    if (!container->ops->add_window) {
> +        return 0;
> +    }
> +
> +    return container->ops->add_window(container, section, errp);
> +}
> +
> +void vfio_container_del_section_window(VFIOContainer *container,
> +                                       MemoryRegionSection *section)
> +{
> +    if (!container->ops->del_window) {
> +        return;
> +    }
> +
> +    return container->ops->del_window(container, section);
> +}
> +
> +void vfio_container_init(VFIOContainer *container,
> +                         VFIOAddressSpace *space,
> +                         struct VFIOIOMMUBackendOpsClass *ops)
> +{
> +    container->ops = ops;
> +    container->space = space;
> +    container->error = NULL;
> +    container->dirty_pages_supported = false;
> +    container->dma_max_mappings = 0;
> +    QLIST_INIT(&container->giommu_list);
> +    QLIST_INIT(&container->hostwin_list);
> +    QLIST_INIT(&container->vrdl_list);
> +}
> +
> +void vfio_container_destroy(VFIOContainer *container)
> +{
> +    VFIORamDiscardListener *vrdl, *vrdl_tmp;
> +    VFIOGuestIOMMU *giommu, *tmp;
> +    VFIOHostDMAWindow *hostwin, *next;
> +
> +    QLIST_SAFE_REMOVE(container, next);
> +
> +    QLIST_FOREACH_SAFE(vrdl, &container->vrdl_list, next, vrdl_tmp) {
> +        RamDiscardManager *rdm;
> +
> +        rdm = memory_region_get_ram_discard_manager(vrdl->mr);
> +        ram_discard_manager_unregister_listener(rdm, &vrdl->listener);
> +        QLIST_REMOVE(vrdl, next);
> +        g_free(vrdl);
> +    }
> +
> +    QLIST_FOREACH_SAFE(giommu, &container->giommu_list, giommu_next, tmp) {
> +        memory_region_unregister_iommu_notifier(
> +                MEMORY_REGION(giommu->iommu_mr), &giommu->n);
> +        QLIST_REMOVE(giommu, giommu_next);
> +        g_free(giommu);
> +    }
> +
> +    QLIST_FOREACH_SAFE(hostwin, &container->hostwin_list, hostwin_next,
> +                       next) {
> +        QLIST_REMOVE(hostwin, hostwin_next);
> +        g_free(hostwin);
> +    }
> +}
> +
> +static const TypeInfo vfio_iommu_backend_ops_type_info = {
> +    .name = TYPE_VFIO_IOMMU_BACKEND_OPS,
> +    .parent = TYPE_OBJECT,
> +    .abstract = true,
> +    .class_size = sizeof(VFIOIOMMUBackendOpsClass),
> +};
> +
> +static void vfio_iommu_backend_ops_register_types(void)
> +{
> +    type_register_static(&vfio_iommu_backend_ops_type_info);
> +}
> +type_init(vfio_iommu_backend_ops_register_types);
> diff --git a/hw/vfio/container.c b/hw/vfio/container.c
> index c71fddc09a..bb29b3612d 100644
> --- a/hw/vfio/container.c
> +++ b/hw/vfio/container.c
> @@ -42,7 +42,8 @@
>   VFIOGroupList vfio_group_list =
>       QLIST_HEAD_INITIALIZER(vfio_group_list);
>   
> -static int vfio_ram_block_discard_disable(VFIOContainer *container, bool state)
> +static int vfio_ram_block_discard_disable(VFIOLegacyContainer *container,
> +                                          bool state)
>   {
>       switch (container->iommu_type) {
>       case VFIO_TYPE1v2_IOMMU:
> @@ -65,11 +66,18 @@ static int vfio_ram_block_discard_disable(VFIOContainer *container, bool state)
>       }
>   }
>   
> -VFIODevice *vfio_container_dev_iter_next(VFIOContainer *container,
> -                                         VFIODevice *curr)
> +static VFIODevice *vfio_legacy_dev_iter_next(VFIOContainer *bcontainer,
> +                                             VFIODevice *curr)
>   {
>       VFIOGroup *group;
>   
> +    assert(object_class_dynamic_cast(OBJECT_CLASS(bcontainer->ops),
> +                                     TYPE_VFIO_IOMMU_BACKEND_LEGACY_OPS));
> +
> +    VFIOLegacyContainer *container = container_of(bcontainer,
> +                                                  VFIOLegacyContainer,
> +                                                  bcontainer);
> +
>       if (!curr) {
>           group = QLIST_FIRST(&container->group_list);
>       } else {
> @@ -85,10 +93,11 @@ VFIODevice *vfio_container_dev_iter_next(VFIOContainer *container,
>       return QLIST_FIRST(&group->device_list);
>   }
>   
> -static int vfio_dma_unmap_bitmap(VFIOContainer *container,
> +static int vfio_dma_unmap_bitmap(VFIOLegacyContainer *container,
>                                    hwaddr iova, ram_addr_t size,
>                                    IOMMUTLBEntry *iotlb)
>   {
> +    VFIOContainer *bcontainer = &container->bcontainer;
>       struct vfio_iommu_type1_dma_unmap *unmap;
>       struct vfio_bitmap *bitmap;
>       VFIOBitmap vbmap;
> @@ -116,7 +125,7 @@ static int vfio_dma_unmap_bitmap(VFIOContainer *container,
>       bitmap->size = vbmap.size;
>       bitmap->data = (__u64 *)vbmap.bitmap;
>   
> -    if (vbmap.size > container->max_dirty_bitmap_size) {
> +    if (vbmap.size > bcontainer->max_dirty_bitmap_size) {
>           error_report("UNMAP: Size of bitmap too big 0x%"PRIx64, vbmap.size);
>           ret = -E2BIG;
>           goto unmap_exit;
> @@ -140,9 +149,13 @@ unmap_exit:
>   /*
>    * DMA - Mapping and unmapping for the "type1" IOMMU interface used on x86
>    */
> -int vfio_dma_unmap(VFIOContainer *container, hwaddr iova,
> -                   ram_addr_t size, IOMMUTLBEntry *iotlb)
> +static int vfio_legacy_dma_unmap(VFIOContainer *bcontainer, hwaddr iova,
> +                          ram_addr_t size, IOMMUTLBEntry *iotlb)
>   {
> +    VFIOLegacyContainer *container = container_of(bcontainer,
> +                                                  VFIOLegacyContainer,
> +                                                  bcontainer);
> +
>       struct vfio_iommu_type1_dma_unmap unmap = {
>           .argsz = sizeof(unmap),
>           .flags = 0,
> @@ -152,9 +165,9 @@ int vfio_dma_unmap(VFIOContainer *container, hwaddr iova,
>       bool need_dirty_sync = false;
>       int ret;
>   
> -    if (iotlb && vfio_devices_all_running_and_mig_active(container)) {
> -        if (!vfio_devices_all_device_dirty_tracking(container) &&
> -            container->dirty_pages_supported) {
> +    if (iotlb && vfio_devices_all_running_and_mig_active(bcontainer)) {
> +        if (!vfio_devices_all_device_dirty_tracking(bcontainer) &&
> +            bcontainer->dirty_pages_supported) {
>               return vfio_dma_unmap_bitmap(container, iova, size, iotlb);
>           }
>   
> @@ -176,8 +189,8 @@ int vfio_dma_unmap(VFIOContainer *container, hwaddr iova,
>            */
>           if (errno == EINVAL && unmap.size && !(unmap.iova + unmap.size) &&
>               container->iommu_type == VFIO_TYPE1v2_IOMMU) {
> -            trace_vfio_dma_unmap_overflow_workaround();
> -            unmap.size -= 1ULL << ctz64(container->pgsizes);
> +            trace_vfio_legacy_dma_unmap_overflow_workaround();
> +            unmap.size -= 1ULL << ctz64(bcontainer->pgsizes);
>               continue;
>           }
>           error_report("VFIO_UNMAP_DMA failed: %s", strerror(errno));
> @@ -185,7 +198,7 @@ int vfio_dma_unmap(VFIOContainer *container, hwaddr iova,
>       }
>   
>       if (need_dirty_sync) {
> -        ret = vfio_get_dirty_bitmap(container, iova, size,
> +        ret = vfio_get_dirty_bitmap(bcontainer, iova, size,
>                                       iotlb->translated_addr);
>           if (ret) {
>               return ret;
> @@ -195,9 +208,13 @@ int vfio_dma_unmap(VFIOContainer *container, hwaddr iova,
>       return 0;
>   }
>   
> -int vfio_dma_map(VFIOContainer *container, hwaddr iova,
> -                 ram_addr_t size, void *vaddr, bool readonly)
> +static int vfio_legacy_dma_map(VFIOContainer *bcontainer, hwaddr iova,
> +                               ram_addr_t size, void *vaddr, bool readonly)
>   {
> +    VFIOLegacyContainer *container = container_of(bcontainer,
> +                                                  VFIOLegacyContainer,
> +                                                  bcontainer);
> +
>       struct vfio_iommu_type1_dma_map map = {
>           .argsz = sizeof(map),
>           .flags = VFIO_DMA_MAP_FLAG_READ,
> @@ -216,7 +233,8 @@ int vfio_dma_map(VFIOContainer *container, hwaddr iova,
>        * the VGA ROM space.
>        */
>       if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0 ||
> -        (errno == EBUSY && vfio_dma_unmap(container, iova, size, NULL) == 0 &&
> +        (errno == EBUSY &&
> +         vfio_legacy_dma_unmap(bcontainer, iova, size, NULL) == 0 &&
>            ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0)) {
>           return 0;
>       }
> @@ -225,14 +243,18 @@ int vfio_dma_map(VFIOContainer *container, hwaddr iova,
>       return -errno;
>   }
>   
> -int vfio_set_dirty_page_tracking(VFIOContainer *container, bool start)
> +static int vfio_legacy_set_dirty_page_tracking(VFIOContainer *bcontainer,
> +                                               bool start)
>   {
> +    VFIOLegacyContainer *container = container_of(bcontainer,
> +                                                  VFIOLegacyContainer,
> +                                                  bcontainer);
>       int ret;
>       struct vfio_iommu_type1_dirty_bitmap dirty = {
>           .argsz = sizeof(dirty),
>       };
>   
> -    if (!container->dirty_pages_supported) {
> +    if (!bcontainer->dirty_pages_supported) {
>           return 0;
>       }
>   
> @@ -252,9 +274,13 @@ int vfio_set_dirty_page_tracking(VFIOContainer *container, bool start)
>       return ret;
>   }
>   
> -int vfio_query_dirty_bitmap(VFIOContainer *container, VFIOBitmap *vbmap,
> -                            hwaddr iova, hwaddr size)
> +static int vfio_legacy_query_dirty_bitmap(VFIOContainer *bcontainer,
> +                                          VFIOBitmap *vbmap,
> +                                          hwaddr iova, hwaddr size)
>   {
> +    VFIOLegacyContainer *container = container_of(bcontainer,
> +                                                  VFIOLegacyContainer,
> +                                                  bcontainer);
>       struct vfio_iommu_type1_dirty_bitmap *dbitmap;
>       struct vfio_iommu_type1_dirty_bitmap_get *range;
>       int ret;
> @@ -289,18 +315,24 @@ int vfio_query_dirty_bitmap(VFIOContainer *container, VFIOBitmap *vbmap,
>       return ret;
>   }
>   
> -static void vfio_listener_release(VFIOContainer *container)
> +static void vfio_listener_release(VFIOLegacyContainer *container)
>   {
> -    memory_listener_unregister(&container->listener);
> +    VFIOContainer *bcontainer = &container->bcontainer;
> +
> +    memory_listener_unregister(&bcontainer->listener);
>       if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
>           memory_listener_unregister(&container->prereg_listener);
>       }
>   }
>   
> -int vfio_container_add_section_window(VFIOContainer *container,
> -                                      MemoryRegionSection *section,
> -                                      Error **errp)
> +static int
> +vfio_legacy_add_section_window(VFIOContainer *bcontainer,
> +                               MemoryRegionSection *section,
> +                               Error **errp)
>   {
> +    VFIOLegacyContainer *container = container_of(bcontainer,
> +                                                  VFIOLegacyContainer,
> +                                                  bcontainer);
>       VFIOHostDMAWindow *hostwin;
>       hwaddr pgsize = 0;
>       int ret;
> @@ -310,7 +342,7 @@ int vfio_container_add_section_window(VFIOContainer *container,
>       }
>   
>       /* For now intersections are not allowed, we may relax this later */
> -    QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) {
> +    QLIST_FOREACH(hostwin, &bcontainer->hostwin_list, hostwin_next) {
>           if (ranges_overlap(hostwin->min_iova,
>                              hostwin->max_iova - hostwin->min_iova + 1,
>                              section->offset_within_address_space,
> @@ -332,7 +364,7 @@ int vfio_container_add_section_window(VFIOContainer *container,
>           return ret;
>       }
>   
> -    vfio_host_win_add(container, section->offset_within_address_space,
> +    vfio_host_win_add(bcontainer, section->offset_within_address_space,
>                         section->offset_within_address_space +
>                         int128_get64(section->size) - 1, pgsize);
>   #ifdef CONFIG_KVM
> @@ -365,16 +397,21 @@ int vfio_container_add_section_window(VFIOContainer *container,
>       return 0;
>   }
>   
> -void vfio_container_del_section_window(VFIOContainer *container,
> -                                       MemoryRegionSection *section)
> +static void
> +vfio_legacy_del_section_window(VFIOContainer *bcontainer,
> +                               MemoryRegionSection *section)
>   {
> +    VFIOLegacyContainer *container = container_of(bcontainer,
> +                                                  VFIOLegacyContainer,
> +                                                  bcontainer);
> +
>       if (container->iommu_type != VFIO_SPAPR_TCE_v2_IOMMU) {
>           return;
>       }
>   
>       vfio_spapr_remove_window(container,
>                                section->offset_within_address_space);
> -    if (vfio_host_win_del(container,
> +    if (vfio_host_win_del(bcontainer,
>                             section->offset_within_address_space,
>                             section->offset_within_address_space +
>                             int128_get64(section->size) - 1) < 0) {
> @@ -427,7 +464,7 @@ static void vfio_kvm_device_del_group(VFIOGroup *group)
>   /*
>    * vfio_get_iommu_type - selects the richest iommu_type (v2 first)
>    */
> -static int vfio_get_iommu_type(VFIOContainer *container,
> +static int vfio_get_iommu_type(VFIOLegacyContainer *container,
>                                  Error **errp)
>   {
>       int iommu_types[] = { VFIO_TYPE1v2_IOMMU, VFIO_TYPE1_IOMMU,
> @@ -443,7 +480,7 @@ static int vfio_get_iommu_type(VFIOContainer *container,
>       return -EINVAL;
>   }
>   
> -static int vfio_init_container(VFIOContainer *container, int group_fd,
> +static int vfio_init_container(VFIOLegacyContainer *container, int group_fd,
>                                  Error **errp)
>   {
>       int iommu_type, ret;
> @@ -478,7 +515,7 @@ static int vfio_init_container(VFIOContainer *container, int group_fd,
>       return 0;
>   }
>   
> -static int vfio_get_iommu_info(VFIOContainer *container,
> +static int vfio_get_iommu_info(VFIOLegacyContainer *container,
>                                  struct vfio_iommu_type1_info **info)
>   {
>   
> @@ -522,11 +559,12 @@ vfio_get_iommu_info_cap(struct vfio_iommu_type1_info *info, uint16_t id)
>       return NULL;
>   }
>   
> -static void vfio_get_iommu_info_migration(VFIOContainer *container,
> -                                         struct vfio_iommu_type1_info *info)
> +static void vfio_get_iommu_info_migration(VFIOLegacyContainer *container,
> +                                          struct vfio_iommu_type1_info *info)
>   {
>       struct vfio_info_cap_header *hdr;
>       struct vfio_iommu_type1_info_cap_migration *cap_mig;
> +    VFIOContainer *bcontainer = &container->bcontainer;
>   
>       hdr = vfio_get_iommu_info_cap(info, VFIO_IOMMU_TYPE1_INFO_CAP_MIGRATION);
>       if (!hdr) {
> @@ -541,16 +579,19 @@ static void vfio_get_iommu_info_migration(VFIOContainer *container,
>        * qemu_real_host_page_size to mark those dirty.
>        */
>       if (cap_mig->pgsize_bitmap & qemu_real_host_page_size()) {
> -        container->dirty_pages_supported = true;
> -        container->max_dirty_bitmap_size = cap_mig->max_dirty_bitmap_size;
> -        container->dirty_pgsizes = cap_mig->pgsize_bitmap;
> +        bcontainer->dirty_pages_supported = true;
> +        bcontainer->max_dirty_bitmap_size = cap_mig->max_dirty_bitmap_size;
> +        bcontainer->dirty_pgsizes = cap_mig->pgsize_bitmap;
>       }
>   }
>   
>   static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>                                     Error **errp)
>   {
> -    VFIOContainer *container;
> +    VFIOIOMMUBackendOpsClass *ops = VFIO_IOMMU_BACKEND_OPS_CLASS(
> +        object_class_by_name(TYPE_VFIO_IOMMU_BACKEND_LEGACY_OPS));
> +    VFIOContainer *bcontainer;
> +    VFIOLegacyContainer *container;
>       int ret, fd;
>       VFIOAddressSpace *space;
>   
> @@ -587,7 +628,8 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>        * details once we know which type of IOMMU we are using.
>        */
>   
> -    QLIST_FOREACH(container, &space->containers, next) {
> +    QLIST_FOREACH(bcontainer, &space->containers, next) {
> +        container = container_of(bcontainer, VFIOLegacyContainer, bcontainer);
>           if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
>               ret = vfio_ram_block_discard_disable(container, true);
>               if (ret) {
> @@ -623,14 +665,9 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>       }
>   
>       container = g_malloc0(sizeof(*container));
> -    container->space = space;
>       container->fd = fd;
> -    container->error = NULL;
> -    container->dirty_pages_supported = false;
> -    container->dma_max_mappings = 0;
> -    QLIST_INIT(&container->giommu_list);
> -    QLIST_INIT(&container->hostwin_list);
> -    QLIST_INIT(&container->vrdl_list);
> +    bcontainer = &container->bcontainer;
> +    vfio_container_init(bcontainer, space, ops);
>   
>       ret = vfio_init_container(container, group->fd, errp);
>       if (ret) {
> @@ -656,13 +693,13 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>           }
>   
>           if (info->flags & VFIO_IOMMU_INFO_PGSIZES) {
> -            container->pgsizes = info->iova_pgsizes;
> +            bcontainer->pgsizes = info->iova_pgsizes;
>           } else {
> -            container->pgsizes = qemu_real_host_page_size();
> +            bcontainer->pgsizes = qemu_real_host_page_size();
>           }
>   
> -        if (!vfio_get_info_dma_avail(info, &container->dma_max_mappings)) {
> -            container->dma_max_mappings = 65535;
> +        if (!vfio_get_info_dma_avail(info, &bcontainer->dma_max_mappings)) {
> +            bcontainer->dma_max_mappings = 65535;
>           }
>           vfio_get_iommu_info_migration(container, info);
>           g_free(info);
> @@ -672,7 +709,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>            * information to get the actual window extent rather than assume
>            * a 64-bit IOVA address space.
>            */
> -        vfio_host_win_add(container, 0, (hwaddr)-1, container->pgsizes);
> +        vfio_host_win_add(bcontainer, 0, (hwaddr)-1, bcontainer->pgsizes);
>   
>           break;
>       }
> @@ -699,10 +736,10 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>   
>               memory_listener_register(&container->prereg_listener,
>                                        &address_space_memory);
> -            if (container->error) {
> +            if (bcontainer->error) {
>                   memory_listener_unregister(&container->prereg_listener);
>                   ret = -1;
> -                error_propagate_prepend(errp, container->error,
> +                error_propagate_prepend(errp, bcontainer->error,
>                       "RAM memory listener initialization failed: ");
>                   goto enable_discards_exit;
>               }
> @@ -721,7 +758,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>           }
>   
>           if (v2) {
> -            container->pgsizes = info.ddw.pgsizes;
> +            bcontainer->pgsizes = info.ddw.pgsizes;
>               /*
>                * There is a default window in just created container.
>                * To make region_add/del simpler, we better remove this
> @@ -736,8 +773,8 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>               }
>           } else {
>               /* The default table uses 4K pages */
> -            container->pgsizes = 0x1000;
> -            vfio_host_win_add(container, info.dma32_window_start,
> +            bcontainer->pgsizes = 0x1000;
> +            vfio_host_win_add(bcontainer, info.dma32_window_start,
>                                 info.dma32_window_start +
>                                 info.dma32_window_size - 1,
>                                 0x1000);
> @@ -748,28 +785,28 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>       vfio_kvm_device_add_group(group);
>   
>       QLIST_INIT(&container->group_list);
> -    QLIST_INSERT_HEAD(&space->containers, container, next);
> +    QLIST_INSERT_HEAD(&space->containers, bcontainer, next);
>   
>       group->container = container;
>       QLIST_INSERT_HEAD(&container->group_list, group, container_next);
>   
> -    container->listener = vfio_memory_listener;
> +    bcontainer->listener = vfio_memory_listener;
>   
> -    memory_listener_register(&container->listener, container->space->as);
> +    memory_listener_register(&bcontainer->listener, bcontainer->space->as);
>   
> -    if (container->error) {
> +    if (bcontainer->error) {
>           ret = -1;
> -        error_propagate_prepend(errp, container->error,
> +        error_propagate_prepend(errp, bcontainer->error,
>               "memory listener initialization failed: ");
>           goto listener_release_exit;
>       }
>   
> -    container->initialized = true;
> +    bcontainer->initialized = true;
>   
>       return 0;
>   listener_release_exit:
>       QLIST_REMOVE(group, container_next);
> -    QLIST_REMOVE(container, next);
> +    QLIST_REMOVE(bcontainer, next);
>       vfio_kvm_device_del_group(group);
>       vfio_listener_release(container);
>   
> @@ -790,7 +827,8 @@ put_space_exit:
>   
>   static void vfio_disconnect_container(VFIOGroup *group)
>   {
> -    VFIOContainer *container = group->container;
> +    VFIOLegacyContainer *container = group->container;
> +    VFIOContainer *bcontainer = &container->bcontainer;
>   
>       QLIST_REMOVE(group, container_next);
>       group->container = NULL;
> @@ -810,25 +848,9 @@ static void vfio_disconnect_container(VFIOGroup *group)
>       }
>   
>       if (QLIST_EMPTY(&container->group_list)) {
> -        VFIOAddressSpace *space = container->space;
> -        VFIOGuestIOMMU *giommu, *tmp;
> -        VFIOHostDMAWindow *hostwin, *next;
> -
> -        QLIST_REMOVE(container, next);
> -
> -        QLIST_FOREACH_SAFE(giommu, &container->giommu_list, giommu_next, tmp) {
> -            memory_region_unregister_iommu_notifier(
> -                    MEMORY_REGION(giommu->iommu_mr), &giommu->n);
> -            QLIST_REMOVE(giommu, giommu_next);
> -            g_free(giommu);
> -        }
> -
> -        QLIST_FOREACH_SAFE(hostwin, &container->hostwin_list, hostwin_next,
> -                           next) {
> -            QLIST_REMOVE(hostwin, hostwin_next);
> -            g_free(hostwin);
> -        }
> +        VFIOAddressSpace *space = bcontainer->space;
>   
> +        vfio_container_destroy(bcontainer);
>           trace_vfio_disconnect_container(container->fd);
>           close(container->fd);
>           g_free(container);
> @@ -840,13 +862,15 @@ static void vfio_disconnect_container(VFIOGroup *group)
>   static VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
>   {
>       VFIOGroup *group;
> +    VFIOContainer *bcontainer;
>       char path[32];
>       struct vfio_group_status status = { .argsz = sizeof(status) };
>   
>       QLIST_FOREACH(group, &vfio_group_list, next) {
>           if (group->groupid == groupid) {
>               /* Found it.  Now is it already in the right context? */
> -            if (group->container->space->as == as) {
> +            bcontainer = &group->container->bcontainer;
> +            if (bcontainer->space->as == as) {
>                   return group;
>               } else {
>                   error_setg(errp, "group %d used in multiple address spaces",
> @@ -990,7 +1014,7 @@ static void vfio_put_base_device(VFIODevice *vbasedev)
>   /*
>    * Interfaces for IBM EEH (Enhanced Error Handling)
>    */
> -static bool vfio_eeh_container_ok(VFIOContainer *container)
> +static bool vfio_eeh_container_ok(VFIOLegacyContainer *container)
>   {
>       /*
>        * As of 2016-03-04 (linux-4.5) the host kernel EEH/VFIO
> @@ -1018,7 +1042,7 @@ static bool vfio_eeh_container_ok(VFIOContainer *container)
>       return true;
>   }
>   
> -static int vfio_eeh_container_op(VFIOContainer *container, uint32_t op)
> +static int vfio_eeh_container_op(VFIOLegacyContainer *container, uint32_t op)
>   {
>       struct vfio_eeh_pe_op pe_op = {
>           .argsz = sizeof(pe_op),
> @@ -1041,19 +1065,21 @@ static int vfio_eeh_container_op(VFIOContainer *container, uint32_t op)
>       return ret;
>   }
>   
> -static VFIOContainer *vfio_eeh_as_container(AddressSpace *as)
> +static VFIOLegacyContainer *vfio_eeh_as_container(AddressSpace *as)
>   {
>       VFIOAddressSpace *space = vfio_get_address_space(as);
> -    VFIOContainer *container = NULL;
> +    VFIOLegacyContainer *container = NULL;
> +    VFIOContainer *bcontainer = NULL;
>   
>       if (QLIST_EMPTY(&space->containers)) {
>           /* No containers to act on */
>           goto out;
>       }
>   
> -    container = QLIST_FIRST(&space->containers);
> +    bcontainer = QLIST_FIRST(&space->containers);
> +    container = container_of(bcontainer, VFIOLegacyContainer, bcontainer);
>   
> -    if (QLIST_NEXT(container, next)) {
> +    if (QLIST_NEXT(bcontainer, next)) {
>           /*
>            * We don't yet have logic to synchronize EEH state across
>            * multiple containers
> @@ -1069,14 +1095,14 @@ out:
>   
>   bool vfio_eeh_as_ok(AddressSpace *as)
>   {
> -    VFIOContainer *container = vfio_eeh_as_container(as);
> +    VFIOLegacyContainer *container = vfio_eeh_as_container(as);
>   
>       return (container != NULL) && vfio_eeh_container_ok(container);
>   }
>   
>   int vfio_eeh_as_op(AddressSpace *as, uint32_t op)
>   {
> -    VFIOContainer *container = vfio_eeh_as_container(as);
> +    VFIOLegacyContainer *container = vfio_eeh_as_container(as);
>   
>       if (!container) {
>           return -ENODEV;
> @@ -1110,8 +1136,8 @@ static int vfio_device_groupid(VFIODevice *vbasedev, Error **errp)
>       return groupid;
>   }
>   
> -int vfio_attach_device(char *name, VFIODevice *vbasedev,
> -                       AddressSpace *as, Error **errp)
> +static int vfio_legacy_attach_device(char *name, VFIODevice *vbasedev,
> +                                     AddressSpace *as, Error **errp)
>   {
>       int groupid = vfio_device_groupid(vbasedev, errp);
>       VFIODevice *vbasedev_iter;
> @@ -1137,15 +1163,46 @@ int vfio_attach_device(char *name, VFIODevice *vbasedev,
>       ret = vfio_get_device(group, name, vbasedev, errp);
>       if (ret) {
>           vfio_put_group(group);
> +        return ret;
>       }
> +    vbasedev->container = &group->container->bcontainer;
>   
>       return ret;
>   }
>   
> -void vfio_detach_device(VFIODevice *vbasedev)
> +static void vfio_legacy_detach_device(VFIODevice *vbasedev)
>   {
>       VFIOGroup *group = vbasedev->group;
>   
>       vfio_put_base_device(vbasedev);
>       vfio_put_group(group);
> +    vbasedev->container = NULL;
> +}
> +
> +static void vfio_iommu_backend_legacy_ops_class_init(ObjectClass *oc,
> +                                                     void *data) {
> +    VFIOIOMMUBackendOpsClass *ops = VFIO_IOMMU_BACKEND_OPS_CLASS(oc);
> +
> +    ops->dev_iter_next = vfio_legacy_dev_iter_next;
> +    ops->dma_map = vfio_legacy_dma_map;
> +    ops->dma_unmap = vfio_legacy_dma_unmap;
> +    ops->attach_device = vfio_legacy_attach_device;
> +    ops->detach_device = vfio_legacy_detach_device;
> +    ops->set_dirty_page_tracking = vfio_legacy_set_dirty_page_tracking;
> +    ops->query_dirty_bitmap = vfio_legacy_query_dirty_bitmap;
> +    ops->add_window = vfio_legacy_add_section_window;
> +    ops->del_window = vfio_legacy_del_section_window;
> +}
> +
> +static const TypeInfo vfio_iommu_backend_legacy_ops_type = {
> +    .name = TYPE_VFIO_IOMMU_BACKEND_LEGACY_OPS,
> +
> +    .parent = TYPE_VFIO_IOMMU_BACKEND_OPS,
> +    .class_init = vfio_iommu_backend_legacy_ops_class_init,
> +    .abstract = true,
> +};
> +static void vfio_iommu_backend_legacy_ops_register_types(void)
> +{
> +    type_register_static(&vfio_iommu_backend_legacy_ops_type);
>   }
> +type_init(vfio_iommu_backend_legacy_ops_register_types);
> diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
> index 2a6912c940..eb6ce6229d 100644
> --- a/hw/vfio/meson.build
> +++ b/hw/vfio/meson.build
> @@ -2,6 +2,7 @@ vfio_ss = ss.source_set()
>   vfio_ss.add(files(
>     'helpers.c',
>     'common.c',
> +  'container-base.c',
>     'container.c',
>     'spapr.c',
>     'migration.c',
> diff --git a/hw/vfio/spapr.c b/hw/vfio/spapr.c
> index 9ec1e95f6d..7647e7d492 100644
> --- a/hw/vfio/spapr.c
> +++ b/hw/vfio/spapr.c
> @@ -39,8 +39,8 @@ static void *vfio_prereg_gpa_to_vaddr(MemoryRegionSection *section, hwaddr gpa)
>   static void vfio_prereg_listener_region_add(MemoryListener *listener,
>                                               MemoryRegionSection *section)
>   {
> -    VFIOContainer *container = container_of(listener, VFIOContainer,
> -                                            prereg_listener);
> +    VFIOLegacyContainer *container = container_of(listener, VFIOLegacyContainer,
> +                                                  prereg_listener);
>       const hwaddr gpa = section->offset_within_address_space;
>       hwaddr end;
>       int ret;
> @@ -83,9 +83,9 @@ static void vfio_prereg_listener_region_add(MemoryListener *listener,
>            * can gracefully fail.  Runtime, there's not much we can do other
>            * than throw a hardware error.
>            */
> -        if (!container->initialized) {
> -            if (!container->error) {
> -                error_setg_errno(&container->error, -ret,
> +        if (!container->bcontainer.initialized) {
> +            if (!container->bcontainer.error) {
> +                error_setg_errno(&container->bcontainer.error, -ret,
>                                    "Memory registering failed");
>               }
>           } else {
> @@ -97,8 +97,8 @@ static void vfio_prereg_listener_region_add(MemoryListener *listener,
>   static void vfio_prereg_listener_region_del(MemoryListener *listener,
>                                               MemoryRegionSection *section)
>   {
> -    VFIOContainer *container = container_of(listener, VFIOContainer,
> -                                            prereg_listener);
> +    VFIOLegacyContainer *container = container_of(listener, VFIOLegacyContainer,
> +                                                  prereg_listener);
>       const hwaddr gpa = section->offset_within_address_space;
>       hwaddr end;
>       int ret;
> @@ -141,7 +141,7 @@ const MemoryListener vfio_prereg_listener = {
>       .region_del = vfio_prereg_listener_region_del,
>   };
>   
> -int vfio_spapr_create_window(VFIOContainer *container,
> +int vfio_spapr_create_window(VFIOLegacyContainer *container,
>                                MemoryRegionSection *section,
>                                hwaddr *pgsize)
>   {
> @@ -159,13 +159,13 @@ int vfio_spapr_create_window(VFIOContainer *container,
>       if (pagesize > rampagesize) {
>           pagesize = rampagesize;
>       }
> -    pgmask = container->pgsizes & (pagesize | (pagesize - 1));
> +    pgmask = container->bcontainer.pgsizes & (pagesize | (pagesize - 1));
>       pagesize = pgmask ? (1ULL << (63 - clz64(pgmask))) : 0;
>       if (!pagesize) {
>           error_report("Host doesn't support page size 0x%"PRIx64
>                        ", the supported mask is 0x%lx",
>                        memory_region_iommu_get_min_page_size(iommu_mr),
> -                     container->pgsizes);
> +                     container->bcontainer.pgsizes);
>           return -EINVAL;
>       }
>   
> @@ -233,7 +233,7 @@ int vfio_spapr_create_window(VFIOContainer *container,
>       return 0;
>   }
>   
> -int vfio_spapr_remove_window(VFIOContainer *container,
> +int vfio_spapr_remove_window(VFIOLegacyContainer *container,
>                                hwaddr offset_within_address_space)
>   {
>       struct vfio_iommu_spapr_tce_remove remove = {
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index bd32970854..1692bcd8f1 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -119,8 +119,8 @@ vfio_region_unmap(const char *name, unsigned long offset, unsigned long end) "Re
>   vfio_region_sparse_mmap_header(const char *name, int index, int nr_areas) "Device %s region %d: %d sparse mmap entries"
>   vfio_region_sparse_mmap_entry(int i, unsigned long start, unsigned long end) "sparse entry %d [0x%lx - 0x%lx]"
>   vfio_get_dev_region(const char *name, int index, uint32_t type, uint32_t subtype) "%s index %d, %08x/%08x"
> -vfio_dma_unmap_overflow_workaround(void) ""
> -vfio_get_dirty_bitmap(int fd, uint64_t iova, uint64_t size, uint64_t bitmap_size, uint64_t start, uint64_t dirty_pages) "container fd=%d, iova=0x%"PRIx64" size= 0x%"PRIx64" bitmap_size=0x%"PRIx64" start=0x%"PRIx64" dirty_pages=%"PRIu64
> +vfio_legacy_dma_unmap_overflow_workaround(void) ""
> +vfio_get_dirty_bitmap(uint64_t iova, uint64_t size, uint64_t bitmap_size, uint64_t start, uint64_t dirty_pages) "iova=0x%"PRIx64" size= 0x%"PRIx64" bitmap_size=0x%"PRIx64" start=0x%"PRIx64" dirty_pages=%"PRIu64
>   vfio_iommu_map_dirty_notify(uint64_t iova_start, uint64_t iova_end) "iommu dirty @ 0x%"PRIx64" - 0x%"PRIx64
>   
>   # platform.c
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 95bcafdaf6..b1a76dcc9c 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -30,6 +30,7 @@
>   #include <linux/vfio.h>
>   #endif
>   #include "sysemu/sysemu.h"
> +#include "hw/vfio/vfio-container-base.h"
>   
>   #define VFIO_MSG_PREFIX "vfio %s: "
>   
> @@ -74,64 +75,22 @@ typedef struct VFIOMigration {
>       bool initial_data_sent;
>   } VFIOMigration;
>   
> -typedef struct VFIOAddressSpace {
> -    AddressSpace *as;
> -    QLIST_HEAD(, VFIOContainer) containers;
> -    QLIST_ENTRY(VFIOAddressSpace) list;
> -} VFIOAddressSpace;
> -
>   struct VFIOGroup;
>   
> -typedef struct VFIOContainer {
> -    VFIOAddressSpace *space;
> +typedef struct VFIOLegacyContainer {
> +    VFIOContainer bcontainer;
>       int fd; /* /dev/vfio/vfio, empowered by the attached groups */
> -    MemoryListener listener;
>       MemoryListener prereg_listener;
>       unsigned iommu_type;
> -    Error *error;
> -    bool initialized;
> -    bool dirty_pages_supported;
> -    uint64_t dirty_pgsizes;
> -    uint64_t max_dirty_bitmap_size;
> -    unsigned long pgsizes;
> -    unsigned int dma_max_mappings;
> -    QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
> -    QLIST_HEAD(, VFIOHostDMAWindow) hostwin_list;
>       QLIST_HEAD(, VFIOGroup) group_list;
> -    QLIST_HEAD(, VFIORamDiscardListener) vrdl_list;
> -    QLIST_ENTRY(VFIOContainer) next;
> -} VFIOContainer;
> -
> -typedef struct VFIOGuestIOMMU {
> -    VFIOContainer *container;
> -    IOMMUMemoryRegion *iommu_mr;
> -    hwaddr iommu_offset;
> -    IOMMUNotifier n;
> -    QLIST_ENTRY(VFIOGuestIOMMU) giommu_next;
> -} VFIOGuestIOMMU;
> -
> -typedef struct VFIORamDiscardListener {
> -    VFIOContainer *container;
> -    MemoryRegion *mr;
> -    hwaddr offset_within_address_space;
> -    hwaddr size;
> -    uint64_t granularity;
> -    RamDiscardListener listener;
> -    QLIST_ENTRY(VFIORamDiscardListener) next;
> -} VFIORamDiscardListener;
> -
> -typedef struct VFIOHostDMAWindow {
> -    hwaddr min_iova;
> -    hwaddr max_iova;
> -    uint64_t iova_pgsizes;
> -    QLIST_ENTRY(VFIOHostDMAWindow) hostwin_next;
> -} VFIOHostDMAWindow;
> +} VFIOLegacyContainer;
>   
>   typedef struct VFIODeviceOps VFIODeviceOps;
>   
>   typedef struct VFIODevice {
>       QLIST_ENTRY(VFIODevice) next;
>       struct VFIOGroup *group;
> +    VFIOContainer *container;
>       char *sysfsdev;
>       char *name;
>       DeviceState *dev;
> @@ -165,7 +124,7 @@ struct VFIODeviceOps {
>   typedef struct VFIOGroup {
>       int fd;
>       int groupid;
> -    VFIOContainer *container;
> +    VFIOLegacyContainer *container;
>       QLIST_HEAD(, VFIODevice) device_list;
>       QLIST_ENTRY(VFIOGroup) next;
>       QLIST_ENTRY(VFIOGroup) container_next;
> @@ -198,37 +157,13 @@ typedef struct VFIODisplay {
>       } dmabuf;
>   } VFIODisplay;
>   
> -typedef struct {
> -    unsigned long *bitmap;
> -    hwaddr size;
> -    hwaddr pages;
> -} VFIOBitmap;
> -
> -void vfio_host_win_add(VFIOContainer *container,
> +void vfio_host_win_add(VFIOContainer *bcontainer,
>                          hwaddr min_iova, hwaddr max_iova,
>                          uint64_t iova_pgsizes);
> -int vfio_host_win_del(VFIOContainer *container, hwaddr min_iova,
> +int vfio_host_win_del(VFIOContainer *bcontainer, hwaddr min_iova,
>                         hwaddr max_iova);
>   VFIOAddressSpace *vfio_get_address_space(AddressSpace *as);
>   void vfio_put_address_space(VFIOAddressSpace *space);
> -bool vfio_devices_all_running_and_saving(VFIOContainer *container);
> -
> -/* container->fd */
> -VFIODevice *vfio_container_dev_iter_next(VFIOContainer *container,
> -                                         VFIODevice *curr);
> -int vfio_dma_unmap(VFIOContainer *container, hwaddr iova,
> -                   ram_addr_t size, IOMMUTLBEntry *iotlb);
> -int vfio_dma_map(VFIOContainer *container, hwaddr iova,
> -                 ram_addr_t size, void *vaddr, bool readonly);
> -int vfio_set_dirty_page_tracking(VFIOContainer *container, bool start);
> -int vfio_query_dirty_bitmap(VFIOContainer *container, VFIOBitmap *vbmap,
> -                            hwaddr iova, hwaddr size);
> -
> -int vfio_container_add_section_window(VFIOContainer *container,
> -                                      MemoryRegionSection *section,
> -                                      Error **errp);
> -void vfio_container_del_section_window(VFIOContainer *container,
> -                                       MemoryRegionSection *section);
>   
>   void vfio_disable_irqindex(VFIODevice *vbasedev, int index);
>   void vfio_unmask_single_irqindex(VFIODevice *vbasedev, int index);
> @@ -285,10 +220,10 @@ vfio_get_cap(void *ptr, uint32_t cap_offset, uint16_t id);
>   #endif
>   extern const MemoryListener vfio_prereg_listener;
>   
> -int vfio_spapr_create_window(VFIOContainer *container,
> +int vfio_spapr_create_window(VFIOLegacyContainer *container,
>                                MemoryRegionSection *section,
>                                hwaddr *pgsize);
> -int vfio_spapr_remove_window(VFIOContainer *container,
> +int vfio_spapr_remove_window(VFIOLegacyContainer *container,
>                                hwaddr offset_within_address_space);
>   
>   bool vfio_migration_realize(VFIODevice *vbasedev, Error **errp);
> diff --git a/include/hw/vfio/vfio-container-base.h b/include/hw/vfio/vfio-container-base.h
> new file mode 100644
> index 0000000000..b18fa92146
> --- /dev/null
> +++ b/include/hw/vfio/vfio-container-base.h
> @@ -0,0 +1,155 @@
> +/*
> + * VFIO BASE CONTAINER
> + *
> + * Copyright (C) 2023 Intel Corporation.
> + * Copyright Red Hat, Inc. 2023
> + *
> + * Authors: Yi Liu <yi.l.liu@intel.com>
> + *          Eric Auger <eric.auger@redhat.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> +
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> +
> + * You should have received a copy of the GNU General Public License along
> + * with this program; if not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#ifndef HW_VFIO_VFIO_BASE_CONTAINER_H
> +#define HW_VFIO_VFIO_BASE_CONTAINER_H
> +
> +#include "exec/memory.h"
> +#ifndef CONFIG_USER_ONLY
> +#include "exec/hwaddr.h"
> +#endif
> +
> +typedef struct VFIOContainer VFIOContainer;
> +
> +typedef struct VFIOAddressSpace {
> +    AddressSpace *as;
> +    QLIST_HEAD(, VFIOContainer) containers;
> +    QLIST_ENTRY(VFIOAddressSpace) list;
> +} VFIOAddressSpace;
> +
> +typedef struct VFIOGuestIOMMU {
> +    VFIOContainer *container;
> +    IOMMUMemoryRegion *iommu_mr;
> +    hwaddr iommu_offset;
> +    IOMMUNotifier n;
> +    QLIST_ENTRY(VFIOGuestIOMMU) giommu_next;
> +} VFIOGuestIOMMU;
> +
> +typedef struct VFIORamDiscardListener {
> +    VFIOContainer *container;
> +    MemoryRegion *mr;
> +    hwaddr offset_within_address_space;
> +    hwaddr size;
> +    uint64_t granularity;
> +    RamDiscardListener listener;
> +    QLIST_ENTRY(VFIORamDiscardListener) next;
> +} VFIORamDiscardListener;
> +
> +typedef struct VFIOHostDMAWindow {
> +    hwaddr min_iova;
> +    hwaddr max_iova;
> +    uint64_t iova_pgsizes;
> +    QLIST_ENTRY(VFIOHostDMAWindow) hostwin_next;
> +} VFIOHostDMAWindow;
> +
> +typedef struct {
> +    unsigned long *bitmap;
> +    hwaddr size;
> +    hwaddr pages;
> +} VFIOBitmap;
> +
> +typedef struct VFIODevice VFIODevice;
> +typedef struct VFIOIOMMUBackendOpsClass VFIOIOMMUBackendOpsClass;
> +
> +/*
> + * This is the base object for vfio container backends
> + */
> +struct VFIOContainer {
> +    VFIOIOMMUBackendOpsClass *ops;
> +    VFIOAddressSpace *space;
> +    MemoryListener listener;
> +    Error *error;
> +    bool initialized;
> +    bool dirty_pages_supported;
> +    uint64_t dirty_pgsizes;
> +    uint64_t max_dirty_bitmap_size;
> +    unsigned long pgsizes;
> +    unsigned int dma_max_mappings;
> +    QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
> +    QLIST_HEAD(, VFIOHostDMAWindow) hostwin_list;
> +    QLIST_HEAD(, VFIORamDiscardListener) vrdl_list;
> +    QLIST_ENTRY(VFIOContainer) next;
> +};
> +
> +VFIODevice *vfio_container_dev_iter_next(VFIOContainer *container,
> +                                 VFIODevice *curr);
> +int vfio_container_dma_map(VFIOContainer *container,
> +                           hwaddr iova, ram_addr_t size,
> +                           void *vaddr, bool readonly);
> +int vfio_container_dma_unmap(VFIOContainer *container,
> +                             hwaddr iova, ram_addr_t size,
> +                             IOMMUTLBEntry *iotlb);
> +bool vfio_container_devices_all_dirty_tracking(VFIOContainer *container);
> +int vfio_container_set_dirty_page_tracking(VFIOContainer *container,
> +                                            bool start);
> +int vfio_container_query_dirty_bitmap(VFIOContainer *container,
> +                                      VFIOBitmap *vbmap,
> +                                      hwaddr iova, hwaddr size);
> +int vfio_container_add_section_window(VFIOContainer *container,
> +                                      MemoryRegionSection *section,
> +                                      Error **errp);
> +void vfio_container_del_section_window(VFIOContainer *container,
> +                                       MemoryRegionSection *section);
> +
> +void vfio_container_init(VFIOContainer *container,
> +                         VFIOAddressSpace *space,
> +                         struct VFIOIOMMUBackendOpsClass *ops);
> +void vfio_container_destroy(VFIOContainer *container);
> +
> +#define TYPE_VFIO_IOMMU_BACKEND_LEGACY_OPS "vfio-iommu-backend-legacy-ops"
> +#define TYPE_VFIO_IOMMU_BACKEND_OPS "vfio-iommu-backend-ops"
> +
> +DECLARE_CLASS_CHECKERS(VFIOIOMMUBackendOpsClass,
> +                       VFIO_IOMMU_BACKEND_OPS, TYPE_VFIO_IOMMU_BACKEND_OPS)
> +
> +struct VFIOIOMMUBackendOpsClass {
> +    /*< private >*/
> +    ObjectClass parent_class;
> +
> +    /*< public >*/
> +    /* required */
> +    VFIODevice *(*dev_iter_next)(VFIOContainer *container, VFIODevice *curr);
> +    int (*dma_map)(VFIOContainer *container,
> +                   hwaddr iova, ram_addr_t size,
> +                   void *vaddr, bool readonly);
> +    int (*dma_unmap)(VFIOContainer *container,
> +                     hwaddr iova, ram_addr_t size,
> +                     IOMMUTLBEntry *iotlb);
> +    int (*attach_device)(char *name, VFIODevice *vbasedev,
> +                         AddressSpace *as, Error **errp);
> +    void (*detach_device)(VFIODevice *vbasedev);
> +    /* migration feature */
> +    int (*set_dirty_page_tracking)(VFIOContainer *container, bool start);
> +    int (*query_dirty_bitmap)(VFIOContainer *bcontainer, VFIOBitmap *vbmap,
> +                              hwaddr iova, hwaddr size);
> +
> +    /* SPAPR specific */
> +    int (*add_window)(VFIOContainer *container,
> +                      MemoryRegionSection *section,
> +                      Error **errp);
> +    void (*del_window)(VFIOContainer *container,
> +                       MemoryRegionSection *section);
> +};
> +
> +
> +#endif /* HW_VFIO_VFIO_BASE_CONTAINER_H */



^ permalink raw reply	[flat|nested] 109+ messages in thread

* RE: [PATCH v1 14/22] vfio/common: Simplify vfio_viommu_preset()
  2023-09-19 16:01   ` Cédric Le Goater
@ 2023-09-20  2:59     ` Duan, Zhenzhong
  0 siblings, 0 replies; 109+ messages in thread
From: Duan, Zhenzhong @ 2023-09-20  2:59 UTC (permalink / raw)
  To: Cédric Le Goater, qemu-devel
  Cc: alex.williamson, jgg, nicolinc, Martins, Joao, eric.auger,
	peterx, jasowang, Tian, Kevin, Liu, Yi L, Sun, Yi Y, Peng,
	Chao P



>-----Original Message-----
>From: Cédric Le Goater <clg@redhat.com>
>Sent: Wednesday, September 20, 2023 12:01 AM
>Subject: Re: [PATCH v1 14/22] vfio/common: Simplify vfio_viommu_preset()
>
>On 8/30/23 12:37, Zhenzhong Duan wrote:
>> Commit "vfio/container-base: Introduce [attach/detach]_device container
>callbacks"
>> add support to link to address space, we can utilize it to simplify
>> vfio_viommu_preset().
>>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>
>This looks like a revert of patch 07. Can it be avoided in v2 ?

Yes, I will redesign the related part so that I could have this patch dropped.

Thanks
Zhenzhong 

^ permalink raw reply	[flat|nested] 109+ messages in thread

* RE: [PATCH v1 15/22] Add iommufd configure option
  2023-09-19 17:07   ` Cédric Le Goater
@ 2023-09-20  3:42     ` Duan, Zhenzhong
  2023-09-20 12:19       ` Cédric Le Goater
  2023-09-20 18:01       ` Alex Williamson
  0 siblings, 2 replies; 109+ messages in thread
From: Duan, Zhenzhong @ 2023-09-20  3:42 UTC (permalink / raw)
  To: Cédric Le Goater, qemu-devel
  Cc: alex.williamson, jgg, nicolinc, Martins, Joao, eric.auger,
	peterx, jasowang, Tian, Kevin, Liu, Yi L, Sun, Yi Y, Peng,
	Chao P, Paolo Bonzini, Marc-André Lureau,
	Daniel P. Berrangé, Thomas Huth, Philippe Mathieu-Daudé



>-----Original Message-----
>From: Cédric Le Goater <clg@redhat.com>
>Sent: Wednesday, September 20, 2023 1:08 AM
>Subject: Re: [PATCH v1 15/22] Add iommufd configure option
>
>On 8/30/23 12:37, Zhenzhong Duan wrote:
>> This adds "--enable-iommufd/--disable-iommufd" to enable or disable
>> iommufd support, enabled by default.
>
>Why would someone want to disable support at compile time ? It might

For those users who only want to support legacy container feature?
Let me know if you still prefer to drop this patch, I'm fine with that.

>have been useful for dev but now QEMU should self-adjust at runtime
>depending only on the host capabilities AFAIUI. Am I missing something ?

IOMMUFD doesn't support all features of legacy container, so QEMU
doesn't self-adjust at runtime by checking if host supports IOMMUFD.
We need to specify it explicitly to use IOMMUFD as below:

    -object iommufd,id=iommufd0
    -device vfio-pci,host=0000:02:00.0,iommufd=iommufd0

Thanks
Zhenzhong


^ permalink raw reply	[flat|nested] 109+ messages in thread

* RE: [PATCH v1 13/22] vfio: Add base container
  2023-09-19 17:23   ` Cédric Le Goater
@ 2023-09-20  8:48     ` Duan, Zhenzhong
  2023-09-20 12:57       ` Cédric Le Goater
  2023-09-20 13:53     ` Eric Auger
  2023-09-20 17:31     ` Eric Auger
  2 siblings, 1 reply; 109+ messages in thread
From: Duan, Zhenzhong @ 2023-09-20  8:48 UTC (permalink / raw)
  To: Cédric Le Goater, qemu-devel
  Cc: alex.williamson, jgg, nicolinc, Martins, Joao, eric.auger,
	peterx, jasowang, Tian, Kevin, Liu, Yi L, Sun, Yi Y, Peng,
	Chao P, Yi Sun, Daniel Henrique Barboza, David Gibson, Greg Kurz,
	Harsh Prateek Bora, open list:sPAPR (pseries)



>-----Original Message-----
>From: Cédric Le Goater <clg@redhat.com>
>Sent: Wednesday, September 20, 2023 1:24 AM
>Subject: Re: [PATCH v1 13/22] vfio: Add base container
>
>On 8/30/23 12:37, Zhenzhong Duan wrote:
>> From: Yi Liu <yi.l.liu@intel.com>
>>
>> Abstract the VFIOContainer to be a base object. It is supposed to be
>> embedded by legacy VFIO container and later on, into the new iommufd
>> based container.
>>
>> The base container implements generic code such as code related to
>> memory_listener and address space management. The VFIOContainerOps
>> implements callbacks that depend on the kernel user space being used.
>>
>> 'common.c' and vfio device code only manipulates the base container with
>> wrapper functions that calls the functions defined in VFIOContainerOpsClass.
>> Existing 'container.c' code is converted to implement the legacy container
>> ops functions.
>>
>> Below is the base container. It's named as VFIOContainer, old VFIOContainer
>> is replaced with VFIOLegacyContainer.
>
>Usualy, we introduce the new interface solely, port the current models
>on top of the new interface, wire the new models in the current
>implementation and remove the old implementation. Then, we can start
>adding extensions to support other implementations.

Not sure if I understand your point correctly. Do you mean to introduce
a new type for the base container as below:

static const TypeInfo vfio_container_info = {
    .parent             = TYPE_OBJECT,
    .name               = TYPE_VFIO_CONTAINER,
    .class_size         = sizeof(VFIOContainerClass),
    .instance_size      = sizeof(VFIOContainer),
    .abstract           = true,
    .interfaces = (InterfaceInfo[]) {
        { TYPE_VFIO_IOMMU_BACKEND_OPS },
        { }
    }
};

and a new interface as below:

static const TypeInfo nvram_info = {
    .name = TYPE_VFIO_IOMMU_BACKEND_OPS,
    .parent = TYPE_INTERFACE,
    .class_size = sizeof(VFIOIOMMUBackendOpsClass),
};

struct VFIOIOMMUBackendOpsClass {
    InterfaceClass parent;
    VFIODevice *(*dev_iter_next)(VFIOContainer *container, VFIODevice *curr);
    int (*dma_map)(VFIOContainer *container,
    ......
};

and legacy container on top of TYPE_VFIO_CONTAINER?

static const TypeInfo vfio_legacy_container_info = {
    .parent = TYPE_VFIO_CONTAINER,
    .name = TYPE_VFIO_LEGACY_CONTAINER,
    .class_init = vfio_legacy_container_class_init,
};

This object style is rejected early in RFCv1.
See https://lore.kernel.org/kvm/20220414104710.28534-8-yi.l.liu@intel.com/

>
>spapr should be taken care of separatly following the principle above.
>With my PPC hat, I would not even read such a massive change, too risky
>for the subsystem. This path will need (much) further splitting to be
>understandable and acceptable.

I'll digging into this and try to split it. Meanwhile, there are many changes
just renaming the parameter or function name for code readability.
For example:

-int vfio_dma_unmap(VFIOContainer *container, hwaddr iova,
-                   ram_addr_t size, IOMMUTLBEntry *iotlb)
+static int vfio_legacy_dma_unmap(VFIOContainer *bcontainer, hwaddr iova,
+                          ram_addr_t size, IOMMUTLBEntry *iotlb)

-        ret = vfio_get_dirty_bitmap(container, iova, size,
+        ret = vfio_get_dirty_bitmap(bcontainer, iova, size,

Let me know if you think such changes are unnecessary which could reduce
this patch largely.

>
>Also, please include the .h file first, it helps in reading.

Do you mean to put struct declaration earlier in patch description?

> Have you considered using an InterfaceClass ?

See above, with object style rejected, it looks hard to use InterfaceClass.

Thanks
Zhenzhong


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v1 02/22] Update linux-header to support iommufd cdev and hwpt alloc
  2023-09-15  3:02     ` Duan, Zhenzhong
@ 2023-09-20 11:04       ` Eric Auger
  2023-09-20 11:15         ` Duan, Zhenzhong
  0 siblings, 1 reply; 109+ messages in thread
From: Eric Auger @ 2023-09-20 11:04 UTC (permalink / raw)
  To: Duan, Zhenzhong, qemu-devel
  Cc: alex.williamson, clg, jgg, nicolinc, Martins, Joao, peterx,
	jasowang, Tian, Kevin, Liu, Yi L, Sun, Yi Y, Peng, Chao P,
	Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini,
	open list:Overall KVM CPUs



On 9/15/23 05:02, Duan, Zhenzhong wrote:
> Hi Eric,
>
>> -----Original Message-----
>> From: Eric Auger <eric.auger@redhat.com>
>> Sent: Thursday, September 14, 2023 10:46 PM
>> Subject: Re: [PATCH v1 02/22] Update linux-header to support iommufd cdev and
>> hwpt alloc
>>
>> Hi Zhenzhong,
>>
>> On 8/30/23 12:37, Zhenzhong Duan wrote:
>>> From https://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd.git
>>> branch: for_next
>>> commit id: eb501c2d96cfce6b42528e8321ea085ec605e790
>> I see that in your branch you have now updated against v6.6-rc1. However
>> you should run a full ./scripts/update-linux-headers.sh,
>> ie. not only importing the changes in linux-headers/linux/iommufd.h as
>> it seems to do but also import all changes brought with this linux version.
> Found reason. The base is already against v6.6-rc1, [PATCH v1 01/22] added
> Iommufd.h into script and this patch added it.
> I agree the subject is confusing, need to be like "Update iommufd.h to linux-header"
> I'll fix the subject in next version, thanks for point out.

OK I see
da3c22c74a3c  linux-headers: Update to Linux v6.6-rc1 (8 days ago)
<Thomas Huth>
now. So you need to add the sha1 against which you ran
./scripts/update-linux-headers.sh and in that case you can precise that
given [PATCH v1 01/22] scripts/update-linux-headers: Add iommufd.h added
iommufd export and given Thomas' patch, only
iommufd.h is added.

Thanks

Eric
>
> BR.
> Zhenzhong
>



^ permalink raw reply	[flat|nested] 109+ messages in thread

* RE: [PATCH v1 02/22] Update linux-header to support iommufd cdev and hwpt alloc
  2023-09-20 11:04       ` Eric Auger
@ 2023-09-20 11:15         ` Duan, Zhenzhong
  0 siblings, 0 replies; 109+ messages in thread
From: Duan, Zhenzhong @ 2023-09-20 11:15 UTC (permalink / raw)
  To: eric.auger, qemu-devel
  Cc: alex.williamson, clg, jgg, nicolinc, Martins, Joao, peterx,
	jasowang, Tian, Kevin, Liu, Yi L, Sun, Yi Y, Peng, Chao P,
	Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini,
	open list:Overall KVM CPUs



>-----Original Message-----
>From: Eric Auger <eric.auger@redhat.com>
>Sent: Wednesday, September 20, 2023 7:05 PM
>Subject: Re: [PATCH v1 02/22] Update linux-header to support iommufd cdev and
>hwpt alloc
>
>
>
>On 9/15/23 05:02, Duan, Zhenzhong wrote:
>> Hi Eric,
>>
>>> -----Original Message-----
>>> From: Eric Auger <eric.auger@redhat.com>
>>> Sent: Thursday, September 14, 2023 10:46 PM
>>> Subject: Re: [PATCH v1 02/22] Update linux-header to support iommufd cdev
>and
>>> hwpt alloc
>>>
>>> Hi Zhenzhong,
>>>
>>> On 8/30/23 12:37, Zhenzhong Duan wrote:
>>>> From https://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd.git
>>>> branch: for_next
>>>> commit id: eb501c2d96cfce6b42528e8321ea085ec605e790
>>> I see that in your branch you have now updated against v6.6-rc1. However
>>> you should run a full ./scripts/update-linux-headers.sh,
>>> ie. not only importing the changes in linux-headers/linux/iommufd.h as
>>> it seems to do but also import all changes brought with this linux version.
>> Found reason. The base is already against v6.6-rc1, [PATCH v1 01/22] added
>> Iommufd.h into script and this patch added it.
>> I agree the subject is confusing, need to be like "Update iommufd.h to linux-
>header"
>> I'll fix the subject in next version, thanks for point out.
>
>OK I see
>da3c22c74a3c  linux-headers: Update to Linux v6.6-rc1 (8 days ago)
><Thomas Huth>
>now. So you need to add the sha1 against which you ran
>./scripts/update-linux-headers.sh and in that case you can precise that
>given [PATCH v1 01/22] scripts/update-linux-headers: Add iommufd.h added
>iommufd export and given Thomas' patch, only
>iommufd.h is added.

Sure, will make it clear in v2.

Thanks
Zhenzhong

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v1 04/22] vfio/common: Introduce vfio_container_add|del_section_window()
  2023-08-30 10:37 ` [PATCH v1 04/22] vfio/common: Introduce vfio_container_add|del_section_window() Zhenzhong Duan
@ 2023-09-20 11:23   ` Eric Auger
  2023-09-20 12:18     ` Duan, Zhenzhong
  2023-09-21  8:28   ` Cédric Le Goater
  1 sibling, 1 reply; 109+ messages in thread
From: Eric Auger @ 2023-09-20 11:23 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, jgg, nicolinc, joao.m.martins, peterx,
	jasowang, kevin.tian, yi.l.liu, yi.y.sun, chao.p.peng

Hi Zhenzhong,
On 8/30/23 12:37, Zhenzhong Duan wrote:
> From: Eric Auger <eric.auger@redhat.com>
>
> Introduce helper functions that isolate the code used for
> VFIO_SPAPR_TCE_v2_IOMMU. This code reliance is IOMMU backend
> specific whereas the rest of the code in the callers, ie.
this last sentence should be rephrased into something like
Those helpers hide implementation details beneath the container object
and make the vfio_listener_region_add/del() implementations more
readable ( I think). No code change intended.

Thanks

Eric
> vfio_listener_region_add|del is not.
>
> Signed-off-by: Eric Auger <eric.auger@redhat.com>
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
>  hw/vfio/common.c | 156 +++++++++++++++++++++++++++--------------------
>  1 file changed, 89 insertions(+), 67 deletions(-)
>
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 9ca695837f..67150e4575 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -796,6 +796,92 @@ static bool vfio_get_section_iova_range(VFIOContainer *container,
>      return true;
>  }
>  
> +static int vfio_container_add_section_window(VFIOContainer *container,
> +                                             MemoryRegionSection *section,
> +                                             Error **errp)
> +{
> +    VFIOHostDMAWindow *hostwin;
> +    hwaddr pgsize = 0;
> +    int ret;
> +
> +    if (container->iommu_type != VFIO_SPAPR_TCE_v2_IOMMU) {
> +        return 0;
> +    }
> +
> +    /* For now intersections are not allowed, we may relax this later */
> +    QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) {
> +        if (ranges_overlap(hostwin->min_iova,
> +                           hostwin->max_iova - hostwin->min_iova + 1,
> +                           section->offset_within_address_space,
> +                           int128_get64(section->size))) {
> +            error_setg(errp,
> +                "region [0x%"PRIx64",0x%"PRIx64"] overlaps with existing"
> +                "host DMA window [0x%"PRIx64",0x%"PRIx64"]",
> +                section->offset_within_address_space,
> +                section->offset_within_address_space +
> +                    int128_get64(section->size) - 1,
> +                hostwin->min_iova, hostwin->max_iova);
> +            return -EINVAL;
> +        }
> +    }
> +
> +    ret = vfio_spapr_create_window(container, section, &pgsize);
> +    if (ret) {
> +        error_setg_errno(errp, -ret, "Failed to create SPAPR window");
> +        return ret;
> +    }
> +
> +    vfio_host_win_add(container, section->offset_within_address_space,
> +                      section->offset_within_address_space +
> +                      int128_get64(section->size) - 1, pgsize);
> +#ifdef CONFIG_KVM
> +    if (kvm_enabled()) {
> +        VFIOGroup *group;
> +        IOMMUMemoryRegion *iommu_mr = IOMMU_MEMORY_REGION(section->mr);
> +        struct kvm_vfio_spapr_tce param;
> +        struct kvm_device_attr attr = {
> +            .group = KVM_DEV_VFIO_GROUP,
> +            .attr = KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE,
> +            .addr = (uint64_t)(unsigned long)&param,
> +        };
> +
> +        if (!memory_region_iommu_get_attr(iommu_mr, IOMMU_ATTR_SPAPR_TCE_FD,
> +                                          &param.tablefd)) {
> +            QLIST_FOREACH(group, &container->group_list, container_next) {
> +                param.groupfd = group->fd;
> +                if (ioctl(vfio_kvm_device_fd, KVM_SET_DEVICE_ATTR, &attr)) {
> +                    error_report("vfio: failed to setup fd %d "
> +                                 "for a group with fd %d: %s",
> +                                 param.tablefd, param.groupfd,
> +                                 strerror(errno));
> +                    return 0;
> +                }
> +                trace_vfio_spapr_group_attach(param.groupfd, param.tablefd);
> +            }
> +        }
> +    }
> +#endif
> +    return 0;
> +}
> +
> +static void vfio_container_del_section_window(VFIOContainer *container,
> +                                              MemoryRegionSection *section)
> +{
> +    if (container->iommu_type != VFIO_SPAPR_TCE_v2_IOMMU) {
> +        return;
> +    }
> +
> +    vfio_spapr_remove_window(container,
> +                             section->offset_within_address_space);
> +    if (vfio_host_win_del(container,
> +                          section->offset_within_address_space,
> +                          section->offset_within_address_space +
> +                          int128_get64(section->size) - 1) < 0) {
> +        hw_error("%s: Cannot delete missing window at %"HWADDR_PRIx,
> +                 __func__, section->offset_within_address_space);
> +    }
> +}
> +
>  static void vfio_listener_region_add(MemoryListener *listener,
>                                       MemoryRegionSection *section)
>  {
> @@ -822,62 +908,8 @@ static void vfio_listener_region_add(MemoryListener *listener,
>          return;
>      }
>  
> -    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
> -        hwaddr pgsize = 0;
> -
> -        /* For now intersections are not allowed, we may relax this later */
> -        QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) {
> -            if (ranges_overlap(hostwin->min_iova,
> -                               hostwin->max_iova - hostwin->min_iova + 1,
> -                               section->offset_within_address_space,
> -                               int128_get64(section->size))) {
> -                error_setg(&err,
> -                    "region [0x%"PRIx64",0x%"PRIx64"] overlaps with existing"
> -                    "host DMA window [0x%"PRIx64",0x%"PRIx64"]",
> -                    section->offset_within_address_space,
> -                    section->offset_within_address_space +
> -                        int128_get64(section->size) - 1,
> -                    hostwin->min_iova, hostwin->max_iova);
> -                goto fail;
> -            }
> -        }
> -
> -        ret = vfio_spapr_create_window(container, section, &pgsize);
> -        if (ret) {
> -            error_setg_errno(&err, -ret, "Failed to create SPAPR window");
> -            goto fail;
> -        }
> -
> -        vfio_host_win_add(container, section->offset_within_address_space,
> -                          section->offset_within_address_space +
> -                          int128_get64(section->size) - 1, pgsize);
> -#ifdef CONFIG_KVM
> -        if (kvm_enabled()) {
> -            VFIOGroup *group;
> -            IOMMUMemoryRegion *iommu_mr = IOMMU_MEMORY_REGION(section->mr);
> -            struct kvm_vfio_spapr_tce param;
> -            struct kvm_device_attr attr = {
> -                .group = KVM_DEV_VFIO_GROUP,
> -                .attr = KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE,
> -                .addr = (uint64_t)(unsigned long)&param,
> -            };
> -
> -            if (!memory_region_iommu_get_attr(iommu_mr, IOMMU_ATTR_SPAPR_TCE_FD,
> -                                              &param.tablefd)) {
> -                QLIST_FOREACH(group, &container->group_list, container_next) {
> -                    param.groupfd = group->fd;
> -                    if (ioctl(vfio_kvm_device_fd, KVM_SET_DEVICE_ATTR, &attr)) {
> -                        error_report("vfio: failed to setup fd %d "
> -                                     "for a group with fd %d: %s",
> -                                     param.tablefd, param.groupfd,
> -                                     strerror(errno));
> -                        return;
> -                    }
> -                    trace_vfio_spapr_group_attach(param.groupfd, param.tablefd);
> -                }
> -            }
> -        }
> -#endif
> +    if (vfio_container_add_section_window(container, section, &err)) {
> +        goto fail;
>      }
>  
>      hostwin = vfio_find_hostwin(container, iova, end);
> @@ -1094,17 +1126,7 @@ static void vfio_listener_region_del(MemoryListener *listener,
>  
>      memory_region_unref(section->mr);
>  
> -    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
> -        vfio_spapr_remove_window(container,
> -                                 section->offset_within_address_space);
> -        if (vfio_host_win_del(container,
> -                              section->offset_within_address_space,
> -                              section->offset_within_address_space +
> -                              int128_get64(section->size) - 1) < 0) {
> -            hw_error("%s: Cannot delete missing window at %"HWADDR_PRIx,
> -                     __func__, section->offset_within_address_space);
> -        }
> -    }
> +    vfio_container_del_section_window(container, section);
>  }
>  
>  static int vfio_set_dirty_page_tracking(VFIOContainer *container, bool start)



^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v1 05/22] vfio/common: Extract out vfio_kvm_device_[add/del]_fd
  2023-08-30 10:37 ` [PATCH v1 05/22] vfio/common: Extract out vfio_kvm_device_[add/del]_fd Zhenzhong Duan
@ 2023-09-20 11:49   ` Eric Auger
  2023-09-21  2:04     ` Duan, Zhenzhong
  2023-09-21  8:42     ` Cédric Le Goater
  2023-09-20 21:39   ` Alex Williamson
  1 sibling, 2 replies; 109+ messages in thread
From: Eric Auger @ 2023-09-20 11:49 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, jgg, nicolinc, joao.m.martins, peterx,
	jasowang, kevin.tian, yi.l.liu, yi.y.sun, chao.p.peng

Hi Zhenzhong,

On 8/30/23 12:37, Zhenzhong Duan wrote:
> ...which will be used by both legacy and iommufd backend.
I prefer genuine sentences in the commit msg. Also you explain what you
do but not why.

suggestion: Introduce two new helpers, vfio_kvm_device_[add/del]_fd
which take as input a file descriptor which can be either a group fd or
a cdev fd. This uses the new KVM_DEV_VFIO_FILE VFIO KVM device group,
which aliases to the legacy KVM_DEV_VFIO_GROUP.

vfio_kvm_device_add/del_group then call those new helpers.



>
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
>  hw/vfio/common.c              | 44 +++++++++++++++++++++++------------
>  include/hw/vfio/vfio-common.h |  3 +++
>  2 files changed, 32 insertions(+), 15 deletions(-)
>
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 67150e4575..949ad6714a 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -1759,17 +1759,17 @@ void vfio_reset_handler(void *opaque)
>      }
>  }
>  
> -static void vfio_kvm_device_add_group(VFIOGroup *group)
> +int vfio_kvm_device_add_fd(int fd)
>  {
>  #ifdef CONFIG_KVM
>      struct kvm_device_attr attr = {
> -        .group = KVM_DEV_VFIO_GROUP,
> -        .attr = KVM_DEV_VFIO_GROUP_ADD,
> -        .addr = (uint64_t)(unsigned long)&group->fd,
> +        .group = KVM_DEV_VFIO_FILE,
> +        .attr = KVM_DEV_VFIO_FILE_ADD,
> +        .addr = (uint64_t)(unsigned long)&fd,
>      };
>  
>      if (!kvm_enabled()) {
> -        return;
> +        return 0;
>      }
>  
>      if (vfio_kvm_device_fd < 0) {
> @@ -1779,37 +1779,51 @@ static void vfio_kvm_device_add_group(VFIOGroup *group)
>  
>          if (kvm_vm_ioctl(kvm_state, KVM_CREATE_DEVICE, &cd)) {
>              error_report("Failed to create KVM VFIO device: %m");
> -            return;
> +            return -ENODEV;
can't you return -errno?
>          }
>  
>          vfio_kvm_device_fd = cd.fd;
>      }
>  
>      if (ioctl(vfio_kvm_device_fd, KVM_SET_DEVICE_ATTR, &attr)) {
> -        error_report("Failed to add group %d to KVM VFIO device: %m",
> -                     group->groupid);
> +        error_report("Failed to add fd %d to KVM VFIO device: %m",
> +                     fd);
> +        return -errno;
>      }
>  #endif
> +    return 0;
>  }
>  
> -static void vfio_kvm_device_del_group(VFIOGroup *group)
> +static void vfio_kvm_device_add_group(VFIOGroup *group)
> +{
> +    vfio_kvm_device_add_fd(group->fd);
Since vfio_kvm_device_add_fd now returns an error value, it's a pity not
to use it and propagate it. Also you could fill an errp with the error
msg and use it in vfio_connect_container(). But this is a new error
handling there.
> +}
> +
> +int vfio_kvm_device_del_fd(int fd)
not sure we want this to return an error. But if we do, I think it would
be nicer to propagate the error up.
>  {
>  #ifdef CONFIG_KVM
>      struct kvm_device_attr attr = {
> -        .group = KVM_DEV_VFIO_GROUP,
> -        .attr = KVM_DEV_VFIO_GROUP_DEL,
> -        .addr = (uint64_t)(unsigned long)&group->fd,
> +        .group = KVM_DEV_VFIO_FILE,
> +        .attr = KVM_DEV_VFIO_FILE_DEL,
> +        .addr = (uint64_t)(unsigned long)&fd,
>      };
>  
>      if (vfio_kvm_device_fd < 0) {
> -        return;
> +        return -EINVAL;
>      }
>  
>      if (ioctl(vfio_kvm_device_fd, KVM_SET_DEVICE_ATTR, &attr)) {
> -        error_report("Failed to remove group %d from KVM VFIO device: %m",
> -                     group->groupid);
> +        error_report("Failed to remove fd %d from KVM VFIO device: %m",
> +                     fd);
> +        return -EBADF;
-errno?
>      }
>  #endif
> +    return 0;
> +}
> +
> +static void vfio_kvm_device_del_group(VFIOGroup *group)
> +{
> +    vfio_kvm_device_del_fd(group->fd);
>  }
>  
>  static VFIOAddressSpace *vfio_get_address_space(AddressSpace *as)
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 5e376c436e..598c3ce079 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -220,6 +220,9 @@ struct vfio_device_info *vfio_get_device_info(int fd);
>  int vfio_get_device(VFIOGroup *group, const char *name,
>                      VFIODevice *vbasedev, Error **errp);
>  
> +int vfio_kvm_device_add_fd(int fd);
> +int vfio_kvm_device_del_fd(int fd);
> +
>  extern const MemoryRegionOps vfio_region_ops;
>  typedef QLIST_HEAD(VFIOGroupList, VFIOGroup) VFIOGroupList;
>  extern VFIOGroupList vfio_group_list;
Thanks

Eric



^ permalink raw reply	[flat|nested] 109+ messages in thread

* RE: [PATCH v1 04/22] vfio/common: Introduce vfio_container_add|del_section_window()
  2023-09-20 11:23   ` Eric Auger
@ 2023-09-20 12:18     ` Duan, Zhenzhong
  0 siblings, 0 replies; 109+ messages in thread
From: Duan, Zhenzhong @ 2023-09-20 12:18 UTC (permalink / raw)
  To: eric.auger, qemu-devel
  Cc: alex.williamson, clg, jgg, nicolinc, Martins, Joao, peterx,
	jasowang, Tian, Kevin, Liu, Yi L, Sun, Yi Y, Peng, Chao P



>-----Original Message-----
>From: Eric Auger <eric.auger@redhat.com>
>Sent: Wednesday, September 20, 2023 7:23 PM
>Subject: Re: [PATCH v1 04/22] vfio/common: Introduce
>vfio_container_add|del_section_window()
>
>Hi Zhenzhong,
>On 8/30/23 12:37, Zhenzhong Duan wrote:
>> From: Eric Auger <eric.auger@redhat.com>
>>
>> Introduce helper functions that isolate the code used for
>> VFIO_SPAPR_TCE_v2_IOMMU. This code reliance is IOMMU backend
>> specific whereas the rest of the code in the callers, ie.
>this last sentence should be rephrased into something like
>Those helpers hide implementation details beneath the container object
>and make the vfio_listener_region_add/del() implementations more
>readable ( I think). No code change intended.

Thanks for your suggestion, will use it in v2.

BR.
Zhenzhong

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v1 15/22] Add iommufd configure option
  2023-09-20  3:42     ` Duan, Zhenzhong
@ 2023-09-20 12:19       ` Cédric Le Goater
  2023-09-20 12:51         ` Jason Gunthorpe
  2023-09-21  2:11         ` Duan, Zhenzhong
  2023-09-20 18:01       ` Alex Williamson
  1 sibling, 2 replies; 109+ messages in thread
From: Cédric Le Goater @ 2023-09-20 12:19 UTC (permalink / raw)
  To: Duan, Zhenzhong, qemu-devel
  Cc: alex.williamson, jgg, nicolinc, Martins, Joao, eric.auger,
	peterx, jasowang, Tian, Kevin, Liu, Yi L, Sun, Yi Y, Peng,
	Chao P, Paolo Bonzini, Marc-André Lureau,
	Daniel P. Berrangé, Thomas Huth, Philippe Mathieu-Daudé

On 9/20/23 05:42, Duan, Zhenzhong wrote:
> 
> 
>> -----Original Message-----
>> From: Cédric Le Goater <clg@redhat.com>
>> Sent: Wednesday, September 20, 2023 1:08 AM
>> Subject: Re: [PATCH v1 15/22] Add iommufd configure option
>>
>> On 8/30/23 12:37, Zhenzhong Duan wrote:
>>> This adds "--enable-iommufd/--disable-iommufd" to enable or disable
>>> iommufd support, enabled by default.
>>
>> Why would someone want to disable support at compile time ? It might
> 
> For those users who only want to support legacy container feature?
> Let me know if you still prefer to drop this patch, I'm fine with that.

I think it is too early.

>> have been useful for dev but now QEMU should self-adjust at runtime
>> depending only on the host capabilities AFAIUI. Am I missing something ?
> 
> IOMMUFD doesn't support all features of legacy container, so QEMU
> doesn't self-adjust at runtime by checking if host supports IOMMUFD.
> We need to specify it explicitly to use IOMMUFD as below:
> 
>      -object iommufd,id=iommufd0
>      -device vfio-pci,host=0000:02:00.0,iommufd=iommufd0

OK. I am not sure this is the correct interface yet. At first glance,
I wouldn't introduce a new object for a simple backend depending on a
kernel interface. I would tend to prefer a "iommu-something" property
of the vfio-pci device with string values: "legacy", "iommufd", "default"
and define the various interfaces (the ops you proposed) for each
depending on the user preference and the capabilities of the host and
possibly the device.

I might be wrong and this might have been discussed before. If so, it
should go in the cover letter with other things : what is this patchset
providing to VFIO (multiple iommu backends), how it is reaching that
goal, how is it organized, how do we deal with the special case (spapr),
what's the user interface, etc.


   
Thanks,

C.


> Thanks
> Zhenzhong
> 



^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v1 06/22] vfio/common: Add a vfio device iterator
  2023-08-30 10:37 ` [PATCH v1 06/22] vfio/common: Add a vfio device iterator Zhenzhong Duan
@ 2023-09-20 12:25   ` Eric Auger
  2023-09-21  2:27     ` Duan, Zhenzhong
  2023-09-20 22:16   ` Alex Williamson
  1 sibling, 1 reply; 109+ messages in thread
From: Eric Auger @ 2023-09-20 12:25 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, jgg, nicolinc, joao.m.martins, peterx,
	jasowang, kevin.tian, yi.l.liu, yi.y.sun, chao.p.peng

Hi Zhenzhong,

On 8/30/23 12:37, Zhenzhong Duan wrote:
> With a vfio device iterator added, we can make some migration and reset
> related functions group agnostic.
> E.x:
> vfio_mig_active
> vfio_migratable_device_num
> vfio_devices_all_dirty_tracking
> vfio_devices_all_device_dirty_tracking
> vfio_devices_all_running_and_mig_active
> vfio_devices_dma_logging_stop
> vfio_devices_dma_logging_start
> vfio_devices_query_dirty_bitmap
> vfio_reset_handler
>
> Or else we need to add container specific callback variants for above
> functions just because they iterate devices based on group.
>
> Move the reset handler registration/unregistration to a place that is not
> group specific, saying first vfio address space created instead of the
> first group.
I would move the reset handler registration/unregistration changes in a
separate patch.
besides,  I don't catch what you mean by
"saying first vfio address space created instead of the first group."
>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
>  hw/vfio/common.c | 224 ++++++++++++++++++++++++++---------------------
>  1 file changed, 122 insertions(+), 102 deletions(-)
>
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 949ad6714a..51c6e7598e 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -84,6 +84,26 @@ static int vfio_ram_block_discard_disable(VFIOContainer *container, bool state)
>      }
>  }
>  
I would add a comment:
iterate on all devices from all groups attached to a container
> +static VFIODevice *vfio_container_dev_iter_next(VFIOContainer *container,
> +                                                VFIODevice *curr)
> +{
> +    VFIOGroup *group;
> +
> +    if (!curr) {
> +        group = QLIST_FIRST(&container->group_list);
> +    } else {
> +        if (curr->next.le_next) {
> +            return curr->next.le_next;
> +        }
> +        group = curr->group->container_next.le_next;
> +    }
> +
> +    if (!group) {
> +        return NULL;
> +    }
> +    return QLIST_FIRST(&group->device_list);
> +}


> +
>  /*
>   * Device state interfaces
>   */
> @@ -112,17 +132,22 @@ static int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
>  
>  bool vfio_mig_active(void)
>  {
> -    VFIOGroup *group;
> +    VFIOAddressSpace *space;
> +    VFIOContainer *container;
>      VFIODevice *vbasedev;
>  
> -    if (QLIST_EMPTY(&vfio_group_list)) {
> +    if (QLIST_EMPTY(&vfio_address_spaces)) {
>          return false;
>      }
>  
> -    QLIST_FOREACH(group, &vfio_group_list, next) {
> -        QLIST_FOREACH(vbasedev, &group->device_list, next) {
> -            if (vbasedev->migration_blocker) {
> -                return false;
> +    QLIST_FOREACH(space, &vfio_address_spaces, list) {
> +        QLIST_FOREACH(container, &space->containers, next) {
> +            vbasedev = NULL;
> +            while ((vbasedev = vfio_container_dev_iter_next(container,
> +                                                            vbasedev))) {

Couldn't you use an extra define such as:
#define CONTAINER_FOREACH_DEV(container, vbasedev) \
vbasedev = NULL
while ((vbasedev = vfio_container_dev_iter_next(container, vbasedev)))

> +                if (vbasedev->migration_blocker) {
> +                    return false;
> +                }
>              }
>          }
>      }
> @@ -133,14 +158,19 @@ static Error *multiple_devices_migration_blocker;
>  
>  static unsigned int vfio_migratable_device_num(void)
>  {
> -    VFIOGroup *group;
> +    VFIOAddressSpace *space;
> +    VFIOContainer *container;
>      VFIODevice *vbasedev;
>      unsigned int device_num = 0;
>  
> -    QLIST_FOREACH(group, &vfio_group_list, next) {
> -        QLIST_FOREACH(vbasedev, &group->device_list, next) {
> -            if (vbasedev->migration) {
> -                device_num++;
> +    QLIST_FOREACH(space, &vfio_address_spaces, list) {
> +        QLIST_FOREACH(container, &space->containers, next) {
> +            vbasedev = NULL;
> +            while ((vbasedev = vfio_container_dev_iter_next(container,
> +                                                            vbasedev))) {
> +                if (vbasedev->migration) {
> +                    device_num++;
> +                }
>              }
>          }
>      }
> @@ -207,8 +237,7 @@ static void vfio_set_migration_error(int err)
>  
>  static bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
>  {
> -    VFIOGroup *group;
> -    VFIODevice *vbasedev;
> +    VFIODevice *vbasedev = NULL;
>      MigrationState *ms = migrate_get_current();
>  
>      if (ms->state != MIGRATION_STATUS_ACTIVE &&
> @@ -216,19 +245,17 @@ static bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
>          return false;
>      }
>  
> -    QLIST_FOREACH(group, &container->group_list, container_next) {
> -        QLIST_FOREACH(vbasedev, &group->device_list, next) {
> -            VFIOMigration *migration = vbasedev->migration;
> +    while ((vbasedev = vfio_container_dev_iter_next(container, vbasedev))) {
> +        VFIOMigration *migration = vbasedev->migration;
>  
> -            if (!migration) {
> -                return false;
> -            }
> +        if (!migration) {
> +            return false;
> +        }
>  
> -            if (vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF &&
> -                (migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
> -                 migration->device_state == VFIO_DEVICE_STATE_PRE_COPY)) {
> -                return false;
> -            }
> +        if (vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF &&
> +            (migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
> +             migration->device_state == VFIO_DEVICE_STATE_PRE_COPY)) {
> +            return false;
>          }
>      }
>      return true;
> @@ -236,14 +263,11 @@ static bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
>  
>  static bool vfio_devices_all_device_dirty_tracking(VFIOContainer *container)
>  {
> -    VFIOGroup *group;
> -    VFIODevice *vbasedev;
> +    VFIODevice *vbasedev = NULL;
>  
> -    QLIST_FOREACH(group, &container->group_list, container_next) {
> -        QLIST_FOREACH(vbasedev, &group->device_list, next) {
> -            if (!vbasedev->dirty_pages_supported) {
> -                return false;
> -            }
> +    while ((vbasedev = vfio_container_dev_iter_next(container, vbasedev))) {
> +        if (!vbasedev->dirty_pages_supported) {
> +            return false;
>          }
>      }
>  
> @@ -256,27 +280,24 @@ static bool vfio_devices_all_device_dirty_tracking(VFIOContainer *container)
>   */
>  static bool vfio_devices_all_running_and_mig_active(VFIOContainer *container)
>  {
> -    VFIOGroup *group;
> -    VFIODevice *vbasedev;
> +    VFIODevice *vbasedev = NULL;
>  
>      if (!migration_is_active(migrate_get_current())) {
>          return false;
>      }
>  
> -    QLIST_FOREACH(group, &container->group_list, container_next) {
> -        QLIST_FOREACH(vbasedev, &group->device_list, next) {
> -            VFIOMigration *migration = vbasedev->migration;
> +    while ((vbasedev = vfio_container_dev_iter_next(container, vbasedev))) {
> +        VFIOMigration *migration = vbasedev->migration;
>  
> -            if (!migration) {
> -                return false;
> -            }
> +        if (!migration) {
> +            return false;
> +        }
>  
> -            if (migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
> -                migration->device_state == VFIO_DEVICE_STATE_PRE_COPY) {
> -                continue;
> -            } else {
> -                return false;
> -            }
> +        if (migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
> +            migration->device_state == VFIO_DEVICE_STATE_PRE_COPY) {
> +            continue;
> +        } else {
> +            return false;
>          }
>      }
>      return true;
> @@ -1243,25 +1264,22 @@ static void vfio_devices_dma_logging_stop(VFIOContainer *container)
>      uint64_t buf[DIV_ROUND_UP(sizeof(struct vfio_device_feature),
>                                sizeof(uint64_t))] = {};
>      struct vfio_device_feature *feature = (struct vfio_device_feature *)buf;
> -    VFIODevice *vbasedev;
> -    VFIOGroup *group;
> +    VFIODevice *vbasedev = NULL;
>  
>      feature->argsz = sizeof(buf);
>      feature->flags = VFIO_DEVICE_FEATURE_SET |
>                       VFIO_DEVICE_FEATURE_DMA_LOGGING_STOP;
>  
> -    QLIST_FOREACH(group, &container->group_list, container_next) {
> -        QLIST_FOREACH(vbasedev, &group->device_list, next) {
> -            if (!vbasedev->dirty_tracking) {
> -                continue;
> -            }
> +    while ((vbasedev = vfio_container_dev_iter_next(container, vbasedev))) {
> +        if (!vbasedev->dirty_tracking) {
> +            continue;
> +        }
>  
> -            if (ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature)) {
> -                warn_report("%s: Failed to stop DMA logging, err %d (%s)",
> -                             vbasedev->name, -errno, strerror(errno));
> -            }
> -            vbasedev->dirty_tracking = false;
> +        if (ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature)) {
> +            warn_report("%s: Failed to stop DMA logging, err %d (%s)",
> +                        vbasedev->name, -errno, strerror(errno));
>          }
> +        vbasedev->dirty_tracking = false;
>      }
>  }
>  
> @@ -1336,8 +1354,7 @@ static int vfio_devices_dma_logging_start(VFIOContainer *container)
>  {
>      struct vfio_device_feature *feature;
>      VFIODirtyRanges ranges;
> -    VFIODevice *vbasedev;
> -    VFIOGroup *group;
> +    VFIODevice *vbasedev = NULL;
>      int ret = 0;
>  
>      vfio_dirty_tracking_init(container, &ranges);
> @@ -1347,21 +1364,19 @@ static int vfio_devices_dma_logging_start(VFIOContainer *container)
>          return -errno;
>      }
>  
> -    QLIST_FOREACH(group, &container->group_list, container_next) {
> -        QLIST_FOREACH(vbasedev, &group->device_list, next) {
> -            if (vbasedev->dirty_tracking) {
> -                continue;
> -            }
> +    while ((vbasedev = vfio_container_dev_iter_next(container, vbasedev))) {
> +        if (vbasedev->dirty_tracking) {
> +            continue;
> +        }
>  
> -            ret = ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature);
> -            if (ret) {
> -                ret = -errno;
> -                error_report("%s: Failed to start DMA logging, err %d (%s)",
> -                             vbasedev->name, ret, strerror(errno));
> -                goto out;
> -            }
> -            vbasedev->dirty_tracking = true;
> +        ret = ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature);
> +        if (ret) {
> +            ret = -errno;
> +            error_report("%s: Failed to start DMA logging, err %d (%s)",
> +                         vbasedev->name, ret, strerror(errno));
> +            goto out;
>          }
> +        vbasedev->dirty_tracking = true;
>      }
>  
>  out:
> @@ -1440,22 +1455,19 @@ static int vfio_devices_query_dirty_bitmap(VFIOContainer *container,
>                                             VFIOBitmap *vbmap, hwaddr iova,
>                                             hwaddr size)
>  {
> -    VFIODevice *vbasedev;
> -    VFIOGroup *group;
> +    VFIODevice *vbasedev = NULL;
>      int ret;
>  
> -    QLIST_FOREACH(group, &container->group_list, container_next) {
> -        QLIST_FOREACH(vbasedev, &group->device_list, next) {
> -            ret = vfio_device_dma_logging_report(vbasedev, iova, size,
> -                                                 vbmap->bitmap);
> -            if (ret) {
> -                error_report("%s: Failed to get DMA logging report, iova: "
> -                             "0x%" HWADDR_PRIx ", size: 0x%" HWADDR_PRIx
> -                             ", err: %d (%s)",
> -                             vbasedev->name, iova, size, ret, strerror(-ret));
> +    while ((vbasedev = vfio_container_dev_iter_next(container, vbasedev))) {
> +        ret = vfio_device_dma_logging_report(vbasedev, iova, size,
> +                                             vbmap->bitmap);
> +        if (ret) {
> +            error_report("%s: Failed to get DMA logging report, iova: "
> +                         "0x%" HWADDR_PRIx ", size: 0x%" HWADDR_PRIx
> +                         ", err: %d (%s)",
> +                         vbasedev->name, iova, size, ret, strerror(-ret));
>  
> -                return ret;
> -            }
> +            return ret;
>          }
>      }
>  
> @@ -1739,21 +1751,30 @@ bool vfio_get_info_dma_avail(struct vfio_iommu_type1_info *info,
>  
>  void vfio_reset_handler(void *opaque)
>  {
> -    VFIOGroup *group;
> +    VFIOAddressSpace *space;
> +    VFIOContainer *container;
>      VFIODevice *vbasedev;
>  
> -    QLIST_FOREACH(group, &vfio_group_list, next) {
> -        QLIST_FOREACH(vbasedev, &group->device_list, next) {
> -            if (vbasedev->dev->realized) {
> -                vbasedev->ops->vfio_compute_needs_reset(vbasedev);
> +    QLIST_FOREACH(space, &vfio_address_spaces, list) {
> +        QLIST_FOREACH(container, &space->containers, next) {
> +            vbasedev = NULL;
> +            while ((vbasedev = vfio_container_dev_iter_next(container,
> +                                                            vbasedev))) {
> +                if (vbasedev->dev->realized) {
> +                    vbasedev->ops->vfio_compute_needs_reset(vbasedev);
> +                }
>              }
>          }
>      }
>  
> -    QLIST_FOREACH(group, &vfio_group_list, next) {
> -        QLIST_FOREACH(vbasedev, &group->device_list, next) {
> -            if (vbasedev->dev->realized && vbasedev->needs_reset) {
> -                vbasedev->ops->vfio_hot_reset_multi(vbasedev);
> +    QLIST_FOREACH(space, &vfio_address_spaces, list) {
> +        QLIST_FOREACH(container, &space->containers, next) {
> +            vbasedev = NULL;
> +            while ((vbasedev = vfio_container_dev_iter_next(container,
> +                                                            vbasedev))) {
> +                if (vbasedev->dev->realized && vbasedev->needs_reset) {
> +                    vbasedev->ops->vfio_hot_reset_multi(vbasedev);
> +                    }
>              }
>          }
>      }
> @@ -1841,6 +1862,10 @@ static VFIOAddressSpace *vfio_get_address_space(AddressSpace *as)
>      space->as = as;
>      QLIST_INIT(&space->containers);
>  
> +    if (QLIST_EMPTY(&vfio_address_spaces)) {
> +        qemu_register_reset(vfio_reset_handler, NULL);
> +    }
> +
>      QLIST_INSERT_HEAD(&vfio_address_spaces, space, list);
>  
>      return space;
> @@ -1852,6 +1877,9 @@ static void vfio_put_address_space(VFIOAddressSpace *space)
>          QLIST_REMOVE(space, list);
>          g_free(space);
>      }
> +    if (QLIST_EMPTY(&vfio_address_spaces)) {
> +        qemu_unregister_reset(vfio_reset_handler, NULL);
> +    }
>  }
>  
>  /*
> @@ -2317,10 +2345,6 @@ VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
>          goto close_fd_exit;
>      }
>  
> -    if (QLIST_EMPTY(&vfio_group_list)) {
> -        qemu_register_reset(vfio_reset_handler, NULL);
> -    }
> -
>      QLIST_INSERT_HEAD(&vfio_group_list, group, next);
>  
>      return group;
> @@ -2349,10 +2373,6 @@ void vfio_put_group(VFIOGroup *group)
>      trace_vfio_put_group(group->fd);
>      close(group->fd);
>      g_free(group);
> -
> -    if (QLIST_EMPTY(&vfio_group_list)) {
> -        qemu_unregister_reset(vfio_reset_handler, NULL);
> -    }
>  }
>  
>  struct vfio_device_info *vfio_get_device_info(int fd)
Thanks

Eric



^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v1 17/22] util/char_dev: Add open_cdev()
  2023-08-30 10:37 ` [PATCH v1 17/22] util/char_dev: Add open_cdev() Zhenzhong Duan
@ 2023-09-20 12:39   ` Daniel P. Berrangé
  2023-09-20 12:53     ` Jason Gunthorpe
  2023-09-21  2:37     ` Duan, Zhenzhong
  0 siblings, 2 replies; 109+ messages in thread
From: Daniel P. Berrangé @ 2023-09-20 12:39 UTC (permalink / raw)
  To: Zhenzhong Duan
  Cc: qemu-devel, alex.williamson, clg, jgg, nicolinc, joao.m.martins,
	eric.auger, peterx, jasowang, kevin.tian, yi.l.liu, yi.y.sun,
	chao.p.peng

On Wed, Aug 30, 2023 at 06:37:49PM +0800, Zhenzhong Duan wrote:
> From: Yi Liu <yi.l.liu@intel.com>
> 
> /dev/vfio/devices/vfioX may not exist. In that case it is still possible
> to open /dev/char/$major:$minor instead. Add helper function to abstract
> the cdev open.
> 
> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
>  MAINTAINERS             |  6 ++++
>  include/qemu/char_dev.h | 16 +++++++++++
>  util/chardev_open.c     | 61 +++++++++++++++++++++++++++++++++++++++++

Using the same naming scheme for the .c and .h is strongly desired.

>  util/meson.build        |  1 +
>  4 files changed, 84 insertions(+)
>  create mode 100644 include/qemu/char_dev.h
>  create mode 100644 util/chardev_open.c
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 04663fbb6f..74d18593fe 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -3372,6 +3372,12 @@ S: Maintained
>  F: include/qemu/iova-tree.h
>  F: util/iova-tree.c
>  
> +cdev Open
> +M: Yi Liu <yi.l.liu@intel.com>
> +S: Maintained
> +F: include/qemu/char_dev.h
> +F: util/chardev_open.c
> +


> diff --git a/util/chardev_open.c b/util/chardev_open.c
> new file mode 100644
> index 0000000000..d03e415131
> --- /dev/null
> +++ b/util/chardev_open.c
> @@ -0,0 +1,61 @@
> +/*
> + * Copyright (C) 2023 Intel Corporation.
> + * Copyright (c) 2019, Mellanox Technologies. All rights reserved.
> + *
> + * Authors: Yi Liu <yi.l.liu@intel.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2.  See
> + * the COPYING file in the top-level directory.
> + *
> + * Copied from
> + * https://github.com/linux-rdma/rdma-core/blob/master/util/open_cdev.c
> + *
> + */

Since this is GPL-2.0-only, IMHO it would be preferrable to keep it
out of the util/ directory, as we're aiming to not add further 2.0
only code, except for specific subdirs. This only appears to be used
by code under hw/vfio/, whcih is one of the dirs still permitting
2.0-only code. So I think better to keep this file where it is used.

> +#ifndef _GNU_SOURCE
> +#define _GNU_SOURCE
> +#endif

This is set globally for building all files in QEMU

> +#include "qemu/osdep.h"
> +#include "qemu/char_dev.h"
> +
> +static int open_cdev_internal(const char *path, dev_t cdev)
> +{
> +    struct stat st;
> +    int fd;
> +
> +    fd = qemu_open_old(path, O_RDWR);
> +    if (fd == -1) {
> +        return -1;
> +    }
> +    if (fstat(fd, &st) || !S_ISCHR(st.st_mode) ||
> +        (cdev != 0 && st.st_rdev != cdev)) {
> +        close(fd);
> +        return -1;
> +    }
> +    return fd;
> +}
> +
> +static int open_cdev_robust(dev_t cdev)
> +{
> +    char *devpath;

g_autofree for this...

> +    int ret;
> +
> +    /*
> +     * This assumes that udev is being used and is creating the /dev/char/
> +     * symlinks.
> +     */
> +    devpath = g_strdup_printf("/dev/char/%u:%u", major(cdev), minor(cdev));
> +    ret = open_cdev_internal(devpath, cdev);
> +    g_free(devpath);

...avoids the need for g_free, and also avoids the need for
the intermediate 'ret' variable.

> +    return ret;
> +}
> +
> +int open_cdev(const char *devpath, dev_t cdev)
> +{
> +    int fd;
> +
> +    fd = open_cdev_internal(devpath, cdev);
> +    if (fd == -1 && cdev != 0) {
> +        return open_cdev_robust(cdev);
> +    }
> +    return fd;
> +}
> diff --git a/util/meson.build b/util/meson.build
> index a375160286..d5313d858f 100644
> --- a/util/meson.build
> +++ b/util/meson.build
> @@ -107,6 +107,7 @@ if have_block
>      util_ss.add(files('filemonitor-stub.c'))
>    endif
>    util_ss.add(when: 'CONFIG_LINUX', if_true: files('vfio-helpers.c'))
> +  util_ss.add(when: 'CONFIG_LINUX', if_true: files('chardev_open.c'))
>  endif
>  
>  if cpu == 'aarch64'
> -- 
> 2.34.1
> 
> 

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v1 15/22] Add iommufd configure option
  2023-09-20 12:19       ` Cédric Le Goater
@ 2023-09-20 12:51         ` Jason Gunthorpe
  2023-09-20 13:01           ` Daniel P. Berrangé
  2023-09-20 13:02           ` Cédric Le Goater
  2023-09-21  2:11         ` Duan, Zhenzhong
  1 sibling, 2 replies; 109+ messages in thread
From: Jason Gunthorpe @ 2023-09-20 12:51 UTC (permalink / raw)
  To: Cédric Le Goater
  Cc: Duan, Zhenzhong, qemu-devel, alex.williamson, nicolinc, Martins,
	Joao, eric.auger, peterx, jasowang, Tian, Kevin, Liu, Yi L, Sun,
	Yi Y, Peng, Chao P, Paolo Bonzini, Marc-André Lureau,
	Daniel P. Berrangé, Thomas Huth, Philippe Mathieu-Daudé

On Wed, Sep 20, 2023 at 02:19:42PM +0200, Cédric Le Goater wrote:
> On 9/20/23 05:42, Duan, Zhenzhong wrote:
> > 
> > 
> > > -----Original Message-----
> > > From: Cédric Le Goater <clg@redhat.com>
> > > Sent: Wednesday, September 20, 2023 1:08 AM
> > > Subject: Re: [PATCH v1 15/22] Add iommufd configure option
> > > 
> > > On 8/30/23 12:37, Zhenzhong Duan wrote:
> > > > This adds "--enable-iommufd/--disable-iommufd" to enable or disable
> > > > iommufd support, enabled by default.
> > > 
> > > Why would someone want to disable support at compile time ? It might
> > 
> > For those users who only want to support legacy container feature?
> > Let me know if you still prefer to drop this patch, I'm fine with that.
> 
> I think it is too early.
> 
> > > have been useful for dev but now QEMU should self-adjust at runtime
> > > depending only on the host capabilities AFAIUI. Am I missing something ?
> > 
> > IOMMUFD doesn't support all features of legacy container, so QEMU
> > doesn't self-adjust at runtime by checking if host supports IOMMUFD.
> > We need to specify it explicitly to use IOMMUFD as below:
> > 
> >      -object iommufd,id=iommufd0
> >      -device vfio-pci,host=0000:02:00.0,iommufd=iommufd0
> 
> OK. I am not sure this is the correct interface yet. At first glance,
> I wouldn't introduce a new object for a simple backend depending on a
> kernel interface. I would tend to prefer a "iommu-something" property
> of the vfio-pci device with string values: "legacy", "iommufd", "default"
> and define the various interfaces (the ops you proposed) for each
> depending on the user preference and the capabilities of the host and
> possibly the device.

I think the idea came from Alex? The major point is to be able to have
libvirt open /dev/iommufd and FD pass it into qemu and then share that
single FD across all VFIOs. qemu will typically not be able to
self-open /dev/iommufd as it is root-only.

So the object is not exactly for the backend, the object is for the
file descriptor.

Adding a legacy/iommufd option to the vfio-pci device string doesn't
address these needs.

Jason


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v1 17/22] util/char_dev: Add open_cdev()
  2023-09-20 12:39   ` Daniel P. Berrangé
@ 2023-09-20 12:53     ` Jason Gunthorpe
  2023-09-20 12:56       ` Daniel P. Berrangé
  2023-09-21  2:37     ` Duan, Zhenzhong
  1 sibling, 1 reply; 109+ messages in thread
From: Jason Gunthorpe @ 2023-09-20 12:53 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Zhenzhong Duan, qemu-devel, alex.williamson, clg, nicolinc,
	joao.m.martins, eric.auger, peterx, jasowang, kevin.tian,
	yi.l.liu, yi.y.sun, chao.p.peng

On Wed, Sep 20, 2023 at 01:39:02PM +0100, Daniel P. Berrangé wrote:

> > diff --git a/util/chardev_open.c b/util/chardev_open.c
> > new file mode 100644
> > index 0000000000..d03e415131
> > --- /dev/null
> > +++ b/util/chardev_open.c
> > @@ -0,0 +1,61 @@
> > +/*
> > + * Copyright (C) 2023 Intel Corporation.
> > + * Copyright (c) 2019, Mellanox Technologies. All rights reserved.
> > + *
> > + * Authors: Yi Liu <yi.l.liu@intel.com>
> > + *
> > + * This work is licensed under the terms of the GNU GPL, version 2.  See
> > + * the COPYING file in the top-level directory.
> > + *
> > + * Copied from
> > + * https://github.com/linux-rdma/rdma-core/blob/master/util/open_cdev.c
> > + *
> > + */
> 
> Since this is GPL-2.0-only, IMHO it would be preferrable to keep it
> out of the util/ directory, as we're aiming to not add further 2.0
> only code, except for specific subdirs. This only appears to be used
> by code under hw/vfio/, whcih is one of the dirs still permitting
> 2.0-only code. So I think better to keep this file where it is used.

The copyright comment above is not fully accurate.

The original code is under the "OpenIB" dual license, you can choose
to take it using the OpenIB BSD license text:

 *      Redistribution and use in source and binary forms, with or
 *      without modification, are permitted provided that the following
 *      conditions are met:
 *
 *      - Redistributions of source code must retain the above
 *        copyright notice, this list of conditions and the following
 *        disclaimer.
 *
 *      - Redistributions in binary form must reproduce the above
 *        copyright notice, this list of conditions and the following
 *        disclaimer in the documentation and/or other materials
 *        provided with the distribution.
 *
 * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
 * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
 * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
 * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
 * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
 * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
 * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
 * SOFTWARE.

And drop reference to GPL if that is what qemu desires.

Jason


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v1 17/22] util/char_dev: Add open_cdev()
  2023-09-20 12:53     ` Jason Gunthorpe
@ 2023-09-20 12:56       ` Daniel P. Berrangé
  0 siblings, 0 replies; 109+ messages in thread
From: Daniel P. Berrangé @ 2023-09-20 12:56 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Zhenzhong Duan, qemu-devel, alex.williamson, clg, nicolinc,
	joao.m.martins, eric.auger, peterx, jasowang, kevin.tian,
	yi.l.liu, yi.y.sun, chao.p.peng

On Wed, Sep 20, 2023 at 09:53:46AM -0300, Jason Gunthorpe wrote:
> On Wed, Sep 20, 2023 at 01:39:02PM +0100, Daniel P. Berrangé wrote:
> 
> > > diff --git a/util/chardev_open.c b/util/chardev_open.c
> > > new file mode 100644
> > > index 0000000000..d03e415131
> > > --- /dev/null
> > > +++ b/util/chardev_open.c
> > > @@ -0,0 +1,61 @@
> > > +/*
> > > + * Copyright (C) 2023 Intel Corporation.
> > > + * Copyright (c) 2019, Mellanox Technologies. All rights reserved.
> > > + *
> > > + * Authors: Yi Liu <yi.l.liu@intel.com>
> > > + *
> > > + * This work is licensed under the terms of the GNU GPL, version 2.  See
> > > + * the COPYING file in the top-level directory.
> > > + *
> > > + * Copied from
> > > + * https://github.com/linux-rdma/rdma-core/blob/master/util/open_cdev.c
> > > + *
> > > + */
> > 
> > Since this is GPL-2.0-only, IMHO it would be preferrable to keep it
> > out of the util/ directory, as we're aiming to not add further 2.0
> > only code, except for specific subdirs. This only appears to be used
> > by code under hw/vfio/, whcih is one of the dirs still permitting
> > 2.0-only code. So I think better to keep this file where it is used.
> 
> The copyright comment above is not fully accurate.
> 
> The original code is under the "OpenIB" dual license, you can choose
> to take it using the OpenIB BSD license text:
> 
>  *      Redistribution and use in source and binary forms, with or
>  *      without modification, are permitted provided that the following
>  *      conditions are met:
>  *
>  *      - Redistributions of source code must retain the above
>  *        copyright notice, this list of conditions and the following
>  *        disclaimer.
>  *
>  *      - Redistributions in binary form must reproduce the above
>  *        copyright notice, this list of conditions and the following
>  *        disclaimer in the documentation and/or other materials
>  *        provided with the distribution.
>  *
>  * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
>  * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
>  * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
>  * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
>  * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
>  * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
>  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
>  * SOFTWARE.
> 
> And drop reference to GPL if that is what qemu desires.

Simplest is probably just to copy the original license header as-is,
and thus preserve the GPL OR BSD choice.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v1 13/22] vfio: Add base container
  2023-09-20  8:48     ` Duan, Zhenzhong
@ 2023-09-20 12:57       ` Cédric Le Goater
  2023-09-20 13:58         ` Eric Auger
  2023-09-21  2:51         ` Duan, Zhenzhong
  0 siblings, 2 replies; 109+ messages in thread
From: Cédric Le Goater @ 2023-09-20 12:57 UTC (permalink / raw)
  To: Duan, Zhenzhong, qemu-devel
  Cc: alex.williamson, jgg, nicolinc, Martins, Joao, eric.auger,
	peterx, jasowang, Tian, Kevin, Liu, Yi L, Sun, Yi Y, Peng,
	Chao P, Yi Sun, Daniel Henrique Barboza, David Gibson, Greg Kurz,
	Harsh Prateek Bora, open list:sPAPR (pseries)

On 9/20/23 10:48, Duan, Zhenzhong wrote:
> 
> 
>> -----Original Message-----
>> From: Cédric Le Goater <clg@redhat.com>
>> Sent: Wednesday, September 20, 2023 1:24 AM
>> Subject: Re: [PATCH v1 13/22] vfio: Add base container
>>
>> On 8/30/23 12:37, Zhenzhong Duan wrote:
>>> From: Yi Liu <yi.l.liu@intel.com>
>>>
>>> Abstract the VFIOContainer to be a base object. It is supposed to be
>>> embedded by legacy VFIO container and later on, into the new iommufd
>>> based container.
>>>
>>> The base container implements generic code such as code related to
>>> memory_listener and address space management. The VFIOContainerOps
>>> implements callbacks that depend on the kernel user space being used.
>>>
>>> 'common.c' and vfio device code only manipulates the base container with
>>> wrapper functions that calls the functions defined in VFIOContainerOpsClass.
>>> Existing 'container.c' code is converted to implement the legacy container
>>> ops functions.
>>>
>>> Below is the base container. It's named as VFIOContainer, old VFIOContainer
>>> is replaced with VFIOLegacyContainer.
>>
>> Usualy, we introduce the new interface solely, port the current models
>> on top of the new interface, wire the new models in the current
>> implementation and remove the old implementation. Then, we can start
>> adding extensions to support other implementations.
> 
> Not sure if I understand your point correctly. Do you mean to introduce
> a new type for the base container as below:
> 
> static const TypeInfo vfio_container_info = {
>      .parent             = TYPE_OBJECT,
>      .name               = TYPE_VFIO_CONTAINER,
>      .class_size         = sizeof(VFIOContainerClass),
>      .instance_size      = sizeof(VFIOContainer),
>      .abstract           = true,
>      .interfaces = (InterfaceInfo[]) {
>          { TYPE_VFIO_IOMMU_BACKEND_OPS },
>          { }
>      }
> };
> 
> and a new interface as below:
> 
> static const TypeInfo nvram_info = {
>      .name = TYPE_VFIO_IOMMU_BACKEND_OPS,
>      .parent = TYPE_INTERFACE,
>      .class_size = sizeof(VFIOIOMMUBackendOpsClass),
> };
> 
> struct VFIOIOMMUBackendOpsClass {
>      InterfaceClass parent;
>      VFIODevice *(*dev_iter_next)(VFIOContainer *container, VFIODevice *curr);
>      int (*dma_map)(VFIOContainer *container,
>      ......
> };
> 
> and legacy container on top of TYPE_VFIO_CONTAINER?
> 
> static const TypeInfo vfio_legacy_container_info = {
>      .parent = TYPE_VFIO_CONTAINER,
>      .name = TYPE_VFIO_LEGACY_CONTAINER,
>      .class_init = vfio_legacy_container_class_init,
> };
> 
> This object style is rejected early in RFCv1.
> See https://lore.kernel.org/kvm/20220414104710.28534-8-yi.l.liu@intel.com/

ouch. this is long ago and I was not aware :/ Bare with me, I will
probably ask the same questions. Nevertheless, we could improve the
cover and the flow of changes in the patchset to help the reader.

>> spapr should be taken care of separatly following the principle above.
>> With my PPC hat, I would not even read such a massive change, too risky
>> for the subsystem. This path will need (much) further splitting to be
>> understandable and acceptable.
> 
> I'll digging into this and try to split it. 

I know I am asking for a lot of work. Thanks for that.

> Meanwhile, there are many changes
> just renaming the parameter or function name for code readability.
> For example:
> 
> -int vfio_dma_unmap(VFIOContainer *container, hwaddr iova,
> -                   ram_addr_t size, IOMMUTLBEntry *iotlb)
> +static int vfio_legacy_dma_unmap(VFIOContainer *bcontainer, hwaddr iova,
> +                          ram_addr_t size, IOMMUTLBEntry *iotlb)
> 
> -        ret = vfio_get_dirty_bitmap(container, iova, size,
> +        ret = vfio_get_dirty_bitmap(bcontainer, iova, size,
> 
> Let me know if you think such changes are unnecessary which could reduce
> this patch largely.

Cleanups, renames, some code reshuffling, anything preparing ground for
the new abstraction is good to have first and can be merged very quickly
if there are no functional changes. It reduces the overall patchset and
ease the coming reviews.

You can send such series independently. That's fine.

> 
>>
>> Also, please include the .h file first, it helps in reading.
> 
> Do you mean to put struct declaration earlier in patch description?

Just add to your .gitconfig :

[diff]
	orderFile = /path/to/qemu/scripts/git.orderfile

It should be enough

>> Have you considered using an InterfaceClass ?
> 
> See above, with object style rejected, it looks hard to use InterfaceClass.

I am not convinced by the QOM approach. I will dig in the past arguments
and let's see what we come with.

Thanks,

C.


> Thanks
> Zhenzhong
> 



^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v1 07/22] vfio/common: Refactor vfio_viommu_preset() to be group agnostic
  2023-08-30 10:37 ` [PATCH v1 07/22] vfio/common: Refactor vfio_viommu_preset() to be group agnostic Zhenzhong Duan
@ 2023-09-20 13:00   ` Eric Auger
  2023-09-21  2:52     ` Duan, Zhenzhong
  2023-09-20 22:51   ` Alex Williamson
  1 sibling, 1 reply; 109+ messages in thread
From: Eric Auger @ 2023-09-20 13:00 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, jgg, nicolinc, joao.m.martins, peterx,
	jasowang, kevin.tian, yi.l.liu, yi.y.sun, chao.p.peng



On 8/30/23 12:37, Zhenzhong Duan wrote:
> So that it doesn't need to be moved into container.c as done
> in following patch.
This is a bit weird to refer to container.c which is not yet created. I
would suggested just reuse the commit title as a commit msg + this will
become easier to handle multiple IOMMU BEs

Eric
>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
>  hw/vfio/common.c | 17 ++++++++++++++++-
>  1 file changed, 16 insertions(+), 1 deletion(-)
>
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 51c6e7598e..fda5fc87b9 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -219,7 +219,22 @@ void vfio_unblock_multiple_devices_migration(void)
>  
>  bool vfio_viommu_preset(VFIODevice *vbasedev)
>  {
> -    return vbasedev->group->container->space->as != &address_space_memory;
> +    VFIOAddressSpace *space;
> +    VFIOContainer *container;
> +    VFIODevice *tmp_dev;
> +
> +    QLIST_FOREACH(space, &vfio_address_spaces, list) {
> +        QLIST_FOREACH(container, &space->containers, next) {
> +            tmp_dev = NULL;
> +            while ((tmp_dev = vfio_container_dev_iter_next(container,
> +                                                           tmp_dev))) {
> +                if (vbasedev == tmp_dev) {
> +                    return space->as != &address_space_memory;
> +                }
> +            }
> +        }
> +    }
> +    g_assert_not_reached();
>  }
>  
>  static void vfio_set_migration_error(int err)



^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v1 15/22] Add iommufd configure option
  2023-09-20 12:51         ` Jason Gunthorpe
@ 2023-09-20 13:01           ` Daniel P. Berrangé
  2023-09-20 13:07             ` Jason Gunthorpe
  2023-09-20 13:02           ` Cédric Le Goater
  1 sibling, 1 reply; 109+ messages in thread
From: Daniel P. Berrangé @ 2023-09-20 13:01 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Cédric Le Goater, Duan, Zhenzhong, qemu-devel,
	alex.williamson, nicolinc, Martins, Joao, eric.auger, peterx,
	jasowang, Tian, Kevin, Liu, Yi L, Sun, Yi Y, Peng, Chao P,
	Paolo Bonzini, Marc-André Lureau, Thomas Huth,
	Philippe Mathieu-Daudé

On Wed, Sep 20, 2023 at 09:51:03AM -0300, Jason Gunthorpe wrote:
> On Wed, Sep 20, 2023 at 02:19:42PM +0200, Cédric Le Goater wrote:
> > On 9/20/23 05:42, Duan, Zhenzhong wrote:
> > > 
> > > 
> > > > -----Original Message-----
> > > > From: Cédric Le Goater <clg@redhat.com>
> > > > Sent: Wednesday, September 20, 2023 1:08 AM
> > > > Subject: Re: [PATCH v1 15/22] Add iommufd configure option
> > > > 
> > > > On 8/30/23 12:37, Zhenzhong Duan wrote:
> > > > > This adds "--enable-iommufd/--disable-iommufd" to enable or disable
> > > > > iommufd support, enabled by default.
> > > > 
> > > > Why would someone want to disable support at compile time ? It might
> > > 
> > > For those users who only want to support legacy container feature?
> > > Let me know if you still prefer to drop this patch, I'm fine with that.
> > 
> > I think it is too early.
> > 
> > > > have been useful for dev but now QEMU should self-adjust at runtime
> > > > depending only on the host capabilities AFAIUI. Am I missing something ?
> > > 
> > > IOMMUFD doesn't support all features of legacy container, so QEMU
> > > doesn't self-adjust at runtime by checking if host supports IOMMUFD.
> > > We need to specify it explicitly to use IOMMUFD as below:
> > > 
> > >      -object iommufd,id=iommufd0
> > >      -device vfio-pci,host=0000:02:00.0,iommufd=iommufd0
> > 
> > OK. I am not sure this is the correct interface yet. At first glance,
> > I wouldn't introduce a new object for a simple backend depending on a
> > kernel interface. I would tend to prefer a "iommu-something" property
> > of the vfio-pci device with string values: "legacy", "iommufd", "default"
> > and define the various interfaces (the ops you proposed) for each
> > depending on the user preference and the capabilities of the host and
> > possibly the device.
> 
> I think the idea came from Alex? The major point is to be able to have
> libvirt open /dev/iommufd and FD pass it into qemu and then share that
> single FD across all VFIOs. qemu will typically not be able to
> self-open /dev/iommufd as it is root-only.
> 
> So the object is not exactly for the backend, the object is for the
> file descriptor.

Assuming we must have the exact same FD used for all vfio-pci devices,
then using -object iommufd is the least worst way to get that FD
injected into QEMU from libvirt. It is a little sucky in that when
hotplugging/unplugging devices, libvirt has to think about whether or
not it has to object_add/object_del  the iommufd> again I don't see
better options considering the need to have a single global FD.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v1 15/22] Add iommufd configure option
  2023-09-20 12:51         ` Jason Gunthorpe
  2023-09-20 13:01           ` Daniel P. Berrangé
@ 2023-09-20 13:02           ` Cédric Le Goater
  2023-09-20 17:37             ` Eric Auger
  2023-09-21  4:00             ` Duan, Zhenzhong
  1 sibling, 2 replies; 109+ messages in thread
From: Cédric Le Goater @ 2023-09-20 13:02 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Duan, Zhenzhong, qemu-devel, alex.williamson, nicolinc, Martins,
	Joao, eric.auger, peterx, jasowang, Tian, Kevin, Liu, Yi L, Sun,
	Yi Y, Peng, Chao P, Paolo Bonzini, Marc-André Lureau,
	Daniel P. Berrangé, Thomas Huth, Philippe Mathieu-Daudé

On 9/20/23 14:51, Jason Gunthorpe wrote:
> On Wed, Sep 20, 2023 at 02:19:42PM +0200, Cédric Le Goater wrote:
>> On 9/20/23 05:42, Duan, Zhenzhong wrote:
>>>
>>>
>>>> -----Original Message-----
>>>> From: Cédric Le Goater <clg@redhat.com>
>>>> Sent: Wednesday, September 20, 2023 1:08 AM
>>>> Subject: Re: [PATCH v1 15/22] Add iommufd configure option
>>>>
>>>> On 8/30/23 12:37, Zhenzhong Duan wrote:
>>>>> This adds "--enable-iommufd/--disable-iommufd" to enable or disable
>>>>> iommufd support, enabled by default.
>>>>
>>>> Why would someone want to disable support at compile time ? It might
>>>
>>> For those users who only want to support legacy container feature?
>>> Let me know if you still prefer to drop this patch, I'm fine with that.
>>
>> I think it is too early.
>>
>>>> have been useful for dev but now QEMU should self-adjust at runtime
>>>> depending only on the host capabilities AFAIUI. Am I missing something ?
>>>
>>> IOMMUFD doesn't support all features of legacy container, so QEMU
>>> doesn't self-adjust at runtime by checking if host supports IOMMUFD.
>>> We need to specify it explicitly to use IOMMUFD as below:
>>>
>>>       -object iommufd,id=iommufd0
>>>       -device vfio-pci,host=0000:02:00.0,iommufd=iommufd0
>>
>> OK. I am not sure this is the correct interface yet. At first glance,
>> I wouldn't introduce a new object for a simple backend depending on a
>> kernel interface. I would tend to prefer a "iommu-something" property
>> of the vfio-pci device with string values: "legacy", "iommufd", "default"
>> and define the various interfaces (the ops you proposed) for each
>> depending on the user preference and the capabilities of the host and
>> possibly the device.
> 
> I think the idea came from Alex? The major point is to be able to have
> libvirt open /dev/iommufd and FD pass it into qemu 

ok.

> and then share that single FD across all VFIOs. 

I will ask Alex to help me catch up on the topic.

> qemu will typically not be able to
> self-open /dev/iommufd as it is root-only.

I don't understand, we open multiple fds to KVM devices. This is the same.

> 
> So the object is not exactly for the backend, the object is for the
> file descriptor.
got it.

> 
> Adding a legacy/iommufd option to the vfio-pci device string doesn't
> address these needs.

I agree.

Thanks,

C.

> Jason
> 



^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v1 15/22] Add iommufd configure option
  2023-09-20 13:01           ` Daniel P. Berrangé
@ 2023-09-20 13:07             ` Jason Gunthorpe
  0 siblings, 0 replies; 109+ messages in thread
From: Jason Gunthorpe @ 2023-09-20 13:07 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Cédric Le Goater, Duan, Zhenzhong, qemu-devel,
	alex.williamson, nicolinc, Martins, Joao, eric.auger, peterx,
	jasowang, Tian, Kevin, Liu, Yi L, Sun, Yi Y, Peng, Chao P,
	Paolo Bonzini, Marc-André Lureau, Thomas Huth,
	Philippe Mathieu-Daudé

On Wed, Sep 20, 2023 at 02:01:39PM +0100, Daniel P. Berrangé wrote:

> Assuming we must have the exact same FD used for all vfio-pci devices,
> then using -object iommufd is the least worst way to get that FD
> injected into QEMU from libvirt. 

Yes, same FD. It is a shared resource.

Jason


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v1 08/22] vfio/common: Move legacy VFIO backend code into separate container.c
  2023-08-30 10:37 ` [PATCH v1 08/22] vfio/common: Move legacy VFIO backend code into separate container.c Zhenzhong Duan
@ 2023-09-20 13:12   ` Eric Auger
  2023-09-21  3:02     ` Duan, Zhenzhong
  0 siblings, 1 reply; 109+ messages in thread
From: Eric Auger @ 2023-09-20 13:12 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, jgg, nicolinc, joao.m.martins, peterx,
	jasowang, kevin.tian, yi.l.liu, yi.y.sun, chao.p.peng

Hi,

On 8/30/23 12:37, Zhenzhong Duan wrote:
> From: Yi Liu <yi.l.liu@intel.com>
>
> Move all the code really dependent on the legacy VFIO container/group
> into a separate file: container.c. What does remain in common.c is
> the code related to VFIOAddressSpace, MemoryListeners, migration and
> all other general operations.
>
> Move struct VFIOBitmap declaration to vfio-common.h also for containter.c
> usage.
note: this may be done in the 3d patch since vfio_bitmap_alloc could
land in helpers.c
>
> Signed-off-by: Eric Auger <eric.auger@redhat.com>
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
> ---
>  hw/vfio/common.c              | 1085 +--------------------------------
>  hw/vfio/container.c           | 1085 +++++++++++++++++++++++++++++++++
>  hw/vfio/meson.build           |    1 +
>  include/hw/vfio/vfio-common.h |   45 ++
>  4 files changed, 1147 insertions(+), 1069 deletions(-)
>  create mode 100644 hw/vfio/container.c
>
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index fda5fc87b9..044710fc1f 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -45,8 +45,6 @@
>  #include "migration/qemu-file.h"
>  #include "sysemu/tpm.h"
>  
> -VFIOGroupList vfio_group_list =
> -    QLIST_HEAD_INITIALIZER(vfio_group_list);
>  static QLIST_HEAD(, VFIOAddressSpace) vfio_address_spaces =
>      QLIST_HEAD_INITIALIZER(vfio_address_spaces);
>  
> @@ -58,63 +56,14 @@ static QLIST_HEAD(, VFIOAddressSpace) vfio_address_spaces =
>   * initialized, this file descriptor is only released on QEMU exit and
>   * we'll re-use it should another vfio device be attached before then.
>   */
> -static int vfio_kvm_device_fd = -1;
> +int vfio_kvm_device_fd = -1;
>  #endif
>  
> -static int vfio_ram_block_discard_disable(VFIOContainer *container, bool state)
> -{
> -    switch (container->iommu_type) {
> -    case VFIO_TYPE1v2_IOMMU:
> -    case VFIO_TYPE1_IOMMU:
> -        /*
> -         * We support coordinated discarding of RAM via the RamDiscardManager.
> -         */
> -        return ram_block_uncoordinated_discard_disable(state);
> -    default:
> -        /*
> -         * VFIO_SPAPR_TCE_IOMMU most probably works just fine with
> -         * RamDiscardManager, however, it is completely untested.
> -         *
> -         * VFIO_SPAPR_TCE_v2_IOMMU with "DMA memory preregistering" does
> -         * completely the opposite of managing mapping/pinning dynamically as
> -         * required by RamDiscardManager. We would have to special-case sections
> -         * with a RamDiscardManager.
> -         */
> -        return ram_block_discard_disable(state);
> -    }
> -}
> -
> -static VFIODevice *vfio_container_dev_iter_next(VFIOContainer *container,
> -                                                VFIODevice *curr)
> -{
> -    VFIOGroup *group;
> -
> -    if (!curr) {
> -        group = QLIST_FIRST(&container->group_list);
> -    } else {
> -        if (curr->next.le_next) {
> -            return curr->next.le_next;
> -        }
> -        group = curr->group->container_next.le_next;
> -    }
> -
> -    if (!group) {
> -        return NULL;
> -    }
> -    return QLIST_FIRST(&group->device_list);
> -}
> -
>  /*
>   * Device state interfaces
>   */
>  
> -typedef struct {
> -    unsigned long *bitmap;
> -    hwaddr size;
> -    hwaddr pages;
> -} VFIOBitmap;
> -
> -static int vfio_bitmap_alloc(VFIOBitmap *vbmap, hwaddr size)
> +int vfio_bitmap_alloc(VFIOBitmap *vbmap, hwaddr size)
>  {
>      vbmap->pages = REAL_HOST_PAGE_ALIGN(size) / qemu_real_host_page_size();
>      vbmap->size = ROUND_UP(vbmap->pages, sizeof(__u64) * BITS_PER_BYTE) /
> @@ -127,9 +76,6 @@ static int vfio_bitmap_alloc(VFIOBitmap *vbmap, hwaddr size)
>      return 0;
>  }
>  
> -static int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
> -                                 uint64_t size, ram_addr_t ram_addr);
> -
>  bool vfio_mig_active(void)
>  {
>      VFIOAddressSpace *space;
> @@ -276,7 +222,7 @@ static bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
>      return true;
>  }
>  
> -static bool vfio_devices_all_device_dirty_tracking(VFIOContainer *container)
> +bool vfio_devices_all_device_dirty_tracking(VFIOContainer *container)
>  {
>      VFIODevice *vbasedev = NULL;
>  
> @@ -293,7 +239,7 @@ static bool vfio_devices_all_device_dirty_tracking(VFIOContainer *container)
>   * Check if all VFIO devices are running and migration is active, which is
>   * essentially equivalent to the migration being in pre-copy phase.
>   */
> -static bool vfio_devices_all_running_and_mig_active(VFIOContainer *container)
> +bool vfio_devices_all_running_and_mig_active(VFIOContainer *container)
>  {
>      VFIODevice *vbasedev = NULL;
>  
> @@ -318,150 +264,8 @@ static bool vfio_devices_all_running_and_mig_active(VFIOContainer *container)
>      return true;
>  }
>  
> -static int vfio_dma_unmap_bitmap(VFIOContainer *container,
> -                                 hwaddr iova, ram_addr_t size,
> -                                 IOMMUTLBEntry *iotlb)
> -{
> -    struct vfio_iommu_type1_dma_unmap *unmap;
> -    struct vfio_bitmap *bitmap;
> -    VFIOBitmap vbmap;
> -    int ret;
> -
> -    ret = vfio_bitmap_alloc(&vbmap, size);
> -    if (ret) {
> -        return ret;
> -    }
> -
> -    unmap = g_malloc0(sizeof(*unmap) + sizeof(*bitmap));
> -
> -    unmap->argsz = sizeof(*unmap) + sizeof(*bitmap);
> -    unmap->iova = iova;
> -    unmap->size = size;
> -    unmap->flags |= VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP;
> -    bitmap = (struct vfio_bitmap *)&unmap->data;
> -
> -    /*
> -     * cpu_physical_memory_set_dirty_lebitmap() supports pages in bitmap of
> -     * qemu_real_host_page_size to mark those dirty. Hence set bitmap_pgsize
> -     * to qemu_real_host_page_size.
> -     */
> -    bitmap->pgsize = qemu_real_host_page_size();
> -    bitmap->size = vbmap.size;
> -    bitmap->data = (__u64 *)vbmap.bitmap;
> -
> -    if (vbmap.size > container->max_dirty_bitmap_size) {
> -        error_report("UNMAP: Size of bitmap too big 0x%"PRIx64, vbmap.size);
> -        ret = -E2BIG;
> -        goto unmap_exit;
> -    }
> -
> -    ret = ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, unmap);
> -    if (!ret) {
> -        cpu_physical_memory_set_dirty_lebitmap(vbmap.bitmap,
> -                iotlb->translated_addr, vbmap.pages);
> -    } else {
> -        error_report("VFIO_UNMAP_DMA with DIRTY_BITMAP : %m");
> -    }
> -
> -unmap_exit:
> -    g_free(unmap);
> -    g_free(vbmap.bitmap);
> -
> -    return ret;
> -}
> -
> -/*
> - * DMA - Mapping and unmapping for the "type1" IOMMU interface used on x86
> - */
> -static int vfio_dma_unmap(VFIOContainer *container,
> -                          hwaddr iova, ram_addr_t size,
> -                          IOMMUTLBEntry *iotlb)
> -{
> -    struct vfio_iommu_type1_dma_unmap unmap = {
> -        .argsz = sizeof(unmap),
> -        .flags = 0,
> -        .iova = iova,
> -        .size = size,
> -    };
> -    bool need_dirty_sync = false;
> -    int ret;
> -
> -    if (iotlb && vfio_devices_all_running_and_mig_active(container)) {
> -        if (!vfio_devices_all_device_dirty_tracking(container) &&
> -            container->dirty_pages_supported) {
> -            return vfio_dma_unmap_bitmap(container, iova, size, iotlb);
> -        }
> -
> -        need_dirty_sync = true;
> -    }
> -
> -    while (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap)) {
> -        /*
> -         * The type1 backend has an off-by-one bug in the kernel (71a7d3d78e3c
> -         * v4.15) where an overflow in its wrap-around check prevents us from
> -         * unmapping the last page of the address space.  Test for the error
> -         * condition and re-try the unmap excluding the last page.  The
> -         * expectation is that we've never mapped the last page anyway and this
> -         * unmap request comes via vIOMMU support which also makes it unlikely
> -         * that this page is used.  This bug was introduced well after type1 v2
> -         * support was introduced, so we shouldn't need to test for v1.  A fix
> -         * is queued for kernel v5.0 so this workaround can be removed once
> -         * affected kernels are sufficiently deprecated.
> -         */
> -        if (errno == EINVAL && unmap.size && !(unmap.iova + unmap.size) &&
> -            container->iommu_type == VFIO_TYPE1v2_IOMMU) {
> -            trace_vfio_dma_unmap_overflow_workaround();
> -            unmap.size -= 1ULL << ctz64(container->pgsizes);
> -            continue;
> -        }
> -        error_report("VFIO_UNMAP_DMA failed: %s", strerror(errno));
> -        return -errno;
> -    }
> -
> -    if (need_dirty_sync) {
> -        ret = vfio_get_dirty_bitmap(container, iova, size,
> -                                    iotlb->translated_addr);
> -        if (ret) {
> -            return ret;
> -        }
> -    }
> -
> -    return 0;
> -}
> -
> -static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
> -                        ram_addr_t size, void *vaddr, bool readonly)
> -{
> -    struct vfio_iommu_type1_dma_map map = {
> -        .argsz = sizeof(map),
> -        .flags = VFIO_DMA_MAP_FLAG_READ,
> -        .vaddr = (__u64)(uintptr_t)vaddr,
> -        .iova = iova,
> -        .size = size,
> -    };
> -
> -    if (!readonly) {
> -        map.flags |= VFIO_DMA_MAP_FLAG_WRITE;
> -    }
> -
> -    /*
> -     * Try the mapping, if it fails with EBUSY, unmap the region and try
> -     * again.  This shouldn't be necessary, but we sometimes see it in
> -     * the VGA ROM space.
> -     */
> -    if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0 ||
> -        (errno == EBUSY && vfio_dma_unmap(container, iova, size, NULL) == 0 &&
> -         ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0)) {
> -        return 0;
> -    }
> -
> -    error_report("VFIO_MAP_DMA failed: %s", strerror(errno));
> -    return -errno;
> -}
> -
> -static void vfio_host_win_add(VFIOContainer *container,
> -                              hwaddr min_iova, hwaddr max_iova,
> -                              uint64_t iova_pgsizes)
> +void vfio_host_win_add(VFIOContainer *container, hwaddr min_iova,
> +                       hwaddr max_iova, uint64_t iova_pgsizes)
>  {
>      VFIOHostDMAWindow *hostwin;
>  
> @@ -482,8 +286,8 @@ static void vfio_host_win_add(VFIOContainer *container,
>      QLIST_INSERT_HEAD(&container->hostwin_list, hostwin, hostwin_next);
>  }
>  
> -static int vfio_host_win_del(VFIOContainer *container, hwaddr min_iova,
> -                             hwaddr max_iova)
> +int vfio_host_win_del(VFIOContainer *container,
> +                      hwaddr min_iova, hwaddr max_iova)
>  {
>      VFIOHostDMAWindow *hostwin;
>  
> @@ -832,92 +636,6 @@ static bool vfio_get_section_iova_range(VFIOContainer *container,
>      return true;
>  }
>  
> -static int vfio_container_add_section_window(VFIOContainer *container,
> -                                             MemoryRegionSection *section,
> -                                             Error **errp)
> -{
> -    VFIOHostDMAWindow *hostwin;
> -    hwaddr pgsize = 0;
> -    int ret;
> -
> -    if (container->iommu_type != VFIO_SPAPR_TCE_v2_IOMMU) {
> -        return 0;
> -    }
> -
> -    /* For now intersections are not allowed, we may relax this later */
> -    QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) {
> -        if (ranges_overlap(hostwin->min_iova,
> -                           hostwin->max_iova - hostwin->min_iova + 1,
> -                           section->offset_within_address_space,
> -                           int128_get64(section->size))) {
> -            error_setg(errp,
> -                "region [0x%"PRIx64",0x%"PRIx64"] overlaps with existing"
> -                "host DMA window [0x%"PRIx64",0x%"PRIx64"]",
> -                section->offset_within_address_space,
> -                section->offset_within_address_space +
> -                    int128_get64(section->size) - 1,
> -                hostwin->min_iova, hostwin->max_iova);
> -            return -EINVAL;
> -        }
> -    }
> -
> -    ret = vfio_spapr_create_window(container, section, &pgsize);
> -    if (ret) {
> -        error_setg_errno(errp, -ret, "Failed to create SPAPR window");
> -        return ret;
> -    }
> -
> -    vfio_host_win_add(container, section->offset_within_address_space,
> -                      section->offset_within_address_space +
> -                      int128_get64(section->size) - 1, pgsize);
> -#ifdef CONFIG_KVM
> -    if (kvm_enabled()) {
> -        VFIOGroup *group;
> -        IOMMUMemoryRegion *iommu_mr = IOMMU_MEMORY_REGION(section->mr);
> -        struct kvm_vfio_spapr_tce param;
> -        struct kvm_device_attr attr = {
> -            .group = KVM_DEV_VFIO_GROUP,
> -            .attr = KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE,
> -            .addr = (uint64_t)(unsigned long)&param,
> -        };
> -
> -        if (!memory_region_iommu_get_attr(iommu_mr, IOMMU_ATTR_SPAPR_TCE_FD,
> -                                          &param.tablefd)) {
> -            QLIST_FOREACH(group, &container->group_list, container_next) {
> -                param.groupfd = group->fd;
> -                if (ioctl(vfio_kvm_device_fd, KVM_SET_DEVICE_ATTR, &attr)) {
> -                    error_report("vfio: failed to setup fd %d "
> -                                 "for a group with fd %d: %s",
> -                                 param.tablefd, param.groupfd,
> -                                 strerror(errno));
> -                    return 0;
> -                }
> -                trace_vfio_spapr_group_attach(param.groupfd, param.tablefd);
> -            }
> -        }
> -    }
> -#endif
> -    return 0;
> -}
> -
> -static void vfio_container_del_section_window(VFIOContainer *container,
> -                                              MemoryRegionSection *section)
> -{
> -    if (container->iommu_type != VFIO_SPAPR_TCE_v2_IOMMU) {
> -        return;
> -    }
> -
> -    vfio_spapr_remove_window(container,
> -                             section->offset_within_address_space);
> -    if (vfio_host_win_del(container,
> -                          section->offset_within_address_space,
> -                          section->offset_within_address_space +
> -                          int128_get64(section->size) - 1) < 0) {
> -        hw_error("%s: Cannot delete missing window at %"HWADDR_PRIx,
> -                 __func__, section->offset_within_address_space);
> -    }
> -}
> -
>  static void vfio_listener_region_add(MemoryListener *listener,
>                                       MemoryRegionSection *section)
>  {
> @@ -1165,33 +883,6 @@ static void vfio_listener_region_del(MemoryListener *listener,
>      vfio_container_del_section_window(container, section);
>  }
>  
> -static int vfio_set_dirty_page_tracking(VFIOContainer *container, bool start)
> -{
> -    int ret;
> -    struct vfio_iommu_type1_dirty_bitmap dirty = {
> -        .argsz = sizeof(dirty),
> -    };
> -
> -    if (!container->dirty_pages_supported) {
> -        return 0;
> -    }
> -
> -    if (start) {
> -        dirty.flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_START;
> -    } else {
> -        dirty.flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP;
> -    }
> -
> -    ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, &dirty);
> -    if (ret) {
> -        ret = -errno;
> -        error_report("Failed to set dirty tracking flag 0x%x errno: %d",
> -                     dirty.flags, errno);
> -    }
> -
> -    return ret;
> -}
> -
>  typedef struct VFIODirtyRanges {
>      hwaddr min32;
>      hwaddr max32;
> @@ -1466,9 +1157,9 @@ static int vfio_device_dma_logging_report(VFIODevice *vbasedev, hwaddr iova,
>      return 0;
>  }
>  
> -static int vfio_devices_query_dirty_bitmap(VFIOContainer *container,
> -                                           VFIOBitmap *vbmap, hwaddr iova,
> -                                           hwaddr size)
> +int vfio_devices_query_dirty_bitmap(VFIOContainer *container,
> +                                    VFIOBitmap *vbmap, hwaddr iova,
> +                                    hwaddr size)
>  {
>      VFIODevice *vbasedev = NULL;
>      int ret;
> @@ -1489,45 +1180,8 @@ static int vfio_devices_query_dirty_bitmap(VFIOContainer *container,
>      return 0;
>  }
>  
> -static int vfio_query_dirty_bitmap(VFIOContainer *container, VFIOBitmap *vbmap,
> -                                   hwaddr iova, hwaddr size)
> -{
> -    struct vfio_iommu_type1_dirty_bitmap *dbitmap;
> -    struct vfio_iommu_type1_dirty_bitmap_get *range;
> -    int ret;
> -
> -    dbitmap = g_malloc0(sizeof(*dbitmap) + sizeof(*range));
> -
> -    dbitmap->argsz = sizeof(*dbitmap) + sizeof(*range);
> -    dbitmap->flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP;
> -    range = (struct vfio_iommu_type1_dirty_bitmap_get *)&dbitmap->data;
> -    range->iova = iova;
> -    range->size = size;
> -
> -    /*
> -     * cpu_physical_memory_set_dirty_lebitmap() supports pages in bitmap of
> -     * qemu_real_host_page_size to mark those dirty. Hence set bitmap's pgsize
> -     * to qemu_real_host_page_size.
> -     */
> -    range->bitmap.pgsize = qemu_real_host_page_size();
> -    range->bitmap.size = vbmap->size;
> -    range->bitmap.data = (__u64 *)vbmap->bitmap;
> -
> -    ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, dbitmap);
> -    if (ret) {
> -        ret = -errno;
> -        error_report("Failed to get dirty bitmap for iova: 0x%"PRIx64
> -                " size: 0x%"PRIx64" err: %d", (uint64_t)range->iova,
> -                (uint64_t)range->size, errno);
> -    }
> -
> -    g_free(dbitmap);
> -
> -    return ret;
> -}
> -
> -static int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
> -                                 uint64_t size, ram_addr_t ram_addr)
> +int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
> +                          uint64_t size, ram_addr_t ram_addr)
>  {
>      bool all_device_dirty_tracking =
>          vfio_devices_all_device_dirty_tracking(container);
> @@ -1716,7 +1370,7 @@ static void vfio_listener_log_sync(MemoryListener *listener,
>      }
>  }
>  
> -static const MemoryListener vfio_memory_listener = {
> +const MemoryListener vfio_memory_listener = {
>      .name = "vfio",
>      .region_add = vfio_listener_region_add,
>      .region_del = vfio_listener_region_del,
> @@ -1725,45 +1379,6 @@ static const MemoryListener vfio_memory_listener = {
>      .log_sync = vfio_listener_log_sync,
>  };
>  
> -static void vfio_listener_release(VFIOContainer *container)
> -{
> -    memory_listener_unregister(&container->listener);
> -    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
> -        memory_listener_unregister(&container->prereg_listener);
> -    }
> -}
> -
> -static struct vfio_info_cap_header *
> -vfio_get_iommu_type1_info_cap(struct vfio_iommu_type1_info *info, uint16_t id)
> -{
> -    if (!(info->flags & VFIO_IOMMU_INFO_CAPS)) {
> -        return NULL;
> -    }
> -
> -    return vfio_get_cap((void *)info, info->cap_offset, id);
> -}
> -
> -bool vfio_get_info_dma_avail(struct vfio_iommu_type1_info *info,
> -                             unsigned int *avail)
> -{
> -    struct vfio_info_cap_header *hdr;
> -    struct vfio_iommu_type1_info_dma_avail *cap;
> -
> -    /* If the capability cannot be found, assume no DMA limiting */
> -    hdr = vfio_get_iommu_type1_info_cap(info,
> -                                        VFIO_IOMMU_TYPE1_INFO_DMA_AVAIL);
> -    if (hdr == NULL) {
> -        return false;
> -    }
> -
> -    if (avail != NULL) {
> -        cap = (void *) hdr;
> -        *avail = cap->avail;
> -    }
> -
> -    return true;
> -}
> -
>  void vfio_reset_handler(void *opaque)
>  {
>      VFIOAddressSpace *space;
> @@ -1830,11 +1445,6 @@ int vfio_kvm_device_add_fd(int fd)
>      return 0;
>  }
>  
> -static void vfio_kvm_device_add_group(VFIOGroup *group)
> -{
> -    vfio_kvm_device_add_fd(group->fd);
> -}
> -
>  int vfio_kvm_device_del_fd(int fd)
>  {
>  #ifdef CONFIG_KVM
> @@ -1857,12 +1467,7 @@ int vfio_kvm_device_del_fd(int fd)
>      return 0;
>  }
>  
> -static void vfio_kvm_device_del_group(VFIOGroup *group)
> -{
> -    vfio_kvm_device_del_fd(group->fd);
> -}
> -
> -static VFIOAddressSpace *vfio_get_address_space(AddressSpace *as)
> +VFIOAddressSpace *vfio_get_address_space(AddressSpace *as)
>  {
>      VFIOAddressSpace *space;
>  
> @@ -1886,7 +1491,7 @@ static VFIOAddressSpace *vfio_get_address_space(AddressSpace *as)
>      return space;
>  }
>  
> -static void vfio_put_address_space(VFIOAddressSpace *space)
> +void vfio_put_address_space(VFIOAddressSpace *space)
>  {
>      if (QLIST_EMPTY(&space->containers)) {
>          QLIST_REMOVE(space, list);
> @@ -1897,499 +1502,6 @@ static void vfio_put_address_space(VFIOAddressSpace *space)
>      }
>  }
>  
> -/*
> - * vfio_get_iommu_type - selects the richest iommu_type (v2 first)
> - */
> -static int vfio_get_iommu_type(VFIOContainer *container,
> -                               Error **errp)
> -{
> -    int iommu_types[] = { VFIO_TYPE1v2_IOMMU, VFIO_TYPE1_IOMMU,
> -                          VFIO_SPAPR_TCE_v2_IOMMU, VFIO_SPAPR_TCE_IOMMU };
> -    int i;
> -
> -    for (i = 0; i < ARRAY_SIZE(iommu_types); i++) {
> -        if (ioctl(container->fd, VFIO_CHECK_EXTENSION, iommu_types[i])) {
> -            return iommu_types[i];
> -        }
> -    }
> -    error_setg(errp, "No available IOMMU models");
> -    return -EINVAL;
> -}
> -
> -static int vfio_init_container(VFIOContainer *container, int group_fd,
> -                               Error **errp)
> -{
> -    int iommu_type, ret;
> -
> -    iommu_type = vfio_get_iommu_type(container, errp);
> -    if (iommu_type < 0) {
> -        return iommu_type;
> -    }
> -
> -    ret = ioctl(group_fd, VFIO_GROUP_SET_CONTAINER, &container->fd);
> -    if (ret) {
> -        error_setg_errno(errp, errno, "Failed to set group container");
> -        return -errno;
> -    }
> -
> -    while (ioctl(container->fd, VFIO_SET_IOMMU, iommu_type)) {
> -        if (iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
> -            /*
> -             * On sPAPR, despite the IOMMU subdriver always advertises v1 and
> -             * v2, the running platform may not support v2 and there is no
> -             * way to guess it until an IOMMU group gets added to the container.
> -             * So in case it fails with v2, try v1 as a fallback.
> -             */
> -            iommu_type = VFIO_SPAPR_TCE_IOMMU;
> -            continue;
> -        }
> -        error_setg_errno(errp, errno, "Failed to set iommu for container");
> -        return -errno;
> -    }
> -
> -    container->iommu_type = iommu_type;
> -    return 0;
> -}
> -
> -static int vfio_get_iommu_info(VFIOContainer *container,
> -                               struct vfio_iommu_type1_info **info)
> -{
> -
> -    size_t argsz = sizeof(struct vfio_iommu_type1_info);
> -
> -    *info = g_new0(struct vfio_iommu_type1_info, 1);
> -again:
> -    (*info)->argsz = argsz;
> -
> -    if (ioctl(container->fd, VFIO_IOMMU_GET_INFO, *info)) {
> -        g_free(*info);
> -        *info = NULL;
> -        return -errno;
> -    }
> -
> -    if (((*info)->argsz > argsz)) {
> -        argsz = (*info)->argsz;
> -        *info = g_realloc(*info, argsz);
> -        goto again;
> -    }
> -
> -    return 0;
> -}
> -
> -static struct vfio_info_cap_header *
> -vfio_get_iommu_info_cap(struct vfio_iommu_type1_info *info, uint16_t id)
> -{
> -    struct vfio_info_cap_header *hdr;
> -    void *ptr = info;
> -
> -    if (!(info->flags & VFIO_IOMMU_INFO_CAPS)) {
> -        return NULL;
> -    }
> -
> -    for (hdr = ptr + info->cap_offset; hdr != ptr; hdr = ptr + hdr->next) {
> -        if (hdr->id == id) {
> -            return hdr;
> -        }
> -    }
> -
> -    return NULL;
> -}
> -
> -static void vfio_get_iommu_info_migration(VFIOContainer *container,
> -                                         struct vfio_iommu_type1_info *info)
> -{
> -    struct vfio_info_cap_header *hdr;
> -    struct vfio_iommu_type1_info_cap_migration *cap_mig;
> -
> -    hdr = vfio_get_iommu_info_cap(info, VFIO_IOMMU_TYPE1_INFO_CAP_MIGRATION);
> -    if (!hdr) {
> -        return;
> -    }
> -
> -    cap_mig = container_of(hdr, struct vfio_iommu_type1_info_cap_migration,
> -                            header);
> -
> -    /*
> -     * cpu_physical_memory_set_dirty_lebitmap() supports pages in bitmap of
> -     * qemu_real_host_page_size to mark those dirty.
> -     */
> -    if (cap_mig->pgsize_bitmap & qemu_real_host_page_size()) {
> -        container->dirty_pages_supported = true;
> -        container->max_dirty_bitmap_size = cap_mig->max_dirty_bitmap_size;
> -        container->dirty_pgsizes = cap_mig->pgsize_bitmap;
> -    }
> -}
> -
> -static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
> -                                  Error **errp)
> -{
> -    VFIOContainer *container;
> -    int ret, fd;
> -    VFIOAddressSpace *space;
> -
> -    space = vfio_get_address_space(as);
> -
> -    /*
> -     * VFIO is currently incompatible with discarding of RAM insofar as the
> -     * madvise to purge (zap) the page from QEMU's address space does not
> -     * interact with the memory API and therefore leaves stale virtual to
> -     * physical mappings in the IOMMU if the page was previously pinned.  We
> -     * therefore set discarding broken for each group added to a container,
> -     * whether the container is used individually or shared.  This provides
> -     * us with options to allow devices within a group to opt-in and allow
> -     * discarding, so long as it is done consistently for a group (for instance
> -     * if the device is an mdev device where it is known that the host vendor
> -     * driver will never pin pages outside of the working set of the guest
> -     * driver, which would thus not be discarding candidates).
> -     *
> -     * The first opportunity to induce pinning occurs here where we attempt to
> -     * attach the group to existing containers within the AddressSpace.  If any
> -     * pages are already zapped from the virtual address space, such as from
> -     * previous discards, new pinning will cause valid mappings to be
> -     * re-established.  Likewise, when the overall MemoryListener for a new
> -     * container is registered, a replay of mappings within the AddressSpace
> -     * will occur, re-establishing any previously zapped pages as well.
> -     *
> -     * Especially virtio-balloon is currently only prevented from discarding
> -     * new memory, it will not yet set ram_block_discard_set_required() and
> -     * therefore, neither stops us here or deals with the sudden memory
> -     * consumption of inflated memory.
> -     *
> -     * We do support discarding of memory coordinated via the RamDiscardManager
> -     * with some IOMMU types. vfio_ram_block_discard_disable() handles the
> -     * details once we know which type of IOMMU we are using.
> -     */
> -
> -    QLIST_FOREACH(container, &space->containers, next) {
> -        if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
> -            ret = vfio_ram_block_discard_disable(container, true);
> -            if (ret) {
> -                error_setg_errno(errp, -ret,
> -                                 "Cannot set discarding of RAM broken");
> -                if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER,
> -                          &container->fd)) {
> -                    error_report("vfio: error disconnecting group %d from"
> -                                 " container", group->groupid);
> -                }
> -                return ret;
> -            }
> -            group->container = container;
> -            QLIST_INSERT_HEAD(&container->group_list, group, container_next);
> -            vfio_kvm_device_add_group(group);
> -            return 0;
> -        }
> -    }
> -
> -    fd = qemu_open_old("/dev/vfio/vfio", O_RDWR);
> -    if (fd < 0) {
> -        error_setg_errno(errp, errno, "failed to open /dev/vfio/vfio");
> -        ret = -errno;
> -        goto put_space_exit;
> -    }
> -
> -    ret = ioctl(fd, VFIO_GET_API_VERSION);
> -    if (ret != VFIO_API_VERSION) {
> -        error_setg(errp, "supported vfio version: %d, "
> -                   "reported version: %d", VFIO_API_VERSION, ret);
> -        ret = -EINVAL;
> -        goto close_fd_exit;
> -    }
> -
> -    container = g_malloc0(sizeof(*container));
> -    container->space = space;
> -    container->fd = fd;
> -    container->error = NULL;
> -    container->dirty_pages_supported = false;
> -    container->dma_max_mappings = 0;
> -    QLIST_INIT(&container->giommu_list);
> -    QLIST_INIT(&container->hostwin_list);
> -    QLIST_INIT(&container->vrdl_list);
> -
> -    ret = vfio_init_container(container, group->fd, errp);
> -    if (ret) {
> -        goto free_container_exit;
> -    }
> -
> -    ret = vfio_ram_block_discard_disable(container, true);
> -    if (ret) {
> -        error_setg_errno(errp, -ret, "Cannot set discarding of RAM broken");
> -        goto free_container_exit;
> -    }
> -
> -    switch (container->iommu_type) {
> -    case VFIO_TYPE1v2_IOMMU:
> -    case VFIO_TYPE1_IOMMU:
> -    {
> -        struct vfio_iommu_type1_info *info;
> -
> -        ret = vfio_get_iommu_info(container, &info);
> -        if (ret) {
> -            error_setg_errno(errp, -ret, "Failed to get VFIO IOMMU info");
> -            goto enable_discards_exit;
> -        }
> -
> -        if (info->flags & VFIO_IOMMU_INFO_PGSIZES) {
> -            container->pgsizes = info->iova_pgsizes;
> -        } else {
> -            container->pgsizes = qemu_real_host_page_size();
> -        }
> -
> -        if (!vfio_get_info_dma_avail(info, &container->dma_max_mappings)) {
> -            container->dma_max_mappings = 65535;
> -        }
> -        vfio_get_iommu_info_migration(container, info);
> -        g_free(info);
> -
> -        /*
> -         * FIXME: We should parse VFIO_IOMMU_TYPE1_INFO_CAP_IOVA_RANGE
> -         * information to get the actual window extent rather than assume
> -         * a 64-bit IOVA address space.
> -         */
> -        vfio_host_win_add(container, 0, (hwaddr)-1, container->pgsizes);
> -
> -        break;
> -    }
> -    case VFIO_SPAPR_TCE_v2_IOMMU:
> -    case VFIO_SPAPR_TCE_IOMMU:
> -    {
> -        struct vfio_iommu_spapr_tce_info info;
> -        bool v2 = container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU;
> -
> -        /*
> -         * The host kernel code implementing VFIO_IOMMU_DISABLE is called
> -         * when container fd is closed so we do not call it explicitly
> -         * in this file.
> -         */
> -        if (!v2) {
> -            ret = ioctl(fd, VFIO_IOMMU_ENABLE);
> -            if (ret) {
> -                error_setg_errno(errp, errno, "failed to enable container");
> -                ret = -errno;
> -                goto enable_discards_exit;
> -            }
> -        } else {
> -            container->prereg_listener = vfio_prereg_listener;
> -
> -            memory_listener_register(&container->prereg_listener,
> -                                     &address_space_memory);
> -            if (container->error) {
> -                memory_listener_unregister(&container->prereg_listener);
> -                ret = -1;
> -                error_propagate_prepend(errp, container->error,
> -                    "RAM memory listener initialization failed: ");
> -                goto enable_discards_exit;
> -            }
> -        }
> -
> -        info.argsz = sizeof(info);
> -        ret = ioctl(fd, VFIO_IOMMU_SPAPR_TCE_GET_INFO, &info);
> -        if (ret) {
> -            error_setg_errno(errp, errno,
> -                             "VFIO_IOMMU_SPAPR_TCE_GET_INFO failed");
> -            ret = -errno;
> -            if (v2) {
> -                memory_listener_unregister(&container->prereg_listener);
> -            }
> -            goto enable_discards_exit;
> -        }
> -
> -        if (v2) {
> -            container->pgsizes = info.ddw.pgsizes;
> -            /*
> -             * There is a default window in just created container.
> -             * To make region_add/del simpler, we better remove this
> -             * window now and let those iommu_listener callbacks
> -             * create/remove them when needed.
> -             */
> -            ret = vfio_spapr_remove_window(container, info.dma32_window_start);
> -            if (ret) {
> -                error_setg_errno(errp, -ret,
> -                                 "failed to remove existing window");
> -                goto enable_discards_exit;
> -            }
> -        } else {
> -            /* The default table uses 4K pages */
> -            container->pgsizes = 0x1000;
> -            vfio_host_win_add(container, info.dma32_window_start,
> -                              info.dma32_window_start +
> -                              info.dma32_window_size - 1,
> -                              0x1000);
> -        }
> -    }
> -    }
> -
> -    vfio_kvm_device_add_group(group);
> -
> -    QLIST_INIT(&container->group_list);
> -    QLIST_INSERT_HEAD(&space->containers, container, next);
> -
> -    group->container = container;
> -    QLIST_INSERT_HEAD(&container->group_list, group, container_next);
> -
> -    container->listener = vfio_memory_listener;
> -
> -    memory_listener_register(&container->listener, container->space->as);
> -
> -    if (container->error) {
> -        ret = -1;
> -        error_propagate_prepend(errp, container->error,
> -            "memory listener initialization failed: ");
> -        goto listener_release_exit;
> -    }
> -
> -    container->initialized = true;
> -
> -    return 0;
> -listener_release_exit:
> -    QLIST_REMOVE(group, container_next);
> -    QLIST_REMOVE(container, next);
> -    vfio_kvm_device_del_group(group);
> -    vfio_listener_release(container);
> -
> -enable_discards_exit:
> -    vfio_ram_block_discard_disable(container, false);
> -
> -free_container_exit:
> -    g_free(container);
> -
> -close_fd_exit:
> -    close(fd);
> -
> -put_space_exit:
> -    vfio_put_address_space(space);
> -
> -    return ret;
> -}
> -
> -static void vfio_disconnect_container(VFIOGroup *group)
> -{
> -    VFIOContainer *container = group->container;
> -
> -    QLIST_REMOVE(group, container_next);
> -    group->container = NULL;
> -
> -    /*
> -     * Explicitly release the listener first before unset container,
> -     * since unset may destroy the backend container if it's the last
> -     * group.
> -     */
> -    if (QLIST_EMPTY(&container->group_list)) {
> -        vfio_listener_release(container);
> -    }
> -
> -    if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER, &container->fd)) {
> -        error_report("vfio: error disconnecting group %d from container",
> -                     group->groupid);
> -    }
> -
> -    if (QLIST_EMPTY(&container->group_list)) {
> -        VFIOAddressSpace *space = container->space;
> -        VFIOGuestIOMMU *giommu, *tmp;
> -        VFIOHostDMAWindow *hostwin, *next;
> -
> -        QLIST_REMOVE(container, next);
> -
> -        QLIST_FOREACH_SAFE(giommu, &container->giommu_list, giommu_next, tmp) {
> -            memory_region_unregister_iommu_notifier(
> -                    MEMORY_REGION(giommu->iommu_mr), &giommu->n);
> -            QLIST_REMOVE(giommu, giommu_next);
> -            g_free(giommu);
> -        }
> -
> -        QLIST_FOREACH_SAFE(hostwin, &container->hostwin_list, hostwin_next,
> -                           next) {
> -            QLIST_REMOVE(hostwin, hostwin_next);
> -            g_free(hostwin);
> -        }
> -
> -        trace_vfio_disconnect_container(container->fd);
> -        close(container->fd);
> -        g_free(container);
> -
> -        vfio_put_address_space(space);
> -    }
> -}
> -
> -VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
> -{
> -    VFIOGroup *group;
> -    char path[32];
> -    struct vfio_group_status status = { .argsz = sizeof(status) };
> -
> -    QLIST_FOREACH(group, &vfio_group_list, next) {
> -        if (group->groupid == groupid) {
> -            /* Found it.  Now is it already in the right context? */
> -            if (group->container->space->as == as) {
> -                return group;
> -            } else {
> -                error_setg(errp, "group %d used in multiple address spaces",
> -                           group->groupid);
> -                return NULL;
> -            }
> -        }
> -    }
> -
> -    group = g_malloc0(sizeof(*group));
> -
> -    snprintf(path, sizeof(path), "/dev/vfio/%d", groupid);
> -    group->fd = qemu_open_old(path, O_RDWR);
> -    if (group->fd < 0) {
> -        error_setg_errno(errp, errno, "failed to open %s", path);
> -        goto free_group_exit;
> -    }
> -
> -    if (ioctl(group->fd, VFIO_GROUP_GET_STATUS, &status)) {
> -        error_setg_errno(errp, errno, "failed to get group %d status", groupid);
> -        goto close_fd_exit;
> -    }
> -
> -    if (!(status.flags & VFIO_GROUP_FLAGS_VIABLE)) {
> -        error_setg(errp, "group %d is not viable", groupid);
> -        error_append_hint(errp,
> -                          "Please ensure all devices within the iommu_group "
> -                          "are bound to their vfio bus driver.\n");
> -        goto close_fd_exit;
> -    }
> -
> -    group->groupid = groupid;
> -    QLIST_INIT(&group->device_list);
> -
> -    if (vfio_connect_container(group, as, errp)) {
> -        error_prepend(errp, "failed to setup container for group %d: ",
> -                      groupid);
> -        goto close_fd_exit;
> -    }
> -
> -    QLIST_INSERT_HEAD(&vfio_group_list, group, next);
> -
> -    return group;
> -
> -close_fd_exit:
> -    close(group->fd);
> -
> -free_group_exit:
> -    g_free(group);
> -
> -    return NULL;
> -}
> -
> -void vfio_put_group(VFIOGroup *group)
> -{
> -    if (!group || !QLIST_EMPTY(&group->device_list)) {
> -        return;
> -    }
> -
> -    if (!group->ram_block_discard_allowed) {
> -        vfio_ram_block_discard_disable(group->container, false);
> -    }
> -    vfio_kvm_device_del_group(group);
> -    vfio_disconnect_container(group);
> -    QLIST_REMOVE(group, next);
> -    trace_vfio_put_group(group->fd);
> -    close(group->fd);
> -    g_free(group);
> -}
> -
>  struct vfio_device_info *vfio_get_device_info(int fd)
>  {
>      struct vfio_device_info *info;
> @@ -2413,168 +1525,3 @@ retry:
>  
>      return info;
>  }
> -
> -int vfio_get_device(VFIOGroup *group, const char *name,
> -                    VFIODevice *vbasedev, Error **errp)
> -{
> -    g_autofree struct vfio_device_info *info = NULL;
> -    int fd;
> -
> -    fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name);
> -    if (fd < 0) {
> -        error_setg_errno(errp, errno, "error getting device from group %d",
> -                         group->groupid);
> -        error_append_hint(errp,
> -                      "Verify all devices in group %d are bound to vfio-<bus> "
> -                      "or pci-stub and not already in use\n", group->groupid);
> -        return fd;
> -    }
> -
> -    info = vfio_get_device_info(fd);
> -    if (!info) {
> -        error_setg_errno(errp, errno, "error getting device info");
> -        close(fd);
> -        return -1;
> -    }
> -
> -    /*
> -     * Set discarding of RAM as not broken for this group if the driver knows
> -     * the device operates compatibly with discarding.  Setting must be
> -     * consistent per group, but since compatibility is really only possible
> -     * with mdev currently, we expect singleton groups.
> -     */
> -    if (vbasedev->ram_block_discard_allowed !=
> -        group->ram_block_discard_allowed) {
> -        if (!QLIST_EMPTY(&group->device_list)) {
> -            error_setg(errp, "Inconsistent setting of support for discarding "
> -                       "RAM (e.g., balloon) within group");
> -            close(fd);
> -            return -1;
> -        }
> -
> -        if (!group->ram_block_discard_allowed) {
> -            group->ram_block_discard_allowed = true;
> -            vfio_ram_block_discard_disable(group->container, false);
> -        }
> -    }
> -
> -    vbasedev->fd = fd;
> -    vbasedev->group = group;
> -    QLIST_INSERT_HEAD(&group->device_list, vbasedev, next);
> -
> -    vbasedev->num_irqs = info->num_irqs;
> -    vbasedev->num_regions = info->num_regions;
> -    vbasedev->flags = info->flags;
> -
> -    trace_vfio_get_device(name, info->flags, info->num_regions, info->num_irqs);
> -
> -    vbasedev->reset_works = !!(info->flags & VFIO_DEVICE_FLAGS_RESET);
> -
> -    return 0;
> -}
> -
> -void vfio_put_base_device(VFIODevice *vbasedev)
> -{
> -    if (!vbasedev->group) {
> -        return;
> -    }
> -    QLIST_REMOVE(vbasedev, next);
> -    vbasedev->group = NULL;
> -    trace_vfio_put_base_device(vbasedev->fd);
> -    close(vbasedev->fd);
> -}
> -
> -/*
> - * Interfaces for IBM EEH (Enhanced Error Handling)
> - */
> -static bool vfio_eeh_container_ok(VFIOContainer *container)
> -{
> -    /*
> -     * As of 2016-03-04 (linux-4.5) the host kernel EEH/VFIO
> -     * implementation is broken if there are multiple groups in a
> -     * container.  The hardware works in units of Partitionable
> -     * Endpoints (== IOMMU groups) and the EEH operations naively
> -     * iterate across all groups in the container, without any logic
> -     * to make sure the groups have their state synchronized.  For
> -     * certain operations (ENABLE) that might be ok, until an error
> -     * occurs, but for others (GET_STATE) it's clearly broken.
> -     */
> -
> -    /*
> -     * XXX Once fixed kernels exist, test for them here
> -     */
> -
> -    if (QLIST_EMPTY(&container->group_list)) {
> -        return false;
> -    }
> -
> -    if (QLIST_NEXT(QLIST_FIRST(&container->group_list), container_next)) {
> -        return false;
> -    }
> -
> -    return true;
> -}
> -
> -static int vfio_eeh_container_op(VFIOContainer *container, uint32_t op)
> -{
> -    struct vfio_eeh_pe_op pe_op = {
> -        .argsz = sizeof(pe_op),
> -        .op = op,
> -    };
> -    int ret;
> -
> -    if (!vfio_eeh_container_ok(container)) {
> -        error_report("vfio/eeh: EEH_PE_OP 0x%x: "
> -                     "kernel requires a container with exactly one group", op);
> -        return -EPERM;
> -    }
> -
> -    ret = ioctl(container->fd, VFIO_EEH_PE_OP, &pe_op);
> -    if (ret < 0) {
> -        error_report("vfio/eeh: EEH_PE_OP 0x%x failed: %m", op);
> -        return -errno;
> -    }
> -
> -    return ret;
> -}
> -
> -static VFIOContainer *vfio_eeh_as_container(AddressSpace *as)
> -{
> -    VFIOAddressSpace *space = vfio_get_address_space(as);
> -    VFIOContainer *container = NULL;
> -
> -    if (QLIST_EMPTY(&space->containers)) {
> -        /* No containers to act on */
> -        goto out;
> -    }
> -
> -    container = QLIST_FIRST(&space->containers);
> -
> -    if (QLIST_NEXT(container, next)) {
> -        /* We don't yet have logic to synchronize EEH state across
> -         * multiple containers */
> -        container = NULL;
> -        goto out;
> -    }
> -
> -out:
> -    vfio_put_address_space(space);
> -    return container;
> -}
> -
> -bool vfio_eeh_as_ok(AddressSpace *as)
> -{
> -    VFIOContainer *container = vfio_eeh_as_container(as);
> -
> -    return (container != NULL) && vfio_eeh_container_ok(container);
> -}
> -
> -int vfio_eeh_as_op(AddressSpace *as, uint32_t op)
> -{
> -    VFIOContainer *container = vfio_eeh_as_container(as);
> -
> -    if (!container) {
> -        return -ENODEV;
> -    }
> -    return vfio_eeh_container_op(container, op);
> -}
> diff --git a/hw/vfio/container.c b/hw/vfio/container.c
> new file mode 100644
> index 0000000000..175cdbbdff
> --- /dev/null
> +++ b/hw/vfio/container.c
> @@ -0,0 +1,1085 @@
> +/*
> + * generic functions used by VFIO devices
> + *
> + * Copyright Red Hat, Inc. 2012
> + *
> + * Authors:
> + *  Alex Williamson <alex.williamson@redhat.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2.  See
> + * the COPYING file in the top-level directory.
> + *
> + * Based on qemu-kvm device-assignment:
> + *  Adapted for KVM by Qumranet.
> + *  Copyright (c) 2007, Neocleus, Alex Novik (alex@neocleus.com)
> + *  Copyright (c) 2007, Neocleus, Guy Zana (guy@neocleus.com)
> + *  Copyright (C) 2008, Qumranet, Amit Shah (amit.shah@qumranet.com)
> + *  Copyright (C) 2008, Red Hat, Amit Shah (amit.shah@redhat.com)
> + *  Copyright (C) 2008, IBM, Muli Ben-Yehuda (muli@il.ibm.com)
> + */
> +
> +#include "qemu/osdep.h"
> +#include <sys/ioctl.h>
> +#ifdef CONFIG_KVM
> +#include <linux/kvm.h>
> +#endif
> +#include <linux/vfio.h>
> +
> +#include "hw/vfio/vfio-common.h"
> +#include "hw/vfio/vfio.h"
> +#include "exec/address-spaces.h"
> +#include "exec/memory.h"
> +#include "exec/ram_addr.h"
> +#include "hw/hw.h"
> +#include "qemu/error-report.h"
> +#include "qemu/range.h"
> +#include "sysemu/kvm.h"
> +#include "sysemu/reset.h"
> +#include "trace.h"
> +#include "qapi/error.h"
> +#include "migration/migration.h"
> +
> +VFIOGroupList vfio_group_list =
> +    QLIST_HEAD_INITIALIZER(vfio_group_list);
> +
> +static int vfio_ram_block_discard_disable(VFIOContainer *container, bool state)
> +{
> +    switch (container->iommu_type) {
> +    case VFIO_TYPE1v2_IOMMU:
> +    case VFIO_TYPE1_IOMMU:
> +        /*
> +         * We support coordinated discarding of RAM via the RamDiscardManager.
> +         */
> +        return ram_block_uncoordinated_discard_disable(state);
> +    default:
> +        /*
> +         * VFIO_SPAPR_TCE_IOMMU most probably works just fine with
> +         * RamDiscardManager, however, it is completely untested.
> +         *
> +         * VFIO_SPAPR_TCE_v2_IOMMU with "DMA memory preregistering" does
> +         * completely the opposite of managing mapping/pinning dynamically as
> +         * required by RamDiscardManager. We would have to special-case sections
> +         * with a RamDiscardManager.
> +         */
> +        return ram_block_discard_disable(state);
> +    }
> +}
> +
> +VFIODevice *vfio_container_dev_iter_next(VFIOContainer *container,
> +                                         VFIODevice *curr)
> +{
> +    VFIOGroup *group;
> +
> +    if (!curr) {
> +        group = QLIST_FIRST(&container->group_list);
> +    } else {
> +        if (curr->next.le_next) {
> +            return curr->next.le_next;
> +        }
> +        group = curr->group->container_next.le_next;
> +    }
> +
> +    if (!group) {
> +        return NULL;
> +    }
> +    return QLIST_FIRST(&group->device_list);
> +}
> +
> +static int vfio_dma_unmap_bitmap(VFIOContainer *container,
> +                                 hwaddr iova, ram_addr_t size,
> +                                 IOMMUTLBEntry *iotlb)
> +{
> +    struct vfio_iommu_type1_dma_unmap *unmap;
> +    struct vfio_bitmap *bitmap;
> +    VFIOBitmap vbmap;
> +    int ret;
> +
> +    ret = vfio_bitmap_alloc(&vbmap, size);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    unmap = g_malloc0(sizeof(*unmap) + sizeof(*bitmap));
> +
> +    unmap->argsz = sizeof(*unmap) + sizeof(*bitmap);
> +    unmap->iova = iova;
> +    unmap->size = size;
> +    unmap->flags |= VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP;
> +    bitmap = (struct vfio_bitmap *)&unmap->data;
> +
> +    /*
> +     * cpu_physical_memory_set_dirty_lebitmap() supports pages in bitmap of
> +     * qemu_real_host_page_size to mark those dirty. Hence set bitmap_pgsize
> +     * to qemu_real_host_page_size.
> +     */
> +    bitmap->pgsize = qemu_real_host_page_size();
> +    bitmap->size = vbmap.size;
> +    bitmap->data = (__u64 *)vbmap.bitmap;
> +
> +    if (vbmap.size > container->max_dirty_bitmap_size) {
> +        error_report("UNMAP: Size of bitmap too big 0x%"PRIx64, vbmap.size);
> +        ret = -E2BIG;
> +        goto unmap_exit;
> +    }
> +
> +    ret = ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, unmap);
> +    if (!ret) {
> +        cpu_physical_memory_set_dirty_lebitmap(vbmap.bitmap,
> +                iotlb->translated_addr, vbmap.pages);
> +    } else {
> +        error_report("VFIO_UNMAP_DMA with DIRTY_BITMAP : %m");
> +    }
> +
> +unmap_exit:
> +    g_free(unmap);
> +    g_free(vbmap.bitmap);
> +
> +    return ret;
> +}
> +
> +/*
> + * DMA - Mapping and unmapping for the "type1" IOMMU interface used on x86
> + */
> +int vfio_dma_unmap(VFIOContainer *container, hwaddr iova,
> +                   ram_addr_t size, IOMMUTLBEntry *iotlb)
> +{
> +    struct vfio_iommu_type1_dma_unmap unmap = {
> +        .argsz = sizeof(unmap),
> +        .flags = 0,
> +        .iova = iova,
> +        .size = size,
> +    };
> +    bool need_dirty_sync = false;
> +    int ret;
> +
> +    if (iotlb && vfio_devices_all_running_and_mig_active(container)) {
> +        if (!vfio_devices_all_device_dirty_tracking(container) &&
> +            container->dirty_pages_supported) {
> +            return vfio_dma_unmap_bitmap(container, iova, size, iotlb);
> +        }
> +
> +        need_dirty_sync = true;
> +    }
> +
> +    while (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap)) {
> +        /*
> +         * The type1 backend has an off-by-one bug in the kernel (71a7d3d78e3c
> +         * v4.15) where an overflow in its wrap-around check prevents us from
> +         * unmapping the last page of the address space.  Test for the error
> +         * condition and re-try the unmap excluding the last page.  The
> +         * expectation is that we've never mapped the last page anyway and this
> +         * unmap request comes via vIOMMU support which also makes it unlikely
> +         * that this page is used.  This bug was introduced well after type1 v2
> +         * support was introduced, so we shouldn't need to test for v1.  A fix
> +         * is queued for kernel v5.0 so this workaround can be removed once
> +         * affected kernels are sufficiently deprecated.
> +         */
> +        if (errno == EINVAL && unmap.size && !(unmap.iova + unmap.size) &&
> +            container->iommu_type == VFIO_TYPE1v2_IOMMU) {
> +            trace_vfio_dma_unmap_overflow_workaround();
> +            unmap.size -= 1ULL << ctz64(container->pgsizes);
> +            continue;
> +        }
> +        error_report("VFIO_UNMAP_DMA failed: %s", strerror(errno));
> +        return -errno;
> +    }
> +
> +    if (need_dirty_sync) {
> +        ret = vfio_get_dirty_bitmap(container, iova, size,
> +                                    iotlb->translated_addr);
> +        if (ret) {
> +            return ret;
> +        }
> +    }
> +
> +    return 0;
> +}
> +
> +int vfio_dma_map(VFIOContainer *container, hwaddr iova,
> +                 ram_addr_t size, void *vaddr, bool readonly)
> +{
> +    struct vfio_iommu_type1_dma_map map = {
> +        .argsz = sizeof(map),
> +        .flags = VFIO_DMA_MAP_FLAG_READ,
> +        .vaddr = (__u64)(uintptr_t)vaddr,
> +        .iova = iova,
> +        .size = size,
> +    };
> +
> +    if (!readonly) {
> +        map.flags |= VFIO_DMA_MAP_FLAG_WRITE;
> +    }
> +
> +    /*
> +     * Try the mapping, if it fails with EBUSY, unmap the region and try
> +     * again.  This shouldn't be necessary, but we sometimes see it in
> +     * the VGA ROM space.
> +     */
> +    if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0 ||
> +        (errno == EBUSY && vfio_dma_unmap(container, iova, size, NULL) == 0 &&
> +         ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0)) {
> +        return 0;
> +    }
> +
> +    error_report("VFIO_MAP_DMA failed: %s", strerror(errno));
> +    return -errno;
> +}
> +
> +int vfio_set_dirty_page_tracking(VFIOContainer *container, bool start)
> +{
> +    int ret;
> +    struct vfio_iommu_type1_dirty_bitmap dirty = {
> +        .argsz = sizeof(dirty),
> +    };
> +
> +    if (!container->dirty_pages_supported) {
> +        return 0;
> +    }
> +
> +    if (start) {
> +        dirty.flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_START;
> +    } else {
> +        dirty.flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP;
> +    }
> +
> +    ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, &dirty);
> +    if (ret) {
> +        ret = -errno;
> +        error_report("Failed to set dirty tracking flag 0x%x errno: %d",
> +                     dirty.flags, errno);
> +    }
> +
> +    return ret;
> +}
> +
> +int vfio_query_dirty_bitmap(VFIOContainer *container, VFIOBitmap *vbmap,
> +                            hwaddr iova, hwaddr size)
> +{
> +    struct vfio_iommu_type1_dirty_bitmap *dbitmap;
> +    struct vfio_iommu_type1_dirty_bitmap_get *range;
> +    int ret;
> +
> +    dbitmap = g_malloc0(sizeof(*dbitmap) + sizeof(*range));
> +
> +    dbitmap->argsz = sizeof(*dbitmap) + sizeof(*range);
> +    dbitmap->flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP;
> +    range = (struct vfio_iommu_type1_dirty_bitmap_get *)&dbitmap->data;
> +    range->iova = iova;
> +    range->size = size;
> +
> +    /*
> +     * cpu_physical_memory_set_dirty_lebitmap() supports pages in bitmap of
> +     * qemu_real_host_page_size to mark those dirty. Hence set bitmap's pgsize
> +     * to qemu_real_host_page_size.
> +     */
> +    range->bitmap.pgsize = qemu_real_host_page_size();
> +    range->bitmap.size = vbmap->size;
> +    range->bitmap.data = (__u64 *)vbmap->bitmap;
> +
> +    ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, dbitmap);
> +    if (ret) {
> +        ret = -errno;
> +        error_report("Failed to get dirty bitmap for iova: 0x%"PRIx64
> +                " size: 0x%"PRIx64" err: %d", (uint64_t)range->iova,
> +                (uint64_t)range->size, errno);
> +    }
> +
> +    g_free(dbitmap);
> +
> +    return ret;
> +}
> +
> +static void vfio_listener_release(VFIOContainer *container)
> +{
> +    memory_listener_unregister(&container->listener);
> +    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
> +        memory_listener_unregister(&container->prereg_listener);
> +    }
> +}
> +
> +int vfio_container_add_section_window(VFIOContainer *container,
> +                                      MemoryRegionSection *section,
> +                                      Error **errp)
> +{
> +    VFIOHostDMAWindow *hostwin;
> +    hwaddr pgsize = 0;
> +    int ret;
> +
> +    if (container->iommu_type != VFIO_SPAPR_TCE_v2_IOMMU) {
> +        return 0;
> +    }
> +
> +    /* For now intersections are not allowed, we may relax this later */
> +    QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) {
> +        if (ranges_overlap(hostwin->min_iova,
> +                           hostwin->max_iova - hostwin->min_iova + 1,
> +                           section->offset_within_address_space,
> +                           int128_get64(section->size))) {
> +            error_setg(errp,
> +                "region [0x%"PRIx64",0x%"PRIx64"] overlaps with existing"
> +                "host DMA window [0x%"PRIx64",0x%"PRIx64"]",
> +                section->offset_within_address_space,
> +                section->offset_within_address_space +
> +                    int128_get64(section->size) - 1,
> +                hostwin->min_iova, hostwin->max_iova);
> +            return -EINVAL;
> +        }
> +    }
> +
> +    ret = vfio_spapr_create_window(container, section, &pgsize);
> +    if (ret) {
> +        error_setg_errno(errp, -ret, "Failed to create SPAPR window");
> +        return ret;
> +    }
> +
> +    vfio_host_win_add(container, section->offset_within_address_space,
> +                      section->offset_within_address_space +
> +                      int128_get64(section->size) - 1, pgsize);
> +#ifdef CONFIG_KVM
> +    if (kvm_enabled()) {
> +        VFIOGroup *group;
> +        IOMMUMemoryRegion *iommu_mr = IOMMU_MEMORY_REGION(section->mr);
> +        struct kvm_vfio_spapr_tce param;
> +        struct kvm_device_attr attr = {
> +            .group = KVM_DEV_VFIO_GROUP,
> +            .attr = KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE,
> +            .addr = (uint64_t)(unsigned long)&param,
> +        };
> +
> +        if (!memory_region_iommu_get_attr(iommu_mr, IOMMU_ATTR_SPAPR_TCE_FD,
> +                                          &param.tablefd)) {
> +            QLIST_FOREACH(group, &container->group_list, container_next) {
> +                param.groupfd = group->fd;
> +                if (ioctl(vfio_kvm_device_fd, KVM_SET_DEVICE_ATTR, &attr)) {
> +                    error_report("vfio: failed to setup fd %d "
> +                                 "for a group with fd %d: %s",
> +                                 param.tablefd, param.groupfd,
> +                                 strerror(errno));
> +                    return 0;
> +                }
> +                trace_vfio_spapr_group_attach(param.groupfd, param.tablefd);
> +            }
> +        }
> +    }
> +#endif
> +    return 0;
> +}
> +
> +void vfio_container_del_section_window(VFIOContainer *container,
> +                                       MemoryRegionSection *section)
> +{
> +    if (container->iommu_type != VFIO_SPAPR_TCE_v2_IOMMU) {
> +        return;
> +    }
> +
> +    vfio_spapr_remove_window(container,
> +                             section->offset_within_address_space);
> +    if (vfio_host_win_del(container,
> +                          section->offset_within_address_space,
> +                          section->offset_within_address_space +
> +                          int128_get64(section->size) - 1) < 0) {
> +        hw_error("%s: Cannot delete missing window at %"HWADDR_PRIx,
> +                 __func__, section->offset_within_address_space);
> +    }
> +}
> +
> +static struct vfio_info_cap_header *
> +vfio_get_iommu_type1_info_cap(struct vfio_iommu_type1_info *info, uint16_t id)
> +{
> +    if (!(info->flags & VFIO_IOMMU_INFO_CAPS)) {
> +        return NULL;
> +    }
> +
> +    return vfio_get_cap((void *)info, info->cap_offset, id);
> +}
> +
> +bool vfio_get_info_dma_avail(struct vfio_iommu_type1_info *info,
> +                             unsigned int *avail)
> +{
> +    struct vfio_info_cap_header *hdr;
> +    struct vfio_iommu_type1_info_dma_avail *cap;
> +
> +    /* If the capability cannot be found, assume no DMA limiting */
> +    hdr = vfio_get_iommu_type1_info_cap(info,
> +                                        VFIO_IOMMU_TYPE1_INFO_DMA_AVAIL);
> +    if (hdr == NULL) {
> +        return false;
> +    }
> +
> +    if (avail != NULL) {
> +        cap = (void *) hdr;
> +        *avail = cap->avail;
> +    }
> +
> +    return true;
> +}
> +
> +static void vfio_kvm_device_add_group(VFIOGroup *group)
> +{
> +    vfio_kvm_device_add_fd(group->fd);
> +}
> +
> +static void vfio_kvm_device_del_group(VFIOGroup *group)
> +{
> +    vfio_kvm_device_del_fd(group->fd);
> +}
> +
> +/*
> + * vfio_get_iommu_type - selects the richest iommu_type (v2 first)
> + */
> +static int vfio_get_iommu_type(VFIOContainer *container,
> +                               Error **errp)
> +{
> +    int iommu_types[] = { VFIO_TYPE1v2_IOMMU, VFIO_TYPE1_IOMMU,
> +                          VFIO_SPAPR_TCE_v2_IOMMU, VFIO_SPAPR_TCE_IOMMU };
> +    int i;
> +
> +    for (i = 0; i < ARRAY_SIZE(iommu_types); i++) {
> +        if (ioctl(container->fd, VFIO_CHECK_EXTENSION, iommu_types[i])) {
> +            return iommu_types[i];
> +        }
> +    }
> +    error_setg(errp, "No available IOMMU models");
> +    return -EINVAL;
> +}
> +
> +static int vfio_init_container(VFIOContainer *container, int group_fd,
> +                               Error **errp)
> +{
> +    int iommu_type, ret;
> +
> +    iommu_type = vfio_get_iommu_type(container, errp);
> +    if (iommu_type < 0) {
> +        return iommu_type;
> +    }
> +
> +    ret = ioctl(group_fd, VFIO_GROUP_SET_CONTAINER, &container->fd);
> +    if (ret) {
> +        error_setg_errno(errp, errno, "Failed to set group container");
> +        return -errno;
> +    }
> +
> +    while (ioctl(container->fd, VFIO_SET_IOMMU, iommu_type)) {
> +        if (iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
> +            /*
> +             * On sPAPR, despite the IOMMU subdriver always advertises v1 and
> +             * v2, the running platform may not support v2 and there is no
> +             * way to guess it until an IOMMU group gets added to the container.
> +             * So in case it fails with v2, try v1 as a fallback.
> +             */
> +            iommu_type = VFIO_SPAPR_TCE_IOMMU;
> +            continue;
> +        }
> +        error_setg_errno(errp, errno, "Failed to set iommu for container");
> +        return -errno;
> +    }
> +
> +    container->iommu_type = iommu_type;
> +    return 0;
> +}
> +
> +static int vfio_get_iommu_info(VFIOContainer *container,
> +                               struct vfio_iommu_type1_info **info)
> +{
> +
> +    size_t argsz = sizeof(struct vfio_iommu_type1_info);
> +
> +    *info = g_new0(struct vfio_iommu_type1_info, 1);
> +again:
> +    (*info)->argsz = argsz;
> +
> +    if (ioctl(container->fd, VFIO_IOMMU_GET_INFO, *info)) {
> +        g_free(*info);
> +        *info = NULL;
> +        return -errno;
> +    }
> +
> +    if (((*info)->argsz > argsz)) {
> +        argsz = (*info)->argsz;
> +        *info = g_realloc(*info, argsz);
> +        goto again;
> +    }
> +
> +    return 0;
> +}
> +
> +static struct vfio_info_cap_header *
> +vfio_get_iommu_info_cap(struct vfio_iommu_type1_info *info, uint16_t id)
> +{
> +    struct vfio_info_cap_header *hdr;
> +    void *ptr = info;
> +
> +    if (!(info->flags & VFIO_IOMMU_INFO_CAPS)) {
> +        return NULL;
> +    }
> +
> +    for (hdr = ptr + info->cap_offset; hdr != ptr; hdr = ptr + hdr->next) {
> +        if (hdr->id == id) {
> +            return hdr;
> +        }
> +    }
> +
> +    return NULL;
> +}
> +
> +static void vfio_get_iommu_info_migration(VFIOContainer *container,
> +                                         struct vfio_iommu_type1_info *info)
> +{
> +    struct vfio_info_cap_header *hdr;
> +    struct vfio_iommu_type1_info_cap_migration *cap_mig;
> +
> +    hdr = vfio_get_iommu_info_cap(info, VFIO_IOMMU_TYPE1_INFO_CAP_MIGRATION);
> +    if (!hdr) {
> +        return;
> +    }
> +
> +    cap_mig = container_of(hdr, struct vfio_iommu_type1_info_cap_migration,
> +                            header);
> +
> +    /*
> +     * cpu_physical_memory_set_dirty_lebitmap() supports pages in bitmap of
> +     * qemu_real_host_page_size to mark those dirty.
> +     */
> +    if (cap_mig->pgsize_bitmap & qemu_real_host_page_size()) {
> +        container->dirty_pages_supported = true;
> +        container->max_dirty_bitmap_size = cap_mig->max_dirty_bitmap_size;
> +        container->dirty_pgsizes = cap_mig->pgsize_bitmap;
> +    }
> +}
> +
> +static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
> +                                  Error **errp)
> +{
> +    VFIOContainer *container;
> +    int ret, fd;
> +    VFIOAddressSpace *space;
> +
> +    space = vfio_get_address_space(as);
> +
> +    /*
> +     * VFIO is currently incompatible with discarding of RAM insofar as the
> +     * madvise to purge (zap) the page from QEMU's address space does not
> +     * interact with the memory API and therefore leaves stale virtual to
> +     * physical mappings in the IOMMU if the page was previously pinned.  We
> +     * therefore set discarding broken for each group added to a container,
> +     * whether the container is used individually or shared.  This provides
> +     * us with options to allow devices within a group to opt-in and allow
> +     * discarding, so long as it is done consistently for a group (for instance
> +     * if the device is an mdev device where it is known that the host vendor
> +     * driver will never pin pages outside of the working set of the guest
> +     * driver, which would thus not be discarding candidates).
> +     *
> +     * The first opportunity to induce pinning occurs here where we attempt to
> +     * attach the group to existing containers within the AddressSpace.  If any
> +     * pages are already zapped from the virtual address space, such as from
> +     * previous discards, new pinning will cause valid mappings to be
> +     * re-established.  Likewise, when the overall MemoryListener for a new
> +     * container is registered, a replay of mappings within the AddressSpace
> +     * will occur, re-establishing any previously zapped pages as well.
> +     *
> +     * Especially virtio-balloon is currently only prevented from discarding
> +     * new memory, it will not yet set ram_block_discard_set_required() and
> +     * therefore, neither stops us here or deals with the sudden memory
> +     * consumption of inflated memory.
> +     *
> +     * We do support discarding of memory coordinated via the RamDiscardManager
> +     * with some IOMMU types. vfio_ram_block_discard_disable() handles the
> +     * details once we know which type of IOMMU we are using.
> +     */
> +
> +    QLIST_FOREACH(container, &space->containers, next) {
> +        if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
> +            ret = vfio_ram_block_discard_disable(container, true);
> +            if (ret) {
> +                error_setg_errno(errp, -ret,
> +                                 "Cannot set discarding of RAM broken");
> +                if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER,
> +                          &container->fd)) {
> +                    error_report("vfio: error disconnecting group %d from"
> +                                 " container", group->groupid);
> +                }
> +                return ret;
> +            }
> +            group->container = container;
> +            QLIST_INSERT_HEAD(&container->group_list, group, container_next);
> +            vfio_kvm_device_add_group(group);
> +            return 0;
> +        }
> +    }
> +
> +    fd = qemu_open_old("/dev/vfio/vfio", O_RDWR);
> +    if (fd < 0) {
> +        error_setg_errno(errp, errno, "failed to open /dev/vfio/vfio");
> +        ret = -errno;
> +        goto put_space_exit;
> +    }
> +
> +    ret = ioctl(fd, VFIO_GET_API_VERSION);
> +    if (ret != VFIO_API_VERSION) {
> +        error_setg(errp, "supported vfio version: %d, "
> +                   "reported version: %d", VFIO_API_VERSION, ret);
> +        ret = -EINVAL;
> +        goto close_fd_exit;
> +    }
> +
> +    container = g_malloc0(sizeof(*container));
> +    container->space = space;
> +    container->fd = fd;
> +    container->error = NULL;
> +    container->dirty_pages_supported = false;
> +    container->dma_max_mappings = 0;
> +    QLIST_INIT(&container->giommu_list);
> +    QLIST_INIT(&container->hostwin_list);
> +    QLIST_INIT(&container->vrdl_list);
> +
> +    ret = vfio_init_container(container, group->fd, errp);
> +    if (ret) {
> +        goto free_container_exit;
> +    }
> +
> +    ret = vfio_ram_block_discard_disable(container, true);
> +    if (ret) {
> +        error_setg_errno(errp, -ret, "Cannot set discarding of RAM broken");
> +        goto free_container_exit;
> +    }
> +
> +    switch (container->iommu_type) {
> +    case VFIO_TYPE1v2_IOMMU:
> +    case VFIO_TYPE1_IOMMU:
> +    {
> +        struct vfio_iommu_type1_info *info;
> +
> +        ret = vfio_get_iommu_info(container, &info);
> +        if (ret) {
> +            error_setg_errno(errp, -ret, "Failed to get VFIO IOMMU info");
> +            goto enable_discards_exit;
> +        }
> +
> +        if (info->flags & VFIO_IOMMU_INFO_PGSIZES) {
> +            container->pgsizes = info->iova_pgsizes;
> +        } else {
> +            container->pgsizes = qemu_real_host_page_size();
> +        }
> +
> +        if (!vfio_get_info_dma_avail(info, &container->dma_max_mappings)) {
> +            container->dma_max_mappings = 65535;
> +        }
> +        vfio_get_iommu_info_migration(container, info);
> +        g_free(info);
> +
> +        /*
> +         * FIXME: We should parse VFIO_IOMMU_TYPE1_INFO_CAP_IOVA_RANGE
> +         * information to get the actual window extent rather than assume
> +         * a 64-bit IOVA address space.
> +         */
> +        vfio_host_win_add(container, 0, (hwaddr)-1, container->pgsizes);
> +
> +        break;
> +    }
> +    case VFIO_SPAPR_TCE_v2_IOMMU:
> +    case VFIO_SPAPR_TCE_IOMMU:
> +    {
> +        struct vfio_iommu_spapr_tce_info info;
> +        bool v2 = container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU;
> +
> +        /*
> +         * The host kernel code implementing VFIO_IOMMU_DISABLE is called
> +         * when container fd is closed so we do not call it explicitly
> +         * in this file.
> +         */
> +        if (!v2) {
> +            ret = ioctl(fd, VFIO_IOMMU_ENABLE);
> +            if (ret) {
> +                error_setg_errno(errp, errno, "failed to enable container");
> +                ret = -errno;
> +                goto enable_discards_exit;
> +            }
> +        } else {
> +            container->prereg_listener = vfio_prereg_listener;
> +
> +            memory_listener_register(&container->prereg_listener,
> +                                     &address_space_memory);
> +            if (container->error) {
> +                memory_listener_unregister(&container->prereg_listener);
> +                ret = -1;
> +                error_propagate_prepend(errp, container->error,
> +                    "RAM memory listener initialization failed: ");
> +                goto enable_discards_exit;
> +            }
> +        }
> +
> +        info.argsz = sizeof(info);
> +        ret = ioctl(fd, VFIO_IOMMU_SPAPR_TCE_GET_INFO, &info);
> +        if (ret) {
> +            error_setg_errno(errp, errno,
> +                             "VFIO_IOMMU_SPAPR_TCE_GET_INFO failed");
> +            ret = -errno;
> +            if (v2) {
> +                memory_listener_unregister(&container->prereg_listener);
> +            }
> +            goto enable_discards_exit;
> +        }
> +
> +        if (v2) {
> +            container->pgsizes = info.ddw.pgsizes;
> +            /*
> +             * There is a default window in just created container.
> +             * To make region_add/del simpler, we better remove this
> +             * window now and let those iommu_listener callbacks
> +             * create/remove them when needed.
> +             */
> +            ret = vfio_spapr_remove_window(container, info.dma32_window_start);
> +            if (ret) {
> +                error_setg_errno(errp, -ret,
> +                                 "failed to remove existing window");
> +                goto enable_discards_exit;
> +            }
> +        } else {
> +            /* The default table uses 4K pages */
> +            container->pgsizes = 0x1000;
> +            vfio_host_win_add(container, info.dma32_window_start,
> +                              info.dma32_window_start +
> +                              info.dma32_window_size - 1,
> +                              0x1000);
> +        }
> +    }
> +    }
> +
> +    vfio_kvm_device_add_group(group);
> +
> +    QLIST_INIT(&container->group_list);
> +    QLIST_INSERT_HEAD(&space->containers, container, next);
> +
> +    group->container = container;
> +    QLIST_INSERT_HEAD(&container->group_list, group, container_next);
> +
> +    container->listener = vfio_memory_listener;
> +
> +    memory_listener_register(&container->listener, container->space->as);
> +
> +    if (container->error) {
> +        ret = -1;
> +        error_propagate_prepend(errp, container->error,
> +            "memory listener initialization failed: ");
> +        goto listener_release_exit;
> +    }
> +
> +    container->initialized = true;
> +
> +    return 0;
> +listener_release_exit:
> +    QLIST_REMOVE(group, container_next);
> +    QLIST_REMOVE(container, next);
> +    vfio_kvm_device_del_group(group);
> +    vfio_listener_release(container);
> +
> +enable_discards_exit:
> +    vfio_ram_block_discard_disable(container, false);
> +
> +free_container_exit:
> +    g_free(container);
> +
> +close_fd_exit:
> +    close(fd);
> +
> +put_space_exit:
> +    vfio_put_address_space(space);
> +
> +    return ret;
> +}
> +
> +static void vfio_disconnect_container(VFIOGroup *group)
> +{
> +    VFIOContainer *container = group->container;
> +
> +    QLIST_REMOVE(group, container_next);
> +    group->container = NULL;
> +
> +    /*
> +     * Explicitly release the listener first before unset container,
> +     * since unset may destroy the backend container if it's the last
> +     * group.
> +     */
> +    if (QLIST_EMPTY(&container->group_list)) {
> +        vfio_listener_release(container);
> +    }
> +
> +    if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER, &container->fd)) {
> +        error_report("vfio: error disconnecting group %d from container",
> +                     group->groupid);
> +    }
> +
> +    if (QLIST_EMPTY(&container->group_list)) {
> +        VFIOAddressSpace *space = container->space;
> +        VFIOGuestIOMMU *giommu, *tmp;
> +        VFIOHostDMAWindow *hostwin, *next;
> +
> +        QLIST_REMOVE(container, next);
> +
> +        QLIST_FOREACH_SAFE(giommu, &container->giommu_list, giommu_next, tmp) {
> +            memory_region_unregister_iommu_notifier(
> +                    MEMORY_REGION(giommu->iommu_mr), &giommu->n);
> +            QLIST_REMOVE(giommu, giommu_next);
> +            g_free(giommu);
> +        }
> +
> +        QLIST_FOREACH_SAFE(hostwin, &container->hostwin_list, hostwin_next,
> +                           next) {
> +            QLIST_REMOVE(hostwin, hostwin_next);
> +            g_free(hostwin);
> +        }
> +
> +        trace_vfio_disconnect_container(container->fd);
> +        close(container->fd);
> +        g_free(container);
> +
> +        vfio_put_address_space(space);
> +    }
> +}
> +
> +VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
> +{
> +    VFIOGroup *group;
> +    char path[32];
> +    struct vfio_group_status status = { .argsz = sizeof(status) };
> +
> +    QLIST_FOREACH(group, &vfio_group_list, next) {
> +        if (group->groupid == groupid) {
> +            /* Found it.  Now is it already in the right context? */
> +            if (group->container->space->as == as) {
> +                return group;
> +            } else {
> +                error_setg(errp, "group %d used in multiple address spaces",
> +                           group->groupid);
> +                return NULL;
> +            }
> +        }
> +    }
> +
> +    group = g_malloc0(sizeof(*group));
> +
> +    snprintf(path, sizeof(path), "/dev/vfio/%d", groupid);
> +    group->fd = qemu_open_old(path, O_RDWR);
> +    if (group->fd < 0) {
> +        error_setg_errno(errp, errno, "failed to open %s", path);
> +        goto free_group_exit;
> +    }
> +
> +    if (ioctl(group->fd, VFIO_GROUP_GET_STATUS, &status)) {
> +        error_setg_errno(errp, errno, "failed to get group %d status", groupid);
> +        goto close_fd_exit;
> +    }
> +
> +    if (!(status.flags & VFIO_GROUP_FLAGS_VIABLE)) {
> +        error_setg(errp, "group %d is not viable", groupid);
> +        error_append_hint(errp,
> +                          "Please ensure all devices within the iommu_group "
> +                          "are bound to their vfio bus driver.\n");
> +        goto close_fd_exit;
> +    }
> +
> +    group->groupid = groupid;
> +    QLIST_INIT(&group->device_list);
> +
> +    if (vfio_connect_container(group, as, errp)) {
> +        error_prepend(errp, "failed to setup container for group %d: ",
> +                      groupid);
> +        goto close_fd_exit;
> +    }
> +
> +    QLIST_INSERT_HEAD(&vfio_group_list, group, next);
> +
> +    return group;
> +
> +close_fd_exit:
> +    close(group->fd);
> +
> +free_group_exit:
> +    g_free(group);
> +
> +    return NULL;
> +}
> +
> +void vfio_put_group(VFIOGroup *group)
> +{
> +    if (!group || !QLIST_EMPTY(&group->device_list)) {
> +        return;
> +    }
> +
> +    if (!group->ram_block_discard_allowed) {
> +        vfio_ram_block_discard_disable(group->container, false);
> +    }
> +    vfio_kvm_device_del_group(group);
> +    vfio_disconnect_container(group);
> +    QLIST_REMOVE(group, next);
> +    trace_vfio_put_group(group->fd);
> +    close(group->fd);
> +    g_free(group);
> +}
> +
> +int vfio_get_device(VFIOGroup *group, const char *name,
> +                    VFIODevice *vbasedev, Error **errp)
> +{
> +    g_autofree struct vfio_device_info *info = NULL;
> +    int fd;
> +
> +    fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name);
> +    if (fd < 0) {
> +        error_setg_errno(errp, errno, "error getting device from group %d",
> +                         group->groupid);
> +        error_append_hint(errp,
> +                      "Verify all devices in group %d are bound to vfio-<bus> "
> +                      "or pci-stub and not already in use\n", group->groupid);
> +        return fd;
> +    }
> +
> +    info = vfio_get_device_info(fd);
> +    if (!info) {
> +        error_setg_errno(errp, errno, "error getting device info");
> +        close(fd);
> +        return -1;
> +    }
> +
> +    /*
> +     * Set discarding of RAM as not broken for this group if the driver knows
> +     * the device operates compatibly with discarding.  Setting must be
> +     * consistent per group, but since compatibility is really only possible
> +     * with mdev currently, we expect singleton groups.
> +     */
> +    if (vbasedev->ram_block_discard_allowed !=
> +        group->ram_block_discard_allowed) {
> +        if (!QLIST_EMPTY(&group->device_list)) {
> +            error_setg(errp, "Inconsistent setting of support for discarding "
> +                       "RAM (e.g., balloon) within group");
> +            close(fd);
> +            return -1;
> +        }
> +
> +        if (!group->ram_block_discard_allowed) {
> +            group->ram_block_discard_allowed = true;
> +            vfio_ram_block_discard_disable(group->container, false);
> +        }
> +    }
> +
> +    vbasedev->fd = fd;
> +    vbasedev->group = group;
> +    QLIST_INSERT_HEAD(&group->device_list, vbasedev, next);
> +
> +    vbasedev->num_irqs = info->num_irqs;
> +    vbasedev->num_regions = info->num_regions;
> +    vbasedev->flags = info->flags;
> +
> +    trace_vfio_get_device(name, info->flags, info->num_regions, info->num_irqs);
> +
> +    vbasedev->reset_works = !!(info->flags & VFIO_DEVICE_FLAGS_RESET);
> +
> +    return 0;
> +}
> +
> +void vfio_put_base_device(VFIODevice *vbasedev)
> +{
> +    if (!vbasedev->group) {
> +        return;
> +    }
> +    QLIST_REMOVE(vbasedev, next);
> +    vbasedev->group = NULL;
> +    trace_vfio_put_base_device(vbasedev->fd);
> +    close(vbasedev->fd);
> +}
> +
> +/*
> + * Interfaces for IBM EEH (Enhanced Error Handling)
> + */
> +static bool vfio_eeh_container_ok(VFIOContainer *container)
> +{
> +    /*
> +     * As of 2016-03-04 (linux-4.5) the host kernel EEH/VFIO
> +     * implementation is broken if there are multiple groups in a
> +     * container.  The hardware works in units of Partitionable
> +     * Endpoints (== IOMMU groups) and the EEH operations naively
> +     * iterate across all groups in the container, without any logic
> +     * to make sure the groups have their state synchronized.  For
> +     * certain operations (ENABLE) that might be ok, until an error
> +     * occurs, but for others (GET_STATE) it's clearly broken.
> +     */
> +
> +    /*
> +     * XXX Once fixed kernels exist, test for them here
> +     */
> +
> +    if (QLIST_EMPTY(&container->group_list)) {
> +        return false;
> +    }
> +
> +    if (QLIST_NEXT(QLIST_FIRST(&container->group_list), container_next)) {
> +        return false;
> +    }
> +
> +    return true;
> +}
> +
> +static int vfio_eeh_container_op(VFIOContainer *container, uint32_t op)
> +{
> +    struct vfio_eeh_pe_op pe_op = {
> +        .argsz = sizeof(pe_op),
> +        .op = op,
> +    };
> +    int ret;
> +
> +    if (!vfio_eeh_container_ok(container)) {
> +        error_report("vfio/eeh: EEH_PE_OP 0x%x: "
> +                     "kernel requires a container with exactly one group", op);
> +        return -EPERM;
> +    }
> +
> +    ret = ioctl(container->fd, VFIO_EEH_PE_OP, &pe_op);
> +    if (ret < 0) {
> +        error_report("vfio/eeh: EEH_PE_OP 0x%x failed: %m", op);
> +        return -errno;
> +    }
> +
> +    return ret;
> +}
> +
> +static VFIOContainer *vfio_eeh_as_container(AddressSpace *as)
> +{
> +    VFIOAddressSpace *space = vfio_get_address_space(as);
> +    VFIOContainer *container = NULL;
> +
> +    if (QLIST_EMPTY(&space->containers)) {
> +        /* No containers to act on */
> +        goto out;
> +    }
> +
> +    container = QLIST_FIRST(&space->containers);
> +
> +    if (QLIST_NEXT(container, next)) {
> +        /*
> +         * We don't yet have logic to synchronize EEH state across
> +         * multiple containers
> +         */
> +        container = NULL;
> +        goto out;
> +    }
> +
> +out:
> +    vfio_put_address_space(space);
> +    return container;
> +}
> +
> +bool vfio_eeh_as_ok(AddressSpace *as)
> +{
> +    VFIOContainer *container = vfio_eeh_as_container(as);
> +
> +    return (container != NULL) && vfio_eeh_container_ok(container);
> +}
> +
> +int vfio_eeh_as_op(AddressSpace *as, uint32_t op)
> +{
> +    VFIOContainer *container = vfio_eeh_as_container(as);
> +
> +    if (!container) {
> +        return -ENODEV;
> +    }
> +    return vfio_eeh_container_op(container, op);
> +}
> diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
> index 3746c9f984..2a6912c940 100644
> --- a/hw/vfio/meson.build
> +++ b/hw/vfio/meson.build
> @@ -2,6 +2,7 @@ vfio_ss = ss.source_set()
>  vfio_ss.add(files(
>    'helpers.c',
>    'common.c',
> +  'container.c',
>    'spapr.c',
>    'migration.c',
>  ))
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 598c3ce079..bb7f9fe9c4 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -33,6 +33,8 @@
>  
>  #define VFIO_MSG_PREFIX "vfio %s: "
>  
> +extern const MemoryListener vfio_memory_listener;
> +
>  enum {
>      VFIO_DEVICE_TYPE_PCI = 0,
>      VFIO_DEVICE_TYPE_PLATFORM = 1,
> @@ -196,6 +198,38 @@ typedef struct VFIODisplay {
>      } dmabuf;
>  } VFIODisplay;
>  
> +typedef struct {
> +    unsigned long *bitmap;
> +    hwaddr size;
> +    hwaddr pages;
> +} VFIOBitmap;
> +
> +void vfio_host_win_add(VFIOContainer *container,
> +                       hwaddr min_iova, hwaddr max_iova,
> +                       uint64_t iova_pgsizes);
> +int vfio_host_win_del(VFIOContainer *container, hwaddr min_iova,
> +                      hwaddr max_iova);
> +VFIOAddressSpace *vfio_get_address_space(AddressSpace *as);
> +void vfio_put_address_space(VFIOAddressSpace *space);
> +bool vfio_devices_all_running_and_saving(VFIOContainer *container);
> +
> +/* container->fd */
> +VFIODevice *vfio_container_dev_iter_next(VFIOContainer *container,
> +                                         VFIODevice *curr);
> +int vfio_dma_unmap(VFIOContainer *container, hwaddr iova,
> +                   ram_addr_t size, IOMMUTLBEntry *iotlb);
> +int vfio_dma_map(VFIOContainer *container, hwaddr iova,
> +                 ram_addr_t size, void *vaddr, bool readonly);
> +int vfio_set_dirty_page_tracking(VFIOContainer *container, bool start);
> +int vfio_query_dirty_bitmap(VFIOContainer *container, VFIOBitmap *vbmap,
> +                            hwaddr iova, hwaddr size);
> +
> +int vfio_container_add_section_window(VFIOContainer *container,
> +                                      MemoryRegionSection *section,
> +                                      Error **errp);
> +void vfio_container_del_section_window(VFIOContainer *container,
> +                                       MemoryRegionSection *section);
> +
>  void vfio_put_base_device(VFIODevice *vbasedev);
>  void vfio_disable_irqindex(VFIODevice *vbasedev, int index);
>  void vfio_unmask_single_irqindex(VFIODevice *vbasedev, int index);
> @@ -220,6 +254,8 @@ struct vfio_device_info *vfio_get_device_info(int fd);
>  int vfio_get_device(VFIOGroup *group, const char *name,
>                      VFIODevice *vbasedev, Error **errp);
>  
> +extern int vfio_kvm_device_fd;
> +
>  int vfio_kvm_device_add_fd(int fd);
>  int vfio_kvm_device_del_fd(int fd);
>  
> @@ -260,4 +296,13 @@ int vfio_spapr_remove_window(VFIOContainer *container,
>  bool vfio_migration_realize(VFIODevice *vbasedev, Error **errp);
>  void vfio_migration_exit(VFIODevice *vbasedev);
>  
> +int vfio_bitmap_alloc(VFIOBitmap *vbmap, hwaddr size);
> +bool vfio_devices_all_running_and_mig_active(VFIOContainer *container);
> +bool vfio_devices_all_device_dirty_tracking(VFIOContainer *container);
> +int vfio_devices_query_dirty_bitmap(VFIOContainer *container,
> +                                    VFIOBitmap *vbmap, hwaddr iova,
> +                                    hwaddr size);
> +int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
> +                                 uint64_t size, ram_addr_t ram_addr);
> +
>  #endif /* HW_VFIO_VFIO_COMMON_H */
Thanks

Eric



^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v1 09/22] vfio/container: Introduce vfio_[attach/detach]_device
  2023-08-30 10:37 ` [PATCH v1 09/22] vfio/container: Introduce vfio_[attach/detach]_device Zhenzhong Duan
@ 2023-09-20 13:33   ` Eric Auger
  2023-09-21  3:08     ` Duan, Zhenzhong
  2023-09-21  9:44   ` Cédric Le Goater
  1 sibling, 1 reply; 109+ messages in thread
From: Eric Auger @ 2023-09-20 13:33 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, jgg, nicolinc, joao.m.martins, peterx,
	jasowang, kevin.tian, yi.l.liu, yi.y.sun, chao.p.peng

Hi Zhenzhong,

In the commit title I would replace vfio/container by vfio/pci to match
next patches

On 8/30/23 12:37, Zhenzhong Duan wrote:
> From: Eric Auger <eric.auger@redhat.com>
>
> We want the VFIO devices to be able to use two different
> IOMMU callbacks, the legacy VFIO one and the new iommufd one.
s/callbacks/backends
>
> Introduce vfio_[attach/detach]_device which aim at hiding the
> underlying IOMMU backend (IOCTLs, datatypes, ...).

At the moment only the implementation based on the legacy
container/group exists. Let's use it from the vfio-pci device.
>
> Once vfio_attach_device completes, the device is attached
> to a security context and its fd can be used. Conversely
> When vfio_detach_device completes, the device has been
> detached to the security context.
from the security context
>
> In this patch, only the vfio-pci device gets converted to use
> the new API. Subsequent patches will handle other devices.
>
> Signed-off-by: Eric Auger <eric.auger@redhat.com>
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
>  hw/vfio/container.c           | 66 +++++++++++++++++++++++++++++++++++
>  hw/vfio/pci.c                 | 50 ++++----------------------
>  hw/vfio/trace-events          |  2 +-
>  include/hw/vfio/vfio-common.h |  3 ++
>  4 files changed, 76 insertions(+), 45 deletions(-)
>
> diff --git a/hw/vfio/container.c b/hw/vfio/container.c
> index 175cdbbdff..74556da0c7 100644
> --- a/hw/vfio/container.c
> +++ b/hw/vfio/container.c
> @@ -1083,3 +1083,69 @@ int vfio_eeh_as_op(AddressSpace *as, uint32_t op)
>      }
>      return vfio_eeh_container_op(container, op);
>  }
> +
> +static int vfio_device_groupid(VFIODevice *vbasedev, Error **errp)
> +{
> +    char *tmp, group_path[PATH_MAX], *group_name;
> +    int ret, groupid;
> +    ssize_t len;
> +
> +    tmp = g_strdup_printf("%s/iommu_group", vbasedev->sysfsdev);
> +    len = readlink(tmp, group_path, sizeof(group_path));
> +    g_free(tmp);
> +
> +    if (len <= 0 || len >= sizeof(group_path)) {
> +        ret = len < 0 ? -errno : -ENAMETOOLONG;
> +        error_setg_errno(errp, -ret, "no iommu_group found");
> +        return ret;
> +    }
> +
> +    group_path[len] = 0;
> +
> +    group_name = basename(group_path);
> +    if (sscanf(group_name, "%d", &groupid) != 1) {
> +        error_setg_errno(errp, errno, "failed to read %s", group_path);
> +        return -errno;
> +    }
> +    return groupid;
> +}
> +
> +int vfio_attach_device(char *name, VFIODevice *vbasedev,
> +                       AddressSpace *as, Error **errp)
> +{
> +    int groupid = vfio_device_groupid(vbasedev, errp);
> +    VFIODevice *vbasedev_iter;
> +    VFIOGroup *group;
> +    int ret;
> +
> +    if (groupid < 0) {
> +        return groupid;
> +    }
> +
> +    group = vfio_get_group(groupid, as, errp);
> +    if (!group) {
> +        return -ENOENT;
> +    }
> +
> +    QLIST_FOREACH(vbasedev_iter, &group->device_list, next) {
> +        if (strcmp(vbasedev_iter->name, vbasedev->name) == 0) {
> +            error_setg(errp, "device is already attached");
> +            vfio_put_group(group);
> +            return -EBUSY;
> +        }
> +    }
> +    ret = vfio_get_device(group, name, vbasedev, errp);
> +    if (ret) {
> +        vfio_put_group(group);
> +    }
> +
> +    return ret;
> +}
> +
> +void vfio_detach_device(VFIODevice *vbasedev)
> +{
> +    VFIOGroup *group = vbasedev->group;
> +
> +    vfio_put_base_device(vbasedev);
> +    vfio_put_group(group);
> +}
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index a205c6b113..34f65ecd17 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -2828,10 +2828,10 @@ static void vfio_populate_device(VFIOPCIDevice *vdev, Error **errp)
>  
>  static void vfio_put_device(VFIOPCIDevice *vdev)
>  {
> +    vfio_detach_device(&vdev->vbasedev);
> +
>      g_free(vdev->vbasedev.name);
>      g_free(vdev->msix);
> -
> -    vfio_put_base_device(&vdev->vbasedev);
>  }
>  
>  static void vfio_err_notifier_handler(void *opaque)
> @@ -2978,13 +2978,9 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
>  {
>      VFIOPCIDevice *vdev = VFIO_PCI(pdev);
>      VFIODevice *vbasedev = &vdev->vbasedev;
> -    VFIODevice *vbasedev_iter;
> -    VFIOGroup *group;
> -    char *tmp, *subsys, group_path[PATH_MAX], *group_name;
> +    char *tmp, *subsys;
>      Error *err = NULL;
> -    ssize_t len;
>      struct stat st;
> -    int groupid;
>      int i, ret;
>      bool is_mdev;
>      char uuid[UUID_FMT_LEN];
> @@ -3015,38 +3011,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
>      vbasedev->type = VFIO_DEVICE_TYPE_PCI;
>      vbasedev->dev = DEVICE(vdev);
>  
> -    tmp = g_strdup_printf("%s/iommu_group", vbasedev->sysfsdev);
> -    len = readlink(tmp, group_path, sizeof(group_path));
> -    g_free(tmp);
> -
> -    if (len <= 0 || len >= sizeof(group_path)) {
> -        error_setg_errno(errp, len < 0 ? errno : ENAMETOOLONG,
> -                         "no iommu_group found");
> -        goto error;
> -    }
> -
> -    group_path[len] = 0;
> -
> -    group_name = basename(group_path);
> -    if (sscanf(group_name, "%d", &groupid) != 1) {
> -        error_setg_errno(errp, errno, "failed to read %s", group_path);
> -        goto error;
> -    }
> -
> -    trace_vfio_realize(vbasedev->name, groupid);
> -
> -    group = vfio_get_group(groupid, pci_device_iommu_address_space(pdev), errp);
> -    if (!group) {
> -        goto error;
> -    }
> -
> -    QLIST_FOREACH(vbasedev_iter, &group->device_list, next) {
> -        if (strcmp(vbasedev_iter->name, vbasedev->name) == 0) {
> -            error_setg(errp, "device is already attached");
> -            vfio_put_group(group);
> -            goto error;
> -        }
> -    }
> +    trace_vfio_realize(vbasedev->name);
>  
>      /*
>       * Mediated devices *might* operate compatibly with discarding of RAM, but
> @@ -3065,7 +3030,6 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
>      if (vbasedev->ram_block_discard_allowed && !is_mdev) {
>          error_setg(errp, "x-balloon-allowed only potentially compatible "
>                     "with mdev devices");
> -        vfio_put_group(group);
>          goto error;
>      }
>  
> @@ -3076,10 +3040,10 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
>          name = g_strdup(vbasedev->name);
>      }
>  
> -    ret = vfio_get_device(group, name, vbasedev, errp);
> +    ret = vfio_attach_device(name, vbasedev,
> +                             pci_device_iommu_address_space(pdev), errp);
>      g_free(name);
>      if (ret) {
> -        vfio_put_group(group);
>          goto error;
>      }
>  
> @@ -3318,7 +3282,6 @@ error:
>  static void vfio_instance_finalize(Object *obj)
>  {
>      VFIOPCIDevice *vdev = VFIO_PCI(obj);
> -    VFIOGroup *group = vdev->vbasedev.group;
>  
>      vfio_display_finalize(vdev);
>      vfio_bars_finalize(vdev);
> @@ -3332,7 +3295,6 @@ static void vfio_instance_finalize(Object *obj)
>       * g_free(vdev->igd_opregion);
>       */
>      vfio_put_device(vdev);
> -    vfio_put_group(group);
>  }
>  
>  static void vfio_exitfn(PCIDevice *pdev)
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index ee7509e68e..8016d9f0d2 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -37,7 +37,7 @@ vfio_pci_hot_reset_dep_devices(int domain, int bus, int slot, int function, int
>  vfio_pci_hot_reset_result(const char *name, const char *result) "%s hot reset: %s"
>  vfio_populate_device_config(const char *name, unsigned long size, unsigned long offset, unsigned long flags) "Device %s config:\n  size: 0x%lx, offset: 0x%lx, flags: 0x%lx"
>  vfio_populate_device_get_irq_info_failure(const char *errstr) "VFIO_DEVICE_GET_IRQ_INFO failure: %s"
> -vfio_realize(const char *name, int group_id) " (%s) group %d"
> +vfio_realize(const char *name) " (%s)"
I am not sure this trace point is useful anymore, without the id. Some
tracepoints shall be BE specific to keep their usefulness and should be
called from container.c/iommufd.c instead of in the generic function.
>  vfio_mdev(const char *name, bool is_mdev) " (%s) is_mdev %d"
>  vfio_add_ext_cap_dropped(const char *name, uint16_t cap, uint16_t offset) "%s 0x%x@0x%x"
>  vfio_pci_reset(const char *name) " (%s)"
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index bb7f9fe9c4..a29dfe7723 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -253,6 +253,9 @@ void vfio_put_group(VFIOGroup *group);
>  struct vfio_device_info *vfio_get_device_info(int fd);
>  int vfio_get_device(VFIOGroup *group, const char *name,
>                      VFIODevice *vbasedev, Error **errp);
> +int vfio_attach_device(char *name, VFIODevice *vbasedev,
> +                       AddressSpace *as, Error **errp);
> +void vfio_detach_device(VFIODevice *vbasedev);
>  
>  extern int vfio_kvm_device_fd;
>  
Thanks

Eric



^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v1 13/22] vfio: Add base container
  2023-09-19 17:23   ` Cédric Le Goater
  2023-09-20  8:48     ` Duan, Zhenzhong
@ 2023-09-20 13:53     ` Eric Auger
  2023-09-21  3:12       ` Duan, Zhenzhong
  2023-09-20 17:31     ` Eric Auger
  2 siblings, 1 reply; 109+ messages in thread
From: Eric Auger @ 2023-09-20 13:53 UTC (permalink / raw)
  To: Cédric Le Goater, Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, jgg, nicolinc, joao.m.martins, peterx, jasowang,
	kevin.tian, yi.l.liu, yi.y.sun, chao.p.peng, Yi Sun,
	Daniel Henrique Barboza, David Gibson, Greg Kurz,
	Harsh Prateek Bora, open list:sPAPR (pseries)

Hi Cedric,

On 9/19/23 19:23, Cédric Le Goater wrote:
> On 8/30/23 12:37, Zhenzhong Duan wrote:
>> From: Yi Liu <yi.l.liu@intel.com>
>>
>> Abstract the VFIOContainer to be a base object. It is supposed to be
>> embedded by legacy VFIO container and later on, into the new iommufd
>> based container.
>>
>> The base container implements generic code such as code related to
>> memory_listener and address space management. The VFIOContainerOps
>> implements callbacks that depend on the kernel user space being used.
>>
>> 'common.c' and vfio device code only manipulates the base container with
>> wrapper functions that calls the functions defined in
>> VFIOContainerOpsClass.
>> Existing 'container.c' code is converted to implement the legacy
>> container
>> ops functions.
>>
>> Below is the base container. It's named as VFIOContainer, old
>> VFIOContainer
>> is replaced with VFIOLegacyContainer.
>
> Usualy, we introduce the new interface solely, port the current models
> on top of the new interface, wire the new models in the current
> implementation and remove the old implementation. Then, we can start
> adding extensions to support other implementations.
> spapr should be taken care of separatly following the principle above.
> With my PPC hat, I would not even read such a massive change, too risky
> for the subsystem. This path will need (much) further splitting to be
> understandable and acceptable.
>
> Also, please include the .h file first, it helps in reading. Have you
> considered using an InterfaceClass ?
in the transition from v1 -> v2, I removed the QOMification of the
VFIOContainer, following David Gibson's advice. QOM objects are visible
from the user interface and there was no interest in that. Does it
answer your question?

- remove the QOMification of the VFIOContainer and simply use standard ops (David)

Unfortunately the coverletter log history has disappeared in this new version. Zhenzhong, I think it is useful to understand how the series moves on.

Thanks

Eric

>
> Thanks,
>
> C.
>
>>
>> struct VFIOContainer {
>>      VFIOIOMMUBackendOpsClass *ops;
>>      VFIOAddressSpace *space;
>>      MemoryListener listener;
>>      Error *error;
>>      bool initialized;
>>      bool dirty_pages_supported;
>>      uint64_t dirty_pgsizes;
>>      uint64_t max_dirty_bitmap_size;
>>      unsigned long pgsizes;
>>      unsigned int dma_max_mappings;
>>      QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
>>      QLIST_HEAD(, VFIOHostDMAWindow) hostwin_list;
>>      QLIST_HEAD(, VFIORamDiscardListener) vrdl_list;
>>      QLIST_ENTRY(VFIOContainer) next;
>> };
>>
>> struct VFIOLegacyContainer {
>>      VFIOContainer bcontainer;
>>      int fd; /* /dev/vfio/vfio, empowered by the attached groups */
>>      MemoryListener prereg_listener;
>>      unsigned iommu_type;
>>      QLIST_HEAD(, VFIOGroup) group_list;
>> };
>>
>> Co-authored-by: Eric Auger <eric.auger@redhat.com>
>> Signed-off-by: Eric Auger <eric.auger@redhat.com>
>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> ---
>>   hw/vfio/common.c                      |  72 +++++---
>>   hw/vfio/container-base.c              | 160 +++++++++++++++++
>>   hw/vfio/container.c                   | 247 ++++++++++++++++----------
>>   hw/vfio/meson.build                   |   1 +
>>   hw/vfio/spapr.c                       |  22 +--
>>   hw/vfio/trace-events                  |   4 +-
>>   include/hw/vfio/vfio-common.h         |  85 ++-------
>>   include/hw/vfio/vfio-container-base.h | 155 ++++++++++++++++
>>   8 files changed, 540 insertions(+), 206 deletions(-)
>>   create mode 100644 hw/vfio/container-base.c
>>   create mode 100644 include/hw/vfio/vfio-container-base.h
>>
>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>> index 044710fc1f..86b6af5740 100644
>> --- a/hw/vfio/common.c
>> +++ b/hw/vfio/common.c
>> @@ -379,19 +379,20 @@ static void vfio_iommu_map_notify(IOMMUNotifier
>> *n, IOMMUTLBEntry *iotlb)
>>            * of vaddr will always be there, even if the memory object is
>>            * destroyed and its backing memory munmap-ed.
>>            */
>> -        ret = vfio_dma_map(container, iova,
>> -                           iotlb->addr_mask + 1, vaddr,
>> -                           read_only);
>> +        ret = vfio_container_dma_map(container, iova,
>> +                                     iotlb->addr_mask + 1, vaddr,
>> +                                     read_only);
>>           if (ret) {
>> -            error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
>> +            error_report("vfio_container_dma_map(%p,
>> 0x%"HWADDR_PRIx", "
>>                            "0x%"HWADDR_PRIx", %p) = %d (%s)",
>>                            container, iova,
>>                            iotlb->addr_mask + 1, vaddr, ret,
>> strerror(-ret));
>>           }
>>       } else {
>> -        ret = vfio_dma_unmap(container, iova, iotlb->addr_mask + 1,
>> iotlb);
>> +        ret = vfio_container_dma_unmap(container, iova,
>> +                                       iotlb->addr_mask + 1, iotlb);
>>           if (ret) {
>> -            error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
>> +            error_report("vfio_container_dma_unmap(%p,
>> 0x%"HWADDR_PRIx", "
>>                            "0x%"HWADDR_PRIx") = %d (%s)",
>>                            container, iova,
>>                            iotlb->addr_mask + 1, ret, strerror(-ret));
>> @@ -407,14 +408,15 @@ static void
>> vfio_ram_discard_notify_discard(RamDiscardListener *rdl,
>>   {
>>       VFIORamDiscardListener *vrdl = container_of(rdl,
>> VFIORamDiscardListener,
>>                                                   listener);
>> +    VFIOContainer *container = vrdl->container;
>>       const hwaddr size = int128_get64(section->size);
>>       const hwaddr iova = section->offset_within_address_space;
>>       int ret;
>>         /* Unmap with a single call. */
>> -    ret = vfio_dma_unmap(vrdl->container, iova, size , NULL);
>> +    ret = vfio_container_dma_unmap(container, iova, size , NULL);
>>       if (ret) {
>> -        error_report("%s: vfio_dma_unmap() failed: %s", __func__,
>> +        error_report("%s: vfio_container_dma_unmap() failed: %s",
>> __func__,
>>                        strerror(-ret));
>>       }
>>   }
>> @@ -424,6 +426,7 @@ static int
>> vfio_ram_discard_notify_populate(RamDiscardListener *rdl,
>>   {
>>       VFIORamDiscardListener *vrdl = container_of(rdl,
>> VFIORamDiscardListener,
>>                                                   listener);
>> +    VFIOContainer *container = vrdl->container;
>>       const hwaddr end = section->offset_within_region +
>>                          int128_get64(section->size);
>>       hwaddr start, next, iova;
>> @@ -442,8 +445,8 @@ static int
>> vfio_ram_discard_notify_populate(RamDiscardListener *rdl,
>>                  section->offset_within_address_space;
>>           vaddr = memory_region_get_ram_ptr(section->mr) + start;
>>   -        ret = vfio_dma_map(vrdl->container, iova, next - start,
>> -                           vaddr, section->readonly);
>> +        ret = vfio_container_dma_map(container, iova, next - start,
>> +                                     vaddr, section->readonly);
>>           if (ret) {
>>               /* Rollback */
>>               vfio_ram_discard_notify_discard(rdl, section);
>> @@ -756,10 +759,10 @@ static void
>> vfio_listener_region_add(MemoryListener *listener,
>>           }
>>       }
>>   -    ret = vfio_dma_map(container, iova, int128_get64(llsize),
>> -                       vaddr, section->readonly);
>> +    ret = vfio_container_dma_map(container, iova, int128_get64(llsize),
>> +                                 vaddr, section->readonly);
>>       if (ret) {
>> -        error_setg(&err, "vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
>> +        error_setg(&err, "vfio_container_dma_map(%p,
>> 0x%"HWADDR_PRIx", "
>>                      "0x%"HWADDR_PRIx", %p) = %d (%s)",
>>                      container, iova, int128_get64(llsize), vaddr, ret,
>>                      strerror(-ret));
>> @@ -775,7 +778,7 @@ static void
>> vfio_listener_region_add(MemoryListener *listener,
>>     fail:
>>       if (memory_region_is_ram_device(section->mr)) {
>> -        error_report("failed to vfio_dma_map. pci p2p may not work");
>> +        error_report("failed to vfio_container_dma_map. pci p2p may
>> not work");
>>           return;
>>       }
>>       /*
>> @@ -860,18 +863,20 @@ static void
>> vfio_listener_region_del(MemoryListener *listener,
>>           if (int128_eq(llsize, int128_2_64())) {
>>               /* The unmap ioctl doesn't accept a full 64-bit span. */
>>               llsize = int128_rshift(llsize, 1);
>> -            ret = vfio_dma_unmap(container, iova,
>> int128_get64(llsize), NULL);
>> +            ret = vfio_container_dma_unmap(container, iova,
>> +                                           int128_get64(llsize), NULL);
>>               if (ret) {
>> -                error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
>> +                error_report("vfio_container_dma_unmap(%p,
>> 0x%"HWADDR_PRIx", "
>>                                "0x%"HWADDR_PRIx") = %d (%s)",
>>                                container, iova, int128_get64(llsize),
>> ret,
>>                                strerror(-ret));
>>               }
>>               iova += int128_get64(llsize);
>>           }
>> -        ret = vfio_dma_unmap(container, iova, int128_get64(llsize),
>> NULL);
>> +        ret = vfio_container_dma_unmap(container, iova,
>> +                                       int128_get64(llsize), NULL);
>>           if (ret) {
>> -            error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
>> +            error_report("vfio_container_dma_unmap(%p,
>> 0x%"HWADDR_PRIx", "
>>                            "0x%"HWADDR_PRIx") = %d (%s)",
>>                            container, iova, int128_get64(llsize), ret,
>>                            strerror(-ret));
>> @@ -1103,7 +1108,7 @@ static void
>> vfio_listener_log_global_start(MemoryListener *listener)
>>       if (vfio_devices_all_device_dirty_tracking(container)) {
>>           ret = vfio_devices_dma_logging_start(container);
>>       } else {
>> -        ret = vfio_set_dirty_page_tracking(container, true);
>> +        ret = vfio_container_set_dirty_page_tracking(container, true);
>>       }
>>         if (ret) {
>> @@ -1121,7 +1126,7 @@ static void
>> vfio_listener_log_global_stop(MemoryListener *listener)
>>       if (vfio_devices_all_device_dirty_tracking(container)) {
>>           vfio_devices_dma_logging_stop(container);
>>       } else {
>> -        ret = vfio_set_dirty_page_tracking(container, false);
>> +        ret = vfio_container_set_dirty_page_tracking(container, false);
>>       }
>>         if (ret) {
>> @@ -1204,7 +1209,7 @@ int vfio_get_dirty_bitmap(VFIOContainer
>> *container, uint64_t iova,
>>       if (all_device_dirty_tracking) {
>>           ret = vfio_devices_query_dirty_bitmap(container, &vbmap,
>> iova, size);
>>       } else {
>> -        ret = vfio_query_dirty_bitmap(container, &vbmap, iova, size);
>> +        ret = vfio_container_query_dirty_bitmap(container, &vbmap,
>> iova, size);
>>       }
>>         if (ret) {
>> @@ -1214,8 +1219,7 @@ int vfio_get_dirty_bitmap(VFIOContainer
>> *container, uint64_t iova,
>>       dirty_pages =
>> cpu_physical_memory_set_dirty_lebitmap(vbmap.bitmap, ram_addr,
>>                                                            vbmap.pages);
>>   -    trace_vfio_get_dirty_bitmap(container->fd, iova, size,
>> vbmap.size,
>> -                                ram_addr, dirty_pages);
>> +    trace_vfio_get_dirty_bitmap(iova, size, vbmap.size, ram_addr,
>> dirty_pages);
>>   out:
>>       g_free(vbmap.bitmap);
>>   @@ -1525,3 +1529,25 @@ retry:
>>         return info;
>>   }
>> +
>> +int vfio_attach_device(char *name, VFIODevice *vbasedev,
>> +                       AddressSpace *as, Error **errp)
>> +{
>> +    const VFIOIOMMUBackendOpsClass *ops;
>> +
>> +    ops = VFIO_IOMMU_BACKEND_OPS_CLASS(
>> +                 
>> object_class_by_name(TYPE_VFIO_IOMMU_BACKEND_LEGACY_OPS));
>> +    if (!ops) {
>> +        error_setg(errp, "VFIO IOMMU Backend not found!");
>> +        return -ENODEV;
>> +    }
>> +    return ops->attach_device(name, vbasedev, as, errp);
>> +}
>> +
>> +void vfio_detach_device(VFIODevice *vbasedev)
>> +{
>> +    if (!vbasedev->container) {
>> +        return;
>> +    }
>> +    vbasedev->container->ops->detach_device(vbasedev);
>> +}
>> diff --git a/hw/vfio/container-base.c b/hw/vfio/container-base.c
>> new file mode 100644
>> index 0000000000..876e95c6dd
>> --- /dev/null
>> +++ b/hw/vfio/container-base.c
>> @@ -0,0 +1,160 @@
>> +/*
>> + * VFIO BASE CONTAINER
>> + *
>> + * Copyright (C) 2023 Intel Corporation.
>> + * Copyright Red Hat, Inc. 2023
>> + *
>> + * Authors: Yi Liu <yi.l.liu@intel.com>
>> + *          Eric Auger <eric.auger@redhat.com>
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License as published by
>> + * the Free Software Foundation; either version 2 of the License, or
>> + * (at your option) any later version.
>> +
>> + * This program is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> + * GNU General Public License for more details.
>> +
>> + * You should have received a copy of the GNU General Public License
>> along
>> + * with this program; if not, see <http://www.gnu.org/licenses/>.
>> + */
>> +
>> +#include "qemu/osdep.h"
>> +#include "qapi/error.h"
>> +#include "qemu/error-report.h"
>> +#include "hw/vfio/vfio-container-base.h"
>> +
>> +VFIODevice *vfio_container_dev_iter_next(VFIOContainer *container,
>> +                                 VFIODevice *curr)
>> +{
>> +    if (!container->ops->dev_iter_next) {
>> +        return NULL;
>> +    }
>> +
>> +    return container->ops->dev_iter_next(container, curr);
>> +}
>> +
>> +int vfio_container_dma_map(VFIOContainer *container,
>> +                           hwaddr iova, ram_addr_t size,
>> +                           void *vaddr, bool readonly)
>> +{
>> +    if (!container->ops->dma_map) {
>> +        return -EINVAL;
>> +    }
>> +
>> +    return container->ops->dma_map(container, iova, size, vaddr,
>> readonly);
>> +}
>> +
>> +int vfio_container_dma_unmap(VFIOContainer *container,
>> +                             hwaddr iova, ram_addr_t size,
>> +                             IOMMUTLBEntry *iotlb)
>> +{
>> +    if (!container->ops->dma_unmap) {
>> +        return -EINVAL;
>> +    }
>> +
>> +    return container->ops->dma_unmap(container, iova, size, iotlb);
>> +}
>> +
>> +int vfio_container_set_dirty_page_tracking(VFIOContainer *container,
>> +                                            bool start)
>> +{
>> +    /* Fallback to all pages dirty if dirty page sync isn't
>> supported */
>> +    if (!container->ops->set_dirty_page_tracking) {
>> +        return 0;
>> +    }
>> +
>> +    return container->ops->set_dirty_page_tracking(container, start);
>> +}
>> +
>> +int vfio_container_query_dirty_bitmap(VFIOContainer *container,
>> +                                      VFIOBitmap *vbmap,
>> +                                      hwaddr iova, hwaddr size)
>> +{
>> +    if (!container->ops->query_dirty_bitmap) {
>> +        return -EINVAL;
>> +    }
>> +
>> +    return container->ops->query_dirty_bitmap(container, vbmap,
>> iova, size);
>> +}
>> +
>> +int vfio_container_add_section_window(VFIOContainer *container,
>> +                                      MemoryRegionSection *section,
>> +                                      Error **errp)
>> +{
>> +    if (!container->ops->add_window) {
>> +        return 0;
>> +    }
>> +
>> +    return container->ops->add_window(container, section, errp);
>> +}
>> +
>> +void vfio_container_del_section_window(VFIOContainer *container,
>> +                                       MemoryRegionSection *section)
>> +{
>> +    if (!container->ops->del_window) {
>> +        return;
>> +    }
>> +
>> +    return container->ops->del_window(container, section);
>> +}
>> +
>> +void vfio_container_init(VFIOContainer *container,
>> +                         VFIOAddressSpace *space,
>> +                         struct VFIOIOMMUBackendOpsClass *ops)
>> +{
>> +    container->ops = ops;
>> +    container->space = space;
>> +    container->error = NULL;
>> +    container->dirty_pages_supported = false;
>> +    container->dma_max_mappings = 0;
>> +    QLIST_INIT(&container->giommu_list);
>> +    QLIST_INIT(&container->hostwin_list);
>> +    QLIST_INIT(&container->vrdl_list);
>> +}
>> +
>> +void vfio_container_destroy(VFIOContainer *container)
>> +{
>> +    VFIORamDiscardListener *vrdl, *vrdl_tmp;
>> +    VFIOGuestIOMMU *giommu, *tmp;
>> +    VFIOHostDMAWindow *hostwin, *next;
>> +
>> +    QLIST_SAFE_REMOVE(container, next);
>> +
>> +    QLIST_FOREACH_SAFE(vrdl, &container->vrdl_list, next, vrdl_tmp) {
>> +        RamDiscardManager *rdm;
>> +
>> +        rdm = memory_region_get_ram_discard_manager(vrdl->mr);
>> +        ram_discard_manager_unregister_listener(rdm, &vrdl->listener);
>> +        QLIST_REMOVE(vrdl, next);
>> +        g_free(vrdl);
>> +    }
>> +
>> +    QLIST_FOREACH_SAFE(giommu, &container->giommu_list, giommu_next,
>> tmp) {
>> +        memory_region_unregister_iommu_notifier(
>> +                MEMORY_REGION(giommu->iommu_mr), &giommu->n);
>> +        QLIST_REMOVE(giommu, giommu_next);
>> +        g_free(giommu);
>> +    }
>> +
>> +    QLIST_FOREACH_SAFE(hostwin, &container->hostwin_list, hostwin_next,
>> +                       next) {
>> +        QLIST_REMOVE(hostwin, hostwin_next);
>> +        g_free(hostwin);
>> +    }
>> +}
>> +
>> +static const TypeInfo vfio_iommu_backend_ops_type_info = {
>> +    .name = TYPE_VFIO_IOMMU_BACKEND_OPS,
>> +    .parent = TYPE_OBJECT,
>> +    .abstract = true,
>> +    .class_size = sizeof(VFIOIOMMUBackendOpsClass),
>> +};
>> +
>> +static void vfio_iommu_backend_ops_register_types(void)
>> +{
>> +    type_register_static(&vfio_iommu_backend_ops_type_info);
>> +}
>> +type_init(vfio_iommu_backend_ops_register_types);
>> diff --git a/hw/vfio/container.c b/hw/vfio/container.c
>> index c71fddc09a..bb29b3612d 100644
>> --- a/hw/vfio/container.c
>> +++ b/hw/vfio/container.c
>> @@ -42,7 +42,8 @@
>>   VFIOGroupList vfio_group_list =
>>       QLIST_HEAD_INITIALIZER(vfio_group_list);
>>   -static int vfio_ram_block_discard_disable(VFIOContainer
>> *container, bool state)
>> +static int vfio_ram_block_discard_disable(VFIOLegacyContainer
>> *container,
>> +                                          bool state)
>>   {
>>       switch (container->iommu_type) {
>>       case VFIO_TYPE1v2_IOMMU:
>> @@ -65,11 +66,18 @@ static int
>> vfio_ram_block_discard_disable(VFIOContainer *container, bool state)
>>       }
>>   }
>>   -VFIODevice *vfio_container_dev_iter_next(VFIOContainer *container,
>> -                                         VFIODevice *curr)
>> +static VFIODevice *vfio_legacy_dev_iter_next(VFIOContainer *bcontainer,
>> +                                             VFIODevice *curr)
>>   {
>>       VFIOGroup *group;
>>   +    assert(object_class_dynamic_cast(OBJECT_CLASS(bcontainer->ops),
>> +                                    
>> TYPE_VFIO_IOMMU_BACKEND_LEGACY_OPS));
>> +
>> +    VFIOLegacyContainer *container = container_of(bcontainer,
>> +                                                  VFIOLegacyContainer,
>> +                                                  bcontainer);
>> +
>>       if (!curr) {
>>           group = QLIST_FIRST(&container->group_list);
>>       } else {
>> @@ -85,10 +93,11 @@ VFIODevice
>> *vfio_container_dev_iter_next(VFIOContainer *container,
>>       return QLIST_FIRST(&group->device_list);
>>   }
>>   -static int vfio_dma_unmap_bitmap(VFIOContainer *container,
>> +static int vfio_dma_unmap_bitmap(VFIOLegacyContainer *container,
>>                                    hwaddr iova, ram_addr_t size,
>>                                    IOMMUTLBEntry *iotlb)
>>   {
>> +    VFIOContainer *bcontainer = &container->bcontainer;
>>       struct vfio_iommu_type1_dma_unmap *unmap;
>>       struct vfio_bitmap *bitmap;
>>       VFIOBitmap vbmap;
>> @@ -116,7 +125,7 @@ static int vfio_dma_unmap_bitmap(VFIOContainer
>> *container,
>>       bitmap->size = vbmap.size;
>>       bitmap->data = (__u64 *)vbmap.bitmap;
>>   -    if (vbmap.size > container->max_dirty_bitmap_size) {
>> +    if (vbmap.size > bcontainer->max_dirty_bitmap_size) {
>>           error_report("UNMAP: Size of bitmap too big 0x%"PRIx64,
>> vbmap.size);
>>           ret = -E2BIG;
>>           goto unmap_exit;
>> @@ -140,9 +149,13 @@ unmap_exit:
>>   /*
>>    * DMA - Mapping and unmapping for the "type1" IOMMU interface used
>> on x86
>>    */
>> -int vfio_dma_unmap(VFIOContainer *container, hwaddr iova,
>> -                   ram_addr_t size, IOMMUTLBEntry *iotlb)
>> +static int vfio_legacy_dma_unmap(VFIOContainer *bcontainer, hwaddr
>> iova,
>> +                          ram_addr_t size, IOMMUTLBEntry *iotlb)
>>   {
>> +    VFIOLegacyContainer *container = container_of(bcontainer,
>> +                                                  VFIOLegacyContainer,
>> +                                                  bcontainer);
>> +
>>       struct vfio_iommu_type1_dma_unmap unmap = {
>>           .argsz = sizeof(unmap),
>>           .flags = 0,
>> @@ -152,9 +165,9 @@ int vfio_dma_unmap(VFIOContainer *container,
>> hwaddr iova,
>>       bool need_dirty_sync = false;
>>       int ret;
>>   -    if (iotlb &&
>> vfio_devices_all_running_and_mig_active(container)) {
>> -        if (!vfio_devices_all_device_dirty_tracking(container) &&
>> -            container->dirty_pages_supported) {
>> +    if (iotlb && vfio_devices_all_running_and_mig_active(bcontainer)) {
>> +        if (!vfio_devices_all_device_dirty_tracking(bcontainer) &&
>> +            bcontainer->dirty_pages_supported) {
>>               return vfio_dma_unmap_bitmap(container, iova, size,
>> iotlb);
>>           }
>>   @@ -176,8 +189,8 @@ int vfio_dma_unmap(VFIOContainer *container,
>> hwaddr iova,
>>            */
>>           if (errno == EINVAL && unmap.size && !(unmap.iova +
>> unmap.size) &&
>>               container->iommu_type == VFIO_TYPE1v2_IOMMU) {
>> -            trace_vfio_dma_unmap_overflow_workaround();
>> -            unmap.size -= 1ULL << ctz64(container->pgsizes);
>> +            trace_vfio_legacy_dma_unmap_overflow_workaround();
>> +            unmap.size -= 1ULL << ctz64(bcontainer->pgsizes);
>>               continue;
>>           }
>>           error_report("VFIO_UNMAP_DMA failed: %s", strerror(errno));
>> @@ -185,7 +198,7 @@ int vfio_dma_unmap(VFIOContainer *container,
>> hwaddr iova,
>>       }
>>         if (need_dirty_sync) {
>> -        ret = vfio_get_dirty_bitmap(container, iova, size,
>> +        ret = vfio_get_dirty_bitmap(bcontainer, iova, size,
>>                                       iotlb->translated_addr);
>>           if (ret) {
>>               return ret;
>> @@ -195,9 +208,13 @@ int vfio_dma_unmap(VFIOContainer *container,
>> hwaddr iova,
>>       return 0;
>>   }
>>   -int vfio_dma_map(VFIOContainer *container, hwaddr iova,
>> -                 ram_addr_t size, void *vaddr, bool readonly)
>> +static int vfio_legacy_dma_map(VFIOContainer *bcontainer, hwaddr iova,
>> +                               ram_addr_t size, void *vaddr, bool
>> readonly)
>>   {
>> +    VFIOLegacyContainer *container = container_of(bcontainer,
>> +                                                  VFIOLegacyContainer,
>> +                                                  bcontainer);
>> +
>>       struct vfio_iommu_type1_dma_map map = {
>>           .argsz = sizeof(map),
>>           .flags = VFIO_DMA_MAP_FLAG_READ,
>> @@ -216,7 +233,8 @@ int vfio_dma_map(VFIOContainer *container, hwaddr
>> iova,
>>        * the VGA ROM space.
>>        */
>>       if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0 ||
>> -        (errno == EBUSY && vfio_dma_unmap(container, iova, size,
>> NULL) == 0 &&
>> +        (errno == EBUSY &&
>> +         vfio_legacy_dma_unmap(bcontainer, iova, size, NULL) == 0 &&
>>            ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0)) {
>>           return 0;
>>       }
>> @@ -225,14 +243,18 @@ int vfio_dma_map(VFIOContainer *container,
>> hwaddr iova,
>>       return -errno;
>>   }
>>   -int vfio_set_dirty_page_tracking(VFIOContainer *container, bool
>> start)
>> +static int vfio_legacy_set_dirty_page_tracking(VFIOContainer
>> *bcontainer,
>> +                                               bool start)
>>   {
>> +    VFIOLegacyContainer *container = container_of(bcontainer,
>> +                                                  VFIOLegacyContainer,
>> +                                                  bcontainer);
>>       int ret;
>>       struct vfio_iommu_type1_dirty_bitmap dirty = {
>>           .argsz = sizeof(dirty),
>>       };
>>   -    if (!container->dirty_pages_supported) {
>> +    if (!bcontainer->dirty_pages_supported) {
>>           return 0;
>>       }
>>   @@ -252,9 +274,13 @@ int vfio_set_dirty_page_tracking(VFIOContainer
>> *container, bool start)
>>       return ret;
>>   }
>>   -int vfio_query_dirty_bitmap(VFIOContainer *container, VFIOBitmap
>> *vbmap,
>> -                            hwaddr iova, hwaddr size)
>> +static int vfio_legacy_query_dirty_bitmap(VFIOContainer *bcontainer,
>> +                                          VFIOBitmap *vbmap,
>> +                                          hwaddr iova, hwaddr size)
>>   {
>> +    VFIOLegacyContainer *container = container_of(bcontainer,
>> +                                                  VFIOLegacyContainer,
>> +                                                  bcontainer);
>>       struct vfio_iommu_type1_dirty_bitmap *dbitmap;
>>       struct vfio_iommu_type1_dirty_bitmap_get *range;
>>       int ret;
>> @@ -289,18 +315,24 @@ int vfio_query_dirty_bitmap(VFIOContainer
>> *container, VFIOBitmap *vbmap,
>>       return ret;
>>   }
>>   -static void vfio_listener_release(VFIOContainer *container)
>> +static void vfio_listener_release(VFIOLegacyContainer *container)
>>   {
>> -    memory_listener_unregister(&container->listener);
>> +    VFIOContainer *bcontainer = &container->bcontainer;
>> +
>> +    memory_listener_unregister(&bcontainer->listener);
>>       if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
>>           memory_listener_unregister(&container->prereg_listener);
>>       }
>>   }
>>   -int vfio_container_add_section_window(VFIOContainer *container,
>> -                                      MemoryRegionSection *section,
>> -                                      Error **errp)
>> +static int
>> +vfio_legacy_add_section_window(VFIOContainer *bcontainer,
>> +                               MemoryRegionSection *section,
>> +                               Error **errp)
>>   {
>> +    VFIOLegacyContainer *container = container_of(bcontainer,
>> +                                                  VFIOLegacyContainer,
>> +                                                  bcontainer);
>>       VFIOHostDMAWindow *hostwin;
>>       hwaddr pgsize = 0;
>>       int ret;
>> @@ -310,7 +342,7 @@ int
>> vfio_container_add_section_window(VFIOContainer *container,
>>       }
>>         /* For now intersections are not allowed, we may relax this
>> later */
>> -    QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) {
>> +    QLIST_FOREACH(hostwin, &bcontainer->hostwin_list, hostwin_next) {
>>           if (ranges_overlap(hostwin->min_iova,
>>                              hostwin->max_iova - hostwin->min_iova + 1,
>>                              section->offset_within_address_space,
>> @@ -332,7 +364,7 @@ int
>> vfio_container_add_section_window(VFIOContainer *container,
>>           return ret;
>>       }
>>   -    vfio_host_win_add(container,
>> section->offset_within_address_space,
>> +    vfio_host_win_add(bcontainer, section->offset_within_address_space,
>>                         section->offset_within_address_space +
>>                         int128_get64(section->size) - 1, pgsize);
>>   #ifdef CONFIG_KVM
>> @@ -365,16 +397,21 @@ int
>> vfio_container_add_section_window(VFIOContainer *container,
>>       return 0;
>>   }
>>   -void vfio_container_del_section_window(VFIOContainer *container,
>> -                                       MemoryRegionSection *section)
>> +static void
>> +vfio_legacy_del_section_window(VFIOContainer *bcontainer,
>> +                               MemoryRegionSection *section)
>>   {
>> +    VFIOLegacyContainer *container = container_of(bcontainer,
>> +                                                  VFIOLegacyContainer,
>> +                                                  bcontainer);
>> +
>>       if (container->iommu_type != VFIO_SPAPR_TCE_v2_IOMMU) {
>>           return;
>>       }
>>         vfio_spapr_remove_window(container,
>>                                section->offset_within_address_space);
>> -    if (vfio_host_win_del(container,
>> +    if (vfio_host_win_del(bcontainer,
>>                             section->offset_within_address_space,
>>                             section->offset_within_address_space +
>>                             int128_get64(section->size) - 1) < 0) {
>> @@ -427,7 +464,7 @@ static void vfio_kvm_device_del_group(VFIOGroup
>> *group)
>>   /*
>>    * vfio_get_iommu_type - selects the richest iommu_type (v2 first)
>>    */
>> -static int vfio_get_iommu_type(VFIOContainer *container,
>> +static int vfio_get_iommu_type(VFIOLegacyContainer *container,
>>                                  Error **errp)
>>   {
>>       int iommu_types[] = { VFIO_TYPE1v2_IOMMU, VFIO_TYPE1_IOMMU,
>> @@ -443,7 +480,7 @@ static int vfio_get_iommu_type(VFIOContainer
>> *container,
>>       return -EINVAL;
>>   }
>>   -static int vfio_init_container(VFIOContainer *container, int
>> group_fd,
>> +static int vfio_init_container(VFIOLegacyContainer *container, int
>> group_fd,
>>                                  Error **errp)
>>   {
>>       int iommu_type, ret;
>> @@ -478,7 +515,7 @@ static int vfio_init_container(VFIOContainer
>> *container, int group_fd,
>>       return 0;
>>   }
>>   -static int vfio_get_iommu_info(VFIOContainer *container,
>> +static int vfio_get_iommu_info(VFIOLegacyContainer *container,
>>                                  struct vfio_iommu_type1_info **info)
>>   {
>>   @@ -522,11 +559,12 @@ vfio_get_iommu_info_cap(struct
>> vfio_iommu_type1_info *info, uint16_t id)
>>       return NULL;
>>   }
>>   -static void vfio_get_iommu_info_migration(VFIOContainer *container,
>> -                                         struct
>> vfio_iommu_type1_info *info)
>> +static void vfio_get_iommu_info_migration(VFIOLegacyContainer
>> *container,
>> +                                          struct
>> vfio_iommu_type1_info *info)
>>   {
>>       struct vfio_info_cap_header *hdr;
>>       struct vfio_iommu_type1_info_cap_migration *cap_mig;
>> +    VFIOContainer *bcontainer = &container->bcontainer;
>>         hdr = vfio_get_iommu_info_cap(info,
>> VFIO_IOMMU_TYPE1_INFO_CAP_MIGRATION);
>>       if (!hdr) {
>> @@ -541,16 +579,19 @@ static void
>> vfio_get_iommu_info_migration(VFIOContainer *container,
>>        * qemu_real_host_page_size to mark those dirty.
>>        */
>>       if (cap_mig->pgsize_bitmap & qemu_real_host_page_size()) {
>> -        container->dirty_pages_supported = true;
>> -        container->max_dirty_bitmap_size =
>> cap_mig->max_dirty_bitmap_size;
>> -        container->dirty_pgsizes = cap_mig->pgsize_bitmap;
>> +        bcontainer->dirty_pages_supported = true;
>> +        bcontainer->max_dirty_bitmap_size =
>> cap_mig->max_dirty_bitmap_size;
>> +        bcontainer->dirty_pgsizes = cap_mig->pgsize_bitmap;
>>       }
>>   }
>>     static int vfio_connect_container(VFIOGroup *group, AddressSpace
>> *as,
>>                                     Error **errp)
>>   {
>> -    VFIOContainer *container;
>> +    VFIOIOMMUBackendOpsClass *ops = VFIO_IOMMU_BACKEND_OPS_CLASS(
>> +        object_class_by_name(TYPE_VFIO_IOMMU_BACKEND_LEGACY_OPS));
>> +    VFIOContainer *bcontainer;
>> +    VFIOLegacyContainer *container;
>>       int ret, fd;
>>       VFIOAddressSpace *space;
>>   @@ -587,7 +628,8 @@ static int vfio_connect_container(VFIOGroup
>> *group, AddressSpace *as,
>>        * details once we know which type of IOMMU we are using.
>>        */
>>   -    QLIST_FOREACH(container, &space->containers, next) {
>> +    QLIST_FOREACH(bcontainer, &space->containers, next) {
>> +        container = container_of(bcontainer, VFIOLegacyContainer,
>> bcontainer);
>>           if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER,
>> &container->fd)) {
>>               ret = vfio_ram_block_discard_disable(container, true);
>>               if (ret) {
>> @@ -623,14 +665,9 @@ static int vfio_connect_container(VFIOGroup
>> *group, AddressSpace *as,
>>       }
>>         container = g_malloc0(sizeof(*container));
>> -    container->space = space;
>>       container->fd = fd;
>> -    container->error = NULL;
>> -    container->dirty_pages_supported = false;
>> -    container->dma_max_mappings = 0;
>> -    QLIST_INIT(&container->giommu_list);
>> -    QLIST_INIT(&container->hostwin_list);
>> -    QLIST_INIT(&container->vrdl_list);
>> +    bcontainer = &container->bcontainer;
>> +    vfio_container_init(bcontainer, space, ops);
>>         ret = vfio_init_container(container, group->fd, errp);
>>       if (ret) {
>> @@ -656,13 +693,13 @@ static int vfio_connect_container(VFIOGroup
>> *group, AddressSpace *as,
>>           }
>>             if (info->flags & VFIO_IOMMU_INFO_PGSIZES) {
>> -            container->pgsizes = info->iova_pgsizes;
>> +            bcontainer->pgsizes = info->iova_pgsizes;
>>           } else {
>> -            container->pgsizes = qemu_real_host_page_size();
>> +            bcontainer->pgsizes = qemu_real_host_page_size();
>>           }
>>   -        if (!vfio_get_info_dma_avail(info,
>> &container->dma_max_mappings)) {
>> -            container->dma_max_mappings = 65535;
>> +        if (!vfio_get_info_dma_avail(info,
>> &bcontainer->dma_max_mappings)) {
>> +            bcontainer->dma_max_mappings = 65535;
>>           }
>>           vfio_get_iommu_info_migration(container, info);
>>           g_free(info);
>> @@ -672,7 +709,7 @@ static int vfio_connect_container(VFIOGroup
>> *group, AddressSpace *as,
>>            * information to get the actual window extent rather than
>> assume
>>            * a 64-bit IOVA address space.
>>            */
>> -        vfio_host_win_add(container, 0, (hwaddr)-1,
>> container->pgsizes);
>> +        vfio_host_win_add(bcontainer, 0, (hwaddr)-1,
>> bcontainer->pgsizes);
>>             break;
>>       }
>> @@ -699,10 +736,10 @@ static int vfio_connect_container(VFIOGroup
>> *group, AddressSpace *as,
>>                 memory_listener_register(&container->prereg_listener,
>>                                        &address_space_memory);
>> -            if (container->error) {
>> +            if (bcontainer->error) {
>>                  
>> memory_listener_unregister(&container->prereg_listener);
>>                   ret = -1;
>> -                error_propagate_prepend(errp, container->error,
>> +                error_propagate_prepend(errp, bcontainer->error,
>>                       "RAM memory listener initialization failed: ");
>>                   goto enable_discards_exit;
>>               }
>> @@ -721,7 +758,7 @@ static int vfio_connect_container(VFIOGroup
>> *group, AddressSpace *as,
>>           }
>>             if (v2) {
>> -            container->pgsizes = info.ddw.pgsizes;
>> +            bcontainer->pgsizes = info.ddw.pgsizes;
>>               /*
>>                * There is a default window in just created container.
>>                * To make region_add/del simpler, we better remove this
>> @@ -736,8 +773,8 @@ static int vfio_connect_container(VFIOGroup
>> *group, AddressSpace *as,
>>               }
>>           } else {
>>               /* The default table uses 4K pages */
>> -            container->pgsizes = 0x1000;
>> -            vfio_host_win_add(container, info.dma32_window_start,
>> +            bcontainer->pgsizes = 0x1000;
>> +            vfio_host_win_add(bcontainer, info.dma32_window_start,
>>                                 info.dma32_window_start +
>>                                 info.dma32_window_size - 1,
>>                                 0x1000);
>> @@ -748,28 +785,28 @@ static int vfio_connect_container(VFIOGroup
>> *group, AddressSpace *as,
>>       vfio_kvm_device_add_group(group);
>>         QLIST_INIT(&container->group_list);
>> -    QLIST_INSERT_HEAD(&space->containers, container, next);
>> +    QLIST_INSERT_HEAD(&space->containers, bcontainer, next);
>>         group->container = container;
>>       QLIST_INSERT_HEAD(&container->group_list, group, container_next);
>>   -    container->listener = vfio_memory_listener;
>> +    bcontainer->listener = vfio_memory_listener;
>>   -    memory_listener_register(&container->listener,
>> container->space->as);
>> +    memory_listener_register(&bcontainer->listener,
>> bcontainer->space->as);
>>   -    if (container->error) {
>> +    if (bcontainer->error) {
>>           ret = -1;
>> -        error_propagate_prepend(errp, container->error,
>> +        error_propagate_prepend(errp, bcontainer->error,
>>               "memory listener initialization failed: ");
>>           goto listener_release_exit;
>>       }
>>   -    container->initialized = true;
>> +    bcontainer->initialized = true;
>>         return 0;
>>   listener_release_exit:
>>       QLIST_REMOVE(group, container_next);
>> -    QLIST_REMOVE(container, next);
>> +    QLIST_REMOVE(bcontainer, next);
>>       vfio_kvm_device_del_group(group);
>>       vfio_listener_release(container);
>>   @@ -790,7 +827,8 @@ put_space_exit:
>>     static void vfio_disconnect_container(VFIOGroup *group)
>>   {
>> -    VFIOContainer *container = group->container;
>> +    VFIOLegacyContainer *container = group->container;
>> +    VFIOContainer *bcontainer = &container->bcontainer;
>>         QLIST_REMOVE(group, container_next);
>>       group->container = NULL;
>> @@ -810,25 +848,9 @@ static void vfio_disconnect_container(VFIOGroup
>> *group)
>>       }
>>         if (QLIST_EMPTY(&container->group_list)) {
>> -        VFIOAddressSpace *space = container->space;
>> -        VFIOGuestIOMMU *giommu, *tmp;
>> -        VFIOHostDMAWindow *hostwin, *next;
>> -
>> -        QLIST_REMOVE(container, next);
>> -
>> -        QLIST_FOREACH_SAFE(giommu, &container->giommu_list,
>> giommu_next, tmp) {
>> -            memory_region_unregister_iommu_notifier(
>> -                    MEMORY_REGION(giommu->iommu_mr), &giommu->n);
>> -            QLIST_REMOVE(giommu, giommu_next);
>> -            g_free(giommu);
>> -        }
>> -
>> -        QLIST_FOREACH_SAFE(hostwin, &container->hostwin_list,
>> hostwin_next,
>> -                           next) {
>> -            QLIST_REMOVE(hostwin, hostwin_next);
>> -            g_free(hostwin);
>> -        }
>> +        VFIOAddressSpace *space = bcontainer->space;
>>   +        vfio_container_destroy(bcontainer);
>>           trace_vfio_disconnect_container(container->fd);
>>           close(container->fd);
>>           g_free(container);
>> @@ -840,13 +862,15 @@ static void vfio_disconnect_container(VFIOGroup
>> *group)
>>   static VFIOGroup *vfio_get_group(int groupid, AddressSpace *as,
>> Error **errp)
>>   {
>>       VFIOGroup *group;
>> +    VFIOContainer *bcontainer;
>>       char path[32];
>>       struct vfio_group_status status = { .argsz = sizeof(status) };
>>         QLIST_FOREACH(group, &vfio_group_list, next) {
>>           if (group->groupid == groupid) {
>>               /* Found it.  Now is it already in the right context? */
>> -            if (group->container->space->as == as) {
>> +            bcontainer = &group->container->bcontainer;
>> +            if (bcontainer->space->as == as) {
>>                   return group;
>>               } else {
>>                   error_setg(errp, "group %d used in multiple address
>> spaces",
>> @@ -990,7 +1014,7 @@ static void vfio_put_base_device(VFIODevice
>> *vbasedev)
>>   /*
>>    * Interfaces for IBM EEH (Enhanced Error Handling)
>>    */
>> -static bool vfio_eeh_container_ok(VFIOContainer *container)
>> +static bool vfio_eeh_container_ok(VFIOLegacyContainer *container)
>>   {
>>       /*
>>        * As of 2016-03-04 (linux-4.5) the host kernel EEH/VFIO
>> @@ -1018,7 +1042,7 @@ static bool vfio_eeh_container_ok(VFIOContainer
>> *container)
>>       return true;
>>   }
>>   -static int vfio_eeh_container_op(VFIOContainer *container,
>> uint32_t op)
>> +static int vfio_eeh_container_op(VFIOLegacyContainer *container,
>> uint32_t op)
>>   {
>>       struct vfio_eeh_pe_op pe_op = {
>>           .argsz = sizeof(pe_op),
>> @@ -1041,19 +1065,21 @@ static int
>> vfio_eeh_container_op(VFIOContainer *container, uint32_t op)
>>       return ret;
>>   }
>>   -static VFIOContainer *vfio_eeh_as_container(AddressSpace *as)
>> +static VFIOLegacyContainer *vfio_eeh_as_container(AddressSpace *as)
>>   {
>>       VFIOAddressSpace *space = vfio_get_address_space(as);
>> -    VFIOContainer *container = NULL;
>> +    VFIOLegacyContainer *container = NULL;
>> +    VFIOContainer *bcontainer = NULL;
>>         if (QLIST_EMPTY(&space->containers)) {
>>           /* No containers to act on */
>>           goto out;
>>       }
>>   -    container = QLIST_FIRST(&space->containers);
>> +    bcontainer = QLIST_FIRST(&space->containers);
>> +    container = container_of(bcontainer, VFIOLegacyContainer,
>> bcontainer);
>>   -    if (QLIST_NEXT(container, next)) {
>> +    if (QLIST_NEXT(bcontainer, next)) {
>>           /*
>>            * We don't yet have logic to synchronize EEH state across
>>            * multiple containers
>> @@ -1069,14 +1095,14 @@ out:
>>     bool vfio_eeh_as_ok(AddressSpace *as)
>>   {
>> -    VFIOContainer *container = vfio_eeh_as_container(as);
>> +    VFIOLegacyContainer *container = vfio_eeh_as_container(as);
>>         return (container != NULL) && vfio_eeh_container_ok(container);
>>   }
>>     int vfio_eeh_as_op(AddressSpace *as, uint32_t op)
>>   {
>> -    VFIOContainer *container = vfio_eeh_as_container(as);
>> +    VFIOLegacyContainer *container = vfio_eeh_as_container(as);
>>         if (!container) {
>>           return -ENODEV;
>> @@ -1110,8 +1136,8 @@ static int vfio_device_groupid(VFIODevice
>> *vbasedev, Error **errp)
>>       return groupid;
>>   }
>>   -int vfio_attach_device(char *name, VFIODevice *vbasedev,
>> -                       AddressSpace *as, Error **errp)
>> +static int vfio_legacy_attach_device(char *name, VFIODevice *vbasedev,
>> +                                     AddressSpace *as, Error **errp)
>>   {
>>       int groupid = vfio_device_groupid(vbasedev, errp);
>>       VFIODevice *vbasedev_iter;
>> @@ -1137,15 +1163,46 @@ int vfio_attach_device(char *name, VFIODevice
>> *vbasedev,
>>       ret = vfio_get_device(group, name, vbasedev, errp);
>>       if (ret) {
>>           vfio_put_group(group);
>> +        return ret;
>>       }
>> +    vbasedev->container = &group->container->bcontainer;
>>         return ret;
>>   }
>>   -void vfio_detach_device(VFIODevice *vbasedev)
>> +static void vfio_legacy_detach_device(VFIODevice *vbasedev)
>>   {
>>       VFIOGroup *group = vbasedev->group;
>>         vfio_put_base_device(vbasedev);
>>       vfio_put_group(group);
>> +    vbasedev->container = NULL;
>> +}
>> +
>> +static void vfio_iommu_backend_legacy_ops_class_init(ObjectClass *oc,
>> +                                                     void *data) {
>> +    VFIOIOMMUBackendOpsClass *ops = VFIO_IOMMU_BACKEND_OPS_CLASS(oc);
>> +
>> +    ops->dev_iter_next = vfio_legacy_dev_iter_next;
>> +    ops->dma_map = vfio_legacy_dma_map;
>> +    ops->dma_unmap = vfio_legacy_dma_unmap;
>> +    ops->attach_device = vfio_legacy_attach_device;
>> +    ops->detach_device = vfio_legacy_detach_device;
>> +    ops->set_dirty_page_tracking = vfio_legacy_set_dirty_page_tracking;
>> +    ops->query_dirty_bitmap = vfio_legacy_query_dirty_bitmap;
>> +    ops->add_window = vfio_legacy_add_section_window;
>> +    ops->del_window = vfio_legacy_del_section_window;
>> +}
>> +
>> +static const TypeInfo vfio_iommu_backend_legacy_ops_type = {
>> +    .name = TYPE_VFIO_IOMMU_BACKEND_LEGACY_OPS,
>> +
>> +    .parent = TYPE_VFIO_IOMMU_BACKEND_OPS,
>> +    .class_init = vfio_iommu_backend_legacy_ops_class_init,
>> +    .abstract = true,
>> +};
>> +static void vfio_iommu_backend_legacy_ops_register_types(void)
>> +{
>> +    type_register_static(&vfio_iommu_backend_legacy_ops_type);
>>   }
>> +type_init(vfio_iommu_backend_legacy_ops_register_types);
>> diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
>> index 2a6912c940..eb6ce6229d 100644
>> --- a/hw/vfio/meson.build
>> +++ b/hw/vfio/meson.build
>> @@ -2,6 +2,7 @@ vfio_ss = ss.source_set()
>>   vfio_ss.add(files(
>>     'helpers.c',
>>     'common.c',
>> +  'container-base.c',
>>     'container.c',
>>     'spapr.c',
>>     'migration.c',
>> diff --git a/hw/vfio/spapr.c b/hw/vfio/spapr.c
>> index 9ec1e95f6d..7647e7d492 100644
>> --- a/hw/vfio/spapr.c
>> +++ b/hw/vfio/spapr.c
>> @@ -39,8 +39,8 @@ static void
>> *vfio_prereg_gpa_to_vaddr(MemoryRegionSection *section, hwaddr gpa)
>>   static void vfio_prereg_listener_region_add(MemoryListener *listener,
>>                                               MemoryRegionSection
>> *section)
>>   {
>> -    VFIOContainer *container = container_of(listener, VFIOContainer,
>> -                                            prereg_listener);
>> +    VFIOLegacyContainer *container = container_of(listener,
>> VFIOLegacyContainer,
>> +                                                  prereg_listener);
>>       const hwaddr gpa = section->offset_within_address_space;
>>       hwaddr end;
>>       int ret;
>> @@ -83,9 +83,9 @@ static void
>> vfio_prereg_listener_region_add(MemoryListener *listener,
>>            * can gracefully fail.  Runtime, there's not much we can
>> do other
>>            * than throw a hardware error.
>>            */
>> -        if (!container->initialized) {
>> -            if (!container->error) {
>> -                error_setg_errno(&container->error, -ret,
>> +        if (!container->bcontainer.initialized) {
>> +            if (!container->bcontainer.error) {
>> +                error_setg_errno(&container->bcontainer.error, -ret,
>>                                    "Memory registering failed");
>>               }
>>           } else {
>> @@ -97,8 +97,8 @@ static void
>> vfio_prereg_listener_region_add(MemoryListener *listener,
>>   static void vfio_prereg_listener_region_del(MemoryListener *listener,
>>                                               MemoryRegionSection
>> *section)
>>   {
>> -    VFIOContainer *container = container_of(listener, VFIOContainer,
>> -                                            prereg_listener);
>> +    VFIOLegacyContainer *container = container_of(listener,
>> VFIOLegacyContainer,
>> +                                                  prereg_listener);
>>       const hwaddr gpa = section->offset_within_address_space;
>>       hwaddr end;
>>       int ret;
>> @@ -141,7 +141,7 @@ const MemoryListener vfio_prereg_listener = {
>>       .region_del = vfio_prereg_listener_region_del,
>>   };
>>   -int vfio_spapr_create_window(VFIOContainer *container,
>> +int vfio_spapr_create_window(VFIOLegacyContainer *container,
>>                                MemoryRegionSection *section,
>>                                hwaddr *pgsize)
>>   {
>> @@ -159,13 +159,13 @@ int vfio_spapr_create_window(VFIOContainer
>> *container,
>>       if (pagesize > rampagesize) {
>>           pagesize = rampagesize;
>>       }
>> -    pgmask = container->pgsizes & (pagesize | (pagesize - 1));
>> +    pgmask = container->bcontainer.pgsizes & (pagesize | (pagesize -
>> 1));
>>       pagesize = pgmask ? (1ULL << (63 - clz64(pgmask))) : 0;
>>       if (!pagesize) {
>>           error_report("Host doesn't support page size 0x%"PRIx64
>>                        ", the supported mask is 0x%lx",
>>                        memory_region_iommu_get_min_page_size(iommu_mr),
>> -                     container->pgsizes);
>> +                     container->bcontainer.pgsizes);
>>           return -EINVAL;
>>       }
>>   @@ -233,7 +233,7 @@ int vfio_spapr_create_window(VFIOContainer
>> *container,
>>       return 0;
>>   }
>>   -int vfio_spapr_remove_window(VFIOContainer *container,
>> +int vfio_spapr_remove_window(VFIOLegacyContainer *container,
>>                                hwaddr offset_within_address_space)
>>   {
>>       struct vfio_iommu_spapr_tce_remove remove = {
>> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
>> index bd32970854..1692bcd8f1 100644
>> --- a/hw/vfio/trace-events
>> +++ b/hw/vfio/trace-events
>> @@ -119,8 +119,8 @@ vfio_region_unmap(const char *name, unsigned long
>> offset, unsigned long end) "Re
>>   vfio_region_sparse_mmap_header(const char *name, int index, int
>> nr_areas) "Device %s region %d: %d sparse mmap entries"
>>   vfio_region_sparse_mmap_entry(int i, unsigned long start, unsigned
>> long end) "sparse entry %d [0x%lx - 0x%lx]"
>>   vfio_get_dev_region(const char *name, int index, uint32_t type,
>> uint32_t subtype) "%s index %d, %08x/%08x"
>> -vfio_dma_unmap_overflow_workaround(void) ""
>> -vfio_get_dirty_bitmap(int fd, uint64_t iova, uint64_t size, uint64_t
>> bitmap_size, uint64_t start, uint64_t dirty_pages) "container fd=%d,
>> iova=0x%"PRIx64" size= 0x%"PRIx64" bitmap_size=0x%"PRIx64"
>> start=0x%"PRIx64" dirty_pages=%"PRIu64
>> +vfio_legacy_dma_unmap_overflow_workaround(void) ""
>> +vfio_get_dirty_bitmap(uint64_t iova, uint64_t size, uint64_t
>> bitmap_size, uint64_t start, uint64_t dirty_pages) "iova=0x%"PRIx64"
>> size= 0x%"PRIx64" bitmap_size=0x%"PRIx64" start=0x%"PRIx64"
>> dirty_pages=%"PRIu64
>>   vfio_iommu_map_dirty_notify(uint64_t iova_start, uint64_t iova_end)
>> "iommu dirty @ 0x%"PRIx64" - 0x%"PRIx64
>>     # platform.c
>> diff --git a/include/hw/vfio/vfio-common.h
>> b/include/hw/vfio/vfio-common.h
>> index 95bcafdaf6..b1a76dcc9c 100644
>> --- a/include/hw/vfio/vfio-common.h
>> +++ b/include/hw/vfio/vfio-common.h
>> @@ -30,6 +30,7 @@
>>   #include <linux/vfio.h>
>>   #endif
>>   #include "sysemu/sysemu.h"
>> +#include "hw/vfio/vfio-container-base.h"
>>     #define VFIO_MSG_PREFIX "vfio %s: "
>>   @@ -74,64 +75,22 @@ typedef struct VFIOMigration {
>>       bool initial_data_sent;
>>   } VFIOMigration;
>>   -typedef struct VFIOAddressSpace {
>> -    AddressSpace *as;
>> -    QLIST_HEAD(, VFIOContainer) containers;
>> -    QLIST_ENTRY(VFIOAddressSpace) list;
>> -} VFIOAddressSpace;
>> -
>>   struct VFIOGroup;
>>   -typedef struct VFIOContainer {
>> -    VFIOAddressSpace *space;
>> +typedef struct VFIOLegacyContainer {
>> +    VFIOContainer bcontainer;
>>       int fd; /* /dev/vfio/vfio, empowered by the attached groups */
>> -    MemoryListener listener;
>>       MemoryListener prereg_listener;
>>       unsigned iommu_type;
>> -    Error *error;
>> -    bool initialized;
>> -    bool dirty_pages_supported;
>> -    uint64_t dirty_pgsizes;
>> -    uint64_t max_dirty_bitmap_size;
>> -    unsigned long pgsizes;
>> -    unsigned int dma_max_mappings;
>> -    QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
>> -    QLIST_HEAD(, VFIOHostDMAWindow) hostwin_list;
>>       QLIST_HEAD(, VFIOGroup) group_list;
>> -    QLIST_HEAD(, VFIORamDiscardListener) vrdl_list;
>> -    QLIST_ENTRY(VFIOContainer) next;
>> -} VFIOContainer;
>> -
>> -typedef struct VFIOGuestIOMMU {
>> -    VFIOContainer *container;
>> -    IOMMUMemoryRegion *iommu_mr;
>> -    hwaddr iommu_offset;
>> -    IOMMUNotifier n;
>> -    QLIST_ENTRY(VFIOGuestIOMMU) giommu_next;
>> -} VFIOGuestIOMMU;
>> -
>> -typedef struct VFIORamDiscardListener {
>> -    VFIOContainer *container;
>> -    MemoryRegion *mr;
>> -    hwaddr offset_within_address_space;
>> -    hwaddr size;
>> -    uint64_t granularity;
>> -    RamDiscardListener listener;
>> -    QLIST_ENTRY(VFIORamDiscardListener) next;
>> -} VFIORamDiscardListener;
>> -
>> -typedef struct VFIOHostDMAWindow {
>> -    hwaddr min_iova;
>> -    hwaddr max_iova;
>> -    uint64_t iova_pgsizes;
>> -    QLIST_ENTRY(VFIOHostDMAWindow) hostwin_next;
>> -} VFIOHostDMAWindow;
>> +} VFIOLegacyContainer;
>>     typedef struct VFIODeviceOps VFIODeviceOps;
>>     typedef struct VFIODevice {
>>       QLIST_ENTRY(VFIODevice) next;
>>       struct VFIOGroup *group;
>> +    VFIOContainer *container;
>>       char *sysfsdev;
>>       char *name;
>>       DeviceState *dev;
>> @@ -165,7 +124,7 @@ struct VFIODeviceOps {
>>   typedef struct VFIOGroup {
>>       int fd;
>>       int groupid;
>> -    VFIOContainer *container;
>> +    VFIOLegacyContainer *container;
>>       QLIST_HEAD(, VFIODevice) device_list;
>>       QLIST_ENTRY(VFIOGroup) next;
>>       QLIST_ENTRY(VFIOGroup) container_next;
>> @@ -198,37 +157,13 @@ typedef struct VFIODisplay {
>>       } dmabuf;
>>   } VFIODisplay;
>>   -typedef struct {
>> -    unsigned long *bitmap;
>> -    hwaddr size;
>> -    hwaddr pages;
>> -} VFIOBitmap;
>> -
>> -void vfio_host_win_add(VFIOContainer *container,
>> +void vfio_host_win_add(VFIOContainer *bcontainer,
>>                          hwaddr min_iova, hwaddr max_iova,
>>                          uint64_t iova_pgsizes);
>> -int vfio_host_win_del(VFIOContainer *container, hwaddr min_iova,
>> +int vfio_host_win_del(VFIOContainer *bcontainer, hwaddr min_iova,
>>                         hwaddr max_iova);
>>   VFIOAddressSpace *vfio_get_address_space(AddressSpace *as);
>>   void vfio_put_address_space(VFIOAddressSpace *space);
>> -bool vfio_devices_all_running_and_saving(VFIOContainer *container);
>> -
>> -/* container->fd */
>> -VFIODevice *vfio_container_dev_iter_next(VFIOContainer *container,
>> -                                         VFIODevice *curr);
>> -int vfio_dma_unmap(VFIOContainer *container, hwaddr iova,
>> -                   ram_addr_t size, IOMMUTLBEntry *iotlb);
>> -int vfio_dma_map(VFIOContainer *container, hwaddr iova,
>> -                 ram_addr_t size, void *vaddr, bool readonly);
>> -int vfio_set_dirty_page_tracking(VFIOContainer *container, bool start);
>> -int vfio_query_dirty_bitmap(VFIOContainer *container, VFIOBitmap
>> *vbmap,
>> -                            hwaddr iova, hwaddr size);
>> -
>> -int vfio_container_add_section_window(VFIOContainer *container,
>> -                                      MemoryRegionSection *section,
>> -                                      Error **errp);
>> -void vfio_container_del_section_window(VFIOContainer *container,
>> -                                       MemoryRegionSection *section);
>>     void vfio_disable_irqindex(VFIODevice *vbasedev, int index);
>>   void vfio_unmask_single_irqindex(VFIODevice *vbasedev, int index);
>> @@ -285,10 +220,10 @@ vfio_get_cap(void *ptr, uint32_t cap_offset,
>> uint16_t id);
>>   #endif
>>   extern const MemoryListener vfio_prereg_listener;
>>   -int vfio_spapr_create_window(VFIOContainer *container,
>> +int vfio_spapr_create_window(VFIOLegacyContainer *container,
>>                                MemoryRegionSection *section,
>>                                hwaddr *pgsize);
>> -int vfio_spapr_remove_window(VFIOContainer *container,
>> +int vfio_spapr_remove_window(VFIOLegacyContainer *container,
>>                                hwaddr offset_within_address_space);
>>     bool vfio_migration_realize(VFIODevice *vbasedev, Error **errp);
>> diff --git a/include/hw/vfio/vfio-container-base.h
>> b/include/hw/vfio/vfio-container-base.h
>> new file mode 100644
>> index 0000000000..b18fa92146
>> --- /dev/null
>> +++ b/include/hw/vfio/vfio-container-base.h
>> @@ -0,0 +1,155 @@
>> +/*
>> + * VFIO BASE CONTAINER
>> + *
>> + * Copyright (C) 2023 Intel Corporation.
>> + * Copyright Red Hat, Inc. 2023
>> + *
>> + * Authors: Yi Liu <yi.l.liu@intel.com>
>> + *          Eric Auger <eric.auger@redhat.com>
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License as published by
>> + * the Free Software Foundation; either version 2 of the License, or
>> + * (at your option) any later version.
>> +
>> + * This program is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> + * GNU General Public License for more details.
>> +
>> + * You should have received a copy of the GNU General Public License
>> along
>> + * with this program; if not, see <http://www.gnu.org/licenses/>.
>> + */
>> +
>> +#ifndef HW_VFIO_VFIO_BASE_CONTAINER_H
>> +#define HW_VFIO_VFIO_BASE_CONTAINER_H
>> +
>> +#include "exec/memory.h"
>> +#ifndef CONFIG_USER_ONLY
>> +#include "exec/hwaddr.h"
>> +#endif
>> +
>> +typedef struct VFIOContainer VFIOContainer;
>> +
>> +typedef struct VFIOAddressSpace {
>> +    AddressSpace *as;
>> +    QLIST_HEAD(, VFIOContainer) containers;
>> +    QLIST_ENTRY(VFIOAddressSpace) list;
>> +} VFIOAddressSpace;
>> +
>> +typedef struct VFIOGuestIOMMU {
>> +    VFIOContainer *container;
>> +    IOMMUMemoryRegion *iommu_mr;
>> +    hwaddr iommu_offset;
>> +    IOMMUNotifier n;
>> +    QLIST_ENTRY(VFIOGuestIOMMU) giommu_next;
>> +} VFIOGuestIOMMU;
>> +
>> +typedef struct VFIORamDiscardListener {
>> +    VFIOContainer *container;
>> +    MemoryRegion *mr;
>> +    hwaddr offset_within_address_space;
>> +    hwaddr size;
>> +    uint64_t granularity;
>> +    RamDiscardListener listener;
>> +    QLIST_ENTRY(VFIORamDiscardListener) next;
>> +} VFIORamDiscardListener;
>> +
>> +typedef struct VFIOHostDMAWindow {
>> +    hwaddr min_iova;
>> +    hwaddr max_iova;
>> +    uint64_t iova_pgsizes;
>> +    QLIST_ENTRY(VFIOHostDMAWindow) hostwin_next;
>> +} VFIOHostDMAWindow;
>> +
>> +typedef struct {
>> +    unsigned long *bitmap;
>> +    hwaddr size;
>> +    hwaddr pages;
>> +} VFIOBitmap;
>> +
>> +typedef struct VFIODevice VFIODevice;
>> +typedef struct VFIOIOMMUBackendOpsClass VFIOIOMMUBackendOpsClass;
>> +
>> +/*
>> + * This is the base object for vfio container backends
>> + */
>> +struct VFIOContainer {
>> +    VFIOIOMMUBackendOpsClass *ops;
>> +    VFIOAddressSpace *space;
>> +    MemoryListener listener;
>> +    Error *error;
>> +    bool initialized;
>> +    bool dirty_pages_supported;
>> +    uint64_t dirty_pgsizes;
>> +    uint64_t max_dirty_bitmap_size;
>> +    unsigned long pgsizes;
>> +    unsigned int dma_max_mappings;
>> +    QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
>> +    QLIST_HEAD(, VFIOHostDMAWindow) hostwin_list;
>> +    QLIST_HEAD(, VFIORamDiscardListener) vrdl_list;
>> +    QLIST_ENTRY(VFIOContainer) next;
>> +};
>> +
>> +VFIODevice *vfio_container_dev_iter_next(VFIOContainer *container,
>> +                                 VFIODevice *curr);
>> +int vfio_container_dma_map(VFIOContainer *container,
>> +                           hwaddr iova, ram_addr_t size,
>> +                           void *vaddr, bool readonly);
>> +int vfio_container_dma_unmap(VFIOContainer *container,
>> +                             hwaddr iova, ram_addr_t size,
>> +                             IOMMUTLBEntry *iotlb);
>> +bool vfio_container_devices_all_dirty_tracking(VFIOContainer
>> *container);
>> +int vfio_container_set_dirty_page_tracking(VFIOContainer *container,
>> +                                            bool start);
>> +int vfio_container_query_dirty_bitmap(VFIOContainer *container,
>> +                                      VFIOBitmap *vbmap,
>> +                                      hwaddr iova, hwaddr size);
>> +int vfio_container_add_section_window(VFIOContainer *container,
>> +                                      MemoryRegionSection *section,
>> +                                      Error **errp);
>> +void vfio_container_del_section_window(VFIOContainer *container,
>> +                                       MemoryRegionSection *section);
>> +
>> +void vfio_container_init(VFIOContainer *container,
>> +                         VFIOAddressSpace *space,
>> +                         struct VFIOIOMMUBackendOpsClass *ops);
>> +void vfio_container_destroy(VFIOContainer *container);
>> +
>> +#define TYPE_VFIO_IOMMU_BACKEND_LEGACY_OPS
>> "vfio-iommu-backend-legacy-ops"
>> +#define TYPE_VFIO_IOMMU_BACKEND_OPS "vfio-iommu-backend-ops"
>> +
>> +DECLARE_CLASS_CHECKERS(VFIOIOMMUBackendOpsClass,
>> +                       VFIO_IOMMU_BACKEND_OPS,
>> TYPE_VFIO_IOMMU_BACKEND_OPS)
>> +
>> +struct VFIOIOMMUBackendOpsClass {
>> +    /*< private >*/
>> +    ObjectClass parent_class;
>> +
>> +    /*< public >*/
>> +    /* required */
>> +    VFIODevice *(*dev_iter_next)(VFIOContainer *container,
>> VFIODevice *curr);
>> +    int (*dma_map)(VFIOContainer *container,
>> +                   hwaddr iova, ram_addr_t size,
>> +                   void *vaddr, bool readonly);
>> +    int (*dma_unmap)(VFIOContainer *container,
>> +                     hwaddr iova, ram_addr_t size,
>> +                     IOMMUTLBEntry *iotlb);
>> +    int (*attach_device)(char *name, VFIODevice *vbasedev,
>> +                         AddressSpace *as, Error **errp);
>> +    void (*detach_device)(VFIODevice *vbasedev);
>> +    /* migration feature */
>> +    int (*set_dirty_page_tracking)(VFIOContainer *container, bool
>> start);
>> +    int (*query_dirty_bitmap)(VFIOContainer *bcontainer, VFIOBitmap
>> *vbmap,
>> +                              hwaddr iova, hwaddr size);
>> +
>> +    /* SPAPR specific */
>> +    int (*add_window)(VFIOContainer *container,
>> +                      MemoryRegionSection *section,
>> +                      Error **errp);
>> +    void (*del_window)(VFIOContainer *container,
>> +                       MemoryRegionSection *section);
>> +};
>> +
>> +
>> +#endif /* HW_VFIO_VFIO_BASE_CONTAINER_H */
>



^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v1 13/22] vfio: Add base container
  2023-09-20 12:57       ` Cédric Le Goater
@ 2023-09-20 13:58         ` Eric Auger
  2023-09-21  2:51         ` Duan, Zhenzhong
  1 sibling, 0 replies; 109+ messages in thread
From: Eric Auger @ 2023-09-20 13:58 UTC (permalink / raw)
  To: Cédric Le Goater, Duan, Zhenzhong, qemu-devel
  Cc: alex.williamson, jgg, nicolinc, Martins, Joao, peterx, jasowang,
	Tian, Kevin, Liu, Yi L, Sun, Yi Y, Peng, Chao P, Yi Sun,
	Daniel Henrique Barboza, David Gibson, Greg Kurz,
	Harsh Prateek Bora, open list:sPAPR (pseries)

Hi Cédric,

On 9/20/23 14:57, Cédric Le Goater wrote:
> On 9/20/23 10:48, Duan, Zhenzhong wrote:
>>
>>
>>> -----Original Message-----
>>> From: Cédric Le Goater <clg@redhat.com>
>>> Sent: Wednesday, September 20, 2023 1:24 AM
>>> Subject: Re: [PATCH v1 13/22] vfio: Add base container
>>>
>>> On 8/30/23 12:37, Zhenzhong Duan wrote:
>>>> From: Yi Liu <yi.l.liu@intel.com>
>>>>
>>>> Abstract the VFIOContainer to be a base object. It is supposed to be
>>>> embedded by legacy VFIO container and later on, into the new iommufd
>>>> based container.
>>>>
>>>> The base container implements generic code such as code related to
>>>> memory_listener and address space management. The VFIOContainerOps
>>>> implements callbacks that depend on the kernel user space being used.
>>>>
>>>> 'common.c' and vfio device code only manipulates the base container
>>>> with
>>>> wrapper functions that calls the functions defined in
>>>> VFIOContainerOpsClass.
>>>> Existing 'container.c' code is converted to implement the legacy
>>>> container
>>>> ops functions.
>>>>
>>>> Below is the base container. It's named as VFIOContainer, old
>>>> VFIOContainer
>>>> is replaced with VFIOLegacyContainer.
>>>
>>> Usualy, we introduce the new interface solely, port the current models
>>> on top of the new interface, wire the new models in the current
>>> implementation and remove the old implementation. Then, we can start
>>> adding extensions to support other implementations.
>>
>> Not sure if I understand your point correctly. Do you mean to introduce
>> a new type for the base container as below:
>>
>> static const TypeInfo vfio_container_info = {
>>      .parent             = TYPE_OBJECT,
>>      .name               = TYPE_VFIO_CONTAINER,
>>      .class_size         = sizeof(VFIOContainerClass),
>>      .instance_size      = sizeof(VFIOContainer),
>>      .abstract           = true,
>>      .interfaces = (InterfaceInfo[]) {
>>          { TYPE_VFIO_IOMMU_BACKEND_OPS },
>>          { }
>>      }
>> };
>>
>> and a new interface as below:
>>
>> static const TypeInfo nvram_info = {
>>      .name = TYPE_VFIO_IOMMU_BACKEND_OPS,
>>      .parent = TYPE_INTERFACE,
>>      .class_size = sizeof(VFIOIOMMUBackendOpsClass),
>> };
>>
>> struct VFIOIOMMUBackendOpsClass {
>>      InterfaceClass parent;
>>      VFIODevice *(*dev_iter_next)(VFIOContainer *container,
>> VFIODevice *curr);
>>      int (*dma_map)(VFIOContainer *container,
>>      ......
>> };
>>
>> and legacy container on top of TYPE_VFIO_CONTAINER?
>>
>> static const TypeInfo vfio_legacy_container_info = {
>>      .parent = TYPE_VFIO_CONTAINER,
>>      .name = TYPE_VFIO_LEGACY_CONTAINER,
>>      .class_init = vfio_legacy_container_class_init,
>> };
>>
>> This object style is rejected early in RFCv1.
>> See
>> https://lore.kernel.org/kvm/20220414104710.28534-8-yi.l.liu@intel.com/
>
> ouch. this is long ago and I was not aware :/ Bare with me, I will
> probably ask the same questions. Nevertheless, we could improve the
> cover and the flow of changes in the patchset to help the reader.
>
>>> spapr should be taken care of separatly following the principle above.
>>> With my PPC hat, I would not even read such a massive change, too risky
>>> for the subsystem. This path will need (much) further splitting to be
>>> understandable and acceptable.
>>
>> I'll digging into this and try to split it. 
>
> I know I am asking for a lot of work. Thanks for that.
>
>> Meanwhile, there are many changes
>> just renaming the parameter or function name for code readability.
>> For example:
>>
>> -int vfio_dma_unmap(VFIOContainer *container, hwaddr iova,
>> -                   ram_addr_t size, IOMMUTLBEntry *iotlb)
>> +static int vfio_legacy_dma_unmap(VFIOContainer *bcontainer, hwaddr
>> iova,
>> +                          ram_addr_t size, IOMMUTLBEntry *iotlb)
>>
>> -        ret = vfio_get_dirty_bitmap(container, iova, size,
>> +        ret = vfio_get_dirty_bitmap(bcontainer, iova, size,
>>
>> Let me know if you think such changes are unnecessary which could reduce
>> this patch largely.
>
> Cleanups, renames, some code reshuffling, anything preparing ground for
> the new abstraction is good to have first and can be merged very quickly
> if there are no functional changes. It reduces the overall patchset and
> ease the coming reviews.
>
> You can send such series independently. That's fine.
>
>>
>>>
>>> Also, please include the .h file first, it helps in reading.
>>
>> Do you mean to put struct declaration earlier in patch description?
>
> Just add to your .gitconfig :
>
> [diff]
>     orderFile = /path/to/qemu/scripts/git.orderfile
>
> It should be enough
>
>>> Have you considered using an InterfaceClass ?
>>
>> See above, with object style rejected, it looks hard to use
>> InterfaceClass.
>
> I am not convinced by the QOM approach. I will dig in the past arguments
> and let's see what we come with.

Here is the reference:
https://lore.kernel.org/all/YmuFv2s5TPuw7K%2Fu@yekko/

Eric

>
> Thanks,
>
> C.
>
>
>> Thanks
>> Zhenzhong
>>
>



^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v1 13/22] vfio: Add base container
  2023-09-19 17:23   ` Cédric Le Goater
  2023-09-20  8:48     ` Duan, Zhenzhong
  2023-09-20 13:53     ` Eric Auger
@ 2023-09-20 17:31     ` Eric Auger
  2023-09-21  3:35       ` Duan, Zhenzhong
  2 siblings, 1 reply; 109+ messages in thread
From: Eric Auger @ 2023-09-20 17:31 UTC (permalink / raw)
  To: Cédric Le Goater, Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, jgg, nicolinc, joao.m.martins, peterx, jasowang,
	kevin.tian, yi.l.liu, yi.y.sun, chao.p.peng, Yi Sun,
	Daniel Henrique Barboza, David Gibson, Greg Kurz,
	Harsh Prateek Bora, open list:sPAPR (pseries)

Hi Zhenzhong,

On 9/19/23 19:23, Cédric Le Goater wrote:
> On 8/30/23 12:37, Zhenzhong Duan wrote:
>> From: Yi Liu <yi.l.liu@intel.com>
>>
>> Abstract the VFIOContainer to be a base object. It is supposed to be
>> embedded by legacy VFIO container and later on, into the new iommufd
>> based container.
>>
>> The base container implements generic code such as code related to
>> memory_listener and address space management. The VFIOContainerOps
>> implements callbacks that depend on the kernel user space being used.
>>
>> 'common.c' and vfio device code only manipulates the base container with
>> wrapper functions that calls the functions defined in
>> VFIOContainerOpsClass.
>> Existing 'container.c' code is converted to implement the legacy
>> container
>> ops functions.
>>
>> Below is the base container. It's named as VFIOContainer, old
>> VFIOContainer
>> is replaced with VFIOLegacyContainer.
>
> Usualy, we introduce the new interface solely, port the current models
> on top of the new interface, wire the new models in the current
> implementation and remove the old implementation. Then, we can start
> adding extensions to support other implementations.
>
> spapr should be taken care of separatly following the principle above.
> With my PPC hat, I would not even read such a massive change, too risky
> for the subsystem. This path will need (much) further splitting to be
> understandable and acceptable.
We might split this patch by
1) introducing VFIOLegacyContainer encapsulating the base VFIOContainer,
without using the ops in a first place:
 common.c would call vfio_container_* with harcoded legacy
implementation, ie. retrieving the legacy container with container_of.
2) we would introduce the BE interface without using it.
3) we would use the new BE interface

Obviously this needs to be further tried out. If you wish I can try to
split it that way ... Please let me know

Eric

>
> Also, please include the .h file first, it helps in reading. Have you
> considered using an InterfaceClass ?
>
> Thanks,
>
> C.
>
>>
>> struct VFIOContainer {
>>      VFIOIOMMUBackendOpsClass *ops;
>>      VFIOAddressSpace *space;
>>      MemoryListener listener;
>>      Error *error;
>>      bool initialized;
>>      bool dirty_pages_supported;
>>      uint64_t dirty_pgsizes;
>>      uint64_t max_dirty_bitmap_size;
>>      unsigned long pgsizes;
>>      unsigned int dma_max_mappings;
>>      QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
>>      QLIST_HEAD(, VFIOHostDMAWindow) hostwin_list;
>>      QLIST_HEAD(, VFIORamDiscardListener) vrdl_list;
>>      QLIST_ENTRY(VFIOContainer) next;
>> };
>>
>> struct VFIOLegacyContainer {
>>      VFIOContainer bcontainer;
>>      int fd; /* /dev/vfio/vfio, empowered by the attached groups */
>>      MemoryListener prereg_listener;
>>      unsigned iommu_type;
>>      QLIST_HEAD(, VFIOGroup) group_list;
>> };
>>
>> Co-authored-by: Eric Auger <eric.auger@redhat.com>
>> Signed-off-by: Eric Auger <eric.auger@redhat.com>
>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> ---
>>   hw/vfio/common.c                      |  72 +++++---
>>   hw/vfio/container-base.c              | 160 +++++++++++++++++
>>   hw/vfio/container.c                   | 247 ++++++++++++++++----------
>>   hw/vfio/meson.build                   |   1 +
>>   hw/vfio/spapr.c                       |  22 +--
>>   hw/vfio/trace-events                  |   4 +-
>>   include/hw/vfio/vfio-common.h         |  85 ++-------
>>   include/hw/vfio/vfio-container-base.h | 155 ++++++++++++++++
>>   8 files changed, 540 insertions(+), 206 deletions(-)
>>   create mode 100644 hw/vfio/container-base.c
>>   create mode 100644 include/hw/vfio/vfio-container-base.h
>>
>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>> index 044710fc1f..86b6af5740 100644
>> --- a/hw/vfio/common.c
>> +++ b/hw/vfio/common.c
>> @@ -379,19 +379,20 @@ static void vfio_iommu_map_notify(IOMMUNotifier
>> *n, IOMMUTLBEntry *iotlb)
>>            * of vaddr will always be there, even if the memory object is
>>            * destroyed and its backing memory munmap-ed.
>>            */
>> -        ret = vfio_dma_map(container, iova,
>> -                           iotlb->addr_mask + 1, vaddr,
>> -                           read_only);
>> +        ret = vfio_container_dma_map(container, iova,
>> +                                     iotlb->addr_mask + 1, vaddr,
>> +                                     read_only);
>>           if (ret) {
>> -            error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
>> +            error_report("vfio_container_dma_map(%p,
>> 0x%"HWADDR_PRIx", "
>>                            "0x%"HWADDR_PRIx", %p) = %d (%s)",
>>                            container, iova,
>>                            iotlb->addr_mask + 1, vaddr, ret,
>> strerror(-ret));
>>           }
>>       } else {
>> -        ret = vfio_dma_unmap(container, iova, iotlb->addr_mask + 1,
>> iotlb);
>> +        ret = vfio_container_dma_unmap(container, iova,
>> +                                       iotlb->addr_mask + 1, iotlb);
>>           if (ret) {
>> -            error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
>> +            error_report("vfio_container_dma_unmap(%p,
>> 0x%"HWADDR_PRIx", "
>>                            "0x%"HWADDR_PRIx") = %d (%s)",
>>                            container, iova,
>>                            iotlb->addr_mask + 1, ret, strerror(-ret));
>> @@ -407,14 +408,15 @@ static void
>> vfio_ram_discard_notify_discard(RamDiscardListener *rdl,
>>   {
>>       VFIORamDiscardListener *vrdl = container_of(rdl,
>> VFIORamDiscardListener,
>>                                                   listener);
>> +    VFIOContainer *container = vrdl->container;
>>       const hwaddr size = int128_get64(section->size);
>>       const hwaddr iova = section->offset_within_address_space;
>>       int ret;
>>         /* Unmap with a single call. */
>> -    ret = vfio_dma_unmap(vrdl->container, iova, size , NULL);
>> +    ret = vfio_container_dma_unmap(container, iova, size , NULL);
>>       if (ret) {
>> -        error_report("%s: vfio_dma_unmap() failed: %s", __func__,
>> +        error_report("%s: vfio_container_dma_unmap() failed: %s",
>> __func__,
>>                        strerror(-ret));
>>       }
>>   }
>> @@ -424,6 +426,7 @@ static int
>> vfio_ram_discard_notify_populate(RamDiscardListener *rdl,
>>   {
>>       VFIORamDiscardListener *vrdl = container_of(rdl,
>> VFIORamDiscardListener,
>>                                                   listener);
>> +    VFIOContainer *container = vrdl->container;
>>       const hwaddr end = section->offset_within_region +
>>                          int128_get64(section->size);
>>       hwaddr start, next, iova;
>> @@ -442,8 +445,8 @@ static int
>> vfio_ram_discard_notify_populate(RamDiscardListener *rdl,
>>                  section->offset_within_address_space;
>>           vaddr = memory_region_get_ram_ptr(section->mr) + start;
>>   -        ret = vfio_dma_map(vrdl->container, iova, next - start,
>> -                           vaddr, section->readonly);
>> +        ret = vfio_container_dma_map(container, iova, next - start,
>> +                                     vaddr, section->readonly);
>>           if (ret) {
>>               /* Rollback */
>>               vfio_ram_discard_notify_discard(rdl, section);
>> @@ -756,10 +759,10 @@ static void
>> vfio_listener_region_add(MemoryListener *listener,
>>           }
>>       }
>>   -    ret = vfio_dma_map(container, iova, int128_get64(llsize),
>> -                       vaddr, section->readonly);
>> +    ret = vfio_container_dma_map(container, iova, int128_get64(llsize),
>> +                                 vaddr, section->readonly);
>>       if (ret) {
>> -        error_setg(&err, "vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
>> +        error_setg(&err, "vfio_container_dma_map(%p,
>> 0x%"HWADDR_PRIx", "
>>                      "0x%"HWADDR_PRIx", %p) = %d (%s)",
>>                      container, iova, int128_get64(llsize), vaddr, ret,
>>                      strerror(-ret));
>> @@ -775,7 +778,7 @@ static void
>> vfio_listener_region_add(MemoryListener *listener,
>>     fail:
>>       if (memory_region_is_ram_device(section->mr)) {
>> -        error_report("failed to vfio_dma_map. pci p2p may not work");
>> +        error_report("failed to vfio_container_dma_map. pci p2p may
>> not work");
>>           return;
>>       }
>>       /*
>> @@ -860,18 +863,20 @@ static void
>> vfio_listener_region_del(MemoryListener *listener,
>>           if (int128_eq(llsize, int128_2_64())) {
>>               /* The unmap ioctl doesn't accept a full 64-bit span. */
>>               llsize = int128_rshift(llsize, 1);
>> -            ret = vfio_dma_unmap(container, iova,
>> int128_get64(llsize), NULL);
>> +            ret = vfio_container_dma_unmap(container, iova,
>> +                                           int128_get64(llsize), NULL);
>>               if (ret) {
>> -                error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
>> +                error_report("vfio_container_dma_unmap(%p,
>> 0x%"HWADDR_PRIx", "
>>                                "0x%"HWADDR_PRIx") = %d (%s)",
>>                                container, iova, int128_get64(llsize),
>> ret,
>>                                strerror(-ret));
>>               }
>>               iova += int128_get64(llsize);
>>           }
>> -        ret = vfio_dma_unmap(container, iova, int128_get64(llsize),
>> NULL);
>> +        ret = vfio_container_dma_unmap(container, iova,
>> +                                       int128_get64(llsize), NULL);
>>           if (ret) {
>> -            error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
>> +            error_report("vfio_container_dma_unmap(%p,
>> 0x%"HWADDR_PRIx", "
>>                            "0x%"HWADDR_PRIx") = %d (%s)",
>>                            container, iova, int128_get64(llsize), ret,
>>                            strerror(-ret));
>> @@ -1103,7 +1108,7 @@ static void
>> vfio_listener_log_global_start(MemoryListener *listener)
>>       if (vfio_devices_all_device_dirty_tracking(container)) {
>>           ret = vfio_devices_dma_logging_start(container);
>>       } else {
>> -        ret = vfio_set_dirty_page_tracking(container, true);
>> +        ret = vfio_container_set_dirty_page_tracking(container, true);
>>       }
>>         if (ret) {
>> @@ -1121,7 +1126,7 @@ static void
>> vfio_listener_log_global_stop(MemoryListener *listener)
>>       if (vfio_devices_all_device_dirty_tracking(container)) {
>>           vfio_devices_dma_logging_stop(container);
>>       } else {
>> -        ret = vfio_set_dirty_page_tracking(container, false);
>> +        ret = vfio_container_set_dirty_page_tracking(container, false);
>>       }
>>         if (ret) {
>> @@ -1204,7 +1209,7 @@ int vfio_get_dirty_bitmap(VFIOContainer
>> *container, uint64_t iova,
>>       if (all_device_dirty_tracking) {
>>           ret = vfio_devices_query_dirty_bitmap(container, &vbmap,
>> iova, size);
>>       } else {
>> -        ret = vfio_query_dirty_bitmap(container, &vbmap, iova, size);
>> +        ret = vfio_container_query_dirty_bitmap(container, &vbmap,
>> iova, size);
>>       }
>>         if (ret) {
>> @@ -1214,8 +1219,7 @@ int vfio_get_dirty_bitmap(VFIOContainer
>> *container, uint64_t iova,
>>       dirty_pages =
>> cpu_physical_memory_set_dirty_lebitmap(vbmap.bitmap, ram_addr,
>>                                                            vbmap.pages);
>>   -    trace_vfio_get_dirty_bitmap(container->fd, iova, size,
>> vbmap.size,
>> -                                ram_addr, dirty_pages);
>> +    trace_vfio_get_dirty_bitmap(iova, size, vbmap.size, ram_addr,
>> dirty_pages);
>>   out:
>>       g_free(vbmap.bitmap);
>>   @@ -1525,3 +1529,25 @@ retry:
>>         return info;
>>   }
>> +
>> +int vfio_attach_device(char *name, VFIODevice *vbasedev,
>> +                       AddressSpace *as, Error **errp)
>> +{
>> +    const VFIOIOMMUBackendOpsClass *ops;
>> +
>> +    ops = VFIO_IOMMU_BACKEND_OPS_CLASS(
>> +                 
>> object_class_by_name(TYPE_VFIO_IOMMU_BACKEND_LEGACY_OPS));
>> +    if (!ops) {
>> +        error_setg(errp, "VFIO IOMMU Backend not found!");
>> +        return -ENODEV;
>> +    }
>> +    return ops->attach_device(name, vbasedev, as, errp);
>> +}
>> +
>> +void vfio_detach_device(VFIODevice *vbasedev)
>> +{
>> +    if (!vbasedev->container) {
>> +        return;
>> +    }
>> +    vbasedev->container->ops->detach_device(vbasedev);
>> +}
>> diff --git a/hw/vfio/container-base.c b/hw/vfio/container-base.c
>> new file mode 100644
>> index 0000000000..876e95c6dd
>> --- /dev/null
>> +++ b/hw/vfio/container-base.c
>> @@ -0,0 +1,160 @@
>> +/*
>> + * VFIO BASE CONTAINER
>> + *
>> + * Copyright (C) 2023 Intel Corporation.
>> + * Copyright Red Hat, Inc. 2023
>> + *
>> + * Authors: Yi Liu <yi.l.liu@intel.com>
>> + *          Eric Auger <eric.auger@redhat.com>
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License as published by
>> + * the Free Software Foundation; either version 2 of the License, or
>> + * (at your option) any later version.
>> +
>> + * This program is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> + * GNU General Public License for more details.
>> +
>> + * You should have received a copy of the GNU General Public License
>> along
>> + * with this program; if not, see <http://www.gnu.org/licenses/>.
>> + */
>> +
>> +#include "qemu/osdep.h"
>> +#include "qapi/error.h"
>> +#include "qemu/error-report.h"
>> +#include "hw/vfio/vfio-container-base.h"
>> +
>> +VFIODevice *vfio_container_dev_iter_next(VFIOContainer *container,
>> +                                 VFIODevice *curr)
>> +{
>> +    if (!container->ops->dev_iter_next) {
>> +        return NULL;
>> +    }
>> +
>> +    return container->ops->dev_iter_next(container, curr);
>> +}
>> +
>> +int vfio_container_dma_map(VFIOContainer *container,
>> +                           hwaddr iova, ram_addr_t size,
>> +                           void *vaddr, bool readonly)
>> +{
>> +    if (!container->ops->dma_map) {
>> +        return -EINVAL;
>> +    }
>> +
>> +    return container->ops->dma_map(container, iova, size, vaddr,
>> readonly);
>> +}
>> +
>> +int vfio_container_dma_unmap(VFIOContainer *container,
>> +                             hwaddr iova, ram_addr_t size,
>> +                             IOMMUTLBEntry *iotlb)
>> +{
>> +    if (!container->ops->dma_unmap) {
>> +        return -EINVAL;
>> +    }
>> +
>> +    return container->ops->dma_unmap(container, iova, size, iotlb);
>> +}
>> +
>> +int vfio_container_set_dirty_page_tracking(VFIOContainer *container,
>> +                                            bool start)
>> +{
>> +    /* Fallback to all pages dirty if dirty page sync isn't
>> supported */
>> +    if (!container->ops->set_dirty_page_tracking) {
>> +        return 0;
>> +    }
>> +
>> +    return container->ops->set_dirty_page_tracking(container, start);
>> +}
>> +
>> +int vfio_container_query_dirty_bitmap(VFIOContainer *container,
>> +                                      VFIOBitmap *vbmap,
>> +                                      hwaddr iova, hwaddr size)
>> +{
>> +    if (!container->ops->query_dirty_bitmap) {
>> +        return -EINVAL;
>> +    }
>> +
>> +    return container->ops->query_dirty_bitmap(container, vbmap,
>> iova, size);
>> +}
>> +
>> +int vfio_container_add_section_window(VFIOContainer *container,
>> +                                      MemoryRegionSection *section,
>> +                                      Error **errp)
>> +{
>> +    if (!container->ops->add_window) {
>> +        return 0;
>> +    }
>> +
>> +    return container->ops->add_window(container, section, errp);
>> +}
>> +
>> +void vfio_container_del_section_window(VFIOContainer *container,
>> +                                       MemoryRegionSection *section)
>> +{
>> +    if (!container->ops->del_window) {
>> +        return;
>> +    }
>> +
>> +    return container->ops->del_window(container, section);
>> +}
>> +
>> +void vfio_container_init(VFIOContainer *container,
>> +                         VFIOAddressSpace *space,
>> +                         struct VFIOIOMMUBackendOpsClass *ops)
>> +{
>> +    container->ops = ops;
>> +    container->space = space;
>> +    container->error = NULL;
>> +    container->dirty_pages_supported = false;
>> +    container->dma_max_mappings = 0;
>> +    QLIST_INIT(&container->giommu_list);
>> +    QLIST_INIT(&container->hostwin_list);
>> +    QLIST_INIT(&container->vrdl_list);
>> +}
>> +
>> +void vfio_container_destroy(VFIOContainer *container)
>> +{
>> +    VFIORamDiscardListener *vrdl, *vrdl_tmp;
>> +    VFIOGuestIOMMU *giommu, *tmp;
>> +    VFIOHostDMAWindow *hostwin, *next;
>> +
>> +    QLIST_SAFE_REMOVE(container, next);
>> +
>> +    QLIST_FOREACH_SAFE(vrdl, &container->vrdl_list, next, vrdl_tmp) {
>> +        RamDiscardManager *rdm;
>> +
>> +        rdm = memory_region_get_ram_discard_manager(vrdl->mr);
>> +        ram_discard_manager_unregister_listener(rdm, &vrdl->listener);
>> +        QLIST_REMOVE(vrdl, next);
>> +        g_free(vrdl);
>> +    }
>> +
>> +    QLIST_FOREACH_SAFE(giommu, &container->giommu_list, giommu_next,
>> tmp) {
>> +        memory_region_unregister_iommu_notifier(
>> +                MEMORY_REGION(giommu->iommu_mr), &giommu->n);
>> +        QLIST_REMOVE(giommu, giommu_next);
>> +        g_free(giommu);
>> +    }
>> +
>> +    QLIST_FOREACH_SAFE(hostwin, &container->hostwin_list, hostwin_next,
>> +                       next) {
>> +        QLIST_REMOVE(hostwin, hostwin_next);
>> +        g_free(hostwin);
>> +    }
>> +}
>> +
>> +static const TypeInfo vfio_iommu_backend_ops_type_info = {
>> +    .name = TYPE_VFIO_IOMMU_BACKEND_OPS,
>> +    .parent = TYPE_OBJECT,
>> +    .abstract = true,
>> +    .class_size = sizeof(VFIOIOMMUBackendOpsClass),
>> +};
>> +
>> +static void vfio_iommu_backend_ops_register_types(void)
>> +{
>> +    type_register_static(&vfio_iommu_backend_ops_type_info);
>> +}
>> +type_init(vfio_iommu_backend_ops_register_types);
>> diff --git a/hw/vfio/container.c b/hw/vfio/container.c
>> index c71fddc09a..bb29b3612d 100644
>> --- a/hw/vfio/container.c
>> +++ b/hw/vfio/container.c
>> @@ -42,7 +42,8 @@
>>   VFIOGroupList vfio_group_list =
>>       QLIST_HEAD_INITIALIZER(vfio_group_list);
>>   -static int vfio_ram_block_discard_disable(VFIOContainer
>> *container, bool state)
>> +static int vfio_ram_block_discard_disable(VFIOLegacyContainer
>> *container,
>> +                                          bool state)
>>   {
>>       switch (container->iommu_type) {
>>       case VFIO_TYPE1v2_IOMMU:
>> @@ -65,11 +66,18 @@ static int
>> vfio_ram_block_discard_disable(VFIOContainer *container, bool state)
>>       }
>>   }
>>   -VFIODevice *vfio_container_dev_iter_next(VFIOContainer *container,
>> -                                         VFIODevice *curr)
>> +static VFIODevice *vfio_legacy_dev_iter_next(VFIOContainer *bcontainer,
>> +                                             VFIODevice *curr)
>>   {
>>       VFIOGroup *group;
>>   +    assert(object_class_dynamic_cast(OBJECT_CLASS(bcontainer->ops),
>> +                                    
>> TYPE_VFIO_IOMMU_BACKEND_LEGACY_OPS));
>> +
>> +    VFIOLegacyContainer *container = container_of(bcontainer,
>> +                                                  VFIOLegacyContainer,
>> +                                                  bcontainer);
>> +
>>       if (!curr) {
>>           group = QLIST_FIRST(&container->group_list);
>>       } else {
>> @@ -85,10 +93,11 @@ VFIODevice
>> *vfio_container_dev_iter_next(VFIOContainer *container,
>>       return QLIST_FIRST(&group->device_list);
>>   }
>>   -static int vfio_dma_unmap_bitmap(VFIOContainer *container,
>> +static int vfio_dma_unmap_bitmap(VFIOLegacyContainer *container,
>>                                    hwaddr iova, ram_addr_t size,
>>                                    IOMMUTLBEntry *iotlb)
>>   {
>> +    VFIOContainer *bcontainer = &container->bcontainer;
>>       struct vfio_iommu_type1_dma_unmap *unmap;
>>       struct vfio_bitmap *bitmap;
>>       VFIOBitmap vbmap;
>> @@ -116,7 +125,7 @@ static int vfio_dma_unmap_bitmap(VFIOContainer
>> *container,
>>       bitmap->size = vbmap.size;
>>       bitmap->data = (__u64 *)vbmap.bitmap;
>>   -    if (vbmap.size > container->max_dirty_bitmap_size) {
>> +    if (vbmap.size > bcontainer->max_dirty_bitmap_size) {
>>           error_report("UNMAP: Size of bitmap too big 0x%"PRIx64,
>> vbmap.size);
>>           ret = -E2BIG;
>>           goto unmap_exit;
>> @@ -140,9 +149,13 @@ unmap_exit:
>>   /*
>>    * DMA - Mapping and unmapping for the "type1" IOMMU interface used
>> on x86
>>    */
>> -int vfio_dma_unmap(VFIOContainer *container, hwaddr iova,
>> -                   ram_addr_t size, IOMMUTLBEntry *iotlb)
>> +static int vfio_legacy_dma_unmap(VFIOContainer *bcontainer, hwaddr
>> iova,
>> +                          ram_addr_t size, IOMMUTLBEntry *iotlb)
>>   {
>> +    VFIOLegacyContainer *container = container_of(bcontainer,
>> +                                                  VFIOLegacyContainer,
>> +                                                  bcontainer);
>> +
>>       struct vfio_iommu_type1_dma_unmap unmap = {
>>           .argsz = sizeof(unmap),
>>           .flags = 0,
>> @@ -152,9 +165,9 @@ int vfio_dma_unmap(VFIOContainer *container,
>> hwaddr iova,
>>       bool need_dirty_sync = false;
>>       int ret;
>>   -    if (iotlb &&
>> vfio_devices_all_running_and_mig_active(container)) {
>> -        if (!vfio_devices_all_device_dirty_tracking(container) &&
>> -            container->dirty_pages_supported) {
>> +    if (iotlb && vfio_devices_all_running_and_mig_active(bcontainer)) {
>> +        if (!vfio_devices_all_device_dirty_tracking(bcontainer) &&
>> +            bcontainer->dirty_pages_supported) {
>>               return vfio_dma_unmap_bitmap(container, iova, size,
>> iotlb);
>>           }
>>   @@ -176,8 +189,8 @@ int vfio_dma_unmap(VFIOContainer *container,
>> hwaddr iova,
>>            */
>>           if (errno == EINVAL && unmap.size && !(unmap.iova +
>> unmap.size) &&
>>               container->iommu_type == VFIO_TYPE1v2_IOMMU) {
>> -            trace_vfio_dma_unmap_overflow_workaround();
>> -            unmap.size -= 1ULL << ctz64(container->pgsizes);
>> +            trace_vfio_legacy_dma_unmap_overflow_workaround();
>> +            unmap.size -= 1ULL << ctz64(bcontainer->pgsizes);
>>               continue;
>>           }
>>           error_report("VFIO_UNMAP_DMA failed: %s", strerror(errno));
>> @@ -185,7 +198,7 @@ int vfio_dma_unmap(VFIOContainer *container,
>> hwaddr iova,
>>       }
>>         if (need_dirty_sync) {
>> -        ret = vfio_get_dirty_bitmap(container, iova, size,
>> +        ret = vfio_get_dirty_bitmap(bcontainer, iova, size,
>>                                       iotlb->translated_addr);
>>           if (ret) {
>>               return ret;
>> @@ -195,9 +208,13 @@ int vfio_dma_unmap(VFIOContainer *container,
>> hwaddr iova,
>>       return 0;
>>   }
>>   -int vfio_dma_map(VFIOContainer *container, hwaddr iova,
>> -                 ram_addr_t size, void *vaddr, bool readonly)
>> +static int vfio_legacy_dma_map(VFIOContainer *bcontainer, hwaddr iova,
>> +                               ram_addr_t size, void *vaddr, bool
>> readonly)
>>   {
>> +    VFIOLegacyContainer *container = container_of(bcontainer,
>> +                                                  VFIOLegacyContainer,
>> +                                                  bcontainer);
>> +
>>       struct vfio_iommu_type1_dma_map map = {
>>           .argsz = sizeof(map),
>>           .flags = VFIO_DMA_MAP_FLAG_READ,
>> @@ -216,7 +233,8 @@ int vfio_dma_map(VFIOContainer *container, hwaddr
>> iova,
>>        * the VGA ROM space.
>>        */
>>       if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0 ||
>> -        (errno == EBUSY && vfio_dma_unmap(container, iova, size,
>> NULL) == 0 &&
>> +        (errno == EBUSY &&
>> +         vfio_legacy_dma_unmap(bcontainer, iova, size, NULL) == 0 &&
>>            ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0)) {
>>           return 0;
>>       }
>> @@ -225,14 +243,18 @@ int vfio_dma_map(VFIOContainer *container,
>> hwaddr iova,
>>       return -errno;
>>   }
>>   -int vfio_set_dirty_page_tracking(VFIOContainer *container, bool
>> start)
>> +static int vfio_legacy_set_dirty_page_tracking(VFIOContainer
>> *bcontainer,
>> +                                               bool start)
>>   {
>> +    VFIOLegacyContainer *container = container_of(bcontainer,
>> +                                                  VFIOLegacyContainer,
>> +                                                  bcontainer);
>>       int ret;
>>       struct vfio_iommu_type1_dirty_bitmap dirty = {
>>           .argsz = sizeof(dirty),
>>       };
>>   -    if (!container->dirty_pages_supported) {
>> +    if (!bcontainer->dirty_pages_supported) {
>>           return 0;
>>       }
>>   @@ -252,9 +274,13 @@ int vfio_set_dirty_page_tracking(VFIOContainer
>> *container, bool start)
>>       return ret;
>>   }
>>   -int vfio_query_dirty_bitmap(VFIOContainer *container, VFIOBitmap
>> *vbmap,
>> -                            hwaddr iova, hwaddr size)
>> +static int vfio_legacy_query_dirty_bitmap(VFIOContainer *bcontainer,
>> +                                          VFIOBitmap *vbmap,
>> +                                          hwaddr iova, hwaddr size)
>>   {
>> +    VFIOLegacyContainer *container = container_of(bcontainer,
>> +                                                  VFIOLegacyContainer,
>> +                                                  bcontainer);
>>       struct vfio_iommu_type1_dirty_bitmap *dbitmap;
>>       struct vfio_iommu_type1_dirty_bitmap_get *range;
>>       int ret;
>> @@ -289,18 +315,24 @@ int vfio_query_dirty_bitmap(VFIOContainer
>> *container, VFIOBitmap *vbmap,
>>       return ret;
>>   }
>>   -static void vfio_listener_release(VFIOContainer *container)
>> +static void vfio_listener_release(VFIOLegacyContainer *container)
>>   {
>> -    memory_listener_unregister(&container->listener);
>> +    VFIOContainer *bcontainer = &container->bcontainer;
>> +
>> +    memory_listener_unregister(&bcontainer->listener);
>>       if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
>>           memory_listener_unregister(&container->prereg_listener);
>>       }
>>   }
>>   -int vfio_container_add_section_window(VFIOContainer *container,
>> -                                      MemoryRegionSection *section,
>> -                                      Error **errp)
>> +static int
>> +vfio_legacy_add_section_window(VFIOContainer *bcontainer,
>> +                               MemoryRegionSection *section,
>> +                               Error **errp)
>>   {
>> +    VFIOLegacyContainer *container = container_of(bcontainer,
>> +                                                  VFIOLegacyContainer,
>> +                                                  bcontainer);
>>       VFIOHostDMAWindow *hostwin;
>>       hwaddr pgsize = 0;
>>       int ret;
>> @@ -310,7 +342,7 @@ int
>> vfio_container_add_section_window(VFIOContainer *container,
>>       }
>>         /* For now intersections are not allowed, we may relax this
>> later */
>> -    QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) {
>> +    QLIST_FOREACH(hostwin, &bcontainer->hostwin_list, hostwin_next) {
>>           if (ranges_overlap(hostwin->min_iova,
>>                              hostwin->max_iova - hostwin->min_iova + 1,
>>                              section->offset_within_address_space,
>> @@ -332,7 +364,7 @@ int
>> vfio_container_add_section_window(VFIOContainer *container,
>>           return ret;
>>       }
>>   -    vfio_host_win_add(container,
>> section->offset_within_address_space,
>> +    vfio_host_win_add(bcontainer, section->offset_within_address_space,
>>                         section->offset_within_address_space +
>>                         int128_get64(section->size) - 1, pgsize);
>>   #ifdef CONFIG_KVM
>> @@ -365,16 +397,21 @@ int
>> vfio_container_add_section_window(VFIOContainer *container,
>>       return 0;
>>   }
>>   -void vfio_container_del_section_window(VFIOContainer *container,
>> -                                       MemoryRegionSection *section)
>> +static void
>> +vfio_legacy_del_section_window(VFIOContainer *bcontainer,
>> +                               MemoryRegionSection *section)
>>   {
>> +    VFIOLegacyContainer *container = container_of(bcontainer,
>> +                                                  VFIOLegacyContainer,
>> +                                                  bcontainer);
>> +
>>       if (container->iommu_type != VFIO_SPAPR_TCE_v2_IOMMU) {
>>           return;
>>       }
>>         vfio_spapr_remove_window(container,
>>                                section->offset_within_address_space);
>> -    if (vfio_host_win_del(container,
>> +    if (vfio_host_win_del(bcontainer,
>>                             section->offset_within_address_space,
>>                             section->offset_within_address_space +
>>                             int128_get64(section->size) - 1) < 0) {
>> @@ -427,7 +464,7 @@ static void vfio_kvm_device_del_group(VFIOGroup
>> *group)
>>   /*
>>    * vfio_get_iommu_type - selects the richest iommu_type (v2 first)
>>    */
>> -static int vfio_get_iommu_type(VFIOContainer *container,
>> +static int vfio_get_iommu_type(VFIOLegacyContainer *container,
>>                                  Error **errp)
>>   {
>>       int iommu_types[] = { VFIO_TYPE1v2_IOMMU, VFIO_TYPE1_IOMMU,
>> @@ -443,7 +480,7 @@ static int vfio_get_iommu_type(VFIOContainer
>> *container,
>>       return -EINVAL;
>>   }
>>   -static int vfio_init_container(VFIOContainer *container, int
>> group_fd,
>> +static int vfio_init_container(VFIOLegacyContainer *container, int
>> group_fd,
>>                                  Error **errp)
>>   {
>>       int iommu_type, ret;
>> @@ -478,7 +515,7 @@ static int vfio_init_container(VFIOContainer
>> *container, int group_fd,
>>       return 0;
>>   }
>>   -static int vfio_get_iommu_info(VFIOContainer *container,
>> +static int vfio_get_iommu_info(VFIOLegacyContainer *container,
>>                                  struct vfio_iommu_type1_info **info)
>>   {
>>   @@ -522,11 +559,12 @@ vfio_get_iommu_info_cap(struct
>> vfio_iommu_type1_info *info, uint16_t id)
>>       return NULL;
>>   }
>>   -static void vfio_get_iommu_info_migration(VFIOContainer *container,
>> -                                         struct
>> vfio_iommu_type1_info *info)
>> +static void vfio_get_iommu_info_migration(VFIOLegacyContainer
>> *container,
>> +                                          struct
>> vfio_iommu_type1_info *info)
>>   {
>>       struct vfio_info_cap_header *hdr;
>>       struct vfio_iommu_type1_info_cap_migration *cap_mig;
>> +    VFIOContainer *bcontainer = &container->bcontainer;
>>         hdr = vfio_get_iommu_info_cap(info,
>> VFIO_IOMMU_TYPE1_INFO_CAP_MIGRATION);
>>       if (!hdr) {
>> @@ -541,16 +579,19 @@ static void
>> vfio_get_iommu_info_migration(VFIOContainer *container,
>>        * qemu_real_host_page_size to mark those dirty.
>>        */
>>       if (cap_mig->pgsize_bitmap & qemu_real_host_page_size()) {
>> -        container->dirty_pages_supported = true;
>> -        container->max_dirty_bitmap_size =
>> cap_mig->max_dirty_bitmap_size;
>> -        container->dirty_pgsizes = cap_mig->pgsize_bitmap;
>> +        bcontainer->dirty_pages_supported = true;
>> +        bcontainer->max_dirty_bitmap_size =
>> cap_mig->max_dirty_bitmap_size;
>> +        bcontainer->dirty_pgsizes = cap_mig->pgsize_bitmap;
>>       }
>>   }
>>     static int vfio_connect_container(VFIOGroup *group, AddressSpace
>> *as,
>>                                     Error **errp)
>>   {
>> -    VFIOContainer *container;
>> +    VFIOIOMMUBackendOpsClass *ops = VFIO_IOMMU_BACKEND_OPS_CLASS(
>> +        object_class_by_name(TYPE_VFIO_IOMMU_BACKEND_LEGACY_OPS));
>> +    VFIOContainer *bcontainer;
>> +    VFIOLegacyContainer *container;
>>       int ret, fd;
>>       VFIOAddressSpace *space;
>>   @@ -587,7 +628,8 @@ static int vfio_connect_container(VFIOGroup
>> *group, AddressSpace *as,
>>        * details once we know which type of IOMMU we are using.
>>        */
>>   -    QLIST_FOREACH(container, &space->containers, next) {
>> +    QLIST_FOREACH(bcontainer, &space->containers, next) {
>> +        container = container_of(bcontainer, VFIOLegacyContainer,
>> bcontainer);
>>           if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER,
>> &container->fd)) {
>>               ret = vfio_ram_block_discard_disable(container, true);
>>               if (ret) {
>> @@ -623,14 +665,9 @@ static int vfio_connect_container(VFIOGroup
>> *group, AddressSpace *as,
>>       }
>>         container = g_malloc0(sizeof(*container));
>> -    container->space = space;
>>       container->fd = fd;
>> -    container->error = NULL;
>> -    container->dirty_pages_supported = false;
>> -    container->dma_max_mappings = 0;
>> -    QLIST_INIT(&container->giommu_list);
>> -    QLIST_INIT(&container->hostwin_list);
>> -    QLIST_INIT(&container->vrdl_list);
>> +    bcontainer = &container->bcontainer;
>> +    vfio_container_init(bcontainer, space, ops);
>>         ret = vfio_init_container(container, group->fd, errp);
>>       if (ret) {
>> @@ -656,13 +693,13 @@ static int vfio_connect_container(VFIOGroup
>> *group, AddressSpace *as,
>>           }
>>             if (info->flags & VFIO_IOMMU_INFO_PGSIZES) {
>> -            container->pgsizes = info->iova_pgsizes;
>> +            bcontainer->pgsizes = info->iova_pgsizes;
>>           } else {
>> -            container->pgsizes = qemu_real_host_page_size();
>> +            bcontainer->pgsizes = qemu_real_host_page_size();
>>           }
>>   -        if (!vfio_get_info_dma_avail(info,
>> &container->dma_max_mappings)) {
>> -            container->dma_max_mappings = 65535;
>> +        if (!vfio_get_info_dma_avail(info,
>> &bcontainer->dma_max_mappings)) {
>> +            bcontainer->dma_max_mappings = 65535;
>>           }
>>           vfio_get_iommu_info_migration(container, info);
>>           g_free(info);
>> @@ -672,7 +709,7 @@ static int vfio_connect_container(VFIOGroup
>> *group, AddressSpace *as,
>>            * information to get the actual window extent rather than
>> assume
>>            * a 64-bit IOVA address space.
>>            */
>> -        vfio_host_win_add(container, 0, (hwaddr)-1,
>> container->pgsizes);
>> +        vfio_host_win_add(bcontainer, 0, (hwaddr)-1,
>> bcontainer->pgsizes);
>>             break;
>>       }
>> @@ -699,10 +736,10 @@ static int vfio_connect_container(VFIOGroup
>> *group, AddressSpace *as,
>>                 memory_listener_register(&container->prereg_listener,
>>                                        &address_space_memory);
>> -            if (container->error) {
>> +            if (bcontainer->error) {
>>                  
>> memory_listener_unregister(&container->prereg_listener);
>>                   ret = -1;
>> -                error_propagate_prepend(errp, container->error,
>> +                error_propagate_prepend(errp, bcontainer->error,
>>                       "RAM memory listener initialization failed: ");
>>                   goto enable_discards_exit;
>>               }
>> @@ -721,7 +758,7 @@ static int vfio_connect_container(VFIOGroup
>> *group, AddressSpace *as,
>>           }
>>             if (v2) {
>> -            container->pgsizes = info.ddw.pgsizes;
>> +            bcontainer->pgsizes = info.ddw.pgsizes;
>>               /*
>>                * There is a default window in just created container.
>>                * To make region_add/del simpler, we better remove this
>> @@ -736,8 +773,8 @@ static int vfio_connect_container(VFIOGroup
>> *group, AddressSpace *as,
>>               }
>>           } else {
>>               /* The default table uses 4K pages */
>> -            container->pgsizes = 0x1000;
>> -            vfio_host_win_add(container, info.dma32_window_start,
>> +            bcontainer->pgsizes = 0x1000;
>> +            vfio_host_win_add(bcontainer, info.dma32_window_start,
>>                                 info.dma32_window_start +
>>                                 info.dma32_window_size - 1,
>>                                 0x1000);
>> @@ -748,28 +785,28 @@ static int vfio_connect_container(VFIOGroup
>> *group, AddressSpace *as,
>>       vfio_kvm_device_add_group(group);
>>         QLIST_INIT(&container->group_list);
>> -    QLIST_INSERT_HEAD(&space->containers, container, next);
>> +    QLIST_INSERT_HEAD(&space->containers, bcontainer, next);
>>         group->container = container;
>>       QLIST_INSERT_HEAD(&container->group_list, group, container_next);
>>   -    container->listener = vfio_memory_listener;
>> +    bcontainer->listener = vfio_memory_listener;
>>   -    memory_listener_register(&container->listener,
>> container->space->as);
>> +    memory_listener_register(&bcontainer->listener,
>> bcontainer->space->as);
>>   -    if (container->error) {
>> +    if (bcontainer->error) {
>>           ret = -1;
>> -        error_propagate_prepend(errp, container->error,
>> +        error_propagate_prepend(errp, bcontainer->error,
>>               "memory listener initialization failed: ");
>>           goto listener_release_exit;
>>       }
>>   -    container->initialized = true;
>> +    bcontainer->initialized = true;
>>         return 0;
>>   listener_release_exit:
>>       QLIST_REMOVE(group, container_next);
>> -    QLIST_REMOVE(container, next);
>> +    QLIST_REMOVE(bcontainer, next);
>>       vfio_kvm_device_del_group(group);
>>       vfio_listener_release(container);
>>   @@ -790,7 +827,8 @@ put_space_exit:
>>     static void vfio_disconnect_container(VFIOGroup *group)
>>   {
>> -    VFIOContainer *container = group->container;
>> +    VFIOLegacyContainer *container = group->container;
>> +    VFIOContainer *bcontainer = &container->bcontainer;
>>         QLIST_REMOVE(group, container_next);
>>       group->container = NULL;
>> @@ -810,25 +848,9 @@ static void vfio_disconnect_container(VFIOGroup
>> *group)
>>       }
>>         if (QLIST_EMPTY(&container->group_list)) {
>> -        VFIOAddressSpace *space = container->space;
>> -        VFIOGuestIOMMU *giommu, *tmp;
>> -        VFIOHostDMAWindow *hostwin, *next;
>> -
>> -        QLIST_REMOVE(container, next);
>> -
>> -        QLIST_FOREACH_SAFE(giommu, &container->giommu_list,
>> giommu_next, tmp) {
>> -            memory_region_unregister_iommu_notifier(
>> -                    MEMORY_REGION(giommu->iommu_mr), &giommu->n);
>> -            QLIST_REMOVE(giommu, giommu_next);
>> -            g_free(giommu);
>> -        }
>> -
>> -        QLIST_FOREACH_SAFE(hostwin, &container->hostwin_list,
>> hostwin_next,
>> -                           next) {
>> -            QLIST_REMOVE(hostwin, hostwin_next);
>> -            g_free(hostwin);
>> -        }
>> +        VFIOAddressSpace *space = bcontainer->space;
>>   +        vfio_container_destroy(bcontainer);
>>           trace_vfio_disconnect_container(container->fd);
>>           close(container->fd);
>>           g_free(container);
>> @@ -840,13 +862,15 @@ static void vfio_disconnect_container(VFIOGroup
>> *group)
>>   static VFIOGroup *vfio_get_group(int groupid, AddressSpace *as,
>> Error **errp)
>>   {
>>       VFIOGroup *group;
>> +    VFIOContainer *bcontainer;
>>       char path[32];
>>       struct vfio_group_status status = { .argsz = sizeof(status) };
>>         QLIST_FOREACH(group, &vfio_group_list, next) {
>>           if (group->groupid == groupid) {
>>               /* Found it.  Now is it already in the right context? */
>> -            if (group->container->space->as == as) {
>> +            bcontainer = &group->container->bcontainer;
>> +            if (bcontainer->space->as == as) {
>>                   return group;
>>               } else {
>>                   error_setg(errp, "group %d used in multiple address
>> spaces",
>> @@ -990,7 +1014,7 @@ static void vfio_put_base_device(VFIODevice
>> *vbasedev)
>>   /*
>>    * Interfaces for IBM EEH (Enhanced Error Handling)
>>    */
>> -static bool vfio_eeh_container_ok(VFIOContainer *container)
>> +static bool vfio_eeh_container_ok(VFIOLegacyContainer *container)
>>   {
>>       /*
>>        * As of 2016-03-04 (linux-4.5) the host kernel EEH/VFIO
>> @@ -1018,7 +1042,7 @@ static bool vfio_eeh_container_ok(VFIOContainer
>> *container)
>>       return true;
>>   }
>>   -static int vfio_eeh_container_op(VFIOContainer *container,
>> uint32_t op)
>> +static int vfio_eeh_container_op(VFIOLegacyContainer *container,
>> uint32_t op)
>>   {
>>       struct vfio_eeh_pe_op pe_op = {
>>           .argsz = sizeof(pe_op),
>> @@ -1041,19 +1065,21 @@ static int
>> vfio_eeh_container_op(VFIOContainer *container, uint32_t op)
>>       return ret;
>>   }
>>   -static VFIOContainer *vfio_eeh_as_container(AddressSpace *as)
>> +static VFIOLegacyContainer *vfio_eeh_as_container(AddressSpace *as)
>>   {
>>       VFIOAddressSpace *space = vfio_get_address_space(as);
>> -    VFIOContainer *container = NULL;
>> +    VFIOLegacyContainer *container = NULL;
>> +    VFIOContainer *bcontainer = NULL;
>>         if (QLIST_EMPTY(&space->containers)) {
>>           /* No containers to act on */
>>           goto out;
>>       }
>>   -    container = QLIST_FIRST(&space->containers);
>> +    bcontainer = QLIST_FIRST(&space->containers);
>> +    container = container_of(bcontainer, VFIOLegacyContainer,
>> bcontainer);
>>   -    if (QLIST_NEXT(container, next)) {
>> +    if (QLIST_NEXT(bcontainer, next)) {
>>           /*
>>            * We don't yet have logic to synchronize EEH state across
>>            * multiple containers
>> @@ -1069,14 +1095,14 @@ out:
>>     bool vfio_eeh_as_ok(AddressSpace *as)
>>   {
>> -    VFIOContainer *container = vfio_eeh_as_container(as);
>> +    VFIOLegacyContainer *container = vfio_eeh_as_container(as);
>>         return (container != NULL) && vfio_eeh_container_ok(container);
>>   }
>>     int vfio_eeh_as_op(AddressSpace *as, uint32_t op)
>>   {
>> -    VFIOContainer *container = vfio_eeh_as_container(as);
>> +    VFIOLegacyContainer *container = vfio_eeh_as_container(as);
>>         if (!container) {
>>           return -ENODEV;
>> @@ -1110,8 +1136,8 @@ static int vfio_device_groupid(VFIODevice
>> *vbasedev, Error **errp)
>>       return groupid;
>>   }
>>   -int vfio_attach_device(char *name, VFIODevice *vbasedev,
>> -                       AddressSpace *as, Error **errp)
>> +static int vfio_legacy_attach_device(char *name, VFIODevice *vbasedev,
>> +                                     AddressSpace *as, Error **errp)
>>   {
>>       int groupid = vfio_device_groupid(vbasedev, errp);
>>       VFIODevice *vbasedev_iter;
>> @@ -1137,15 +1163,46 @@ int vfio_attach_device(char *name, VFIODevice
>> *vbasedev,
>>       ret = vfio_get_device(group, name, vbasedev, errp);
>>       if (ret) {
>>           vfio_put_group(group);
>> +        return ret;
>>       }
>> +    vbasedev->container = &group->container->bcontainer;
>>         return ret;
>>   }
>>   -void vfio_detach_device(VFIODevice *vbasedev)
>> +static void vfio_legacy_detach_device(VFIODevice *vbasedev)
>>   {
>>       VFIOGroup *group = vbasedev->group;
>>         vfio_put_base_device(vbasedev);
>>       vfio_put_group(group);
>> +    vbasedev->container = NULL;
>> +}
>> +
>> +static void vfio_iommu_backend_legacy_ops_class_init(ObjectClass *oc,
>> +                                                     void *data) {
>> +    VFIOIOMMUBackendOpsClass *ops = VFIO_IOMMU_BACKEND_OPS_CLASS(oc);
>> +
>> +    ops->dev_iter_next = vfio_legacy_dev_iter_next;
>> +    ops->dma_map = vfio_legacy_dma_map;
>> +    ops->dma_unmap = vfio_legacy_dma_unmap;
>> +    ops->attach_device = vfio_legacy_attach_device;
>> +    ops->detach_device = vfio_legacy_detach_device;
>> +    ops->set_dirty_page_tracking = vfio_legacy_set_dirty_page_tracking;
>> +    ops->query_dirty_bitmap = vfio_legacy_query_dirty_bitmap;
>> +    ops->add_window = vfio_legacy_add_section_window;
>> +    ops->del_window = vfio_legacy_del_section_window;
>> +}
>> +
>> +static const TypeInfo vfio_iommu_backend_legacy_ops_type = {
>> +    .name = TYPE_VFIO_IOMMU_BACKEND_LEGACY_OPS,
>> +
>> +    .parent = TYPE_VFIO_IOMMU_BACKEND_OPS,
>> +    .class_init = vfio_iommu_backend_legacy_ops_class_init,
>> +    .abstract = true,
>> +};
>> +static void vfio_iommu_backend_legacy_ops_register_types(void)
>> +{
>> +    type_register_static(&vfio_iommu_backend_legacy_ops_type);
>>   }
>> +type_init(vfio_iommu_backend_legacy_ops_register_types);
>> diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
>> index 2a6912c940..eb6ce6229d 100644
>> --- a/hw/vfio/meson.build
>> +++ b/hw/vfio/meson.build
>> @@ -2,6 +2,7 @@ vfio_ss = ss.source_set()
>>   vfio_ss.add(files(
>>     'helpers.c',
>>     'common.c',
>> +  'container-base.c',
>>     'container.c',
>>     'spapr.c',
>>     'migration.c',
>> diff --git a/hw/vfio/spapr.c b/hw/vfio/spapr.c
>> index 9ec1e95f6d..7647e7d492 100644
>> --- a/hw/vfio/spapr.c
>> +++ b/hw/vfio/spapr.c
>> @@ -39,8 +39,8 @@ static void
>> *vfio_prereg_gpa_to_vaddr(MemoryRegionSection *section, hwaddr gpa)
>>   static void vfio_prereg_listener_region_add(MemoryListener *listener,
>>                                               MemoryRegionSection
>> *section)
>>   {
>> -    VFIOContainer *container = container_of(listener, VFIOContainer,
>> -                                            prereg_listener);
>> +    VFIOLegacyContainer *container = container_of(listener,
>> VFIOLegacyContainer,
>> +                                                  prereg_listener);
>>       const hwaddr gpa = section->offset_within_address_space;
>>       hwaddr end;
>>       int ret;
>> @@ -83,9 +83,9 @@ static void
>> vfio_prereg_listener_region_add(MemoryListener *listener,
>>            * can gracefully fail.  Runtime, there's not much we can
>> do other
>>            * than throw a hardware error.
>>            */
>> -        if (!container->initialized) {
>> -            if (!container->error) {
>> -                error_setg_errno(&container->error, -ret,
>> +        if (!container->bcontainer.initialized) {
>> +            if (!container->bcontainer.error) {
>> +                error_setg_errno(&container->bcontainer.error, -ret,
>>                                    "Memory registering failed");
>>               }
>>           } else {
>> @@ -97,8 +97,8 @@ static void
>> vfio_prereg_listener_region_add(MemoryListener *listener,
>>   static void vfio_prereg_listener_region_del(MemoryListener *listener,
>>                                               MemoryRegionSection
>> *section)
>>   {
>> -    VFIOContainer *container = container_of(listener, VFIOContainer,
>> -                                            prereg_listener);
>> +    VFIOLegacyContainer *container = container_of(listener,
>> VFIOLegacyContainer,
>> +                                                  prereg_listener);
>>       const hwaddr gpa = section->offset_within_address_space;
>>       hwaddr end;
>>       int ret;
>> @@ -141,7 +141,7 @@ const MemoryListener vfio_prereg_listener = {
>>       .region_del = vfio_prereg_listener_region_del,
>>   };
>>   -int vfio_spapr_create_window(VFIOContainer *container,
>> +int vfio_spapr_create_window(VFIOLegacyContainer *container,
>>                                MemoryRegionSection *section,
>>                                hwaddr *pgsize)
>>   {
>> @@ -159,13 +159,13 @@ int vfio_spapr_create_window(VFIOContainer
>> *container,
>>       if (pagesize > rampagesize) {
>>           pagesize = rampagesize;
>>       }
>> -    pgmask = container->pgsizes & (pagesize | (pagesize - 1));
>> +    pgmask = container->bcontainer.pgsizes & (pagesize | (pagesize -
>> 1));
>>       pagesize = pgmask ? (1ULL << (63 - clz64(pgmask))) : 0;
>>       if (!pagesize) {
>>           error_report("Host doesn't support page size 0x%"PRIx64
>>                        ", the supported mask is 0x%lx",
>>                        memory_region_iommu_get_min_page_size(iommu_mr),
>> -                     container->pgsizes);
>> +                     container->bcontainer.pgsizes);
>>           return -EINVAL;
>>       }
>>   @@ -233,7 +233,7 @@ int vfio_spapr_create_window(VFIOContainer
>> *container,
>>       return 0;
>>   }
>>   -int vfio_spapr_remove_window(VFIOContainer *container,
>> +int vfio_spapr_remove_window(VFIOLegacyContainer *container,
>>                                hwaddr offset_within_address_space)
>>   {
>>       struct vfio_iommu_spapr_tce_remove remove = {
>> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
>> index bd32970854..1692bcd8f1 100644
>> --- a/hw/vfio/trace-events
>> +++ b/hw/vfio/trace-events
>> @@ -119,8 +119,8 @@ vfio_region_unmap(const char *name, unsigned long
>> offset, unsigned long end) "Re
>>   vfio_region_sparse_mmap_header(const char *name, int index, int
>> nr_areas) "Device %s region %d: %d sparse mmap entries"
>>   vfio_region_sparse_mmap_entry(int i, unsigned long start, unsigned
>> long end) "sparse entry %d [0x%lx - 0x%lx]"
>>   vfio_get_dev_region(const char *name, int index, uint32_t type,
>> uint32_t subtype) "%s index %d, %08x/%08x"
>> -vfio_dma_unmap_overflow_workaround(void) ""
>> -vfio_get_dirty_bitmap(int fd, uint64_t iova, uint64_t size, uint64_t
>> bitmap_size, uint64_t start, uint64_t dirty_pages) "container fd=%d,
>> iova=0x%"PRIx64" size= 0x%"PRIx64" bitmap_size=0x%"PRIx64"
>> start=0x%"PRIx64" dirty_pages=%"PRIu64
>> +vfio_legacy_dma_unmap_overflow_workaround(void) ""
>> +vfio_get_dirty_bitmap(uint64_t iova, uint64_t size, uint64_t
>> bitmap_size, uint64_t start, uint64_t dirty_pages) "iova=0x%"PRIx64"
>> size= 0x%"PRIx64" bitmap_size=0x%"PRIx64" start=0x%"PRIx64"
>> dirty_pages=%"PRIu64
>>   vfio_iommu_map_dirty_notify(uint64_t iova_start, uint64_t iova_end)
>> "iommu dirty @ 0x%"PRIx64" - 0x%"PRIx64
>>     # platform.c
>> diff --git a/include/hw/vfio/vfio-common.h
>> b/include/hw/vfio/vfio-common.h
>> index 95bcafdaf6..b1a76dcc9c 100644
>> --- a/include/hw/vfio/vfio-common.h
>> +++ b/include/hw/vfio/vfio-common.h
>> @@ -30,6 +30,7 @@
>>   #include <linux/vfio.h>
>>   #endif
>>   #include "sysemu/sysemu.h"
>> +#include "hw/vfio/vfio-container-base.h"
>>     #define VFIO_MSG_PREFIX "vfio %s: "
>>   @@ -74,64 +75,22 @@ typedef struct VFIOMigration {
>>       bool initial_data_sent;
>>   } VFIOMigration;
>>   -typedef struct VFIOAddressSpace {
>> -    AddressSpace *as;
>> -    QLIST_HEAD(, VFIOContainer) containers;
>> -    QLIST_ENTRY(VFIOAddressSpace) list;
>> -} VFIOAddressSpace;
>> -
>>   struct VFIOGroup;
>>   -typedef struct VFIOContainer {
>> -    VFIOAddressSpace *space;
>> +typedef struct VFIOLegacyContainer {
>> +    VFIOContainer bcontainer;
>>       int fd; /* /dev/vfio/vfio, empowered by the attached groups */
>> -    MemoryListener listener;
>>       MemoryListener prereg_listener;
>>       unsigned iommu_type;
>> -    Error *error;
>> -    bool initialized;
>> -    bool dirty_pages_supported;
>> -    uint64_t dirty_pgsizes;
>> -    uint64_t max_dirty_bitmap_size;
>> -    unsigned long pgsizes;
>> -    unsigned int dma_max_mappings;
>> -    QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
>> -    QLIST_HEAD(, VFIOHostDMAWindow) hostwin_list;
>>       QLIST_HEAD(, VFIOGroup) group_list;
>> -    QLIST_HEAD(, VFIORamDiscardListener) vrdl_list;
>> -    QLIST_ENTRY(VFIOContainer) next;
>> -} VFIOContainer;
>> -
>> -typedef struct VFIOGuestIOMMU {
>> -    VFIOContainer *container;
>> -    IOMMUMemoryRegion *iommu_mr;
>> -    hwaddr iommu_offset;
>> -    IOMMUNotifier n;
>> -    QLIST_ENTRY(VFIOGuestIOMMU) giommu_next;
>> -} VFIOGuestIOMMU;
>> -
>> -typedef struct VFIORamDiscardListener {
>> -    VFIOContainer *container;
>> -    MemoryRegion *mr;
>> -    hwaddr offset_within_address_space;
>> -    hwaddr size;
>> -    uint64_t granularity;
>> -    RamDiscardListener listener;
>> -    QLIST_ENTRY(VFIORamDiscardListener) next;
>> -} VFIORamDiscardListener;
>> -
>> -typedef struct VFIOHostDMAWindow {
>> -    hwaddr min_iova;
>> -    hwaddr max_iova;
>> -    uint64_t iova_pgsizes;
>> -    QLIST_ENTRY(VFIOHostDMAWindow) hostwin_next;
>> -} VFIOHostDMAWindow;
>> +} VFIOLegacyContainer;
>>     typedef struct VFIODeviceOps VFIODeviceOps;
>>     typedef struct VFIODevice {
>>       QLIST_ENTRY(VFIODevice) next;
>>       struct VFIOGroup *group;
>> +    VFIOContainer *container;
>>       char *sysfsdev;
>>       char *name;
>>       DeviceState *dev;
>> @@ -165,7 +124,7 @@ struct VFIODeviceOps {
>>   typedef struct VFIOGroup {
>>       int fd;
>>       int groupid;
>> -    VFIOContainer *container;
>> +    VFIOLegacyContainer *container;
>>       QLIST_HEAD(, VFIODevice) device_list;
>>       QLIST_ENTRY(VFIOGroup) next;
>>       QLIST_ENTRY(VFIOGroup) container_next;
>> @@ -198,37 +157,13 @@ typedef struct VFIODisplay {
>>       } dmabuf;
>>   } VFIODisplay;
>>   -typedef struct {
>> -    unsigned long *bitmap;
>> -    hwaddr size;
>> -    hwaddr pages;
>> -} VFIOBitmap;
>> -
>> -void vfio_host_win_add(VFIOContainer *container,
>> +void vfio_host_win_add(VFIOContainer *bcontainer,
>>                          hwaddr min_iova, hwaddr max_iova,
>>                          uint64_t iova_pgsizes);
>> -int vfio_host_win_del(VFIOContainer *container, hwaddr min_iova,
>> +int vfio_host_win_del(VFIOContainer *bcontainer, hwaddr min_iova,
>>                         hwaddr max_iova);
>>   VFIOAddressSpace *vfio_get_address_space(AddressSpace *as);
>>   void vfio_put_address_space(VFIOAddressSpace *space);
>> -bool vfio_devices_all_running_and_saving(VFIOContainer *container);
>> -
>> -/* container->fd */
>> -VFIODevice *vfio_container_dev_iter_next(VFIOContainer *container,
>> -                                         VFIODevice *curr);
>> -int vfio_dma_unmap(VFIOContainer *container, hwaddr iova,
>> -                   ram_addr_t size, IOMMUTLBEntry *iotlb);
>> -int vfio_dma_map(VFIOContainer *container, hwaddr iova,
>> -                 ram_addr_t size, void *vaddr, bool readonly);
>> -int vfio_set_dirty_page_tracking(VFIOContainer *container, bool start);
>> -int vfio_query_dirty_bitmap(VFIOContainer *container, VFIOBitmap
>> *vbmap,
>> -                            hwaddr iova, hwaddr size);
>> -
>> -int vfio_container_add_section_window(VFIOContainer *container,
>> -                                      MemoryRegionSection *section,
>> -                                      Error **errp);
>> -void vfio_container_del_section_window(VFIOContainer *container,
>> -                                       MemoryRegionSection *section);
>>     void vfio_disable_irqindex(VFIODevice *vbasedev, int index);
>>   void vfio_unmask_single_irqindex(VFIODevice *vbasedev, int index);
>> @@ -285,10 +220,10 @@ vfio_get_cap(void *ptr, uint32_t cap_offset,
>> uint16_t id);
>>   #endif
>>   extern const MemoryListener vfio_prereg_listener;
>>   -int vfio_spapr_create_window(VFIOContainer *container,
>> +int vfio_spapr_create_window(VFIOLegacyContainer *container,
>>                                MemoryRegionSection *section,
>>                                hwaddr *pgsize);
>> -int vfio_spapr_remove_window(VFIOContainer *container,
>> +int vfio_spapr_remove_window(VFIOLegacyContainer *container,
>>                                hwaddr offset_within_address_space);
>>     bool vfio_migration_realize(VFIODevice *vbasedev, Error **errp);
>> diff --git a/include/hw/vfio/vfio-container-base.h
>> b/include/hw/vfio/vfio-container-base.h
>> new file mode 100644
>> index 0000000000..b18fa92146
>> --- /dev/null
>> +++ b/include/hw/vfio/vfio-container-base.h
>> @@ -0,0 +1,155 @@
>> +/*
>> + * VFIO BASE CONTAINER
>> + *
>> + * Copyright (C) 2023 Intel Corporation.
>> + * Copyright Red Hat, Inc. 2023
>> + *
>> + * Authors: Yi Liu <yi.l.liu@intel.com>
>> + *          Eric Auger <eric.auger@redhat.com>
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License as published by
>> + * the Free Software Foundation; either version 2 of the License, or
>> + * (at your option) any later version.
>> +
>> + * This program is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> + * GNU General Public License for more details.
>> +
>> + * You should have received a copy of the GNU General Public License
>> along
>> + * with this program; if not, see <http://www.gnu.org/licenses/>.
>> + */
>> +
>> +#ifndef HW_VFIO_VFIO_BASE_CONTAINER_H
>> +#define HW_VFIO_VFIO_BASE_CONTAINER_H
>> +
>> +#include "exec/memory.h"
>> +#ifndef CONFIG_USER_ONLY
>> +#include "exec/hwaddr.h"
>> +#endif
>> +
>> +typedef struct VFIOContainer VFIOContainer;
>> +
>> +typedef struct VFIOAddressSpace {
>> +    AddressSpace *as;
>> +    QLIST_HEAD(, VFIOContainer) containers;
>> +    QLIST_ENTRY(VFIOAddressSpace) list;
>> +} VFIOAddressSpace;
>> +
>> +typedef struct VFIOGuestIOMMU {
>> +    VFIOContainer *container;
>> +    IOMMUMemoryRegion *iommu_mr;
>> +    hwaddr iommu_offset;
>> +    IOMMUNotifier n;
>> +    QLIST_ENTRY(VFIOGuestIOMMU) giommu_next;
>> +} VFIOGuestIOMMU;
>> +
>> +typedef struct VFIORamDiscardListener {
>> +    VFIOContainer *container;
>> +    MemoryRegion *mr;
>> +    hwaddr offset_within_address_space;
>> +    hwaddr size;
>> +    uint64_t granularity;
>> +    RamDiscardListener listener;
>> +    QLIST_ENTRY(VFIORamDiscardListener) next;
>> +} VFIORamDiscardListener;
>> +
>> +typedef struct VFIOHostDMAWindow {
>> +    hwaddr min_iova;
>> +    hwaddr max_iova;
>> +    uint64_t iova_pgsizes;
>> +    QLIST_ENTRY(VFIOHostDMAWindow) hostwin_next;
>> +} VFIOHostDMAWindow;
>> +
>> +typedef struct {
>> +    unsigned long *bitmap;
>> +    hwaddr size;
>> +    hwaddr pages;
>> +} VFIOBitmap;
>> +
>> +typedef struct VFIODevice VFIODevice;
>> +typedef struct VFIOIOMMUBackendOpsClass VFIOIOMMUBackendOpsClass;
>> +
>> +/*
>> + * This is the base object for vfio container backends
>> + */
>> +struct VFIOContainer {
>> +    VFIOIOMMUBackendOpsClass *ops;
>> +    VFIOAddressSpace *space;
>> +    MemoryListener listener;
>> +    Error *error;
>> +    bool initialized;
>> +    bool dirty_pages_supported;
>> +    uint64_t dirty_pgsizes;
>> +    uint64_t max_dirty_bitmap_size;
>> +    unsigned long pgsizes;
>> +    unsigned int dma_max_mappings;
>> +    QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
>> +    QLIST_HEAD(, VFIOHostDMAWindow) hostwin_list;
>> +    QLIST_HEAD(, VFIORamDiscardListener) vrdl_list;
>> +    QLIST_ENTRY(VFIOContainer) next;
>> +};
>> +
>> +VFIODevice *vfio_container_dev_iter_next(VFIOContainer *container,
>> +                                 VFIODevice *curr);
>> +int vfio_container_dma_map(VFIOContainer *container,
>> +                           hwaddr iova, ram_addr_t size,
>> +                           void *vaddr, bool readonly);
>> +int vfio_container_dma_unmap(VFIOContainer *container,
>> +                             hwaddr iova, ram_addr_t size,
>> +                             IOMMUTLBEntry *iotlb);
>> +bool vfio_container_devices_all_dirty_tracking(VFIOContainer
>> *container);
>> +int vfio_container_set_dirty_page_tracking(VFIOContainer *container,
>> +                                            bool start);
>> +int vfio_container_query_dirty_bitmap(VFIOContainer *container,
>> +                                      VFIOBitmap *vbmap,
>> +                                      hwaddr iova, hwaddr size);
>> +int vfio_container_add_section_window(VFIOContainer *container,
>> +                                      MemoryRegionSection *section,
>> +                                      Error **errp);
>> +void vfio_container_del_section_window(VFIOContainer *container,
>> +                                       MemoryRegionSection *section);
>> +
>> +void vfio_container_init(VFIOContainer *container,
>> +                         VFIOAddressSpace *space,
>> +                         struct VFIOIOMMUBackendOpsClass *ops);
>> +void vfio_container_destroy(VFIOContainer *container);
>> +
>> +#define TYPE_VFIO_IOMMU_BACKEND_LEGACY_OPS
>> "vfio-iommu-backend-legacy-ops"
>> +#define TYPE_VFIO_IOMMU_BACKEND_OPS "vfio-iommu-backend-ops"
>> +
>> +DECLARE_CLASS_CHECKERS(VFIOIOMMUBackendOpsClass,
>> +                       VFIO_IOMMU_BACKEND_OPS,
>> TYPE_VFIO_IOMMU_BACKEND_OPS)
>> +
>> +struct VFIOIOMMUBackendOpsClass {
>> +    /*< private >*/
>> +    ObjectClass parent_class;
>> +
>> +    /*< public >*/
>> +    /* required */
>> +    VFIODevice *(*dev_iter_next)(VFIOContainer *container,
>> VFIODevice *curr);
>> +    int (*dma_map)(VFIOContainer *container,
>> +                   hwaddr iova, ram_addr_t size,
>> +                   void *vaddr, bool readonly);
>> +    int (*dma_unmap)(VFIOContainer *container,
>> +                     hwaddr iova, ram_addr_t size,
>> +                     IOMMUTLBEntry *iotlb);
>> +    int (*attach_device)(char *name, VFIODevice *vbasedev,
>> +                         AddressSpace *as, Error **errp);
>> +    void (*detach_device)(VFIODevice *vbasedev);
>> +    /* migration feature */
>> +    int (*set_dirty_page_tracking)(VFIOContainer *container, bool
>> start);
>> +    int (*query_dirty_bitmap)(VFIOContainer *bcontainer, VFIOBitmap
>> *vbmap,
>> +                              hwaddr iova, hwaddr size);
>> +
>> +    /* SPAPR specific */
>> +    int (*add_window)(VFIOContainer *container,
>> +                      MemoryRegionSection *section,
>> +                      Error **errp);
>> +    void (*del_window)(VFIOContainer *container,
>> +                       MemoryRegionSection *section);
>> +};
>> +
>> +
>> +#endif /* HW_VFIO_VFIO_BASE_CONTAINER_H */
>



^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v1 15/22] Add iommufd configure option
  2023-09-20 13:02           ` Cédric Le Goater
@ 2023-09-20 17:37             ` Eric Auger
  2023-09-20 17:49               ` Jason Gunthorpe
  2023-09-21  4:00             ` Duan, Zhenzhong
  1 sibling, 1 reply; 109+ messages in thread
From: Eric Auger @ 2023-09-20 17:37 UTC (permalink / raw)
  To: Cédric Le Goater, Jason Gunthorpe
  Cc: Duan, Zhenzhong, qemu-devel, alex.williamson, nicolinc, Martins,
	Joao, peterx, jasowang, Tian, Kevin, Liu, Yi L, Sun, Yi Y, Peng,
	Chao P, Paolo Bonzini, Marc-André Lureau,
	Daniel P. Berrangé, Thomas Huth, Philippe Mathieu-Daudé



On 9/20/23 15:02, Cédric Le Goater wrote:
> On 9/20/23 14:51, Jason Gunthorpe wrote:
>> On Wed, Sep 20, 2023 at 02:19:42PM +0200, Cédric Le Goater wrote:
>>> On 9/20/23 05:42, Duan, Zhenzhong wrote:
>>>>
>>>>
>>>>> -----Original Message-----
>>>>> From: Cédric Le Goater <clg@redhat.com>
>>>>> Sent: Wednesday, September 20, 2023 1:08 AM
>>>>> Subject: Re: [PATCH v1 15/22] Add iommufd configure option
>>>>>
>>>>> On 8/30/23 12:37, Zhenzhong Duan wrote:
>>>>>> This adds "--enable-iommufd/--disable-iommufd" to enable or disable
>>>>>> iommufd support, enabled by default.
>>>>>
>>>>> Why would someone want to disable support at compile time ? It might
>>>>
>>>> For those users who only want to support legacy container feature?
>>>> Let me know if you still prefer to drop this patch, I'm fine with
>>>> that.
>>>
>>> I think it is too early.
>>>
>>>>> have been useful for dev but now QEMU should self-adjust at runtime
>>>>> depending only on the host capabilities AFAIUI. Am I missing
>>>>> something ?
>>>>
>>>> IOMMUFD doesn't support all features of legacy container, so QEMU
>>>> doesn't self-adjust at runtime by checking if host supports IOMMUFD.
>>>> We need to specify it explicitly to use IOMMUFD as below:
>>>>
>>>>       -object iommufd,id=iommufd0
>>>>       -device vfio-pci,host=0000:02:00.0,iommufd=iommufd0
>>>
>>> OK. I am not sure this is the correct interface yet. At first glance,
>>> I wouldn't introduce a new object for a simple backend depending on a
>>> kernel interface. I would tend to prefer a "iommu-something" property
>>> of the vfio-pci device with string values: "legacy", "iommufd",
>>> "default"
>>> and define the various interfaces (the ops you proposed) for each
>>> depending on the user preference and the capabilities of the host and
>>> possibly the device.
>>
>> I think the idea came from Alex? The major point is to be able to have
>> libvirt open /dev/iommufd and FD pass it into qemu 
>
> ok.
>
>> and then share that single FD across all VFIOs. 
>
> I will ask Alex to help me catch up on the topic.
>
>> qemu will typically not be able to
>> self-open /dev/iommufd as it is root-only.
>
> I don't understand, we open multiple fds to KVM devices. This is the
> same.
Actually qemu opens the /dev/iommu in case no fd is passed along with
the iommufd object. This is done in
[PATCH v1 16/22] backends/iommufd: Introduce the iommufd object, in

iommufd_backend_connect(). I don't understand either.

Thanks

Eric

>
>>
>> So the object is not exactly for the backend, the object is for the
>> file descriptor.
> got it.
>
>>
>> Adding a legacy/iommufd option to the vfio-pci device string doesn't
>> address these needs.
>
> I agree.
>
> Thanks,
>
> C.
>
>> Jason
>>
>



^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v1 15/22] Add iommufd configure option
  2023-09-20 17:37             ` Eric Auger
@ 2023-09-20 17:49               ` Jason Gunthorpe
  2023-09-20 18:17                 ` Alex Williamson
  0 siblings, 1 reply; 109+ messages in thread
From: Jason Gunthorpe @ 2023-09-20 17:49 UTC (permalink / raw)
  To: Eric Auger
  Cc: Cédric Le Goater, Duan, Zhenzhong, qemu-devel,
	alex.williamson, nicolinc, Martins, Joao, peterx, jasowang, Tian,
	Kevin, Liu, Yi L, Sun, Yi Y, Peng, Chao P, Paolo Bonzini,
	Marc-André Lureau, Daniel P. Berrangé,
	Thomas Huth, Philippe Mathieu-Daudé

On Wed, Sep 20, 2023 at 07:37:53PM +0200, Eric Auger wrote:

> >> qemu will typically not be able to
> >> self-open /dev/iommufd as it is root-only.
> >
> > I don't understand, we open multiple fds to KVM devices. This is the
> > same.
> Actually qemu opens the /dev/iommu in case no fd is passed along with
> the iommufd object. This is done in
> [PATCH v1 16/22] backends/iommufd: Introduce the iommufd object, in
> 
> iommufd_backend_connect(). I don't understand either.

The char dev node is root only so this automatic behvaior is fine
but not useful if qmeu is running in a sandbox.

I'm not sure what "multiple fds to KVM devices" means, I don't know
anything about kvm devices..

The iommufd design requires one open of the /dev/iommu to be shared
across all the vfios.

Jason


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v1 15/22] Add iommufd configure option
  2023-09-20  3:42     ` Duan, Zhenzhong
  2023-09-20 12:19       ` Cédric Le Goater
@ 2023-09-20 18:01       ` Alex Williamson
  2023-09-20 18:12         ` Jason Gunthorpe
  2023-09-20 18:15         ` Daniel P. Berrangé
  1 sibling, 2 replies; 109+ messages in thread
From: Alex Williamson @ 2023-09-20 18:01 UTC (permalink / raw)
  To: Duan, Zhenzhong
  Cc: Cédric Le Goater, qemu-devel, jgg, nicolinc, Martins, Joao,
	eric.auger, peterx, jasowang, Tian, Kevin, Liu, Yi L, Sun, Yi Y,
	Peng, Chao P, Paolo Bonzini, Marc-André Lureau,
	Daniel P. Berrangé, Thomas Huth, Philippe Mathieu-Daudé,
	Laine Stump

On Wed, 20 Sep 2023 03:42:20 +0000
"Duan, Zhenzhong" <zhenzhong.duan@intel.com> wrote:

> >-----Original Message-----
> >From: Cédric Le Goater <clg@redhat.com>
> >Sent: Wednesday, September 20, 2023 1:08 AM
> >Subject: Re: [PATCH v1 15/22] Add iommufd configure option
> >
> >On 8/30/23 12:37, Zhenzhong Duan wrote:  
> >> This adds "--enable-iommufd/--disable-iommufd" to enable or disable
> >> iommufd support, enabled by default.  
> >
> >Why would someone want to disable support at compile time ? It might  
> 
> For those users who only want to support legacy container feature?
> Let me know if you still prefer to drop this patch, I'm fine with that.
> 
> >have been useful for dev but now QEMU should self-adjust at runtime
> >depending only on the host capabilities AFAIUI. Am I missing something ?  
> 
> IOMMUFD doesn't support all features of legacy container, so QEMU
> doesn't self-adjust at runtime by checking if host supports IOMMUFD.
> We need to specify it explicitly to use IOMMUFD as below:
> 
>     -object iommufd,id=iommufd0
>     -device vfio-pci,host=0000:02:00.0,iommufd=iommufd0

There's an important point here that maybe we've let slip for too long.
Laine had asked in an internal forum whether the switch to IOMMUFD was
visible to the guest.  I replied that it wasn't, but this note about
IOMMUFD vs container features jogged my memory that I think we still
lack p2p support with IOMMUFD, ie. IOMMU mapping of device MMIO.  It
seemed like there was something else too, but I don't recall without
some research.

Ideally we'd have feature parity and libvirt could simply use the
native IOMMUFD interface whenever both the kernel and QEMU support it.

Without that parity, when does libvirt decide to use IOMMUFD?

How would libvirt know if some future IOMMUFD does have parity?

Does the XML direct this through some new interpretation of the driver
field? ex. "vfio-container" vs "vfio-iommufd" where "vfio" becomes an
alias or priority preference.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v1 15/22] Add iommufd configure option
  2023-09-20 18:01       ` Alex Williamson
@ 2023-09-20 18:12         ` Jason Gunthorpe
  2023-09-20 20:29           ` Alex Williamson
  2023-09-20 18:15         ` Daniel P. Berrangé
  1 sibling, 1 reply; 109+ messages in thread
From: Jason Gunthorpe @ 2023-09-20 18:12 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Duan, Zhenzhong, Cédric Le Goater, qemu-devel, nicolinc,
	Martins, Joao, eric.auger, peterx, jasowang, Tian, Kevin, Liu,
	Yi L, Sun, Yi Y, Peng, Chao P, Paolo Bonzini,
	Marc-André Lureau, Daniel P. Berrangé,
	Thomas Huth, Philippe Mathieu-Daudé,
	Laine Stump

On Wed, Sep 20, 2023 at 12:01:42PM -0600, Alex Williamson wrote:
> On Wed, 20 Sep 2023 03:42:20 +0000
> "Duan, Zhenzhong" <zhenzhong.duan@intel.com> wrote:
> 
> > >-----Original Message-----
> > >From: Cédric Le Goater <clg@redhat.com>
> > >Sent: Wednesday, September 20, 2023 1:08 AM
> > >Subject: Re: [PATCH v1 15/22] Add iommufd configure option
> > >
> > >On 8/30/23 12:37, Zhenzhong Duan wrote:  
> > >> This adds "--enable-iommufd/--disable-iommufd" to enable or disable
> > >> iommufd support, enabled by default.  
> > >
> > >Why would someone want to disable support at compile time ? It might  
> > 
> > For those users who only want to support legacy container feature?
> > Let me know if you still prefer to drop this patch, I'm fine with that.
> > 
> > >have been useful for dev but now QEMU should self-adjust at runtime
> > >depending only on the host capabilities AFAIUI. Am I missing something ?  
> > 
> > IOMMUFD doesn't support all features of legacy container, so QEMU
> > doesn't self-adjust at runtime by checking if host supports IOMMUFD.
> > We need to specify it explicitly to use IOMMUFD as below:
> > 
> >     -object iommufd,id=iommufd0
> >     -device vfio-pci,host=0000:02:00.0,iommufd=iommufd0
> 
> There's an important point here that maybe we've let slip for too long.
> Laine had asked in an internal forum whether the switch to IOMMUFD was
> visible to the guest.  I replied that it wasn't, but this note about
> IOMMUFD vs container features jogged my memory that I think we still
> lack p2p support with IOMMUFD, ie. IOMMU mapping of device MMIO.  It
> seemed like there was something else too, but I don't recall without
> some research.

I think p2p is the only guest visible one.

I still expect to solve it :\

> Ideally we'd have feature parity and libvirt could simply use the
> native IOMMUFD interface whenever both the kernel and QEMU support it.
> 
> Without that parity, when does libvirt decide to use IOMMUFD?
> 
> How would libvirt know if some future IOMMUFD does have parity?

At this point I think it is reasonable that iommufd is explicitly
opted into.

The next step would be automatic for single PCI device VMs (p2p is not
relavent)

The final step would be automatic if kernel supports P2P. I expect
libvirt will be able to detect this from an open'd /dev/iommu.

Jason


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v1 15/22] Add iommufd configure option
  2023-09-20 18:01       ` Alex Williamson
  2023-09-20 18:12         ` Jason Gunthorpe
@ 2023-09-20 18:15         ` Daniel P. Berrangé
  1 sibling, 0 replies; 109+ messages in thread
From: Daniel P. Berrangé @ 2023-09-20 18:15 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Duan, Zhenzhong, Cédric Le Goater, qemu-devel, jgg,
	nicolinc, Martins, Joao, eric.auger, peterx, jasowang, Tian,
	Kevin, Liu, Yi L, Sun, Yi Y, Peng, Chao P, Paolo Bonzini,
	Marc-André Lureau, Thomas Huth, Philippe Mathieu-Daudé,
	Laine Stump

On Wed, Sep 20, 2023 at 12:01:42PM -0600, Alex Williamson wrote:
> On Wed, 20 Sep 2023 03:42:20 +0000
> "Duan, Zhenzhong" <zhenzhong.duan@intel.com> wrote:
> 
> > >-----Original Message-----
> > >From: Cédric Le Goater <clg@redhat.com>
> > >Sent: Wednesday, September 20, 2023 1:08 AM
> > >Subject: Re: [PATCH v1 15/22] Add iommufd configure option
> > >
> > >On 8/30/23 12:37, Zhenzhong Duan wrote:  
> > >> This adds "--enable-iommufd/--disable-iommufd" to enable or disable
> > >> iommufd support, enabled by default.  
> > >
> > >Why would someone want to disable support at compile time ? It might  
> > 
> > For those users who only want to support legacy container feature?
> > Let me know if you still prefer to drop this patch, I'm fine with that.
> > 
> > >have been useful for dev but now QEMU should self-adjust at runtime
> > >depending only on the host capabilities AFAIUI. Am I missing something ?  
> > 
> > IOMMUFD doesn't support all features of legacy container, so QEMU
> > doesn't self-adjust at runtime by checking if host supports IOMMUFD.
> > We need to specify it explicitly to use IOMMUFD as below:
> > 
> >     -object iommufd,id=iommufd0
> >     -device vfio-pci,host=0000:02:00.0,iommufd=iommufd0
> 
> There's an important point here that maybe we've let slip for too long.
> Laine had asked in an internal forum whether the switch to IOMMUFD was
> visible to the guest.  I replied that it wasn't, but this note about
> IOMMUFD vs container features jogged my memory that I think we still
> lack p2p support with IOMMUFD, ie. IOMMU mapping of device MMIO.  It
> seemed like there was something else too, but I don't recall without
> some research.
> 
> Ideally we'd have feature parity and libvirt could simply use the
> native IOMMUFD interface whenever both the kernel and QEMU support it.
> 
> Without that parity, when does libvirt decide to use IOMMUFD?
> 
> How would libvirt know if some future IOMMUFD does have parity?
> 
> Does the XML direct this through some new interpretation of the driver
> field? ex. "vfio-container" vs "vfio-iommufd" where "vfio" becomes an
> alias or priority preference.  Thanks,

Right now a host device would have


  <hostdev mode='subsystem' type='mdev' model='vfio-pci'>
   ...
  </hostdev>

where model could also accept 'vfio-ccw' / 'vfio-ap' on s390x IIUC.

If the use of IOMMUFD has guest ABI feature differences, then we
would need to treat this as a new device model in libvirt, ie add
vfio-iommu-pci model.   Does thos iommufd work with vfio-ccw / vfio-ap
too ? If so we'd need new models for those too in libvirt.

The downside of this is that it means no appication is going to
use iommufd mode without having explicit coding done to make it
aware of the new model in libvirt.

If we /want/ apps to move over to iommufd approach in a finite
short timeframe then IMHO achieving feature parity is critical
as feature partiy would let libvirt switch over to it automatically
and avoid the pain of updating any apps. This would be my preference,
as exposing the iommufd concept to apps feels wrong - this is an
internal impl detail ideally. Again we must have feature parity
for this to work though.


With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v1 15/22] Add iommufd configure option
  2023-09-20 17:49               ` Jason Gunthorpe
@ 2023-09-20 18:17                 ` Alex Williamson
  2023-09-20 18:19                   ` Jason Gunthorpe
  0 siblings, 1 reply; 109+ messages in thread
From: Alex Williamson @ 2023-09-20 18:17 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Eric Auger, Cédric Le Goater, Duan, Zhenzhong, qemu-devel,
	nicolinc, Martins, Joao, peterx, jasowang, Tian, Kevin, Liu,
	Yi L, Sun, Yi Y, Peng, Chao P, Paolo Bonzini,
	Marc-André Lureau, Daniel P. Berrangé,
	Thomas Huth, Philippe Mathieu-Daudé

On Wed, 20 Sep 2023 14:49:19 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Wed, Sep 20, 2023 at 07:37:53PM +0200, Eric Auger wrote:
> 
> > >> qemu will typically not be able to
> > >> self-open /dev/iommufd as it is root-only.  
> > >
> > > I don't understand, we open multiple fds to KVM devices. This is the
> > > same.  
> > Actually qemu opens the /dev/iommu in case no fd is passed along with
> > the iommufd object. This is done in
> > [PATCH v1 16/22] backends/iommufd: Introduce the iommufd object, in
> > 
> > iommufd_backend_connect(). I don't understand either.  
> 
> The char dev node is root only so this automatic behvaior is fine
> but not useful if qmeu is running in a sandbox.
> 
> I'm not sure what "multiple fds to KVM devices" means, I don't know
> anything about kvm devices..

Looking at a local VM, the only kvm related open file is /dev/kvm,
which kvm_init() does directly open.  The other tun/tap/vhost files are
all passed by fd.  We have a bunch of anon_inodes representing eventfds
and vcpu source from /dev/kvm, but the only other direct files are disk
images and the created pid file.
 
> The iommufd design requires one open of the /dev/iommu to be shared
> across all the vfios.

"requires"?  It's certainly of limited value to have multiple iommufd
instances rather than create multiple address spaces within a single
iommufd, but what exactly precludes an iommufd per device if QEMU, or
any other userspace so desired?  Thanks,

Alex



^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v1 15/22] Add iommufd configure option
  2023-09-20 18:17                 ` Alex Williamson
@ 2023-09-20 18:19                   ` Jason Gunthorpe
  2023-09-21  3:43                     ` Duan, Zhenzhong
  2023-09-26  6:05                     ` Tian, Kevin
  0 siblings, 2 replies; 109+ messages in thread
From: Jason Gunthorpe @ 2023-09-20 18:19 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Eric Auger, Cédric Le Goater, Duan, Zhenzhong, qemu-devel,
	nicolinc, Martins, Joao, peterx, jasowang, Tian, Kevin, Liu,
	Yi L, Sun, Yi Y, Peng, Chao P, Paolo Bonzini,
	Marc-André Lureau, Daniel P. Berrangé,
	Thomas Huth, Philippe Mathieu-Daudé

On Wed, Sep 20, 2023 at 12:17:24PM -0600, Alex Williamson wrote:

> > The iommufd design requires one open of the /dev/iommu to be shared
> > across all the vfios.
> 
> "requires"?  It's certainly of limited value to have multiple iommufd
> instances rather than create multiple address spaces within a single
> iommufd, but what exactly precludes an iommufd per device if QEMU, or
> any other userspace so desired?  Thanks,

From the kernel side requires is too strong I suppose

Not sure about these qemu patches though?

Jason


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v1 15/22] Add iommufd configure option
  2023-09-20 18:12         ` Jason Gunthorpe
@ 2023-09-20 20:29           ` Alex Williamson
  0 siblings, 0 replies; 109+ messages in thread
From: Alex Williamson @ 2023-09-20 20:29 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Duan, Zhenzhong, Cédric Le Goater, qemu-devel, nicolinc,
	Martins, Joao, eric.auger, peterx, jasowang, Tian, Kevin, Liu,
	Yi L, Sun, Yi Y, Peng, Chao P, Paolo Bonzini,
	Marc-André Lureau, Daniel P. Berrangé,
	Thomas Huth, Philippe Mathieu-Daudé,
	Laine Stump

On Wed, 20 Sep 2023 15:12:59 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Wed, Sep 20, 2023 at 12:01:42PM -0600, Alex Williamson wrote:
> > On Wed, 20 Sep 2023 03:42:20 +0000
> > "Duan, Zhenzhong" <zhenzhong.duan@intel.com> wrote:
> >   
> > > >-----Original Message-----
> > > >From: Cédric Le Goater <clg@redhat.com>
> > > >Sent: Wednesday, September 20, 2023 1:08 AM
> > > >Subject: Re: [PATCH v1 15/22] Add iommufd configure option
> > > >
> > > >On 8/30/23 12:37, Zhenzhong Duan wrote:    
> > > >> This adds "--enable-iommufd/--disable-iommufd" to enable or disable
> > > >> iommufd support, enabled by default.    
> > > >
> > > >Why would someone want to disable support at compile time ? It might    
> > > 
> > > For those users who only want to support legacy container feature?
> > > Let me know if you still prefer to drop this patch, I'm fine with that.
> > >   
> > > >have been useful for dev but now QEMU should self-adjust at runtime
> > > >depending only on the host capabilities AFAIUI. Am I missing something ?    
> > > 
> > > IOMMUFD doesn't support all features of legacy container, so QEMU
> > > doesn't self-adjust at runtime by checking if host supports IOMMUFD.
> > > We need to specify it explicitly to use IOMMUFD as below:
> > > 
> > >     -object iommufd,id=iommufd0
> > >     -device vfio-pci,host=0000:02:00.0,iommufd=iommufd0  
> > 
> > There's an important point here that maybe we've let slip for too long.
> > Laine had asked in an internal forum whether the switch to IOMMUFD was
> > visible to the guest.  I replied that it wasn't, but this note about
> > IOMMUFD vs container features jogged my memory that I think we still
> > lack p2p support with IOMMUFD, ie. IOMMU mapping of device MMIO.  It
> > seemed like there was something else too, but I don't recall without
> > some research.  
> 
> I think p2p is the only guest visible one.
> 
> I still expect to solve it :\
> 
> > Ideally we'd have feature parity and libvirt could simply use the
> > native IOMMUFD interface whenever both the kernel and QEMU support it.
> > 
> > Without that parity, when does libvirt decide to use IOMMUFD?
> > 
> > How would libvirt know if some future IOMMUFD does have parity?  
> 
> At this point I think it is reasonable that iommufd is explicitly
> opted into.
> 
> The next step would be automatic for single PCI device VMs (p2p is not
> relavent)

And when a second PCI device is hot-plugged into the VM and it behaves
differently from a VM with multiple statically attached devices?  Seems
like it's an opt-in until full p2p support, then an opt-out for
potential bugs.  Thanks,

Alex

> The final step would be automatic if kernel supports P2P. I expect
> libvirt will be able to detect this from an open'd /dev/iommu.
> 
> Jason
> 



^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v1 05/22] vfio/common: Extract out vfio_kvm_device_[add/del]_fd
  2023-08-30 10:37 ` [PATCH v1 05/22] vfio/common: Extract out vfio_kvm_device_[add/del]_fd Zhenzhong Duan
  2023-09-20 11:49   ` Eric Auger
@ 2023-09-20 21:39   ` Alex Williamson
  2023-09-21  6:03     ` Duan, Zhenzhong
  1 sibling, 1 reply; 109+ messages in thread
From: Alex Williamson @ 2023-09-20 21:39 UTC (permalink / raw)
  To: Zhenzhong Duan
  Cc: qemu-devel, clg, jgg, nicolinc, joao.m.martins, eric.auger,
	peterx, jasowang, kevin.tian, yi.l.liu, yi.y.sun, chao.p.peng

On Wed, 30 Aug 2023 18:37:37 +0800
Zhenzhong Duan <zhenzhong.duan@intel.com> wrote:

> ...which will be used by both legacy and iommufd backend.

+1 to Eric's comments regarding complete sentences in the commit log
and suggested description.

> 
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
>  hw/vfio/common.c              | 44 +++++++++++++++++++++++------------
>  include/hw/vfio/vfio-common.h |  3 +++
>  2 files changed, 32 insertions(+), 15 deletions(-)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 67150e4575..949ad6714a 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -1759,17 +1759,17 @@ void vfio_reset_handler(void *opaque)
>      }
>  }
>  
> -static void vfio_kvm_device_add_group(VFIOGroup *group)
> +int vfio_kvm_device_add_fd(int fd)

Returning int vs void looks gratuitous, nothing uses the return value
in this series.

>  {
>  #ifdef CONFIG_KVM
>      struct kvm_device_attr attr = {
> -        .group = KVM_DEV_VFIO_GROUP,
> -        .attr = KVM_DEV_VFIO_GROUP_ADD,
> -        .addr = (uint64_t)(unsigned long)&group->fd,
> +        .group = KVM_DEV_VFIO_FILE,
> +        .attr = KVM_DEV_VFIO_FILE_ADD,
> +        .addr = (uint64_t)(unsigned long)&fd,
>      };
>  
>      if (!kvm_enabled()) {
> -        return;
> +        return 0;
>      }
>  
>      if (vfio_kvm_device_fd < 0) {
> @@ -1779,37 +1779,51 @@ static void vfio_kvm_device_add_group(VFIOGroup *group)
>  
>          if (kvm_vm_ioctl(kvm_state, KVM_CREATE_DEVICE, &cd)) {
>              error_report("Failed to create KVM VFIO device: %m");
> -            return;
> +            return -ENODEV;
>          }
>  
>          vfio_kvm_device_fd = cd.fd;
>      }
>  
>      if (ioctl(vfio_kvm_device_fd, KVM_SET_DEVICE_ATTR, &attr)) {
> -        error_report("Failed to add group %d to KVM VFIO device: %m",
> -                     group->groupid);
> +        error_report("Failed to add fd %d to KVM VFIO device: %m",
> +                     fd);

It's not nearly as useful to report an fd# in the error log vs the
group#.  Thanks,

Alex

> +        return -errno;
>      }
>  #endif
> +    return 0;
>  }
>  
> -static void vfio_kvm_device_del_group(VFIOGroup *group)
> +static void vfio_kvm_device_add_group(VFIOGroup *group)
> +{
> +    vfio_kvm_device_add_fd(group->fd);
> +}
> +
> +int vfio_kvm_device_del_fd(int fd)
>  {
>  #ifdef CONFIG_KVM
>      struct kvm_device_attr attr = {
> -        .group = KVM_DEV_VFIO_GROUP,
> -        .attr = KVM_DEV_VFIO_GROUP_DEL,
> -        .addr = (uint64_t)(unsigned long)&group->fd,
> +        .group = KVM_DEV_VFIO_FILE,
> +        .attr = KVM_DEV_VFIO_FILE_DEL,
> +        .addr = (uint64_t)(unsigned long)&fd,
>      };
>  
>      if (vfio_kvm_device_fd < 0) {
> -        return;
> +        return -EINVAL;
>      }
>  
>      if (ioctl(vfio_kvm_device_fd, KVM_SET_DEVICE_ATTR, &attr)) {
> -        error_report("Failed to remove group %d from KVM VFIO device: %m",
> -                     group->groupid);
> +        error_report("Failed to remove fd %d from KVM VFIO device: %m",
> +                     fd);
> +        return -EBADF;
>      }
>  #endif
> +    return 0;
> +}
> +
> +static void vfio_kvm_device_del_group(VFIOGroup *group)
> +{
> +    vfio_kvm_device_del_fd(group->fd);
>  }
>  
>  static VFIOAddressSpace *vfio_get_address_space(AddressSpace *as)
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 5e376c436e..598c3ce079 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -220,6 +220,9 @@ struct vfio_device_info *vfio_get_device_info(int fd);
>  int vfio_get_device(VFIOGroup *group, const char *name,
>                      VFIODevice *vbasedev, Error **errp);
>  
> +int vfio_kvm_device_add_fd(int fd);
> +int vfio_kvm_device_del_fd(int fd);
> +
>  extern const MemoryRegionOps vfio_region_ops;
>  typedef QLIST_HEAD(VFIOGroupList, VFIOGroup) VFIOGroupList;
>  extern VFIOGroupList vfio_group_list;



^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v1 06/22] vfio/common: Add a vfio device iterator
  2023-08-30 10:37 ` [PATCH v1 06/22] vfio/common: Add a vfio device iterator Zhenzhong Duan
  2023-09-20 12:25   ` Eric Auger
@ 2023-09-20 22:16   ` Alex Williamson
  2023-09-21  2:16     ` Duan, Zhenzhong
  1 sibling, 1 reply; 109+ messages in thread
From: Alex Williamson @ 2023-09-20 22:16 UTC (permalink / raw)
  To: Zhenzhong Duan
  Cc: qemu-devel, clg, jgg, nicolinc, joao.m.martins, eric.auger,
	peterx, jasowang, kevin.tian, yi.l.liu, yi.y.sun, chao.p.peng

On Wed, 30 Aug 2023 18:37:38 +0800
Zhenzhong Duan <zhenzhong.duan@intel.com> wrote:

> With a vfio device iterator added, we can make some migration and reset
> related functions group agnostic.
> E.x:
> vfio_mig_active
> vfio_migratable_device_num
> vfio_devices_all_dirty_tracking
> vfio_devices_all_device_dirty_tracking
> vfio_devices_all_running_and_mig_active
> vfio_devices_dma_logging_stop
> vfio_devices_dma_logging_start
> vfio_devices_query_dirty_bitmap
> vfio_reset_handler
> 
> Or else we need to add container specific callback variants for above
> functions just because they iterate devices based on group.
> 
> Move the reset handler registration/unregistration to a place that is not
> group specific, saying first vfio address space created instead of the
> first group.
> 
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
>  hw/vfio/common.c | 224 ++++++++++++++++++++++++++---------------------
>  1 file changed, 122 insertions(+), 102 deletions(-)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 949ad6714a..51c6e7598e 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -84,6 +84,26 @@ static int vfio_ram_block_discard_disable(VFIOContainer *container, bool state)
>      }
>  }
>  
> +static VFIODevice *vfio_container_dev_iter_next(VFIOContainer *container,
> +                                                VFIODevice *curr)
> +{
> +    VFIOGroup *group;
> +
> +    if (!curr) {
> +        group = QLIST_FIRST(&container->group_list);
> +    } else {
> +        if (curr->next.le_next) {
> +            return curr->next.le_next;
> +        }


VFIODevice *device = QLIST_NEXT(curr, next);

if (device) {
    return device;
}

> +        group = curr->group->container_next.le_next;


group = QLIST_NEXT(curr->group, container_next);

> +    }
> +
> +    if (!group) {
> +        return NULL;
> +    }
> +    return QLIST_FIRST(&group->device_list);
> +}
> +
>  /*
>   * Device state interfaces
>   */
> @@ -112,17 +132,22 @@ static int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
>  
>  bool vfio_mig_active(void)
>  {
> -    VFIOGroup *group;
> +    VFIOAddressSpace *space;
> +    VFIOContainer *container;
>      VFIODevice *vbasedev;
>  
> -    if (QLIST_EMPTY(&vfio_group_list)) {
> +    if (QLIST_EMPTY(&vfio_address_spaces)) {
>          return false;
>      }
>  
> -    QLIST_FOREACH(group, &vfio_group_list, next) {
> -        QLIST_FOREACH(vbasedev, &group->device_list, next) {
> -            if (vbasedev->migration_blocker) {
> -                return false;
> +    QLIST_FOREACH(space, &vfio_address_spaces, list) {
> +        QLIST_FOREACH(container, &space->containers, next) {
> +            vbasedev = NULL;
> +            while ((vbasedev = vfio_container_dev_iter_next(container,
> +                                                            vbasedev))) {
> +                if (vbasedev->migration_blocker) {
> +                    return false;
> +                }

Appears easy to avoid setting vbasedev in the loop iterator and
improving the scope of vbasedev:

VFIODevice *vbasedev = vfio_container_dev_iter_next(container, NULL);

while (vbasedev) {
    if (vbasedev->migration_blocker) {
        return false;
    }

    vbasedev = vfio_container_dev_iter_next(container, vbasedev);
}

>              }
>          }
>      }
> @@ -133,14 +158,19 @@ static Error *multiple_devices_migration_blocker;
>  
>  static unsigned int vfio_migratable_device_num(void)
>  {
> -    VFIOGroup *group;
> +    VFIOAddressSpace *space;
> +    VFIOContainer *container;
>      VFIODevice *vbasedev;
>      unsigned int device_num = 0;
>  
> -    QLIST_FOREACH(group, &vfio_group_list, next) {
> -        QLIST_FOREACH(vbasedev, &group->device_list, next) {
> -            if (vbasedev->migration) {
> -                device_num++;
> +    QLIST_FOREACH(space, &vfio_address_spaces, list) {
> +        QLIST_FOREACH(container, &space->containers, next) {
> +            vbasedev = NULL;
> +            while ((vbasedev = vfio_container_dev_iter_next(container,
> +                                                            vbasedev))) {
> +                if (vbasedev->migration) {
> +                    device_num++;
> +                }

Same as above.

>              }
>          }
>      }
> @@ -207,8 +237,7 @@ static void vfio_set_migration_error(int err)
>  
>  static bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
>  {
> -    VFIOGroup *group;
> -    VFIODevice *vbasedev;
> +    VFIODevice *vbasedev = NULL;
>      MigrationState *ms = migrate_get_current();
>  
>      if (ms->state != MIGRATION_STATUS_ACTIVE &&
> @@ -216,19 +245,17 @@ static bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
>          return false;
>      }
>  
> -    QLIST_FOREACH(group, &container->group_list, container_next) {
> -        QLIST_FOREACH(vbasedev, &group->device_list, next) {
> -            VFIOMigration *migration = vbasedev->migration;
> +    while ((vbasedev = vfio_container_dev_iter_next(container, vbasedev))) {
> +        VFIOMigration *migration = vbasedev->migration;

Similar, and all the other loops below.

>  
> -            if (!migration) {
> -                return false;
> -            }
> +        if (!migration) {
> +            return false;
> +        }
>  
> -            if (vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF &&
> -                (migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
> -                 migration->device_state == VFIO_DEVICE_STATE_PRE_COPY)) {
> -                return false;
> -            }
> +        if (vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF &&
> +            (migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
> +             migration->device_state == VFIO_DEVICE_STATE_PRE_COPY)) {
> +            return false;
>          }
>      }
>      return true;
> @@ -236,14 +263,11 @@ static bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
>  
>  static bool vfio_devices_all_device_dirty_tracking(VFIOContainer *container)
>  {
> -    VFIOGroup *group;
> -    VFIODevice *vbasedev;
> +    VFIODevice *vbasedev = NULL;
>  
> -    QLIST_FOREACH(group, &container->group_list, container_next) {
> -        QLIST_FOREACH(vbasedev, &group->device_list, next) {
> -            if (!vbasedev->dirty_pages_supported) {
> -                return false;
> -            }
> +    while ((vbasedev = vfio_container_dev_iter_next(container, vbasedev))) {
> +        if (!vbasedev->dirty_pages_supported) {
> +            return false;
>          }
>      }
>  
> @@ -256,27 +280,24 @@ static bool vfio_devices_all_device_dirty_tracking(VFIOContainer *container)
>   */
>  static bool vfio_devices_all_running_and_mig_active(VFIOContainer *container)
>  {
> -    VFIOGroup *group;
> -    VFIODevice *vbasedev;
> +    VFIODevice *vbasedev = NULL;
>  
>      if (!migration_is_active(migrate_get_current())) {
>          return false;
>      }
>  
> -    QLIST_FOREACH(group, &container->group_list, container_next) {
> -        QLIST_FOREACH(vbasedev, &group->device_list, next) {
> -            VFIOMigration *migration = vbasedev->migration;
> +    while ((vbasedev = vfio_container_dev_iter_next(container, vbasedev))) {
> +        VFIOMigration *migration = vbasedev->migration;
>  
> -            if (!migration) {
> -                return false;
> -            }
> +        if (!migration) {
> +            return false;
> +        }
>  
> -            if (migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
> -                migration->device_state == VFIO_DEVICE_STATE_PRE_COPY) {
> -                continue;
> -            } else {
> -                return false;
> -            }
> +        if (migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
> +            migration->device_state == VFIO_DEVICE_STATE_PRE_COPY) {
> +            continue;
> +        } else {
> +            return false;
>          }
>      }
>      return true;
> @@ -1243,25 +1264,22 @@ static void vfio_devices_dma_logging_stop(VFIOContainer *container)
>      uint64_t buf[DIV_ROUND_UP(sizeof(struct vfio_device_feature),
>                                sizeof(uint64_t))] = {};
>      struct vfio_device_feature *feature = (struct vfio_device_feature *)buf;
> -    VFIODevice *vbasedev;
> -    VFIOGroup *group;
> +    VFIODevice *vbasedev = NULL;
>  
>      feature->argsz = sizeof(buf);
>      feature->flags = VFIO_DEVICE_FEATURE_SET |
>                       VFIO_DEVICE_FEATURE_DMA_LOGGING_STOP;
>  
> -    QLIST_FOREACH(group, &container->group_list, container_next) {
> -        QLIST_FOREACH(vbasedev, &group->device_list, next) {
> -            if (!vbasedev->dirty_tracking) {
> -                continue;
> -            }
> +    while ((vbasedev = vfio_container_dev_iter_next(container, vbasedev))) {
> +        if (!vbasedev->dirty_tracking) {
> +            continue;
> +        }
>  
> -            if (ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature)) {
> -                warn_report("%s: Failed to stop DMA logging, err %d (%s)",
> -                             vbasedev->name, -errno, strerror(errno));
> -            }
> -            vbasedev->dirty_tracking = false;
> +        if (ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature)) {
> +            warn_report("%s: Failed to stop DMA logging, err %d (%s)",
> +                        vbasedev->name, -errno, strerror(errno));
>          }
> +        vbasedev->dirty_tracking = false;
>      }
>  }
>  
> @@ -1336,8 +1354,7 @@ static int vfio_devices_dma_logging_start(VFIOContainer *container)
>  {
>      struct vfio_device_feature *feature;
>      VFIODirtyRanges ranges;
> -    VFIODevice *vbasedev;
> -    VFIOGroup *group;
> +    VFIODevice *vbasedev = NULL;
>      int ret = 0;
>  
>      vfio_dirty_tracking_init(container, &ranges);
> @@ -1347,21 +1364,19 @@ static int vfio_devices_dma_logging_start(VFIOContainer *container)
>          return -errno;
>      }
>  
> -    QLIST_FOREACH(group, &container->group_list, container_next) {
> -        QLIST_FOREACH(vbasedev, &group->device_list, next) {
> -            if (vbasedev->dirty_tracking) {
> -                continue;
> -            }
> +    while ((vbasedev = vfio_container_dev_iter_next(container, vbasedev))) {
> +        if (vbasedev->dirty_tracking) {
> +            continue;
> +        }
>  
> -            ret = ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature);
> -            if (ret) {
> -                ret = -errno;
> -                error_report("%s: Failed to start DMA logging, err %d (%s)",
> -                             vbasedev->name, ret, strerror(errno));
> -                goto out;
> -            }
> -            vbasedev->dirty_tracking = true;
> +        ret = ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature);
> +        if (ret) {
> +            ret = -errno;
> +            error_report("%s: Failed to start DMA logging, err %d (%s)",
> +                         vbasedev->name, ret, strerror(errno));
> +            goto out;
>          }
> +        vbasedev->dirty_tracking = true;
>      }
>  
>  out:
> @@ -1440,22 +1455,19 @@ static int vfio_devices_query_dirty_bitmap(VFIOContainer *container,
>                                             VFIOBitmap *vbmap, hwaddr iova,
>                                             hwaddr size)
>  {
> -    VFIODevice *vbasedev;
> -    VFIOGroup *group;
> +    VFIODevice *vbasedev = NULL;
>      int ret;
>  
> -    QLIST_FOREACH(group, &container->group_list, container_next) {
> -        QLIST_FOREACH(vbasedev, &group->device_list, next) {
> -            ret = vfio_device_dma_logging_report(vbasedev, iova, size,
> -                                                 vbmap->bitmap);
> -            if (ret) {
> -                error_report("%s: Failed to get DMA logging report, iova: "
> -                             "0x%" HWADDR_PRIx ", size: 0x%" HWADDR_PRIx
> -                             ", err: %d (%s)",
> -                             vbasedev->name, iova, size, ret, strerror(-ret));
> +    while ((vbasedev = vfio_container_dev_iter_next(container, vbasedev))) {
> +        ret = vfio_device_dma_logging_report(vbasedev, iova, size,
> +                                             vbmap->bitmap);
> +        if (ret) {
> +            error_report("%s: Failed to get DMA logging report, iova: "
> +                         "0x%" HWADDR_PRIx ", size: 0x%" HWADDR_PRIx
> +                         ", err: %d (%s)",
> +                         vbasedev->name, iova, size, ret, strerror(-ret));
>  
> -                return ret;
> -            }
> +            return ret;
>          }
>      }
>  
> @@ -1739,21 +1751,30 @@ bool vfio_get_info_dma_avail(struct vfio_iommu_type1_info *info,
>  
>  void vfio_reset_handler(void *opaque)
>  {
> -    VFIOGroup *group;
> +    VFIOAddressSpace *space;
> +    VFIOContainer *container;
>      VFIODevice *vbasedev;
>  
> -    QLIST_FOREACH(group, &vfio_group_list, next) {
> -        QLIST_FOREACH(vbasedev, &group->device_list, next) {
> -            if (vbasedev->dev->realized) {
> -                vbasedev->ops->vfio_compute_needs_reset(vbasedev);
> +    QLIST_FOREACH(space, &vfio_address_spaces, list) {
> +        QLIST_FOREACH(container, &space->containers, next) {
> +            vbasedev = NULL;
> +            while ((vbasedev = vfio_container_dev_iter_next(container,
> +                                                            vbasedev))) {
> +                if (vbasedev->dev->realized) {
> +                    vbasedev->ops->vfio_compute_needs_reset(vbasedev);
> +                }
>              }
>          }
>      }
>  
> -    QLIST_FOREACH(group, &vfio_group_list, next) {
> -        QLIST_FOREACH(vbasedev, &group->device_list, next) {
> -            if (vbasedev->dev->realized && vbasedev->needs_reset) {
> -                vbasedev->ops->vfio_hot_reset_multi(vbasedev);
> +    QLIST_FOREACH(space, &vfio_address_spaces, list) {
> +        QLIST_FOREACH(container, &space->containers, next) {
> +            vbasedev = NULL;
> +            while ((vbasedev = vfio_container_dev_iter_next(container,
> +                                                            vbasedev))) {
> +                if (vbasedev->dev->realized && vbasedev->needs_reset) {
> +                    vbasedev->ops->vfio_hot_reset_multi(vbasedev);
> +                    }
>              }
>          }
>      }
> @@ -1841,6 +1862,10 @@ static VFIOAddressSpace *vfio_get_address_space(AddressSpace *as)
>      space->as = as;
>      QLIST_INIT(&space->containers);
>  
> +    if (QLIST_EMPTY(&vfio_address_spaces)) {
> +        qemu_register_reset(vfio_reset_handler, NULL);
> +    }
> +

We could just have a vfio_device_list to avoid iterating either
containers and group or address spaces.  Thanks,

Alex

>      QLIST_INSERT_HEAD(&vfio_address_spaces, space, list);
>  
>      return space;
> @@ -1852,6 +1877,9 @@ static void vfio_put_address_space(VFIOAddressSpace *space)
>          QLIST_REMOVE(space, list);
>          g_free(space);
>      }
> +    if (QLIST_EMPTY(&vfio_address_spaces)) {
> +        qemu_unregister_reset(vfio_reset_handler, NULL);
> +    }
>  }
>  
>  /*
> @@ -2317,10 +2345,6 @@ VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
>          goto close_fd_exit;
>      }
>  
> -    if (QLIST_EMPTY(&vfio_group_list)) {
> -        qemu_register_reset(vfio_reset_handler, NULL);
> -    }
> -
>      QLIST_INSERT_HEAD(&vfio_group_list, group, next);
>  
>      return group;
> @@ -2349,10 +2373,6 @@ void vfio_put_group(VFIOGroup *group)
>      trace_vfio_put_group(group->fd);
>      close(group->fd);
>      g_free(group);
> -
> -    if (QLIST_EMPTY(&vfio_group_list)) {
> -        qemu_unregister_reset(vfio_reset_handler, NULL);
> -    }
>  }
>  
>  struct vfio_device_info *vfio_get_device_info(int fd)



^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v1 07/22] vfio/common: Refactor vfio_viommu_preset() to be group agnostic
  2023-08-30 10:37 ` [PATCH v1 07/22] vfio/common: Refactor vfio_viommu_preset() to be group agnostic Zhenzhong Duan
  2023-09-20 13:00   ` Eric Auger
@ 2023-09-20 22:51   ` Alex Williamson
  2023-09-21  6:13     ` Duan, Zhenzhong
  1 sibling, 1 reply; 109+ messages in thread
From: Alex Williamson @ 2023-09-20 22:51 UTC (permalink / raw)
  To: Zhenzhong Duan
  Cc: qemu-devel, clg, jgg, nicolinc, joao.m.martins, eric.auger,
	peterx, jasowang, kevin.tian, yi.l.liu, yi.y.sun, chao.p.peng

On Wed, 30 Aug 2023 18:37:39 +0800
Zhenzhong Duan <zhenzhong.duan@intel.com> wrote:

> So that it doesn't need to be moved into container.c as done
> in following patch.
> 
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
>  hw/vfio/common.c | 17 ++++++++++++++++-
>  1 file changed, 16 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 51c6e7598e..fda5fc87b9 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -219,7 +219,22 @@ void vfio_unblock_multiple_devices_migration(void)
>  
>  bool vfio_viommu_preset(VFIODevice *vbasedev)
>  {
> -    return vbasedev->group->container->space->as != &address_space_memory;
> +    VFIOAddressSpace *space;
> +    VFIOContainer *container;
> +    VFIODevice *tmp_dev;
> +
> +    QLIST_FOREACH(space, &vfio_address_spaces, list) {
> +        QLIST_FOREACH(container, &space->containers, next) {
> +            tmp_dev = NULL;
> +            while ((tmp_dev = vfio_container_dev_iter_next(container,
> +                                                           tmp_dev))) {
> +                if (vbasedev == tmp_dev) {
> +                    return space->as != &address_space_memory;
> +                }
> +            }
> +        }
> +    }
> +    g_assert_not_reached();

Should the VFIODevice just have a pointer to the VFIOAddressSpace?
Thanks,

Alex


>  }
>  
>  static void vfio_set_migration_error(int err)



^ permalink raw reply	[flat|nested] 109+ messages in thread

* RE: [PATCH v1 05/22] vfio/common: Extract out vfio_kvm_device_[add/del]_fd
  2023-09-20 11:49   ` Eric Auger
@ 2023-09-21  2:04     ` Duan, Zhenzhong
  2023-09-21  8:42     ` Cédric Le Goater
  1 sibling, 0 replies; 109+ messages in thread
From: Duan, Zhenzhong @ 2023-09-21  2:04 UTC (permalink / raw)
  To: eric.auger, qemu-devel
  Cc: alex.williamson, clg, jgg, nicolinc, Martins, Joao, peterx,
	jasowang, Tian, Kevin, Liu, Yi L, Sun, Yi Y, Peng, Chao P



>-----Original Message-----
>From: Eric Auger <eric.auger@redhat.com>
>Sent: Wednesday, September 20, 2023 7:49 PM
>Subject: Re: [PATCH v1 05/22] vfio/common: Extract out
>vfio_kvm_device_[add/del]_fd
>
>Hi Zhenzhong,
>
>On 8/30/23 12:37, Zhenzhong Duan wrote:
>> ...which will be used by both legacy and iommufd backend.
>I prefer genuine sentences in the commit msg. Also you explain what you
>do but not why.
>
>suggestion: Introduce two new helpers, vfio_kvm_device_[add/del]_fd
>which take as input a file descriptor which can be either a group fd or
>a cdev fd. This uses the new KVM_DEV_VFIO_FILE VFIO KVM device group,
>which aliases to the legacy KVM_DEV_VFIO_GROUP.
>
>vfio_kvm_device_add/del_group then call those new helpers.

Thanks, will update in v2.

>
>
>
>>
>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> ---
>>  hw/vfio/common.c              | 44 +++++++++++++++++++++++------------
>>  include/hw/vfio/vfio-common.h |  3 +++
>>  2 files changed, 32 insertions(+), 15 deletions(-)
>>
>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>> index 67150e4575..949ad6714a 100644
>> --- a/hw/vfio/common.c
>> +++ b/hw/vfio/common.c
>> @@ -1759,17 +1759,17 @@ void vfio_reset_handler(void *opaque)
>>      }
>>  }
>>
>> -static void vfio_kvm_device_add_group(VFIOGroup *group)
>> +int vfio_kvm_device_add_fd(int fd)
>>  {
>>  #ifdef CONFIG_KVM
>>      struct kvm_device_attr attr = {
>> -        .group = KVM_DEV_VFIO_GROUP,
>> -        .attr = KVM_DEV_VFIO_GROUP_ADD,
>> -        .addr = (uint64_t)(unsigned long)&group->fd,
>> +        .group = KVM_DEV_VFIO_FILE,
>> +        .attr = KVM_DEV_VFIO_FILE_ADD,
>> +        .addr = (uint64_t)(unsigned long)&fd,
>>      };
>>
>>      if (!kvm_enabled()) {
>> -        return;
>> +        return 0;
>>      }
>>
>>      if (vfio_kvm_device_fd < 0) {
>> @@ -1779,37 +1779,51 @@ static void
>vfio_kvm_device_add_group(VFIOGroup *group)
>>
>>          if (kvm_vm_ioctl(kvm_state, KVM_CREATE_DEVICE, &cd)) {
>>              error_report("Failed to create KVM VFIO device: %m");
>> -            return;
>> +            return -ENODEV;
>can't you return -errno?
Will fix.

>>          }
>>
>>          vfio_kvm_device_fd = cd.fd;
>>      }
>>
>>      if (ioctl(vfio_kvm_device_fd, KVM_SET_DEVICE_ATTR, &attr)) {
>> -        error_report("Failed to add group %d to KVM VFIO device: %m",
>> -                     group->groupid);
>> +        error_report("Failed to add fd %d to KVM VFIO device: %m",
>> +                     fd);
>> +        return -errno;
>>      }
>>  #endif
>> +    return 0;
>>  }
>>
>> -static void vfio_kvm_device_del_group(VFIOGroup *group)
>> +static void vfio_kvm_device_add_group(VFIOGroup *group)
>> +{
>> +    vfio_kvm_device_add_fd(group->fd);
>Since vfio_kvm_device_add_fd now returns an error value, it's a pity not
>to use it and propagate it. Also you could fill an errp with the error
>msg and use it in vfio_connect_container(). But this is a new error
>handling there.

What about having vfio_kvm_device_add_fd return void as
vfio_kvm_device_add_group. I just realize vfio_connect_container()
doesn't get any failure of vfio_kvm_device_add_group, propagating
err to vfio_connect_container() is just to print it out there which I have
done in vfio_kvm_device_add_fd.

>> +}
>> +
>> +int vfio_kvm_device_del_fd(int fd)
>not sure we want this to return an error. But if we do, I think it would
>be nicer to propagate the error up.

Same question as above.

>>  {
>>  #ifdef CONFIG_KVM
>>      struct kvm_device_attr attr = {
>> -        .group = KVM_DEV_VFIO_GROUP,
>> -        .attr = KVM_DEV_VFIO_GROUP_DEL,
>> -        .addr = (uint64_t)(unsigned long)&group->fd,
>> +        .group = KVM_DEV_VFIO_FILE,
>> +        .attr = KVM_DEV_VFIO_FILE_DEL,
>> +        .addr = (uint64_t)(unsigned long)&fd,
>>      };
>>
>>      if (vfio_kvm_device_fd < 0) {
>> -        return;
>> +        return -EINVAL;
>>      }
>>
>>      if (ioctl(vfio_kvm_device_fd, KVM_SET_DEVICE_ATTR, &attr)) {
>> -        error_report("Failed to remove group %d from KVM VFIO device: %m",
>> -                     group->groupid);
>> +        error_report("Failed to remove fd %d from KVM VFIO device: %m",
>> +                     fd);
>> +        return -EBADF;
>-errno?
Sure.

Thanks
Zhenzhong

^ permalink raw reply	[flat|nested] 109+ messages in thread

* RE: [PATCH v1 15/22] Add iommufd configure option
  2023-09-20 12:19       ` Cédric Le Goater
  2023-09-20 12:51         ` Jason Gunthorpe
@ 2023-09-21  2:11         ` Duan, Zhenzhong
  1 sibling, 0 replies; 109+ messages in thread
From: Duan, Zhenzhong @ 2023-09-21  2:11 UTC (permalink / raw)
  To: Cédric Le Goater, qemu-devel
  Cc: alex.williamson, jgg, nicolinc, Martins, Joao, eric.auger,
	peterx, jasowang, Tian, Kevin, Liu, Yi L, Sun, Yi Y, Peng,
	Chao P, Paolo Bonzini, Marc-André Lureau,
	Daniel P. Berrangé, Thomas Huth, Philippe Mathieu-Daudé



>-----Original Message-----
>From: Cédric Le Goater <clg@redhat.com>
>Sent: Wednesday, September 20, 2023 8:20 PM
>Subject: Re: [PATCH v1 15/22] Add iommufd configure option
>
>On 9/20/23 05:42, Duan, Zhenzhong wrote:
>>
>>
>>> -----Original Message-----
>>> From: Cédric Le Goater <clg@redhat.com>
>>> Sent: Wednesday, September 20, 2023 1:08 AM
>>> Subject: Re: [PATCH v1 15/22] Add iommufd configure option
>>>
>>> On 8/30/23 12:37, Zhenzhong Duan wrote:
>>>> This adds "--enable-iommufd/--disable-iommufd" to enable or disable
>>>> iommufd support, enabled by default.
>>>
>>> Why would someone want to disable support at compile time ? It might
>>
>> For those users who only want to support legacy container feature?
>> Let me know if you still prefer to drop this patch, I'm fine with that.
>
>I think it is too early.
>
>>> have been useful for dev but now QEMU should self-adjust at runtime
>>> depending only on the host capabilities AFAIUI. Am I missing something ?
>>
>> IOMMUFD doesn't support all features of legacy container, so QEMU
>> doesn't self-adjust at runtime by checking if host supports IOMMUFD.
>> We need to specify it explicitly to use IOMMUFD as below:
>>
>>      -object iommufd,id=iommufd0
>>      -device vfio-pci,host=0000:02:00.0,iommufd=iommufd0
>
>OK. I am not sure this is the correct interface yet. At first glance,
>I wouldn't introduce a new object for a simple backend depending on a
>kernel interface. I would tend to prefer a "iommu-something" property
>of the vfio-pci device with string values: "legacy", "iommufd", "default"
>and define the various interfaces (the ops you proposed) for each
>depending on the user preference and the capabilities of the host and
>possibly the device.
>
>I might be wrong and this might have been discussed before. If so, it
>should go in the cover letter with other things : what is this patchset
>providing to VFIO (multiple iommu backends), how it is reaching that
>goal, how is it organized, how do we deal with the special case (spapr),
>what's the user interface, etc.

Got it, I'll add " how is it organized, how do we deal with the special case (spapr)"
part, other parts seems already in cover letter, there is a diagram showing
the architecture of VFIO/legacy BE/IOMMUFD BE, etc.

Thanks
Zhenzhong

^ permalink raw reply	[flat|nested] 109+ messages in thread

* RE: [PATCH v1 06/22] vfio/common: Add a vfio device iterator
  2023-09-20 22:16   ` Alex Williamson
@ 2023-09-21  2:16     ` Duan, Zhenzhong
  0 siblings, 0 replies; 109+ messages in thread
From: Duan, Zhenzhong @ 2023-09-21  2:16 UTC (permalink / raw)
  To: Alex Williamson
  Cc: qemu-devel, clg, jgg, nicolinc, Martins, Joao, eric.auger,
	peterx, jasowang, Tian, Kevin, Liu, Yi L, Sun, Yi Y, Peng,
	Chao P



>-----Original Message-----
>From: Alex Williamson <alex.williamson@redhat.com>
>Subject: Re: [PATCH v1 06/22] vfio/common: Add a vfio device iterator
>
>On Wed, 30 Aug 2023 18:37:38 +0800
>Zhenzhong Duan <zhenzhong.duan@intel.com> wrote:
>
>> With a vfio device iterator added, we can make some migration and reset
>> related functions group agnostic.
>> E.x:
>> vfio_mig_active
>> vfio_migratable_device_num
>> vfio_devices_all_dirty_tracking
>> vfio_devices_all_device_dirty_tracking
>> vfio_devices_all_running_and_mig_active
>> vfio_devices_dma_logging_stop
>> vfio_devices_dma_logging_start
>> vfio_devices_query_dirty_bitmap
>> vfio_reset_handler
>>
>> Or else we need to add container specific callback variants for above
>> functions just because they iterate devices based on group.
>>
>> Move the reset handler registration/unregistration to a place that is not
>> group specific, saying first vfio address space created instead of the
>> first group.
>>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> ---
>>  hw/vfio/common.c | 224 ++++++++++++++++++++++++++---------------------
>>  1 file changed, 122 insertions(+), 102 deletions(-)
>>
>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>> index 949ad6714a..51c6e7598e 100644
>> --- a/hw/vfio/common.c
>> +++ b/hw/vfio/common.c
>> @@ -84,6 +84,26 @@ static int vfio_ram_block_discard_disable(VFIOContainer
>*container, bool state)
>>      }
>>  }
>>
>> +static VFIODevice *vfio_container_dev_iter_next(VFIOContainer *container,
>> +                                                VFIODevice *curr)
>> +{
>> +    VFIOGroup *group;
>> +
>> +    if (!curr) {
>> +        group = QLIST_FIRST(&container->group_list);
>> +    } else {
>> +        if (curr->next.le_next) {
>> +            return curr->next.le_next;
>> +        }
>
>
>VFIODevice *device = QLIST_NEXT(curr, next);
>
>if (device) {
>    return device;
>}
>
>> +        group = curr->group->container_next.le_next;
>
>
>group = QLIST_NEXT(curr->group, container_next);
>
>> +    }
>> +
>> +    if (!group) {
>> +        return NULL;
>> +    }
>> +    return QLIST_FIRST(&group->device_list);
>> +}
>> +
>>  /*
>>   * Device state interfaces
>>   */
>> @@ -112,17 +132,22 @@ static int vfio_get_dirty_bitmap(VFIOContainer
>*container, uint64_t iova,
>>
>>  bool vfio_mig_active(void)
>>  {
>> -    VFIOGroup *group;
>> +    VFIOAddressSpace *space;
>> +    VFIOContainer *container;
>>      VFIODevice *vbasedev;
>>
>> -    if (QLIST_EMPTY(&vfio_group_list)) {
>> +    if (QLIST_EMPTY(&vfio_address_spaces)) {
>>          return false;
>>      }
>>
>> -    QLIST_FOREACH(group, &vfio_group_list, next) {
>> -        QLIST_FOREACH(vbasedev, &group->device_list, next) {
>> -            if (vbasedev->migration_blocker) {
>> -                return false;
>> +    QLIST_FOREACH(space, &vfio_address_spaces, list) {
>> +        QLIST_FOREACH(container, &space->containers, next) {
>> +            vbasedev = NULL;
>> +            while ((vbasedev = vfio_container_dev_iter_next(container,
>> +                                                            vbasedev))) {
>> +                if (vbasedev->migration_blocker) {
>> +                    return false;
>> +                }
>
>Appears easy to avoid setting vbasedev in the loop iterator and
>improving the scope of vbasedev:
>
>VFIODevice *vbasedev = vfio_container_dev_iter_next(container, NULL);
>
>while (vbasedev) {
>    if (vbasedev->migration_blocker) {
>        return false;
>    }
>
>    vbasedev = vfio_container_dev_iter_next(container, vbasedev);
>}
>
>>              }
>>          }
>>      }
>> @@ -133,14 +158,19 @@ static Error *multiple_devices_migration_blocker;
>>
>>  static unsigned int vfio_migratable_device_num(void)
>>  {
>> -    VFIOGroup *group;
>> +    VFIOAddressSpace *space;
>> +    VFIOContainer *container;
>>      VFIODevice *vbasedev;
>>      unsigned int device_num = 0;
>>
>> -    QLIST_FOREACH(group, &vfio_group_list, next) {
>> -        QLIST_FOREACH(vbasedev, &group->device_list, next) {
>> -            if (vbasedev->migration) {
>> -                device_num++;
>> +    QLIST_FOREACH(space, &vfio_address_spaces, list) {
>> +        QLIST_FOREACH(container, &space->containers, next) {
>> +            vbasedev = NULL;
>> +            while ((vbasedev = vfio_container_dev_iter_next(container,
>> +                                                            vbasedev))) {
>> +                if (vbasedev->migration) {
>> +                    device_num++;
>> +                }
>
>Same as above.
>
>>              }
>>          }
>>      }
>> @@ -207,8 +237,7 @@ static void vfio_set_migration_error(int err)
>>
>>  static bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
>>  {
>> -    VFIOGroup *group;
>> -    VFIODevice *vbasedev;
>> +    VFIODevice *vbasedev = NULL;
>>      MigrationState *ms = migrate_get_current();
>>
>>      if (ms->state != MIGRATION_STATUS_ACTIVE &&
>> @@ -216,19 +245,17 @@ static bool
>vfio_devices_all_dirty_tracking(VFIOContainer *container)
>>          return false;
>>      }
>>
>> -    QLIST_FOREACH(group, &container->group_list, container_next) {
>> -        QLIST_FOREACH(vbasedev, &group->device_list, next) {
>> -            VFIOMigration *migration = vbasedev->migration;
>> +    while ((vbasedev = vfio_container_dev_iter_next(container, vbasedev))) {
>> +        VFIOMigration *migration = vbasedev->migration;
>
>Similar, and all the other loops below.
>
>>
>> -            if (!migration) {
>> -                return false;
>> -            }
>> +        if (!migration) {
>> +            return false;
>> +        }
>>
>> -            if (vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF &&
>> -                (migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
>> -                 migration->device_state == VFIO_DEVICE_STATE_PRE_COPY)) {
>> -                return false;
>> -            }
>> +        if (vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF &&
>> +            (migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
>> +             migration->device_state == VFIO_DEVICE_STATE_PRE_COPY)) {
>> +            return false;
>>          }
>>      }
>>      return true;
>> @@ -236,14 +263,11 @@ static bool
>vfio_devices_all_dirty_tracking(VFIOContainer *container)
>>
>>  static bool vfio_devices_all_device_dirty_tracking(VFIOContainer *container)
>>  {
>> -    VFIOGroup *group;
>> -    VFIODevice *vbasedev;
>> +    VFIODevice *vbasedev = NULL;
>>
>> -    QLIST_FOREACH(group, &container->group_list, container_next) {
>> -        QLIST_FOREACH(vbasedev, &group->device_list, next) {
>> -            if (!vbasedev->dirty_pages_supported) {
>> -                return false;
>> -            }
>> +    while ((vbasedev = vfio_container_dev_iter_next(container, vbasedev))) {
>> +        if (!vbasedev->dirty_pages_supported) {
>> +            return false;
>>          }
>>      }
>>
>> @@ -256,27 +280,24 @@ static bool
>vfio_devices_all_device_dirty_tracking(VFIOContainer *container)
>>   */
>>  static bool vfio_devices_all_running_and_mig_active(VFIOContainer
>*container)
>>  {
>> -    VFIOGroup *group;
>> -    VFIODevice *vbasedev;
>> +    VFIODevice *vbasedev = NULL;
>>
>>      if (!migration_is_active(migrate_get_current())) {
>>          return false;
>>      }
>>
>> -    QLIST_FOREACH(group, &container->group_list, container_next) {
>> -        QLIST_FOREACH(vbasedev, &group->device_list, next) {
>> -            VFIOMigration *migration = vbasedev->migration;
>> +    while ((vbasedev = vfio_container_dev_iter_next(container, vbasedev))) {
>> +        VFIOMigration *migration = vbasedev->migration;
>>
>> -            if (!migration) {
>> -                return false;
>> -            }
>> +        if (!migration) {
>> +            return false;
>> +        }
>>
>> -            if (migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
>> -                migration->device_state == VFIO_DEVICE_STATE_PRE_COPY) {
>> -                continue;
>> -            } else {
>> -                return false;
>> -            }
>> +        if (migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
>> +            migration->device_state == VFIO_DEVICE_STATE_PRE_COPY) {
>> +            continue;
>> +        } else {
>> +            return false;
>>          }
>>      }
>>      return true;
>> @@ -1243,25 +1264,22 @@ static void
>vfio_devices_dma_logging_stop(VFIOContainer *container)
>>      uint64_t buf[DIV_ROUND_UP(sizeof(struct vfio_device_feature),
>>                                sizeof(uint64_t))] = {};
>>      struct vfio_device_feature *feature = (struct vfio_device_feature *)buf;
>> -    VFIODevice *vbasedev;
>> -    VFIOGroup *group;
>> +    VFIODevice *vbasedev = NULL;
>>
>>      feature->argsz = sizeof(buf);
>>      feature->flags = VFIO_DEVICE_FEATURE_SET |
>>                       VFIO_DEVICE_FEATURE_DMA_LOGGING_STOP;
>>
>> -    QLIST_FOREACH(group, &container->group_list, container_next) {
>> -        QLIST_FOREACH(vbasedev, &group->device_list, next) {
>> -            if (!vbasedev->dirty_tracking) {
>> -                continue;
>> -            }
>> +    while ((vbasedev = vfio_container_dev_iter_next(container, vbasedev))) {
>> +        if (!vbasedev->dirty_tracking) {
>> +            continue;
>> +        }
>>
>> -            if (ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature)) {
>> -                warn_report("%s: Failed to stop DMA logging, err %d (%s)",
>> -                             vbasedev->name, -errno, strerror(errno));
>> -            }
>> -            vbasedev->dirty_tracking = false;
>> +        if (ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature)) {
>> +            warn_report("%s: Failed to stop DMA logging, err %d (%s)",
>> +                        vbasedev->name, -errno, strerror(errno));
>>          }
>> +        vbasedev->dirty_tracking = false;
>>      }
>>  }
>>
>> @@ -1336,8 +1354,7 @@ static int
>vfio_devices_dma_logging_start(VFIOContainer *container)
>>  {
>>      struct vfio_device_feature *feature;
>>      VFIODirtyRanges ranges;
>> -    VFIODevice *vbasedev;
>> -    VFIOGroup *group;
>> +    VFIODevice *vbasedev = NULL;
>>      int ret = 0;
>>
>>      vfio_dirty_tracking_init(container, &ranges);
>> @@ -1347,21 +1364,19 @@ static int
>vfio_devices_dma_logging_start(VFIOContainer *container)
>>          return -errno;
>>      }
>>
>> -    QLIST_FOREACH(group, &container->group_list, container_next) {
>> -        QLIST_FOREACH(vbasedev, &group->device_list, next) {
>> -            if (vbasedev->dirty_tracking) {
>> -                continue;
>> -            }
>> +    while ((vbasedev = vfio_container_dev_iter_next(container, vbasedev))) {
>> +        if (vbasedev->dirty_tracking) {
>> +            continue;
>> +        }
>>
>> -            ret = ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature);
>> -            if (ret) {
>> -                ret = -errno;
>> -                error_report("%s: Failed to start DMA logging, err %d (%s)",
>> -                             vbasedev->name, ret, strerror(errno));
>> -                goto out;
>> -            }
>> -            vbasedev->dirty_tracking = true;
>> +        ret = ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature);
>> +        if (ret) {
>> +            ret = -errno;
>> +            error_report("%s: Failed to start DMA logging, err %d (%s)",
>> +                         vbasedev->name, ret, strerror(errno));
>> +            goto out;
>>          }
>> +        vbasedev->dirty_tracking = true;
>>      }
>>
>>  out:
>> @@ -1440,22 +1455,19 @@ static int
>vfio_devices_query_dirty_bitmap(VFIOContainer *container,
>>                                             VFIOBitmap *vbmap, hwaddr iova,
>>                                             hwaddr size)
>>  {
>> -    VFIODevice *vbasedev;
>> -    VFIOGroup *group;
>> +    VFIODevice *vbasedev = NULL;
>>      int ret;
>>
>> -    QLIST_FOREACH(group, &container->group_list, container_next) {
>> -        QLIST_FOREACH(vbasedev, &group->device_list, next) {
>> -            ret = vfio_device_dma_logging_report(vbasedev, iova, size,
>> -                                                 vbmap->bitmap);
>> -            if (ret) {
>> -                error_report("%s: Failed to get DMA logging report, iova: "
>> -                             "0x%" HWADDR_PRIx ", size: 0x%" HWADDR_PRIx
>> -                             ", err: %d (%s)",
>> -                             vbasedev->name, iova, size, ret, strerror(-ret));
>> +    while ((vbasedev = vfio_container_dev_iter_next(container, vbasedev))) {
>> +        ret = vfio_device_dma_logging_report(vbasedev, iova, size,
>> +                                             vbmap->bitmap);
>> +        if (ret) {
>> +            error_report("%s: Failed to get DMA logging report, iova: "
>> +                         "0x%" HWADDR_PRIx ", size: 0x%" HWADDR_PRIx
>> +                         ", err: %d (%s)",
>> +                         vbasedev->name, iova, size, ret, strerror(-ret));
>>
>> -                return ret;
>> -            }
>> +            return ret;
>>          }
>>      }
>>
>> @@ -1739,21 +1751,30 @@ bool vfio_get_info_dma_avail(struct
>vfio_iommu_type1_info *info,
>>
>>  void vfio_reset_handler(void *opaque)
>>  {
>> -    VFIOGroup *group;
>> +    VFIOAddressSpace *space;
>> +    VFIOContainer *container;
>>      VFIODevice *vbasedev;
>>
>> -    QLIST_FOREACH(group, &vfio_group_list, next) {
>> -        QLIST_FOREACH(vbasedev, &group->device_list, next) {
>> -            if (vbasedev->dev->realized) {
>> -                vbasedev->ops->vfio_compute_needs_reset(vbasedev);
>> +    QLIST_FOREACH(space, &vfio_address_spaces, list) {
>> +        QLIST_FOREACH(container, &space->containers, next) {
>> +            vbasedev = NULL;
>> +            while ((vbasedev = vfio_container_dev_iter_next(container,
>> +                                                            vbasedev))) {
>> +                if (vbasedev->dev->realized) {
>> +                    vbasedev->ops->vfio_compute_needs_reset(vbasedev);
>> +                }
>>              }
>>          }
>>      }
>>
>> -    QLIST_FOREACH(group, &vfio_group_list, next) {
>> -        QLIST_FOREACH(vbasedev, &group->device_list, next) {
>> -            if (vbasedev->dev->realized && vbasedev->needs_reset) {
>> -                vbasedev->ops->vfio_hot_reset_multi(vbasedev);
>> +    QLIST_FOREACH(space, &vfio_address_spaces, list) {
>> +        QLIST_FOREACH(container, &space->containers, next) {
>> +            vbasedev = NULL;
>> +            while ((vbasedev = vfio_container_dev_iter_next(container,
>> +                                                            vbasedev))) {
>> +                if (vbasedev->dev->realized && vbasedev->needs_reset) {
>> +                    vbasedev->ops->vfio_hot_reset_multi(vbasedev);
>> +                    }
>>              }
>>          }
>>      }
>> @@ -1841,6 +1862,10 @@ static VFIOAddressSpace
>*vfio_get_address_space(AddressSpace *as)
>>      space->as = as;
>>      QLIST_INIT(&space->containers);
>>
>> +    if (QLIST_EMPTY(&vfio_address_spaces)) {
>> +        qemu_register_reset(vfio_reset_handler, NULL);
>> +    }
>> +
>
>We could just have a vfio_device_list to avoid iterating either
>containers and group or address spaces.  Thanks,

Good idea! Will do.
A vfio_device_list can be used by both BEs and I can have this
patch dropped.

Thanks
Zhenzhong


^ permalink raw reply	[flat|nested] 109+ messages in thread

* RE: [PATCH v1 06/22] vfio/common: Add a vfio device iterator
  2023-09-20 12:25   ` Eric Auger
@ 2023-09-21  2:27     ` Duan, Zhenzhong
  0 siblings, 0 replies; 109+ messages in thread
From: Duan, Zhenzhong @ 2023-09-21  2:27 UTC (permalink / raw)
  To: eric.auger, qemu-devel
  Cc: alex.williamson, clg, jgg, nicolinc, Martins, Joao, peterx,
	jasowang, Tian, Kevin, Liu, Yi L, Sun, Yi Y, Peng, Chao P

Hi Eric,

>-----Original Message-----
>From: Eric Auger <eric.auger@redhat.com>
>Sent: Wednesday, September 20, 2023 8:26 PM
>Subject: Re: [PATCH v1 06/22] vfio/common: Add a vfio device iterator
>
>Hi Zhenzhong,
>
>On 8/30/23 12:37, Zhenzhong Duan wrote:
>> With a vfio device iterator added, we can make some migration and reset
>> related functions group agnostic.
>> E.x:
>> vfio_mig_active
>> vfio_migratable_device_num
>> vfio_devices_all_dirty_tracking
>> vfio_devices_all_device_dirty_tracking
>> vfio_devices_all_running_and_mig_active
>> vfio_devices_dma_logging_stop
>> vfio_devices_dma_logging_start
>> vfio_devices_query_dirty_bitmap
>> vfio_reset_handler
>>
>> Or else we need to add container specific callback variants for above
>> functions just because they iterate devices based on group.
>>
>> Move the reset handler registration/unregistration to a place that is not
>> group specific, saying first vfio address space created instead of the
>> first group.
>I would move the reset handler registration/unregistration changes in a
>separate patch.

Got it.

>besides,  I don't catch what you mean by
>"saying first vfio address space created instead of the first group."

Before patch, reset hander is registered in first group creation,
after patch, it's registered in first address space creation.
The main purpose is to make this code group agnostic.

For the device iteration part of this patch, I plan to follow Alex's
suggestion to use vfio_device_list for both BEs. Thanks for your
time on this patch.

BRs.
Zhenzhong

^ permalink raw reply	[flat|nested] 109+ messages in thread

* RE: [PATCH v1 17/22] util/char_dev: Add open_cdev()
  2023-09-20 12:39   ` Daniel P. Berrangé
  2023-09-20 12:53     ` Jason Gunthorpe
@ 2023-09-21  2:37     ` Duan, Zhenzhong
  1 sibling, 0 replies; 109+ messages in thread
From: Duan, Zhenzhong @ 2023-09-21  2:37 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: qemu-devel, alex.williamson, clg, jgg, nicolinc, Martins, Joao,
	eric.auger, peterx, jasowang, Tian, Kevin, Liu, Yi L, Sun, Yi Y,
	Peng, Chao P



>-----Original Message-----
>From: Daniel P. Berrangé <berrange@redhat.com>
>Sent: Wednesday, September 20, 2023 8:39 PM
>Subject: Re: [PATCH v1 17/22] util/char_dev: Add open_cdev()
>
>On Wed, Aug 30, 2023 at 06:37:49PM +0800, Zhenzhong Duan wrote:
>> From: Yi Liu <yi.l.liu@intel.com>
>>
>> /dev/vfio/devices/vfioX may not exist. In that case it is still possible
>> to open /dev/char/$major:$minor instead. Add helper function to abstract
>> the cdev open.
>>
>> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> ---
>>  MAINTAINERS             |  6 ++++
>>  include/qemu/char_dev.h | 16 +++++++++++
>>  util/chardev_open.c     | 61 +++++++++++++++++++++++++++++++++++++++++
>
>Using the same naming scheme for the .c and .h is strongly desired.

Got it.

>
>>  util/meson.build        |  1 +
>>  4 files changed, 84 insertions(+)
>>  create mode 100644 include/qemu/char_dev.h
>>  create mode 100644 util/chardev_open.c
>>
>> diff --git a/MAINTAINERS b/MAINTAINERS
>> index 04663fbb6f..74d18593fe 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -3372,6 +3372,12 @@ S: Maintained
>>  F: include/qemu/iova-tree.h
>>  F: util/iova-tree.c
>>
>> +cdev Open
>> +M: Yi Liu <yi.l.liu@intel.com>
>> +S: Maintained
>> +F: include/qemu/char_dev.h
>> +F: util/chardev_open.c
>> +
>
>
>> diff --git a/util/chardev_open.c b/util/chardev_open.c
>> new file mode 100644
>> index 0000000000..d03e415131
>> --- /dev/null
>> +++ b/util/chardev_open.c
>> @@ -0,0 +1,61 @@
>> +/*
>> + * Copyright (C) 2023 Intel Corporation.
>> + * Copyright (c) 2019, Mellanox Technologies. All rights reserved.
>> + *
>> + * Authors: Yi Liu <yi.l.liu@intel.com>
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2.  See
>> + * the COPYING file in the top-level directory.
>> + *
>> + * Copied from
>> + * https://github.com/linux-rdma/rdma-core/blob/master/util/open_cdev.c
>> + *
>> + */
>
>Since this is GPL-2.0-only, IMHO it would be preferrable to keep it
>out of the util/ directory, as we're aiming to not add further 2.0
>only code, except for specific subdirs. This only appears to be used
>by code under hw/vfio/, whcih is one of the dirs still permitting
>2.0-only code. So I think better to keep this file where it is used.

I'll copy the original license header to preserve the GPL OR BSD choice.
As it's not restricted by GPL-2.0-only now, I plan to keep it in util/.
Let me know if you still prefer to move to hw/vifo/.

>
>> +#ifndef _GNU_SOURCE
>> +#define _GNU_SOURCE
>> +#endif
>
>This is set globally for building all files in QEMU

Will remove it.

>
>> +#include "qemu/osdep.h"
>> +#include "qemu/char_dev.h"
>> +
>> +static int open_cdev_internal(const char *path, dev_t cdev)
>> +{
>> +    struct stat st;
>> +    int fd;
>> +
>> +    fd = qemu_open_old(path, O_RDWR);
>> +    if (fd == -1) {
>> +        return -1;
>> +    }
>> +    if (fstat(fd, &st) || !S_ISCHR(st.st_mode) ||
>> +        (cdev != 0 && st.st_rdev != cdev)) {
>> +        close(fd);
>> +        return -1;
>> +    }
>> +    return fd;
>> +}
>> +
>> +static int open_cdev_robust(dev_t cdev)
>> +{
>> +    char *devpath;
>
>g_autofree for this...

Will do.

>
>> +    int ret;
>> +
>> +    /*
>> +     * This assumes that udev is being used and is creating the /dev/char/
>> +     * symlinks.
>> +     */
>> +    devpath = g_strdup_printf("/dev/char/%u:%u", major(cdev), minor(cdev));
>> +    ret = open_cdev_internal(devpath, cdev);
>> +    g_free(devpath);
>
>...avoids the need for g_free, and also avoids the need for
>the intermediate 'ret' variable.

Yes.

Thanks
Zhenzhong

^ permalink raw reply	[flat|nested] 109+ messages in thread

* RE: [PATCH v1 13/22] vfio: Add base container
  2023-09-20 12:57       ` Cédric Le Goater
  2023-09-20 13:58         ` Eric Auger
@ 2023-09-21  2:51         ` Duan, Zhenzhong
  1 sibling, 0 replies; 109+ messages in thread
From: Duan, Zhenzhong @ 2023-09-21  2:51 UTC (permalink / raw)
  To: Cédric Le Goater, qemu-devel
  Cc: alex.williamson, jgg, nicolinc, Martins, Joao, eric.auger,
	peterx, jasowang, Tian, Kevin, Liu, Yi L, Sun, Yi Y, Peng,
	Chao P, Yi Sun, Daniel Henrique Barboza, David Gibson, Greg Kurz,
	Harsh Prateek Bora, open list:sPAPR (pseries)



>-----Original Message-----
>From: Cédric Le Goater <clg@redhat.com>
>Sent: Wednesday, September 20, 2023 8:58 PM
>Subject: Re: [PATCH v1 13/22] vfio: Add base container
>
>On 9/20/23 10:48, Duan, Zhenzhong wrote:
>>
>>
>>> -----Original Message-----
>>> From: Cédric Le Goater <clg@redhat.com>
>>> Sent: Wednesday, September 20, 2023 1:24 AM
>>> Subject: Re: [PATCH v1 13/22] vfio: Add base container
>>>
>>> On 8/30/23 12:37, Zhenzhong Duan wrote:
>>>> From: Yi Liu <yi.l.liu@intel.com>
>>>>
>>>> Abstract the VFIOContainer to be a base object. It is supposed to be
>>>> embedded by legacy VFIO container and later on, into the new iommufd
>>>> based container.
>>>>
>>>> The base container implements generic code such as code related to
>>>> memory_listener and address space management. The VFIOContainerOps
>>>> implements callbacks that depend on the kernel user space being used.
>>>>
>>>> 'common.c' and vfio device code only manipulates the base container with
>>>> wrapper functions that calls the functions defined in VFIOContainerOpsClass.
>>>> Existing 'container.c' code is converted to implement the legacy container
>>>> ops functions.
>>>>
>>>> Below is the base container. It's named as VFIOContainer, old VFIOContainer
>>>> is replaced with VFIOLegacyContainer.
>>>
>>> Usualy, we introduce the new interface solely, port the current models
>>> on top of the new interface, wire the new models in the current
>>> implementation and remove the old implementation. Then, we can start
>>> adding extensions to support other implementations.
>>
>> Not sure if I understand your point correctly. Do you mean to introduce
>> a new type for the base container as below:
>>
>> static const TypeInfo vfio_container_info = {
>>      .parent             = TYPE_OBJECT,
>>      .name               = TYPE_VFIO_CONTAINER,
>>      .class_size         = sizeof(VFIOContainerClass),
>>      .instance_size      = sizeof(VFIOContainer),
>>      .abstract           = true,
>>      .interfaces = (InterfaceInfo[]) {
>>          { TYPE_VFIO_IOMMU_BACKEND_OPS },
>>          { }
>>      }
>> };
>>
>> and a new interface as below:
>>
>> static const TypeInfo nvram_info = {
>>      .name = TYPE_VFIO_IOMMU_BACKEND_OPS,
>>      .parent = TYPE_INTERFACE,
>>      .class_size = sizeof(VFIOIOMMUBackendOpsClass),
>> };
>>
>> struct VFIOIOMMUBackendOpsClass {
>>      InterfaceClass parent;
>>      VFIODevice *(*dev_iter_next)(VFIOContainer *container, VFIODevice *curr);
>>      int (*dma_map)(VFIOContainer *container,
>>      ......
>> };
>>
>> and legacy container on top of TYPE_VFIO_CONTAINER?
>>
>> static const TypeInfo vfio_legacy_container_info = {
>>      .parent = TYPE_VFIO_CONTAINER,
>>      .name = TYPE_VFIO_LEGACY_CONTAINER,
>>      .class_init = vfio_legacy_container_class_init,
>> };
>>
>> This object style is rejected early in RFCv1.
>> See https://lore.kernel.org/kvm/20220414104710.28534-8-yi.l.liu@intel.com/
>
>ouch. this is long ago and I was not aware :/ Bare with me, I will
>probably ask the same questions. Nevertheless, we could improve the
>cover and the flow of changes in the patchset to help the reader.

Sure.

>
>>> spapr should be taken care of separatly following the principle above.
>>> With my PPC hat, I would not even read such a massive change, too risky
>>> for the subsystem. This path will need (much) further splitting to be
>>> understandable and acceptable.
>>
>> I'll digging into this and try to split it.
>
>I know I am asking for a lot of work. Thanks for that.

Np, all comments, suggestions, etc are appreciated 😊

>
>> Meanwhile, there are many changes
>> just renaming the parameter or function name for code readability.
>> For example:
>>
>> -int vfio_dma_unmap(VFIOContainer *container, hwaddr iova,
>> -                   ram_addr_t size, IOMMUTLBEntry *iotlb)
>> +static int vfio_legacy_dma_unmap(VFIOContainer *bcontainer, hwaddr iova,
>> +                          ram_addr_t size, IOMMUTLBEntry *iotlb)
>>
>> -        ret = vfio_get_dirty_bitmap(container, iova, size,
>> +        ret = vfio_get_dirty_bitmap(bcontainer, iova, size,
>>
>> Let me know if you think such changes are unnecessary which could reduce
>> this patch largely.
>
>Cleanups, renames, some code reshuffling, anything preparing ground for
>the new abstraction is good to have first and can be merged very quickly
>if there are no functional changes. It reduces the overall patchset and
>ease the coming reviews.
>
>You can send such series independently. That's fine.

Got it.

>
>>
>>>
>>> Also, please include the .h file first, it helps in reading.
>>
>> Do you mean to put struct declaration earlier in patch description?
>
>Just add to your .gitconfig :
>
>[diff]
>	orderFile = /path/to/qemu/scripts/git.orderfile
>
>It should be enough

Understood.

>
>>> Have you considered using an InterfaceClass ?
>>
>> See above, with object style rejected, it looks hard to use InterfaceClass.
>
>I am not convinced by the QOM approach. I will dig in the past arguments
>and let's see what we come with.

Thanks for your time.

BRs.
Zhenzhong


^ permalink raw reply	[flat|nested] 109+ messages in thread

* RE: [PATCH v1 07/22] vfio/common: Refactor vfio_viommu_preset() to be group agnostic
  2023-09-20 13:00   ` Eric Auger
@ 2023-09-21  2:52     ` Duan, Zhenzhong
  0 siblings, 0 replies; 109+ messages in thread
From: Duan, Zhenzhong @ 2023-09-21  2:52 UTC (permalink / raw)
  To: eric.auger, qemu-devel
  Cc: alex.williamson, clg, jgg, nicolinc, Martins, Joao, peterx,
	jasowang, Tian, Kevin, Liu, Yi L, Sun, Yi Y, Peng, Chao P



>-----Original Message-----
>From: Eric Auger <eric.auger@redhat.com>
>Sent: Wednesday, September 20, 2023 9:01 PM
>Subject: Re: [PATCH v1 07/22] vfio/common: Refactor vfio_viommu_preset() to
>be group agnostic
>
>
>
>On 8/30/23 12:37, Zhenzhong Duan wrote:
>> So that it doesn't need to be moved into container.c as done
>> in following patch.
>This is a bit weird to refer to container.c which is not yet created. I
>would suggested just reuse the commit title as a commit msg + this will
>become easier to handle multiple IOMMU BEs

Will fix, thanks Eric.

BRs.
Zhenzhong

^ permalink raw reply	[flat|nested] 109+ messages in thread

* RE: [PATCH v1 08/22] vfio/common: Move legacy VFIO backend code into separate container.c
  2023-09-20 13:12   ` Eric Auger
@ 2023-09-21  3:02     ` Duan, Zhenzhong
  0 siblings, 0 replies; 109+ messages in thread
From: Duan, Zhenzhong @ 2023-09-21  3:02 UTC (permalink / raw)
  To: eric.auger, qemu-devel
  Cc: alex.williamson, clg, jgg, nicolinc, Martins, Joao, peterx,
	jasowang, Tian, Kevin, Liu, Yi L, Sun, Yi Y, Peng, Chao P



>-----Original Message-----
>From: Eric Auger <eric.auger@redhat.com>
>Sent: Wednesday, September 20, 2023 9:12 PM
>Subject: Re: [PATCH v1 08/22] vfio/common: Move legacy VFIO backend code
>into separate container.c
>
>Hi,
>
>On 8/30/23 12:37, Zhenzhong Duan wrote:
>> From: Yi Liu <yi.l.liu@intel.com>
>>
>> Move all the code really dependent on the legacy VFIO container/group
>> into a separate file: container.c. What does remain in common.c is
>> the code related to VFIOAddressSpace, MemoryListeners, migration and
>> all other general operations.
>>
>> Move struct VFIOBitmap declaration to vfio-common.h also for containter.c
>> usage.
>note: this may be done in the 3d patch since vfio_bitmap_alloc could
>land in helpers.c

Good idea, will do.

Thanks
Zhenzhong

^ permalink raw reply	[flat|nested] 109+ messages in thread

* RE: [PATCH v1 09/22] vfio/container: Introduce vfio_[attach/detach]_device
  2023-09-20 13:33   ` Eric Auger
@ 2023-09-21  3:08     ` Duan, Zhenzhong
  0 siblings, 0 replies; 109+ messages in thread
From: Duan, Zhenzhong @ 2023-09-21  3:08 UTC (permalink / raw)
  To: eric.auger, qemu-devel
  Cc: alex.williamson, clg, jgg, nicolinc, Martins, Joao, peterx,
	jasowang, Tian, Kevin, Liu, Yi L, Sun, Yi Y, Peng, Chao P



>-----Original Message-----
>From: Eric Auger <eric.auger@redhat.com>
>Sent: Wednesday, September 20, 2023 9:33 PM
>Subject: Re: [PATCH v1 09/22] vfio/container: Introduce
>vfio_[attach/detach]_device
>
>Hi Zhenzhong,
>
>In the commit title I would replace vfio/container by vfio/pci to match
>next patches

Make sense, will do.

>
>On 8/30/23 12:37, Zhenzhong Duan wrote:
>> From: Eric Auger <eric.auger@redhat.com>
>>
>> We want the VFIO devices to be able to use two different
>> IOMMU callbacks, the legacy VFIO one and the new iommufd one.
>s/callbacks/backends
>>
>> Introduce vfio_[attach/detach]_device which aim at hiding the
>> underlying IOMMU backend (IOCTLs, datatypes, ...).
>
......

>> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
>> index ee7509e68e..8016d9f0d2 100644
>> --- a/hw/vfio/trace-events
>> +++ b/hw/vfio/trace-events
>> @@ -37,7 +37,7 @@ vfio_pci_hot_reset_dep_devices(int domain, int bus, int
>slot, int function, int
>>  vfio_pci_hot_reset_result(const char *name, const char *result) "%s hot
>reset: %s"
>>  vfio_populate_device_config(const char *name, unsigned long size, unsigned
>long offset, unsigned long flags) "Device %s config:\n  size: 0x%lx, offset: 0x%lx,
>flags: 0x%lx"
>>  vfio_populate_device_get_irq_info_failure(const char *errstr)
>"VFIO_DEVICE_GET_IRQ_INFO failure: %s"
>> -vfio_realize(const char *name, int group_id) " (%s) group %d"
>> +vfio_realize(const char *name) " (%s)"
>I am not sure this trace point is useful anymore, without the id. Some
>tracepoints shall be BE specific to keep their usefulness and should be
>called from container.c/iommufd.c instead of in the generic function.

Previously I use this trace event just to hint vfio realize starting.
I agree with you that being BE specific could show more useful information.
I'll fix it in v2.

Thanks
Zhenzhong


^ permalink raw reply	[flat|nested] 109+ messages in thread

* RE: [PATCH v1 13/22] vfio: Add base container
  2023-09-20 13:53     ` Eric Auger
@ 2023-09-21  3:12       ` Duan, Zhenzhong
  0 siblings, 0 replies; 109+ messages in thread
From: Duan, Zhenzhong @ 2023-09-21  3:12 UTC (permalink / raw)
  To: eric.auger, Cédric Le Goater, qemu-devel
  Cc: alex.williamson, jgg, nicolinc, Martins, Joao, peterx, jasowang,
	Tian, Kevin, Liu, Yi L, Sun, Yi Y, Peng, Chao P, Yi Sun,
	Daniel Henrique Barboza, David Gibson, Greg Kurz,
	Harsh Prateek Bora, open list:sPAPR (pseries)



>-----Original Message-----
>From: Eric Auger <eric.auger@redhat.com>
>Sent: Wednesday, September 20, 2023 9:54 PM
>Subject: Re: [PATCH v1 13/22] vfio: Add base container
>
>Hi Cedric,
>
>On 9/19/23 19:23, Cédric Le Goater wrote:
>> On 8/30/23 12:37, Zhenzhong Duan wrote:
>>> From: Yi Liu <yi.l.liu@intel.com>
>>>
>>> Abstract the VFIOContainer to be a base object. It is supposed to be
>>> embedded by legacy VFIO container and later on, into the new iommufd
>>> based container.
>>>
>>> The base container implements generic code such as code related to
>>> memory_listener and address space management. The VFIOContainerOps
>>> implements callbacks that depend on the kernel user space being used.
>>>
>>> 'common.c' and vfio device code only manipulates the base container with
>>> wrapper functions that calls the functions defined in
>>> VFIOContainerOpsClass.
>>> Existing 'container.c' code is converted to implement the legacy
>>> container
>>> ops functions.
>>>
>>> Below is the base container. It's named as VFIOContainer, old
>>> VFIOContainer
>>> is replaced with VFIOLegacyContainer.
>>
>> Usualy, we introduce the new interface solely, port the current models
>> on top of the new interface, wire the new models in the current
>> implementation and remove the old implementation. Then, we can start
>> adding extensions to support other implementations.
>> spapr should be taken care of separatly following the principle above.
>> With my PPC hat, I would not even read such a massive change, too risky
>> for the subsystem. This path will need (much) further splitting to be
>> understandable and acceptable.
>>
>> Also, please include the .h file first, it helps in reading. Have you
>> considered using an InterfaceClass ?
>in the transition from v1 -> v2, I removed the QOMification of the
>VFIOContainer, following David Gibson's advice. QOM objects are visible
>from the user interface and there was no interest in that. Does it
>answer your question?
>
>- remove the QOMification of the VFIOContainer and simply use standard ops
>(David)
>
>Unfortunately the coverletter log history has disappeared in this new version.
>Zhenzhong, I think it is useful to understand how the series moves on.

I have archive it with a link https://lists.nongnu.org/archive/html/qemu-devel/2023-07/msg02529.html
for cleaner cover letter, looks I'm wrong. I'll restore the whole changelog in v2.

Thanks
Zhenzhong


^ permalink raw reply	[flat|nested] 109+ messages in thread

* RE: [PATCH v1 13/22] vfio: Add base container
  2023-09-20 17:31     ` Eric Auger
@ 2023-09-21  3:35       ` Duan, Zhenzhong
  2023-09-21  6:28         ` Eric Auger
  2023-09-21 17:20         ` Eric Auger
  0 siblings, 2 replies; 109+ messages in thread
From: Duan, Zhenzhong @ 2023-09-21  3:35 UTC (permalink / raw)
  To: eric.auger, Cédric Le Goater, qemu-devel
  Cc: alex.williamson, jgg, nicolinc, Martins, Joao, peterx, jasowang,
	Tian, Kevin, Liu, Yi L, Sun, Yi Y, Peng, Chao P, Yi Sun,
	Daniel Henrique Barboza, David Gibson, Greg Kurz,
	Harsh Prateek Bora, open list:sPAPR (pseries)

Hi Eric,

>-----Original Message-----
>From: Eric Auger <eric.auger@redhat.com>
>Sent: Thursday, September 21, 2023 1:31 AM
>Subject: Re: [PATCH v1 13/22] vfio: Add base container
>
>Hi Zhenzhong,
>
>On 9/19/23 19:23, Cédric Le Goater wrote:
>> On 8/30/23 12:37, Zhenzhong Duan wrote:
>>> From: Yi Liu <yi.l.liu@intel.com>
>>>
>>> Abstract the VFIOContainer to be a base object. It is supposed to be
>>> embedded by legacy VFIO container and later on, into the new iommufd
>>> based container.
>>>
>>> The base container implements generic code such as code related to
>>> memory_listener and address space management. The VFIOContainerOps
>>> implements callbacks that depend on the kernel user space being used.
>>>
>>> 'common.c' and vfio device code only manipulates the base container with
>>> wrapper functions that calls the functions defined in
>>> VFIOContainerOpsClass.
>>> Existing 'container.c' code is converted to implement the legacy
>>> container
>>> ops functions.
>>>
>>> Below is the base container. It's named as VFIOContainer, old
>>> VFIOContainer
>>> is replaced with VFIOLegacyContainer.
>>
>> Usualy, we introduce the new interface solely, port the current models
>> on top of the new interface, wire the new models in the current
>> implementation and remove the old implementation. Then, we can start
>> adding extensions to support other implementations.
>>
>> spapr should be taken care of separatly following the principle above.
>> With my PPC hat, I would not even read such a massive change, too risky
>> for the subsystem. This path will need (much) further splitting to be
>> understandable and acceptable.
>We might split this patch by
>1) introducing VFIOLegacyContainer encapsulating the base VFIOContainer,
>without using the ops in a first place:
> common.c would call vfio_container_* with harcoded legacy
>implementation, ie. retrieving the legacy container with container_of.
>2) we would introduce the BE interface without using it.
>3) we would use the new BE interface
>
>Obviously this needs to be further tried out. If you wish I can try to
>split it that way ... Please let me know

Sure, thanks for your help, glad that I can cooperate with you to move
this series forward.
I just updated the branch which rebased to newest upstream for you to pick at https://github.com/yiliu1765/qemu/tree/zhenzhong/iommufd_cdev_v1_rebased 

Thanks
Zhenzhong

^ permalink raw reply	[flat|nested] 109+ messages in thread

* RE: [PATCH v1 15/22] Add iommufd configure option
  2023-09-20 18:19                   ` Jason Gunthorpe
@ 2023-09-21  3:43                     ` Duan, Zhenzhong
  2023-09-26  6:05                     ` Tian, Kevin
  1 sibling, 0 replies; 109+ messages in thread
From: Duan, Zhenzhong @ 2023-09-21  3:43 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson
  Cc: Eric Auger, Cédric Le Goater, qemu-devel, nicolinc, Martins,
	Joao, peterx, jasowang, Tian, Kevin, Liu, Yi L, Sun, Yi Y, Peng,
	Chao P, Paolo Bonzini, Marc-André Lureau,
	Daniel P. Berrangé, Thomas Huth, Philippe Mathieu-Daudé



>-----Original Message-----
>From: Jason Gunthorpe <jgg@nvidia.com>
>Sent: Thursday, September 21, 2023 2:20 AM
>Subject: Re: [PATCH v1 15/22] Add iommufd configure option
>
>On Wed, Sep 20, 2023 at 12:17:24PM -0600, Alex Williamson wrote:
>
>> > The iommufd design requires one open of the /dev/iommu to be shared
>> > across all the vfios.
>>
>> "requires"?  It's certainly of limited value to have multiple iommufd
>> instances rather than create multiple address spaces within a single
>> iommufd, but what exactly precludes an iommufd per device if QEMU, or
>> any other userspace so desired?  Thanks,
>
>From the kernel side requires is too strong I suppose
>
>Not sure about these qemu patches though?

I had ever tested with multiple IOMMUFDs and mix of IOMMUFD/legacy BE linking to different VFIO devices with this series,  all works fine.

Thanks
Zhenzhong


^ permalink raw reply	[flat|nested] 109+ messages in thread

* RE: [PATCH v1 15/22] Add iommufd configure option
  2023-09-20 13:02           ` Cédric Le Goater
  2023-09-20 17:37             ` Eric Auger
@ 2023-09-21  4:00             ` Duan, Zhenzhong
  1 sibling, 0 replies; 109+ messages in thread
From: Duan, Zhenzhong @ 2023-09-21  4:00 UTC (permalink / raw)
  To: Cédric Le Goater, Jason Gunthorpe
  Cc: qemu-devel, alex.williamson, nicolinc, Martins, Joao, eric.auger,
	peterx, jasowang, Tian, Kevin, Liu, Yi L, Sun, Yi Y, Peng,
	Chao P, Paolo Bonzini, Marc-André Lureau,
	Daniel P. Berrangé, Thomas Huth, Philippe Mathieu-Daudé



>-----Original Message-----
>From: Cédric Le Goater <clg@redhat.com>
>Sent: Wednesday, September 20, 2023 9:02 PM
>Subject: Re: [PATCH v1 15/22] Add iommufd configure option
>
>On 9/20/23 14:51, Jason Gunthorpe wrote:
>> On Wed, Sep 20, 2023 at 02:19:42PM +0200, Cédric Le Goater wrote:
>>> On 9/20/23 05:42, Duan, Zhenzhong wrote:
>>>>
>>>>
>>>>> -----Original Message-----
>>>>> From: Cédric Le Goater <clg@redhat.com>
>>>>> Sent: Wednesday, September 20, 2023 1:08 AM
>>>>> Subject: Re: [PATCH v1 15/22] Add iommufd configure option
>>>>>
>>>>> On 8/30/23 12:37, Zhenzhong Duan wrote:
>>>>>> This adds "--enable-iommufd/--disable-iommufd" to enable or disable
>>>>>> iommufd support, enabled by default.
>>>>>
>>>>> Why would someone want to disable support at compile time ? It might
>>>>
>>>> For those users who only want to support legacy container feature?
>>>> Let me know if you still prefer to drop this patch, I'm fine with that.
>>>
>>> I think it is too early.
>>>
>>>>> have been useful for dev but now QEMU should self-adjust at runtime
>>>>> depending only on the host capabilities AFAIUI. Am I missing something ?
>>>>
>>>> IOMMUFD doesn't support all features of legacy container, so QEMU
>>>> doesn't self-adjust at runtime by checking if host supports IOMMUFD.
>>>> We need to specify it explicitly to use IOMMUFD as below:
>>>>
>>>>       -object iommufd,id=iommufd0
>>>>       -device vfio-pci,host=0000:02:00.0,iommufd=iommufd0
>>>
>>> OK. I am not sure this is the correct interface yet. At first glance,
>>> I wouldn't introduce a new object for a simple backend depending on a
>>> kernel interface. I would tend to prefer a "iommu-something" property
>>> of the vfio-pci device with string values: "legacy", "iommufd", "default"
>>> and define the various interfaces (the ops you proposed) for each
>>> depending on the user preference and the capabilities of the host and
>>> possibly the device.
>>
>> I think the idea came from Alex? The major point is to be able to have
>> libvirt open /dev/iommufd and FD pass it into qemu
>
>ok.
>
>> and then share that single FD across all VFIOs.
>
>I will ask Alex to help me catch up on the topic.
>
>> qemu will typically not be able to
>> self-open /dev/iommufd as it is root-only.
>
>I don't understand, we open multiple fds to KVM devices. This is the same.

There are two slight differences:

1. Different group:
$ ll /dev/kvm
crw-rw----+ 1 root kvm 10, 232  9月 18 14:23 /dev/kvm
$ ll /dev/iommu
crw-rw---- 1 root root 10, 124  9月 12 14:14 /dev/iommu

2. Default cgroup device allowed list:
#cgroup_device_acl = [
#    "/dev/null", "/dev/full", "/dev/zero",
#    "/dev/random", "/dev/urandom",
#    "/dev/ptmx", "/dev/kvm"
#]

By default, libvirt creates qemu instance with usr/group libvirt-qemu/kvm
So qemu has permission to open /dev/kvm, but not for /dev/iommu.

When a general user wants to open /dev/kvm, it's not permitted:

duan@duan-server-S2600BT:~$ qemu-system-x86_64 -accel kvm
Could not access KVM kernel module: Permission denied
qemu-system-x86_64: -accel kvm: failed to initialize kvm: Permission denied

Thanks
Zhenzhong

^ permalink raw reply	[flat|nested] 109+ messages in thread

* RE: [PATCH v1 05/22] vfio/common: Extract out vfio_kvm_device_[add/del]_fd
  2023-09-20 21:39   ` Alex Williamson
@ 2023-09-21  6:03     ` Duan, Zhenzhong
  0 siblings, 0 replies; 109+ messages in thread
From: Duan, Zhenzhong @ 2023-09-21  6:03 UTC (permalink / raw)
  To: Alex Williamson
  Cc: qemu-devel, clg, jgg, nicolinc, Martins, Joao, eric.auger,
	peterx, jasowang, Tian, Kevin, Liu, Yi L, Sun, Yi Y, Peng,
	Chao P



>-----Original Message-----
>From: Alex Williamson <alex.williamson@redhat.com>
>Sent: Thursday, September 21, 2023 5:40 AM
>Subject: Re: [PATCH v1 05/22] vfio/common: Extract out
>vfio_kvm_device_[add/del]_fd
>
>On Wed, 30 Aug 2023 18:37:37 +0800
>Zhenzhong Duan <zhenzhong.duan@intel.com> wrote:
>
>> ...which will be used by both legacy and iommufd backend.
>
>+1 to Eric's comments regarding complete sentences in the commit log
>and suggested description.

Will fix.

>
>>
>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> ---
>>  hw/vfio/common.c              | 44 +++++++++++++++++++++++------------
>>  include/hw/vfio/vfio-common.h |  3 +++
>>  2 files changed, 32 insertions(+), 15 deletions(-)
>>
>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>> index 67150e4575..949ad6714a 100644
>> --- a/hw/vfio/common.c
>> +++ b/hw/vfio/common.c
>> @@ -1759,17 +1759,17 @@ void vfio_reset_handler(void *opaque)
>>      }
>>  }
>>
>> -static void vfio_kvm_device_add_group(VFIOGroup *group)
>> +int vfio_kvm_device_add_fd(int fd)
>
>Returning int vs void looks gratuitous, nothing uses the return value
>in this series.

Will return void.

>
>>  {
>>  #ifdef CONFIG_KVM
>>      struct kvm_device_attr attr = {
>> -        .group = KVM_DEV_VFIO_GROUP,
>> -        .attr = KVM_DEV_VFIO_GROUP_ADD,
>> -        .addr = (uint64_t)(unsigned long)&group->fd,
>> +        .group = KVM_DEV_VFIO_FILE,
>> +        .attr = KVM_DEV_VFIO_FILE_ADD,
>> +        .addr = (uint64_t)(unsigned long)&fd,
>>      };
>>
>>      if (!kvm_enabled()) {
>> -        return;
>> +        return 0;
>>      }
>>
>>      if (vfio_kvm_device_fd < 0) {
>> @@ -1779,37 +1779,51 @@ static void
>vfio_kvm_device_add_group(VFIOGroup *group)
>>
>>          if (kvm_vm_ioctl(kvm_state, KVM_CREATE_DEVICE, &cd)) {
>>              error_report("Failed to create KVM VFIO device: %m");
>> -            return;
>> +            return -ENODEV;
>>          }
>>
>>          vfio_kvm_device_fd = cd.fd;
>>      }
>>
>>      if (ioctl(vfio_kvm_device_fd, KVM_SET_DEVICE_ATTR, &attr)) {
>> -        error_report("Failed to add group %d to KVM VFIO device: %m",
>> -                     group->groupid);
>> +        error_report("Failed to add fd %d to KVM VFIO device: %m",
>> +                     fd);
>
>It's not nearly as useful to report an fd# in the error log vs the
>group#.  Thanks,

What about checking the return value of vfio_kvm_device_add_fd and
error_report in vfio_kvm_device_add_group. But that will be duplicate
error report. Is that acceptable?

Thanks
Zhenzhong


^ permalink raw reply	[flat|nested] 109+ messages in thread

* RE: [PATCH v1 07/22] vfio/common: Refactor vfio_viommu_preset() to be group agnostic
  2023-09-20 22:51   ` Alex Williamson
@ 2023-09-21  6:13     ` Duan, Zhenzhong
  0 siblings, 0 replies; 109+ messages in thread
From: Duan, Zhenzhong @ 2023-09-21  6:13 UTC (permalink / raw)
  To: Alex Williamson
  Cc: qemu-devel, clg, jgg, nicolinc, Martins, Joao, eric.auger,
	peterx, jasowang, Tian, Kevin, Liu, Yi L, Sun, Yi Y, Peng,
	Chao P



>-----Original Message-----
>From: Alex Williamson <alex.williamson@redhat.com>
>Sent: Thursday, September 21, 2023 6:51 AM
>Subject: Re: [PATCH v1 07/22] vfio/common: Refactor vfio_viommu_preset() to
>be group agnostic
>
>On Wed, 30 Aug 2023 18:37:39 +0800
>Zhenzhong Duan <zhenzhong.duan@intel.com> wrote:
>
>> So that it doesn't need to be moved into container.c as done
>> in following patch.
>>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> ---
>>  hw/vfio/common.c | 17 ++++++++++++++++-
>>  1 file changed, 16 insertions(+), 1 deletion(-)
>>
>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>> index 51c6e7598e..fda5fc87b9 100644
>> --- a/hw/vfio/common.c
>> +++ b/hw/vfio/common.c
>> @@ -219,7 +219,22 @@ void vfio_unblock_multiple_devices_migration(void)
>>
>>  bool vfio_viommu_preset(VFIODevice *vbasedev)
>>  {
>> -    return vbasedev->group->container->space->as != &address_space_memory;
>> +    VFIOAddressSpace *space;
>> +    VFIOContainer *container;
>> +    VFIODevice *tmp_dev;
>> +
>> +    QLIST_FOREACH(space, &vfio_address_spaces, list) {
>> +        QLIST_FOREACH(container, &space->containers, next) {
>> +            tmp_dev = NULL;
>> +            while ((tmp_dev = vfio_container_dev_iter_next(container,
>> +                                                           tmp_dev))) {
>> +                if (vbasedev == tmp_dev) {
>> +                    return space->as != &address_space_memory;
>> +                }
>> +            }
>> +        }
>> +    }
>> +    g_assert_not_reached();
>
>Should the VFIODevice just have a pointer to the VFIOAddressSpace?

After "[PATCH v1 13/22] vfio: Add base container", VFIODevice has pointer
to base container which has pointer further to VFIOAddressSpace.
Is that meet your expectation?

Thanks
Zhenzhong



^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v1 13/22] vfio: Add base container
  2023-09-21  3:35       ` Duan, Zhenzhong
@ 2023-09-21  6:28         ` Eric Auger
  2023-09-21 17:20         ` Eric Auger
  1 sibling, 0 replies; 109+ messages in thread
From: Eric Auger @ 2023-09-21  6:28 UTC (permalink / raw)
  To: Duan, Zhenzhong, Cédric Le Goater, qemu-devel
  Cc: alex.williamson, jgg, nicolinc, Martins, Joao, peterx, jasowang,
	Tian, Kevin, Liu, Yi L, Sun, Yi Y, Peng, Chao P, Yi Sun,
	Daniel Henrique Barboza, David Gibson, Greg Kurz,
	Harsh Prateek Bora, open list:sPAPR (pseries)


Hi,
On 9/21/23 05:35, Duan, Zhenzhong wrote:
> Hi Eric,
>
>> -----Original Message-----
>> From: Eric Auger <eric.auger@redhat.com>
>> Sent: Thursday, September 21, 2023 1:31 AM
>> Subject: Re: [PATCH v1 13/22] vfio: Add base container
>>
>> Hi Zhenzhong,
>>
>> On 9/19/23 19:23, Cédric Le Goater wrote:
>>> On 8/30/23 12:37, Zhenzhong Duan wrote:
>>>> From: Yi Liu <yi.l.liu@intel.com>
>>>>
>>>> Abstract the VFIOContainer to be a base object. It is supposed to be
>>>> embedded by legacy VFIO container and later on, into the new iommufd
>>>> based container.
>>>>
>>>> The base container implements generic code such as code related to
>>>> memory_listener and address space management. The VFIOContainerOps
>>>> implements callbacks that depend on the kernel user space being used.
>>>>
>>>> 'common.c' and vfio device code only manipulates the base container with
>>>> wrapper functions that calls the functions defined in
>>>> VFIOContainerOpsClass.
>>>> Existing 'container.c' code is converted to implement the legacy
>>>> container
>>>> ops functions.
>>>>
>>>> Below is the base container. It's named as VFIOContainer, old
>>>> VFIOContainer
>>>> is replaced with VFIOLegacyContainer.
>>> Usualy, we introduce the new interface solely, port the current models
>>> on top of the new interface, wire the new models in the current
>>> implementation and remove the old implementation. Then, we can start
>>> adding extensions to support other implementations.
>>>
>>> spapr should be taken care of separatly following the principle above.
>>> With my PPC hat, I would not even read such a massive change, too risky
>>> for the subsystem. This path will need (much) further splitting to be
>>> understandable and acceptable.
>> We might split this patch by
>> 1) introducing VFIOLegacyContainer encapsulating the base VFIOContainer,
>> without using the ops in a first place:
>>  common.c would call vfio_container_* with harcoded legacy
>> implementation, ie. retrieving the legacy container with container_of.
>> 2) we would introduce the BE interface without using it.
>> 3) we would use the new BE interface
>>
>> Obviously this needs to be further tried out. If you wish I can try to
>> split it that way ... Please let me know
> Sure, thanks for your help, glad that I can cooperate with you to move
> this series forward.
> I just updated the branch which rebased to newest upstream for you to pick at https://github.com/yiliu1765/qemu/tree/zhenzhong/iommufd_cdev_v1_rebased 

OK thanks. Let me do the exercise.

Eric
>
> Thanks
> Zhenzhong



^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v1 04/22] vfio/common: Introduce vfio_container_add|del_section_window()
  2023-08-30 10:37 ` [PATCH v1 04/22] vfio/common: Introduce vfio_container_add|del_section_window() Zhenzhong Duan
  2023-09-20 11:23   ` Eric Auger
@ 2023-09-21  8:28   ` Cédric Le Goater
  2023-09-21 10:14     ` Duan, Zhenzhong
  1 sibling, 1 reply; 109+ messages in thread
From: Cédric Le Goater @ 2023-09-21  8:28 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, jgg, nicolinc, joao.m.martins, eric.auger,
	peterx, jasowang, kevin.tian, yi.l.liu, yi.y.sun, chao.p.peng

Hello Zhenzhong,

On 8/30/23 12:37, Zhenzhong Duan wrote:
> From: Eric Auger <eric.auger@redhat.com>
> 
> Introduce helper functions that isolate the code used for
> VFIO_SPAPR_TCE_v2_IOMMU. This code reliance is IOMMU backend
> specific whereas the rest of the code in the callers, ie.
> vfio_listener_region_add|del is not.
> 
> Signed-off-by: Eric Auger <eric.auger@redhat.com>
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
>   hw/vfio/common.c | 156 +++++++++++++++++++++++++++--------------------
>   1 file changed, 89 insertions(+), 67 deletions(-)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 9ca695837f..67150e4575 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -796,6 +796,92 @@ static bool vfio_get_section_iova_range(VFIOContainer *container,
>       return true;
>   }
>   
> +static int vfio_container_add_section_window(VFIOContainer *container,
> +                                             MemoryRegionSection *section,
> +                                             Error **errp)
> +{
> +    VFIOHostDMAWindow *hostwin;
> +    hwaddr pgsize = 0;
> +    int ret;
> +
> +    if (container->iommu_type != VFIO_SPAPR_TCE_v2_IOMMU) {
> +        return 0;
> +    }

This test makes me think that we should register a specific backend
for the pseries machines, implementing the add/del_window handler,
since others do not need it. Correct ?

It would avoid this ugly test. Let's keep that in mind when the
backends are introduced.

> +
> +    /* For now intersections are not allowed, we may relax this later */
> +    QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) {
> +        if (ranges_overlap(hostwin->min_iova,
> +                           hostwin->max_iova - hostwin->min_iova + 1,
> +                           section->offset_within_address_space,
> +                           int128_get64(section->size))) {
> +            error_setg(errp,
> +                "region [0x%"PRIx64",0x%"PRIx64"] overlaps with existing"
> +                "host DMA window [0x%"PRIx64",0x%"PRIx64"]",
> +                section->offset_within_address_space,
> +                section->offset_within_address_space +
> +                    int128_get64(section->size) - 1,
> +                hostwin->min_iova, hostwin->max_iova);
> +            return -EINVAL;
> +        }
> +    }
> +
> +    ret = vfio_spapr_create_window(container, section, &pgsize);
> +    if (ret) {
> +        error_setg_errno(errp, -ret, "Failed to create SPAPR window");
> +        return ret;
> +    }
> +
> +    vfio_host_win_add(container, section->offset_within_address_space,
> +                      section->offset_within_address_space +
> +                      int128_get64(section->size) - 1, pgsize);
> +#ifdef CONFIG_KVM

the ifdef test doesn't seem useful because the compiler should compile
out the section below since, in that case, kvm_enabled() is defined as :

   #define kvm_enabled()           (0)

> +    if (kvm_enabled()) {
> +        VFIOGroup *group;
> +        IOMMUMemoryRegion *iommu_mr = IOMMU_MEMORY_REGION(section->mr);
> +        struct kvm_vfio_spapr_tce param;
> +        struct kvm_device_attr attr = {
> +            .group = KVM_DEV_VFIO_GROUP,
> +            .attr = KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE,
> +            .addr = (uint64_t)(unsigned long)&param,
> +        };
> +
> +        if (!memory_region_iommu_get_attr(iommu_mr, IOMMU_ATTR_SPAPR_TCE_FD,
> +                                          &param.tablefd)) {
> +            QLIST_FOREACH(group, &container->group_list, container_next) {
> +                param.groupfd = group->fd;
> +                if (ioctl(vfio_kvm_device_fd, KVM_SET_DEVICE_ATTR, &attr)) {
> +                    error_report("vfio: failed to setup fd %d "
> +                                 "for a group with fd %d: %s",
> +                                 param.tablefd, param.groupfd,
> +                                 strerror(errno));
> +                    return 0;

hmm, the code bails out directly without undoing previous actions. we should
return some error at least.

> +                }
> +                trace_vfio_spapr_group_attach(param.groupfd, param.tablefd);
> +            }
> +        }
> +    }
> +#endif
> +    return 0;
> +}
> +
> +static void vfio_container_del_section_window(VFIOContainer *container,
> +                                              MemoryRegionSection *section)
> +{
> +    if (container->iommu_type != VFIO_SPAPR_TCE_v2_IOMMU) {
> +        return;
> +    }
> +
> +    vfio_spapr_remove_window(container,
> +                             section->offset_within_address_space);
> +    if (vfio_host_win_del(container,
> +                          section->offset_within_address_space,
> +                          section->offset_within_address_space +
> +                          int128_get64(section->size) - 1) < 0) {
> +        hw_error("%s: Cannot delete missing window at %"HWADDR_PRIx,
> +                 __func__, section->offset_within_address_space);
> +    }
> +}
> +
>   static void vfio_listener_region_add(MemoryListener *listener,
>                                        MemoryRegionSection *section)
>   {
> @@ -822,62 +908,8 @@ static void vfio_listener_region_add(MemoryListener *listener,
>           return;
>       }
>   
> -    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
> -        hwaddr pgsize = 0;
> -
> -        /* For now intersections are not allowed, we may relax this later */
> -        QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) {
> -            if (ranges_overlap(hostwin->min_iova,
> -                               hostwin->max_iova - hostwin->min_iova + 1,
> -                               section->offset_within_address_space,
> -                               int128_get64(section->size))) {
> -                error_setg(&err,
> -                    "region [0x%"PRIx64",0x%"PRIx64"] overlaps with existing"
> -                    "host DMA window [0x%"PRIx64",0x%"PRIx64"]",
> -                    section->offset_within_address_space,
> -                    section->offset_within_address_space +
> -                        int128_get64(section->size) - 1,
> -                    hostwin->min_iova, hostwin->max_iova);
> -                goto fail;
> -            }
> -        }
> -
> -        ret = vfio_spapr_create_window(container, section, &pgsize);
> -        if (ret) {
> -            error_setg_errno(&err, -ret, "Failed to create SPAPR window");
> -            goto fail;
> -        }
> -
> -        vfio_host_win_add(container, section->offset_within_address_space,
> -                          section->offset_within_address_space +
> -                          int128_get64(section->size) - 1, pgsize);
> -#ifdef CONFIG_KVM
> -        if (kvm_enabled()) {
> -            VFIOGroup *group;
> -            IOMMUMemoryRegion *iommu_mr = IOMMU_MEMORY_REGION(section->mr);
> -            struct kvm_vfio_spapr_tce param;
> -            struct kvm_device_attr attr = {
> -                .group = KVM_DEV_VFIO_GROUP,
> -                .attr = KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE,
> -                .addr = (uint64_t)(unsigned long)&param,
> -            };
> -
> -            if (!memory_region_iommu_get_attr(iommu_mr, IOMMU_ATTR_SPAPR_TCE_FD,
> -                                              &param.tablefd)) {
> -                QLIST_FOREACH(group, &container->group_list, container_next) {
> -                    param.groupfd = group->fd;
> -                    if (ioctl(vfio_kvm_device_fd, KVM_SET_DEVICE_ATTR, &attr)) {
> -                        error_report("vfio: failed to setup fd %d "
> -                                     "for a group with fd %d: %s",
> -                                     param.tablefd, param.groupfd,
> -                                     strerror(errno));
> -                        return;
> -                    }
> -                    trace_vfio_spapr_group_attach(param.groupfd, param.tablefd);
> -                }
> -            }
> -        }
> -#endif
> +    if (vfio_container_add_section_window(container, section, &err)) {
> +        goto fail;

That's not exactly the same as the return above when the ioctl call
fails. there doesn't seem to be much consequences though. Let's keep
it that way.

>       }
>   
>       hostwin = vfio_find_hostwin(container, iova, end);
> @@ -1094,17 +1126,7 @@ static void vfio_listener_region_del(MemoryListener *listener,
>   
>       memory_region_unref(section->mr);
>   
> -    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
> -        vfio_spapr_remove_window(container,
> -                                 section->offset_within_address_space);
> -        if (vfio_host_win_del(container,
> -                              section->offset_within_address_space,
> -                              section->offset_within_address_space +
> -                              int128_get64(section->size) - 1) < 0) {
> -            hw_error("%s: Cannot delete missing window at %"HWADDR_PRIx,
> -                     __func__, section->offset_within_address_space);
> -        }
> -    }
> +    vfio_container_del_section_window(container, section);
>   }
>   
>   static int vfio_set_dirty_page_tracking(VFIOContainer *container, bool start)

PPC is in the way. May be we could move these two routines in pseries to
help a little. I will look into it.

Thanks,

C.



^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v1 05/22] vfio/common: Extract out vfio_kvm_device_[add/del]_fd
  2023-09-20 11:49   ` Eric Auger
  2023-09-21  2:04     ` Duan, Zhenzhong
@ 2023-09-21  8:42     ` Cédric Le Goater
  2023-09-21 10:22       ` Duan, Zhenzhong
  1 sibling, 1 reply; 109+ messages in thread
From: Cédric Le Goater @ 2023-09-21  8:42 UTC (permalink / raw)
  To: eric.auger, Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, jgg, nicolinc, joao.m.martins, peterx, jasowang,
	kevin.tian, yi.l.liu, yi.y.sun, chao.p.peng

On 9/20/23 13:49, Eric Auger wrote:
> Hi Zhenzhong,
> 
> On 8/30/23 12:37, Zhenzhong Duan wrote:
>> ...which will be used by both legacy and iommufd backend.
> I prefer genuine sentences in the commit msg. Also you explain what you
> do but not why.
> 
> suggestion: Introduce two new helpers, vfio_kvm_device_[add/del]_fd
> which take as input a file descriptor which can be either a group fd or
> a cdev fd. This uses the new KVM_DEV_VFIO_FILE VFIO KVM device group,
> which aliases to the legacy KVM_DEV_VFIO_GROUP.

Ah yes. I didn't understand why the 's/GROUP/FILE/' change in the
VFIO KVM device ioctls. Thanks for clarifying.

What about pre-6.6 kernels without KVM_DEV_VFIO_FILE support ?

C.


> 
> vfio_kvm_device_add/del_group then call those new helpers.
> 
> 
> 
>>
>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> ---
>>   hw/vfio/common.c              | 44 +++++++++++++++++++++++------------
>>   include/hw/vfio/vfio-common.h |  3 +++
>>   2 files changed, 32 insertions(+), 15 deletions(-)
>>
>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>> index 67150e4575..949ad6714a 100644
>> --- a/hw/vfio/common.c
>> +++ b/hw/vfio/common.c
>> @@ -1759,17 +1759,17 @@ void vfio_reset_handler(void *opaque)
>>       }
>>   }
>>   
>> -static void vfio_kvm_device_add_group(VFIOGroup *group)
>> +int vfio_kvm_device_add_fd(int fd)
>>   {
>>   #ifdef CONFIG_KVM
>>       struct kvm_device_attr attr = {
>> -        .group = KVM_DEV_VFIO_GROUP,
>> -        .attr = KVM_DEV_VFIO_GROUP_ADD,
>> -        .addr = (uint64_t)(unsigned long)&group->fd,
>> +        .group = KVM_DEV_VFIO_FILE,
>> +        .attr = KVM_DEV_VFIO_FILE_ADD,
>> +        .addr = (uint64_t)(unsigned long)&fd,
>>       };
>>   
>>       if (!kvm_enabled()) {
>> -        return;
>> +        return 0;
>>       }
>>   
>>       if (vfio_kvm_device_fd < 0) {
>> @@ -1779,37 +1779,51 @@ static void vfio_kvm_device_add_group(VFIOGroup *group)
>>   
>>           if (kvm_vm_ioctl(kvm_state, KVM_CREATE_DEVICE, &cd)) {
>>               error_report("Failed to create KVM VFIO device: %m");
>> -            return;
>> +            return -ENODEV;
> can't you return -errno?
>>           }
>>   
>>           vfio_kvm_device_fd = cd.fd;
>>       }
>>   
>>       if (ioctl(vfio_kvm_device_fd, KVM_SET_DEVICE_ATTR, &attr)) {
>> -        error_report("Failed to add group %d to KVM VFIO device: %m",
>> -                     group->groupid);
>> +        error_report("Failed to add fd %d to KVM VFIO device: %m",
>> +                     fd);
>> +        return -errno;
>>       }
>>   #endif
>> +    return 0;
>>   }
>>   
>> -static void vfio_kvm_device_del_group(VFIOGroup *group)
>> +static void vfio_kvm_device_add_group(VFIOGroup *group)
>> +{
>> +    vfio_kvm_device_add_fd(group->fd);
> Since vfio_kvm_device_add_fd now returns an error value, it's a pity not
> to use it and propagate it. Also you could fill an errp with the error
> msg and use it in vfio_connect_container(). But this is a new error
> handling there.
>> +}
>> +
>> +int vfio_kvm_device_del_fd(int fd)
> not sure we want this to return an error. But if we do, I think it would
> be nicer to propagate the error up.
>>   {
>>   #ifdef CONFIG_KVM
>>       struct kvm_device_attr attr = {
>> -        .group = KVM_DEV_VFIO_GROUP,
>> -        .attr = KVM_DEV_VFIO_GROUP_DEL,
>> -        .addr = (uint64_t)(unsigned long)&group->fd,
>> +        .group = KVM_DEV_VFIO_FILE,
>> +        .attr = KVM_DEV_VFIO_FILE_DEL,
>> +        .addr = (uint64_t)(unsigned long)&fd,
>>       };
>>   
>>       if (vfio_kvm_device_fd < 0) {
>> -        return;
>> +        return -EINVAL;
>>       }
>>   
>>       if (ioctl(vfio_kvm_device_fd, KVM_SET_DEVICE_ATTR, &attr)) {
>> -        error_report("Failed to remove group %d from KVM VFIO device: %m",
>> -                     group->groupid);
>> +        error_report("Failed to remove fd %d from KVM VFIO device: %m",
>> +                     fd);
>> +        return -EBADF;
> -errno?
>>       }
>>   #endif
>> +    return 0;
>> +}
>> +
>> +static void vfio_kvm_device_del_group(VFIOGroup *group)
>> +{
>> +    vfio_kvm_device_del_fd(group->fd);
>>   }
>>   
>>   static VFIOAddressSpace *vfio_get_address_space(AddressSpace *as)
>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>> index 5e376c436e..598c3ce079 100644
>> --- a/include/hw/vfio/vfio-common.h
>> +++ b/include/hw/vfio/vfio-common.h
>> @@ -220,6 +220,9 @@ struct vfio_device_info *vfio_get_device_info(int fd);
>>   int vfio_get_device(VFIOGroup *group, const char *name,
>>                       VFIODevice *vbasedev, Error **errp);
>>   
>> +int vfio_kvm_device_add_fd(int fd);
>> +int vfio_kvm_device_del_fd(int fd);
>> +
>>   extern const MemoryRegionOps vfio_region_ops;
>>   typedef QLIST_HEAD(VFIOGroupList, VFIOGroup) VFIOGroupList;
>>   extern VFIOGroupList vfio_group_list;
> Thanks
> 
> Eric
> 



^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v1 09/22] vfio/container: Introduce vfio_[attach/detach]_device
  2023-08-30 10:37 ` [PATCH v1 09/22] vfio/container: Introduce vfio_[attach/detach]_device Zhenzhong Duan
  2023-09-20 13:33   ` Eric Auger
@ 2023-09-21  9:44   ` Cédric Le Goater
  2023-09-21 10:26     ` Duan, Zhenzhong
  1 sibling, 1 reply; 109+ messages in thread
From: Cédric Le Goater @ 2023-09-21  9:44 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, jgg, nicolinc, joao.m.martins, eric.auger,
	peterx, jasowang, kevin.tian, yi.l.liu, yi.y.sun, chao.p.peng

On 8/30/23 12:37, Zhenzhong Duan wrote:
> From: Eric Auger <eric.auger@redhat.com>
> 
> We want the VFIO devices to be able to use two different
> IOMMU callbacks, the legacy VFIO one and the new iommufd one.
> 
> Introduce vfio_[attach/detach]_device which aim at hiding the
> underlying IOMMU backend (IOCTLs, datatypes, ...).
> 
> Once vfio_attach_device completes, the device is attached
> to a security context and its fd can be used. Conversely
> When vfio_detach_device completes, the device has been
> detached to the security context.
> 
> In this patch, only the vfio-pci device gets converted to use
> the new API. Subsequent patches will handle other devices.
> 
> Signed-off-by: Eric Auger <eric.auger@redhat.com>
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
>   hw/vfio/container.c           | 66 +++++++++++++++++++++++++++++++++++
>   hw/vfio/pci.c                 | 50 ++++----------------------
>   hw/vfio/trace-events          |  2 +-
>   include/hw/vfio/vfio-common.h |  3 ++
>   4 files changed, 76 insertions(+), 45 deletions(-)
> 
> diff --git a/hw/vfio/container.c b/hw/vfio/container.c
> index 175cdbbdff..74556da0c7 100644
> --- a/hw/vfio/container.c
> +++ b/hw/vfio/container.c
> @@ -1083,3 +1083,69 @@ int vfio_eeh_as_op(AddressSpace *as, uint32_t op)
>       }
>       return vfio_eeh_container_op(container, op);
>   }
> +
> +static int vfio_device_groupid(VFIODevice *vbasedev, Error **errp)
> +{
> +    char *tmp, group_path[PATH_MAX], *group_name;
> +    int ret, groupid;
> +    ssize_t len;
> +
> +    tmp = g_strdup_printf("%s/iommu_group", vbasedev->sysfsdev);
> +    len = readlink(tmp, group_path, sizeof(group_path));
> +    g_free(tmp);
> +
> +    if (len <= 0 || len >= sizeof(group_path)) {
> +        ret = len < 0 ? -errno : -ENAMETOOLONG;
> +        error_setg_errno(errp, -ret, "no iommu_group found");
> +        return ret;
> +    }
> +
> +    group_path[len] = 0;
> +
> +    group_name = basename(group_path);
> +    if (sscanf(group_name, "%d", &groupid) != 1) {
> +        error_setg_errno(errp, errno, "failed to read %s", group_path);
> +        return -errno;
> +    }
> +    return groupid;
> +}

VFIO has 4 other  routines reading the iommu_group from sysfs :

   vfio_ccw_get_group()
   vfio_ap_get_group()
   vfio_base_device_init()
   sysfs_find_group_file()

which could use this helper. Thanks for introducing it !



> +
> +int vfio_attach_device(char *name, VFIODevice *vbasedev,
> +                       AddressSpace *as, Error **errp)
> +{
> +    int groupid = vfio_device_groupid(vbasedev, errp);
> +    VFIODevice *vbasedev_iter;
> +    VFIOGroup *group;
> +    int ret;
> +
> +    if (groupid < 0) {
> +        return groupid;
> +    }
> +
> +    group = vfio_get_group(groupid, as, errp);
> +    if (!group) {
> +        return -ENOENT;
> +    }
> +
> +    QLIST_FOREACH(vbasedev_iter, &group->device_list, next) {
> +        if (strcmp(vbasedev_iter->name, vbasedev->name) == 0) {
> +            error_setg(errp, "device is already attached");
> +            vfio_put_group(group);
> +            return -EBUSY;
> +        }
> +    }
> +    ret = vfio_get_device(group, name, vbasedev, errp);
> +    if (ret) {
> +        vfio_put_group(group);
> +    }
> +
> +    return ret;
> +}
> +
> +void vfio_detach_device(VFIODevice *vbasedev)
> +{
> +    VFIOGroup *group = vbasedev->group;
> +
> +    vfio_put_base_device(vbasedev);
> +    vfio_put_group(group);
> +}
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index a205c6b113..34f65ecd17 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -2828,10 +2828,10 @@ static void vfio_populate_device(VFIOPCIDevice *vdev, Error **errp)
>   
>   static void vfio_put_device(VFIOPCIDevice *vdev)
>   {
> +    vfio_detach_device(&vdev->vbasedev);
> +
>       g_free(vdev->vbasedev.name);
>       g_free(vdev->msix);
> -
> -    vfio_put_base_device(&vdev->vbasedev);
>   }
>   
>   static void vfio_err_notifier_handler(void *opaque)
> @@ -2978,13 +2978,9 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
>   {
>       VFIOPCIDevice *vdev = VFIO_PCI(pdev);
>       VFIODevice *vbasedev = &vdev->vbasedev;
> -    VFIODevice *vbasedev_iter;
> -    VFIOGroup *group;
> -    char *tmp, *subsys, group_path[PATH_MAX], *group_name;
> +    char *tmp, *subsys;
>       Error *err = NULL;
> -    ssize_t len;
>       struct stat st;
> -    int groupid;
>       int i, ret;
>       bool is_mdev;
>       char uuid[UUID_FMT_LEN];
> @@ -3015,38 +3011,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
>       vbasedev->type = VFIO_DEVICE_TYPE_PCI;
>       vbasedev->dev = DEVICE(vdev);
>   
> -    tmp = g_strdup_printf("%s/iommu_group", vbasedev->sysfsdev);
> -    len = readlink(tmp, group_path, sizeof(group_path));
> -    g_free(tmp);
> -
> -    if (len <= 0 || len >= sizeof(group_path)) {
> -        error_setg_errno(errp, len < 0 ? errno : ENAMETOOLONG,
> -                         "no iommu_group found");
> -        goto error;
> -    }
> -
> -    group_path[len] = 0;
> -
> -    group_name = basename(group_path);
> -    if (sscanf(group_name, "%d", &groupid) != 1) {
> -        error_setg_errno(errp, errno, "failed to read %s", group_path);
> -        goto error;
> -    }
> -
> -    trace_vfio_realize(vbasedev->name, groupid);
> -
> -    group = vfio_get_group(groupid, pci_device_iommu_address_space(pdev), errp);
> -    if (!group) {
> -        goto error;
> -    }
> -
> -    QLIST_FOREACH(vbasedev_iter, &group->device_list, next) {
> -        if (strcmp(vbasedev_iter->name, vbasedev->name) == 0) {
> -            error_setg(errp, "device is already attached");
> -            vfio_put_group(group);
> -            goto error;
> -        }
> -    }
> +    trace_vfio_realize(vbasedev->name);

I would move the trace event after vfio_attach_device() and print out the group.
Or simply add trace events in vfio_detach/attach_device().

This is a general comment on the VFIO PCI routines which do not use a 'vfio_pci'
prefix and I find it confusing, sometimes. Like this call stack :

   vfio_put_device()
     vfio_detach_device()
       vfio_put_base_device()

I think we should rename vfio_put_device() in vfio_pci_put_device(). This is
not for this series.

Thanks,

C.

>   
>       /*
>        * Mediated devices *might* operate compatibly with discarding of RAM, but
> @@ -3065,7 +3030,6 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
>       if (vbasedev->ram_block_discard_allowed && !is_mdev) {
>           error_setg(errp, "x-balloon-allowed only potentially compatible "
>                      "with mdev devices");
> -        vfio_put_group(group);
>           goto error;
>       }
>   
> @@ -3076,10 +3040,10 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
>           name = g_strdup(vbasedev->name);
>       }
>   
> -    ret = vfio_get_device(group, name, vbasedev, errp);
> +    ret = vfio_attach_device(name, vbasedev,
> +                             pci_device_iommu_address_space(pdev), errp);
>       g_free(name);
>       if (ret) {
> -        vfio_put_group(group);
>           goto error;
>       }
>   
> @@ -3318,7 +3282,6 @@ error:
>   static void vfio_instance_finalize(Object *obj)
>   {
>       VFIOPCIDevice *vdev = VFIO_PCI(obj);
> -    VFIOGroup *group = vdev->vbasedev.group;
>   
>       vfio_display_finalize(vdev);
>       vfio_bars_finalize(vdev);
> @@ -3332,7 +3295,6 @@ static void vfio_instance_finalize(Object *obj)
>        * g_free(vdev->igd_opregion);
>        */
>       vfio_put_device(vdev);
> -    vfio_put_group(group);
>   }
>   
>   static void vfio_exitfn(PCIDevice *pdev)
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index ee7509e68e..8016d9f0d2 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -37,7 +37,7 @@ vfio_pci_hot_reset_dep_devices(int domain, int bus, int slot, int function, int
>   vfio_pci_hot_reset_result(const char *name, const char *result) "%s hot reset: %s"
>   vfio_populate_device_config(const char *name, unsigned long size, unsigned long offset, unsigned long flags) "Device %s config:\n  size: 0x%lx, offset: 0x%lx, flags: 0x%lx"
>   vfio_populate_device_get_irq_info_failure(const char *errstr) "VFIO_DEVICE_GET_IRQ_INFO failure: %s"
> -vfio_realize(const char *name, int group_id) " (%s) group %d"
> +vfio_realize(const char *name) " (%s)"
>   vfio_mdev(const char *name, bool is_mdev) " (%s) is_mdev %d"
>   vfio_add_ext_cap_dropped(const char *name, uint16_t cap, uint16_t offset) "%s 0x%x@0x%x"
>   vfio_pci_reset(const char *name) " (%s)"
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index bb7f9fe9c4..a29dfe7723 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -253,6 +253,9 @@ void vfio_put_group(VFIOGroup *group);
>   struct vfio_device_info *vfio_get_device_info(int fd);
>   int vfio_get_device(VFIOGroup *group, const char *name,
>                       VFIODevice *vbasedev, Error **errp);
> +int vfio_attach_device(char *name, VFIODevice *vbasedev,
> +                       AddressSpace *as, Error **errp);
> +void vfio_detach_device(VFIODevice *vbasedev);
>   
>   extern int vfio_kvm_device_fd;
>   



^ permalink raw reply	[flat|nested] 109+ messages in thread

* RE: [PATCH v1 04/22] vfio/common: Introduce vfio_container_add|del_section_window()
  2023-09-21  8:28   ` Cédric Le Goater
@ 2023-09-21 10:14     ` Duan, Zhenzhong
  2023-09-21 10:55       ` Cédric Le Goater
  2023-09-27  2:08       ` Duan, Zhenzhong
  0 siblings, 2 replies; 109+ messages in thread
From: Duan, Zhenzhong @ 2023-09-21 10:14 UTC (permalink / raw)
  To: Cédric Le Goater, qemu-devel
  Cc: alex.williamson, jgg, nicolinc, Martins, Joao, eric.auger,
	peterx, jasowang, Tian, Kevin, Liu, Yi L, Sun, Yi Y, Peng,
	Chao P

Hi Cédric,

>-----Original Message-----
>From: Cédric Le Goater <clg@redhat.com>
>Sent: Thursday, September 21, 2023 4:29 PM
>Subject: Re: [PATCH v1 04/22] vfio/common: Introduce
>vfio_container_add|del_section_window()
>
>Hello Zhenzhong,
>
>On 8/30/23 12:37, Zhenzhong Duan wrote:
>> From: Eric Auger <eric.auger@redhat.com>
>>
>> Introduce helper functions that isolate the code used for
>> VFIO_SPAPR_TCE_v2_IOMMU. This code reliance is IOMMU backend
>> specific whereas the rest of the code in the callers, ie.
>> vfio_listener_region_add|del is not.
>>
>> Signed-off-by: Eric Auger <eric.auger@redhat.com>
>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> ---
>>   hw/vfio/common.c | 156 +++++++++++++++++++++++++++--------------------
>>   1 file changed, 89 insertions(+), 67 deletions(-)
>>
>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>> index 9ca695837f..67150e4575 100644
>> --- a/hw/vfio/common.c
>> +++ b/hw/vfio/common.c
>> @@ -796,6 +796,92 @@ static bool
>vfio_get_section_iova_range(VFIOContainer *container,
>>       return true;
>>   }
>>
>> +static int vfio_container_add_section_window(VFIOContainer *container,
>> +                                             MemoryRegionSection *section,
>> +                                             Error **errp)
>> +{
>> +    VFIOHostDMAWindow *hostwin;
>> +    hwaddr pgsize = 0;
>> +    int ret;
>> +
>> +    if (container->iommu_type != VFIO_SPAPR_TCE_v2_IOMMU) {
>> +        return 0;
>> +    }
>
>This test makes me think that we should register a specific backend
>for the pseries machines, implementing the add/del_window handler,
>since others do not need it. Correct ?

Yes, introducing a specific backend could help removing above check.
But each backend has a VFIOIOMMUBackendOps, we need same check
as above to select Ops.

>
>It would avoid this ugly test. Let's keep that in mind when the
>backends are introduced.
>
>> +
>> +    /* For now intersections are not allowed, we may relax this later */
>> +    QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) {
>> +        if (ranges_overlap(hostwin->min_iova,
>> +                           hostwin->max_iova - hostwin->min_iova + 1,
>> +                           section->offset_within_address_space,
>> +                           int128_get64(section->size))) {
>> +            error_setg(errp,
>> +                "region [0x%"PRIx64",0x%"PRIx64"] overlaps with existing"
>> +                "host DMA window [0x%"PRIx64",0x%"PRIx64"]",
>> +                section->offset_within_address_space,
>> +                section->offset_within_address_space +
>> +                    int128_get64(section->size) - 1,
>> +                hostwin->min_iova, hostwin->max_iova);
>> +            return -EINVAL;
>> +        }
>> +    }
>> +
>> +    ret = vfio_spapr_create_window(container, section, &pgsize);
>> +    if (ret) {
>> +        error_setg_errno(errp, -ret, "Failed to create SPAPR window");
>> +        return ret;
>> +    }
>> +
>> +    vfio_host_win_add(container, section->offset_within_address_space,
>> +                      section->offset_within_address_space +
>> +                      int128_get64(section->size) - 1, pgsize);
>> +#ifdef CONFIG_KVM
>
>the ifdef test doesn't seem useful because the compiler should compile
>out the section below since, in that case, kvm_enabled() is defined as :
>
>   #define kvm_enabled()           (0)

Looks so, I'll remove it in v2.

>
>> +    if (kvm_enabled()) {
>> +        VFIOGroup *group;
>> +        IOMMUMemoryRegion *iommu_mr =
>IOMMU_MEMORY_REGION(section->mr);
>> +        struct kvm_vfio_spapr_tce param;
>> +        struct kvm_device_attr attr = {
>> +            .group = KVM_DEV_VFIO_GROUP,
>> +            .attr = KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE,
>> +            .addr = (uint64_t)(unsigned long)&param,
>> +        };
>> +
>> +        if (!memory_region_iommu_get_attr(iommu_mr,
>IOMMU_ATTR_SPAPR_TCE_FD,
>> +                                          &param.tablefd)) {
>> +            QLIST_FOREACH(group, &container->group_list, container_next) {
>> +                param.groupfd = group->fd;
>> +                if (ioctl(vfio_kvm_device_fd, KVM_SET_DEVICE_ATTR, &attr)) {
>> +                    error_report("vfio: failed to setup fd %d "
>> +                                 "for a group with fd %d: %s",
>> +                                 param.tablefd, param.groupfd,
>> +                                 strerror(errno));
>> +                    return 0;
>
>hmm, the code bails out directly without undoing previous actions. we should
>return some error at least.

I think Eric doesn't intend any functional change in this patch, just refactor these
code into two wrapper functions. In fact the original code just return void,
if ioctl() fails. Not clear if that's intentional or a bug.

>
>> +                }
>> +                trace_vfio_spapr_group_attach(param.groupfd, param.tablefd);
>> +            }
>> +        }
>> +    }
>> +#endif
>> +    return 0;
>> +}
>> +
>> +static void vfio_container_del_section_window(VFIOContainer *container,
>> +                                              MemoryRegionSection *section)
>> +{
>> +    if (container->iommu_type != VFIO_SPAPR_TCE_v2_IOMMU) {
>> +        return;
>> +    }
>> +
>> +    vfio_spapr_remove_window(container,
>> +                             section->offset_within_address_space);
>> +    if (vfio_host_win_del(container,
>> +                          section->offset_within_address_space,
>> +                          section->offset_within_address_space +
>> +                          int128_get64(section->size) - 1) < 0) {
>> +        hw_error("%s: Cannot delete missing window at %"HWADDR_PRIx,
>> +                 __func__, section->offset_within_address_space);
>> +    }
>> +}
>> +
>>   static void vfio_listener_region_add(MemoryListener *listener,
>>                                        MemoryRegionSection *section)
>>   {
>> @@ -822,62 +908,8 @@ static void vfio_listener_region_add(MemoryListener
>*listener,
>>           return;
>>       }
>>
>> -    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
>> -        hwaddr pgsize = 0;
>> -
>> -        /* For now intersections are not allowed, we may relax this later */
>> -        QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) {
>> -            if (ranges_overlap(hostwin->min_iova,
>> -                               hostwin->max_iova - hostwin->min_iova + 1,
>> -                               section->offset_within_address_space,
>> -                               int128_get64(section->size))) {
>> -                error_setg(&err,
>> -                    "region [0x%"PRIx64",0x%"PRIx64"] overlaps with existing"
>> -                    "host DMA window [0x%"PRIx64",0x%"PRIx64"]",
>> -                    section->offset_within_address_space,
>> -                    section->offset_within_address_space +
>> -                        int128_get64(section->size) - 1,
>> -                    hostwin->min_iova, hostwin->max_iova);
>> -                goto fail;
>> -            }
>> -        }
>> -
>> -        ret = vfio_spapr_create_window(container, section, &pgsize);
>> -        if (ret) {
>> -            error_setg_errno(&err, -ret, "Failed to create SPAPR window");
>> -            goto fail;
>> -        }
>> -
>> -        vfio_host_win_add(container, section->offset_within_address_space,
>> -                          section->offset_within_address_space +
>> -                          int128_get64(section->size) - 1, pgsize);
>> -#ifdef CONFIG_KVM
>> -        if (kvm_enabled()) {
>> -            VFIOGroup *group;
>> -            IOMMUMemoryRegion *iommu_mr =
>IOMMU_MEMORY_REGION(section->mr);
>> -            struct kvm_vfio_spapr_tce param;
>> -            struct kvm_device_attr attr = {
>> -                .group = KVM_DEV_VFIO_GROUP,
>> -                .attr = KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE,
>> -                .addr = (uint64_t)(unsigned long)&param,
>> -            };
>> -
>> -            if (!memory_region_iommu_get_attr(iommu_mr,
>IOMMU_ATTR_SPAPR_TCE_FD,
>> -                                              &param.tablefd)) {
>> -                QLIST_FOREACH(group, &container->group_list, container_next) {
>> -                    param.groupfd = group->fd;
>> -                    if (ioctl(vfio_kvm_device_fd, KVM_SET_DEVICE_ATTR, &attr)) {
>> -                        error_report("vfio: failed to setup fd %d "
>> -                                     "for a group with fd %d: %s",
>> -                                     param.tablefd, param.groupfd,
>> -                                     strerror(errno));
>> -                        return;
>> -                    }
>> -                    trace_vfio_spapr_group_attach(param.groupfd, param.tablefd);
>> -                }
>> -            }
>> -        }
>> -#endif
>> +    if (vfio_container_add_section_window(container, section, &err)) {
>> +        goto fail;
>
>That's not exactly the same as the return above when the ioctl call
>fails. there doesn't seem to be much consequences though. Let's keep
>it that way.
OK.

>
>>       }
>>
>>       hostwin = vfio_find_hostwin(container, iova, end);
>> @@ -1094,17 +1126,7 @@ static void
>vfio_listener_region_del(MemoryListener *listener,
>>
>>       memory_region_unref(section->mr);
>>
>> -    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
>> -        vfio_spapr_remove_window(container,
>> -                                 section->offset_within_address_space);
>> -        if (vfio_host_win_del(container,
>> -                              section->offset_within_address_space,
>> -                              section->offset_within_address_space +
>> -                              int128_get64(section->size) - 1) < 0) {
>> -            hw_error("%s: Cannot delete missing window at %"HWADDR_PRIx,
>> -                     __func__, section->offset_within_address_space);
>> -        }
>> -    }
>> +    vfio_container_del_section_window(container, section);
>>   }
>>
>>   static int vfio_set_dirty_page_tracking(VFIOContainer *container, bool start)
>
>PPC is in the way. May be we could move these two routines in pseries to
>help a little. I will look into it.
Do you mean PPC cleanup?

Thanks
Zhenzhong


^ permalink raw reply	[flat|nested] 109+ messages in thread

* RE: [PATCH v1 05/22] vfio/common: Extract out vfio_kvm_device_[add/del]_fd
  2023-09-21  8:42     ` Cédric Le Goater
@ 2023-09-21 10:22       ` Duan, Zhenzhong
  2023-09-21 10:53         ` Cédric Le Goater
  0 siblings, 1 reply; 109+ messages in thread
From: Duan, Zhenzhong @ 2023-09-21 10:22 UTC (permalink / raw)
  To: Cédric Le Goater, eric.auger, qemu-devel
  Cc: alex.williamson, jgg, nicolinc, Martins, Joao, peterx, jasowang,
	Tian, Kevin, Liu, Yi L, Sun, Yi Y, Peng, Chao P



>-----Original Message-----
>From: Cédric Le Goater <clg@redhat.com>
>Sent: Thursday, September 21, 2023 4:42 PM
>Subject: Re: [PATCH v1 05/22] vfio/common: Extract out
>vfio_kvm_device_[add/del]_fd
>
>On 9/20/23 13:49, Eric Auger wrote:
>> Hi Zhenzhong,
>>
>> On 8/30/23 12:37, Zhenzhong Duan wrote:
>>> ...which will be used by both legacy and iommufd backend.
>> I prefer genuine sentences in the commit msg. Also you explain what you
>> do but not why.
>>
>> suggestion: Introduce two new helpers, vfio_kvm_device_[add/del]_fd
>> which take as input a file descriptor which can be either a group fd or
>> a cdev fd. This uses the new KVM_DEV_VFIO_FILE VFIO KVM device group,
>> which aliases to the legacy KVM_DEV_VFIO_GROUP.
>
>Ah yes. I didn't understand why the 's/GROUP/FILE/' change in the
>VFIO KVM device ioctls. Thanks for clarifying.
>
>What about pre-6.6 kernels without KVM_DEV_VFIO_FILE support ?
They are purely alias. See below commit:

commit da3c22c74a3c6cbd26df40b2f6798a2d41be80ac
Author: Thomas Huth <thuth@redhat.com>
Date:   Tue Sep 12 11:24:40 2023 +0200

    linux-headers: Update to Linux v6.6-rc1

    This update contains the required header changes for the
    "target/s390x: AP-passthrough for PV guests" patch from
    Steffen Eiden.

    Message-ID: <20230912093432.180041-1-thuth@redhat.com>
    Signed-off-by: Thomas Huth <thuth@redhat.com>

diff --git a/linux-headers/linux/kvm.h b/linux-headers/linux/kvm.h
index 1f3f3333a4..0d74ee999a 100644
--- a/linux-headers/linux/kvm.h
+++ b/linux-headers/linux/kvm.h
@@ -1414,9 +1414,16 @@ struct kvm_device_attr {
        __u64   addr;           /* userspace address of attr data */
 };

-#define  KVM_DEV_VFIO_GROUP                    1
-#define   KVM_DEV_VFIO_GROUP_ADD                       1
-#define   KVM_DEV_VFIO_GROUP_DEL                       2
+#define  KVM_DEV_VFIO_FILE                     1
+
+#define   KVM_DEV_VFIO_FILE_ADD                        1
+#define   KVM_DEV_VFIO_FILE_DEL                        2
+
+/* KVM_DEV_VFIO_GROUP aliases are for compile time uapi compatibility */
+#define  KVM_DEV_VFIO_GROUP    KVM_DEV_VFIO_FILE
+
+#define   KVM_DEV_VFIO_GROUP_ADD       KVM_DEV_VFIO_FILE_ADD
+#define   KVM_DEV_VFIO_GROUP_DEL       KVM_DEV_VFIO_FILE_DEL
 #define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE             3

Pre-6.6 kernel not supporting KVM_DEV_VFIO_FILE also not support IOMMUFD.
So I think that's fine.

Thanks
Zhenzhong

^ permalink raw reply related	[flat|nested] 109+ messages in thread

* RE: [PATCH v1 09/22] vfio/container: Introduce vfio_[attach/detach]_device
  2023-09-21  9:44   ` Cédric Le Goater
@ 2023-09-21 10:26     ` Duan, Zhenzhong
  0 siblings, 0 replies; 109+ messages in thread
From: Duan, Zhenzhong @ 2023-09-21 10:26 UTC (permalink / raw)
  To: Cédric Le Goater, qemu-devel
  Cc: alex.williamson, jgg, nicolinc, Martins, Joao, eric.auger,
	peterx, jasowang, Tian, Kevin, Liu, Yi L, Sun, Yi Y, Peng,
	Chao P



>-----Original Message-----
>From: Cédric Le Goater <clg@redhat.com>
>Sent: Thursday, September 21, 2023 5:45 PM
>Subject: Re: [PATCH v1 09/22] vfio/container: Introduce
>vfio_[attach/detach]_device
>
>On 8/30/23 12:37, Zhenzhong Duan wrote:
>> From: Eric Auger <eric.auger@redhat.com>
>>
>> We want the VFIO devices to be able to use two different
>> IOMMU callbacks, the legacy VFIO one and the new iommufd one.
>>
>> Introduce vfio_[attach/detach]_device which aim at hiding the
>> underlying IOMMU backend (IOCTLs, datatypes, ...).
>>
>> Once vfio_attach_device completes, the device is attached
>> to a security context and its fd can be used. Conversely
>> When vfio_detach_device completes, the device has been
>> detached to the security context.
>>
>> In this patch, only the vfio-pci device gets converted to use
>> the new API. Subsequent patches will handle other devices.
>>
>> Signed-off-by: Eric Auger <eric.auger@redhat.com>
>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> ---
>>   hw/vfio/container.c           | 66 +++++++++++++++++++++++++++++++++++
>>   hw/vfio/pci.c                 | 50 ++++----------------------
>>   hw/vfio/trace-events          |  2 +-
>>   include/hw/vfio/vfio-common.h |  3 ++
>>   4 files changed, 76 insertions(+), 45 deletions(-)
>>
>> diff --git a/hw/vfio/container.c b/hw/vfio/container.c
>> index 175cdbbdff..74556da0c7 100644
>> --- a/hw/vfio/container.c
>> +++ b/hw/vfio/container.c
>> @@ -1083,3 +1083,69 @@ int vfio_eeh_as_op(AddressSpace *as, uint32_t op)
>>       }
>>       return vfio_eeh_container_op(container, op);
>>   }
>> +
>> +static int vfio_device_groupid(VFIODevice *vbasedev, Error **errp)
>> +{
>> +    char *tmp, group_path[PATH_MAX], *group_name;
>> +    int ret, groupid;
>> +    ssize_t len;
>> +
>> +    tmp = g_strdup_printf("%s/iommu_group", vbasedev->sysfsdev);
>> +    len = readlink(tmp, group_path, sizeof(group_path));
>> +    g_free(tmp);
>> +
>> +    if (len <= 0 || len >= sizeof(group_path)) {
>> +        ret = len < 0 ? -errno : -ENAMETOOLONG;
>> +        error_setg_errno(errp, -ret, "no iommu_group found");
>> +        return ret;
>> +    }
>> +
>> +    group_path[len] = 0;
>> +
>> +    group_name = basename(group_path);
>> +    if (sscanf(group_name, "%d", &groupid) != 1) {
>> +        error_setg_errno(errp, errno, "failed to read %s", group_path);
>> +        return -errno;
>> +    }
>> +    return groupid;
>> +}
>
>VFIO has 4 other  routines reading the iommu_group from sysfs :
>
>   vfio_ccw_get_group()
>   vfio_ap_get_group()
>   vfio_base_device_init()
>   sysfs_find_group_file()
>
>which could use this helper. Thanks for introducing it !
>
>
>
>> +
>> +int vfio_attach_device(char *name, VFIODevice *vbasedev,
>> +                       AddressSpace *as, Error **errp)
>> +{
>> +    int groupid = vfio_device_groupid(vbasedev, errp);
>> +    VFIODevice *vbasedev_iter;
>> +    VFIOGroup *group;
>> +    int ret;
>> +
>> +    if (groupid < 0) {
>> +        return groupid;
>> +    }
>> +
>> +    group = vfio_get_group(groupid, as, errp);
>> +    if (!group) {
>> +        return -ENOENT;
>> +    }
>> +
>> +    QLIST_FOREACH(vbasedev_iter, &group->device_list, next) {
>> +        if (strcmp(vbasedev_iter->name, vbasedev->name) == 0) {
>> +            error_setg(errp, "device is already attached");
>> +            vfio_put_group(group);
>> +            return -EBUSY;
>> +        }
>> +    }
>> +    ret = vfio_get_device(group, name, vbasedev, errp);
>> +    if (ret) {
>> +        vfio_put_group(group);
>> +    }
>> +
>> +    return ret;
>> +}
>> +
>> +void vfio_detach_device(VFIODevice *vbasedev)
>> +{
>> +    VFIOGroup *group = vbasedev->group;
>> +
>> +    vfio_put_base_device(vbasedev);
>> +    vfio_put_group(group);
>> +}
>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>> index a205c6b113..34f65ecd17 100644
>> --- a/hw/vfio/pci.c
>> +++ b/hw/vfio/pci.c
>> @@ -2828,10 +2828,10 @@ static void vfio_populate_device(VFIOPCIDevice
>*vdev, Error **errp)
>>
>>   static void vfio_put_device(VFIOPCIDevice *vdev)
>>   {
>> +    vfio_detach_device(&vdev->vbasedev);
>> +
>>       g_free(vdev->vbasedev.name);
>>       g_free(vdev->msix);
>> -
>> -    vfio_put_base_device(&vdev->vbasedev);
>>   }
>>
>>   static void vfio_err_notifier_handler(void *opaque)
>> @@ -2978,13 +2978,9 @@ static void vfio_realize(PCIDevice *pdev, Error
>**errp)
>>   {
>>       VFIOPCIDevice *vdev = VFIO_PCI(pdev);
>>       VFIODevice *vbasedev = &vdev->vbasedev;
>> -    VFIODevice *vbasedev_iter;
>> -    VFIOGroup *group;
>> -    char *tmp, *subsys, group_path[PATH_MAX], *group_name;
>> +    char *tmp, *subsys;
>>       Error *err = NULL;
>> -    ssize_t len;
>>       struct stat st;
>> -    int groupid;
>>       int i, ret;
>>       bool is_mdev;
>>       char uuid[UUID_FMT_LEN];
>> @@ -3015,38 +3011,7 @@ static void vfio_realize(PCIDevice *pdev, Error
>**errp)
>>       vbasedev->type = VFIO_DEVICE_TYPE_PCI;
>>       vbasedev->dev = DEVICE(vdev);
>>
>> -    tmp = g_strdup_printf("%s/iommu_group", vbasedev->sysfsdev);
>> -    len = readlink(tmp, group_path, sizeof(group_path));
>> -    g_free(tmp);
>> -
>> -    if (len <= 0 || len >= sizeof(group_path)) {
>> -        error_setg_errno(errp, len < 0 ? errno : ENAMETOOLONG,
>> -                         "no iommu_group found");
>> -        goto error;
>> -    }
>> -
>> -    group_path[len] = 0;
>> -
>> -    group_name = basename(group_path);
>> -    if (sscanf(group_name, "%d", &groupid) != 1) {
>> -        error_setg_errno(errp, errno, "failed to read %s", group_path);
>> -        goto error;
>> -    }
>> -
>> -    trace_vfio_realize(vbasedev->name, groupid);
>> -
>> -    group = vfio_get_group(groupid, pci_device_iommu_address_space(pdev),
>errp);
>> -    if (!group) {
>> -        goto error;
>> -    }
>> -
>> -    QLIST_FOREACH(vbasedev_iter, &group->device_list, next) {
>> -        if (strcmp(vbasedev_iter->name, vbasedev->name) == 0) {
>> -            error_setg(errp, "device is already attached");
>> -            vfio_put_group(group);
>> -            goto error;
>> -        }
>> -    }
>> +    trace_vfio_realize(vbasedev->name);
>
>I would move the trace event after vfio_attach_device() and print out the group.
>Or simply add trace events in vfio_detach/attach_device().
>
>This is a general comment on the VFIO PCI routines which do not use a 'vfio_pci'
>prefix and I find it confusing, sometimes. Like this call stack :
>
>   vfio_put_device()
>     vfio_detach_device()
>       vfio_put_base_device()
>
>I think we should rename vfio_put_device() in vfio_pci_put_device(). This is
>not for this series.
Good suggestion! I had ever been confused by this function too.
I can help if you have not done that yet.

Thanks
Zhenzhong

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v1 05/22] vfio/common: Extract out vfio_kvm_device_[add/del]_fd
  2023-09-21 10:22       ` Duan, Zhenzhong
@ 2023-09-21 10:53         ` Cédric Le Goater
  0 siblings, 0 replies; 109+ messages in thread
From: Cédric Le Goater @ 2023-09-21 10:53 UTC (permalink / raw)
  To: Duan, Zhenzhong, eric.auger, qemu-devel
  Cc: alex.williamson, jgg, nicolinc, Martins, Joao, peterx, jasowang,
	Tian, Kevin, Liu, Yi L, Sun, Yi Y, Peng, Chao P

On 9/21/23 12:22, Duan, Zhenzhong wrote:
> 
> 
>> -----Original Message-----
>> From: Cédric Le Goater <clg@redhat.com>
>> Sent: Thursday, September 21, 2023 4:42 PM
>> Subject: Re: [PATCH v1 05/22] vfio/common: Extract out
>> vfio_kvm_device_[add/del]_fd
>>
>> On 9/20/23 13:49, Eric Auger wrote:
>>> Hi Zhenzhong,
>>>
>>> On 8/30/23 12:37, Zhenzhong Duan wrote:
>>>> ...which will be used by both legacy and iommufd backend.
>>> I prefer genuine sentences in the commit msg. Also you explain what you
>>> do but not why.
>>>
>>> suggestion: Introduce two new helpers, vfio_kvm_device_[add/del]_fd
>>> which take as input a file descriptor which can be either a group fd or
>>> a cdev fd. This uses the new KVM_DEV_VFIO_FILE VFIO KVM device group,
>>> which aliases to the legacy KVM_DEV_VFIO_GROUP.
>>
>> Ah yes. I didn't understand why the 's/GROUP/FILE/' change in the
>> VFIO KVM device ioctls. Thanks for clarifying.
>>
>> What about pre-6.6 kernels without KVM_DEV_VFIO_FILE support ?
> They are purely alias. See below commit:

Ah. I missed that. thanks again.

C.



^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v1 04/22] vfio/common: Introduce vfio_container_add|del_section_window()
  2023-09-21 10:14     ` Duan, Zhenzhong
@ 2023-09-21 10:55       ` Cédric Le Goater
  2023-09-27  2:08       ` Duan, Zhenzhong
  1 sibling, 0 replies; 109+ messages in thread
From: Cédric Le Goater @ 2023-09-21 10:55 UTC (permalink / raw)
  To: Duan, Zhenzhong, qemu-devel
  Cc: alex.williamson, jgg, nicolinc, Martins, Joao, eric.auger,
	peterx, jasowang, Tian, Kevin, Liu, Yi L, Sun, Yi Y, Peng,
	Chao P

On 9/21/23 12:14, Duan, Zhenzhong wrote:
> Hi Cédric,
> 
>> -----Original Message-----
>> From: Cédric Le Goater <clg@redhat.com>
>> Sent: Thursday, September 21, 2023 4:29 PM
>> Subject: Re: [PATCH v1 04/22] vfio/common: Introduce
>> vfio_container_add|del_section_window()
>>
>> Hello Zhenzhong,
>>
>> On 8/30/23 12:37, Zhenzhong Duan wrote:
>>> From: Eric Auger <eric.auger@redhat.com>
>>>
>>> Introduce helper functions that isolate the code used for
>>> VFIO_SPAPR_TCE_v2_IOMMU. This code reliance is IOMMU backend
>>> specific whereas the rest of the code in the callers, ie.
>>> vfio_listener_region_add|del is not.
>>>
>>> Signed-off-by: Eric Auger <eric.auger@redhat.com>
>>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>>> ---
>>>    hw/vfio/common.c | 156 +++++++++++++++++++++++++++--------------------
>>>    1 file changed, 89 insertions(+), 67 deletions(-)
>>>
>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>>> index 9ca695837f..67150e4575 100644
>>> --- a/hw/vfio/common.c
>>> +++ b/hw/vfio/common.c
>>> @@ -796,6 +796,92 @@ static bool
>> vfio_get_section_iova_range(VFIOContainer *container,
>>>        return true;
>>>    }
>>>
>>> +static int vfio_container_add_section_window(VFIOContainer *container,
>>> +                                             MemoryRegionSection *section,
>>> +                                             Error **errp)
>>> +{
>>> +    VFIOHostDMAWindow *hostwin;
>>> +    hwaddr pgsize = 0;
>>> +    int ret;
>>> +
>>> +    if (container->iommu_type != VFIO_SPAPR_TCE_v2_IOMMU) {
>>> +        return 0;
>>> +    }
>>
>> This test makes me think that we should register a specific backend
>> for the pseries machines, implementing the add/del_window handler,
>> since others do not need it. Correct ?
> 
> Yes, introducing a specific backend could help removing above check.
> But each backend has a VFIOIOMMUBackendOps, we need same check
> as above to select Ops.
> 
>>
>> It would avoid this ugly test. Let's keep that in mind when the
>> backends are introduced.
>>
>>> +
>>> +    /* For now intersections are not allowed, we may relax this later */
>>> +    QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) {
>>> +        if (ranges_overlap(hostwin->min_iova,
>>> +                           hostwin->max_iova - hostwin->min_iova + 1,
>>> +                           section->offset_within_address_space,
>>> +                           int128_get64(section->size))) {
>>> +            error_setg(errp,
>>> +                "region [0x%"PRIx64",0x%"PRIx64"] overlaps with existing"
>>> +                "host DMA window [0x%"PRIx64",0x%"PRIx64"]",
>>> +                section->offset_within_address_space,
>>> +                section->offset_within_address_space +
>>> +                    int128_get64(section->size) - 1,
>>> +                hostwin->min_iova, hostwin->max_iova);
>>> +            return -EINVAL;
>>> +        }
>>> +    }
>>> +
>>> +    ret = vfio_spapr_create_window(container, section, &pgsize);
>>> +    if (ret) {
>>> +        error_setg_errno(errp, -ret, "Failed to create SPAPR window");
>>> +        return ret;
>>> +    }
>>> +
>>> +    vfio_host_win_add(container, section->offset_within_address_space,
>>> +                      section->offset_within_address_space +
>>> +                      int128_get64(section->size) - 1, pgsize);
>>> +#ifdef CONFIG_KVM
>>
>> the ifdef test doesn't seem useful because the compiler should compile
>> out the section below since, in that case, kvm_enabled() is defined as :
>>
>>    #define kvm_enabled()           (0)
> 
> Looks so, I'll remove it in v2.
> 
>>
>>> +    if (kvm_enabled()) {
>>> +        VFIOGroup *group;
>>> +        IOMMUMemoryRegion *iommu_mr =
>> IOMMU_MEMORY_REGION(section->mr);
>>> +        struct kvm_vfio_spapr_tce param;
>>> +        struct kvm_device_attr attr = {
>>> +            .group = KVM_DEV_VFIO_GROUP,
>>> +            .attr = KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE,
>>> +            .addr = (uint64_t)(unsigned long)&param,
>>> +        };
>>> +
>>> +        if (!memory_region_iommu_get_attr(iommu_mr,
>> IOMMU_ATTR_SPAPR_TCE_FD,
>>> +                                          &param.tablefd)) {
>>> +            QLIST_FOREACH(group, &container->group_list, container_next) {
>>> +                param.groupfd = group->fd;
>>> +                if (ioctl(vfio_kvm_device_fd, KVM_SET_DEVICE_ATTR, &attr)) {
>>> +                    error_report("vfio: failed to setup fd %d "
>>> +                                 "for a group with fd %d: %s",
>>> +                                 param.tablefd, param.groupfd,
>>> +                                 strerror(errno));
>>> +                    return 0;
>>
>> hmm, the code bails out directly without undoing previous actions. we should
>> return some error at least.
> 
> I think Eric doesn't intend any functional change in this patch, just refactor these
> code into two wrapper functions. In fact the original code just return void,
> if ioctl() fails. Not clear if that's intentional or a bug.
> 
>>
>>> +                }
>>> +                trace_vfio_spapr_group_attach(param.groupfd, param.tablefd);
>>> +            }
>>> +        }
>>> +    }
>>> +#endif
>>> +    return 0;
>>> +}
>>> +
>>> +static void vfio_container_del_section_window(VFIOContainer *container,
>>> +                                              MemoryRegionSection *section)
>>> +{
>>> +    if (container->iommu_type != VFIO_SPAPR_TCE_v2_IOMMU) {
>>> +        return;
>>> +    }
>>> +
>>> +    vfio_spapr_remove_window(container,
>>> +                             section->offset_within_address_space);
>>> +    if (vfio_host_win_del(container,
>>> +                          section->offset_within_address_space,
>>> +                          section->offset_within_address_space +
>>> +                          int128_get64(section->size) - 1) < 0) {
>>> +        hw_error("%s: Cannot delete missing window at %"HWADDR_PRIx,
>>> +                 __func__, section->offset_within_address_space);
>>> +    }
>>> +}
>>> +
>>>    static void vfio_listener_region_add(MemoryListener *listener,
>>>                                         MemoryRegionSection *section)
>>>    {
>>> @@ -822,62 +908,8 @@ static void vfio_listener_region_add(MemoryListener
>> *listener,
>>>            return;
>>>        }
>>>
>>> -    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
>>> -        hwaddr pgsize = 0;
>>> -
>>> -        /* For now intersections are not allowed, we may relax this later */
>>> -        QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) {
>>> -            if (ranges_overlap(hostwin->min_iova,
>>> -                               hostwin->max_iova - hostwin->min_iova + 1,
>>> -                               section->offset_within_address_space,
>>> -                               int128_get64(section->size))) {
>>> -                error_setg(&err,
>>> -                    "region [0x%"PRIx64",0x%"PRIx64"] overlaps with existing"
>>> -                    "host DMA window [0x%"PRIx64",0x%"PRIx64"]",
>>> -                    section->offset_within_address_space,
>>> -                    section->offset_within_address_space +
>>> -                        int128_get64(section->size) - 1,
>>> -                    hostwin->min_iova, hostwin->max_iova);
>>> -                goto fail;
>>> -            }
>>> -        }
>>> -
>>> -        ret = vfio_spapr_create_window(container, section, &pgsize);
>>> -        if (ret) {
>>> -            error_setg_errno(&err, -ret, "Failed to create SPAPR window");
>>> -            goto fail;
>>> -        }
>>> -
>>> -        vfio_host_win_add(container, section->offset_within_address_space,
>>> -                          section->offset_within_address_space +
>>> -                          int128_get64(section->size) - 1, pgsize);
>>> -#ifdef CONFIG_KVM
>>> -        if (kvm_enabled()) {
>>> -            VFIOGroup *group;
>>> -            IOMMUMemoryRegion *iommu_mr =
>> IOMMU_MEMORY_REGION(section->mr);
>>> -            struct kvm_vfio_spapr_tce param;
>>> -            struct kvm_device_attr attr = {
>>> -                .group = KVM_DEV_VFIO_GROUP,
>>> -                .attr = KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE,
>>> -                .addr = (uint64_t)(unsigned long)&param,
>>> -            };
>>> -
>>> -            if (!memory_region_iommu_get_attr(iommu_mr,
>> IOMMU_ATTR_SPAPR_TCE_FD,
>>> -                                              &param.tablefd)) {
>>> -                QLIST_FOREACH(group, &container->group_list, container_next) {
>>> -                    param.groupfd = group->fd;
>>> -                    if (ioctl(vfio_kvm_device_fd, KVM_SET_DEVICE_ATTR, &attr)) {
>>> -                        error_report("vfio: failed to setup fd %d "
>>> -                                     "for a group with fd %d: %s",
>>> -                                     param.tablefd, param.groupfd,
>>> -                                     strerror(errno));
>>> -                        return;
>>> -                    }
>>> -                    trace_vfio_spapr_group_attach(param.groupfd, param.tablefd);
>>> -                }
>>> -            }
>>> -        }
>>> -#endif
>>> +    if (vfio_container_add_section_window(container, section, &err)) {
>>> +        goto fail;
>>
>> That's not exactly the same as the return above when the ioctl call
>> fails. there doesn't seem to be much consequences though. Let's keep
>> it that way.
> OK.
> 
>>
>>>        }
>>>
>>>        hostwin = vfio_find_hostwin(container, iova, end);
>>> @@ -1094,17 +1126,7 @@ static void
>> vfio_listener_region_del(MemoryListener *listener,
>>>
>>>        memory_region_unref(section->mr);
>>>
>>> -    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
>>> -        vfio_spapr_remove_window(container,
>>> -                                 section->offset_within_address_space);
>>> -        if (vfio_host_win_del(container,
>>> -                              section->offset_within_address_space,
>>> -                              section->offset_within_address_space +
>>> -                              int128_get64(section->size) - 1) < 0) {
>>> -            hw_error("%s: Cannot delete missing window at %"HWADDR_PRIx,
>>> -                     __func__, section->offset_within_address_space);
>>> -        }
>>> -    }
>>> +    vfio_container_del_section_window(container, section);
>>>    }
>>>
>>>    static int vfio_set_dirty_page_tracking(VFIOContainer *container, bool start)
>>
>> PPC is in the way. May be we could move these two routines in pseries to
>> help a little. I will look into it.
> Do you mean PPC cleanup?

I will see if we can move out of VFIO the implementation of the spapr routines.
Don't wait for me. It can be addressed in parallel.

Thanks,

C.


> 
> Thanks
> Zhenzhong
> 



^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v1 10/22] vfio/platform: Use vfio_[attach/detach]_device
  2023-08-30 10:37 ` [PATCH v1 10/22] vfio/platform: Use vfio_[attach/detach]_device Zhenzhong Duan
@ 2023-09-21 12:17   ` Cédric Le Goater
  0 siblings, 0 replies; 109+ messages in thread
From: Cédric Le Goater @ 2023-09-21 12:17 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, jgg, nicolinc, joao.m.martins, eric.auger,
	peterx, jasowang, kevin.tian, yi.l.liu, yi.y.sun, chao.p.peng

On 8/30/23 12:37, Zhenzhong Duan wrote:
> From: Eric Auger <eric.auger@redhat.com>
> 
> Let the vfio-platform device use vfio_attach_device() and
> vfio_detach_device(), hence hiding the details of the used
> IOMMU backend.
> 
> Signed-off-by: Eric Auger <eric.auger@redhat.com>
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
>   hw/vfio/platform.c   | 43 ++++---------------------------------------
>   hw/vfio/trace-events |  2 +-
>   2 files changed, 5 insertions(+), 40 deletions(-)

Nice cleanup. As said earlier, the realize trace events could be kept.

Thanks,

C.


> 
> diff --git a/hw/vfio/platform.c b/hw/vfio/platform.c
> index 5af73f9287..5c08c39315 100644
> --- a/hw/vfio/platform.c
> +++ b/hw/vfio/platform.c
> @@ -529,12 +529,7 @@ static VFIODeviceOps vfio_platform_ops = {
>    */
>   static int vfio_base_device_init(VFIODevice *vbasedev, Error **errp)
>   {
> -    VFIOGroup *group;
> -    VFIODevice *vbasedev_iter;
> -    char *tmp, group_path[PATH_MAX], *group_name;
> -    ssize_t len;
>       struct stat st;
> -    int groupid;
>       int ret;
>   
>       /* @sysfsdev takes precedence over @host */
> @@ -557,47 +552,17 @@ static int vfio_base_device_init(VFIODevice *vbasedev, Error **errp)
>           return -errno;
>       }
>   
> -    tmp = g_strdup_printf("%s/iommu_group", vbasedev->sysfsdev);
> -    len = readlink(tmp, group_path, sizeof(group_path));
> -    g_free(tmp);
> +    trace_vfio_platform_base_device_init(vbasedev->name);
>   
> -    if (len < 0 || len >= sizeof(group_path)) {
> -        ret = len < 0 ? -errno : -ENAMETOOLONG;
> -        error_setg_errno(errp, -ret, "no iommu_group found");
> -        return ret;
> -    }
> -
> -    group_path[len] = 0;
> -
> -    group_name = basename(group_path);
> -    if (sscanf(group_name, "%d", &groupid) != 1) {
> -        error_setg_errno(errp, errno, "failed to read %s", group_path);
> -        return -errno;
> -    }
> -
> -    trace_vfio_platform_base_device_init(vbasedev->name, groupid);
> -
> -    group = vfio_get_group(groupid, &address_space_memory, errp);
> -    if (!group) {
> -        return -ENOENT;
> -    }
> -
> -    QLIST_FOREACH(vbasedev_iter, &group->device_list, next) {
> -        if (strcmp(vbasedev_iter->name, vbasedev->name) == 0) {
> -            error_setg(errp, "device is already attached");
> -            vfio_put_group(group);
> -            return -EBUSY;
> -        }
> -    }
> -    ret = vfio_get_device(group, vbasedev->name, vbasedev, errp);
> +    ret = vfio_attach_device(vbasedev->name, vbasedev,
> +                             &address_space_memory, errp);
>       if (ret) {
> -        vfio_put_group(group);
>           return ret;
>       }
>   
>       ret = vfio_populate_device(vbasedev, errp);
>       if (ret) {
> -        vfio_put_group(group);
> +        vfio_detach_device(vbasedev);
>       }
>   
>       return ret;
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index 8016d9f0d2..bd32970854 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -124,7 +124,7 @@ vfio_get_dirty_bitmap(int fd, uint64_t iova, uint64_t size, uint64_t bitmap_size
>   vfio_iommu_map_dirty_notify(uint64_t iova_start, uint64_t iova_end) "iommu dirty @ 0x%"PRIx64" - 0x%"PRIx64
>   
>   # platform.c
> -vfio_platform_base_device_init(char *name, int groupid) "%s belongs to group #%d"
> +vfio_platform_base_device_init(char *name) "%s"
>   vfio_platform_realize(char *name, char *compat) "vfio device %s, compat = %s"
>   vfio_platform_eoi(int pin, int fd) "EOI IRQ pin %d (fd=%d)"
>   vfio_platform_intp_mmap_enable(int pin) "IRQ #%d still active, stay in slow path"



^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v1 12/22] vfio/ccw: Use vfio_[attach/detach]_device
  2023-08-30 10:37 ` [PATCH v1 12/22] vfio/ccw: " Zhenzhong Duan
@ 2023-09-21 12:19   ` Cédric Le Goater
  2023-09-21 13:00     ` Duan, Zhenzhong
  0 siblings, 1 reply; 109+ messages in thread
From: Cédric Le Goater @ 2023-09-21 12:19 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, jgg, nicolinc, joao.m.martins, eric.auger,
	peterx, jasowang, kevin.tian, yi.l.liu, yi.y.sun, chao.p.peng,
	Eric Farman, Matthew Rosato, Thomas Huth, open list:vfio-ccw

On 8/30/23 12:37, Zhenzhong Duan wrote:
> From: Eric Auger <eric.auger@redhat.com>
> 
> Let the vfio-ccw device use vfio_attach_device() and
> vfio_detach_device(), hence hiding the details of the used
> IOMMU backend.
> 
> Also now all the devices have been migrated to use the new
> vfio_attach_device/vfio_detach_device API, let's turn the
> legacy functions into static functions, local to container.c.
> 
> Signed-off-by: Eric Auger <eric.auger@redhat.com>
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>

Zhenzhong,

Could you please resend 1-12 independantly as a prereq series for iommufd
support ? I think there wouldn't be much to say and they could be merged
pretty quickly.

Thanks,

C.

> ---
>   hw/vfio/ccw.c                 | 120 ++++++++--------------------------
>   hw/vfio/container.c           |  10 +--
>   include/hw/vfio/vfio-common.h |   5 --
>   3 files changed, 33 insertions(+), 102 deletions(-)
> 
> diff --git a/hw/vfio/ccw.c b/hw/vfio/ccw.c
> index 1e2fce83b0..f078e014fa 100644
> --- a/hw/vfio/ccw.c
> +++ b/hw/vfio/ccw.c
> @@ -572,88 +572,15 @@ static void vfio_ccw_put_region(VFIOCCWDevice *vcdev)
>       g_free(vcdev->io_region);
>   }
>   
> -static void vfio_ccw_put_device(VFIOCCWDevice *vcdev)
> -{
> -    g_free(vcdev->vdev.name);
> -    vfio_put_base_device(&vcdev->vdev);
> -}
> -
> -static void vfio_ccw_get_device(VFIOGroup *group, VFIOCCWDevice *vcdev,
> -                                Error **errp)
> -{
> -    S390CCWDevice *cdev = S390_CCW_DEVICE(vcdev);
> -    char *name = g_strdup_printf("%x.%x.%04x", cdev->hostid.cssid,
> -                                 cdev->hostid.ssid,
> -                                 cdev->hostid.devid);
> -    VFIODevice *vbasedev;
> -
> -    QLIST_FOREACH(vbasedev, &group->device_list, next) {
> -        if (strcmp(vbasedev->name, name) == 0) {
> -            error_setg(errp, "vfio: subchannel %s has already been attached",
> -                       name);
> -            goto out_err;
> -        }
> -    }
> -
> -    /*
> -     * All vfio-ccw devices are believed to operate in a way compatible with
> -     * discarding of memory in RAM blocks, ie. pages pinned in the host are
> -     * in the current working set of the guest driver and therefore never
> -     * overlap e.g., with pages available to the guest balloon driver.  This
> -     * needs to be set before vfio_get_device() for vfio common to handle
> -     * ram_block_discard_disable().
> -     */
> -    vcdev->vdev.ram_block_discard_allowed = true;
> -
> -    if (vfio_get_device(group, cdev->mdevid, &vcdev->vdev, errp)) {
> -        goto out_err;
> -    }
> -
> -    vcdev->vdev.ops = &vfio_ccw_ops;
> -    vcdev->vdev.type = VFIO_DEVICE_TYPE_CCW;
> -    vcdev->vdev.name = name;
> -    vcdev->vdev.dev = DEVICE(vcdev);
> -
> -    return;
> -
> -out_err:
> -    g_free(name);
> -}
> -
> -static VFIOGroup *vfio_ccw_get_group(S390CCWDevice *cdev, Error **errp)
> -{
> -    char *tmp, group_path[PATH_MAX];
> -    ssize_t len;
> -    int groupid;
> -
> -    tmp = g_strdup_printf("/sys/bus/css/devices/%x.%x.%04x/%s/iommu_group",
> -                          cdev->hostid.cssid, cdev->hostid.ssid,
> -                          cdev->hostid.devid, cdev->mdevid);
> -    len = readlink(tmp, group_path, sizeof(group_path));
> -    g_free(tmp);
> -
> -    if (len <= 0 || len >= sizeof(group_path)) {
> -        error_setg(errp, "vfio: no iommu_group found");
> -        return NULL;
> -    }
> -
> -    group_path[len] = 0;
> -
> -    if (sscanf(basename(group_path), "%d", &groupid) != 1) {
> -        error_setg(errp, "vfio: failed to read %s", group_path);
> -        return NULL;
> -    }
> -
> -    return vfio_get_group(groupid, &address_space_memory, errp);
> -}
> -
>   static void vfio_ccw_realize(DeviceState *dev, Error **errp)
>   {
> -    VFIOGroup *group;
> -    S390CCWDevice *cdev = S390_CCW_DEVICE(dev);
> -    VFIOCCWDevice *vcdev = VFIO_CCW(cdev);
> +    CcwDevice *ccw_dev = DO_UPCAST(CcwDevice, parent_obj, dev);
> +    S390CCWDevice *cdev = DO_UPCAST(S390CCWDevice, parent_obj, ccw_dev);
> +    VFIOCCWDevice *vcdev = DO_UPCAST(VFIOCCWDevice, cdev, cdev);
>       S390CCWDeviceClass *cdc = S390_CCW_DEVICE_GET_CLASS(cdev);
> +    VFIODevice *vbasedev = &vcdev->vdev;
>       Error *err = NULL;
> +    int ret;
>   
>       /* Call the class init function for subchannel. */
>       if (cdc->realize) {
> @@ -663,14 +590,25 @@ static void vfio_ccw_realize(DeviceState *dev, Error **errp)
>           }
>       }
>   
> -    group = vfio_ccw_get_group(cdev, &err);
> -    if (!group) {
> -        goto out_group_err;
> -    }
> +    vbasedev->ops = &vfio_ccw_ops;
> +    vbasedev->type = VFIO_DEVICE_TYPE_CCW;
> +    vbasedev->name = g_strdup(cdev->mdevid);
> +    vbasedev->dev = &vcdev->cdev.parent_obj.parent_obj;
>   
> -    vfio_ccw_get_device(group, vcdev, &err);
> -    if (err) {
> -        goto out_device_err;
> +    /*
> +     * All vfio-ccw devices are believed to operate in a way compatible with
> +     * discarding of memory in RAM blocks, ie. pages pinned in the host are
> +     * in the current working set of the guest driver and therefore never
> +     * overlap e.g., with pages available to the guest balloon driver.  This
> +     * needs to be set before vfio_get_device() for vfio common to handle
> +     * ram_block_discard_disable().
> +     */
> +    vbasedev->ram_block_discard_allowed = true;
> +
> +    ret = vfio_attach_device(vbasedev->name, vbasedev,
> +                             &address_space_memory, errp);
> +    if (ret) {
> +        goto out_attach_dev_err;
>       }
>   
>       vfio_ccw_get_region(vcdev, &err);
> @@ -708,10 +646,9 @@ out_irq_notifier_err:
>   out_io_notifier_err:
>       vfio_ccw_put_region(vcdev);
>   out_region_err:
> -    vfio_ccw_put_device(vcdev);
> -out_device_err:
> -    vfio_put_group(group);
> -out_group_err:
> +    vfio_detach_device(vbasedev);
> +out_attach_dev_err:
> +    g_free(vbasedev->name);
>       if (cdc->unrealize) {
>           cdc->unrealize(cdev);
>       }
> @@ -724,14 +661,13 @@ static void vfio_ccw_unrealize(DeviceState *dev)
>       S390CCWDevice *cdev = S390_CCW_DEVICE(dev);
>       VFIOCCWDevice *vcdev = VFIO_CCW(cdev);
>       S390CCWDeviceClass *cdc = S390_CCW_DEVICE_GET_CLASS(cdev);
> -    VFIOGroup *group = vcdev->vdev.group;
>   
>       vfio_ccw_unregister_irq_notifier(vcdev, VFIO_CCW_REQ_IRQ_INDEX);
>       vfio_ccw_unregister_irq_notifier(vcdev, VFIO_CCW_CRW_IRQ_INDEX);
>       vfio_ccw_unregister_irq_notifier(vcdev, VFIO_CCW_IO_IRQ_INDEX);
>       vfio_ccw_put_region(vcdev);
> -    vfio_ccw_put_device(vcdev);
> -    vfio_put_group(group);
> +    vfio_detach_device(&vcdev->vdev);
> +    g_free(vcdev->vdev.name);
>   
>       if (cdc->unrealize) {
>           cdc->unrealize(cdev);
> diff --git a/hw/vfio/container.c b/hw/vfio/container.c
> index 74556da0c7..c71fddc09a 100644
> --- a/hw/vfio/container.c
> +++ b/hw/vfio/container.c
> @@ -837,7 +837,7 @@ static void vfio_disconnect_container(VFIOGroup *group)
>       }
>   }
>   
> -VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
> +static VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
>   {
>       VFIOGroup *group;
>       char path[32];
> @@ -900,7 +900,7 @@ free_group_exit:
>       return NULL;
>   }
>   
> -void vfio_put_group(VFIOGroup *group)
> +static void vfio_put_group(VFIOGroup *group)
>   {
>       if (!group || !QLIST_EMPTY(&group->device_list)) {
>           return;
> @@ -917,8 +917,8 @@ void vfio_put_group(VFIOGroup *group)
>       g_free(group);
>   }
>   
> -int vfio_get_device(VFIOGroup *group, const char *name,
> -                    VFIODevice *vbasedev, Error **errp)
> +static int vfio_get_device(VFIOGroup *group, const char *name,
> +                           VFIODevice *vbasedev, Error **errp)
>   {
>       g_autofree struct vfio_device_info *info = NULL;
>       int fd;
> @@ -976,7 +976,7 @@ int vfio_get_device(VFIOGroup *group, const char *name,
>       return 0;
>   }
>   
> -void vfio_put_base_device(VFIODevice *vbasedev)
> +static void vfio_put_base_device(VFIODevice *vbasedev)
>   {
>       if (!vbasedev->group) {
>           return;
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index a29dfe7723..95bcafdaf6 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -230,7 +230,6 @@ int vfio_container_add_section_window(VFIOContainer *container,
>   void vfio_container_del_section_window(VFIOContainer *container,
>                                          MemoryRegionSection *section);
>   
> -void vfio_put_base_device(VFIODevice *vbasedev);
>   void vfio_disable_irqindex(VFIODevice *vbasedev, int index);
>   void vfio_unmask_single_irqindex(VFIODevice *vbasedev, int index);
>   void vfio_mask_single_irqindex(VFIODevice *vbasedev, int index);
> @@ -248,11 +247,7 @@ void vfio_region_unmap(VFIORegion *region);
>   void vfio_region_exit(VFIORegion *region);
>   void vfio_region_finalize(VFIORegion *region);
>   void vfio_reset_handler(void *opaque);
> -VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp);
> -void vfio_put_group(VFIOGroup *group);
>   struct vfio_device_info *vfio_get_device_info(int fd);
> -int vfio_get_device(VFIOGroup *group, const char *name,
> -                    VFIODevice *vbasedev, Error **errp);
>   int vfio_attach_device(char *name, VFIODevice *vbasedev,
>                          AddressSpace *as, Error **errp);
>   void vfio_detach_device(VFIODevice *vbasedev);



^ permalink raw reply	[flat|nested] 109+ messages in thread

* RE: [PATCH v1 12/22] vfio/ccw: Use vfio_[attach/detach]_device
  2023-09-21 12:19   ` Cédric Le Goater
@ 2023-09-21 13:00     ` Duan, Zhenzhong
  2023-09-21 13:24       ` Cédric Le Goater
  0 siblings, 1 reply; 109+ messages in thread
From: Duan, Zhenzhong @ 2023-09-21 13:00 UTC (permalink / raw)
  To: Cédric Le Goater, qemu-devel
  Cc: alex.williamson, jgg, nicolinc, Martins, Joao, eric.auger,
	peterx, jasowang, Tian, Kevin, Liu, Yi L, Sun, Yi Y, Peng,
	Chao P, Eric Farman, Matthew Rosato, Thomas Huth,
	open list:vfio-ccw



>-----Original Message-----
>From: Cédric Le Goater <clg@redhat.com>
>Sent: Thursday, September 21, 2023 8:20 PM
>Subject: Re: [PATCH v1 12/22] vfio/ccw: Use vfio_[attach/detach]_device
>
>On 8/30/23 12:37, Zhenzhong Duan wrote:
>> From: Eric Auger <eric.auger@redhat.com>
>>
>> Let the vfio-ccw device use vfio_attach_device() and
>> vfio_detach_device(), hence hiding the details of the used
>> IOMMU backend.
>>
>> Also now all the devices have been migrated to use the new
>> vfio_attach_device/vfio_detach_device API, let's turn the
>> legacy functions into static functions, local to container.c.
>>
>> Signed-off-by: Eric Auger <eric.auger@redhat.com>
>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>
>Zhenzhong,
>
>Could you please resend 1-12 independantly as a prereq series for iommufd
>support ? I think there wouldn't be much to say and they could be merged
>pretty quickly.

Got it, will do.
Note I want to replace "[PATCH v1 06/22] vfio/common: Add a vfio device iterator"
with vfio_device_list which will be used by both BEs, that may need some time.

Thanks
Zhenzhong

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v1 12/22] vfio/ccw: Use vfio_[attach/detach]_device
  2023-09-21 13:00     ` Duan, Zhenzhong
@ 2023-09-21 13:24       ` Cédric Le Goater
  0 siblings, 0 replies; 109+ messages in thread
From: Cédric Le Goater @ 2023-09-21 13:24 UTC (permalink / raw)
  To: Duan, Zhenzhong, qemu-devel
  Cc: alex.williamson, jgg, nicolinc, Martins, Joao, eric.auger,
	peterx, jasowang, Tian, Kevin, Liu, Yi L, Sun, Yi Y, Peng,
	Chao P, Eric Farman, Matthew Rosato, Thomas Huth,
	open list:vfio-ccw

On 9/21/23 15:00, Duan, Zhenzhong wrote:
> 
> 
>> -----Original Message-----
>> From: Cédric Le Goater <clg@redhat.com>
>> Sent: Thursday, September 21, 2023 8:20 PM
>> Subject: Re: [PATCH v1 12/22] vfio/ccw: Use vfio_[attach/detach]_device
>>
>> On 8/30/23 12:37, Zhenzhong Duan wrote:
>>> From: Eric Auger <eric.auger@redhat.com>
>>>
>>> Let the vfio-ccw device use vfio_attach_device() and
>>> vfio_detach_device(), hence hiding the details of the used
>>> IOMMU backend.
>>>
>>> Also now all the devices have been migrated to use the new
>>> vfio_attach_device/vfio_detach_device API, let's turn the
>>> legacy functions into static functions, local to container.c.
>>>
>>> Signed-off-by: Eric Auger <eric.auger@redhat.com>
>>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>>
>> Zhenzhong,
>>
>> Could you please resend 1-12 independantly as a prereq series for iommufd
>> support ? I think there wouldn't be much to say and they could be merged
>> pretty quickly.
> 
> Got it, will do.
> Note I want to replace "[PATCH v1 06/22] vfio/common: Add a vfio device iterator"
> with vfio_device_list which will be used by both BEs, that may need some time.

Sure. Not today :) When you can.

Thanks,

C.



^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v1 13/22] vfio: Add base container
  2023-09-21  3:35       ` Duan, Zhenzhong
  2023-09-21  6:28         ` Eric Auger
@ 2023-09-21 17:20         ` Eric Auger
  2023-09-22  2:52           ` Duan, Zhenzhong
  1 sibling, 1 reply; 109+ messages in thread
From: Eric Auger @ 2023-09-21 17:20 UTC (permalink / raw)
  To: Duan, Zhenzhong, Cédric Le Goater, qemu-devel
  Cc: alex.williamson, jgg, nicolinc, Martins, Joao, peterx, jasowang,
	Tian, Kevin, Liu, Yi L, Sun, Yi Y, Peng, Chao P, Yi Sun,
	Daniel Henrique Barboza, David Gibson, Greg Kurz,
	Harsh Prateek Bora, open list:sPAPR (pseries)

Hi Zhenzhong,
On 9/21/23 05:35, Duan, Zhenzhong wrote:
> Hi Eric,
>
>> -----Original Message-----
>> From: Eric Auger <eric.auger@redhat.com>
>> Sent: Thursday, September 21, 2023 1:31 AM
>> Subject: Re: [PATCH v1 13/22] vfio: Add base container
>>
>> Hi Zhenzhong,
>>
>> On 9/19/23 19:23, Cédric Le Goater wrote:
>>> On 8/30/23 12:37, Zhenzhong Duan wrote:
>>>> From: Yi Liu <yi.l.liu@intel.com>
>>>>
>>>> Abstract the VFIOContainer to be a base object. It is supposed to be
>>>> embedded by legacy VFIO container and later on, into the new iommufd
>>>> based container.
>>>>
>>>> The base container implements generic code such as code related to
>>>> memory_listener and address space management. The VFIOContainerOps
>>>> implements callbacks that depend on the kernel user space being used.
>>>>
>>>> 'common.c' and vfio device code only manipulates the base container with
>>>> wrapper functions that calls the functions defined in
>>>> VFIOContainerOpsClass.
>>>> Existing 'container.c' code is converted to implement the legacy
>>>> container
>>>> ops functions.
>>>>
>>>> Below is the base container. It's named as VFIOContainer, old
>>>> VFIOContainer
>>>> is replaced with VFIOLegacyContainer.
>>> Usualy, we introduce the new interface solely, port the current models
>>> on top of the new interface, wire the new models in the current
>>> implementation and remove the old implementation. Then, we can start
>>> adding extensions to support other implementations.
>>>
>>> spapr should be taken care of separatly following the principle above.
>>> With my PPC hat, I would not even read such a massive change, too risky
>>> for the subsystem. This path will need (much) further splitting to be
>>> understandable and acceptable.
>> We might split this patch by
>> 1) introducing VFIOLegacyContainer encapsulating the base VFIOContainer,
>> without using the ops in a first place:
>>  common.c would call vfio_container_* with harcoded legacy
>> implementation, ie. retrieving the legacy container with container_of.
>> 2) we would introduce the BE interface without using it.
>> 3) we would use the new BE interface
>>
>> Obviously this needs to be further tried out. If you wish I can try to
>> split it that way ... Please let me know
> Sure, thanks for your help, glad that I can cooperate with you to move
> this series forward.
> I just updated the branch which rebased to newest upstream for you to pick at https://github.com/yiliu1765/qemu/tree/zhenzhong/iommufd_cdev_v1_rebased 

I have spent most of my day reshuffling this single patch into numerous
ones (16!). This should help the review.
I was short of time. This compiles, the end code should be identical to
the original one. Besides this deserves some additional review on your
end, commit msg tuning, ...

But at least it is a move forward. Feel free to incorporate that in your
next respin.

Please find that work on the following branch

https://github.com/eauger/qemu/tree/iommufd_cdev_v1_rebased_split

Thanks

Eric
>
> Thanks
> Zhenzhong



^ permalink raw reply	[flat|nested] 109+ messages in thread

* RE: [PATCH v1 13/22] vfio: Add base container
  2023-09-21 17:20         ` Eric Auger
@ 2023-09-22  2:52           ` Duan, Zhenzhong
  0 siblings, 0 replies; 109+ messages in thread
From: Duan, Zhenzhong @ 2023-09-22  2:52 UTC (permalink / raw)
  To: eric.auger, Cédric Le Goater, qemu-devel
  Cc: alex.williamson, jgg, nicolinc, Martins, Joao, peterx, jasowang,
	Tian, Kevin, Liu, Yi L, Sun, Yi Y, Peng, Chao P, Yi Sun,
	Daniel Henrique Barboza, David Gibson, Greg Kurz,
	Harsh Prateek Bora, open list:sPAPR (pseries)



>-----Original Message-----
>From: Eric Auger <eric.auger@redhat.com>
>Sent: Friday, September 22, 2023 1:20 AM
>Subject: Re: [PATCH v1 13/22] vfio: Add base container
>
>Hi Zhenzhong,
>On 9/21/23 05:35, Duan, Zhenzhong wrote:
>> Hi Eric,
>>
>>> -----Original Message-----
>>> From: Eric Auger <eric.auger@redhat.com>
>>> Sent: Thursday, September 21, 2023 1:31 AM
>>> Subject: Re: [PATCH v1 13/22] vfio: Add base container
>>>
>>> Hi Zhenzhong,
>>>
>>> On 9/19/23 19:23, Cédric Le Goater wrote:
>>>> On 8/30/23 12:37, Zhenzhong Duan wrote:
>>>>> From: Yi Liu <yi.l.liu@intel.com>
>>>>>
>>>>> Abstract the VFIOContainer to be a base object. It is supposed to be
>>>>> embedded by legacy VFIO container and later on, into the new iommufd
>>>>> based container.
>>>>>
>>>>> The base container implements generic code such as code related to
>>>>> memory_listener and address space management. The VFIOContainerOps
>>>>> implements callbacks that depend on the kernel user space being used.
>>>>>
>>>>> 'common.c' and vfio device code only manipulates the base container with
>>>>> wrapper functions that calls the functions defined in
>>>>> VFIOContainerOpsClass.
>>>>> Existing 'container.c' code is converted to implement the legacy
>>>>> container
>>>>> ops functions.
>>>>>
>>>>> Below is the base container. It's named as VFIOContainer, old
>>>>> VFIOContainer
>>>>> is replaced with VFIOLegacyContainer.
>>>> Usualy, we introduce the new interface solely, port the current models
>>>> on top of the new interface, wire the new models in the current
>>>> implementation and remove the old implementation. Then, we can start
>>>> adding extensions to support other implementations.
>>>>
>>>> spapr should be taken care of separatly following the principle above.
>>>> With my PPC hat, I would not even read such a massive change, too risky
>>>> for the subsystem. This path will need (much) further splitting to be
>>>> understandable and acceptable.
>>> We might split this patch by
>>> 1) introducing VFIOLegacyContainer encapsulating the base VFIOContainer,
>>> without using the ops in a first place:
>>>  common.c would call vfio_container_* with harcoded legacy
>>> implementation, ie. retrieving the legacy container with container_of.
>>> 2) we would introduce the BE interface without using it.
>>> 3) we would use the new BE interface
>>>
>>> Obviously this needs to be further tried out. If you wish I can try to
>>> split it that way ... Please let me know
>> Sure, thanks for your help, glad that I can cooperate with you to move
>> this series forward.
>> I just updated the branch which rebased to newest upstream for you to pick at
>https://github.com/yiliu1765/qemu/tree/zhenzhong/iommufd_cdev_v1_rebased
>
>I have spent most of my day reshuffling this single patch into numerous
>ones (16!). This should help the review.
>I was short of time. This compiles, the end code should be identical to
>the original one. Besides this deserves some additional review on your
>end, commit msg tuning, ...
>
>But at least it is a move forward. Feel free to incorporate that in your
>next respin.
>
>Please find that work on the following branch
>
>https://github.com/eauger/qemu/tree/iommufd_cdev_v1_rebased_split

Thanks Eric, you have done a so quick and awesome work. Let me learn
your change and integrate with my other changes. Will get back to you
then.

BRs.
Zhenzhong

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v1 16/22] backends/iommufd: Introduce the iommufd object
  2023-08-30 10:37 ` [PATCH v1 16/22] backends/iommufd: Introduce the iommufd object Zhenzhong Duan
@ 2023-09-22  7:15   ` Cédric Le Goater
  2023-09-22  8:39     ` Duan, Zhenzhong
  0 siblings, 1 reply; 109+ messages in thread
From: Cédric Le Goater @ 2023-09-22  7:15 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, jgg, nicolinc, joao.m.martins, eric.auger,
	peterx, jasowang, kevin.tian, yi.l.liu, yi.y.sun, chao.p.peng,
	Paolo Bonzini, Eric Blake, Markus Armbruster,
	Daniel P. Berrangé,
	Eduardo Habkost

On 8/30/23 12:37, Zhenzhong Duan wrote:
> From: Eric Auger <eric.auger@redhat.com>
> 
> Introduce an iommufd object which allows the interaction
> with the host /dev/iommu device.
> 
> The /dev/iommu can have been already pre-opened outside of qemu,
> in which case the fd can be passed directly along with the
> iommufd object:
> 
> This allows the iommufd object to be shared accross several
> subsystems (VFIO, VDPA, ...). For example, libvirt would open
> the /dev/iommu once.
> 
> If no fd is passed along with the iommufd object, the /dev/iommu
> is opened by the qemu code.
> 
> The CONFIG_IOMMUFD option must be set to compile this new object.
> 
> Suggested-by: Alex Williamson <alex.williamson@redhat.com>
> Signed-off-by: Eric Auger <eric.auger@redhat.com>
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
>   MAINTAINERS              |   7 +
>   backends/Kconfig         |   4 +
>   backends/iommufd.c       | 291 +++++++++++++++++++++++++++++++++++++++
>   backends/meson.build     |   3 +
>   backends/trace-events    |  13 ++
>   include/sysemu/iommufd.h |  49 +++++++
>   qapi/qom.json            |  18 ++-
>   qemu-options.hx          |  13 ++
>   8 files changed, 397 insertions(+), 1 deletion(-)
>   create mode 100644 backends/iommufd.c
>   create mode 100644 include/sysemu/iommufd.h
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 6111b6b4d9..04663fbb6f 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -2079,6 +2079,13 @@ F: hw/vfio/ap.c
>   F: docs/system/s390x/vfio-ap.rst
>   L: qemu-s390x@nongnu.org
>   
> +iommufd
> +M: Yi Liu <yi.l.liu@intel.com>
> +M: Eric Auger <eric.auger@redhat.com>
> +S: Supported
> +F: backends/iommufd.c
> +F: include/sysemu/iommufd.h
> +
>   vhost
>   M: Michael S. Tsirkin <mst@redhat.com>
>   S: Supported
> diff --git a/backends/Kconfig b/backends/Kconfig
> index f35abc1609..2cb23f62fa 100644
> --- a/backends/Kconfig
> +++ b/backends/Kconfig
> @@ -1 +1,5 @@
>   source tpm/Kconfig
> +
> +config IOMMUFD
> +    bool
> +    depends on VFIO
> diff --git a/backends/iommufd.c b/backends/iommufd.c
> new file mode 100644
> index 0000000000..07ea434424
> --- /dev/null
> +++ b/backends/iommufd.c
> @@ -0,0 +1,291 @@
> +/*
> + * iommufd container backend
> + *
> + * Copyright (C) 2023 Intel Corporation.
> + * Copyright Red Hat, Inc. 2023
> + *
> + * Authors: Yi Liu <yi.l.liu@intel.com>
> + *          Eric Auger <eric.auger@redhat.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> +
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> +
> + * You should have received a copy of the GNU General Public License along
> + * with this program; if not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#include "qemu/osdep.h"
> +#include "sysemu/iommufd.h"
> +#include "qapi/error.h"
> +#include "qapi/qmp/qerror.h"
> +#include "qemu/module.h"
> +#include "qom/object_interfaces.h"
> +#include "qemu/error-report.h"
> +#include "monitor/monitor.h"
> +#include "trace.h"
> +#include <sys/ioctl.h>
> +#include <linux/iommufd.h>
> +
> +static void iommufd_backend_init(Object *obj)
> +{
> +    IOMMUFDBackend *be = IOMMUFD_BACKEND(obj);
> +
> +    be->fd = -1;
> +    be->users = 0;
> +    be->owned = true;
> +    qemu_mutex_init(&be->lock);
> +}
> +
> +static void iommufd_backend_finalize(Object *obj)
> +{
> +    IOMMUFDBackend *be = IOMMUFD_BACKEND(obj);
> +
> +    if (be->owned) {
> +        close(be->fd);
> +        be->fd = -1;
> +    }
> +}
> +
> +static void iommufd_backend_set_fd(Object *obj, const char *str, Error **errp)
> +{
> +    IOMMUFDBackend *be = IOMMUFD_BACKEND(obj);
> +    int fd = -1;
> +
> +    fd = monitor_fd_param(monitor_cur(), str, errp);
> +    if (fd == -1) {
> +        error_prepend(errp, "Could not parse remote object fd %s:", str);
> +        return;
> +    }
> +    qemu_mutex_lock(&be->lock);
> +    be->fd = fd;
> +    be->owned = false;
> +    qemu_mutex_unlock(&be->lock);
> +    trace_iommu_backend_set_fd(be->fd);
> +}
> +
> +static void iommufd_backend_class_init(ObjectClass *oc, void *data)
> +{
> +    object_class_property_add_str(oc, "fd", NULL, iommufd_backend_set_fd);
> +}
> +
> +int iommufd_backend_connect(IOMMUFDBackend *be, Error **errp)
> +{
> +    int fd, ret = 0;
> +
> +    qemu_mutex_lock(&be->lock);
> +    if (be->users == UINT32_MAX) {
> +        error_setg(errp, "too many connections");
> +        ret = -E2BIG;
> +        goto out;
> +    }
> +    if (be->owned && !be->users) {
> +        fd = qemu_open_old("/dev/iommu", O_RDWR);
> +        if (fd < 0) {
> +            error_setg_errno(errp, errno, "/dev/iommu opening failed");
> +            ret = fd;
> +            goto out;
> +        }
> +        be->fd = fd;
> +    }
> +    be->users++;
> +out:
> +    trace_iommufd_backend_connect(be->fd, be->owned,
> +                                  be->users, ret);
> +    qemu_mutex_unlock(&be->lock);
> +    return ret;
> +}
> +
> +void iommufd_backend_disconnect(IOMMUFDBackend *be)
> +{
> +    qemu_mutex_lock(&be->lock);
> +    if (!be->users) {
> +        goto out;
> +    }
> +    be->users--;
> +    if (!be->users && be->owned) {
> +        close(be->fd);
> +        be->fd = -1;
> +    }
> +out:
> +    trace_iommufd_backend_disconnect(be->fd, be->users);
> +    qemu_mutex_unlock(&be->lock);
> +}
> +
> +static int iommufd_backend_alloc_ioas(int fd, uint32_t *ioas)
> +{
> +    int ret;
> +    struct iommu_ioas_alloc alloc_data  = {
> +        .size = sizeof(alloc_data),
> +        .flags = 0,
> +    };
> +
> +    ret = ioctl(fd, IOMMU_IOAS_ALLOC, &alloc_data);
> +    if (ret) {
> +        error_report("Failed to allocate ioas %m");
> +    }
> +
> +    *ioas = alloc_data.out_ioas_id;
> +    trace_iommufd_backend_alloc_ioas(fd, *ioas, ret);
> +
> +    return ret;
> +}
> +
> +void iommufd_backend_free_id(int fd, uint32_t id)
> +{
> +    int ret;
> +    struct iommu_destroy des = {
> +        .size = sizeof(des),
> +        .id = id,
> +    };
> +
> +    ret = ioctl(fd, IOMMU_DESTROY, &des);
> +    trace_iommufd_backend_free_id(fd, id, ret);
> +    if (ret) {
> +        error_report("Failed to free id: %u %m", id);
> +    }
> +}
> +
> +int iommufd_backend_get_ioas(IOMMUFDBackend *be, uint32_t *ioas_id)
> +{
> +    int ret;
> +
> +    ret = iommufd_backend_alloc_ioas(be->fd, ioas_id);
> +    trace_iommufd_backend_get_ioas(be->fd, *ioas_id, ret);
> +    return ret;
> +}
> +
> +void iommufd_backend_put_ioas(IOMMUFDBackend *be, uint32_t ioas)
> +{
> +    trace_iommufd_backend_put_ioas(be->fd, ioas);
> +    iommufd_backend_free_id(be->fd, ioas);
> +}
> +
> +int iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t ioas,
> +                              hwaddr iova, ram_addr_t size)
> +{
> +    int ret;
> +    struct iommu_ioas_unmap unmap = {
> +        .size = sizeof(unmap),
> +        .ioas_id = ioas,
> +        .iova = iova,
> +        .length = size,
> +    };
> +
> +    ret = ioctl(be->fd, IOMMU_IOAS_UNMAP, &unmap);
> +    trace_iommufd_backend_unmap_dma(be->fd, ioas, iova, size, ret);
> +    if (ret && errno == ENOENT) {
> +        ret = 0;
> +    }
> +    if (ret) {
> +        error_report("IOMMU_IOAS_UNMAP failed: %s", strerror(errno));
> +    }
> +    return !ret ? 0 : -errno;
> +}
> +
> +int iommufd_backend_map_dma(IOMMUFDBackend *be, uint32_t ioas, hwaddr iova,
> +                            ram_addr_t size, void *vaddr, bool readonly)
> +{
> +    int ret;
> +    struct iommu_ioas_map map = {
> +        .size = sizeof(map),
> +        .flags = IOMMU_IOAS_MAP_READABLE |
> +                 IOMMU_IOAS_MAP_FIXED_IOVA,
> +        .ioas_id = ioas,
> +        .__reserved = 0,
> +        .user_va = (int64_t)vaddr,

This needs an extra cast (uintptr_t)

Thanks,

C.

> +        .iova = iova,
> +        .length = size,
> +    };
> +
> +    if (!readonly) {
> +        map.flags |= IOMMU_IOAS_MAP_WRITEABLE;
> +    }
> +
> +    ret = ioctl(be->fd, IOMMU_IOAS_MAP, &map);
> +    trace_iommufd_backend_map_dma(be->fd, ioas, iova, size,
> +                                  vaddr, readonly, ret);
> +    if (ret) {
> +        error_report("IOMMU_IOAS_MAP failed: %s", strerror(errno));
> +    }
> +    return !ret ? 0 : -errno;
> +}
> +
> +int iommufd_backend_copy_dma(IOMMUFDBackend *be, uint32_t src_ioas,
> +                             uint32_t dst_ioas, hwaddr iova,
> +                             ram_addr_t size, bool readonly)
> +{
> +    int ret;
> +    struct iommu_ioas_copy copy = {
> +        .size = sizeof(copy),
> +        .flags = IOMMU_IOAS_MAP_READABLE |
> +                 IOMMU_IOAS_MAP_FIXED_IOVA,
> +        .dst_ioas_id = dst_ioas,
> +        .src_ioas_id = src_ioas,
> +        .length = size,
> +        .dst_iova = iova,
> +        .src_iova = iova,
> +    };
> +
> +    if (!readonly) {
> +        copy.flags |= IOMMU_IOAS_MAP_WRITEABLE;
> +    }
> +
> +    ret = ioctl(be->fd, IOMMU_IOAS_COPY, &copy);
> +    trace_iommufd_backend_copy_dma(be->fd, src_ioas, dst_ioas,
> +                                   iova, size, readonly, ret);
> +    if (ret) {
> +        error_report("IOMMU_IOAS_COPY failed: %s", strerror(errno));
> +    }
> +    return !ret ? 0 : -errno;
> +}
> +
> +int iommufd_backend_alloc_hwpt(int iommufd, uint32_t dev_id,
> +                               uint32_t pt_id, uint32_t *out_hwpt)
> +{
> +    int ret;
> +    struct iommu_hwpt_alloc alloc_hwpt = {
> +        .size = sizeof(struct iommu_hwpt_alloc),
> +        .flags = 0,
> +        .dev_id = dev_id,
> +        .pt_id = pt_id,
> +        .__reserved = 0,
> +    };
> +
> +    ret = ioctl(iommufd, IOMMU_HWPT_ALLOC, &alloc_hwpt);
> +    trace_iommufd_backend_alloc_hwpt(iommufd, dev_id, pt_id, ret);
> +
> +    if (ret) {
> +        error_report("IOMMU_HWPT_ALLOC failed: %s", strerror(errno));
> +    } else {
> +        *out_hwpt = alloc_hwpt.out_hwpt_id;
> +    }
> +    return !ret ? 0 : -errno;
> +}
> +
> +static const TypeInfo iommufd_backend_info = {
> +    .name = TYPE_IOMMUFD_BACKEND,
> +    .parent = TYPE_OBJECT,
> +    .instance_size = sizeof(IOMMUFDBackend),
> +    .instance_init = iommufd_backend_init,
> +    .instance_finalize = iommufd_backend_finalize,
> +    .class_size = sizeof(IOMMUFDBackendClass),
> +    .class_init = iommufd_backend_class_init,
> +    .interfaces = (InterfaceInfo[]) {
> +        { TYPE_USER_CREATABLE },
> +        { }
> +    }
> +};
> +
> +static void register_types(void)
> +{
> +    type_register_static(&iommufd_backend_info);
> +}
> +
> +type_init(register_types);
> diff --git a/backends/meson.build b/backends/meson.build
> index 914c7c4afb..29dc147c8e 100644
> --- a/backends/meson.build
> +++ b/backends/meson.build
> @@ -20,6 +20,9 @@ if have_vhost_user
>     system_ss.add(when: 'CONFIG_VIRTIO', if_true: files('vhost-user.c'))
>   endif
>   system_ss.add(when: 'CONFIG_VIRTIO_CRYPTO', if_true: files('cryptodev-vhost.c'))
> +if have_iommufd
> +  system_ss.add(files('iommufd.c'))
> +endif
>   if have_vhost_user_crypto
>     system_ss.add(when: 'CONFIG_VIRTIO_CRYPTO', if_true: files('cryptodev-vhost-user.c'))
>   endif
> diff --git a/backends/trace-events b/backends/trace-events
> index 652eb76a57..093e3eb1da 100644
> --- a/backends/trace-events
> +++ b/backends/trace-events
> @@ -5,3 +5,16 @@ dbus_vmstate_pre_save(void)
>   dbus_vmstate_post_load(int version_id) "version_id: %d"
>   dbus_vmstate_loading(const char *id) "id: %s"
>   dbus_vmstate_saving(const char *id) "id: %s"
> +
> +# iommufd.c
> +iommufd_backend_connect(int fd, bool owned, uint32_t users, int ret) "fd=%d owned=%d users=%d (%d)"
> +iommufd_backend_disconnect(int fd, uint32_t users) "fd=%d users=%d"
> +iommu_backend_set_fd(int fd) "pre-opened /dev/iommu fd=%d"
> +iommufd_backend_get_ioas(int iommufd, uint32_t ioas, int ret) " iommufd=%d ioas=%d (%d)"
> +iommufd_backend_put_ioas(int iommufd, uint32_t ioas) " iommufd=%d ioas=%d"
> +iommufd_backend_unmap_dma(int iommufd, uint32_t ioas, uint64_t iova, uint64_t size, int ret) " iommufd=%d ioas=%d iova=0x%"PRIx64" size=0x%"PRIx64" (%d)"
> +iommufd_backend_map_dma(int iommufd, uint32_t ioas, uint64_t iova, uint64_t size, void *vaddr, bool readonly, int ret) " iommufd=%d ioas=%d iova=0x%"PRIx64" size=0x%"PRIx64" addr=%p readonly=%d (%d)"
> +iommufd_backend_copy_dma(int iommufd, uint32_t src_ioas, uint32_t dst_ioas, uint64_t iova, uint64_t size, bool readonly, int ret) " iommufd=%d src_ioas=%d dst_ioas=%d iova=0x%"PRIx64" size=0x%"PRIx64" readonly=%d (%d)"
> +iommufd_backend_alloc_ioas(int iommufd, uint32_t ioas, int ret) " iommufd=%d ioas=%d (%d)"
> +iommufd_backend_free_id(int iommufd, uint32_t id, int ret) " iommufd=%d id=%d (%d)"
> +iommufd_backend_alloc_hwpt(int iommufd, uint32_t dev_id, uint32_t pt_id, int ret) " iommufd=%d dev_id=%u pt_id=%u (%d)"
> diff --git a/include/sysemu/iommufd.h b/include/sysemu/iommufd.h
> new file mode 100644
> index 0000000000..f3bd212170
> --- /dev/null
> +++ b/include/sysemu/iommufd.h
> @@ -0,0 +1,49 @@
> +#ifndef SYSEMU_IOMMUFD_H
> +#define SYSEMU_IOMMUFD_H
> +
> +#include "qom/object.h"
> +#include "qemu/thread.h"
> +#include "exec/hwaddr.h"
> +#include "exec/cpu-common.h"
> +
> +#define TYPE_IOMMUFD_BACKEND "iommufd"
> +OBJECT_DECLARE_TYPE(IOMMUFDBackend, IOMMUFDBackendClass,
> +                    IOMMUFD_BACKEND)
> +#define IOMMUFD_BACKEND(obj) \
> +    OBJECT_CHECK(IOMMUFDBackend, (obj), TYPE_IOMMUFD_BACKEND)
> +#define IOMMUFD_BACKEND_GET_CLASS(obj) \
> +    OBJECT_GET_CLASS(IOMMUFDBackendClass, (obj), TYPE_IOMMUFD_BACKEND)
> +#define IOMMUFD_BACKEND_CLASS(klass) \
> +    OBJECT_CLASS_CHECK(IOMMUFDBackendClass, (klass), TYPE_IOMMUFD_BACKEND)
> +struct IOMMUFDBackendClass {
> +    ObjectClass parent_class;
> +};
> +
> +struct IOMMUFDBackend {
> +    Object parent;
> +
> +    /*< protected >*/
> +    int fd;            /* /dev/iommu file descriptor */
> +    bool owned;        /* is the /dev/iommu opened internally */
> +    QemuMutex lock;
> +    uint32_t users;
> +
> +    /*< public >*/
> +};
> +
> +int iommufd_backend_connect(IOMMUFDBackend *be, Error **errp);
> +void iommufd_backend_disconnect(IOMMUFDBackend *be);
> +
> +int iommufd_backend_get_ioas(IOMMUFDBackend *be, uint32_t *ioas_id);
> +void iommufd_backend_put_ioas(IOMMUFDBackend *be, uint32_t ioas_id);
> +void iommufd_backend_free_id(int fd, uint32_t id);
> +int iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t ioas,
> +                              hwaddr iova, ram_addr_t size);
> +int iommufd_backend_map_dma(IOMMUFDBackend *be, uint32_t ioas, hwaddr iova,
> +                            ram_addr_t size, void *vaddr, bool readonly);
> +int iommufd_backend_copy_dma(IOMMUFDBackend *be, uint32_t src_ioas,
> +                             uint32_t dst_ioas, hwaddr iova,
> +                             ram_addr_t size, bool readonly);
> +int iommufd_backend_alloc_hwpt(int iommufd, uint32_t dev_id,
> +                               uint32_t pt_id, uint32_t *out_hwpt);
> +#endif
> diff --git a/qapi/qom.json b/qapi/qom.json
> index fa3e88c8e6..2646ac4cca 100644
> --- a/qapi/qom.json
> +++ b/qapi/qom.json
> @@ -779,6 +779,18 @@
>   { 'struct': 'VfioUserServerProperties',
>     'data': { 'socket': 'SocketAddress', 'device': 'str' } }
>   
> +##
> +# @IOMMUFDProperties:
> +#
> +# Properties for IOMMUFDbackend objects.
> +#
> +# fd: file descriptor name
> +#
> +# Since: 7.2
> +##
> +{ 'struct': 'IOMMUFDProperties',
> +        'data': { '*fd': 'str' } }
> +
>   ##
>   # @RngProperties:
>   #
> @@ -933,6 +945,8 @@
>       'qtest',
>       'rng-builtin',
>       'rng-egd',
> +    { 'name': 'iommufd',
> +      'if': 'CONFIG_IOMMUFD' },
>       { 'name': 'rng-random',
>         'if': 'CONFIG_POSIX' },
>       'secret',
> @@ -1014,7 +1028,9 @@
>         'tls-creds-x509':             'TlsCredsX509Properties',
>         'tls-cipher-suites':          'TlsCredsProperties',
>         'x-remote-object':            'RemoteObjectProperties',
> -      'x-vfio-user-server':         'VfioUserServerProperties'
> +      'x-vfio-user-server':         'VfioUserServerProperties',
> +      'iommufd':                    { 'type': 'IOMMUFDProperties',
> +                                      'if': 'CONFIG_IOMMUFD' }
>     } }
>   
>   ##
> diff --git a/qemu-options.hx b/qemu-options.hx
> index 29b98c3d4c..827dd085ee 100644
> --- a/qemu-options.hx
> +++ b/qemu-options.hx
> @@ -5098,6 +5098,19 @@ SRST
>   
>           The ``share`` boolean option is on by default with memfd.
>   
> +#ifdef CONFIG_IOMMUFD
> +    ``-object iommufd,id=id[,fd=fd]``
> +        Creates an iommufd backend which allows control of DMA mapping
> +        through the /dev/iommu device.
> +
> +        The ``id`` parameter is a unique ID which frontends (such as
> +        vfio-pci of vdpa) will use to connect withe the iommufd backend.
> +
> +        The ``fd`` parameter is an optional pre-opened file descriptor
> +        resulting from /dev/iommu opening. Usually the iommufd is shared
> +        accross all subsystems, bringing the benefit of centralized
> +        reference counting.
> +#endif
>       ``-object rng-builtin,id=id``
>           Creates a random number generator backend which obtains entropy
>           from QEMU builtin functions. The ``id`` parameter is a unique ID



^ permalink raw reply	[flat|nested] 109+ messages in thread

* RE: [PATCH v1 16/22] backends/iommufd: Introduce the iommufd object
  2023-09-22  7:15   ` Cédric Le Goater
@ 2023-09-22  8:39     ` Duan, Zhenzhong
  0 siblings, 0 replies; 109+ messages in thread
From: Duan, Zhenzhong @ 2023-09-22  8:39 UTC (permalink / raw)
  To: Cédric Le Goater, qemu-devel
  Cc: alex.williamson, jgg, nicolinc, Martins, Joao, eric.auger,
	peterx, jasowang, Tian, Kevin, Liu, Yi L, Sun, Yi Y, Peng,
	Chao P, Paolo Bonzini, Eric Blake, Markus Armbruster,
	Daniel P. Berrangé,
	Eduardo Habkost



>-----Original Message-----
>From: Cédric Le Goater <clg@redhat.com>
>Sent: Friday, September 22, 2023 3:16 PM
...
>> +int iommufd_backend_map_dma(IOMMUFDBackend *be, uint32_t ioas,
>hwaddr iova,
>> +                            ram_addr_t size, void *vaddr, bool readonly)
>> +{
>> +    int ret;
>> +    struct iommu_ioas_map map = {
>> +        .size = sizeof(map),
>> +        .flags = IOMMU_IOAS_MAP_READABLE |
>> +                 IOMMU_IOAS_MAP_FIXED_IOVA,
>> +        .ioas_id = ioas,
>> +        .__reserved = 0,
>> +        .user_va = (int64_t)vaddr,
>
>This needs an extra cast (uintptr_t)

Will fix.

Thanks
Zhenzhong


^ permalink raw reply	[flat|nested] 109+ messages in thread

* RE: [PATCH v1 15/22] Add iommufd configure option
  2023-09-20 18:19                   ` Jason Gunthorpe
  2023-09-21  3:43                     ` Duan, Zhenzhong
@ 2023-09-26  6:05                     ` Tian, Kevin
  1 sibling, 0 replies; 109+ messages in thread
From: Tian, Kevin @ 2023-09-26  6:05 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson
  Cc: Eric Auger, Cédric Le Goater, Duan, Zhenzhong, qemu-devel,
	nicolinc, Martins, Joao, peterx, jasowang, Liu, Yi L, Sun, Yi Y,
	Peng, Chao P, Paolo Bonzini, Marc-André Lureau,
	Daniel P. Berrangé, Thomas Huth, Philippe Mathieu-Daudé

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, September 21, 2023 2:20 AM
> 
> On Wed, Sep 20, 2023 at 12:17:24PM -0600, Alex Williamson wrote:
> 
> > > The iommufd design requires one open of the /dev/iommu to be shared
> > > across all the vfios.
> >
> > "requires"?  It's certainly of limited value to have multiple iommufd
> > instances rather than create multiple address spaces within a single
> > iommufd, but what exactly precludes an iommufd per device if QEMU, or
> > any other userspace so desired?  Thanks,
> 
> From the kernel side requires is too strong I suppose
> 

Agree. But with limited value let's stay with one iommufd per qemu
instance to reduce the maintenance burden.

It is also more future-friendly towards nested translation or the
need of centralized PASID tracking when supporting SIOV/ENQCMD, etc.

Supporting those new features between multiple iommufd's would
incur unnecessary complexities.


^ permalink raw reply	[flat|nested] 109+ messages in thread

* RE: [PATCH v1 04/22] vfio/common: Introduce vfio_container_add|del_section_window()
  2023-09-21 10:14     ` Duan, Zhenzhong
  2023-09-21 10:55       ` Cédric Le Goater
@ 2023-09-27  2:08       ` Duan, Zhenzhong
  2023-09-27  6:50         ` Cédric Le Goater
  1 sibling, 1 reply; 109+ messages in thread
From: Duan, Zhenzhong @ 2023-09-27  2:08 UTC (permalink / raw)
  To: Cédric Le Goater, qemu-devel
  Cc: alex.williamson, jgg, nicolinc, Martins, Joao, eric.auger,
	peterx, jasowang, Tian, Kevin, Liu, Yi L, Sun, Yi Y, Peng,
	Chao P

Hi Cédric,

>-----Original Message-----
>From: Duan, Zhenzhong
>Sent: Thursday, September 21, 2023 6:14 PM
>Subject: RE: [PATCH v1 04/22] vfio/common: Introduce
>vfio_container_add|del_section_window()
>
>Hi Cédric,
>
>>-----Original Message-----
>>From: Cédric Le Goater <clg@redhat.com>
>>Sent: Thursday, September 21, 2023 4:29 PM
>>Subject: Re: [PATCH v1 04/22] vfio/common: Introduce
>>vfio_container_add|del_section_window()
>>
>>Hello Zhenzhong,
>>
>>On 8/30/23 12:37, Zhenzhong Duan wrote:
>>> From: Eric Auger <eric.auger@redhat.com>
>>>
>>> Introduce helper functions that isolate the code used for
>>> VFIO_SPAPR_TCE_v2_IOMMU. This code reliance is IOMMU backend
>>> specific whereas the rest of the code in the callers, ie.
>>> vfio_listener_region_add|del is not.
>>>
>>> Signed-off-by: Eric Auger <eric.auger@redhat.com>
>>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>>> ---
>>>   hw/vfio/common.c | 156 +++++++++++++++++++++++++++--------------------
>>>   1 file changed, 89 insertions(+), 67 deletions(-)
>>>
>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>>> index 9ca695837f..67150e4575 100644
>>> --- a/hw/vfio/common.c
>>> +++ b/hw/vfio/common.c
>>> @@ -796,6 +796,92 @@ static bool
>>vfio_get_section_iova_range(VFIOContainer *container,
>>>       return true;
>>>   }
>>>
>>> +static int vfio_container_add_section_window(VFIOContainer *container,
>>> +                                             MemoryRegionSection *section,
>>> +                                             Error **errp)
>>> +{
>>> +    VFIOHostDMAWindow *hostwin;
>>> +    hwaddr pgsize = 0;
>>> +    int ret;
>>> +
>>> +    if (container->iommu_type != VFIO_SPAPR_TCE_v2_IOMMU) {
>>> +        return 0;
>>> +    }
>>
>>This test makes me think that we should register a specific backend
>>for the pseries machines, implementing the add/del_window handler,
>>since others do not need it. Correct ?
>
>Yes, introducing a specific backend could help removing above check.
>But each backend has a VFIOIOMMUBackendOps, we need same check
>as above to select Ops.
>
>>
>>It would avoid this ugly test. Let's keep that in mind when the
>>backends are introduced.
>>
>>> +
>>> +    /* For now intersections are not allowed, we may relax this later */
>>> +    QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) {
>>> +        if (ranges_overlap(hostwin->min_iova,
>>> +                           hostwin->max_iova - hostwin->min_iova + 1,
>>> +                           section->offset_within_address_space,
>>> +                           int128_get64(section->size))) {
>>> +            error_setg(errp,
>>> +                "region [0x%"PRIx64",0x%"PRIx64"] overlaps with existing"
>>> +                "host DMA window [0x%"PRIx64",0x%"PRIx64"]",
>>> +                section->offset_within_address_space,
>>> +                section->offset_within_address_space +
>>> +                    int128_get64(section->size) - 1,
>>> +                hostwin->min_iova, hostwin->max_iova);
>>> +            return -EINVAL;
>>> +        }
>>> +    }
>>> +
>>> +    ret = vfio_spapr_create_window(container, section, &pgsize);
>>> +    if (ret) {
>>> +        error_setg_errno(errp, -ret, "Failed to create SPAPR window");
>>> +        return ret;
>>> +    }
>>> +
>>> +    vfio_host_win_add(container, section->offset_within_address_space,
>>> +                      section->offset_within_address_space +
>>> +                      int128_get64(section->size) - 1, pgsize);
>>> +#ifdef CONFIG_KVM
>>
>>the ifdef test doesn't seem useful because the compiler should compile
>>out the section below since, in that case, kvm_enabled() is defined as :
>>
>>   #define kvm_enabled()           (0)
>
>Looks so, I'll remove it in v2.

Forgot to let you know, finally I failed to remove the ifdef test in v2 due to
many "undeclared" compile errors. I guess the reason is grammatical check
Is triggered before optimization in compiler.

For example:
error: ‘KVM_DEV_VFIO_GROUP’ undeclared
error: ‘vfio_kvm_device_fd’ undeclared
...

Thanks
Zhenzhong

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v1 04/22] vfio/common: Introduce vfio_container_add|del_section_window()
  2023-09-27  2:08       ` Duan, Zhenzhong
@ 2023-09-27  6:50         ` Cédric Le Goater
  0 siblings, 0 replies; 109+ messages in thread
From: Cédric Le Goater @ 2023-09-27  6:50 UTC (permalink / raw)
  To: Duan, Zhenzhong, qemu-devel
  Cc: alex.williamson, jgg, nicolinc, Martins, Joao, eric.auger,
	peterx, jasowang, Tian, Kevin, Liu, Yi L, Sun, Yi Y, Peng,
	Chao P

On 9/27/23 04:08, Duan, Zhenzhong wrote:
> Hi Cédric,
> 
>> -----Original Message-----
>> From: Duan, Zhenzhong
>> Sent: Thursday, September 21, 2023 6:14 PM
>> Subject: RE: [PATCH v1 04/22] vfio/common: Introduce
>> vfio_container_add|del_section_window()
>>
>> Hi Cédric,
>>
>>> -----Original Message-----
>>> From: Cédric Le Goater <clg@redhat.com>
>>> Sent: Thursday, September 21, 2023 4:29 PM
>>> Subject: Re: [PATCH v1 04/22] vfio/common: Introduce
>>> vfio_container_add|del_section_window()
>>>
>>> Hello Zhenzhong,
>>>
>>> On 8/30/23 12:37, Zhenzhong Duan wrote:
>>>> From: Eric Auger <eric.auger@redhat.com>
>>>>
>>>> Introduce helper functions that isolate the code used for
>>>> VFIO_SPAPR_TCE_v2_IOMMU. This code reliance is IOMMU backend
>>>> specific whereas the rest of the code in the callers, ie.
>>>> vfio_listener_region_add|del is not.
>>>>
>>>> Signed-off-by: Eric Auger <eric.auger@redhat.com>
>>>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>>>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>>>> ---
>>>>    hw/vfio/common.c | 156 +++++++++++++++++++++++++++--------------------
>>>>    1 file changed, 89 insertions(+), 67 deletions(-)
>>>>
>>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>>>> index 9ca695837f..67150e4575 100644
>>>> --- a/hw/vfio/common.c
>>>> +++ b/hw/vfio/common.c
>>>> @@ -796,6 +796,92 @@ static bool
>>> vfio_get_section_iova_range(VFIOContainer *container,
>>>>        return true;
>>>>    }
>>>>
>>>> +static int vfio_container_add_section_window(VFIOContainer *container,
>>>> +                                             MemoryRegionSection *section,
>>>> +                                             Error **errp)
>>>> +{
>>>> +    VFIOHostDMAWindow *hostwin;
>>>> +    hwaddr pgsize = 0;
>>>> +    int ret;
>>>> +
>>>> +    if (container->iommu_type != VFIO_SPAPR_TCE_v2_IOMMU) {
>>>> +        return 0;
>>>> +    }
>>>
>>> This test makes me think that we should register a specific backend
>>> for the pseries machines, implementing the add/del_window handler,
>>> since others do not need it. Correct ?
>>
>> Yes, introducing a specific backend could help removing above check.
>> But each backend has a VFIOIOMMUBackendOps, we need same check
>> as above to select Ops.
>>
>>>
>>> It would avoid this ugly test. Let's keep that in mind when the
>>> backends are introduced.
>>>
>>>> +
>>>> +    /* For now intersections are not allowed, we may relax this later */
>>>> +    QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) {
>>>> +        if (ranges_overlap(hostwin->min_iova,
>>>> +                           hostwin->max_iova - hostwin->min_iova + 1,
>>>> +                           section->offset_within_address_space,
>>>> +                           int128_get64(section->size))) {
>>>> +            error_setg(errp,
>>>> +                "region [0x%"PRIx64",0x%"PRIx64"] overlaps with existing"
>>>> +                "host DMA window [0x%"PRIx64",0x%"PRIx64"]",
>>>> +                section->offset_within_address_space,
>>>> +                section->offset_within_address_space +
>>>> +                    int128_get64(section->size) - 1,
>>>> +                hostwin->min_iova, hostwin->max_iova);
>>>> +            return -EINVAL;
>>>> +        }
>>>> +    }
>>>> +
>>>> +    ret = vfio_spapr_create_window(container, section, &pgsize);
>>>> +    if (ret) {
>>>> +        error_setg_errno(errp, -ret, "Failed to create SPAPR window");
>>>> +        return ret;
>>>> +    }
>>>> +
>>>> +    vfio_host_win_add(container, section->offset_within_address_space,
>>>> +                      section->offset_within_address_space +
>>>> +                      int128_get64(section->size) - 1, pgsize);
>>>> +#ifdef CONFIG_KVM
>>>
>>> the ifdef test doesn't seem useful because the compiler should compile
>>> out the section below since, in that case, kvm_enabled() is defined as :
>>>
>>>    #define kvm_enabled()           (0)
>>
>> Looks so, I'll remove it in v2.
> 
> Forgot to let you know, finally I failed to remove the ifdef test in v2 due to
> many "undeclared" compile errors. I guess the reason is grammatical check
> Is triggered before optimization in compiler.
> 
> For example:
> error: ‘KVM_DEV_VFIO_GROUP’ undeclared
> error: ‘vfio_kvm_device_fd’ undeclared


Yes. It would need helpers to hide the kernel structs and defined.
Let's address it later, after the backends are introduced.

Thanks for looking into it.

C.



^ permalink raw reply	[flat|nested] 109+ messages in thread

end of thread, other threads:[~2023-09-27  6:51 UTC | newest]

Thread overview: 109+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-08-30 10:37 [PATCH v1 00/22] vfio: Adopt iommufd Zhenzhong Duan
2023-08-30 10:37 ` [PATCH v1 01/22] scripts/update-linux-headers: Add iommufd.h Zhenzhong Duan
2023-08-30 10:37 ` [PATCH v1 02/22] Update linux-header to support iommufd cdev and hwpt alloc Zhenzhong Duan
2023-09-14 14:46   ` Eric Auger
2023-09-15  3:02     ` Duan, Zhenzhong
2023-09-20 11:04       ` Eric Auger
2023-09-20 11:15         ` Duan, Zhenzhong
2023-08-30 10:37 ` [PATCH v1 03/22] vfio/common: Move IOMMU agnostic helpers to a separate file Zhenzhong Duan
2023-08-30 10:37 ` [PATCH v1 04/22] vfio/common: Introduce vfio_container_add|del_section_window() Zhenzhong Duan
2023-09-20 11:23   ` Eric Auger
2023-09-20 12:18     ` Duan, Zhenzhong
2023-09-21  8:28   ` Cédric Le Goater
2023-09-21 10:14     ` Duan, Zhenzhong
2023-09-21 10:55       ` Cédric Le Goater
2023-09-27  2:08       ` Duan, Zhenzhong
2023-09-27  6:50         ` Cédric Le Goater
2023-08-30 10:37 ` [PATCH v1 05/22] vfio/common: Extract out vfio_kvm_device_[add/del]_fd Zhenzhong Duan
2023-09-20 11:49   ` Eric Auger
2023-09-21  2:04     ` Duan, Zhenzhong
2023-09-21  8:42     ` Cédric Le Goater
2023-09-21 10:22       ` Duan, Zhenzhong
2023-09-21 10:53         ` Cédric Le Goater
2023-09-20 21:39   ` Alex Williamson
2023-09-21  6:03     ` Duan, Zhenzhong
2023-08-30 10:37 ` [PATCH v1 06/22] vfio/common: Add a vfio device iterator Zhenzhong Duan
2023-09-20 12:25   ` Eric Auger
2023-09-21  2:27     ` Duan, Zhenzhong
2023-09-20 22:16   ` Alex Williamson
2023-09-21  2:16     ` Duan, Zhenzhong
2023-08-30 10:37 ` [PATCH v1 07/22] vfio/common: Refactor vfio_viommu_preset() to be group agnostic Zhenzhong Duan
2023-09-20 13:00   ` Eric Auger
2023-09-21  2:52     ` Duan, Zhenzhong
2023-09-20 22:51   ` Alex Williamson
2023-09-21  6:13     ` Duan, Zhenzhong
2023-08-30 10:37 ` [PATCH v1 08/22] vfio/common: Move legacy VFIO backend code into separate container.c Zhenzhong Duan
2023-09-20 13:12   ` Eric Auger
2023-09-21  3:02     ` Duan, Zhenzhong
2023-08-30 10:37 ` [PATCH v1 09/22] vfio/container: Introduce vfio_[attach/detach]_device Zhenzhong Duan
2023-09-20 13:33   ` Eric Auger
2023-09-21  3:08     ` Duan, Zhenzhong
2023-09-21  9:44   ` Cédric Le Goater
2023-09-21 10:26     ` Duan, Zhenzhong
2023-08-30 10:37 ` [PATCH v1 10/22] vfio/platform: Use vfio_[attach/detach]_device Zhenzhong Duan
2023-09-21 12:17   ` Cédric Le Goater
2023-08-30 10:37 ` [PATCH v1 11/22] vfio/ap: " Zhenzhong Duan
2023-08-30 10:37 ` [PATCH v1 12/22] vfio/ccw: " Zhenzhong Duan
2023-09-21 12:19   ` Cédric Le Goater
2023-09-21 13:00     ` Duan, Zhenzhong
2023-09-21 13:24       ` Cédric Le Goater
2023-08-30 10:37 ` [PATCH v1 13/22] vfio: Add base container Zhenzhong Duan
2023-09-19 17:23   ` Cédric Le Goater
2023-09-20  8:48     ` Duan, Zhenzhong
2023-09-20 12:57       ` Cédric Le Goater
2023-09-20 13:58         ` Eric Auger
2023-09-21  2:51         ` Duan, Zhenzhong
2023-09-20 13:53     ` Eric Auger
2023-09-21  3:12       ` Duan, Zhenzhong
2023-09-20 17:31     ` Eric Auger
2023-09-21  3:35       ` Duan, Zhenzhong
2023-09-21  6:28         ` Eric Auger
2023-09-21 17:20         ` Eric Auger
2023-09-22  2:52           ` Duan, Zhenzhong
2023-08-30 10:37 ` [PATCH v1 14/22] vfio/common: Simplify vfio_viommu_preset() Zhenzhong Duan
2023-09-19 16:01   ` Cédric Le Goater
2023-09-20  2:59     ` Duan, Zhenzhong
2023-08-30 10:37 ` [PATCH v1 15/22] Add iommufd configure option Zhenzhong Duan
2023-09-19 17:07   ` Cédric Le Goater
2023-09-20  3:42     ` Duan, Zhenzhong
2023-09-20 12:19       ` Cédric Le Goater
2023-09-20 12:51         ` Jason Gunthorpe
2023-09-20 13:01           ` Daniel P. Berrangé
2023-09-20 13:07             ` Jason Gunthorpe
2023-09-20 13:02           ` Cédric Le Goater
2023-09-20 17:37             ` Eric Auger
2023-09-20 17:49               ` Jason Gunthorpe
2023-09-20 18:17                 ` Alex Williamson
2023-09-20 18:19                   ` Jason Gunthorpe
2023-09-21  3:43                     ` Duan, Zhenzhong
2023-09-26  6:05                     ` Tian, Kevin
2023-09-21  4:00             ` Duan, Zhenzhong
2023-09-21  2:11         ` Duan, Zhenzhong
2023-09-20 18:01       ` Alex Williamson
2023-09-20 18:12         ` Jason Gunthorpe
2023-09-20 20:29           ` Alex Williamson
2023-09-20 18:15         ` Daniel P. Berrangé
2023-08-30 10:37 ` [PATCH v1 16/22] backends/iommufd: Introduce the iommufd object Zhenzhong Duan
2023-09-22  7:15   ` Cédric Le Goater
2023-09-22  8:39     ` Duan, Zhenzhong
2023-08-30 10:37 ` [PATCH v1 17/22] util/char_dev: Add open_cdev() Zhenzhong Duan
2023-09-20 12:39   ` Daniel P. Berrangé
2023-09-20 12:53     ` Jason Gunthorpe
2023-09-20 12:56       ` Daniel P. Berrangé
2023-09-21  2:37     ` Duan, Zhenzhong
2023-08-30 10:37 ` [PATCH v1 18/22] vfio/iommufd: Implement the iommufd backend Zhenzhong Duan
2023-08-30 10:37 ` [PATCH v1 19/22] vfio/iommufd: Add vfio device iterator callback for iommufd Zhenzhong Duan
2023-08-30 10:37 ` [PATCH v1 20/22] vfio/pci: Adapt vfio pci hot reset support with iommufd BE Zhenzhong Duan
2023-08-30 10:37 ` [PATCH v1 21/22] vfio/pci: Allow the selection of a given iommu backend Zhenzhong Duan
2023-09-06 18:10   ` Jason Gunthorpe
2023-09-06 19:09     ` Alex Williamson
2023-09-07  1:10       ` Jason Gunthorpe
2023-09-07  2:27         ` Duan, Zhenzhong
2023-08-30 10:37 ` [PATCH v1 22/22] vfio/pci: Make vfio cdev pre-openable by passing a file handle Zhenzhong Duan
2023-09-14  9:04 ` [PATCH v1 00/22] vfio: Adopt iommufd Eric Auger
2023-09-14  9:27   ` Duan, Zhenzhong
2023-09-15 12:42 ` Cédric Le Goater
2023-09-15 13:14   ` Duan, Zhenzhong
2023-09-18 11:51   ` Jason Gunthorpe
2023-09-18 12:23     ` Cédric Le Goater
2023-09-18 17:56       ` Jason Gunthorpe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).