All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 00/15] Prerequisite changes for IOMMUFD support
@ 2023-10-04 15:43 Eric Auger
  2023-10-04 15:43 ` [PATCH v4 01/15] scripts/update-linux-headers: Add iommufd.h Eric Auger
                   ` (15 more replies)
  0 siblings, 16 replies; 25+ messages in thread
From: Eric Auger @ 2023-10-04 15:43 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, qemu-devel, zhenzhong.duan,
	alex.williamson, clg, jgg, nicolinc, joao.m.martins, peterx,
	kevin.tian, yi.l.liu, yi.y.sun, chao.p.peng, mjrosato, aik

Hi All,

With the agreement of Zhenzhong, here is a v3 respin of the IOMMUFD
prerequisite series. This applies on top of vfio-next:
https://github.com/legoater/qemu/, branch vfio-next.

Per Cédric's suggestion, the IOMMUFD patchset v1[1] is now split
into two series, this prerequisite series and the new IOMMUFD backend
introduction support series. Hopefully this will ease the review.

The main purpose of this series is to make "common.c" group agnostic:
all group related code are moved into container.c. Then we are prepared
for next series, abstract base container, adding new backend, etc.

This series can be found at 
https://github.com/eauger/qemu/tree/prereq_v4

Test done:
- PCI device were tested
- device hotplug test
- with or without vIOMMU
- VFIO migration with a E800 net card(no dirty sync support) passthrough
- platform and ccw were only compile-tested due to environment limit

Zhenzhong, Yi, Eric

[1] https://lore.kernel.org/all/20230830103754.36461-1-zhenzhong.duan@intel.com/t/#u

Changelog:

v4:
- include qemu/error-report.h in helpers.c
- in ap.c, fix the wrongly added
  vfio_detach_device(vbasedev) and g_free(vbasedev->name);
  also added error_prepend
- simplified vbasedev setting in ccw.c
- vfio_detach_device: dropped check on
  !vbasedev->container
- container.c: restore dropped comment

v3:
- rebased on vfio-next as suggested by Cedric
- added vfio/common: Propagate KVM_SET_DEVICE_ATTR error if any
- collected Cedric's R-b
- Fix some error paths in vfio/cpi which now properly detach the device
  and also free the vbasedev->name
- Fix vfio/ccw migration (hopefully) [Matthew inputs]
- Split [PATCH v2 11/12] vfio/common: Introduce two kinds of VFIO device lists
  into 3 patches

v2:
- Refine patch description per Eric
- return errno and errp in vfio_kvm_device_[add/del]_fd per Eric
- make memory listener register/deregister in seperate patch per Eric
- Include the .h file first per Cédric
- Add trace event in vfio_attach_device per Cédric
- drop the change to vfio_viommu_preset by refactor per Cédric
- Introduce global VFIO device list and per container list per Alex

Note changelog below are from full IOMMUFD series:

v1:
- Alloc hwpt instead of using auto hwpt
- elaborate iommufd code per Nicolin
- consolidate two patches and drop as.c
- typo error fix and function rename

rfcv4:
- rebase on top of v8.0.3
- Add one patch from Yi which is about vfio device add in kvm
- Remove IOAS_COPY optimization and focus on functions in this patchset
- Fix wrong name issue reported and fix suggested by Matthew
- Fix compilation issue reported and fix sugggsted by Nicolin
- Use query_dirty_bitmap callback to replace get_dirty_bitmap for better
granularity
- Add dev_iter_next() callback to avoid adding so many callback
  at container scope, add VFIODevice.hwpt to support that
- Restore all functions back to common from container whenever possible,
  mainly migration and reset related functions
- Add --enable/disable-iommufd config option, enabled by default in linux
- Remove VFIODevice.hwpt_next as it's redundant with VFIODevice.next
- Adapt new VFIO_DEVICE_PCI_HOT_RESET uAPI for IOMMUFD backed device
- vfio_kvm_device_add/del_group call vfio_kvm_device_add/del_fd to remove
redundant code
- Add FD passing support for vfio device backed by IOMMUFD
- Fix hot unplug resource leak issue in vfio_legacy_detach_device()
- Fix FD leak in vfio_get_devicefd()

rfcv3:
- rebase on top of v7.2.0
- Fix the compilation with CONFIG_IOMMUFD unset by using true classes for
  VFIO backends
- Fix use after free in error path, reported by Alister
- Split common.c in several steps to ease the review

rfcv2:
- remove the first three patches of rfcv1
- add open cdev helper suggested by Jason
- remove the QOMification of the VFIOContainer and simply use standard ops
(David)
- add "-object iommufd" suggested by Alex

Thanks
Zhenzhong


Eric Auger (7):
  scripts/update-linux-headers: Add iommufd.h
  vfio/common: Propagate KVM_SET_DEVICE_ATTR error if any
  vfio/common: Introduce vfio_container_add|del_section_window()
  vfio/pci: Introduce vfio_[attach/detach]_device
  vfio/platform: Use vfio_[attach/detach]_device
  vfio/ap: Use vfio_[attach/detach]_device
  vfio/ccw: Use vfio_[attach/detach]_device

Yi Liu (2):
  vfio/common: Move IOMMU agnostic helpers to a separate file
  vfio/common: Move legacy VFIO backend code into separate container.c

Zhenzhong Duan (6):
  linux-headers: Add iommufd.h
  vfio/common: Extract out vfio_kvm_device_[add/del]_fd
  vfio/common: Move VFIO reset handler registration to a group agnostic
    function
  vfio/common: Introduce a per container device list
  vfio/common: Store the parent container in VFIODevice
  vfio/common: Introduce a global VFIODevice list

 include/hw/vfio/vfio-common.h   |   60 +-
 linux-headers/linux/iommufd.h   |  444 ++++++++
 hw/vfio/ap.c                    |   69 +-
 hw/vfio/ccw.c                   |  122 +-
 hw/vfio/common.c                | 1851 ++-----------------------------
 hw/vfio/container.c             | 1157 +++++++++++++++++++
 hw/vfio/helpers.c               |  612 ++++++++++
 hw/vfio/pci.c                   |   67 +-
 hw/vfio/platform.c              |   43 +-
 hw/vfio/meson.build             |    2 +
 hw/vfio/trace-events            |    4 +-
 scripts/update-linux-headers.sh |    3 +-
 12 files changed, 2458 insertions(+), 1976 deletions(-)
 create mode 100644 linux-headers/linux/iommufd.h
 create mode 100644 hw/vfio/container.c
 create mode 100644 hw/vfio/helpers.c

-- 
2.41.0



^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v4 01/15] scripts/update-linux-headers: Add iommufd.h
  2023-10-04 15:43 [PATCH v4 00/15] Prerequisite changes for IOMMUFD support Eric Auger
@ 2023-10-04 15:43 ` Eric Auger
  2023-10-04 15:43 ` [PATCH v4 02/15] linux-headers: " Eric Auger
                   ` (14 subsequent siblings)
  15 siblings, 0 replies; 25+ messages in thread
From: Eric Auger @ 2023-10-04 15:43 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, qemu-devel, zhenzhong.duan,
	alex.williamson, clg, jgg, nicolinc, joao.m.martins, peterx,
	kevin.tian, yi.l.liu, yi.y.sun, chao.p.peng, mjrosato, aik

Update the script to import iommufd.h

Signed-off-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
---
 scripts/update-linux-headers.sh | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/scripts/update-linux-headers.sh b/scripts/update-linux-headers.sh
index 35a64bb501..34295c0fe5 100755
--- a/scripts/update-linux-headers.sh
+++ b/scripts/update-linux-headers.sh
@@ -161,7 +161,8 @@ done
 rm -rf "$output/linux-headers/linux"
 mkdir -p "$output/linux-headers/linux"
 for header in const.h stddef.h kvm.h vfio.h vfio_ccw.h vfio_zdev.h vhost.h \
-              psci.h psp-sev.h userfaultfd.h memfd.h mman.h nvme_ioctl.h vduse.h; do
+              psci.h psp-sev.h userfaultfd.h memfd.h mman.h nvme_ioctl.h \
+              vduse.h iommufd.h; do
     cp "$tmpdir/include/linux/$header" "$output/linux-headers/linux"
 done
 
-- 
2.41.0



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v4 02/15] linux-headers: Add iommufd.h
  2023-10-04 15:43 [PATCH v4 00/15] Prerequisite changes for IOMMUFD support Eric Auger
  2023-10-04 15:43 ` [PATCH v4 01/15] scripts/update-linux-headers: Add iommufd.h Eric Auger
@ 2023-10-04 15:43 ` Eric Auger
  2023-10-04 15:43 ` [PATCH v4 03/15] vfio/common: Move IOMMU agnostic helpers to a separate file Eric Auger
                   ` (13 subsequent siblings)
  15 siblings, 0 replies; 25+ messages in thread
From: Eric Auger @ 2023-10-04 15:43 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, qemu-devel, zhenzhong.duan,
	alex.williamson, clg, jgg, nicolinc, joao.m.martins, peterx,
	kevin.tian, yi.l.liu, yi.y.sun, chao.p.peng, mjrosato, aik

From: Zhenzhong Duan <zhenzhong.duan@intel.com>

Since commit da3c22c74a3c ("linux-headers: Update to Linux v6.6-rc1"),
linux-headers has been updated to v6.6-rc1.

As previous patch added iommufd.h to update-linux-headers.sh,
run the script again against TAG v6.6-rc1 to have iommufd.h included.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
---
 linux-headers/linux/iommufd.h | 444 ++++++++++++++++++++++++++++++++++
 1 file changed, 444 insertions(+)
 create mode 100644 linux-headers/linux/iommufd.h

diff --git a/linux-headers/linux/iommufd.h b/linux-headers/linux/iommufd.h
new file mode 100644
index 0000000000..218bf7ac98
--- /dev/null
+++ b/linux-headers/linux/iommufd.h
@@ -0,0 +1,444 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES.
+ */
+#ifndef _IOMMUFD_H
+#define _IOMMUFD_H
+
+#include <linux/types.h>
+#include <linux/ioctl.h>
+
+#define IOMMUFD_TYPE (';')
+
+/**
+ * DOC: General ioctl format
+ *
+ * The ioctl interface follows a general format to allow for extensibility. Each
+ * ioctl is passed in a structure pointer as the argument providing the size of
+ * the structure in the first u32. The kernel checks that any structure space
+ * beyond what it understands is 0. This allows userspace to use the backward
+ * compatible portion while consistently using the newer, larger, structures.
+ *
+ * ioctls use a standard meaning for common errnos:
+ *
+ *  - ENOTTY: The IOCTL number itself is not supported at all
+ *  - E2BIG: The IOCTL number is supported, but the provided structure has
+ *    non-zero in a part the kernel does not understand.
+ *  - EOPNOTSUPP: The IOCTL number is supported, and the structure is
+ *    understood, however a known field has a value the kernel does not
+ *    understand or support.
+ *  - EINVAL: Everything about the IOCTL was understood, but a field is not
+ *    correct.
+ *  - ENOENT: An ID or IOVA provided does not exist.
+ *  - ENOMEM: Out of memory.
+ *  - EOVERFLOW: Mathematics overflowed.
+ *
+ * As well as additional errnos, within specific ioctls.
+ */
+enum {
+	IOMMUFD_CMD_BASE = 0x80,
+	IOMMUFD_CMD_DESTROY = IOMMUFD_CMD_BASE,
+	IOMMUFD_CMD_IOAS_ALLOC,
+	IOMMUFD_CMD_IOAS_ALLOW_IOVAS,
+	IOMMUFD_CMD_IOAS_COPY,
+	IOMMUFD_CMD_IOAS_IOVA_RANGES,
+	IOMMUFD_CMD_IOAS_MAP,
+	IOMMUFD_CMD_IOAS_UNMAP,
+	IOMMUFD_CMD_OPTION,
+	IOMMUFD_CMD_VFIO_IOAS,
+	IOMMUFD_CMD_HWPT_ALLOC,
+	IOMMUFD_CMD_GET_HW_INFO,
+};
+
+/**
+ * struct iommu_destroy - ioctl(IOMMU_DESTROY)
+ * @size: sizeof(struct iommu_destroy)
+ * @id: iommufd object ID to destroy. Can be any destroyable object type.
+ *
+ * Destroy any object held within iommufd.
+ */
+struct iommu_destroy {
+	__u32 size;
+	__u32 id;
+};
+#define IOMMU_DESTROY _IO(IOMMUFD_TYPE, IOMMUFD_CMD_DESTROY)
+
+/**
+ * struct iommu_ioas_alloc - ioctl(IOMMU_IOAS_ALLOC)
+ * @size: sizeof(struct iommu_ioas_alloc)
+ * @flags: Must be 0
+ * @out_ioas_id: Output IOAS ID for the allocated object
+ *
+ * Allocate an IO Address Space (IOAS) which holds an IO Virtual Address (IOVA)
+ * to memory mapping.
+ */
+struct iommu_ioas_alloc {
+	__u32 size;
+	__u32 flags;
+	__u32 out_ioas_id;
+};
+#define IOMMU_IOAS_ALLOC _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_ALLOC)
+
+/**
+ * struct iommu_iova_range - ioctl(IOMMU_IOVA_RANGE)
+ * @start: First IOVA
+ * @last: Inclusive last IOVA
+ *
+ * An interval in IOVA space.
+ */
+struct iommu_iova_range {
+	__aligned_u64 start;
+	__aligned_u64 last;
+};
+
+/**
+ * struct iommu_ioas_iova_ranges - ioctl(IOMMU_IOAS_IOVA_RANGES)
+ * @size: sizeof(struct iommu_ioas_iova_ranges)
+ * @ioas_id: IOAS ID to read ranges from
+ * @num_iovas: Input/Output total number of ranges in the IOAS
+ * @__reserved: Must be 0
+ * @allowed_iovas: Pointer to the output array of struct iommu_iova_range
+ * @out_iova_alignment: Minimum alignment required for mapping IOVA
+ *
+ * Query an IOAS for ranges of allowed IOVAs. Mapping IOVA outside these ranges
+ * is not allowed. num_iovas will be set to the total number of iovas and
+ * the allowed_iovas[] will be filled in as space permits.
+ *
+ * The allowed ranges are dependent on the HW path the DMA operation takes, and
+ * can change during the lifetime of the IOAS. A fresh empty IOAS will have a
+ * full range, and each attached device will narrow the ranges based on that
+ * device's HW restrictions. Detaching a device can widen the ranges. Userspace
+ * should query ranges after every attach/detach to know what IOVAs are valid
+ * for mapping.
+ *
+ * On input num_iovas is the length of the allowed_iovas array. On output it is
+ * the total number of iovas filled in. The ioctl will return -EMSGSIZE and set
+ * num_iovas to the required value if num_iovas is too small. In this case the
+ * caller should allocate a larger output array and re-issue the ioctl.
+ *
+ * out_iova_alignment returns the minimum IOVA alignment that can be given
+ * to IOMMU_IOAS_MAP/COPY. IOVA's must satisfy::
+ *
+ *   starting_iova % out_iova_alignment == 0
+ *   (starting_iova + length) % out_iova_alignment == 0
+ *
+ * out_iova_alignment can be 1 indicating any IOVA is allowed. It cannot
+ * be higher than the system PAGE_SIZE.
+ */
+struct iommu_ioas_iova_ranges {
+	__u32 size;
+	__u32 ioas_id;
+	__u32 num_iovas;
+	__u32 __reserved;
+	__aligned_u64 allowed_iovas;
+	__aligned_u64 out_iova_alignment;
+};
+#define IOMMU_IOAS_IOVA_RANGES _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_IOVA_RANGES)
+
+/**
+ * struct iommu_ioas_allow_iovas - ioctl(IOMMU_IOAS_ALLOW_IOVAS)
+ * @size: sizeof(struct iommu_ioas_allow_iovas)
+ * @ioas_id: IOAS ID to allow IOVAs from
+ * @num_iovas: Input/Output total number of ranges in the IOAS
+ * @__reserved: Must be 0
+ * @allowed_iovas: Pointer to array of struct iommu_iova_range
+ *
+ * Ensure a range of IOVAs are always available for allocation. If this call
+ * succeeds then IOMMU_IOAS_IOVA_RANGES will never return a list of IOVA ranges
+ * that are narrower than the ranges provided here. This call will fail if
+ * IOMMU_IOAS_IOVA_RANGES is currently narrower than the given ranges.
+ *
+ * When an IOAS is first created the IOVA_RANGES will be maximally sized, and as
+ * devices are attached the IOVA will narrow based on the device restrictions.
+ * When an allowed range is specified any narrowing will be refused, ie device
+ * attachment can fail if the device requires limiting within the allowed range.
+ *
+ * Automatic IOVA allocation is also impacted by this call. MAP will only
+ * allocate within the allowed IOVAs if they are present.
+ *
+ * This call replaces the entire allowed list with the given list.
+ */
+struct iommu_ioas_allow_iovas {
+	__u32 size;
+	__u32 ioas_id;
+	__u32 num_iovas;
+	__u32 __reserved;
+	__aligned_u64 allowed_iovas;
+};
+#define IOMMU_IOAS_ALLOW_IOVAS _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_ALLOW_IOVAS)
+
+/**
+ * enum iommufd_ioas_map_flags - Flags for map and copy
+ * @IOMMU_IOAS_MAP_FIXED_IOVA: If clear the kernel will compute an appropriate
+ *                             IOVA to place the mapping at
+ * @IOMMU_IOAS_MAP_WRITEABLE: DMA is allowed to write to this mapping
+ * @IOMMU_IOAS_MAP_READABLE: DMA is allowed to read from this mapping
+ */
+enum iommufd_ioas_map_flags {
+	IOMMU_IOAS_MAP_FIXED_IOVA = 1 << 0,
+	IOMMU_IOAS_MAP_WRITEABLE = 1 << 1,
+	IOMMU_IOAS_MAP_READABLE = 1 << 2,
+};
+
+/**
+ * struct iommu_ioas_map - ioctl(IOMMU_IOAS_MAP)
+ * @size: sizeof(struct iommu_ioas_map)
+ * @flags: Combination of enum iommufd_ioas_map_flags
+ * @ioas_id: IOAS ID to change the mapping of
+ * @__reserved: Must be 0
+ * @user_va: Userspace pointer to start mapping from
+ * @length: Number of bytes to map
+ * @iova: IOVA the mapping was placed at. If IOMMU_IOAS_MAP_FIXED_IOVA is set
+ *        then this must be provided as input.
+ *
+ * Set an IOVA mapping from a user pointer. If FIXED_IOVA is specified then the
+ * mapping will be established at iova, otherwise a suitable location based on
+ * the reserved and allowed lists will be automatically selected and returned in
+ * iova.
+ *
+ * If IOMMU_IOAS_MAP_FIXED_IOVA is specified then the iova range must currently
+ * be unused, existing IOVA cannot be replaced.
+ */
+struct iommu_ioas_map {
+	__u32 size;
+	__u32 flags;
+	__u32 ioas_id;
+	__u32 __reserved;
+	__aligned_u64 user_va;
+	__aligned_u64 length;
+	__aligned_u64 iova;
+};
+#define IOMMU_IOAS_MAP _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_MAP)
+
+/**
+ * struct iommu_ioas_copy - ioctl(IOMMU_IOAS_COPY)
+ * @size: sizeof(struct iommu_ioas_copy)
+ * @flags: Combination of enum iommufd_ioas_map_flags
+ * @dst_ioas_id: IOAS ID to change the mapping of
+ * @src_ioas_id: IOAS ID to copy from
+ * @length: Number of bytes to copy and map
+ * @dst_iova: IOVA the mapping was placed at. If IOMMU_IOAS_MAP_FIXED_IOVA is
+ *            set then this must be provided as input.
+ * @src_iova: IOVA to start the copy
+ *
+ * Copy an already existing mapping from src_ioas_id and establish it in
+ * dst_ioas_id. The src iova/length must exactly match a range used with
+ * IOMMU_IOAS_MAP.
+ *
+ * This may be used to efficiently clone a subset of an IOAS to another, or as a
+ * kind of 'cache' to speed up mapping. Copy has an efficiency advantage over
+ * establishing equivalent new mappings, as internal resources are shared, and
+ * the kernel will pin the user memory only once.
+ */
+struct iommu_ioas_copy {
+	__u32 size;
+	__u32 flags;
+	__u32 dst_ioas_id;
+	__u32 src_ioas_id;
+	__aligned_u64 length;
+	__aligned_u64 dst_iova;
+	__aligned_u64 src_iova;
+};
+#define IOMMU_IOAS_COPY _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_COPY)
+
+/**
+ * struct iommu_ioas_unmap - ioctl(IOMMU_IOAS_UNMAP)
+ * @size: sizeof(struct iommu_ioas_unmap)
+ * @ioas_id: IOAS ID to change the mapping of
+ * @iova: IOVA to start the unmapping at
+ * @length: Number of bytes to unmap, and return back the bytes unmapped
+ *
+ * Unmap an IOVA range. The iova/length must be a superset of a previously
+ * mapped range used with IOMMU_IOAS_MAP or IOMMU_IOAS_COPY. Splitting or
+ * truncating ranges is not allowed. The values 0 to U64_MAX will unmap
+ * everything.
+ */
+struct iommu_ioas_unmap {
+	__u32 size;
+	__u32 ioas_id;
+	__aligned_u64 iova;
+	__aligned_u64 length;
+};
+#define IOMMU_IOAS_UNMAP _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_UNMAP)
+
+/**
+ * enum iommufd_option - ioctl(IOMMU_OPTION_RLIMIT_MODE) and
+ *                       ioctl(IOMMU_OPTION_HUGE_PAGES)
+ * @IOMMU_OPTION_RLIMIT_MODE:
+ *    Change how RLIMIT_MEMLOCK accounting works. The caller must have privilege
+ *    to invoke this. Value 0 (default) is user based accouting, 1 uses process
+ *    based accounting. Global option, object_id must be 0
+ * @IOMMU_OPTION_HUGE_PAGES:
+ *    Value 1 (default) allows contiguous pages to be combined when generating
+ *    iommu mappings. Value 0 disables combining, everything is mapped to
+ *    PAGE_SIZE. This can be useful for benchmarking.  This is a per-IOAS
+ *    option, the object_id must be the IOAS ID.
+ */
+enum iommufd_option {
+	IOMMU_OPTION_RLIMIT_MODE = 0,
+	IOMMU_OPTION_HUGE_PAGES = 1,
+};
+
+/**
+ * enum iommufd_option_ops - ioctl(IOMMU_OPTION_OP_SET) and
+ *                           ioctl(IOMMU_OPTION_OP_GET)
+ * @IOMMU_OPTION_OP_SET: Set the option's value
+ * @IOMMU_OPTION_OP_GET: Get the option's value
+ */
+enum iommufd_option_ops {
+	IOMMU_OPTION_OP_SET = 0,
+	IOMMU_OPTION_OP_GET = 1,
+};
+
+/**
+ * struct iommu_option - iommu option multiplexer
+ * @size: sizeof(struct iommu_option)
+ * @option_id: One of enum iommufd_option
+ * @op: One of enum iommufd_option_ops
+ * @__reserved: Must be 0
+ * @object_id: ID of the object if required
+ * @val64: Option value to set or value returned on get
+ *
+ * Change a simple option value. This multiplexor allows controlling options
+ * on objects. IOMMU_OPTION_OP_SET will load an option and IOMMU_OPTION_OP_GET
+ * will return the current value.
+ */
+struct iommu_option {
+	__u32 size;
+	__u32 option_id;
+	__u16 op;
+	__u16 __reserved;
+	__u32 object_id;
+	__aligned_u64 val64;
+};
+#define IOMMU_OPTION _IO(IOMMUFD_TYPE, IOMMUFD_CMD_OPTION)
+
+/**
+ * enum iommufd_vfio_ioas_op - IOMMU_VFIO_IOAS_* ioctls
+ * @IOMMU_VFIO_IOAS_GET: Get the current compatibility IOAS
+ * @IOMMU_VFIO_IOAS_SET: Change the current compatibility IOAS
+ * @IOMMU_VFIO_IOAS_CLEAR: Disable VFIO compatibility
+ */
+enum iommufd_vfio_ioas_op {
+	IOMMU_VFIO_IOAS_GET = 0,
+	IOMMU_VFIO_IOAS_SET = 1,
+	IOMMU_VFIO_IOAS_CLEAR = 2,
+};
+
+/**
+ * struct iommu_vfio_ioas - ioctl(IOMMU_VFIO_IOAS)
+ * @size: sizeof(struct iommu_vfio_ioas)
+ * @ioas_id: For IOMMU_VFIO_IOAS_SET the input IOAS ID to set
+ *           For IOMMU_VFIO_IOAS_GET will output the IOAS ID
+ * @op: One of enum iommufd_vfio_ioas_op
+ * @__reserved: Must be 0
+ *
+ * The VFIO compatibility support uses a single ioas because VFIO APIs do not
+ * support the ID field. Set or Get the IOAS that VFIO compatibility will use.
+ * When VFIO_GROUP_SET_CONTAINER is used on an iommufd it will get the
+ * compatibility ioas, either by taking what is already set, or auto creating
+ * one. From then on VFIO will continue to use that ioas and is not effected by
+ * this ioctl. SET or CLEAR does not destroy any auto-created IOAS.
+ */
+struct iommu_vfio_ioas {
+	__u32 size;
+	__u32 ioas_id;
+	__u16 op;
+	__u16 __reserved;
+};
+#define IOMMU_VFIO_IOAS _IO(IOMMUFD_TYPE, IOMMUFD_CMD_VFIO_IOAS)
+
+/**
+ * struct iommu_hwpt_alloc - ioctl(IOMMU_HWPT_ALLOC)
+ * @size: sizeof(struct iommu_hwpt_alloc)
+ * @flags: Must be 0
+ * @dev_id: The device to allocate this HWPT for
+ * @pt_id: The IOAS to connect this HWPT to
+ * @out_hwpt_id: The ID of the new HWPT
+ * @__reserved: Must be 0
+ *
+ * Explicitly allocate a hardware page table object. This is the same object
+ * type that is returned by iommufd_device_attach() and represents the
+ * underlying iommu driver's iommu_domain kernel object.
+ *
+ * A HWPT will be created with the IOVA mappings from the given IOAS.
+ */
+struct iommu_hwpt_alloc {
+	__u32 size;
+	__u32 flags;
+	__u32 dev_id;
+	__u32 pt_id;
+	__u32 out_hwpt_id;
+	__u32 __reserved;
+};
+#define IOMMU_HWPT_ALLOC _IO(IOMMUFD_TYPE, IOMMUFD_CMD_HWPT_ALLOC)
+
+/**
+ * struct iommu_hw_info_vtd - Intel VT-d hardware information
+ *
+ * @flags: Must be 0
+ * @__reserved: Must be 0
+ *
+ * @cap_reg: Value of Intel VT-d capability register defined in VT-d spec
+ *           section 11.4.2 Capability Register.
+ * @ecap_reg: Value of Intel VT-d capability register defined in VT-d spec
+ *            section 11.4.3 Extended Capability Register.
+ *
+ * User needs to understand the Intel VT-d specification to decode the
+ * register value.
+ */
+struct iommu_hw_info_vtd {
+	__u32 flags;
+	__u32 __reserved;
+	__aligned_u64 cap_reg;
+	__aligned_u64 ecap_reg;
+};
+
+/**
+ * enum iommu_hw_info_type - IOMMU Hardware Info Types
+ * @IOMMU_HW_INFO_TYPE_NONE: Used by the drivers that do not report hardware
+ *                           info
+ * @IOMMU_HW_INFO_TYPE_INTEL_VTD: Intel VT-d iommu info type
+ */
+enum iommu_hw_info_type {
+	IOMMU_HW_INFO_TYPE_NONE,
+	IOMMU_HW_INFO_TYPE_INTEL_VTD,
+};
+
+/**
+ * struct iommu_hw_info - ioctl(IOMMU_GET_HW_INFO)
+ * @size: sizeof(struct iommu_hw_info)
+ * @flags: Must be 0
+ * @dev_id: The device bound to the iommufd
+ * @data_len: Input the length of a user buffer in bytes. Output the length of
+ *            data that kernel supports
+ * @data_uptr: User pointer to a user-space buffer used by the kernel to fill
+ *             the iommu type specific hardware information data
+ * @out_data_type: Output the iommu hardware info type as defined in the enum
+ *                 iommu_hw_info_type.
+ * @__reserved: Must be 0
+ *
+ * Query an iommu type specific hardware information data from an iommu behind
+ * a given device that has been bound to iommufd. This hardware info data will
+ * be used to sync capabilities between the virtual iommu and the physical
+ * iommu, e.g. a nested translation setup needs to check the hardware info, so
+ * a guest stage-1 page table can be compatible with the physical iommu.
+ *
+ * To capture an iommu type specific hardware information data, @data_uptr and
+ * its length @data_len must be provided. Trailing bytes will be zeroed if the
+ * user buffer is larger than the data that kernel has. Otherwise, kernel only
+ * fills the buffer using the given length in @data_len. If the ioctl succeeds,
+ * @data_len will be updated to the length that kernel actually supports,
+ * @out_data_type will be filled to decode the data filled in the buffer
+ * pointed by @data_uptr. Input @data_len == zero is allowed.
+ */
+struct iommu_hw_info {
+	__u32 size;
+	__u32 flags;
+	__u32 dev_id;
+	__u32 data_len;
+	__aligned_u64 data_uptr;
+	__u32 out_data_type;
+	__u32 __reserved;
+};
+#define IOMMU_GET_HW_INFO _IO(IOMMUFD_TYPE, IOMMUFD_CMD_GET_HW_INFO)
+#endif
-- 
2.41.0



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v4 03/15] vfio/common: Move IOMMU agnostic helpers to a separate file
  2023-10-04 15:43 [PATCH v4 00/15] Prerequisite changes for IOMMUFD support Eric Auger
  2023-10-04 15:43 ` [PATCH v4 01/15] scripts/update-linux-headers: Add iommufd.h Eric Auger
  2023-10-04 15:43 ` [PATCH v4 02/15] linux-headers: " Eric Auger
@ 2023-10-04 15:43 ` Eric Auger
  2023-10-04 15:43 ` [PATCH v4 04/15] vfio/common: Propagate KVM_SET_DEVICE_ATTR error if any Eric Auger
                   ` (12 subsequent siblings)
  15 siblings, 0 replies; 25+ messages in thread
From: Eric Auger @ 2023-10-04 15:43 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, qemu-devel, zhenzhong.duan,
	alex.williamson, clg, jgg, nicolinc, joao.m.martins, peterx,
	kevin.tian, yi.l.liu, yi.y.sun, chao.p.peng, mjrosato, aik

From: Yi Liu <yi.l.liu@intel.com>

Move low-level iommu agnostic helpers to a separate helpers.c
file. They relate to regions, interrupts, device/region
capabilities and etc.

Signed-off-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>

---
v3 -> v4:
- added #include "qemu/error-report.h"
---
 include/hw/vfio/vfio-common.h |   9 +
 hw/vfio/common.c              | 588 --------------------------------
 hw/vfio/helpers.c             | 612 ++++++++++++++++++++++++++++++++++
 hw/vfio/meson.build           |   1 +
 4 files changed, 622 insertions(+), 588 deletions(-)
 create mode 100644 hw/vfio/helpers.c

diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index e9b8954595..e0483893d1 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -196,6 +196,12 @@ typedef struct VFIODisplay {
     } dmabuf;
 } VFIODisplay;
 
+typedef struct {
+    unsigned long *bitmap;
+    hwaddr size;
+    hwaddr pages;
+} VFIOBitmap;
+
 void vfio_put_base_device(VFIODevice *vbasedev);
 void vfio_disable_irqindex(VFIODevice *vbasedev, int index);
 void vfio_unmask_single_irqindex(VFIODevice *vbasedev, int index);
@@ -245,6 +251,8 @@ bool vfio_get_info_dma_avail(struct vfio_iommu_type1_info *info,
                              unsigned int *avail);
 struct vfio_info_cap_header *
 vfio_get_device_info_cap(struct vfio_device_info *info, uint16_t id);
+struct vfio_info_cap_header *
+vfio_get_cap(void *ptr, uint32_t cap_offset, uint16_t id);
 #endif
 extern const MemoryListener vfio_prereg_listener;
 
@@ -257,4 +265,5 @@ int vfio_spapr_remove_window(VFIOContainer *container,
 bool vfio_migration_realize(VFIODevice *vbasedev, Error **errp);
 void vfio_migration_exit(VFIODevice *vbasedev);
 
+int vfio_bitmap_alloc(VFIOBitmap *vbmap, hwaddr size);
 #endif /* HW_VFIO_VFIO_COMMON_H */
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 134649226d..4e122fc4e4 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -62,84 +62,6 @@ static QLIST_HEAD(, VFIOAddressSpace) vfio_address_spaces =
 static int vfio_kvm_device_fd = -1;
 #endif
 
-/*
- * Common VFIO interrupt disable
- */
-void vfio_disable_irqindex(VFIODevice *vbasedev, int index)
-{
-    struct vfio_irq_set irq_set = {
-        .argsz = sizeof(irq_set),
-        .flags = VFIO_IRQ_SET_DATA_NONE | VFIO_IRQ_SET_ACTION_TRIGGER,
-        .index = index,
-        .start = 0,
-        .count = 0,
-    };
-
-    ioctl(vbasedev->fd, VFIO_DEVICE_SET_IRQS, &irq_set);
-}
-
-void vfio_unmask_single_irqindex(VFIODevice *vbasedev, int index)
-{
-    struct vfio_irq_set irq_set = {
-        .argsz = sizeof(irq_set),
-        .flags = VFIO_IRQ_SET_DATA_NONE | VFIO_IRQ_SET_ACTION_UNMASK,
-        .index = index,
-        .start = 0,
-        .count = 1,
-    };
-
-    ioctl(vbasedev->fd, VFIO_DEVICE_SET_IRQS, &irq_set);
-}
-
-void vfio_mask_single_irqindex(VFIODevice *vbasedev, int index)
-{
-    struct vfio_irq_set irq_set = {
-        .argsz = sizeof(irq_set),
-        .flags = VFIO_IRQ_SET_DATA_NONE | VFIO_IRQ_SET_ACTION_MASK,
-        .index = index,
-        .start = 0,
-        .count = 1,
-    };
-
-    ioctl(vbasedev->fd, VFIO_DEVICE_SET_IRQS, &irq_set);
-}
-
-static inline const char *action_to_str(int action)
-{
-    switch (action) {
-    case VFIO_IRQ_SET_ACTION_MASK:
-        return "MASK";
-    case VFIO_IRQ_SET_ACTION_UNMASK:
-        return "UNMASK";
-    case VFIO_IRQ_SET_ACTION_TRIGGER:
-        return "TRIGGER";
-    default:
-        return "UNKNOWN ACTION";
-    }
-}
-
-static const char *index_to_str(VFIODevice *vbasedev, int index)
-{
-    if (vbasedev->type != VFIO_DEVICE_TYPE_PCI) {
-        return NULL;
-    }
-
-    switch (index) {
-    case VFIO_PCI_INTX_IRQ_INDEX:
-        return "INTX";
-    case VFIO_PCI_MSI_IRQ_INDEX:
-        return "MSI";
-    case VFIO_PCI_MSIX_IRQ_INDEX:
-        return "MSIX";
-    case VFIO_PCI_ERR_IRQ_INDEX:
-        return "ERR";
-    case VFIO_PCI_REQ_IRQ_INDEX:
-        return "REQ";
-    default:
-        return NULL;
-    }
-}
-
 static int vfio_ram_block_discard_disable(VFIOContainer *container, bool state)
 {
     switch (container->iommu_type) {
@@ -163,183 +85,10 @@ static int vfio_ram_block_discard_disable(VFIOContainer *container, bool state)
     }
 }
 
-int vfio_set_irq_signaling(VFIODevice *vbasedev, int index, int subindex,
-                           int action, int fd, Error **errp)
-{
-    struct vfio_irq_set *irq_set;
-    int argsz, ret = 0;
-    const char *name;
-    int32_t *pfd;
-
-    argsz = sizeof(*irq_set) + sizeof(*pfd);
-
-    irq_set = g_malloc0(argsz);
-    irq_set->argsz = argsz;
-    irq_set->flags = VFIO_IRQ_SET_DATA_EVENTFD | action;
-    irq_set->index = index;
-    irq_set->start = subindex;
-    irq_set->count = 1;
-    pfd = (int32_t *)&irq_set->data;
-    *pfd = fd;
-
-    if (ioctl(vbasedev->fd, VFIO_DEVICE_SET_IRQS, irq_set)) {
-        ret = -errno;
-    }
-    g_free(irq_set);
-
-    if (!ret) {
-        return 0;
-    }
-
-    error_setg_errno(errp, -ret, "VFIO_DEVICE_SET_IRQS failure");
-
-    name = index_to_str(vbasedev, index);
-    if (name) {
-        error_prepend(errp, "%s-%d: ", name, subindex);
-    } else {
-        error_prepend(errp, "index %d-%d: ", index, subindex);
-    }
-    error_prepend(errp,
-                  "Failed to %s %s eventfd signaling for interrupt ",
-                  fd < 0 ? "tear down" : "set up", action_to_str(action));
-    return ret;
-}
-
-/*
- * IO Port/MMIO - Beware of the endians, VFIO is always little endian
- */
-void vfio_region_write(void *opaque, hwaddr addr,
-                       uint64_t data, unsigned size)
-{
-    VFIORegion *region = opaque;
-    VFIODevice *vbasedev = region->vbasedev;
-    union {
-        uint8_t byte;
-        uint16_t word;
-        uint32_t dword;
-        uint64_t qword;
-    } buf;
-
-    switch (size) {
-    case 1:
-        buf.byte = data;
-        break;
-    case 2:
-        buf.word = cpu_to_le16(data);
-        break;
-    case 4:
-        buf.dword = cpu_to_le32(data);
-        break;
-    case 8:
-        buf.qword = cpu_to_le64(data);
-        break;
-    default:
-        hw_error("vfio: unsupported write size, %u bytes", size);
-        break;
-    }
-
-    if (pwrite(vbasedev->fd, &buf, size, region->fd_offset + addr) != size) {
-        error_report("%s(%s:region%d+0x%"HWADDR_PRIx", 0x%"PRIx64
-                     ",%d) failed: %m",
-                     __func__, vbasedev->name, region->nr,
-                     addr, data, size);
-    }
-
-    trace_vfio_region_write(vbasedev->name, region->nr, addr, data, size);
-
-    /*
-     * A read or write to a BAR always signals an INTx EOI.  This will
-     * do nothing if not pending (including not in INTx mode).  We assume
-     * that a BAR access is in response to an interrupt and that BAR
-     * accesses will service the interrupt.  Unfortunately, we don't know
-     * which access will service the interrupt, so we're potentially
-     * getting quite a few host interrupts per guest interrupt.
-     */
-    vbasedev->ops->vfio_eoi(vbasedev);
-}
-
-uint64_t vfio_region_read(void *opaque,
-                          hwaddr addr, unsigned size)
-{
-    VFIORegion *region = opaque;
-    VFIODevice *vbasedev = region->vbasedev;
-    union {
-        uint8_t byte;
-        uint16_t word;
-        uint32_t dword;
-        uint64_t qword;
-    } buf;
-    uint64_t data = 0;
-
-    if (pread(vbasedev->fd, &buf, size, region->fd_offset + addr) != size) {
-        error_report("%s(%s:region%d+0x%"HWADDR_PRIx", %d) failed: %m",
-                     __func__, vbasedev->name, region->nr,
-                     addr, size);
-        return (uint64_t)-1;
-    }
-    switch (size) {
-    case 1:
-        data = buf.byte;
-        break;
-    case 2:
-        data = le16_to_cpu(buf.word);
-        break;
-    case 4:
-        data = le32_to_cpu(buf.dword);
-        break;
-    case 8:
-        data = le64_to_cpu(buf.qword);
-        break;
-    default:
-        hw_error("vfio: unsupported read size, %u bytes", size);
-        break;
-    }
-
-    trace_vfio_region_read(vbasedev->name, region->nr, addr, size, data);
-
-    /* Same as write above */
-    vbasedev->ops->vfio_eoi(vbasedev);
-
-    return data;
-}
-
-const MemoryRegionOps vfio_region_ops = {
-    .read = vfio_region_read,
-    .write = vfio_region_write,
-    .endianness = DEVICE_LITTLE_ENDIAN,
-    .valid = {
-        .min_access_size = 1,
-        .max_access_size = 8,
-    },
-    .impl = {
-        .min_access_size = 1,
-        .max_access_size = 8,
-    },
-};
-
 /*
  * Device state interfaces
  */
 
-typedef struct {
-    unsigned long *bitmap;
-    hwaddr size;
-    hwaddr pages;
-} VFIOBitmap;
-
-static int vfio_bitmap_alloc(VFIOBitmap *vbmap, hwaddr size)
-{
-    vbmap->pages = REAL_HOST_PAGE_ALIGN(size) / qemu_real_host_page_size();
-    vbmap->size = ROUND_UP(vbmap->pages, sizeof(__u64) * BITS_PER_BYTE) /
-                                         BITS_PER_BYTE;
-    vbmap->bitmap = g_try_malloc0(vbmap->size);
-    if (!vbmap->bitmap) {
-        return -ENOMEM;
-    }
-
-    return 0;
-}
-
 static int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
                                  uint64_t size, ram_addr_t ram_addr);
 
@@ -1994,30 +1743,6 @@ static void vfio_listener_release(VFIOContainer *container)
     }
 }
 
-static struct vfio_info_cap_header *
-vfio_get_cap(void *ptr, uint32_t cap_offset, uint16_t id)
-{
-    struct vfio_info_cap_header *hdr;
-
-    for (hdr = ptr + cap_offset; hdr != ptr; hdr = ptr + hdr->next) {
-        if (hdr->id == id) {
-            return hdr;
-        }
-    }
-
-    return NULL;
-}
-
-struct vfio_info_cap_header *
-vfio_get_region_info_cap(struct vfio_region_info *info, uint16_t id)
-{
-    if (!(info->flags & VFIO_REGION_INFO_FLAG_CAPS)) {
-        return NULL;
-    }
-
-    return vfio_get_cap((void *)info, info->cap_offset, id);
-}
-
 static struct vfio_info_cap_header *
 vfio_get_iommu_type1_info_cap(struct vfio_iommu_type1_info *info, uint16_t id)
 {
@@ -2028,16 +1753,6 @@ vfio_get_iommu_type1_info_cap(struct vfio_iommu_type1_info *info, uint16_t id)
     return vfio_get_cap((void *)info, info->cap_offset, id);
 }
 
-struct vfio_info_cap_header *
-vfio_get_device_info_cap(struct vfio_device_info *info, uint16_t id)
-{
-    if (!(info->flags & VFIO_DEVICE_FLAGS_CAPS)) {
-        return NULL;
-    }
-
-    return vfio_get_cap((void *)info, info->cap_offset, id);
-}
-
 bool vfio_get_info_dma_avail(struct vfio_iommu_type1_info *info,
                              unsigned int *avail)
 {
@@ -2059,232 +1774,6 @@ bool vfio_get_info_dma_avail(struct vfio_iommu_type1_info *info,
     return true;
 }
 
-static int vfio_setup_region_sparse_mmaps(VFIORegion *region,
-                                          struct vfio_region_info *info)
-{
-    struct vfio_info_cap_header *hdr;
-    struct vfio_region_info_cap_sparse_mmap *sparse;
-    int i, j;
-
-    hdr = vfio_get_region_info_cap(info, VFIO_REGION_INFO_CAP_SPARSE_MMAP);
-    if (!hdr) {
-        return -ENODEV;
-    }
-
-    sparse = container_of(hdr, struct vfio_region_info_cap_sparse_mmap, header);
-
-    trace_vfio_region_sparse_mmap_header(region->vbasedev->name,
-                                         region->nr, sparse->nr_areas);
-
-    region->mmaps = g_new0(VFIOMmap, sparse->nr_areas);
-
-    for (i = 0, j = 0; i < sparse->nr_areas; i++) {
-        if (sparse->areas[i].size) {
-            trace_vfio_region_sparse_mmap_entry(i, sparse->areas[i].offset,
-                                            sparse->areas[i].offset +
-                                            sparse->areas[i].size - 1);
-            region->mmaps[j].offset = sparse->areas[i].offset;
-            region->mmaps[j].size = sparse->areas[i].size;
-            j++;
-        }
-    }
-
-    region->nr_mmaps = j;
-    region->mmaps = g_realloc(region->mmaps, j * sizeof(VFIOMmap));
-
-    return 0;
-}
-
-int vfio_region_setup(Object *obj, VFIODevice *vbasedev, VFIORegion *region,
-                      int index, const char *name)
-{
-    struct vfio_region_info *info;
-    int ret;
-
-    ret = vfio_get_region_info(vbasedev, index, &info);
-    if (ret) {
-        return ret;
-    }
-
-    region->vbasedev = vbasedev;
-    region->flags = info->flags;
-    region->size = info->size;
-    region->fd_offset = info->offset;
-    region->nr = index;
-
-    if (region->size) {
-        region->mem = g_new0(MemoryRegion, 1);
-        memory_region_init_io(region->mem, obj, &vfio_region_ops,
-                              region, name, region->size);
-
-        if (!vbasedev->no_mmap &&
-            region->flags & VFIO_REGION_INFO_FLAG_MMAP) {
-
-            ret = vfio_setup_region_sparse_mmaps(region, info);
-
-            if (ret) {
-                region->nr_mmaps = 1;
-                region->mmaps = g_new0(VFIOMmap, region->nr_mmaps);
-                region->mmaps[0].offset = 0;
-                region->mmaps[0].size = region->size;
-            }
-        }
-    }
-
-    g_free(info);
-
-    trace_vfio_region_setup(vbasedev->name, index, name,
-                            region->flags, region->fd_offset, region->size);
-    return 0;
-}
-
-static void vfio_subregion_unmap(VFIORegion *region, int index)
-{
-    trace_vfio_region_unmap(memory_region_name(&region->mmaps[index].mem),
-                            region->mmaps[index].offset,
-                            region->mmaps[index].offset +
-                            region->mmaps[index].size - 1);
-    memory_region_del_subregion(region->mem, &region->mmaps[index].mem);
-    munmap(region->mmaps[index].mmap, region->mmaps[index].size);
-    object_unparent(OBJECT(&region->mmaps[index].mem));
-    region->mmaps[index].mmap = NULL;
-}
-
-int vfio_region_mmap(VFIORegion *region)
-{
-    int i, prot = 0;
-    char *name;
-
-    if (!region->mem) {
-        return 0;
-    }
-
-    prot |= region->flags & VFIO_REGION_INFO_FLAG_READ ? PROT_READ : 0;
-    prot |= region->flags & VFIO_REGION_INFO_FLAG_WRITE ? PROT_WRITE : 0;
-
-    for (i = 0; i < region->nr_mmaps; i++) {
-        region->mmaps[i].mmap = mmap(NULL, region->mmaps[i].size, prot,
-                                     MAP_SHARED, region->vbasedev->fd,
-                                     region->fd_offset +
-                                     region->mmaps[i].offset);
-        if (region->mmaps[i].mmap == MAP_FAILED) {
-            int ret = -errno;
-
-            trace_vfio_region_mmap_fault(memory_region_name(region->mem), i,
-                                         region->fd_offset +
-                                         region->mmaps[i].offset,
-                                         region->fd_offset +
-                                         region->mmaps[i].offset +
-                                         region->mmaps[i].size - 1, ret);
-
-            region->mmaps[i].mmap = NULL;
-
-            for (i--; i >= 0; i--) {
-                vfio_subregion_unmap(region, i);
-            }
-
-            return ret;
-        }
-
-        name = g_strdup_printf("%s mmaps[%d]",
-                               memory_region_name(region->mem), i);
-        memory_region_init_ram_device_ptr(&region->mmaps[i].mem,
-                                          memory_region_owner(region->mem),
-                                          name, region->mmaps[i].size,
-                                          region->mmaps[i].mmap);
-        g_free(name);
-        memory_region_add_subregion(region->mem, region->mmaps[i].offset,
-                                    &region->mmaps[i].mem);
-
-        trace_vfio_region_mmap(memory_region_name(&region->mmaps[i].mem),
-                               region->mmaps[i].offset,
-                               region->mmaps[i].offset +
-                               region->mmaps[i].size - 1);
-    }
-
-    return 0;
-}
-
-void vfio_region_unmap(VFIORegion *region)
-{
-    int i;
-
-    if (!region->mem) {
-        return;
-    }
-
-    for (i = 0; i < region->nr_mmaps; i++) {
-        if (region->mmaps[i].mmap) {
-            vfio_subregion_unmap(region, i);
-        }
-    }
-}
-
-void vfio_region_exit(VFIORegion *region)
-{
-    int i;
-
-    if (!region->mem) {
-        return;
-    }
-
-    for (i = 0; i < region->nr_mmaps; i++) {
-        if (region->mmaps[i].mmap) {
-            memory_region_del_subregion(region->mem, &region->mmaps[i].mem);
-        }
-    }
-
-    trace_vfio_region_exit(region->vbasedev->name, region->nr);
-}
-
-void vfio_region_finalize(VFIORegion *region)
-{
-    int i;
-
-    if (!region->mem) {
-        return;
-    }
-
-    for (i = 0; i < region->nr_mmaps; i++) {
-        if (region->mmaps[i].mmap) {
-            munmap(region->mmaps[i].mmap, region->mmaps[i].size);
-            object_unparent(OBJECT(&region->mmaps[i].mem));
-        }
-    }
-
-    object_unparent(OBJECT(region->mem));
-
-    g_free(region->mem);
-    g_free(region->mmaps);
-
-    trace_vfio_region_finalize(region->vbasedev->name, region->nr);
-
-    region->mem = NULL;
-    region->mmaps = NULL;
-    region->nr_mmaps = 0;
-    region->size = 0;
-    region->flags = 0;
-    region->nr = 0;
-}
-
-void vfio_region_mmaps_set_enabled(VFIORegion *region, bool enabled)
-{
-    int i;
-
-    if (!region->mem) {
-        return;
-    }
-
-    for (i = 0; i < region->nr_mmaps; i++) {
-        if (region->mmaps[i].mmap) {
-            memory_region_set_enabled(&region->mmaps[i].mem, enabled);
-        }
-    }
-
-    trace_vfio_region_mmaps_set_enabled(memory_region_name(region->mem),
-                                        enabled);
-}
-
 void vfio_reset_handler(void *opaque)
 {
     VFIOGroup *group;
@@ -2983,83 +2472,6 @@ void vfio_put_base_device(VFIODevice *vbasedev)
     close(vbasedev->fd);
 }
 
-int vfio_get_region_info(VFIODevice *vbasedev, int index,
-                         struct vfio_region_info **info)
-{
-    size_t argsz = sizeof(struct vfio_region_info);
-
-    *info = g_malloc0(argsz);
-
-    (*info)->index = index;
-retry:
-    (*info)->argsz = argsz;
-
-    if (ioctl(vbasedev->fd, VFIO_DEVICE_GET_REGION_INFO, *info)) {
-        g_free(*info);
-        *info = NULL;
-        return -errno;
-    }
-
-    if ((*info)->argsz > argsz) {
-        argsz = (*info)->argsz;
-        *info = g_realloc(*info, argsz);
-
-        goto retry;
-    }
-
-    return 0;
-}
-
-int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type,
-                             uint32_t subtype, struct vfio_region_info **info)
-{
-    int i;
-
-    for (i = 0; i < vbasedev->num_regions; i++) {
-        struct vfio_info_cap_header *hdr;
-        struct vfio_region_info_cap_type *cap_type;
-
-        if (vfio_get_region_info(vbasedev, i, info)) {
-            continue;
-        }
-
-        hdr = vfio_get_region_info_cap(*info, VFIO_REGION_INFO_CAP_TYPE);
-        if (!hdr) {
-            g_free(*info);
-            continue;
-        }
-
-        cap_type = container_of(hdr, struct vfio_region_info_cap_type, header);
-
-        trace_vfio_get_dev_region(vbasedev->name, i,
-                                  cap_type->type, cap_type->subtype);
-
-        if (cap_type->type == type && cap_type->subtype == subtype) {
-            return 0;
-        }
-
-        g_free(*info);
-    }
-
-    *info = NULL;
-    return -ENODEV;
-}
-
-bool vfio_has_region_cap(VFIODevice *vbasedev, int region, uint16_t cap_type)
-{
-    struct vfio_region_info *info = NULL;
-    bool ret = false;
-
-    if (!vfio_get_region_info(vbasedev, region, &info)) {
-        if (vfio_get_region_info_cap(info, cap_type)) {
-            ret = true;
-        }
-        g_free(info);
-    }
-
-    return ret;
-}
-
 /*
  * Interfaces for IBM EEH (Enhanced Error Handling)
  */
diff --git a/hw/vfio/helpers.c b/hw/vfio/helpers.c
new file mode 100644
index 0000000000..7e5da21b31
--- /dev/null
+++ b/hw/vfio/helpers.c
@@ -0,0 +1,612 @@
+/*
+ * low level and IOMMU backend agnostic helpers used by VFIO devices,
+ * related to regions, interrupts, capabilities
+ *
+ * Copyright Red Hat, Inc. 2012
+ *
+ * Authors:
+ *  Alex Williamson <alex.williamson@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ * Based on qemu-kvm device-assignment:
+ *  Adapted for KVM by Qumranet.
+ *  Copyright (c) 2007, Neocleus, Alex Novik (alex@neocleus.com)
+ *  Copyright (c) 2007, Neocleus, Guy Zana (guy@neocleus.com)
+ *  Copyright (C) 2008, Qumranet, Amit Shah (amit.shah@qumranet.com)
+ *  Copyright (C) 2008, Red Hat, Amit Shah (amit.shah@redhat.com)
+ *  Copyright (C) 2008, IBM, Muli Ben-Yehuda (muli@il.ibm.com)
+ */
+
+#include "qemu/osdep.h"
+#include <sys/ioctl.h>
+
+#include "hw/vfio/vfio-common.h"
+#include "hw/vfio/vfio.h"
+#include "hw/hw.h"
+#include "trace.h"
+#include "qapi/error.h"
+#include "qemu/error-report.h"
+
+/*
+ * Common VFIO interrupt disable
+ */
+void vfio_disable_irqindex(VFIODevice *vbasedev, int index)
+{
+    struct vfio_irq_set irq_set = {
+        .argsz = sizeof(irq_set),
+        .flags = VFIO_IRQ_SET_DATA_NONE | VFIO_IRQ_SET_ACTION_TRIGGER,
+        .index = index,
+        .start = 0,
+        .count = 0,
+    };
+
+    ioctl(vbasedev->fd, VFIO_DEVICE_SET_IRQS, &irq_set);
+}
+
+void vfio_unmask_single_irqindex(VFIODevice *vbasedev, int index)
+{
+    struct vfio_irq_set irq_set = {
+        .argsz = sizeof(irq_set),
+        .flags = VFIO_IRQ_SET_DATA_NONE | VFIO_IRQ_SET_ACTION_UNMASK,
+        .index = index,
+        .start = 0,
+        .count = 1,
+    };
+
+    ioctl(vbasedev->fd, VFIO_DEVICE_SET_IRQS, &irq_set);
+}
+
+void vfio_mask_single_irqindex(VFIODevice *vbasedev, int index)
+{
+    struct vfio_irq_set irq_set = {
+        .argsz = sizeof(irq_set),
+        .flags = VFIO_IRQ_SET_DATA_NONE | VFIO_IRQ_SET_ACTION_MASK,
+        .index = index,
+        .start = 0,
+        .count = 1,
+    };
+
+    ioctl(vbasedev->fd, VFIO_DEVICE_SET_IRQS, &irq_set);
+}
+
+static inline const char *action_to_str(int action)
+{
+    switch (action) {
+    case VFIO_IRQ_SET_ACTION_MASK:
+        return "MASK";
+    case VFIO_IRQ_SET_ACTION_UNMASK:
+        return "UNMASK";
+    case VFIO_IRQ_SET_ACTION_TRIGGER:
+        return "TRIGGER";
+    default:
+        return "UNKNOWN ACTION";
+    }
+}
+
+static const char *index_to_str(VFIODevice *vbasedev, int index)
+{
+    if (vbasedev->type != VFIO_DEVICE_TYPE_PCI) {
+        return NULL;
+    }
+
+    switch (index) {
+    case VFIO_PCI_INTX_IRQ_INDEX:
+        return "INTX";
+    case VFIO_PCI_MSI_IRQ_INDEX:
+        return "MSI";
+    case VFIO_PCI_MSIX_IRQ_INDEX:
+        return "MSIX";
+    case VFIO_PCI_ERR_IRQ_INDEX:
+        return "ERR";
+    case VFIO_PCI_REQ_IRQ_INDEX:
+        return "REQ";
+    default:
+        return NULL;
+    }
+}
+
+int vfio_set_irq_signaling(VFIODevice *vbasedev, int index, int subindex,
+                           int action, int fd, Error **errp)
+{
+    struct vfio_irq_set *irq_set;
+    int argsz, ret = 0;
+    const char *name;
+    int32_t *pfd;
+
+    argsz = sizeof(*irq_set) + sizeof(*pfd);
+
+    irq_set = g_malloc0(argsz);
+    irq_set->argsz = argsz;
+    irq_set->flags = VFIO_IRQ_SET_DATA_EVENTFD | action;
+    irq_set->index = index;
+    irq_set->start = subindex;
+    irq_set->count = 1;
+    pfd = (int32_t *)&irq_set->data;
+    *pfd = fd;
+
+    if (ioctl(vbasedev->fd, VFIO_DEVICE_SET_IRQS, irq_set)) {
+        ret = -errno;
+    }
+    g_free(irq_set);
+
+    if (!ret) {
+        return 0;
+    }
+
+    error_setg_errno(errp, -ret, "VFIO_DEVICE_SET_IRQS failure");
+
+    name = index_to_str(vbasedev, index);
+    if (name) {
+        error_prepend(errp, "%s-%d: ", name, subindex);
+    } else {
+        error_prepend(errp, "index %d-%d: ", index, subindex);
+    }
+    error_prepend(errp,
+                  "Failed to %s %s eventfd signaling for interrupt ",
+                  fd < 0 ? "tear down" : "set up", action_to_str(action));
+    return ret;
+}
+
+/*
+ * IO Port/MMIO - Beware of the endians, VFIO is always little endian
+ */
+void vfio_region_write(void *opaque, hwaddr addr,
+                       uint64_t data, unsigned size)
+{
+    VFIORegion *region = opaque;
+    VFIODevice *vbasedev = region->vbasedev;
+    union {
+        uint8_t byte;
+        uint16_t word;
+        uint32_t dword;
+        uint64_t qword;
+    } buf;
+
+    switch (size) {
+    case 1:
+        buf.byte = data;
+        break;
+    case 2:
+        buf.word = cpu_to_le16(data);
+        break;
+    case 4:
+        buf.dword = cpu_to_le32(data);
+        break;
+    case 8:
+        buf.qword = cpu_to_le64(data);
+        break;
+    default:
+        hw_error("vfio: unsupported write size, %u bytes", size);
+        break;
+    }
+
+    if (pwrite(vbasedev->fd, &buf, size, region->fd_offset + addr) != size) {
+        error_report("%s(%s:region%d+0x%"HWADDR_PRIx", 0x%"PRIx64
+                     ",%d) failed: %m",
+                     __func__, vbasedev->name, region->nr,
+                     addr, data, size);
+    }
+
+    trace_vfio_region_write(vbasedev->name, region->nr, addr, data, size);
+
+    /*
+     * A read or write to a BAR always signals an INTx EOI.  This will
+     * do nothing if not pending (including not in INTx mode).  We assume
+     * that a BAR access is in response to an interrupt and that BAR
+     * accesses will service the interrupt.  Unfortunately, we don't know
+     * which access will service the interrupt, so we're potentially
+     * getting quite a few host interrupts per guest interrupt.
+     */
+    vbasedev->ops->vfio_eoi(vbasedev);
+}
+
+uint64_t vfio_region_read(void *opaque,
+                          hwaddr addr, unsigned size)
+{
+    VFIORegion *region = opaque;
+    VFIODevice *vbasedev = region->vbasedev;
+    union {
+        uint8_t byte;
+        uint16_t word;
+        uint32_t dword;
+        uint64_t qword;
+    } buf;
+    uint64_t data = 0;
+
+    if (pread(vbasedev->fd, &buf, size, region->fd_offset + addr) != size) {
+        error_report("%s(%s:region%d+0x%"HWADDR_PRIx", %d) failed: %m",
+                     __func__, vbasedev->name, region->nr,
+                     addr, size);
+        return (uint64_t)-1;
+    }
+    switch (size) {
+    case 1:
+        data = buf.byte;
+        break;
+    case 2:
+        data = le16_to_cpu(buf.word);
+        break;
+    case 4:
+        data = le32_to_cpu(buf.dword);
+        break;
+    case 8:
+        data = le64_to_cpu(buf.qword);
+        break;
+    default:
+        hw_error("vfio: unsupported read size, %u bytes", size);
+        break;
+    }
+
+    trace_vfio_region_read(vbasedev->name, region->nr, addr, size, data);
+
+    /* Same as write above */
+    vbasedev->ops->vfio_eoi(vbasedev);
+
+    return data;
+}
+
+const MemoryRegionOps vfio_region_ops = {
+    .read = vfio_region_read,
+    .write = vfio_region_write,
+    .endianness = DEVICE_LITTLE_ENDIAN,
+    .valid = {
+        .min_access_size = 1,
+        .max_access_size = 8,
+    },
+    .impl = {
+        .min_access_size = 1,
+        .max_access_size = 8,
+    },
+};
+
+int vfio_bitmap_alloc(VFIOBitmap *vbmap, hwaddr size)
+{
+    vbmap->pages = REAL_HOST_PAGE_ALIGN(size) / qemu_real_host_page_size();
+    vbmap->size = ROUND_UP(vbmap->pages, sizeof(__u64) * BITS_PER_BYTE) /
+                                         BITS_PER_BYTE;
+    vbmap->bitmap = g_try_malloc0(vbmap->size);
+    if (!vbmap->bitmap) {
+        return -ENOMEM;
+    }
+
+    return 0;
+}
+
+struct vfio_info_cap_header *
+vfio_get_cap(void *ptr, uint32_t cap_offset, uint16_t id)
+{
+    struct vfio_info_cap_header *hdr;
+
+    for (hdr = ptr + cap_offset; hdr != ptr; hdr = ptr + hdr->next) {
+        if (hdr->id == id) {
+            return hdr;
+        }
+    }
+
+    return NULL;
+}
+
+struct vfio_info_cap_header *
+vfio_get_region_info_cap(struct vfio_region_info *info, uint16_t id)
+{
+    if (!(info->flags & VFIO_REGION_INFO_FLAG_CAPS)) {
+        return NULL;
+    }
+
+    return vfio_get_cap((void *)info, info->cap_offset, id);
+}
+
+struct vfio_info_cap_header *
+vfio_get_device_info_cap(struct vfio_device_info *info, uint16_t id)
+{
+    if (!(info->flags & VFIO_DEVICE_FLAGS_CAPS)) {
+        return NULL;
+    }
+
+    return vfio_get_cap((void *)info, info->cap_offset, id);
+}
+
+static int vfio_setup_region_sparse_mmaps(VFIORegion *region,
+                                          struct vfio_region_info *info)
+{
+    struct vfio_info_cap_header *hdr;
+    struct vfio_region_info_cap_sparse_mmap *sparse;
+    int i, j;
+
+    hdr = vfio_get_region_info_cap(info, VFIO_REGION_INFO_CAP_SPARSE_MMAP);
+    if (!hdr) {
+        return -ENODEV;
+    }
+
+    sparse = container_of(hdr, struct vfio_region_info_cap_sparse_mmap, header);
+
+    trace_vfio_region_sparse_mmap_header(region->vbasedev->name,
+                                         region->nr, sparse->nr_areas);
+
+    region->mmaps = g_new0(VFIOMmap, sparse->nr_areas);
+
+    for (i = 0, j = 0; i < sparse->nr_areas; i++) {
+        if (sparse->areas[i].size) {
+            trace_vfio_region_sparse_mmap_entry(i, sparse->areas[i].offset,
+                                            sparse->areas[i].offset +
+                                            sparse->areas[i].size - 1);
+            region->mmaps[j].offset = sparse->areas[i].offset;
+            region->mmaps[j].size = sparse->areas[i].size;
+            j++;
+        }
+    }
+
+    region->nr_mmaps = j;
+    region->mmaps = g_realloc(region->mmaps, j * sizeof(VFIOMmap));
+
+    return 0;
+}
+
+int vfio_region_setup(Object *obj, VFIODevice *vbasedev, VFIORegion *region,
+                      int index, const char *name)
+{
+    struct vfio_region_info *info;
+    int ret;
+
+    ret = vfio_get_region_info(vbasedev, index, &info);
+    if (ret) {
+        return ret;
+    }
+
+    region->vbasedev = vbasedev;
+    region->flags = info->flags;
+    region->size = info->size;
+    region->fd_offset = info->offset;
+    region->nr = index;
+
+    if (region->size) {
+        region->mem = g_new0(MemoryRegion, 1);
+        memory_region_init_io(region->mem, obj, &vfio_region_ops,
+                              region, name, region->size);
+
+        if (!vbasedev->no_mmap &&
+            region->flags & VFIO_REGION_INFO_FLAG_MMAP) {
+
+            ret = vfio_setup_region_sparse_mmaps(region, info);
+
+            if (ret) {
+                region->nr_mmaps = 1;
+                region->mmaps = g_new0(VFIOMmap, region->nr_mmaps);
+                region->mmaps[0].offset = 0;
+                region->mmaps[0].size = region->size;
+            }
+        }
+    }
+
+    g_free(info);
+
+    trace_vfio_region_setup(vbasedev->name, index, name,
+                            region->flags, region->fd_offset, region->size);
+    return 0;
+}
+
+static void vfio_subregion_unmap(VFIORegion *region, int index)
+{
+    trace_vfio_region_unmap(memory_region_name(&region->mmaps[index].mem),
+                            region->mmaps[index].offset,
+                            region->mmaps[index].offset +
+                            region->mmaps[index].size - 1);
+    memory_region_del_subregion(region->mem, &region->mmaps[index].mem);
+    munmap(region->mmaps[index].mmap, region->mmaps[index].size);
+    object_unparent(OBJECT(&region->mmaps[index].mem));
+    region->mmaps[index].mmap = NULL;
+}
+
+int vfio_region_mmap(VFIORegion *region)
+{
+    int i, prot = 0;
+    char *name;
+
+    if (!region->mem) {
+        return 0;
+    }
+
+    prot |= region->flags & VFIO_REGION_INFO_FLAG_READ ? PROT_READ : 0;
+    prot |= region->flags & VFIO_REGION_INFO_FLAG_WRITE ? PROT_WRITE : 0;
+
+    for (i = 0; i < region->nr_mmaps; i++) {
+        region->mmaps[i].mmap = mmap(NULL, region->mmaps[i].size, prot,
+                                     MAP_SHARED, region->vbasedev->fd,
+                                     region->fd_offset +
+                                     region->mmaps[i].offset);
+        if (region->mmaps[i].mmap == MAP_FAILED) {
+            int ret = -errno;
+
+            trace_vfio_region_mmap_fault(memory_region_name(region->mem), i,
+                                         region->fd_offset +
+                                         region->mmaps[i].offset,
+                                         region->fd_offset +
+                                         region->mmaps[i].offset +
+                                         region->mmaps[i].size - 1, ret);
+
+            region->mmaps[i].mmap = NULL;
+
+            for (i--; i >= 0; i--) {
+                vfio_subregion_unmap(region, i);
+            }
+
+            return ret;
+        }
+
+        name = g_strdup_printf("%s mmaps[%d]",
+                               memory_region_name(region->mem), i);
+        memory_region_init_ram_device_ptr(&region->mmaps[i].mem,
+                                          memory_region_owner(region->mem),
+                                          name, region->mmaps[i].size,
+                                          region->mmaps[i].mmap);
+        g_free(name);
+        memory_region_add_subregion(region->mem, region->mmaps[i].offset,
+                                    &region->mmaps[i].mem);
+
+        trace_vfio_region_mmap(memory_region_name(&region->mmaps[i].mem),
+                               region->mmaps[i].offset,
+                               region->mmaps[i].offset +
+                               region->mmaps[i].size - 1);
+    }
+
+    return 0;
+}
+
+void vfio_region_unmap(VFIORegion *region)
+{
+    int i;
+
+    if (!region->mem) {
+        return;
+    }
+
+    for (i = 0; i < region->nr_mmaps; i++) {
+        if (region->mmaps[i].mmap) {
+            vfio_subregion_unmap(region, i);
+        }
+    }
+}
+
+void vfio_region_exit(VFIORegion *region)
+{
+    int i;
+
+    if (!region->mem) {
+        return;
+    }
+
+    for (i = 0; i < region->nr_mmaps; i++) {
+        if (region->mmaps[i].mmap) {
+            memory_region_del_subregion(region->mem, &region->mmaps[i].mem);
+        }
+    }
+
+    trace_vfio_region_exit(region->vbasedev->name, region->nr);
+}
+
+void vfio_region_finalize(VFIORegion *region)
+{
+    int i;
+
+    if (!region->mem) {
+        return;
+    }
+
+    for (i = 0; i < region->nr_mmaps; i++) {
+        if (region->mmaps[i].mmap) {
+            munmap(region->mmaps[i].mmap, region->mmaps[i].size);
+            object_unparent(OBJECT(&region->mmaps[i].mem));
+        }
+    }
+
+    object_unparent(OBJECT(region->mem));
+
+    g_free(region->mem);
+    g_free(region->mmaps);
+
+    trace_vfio_region_finalize(region->vbasedev->name, region->nr);
+
+    region->mem = NULL;
+    region->mmaps = NULL;
+    region->nr_mmaps = 0;
+    region->size = 0;
+    region->flags = 0;
+    region->nr = 0;
+}
+
+void vfio_region_mmaps_set_enabled(VFIORegion *region, bool enabled)
+{
+    int i;
+
+    if (!region->mem) {
+        return;
+    }
+
+    for (i = 0; i < region->nr_mmaps; i++) {
+        if (region->mmaps[i].mmap) {
+            memory_region_set_enabled(&region->mmaps[i].mem, enabled);
+        }
+    }
+
+    trace_vfio_region_mmaps_set_enabled(memory_region_name(region->mem),
+                                        enabled);
+}
+
+int vfio_get_region_info(VFIODevice *vbasedev, int index,
+                         struct vfio_region_info **info)
+{
+    size_t argsz = sizeof(struct vfio_region_info);
+
+    *info = g_malloc0(argsz);
+
+    (*info)->index = index;
+retry:
+    (*info)->argsz = argsz;
+
+    if (ioctl(vbasedev->fd, VFIO_DEVICE_GET_REGION_INFO, *info)) {
+        g_free(*info);
+        *info = NULL;
+        return -errno;
+    }
+
+    if ((*info)->argsz > argsz) {
+        argsz = (*info)->argsz;
+        *info = g_realloc(*info, argsz);
+
+        goto retry;
+    }
+
+    return 0;
+}
+
+int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type,
+                             uint32_t subtype, struct vfio_region_info **info)
+{
+    int i;
+
+    for (i = 0; i < vbasedev->num_regions; i++) {
+        struct vfio_info_cap_header *hdr;
+        struct vfio_region_info_cap_type *cap_type;
+
+        if (vfio_get_region_info(vbasedev, i, info)) {
+            continue;
+        }
+
+        hdr = vfio_get_region_info_cap(*info, VFIO_REGION_INFO_CAP_TYPE);
+        if (!hdr) {
+            g_free(*info);
+            continue;
+        }
+
+        cap_type = container_of(hdr, struct vfio_region_info_cap_type, header);
+
+        trace_vfio_get_dev_region(vbasedev->name, i,
+                                  cap_type->type, cap_type->subtype);
+
+        if (cap_type->type == type && cap_type->subtype == subtype) {
+            return 0;
+        }
+
+        g_free(*info);
+    }
+
+    *info = NULL;
+    return -ENODEV;
+}
+
+bool vfio_has_region_cap(VFIODevice *vbasedev, int region, uint16_t cap_type)
+{
+    struct vfio_region_info *info = NULL;
+    bool ret = false;
+
+    if (!vfio_get_region_info(vbasedev, region, &info)) {
+        if (vfio_get_region_info_cap(info, cap_type)) {
+            ret = true;
+        }
+        g_free(info);
+    }
+
+    return ret;
+}
diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
index da9af297a0..3746c9f984 100644
--- a/hw/vfio/meson.build
+++ b/hw/vfio/meson.build
@@ -1,5 +1,6 @@
 vfio_ss = ss.source_set()
 vfio_ss.add(files(
+  'helpers.c',
   'common.c',
   'spapr.c',
   'migration.c',
-- 
2.41.0



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v4 04/15] vfio/common: Propagate KVM_SET_DEVICE_ATTR error if any
  2023-10-04 15:43 [PATCH v4 00/15] Prerequisite changes for IOMMUFD support Eric Auger
                   ` (2 preceding siblings ...)
  2023-10-04 15:43 ` [PATCH v4 03/15] vfio/common: Move IOMMU agnostic helpers to a separate file Eric Auger
@ 2023-10-04 15:43 ` Eric Auger
  2023-10-04 15:43 ` [PATCH v4 05/15] vfio/common: Introduce vfio_container_add|del_section_window() Eric Auger
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 25+ messages in thread
From: Eric Auger @ 2023-10-04 15:43 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, qemu-devel, zhenzhong.duan,
	alex.williamson, clg, jgg, nicolinc, joao.m.martins, peterx,
	kevin.tian, yi.l.liu, yi.y.sun, chao.p.peng, mjrosato, aik

In the VFIO_SPAPR_TCE_v2_IOMMU container case, when
KVM_SET_DEVICE_ATTR fails, we currently don't propagate the
error as we do on the vfio_spapr_create_window() failure
case. Let's align the code. Take the opportunity to
reword the error message and make it more explicit.

Signed-off-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>

---
---
 hw/vfio/common.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 4e122fc4e4..c54a72ec80 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -878,11 +878,11 @@ static void vfio_listener_region_add(MemoryListener *listener,
                 QLIST_FOREACH(group, &container->group_list, container_next) {
                     param.groupfd = group->fd;
                     if (ioctl(vfio_kvm_device_fd, KVM_SET_DEVICE_ATTR, &attr)) {
-                        error_report("vfio: failed to setup fd %d "
-                                     "for a group with fd %d: %s",
-                                     param.tablefd, param.groupfd,
-                                     strerror(errno));
-                        return;
+                        error_setg_errno(&err, errno,
+                                         "vfio: failed GROUP_SET_SPAPR_TCE for "
+                                         "KVM VFIO device %d and group fd %d",
+                                         param.tablefd, param.groupfd);
+                        goto fail;
                     }
                     trace_vfio_spapr_group_attach(param.groupfd, param.tablefd);
                 }
-- 
2.41.0



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v4 05/15] vfio/common: Introduce vfio_container_add|del_section_window()
  2023-10-04 15:43 [PATCH v4 00/15] Prerequisite changes for IOMMUFD support Eric Auger
                   ` (3 preceding siblings ...)
  2023-10-04 15:43 ` [PATCH v4 04/15] vfio/common: Propagate KVM_SET_DEVICE_ATTR error if any Eric Auger
@ 2023-10-04 15:43 ` Eric Auger
  2023-10-04 15:43 ` [PATCH v4 06/15] vfio/common: Extract out vfio_kvm_device_[add/del]_fd Eric Auger
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 25+ messages in thread
From: Eric Auger @ 2023-10-04 15:43 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, qemu-devel, zhenzhong.duan,
	alex.williamson, clg, jgg, nicolinc, joao.m.martins, peterx,
	kevin.tian, yi.l.liu, yi.y.sun, chao.p.peng, mjrosato, aik

Introduce helper functions that isolate the code used for
VFIO_SPAPR_TCE_v2_IOMMU.

Those helpers hide implementation details beneath the container object
and make the vfio_listener_region_add/del() implementations more
readable. No code change intended.

Signed-off-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
---
 hw/vfio/common.c | 156 +++++++++++++++++++++++++++--------------------
 1 file changed, 89 insertions(+), 67 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index c54a72ec80..0397788aa5 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -807,6 +807,92 @@ static bool vfio_get_section_iova_range(VFIOContainer *container,
     return true;
 }
 
+static int vfio_container_add_section_window(VFIOContainer *container,
+                                             MemoryRegionSection *section,
+                                             Error **errp)
+{
+    VFIOHostDMAWindow *hostwin;
+    hwaddr pgsize = 0;
+    int ret;
+
+    if (container->iommu_type != VFIO_SPAPR_TCE_v2_IOMMU) {
+        return 0;
+    }
+
+    /* For now intersections are not allowed, we may relax this later */
+    QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) {
+        if (ranges_overlap(hostwin->min_iova,
+                           hostwin->max_iova - hostwin->min_iova + 1,
+                           section->offset_within_address_space,
+                           int128_get64(section->size))) {
+            error_setg(errp,
+                "region [0x%"PRIx64",0x%"PRIx64"] overlaps with existing"
+                "host DMA window [0x%"PRIx64",0x%"PRIx64"]",
+                section->offset_within_address_space,
+                section->offset_within_address_space +
+                    int128_get64(section->size) - 1,
+                hostwin->min_iova, hostwin->max_iova);
+            return -EINVAL;
+        }
+    }
+
+    ret = vfio_spapr_create_window(container, section, &pgsize);
+    if (ret) {
+        error_setg_errno(errp, -ret, "Failed to create SPAPR window");
+        return ret;
+    }
+
+    vfio_host_win_add(container, section->offset_within_address_space,
+                      section->offset_within_address_space +
+                      int128_get64(section->size) - 1, pgsize);
+#ifdef CONFIG_KVM
+    if (kvm_enabled()) {
+        VFIOGroup *group;
+        IOMMUMemoryRegion *iommu_mr = IOMMU_MEMORY_REGION(section->mr);
+        struct kvm_vfio_spapr_tce param;
+        struct kvm_device_attr attr = {
+            .group = KVM_DEV_VFIO_GROUP,
+            .attr = KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE,
+            .addr = (uint64_t)(unsigned long)&param,
+        };
+
+        if (!memory_region_iommu_get_attr(iommu_mr, IOMMU_ATTR_SPAPR_TCE_FD,
+                                          &param.tablefd)) {
+            QLIST_FOREACH(group, &container->group_list, container_next) {
+                param.groupfd = group->fd;
+                if (ioctl(vfio_kvm_device_fd, KVM_SET_DEVICE_ATTR, &attr)) {
+                    error_setg_errno(errp, errno,
+                                     "vfio: failed GROUP_SET_SPAPR_TCE for "
+                                     "KVM VFIO device %d and group fd %d",
+                                     param.tablefd, param.groupfd);
+                    return -errno;
+                }
+                trace_vfio_spapr_group_attach(param.groupfd, param.tablefd);
+            }
+        }
+    }
+#endif
+    return 0;
+}
+
+static void vfio_container_del_section_window(VFIOContainer *container,
+                                              MemoryRegionSection *section)
+{
+    if (container->iommu_type != VFIO_SPAPR_TCE_v2_IOMMU) {
+        return;
+    }
+
+    vfio_spapr_remove_window(container,
+                             section->offset_within_address_space);
+    if (vfio_host_win_del(container,
+                          section->offset_within_address_space,
+                          section->offset_within_address_space +
+                          int128_get64(section->size) - 1) < 0) {
+        hw_error("%s: Cannot delete missing window at %"HWADDR_PRIx,
+                 __func__, section->offset_within_address_space);
+    }
+}
+
 static void vfio_listener_region_add(MemoryListener *listener,
                                      MemoryRegionSection *section)
 {
@@ -833,62 +919,8 @@ static void vfio_listener_region_add(MemoryListener *listener,
         return;
     }
 
-    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
-        hwaddr pgsize = 0;
-
-        /* For now intersections are not allowed, we may relax this later */
-        QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) {
-            if (ranges_overlap(hostwin->min_iova,
-                               hostwin->max_iova - hostwin->min_iova + 1,
-                               section->offset_within_address_space,
-                               int128_get64(section->size))) {
-                error_setg(&err,
-                    "region [0x%"PRIx64",0x%"PRIx64"] overlaps with existing"
-                    "host DMA window [0x%"PRIx64",0x%"PRIx64"]",
-                    section->offset_within_address_space,
-                    section->offset_within_address_space +
-                        int128_get64(section->size) - 1,
-                    hostwin->min_iova, hostwin->max_iova);
-                goto fail;
-            }
-        }
-
-        ret = vfio_spapr_create_window(container, section, &pgsize);
-        if (ret) {
-            error_setg_errno(&err, -ret, "Failed to create SPAPR window");
-            goto fail;
-        }
-
-        vfio_host_win_add(container, section->offset_within_address_space,
-                          section->offset_within_address_space +
-                          int128_get64(section->size) - 1, pgsize);
-#ifdef CONFIG_KVM
-        if (kvm_enabled()) {
-            VFIOGroup *group;
-            IOMMUMemoryRegion *iommu_mr = IOMMU_MEMORY_REGION(section->mr);
-            struct kvm_vfio_spapr_tce param;
-            struct kvm_device_attr attr = {
-                .group = KVM_DEV_VFIO_GROUP,
-                .attr = KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE,
-                .addr = (uint64_t)(unsigned long)&param,
-            };
-
-            if (!memory_region_iommu_get_attr(iommu_mr, IOMMU_ATTR_SPAPR_TCE_FD,
-                                              &param.tablefd)) {
-                QLIST_FOREACH(group, &container->group_list, container_next) {
-                    param.groupfd = group->fd;
-                    if (ioctl(vfio_kvm_device_fd, KVM_SET_DEVICE_ATTR, &attr)) {
-                        error_setg_errno(&err, errno,
-                                         "vfio: failed GROUP_SET_SPAPR_TCE for "
-                                         "KVM VFIO device %d and group fd %d",
-                                         param.tablefd, param.groupfd);
-                        goto fail;
-                    }
-                    trace_vfio_spapr_group_attach(param.groupfd, param.tablefd);
-                }
-            }
-        }
-#endif
+    if (vfio_container_add_section_window(container, section, &err)) {
+        goto fail;
     }
 
     hostwin = vfio_find_hostwin(container, iova, end);
@@ -1105,17 +1137,7 @@ static void vfio_listener_region_del(MemoryListener *listener,
 
     memory_region_unref(section->mr);
 
-    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
-        vfio_spapr_remove_window(container,
-                                 section->offset_within_address_space);
-        if (vfio_host_win_del(container,
-                              section->offset_within_address_space,
-                              section->offset_within_address_space +
-                              int128_get64(section->size) - 1) < 0) {
-            hw_error("%s: Cannot delete missing window at %"HWADDR_PRIx,
-                     __func__, section->offset_within_address_space);
-        }
-    }
+    vfio_container_del_section_window(container, section);
 }
 
 static int vfio_set_dirty_page_tracking(VFIOContainer *container, bool start)
-- 
2.41.0



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v4 06/15] vfio/common: Extract out vfio_kvm_device_[add/del]_fd
  2023-10-04 15:43 [PATCH v4 00/15] Prerequisite changes for IOMMUFD support Eric Auger
                   ` (4 preceding siblings ...)
  2023-10-04 15:43 ` [PATCH v4 05/15] vfio/common: Introduce vfio_container_add|del_section_window() Eric Auger
@ 2023-10-04 15:43 ` Eric Auger
  2023-10-04 15:43 ` [PATCH v4 07/15] vfio/pci: Introduce vfio_[attach/detach]_device Eric Auger
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 25+ messages in thread
From: Eric Auger @ 2023-10-04 15:43 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, qemu-devel, zhenzhong.duan,
	alex.williamson, clg, jgg, nicolinc, joao.m.martins, peterx,
	kevin.tian, yi.l.liu, yi.y.sun, chao.p.peng, mjrosato, aik

From: Zhenzhong Duan <zhenzhong.duan@intel.com>

Introduce two new helpers, vfio_kvm_device_[add/del]_fd
which take as input a file descriptor which can be either a group fd or
a cdev fd. This uses the new KVM_DEV_VFIO_FILE VFIO KVM device group,
which aliases to the legacy KVM_DEV_VFIO_GROUP.

vfio_kvm_device_[add/del]_group then call those new helpers.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
---
 include/hw/vfio/vfio-common.h |  3 ++
 hw/vfio/common.c              | 69 +++++++++++++++++++++++------------
 2 files changed, 49 insertions(+), 23 deletions(-)

diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index e0483893d1..c4e7c3b4a7 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -226,6 +226,9 @@ struct vfio_device_info *vfio_get_device_info(int fd);
 int vfio_get_device(VFIOGroup *group, const char *name,
                     VFIODevice *vbasedev, Error **errp);
 
+int vfio_kvm_device_add_fd(int fd, Error **errp);
+int vfio_kvm_device_del_fd(int fd, Error **errp);
+
 extern const MemoryRegionOps vfio_region_ops;
 typedef QLIST_HEAD(VFIOGroupList, VFIOGroup) VFIOGroupList;
 extern VFIOGroupList vfio_group_list;
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 0397788aa5..d8ed432cb6 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -1818,17 +1818,17 @@ void vfio_reset_handler(void *opaque)
     }
 }
 
-static void vfio_kvm_device_add_group(VFIOGroup *group)
+int vfio_kvm_device_add_fd(int fd, Error **errp)
 {
 #ifdef CONFIG_KVM
     struct kvm_device_attr attr = {
-        .group = KVM_DEV_VFIO_GROUP,
-        .attr = KVM_DEV_VFIO_GROUP_ADD,
-        .addr = (uint64_t)(unsigned long)&group->fd,
+        .group = KVM_DEV_VFIO_FILE,
+        .attr = KVM_DEV_VFIO_FILE_ADD,
+        .addr = (uint64_t)(unsigned long)&fd,
     };
 
     if (!kvm_enabled()) {
-        return;
+        return 0;
     }
 
     if (vfio_kvm_device_fd < 0) {
@@ -1837,38 +1837,61 @@ static void vfio_kvm_device_add_group(VFIOGroup *group)
         };
 
         if (kvm_vm_ioctl(kvm_state, KVM_CREATE_DEVICE, &cd)) {
-            error_report("Failed to create KVM VFIO device: %m");
-            return;
+            error_setg_errno(errp, errno, "Failed to create KVM VFIO device");
+            return -errno;
         }
 
         vfio_kvm_device_fd = cd.fd;
     }
 
     if (ioctl(vfio_kvm_device_fd, KVM_SET_DEVICE_ATTR, &attr)) {
-        error_report("Failed to add group %d to KVM VFIO device: %m",
-                     group->groupid);
+        error_setg_errno(errp, errno, "Failed to add fd %d to KVM VFIO device",
+                         fd);
+        return -errno;
     }
 #endif
+    return 0;
+}
+
+int vfio_kvm_device_del_fd(int fd, Error **errp)
+{
+#ifdef CONFIG_KVM
+    struct kvm_device_attr attr = {
+        .group = KVM_DEV_VFIO_FILE,
+        .attr = KVM_DEV_VFIO_FILE_DEL,
+        .addr = (uint64_t)(unsigned long)&fd,
+    };
+
+    if (vfio_kvm_device_fd < 0) {
+        error_setg(errp, "KVM VFIO device isn't created yet");
+        return -EINVAL;
+    }
+
+    if (ioctl(vfio_kvm_device_fd, KVM_SET_DEVICE_ATTR, &attr)) {
+        error_setg_errno(errp, errno,
+                         "Failed to remove fd %d from KVM VFIO device", fd);
+        return -errno;
+    }
+#endif
+    return 0;
+}
+
+static void vfio_kvm_device_add_group(VFIOGroup *group)
+{
+    Error *err = NULL;
+
+    if (vfio_kvm_device_add_fd(group->fd, &err)) {
+        error_reportf_err(err, "group ID %d: ", group->groupid);
+    }
 }
 
 static void vfio_kvm_device_del_group(VFIOGroup *group)
 {
-#ifdef CONFIG_KVM
-    struct kvm_device_attr attr = {
-        .group = KVM_DEV_VFIO_GROUP,
-        .attr = KVM_DEV_VFIO_GROUP_DEL,
-        .addr = (uint64_t)(unsigned long)&group->fd,
-    };
+    Error *err = NULL;
 
-    if (vfio_kvm_device_fd < 0) {
-        return;
+    if (vfio_kvm_device_del_fd(group->fd, &err)) {
+        error_reportf_err(err, "group ID %d: ", group->groupid);
     }
-
-    if (ioctl(vfio_kvm_device_fd, KVM_SET_DEVICE_ATTR, &attr)) {
-        error_report("Failed to remove group %d from KVM VFIO device: %m",
-                     group->groupid);
-    }
-#endif
 }
 
 static VFIOAddressSpace *vfio_get_address_space(AddressSpace *as)
-- 
2.41.0



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v4 07/15] vfio/pci: Introduce vfio_[attach/detach]_device
  2023-10-04 15:43 [PATCH v4 00/15] Prerequisite changes for IOMMUFD support Eric Auger
                   ` (5 preceding siblings ...)
  2023-10-04 15:43 ` [PATCH v4 06/15] vfio/common: Extract out vfio_kvm_device_[add/del]_fd Eric Auger
@ 2023-10-04 15:43 ` Eric Auger
  2023-10-04 15:43 ` [PATCH v4 08/15] vfio/platform: Use vfio_[attach/detach]_device Eric Auger
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 25+ messages in thread
From: Eric Auger @ 2023-10-04 15:43 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, qemu-devel, zhenzhong.duan,
	alex.williamson, clg, jgg, nicolinc, joao.m.martins, peterx,
	kevin.tian, yi.l.liu, yi.y.sun, chao.p.peng, mjrosato, aik

We want the VFIO devices to be able to use two different
IOMMU backends, the legacy VFIO one and the new iommufd one.

Introduce vfio_[attach/detach]_device which aim at hiding the
underlying IOMMU backend (IOCTLs, datatypes, ...).

Once vfio_attach_device completes, the device is attached
to a security context and its fd can be used. Conversely
When vfio_detach_device completes, the device has been
detached from the security context.

At the moment only the implementation based on the legacy
container/group exists. Let's use it from the vfio-pci device.
Subsequent patches will handle other devices.

We also take benefit of this patch to properly free
vbasedev->name on failure.

Signed-off-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>

---

v2 -> v3:
- added trace_vfio_detach_device
- added a comment explaining why we pass @name to vfio_attach_device
  although vbasedev->name is populated
- free vbasedev->name and detach_device if needed
---
 include/hw/vfio/vfio-common.h |  3 ++
 hw/vfio/common.c              | 74 +++++++++++++++++++++++++++++++++++
 hw/vfio/pci.c                 | 67 +++++++------------------------
 hw/vfio/trace-events          |  3 +-
 4 files changed, 94 insertions(+), 53 deletions(-)

diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index c4e7c3b4a7..12fbfbc37d 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -225,6 +225,9 @@ void vfio_put_group(VFIOGroup *group);
 struct vfio_device_info *vfio_get_device_info(int fd);
 int vfio_get_device(VFIOGroup *group, const char *name,
                     VFIODevice *vbasedev, Error **errp);
+int vfio_attach_device(char *name, VFIODevice *vbasedev,
+                       AddressSpace *as, Error **errp);
+void vfio_detach_device(VFIODevice *vbasedev);
 
 int vfio_kvm_device_add_fd(int fd, Error **errp);
 int vfio_kvm_device_del_fd(int fd, Error **errp);
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index d8ed432cb6..f4c33c9858 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -2611,3 +2611,77 @@ int vfio_eeh_as_op(AddressSpace *as, uint32_t op)
     }
     return vfio_eeh_container_op(container, op);
 }
+
+static int vfio_device_groupid(VFIODevice *vbasedev, Error **errp)
+{
+    char *tmp, group_path[PATH_MAX], *group_name;
+    int ret, groupid;
+    ssize_t len;
+
+    tmp = g_strdup_printf("%s/iommu_group", vbasedev->sysfsdev);
+    len = readlink(tmp, group_path, sizeof(group_path));
+    g_free(tmp);
+
+    if (len <= 0 || len >= sizeof(group_path)) {
+        ret = len < 0 ? -errno : -ENAMETOOLONG;
+        error_setg_errno(errp, -ret, "no iommu_group found");
+        return ret;
+    }
+
+    group_path[len] = 0;
+
+    group_name = basename(group_path);
+    if (sscanf(group_name, "%d", &groupid) != 1) {
+        error_setg_errno(errp, errno, "failed to read %s", group_path);
+        return -errno;
+    }
+    return groupid;
+}
+
+/*
+ * vfio_attach_device: attach a device to a security context
+ * @name and @vbasedev->name are likely to be different depending
+ * on the type of the device, hence the need for passing @name
+ */
+int vfio_attach_device(char *name, VFIODevice *vbasedev,
+                       AddressSpace *as, Error **errp)
+{
+    int groupid = vfio_device_groupid(vbasedev, errp);
+    VFIODevice *vbasedev_iter;
+    VFIOGroup *group;
+    int ret;
+
+    if (groupid < 0) {
+        return groupid;
+    }
+
+    trace_vfio_attach_device(vbasedev->name, groupid);
+
+    group = vfio_get_group(groupid, as, errp);
+    if (!group) {
+        return -ENOENT;
+    }
+
+    QLIST_FOREACH(vbasedev_iter, &group->device_list, next) {
+        if (strcmp(vbasedev_iter->name, vbasedev->name) == 0) {
+            error_setg(errp, "device is already attached");
+            vfio_put_group(group);
+            return -EBUSY;
+        }
+    }
+    ret = vfio_get_device(group, name, vbasedev, errp);
+    if (ret) {
+        vfio_put_group(group);
+    }
+
+    return ret;
+}
+
+void vfio_detach_device(VFIODevice *vbasedev)
+{
+    VFIOGroup *group = vbasedev->group;
+
+    trace_vfio_detach_device(vbasedev->name, group->groupid);
+    vfio_put_base_device(vbasedev);
+    vfio_put_group(group);
+}
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 898296fd54..60e10d0eee 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2895,10 +2895,10 @@ static void vfio_populate_device(VFIOPCIDevice *vdev, Error **errp)
 
 static void vfio_pci_put_device(VFIOPCIDevice *vdev)
 {
+    vfio_detach_device(&vdev->vbasedev);
+
     g_free(vdev->vbasedev.name);
     g_free(vdev->msix);
-
-    vfio_put_base_device(&vdev->vbasedev);
 }
 
 static void vfio_err_notifier_handler(void *opaque)
@@ -3045,13 +3045,9 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
 {
     VFIOPCIDevice *vdev = VFIO_PCI(pdev);
     VFIODevice *vbasedev = &vdev->vbasedev;
-    VFIODevice *vbasedev_iter;
-    VFIOGroup *group;
-    char *tmp, *subsys, group_path[PATH_MAX], *group_name;
+    char *tmp, *subsys;
     Error *err = NULL;
-    ssize_t len;
     struct stat st;
-    int groupid;
     int i, ret;
     bool is_mdev;
     char uuid[UUID_FMT_LEN];
@@ -3082,39 +3078,6 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
     vbasedev->type = VFIO_DEVICE_TYPE_PCI;
     vbasedev->dev = DEVICE(vdev);
 
-    tmp = g_strdup_printf("%s/iommu_group", vbasedev->sysfsdev);
-    len = readlink(tmp, group_path, sizeof(group_path));
-    g_free(tmp);
-
-    if (len <= 0 || len >= sizeof(group_path)) {
-        error_setg_errno(errp, len < 0 ? errno : ENAMETOOLONG,
-                         "no iommu_group found");
-        goto error;
-    }
-
-    group_path[len] = 0;
-
-    group_name = basename(group_path);
-    if (sscanf(group_name, "%d", &groupid) != 1) {
-        error_setg_errno(errp, errno, "failed to read %s", group_path);
-        goto error;
-    }
-
-    trace_vfio_realize(vbasedev->name, groupid);
-
-    group = vfio_get_group(groupid, pci_device_iommu_address_space(pdev), errp);
-    if (!group) {
-        goto error;
-    }
-
-    QLIST_FOREACH(vbasedev_iter, &group->device_list, next) {
-        if (strcmp(vbasedev_iter->name, vbasedev->name) == 0) {
-            error_setg(errp, "device is already attached");
-            vfio_put_group(group);
-            goto error;
-        }
-    }
-
     /*
      * Mediated devices *might* operate compatibly with discarding of RAM, but
      * we cannot know for certain, it depends on whether the mdev vendor driver
@@ -3132,7 +3095,6 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
     if (vbasedev->ram_block_discard_allowed && !is_mdev) {
         error_setg(errp, "x-balloon-allowed only potentially compatible "
                    "with mdev devices");
-        vfio_put_group(group);
         goto error;
     }
 
@@ -3143,17 +3105,17 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
         name = g_strdup(vbasedev->name);
     }
 
-    ret = vfio_get_device(group, name, vbasedev, errp);
+    ret = vfio_attach_device(name, vbasedev,
+                             pci_device_iommu_address_space(pdev), errp);
     g_free(name);
     if (ret) {
-        vfio_put_group(group);
         goto error;
     }
 
     vfio_populate_device(vdev, &err);
     if (err) {
         error_propagate(errp, err);
-        goto error;
+        goto out_detach;
     }
 
     /* Get a copy of config space */
@@ -3163,7 +3125,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
     if (ret < (int)MIN(pci_config_size(&vdev->pdev), vdev->config_size)) {
         ret = ret < 0 ? -errno : -EFAULT;
         error_setg_errno(errp, -ret, "failed to read device config space");
-        goto error;
+        goto out_detach;
     }
 
     /* vfio emulates a lot for us, but some bits need extra love */
@@ -3182,7 +3144,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
     if (vdev->vendor_id != PCI_ANY_ID) {
         if (vdev->vendor_id >= 0xffff) {
             error_setg(errp, "invalid PCI vendor ID provided");
-            goto error;
+            goto out_detach;
         }
         vfio_add_emulated_word(vdev, PCI_VENDOR_ID, vdev->vendor_id, ~0);
         trace_vfio_pci_emulated_vendor_id(vbasedev->name, vdev->vendor_id);
@@ -3193,7 +3155,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
     if (vdev->device_id != PCI_ANY_ID) {
         if (vdev->device_id > 0xffff) {
             error_setg(errp, "invalid PCI device ID provided");
-            goto error;
+            goto out_detach;
         }
         vfio_add_emulated_word(vdev, PCI_DEVICE_ID, vdev->device_id, ~0);
         trace_vfio_pci_emulated_device_id(vbasedev->name, vdev->device_id);
@@ -3204,7 +3166,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
     if (vdev->sub_vendor_id != PCI_ANY_ID) {
         if (vdev->sub_vendor_id > 0xffff) {
             error_setg(errp, "invalid PCI subsystem vendor ID provided");
-            goto error;
+            goto out_detach;
         }
         vfio_add_emulated_word(vdev, PCI_SUBSYSTEM_VENDOR_ID,
                                vdev->sub_vendor_id, ~0);
@@ -3215,7 +3177,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
     if (vdev->sub_device_id != PCI_ANY_ID) {
         if (vdev->sub_device_id > 0xffff) {
             error_setg(errp, "invalid PCI subsystem device ID provided");
-            goto error;
+            goto out_detach;
         }
         vfio_add_emulated_word(vdev, PCI_SUBSYSTEM_ID, vdev->sub_device_id, ~0);
         trace_vfio_pci_emulated_sub_device_id(vbasedev->name,
@@ -3248,7 +3210,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
     vfio_msix_early_setup(vdev, &err);
     if (err) {
         error_propagate(errp, err);
-        goto error;
+        goto out_detach;
     }
 
     vfio_bars_register(vdev);
@@ -3364,14 +3326,16 @@ out_deregister:
 out_teardown:
     vfio_teardown_msi(vdev);
     vfio_bars_exit(vdev);
+out_detach:
+    vfio_detach_device(vbasedev);
 error:
     error_prepend(errp, VFIO_MSG_PREFIX, vbasedev->name);
+    g_free(vbasedev->name);
 }
 
 static void vfio_instance_finalize(Object *obj)
 {
     VFIOPCIDevice *vdev = VFIO_PCI(obj);
-    VFIOGroup *group = vdev->vbasedev.group;
 
     vfio_display_finalize(vdev);
     vfio_bars_finalize(vdev);
@@ -3385,7 +3349,6 @@ static void vfio_instance_finalize(Object *obj)
      * g_free(vdev->igd_opregion);
      */
     vfio_pci_put_device(vdev);
-    vfio_put_group(group);
 }
 
 static void vfio_exitfn(PCIDevice *pdev)
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 0ba3c5a0e2..8ac13eb106 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -37,7 +37,8 @@ vfio_pci_hot_reset_dep_devices(int domain, int bus, int slot, int function, int
 vfio_pci_hot_reset_result(const char *name, const char *result) "%s hot reset: %s"
 vfio_populate_device_config(const char *name, unsigned long size, unsigned long offset, unsigned long flags) "Device %s config:\n  size: 0x%lx, offset: 0x%lx, flags: 0x%lx"
 vfio_populate_device_get_irq_info_failure(const char *errstr) "VFIO_DEVICE_GET_IRQ_INFO failure: %s"
-vfio_realize(const char *name, int group_id) " (%s) group %d"
+vfio_attach_device(const char *name, int group_id) " (%s) group %d"
+vfio_detach_device(const char *name, int group_id) " (%s) group %d"
 vfio_mdev(const char *name, bool is_mdev) " (%s) is_mdev %d"
 vfio_add_ext_cap_dropped(const char *name, uint16_t cap, uint16_t offset) "%s 0x%x@0x%x"
 vfio_pci_reset(const char *name) " (%s)"
-- 
2.41.0



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v4 08/15] vfio/platform: Use vfio_[attach/detach]_device
  2023-10-04 15:43 [PATCH v4 00/15] Prerequisite changes for IOMMUFD support Eric Auger
                   ` (6 preceding siblings ...)
  2023-10-04 15:43 ` [PATCH v4 07/15] vfio/pci: Introduce vfio_[attach/detach]_device Eric Auger
@ 2023-10-04 15:43 ` Eric Auger
  2023-10-04 15:43 ` [PATCH v4 09/15] vfio/ap: " Eric Auger
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 25+ messages in thread
From: Eric Auger @ 2023-10-04 15:43 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, qemu-devel, zhenzhong.duan,
	alex.williamson, clg, jgg, nicolinc, joao.m.martins, peterx,
	kevin.tian, yi.l.liu, yi.y.sun, chao.p.peng, mjrosato, aik

Let the vfio-platform device use vfio_attach_device() and
vfio_detach_device(), hence hiding the details of the used
IOMMU backend.

Drop the trace event for vfio-platform as we have similar
one in vfio_attach_device.

Signed-off-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
---
 hw/vfio/platform.c   | 43 +++----------------------------------------
 hw/vfio/trace-events |  1 -
 2 files changed, 3 insertions(+), 41 deletions(-)

diff --git a/hw/vfio/platform.c b/hw/vfio/platform.c
index 5af73f9287..8e3d4ac458 100644
--- a/hw/vfio/platform.c
+++ b/hw/vfio/platform.c
@@ -529,12 +529,7 @@ static VFIODeviceOps vfio_platform_ops = {
  */
 static int vfio_base_device_init(VFIODevice *vbasedev, Error **errp)
 {
-    VFIOGroup *group;
-    VFIODevice *vbasedev_iter;
-    char *tmp, group_path[PATH_MAX], *group_name;
-    ssize_t len;
     struct stat st;
-    int groupid;
     int ret;
 
     /* @sysfsdev takes precedence over @host */
@@ -557,47 +552,15 @@ static int vfio_base_device_init(VFIODevice *vbasedev, Error **errp)
         return -errno;
     }
 
-    tmp = g_strdup_printf("%s/iommu_group", vbasedev->sysfsdev);
-    len = readlink(tmp, group_path, sizeof(group_path));
-    g_free(tmp);
-
-    if (len < 0 || len >= sizeof(group_path)) {
-        ret = len < 0 ? -errno : -ENAMETOOLONG;
-        error_setg_errno(errp, -ret, "no iommu_group found");
-        return ret;
-    }
-
-    group_path[len] = 0;
-
-    group_name = basename(group_path);
-    if (sscanf(group_name, "%d", &groupid) != 1) {
-        error_setg_errno(errp, errno, "failed to read %s", group_path);
-        return -errno;
-    }
-
-    trace_vfio_platform_base_device_init(vbasedev->name, groupid);
-
-    group = vfio_get_group(groupid, &address_space_memory, errp);
-    if (!group) {
-        return -ENOENT;
-    }
-
-    QLIST_FOREACH(vbasedev_iter, &group->device_list, next) {
-        if (strcmp(vbasedev_iter->name, vbasedev->name) == 0) {
-            error_setg(errp, "device is already attached");
-            vfio_put_group(group);
-            return -EBUSY;
-        }
-    }
-    ret = vfio_get_device(group, vbasedev->name, vbasedev, errp);
+    ret = vfio_attach_device(vbasedev->name, vbasedev,
+                             &address_space_memory, errp);
     if (ret) {
-        vfio_put_group(group);
         return ret;
     }
 
     ret = vfio_populate_device(vbasedev, errp);
     if (ret) {
-        vfio_put_group(group);
+        vfio_detach_device(vbasedev);
     }
 
     return ret;
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 8ac13eb106..0eb2387cf2 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -121,7 +121,6 @@ vfio_get_dirty_bitmap(int fd, uint64_t iova, uint64_t size, uint64_t bitmap_size
 vfio_iommu_map_dirty_notify(uint64_t iova_start, uint64_t iova_end) "iommu dirty @ 0x%"PRIx64" - 0x%"PRIx64
 
 # platform.c
-vfio_platform_base_device_init(char *name, int groupid) "%s belongs to group #%d"
 vfio_platform_realize(char *name, char *compat) "vfio device %s, compat = %s"
 vfio_platform_eoi(int pin, int fd) "EOI IRQ pin %d (fd=%d)"
 vfio_platform_intp_mmap_enable(int pin) "IRQ #%d still active, stay in slow path"
-- 
2.41.0



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v4 09/15] vfio/ap: Use vfio_[attach/detach]_device
  2023-10-04 15:43 [PATCH v4 00/15] Prerequisite changes for IOMMUFD support Eric Auger
                   ` (7 preceding siblings ...)
  2023-10-04 15:43 ` [PATCH v4 08/15] vfio/platform: Use vfio_[attach/detach]_device Eric Auger
@ 2023-10-04 15:43 ` Eric Auger
  2023-10-04 15:43 ` [PATCH v4 10/15] vfio/ccw: " Eric Auger
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 25+ messages in thread
From: Eric Auger @ 2023-10-04 15:43 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, qemu-devel, zhenzhong.duan,
	alex.williamson, clg, jgg, nicolinc, joao.m.martins, peterx,
	kevin.tian, yi.l.liu, yi.y.sun, chao.p.peng, mjrosato, aik

Let the vfio-ap device use vfio_attach_device() and
vfio_detach_device(), hence hiding the details of the used
IOMMU backend.

We take the opportunity to use g_path_get_basename() which
is prefered, as suggested by
3e015d815b ("use g_path_get_basename instead of basename")

Signed-off-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Matthew Rosato <mjrosato@linux.ibm.com>

---

v3 -> v4:
- Removed vfio_detach_device(vbasedev) and g_free(vbasedev->name);
  which do not match the intent
- added error_prepend

v2 -> v3:
- Mention g_path_get_basename in commit message and properly free
  vbasedev->name, call vfio_detach_device
---
 hw/vfio/ap.c | 69 ++++++++++------------------------------------------
 1 file changed, 13 insertions(+), 56 deletions(-)

diff --git a/hw/vfio/ap.c b/hw/vfio/ap.c
index 6e21d1da5a..22e564f4f7 100644
--- a/hw/vfio/ap.c
+++ b/hw/vfio/ap.c
@@ -53,40 +53,6 @@ struct VFIODeviceOps vfio_ap_ops = {
     .vfio_compute_needs_reset = vfio_ap_compute_needs_reset,
 };
 
-static void vfio_ap_put_device(VFIOAPDevice *vapdev)
-{
-    g_free(vapdev->vdev.name);
-    vfio_put_base_device(&vapdev->vdev);
-}
-
-static VFIOGroup *vfio_ap_get_group(VFIOAPDevice *vapdev, Error **errp)
-{
-    GError *gerror = NULL;
-    char *symlink, *group_path;
-    int groupid;
-
-    symlink = g_strdup_printf("%s/iommu_group", vapdev->vdev.sysfsdev);
-    group_path = g_file_read_link(symlink, &gerror);
-    g_free(symlink);
-
-    if (!group_path) {
-        error_setg(errp, "%s: no iommu_group found for %s: %s",
-                   TYPE_VFIO_AP_DEVICE, vapdev->vdev.sysfsdev, gerror->message);
-        g_error_free(gerror);
-        return NULL;
-    }
-
-    if (sscanf(basename(group_path), "%d", &groupid) != 1) {
-        error_setg(errp, "vfio: failed to read %s", group_path);
-        g_free(group_path);
-        return NULL;
-    }
-
-    g_free(group_path);
-
-    return vfio_get_group(groupid, &address_space_memory, errp);
-}
-
 static void vfio_ap_req_notifier_handler(void *opaque)
 {
     VFIOAPDevice *vapdev = opaque;
@@ -189,22 +155,15 @@ static void vfio_ap_unregister_irq_notifier(VFIOAPDevice *vapdev,
 static void vfio_ap_realize(DeviceState *dev, Error **errp)
 {
     int ret;
-    char *mdevid;
     Error *err = NULL;
-    VFIOGroup *vfio_group;
     APDevice *apdev = AP_DEVICE(dev);
     VFIOAPDevice *vapdev = VFIO_AP_DEVICE(apdev);
+    VFIODevice *vbasedev = &vapdev->vdev;
 
-    vfio_group = vfio_ap_get_group(vapdev, errp);
-    if (!vfio_group) {
-        return;
-    }
-
-    vapdev->vdev.ops = &vfio_ap_ops;
-    vapdev->vdev.type = VFIO_DEVICE_TYPE_AP;
-    mdevid = basename(vapdev->vdev.sysfsdev);
-    vapdev->vdev.name = g_strdup_printf("%s", mdevid);
-    vapdev->vdev.dev = dev;
+    vbasedev->name = g_path_get_basename(vbasedev->sysfsdev);
+    vbasedev->ops = &vfio_ap_ops;
+    vbasedev->type = VFIO_DEVICE_TYPE_AP;
+    vbasedev->dev = dev;
 
     /*
      * vfio-ap devices operate in a way compatible with discarding of
@@ -214,9 +173,10 @@ static void vfio_ap_realize(DeviceState *dev, Error **errp)
      */
     vapdev->vdev.ram_block_discard_allowed = true;
 
-    ret = vfio_get_device(vfio_group, mdevid, &vapdev->vdev, errp);
+    ret = vfio_attach_device(vbasedev->name, vbasedev,
+                             &address_space_memory, errp);
     if (ret) {
-        goto out_get_dev_err;
+        goto error;
     }
 
     vfio_ap_register_irq_notifier(vapdev, VFIO_AP_REQ_IRQ_INDEX, &err);
@@ -228,22 +188,19 @@ static void vfio_ap_realize(DeviceState *dev, Error **errp)
         error_report_err(err);
     }
 
-    return;
-
-out_get_dev_err:
-    vfio_ap_put_device(vapdev);
-    vfio_put_group(vfio_group);
+error:
+    error_prepend(errp, VFIO_MSG_PREFIX, vbasedev->name);
+    g_free(vbasedev->name);
 }
 
 static void vfio_ap_unrealize(DeviceState *dev)
 {
     APDevice *apdev = AP_DEVICE(dev);
     VFIOAPDevice *vapdev = VFIO_AP_DEVICE(apdev);
-    VFIOGroup *group = vapdev->vdev.group;
 
     vfio_ap_unregister_irq_notifier(vapdev, VFIO_AP_REQ_IRQ_INDEX);
-    vfio_ap_put_device(vapdev);
-    vfio_put_group(group);
+    vfio_detach_device(&vapdev->vdev);
+    g_free(vapdev->vdev.name);
 }
 
 static Property vfio_ap_properties[] = {
-- 
2.41.0



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v4 10/15] vfio/ccw: Use vfio_[attach/detach]_device
  2023-10-04 15:43 [PATCH v4 00/15] Prerequisite changes for IOMMUFD support Eric Auger
                   ` (8 preceding siblings ...)
  2023-10-04 15:43 ` [PATCH v4 09/15] vfio/ap: " Eric Auger
@ 2023-10-04 15:43 ` Eric Auger
  2023-10-08 10:21   ` Duan, Zhenzhong
  2023-10-04 15:43 ` [PATCH v4 11/15] vfio/common: Move VFIO reset handler registration to a group agnostic function Eric Auger
                   ` (5 subsequent siblings)
  15 siblings, 1 reply; 25+ messages in thread
From: Eric Auger @ 2023-10-04 15:43 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, qemu-devel, zhenzhong.duan,
	alex.williamson, clg, jgg, nicolinc, joao.m.martins, peterx,
	kevin.tian, yi.l.liu, yi.y.sun, chao.p.peng, mjrosato, aik

Let the vfio-ccw device use vfio_attach_device() and
vfio_detach_device(), hence hiding the details of the used
IOMMU backend.

Note that the migration reduces the following trace
"vfio: subchannel %s has already been attached" (featuring
cssid.ssid.devid) into "device is already attached"

Also now all the devices have been migrated to use the new
vfio_attach_device/vfio_detach_device API, let's turn the
legacy functions into static functions, local to container.c.

Signed-off-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Matthew Rosato <mjrosato@linux.ibm.com>

---

v3:
- simplified vbasedev->dev setting

v2 -> v3:
- Hopefully fix confusion beteen vbasedev->name, mdevid and sysfsdev
  while keeping into account Matthew's comment
  https://lore.kernel.org/qemu-devel/6e04ab8f-dc84-e9c2-deea-2b6b31678b53@linux.ibm.com/
---
 include/hw/vfio/vfio-common.h |   5 --
 hw/vfio/ccw.c                 | 122 +++++++++-------------------------
 hw/vfio/common.c              |  10 +--
 3 files changed, 37 insertions(+), 100 deletions(-)

diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 12fbfbc37d..c486bdef2a 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -202,7 +202,6 @@ typedef struct {
     hwaddr pages;
 } VFIOBitmap;
 
-void vfio_put_base_device(VFIODevice *vbasedev);
 void vfio_disable_irqindex(VFIODevice *vbasedev, int index);
 void vfio_unmask_single_irqindex(VFIODevice *vbasedev, int index);
 void vfio_mask_single_irqindex(VFIODevice *vbasedev, int index);
@@ -220,11 +219,7 @@ void vfio_region_unmap(VFIORegion *region);
 void vfio_region_exit(VFIORegion *region);
 void vfio_region_finalize(VFIORegion *region);
 void vfio_reset_handler(void *opaque);
-VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp);
-void vfio_put_group(VFIOGroup *group);
 struct vfio_device_info *vfio_get_device_info(int fd);
-int vfio_get_device(VFIOGroup *group, const char *name,
-                    VFIODevice *vbasedev, Error **errp);
 int vfio_attach_device(char *name, VFIODevice *vbasedev,
                        AddressSpace *as, Error **errp);
 void vfio_detach_device(VFIODevice *vbasedev);
diff --git a/hw/vfio/ccw.c b/hw/vfio/ccw.c
index 1e2fce83b0..6ec35fedc9 100644
--- a/hw/vfio/ccw.c
+++ b/hw/vfio/ccw.c
@@ -572,88 +572,15 @@ static void vfio_ccw_put_region(VFIOCCWDevice *vcdev)
     g_free(vcdev->io_region);
 }
 
-static void vfio_ccw_put_device(VFIOCCWDevice *vcdev)
-{
-    g_free(vcdev->vdev.name);
-    vfio_put_base_device(&vcdev->vdev);
-}
-
-static void vfio_ccw_get_device(VFIOGroup *group, VFIOCCWDevice *vcdev,
-                                Error **errp)
-{
-    S390CCWDevice *cdev = S390_CCW_DEVICE(vcdev);
-    char *name = g_strdup_printf("%x.%x.%04x", cdev->hostid.cssid,
-                                 cdev->hostid.ssid,
-                                 cdev->hostid.devid);
-    VFIODevice *vbasedev;
-
-    QLIST_FOREACH(vbasedev, &group->device_list, next) {
-        if (strcmp(vbasedev->name, name) == 0) {
-            error_setg(errp, "vfio: subchannel %s has already been attached",
-                       name);
-            goto out_err;
-        }
-    }
-
-    /*
-     * All vfio-ccw devices are believed to operate in a way compatible with
-     * discarding of memory in RAM blocks, ie. pages pinned in the host are
-     * in the current working set of the guest driver and therefore never
-     * overlap e.g., with pages available to the guest balloon driver.  This
-     * needs to be set before vfio_get_device() for vfio common to handle
-     * ram_block_discard_disable().
-     */
-    vcdev->vdev.ram_block_discard_allowed = true;
-
-    if (vfio_get_device(group, cdev->mdevid, &vcdev->vdev, errp)) {
-        goto out_err;
-    }
-
-    vcdev->vdev.ops = &vfio_ccw_ops;
-    vcdev->vdev.type = VFIO_DEVICE_TYPE_CCW;
-    vcdev->vdev.name = name;
-    vcdev->vdev.dev = DEVICE(vcdev);
-
-    return;
-
-out_err:
-    g_free(name);
-}
-
-static VFIOGroup *vfio_ccw_get_group(S390CCWDevice *cdev, Error **errp)
-{
-    char *tmp, group_path[PATH_MAX];
-    ssize_t len;
-    int groupid;
-
-    tmp = g_strdup_printf("/sys/bus/css/devices/%x.%x.%04x/%s/iommu_group",
-                          cdev->hostid.cssid, cdev->hostid.ssid,
-                          cdev->hostid.devid, cdev->mdevid);
-    len = readlink(tmp, group_path, sizeof(group_path));
-    g_free(tmp);
-
-    if (len <= 0 || len >= sizeof(group_path)) {
-        error_setg(errp, "vfio: no iommu_group found");
-        return NULL;
-    }
-
-    group_path[len] = 0;
-
-    if (sscanf(basename(group_path), "%d", &groupid) != 1) {
-        error_setg(errp, "vfio: failed to read %s", group_path);
-        return NULL;
-    }
-
-    return vfio_get_group(groupid, &address_space_memory, errp);
-}
-
 static void vfio_ccw_realize(DeviceState *dev, Error **errp)
 {
-    VFIOGroup *group;
     S390CCWDevice *cdev = S390_CCW_DEVICE(dev);
     VFIOCCWDevice *vcdev = VFIO_CCW(cdev);
     S390CCWDeviceClass *cdc = S390_CCW_DEVICE_GET_CLASS(cdev);
+    VFIODevice *vbasedev = &vcdev->vdev;
     Error *err = NULL;
+    char *name;
+    int ret;
 
     /* Call the class init function for subchannel. */
     if (cdc->realize) {
@@ -663,14 +590,31 @@ static void vfio_ccw_realize(DeviceState *dev, Error **errp)
         }
     }
 
-    group = vfio_ccw_get_group(cdev, &err);
-    if (!group) {
-        goto out_group_err;
-    }
+    name = g_strdup_printf("%x.%x.%04x", vcdev->cdev.hostid.cssid,
+                           vcdev->cdev.hostid.ssid,
+                           vcdev->cdev.hostid.devid);
+    vbasedev->sysfsdev = g_strdup_printf("/sys/bus/css/devices/%s/%s",
+                                         name,
+                                         cdev->mdevid);
+    vbasedev->ops = &vfio_ccw_ops;
+    vbasedev->type = VFIO_DEVICE_TYPE_CCW;
+    vbasedev->name = name;
+    vbasedev->dev = dev;
 
-    vfio_ccw_get_device(group, vcdev, &err);
-    if (err) {
-        goto out_device_err;
+    /*
+     * All vfio-ccw devices are believed to operate in a way compatible with
+     * discarding of memory in RAM blocks, ie. pages pinned in the host are
+     * in the current working set of the guest driver and therefore never
+     * overlap e.g., with pages available to the guest balloon driver.  This
+     * needs to be set before vfio_get_device() for vfio common to handle
+     * ram_block_discard_disable().
+     */
+    vbasedev->ram_block_discard_allowed = true;
+
+    ret = vfio_attach_device(cdev->mdevid, vbasedev,
+                             &address_space_memory, errp);
+    if (ret) {
+        goto out_attach_dev_err;
     }
 
     vfio_ccw_get_region(vcdev, &err);
@@ -708,10 +652,9 @@ out_irq_notifier_err:
 out_io_notifier_err:
     vfio_ccw_put_region(vcdev);
 out_region_err:
-    vfio_ccw_put_device(vcdev);
-out_device_err:
-    vfio_put_group(group);
-out_group_err:
+    vfio_detach_device(vbasedev);
+out_attach_dev_err:
+    g_free(vbasedev->name);
     if (cdc->unrealize) {
         cdc->unrealize(cdev);
     }
@@ -724,14 +667,13 @@ static void vfio_ccw_unrealize(DeviceState *dev)
     S390CCWDevice *cdev = S390_CCW_DEVICE(dev);
     VFIOCCWDevice *vcdev = VFIO_CCW(cdev);
     S390CCWDeviceClass *cdc = S390_CCW_DEVICE_GET_CLASS(cdev);
-    VFIOGroup *group = vcdev->vdev.group;
 
     vfio_ccw_unregister_irq_notifier(vcdev, VFIO_CCW_REQ_IRQ_INDEX);
     vfio_ccw_unregister_irq_notifier(vcdev, VFIO_CCW_CRW_IRQ_INDEX);
     vfio_ccw_unregister_irq_notifier(vcdev, VFIO_CCW_IO_IRQ_INDEX);
     vfio_ccw_put_region(vcdev);
-    vfio_ccw_put_device(vcdev);
-    vfio_put_group(group);
+    vfio_detach_device(&vcdev->vdev);
+    g_free(vcdev->vdev.name);
 
     if (cdc->unrealize) {
         cdc->unrealize(cdev);
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index f4c33c9858..56cfe94d97 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -2335,7 +2335,7 @@ static void vfio_disconnect_container(VFIOGroup *group)
     }
 }
 
-VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
+static VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
 {
     VFIOGroup *group;
     char path[32];
@@ -2402,7 +2402,7 @@ free_group_exit:
     return NULL;
 }
 
-void vfio_put_group(VFIOGroup *group)
+static void vfio_put_group(VFIOGroup *group)
 {
     if (!group || !QLIST_EMPTY(&group->device_list)) {
         return;
@@ -2447,8 +2447,8 @@ retry:
     return info;
 }
 
-int vfio_get_device(VFIOGroup *group, const char *name,
-                    VFIODevice *vbasedev, Error **errp)
+static int vfio_get_device(VFIOGroup *group, const char *name,
+                           VFIODevice *vbasedev, Error **errp)
 {
     g_autofree struct vfio_device_info *info = NULL;
     int fd;
@@ -2506,7 +2506,7 @@ int vfio_get_device(VFIOGroup *group, const char *name,
     return 0;
 }
 
-void vfio_put_base_device(VFIODevice *vbasedev)
+static void vfio_put_base_device(VFIODevice *vbasedev)
 {
     if (!vbasedev->group) {
         return;
-- 
2.41.0



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v4 11/15] vfio/common: Move VFIO reset handler registration to a group agnostic function
  2023-10-04 15:43 [PATCH v4 00/15] Prerequisite changes for IOMMUFD support Eric Auger
                   ` (9 preceding siblings ...)
  2023-10-04 15:43 ` [PATCH v4 10/15] vfio/ccw: " Eric Auger
@ 2023-10-04 15:43 ` Eric Auger
  2023-10-04 15:43 ` [PATCH v4 12/15] vfio/common: Introduce a per container device list Eric Auger
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 25+ messages in thread
From: Eric Auger @ 2023-10-04 15:43 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, qemu-devel, zhenzhong.duan,
	alex.williamson, clg, jgg, nicolinc, joao.m.martins, peterx,
	kevin.tian, yi.l.liu, yi.y.sun, chao.p.peng, mjrosato, aik

From: Zhenzhong Duan <zhenzhong.duan@intel.com>

Move the reset handler registration/unregistration to a place that is not
group specific. vfio_[get/put]_address_space are the best places for that
purpose.

Signed-off-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
---
 hw/vfio/common.c | 15 +++++++--------
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 56cfe94d97..019da387d2 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -1909,6 +1909,10 @@ static VFIOAddressSpace *vfio_get_address_space(AddressSpace *as)
     space->as = as;
     QLIST_INIT(&space->containers);
 
+    if (QLIST_EMPTY(&vfio_address_spaces)) {
+        qemu_register_reset(vfio_reset_handler, NULL);
+    }
+
     QLIST_INSERT_HEAD(&vfio_address_spaces, space, list);
 
     return space;
@@ -1920,6 +1924,9 @@ static void vfio_put_address_space(VFIOAddressSpace *space)
         QLIST_REMOVE(space, list);
         g_free(space);
     }
+    if (QLIST_EMPTY(&vfio_address_spaces)) {
+        qemu_unregister_reset(vfio_reset_handler, NULL);
+    }
 }
 
 /*
@@ -2385,10 +2392,6 @@ static VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
         goto close_fd_exit;
     }
 
-    if (QLIST_EMPTY(&vfio_group_list)) {
-        qemu_register_reset(vfio_reset_handler, NULL);
-    }
-
     QLIST_INSERT_HEAD(&vfio_group_list, group, next);
 
     return group;
@@ -2417,10 +2420,6 @@ static void vfio_put_group(VFIOGroup *group)
     trace_vfio_put_group(group->fd);
     close(group->fd);
     g_free(group);
-
-    if (QLIST_EMPTY(&vfio_group_list)) {
-        qemu_unregister_reset(vfio_reset_handler, NULL);
-    }
 }
 
 struct vfio_device_info *vfio_get_device_info(int fd)
-- 
2.41.0



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v4 12/15] vfio/common: Introduce a per container device list
  2023-10-04 15:43 [PATCH v4 00/15] Prerequisite changes for IOMMUFD support Eric Auger
                   ` (10 preceding siblings ...)
  2023-10-04 15:43 ` [PATCH v4 11/15] vfio/common: Move VFIO reset handler registration to a group agnostic function Eric Auger
@ 2023-10-04 15:43 ` Eric Auger
  2023-10-04 15:44 ` [PATCH v4 13/15] vfio/common: Store the parent container in VFIODevice Eric Auger
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 25+ messages in thread
From: Eric Auger @ 2023-10-04 15:43 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, qemu-devel, zhenzhong.duan,
	alex.williamson, clg, jgg, nicolinc, joao.m.martins, peterx,
	kevin.tian, yi.l.liu, yi.y.sun, chao.p.peng, mjrosato, aik

From: Zhenzhong Duan <zhenzhong.duan@intel.com>

Several functions need to iterate over the VFIO devices attached to
a given container.  This is currently achieved by iterating over the
groups attached to the container and then over the devices in the group.
Let's introduce a per container device list that simplifies this
search.

Per container list is used in below functions:
vfio_devices_all_dirty_tracking
vfio_devices_all_device_dirty_tracking
vfio_devices_all_running_and_mig_active
vfio_devices_dma_logging_stop
vfio_devices_dma_logging_start
vfio_devices_query_dirty_bitmap

This will also ease the migration of IOMMUFD by hiding the group
specificity.

Suggested-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Signed-off-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
---
 include/hw/vfio/vfio-common.h |   2 +
 hw/vfio/common.c              | 145 +++++++++++++++-------------------
 2 files changed, 67 insertions(+), 80 deletions(-)

diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index c486bdef2a..8ca70dd821 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -98,6 +98,7 @@ typedef struct VFIOContainer {
     QLIST_HEAD(, VFIOGroup) group_list;
     QLIST_HEAD(, VFIORamDiscardListener) vrdl_list;
     QLIST_ENTRY(VFIOContainer) next;
+    QLIST_HEAD(, VFIODevice) device_list;
 } VFIOContainer;
 
 typedef struct VFIOGuestIOMMU {
@@ -129,6 +130,7 @@ typedef struct VFIODeviceOps VFIODeviceOps;
 
 typedef struct VFIODevice {
     QLIST_ENTRY(VFIODevice) next;
+    QLIST_ENTRY(VFIODevice) container_next;
     struct VFIOGroup *group;
     char *sysfsdev;
     char *name;
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 019da387d2..ef9dc7c747 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -218,7 +218,6 @@ bool vfio_device_state_is_precopy(VFIODevice *vbasedev)
 
 static bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
 {
-    VFIOGroup *group;
     VFIODevice *vbasedev;
     MigrationState *ms = migrate_get_current();
 
@@ -227,19 +226,17 @@ static bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
         return false;
     }
 
-    QLIST_FOREACH(group, &container->group_list, container_next) {
-        QLIST_FOREACH(vbasedev, &group->device_list, next) {
-            VFIOMigration *migration = vbasedev->migration;
+    QLIST_FOREACH(vbasedev, &container->device_list, container_next) {
+        VFIOMigration *migration = vbasedev->migration;
 
-            if (!migration) {
-                return false;
-            }
+        if (!migration) {
+            return false;
+        }
 
-            if (vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF &&
-                (vfio_device_state_is_running(vbasedev) ||
-                 vfio_device_state_is_precopy(vbasedev))) {
-                return false;
-            }
+        if (vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF &&
+            (vfio_device_state_is_running(vbasedev) ||
+             vfio_device_state_is_precopy(vbasedev))) {
+            return false;
         }
     }
     return true;
@@ -247,14 +244,11 @@ static bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
 
 static bool vfio_devices_all_device_dirty_tracking(VFIOContainer *container)
 {
-    VFIOGroup *group;
     VFIODevice *vbasedev;
 
-    QLIST_FOREACH(group, &container->group_list, container_next) {
-        QLIST_FOREACH(vbasedev, &group->device_list, next) {
-            if (!vbasedev->dirty_pages_supported) {
-                return false;
-            }
+    QLIST_FOREACH(vbasedev, &container->device_list, container_next) {
+        if (!vbasedev->dirty_pages_supported) {
+            return false;
         }
     }
 
@@ -267,27 +261,24 @@ static bool vfio_devices_all_device_dirty_tracking(VFIOContainer *container)
  */
 static bool vfio_devices_all_running_and_mig_active(VFIOContainer *container)
 {
-    VFIOGroup *group;
     VFIODevice *vbasedev;
 
     if (!migration_is_active(migrate_get_current())) {
         return false;
     }
 
-    QLIST_FOREACH(group, &container->group_list, container_next) {
-        QLIST_FOREACH(vbasedev, &group->device_list, next) {
-            VFIOMigration *migration = vbasedev->migration;
+    QLIST_FOREACH(vbasedev, &container->device_list, container_next) {
+        VFIOMigration *migration = vbasedev->migration;
 
-            if (!migration) {
-                return false;
-            }
+        if (!migration) {
+            return false;
+        }
 
-            if (vfio_device_state_is_running(vbasedev) ||
-                vfio_device_state_is_precopy(vbasedev)) {
-                continue;
-            } else {
-                return false;
-            }
+        if (vfio_device_state_is_running(vbasedev) ||
+            vfio_device_state_is_precopy(vbasedev)) {
+            continue;
+        } else {
+            return false;
         }
     }
     return true;
@@ -1187,20 +1178,17 @@ static bool vfio_section_is_vfio_pci(MemoryRegionSection *section,
 {
     VFIOPCIDevice *pcidev;
     VFIODevice *vbasedev;
-    VFIOGroup *group;
     Object *owner;
 
     owner = memory_region_owner(section->mr);
 
-    QLIST_FOREACH(group, &container->group_list, container_next) {
-        QLIST_FOREACH(vbasedev, &group->device_list, next) {
-            if (vbasedev->type != VFIO_DEVICE_TYPE_PCI) {
-                continue;
-            }
-            pcidev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
-            if (OBJECT(pcidev) == owner) {
-                return true;
-            }
+    QLIST_FOREACH(vbasedev, &container->device_list, container_next) {
+        if (vbasedev->type != VFIO_DEVICE_TYPE_PCI) {
+            continue;
+        }
+        pcidev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
+        if (OBJECT(pcidev) == owner) {
+            return true;
         }
     }
 
@@ -1296,24 +1284,21 @@ static void vfio_devices_dma_logging_stop(VFIOContainer *container)
                               sizeof(uint64_t))] = {};
     struct vfio_device_feature *feature = (struct vfio_device_feature *)buf;
     VFIODevice *vbasedev;
-    VFIOGroup *group;
 
     feature->argsz = sizeof(buf);
     feature->flags = VFIO_DEVICE_FEATURE_SET |
                      VFIO_DEVICE_FEATURE_DMA_LOGGING_STOP;
 
-    QLIST_FOREACH(group, &container->group_list, container_next) {
-        QLIST_FOREACH(vbasedev, &group->device_list, next) {
-            if (!vbasedev->dirty_tracking) {
-                continue;
-            }
-
-            if (ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature)) {
-                warn_report("%s: Failed to stop DMA logging, err %d (%s)",
-                             vbasedev->name, -errno, strerror(errno));
-            }
-            vbasedev->dirty_tracking = false;
+    QLIST_FOREACH(vbasedev, &container->device_list, container_next) {
+        if (!vbasedev->dirty_tracking) {
+            continue;
         }
+
+        if (ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature)) {
+            warn_report("%s: Failed to stop DMA logging, err %d (%s)",
+                        vbasedev->name, -errno, strerror(errno));
+        }
+        vbasedev->dirty_tracking = false;
     }
 }
 
@@ -1396,7 +1381,6 @@ static int vfio_devices_dma_logging_start(VFIOContainer *container)
     struct vfio_device_feature *feature;
     VFIODirtyRanges ranges;
     VFIODevice *vbasedev;
-    VFIOGroup *group;
     int ret = 0;
 
     vfio_dirty_tracking_init(container, &ranges);
@@ -1406,21 +1390,19 @@ static int vfio_devices_dma_logging_start(VFIOContainer *container)
         return -errno;
     }
 
-    QLIST_FOREACH(group, &container->group_list, container_next) {
-        QLIST_FOREACH(vbasedev, &group->device_list, next) {
-            if (vbasedev->dirty_tracking) {
-                continue;
-            }
-
-            ret = ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature);
-            if (ret) {
-                ret = -errno;
-                error_report("%s: Failed to start DMA logging, err %d (%s)",
-                             vbasedev->name, ret, strerror(errno));
-                goto out;
-            }
-            vbasedev->dirty_tracking = true;
+    QLIST_FOREACH(vbasedev, &container->device_list, container_next) {
+        if (vbasedev->dirty_tracking) {
+            continue;
         }
+
+        ret = ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature);
+        if (ret) {
+            ret = -errno;
+            error_report("%s: Failed to start DMA logging, err %d (%s)",
+                         vbasedev->name, ret, strerror(errno));
+            goto out;
+        }
+        vbasedev->dirty_tracking = true;
     }
 
 out:
@@ -1500,21 +1482,18 @@ static int vfio_devices_query_dirty_bitmap(VFIOContainer *container,
                                            hwaddr size)
 {
     VFIODevice *vbasedev;
-    VFIOGroup *group;
     int ret;
 
-    QLIST_FOREACH(group, &container->group_list, container_next) {
-        QLIST_FOREACH(vbasedev, &group->device_list, next) {
-            ret = vfio_device_dma_logging_report(vbasedev, iova, size,
-                                                 vbmap->bitmap);
-            if (ret) {
-                error_report("%s: Failed to get DMA logging report, iova: "
-                             "0x%" HWADDR_PRIx ", size: 0x%" HWADDR_PRIx
-                             ", err: %d (%s)",
-                             vbasedev->name, iova, size, ret, strerror(-ret));
+    QLIST_FOREACH(vbasedev, &container->device_list, container_next) {
+        ret = vfio_device_dma_logging_report(vbasedev, iova, size,
+                                             vbmap->bitmap);
+        if (ret) {
+            error_report("%s: Failed to get DMA logging report, iova: "
+                         "0x%" HWADDR_PRIx ", size: 0x%" HWADDR_PRIx
+                         ", err: %d (%s)",
+                         vbasedev->name, iova, size, ret, strerror(-ret));
 
-                return ret;
-            }
+            return ret;
         }
     }
 
@@ -2648,6 +2627,7 @@ int vfio_attach_device(char *name, VFIODevice *vbasedev,
     int groupid = vfio_device_groupid(vbasedev, errp);
     VFIODevice *vbasedev_iter;
     VFIOGroup *group;
+    VFIOContainer *container;
     int ret;
 
     if (groupid < 0) {
@@ -2671,8 +2651,12 @@ int vfio_attach_device(char *name, VFIODevice *vbasedev,
     ret = vfio_get_device(group, name, vbasedev, errp);
     if (ret) {
         vfio_put_group(group);
+        return ret;
     }
 
+    container = group->container;
+    QLIST_INSERT_HEAD(&container->device_list, vbasedev, container_next);
+
     return ret;
 }
 
@@ -2680,6 +2664,7 @@ void vfio_detach_device(VFIODevice *vbasedev)
 {
     VFIOGroup *group = vbasedev->group;
 
+    QLIST_REMOVE(vbasedev, container_next);
     trace_vfio_detach_device(vbasedev->name, group->groupid);
     vfio_put_base_device(vbasedev);
     vfio_put_group(group);
-- 
2.41.0



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v4 13/15] vfio/common: Store the parent container in VFIODevice
  2023-10-04 15:43 [PATCH v4 00/15] Prerequisite changes for IOMMUFD support Eric Auger
                   ` (11 preceding siblings ...)
  2023-10-04 15:43 ` [PATCH v4 12/15] vfio/common: Introduce a per container device list Eric Auger
@ 2023-10-04 15:44 ` Eric Auger
  2023-10-04 17:05   ` Cédric Le Goater
  2023-10-04 15:44 ` [PATCH v4 14/15] vfio/common: Introduce a global VFIODevice list Eric Auger
                   ` (2 subsequent siblings)
  15 siblings, 1 reply; 25+ messages in thread
From: Eric Auger @ 2023-10-04 15:44 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, qemu-devel, zhenzhong.duan,
	alex.williamson, clg, jgg, nicolinc, joao.m.martins, peterx,
	kevin.tian, yi.l.liu, yi.y.sun, chao.p.peng, mjrosato, aik

From: Zhenzhong Duan <zhenzhong.duan@intel.com>

let's store the parent contaienr within the VFIODevice.
This simplifies the logic in vfio_viommu_preset() and
brings the benefice to hide the group specificity which
is useful for IOMMUFD migration.

Signed-off-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>

---

v3 -> v4:
- Dropped check on !vbasedev->container
---
 include/hw/vfio/vfio-common.h | 1 +
 hw/vfio/common.c              | 4 +++-
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 8ca70dd821..bf12e40667 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -132,6 +132,7 @@ typedef struct VFIODevice {
     QLIST_ENTRY(VFIODevice) next;
     QLIST_ENTRY(VFIODevice) container_next;
     struct VFIOGroup *group;
+    VFIOContainer *container;
     char *sysfsdev;
     char *name;
     DeviceState *dev;
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index ef9dc7c747..b14f04c9b0 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -184,7 +184,7 @@ void vfio_unblock_multiple_devices_migration(void)
 
 bool vfio_viommu_preset(VFIODevice *vbasedev)
 {
-    return vbasedev->group->container->space->as != &address_space_memory;
+    return vbasedev->container->space->as != &address_space_memory;
 }
 
 static void vfio_set_migration_error(int err)
@@ -2655,6 +2655,7 @@ int vfio_attach_device(char *name, VFIODevice *vbasedev,
     }
 
     container = group->container;
+    vbasedev->container = container;
     QLIST_INSERT_HEAD(&container->device_list, vbasedev, container_next);
 
     return ret;
@@ -2665,6 +2666,7 @@ void vfio_detach_device(VFIODevice *vbasedev)
     VFIOGroup *group = vbasedev->group;
 
     QLIST_REMOVE(vbasedev, container_next);
+    vbasedev->container = NULL;
     trace_vfio_detach_device(vbasedev->name, group->groupid);
     vfio_put_base_device(vbasedev);
     vfio_put_group(group);
-- 
2.41.0



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v4 14/15] vfio/common: Introduce a global VFIODevice list
  2023-10-04 15:43 [PATCH v4 00/15] Prerequisite changes for IOMMUFD support Eric Auger
                   ` (12 preceding siblings ...)
  2023-10-04 15:44 ` [PATCH v4 13/15] vfio/common: Store the parent container in VFIODevice Eric Auger
@ 2023-10-04 15:44 ` Eric Auger
  2023-10-04 17:05   ` Cédric Le Goater
  2023-10-04 15:44 ` [PATCH v4 15/15] vfio/common: Move legacy VFIO backend code into separate container.c Eric Auger
  2023-10-04 19:58 ` [PATCH v4 00/15] Prerequisite changes for IOMMUFD support Cédric Le Goater
  15 siblings, 1 reply; 25+ messages in thread
From: Eric Auger @ 2023-10-04 15:44 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, qemu-devel, zhenzhong.duan,
	alex.williamson, clg, jgg, nicolinc, joao.m.martins, peterx,
	kevin.tian, yi.l.liu, yi.y.sun, chao.p.peng, mjrosato, aik

From: Zhenzhong Duan <zhenzhong.duan@intel.com>

Some functions iterate over all the VFIODevices. This is currently
achieved by iterating over all groups/devices. Let's
introduce a global list of VFIODevices simplifying that scan.

This will also be useful while migrating to IOMMUFD by hiding the
group specificity.

Signed-off-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Suggested-by: Alex Williamson <alex.williamson@redhat.com>
---
 include/hw/vfio/vfio-common.h |  2 ++
 hw/vfio/common.c              | 45 +++++++++++++++--------------------
 2 files changed, 21 insertions(+), 26 deletions(-)

diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index bf12e40667..54905b9dd4 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -131,6 +131,7 @@ typedef struct VFIODeviceOps VFIODeviceOps;
 typedef struct VFIODevice {
     QLIST_ENTRY(VFIODevice) next;
     QLIST_ENTRY(VFIODevice) container_next;
+    QLIST_ENTRY(VFIODevice) global_next;
     struct VFIOGroup *group;
     VFIOContainer *container;
     char *sysfsdev;
@@ -232,6 +233,7 @@ int vfio_kvm_device_del_fd(int fd, Error **errp);
 
 extern const MemoryRegionOps vfio_region_ops;
 typedef QLIST_HEAD(VFIOGroupList, VFIOGroup) VFIOGroupList;
+typedef QLIST_HEAD(VFIODeviceList, VFIODevice) VFIODeviceList;
 extern VFIOGroupList vfio_group_list;
 
 bool vfio_mig_active(void);
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index b14f04c9b0..5dd03763d2 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -48,6 +48,8 @@
 
 VFIOGroupList vfio_group_list =
     QLIST_HEAD_INITIALIZER(vfio_group_list);
+static VFIODeviceList vfio_device_list =
+    QLIST_HEAD_INITIALIZER(vfio_device_list);
 static QLIST_HEAD(, VFIOAddressSpace) vfio_address_spaces =
     QLIST_HEAD_INITIALIZER(vfio_address_spaces);
 
@@ -94,18 +96,15 @@ static int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
 
 bool vfio_mig_active(void)
 {
-    VFIOGroup *group;
     VFIODevice *vbasedev;
 
-    if (QLIST_EMPTY(&vfio_group_list)) {
+    if (QLIST_EMPTY(&vfio_device_list)) {
         return false;
     }
 
-    QLIST_FOREACH(group, &vfio_group_list, next) {
-        QLIST_FOREACH(vbasedev, &group->device_list, next) {
-            if (vbasedev->migration_blocker) {
-                return false;
-            }
+    QLIST_FOREACH(vbasedev, &vfio_device_list, next) {
+        if (vbasedev->migration_blocker) {
+            return false;
         }
     }
     return true;
@@ -120,19 +119,16 @@ static Error *multiple_devices_migration_blocker;
  */
 static bool vfio_multiple_devices_migration_is_supported(void)
 {
-    VFIOGroup *group;
     VFIODevice *vbasedev;
     unsigned int device_num = 0;
     bool all_support_p2p = true;
 
-    QLIST_FOREACH(group, &vfio_group_list, next) {
-        QLIST_FOREACH(vbasedev, &group->device_list, next) {
-            if (vbasedev->migration) {
-                device_num++;
+    QLIST_FOREACH(vbasedev, &vfio_device_list, next) {
+        if (vbasedev->migration) {
+            device_num++;
 
-                if (!(vbasedev->migration->mig_flags & VFIO_MIGRATION_P2P)) {
-                    all_support_p2p = false;
-                }
+            if (!(vbasedev->migration->mig_flags & VFIO_MIGRATION_P2P)) {
+                all_support_p2p = false;
             }
         }
     }
@@ -1777,22 +1773,17 @@ bool vfio_get_info_dma_avail(struct vfio_iommu_type1_info *info,
 
 void vfio_reset_handler(void *opaque)
 {
-    VFIOGroup *group;
     VFIODevice *vbasedev;
 
-    QLIST_FOREACH(group, &vfio_group_list, next) {
-        QLIST_FOREACH(vbasedev, &group->device_list, next) {
-            if (vbasedev->dev->realized) {
-                vbasedev->ops->vfio_compute_needs_reset(vbasedev);
-            }
+    QLIST_FOREACH(vbasedev, &vfio_device_list, next) {
+        if (vbasedev->dev->realized) {
+            vbasedev->ops->vfio_compute_needs_reset(vbasedev);
         }
     }
 
-    QLIST_FOREACH(group, &vfio_group_list, next) {
-        QLIST_FOREACH(vbasedev, &group->device_list, next) {
-            if (vbasedev->dev->realized && vbasedev->needs_reset) {
-                vbasedev->ops->vfio_hot_reset_multi(vbasedev);
-            }
+    QLIST_FOREACH(vbasedev, &vfio_device_list, next) {
+        if (vbasedev->dev->realized && vbasedev->needs_reset) {
+            vbasedev->ops->vfio_hot_reset_multi(vbasedev);
         }
     }
 }
@@ -2657,6 +2648,7 @@ int vfio_attach_device(char *name, VFIODevice *vbasedev,
     container = group->container;
     vbasedev->container = container;
     QLIST_INSERT_HEAD(&container->device_list, vbasedev, container_next);
+    QLIST_INSERT_HEAD(&vfio_device_list, vbasedev, global_next);
 
     return ret;
 }
@@ -2665,6 +2657,7 @@ void vfio_detach_device(VFIODevice *vbasedev)
 {
     VFIOGroup *group = vbasedev->group;
 
+    QLIST_REMOVE(vbasedev, global_next);
     QLIST_REMOVE(vbasedev, container_next);
     vbasedev->container = NULL;
     trace_vfio_detach_device(vbasedev->name, group->groupid);
-- 
2.41.0



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v4 15/15] vfio/common: Move legacy VFIO backend code into separate container.c
  2023-10-04 15:43 [PATCH v4 00/15] Prerequisite changes for IOMMUFD support Eric Auger
                   ` (13 preceding siblings ...)
  2023-10-04 15:44 ` [PATCH v4 14/15] vfio/common: Introduce a global VFIODevice list Eric Auger
@ 2023-10-04 15:44 ` Eric Auger
  2023-10-04 19:58 ` [PATCH v4 00/15] Prerequisite changes for IOMMUFD support Cédric Le Goater
  15 siblings, 0 replies; 25+ messages in thread
From: Eric Auger @ 2023-10-04 15:44 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, qemu-devel, zhenzhong.duan,
	alex.williamson, clg, jgg, nicolinc, joao.m.martins, peterx,
	kevin.tian, yi.l.liu, yi.y.sun, chao.p.peng, mjrosato, aik

From: Yi Liu <yi.l.liu@intel.com>

Move all the code really dependent on the legacy VFIO container/group
into a separate file: container.c. What does remain in common.c is
the code related to VFIOAddressSpace, MemoryListeners, migration and
all other general operations.

Signed-off-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>

---

v3 -> v4:
- added dropped comment
---
 include/hw/vfio/vfio-common.h |   35 +
 hw/vfio/common.c              | 1151 +-------------------------------
 hw/vfio/container.c           | 1157 +++++++++++++++++++++++++++++++++
 hw/vfio/meson.build           |    1 +
 4 files changed, 1209 insertions(+), 1135 deletions(-)
 create mode 100644 hw/vfio/container.c

diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 54905b9dd4..7780b9073a 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -206,6 +206,30 @@ typedef struct {
     hwaddr pages;
 } VFIOBitmap;
 
+void vfio_host_win_add(VFIOContainer *container,
+                       hwaddr min_iova, hwaddr max_iova,
+                       uint64_t iova_pgsizes);
+int vfio_host_win_del(VFIOContainer *container, hwaddr min_iova,
+                      hwaddr max_iova);
+VFIOAddressSpace *vfio_get_address_space(AddressSpace *as);
+void vfio_put_address_space(VFIOAddressSpace *space);
+bool vfio_devices_all_running_and_saving(VFIOContainer *container);
+
+/* container->fd */
+int vfio_dma_unmap(VFIOContainer *container, hwaddr iova,
+                   ram_addr_t size, IOMMUTLBEntry *iotlb);
+int vfio_dma_map(VFIOContainer *container, hwaddr iova,
+                 ram_addr_t size, void *vaddr, bool readonly);
+int vfio_set_dirty_page_tracking(VFIOContainer *container, bool start);
+int vfio_query_dirty_bitmap(VFIOContainer *container, VFIOBitmap *vbmap,
+                            hwaddr iova, hwaddr size);
+
+int vfio_container_add_section_window(VFIOContainer *container,
+                                      MemoryRegionSection *section,
+                                      Error **errp);
+void vfio_container_del_section_window(VFIOContainer *container,
+                                       MemoryRegionSection *section);
+
 void vfio_disable_irqindex(VFIODevice *vbasedev, int index);
 void vfio_unmask_single_irqindex(VFIODevice *vbasedev, int index);
 void vfio_mask_single_irqindex(VFIODevice *vbasedev, int index);
@@ -235,6 +259,10 @@ extern const MemoryRegionOps vfio_region_ops;
 typedef QLIST_HEAD(VFIOGroupList, VFIOGroup) VFIOGroupList;
 typedef QLIST_HEAD(VFIODeviceList, VFIODevice) VFIODeviceList;
 extern VFIOGroupList vfio_group_list;
+extern VFIODeviceList vfio_device_list;
+
+extern const MemoryListener vfio_memory_listener;
+extern int vfio_kvm_device_fd;
 
 bool vfio_mig_active(void);
 int vfio_block_multiple_devices_migration(VFIODevice *vbasedev, Error **errp);
@@ -272,4 +300,11 @@ bool vfio_migration_realize(VFIODevice *vbasedev, Error **errp);
 void vfio_migration_exit(VFIODevice *vbasedev);
 
 int vfio_bitmap_alloc(VFIOBitmap *vbmap, hwaddr size);
+bool vfio_devices_all_running_and_mig_active(VFIOContainer *container);
+bool vfio_devices_all_device_dirty_tracking(VFIOContainer *container);
+int vfio_devices_query_dirty_bitmap(VFIOContainer *container,
+                                    VFIOBitmap *vbmap, hwaddr iova,
+                                    hwaddr size);
+int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
+                                 uint64_t size, ram_addr_t ram_addr);
 #endif /* HW_VFIO_VFIO_COMMON_H */
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 5dd03763d2..9e61de03ee 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -46,9 +46,7 @@
 #include "migration/qemu-file.h"
 #include "sysemu/tpm.h"
 
-VFIOGroupList vfio_group_list =
-    QLIST_HEAD_INITIALIZER(vfio_group_list);
-static VFIODeviceList vfio_device_list =
+VFIODeviceList vfio_device_list =
     QLIST_HEAD_INITIALIZER(vfio_device_list);
 static QLIST_HEAD(, VFIOAddressSpace) vfio_address_spaces =
     QLIST_HEAD_INITIALIZER(vfio_address_spaces);
@@ -61,39 +59,13 @@ static QLIST_HEAD(, VFIOAddressSpace) vfio_address_spaces =
  * initialized, this file descriptor is only released on QEMU exit and
  * we'll re-use it should another vfio device be attached before then.
  */
-static int vfio_kvm_device_fd = -1;
+int vfio_kvm_device_fd = -1;
 #endif
 
-static int vfio_ram_block_discard_disable(VFIOContainer *container, bool state)
-{
-    switch (container->iommu_type) {
-    case VFIO_TYPE1v2_IOMMU:
-    case VFIO_TYPE1_IOMMU:
-        /*
-         * We support coordinated discarding of RAM via the RamDiscardManager.
-         */
-        return ram_block_uncoordinated_discard_disable(state);
-    default:
-        /*
-         * VFIO_SPAPR_TCE_IOMMU most probably works just fine with
-         * RamDiscardManager, however, it is completely untested.
-         *
-         * VFIO_SPAPR_TCE_v2_IOMMU with "DMA memory preregistering" does
-         * completely the opposite of managing mapping/pinning dynamically as
-         * required by RamDiscardManager. We would have to special-case sections
-         * with a RamDiscardManager.
-         */
-        return ram_block_discard_disable(state);
-    }
-}
-
 /*
  * Device state interfaces
  */
 
-static int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
-                                 uint64_t size, ram_addr_t ram_addr);
-
 bool vfio_mig_active(void)
 {
     VFIODevice *vbasedev;
@@ -238,7 +210,7 @@ static bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
     return true;
 }
 
-static bool vfio_devices_all_device_dirty_tracking(VFIOContainer *container)
+bool vfio_devices_all_device_dirty_tracking(VFIOContainer *container)
 {
     VFIODevice *vbasedev;
 
@@ -255,7 +227,7 @@ static bool vfio_devices_all_device_dirty_tracking(VFIOContainer *container)
  * Check if all VFIO devices are running and migration is active, which is
  * essentially equivalent to the migration being in pre-copy phase.
  */
-static bool vfio_devices_all_running_and_mig_active(VFIOContainer *container)
+bool vfio_devices_all_running_and_mig_active(VFIOContainer *container)
 {
     VFIODevice *vbasedev;
 
@@ -280,150 +252,8 @@ static bool vfio_devices_all_running_and_mig_active(VFIOContainer *container)
     return true;
 }
 
-static int vfio_dma_unmap_bitmap(VFIOContainer *container,
-                                 hwaddr iova, ram_addr_t size,
-                                 IOMMUTLBEntry *iotlb)
-{
-    struct vfio_iommu_type1_dma_unmap *unmap;
-    struct vfio_bitmap *bitmap;
-    VFIOBitmap vbmap;
-    int ret;
-
-    ret = vfio_bitmap_alloc(&vbmap, size);
-    if (ret) {
-        return ret;
-    }
-
-    unmap = g_malloc0(sizeof(*unmap) + sizeof(*bitmap));
-
-    unmap->argsz = sizeof(*unmap) + sizeof(*bitmap);
-    unmap->iova = iova;
-    unmap->size = size;
-    unmap->flags |= VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP;
-    bitmap = (struct vfio_bitmap *)&unmap->data;
-
-    /*
-     * cpu_physical_memory_set_dirty_lebitmap() supports pages in bitmap of
-     * qemu_real_host_page_size to mark those dirty. Hence set bitmap_pgsize
-     * to qemu_real_host_page_size.
-     */
-    bitmap->pgsize = qemu_real_host_page_size();
-    bitmap->size = vbmap.size;
-    bitmap->data = (__u64 *)vbmap.bitmap;
-
-    if (vbmap.size > container->max_dirty_bitmap_size) {
-        error_report("UNMAP: Size of bitmap too big 0x%"PRIx64, vbmap.size);
-        ret = -E2BIG;
-        goto unmap_exit;
-    }
-
-    ret = ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, unmap);
-    if (!ret) {
-        cpu_physical_memory_set_dirty_lebitmap(vbmap.bitmap,
-                iotlb->translated_addr, vbmap.pages);
-    } else {
-        error_report("VFIO_UNMAP_DMA with DIRTY_BITMAP : %m");
-    }
-
-unmap_exit:
-    g_free(unmap);
-    g_free(vbmap.bitmap);
-
-    return ret;
-}
-
-/*
- * DMA - Mapping and unmapping for the "type1" IOMMU interface used on x86
- */
-static int vfio_dma_unmap(VFIOContainer *container,
-                          hwaddr iova, ram_addr_t size,
-                          IOMMUTLBEntry *iotlb)
-{
-    struct vfio_iommu_type1_dma_unmap unmap = {
-        .argsz = sizeof(unmap),
-        .flags = 0,
-        .iova = iova,
-        .size = size,
-    };
-    bool need_dirty_sync = false;
-    int ret;
-
-    if (iotlb && vfio_devices_all_running_and_mig_active(container)) {
-        if (!vfio_devices_all_device_dirty_tracking(container) &&
-            container->dirty_pages_supported) {
-            return vfio_dma_unmap_bitmap(container, iova, size, iotlb);
-        }
-
-        need_dirty_sync = true;
-    }
-
-    while (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap)) {
-        /*
-         * The type1 backend has an off-by-one bug in the kernel (71a7d3d78e3c
-         * v4.15) where an overflow in its wrap-around check prevents us from
-         * unmapping the last page of the address space.  Test for the error
-         * condition and re-try the unmap excluding the last page.  The
-         * expectation is that we've never mapped the last page anyway and this
-         * unmap request comes via vIOMMU support which also makes it unlikely
-         * that this page is used.  This bug was introduced well after type1 v2
-         * support was introduced, so we shouldn't need to test for v1.  A fix
-         * is queued for kernel v5.0 so this workaround can be removed once
-         * affected kernels are sufficiently deprecated.
-         */
-        if (errno == EINVAL && unmap.size && !(unmap.iova + unmap.size) &&
-            container->iommu_type == VFIO_TYPE1v2_IOMMU) {
-            trace_vfio_dma_unmap_overflow_workaround();
-            unmap.size -= 1ULL << ctz64(container->pgsizes);
-            continue;
-        }
-        error_report("VFIO_UNMAP_DMA failed: %s", strerror(errno));
-        return -errno;
-    }
-
-    if (need_dirty_sync) {
-        ret = vfio_get_dirty_bitmap(container, iova, size,
-                                    iotlb->translated_addr);
-        if (ret) {
-            return ret;
-        }
-    }
-
-    return 0;
-}
-
-static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
-                        ram_addr_t size, void *vaddr, bool readonly)
-{
-    struct vfio_iommu_type1_dma_map map = {
-        .argsz = sizeof(map),
-        .flags = VFIO_DMA_MAP_FLAG_READ,
-        .vaddr = (__u64)(uintptr_t)vaddr,
-        .iova = iova,
-        .size = size,
-    };
-
-    if (!readonly) {
-        map.flags |= VFIO_DMA_MAP_FLAG_WRITE;
-    }
-
-    /*
-     * Try the mapping, if it fails with EBUSY, unmap the region and try
-     * again.  This shouldn't be necessary, but we sometimes see it in
-     * the VGA ROM space.
-     */
-    if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0 ||
-        (errno == EBUSY && vfio_dma_unmap(container, iova, size, NULL) == 0 &&
-         ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0)) {
-        return 0;
-    }
-
-    error_report("VFIO_MAP_DMA failed: %s", strerror(errno));
-    return -errno;
-}
-
-static void vfio_host_win_add(VFIOContainer *container,
-                              hwaddr min_iova, hwaddr max_iova,
-                              uint64_t iova_pgsizes)
+void vfio_host_win_add(VFIOContainer *container, hwaddr min_iova,
+                       hwaddr max_iova, uint64_t iova_pgsizes)
 {
     VFIOHostDMAWindow *hostwin;
 
@@ -444,8 +274,8 @@ static void vfio_host_win_add(VFIOContainer *container,
     QLIST_INSERT_HEAD(&container->hostwin_list, hostwin, hostwin_next);
 }
 
-static int vfio_host_win_del(VFIOContainer *container, hwaddr min_iova,
-                             hwaddr max_iova)
+int vfio_host_win_del(VFIOContainer *container,
+                      hwaddr min_iova, hwaddr max_iova)
 {
     VFIOHostDMAWindow *hostwin;
 
@@ -794,92 +624,6 @@ static bool vfio_get_section_iova_range(VFIOContainer *container,
     return true;
 }
 
-static int vfio_container_add_section_window(VFIOContainer *container,
-                                             MemoryRegionSection *section,
-                                             Error **errp)
-{
-    VFIOHostDMAWindow *hostwin;
-    hwaddr pgsize = 0;
-    int ret;
-
-    if (container->iommu_type != VFIO_SPAPR_TCE_v2_IOMMU) {
-        return 0;
-    }
-
-    /* For now intersections are not allowed, we may relax this later */
-    QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) {
-        if (ranges_overlap(hostwin->min_iova,
-                           hostwin->max_iova - hostwin->min_iova + 1,
-                           section->offset_within_address_space,
-                           int128_get64(section->size))) {
-            error_setg(errp,
-                "region [0x%"PRIx64",0x%"PRIx64"] overlaps with existing"
-                "host DMA window [0x%"PRIx64",0x%"PRIx64"]",
-                section->offset_within_address_space,
-                section->offset_within_address_space +
-                    int128_get64(section->size) - 1,
-                hostwin->min_iova, hostwin->max_iova);
-            return -EINVAL;
-        }
-    }
-
-    ret = vfio_spapr_create_window(container, section, &pgsize);
-    if (ret) {
-        error_setg_errno(errp, -ret, "Failed to create SPAPR window");
-        return ret;
-    }
-
-    vfio_host_win_add(container, section->offset_within_address_space,
-                      section->offset_within_address_space +
-                      int128_get64(section->size) - 1, pgsize);
-#ifdef CONFIG_KVM
-    if (kvm_enabled()) {
-        VFIOGroup *group;
-        IOMMUMemoryRegion *iommu_mr = IOMMU_MEMORY_REGION(section->mr);
-        struct kvm_vfio_spapr_tce param;
-        struct kvm_device_attr attr = {
-            .group = KVM_DEV_VFIO_GROUP,
-            .attr = KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE,
-            .addr = (uint64_t)(unsigned long)&param,
-        };
-
-        if (!memory_region_iommu_get_attr(iommu_mr, IOMMU_ATTR_SPAPR_TCE_FD,
-                                          &param.tablefd)) {
-            QLIST_FOREACH(group, &container->group_list, container_next) {
-                param.groupfd = group->fd;
-                if (ioctl(vfio_kvm_device_fd, KVM_SET_DEVICE_ATTR, &attr)) {
-                    error_setg_errno(errp, errno,
-                                     "vfio: failed GROUP_SET_SPAPR_TCE for "
-                                     "KVM VFIO device %d and group fd %d",
-                                     param.tablefd, param.groupfd);
-                    return -errno;
-                }
-                trace_vfio_spapr_group_attach(param.groupfd, param.tablefd);
-            }
-        }
-    }
-#endif
-    return 0;
-}
-
-static void vfio_container_del_section_window(VFIOContainer *container,
-                                              MemoryRegionSection *section)
-{
-    if (container->iommu_type != VFIO_SPAPR_TCE_v2_IOMMU) {
-        return;
-    }
-
-    vfio_spapr_remove_window(container,
-                             section->offset_within_address_space);
-    if (vfio_host_win_del(container,
-                          section->offset_within_address_space,
-                          section->offset_within_address_space +
-                          int128_get64(section->size) - 1) < 0) {
-        hw_error("%s: Cannot delete missing window at %"HWADDR_PRIx,
-                 __func__, section->offset_within_address_space);
-    }
-}
-
 static void vfio_listener_region_add(MemoryListener *listener,
                                      MemoryRegionSection *section)
 {
@@ -1127,33 +871,6 @@ static void vfio_listener_region_del(MemoryListener *listener,
     vfio_container_del_section_window(container, section);
 }
 
-static int vfio_set_dirty_page_tracking(VFIOContainer *container, bool start)
-{
-    int ret;
-    struct vfio_iommu_type1_dirty_bitmap dirty = {
-        .argsz = sizeof(dirty),
-    };
-
-    if (!container->dirty_pages_supported) {
-        return 0;
-    }
-
-    if (start) {
-        dirty.flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_START;
-    } else {
-        dirty.flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP;
-    }
-
-    ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, &dirty);
-    if (ret) {
-        ret = -errno;
-        error_report("Failed to set dirty tracking flag 0x%x errno: %d",
-                     dirty.flags, errno);
-    }
-
-    return ret;
-}
-
 typedef struct VFIODirtyRanges {
     hwaddr min32;
     hwaddr max32;
@@ -1473,9 +1190,9 @@ static int vfio_device_dma_logging_report(VFIODevice *vbasedev, hwaddr iova,
     return 0;
 }
 
-static int vfio_devices_query_dirty_bitmap(VFIOContainer *container,
-                                           VFIOBitmap *vbmap, hwaddr iova,
-                                           hwaddr size)
+int vfio_devices_query_dirty_bitmap(VFIOContainer *container,
+                                    VFIOBitmap *vbmap, hwaddr iova,
+                                    hwaddr size)
 {
     VFIODevice *vbasedev;
     int ret;
@@ -1496,45 +1213,8 @@ static int vfio_devices_query_dirty_bitmap(VFIOContainer *container,
     return 0;
 }
 
-static int vfio_query_dirty_bitmap(VFIOContainer *container, VFIOBitmap *vbmap,
-                                   hwaddr iova, hwaddr size)
-{
-    struct vfio_iommu_type1_dirty_bitmap *dbitmap;
-    struct vfio_iommu_type1_dirty_bitmap_get *range;
-    int ret;
-
-    dbitmap = g_malloc0(sizeof(*dbitmap) + sizeof(*range));
-
-    dbitmap->argsz = sizeof(*dbitmap) + sizeof(*range);
-    dbitmap->flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP;
-    range = (struct vfio_iommu_type1_dirty_bitmap_get *)&dbitmap->data;
-    range->iova = iova;
-    range->size = size;
-
-    /*
-     * cpu_physical_memory_set_dirty_lebitmap() supports pages in bitmap of
-     * qemu_real_host_page_size to mark those dirty. Hence set bitmap's pgsize
-     * to qemu_real_host_page_size.
-     */
-    range->bitmap.pgsize = qemu_real_host_page_size();
-    range->bitmap.size = vbmap->size;
-    range->bitmap.data = (__u64 *)vbmap->bitmap;
-
-    ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, dbitmap);
-    if (ret) {
-        ret = -errno;
-        error_report("Failed to get dirty bitmap for iova: 0x%"PRIx64
-                " size: 0x%"PRIx64" err: %d", (uint64_t)range->iova,
-                (uint64_t)range->size, errno);
-    }
-
-    g_free(dbitmap);
-
-    return ret;
-}
-
-static int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
-                                 uint64_t size, ram_addr_t ram_addr)
+int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
+                          uint64_t size, ram_addr_t ram_addr)
 {
     bool all_device_dirty_tracking =
         vfio_devices_all_device_dirty_tracking(container);
@@ -1723,7 +1403,7 @@ static void vfio_listener_log_sync(MemoryListener *listener,
     }
 }
 
-static const MemoryListener vfio_memory_listener = {
+const MemoryListener vfio_memory_listener = {
     .name = "vfio",
     .region_add = vfio_listener_region_add,
     .region_del = vfio_listener_region_del,
@@ -1732,45 +1412,6 @@ static const MemoryListener vfio_memory_listener = {
     .log_sync = vfio_listener_log_sync,
 };
 
-static void vfio_listener_release(VFIOContainer *container)
-{
-    memory_listener_unregister(&container->listener);
-    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
-        memory_listener_unregister(&container->prereg_listener);
-    }
-}
-
-static struct vfio_info_cap_header *
-vfio_get_iommu_type1_info_cap(struct vfio_iommu_type1_info *info, uint16_t id)
-{
-    if (!(info->flags & VFIO_IOMMU_INFO_CAPS)) {
-        return NULL;
-    }
-
-    return vfio_get_cap((void *)info, info->cap_offset, id);
-}
-
-bool vfio_get_info_dma_avail(struct vfio_iommu_type1_info *info,
-                             unsigned int *avail)
-{
-    struct vfio_info_cap_header *hdr;
-    struct vfio_iommu_type1_info_dma_avail *cap;
-
-    /* If the capability cannot be found, assume no DMA limiting */
-    hdr = vfio_get_iommu_type1_info_cap(info,
-                                        VFIO_IOMMU_TYPE1_INFO_DMA_AVAIL);
-    if (hdr == NULL) {
-        return false;
-    }
-
-    if (avail != NULL) {
-        cap = (void *) hdr;
-        *avail = cap->avail;
-    }
-
-    return true;
-}
-
 void vfio_reset_handler(void *opaque)
 {
     VFIODevice *vbasedev;
@@ -1846,25 +1487,7 @@ int vfio_kvm_device_del_fd(int fd, Error **errp)
     return 0;
 }
 
-static void vfio_kvm_device_add_group(VFIOGroup *group)
-{
-    Error *err = NULL;
-
-    if (vfio_kvm_device_add_fd(group->fd, &err)) {
-        error_reportf_err(err, "group ID %d: ", group->groupid);
-    }
-}
-
-static void vfio_kvm_device_del_group(VFIOGroup *group)
-{
-    Error *err = NULL;
-
-    if (vfio_kvm_device_del_fd(group->fd, &err)) {
-        error_reportf_err(err, "group ID %d: ", group->groupid);
-    }
-}
-
-static VFIOAddressSpace *vfio_get_address_space(AddressSpace *as)
+VFIOAddressSpace *vfio_get_address_space(AddressSpace *as)
 {
     VFIOAddressSpace *space;
 
@@ -1888,7 +1511,7 @@ static VFIOAddressSpace *vfio_get_address_space(AddressSpace *as)
     return space;
 }
 
-static void vfio_put_address_space(VFIOAddressSpace *space)
+void vfio_put_address_space(VFIOAddressSpace *space)
 {
     if (QLIST_EMPTY(&space->containers)) {
         QLIST_REMOVE(space, list);
@@ -1899,499 +1522,6 @@ static void vfio_put_address_space(VFIOAddressSpace *space)
     }
 }
 
-/*
- * vfio_get_iommu_type - selects the richest iommu_type (v2 first)
- */
-static int vfio_get_iommu_type(VFIOContainer *container,
-                               Error **errp)
-{
-    int iommu_types[] = { VFIO_TYPE1v2_IOMMU, VFIO_TYPE1_IOMMU,
-                          VFIO_SPAPR_TCE_v2_IOMMU, VFIO_SPAPR_TCE_IOMMU };
-    int i;
-
-    for (i = 0; i < ARRAY_SIZE(iommu_types); i++) {
-        if (ioctl(container->fd, VFIO_CHECK_EXTENSION, iommu_types[i])) {
-            return iommu_types[i];
-        }
-    }
-    error_setg(errp, "No available IOMMU models");
-    return -EINVAL;
-}
-
-static int vfio_init_container(VFIOContainer *container, int group_fd,
-                               Error **errp)
-{
-    int iommu_type, ret;
-
-    iommu_type = vfio_get_iommu_type(container, errp);
-    if (iommu_type < 0) {
-        return iommu_type;
-    }
-
-    ret = ioctl(group_fd, VFIO_GROUP_SET_CONTAINER, &container->fd);
-    if (ret) {
-        error_setg_errno(errp, errno, "Failed to set group container");
-        return -errno;
-    }
-
-    while (ioctl(container->fd, VFIO_SET_IOMMU, iommu_type)) {
-        if (iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
-            /*
-             * On sPAPR, despite the IOMMU subdriver always advertises v1 and
-             * v2, the running platform may not support v2 and there is no
-             * way to guess it until an IOMMU group gets added to the container.
-             * So in case it fails with v2, try v1 as a fallback.
-             */
-            iommu_type = VFIO_SPAPR_TCE_IOMMU;
-            continue;
-        }
-        error_setg_errno(errp, errno, "Failed to set iommu for container");
-        return -errno;
-    }
-
-    container->iommu_type = iommu_type;
-    return 0;
-}
-
-static int vfio_get_iommu_info(VFIOContainer *container,
-                               struct vfio_iommu_type1_info **info)
-{
-
-    size_t argsz = sizeof(struct vfio_iommu_type1_info);
-
-    *info = g_new0(struct vfio_iommu_type1_info, 1);
-again:
-    (*info)->argsz = argsz;
-
-    if (ioctl(container->fd, VFIO_IOMMU_GET_INFO, *info)) {
-        g_free(*info);
-        *info = NULL;
-        return -errno;
-    }
-
-    if (((*info)->argsz > argsz)) {
-        argsz = (*info)->argsz;
-        *info = g_realloc(*info, argsz);
-        goto again;
-    }
-
-    return 0;
-}
-
-static struct vfio_info_cap_header *
-vfio_get_iommu_info_cap(struct vfio_iommu_type1_info *info, uint16_t id)
-{
-    struct vfio_info_cap_header *hdr;
-    void *ptr = info;
-
-    if (!(info->flags & VFIO_IOMMU_INFO_CAPS)) {
-        return NULL;
-    }
-
-    for (hdr = ptr + info->cap_offset; hdr != ptr; hdr = ptr + hdr->next) {
-        if (hdr->id == id) {
-            return hdr;
-        }
-    }
-
-    return NULL;
-}
-
-static void vfio_get_iommu_info_migration(VFIOContainer *container,
-                                         struct vfio_iommu_type1_info *info)
-{
-    struct vfio_info_cap_header *hdr;
-    struct vfio_iommu_type1_info_cap_migration *cap_mig;
-
-    hdr = vfio_get_iommu_info_cap(info, VFIO_IOMMU_TYPE1_INFO_CAP_MIGRATION);
-    if (!hdr) {
-        return;
-    }
-
-    cap_mig = container_of(hdr, struct vfio_iommu_type1_info_cap_migration,
-                            header);
-
-    /*
-     * cpu_physical_memory_set_dirty_lebitmap() supports pages in bitmap of
-     * qemu_real_host_page_size to mark those dirty.
-     */
-    if (cap_mig->pgsize_bitmap & qemu_real_host_page_size()) {
-        container->dirty_pages_supported = true;
-        container->max_dirty_bitmap_size = cap_mig->max_dirty_bitmap_size;
-        container->dirty_pgsizes = cap_mig->pgsize_bitmap;
-    }
-}
-
-static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
-                                  Error **errp)
-{
-    VFIOContainer *container;
-    int ret, fd;
-    VFIOAddressSpace *space;
-
-    space = vfio_get_address_space(as);
-
-    /*
-     * VFIO is currently incompatible with discarding of RAM insofar as the
-     * madvise to purge (zap) the page from QEMU's address space does not
-     * interact with the memory API and therefore leaves stale virtual to
-     * physical mappings in the IOMMU if the page was previously pinned.  We
-     * therefore set discarding broken for each group added to a container,
-     * whether the container is used individually or shared.  This provides
-     * us with options to allow devices within a group to opt-in and allow
-     * discarding, so long as it is done consistently for a group (for instance
-     * if the device is an mdev device where it is known that the host vendor
-     * driver will never pin pages outside of the working set of the guest
-     * driver, which would thus not be discarding candidates).
-     *
-     * The first opportunity to induce pinning occurs here where we attempt to
-     * attach the group to existing containers within the AddressSpace.  If any
-     * pages are already zapped from the virtual address space, such as from
-     * previous discards, new pinning will cause valid mappings to be
-     * re-established.  Likewise, when the overall MemoryListener for a new
-     * container is registered, a replay of mappings within the AddressSpace
-     * will occur, re-establishing any previously zapped pages as well.
-     *
-     * Especially virtio-balloon is currently only prevented from discarding
-     * new memory, it will not yet set ram_block_discard_set_required() and
-     * therefore, neither stops us here or deals with the sudden memory
-     * consumption of inflated memory.
-     *
-     * We do support discarding of memory coordinated via the RamDiscardManager
-     * with some IOMMU types. vfio_ram_block_discard_disable() handles the
-     * details once we know which type of IOMMU we are using.
-     */
-
-    QLIST_FOREACH(container, &space->containers, next) {
-        if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
-            ret = vfio_ram_block_discard_disable(container, true);
-            if (ret) {
-                error_setg_errno(errp, -ret,
-                                 "Cannot set discarding of RAM broken");
-                if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER,
-                          &container->fd)) {
-                    error_report("vfio: error disconnecting group %d from"
-                                 " container", group->groupid);
-                }
-                return ret;
-            }
-            group->container = container;
-            QLIST_INSERT_HEAD(&container->group_list, group, container_next);
-            vfio_kvm_device_add_group(group);
-            return 0;
-        }
-    }
-
-    fd = qemu_open_old("/dev/vfio/vfio", O_RDWR);
-    if (fd < 0) {
-        error_setg_errno(errp, errno, "failed to open /dev/vfio/vfio");
-        ret = -errno;
-        goto put_space_exit;
-    }
-
-    ret = ioctl(fd, VFIO_GET_API_VERSION);
-    if (ret != VFIO_API_VERSION) {
-        error_setg(errp, "supported vfio version: %d, "
-                   "reported version: %d", VFIO_API_VERSION, ret);
-        ret = -EINVAL;
-        goto close_fd_exit;
-    }
-
-    container = g_malloc0(sizeof(*container));
-    container->space = space;
-    container->fd = fd;
-    container->error = NULL;
-    container->dirty_pages_supported = false;
-    container->dma_max_mappings = 0;
-    QLIST_INIT(&container->giommu_list);
-    QLIST_INIT(&container->hostwin_list);
-    QLIST_INIT(&container->vrdl_list);
-
-    ret = vfio_init_container(container, group->fd, errp);
-    if (ret) {
-        goto free_container_exit;
-    }
-
-    ret = vfio_ram_block_discard_disable(container, true);
-    if (ret) {
-        error_setg_errno(errp, -ret, "Cannot set discarding of RAM broken");
-        goto free_container_exit;
-    }
-
-    switch (container->iommu_type) {
-    case VFIO_TYPE1v2_IOMMU:
-    case VFIO_TYPE1_IOMMU:
-    {
-        struct vfio_iommu_type1_info *info;
-
-        ret = vfio_get_iommu_info(container, &info);
-        if (ret) {
-            error_setg_errno(errp, -ret, "Failed to get VFIO IOMMU info");
-            goto enable_discards_exit;
-        }
-
-        if (info->flags & VFIO_IOMMU_INFO_PGSIZES) {
-            container->pgsizes = info->iova_pgsizes;
-        } else {
-            container->pgsizes = qemu_real_host_page_size();
-        }
-
-        if (!vfio_get_info_dma_avail(info, &container->dma_max_mappings)) {
-            container->dma_max_mappings = 65535;
-        }
-        vfio_get_iommu_info_migration(container, info);
-        g_free(info);
-
-        /*
-         * FIXME: We should parse VFIO_IOMMU_TYPE1_INFO_CAP_IOVA_RANGE
-         * information to get the actual window extent rather than assume
-         * a 64-bit IOVA address space.
-         */
-        vfio_host_win_add(container, 0, (hwaddr)-1, container->pgsizes);
-
-        break;
-    }
-    case VFIO_SPAPR_TCE_v2_IOMMU:
-    case VFIO_SPAPR_TCE_IOMMU:
-    {
-        struct vfio_iommu_spapr_tce_info info;
-        bool v2 = container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU;
-
-        /*
-         * The host kernel code implementing VFIO_IOMMU_DISABLE is called
-         * when container fd is closed so we do not call it explicitly
-         * in this file.
-         */
-        if (!v2) {
-            ret = ioctl(fd, VFIO_IOMMU_ENABLE);
-            if (ret) {
-                error_setg_errno(errp, errno, "failed to enable container");
-                ret = -errno;
-                goto enable_discards_exit;
-            }
-        } else {
-            container->prereg_listener = vfio_prereg_listener;
-
-            memory_listener_register(&container->prereg_listener,
-                                     &address_space_memory);
-            if (container->error) {
-                memory_listener_unregister(&container->prereg_listener);
-                ret = -1;
-                error_propagate_prepend(errp, container->error,
-                    "RAM memory listener initialization failed: ");
-                goto enable_discards_exit;
-            }
-        }
-
-        info.argsz = sizeof(info);
-        ret = ioctl(fd, VFIO_IOMMU_SPAPR_TCE_GET_INFO, &info);
-        if (ret) {
-            error_setg_errno(errp, errno,
-                             "VFIO_IOMMU_SPAPR_TCE_GET_INFO failed");
-            ret = -errno;
-            if (v2) {
-                memory_listener_unregister(&container->prereg_listener);
-            }
-            goto enable_discards_exit;
-        }
-
-        if (v2) {
-            container->pgsizes = info.ddw.pgsizes;
-            /*
-             * There is a default window in just created container.
-             * To make region_add/del simpler, we better remove this
-             * window now and let those iommu_listener callbacks
-             * create/remove them when needed.
-             */
-            ret = vfio_spapr_remove_window(container, info.dma32_window_start);
-            if (ret) {
-                error_setg_errno(errp, -ret,
-                                 "failed to remove existing window");
-                goto enable_discards_exit;
-            }
-        } else {
-            /* The default table uses 4K pages */
-            container->pgsizes = 0x1000;
-            vfio_host_win_add(container, info.dma32_window_start,
-                              info.dma32_window_start +
-                              info.dma32_window_size - 1,
-                              0x1000);
-        }
-    }
-    }
-
-    vfio_kvm_device_add_group(group);
-
-    QLIST_INIT(&container->group_list);
-    QLIST_INSERT_HEAD(&space->containers, container, next);
-
-    group->container = container;
-    QLIST_INSERT_HEAD(&container->group_list, group, container_next);
-
-    container->listener = vfio_memory_listener;
-
-    memory_listener_register(&container->listener, container->space->as);
-
-    if (container->error) {
-        ret = -1;
-        error_propagate_prepend(errp, container->error,
-            "memory listener initialization failed: ");
-        goto listener_release_exit;
-    }
-
-    container->initialized = true;
-
-    return 0;
-listener_release_exit:
-    QLIST_REMOVE(group, container_next);
-    QLIST_REMOVE(container, next);
-    vfio_kvm_device_del_group(group);
-    vfio_listener_release(container);
-
-enable_discards_exit:
-    vfio_ram_block_discard_disable(container, false);
-
-free_container_exit:
-    g_free(container);
-
-close_fd_exit:
-    close(fd);
-
-put_space_exit:
-    vfio_put_address_space(space);
-
-    return ret;
-}
-
-static void vfio_disconnect_container(VFIOGroup *group)
-{
-    VFIOContainer *container = group->container;
-
-    QLIST_REMOVE(group, container_next);
-    group->container = NULL;
-
-    /*
-     * Explicitly release the listener first before unset container,
-     * since unset may destroy the backend container if it's the last
-     * group.
-     */
-    if (QLIST_EMPTY(&container->group_list)) {
-        vfio_listener_release(container);
-    }
-
-    if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER, &container->fd)) {
-        error_report("vfio: error disconnecting group %d from container",
-                     group->groupid);
-    }
-
-    if (QLIST_EMPTY(&container->group_list)) {
-        VFIOAddressSpace *space = container->space;
-        VFIOGuestIOMMU *giommu, *tmp;
-        VFIOHostDMAWindow *hostwin, *next;
-
-        QLIST_REMOVE(container, next);
-
-        QLIST_FOREACH_SAFE(giommu, &container->giommu_list, giommu_next, tmp) {
-            memory_region_unregister_iommu_notifier(
-                    MEMORY_REGION(giommu->iommu_mr), &giommu->n);
-            QLIST_REMOVE(giommu, giommu_next);
-            g_free(giommu);
-        }
-
-        QLIST_FOREACH_SAFE(hostwin, &container->hostwin_list, hostwin_next,
-                           next) {
-            QLIST_REMOVE(hostwin, hostwin_next);
-            g_free(hostwin);
-        }
-
-        trace_vfio_disconnect_container(container->fd);
-        close(container->fd);
-        g_free(container);
-
-        vfio_put_address_space(space);
-    }
-}
-
-static VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
-{
-    VFIOGroup *group;
-    char path[32];
-    struct vfio_group_status status = { .argsz = sizeof(status) };
-
-    QLIST_FOREACH(group, &vfio_group_list, next) {
-        if (group->groupid == groupid) {
-            /* Found it.  Now is it already in the right context? */
-            if (group->container->space->as == as) {
-                return group;
-            } else {
-                error_setg(errp, "group %d used in multiple address spaces",
-                           group->groupid);
-                return NULL;
-            }
-        }
-    }
-
-    group = g_malloc0(sizeof(*group));
-
-    snprintf(path, sizeof(path), "/dev/vfio/%d", groupid);
-    group->fd = qemu_open_old(path, O_RDWR);
-    if (group->fd < 0) {
-        error_setg_errno(errp, errno, "failed to open %s", path);
-        goto free_group_exit;
-    }
-
-    if (ioctl(group->fd, VFIO_GROUP_GET_STATUS, &status)) {
-        error_setg_errno(errp, errno, "failed to get group %d status", groupid);
-        goto close_fd_exit;
-    }
-
-    if (!(status.flags & VFIO_GROUP_FLAGS_VIABLE)) {
-        error_setg(errp, "group %d is not viable", groupid);
-        error_append_hint(errp,
-                          "Please ensure all devices within the iommu_group "
-                          "are bound to their vfio bus driver.\n");
-        goto close_fd_exit;
-    }
-
-    group->groupid = groupid;
-    QLIST_INIT(&group->device_list);
-
-    if (vfio_connect_container(group, as, errp)) {
-        error_prepend(errp, "failed to setup container for group %d: ",
-                      groupid);
-        goto close_fd_exit;
-    }
-
-    QLIST_INSERT_HEAD(&vfio_group_list, group, next);
-
-    return group;
-
-close_fd_exit:
-    close(group->fd);
-
-free_group_exit:
-    g_free(group);
-
-    return NULL;
-}
-
-static void vfio_put_group(VFIOGroup *group)
-{
-    if (!group || !QLIST_EMPTY(&group->device_list)) {
-        return;
-    }
-
-    if (!group->ram_block_discard_allowed) {
-        vfio_ram_block_discard_disable(group->container, false);
-    }
-    vfio_kvm_device_del_group(group);
-    vfio_disconnect_container(group);
-    QLIST_REMOVE(group, next);
-    trace_vfio_put_group(group->fd);
-    close(group->fd);
-    g_free(group);
-}
-
 struct vfio_device_info *vfio_get_device_info(int fd)
 {
     struct vfio_device_info *info;
@@ -2415,252 +1545,3 @@ retry:
 
     return info;
 }
-
-static int vfio_get_device(VFIOGroup *group, const char *name,
-                           VFIODevice *vbasedev, Error **errp)
-{
-    g_autofree struct vfio_device_info *info = NULL;
-    int fd;
-
-    fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name);
-    if (fd < 0) {
-        error_setg_errno(errp, errno, "error getting device from group %d",
-                         group->groupid);
-        error_append_hint(errp,
-                      "Verify all devices in group %d are bound to vfio-<bus> "
-                      "or pci-stub and not already in use\n", group->groupid);
-        return fd;
-    }
-
-    info = vfio_get_device_info(fd);
-    if (!info) {
-        error_setg_errno(errp, errno, "error getting device info");
-        close(fd);
-        return -1;
-    }
-
-    /*
-     * Set discarding of RAM as not broken for this group if the driver knows
-     * the device operates compatibly with discarding.  Setting must be
-     * consistent per group, but since compatibility is really only possible
-     * with mdev currently, we expect singleton groups.
-     */
-    if (vbasedev->ram_block_discard_allowed !=
-        group->ram_block_discard_allowed) {
-        if (!QLIST_EMPTY(&group->device_list)) {
-            error_setg(errp, "Inconsistent setting of support for discarding "
-                       "RAM (e.g., balloon) within group");
-            close(fd);
-            return -1;
-        }
-
-        if (!group->ram_block_discard_allowed) {
-            group->ram_block_discard_allowed = true;
-            vfio_ram_block_discard_disable(group->container, false);
-        }
-    }
-
-    vbasedev->fd = fd;
-    vbasedev->group = group;
-    QLIST_INSERT_HEAD(&group->device_list, vbasedev, next);
-
-    vbasedev->num_irqs = info->num_irqs;
-    vbasedev->num_regions = info->num_regions;
-    vbasedev->flags = info->flags;
-
-    trace_vfio_get_device(name, info->flags, info->num_regions, info->num_irqs);
-
-    vbasedev->reset_works = !!(info->flags & VFIO_DEVICE_FLAGS_RESET);
-
-    return 0;
-}
-
-static void vfio_put_base_device(VFIODevice *vbasedev)
-{
-    if (!vbasedev->group) {
-        return;
-    }
-    QLIST_REMOVE(vbasedev, next);
-    vbasedev->group = NULL;
-    trace_vfio_put_base_device(vbasedev->fd);
-    close(vbasedev->fd);
-}
-
-/*
- * Interfaces for IBM EEH (Enhanced Error Handling)
- */
-static bool vfio_eeh_container_ok(VFIOContainer *container)
-{
-    /*
-     * As of 2016-03-04 (linux-4.5) the host kernel EEH/VFIO
-     * implementation is broken if there are multiple groups in a
-     * container.  The hardware works in units of Partitionable
-     * Endpoints (== IOMMU groups) and the EEH operations naively
-     * iterate across all groups in the container, without any logic
-     * to make sure the groups have their state synchronized.  For
-     * certain operations (ENABLE) that might be ok, until an error
-     * occurs, but for others (GET_STATE) it's clearly broken.
-     */
-
-    /*
-     * XXX Once fixed kernels exist, test for them here
-     */
-
-    if (QLIST_EMPTY(&container->group_list)) {
-        return false;
-    }
-
-    if (QLIST_NEXT(QLIST_FIRST(&container->group_list), container_next)) {
-        return false;
-    }
-
-    return true;
-}
-
-static int vfio_eeh_container_op(VFIOContainer *container, uint32_t op)
-{
-    struct vfio_eeh_pe_op pe_op = {
-        .argsz = sizeof(pe_op),
-        .op = op,
-    };
-    int ret;
-
-    if (!vfio_eeh_container_ok(container)) {
-        error_report("vfio/eeh: EEH_PE_OP 0x%x: "
-                     "kernel requires a container with exactly one group", op);
-        return -EPERM;
-    }
-
-    ret = ioctl(container->fd, VFIO_EEH_PE_OP, &pe_op);
-    if (ret < 0) {
-        error_report("vfio/eeh: EEH_PE_OP 0x%x failed: %m", op);
-        return -errno;
-    }
-
-    return ret;
-}
-
-static VFIOContainer *vfio_eeh_as_container(AddressSpace *as)
-{
-    VFIOAddressSpace *space = vfio_get_address_space(as);
-    VFIOContainer *container = NULL;
-
-    if (QLIST_EMPTY(&space->containers)) {
-        /* No containers to act on */
-        goto out;
-    }
-
-    container = QLIST_FIRST(&space->containers);
-
-    if (QLIST_NEXT(container, next)) {
-        /* We don't yet have logic to synchronize EEH state across
-         * multiple containers */
-        container = NULL;
-        goto out;
-    }
-
-out:
-    vfio_put_address_space(space);
-    return container;
-}
-
-bool vfio_eeh_as_ok(AddressSpace *as)
-{
-    VFIOContainer *container = vfio_eeh_as_container(as);
-
-    return (container != NULL) && vfio_eeh_container_ok(container);
-}
-
-int vfio_eeh_as_op(AddressSpace *as, uint32_t op)
-{
-    VFIOContainer *container = vfio_eeh_as_container(as);
-
-    if (!container) {
-        return -ENODEV;
-    }
-    return vfio_eeh_container_op(container, op);
-}
-
-static int vfio_device_groupid(VFIODevice *vbasedev, Error **errp)
-{
-    char *tmp, group_path[PATH_MAX], *group_name;
-    int ret, groupid;
-    ssize_t len;
-
-    tmp = g_strdup_printf("%s/iommu_group", vbasedev->sysfsdev);
-    len = readlink(tmp, group_path, sizeof(group_path));
-    g_free(tmp);
-
-    if (len <= 0 || len >= sizeof(group_path)) {
-        ret = len < 0 ? -errno : -ENAMETOOLONG;
-        error_setg_errno(errp, -ret, "no iommu_group found");
-        return ret;
-    }
-
-    group_path[len] = 0;
-
-    group_name = basename(group_path);
-    if (sscanf(group_name, "%d", &groupid) != 1) {
-        error_setg_errno(errp, errno, "failed to read %s", group_path);
-        return -errno;
-    }
-    return groupid;
-}
-
-/*
- * vfio_attach_device: attach a device to a security context
- * @name and @vbasedev->name are likely to be different depending
- * on the type of the device, hence the need for passing @name
- */
-int vfio_attach_device(char *name, VFIODevice *vbasedev,
-                       AddressSpace *as, Error **errp)
-{
-    int groupid = vfio_device_groupid(vbasedev, errp);
-    VFIODevice *vbasedev_iter;
-    VFIOGroup *group;
-    VFIOContainer *container;
-    int ret;
-
-    if (groupid < 0) {
-        return groupid;
-    }
-
-    trace_vfio_attach_device(vbasedev->name, groupid);
-
-    group = vfio_get_group(groupid, as, errp);
-    if (!group) {
-        return -ENOENT;
-    }
-
-    QLIST_FOREACH(vbasedev_iter, &group->device_list, next) {
-        if (strcmp(vbasedev_iter->name, vbasedev->name) == 0) {
-            error_setg(errp, "device is already attached");
-            vfio_put_group(group);
-            return -EBUSY;
-        }
-    }
-    ret = vfio_get_device(group, name, vbasedev, errp);
-    if (ret) {
-        vfio_put_group(group);
-        return ret;
-    }
-
-    container = group->container;
-    vbasedev->container = container;
-    QLIST_INSERT_HEAD(&container->device_list, vbasedev, container_next);
-    QLIST_INSERT_HEAD(&vfio_device_list, vbasedev, global_next);
-
-    return ret;
-}
-
-void vfio_detach_device(VFIODevice *vbasedev)
-{
-    VFIOGroup *group = vbasedev->group;
-
-    QLIST_REMOVE(vbasedev, global_next);
-    QLIST_REMOVE(vbasedev, container_next);
-    vbasedev->container = NULL;
-    trace_vfio_detach_device(vbasedev->name, group->groupid);
-    vfio_put_base_device(vbasedev);
-    vfio_put_group(group);
-}
diff --git a/hw/vfio/container.c b/hw/vfio/container.c
new file mode 100644
index 0000000000..7a3f108478
--- /dev/null
+++ b/hw/vfio/container.c
@@ -0,0 +1,1157 @@
+/*
+ * generic functions used by VFIO devices
+ *
+ * Copyright Red Hat, Inc. 2012
+ *
+ * Authors:
+ *  Alex Williamson <alex.williamson@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ * Based on qemu-kvm device-assignment:
+ *  Adapted for KVM by Qumranet.
+ *  Copyright (c) 2007, Neocleus, Alex Novik (alex@neocleus.com)
+ *  Copyright (c) 2007, Neocleus, Guy Zana (guy@neocleus.com)
+ *  Copyright (C) 2008, Qumranet, Amit Shah (amit.shah@qumranet.com)
+ *  Copyright (C) 2008, Red Hat, Amit Shah (amit.shah@redhat.com)
+ *  Copyright (C) 2008, IBM, Muli Ben-Yehuda (muli@il.ibm.com)
+ */
+
+#include "qemu/osdep.h"
+#include <sys/ioctl.h>
+#ifdef CONFIG_KVM
+#include <linux/kvm.h>
+#endif
+#include <linux/vfio.h>
+
+#include "hw/vfio/vfio-common.h"
+#include "hw/vfio/vfio.h"
+#include "exec/address-spaces.h"
+#include "exec/memory.h"
+#include "exec/ram_addr.h"
+#include "hw/hw.h"
+#include "qemu/error-report.h"
+#include "qemu/range.h"
+#include "sysemu/kvm.h"
+#include "sysemu/reset.h"
+#include "trace.h"
+#include "qapi/error.h"
+#include "migration/migration.h"
+
+VFIOGroupList vfio_group_list =
+    QLIST_HEAD_INITIALIZER(vfio_group_list);
+
+static int vfio_ram_block_discard_disable(VFIOContainer *container, bool state)
+{
+    switch (container->iommu_type) {
+    case VFIO_TYPE1v2_IOMMU:
+    case VFIO_TYPE1_IOMMU:
+        /*
+         * We support coordinated discarding of RAM via the RamDiscardManager.
+         */
+        return ram_block_uncoordinated_discard_disable(state);
+    default:
+        /*
+         * VFIO_SPAPR_TCE_IOMMU most probably works just fine with
+         * RamDiscardManager, however, it is completely untested.
+         *
+         * VFIO_SPAPR_TCE_v2_IOMMU with "DMA memory preregistering" does
+         * completely the opposite of managing mapping/pinning dynamically as
+         * required by RamDiscardManager. We would have to special-case sections
+         * with a RamDiscardManager.
+         */
+        return ram_block_discard_disable(state);
+    }
+}
+
+static int vfio_dma_unmap_bitmap(VFIOContainer *container,
+                                 hwaddr iova, ram_addr_t size,
+                                 IOMMUTLBEntry *iotlb)
+{
+    struct vfio_iommu_type1_dma_unmap *unmap;
+    struct vfio_bitmap *bitmap;
+    VFIOBitmap vbmap;
+    int ret;
+
+    ret = vfio_bitmap_alloc(&vbmap, size);
+    if (ret) {
+        return ret;
+    }
+
+    unmap = g_malloc0(sizeof(*unmap) + sizeof(*bitmap));
+
+    unmap->argsz = sizeof(*unmap) + sizeof(*bitmap);
+    unmap->iova = iova;
+    unmap->size = size;
+    unmap->flags |= VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP;
+    bitmap = (struct vfio_bitmap *)&unmap->data;
+
+    /*
+     * cpu_physical_memory_set_dirty_lebitmap() supports pages in bitmap of
+     * qemu_real_host_page_size to mark those dirty. Hence set bitmap_pgsize
+     * to qemu_real_host_page_size.
+     */
+    bitmap->pgsize = qemu_real_host_page_size();
+    bitmap->size = vbmap.size;
+    bitmap->data = (__u64 *)vbmap.bitmap;
+
+    if (vbmap.size > container->max_dirty_bitmap_size) {
+        error_report("UNMAP: Size of bitmap too big 0x%"PRIx64, vbmap.size);
+        ret = -E2BIG;
+        goto unmap_exit;
+    }
+
+    ret = ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, unmap);
+    if (!ret) {
+        cpu_physical_memory_set_dirty_lebitmap(vbmap.bitmap,
+                iotlb->translated_addr, vbmap.pages);
+    } else {
+        error_report("VFIO_UNMAP_DMA with DIRTY_BITMAP : %m");
+    }
+
+unmap_exit:
+    g_free(unmap);
+    g_free(vbmap.bitmap);
+
+    return ret;
+}
+
+/*
+ * DMA - Mapping and unmapping for the "type1" IOMMU interface used on x86
+ */
+int vfio_dma_unmap(VFIOContainer *container, hwaddr iova,
+                   ram_addr_t size, IOMMUTLBEntry *iotlb)
+{
+    struct vfio_iommu_type1_dma_unmap unmap = {
+        .argsz = sizeof(unmap),
+        .flags = 0,
+        .iova = iova,
+        .size = size,
+    };
+    bool need_dirty_sync = false;
+    int ret;
+
+    if (iotlb && vfio_devices_all_running_and_mig_active(container)) {
+        if (!vfio_devices_all_device_dirty_tracking(container) &&
+            container->dirty_pages_supported) {
+            return vfio_dma_unmap_bitmap(container, iova, size, iotlb);
+        }
+
+        need_dirty_sync = true;
+    }
+
+    while (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap)) {
+        /*
+         * The type1 backend has an off-by-one bug in the kernel (71a7d3d78e3c
+         * v4.15) where an overflow in its wrap-around check prevents us from
+         * unmapping the last page of the address space.  Test for the error
+         * condition and re-try the unmap excluding the last page.  The
+         * expectation is that we've never mapped the last page anyway and this
+         * unmap request comes via vIOMMU support which also makes it unlikely
+         * that this page is used.  This bug was introduced well after type1 v2
+         * support was introduced, so we shouldn't need to test for v1.  A fix
+         * is queued for kernel v5.0 so this workaround can be removed once
+         * affected kernels are sufficiently deprecated.
+         */
+        if (errno == EINVAL && unmap.size && !(unmap.iova + unmap.size) &&
+            container->iommu_type == VFIO_TYPE1v2_IOMMU) {
+            trace_vfio_dma_unmap_overflow_workaround();
+            unmap.size -= 1ULL << ctz64(container->pgsizes);
+            continue;
+        }
+        error_report("VFIO_UNMAP_DMA failed: %s", strerror(errno));
+        return -errno;
+    }
+
+    if (need_dirty_sync) {
+        ret = vfio_get_dirty_bitmap(container, iova, size,
+                                    iotlb->translated_addr);
+        if (ret) {
+            return ret;
+        }
+    }
+
+    return 0;
+}
+
+int vfio_dma_map(VFIOContainer *container, hwaddr iova,
+                 ram_addr_t size, void *vaddr, bool readonly)
+{
+    struct vfio_iommu_type1_dma_map map = {
+        .argsz = sizeof(map),
+        .flags = VFIO_DMA_MAP_FLAG_READ,
+        .vaddr = (__u64)(uintptr_t)vaddr,
+        .iova = iova,
+        .size = size,
+    };
+
+    if (!readonly) {
+        map.flags |= VFIO_DMA_MAP_FLAG_WRITE;
+    }
+
+    /*
+     * Try the mapping, if it fails with EBUSY, unmap the region and try
+     * again.  This shouldn't be necessary, but we sometimes see it in
+     * the VGA ROM space.
+     */
+    if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0 ||
+        (errno == EBUSY && vfio_dma_unmap(container, iova, size, NULL) == 0 &&
+         ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0)) {
+        return 0;
+    }
+
+    error_report("VFIO_MAP_DMA failed: %s", strerror(errno));
+    return -errno;
+}
+
+int vfio_container_add_section_window(VFIOContainer *container,
+                                      MemoryRegionSection *section,
+                                      Error **errp)
+{
+    VFIOHostDMAWindow *hostwin;
+    hwaddr pgsize = 0;
+    int ret;
+
+    if (container->iommu_type != VFIO_SPAPR_TCE_v2_IOMMU) {
+        return 0;
+    }
+
+    /* For now intersections are not allowed, we may relax this later */
+    QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) {
+        if (ranges_overlap(hostwin->min_iova,
+                           hostwin->max_iova - hostwin->min_iova + 1,
+                           section->offset_within_address_space,
+                           int128_get64(section->size))) {
+            error_setg(errp,
+                "region [0x%"PRIx64",0x%"PRIx64"] overlaps with existing"
+                "host DMA window [0x%"PRIx64",0x%"PRIx64"]",
+                section->offset_within_address_space,
+                section->offset_within_address_space +
+                    int128_get64(section->size) - 1,
+                hostwin->min_iova, hostwin->max_iova);
+            return -EINVAL;
+        }
+    }
+
+    ret = vfio_spapr_create_window(container, section, &pgsize);
+    if (ret) {
+        error_setg_errno(errp, -ret, "Failed to create SPAPR window");
+        return ret;
+    }
+
+    vfio_host_win_add(container, section->offset_within_address_space,
+                      section->offset_within_address_space +
+                      int128_get64(section->size) - 1, pgsize);
+#ifdef CONFIG_KVM
+    if (kvm_enabled()) {
+        VFIOGroup *group;
+        IOMMUMemoryRegion *iommu_mr = IOMMU_MEMORY_REGION(section->mr);
+        struct kvm_vfio_spapr_tce param;
+        struct kvm_device_attr attr = {
+            .group = KVM_DEV_VFIO_GROUP,
+            .attr = KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE,
+            .addr = (uint64_t)(unsigned long)&param,
+        };
+
+        if (!memory_region_iommu_get_attr(iommu_mr, IOMMU_ATTR_SPAPR_TCE_FD,
+                                          &param.tablefd)) {
+            QLIST_FOREACH(group, &container->group_list, container_next) {
+                param.groupfd = group->fd;
+                if (ioctl(vfio_kvm_device_fd, KVM_SET_DEVICE_ATTR, &attr)) {
+                    error_setg_errno(errp, errno,
+                                     "vfio: failed GROUP_SET_SPAPR_TCE for "
+                                     "KVM VFIO device %d and group fd %d",
+                                     param.tablefd, param.groupfd);
+                    return -errno;
+                }
+                trace_vfio_spapr_group_attach(param.groupfd, param.tablefd);
+            }
+        }
+    }
+#endif
+    return 0;
+}
+
+void vfio_container_del_section_window(VFIOContainer *container,
+                                       MemoryRegionSection *section)
+{
+    if (container->iommu_type != VFIO_SPAPR_TCE_v2_IOMMU) {
+        return;
+    }
+
+    vfio_spapr_remove_window(container,
+                             section->offset_within_address_space);
+    if (vfio_host_win_del(container,
+                          section->offset_within_address_space,
+                          section->offset_within_address_space +
+                          int128_get64(section->size) - 1) < 0) {
+        hw_error("%s: Cannot delete missing window at %"HWADDR_PRIx,
+                 __func__, section->offset_within_address_space);
+    }
+}
+
+int vfio_set_dirty_page_tracking(VFIOContainer *container, bool start)
+{
+    int ret;
+    struct vfio_iommu_type1_dirty_bitmap dirty = {
+        .argsz = sizeof(dirty),
+    };
+
+    if (!container->dirty_pages_supported) {
+        return 0;
+    }
+
+    if (start) {
+        dirty.flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_START;
+    } else {
+        dirty.flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP;
+    }
+
+    ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, &dirty);
+    if (ret) {
+        ret = -errno;
+        error_report("Failed to set dirty tracking flag 0x%x errno: %d",
+                     dirty.flags, errno);
+    }
+
+    return ret;
+}
+
+int vfio_query_dirty_bitmap(VFIOContainer *container, VFIOBitmap *vbmap,
+                            hwaddr iova, hwaddr size)
+{
+    struct vfio_iommu_type1_dirty_bitmap *dbitmap;
+    struct vfio_iommu_type1_dirty_bitmap_get *range;
+    int ret;
+
+    dbitmap = g_malloc0(sizeof(*dbitmap) + sizeof(*range));
+
+    dbitmap->argsz = sizeof(*dbitmap) + sizeof(*range);
+    dbitmap->flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP;
+    range = (struct vfio_iommu_type1_dirty_bitmap_get *)&dbitmap->data;
+    range->iova = iova;
+    range->size = size;
+
+    /*
+     * cpu_physical_memory_set_dirty_lebitmap() supports pages in bitmap of
+     * qemu_real_host_page_size to mark those dirty. Hence set bitmap's pgsize
+     * to qemu_real_host_page_size.
+     */
+    range->bitmap.pgsize = qemu_real_host_page_size();
+    range->bitmap.size = vbmap->size;
+    range->bitmap.data = (__u64 *)vbmap->bitmap;
+
+    ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, dbitmap);
+    if (ret) {
+        ret = -errno;
+        error_report("Failed to get dirty bitmap for iova: 0x%"PRIx64
+                " size: 0x%"PRIx64" err: %d", (uint64_t)range->iova,
+                (uint64_t)range->size, errno);
+    }
+
+    g_free(dbitmap);
+
+    return ret;
+}
+
+static void vfio_listener_release(VFIOContainer *container)
+{
+    memory_listener_unregister(&container->listener);
+    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
+        memory_listener_unregister(&container->prereg_listener);
+    }
+}
+
+static struct vfio_info_cap_header *
+vfio_get_iommu_type1_info_cap(struct vfio_iommu_type1_info *info, uint16_t id)
+{
+    if (!(info->flags & VFIO_IOMMU_INFO_CAPS)) {
+        return NULL;
+    }
+
+    return vfio_get_cap((void *)info, info->cap_offset, id);
+}
+
+bool vfio_get_info_dma_avail(struct vfio_iommu_type1_info *info,
+                             unsigned int *avail)
+{
+    struct vfio_info_cap_header *hdr;
+    struct vfio_iommu_type1_info_dma_avail *cap;
+
+    /* If the capability cannot be found, assume no DMA limiting */
+    hdr = vfio_get_iommu_type1_info_cap(info,
+                                        VFIO_IOMMU_TYPE1_INFO_DMA_AVAIL);
+    if (hdr == NULL) {
+        return false;
+    }
+
+    if (avail != NULL) {
+        cap = (void *) hdr;
+        *avail = cap->avail;
+    }
+
+    return true;
+}
+
+static void vfio_kvm_device_add_group(VFIOGroup *group)
+{
+    Error *err = NULL;
+
+    if (vfio_kvm_device_add_fd(group->fd, &err)) {
+        error_reportf_err(err, "group ID %d: ", group->groupid);
+    }
+}
+
+static void vfio_kvm_device_del_group(VFIOGroup *group)
+{
+    Error *err = NULL;
+
+    if (vfio_kvm_device_del_fd(group->fd, &err)) {
+        error_reportf_err(err, "group ID %d: ", group->groupid);
+    }
+}
+
+/*
+ * vfio_get_iommu_type - selects the richest iommu_type (v2 first)
+ */
+static int vfio_get_iommu_type(VFIOContainer *container,
+                               Error **errp)
+{
+    int iommu_types[] = { VFIO_TYPE1v2_IOMMU, VFIO_TYPE1_IOMMU,
+                          VFIO_SPAPR_TCE_v2_IOMMU, VFIO_SPAPR_TCE_IOMMU };
+    int i;
+
+    for (i = 0; i < ARRAY_SIZE(iommu_types); i++) {
+        if (ioctl(container->fd, VFIO_CHECK_EXTENSION, iommu_types[i])) {
+            return iommu_types[i];
+        }
+    }
+    error_setg(errp, "No available IOMMU models");
+    return -EINVAL;
+}
+
+static int vfio_init_container(VFIOContainer *container, int group_fd,
+                               Error **errp)
+{
+    int iommu_type, ret;
+
+    iommu_type = vfio_get_iommu_type(container, errp);
+    if (iommu_type < 0) {
+        return iommu_type;
+    }
+
+    ret = ioctl(group_fd, VFIO_GROUP_SET_CONTAINER, &container->fd);
+    if (ret) {
+        error_setg_errno(errp, errno, "Failed to set group container");
+        return -errno;
+    }
+
+    while (ioctl(container->fd, VFIO_SET_IOMMU, iommu_type)) {
+        if (iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
+            /*
+             * On sPAPR, despite the IOMMU subdriver always advertises v1 and
+             * v2, the running platform may not support v2 and there is no
+             * way to guess it until an IOMMU group gets added to the container.
+             * So in case it fails with v2, try v1 as a fallback.
+             */
+            iommu_type = VFIO_SPAPR_TCE_IOMMU;
+            continue;
+        }
+        error_setg_errno(errp, errno, "Failed to set iommu for container");
+        return -errno;
+    }
+
+    container->iommu_type = iommu_type;
+    return 0;
+}
+
+static int vfio_get_iommu_info(VFIOContainer *container,
+                               struct vfio_iommu_type1_info **info)
+{
+
+    size_t argsz = sizeof(struct vfio_iommu_type1_info);
+
+    *info = g_new0(struct vfio_iommu_type1_info, 1);
+again:
+    (*info)->argsz = argsz;
+
+    if (ioctl(container->fd, VFIO_IOMMU_GET_INFO, *info)) {
+        g_free(*info);
+        *info = NULL;
+        return -errno;
+    }
+
+    if (((*info)->argsz > argsz)) {
+        argsz = (*info)->argsz;
+        *info = g_realloc(*info, argsz);
+        goto again;
+    }
+
+    return 0;
+}
+
+static struct vfio_info_cap_header *
+vfio_get_iommu_info_cap(struct vfio_iommu_type1_info *info, uint16_t id)
+{
+    struct vfio_info_cap_header *hdr;
+    void *ptr = info;
+
+    if (!(info->flags & VFIO_IOMMU_INFO_CAPS)) {
+        return NULL;
+    }
+
+    for (hdr = ptr + info->cap_offset; hdr != ptr; hdr = ptr + hdr->next) {
+        if (hdr->id == id) {
+            return hdr;
+        }
+    }
+
+    return NULL;
+}
+
+static void vfio_get_iommu_info_migration(VFIOContainer *container,
+                                          struct vfio_iommu_type1_info *info)
+{
+    struct vfio_info_cap_header *hdr;
+    struct vfio_iommu_type1_info_cap_migration *cap_mig;
+
+    hdr = vfio_get_iommu_info_cap(info, VFIO_IOMMU_TYPE1_INFO_CAP_MIGRATION);
+    if (!hdr) {
+        return;
+    }
+
+    cap_mig = container_of(hdr, struct vfio_iommu_type1_info_cap_migration,
+                            header);
+
+    /*
+     * cpu_physical_memory_set_dirty_lebitmap() supports pages in bitmap of
+     * qemu_real_host_page_size to mark those dirty.
+     */
+    if (cap_mig->pgsize_bitmap & qemu_real_host_page_size()) {
+        container->dirty_pages_supported = true;
+        container->max_dirty_bitmap_size = cap_mig->max_dirty_bitmap_size;
+        container->dirty_pgsizes = cap_mig->pgsize_bitmap;
+    }
+}
+
+static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
+                                  Error **errp)
+{
+    VFIOContainer *container;
+    int ret, fd;
+    VFIOAddressSpace *space;
+
+    space = vfio_get_address_space(as);
+
+    /*
+     * VFIO is currently incompatible with discarding of RAM insofar as the
+     * madvise to purge (zap) the page from QEMU's address space does not
+     * interact with the memory API and therefore leaves stale virtual to
+     * physical mappings in the IOMMU if the page was previously pinned.  We
+     * therefore set discarding broken for each group added to a container,
+     * whether the container is used individually or shared.  This provides
+     * us with options to allow devices within a group to opt-in and allow
+     * discarding, so long as it is done consistently for a group (for instance
+     * if the device is an mdev device where it is known that the host vendor
+     * driver will never pin pages outside of the working set of the guest
+     * driver, which would thus not be discarding candidates).
+     *
+     * The first opportunity to induce pinning occurs here where we attempt to
+     * attach the group to existing containers within the AddressSpace.  If any
+     * pages are already zapped from the virtual address space, such as from
+     * previous discards, new pinning will cause valid mappings to be
+     * re-established.  Likewise, when the overall MemoryListener for a new
+     * container is registered, a replay of mappings within the AddressSpace
+     * will occur, re-establishing any previously zapped pages as well.
+     *
+     * Especially virtio-balloon is currently only prevented from discarding
+     * new memory, it will not yet set ram_block_discard_set_required() and
+     * therefore, neither stops us here or deals with the sudden memory
+     * consumption of inflated memory.
+     *
+     * We do support discarding of memory coordinated via the RamDiscardManager
+     * with some IOMMU types. vfio_ram_block_discard_disable() handles the
+     * details once we know which type of IOMMU we are using.
+     */
+
+    QLIST_FOREACH(container, &space->containers, next) {
+        if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
+            ret = vfio_ram_block_discard_disable(container, true);
+            if (ret) {
+                error_setg_errno(errp, -ret,
+                                 "Cannot set discarding of RAM broken");
+                if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER,
+                          &container->fd)) {
+                    error_report("vfio: error disconnecting group %d from"
+                                 " container", group->groupid);
+                }
+                return ret;
+            }
+            group->container = container;
+            QLIST_INSERT_HEAD(&container->group_list, group, container_next);
+            vfio_kvm_device_add_group(group);
+            return 0;
+        }
+    }
+
+    fd = qemu_open_old("/dev/vfio/vfio", O_RDWR);
+    if (fd < 0) {
+        error_setg_errno(errp, errno, "failed to open /dev/vfio/vfio");
+        ret = -errno;
+        goto put_space_exit;
+    }
+
+    ret = ioctl(fd, VFIO_GET_API_VERSION);
+    if (ret != VFIO_API_VERSION) {
+        error_setg(errp, "supported vfio version: %d, "
+                   "reported version: %d", VFIO_API_VERSION, ret);
+        ret = -EINVAL;
+        goto close_fd_exit;
+    }
+
+    container = g_malloc0(sizeof(*container));
+    container->space = space;
+    container->fd = fd;
+    container->error = NULL;
+    container->dirty_pages_supported = false;
+    container->dma_max_mappings = 0;
+    QLIST_INIT(&container->giommu_list);
+    QLIST_INIT(&container->hostwin_list);
+    QLIST_INIT(&container->vrdl_list);
+
+    ret = vfio_init_container(container, group->fd, errp);
+    if (ret) {
+        goto free_container_exit;
+    }
+
+    ret = vfio_ram_block_discard_disable(container, true);
+    if (ret) {
+        error_setg_errno(errp, -ret, "Cannot set discarding of RAM broken");
+        goto free_container_exit;
+    }
+
+    switch (container->iommu_type) {
+    case VFIO_TYPE1v2_IOMMU:
+    case VFIO_TYPE1_IOMMU:
+    {
+        struct vfio_iommu_type1_info *info;
+
+        ret = vfio_get_iommu_info(container, &info);
+        if (ret) {
+            error_setg_errno(errp, -ret, "Failed to get VFIO IOMMU info");
+            goto enable_discards_exit;
+        }
+
+        if (info->flags & VFIO_IOMMU_INFO_PGSIZES) {
+            container->pgsizes = info->iova_pgsizes;
+        } else {
+            container->pgsizes = qemu_real_host_page_size();
+        }
+
+        if (!vfio_get_info_dma_avail(info, &container->dma_max_mappings)) {
+            container->dma_max_mappings = 65535;
+        }
+        vfio_get_iommu_info_migration(container, info);
+        g_free(info);
+
+        /*
+         * FIXME: We should parse VFIO_IOMMU_TYPE1_INFO_CAP_IOVA_RANGE
+         * information to get the actual window extent rather than assume
+         * a 64-bit IOVA address space.
+         */
+        vfio_host_win_add(container, 0, (hwaddr)-1, container->pgsizes);
+
+        break;
+    }
+    case VFIO_SPAPR_TCE_v2_IOMMU:
+    case VFIO_SPAPR_TCE_IOMMU:
+    {
+        struct vfio_iommu_spapr_tce_info info;
+        bool v2 = container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU;
+
+        /*
+         * The host kernel code implementing VFIO_IOMMU_DISABLE is called
+         * when container fd is closed so we do not call it explicitly
+         * in this file.
+         */
+        if (!v2) {
+            ret = ioctl(fd, VFIO_IOMMU_ENABLE);
+            if (ret) {
+                error_setg_errno(errp, errno, "failed to enable container");
+                ret = -errno;
+                goto enable_discards_exit;
+            }
+        } else {
+            container->prereg_listener = vfio_prereg_listener;
+
+            memory_listener_register(&container->prereg_listener,
+                                     &address_space_memory);
+            if (container->error) {
+                memory_listener_unregister(&container->prereg_listener);
+                ret = -1;
+                error_propagate_prepend(errp, container->error,
+                    "RAM memory listener initialization failed: ");
+                goto enable_discards_exit;
+            }
+        }
+
+        info.argsz = sizeof(info);
+        ret = ioctl(fd, VFIO_IOMMU_SPAPR_TCE_GET_INFO, &info);
+        if (ret) {
+            error_setg_errno(errp, errno,
+                             "VFIO_IOMMU_SPAPR_TCE_GET_INFO failed");
+            ret = -errno;
+            if (v2) {
+                memory_listener_unregister(&container->prereg_listener);
+            }
+            goto enable_discards_exit;
+        }
+
+        if (v2) {
+            container->pgsizes = info.ddw.pgsizes;
+            /*
+             * There is a default window in just created container.
+             * To make region_add/del simpler, we better remove this
+             * window now and let those iommu_listener callbacks
+             * create/remove them when needed.
+             */
+            ret = vfio_spapr_remove_window(container, info.dma32_window_start);
+            if (ret) {
+                error_setg_errno(errp, -ret,
+                                 "failed to remove existing window");
+                goto enable_discards_exit;
+            }
+        } else {
+            /* The default table uses 4K pages */
+            container->pgsizes = 0x1000;
+            vfio_host_win_add(container, info.dma32_window_start,
+                              info.dma32_window_start +
+                              info.dma32_window_size - 1,
+                              0x1000);
+        }
+    }
+    }
+
+    vfio_kvm_device_add_group(group);
+
+    QLIST_INIT(&container->group_list);
+    QLIST_INSERT_HEAD(&space->containers, container, next);
+
+    group->container = container;
+    QLIST_INSERT_HEAD(&container->group_list, group, container_next);
+
+    container->listener = vfio_memory_listener;
+
+    memory_listener_register(&container->listener, container->space->as);
+
+    if (container->error) {
+        ret = -1;
+        error_propagate_prepend(errp, container->error,
+            "memory listener initialization failed: ");
+        goto listener_release_exit;
+    }
+
+    container->initialized = true;
+
+    return 0;
+listener_release_exit:
+    QLIST_REMOVE(group, container_next);
+    QLIST_REMOVE(container, next);
+    vfio_kvm_device_del_group(group);
+    vfio_listener_release(container);
+
+enable_discards_exit:
+    vfio_ram_block_discard_disable(container, false);
+
+free_container_exit:
+    g_free(container);
+
+close_fd_exit:
+    close(fd);
+
+put_space_exit:
+    vfio_put_address_space(space);
+
+    return ret;
+}
+
+static void vfio_disconnect_container(VFIOGroup *group)
+{
+    VFIOContainer *container = group->container;
+
+    QLIST_REMOVE(group, container_next);
+    group->container = NULL;
+
+    /*
+     * Explicitly release the listener first before unset container,
+     * since unset may destroy the backend container if it's the last
+     * group.
+     */
+    if (QLIST_EMPTY(&container->group_list)) {
+        vfio_listener_release(container);
+    }
+
+    if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER, &container->fd)) {
+        error_report("vfio: error disconnecting group %d from container",
+                     group->groupid);
+    }
+
+    if (QLIST_EMPTY(&container->group_list)) {
+        VFIOAddressSpace *space = container->space;
+        VFIOGuestIOMMU *giommu, *tmp;
+        VFIOHostDMAWindow *hostwin, *next;
+
+        QLIST_REMOVE(container, next);
+
+        QLIST_FOREACH_SAFE(giommu, &container->giommu_list, giommu_next, tmp) {
+            memory_region_unregister_iommu_notifier(
+                    MEMORY_REGION(giommu->iommu_mr), &giommu->n);
+            QLIST_REMOVE(giommu, giommu_next);
+            g_free(giommu);
+        }
+
+        QLIST_FOREACH_SAFE(hostwin, &container->hostwin_list, hostwin_next,
+                           next) {
+            QLIST_REMOVE(hostwin, hostwin_next);
+            g_free(hostwin);
+        }
+
+        trace_vfio_disconnect_container(container->fd);
+        close(container->fd);
+        g_free(container);
+
+        vfio_put_address_space(space);
+    }
+}
+
+static VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
+{
+    VFIOGroup *group;
+    char path[32];
+    struct vfio_group_status status = { .argsz = sizeof(status) };
+
+    QLIST_FOREACH(group, &vfio_group_list, next) {
+        if (group->groupid == groupid) {
+            /* Found it.  Now is it already in the right context? */
+            if (group->container->space->as == as) {
+                return group;
+            } else {
+                error_setg(errp, "group %d used in multiple address spaces",
+                           group->groupid);
+                return NULL;
+            }
+        }
+    }
+
+    group = g_malloc0(sizeof(*group));
+
+    snprintf(path, sizeof(path), "/dev/vfio/%d", groupid);
+    group->fd = qemu_open_old(path, O_RDWR);
+    if (group->fd < 0) {
+        error_setg_errno(errp, errno, "failed to open %s", path);
+        goto free_group_exit;
+    }
+
+    if (ioctl(group->fd, VFIO_GROUP_GET_STATUS, &status)) {
+        error_setg_errno(errp, errno, "failed to get group %d status", groupid);
+        goto close_fd_exit;
+    }
+
+    if (!(status.flags & VFIO_GROUP_FLAGS_VIABLE)) {
+        error_setg(errp, "group %d is not viable", groupid);
+        error_append_hint(errp,
+                          "Please ensure all devices within the iommu_group "
+                          "are bound to their vfio bus driver.\n");
+        goto close_fd_exit;
+    }
+
+    group->groupid = groupid;
+    QLIST_INIT(&group->device_list);
+
+    if (vfio_connect_container(group, as, errp)) {
+        error_prepend(errp, "failed to setup container for group %d: ",
+                      groupid);
+        goto close_fd_exit;
+    }
+
+    QLIST_INSERT_HEAD(&vfio_group_list, group, next);
+
+    return group;
+
+close_fd_exit:
+    close(group->fd);
+
+free_group_exit:
+    g_free(group);
+
+    return NULL;
+}
+
+static void vfio_put_group(VFIOGroup *group)
+{
+    if (!group || !QLIST_EMPTY(&group->device_list)) {
+        return;
+    }
+
+    if (!group->ram_block_discard_allowed) {
+        vfio_ram_block_discard_disable(group->container, false);
+    }
+    vfio_kvm_device_del_group(group);
+    vfio_disconnect_container(group);
+    QLIST_REMOVE(group, next);
+    trace_vfio_put_group(group->fd);
+    close(group->fd);
+    g_free(group);
+}
+
+static int vfio_get_device(VFIOGroup *group, const char *name,
+                           VFIODevice *vbasedev, Error **errp)
+{
+    g_autofree struct vfio_device_info *info = NULL;
+    int fd;
+
+    fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name);
+    if (fd < 0) {
+        error_setg_errno(errp, errno, "error getting device from group %d",
+                         group->groupid);
+        error_append_hint(errp,
+                      "Verify all devices in group %d are bound to vfio-<bus> "
+                      "or pci-stub and not already in use\n", group->groupid);
+        return fd;
+    }
+
+    info = vfio_get_device_info(fd);
+    if (!info) {
+        error_setg_errno(errp, errno, "error getting device info");
+        close(fd);
+        return -1;
+    }
+
+    /*
+     * Set discarding of RAM as not broken for this group if the driver knows
+     * the device operates compatibly with discarding.  Setting must be
+     * consistent per group, but since compatibility is really only possible
+     * with mdev currently, we expect singleton groups.
+     */
+    if (vbasedev->ram_block_discard_allowed !=
+        group->ram_block_discard_allowed) {
+        if (!QLIST_EMPTY(&group->device_list)) {
+            error_setg(errp, "Inconsistent setting of support for discarding "
+                       "RAM (e.g., balloon) within group");
+            close(fd);
+            return -1;
+        }
+
+        if (!group->ram_block_discard_allowed) {
+            group->ram_block_discard_allowed = true;
+            vfio_ram_block_discard_disable(group->container, false);
+        }
+    }
+
+    vbasedev->fd = fd;
+    vbasedev->group = group;
+    QLIST_INSERT_HEAD(&group->device_list, vbasedev, next);
+
+    vbasedev->num_irqs = info->num_irqs;
+    vbasedev->num_regions = info->num_regions;
+    vbasedev->flags = info->flags;
+
+    trace_vfio_get_device(name, info->flags, info->num_regions, info->num_irqs);
+
+    vbasedev->reset_works = !!(info->flags & VFIO_DEVICE_FLAGS_RESET);
+
+    return 0;
+}
+
+static void vfio_put_base_device(VFIODevice *vbasedev)
+{
+    if (!vbasedev->group) {
+        return;
+    }
+    QLIST_REMOVE(vbasedev, next);
+    vbasedev->group = NULL;
+    trace_vfio_put_base_device(vbasedev->fd);
+    close(vbasedev->fd);
+}
+
+/*
+ * Interfaces for IBM EEH (Enhanced Error Handling)
+ */
+static bool vfio_eeh_container_ok(VFIOContainer *container)
+{
+    /*
+     * As of 2016-03-04 (linux-4.5) the host kernel EEH/VFIO
+     * implementation is broken if there are multiple groups in a
+     * container.  The hardware works in units of Partitionable
+     * Endpoints (== IOMMU groups) and the EEH operations naively
+     * iterate across all groups in the container, without any logic
+     * to make sure the groups have their state synchronized.  For
+     * certain operations (ENABLE) that might be ok, until an error
+     * occurs, but for others (GET_STATE) it's clearly broken.
+     */
+
+    /*
+     * XXX Once fixed kernels exist, test for them here
+     */
+
+    if (QLIST_EMPTY(&container->group_list)) {
+        return false;
+    }
+
+    if (QLIST_NEXT(QLIST_FIRST(&container->group_list), container_next)) {
+        return false;
+    }
+
+    return true;
+}
+
+static int vfio_eeh_container_op(VFIOContainer *container, uint32_t op)
+{
+    struct vfio_eeh_pe_op pe_op = {
+        .argsz = sizeof(pe_op),
+        .op = op,
+    };
+    int ret;
+
+    if (!vfio_eeh_container_ok(container)) {
+        error_report("vfio/eeh: EEH_PE_OP 0x%x: "
+                     "kernel requires a container with exactly one group", op);
+        return -EPERM;
+    }
+
+    ret = ioctl(container->fd, VFIO_EEH_PE_OP, &pe_op);
+    if (ret < 0) {
+        error_report("vfio/eeh: EEH_PE_OP 0x%x failed: %m", op);
+        return -errno;
+    }
+
+    return ret;
+}
+
+static VFIOContainer *vfio_eeh_as_container(AddressSpace *as)
+{
+    VFIOAddressSpace *space = vfio_get_address_space(as);
+    VFIOContainer *container = NULL;
+
+    if (QLIST_EMPTY(&space->containers)) {
+        /* No containers to act on */
+        goto out;
+    }
+
+    container = QLIST_FIRST(&space->containers);
+
+    if (QLIST_NEXT(container, next)) {
+        /*
+         * We don't yet have logic to synchronize EEH state across
+         * multiple containers
+         */
+        container = NULL;
+        goto out;
+    }
+
+out:
+    vfio_put_address_space(space);
+    return container;
+}
+
+bool vfio_eeh_as_ok(AddressSpace *as)
+{
+    VFIOContainer *container = vfio_eeh_as_container(as);
+
+    return (container != NULL) && vfio_eeh_container_ok(container);
+}
+
+int vfio_eeh_as_op(AddressSpace *as, uint32_t op)
+{
+    VFIOContainer *container = vfio_eeh_as_container(as);
+
+    if (!container) {
+        return -ENODEV;
+    }
+    return vfio_eeh_container_op(container, op);
+}
+
+static int vfio_device_groupid(VFIODevice *vbasedev, Error **errp)
+{
+    char *tmp, group_path[PATH_MAX], *group_name;
+    int ret, groupid;
+    ssize_t len;
+
+    tmp = g_strdup_printf("%s/iommu_group", vbasedev->sysfsdev);
+    len = readlink(tmp, group_path, sizeof(group_path));
+    g_free(tmp);
+
+    if (len <= 0 || len >= sizeof(group_path)) {
+        ret = len < 0 ? -errno : -ENAMETOOLONG;
+        error_setg_errno(errp, -ret, "no iommu_group found");
+        return ret;
+    }
+
+    group_path[len] = 0;
+
+    group_name = basename(group_path);
+    if (sscanf(group_name, "%d", &groupid) != 1) {
+        error_setg_errno(errp, errno, "failed to read %s", group_path);
+        return -errno;
+    }
+    return groupid;
+}
+
+/*
+ * vfio_attach_device: attach a device to a security context
+ * @name and @vbasedev->name are likely to be different depending
+ * on the type of the device, hence the need for passing @name
+ */
+int vfio_attach_device(char *name, VFIODevice *vbasedev,
+                       AddressSpace *as, Error **errp)
+{
+    int groupid = vfio_device_groupid(vbasedev, errp);
+    VFIODevice *vbasedev_iter;
+    VFIOGroup *group;
+    VFIOContainer *container;
+    int ret;
+
+    if (groupid < 0) {
+        return groupid;
+    }
+
+    trace_vfio_attach_device(vbasedev->name, groupid);
+
+    group = vfio_get_group(groupid, as, errp);
+    if (!group) {
+        return -ENOENT;
+    }
+
+    QLIST_FOREACH(vbasedev_iter, &group->device_list, next) {
+        if (strcmp(vbasedev_iter->name, vbasedev->name) == 0) {
+            error_setg(errp, "device is already attached");
+            vfio_put_group(group);
+            return -EBUSY;
+        }
+    }
+    ret = vfio_get_device(group, name, vbasedev, errp);
+    if (ret) {
+        vfio_put_group(group);
+        return ret;
+    }
+
+    container = group->container;
+    vbasedev->container = container;
+    QLIST_INSERT_HEAD(&container->device_list, vbasedev, container_next);
+    QLIST_INSERT_HEAD(&vfio_device_list, vbasedev, global_next);
+
+    return ret;
+}
+
+void vfio_detach_device(VFIODevice *vbasedev)
+{
+    VFIOGroup *group = vbasedev->group;
+
+    QLIST_REMOVE(vbasedev, global_next);
+    QLIST_REMOVE(vbasedev, container_next);
+    vbasedev->container = NULL;
+    trace_vfio_detach_device(vbasedev->name, group->groupid);
+    vfio_put_base_device(vbasedev);
+    vfio_put_group(group);
+}
diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
index 3746c9f984..2a6912c940 100644
--- a/hw/vfio/meson.build
+++ b/hw/vfio/meson.build
@@ -2,6 +2,7 @@ vfio_ss = ss.source_set()
 vfio_ss.add(files(
   'helpers.c',
   'common.c',
+  'container.c',
   'spapr.c',
   'migration.c',
 ))
-- 
2.41.0



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 13/15] vfio/common: Store the parent container in VFIODevice
  2023-10-04 15:44 ` [PATCH v4 13/15] vfio/common: Store the parent container in VFIODevice Eric Auger
@ 2023-10-04 17:05   ` Cédric Le Goater
  0 siblings, 0 replies; 25+ messages in thread
From: Cédric Le Goater @ 2023-10-04 17:05 UTC (permalink / raw)
  To: Eric Auger, eric.auger.pro, qemu-devel, zhenzhong.duan,
	alex.williamson, jgg, nicolinc, joao.m.martins, peterx,
	kevin.tian, yi.l.liu, yi.y.sun, chao.p.peng, mjrosato, aik

On 10/4/23 17:44, Eric Auger wrote:
> From: Zhenzhong Duan <zhenzhong.duan@intel.com>
> 
> let's store the parent contaienr within the VFIODevice.
> This simplifies the logic in vfio_viommu_preset() and
> brings the benefice to hide the group specificity which
> is useful for IOMMUFD migration.
> 
> Signed-off-by: Eric Auger <eric.auger@redhat.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>

with the re-added check on !vbasedev->container,

Reviewed-by: Cédric Le Goater <clg@redhat.com>

Thanks,

C.


> ---
> 
> v3 -> v4:
> - Dropped check on !vbasedev->container
> ---
>   include/hw/vfio/vfio-common.h | 1 +
>   hw/vfio/common.c              | 4 +++-
>   2 files changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 8ca70dd821..bf12e40667 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -132,6 +132,7 @@ typedef struct VFIODevice {
>       QLIST_ENTRY(VFIODevice) next;
>       QLIST_ENTRY(VFIODevice) container_next;
>       struct VFIOGroup *group;
> +    VFIOContainer *container;
>       char *sysfsdev;
>       char *name;
>       DeviceState *dev;
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index ef9dc7c747..b14f04c9b0 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -184,7 +184,7 @@ void vfio_unblock_multiple_devices_migration(void)
>   
>   bool vfio_viommu_preset(VFIODevice *vbasedev)
>   {
> -    return vbasedev->group->container->space->as != &address_space_memory;
> +    return vbasedev->container->space->as != &address_space_memory;
>   }
>   
>   static void vfio_set_migration_error(int err)
> @@ -2655,6 +2655,7 @@ int vfio_attach_device(char *name, VFIODevice *vbasedev,
>       }
>   
>       container = group->container;
> +    vbasedev->container = container;
>       QLIST_INSERT_HEAD(&container->device_list, vbasedev, container_next);
>   
>       return ret;
> @@ -2665,6 +2666,7 @@ void vfio_detach_device(VFIODevice *vbasedev)
>       VFIOGroup *group = vbasedev->group;
>   
>       QLIST_REMOVE(vbasedev, container_next);
> +    vbasedev->container = NULL;
>       trace_vfio_detach_device(vbasedev->name, group->groupid);
>       vfio_put_base_device(vbasedev);
>       vfio_put_group(group);



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 14/15] vfio/common: Introduce a global VFIODevice list
  2023-10-04 15:44 ` [PATCH v4 14/15] vfio/common: Introduce a global VFIODevice list Eric Auger
@ 2023-10-04 17:05   ` Cédric Le Goater
  0 siblings, 0 replies; 25+ messages in thread
From: Cédric Le Goater @ 2023-10-04 17:05 UTC (permalink / raw)
  To: Eric Auger, eric.auger.pro, qemu-devel, zhenzhong.duan,
	alex.williamson, jgg, nicolinc, joao.m.martins, peterx,
	kevin.tian, yi.l.liu, yi.y.sun, chao.p.peng, mjrosato, aik

On 10/4/23 17:44, Eric Auger wrote:
> From: Zhenzhong Duan <zhenzhong.duan@intel.com>
> 
> Some functions iterate over all the VFIODevices. This is currently
> achieved by iterating over all groups/devices. Let's
> introduce a global list of VFIODevices simplifying that scan.
> 
> This will also be useful while migrating to IOMMUFD by hiding the
> group specificity.
> 
> Signed-off-by: Eric Auger <eric.auger@redhat.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> Suggested-by: Alex Williamson <alex.williamson@redhat.com>


Reviewed-by: Cédric Le Goater <clg@redhat.com>

Thanks,

C.


> ---
>   include/hw/vfio/vfio-common.h |  2 ++
>   hw/vfio/common.c              | 45 +++++++++++++++--------------------
>   2 files changed, 21 insertions(+), 26 deletions(-)
> 
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index bf12e40667..54905b9dd4 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -131,6 +131,7 @@ typedef struct VFIODeviceOps VFIODeviceOps;
>   typedef struct VFIODevice {
>       QLIST_ENTRY(VFIODevice) next;
>       QLIST_ENTRY(VFIODevice) container_next;
> +    QLIST_ENTRY(VFIODevice) global_next;
>       struct VFIOGroup *group;
>       VFIOContainer *container;
>       char *sysfsdev;
> @@ -232,6 +233,7 @@ int vfio_kvm_device_del_fd(int fd, Error **errp);
>   
>   extern const MemoryRegionOps vfio_region_ops;
>   typedef QLIST_HEAD(VFIOGroupList, VFIOGroup) VFIOGroupList;
> +typedef QLIST_HEAD(VFIODeviceList, VFIODevice) VFIODeviceList;
>   extern VFIOGroupList vfio_group_list;
>   
>   bool vfio_mig_active(void);
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index b14f04c9b0..5dd03763d2 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -48,6 +48,8 @@
>   
>   VFIOGroupList vfio_group_list =
>       QLIST_HEAD_INITIALIZER(vfio_group_list);
> +static VFIODeviceList vfio_device_list =
> +    QLIST_HEAD_INITIALIZER(vfio_device_list);
>   static QLIST_HEAD(, VFIOAddressSpace) vfio_address_spaces =
>       QLIST_HEAD_INITIALIZER(vfio_address_spaces);
>   
> @@ -94,18 +96,15 @@ static int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
>   
>   bool vfio_mig_active(void)
>   {
> -    VFIOGroup *group;
>       VFIODevice *vbasedev;
>   
> -    if (QLIST_EMPTY(&vfio_group_list)) {
> +    if (QLIST_EMPTY(&vfio_device_list)) {
>           return false;
>       }
>   
> -    QLIST_FOREACH(group, &vfio_group_list, next) {
> -        QLIST_FOREACH(vbasedev, &group->device_list, next) {
> -            if (vbasedev->migration_blocker) {
> -                return false;
> -            }
> +    QLIST_FOREACH(vbasedev, &vfio_device_list, next) {
> +        if (vbasedev->migration_blocker) {
> +            return false;
>           }
>       }
>       return true;
> @@ -120,19 +119,16 @@ static Error *multiple_devices_migration_blocker;
>    */
>   static bool vfio_multiple_devices_migration_is_supported(void)
>   {
> -    VFIOGroup *group;
>       VFIODevice *vbasedev;
>       unsigned int device_num = 0;
>       bool all_support_p2p = true;
>   
> -    QLIST_FOREACH(group, &vfio_group_list, next) {
> -        QLIST_FOREACH(vbasedev, &group->device_list, next) {
> -            if (vbasedev->migration) {
> -                device_num++;
> +    QLIST_FOREACH(vbasedev, &vfio_device_list, next) {
> +        if (vbasedev->migration) {
> +            device_num++;
>   
> -                if (!(vbasedev->migration->mig_flags & VFIO_MIGRATION_P2P)) {
> -                    all_support_p2p = false;
> -                }
> +            if (!(vbasedev->migration->mig_flags & VFIO_MIGRATION_P2P)) {
> +                all_support_p2p = false;
>               }
>           }
>       }
> @@ -1777,22 +1773,17 @@ bool vfio_get_info_dma_avail(struct vfio_iommu_type1_info *info,
>   
>   void vfio_reset_handler(void *opaque)
>   {
> -    VFIOGroup *group;
>       VFIODevice *vbasedev;
>   
> -    QLIST_FOREACH(group, &vfio_group_list, next) {
> -        QLIST_FOREACH(vbasedev, &group->device_list, next) {
> -            if (vbasedev->dev->realized) {
> -                vbasedev->ops->vfio_compute_needs_reset(vbasedev);
> -            }
> +    QLIST_FOREACH(vbasedev, &vfio_device_list, next) {
> +        if (vbasedev->dev->realized) {
> +            vbasedev->ops->vfio_compute_needs_reset(vbasedev);
>           }
>       }
>   
> -    QLIST_FOREACH(group, &vfio_group_list, next) {
> -        QLIST_FOREACH(vbasedev, &group->device_list, next) {
> -            if (vbasedev->dev->realized && vbasedev->needs_reset) {
> -                vbasedev->ops->vfio_hot_reset_multi(vbasedev);
> -            }
> +    QLIST_FOREACH(vbasedev, &vfio_device_list, next) {
> +        if (vbasedev->dev->realized && vbasedev->needs_reset) {
> +            vbasedev->ops->vfio_hot_reset_multi(vbasedev);
>           }
>       }
>   }
> @@ -2657,6 +2648,7 @@ int vfio_attach_device(char *name, VFIODevice *vbasedev,
>       container = group->container;
>       vbasedev->container = container;
>       QLIST_INSERT_HEAD(&container->device_list, vbasedev, container_next);
> +    QLIST_INSERT_HEAD(&vfio_device_list, vbasedev, global_next);
>   
>       return ret;
>   }
> @@ -2665,6 +2657,7 @@ void vfio_detach_device(VFIODevice *vbasedev)
>   {
>       VFIOGroup *group = vbasedev->group;
>   
> +    QLIST_REMOVE(vbasedev, global_next);
>       QLIST_REMOVE(vbasedev, container_next);
>       vbasedev->container = NULL;
>       trace_vfio_detach_device(vbasedev->name, group->groupid);



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 00/15] Prerequisite changes for IOMMUFD support
  2023-10-04 15:43 [PATCH v4 00/15] Prerequisite changes for IOMMUFD support Eric Auger
                   ` (14 preceding siblings ...)
  2023-10-04 15:44 ` [PATCH v4 15/15] vfio/common: Move legacy VFIO backend code into separate container.c Eric Auger
@ 2023-10-04 19:58 ` Cédric Le Goater
  2023-10-05  6:55   ` Eric Auger
  15 siblings, 1 reply; 25+ messages in thread
From: Cédric Le Goater @ 2023-10-04 19:58 UTC (permalink / raw)
  To: Eric Auger, eric.auger.pro, qemu-devel, zhenzhong.duan,
	alex.williamson, jgg, nicolinc, joao.m.martins, peterx,
	kevin.tian, yi.l.liu, yi.y.sun, chao.p.peng, mjrosato, aik

On 10/4/23 17:43, Eric Auger wrote:
> Hi All,
> 
> With the agreement of Zhenzhong, here is a v3 respin of the IOMMUFD
> prerequisite series. This applies on top of vfio-next:
> https://github.com/legoater/qemu/, branch vfio-next.
> 
> Per Cédric's suggestion, the IOMMUFD patchset v1[1] is now split
> into two series, this prerequisite series and the new IOMMUFD backend
> introduction support series. Hopefully this will ease the review.
> 
> The main purpose of this series is to make "common.c" group agnostic:
> all group related code are moved into container.c. Then we are prepared
> for next series, abstract base container, adding new backend, etc.
> 
> This series can be found at
> https://github.com/eauger/qemu/tree/prereq_v4
> 
> Test done:
> - PCI device were tested
> - device hotplug test
> - with or without vIOMMU
> - VFIO migration with a E800 net card(no dirty sync support) passthrough
> - platform and ccw were only compile-tested due to environment limit
> 
> Zhenzhong, Yi, Eric

LGTM, I did a small adjustment in vfio_detach_device() in patches 13+15.

Applied to vfio-next.

Thanks,

C.


> [1] https://lore.kernel.org/all/20230830103754.36461-1-zhenzhong.duan@intel.com/t/#u
> 
> Changelog:
> 
> v4:
> - include qemu/error-report.h in helpers.c
> - in ap.c, fix the wrongly added
>    vfio_detach_device(vbasedev) and g_free(vbasedev->name);
>    also added error_prepend
> - simplified vbasedev setting in ccw.c
> - vfio_detach_device: dropped check on
>    !vbasedev->container
> - container.c: restore dropped comment
> 
> v3:
> - rebased on vfio-next as suggested by Cedric
> - added vfio/common: Propagate KVM_SET_DEVICE_ATTR error if any
> - collected Cedric's R-b
> - Fix some error paths in vfio/cpi which now properly detach the device
>    and also free the vbasedev->name
> - Fix vfio/ccw migration (hopefully) [Matthew inputs]
> - Split [PATCH v2 11/12] vfio/common: Introduce two kinds of VFIO device lists
>    into 3 patches
> 
> v2:
> - Refine patch description per Eric
> - return errno and errp in vfio_kvm_device_[add/del]_fd per Eric
> - make memory listener register/deregister in seperate patch per Eric
> - Include the .h file first per Cédric
> - Add trace event in vfio_attach_device per Cédric
> - drop the change to vfio_viommu_preset by refactor per Cédric
> - Introduce global VFIO device list and per container list per Alex
> 
> Note changelog below are from full IOMMUFD series:
> 
> v1:
> - Alloc hwpt instead of using auto hwpt
> - elaborate iommufd code per Nicolin
> - consolidate two patches and drop as.c
> - typo error fix and function rename
> 
> rfcv4:
> - rebase on top of v8.0.3
> - Add one patch from Yi which is about vfio device add in kvm
> - Remove IOAS_COPY optimization and focus on functions in this patchset
> - Fix wrong name issue reported and fix suggested by Matthew
> - Fix compilation issue reported and fix sugggsted by Nicolin
> - Use query_dirty_bitmap callback to replace get_dirty_bitmap for better
> granularity
> - Add dev_iter_next() callback to avoid adding so many callback
>    at container scope, add VFIODevice.hwpt to support that
> - Restore all functions back to common from container whenever possible,
>    mainly migration and reset related functions
> - Add --enable/disable-iommufd config option, enabled by default in linux
> - Remove VFIODevice.hwpt_next as it's redundant with VFIODevice.next
> - Adapt new VFIO_DEVICE_PCI_HOT_RESET uAPI for IOMMUFD backed device
> - vfio_kvm_device_add/del_group call vfio_kvm_device_add/del_fd to remove
> redundant code
> - Add FD passing support for vfio device backed by IOMMUFD
> - Fix hot unplug resource leak issue in vfio_legacy_detach_device()
> - Fix FD leak in vfio_get_devicefd()
> 
> rfcv3:
> - rebase on top of v7.2.0
> - Fix the compilation with CONFIG_IOMMUFD unset by using true classes for
>    VFIO backends
> - Fix use after free in error path, reported by Alister
> - Split common.c in several steps to ease the review
> 
> rfcv2:
> - remove the first three patches of rfcv1
> - add open cdev helper suggested by Jason
> - remove the QOMification of the VFIOContainer and simply use standard ops
> (David)
> - add "-object iommufd" suggested by Alex
> 
> Thanks
> Zhenzhong
> 
> 
> Eric Auger (7):
>    scripts/update-linux-headers: Add iommufd.h
>    vfio/common: Propagate KVM_SET_DEVICE_ATTR error if any
>    vfio/common: Introduce vfio_container_add|del_section_window()
>    vfio/pci: Introduce vfio_[attach/detach]_device
>    vfio/platform: Use vfio_[attach/detach]_device
>    vfio/ap: Use vfio_[attach/detach]_device
>    vfio/ccw: Use vfio_[attach/detach]_device
> 
> Yi Liu (2):
>    vfio/common: Move IOMMU agnostic helpers to a separate file
>    vfio/common: Move legacy VFIO backend code into separate container.c
> 
> Zhenzhong Duan (6):
>    linux-headers: Add iommufd.h
>    vfio/common: Extract out vfio_kvm_device_[add/del]_fd
>    vfio/common: Move VFIO reset handler registration to a group agnostic
>      function
>    vfio/common: Introduce a per container device list
>    vfio/common: Store the parent container in VFIODevice
>    vfio/common: Introduce a global VFIODevice list
> 
>   include/hw/vfio/vfio-common.h   |   60 +-
>   linux-headers/linux/iommufd.h   |  444 ++++++++
>   hw/vfio/ap.c                    |   69 +-
>   hw/vfio/ccw.c                   |  122 +-
>   hw/vfio/common.c                | 1851 ++-----------------------------
>   hw/vfio/container.c             | 1157 +++++++++++++++++++
>   hw/vfio/helpers.c               |  612 ++++++++++
>   hw/vfio/pci.c                   |   67 +-
>   hw/vfio/platform.c              |   43 +-
>   hw/vfio/meson.build             |    2 +
>   hw/vfio/trace-events            |    4 +-
>   scripts/update-linux-headers.sh |    3 +-
>   12 files changed, 2458 insertions(+), 1976 deletions(-)
>   create mode 100644 linux-headers/linux/iommufd.h
>   create mode 100644 hw/vfio/container.c
>   create mode 100644 hw/vfio/helpers.c
> 



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 00/15] Prerequisite changes for IOMMUFD support
  2023-10-04 19:58 ` [PATCH v4 00/15] Prerequisite changes for IOMMUFD support Cédric Le Goater
@ 2023-10-05  6:55   ` Eric Auger
  0 siblings, 0 replies; 25+ messages in thread
From: Eric Auger @ 2023-10-05  6:55 UTC (permalink / raw)
  To: Cédric Le Goater, eric.auger.pro, qemu-devel,
	zhenzhong.duan, alex.williamson, jgg, nicolinc, joao.m.martins,
	peterx, kevin.tian, yi.l.liu, yi.y.sun, chao.p.peng, mjrosato,
	aik

Hi Cédric,

On 10/4/23 21:58, Cédric Le Goater wrote:
> On 10/4/23 17:43, Eric Auger wrote:
>> Hi All,
>>
>> With the agreement of Zhenzhong, here is a v3 respin of the IOMMUFD
>> prerequisite series. This applies on top of vfio-next:
>> https://github.com/legoater/qemu/, branch vfio-next.
>>
>> Per Cédric's suggestion, the IOMMUFD patchset v1[1] is now split
>> into two series, this prerequisite series and the new IOMMUFD backend
>> introduction support series. Hopefully this will ease the review.
>>
>> The main purpose of this series is to make "common.c" group agnostic:
>> all group related code are moved into container.c. Then we are prepared
>> for next series, abstract base container, adding new backend, etc.
>>
>> This series can be found at
>> https://github.com/eauger/qemu/tree/prereq_v4
>>
>> Test done:
>> - PCI device were tested
>> - device hotplug test
>> - with or without vIOMMU
>> - VFIO migration with a E800 net card(no dirty sync support) passthrough
>> - platform and ccw were only compile-tested due to environment limit
>>
>> Zhenzhong, Yi, Eric
>
> LGTM, I did a small adjustment in vfio_detach_device() in patches 13+15.
>
> Applied to vfio-next.

Thanks!

Eric
>
> Thanks,
>
> C.
>
>
>> [1]
>> https://lore.kernel.org/all/20230830103754.36461-1-zhenzhong.duan@intel.com/t/#u
>>
>> Changelog:
>>
>> v4:
>> - include qemu/error-report.h in helpers.c
>> - in ap.c, fix the wrongly added
>>    vfio_detach_device(vbasedev) and g_free(vbasedev->name);
>>    also added error_prepend
>> - simplified vbasedev setting in ccw.c
>> - vfio_detach_device: dropped check on
>>    !vbasedev->container
>> - container.c: restore dropped comment
>>
>> v3:
>> - rebased on vfio-next as suggested by Cedric
>> - added vfio/common: Propagate KVM_SET_DEVICE_ATTR error if any
>> - collected Cedric's R-b
>> - Fix some error paths in vfio/cpi which now properly detach the device
>>    and also free the vbasedev->name
>> - Fix vfio/ccw migration (hopefully) [Matthew inputs]
>> - Split [PATCH v2 11/12] vfio/common: Introduce two kinds of VFIO
>> device lists
>>    into 3 patches
>>
>> v2:
>> - Refine patch description per Eric
>> - return errno and errp in vfio_kvm_device_[add/del]_fd per Eric
>> - make memory listener register/deregister in seperate patch per Eric
>> - Include the .h file first per Cédric
>> - Add trace event in vfio_attach_device per Cédric
>> - drop the change to vfio_viommu_preset by refactor per Cédric
>> - Introduce global VFIO device list and per container list per Alex
>>
>> Note changelog below are from full IOMMUFD series:
>>
>> v1:
>> - Alloc hwpt instead of using auto hwpt
>> - elaborate iommufd code per Nicolin
>> - consolidate two patches and drop as.c
>> - typo error fix and function rename
>>
>> rfcv4:
>> - rebase on top of v8.0.3
>> - Add one patch from Yi which is about vfio device add in kvm
>> - Remove IOAS_COPY optimization and focus on functions in this patchset
>> - Fix wrong name issue reported and fix suggested by Matthew
>> - Fix compilation issue reported and fix sugggsted by Nicolin
>> - Use query_dirty_bitmap callback to replace get_dirty_bitmap for better
>> granularity
>> - Add dev_iter_next() callback to avoid adding so many callback
>>    at container scope, add VFIODevice.hwpt to support that
>> - Restore all functions back to common from container whenever possible,
>>    mainly migration and reset related functions
>> - Add --enable/disable-iommufd config option, enabled by default in
>> linux
>> - Remove VFIODevice.hwpt_next as it's redundant with VFIODevice.next
>> - Adapt new VFIO_DEVICE_PCI_HOT_RESET uAPI for IOMMUFD backed device
>> - vfio_kvm_device_add/del_group call vfio_kvm_device_add/del_fd to
>> remove
>> redundant code
>> - Add FD passing support for vfio device backed by IOMMUFD
>> - Fix hot unplug resource leak issue in vfio_legacy_detach_device()
>> - Fix FD leak in vfio_get_devicefd()
>>
>> rfcv3:
>> - rebase on top of v7.2.0
>> - Fix the compilation with CONFIG_IOMMUFD unset by using true classes
>> for
>>    VFIO backends
>> - Fix use after free in error path, reported by Alister
>> - Split common.c in several steps to ease the review
>>
>> rfcv2:
>> - remove the first three patches of rfcv1
>> - add open cdev helper suggested by Jason
>> - remove the QOMification of the VFIOContainer and simply use
>> standard ops
>> (David)
>> - add "-object iommufd" suggested by Alex
>>
>> Thanks
>> Zhenzhong
>>
>>
>> Eric Auger (7):
>>    scripts/update-linux-headers: Add iommufd.h
>>    vfio/common: Propagate KVM_SET_DEVICE_ATTR error if any
>>    vfio/common: Introduce vfio_container_add|del_section_window()
>>    vfio/pci: Introduce vfio_[attach/detach]_device
>>    vfio/platform: Use vfio_[attach/detach]_device
>>    vfio/ap: Use vfio_[attach/detach]_device
>>    vfio/ccw: Use vfio_[attach/detach]_device
>>
>> Yi Liu (2):
>>    vfio/common: Move IOMMU agnostic helpers to a separate file
>>    vfio/common: Move legacy VFIO backend code into separate container.c
>>
>> Zhenzhong Duan (6):
>>    linux-headers: Add iommufd.h
>>    vfio/common: Extract out vfio_kvm_device_[add/del]_fd
>>    vfio/common: Move VFIO reset handler registration to a group agnostic
>>      function
>>    vfio/common: Introduce a per container device list
>>    vfio/common: Store the parent container in VFIODevice
>>    vfio/common: Introduce a global VFIODevice list
>>
>>   include/hw/vfio/vfio-common.h   |   60 +-
>>   linux-headers/linux/iommufd.h   |  444 ++++++++
>>   hw/vfio/ap.c                    |   69 +-
>>   hw/vfio/ccw.c                   |  122 +-
>>   hw/vfio/common.c                | 1851 ++-----------------------------
>>   hw/vfio/container.c             | 1157 +++++++++++++++++++
>>   hw/vfio/helpers.c               |  612 ++++++++++
>>   hw/vfio/pci.c                   |   67 +-
>>   hw/vfio/platform.c              |   43 +-
>>   hw/vfio/meson.build             |    2 +
>>   hw/vfio/trace-events            |    4 +-
>>   scripts/update-linux-headers.sh |    3 +-
>>   12 files changed, 2458 insertions(+), 1976 deletions(-)
>>   create mode 100644 linux-headers/linux/iommufd.h
>>   create mode 100644 hw/vfio/container.c
>>   create mode 100644 hw/vfio/helpers.c
>>
>



^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: [PATCH v4 10/15] vfio/ccw: Use vfio_[attach/detach]_device
  2023-10-04 15:43 ` [PATCH v4 10/15] vfio/ccw: " Eric Auger
@ 2023-10-08 10:21   ` Duan, Zhenzhong
  2023-10-08 17:45     ` Eric Auger
  0 siblings, 1 reply; 25+ messages in thread
From: Duan, Zhenzhong @ 2023-10-08 10:21 UTC (permalink / raw)
  To: Eric Auger, eric.auger.pro, qemu-devel, alex.williamson, clg,
	jgg, nicolinc, Martins, Joao, peterx, Tian, Kevin, Liu, Yi L,
	Sun, Yi Y, Peng, Chao P, mjrosato, aik

Hi Eric,

>-----Original Message-----
>From: Eric Auger <eric.auger@redhat.com>
>Sent: Wednesday, October 4, 2023 11:44 PM
>Subject: [PATCH v4 10/15] vfio/ccw: Use vfio_[attach/detach]_device
>
>Let the vfio-ccw device use vfio_attach_device() and
>vfio_detach_device(), hence hiding the details of the used
>IOMMU backend.
>
>Note that the migration reduces the following trace
>"vfio: subchannel %s has already been attached" (featuring
>cssid.ssid.devid) into "device is already attached"
>
>Also now all the devices have been migrated to use the new
>vfio_attach_device/vfio_detach_device API, let's turn the
>legacy functions into static functions, local to container.c.
>
>Signed-off-by: Eric Auger <eric.auger@redhat.com>
>Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>Reviewed-by: Matthew Rosato <mjrosato@linux.ibm.com>
>
>---
>
>v3:
>- simplified vbasedev->dev setting
>
>v2 -> v3:
>- Hopefully fix confusion beteen vbasedev->name, mdevid and sysfsdev
>  while keeping into account Matthew's comment
>  https://lore.kernel.org/qemu-devel/6e04ab8f-dc84-e9c2-deea-
>2b6b31678b53@linux.ibm.com/
>---
> include/hw/vfio/vfio-common.h |   5 --
> hw/vfio/ccw.c                 | 122 +++++++++-------------------------
> hw/vfio/common.c              |  10 +--
> 3 files changed, 37 insertions(+), 100 deletions(-)
>
>diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>index 12fbfbc37d..c486bdef2a 100644
>--- a/include/hw/vfio/vfio-common.h
>+++ b/include/hw/vfio/vfio-common.h
>@@ -202,7 +202,6 @@ typedef struct {
>     hwaddr pages;
> } VFIOBitmap;
>
>-void vfio_put_base_device(VFIODevice *vbasedev);
> void vfio_disable_irqindex(VFIODevice *vbasedev, int index);
> void vfio_unmask_single_irqindex(VFIODevice *vbasedev, int index);
> void vfio_mask_single_irqindex(VFIODevice *vbasedev, int index);
>@@ -220,11 +219,7 @@ void vfio_region_unmap(VFIORegion *region);
> void vfio_region_exit(VFIORegion *region);
> void vfio_region_finalize(VFIORegion *region);
> void vfio_reset_handler(void *opaque);
>-VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp);
>-void vfio_put_group(VFIOGroup *group);
> struct vfio_device_info *vfio_get_device_info(int fd);
>-int vfio_get_device(VFIOGroup *group, const char *name,
>-                    VFIODevice *vbasedev, Error **errp);
> int vfio_attach_device(char *name, VFIODevice *vbasedev,
>                        AddressSpace *as, Error **errp);
> void vfio_detach_device(VFIODevice *vbasedev);
>diff --git a/hw/vfio/ccw.c b/hw/vfio/ccw.c
>index 1e2fce83b0..6ec35fedc9 100644
>--- a/hw/vfio/ccw.c
>+++ b/hw/vfio/ccw.c
>@@ -572,88 +572,15 @@ static void vfio_ccw_put_region(VFIOCCWDevice
>*vcdev)
>     g_free(vcdev->io_region);
> }
>
>-static void vfio_ccw_put_device(VFIOCCWDevice *vcdev)
>-{
>-    g_free(vcdev->vdev.name);
>-    vfio_put_base_device(&vcdev->vdev);
>-}
>-
>-static void vfio_ccw_get_device(VFIOGroup *group, VFIOCCWDevice *vcdev,
>-                                Error **errp)
>-{
>-    S390CCWDevice *cdev = S390_CCW_DEVICE(vcdev);
>-    char *name = g_strdup_printf("%x.%x.%04x", cdev->hostid.cssid,
>-                                 cdev->hostid.ssid,
>-                                 cdev->hostid.devid);
>-    VFIODevice *vbasedev;
>-
>-    QLIST_FOREACH(vbasedev, &group->device_list, next) {
>-        if (strcmp(vbasedev->name, name) == 0) {
>-            error_setg(errp, "vfio: subchannel %s has already been attached",
>-                       name);
>-            goto out_err;
>-        }
>-    }
>-
>-    /*
>-     * All vfio-ccw devices are believed to operate in a way compatible with
>-     * discarding of memory in RAM blocks, ie. pages pinned in the host are
>-     * in the current working set of the guest driver and therefore never
>-     * overlap e.g., with pages available to the guest balloon driver.  This
>-     * needs to be set before vfio_get_device() for vfio common to handle
>-     * ram_block_discard_disable().
>-     */
>-    vcdev->vdev.ram_block_discard_allowed = true;
>-
>-    if (vfio_get_device(group, cdev->mdevid, &vcdev->vdev, errp)) {
>-        goto out_err;
>-    }
>-
>-    vcdev->vdev.ops = &vfio_ccw_ops;
>-    vcdev->vdev.type = VFIO_DEVICE_TYPE_CCW;
>-    vcdev->vdev.name = name;
>-    vcdev->vdev.dev = DEVICE(vcdev);
>-
>-    return;
>-
>-out_err:
>-    g_free(name);
>-}
>-
>-static VFIOGroup *vfio_ccw_get_group(S390CCWDevice *cdev, Error **errp)
>-{
>-    char *tmp, group_path[PATH_MAX];
>-    ssize_t len;
>-    int groupid;
>-
>-    tmp = g_strdup_printf("/sys/bus/css/devices/%x.%x.%04x/%s/iommu_group",
>-                          cdev->hostid.cssid, cdev->hostid.ssid,
>-                          cdev->hostid.devid, cdev->mdevid);
>-    len = readlink(tmp, group_path, sizeof(group_path));
>-    g_free(tmp);
>-
>-    if (len <= 0 || len >= sizeof(group_path)) {
>-        error_setg(errp, "vfio: no iommu_group found");
>-        return NULL;
>-    }
>-
>-    group_path[len] = 0;
>-
>-    if (sscanf(basename(group_path), "%d", &groupid) != 1) {
>-        error_setg(errp, "vfio: failed to read %s", group_path);
>-        return NULL;
>-    }
>-
>-    return vfio_get_group(groupid, &address_space_memory, errp);
>-}
>-
> static void vfio_ccw_realize(DeviceState *dev, Error **errp)
> {
>-    VFIOGroup *group;
>     S390CCWDevice *cdev = S390_CCW_DEVICE(dev);
>     VFIOCCWDevice *vcdev = VFIO_CCW(cdev);
>     S390CCWDeviceClass *cdc = S390_CCW_DEVICE_GET_CLASS(cdev);
>+    VFIODevice *vbasedev = &vcdev->vdev;
>     Error *err = NULL;
>+    char *name;
>+    int ret;
>
>     /* Call the class init function for subchannel. */
>     if (cdc->realize) {
>@@ -663,14 +590,31 @@ static void vfio_ccw_realize(DeviceState *dev, Error
>**errp)
>         }
>     }
>
>-    group = vfio_ccw_get_group(cdev, &err);
>-    if (!group) {
>-        goto out_group_err;
>-    }
>+    name = g_strdup_printf("%x.%x.%04x", vcdev->cdev.hostid.cssid,
>+                           vcdev->cdev.hostid.ssid,
>+                           vcdev->cdev.hostid.devid);
>+    vbasedev->sysfsdev = g_strdup_printf("/sys/bus/css/devices/%s/%s",
>+                                         name,
>+                                         cdev->mdevid);

Hoping not late for you to include this in v5.
I think no need to re-assign sysfsdev as it's a user property, we'd better to
keep the original user value. Also looks a memory leak here.

>+    vbasedev->ops = &vfio_ccw_ops;
>+    vbasedev->type = VFIO_DEVICE_TYPE_CCW;
>+    vbasedev->name = name;

There will be a potential failure when a second mdev device under
same cssid.ssid.devid attached. We can use cdev->mdevid as name.

Maybe you can use v2 of this patch, I remember these two issues are already addressed in v2.

Thanks
Zhenzhong




^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 10/15] vfio/ccw: Use vfio_[attach/detach]_device
  2023-10-08 10:21   ` Duan, Zhenzhong
@ 2023-10-08 17:45     ` Eric Auger
  2023-10-09  1:25       ` Duan, Zhenzhong
  0 siblings, 1 reply; 25+ messages in thread
From: Eric Auger @ 2023-10-08 17:45 UTC (permalink / raw)
  To: Duan, Zhenzhong, eric.auger.pro, qemu-devel, alex.williamson,
	clg, jgg, nicolinc, Martins, Joao, peterx, Tian, Kevin, Liu,
	Yi L, Sun, Yi Y, Peng, Chao P, mjrosato, aik

Hi Zhenzhong,
On 10/8/23 12:21, Duan, Zhenzhong wrote:
> Hi Eric,
>
>> -----Original Message-----
>> From: Eric Auger <eric.auger@redhat.com>
>> Sent: Wednesday, October 4, 2023 11:44 PM
>> Subject: [PATCH v4 10/15] vfio/ccw: Use vfio_[attach/detach]_device
>>
>> Let the vfio-ccw device use vfio_attach_device() and
>> vfio_detach_device(), hence hiding the details of the used
>> IOMMU backend.
>>
>> Note that the migration reduces the following trace
>> "vfio: subchannel %s has already been attached" (featuring
>> cssid.ssid.devid) into "device is already attached"
>>
>> Also now all the devices have been migrated to use the new
>> vfio_attach_device/vfio_detach_device API, let's turn the
>> legacy functions into static functions, local to container.c.
>>
>> Signed-off-by: Eric Auger <eric.auger@redhat.com>
>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> Reviewed-by: Matthew Rosato <mjrosato@linux.ibm.com>
>>
>> ---
>>
>> v3:
>> - simplified vbasedev->dev setting
>>
>> v2 -> v3:
>> - Hopefully fix confusion beteen vbasedev->name, mdevid and sysfsdev
>>  while keeping into account Matthew's comment
>>  https://lore.kernel.org/qemu-devel/6e04ab8f-dc84-e9c2-deea-
>> 2b6b31678b53@linux.ibm.com/
>> ---
>> include/hw/vfio/vfio-common.h |   5 --
>> hw/vfio/ccw.c                 | 122 +++++++++-------------------------
>> hw/vfio/common.c              |  10 +--
>> 3 files changed, 37 insertions(+), 100 deletions(-)
>>
>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>> index 12fbfbc37d..c486bdef2a 100644
>> --- a/include/hw/vfio/vfio-common.h
>> +++ b/include/hw/vfio/vfio-common.h
>> @@ -202,7 +202,6 @@ typedef struct {
>>     hwaddr pages;
>> } VFIOBitmap;
>>
>> -void vfio_put_base_device(VFIODevice *vbasedev);
>> void vfio_disable_irqindex(VFIODevice *vbasedev, int index);
>> void vfio_unmask_single_irqindex(VFIODevice *vbasedev, int index);
>> void vfio_mask_single_irqindex(VFIODevice *vbasedev, int index);
>> @@ -220,11 +219,7 @@ void vfio_region_unmap(VFIORegion *region);
>> void vfio_region_exit(VFIORegion *region);
>> void vfio_region_finalize(VFIORegion *region);
>> void vfio_reset_handler(void *opaque);
>> -VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp);
>> -void vfio_put_group(VFIOGroup *group);
>> struct vfio_device_info *vfio_get_device_info(int fd);
>> -int vfio_get_device(VFIOGroup *group, const char *name,
>> -                    VFIODevice *vbasedev, Error **errp);
>> int vfio_attach_device(char *name, VFIODevice *vbasedev,
>>                        AddressSpace *as, Error **errp);
>> void vfio_detach_device(VFIODevice *vbasedev);
>> diff --git a/hw/vfio/ccw.c b/hw/vfio/ccw.c
>> index 1e2fce83b0..6ec35fedc9 100644
>> --- a/hw/vfio/ccw.c
>> +++ b/hw/vfio/ccw.c
>> @@ -572,88 +572,15 @@ static void vfio_ccw_put_region(VFIOCCWDevice
>> *vcdev)
>>     g_free(vcdev->io_region);
>> }
>>
>> -static void vfio_ccw_put_device(VFIOCCWDevice *vcdev)
>> -{
>> -    g_free(vcdev->vdev.name);
>> -    vfio_put_base_device(&vcdev->vdev);
>> -}
>> -
>> -static void vfio_ccw_get_device(VFIOGroup *group, VFIOCCWDevice *vcdev,
>> -                                Error **errp)
>> -{
>> -    S390CCWDevice *cdev = S390_CCW_DEVICE(vcdev);
>> -    char *name = g_strdup_printf("%x.%x.%04x", cdev->hostid.cssid,
>> -                                 cdev->hostid.ssid,
>> -                                 cdev->hostid.devid);
>> -    VFIODevice *vbasedev;
>> -
>> -    QLIST_FOREACH(vbasedev, &group->device_list, next) {
>> -        if (strcmp(vbasedev->name, name) == 0) {
>> -            error_setg(errp, "vfio: subchannel %s has already been attached",
>> -                       name);
>> -            goto out_err;
>> -        }
>> -    }
>> -
>> -    /*
>> -     * All vfio-ccw devices are believed to operate in a way compatible with
>> -     * discarding of memory in RAM blocks, ie. pages pinned in the host are
>> -     * in the current working set of the guest driver and therefore never
>> -     * overlap e.g., with pages available to the guest balloon driver.  This
>> -     * needs to be set before vfio_get_device() for vfio common to handle
>> -     * ram_block_discard_disable().
>> -     */
>> -    vcdev->vdev.ram_block_discard_allowed = true;
>> -
>> -    if (vfio_get_device(group, cdev->mdevid, &vcdev->vdev, errp)) {
>> -        goto out_err;
>> -    }
>> -
>> -    vcdev->vdev.ops = &vfio_ccw_ops;
>> -    vcdev->vdev.type = VFIO_DEVICE_TYPE_CCW;
>> -    vcdev->vdev.name = name;
>> -    vcdev->vdev.dev = DEVICE(vcdev);
>> -
>> -    return;
>> -
>> -out_err:
>> -    g_free(name);
>> -}
>> -
>> -static VFIOGroup *vfio_ccw_get_group(S390CCWDevice *cdev, Error **errp)
>> -{
>> -    char *tmp, group_path[PATH_MAX];
>> -    ssize_t len;
>> -    int groupid;
>> -
>> -    tmp = g_strdup_printf("/sys/bus/css/devices/%x.%x.%04x/%s/iommu_group",
>> -                          cdev->hostid.cssid, cdev->hostid.ssid,
>> -                          cdev->hostid.devid, cdev->mdevid);
>> -    len = readlink(tmp, group_path, sizeof(group_path));
>> -    g_free(tmp);
>> -
>> -    if (len <= 0 || len >= sizeof(group_path)) {
>> -        error_setg(errp, "vfio: no iommu_group found");
>> -        return NULL;
>> -    }
>> -
>> -    group_path[len] = 0;
>> -
>> -    if (sscanf(basename(group_path), "%d", &groupid) != 1) {
>> -        error_setg(errp, "vfio: failed to read %s", group_path);
>> -        return NULL;
>> -    }
>> -
>> -    return vfio_get_group(groupid, &address_space_memory, errp);
>> -}
>> -
>> static void vfio_ccw_realize(DeviceState *dev, Error **errp)
>> {
>> -    VFIOGroup *group;
>>     S390CCWDevice *cdev = S390_CCW_DEVICE(dev);
>>     VFIOCCWDevice *vcdev = VFIO_CCW(cdev);
>>     S390CCWDeviceClass *cdc = S390_CCW_DEVICE_GET_CLASS(cdev);
>> +    VFIODevice *vbasedev = &vcdev->vdev;
>>     Error *err = NULL;
>> +    char *name;
>> +    int ret;
>>
>>     /* Call the class init function for subchannel. */
>>     if (cdc->realize) {
>> @@ -663,14 +590,31 @@ static void vfio_ccw_realize(DeviceState *dev, Error
>> **errp)
>>         }
>>     }
>>
>> -    group = vfio_ccw_get_group(cdev, &err);
>> -    if (!group) {
>> -        goto out_group_err;
>> -    }
>> +    name = g_strdup_printf("%x.%x.%04x", vcdev->cdev.hostid.cssid,
>> +                           vcdev->cdev.hostid.ssid,
>> +                           vcdev->cdev.hostid.devid);
>> +    vbasedev->sysfsdev = g_strdup_printf("/sys/bus/css/devices/%s/%s",
>> +                                         name,
>> +                                         cdev->mdevid);
> Hoping not late for you to include this in v5.
> I think no need to re-assign sysfsdev as it's a user property, we'd better to
> keep the original user value. Also looks a memory leak here.
OK I removed it.
>
>> +    vbasedev->ops = &vfio_ccw_ops;
>> +    vbasedev->type = VFIO_DEVICE_TYPE_CCW;
>> +    vbasedev->name = name;
> There will be a potential failure when a second mdev device under
> same cssid.ssid.devid attached. We can use cdev->mdevid as name.
But this mathes vfio_ccw_get_device() existing code where
vcdev->vdev.name = name; and
name = g_strdup_printf("%x.%x.%04x", cdev->hostid.cssid,
                                 cdev->hostid.ssid,
                                 cdev->hostid.devid);

cdev->mdevid is passed as first arg of vfio_attach_device() instead .

i think this also matches
https://lore.kernel.org/all/PH7PR11MB67222DD282F98E03095FBA8A92C1A@PH7PR11MB6722.namprd11.prod.outlook.com/
no?

Thanks

Eric

>
> Maybe you can use v2 of this patch, I remember these two issues are already addressed in v2.
>
> Thanks
> Zhenzhong
>
>



^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: [PATCH v4 10/15] vfio/ccw: Use vfio_[attach/detach]_device
  2023-10-08 17:45     ` Eric Auger
@ 2023-10-09  1:25       ` Duan, Zhenzhong
  2023-10-09  8:14         ` Eric Auger
  0 siblings, 1 reply; 25+ messages in thread
From: Duan, Zhenzhong @ 2023-10-09  1:25 UTC (permalink / raw)
  To: eric.auger, eric.auger.pro, qemu-devel, alex.williamson, clg,
	jgg, nicolinc, Martins, Joao, peterx, Tian, Kevin, Liu, Yi L,
	Sun, Yi Y, Peng, Chao P, mjrosato, aik



>-----Original Message-----
>From: Eric Auger <eric.auger@redhat.com>
>Sent: Monday, October 9, 2023 1:46 AM
>Subject: Re: [PATCH v4 10/15] vfio/ccw: Use vfio_[attach/detach]_device
>
>Hi Zhenzhong,
>On 10/8/23 12:21, Duan, Zhenzhong wrote:
>> Hi Eric,
>>
>>> -----Original Message-----
>>> From: Eric Auger <eric.auger@redhat.com>
>>> Sent: Wednesday, October 4, 2023 11:44 PM
>>> Subject: [PATCH v4 10/15] vfio/ccw: Use vfio_[attach/detach]_device
>>>
>>> Let the vfio-ccw device use vfio_attach_device() and
>>> vfio_detach_device(), hence hiding the details of the used
>>> IOMMU backend.
>>>
>>> Note that the migration reduces the following trace
>>> "vfio: subchannel %s has already been attached" (featuring
>>> cssid.ssid.devid) into "device is already attached"
>>>
>>> Also now all the devices have been migrated to use the new
>>> vfio_attach_device/vfio_detach_device API, let's turn the
>>> legacy functions into static functions, local to container.c.
>>>
>>> Signed-off-by: Eric Auger <eric.auger@redhat.com>
>>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>>> Reviewed-by: Matthew Rosato <mjrosato@linux.ibm.com>
>>>
>>> ---
>>>
>>> v3:
>>> - simplified vbasedev->dev setting
>>>
>>> v2 -> v3:
>>> - Hopefully fix confusion beteen vbasedev->name, mdevid and sysfsdev
>>>  while keeping into account Matthew's comment
>>>  https://lore.kernel.org/qemu-devel/6e04ab8f-dc84-e9c2-deea-
>>> 2b6b31678b53@linux.ibm.com/
>>> ---
>>> include/hw/vfio/vfio-common.h |   5 --
>>> hw/vfio/ccw.c                 | 122 +++++++++-------------------------
>>> hw/vfio/common.c              |  10 +--
>>> 3 files changed, 37 insertions(+), 100 deletions(-)
>>>
>>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>>> index 12fbfbc37d..c486bdef2a 100644
>>> --- a/include/hw/vfio/vfio-common.h
>>> +++ b/include/hw/vfio/vfio-common.h
>>> @@ -202,7 +202,6 @@ typedef struct {
>>>     hwaddr pages;
>>> } VFIOBitmap;
>>>
>>> -void vfio_put_base_device(VFIODevice *vbasedev);
>>> void vfio_disable_irqindex(VFIODevice *vbasedev, int index);
>>> void vfio_unmask_single_irqindex(VFIODevice *vbasedev, int index);
>>> void vfio_mask_single_irqindex(VFIODevice *vbasedev, int index);
>>> @@ -220,11 +219,7 @@ void vfio_region_unmap(VFIORegion *region);
>>> void vfio_region_exit(VFIORegion *region);
>>> void vfio_region_finalize(VFIORegion *region);
>>> void vfio_reset_handler(void *opaque);
>>> -VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp);
>>> -void vfio_put_group(VFIOGroup *group);
>>> struct vfio_device_info *vfio_get_device_info(int fd);
>>> -int vfio_get_device(VFIOGroup *group, const char *name,
>>> -                    VFIODevice *vbasedev, Error **errp);
>>> int vfio_attach_device(char *name, VFIODevice *vbasedev,
>>>                        AddressSpace *as, Error **errp);
>>> void vfio_detach_device(VFIODevice *vbasedev);
>>> diff --git a/hw/vfio/ccw.c b/hw/vfio/ccw.c
>>> index 1e2fce83b0..6ec35fedc9 100644
>>> --- a/hw/vfio/ccw.c
>>> +++ b/hw/vfio/ccw.c
>>> @@ -572,88 +572,15 @@ static void vfio_ccw_put_region(VFIOCCWDevice
>>> *vcdev)
>>>     g_free(vcdev->io_region);
>>> }
>>>
>>> -static void vfio_ccw_put_device(VFIOCCWDevice *vcdev)
>>> -{
>>> -    g_free(vcdev->vdev.name);
>>> -    vfio_put_base_device(&vcdev->vdev);
>>> -}
>>> -
>>> -static void vfio_ccw_get_device(VFIOGroup *group, VFIOCCWDevice *vcdev,
>>> -                                Error **errp)
>>> -{
>>> -    S390CCWDevice *cdev = S390_CCW_DEVICE(vcdev);
>>> -    char *name = g_strdup_printf("%x.%x.%04x", cdev->hostid.cssid,
>>> -                                 cdev->hostid.ssid,
>>> -                                 cdev->hostid.devid);
>>> -    VFIODevice *vbasedev;
>>> -
>>> -    QLIST_FOREACH(vbasedev, &group->device_list, next) {
>>> -        if (strcmp(vbasedev->name, name) == 0) {
>>> -            error_setg(errp, "vfio: subchannel %s has already been attached",
>>> -                       name);
>>> -            goto out_err;
>>> -        }
>>> -    }
>>> -
>>> -    /*
>>> -     * All vfio-ccw devices are believed to operate in a way compatible with
>>> -     * discarding of memory in RAM blocks, ie. pages pinned in the host are
>>> -     * in the current working set of the guest driver and therefore never
>>> -     * overlap e.g., with pages available to the guest balloon driver.  This
>>> -     * needs to be set before vfio_get_device() for vfio common to handle
>>> -     * ram_block_discard_disable().
>>> -     */
>>> -    vcdev->vdev.ram_block_discard_allowed = true;
>>> -
>>> -    if (vfio_get_device(group, cdev->mdevid, &vcdev->vdev, errp)) {
>>> -        goto out_err;
>>> -    }
>>> -
>>> -    vcdev->vdev.ops = &vfio_ccw_ops;
>>> -    vcdev->vdev.type = VFIO_DEVICE_TYPE_CCW;
>>> -    vcdev->vdev.name = name;
>>> -    vcdev->vdev.dev = DEVICE(vcdev);
>>> -
>>> -    return;
>>> -
>>> -out_err:
>>> -    g_free(name);
>>> -}
>>> -
>>> -static VFIOGroup *vfio_ccw_get_group(S390CCWDevice *cdev, Error **errp)
>>> -{
>>> -    char *tmp, group_path[PATH_MAX];
>>> -    ssize_t len;
>>> -    int groupid;
>>> -
>>> -    tmp =
>g_strdup_printf("/sys/bus/css/devices/%x.%x.%04x/%s/iommu_group",
>>> -                          cdev->hostid.cssid, cdev->hostid.ssid,
>>> -                          cdev->hostid.devid, cdev->mdevid);
>>> -    len = readlink(tmp, group_path, sizeof(group_path));
>>> -    g_free(tmp);
>>> -
>>> -    if (len <= 0 || len >= sizeof(group_path)) {
>>> -        error_setg(errp, "vfio: no iommu_group found");
>>> -        return NULL;
>>> -    }
>>> -
>>> -    group_path[len] = 0;
>>> -
>>> -    if (sscanf(basename(group_path), "%d", &groupid) != 1) {
>>> -        error_setg(errp, "vfio: failed to read %s", group_path);
>>> -        return NULL;
>>> -    }
>>> -
>>> -    return vfio_get_group(groupid, &address_space_memory, errp);
>>> -}
>>> -
>>> static void vfio_ccw_realize(DeviceState *dev, Error **errp)
>>> {
>>> -    VFIOGroup *group;
>>>     S390CCWDevice *cdev = S390_CCW_DEVICE(dev);
>>>     VFIOCCWDevice *vcdev = VFIO_CCW(cdev);
>>>     S390CCWDeviceClass *cdc = S390_CCW_DEVICE_GET_CLASS(cdev);
>>> +    VFIODevice *vbasedev = &vcdev->vdev;
>>>     Error *err = NULL;
>>> +    char *name;
>>> +    int ret;
>>>
>>>     /* Call the class init function for subchannel. */
>>>     if (cdc->realize) {
>>> @@ -663,14 +590,31 @@ static void vfio_ccw_realize(DeviceState *dev, Error
>>> **errp)
>>>         }
>>>     }
>>>
>>> -    group = vfio_ccw_get_group(cdev, &err);
>>> -    if (!group) {
>>> -        goto out_group_err;
>>> -    }
>>> +    name = g_strdup_printf("%x.%x.%04x", vcdev->cdev.hostid.cssid,
>>> +                           vcdev->cdev.hostid.ssid,
>>> +                           vcdev->cdev.hostid.devid);
>>> +    vbasedev->sysfsdev = g_strdup_printf("/sys/bus/css/devices/%s/%s",
>>> +                                         name,
>>> +                                         cdev->mdevid);
>> Hoping not late for you to include this in v5.
>> I think no need to re-assign sysfsdev as it's a user property, we'd better to
>> keep the original user value. Also looks a memory leak here.
>OK I removed it.
>>
>>> +    vbasedev->ops = &vfio_ccw_ops;
>>> +    vbasedev->type = VFIO_DEVICE_TYPE_CCW;
>>> +    vbasedev->name = name;
>> There will be a potential failure when a second mdev device under
>> same cssid.ssid.devid attached. We can use cdev->mdevid as name.
>But this mathes vfio_ccw_get_device() existing code where
>vcdev->vdev.name = name; and
>name = g_strdup_printf("%x.%x.%04x", cdev->hostid.cssid,
>                                 cdev->hostid.ssid,
>                                 cdev->hostid.devid);

I suspect this is a bug of the existing code.

>
>cdev->mdevid is passed as first arg of vfio_attach_device() instead .

vfio_attach_device() uses cdev->mdevid to get device FD, nothing more.

If we use cssid.ssid.devid as name, then different mdev under same cssid.ssid.devid will have same name, and the second mdev attachment will fail to attach in vfio_attach_device():

    QLIST_FOREACH(vbasedev_iter, &group->device_list, next) {
        if (strcmp(vbasedev_iter->name, vbasedev->name) == 0) {
            error_setg(errp, "device is already attached");
            vfio_put_group(group);
            return -EBUSY;
        }
    }

>
>i think this also matches
>https://lore.kernel.org/all/PH7PR11MB67222DD282F98E03095FBA8A92C1A@PH
>7PR11MB6722.namprd11.prod.outlook.com/
>no?

It doesn't match what Mattew suggested: https://lore.kernel.org/qemu-devel/6e04ab8f-dc84-e9c2-deea-2b6b31678b53@linux.ibm.com/

Thanks
Zhenzhong


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 10/15] vfio/ccw: Use vfio_[attach/detach]_device
  2023-10-09  1:25       ` Duan, Zhenzhong
@ 2023-10-09  8:14         ` Eric Auger
  2023-10-09  8:25           ` Duan, Zhenzhong
  0 siblings, 1 reply; 25+ messages in thread
From: Eric Auger @ 2023-10-09  8:14 UTC (permalink / raw)
  To: Duan, Zhenzhong, eric.auger.pro, qemu-devel, alex.williamson,
	clg, jgg, nicolinc, Martins, Joao, peterx, Tian, Kevin, Liu,
	Yi L, Sun, Yi Y, Peng, Chao P, mjrosato, aik

Hi Zhenzhong,

On 10/9/23 03:25, Duan, Zhenzhong wrote:
>
>> -----Original Message-----
>> From: Eric Auger <eric.auger@redhat.com>
>> Sent: Monday, October 9, 2023 1:46 AM
>> Subject: Re: [PATCH v4 10/15] vfio/ccw: Use vfio_[attach/detach]_device
>>
>> Hi Zhenzhong,
>> On 10/8/23 12:21, Duan, Zhenzhong wrote:
>>> Hi Eric,
>>>
>>>> -----Original Message-----
>>>> From: Eric Auger <eric.auger@redhat.com>
>>>> Sent: Wednesday, October 4, 2023 11:44 PM
>>>> Subject: [PATCH v4 10/15] vfio/ccw: Use vfio_[attach/detach]_device
>>>>
>>>> Let the vfio-ccw device use vfio_attach_device() and
>>>> vfio_detach_device(), hence hiding the details of the used
>>>> IOMMU backend.
>>>>
>>>> Note that the migration reduces the following trace
>>>> "vfio: subchannel %s has already been attached" (featuring
>>>> cssid.ssid.devid) into "device is already attached"
>>>>
>>>> Also now all the devices have been migrated to use the new
>>>> vfio_attach_device/vfio_detach_device API, let's turn the
>>>> legacy functions into static functions, local to container.c.
>>>>
>>>> Signed-off-by: Eric Auger <eric.auger@redhat.com>
>>>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>>>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>>>> Reviewed-by: Matthew Rosato <mjrosato@linux.ibm.com>
>>>>
>>>> ---
>>>>
>>>> v3:
>>>> - simplified vbasedev->dev setting
>>>>
>>>> v2 -> v3:
>>>> - Hopefully fix confusion beteen vbasedev->name, mdevid and sysfsdev
>>>>  while keeping into account Matthew's comment
>>>>  https://lore.kernel.org/qemu-devel/6e04ab8f-dc84-e9c2-deea-
>>>> 2b6b31678b53@linux.ibm.com/
>>>> ---
>>>> include/hw/vfio/vfio-common.h |   5 --
>>>> hw/vfio/ccw.c                 | 122 +++++++++-------------------------
>>>> hw/vfio/common.c              |  10 +--
>>>> 3 files changed, 37 insertions(+), 100 deletions(-)
>>>>
>>>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>>>> index 12fbfbc37d..c486bdef2a 100644
>>>> --- a/include/hw/vfio/vfio-common.h
>>>> +++ b/include/hw/vfio/vfio-common.h
>>>> @@ -202,7 +202,6 @@ typedef struct {
>>>>     hwaddr pages;
>>>> } VFIOBitmap;
>>>>
>>>> -void vfio_put_base_device(VFIODevice *vbasedev);
>>>> void vfio_disable_irqindex(VFIODevice *vbasedev, int index);
>>>> void vfio_unmask_single_irqindex(VFIODevice *vbasedev, int index);
>>>> void vfio_mask_single_irqindex(VFIODevice *vbasedev, int index);
>>>> @@ -220,11 +219,7 @@ void vfio_region_unmap(VFIORegion *region);
>>>> void vfio_region_exit(VFIORegion *region);
>>>> void vfio_region_finalize(VFIORegion *region);
>>>> void vfio_reset_handler(void *opaque);
>>>> -VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp);
>>>> -void vfio_put_group(VFIOGroup *group);
>>>> struct vfio_device_info *vfio_get_device_info(int fd);
>>>> -int vfio_get_device(VFIOGroup *group, const char *name,
>>>> -                    VFIODevice *vbasedev, Error **errp);
>>>> int vfio_attach_device(char *name, VFIODevice *vbasedev,
>>>>                        AddressSpace *as, Error **errp);
>>>> void vfio_detach_device(VFIODevice *vbasedev);
>>>> diff --git a/hw/vfio/ccw.c b/hw/vfio/ccw.c
>>>> index 1e2fce83b0..6ec35fedc9 100644
>>>> --- a/hw/vfio/ccw.c
>>>> +++ b/hw/vfio/ccw.c
>>>> @@ -572,88 +572,15 @@ static void vfio_ccw_put_region(VFIOCCWDevice
>>>> *vcdev)
>>>>     g_free(vcdev->io_region);
>>>> }
>>>>
>>>> -static void vfio_ccw_put_device(VFIOCCWDevice *vcdev)
>>>> -{
>>>> -    g_free(vcdev->vdev.name);
>>>> -    vfio_put_base_device(&vcdev->vdev);
>>>> -}
>>>> -
>>>> -static void vfio_ccw_get_device(VFIOGroup *group, VFIOCCWDevice *vcdev,
>>>> -                                Error **errp)
>>>> -{
>>>> -    S390CCWDevice *cdev = S390_CCW_DEVICE(vcdev);
>>>> -    char *name = g_strdup_printf("%x.%x.%04x", cdev->hostid.cssid,
>>>> -                                 cdev->hostid.ssid,
>>>> -                                 cdev->hostid.devid);
>>>> -    VFIODevice *vbasedev;
>>>> -
>>>> -    QLIST_FOREACH(vbasedev, &group->device_list, next) {
>>>> -        if (strcmp(vbasedev->name, name) == 0) {
>>>> -            error_setg(errp, "vfio: subchannel %s has already been attached",
>>>> -                       name);
>>>> -            goto out_err;
>>>> -        }
>>>> -    }
>>>> -
>>>> -    /*
>>>> -     * All vfio-ccw devices are believed to operate in a way compatible with
>>>> -     * discarding of memory in RAM blocks, ie. pages pinned in the host are
>>>> -     * in the current working set of the guest driver and therefore never
>>>> -     * overlap e.g., with pages available to the guest balloon driver.  This
>>>> -     * needs to be set before vfio_get_device() for vfio common to handle
>>>> -     * ram_block_discard_disable().
>>>> -     */
>>>> -    vcdev->vdev.ram_block_discard_allowed = true;
>>>> -
>>>> -    if (vfio_get_device(group, cdev->mdevid, &vcdev->vdev, errp)) {
>>>> -        goto out_err;
>>>> -    }
>>>> -
>>>> -    vcdev->vdev.ops = &vfio_ccw_ops;
>>>> -    vcdev->vdev.type = VFIO_DEVICE_TYPE_CCW;
>>>> -    vcdev->vdev.name = name;
>>>> -    vcdev->vdev.dev = DEVICE(vcdev);
>>>> -
>>>> -    return;
>>>> -
>>>> -out_err:
>>>> -    g_free(name);
>>>> -}
>>>> -
>>>> -static VFIOGroup *vfio_ccw_get_group(S390CCWDevice *cdev, Error **errp)
>>>> -{
>>>> -    char *tmp, group_path[PATH_MAX];
>>>> -    ssize_t len;
>>>> -    int groupid;
>>>> -
>>>> -    tmp =
>> g_strdup_printf("/sys/bus/css/devices/%x.%x.%04x/%s/iommu_group",
>>>> -                          cdev->hostid.cssid, cdev->hostid.ssid,
>>>> -                          cdev->hostid.devid, cdev->mdevid);
>>>> -    len = readlink(tmp, group_path, sizeof(group_path));
>>>> -    g_free(tmp);
>>>> -
>>>> -    if (len <= 0 || len >= sizeof(group_path)) {
>>>> -        error_setg(errp, "vfio: no iommu_group found");
>>>> -        return NULL;
>>>> -    }
>>>> -
>>>> -    group_path[len] = 0;
>>>> -
>>>> -    if (sscanf(basename(group_path), "%d", &groupid) != 1) {
>>>> -        error_setg(errp, "vfio: failed to read %s", group_path);
>>>> -        return NULL;
>>>> -    }
>>>> -
>>>> -    return vfio_get_group(groupid, &address_space_memory, errp);
>>>> -}
>>>> -
>>>> static void vfio_ccw_realize(DeviceState *dev, Error **errp)
>>>> {
>>>> -    VFIOGroup *group;
>>>>     S390CCWDevice *cdev = S390_CCW_DEVICE(dev);
>>>>     VFIOCCWDevice *vcdev = VFIO_CCW(cdev);
>>>>     S390CCWDeviceClass *cdc = S390_CCW_DEVICE_GET_CLASS(cdev);
>>>> +    VFIODevice *vbasedev = &vcdev->vdev;
>>>>     Error *err = NULL;
>>>> +    char *name;
>>>> +    int ret;
>>>>
>>>>     /* Call the class init function for subchannel. */
>>>>     if (cdc->realize) {
>>>> @@ -663,14 +590,31 @@ static void vfio_ccw_realize(DeviceState *dev, Error
>>>> **errp)
>>>>         }
>>>>     }
>>>>
>>>> -    group = vfio_ccw_get_group(cdev, &err);
>>>> -    if (!group) {
>>>> -        goto out_group_err;
>>>> -    }
>>>> +    name = g_strdup_printf("%x.%x.%04x", vcdev->cdev.hostid.cssid,
>>>> +                           vcdev->cdev.hostid.ssid,
>>>> +                           vcdev->cdev.hostid.devid);
>>>> +    vbasedev->sysfsdev = g_strdup_printf("/sys/bus/css/devices/%s/%s",
>>>> +                                         name,
>>>> +                                         cdev->mdevid);
>>> Hoping not late for you to include this in v5.
>>> I think no need to re-assign sysfsdev as it's a user property, we'd better to
>>> keep the original user value. Also looks a memory leak here.
>> OK I removed it.
>>>> +    vbasedev->ops = &vfio_ccw_ops;
>>>> +    vbasedev->type = VFIO_DEVICE_TYPE_CCW;
>>>> +    vbasedev->name = name;
>>> There will be a potential failure when a second mdev device under
>>> same cssid.ssid.devid attached. We can use cdev->mdevid as name.
>> But this mathes vfio_ccw_get_device() existing code where
>> vcdev->vdev.name = name; and
>> name = g_strdup_printf("%x.%x.%04x", cdev->hostid.cssid,
>>                                  cdev->hostid.ssid,
>>                                  cdev->hostid.devid);
> I suspect this is a bug of the existing code.
Then I would prefer we fix it separately. This patch migrates the
existing code without functional change intended.

>
>> cdev->mdevid is passed as first arg of vfio_attach_device() instead .
> vfio_attach_device() uses cdev->mdevid to get device FD, nothing more.
>
> If we use cssid.ssid.devid as name, then different mdev under same cssid.ssid.devid will have same name, and the second mdev attachment will fail to attach in vfio_attach_device():
>
>     QLIST_FOREACH(vbasedev_iter, &group->device_list, next) {
>         if (strcmp(vbasedev_iter->name, vbasedev->name) == 0) {
>             error_setg(errp, "device is already attached");
>             vfio_put_group(group);
>             return -EBUSY;
>         }
>     }
I get your point but this conversion matches the existing
vfio_ccw_get_device() code:
    char *name = g_strdup_printf("%x.%x.%04x", cdev->hostid.cssid,
                                 cdev->hostid.ssid,
                                 cdev->hostid.devid);
    VFIODevice *vbasedev;

    QLIST_FOREACH(vbasedev, &group->device_list, next) {
        if (strcmp(vbasedev->name, name) == 0) {
            error_setg(errp, "vfio: subchannel %s has already been
attached",
                       name);
            goto out_err;
        }
    }

>
>> i think this also matches
>> https://lore.kernel.org/all/PH7PR11MB67222DD282F98E03095FBA8A92C1A@PH
>> 7PR11MB6722.namprd11.prod.outlook.com/
>> no?
> It doesn't match what Mattew suggested: https://lore.kernel.org/qemu-devel/6e04ab8f-dc84-e9c2-deea-2b6b31678b53@linux.ibm.com/
this was RFC v3. At that time we did not pass any "name" arg to

vfio_attach_device(VFIODevice *vbasedev, AddressSpace *as, Error **errp)
and vbasedev->name was used when calling vfio_get_device() while we now
use cdev->mdevid. Besides Mattew ran some basic tests on PATCH v3:
https://lore.kernel.org/all/33b7803c-f231-d4fb-d9d9-26a097a89e93@redhat.com/
So I would be tempted to leave it as is (without the sysfsdev overwrite
which came from Mattew's suggestion in
https://lore.kernel.org/qemu-devel/6e04ab8f-dc84-e9c2-deea-2b6b31678b53@linux.ibm.com/
). Thanks Eric

>
> Thanks
> Zhenzhong
>



^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: [PATCH v4 10/15] vfio/ccw: Use vfio_[attach/detach]_device
  2023-10-09  8:14         ` Eric Auger
@ 2023-10-09  8:25           ` Duan, Zhenzhong
  0 siblings, 0 replies; 25+ messages in thread
From: Duan, Zhenzhong @ 2023-10-09  8:25 UTC (permalink / raw)
  To: eric.auger, eric.auger.pro, qemu-devel, alex.williamson, clg,
	jgg, nicolinc, Martins, Joao, peterx, Tian, Kevin, Liu, Yi L,
	Sun, Yi Y, Peng, Chao P, mjrosato, aik

Hi Eric,

>-----Original Message-----
>From: qemu-devel-bounces+zhenzhong.duan=intel.com@nongnu.org <qemu-
>devel-bounces+zhenzhong.duan=intel.com@nongnu.org> On Behalf Of Eric
>Auger
>Sent: Monday, October 9, 2023 4:15 PM
>Subject: Re: [PATCH v4 10/15] vfio/ccw: Use vfio_[attach/detach]_device
>
>Hi Zhenzhong,
>
>On 10/9/23 03:25, Duan, Zhenzhong wrote:
>>
>>> -----Original Message-----
>>> From: Eric Auger <eric.auger@redhat.com>
>>> Sent: Monday, October 9, 2023 1:46 AM
>>> Subject: Re: [PATCH v4 10/15] vfio/ccw: Use vfio_[attach/detach]_device
>>>
>>> Hi Zhenzhong,
>>> On 10/8/23 12:21, Duan, Zhenzhong wrote:
>>>> Hi Eric,
>>>>
>>>>> -----Original Message-----
>>>>> From: Eric Auger <eric.auger@redhat.com>
>>>>> Sent: Wednesday, October 4, 2023 11:44 PM
>>>>> Subject: [PATCH v4 10/15] vfio/ccw: Use vfio_[attach/detach]_device
>>>>>
>>>>> Let the vfio-ccw device use vfio_attach_device() and
>>>>> vfio_detach_device(), hence hiding the details of the used
>>>>> IOMMU backend.
>>>>>
>>>>> Note that the migration reduces the following trace
>>>>> "vfio: subchannel %s has already been attached" (featuring
>>>>> cssid.ssid.devid) into "device is already attached"
>>>>>
>>>>> Also now all the devices have been migrated to use the new
>>>>> vfio_attach_device/vfio_detach_device API, let's turn the
>>>>> legacy functions into static functions, local to container.c.
>>>>>
>>>>> Signed-off-by: Eric Auger <eric.auger@redhat.com>
>>>>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>>>>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>>>>> Reviewed-by: Matthew Rosato <mjrosato@linux.ibm.com>
>>>>>
>>>>> ---
>>>>>
>>>>> v3:
>>>>> - simplified vbasedev->dev setting
>>>>>
>>>>> v2 -> v3:
>>>>> - Hopefully fix confusion beteen vbasedev->name, mdevid and sysfsdev
>>>>>  while keeping into account Matthew's comment
>>>>>  https://lore.kernel.org/qemu-devel/6e04ab8f-dc84-e9c2-deea-
>>>>> 2b6b31678b53@linux.ibm.com/
>>>>> ---
>>>>> include/hw/vfio/vfio-common.h |   5 --
>>>>> hw/vfio/ccw.c                 | 122 +++++++++-------------------------
>>>>> hw/vfio/common.c              |  10 +--
>>>>> 3 files changed, 37 insertions(+), 100 deletions(-)
>>>>>
>>>>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-
>common.h
>>>>> index 12fbfbc37d..c486bdef2a 100644
>>>>> --- a/include/hw/vfio/vfio-common.h
>>>>> +++ b/include/hw/vfio/vfio-common.h
>>>>> @@ -202,7 +202,6 @@ typedef struct {
>>>>>     hwaddr pages;
>>>>> } VFIOBitmap;
>>>>>
>>>>> -void vfio_put_base_device(VFIODevice *vbasedev);
>>>>> void vfio_disable_irqindex(VFIODevice *vbasedev, int index);
>>>>> void vfio_unmask_single_irqindex(VFIODevice *vbasedev, int index);
>>>>> void vfio_mask_single_irqindex(VFIODevice *vbasedev, int index);
>>>>> @@ -220,11 +219,7 @@ void vfio_region_unmap(VFIORegion *region);
>>>>> void vfio_region_exit(VFIORegion *region);
>>>>> void vfio_region_finalize(VFIORegion *region);
>>>>> void vfio_reset_handler(void *opaque);
>>>>> -VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp);
>>>>> -void vfio_put_group(VFIOGroup *group);
>>>>> struct vfio_device_info *vfio_get_device_info(int fd);
>>>>> -int vfio_get_device(VFIOGroup *group, const char *name,
>>>>> -                    VFIODevice *vbasedev, Error **errp);
>>>>> int vfio_attach_device(char *name, VFIODevice *vbasedev,
>>>>>                        AddressSpace *as, Error **errp);
>>>>> void vfio_detach_device(VFIODevice *vbasedev);
>>>>> diff --git a/hw/vfio/ccw.c b/hw/vfio/ccw.c
>>>>> index 1e2fce83b0..6ec35fedc9 100644
>>>>> --- a/hw/vfio/ccw.c
>>>>> +++ b/hw/vfio/ccw.c
>>>>> @@ -572,88 +572,15 @@ static void vfio_ccw_put_region(VFIOCCWDevice
>>>>> *vcdev)
>>>>>     g_free(vcdev->io_region);
>>>>> }
>>>>>
>>>>> -static void vfio_ccw_put_device(VFIOCCWDevice *vcdev)
>>>>> -{
>>>>> -    g_free(vcdev->vdev.name);
>>>>> -    vfio_put_base_device(&vcdev->vdev);
>>>>> -}
>>>>> -
>>>>> -static void vfio_ccw_get_device(VFIOGroup *group, VFIOCCWDevice
>*vcdev,
>>>>> -                                Error **errp)
>>>>> -{
>>>>> -    S390CCWDevice *cdev = S390_CCW_DEVICE(vcdev);
>>>>> -    char *name = g_strdup_printf("%x.%x.%04x", cdev->hostid.cssid,
>>>>> -                                 cdev->hostid.ssid,
>>>>> -                                 cdev->hostid.devid);
>>>>> -    VFIODevice *vbasedev;
>>>>> -
>>>>> -    QLIST_FOREACH(vbasedev, &group->device_list, next) {
>>>>> -        if (strcmp(vbasedev->name, name) == 0) {
>>>>> -            error_setg(errp, "vfio: subchannel %s has already been attached",
>>>>> -                       name);
>>>>> -            goto out_err;
>>>>> -        }
>>>>> -    }
>>>>> -
>>>>> -    /*
>>>>> -     * All vfio-ccw devices are believed to operate in a way compatible with
>>>>> -     * discarding of memory in RAM blocks, ie. pages pinned in the host are
>>>>> -     * in the current working set of the guest driver and therefore never
>>>>> -     * overlap e.g., with pages available to the guest balloon driver.  This
>>>>> -     * needs to be set before vfio_get_device() for vfio common to handle
>>>>> -     * ram_block_discard_disable().
>>>>> -     */
>>>>> -    vcdev->vdev.ram_block_discard_allowed = true;
>>>>> -
>>>>> -    if (vfio_get_device(group, cdev->mdevid, &vcdev->vdev, errp)) {
>>>>> -        goto out_err;
>>>>> -    }
>>>>> -
>>>>> -    vcdev->vdev.ops = &vfio_ccw_ops;
>>>>> -    vcdev->vdev.type = VFIO_DEVICE_TYPE_CCW;
>>>>> -    vcdev->vdev.name = name;
>>>>> -    vcdev->vdev.dev = DEVICE(vcdev);
>>>>> -
>>>>> -    return;
>>>>> -
>>>>> -out_err:
>>>>> -    g_free(name);
>>>>> -}
>>>>> -
>>>>> -static VFIOGroup *vfio_ccw_get_group(S390CCWDevice *cdev, Error
>**errp)
>>>>> -{
>>>>> -    char *tmp, group_path[PATH_MAX];
>>>>> -    ssize_t len;
>>>>> -    int groupid;
>>>>> -
>>>>> -    tmp =
>>> g_strdup_printf("/sys/bus/css/devices/%x.%x.%04x/%s/iommu_group",
>>>>> -                          cdev->hostid.cssid, cdev->hostid.ssid,
>>>>> -                          cdev->hostid.devid, cdev->mdevid);
>>>>> -    len = readlink(tmp, group_path, sizeof(group_path));
>>>>> -    g_free(tmp);
>>>>> -
>>>>> -    if (len <= 0 || len >= sizeof(group_path)) {
>>>>> -        error_setg(errp, "vfio: no iommu_group found");
>>>>> -        return NULL;
>>>>> -    }
>>>>> -
>>>>> -    group_path[len] = 0;
>>>>> -
>>>>> -    if (sscanf(basename(group_path), "%d", &groupid) != 1) {
>>>>> -        error_setg(errp, "vfio: failed to read %s", group_path);
>>>>> -        return NULL;
>>>>> -    }
>>>>> -
>>>>> -    return vfio_get_group(groupid, &address_space_memory, errp);
>>>>> -}
>>>>> -
>>>>> static void vfio_ccw_realize(DeviceState *dev, Error **errp)
>>>>> {
>>>>> -    VFIOGroup *group;
>>>>>     S390CCWDevice *cdev = S390_CCW_DEVICE(dev);
>>>>>     VFIOCCWDevice *vcdev = VFIO_CCW(cdev);
>>>>>     S390CCWDeviceClass *cdc = S390_CCW_DEVICE_GET_CLASS(cdev);
>>>>> +    VFIODevice *vbasedev = &vcdev->vdev;
>>>>>     Error *err = NULL;
>>>>> +    char *name;
>>>>> +    int ret;
>>>>>
>>>>>     /* Call the class init function for subchannel. */
>>>>>     if (cdc->realize) {
>>>>> @@ -663,14 +590,31 @@ static void vfio_ccw_realize(DeviceState *dev,
>Error
>>>>> **errp)
>>>>>         }
>>>>>     }
>>>>>
>>>>> -    group = vfio_ccw_get_group(cdev, &err);
>>>>> -    if (!group) {
>>>>> -        goto out_group_err;
>>>>> -    }
>>>>> +    name = g_strdup_printf("%x.%x.%04x", vcdev->cdev.hostid.cssid,
>>>>> +                           vcdev->cdev.hostid.ssid,
>>>>> +                           vcdev->cdev.hostid.devid);
>>>>> +    vbasedev->sysfsdev = g_strdup_printf("/sys/bus/css/devices/%s/%s",
>>>>> +                                         name,
>>>>> +                                         cdev->mdevid);
>>>> Hoping not late for you to include this in v5.
>>>> I think no need to re-assign sysfsdev as it's a user property, we'd better to
>>>> keep the original user value. Also looks a memory leak here.
>>> OK I removed it.
>>>>> +    vbasedev->ops = &vfio_ccw_ops;
>>>>> +    vbasedev->type = VFIO_DEVICE_TYPE_CCW;
>>>>> +    vbasedev->name = name;
>>>> There will be a potential failure when a second mdev device under
>>>> same cssid.ssid.devid attached. We can use cdev->mdevid as name.
>>> But this mathes vfio_ccw_get_device() existing code where
>>> vcdev->vdev.name = name; and
>>> name = g_strdup_printf("%x.%x.%04x", cdev->hostid.cssid,
>>>                                  cdev->hostid.ssid,
>>>                                  cdev->hostid.devid);
>> I suspect this is a bug of the existing code.
>Then I would prefer we fix it separately. This patch migrates the
>existing code without functional change intended.

OK, make sense. I'll do that after your v5 get in.

>
>>
>>> cdev->mdevid is passed as first arg of vfio_attach_device() instead .
>> vfio_attach_device() uses cdev->mdevid to get device FD, nothing more.
>>
>> If we use cssid.ssid.devid as name, then different mdev under same
>cssid.ssid.devid will have same name, and the second mdev attachment will fail to
>attach in vfio_attach_device():
>>
>>     QLIST_FOREACH(vbasedev_iter, &group->device_list, next) {
>>         if (strcmp(vbasedev_iter->name, vbasedev->name) == 0) {
>>             error_setg(errp, "device is already attached");
>>             vfio_put_group(group);
>>             return -EBUSY;
>>         }
>>     }
>I get your point but this conversion matches the existing
>vfio_ccw_get_device() code:
>    char *name = g_strdup_printf("%x.%x.%04x", cdev->hostid.cssid,
>                                 cdev->hostid.ssid,
>                                 cdev->hostid.devid);
>    VFIODevice *vbasedev;
>
>    QLIST_FOREACH(vbasedev, &group->device_list, next) {
>        if (strcmp(vbasedev->name, name) == 0) {
>            error_setg(errp, "vfio: subchannel %s has already been
>attached",
>                       name);
>            goto out_err;
>        }
>    }
>
>>
>>> i think this also matches
>>>
>https://lore.kernel.org/all/PH7PR11MB67222DD282F98E03095FBA8A92C1A@PH
>>> 7PR11MB6722.namprd11.prod.outlook.com/
>>> no?
>> It doesn't match what Mattew suggested: https://lore.kernel.org/qemu-
>devel/6e04ab8f-dc84-e9c2-deea-2b6b31678b53@linux.ibm.com/
>this was RFC v3. At that time we did not pass any "name" arg to
>
>vfio_attach_device(VFIODevice *vbasedev, AddressSpace *as, Error **errp)
>and vbasedev->name was used when calling vfio_get_device() while we now
>use cdev->mdevid. Besides Mattew ran some basic tests on PATCH v3:
>https://lore.kernel.org/all/33b7803c-f231-d4fb-d9d9-
>26a097a89e93@redhat.com/
>So I would be tempted to leave it as is (without the sysfsdev overwrite
>which came from Mattew's suggestion in
>https://lore.kernel.org/qemu-devel/6e04ab8f-dc84-e9c2-deea-
>2b6b31678b53@linux.ibm.com/
>). Thanks Eric

OK, then go ahead.

Thanks
Zhenzhong

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2023-10-09  8:26 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-10-04 15:43 [PATCH v4 00/15] Prerequisite changes for IOMMUFD support Eric Auger
2023-10-04 15:43 ` [PATCH v4 01/15] scripts/update-linux-headers: Add iommufd.h Eric Auger
2023-10-04 15:43 ` [PATCH v4 02/15] linux-headers: " Eric Auger
2023-10-04 15:43 ` [PATCH v4 03/15] vfio/common: Move IOMMU agnostic helpers to a separate file Eric Auger
2023-10-04 15:43 ` [PATCH v4 04/15] vfio/common: Propagate KVM_SET_DEVICE_ATTR error if any Eric Auger
2023-10-04 15:43 ` [PATCH v4 05/15] vfio/common: Introduce vfio_container_add|del_section_window() Eric Auger
2023-10-04 15:43 ` [PATCH v4 06/15] vfio/common: Extract out vfio_kvm_device_[add/del]_fd Eric Auger
2023-10-04 15:43 ` [PATCH v4 07/15] vfio/pci: Introduce vfio_[attach/detach]_device Eric Auger
2023-10-04 15:43 ` [PATCH v4 08/15] vfio/platform: Use vfio_[attach/detach]_device Eric Auger
2023-10-04 15:43 ` [PATCH v4 09/15] vfio/ap: " Eric Auger
2023-10-04 15:43 ` [PATCH v4 10/15] vfio/ccw: " Eric Auger
2023-10-08 10:21   ` Duan, Zhenzhong
2023-10-08 17:45     ` Eric Auger
2023-10-09  1:25       ` Duan, Zhenzhong
2023-10-09  8:14         ` Eric Auger
2023-10-09  8:25           ` Duan, Zhenzhong
2023-10-04 15:43 ` [PATCH v4 11/15] vfio/common: Move VFIO reset handler registration to a group agnostic function Eric Auger
2023-10-04 15:43 ` [PATCH v4 12/15] vfio/common: Introduce a per container device list Eric Auger
2023-10-04 15:44 ` [PATCH v4 13/15] vfio/common: Store the parent container in VFIODevice Eric Auger
2023-10-04 17:05   ` Cédric Le Goater
2023-10-04 15:44 ` [PATCH v4 14/15] vfio/common: Introduce a global VFIODevice list Eric Auger
2023-10-04 17:05   ` Cédric Le Goater
2023-10-04 15:44 ` [PATCH v4 15/15] vfio/common: Move legacy VFIO backend code into separate container.c Eric Auger
2023-10-04 19:58 ` [PATCH v4 00/15] Prerequisite changes for IOMMUFD support Cédric Le Goater
2023-10-05  6:55   ` Eric Auger

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.