All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC 00/18] vfio: Adopt iommufd
@ 2022-04-14 10:46 ` Yi Liu
  0 siblings, 0 replies; 125+ messages in thread
From: Yi Liu @ 2022-04-14 10:46 UTC (permalink / raw)
  To: alex.williamson, cohuck, qemu-devel
  Cc: david, thuth, farman, mjrosato, akrowiak, pasic, jjherne,
	jasowang, kvm, jgg, nicolinc, eric.auger, eric.auger.pro,
	kevin.tian, yi.l.liu, chao.p.peng, yi.y.sun, peterx

With the introduction of iommufd[1], the linux kernel provides a generic
interface for userspace drivers to propagate their DMA mappings to kernel
for assigned devices. This series does the porting of the VFIO devices
onto the /dev/iommu uapi and let it coexist with the legacy implementation.
Other devices like vpda, vfio mdev and etc. are not considered yet.

For vfio devices, the new interface is tied with device fd and iommufd
as the iommufd solution is device-centric. This is different from legacy
vfio which is group-centric. To support both interfaces in QEMU, this
series introduces the iommu backend concept in the form of different
container classes. The existing vfio container is named legacy container
(equivalent with legacy iommu backend in this series), while the new
iommufd based container is named as iommufd container (may also be mentioned
as iommufd backend in this series). The two backend types have their own
way to setup secure context and dma management interface. Below diagram
shows how it looks like with both BEs.

                    VFIO                           AddressSpace/Memory
    +-------+  +----------+  +-----+  +-----+
    |  pci  |  | platform |  |  ap |  | ccw |
    +---+---+  +----+-----+  +--+--+  +--+--+     +----------------------+
        |           |           |        |        |   AddressSpace       |
        |           |           |        |        +------------+---------+
    +---V-----------V-----------V--------V----+               /
    |           VFIOAddressSpace              | <------------+
    |                  |                      |  MemoryListener
    |          VFIOContainer list             |
    +-------+----------------------------+----+
            |                            |
            |                            |
    +-------V------+            +--------V----------+
    |   iommufd    |            |    vfio legacy    |
    |  container   |            |     container     |
    +-------+------+            +--------+----------+
            |                            |
            | /dev/iommu                 | /dev/vfio/vfio
            | /dev/vfio/devices/vfioX    | /dev/vfio/$group_id
 Userspace  |                            |
 ===========+============================+================================
 Kernel     |  device fd                 |
            +---------------+            | group/container fd
            | (BIND_IOMMUFD |            | (SET_CONTAINER/SET_IOMMU)
            |  ATTACH_IOAS) |            | device fd
            |               |            |
            |       +-------V------------V-----------------+
    iommufd |       |                vfio                  |
(map/unmap  |       +---------+--------------------+-------+
 ioas_copy) |                 |                    | map/unmap
            |                 |                    |
     +------V------+    +-----V------+      +------V--------+
     | iommfd core |    |  device    |      |  vfio iommu   |
     +-------------+    +------------+      +---------------+

[Secure Context setup]
- iommufd BE: uses device fd and iommufd to setup secure context
              (bind_iommufd, attach_ioas)
- vfio legacy BE: uses group fd and container fd to setup secure context
                  (set_container, set_iommu)
[Device access]
- iommufd BE: device fd is opened through /dev/vfio/devices/vfioX
- vfio legacy BE: device fd is retrieved from group fd ioctl
[DMA Mapping flow]
- VFIOAddressSpace receives MemoryRegion add/del via MemoryListener
- VFIO populates DMA map/unmap via the container BEs
  *) iommufd BE: uses iommufd
  *) vfio legacy BE: uses container fd

This series qomifies the VFIOContainer object which acts as a base class
for a container. This base class is derived into the legacy VFIO container
and the new iommufd based container. The base class implements generic code
such as code related to memory_listener and address space management whereas
the derived class implements callbacks that depend on the kernel user space
being used.

The selection of the backend is made on a device basis using the new
iommufd option (on/off/auto). By default the iommufd backend is selected
if supported by the host and by QEMU (iommufd KConfig). This option is
currently available only for the vfio-pci device. For other types of
devices, it does not yet exist and the legacy BE is chosen by default.

Test done:
- PCI and Platform device were tested
- ccw and ap were only compile-tested
- limited device hotplug test
- vIOMMU test run for both legacy and iommufd backends (limited tests)

This series was co-developed by Eric Auger and me based on the exploration
iommufd kernel[2], complete code of this series is available in[3]. As
iommufd kernel is in the early step (only iommufd generic interface is in
mailing list), so this series hasn't made the iommufd backend fully on par
with legacy backend w.r.t. features like p2p mappings, coherency tracking,
live migration, etc. This series hasn't supported PCI devices without FLR
neither as the kernel doesn't support VFIO_DEVICE_PCI_HOT_RESET when userspace
is using iommufd. The kernel needs to be updated to accept device fd list for
reset when userspace is using iommufd. Related work is in progress by
Jason[4].

TODOs:
- Add DMA alias check for iommufd BE (group level)
- Make pci.c to be BE agnostic. Needs kernel change as well to fix the
  VFIO_DEVICE_PCI_HOT_RESET gap
- Cleanup the VFIODevice fields as it's used in both BEs
- Add locks
- Replace list with g_tree
- More tests

Patch Overview:

- Preparation:
  0001-scripts-update-linux-headers-Add-iommufd.h.patch
  0002-linux-headers-Import-latest-vfio.h-and-iommufd.h.patch
  0003-hw-vfio-pci-fix-vfio_pci_hot_reset_result-trace-poin.patch
  0004-vfio-pci-Use-vbasedev-local-variable-in-vfio_realize.patch
  0005-vfio-common-Rename-VFIOGuestIOMMU-iommu-into-iommu_m.patch
  0006-vfio-common-Split-common.c-into-common.c-container.c.patch

- Introduce container object and covert existing vfio to use it:
  0007-vfio-Add-base-object-for-VFIOContainer.patch
  0008-vfio-container-Introduce-vfio_attach-detach_device.patch
  0009-vfio-platform-Use-vfio_-attach-detach-_device.patch
  0010-vfio-ap-Use-vfio_-attach-detach-_device.patch
  0011-vfio-ccw-Use-vfio_-attach-detach-_device.patch
  0012-vfio-container-obj-Introduce-attach-detach-_device-c.patch
  0013-vfio-container-obj-Introduce-VFIOContainer-reset-cal.patch

- Introduce iommufd based container:
  0014-hw-iommufd-Creation.patch
  0015-vfio-iommufd-Implement-iommufd-backend.patch
  0016-vfio-iommufd-Add-IOAS_COPY_DMA-support.patch

- Add backend selection for vfio-pci:
  0017-vfio-as-Allow-the-selection-of-a-given-iommu-backend.patch
  0018-vfio-pci-Add-an-iommufd-option.patch

[1] https://lore.kernel.org/kvm/0-v1-e79cd8d168e8+6-iommufd_jgg@nvidia.com/
[2] https://github.com/luxis1999/iommufd/tree/iommufd-v5.17-rc6
[3] https://github.com/luxis1999/qemu/tree/qemu-for-5.17-rc6-vm-rfcv1
[4] https://lore.kernel.org/kvm/0-v1-a8faf768d202+125dd-vfio_mdev_no_group_jgg@nvidia.com/

Base commit: 4bf58c7 virtio-iommu: use-after-free fix

Thanks,
Yi & Eric

Eric Auger (12):
  scripts/update-linux-headers: Add iommufd.h
  linux-headers: Import latest vfio.h and iommufd.h
  hw/vfio/pci: fix vfio_pci_hot_reset_result trace point
  vfio/pci: Use vbasedev local variable in vfio_realize()
  vfio/container: Introduce vfio_[attach/detach]_device
  vfio/platform: Use vfio_[attach/detach]_device
  vfio/ap: Use vfio_[attach/detach]_device
  vfio/ccw: Use vfio_[attach/detach]_device
  vfio/container-obj: Introduce [attach/detach]_device container
    callbacks
  vfio/container-obj: Introduce VFIOContainer reset callback
  vfio/as: Allow the selection of a given iommu backend
  vfio/pci: Add an iommufd option

Yi Liu (6):
  vfio/common: Rename VFIOGuestIOMMU::iommu into ::iommu_mr
  vfio/common: Split common.c into common.c, container.c and as.c
  vfio: Add base object for VFIOContainer
  hw/iommufd: Creation
  vfio/iommufd: Implement iommufd backend
  vfio/iommufd: Add IOAS_COPY_DMA support

 MAINTAINERS                          |    7 +
 hw/Kconfig                           |    1 +
 hw/iommufd/Kconfig                   |    4 +
 hw/iommufd/iommufd.c                 |  209 +++
 hw/iommufd/meson.build               |    1 +
 hw/iommufd/trace-events              |   11 +
 hw/iommufd/trace.h                   |    1 +
 hw/meson.build                       |    1 +
 hw/vfio/ap.c                         |   62 +-
 hw/vfio/as.c                         | 1042 ++++++++++++
 hw/vfio/ccw.c                        |  118 +-
 hw/vfio/common.c                     | 2340 ++------------------------
 hw/vfio/container-obj.c              |  221 +++
 hw/vfio/container.c                  | 1308 ++++++++++++++
 hw/vfio/iommufd.c                    |  570 +++++++
 hw/vfio/meson.build                  |    6 +
 hw/vfio/migration.c                  |    4 +-
 hw/vfio/pci.c                        |  133 +-
 hw/vfio/platform.c                   |   42 +-
 hw/vfio/spapr.c                      |   22 +-
 hw/vfio/trace-events                 |   11 +
 include/hw/iommufd/iommufd.h         |   37 +
 include/hw/vfio/vfio-common.h        |   96 +-
 include/hw/vfio/vfio-container-obj.h |  169 ++
 linux-headers/linux/iommufd.h        |  223 +++
 linux-headers/linux/vfio.h           |   84 +
 meson.build                          |    1 +
 scripts/update-linux-headers.sh      |    2 +-
 28 files changed, 4258 insertions(+), 2468 deletions(-)
 create mode 100644 hw/iommufd/Kconfig
 create mode 100644 hw/iommufd/iommufd.c
 create mode 100644 hw/iommufd/meson.build
 create mode 100644 hw/iommufd/trace-events
 create mode 100644 hw/iommufd/trace.h
 create mode 100644 hw/vfio/as.c
 create mode 100644 hw/vfio/container-obj.c
 create mode 100644 hw/vfio/container.c
 create mode 100644 hw/vfio/iommufd.c
 create mode 100644 include/hw/iommufd/iommufd.h
 create mode 100644 include/hw/vfio/vfio-container-obj.h
 create mode 100644 linux-headers/linux/iommufd.h

-- 
2.27.0


^ permalink raw reply	[flat|nested] 125+ messages in thread

* [RFC 00/18] vfio: Adopt iommufd
@ 2022-04-14 10:46 ` Yi Liu
  0 siblings, 0 replies; 125+ messages in thread
From: Yi Liu @ 2022-04-14 10:46 UTC (permalink / raw)
  To: alex.williamson, cohuck, qemu-devel
  Cc: akrowiak, jjherne, thuth, yi.l.liu, kvm, mjrosato, jasowang,
	farman, peterx, pasic, eric.auger, yi.y.sun, chao.p.peng,
	nicolinc, kevin.tian, jgg, eric.auger.pro, david

With the introduction of iommufd[1], the linux kernel provides a generic
interface for userspace drivers to propagate their DMA mappings to kernel
for assigned devices. This series does the porting of the VFIO devices
onto the /dev/iommu uapi and let it coexist with the legacy implementation.
Other devices like vpda, vfio mdev and etc. are not considered yet.

For vfio devices, the new interface is tied with device fd and iommufd
as the iommufd solution is device-centric. This is different from legacy
vfio which is group-centric. To support both interfaces in QEMU, this
series introduces the iommu backend concept in the form of different
container classes. The existing vfio container is named legacy container
(equivalent with legacy iommu backend in this series), while the new
iommufd based container is named as iommufd container (may also be mentioned
as iommufd backend in this series). The two backend types have their own
way to setup secure context and dma management interface. Below diagram
shows how it looks like with both BEs.

                    VFIO                           AddressSpace/Memory
    +-------+  +----------+  +-----+  +-----+
    |  pci  |  | platform |  |  ap |  | ccw |
    +---+---+  +----+-----+  +--+--+  +--+--+     +----------------------+
        |           |           |        |        |   AddressSpace       |
        |           |           |        |        +------------+---------+
    +---V-----------V-----------V--------V----+               /
    |           VFIOAddressSpace              | <------------+
    |                  |                      |  MemoryListener
    |          VFIOContainer list             |
    +-------+----------------------------+----+
            |                            |
            |                            |
    +-------V------+            +--------V----------+
    |   iommufd    |            |    vfio legacy    |
    |  container   |            |     container     |
    +-------+------+            +--------+----------+
            |                            |
            | /dev/iommu                 | /dev/vfio/vfio
            | /dev/vfio/devices/vfioX    | /dev/vfio/$group_id
 Userspace  |                            |
 ===========+============================+================================
 Kernel     |  device fd                 |
            +---------------+            | group/container fd
            | (BIND_IOMMUFD |            | (SET_CONTAINER/SET_IOMMU)
            |  ATTACH_IOAS) |            | device fd
            |               |            |
            |       +-------V------------V-----------------+
    iommufd |       |                vfio                  |
(map/unmap  |       +---------+--------------------+-------+
 ioas_copy) |                 |                    | map/unmap
            |                 |                    |
     +------V------+    +-----V------+      +------V--------+
     | iommfd core |    |  device    |      |  vfio iommu   |
     +-------------+    +------------+      +---------------+

[Secure Context setup]
- iommufd BE: uses device fd and iommufd to setup secure context
              (bind_iommufd, attach_ioas)
- vfio legacy BE: uses group fd and container fd to setup secure context
                  (set_container, set_iommu)
[Device access]
- iommufd BE: device fd is opened through /dev/vfio/devices/vfioX
- vfio legacy BE: device fd is retrieved from group fd ioctl
[DMA Mapping flow]
- VFIOAddressSpace receives MemoryRegion add/del via MemoryListener
- VFIO populates DMA map/unmap via the container BEs
  *) iommufd BE: uses iommufd
  *) vfio legacy BE: uses container fd

This series qomifies the VFIOContainer object which acts as a base class
for a container. This base class is derived into the legacy VFIO container
and the new iommufd based container. The base class implements generic code
such as code related to memory_listener and address space management whereas
the derived class implements callbacks that depend on the kernel user space
being used.

The selection of the backend is made on a device basis using the new
iommufd option (on/off/auto). By default the iommufd backend is selected
if supported by the host and by QEMU (iommufd KConfig). This option is
currently available only for the vfio-pci device. For other types of
devices, it does not yet exist and the legacy BE is chosen by default.

Test done:
- PCI and Platform device were tested
- ccw and ap were only compile-tested
- limited device hotplug test
- vIOMMU test run for both legacy and iommufd backends (limited tests)

This series was co-developed by Eric Auger and me based on the exploration
iommufd kernel[2], complete code of this series is available in[3]. As
iommufd kernel is in the early step (only iommufd generic interface is in
mailing list), so this series hasn't made the iommufd backend fully on par
with legacy backend w.r.t. features like p2p mappings, coherency tracking,
live migration, etc. This series hasn't supported PCI devices without FLR
neither as the kernel doesn't support VFIO_DEVICE_PCI_HOT_RESET when userspace
is using iommufd. The kernel needs to be updated to accept device fd list for
reset when userspace is using iommufd. Related work is in progress by
Jason[4].

TODOs:
- Add DMA alias check for iommufd BE (group level)
- Make pci.c to be BE agnostic. Needs kernel change as well to fix the
  VFIO_DEVICE_PCI_HOT_RESET gap
- Cleanup the VFIODevice fields as it's used in both BEs
- Add locks
- Replace list with g_tree
- More tests

Patch Overview:

- Preparation:
  0001-scripts-update-linux-headers-Add-iommufd.h.patch
  0002-linux-headers-Import-latest-vfio.h-and-iommufd.h.patch
  0003-hw-vfio-pci-fix-vfio_pci_hot_reset_result-trace-poin.patch
  0004-vfio-pci-Use-vbasedev-local-variable-in-vfio_realize.patch
  0005-vfio-common-Rename-VFIOGuestIOMMU-iommu-into-iommu_m.patch
  0006-vfio-common-Split-common.c-into-common.c-container.c.patch

- Introduce container object and covert existing vfio to use it:
  0007-vfio-Add-base-object-for-VFIOContainer.patch
  0008-vfio-container-Introduce-vfio_attach-detach_device.patch
  0009-vfio-platform-Use-vfio_-attach-detach-_device.patch
  0010-vfio-ap-Use-vfio_-attach-detach-_device.patch
  0011-vfio-ccw-Use-vfio_-attach-detach-_device.patch
  0012-vfio-container-obj-Introduce-attach-detach-_device-c.patch
  0013-vfio-container-obj-Introduce-VFIOContainer-reset-cal.patch

- Introduce iommufd based container:
  0014-hw-iommufd-Creation.patch
  0015-vfio-iommufd-Implement-iommufd-backend.patch
  0016-vfio-iommufd-Add-IOAS_COPY_DMA-support.patch

- Add backend selection for vfio-pci:
  0017-vfio-as-Allow-the-selection-of-a-given-iommu-backend.patch
  0018-vfio-pci-Add-an-iommufd-option.patch

[1] https://lore.kernel.org/kvm/0-v1-e79cd8d168e8+6-iommufd_jgg@nvidia.com/
[2] https://github.com/luxis1999/iommufd/tree/iommufd-v5.17-rc6
[3] https://github.com/luxis1999/qemu/tree/qemu-for-5.17-rc6-vm-rfcv1
[4] https://lore.kernel.org/kvm/0-v1-a8faf768d202+125dd-vfio_mdev_no_group_jgg@nvidia.com/

Base commit: 4bf58c7 virtio-iommu: use-after-free fix

Thanks,
Yi & Eric

Eric Auger (12):
  scripts/update-linux-headers: Add iommufd.h
  linux-headers: Import latest vfio.h and iommufd.h
  hw/vfio/pci: fix vfio_pci_hot_reset_result trace point
  vfio/pci: Use vbasedev local variable in vfio_realize()
  vfio/container: Introduce vfio_[attach/detach]_device
  vfio/platform: Use vfio_[attach/detach]_device
  vfio/ap: Use vfio_[attach/detach]_device
  vfio/ccw: Use vfio_[attach/detach]_device
  vfio/container-obj: Introduce [attach/detach]_device container
    callbacks
  vfio/container-obj: Introduce VFIOContainer reset callback
  vfio/as: Allow the selection of a given iommu backend
  vfio/pci: Add an iommufd option

Yi Liu (6):
  vfio/common: Rename VFIOGuestIOMMU::iommu into ::iommu_mr
  vfio/common: Split common.c into common.c, container.c and as.c
  vfio: Add base object for VFIOContainer
  hw/iommufd: Creation
  vfio/iommufd: Implement iommufd backend
  vfio/iommufd: Add IOAS_COPY_DMA support

 MAINTAINERS                          |    7 +
 hw/Kconfig                           |    1 +
 hw/iommufd/Kconfig                   |    4 +
 hw/iommufd/iommufd.c                 |  209 +++
 hw/iommufd/meson.build               |    1 +
 hw/iommufd/trace-events              |   11 +
 hw/iommufd/trace.h                   |    1 +
 hw/meson.build                       |    1 +
 hw/vfio/ap.c                         |   62 +-
 hw/vfio/as.c                         | 1042 ++++++++++++
 hw/vfio/ccw.c                        |  118 +-
 hw/vfio/common.c                     | 2340 ++------------------------
 hw/vfio/container-obj.c              |  221 +++
 hw/vfio/container.c                  | 1308 ++++++++++++++
 hw/vfio/iommufd.c                    |  570 +++++++
 hw/vfio/meson.build                  |    6 +
 hw/vfio/migration.c                  |    4 +-
 hw/vfio/pci.c                        |  133 +-
 hw/vfio/platform.c                   |   42 +-
 hw/vfio/spapr.c                      |   22 +-
 hw/vfio/trace-events                 |   11 +
 include/hw/iommufd/iommufd.h         |   37 +
 include/hw/vfio/vfio-common.h        |   96 +-
 include/hw/vfio/vfio-container-obj.h |  169 ++
 linux-headers/linux/iommufd.h        |  223 +++
 linux-headers/linux/vfio.h           |   84 +
 meson.build                          |    1 +
 scripts/update-linux-headers.sh      |    2 +-
 28 files changed, 4258 insertions(+), 2468 deletions(-)
 create mode 100644 hw/iommufd/Kconfig
 create mode 100644 hw/iommufd/iommufd.c
 create mode 100644 hw/iommufd/meson.build
 create mode 100644 hw/iommufd/trace-events
 create mode 100644 hw/iommufd/trace.h
 create mode 100644 hw/vfio/as.c
 create mode 100644 hw/vfio/container-obj.c
 create mode 100644 hw/vfio/container.c
 create mode 100644 hw/vfio/iommufd.c
 create mode 100644 include/hw/iommufd/iommufd.h
 create mode 100644 include/hw/vfio/vfio-container-obj.h
 create mode 100644 linux-headers/linux/iommufd.h

-- 
2.27.0



^ permalink raw reply	[flat|nested] 125+ messages in thread

* [RFC 01/18] scripts/update-linux-headers: Add iommufd.h
  2022-04-14 10:46 ` Yi Liu
@ 2022-04-14 10:46   ` Yi Liu
  -1 siblings, 0 replies; 125+ messages in thread
From: Yi Liu @ 2022-04-14 10:46 UTC (permalink / raw)
  To: alex.williamson, cohuck, qemu-devel
  Cc: david, thuth, farman, mjrosato, akrowiak, pasic, jjherne,
	jasowang, kvm, jgg, nicolinc, eric.auger, eric.auger.pro,
	kevin.tian, yi.l.liu, chao.p.peng, yi.y.sun, peterx

From: Eric Auger <eric.auger@redhat.com>

Update the script to import iommufd.h

Signed-off-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 scripts/update-linux-headers.sh | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/scripts/update-linux-headers.sh b/scripts/update-linux-headers.sh
index 839a5ec614..a89b83e6d6 100755
--- a/scripts/update-linux-headers.sh
+++ b/scripts/update-linux-headers.sh
@@ -160,7 +160,7 @@ done
 
 rm -rf "$output/linux-headers/linux"
 mkdir -p "$output/linux-headers/linux"
-for header in kvm.h vfio.h vfio_ccw.h vfio_zdev.h vhost.h \
+for header in kvm.h vfio.h iommufd.h vfio_ccw.h vfio_zdev.h vhost.h \
               psci.h psp-sev.h userfaultfd.h mman.h; do
     cp "$tmpdir/include/linux/$header" "$output/linux-headers/linux"
 done
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [RFC 01/18] scripts/update-linux-headers: Add iommufd.h
@ 2022-04-14 10:46   ` Yi Liu
  0 siblings, 0 replies; 125+ messages in thread
From: Yi Liu @ 2022-04-14 10:46 UTC (permalink / raw)
  To: alex.williamson, cohuck, qemu-devel
  Cc: akrowiak, jjherne, thuth, yi.l.liu, kvm, mjrosato, jasowang,
	farman, peterx, pasic, eric.auger, yi.y.sun, chao.p.peng,
	nicolinc, kevin.tian, jgg, eric.auger.pro, david

From: Eric Auger <eric.auger@redhat.com>

Update the script to import iommufd.h

Signed-off-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 scripts/update-linux-headers.sh | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/scripts/update-linux-headers.sh b/scripts/update-linux-headers.sh
index 839a5ec614..a89b83e6d6 100755
--- a/scripts/update-linux-headers.sh
+++ b/scripts/update-linux-headers.sh
@@ -160,7 +160,7 @@ done
 
 rm -rf "$output/linux-headers/linux"
 mkdir -p "$output/linux-headers/linux"
-for header in kvm.h vfio.h vfio_ccw.h vfio_zdev.h vhost.h \
+for header in kvm.h vfio.h iommufd.h vfio_ccw.h vfio_zdev.h vhost.h \
               psci.h psp-sev.h userfaultfd.h mman.h; do
     cp "$tmpdir/include/linux/$header" "$output/linux-headers/linux"
 done
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [RFC 02/18] linux-headers: Import latest vfio.h and iommufd.h
  2022-04-14 10:46 ` Yi Liu
@ 2022-04-14 10:46   ` Yi Liu
  -1 siblings, 0 replies; 125+ messages in thread
From: Yi Liu @ 2022-04-14 10:46 UTC (permalink / raw)
  To: alex.williamson, cohuck, qemu-devel
  Cc: david, thuth, farman, mjrosato, akrowiak, pasic, jjherne,
	jasowang, kvm, jgg, nicolinc, eric.auger, eric.auger.pro,
	kevin.tian, yi.l.liu, chao.p.peng, yi.y.sun, peterx

From: Eric Auger <eric.auger@redhat.com>

Imported from https://github.com/luxis1999/iommufd/tree/iommufd-v5.17-rc6

Signed-off-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 linux-headers/linux/iommufd.h | 223 ++++++++++++++++++++++++++++++++++
 linux-headers/linux/vfio.h    |  84 +++++++++++++
 2 files changed, 307 insertions(+)
 create mode 100644 linux-headers/linux/iommufd.h

diff --git a/linux-headers/linux/iommufd.h b/linux-headers/linux/iommufd.h
new file mode 100644
index 0000000000..6c3cd9e259
--- /dev/null
+++ b/linux-headers/linux/iommufd.h
@@ -0,0 +1,223 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES.
+ */
+#ifndef _IOMMUFD_H
+#define _IOMMUFD_H
+
+#include <linux/types.h>
+#include <linux/ioctl.h>
+
+#define IOMMUFD_TYPE (';')
+
+/**
+ * DOC: General ioctl format
+ *
+ * The ioctl mechanims follows a general format to allow for extensibility. Each
+ * ioctl is passed in a structure pointer as the argument providing the size of
+ * the structure in the first u32. The kernel checks that any structure space
+ * beyond what it understands is 0. This allows userspace to use the backward
+ * compatible portion while consistently using the newer, larger, structures.
+ *
+ * ioctls use a standard meaning for common errnos:
+ *
+ *  - ENOTTY: The IOCTL number itself is not supported at all
+ *  - E2BIG: The IOCTL number is supported, but the provided structure has
+ *    non-zero in a part the kernel does not understand.
+ *  - EOPNOTSUPP: The IOCTL number is supported, and the structure is
+ *    understood, however a known field has a value the kernel does not
+ *    understand or support.
+ *  - EINVAL: Everything about the IOCTL was understood, but a field is not
+ *    correct.
+ *  - ENOENT: An ID or IOVA provided does not exist.
+ *  - ENOMEM: Out of memory.
+ *  - EOVERFLOW: Mathematics oveflowed.
+ *
+ * As well as additional errnos. within specific ioctls.
+ */
+enum {
+	IOMMUFD_CMD_BASE = 0x80,
+	IOMMUFD_CMD_DESTROY = IOMMUFD_CMD_BASE,
+	IOMMUFD_CMD_IOAS_ALLOC,
+	IOMMUFD_CMD_IOAS_IOVA_RANGES,
+	IOMMUFD_CMD_IOAS_MAP,
+	IOMMUFD_CMD_IOAS_COPY,
+	IOMMUFD_CMD_IOAS_UNMAP,
+	IOMMUFD_CMD_VFIO_IOAS,
+};
+
+/**
+ * struct iommu_destroy - ioctl(IOMMU_DESTROY)
+ * @size: sizeof(struct iommu_destroy)
+ * @id: iommufd object ID to destroy. Can by any destroyable object type.
+ *
+ * Destroy any object held within iommufd.
+ */
+struct iommu_destroy {
+	__u32 size;
+	__u32 id;
+};
+#define IOMMU_DESTROY _IO(IOMMUFD_TYPE, IOMMUFD_CMD_DESTROY)
+
+/**
+ * struct iommu_ioas_alloc - ioctl(IOMMU_IOAS_ALLOC)
+ * @size: sizeof(struct iommu_ioas_alloc)
+ * @flags: Must be 0
+ * @out_ioas_id: Output IOAS ID for the allocated object
+ *
+ * Allocate an IO Address Space (IOAS) which holds an IO Virtual Address (IOVA)
+ * to memory mapping.
+ */
+struct iommu_ioas_alloc {
+	__u32 size;
+	__u32 flags;
+	__u32 out_ioas_id;
+};
+#define IOMMU_IOAS_ALLOC _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_ALLOC)
+
+/**
+ * struct iommu_ioas_iova_ranges - ioctl(IOMMU_IOAS_IOVA_RANGES)
+ * @size: sizeof(struct iommu_ioas_iova_ranges)
+ * @ioas_id: IOAS ID to read ranges from
+ * @out_num_iovas: Output total number of ranges in the IOAS
+ * @__reserved: Must be 0
+ * @out_valid_iovas: Array of valid IOVA ranges. The array length is the smaller
+ *                   of out_num_iovas or the length implied by size.
+ * @out_valid_iovas.start: First IOVA in the allowed range
+ * @out_valid_iovas.last: Inclusive last IOVA in the allowed range
+ *
+ * Query an IOAS for ranges of allowed IOVAs. Operation outside these ranges is
+ * not allowed. out_num_iovas will be set to the total number of iovas
+ * and the out_valid_iovas[] will be filled in as space permits.
+ * size should include the allocated flex array.
+ */
+struct iommu_ioas_iova_ranges {
+	__u32 size;
+	__u32 ioas_id;
+	__u32 out_num_iovas;
+	__u32 __reserved;
+	struct iommu_valid_iovas {
+		__aligned_u64 start;
+		__aligned_u64 last;
+	} out_valid_iovas[];
+};
+#define IOMMU_IOAS_IOVA_RANGES _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_IOVA_RANGES)
+
+/**
+ * enum iommufd_ioas_map_flags - Flags for map and copy
+ * @IOMMU_IOAS_MAP_FIXED_IOVA: If clear the kernel will compute an appropriate
+ *                             IOVA to place the mapping at
+ * @IOMMU_IOAS_MAP_WRITEABLE: DMA is allowed to write to this mapping
+ * @IOMMU_IOAS_MAP_READABLE: DMA is allowed to read from this mapping
+ */
+enum iommufd_ioas_map_flags {
+	IOMMU_IOAS_MAP_FIXED_IOVA = 1 << 0,
+	IOMMU_IOAS_MAP_WRITEABLE = 1 << 1,
+	IOMMU_IOAS_MAP_READABLE = 1 << 2,
+};
+
+/**
+ * struct iommu_ioas_map - ioctl(IOMMU_IOAS_MAP)
+ * @size: sizeof(struct iommu_ioas_map)
+ * @flags: Combination of enum iommufd_ioas_map_flags
+ * @ioas_id: IOAS ID to change the mapping of
+ * @__reserved: Must be 0
+ * @user_va: Userspace pointer to start mapping from
+ * @length: Number of bytes to map
+ * @iova: IOVA the mapping was placed at. If IOMMU_IOAS_MAP_FIXED_IOVA is set
+ *        then this must be provided as input.
+ *
+ * Set an IOVA mapping from a user pointer. If FIXED_IOVA is specified then the
+ * mapping will be established at iova, otherwise a suitable location will be
+ * automatically selected and returned in iova.
+ */
+struct iommu_ioas_map {
+	__u32 size;
+	__u32 flags;
+	__u32 ioas_id;
+	__u32 __reserved;
+	__aligned_u64 user_va;
+	__aligned_u64 length;
+	__aligned_u64 iova;
+};
+#define IOMMU_IOAS_MAP _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_MAP)
+
+/**
+ * struct iommu_ioas_copy - ioctl(IOMMU_IOAS_COPY)
+ * @size: sizeof(struct iommu_ioas_copy)
+ * @flags: Combination of enum iommufd_ioas_map_flags
+ * @dst_ioas_id: IOAS ID to change the mapping of
+ * @src_ioas_id: IOAS ID to copy from
+ * @length: Number of bytes to copy and map
+ * @dst_iova: IOVA the mapping was placed at. If IOMMU_IOAS_MAP_FIXED_IOVA is
+ *            set then this must be provided as input.
+ * @src_iova: IOVA to start the copy
+ *
+ * Copy an already existing mapping from src_ioas_id and establish it in
+ * dst_ioas_id. The src iova/length must exactly match a range used with
+ * IOMMU_IOAS_MAP.
+ */
+struct iommu_ioas_copy {
+	__u32 size;
+	__u32 flags;
+	__u32 dst_ioas_id;
+	__u32 src_ioas_id;
+	__aligned_u64 length;
+	__aligned_u64 dst_iova;
+	__aligned_u64 src_iova;
+};
+#define IOMMU_IOAS_COPY _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_COPY)
+
+/**
+ * struct iommu_ioas_unmap - ioctl(IOMMU_IOAS_UNMAP)
+ * @size: sizeof(struct iommu_ioas_copy)
+ * @ioas_id: IOAS ID to change the mapping of
+ * @iova: IOVA to start the unmapping at
+ * @length: Number of bytes to unmap
+ *
+ * Unmap an IOVA range. The iova/length must exactly match a range
+ * used with IOMMU_IOAS_PAGETABLE_MAP, or be the values 0 & U64_MAX.
+ * In the latter case all IOVAs will be unmaped.
+ */
+struct iommu_ioas_unmap {
+	__u32 size;
+	__u32 ioas_id;
+	__aligned_u64 iova;
+	__aligned_u64 length;
+};
+#define IOMMU_IOAS_UNMAP _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_UNMAP)
+
+/**
+ * enum iommufd_vfio_ioas_op
+ * @IOMMU_VFIO_IOAS_GET: Get the current compatibility IOAS
+ * @IOMMU_VFIO_IOAS_SET: Change the current compatibility IOAS
+ * @IOMMU_VFIO_IOAS_CLEAR: Disable VFIO compatibility
+ */
+enum iommufd_vfio_ioas_op {
+	IOMMU_VFIO_IOAS_GET = 0,
+	IOMMU_VFIO_IOAS_SET = 1,
+	IOMMU_VFIO_IOAS_CLEAR = 2,
+};
+
+/**
+ * struct iommu_vfio_ioas - ioctl(IOMMU_VFIO_IOAS)
+ * @size: sizeof(struct iommu_ioas_copy)
+ * @ioas_id: For IOMMU_VFIO_IOAS_SET the input IOAS ID to set
+ *           For IOMMU_VFIO_IOAS_GET will output the IOAS ID
+ * @op: One of enum iommufd_vfio_ioas_op
+ * @__reserved: Must be 0
+ *
+ * The VFIO compatibility support uses a single ioas because VFIO APIs do not
+ * support the ID field. Set or Get the IOAS that VFIO compatibility will use.
+ * When VFIO_GROUP_SET_CONTAINER is used on an iommufd it will get the
+ * compatibility ioas, either by taking what is already set, or auto creating
+ * one. From then on VFIO will continue to use that ioas and is not effected by
+ * this ioctl. SET or CLEAR does not destroy any auto-created IOAS.
+ */
+struct iommu_vfio_ioas {
+	__u32 size;
+	__u32 ioas_id;
+	__u16 op;
+	__u16 __reserved;
+};
+#define IOMMU_VFIO_IOAS _IO(IOMMUFD_TYPE, IOMMUFD_CMD_VFIO_IOAS)
+#endif
diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
index e680594f27..0e7b1159ca 100644
--- a/linux-headers/linux/vfio.h
+++ b/linux-headers/linux/vfio.h
@@ -190,6 +190,90 @@ struct vfio_group_status {
 
 /* --------------- IOCTLs for DEVICE file descriptors --------------- */
 
+/*
+ * VFIO_DEVICE_BIND_IOMMUFD - _IOR(VFIO_TYPE, VFIO_BASE + 19,
+ *				struct vfio_device_bind_iommufd)
+ *
+ * Bind a vfio_device to the specified iommufd
+ *
+ * The user should provide a device cookie when calling this ioctl. The
+ * cookie is carried only in event e.g. I/O fault reported to userspace
+ * via iommufd. The user should use devid returned by this ioctl to mark
+ * the target device in other ioctls (e.g. capability query via iommufd).
+ *
+ * User is not allowed to access the device before the binding operation
+ * is completed.
+ *
+ * Unbind is automatically conducted when device fd is closed.
+ *
+ * Input parameters:
+ *	- iommufd;
+ *	- dev_cookie;
+ *
+ * Output parameters:
+ *	- devid;
+ *
+ * Return: 0 on success, -errno on failure.
+ */
+struct vfio_device_bind_iommufd {
+	__u32		argsz;
+	__u32		flags;
+	__aligned_u64	dev_cookie;
+	__s32		iommufd;
+	__u32		out_devid;
+};
+
+#define VFIO_DEVICE_BIND_IOMMUFD	_IO(VFIO_TYPE, VFIO_BASE + 19)
+
+/*
+ * VFIO_DEVICE_ATTACH_IOAS - _IOW(VFIO_TYPE, VFIO_BASE + 21,
+ *				  struct vfio_device_attach_ioas)
+ *
+ * Attach a vfio device to the specified IOAS.
+ *
+ * Multiple vfio devices can be attached to the same IOAS Page Table. One
+ * device can be attached to only one ioas at this point.
+ *
+ * @argsz:	user filled size of this data.
+ * @flags:	reserved for future extension.
+ * @iommufd:	iommufd where the ioas comes from.
+ * @ioas_id:	Input the target I/O address space page table.
+ * @hwpt_id:	Output the hw page table id
+ *
+ * Return: 0 on success, -errno on failure.
+ */
+struct vfio_device_attach_ioas {
+	__u32	argsz;
+	__u32	flags;
+	__s32	iommufd;
+	__u32	ioas_id;
+	__u32	out_hwpt_id;
+};
+
+#define VFIO_DEVICE_ATTACH_IOAS	_IO(VFIO_TYPE, VFIO_BASE + 20)
+
+/*
+ * VFIO_DEVICE_DETACH_IOAS - _IOW(VFIO_TYPE, VFIO_BASE + 21,
+ *				  struct vfio_device_detach_ioas)
+ *
+ * Detach a vfio device from the specified IOAS.
+ *
+ * @argsz:	user filled size of this data.
+ * @flags:	reserved for future extension.
+ * @iommufd:	iommufd where the ioas comes from.
+ * @ioas_id:	Input the target I/O address space page table.
+ *
+ * Return: 0 on success, -errno on failure.
+ */
+struct vfio_device_detach_ioas {
+	__u32	argsz;
+	__u32	flags;
+	__s32	iommufd;
+	__u32	ioas_id;
+};
+
+#define VFIO_DEVICE_DETACH_IOAS	_IO(VFIO_TYPE, VFIO_BASE + 21)
+
 /**
  * VFIO_DEVICE_GET_INFO - _IOR(VFIO_TYPE, VFIO_BASE + 7,
  *						struct vfio_device_info)
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [RFC 02/18] linux-headers: Import latest vfio.h and iommufd.h
@ 2022-04-14 10:46   ` Yi Liu
  0 siblings, 0 replies; 125+ messages in thread
From: Yi Liu @ 2022-04-14 10:46 UTC (permalink / raw)
  To: alex.williamson, cohuck, qemu-devel
  Cc: akrowiak, jjherne, thuth, yi.l.liu, kvm, mjrosato, jasowang,
	farman, peterx, pasic, eric.auger, yi.y.sun, chao.p.peng,
	nicolinc, kevin.tian, jgg, eric.auger.pro, david

From: Eric Auger <eric.auger@redhat.com>

Imported from https://github.com/luxis1999/iommufd/tree/iommufd-v5.17-rc6

Signed-off-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 linux-headers/linux/iommufd.h | 223 ++++++++++++++++++++++++++++++++++
 linux-headers/linux/vfio.h    |  84 +++++++++++++
 2 files changed, 307 insertions(+)
 create mode 100644 linux-headers/linux/iommufd.h

diff --git a/linux-headers/linux/iommufd.h b/linux-headers/linux/iommufd.h
new file mode 100644
index 0000000000..6c3cd9e259
--- /dev/null
+++ b/linux-headers/linux/iommufd.h
@@ -0,0 +1,223 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES.
+ */
+#ifndef _IOMMUFD_H
+#define _IOMMUFD_H
+
+#include <linux/types.h>
+#include <linux/ioctl.h>
+
+#define IOMMUFD_TYPE (';')
+
+/**
+ * DOC: General ioctl format
+ *
+ * The ioctl mechanims follows a general format to allow for extensibility. Each
+ * ioctl is passed in a structure pointer as the argument providing the size of
+ * the structure in the first u32. The kernel checks that any structure space
+ * beyond what it understands is 0. This allows userspace to use the backward
+ * compatible portion while consistently using the newer, larger, structures.
+ *
+ * ioctls use a standard meaning for common errnos:
+ *
+ *  - ENOTTY: The IOCTL number itself is not supported at all
+ *  - E2BIG: The IOCTL number is supported, but the provided structure has
+ *    non-zero in a part the kernel does not understand.
+ *  - EOPNOTSUPP: The IOCTL number is supported, and the structure is
+ *    understood, however a known field has a value the kernel does not
+ *    understand or support.
+ *  - EINVAL: Everything about the IOCTL was understood, but a field is not
+ *    correct.
+ *  - ENOENT: An ID or IOVA provided does not exist.
+ *  - ENOMEM: Out of memory.
+ *  - EOVERFLOW: Mathematics oveflowed.
+ *
+ * As well as additional errnos. within specific ioctls.
+ */
+enum {
+	IOMMUFD_CMD_BASE = 0x80,
+	IOMMUFD_CMD_DESTROY = IOMMUFD_CMD_BASE,
+	IOMMUFD_CMD_IOAS_ALLOC,
+	IOMMUFD_CMD_IOAS_IOVA_RANGES,
+	IOMMUFD_CMD_IOAS_MAP,
+	IOMMUFD_CMD_IOAS_COPY,
+	IOMMUFD_CMD_IOAS_UNMAP,
+	IOMMUFD_CMD_VFIO_IOAS,
+};
+
+/**
+ * struct iommu_destroy - ioctl(IOMMU_DESTROY)
+ * @size: sizeof(struct iommu_destroy)
+ * @id: iommufd object ID to destroy. Can by any destroyable object type.
+ *
+ * Destroy any object held within iommufd.
+ */
+struct iommu_destroy {
+	__u32 size;
+	__u32 id;
+};
+#define IOMMU_DESTROY _IO(IOMMUFD_TYPE, IOMMUFD_CMD_DESTROY)
+
+/**
+ * struct iommu_ioas_alloc - ioctl(IOMMU_IOAS_ALLOC)
+ * @size: sizeof(struct iommu_ioas_alloc)
+ * @flags: Must be 0
+ * @out_ioas_id: Output IOAS ID for the allocated object
+ *
+ * Allocate an IO Address Space (IOAS) which holds an IO Virtual Address (IOVA)
+ * to memory mapping.
+ */
+struct iommu_ioas_alloc {
+	__u32 size;
+	__u32 flags;
+	__u32 out_ioas_id;
+};
+#define IOMMU_IOAS_ALLOC _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_ALLOC)
+
+/**
+ * struct iommu_ioas_iova_ranges - ioctl(IOMMU_IOAS_IOVA_RANGES)
+ * @size: sizeof(struct iommu_ioas_iova_ranges)
+ * @ioas_id: IOAS ID to read ranges from
+ * @out_num_iovas: Output total number of ranges in the IOAS
+ * @__reserved: Must be 0
+ * @out_valid_iovas: Array of valid IOVA ranges. The array length is the smaller
+ *                   of out_num_iovas or the length implied by size.
+ * @out_valid_iovas.start: First IOVA in the allowed range
+ * @out_valid_iovas.last: Inclusive last IOVA in the allowed range
+ *
+ * Query an IOAS for ranges of allowed IOVAs. Operation outside these ranges is
+ * not allowed. out_num_iovas will be set to the total number of iovas
+ * and the out_valid_iovas[] will be filled in as space permits.
+ * size should include the allocated flex array.
+ */
+struct iommu_ioas_iova_ranges {
+	__u32 size;
+	__u32 ioas_id;
+	__u32 out_num_iovas;
+	__u32 __reserved;
+	struct iommu_valid_iovas {
+		__aligned_u64 start;
+		__aligned_u64 last;
+	} out_valid_iovas[];
+};
+#define IOMMU_IOAS_IOVA_RANGES _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_IOVA_RANGES)
+
+/**
+ * enum iommufd_ioas_map_flags - Flags for map and copy
+ * @IOMMU_IOAS_MAP_FIXED_IOVA: If clear the kernel will compute an appropriate
+ *                             IOVA to place the mapping at
+ * @IOMMU_IOAS_MAP_WRITEABLE: DMA is allowed to write to this mapping
+ * @IOMMU_IOAS_MAP_READABLE: DMA is allowed to read from this mapping
+ */
+enum iommufd_ioas_map_flags {
+	IOMMU_IOAS_MAP_FIXED_IOVA = 1 << 0,
+	IOMMU_IOAS_MAP_WRITEABLE = 1 << 1,
+	IOMMU_IOAS_MAP_READABLE = 1 << 2,
+};
+
+/**
+ * struct iommu_ioas_map - ioctl(IOMMU_IOAS_MAP)
+ * @size: sizeof(struct iommu_ioas_map)
+ * @flags: Combination of enum iommufd_ioas_map_flags
+ * @ioas_id: IOAS ID to change the mapping of
+ * @__reserved: Must be 0
+ * @user_va: Userspace pointer to start mapping from
+ * @length: Number of bytes to map
+ * @iova: IOVA the mapping was placed at. If IOMMU_IOAS_MAP_FIXED_IOVA is set
+ *        then this must be provided as input.
+ *
+ * Set an IOVA mapping from a user pointer. If FIXED_IOVA is specified then the
+ * mapping will be established at iova, otherwise a suitable location will be
+ * automatically selected and returned in iova.
+ */
+struct iommu_ioas_map {
+	__u32 size;
+	__u32 flags;
+	__u32 ioas_id;
+	__u32 __reserved;
+	__aligned_u64 user_va;
+	__aligned_u64 length;
+	__aligned_u64 iova;
+};
+#define IOMMU_IOAS_MAP _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_MAP)
+
+/**
+ * struct iommu_ioas_copy - ioctl(IOMMU_IOAS_COPY)
+ * @size: sizeof(struct iommu_ioas_copy)
+ * @flags: Combination of enum iommufd_ioas_map_flags
+ * @dst_ioas_id: IOAS ID to change the mapping of
+ * @src_ioas_id: IOAS ID to copy from
+ * @length: Number of bytes to copy and map
+ * @dst_iova: IOVA the mapping was placed at. If IOMMU_IOAS_MAP_FIXED_IOVA is
+ *            set then this must be provided as input.
+ * @src_iova: IOVA to start the copy
+ *
+ * Copy an already existing mapping from src_ioas_id and establish it in
+ * dst_ioas_id. The src iova/length must exactly match a range used with
+ * IOMMU_IOAS_MAP.
+ */
+struct iommu_ioas_copy {
+	__u32 size;
+	__u32 flags;
+	__u32 dst_ioas_id;
+	__u32 src_ioas_id;
+	__aligned_u64 length;
+	__aligned_u64 dst_iova;
+	__aligned_u64 src_iova;
+};
+#define IOMMU_IOAS_COPY _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_COPY)
+
+/**
+ * struct iommu_ioas_unmap - ioctl(IOMMU_IOAS_UNMAP)
+ * @size: sizeof(struct iommu_ioas_copy)
+ * @ioas_id: IOAS ID to change the mapping of
+ * @iova: IOVA to start the unmapping at
+ * @length: Number of bytes to unmap
+ *
+ * Unmap an IOVA range. The iova/length must exactly match a range
+ * used with IOMMU_IOAS_PAGETABLE_MAP, or be the values 0 & U64_MAX.
+ * In the latter case all IOVAs will be unmaped.
+ */
+struct iommu_ioas_unmap {
+	__u32 size;
+	__u32 ioas_id;
+	__aligned_u64 iova;
+	__aligned_u64 length;
+};
+#define IOMMU_IOAS_UNMAP _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_UNMAP)
+
+/**
+ * enum iommufd_vfio_ioas_op
+ * @IOMMU_VFIO_IOAS_GET: Get the current compatibility IOAS
+ * @IOMMU_VFIO_IOAS_SET: Change the current compatibility IOAS
+ * @IOMMU_VFIO_IOAS_CLEAR: Disable VFIO compatibility
+ */
+enum iommufd_vfio_ioas_op {
+	IOMMU_VFIO_IOAS_GET = 0,
+	IOMMU_VFIO_IOAS_SET = 1,
+	IOMMU_VFIO_IOAS_CLEAR = 2,
+};
+
+/**
+ * struct iommu_vfio_ioas - ioctl(IOMMU_VFIO_IOAS)
+ * @size: sizeof(struct iommu_ioas_copy)
+ * @ioas_id: For IOMMU_VFIO_IOAS_SET the input IOAS ID to set
+ *           For IOMMU_VFIO_IOAS_GET will output the IOAS ID
+ * @op: One of enum iommufd_vfio_ioas_op
+ * @__reserved: Must be 0
+ *
+ * The VFIO compatibility support uses a single ioas because VFIO APIs do not
+ * support the ID field. Set or Get the IOAS that VFIO compatibility will use.
+ * When VFIO_GROUP_SET_CONTAINER is used on an iommufd it will get the
+ * compatibility ioas, either by taking what is already set, or auto creating
+ * one. From then on VFIO will continue to use that ioas and is not effected by
+ * this ioctl. SET or CLEAR does not destroy any auto-created IOAS.
+ */
+struct iommu_vfio_ioas {
+	__u32 size;
+	__u32 ioas_id;
+	__u16 op;
+	__u16 __reserved;
+};
+#define IOMMU_VFIO_IOAS _IO(IOMMUFD_TYPE, IOMMUFD_CMD_VFIO_IOAS)
+#endif
diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
index e680594f27..0e7b1159ca 100644
--- a/linux-headers/linux/vfio.h
+++ b/linux-headers/linux/vfio.h
@@ -190,6 +190,90 @@ struct vfio_group_status {
 
 /* --------------- IOCTLs for DEVICE file descriptors --------------- */
 
+/*
+ * VFIO_DEVICE_BIND_IOMMUFD - _IOR(VFIO_TYPE, VFIO_BASE + 19,
+ *				struct vfio_device_bind_iommufd)
+ *
+ * Bind a vfio_device to the specified iommufd
+ *
+ * The user should provide a device cookie when calling this ioctl. The
+ * cookie is carried only in event e.g. I/O fault reported to userspace
+ * via iommufd. The user should use devid returned by this ioctl to mark
+ * the target device in other ioctls (e.g. capability query via iommufd).
+ *
+ * User is not allowed to access the device before the binding operation
+ * is completed.
+ *
+ * Unbind is automatically conducted when device fd is closed.
+ *
+ * Input parameters:
+ *	- iommufd;
+ *	- dev_cookie;
+ *
+ * Output parameters:
+ *	- devid;
+ *
+ * Return: 0 on success, -errno on failure.
+ */
+struct vfio_device_bind_iommufd {
+	__u32		argsz;
+	__u32		flags;
+	__aligned_u64	dev_cookie;
+	__s32		iommufd;
+	__u32		out_devid;
+};
+
+#define VFIO_DEVICE_BIND_IOMMUFD	_IO(VFIO_TYPE, VFIO_BASE + 19)
+
+/*
+ * VFIO_DEVICE_ATTACH_IOAS - _IOW(VFIO_TYPE, VFIO_BASE + 21,
+ *				  struct vfio_device_attach_ioas)
+ *
+ * Attach a vfio device to the specified IOAS.
+ *
+ * Multiple vfio devices can be attached to the same IOAS Page Table. One
+ * device can be attached to only one ioas at this point.
+ *
+ * @argsz:	user filled size of this data.
+ * @flags:	reserved for future extension.
+ * @iommufd:	iommufd where the ioas comes from.
+ * @ioas_id:	Input the target I/O address space page table.
+ * @hwpt_id:	Output the hw page table id
+ *
+ * Return: 0 on success, -errno on failure.
+ */
+struct vfio_device_attach_ioas {
+	__u32	argsz;
+	__u32	flags;
+	__s32	iommufd;
+	__u32	ioas_id;
+	__u32	out_hwpt_id;
+};
+
+#define VFIO_DEVICE_ATTACH_IOAS	_IO(VFIO_TYPE, VFIO_BASE + 20)
+
+/*
+ * VFIO_DEVICE_DETACH_IOAS - _IOW(VFIO_TYPE, VFIO_BASE + 21,
+ *				  struct vfio_device_detach_ioas)
+ *
+ * Detach a vfio device from the specified IOAS.
+ *
+ * @argsz:	user filled size of this data.
+ * @flags:	reserved for future extension.
+ * @iommufd:	iommufd where the ioas comes from.
+ * @ioas_id:	Input the target I/O address space page table.
+ *
+ * Return: 0 on success, -errno on failure.
+ */
+struct vfio_device_detach_ioas {
+	__u32	argsz;
+	__u32	flags;
+	__s32	iommufd;
+	__u32	ioas_id;
+};
+
+#define VFIO_DEVICE_DETACH_IOAS	_IO(VFIO_TYPE, VFIO_BASE + 21)
+
 /**
  * VFIO_DEVICE_GET_INFO - _IOR(VFIO_TYPE, VFIO_BASE + 7,
  *						struct vfio_device_info)
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [RFC 03/18] hw/vfio/pci: fix vfio_pci_hot_reset_result trace point
  2022-04-14 10:46 ` Yi Liu
@ 2022-04-14 10:46   ` Yi Liu
  -1 siblings, 0 replies; 125+ messages in thread
From: Yi Liu @ 2022-04-14 10:46 UTC (permalink / raw)
  To: alex.williamson, cohuck, qemu-devel
  Cc: david, thuth, farman, mjrosato, akrowiak, pasic, jjherne,
	jasowang, kvm, jgg, nicolinc, eric.auger, eric.auger.pro,
	kevin.tian, yi.l.liu, chao.p.peng, yi.y.sun, peterx

From: Eric Auger <eric.auger@redhat.com>

Properly output the errno string.

Signed-off-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 hw/vfio/pci.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 67a183f17b..e26e65bb1f 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2337,7 +2337,7 @@ static int vfio_pci_hot_reset(VFIOPCIDevice *vdev, bool single)
     g_free(reset);
 
     trace_vfio_pci_hot_reset_result(vdev->vbasedev.name,
-                                    ret ? "%m" : "Success");
+                                    ret ? strerror(errno) : "Success");
 
 out:
     /* Re-enable INTx on affected devices */
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [RFC 03/18] hw/vfio/pci: fix vfio_pci_hot_reset_result trace point
@ 2022-04-14 10:46   ` Yi Liu
  0 siblings, 0 replies; 125+ messages in thread
From: Yi Liu @ 2022-04-14 10:46 UTC (permalink / raw)
  To: alex.williamson, cohuck, qemu-devel
  Cc: akrowiak, jjherne, thuth, yi.l.liu, kvm, mjrosato, jasowang,
	farman, peterx, pasic, eric.auger, yi.y.sun, chao.p.peng,
	nicolinc, kevin.tian, jgg, eric.auger.pro, david

From: Eric Auger <eric.auger@redhat.com>

Properly output the errno string.

Signed-off-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 hw/vfio/pci.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 67a183f17b..e26e65bb1f 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2337,7 +2337,7 @@ static int vfio_pci_hot_reset(VFIOPCIDevice *vdev, bool single)
     g_free(reset);
 
     trace_vfio_pci_hot_reset_result(vdev->vbasedev.name,
-                                    ret ? "%m" : "Success");
+                                    ret ? strerror(errno) : "Success");
 
 out:
     /* Re-enable INTx on affected devices */
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [RFC 04/18] vfio/pci: Use vbasedev local variable in vfio_realize()
  2022-04-14 10:46 ` Yi Liu
@ 2022-04-14 10:46   ` Yi Liu
  -1 siblings, 0 replies; 125+ messages in thread
From: Yi Liu @ 2022-04-14 10:46 UTC (permalink / raw)
  To: alex.williamson, cohuck, qemu-devel
  Cc: david, thuth, farman, mjrosato, akrowiak, pasic, jjherne,
	jasowang, kvm, jgg, nicolinc, eric.auger, eric.auger.pro,
	kevin.tian, yi.l.liu, chao.p.peng, yi.y.sun, peterx

From: Eric Auger <eric.auger@redhat.com>

Using a VFIODevice handle local variable to improve the code readability.

no functional change intended

Signed-off-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 hw/vfio/pci.c | 49 +++++++++++++++++++++++++------------------------
 1 file changed, 25 insertions(+), 24 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index e26e65bb1f..e707329394 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2803,6 +2803,7 @@ static void vfio_unregister_req_notifier(VFIOPCIDevice *vdev)
 static void vfio_realize(PCIDevice *pdev, Error **errp)
 {
     VFIOPCIDevice *vdev = VFIO_PCI(pdev);
+    VFIODevice *vbasedev = &vdev->vbasedev;
     VFIODevice *vbasedev_iter;
     VFIOGroup *group;
     char *tmp, *subsys, group_path[PATH_MAX], *group_name;
@@ -2813,7 +2814,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
     int i, ret;
     bool is_mdev;
 
-    if (!vdev->vbasedev.sysfsdev) {
+    if (!vbasedev->sysfsdev) {
         if (!(~vdev->host.domain || ~vdev->host.bus ||
               ~vdev->host.slot || ~vdev->host.function)) {
             error_setg(errp, "No provided host device");
@@ -2821,24 +2822,24 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
                               "or -device vfio-pci,sysfsdev=PATH_TO_DEVICE\n");
             return;
         }
-        vdev->vbasedev.sysfsdev =
+        vbasedev->sysfsdev =
             g_strdup_printf("/sys/bus/pci/devices/%04x:%02x:%02x.%01x",
                             vdev->host.domain, vdev->host.bus,
                             vdev->host.slot, vdev->host.function);
     }
 
-    if (stat(vdev->vbasedev.sysfsdev, &st) < 0) {
+    if (stat(vbasedev->sysfsdev, &st) < 0) {
         error_setg_errno(errp, errno, "no such host device");
-        error_prepend(errp, VFIO_MSG_PREFIX, vdev->vbasedev.sysfsdev);
+        error_prepend(errp, VFIO_MSG_PREFIX, vbasedev->sysfsdev);
         return;
     }
 
-    vdev->vbasedev.name = g_path_get_basename(vdev->vbasedev.sysfsdev);
-    vdev->vbasedev.ops = &vfio_pci_ops;
-    vdev->vbasedev.type = VFIO_DEVICE_TYPE_PCI;
-    vdev->vbasedev.dev = DEVICE(vdev);
+    vbasedev->name = g_path_get_basename(vbasedev->sysfsdev);
+    vbasedev->ops = &vfio_pci_ops;
+    vbasedev->type = VFIO_DEVICE_TYPE_PCI;
+    vbasedev->dev = DEVICE(vdev);
 
-    tmp = g_strdup_printf("%s/iommu_group", vdev->vbasedev.sysfsdev);
+    tmp = g_strdup_printf("%s/iommu_group", vbasedev->sysfsdev);
     len = readlink(tmp, group_path, sizeof(group_path));
     g_free(tmp);
 
@@ -2856,7 +2857,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
         goto error;
     }
 
-    trace_vfio_realize(vdev->vbasedev.name, groupid);
+    trace_vfio_realize(vbasedev->name, groupid);
 
     group = vfio_get_group(groupid, pci_device_iommu_address_space(pdev), errp);
     if (!group) {
@@ -2864,7 +2865,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
     }
 
     QLIST_FOREACH(vbasedev_iter, &group->device_list, next) {
-        if (strcmp(vbasedev_iter->name, vdev->vbasedev.name) == 0) {
+        if (strcmp(vbasedev_iter->name, vbasedev->name) == 0) {
             error_setg(errp, "device is already attached");
             vfio_put_group(group);
             goto error;
@@ -2877,22 +2878,22 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
      * stays in sync with the active working set of the guest driver.  Prevent
      * the x-balloon-allowed option unless this is minimally an mdev device.
      */
-    tmp = g_strdup_printf("%s/subsystem", vdev->vbasedev.sysfsdev);
+    tmp = g_strdup_printf("%s/subsystem", vbasedev->sysfsdev);
     subsys = realpath(tmp, NULL);
     g_free(tmp);
     is_mdev = subsys && (strcmp(subsys, "/sys/bus/mdev") == 0);
     free(subsys);
 
-    trace_vfio_mdev(vdev->vbasedev.name, is_mdev);
+    trace_vfio_mdev(vbasedev->name, is_mdev);
 
-    if (vdev->vbasedev.ram_block_discard_allowed && !is_mdev) {
+    if (vbasedev->ram_block_discard_allowed && !is_mdev) {
         error_setg(errp, "x-balloon-allowed only potentially compatible "
                    "with mdev devices");
         vfio_put_group(group);
         goto error;
     }
 
-    ret = vfio_get_device(group, vdev->vbasedev.name, &vdev->vbasedev, errp);
+    ret = vfio_get_device(group, vbasedev->name, vbasedev, errp);
     if (ret) {
         vfio_put_group(group);
         goto error;
@@ -2905,7 +2906,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
     }
 
     /* Get a copy of config space */
-    ret = pread(vdev->vbasedev.fd, vdev->pdev.config,
+    ret = pread(vbasedev->fd, vdev->pdev.config,
                 MIN(pci_config_size(&vdev->pdev), vdev->config_size),
                 vdev->config_offset);
     if (ret < (int)MIN(pci_config_size(&vdev->pdev), vdev->config_size)) {
@@ -2933,7 +2934,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
             goto error;
         }
         vfio_add_emulated_word(vdev, PCI_VENDOR_ID, vdev->vendor_id, ~0);
-        trace_vfio_pci_emulated_vendor_id(vdev->vbasedev.name, vdev->vendor_id);
+        trace_vfio_pci_emulated_vendor_id(vbasedev->name, vdev->vendor_id);
     } else {
         vdev->vendor_id = pci_get_word(pdev->config + PCI_VENDOR_ID);
     }
@@ -2944,7 +2945,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
             goto error;
         }
         vfio_add_emulated_word(vdev, PCI_DEVICE_ID, vdev->device_id, ~0);
-        trace_vfio_pci_emulated_device_id(vdev->vbasedev.name, vdev->device_id);
+        trace_vfio_pci_emulated_device_id(vbasedev->name, vdev->device_id);
     } else {
         vdev->device_id = pci_get_word(pdev->config + PCI_DEVICE_ID);
     }
@@ -2956,7 +2957,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
         }
         vfio_add_emulated_word(vdev, PCI_SUBSYSTEM_VENDOR_ID,
                                vdev->sub_vendor_id, ~0);
-        trace_vfio_pci_emulated_sub_vendor_id(vdev->vbasedev.name,
+        trace_vfio_pci_emulated_sub_vendor_id(vbasedev->name,
                                               vdev->sub_vendor_id);
     }
 
@@ -2966,7 +2967,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
             goto error;
         }
         vfio_add_emulated_word(vdev, PCI_SUBSYSTEM_ID, vdev->sub_device_id, ~0);
-        trace_vfio_pci_emulated_sub_device_id(vdev->vbasedev.name,
+        trace_vfio_pci_emulated_sub_device_id(vbasedev->name,
                                               vdev->sub_device_id);
     }
 
@@ -3025,7 +3026,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
             goto out_teardown;
         }
 
-        ret = vfio_get_dev_region_info(&vdev->vbasedev,
+        ret = vfio_get_dev_region_info(vbasedev,
                         VFIO_REGION_TYPE_PCI_VENDOR_TYPE | PCI_VENDOR_ID_INTEL,
                         VFIO_REGION_SUBTYPE_INTEL_IGD_OPREGION, &opregion);
         if (ret) {
@@ -3101,9 +3102,9 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
     }
 
     if (!pdev->failover_pair_id) {
-        ret = vfio_migration_probe(&vdev->vbasedev, errp);
+        ret = vfio_migration_probe(vbasedev, errp);
         if (ret) {
-            error_report("%s: Migration disabled", vdev->vbasedev.name);
+            error_report("%s: Migration disabled", vbasedev->name);
         }
     }
 
@@ -3120,7 +3121,7 @@ out_teardown:
     vfio_teardown_msi(vdev);
     vfio_bars_exit(vdev);
 error:
-    error_prepend(errp, VFIO_MSG_PREFIX, vdev->vbasedev.name);
+    error_prepend(errp, VFIO_MSG_PREFIX, vbasedev->name);
 }
 
 static void vfio_instance_finalize(Object *obj)
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [RFC 04/18] vfio/pci: Use vbasedev local variable in vfio_realize()
@ 2022-04-14 10:46   ` Yi Liu
  0 siblings, 0 replies; 125+ messages in thread
From: Yi Liu @ 2022-04-14 10:46 UTC (permalink / raw)
  To: alex.williamson, cohuck, qemu-devel
  Cc: akrowiak, jjherne, thuth, yi.l.liu, kvm, mjrosato, jasowang,
	farman, peterx, pasic, eric.auger, yi.y.sun, chao.p.peng,
	nicolinc, kevin.tian, jgg, eric.auger.pro, david

From: Eric Auger <eric.auger@redhat.com>

Using a VFIODevice handle local variable to improve the code readability.

no functional change intended

Signed-off-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 hw/vfio/pci.c | 49 +++++++++++++++++++++++++------------------------
 1 file changed, 25 insertions(+), 24 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index e26e65bb1f..e707329394 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2803,6 +2803,7 @@ static void vfio_unregister_req_notifier(VFIOPCIDevice *vdev)
 static void vfio_realize(PCIDevice *pdev, Error **errp)
 {
     VFIOPCIDevice *vdev = VFIO_PCI(pdev);
+    VFIODevice *vbasedev = &vdev->vbasedev;
     VFIODevice *vbasedev_iter;
     VFIOGroup *group;
     char *tmp, *subsys, group_path[PATH_MAX], *group_name;
@@ -2813,7 +2814,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
     int i, ret;
     bool is_mdev;
 
-    if (!vdev->vbasedev.sysfsdev) {
+    if (!vbasedev->sysfsdev) {
         if (!(~vdev->host.domain || ~vdev->host.bus ||
               ~vdev->host.slot || ~vdev->host.function)) {
             error_setg(errp, "No provided host device");
@@ -2821,24 +2822,24 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
                               "or -device vfio-pci,sysfsdev=PATH_TO_DEVICE\n");
             return;
         }
-        vdev->vbasedev.sysfsdev =
+        vbasedev->sysfsdev =
             g_strdup_printf("/sys/bus/pci/devices/%04x:%02x:%02x.%01x",
                             vdev->host.domain, vdev->host.bus,
                             vdev->host.slot, vdev->host.function);
     }
 
-    if (stat(vdev->vbasedev.sysfsdev, &st) < 0) {
+    if (stat(vbasedev->sysfsdev, &st) < 0) {
         error_setg_errno(errp, errno, "no such host device");
-        error_prepend(errp, VFIO_MSG_PREFIX, vdev->vbasedev.sysfsdev);
+        error_prepend(errp, VFIO_MSG_PREFIX, vbasedev->sysfsdev);
         return;
     }
 
-    vdev->vbasedev.name = g_path_get_basename(vdev->vbasedev.sysfsdev);
-    vdev->vbasedev.ops = &vfio_pci_ops;
-    vdev->vbasedev.type = VFIO_DEVICE_TYPE_PCI;
-    vdev->vbasedev.dev = DEVICE(vdev);
+    vbasedev->name = g_path_get_basename(vbasedev->sysfsdev);
+    vbasedev->ops = &vfio_pci_ops;
+    vbasedev->type = VFIO_DEVICE_TYPE_PCI;
+    vbasedev->dev = DEVICE(vdev);
 
-    tmp = g_strdup_printf("%s/iommu_group", vdev->vbasedev.sysfsdev);
+    tmp = g_strdup_printf("%s/iommu_group", vbasedev->sysfsdev);
     len = readlink(tmp, group_path, sizeof(group_path));
     g_free(tmp);
 
@@ -2856,7 +2857,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
         goto error;
     }
 
-    trace_vfio_realize(vdev->vbasedev.name, groupid);
+    trace_vfio_realize(vbasedev->name, groupid);
 
     group = vfio_get_group(groupid, pci_device_iommu_address_space(pdev), errp);
     if (!group) {
@@ -2864,7 +2865,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
     }
 
     QLIST_FOREACH(vbasedev_iter, &group->device_list, next) {
-        if (strcmp(vbasedev_iter->name, vdev->vbasedev.name) == 0) {
+        if (strcmp(vbasedev_iter->name, vbasedev->name) == 0) {
             error_setg(errp, "device is already attached");
             vfio_put_group(group);
             goto error;
@@ -2877,22 +2878,22 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
      * stays in sync with the active working set of the guest driver.  Prevent
      * the x-balloon-allowed option unless this is minimally an mdev device.
      */
-    tmp = g_strdup_printf("%s/subsystem", vdev->vbasedev.sysfsdev);
+    tmp = g_strdup_printf("%s/subsystem", vbasedev->sysfsdev);
     subsys = realpath(tmp, NULL);
     g_free(tmp);
     is_mdev = subsys && (strcmp(subsys, "/sys/bus/mdev") == 0);
     free(subsys);
 
-    trace_vfio_mdev(vdev->vbasedev.name, is_mdev);
+    trace_vfio_mdev(vbasedev->name, is_mdev);
 
-    if (vdev->vbasedev.ram_block_discard_allowed && !is_mdev) {
+    if (vbasedev->ram_block_discard_allowed && !is_mdev) {
         error_setg(errp, "x-balloon-allowed only potentially compatible "
                    "with mdev devices");
         vfio_put_group(group);
         goto error;
     }
 
-    ret = vfio_get_device(group, vdev->vbasedev.name, &vdev->vbasedev, errp);
+    ret = vfio_get_device(group, vbasedev->name, vbasedev, errp);
     if (ret) {
         vfio_put_group(group);
         goto error;
@@ -2905,7 +2906,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
     }
 
     /* Get a copy of config space */
-    ret = pread(vdev->vbasedev.fd, vdev->pdev.config,
+    ret = pread(vbasedev->fd, vdev->pdev.config,
                 MIN(pci_config_size(&vdev->pdev), vdev->config_size),
                 vdev->config_offset);
     if (ret < (int)MIN(pci_config_size(&vdev->pdev), vdev->config_size)) {
@@ -2933,7 +2934,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
             goto error;
         }
         vfio_add_emulated_word(vdev, PCI_VENDOR_ID, vdev->vendor_id, ~0);
-        trace_vfio_pci_emulated_vendor_id(vdev->vbasedev.name, vdev->vendor_id);
+        trace_vfio_pci_emulated_vendor_id(vbasedev->name, vdev->vendor_id);
     } else {
         vdev->vendor_id = pci_get_word(pdev->config + PCI_VENDOR_ID);
     }
@@ -2944,7 +2945,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
             goto error;
         }
         vfio_add_emulated_word(vdev, PCI_DEVICE_ID, vdev->device_id, ~0);
-        trace_vfio_pci_emulated_device_id(vdev->vbasedev.name, vdev->device_id);
+        trace_vfio_pci_emulated_device_id(vbasedev->name, vdev->device_id);
     } else {
         vdev->device_id = pci_get_word(pdev->config + PCI_DEVICE_ID);
     }
@@ -2956,7 +2957,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
         }
         vfio_add_emulated_word(vdev, PCI_SUBSYSTEM_VENDOR_ID,
                                vdev->sub_vendor_id, ~0);
-        trace_vfio_pci_emulated_sub_vendor_id(vdev->vbasedev.name,
+        trace_vfio_pci_emulated_sub_vendor_id(vbasedev->name,
                                               vdev->sub_vendor_id);
     }
 
@@ -2966,7 +2967,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
             goto error;
         }
         vfio_add_emulated_word(vdev, PCI_SUBSYSTEM_ID, vdev->sub_device_id, ~0);
-        trace_vfio_pci_emulated_sub_device_id(vdev->vbasedev.name,
+        trace_vfio_pci_emulated_sub_device_id(vbasedev->name,
                                               vdev->sub_device_id);
     }
 
@@ -3025,7 +3026,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
             goto out_teardown;
         }
 
-        ret = vfio_get_dev_region_info(&vdev->vbasedev,
+        ret = vfio_get_dev_region_info(vbasedev,
                         VFIO_REGION_TYPE_PCI_VENDOR_TYPE | PCI_VENDOR_ID_INTEL,
                         VFIO_REGION_SUBTYPE_INTEL_IGD_OPREGION, &opregion);
         if (ret) {
@@ -3101,9 +3102,9 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
     }
 
     if (!pdev->failover_pair_id) {
-        ret = vfio_migration_probe(&vdev->vbasedev, errp);
+        ret = vfio_migration_probe(vbasedev, errp);
         if (ret) {
-            error_report("%s: Migration disabled", vdev->vbasedev.name);
+            error_report("%s: Migration disabled", vbasedev->name);
         }
     }
 
@@ -3120,7 +3121,7 @@ out_teardown:
     vfio_teardown_msi(vdev);
     vfio_bars_exit(vdev);
 error:
-    error_prepend(errp, VFIO_MSG_PREFIX, vdev->vbasedev.name);
+    error_prepend(errp, VFIO_MSG_PREFIX, vbasedev->name);
 }
 
 static void vfio_instance_finalize(Object *obj)
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [RFC 05/18] vfio/common: Rename VFIOGuestIOMMU::iommu into ::iommu_mr
  2022-04-14 10:46 ` Yi Liu
@ 2022-04-14 10:46   ` Yi Liu
  -1 siblings, 0 replies; 125+ messages in thread
From: Yi Liu @ 2022-04-14 10:46 UTC (permalink / raw)
  To: alex.williamson, cohuck, qemu-devel
  Cc: david, thuth, farman, mjrosato, akrowiak, pasic, jjherne,
	jasowang, kvm, jgg, nicolinc, eric.auger, eric.auger.pro,
	kevin.tian, yi.l.liu, chao.p.peng, yi.y.sun, peterx

Rename VFIOGuestIOMMU iommu field into iommu_mr. Then it becomes clearer
it is an IOMMU memory region.

no functional change intended

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 hw/vfio/common.c              | 16 ++++++++--------
 include/hw/vfio/vfio-common.h |  2 +-
 2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 080046e3f5..b05f68b5c7 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -992,7 +992,7 @@ static void vfio_listener_region_add(MemoryListener *listener,
          * device emulation the VFIO iommu handles to use).
          */
         giommu = g_malloc0(sizeof(*giommu));
-        giommu->iommu = iommu_mr;
+        giommu->iommu_mr = iommu_mr;
         giommu->iommu_offset = section->offset_within_address_space -
                                section->offset_within_region;
         giommu->container = container;
@@ -1007,7 +1007,7 @@ static void vfio_listener_region_add(MemoryListener *listener,
                             int128_get64(llend),
                             iommu_idx);
 
-        ret = memory_region_iommu_set_page_size_mask(giommu->iommu,
+        ret = memory_region_iommu_set_page_size_mask(giommu->iommu_mr,
                                                      container->pgsizes,
                                                      &err);
         if (ret) {
@@ -1022,7 +1022,7 @@ static void vfio_listener_region_add(MemoryListener *listener,
             goto fail;
         }
         QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next);
-        memory_region_iommu_replay(giommu->iommu, &giommu->n);
+        memory_region_iommu_replay(giommu->iommu_mr, &giommu->n);
 
         return;
     }
@@ -1128,7 +1128,7 @@ static void vfio_listener_region_del(MemoryListener *listener,
         VFIOGuestIOMMU *giommu;
 
         QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
-            if (MEMORY_REGION(giommu->iommu) == section->mr &&
+            if (MEMORY_REGION(giommu->iommu_mr) == section->mr &&
                 giommu->n.start == section->offset_within_region) {
                 memory_region_unregister_iommu_notifier(section->mr,
                                                         &giommu->n);
@@ -1393,11 +1393,11 @@ static int vfio_sync_dirty_bitmap(VFIOContainer *container,
         VFIOGuestIOMMU *giommu;
 
         QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
-            if (MEMORY_REGION(giommu->iommu) == section->mr &&
+            if (MEMORY_REGION(giommu->iommu_mr) == section->mr &&
                 giommu->n.start == section->offset_within_region) {
                 Int128 llend;
                 vfio_giommu_dirty_notifier gdn = { .giommu = giommu };
-                int idx = memory_region_iommu_attrs_to_index(giommu->iommu,
+                int idx = memory_region_iommu_attrs_to_index(giommu->iommu_mr,
                                                        MEMTXATTRS_UNSPECIFIED);
 
                 llend = int128_add(int128_make64(section->offset_within_region),
@@ -1410,7 +1410,7 @@ static int vfio_sync_dirty_bitmap(VFIOContainer *container,
                                     section->offset_within_region,
                                     int128_get64(llend),
                                     idx);
-                memory_region_iommu_replay(giommu->iommu, &gdn.n);
+                memory_region_iommu_replay(giommu->iommu_mr, &gdn.n);
                 break;
             }
         }
@@ -2246,7 +2246,7 @@ static void vfio_disconnect_container(VFIOGroup *group)
 
         QLIST_FOREACH_SAFE(giommu, &container->giommu_list, giommu_next, tmp) {
             memory_region_unregister_iommu_notifier(
-                    MEMORY_REGION(giommu->iommu), &giommu->n);
+                    MEMORY_REGION(giommu->iommu_mr), &giommu->n);
             QLIST_REMOVE(giommu, giommu_next);
             g_free(giommu);
         }
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 8af11b0a76..e573f5a9f1 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -98,7 +98,7 @@ typedef struct VFIOContainer {
 
 typedef struct VFIOGuestIOMMU {
     VFIOContainer *container;
-    IOMMUMemoryRegion *iommu;
+    IOMMUMemoryRegion *iommu_mr;
     hwaddr iommu_offset;
     IOMMUNotifier n;
     QLIST_ENTRY(VFIOGuestIOMMU) giommu_next;
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [RFC 05/18] vfio/common: Rename VFIOGuestIOMMU::iommu into ::iommu_mr
@ 2022-04-14 10:46   ` Yi Liu
  0 siblings, 0 replies; 125+ messages in thread
From: Yi Liu @ 2022-04-14 10:46 UTC (permalink / raw)
  To: alex.williamson, cohuck, qemu-devel
  Cc: akrowiak, jjherne, thuth, yi.l.liu, kvm, mjrosato, jasowang,
	farman, peterx, pasic, eric.auger, yi.y.sun, chao.p.peng,
	nicolinc, kevin.tian, jgg, eric.auger.pro, david

Rename VFIOGuestIOMMU iommu field into iommu_mr. Then it becomes clearer
it is an IOMMU memory region.

no functional change intended

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 hw/vfio/common.c              | 16 ++++++++--------
 include/hw/vfio/vfio-common.h |  2 +-
 2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 080046e3f5..b05f68b5c7 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -992,7 +992,7 @@ static void vfio_listener_region_add(MemoryListener *listener,
          * device emulation the VFIO iommu handles to use).
          */
         giommu = g_malloc0(sizeof(*giommu));
-        giommu->iommu = iommu_mr;
+        giommu->iommu_mr = iommu_mr;
         giommu->iommu_offset = section->offset_within_address_space -
                                section->offset_within_region;
         giommu->container = container;
@@ -1007,7 +1007,7 @@ static void vfio_listener_region_add(MemoryListener *listener,
                             int128_get64(llend),
                             iommu_idx);
 
-        ret = memory_region_iommu_set_page_size_mask(giommu->iommu,
+        ret = memory_region_iommu_set_page_size_mask(giommu->iommu_mr,
                                                      container->pgsizes,
                                                      &err);
         if (ret) {
@@ -1022,7 +1022,7 @@ static void vfio_listener_region_add(MemoryListener *listener,
             goto fail;
         }
         QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next);
-        memory_region_iommu_replay(giommu->iommu, &giommu->n);
+        memory_region_iommu_replay(giommu->iommu_mr, &giommu->n);
 
         return;
     }
@@ -1128,7 +1128,7 @@ static void vfio_listener_region_del(MemoryListener *listener,
         VFIOGuestIOMMU *giommu;
 
         QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
-            if (MEMORY_REGION(giommu->iommu) == section->mr &&
+            if (MEMORY_REGION(giommu->iommu_mr) == section->mr &&
                 giommu->n.start == section->offset_within_region) {
                 memory_region_unregister_iommu_notifier(section->mr,
                                                         &giommu->n);
@@ -1393,11 +1393,11 @@ static int vfio_sync_dirty_bitmap(VFIOContainer *container,
         VFIOGuestIOMMU *giommu;
 
         QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
-            if (MEMORY_REGION(giommu->iommu) == section->mr &&
+            if (MEMORY_REGION(giommu->iommu_mr) == section->mr &&
                 giommu->n.start == section->offset_within_region) {
                 Int128 llend;
                 vfio_giommu_dirty_notifier gdn = { .giommu = giommu };
-                int idx = memory_region_iommu_attrs_to_index(giommu->iommu,
+                int idx = memory_region_iommu_attrs_to_index(giommu->iommu_mr,
                                                        MEMTXATTRS_UNSPECIFIED);
 
                 llend = int128_add(int128_make64(section->offset_within_region),
@@ -1410,7 +1410,7 @@ static int vfio_sync_dirty_bitmap(VFIOContainer *container,
                                     section->offset_within_region,
                                     int128_get64(llend),
                                     idx);
-                memory_region_iommu_replay(giommu->iommu, &gdn.n);
+                memory_region_iommu_replay(giommu->iommu_mr, &gdn.n);
                 break;
             }
         }
@@ -2246,7 +2246,7 @@ static void vfio_disconnect_container(VFIOGroup *group)
 
         QLIST_FOREACH_SAFE(giommu, &container->giommu_list, giommu_next, tmp) {
             memory_region_unregister_iommu_notifier(
-                    MEMORY_REGION(giommu->iommu), &giommu->n);
+                    MEMORY_REGION(giommu->iommu_mr), &giommu->n);
             QLIST_REMOVE(giommu, giommu_next);
             g_free(giommu);
         }
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 8af11b0a76..e573f5a9f1 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -98,7 +98,7 @@ typedef struct VFIOContainer {
 
 typedef struct VFIOGuestIOMMU {
     VFIOContainer *container;
-    IOMMUMemoryRegion *iommu;
+    IOMMUMemoryRegion *iommu_mr;
     hwaddr iommu_offset;
     IOMMUNotifier n;
     QLIST_ENTRY(VFIOGuestIOMMU) giommu_next;
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [RFC 06/18] vfio/common: Split common.c into common.c, container.c and as.c
  2022-04-14 10:46 ` Yi Liu
                   ` (5 preceding siblings ...)
  (?)
@ 2022-04-14 10:46 ` Yi Liu
  -1 siblings, 0 replies; 125+ messages in thread
From: Yi Liu @ 2022-04-14 10:46 UTC (permalink / raw)
  To: alex.williamson, cohuck, qemu-devel
  Cc: akrowiak, jjherne, thuth, yi.l.liu, kvm, mjrosato, jasowang,
	farman, peterx, pasic, eric.auger, yi.y.sun, chao.p.peng,
	nicolinc, kevin.tian, jgg, eric.auger.pro, david

Before introducing the support for the new /dev/iommu backend
in VFIO let's try to split common.c file into 3 parts:

- in common.c we keep backend agnostic code unrelated to dma mapping
- as.c is created and contains code related to VFIOAddressSpace and
  MemoryListeners. This code will be backend agnostic.
- container.c is created and will contain code related to the legacy
  VFIO backend (containers, groups, ...).

No functional change intended

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 hw/vfio/as.c                  |  868 ++++++++++++
 hw/vfio/common.c              | 2340 +++------------------------------
 hw/vfio/container.c           | 1193 +++++++++++++++++
 hw/vfio/meson.build           |    2 +
 include/hw/vfio/vfio-common.h |   28 +
 5 files changed, 2278 insertions(+), 2153 deletions(-)
 create mode 100644 hw/vfio/as.c
 create mode 100644 hw/vfio/container.c

diff --git a/hw/vfio/as.c b/hw/vfio/as.c
new file mode 100644
index 0000000000..4181182808
--- /dev/null
+++ b/hw/vfio/as.c
@@ -0,0 +1,868 @@
+/*
+ * generic functions used by VFIO devices
+ *
+ * Copyright Red Hat, Inc. 2012
+ *
+ * Authors:
+ *  Alex Williamson <alex.williamson@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ * Based on qemu-kvm device-assignment:
+ *  Adapted for KVM by Qumranet.
+ *  Copyright (c) 2007, Neocleus, Alex Novik (alex@neocleus.com)
+ *  Copyright (c) 2007, Neocleus, Guy Zana (guy@neocleus.com)
+ *  Copyright (C) 2008, Qumranet, Amit Shah (amit.shah@qumranet.com)
+ *  Copyright (C) 2008, Red Hat, Amit Shah (amit.shah@redhat.com)
+ *  Copyright (C) 2008, IBM, Muli Ben-Yehuda (muli@il.ibm.com)
+ */
+
+#include "qemu/osdep.h"
+#include <sys/ioctl.h>
+#ifdef CONFIG_KVM
+#include <linux/kvm.h>
+#endif
+#include <linux/vfio.h>
+
+#include "hw/vfio/vfio-common.h"
+#include "hw/vfio/vfio.h"
+#include "exec/address-spaces.h"
+#include "exec/memory.h"
+#include "exec/ram_addr.h"
+#include "hw/hw.h"
+#include "qemu/error-report.h"
+#include "qemu/main-loop.h"
+#include "qemu/range.h"
+#include "sysemu/kvm.h"
+#include "sysemu/reset.h"
+#include "sysemu/runstate.h"
+#include "trace.h"
+#include "qapi/error.h"
+#include "migration/migration.h"
+
+static QLIST_HEAD(, VFIOAddressSpace) vfio_address_spaces =
+    QLIST_HEAD_INITIALIZER(vfio_address_spaces);
+
+void vfio_host_win_add(VFIOContainer *container,
+                       hwaddr min_iova, hwaddr max_iova,
+                       uint64_t iova_pgsizes)
+{
+    VFIOHostDMAWindow *hostwin;
+
+    QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) {
+        if (ranges_overlap(hostwin->min_iova,
+                           hostwin->max_iova - hostwin->min_iova + 1,
+                           min_iova,
+                           max_iova - min_iova + 1)) {
+            hw_error("%s: Overlapped IOMMU are not enabled", __func__);
+        }
+    }
+
+    hostwin = g_malloc0(sizeof(*hostwin));
+
+    hostwin->min_iova = min_iova;
+    hostwin->max_iova = max_iova;
+    hostwin->iova_pgsizes = iova_pgsizes;
+    QLIST_INSERT_HEAD(&container->hostwin_list, hostwin, hostwin_next);
+}
+
+int vfio_host_win_del(VFIOContainer *container, hwaddr min_iova,
+                      hwaddr max_iova)
+{
+    VFIOHostDMAWindow *hostwin;
+
+    QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) {
+        if (hostwin->min_iova == min_iova && hostwin->max_iova == max_iova) {
+            QLIST_REMOVE(hostwin, hostwin_next);
+            g_free(hostwin);
+            return 0;
+        }
+    }
+
+    return -1;
+}
+
+static bool vfio_listener_skipped_section(MemoryRegionSection *section)
+{
+    return (!memory_region_is_ram(section->mr) &&
+            !memory_region_is_iommu(section->mr)) ||
+           memory_region_is_protected(section->mr) ||
+           /*
+            * Sizing an enabled 64-bit BAR can cause spurious mappings to
+            * addresses in the upper part of the 64-bit address space.  These
+            * are never accessed by the CPU and beyond the address width of
+            * some IOMMU hardware.  TODO: VFIO should tell us the IOMMU width.
+            */
+           section->offset_within_address_space & (1ULL << 63);
+}
+
+/* Called with rcu_read_lock held.  */
+static bool vfio_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
+                               ram_addr_t *ram_addr, bool *read_only)
+{
+    MemoryRegion *mr;
+    hwaddr xlat;
+    hwaddr len = iotlb->addr_mask + 1;
+    bool writable = iotlb->perm & IOMMU_WO;
+
+    /*
+     * The IOMMU TLB entry we have just covers translation through
+     * this IOMMU to its immediate target.  We need to translate
+     * it the rest of the way through to memory.
+     */
+    mr = address_space_translate(&address_space_memory,
+                                 iotlb->translated_addr,
+                                 &xlat, &len, writable,
+                                 MEMTXATTRS_UNSPECIFIED);
+    if (!memory_region_is_ram(mr)) {
+        error_report("iommu map to non memory area %"HWADDR_PRIx"",
+                     xlat);
+        return false;
+    } else if (memory_region_has_ram_discard_manager(mr)) {
+        RamDiscardManager *rdm = memory_region_get_ram_discard_manager(mr);
+        MemoryRegionSection tmp = {
+            .mr = mr,
+            .offset_within_region = xlat,
+            .size = int128_make64(len),
+        };
+
+        /*
+         * Malicious VMs can map memory into the IOMMU, which is expected
+         * to remain discarded. vfio will pin all pages, populating memory.
+         * Disallow that. vmstate priorities make sure any RamDiscardManager
+         * were already restored before IOMMUs are restored.
+         */
+        if (!ram_discard_manager_is_populated(rdm, &tmp)) {
+            error_report("iommu map to discarded memory (e.g., unplugged via"
+                         " virtio-mem): %"HWADDR_PRIx"",
+                         iotlb->translated_addr);
+            return false;
+        }
+
+        /*
+         * Malicious VMs might trigger discarding of IOMMU-mapped memory. The
+         * pages will remain pinned inside vfio until unmapped, resulting in a
+         * higher memory consumption than expected. If memory would get
+         * populated again later, there would be an inconsistency between pages
+         * pinned by vfio and pages seen by QEMU. This is the case until
+         * unmapped from the IOMMU (e.g., during device reset).
+         *
+         * With malicious guests, we really only care about pinning more memory
+         * than expected. RLIMIT_MEMLOCK set for the user/process can never be
+         * exceeded and can be used to mitigate this problem.
+         */
+        warn_report_once("Using vfio with vIOMMUs and coordinated discarding of"
+                         " RAM (e.g., virtio-mem) works, however, malicious"
+                         " guests can trigger pinning of more memory than"
+                         " intended via an IOMMU. It's possible to mitigate "
+                         " by setting/adjusting RLIMIT_MEMLOCK.");
+    }
+
+    /*
+     * Translation truncates length to the IOMMU page size,
+     * check that it did not truncate too much.
+     */
+    if (len & iotlb->addr_mask) {
+        error_report("iommu has granularity incompatible with target AS");
+        return false;
+    }
+
+    if (vaddr) {
+        *vaddr = memory_region_get_ram_ptr(mr) + xlat;
+    }
+
+    if (ram_addr) {
+        *ram_addr = memory_region_get_ram_addr(mr) + xlat;
+    }
+
+    if (read_only) {
+        *read_only = !writable || mr->readonly;
+    }
+
+    return true;
+}
+
+static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
+{
+    VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
+    VFIOContainer *container = giommu->container;
+    hwaddr iova = iotlb->iova + giommu->iommu_offset;
+    void *vaddr;
+    int ret;
+
+    trace_vfio_iommu_map_notify(iotlb->perm == IOMMU_NONE ? "UNMAP" : "MAP",
+                                iova, iova + iotlb->addr_mask);
+
+    if (iotlb->target_as != &address_space_memory) {
+        error_report("Wrong target AS \"%s\", only system memory is allowed",
+                     iotlb->target_as->name ? iotlb->target_as->name : "none");
+        return;
+    }
+
+    rcu_read_lock();
+
+    if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) {
+        bool read_only;
+
+        if (!vfio_get_xlat_addr(iotlb, &vaddr, NULL, &read_only)) {
+            goto out;
+        }
+        /*
+         * vaddr is only valid until rcu_read_unlock(). But after
+         * vfio_dma_map has set up the mapping the pages will be
+         * pinned by the kernel. This makes sure that the RAM backend
+         * of vaddr will always be there, even if the memory object is
+         * destroyed and its backing memory munmap-ed.
+         */
+        ret = vfio_dma_map(container, iova,
+                           iotlb->addr_mask + 1, vaddr,
+                           read_only);
+        if (ret) {
+            error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
+                         "0x%"HWADDR_PRIx", %p) = %d (%m)",
+                         container, iova,
+                         iotlb->addr_mask + 1, vaddr, ret);
+        }
+    } else {
+        ret = vfio_dma_unmap(container, iova, iotlb->addr_mask + 1, iotlb);
+        if (ret) {
+            error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
+                         "0x%"HWADDR_PRIx") = %d (%m)",
+                         container, iova,
+                         iotlb->addr_mask + 1, ret);
+        }
+    }
+out:
+    rcu_read_unlock();
+}
+
+static void vfio_ram_discard_notify_discard(RamDiscardListener *rdl,
+                                            MemoryRegionSection *section)
+{
+    VFIORamDiscardListener *vrdl = container_of(rdl, VFIORamDiscardListener,
+                                                listener);
+    const hwaddr size = int128_get64(section->size);
+    const hwaddr iova = section->offset_within_address_space;
+    int ret;
+
+    /* Unmap with a single call. */
+    ret = vfio_dma_unmap(vrdl->container, iova, size , NULL);
+    if (ret) {
+        error_report("%s: vfio_dma_unmap() failed: %s", __func__,
+                     strerror(-ret));
+    }
+}
+
+static int vfio_ram_discard_notify_populate(RamDiscardListener *rdl,
+                                            MemoryRegionSection *section)
+{
+    VFIORamDiscardListener *vrdl = container_of(rdl, VFIORamDiscardListener,
+                                                listener);
+    const hwaddr end = section->offset_within_region +
+                       int128_get64(section->size);
+    hwaddr start, next, iova;
+    void *vaddr;
+    int ret;
+
+    /*
+     * Map in (aligned within memory region) minimum granularity, so we can
+     * unmap in minimum granularity later.
+     */
+    for (start = section->offset_within_region; start < end; start = next) {
+        next = ROUND_UP(start + 1, vrdl->granularity);
+        next = MIN(next, end);
+
+        iova = start - section->offset_within_region +
+               section->offset_within_address_space;
+        vaddr = memory_region_get_ram_ptr(section->mr) + start;
+
+        ret = vfio_dma_map(vrdl->container, iova, next - start,
+                           vaddr, section->readonly);
+        if (ret) {
+            /* Rollback */
+            vfio_ram_discard_notify_discard(rdl, section);
+            return ret;
+        }
+    }
+    return 0;
+}
+
+static void vfio_register_ram_discard_listener(VFIOContainer *container,
+                                               MemoryRegionSection *section)
+{
+    RamDiscardManager *rdm = memory_region_get_ram_discard_manager(section->mr);
+    VFIORamDiscardListener *vrdl;
+
+    /* Ignore some corner cases not relevant in practice. */
+    g_assert(QEMU_IS_ALIGNED(section->offset_within_region, TARGET_PAGE_SIZE));
+    g_assert(QEMU_IS_ALIGNED(section->offset_within_address_space,
+                             TARGET_PAGE_SIZE));
+    g_assert(QEMU_IS_ALIGNED(int128_get64(section->size), TARGET_PAGE_SIZE));
+
+    vrdl = g_new0(VFIORamDiscardListener, 1);
+    vrdl->container = container;
+    vrdl->mr = section->mr;
+    vrdl->offset_within_address_space = section->offset_within_address_space;
+    vrdl->size = int128_get64(section->size);
+    vrdl->granularity = ram_discard_manager_get_min_granularity(rdm,
+                                                                section->mr);
+
+    g_assert(vrdl->granularity && is_power_of_2(vrdl->granularity));
+    g_assert(container->pgsizes &&
+             vrdl->granularity >= 1ULL << ctz64(container->pgsizes));
+
+    ram_discard_listener_init(&vrdl->listener,
+                              vfio_ram_discard_notify_populate,
+                              vfio_ram_discard_notify_discard, true);
+    ram_discard_manager_register_listener(rdm, &vrdl->listener, section);
+    QLIST_INSERT_HEAD(&container->vrdl_list, vrdl, next);
+
+    /*
+     * Sanity-check if we have a theoretically problematic setup where we could
+     * exceed the maximum number of possible DMA mappings over time. We assume
+     * that each mapped section in the same address space as a RamDiscardManager
+     * section consumes exactly one DMA mapping, with the exception of
+     * RamDiscardManager sections; i.e., we don't expect to have gIOMMU sections
+     * in the same address space as RamDiscardManager sections.
+     *
+     * We assume that each section in the address space consumes one memslot.
+     * We take the number of KVM memory slots as a best guess for the maximum
+     * number of sections in the address space we could have over time,
+     * also consuming DMA mappings.
+     */
+    if (container->dma_max_mappings) {
+        unsigned int vrdl_count = 0, vrdl_mappings = 0, max_memslots = 512;
+
+#ifdef CONFIG_KVM
+        if (kvm_enabled()) {
+            max_memslots = kvm_get_max_memslots();
+        }
+#endif
+
+        QLIST_FOREACH(vrdl, &container->vrdl_list, next) {
+            hwaddr start, end;
+
+            start = QEMU_ALIGN_DOWN(vrdl->offset_within_address_space,
+                                    vrdl->granularity);
+            end = ROUND_UP(vrdl->offset_within_address_space + vrdl->size,
+                           vrdl->granularity);
+            vrdl_mappings += (end - start) / vrdl->granularity;
+            vrdl_count++;
+        }
+
+        if (vrdl_mappings + max_memslots - vrdl_count >
+            container->dma_max_mappings) {
+            warn_report("%s: possibly running out of DMA mappings. E.g., try"
+                        " increasing the 'block-size' of virtio-mem devies."
+                        " Maximum possible DMA mappings: %d, Maximum possible"
+                        " memslots: %d", __func__, container->dma_max_mappings,
+                        max_memslots);
+        }
+    }
+}
+
+static void vfio_unregister_ram_discard_listener(VFIOContainer *container,
+                                                 MemoryRegionSection *section)
+{
+    RamDiscardManager *rdm = memory_region_get_ram_discard_manager(section->mr);
+    VFIORamDiscardListener *vrdl = NULL;
+
+    QLIST_FOREACH(vrdl, &container->vrdl_list, next) {
+        if (vrdl->mr == section->mr &&
+            vrdl->offset_within_address_space ==
+            section->offset_within_address_space) {
+            break;
+        }
+    }
+
+    if (!vrdl) {
+        hw_error("vfio: Trying to unregister missing RAM discard listener");
+    }
+
+    ram_discard_manager_unregister_listener(rdm, &vrdl->listener);
+    QLIST_REMOVE(vrdl, next);
+    g_free(vrdl);
+}
+
+static void vfio_listener_region_add(MemoryListener *listener,
+                                     MemoryRegionSection *section)
+{
+    VFIOContainer *container = container_of(listener, VFIOContainer, listener);
+    hwaddr iova, end;
+    Int128 llend, llsize;
+    void *vaddr;
+    int ret;
+    VFIOHostDMAWindow *hostwin;
+    bool hostwin_found;
+    Error *err = NULL;
+
+    if (vfio_listener_skipped_section(section)) {
+        trace_vfio_listener_region_add_skip(
+                section->offset_within_address_space,
+                section->offset_within_address_space +
+                int128_get64(int128_sub(section->size, int128_one())));
+        return;
+    }
+
+    if (unlikely((section->offset_within_address_space &
+                  ~qemu_real_host_page_mask) !=
+                 (section->offset_within_region & ~qemu_real_host_page_mask))) {
+        error_report("%s received unaligned region", __func__);
+        return;
+    }
+
+    iova = REAL_HOST_PAGE_ALIGN(section->offset_within_address_space);
+    llend = int128_make64(section->offset_within_address_space);
+    llend = int128_add(llend, section->size);
+    llend = int128_and(llend, int128_exts64(qemu_real_host_page_mask));
+
+    if (int128_ge(int128_make64(iova), llend)) {
+        if (memory_region_is_ram_device(section->mr)) {
+            trace_vfio_listener_region_add_no_dma_map(
+                memory_region_name(section->mr),
+                section->offset_within_address_space,
+                int128_getlo(section->size),
+                qemu_real_host_page_size);
+        }
+        return;
+    }
+    end = int128_get64(int128_sub(llend, int128_one()));
+
+    if (vfio_container_add_section_window(container, section, &err)) {
+        goto fail;
+    }
+
+    hostwin_found = false;
+    QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) {
+        if (hostwin->min_iova <= iova && end <= hostwin->max_iova) {
+            hostwin_found = true;
+            break;
+        }
+    }
+
+    if (!hostwin_found) {
+        error_setg(&err, "Container %p can't map guest IOVA region"
+                   " 0x%"HWADDR_PRIx"..0x%"HWADDR_PRIx, container, iova, end);
+        goto fail;
+    }
+
+    memory_region_ref(section->mr);
+
+    if (memory_region_is_iommu(section->mr)) {
+        VFIOGuestIOMMU *giommu;
+        IOMMUMemoryRegion *iommu_mr = IOMMU_MEMORY_REGION(section->mr);
+        int iommu_idx;
+
+        trace_vfio_listener_region_add_iommu(iova, end);
+        /*
+         * FIXME: For VFIO iommu types which have KVM acceleration to
+         * avoid bouncing all map/unmaps through qemu this way, this
+         * would be the right place to wire that up (tell the KVM
+         * device emulation the VFIO iommu handles to use).
+         */
+        giommu = g_malloc0(sizeof(*giommu));
+        giommu->iommu_mr = iommu_mr;
+        giommu->iommu_offset = section->offset_within_address_space -
+                               section->offset_within_region;
+        giommu->container = container;
+        llend = int128_add(int128_make64(section->offset_within_region),
+                           section->size);
+        llend = int128_sub(llend, int128_one());
+        iommu_idx = memory_region_iommu_attrs_to_index(iommu_mr,
+                                                       MEMTXATTRS_UNSPECIFIED);
+        iommu_notifier_init(&giommu->n, vfio_iommu_map_notify,
+                            IOMMU_NOTIFIER_IOTLB_EVENTS,
+                            section->offset_within_region,
+                            int128_get64(llend),
+                            iommu_idx);
+
+        ret = memory_region_iommu_set_page_size_mask(giommu->iommu_mr,
+                                                     container->pgsizes,
+                                                     &err);
+        if (ret) {
+            g_free(giommu);
+            goto fail;
+        }
+
+        ret = memory_region_register_iommu_notifier(section->mr, &giommu->n,
+                                                    &err);
+        if (ret) {
+            g_free(giommu);
+            goto fail;
+        }
+        QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next);
+        memory_region_iommu_replay(giommu->iommu_mr, &giommu->n);
+
+        return;
+    }
+
+    /* Here we assume that memory_region_is_ram(section->mr)==true */
+
+    /*
+     * For RAM memory regions with a RamDiscardManager, we only want to map the
+     * actually populated parts - and update the mapping whenever we're notified
+     * about changes.
+     */
+    if (memory_region_has_ram_discard_manager(section->mr)) {
+        vfio_register_ram_discard_listener(container, section);
+        return;
+    }
+
+    vaddr = memory_region_get_ram_ptr(section->mr) +
+            section->offset_within_region +
+            (iova - section->offset_within_address_space);
+
+    trace_vfio_listener_region_add_ram(iova, end, vaddr);
+
+    llsize = int128_sub(llend, int128_make64(iova));
+
+    if (memory_region_is_ram_device(section->mr)) {
+        hwaddr pgmask = (1ULL << ctz64(hostwin->iova_pgsizes)) - 1;
+
+        if ((iova & pgmask) || (int128_get64(llsize) & pgmask)) {
+            trace_vfio_listener_region_add_no_dma_map(
+                memory_region_name(section->mr),
+                section->offset_within_address_space,
+                int128_getlo(section->size),
+                pgmask + 1);
+            return;
+        }
+    }
+
+    ret = vfio_dma_map(container, iova, int128_get64(llsize),
+                       vaddr, section->readonly);
+    if (ret) {
+        error_setg(&err, "vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
+                   "0x%"HWADDR_PRIx", %p) = %d (%m)",
+                   container, iova, int128_get64(llsize), vaddr, ret);
+        if (memory_region_is_ram_device(section->mr)) {
+            /* Allow unexpected mappings not to be fatal for RAM devices */
+            error_report_err(err);
+            return;
+        }
+        goto fail;
+    }
+
+    return;
+
+fail:
+    if (memory_region_is_ram_device(section->mr)) {
+        error_report("failed to vfio_dma_map. pci p2p may not work");
+        return;
+    }
+    /*
+     * On the initfn path, store the first error in the container so we
+     * can gracefully fail.  Runtime, there's not much we can do other
+     * than throw a hardware error.
+     */
+    if (!container->initialized) {
+        if (!container->error) {
+            error_propagate_prepend(&container->error, err,
+                                    "Region %s: ",
+                                    memory_region_name(section->mr));
+        } else {
+            error_free(err);
+        }
+    } else {
+        error_report_err(err);
+        hw_error("vfio: DMA mapping failed, unable to continue");
+    }
+}
+
+static void vfio_listener_region_del(MemoryListener *listener,
+                                     MemoryRegionSection *section)
+{
+    VFIOContainer *container = container_of(listener, VFIOContainer, listener);
+    hwaddr iova, end;
+    Int128 llend, llsize;
+    int ret;
+    bool try_unmap = true;
+
+    if (vfio_listener_skipped_section(section)) {
+        trace_vfio_listener_region_del_skip(
+                section->offset_within_address_space,
+                section->offset_within_address_space +
+                int128_get64(int128_sub(section->size, int128_one())));
+        return;
+    }
+
+    if (unlikely((section->offset_within_address_space &
+                  ~qemu_real_host_page_mask) !=
+                 (section->offset_within_region & ~qemu_real_host_page_mask))) {
+        error_report("%s received unaligned region", __func__);
+        return;
+    }
+
+    if (memory_region_is_iommu(section->mr)) {
+        VFIOGuestIOMMU *giommu;
+
+        QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
+            if (MEMORY_REGION(giommu->iommu_mr) == section->mr &&
+                giommu->n.start == section->offset_within_region) {
+                memory_region_unregister_iommu_notifier(section->mr,
+                                                        &giommu->n);
+                QLIST_REMOVE(giommu, giommu_next);
+                g_free(giommu);
+                break;
+            }
+        }
+
+        /*
+         * FIXME: We assume the one big unmap below is adequate to
+         * remove any individual page mappings in the IOMMU which
+         * might have been copied into VFIO. This works for a page table
+         * based IOMMU where a big unmap flattens a large range of IO-PTEs.
+         * That may not be true for all IOMMU types.
+         */
+    }
+
+    iova = REAL_HOST_PAGE_ALIGN(section->offset_within_address_space);
+    llend = int128_make64(section->offset_within_address_space);
+    llend = int128_add(llend, section->size);
+    llend = int128_and(llend, int128_exts64(qemu_real_host_page_mask));
+
+    if (int128_ge(int128_make64(iova), llend)) {
+        return;
+    }
+    end = int128_get64(int128_sub(llend, int128_one()));
+
+    llsize = int128_sub(llend, int128_make64(iova));
+
+    trace_vfio_listener_region_del(iova, end);
+
+    if (memory_region_is_ram_device(section->mr)) {
+        hwaddr pgmask;
+        VFIOHostDMAWindow *hostwin;
+        bool hostwin_found = false;
+
+        QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) {
+            if (hostwin->min_iova <= iova && end <= hostwin->max_iova) {
+                hostwin_found = true;
+                break;
+            }
+        }
+        assert(hostwin_found); /* or region_add() would have failed */
+
+        pgmask = (1ULL << ctz64(hostwin->iova_pgsizes)) - 1;
+        try_unmap = !((iova & pgmask) || (int128_get64(llsize) & pgmask));
+    } else if (memory_region_has_ram_discard_manager(section->mr)) {
+        vfio_unregister_ram_discard_listener(container, section);
+        /* Unregistering will trigger an unmap. */
+        try_unmap = false;
+    }
+
+    if (try_unmap) {
+        if (int128_eq(llsize, int128_2_64())) {
+            /* The unmap ioctl doesn't accept a full 64-bit span. */
+            llsize = int128_rshift(llsize, 1);
+            ret = vfio_dma_unmap(container, iova, int128_get64(llsize), NULL);
+            if (ret) {
+                error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
+                             "0x%"HWADDR_PRIx") = %d (%m)",
+                             container, iova, int128_get64(llsize), ret);
+            }
+            iova += int128_get64(llsize);
+        }
+        ret = vfio_dma_unmap(container, iova, int128_get64(llsize), NULL);
+        if (ret) {
+            error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
+                         "0x%"HWADDR_PRIx") = %d (%m)",
+                         container, iova, int128_get64(llsize), ret);
+        }
+    }
+
+    memory_region_unref(section->mr);
+
+    vfio_container_del_section_window(container, section);
+}
+
+static void vfio_listener_log_global_start(MemoryListener *listener)
+{
+    VFIOContainer *container = container_of(listener, VFIOContainer, listener);
+
+    vfio_set_dirty_page_tracking(container, true);
+}
+
+static void vfio_listener_log_global_stop(MemoryListener *listener)
+{
+    VFIOContainer *container = container_of(listener, VFIOContainer, listener);
+
+    vfio_set_dirty_page_tracking(container, false);
+}
+
+typedef struct {
+    IOMMUNotifier n;
+    VFIOGuestIOMMU *giommu;
+} vfio_giommu_dirty_notifier;
+
+static void vfio_iommu_map_dirty_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
+{
+    vfio_giommu_dirty_notifier *gdn = container_of(n,
+                                                vfio_giommu_dirty_notifier, n);
+    VFIOGuestIOMMU *giommu = gdn->giommu;
+    VFIOContainer *container = giommu->container;
+    hwaddr iova = iotlb->iova + giommu->iommu_offset;
+    ram_addr_t translated_addr;
+
+    trace_vfio_iommu_map_dirty_notify(iova, iova + iotlb->addr_mask);
+
+    if (iotlb->target_as != &address_space_memory) {
+        error_report("Wrong target AS \"%s\", only system memory is allowed",
+                     iotlb->target_as->name ? iotlb->target_as->name : "none");
+        return;
+    }
+
+    rcu_read_lock();
+    if (vfio_get_xlat_addr(iotlb, NULL, &translated_addr, NULL)) {
+        int ret;
+
+        ret = vfio_get_dirty_bitmap(container, iova, iotlb->addr_mask + 1,
+                                    translated_addr);
+        if (ret) {
+            error_report("vfio_iommu_map_dirty_notify(%p, 0x%"HWADDR_PRIx", "
+                         "0x%"HWADDR_PRIx") = %d (%m)",
+                         container, iova,
+                         iotlb->addr_mask + 1, ret);
+        }
+    }
+    rcu_read_unlock();
+}
+
+static int vfio_ram_discard_get_dirty_bitmap(MemoryRegionSection *section,
+                                             void *opaque)
+{
+    const hwaddr size = int128_get64(section->size);
+    const hwaddr iova = section->offset_within_address_space;
+    const ram_addr_t ram_addr = memory_region_get_ram_addr(section->mr) +
+                                section->offset_within_region;
+    VFIORamDiscardListener *vrdl = opaque;
+
+    /*
+     * Sync the whole mapped region (spanning multiple individual mappings)
+     * in one go.
+     */
+    return vfio_get_dirty_bitmap(vrdl->container, iova, size, ram_addr);
+}
+
+static int vfio_sync_ram_discard_listener_dirty_bitmap(VFIOContainer *container,
+                                                   MemoryRegionSection *section)
+{
+    RamDiscardManager *rdm = memory_region_get_ram_discard_manager(section->mr);
+    VFIORamDiscardListener *vrdl = NULL;
+
+    QLIST_FOREACH(vrdl, &container->vrdl_list, next) {
+        if (vrdl->mr == section->mr &&
+            vrdl->offset_within_address_space ==
+            section->offset_within_address_space) {
+            break;
+        }
+    }
+
+    if (!vrdl) {
+        hw_error("vfio: Trying to sync missing RAM discard listener");
+    }
+
+    /*
+     * We only want/can synchronize the bitmap for actually mapped parts -
+     * which correspond to populated parts. Replay all populated parts.
+     */
+    return ram_discard_manager_replay_populated(rdm, section,
+                                              vfio_ram_discard_get_dirty_bitmap,
+                                                &vrdl);
+}
+
+static int vfio_sync_dirty_bitmap(VFIOContainer *container,
+                                  MemoryRegionSection *section)
+{
+    ram_addr_t ram_addr;
+
+    if (memory_region_is_iommu(section->mr)) {
+        VFIOGuestIOMMU *giommu;
+
+        QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
+            if (MEMORY_REGION(giommu->iommu_mr) == section->mr &&
+                giommu->n.start == section->offset_within_region) {
+                Int128 llend;
+                vfio_giommu_dirty_notifier gdn = { .giommu = giommu };
+                int idx = memory_region_iommu_attrs_to_index(giommu->iommu_mr,
+                                                       MEMTXATTRS_UNSPECIFIED);
+
+                llend = int128_add(int128_make64(section->offset_within_region),
+                                   section->size);
+                llend = int128_sub(llend, int128_one());
+
+                iommu_notifier_init(&gdn.n,
+                                    vfio_iommu_map_dirty_notify,
+                                    IOMMU_NOTIFIER_MAP,
+                                    section->offset_within_region,
+                                    int128_get64(llend),
+                                    idx);
+                memory_region_iommu_replay(giommu->iommu_mr, &gdn.n);
+                break;
+            }
+        }
+        return 0;
+    } else if (memory_region_has_ram_discard_manager(section->mr)) {
+        return vfio_sync_ram_discard_listener_dirty_bitmap(container, section);
+    }
+
+    ram_addr = memory_region_get_ram_addr(section->mr) +
+               section->offset_within_region;
+
+    return vfio_get_dirty_bitmap(container,
+                   REAL_HOST_PAGE_ALIGN(section->offset_within_address_space),
+                   int128_get64(section->size), ram_addr);
+}
+
+static void vfio_listener_log_sync(MemoryListener *listener,
+        MemoryRegionSection *section)
+{
+    VFIOContainer *container = container_of(listener, VFIOContainer, listener);
+
+    if (vfio_listener_skipped_section(section) ||
+        !container->dirty_pages_supported) {
+        return;
+    }
+
+    if (vfio_devices_all_dirty_tracking(container)) {
+        vfio_sync_dirty_bitmap(container, section);
+    }
+}
+
+const MemoryListener vfio_memory_listener = {
+    .name = "vfio",
+    .region_add = vfio_listener_region_add,
+    .region_del = vfio_listener_region_del,
+    .log_global_start = vfio_listener_log_global_start,
+    .log_global_stop = vfio_listener_log_global_stop,
+    .log_sync = vfio_listener_log_sync,
+};
+
+VFIOAddressSpace *vfio_get_address_space(AddressSpace *as)
+{
+    VFIOAddressSpace *space;
+
+    QLIST_FOREACH(space, &vfio_address_spaces, list) {
+        if (space->as == as) {
+            return space;
+        }
+    }
+
+    /* No suitable VFIOAddressSpace, create a new one */
+    space = g_malloc0(sizeof(*space));
+    space->as = as;
+    QLIST_INIT(&space->containers);
+
+    QLIST_INSERT_HEAD(&vfio_address_spaces, space, list);
+
+    return space;
+}
+
+void vfio_put_address_space(VFIOAddressSpace *space)
+{
+    if (QLIST_EMPTY(&space->containers)) {
+        QLIST_REMOVE(space, list);
+        g_free(space);
+    }
+}
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index b05f68b5c7..892aa47113 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -20,42 +20,13 @@
 
 #include "qemu/osdep.h"
 #include <sys/ioctl.h>
-#ifdef CONFIG_KVM
-#include <linux/kvm.h>
-#endif
 #include <linux/vfio.h>
 
 #include "hw/vfio/vfio-common.h"
 #include "hw/vfio/vfio.h"
-#include "exec/address-spaces.h"
-#include "exec/memory.h"
-#include "exec/ram_addr.h"
 #include "hw/hw.h"
-#include "qemu/error-report.h"
-#include "qemu/main-loop.h"
-#include "qemu/range.h"
-#include "sysemu/kvm.h"
-#include "sysemu/reset.h"
-#include "sysemu/runstate.h"
 #include "trace.h"
 #include "qapi/error.h"
-#include "migration/migration.h"
-
-VFIOGroupList vfio_group_list =
-    QLIST_HEAD_INITIALIZER(vfio_group_list);
-static QLIST_HEAD(, VFIOAddressSpace) vfio_address_spaces =
-    QLIST_HEAD_INITIALIZER(vfio_address_spaces);
-
-#ifdef CONFIG_KVM
-/*
- * We have a single VFIO pseudo device per KVM VM.  Once created it lives
- * for the life of the VM.  Closing the file descriptor only drops our
- * reference to it and the device's reference to kvm.  Therefore once
- * initialized, this file descriptor is only released on QEMU exit and
- * we'll re-use it should another vfio device be attached before then.
- */
-static int vfio_kvm_device_fd = -1;
-#endif
 
 /*
  * Common VFIO interrupt disable
@@ -135,29 +106,6 @@ static const char *index_to_str(VFIODevice *vbasedev, int index)
     }
 }
 
-static int vfio_ram_block_discard_disable(VFIOContainer *container, bool state)
-{
-    switch (container->iommu_type) {
-    case VFIO_TYPE1v2_IOMMU:
-    case VFIO_TYPE1_IOMMU:
-        /*
-         * We support coordinated discarding of RAM via the RamDiscardManager.
-         */
-        return ram_block_uncoordinated_discard_disable(state);
-    default:
-        /*
-         * VFIO_SPAPR_TCE_IOMMU most probably works just fine with
-         * RamDiscardManager, however, it is completely untested.
-         *
-         * VFIO_SPAPR_TCE_v2_IOMMU with "DMA memory preregistering" does
-         * completely the opposite of managing mapping/pinning dynamically as
-         * required by RamDiscardManager. We would have to special-case sections
-         * with a RamDiscardManager.
-         */
-        return ram_block_discard_disable(state);
-    }
-}
-
 int vfio_set_irq_signaling(VFIODevice *vbasedev, int index, int subindex,
                            int action, int fd, Error **errp)
 {
@@ -312,2115 +260,296 @@ const MemoryRegionOps vfio_region_ops = {
     },
 };
 
-/*
- * Device state interfaces
- */
-
-bool vfio_mig_active(void)
+static struct vfio_info_cap_header *
+vfio_get_cap(void *ptr, uint32_t cap_offset, uint16_t id)
 {
-    VFIOGroup *group;
-    VFIODevice *vbasedev;
-
-    if (QLIST_EMPTY(&vfio_group_list)) {
-        return false;
-    }
+    struct vfio_info_cap_header *hdr;
 
-    QLIST_FOREACH(group, &vfio_group_list, next) {
-        QLIST_FOREACH(vbasedev, &group->device_list, next) {
-            if (vbasedev->migration_blocker) {
-                return false;
-            }
+    for (hdr = ptr + cap_offset; hdr != ptr; hdr = ptr + hdr->next) {
+        if (hdr->id == id) {
+            return hdr;
         }
     }
-    return true;
+
+    return NULL;
 }
 
-static bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
+struct vfio_info_cap_header *
+vfio_get_region_info_cap(struct vfio_region_info *info, uint16_t id)
 {
-    VFIOGroup *group;
-    VFIODevice *vbasedev;
-    MigrationState *ms = migrate_get_current();
-
-    if (!migration_is_setup_or_active(ms->state)) {
-        return false;
+    if (!(info->flags & VFIO_REGION_INFO_FLAG_CAPS)) {
+        return NULL;
     }
 
-    QLIST_FOREACH(group, &container->group_list, container_next) {
-        QLIST_FOREACH(vbasedev, &group->device_list, next) {
-            VFIOMigration *migration = vbasedev->migration;
+    return vfio_get_cap((void *)info, info->cap_offset, id);
+}
+
+static struct vfio_info_cap_header *
+vfio_get_iommu_type1_info_cap(struct vfio_iommu_type1_info *info, uint16_t id)
+{
+    if (!(info->flags & VFIO_IOMMU_INFO_CAPS)) {
+        return NULL;
+    }
 
-            if (!migration) {
-                return false;
-            }
+    return vfio_get_cap((void *)info, info->cap_offset, id);
+}
 
-            if ((vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF)
-                && (migration->device_state & VFIO_DEVICE_STATE_RUNNING)) {
-                return false;
-            }
-        }
+struct vfio_info_cap_header *
+vfio_get_device_info_cap(struct vfio_device_info *info, uint16_t id)
+{
+    if (!(info->flags & VFIO_DEVICE_FLAGS_CAPS)) {
+        return NULL;
     }
-    return true;
+
+    return vfio_get_cap((void *)info, info->cap_offset, id);
 }
 
-static bool vfio_devices_all_running_and_saving(VFIOContainer *container)
+bool vfio_get_info_dma_avail(struct vfio_iommu_type1_info *info,
+                             unsigned int *avail)
 {
-    VFIOGroup *group;
-    VFIODevice *vbasedev;
-    MigrationState *ms = migrate_get_current();
+    struct vfio_info_cap_header *hdr;
+    struct vfio_iommu_type1_info_dma_avail *cap;
 
-    if (!migration_is_setup_or_active(ms->state)) {
+    /* If the capability cannot be found, assume no DMA limiting */
+    hdr = vfio_get_iommu_type1_info_cap(info,
+                                        VFIO_IOMMU_TYPE1_INFO_DMA_AVAIL);
+    if (hdr == NULL) {
         return false;
     }
 
-    QLIST_FOREACH(group, &container->group_list, container_next) {
-        QLIST_FOREACH(vbasedev, &group->device_list, next) {
-            VFIOMigration *migration = vbasedev->migration;
-
-            if (!migration) {
-                return false;
-            }
-
-            if ((migration->device_state & VFIO_DEVICE_STATE_SAVING) &&
-                (migration->device_state & VFIO_DEVICE_STATE_RUNNING)) {
-                continue;
-            } else {
-                return false;
-            }
-        }
+    if (avail != NULL) {
+        cap = (void *) hdr;
+        *avail = cap->avail;
     }
+
     return true;
 }
 
-static int vfio_dma_unmap_bitmap(VFIOContainer *container,
-                                 hwaddr iova, ram_addr_t size,
-                                 IOMMUTLBEntry *iotlb)
+static int vfio_setup_region_sparse_mmaps(VFIORegion *region,
+                                          struct vfio_region_info *info)
 {
-    struct vfio_iommu_type1_dma_unmap *unmap;
-    struct vfio_bitmap *bitmap;
-    uint64_t pages = REAL_HOST_PAGE_ALIGN(size) / qemu_real_host_page_size;
-    int ret;
+    struct vfio_info_cap_header *hdr;
+    struct vfio_region_info_cap_sparse_mmap *sparse;
+    int i, j;
 
-    unmap = g_malloc0(sizeof(*unmap) + sizeof(*bitmap));
+    hdr = vfio_get_region_info_cap(info, VFIO_REGION_INFO_CAP_SPARSE_MMAP);
+    if (!hdr) {
+        return -ENODEV;
+    }
 
-    unmap->argsz = sizeof(*unmap) + sizeof(*bitmap);
-    unmap->iova = iova;
-    unmap->size = size;
-    unmap->flags |= VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP;
-    bitmap = (struct vfio_bitmap *)&unmap->data;
+    sparse = container_of(hdr, struct vfio_region_info_cap_sparse_mmap, header);
 
-    /*
-     * cpu_physical_memory_set_dirty_lebitmap() supports pages in bitmap of
-     * qemu_real_host_page_size to mark those dirty. Hence set bitmap_pgsize
-     * to qemu_real_host_page_size.
-     */
+    trace_vfio_region_sparse_mmap_header(region->vbasedev->name,
+                                         region->nr, sparse->nr_areas);
 
-    bitmap->pgsize = qemu_real_host_page_size;
-    bitmap->size = ROUND_UP(pages, sizeof(__u64) * BITS_PER_BYTE) /
-                   BITS_PER_BYTE;
+    region->mmaps = g_new0(VFIOMmap, sparse->nr_areas);
 
-    if (bitmap->size > container->max_dirty_bitmap_size) {
-        error_report("UNMAP: Size of bitmap too big 0x%"PRIx64,
-                     (uint64_t)bitmap->size);
-        ret = -E2BIG;
-        goto unmap_exit;
-    }
+    for (i = 0, j = 0; i < sparse->nr_areas; i++) {
+        trace_vfio_region_sparse_mmap_entry(i, sparse->areas[i].offset,
+                                            sparse->areas[i].offset +
+                                            sparse->areas[i].size);
 
-    bitmap->data = g_try_malloc0(bitmap->size);
-    if (!bitmap->data) {
-        ret = -ENOMEM;
-        goto unmap_exit;
+        if (sparse->areas[i].size) {
+            region->mmaps[j].offset = sparse->areas[i].offset;
+            region->mmaps[j].size = sparse->areas[i].size;
+            j++;
+        }
     }
 
-    ret = ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, unmap);
-    if (!ret) {
-        cpu_physical_memory_set_dirty_lebitmap((unsigned long *)bitmap->data,
-                iotlb->translated_addr, pages);
-    } else {
-        error_report("VFIO_UNMAP_DMA with DIRTY_BITMAP : %m");
-    }
+    region->nr_mmaps = j;
+    region->mmaps = g_realloc(region->mmaps, j * sizeof(VFIOMmap));
 
-    g_free(bitmap->data);
-unmap_exit:
-    g_free(unmap);
-    return ret;
+    return 0;
 }
 
-/*
- * DMA - Mapping and unmapping for the "type1" IOMMU interface used on x86
- */
-static int vfio_dma_unmap(VFIOContainer *container,
-                          hwaddr iova, ram_addr_t size,
-                          IOMMUTLBEntry *iotlb)
+int vfio_region_setup(Object *obj, VFIODevice *vbasedev, VFIORegion *region,
+                      int index, const char *name)
 {
-    struct vfio_iommu_type1_dma_unmap unmap = {
-        .argsz = sizeof(unmap),
-        .flags = 0,
-        .iova = iova,
-        .size = size,
-    };
+    struct vfio_region_info *info;
+    int ret;
 
-    if (iotlb && container->dirty_pages_supported &&
-        vfio_devices_all_running_and_saving(container)) {
-        return vfio_dma_unmap_bitmap(container, iova, size, iotlb);
+    ret = vfio_get_region_info(vbasedev, index, &info);
+    if (ret) {
+        return ret;
     }
 
-    while (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap)) {
-        /*
-         * The type1 backend has an off-by-one bug in the kernel (71a7d3d78e3c
-         * v4.15) where an overflow in its wrap-around check prevents us from
-         * unmapping the last page of the address space.  Test for the error
-         * condition and re-try the unmap excluding the last page.  The
-         * expectation is that we've never mapped the last page anyway and this
-         * unmap request comes via vIOMMU support which also makes it unlikely
-         * that this page is used.  This bug was introduced well after type1 v2
-         * support was introduced, so we shouldn't need to test for v1.  A fix
-         * is queued for kernel v5.0 so this workaround can be removed once
-         * affected kernels are sufficiently deprecated.
-         */
-        if (errno == EINVAL && unmap.size && !(unmap.iova + unmap.size) &&
-            container->iommu_type == VFIO_TYPE1v2_IOMMU) {
-            trace_vfio_dma_unmap_overflow_workaround();
-            unmap.size -= 1ULL << ctz64(container->pgsizes);
-            continue;
-        }
-        error_report("VFIO_UNMAP_DMA failed: %s", strerror(errno));
-        return -errno;
-    }
+    region->vbasedev = vbasedev;
+    region->flags = info->flags;
+    region->size = info->size;
+    region->fd_offset = info->offset;
+    region->nr = index;
 
-    return 0;
-}
+    if (region->size) {
+        region->mem = g_new0(MemoryRegion, 1);
+        memory_region_init_io(region->mem, obj, &vfio_region_ops,
+                              region, name, region->size);
 
-static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
-                        ram_addr_t size, void *vaddr, bool readonly)
-{
-    struct vfio_iommu_type1_dma_map map = {
-        .argsz = sizeof(map),
-        .flags = VFIO_DMA_MAP_FLAG_READ,
-        .vaddr = (__u64)(uintptr_t)vaddr,
-        .iova = iova,
-        .size = size,
-    };
+        if (!vbasedev->no_mmap &&
+            region->flags & VFIO_REGION_INFO_FLAG_MMAP) {
 
-    if (!readonly) {
-        map.flags |= VFIO_DMA_MAP_FLAG_WRITE;
-    }
+            ret = vfio_setup_region_sparse_mmaps(region, info);
 
-    /*
-     * Try the mapping, if it fails with EBUSY, unmap the region and try
-     * again.  This shouldn't be necessary, but we sometimes see it in
-     * the VGA ROM space.
-     */
-    if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0 ||
-        (errno == EBUSY && vfio_dma_unmap(container, iova, size, NULL) == 0 &&
-         ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0)) {
-        return 0;
+            if (ret) {
+                region->nr_mmaps = 1;
+                region->mmaps = g_new0(VFIOMmap, region->nr_mmaps);
+                region->mmaps[0].offset = 0;
+                region->mmaps[0].size = region->size;
+            }
+        }
     }
 
-    error_report("VFIO_MAP_DMA failed: %s", strerror(errno));
-    return -errno;
+    g_free(info);
+
+    trace_vfio_region_setup(vbasedev->name, index, name,
+                            region->flags, region->fd_offset, region->size);
+    return 0;
 }
 
-static void vfio_host_win_add(VFIOContainer *container,
-                              hwaddr min_iova, hwaddr max_iova,
-                              uint64_t iova_pgsizes)
+static void vfio_subregion_unmap(VFIORegion *region, int index)
 {
-    VFIOHostDMAWindow *hostwin;
-
-    QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) {
-        if (ranges_overlap(hostwin->min_iova,
-                           hostwin->max_iova - hostwin->min_iova + 1,
-                           min_iova,
-                           max_iova - min_iova + 1)) {
-            hw_error("%s: Overlapped IOMMU are not enabled", __func__);
-        }
-    }
-
-    hostwin = g_malloc0(sizeof(*hostwin));
-
-    hostwin->min_iova = min_iova;
-    hostwin->max_iova = max_iova;
-    hostwin->iova_pgsizes = iova_pgsizes;
-    QLIST_INSERT_HEAD(&container->hostwin_list, hostwin, hostwin_next);
+    trace_vfio_region_unmap(memory_region_name(&region->mmaps[index].mem),
+                            region->mmaps[index].offset,
+                            region->mmaps[index].offset +
+                            region->mmaps[index].size - 1);
+    memory_region_del_subregion(region->mem, &region->mmaps[index].mem);
+    munmap(region->mmaps[index].mmap, region->mmaps[index].size);
+    object_unparent(OBJECT(&region->mmaps[index].mem));
+    region->mmaps[index].mmap = NULL;
 }
 
-static int vfio_host_win_del(VFIOContainer *container, hwaddr min_iova,
-                             hwaddr max_iova)
+int vfio_region_mmap(VFIORegion *region)
 {
-    VFIOHostDMAWindow *hostwin;
+    int i, prot = 0;
+    char *name;
 
-    QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) {
-        if (hostwin->min_iova == min_iova && hostwin->max_iova == max_iova) {
-            QLIST_REMOVE(hostwin, hostwin_next);
-            g_free(hostwin);
-            return 0;
-        }
+    if (!region->mem) {
+        return 0;
     }
 
-    return -1;
-}
-
-static bool vfio_listener_skipped_section(MemoryRegionSection *section)
-{
-    return (!memory_region_is_ram(section->mr) &&
-            !memory_region_is_iommu(section->mr)) ||
-           memory_region_is_protected(section->mr) ||
-           /*
-            * Sizing an enabled 64-bit BAR can cause spurious mappings to
-            * addresses in the upper part of the 64-bit address space.  These
-            * are never accessed by the CPU and beyond the address width of
-            * some IOMMU hardware.  TODO: VFIO should tell us the IOMMU width.
-            */
-           section->offset_within_address_space & (1ULL << 63);
-}
+    prot |= region->flags & VFIO_REGION_INFO_FLAG_READ ? PROT_READ : 0;
+    prot |= region->flags & VFIO_REGION_INFO_FLAG_WRITE ? PROT_WRITE : 0;
 
-/* Called with rcu_read_lock held.  */
-static bool vfio_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
-                               ram_addr_t *ram_addr, bool *read_only)
-{
-    MemoryRegion *mr;
-    hwaddr xlat;
-    hwaddr len = iotlb->addr_mask + 1;
-    bool writable = iotlb->perm & IOMMU_WO;
+    for (i = 0; i < region->nr_mmaps; i++) {
+        region->mmaps[i].mmap = mmap(NULL, region->mmaps[i].size, prot,
+                                     MAP_SHARED, region->vbasedev->fd,
+                                     region->fd_offset +
+                                     region->mmaps[i].offset);
+        if (region->mmaps[i].mmap == MAP_FAILED) {
+            int ret = -errno;
 
-    /*
-     * The IOMMU TLB entry we have just covers translation through
-     * this IOMMU to its immediate target.  We need to translate
-     * it the rest of the way through to memory.
-     */
-    mr = address_space_translate(&address_space_memory,
-                                 iotlb->translated_addr,
-                                 &xlat, &len, writable,
-                                 MEMTXATTRS_UNSPECIFIED);
-    if (!memory_region_is_ram(mr)) {
-        error_report("iommu map to non memory area %"HWADDR_PRIx"",
-                     xlat);
-        return false;
-    } else if (memory_region_has_ram_discard_manager(mr)) {
-        RamDiscardManager *rdm = memory_region_get_ram_discard_manager(mr);
-        MemoryRegionSection tmp = {
-            .mr = mr,
-            .offset_within_region = xlat,
-            .size = int128_make64(len),
-        };
-
-        /*
-         * Malicious VMs can map memory into the IOMMU, which is expected
-         * to remain discarded. vfio will pin all pages, populating memory.
-         * Disallow that. vmstate priorities make sure any RamDiscardManager
-         * were already restored before IOMMUs are restored.
-         */
-        if (!ram_discard_manager_is_populated(rdm, &tmp)) {
-            error_report("iommu map to discarded memory (e.g., unplugged via"
-                         " virtio-mem): %"HWADDR_PRIx"",
-                         iotlb->translated_addr);
-            return false;
-        }
+            trace_vfio_region_mmap_fault(memory_region_name(region->mem), i,
+                                         region->fd_offset +
+                                         region->mmaps[i].offset,
+                                         region->fd_offset +
+                                         region->mmaps[i].offset +
+                                         region->mmaps[i].size - 1, ret);
 
-        /*
-         * Malicious VMs might trigger discarding of IOMMU-mapped memory. The
-         * pages will remain pinned inside vfio until unmapped, resulting in a
-         * higher memory consumption than expected. If memory would get
-         * populated again later, there would be an inconsistency between pages
-         * pinned by vfio and pages seen by QEMU. This is the case until
-         * unmapped from the IOMMU (e.g., during device reset).
-         *
-         * With malicious guests, we really only care about pinning more memory
-         * than expected. RLIMIT_MEMLOCK set for the user/process can never be
-         * exceeded and can be used to mitigate this problem.
-         */
-        warn_report_once("Using vfio with vIOMMUs and coordinated discarding of"
-                         " RAM (e.g., virtio-mem) works, however, malicious"
-                         " guests can trigger pinning of more memory than"
-                         " intended via an IOMMU. It's possible to mitigate "
-                         " by setting/adjusting RLIMIT_MEMLOCK.");
-    }
+            region->mmaps[i].mmap = NULL;
 
-    /*
-     * Translation truncates length to the IOMMU page size,
-     * check that it did not truncate too much.
-     */
-    if (len & iotlb->addr_mask) {
-        error_report("iommu has granularity incompatible with target AS");
-        return false;
-    }
+            for (i--; i >= 0; i--) {
+                vfio_subregion_unmap(region, i);
+            }
 
-    if (vaddr) {
-        *vaddr = memory_region_get_ram_ptr(mr) + xlat;
-    }
+            return ret;
+        }
 
-    if (ram_addr) {
-        *ram_addr = memory_region_get_ram_addr(mr) + xlat;
-    }
+        name = g_strdup_printf("%s mmaps[%d]",
+                               memory_region_name(region->mem), i);
+        memory_region_init_ram_device_ptr(&region->mmaps[i].mem,
+                                          memory_region_owner(region->mem),
+                                          name, region->mmaps[i].size,
+                                          region->mmaps[i].mmap);
+        g_free(name);
+        memory_region_add_subregion(region->mem, region->mmaps[i].offset,
+                                    &region->mmaps[i].mem);
 
-    if (read_only) {
-        *read_only = !writable || mr->readonly;
+        trace_vfio_region_mmap(memory_region_name(&region->mmaps[i].mem),
+                               region->mmaps[i].offset,
+                               region->mmaps[i].offset +
+                               region->mmaps[i].size - 1);
     }
 
-    return true;
+    return 0;
 }
 
-static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
+void vfio_region_unmap(VFIORegion *region)
 {
-    VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
-    VFIOContainer *container = giommu->container;
-    hwaddr iova = iotlb->iova + giommu->iommu_offset;
-    void *vaddr;
-    int ret;
-
-    trace_vfio_iommu_map_notify(iotlb->perm == IOMMU_NONE ? "UNMAP" : "MAP",
-                                iova, iova + iotlb->addr_mask);
+    int i;
 
-    if (iotlb->target_as != &address_space_memory) {
-        error_report("Wrong target AS \"%s\", only system memory is allowed",
-                     iotlb->target_as->name ? iotlb->target_as->name : "none");
+    if (!region->mem) {
         return;
     }
 
-    rcu_read_lock();
-
-    if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) {
-        bool read_only;
-
-        if (!vfio_get_xlat_addr(iotlb, &vaddr, NULL, &read_only)) {
-            goto out;
-        }
-        /*
-         * vaddr is only valid until rcu_read_unlock(). But after
-         * vfio_dma_map has set up the mapping the pages will be
-         * pinned by the kernel. This makes sure that the RAM backend
-         * of vaddr will always be there, even if the memory object is
-         * destroyed and its backing memory munmap-ed.
-         */
-        ret = vfio_dma_map(container, iova,
-                           iotlb->addr_mask + 1, vaddr,
-                           read_only);
-        if (ret) {
-            error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
-                         "0x%"HWADDR_PRIx", %p) = %d (%m)",
-                         container, iova,
-                         iotlb->addr_mask + 1, vaddr, ret);
-        }
-    } else {
-        ret = vfio_dma_unmap(container, iova, iotlb->addr_mask + 1, iotlb);
-        if (ret) {
-            error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
-                         "0x%"HWADDR_PRIx") = %d (%m)",
-                         container, iova,
-                         iotlb->addr_mask + 1, ret);
+    for (i = 0; i < region->nr_mmaps; i++) {
+        if (region->mmaps[i].mmap) {
+            vfio_subregion_unmap(region, i);
         }
     }
-out:
-    rcu_read_unlock();
 }
 
-static void vfio_ram_discard_notify_discard(RamDiscardListener *rdl,
-                                            MemoryRegionSection *section)
+void vfio_region_exit(VFIORegion *region)
 {
-    VFIORamDiscardListener *vrdl = container_of(rdl, VFIORamDiscardListener,
-                                                listener);
-    const hwaddr size = int128_get64(section->size);
-    const hwaddr iova = section->offset_within_address_space;
-    int ret;
+    int i;
 
-    /* Unmap with a single call. */
-    ret = vfio_dma_unmap(vrdl->container, iova, size , NULL);
-    if (ret) {
-        error_report("%s: vfio_dma_unmap() failed: %s", __func__,
-                     strerror(-ret));
+    if (!region->mem) {
+        return;
     }
-}
-
-static int vfio_ram_discard_notify_populate(RamDiscardListener *rdl,
-                                            MemoryRegionSection *section)
-{
-    VFIORamDiscardListener *vrdl = container_of(rdl, VFIORamDiscardListener,
-                                                listener);
-    const hwaddr end = section->offset_within_region +
-                       int128_get64(section->size);
-    hwaddr start, next, iova;
-    void *vaddr;
-    int ret;
 
-    /*
-     * Map in (aligned within memory region) minimum granularity, so we can
-     * unmap in minimum granularity later.
-     */
-    for (start = section->offset_within_region; start < end; start = next) {
-        next = ROUND_UP(start + 1, vrdl->granularity);
-        next = MIN(next, end);
-
-        iova = start - section->offset_within_region +
-               section->offset_within_address_space;
-        vaddr = memory_region_get_ram_ptr(section->mr) + start;
-
-        ret = vfio_dma_map(vrdl->container, iova, next - start,
-                           vaddr, section->readonly);
-        if (ret) {
-            /* Rollback */
-            vfio_ram_discard_notify_discard(rdl, section);
-            return ret;
+    for (i = 0; i < region->nr_mmaps; i++) {
+        if (region->mmaps[i].mmap) {
+            memory_region_del_subregion(region->mem, &region->mmaps[i].mem);
         }
     }
-    return 0;
+
+    trace_vfio_region_exit(region->vbasedev->name, region->nr);
 }
 
-static void vfio_register_ram_discard_listener(VFIOContainer *container,
-                                               MemoryRegionSection *section)
+void vfio_region_finalize(VFIORegion *region)
 {
-    RamDiscardManager *rdm = memory_region_get_ram_discard_manager(section->mr);
-    VFIORamDiscardListener *vrdl;
-
-    /* Ignore some corner cases not relevant in practice. */
-    g_assert(QEMU_IS_ALIGNED(section->offset_within_region, TARGET_PAGE_SIZE));
-    g_assert(QEMU_IS_ALIGNED(section->offset_within_address_space,
-                             TARGET_PAGE_SIZE));
-    g_assert(QEMU_IS_ALIGNED(int128_get64(section->size), TARGET_PAGE_SIZE));
-
-    vrdl = g_new0(VFIORamDiscardListener, 1);
-    vrdl->container = container;
-    vrdl->mr = section->mr;
-    vrdl->offset_within_address_space = section->offset_within_address_space;
-    vrdl->size = int128_get64(section->size);
-    vrdl->granularity = ram_discard_manager_get_min_granularity(rdm,
-                                                                section->mr);
-
-    g_assert(vrdl->granularity && is_power_of_2(vrdl->granularity));
-    g_assert(container->pgsizes &&
-             vrdl->granularity >= 1ULL << ctz64(container->pgsizes));
-
-    ram_discard_listener_init(&vrdl->listener,
-                              vfio_ram_discard_notify_populate,
-                              vfio_ram_discard_notify_discard, true);
-    ram_discard_manager_register_listener(rdm, &vrdl->listener, section);
-    QLIST_INSERT_HEAD(&container->vrdl_list, vrdl, next);
+    int i;
 
-    /*
-     * Sanity-check if we have a theoretically problematic setup where we could
-     * exceed the maximum number of possible DMA mappings over time. We assume
-     * that each mapped section in the same address space as a RamDiscardManager
-     * section consumes exactly one DMA mapping, with the exception of
-     * RamDiscardManager sections; i.e., we don't expect to have gIOMMU sections
-     * in the same address space as RamDiscardManager sections.
-     *
-     * We assume that each section in the address space consumes one memslot.
-     * We take the number of KVM memory slots as a best guess for the maximum
-     * number of sections in the address space we could have over time,
-     * also consuming DMA mappings.
-     */
-    if (container->dma_max_mappings) {
-        unsigned int vrdl_count = 0, vrdl_mappings = 0, max_memslots = 512;
-
-#ifdef CONFIG_KVM
-        if (kvm_enabled()) {
-            max_memslots = kvm_get_max_memslots();
-        }
-#endif
-
-        QLIST_FOREACH(vrdl, &container->vrdl_list, next) {
-            hwaddr start, end;
-
-            start = QEMU_ALIGN_DOWN(vrdl->offset_within_address_space,
-                                    vrdl->granularity);
-            end = ROUND_UP(vrdl->offset_within_address_space + vrdl->size,
-                           vrdl->granularity);
-            vrdl_mappings += (end - start) / vrdl->granularity;
-            vrdl_count++;
-        }
-
-        if (vrdl_mappings + max_memslots - vrdl_count >
-            container->dma_max_mappings) {
-            warn_report("%s: possibly running out of DMA mappings. E.g., try"
-                        " increasing the 'block-size' of virtio-mem devies."
-                        " Maximum possible DMA mappings: %d, Maximum possible"
-                        " memslots: %d", __func__, container->dma_max_mappings,
-                        max_memslots);
-        }
-    }
-}
-
-static void vfio_unregister_ram_discard_listener(VFIOContainer *container,
-                                                 MemoryRegionSection *section)
-{
-    RamDiscardManager *rdm = memory_region_get_ram_discard_manager(section->mr);
-    VFIORamDiscardListener *vrdl = NULL;
-
-    QLIST_FOREACH(vrdl, &container->vrdl_list, next) {
-        if (vrdl->mr == section->mr &&
-            vrdl->offset_within_address_space ==
-            section->offset_within_address_space) {
-            break;
-        }
-    }
-
-    if (!vrdl) {
-        hw_error("vfio: Trying to unregister missing RAM discard listener");
-    }
-
-    ram_discard_manager_unregister_listener(rdm, &vrdl->listener);
-    QLIST_REMOVE(vrdl, next);
-    g_free(vrdl);
-}
-
-static void vfio_listener_region_add(MemoryListener *listener,
-                                     MemoryRegionSection *section)
-{
-    VFIOContainer *container = container_of(listener, VFIOContainer, listener);
-    hwaddr iova, end;
-    Int128 llend, llsize;
-    void *vaddr;
-    int ret;
-    VFIOHostDMAWindow *hostwin;
-    bool hostwin_found;
-    Error *err = NULL;
-
-    if (vfio_listener_skipped_section(section)) {
-        trace_vfio_listener_region_add_skip(
-                section->offset_within_address_space,
-                section->offset_within_address_space +
-                int128_get64(int128_sub(section->size, int128_one())));
-        return;
-    }
-
-    if (unlikely((section->offset_within_address_space &
-                  ~qemu_real_host_page_mask) !=
-                 (section->offset_within_region & ~qemu_real_host_page_mask))) {
-        error_report("%s received unaligned region", __func__);
-        return;
-    }
-
-    iova = REAL_HOST_PAGE_ALIGN(section->offset_within_address_space);
-    llend = int128_make64(section->offset_within_address_space);
-    llend = int128_add(llend, section->size);
-    llend = int128_and(llend, int128_exts64(qemu_real_host_page_mask));
-
-    if (int128_ge(int128_make64(iova), llend)) {
-        if (memory_region_is_ram_device(section->mr)) {
-            trace_vfio_listener_region_add_no_dma_map(
-                memory_region_name(section->mr),
-                section->offset_within_address_space,
-                int128_getlo(section->size),
-                qemu_real_host_page_size);
-        }
-        return;
-    }
-    end = int128_get64(int128_sub(llend, int128_one()));
-
-    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
-        hwaddr pgsize = 0;
-
-        /* For now intersections are not allowed, we may relax this later */
-        QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) {
-            if (ranges_overlap(hostwin->min_iova,
-                               hostwin->max_iova - hostwin->min_iova + 1,
-                               section->offset_within_address_space,
-                               int128_get64(section->size))) {
-                error_setg(&err,
-                    "region [0x%"PRIx64",0x%"PRIx64"] overlaps with existing"
-                    "host DMA window [0x%"PRIx64",0x%"PRIx64"]",
-                    section->offset_within_address_space,
-                    section->offset_within_address_space +
-                        int128_get64(section->size) - 1,
-                    hostwin->min_iova, hostwin->max_iova);
-                goto fail;
-            }
-        }
-
-        ret = vfio_spapr_create_window(container, section, &pgsize);
-        if (ret) {
-            error_setg_errno(&err, -ret, "Failed to create SPAPR window");
-            goto fail;
-        }
-
-        vfio_host_win_add(container, section->offset_within_address_space,
-                          section->offset_within_address_space +
-                          int128_get64(section->size) - 1, pgsize);
-#ifdef CONFIG_KVM
-        if (kvm_enabled()) {
-            VFIOGroup *group;
-            IOMMUMemoryRegion *iommu_mr = IOMMU_MEMORY_REGION(section->mr);
-            struct kvm_vfio_spapr_tce param;
-            struct kvm_device_attr attr = {
-                .group = KVM_DEV_VFIO_GROUP,
-                .attr = KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE,
-                .addr = (uint64_t)(unsigned long)&param,
-            };
-
-            if (!memory_region_iommu_get_attr(iommu_mr, IOMMU_ATTR_SPAPR_TCE_FD,
-                                              &param.tablefd)) {
-                QLIST_FOREACH(group, &container->group_list, container_next) {
-                    param.groupfd = group->fd;
-                    if (ioctl(vfio_kvm_device_fd, KVM_SET_DEVICE_ATTR, &attr)) {
-                        error_report("vfio: failed to setup fd %d "
-                                     "for a group with fd %d: %s",
-                                     param.tablefd, param.groupfd,
-                                     strerror(errno));
-                        return;
-                    }
-                    trace_vfio_spapr_group_attach(param.groupfd, param.tablefd);
-                }
-            }
-        }
-#endif
-    }
-
-    hostwin_found = false;
-    QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) {
-        if (hostwin->min_iova <= iova && end <= hostwin->max_iova) {
-            hostwin_found = true;
-            break;
-        }
-    }
-
-    if (!hostwin_found) {
-        error_setg(&err, "Container %p can't map guest IOVA region"
-                   " 0x%"HWADDR_PRIx"..0x%"HWADDR_PRIx, container, iova, end);
-        goto fail;
-    }
-
-    memory_region_ref(section->mr);
-
-    if (memory_region_is_iommu(section->mr)) {
-        VFIOGuestIOMMU *giommu;
-        IOMMUMemoryRegion *iommu_mr = IOMMU_MEMORY_REGION(section->mr);
-        int iommu_idx;
-
-        trace_vfio_listener_region_add_iommu(iova, end);
-        /*
-         * FIXME: For VFIO iommu types which have KVM acceleration to
-         * avoid bouncing all map/unmaps through qemu this way, this
-         * would be the right place to wire that up (tell the KVM
-         * device emulation the VFIO iommu handles to use).
-         */
-        giommu = g_malloc0(sizeof(*giommu));
-        giommu->iommu_mr = iommu_mr;
-        giommu->iommu_offset = section->offset_within_address_space -
-                               section->offset_within_region;
-        giommu->container = container;
-        llend = int128_add(int128_make64(section->offset_within_region),
-                           section->size);
-        llend = int128_sub(llend, int128_one());
-        iommu_idx = memory_region_iommu_attrs_to_index(iommu_mr,
-                                                       MEMTXATTRS_UNSPECIFIED);
-        iommu_notifier_init(&giommu->n, vfio_iommu_map_notify,
-                            IOMMU_NOTIFIER_IOTLB_EVENTS,
-                            section->offset_within_region,
-                            int128_get64(llend),
-                            iommu_idx);
-
-        ret = memory_region_iommu_set_page_size_mask(giommu->iommu_mr,
-                                                     container->pgsizes,
-                                                     &err);
-        if (ret) {
-            g_free(giommu);
-            goto fail;
-        }
-
-        ret = memory_region_register_iommu_notifier(section->mr, &giommu->n,
-                                                    &err);
-        if (ret) {
-            g_free(giommu);
-            goto fail;
-        }
-        QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next);
-        memory_region_iommu_replay(giommu->iommu_mr, &giommu->n);
-
-        return;
-    }
-
-    /* Here we assume that memory_region_is_ram(section->mr)==true */
-
-    /*
-     * For RAM memory regions with a RamDiscardManager, we only want to map the
-     * actually populated parts - and update the mapping whenever we're notified
-     * about changes.
-     */
-    if (memory_region_has_ram_discard_manager(section->mr)) {
-        vfio_register_ram_discard_listener(container, section);
-        return;
-    }
-
-    vaddr = memory_region_get_ram_ptr(section->mr) +
-            section->offset_within_region +
-            (iova - section->offset_within_address_space);
-
-    trace_vfio_listener_region_add_ram(iova, end, vaddr);
-
-    llsize = int128_sub(llend, int128_make64(iova));
-
-    if (memory_region_is_ram_device(section->mr)) {
-        hwaddr pgmask = (1ULL << ctz64(hostwin->iova_pgsizes)) - 1;
-
-        if ((iova & pgmask) || (int128_get64(llsize) & pgmask)) {
-            trace_vfio_listener_region_add_no_dma_map(
-                memory_region_name(section->mr),
-                section->offset_within_address_space,
-                int128_getlo(section->size),
-                pgmask + 1);
-            return;
-        }
-    }
-
-    ret = vfio_dma_map(container, iova, int128_get64(llsize),
-                       vaddr, section->readonly);
-    if (ret) {
-        error_setg(&err, "vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
-                   "0x%"HWADDR_PRIx", %p) = %d (%m)",
-                   container, iova, int128_get64(llsize), vaddr, ret);
-        if (memory_region_is_ram_device(section->mr)) {
-            /* Allow unexpected mappings not to be fatal for RAM devices */
-            error_report_err(err);
-            return;
-        }
-        goto fail;
-    }
-
-    return;
-
-fail:
-    if (memory_region_is_ram_device(section->mr)) {
-        error_report("failed to vfio_dma_map. pci p2p may not work");
-        return;
-    }
-    /*
-     * On the initfn path, store the first error in the container so we
-     * can gracefully fail.  Runtime, there's not much we can do other
-     * than throw a hardware error.
-     */
-    if (!container->initialized) {
-        if (!container->error) {
-            error_propagate_prepend(&container->error, err,
-                                    "Region %s: ",
-                                    memory_region_name(section->mr));
-        } else {
-            error_free(err);
-        }
-    } else {
-        error_report_err(err);
-        hw_error("vfio: DMA mapping failed, unable to continue");
-    }
-}
-
-static void vfio_listener_region_del(MemoryListener *listener,
-                                     MemoryRegionSection *section)
-{
-    VFIOContainer *container = container_of(listener, VFIOContainer, listener);
-    hwaddr iova, end;
-    Int128 llend, llsize;
-    int ret;
-    bool try_unmap = true;
-
-    if (vfio_listener_skipped_section(section)) {
-        trace_vfio_listener_region_del_skip(
-                section->offset_within_address_space,
-                section->offset_within_address_space +
-                int128_get64(int128_sub(section->size, int128_one())));
-        return;
-    }
-
-    if (unlikely((section->offset_within_address_space &
-                  ~qemu_real_host_page_mask) !=
-                 (section->offset_within_region & ~qemu_real_host_page_mask))) {
-        error_report("%s received unaligned region", __func__);
-        return;
-    }
-
-    if (memory_region_is_iommu(section->mr)) {
-        VFIOGuestIOMMU *giommu;
-
-        QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
-            if (MEMORY_REGION(giommu->iommu_mr) == section->mr &&
-                giommu->n.start == section->offset_within_region) {
-                memory_region_unregister_iommu_notifier(section->mr,
-                                                        &giommu->n);
-                QLIST_REMOVE(giommu, giommu_next);
-                g_free(giommu);
-                break;
-            }
-        }
-
-        /*
-         * FIXME: We assume the one big unmap below is adequate to
-         * remove any individual page mappings in the IOMMU which
-         * might have been copied into VFIO. This works for a page table
-         * based IOMMU where a big unmap flattens a large range of IO-PTEs.
-         * That may not be true for all IOMMU types.
-         */
-    }
-
-    iova = REAL_HOST_PAGE_ALIGN(section->offset_within_address_space);
-    llend = int128_make64(section->offset_within_address_space);
-    llend = int128_add(llend, section->size);
-    llend = int128_and(llend, int128_exts64(qemu_real_host_page_mask));
-
-    if (int128_ge(int128_make64(iova), llend)) {
-        return;
-    }
-    end = int128_get64(int128_sub(llend, int128_one()));
-
-    llsize = int128_sub(llend, int128_make64(iova));
-
-    trace_vfio_listener_region_del(iova, end);
-
-    if (memory_region_is_ram_device(section->mr)) {
-        hwaddr pgmask;
-        VFIOHostDMAWindow *hostwin;
-        bool hostwin_found = false;
-
-        QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) {
-            if (hostwin->min_iova <= iova && end <= hostwin->max_iova) {
-                hostwin_found = true;
-                break;
-            }
-        }
-        assert(hostwin_found); /* or region_add() would have failed */
-
-        pgmask = (1ULL << ctz64(hostwin->iova_pgsizes)) - 1;
-        try_unmap = !((iova & pgmask) || (int128_get64(llsize) & pgmask));
-    } else if (memory_region_has_ram_discard_manager(section->mr)) {
-        vfio_unregister_ram_discard_listener(container, section);
-        /* Unregistering will trigger an unmap. */
-        try_unmap = false;
-    }
-
-    if (try_unmap) {
-        if (int128_eq(llsize, int128_2_64())) {
-            /* The unmap ioctl doesn't accept a full 64-bit span. */
-            llsize = int128_rshift(llsize, 1);
-            ret = vfio_dma_unmap(container, iova, int128_get64(llsize), NULL);
-            if (ret) {
-                error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
-                             "0x%"HWADDR_PRIx") = %d (%m)",
-                             container, iova, int128_get64(llsize), ret);
-            }
-            iova += int128_get64(llsize);
-        }
-        ret = vfio_dma_unmap(container, iova, int128_get64(llsize), NULL);
-        if (ret) {
-            error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
-                         "0x%"HWADDR_PRIx") = %d (%m)",
-                         container, iova, int128_get64(llsize), ret);
-        }
-    }
-
-    memory_region_unref(section->mr);
-
-    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
-        vfio_spapr_remove_window(container,
-                                 section->offset_within_address_space);
-        if (vfio_host_win_del(container,
-                              section->offset_within_address_space,
-                              section->offset_within_address_space +
-                              int128_get64(section->size) - 1) < 0) {
-            hw_error("%s: Cannot delete missing window at %"HWADDR_PRIx,
-                     __func__, section->offset_within_address_space);
-        }
-    }
-}
-
-static void vfio_set_dirty_page_tracking(VFIOContainer *container, bool start)
-{
-    int ret;
-    struct vfio_iommu_type1_dirty_bitmap dirty = {
-        .argsz = sizeof(dirty),
-    };
-
-    if (start) {
-        dirty.flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_START;
-    } else {
-        dirty.flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP;
-    }
-
-    ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, &dirty);
-    if (ret) {
-        error_report("Failed to set dirty tracking flag 0x%x errno: %d",
-                     dirty.flags, errno);
-    }
-}
-
-static void vfio_listener_log_global_start(MemoryListener *listener)
-{
-    VFIOContainer *container = container_of(listener, VFIOContainer, listener);
-
-    vfio_set_dirty_page_tracking(container, true);
-}
-
-static void vfio_listener_log_global_stop(MemoryListener *listener)
-{
-    VFIOContainer *container = container_of(listener, VFIOContainer, listener);
-
-    vfio_set_dirty_page_tracking(container, false);
-}
-
-static int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
-                                 uint64_t size, ram_addr_t ram_addr)
-{
-    struct vfio_iommu_type1_dirty_bitmap *dbitmap;
-    struct vfio_iommu_type1_dirty_bitmap_get *range;
-    uint64_t pages;
-    int ret;
-
-    dbitmap = g_malloc0(sizeof(*dbitmap) + sizeof(*range));
-
-    dbitmap->argsz = sizeof(*dbitmap) + sizeof(*range);
-    dbitmap->flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP;
-    range = (struct vfio_iommu_type1_dirty_bitmap_get *)&dbitmap->data;
-    range->iova = iova;
-    range->size = size;
-
-    /*
-     * cpu_physical_memory_set_dirty_lebitmap() supports pages in bitmap of
-     * qemu_real_host_page_size to mark those dirty. Hence set bitmap's pgsize
-     * to qemu_real_host_page_size.
-     */
-    range->bitmap.pgsize = qemu_real_host_page_size;
-
-    pages = REAL_HOST_PAGE_ALIGN(range->size) / qemu_real_host_page_size;
-    range->bitmap.size = ROUND_UP(pages, sizeof(__u64) * BITS_PER_BYTE) /
-                                         BITS_PER_BYTE;
-    range->bitmap.data = g_try_malloc0(range->bitmap.size);
-    if (!range->bitmap.data) {
-        ret = -ENOMEM;
-        goto err_out;
-    }
-
-    ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, dbitmap);
-    if (ret) {
-        error_report("Failed to get dirty bitmap for iova: 0x%"PRIx64
-                " size: 0x%"PRIx64" err: %d", (uint64_t)range->iova,
-                (uint64_t)range->size, errno);
-        goto err_out;
-    }
-
-    cpu_physical_memory_set_dirty_lebitmap((unsigned long *)range->bitmap.data,
-                                            ram_addr, pages);
-
-    trace_vfio_get_dirty_bitmap(container->fd, range->iova, range->size,
-                                range->bitmap.size, ram_addr);
-err_out:
-    g_free(range->bitmap.data);
-    g_free(dbitmap);
-
-    return ret;
-}
-
-typedef struct {
-    IOMMUNotifier n;
-    VFIOGuestIOMMU *giommu;
-} vfio_giommu_dirty_notifier;
-
-static void vfio_iommu_map_dirty_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
-{
-    vfio_giommu_dirty_notifier *gdn = container_of(n,
-                                                vfio_giommu_dirty_notifier, n);
-    VFIOGuestIOMMU *giommu = gdn->giommu;
-    VFIOContainer *container = giommu->container;
-    hwaddr iova = iotlb->iova + giommu->iommu_offset;
-    ram_addr_t translated_addr;
-
-    trace_vfio_iommu_map_dirty_notify(iova, iova + iotlb->addr_mask);
-
-    if (iotlb->target_as != &address_space_memory) {
-        error_report("Wrong target AS \"%s\", only system memory is allowed",
-                     iotlb->target_as->name ? iotlb->target_as->name : "none");
-        return;
-    }
-
-    rcu_read_lock();
-    if (vfio_get_xlat_addr(iotlb, NULL, &translated_addr, NULL)) {
-        int ret;
-
-        ret = vfio_get_dirty_bitmap(container, iova, iotlb->addr_mask + 1,
-                                    translated_addr);
-        if (ret) {
-            error_report("vfio_iommu_map_dirty_notify(%p, 0x%"HWADDR_PRIx", "
-                         "0x%"HWADDR_PRIx") = %d (%m)",
-                         container, iova,
-                         iotlb->addr_mask + 1, ret);
-        }
-    }
-    rcu_read_unlock();
-}
-
-static int vfio_ram_discard_get_dirty_bitmap(MemoryRegionSection *section,
-                                             void *opaque)
-{
-    const hwaddr size = int128_get64(section->size);
-    const hwaddr iova = section->offset_within_address_space;
-    const ram_addr_t ram_addr = memory_region_get_ram_addr(section->mr) +
-                                section->offset_within_region;
-    VFIORamDiscardListener *vrdl = opaque;
-
-    /*
-     * Sync the whole mapped region (spanning multiple individual mappings)
-     * in one go.
-     */
-    return vfio_get_dirty_bitmap(vrdl->container, iova, size, ram_addr);
-}
-
-static int vfio_sync_ram_discard_listener_dirty_bitmap(VFIOContainer *container,
-                                                   MemoryRegionSection *section)
-{
-    RamDiscardManager *rdm = memory_region_get_ram_discard_manager(section->mr);
-    VFIORamDiscardListener *vrdl = NULL;
-
-    QLIST_FOREACH(vrdl, &container->vrdl_list, next) {
-        if (vrdl->mr == section->mr &&
-            vrdl->offset_within_address_space ==
-            section->offset_within_address_space) {
-            break;
-        }
-    }
-
-    if (!vrdl) {
-        hw_error("vfio: Trying to sync missing RAM discard listener");
-    }
-
-    /*
-     * We only want/can synchronize the bitmap for actually mapped parts -
-     * which correspond to populated parts. Replay all populated parts.
-     */
-    return ram_discard_manager_replay_populated(rdm, section,
-                                              vfio_ram_discard_get_dirty_bitmap,
-                                                &vrdl);
-}
-
-static int vfio_sync_dirty_bitmap(VFIOContainer *container,
-                                  MemoryRegionSection *section)
-{
-    ram_addr_t ram_addr;
-
-    if (memory_region_is_iommu(section->mr)) {
-        VFIOGuestIOMMU *giommu;
-
-        QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
-            if (MEMORY_REGION(giommu->iommu_mr) == section->mr &&
-                giommu->n.start == section->offset_within_region) {
-                Int128 llend;
-                vfio_giommu_dirty_notifier gdn = { .giommu = giommu };
-                int idx = memory_region_iommu_attrs_to_index(giommu->iommu_mr,
-                                                       MEMTXATTRS_UNSPECIFIED);
-
-                llend = int128_add(int128_make64(section->offset_within_region),
-                                   section->size);
-                llend = int128_sub(llend, int128_one());
-
-                iommu_notifier_init(&gdn.n,
-                                    vfio_iommu_map_dirty_notify,
-                                    IOMMU_NOTIFIER_MAP,
-                                    section->offset_within_region,
-                                    int128_get64(llend),
-                                    idx);
-                memory_region_iommu_replay(giommu->iommu_mr, &gdn.n);
-                break;
-            }
-        }
-        return 0;
-    } else if (memory_region_has_ram_discard_manager(section->mr)) {
-        return vfio_sync_ram_discard_listener_dirty_bitmap(container, section);
-    }
-
-    ram_addr = memory_region_get_ram_addr(section->mr) +
-               section->offset_within_region;
-
-    return vfio_get_dirty_bitmap(container,
-                   REAL_HOST_PAGE_ALIGN(section->offset_within_address_space),
-                   int128_get64(section->size), ram_addr);
-}
-
-static void vfio_listener_log_sync(MemoryListener *listener,
-        MemoryRegionSection *section)
-{
-    VFIOContainer *container = container_of(listener, VFIOContainer, listener);
-
-    if (vfio_listener_skipped_section(section) ||
-        !container->dirty_pages_supported) {
-        return;
-    }
-
-    if (vfio_devices_all_dirty_tracking(container)) {
-        vfio_sync_dirty_bitmap(container, section);
-    }
-}
-
-static const MemoryListener vfio_memory_listener = {
-    .name = "vfio",
-    .region_add = vfio_listener_region_add,
-    .region_del = vfio_listener_region_del,
-    .log_global_start = vfio_listener_log_global_start,
-    .log_global_stop = vfio_listener_log_global_stop,
-    .log_sync = vfio_listener_log_sync,
-};
-
-static void vfio_listener_release(VFIOContainer *container)
-{
-    memory_listener_unregister(&container->listener);
-    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
-        memory_listener_unregister(&container->prereg_listener);
-    }
-}
-
-static struct vfio_info_cap_header *
-vfio_get_cap(void *ptr, uint32_t cap_offset, uint16_t id)
-{
-    struct vfio_info_cap_header *hdr;
-
-    for (hdr = ptr + cap_offset; hdr != ptr; hdr = ptr + hdr->next) {
-        if (hdr->id == id) {
-            return hdr;
-        }
-    }
-
-    return NULL;
-}
-
-struct vfio_info_cap_header *
-vfio_get_region_info_cap(struct vfio_region_info *info, uint16_t id)
-{
-    if (!(info->flags & VFIO_REGION_INFO_FLAG_CAPS)) {
-        return NULL;
-    }
-
-    return vfio_get_cap((void *)info, info->cap_offset, id);
-}
-
-static struct vfio_info_cap_header *
-vfio_get_iommu_type1_info_cap(struct vfio_iommu_type1_info *info, uint16_t id)
-{
-    if (!(info->flags & VFIO_IOMMU_INFO_CAPS)) {
-        return NULL;
-    }
-
-    return vfio_get_cap((void *)info, info->cap_offset, id);
-}
-
-struct vfio_info_cap_header *
-vfio_get_device_info_cap(struct vfio_device_info *info, uint16_t id)
-{
-    if (!(info->flags & VFIO_DEVICE_FLAGS_CAPS)) {
-        return NULL;
-    }
-
-    return vfio_get_cap((void *)info, info->cap_offset, id);
-}
-
-bool vfio_get_info_dma_avail(struct vfio_iommu_type1_info *info,
-                             unsigned int *avail)
-{
-    struct vfio_info_cap_header *hdr;
-    struct vfio_iommu_type1_info_dma_avail *cap;
-
-    /* If the capability cannot be found, assume no DMA limiting */
-    hdr = vfio_get_iommu_type1_info_cap(info,
-                                        VFIO_IOMMU_TYPE1_INFO_DMA_AVAIL);
-    if (hdr == NULL) {
-        return false;
-    }
-
-    if (avail != NULL) {
-        cap = (void *) hdr;
-        *avail = cap->avail;
-    }
-
-    return true;
-}
-
-static int vfio_setup_region_sparse_mmaps(VFIORegion *region,
-                                          struct vfio_region_info *info)
-{
-    struct vfio_info_cap_header *hdr;
-    struct vfio_region_info_cap_sparse_mmap *sparse;
-    int i, j;
-
-    hdr = vfio_get_region_info_cap(info, VFIO_REGION_INFO_CAP_SPARSE_MMAP);
-    if (!hdr) {
-        return -ENODEV;
-    }
-
-    sparse = container_of(hdr, struct vfio_region_info_cap_sparse_mmap, header);
-
-    trace_vfio_region_sparse_mmap_header(region->vbasedev->name,
-                                         region->nr, sparse->nr_areas);
-
-    region->mmaps = g_new0(VFIOMmap, sparse->nr_areas);
-
-    for (i = 0, j = 0; i < sparse->nr_areas; i++) {
-        trace_vfio_region_sparse_mmap_entry(i, sparse->areas[i].offset,
-                                            sparse->areas[i].offset +
-                                            sparse->areas[i].size);
-
-        if (sparse->areas[i].size) {
-            region->mmaps[j].offset = sparse->areas[i].offset;
-            region->mmaps[j].size = sparse->areas[i].size;
-            j++;
-        }
-    }
-
-    region->nr_mmaps = j;
-    region->mmaps = g_realloc(region->mmaps, j * sizeof(VFIOMmap));
-
-    return 0;
-}
-
-int vfio_region_setup(Object *obj, VFIODevice *vbasedev, VFIORegion *region,
-                      int index, const char *name)
-{
-    struct vfio_region_info *info;
-    int ret;
-
-    ret = vfio_get_region_info(vbasedev, index, &info);
-    if (ret) {
-        return ret;
-    }
-
-    region->vbasedev = vbasedev;
-    region->flags = info->flags;
-    region->size = info->size;
-    region->fd_offset = info->offset;
-    region->nr = index;
-
-    if (region->size) {
-        region->mem = g_new0(MemoryRegion, 1);
-        memory_region_init_io(region->mem, obj, &vfio_region_ops,
-                              region, name, region->size);
-
-        if (!vbasedev->no_mmap &&
-            region->flags & VFIO_REGION_INFO_FLAG_MMAP) {
-
-            ret = vfio_setup_region_sparse_mmaps(region, info);
-
-            if (ret) {
-                region->nr_mmaps = 1;
-                region->mmaps = g_new0(VFIOMmap, region->nr_mmaps);
-                region->mmaps[0].offset = 0;
-                region->mmaps[0].size = region->size;
-            }
-        }
-    }
-
-    g_free(info);
-
-    trace_vfio_region_setup(vbasedev->name, index, name,
-                            region->flags, region->fd_offset, region->size);
-    return 0;
-}
-
-static void vfio_subregion_unmap(VFIORegion *region, int index)
-{
-    trace_vfio_region_unmap(memory_region_name(&region->mmaps[index].mem),
-                            region->mmaps[index].offset,
-                            region->mmaps[index].offset +
-                            region->mmaps[index].size - 1);
-    memory_region_del_subregion(region->mem, &region->mmaps[index].mem);
-    munmap(region->mmaps[index].mmap, region->mmaps[index].size);
-    object_unparent(OBJECT(&region->mmaps[index].mem));
-    region->mmaps[index].mmap = NULL;
-}
-
-int vfio_region_mmap(VFIORegion *region)
-{
-    int i, prot = 0;
-    char *name;
-
-    if (!region->mem) {
-        return 0;
-    }
-
-    prot |= region->flags & VFIO_REGION_INFO_FLAG_READ ? PROT_READ : 0;
-    prot |= region->flags & VFIO_REGION_INFO_FLAG_WRITE ? PROT_WRITE : 0;
-
-    for (i = 0; i < region->nr_mmaps; i++) {
-        region->mmaps[i].mmap = mmap(NULL, region->mmaps[i].size, prot,
-                                     MAP_SHARED, region->vbasedev->fd,
-                                     region->fd_offset +
-                                     region->mmaps[i].offset);
-        if (region->mmaps[i].mmap == MAP_FAILED) {
-            int ret = -errno;
-
-            trace_vfio_region_mmap_fault(memory_region_name(region->mem), i,
-                                         region->fd_offset +
-                                         region->mmaps[i].offset,
-                                         region->fd_offset +
-                                         region->mmaps[i].offset +
-                                         region->mmaps[i].size - 1, ret);
-
-            region->mmaps[i].mmap = NULL;
-
-            for (i--; i >= 0; i--) {
-                vfio_subregion_unmap(region, i);
-            }
-
-            return ret;
-        }
-
-        name = g_strdup_printf("%s mmaps[%d]",
-                               memory_region_name(region->mem), i);
-        memory_region_init_ram_device_ptr(&region->mmaps[i].mem,
-                                          memory_region_owner(region->mem),
-                                          name, region->mmaps[i].size,
-                                          region->mmaps[i].mmap);
-        g_free(name);
-        memory_region_add_subregion(region->mem, region->mmaps[i].offset,
-                                    &region->mmaps[i].mem);
-
-        trace_vfio_region_mmap(memory_region_name(&region->mmaps[i].mem),
-                               region->mmaps[i].offset,
-                               region->mmaps[i].offset +
-                               region->mmaps[i].size - 1);
-    }
-
-    return 0;
-}
-
-void vfio_region_unmap(VFIORegion *region)
-{
-    int i;
-
-    if (!region->mem) {
-        return;
-    }
-
-    for (i = 0; i < region->nr_mmaps; i++) {
-        if (region->mmaps[i].mmap) {
-            vfio_subregion_unmap(region, i);
-        }
-    }
-}
-
-void vfio_region_exit(VFIORegion *region)
-{
-    int i;
-
-    if (!region->mem) {
-        return;
-    }
-
-    for (i = 0; i < region->nr_mmaps; i++) {
-        if (region->mmaps[i].mmap) {
-            memory_region_del_subregion(region->mem, &region->mmaps[i].mem);
-        }
-    }
-
-    trace_vfio_region_exit(region->vbasedev->name, region->nr);
-}
-
-void vfio_region_finalize(VFIORegion *region)
-{
-    int i;
-
-    if (!region->mem) {
-        return;
-    }
-
-    for (i = 0; i < region->nr_mmaps; i++) {
-        if (region->mmaps[i].mmap) {
-            munmap(region->mmaps[i].mmap, region->mmaps[i].size);
-            object_unparent(OBJECT(&region->mmaps[i].mem));
-        }
-    }
-
-    object_unparent(OBJECT(region->mem));
-
-    g_free(region->mem);
-    g_free(region->mmaps);
-
-    trace_vfio_region_finalize(region->vbasedev->name, region->nr);
-
-    region->mem = NULL;
-    region->mmaps = NULL;
-    region->nr_mmaps = 0;
-    region->size = 0;
-    region->flags = 0;
-    region->nr = 0;
-}
-
-void vfio_region_mmaps_set_enabled(VFIORegion *region, bool enabled)
-{
-    int i;
-
-    if (!region->mem) {
-        return;
-    }
-
-    for (i = 0; i < region->nr_mmaps; i++) {
-        if (region->mmaps[i].mmap) {
-            memory_region_set_enabled(&region->mmaps[i].mem, enabled);
-        }
-    }
-
-    trace_vfio_region_mmaps_set_enabled(memory_region_name(region->mem),
-                                        enabled);
-}
-
-void vfio_reset_handler(void *opaque)
-{
-    VFIOGroup *group;
-    VFIODevice *vbasedev;
-
-    QLIST_FOREACH(group, &vfio_group_list, next) {
-        QLIST_FOREACH(vbasedev, &group->device_list, next) {
-            if (vbasedev->dev->realized) {
-                vbasedev->ops->vfio_compute_needs_reset(vbasedev);
-            }
-        }
-    }
-
-    QLIST_FOREACH(group, &vfio_group_list, next) {
-        QLIST_FOREACH(vbasedev, &group->device_list, next) {
-            if (vbasedev->dev->realized && vbasedev->needs_reset) {
-                vbasedev->ops->vfio_hot_reset_multi(vbasedev);
-            }
-        }
-    }
-}
-
-static void vfio_kvm_device_add_group(VFIOGroup *group)
-{
-#ifdef CONFIG_KVM
-    struct kvm_device_attr attr = {
-        .group = KVM_DEV_VFIO_GROUP,
-        .attr = KVM_DEV_VFIO_GROUP_ADD,
-        .addr = (uint64_t)(unsigned long)&group->fd,
-    };
-
-    if (!kvm_enabled()) {
-        return;
-    }
-
-    if (vfio_kvm_device_fd < 0) {
-        struct kvm_create_device cd = {
-            .type = KVM_DEV_TYPE_VFIO,
-        };
-
-        if (kvm_vm_ioctl(kvm_state, KVM_CREATE_DEVICE, &cd)) {
-            error_report("Failed to create KVM VFIO device: %m");
-            return;
-        }
-
-        vfio_kvm_device_fd = cd.fd;
-    }
-
-    if (ioctl(vfio_kvm_device_fd, KVM_SET_DEVICE_ATTR, &attr)) {
-        error_report("Failed to add group %d to KVM VFIO device: %m",
-                     group->groupid);
-    }
-#endif
-}
-
-static void vfio_kvm_device_del_group(VFIOGroup *group)
-{
-#ifdef CONFIG_KVM
-    struct kvm_device_attr attr = {
-        .group = KVM_DEV_VFIO_GROUP,
-        .attr = KVM_DEV_VFIO_GROUP_DEL,
-        .addr = (uint64_t)(unsigned long)&group->fd,
-    };
-
-    if (vfio_kvm_device_fd < 0) {
-        return;
-    }
-
-    if (ioctl(vfio_kvm_device_fd, KVM_SET_DEVICE_ATTR, &attr)) {
-        error_report("Failed to remove group %d from KVM VFIO device: %m",
-                     group->groupid);
-    }
-#endif
-}
-
-static VFIOAddressSpace *vfio_get_address_space(AddressSpace *as)
-{
-    VFIOAddressSpace *space;
-
-    QLIST_FOREACH(space, &vfio_address_spaces, list) {
-        if (space->as == as) {
-            return space;
-        }
-    }
-
-    /* No suitable VFIOAddressSpace, create a new one */
-    space = g_malloc0(sizeof(*space));
-    space->as = as;
-    QLIST_INIT(&space->containers);
-
-    QLIST_INSERT_HEAD(&vfio_address_spaces, space, list);
-
-    return space;
-}
-
-static void vfio_put_address_space(VFIOAddressSpace *space)
-{
-    if (QLIST_EMPTY(&space->containers)) {
-        QLIST_REMOVE(space, list);
-        g_free(space);
-    }
-}
-
-/*
- * vfio_get_iommu_type - selects the richest iommu_type (v2 first)
- */
-static int vfio_get_iommu_type(VFIOContainer *container,
-                               Error **errp)
-{
-    int iommu_types[] = { VFIO_TYPE1v2_IOMMU, VFIO_TYPE1_IOMMU,
-                          VFIO_SPAPR_TCE_v2_IOMMU, VFIO_SPAPR_TCE_IOMMU };
-    int i;
-
-    for (i = 0; i < ARRAY_SIZE(iommu_types); i++) {
-        if (ioctl(container->fd, VFIO_CHECK_EXTENSION, iommu_types[i])) {
-            return iommu_types[i];
-        }
-    }
-    error_setg(errp, "No available IOMMU models");
-    return -EINVAL;
-}
-
-static int vfio_init_container(VFIOContainer *container, int group_fd,
-                               Error **errp)
-{
-    int iommu_type, ret;
-
-    iommu_type = vfio_get_iommu_type(container, errp);
-    if (iommu_type < 0) {
-        return iommu_type;
-    }
-
-    ret = ioctl(group_fd, VFIO_GROUP_SET_CONTAINER, &container->fd);
-    if (ret) {
-        error_setg_errno(errp, errno, "Failed to set group container");
-        return -errno;
-    }
-
-    while (ioctl(container->fd, VFIO_SET_IOMMU, iommu_type)) {
-        if (iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
-            /*
-             * On sPAPR, despite the IOMMU subdriver always advertises v1 and
-             * v2, the running platform may not support v2 and there is no
-             * way to guess it until an IOMMU group gets added to the container.
-             * So in case it fails with v2, try v1 as a fallback.
-             */
-            iommu_type = VFIO_SPAPR_TCE_IOMMU;
-            continue;
-        }
-        error_setg_errno(errp, errno, "Failed to set iommu for container");
-        return -errno;
-    }
-
-    container->iommu_type = iommu_type;
-    return 0;
-}
-
-static int vfio_get_iommu_info(VFIOContainer *container,
-                               struct vfio_iommu_type1_info **info)
-{
-
-    size_t argsz = sizeof(struct vfio_iommu_type1_info);
-
-    *info = g_new0(struct vfio_iommu_type1_info, 1);
-again:
-    (*info)->argsz = argsz;
-
-    if (ioctl(container->fd, VFIO_IOMMU_GET_INFO, *info)) {
-        g_free(*info);
-        *info = NULL;
-        return -errno;
-    }
-
-    if (((*info)->argsz > argsz)) {
-        argsz = (*info)->argsz;
-        *info = g_realloc(*info, argsz);
-        goto again;
-    }
-
-    return 0;
-}
-
-static struct vfio_info_cap_header *
-vfio_get_iommu_info_cap(struct vfio_iommu_type1_info *info, uint16_t id)
-{
-    struct vfio_info_cap_header *hdr;
-    void *ptr = info;
-
-    if (!(info->flags & VFIO_IOMMU_INFO_CAPS)) {
-        return NULL;
-    }
-
-    for (hdr = ptr + info->cap_offset; hdr != ptr; hdr = ptr + hdr->next) {
-        if (hdr->id == id) {
-            return hdr;
-        }
-    }
-
-    return NULL;
-}
-
-static void vfio_get_iommu_info_migration(VFIOContainer *container,
-                                         struct vfio_iommu_type1_info *info)
-{
-    struct vfio_info_cap_header *hdr;
-    struct vfio_iommu_type1_info_cap_migration *cap_mig;
-
-    hdr = vfio_get_iommu_info_cap(info, VFIO_IOMMU_TYPE1_INFO_CAP_MIGRATION);
-    if (!hdr) {
+    if (!region->mem) {
         return;
     }
 
-    cap_mig = container_of(hdr, struct vfio_iommu_type1_info_cap_migration,
-                            header);
-
-    /*
-     * cpu_physical_memory_set_dirty_lebitmap() supports pages in bitmap of
-     * qemu_real_host_page_size to mark those dirty.
-     */
-    if (cap_mig->pgsize_bitmap & qemu_real_host_page_size) {
-        container->dirty_pages_supported = true;
-        container->max_dirty_bitmap_size = cap_mig->max_dirty_bitmap_size;
-        container->dirty_pgsizes = cap_mig->pgsize_bitmap;
-    }
-}
-
-static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
-                                  Error **errp)
-{
-    VFIOContainer *container;
-    int ret, fd;
-    VFIOAddressSpace *space;
-
-    space = vfio_get_address_space(as);
-
-    /*
-     * VFIO is currently incompatible with discarding of RAM insofar as the
-     * madvise to purge (zap) the page from QEMU's address space does not
-     * interact with the memory API and therefore leaves stale virtual to
-     * physical mappings in the IOMMU if the page was previously pinned.  We
-     * therefore set discarding broken for each group added to a container,
-     * whether the container is used individually or shared.  This provides
-     * us with options to allow devices within a group to opt-in and allow
-     * discarding, so long as it is done consistently for a group (for instance
-     * if the device is an mdev device where it is known that the host vendor
-     * driver will never pin pages outside of the working set of the guest
-     * driver, which would thus not be discarding candidates).
-     *
-     * The first opportunity to induce pinning occurs here where we attempt to
-     * attach the group to existing containers within the AddressSpace.  If any
-     * pages are already zapped from the virtual address space, such as from
-     * previous discards, new pinning will cause valid mappings to be
-     * re-established.  Likewise, when the overall MemoryListener for a new
-     * container is registered, a replay of mappings within the AddressSpace
-     * will occur, re-establishing any previously zapped pages as well.
-     *
-     * Especially virtio-balloon is currently only prevented from discarding
-     * new memory, it will not yet set ram_block_discard_set_required() and
-     * therefore, neither stops us here or deals with the sudden memory
-     * consumption of inflated memory.
-     *
-     * We do support discarding of memory coordinated via the RamDiscardManager
-     * with some IOMMU types. vfio_ram_block_discard_disable() handles the
-     * details once we know which type of IOMMU we are using.
-     */
-
-    QLIST_FOREACH(container, &space->containers, next) {
-        if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
-            ret = vfio_ram_block_discard_disable(container, true);
-            if (ret) {
-                error_setg_errno(errp, -ret,
-                                 "Cannot set discarding of RAM broken");
-                if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER,
-                          &container->fd)) {
-                    error_report("vfio: error disconnecting group %d from"
-                                 " container", group->groupid);
-                }
-                return ret;
-            }
-            group->container = container;
-            QLIST_INSERT_HEAD(&container->group_list, group, container_next);
-            vfio_kvm_device_add_group(group);
-            return 0;
-        }
-    }
-
-    fd = qemu_open_old("/dev/vfio/vfio", O_RDWR);
-    if (fd < 0) {
-        error_setg_errno(errp, errno, "failed to open /dev/vfio/vfio");
-        ret = -errno;
-        goto put_space_exit;
-    }
-
-    ret = ioctl(fd, VFIO_GET_API_VERSION);
-    if (ret != VFIO_API_VERSION) {
-        error_setg(errp, "supported vfio version: %d, "
-                   "reported version: %d", VFIO_API_VERSION, ret);
-        ret = -EINVAL;
-        goto close_fd_exit;
-    }
-
-    container = g_malloc0(sizeof(*container));
-    container->space = space;
-    container->fd = fd;
-    container->error = NULL;
-    container->dirty_pages_supported = false;
-    container->dma_max_mappings = 0;
-    QLIST_INIT(&container->giommu_list);
-    QLIST_INIT(&container->hostwin_list);
-    QLIST_INIT(&container->vrdl_list);
-
-    ret = vfio_init_container(container, group->fd, errp);
-    if (ret) {
-        goto free_container_exit;
-    }
-
-    ret = vfio_ram_block_discard_disable(container, true);
-    if (ret) {
-        error_setg_errno(errp, -ret, "Cannot set discarding of RAM broken");
-        goto free_container_exit;
-    }
-
-    switch (container->iommu_type) {
-    case VFIO_TYPE1v2_IOMMU:
-    case VFIO_TYPE1_IOMMU:
-    {
-        struct vfio_iommu_type1_info *info;
-
-        /*
-         * FIXME: This assumes that a Type1 IOMMU can map any 64-bit
-         * IOVA whatsoever.  That's not actually true, but the current
-         * kernel interface doesn't tell us what it can map, and the
-         * existing Type1 IOMMUs generally support any IOVA we're
-         * going to actually try in practice.
-         */
-        ret = vfio_get_iommu_info(container, &info);
-
-        if (ret || !(info->flags & VFIO_IOMMU_INFO_PGSIZES)) {
-            /* Assume 4k IOVA page size */
-            info->iova_pgsizes = 4096;
-        }
-        vfio_host_win_add(container, 0, (hwaddr)-1, info->iova_pgsizes);
-        container->pgsizes = info->iova_pgsizes;
-
-        /* The default in the kernel ("dma_entry_limit") is 65535. */
-        container->dma_max_mappings = 65535;
-        if (!ret) {
-            vfio_get_info_dma_avail(info, &container->dma_max_mappings);
-            vfio_get_iommu_info_migration(container, info);
-        }
-        g_free(info);
-        break;
-    }
-    case VFIO_SPAPR_TCE_v2_IOMMU:
-    case VFIO_SPAPR_TCE_IOMMU:
-    {
-        struct vfio_iommu_spapr_tce_info info;
-        bool v2 = container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU;
-
-        /*
-         * The host kernel code implementing VFIO_IOMMU_DISABLE is called
-         * when container fd is closed so we do not call it explicitly
-         * in this file.
-         */
-        if (!v2) {
-            ret = ioctl(fd, VFIO_IOMMU_ENABLE);
-            if (ret) {
-                error_setg_errno(errp, errno, "failed to enable container");
-                ret = -errno;
-                goto enable_discards_exit;
-            }
-        } else {
-            container->prereg_listener = vfio_prereg_listener;
-
-            memory_listener_register(&container->prereg_listener,
-                                     &address_space_memory);
-            if (container->error) {
-                memory_listener_unregister(&container->prereg_listener);
-                ret = -1;
-                error_propagate_prepend(errp, container->error,
-                    "RAM memory listener initialization failed: ");
-                goto enable_discards_exit;
-            }
-        }
-
-        info.argsz = sizeof(info);
-        ret = ioctl(fd, VFIO_IOMMU_SPAPR_TCE_GET_INFO, &info);
-        if (ret) {
-            error_setg_errno(errp, errno,
-                             "VFIO_IOMMU_SPAPR_TCE_GET_INFO failed");
-            ret = -errno;
-            if (v2) {
-                memory_listener_unregister(&container->prereg_listener);
-            }
-            goto enable_discards_exit;
-        }
-
-        if (v2) {
-            container->pgsizes = info.ddw.pgsizes;
-            /*
-             * There is a default window in just created container.
-             * To make region_add/del simpler, we better remove this
-             * window now and let those iommu_listener callbacks
-             * create/remove them when needed.
-             */
-            ret = vfio_spapr_remove_window(container, info.dma32_window_start);
-            if (ret) {
-                error_setg_errno(errp, -ret,
-                                 "failed to remove existing window");
-                goto enable_discards_exit;
-            }
-        } else {
-            /* The default table uses 4K pages */
-            container->pgsizes = 0x1000;
-            vfio_host_win_add(container, info.dma32_window_start,
-                              info.dma32_window_start +
-                              info.dma32_window_size - 1,
-                              0x1000);
+    for (i = 0; i < region->nr_mmaps; i++) {
+        if (region->mmaps[i].mmap) {
+            munmap(region->mmaps[i].mmap, region->mmaps[i].size);
+            object_unparent(OBJECT(&region->mmaps[i].mem));
         }
     }
-    }
-
-    vfio_kvm_device_add_group(group);
-
-    QLIST_INIT(&container->group_list);
-    QLIST_INSERT_HEAD(&space->containers, container, next);
-
-    group->container = container;
-    QLIST_INSERT_HEAD(&container->group_list, group, container_next);
-
-    container->listener = vfio_memory_listener;
-
-    memory_listener_register(&container->listener, container->space->as);
-
-    if (container->error) {
-        ret = -1;
-        error_propagate_prepend(errp, container->error,
-            "memory listener initialization failed: ");
-        goto listener_release_exit;
-    }
-
-    container->initialized = true;
-
-    return 0;
-listener_release_exit:
-    QLIST_REMOVE(group, container_next);
-    QLIST_REMOVE(container, next);
-    vfio_kvm_device_del_group(group);
-    vfio_listener_release(container);
-
-enable_discards_exit:
-    vfio_ram_block_discard_disable(container, false);
-
-free_container_exit:
-    g_free(container);
-
-close_fd_exit:
-    close(fd);
-
-put_space_exit:
-    vfio_put_address_space(space);
-
-    return ret;
-}
-
-static void vfio_disconnect_container(VFIOGroup *group)
-{
-    VFIOContainer *container = group->container;
-
-    QLIST_REMOVE(group, container_next);
-    group->container = NULL;
-
-    /*
-     * Explicitly release the listener first before unset container,
-     * since unset may destroy the backend container if it's the last
-     * group.
-     */
-    if (QLIST_EMPTY(&container->group_list)) {
-        vfio_listener_release(container);
-    }
-
-    if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER, &container->fd)) {
-        error_report("vfio: error disconnecting group %d from container",
-                     group->groupid);
-    }
-
-    if (QLIST_EMPTY(&container->group_list)) {
-        VFIOAddressSpace *space = container->space;
-        VFIOGuestIOMMU *giommu, *tmp;
-        VFIOHostDMAWindow *hostwin, *next;
 
-        QLIST_REMOVE(container, next);
-
-        QLIST_FOREACH_SAFE(giommu, &container->giommu_list, giommu_next, tmp) {
-            memory_region_unregister_iommu_notifier(
-                    MEMORY_REGION(giommu->iommu_mr), &giommu->n);
-            QLIST_REMOVE(giommu, giommu_next);
-            g_free(giommu);
-        }
+    object_unparent(OBJECT(region->mem));
 
-        QLIST_FOREACH_SAFE(hostwin, &container->hostwin_list, hostwin_next,
-                           next) {
-            QLIST_REMOVE(hostwin, hostwin_next);
-            g_free(hostwin);
-        }
+    g_free(region->mem);
+    g_free(region->mmaps);
 
-        trace_vfio_disconnect_container(container->fd);
-        close(container->fd);
-        g_free(container);
+    trace_vfio_region_finalize(region->vbasedev->name, region->nr);
 
-        vfio_put_address_space(space);
-    }
+    region->mem = NULL;
+    region->mmaps = NULL;
+    region->nr_mmaps = 0;
+    region->size = 0;
+    region->flags = 0;
+    region->nr = 0;
 }
 
-VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
+void vfio_region_mmaps_set_enabled(VFIORegion *region, bool enabled)
 {
-    VFIOGroup *group;
-    char path[32];
-    struct vfio_group_status status = { .argsz = sizeof(status) };
-
-    QLIST_FOREACH(group, &vfio_group_list, next) {
-        if (group->groupid == groupid) {
-            /* Found it.  Now is it already in the right context? */
-            if (group->container->space->as == as) {
-                return group;
-            } else {
-                error_setg(errp, "group %d used in multiple address spaces",
-                           group->groupid);
-                return NULL;
-            }
-        }
-    }
-
-    group = g_malloc0(sizeof(*group));
-
-    snprintf(path, sizeof(path), "/dev/vfio/%d", groupid);
-    group->fd = qemu_open_old(path, O_RDWR);
-    if (group->fd < 0) {
-        error_setg_errno(errp, errno, "failed to open %s", path);
-        goto free_group_exit;
-    }
-
-    if (ioctl(group->fd, VFIO_GROUP_GET_STATUS, &status)) {
-        error_setg_errno(errp, errno, "failed to get group %d status", groupid);
-        goto close_fd_exit;
-    }
-
-    if (!(status.flags & VFIO_GROUP_FLAGS_VIABLE)) {
-        error_setg(errp, "group %d is not viable", groupid);
-        error_append_hint(errp,
-                          "Please ensure all devices within the iommu_group "
-                          "are bound to their vfio bus driver.\n");
-        goto close_fd_exit;
-    }
-
-    group->groupid = groupid;
-    QLIST_INIT(&group->device_list);
-
-    if (vfio_connect_container(group, as, errp)) {
-        error_prepend(errp, "failed to setup container for group %d: ",
-                      groupid);
-        goto close_fd_exit;
-    }
-
-    if (QLIST_EMPTY(&vfio_group_list)) {
-        qemu_register_reset(vfio_reset_handler, NULL);
-    }
-
-    QLIST_INSERT_HEAD(&vfio_group_list, group, next);
-
-    return group;
-
-close_fd_exit:
-    close(group->fd);
-
-free_group_exit:
-    g_free(group);
-
-    return NULL;
-}
+    int i;
 
-void vfio_put_group(VFIOGroup *group)
-{
-    if (!group || !QLIST_EMPTY(&group->device_list)) {
+    if (!region->mem) {
         return;
     }
 
-    if (!group->ram_block_discard_allowed) {
-        vfio_ram_block_discard_disable(group->container, false);
-    }
-    vfio_kvm_device_del_group(group);
-    vfio_disconnect_container(group);
-    QLIST_REMOVE(group, next);
-    trace_vfio_put_group(group->fd);
-    close(group->fd);
-    g_free(group);
-
-    if (QLIST_EMPTY(&vfio_group_list)) {
-        qemu_unregister_reset(vfio_reset_handler, NULL);
-    }
-}
-
-int vfio_get_device(VFIOGroup *group, const char *name,
-                    VFIODevice *vbasedev, Error **errp)
-{
-    struct vfio_device_info dev_info = { .argsz = sizeof(dev_info) };
-    int ret, fd;
-
-    fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name);
-    if (fd < 0) {
-        error_setg_errno(errp, errno, "error getting device from group %d",
-                         group->groupid);
-        error_append_hint(errp,
-                      "Verify all devices in group %d are bound to vfio-<bus> "
-                      "or pci-stub and not already in use\n", group->groupid);
-        return fd;
-    }
-
-    ret = ioctl(fd, VFIO_DEVICE_GET_INFO, &dev_info);
-    if (ret) {
-        error_setg_errno(errp, errno, "error getting device info");
-        close(fd);
-        return ret;
-    }
-
-    /*
-     * Set discarding of RAM as not broken for this group if the driver knows
-     * the device operates compatibly with discarding.  Setting must be
-     * consistent per group, but since compatibility is really only possible
-     * with mdev currently, we expect singleton groups.
-     */
-    if (vbasedev->ram_block_discard_allowed !=
-        group->ram_block_discard_allowed) {
-        if (!QLIST_EMPTY(&group->device_list)) {
-            error_setg(errp, "Inconsistent setting of support for discarding "
-                       "RAM (e.g., balloon) within group");
-            close(fd);
-            return -1;
-        }
-
-        if (!group->ram_block_discard_allowed) {
-            group->ram_block_discard_allowed = true;
-            vfio_ram_block_discard_disable(group->container, false);
+    for (i = 0; i < region->nr_mmaps; i++) {
+        if (region->mmaps[i].mmap) {
+            memory_region_set_enabled(&region->mmaps[i].mem, enabled);
         }
     }
 
-    vbasedev->fd = fd;
-    vbasedev->group = group;
-    QLIST_INSERT_HEAD(&group->device_list, vbasedev, next);
-
-    vbasedev->num_irqs = dev_info.num_irqs;
-    vbasedev->num_regions = dev_info.num_regions;
-    vbasedev->flags = dev_info.flags;
-
-    trace_vfio_get_device(name, dev_info.flags, dev_info.num_regions,
-                          dev_info.num_irqs);
-
-    vbasedev->reset_works = !!(dev_info.flags & VFIO_DEVICE_FLAGS_RESET);
-    return 0;
-}
-
-void vfio_put_base_device(VFIODevice *vbasedev)
-{
-    if (!vbasedev->group) {
-        return;
-    }
-    QLIST_REMOVE(vbasedev, next);
-    vbasedev->group = NULL;
-    trace_vfio_put_base_device(vbasedev->fd);
-    close(vbasedev->fd);
+    trace_vfio_region_mmaps_set_enabled(memory_region_name(region->mem),
+                                        enabled);
 }
 
 int vfio_get_region_info(VFIODevice *vbasedev, int index,
@@ -2499,98 +628,3 @@ bool vfio_has_region_cap(VFIODevice *vbasedev, int region, uint16_t cap_type)
 
     return ret;
 }
-
-/*
- * Interfaces for IBM EEH (Enhanced Error Handling)
- */
-static bool vfio_eeh_container_ok(VFIOContainer *container)
-{
-    /*
-     * As of 2016-03-04 (linux-4.5) the host kernel EEH/VFIO
-     * implementation is broken if there are multiple groups in a
-     * container.  The hardware works in units of Partitionable
-     * Endpoints (== IOMMU groups) and the EEH operations naively
-     * iterate across all groups in the container, without any logic
-     * to make sure the groups have their state synchronized.  For
-     * certain operations (ENABLE) that might be ok, until an error
-     * occurs, but for others (GET_STATE) it's clearly broken.
-     */
-
-    /*
-     * XXX Once fixed kernels exist, test for them here
-     */
-
-    if (QLIST_EMPTY(&container->group_list)) {
-        return false;
-    }
-
-    if (QLIST_NEXT(QLIST_FIRST(&container->group_list), container_next)) {
-        return false;
-    }
-
-    return true;
-}
-
-static int vfio_eeh_container_op(VFIOContainer *container, uint32_t op)
-{
-    struct vfio_eeh_pe_op pe_op = {
-        .argsz = sizeof(pe_op),
-        .op = op,
-    };
-    int ret;
-
-    if (!vfio_eeh_container_ok(container)) {
-        error_report("vfio/eeh: EEH_PE_OP 0x%x: "
-                     "kernel requires a container with exactly one group", op);
-        return -EPERM;
-    }
-
-    ret = ioctl(container->fd, VFIO_EEH_PE_OP, &pe_op);
-    if (ret < 0) {
-        error_report("vfio/eeh: EEH_PE_OP 0x%x failed: %m", op);
-        return -errno;
-    }
-
-    return ret;
-}
-
-static VFIOContainer *vfio_eeh_as_container(AddressSpace *as)
-{
-    VFIOAddressSpace *space = vfio_get_address_space(as);
-    VFIOContainer *container = NULL;
-
-    if (QLIST_EMPTY(&space->containers)) {
-        /* No containers to act on */
-        goto out;
-    }
-
-    container = QLIST_FIRST(&space->containers);
-
-    if (QLIST_NEXT(container, next)) {
-        /* We don't yet have logic to synchronize EEH state across
-         * multiple containers */
-        container = NULL;
-        goto out;
-    }
-
-out:
-    vfio_put_address_space(space);
-    return container;
-}
-
-bool vfio_eeh_as_ok(AddressSpace *as)
-{
-    VFIOContainer *container = vfio_eeh_as_container(as);
-
-    return (container != NULL) && vfio_eeh_container_ok(container);
-}
-
-int vfio_eeh_as_op(AddressSpace *as, uint32_t op)
-{
-    VFIOContainer *container = vfio_eeh_as_container(as);
-
-    if (!container) {
-        return -ENODEV;
-    }
-    return vfio_eeh_container_op(container, op);
-}
diff --git a/hw/vfio/container.c b/hw/vfio/container.c
new file mode 100644
index 0000000000..9c665c1720
--- /dev/null
+++ b/hw/vfio/container.c
@@ -0,0 +1,1193 @@
+/*
+ * generic functions used by VFIO devices
+ *
+ * Copyright Red Hat, Inc. 2012
+ *
+ * Authors:
+ *  Alex Williamson <alex.williamson@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ * Based on qemu-kvm device-assignment:
+ *  Adapted for KVM by Qumranet.
+ *  Copyright (c) 2007, Neocleus, Alex Novik (alex@neocleus.com)
+ *  Copyright (c) 2007, Neocleus, Guy Zana (guy@neocleus.com)
+ *  Copyright (C) 2008, Qumranet, Amit Shah (amit.shah@qumranet.com)
+ *  Copyright (C) 2008, Red Hat, Amit Shah (amit.shah@redhat.com)
+ *  Copyright (C) 2008, IBM, Muli Ben-Yehuda (muli@il.ibm.com)
+ */
+
+#include "qemu/osdep.h"
+#include <sys/ioctl.h>
+#ifdef CONFIG_KVM
+#include <linux/kvm.h>
+#endif
+#include <linux/vfio.h>
+
+#include "hw/vfio/vfio-common.h"
+#include "hw/vfio/vfio.h"
+#include "exec/address-spaces.h"
+#include "exec/memory.h"
+#include "exec/ram_addr.h"
+#include "hw/hw.h"
+#include "qemu/error-report.h"
+#include "qemu/range.h"
+#include "sysemu/kvm.h"
+#include "sysemu/reset.h"
+#include "trace.h"
+#include "qapi/error.h"
+#include "migration/migration.h"
+
+#ifdef CONFIG_KVM
+/*
+ * We have a single VFIO pseudo device per KVM VM.  Once created it lives
+ * for the life of the VM.  Closing the file descriptor only drops our
+ * reference to it and the device's reference to kvm.  Therefore once
+ * initialized, this file descriptor is only released on QEMU exit and
+ * we'll re-use it should another vfio device be attached before then.
+ */
+static int vfio_kvm_device_fd = -1;
+#endif
+
+VFIOGroupList vfio_group_list =
+    QLIST_HEAD_INITIALIZER(vfio_group_list);
+
+/*
+ * Device state interfaces
+ */
+
+bool vfio_mig_active(void)
+{
+    VFIOGroup *group;
+    VFIODevice *vbasedev;
+
+    if (QLIST_EMPTY(&vfio_group_list)) {
+        return false;
+    }
+
+    QLIST_FOREACH(group, &vfio_group_list, next) {
+        QLIST_FOREACH(vbasedev, &group->device_list, next) {
+            if (vbasedev->migration_blocker) {
+                return false;
+            }
+        }
+    }
+    return true;
+}
+
+bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
+{
+    VFIOGroup *group;
+    VFIODevice *vbasedev;
+    MigrationState *ms = migrate_get_current();
+
+    if (!migration_is_setup_or_active(ms->state)) {
+        return false;
+    }
+
+    QLIST_FOREACH(group, &container->group_list, container_next) {
+        QLIST_FOREACH(vbasedev, &group->device_list, next) {
+            VFIOMigration *migration = vbasedev->migration;
+
+            if (!migration) {
+                return false;
+            }
+
+            if ((vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF)
+                && (migration->device_state & VFIO_DEVICE_STATE_RUNNING)) {
+                return false;
+            }
+        }
+    }
+    return true;
+}
+
+bool vfio_devices_all_running_and_saving(VFIOContainer *container)
+{
+    VFIOGroup *group;
+    VFIODevice *vbasedev;
+    MigrationState *ms = migrate_get_current();
+
+    if (!migration_is_setup_or_active(ms->state)) {
+        return false;
+    }
+
+    QLIST_FOREACH(group, &container->group_list, container_next) {
+        QLIST_FOREACH(vbasedev, &group->device_list, next) {
+            VFIOMigration *migration = vbasedev->migration;
+
+            if (!migration) {
+                return false;
+            }
+
+            if ((migration->device_state & VFIO_DEVICE_STATE_SAVING) &&
+                (migration->device_state & VFIO_DEVICE_STATE_RUNNING)) {
+                continue;
+            } else {
+                return false;
+            }
+        }
+    }
+    return true;
+}
+
+static int vfio_dma_unmap_bitmap(VFIOContainer *container,
+                                 hwaddr iova, ram_addr_t size,
+                                 IOMMUTLBEntry *iotlb)
+{
+    struct vfio_iommu_type1_dma_unmap *unmap;
+    struct vfio_bitmap *bitmap;
+    uint64_t pages = REAL_HOST_PAGE_ALIGN(size) / qemu_real_host_page_size;
+    int ret;
+
+    unmap = g_malloc0(sizeof(*unmap) + sizeof(*bitmap));
+
+    unmap->argsz = sizeof(*unmap) + sizeof(*bitmap);
+    unmap->iova = iova;
+    unmap->size = size;
+    unmap->flags |= VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP;
+    bitmap = (struct vfio_bitmap *)&unmap->data;
+
+    /*
+     * cpu_physical_memory_set_dirty_lebitmap() supports pages in bitmap of
+     * qemu_real_host_page_size to mark those dirty. Hence set bitmap_pgsize
+     * to qemu_real_host_page_size.
+     */
+
+    bitmap->pgsize = qemu_real_host_page_size;
+    bitmap->size = ROUND_UP(pages, sizeof(__u64) * BITS_PER_BYTE) /
+                   BITS_PER_BYTE;
+
+    if (bitmap->size > container->max_dirty_bitmap_size) {
+        error_report("UNMAP: Size of bitmap too big 0x%"PRIx64,
+                     (uint64_t)bitmap->size);
+        ret = -E2BIG;
+        goto unmap_exit;
+    }
+
+    bitmap->data = g_try_malloc0(bitmap->size);
+    if (!bitmap->data) {
+        ret = -ENOMEM;
+        goto unmap_exit;
+    }
+
+    ret = ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, unmap);
+    if (!ret) {
+        cpu_physical_memory_set_dirty_lebitmap((unsigned long *)bitmap->data,
+                iotlb->translated_addr, pages);
+    } else {
+        error_report("VFIO_UNMAP_DMA with DIRTY_BITMAP : %m");
+    }
+
+    g_free(bitmap->data);
+unmap_exit:
+    g_free(unmap);
+    return ret;
+}
+
+/*
+ * DMA - Mapping and unmapping for the "type1" IOMMU interface used on x86
+ */
+int vfio_dma_unmap(VFIOContainer *container,
+                   hwaddr iova, ram_addr_t size,
+                   IOMMUTLBEntry *iotlb)
+{
+    struct vfio_iommu_type1_dma_unmap unmap = {
+        .argsz = sizeof(unmap),
+        .flags = 0,
+        .iova = iova,
+        .size = size,
+    };
+
+    if (iotlb && container->dirty_pages_supported &&
+        vfio_devices_all_running_and_saving(container)) {
+        return vfio_dma_unmap_bitmap(container, iova, size, iotlb);
+    }
+
+    while (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap)) {
+        /*
+         * The type1 backend has an off-by-one bug in the kernel (71a7d3d78e3c
+         * v4.15) where an overflow in its wrap-around check prevents us from
+         * unmapping the last page of the address space.  Test for the error
+         * condition and re-try the unmap excluding the last page.  The
+         * expectation is that we've never mapped the last page anyway and this
+         * unmap request comes via vIOMMU support which also makes it unlikely
+         * that this page is used.  This bug was introduced well after type1 v2
+         * support was introduced, so we shouldn't need to test for v1.  A fix
+         * is queued for kernel v5.0 so this workaround can be removed once
+         * affected kernels are sufficiently deprecated.
+         */
+        if (errno == EINVAL && unmap.size && !(unmap.iova + unmap.size) &&
+            container->iommu_type == VFIO_TYPE1v2_IOMMU) {
+            trace_vfio_dma_unmap_overflow_workaround();
+            unmap.size -= 1ULL << ctz64(container->pgsizes);
+            continue;
+        }
+        error_report("VFIO_UNMAP_DMA failed: %s", strerror(errno));
+        return -errno;
+    }
+
+    return 0;
+}
+
+int vfio_dma_map(VFIOContainer *container, hwaddr iova,
+                 ram_addr_t size, void *vaddr, bool readonly)
+{
+    struct vfio_iommu_type1_dma_map map = {
+        .argsz = sizeof(map),
+        .flags = VFIO_DMA_MAP_FLAG_READ,
+        .vaddr = (__u64)(uintptr_t)vaddr,
+        .iova = iova,
+        .size = size,
+    };
+
+    if (!readonly) {
+        map.flags |= VFIO_DMA_MAP_FLAG_WRITE;
+    }
+
+    /*
+     * Try the mapping, if it fails with EBUSY, unmap the region and try
+     * again.  This shouldn't be necessary, but we sometimes see it in
+     * the VGA ROM space.
+     */
+    if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0 ||
+        (errno == EBUSY && vfio_dma_unmap(container, iova, size, NULL) == 0 &&
+         ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0)) {
+        return 0;
+    }
+
+    error_report("VFIO_MAP_DMA failed: %s", strerror(errno));
+    return -errno;
+}
+
+void vfio_set_dirty_page_tracking(VFIOContainer *container, bool start)
+{
+    int ret;
+    struct vfio_iommu_type1_dirty_bitmap dirty = {
+        .argsz = sizeof(dirty),
+    };
+
+    if (start) {
+        dirty.flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_START;
+    } else {
+        dirty.flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP;
+    }
+
+    ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, &dirty);
+    if (ret) {
+        error_report("Failed to set dirty tracking flag 0x%x errno: %d",
+                     dirty.flags, errno);
+    }
+}
+
+int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
+                          uint64_t size, ram_addr_t ram_addr)
+{
+    struct vfio_iommu_type1_dirty_bitmap *dbitmap;
+    struct vfio_iommu_type1_dirty_bitmap_get *range;
+    uint64_t pages;
+    int ret;
+
+    dbitmap = g_malloc0(sizeof(*dbitmap) + sizeof(*range));
+
+    dbitmap->argsz = sizeof(*dbitmap) + sizeof(*range);
+    dbitmap->flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP;
+    range = (struct vfio_iommu_type1_dirty_bitmap_get *)&dbitmap->data;
+    range->iova = iova;
+    range->size = size;
+
+    /*
+     * cpu_physical_memory_set_dirty_lebitmap() supports pages in bitmap of
+     * qemu_real_host_page_size to mark those dirty. Hence set bitmap's pgsize
+     * to qemu_real_host_page_size.
+     */
+    range->bitmap.pgsize = qemu_real_host_page_size;
+
+    pages = REAL_HOST_PAGE_ALIGN(range->size) / qemu_real_host_page_size;
+    range->bitmap.size = ROUND_UP(pages, sizeof(__u64) * BITS_PER_BYTE) /
+                                         BITS_PER_BYTE;
+    range->bitmap.data = g_try_malloc0(range->bitmap.size);
+    if (!range->bitmap.data) {
+        ret = -ENOMEM;
+        goto err_out;
+    }
+
+    ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, dbitmap);
+    if (ret) {
+        error_report("Failed to get dirty bitmap for iova: 0x%"PRIx64
+                " size: 0x%"PRIx64" err: %d", (uint64_t)range->iova,
+                (uint64_t)range->size, errno);
+        goto err_out;
+    }
+
+    cpu_physical_memory_set_dirty_lebitmap((unsigned long *)range->bitmap.data,
+                                            ram_addr, pages);
+
+    trace_vfio_get_dirty_bitmap(container->fd, range->iova, range->size,
+                                range->bitmap.size, ram_addr);
+err_out:
+    g_free(range->bitmap.data);
+    g_free(dbitmap);
+
+    return ret;
+}
+
+static void vfio_listener_release(VFIOContainer *container)
+{
+    memory_listener_unregister(&container->listener);
+    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
+        memory_listener_unregister(&container->prereg_listener);
+    }
+}
+
+int vfio_container_add_section_window(VFIOContainer *container,
+                                      MemoryRegionSection *section,
+                                      Error **errp)
+{
+    VFIOHostDMAWindow *hostwin;
+    hwaddr pgsize = 0;
+    int ret;
+
+    if (container->iommu_type != VFIO_SPAPR_TCE_v2_IOMMU) {
+        return 0;
+    }
+
+    /* For now intersections are not allowed, we may relax this later */
+    QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) {
+        if (ranges_overlap(hostwin->min_iova,
+                           hostwin->max_iova - hostwin->min_iova + 1,
+                           section->offset_within_address_space,
+                           int128_get64(section->size))) {
+            error_setg(errp,
+                    "region [0x%"PRIx64",0x%"PRIx64"] overlaps with existing"
+                    "host DMA window [0x%"PRIx64",0x%"PRIx64"]",
+                    section->offset_within_address_space,
+                    section->offset_within_address_space +
+                        int128_get64(section->size) - 1,
+                    hostwin->min_iova, hostwin->max_iova);
+            return -1;
+        }
+    }
+
+    ret = vfio_spapr_create_window(container, section, &pgsize);
+    if (ret) {
+        error_setg_errno(errp, -ret, "Failed to create SPAPR window");
+        return ret;
+    }
+
+    vfio_host_win_add(container, section->offset_within_address_space,
+                      section->offset_within_address_space +
+                      int128_get64(section->size) - 1, pgsize);
+#ifdef CONFIG_KVM
+    if (kvm_enabled()) {
+        VFIOGroup *group;
+        IOMMUMemoryRegion *iommu_mr = IOMMU_MEMORY_REGION(section->mr);
+        struct kvm_vfio_spapr_tce param;
+        struct kvm_device_attr attr = {
+                .group = KVM_DEV_VFIO_GROUP,
+                .attr = KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE,
+                .addr = (uint64_t)(unsigned long)&param,
+        };
+
+        if (!memory_region_iommu_get_attr(iommu_mr, IOMMU_ATTR_SPAPR_TCE_FD,
+                                          &param.tablefd)) {
+            QLIST_FOREACH(group, &container->group_list, container_next) {
+                param.groupfd = group->fd;
+                if (ioctl(vfio_kvm_device_fd, KVM_SET_DEVICE_ATTR, &attr)) {
+                    error_report("vfio: failed to setup fd %d "
+                                 "for a group with fd %d: %s",
+                                 param.tablefd, param.groupfd,
+                                 strerror(errno));
+                    return -1;
+                }
+                trace_vfio_spapr_group_attach(param.groupfd, param.tablefd);
+            }
+        }
+    }
+#endif
+    return 0;
+}
+
+void vfio_container_del_section_window(VFIOContainer *container,
+                                       MemoryRegionSection *section)
+{
+    if (container->iommu_type != VFIO_SPAPR_TCE_v2_IOMMU) {
+        return;
+    }
+
+    vfio_spapr_remove_window(container,
+                             section->offset_within_address_space);
+    if (vfio_host_win_del(container,
+                          section->offset_within_address_space,
+                          section->offset_within_address_space +
+                          int128_get64(section->size) - 1) < 0) {
+        hw_error("%s: Cannot delete missing window at %"HWADDR_PRIx,
+                 __func__, section->offset_within_address_space);
+    }
+}
+
+void vfio_reset_handler(void *opaque)
+{
+    VFIOGroup *group;
+    VFIODevice *vbasedev;
+
+    QLIST_FOREACH(group, &vfio_group_list, next) {
+        QLIST_FOREACH(vbasedev, &group->device_list, next) {
+            if (vbasedev->dev->realized) {
+                vbasedev->ops->vfio_compute_needs_reset(vbasedev);
+            }
+        }
+    }
+
+    QLIST_FOREACH(group, &vfio_group_list, next) {
+        QLIST_FOREACH(vbasedev, &group->device_list, next) {
+            if (vbasedev->dev->realized && vbasedev->needs_reset) {
+                vbasedev->ops->vfio_hot_reset_multi(vbasedev);
+            }
+        }
+    }
+}
+
+static void vfio_kvm_device_add_group(VFIOGroup *group)
+{
+#ifdef CONFIG_KVM
+    struct kvm_device_attr attr = {
+        .group = KVM_DEV_VFIO_GROUP,
+        .attr = KVM_DEV_VFIO_GROUP_ADD,
+        .addr = (uint64_t)(unsigned long)&group->fd,
+    };
+
+    if (!kvm_enabled()) {
+        return;
+    }
+
+    if (vfio_kvm_device_fd < 0) {
+        struct kvm_create_device cd = {
+            .type = KVM_DEV_TYPE_VFIO,
+        };
+
+        if (kvm_vm_ioctl(kvm_state, KVM_CREATE_DEVICE, &cd)) {
+            error_report("Failed to create KVM VFIO device: %m");
+            return;
+        }
+
+        vfio_kvm_device_fd = cd.fd;
+    }
+
+    if (ioctl(vfio_kvm_device_fd, KVM_SET_DEVICE_ATTR, &attr)) {
+        error_report("Failed to add group %d to KVM VFIO device: %m",
+                     group->groupid);
+    }
+#endif
+}
+
+static void vfio_kvm_device_del_group(VFIOGroup *group)
+{
+#ifdef CONFIG_KVM
+    struct kvm_device_attr attr = {
+        .group = KVM_DEV_VFIO_GROUP,
+        .attr = KVM_DEV_VFIO_GROUP_DEL,
+        .addr = (uint64_t)(unsigned long)&group->fd,
+    };
+
+    if (vfio_kvm_device_fd < 0) {
+        return;
+    }
+
+    if (ioctl(vfio_kvm_device_fd, KVM_SET_DEVICE_ATTR, &attr)) {
+        error_report("Failed to remove group %d from KVM VFIO device: %m",
+                     group->groupid);
+    }
+#endif
+}
+
+/*
+ * vfio_get_iommu_type - selects the richest iommu_type (v2 first)
+ */
+static int vfio_get_iommu_type(VFIOContainer *container,
+                               Error **errp)
+{
+    int iommu_types[] = { VFIO_TYPE1v2_IOMMU, VFIO_TYPE1_IOMMU,
+                          VFIO_SPAPR_TCE_v2_IOMMU, VFIO_SPAPR_TCE_IOMMU };
+    int i;
+
+    for (i = 0; i < ARRAY_SIZE(iommu_types); i++) {
+        if (ioctl(container->fd, VFIO_CHECK_EXTENSION, iommu_types[i])) {
+            return iommu_types[i];
+        }
+    }
+    error_setg(errp, "No available IOMMU models");
+    return -EINVAL;
+}
+
+static int vfio_init_container(VFIOContainer *container, int group_fd,
+                               Error **errp)
+{
+    int iommu_type, ret;
+
+    iommu_type = vfio_get_iommu_type(container, errp);
+    if (iommu_type < 0) {
+        return iommu_type;
+    }
+
+    ret = ioctl(group_fd, VFIO_GROUP_SET_CONTAINER, &container->fd);
+    if (ret) {
+        error_setg_errno(errp, errno, "Failed to set group container");
+        return -errno;
+    }
+
+    while (ioctl(container->fd, VFIO_SET_IOMMU, iommu_type)) {
+        if (iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
+            /*
+             * On sPAPR, despite the IOMMU subdriver always advertises v1 and
+             * v2, the running platform may not support v2 and there is no
+             * way to guess it until an IOMMU group gets added to the container.
+             * So in case it fails with v2, try v1 as a fallback.
+             */
+            iommu_type = VFIO_SPAPR_TCE_IOMMU;
+            continue;
+        }
+        error_setg_errno(errp, errno, "Failed to set iommu for container");
+        return -errno;
+    }
+
+    container->iommu_type = iommu_type;
+    return 0;
+}
+
+static int vfio_get_iommu_info(VFIOContainer *container,
+                               struct vfio_iommu_type1_info **info)
+{
+
+    size_t argsz = sizeof(struct vfio_iommu_type1_info);
+
+    *info = g_new0(struct vfio_iommu_type1_info, 1);
+again:
+    (*info)->argsz = argsz;
+
+    if (ioctl(container->fd, VFIO_IOMMU_GET_INFO, *info)) {
+        g_free(*info);
+        *info = NULL;
+        return -errno;
+    }
+
+    if (((*info)->argsz > argsz)) {
+        argsz = (*info)->argsz;
+        *info = g_realloc(*info, argsz);
+        goto again;
+    }
+
+    return 0;
+}
+
+static struct vfio_info_cap_header *
+vfio_get_iommu_info_cap(struct vfio_iommu_type1_info *info, uint16_t id)
+{
+    struct vfio_info_cap_header *hdr;
+    void *ptr = info;
+
+    if (!(info->flags & VFIO_IOMMU_INFO_CAPS)) {
+        return NULL;
+    }
+
+    for (hdr = ptr + info->cap_offset; hdr != ptr; hdr = ptr + hdr->next) {
+        if (hdr->id == id) {
+            return hdr;
+        }
+    }
+
+    return NULL;
+}
+
+static void vfio_get_iommu_info_migration(VFIOContainer *container,
+                                         struct vfio_iommu_type1_info *info)
+{
+    struct vfio_info_cap_header *hdr;
+    struct vfio_iommu_type1_info_cap_migration *cap_mig;
+
+    hdr = vfio_get_iommu_info_cap(info, VFIO_IOMMU_TYPE1_INFO_CAP_MIGRATION);
+    if (!hdr) {
+        return;
+    }
+
+    cap_mig = container_of(hdr, struct vfio_iommu_type1_info_cap_migration,
+                            header);
+
+    /*
+     * cpu_physical_memory_set_dirty_lebitmap() supports pages in bitmap of
+     * qemu_real_host_page_size to mark those dirty.
+     */
+    if (cap_mig->pgsize_bitmap & qemu_real_host_page_size) {
+        container->dirty_pages_supported = true;
+        container->max_dirty_bitmap_size = cap_mig->max_dirty_bitmap_size;
+        container->dirty_pgsizes = cap_mig->pgsize_bitmap;
+    }
+}
+
+static int vfio_ram_block_discard_disable(VFIOContainer *container, bool state)
+{
+    switch (container->iommu_type) {
+    case VFIO_TYPE1v2_IOMMU:
+    case VFIO_TYPE1_IOMMU:
+        /*
+         * We support coordinated discarding of RAM via the RamDiscardManager.
+         */
+        return ram_block_uncoordinated_discard_disable(state);
+    default:
+        /*
+         * VFIO_SPAPR_TCE_IOMMU most probably works just fine with
+         * RamDiscardManager, however, it is completely untested.
+         *
+         * VFIO_SPAPR_TCE_v2_IOMMU with "DMA memory preregistering" does
+         * completely the opposite of managing mapping/pinning dynamically as
+         * required by RamDiscardManager. We would have to special-case sections
+         * with a RamDiscardManager.
+         */
+        return ram_block_discard_disable(state);
+    }
+}
+
+static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
+                                  Error **errp)
+{
+    VFIOContainer *container;
+    int ret, fd;
+    VFIOAddressSpace *space;
+
+    space = vfio_get_address_space(as);
+
+    /*
+     * VFIO is currently incompatible with discarding of RAM insofar as the
+     * madvise to purge (zap) the page from QEMU's address space does not
+     * interact with the memory API and therefore leaves stale virtual to
+     * physical mappings in the IOMMU if the page was previously pinned.  We
+     * therefore set discarding broken for each group added to a container,
+     * whether the container is used individually or shared.  This provides
+     * us with options to allow devices within a group to opt-in and allow
+     * discarding, so long as it is done consistently for a group (for instance
+     * if the device is an mdev device where it is known that the host vendor
+     * driver will never pin pages outside of the working set of the guest
+     * driver, which would thus not be discarding candidates).
+     *
+     * The first opportunity to induce pinning occurs here where we attempt to
+     * attach the group to existing containers within the AddressSpace.  If any
+     * pages are already zapped from the virtual address space, such as from
+     * previous discards, new pinning will cause valid mappings to be
+     * re-established.  Likewise, when the overall MemoryListener for a new
+     * container is registered, a replay of mappings within the AddressSpace
+     * will occur, re-establishing any previously zapped pages as well.
+     *
+     * Especially virtio-balloon is currently only prevented from discarding
+     * new memory, it will not yet set ram_block_discard_set_required() and
+     * therefore, neither stops us here or deals with the sudden memory
+     * consumption of inflated memory.
+     *
+     * We do support discarding of memory coordinated via the RamDiscardManager
+     * with some IOMMU types. vfio_ram_block_discard_disable() handles the
+     * details once we know which type of IOMMU we are using.
+     */
+
+    QLIST_FOREACH(container, &space->containers, next) {
+        if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
+            ret = vfio_ram_block_discard_disable(container, true);
+            if (ret) {
+                error_setg_errno(errp, -ret,
+                                 "Cannot set discarding of RAM broken");
+                if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER,
+                          &container->fd)) {
+                    error_report("vfio: error disconnecting group %d from"
+                                 " container", group->groupid);
+                }
+                return ret;
+            }
+            group->container = container;
+            QLIST_INSERT_HEAD(&container->group_list, group, container_next);
+            vfio_kvm_device_add_group(group);
+            return 0;
+        }
+    }
+
+    fd = qemu_open_old("/dev/vfio/vfio", O_RDWR);
+    if (fd < 0) {
+        error_setg_errno(errp, errno, "failed to open /dev/vfio/vfio");
+        ret = -errno;
+        goto put_space_exit;
+    }
+
+    ret = ioctl(fd, VFIO_GET_API_VERSION);
+    if (ret != VFIO_API_VERSION) {
+        error_setg(errp, "supported vfio version: %d, "
+                   "reported version: %d", VFIO_API_VERSION, ret);
+        ret = -EINVAL;
+        goto close_fd_exit;
+    }
+
+    container = g_malloc0(sizeof(*container));
+    container->space = space;
+    container->fd = fd;
+    container->error = NULL;
+    container->dirty_pages_supported = false;
+    container->dma_max_mappings = 0;
+    QLIST_INIT(&container->giommu_list);
+    QLIST_INIT(&container->hostwin_list);
+    QLIST_INIT(&container->vrdl_list);
+
+    ret = vfio_init_container(container, group->fd, errp);
+    if (ret) {
+        goto free_container_exit;
+    }
+
+    ret = vfio_ram_block_discard_disable(container, true);
+    if (ret) {
+        error_setg_errno(errp, -ret, "Cannot set discarding of RAM broken");
+        goto free_container_exit;
+    }
+
+    switch (container->iommu_type) {
+    case VFIO_TYPE1v2_IOMMU:
+    case VFIO_TYPE1_IOMMU:
+    {
+        struct vfio_iommu_type1_info *info;
+
+        /*
+         * FIXME: This assumes that a Type1 IOMMU can map any 64-bit
+         * IOVA whatsoever.  That's not actually true, but the current
+         * kernel interface doesn't tell us what it can map, and the
+         * existing Type1 IOMMUs generally support any IOVA we're
+         * going to actually try in practice.
+         */
+        ret = vfio_get_iommu_info(container, &info);
+
+        if (ret || !(info->flags & VFIO_IOMMU_INFO_PGSIZES)) {
+            /* Assume 4k IOVA page size */
+            info->iova_pgsizes = 4096;
+        }
+        vfio_host_win_add(container, 0, (hwaddr)-1, info->iova_pgsizes);
+        container->pgsizes = info->iova_pgsizes;
+
+        /* The default in the kernel ("dma_entry_limit") is 65535. */
+        container->dma_max_mappings = 65535;
+        if (!ret) {
+            vfio_get_info_dma_avail(info, &container->dma_max_mappings);
+            vfio_get_iommu_info_migration(container, info);
+        }
+        g_free(info);
+        break;
+    }
+    case VFIO_SPAPR_TCE_v2_IOMMU:
+    case VFIO_SPAPR_TCE_IOMMU:
+    {
+        struct vfio_iommu_spapr_tce_info info;
+        bool v2 = container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU;
+
+        /*
+         * The host kernel code implementing VFIO_IOMMU_DISABLE is called
+         * when container fd is closed so we do not call it explicitly
+         * in this file.
+         */
+        if (!v2) {
+            ret = ioctl(fd, VFIO_IOMMU_ENABLE);
+            if (ret) {
+                error_setg_errno(errp, errno, "failed to enable container");
+                ret = -errno;
+                goto enable_discards_exit;
+            }
+        } else {
+            container->prereg_listener = vfio_prereg_listener;
+
+            memory_listener_register(&container->prereg_listener,
+                                     &address_space_memory);
+            if (container->error) {
+                memory_listener_unregister(&container->prereg_listener);
+                ret = -1;
+                error_propagate_prepend(errp, container->error,
+                    "RAM memory listener initialization failed: ");
+                goto enable_discards_exit;
+            }
+        }
+
+        info.argsz = sizeof(info);
+        ret = ioctl(fd, VFIO_IOMMU_SPAPR_TCE_GET_INFO, &info);
+        if (ret) {
+            error_setg_errno(errp, errno,
+                             "VFIO_IOMMU_SPAPR_TCE_GET_INFO failed");
+            ret = -errno;
+            if (v2) {
+                memory_listener_unregister(&container->prereg_listener);
+            }
+            goto enable_discards_exit;
+        }
+
+        if (v2) {
+            container->pgsizes = info.ddw.pgsizes;
+            /*
+             * There is a default window in just created container.
+             * To make region_add/del simpler, we better remove this
+             * window now and let those iommu_listener callbacks
+             * create/remove them when needed.
+             */
+            ret = vfio_spapr_remove_window(container, info.dma32_window_start);
+            if (ret) {
+                error_setg_errno(errp, -ret,
+                                 "failed to remove existing window");
+                goto enable_discards_exit;
+            }
+        } else {
+            /* The default table uses 4K pages */
+            container->pgsizes = 0x1000;
+            vfio_host_win_add(container, info.dma32_window_start,
+                              info.dma32_window_start +
+                              info.dma32_window_size - 1,
+                              0x1000);
+        }
+    }
+    }
+
+    vfio_kvm_device_add_group(group);
+
+    QLIST_INIT(&container->group_list);
+    QLIST_INSERT_HEAD(&space->containers, container, next);
+
+    group->container = container;
+    QLIST_INSERT_HEAD(&container->group_list, group, container_next);
+
+    container->listener = vfio_memory_listener;
+
+    memory_listener_register(&container->listener, container->space->as);
+
+    if (container->error) {
+        ret = -1;
+        error_propagate_prepend(errp, container->error,
+            "memory listener initialization failed: ");
+        goto listener_release_exit;
+    }
+
+    container->initialized = true;
+
+    return 0;
+listener_release_exit:
+    QLIST_REMOVE(group, container_next);
+    QLIST_REMOVE(container, next);
+    vfio_kvm_device_del_group(group);
+    vfio_listener_release(container);
+
+enable_discards_exit:
+    vfio_ram_block_discard_disable(container, false);
+
+free_container_exit:
+    g_free(container);
+
+close_fd_exit:
+    close(fd);
+
+put_space_exit:
+    vfio_put_address_space(space);
+
+    return ret;
+}
+
+static void vfio_disconnect_container(VFIOGroup *group)
+{
+    VFIOContainer *container = group->container;
+
+    QLIST_REMOVE(group, container_next);
+    group->container = NULL;
+
+    /*
+     * Explicitly release the listener first before unset container,
+     * since unset may destroy the backend container if it's the last
+     * group.
+     */
+    if (QLIST_EMPTY(&container->group_list)) {
+        vfio_listener_release(container);
+    }
+
+    if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER, &container->fd)) {
+        error_report("vfio: error disconnecting group %d from container",
+                     group->groupid);
+    }
+
+    if (QLIST_EMPTY(&container->group_list)) {
+        VFIOAddressSpace *space = container->space;
+        VFIOGuestIOMMU *giommu, *tmp;
+        VFIOHostDMAWindow *hostwin, *next;
+
+        QLIST_REMOVE(container, next);
+
+        QLIST_FOREACH_SAFE(giommu, &container->giommu_list, giommu_next, tmp) {
+            memory_region_unregister_iommu_notifier(
+                    MEMORY_REGION(giommu->iommu_mr), &giommu->n);
+            QLIST_REMOVE(giommu, giommu_next);
+            g_free(giommu);
+        }
+
+        QLIST_FOREACH_SAFE(hostwin, &container->hostwin_list, hostwin_next,
+                           next) {
+            QLIST_REMOVE(hostwin, hostwin_next);
+            g_free(hostwin);
+        }
+
+        trace_vfio_disconnect_container(container->fd);
+        close(container->fd);
+        g_free(container);
+
+        vfio_put_address_space(space);
+    }
+}
+
+VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
+{
+    VFIOGroup *group;
+    char path[32];
+    struct vfio_group_status status = { .argsz = sizeof(status) };
+
+    QLIST_FOREACH(group, &vfio_group_list, next) {
+        if (group->groupid == groupid) {
+            /* Found it.  Now is it already in the right context? */
+            if (group->container->space->as == as) {
+                return group;
+            } else {
+                error_setg(errp, "group %d used in multiple address spaces",
+                           group->groupid);
+                return NULL;
+            }
+        }
+    }
+
+    group = g_malloc0(sizeof(*group));
+
+    snprintf(path, sizeof(path), "/dev/vfio/%d", groupid);
+    group->fd = qemu_open_old(path, O_RDWR);
+    if (group->fd < 0) {
+        error_setg_errno(errp, errno, "failed to open %s", path);
+        goto free_group_exit;
+    }
+
+    if (ioctl(group->fd, VFIO_GROUP_GET_STATUS, &status)) {
+        error_setg_errno(errp, errno, "failed to get group %d status", groupid);
+        goto close_fd_exit;
+    }
+
+    if (!(status.flags & VFIO_GROUP_FLAGS_VIABLE)) {
+        error_setg(errp, "group %d is not viable", groupid);
+        error_append_hint(errp,
+                          "Please ensure all devices within the iommu_group "
+                          "are bound to their vfio bus driver.\n");
+        goto close_fd_exit;
+    }
+
+    group->groupid = groupid;
+    QLIST_INIT(&group->device_list);
+
+    if (vfio_connect_container(group, as, errp)) {
+        error_prepend(errp, "failed to setup container for group %d: ",
+                      groupid);
+        goto close_fd_exit;
+    }
+
+    if (QLIST_EMPTY(&vfio_group_list)) {
+        qemu_register_reset(vfio_reset_handler, NULL);
+    }
+
+    QLIST_INSERT_HEAD(&vfio_group_list, group, next);
+
+    return group;
+
+close_fd_exit:
+    close(group->fd);
+
+free_group_exit:
+    g_free(group);
+
+    return NULL;
+}
+
+void vfio_put_group(VFIOGroup *group)
+{
+    if (!group || !QLIST_EMPTY(&group->device_list)) {
+        return;
+    }
+
+    if (!group->ram_block_discard_allowed) {
+        vfio_ram_block_discard_disable(group->container, false);
+    }
+    vfio_kvm_device_del_group(group);
+    vfio_disconnect_container(group);
+    QLIST_REMOVE(group, next);
+    trace_vfio_put_group(group->fd);
+    close(group->fd);
+    g_free(group);
+
+    if (QLIST_EMPTY(&vfio_group_list)) {
+        qemu_unregister_reset(vfio_reset_handler, NULL);
+    }
+}
+
+int vfio_get_device(VFIOGroup *group, const char *name,
+                    VFIODevice *vbasedev, Error **errp)
+{
+    struct vfio_device_info dev_info = { .argsz = sizeof(dev_info) };
+    int ret, fd;
+
+    fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name);
+    if (fd < 0) {
+        error_setg_errno(errp, errno, "error getting device from group %d",
+                         group->groupid);
+        error_append_hint(errp,
+                      "Verify all devices in group %d are bound to vfio-<bus> "
+                      "or pci-stub and not already in use\n", group->groupid);
+        return fd;
+    }
+
+    ret = ioctl(fd, VFIO_DEVICE_GET_INFO, &dev_info);
+    if (ret) {
+        error_setg_errno(errp, errno, "error getting device info");
+        close(fd);
+        return ret;
+    }
+
+    /*
+     * Set discarding of RAM as not broken for this group if the driver knows
+     * the device operates compatibly with discarding.  Setting must be
+     * consistent per group, but since compatibility is really only possible
+     * with mdev currently, we expect singleton groups.
+     */
+    if (vbasedev->ram_block_discard_allowed !=
+        group->ram_block_discard_allowed) {
+        if (!QLIST_EMPTY(&group->device_list)) {
+            error_setg(errp, "Inconsistent setting of support for discarding "
+                       "RAM (e.g., balloon) within group");
+            close(fd);
+            return -1;
+        }
+
+        if (!group->ram_block_discard_allowed) {
+            group->ram_block_discard_allowed = true;
+            vfio_ram_block_discard_disable(group->container, false);
+        }
+    }
+
+    vbasedev->fd = fd;
+    vbasedev->group = group;
+    QLIST_INSERT_HEAD(&group->device_list, vbasedev, next);
+
+    vbasedev->num_irqs = dev_info.num_irqs;
+    vbasedev->num_regions = dev_info.num_regions;
+    vbasedev->flags = dev_info.flags;
+
+    trace_vfio_get_device(name, dev_info.flags, dev_info.num_regions,
+                          dev_info.num_irqs);
+
+    vbasedev->reset_works = !!(dev_info.flags & VFIO_DEVICE_FLAGS_RESET);
+    return 0;
+}
+
+void vfio_put_base_device(VFIODevice *vbasedev)
+{
+    if (!vbasedev->group) {
+        return;
+    }
+    QLIST_REMOVE(vbasedev, next);
+    vbasedev->group = NULL;
+    trace_vfio_put_base_device(vbasedev->fd);
+    close(vbasedev->fd);
+}
+
+/* FIXME: should below code be in common.c? */
+/*
+ * Interfaces for IBM EEH (Enhanced Error Handling)
+ */
+static bool vfio_eeh_container_ok(VFIOContainer *container)
+{
+    /*
+     * As of 2016-03-04 (linux-4.5) the host kernel EEH/VFIO
+     * implementation is broken if there are multiple groups in a
+     * container.  The hardware works in units of Partitionable
+     * Endpoints (== IOMMU groups) and the EEH operations naively
+     * iterate across all groups in the container, without any logic
+     * to make sure the groups have their state synchronized.  For
+     * certain operations (ENABLE) that might be ok, until an error
+     * occurs, but for others (GET_STATE) it's clearly broken.
+     */
+
+    /*
+     * XXX Once fixed kernels exist, test for them here
+     */
+
+    if (QLIST_EMPTY(&container->group_list)) {
+        return false;
+    }
+
+    if (QLIST_NEXT(QLIST_FIRST(&container->group_list), container_next)) {
+        return false;
+    }
+
+    return true;
+}
+
+static int vfio_eeh_container_op(VFIOContainer *container, uint32_t op)
+{
+    struct vfio_eeh_pe_op pe_op = {
+        .argsz = sizeof(pe_op),
+        .op = op,
+    };
+    int ret;
+
+    if (!vfio_eeh_container_ok(container)) {
+        error_report("vfio/eeh: EEH_PE_OP 0x%x: "
+                     "kernel requires a container with exactly one group", op);
+        return -EPERM;
+    }
+
+    ret = ioctl(container->fd, VFIO_EEH_PE_OP, &pe_op);
+    if (ret < 0) {
+        error_report("vfio/eeh: EEH_PE_OP 0x%x failed: %m", op);
+        return -errno;
+    }
+
+    return ret;
+}
+
+static VFIOContainer *vfio_eeh_as_container(AddressSpace *as)
+{
+    VFIOAddressSpace *space = vfio_get_address_space(as);
+    VFIOContainer *container = NULL;
+
+    if (QLIST_EMPTY(&space->containers)) {
+        /* No containers to act on */
+        goto out;
+    }
+
+    container = QLIST_FIRST(&space->containers);
+
+    if (QLIST_NEXT(container, next)) {
+        /*
+         * We don't yet have logic to synchronize EEH state across
+         * multiple containers.
+         */
+        container = NULL;
+        goto out;
+    }
+
+out:
+    vfio_put_address_space(space);
+    return container;
+}
+
+bool vfio_eeh_as_ok(AddressSpace *as)
+{
+    VFIOContainer *container = vfio_eeh_as_container(as);
+
+    return (container != NULL) && vfio_eeh_container_ok(container);
+}
+
+int vfio_eeh_as_op(AddressSpace *as, uint32_t op)
+{
+    VFIOContainer *container = vfio_eeh_as_container(as);
+
+    if (!container) {
+        return -ENODEV;
+    }
+    return vfio_eeh_container_op(container, op);
+}
diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
index da9af297a0..e3b6d6e2cb 100644
--- a/hw/vfio/meson.build
+++ b/hw/vfio/meson.build
@@ -1,6 +1,8 @@
 vfio_ss = ss.source_set()
 vfio_ss.add(files(
   'common.c',
+  'as.c',
+  'container.c',
   'spapr.c',
   'migration.c',
 ))
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index e573f5a9f1..03ff7944cb 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -33,6 +33,8 @@
 
 #define VFIO_MSG_PREFIX "vfio %s: "
 
+extern const MemoryListener vfio_memory_listener;
+
 enum {
     VFIO_DEVICE_TYPE_PCI = 0,
     VFIO_DEVICE_TYPE_PLATFORM = 1,
@@ -190,6 +192,32 @@ typedef struct VFIODisplay {
     } dmabuf;
 } VFIODisplay;
 
+void vfio_host_win_add(VFIOContainer *container,
+                       hwaddr min_iova, hwaddr max_iova,
+                       uint64_t iova_pgsizes);
+int vfio_host_win_del(VFIOContainer *container, hwaddr min_iova,
+                      hwaddr max_iova);
+VFIOAddressSpace *vfio_get_address_space(AddressSpace *as);
+void vfio_put_address_space(VFIOAddressSpace *space);
+bool vfio_devices_all_running_and_saving(VFIOContainer *container);
+bool vfio_devices_all_dirty_tracking(VFIOContainer *container);
+
+/* container->fd */
+int vfio_dma_unmap(VFIOContainer *container,
+                   hwaddr iova, ram_addr_t size,
+                   IOMMUTLBEntry *iotlb);
+int vfio_dma_map(VFIOContainer *container, hwaddr iova,
+                 ram_addr_t size, void *vaddr, bool readonly);
+void vfio_set_dirty_page_tracking(VFIOContainer *container, bool start);
+int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
+                          uint64_t size, ram_addr_t ram_addr);
+
+int vfio_container_add_section_window(VFIOContainer *container,
+                                      MemoryRegionSection *section,
+                                      Error **errp);
+void vfio_container_del_section_window(VFIOContainer *container,
+                                       MemoryRegionSection *section);
+
 void vfio_put_base_device(VFIODevice *vbasedev);
 void vfio_disable_irqindex(VFIODevice *vbasedev, int index);
 void vfio_unmask_single_irqindex(VFIODevice *vbasedev, int index);
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [RFC 07/18] vfio: Add base object for VFIOContainer
  2022-04-14 10:46 ` Yi Liu
@ 2022-04-14 10:46   ` Yi Liu
  -1 siblings, 0 replies; 125+ messages in thread
From: Yi Liu @ 2022-04-14 10:46 UTC (permalink / raw)
  To: alex.williamson, cohuck, qemu-devel
  Cc: david, thuth, farman, mjrosato, akrowiak, pasic, jjherne,
	jasowang, kvm, jgg, nicolinc, eric.auger, eric.auger.pro,
	kevin.tian, yi.l.liu, chao.p.peng, yi.y.sun, peterx

Qomify the VFIOContainer object which acts as a base class for a
container. This base class is derived into the legacy VFIO container
and later on, into the new iommufd based container.

The base class implements generic code such as code related to
memory_listener and address space management whereas the derived
class implements callbacks that depend on the kernel user space
being used.

'as.c' only manipulates the base class object with wrapper functions
that call the right class functions. Existing 'container.c' code is
converted to implement the legacy container class functions.

Existing migration code only works with the legacy container.
Also 'spapr.c' isn't BE agnostic.

Below is the object. It's named as VFIOContainer, old VFIOContainer
is replaced with VFIOLegacyContainer.

struct VFIOContainer {
    /* private */
    Object parent_obj;

    VFIOAddressSpace *space;
    MemoryListener listener;
    Error *error;
    bool initialized;
    bool dirty_pages_supported;
    uint64_t dirty_pgsizes;
    uint64_t max_dirty_bitmap_size;
    unsigned long pgsizes;
    unsigned int dma_max_mappings;
    QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
    QLIST_HEAD(, VFIOHostDMAWindow) hostwin_list;
    QLIST_HEAD(, VFIORamDiscardListener) vrdl_list;
    QLIST_ENTRY(VFIOContainer) next;
};

struct VFIOLegacyContainer {
    VFIOContainer obj;
    int fd; /* /dev/vfio/vfio, empowered by the attached groups */
    MemoryListener prereg_listener;
    unsigned iommu_type;
    QLIST_HEAD(, VFIOGroup) group_list;
};

Co-authored-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 hw/vfio/as.c                         |  48 +++---
 hw/vfio/container-obj.c              | 195 +++++++++++++++++++++++
 hw/vfio/container.c                  | 224 ++++++++++++++++-----------
 hw/vfio/meson.build                  |   1 +
 hw/vfio/migration.c                  |   4 +-
 hw/vfio/pci.c                        |   4 +-
 hw/vfio/spapr.c                      |  22 +--
 include/hw/vfio/vfio-common.h        |  78 ++--------
 include/hw/vfio/vfio-container-obj.h | 154 ++++++++++++++++++
 9 files changed, 540 insertions(+), 190 deletions(-)
 create mode 100644 hw/vfio/container-obj.c
 create mode 100644 include/hw/vfio/vfio-container-obj.h

diff --git a/hw/vfio/as.c b/hw/vfio/as.c
index 4181182808..37423d2c89 100644
--- a/hw/vfio/as.c
+++ b/hw/vfio/as.c
@@ -215,9 +215,9 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
          * of vaddr will always be there, even if the memory object is
          * destroyed and its backing memory munmap-ed.
          */
-        ret = vfio_dma_map(container, iova,
-                           iotlb->addr_mask + 1, vaddr,
-                           read_only);
+        ret = vfio_container_dma_map(container, iova,
+                                     iotlb->addr_mask + 1, vaddr,
+                                     read_only);
         if (ret) {
             error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
                          "0x%"HWADDR_PRIx", %p) = %d (%m)",
@@ -225,7 +225,8 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
                          iotlb->addr_mask + 1, vaddr, ret);
         }
     } else {
-        ret = vfio_dma_unmap(container, iova, iotlb->addr_mask + 1, iotlb);
+        ret = vfio_container_dma_unmap(container, iova,
+                                       iotlb->addr_mask + 1, iotlb);
         if (ret) {
             error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
                          "0x%"HWADDR_PRIx") = %d (%m)",
@@ -242,12 +243,13 @@ static void vfio_ram_discard_notify_discard(RamDiscardListener *rdl,
 {
     VFIORamDiscardListener *vrdl = container_of(rdl, VFIORamDiscardListener,
                                                 listener);
+    VFIOContainer *container = vrdl->container;
     const hwaddr size = int128_get64(section->size);
     const hwaddr iova = section->offset_within_address_space;
     int ret;
 
     /* Unmap with a single call. */
-    ret = vfio_dma_unmap(vrdl->container, iova, size , NULL);
+    ret = vfio_container_dma_unmap(container, iova, size , NULL);
     if (ret) {
         error_report("%s: vfio_dma_unmap() failed: %s", __func__,
                      strerror(-ret));
@@ -259,6 +261,7 @@ static int vfio_ram_discard_notify_populate(RamDiscardListener *rdl,
 {
     VFIORamDiscardListener *vrdl = container_of(rdl, VFIORamDiscardListener,
                                                 listener);
+    VFIOContainer *container = vrdl->container;
     const hwaddr end = section->offset_within_region +
                        int128_get64(section->size);
     hwaddr start, next, iova;
@@ -277,8 +280,8 @@ static int vfio_ram_discard_notify_populate(RamDiscardListener *rdl,
                section->offset_within_address_space;
         vaddr = memory_region_get_ram_ptr(section->mr) + start;
 
-        ret = vfio_dma_map(vrdl->container, iova, next - start,
-                           vaddr, section->readonly);
+        ret = vfio_container_dma_map(container, iova, next - start,
+                                     vaddr, section->readonly);
         if (ret) {
             /* Rollback */
             vfio_ram_discard_notify_discard(rdl, section);
@@ -530,8 +533,8 @@ static void vfio_listener_region_add(MemoryListener *listener,
         }
     }
 
-    ret = vfio_dma_map(container, iova, int128_get64(llsize),
-                       vaddr, section->readonly);
+    ret = vfio_container_dma_map(container, iova, int128_get64(llsize),
+                                 vaddr, section->readonly);
     if (ret) {
         error_setg(&err, "vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
                    "0x%"HWADDR_PRIx", %p) = %d (%m)",
@@ -656,7 +659,8 @@ static void vfio_listener_region_del(MemoryListener *listener,
         if (int128_eq(llsize, int128_2_64())) {
             /* The unmap ioctl doesn't accept a full 64-bit span. */
             llsize = int128_rshift(llsize, 1);
-            ret = vfio_dma_unmap(container, iova, int128_get64(llsize), NULL);
+            ret = vfio_container_dma_unmap(container, iova,
+                                           int128_get64(llsize), NULL);
             if (ret) {
                 error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
                              "0x%"HWADDR_PRIx") = %d (%m)",
@@ -664,7 +668,8 @@ static void vfio_listener_region_del(MemoryListener *listener,
             }
             iova += int128_get64(llsize);
         }
-        ret = vfio_dma_unmap(container, iova, int128_get64(llsize), NULL);
+        ret = vfio_container_dma_unmap(container, iova,
+                                       int128_get64(llsize), NULL);
         if (ret) {
             error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
                          "0x%"HWADDR_PRIx") = %d (%m)",
@@ -681,14 +686,14 @@ static void vfio_listener_log_global_start(MemoryListener *listener)
 {
     VFIOContainer *container = container_of(listener, VFIOContainer, listener);
 
-    vfio_set_dirty_page_tracking(container, true);
+    vfio_container_set_dirty_page_tracking(container, true);
 }
 
 static void vfio_listener_log_global_stop(MemoryListener *listener)
 {
     VFIOContainer *container = container_of(listener, VFIOContainer, listener);
 
-    vfio_set_dirty_page_tracking(container, false);
+    vfio_container_set_dirty_page_tracking(container, false);
 }
 
 typedef struct {
@@ -717,8 +722,9 @@ static void vfio_iommu_map_dirty_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
     if (vfio_get_xlat_addr(iotlb, NULL, &translated_addr, NULL)) {
         int ret;
 
-        ret = vfio_get_dirty_bitmap(container, iova, iotlb->addr_mask + 1,
-                                    translated_addr);
+        ret = vfio_container_get_dirty_bitmap(container, iova,
+                                              iotlb->addr_mask + 1,
+                                              translated_addr);
         if (ret) {
             error_report("vfio_iommu_map_dirty_notify(%p, 0x%"HWADDR_PRIx", "
                          "0x%"HWADDR_PRIx") = %d (%m)",
@@ -742,11 +748,13 @@ static int vfio_ram_discard_get_dirty_bitmap(MemoryRegionSection *section,
      * Sync the whole mapped region (spanning multiple individual mappings)
      * in one go.
      */
-    return vfio_get_dirty_bitmap(vrdl->container, iova, size, ram_addr);
+    return vfio_container_get_dirty_bitmap(vrdl->container, iova,
+                                           size, ram_addr);
 }
 
-static int vfio_sync_ram_discard_listener_dirty_bitmap(VFIOContainer *container,
-                                                   MemoryRegionSection *section)
+static int
+vfio_sync_ram_discard_listener_dirty_bitmap(VFIOContainer *container,
+                                            MemoryRegionSection *section)
 {
     RamDiscardManager *rdm = memory_region_get_ram_discard_manager(section->mr);
     VFIORamDiscardListener *vrdl = NULL;
@@ -810,7 +818,7 @@ static int vfio_sync_dirty_bitmap(VFIOContainer *container,
     ram_addr = memory_region_get_ram_addr(section->mr) +
                section->offset_within_region;
 
-    return vfio_get_dirty_bitmap(container,
+    return vfio_container_get_dirty_bitmap(container,
                    REAL_HOST_PAGE_ALIGN(section->offset_within_address_space),
                    int128_get64(section->size), ram_addr);
 }
@@ -825,7 +833,7 @@ static void vfio_listener_log_sync(MemoryListener *listener,
         return;
     }
 
-    if (vfio_devices_all_dirty_tracking(container)) {
+    if (vfio_container_devices_all_dirty_tracking(container)) {
         vfio_sync_dirty_bitmap(container, section);
     }
 }
diff --git a/hw/vfio/container-obj.c b/hw/vfio/container-obj.c
new file mode 100644
index 0000000000..40c1e2a2b5
--- /dev/null
+++ b/hw/vfio/container-obj.c
@@ -0,0 +1,195 @@
+/*
+ * VFIO CONTAINER BASE OBJECT
+ *
+ * Copyright (C) 2022 Intel Corporation.
+ * Copyright Red Hat, Inc. 2022
+ *
+ * Authors: Yi Liu <yi.l.liu@intel.com>
+ *          Eric Auger <eric.auger@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include "qemu/osdep.h"
+#include "qapi/error.h"
+#include "qemu/error-report.h"
+#include "qom/object.h"
+#include "qapi/visitor.h"
+#include "hw/vfio/vfio-container-obj.h"
+
+bool vfio_container_check_extension(VFIOContainer *container,
+                                    VFIOContainerFeature feat)
+{
+    VFIOContainerClass *vccs = VFIO_CONTAINER_OBJ_GET_CLASS(container);
+
+    if (!vccs->check_extension) {
+        return false;
+    }
+
+    return vccs->check_extension(container, feat);
+}
+
+int vfio_container_dma_map(VFIOContainer *container,
+                           hwaddr iova, ram_addr_t size,
+                           void *vaddr, bool readonly)
+{
+    VFIOContainerClass *vccs = VFIO_CONTAINER_OBJ_GET_CLASS(container);
+
+    if (!vccs->dma_map) {
+        return -EINVAL;
+    }
+
+    return vccs->dma_map(container, iova, size, vaddr, readonly);
+}
+
+int vfio_container_dma_unmap(VFIOContainer *container,
+                             hwaddr iova, ram_addr_t size,
+                             IOMMUTLBEntry *iotlb)
+{
+    VFIOContainerClass *vccs = VFIO_CONTAINER_OBJ_GET_CLASS(container);
+
+    vccs = VFIO_CONTAINER_OBJ_GET_CLASS(container);
+
+    if (!vccs->dma_unmap) {
+        return -EINVAL;
+    }
+
+    return vccs->dma_unmap(container, iova, size, iotlb);
+}
+
+void vfio_container_set_dirty_page_tracking(VFIOContainer *container,
+                                            bool start)
+{
+    VFIOContainerClass *vccs = VFIO_CONTAINER_OBJ_GET_CLASS(container);
+
+    if (!vccs->set_dirty_page_tracking) {
+        return;
+    }
+
+    vccs->set_dirty_page_tracking(container, start);
+}
+
+bool vfio_container_devices_all_dirty_tracking(VFIOContainer *container)
+{
+    VFIOContainerClass *vccs = VFIO_CONTAINER_OBJ_GET_CLASS(container);
+
+    if (!vccs->devices_all_dirty_tracking) {
+        return false;
+    }
+
+    return vccs->devices_all_dirty_tracking(container);
+}
+
+int vfio_container_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
+                                    uint64_t size, ram_addr_t ram_addr)
+{
+    VFIOContainerClass *vccs = VFIO_CONTAINER_OBJ_GET_CLASS(container);
+
+    if (!vccs->get_dirty_bitmap) {
+        return -EINVAL;
+    }
+
+    return vccs->get_dirty_bitmap(container, iova, size, ram_addr);
+}
+
+int vfio_container_add_section_window(VFIOContainer *container,
+                                      MemoryRegionSection *section,
+                                      Error **errp)
+{
+    VFIOContainerClass *vccs = VFIO_CONTAINER_OBJ_GET_CLASS(container);
+
+    if (!vccs->add_window) {
+        return 0;
+    }
+
+    return vccs->add_window(container, section, errp);
+}
+
+void vfio_container_del_section_window(VFIOContainer *container,
+                                       MemoryRegionSection *section)
+{
+    VFIOContainerClass *vccs = VFIO_CONTAINER_OBJ_GET_CLASS(container);
+
+    if (!vccs->del_window) {
+        return;
+    }
+
+    return vccs->del_window(container, section);
+}
+
+void vfio_container_init(void *_container, size_t instance_size,
+                         const char *mrtypename,
+                         VFIOAddressSpace *space)
+{
+    VFIOContainer *container;
+
+    object_initialize(_container, instance_size, mrtypename);
+    container = VFIO_CONTAINER_OBJ(_container);
+
+    container->space = space;
+    container->error = NULL;
+    container->dirty_pages_supported = false;
+    container->dma_max_mappings = 0;
+    QLIST_INIT(&container->giommu_list);
+    QLIST_INIT(&container->hostwin_list);
+    QLIST_INIT(&container->vrdl_list);
+}
+
+void vfio_container_destroy(VFIOContainer *container)
+{
+    VFIORamDiscardListener *vrdl, *vrdl_tmp;
+    VFIOGuestIOMMU *giommu, *tmp;
+    VFIOHostDMAWindow *hostwin, *next;
+
+    QLIST_SAFE_REMOVE(container, next);
+
+    QLIST_FOREACH_SAFE(vrdl, &container->vrdl_list, next, vrdl_tmp) {
+        RamDiscardManager *rdm;
+
+        rdm = memory_region_get_ram_discard_manager(vrdl->mr);
+        ram_discard_manager_unregister_listener(rdm, &vrdl->listener);
+        QLIST_REMOVE(vrdl, next);
+        g_free(vrdl);
+    }
+
+    QLIST_FOREACH_SAFE(giommu, &container->giommu_list, giommu_next, tmp) {
+        memory_region_unregister_iommu_notifier(
+                MEMORY_REGION(giommu->iommu_mr), &giommu->n);
+        QLIST_REMOVE(giommu, giommu_next);
+        g_free(giommu);
+    }
+
+    QLIST_FOREACH_SAFE(hostwin, &container->hostwin_list, hostwin_next,
+                       next) {
+        QLIST_REMOVE(hostwin, hostwin_next);
+        g_free(hostwin);
+    }
+
+    object_unref(&container->parent_obj);
+}
+
+static const TypeInfo vfio_container_info = {
+    .parent             = TYPE_OBJECT,
+    .name               = TYPE_VFIO_CONTAINER_OBJ,
+    .class_size         = sizeof(VFIOContainerClass),
+    .instance_size      = sizeof(VFIOContainer),
+    .abstract           = true,
+};
+
+static void vfio_container_register_types(void)
+{
+    type_register_static(&vfio_container_info);
+}
+
+type_init(vfio_container_register_types)
diff --git a/hw/vfio/container.c b/hw/vfio/container.c
index 9c665c1720..79972064d3 100644
--- a/hw/vfio/container.c
+++ b/hw/vfio/container.c
@@ -50,6 +50,8 @@
 static int vfio_kvm_device_fd = -1;
 #endif
 
+#define TYPE_VFIO_LEGACY_CONTAINER "qemu:vfio-legacy-container"
+
 VFIOGroupList vfio_group_list =
     QLIST_HEAD_INITIALIZER(vfio_group_list);
 
@@ -76,8 +78,10 @@ bool vfio_mig_active(void)
     return true;
 }
 
-bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
+static bool vfio_devices_all_dirty_tracking(VFIOContainer *bcontainer)
 {
+    VFIOLegacyContainer *container = container_of(bcontainer,
+                                                  VFIOLegacyContainer, obj);
     VFIOGroup *group;
     VFIODevice *vbasedev;
     MigrationState *ms = migrate_get_current();
@@ -103,7 +107,7 @@ bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
     return true;
 }
 
-bool vfio_devices_all_running_and_saving(VFIOContainer *container)
+static bool vfio_devices_all_running_and_saving(VFIOLegacyContainer *container)
 {
     VFIOGroup *group;
     VFIODevice *vbasedev;
@@ -132,10 +136,11 @@ bool vfio_devices_all_running_and_saving(VFIOContainer *container)
     return true;
 }
 
-static int vfio_dma_unmap_bitmap(VFIOContainer *container,
+static int vfio_dma_unmap_bitmap(VFIOLegacyContainer *container,
                                  hwaddr iova, ram_addr_t size,
                                  IOMMUTLBEntry *iotlb)
 {
+    VFIOContainer *bcontainer = &container->obj;
     struct vfio_iommu_type1_dma_unmap *unmap;
     struct vfio_bitmap *bitmap;
     uint64_t pages = REAL_HOST_PAGE_ALIGN(size) / qemu_real_host_page_size;
@@ -159,7 +164,7 @@ static int vfio_dma_unmap_bitmap(VFIOContainer *container,
     bitmap->size = ROUND_UP(pages, sizeof(__u64) * BITS_PER_BYTE) /
                    BITS_PER_BYTE;
 
-    if (bitmap->size > container->max_dirty_bitmap_size) {
+    if (bitmap->size > bcontainer->max_dirty_bitmap_size) {
         error_report("UNMAP: Size of bitmap too big 0x%"PRIx64,
                      (uint64_t)bitmap->size);
         ret = -E2BIG;
@@ -189,10 +194,12 @@ unmap_exit:
 /*
  * DMA - Mapping and unmapping for the "type1" IOMMU interface used on x86
  */
-int vfio_dma_unmap(VFIOContainer *container,
-                   hwaddr iova, ram_addr_t size,
-                   IOMMUTLBEntry *iotlb)
+static int vfio_dma_unmap(VFIOContainer *bcontainer,
+                          hwaddr iova, ram_addr_t size,
+                          IOMMUTLBEntry *iotlb)
 {
+    VFIOLegacyContainer *container = container_of(bcontainer,
+                                                  VFIOLegacyContainer, obj);
     struct vfio_iommu_type1_dma_unmap unmap = {
         .argsz = sizeof(unmap),
         .flags = 0,
@@ -200,7 +207,7 @@ int vfio_dma_unmap(VFIOContainer *container,
         .size = size,
     };
 
-    if (iotlb && container->dirty_pages_supported &&
+    if (iotlb && bcontainer->dirty_pages_supported &&
         vfio_devices_all_running_and_saving(container)) {
         return vfio_dma_unmap_bitmap(container, iova, size, iotlb);
     }
@@ -221,7 +228,7 @@ int vfio_dma_unmap(VFIOContainer *container,
         if (errno == EINVAL && unmap.size && !(unmap.iova + unmap.size) &&
             container->iommu_type == VFIO_TYPE1v2_IOMMU) {
             trace_vfio_dma_unmap_overflow_workaround();
-            unmap.size -= 1ULL << ctz64(container->pgsizes);
+            unmap.size -= 1ULL << ctz64(bcontainer->pgsizes);
             continue;
         }
         error_report("VFIO_UNMAP_DMA failed: %s", strerror(errno));
@@ -231,9 +238,22 @@ int vfio_dma_unmap(VFIOContainer *container,
     return 0;
 }
 
-int vfio_dma_map(VFIOContainer *container, hwaddr iova,
-                 ram_addr_t size, void *vaddr, bool readonly)
+static bool vfio_legacy_container_check_extension(VFIOContainer *bcontainer,
+                                                  VFIOContainerFeature feat)
 {
+    switch (feat) {
+    case VFIO_FEAT_LIVE_MIGRATION:
+        return true;
+    default:
+        return false;
+    };
+}
+
+static int vfio_dma_map(VFIOContainer *bcontainer, hwaddr iova,
+                       ram_addr_t size, void *vaddr, bool readonly)
+{
+    VFIOLegacyContainer *container = container_of(bcontainer,
+                                                  VFIOLegacyContainer, obj);
     struct vfio_iommu_type1_dma_map map = {
         .argsz = sizeof(map),
         .flags = VFIO_DMA_MAP_FLAG_READ,
@@ -252,7 +272,7 @@ int vfio_dma_map(VFIOContainer *container, hwaddr iova,
      * the VGA ROM space.
      */
     if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0 ||
-        (errno == EBUSY && vfio_dma_unmap(container, iova, size, NULL) == 0 &&
+        (errno == EBUSY && vfio_dma_unmap(bcontainer, iova, size, NULL) == 0 &&
          ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0)) {
         return 0;
     }
@@ -261,8 +281,10 @@ int vfio_dma_map(VFIOContainer *container, hwaddr iova,
     return -errno;
 }
 
-void vfio_set_dirty_page_tracking(VFIOContainer *container, bool start)
+static void vfio_set_dirty_page_tracking(VFIOContainer *bcontainer, bool start)
 {
+    VFIOLegacyContainer *container = container_of(bcontainer,
+                                                  VFIOLegacyContainer, obj);
     int ret;
     struct vfio_iommu_type1_dirty_bitmap dirty = {
         .argsz = sizeof(dirty),
@@ -281,9 +303,11 @@ void vfio_set_dirty_page_tracking(VFIOContainer *container, bool start)
     }
 }
 
-int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
-                          uint64_t size, ram_addr_t ram_addr)
+static int vfio_get_dirty_bitmap(VFIOContainer *bcontainer, uint64_t iova,
+                                 uint64_t size, ram_addr_t ram_addr)
 {
+    VFIOLegacyContainer *container = container_of(bcontainer,
+                                                  VFIOLegacyContainer, obj);
     struct vfio_iommu_type1_dirty_bitmap *dbitmap;
     struct vfio_iommu_type1_dirty_bitmap_get *range;
     uint64_t pages;
@@ -333,18 +357,23 @@ err_out:
     return ret;
 }
 
-static void vfio_listener_release(VFIOContainer *container)
+static void vfio_listener_release(VFIOLegacyContainer *container)
 {
-    memory_listener_unregister(&container->listener);
+    VFIOContainer *bcontainer = &container->obj;
+
+    memory_listener_unregister(&bcontainer->listener);
     if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
         memory_listener_unregister(&container->prereg_listener);
     }
 }
 
-int vfio_container_add_section_window(VFIOContainer *container,
-                                      MemoryRegionSection *section,
-                                      Error **errp)
+static int
+vfio_legacy_container_add_section_window(VFIOContainer *bcontainer,
+                                         MemoryRegionSection *section,
+                                         Error **errp)
 {
+    VFIOLegacyContainer *container = container_of(bcontainer,
+                                                  VFIOLegacyContainer, obj);
     VFIOHostDMAWindow *hostwin;
     hwaddr pgsize = 0;
     int ret;
@@ -354,7 +383,7 @@ int vfio_container_add_section_window(VFIOContainer *container,
     }
 
     /* For now intersections are not allowed, we may relax this later */
-    QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) {
+    QLIST_FOREACH(hostwin, &bcontainer->hostwin_list, hostwin_next) {
         if (ranges_overlap(hostwin->min_iova,
                            hostwin->max_iova - hostwin->min_iova + 1,
                            section->offset_within_address_space,
@@ -376,7 +405,7 @@ int vfio_container_add_section_window(VFIOContainer *container,
         return ret;
     }
 
-    vfio_host_win_add(container, section->offset_within_address_space,
+    vfio_host_win_add(bcontainer, section->offset_within_address_space,
                       section->offset_within_address_space +
                       int128_get64(section->size) - 1, pgsize);
 #ifdef CONFIG_KVM
@@ -409,16 +438,20 @@ int vfio_container_add_section_window(VFIOContainer *container,
     return 0;
 }
 
-void vfio_container_del_section_window(VFIOContainer *container,
-                                       MemoryRegionSection *section)
+static void
+vfio_legacy_container_del_section_window(VFIOContainer *bcontainer,
+                                         MemoryRegionSection *section)
 {
+    VFIOLegacyContainer *container = container_of(bcontainer,
+                                                  VFIOLegacyContainer, obj);
+
     if (container->iommu_type != VFIO_SPAPR_TCE_v2_IOMMU) {
         return;
     }
 
     vfio_spapr_remove_window(container,
                              section->offset_within_address_space);
-    if (vfio_host_win_del(container,
+    if (vfio_host_win_del(bcontainer,
                           section->offset_within_address_space,
                           section->offset_within_address_space +
                           int128_get64(section->size) - 1) < 0) {
@@ -505,7 +538,7 @@ static void vfio_kvm_device_del_group(VFIOGroup *group)
 /*
  * vfio_get_iommu_type - selects the richest iommu_type (v2 first)
  */
-static int vfio_get_iommu_type(VFIOContainer *container,
+static int vfio_get_iommu_type(VFIOLegacyContainer *container,
                                Error **errp)
 {
     int iommu_types[] = { VFIO_TYPE1v2_IOMMU, VFIO_TYPE1_IOMMU,
@@ -521,7 +554,7 @@ static int vfio_get_iommu_type(VFIOContainer *container,
     return -EINVAL;
 }
 
-static int vfio_init_container(VFIOContainer *container, int group_fd,
+static int vfio_init_container(VFIOLegacyContainer *container, int group_fd,
                                Error **errp)
 {
     int iommu_type, ret;
@@ -556,7 +589,7 @@ static int vfio_init_container(VFIOContainer *container, int group_fd,
     return 0;
 }
 
-static int vfio_get_iommu_info(VFIOContainer *container,
+static int vfio_get_iommu_info(VFIOLegacyContainer *container,
                                struct vfio_iommu_type1_info **info)
 {
 
@@ -600,11 +633,12 @@ vfio_get_iommu_info_cap(struct vfio_iommu_type1_info *info, uint16_t id)
     return NULL;
 }
 
-static void vfio_get_iommu_info_migration(VFIOContainer *container,
-                                         struct vfio_iommu_type1_info *info)
+static void vfio_get_iommu_info_migration(VFIOLegacyContainer *container,
+                                          struct vfio_iommu_type1_info *info)
 {
     struct vfio_info_cap_header *hdr;
     struct vfio_iommu_type1_info_cap_migration *cap_mig;
+    VFIOContainer *bcontainer = &container->obj;
 
     hdr = vfio_get_iommu_info_cap(info, VFIO_IOMMU_TYPE1_INFO_CAP_MIGRATION);
     if (!hdr) {
@@ -619,13 +653,14 @@ static void vfio_get_iommu_info_migration(VFIOContainer *container,
      * qemu_real_host_page_size to mark those dirty.
      */
     if (cap_mig->pgsize_bitmap & qemu_real_host_page_size) {
-        container->dirty_pages_supported = true;
-        container->max_dirty_bitmap_size = cap_mig->max_dirty_bitmap_size;
-        container->dirty_pgsizes = cap_mig->pgsize_bitmap;
+        bcontainer->dirty_pages_supported = true;
+        bcontainer->max_dirty_bitmap_size = cap_mig->max_dirty_bitmap_size;
+        bcontainer->dirty_pgsizes = cap_mig->pgsize_bitmap;
     }
 }
 
-static int vfio_ram_block_discard_disable(VFIOContainer *container, bool state)
+static int
+vfio_ram_block_discard_disable(VFIOLegacyContainer *container, bool state)
 {
     switch (container->iommu_type) {
     case VFIO_TYPE1v2_IOMMU:
@@ -651,7 +686,8 @@ static int vfio_ram_block_discard_disable(VFIOContainer *container, bool state)
 static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
                                   Error **errp)
 {
-    VFIOContainer *container;
+    VFIOContainer *bcontainer;
+    VFIOLegacyContainer *container;
     int ret, fd;
     VFIOAddressSpace *space;
 
@@ -688,7 +724,8 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
      * details once we know which type of IOMMU we are using.
      */
 
-    QLIST_FOREACH(container, &space->containers, next) {
+    QLIST_FOREACH(bcontainer, &space->containers, next) {
+        container = container_of(bcontainer, VFIOLegacyContainer, obj);
         if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
             ret = vfio_ram_block_discard_disable(container, true);
             if (ret) {
@@ -724,14 +761,10 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
     }
 
     container = g_malloc0(sizeof(*container));
-    container->space = space;
     container->fd = fd;
-    container->error = NULL;
-    container->dirty_pages_supported = false;
-    container->dma_max_mappings = 0;
-    QLIST_INIT(&container->giommu_list);
-    QLIST_INIT(&container->hostwin_list);
-    QLIST_INIT(&container->vrdl_list);
+    bcontainer = &container->obj;
+    vfio_container_init(bcontainer, sizeof(*bcontainer),
+                        TYPE_VFIO_LEGACY_CONTAINER, space);
 
     ret = vfio_init_container(container, group->fd, errp);
     if (ret) {
@@ -763,13 +796,13 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
             /* Assume 4k IOVA page size */
             info->iova_pgsizes = 4096;
         }
-        vfio_host_win_add(container, 0, (hwaddr)-1, info->iova_pgsizes);
-        container->pgsizes = info->iova_pgsizes;
+        vfio_host_win_add(bcontainer, 0, (hwaddr)-1, info->iova_pgsizes);
+        bcontainer->pgsizes = info->iova_pgsizes;
 
         /* The default in the kernel ("dma_entry_limit") is 65535. */
-        container->dma_max_mappings = 65535;
+        bcontainer->dma_max_mappings = 65535;
         if (!ret) {
-            vfio_get_info_dma_avail(info, &container->dma_max_mappings);
+            vfio_get_info_dma_avail(info, &bcontainer->dma_max_mappings);
             vfio_get_iommu_info_migration(container, info);
         }
         g_free(info);
@@ -798,10 +831,10 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
 
             memory_listener_register(&container->prereg_listener,
                                      &address_space_memory);
-            if (container->error) {
+            if (bcontainer->error) {
                 memory_listener_unregister(&container->prereg_listener);
                 ret = -1;
-                error_propagate_prepend(errp, container->error,
+                error_propagate_prepend(errp, bcontainer->error,
                     "RAM memory listener initialization failed: ");
                 goto enable_discards_exit;
             }
@@ -820,7 +853,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
         }
 
         if (v2) {
-            container->pgsizes = info.ddw.pgsizes;
+            bcontainer->pgsizes = info.ddw.pgsizes;
             /*
              * There is a default window in just created container.
              * To make region_add/del simpler, we better remove this
@@ -835,8 +868,8 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
             }
         } else {
             /* The default table uses 4K pages */
-            container->pgsizes = 0x1000;
-            vfio_host_win_add(container, info.dma32_window_start,
+            bcontainer->pgsizes = 0x1000;
+            vfio_host_win_add(bcontainer, info.dma32_window_start,
                               info.dma32_window_start +
                               info.dma32_window_size - 1,
                               0x1000);
@@ -847,28 +880,28 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
     vfio_kvm_device_add_group(group);
 
     QLIST_INIT(&container->group_list);
-    QLIST_INSERT_HEAD(&space->containers, container, next);
+    QLIST_INSERT_HEAD(&space->containers, bcontainer, next);
 
     group->container = container;
     QLIST_INSERT_HEAD(&container->group_list, group, container_next);
 
-    container->listener = vfio_memory_listener;
+    bcontainer->listener = vfio_memory_listener;
 
-    memory_listener_register(&container->listener, container->space->as);
+    memory_listener_register(&bcontainer->listener, bcontainer->space->as);
 
-    if (container->error) {
+    if (bcontainer->error) {
         ret = -1;
-        error_propagate_prepend(errp, container->error,
+        error_propagate_prepend(errp, bcontainer->error,
             "memory listener initialization failed: ");
         goto listener_release_exit;
     }
 
-    container->initialized = true;
+    bcontainer->initialized = true;
 
     return 0;
 listener_release_exit:
     QLIST_REMOVE(group, container_next);
-    QLIST_REMOVE(container, next);
+    QLIST_REMOVE(bcontainer, next);
     vfio_kvm_device_del_group(group);
     vfio_listener_release(container);
 
@@ -889,7 +922,8 @@ put_space_exit:
 
 static void vfio_disconnect_container(VFIOGroup *group)
 {
-    VFIOContainer *container = group->container;
+    VFIOLegacyContainer *container = group->container;
+    VFIOContainer *bcontainer = &container->obj;
 
     QLIST_REMOVE(group, container_next);
     group->container = NULL;
@@ -909,25 +943,9 @@ static void vfio_disconnect_container(VFIOGroup *group)
     }
 
     if (QLIST_EMPTY(&container->group_list)) {
-        VFIOAddressSpace *space = container->space;
-        VFIOGuestIOMMU *giommu, *tmp;
-        VFIOHostDMAWindow *hostwin, *next;
-
-        QLIST_REMOVE(container, next);
-
-        QLIST_FOREACH_SAFE(giommu, &container->giommu_list, giommu_next, tmp) {
-            memory_region_unregister_iommu_notifier(
-                    MEMORY_REGION(giommu->iommu_mr), &giommu->n);
-            QLIST_REMOVE(giommu, giommu_next);
-            g_free(giommu);
-        }
-
-        QLIST_FOREACH_SAFE(hostwin, &container->hostwin_list, hostwin_next,
-                           next) {
-            QLIST_REMOVE(hostwin, hostwin_next);
-            g_free(hostwin);
-        }
+        VFIOAddressSpace *space = bcontainer->space;
 
+        vfio_container_destroy(bcontainer);
         trace_vfio_disconnect_container(container->fd);
         close(container->fd);
         g_free(container);
@@ -939,13 +957,15 @@ static void vfio_disconnect_container(VFIOGroup *group)
 VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
 {
     VFIOGroup *group;
+    VFIOContainer *bcontainer;
     char path[32];
     struct vfio_group_status status = { .argsz = sizeof(status) };
 
     QLIST_FOREACH(group, &vfio_group_list, next) {
         if (group->groupid == groupid) {
             /* Found it.  Now is it already in the right context? */
-            if (group->container->space->as == as) {
+            bcontainer = &group->container->obj;
+            if (bcontainer->space->as == as) {
                 return group;
             } else {
                 error_setg(errp, "group %d used in multiple address spaces",
@@ -1098,7 +1118,7 @@ void vfio_put_base_device(VFIODevice *vbasedev)
 /*
  * Interfaces for IBM EEH (Enhanced Error Handling)
  */
-static bool vfio_eeh_container_ok(VFIOContainer *container)
+static bool vfio_eeh_container_ok(VFIOLegacyContainer *container)
 {
     /*
      * As of 2016-03-04 (linux-4.5) the host kernel EEH/VFIO
@@ -1126,7 +1146,7 @@ static bool vfio_eeh_container_ok(VFIOContainer *container)
     return true;
 }
 
-static int vfio_eeh_container_op(VFIOContainer *container, uint32_t op)
+static int vfio_eeh_container_op(VFIOLegacyContainer *container, uint32_t op)
 {
     struct vfio_eeh_pe_op pe_op = {
         .argsz = sizeof(pe_op),
@@ -1149,19 +1169,21 @@ static int vfio_eeh_container_op(VFIOContainer *container, uint32_t op)
     return ret;
 }
 
-static VFIOContainer *vfio_eeh_as_container(AddressSpace *as)
+static VFIOLegacyContainer *vfio_eeh_as_container(AddressSpace *as)
 {
     VFIOAddressSpace *space = vfio_get_address_space(as);
-    VFIOContainer *container = NULL;
+    VFIOLegacyContainer *container = NULL;
+    VFIOContainer *bcontainer = NULL;
 
     if (QLIST_EMPTY(&space->containers)) {
         /* No containers to act on */
         goto out;
     }
 
-    container = QLIST_FIRST(&space->containers);
+    bcontainer = QLIST_FIRST(&space->containers);
+    container = container_of(bcontainer, VFIOLegacyContainer, obj);
 
-    if (QLIST_NEXT(container, next)) {
+    if (QLIST_NEXT(bcontainer, next)) {
         /*
          * We don't yet have logic to synchronize EEH state across
          * multiple containers.
@@ -1177,17 +1199,45 @@ out:
 
 bool vfio_eeh_as_ok(AddressSpace *as)
 {
-    VFIOContainer *container = vfio_eeh_as_container(as);
+    VFIOLegacyContainer *container = vfio_eeh_as_container(as);
 
     return (container != NULL) && vfio_eeh_container_ok(container);
 }
 
 int vfio_eeh_as_op(AddressSpace *as, uint32_t op)
 {
-    VFIOContainer *container = vfio_eeh_as_container(as);
+    VFIOLegacyContainer *container = vfio_eeh_as_container(as);
 
     if (!container) {
         return -ENODEV;
     }
     return vfio_eeh_container_op(container, op);
 }
+
+static void vfio_legacy_container_class_init(ObjectClass *klass,
+                                             void *data)
+{
+    VFIOContainerClass *vccs = VFIO_CONTAINER_OBJ_CLASS(klass);
+
+    vccs->dma_map = vfio_dma_map;
+    vccs->dma_unmap = vfio_dma_unmap;
+    vccs->devices_all_dirty_tracking = vfio_devices_all_dirty_tracking;
+    vccs->set_dirty_page_tracking = vfio_set_dirty_page_tracking;
+    vccs->get_dirty_bitmap = vfio_get_dirty_bitmap;
+    vccs->add_window = vfio_legacy_container_add_section_window;
+    vccs->del_window = vfio_legacy_container_del_section_window;
+    vccs->check_extension = vfio_legacy_container_check_extension;
+}
+
+static const TypeInfo vfio_legacy_container_info = {
+    .parent = TYPE_VFIO_CONTAINER_OBJ,
+    .name = TYPE_VFIO_LEGACY_CONTAINER,
+    .class_init = vfio_legacy_container_class_init,
+};
+
+static void vfio_register_types(void)
+{
+    type_register_static(&vfio_legacy_container_info);
+}
+
+type_init(vfio_register_types)
diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
index e3b6d6e2cb..df4fa2b695 100644
--- a/hw/vfio/meson.build
+++ b/hw/vfio/meson.build
@@ -2,6 +2,7 @@ vfio_ss = ss.source_set()
 vfio_ss.add(files(
   'common.c',
   'as.c',
+  'container-obj.c',
   'container.c',
   'spapr.c',
   'migration.c',
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index ff6b45de6b..cbbde177c3 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -856,11 +856,11 @@ int64_t vfio_mig_bytes_transferred(void)
 
 int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
 {
-    VFIOContainer *container = vbasedev->group->container;
+    VFIOLegacyContainer *container = vbasedev->group->container;
     struct vfio_region_info *info = NULL;
     int ret = -ENOTSUP;
 
-    if (!vbasedev->enable_migration || !container->dirty_pages_supported) {
+    if (!vbasedev->enable_migration || !container->obj.dirty_pages_supported) {
         goto add_blocker;
     }
 
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index e707329394..a00a485e46 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -3101,7 +3101,9 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
         }
     }
 
-    if (!pdev->failover_pair_id) {
+    if (!pdev->failover_pair_id &&
+        vfio_container_check_extension(&vbasedev->group->container->obj,
+                                       VFIO_FEAT_LIVE_MIGRATION)) {
         ret = vfio_migration_probe(vbasedev, errp);
         if (ret) {
             error_report("%s: Migration disabled", vbasedev->name);
diff --git a/hw/vfio/spapr.c b/hw/vfio/spapr.c
index 04c6e67f8f..cdcd9e05ba 100644
--- a/hw/vfio/spapr.c
+++ b/hw/vfio/spapr.c
@@ -39,8 +39,8 @@ static void *vfio_prereg_gpa_to_vaddr(MemoryRegionSection *section, hwaddr gpa)
 static void vfio_prereg_listener_region_add(MemoryListener *listener,
                                             MemoryRegionSection *section)
 {
-    VFIOContainer *container = container_of(listener, VFIOContainer,
-                                            prereg_listener);
+    VFIOLegacyContainer *container = container_of(listener, VFIOLegacyContainer,
+                                                  prereg_listener);
     const hwaddr gpa = section->offset_within_address_space;
     hwaddr end;
     int ret;
@@ -83,9 +83,9 @@ static void vfio_prereg_listener_region_add(MemoryListener *listener,
          * can gracefully fail.  Runtime, there's not much we can do other
          * than throw a hardware error.
          */
-        if (!container->initialized) {
-            if (!container->error) {
-                error_setg_errno(&container->error, -ret,
+        if (!container->obj.initialized) {
+            if (!container->obj.error) {
+                error_setg_errno(&container->obj.error, -ret,
                                  "Memory registering failed");
             }
         } else {
@@ -97,8 +97,8 @@ static void vfio_prereg_listener_region_add(MemoryListener *listener,
 static void vfio_prereg_listener_region_del(MemoryListener *listener,
                                             MemoryRegionSection *section)
 {
-    VFIOContainer *container = container_of(listener, VFIOContainer,
-                                            prereg_listener);
+    VFIOLegacyContainer *container = container_of(listener, VFIOLegacyContainer,
+                                                  prereg_listener);
     const hwaddr gpa = section->offset_within_address_space;
     hwaddr end;
     int ret;
@@ -141,7 +141,7 @@ const MemoryListener vfio_prereg_listener = {
     .region_del = vfio_prereg_listener_region_del,
 };
 
-int vfio_spapr_create_window(VFIOContainer *container,
+int vfio_spapr_create_window(VFIOLegacyContainer *container,
                              MemoryRegionSection *section,
                              hwaddr *pgsize)
 {
@@ -159,13 +159,13 @@ int vfio_spapr_create_window(VFIOContainer *container,
     if (pagesize > rampagesize) {
         pagesize = rampagesize;
     }
-    pgmask = container->pgsizes & (pagesize | (pagesize - 1));
+    pgmask = container->obj.pgsizes & (pagesize | (pagesize - 1));
     pagesize = pgmask ? (1ULL << (63 - clz64(pgmask))) : 0;
     if (!pagesize) {
         error_report("Host doesn't support page size 0x%"PRIx64
                      ", the supported mask is 0x%lx",
                      memory_region_iommu_get_min_page_size(iommu_mr),
-                     container->pgsizes);
+                     container->obj.pgsizes);
         return -EINVAL;
     }
 
@@ -233,7 +233,7 @@ int vfio_spapr_create_window(VFIOContainer *container,
     return 0;
 }
 
-int vfio_spapr_remove_window(VFIOContainer *container,
+int vfio_spapr_remove_window(VFIOLegacyContainer *container,
                              hwaddr offset_within_address_space)
 {
     struct vfio_iommu_spapr_tce_remove remove = {
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 03ff7944cb..02a6f36a9e 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -30,6 +30,7 @@
 #include <linux/vfio.h>
 #endif
 #include "sysemu/sysemu.h"
+#include "hw/vfio/vfio-container-obj.h"
 
 #define VFIO_MSG_PREFIX "vfio %s: "
 
@@ -70,58 +71,15 @@ typedef struct VFIOMigration {
     uint64_t pending_bytes;
 } VFIOMigration;
 
-typedef struct VFIOAddressSpace {
-    AddressSpace *as;
-    QLIST_HEAD(, VFIOContainer) containers;
-    QLIST_ENTRY(VFIOAddressSpace) list;
-} VFIOAddressSpace;
-
 struct VFIOGroup;
 
-typedef struct VFIOContainer {
-    VFIOAddressSpace *space;
+typedef struct VFIOLegacyContainer {
+    VFIOContainer obj;
     int fd; /* /dev/vfio/vfio, empowered by the attached groups */
-    MemoryListener listener;
     MemoryListener prereg_listener;
     unsigned iommu_type;
-    Error *error;
-    bool initialized;
-    bool dirty_pages_supported;
-    uint64_t dirty_pgsizes;
-    uint64_t max_dirty_bitmap_size;
-    unsigned long pgsizes;
-    unsigned int dma_max_mappings;
-    QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
-    QLIST_HEAD(, VFIOHostDMAWindow) hostwin_list;
     QLIST_HEAD(, VFIOGroup) group_list;
-    QLIST_HEAD(, VFIORamDiscardListener) vrdl_list;
-    QLIST_ENTRY(VFIOContainer) next;
-} VFIOContainer;
-
-typedef struct VFIOGuestIOMMU {
-    VFIOContainer *container;
-    IOMMUMemoryRegion *iommu_mr;
-    hwaddr iommu_offset;
-    IOMMUNotifier n;
-    QLIST_ENTRY(VFIOGuestIOMMU) giommu_next;
-} VFIOGuestIOMMU;
-
-typedef struct VFIORamDiscardListener {
-    VFIOContainer *container;
-    MemoryRegion *mr;
-    hwaddr offset_within_address_space;
-    hwaddr size;
-    uint64_t granularity;
-    RamDiscardListener listener;
-    QLIST_ENTRY(VFIORamDiscardListener) next;
-} VFIORamDiscardListener;
-
-typedef struct VFIOHostDMAWindow {
-    hwaddr min_iova;
-    hwaddr max_iova;
-    uint64_t iova_pgsizes;
-    QLIST_ENTRY(VFIOHostDMAWindow) hostwin_next;
-} VFIOHostDMAWindow;
+} VFIOLegacyContainer;
 
 typedef struct VFIODeviceOps VFIODeviceOps;
 
@@ -159,7 +117,7 @@ struct VFIODeviceOps {
 typedef struct VFIOGroup {
     int fd;
     int groupid;
-    VFIOContainer *container;
+    VFIOLegacyContainer *container;
     QLIST_HEAD(, VFIODevice) device_list;
     QLIST_ENTRY(VFIOGroup) next;
     QLIST_ENTRY(VFIOGroup) container_next;
@@ -192,31 +150,13 @@ typedef struct VFIODisplay {
     } dmabuf;
 } VFIODisplay;
 
-void vfio_host_win_add(VFIOContainer *container,
+void vfio_host_win_add(VFIOContainer *bcontainer,
                        hwaddr min_iova, hwaddr max_iova,
                        uint64_t iova_pgsizes);
-int vfio_host_win_del(VFIOContainer *container, hwaddr min_iova,
+int vfio_host_win_del(VFIOContainer *bcontainer, hwaddr min_iova,
                       hwaddr max_iova);
 VFIOAddressSpace *vfio_get_address_space(AddressSpace *as);
 void vfio_put_address_space(VFIOAddressSpace *space);
-bool vfio_devices_all_running_and_saving(VFIOContainer *container);
-bool vfio_devices_all_dirty_tracking(VFIOContainer *container);
-
-/* container->fd */
-int vfio_dma_unmap(VFIOContainer *container,
-                   hwaddr iova, ram_addr_t size,
-                   IOMMUTLBEntry *iotlb);
-int vfio_dma_map(VFIOContainer *container, hwaddr iova,
-                 ram_addr_t size, void *vaddr, bool readonly);
-void vfio_set_dirty_page_tracking(VFIOContainer *container, bool start);
-int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
-                          uint64_t size, ram_addr_t ram_addr);
-
-int vfio_container_add_section_window(VFIOContainer *container,
-                                      MemoryRegionSection *section,
-                                      Error **errp);
-void vfio_container_del_section_window(VFIOContainer *container,
-                                       MemoryRegionSection *section);
 
 void vfio_put_base_device(VFIODevice *vbasedev);
 void vfio_disable_irqindex(VFIODevice *vbasedev, int index);
@@ -263,10 +203,10 @@ vfio_get_device_info_cap(struct vfio_device_info *info, uint16_t id);
 #endif
 extern const MemoryListener vfio_prereg_listener;
 
-int vfio_spapr_create_window(VFIOContainer *container,
+int vfio_spapr_create_window(VFIOLegacyContainer *container,
                              MemoryRegionSection *section,
                              hwaddr *pgsize);
-int vfio_spapr_remove_window(VFIOContainer *container,
+int vfio_spapr_remove_window(VFIOLegacyContainer *container,
                              hwaddr offset_within_address_space);
 
 int vfio_migration_probe(VFIODevice *vbasedev, Error **errp);
diff --git a/include/hw/vfio/vfio-container-obj.h b/include/hw/vfio/vfio-container-obj.h
new file mode 100644
index 0000000000..7ffbbb299f
--- /dev/null
+++ b/include/hw/vfio/vfio-container-obj.h
@@ -0,0 +1,154 @@
+/*
+ * VFIO CONTAINER BASE OBJECT
+ *
+ * Copyright (C) 2022 Intel Corporation.
+ * Copyright Red Hat, Inc. 2022
+ *
+ * Authors: Yi Liu <yi.l.liu@intel.com>
+ *          Eric Auger <eric.auger@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef HW_VFIO_VFIO_CONTAINER_OBJ_H
+#define HW_VFIO_VFIO_CONTAINER_OBJ_H
+
+#include "qom/object.h"
+#include "exec/memory.h"
+#include "qemu/queue.h"
+#include "qemu/thread.h"
+#ifndef CONFIG_USER_ONLY
+#include "exec/hwaddr.h"
+#endif
+
+#define TYPE_VFIO_CONTAINER_OBJ "qemu:vfio-base-container-obj"
+#define VFIO_CONTAINER_OBJ(obj) \
+        OBJECT_CHECK(VFIOContainer, (obj), TYPE_VFIO_CONTAINER_OBJ)
+#define VFIO_CONTAINER_OBJ_CLASS(klass) \
+        OBJECT_CLASS_CHECK(VFIOContainerClass, (klass), \
+                         TYPE_VFIO_CONTAINER_OBJ)
+#define VFIO_CONTAINER_OBJ_GET_CLASS(obj) \
+        OBJECT_GET_CLASS(VFIOContainerClass, (obj), \
+                         TYPE_VFIO_CONTAINER_OBJ)
+
+typedef enum VFIOContainerFeature {
+    VFIO_FEAT_LIVE_MIGRATION,
+} VFIOContainerFeature;
+
+typedef struct VFIOContainer VFIOContainer;
+
+typedef struct VFIOAddressSpace {
+    AddressSpace *as;
+    QLIST_HEAD(, VFIOContainer) containers;
+    QLIST_ENTRY(VFIOAddressSpace) list;
+} VFIOAddressSpace;
+
+typedef struct VFIOGuestIOMMU {
+    VFIOContainer *container;
+    IOMMUMemoryRegion *iommu_mr;
+    hwaddr iommu_offset;
+    IOMMUNotifier n;
+    QLIST_ENTRY(VFIOGuestIOMMU) giommu_next;
+} VFIOGuestIOMMU;
+
+typedef struct VFIORamDiscardListener {
+    VFIOContainer *container;
+    MemoryRegion *mr;
+    hwaddr offset_within_address_space;
+    hwaddr size;
+    uint64_t granularity;
+    RamDiscardListener listener;
+    QLIST_ENTRY(VFIORamDiscardListener) next;
+} VFIORamDiscardListener;
+
+typedef struct VFIOHostDMAWindow {
+    hwaddr min_iova;
+    hwaddr max_iova;
+    uint64_t iova_pgsizes;
+    QLIST_ENTRY(VFIOHostDMAWindow) hostwin_next;
+} VFIOHostDMAWindow;
+
+/*
+ * This is the base object for vfio container backends
+ */
+struct VFIOContainer {
+    /* private */
+    Object parent_obj;
+
+    VFIOAddressSpace *space;
+    MemoryListener listener;
+    Error *error;
+    bool initialized;
+    bool dirty_pages_supported;
+    uint64_t dirty_pgsizes;
+    uint64_t max_dirty_bitmap_size;
+    unsigned long pgsizes;
+    unsigned int dma_max_mappings;
+    QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
+    QLIST_HEAD(, VFIOHostDMAWindow) hostwin_list;
+    QLIST_HEAD(, VFIORamDiscardListener) vrdl_list;
+    QLIST_ENTRY(VFIOContainer) next;
+};
+
+typedef struct VFIOContainerClass {
+    /* private */
+    ObjectClass parent_class;
+
+    /* required */
+    bool (*check_extension)(VFIOContainer *container,
+                            VFIOContainerFeature feat);
+    int (*dma_map)(VFIOContainer *container,
+                   hwaddr iova, ram_addr_t size,
+                   void *vaddr, bool readonly);
+    int (*dma_unmap)(VFIOContainer *container,
+                     hwaddr iova, ram_addr_t size,
+                     IOMMUTLBEntry *iotlb);
+    /* migration feature */
+    bool (*devices_all_dirty_tracking)(VFIOContainer *container);
+    void (*set_dirty_page_tracking)(VFIOContainer *container, bool start);
+    int (*get_dirty_bitmap)(VFIOContainer *container, uint64_t iova,
+                            uint64_t size, ram_addr_t ram_addr);
+
+    /* SPAPR specific */
+    int (*add_window)(VFIOContainer *container,
+                      MemoryRegionSection *section,
+                      Error **errp);
+    void (*del_window)(VFIOContainer *container,
+                       MemoryRegionSection *section);
+} VFIOContainerClass;
+
+bool vfio_container_check_extension(VFIOContainer *container,
+                                    VFIOContainerFeature feat);
+int vfio_container_dma_map(VFIOContainer *container,
+                           hwaddr iova, ram_addr_t size,
+                           void *vaddr, bool readonly);
+int vfio_container_dma_unmap(VFIOContainer *container,
+                             hwaddr iova, ram_addr_t size,
+                             IOMMUTLBEntry *iotlb);
+bool vfio_container_devices_all_dirty_tracking(VFIOContainer *container);
+void vfio_container_set_dirty_page_tracking(VFIOContainer *container,
+                                            bool start);
+int vfio_container_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
+                                    uint64_t size, ram_addr_t ram_addr);
+int vfio_container_add_section_window(VFIOContainer *container,
+                                      MemoryRegionSection *section,
+                                      Error **errp);
+void vfio_container_del_section_window(VFIOContainer *container,
+                                       MemoryRegionSection *section);
+
+void vfio_container_init(void *_container, size_t instance_size,
+                         const char *mrtypename,
+                         VFIOAddressSpace *space);
+void vfio_container_destroy(VFIOContainer *container);
+#endif /* HW_VFIO_VFIO_CONTAINER_OBJ_H */
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [RFC 07/18] vfio: Add base object for VFIOContainer
@ 2022-04-14 10:46   ` Yi Liu
  0 siblings, 0 replies; 125+ messages in thread
From: Yi Liu @ 2022-04-14 10:46 UTC (permalink / raw)
  To: alex.williamson, cohuck, qemu-devel
  Cc: akrowiak, jjherne, thuth, yi.l.liu, kvm, mjrosato, jasowang,
	farman, peterx, pasic, eric.auger, yi.y.sun, chao.p.peng,
	nicolinc, kevin.tian, jgg, eric.auger.pro, david

Qomify the VFIOContainer object which acts as a base class for a
container. This base class is derived into the legacy VFIO container
and later on, into the new iommufd based container.

The base class implements generic code such as code related to
memory_listener and address space management whereas the derived
class implements callbacks that depend on the kernel user space
being used.

'as.c' only manipulates the base class object with wrapper functions
that call the right class functions. Existing 'container.c' code is
converted to implement the legacy container class functions.

Existing migration code only works with the legacy container.
Also 'spapr.c' isn't BE agnostic.

Below is the object. It's named as VFIOContainer, old VFIOContainer
is replaced with VFIOLegacyContainer.

struct VFIOContainer {
    /* private */
    Object parent_obj;

    VFIOAddressSpace *space;
    MemoryListener listener;
    Error *error;
    bool initialized;
    bool dirty_pages_supported;
    uint64_t dirty_pgsizes;
    uint64_t max_dirty_bitmap_size;
    unsigned long pgsizes;
    unsigned int dma_max_mappings;
    QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
    QLIST_HEAD(, VFIOHostDMAWindow) hostwin_list;
    QLIST_HEAD(, VFIORamDiscardListener) vrdl_list;
    QLIST_ENTRY(VFIOContainer) next;
};

struct VFIOLegacyContainer {
    VFIOContainer obj;
    int fd; /* /dev/vfio/vfio, empowered by the attached groups */
    MemoryListener prereg_listener;
    unsigned iommu_type;
    QLIST_HEAD(, VFIOGroup) group_list;
};

Co-authored-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 hw/vfio/as.c                         |  48 +++---
 hw/vfio/container-obj.c              | 195 +++++++++++++++++++++++
 hw/vfio/container.c                  | 224 ++++++++++++++++-----------
 hw/vfio/meson.build                  |   1 +
 hw/vfio/migration.c                  |   4 +-
 hw/vfio/pci.c                        |   4 +-
 hw/vfio/spapr.c                      |  22 +--
 include/hw/vfio/vfio-common.h        |  78 ++--------
 include/hw/vfio/vfio-container-obj.h | 154 ++++++++++++++++++
 9 files changed, 540 insertions(+), 190 deletions(-)
 create mode 100644 hw/vfio/container-obj.c
 create mode 100644 include/hw/vfio/vfio-container-obj.h

diff --git a/hw/vfio/as.c b/hw/vfio/as.c
index 4181182808..37423d2c89 100644
--- a/hw/vfio/as.c
+++ b/hw/vfio/as.c
@@ -215,9 +215,9 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
          * of vaddr will always be there, even if the memory object is
          * destroyed and its backing memory munmap-ed.
          */
-        ret = vfio_dma_map(container, iova,
-                           iotlb->addr_mask + 1, vaddr,
-                           read_only);
+        ret = vfio_container_dma_map(container, iova,
+                                     iotlb->addr_mask + 1, vaddr,
+                                     read_only);
         if (ret) {
             error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
                          "0x%"HWADDR_PRIx", %p) = %d (%m)",
@@ -225,7 +225,8 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
                          iotlb->addr_mask + 1, vaddr, ret);
         }
     } else {
-        ret = vfio_dma_unmap(container, iova, iotlb->addr_mask + 1, iotlb);
+        ret = vfio_container_dma_unmap(container, iova,
+                                       iotlb->addr_mask + 1, iotlb);
         if (ret) {
             error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
                          "0x%"HWADDR_PRIx") = %d (%m)",
@@ -242,12 +243,13 @@ static void vfio_ram_discard_notify_discard(RamDiscardListener *rdl,
 {
     VFIORamDiscardListener *vrdl = container_of(rdl, VFIORamDiscardListener,
                                                 listener);
+    VFIOContainer *container = vrdl->container;
     const hwaddr size = int128_get64(section->size);
     const hwaddr iova = section->offset_within_address_space;
     int ret;
 
     /* Unmap with a single call. */
-    ret = vfio_dma_unmap(vrdl->container, iova, size , NULL);
+    ret = vfio_container_dma_unmap(container, iova, size , NULL);
     if (ret) {
         error_report("%s: vfio_dma_unmap() failed: %s", __func__,
                      strerror(-ret));
@@ -259,6 +261,7 @@ static int vfio_ram_discard_notify_populate(RamDiscardListener *rdl,
 {
     VFIORamDiscardListener *vrdl = container_of(rdl, VFIORamDiscardListener,
                                                 listener);
+    VFIOContainer *container = vrdl->container;
     const hwaddr end = section->offset_within_region +
                        int128_get64(section->size);
     hwaddr start, next, iova;
@@ -277,8 +280,8 @@ static int vfio_ram_discard_notify_populate(RamDiscardListener *rdl,
                section->offset_within_address_space;
         vaddr = memory_region_get_ram_ptr(section->mr) + start;
 
-        ret = vfio_dma_map(vrdl->container, iova, next - start,
-                           vaddr, section->readonly);
+        ret = vfio_container_dma_map(container, iova, next - start,
+                                     vaddr, section->readonly);
         if (ret) {
             /* Rollback */
             vfio_ram_discard_notify_discard(rdl, section);
@@ -530,8 +533,8 @@ static void vfio_listener_region_add(MemoryListener *listener,
         }
     }
 
-    ret = vfio_dma_map(container, iova, int128_get64(llsize),
-                       vaddr, section->readonly);
+    ret = vfio_container_dma_map(container, iova, int128_get64(llsize),
+                                 vaddr, section->readonly);
     if (ret) {
         error_setg(&err, "vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
                    "0x%"HWADDR_PRIx", %p) = %d (%m)",
@@ -656,7 +659,8 @@ static void vfio_listener_region_del(MemoryListener *listener,
         if (int128_eq(llsize, int128_2_64())) {
             /* The unmap ioctl doesn't accept a full 64-bit span. */
             llsize = int128_rshift(llsize, 1);
-            ret = vfio_dma_unmap(container, iova, int128_get64(llsize), NULL);
+            ret = vfio_container_dma_unmap(container, iova,
+                                           int128_get64(llsize), NULL);
             if (ret) {
                 error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
                              "0x%"HWADDR_PRIx") = %d (%m)",
@@ -664,7 +668,8 @@ static void vfio_listener_region_del(MemoryListener *listener,
             }
             iova += int128_get64(llsize);
         }
-        ret = vfio_dma_unmap(container, iova, int128_get64(llsize), NULL);
+        ret = vfio_container_dma_unmap(container, iova,
+                                       int128_get64(llsize), NULL);
         if (ret) {
             error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
                          "0x%"HWADDR_PRIx") = %d (%m)",
@@ -681,14 +686,14 @@ static void vfio_listener_log_global_start(MemoryListener *listener)
 {
     VFIOContainer *container = container_of(listener, VFIOContainer, listener);
 
-    vfio_set_dirty_page_tracking(container, true);
+    vfio_container_set_dirty_page_tracking(container, true);
 }
 
 static void vfio_listener_log_global_stop(MemoryListener *listener)
 {
     VFIOContainer *container = container_of(listener, VFIOContainer, listener);
 
-    vfio_set_dirty_page_tracking(container, false);
+    vfio_container_set_dirty_page_tracking(container, false);
 }
 
 typedef struct {
@@ -717,8 +722,9 @@ static void vfio_iommu_map_dirty_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
     if (vfio_get_xlat_addr(iotlb, NULL, &translated_addr, NULL)) {
         int ret;
 
-        ret = vfio_get_dirty_bitmap(container, iova, iotlb->addr_mask + 1,
-                                    translated_addr);
+        ret = vfio_container_get_dirty_bitmap(container, iova,
+                                              iotlb->addr_mask + 1,
+                                              translated_addr);
         if (ret) {
             error_report("vfio_iommu_map_dirty_notify(%p, 0x%"HWADDR_PRIx", "
                          "0x%"HWADDR_PRIx") = %d (%m)",
@@ -742,11 +748,13 @@ static int vfio_ram_discard_get_dirty_bitmap(MemoryRegionSection *section,
      * Sync the whole mapped region (spanning multiple individual mappings)
      * in one go.
      */
-    return vfio_get_dirty_bitmap(vrdl->container, iova, size, ram_addr);
+    return vfio_container_get_dirty_bitmap(vrdl->container, iova,
+                                           size, ram_addr);
 }
 
-static int vfio_sync_ram_discard_listener_dirty_bitmap(VFIOContainer *container,
-                                                   MemoryRegionSection *section)
+static int
+vfio_sync_ram_discard_listener_dirty_bitmap(VFIOContainer *container,
+                                            MemoryRegionSection *section)
 {
     RamDiscardManager *rdm = memory_region_get_ram_discard_manager(section->mr);
     VFIORamDiscardListener *vrdl = NULL;
@@ -810,7 +818,7 @@ static int vfio_sync_dirty_bitmap(VFIOContainer *container,
     ram_addr = memory_region_get_ram_addr(section->mr) +
                section->offset_within_region;
 
-    return vfio_get_dirty_bitmap(container,
+    return vfio_container_get_dirty_bitmap(container,
                    REAL_HOST_PAGE_ALIGN(section->offset_within_address_space),
                    int128_get64(section->size), ram_addr);
 }
@@ -825,7 +833,7 @@ static void vfio_listener_log_sync(MemoryListener *listener,
         return;
     }
 
-    if (vfio_devices_all_dirty_tracking(container)) {
+    if (vfio_container_devices_all_dirty_tracking(container)) {
         vfio_sync_dirty_bitmap(container, section);
     }
 }
diff --git a/hw/vfio/container-obj.c b/hw/vfio/container-obj.c
new file mode 100644
index 0000000000..40c1e2a2b5
--- /dev/null
+++ b/hw/vfio/container-obj.c
@@ -0,0 +1,195 @@
+/*
+ * VFIO CONTAINER BASE OBJECT
+ *
+ * Copyright (C) 2022 Intel Corporation.
+ * Copyright Red Hat, Inc. 2022
+ *
+ * Authors: Yi Liu <yi.l.liu@intel.com>
+ *          Eric Auger <eric.auger@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include "qemu/osdep.h"
+#include "qapi/error.h"
+#include "qemu/error-report.h"
+#include "qom/object.h"
+#include "qapi/visitor.h"
+#include "hw/vfio/vfio-container-obj.h"
+
+bool vfio_container_check_extension(VFIOContainer *container,
+                                    VFIOContainerFeature feat)
+{
+    VFIOContainerClass *vccs = VFIO_CONTAINER_OBJ_GET_CLASS(container);
+
+    if (!vccs->check_extension) {
+        return false;
+    }
+
+    return vccs->check_extension(container, feat);
+}
+
+int vfio_container_dma_map(VFIOContainer *container,
+                           hwaddr iova, ram_addr_t size,
+                           void *vaddr, bool readonly)
+{
+    VFIOContainerClass *vccs = VFIO_CONTAINER_OBJ_GET_CLASS(container);
+
+    if (!vccs->dma_map) {
+        return -EINVAL;
+    }
+
+    return vccs->dma_map(container, iova, size, vaddr, readonly);
+}
+
+int vfio_container_dma_unmap(VFIOContainer *container,
+                             hwaddr iova, ram_addr_t size,
+                             IOMMUTLBEntry *iotlb)
+{
+    VFIOContainerClass *vccs = VFIO_CONTAINER_OBJ_GET_CLASS(container);
+
+    vccs = VFIO_CONTAINER_OBJ_GET_CLASS(container);
+
+    if (!vccs->dma_unmap) {
+        return -EINVAL;
+    }
+
+    return vccs->dma_unmap(container, iova, size, iotlb);
+}
+
+void vfio_container_set_dirty_page_tracking(VFIOContainer *container,
+                                            bool start)
+{
+    VFIOContainerClass *vccs = VFIO_CONTAINER_OBJ_GET_CLASS(container);
+
+    if (!vccs->set_dirty_page_tracking) {
+        return;
+    }
+
+    vccs->set_dirty_page_tracking(container, start);
+}
+
+bool vfio_container_devices_all_dirty_tracking(VFIOContainer *container)
+{
+    VFIOContainerClass *vccs = VFIO_CONTAINER_OBJ_GET_CLASS(container);
+
+    if (!vccs->devices_all_dirty_tracking) {
+        return false;
+    }
+
+    return vccs->devices_all_dirty_tracking(container);
+}
+
+int vfio_container_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
+                                    uint64_t size, ram_addr_t ram_addr)
+{
+    VFIOContainerClass *vccs = VFIO_CONTAINER_OBJ_GET_CLASS(container);
+
+    if (!vccs->get_dirty_bitmap) {
+        return -EINVAL;
+    }
+
+    return vccs->get_dirty_bitmap(container, iova, size, ram_addr);
+}
+
+int vfio_container_add_section_window(VFIOContainer *container,
+                                      MemoryRegionSection *section,
+                                      Error **errp)
+{
+    VFIOContainerClass *vccs = VFIO_CONTAINER_OBJ_GET_CLASS(container);
+
+    if (!vccs->add_window) {
+        return 0;
+    }
+
+    return vccs->add_window(container, section, errp);
+}
+
+void vfio_container_del_section_window(VFIOContainer *container,
+                                       MemoryRegionSection *section)
+{
+    VFIOContainerClass *vccs = VFIO_CONTAINER_OBJ_GET_CLASS(container);
+
+    if (!vccs->del_window) {
+        return;
+    }
+
+    return vccs->del_window(container, section);
+}
+
+void vfio_container_init(void *_container, size_t instance_size,
+                         const char *mrtypename,
+                         VFIOAddressSpace *space)
+{
+    VFIOContainer *container;
+
+    object_initialize(_container, instance_size, mrtypename);
+    container = VFIO_CONTAINER_OBJ(_container);
+
+    container->space = space;
+    container->error = NULL;
+    container->dirty_pages_supported = false;
+    container->dma_max_mappings = 0;
+    QLIST_INIT(&container->giommu_list);
+    QLIST_INIT(&container->hostwin_list);
+    QLIST_INIT(&container->vrdl_list);
+}
+
+void vfio_container_destroy(VFIOContainer *container)
+{
+    VFIORamDiscardListener *vrdl, *vrdl_tmp;
+    VFIOGuestIOMMU *giommu, *tmp;
+    VFIOHostDMAWindow *hostwin, *next;
+
+    QLIST_SAFE_REMOVE(container, next);
+
+    QLIST_FOREACH_SAFE(vrdl, &container->vrdl_list, next, vrdl_tmp) {
+        RamDiscardManager *rdm;
+
+        rdm = memory_region_get_ram_discard_manager(vrdl->mr);
+        ram_discard_manager_unregister_listener(rdm, &vrdl->listener);
+        QLIST_REMOVE(vrdl, next);
+        g_free(vrdl);
+    }
+
+    QLIST_FOREACH_SAFE(giommu, &container->giommu_list, giommu_next, tmp) {
+        memory_region_unregister_iommu_notifier(
+                MEMORY_REGION(giommu->iommu_mr), &giommu->n);
+        QLIST_REMOVE(giommu, giommu_next);
+        g_free(giommu);
+    }
+
+    QLIST_FOREACH_SAFE(hostwin, &container->hostwin_list, hostwin_next,
+                       next) {
+        QLIST_REMOVE(hostwin, hostwin_next);
+        g_free(hostwin);
+    }
+
+    object_unref(&container->parent_obj);
+}
+
+static const TypeInfo vfio_container_info = {
+    .parent             = TYPE_OBJECT,
+    .name               = TYPE_VFIO_CONTAINER_OBJ,
+    .class_size         = sizeof(VFIOContainerClass),
+    .instance_size      = sizeof(VFIOContainer),
+    .abstract           = true,
+};
+
+static void vfio_container_register_types(void)
+{
+    type_register_static(&vfio_container_info);
+}
+
+type_init(vfio_container_register_types)
diff --git a/hw/vfio/container.c b/hw/vfio/container.c
index 9c665c1720..79972064d3 100644
--- a/hw/vfio/container.c
+++ b/hw/vfio/container.c
@@ -50,6 +50,8 @@
 static int vfio_kvm_device_fd = -1;
 #endif
 
+#define TYPE_VFIO_LEGACY_CONTAINER "qemu:vfio-legacy-container"
+
 VFIOGroupList vfio_group_list =
     QLIST_HEAD_INITIALIZER(vfio_group_list);
 
@@ -76,8 +78,10 @@ bool vfio_mig_active(void)
     return true;
 }
 
-bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
+static bool vfio_devices_all_dirty_tracking(VFIOContainer *bcontainer)
 {
+    VFIOLegacyContainer *container = container_of(bcontainer,
+                                                  VFIOLegacyContainer, obj);
     VFIOGroup *group;
     VFIODevice *vbasedev;
     MigrationState *ms = migrate_get_current();
@@ -103,7 +107,7 @@ bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
     return true;
 }
 
-bool vfio_devices_all_running_and_saving(VFIOContainer *container)
+static bool vfio_devices_all_running_and_saving(VFIOLegacyContainer *container)
 {
     VFIOGroup *group;
     VFIODevice *vbasedev;
@@ -132,10 +136,11 @@ bool vfio_devices_all_running_and_saving(VFIOContainer *container)
     return true;
 }
 
-static int vfio_dma_unmap_bitmap(VFIOContainer *container,
+static int vfio_dma_unmap_bitmap(VFIOLegacyContainer *container,
                                  hwaddr iova, ram_addr_t size,
                                  IOMMUTLBEntry *iotlb)
 {
+    VFIOContainer *bcontainer = &container->obj;
     struct vfio_iommu_type1_dma_unmap *unmap;
     struct vfio_bitmap *bitmap;
     uint64_t pages = REAL_HOST_PAGE_ALIGN(size) / qemu_real_host_page_size;
@@ -159,7 +164,7 @@ static int vfio_dma_unmap_bitmap(VFIOContainer *container,
     bitmap->size = ROUND_UP(pages, sizeof(__u64) * BITS_PER_BYTE) /
                    BITS_PER_BYTE;
 
-    if (bitmap->size > container->max_dirty_bitmap_size) {
+    if (bitmap->size > bcontainer->max_dirty_bitmap_size) {
         error_report("UNMAP: Size of bitmap too big 0x%"PRIx64,
                      (uint64_t)bitmap->size);
         ret = -E2BIG;
@@ -189,10 +194,12 @@ unmap_exit:
 /*
  * DMA - Mapping and unmapping for the "type1" IOMMU interface used on x86
  */
-int vfio_dma_unmap(VFIOContainer *container,
-                   hwaddr iova, ram_addr_t size,
-                   IOMMUTLBEntry *iotlb)
+static int vfio_dma_unmap(VFIOContainer *bcontainer,
+                          hwaddr iova, ram_addr_t size,
+                          IOMMUTLBEntry *iotlb)
 {
+    VFIOLegacyContainer *container = container_of(bcontainer,
+                                                  VFIOLegacyContainer, obj);
     struct vfio_iommu_type1_dma_unmap unmap = {
         .argsz = sizeof(unmap),
         .flags = 0,
@@ -200,7 +207,7 @@ int vfio_dma_unmap(VFIOContainer *container,
         .size = size,
     };
 
-    if (iotlb && container->dirty_pages_supported &&
+    if (iotlb && bcontainer->dirty_pages_supported &&
         vfio_devices_all_running_and_saving(container)) {
         return vfio_dma_unmap_bitmap(container, iova, size, iotlb);
     }
@@ -221,7 +228,7 @@ int vfio_dma_unmap(VFIOContainer *container,
         if (errno == EINVAL && unmap.size && !(unmap.iova + unmap.size) &&
             container->iommu_type == VFIO_TYPE1v2_IOMMU) {
             trace_vfio_dma_unmap_overflow_workaround();
-            unmap.size -= 1ULL << ctz64(container->pgsizes);
+            unmap.size -= 1ULL << ctz64(bcontainer->pgsizes);
             continue;
         }
         error_report("VFIO_UNMAP_DMA failed: %s", strerror(errno));
@@ -231,9 +238,22 @@ int vfio_dma_unmap(VFIOContainer *container,
     return 0;
 }
 
-int vfio_dma_map(VFIOContainer *container, hwaddr iova,
-                 ram_addr_t size, void *vaddr, bool readonly)
+static bool vfio_legacy_container_check_extension(VFIOContainer *bcontainer,
+                                                  VFIOContainerFeature feat)
 {
+    switch (feat) {
+    case VFIO_FEAT_LIVE_MIGRATION:
+        return true;
+    default:
+        return false;
+    };
+}
+
+static int vfio_dma_map(VFIOContainer *bcontainer, hwaddr iova,
+                       ram_addr_t size, void *vaddr, bool readonly)
+{
+    VFIOLegacyContainer *container = container_of(bcontainer,
+                                                  VFIOLegacyContainer, obj);
     struct vfio_iommu_type1_dma_map map = {
         .argsz = sizeof(map),
         .flags = VFIO_DMA_MAP_FLAG_READ,
@@ -252,7 +272,7 @@ int vfio_dma_map(VFIOContainer *container, hwaddr iova,
      * the VGA ROM space.
      */
     if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0 ||
-        (errno == EBUSY && vfio_dma_unmap(container, iova, size, NULL) == 0 &&
+        (errno == EBUSY && vfio_dma_unmap(bcontainer, iova, size, NULL) == 0 &&
          ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0)) {
         return 0;
     }
@@ -261,8 +281,10 @@ int vfio_dma_map(VFIOContainer *container, hwaddr iova,
     return -errno;
 }
 
-void vfio_set_dirty_page_tracking(VFIOContainer *container, bool start)
+static void vfio_set_dirty_page_tracking(VFIOContainer *bcontainer, bool start)
 {
+    VFIOLegacyContainer *container = container_of(bcontainer,
+                                                  VFIOLegacyContainer, obj);
     int ret;
     struct vfio_iommu_type1_dirty_bitmap dirty = {
         .argsz = sizeof(dirty),
@@ -281,9 +303,11 @@ void vfio_set_dirty_page_tracking(VFIOContainer *container, bool start)
     }
 }
 
-int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
-                          uint64_t size, ram_addr_t ram_addr)
+static int vfio_get_dirty_bitmap(VFIOContainer *bcontainer, uint64_t iova,
+                                 uint64_t size, ram_addr_t ram_addr)
 {
+    VFIOLegacyContainer *container = container_of(bcontainer,
+                                                  VFIOLegacyContainer, obj);
     struct vfio_iommu_type1_dirty_bitmap *dbitmap;
     struct vfio_iommu_type1_dirty_bitmap_get *range;
     uint64_t pages;
@@ -333,18 +357,23 @@ err_out:
     return ret;
 }
 
-static void vfio_listener_release(VFIOContainer *container)
+static void vfio_listener_release(VFIOLegacyContainer *container)
 {
-    memory_listener_unregister(&container->listener);
+    VFIOContainer *bcontainer = &container->obj;
+
+    memory_listener_unregister(&bcontainer->listener);
     if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
         memory_listener_unregister(&container->prereg_listener);
     }
 }
 
-int vfio_container_add_section_window(VFIOContainer *container,
-                                      MemoryRegionSection *section,
-                                      Error **errp)
+static int
+vfio_legacy_container_add_section_window(VFIOContainer *bcontainer,
+                                         MemoryRegionSection *section,
+                                         Error **errp)
 {
+    VFIOLegacyContainer *container = container_of(bcontainer,
+                                                  VFIOLegacyContainer, obj);
     VFIOHostDMAWindow *hostwin;
     hwaddr pgsize = 0;
     int ret;
@@ -354,7 +383,7 @@ int vfio_container_add_section_window(VFIOContainer *container,
     }
 
     /* For now intersections are not allowed, we may relax this later */
-    QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) {
+    QLIST_FOREACH(hostwin, &bcontainer->hostwin_list, hostwin_next) {
         if (ranges_overlap(hostwin->min_iova,
                            hostwin->max_iova - hostwin->min_iova + 1,
                            section->offset_within_address_space,
@@ -376,7 +405,7 @@ int vfio_container_add_section_window(VFIOContainer *container,
         return ret;
     }
 
-    vfio_host_win_add(container, section->offset_within_address_space,
+    vfio_host_win_add(bcontainer, section->offset_within_address_space,
                       section->offset_within_address_space +
                       int128_get64(section->size) - 1, pgsize);
 #ifdef CONFIG_KVM
@@ -409,16 +438,20 @@ int vfio_container_add_section_window(VFIOContainer *container,
     return 0;
 }
 
-void vfio_container_del_section_window(VFIOContainer *container,
-                                       MemoryRegionSection *section)
+static void
+vfio_legacy_container_del_section_window(VFIOContainer *bcontainer,
+                                         MemoryRegionSection *section)
 {
+    VFIOLegacyContainer *container = container_of(bcontainer,
+                                                  VFIOLegacyContainer, obj);
+
     if (container->iommu_type != VFIO_SPAPR_TCE_v2_IOMMU) {
         return;
     }
 
     vfio_spapr_remove_window(container,
                              section->offset_within_address_space);
-    if (vfio_host_win_del(container,
+    if (vfio_host_win_del(bcontainer,
                           section->offset_within_address_space,
                           section->offset_within_address_space +
                           int128_get64(section->size) - 1) < 0) {
@@ -505,7 +538,7 @@ static void vfio_kvm_device_del_group(VFIOGroup *group)
 /*
  * vfio_get_iommu_type - selects the richest iommu_type (v2 first)
  */
-static int vfio_get_iommu_type(VFIOContainer *container,
+static int vfio_get_iommu_type(VFIOLegacyContainer *container,
                                Error **errp)
 {
     int iommu_types[] = { VFIO_TYPE1v2_IOMMU, VFIO_TYPE1_IOMMU,
@@ -521,7 +554,7 @@ static int vfio_get_iommu_type(VFIOContainer *container,
     return -EINVAL;
 }
 
-static int vfio_init_container(VFIOContainer *container, int group_fd,
+static int vfio_init_container(VFIOLegacyContainer *container, int group_fd,
                                Error **errp)
 {
     int iommu_type, ret;
@@ -556,7 +589,7 @@ static int vfio_init_container(VFIOContainer *container, int group_fd,
     return 0;
 }
 
-static int vfio_get_iommu_info(VFIOContainer *container,
+static int vfio_get_iommu_info(VFIOLegacyContainer *container,
                                struct vfio_iommu_type1_info **info)
 {
 
@@ -600,11 +633,12 @@ vfio_get_iommu_info_cap(struct vfio_iommu_type1_info *info, uint16_t id)
     return NULL;
 }
 
-static void vfio_get_iommu_info_migration(VFIOContainer *container,
-                                         struct vfio_iommu_type1_info *info)
+static void vfio_get_iommu_info_migration(VFIOLegacyContainer *container,
+                                          struct vfio_iommu_type1_info *info)
 {
     struct vfio_info_cap_header *hdr;
     struct vfio_iommu_type1_info_cap_migration *cap_mig;
+    VFIOContainer *bcontainer = &container->obj;
 
     hdr = vfio_get_iommu_info_cap(info, VFIO_IOMMU_TYPE1_INFO_CAP_MIGRATION);
     if (!hdr) {
@@ -619,13 +653,14 @@ static void vfio_get_iommu_info_migration(VFIOContainer *container,
      * qemu_real_host_page_size to mark those dirty.
      */
     if (cap_mig->pgsize_bitmap & qemu_real_host_page_size) {
-        container->dirty_pages_supported = true;
-        container->max_dirty_bitmap_size = cap_mig->max_dirty_bitmap_size;
-        container->dirty_pgsizes = cap_mig->pgsize_bitmap;
+        bcontainer->dirty_pages_supported = true;
+        bcontainer->max_dirty_bitmap_size = cap_mig->max_dirty_bitmap_size;
+        bcontainer->dirty_pgsizes = cap_mig->pgsize_bitmap;
     }
 }
 
-static int vfio_ram_block_discard_disable(VFIOContainer *container, bool state)
+static int
+vfio_ram_block_discard_disable(VFIOLegacyContainer *container, bool state)
 {
     switch (container->iommu_type) {
     case VFIO_TYPE1v2_IOMMU:
@@ -651,7 +686,8 @@ static int vfio_ram_block_discard_disable(VFIOContainer *container, bool state)
 static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
                                   Error **errp)
 {
-    VFIOContainer *container;
+    VFIOContainer *bcontainer;
+    VFIOLegacyContainer *container;
     int ret, fd;
     VFIOAddressSpace *space;
 
@@ -688,7 +724,8 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
      * details once we know which type of IOMMU we are using.
      */
 
-    QLIST_FOREACH(container, &space->containers, next) {
+    QLIST_FOREACH(bcontainer, &space->containers, next) {
+        container = container_of(bcontainer, VFIOLegacyContainer, obj);
         if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
             ret = vfio_ram_block_discard_disable(container, true);
             if (ret) {
@@ -724,14 +761,10 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
     }
 
     container = g_malloc0(sizeof(*container));
-    container->space = space;
     container->fd = fd;
-    container->error = NULL;
-    container->dirty_pages_supported = false;
-    container->dma_max_mappings = 0;
-    QLIST_INIT(&container->giommu_list);
-    QLIST_INIT(&container->hostwin_list);
-    QLIST_INIT(&container->vrdl_list);
+    bcontainer = &container->obj;
+    vfio_container_init(bcontainer, sizeof(*bcontainer),
+                        TYPE_VFIO_LEGACY_CONTAINER, space);
 
     ret = vfio_init_container(container, group->fd, errp);
     if (ret) {
@@ -763,13 +796,13 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
             /* Assume 4k IOVA page size */
             info->iova_pgsizes = 4096;
         }
-        vfio_host_win_add(container, 0, (hwaddr)-1, info->iova_pgsizes);
-        container->pgsizes = info->iova_pgsizes;
+        vfio_host_win_add(bcontainer, 0, (hwaddr)-1, info->iova_pgsizes);
+        bcontainer->pgsizes = info->iova_pgsizes;
 
         /* The default in the kernel ("dma_entry_limit") is 65535. */
-        container->dma_max_mappings = 65535;
+        bcontainer->dma_max_mappings = 65535;
         if (!ret) {
-            vfio_get_info_dma_avail(info, &container->dma_max_mappings);
+            vfio_get_info_dma_avail(info, &bcontainer->dma_max_mappings);
             vfio_get_iommu_info_migration(container, info);
         }
         g_free(info);
@@ -798,10 +831,10 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
 
             memory_listener_register(&container->prereg_listener,
                                      &address_space_memory);
-            if (container->error) {
+            if (bcontainer->error) {
                 memory_listener_unregister(&container->prereg_listener);
                 ret = -1;
-                error_propagate_prepend(errp, container->error,
+                error_propagate_prepend(errp, bcontainer->error,
                     "RAM memory listener initialization failed: ");
                 goto enable_discards_exit;
             }
@@ -820,7 +853,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
         }
 
         if (v2) {
-            container->pgsizes = info.ddw.pgsizes;
+            bcontainer->pgsizes = info.ddw.pgsizes;
             /*
              * There is a default window in just created container.
              * To make region_add/del simpler, we better remove this
@@ -835,8 +868,8 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
             }
         } else {
             /* The default table uses 4K pages */
-            container->pgsizes = 0x1000;
-            vfio_host_win_add(container, info.dma32_window_start,
+            bcontainer->pgsizes = 0x1000;
+            vfio_host_win_add(bcontainer, info.dma32_window_start,
                               info.dma32_window_start +
                               info.dma32_window_size - 1,
                               0x1000);
@@ -847,28 +880,28 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
     vfio_kvm_device_add_group(group);
 
     QLIST_INIT(&container->group_list);
-    QLIST_INSERT_HEAD(&space->containers, container, next);
+    QLIST_INSERT_HEAD(&space->containers, bcontainer, next);
 
     group->container = container;
     QLIST_INSERT_HEAD(&container->group_list, group, container_next);
 
-    container->listener = vfio_memory_listener;
+    bcontainer->listener = vfio_memory_listener;
 
-    memory_listener_register(&container->listener, container->space->as);
+    memory_listener_register(&bcontainer->listener, bcontainer->space->as);
 
-    if (container->error) {
+    if (bcontainer->error) {
         ret = -1;
-        error_propagate_prepend(errp, container->error,
+        error_propagate_prepend(errp, bcontainer->error,
             "memory listener initialization failed: ");
         goto listener_release_exit;
     }
 
-    container->initialized = true;
+    bcontainer->initialized = true;
 
     return 0;
 listener_release_exit:
     QLIST_REMOVE(group, container_next);
-    QLIST_REMOVE(container, next);
+    QLIST_REMOVE(bcontainer, next);
     vfio_kvm_device_del_group(group);
     vfio_listener_release(container);
 
@@ -889,7 +922,8 @@ put_space_exit:
 
 static void vfio_disconnect_container(VFIOGroup *group)
 {
-    VFIOContainer *container = group->container;
+    VFIOLegacyContainer *container = group->container;
+    VFIOContainer *bcontainer = &container->obj;
 
     QLIST_REMOVE(group, container_next);
     group->container = NULL;
@@ -909,25 +943,9 @@ static void vfio_disconnect_container(VFIOGroup *group)
     }
 
     if (QLIST_EMPTY(&container->group_list)) {
-        VFIOAddressSpace *space = container->space;
-        VFIOGuestIOMMU *giommu, *tmp;
-        VFIOHostDMAWindow *hostwin, *next;
-
-        QLIST_REMOVE(container, next);
-
-        QLIST_FOREACH_SAFE(giommu, &container->giommu_list, giommu_next, tmp) {
-            memory_region_unregister_iommu_notifier(
-                    MEMORY_REGION(giommu->iommu_mr), &giommu->n);
-            QLIST_REMOVE(giommu, giommu_next);
-            g_free(giommu);
-        }
-
-        QLIST_FOREACH_SAFE(hostwin, &container->hostwin_list, hostwin_next,
-                           next) {
-            QLIST_REMOVE(hostwin, hostwin_next);
-            g_free(hostwin);
-        }
+        VFIOAddressSpace *space = bcontainer->space;
 
+        vfio_container_destroy(bcontainer);
         trace_vfio_disconnect_container(container->fd);
         close(container->fd);
         g_free(container);
@@ -939,13 +957,15 @@ static void vfio_disconnect_container(VFIOGroup *group)
 VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
 {
     VFIOGroup *group;
+    VFIOContainer *bcontainer;
     char path[32];
     struct vfio_group_status status = { .argsz = sizeof(status) };
 
     QLIST_FOREACH(group, &vfio_group_list, next) {
         if (group->groupid == groupid) {
             /* Found it.  Now is it already in the right context? */
-            if (group->container->space->as == as) {
+            bcontainer = &group->container->obj;
+            if (bcontainer->space->as == as) {
                 return group;
             } else {
                 error_setg(errp, "group %d used in multiple address spaces",
@@ -1098,7 +1118,7 @@ void vfio_put_base_device(VFIODevice *vbasedev)
 /*
  * Interfaces for IBM EEH (Enhanced Error Handling)
  */
-static bool vfio_eeh_container_ok(VFIOContainer *container)
+static bool vfio_eeh_container_ok(VFIOLegacyContainer *container)
 {
     /*
      * As of 2016-03-04 (linux-4.5) the host kernel EEH/VFIO
@@ -1126,7 +1146,7 @@ static bool vfio_eeh_container_ok(VFIOContainer *container)
     return true;
 }
 
-static int vfio_eeh_container_op(VFIOContainer *container, uint32_t op)
+static int vfio_eeh_container_op(VFIOLegacyContainer *container, uint32_t op)
 {
     struct vfio_eeh_pe_op pe_op = {
         .argsz = sizeof(pe_op),
@@ -1149,19 +1169,21 @@ static int vfio_eeh_container_op(VFIOContainer *container, uint32_t op)
     return ret;
 }
 
-static VFIOContainer *vfio_eeh_as_container(AddressSpace *as)
+static VFIOLegacyContainer *vfio_eeh_as_container(AddressSpace *as)
 {
     VFIOAddressSpace *space = vfio_get_address_space(as);
-    VFIOContainer *container = NULL;
+    VFIOLegacyContainer *container = NULL;
+    VFIOContainer *bcontainer = NULL;
 
     if (QLIST_EMPTY(&space->containers)) {
         /* No containers to act on */
         goto out;
     }
 
-    container = QLIST_FIRST(&space->containers);
+    bcontainer = QLIST_FIRST(&space->containers);
+    container = container_of(bcontainer, VFIOLegacyContainer, obj);
 
-    if (QLIST_NEXT(container, next)) {
+    if (QLIST_NEXT(bcontainer, next)) {
         /*
          * We don't yet have logic to synchronize EEH state across
          * multiple containers.
@@ -1177,17 +1199,45 @@ out:
 
 bool vfio_eeh_as_ok(AddressSpace *as)
 {
-    VFIOContainer *container = vfio_eeh_as_container(as);
+    VFIOLegacyContainer *container = vfio_eeh_as_container(as);
 
     return (container != NULL) && vfio_eeh_container_ok(container);
 }
 
 int vfio_eeh_as_op(AddressSpace *as, uint32_t op)
 {
-    VFIOContainer *container = vfio_eeh_as_container(as);
+    VFIOLegacyContainer *container = vfio_eeh_as_container(as);
 
     if (!container) {
         return -ENODEV;
     }
     return vfio_eeh_container_op(container, op);
 }
+
+static void vfio_legacy_container_class_init(ObjectClass *klass,
+                                             void *data)
+{
+    VFIOContainerClass *vccs = VFIO_CONTAINER_OBJ_CLASS(klass);
+
+    vccs->dma_map = vfio_dma_map;
+    vccs->dma_unmap = vfio_dma_unmap;
+    vccs->devices_all_dirty_tracking = vfio_devices_all_dirty_tracking;
+    vccs->set_dirty_page_tracking = vfio_set_dirty_page_tracking;
+    vccs->get_dirty_bitmap = vfio_get_dirty_bitmap;
+    vccs->add_window = vfio_legacy_container_add_section_window;
+    vccs->del_window = vfio_legacy_container_del_section_window;
+    vccs->check_extension = vfio_legacy_container_check_extension;
+}
+
+static const TypeInfo vfio_legacy_container_info = {
+    .parent = TYPE_VFIO_CONTAINER_OBJ,
+    .name = TYPE_VFIO_LEGACY_CONTAINER,
+    .class_init = vfio_legacy_container_class_init,
+};
+
+static void vfio_register_types(void)
+{
+    type_register_static(&vfio_legacy_container_info);
+}
+
+type_init(vfio_register_types)
diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
index e3b6d6e2cb..df4fa2b695 100644
--- a/hw/vfio/meson.build
+++ b/hw/vfio/meson.build
@@ -2,6 +2,7 @@ vfio_ss = ss.source_set()
 vfio_ss.add(files(
   'common.c',
   'as.c',
+  'container-obj.c',
   'container.c',
   'spapr.c',
   'migration.c',
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index ff6b45de6b..cbbde177c3 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -856,11 +856,11 @@ int64_t vfio_mig_bytes_transferred(void)
 
 int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
 {
-    VFIOContainer *container = vbasedev->group->container;
+    VFIOLegacyContainer *container = vbasedev->group->container;
     struct vfio_region_info *info = NULL;
     int ret = -ENOTSUP;
 
-    if (!vbasedev->enable_migration || !container->dirty_pages_supported) {
+    if (!vbasedev->enable_migration || !container->obj.dirty_pages_supported) {
         goto add_blocker;
     }
 
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index e707329394..a00a485e46 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -3101,7 +3101,9 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
         }
     }
 
-    if (!pdev->failover_pair_id) {
+    if (!pdev->failover_pair_id &&
+        vfio_container_check_extension(&vbasedev->group->container->obj,
+                                       VFIO_FEAT_LIVE_MIGRATION)) {
         ret = vfio_migration_probe(vbasedev, errp);
         if (ret) {
             error_report("%s: Migration disabled", vbasedev->name);
diff --git a/hw/vfio/spapr.c b/hw/vfio/spapr.c
index 04c6e67f8f..cdcd9e05ba 100644
--- a/hw/vfio/spapr.c
+++ b/hw/vfio/spapr.c
@@ -39,8 +39,8 @@ static void *vfio_prereg_gpa_to_vaddr(MemoryRegionSection *section, hwaddr gpa)
 static void vfio_prereg_listener_region_add(MemoryListener *listener,
                                             MemoryRegionSection *section)
 {
-    VFIOContainer *container = container_of(listener, VFIOContainer,
-                                            prereg_listener);
+    VFIOLegacyContainer *container = container_of(listener, VFIOLegacyContainer,
+                                                  prereg_listener);
     const hwaddr gpa = section->offset_within_address_space;
     hwaddr end;
     int ret;
@@ -83,9 +83,9 @@ static void vfio_prereg_listener_region_add(MemoryListener *listener,
          * can gracefully fail.  Runtime, there's not much we can do other
          * than throw a hardware error.
          */
-        if (!container->initialized) {
-            if (!container->error) {
-                error_setg_errno(&container->error, -ret,
+        if (!container->obj.initialized) {
+            if (!container->obj.error) {
+                error_setg_errno(&container->obj.error, -ret,
                                  "Memory registering failed");
             }
         } else {
@@ -97,8 +97,8 @@ static void vfio_prereg_listener_region_add(MemoryListener *listener,
 static void vfio_prereg_listener_region_del(MemoryListener *listener,
                                             MemoryRegionSection *section)
 {
-    VFIOContainer *container = container_of(listener, VFIOContainer,
-                                            prereg_listener);
+    VFIOLegacyContainer *container = container_of(listener, VFIOLegacyContainer,
+                                                  prereg_listener);
     const hwaddr gpa = section->offset_within_address_space;
     hwaddr end;
     int ret;
@@ -141,7 +141,7 @@ const MemoryListener vfio_prereg_listener = {
     .region_del = vfio_prereg_listener_region_del,
 };
 
-int vfio_spapr_create_window(VFIOContainer *container,
+int vfio_spapr_create_window(VFIOLegacyContainer *container,
                              MemoryRegionSection *section,
                              hwaddr *pgsize)
 {
@@ -159,13 +159,13 @@ int vfio_spapr_create_window(VFIOContainer *container,
     if (pagesize > rampagesize) {
         pagesize = rampagesize;
     }
-    pgmask = container->pgsizes & (pagesize | (pagesize - 1));
+    pgmask = container->obj.pgsizes & (pagesize | (pagesize - 1));
     pagesize = pgmask ? (1ULL << (63 - clz64(pgmask))) : 0;
     if (!pagesize) {
         error_report("Host doesn't support page size 0x%"PRIx64
                      ", the supported mask is 0x%lx",
                      memory_region_iommu_get_min_page_size(iommu_mr),
-                     container->pgsizes);
+                     container->obj.pgsizes);
         return -EINVAL;
     }
 
@@ -233,7 +233,7 @@ int vfio_spapr_create_window(VFIOContainer *container,
     return 0;
 }
 
-int vfio_spapr_remove_window(VFIOContainer *container,
+int vfio_spapr_remove_window(VFIOLegacyContainer *container,
                              hwaddr offset_within_address_space)
 {
     struct vfio_iommu_spapr_tce_remove remove = {
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 03ff7944cb..02a6f36a9e 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -30,6 +30,7 @@
 #include <linux/vfio.h>
 #endif
 #include "sysemu/sysemu.h"
+#include "hw/vfio/vfio-container-obj.h"
 
 #define VFIO_MSG_PREFIX "vfio %s: "
 
@@ -70,58 +71,15 @@ typedef struct VFIOMigration {
     uint64_t pending_bytes;
 } VFIOMigration;
 
-typedef struct VFIOAddressSpace {
-    AddressSpace *as;
-    QLIST_HEAD(, VFIOContainer) containers;
-    QLIST_ENTRY(VFIOAddressSpace) list;
-} VFIOAddressSpace;
-
 struct VFIOGroup;
 
-typedef struct VFIOContainer {
-    VFIOAddressSpace *space;
+typedef struct VFIOLegacyContainer {
+    VFIOContainer obj;
     int fd; /* /dev/vfio/vfio, empowered by the attached groups */
-    MemoryListener listener;
     MemoryListener prereg_listener;
     unsigned iommu_type;
-    Error *error;
-    bool initialized;
-    bool dirty_pages_supported;
-    uint64_t dirty_pgsizes;
-    uint64_t max_dirty_bitmap_size;
-    unsigned long pgsizes;
-    unsigned int dma_max_mappings;
-    QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
-    QLIST_HEAD(, VFIOHostDMAWindow) hostwin_list;
     QLIST_HEAD(, VFIOGroup) group_list;
-    QLIST_HEAD(, VFIORamDiscardListener) vrdl_list;
-    QLIST_ENTRY(VFIOContainer) next;
-} VFIOContainer;
-
-typedef struct VFIOGuestIOMMU {
-    VFIOContainer *container;
-    IOMMUMemoryRegion *iommu_mr;
-    hwaddr iommu_offset;
-    IOMMUNotifier n;
-    QLIST_ENTRY(VFIOGuestIOMMU) giommu_next;
-} VFIOGuestIOMMU;
-
-typedef struct VFIORamDiscardListener {
-    VFIOContainer *container;
-    MemoryRegion *mr;
-    hwaddr offset_within_address_space;
-    hwaddr size;
-    uint64_t granularity;
-    RamDiscardListener listener;
-    QLIST_ENTRY(VFIORamDiscardListener) next;
-} VFIORamDiscardListener;
-
-typedef struct VFIOHostDMAWindow {
-    hwaddr min_iova;
-    hwaddr max_iova;
-    uint64_t iova_pgsizes;
-    QLIST_ENTRY(VFIOHostDMAWindow) hostwin_next;
-} VFIOHostDMAWindow;
+} VFIOLegacyContainer;
 
 typedef struct VFIODeviceOps VFIODeviceOps;
 
@@ -159,7 +117,7 @@ struct VFIODeviceOps {
 typedef struct VFIOGroup {
     int fd;
     int groupid;
-    VFIOContainer *container;
+    VFIOLegacyContainer *container;
     QLIST_HEAD(, VFIODevice) device_list;
     QLIST_ENTRY(VFIOGroup) next;
     QLIST_ENTRY(VFIOGroup) container_next;
@@ -192,31 +150,13 @@ typedef struct VFIODisplay {
     } dmabuf;
 } VFIODisplay;
 
-void vfio_host_win_add(VFIOContainer *container,
+void vfio_host_win_add(VFIOContainer *bcontainer,
                        hwaddr min_iova, hwaddr max_iova,
                        uint64_t iova_pgsizes);
-int vfio_host_win_del(VFIOContainer *container, hwaddr min_iova,
+int vfio_host_win_del(VFIOContainer *bcontainer, hwaddr min_iova,
                       hwaddr max_iova);
 VFIOAddressSpace *vfio_get_address_space(AddressSpace *as);
 void vfio_put_address_space(VFIOAddressSpace *space);
-bool vfio_devices_all_running_and_saving(VFIOContainer *container);
-bool vfio_devices_all_dirty_tracking(VFIOContainer *container);
-
-/* container->fd */
-int vfio_dma_unmap(VFIOContainer *container,
-                   hwaddr iova, ram_addr_t size,
-                   IOMMUTLBEntry *iotlb);
-int vfio_dma_map(VFIOContainer *container, hwaddr iova,
-                 ram_addr_t size, void *vaddr, bool readonly);
-void vfio_set_dirty_page_tracking(VFIOContainer *container, bool start);
-int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
-                          uint64_t size, ram_addr_t ram_addr);
-
-int vfio_container_add_section_window(VFIOContainer *container,
-                                      MemoryRegionSection *section,
-                                      Error **errp);
-void vfio_container_del_section_window(VFIOContainer *container,
-                                       MemoryRegionSection *section);
 
 void vfio_put_base_device(VFIODevice *vbasedev);
 void vfio_disable_irqindex(VFIODevice *vbasedev, int index);
@@ -263,10 +203,10 @@ vfio_get_device_info_cap(struct vfio_device_info *info, uint16_t id);
 #endif
 extern const MemoryListener vfio_prereg_listener;
 
-int vfio_spapr_create_window(VFIOContainer *container,
+int vfio_spapr_create_window(VFIOLegacyContainer *container,
                              MemoryRegionSection *section,
                              hwaddr *pgsize);
-int vfio_spapr_remove_window(VFIOContainer *container,
+int vfio_spapr_remove_window(VFIOLegacyContainer *container,
                              hwaddr offset_within_address_space);
 
 int vfio_migration_probe(VFIODevice *vbasedev, Error **errp);
diff --git a/include/hw/vfio/vfio-container-obj.h b/include/hw/vfio/vfio-container-obj.h
new file mode 100644
index 0000000000..7ffbbb299f
--- /dev/null
+++ b/include/hw/vfio/vfio-container-obj.h
@@ -0,0 +1,154 @@
+/*
+ * VFIO CONTAINER BASE OBJECT
+ *
+ * Copyright (C) 2022 Intel Corporation.
+ * Copyright Red Hat, Inc. 2022
+ *
+ * Authors: Yi Liu <yi.l.liu@intel.com>
+ *          Eric Auger <eric.auger@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef HW_VFIO_VFIO_CONTAINER_OBJ_H
+#define HW_VFIO_VFIO_CONTAINER_OBJ_H
+
+#include "qom/object.h"
+#include "exec/memory.h"
+#include "qemu/queue.h"
+#include "qemu/thread.h"
+#ifndef CONFIG_USER_ONLY
+#include "exec/hwaddr.h"
+#endif
+
+#define TYPE_VFIO_CONTAINER_OBJ "qemu:vfio-base-container-obj"
+#define VFIO_CONTAINER_OBJ(obj) \
+        OBJECT_CHECK(VFIOContainer, (obj), TYPE_VFIO_CONTAINER_OBJ)
+#define VFIO_CONTAINER_OBJ_CLASS(klass) \
+        OBJECT_CLASS_CHECK(VFIOContainerClass, (klass), \
+                         TYPE_VFIO_CONTAINER_OBJ)
+#define VFIO_CONTAINER_OBJ_GET_CLASS(obj) \
+        OBJECT_GET_CLASS(VFIOContainerClass, (obj), \
+                         TYPE_VFIO_CONTAINER_OBJ)
+
+typedef enum VFIOContainerFeature {
+    VFIO_FEAT_LIVE_MIGRATION,
+} VFIOContainerFeature;
+
+typedef struct VFIOContainer VFIOContainer;
+
+typedef struct VFIOAddressSpace {
+    AddressSpace *as;
+    QLIST_HEAD(, VFIOContainer) containers;
+    QLIST_ENTRY(VFIOAddressSpace) list;
+} VFIOAddressSpace;
+
+typedef struct VFIOGuestIOMMU {
+    VFIOContainer *container;
+    IOMMUMemoryRegion *iommu_mr;
+    hwaddr iommu_offset;
+    IOMMUNotifier n;
+    QLIST_ENTRY(VFIOGuestIOMMU) giommu_next;
+} VFIOGuestIOMMU;
+
+typedef struct VFIORamDiscardListener {
+    VFIOContainer *container;
+    MemoryRegion *mr;
+    hwaddr offset_within_address_space;
+    hwaddr size;
+    uint64_t granularity;
+    RamDiscardListener listener;
+    QLIST_ENTRY(VFIORamDiscardListener) next;
+} VFIORamDiscardListener;
+
+typedef struct VFIOHostDMAWindow {
+    hwaddr min_iova;
+    hwaddr max_iova;
+    uint64_t iova_pgsizes;
+    QLIST_ENTRY(VFIOHostDMAWindow) hostwin_next;
+} VFIOHostDMAWindow;
+
+/*
+ * This is the base object for vfio container backends
+ */
+struct VFIOContainer {
+    /* private */
+    Object parent_obj;
+
+    VFIOAddressSpace *space;
+    MemoryListener listener;
+    Error *error;
+    bool initialized;
+    bool dirty_pages_supported;
+    uint64_t dirty_pgsizes;
+    uint64_t max_dirty_bitmap_size;
+    unsigned long pgsizes;
+    unsigned int dma_max_mappings;
+    QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
+    QLIST_HEAD(, VFIOHostDMAWindow) hostwin_list;
+    QLIST_HEAD(, VFIORamDiscardListener) vrdl_list;
+    QLIST_ENTRY(VFIOContainer) next;
+};
+
+typedef struct VFIOContainerClass {
+    /* private */
+    ObjectClass parent_class;
+
+    /* required */
+    bool (*check_extension)(VFIOContainer *container,
+                            VFIOContainerFeature feat);
+    int (*dma_map)(VFIOContainer *container,
+                   hwaddr iova, ram_addr_t size,
+                   void *vaddr, bool readonly);
+    int (*dma_unmap)(VFIOContainer *container,
+                     hwaddr iova, ram_addr_t size,
+                     IOMMUTLBEntry *iotlb);
+    /* migration feature */
+    bool (*devices_all_dirty_tracking)(VFIOContainer *container);
+    void (*set_dirty_page_tracking)(VFIOContainer *container, bool start);
+    int (*get_dirty_bitmap)(VFIOContainer *container, uint64_t iova,
+                            uint64_t size, ram_addr_t ram_addr);
+
+    /* SPAPR specific */
+    int (*add_window)(VFIOContainer *container,
+                      MemoryRegionSection *section,
+                      Error **errp);
+    void (*del_window)(VFIOContainer *container,
+                       MemoryRegionSection *section);
+} VFIOContainerClass;
+
+bool vfio_container_check_extension(VFIOContainer *container,
+                                    VFIOContainerFeature feat);
+int vfio_container_dma_map(VFIOContainer *container,
+                           hwaddr iova, ram_addr_t size,
+                           void *vaddr, bool readonly);
+int vfio_container_dma_unmap(VFIOContainer *container,
+                             hwaddr iova, ram_addr_t size,
+                             IOMMUTLBEntry *iotlb);
+bool vfio_container_devices_all_dirty_tracking(VFIOContainer *container);
+void vfio_container_set_dirty_page_tracking(VFIOContainer *container,
+                                            bool start);
+int vfio_container_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
+                                    uint64_t size, ram_addr_t ram_addr);
+int vfio_container_add_section_window(VFIOContainer *container,
+                                      MemoryRegionSection *section,
+                                      Error **errp);
+void vfio_container_del_section_window(VFIOContainer *container,
+                                       MemoryRegionSection *section);
+
+void vfio_container_init(void *_container, size_t instance_size,
+                         const char *mrtypename,
+                         VFIOAddressSpace *space);
+void vfio_container_destroy(VFIOContainer *container);
+#endif /* HW_VFIO_VFIO_CONTAINER_OBJ_H */
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [RFC 08/18] vfio/container: Introduce vfio_[attach/detach]_device
  2022-04-14 10:46 ` Yi Liu
@ 2022-04-14 10:47   ` Yi Liu
  -1 siblings, 0 replies; 125+ messages in thread
From: Yi Liu @ 2022-04-14 10:47 UTC (permalink / raw)
  To: alex.williamson, cohuck, qemu-devel
  Cc: david, thuth, farman, mjrosato, akrowiak, pasic, jjherne,
	jasowang, kvm, jgg, nicolinc, eric.auger, eric.auger.pro,
	kevin.tian, yi.l.liu, chao.p.peng, yi.y.sun, peterx

From: Eric Auger <eric.auger@redhat.com>

We want the VFIO devices to be able to use two different
IOMMU callbacks, the legacy VFIO one and the new iommufd one.

Introduce vfio_[attach/detach]_device which aim at hiding the
underlying IOMMU backend (IOCTLs, datatypes, ...).

Once vfio_attach_device completes, the device is attached
to a security context and its fd can be used. Conversely
When vfio_detach_device completes, the device has been
detached to the security context.

In this patch, only the vfio-pci device gets converted to use
the new API. Subsequent patches will handle other devices.

Signed-off-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 hw/vfio/container.c           | 65 +++++++++++++++++++++++++++++++++++
 hw/vfio/pci.c                 | 50 +++------------------------
 include/hw/vfio/vfio-common.h |  2 ++
 3 files changed, 72 insertions(+), 45 deletions(-)

diff --git a/hw/vfio/container.c b/hw/vfio/container.c
index 79972064d3..c74a3cd4ae 100644
--- a/hw/vfio/container.c
+++ b/hw/vfio/container.c
@@ -1214,6 +1214,71 @@ int vfio_eeh_as_op(AddressSpace *as, uint32_t op)
     return vfio_eeh_container_op(container, op);
 }
 
+static int vfio_device_groupid(VFIODevice *vbasedev, Error **errp)
+{
+    char *tmp, group_path[PATH_MAX], *group_name;
+    int ret, groupid;
+    ssize_t len;
+
+    tmp = g_strdup_printf("%s/iommu_group", vbasedev->sysfsdev);
+    len = readlink(tmp, group_path, sizeof(group_path));
+    g_free(tmp);
+
+    if (len <= 0 || len >= sizeof(group_path)) {
+        ret = len < 0 ? -errno : -ENAMETOOLONG;
+        error_setg_errno(errp, -ret, "no iommu_group found");
+        return ret;
+    }
+
+    group_path[len] = 0;
+
+    group_name = basename(group_path);
+    if (sscanf(group_name, "%d", &groupid) != 1) {
+        error_setg_errno(errp, errno, "failed to read %s", group_path);
+        return -errno;
+    }
+    return groupid;
+}
+
+int vfio_attach_device(VFIODevice *vbasedev, AddressSpace *as, Error **errp)
+{
+    int groupid = vfio_device_groupid(vbasedev, errp);
+    VFIODevice *vbasedev_iter;
+    VFIOGroup *group;
+    int ret;
+
+    if (groupid < 0) {
+        return groupid;
+    }
+
+    trace_vfio_realize(vbasedev->name, groupid);
+    group = vfio_get_group(groupid, as, errp);
+    if (!group) {
+        return -1;
+    }
+
+    QLIST_FOREACH(vbasedev_iter, &group->device_list, next) {
+        if (strcmp(vbasedev_iter->name, vbasedev->name) == 0) {
+            error_setg(errp, "device is already attached");
+            vfio_put_group(group);
+            return -1;
+        }
+    }
+    ret = vfio_get_device(group, vbasedev->name, vbasedev, errp);
+    if (ret) {
+        vfio_put_group(group);
+        return -1;
+    }
+
+    return 0;
+}
+
+void vfio_detach_device(VFIODevice *vbasedev)
+{
+    vfio_put_base_device(vbasedev);
+    vfio_put_group(vbasedev->group);
+}
+
 static void vfio_legacy_container_class_init(ObjectClass *klass,
                                              void *data)
 {
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index a00a485e46..0363f81017 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2654,10 +2654,9 @@ static void vfio_populate_device(VFIOPCIDevice *vdev, Error **errp)
 
 static void vfio_put_device(VFIOPCIDevice *vdev)
 {
-    g_free(vdev->vbasedev.name);
     g_free(vdev->msix);
 
-    vfio_put_base_device(&vdev->vbasedev);
+    vfio_detach_device(&vdev->vbasedev);
 }
 
 static void vfio_err_notifier_handler(void *opaque)
@@ -2804,13 +2803,9 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
 {
     VFIOPCIDevice *vdev = VFIO_PCI(pdev);
     VFIODevice *vbasedev = &vdev->vbasedev;
-    VFIODevice *vbasedev_iter;
-    VFIOGroup *group;
-    char *tmp, *subsys, group_path[PATH_MAX], *group_name;
+    char *tmp, *subsys;
     Error *err = NULL;
-    ssize_t len;
     struct stat st;
-    int groupid;
     int i, ret;
     bool is_mdev;
 
@@ -2839,39 +2834,6 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
     vbasedev->type = VFIO_DEVICE_TYPE_PCI;
     vbasedev->dev = DEVICE(vdev);
 
-    tmp = g_strdup_printf("%s/iommu_group", vbasedev->sysfsdev);
-    len = readlink(tmp, group_path, sizeof(group_path));
-    g_free(tmp);
-
-    if (len <= 0 || len >= sizeof(group_path)) {
-        error_setg_errno(errp, len < 0 ? errno : ENAMETOOLONG,
-                         "no iommu_group found");
-        goto error;
-    }
-
-    group_path[len] = 0;
-
-    group_name = basename(group_path);
-    if (sscanf(group_name, "%d", &groupid) != 1) {
-        error_setg_errno(errp, errno, "failed to read %s", group_path);
-        goto error;
-    }
-
-    trace_vfio_realize(vbasedev->name, groupid);
-
-    group = vfio_get_group(groupid, pci_device_iommu_address_space(pdev), errp);
-    if (!group) {
-        goto error;
-    }
-
-    QLIST_FOREACH(vbasedev_iter, &group->device_list, next) {
-        if (strcmp(vbasedev_iter->name, vbasedev->name) == 0) {
-            error_setg(errp, "device is already attached");
-            vfio_put_group(group);
-            goto error;
-        }
-    }
-
     /*
      * Mediated devices *might* operate compatibly with discarding of RAM, but
      * we cannot know for certain, it depends on whether the mdev vendor driver
@@ -2889,13 +2851,12 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
     if (vbasedev->ram_block_discard_allowed && !is_mdev) {
         error_setg(errp, "x-balloon-allowed only potentially compatible "
                    "with mdev devices");
-        vfio_put_group(group);
         goto error;
     }
 
-    ret = vfio_get_device(group, vbasedev->name, vbasedev, errp);
+    ret = vfio_attach_device(vbasedev,
+                             pci_device_iommu_address_space(pdev), errp);
     if (ret) {
-        vfio_put_group(group);
         goto error;
     }
 
@@ -3124,12 +3085,12 @@ out_teardown:
     vfio_bars_exit(vdev);
 error:
     error_prepend(errp, VFIO_MSG_PREFIX, vbasedev->name);
+    vfio_detach_device(vbasedev);
 }
 
 static void vfio_instance_finalize(Object *obj)
 {
     VFIOPCIDevice *vdev = VFIO_PCI(obj);
-    VFIOGroup *group = vdev->vbasedev.group;
 
     vfio_display_finalize(vdev);
     vfio_bars_finalize(vdev);
@@ -3143,7 +3104,6 @@ static void vfio_instance_finalize(Object *obj)
      * g_free(vdev->igd_opregion);
      */
     vfio_put_device(vdev);
-    vfio_put_group(group);
 }
 
 static void vfio_exitfn(PCIDevice *pdev)
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 02a6f36a9e..978b2c2f6e 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -180,6 +180,8 @@ VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp);
 void vfio_put_group(VFIOGroup *group);
 int vfio_get_device(VFIOGroup *group, const char *name,
                     VFIODevice *vbasedev, Error **errp);
+int vfio_attach_device(VFIODevice *vbasedev, AddressSpace *as, Error **errp);
+void vfio_detach_device(VFIODevice *vbasedev);
 
 extern const MemoryRegionOps vfio_region_ops;
 typedef QLIST_HEAD(VFIOGroupList, VFIOGroup) VFIOGroupList;
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [RFC 08/18] vfio/container: Introduce vfio_[attach/detach]_device
@ 2022-04-14 10:47   ` Yi Liu
  0 siblings, 0 replies; 125+ messages in thread
From: Yi Liu @ 2022-04-14 10:47 UTC (permalink / raw)
  To: alex.williamson, cohuck, qemu-devel
  Cc: akrowiak, jjherne, thuth, yi.l.liu, kvm, mjrosato, jasowang,
	farman, peterx, pasic, eric.auger, yi.y.sun, chao.p.peng,
	nicolinc, kevin.tian, jgg, eric.auger.pro, david

From: Eric Auger <eric.auger@redhat.com>

We want the VFIO devices to be able to use two different
IOMMU callbacks, the legacy VFIO one and the new iommufd one.

Introduce vfio_[attach/detach]_device which aim at hiding the
underlying IOMMU backend (IOCTLs, datatypes, ...).

Once vfio_attach_device completes, the device is attached
to a security context and its fd can be used. Conversely
When vfio_detach_device completes, the device has been
detached to the security context.

In this patch, only the vfio-pci device gets converted to use
the new API. Subsequent patches will handle other devices.

Signed-off-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 hw/vfio/container.c           | 65 +++++++++++++++++++++++++++++++++++
 hw/vfio/pci.c                 | 50 +++------------------------
 include/hw/vfio/vfio-common.h |  2 ++
 3 files changed, 72 insertions(+), 45 deletions(-)

diff --git a/hw/vfio/container.c b/hw/vfio/container.c
index 79972064d3..c74a3cd4ae 100644
--- a/hw/vfio/container.c
+++ b/hw/vfio/container.c
@@ -1214,6 +1214,71 @@ int vfio_eeh_as_op(AddressSpace *as, uint32_t op)
     return vfio_eeh_container_op(container, op);
 }
 
+static int vfio_device_groupid(VFIODevice *vbasedev, Error **errp)
+{
+    char *tmp, group_path[PATH_MAX], *group_name;
+    int ret, groupid;
+    ssize_t len;
+
+    tmp = g_strdup_printf("%s/iommu_group", vbasedev->sysfsdev);
+    len = readlink(tmp, group_path, sizeof(group_path));
+    g_free(tmp);
+
+    if (len <= 0 || len >= sizeof(group_path)) {
+        ret = len < 0 ? -errno : -ENAMETOOLONG;
+        error_setg_errno(errp, -ret, "no iommu_group found");
+        return ret;
+    }
+
+    group_path[len] = 0;
+
+    group_name = basename(group_path);
+    if (sscanf(group_name, "%d", &groupid) != 1) {
+        error_setg_errno(errp, errno, "failed to read %s", group_path);
+        return -errno;
+    }
+    return groupid;
+}
+
+int vfio_attach_device(VFIODevice *vbasedev, AddressSpace *as, Error **errp)
+{
+    int groupid = vfio_device_groupid(vbasedev, errp);
+    VFIODevice *vbasedev_iter;
+    VFIOGroup *group;
+    int ret;
+
+    if (groupid < 0) {
+        return groupid;
+    }
+
+    trace_vfio_realize(vbasedev->name, groupid);
+    group = vfio_get_group(groupid, as, errp);
+    if (!group) {
+        return -1;
+    }
+
+    QLIST_FOREACH(vbasedev_iter, &group->device_list, next) {
+        if (strcmp(vbasedev_iter->name, vbasedev->name) == 0) {
+            error_setg(errp, "device is already attached");
+            vfio_put_group(group);
+            return -1;
+        }
+    }
+    ret = vfio_get_device(group, vbasedev->name, vbasedev, errp);
+    if (ret) {
+        vfio_put_group(group);
+        return -1;
+    }
+
+    return 0;
+}
+
+void vfio_detach_device(VFIODevice *vbasedev)
+{
+    vfio_put_base_device(vbasedev);
+    vfio_put_group(vbasedev->group);
+}
+
 static void vfio_legacy_container_class_init(ObjectClass *klass,
                                              void *data)
 {
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index a00a485e46..0363f81017 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2654,10 +2654,9 @@ static void vfio_populate_device(VFIOPCIDevice *vdev, Error **errp)
 
 static void vfio_put_device(VFIOPCIDevice *vdev)
 {
-    g_free(vdev->vbasedev.name);
     g_free(vdev->msix);
 
-    vfio_put_base_device(&vdev->vbasedev);
+    vfio_detach_device(&vdev->vbasedev);
 }
 
 static void vfio_err_notifier_handler(void *opaque)
@@ -2804,13 +2803,9 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
 {
     VFIOPCIDevice *vdev = VFIO_PCI(pdev);
     VFIODevice *vbasedev = &vdev->vbasedev;
-    VFIODevice *vbasedev_iter;
-    VFIOGroup *group;
-    char *tmp, *subsys, group_path[PATH_MAX], *group_name;
+    char *tmp, *subsys;
     Error *err = NULL;
-    ssize_t len;
     struct stat st;
-    int groupid;
     int i, ret;
     bool is_mdev;
 
@@ -2839,39 +2834,6 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
     vbasedev->type = VFIO_DEVICE_TYPE_PCI;
     vbasedev->dev = DEVICE(vdev);
 
-    tmp = g_strdup_printf("%s/iommu_group", vbasedev->sysfsdev);
-    len = readlink(tmp, group_path, sizeof(group_path));
-    g_free(tmp);
-
-    if (len <= 0 || len >= sizeof(group_path)) {
-        error_setg_errno(errp, len < 0 ? errno : ENAMETOOLONG,
-                         "no iommu_group found");
-        goto error;
-    }
-
-    group_path[len] = 0;
-
-    group_name = basename(group_path);
-    if (sscanf(group_name, "%d", &groupid) != 1) {
-        error_setg_errno(errp, errno, "failed to read %s", group_path);
-        goto error;
-    }
-
-    trace_vfio_realize(vbasedev->name, groupid);
-
-    group = vfio_get_group(groupid, pci_device_iommu_address_space(pdev), errp);
-    if (!group) {
-        goto error;
-    }
-
-    QLIST_FOREACH(vbasedev_iter, &group->device_list, next) {
-        if (strcmp(vbasedev_iter->name, vbasedev->name) == 0) {
-            error_setg(errp, "device is already attached");
-            vfio_put_group(group);
-            goto error;
-        }
-    }
-
     /*
      * Mediated devices *might* operate compatibly with discarding of RAM, but
      * we cannot know for certain, it depends on whether the mdev vendor driver
@@ -2889,13 +2851,12 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
     if (vbasedev->ram_block_discard_allowed && !is_mdev) {
         error_setg(errp, "x-balloon-allowed only potentially compatible "
                    "with mdev devices");
-        vfio_put_group(group);
         goto error;
     }
 
-    ret = vfio_get_device(group, vbasedev->name, vbasedev, errp);
+    ret = vfio_attach_device(vbasedev,
+                             pci_device_iommu_address_space(pdev), errp);
     if (ret) {
-        vfio_put_group(group);
         goto error;
     }
 
@@ -3124,12 +3085,12 @@ out_teardown:
     vfio_bars_exit(vdev);
 error:
     error_prepend(errp, VFIO_MSG_PREFIX, vbasedev->name);
+    vfio_detach_device(vbasedev);
 }
 
 static void vfio_instance_finalize(Object *obj)
 {
     VFIOPCIDevice *vdev = VFIO_PCI(obj);
-    VFIOGroup *group = vdev->vbasedev.group;
 
     vfio_display_finalize(vdev);
     vfio_bars_finalize(vdev);
@@ -3143,7 +3104,6 @@ static void vfio_instance_finalize(Object *obj)
      * g_free(vdev->igd_opregion);
      */
     vfio_put_device(vdev);
-    vfio_put_group(group);
 }
 
 static void vfio_exitfn(PCIDevice *pdev)
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 02a6f36a9e..978b2c2f6e 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -180,6 +180,8 @@ VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp);
 void vfio_put_group(VFIOGroup *group);
 int vfio_get_device(VFIOGroup *group, const char *name,
                     VFIODevice *vbasedev, Error **errp);
+int vfio_attach_device(VFIODevice *vbasedev, AddressSpace *as, Error **errp);
+void vfio_detach_device(VFIODevice *vbasedev);
 
 extern const MemoryRegionOps vfio_region_ops;
 typedef QLIST_HEAD(VFIOGroupList, VFIOGroup) VFIOGroupList;
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [RFC 09/18] vfio/platform: Use vfio_[attach/detach]_device
  2022-04-14 10:46 ` Yi Liu
@ 2022-04-14 10:47   ` Yi Liu
  -1 siblings, 0 replies; 125+ messages in thread
From: Yi Liu @ 2022-04-14 10:47 UTC (permalink / raw)
  To: alex.williamson, cohuck, qemu-devel
  Cc: david, thuth, farman, mjrosato, akrowiak, pasic, jjherne,
	jasowang, kvm, jgg, nicolinc, eric.auger, eric.auger.pro,
	kevin.tian, yi.l.liu, chao.p.peng, yi.y.sun, peterx

From: Eric Auger <eric.auger@redhat.com>

Let the vfio-platform device use vfio_attach_device() and
vfio_detach_device(), hence hiding the details of the used
IOMMU backend.

Signed-off-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 hw/vfio/platform.c | 42 ++----------------------------------------
 1 file changed, 2 insertions(+), 40 deletions(-)

diff --git a/hw/vfio/platform.c b/hw/vfio/platform.c
index 5af73f9287..3bcdc20667 100644
--- a/hw/vfio/platform.c
+++ b/hw/vfio/platform.c
@@ -529,12 +529,7 @@ static VFIODeviceOps vfio_platform_ops = {
  */
 static int vfio_base_device_init(VFIODevice *vbasedev, Error **errp)
 {
-    VFIOGroup *group;
-    VFIODevice *vbasedev_iter;
-    char *tmp, group_path[PATH_MAX], *group_name;
-    ssize_t len;
     struct stat st;
-    int groupid;
     int ret;
 
     /* @sysfsdev takes precedence over @host */
@@ -557,47 +552,14 @@ static int vfio_base_device_init(VFIODevice *vbasedev, Error **errp)
         return -errno;
     }
 
-    tmp = g_strdup_printf("%s/iommu_group", vbasedev->sysfsdev);
-    len = readlink(tmp, group_path, sizeof(group_path));
-    g_free(tmp);
-
-    if (len < 0 || len >= sizeof(group_path)) {
-        ret = len < 0 ? -errno : -ENAMETOOLONG;
-        error_setg_errno(errp, -ret, "no iommu_group found");
-        return ret;
-    }
-
-    group_path[len] = 0;
-
-    group_name = basename(group_path);
-    if (sscanf(group_name, "%d", &groupid) != 1) {
-        error_setg_errno(errp, errno, "failed to read %s", group_path);
-        return -errno;
-    }
-
-    trace_vfio_platform_base_device_init(vbasedev->name, groupid);
-
-    group = vfio_get_group(groupid, &address_space_memory, errp);
-    if (!group) {
-        return -ENOENT;
-    }
-
-    QLIST_FOREACH(vbasedev_iter, &group->device_list, next) {
-        if (strcmp(vbasedev_iter->name, vbasedev->name) == 0) {
-            error_setg(errp, "device is already attached");
-            vfio_put_group(group);
-            return -EBUSY;
-        }
-    }
-    ret = vfio_get_device(group, vbasedev->name, vbasedev, errp);
+    ret = vfio_attach_device(vbasedev, &address_space_memory, errp);
     if (ret) {
-        vfio_put_group(group);
         return ret;
     }
 
     ret = vfio_populate_device(vbasedev, errp);
     if (ret) {
-        vfio_put_group(group);
+        vfio_detach_device(vbasedev);
     }
 
     return ret;
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [RFC 09/18] vfio/platform: Use vfio_[attach/detach]_device
@ 2022-04-14 10:47   ` Yi Liu
  0 siblings, 0 replies; 125+ messages in thread
From: Yi Liu @ 2022-04-14 10:47 UTC (permalink / raw)
  To: alex.williamson, cohuck, qemu-devel
  Cc: akrowiak, jjherne, thuth, yi.l.liu, kvm, mjrosato, jasowang,
	farman, peterx, pasic, eric.auger, yi.y.sun, chao.p.peng,
	nicolinc, kevin.tian, jgg, eric.auger.pro, david

From: Eric Auger <eric.auger@redhat.com>

Let the vfio-platform device use vfio_attach_device() and
vfio_detach_device(), hence hiding the details of the used
IOMMU backend.

Signed-off-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 hw/vfio/platform.c | 42 ++----------------------------------------
 1 file changed, 2 insertions(+), 40 deletions(-)

diff --git a/hw/vfio/platform.c b/hw/vfio/platform.c
index 5af73f9287..3bcdc20667 100644
--- a/hw/vfio/platform.c
+++ b/hw/vfio/platform.c
@@ -529,12 +529,7 @@ static VFIODeviceOps vfio_platform_ops = {
  */
 static int vfio_base_device_init(VFIODevice *vbasedev, Error **errp)
 {
-    VFIOGroup *group;
-    VFIODevice *vbasedev_iter;
-    char *tmp, group_path[PATH_MAX], *group_name;
-    ssize_t len;
     struct stat st;
-    int groupid;
     int ret;
 
     /* @sysfsdev takes precedence over @host */
@@ -557,47 +552,14 @@ static int vfio_base_device_init(VFIODevice *vbasedev, Error **errp)
         return -errno;
     }
 
-    tmp = g_strdup_printf("%s/iommu_group", vbasedev->sysfsdev);
-    len = readlink(tmp, group_path, sizeof(group_path));
-    g_free(tmp);
-
-    if (len < 0 || len >= sizeof(group_path)) {
-        ret = len < 0 ? -errno : -ENAMETOOLONG;
-        error_setg_errno(errp, -ret, "no iommu_group found");
-        return ret;
-    }
-
-    group_path[len] = 0;
-
-    group_name = basename(group_path);
-    if (sscanf(group_name, "%d", &groupid) != 1) {
-        error_setg_errno(errp, errno, "failed to read %s", group_path);
-        return -errno;
-    }
-
-    trace_vfio_platform_base_device_init(vbasedev->name, groupid);
-
-    group = vfio_get_group(groupid, &address_space_memory, errp);
-    if (!group) {
-        return -ENOENT;
-    }
-
-    QLIST_FOREACH(vbasedev_iter, &group->device_list, next) {
-        if (strcmp(vbasedev_iter->name, vbasedev->name) == 0) {
-            error_setg(errp, "device is already attached");
-            vfio_put_group(group);
-            return -EBUSY;
-        }
-    }
-    ret = vfio_get_device(group, vbasedev->name, vbasedev, errp);
+    ret = vfio_attach_device(vbasedev, &address_space_memory, errp);
     if (ret) {
-        vfio_put_group(group);
         return ret;
     }
 
     ret = vfio_populate_device(vbasedev, errp);
     if (ret) {
-        vfio_put_group(group);
+        vfio_detach_device(vbasedev);
     }
 
     return ret;
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [RFC 10/18] vfio/ap: Use vfio_[attach/detach]_device
  2022-04-14 10:46 ` Yi Liu
@ 2022-04-14 10:47   ` Yi Liu
  -1 siblings, 0 replies; 125+ messages in thread
From: Yi Liu @ 2022-04-14 10:47 UTC (permalink / raw)
  To: alex.williamson, cohuck, qemu-devel
  Cc: david, thuth, farman, mjrosato, akrowiak, pasic, jjherne,
	jasowang, kvm, jgg, nicolinc, eric.auger, eric.auger.pro,
	kevin.tian, yi.l.liu, chao.p.peng, yi.y.sun, peterx

From: Eric Auger <eric.auger@redhat.com>

Let the vfio-ap device use vfio_attach_device() and
vfio_detach_device(), hence hiding the details of the used
IOMMU backend.

Signed-off-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 hw/vfio/ap.c | 62 ++++++++--------------------------------------------
 1 file changed, 9 insertions(+), 53 deletions(-)

diff --git a/hw/vfio/ap.c b/hw/vfio/ap.c
index e0dd561e85..286ac638e5 100644
--- a/hw/vfio/ap.c
+++ b/hw/vfio/ap.c
@@ -50,58 +50,17 @@ struct VFIODeviceOps vfio_ap_ops = {
     .vfio_compute_needs_reset = vfio_ap_compute_needs_reset,
 };
 
-static void vfio_ap_put_device(VFIOAPDevice *vapdev)
-{
-    g_free(vapdev->vdev.name);
-    vfio_put_base_device(&vapdev->vdev);
-}
-
-static VFIOGroup *vfio_ap_get_group(VFIOAPDevice *vapdev, Error **errp)
-{
-    GError *gerror = NULL;
-    char *symlink, *group_path;
-    int groupid;
-
-    symlink = g_strdup_printf("%s/iommu_group", vapdev->vdev.sysfsdev);
-    group_path = g_file_read_link(symlink, &gerror);
-    g_free(symlink);
-
-    if (!group_path) {
-        error_setg(errp, "%s: no iommu_group found for %s: %s",
-                   TYPE_VFIO_AP_DEVICE, vapdev->vdev.sysfsdev, gerror->message);
-        g_error_free(gerror);
-        return NULL;
-    }
-
-    if (sscanf(basename(group_path), "%d", &groupid) != 1) {
-        error_setg(errp, "vfio: failed to read %s", group_path);
-        g_free(group_path);
-        return NULL;
-    }
-
-    g_free(group_path);
-
-    return vfio_get_group(groupid, &address_space_memory, errp);
-}
-
 static void vfio_ap_realize(DeviceState *dev, Error **errp)
 {
-    int ret;
-    char *mdevid;
-    VFIOGroup *vfio_group;
     APDevice *apdev = AP_DEVICE(dev);
     VFIOAPDevice *vapdev = VFIO_AP_DEVICE(apdev);
+    VFIODevice *vbasedev = &vapdev->vdev;
+    int ret;
 
-    vfio_group = vfio_ap_get_group(vapdev, errp);
-    if (!vfio_group) {
-        return;
-    }
-
-    vapdev->vdev.ops = &vfio_ap_ops;
-    vapdev->vdev.type = VFIO_DEVICE_TYPE_AP;
-    mdevid = basename(vapdev->vdev.sysfsdev);
-    vapdev->vdev.name = g_strdup_printf("%s", mdevid);
-    vapdev->vdev.dev = dev;
+    vbasedev->name = g_path_get_basename(vbasedev->sysfsdev);
+    vbasedev->ops = &vfio_ap_ops;
+    vbasedev->type = VFIO_DEVICE_TYPE_AP;
+    vbasedev->dev = dev;
 
     /*
      * vfio-ap devices operate in a way compatible with discarding of
@@ -111,7 +70,7 @@ static void vfio_ap_realize(DeviceState *dev, Error **errp)
      */
     vapdev->vdev.ram_block_discard_allowed = true;
 
-    ret = vfio_get_device(vfio_group, mdevid, &vapdev->vdev, errp);
+    ret = vfio_attach_device(vbasedev, &address_space_memory, errp);
     if (ret) {
         goto out_get_dev_err;
     }
@@ -119,18 +78,15 @@ static void vfio_ap_realize(DeviceState *dev, Error **errp)
     return;
 
 out_get_dev_err:
-    vfio_ap_put_device(vapdev);
-    vfio_put_group(vfio_group);
+    vfio_detach_device(vbasedev);
 }
 
 static void vfio_ap_unrealize(DeviceState *dev)
 {
     APDevice *apdev = AP_DEVICE(dev);
     VFIOAPDevice *vapdev = VFIO_AP_DEVICE(apdev);
-    VFIOGroup *group = vapdev->vdev.group;
 
-    vfio_ap_put_device(vapdev);
-    vfio_put_group(group);
+    vfio_detach_device(&vapdev->vdev);
 }
 
 static Property vfio_ap_properties[] = {
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [RFC 10/18] vfio/ap: Use vfio_[attach/detach]_device
@ 2022-04-14 10:47   ` Yi Liu
  0 siblings, 0 replies; 125+ messages in thread
From: Yi Liu @ 2022-04-14 10:47 UTC (permalink / raw)
  To: alex.williamson, cohuck, qemu-devel
  Cc: akrowiak, jjherne, thuth, yi.l.liu, kvm, mjrosato, jasowang,
	farman, peterx, pasic, eric.auger, yi.y.sun, chao.p.peng,
	nicolinc, kevin.tian, jgg, eric.auger.pro, david

From: Eric Auger <eric.auger@redhat.com>

Let the vfio-ap device use vfio_attach_device() and
vfio_detach_device(), hence hiding the details of the used
IOMMU backend.

Signed-off-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 hw/vfio/ap.c | 62 ++++++++--------------------------------------------
 1 file changed, 9 insertions(+), 53 deletions(-)

diff --git a/hw/vfio/ap.c b/hw/vfio/ap.c
index e0dd561e85..286ac638e5 100644
--- a/hw/vfio/ap.c
+++ b/hw/vfio/ap.c
@@ -50,58 +50,17 @@ struct VFIODeviceOps vfio_ap_ops = {
     .vfio_compute_needs_reset = vfio_ap_compute_needs_reset,
 };
 
-static void vfio_ap_put_device(VFIOAPDevice *vapdev)
-{
-    g_free(vapdev->vdev.name);
-    vfio_put_base_device(&vapdev->vdev);
-}
-
-static VFIOGroup *vfio_ap_get_group(VFIOAPDevice *vapdev, Error **errp)
-{
-    GError *gerror = NULL;
-    char *symlink, *group_path;
-    int groupid;
-
-    symlink = g_strdup_printf("%s/iommu_group", vapdev->vdev.sysfsdev);
-    group_path = g_file_read_link(symlink, &gerror);
-    g_free(symlink);
-
-    if (!group_path) {
-        error_setg(errp, "%s: no iommu_group found for %s: %s",
-                   TYPE_VFIO_AP_DEVICE, vapdev->vdev.sysfsdev, gerror->message);
-        g_error_free(gerror);
-        return NULL;
-    }
-
-    if (sscanf(basename(group_path), "%d", &groupid) != 1) {
-        error_setg(errp, "vfio: failed to read %s", group_path);
-        g_free(group_path);
-        return NULL;
-    }
-
-    g_free(group_path);
-
-    return vfio_get_group(groupid, &address_space_memory, errp);
-}
-
 static void vfio_ap_realize(DeviceState *dev, Error **errp)
 {
-    int ret;
-    char *mdevid;
-    VFIOGroup *vfio_group;
     APDevice *apdev = AP_DEVICE(dev);
     VFIOAPDevice *vapdev = VFIO_AP_DEVICE(apdev);
+    VFIODevice *vbasedev = &vapdev->vdev;
+    int ret;
 
-    vfio_group = vfio_ap_get_group(vapdev, errp);
-    if (!vfio_group) {
-        return;
-    }
-
-    vapdev->vdev.ops = &vfio_ap_ops;
-    vapdev->vdev.type = VFIO_DEVICE_TYPE_AP;
-    mdevid = basename(vapdev->vdev.sysfsdev);
-    vapdev->vdev.name = g_strdup_printf("%s", mdevid);
-    vapdev->vdev.dev = dev;
+    vbasedev->name = g_path_get_basename(vbasedev->sysfsdev);
+    vbasedev->ops = &vfio_ap_ops;
+    vbasedev->type = VFIO_DEVICE_TYPE_AP;
+    vbasedev->dev = dev;
 
     /*
      * vfio-ap devices operate in a way compatible with discarding of
@@ -111,7 +70,7 @@ static void vfio_ap_realize(DeviceState *dev, Error **errp)
      */
     vapdev->vdev.ram_block_discard_allowed = true;
 
-    ret = vfio_get_device(vfio_group, mdevid, &vapdev->vdev, errp);
+    ret = vfio_attach_device(vbasedev, &address_space_memory, errp);
     if (ret) {
         goto out_get_dev_err;
     }
@@ -119,18 +78,15 @@ static void vfio_ap_realize(DeviceState *dev, Error **errp)
     return;
 
 out_get_dev_err:
-    vfio_ap_put_device(vapdev);
-    vfio_put_group(vfio_group);
+    vfio_detach_device(vbasedev);
 }
 
 static void vfio_ap_unrealize(DeviceState *dev)
 {
     APDevice *apdev = AP_DEVICE(dev);
     VFIOAPDevice *vapdev = VFIO_AP_DEVICE(apdev);
-    VFIOGroup *group = vapdev->vdev.group;
 
-    vfio_ap_put_device(vapdev);
-    vfio_put_group(group);
+    vfio_detach_device(&vapdev->vdev);
 }
 
 static Property vfio_ap_properties[] = {
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [RFC 11/18] vfio/ccw: Use vfio_[attach/detach]_device
  2022-04-14 10:46 ` Yi Liu
@ 2022-04-14 10:47   ` Yi Liu
  -1 siblings, 0 replies; 125+ messages in thread
From: Yi Liu @ 2022-04-14 10:47 UTC (permalink / raw)
  To: alex.williamson, cohuck, qemu-devel
  Cc: david, thuth, farman, mjrosato, akrowiak, pasic, jjherne,
	jasowang, kvm, jgg, nicolinc, eric.auger, eric.auger.pro,
	kevin.tian, yi.l.liu, chao.p.peng, yi.y.sun, peterx

From: Eric Auger <eric.auger@redhat.com>

Let the vfio-ccw device use vfio_attach_device() and
vfio_detach_device(), hence hiding the details of the used
IOMMU backend.

Also now all the devices have been migrated to use the new
vfio_attach_device/vfio_detach_device API, let's turn the
legacy functions into static functions, local to container.c.

Signed-off-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 hw/vfio/ccw.c                 | 118 ++++++++--------------------------
 hw/vfio/container.c           |   8 +--
 include/hw/vfio/vfio-common.h |   4 --
 3 files changed, 32 insertions(+), 98 deletions(-)

diff --git a/hw/vfio/ccw.c b/hw/vfio/ccw.c
index 0354737666..6fde7849cc 100644
--- a/hw/vfio/ccw.c
+++ b/hw/vfio/ccw.c
@@ -579,27 +579,32 @@ static void vfio_ccw_put_region(VFIOCCWDevice *vcdev)
     g_free(vcdev->io_region);
 }
 
-static void vfio_ccw_put_device(VFIOCCWDevice *vcdev)
-{
-    g_free(vcdev->vdev.name);
-    vfio_put_base_device(&vcdev->vdev);
-}
-
-static void vfio_ccw_get_device(VFIOGroup *group, VFIOCCWDevice *vcdev,
-                                Error **errp)
+static void vfio_ccw_realize(DeviceState *dev, Error **errp)
 {
+    CcwDevice *ccw_dev = DO_UPCAST(CcwDevice, parent_obj, dev);
+    S390CCWDevice *cdev = DO_UPCAST(S390CCWDevice, parent_obj, ccw_dev);
+    VFIOCCWDevice *vcdev = DO_UPCAST(VFIOCCWDevice, cdev, cdev);
+    S390CCWDeviceClass *cdc = S390_CCW_DEVICE_GET_CLASS(cdev);
+    VFIODevice *vbasedev = &vcdev->vdev;
+    Error *err = NULL;
     char *name = g_strdup_printf("%x.%x.%04x", vcdev->cdev.hostid.cssid,
                                  vcdev->cdev.hostid.ssid,
                                  vcdev->cdev.hostid.devid);
-    VFIODevice *vbasedev;
+    int ret;
 
-    QLIST_FOREACH(vbasedev, &group->device_list, next) {
-        if (strcmp(vbasedev->name, name) == 0) {
-            error_setg(errp, "vfio: subchannel %s has already been attached",
-                       name);
-            goto out_err;
+    /* Call the class init function for subchannel. */
+    if (cdc->realize) {
+        cdc->realize(cdev, vcdev->vdev.sysfsdev, &err);
+        if (err) {
+            goto out_err_propagate;
         }
     }
+    vbasedev->sysfsdev = g_strdup_printf("/sys/bus/css/devices/%s/%s",
+                                         name, cdev->mdevid);
+    vbasedev->ops = &vfio_ccw_ops;
+    vbasedev->type = VFIO_DEVICE_TYPE_CCW;
+    vbasedev->name = name;
+    vbasedev->dev = &vcdev->cdev.parent_obj.parent_obj;
 
     /*
      * All vfio-ccw devices are believed to operate in a way compatible with
@@ -609,80 +614,18 @@ static void vfio_ccw_get_device(VFIOGroup *group, VFIOCCWDevice *vcdev,
      * needs to be set before vfio_get_device() for vfio common to handle
      * ram_block_discard_disable().
      */
-    vcdev->vdev.ram_block_discard_allowed = true;
-
-    if (vfio_get_device(group, vcdev->cdev.mdevid, &vcdev->vdev, errp)) {
-        goto out_err;
-    }
-
-    vcdev->vdev.ops = &vfio_ccw_ops;
-    vcdev->vdev.type = VFIO_DEVICE_TYPE_CCW;
-    vcdev->vdev.name = name;
-    vcdev->vdev.dev = &vcdev->cdev.parent_obj.parent_obj;
-
-    return;
-
-out_err:
-    g_free(name);
-}
-
-static VFIOGroup *vfio_ccw_get_group(S390CCWDevice *cdev, Error **errp)
-{
-    char *tmp, group_path[PATH_MAX];
-    ssize_t len;
-    int groupid;
 
-    tmp = g_strdup_printf("/sys/bus/css/devices/%x.%x.%04x/%s/iommu_group",
-                          cdev->hostid.cssid, cdev->hostid.ssid,
-                          cdev->hostid.devid, cdev->mdevid);
-    len = readlink(tmp, group_path, sizeof(group_path));
-    g_free(tmp);
+    vbasedev->ram_block_discard_allowed = true;
 
-    if (len <= 0 || len >= sizeof(group_path)) {
-        error_setg(errp, "vfio: no iommu_group found");
-        return NULL;
-    }
-
-    group_path[len] = 0;
-
-    if (sscanf(basename(group_path), "%d", &groupid) != 1) {
-        error_setg(errp, "vfio: failed to read %s", group_path);
-        return NULL;
-    }
-
-    return vfio_get_group(groupid, &address_space_memory, errp);
-}
-
-static void vfio_ccw_realize(DeviceState *dev, Error **errp)
-{
-    VFIOGroup *group;
-    CcwDevice *ccw_dev = DO_UPCAST(CcwDevice, parent_obj, dev);
-    S390CCWDevice *cdev = DO_UPCAST(S390CCWDevice, parent_obj, ccw_dev);
-    VFIOCCWDevice *vcdev = DO_UPCAST(VFIOCCWDevice, cdev, cdev);
-    S390CCWDeviceClass *cdc = S390_CCW_DEVICE_GET_CLASS(cdev);
-    Error *err = NULL;
-
-    /* Call the class init function for subchannel. */
-    if (cdc->realize) {
-        cdc->realize(cdev, vcdev->vdev.sysfsdev, &err);
-        if (err) {
-            goto out_err_propagate;
-        }
-    }
-
-    group = vfio_ccw_get_group(cdev, &err);
-    if (!group) {
-        goto out_group_err;
-    }
-
-    vfio_ccw_get_device(group, vcdev, &err);
-    if (err) {
-        goto out_device_err;
+    ret = vfio_attach_device(vbasedev, &address_space_memory, errp);
+    if (ret) {
+        g_free(vbasedev->name);
+        g_free(vbasedev->sysfsdev);
     }
 
     vfio_ccw_get_region(vcdev, &err);
     if (err) {
-        goto out_region_err;
+        goto out_get_dev_err;
     }
 
     vfio_ccw_register_irq_notifier(vcdev, VFIO_CCW_IO_IRQ_INDEX, &err);
@@ -714,11 +657,8 @@ out_irq_notifier_err:
     vfio_ccw_unregister_irq_notifier(vcdev, VFIO_CCW_IO_IRQ_INDEX);
 out_io_notifier_err:
     vfio_ccw_put_region(vcdev);
-out_region_err:
-    vfio_ccw_put_device(vcdev);
-out_device_err:
-    vfio_put_group(group);
-out_group_err:
+out_get_dev_err:
+    vfio_detach_device(vbasedev);
     if (cdc->unrealize) {
         cdc->unrealize(cdev);
     }
@@ -732,14 +672,12 @@ static void vfio_ccw_unrealize(DeviceState *dev)
     S390CCWDevice *cdev = DO_UPCAST(S390CCWDevice, parent_obj, ccw_dev);
     VFIOCCWDevice *vcdev = DO_UPCAST(VFIOCCWDevice, cdev, cdev);
     S390CCWDeviceClass *cdc = S390_CCW_DEVICE_GET_CLASS(cdev);
-    VFIOGroup *group = vcdev->vdev.group;
 
     vfio_ccw_unregister_irq_notifier(vcdev, VFIO_CCW_REQ_IRQ_INDEX);
     vfio_ccw_unregister_irq_notifier(vcdev, VFIO_CCW_CRW_IRQ_INDEX);
     vfio_ccw_unregister_irq_notifier(vcdev, VFIO_CCW_IO_IRQ_INDEX);
     vfio_ccw_put_region(vcdev);
-    vfio_ccw_put_device(vcdev);
-    vfio_put_group(group);
+    vfio_detach_device(&vcdev->vdev);
 
     if (cdc->unrealize) {
         cdc->unrealize(cdev);
diff --git a/hw/vfio/container.c b/hw/vfio/container.c
index c74a3cd4ae..5d73f8285e 100644
--- a/hw/vfio/container.c
+++ b/hw/vfio/container.c
@@ -954,7 +954,7 @@ static void vfio_disconnect_container(VFIOGroup *group)
     }
 }
 
-VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
+static VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
 {
     VFIOGroup *group;
     VFIOContainer *bcontainer;
@@ -1023,7 +1023,7 @@ free_group_exit:
     return NULL;
 }
 
-void vfio_put_group(VFIOGroup *group)
+static void vfio_put_group(VFIOGroup *group)
 {
     if (!group || !QLIST_EMPTY(&group->device_list)) {
         return;
@@ -1044,8 +1044,8 @@ void vfio_put_group(VFIOGroup *group)
     }
 }
 
-int vfio_get_device(VFIOGroup *group, const char *name,
-                    VFIODevice *vbasedev, Error **errp)
+static int vfio_get_device(VFIOGroup *group, const char *name,
+                           VFIODevice *vbasedev, Error **errp)
 {
     struct vfio_device_info dev_info = { .argsz = sizeof(dev_info) };
     int ret, fd;
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 978b2c2f6e..7d7898717e 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -176,10 +176,6 @@ void vfio_region_unmap(VFIORegion *region);
 void vfio_region_exit(VFIORegion *region);
 void vfio_region_finalize(VFIORegion *region);
 void vfio_reset_handler(void *opaque);
-VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp);
-void vfio_put_group(VFIOGroup *group);
-int vfio_get_device(VFIOGroup *group, const char *name,
-                    VFIODevice *vbasedev, Error **errp);
 int vfio_attach_device(VFIODevice *vbasedev, AddressSpace *as, Error **errp);
 void vfio_detach_device(VFIODevice *vbasedev);
 
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [RFC 11/18] vfio/ccw: Use vfio_[attach/detach]_device
@ 2022-04-14 10:47   ` Yi Liu
  0 siblings, 0 replies; 125+ messages in thread
From: Yi Liu @ 2022-04-14 10:47 UTC (permalink / raw)
  To: alex.williamson, cohuck, qemu-devel
  Cc: akrowiak, jjherne, thuth, yi.l.liu, kvm, mjrosato, jasowang,
	farman, peterx, pasic, eric.auger, yi.y.sun, chao.p.peng,
	nicolinc, kevin.tian, jgg, eric.auger.pro, david

From: Eric Auger <eric.auger@redhat.com>

Let the vfio-ccw device use vfio_attach_device() and
vfio_detach_device(), hence hiding the details of the used
IOMMU backend.

Also now all the devices have been migrated to use the new
vfio_attach_device/vfio_detach_device API, let's turn the
legacy functions into static functions, local to container.c.

Signed-off-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 hw/vfio/ccw.c                 | 118 ++++++++--------------------------
 hw/vfio/container.c           |   8 +--
 include/hw/vfio/vfio-common.h |   4 --
 3 files changed, 32 insertions(+), 98 deletions(-)

diff --git a/hw/vfio/ccw.c b/hw/vfio/ccw.c
index 0354737666..6fde7849cc 100644
--- a/hw/vfio/ccw.c
+++ b/hw/vfio/ccw.c
@@ -579,27 +579,32 @@ static void vfio_ccw_put_region(VFIOCCWDevice *vcdev)
     g_free(vcdev->io_region);
 }
 
-static void vfio_ccw_put_device(VFIOCCWDevice *vcdev)
-{
-    g_free(vcdev->vdev.name);
-    vfio_put_base_device(&vcdev->vdev);
-}
-
-static void vfio_ccw_get_device(VFIOGroup *group, VFIOCCWDevice *vcdev,
-                                Error **errp)
+static void vfio_ccw_realize(DeviceState *dev, Error **errp)
 {
+    CcwDevice *ccw_dev = DO_UPCAST(CcwDevice, parent_obj, dev);
+    S390CCWDevice *cdev = DO_UPCAST(S390CCWDevice, parent_obj, ccw_dev);
+    VFIOCCWDevice *vcdev = DO_UPCAST(VFIOCCWDevice, cdev, cdev);
+    S390CCWDeviceClass *cdc = S390_CCW_DEVICE_GET_CLASS(cdev);
+    VFIODevice *vbasedev = &vcdev->vdev;
+    Error *err = NULL;
     char *name = g_strdup_printf("%x.%x.%04x", vcdev->cdev.hostid.cssid,
                                  vcdev->cdev.hostid.ssid,
                                  vcdev->cdev.hostid.devid);
-    VFIODevice *vbasedev;
+    int ret;
 
-    QLIST_FOREACH(vbasedev, &group->device_list, next) {
-        if (strcmp(vbasedev->name, name) == 0) {
-            error_setg(errp, "vfio: subchannel %s has already been attached",
-                       name);
-            goto out_err;
+    /* Call the class init function for subchannel. */
+    if (cdc->realize) {
+        cdc->realize(cdev, vcdev->vdev.sysfsdev, &err);
+        if (err) {
+            goto out_err_propagate;
         }
     }
+    vbasedev->sysfsdev = g_strdup_printf("/sys/bus/css/devices/%s/%s",
+                                         name, cdev->mdevid);
+    vbasedev->ops = &vfio_ccw_ops;
+    vbasedev->type = VFIO_DEVICE_TYPE_CCW;
+    vbasedev->name = name;
+    vbasedev->dev = &vcdev->cdev.parent_obj.parent_obj;
 
     /*
      * All vfio-ccw devices are believed to operate in a way compatible with
@@ -609,80 +614,18 @@ static void vfio_ccw_get_device(VFIOGroup *group, VFIOCCWDevice *vcdev,
      * needs to be set before vfio_get_device() for vfio common to handle
      * ram_block_discard_disable().
      */
-    vcdev->vdev.ram_block_discard_allowed = true;
-
-    if (vfio_get_device(group, vcdev->cdev.mdevid, &vcdev->vdev, errp)) {
-        goto out_err;
-    }
-
-    vcdev->vdev.ops = &vfio_ccw_ops;
-    vcdev->vdev.type = VFIO_DEVICE_TYPE_CCW;
-    vcdev->vdev.name = name;
-    vcdev->vdev.dev = &vcdev->cdev.parent_obj.parent_obj;
-
-    return;
-
-out_err:
-    g_free(name);
-}
-
-static VFIOGroup *vfio_ccw_get_group(S390CCWDevice *cdev, Error **errp)
-{
-    char *tmp, group_path[PATH_MAX];
-    ssize_t len;
-    int groupid;
 
-    tmp = g_strdup_printf("/sys/bus/css/devices/%x.%x.%04x/%s/iommu_group",
-                          cdev->hostid.cssid, cdev->hostid.ssid,
-                          cdev->hostid.devid, cdev->mdevid);
-    len = readlink(tmp, group_path, sizeof(group_path));
-    g_free(tmp);
+    vbasedev->ram_block_discard_allowed = true;
 
-    if (len <= 0 || len >= sizeof(group_path)) {
-        error_setg(errp, "vfio: no iommu_group found");
-        return NULL;
-    }
-
-    group_path[len] = 0;
-
-    if (sscanf(basename(group_path), "%d", &groupid) != 1) {
-        error_setg(errp, "vfio: failed to read %s", group_path);
-        return NULL;
-    }
-
-    return vfio_get_group(groupid, &address_space_memory, errp);
-}
-
-static void vfio_ccw_realize(DeviceState *dev, Error **errp)
-{
-    VFIOGroup *group;
-    CcwDevice *ccw_dev = DO_UPCAST(CcwDevice, parent_obj, dev);
-    S390CCWDevice *cdev = DO_UPCAST(S390CCWDevice, parent_obj, ccw_dev);
-    VFIOCCWDevice *vcdev = DO_UPCAST(VFIOCCWDevice, cdev, cdev);
-    S390CCWDeviceClass *cdc = S390_CCW_DEVICE_GET_CLASS(cdev);
-    Error *err = NULL;
-
-    /* Call the class init function for subchannel. */
-    if (cdc->realize) {
-        cdc->realize(cdev, vcdev->vdev.sysfsdev, &err);
-        if (err) {
-            goto out_err_propagate;
-        }
-    }
-
-    group = vfio_ccw_get_group(cdev, &err);
-    if (!group) {
-        goto out_group_err;
-    }
-
-    vfio_ccw_get_device(group, vcdev, &err);
-    if (err) {
-        goto out_device_err;
+    ret = vfio_attach_device(vbasedev, &address_space_memory, errp);
+    if (ret) {
+        g_free(vbasedev->name);
+        g_free(vbasedev->sysfsdev);
     }
 
     vfio_ccw_get_region(vcdev, &err);
     if (err) {
-        goto out_region_err;
+        goto out_get_dev_err;
     }
 
     vfio_ccw_register_irq_notifier(vcdev, VFIO_CCW_IO_IRQ_INDEX, &err);
@@ -714,11 +657,8 @@ out_irq_notifier_err:
     vfio_ccw_unregister_irq_notifier(vcdev, VFIO_CCW_IO_IRQ_INDEX);
 out_io_notifier_err:
     vfio_ccw_put_region(vcdev);
-out_region_err:
-    vfio_ccw_put_device(vcdev);
-out_device_err:
-    vfio_put_group(group);
-out_group_err:
+out_get_dev_err:
+    vfio_detach_device(vbasedev);
     if (cdc->unrealize) {
         cdc->unrealize(cdev);
     }
@@ -732,14 +672,12 @@ static void vfio_ccw_unrealize(DeviceState *dev)
     S390CCWDevice *cdev = DO_UPCAST(S390CCWDevice, parent_obj, ccw_dev);
     VFIOCCWDevice *vcdev = DO_UPCAST(VFIOCCWDevice, cdev, cdev);
     S390CCWDeviceClass *cdc = S390_CCW_DEVICE_GET_CLASS(cdev);
-    VFIOGroup *group = vcdev->vdev.group;
 
     vfio_ccw_unregister_irq_notifier(vcdev, VFIO_CCW_REQ_IRQ_INDEX);
     vfio_ccw_unregister_irq_notifier(vcdev, VFIO_CCW_CRW_IRQ_INDEX);
     vfio_ccw_unregister_irq_notifier(vcdev, VFIO_CCW_IO_IRQ_INDEX);
     vfio_ccw_put_region(vcdev);
-    vfio_ccw_put_device(vcdev);
-    vfio_put_group(group);
+    vfio_detach_device(&vcdev->vdev);
 
     if (cdc->unrealize) {
         cdc->unrealize(cdev);
diff --git a/hw/vfio/container.c b/hw/vfio/container.c
index c74a3cd4ae..5d73f8285e 100644
--- a/hw/vfio/container.c
+++ b/hw/vfio/container.c
@@ -954,7 +954,7 @@ static void vfio_disconnect_container(VFIOGroup *group)
     }
 }
 
-VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
+static VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
 {
     VFIOGroup *group;
     VFIOContainer *bcontainer;
@@ -1023,7 +1023,7 @@ free_group_exit:
     return NULL;
 }
 
-void vfio_put_group(VFIOGroup *group)
+static void vfio_put_group(VFIOGroup *group)
 {
     if (!group || !QLIST_EMPTY(&group->device_list)) {
         return;
@@ -1044,8 +1044,8 @@ void vfio_put_group(VFIOGroup *group)
     }
 }
 
-int vfio_get_device(VFIOGroup *group, const char *name,
-                    VFIODevice *vbasedev, Error **errp)
+static int vfio_get_device(VFIOGroup *group, const char *name,
+                           VFIODevice *vbasedev, Error **errp)
 {
     struct vfio_device_info dev_info = { .argsz = sizeof(dev_info) };
     int ret, fd;
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 978b2c2f6e..7d7898717e 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -176,10 +176,6 @@ void vfio_region_unmap(VFIORegion *region);
 void vfio_region_exit(VFIORegion *region);
 void vfio_region_finalize(VFIORegion *region);
 void vfio_reset_handler(void *opaque);
-VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp);
-void vfio_put_group(VFIOGroup *group);
-int vfio_get_device(VFIOGroup *group, const char *name,
-                    VFIODevice *vbasedev, Error **errp);
 int vfio_attach_device(VFIODevice *vbasedev, AddressSpace *as, Error **errp);
 void vfio_detach_device(VFIODevice *vbasedev);
 
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [RFC 12/18] vfio/container-obj: Introduce [attach/detach]_device container callbacks
  2022-04-14 10:46 ` Yi Liu
@ 2022-04-14 10:47   ` Yi Liu
  -1 siblings, 0 replies; 125+ messages in thread
From: Yi Liu @ 2022-04-14 10:47 UTC (permalink / raw)
  To: alex.williamson, cohuck, qemu-devel
  Cc: david, thuth, farman, mjrosato, akrowiak, pasic, jjherne,
	jasowang, kvm, jgg, nicolinc, eric.auger, eric.auger.pro,
	kevin.tian, yi.l.liu, chao.p.peng, yi.y.sun, peterx

From: Eric Auger <eric.auger@redhat.com>

Let's turn attach/detach_device as container callbacks. That way,
their implementation can be easily customized for a given backend.

For the time being, only the legacy container is supported.

Signed-off-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 hw/vfio/as.c                         | 36 ++++++++++++++++++++++++++++
 hw/vfio/container.c                  | 11 +++++----
 hw/vfio/pci.c                        |  2 +-
 include/hw/vfio/vfio-common.h        |  7 ++++++
 include/hw/vfio/vfio-container-obj.h |  6 +++++
 5 files changed, 57 insertions(+), 5 deletions(-)

diff --git a/hw/vfio/as.c b/hw/vfio/as.c
index 37423d2c89..30e86f6833 100644
--- a/hw/vfio/as.c
+++ b/hw/vfio/as.c
@@ -874,3 +874,39 @@ void vfio_put_address_space(VFIOAddressSpace *space)
         g_free(space);
     }
 }
+
+static VFIOContainerClass *
+vfio_get_container_class(VFIOIOMMUBackendType be)
+{
+    ObjectClass *klass;
+
+    switch (be) {
+    case VFIO_IOMMU_BACKEND_TYPE_LEGACY:
+        klass = object_class_by_name(TYPE_VFIO_LEGACY_CONTAINER);
+        return VFIO_CONTAINER_OBJ_CLASS(klass);
+    default:
+        return NULL;
+    }
+}
+
+int vfio_attach_device(VFIODevice *vbasedev, AddressSpace *as, Error **errp)
+{
+    VFIOContainerClass *vccs;
+
+    vccs = vfio_get_container_class(VFIO_IOMMU_BACKEND_TYPE_LEGACY);
+    if (!vccs) {
+        return -ENOENT;
+    }
+    return vccs->attach_device(vbasedev, as, errp);
+}
+
+void vfio_detach_device(VFIODevice *vbasedev)
+{
+    VFIOContainerClass *vccs;
+
+    if (!vbasedev->container) {
+        return;
+    }
+    vccs = VFIO_CONTAINER_OBJ_GET_CLASS(vbasedev->container);
+    vccs->detach_device(vbasedev);
+}
diff --git a/hw/vfio/container.c b/hw/vfio/container.c
index 5d73f8285e..74febc1567 100644
--- a/hw/vfio/container.c
+++ b/hw/vfio/container.c
@@ -50,8 +50,6 @@
 static int vfio_kvm_device_fd = -1;
 #endif
 
-#define TYPE_VFIO_LEGACY_CONTAINER "qemu:vfio-legacy-container"
-
 VFIOGroupList vfio_group_list =
     QLIST_HEAD_INITIALIZER(vfio_group_list);
 
@@ -1240,7 +1238,8 @@ static int vfio_device_groupid(VFIODevice *vbasedev, Error **errp)
     return groupid;
 }
 
-int vfio_attach_device(VFIODevice *vbasedev, AddressSpace *as, Error **errp)
+static int
+legacy_attach_device(VFIODevice *vbasedev, AddressSpace *as, Error **errp)
 {
     int groupid = vfio_device_groupid(vbasedev, errp);
     VFIODevice *vbasedev_iter;
@@ -1269,14 +1268,16 @@ int vfio_attach_device(VFIODevice *vbasedev, AddressSpace *as, Error **errp)
         vfio_put_group(group);
         return -1;
     }
+    vbasedev->container = &group->container->obj;
 
     return 0;
 }
 
-void vfio_detach_device(VFIODevice *vbasedev)
+static void legacy_detach_device(VFIODevice *vbasedev)
 {
     vfio_put_base_device(vbasedev);
     vfio_put_group(vbasedev->group);
+    vbasedev->container = NULL;
 }
 
 static void vfio_legacy_container_class_init(ObjectClass *klass,
@@ -1292,6 +1293,8 @@ static void vfio_legacy_container_class_init(ObjectClass *klass,
     vccs->add_window = vfio_legacy_container_add_section_window;
     vccs->del_window = vfio_legacy_container_del_section_window;
     vccs->check_extension = vfio_legacy_container_check_extension;
+    vccs->attach_device = legacy_attach_device;
+    vccs->detach_device = legacy_detach_device;
 }
 
 static const TypeInfo vfio_legacy_container_info = {
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 0363f81017..e1ab6d339d 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -3063,7 +3063,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
     }
 
     if (!pdev->failover_pair_id &&
-        vfio_container_check_extension(&vbasedev->group->container->obj,
+        vfio_container_check_extension(vbasedev->container,
                                        VFIO_FEAT_LIVE_MIGRATION)) {
         ret = vfio_migration_probe(vbasedev, errp);
         if (ret) {
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 7d7898717e..2040c27cda 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -83,9 +83,15 @@ typedef struct VFIOLegacyContainer {
 
 typedef struct VFIODeviceOps VFIODeviceOps;
 
+typedef enum VFIOIOMMUBackendType {
+    VFIO_IOMMU_BACKEND_TYPE_LEGACY = 0,
+    VFIO_IOMMU_BACKEND_TYPE_IOMMUFD = 1,
+} VFIOIOMMUBackendType;
+
 typedef struct VFIODevice {
     QLIST_ENTRY(VFIODevice) next;
     struct VFIOGroup *group;
+    VFIOContainer *container;
     char *sysfsdev;
     char *name;
     DeviceState *dev;
@@ -97,6 +103,7 @@ typedef struct VFIODevice {
     bool ram_block_discard_allowed;
     bool enable_migration;
     VFIODeviceOps *ops;
+    VFIOIOMMUBackendType be;
     unsigned int num_irqs;
     unsigned int num_regions;
     unsigned int flags;
diff --git a/include/hw/vfio/vfio-container-obj.h b/include/hw/vfio/vfio-container-obj.h
index 7ffbbb299f..ebc1340530 100644
--- a/include/hw/vfio/vfio-container-obj.h
+++ b/include/hw/vfio/vfio-container-obj.h
@@ -42,6 +42,8 @@
         OBJECT_GET_CLASS(VFIOContainerClass, (obj), \
                          TYPE_VFIO_CONTAINER_OBJ)
 
+#define TYPE_VFIO_LEGACY_CONTAINER "qemu:vfio-legacy-container"
+
 typedef enum VFIOContainerFeature {
     VFIO_FEAT_LIVE_MIGRATION,
 } VFIOContainerFeature;
@@ -101,6 +103,8 @@ struct VFIOContainer {
     QLIST_ENTRY(VFIOContainer) next;
 };
 
+typedef struct VFIODevice VFIODevice;
+
 typedef struct VFIOContainerClass {
     /* private */
     ObjectClass parent_class;
@@ -126,6 +130,8 @@ typedef struct VFIOContainerClass {
                       Error **errp);
     void (*del_window)(VFIOContainer *container,
                        MemoryRegionSection *section);
+    int (*attach_device)(VFIODevice *vbasedev, AddressSpace *as, Error **errp);
+    void (*detach_device)(VFIODevice *vbasedev);
 } VFIOContainerClass;
 
 bool vfio_container_check_extension(VFIOContainer *container,
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [RFC 12/18] vfio/container-obj: Introduce [attach/detach]_device container callbacks
@ 2022-04-14 10:47   ` Yi Liu
  0 siblings, 0 replies; 125+ messages in thread
From: Yi Liu @ 2022-04-14 10:47 UTC (permalink / raw)
  To: alex.williamson, cohuck, qemu-devel
  Cc: akrowiak, jjherne, thuth, yi.l.liu, kvm, mjrosato, jasowang,
	farman, peterx, pasic, eric.auger, yi.y.sun, chao.p.peng,
	nicolinc, kevin.tian, jgg, eric.auger.pro, david

From: Eric Auger <eric.auger@redhat.com>

Let's turn attach/detach_device as container callbacks. That way,
their implementation can be easily customized for a given backend.

For the time being, only the legacy container is supported.

Signed-off-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 hw/vfio/as.c                         | 36 ++++++++++++++++++++++++++++
 hw/vfio/container.c                  | 11 +++++----
 hw/vfio/pci.c                        |  2 +-
 include/hw/vfio/vfio-common.h        |  7 ++++++
 include/hw/vfio/vfio-container-obj.h |  6 +++++
 5 files changed, 57 insertions(+), 5 deletions(-)

diff --git a/hw/vfio/as.c b/hw/vfio/as.c
index 37423d2c89..30e86f6833 100644
--- a/hw/vfio/as.c
+++ b/hw/vfio/as.c
@@ -874,3 +874,39 @@ void vfio_put_address_space(VFIOAddressSpace *space)
         g_free(space);
     }
 }
+
+static VFIOContainerClass *
+vfio_get_container_class(VFIOIOMMUBackendType be)
+{
+    ObjectClass *klass;
+
+    switch (be) {
+    case VFIO_IOMMU_BACKEND_TYPE_LEGACY:
+        klass = object_class_by_name(TYPE_VFIO_LEGACY_CONTAINER);
+        return VFIO_CONTAINER_OBJ_CLASS(klass);
+    default:
+        return NULL;
+    }
+}
+
+int vfio_attach_device(VFIODevice *vbasedev, AddressSpace *as, Error **errp)
+{
+    VFIOContainerClass *vccs;
+
+    vccs = vfio_get_container_class(VFIO_IOMMU_BACKEND_TYPE_LEGACY);
+    if (!vccs) {
+        return -ENOENT;
+    }
+    return vccs->attach_device(vbasedev, as, errp);
+}
+
+void vfio_detach_device(VFIODevice *vbasedev)
+{
+    VFIOContainerClass *vccs;
+
+    if (!vbasedev->container) {
+        return;
+    }
+    vccs = VFIO_CONTAINER_OBJ_GET_CLASS(vbasedev->container);
+    vccs->detach_device(vbasedev);
+}
diff --git a/hw/vfio/container.c b/hw/vfio/container.c
index 5d73f8285e..74febc1567 100644
--- a/hw/vfio/container.c
+++ b/hw/vfio/container.c
@@ -50,8 +50,6 @@
 static int vfio_kvm_device_fd = -1;
 #endif
 
-#define TYPE_VFIO_LEGACY_CONTAINER "qemu:vfio-legacy-container"
-
 VFIOGroupList vfio_group_list =
     QLIST_HEAD_INITIALIZER(vfio_group_list);
 
@@ -1240,7 +1238,8 @@ static int vfio_device_groupid(VFIODevice *vbasedev, Error **errp)
     return groupid;
 }
 
-int vfio_attach_device(VFIODevice *vbasedev, AddressSpace *as, Error **errp)
+static int
+legacy_attach_device(VFIODevice *vbasedev, AddressSpace *as, Error **errp)
 {
     int groupid = vfio_device_groupid(vbasedev, errp);
     VFIODevice *vbasedev_iter;
@@ -1269,14 +1268,16 @@ int vfio_attach_device(VFIODevice *vbasedev, AddressSpace *as, Error **errp)
         vfio_put_group(group);
         return -1;
     }
+    vbasedev->container = &group->container->obj;
 
     return 0;
 }
 
-void vfio_detach_device(VFIODevice *vbasedev)
+static void legacy_detach_device(VFIODevice *vbasedev)
 {
     vfio_put_base_device(vbasedev);
     vfio_put_group(vbasedev->group);
+    vbasedev->container = NULL;
 }
 
 static void vfio_legacy_container_class_init(ObjectClass *klass,
@@ -1292,6 +1293,8 @@ static void vfio_legacy_container_class_init(ObjectClass *klass,
     vccs->add_window = vfio_legacy_container_add_section_window;
     vccs->del_window = vfio_legacy_container_del_section_window;
     vccs->check_extension = vfio_legacy_container_check_extension;
+    vccs->attach_device = legacy_attach_device;
+    vccs->detach_device = legacy_detach_device;
 }
 
 static const TypeInfo vfio_legacy_container_info = {
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 0363f81017..e1ab6d339d 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -3063,7 +3063,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
     }
 
     if (!pdev->failover_pair_id &&
-        vfio_container_check_extension(&vbasedev->group->container->obj,
+        vfio_container_check_extension(vbasedev->container,
                                        VFIO_FEAT_LIVE_MIGRATION)) {
         ret = vfio_migration_probe(vbasedev, errp);
         if (ret) {
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 7d7898717e..2040c27cda 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -83,9 +83,15 @@ typedef struct VFIOLegacyContainer {
 
 typedef struct VFIODeviceOps VFIODeviceOps;
 
+typedef enum VFIOIOMMUBackendType {
+    VFIO_IOMMU_BACKEND_TYPE_LEGACY = 0,
+    VFIO_IOMMU_BACKEND_TYPE_IOMMUFD = 1,
+} VFIOIOMMUBackendType;
+
 typedef struct VFIODevice {
     QLIST_ENTRY(VFIODevice) next;
     struct VFIOGroup *group;
+    VFIOContainer *container;
     char *sysfsdev;
     char *name;
     DeviceState *dev;
@@ -97,6 +103,7 @@ typedef struct VFIODevice {
     bool ram_block_discard_allowed;
     bool enable_migration;
     VFIODeviceOps *ops;
+    VFIOIOMMUBackendType be;
     unsigned int num_irqs;
     unsigned int num_regions;
     unsigned int flags;
diff --git a/include/hw/vfio/vfio-container-obj.h b/include/hw/vfio/vfio-container-obj.h
index 7ffbbb299f..ebc1340530 100644
--- a/include/hw/vfio/vfio-container-obj.h
+++ b/include/hw/vfio/vfio-container-obj.h
@@ -42,6 +42,8 @@
         OBJECT_GET_CLASS(VFIOContainerClass, (obj), \
                          TYPE_VFIO_CONTAINER_OBJ)
 
+#define TYPE_VFIO_LEGACY_CONTAINER "qemu:vfio-legacy-container"
+
 typedef enum VFIOContainerFeature {
     VFIO_FEAT_LIVE_MIGRATION,
 } VFIOContainerFeature;
@@ -101,6 +103,8 @@ struct VFIOContainer {
     QLIST_ENTRY(VFIOContainer) next;
 };
 
+typedef struct VFIODevice VFIODevice;
+
 typedef struct VFIOContainerClass {
     /* private */
     ObjectClass parent_class;
@@ -126,6 +130,8 @@ typedef struct VFIOContainerClass {
                       Error **errp);
     void (*del_window)(VFIOContainer *container,
                        MemoryRegionSection *section);
+    int (*attach_device)(VFIODevice *vbasedev, AddressSpace *as, Error **errp);
+    void (*detach_device)(VFIODevice *vbasedev);
 } VFIOContainerClass;
 
 bool vfio_container_check_extension(VFIOContainer *container,
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [RFC 13/18] vfio/container-obj: Introduce VFIOContainer reset callback
  2022-04-14 10:46 ` Yi Liu
@ 2022-04-14 10:47   ` Yi Liu
  -1 siblings, 0 replies; 125+ messages in thread
From: Yi Liu @ 2022-04-14 10:47 UTC (permalink / raw)
  To: alex.williamson, cohuck, qemu-devel
  Cc: david, thuth, farman, mjrosato, akrowiak, pasic, jjherne,
	jasowang, kvm, jgg, nicolinc, eric.auger, eric.auger.pro,
	kevin.tian, yi.l.liu, chao.p.peng, yi.y.sun, peterx

From: Eric Auger <eric.auger@redhat.com>

Reset implementation depends on the container backend. Let's
introduce a VFIOContainer class function and register a generic
reset handler that will be able to call the right reset function
depending on the container type. Also, let's move the
registration/unregistration to a place that is not backend-specific
(first vfio address space created instead of the first group).

Signed-off-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 hw/vfio/as.c                         | 18 ++++++++++++++++++
 hw/vfio/container-obj.c              | 13 +++++++++++++
 hw/vfio/container.c                  | 26 ++++++++++++++------------
 include/hw/vfio/vfio-container-obj.h |  2 ++
 4 files changed, 47 insertions(+), 12 deletions(-)

diff --git a/hw/vfio/as.c b/hw/vfio/as.c
index 30e86f6833..4abaa4068f 100644
--- a/hw/vfio/as.c
+++ b/hw/vfio/as.c
@@ -847,6 +847,18 @@ const MemoryListener vfio_memory_listener = {
     .log_sync = vfio_listener_log_sync,
 };
 
+void vfio_reset_handler(void *opaque)
+{
+    VFIOAddressSpace *space;
+    VFIOContainer *bcontainer;
+
+    QLIST_FOREACH(space, &vfio_address_spaces, list) {
+         QLIST_FOREACH(bcontainer, &space->containers, next) {
+             vfio_container_reset(bcontainer);
+         }
+    }
+}
+
 VFIOAddressSpace *vfio_get_address_space(AddressSpace *as)
 {
     VFIOAddressSpace *space;
@@ -862,6 +874,9 @@ VFIOAddressSpace *vfio_get_address_space(AddressSpace *as)
     space->as = as;
     QLIST_INIT(&space->containers);
 
+    if (QLIST_EMPTY(&vfio_address_spaces)) {
+        qemu_register_reset(vfio_reset_handler, NULL);
+    }
     QLIST_INSERT_HEAD(&vfio_address_spaces, space, list);
 
     return space;
@@ -873,6 +888,9 @@ void vfio_put_address_space(VFIOAddressSpace *space)
         QLIST_REMOVE(space, list);
         g_free(space);
     }
+    if (QLIST_EMPTY(&vfio_address_spaces)) {
+        qemu_unregister_reset(vfio_reset_handler, NULL);
+    }
 }
 
 static VFIOContainerClass *
diff --git a/hw/vfio/container-obj.c b/hw/vfio/container-obj.c
index 40c1e2a2b5..c4220336af 100644
--- a/hw/vfio/container-obj.c
+++ b/hw/vfio/container-obj.c
@@ -68,6 +68,19 @@ int vfio_container_dma_unmap(VFIOContainer *container,
     return vccs->dma_unmap(container, iova, size, iotlb);
 }
 
+int vfio_container_reset(VFIOContainer *container)
+{
+    VFIOContainerClass *vccs = VFIO_CONTAINER_OBJ_GET_CLASS(container);
+
+    vccs = VFIO_CONTAINER_OBJ_GET_CLASS(container);
+
+    if (!vccs->reset) {
+        return -ENOENT;
+    }
+
+    return vccs->reset(container);
+}
+
 void vfio_container_set_dirty_page_tracking(VFIOContainer *container,
                                             bool start)
 {
diff --git a/hw/vfio/container.c b/hw/vfio/container.c
index 74febc1567..2f59422048 100644
--- a/hw/vfio/container.c
+++ b/hw/vfio/container.c
@@ -458,12 +458,15 @@ vfio_legacy_container_del_section_window(VFIOContainer *bcontainer,
     }
 }
 
-void vfio_reset_handler(void *opaque)
+static int vfio_legacy_container_reset(VFIOContainer *bcontainer)
 {
+    VFIOLegacyContainer *container = container_of(bcontainer,
+                                                  VFIOLegacyContainer, obj);
     VFIOGroup *group;
     VFIODevice *vbasedev;
+    int ret, final_ret = 0;
 
-    QLIST_FOREACH(group, &vfio_group_list, next) {
+    QLIST_FOREACH(group, &container->group_list, container_next) {
         QLIST_FOREACH(vbasedev, &group->device_list, next) {
             if (vbasedev->dev->realized) {
                 vbasedev->ops->vfio_compute_needs_reset(vbasedev);
@@ -471,13 +474,19 @@ void vfio_reset_handler(void *opaque)
         }
     }
 
-    QLIST_FOREACH(group, &vfio_group_list, next) {
+    QLIST_FOREACH(group, &container->group_list, next) {
         QLIST_FOREACH(vbasedev, &group->device_list, next) {
             if (vbasedev->dev->realized && vbasedev->needs_reset) {
-                vbasedev->ops->vfio_hot_reset_multi(vbasedev);
+                ret = vbasedev->ops->vfio_hot_reset_multi(vbasedev);
+                if (ret) {
+                    error_report("failed to reset %s (%d)",
+                                 vbasedev->name, ret);
+                    final_ret = ret;
+                }
             }
         }
     }
+    return final_ret;
 }
 
 static void vfio_kvm_device_add_group(VFIOGroup *group)
@@ -1004,10 +1013,6 @@ static VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
         goto close_fd_exit;
     }
 
-    if (QLIST_EMPTY(&vfio_group_list)) {
-        qemu_register_reset(vfio_reset_handler, NULL);
-    }
-
     QLIST_INSERT_HEAD(&vfio_group_list, group, next);
 
     return group;
@@ -1036,10 +1041,6 @@ static void vfio_put_group(VFIOGroup *group)
     trace_vfio_put_group(group->fd);
     close(group->fd);
     g_free(group);
-
-    if (QLIST_EMPTY(&vfio_group_list)) {
-        qemu_unregister_reset(vfio_reset_handler, NULL);
-    }
 }
 
 static int vfio_get_device(VFIOGroup *group, const char *name,
@@ -1293,6 +1294,7 @@ static void vfio_legacy_container_class_init(ObjectClass *klass,
     vccs->add_window = vfio_legacy_container_add_section_window;
     vccs->del_window = vfio_legacy_container_del_section_window;
     vccs->check_extension = vfio_legacy_container_check_extension;
+    vccs->reset = vfio_legacy_container_reset;
     vccs->attach_device = legacy_attach_device;
     vccs->detach_device = legacy_detach_device;
 }
diff --git a/include/hw/vfio/vfio-container-obj.h b/include/hw/vfio/vfio-container-obj.h
index ebc1340530..ffd8590ff8 100644
--- a/include/hw/vfio/vfio-container-obj.h
+++ b/include/hw/vfio/vfio-container-obj.h
@@ -118,6 +118,7 @@ typedef struct VFIOContainerClass {
     int (*dma_unmap)(VFIOContainer *container,
                      hwaddr iova, ram_addr_t size,
                      IOMMUTLBEntry *iotlb);
+    int (*reset)(VFIOContainer *container);
     /* migration feature */
     bool (*devices_all_dirty_tracking)(VFIOContainer *container);
     void (*set_dirty_page_tracking)(VFIOContainer *container, bool start);
@@ -142,6 +143,7 @@ int vfio_container_dma_map(VFIOContainer *container,
 int vfio_container_dma_unmap(VFIOContainer *container,
                              hwaddr iova, ram_addr_t size,
                              IOMMUTLBEntry *iotlb);
+int vfio_container_reset(VFIOContainer *container);
 bool vfio_container_devices_all_dirty_tracking(VFIOContainer *container);
 void vfio_container_set_dirty_page_tracking(VFIOContainer *container,
                                             bool start);
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [RFC 13/18] vfio/container-obj: Introduce VFIOContainer reset callback
@ 2022-04-14 10:47   ` Yi Liu
  0 siblings, 0 replies; 125+ messages in thread
From: Yi Liu @ 2022-04-14 10:47 UTC (permalink / raw)
  To: alex.williamson, cohuck, qemu-devel
  Cc: akrowiak, jjherne, thuth, yi.l.liu, kvm, mjrosato, jasowang,
	farman, peterx, pasic, eric.auger, yi.y.sun, chao.p.peng,
	nicolinc, kevin.tian, jgg, eric.auger.pro, david

From: Eric Auger <eric.auger@redhat.com>

Reset implementation depends on the container backend. Let's
introduce a VFIOContainer class function and register a generic
reset handler that will be able to call the right reset function
depending on the container type. Also, let's move the
registration/unregistration to a place that is not backend-specific
(first vfio address space created instead of the first group).

Signed-off-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 hw/vfio/as.c                         | 18 ++++++++++++++++++
 hw/vfio/container-obj.c              | 13 +++++++++++++
 hw/vfio/container.c                  | 26 ++++++++++++++------------
 include/hw/vfio/vfio-container-obj.h |  2 ++
 4 files changed, 47 insertions(+), 12 deletions(-)

diff --git a/hw/vfio/as.c b/hw/vfio/as.c
index 30e86f6833..4abaa4068f 100644
--- a/hw/vfio/as.c
+++ b/hw/vfio/as.c
@@ -847,6 +847,18 @@ const MemoryListener vfio_memory_listener = {
     .log_sync = vfio_listener_log_sync,
 };
 
+void vfio_reset_handler(void *opaque)
+{
+    VFIOAddressSpace *space;
+    VFIOContainer *bcontainer;
+
+    QLIST_FOREACH(space, &vfio_address_spaces, list) {
+         QLIST_FOREACH(bcontainer, &space->containers, next) {
+             vfio_container_reset(bcontainer);
+         }
+    }
+}
+
 VFIOAddressSpace *vfio_get_address_space(AddressSpace *as)
 {
     VFIOAddressSpace *space;
@@ -862,6 +874,9 @@ VFIOAddressSpace *vfio_get_address_space(AddressSpace *as)
     space->as = as;
     QLIST_INIT(&space->containers);
 
+    if (QLIST_EMPTY(&vfio_address_spaces)) {
+        qemu_register_reset(vfio_reset_handler, NULL);
+    }
     QLIST_INSERT_HEAD(&vfio_address_spaces, space, list);
 
     return space;
@@ -873,6 +888,9 @@ void vfio_put_address_space(VFIOAddressSpace *space)
         QLIST_REMOVE(space, list);
         g_free(space);
     }
+    if (QLIST_EMPTY(&vfio_address_spaces)) {
+        qemu_unregister_reset(vfio_reset_handler, NULL);
+    }
 }
 
 static VFIOContainerClass *
diff --git a/hw/vfio/container-obj.c b/hw/vfio/container-obj.c
index 40c1e2a2b5..c4220336af 100644
--- a/hw/vfio/container-obj.c
+++ b/hw/vfio/container-obj.c
@@ -68,6 +68,19 @@ int vfio_container_dma_unmap(VFIOContainer *container,
     return vccs->dma_unmap(container, iova, size, iotlb);
 }
 
+int vfio_container_reset(VFIOContainer *container)
+{
+    VFIOContainerClass *vccs = VFIO_CONTAINER_OBJ_GET_CLASS(container);
+
+    vccs = VFIO_CONTAINER_OBJ_GET_CLASS(container);
+
+    if (!vccs->reset) {
+        return -ENOENT;
+    }
+
+    return vccs->reset(container);
+}
+
 void vfio_container_set_dirty_page_tracking(VFIOContainer *container,
                                             bool start)
 {
diff --git a/hw/vfio/container.c b/hw/vfio/container.c
index 74febc1567..2f59422048 100644
--- a/hw/vfio/container.c
+++ b/hw/vfio/container.c
@@ -458,12 +458,15 @@ vfio_legacy_container_del_section_window(VFIOContainer *bcontainer,
     }
 }
 
-void vfio_reset_handler(void *opaque)
+static int vfio_legacy_container_reset(VFIOContainer *bcontainer)
 {
+    VFIOLegacyContainer *container = container_of(bcontainer,
+                                                  VFIOLegacyContainer, obj);
     VFIOGroup *group;
     VFIODevice *vbasedev;
+    int ret, final_ret = 0;
 
-    QLIST_FOREACH(group, &vfio_group_list, next) {
+    QLIST_FOREACH(group, &container->group_list, container_next) {
         QLIST_FOREACH(vbasedev, &group->device_list, next) {
             if (vbasedev->dev->realized) {
                 vbasedev->ops->vfio_compute_needs_reset(vbasedev);
@@ -471,13 +474,19 @@ void vfio_reset_handler(void *opaque)
         }
     }
 
-    QLIST_FOREACH(group, &vfio_group_list, next) {
+    QLIST_FOREACH(group, &container->group_list, next) {
         QLIST_FOREACH(vbasedev, &group->device_list, next) {
             if (vbasedev->dev->realized && vbasedev->needs_reset) {
-                vbasedev->ops->vfio_hot_reset_multi(vbasedev);
+                ret = vbasedev->ops->vfio_hot_reset_multi(vbasedev);
+                if (ret) {
+                    error_report("failed to reset %s (%d)",
+                                 vbasedev->name, ret);
+                    final_ret = ret;
+                }
             }
         }
     }
+    return final_ret;
 }
 
 static void vfio_kvm_device_add_group(VFIOGroup *group)
@@ -1004,10 +1013,6 @@ static VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
         goto close_fd_exit;
     }
 
-    if (QLIST_EMPTY(&vfio_group_list)) {
-        qemu_register_reset(vfio_reset_handler, NULL);
-    }
-
     QLIST_INSERT_HEAD(&vfio_group_list, group, next);
 
     return group;
@@ -1036,10 +1041,6 @@ static void vfio_put_group(VFIOGroup *group)
     trace_vfio_put_group(group->fd);
     close(group->fd);
     g_free(group);
-
-    if (QLIST_EMPTY(&vfio_group_list)) {
-        qemu_unregister_reset(vfio_reset_handler, NULL);
-    }
 }
 
 static int vfio_get_device(VFIOGroup *group, const char *name,
@@ -1293,6 +1294,7 @@ static void vfio_legacy_container_class_init(ObjectClass *klass,
     vccs->add_window = vfio_legacy_container_add_section_window;
     vccs->del_window = vfio_legacy_container_del_section_window;
     vccs->check_extension = vfio_legacy_container_check_extension;
+    vccs->reset = vfio_legacy_container_reset;
     vccs->attach_device = legacy_attach_device;
     vccs->detach_device = legacy_detach_device;
 }
diff --git a/include/hw/vfio/vfio-container-obj.h b/include/hw/vfio/vfio-container-obj.h
index ebc1340530..ffd8590ff8 100644
--- a/include/hw/vfio/vfio-container-obj.h
+++ b/include/hw/vfio/vfio-container-obj.h
@@ -118,6 +118,7 @@ typedef struct VFIOContainerClass {
     int (*dma_unmap)(VFIOContainer *container,
                      hwaddr iova, ram_addr_t size,
                      IOMMUTLBEntry *iotlb);
+    int (*reset)(VFIOContainer *container);
     /* migration feature */
     bool (*devices_all_dirty_tracking)(VFIOContainer *container);
     void (*set_dirty_page_tracking)(VFIOContainer *container, bool start);
@@ -142,6 +143,7 @@ int vfio_container_dma_map(VFIOContainer *container,
 int vfio_container_dma_unmap(VFIOContainer *container,
                              hwaddr iova, ram_addr_t size,
                              IOMMUTLBEntry *iotlb);
+int vfio_container_reset(VFIOContainer *container);
 bool vfio_container_devices_all_dirty_tracking(VFIOContainer *container);
 void vfio_container_set_dirty_page_tracking(VFIOContainer *container,
                                             bool start);
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [RFC 14/18] hw/iommufd: Creation
  2022-04-14 10:46 ` Yi Liu
@ 2022-04-14 10:47   ` Yi Liu
  -1 siblings, 0 replies; 125+ messages in thread
From: Yi Liu @ 2022-04-14 10:47 UTC (permalink / raw)
  To: alex.williamson, cohuck, qemu-devel
  Cc: david, thuth, farman, mjrosato, akrowiak, pasic, jjherne,
	jasowang, kvm, jgg, nicolinc, eric.auger, eric.auger.pro,
	kevin.tian, yi.l.liu, chao.p.peng, yi.y.sun, peterx

Introduce iommufd utility library which can be compiled out with
CONFIG_IOMMUFD configuration. This code is bound to be called by
several subsystems: vdpa, and vfio.

Co-authored-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 MAINTAINERS                  |   7 ++
 hw/Kconfig                   |   1 +
 hw/iommufd/Kconfig           |   4 +
 hw/iommufd/iommufd.c         | 209 +++++++++++++++++++++++++++++++++++
 hw/iommufd/meson.build       |   1 +
 hw/iommufd/trace-events      |  11 ++
 hw/iommufd/trace.h           |   1 +
 hw/meson.build               |   1 +
 include/hw/iommufd/iommufd.h |  37 +++++++
 meson.build                  |   1 +
 10 files changed, 273 insertions(+)
 create mode 100644 hw/iommufd/Kconfig
 create mode 100644 hw/iommufd/iommufd.c
 create mode 100644 hw/iommufd/meson.build
 create mode 100644 hw/iommufd/trace-events
 create mode 100644 hw/iommufd/trace.h
 create mode 100644 include/hw/iommufd/iommufd.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 4ad2451e03..f6bcb25f7f 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1954,6 +1954,13 @@ F: hw/vfio/ap.c
 F: docs/system/s390x/vfio-ap.rst
 L: qemu-s390x@nongnu.org
 
+iommufd
+M: Yi Liu <yi.l.liu@intel.com>
+M: Eric Auger <eric.auger@redhat.com>
+S: Supported
+F: hw/iommufd/*
+F: include/hw/iommufd/*
+
 vhost
 M: Michael S. Tsirkin <mst@redhat.com>
 S: Supported
diff --git a/hw/Kconfig b/hw/Kconfig
index ad20cce0a9..d270d44760 100644
--- a/hw/Kconfig
+++ b/hw/Kconfig
@@ -63,6 +63,7 @@ source sparc/Kconfig
 source sparc64/Kconfig
 source tricore/Kconfig
 source xtensa/Kconfig
+source iommufd/Kconfig
 
 # Symbols used by multiple targets
 config TEST_DEVICES
diff --git a/hw/iommufd/Kconfig b/hw/iommufd/Kconfig
new file mode 100644
index 0000000000..4b1b00e36b
--- /dev/null
+++ b/hw/iommufd/Kconfig
@@ -0,0 +1,4 @@
+config IOMMUFD
+    bool
+    default y
+    depends on LINUX
diff --git a/hw/iommufd/iommufd.c b/hw/iommufd/iommufd.c
new file mode 100644
index 0000000000..4e8179d612
--- /dev/null
+++ b/hw/iommufd/iommufd.c
@@ -0,0 +1,209 @@
+/*
+ * QEMU IOMMUFD
+ *
+ * Copyright (C) 2022 Intel Corporation.
+ * Copyright Red Hat, Inc. 2022
+ *
+ * Authors: Yi Liu <yi.l.liu@intel.com>
+ *          Eric Auger <eric.auger@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include "qemu/osdep.h"
+#include "qapi/error.h"
+#include "qemu/error-report.h"
+#include "qemu/thread.h"
+#include "qemu/module.h"
+#include <sys/ioctl.h>
+#include <linux/iommufd.h>
+#include "hw/iommufd/iommufd.h"
+#include "trace.h"
+
+static QemuMutex iommufd_lock;
+static uint32_t iommufd_users;
+static int iommufd = -1;
+
+static int iommufd_get(void)
+{
+    qemu_mutex_lock(&iommufd_lock);
+    if (iommufd == -1) {
+        iommufd = qemu_open_old("/dev/iommu", O_RDWR);
+        if (iommufd < 0) {
+            error_report("Failed to open /dev/iommu!");
+        } else {
+            iommufd_users = 1;
+        }
+        trace_iommufd_get(iommufd);
+    } else if (++iommufd_users == UINT32_MAX) {
+        error_report("Failed to get iommufd: %d, count overflow", iommufd);
+        iommufd_users--;
+        qemu_mutex_unlock(&iommufd_lock);
+        return -E2BIG;
+    }
+    qemu_mutex_unlock(&iommufd_lock);
+    return iommufd;
+}
+
+static void iommufd_put(int fd)
+{
+    qemu_mutex_lock(&iommufd_lock);
+    if (--iommufd_users) {
+        qemu_mutex_unlock(&iommufd_lock);
+        return;
+    }
+    iommufd = -1;
+    trace_iommufd_put(fd);
+    close(fd);
+    qemu_mutex_unlock(&iommufd_lock);
+}
+
+static int iommufd_alloc_ioas(int iommufd, uint32_t *ioas)
+{
+    int ret;
+    struct iommu_ioas_alloc alloc_data  = {
+        .size = sizeof(alloc_data),
+        .flags = 0,
+    };
+
+    ret = ioctl(iommufd, IOMMU_IOAS_ALLOC, &alloc_data);
+    if (ret) {
+        error_report("Failed to allocate ioas %m");
+    }
+
+    *ioas = alloc_data.out_ioas_id;
+    trace_iommufd_alloc_ioas(iommufd, *ioas, ret);
+
+    return ret;
+}
+
+static void iommufd_free_ioas(int iommufd, uint32_t ioas)
+{
+    int ret;
+    struct iommu_destroy des = {
+        .size = sizeof(des),
+        .id = ioas,
+    };
+
+    ret = ioctl(iommufd, IOMMU_DESTROY, &des);
+    trace_iommufd_free_ioas(iommufd, ioas, ret);
+    if (ret) {
+        error_report("Failed to free ioas: %u %m", ioas);
+    }
+}
+
+int iommufd_get_ioas(int *fd, uint32_t *ioas_id)
+{
+    int ret;
+
+    *fd = iommufd_get();
+    if (*fd < 0) {
+        return *fd;
+    }
+
+    ret = iommufd_alloc_ioas(*fd, ioas_id);
+    trace_iommufd_get_ioas(*fd, *ioas_id, ret);
+    if (ret) {
+        iommufd_put(*fd);
+    }
+    return ret;
+}
+
+void iommufd_put_ioas(int iommufd, uint32_t ioas)
+{
+    trace_iommufd_put_ioas(iommufd, ioas);
+    iommufd_free_ioas(iommufd, ioas);
+    iommufd_put(iommufd);
+}
+
+int iommufd_unmap_dma(int iommufd, uint32_t ioas,
+                      hwaddr iova, ram_addr_t size)
+{
+    int ret;
+    struct iommu_ioas_unmap unmap = {
+        .size = sizeof(unmap),
+        .ioas_id = ioas,
+        .iova = iova,
+        .length = size,
+    };
+
+    ret = ioctl(iommufd, IOMMU_IOAS_UNMAP, &unmap);
+    trace_iommufd_unmap_dma(iommufd, ioas, iova, size, ret);
+    if (ret) {
+        error_report("IOMMU_IOAS_UNMAP failed: %s", strerror(errno));
+    }
+    return !ret ? 0 : -errno;
+}
+
+int iommufd_map_dma(int iommufd, uint32_t ioas, hwaddr iova,
+                    ram_addr_t size, void *vaddr, bool readonly)
+{
+    int ret;
+    struct iommu_ioas_map map = {
+        .size = sizeof(map),
+        .flags = IOMMU_IOAS_MAP_READABLE |
+                 IOMMU_IOAS_MAP_FIXED_IOVA,
+        .ioas_id = ioas,
+        .__reserved = 0,
+        .user_va = (int64_t)vaddr,
+        .iova = iova,
+        .length = size,
+    };
+
+    if (!readonly) {
+        map.flags |= IOMMU_IOAS_MAP_WRITEABLE;
+    }
+
+    ret = ioctl(iommufd, IOMMU_IOAS_MAP, &map);
+    trace_iommufd_map_dma(iommufd, ioas, iova, size, vaddr, readonly, ret);
+    if (ret) {
+        error_report("IOMMU_IOAS_MAP failed: %s", strerror(errno));
+    }
+    return !ret ? 0 : -errno;
+}
+
+int iommufd_copy_dma(int iommufd, uint32_t src_ioas, uint32_t dst_ioas,
+                     hwaddr iova, ram_addr_t size, bool readonly)
+{
+    int ret;
+    struct iommu_ioas_copy copy = {
+        .size = sizeof(copy),
+        .flags = IOMMU_IOAS_MAP_READABLE |
+                 IOMMU_IOAS_MAP_FIXED_IOVA,
+        .dst_ioas_id = dst_ioas,
+        .src_ioas_id = src_ioas,
+        .length = size,
+        .dst_iova = iova,
+        .src_iova = iova,
+    };
+
+    if (!readonly) {
+        copy.flags |= IOMMU_IOAS_MAP_WRITEABLE;
+    }
+
+    ret = ioctl(iommufd, IOMMU_IOAS_COPY, &copy);
+    trace_iommufd_copy_dma(iommufd, src_ioas, dst_ioas,
+                           iova, size, readonly, ret);
+    if (ret) {
+        error_report("IOMMU_IOAS_COPY failed: %s", strerror(errno));
+    }
+    return !ret ? 0 : -errno;
+}
+
+static void iommufd_register_types(void)
+{
+    qemu_mutex_init(&iommufd_lock);
+}
+
+type_init(iommufd_register_types)
diff --git a/hw/iommufd/meson.build b/hw/iommufd/meson.build
new file mode 100644
index 0000000000..515bc40cbe
--- /dev/null
+++ b/hw/iommufd/meson.build
@@ -0,0 +1 @@
+specific_ss.add(when: 'CONFIG_IOMMUFD', if_true: files('iommufd.c'))
diff --git a/hw/iommufd/trace-events b/hw/iommufd/trace-events
new file mode 100644
index 0000000000..615d80cdf4
--- /dev/null
+++ b/hw/iommufd/trace-events
@@ -0,0 +1,11 @@
+# See docs/devel/tracing.rst for syntax documentation.
+
+iommufd_get(int iommufd) " iommufd=%d"
+iommufd_put(int iommufd) " iommufd=%d"
+iommufd_alloc_ioas(int iommufd, uint32_t ioas, int ret) " iommufd=%d ioas=%d (%d)"
+iommufd_free_ioas(int iommufd, uint32_t ioas, int ret) " iommufd=%d ioas=%d (%d)"
+iommufd_get_ioas(int iommufd, uint32_t ioas, int ret) " iommufd=%d ioas=%d (%d)"
+iommufd_put_ioas(int iommufd, uint32_t ioas) " iommufd=%d ioas=%d"
+iommufd_unmap_dma(int iommufd, uint32_t ioas, uint64_t iova, uint64_t size, int ret) " iommufd=%d ioas=%d iova=0x%"PRIx64" size=0x%"PRIx64" (%d)"
+iommufd_map_dma(int iommufd, uint32_t ioas, uint64_t iova, uint64_t size, void *vaddr, bool readonly, int ret) " iommufd=%d ioas=%d iova=0x%"PRIx64" size=0x%"PRIx64" addr=%p readonly=%d (%d)"
+iommufd_copy_dma(int iommufd, uint32_t src_ioas, uint32_t dst_ioas, uint64_t iova, uint64_t size, bool readonly, int ret) " iommufd=%d src_ioas=%d dst_ioas=%d iova=0x%"PRIx64" size=0x%"PRIx64" readonly=%d (%d)"
diff --git a/hw/iommufd/trace.h b/hw/iommufd/trace.h
new file mode 100644
index 0000000000..3fb40b0932
--- /dev/null
+++ b/hw/iommufd/trace.h
@@ -0,0 +1 @@
+#include "trace/trace-hw_iommufd.h"
diff --git a/hw/meson.build b/hw/meson.build
index b3366c888e..ffb5203265 100644
--- a/hw/meson.build
+++ b/hw/meson.build
@@ -38,6 +38,7 @@ subdir('timer')
 subdir('tpm')
 subdir('usb')
 subdir('vfio')
+subdir('iommufd')
 subdir('virtio')
 subdir('watchdog')
 subdir('xen')
diff --git a/include/hw/iommufd/iommufd.h b/include/hw/iommufd/iommufd.h
new file mode 100644
index 0000000000..59835cddca
--- /dev/null
+++ b/include/hw/iommufd/iommufd.h
@@ -0,0 +1,37 @@
+/*
+ * QEMU IOMMUFD
+ *
+ * Copyright (C) 2022 Intel Corporation.
+ * Copyright Red Hat, Inc. 2022
+ *
+ * Authors: Yi Liu <yi.l.liu@intel.com>
+ *          Eric Auger <eric.auger@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef HW_IOMMUFD_IOMMUFD_H
+#define HW_IOMMUFD_IOMMUFD_H
+#include "exec/hwaddr.h"
+#include "exec/cpu-common.h"
+
+int iommufd_get_ioas(int *fd, uint32_t *ioas_id);
+void iommufd_put_ioas(int fd, uint32_t ioas_id);
+int iommufd_unmap_dma(int iommufd, uint32_t ioas, hwaddr iova, ram_addr_t size);
+int iommufd_map_dma(int iommufd, uint32_t ioas, hwaddr iova,
+                    ram_addr_t size, void *vaddr, bool readonly);
+int iommufd_copy_dma(int iommufd, uint32_t src_ioas, uint32_t dst_ioas,
+                     hwaddr iova, ram_addr_t size, bool readonly);
+bool iommufd_supported(void);
+#endif /* HW_IOMMUFD_IOMMUFD_H */
diff --git a/meson.build b/meson.build
index 861de93c4f..45caa53db6 100644
--- a/meson.build
+++ b/meson.build
@@ -2755,6 +2755,7 @@ if have_system
     'hw/tpm',
     'hw/usb',
     'hw/vfio',
+    'hw/iommufd',
     'hw/virtio',
     'hw/watchdog',
     'hw/xen',
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [RFC 14/18] hw/iommufd: Creation
@ 2022-04-14 10:47   ` Yi Liu
  0 siblings, 0 replies; 125+ messages in thread
From: Yi Liu @ 2022-04-14 10:47 UTC (permalink / raw)
  To: alex.williamson, cohuck, qemu-devel
  Cc: akrowiak, jjherne, thuth, yi.l.liu, kvm, mjrosato, jasowang,
	farman, peterx, pasic, eric.auger, yi.y.sun, chao.p.peng,
	nicolinc, kevin.tian, jgg, eric.auger.pro, david

Introduce iommufd utility library which can be compiled out with
CONFIG_IOMMUFD configuration. This code is bound to be called by
several subsystems: vdpa, and vfio.

Co-authored-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 MAINTAINERS                  |   7 ++
 hw/Kconfig                   |   1 +
 hw/iommufd/Kconfig           |   4 +
 hw/iommufd/iommufd.c         | 209 +++++++++++++++++++++++++++++++++++
 hw/iommufd/meson.build       |   1 +
 hw/iommufd/trace-events      |  11 ++
 hw/iommufd/trace.h           |   1 +
 hw/meson.build               |   1 +
 include/hw/iommufd/iommufd.h |  37 +++++++
 meson.build                  |   1 +
 10 files changed, 273 insertions(+)
 create mode 100644 hw/iommufd/Kconfig
 create mode 100644 hw/iommufd/iommufd.c
 create mode 100644 hw/iommufd/meson.build
 create mode 100644 hw/iommufd/trace-events
 create mode 100644 hw/iommufd/trace.h
 create mode 100644 include/hw/iommufd/iommufd.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 4ad2451e03..f6bcb25f7f 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1954,6 +1954,13 @@ F: hw/vfio/ap.c
 F: docs/system/s390x/vfio-ap.rst
 L: qemu-s390x@nongnu.org
 
+iommufd
+M: Yi Liu <yi.l.liu@intel.com>
+M: Eric Auger <eric.auger@redhat.com>
+S: Supported
+F: hw/iommufd/*
+F: include/hw/iommufd/*
+
 vhost
 M: Michael S. Tsirkin <mst@redhat.com>
 S: Supported
diff --git a/hw/Kconfig b/hw/Kconfig
index ad20cce0a9..d270d44760 100644
--- a/hw/Kconfig
+++ b/hw/Kconfig
@@ -63,6 +63,7 @@ source sparc/Kconfig
 source sparc64/Kconfig
 source tricore/Kconfig
 source xtensa/Kconfig
+source iommufd/Kconfig
 
 # Symbols used by multiple targets
 config TEST_DEVICES
diff --git a/hw/iommufd/Kconfig b/hw/iommufd/Kconfig
new file mode 100644
index 0000000000..4b1b00e36b
--- /dev/null
+++ b/hw/iommufd/Kconfig
@@ -0,0 +1,4 @@
+config IOMMUFD
+    bool
+    default y
+    depends on LINUX
diff --git a/hw/iommufd/iommufd.c b/hw/iommufd/iommufd.c
new file mode 100644
index 0000000000..4e8179d612
--- /dev/null
+++ b/hw/iommufd/iommufd.c
@@ -0,0 +1,209 @@
+/*
+ * QEMU IOMMUFD
+ *
+ * Copyright (C) 2022 Intel Corporation.
+ * Copyright Red Hat, Inc. 2022
+ *
+ * Authors: Yi Liu <yi.l.liu@intel.com>
+ *          Eric Auger <eric.auger@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include "qemu/osdep.h"
+#include "qapi/error.h"
+#include "qemu/error-report.h"
+#include "qemu/thread.h"
+#include "qemu/module.h"
+#include <sys/ioctl.h>
+#include <linux/iommufd.h>
+#include "hw/iommufd/iommufd.h"
+#include "trace.h"
+
+static QemuMutex iommufd_lock;
+static uint32_t iommufd_users;
+static int iommufd = -1;
+
+static int iommufd_get(void)
+{
+    qemu_mutex_lock(&iommufd_lock);
+    if (iommufd == -1) {
+        iommufd = qemu_open_old("/dev/iommu", O_RDWR);
+        if (iommufd < 0) {
+            error_report("Failed to open /dev/iommu!");
+        } else {
+            iommufd_users = 1;
+        }
+        trace_iommufd_get(iommufd);
+    } else if (++iommufd_users == UINT32_MAX) {
+        error_report("Failed to get iommufd: %d, count overflow", iommufd);
+        iommufd_users--;
+        qemu_mutex_unlock(&iommufd_lock);
+        return -E2BIG;
+    }
+    qemu_mutex_unlock(&iommufd_lock);
+    return iommufd;
+}
+
+static void iommufd_put(int fd)
+{
+    qemu_mutex_lock(&iommufd_lock);
+    if (--iommufd_users) {
+        qemu_mutex_unlock(&iommufd_lock);
+        return;
+    }
+    iommufd = -1;
+    trace_iommufd_put(fd);
+    close(fd);
+    qemu_mutex_unlock(&iommufd_lock);
+}
+
+static int iommufd_alloc_ioas(int iommufd, uint32_t *ioas)
+{
+    int ret;
+    struct iommu_ioas_alloc alloc_data  = {
+        .size = sizeof(alloc_data),
+        .flags = 0,
+    };
+
+    ret = ioctl(iommufd, IOMMU_IOAS_ALLOC, &alloc_data);
+    if (ret) {
+        error_report("Failed to allocate ioas %m");
+    }
+
+    *ioas = alloc_data.out_ioas_id;
+    trace_iommufd_alloc_ioas(iommufd, *ioas, ret);
+
+    return ret;
+}
+
+static void iommufd_free_ioas(int iommufd, uint32_t ioas)
+{
+    int ret;
+    struct iommu_destroy des = {
+        .size = sizeof(des),
+        .id = ioas,
+    };
+
+    ret = ioctl(iommufd, IOMMU_DESTROY, &des);
+    trace_iommufd_free_ioas(iommufd, ioas, ret);
+    if (ret) {
+        error_report("Failed to free ioas: %u %m", ioas);
+    }
+}
+
+int iommufd_get_ioas(int *fd, uint32_t *ioas_id)
+{
+    int ret;
+
+    *fd = iommufd_get();
+    if (*fd < 0) {
+        return *fd;
+    }
+
+    ret = iommufd_alloc_ioas(*fd, ioas_id);
+    trace_iommufd_get_ioas(*fd, *ioas_id, ret);
+    if (ret) {
+        iommufd_put(*fd);
+    }
+    return ret;
+}
+
+void iommufd_put_ioas(int iommufd, uint32_t ioas)
+{
+    trace_iommufd_put_ioas(iommufd, ioas);
+    iommufd_free_ioas(iommufd, ioas);
+    iommufd_put(iommufd);
+}
+
+int iommufd_unmap_dma(int iommufd, uint32_t ioas,
+                      hwaddr iova, ram_addr_t size)
+{
+    int ret;
+    struct iommu_ioas_unmap unmap = {
+        .size = sizeof(unmap),
+        .ioas_id = ioas,
+        .iova = iova,
+        .length = size,
+    };
+
+    ret = ioctl(iommufd, IOMMU_IOAS_UNMAP, &unmap);
+    trace_iommufd_unmap_dma(iommufd, ioas, iova, size, ret);
+    if (ret) {
+        error_report("IOMMU_IOAS_UNMAP failed: %s", strerror(errno));
+    }
+    return !ret ? 0 : -errno;
+}
+
+int iommufd_map_dma(int iommufd, uint32_t ioas, hwaddr iova,
+                    ram_addr_t size, void *vaddr, bool readonly)
+{
+    int ret;
+    struct iommu_ioas_map map = {
+        .size = sizeof(map),
+        .flags = IOMMU_IOAS_MAP_READABLE |
+                 IOMMU_IOAS_MAP_FIXED_IOVA,
+        .ioas_id = ioas,
+        .__reserved = 0,
+        .user_va = (int64_t)vaddr,
+        .iova = iova,
+        .length = size,
+    };
+
+    if (!readonly) {
+        map.flags |= IOMMU_IOAS_MAP_WRITEABLE;
+    }
+
+    ret = ioctl(iommufd, IOMMU_IOAS_MAP, &map);
+    trace_iommufd_map_dma(iommufd, ioas, iova, size, vaddr, readonly, ret);
+    if (ret) {
+        error_report("IOMMU_IOAS_MAP failed: %s", strerror(errno));
+    }
+    return !ret ? 0 : -errno;
+}
+
+int iommufd_copy_dma(int iommufd, uint32_t src_ioas, uint32_t dst_ioas,
+                     hwaddr iova, ram_addr_t size, bool readonly)
+{
+    int ret;
+    struct iommu_ioas_copy copy = {
+        .size = sizeof(copy),
+        .flags = IOMMU_IOAS_MAP_READABLE |
+                 IOMMU_IOAS_MAP_FIXED_IOVA,
+        .dst_ioas_id = dst_ioas,
+        .src_ioas_id = src_ioas,
+        .length = size,
+        .dst_iova = iova,
+        .src_iova = iova,
+    };
+
+    if (!readonly) {
+        copy.flags |= IOMMU_IOAS_MAP_WRITEABLE;
+    }
+
+    ret = ioctl(iommufd, IOMMU_IOAS_COPY, &copy);
+    trace_iommufd_copy_dma(iommufd, src_ioas, dst_ioas,
+                           iova, size, readonly, ret);
+    if (ret) {
+        error_report("IOMMU_IOAS_COPY failed: %s", strerror(errno));
+    }
+    return !ret ? 0 : -errno;
+}
+
+static void iommufd_register_types(void)
+{
+    qemu_mutex_init(&iommufd_lock);
+}
+
+type_init(iommufd_register_types)
diff --git a/hw/iommufd/meson.build b/hw/iommufd/meson.build
new file mode 100644
index 0000000000..515bc40cbe
--- /dev/null
+++ b/hw/iommufd/meson.build
@@ -0,0 +1 @@
+specific_ss.add(when: 'CONFIG_IOMMUFD', if_true: files('iommufd.c'))
diff --git a/hw/iommufd/trace-events b/hw/iommufd/trace-events
new file mode 100644
index 0000000000..615d80cdf4
--- /dev/null
+++ b/hw/iommufd/trace-events
@@ -0,0 +1,11 @@
+# See docs/devel/tracing.rst for syntax documentation.
+
+iommufd_get(int iommufd) " iommufd=%d"
+iommufd_put(int iommufd) " iommufd=%d"
+iommufd_alloc_ioas(int iommufd, uint32_t ioas, int ret) " iommufd=%d ioas=%d (%d)"
+iommufd_free_ioas(int iommufd, uint32_t ioas, int ret) " iommufd=%d ioas=%d (%d)"
+iommufd_get_ioas(int iommufd, uint32_t ioas, int ret) " iommufd=%d ioas=%d (%d)"
+iommufd_put_ioas(int iommufd, uint32_t ioas) " iommufd=%d ioas=%d"
+iommufd_unmap_dma(int iommufd, uint32_t ioas, uint64_t iova, uint64_t size, int ret) " iommufd=%d ioas=%d iova=0x%"PRIx64" size=0x%"PRIx64" (%d)"
+iommufd_map_dma(int iommufd, uint32_t ioas, uint64_t iova, uint64_t size, void *vaddr, bool readonly, int ret) " iommufd=%d ioas=%d iova=0x%"PRIx64" size=0x%"PRIx64" addr=%p readonly=%d (%d)"
+iommufd_copy_dma(int iommufd, uint32_t src_ioas, uint32_t dst_ioas, uint64_t iova, uint64_t size, bool readonly, int ret) " iommufd=%d src_ioas=%d dst_ioas=%d iova=0x%"PRIx64" size=0x%"PRIx64" readonly=%d (%d)"
diff --git a/hw/iommufd/trace.h b/hw/iommufd/trace.h
new file mode 100644
index 0000000000..3fb40b0932
--- /dev/null
+++ b/hw/iommufd/trace.h
@@ -0,0 +1 @@
+#include "trace/trace-hw_iommufd.h"
diff --git a/hw/meson.build b/hw/meson.build
index b3366c888e..ffb5203265 100644
--- a/hw/meson.build
+++ b/hw/meson.build
@@ -38,6 +38,7 @@ subdir('timer')
 subdir('tpm')
 subdir('usb')
 subdir('vfio')
+subdir('iommufd')
 subdir('virtio')
 subdir('watchdog')
 subdir('xen')
diff --git a/include/hw/iommufd/iommufd.h b/include/hw/iommufd/iommufd.h
new file mode 100644
index 0000000000..59835cddca
--- /dev/null
+++ b/include/hw/iommufd/iommufd.h
@@ -0,0 +1,37 @@
+/*
+ * QEMU IOMMUFD
+ *
+ * Copyright (C) 2022 Intel Corporation.
+ * Copyright Red Hat, Inc. 2022
+ *
+ * Authors: Yi Liu <yi.l.liu@intel.com>
+ *          Eric Auger <eric.auger@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef HW_IOMMUFD_IOMMUFD_H
+#define HW_IOMMUFD_IOMMUFD_H
+#include "exec/hwaddr.h"
+#include "exec/cpu-common.h"
+
+int iommufd_get_ioas(int *fd, uint32_t *ioas_id);
+void iommufd_put_ioas(int fd, uint32_t ioas_id);
+int iommufd_unmap_dma(int iommufd, uint32_t ioas, hwaddr iova, ram_addr_t size);
+int iommufd_map_dma(int iommufd, uint32_t ioas, hwaddr iova,
+                    ram_addr_t size, void *vaddr, bool readonly);
+int iommufd_copy_dma(int iommufd, uint32_t src_ioas, uint32_t dst_ioas,
+                     hwaddr iova, ram_addr_t size, bool readonly);
+bool iommufd_supported(void);
+#endif /* HW_IOMMUFD_IOMMUFD_H */
diff --git a/meson.build b/meson.build
index 861de93c4f..45caa53db6 100644
--- a/meson.build
+++ b/meson.build
@@ -2755,6 +2755,7 @@ if have_system
     'hw/tpm',
     'hw/usb',
     'hw/vfio',
+    'hw/iommufd',
     'hw/virtio',
     'hw/watchdog',
     'hw/xen',
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [RFC 15/18] vfio/iommufd: Implement iommufd backend
  2022-04-14 10:46 ` Yi Liu
@ 2022-04-14 10:47   ` Yi Liu
  -1 siblings, 0 replies; 125+ messages in thread
From: Yi Liu @ 2022-04-14 10:47 UTC (permalink / raw)
  To: alex.williamson, cohuck, qemu-devel
  Cc: david, thuth, farman, mjrosato, akrowiak, pasic, jjherne,
	jasowang, kvm, jgg, nicolinc, eric.auger, eric.auger.pro,
	kevin.tian, yi.l.liu, chao.p.peng, yi.y.sun, peterx

Add the iommufd backend. The IOMMUFD container class is implemented
based on the new /dev/iommu user API. This backend obviously depends
on CONFIG_IOMMUFD.

So far, the iommufd backend doesn't support live migration and
cache coherency yet due to missing support in the host kernel meaning
that only a subset of the container class callbacks is implemented.

Co-authored-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 hw/vfio/as.c                         |   2 +-
 hw/vfio/iommufd.c                    | 545 +++++++++++++++++++++++++++
 hw/vfio/meson.build                  |   3 +
 hw/vfio/pci.c                        |  10 +
 hw/vfio/trace-events                 |  11 +
 include/hw/vfio/vfio-common.h        |  18 +
 include/hw/vfio/vfio-container-obj.h |   1 +
 7 files changed, 589 insertions(+), 1 deletion(-)
 create mode 100644 hw/vfio/iommufd.c

diff --git a/hw/vfio/as.c b/hw/vfio/as.c
index 4abaa4068f..94618efd1f 100644
--- a/hw/vfio/as.c
+++ b/hw/vfio/as.c
@@ -41,7 +41,7 @@
 #include "qapi/error.h"
 #include "migration/migration.h"
 
-static QLIST_HEAD(, VFIOAddressSpace) vfio_address_spaces =
+VFIOAddressSpaceList vfio_address_spaces =
     QLIST_HEAD_INITIALIZER(vfio_address_spaces);
 
 void vfio_host_win_add(VFIOContainer *container,
diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
new file mode 100644
index 0000000000..f8375f1672
--- /dev/null
+++ b/hw/vfio/iommufd.c
@@ -0,0 +1,545 @@
+/*
+ * iommufd container backend
+ *
+ * Copyright (C) 2022 Intel Corporation.
+ * Copyright Red Hat, Inc. 2022
+ *
+ * Authors: Yi Liu <yi.l.liu@intel.com>
+ *          Eric Auger <eric.auger@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include "qemu/osdep.h"
+#include <sys/ioctl.h>
+#include <linux/vfio.h>
+
+#include "hw/vfio/vfio-common.h"
+#include "qemu/error-report.h"
+#include "trace.h"
+#include "qapi/error.h"
+#include "hw/iommufd/iommufd.h"
+#include "hw/qdev-core.h"
+#include "sysemu/reset.h"
+#include "qemu/cutils.h"
+
+static bool iommufd_check_extension(VFIOContainer *bcontainer,
+                                    VFIOContainerFeature feat)
+{
+    switch (feat) {
+    default:
+        return false;
+    };
+}
+
+static int iommufd_map(VFIOContainer *bcontainer, hwaddr iova,
+                       ram_addr_t size, void *vaddr, bool readonly)
+{
+    VFIOIOMMUFDContainer *container = container_of(bcontainer,
+                                                   VFIOIOMMUFDContainer, obj);
+
+    return iommufd_map_dma(container->iommufd, container->ioas_id,
+                           iova, size, vaddr, readonly);
+}
+
+static int iommufd_unmap(VFIOContainer *bcontainer,
+                         hwaddr iova, ram_addr_t size,
+                         IOMMUTLBEntry *iotlb)
+{
+    VFIOIOMMUFDContainer *container = container_of(bcontainer,
+                                                   VFIOIOMMUFDContainer, obj);
+
+    /* TODO: Handle dma_unmap_bitmap with iotlb args (migration) */
+    return iommufd_unmap_dma(container->iommufd,
+                             container->ioas_id, iova, size);
+}
+
+static int vfio_get_devicefd(const char *sysfs_path, Error **errp)
+{
+    long int vfio_id = -1, ret = -ENOTTY;
+    char *path, *tmp = NULL;
+    DIR *dir;
+    struct dirent *dent;
+    struct stat st;
+    gchar *contents;
+    gsize length;
+    int major, minor;
+    dev_t vfio_devt;
+
+    path = g_strdup_printf("%s/vfio-device", sysfs_path);
+    if (stat(path, &st) < 0) {
+        error_setg_errno(errp, errno, "no such host device");
+        goto out;
+    }
+
+    dir = opendir(path);
+    if (!dir) {
+        error_setg_errno(errp, errno, "couldn't open dirrectory %s", path);
+        goto out;
+    }
+
+    while ((dent = readdir(dir))) {
+        const char *end_name;
+
+        if (!strncmp(dent->d_name, "vfio", 4)) {
+            ret = qemu_strtol(dent->d_name + 4, &end_name, 10, &vfio_id);
+            if (ret) {
+                error_setg(errp, "suspicious vfio* file in %s", path);
+                goto out;
+            }
+            break;
+        }
+    }
+
+    /* check if the major:minor matches */
+    tmp = g_strdup_printf("%s/%s/dev", path, dent->d_name);
+    if (!g_file_get_contents(tmp, &contents, &length, NULL)) {
+        error_setg(errp, "failed to load \"%s\"", tmp);
+        goto out;
+    }
+
+    if (sscanf(contents, "%d:%d", &major, &minor) != 2) {
+        error_setg(errp, "failed to load \"%s\"", tmp);
+        goto out;
+    }
+    g_free(contents);
+    g_free(tmp);
+
+    tmp = g_strdup_printf("/dev/vfio/devices/vfio%ld", vfio_id);
+    if (stat(tmp, &st) < 0) {
+        error_setg_errno(errp, errno, "no such vfio device");
+        goto out;
+    }
+    vfio_devt = makedev(major, minor);
+    if (st.st_rdev != vfio_devt) {
+        error_setg(errp, "minor do not match: %lu, %lu", vfio_devt, st.st_rdev);
+        goto out;
+    }
+
+    ret = qemu_open_old(tmp, O_RDWR);
+    if (ret < 0) {
+        error_setg(errp, "Failed to open %s", tmp);
+    }
+    trace_vfio_iommufd_get_devicefd(tmp, ret);
+out:
+    g_free(tmp);
+    g_free(path);
+
+    if (*errp) {
+        error_prepend(errp, VFIO_MSG_PREFIX, path);
+    }
+    return ret;
+}
+
+static VFIOIOASHwpt *vfio_container_get_hwpt(VFIOIOMMUFDContainer *container,
+                                             uint32_t hwpt_id)
+{
+    VFIOIOASHwpt *hwpt;
+
+    QLIST_FOREACH(hwpt, &container->hwpt_list, next) {
+        if (hwpt->hwpt_id == hwpt_id) {
+            return hwpt;
+        }
+    }
+
+    hwpt = g_malloc0(sizeof(*hwpt));
+
+    hwpt->hwpt_id = hwpt_id;
+    QLIST_INIT(&hwpt->device_list);
+    QLIST_INSERT_HEAD(&container->hwpt_list, hwpt, next);
+
+    return hwpt;
+}
+
+static void vfio_container_put_hwpt(VFIOIOASHwpt *hwpt)
+{
+    if (!QLIST_EMPTY(&hwpt->device_list)) {
+        g_assert_not_reached();
+    }
+    QLIST_REMOVE(hwpt, next);
+    g_free(hwpt);
+}
+
+static VFIOIOASHwpt *vfio_find_hwpt_for_dev(VFIOIOMMUFDContainer *container,
+                                            VFIODevice *vbasedev)
+{
+    VFIOIOASHwpt *hwpt;
+    VFIODevice *vbasedev_iter;
+
+    QLIST_FOREACH(hwpt, &container->hwpt_list, next) {
+        QLIST_FOREACH(vbasedev_iter, &hwpt->device_list, hwpt_next) {
+            if (vbasedev_iter == vbasedev) {
+                return hwpt;
+            }
+        }
+    }
+    return NULL;
+}
+
+static void
+__vfio_device_detach_container(VFIODevice *vbasedev,
+                               VFIOIOMMUFDContainer *container, Error **errp)
+{
+    struct vfio_device_detach_ioas detach_data = {
+        .argsz = sizeof(detach_data),
+        .flags = 0,
+        .iommufd = container->iommufd,
+        .ioas_id = container->ioas_id,
+    };
+
+    if (ioctl(vbasedev->fd, VFIO_DEVICE_DETACH_IOAS, &detach_data)) {
+        error_setg_errno(errp, errno, "detach %s from ioas id=%d failed",
+                         vbasedev->name, container->ioas_id);
+    }
+    trace_vfio_iommufd_detach_device(container->iommufd, vbasedev->name,
+                                     container->ioas_id);
+
+    /* iommufd unbind is done per device fd close */
+}
+
+static void vfio_device_detach_container(VFIODevice *vbasedev,
+                                         VFIOIOMMUFDContainer *container,
+                                         Error **errp)
+{
+    VFIOIOASHwpt *hwpt;
+
+    hwpt = vfio_find_hwpt_for_dev(container, vbasedev);
+    if (hwpt) {
+        QLIST_REMOVE(vbasedev, hwpt_next);
+        if (QLIST_EMPTY(&hwpt->device_list)) {
+            vfio_container_put_hwpt(hwpt);
+        }
+    }
+
+    __vfio_device_detach_container(vbasedev, container, errp);
+}
+
+static int vfio_device_attach_container(VFIODevice *vbasedev,
+                                        VFIOIOMMUFDContainer *container,
+                                        Error **errp)
+{
+    struct vfio_device_bind_iommufd bind = {
+        .argsz = sizeof(bind),
+        .flags = 0,
+        .iommufd = container->iommufd,
+        .dev_cookie = (uint64_t)vbasedev,
+    };
+    struct vfio_device_attach_ioas attach_data = {
+        .argsz = sizeof(attach_data),
+        .flags = 0,
+        .iommufd = container->iommufd,
+        .ioas_id = container->ioas_id,
+    };
+    VFIOIOASHwpt *hwpt;
+    int ret;
+
+    /* Bind device to iommufd */
+    ret = ioctl(vbasedev->fd, VFIO_DEVICE_BIND_IOMMUFD, &bind);
+    if (ret) {
+        error_setg_errno(errp, errno, "error bind device fd=%d to iommufd=%d",
+                         vbasedev->fd, bind.iommufd);
+        return ret;
+    }
+
+    vbasedev->devid = bind.out_devid;
+    trace_vfio_iommufd_bind_device(bind.iommufd, vbasedev->name,
+                                   vbasedev->fd, vbasedev->devid);
+
+    /* Attach device to an ioas within iommufd */
+    ret = ioctl(vbasedev->fd, VFIO_DEVICE_ATTACH_IOAS, &attach_data);
+    if (ret) {
+        error_setg_errno(errp, errno,
+                         "[iommufd=%d] error attach %s (%d) to ioasid=%d",
+                         container->iommufd, vbasedev->name, vbasedev->fd,
+                         attach_data.ioas_id);
+        return ret;
+
+    }
+    trace_vfio_iommufd_attach_device(bind.iommufd, vbasedev->name,
+                                     vbasedev->fd, container->ioas_id,
+                                     attach_data.out_hwpt_id);
+
+    hwpt = vfio_container_get_hwpt(container, attach_data.out_hwpt_id);
+
+    QLIST_INSERT_HEAD(&hwpt->device_list, vbasedev, hwpt_next);
+    return 0;
+}
+
+static int vfio_device_reset(VFIODevice *vbasedev)
+{
+    if (vbasedev->dev->realized) {
+        vbasedev->ops->vfio_compute_needs_reset(vbasedev);
+        if (vbasedev->needs_reset) {
+            return vbasedev->ops->vfio_hot_reset_multi(vbasedev);
+        }
+    }
+    return 0;
+}
+
+static int vfio_iommufd_container_reset(VFIOContainer *bcontainer)
+{
+    VFIOIOMMUFDContainer *container;
+    int ret, final_ret = 0;
+    VFIODevice *vbasedev;
+    VFIOIOASHwpt *hwpt;
+
+    container = container_of(bcontainer, VFIOIOMMUFDContainer, obj);
+
+    QLIST_FOREACH(hwpt, &container->hwpt_list, next) {
+        QLIST_FOREACH(vbasedev, &hwpt->device_list, hwpt_next) {
+            ret = vfio_device_reset(vbasedev);
+            if (ret) {
+                error_report("failed to reset %s (%d)", vbasedev->name, ret);
+                final_ret = ret;
+            } else {
+                trace_vfio_iommufd_container_reset(vbasedev->name);
+            }
+        }
+    }
+    return final_ret;
+}
+
+static void vfio_iommufd_container_destroy(VFIOIOMMUFDContainer *container)
+{
+    vfio_container_destroy(&container->obj);
+    g_free(container);
+}
+
+static int vfio_ram_block_discard_disable(bool state)
+{
+    /*
+     * We support coordinated discarding of RAM via the RamDiscardManager.
+     */
+    return ram_block_uncoordinated_discard_disable(state);
+}
+
+static void iommufd_detach_device(VFIODevice *vbasedev);
+
+static int iommufd_attach_device(VFIODevice *vbasedev, AddressSpace *as,
+                                 Error **errp)
+{
+    VFIOContainer *bcontainer;
+    VFIOIOMMUFDContainer *container;
+    VFIOAddressSpace *space;
+    struct vfio_device_info dev_info = { .argsz = sizeof(dev_info) };
+    int ret, devfd, iommufd;
+    uint32_t ioas_id;
+    Error *err = NULL;
+
+    devfd = vfio_get_devicefd(vbasedev->sysfsdev, errp);
+    if (devfd < 0) {
+        return devfd;
+    }
+    vbasedev->fd = devfd;
+
+    space = vfio_get_address_space(as);
+
+    /* try to attach to an existing container in this space */
+    QLIST_FOREACH(bcontainer, &space->containers, next) {
+        if (!object_dynamic_cast(OBJECT(bcontainer),
+                                 TYPE_VFIO_IOMMUFD_CONTAINER)) {
+            continue;
+        }
+        container = container_of(bcontainer, VFIOIOMMUFDContainer, obj);
+        if (vfio_device_attach_container(vbasedev, container, &err)) {
+            const char *msg = error_get_pretty(err);
+
+            trace_vfio_iommufd_fail_attach_existing_container(msg);
+            error_free(err);
+            err = NULL;
+        } else {
+            ret = vfio_ram_block_discard_disable(true);
+            if (ret) {
+                vfio_device_detach_container(vbasedev, container, &err);
+                error_propagate(errp, err);
+                vfio_put_address_space(space);
+                close(vbasedev->fd);
+                error_prepend(errp,
+                              "Cannot set discarding of RAM broken (%d)", ret);
+                return ret;
+            }
+            goto out;
+        }
+    }
+
+    /* Need to allocate a new dedicated container */
+    ret = iommufd_get_ioas(&iommufd, &ioas_id);
+    if (ret < 0) {
+        vfio_put_address_space(space);
+        close(vbasedev->fd);
+        error_report("Failed to alloc ioas (%s)", strerror(errno));
+        return ret;
+    }
+
+    trace_vfio_iommufd_alloc_ioas(iommufd, ioas_id);
+
+    container = g_malloc0(sizeof(*container));
+    container->iommufd = iommufd;
+    container->ioas_id = ioas_id;
+    QLIST_INIT(&container->hwpt_list);
+
+    bcontainer = &container->obj;
+    vfio_container_init(bcontainer, sizeof(*bcontainer),
+                        TYPE_VFIO_IOMMUFD_CONTAINER, space);
+
+    ret = vfio_device_attach_container(vbasedev, container, &err);
+    if (ret) {
+        /* todo check if any other thing to do */
+        error_propagate(errp, err);
+        vfio_iommufd_container_destroy(container);
+        iommufd_put_ioas(iommufd, ioas_id);
+        vfio_put_address_space(space);
+        close(vbasedev->fd);
+        return ret;
+    }
+
+    ret = vfio_ram_block_discard_disable(true);
+    if (ret) {
+        vfio_device_detach_container(vbasedev, container, &err);
+        error_propagate(errp, err);
+        error_prepend(errp, "Cannot set discarding of RAM broken (%d)", -ret);
+        vfio_iommufd_container_destroy(container);
+        iommufd_put_ioas(iommufd, ioas_id);
+        vfio_put_address_space(space);
+        close(vbasedev->fd);
+        return ret;
+    }
+
+    /*
+     * TODO: for now iommufd BE is on par with vfio iommu type1, so it's
+     * fine to add the whole range as window. For SPAPR, below code
+     * should be updated.
+     */
+    vfio_host_win_add(bcontainer, 0, (hwaddr)-1, 4096);
+
+    /*
+     * TODO: kvmgroup, unable to do it before the protocol done
+     * between iommufd and kvm.
+     */
+
+    QLIST_INSERT_HEAD(&space->containers, bcontainer, next);
+
+    bcontainer->listener = vfio_memory_listener;
+
+    memory_listener_register(&bcontainer->listener, bcontainer->space->as);
+
+    bcontainer->initialized = true;
+
+out:
+    vbasedev->container = bcontainer;
+
+    /*
+     * TODO: examine RAM_BLOCK_DISCARD stuff, should we do group level
+     * for discarding incompatibility check as well?
+     */
+    if (vbasedev->ram_block_discard_allowed) {
+        vfio_ram_block_discard_disable(false);
+    }
+
+    ret = ioctl(devfd, VFIO_DEVICE_GET_INFO, &dev_info);
+    if (ret) {
+        error_setg_errno(errp, errno, "error getting device info");
+        /*
+         * Needs to use iommufd_detach_device() as this may be failed after
+         * attaching a new deivce to an existing container.
+         */
+        iommufd_detach_device(vbasedev);
+        close(vbasedev->fd);
+        return ret;
+    }
+
+    vbasedev->group = 0;
+    vbasedev->num_irqs = dev_info.num_irqs;
+    vbasedev->num_regions = dev_info.num_regions;
+    vbasedev->flags = dev_info.flags;
+    vbasedev->reset_works = !!(dev_info.flags & VFIO_DEVICE_FLAGS_RESET);
+
+    trace_vfio_iommufd_device_info(vbasedev->name, devfd, vbasedev->num_irqs,
+                                   vbasedev->num_regions, vbasedev->flags);
+    return 0;
+}
+
+static void iommufd_detach_device(VFIODevice *vbasedev)
+{
+    VFIOContainer *bcontainer = vbasedev->container;
+    VFIOIOMMUFDContainer *container;
+    VFIODevice *vbasedev_iter;
+    VFIOIOASHwpt *hwpt;
+    Error *err;
+
+    if (!bcontainer) {
+        goto out;
+    }
+
+    if (!vbasedev->ram_block_discard_allowed) {
+        vfio_ram_block_discard_disable(false);
+    }
+
+    container = container_of(bcontainer, VFIOIOMMUFDContainer, obj);
+    QLIST_FOREACH(hwpt, &container->hwpt_list, next) {
+        QLIST_FOREACH(vbasedev_iter, &hwpt->device_list, hwpt_next) {
+            if (vbasedev_iter == vbasedev) {
+                goto found;
+            }
+        }
+    }
+    g_assert_not_reached();
+found:
+    QLIST_REMOVE(vbasedev, hwpt_next);
+    if (QLIST_EMPTY(&hwpt->device_list)) {
+        vfio_container_put_hwpt(hwpt);
+    }
+
+    __vfio_device_detach_container(vbasedev, container, &err);
+    if (err) {
+        error_report_err(err);
+    }
+    if (QLIST_EMPTY(&container->hwpt_list)) {
+        VFIOAddressSpace *space = bcontainer->space;
+
+        iommufd_put_ioas(container->iommufd, container->ioas_id);
+        vfio_iommufd_container_destroy(container);
+        vfio_put_address_space(space);
+    }
+    vbasedev->container = NULL;
+out:
+    close(vbasedev->fd);
+    g_free(vbasedev->name);
+}
+
+static void vfio_iommufd_class_init(ObjectClass *klass,
+                                    void *data)
+{
+    VFIOContainerClass *vccs = VFIO_CONTAINER_OBJ_CLASS(klass);
+
+    vccs->check_extension = iommufd_check_extension;
+    vccs->dma_map = iommufd_map;
+    vccs->dma_unmap = iommufd_unmap;
+    vccs->attach_device = iommufd_attach_device;
+    vccs->detach_device = iommufd_detach_device;
+    vccs->reset = vfio_iommufd_container_reset;
+}
+
+static const TypeInfo vfio_iommufd_info = {
+    .parent = TYPE_VFIO_CONTAINER_OBJ,
+    .name = TYPE_VFIO_IOMMUFD_CONTAINER,
+    .class_init = vfio_iommufd_class_init,
+};
+
+static void vfio_iommufd_register_types(void)
+{
+    type_register_static(&vfio_iommufd_info);
+}
+
+type_init(vfio_iommufd_register_types)
diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
index df4fa2b695..3c53c87200 100644
--- a/hw/vfio/meson.build
+++ b/hw/vfio/meson.build
@@ -7,6 +7,9 @@ vfio_ss.add(files(
   'spapr.c',
   'migration.c',
 ))
+vfio_ss.add(when: 'CONFIG_IOMMUFD', if_true: files(
+  'iommufd.c',
+))
 vfio_ss.add(when: 'CONFIG_VFIO_PCI', if_true: files(
   'display.c',
   'pci-quirks.c',
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index e1ab6d339d..cf5703f94b 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -3148,6 +3148,16 @@ static void vfio_pci_reset(DeviceState *dev)
         goto post_reset;
     }
 
+    /*
+     * This is a temporary check, long term iommufd should
+     * support hot reset as well
+     */
+    if (vdev->vbasedev.be == VFIO_IOMMU_BACKEND_TYPE_IOMMUFD) {
+        error_report("Dangerous: iommufd BE doesn't support hot "
+                     "reset, please stop the VM");
+        goto post_reset;
+    }
+
     /* See if we can do our own bus reset */
     if (!vfio_pci_hot_reset_one(vdev)) {
         goto post_reset;
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 0ef1b5f4a6..51f04b0b80 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -165,3 +165,14 @@ vfio_load_state_device_data(const char *name, uint64_t data_offset, uint64_t dat
 vfio_load_cleanup(const char *name) " (%s)"
 vfio_get_dirty_bitmap(int fd, uint64_t iova, uint64_t size, uint64_t bitmap_size, uint64_t start) "container fd=%d, iova=0x%"PRIx64" size= 0x%"PRIx64" bitmap_size=0x%"PRIx64" start=0x%"PRIx64
 vfio_iommu_map_dirty_notify(uint64_t iova_start, uint64_t iova_end) "iommu dirty @ 0x%"PRIx64" - 0x%"PRIx64
+
+#iommufd.c
+
+vfio_iommufd_get_devicefd(const char *dev, int devfd) " %s (fd=%d)"
+vfio_iommufd_bind_device(int iommufd, const char *name, int devfd, int devid) " [iommufd=%d] Succesfully bound device %s (fd=%d): output devid=%d"
+vfio_iommufd_attach_device(int iommufd, const char *name, int devfd, int ioasid, int hwptid) " [iommufd=%d] Succesfully attached device %s (%d) to ioasid=%d: output hwptd=%d"
+vfio_iommufd_detach_device(int iommufd, const char *name, int ioasid) " [iommufd=%d] Detached %s from ioasid=%d"
+vfio_iommufd_alloc_ioas(int iommufd, int ioas_id) " [iommufd=%d] new IOMMUFD container with ioasid=%d"
+vfio_iommufd_device_info(char *name, int devfd, int num_irqs, int num_regions, int flags) " %s (%d) num_irqs=%d num_regions=%d flags=%d"
+vfio_iommufd_fail_attach_existing_container(const char *msg) " %s"
+vfio_iommufd_container_reset(char *name) " Successfully reset %s"
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 2040c27cda..19731ea685 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -81,6 +81,22 @@ typedef struct VFIOLegacyContainer {
     QLIST_HEAD(, VFIOGroup) group_list;
 } VFIOLegacyContainer;
 
+typedef struct VFIOIOASHwpt {
+    uint32_t hwpt_id;
+    QLIST_HEAD(, VFIODevice) device_list;
+    QLIST_ENTRY(VFIOIOASHwpt) next;
+} VFIOIOASHwpt;
+
+typedef struct VFIOIOMMUFDContainer {
+    VFIOContainer obj;
+    int iommufd; /* /dev/vfio/vfio, empowered by the attached device */
+    uint32_t ioas_id;
+    QLIST_HEAD(, VFIOIOASHwpt) hwpt_list;
+} VFIOIOMMUFDContainer;
+
+typedef QLIST_HEAD(VFIOAddressSpaceList, VFIOAddressSpace) VFIOAddressSpaceList;
+extern VFIOAddressSpaceList vfio_address_spaces;
+
 typedef struct VFIODeviceOps VFIODeviceOps;
 
 typedef enum VFIOIOMMUBackendType {
@@ -90,6 +106,7 @@ typedef enum VFIOIOMMUBackendType {
 
 typedef struct VFIODevice {
     QLIST_ENTRY(VFIODevice) next;
+    QLIST_ENTRY(VFIODevice) hwpt_next;
     struct VFIOGroup *group;
     VFIOContainer *container;
     char *sysfsdev;
@@ -97,6 +114,7 @@ typedef struct VFIODevice {
     DeviceState *dev;
     int fd;
     int type;
+    int devid;
     bool reset_works;
     bool needs_reset;
     bool no_mmap;
diff --git a/include/hw/vfio/vfio-container-obj.h b/include/hw/vfio/vfio-container-obj.h
index ffd8590ff8..b5ef2160d8 100644
--- a/include/hw/vfio/vfio-container-obj.h
+++ b/include/hw/vfio/vfio-container-obj.h
@@ -43,6 +43,7 @@
                          TYPE_VFIO_CONTAINER_OBJ)
 
 #define TYPE_VFIO_LEGACY_CONTAINER "qemu:vfio-legacy-container"
+#define TYPE_VFIO_IOMMUFD_CONTAINER "qemu:vfio-iommufd-container"
 
 typedef enum VFIOContainerFeature {
     VFIO_FEAT_LIVE_MIGRATION,
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [RFC 15/18] vfio/iommufd: Implement iommufd backend
@ 2022-04-14 10:47   ` Yi Liu
  0 siblings, 0 replies; 125+ messages in thread
From: Yi Liu @ 2022-04-14 10:47 UTC (permalink / raw)
  To: alex.williamson, cohuck, qemu-devel
  Cc: akrowiak, jjherne, thuth, yi.l.liu, kvm, mjrosato, jasowang,
	farman, peterx, pasic, eric.auger, yi.y.sun, chao.p.peng,
	nicolinc, kevin.tian, jgg, eric.auger.pro, david

Add the iommufd backend. The IOMMUFD container class is implemented
based on the new /dev/iommu user API. This backend obviously depends
on CONFIG_IOMMUFD.

So far, the iommufd backend doesn't support live migration and
cache coherency yet due to missing support in the host kernel meaning
that only a subset of the container class callbacks is implemented.

Co-authored-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 hw/vfio/as.c                         |   2 +-
 hw/vfio/iommufd.c                    | 545 +++++++++++++++++++++++++++
 hw/vfio/meson.build                  |   3 +
 hw/vfio/pci.c                        |  10 +
 hw/vfio/trace-events                 |  11 +
 include/hw/vfio/vfio-common.h        |  18 +
 include/hw/vfio/vfio-container-obj.h |   1 +
 7 files changed, 589 insertions(+), 1 deletion(-)
 create mode 100644 hw/vfio/iommufd.c

diff --git a/hw/vfio/as.c b/hw/vfio/as.c
index 4abaa4068f..94618efd1f 100644
--- a/hw/vfio/as.c
+++ b/hw/vfio/as.c
@@ -41,7 +41,7 @@
 #include "qapi/error.h"
 #include "migration/migration.h"
 
-static QLIST_HEAD(, VFIOAddressSpace) vfio_address_spaces =
+VFIOAddressSpaceList vfio_address_spaces =
     QLIST_HEAD_INITIALIZER(vfio_address_spaces);
 
 void vfio_host_win_add(VFIOContainer *container,
diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
new file mode 100644
index 0000000000..f8375f1672
--- /dev/null
+++ b/hw/vfio/iommufd.c
@@ -0,0 +1,545 @@
+/*
+ * iommufd container backend
+ *
+ * Copyright (C) 2022 Intel Corporation.
+ * Copyright Red Hat, Inc. 2022
+ *
+ * Authors: Yi Liu <yi.l.liu@intel.com>
+ *          Eric Auger <eric.auger@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include "qemu/osdep.h"
+#include <sys/ioctl.h>
+#include <linux/vfio.h>
+
+#include "hw/vfio/vfio-common.h"
+#include "qemu/error-report.h"
+#include "trace.h"
+#include "qapi/error.h"
+#include "hw/iommufd/iommufd.h"
+#include "hw/qdev-core.h"
+#include "sysemu/reset.h"
+#include "qemu/cutils.h"
+
+static bool iommufd_check_extension(VFIOContainer *bcontainer,
+                                    VFIOContainerFeature feat)
+{
+    switch (feat) {
+    default:
+        return false;
+    };
+}
+
+static int iommufd_map(VFIOContainer *bcontainer, hwaddr iova,
+                       ram_addr_t size, void *vaddr, bool readonly)
+{
+    VFIOIOMMUFDContainer *container = container_of(bcontainer,
+                                                   VFIOIOMMUFDContainer, obj);
+
+    return iommufd_map_dma(container->iommufd, container->ioas_id,
+                           iova, size, vaddr, readonly);
+}
+
+static int iommufd_unmap(VFIOContainer *bcontainer,
+                         hwaddr iova, ram_addr_t size,
+                         IOMMUTLBEntry *iotlb)
+{
+    VFIOIOMMUFDContainer *container = container_of(bcontainer,
+                                                   VFIOIOMMUFDContainer, obj);
+
+    /* TODO: Handle dma_unmap_bitmap with iotlb args (migration) */
+    return iommufd_unmap_dma(container->iommufd,
+                             container->ioas_id, iova, size);
+}
+
+static int vfio_get_devicefd(const char *sysfs_path, Error **errp)
+{
+    long int vfio_id = -1, ret = -ENOTTY;
+    char *path, *tmp = NULL;
+    DIR *dir;
+    struct dirent *dent;
+    struct stat st;
+    gchar *contents;
+    gsize length;
+    int major, minor;
+    dev_t vfio_devt;
+
+    path = g_strdup_printf("%s/vfio-device", sysfs_path);
+    if (stat(path, &st) < 0) {
+        error_setg_errno(errp, errno, "no such host device");
+        goto out;
+    }
+
+    dir = opendir(path);
+    if (!dir) {
+        error_setg_errno(errp, errno, "couldn't open dirrectory %s", path);
+        goto out;
+    }
+
+    while ((dent = readdir(dir))) {
+        const char *end_name;
+
+        if (!strncmp(dent->d_name, "vfio", 4)) {
+            ret = qemu_strtol(dent->d_name + 4, &end_name, 10, &vfio_id);
+            if (ret) {
+                error_setg(errp, "suspicious vfio* file in %s", path);
+                goto out;
+            }
+            break;
+        }
+    }
+
+    /* check if the major:minor matches */
+    tmp = g_strdup_printf("%s/%s/dev", path, dent->d_name);
+    if (!g_file_get_contents(tmp, &contents, &length, NULL)) {
+        error_setg(errp, "failed to load \"%s\"", tmp);
+        goto out;
+    }
+
+    if (sscanf(contents, "%d:%d", &major, &minor) != 2) {
+        error_setg(errp, "failed to load \"%s\"", tmp);
+        goto out;
+    }
+    g_free(contents);
+    g_free(tmp);
+
+    tmp = g_strdup_printf("/dev/vfio/devices/vfio%ld", vfio_id);
+    if (stat(tmp, &st) < 0) {
+        error_setg_errno(errp, errno, "no such vfio device");
+        goto out;
+    }
+    vfio_devt = makedev(major, minor);
+    if (st.st_rdev != vfio_devt) {
+        error_setg(errp, "minor do not match: %lu, %lu", vfio_devt, st.st_rdev);
+        goto out;
+    }
+
+    ret = qemu_open_old(tmp, O_RDWR);
+    if (ret < 0) {
+        error_setg(errp, "Failed to open %s", tmp);
+    }
+    trace_vfio_iommufd_get_devicefd(tmp, ret);
+out:
+    g_free(tmp);
+    g_free(path);
+
+    if (*errp) {
+        error_prepend(errp, VFIO_MSG_PREFIX, path);
+    }
+    return ret;
+}
+
+static VFIOIOASHwpt *vfio_container_get_hwpt(VFIOIOMMUFDContainer *container,
+                                             uint32_t hwpt_id)
+{
+    VFIOIOASHwpt *hwpt;
+
+    QLIST_FOREACH(hwpt, &container->hwpt_list, next) {
+        if (hwpt->hwpt_id == hwpt_id) {
+            return hwpt;
+        }
+    }
+
+    hwpt = g_malloc0(sizeof(*hwpt));
+
+    hwpt->hwpt_id = hwpt_id;
+    QLIST_INIT(&hwpt->device_list);
+    QLIST_INSERT_HEAD(&container->hwpt_list, hwpt, next);
+
+    return hwpt;
+}
+
+static void vfio_container_put_hwpt(VFIOIOASHwpt *hwpt)
+{
+    if (!QLIST_EMPTY(&hwpt->device_list)) {
+        g_assert_not_reached();
+    }
+    QLIST_REMOVE(hwpt, next);
+    g_free(hwpt);
+}
+
+static VFIOIOASHwpt *vfio_find_hwpt_for_dev(VFIOIOMMUFDContainer *container,
+                                            VFIODevice *vbasedev)
+{
+    VFIOIOASHwpt *hwpt;
+    VFIODevice *vbasedev_iter;
+
+    QLIST_FOREACH(hwpt, &container->hwpt_list, next) {
+        QLIST_FOREACH(vbasedev_iter, &hwpt->device_list, hwpt_next) {
+            if (vbasedev_iter == vbasedev) {
+                return hwpt;
+            }
+        }
+    }
+    return NULL;
+}
+
+static void
+__vfio_device_detach_container(VFIODevice *vbasedev,
+                               VFIOIOMMUFDContainer *container, Error **errp)
+{
+    struct vfio_device_detach_ioas detach_data = {
+        .argsz = sizeof(detach_data),
+        .flags = 0,
+        .iommufd = container->iommufd,
+        .ioas_id = container->ioas_id,
+    };
+
+    if (ioctl(vbasedev->fd, VFIO_DEVICE_DETACH_IOAS, &detach_data)) {
+        error_setg_errno(errp, errno, "detach %s from ioas id=%d failed",
+                         vbasedev->name, container->ioas_id);
+    }
+    trace_vfio_iommufd_detach_device(container->iommufd, vbasedev->name,
+                                     container->ioas_id);
+
+    /* iommufd unbind is done per device fd close */
+}
+
+static void vfio_device_detach_container(VFIODevice *vbasedev,
+                                         VFIOIOMMUFDContainer *container,
+                                         Error **errp)
+{
+    VFIOIOASHwpt *hwpt;
+
+    hwpt = vfio_find_hwpt_for_dev(container, vbasedev);
+    if (hwpt) {
+        QLIST_REMOVE(vbasedev, hwpt_next);
+        if (QLIST_EMPTY(&hwpt->device_list)) {
+            vfio_container_put_hwpt(hwpt);
+        }
+    }
+
+    __vfio_device_detach_container(vbasedev, container, errp);
+}
+
+static int vfio_device_attach_container(VFIODevice *vbasedev,
+                                        VFIOIOMMUFDContainer *container,
+                                        Error **errp)
+{
+    struct vfio_device_bind_iommufd bind = {
+        .argsz = sizeof(bind),
+        .flags = 0,
+        .iommufd = container->iommufd,
+        .dev_cookie = (uint64_t)vbasedev,
+    };
+    struct vfio_device_attach_ioas attach_data = {
+        .argsz = sizeof(attach_data),
+        .flags = 0,
+        .iommufd = container->iommufd,
+        .ioas_id = container->ioas_id,
+    };
+    VFIOIOASHwpt *hwpt;
+    int ret;
+
+    /* Bind device to iommufd */
+    ret = ioctl(vbasedev->fd, VFIO_DEVICE_BIND_IOMMUFD, &bind);
+    if (ret) {
+        error_setg_errno(errp, errno, "error bind device fd=%d to iommufd=%d",
+                         vbasedev->fd, bind.iommufd);
+        return ret;
+    }
+
+    vbasedev->devid = bind.out_devid;
+    trace_vfio_iommufd_bind_device(bind.iommufd, vbasedev->name,
+                                   vbasedev->fd, vbasedev->devid);
+
+    /* Attach device to an ioas within iommufd */
+    ret = ioctl(vbasedev->fd, VFIO_DEVICE_ATTACH_IOAS, &attach_data);
+    if (ret) {
+        error_setg_errno(errp, errno,
+                         "[iommufd=%d] error attach %s (%d) to ioasid=%d",
+                         container->iommufd, vbasedev->name, vbasedev->fd,
+                         attach_data.ioas_id);
+        return ret;
+
+    }
+    trace_vfio_iommufd_attach_device(bind.iommufd, vbasedev->name,
+                                     vbasedev->fd, container->ioas_id,
+                                     attach_data.out_hwpt_id);
+
+    hwpt = vfio_container_get_hwpt(container, attach_data.out_hwpt_id);
+
+    QLIST_INSERT_HEAD(&hwpt->device_list, vbasedev, hwpt_next);
+    return 0;
+}
+
+static int vfio_device_reset(VFIODevice *vbasedev)
+{
+    if (vbasedev->dev->realized) {
+        vbasedev->ops->vfio_compute_needs_reset(vbasedev);
+        if (vbasedev->needs_reset) {
+            return vbasedev->ops->vfio_hot_reset_multi(vbasedev);
+        }
+    }
+    return 0;
+}
+
+static int vfio_iommufd_container_reset(VFIOContainer *bcontainer)
+{
+    VFIOIOMMUFDContainer *container;
+    int ret, final_ret = 0;
+    VFIODevice *vbasedev;
+    VFIOIOASHwpt *hwpt;
+
+    container = container_of(bcontainer, VFIOIOMMUFDContainer, obj);
+
+    QLIST_FOREACH(hwpt, &container->hwpt_list, next) {
+        QLIST_FOREACH(vbasedev, &hwpt->device_list, hwpt_next) {
+            ret = vfio_device_reset(vbasedev);
+            if (ret) {
+                error_report("failed to reset %s (%d)", vbasedev->name, ret);
+                final_ret = ret;
+            } else {
+                trace_vfio_iommufd_container_reset(vbasedev->name);
+            }
+        }
+    }
+    return final_ret;
+}
+
+static void vfio_iommufd_container_destroy(VFIOIOMMUFDContainer *container)
+{
+    vfio_container_destroy(&container->obj);
+    g_free(container);
+}
+
+static int vfio_ram_block_discard_disable(bool state)
+{
+    /*
+     * We support coordinated discarding of RAM via the RamDiscardManager.
+     */
+    return ram_block_uncoordinated_discard_disable(state);
+}
+
+static void iommufd_detach_device(VFIODevice *vbasedev);
+
+static int iommufd_attach_device(VFIODevice *vbasedev, AddressSpace *as,
+                                 Error **errp)
+{
+    VFIOContainer *bcontainer;
+    VFIOIOMMUFDContainer *container;
+    VFIOAddressSpace *space;
+    struct vfio_device_info dev_info = { .argsz = sizeof(dev_info) };
+    int ret, devfd, iommufd;
+    uint32_t ioas_id;
+    Error *err = NULL;
+
+    devfd = vfio_get_devicefd(vbasedev->sysfsdev, errp);
+    if (devfd < 0) {
+        return devfd;
+    }
+    vbasedev->fd = devfd;
+
+    space = vfio_get_address_space(as);
+
+    /* try to attach to an existing container in this space */
+    QLIST_FOREACH(bcontainer, &space->containers, next) {
+        if (!object_dynamic_cast(OBJECT(bcontainer),
+                                 TYPE_VFIO_IOMMUFD_CONTAINER)) {
+            continue;
+        }
+        container = container_of(bcontainer, VFIOIOMMUFDContainer, obj);
+        if (vfio_device_attach_container(vbasedev, container, &err)) {
+            const char *msg = error_get_pretty(err);
+
+            trace_vfio_iommufd_fail_attach_existing_container(msg);
+            error_free(err);
+            err = NULL;
+        } else {
+            ret = vfio_ram_block_discard_disable(true);
+            if (ret) {
+                vfio_device_detach_container(vbasedev, container, &err);
+                error_propagate(errp, err);
+                vfio_put_address_space(space);
+                close(vbasedev->fd);
+                error_prepend(errp,
+                              "Cannot set discarding of RAM broken (%d)", ret);
+                return ret;
+            }
+            goto out;
+        }
+    }
+
+    /* Need to allocate a new dedicated container */
+    ret = iommufd_get_ioas(&iommufd, &ioas_id);
+    if (ret < 0) {
+        vfio_put_address_space(space);
+        close(vbasedev->fd);
+        error_report("Failed to alloc ioas (%s)", strerror(errno));
+        return ret;
+    }
+
+    trace_vfio_iommufd_alloc_ioas(iommufd, ioas_id);
+
+    container = g_malloc0(sizeof(*container));
+    container->iommufd = iommufd;
+    container->ioas_id = ioas_id;
+    QLIST_INIT(&container->hwpt_list);
+
+    bcontainer = &container->obj;
+    vfio_container_init(bcontainer, sizeof(*bcontainer),
+                        TYPE_VFIO_IOMMUFD_CONTAINER, space);
+
+    ret = vfio_device_attach_container(vbasedev, container, &err);
+    if (ret) {
+        /* todo check if any other thing to do */
+        error_propagate(errp, err);
+        vfio_iommufd_container_destroy(container);
+        iommufd_put_ioas(iommufd, ioas_id);
+        vfio_put_address_space(space);
+        close(vbasedev->fd);
+        return ret;
+    }
+
+    ret = vfio_ram_block_discard_disable(true);
+    if (ret) {
+        vfio_device_detach_container(vbasedev, container, &err);
+        error_propagate(errp, err);
+        error_prepend(errp, "Cannot set discarding of RAM broken (%d)", -ret);
+        vfio_iommufd_container_destroy(container);
+        iommufd_put_ioas(iommufd, ioas_id);
+        vfio_put_address_space(space);
+        close(vbasedev->fd);
+        return ret;
+    }
+
+    /*
+     * TODO: for now iommufd BE is on par with vfio iommu type1, so it's
+     * fine to add the whole range as window. For SPAPR, below code
+     * should be updated.
+     */
+    vfio_host_win_add(bcontainer, 0, (hwaddr)-1, 4096);
+
+    /*
+     * TODO: kvmgroup, unable to do it before the protocol done
+     * between iommufd and kvm.
+     */
+
+    QLIST_INSERT_HEAD(&space->containers, bcontainer, next);
+
+    bcontainer->listener = vfio_memory_listener;
+
+    memory_listener_register(&bcontainer->listener, bcontainer->space->as);
+
+    bcontainer->initialized = true;
+
+out:
+    vbasedev->container = bcontainer;
+
+    /*
+     * TODO: examine RAM_BLOCK_DISCARD stuff, should we do group level
+     * for discarding incompatibility check as well?
+     */
+    if (vbasedev->ram_block_discard_allowed) {
+        vfio_ram_block_discard_disable(false);
+    }
+
+    ret = ioctl(devfd, VFIO_DEVICE_GET_INFO, &dev_info);
+    if (ret) {
+        error_setg_errno(errp, errno, "error getting device info");
+        /*
+         * Needs to use iommufd_detach_device() as this may be failed after
+         * attaching a new deivce to an existing container.
+         */
+        iommufd_detach_device(vbasedev);
+        close(vbasedev->fd);
+        return ret;
+    }
+
+    vbasedev->group = 0;
+    vbasedev->num_irqs = dev_info.num_irqs;
+    vbasedev->num_regions = dev_info.num_regions;
+    vbasedev->flags = dev_info.flags;
+    vbasedev->reset_works = !!(dev_info.flags & VFIO_DEVICE_FLAGS_RESET);
+
+    trace_vfio_iommufd_device_info(vbasedev->name, devfd, vbasedev->num_irqs,
+                                   vbasedev->num_regions, vbasedev->flags);
+    return 0;
+}
+
+static void iommufd_detach_device(VFIODevice *vbasedev)
+{
+    VFIOContainer *bcontainer = vbasedev->container;
+    VFIOIOMMUFDContainer *container;
+    VFIODevice *vbasedev_iter;
+    VFIOIOASHwpt *hwpt;
+    Error *err;
+
+    if (!bcontainer) {
+        goto out;
+    }
+
+    if (!vbasedev->ram_block_discard_allowed) {
+        vfio_ram_block_discard_disable(false);
+    }
+
+    container = container_of(bcontainer, VFIOIOMMUFDContainer, obj);
+    QLIST_FOREACH(hwpt, &container->hwpt_list, next) {
+        QLIST_FOREACH(vbasedev_iter, &hwpt->device_list, hwpt_next) {
+            if (vbasedev_iter == vbasedev) {
+                goto found;
+            }
+        }
+    }
+    g_assert_not_reached();
+found:
+    QLIST_REMOVE(vbasedev, hwpt_next);
+    if (QLIST_EMPTY(&hwpt->device_list)) {
+        vfio_container_put_hwpt(hwpt);
+    }
+
+    __vfio_device_detach_container(vbasedev, container, &err);
+    if (err) {
+        error_report_err(err);
+    }
+    if (QLIST_EMPTY(&container->hwpt_list)) {
+        VFIOAddressSpace *space = bcontainer->space;
+
+        iommufd_put_ioas(container->iommufd, container->ioas_id);
+        vfio_iommufd_container_destroy(container);
+        vfio_put_address_space(space);
+    }
+    vbasedev->container = NULL;
+out:
+    close(vbasedev->fd);
+    g_free(vbasedev->name);
+}
+
+static void vfio_iommufd_class_init(ObjectClass *klass,
+                                    void *data)
+{
+    VFIOContainerClass *vccs = VFIO_CONTAINER_OBJ_CLASS(klass);
+
+    vccs->check_extension = iommufd_check_extension;
+    vccs->dma_map = iommufd_map;
+    vccs->dma_unmap = iommufd_unmap;
+    vccs->attach_device = iommufd_attach_device;
+    vccs->detach_device = iommufd_detach_device;
+    vccs->reset = vfio_iommufd_container_reset;
+}
+
+static const TypeInfo vfio_iommufd_info = {
+    .parent = TYPE_VFIO_CONTAINER_OBJ,
+    .name = TYPE_VFIO_IOMMUFD_CONTAINER,
+    .class_init = vfio_iommufd_class_init,
+};
+
+static void vfio_iommufd_register_types(void)
+{
+    type_register_static(&vfio_iommufd_info);
+}
+
+type_init(vfio_iommufd_register_types)
diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
index df4fa2b695..3c53c87200 100644
--- a/hw/vfio/meson.build
+++ b/hw/vfio/meson.build
@@ -7,6 +7,9 @@ vfio_ss.add(files(
   'spapr.c',
   'migration.c',
 ))
+vfio_ss.add(when: 'CONFIG_IOMMUFD', if_true: files(
+  'iommufd.c',
+))
 vfio_ss.add(when: 'CONFIG_VFIO_PCI', if_true: files(
   'display.c',
   'pci-quirks.c',
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index e1ab6d339d..cf5703f94b 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -3148,6 +3148,16 @@ static void vfio_pci_reset(DeviceState *dev)
         goto post_reset;
     }
 
+    /*
+     * This is a temporary check, long term iommufd should
+     * support hot reset as well
+     */
+    if (vdev->vbasedev.be == VFIO_IOMMU_BACKEND_TYPE_IOMMUFD) {
+        error_report("Dangerous: iommufd BE doesn't support hot "
+                     "reset, please stop the VM");
+        goto post_reset;
+    }
+
     /* See if we can do our own bus reset */
     if (!vfio_pci_hot_reset_one(vdev)) {
         goto post_reset;
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 0ef1b5f4a6..51f04b0b80 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -165,3 +165,14 @@ vfio_load_state_device_data(const char *name, uint64_t data_offset, uint64_t dat
 vfio_load_cleanup(const char *name) " (%s)"
 vfio_get_dirty_bitmap(int fd, uint64_t iova, uint64_t size, uint64_t bitmap_size, uint64_t start) "container fd=%d, iova=0x%"PRIx64" size= 0x%"PRIx64" bitmap_size=0x%"PRIx64" start=0x%"PRIx64
 vfio_iommu_map_dirty_notify(uint64_t iova_start, uint64_t iova_end) "iommu dirty @ 0x%"PRIx64" - 0x%"PRIx64
+
+#iommufd.c
+
+vfio_iommufd_get_devicefd(const char *dev, int devfd) " %s (fd=%d)"
+vfio_iommufd_bind_device(int iommufd, const char *name, int devfd, int devid) " [iommufd=%d] Succesfully bound device %s (fd=%d): output devid=%d"
+vfio_iommufd_attach_device(int iommufd, const char *name, int devfd, int ioasid, int hwptid) " [iommufd=%d] Succesfully attached device %s (%d) to ioasid=%d: output hwptd=%d"
+vfio_iommufd_detach_device(int iommufd, const char *name, int ioasid) " [iommufd=%d] Detached %s from ioasid=%d"
+vfio_iommufd_alloc_ioas(int iommufd, int ioas_id) " [iommufd=%d] new IOMMUFD container with ioasid=%d"
+vfio_iommufd_device_info(char *name, int devfd, int num_irqs, int num_regions, int flags) " %s (%d) num_irqs=%d num_regions=%d flags=%d"
+vfio_iommufd_fail_attach_existing_container(const char *msg) " %s"
+vfio_iommufd_container_reset(char *name) " Successfully reset %s"
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 2040c27cda..19731ea685 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -81,6 +81,22 @@ typedef struct VFIOLegacyContainer {
     QLIST_HEAD(, VFIOGroup) group_list;
 } VFIOLegacyContainer;
 
+typedef struct VFIOIOASHwpt {
+    uint32_t hwpt_id;
+    QLIST_HEAD(, VFIODevice) device_list;
+    QLIST_ENTRY(VFIOIOASHwpt) next;
+} VFIOIOASHwpt;
+
+typedef struct VFIOIOMMUFDContainer {
+    VFIOContainer obj;
+    int iommufd; /* /dev/vfio/vfio, empowered by the attached device */
+    uint32_t ioas_id;
+    QLIST_HEAD(, VFIOIOASHwpt) hwpt_list;
+} VFIOIOMMUFDContainer;
+
+typedef QLIST_HEAD(VFIOAddressSpaceList, VFIOAddressSpace) VFIOAddressSpaceList;
+extern VFIOAddressSpaceList vfio_address_spaces;
+
 typedef struct VFIODeviceOps VFIODeviceOps;
 
 typedef enum VFIOIOMMUBackendType {
@@ -90,6 +106,7 @@ typedef enum VFIOIOMMUBackendType {
 
 typedef struct VFIODevice {
     QLIST_ENTRY(VFIODevice) next;
+    QLIST_ENTRY(VFIODevice) hwpt_next;
     struct VFIOGroup *group;
     VFIOContainer *container;
     char *sysfsdev;
@@ -97,6 +114,7 @@ typedef struct VFIODevice {
     DeviceState *dev;
     int fd;
     int type;
+    int devid;
     bool reset_works;
     bool needs_reset;
     bool no_mmap;
diff --git a/include/hw/vfio/vfio-container-obj.h b/include/hw/vfio/vfio-container-obj.h
index ffd8590ff8..b5ef2160d8 100644
--- a/include/hw/vfio/vfio-container-obj.h
+++ b/include/hw/vfio/vfio-container-obj.h
@@ -43,6 +43,7 @@
                          TYPE_VFIO_CONTAINER_OBJ)
 
 #define TYPE_VFIO_LEGACY_CONTAINER "qemu:vfio-legacy-container"
+#define TYPE_VFIO_IOMMUFD_CONTAINER "qemu:vfio-iommufd-container"
 
 typedef enum VFIOContainerFeature {
     VFIO_FEAT_LIVE_MIGRATION,
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [RFC 16/18] vfio/iommufd: Add IOAS_COPY_DMA support
  2022-04-14 10:46 ` Yi Liu
@ 2022-04-14 10:47   ` Yi Liu
  -1 siblings, 0 replies; 125+ messages in thread
From: Yi Liu @ 2022-04-14 10:47 UTC (permalink / raw)
  To: alex.williamson, cohuck, qemu-devel
  Cc: david, thuth, farman, mjrosato, akrowiak, pasic, jjherne,
	jasowang, kvm, jgg, nicolinc, eric.auger, eric.auger.pro,
	kevin.tian, yi.l.liu, chao.p.peng, yi.y.sun, peterx

Compared with legacy vfio container BE, one of the benefits provided by
iommufd is to reduce the redundant page pinning on kernel side through
the usage of IOAS_COPY_DMA. For iommufd containers within the same address
space, IOVA mappings can be copied from a source container to destination
container.

To achieve this, move the vfio_memory_listener to be per address space.
In the memory listener callbacks, all the containers within the address
space will be looped. For the iommufd containers, QEMU uses IOAS_MAP_DMA
on the first one, and then uses IOAS_COPY_DMA to copy the IOVA mappings
from the first iommufd container to other iommufd containers within the
address space. For legacy containers, IOVA mapping is done by
VFIO_IOMMU_MAP_DMA.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 hw/vfio/as.c                         | 117 +++++++++++++++++++++++----
 hw/vfio/container-obj.c              |  17 +++-
 hw/vfio/container.c                  |  19 ++---
 hw/vfio/iommufd.c                    |  43 +++++++---
 include/hw/vfio/vfio-common.h        |   6 +-
 include/hw/vfio/vfio-container-obj.h |   8 +-
 6 files changed, 167 insertions(+), 43 deletions(-)

diff --git a/hw/vfio/as.c b/hw/vfio/as.c
index 94618efd1f..13a6653a0d 100644
--- a/hw/vfio/as.c
+++ b/hw/vfio/as.c
@@ -388,16 +388,16 @@ static void vfio_unregister_ram_discard_listener(VFIOContainer *container,
     g_free(vrdl);
 }
 
-static void vfio_listener_region_add(MemoryListener *listener,
-                                     MemoryRegionSection *section)
+static void vfio_container_region_add(VFIOContainer *container,
+                                      VFIOContainer **src_container,
+                                      MemoryRegionSection *section)
 {
-    VFIOContainer *container = container_of(listener, VFIOContainer, listener);
     hwaddr iova, end;
     Int128 llend, llsize;
     void *vaddr;
     int ret;
     VFIOHostDMAWindow *hostwin;
-    bool hostwin_found;
+    bool hostwin_found, copy_dma_supported = false;
     Error *err = NULL;
 
     if (vfio_listener_skipped_section(section)) {
@@ -533,12 +533,25 @@ static void vfio_listener_region_add(MemoryListener *listener,
         }
     }
 
+    copy_dma_supported = vfio_container_check_extension(container,
+                                                        VFIO_FEAT_DMA_COPY);
+
+    if (copy_dma_supported && *src_container) {
+        if (!vfio_container_dma_copy(*src_container, container,
+                                     iova, int128_get64(llsize),
+                                     section->readonly)) {
+            return;
+        } else {
+            info_report("IOAS copy failed try map for container: %p", container);
+        }
+    }
+
     ret = vfio_container_dma_map(container, iova, int128_get64(llsize),
                                  vaddr, section->readonly);
     if (ret) {
-        error_setg(&err, "vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
-                   "0x%"HWADDR_PRIx", %p) = %d (%m)",
-                   container, iova, int128_get64(llsize), vaddr, ret);
+        error_setg(&err, "vfio_container_dma_map(%p, 0x%"HWADDR_PRIx", "
+                   "0x%"HWADDR_PRIx", %p) = %d (%m)", container, iova,
+                   int128_get64(llsize), vaddr, ret);
         if (memory_region_is_ram_device(section->mr)) {
             /* Allow unexpected mappings not to be fatal for RAM devices */
             error_report_err(err);
@@ -547,6 +560,9 @@ static void vfio_listener_region_add(MemoryListener *listener,
         goto fail;
     }
 
+    if (copy_dma_supported) {
+        *src_container = container;
+    }
     return;
 
 fail:
@@ -573,10 +589,22 @@ fail:
     }
 }
 
-static void vfio_listener_region_del(MemoryListener *listener,
+static void vfio_listener_region_add(MemoryListener *listener,
                                      MemoryRegionSection *section)
 {
-    VFIOContainer *container = container_of(listener, VFIOContainer, listener);
+    VFIOAddressSpace *space = container_of(listener,
+                                           VFIOAddressSpace, listener);
+    VFIOContainer *container, *src_container;
+
+    src_container = NULL;
+    QLIST_FOREACH(container, &space->containers, next) {
+        vfio_container_region_add(container, &src_container, section);
+    }
+}
+
+static void vfio_container_region_del(VFIOContainer *container,
+                                      MemoryRegionSection *section)
+{
     hwaddr iova, end;
     Int128 llend, llsize;
     int ret;
@@ -682,18 +710,38 @@ static void vfio_listener_region_del(MemoryListener *listener,
     vfio_container_del_section_window(container, section);
 }
 
+static void vfio_listener_region_del(MemoryListener *listener,
+                                     MemoryRegionSection *section)
+{
+    VFIOAddressSpace *space = container_of(listener,
+                                           VFIOAddressSpace, listener);
+    VFIOContainer *container;
+
+    QLIST_FOREACH(container, &space->containers, next) {
+        vfio_container_region_del(container, section);
+    }
+}
+
 static void vfio_listener_log_global_start(MemoryListener *listener)
 {
-    VFIOContainer *container = container_of(listener, VFIOContainer, listener);
+    VFIOAddressSpace *space = container_of(listener,
+                                           VFIOAddressSpace, listener);
+    VFIOContainer *container;
 
-    vfio_container_set_dirty_page_tracking(container, true);
+    QLIST_FOREACH(container, &space->containers, next) {
+        vfio_container_set_dirty_page_tracking(container, true);
+    }
 }
 
 static void vfio_listener_log_global_stop(MemoryListener *listener)
 {
-    VFIOContainer *container = container_of(listener, VFIOContainer, listener);
+    VFIOAddressSpace *space = container_of(listener,
+                                           VFIOAddressSpace, listener);
+    VFIOContainer *container;
 
-    vfio_container_set_dirty_page_tracking(container, false);
+    QLIST_FOREACH(container, &space->containers, next) {
+        vfio_container_set_dirty_page_tracking(container, false);
+    }
 }
 
 typedef struct {
@@ -823,11 +871,9 @@ static int vfio_sync_dirty_bitmap(VFIOContainer *container,
                    int128_get64(section->size), ram_addr);
 }
 
-static void vfio_listener_log_sync(MemoryListener *listener,
-        MemoryRegionSection *section)
+static void vfio_container_log_sync(VFIOContainer *container,
+                                    MemoryRegionSection *section)
 {
-    VFIOContainer *container = container_of(listener, VFIOContainer, listener);
-
     if (vfio_listener_skipped_section(section) ||
         !container->dirty_pages_supported) {
         return;
@@ -838,6 +884,18 @@ static void vfio_listener_log_sync(MemoryListener *listener,
     }
 }
 
+static void vfio_listener_log_sync(MemoryListener *listener,
+                                   MemoryRegionSection *section)
+{
+    VFIOAddressSpace *space = container_of(listener,
+                                           VFIOAddressSpace, listener);
+    VFIOContainer *container;
+
+    QLIST_FOREACH(container, &space->containers, next) {
+        vfio_container_log_sync(container, section);
+    }
+}
+
 const MemoryListener vfio_memory_listener = {
     .name = "vfio",
     .region_add = vfio_listener_region_add,
@@ -882,6 +940,31 @@ VFIOAddressSpace *vfio_get_address_space(AddressSpace *as)
     return space;
 }
 
+void vfio_as_add_container(VFIOAddressSpace *space,
+                           VFIOContainer *container)
+{
+    if (space->listener_initialized) {
+        memory_listener_unregister(&space->listener);
+    }
+
+    QLIST_INSERT_HEAD(&space->containers, container, next);
+
+    /* Unregistration happen in vfio_as_del_container() */
+    space->listener = vfio_memory_listener;
+    memory_listener_register(&space->listener, space->as);
+    space->listener_initialized = true;
+}
+
+void vfio_as_del_container(VFIOAddressSpace *space,
+                           VFIOContainer *container)
+{
+    QLIST_SAFE_REMOVE(container, next);
+
+    if (QLIST_EMPTY(&space->containers)) {
+        memory_listener_unregister(&space->listener);
+    }
+}
+
 void vfio_put_address_space(VFIOAddressSpace *space)
 {
     if (QLIST_EMPTY(&space->containers)) {
diff --git a/hw/vfio/container-obj.c b/hw/vfio/container-obj.c
index c4220336af..2c79089364 100644
--- a/hw/vfio/container-obj.c
+++ b/hw/vfio/container-obj.c
@@ -27,6 +27,7 @@
 #include "qom/object.h"
 #include "qapi/visitor.h"
 #include "hw/vfio/vfio-container-obj.h"
+#include "exec/memory.h"
 
 bool vfio_container_check_extension(VFIOContainer *container,
                                     VFIOContainerFeature feat)
@@ -53,6 +54,20 @@ int vfio_container_dma_map(VFIOContainer *container,
     return vccs->dma_map(container, iova, size, vaddr, readonly);
 }
 
+int vfio_container_dma_copy(VFIOContainer *src, VFIOContainer *dst,
+                            hwaddr iova, ram_addr_t size, bool readonly)
+{
+    VFIOContainerClass *vccs1 = VFIO_CONTAINER_OBJ_GET_CLASS(src);
+    VFIOContainerClass *vccs2 = VFIO_CONTAINER_OBJ_GET_CLASS(dst);
+
+    if (!vccs1->dma_copy || vccs1->dma_copy != vccs2->dma_copy) {
+        error_report("Incompatiable container: unable to copy dma");
+        return -EINVAL;
+    }
+
+    return vccs1->dma_copy(src, dst, iova, size, readonly);
+}
+
 int vfio_container_dma_unmap(VFIOContainer *container,
                              hwaddr iova, ram_addr_t size,
                              IOMMUTLBEntry *iotlb)
@@ -165,8 +180,6 @@ void vfio_container_destroy(VFIOContainer *container)
     VFIOGuestIOMMU *giommu, *tmp;
     VFIOHostDMAWindow *hostwin, *next;
 
-    QLIST_SAFE_REMOVE(container, next);
-
     QLIST_FOREACH_SAFE(vrdl, &container->vrdl_list, next, vrdl_tmp) {
         RamDiscardManager *rdm;
 
diff --git a/hw/vfio/container.c b/hw/vfio/container.c
index 2f59422048..6bc1b8763f 100644
--- a/hw/vfio/container.c
+++ b/hw/vfio/container.c
@@ -357,9 +357,6 @@ err_out:
 
 static void vfio_listener_release(VFIOLegacyContainer *container)
 {
-    VFIOContainer *bcontainer = &container->obj;
-
-    memory_listener_unregister(&bcontainer->listener);
     if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
         memory_listener_unregister(&container->prereg_listener);
     }
@@ -887,14 +884,11 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
     vfio_kvm_device_add_group(group);
 
     QLIST_INIT(&container->group_list);
-    QLIST_INSERT_HEAD(&space->containers, bcontainer, next);
 
     group->container = container;
     QLIST_INSERT_HEAD(&container->group_list, group, container_next);
 
-    bcontainer->listener = vfio_memory_listener;
-
-    memory_listener_register(&bcontainer->listener, bcontainer->space->as);
+    vfio_as_add_container(space, bcontainer);
 
     if (bcontainer->error) {
         ret = -1;
@@ -907,8 +901,8 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
 
     return 0;
 listener_release_exit:
+    vfio_as_del_container(space, bcontainer);
     QLIST_REMOVE(group, container_next);
-    QLIST_REMOVE(bcontainer, next);
     vfio_kvm_device_del_group(group);
     vfio_listener_release(container);
 
@@ -931,6 +925,7 @@ static void vfio_disconnect_container(VFIOGroup *group)
 {
     VFIOLegacyContainer *container = group->container;
     VFIOContainer *bcontainer = &container->obj;
+    VFIOAddressSpace *space = bcontainer->space;
 
     QLIST_REMOVE(group, container_next);
     group->container = NULL;
@@ -938,10 +933,12 @@ static void vfio_disconnect_container(VFIOGroup *group)
     /*
      * Explicitly release the listener first before unset container,
      * since unset may destroy the backend container if it's the last
-     * group.
+     * group. By removing container from the list, container is disconnected
+     * with address space memory listener.
      */
     if (QLIST_EMPTY(&container->group_list)) {
         vfio_listener_release(container);
+        vfio_as_del_container(space, bcontainer);
     }
 
     if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER, &container->fd)) {
@@ -950,10 +947,8 @@ static void vfio_disconnect_container(VFIOGroup *group)
     }
 
     if (QLIST_EMPTY(&container->group_list)) {
-        VFIOAddressSpace *space = bcontainer->space;
-
-        vfio_container_destroy(bcontainer);
         trace_vfio_disconnect_container(container->fd);
+        vfio_container_destroy(bcontainer);
         close(container->fd);
         g_free(container);
 
diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
index f8375f1672..8ff5988b07 100644
--- a/hw/vfio/iommufd.c
+++ b/hw/vfio/iommufd.c
@@ -38,6 +38,8 @@ static bool iommufd_check_extension(VFIOContainer *bcontainer,
                                     VFIOContainerFeature feat)
 {
     switch (feat) {
+    case VFIO_FEAT_DMA_COPY:
+        return true;
     default:
         return false;
     };
@@ -49,10 +51,25 @@ static int iommufd_map(VFIOContainer *bcontainer, hwaddr iova,
     VFIOIOMMUFDContainer *container = container_of(bcontainer,
                                                    VFIOIOMMUFDContainer, obj);
 
-    return iommufd_map_dma(container->iommufd, container->ioas_id,
+    return iommufd_map_dma(container->iommufd,
+                           container->ioas_id,
                            iova, size, vaddr, readonly);
 }
 
+static int iommufd_copy(VFIOContainer *src, VFIOContainer *dst,
+                        hwaddr iova, ram_addr_t size, bool readonly)
+{
+    VFIOIOMMUFDContainer *container_src = container_of(src,
+                                                   VFIOIOMMUFDContainer, obj);
+    VFIOIOMMUFDContainer *container_dst = container_of(dst,
+                                                   VFIOIOMMUFDContainer, obj);
+
+    assert(container_src->iommufd == container_dst->iommufd);
+
+    return iommufd_copy_dma(container_src->iommufd, container_src->ioas_id,
+                            container_dst->ioas_id, iova, size, readonly);
+}
+
 static int iommufd_unmap(VFIOContainer *bcontainer,
                          hwaddr iova, ram_addr_t size,
                          IOMMUTLBEntry *iotlb)
@@ -428,12 +445,7 @@ static int iommufd_attach_device(VFIODevice *vbasedev, AddressSpace *as,
      * between iommufd and kvm.
      */
 
-    QLIST_INSERT_HEAD(&space->containers, bcontainer, next);
-
-    bcontainer->listener = vfio_memory_listener;
-
-    memory_listener_register(&bcontainer->listener, bcontainer->space->as);
-
+    vfio_as_add_container(space, bcontainer);
     bcontainer->initialized = true;
 
 out:
@@ -476,6 +488,7 @@ static void iommufd_detach_device(VFIODevice *vbasedev)
     VFIOIOMMUFDContainer *container;
     VFIODevice *vbasedev_iter;
     VFIOIOASHwpt *hwpt;
+    VFIOAddressSpace *space;
     Error *err;
 
     if (!bcontainer) {
@@ -501,15 +514,26 @@ found:
         vfio_container_put_hwpt(hwpt);
     }
 
+    space = bcontainer->space;
+    /*
+     * Needs to remove the bcontainer from space->containers list before
+     * detach container. Otherwise, detach container may destroy the
+     * container if it's the last device. By removing bcontainer from the
+     * list, container is disconnected with address space memory listener.
+     */
+    if (QLIST_EMPTY(&container->hwpt_list)) {
+        vfio_as_del_container(space, bcontainer);
+    }
     __vfio_device_detach_container(vbasedev, container, &err);
     if (err) {
         error_report_err(err);
     }
     if (QLIST_EMPTY(&container->hwpt_list)) {
-        VFIOAddressSpace *space = bcontainer->space;
+        int iommufd = container->iommufd;
+        uint32_t ioas_id = container->ioas_id;
 
-        iommufd_put_ioas(container->iommufd, container->ioas_id);
         vfio_iommufd_container_destroy(container);
+        iommufd_put_ioas(iommufd, ioas_id);
         vfio_put_address_space(space);
     }
     vbasedev->container = NULL;
@@ -525,6 +549,7 @@ static void vfio_iommufd_class_init(ObjectClass *klass,
 
     vccs->check_extension = iommufd_check_extension;
     vccs->dma_map = iommufd_map;
+    vccs->dma_copy = iommufd_copy;
     vccs->dma_unmap = iommufd_unmap;
     vccs->attach_device = iommufd_attach_device;
     vccs->detach_device = iommufd_detach_device;
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 19731ea685..bef48ddfaf 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -34,8 +34,6 @@
 
 #define VFIO_MSG_PREFIX "vfio %s: "
 
-extern const MemoryListener vfio_memory_listener;
-
 enum {
     VFIO_DEVICE_TYPE_PCI = 0,
     VFIO_DEVICE_TYPE_PLATFORM = 1,
@@ -181,6 +179,10 @@ void vfio_host_win_add(VFIOContainer *bcontainer,
 int vfio_host_win_del(VFIOContainer *bcontainer, hwaddr min_iova,
                       hwaddr max_iova);
 VFIOAddressSpace *vfio_get_address_space(AddressSpace *as);
+void vfio_as_add_container(VFIOAddressSpace *space,
+                           VFIOContainer *bcontainer);
+void vfio_as_del_container(VFIOAddressSpace *space,
+                           VFIOContainer *container);
 void vfio_put_address_space(VFIOAddressSpace *space);
 
 void vfio_put_base_device(VFIODevice *vbasedev);
diff --git a/include/hw/vfio/vfio-container-obj.h b/include/hw/vfio/vfio-container-obj.h
index b5ef2160d8..b65f827bc1 100644
--- a/include/hw/vfio/vfio-container-obj.h
+++ b/include/hw/vfio/vfio-container-obj.h
@@ -47,12 +47,15 @@
 
 typedef enum VFIOContainerFeature {
     VFIO_FEAT_LIVE_MIGRATION,
+    VFIO_FEAT_DMA_COPY,
 } VFIOContainerFeature;
 
 typedef struct VFIOContainer VFIOContainer;
 
 typedef struct VFIOAddressSpace {
     AddressSpace *as;
+    MemoryListener listener;
+    bool listener_initialized;
     QLIST_HEAD(, VFIOContainer) containers;
     QLIST_ENTRY(VFIOAddressSpace) list;
 } VFIOAddressSpace;
@@ -90,7 +93,6 @@ struct VFIOContainer {
     Object parent_obj;
 
     VFIOAddressSpace *space;
-    MemoryListener listener;
     Error *error;
     bool initialized;
     bool dirty_pages_supported;
@@ -116,6 +118,8 @@ typedef struct VFIOContainerClass {
     int (*dma_map)(VFIOContainer *container,
                    hwaddr iova, ram_addr_t size,
                    void *vaddr, bool readonly);
+    int (*dma_copy)(VFIOContainer *src, VFIOContainer *dst,
+                    hwaddr iova, ram_addr_t size, bool readonly);
     int (*dma_unmap)(VFIOContainer *container,
                      hwaddr iova, ram_addr_t size,
                      IOMMUTLBEntry *iotlb);
@@ -141,6 +145,8 @@ bool vfio_container_check_extension(VFIOContainer *container,
 int vfio_container_dma_map(VFIOContainer *container,
                            hwaddr iova, ram_addr_t size,
                            void *vaddr, bool readonly);
+int vfio_container_dma_copy(VFIOContainer *src, VFIOContainer *dst,
+                            hwaddr iova, ram_addr_t size, bool readonly);
 int vfio_container_dma_unmap(VFIOContainer *container,
                              hwaddr iova, ram_addr_t size,
                              IOMMUTLBEntry *iotlb);
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [RFC 16/18] vfio/iommufd: Add IOAS_COPY_DMA support
@ 2022-04-14 10:47   ` Yi Liu
  0 siblings, 0 replies; 125+ messages in thread
From: Yi Liu @ 2022-04-14 10:47 UTC (permalink / raw)
  To: alex.williamson, cohuck, qemu-devel
  Cc: akrowiak, jjherne, thuth, yi.l.liu, kvm, mjrosato, jasowang,
	farman, peterx, pasic, eric.auger, yi.y.sun, chao.p.peng,
	nicolinc, kevin.tian, jgg, eric.auger.pro, david

Compared with legacy vfio container BE, one of the benefits provided by
iommufd is to reduce the redundant page pinning on kernel side through
the usage of IOAS_COPY_DMA. For iommufd containers within the same address
space, IOVA mappings can be copied from a source container to destination
container.

To achieve this, move the vfio_memory_listener to be per address space.
In the memory listener callbacks, all the containers within the address
space will be looped. For the iommufd containers, QEMU uses IOAS_MAP_DMA
on the first one, and then uses IOAS_COPY_DMA to copy the IOVA mappings
from the first iommufd container to other iommufd containers within the
address space. For legacy containers, IOVA mapping is done by
VFIO_IOMMU_MAP_DMA.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 hw/vfio/as.c                         | 117 +++++++++++++++++++++++----
 hw/vfio/container-obj.c              |  17 +++-
 hw/vfio/container.c                  |  19 ++---
 hw/vfio/iommufd.c                    |  43 +++++++---
 include/hw/vfio/vfio-common.h        |   6 +-
 include/hw/vfio/vfio-container-obj.h |   8 +-
 6 files changed, 167 insertions(+), 43 deletions(-)

diff --git a/hw/vfio/as.c b/hw/vfio/as.c
index 94618efd1f..13a6653a0d 100644
--- a/hw/vfio/as.c
+++ b/hw/vfio/as.c
@@ -388,16 +388,16 @@ static void vfio_unregister_ram_discard_listener(VFIOContainer *container,
     g_free(vrdl);
 }
 
-static void vfio_listener_region_add(MemoryListener *listener,
-                                     MemoryRegionSection *section)
+static void vfio_container_region_add(VFIOContainer *container,
+                                      VFIOContainer **src_container,
+                                      MemoryRegionSection *section)
 {
-    VFIOContainer *container = container_of(listener, VFIOContainer, listener);
     hwaddr iova, end;
     Int128 llend, llsize;
     void *vaddr;
     int ret;
     VFIOHostDMAWindow *hostwin;
-    bool hostwin_found;
+    bool hostwin_found, copy_dma_supported = false;
     Error *err = NULL;
 
     if (vfio_listener_skipped_section(section)) {
@@ -533,12 +533,25 @@ static void vfio_listener_region_add(MemoryListener *listener,
         }
     }
 
+    copy_dma_supported = vfio_container_check_extension(container,
+                                                        VFIO_FEAT_DMA_COPY);
+
+    if (copy_dma_supported && *src_container) {
+        if (!vfio_container_dma_copy(*src_container, container,
+                                     iova, int128_get64(llsize),
+                                     section->readonly)) {
+            return;
+        } else {
+            info_report("IOAS copy failed try map for container: %p", container);
+        }
+    }
+
     ret = vfio_container_dma_map(container, iova, int128_get64(llsize),
                                  vaddr, section->readonly);
     if (ret) {
-        error_setg(&err, "vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
-                   "0x%"HWADDR_PRIx", %p) = %d (%m)",
-                   container, iova, int128_get64(llsize), vaddr, ret);
+        error_setg(&err, "vfio_container_dma_map(%p, 0x%"HWADDR_PRIx", "
+                   "0x%"HWADDR_PRIx", %p) = %d (%m)", container, iova,
+                   int128_get64(llsize), vaddr, ret);
         if (memory_region_is_ram_device(section->mr)) {
             /* Allow unexpected mappings not to be fatal for RAM devices */
             error_report_err(err);
@@ -547,6 +560,9 @@ static void vfio_listener_region_add(MemoryListener *listener,
         goto fail;
     }
 
+    if (copy_dma_supported) {
+        *src_container = container;
+    }
     return;
 
 fail:
@@ -573,10 +589,22 @@ fail:
     }
 }
 
-static void vfio_listener_region_del(MemoryListener *listener,
+static void vfio_listener_region_add(MemoryListener *listener,
                                      MemoryRegionSection *section)
 {
-    VFIOContainer *container = container_of(listener, VFIOContainer, listener);
+    VFIOAddressSpace *space = container_of(listener,
+                                           VFIOAddressSpace, listener);
+    VFIOContainer *container, *src_container;
+
+    src_container = NULL;
+    QLIST_FOREACH(container, &space->containers, next) {
+        vfio_container_region_add(container, &src_container, section);
+    }
+}
+
+static void vfio_container_region_del(VFIOContainer *container,
+                                      MemoryRegionSection *section)
+{
     hwaddr iova, end;
     Int128 llend, llsize;
     int ret;
@@ -682,18 +710,38 @@ static void vfio_listener_region_del(MemoryListener *listener,
     vfio_container_del_section_window(container, section);
 }
 
+static void vfio_listener_region_del(MemoryListener *listener,
+                                     MemoryRegionSection *section)
+{
+    VFIOAddressSpace *space = container_of(listener,
+                                           VFIOAddressSpace, listener);
+    VFIOContainer *container;
+
+    QLIST_FOREACH(container, &space->containers, next) {
+        vfio_container_region_del(container, section);
+    }
+}
+
 static void vfio_listener_log_global_start(MemoryListener *listener)
 {
-    VFIOContainer *container = container_of(listener, VFIOContainer, listener);
+    VFIOAddressSpace *space = container_of(listener,
+                                           VFIOAddressSpace, listener);
+    VFIOContainer *container;
 
-    vfio_container_set_dirty_page_tracking(container, true);
+    QLIST_FOREACH(container, &space->containers, next) {
+        vfio_container_set_dirty_page_tracking(container, true);
+    }
 }
 
 static void vfio_listener_log_global_stop(MemoryListener *listener)
 {
-    VFIOContainer *container = container_of(listener, VFIOContainer, listener);
+    VFIOAddressSpace *space = container_of(listener,
+                                           VFIOAddressSpace, listener);
+    VFIOContainer *container;
 
-    vfio_container_set_dirty_page_tracking(container, false);
+    QLIST_FOREACH(container, &space->containers, next) {
+        vfio_container_set_dirty_page_tracking(container, false);
+    }
 }
 
 typedef struct {
@@ -823,11 +871,9 @@ static int vfio_sync_dirty_bitmap(VFIOContainer *container,
                    int128_get64(section->size), ram_addr);
 }
 
-static void vfio_listener_log_sync(MemoryListener *listener,
-        MemoryRegionSection *section)
+static void vfio_container_log_sync(VFIOContainer *container,
+                                    MemoryRegionSection *section)
 {
-    VFIOContainer *container = container_of(listener, VFIOContainer, listener);
-
     if (vfio_listener_skipped_section(section) ||
         !container->dirty_pages_supported) {
         return;
@@ -838,6 +884,18 @@ static void vfio_listener_log_sync(MemoryListener *listener,
     }
 }
 
+static void vfio_listener_log_sync(MemoryListener *listener,
+                                   MemoryRegionSection *section)
+{
+    VFIOAddressSpace *space = container_of(listener,
+                                           VFIOAddressSpace, listener);
+    VFIOContainer *container;
+
+    QLIST_FOREACH(container, &space->containers, next) {
+        vfio_container_log_sync(container, section);
+    }
+}
+
 const MemoryListener vfio_memory_listener = {
     .name = "vfio",
     .region_add = vfio_listener_region_add,
@@ -882,6 +940,31 @@ VFIOAddressSpace *vfio_get_address_space(AddressSpace *as)
     return space;
 }
 
+void vfio_as_add_container(VFIOAddressSpace *space,
+                           VFIOContainer *container)
+{
+    if (space->listener_initialized) {
+        memory_listener_unregister(&space->listener);
+    }
+
+    QLIST_INSERT_HEAD(&space->containers, container, next);
+
+    /* Unregistration happen in vfio_as_del_container() */
+    space->listener = vfio_memory_listener;
+    memory_listener_register(&space->listener, space->as);
+    space->listener_initialized = true;
+}
+
+void vfio_as_del_container(VFIOAddressSpace *space,
+                           VFIOContainer *container)
+{
+    QLIST_SAFE_REMOVE(container, next);
+
+    if (QLIST_EMPTY(&space->containers)) {
+        memory_listener_unregister(&space->listener);
+    }
+}
+
 void vfio_put_address_space(VFIOAddressSpace *space)
 {
     if (QLIST_EMPTY(&space->containers)) {
diff --git a/hw/vfio/container-obj.c b/hw/vfio/container-obj.c
index c4220336af..2c79089364 100644
--- a/hw/vfio/container-obj.c
+++ b/hw/vfio/container-obj.c
@@ -27,6 +27,7 @@
 #include "qom/object.h"
 #include "qapi/visitor.h"
 #include "hw/vfio/vfio-container-obj.h"
+#include "exec/memory.h"
 
 bool vfio_container_check_extension(VFIOContainer *container,
                                     VFIOContainerFeature feat)
@@ -53,6 +54,20 @@ int vfio_container_dma_map(VFIOContainer *container,
     return vccs->dma_map(container, iova, size, vaddr, readonly);
 }
 
+int vfio_container_dma_copy(VFIOContainer *src, VFIOContainer *dst,
+                            hwaddr iova, ram_addr_t size, bool readonly)
+{
+    VFIOContainerClass *vccs1 = VFIO_CONTAINER_OBJ_GET_CLASS(src);
+    VFIOContainerClass *vccs2 = VFIO_CONTAINER_OBJ_GET_CLASS(dst);
+
+    if (!vccs1->dma_copy || vccs1->dma_copy != vccs2->dma_copy) {
+        error_report("Incompatiable container: unable to copy dma");
+        return -EINVAL;
+    }
+
+    return vccs1->dma_copy(src, dst, iova, size, readonly);
+}
+
 int vfio_container_dma_unmap(VFIOContainer *container,
                              hwaddr iova, ram_addr_t size,
                              IOMMUTLBEntry *iotlb)
@@ -165,8 +180,6 @@ void vfio_container_destroy(VFIOContainer *container)
     VFIOGuestIOMMU *giommu, *tmp;
     VFIOHostDMAWindow *hostwin, *next;
 
-    QLIST_SAFE_REMOVE(container, next);
-
     QLIST_FOREACH_SAFE(vrdl, &container->vrdl_list, next, vrdl_tmp) {
         RamDiscardManager *rdm;
 
diff --git a/hw/vfio/container.c b/hw/vfio/container.c
index 2f59422048..6bc1b8763f 100644
--- a/hw/vfio/container.c
+++ b/hw/vfio/container.c
@@ -357,9 +357,6 @@ err_out:
 
 static void vfio_listener_release(VFIOLegacyContainer *container)
 {
-    VFIOContainer *bcontainer = &container->obj;
-
-    memory_listener_unregister(&bcontainer->listener);
     if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
         memory_listener_unregister(&container->prereg_listener);
     }
@@ -887,14 +884,11 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
     vfio_kvm_device_add_group(group);
 
     QLIST_INIT(&container->group_list);
-    QLIST_INSERT_HEAD(&space->containers, bcontainer, next);
 
     group->container = container;
     QLIST_INSERT_HEAD(&container->group_list, group, container_next);
 
-    bcontainer->listener = vfio_memory_listener;
-
-    memory_listener_register(&bcontainer->listener, bcontainer->space->as);
+    vfio_as_add_container(space, bcontainer);
 
     if (bcontainer->error) {
         ret = -1;
@@ -907,8 +901,8 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
 
     return 0;
 listener_release_exit:
+    vfio_as_del_container(space, bcontainer);
     QLIST_REMOVE(group, container_next);
-    QLIST_REMOVE(bcontainer, next);
     vfio_kvm_device_del_group(group);
     vfio_listener_release(container);
 
@@ -931,6 +925,7 @@ static void vfio_disconnect_container(VFIOGroup *group)
 {
     VFIOLegacyContainer *container = group->container;
     VFIOContainer *bcontainer = &container->obj;
+    VFIOAddressSpace *space = bcontainer->space;
 
     QLIST_REMOVE(group, container_next);
     group->container = NULL;
@@ -938,10 +933,12 @@ static void vfio_disconnect_container(VFIOGroup *group)
     /*
      * Explicitly release the listener first before unset container,
      * since unset may destroy the backend container if it's the last
-     * group.
+     * group. By removing container from the list, container is disconnected
+     * with address space memory listener.
      */
     if (QLIST_EMPTY(&container->group_list)) {
         vfio_listener_release(container);
+        vfio_as_del_container(space, bcontainer);
     }
 
     if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER, &container->fd)) {
@@ -950,10 +947,8 @@ static void vfio_disconnect_container(VFIOGroup *group)
     }
 
     if (QLIST_EMPTY(&container->group_list)) {
-        VFIOAddressSpace *space = bcontainer->space;
-
-        vfio_container_destroy(bcontainer);
         trace_vfio_disconnect_container(container->fd);
+        vfio_container_destroy(bcontainer);
         close(container->fd);
         g_free(container);
 
diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
index f8375f1672..8ff5988b07 100644
--- a/hw/vfio/iommufd.c
+++ b/hw/vfio/iommufd.c
@@ -38,6 +38,8 @@ static bool iommufd_check_extension(VFIOContainer *bcontainer,
                                     VFIOContainerFeature feat)
 {
     switch (feat) {
+    case VFIO_FEAT_DMA_COPY:
+        return true;
     default:
         return false;
     };
@@ -49,10 +51,25 @@ static int iommufd_map(VFIOContainer *bcontainer, hwaddr iova,
     VFIOIOMMUFDContainer *container = container_of(bcontainer,
                                                    VFIOIOMMUFDContainer, obj);
 
-    return iommufd_map_dma(container->iommufd, container->ioas_id,
+    return iommufd_map_dma(container->iommufd,
+                           container->ioas_id,
                            iova, size, vaddr, readonly);
 }
 
+static int iommufd_copy(VFIOContainer *src, VFIOContainer *dst,
+                        hwaddr iova, ram_addr_t size, bool readonly)
+{
+    VFIOIOMMUFDContainer *container_src = container_of(src,
+                                                   VFIOIOMMUFDContainer, obj);
+    VFIOIOMMUFDContainer *container_dst = container_of(dst,
+                                                   VFIOIOMMUFDContainer, obj);
+
+    assert(container_src->iommufd == container_dst->iommufd);
+
+    return iommufd_copy_dma(container_src->iommufd, container_src->ioas_id,
+                            container_dst->ioas_id, iova, size, readonly);
+}
+
 static int iommufd_unmap(VFIOContainer *bcontainer,
                          hwaddr iova, ram_addr_t size,
                          IOMMUTLBEntry *iotlb)
@@ -428,12 +445,7 @@ static int iommufd_attach_device(VFIODevice *vbasedev, AddressSpace *as,
      * between iommufd and kvm.
      */
 
-    QLIST_INSERT_HEAD(&space->containers, bcontainer, next);
-
-    bcontainer->listener = vfio_memory_listener;
-
-    memory_listener_register(&bcontainer->listener, bcontainer->space->as);
-
+    vfio_as_add_container(space, bcontainer);
     bcontainer->initialized = true;
 
 out:
@@ -476,6 +488,7 @@ static void iommufd_detach_device(VFIODevice *vbasedev)
     VFIOIOMMUFDContainer *container;
     VFIODevice *vbasedev_iter;
     VFIOIOASHwpt *hwpt;
+    VFIOAddressSpace *space;
     Error *err;
 
     if (!bcontainer) {
@@ -501,15 +514,26 @@ found:
         vfio_container_put_hwpt(hwpt);
     }
 
+    space = bcontainer->space;
+    /*
+     * Needs to remove the bcontainer from space->containers list before
+     * detach container. Otherwise, detach container may destroy the
+     * container if it's the last device. By removing bcontainer from the
+     * list, container is disconnected with address space memory listener.
+     */
+    if (QLIST_EMPTY(&container->hwpt_list)) {
+        vfio_as_del_container(space, bcontainer);
+    }
     __vfio_device_detach_container(vbasedev, container, &err);
     if (err) {
         error_report_err(err);
     }
     if (QLIST_EMPTY(&container->hwpt_list)) {
-        VFIOAddressSpace *space = bcontainer->space;
+        int iommufd = container->iommufd;
+        uint32_t ioas_id = container->ioas_id;
 
-        iommufd_put_ioas(container->iommufd, container->ioas_id);
         vfio_iommufd_container_destroy(container);
+        iommufd_put_ioas(iommufd, ioas_id);
         vfio_put_address_space(space);
     }
     vbasedev->container = NULL;
@@ -525,6 +549,7 @@ static void vfio_iommufd_class_init(ObjectClass *klass,
 
     vccs->check_extension = iommufd_check_extension;
     vccs->dma_map = iommufd_map;
+    vccs->dma_copy = iommufd_copy;
     vccs->dma_unmap = iommufd_unmap;
     vccs->attach_device = iommufd_attach_device;
     vccs->detach_device = iommufd_detach_device;
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 19731ea685..bef48ddfaf 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -34,8 +34,6 @@
 
 #define VFIO_MSG_PREFIX "vfio %s: "
 
-extern const MemoryListener vfio_memory_listener;
-
 enum {
     VFIO_DEVICE_TYPE_PCI = 0,
     VFIO_DEVICE_TYPE_PLATFORM = 1,
@@ -181,6 +179,10 @@ void vfio_host_win_add(VFIOContainer *bcontainer,
 int vfio_host_win_del(VFIOContainer *bcontainer, hwaddr min_iova,
                       hwaddr max_iova);
 VFIOAddressSpace *vfio_get_address_space(AddressSpace *as);
+void vfio_as_add_container(VFIOAddressSpace *space,
+                           VFIOContainer *bcontainer);
+void vfio_as_del_container(VFIOAddressSpace *space,
+                           VFIOContainer *container);
 void vfio_put_address_space(VFIOAddressSpace *space);
 
 void vfio_put_base_device(VFIODevice *vbasedev);
diff --git a/include/hw/vfio/vfio-container-obj.h b/include/hw/vfio/vfio-container-obj.h
index b5ef2160d8..b65f827bc1 100644
--- a/include/hw/vfio/vfio-container-obj.h
+++ b/include/hw/vfio/vfio-container-obj.h
@@ -47,12 +47,15 @@
 
 typedef enum VFIOContainerFeature {
     VFIO_FEAT_LIVE_MIGRATION,
+    VFIO_FEAT_DMA_COPY,
 } VFIOContainerFeature;
 
 typedef struct VFIOContainer VFIOContainer;
 
 typedef struct VFIOAddressSpace {
     AddressSpace *as;
+    MemoryListener listener;
+    bool listener_initialized;
     QLIST_HEAD(, VFIOContainer) containers;
     QLIST_ENTRY(VFIOAddressSpace) list;
 } VFIOAddressSpace;
@@ -90,7 +93,6 @@ struct VFIOContainer {
     Object parent_obj;
 
     VFIOAddressSpace *space;
-    MemoryListener listener;
     Error *error;
     bool initialized;
     bool dirty_pages_supported;
@@ -116,6 +118,8 @@ typedef struct VFIOContainerClass {
     int (*dma_map)(VFIOContainer *container,
                    hwaddr iova, ram_addr_t size,
                    void *vaddr, bool readonly);
+    int (*dma_copy)(VFIOContainer *src, VFIOContainer *dst,
+                    hwaddr iova, ram_addr_t size, bool readonly);
     int (*dma_unmap)(VFIOContainer *container,
                      hwaddr iova, ram_addr_t size,
                      IOMMUTLBEntry *iotlb);
@@ -141,6 +145,8 @@ bool vfio_container_check_extension(VFIOContainer *container,
 int vfio_container_dma_map(VFIOContainer *container,
                            hwaddr iova, ram_addr_t size,
                            void *vaddr, bool readonly);
+int vfio_container_dma_copy(VFIOContainer *src, VFIOContainer *dst,
+                            hwaddr iova, ram_addr_t size, bool readonly);
 int vfio_container_dma_unmap(VFIOContainer *container,
                              hwaddr iova, ram_addr_t size,
                              IOMMUTLBEntry *iotlb);
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [RFC 17/18] vfio/as: Allow the selection of a given iommu backend
  2022-04-14 10:46 ` Yi Liu
@ 2022-04-14 10:47   ` Yi Liu
  -1 siblings, 0 replies; 125+ messages in thread
From: Yi Liu @ 2022-04-14 10:47 UTC (permalink / raw)
  To: alex.williamson, cohuck, qemu-devel
  Cc: david, thuth, farman, mjrosato, akrowiak, pasic, jjherne,
	jasowang, kvm, jgg, nicolinc, eric.auger, eric.auger.pro,
	kevin.tian, yi.l.liu, chao.p.peng, yi.y.sun, peterx

From: Eric Auger <eric.auger@redhat.com>

Now we support two types of iommu backends, let's add the capability
to select one of them. This is based on a VFIODevice auto/on/off
iommu_be field. This field is likely to be forced to a given value or
set by a device option.

Signed-off-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 hw/vfio/as.c                  | 31 ++++++++++++++++++++++++++++++-
 include/hw/vfio/vfio-common.h |  1 +
 2 files changed, 31 insertions(+), 1 deletion(-)

diff --git a/hw/vfio/as.c b/hw/vfio/as.c
index 13a6653a0d..fce7a088e9 100644
--- a/hw/vfio/as.c
+++ b/hw/vfio/as.c
@@ -985,16 +985,45 @@ vfio_get_container_class(VFIOIOMMUBackendType be)
     case VFIO_IOMMU_BACKEND_TYPE_LEGACY:
         klass = object_class_by_name(TYPE_VFIO_LEGACY_CONTAINER);
         return VFIO_CONTAINER_OBJ_CLASS(klass);
+    case VFIO_IOMMU_BACKEND_TYPE_IOMMUFD:
+        klass = object_class_by_name(TYPE_VFIO_IOMMUFD_CONTAINER);
+        return VFIO_CONTAINER_OBJ_CLASS(klass);
     default:
         return NULL;
     }
 }
 
+static VFIOContainerClass *
+select_iommu_backend(OnOffAuto value, Error **errp)
+{
+    VFIOContainerClass *vccs = NULL;
+
+    if (value == ON_OFF_AUTO_OFF) {
+        return vfio_get_container_class(VFIO_IOMMU_BACKEND_TYPE_LEGACY);
+    } else {
+        int iommufd = qemu_open_old("/dev/iommu", O_RDWR);
+
+        vccs = vfio_get_container_class(VFIO_IOMMU_BACKEND_TYPE_IOMMUFD);
+        if (iommufd < 0 || !vccs) {
+            if (value == ON_OFF_AUTO_AUTO) {
+                vccs = vfio_get_container_class(VFIO_IOMMU_BACKEND_TYPE_LEGACY);
+            } else { /* ON */
+                error_setg(errp, "iommufd backend is not supported by %s",
+                           iommufd < 0 ? "the host" : "QEMU");
+                error_append_hint(errp, "set iommufd=off\n");
+                vccs = NULL;
+            }
+        }
+        close(iommufd);
+    }
+    return vccs;
+}
+
 int vfio_attach_device(VFIODevice *vbasedev, AddressSpace *as, Error **errp)
 {
     VFIOContainerClass *vccs;
 
-    vccs = vfio_get_container_class(VFIO_IOMMU_BACKEND_TYPE_LEGACY);
+    vccs = select_iommu_backend(vbasedev->iommufd_be, errp);
     if (!vccs) {
         return -ENOENT;
     }
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index bef48ddfaf..2d941aae70 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -126,6 +126,7 @@ typedef struct VFIODevice {
     VFIOMigration *migration;
     Error *migration_blocker;
     OnOffAuto pre_copy_dirty_page_tracking;
+    OnOffAuto iommufd_be;
 } VFIODevice;
 
 struct VFIODeviceOps {
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [RFC 17/18] vfio/as: Allow the selection of a given iommu backend
@ 2022-04-14 10:47   ` Yi Liu
  0 siblings, 0 replies; 125+ messages in thread
From: Yi Liu @ 2022-04-14 10:47 UTC (permalink / raw)
  To: alex.williamson, cohuck, qemu-devel
  Cc: akrowiak, jjherne, thuth, yi.l.liu, kvm, mjrosato, jasowang,
	farman, peterx, pasic, eric.auger, yi.y.sun, chao.p.peng,
	nicolinc, kevin.tian, jgg, eric.auger.pro, david

From: Eric Auger <eric.auger@redhat.com>

Now we support two types of iommu backends, let's add the capability
to select one of them. This is based on a VFIODevice auto/on/off
iommu_be field. This field is likely to be forced to a given value or
set by a device option.

Signed-off-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 hw/vfio/as.c                  | 31 ++++++++++++++++++++++++++++++-
 include/hw/vfio/vfio-common.h |  1 +
 2 files changed, 31 insertions(+), 1 deletion(-)

diff --git a/hw/vfio/as.c b/hw/vfio/as.c
index 13a6653a0d..fce7a088e9 100644
--- a/hw/vfio/as.c
+++ b/hw/vfio/as.c
@@ -985,16 +985,45 @@ vfio_get_container_class(VFIOIOMMUBackendType be)
     case VFIO_IOMMU_BACKEND_TYPE_LEGACY:
         klass = object_class_by_name(TYPE_VFIO_LEGACY_CONTAINER);
         return VFIO_CONTAINER_OBJ_CLASS(klass);
+    case VFIO_IOMMU_BACKEND_TYPE_IOMMUFD:
+        klass = object_class_by_name(TYPE_VFIO_IOMMUFD_CONTAINER);
+        return VFIO_CONTAINER_OBJ_CLASS(klass);
     default:
         return NULL;
     }
 }
 
+static VFIOContainerClass *
+select_iommu_backend(OnOffAuto value, Error **errp)
+{
+    VFIOContainerClass *vccs = NULL;
+
+    if (value == ON_OFF_AUTO_OFF) {
+        return vfio_get_container_class(VFIO_IOMMU_BACKEND_TYPE_LEGACY);
+    } else {
+        int iommufd = qemu_open_old("/dev/iommu", O_RDWR);
+
+        vccs = vfio_get_container_class(VFIO_IOMMU_BACKEND_TYPE_IOMMUFD);
+        if (iommufd < 0 || !vccs) {
+            if (value == ON_OFF_AUTO_AUTO) {
+                vccs = vfio_get_container_class(VFIO_IOMMU_BACKEND_TYPE_LEGACY);
+            } else { /* ON */
+                error_setg(errp, "iommufd backend is not supported by %s",
+                           iommufd < 0 ? "the host" : "QEMU");
+                error_append_hint(errp, "set iommufd=off\n");
+                vccs = NULL;
+            }
+        }
+        close(iommufd);
+    }
+    return vccs;
+}
+
 int vfio_attach_device(VFIODevice *vbasedev, AddressSpace *as, Error **errp)
 {
     VFIOContainerClass *vccs;
 
-    vccs = vfio_get_container_class(VFIO_IOMMU_BACKEND_TYPE_LEGACY);
+    vccs = select_iommu_backend(vbasedev->iommufd_be, errp);
     if (!vccs) {
         return -ENOENT;
     }
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index bef48ddfaf..2d941aae70 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -126,6 +126,7 @@ typedef struct VFIODevice {
     VFIOMigration *migration;
     Error *migration_blocker;
     OnOffAuto pre_copy_dirty_page_tracking;
+    OnOffAuto iommufd_be;
 } VFIODevice;
 
 struct VFIODeviceOps {
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [RFC 18/18] vfio/pci: Add an iommufd option
  2022-04-14 10:46 ` Yi Liu
@ 2022-04-14 10:47   ` Yi Liu
  -1 siblings, 0 replies; 125+ messages in thread
From: Yi Liu @ 2022-04-14 10:47 UTC (permalink / raw)
  To: alex.williamson, cohuck, qemu-devel
  Cc: david, thuth, farman, mjrosato, akrowiak, pasic, jjherne,
	jasowang, kvm, jgg, nicolinc, eric.auger, eric.auger.pro,
	kevin.tian, yi.l.liu, chao.p.peng, yi.y.sun, peterx

From: Eric Auger <eric.auger@redhat.com>

This auto/on/off option allows the user to force a the select
the iommu BE (iommufd or legacy).

Signed-off-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 hw/vfio/pci.c | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index cf5703f94b..70a4c2b0a8 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -42,6 +42,8 @@
 #include "qapi/error.h"
 #include "migration/blocker.h"
 #include "migration/qemu-file.h"
+#include "qapi/visitor.h"
+#include "qapi/qapi-visit-common.h"
 
 #define TYPE_VFIO_PCI_NOHOTPLUG "vfio-pci-nohotplug"
 
@@ -3246,6 +3248,26 @@ static Property vfio_pci_dev_properties[] = {
     DEFINE_PROP_END_OF_LIST(),
 };
 
+static void get_iommu_be(Object *obj, Visitor *v, const char *name,
+                         void *opaque, Error **errp)
+{
+    VFIOPCIDevice *vdev = VFIO_PCI(obj);
+    VFIODevice *vbasedev = &vdev->vbasedev;
+    OnOffAuto iommufd_be = vbasedev->iommufd_be;
+
+    visit_type_OnOffAuto(v, name, &iommufd_be, errp);
+}
+
+static void set_iommu_be(Object *obj, Visitor *v, const char *name,
+                         void *opaque, Error **errp)
+{
+    VFIOPCIDevice *vdev = VFIO_PCI(obj);
+    VFIODevice *vbasedev = &vdev->vbasedev;
+
+    visit_type_OnOffAuto(v, name, &vbasedev->iommufd_be, errp);
+}
+
+
 static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
 {
     DeviceClass *dc = DEVICE_CLASS(klass);
@@ -3253,6 +3275,10 @@ static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
 
     dc->reset = vfio_pci_reset;
     device_class_set_props(dc, vfio_pci_dev_properties);
+    object_class_property_add(klass, "iommufd", "OnOffAuto",
+                              get_iommu_be, set_iommu_be, NULL, NULL);
+    object_class_property_set_description(klass, "iommufd",
+                                          "Enable iommufd backend");
     dc->desc = "VFIO-based PCI device assignment";
     set_bit(DEVICE_CATEGORY_MISC, dc->categories);
     pdc->realize = vfio_realize;
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [RFC 18/18] vfio/pci: Add an iommufd option
@ 2022-04-14 10:47   ` Yi Liu
  0 siblings, 0 replies; 125+ messages in thread
From: Yi Liu @ 2022-04-14 10:47 UTC (permalink / raw)
  To: alex.williamson, cohuck, qemu-devel
  Cc: akrowiak, jjherne, thuth, yi.l.liu, kvm, mjrosato, jasowang,
	farman, peterx, pasic, eric.auger, yi.y.sun, chao.p.peng,
	nicolinc, kevin.tian, jgg, eric.auger.pro, david

From: Eric Auger <eric.auger@redhat.com>

This auto/on/off option allows the user to force a the select
the iommu BE (iommufd or legacy).

Signed-off-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 hw/vfio/pci.c | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index cf5703f94b..70a4c2b0a8 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -42,6 +42,8 @@
 #include "qapi/error.h"
 #include "migration/blocker.h"
 #include "migration/qemu-file.h"
+#include "qapi/visitor.h"
+#include "qapi/qapi-visit-common.h"
 
 #define TYPE_VFIO_PCI_NOHOTPLUG "vfio-pci-nohotplug"
 
@@ -3246,6 +3248,26 @@ static Property vfio_pci_dev_properties[] = {
     DEFINE_PROP_END_OF_LIST(),
 };
 
+static void get_iommu_be(Object *obj, Visitor *v, const char *name,
+                         void *opaque, Error **errp)
+{
+    VFIOPCIDevice *vdev = VFIO_PCI(obj);
+    VFIODevice *vbasedev = &vdev->vbasedev;
+    OnOffAuto iommufd_be = vbasedev->iommufd_be;
+
+    visit_type_OnOffAuto(v, name, &iommufd_be, errp);
+}
+
+static void set_iommu_be(Object *obj, Visitor *v, const char *name,
+                         void *opaque, Error **errp)
+{
+    VFIOPCIDevice *vdev = VFIO_PCI(obj);
+    VFIODevice *vbasedev = &vdev->vbasedev;
+
+    visit_type_OnOffAuto(v, name, &vbasedev->iommufd_be, errp);
+}
+
+
 static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
 {
     DeviceClass *dc = DEVICE_CLASS(klass);
@@ -3253,6 +3275,10 @@ static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
 
     dc->reset = vfio_pci_reset;
     device_class_set_props(dc, vfio_pci_dev_properties);
+    object_class_property_add(klass, "iommufd", "OnOffAuto",
+                              get_iommu_be, set_iommu_be, NULL, NULL);
+    object_class_property_set_description(klass, "iommufd",
+                                          "Enable iommufd backend");
     dc->desc = "VFIO-based PCI device assignment";
     set_bit(DEVICE_CATEGORY_MISC, dc->categories);
     pdc->realize = vfio_realize;
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 125+ messages in thread

* Re: [RFC 00/18] vfio: Adopt iommufd
  2022-04-14 10:46 ` Yi Liu
                   ` (18 preceding siblings ...)
  (?)
@ 2022-04-15  8:37 ` Nicolin Chen
  2022-04-17 10:30     ` Eric Auger
  -1 siblings, 1 reply; 125+ messages in thread
From: Nicolin Chen @ 2022-04-15  8:37 UTC (permalink / raw)
  To: Yi Liu
  Cc: alex.williamson, cohuck, qemu-devel, david, thuth, farman,
	mjrosato, akrowiak, pasic, jjherne, jasowang, kvm, jgg,
	eric.auger, eric.auger.pro, kevin.tian, chao.p.peng, yi.y.sun,
	peterx

Hi,

Thanks for the work!

On Thu, Apr 14, 2022 at 03:46:52AM -0700, Yi Liu wrote:
 
> - More tests

I did a quick test on my ARM64 platform, using "iommu=smmuv3"
string. The behaviors are different between using default and
using legacy "iommufd=off".

The legacy pathway exits the VM with:
    vfio 0002:01:00.0:
    failed to setup container for group 1:
    memory listener initialization failed:
    Region smmuv3-iommu-memory-region-16-0:
    device 00.02.0 requires iommu MAP notifier which is not currently supported

while the iommufd pathway started the VM but reported errors
from host kernel about address translation failures, probably
because of accessing unmapped addresses.

I found iommufd pathway also calls error_propagate_prepend()
to add to errp for not supporting IOMMU_NOTIFIER_MAP, but it
doesn't get a chance to print errp out. Perhaps there should
be a final error check somewhere to exit?

Nic

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC 00/18] vfio: Adopt iommufd
  2022-04-15  8:37 ` [RFC 00/18] vfio: Adopt iommufd Nicolin Chen
@ 2022-04-17 10:30     ` Eric Auger
  0 siblings, 0 replies; 125+ messages in thread
From: Eric Auger @ 2022-04-17 10:30 UTC (permalink / raw)
  To: Nicolin Chen, Yi Liu
  Cc: alex.williamson, cohuck, qemu-devel, david, thuth, farman,
	mjrosato, akrowiak, pasic, jjherne, jasowang, kvm, jgg,
	eric.auger.pro, kevin.tian, chao.p.peng, yi.y.sun, peterx

Hi Nicolin,

On 4/15/22 10:37 AM, Nicolin Chen wrote:
> Hi,
>
> Thanks for the work!
>
> On Thu, Apr 14, 2022 at 03:46:52AM -0700, Yi Liu wrote:
>  
>> - More tests
> I did a quick test on my ARM64 platform, using "iommu=smmuv3"
> string. The behaviors are different between using default and
> using legacy "iommufd=off".
>
> The legacy pathway exits the VM with:
>     vfio 0002:01:00.0:
>     failed to setup container for group 1:
>     memory listener initialization failed:
>     Region smmuv3-iommu-memory-region-16-0:
>     device 00.02.0 requires iommu MAP notifier which is not currently supported
>
> while the iommufd pathway started the VM but reported errors
> from host kernel about address translation failures, probably
> because of accessing unmapped addresses.
>
> I found iommufd pathway also calls error_propagate_prepend()
> to add to errp for not supporting IOMMU_NOTIFIER_MAP, but it
> doesn't get a chance to print errp out. Perhaps there should
> be a final error check somewhere to exit?

thank you for giving it a try.

vsmmuv3 + vfio is not supported as we miss the HW nested stage support
and SMMU does not support cache mode. If you want to test viommu on ARM
you shall test virtio-iommu+vfio. This should work but this is not yet
tested.

I pushed a fix for the error notification issue:
qemu-for-5.17-rc6-vm-rfcv2-rc0 on my git https://github.com/eauger/qemu.git

Thanks

Eric
>
> Nic
>


^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC 00/18] vfio: Adopt iommufd
@ 2022-04-17 10:30     ` Eric Auger
  0 siblings, 0 replies; 125+ messages in thread
From: Eric Auger @ 2022-04-17 10:30 UTC (permalink / raw)
  To: Nicolin Chen, Yi Liu
  Cc: akrowiak, jjherne, thuth, chao.p.peng, kvm, mjrosato, jasowang,
	cohuck, farman, peterx, qemu-devel, pasic, alex.williamson,
	kevin.tian, jgg, eric.auger.pro, yi.y.sun, david

Hi Nicolin,

On 4/15/22 10:37 AM, Nicolin Chen wrote:
> Hi,
>
> Thanks for the work!
>
> On Thu, Apr 14, 2022 at 03:46:52AM -0700, Yi Liu wrote:
>  
>> - More tests
> I did a quick test on my ARM64 platform, using "iommu=smmuv3"
> string. The behaviors are different between using default and
> using legacy "iommufd=off".
>
> The legacy pathway exits the VM with:
>     vfio 0002:01:00.0:
>     failed to setup container for group 1:
>     memory listener initialization failed:
>     Region smmuv3-iommu-memory-region-16-0:
>     device 00.02.0 requires iommu MAP notifier which is not currently supported
>
> while the iommufd pathway started the VM but reported errors
> from host kernel about address translation failures, probably
> because of accessing unmapped addresses.
>
> I found iommufd pathway also calls error_propagate_prepend()
> to add to errp for not supporting IOMMU_NOTIFIER_MAP, but it
> doesn't get a chance to print errp out. Perhaps there should
> be a final error check somewhere to exit?

thank you for giving it a try.

vsmmuv3 + vfio is not supported as we miss the HW nested stage support
and SMMU does not support cache mode. If you want to test viommu on ARM
you shall test virtio-iommu+vfio. This should work but this is not yet
tested.

I pushed a fix for the error notification issue:
qemu-for-5.17-rc6-vm-rfcv2-rc0 on my git https://github.com/eauger/qemu.git

Thanks

Eric
>
> Nic
>



^ permalink raw reply	[flat|nested] 125+ messages in thread

* RE: [RFC 00/18] vfio: Adopt iommufd
  2022-04-14 10:46 ` Yi Liu
@ 2022-04-18  8:49   ` Tian, Kevin
  -1 siblings, 0 replies; 125+ messages in thread
From: Tian, Kevin @ 2022-04-18  8:49 UTC (permalink / raw)
  To: Liu, Yi L, alex.williamson, cohuck, qemu-devel
  Cc: david, thuth, farman, mjrosato, akrowiak, pasic, jjherne,
	jasowang, kvm, jgg, nicolinc, eric.auger, eric.auger.pro, Peng,
	Chao P, Sun, Yi Y, peterx

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Thursday, April 14, 2022 6:47 PM
> 
> With the introduction of iommufd[1], the linux kernel provides a generic
> interface for userspace drivers to propagate their DMA mappings to kernel
> for assigned devices. This series does the porting of the VFIO devices
> onto the /dev/iommu uapi and let it coexist with the legacy implementation.
> Other devices like vpda, vfio mdev and etc. are not considered yet.

vfio mdev has no special support in Qemu. Just that it's not supported
by iommufd yet thus can only be operated in legacy container interface at
this point. Later once it's supported by the kernel suppose no additional
enabling work is required for mdev in Qemu.

> 
> For vfio devices, the new interface is tied with device fd and iommufd
> as the iommufd solution is device-centric. This is different from legacy
> vfio which is group-centric. To support both interfaces in QEMU, this
> series introduces the iommu backend concept in the form of different
> container classes. The existing vfio container is named legacy container
> (equivalent with legacy iommu backend in this series), while the new
> iommufd based container is named as iommufd container (may also be
> mentioned
> as iommufd backend in this series). The two backend types have their own
> way to setup secure context and dma management interface. Below diagram
> shows how it looks like with both BEs.
> 
>                     VFIO                           AddressSpace/Memory
>     +-------+  +----------+  +-----+  +-----+
>     |  pci  |  | platform |  |  ap |  | ccw |
>     +---+---+  +----+-----+  +--+--+  +--+--+     +----------------------+
>         |           |           |        |        |   AddressSpace       |
>         |           |           |        |        +------------+---------+
>     +---V-----------V-----------V--------V----+               /
>     |           VFIOAddressSpace              | <------------+
>     |                  |                      |  MemoryListener
>     |          VFIOContainer list             |
>     +-------+----------------------------+----+
>             |                            |
>             |                            |
>     +-------V------+            +--------V----------+
>     |   iommufd    |            |    vfio legacy    |
>     |  container   |            |     container     |
>     +-------+------+            +--------+----------+
>             |                            |
>             | /dev/iommu                 | /dev/vfio/vfio
>             | /dev/vfio/devices/vfioX    | /dev/vfio/$group_id
>  Userspace  |                            |
> 
> ===========+============================+=======================
> =========
>  Kernel     |  device fd                 |
>             +---------------+            | group/container fd
>             | (BIND_IOMMUFD |            | (SET_CONTAINER/SET_IOMMU)
>             |  ATTACH_IOAS) |            | device fd
>             |               |            |
>             |       +-------V------------V-----------------+
>     iommufd |       |                vfio                  |
> (map/unmap  |       +---------+--------------------+-------+
>  ioas_copy) |                 |                    | map/unmap
>             |                 |                    |
>      +------V------+    +-----V------+      +------V--------+
>      | iommfd core |    |  device    |      |  vfio iommu   |
>      +-------------+    +------------+      +---------------+

last row: s/iommfd/iommufd/

overall this sounds a reasonable abstraction. Later when vdpa starts
supporting iommufd probably the iommufd BE will become even
smaller with more logic shareable between vfio and vdpa.

> 
> [Secure Context setup]
> - iommufd BE: uses device fd and iommufd to setup secure context
>               (bind_iommufd, attach_ioas)
> - vfio legacy BE: uses group fd and container fd to setup secure context
>                   (set_container, set_iommu)
> [Device access]
> - iommufd BE: device fd is opened through /dev/vfio/devices/vfioX
> - vfio legacy BE: device fd is retrieved from group fd ioctl
> [DMA Mapping flow]
> - VFIOAddressSpace receives MemoryRegion add/del via MemoryListener
> - VFIO populates DMA map/unmap via the container BEs
>   *) iommufd BE: uses iommufd
>   *) vfio legacy BE: uses container fd
> 
> This series qomifies the VFIOContainer object which acts as a base class

what does 'qomify' mean? I didn't find this word from dictionary...

> for a container. This base class is derived into the legacy VFIO container
> and the new iommufd based container. The base class implements generic
> code
> such as code related to memory_listener and address space management
> whereas
> the derived class implements callbacks that depend on the kernel user space

'the kernel user space'?

> being used.
> 
> The selection of the backend is made on a device basis using the new
> iommufd option (on/off/auto). By default the iommufd backend is selected
> if supported by the host and by QEMU (iommufd KConfig). This option is
> currently available only for the vfio-pci device. For other types of
> devices, it does not yet exist and the legacy BE is chosen by default.
> 
> Test done:
> - PCI and Platform device were tested

In this case PCI uses iommufd while platform device uses legacy?

> - ccw and ap were only compile-tested
> - limited device hotplug test
> - vIOMMU test run for both legacy and iommufd backends (limited tests)
> 
> This series was co-developed by Eric Auger and me based on the exploration
> iommufd kernel[2], complete code of this series is available in[3]. As
> iommufd kernel is in the early step (only iommufd generic interface is in
> mailing list), so this series hasn't made the iommufd backend fully on par
> with legacy backend w.r.t. features like p2p mappings, coherency tracking,

what does 'coherency tracking' mean here? if related to iommu enforce
snoop it is fully handled by the kernel so far. I didn't find any use of
VFIO_DMA_CC_IOMMU in current Qemu.

> live migration, etc. This series hasn't supported PCI devices without FLR
> neither as the kernel doesn't support VFIO_DEVICE_PCI_HOT_RESET when
> userspace
> is using iommufd. The kernel needs to be updated to accept device fd list for
> reset when userspace is using iommufd. Related work is in progress by
> Jason[4].
> 
> TODOs:
> - Add DMA alias check for iommufd BE (group level)
> - Make pci.c to be BE agnostic. Needs kernel change as well to fix the
>   VFIO_DEVICE_PCI_HOT_RESET gap
> - Cleanup the VFIODevice fields as it's used in both BEs
> - Add locks
> - Replace list with g_tree
> - More tests
> 
> Patch Overview:
> 
> - Preparation:
>   0001-scripts-update-linux-headers-Add-iommufd.h.patch
>   0002-linux-headers-Import-latest-vfio.h-and-iommufd.h.patch
>   0003-hw-vfio-pci-fix-vfio_pci_hot_reset_result-trace-poin.patch
>   0004-vfio-pci-Use-vbasedev-local-variable-in-vfio_realize.patch
>   0005-vfio-common-Rename-VFIOGuestIOMMU-iommu-into-
> iommu_m.patch

3-5 are pure cleanups which could be sent out separately 

>   0006-vfio-common-Split-common.c-into-common.c-container.c.patch
> 
> - Introduce container object and covert existing vfio to use it:
>   0007-vfio-Add-base-object-for-VFIOContainer.patch
>   0008-vfio-container-Introduce-vfio_attach-detach_device.patch
>   0009-vfio-platform-Use-vfio_-attach-detach-_device.patch
>   0010-vfio-ap-Use-vfio_-attach-detach-_device.patch
>   0011-vfio-ccw-Use-vfio_-attach-detach-_device.patch
>   0012-vfio-container-obj-Introduce-attach-detach-_device-c.patch
>   0013-vfio-container-obj-Introduce-VFIOContainer-reset-cal.patch
> 
> - Introduce iommufd based container:
>   0014-hw-iommufd-Creation.patch
>   0015-vfio-iommufd-Implement-iommufd-backend.patch
>   0016-vfio-iommufd-Add-IOAS_COPY_DMA-support.patch
> 
> - Add backend selection for vfio-pci:
>   0017-vfio-as-Allow-the-selection-of-a-given-iommu-backend.patch
>   0018-vfio-pci-Add-an-iommufd-option.patch
> 
> [1] https://lore.kernel.org/kvm/0-v1-e79cd8d168e8+6-
> iommufd_jgg@nvidia.com/
> [2] https://github.com/luxis1999/iommufd/tree/iommufd-v5.17-rc6
> [3] https://github.com/luxis1999/qemu/tree/qemu-for-5.17-rc6-vm-rfcv1
> [4] https://lore.kernel.org/kvm/0-v1-a8faf768d202+125dd-
> vfio_mdev_no_group_jgg@nvidia.com/

Following is probably more relevant to [4]:

https://lore.kernel.org/all/10-v1-33906a626da1+16b0-vfio_kvm_no_group_jgg@nvidia.com/

Thanks
Kevin

^ permalink raw reply	[flat|nested] 125+ messages in thread

* RE: [RFC 00/18] vfio: Adopt iommufd
@ 2022-04-18  8:49   ` Tian, Kevin
  0 siblings, 0 replies; 125+ messages in thread
From: Tian, Kevin @ 2022-04-18  8:49 UTC (permalink / raw)
  To: Liu, Yi L, alex.williamson, cohuck, qemu-devel
  Cc: akrowiak, jjherne, thuth, Peng, Chao P, kvm, mjrosato, jasowang,
	farman, peterx, pasic, eric.auger, Sun, Yi Y, nicolinc, jgg,
	eric.auger.pro, david

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Thursday, April 14, 2022 6:47 PM
> 
> With the introduction of iommufd[1], the linux kernel provides a generic
> interface for userspace drivers to propagate their DMA mappings to kernel
> for assigned devices. This series does the porting of the VFIO devices
> onto the /dev/iommu uapi and let it coexist with the legacy implementation.
> Other devices like vpda, vfio mdev and etc. are not considered yet.

vfio mdev has no special support in Qemu. Just that it's not supported
by iommufd yet thus can only be operated in legacy container interface at
this point. Later once it's supported by the kernel suppose no additional
enabling work is required for mdev in Qemu.

> 
> For vfio devices, the new interface is tied with device fd and iommufd
> as the iommufd solution is device-centric. This is different from legacy
> vfio which is group-centric. To support both interfaces in QEMU, this
> series introduces the iommu backend concept in the form of different
> container classes. The existing vfio container is named legacy container
> (equivalent with legacy iommu backend in this series), while the new
> iommufd based container is named as iommufd container (may also be
> mentioned
> as iommufd backend in this series). The two backend types have their own
> way to setup secure context and dma management interface. Below diagram
> shows how it looks like with both BEs.
> 
>                     VFIO                           AddressSpace/Memory
>     +-------+  +----------+  +-----+  +-----+
>     |  pci  |  | platform |  |  ap |  | ccw |
>     +---+---+  +----+-----+  +--+--+  +--+--+     +----------------------+
>         |           |           |        |        |   AddressSpace       |
>         |           |           |        |        +------------+---------+
>     +---V-----------V-----------V--------V----+               /
>     |           VFIOAddressSpace              | <------------+
>     |                  |                      |  MemoryListener
>     |          VFIOContainer list             |
>     +-------+----------------------------+----+
>             |                            |
>             |                            |
>     +-------V------+            +--------V----------+
>     |   iommufd    |            |    vfio legacy    |
>     |  container   |            |     container     |
>     +-------+------+            +--------+----------+
>             |                            |
>             | /dev/iommu                 | /dev/vfio/vfio
>             | /dev/vfio/devices/vfioX    | /dev/vfio/$group_id
>  Userspace  |                            |
> 
> ===========+============================+=======================
> =========
>  Kernel     |  device fd                 |
>             +---------------+            | group/container fd
>             | (BIND_IOMMUFD |            | (SET_CONTAINER/SET_IOMMU)
>             |  ATTACH_IOAS) |            | device fd
>             |               |            |
>             |       +-------V------------V-----------------+
>     iommufd |       |                vfio                  |
> (map/unmap  |       +---------+--------------------+-------+
>  ioas_copy) |                 |                    | map/unmap
>             |                 |                    |
>      +------V------+    +-----V------+      +------V--------+
>      | iommfd core |    |  device    |      |  vfio iommu   |
>      +-------------+    +------------+      +---------------+

last row: s/iommfd/iommufd/

overall this sounds a reasonable abstraction. Later when vdpa starts
supporting iommufd probably the iommufd BE will become even
smaller with more logic shareable between vfio and vdpa.

> 
> [Secure Context setup]
> - iommufd BE: uses device fd and iommufd to setup secure context
>               (bind_iommufd, attach_ioas)
> - vfio legacy BE: uses group fd and container fd to setup secure context
>                   (set_container, set_iommu)
> [Device access]
> - iommufd BE: device fd is opened through /dev/vfio/devices/vfioX
> - vfio legacy BE: device fd is retrieved from group fd ioctl
> [DMA Mapping flow]
> - VFIOAddressSpace receives MemoryRegion add/del via MemoryListener
> - VFIO populates DMA map/unmap via the container BEs
>   *) iommufd BE: uses iommufd
>   *) vfio legacy BE: uses container fd
> 
> This series qomifies the VFIOContainer object which acts as a base class

what does 'qomify' mean? I didn't find this word from dictionary...

> for a container. This base class is derived into the legacy VFIO container
> and the new iommufd based container. The base class implements generic
> code
> such as code related to memory_listener and address space management
> whereas
> the derived class implements callbacks that depend on the kernel user space

'the kernel user space'?

> being used.
> 
> The selection of the backend is made on a device basis using the new
> iommufd option (on/off/auto). By default the iommufd backend is selected
> if supported by the host and by QEMU (iommufd KConfig). This option is
> currently available only for the vfio-pci device. For other types of
> devices, it does not yet exist and the legacy BE is chosen by default.
> 
> Test done:
> - PCI and Platform device were tested

In this case PCI uses iommufd while platform device uses legacy?

> - ccw and ap were only compile-tested
> - limited device hotplug test
> - vIOMMU test run for both legacy and iommufd backends (limited tests)
> 
> This series was co-developed by Eric Auger and me based on the exploration
> iommufd kernel[2], complete code of this series is available in[3]. As
> iommufd kernel is in the early step (only iommufd generic interface is in
> mailing list), so this series hasn't made the iommufd backend fully on par
> with legacy backend w.r.t. features like p2p mappings, coherency tracking,

what does 'coherency tracking' mean here? if related to iommu enforce
snoop it is fully handled by the kernel so far. I didn't find any use of
VFIO_DMA_CC_IOMMU in current Qemu.

> live migration, etc. This series hasn't supported PCI devices without FLR
> neither as the kernel doesn't support VFIO_DEVICE_PCI_HOT_RESET when
> userspace
> is using iommufd. The kernel needs to be updated to accept device fd list for
> reset when userspace is using iommufd. Related work is in progress by
> Jason[4].
> 
> TODOs:
> - Add DMA alias check for iommufd BE (group level)
> - Make pci.c to be BE agnostic. Needs kernel change as well to fix the
>   VFIO_DEVICE_PCI_HOT_RESET gap
> - Cleanup the VFIODevice fields as it's used in both BEs
> - Add locks
> - Replace list with g_tree
> - More tests
> 
> Patch Overview:
> 
> - Preparation:
>   0001-scripts-update-linux-headers-Add-iommufd.h.patch
>   0002-linux-headers-Import-latest-vfio.h-and-iommufd.h.patch
>   0003-hw-vfio-pci-fix-vfio_pci_hot_reset_result-trace-poin.patch
>   0004-vfio-pci-Use-vbasedev-local-variable-in-vfio_realize.patch
>   0005-vfio-common-Rename-VFIOGuestIOMMU-iommu-into-
> iommu_m.patch

3-5 are pure cleanups which could be sent out separately 

>   0006-vfio-common-Split-common.c-into-common.c-container.c.patch
> 
> - Introduce container object and covert existing vfio to use it:
>   0007-vfio-Add-base-object-for-VFIOContainer.patch
>   0008-vfio-container-Introduce-vfio_attach-detach_device.patch
>   0009-vfio-platform-Use-vfio_-attach-detach-_device.patch
>   0010-vfio-ap-Use-vfio_-attach-detach-_device.patch
>   0011-vfio-ccw-Use-vfio_-attach-detach-_device.patch
>   0012-vfio-container-obj-Introduce-attach-detach-_device-c.patch
>   0013-vfio-container-obj-Introduce-VFIOContainer-reset-cal.patch
> 
> - Introduce iommufd based container:
>   0014-hw-iommufd-Creation.patch
>   0015-vfio-iommufd-Implement-iommufd-backend.patch
>   0016-vfio-iommufd-Add-IOAS_COPY_DMA-support.patch
> 
> - Add backend selection for vfio-pci:
>   0017-vfio-as-Allow-the-selection-of-a-given-iommu-backend.patch
>   0018-vfio-pci-Add-an-iommufd-option.patch
> 
> [1] https://lore.kernel.org/kvm/0-v1-e79cd8d168e8+6-
> iommufd_jgg@nvidia.com/
> [2] https://github.com/luxis1999/iommufd/tree/iommufd-v5.17-rc6
> [3] https://github.com/luxis1999/qemu/tree/qemu-for-5.17-rc6-vm-rfcv1
> [4] https://lore.kernel.org/kvm/0-v1-a8faf768d202+125dd-
> vfio_mdev_no_group_jgg@nvidia.com/

Following is probably more relevant to [4]:

https://lore.kernel.org/all/10-v1-33906a626da1+16b0-vfio_kvm_no_group_jgg@nvidia.com/

Thanks
Kevin


^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC 00/18] vfio: Adopt iommufd
  2022-04-18  8:49   ` Tian, Kevin
@ 2022-04-18 12:09     ` Yi Liu
  -1 siblings, 0 replies; 125+ messages in thread
From: Yi Liu @ 2022-04-18 12:09 UTC (permalink / raw)
  To: Tian, Kevin, alex.williamson, cohuck, qemu-devel
  Cc: david, thuth, farman, mjrosato, akrowiak, pasic, jjherne,
	jasowang, kvm, jgg, nicolinc, eric.auger, eric.auger.pro, Peng,
	Chao P, Sun, Yi Y, peterx

Hi Kevin,

On 2022/4/18 16:49, Tian, Kevin wrote:
>> From: Liu, Yi L <yi.l.liu@intel.com>
>> Sent: Thursday, April 14, 2022 6:47 PM
>>
>> With the introduction of iommufd[1], the linux kernel provides a generic
>> interface for userspace drivers to propagate their DMA mappings to kernel
>> for assigned devices. This series does the porting of the VFIO devices
>> onto the /dev/iommu uapi and let it coexist with the legacy implementation.
>> Other devices like vpda, vfio mdev and etc. are not considered yet.
> 
> vfio mdev has no special support in Qemu. Just that it's not supported
> by iommufd yet thus can only be operated in legacy container interface at
> this point. Later once it's supported by the kernel suppose no additional
> enabling work is required for mdev in Qemu.

yes. will make it more precise in next version.

>>
>> For vfio devices, the new interface is tied with device fd and iommufd
>> as the iommufd solution is device-centric. This is different from legacy
>> vfio which is group-centric. To support both interfaces in QEMU, this
>> series introduces the iommu backend concept in the form of different
>> container classes. The existing vfio container is named legacy container
>> (equivalent with legacy iommu backend in this series), while the new
>> iommufd based container is named as iommufd container (may also be
>> mentioned
>> as iommufd backend in this series). The two backend types have their own
>> way to setup secure context and dma management interface. Below diagram
>> shows how it looks like with both BEs.
>>
>>                      VFIO                           AddressSpace/Memory
>>      +-------+  +----------+  +-----+  +-----+
>>      |  pci  |  | platform |  |  ap |  | ccw |
>>      +---+---+  +----+-----+  +--+--+  +--+--+     +----------------------+
>>          |           |           |        |        |   AddressSpace       |
>>          |           |           |        |        +------------+---------+
>>      +---V-----------V-----------V--------V----+               /
>>      |           VFIOAddressSpace              | <------------+
>>      |                  |                      |  MemoryListener
>>      |          VFIOContainer list             |
>>      +-------+----------------------------+----+
>>              |                            |
>>              |                            |
>>      +-------V------+            +--------V----------+
>>      |   iommufd    |            |    vfio legacy    |
>>      |  container   |            |     container     |
>>      +-------+------+            +--------+----------+
>>              |                            |
>>              | /dev/iommu                 | /dev/vfio/vfio
>>              | /dev/vfio/devices/vfioX    | /dev/vfio/$group_id
>>   Userspace  |                            |
>>
>> ===========+============================+=======================
>> =========
>>   Kernel     |  device fd                 |
>>              +---------------+            | group/container fd
>>              | (BIND_IOMMUFD |            | (SET_CONTAINER/SET_IOMMU)
>>              |  ATTACH_IOAS) |            | device fd
>>              |               |            |
>>              |       +-------V------------V-----------------+
>>      iommufd |       |                vfio                  |
>> (map/unmap  |       +---------+--------------------+-------+
>>   ioas_copy) |                 |                    | map/unmap
>>              |                 |                    |
>>       +------V------+    +-----V------+      +------V--------+
>>       | iommfd core |    |  device    |      |  vfio iommu   |
>>       +-------------+    +------------+      +---------------+
> 
> last row: s/iommfd/iommufd/

thanks. a typo.

> overall this sounds a reasonable abstraction. Later when vdpa starts
> supporting iommufd probably the iommufd BE will become even
> smaller with more logic shareable between vfio and vdpa.

let's see if Jason Wang will give some idea. :-)

>>
>> [Secure Context setup]
>> - iommufd BE: uses device fd and iommufd to setup secure context
>>                (bind_iommufd, attach_ioas)
>> - vfio legacy BE: uses group fd and container fd to setup secure context
>>                    (set_container, set_iommu)
>> [Device access]
>> - iommufd BE: device fd is opened through /dev/vfio/devices/vfioX
>> - vfio legacy BE: device fd is retrieved from group fd ioctl
>> [DMA Mapping flow]
>> - VFIOAddressSpace receives MemoryRegion add/del via MemoryListener
>> - VFIO populates DMA map/unmap via the container BEs
>>    *) iommufd BE: uses iommufd
>>    *) vfio legacy BE: uses container fd
>>
>> This series qomifies the VFIOContainer object which acts as a base class
> 
> what does 'qomify' mean? I didn't find this word from dictionary...
> 
>> for a container. This base class is derived into the legacy VFIO container
>> and the new iommufd based container. The base class implements generic
>> code
>> such as code related to memory_listener and address space management
>> whereas
>> the derived class implements callbacks that depend on the kernel user space
> 
> 'the kernel user space'?

aha, just want to express different BE callbacks will use different user 
interface exposed by kernel. will refine the wording.

> 
>> being used.
>>
>> The selection of the backend is made on a device basis using the new
>> iommufd option (on/off/auto). By default the iommufd backend is selected
>> if supported by the host and by QEMU (iommufd KConfig). This option is
>> currently available only for the vfio-pci device. For other types of
>> devices, it does not yet exist and the legacy BE is chosen by default.
>>
>> Test done:
>> - PCI and Platform device were tested
> 
> In this case PCI uses iommufd while platform device uses legacy?

For PCI, both legacy and iommufd were tested. The exploration kernel branch 
doesn't have the new device uapi for platform device, so I didn't test it.
But I remember Eric should have tested it with iommufd. Eric?

>> - ccw and ap were only compile-tested
>> - limited device hotplug test
>> - vIOMMU test run for both legacy and iommufd backends (limited tests)
>>
>> This series was co-developed by Eric Auger and me based on the exploration
>> iommufd kernel[2], complete code of this series is available in[3]. As
>> iommufd kernel is in the early step (only iommufd generic interface is in
>> mailing list), so this series hasn't made the iommufd backend fully on par
>> with legacy backend w.r.t. features like p2p mappings, coherency tracking,
> 
> what does 'coherency tracking' mean here? if related to iommu enforce
> snoop it is fully handled by the kernel so far. I didn't find any use of
> VFIO_DMA_CC_IOMMU in current Qemu.

It's the kvm_group add/del stuffs.perhaps say kvm_group add/del equivalence
would be better?

>> live migration, etc. This series hasn't supported PCI devices without FLR
>> neither as the kernel doesn't support VFIO_DEVICE_PCI_HOT_RESET when
>> userspace
>> is using iommufd. The kernel needs to be updated to accept device fd list for
>> reset when userspace is using iommufd. Related work is in progress by
>> Jason[4].
>>
>> TODOs:
>> - Add DMA alias check for iommufd BE (group level)
>> - Make pci.c to be BE agnostic. Needs kernel change as well to fix the
>>    VFIO_DEVICE_PCI_HOT_RESET gap
>> - Cleanup the VFIODevice fields as it's used in both BEs
>> - Add locks
>> - Replace list with g_tree
>> - More tests
>>
>> Patch Overview:
>>
>> - Preparation:
>>    0001-scripts-update-linux-headers-Add-iommufd.h.patch
>>    0002-linux-headers-Import-latest-vfio.h-and-iommufd.h.patch
>>    0003-hw-vfio-pci-fix-vfio_pci_hot_reset_result-trace-poin.patch
>>    0004-vfio-pci-Use-vbasedev-local-variable-in-vfio_realize.patch
>>    0005-vfio-common-Rename-VFIOGuestIOMMU-iommu-into-
>> iommu_m.patch
> 
> 3-5 are pure cleanups which could be sent out separately

yes. may send later after checking with Eric. :-)

>>    0006-vfio-common-Split-common.c-into-common.c-container.c.patch
>>
>> - Introduce container object and covert existing vfio to use it:
>>    0007-vfio-Add-base-object-for-VFIOContainer.patch
>>    0008-vfio-container-Introduce-vfio_attach-detach_device.patch
>>    0009-vfio-platform-Use-vfio_-attach-detach-_device.patch
>>    0010-vfio-ap-Use-vfio_-attach-detach-_device.patch
>>    0011-vfio-ccw-Use-vfio_-attach-detach-_device.patch
>>    0012-vfio-container-obj-Introduce-attach-detach-_device-c.patch
>>    0013-vfio-container-obj-Introduce-VFIOContainer-reset-cal.patch
>>
>> - Introduce iommufd based container:
>>    0014-hw-iommufd-Creation.patch
>>    0015-vfio-iommufd-Implement-iommufd-backend.patch
>>    0016-vfio-iommufd-Add-IOAS_COPY_DMA-support.patch
>>
>> - Add backend selection for vfio-pci:
>>    0017-vfio-as-Allow-the-selection-of-a-given-iommu-backend.patch
>>    0018-vfio-pci-Add-an-iommufd-option.patch
>>
>> [1] https://lore.kernel.org/kvm/0-v1-e79cd8d168e8+6-
>> iommufd_jgg@nvidia.com/
>> [2] https://github.com/luxis1999/iommufd/tree/iommufd-v5.17-rc6
>> [3] https://github.com/luxis1999/qemu/tree/qemu-for-5.17-rc6-vm-rfcv1
>> [4] https://lore.kernel.org/kvm/0-v1-a8faf768d202+125dd-
>> vfio_mdev_no_group_jgg@nvidia.com/
> 
> Following is probably more relevant to [4]:
> 
> https://lore.kernel.org/all/10-v1-33906a626da1+16b0-vfio_kvm_no_group_jgg@nvidia.com/

absolutely.:-) thanks.

> Thanks
> Kevin

-- 
Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC 00/18] vfio: Adopt iommufd
@ 2022-04-18 12:09     ` Yi Liu
  0 siblings, 0 replies; 125+ messages in thread
From: Yi Liu @ 2022-04-18 12:09 UTC (permalink / raw)
  To: Tian, Kevin, alex.williamson, cohuck, qemu-devel
  Cc: akrowiak, jjherne, thuth, Peng, Chao P, kvm, mjrosato, jasowang,
	farman, peterx, pasic, eric.auger, Sun, Yi Y, nicolinc, jgg,
	eric.auger.pro, david

Hi Kevin,

On 2022/4/18 16:49, Tian, Kevin wrote:
>> From: Liu, Yi L <yi.l.liu@intel.com>
>> Sent: Thursday, April 14, 2022 6:47 PM
>>
>> With the introduction of iommufd[1], the linux kernel provides a generic
>> interface for userspace drivers to propagate their DMA mappings to kernel
>> for assigned devices. This series does the porting of the VFIO devices
>> onto the /dev/iommu uapi and let it coexist with the legacy implementation.
>> Other devices like vpda, vfio mdev and etc. are not considered yet.
> 
> vfio mdev has no special support in Qemu. Just that it's not supported
> by iommufd yet thus can only be operated in legacy container interface at
> this point. Later once it's supported by the kernel suppose no additional
> enabling work is required for mdev in Qemu.

yes. will make it more precise in next version.

>>
>> For vfio devices, the new interface is tied with device fd and iommufd
>> as the iommufd solution is device-centric. This is different from legacy
>> vfio which is group-centric. To support both interfaces in QEMU, this
>> series introduces the iommu backend concept in the form of different
>> container classes. The existing vfio container is named legacy container
>> (equivalent with legacy iommu backend in this series), while the new
>> iommufd based container is named as iommufd container (may also be
>> mentioned
>> as iommufd backend in this series). The two backend types have their own
>> way to setup secure context and dma management interface. Below diagram
>> shows how it looks like with both BEs.
>>
>>                      VFIO                           AddressSpace/Memory
>>      +-------+  +----------+  +-----+  +-----+
>>      |  pci  |  | platform |  |  ap |  | ccw |
>>      +---+---+  +----+-----+  +--+--+  +--+--+     +----------------------+
>>          |           |           |        |        |   AddressSpace       |
>>          |           |           |        |        +------------+---------+
>>      +---V-----------V-----------V--------V----+               /
>>      |           VFIOAddressSpace              | <------------+
>>      |                  |                      |  MemoryListener
>>      |          VFIOContainer list             |
>>      +-------+----------------------------+----+
>>              |                            |
>>              |                            |
>>      +-------V------+            +--------V----------+
>>      |   iommufd    |            |    vfio legacy    |
>>      |  container   |            |     container     |
>>      +-------+------+            +--------+----------+
>>              |                            |
>>              | /dev/iommu                 | /dev/vfio/vfio
>>              | /dev/vfio/devices/vfioX    | /dev/vfio/$group_id
>>   Userspace  |                            |
>>
>> ===========+============================+=======================
>> =========
>>   Kernel     |  device fd                 |
>>              +---------------+            | group/container fd
>>              | (BIND_IOMMUFD |            | (SET_CONTAINER/SET_IOMMU)
>>              |  ATTACH_IOAS) |            | device fd
>>              |               |            |
>>              |       +-------V------------V-----------------+
>>      iommufd |       |                vfio                  |
>> (map/unmap  |       +---------+--------------------+-------+
>>   ioas_copy) |                 |                    | map/unmap
>>              |                 |                    |
>>       +------V------+    +-----V------+      +------V--------+
>>       | iommfd core |    |  device    |      |  vfio iommu   |
>>       +-------------+    +------------+      +---------------+
> 
> last row: s/iommfd/iommufd/

thanks. a typo.

> overall this sounds a reasonable abstraction. Later when vdpa starts
> supporting iommufd probably the iommufd BE will become even
> smaller with more logic shareable between vfio and vdpa.

let's see if Jason Wang will give some idea. :-)

>>
>> [Secure Context setup]
>> - iommufd BE: uses device fd and iommufd to setup secure context
>>                (bind_iommufd, attach_ioas)
>> - vfio legacy BE: uses group fd and container fd to setup secure context
>>                    (set_container, set_iommu)
>> [Device access]
>> - iommufd BE: device fd is opened through /dev/vfio/devices/vfioX
>> - vfio legacy BE: device fd is retrieved from group fd ioctl
>> [DMA Mapping flow]
>> - VFIOAddressSpace receives MemoryRegion add/del via MemoryListener
>> - VFIO populates DMA map/unmap via the container BEs
>>    *) iommufd BE: uses iommufd
>>    *) vfio legacy BE: uses container fd
>>
>> This series qomifies the VFIOContainer object which acts as a base class
> 
> what does 'qomify' mean? I didn't find this word from dictionary...
> 
>> for a container. This base class is derived into the legacy VFIO container
>> and the new iommufd based container. The base class implements generic
>> code
>> such as code related to memory_listener and address space management
>> whereas
>> the derived class implements callbacks that depend on the kernel user space
> 
> 'the kernel user space'?

aha, just want to express different BE callbacks will use different user 
interface exposed by kernel. will refine the wording.

> 
>> being used.
>>
>> The selection of the backend is made on a device basis using the new
>> iommufd option (on/off/auto). By default the iommufd backend is selected
>> if supported by the host and by QEMU (iommufd KConfig). This option is
>> currently available only for the vfio-pci device. For other types of
>> devices, it does not yet exist and the legacy BE is chosen by default.
>>
>> Test done:
>> - PCI and Platform device were tested
> 
> In this case PCI uses iommufd while platform device uses legacy?

For PCI, both legacy and iommufd were tested. The exploration kernel branch 
doesn't have the new device uapi for platform device, so I didn't test it.
But I remember Eric should have tested it with iommufd. Eric?

>> - ccw and ap were only compile-tested
>> - limited device hotplug test
>> - vIOMMU test run for both legacy and iommufd backends (limited tests)
>>
>> This series was co-developed by Eric Auger and me based on the exploration
>> iommufd kernel[2], complete code of this series is available in[3]. As
>> iommufd kernel is in the early step (only iommufd generic interface is in
>> mailing list), so this series hasn't made the iommufd backend fully on par
>> with legacy backend w.r.t. features like p2p mappings, coherency tracking,
> 
> what does 'coherency tracking' mean here? if related to iommu enforce
> snoop it is fully handled by the kernel so far. I didn't find any use of
> VFIO_DMA_CC_IOMMU in current Qemu.

It's the kvm_group add/del stuffs.perhaps say kvm_group add/del equivalence
would be better?

>> live migration, etc. This series hasn't supported PCI devices without FLR
>> neither as the kernel doesn't support VFIO_DEVICE_PCI_HOT_RESET when
>> userspace
>> is using iommufd. The kernel needs to be updated to accept device fd list for
>> reset when userspace is using iommufd. Related work is in progress by
>> Jason[4].
>>
>> TODOs:
>> - Add DMA alias check for iommufd BE (group level)
>> - Make pci.c to be BE agnostic. Needs kernel change as well to fix the
>>    VFIO_DEVICE_PCI_HOT_RESET gap
>> - Cleanup the VFIODevice fields as it's used in both BEs
>> - Add locks
>> - Replace list with g_tree
>> - More tests
>>
>> Patch Overview:
>>
>> - Preparation:
>>    0001-scripts-update-linux-headers-Add-iommufd.h.patch
>>    0002-linux-headers-Import-latest-vfio.h-and-iommufd.h.patch
>>    0003-hw-vfio-pci-fix-vfio_pci_hot_reset_result-trace-poin.patch
>>    0004-vfio-pci-Use-vbasedev-local-variable-in-vfio_realize.patch
>>    0005-vfio-common-Rename-VFIOGuestIOMMU-iommu-into-
>> iommu_m.patch
> 
> 3-5 are pure cleanups which could be sent out separately

yes. may send later after checking with Eric. :-)

>>    0006-vfio-common-Split-common.c-into-common.c-container.c.patch
>>
>> - Introduce container object and covert existing vfio to use it:
>>    0007-vfio-Add-base-object-for-VFIOContainer.patch
>>    0008-vfio-container-Introduce-vfio_attach-detach_device.patch
>>    0009-vfio-platform-Use-vfio_-attach-detach-_device.patch
>>    0010-vfio-ap-Use-vfio_-attach-detach-_device.patch
>>    0011-vfio-ccw-Use-vfio_-attach-detach-_device.patch
>>    0012-vfio-container-obj-Introduce-attach-detach-_device-c.patch
>>    0013-vfio-container-obj-Introduce-VFIOContainer-reset-cal.patch
>>
>> - Introduce iommufd based container:
>>    0014-hw-iommufd-Creation.patch
>>    0015-vfio-iommufd-Implement-iommufd-backend.patch
>>    0016-vfio-iommufd-Add-IOAS_COPY_DMA-support.patch
>>
>> - Add backend selection for vfio-pci:
>>    0017-vfio-as-Allow-the-selection-of-a-given-iommu-backend.patch
>>    0018-vfio-pci-Add-an-iommufd-option.patch
>>
>> [1] https://lore.kernel.org/kvm/0-v1-e79cd8d168e8+6-
>> iommufd_jgg@nvidia.com/
>> [2] https://github.com/luxis1999/iommufd/tree/iommufd-v5.17-rc6
>> [3] https://github.com/luxis1999/qemu/tree/qemu-for-5.17-rc6-vm-rfcv1
>> [4] https://lore.kernel.org/kvm/0-v1-a8faf768d202+125dd-
>> vfio_mdev_no_group_jgg@nvidia.com/
> 
> Following is probably more relevant to [4]:
> 
> https://lore.kernel.org/all/10-v1-33906a626da1+16b0-vfio_kvm_no_group_jgg@nvidia.com/

absolutely.:-) thanks.

> Thanks
> Kevin

-- 
Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC 00/18] vfio: Adopt iommufd
  2022-04-17 10:30     ` Eric Auger
  (?)
@ 2022-04-19  3:26     ` Nicolin Chen
  2022-04-25 19:40         ` Eric Auger
  -1 siblings, 1 reply; 125+ messages in thread
From: Nicolin Chen @ 2022-04-19  3:26 UTC (permalink / raw)
  To: Eric Auger
  Cc: Yi Liu, alex.williamson, cohuck, qemu-devel, david, thuth,
	farman, mjrosato, akrowiak, pasic, jjherne, jasowang, kvm, jgg,
	eric.auger.pro, kevin.tian, chao.p.peng, yi.y.sun, peterx

On Sun, Apr 17, 2022 at 12:30:40PM +0200, Eric Auger wrote:

> >> - More tests
> > I did a quick test on my ARM64 platform, using "iommu=smmuv3"
> > string. The behaviors are different between using default and
> > using legacy "iommufd=off".
> >
> > The legacy pathway exits the VM with:
> >     vfio 0002:01:00.0:
> >     failed to setup container for group 1:
> >     memory listener initialization failed:
> >     Region smmuv3-iommu-memory-region-16-0:
> >     device 00.02.0 requires iommu MAP notifier which is not currently supported
> >
> > while the iommufd pathway started the VM but reported errors
> > from host kernel about address translation failures, probably
> > because of accessing unmapped addresses.
> >
> > I found iommufd pathway also calls error_propagate_prepend()
> > to add to errp for not supporting IOMMU_NOTIFIER_MAP, but it
> > doesn't get a chance to print errp out. Perhaps there should
> > be a final error check somewhere to exit?
> 
> thank you for giving it a try.
> 
> vsmmuv3 + vfio is not supported as we miss the HW nested stage support
> and SMMU does not support cache mode. If you want to test viommu on ARM
> you shall test virtio-iommu+vfio. This should work but this is not yet
> tested.

I tried "-device virtio-iommu" and "-device virtio-iommu-pci"
separately with vfio-pci, but neither seems to work. The host
SMMU driver reports Translation Faults.

Do you know what commands I should use to run QEMU for that
combination?

> I pushed a fix for the error notification issue:
> qemu-for-5.17-rc6-vm-rfcv2-rc0 on my git https://github.com/eauger/qemu.git

Yes. This fixes the problem. Thanks!

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC 15/18] vfio/iommufd: Implement iommufd backend
  2022-04-14 10:47   ` Yi Liu
  (?)
@ 2022-04-22 14:58   ` Jason Gunthorpe
  2022-04-22 21:33       ` Alex Williamson
  2022-04-26  9:55       ` Yi Liu
  -1 siblings, 2 replies; 125+ messages in thread
From: Jason Gunthorpe @ 2022-04-22 14:58 UTC (permalink / raw)
  To: Yi Liu
  Cc: alex.williamson, cohuck, qemu-devel, david, thuth, farman,
	mjrosato, akrowiak, pasic, jjherne, jasowang, kvm, nicolinc,
	eric.auger, eric.auger.pro, kevin.tian, chao.p.peng, yi.y.sun,
	peterx

On Thu, Apr 14, 2022 at 03:47:07AM -0700, Yi Liu wrote:

> +static int vfio_get_devicefd(const char *sysfs_path, Error **errp)
> +{
> +    long int vfio_id = -1, ret = -ENOTTY;
> +    char *path, *tmp = NULL;
> +    DIR *dir;
> +    struct dirent *dent;
> +    struct stat st;
> +    gchar *contents;
> +    gsize length;
> +    int major, minor;
> +    dev_t vfio_devt;
> +
> +    path = g_strdup_printf("%s/vfio-device", sysfs_path);
> +    if (stat(path, &st) < 0) {
> +        error_setg_errno(errp, errno, "no such host device");
> +        goto out;
> +    }
> +
> +    dir = opendir(path);
> +    if (!dir) {
> +        error_setg_errno(errp, errno, "couldn't open dirrectory %s", path);
> +        goto out;
> +    }
> +
> +    while ((dent = readdir(dir))) {
> +        const char *end_name;
> +
> +        if (!strncmp(dent->d_name, "vfio", 4)) {
> +            ret = qemu_strtol(dent->d_name + 4, &end_name, 10, &vfio_id);
> +            if (ret) {
> +                error_setg(errp, "suspicious vfio* file in %s", path);
> +                goto out;
> +            }

Userspace shouldn't explode if there are different files here down the
road. Just search for the first match of vfio\d+ and there is no need
to parse out the vfio_id from the string. Only fail if no match is
found.

> +    tmp = g_strdup_printf("/dev/vfio/devices/vfio%ld", vfio_id);
> +    if (stat(tmp, &st) < 0) {
> +        error_setg_errno(errp, errno, "no such vfio device");
> +        goto out;
> +    }

And simply pass the string directly here, no need to parse out
vfio_id.

I also suggest falling back to using "/dev/char/%u:%u" if the above
does not exist which prevents "vfio/devices/vfio" from turning into
ABI.

It would be a good idea to make a general open_cdev function that does
all this work once the sysfs is found and cdev read out of it, all the
other vfio places can use it too.

> +static int iommufd_attach_device(VFIODevice *vbasedev, AddressSpace *as,
> +                                 Error **errp)
> +{
> +    VFIOContainer *bcontainer;
> +    VFIOIOMMUFDContainer *container;
> +    VFIOAddressSpace *space;
> +    struct vfio_device_info dev_info = { .argsz = sizeof(dev_info) };
> +    int ret, devfd, iommufd;
> +    uint32_t ioas_id;
> +    Error *err = NULL;
> +
> +    devfd = vfio_get_devicefd(vbasedev->sysfsdev, errp);
> +    if (devfd < 0) {
> +        return devfd;
> +    }
> +    vbasedev->fd = devfd;
> +
> +    space = vfio_get_address_space(as);
> +
> +    /* try to attach to an existing container in this space */
> +    QLIST_FOREACH(bcontainer, &space->containers, next) {
> +        if (!object_dynamic_cast(OBJECT(bcontainer),
> +                                 TYPE_VFIO_IOMMUFD_CONTAINER)) {
> +            continue;
> +        }
> +        container = container_of(bcontainer, VFIOIOMMUFDContainer, obj);
> +        if (vfio_device_attach_container(vbasedev, container, &err)) {
> +            const char *msg = error_get_pretty(err);
> +
> +            trace_vfio_iommufd_fail_attach_existing_container(msg);
> +            error_free(err);
> +            err = NULL;
> +        } else {
> +            ret = vfio_ram_block_discard_disable(true);
> +            if (ret) {
> +                vfio_device_detach_container(vbasedev, container, &err);
> +                error_propagate(errp, err);
> +                vfio_put_address_space(space);
> +                close(vbasedev->fd);
> +                error_prepend(errp,
> +                              "Cannot set discarding of RAM broken (%d)", ret);
> +                return ret;
> +            }
> +            goto out;
> +        }
> +    }

?? this logic shouldn't be necessary, a single ioas always supports
all devices, userspace should never need to juggle multiple ioas's
unless it wants to have different address maps.

Something I would like to see confirmed here in qemu is that qemu can
track the hw pagetable id for each device it binds because we will
need that later to do dirty tracking and other things.

> +    /*
> +     * TODO: for now iommufd BE is on par with vfio iommu type1, so it's
> +     * fine to add the whole range as window. For SPAPR, below code
> +     * should be updated.
> +     */
> +    vfio_host_win_add(bcontainer, 0, (hwaddr)-1, 4096);

? Not sure what this is, but I don't expect any changes for SPAPR
someday IOMMU_IOAS_IOVA_RANGES should be able to accurately report its
configuration.

I don't see IOMMU_IOAS_IOVA_RANGES called at all, that seems like a
problem..

(and note that IOVA_RANGES changes with every device attached to the IOAS)

Jason

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC 15/18] vfio/iommufd: Implement iommufd backend
  2022-04-22 14:58   ` Jason Gunthorpe
@ 2022-04-22 21:33       ` Alex Williamson
  2022-04-26  9:55       ` Yi Liu
  1 sibling, 0 replies; 125+ messages in thread
From: Alex Williamson @ 2022-04-22 21:33 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: akrowiak, jjherne, farman, Yi Liu, kvm, mjrosato, jasowang,
	cohuck, thuth, peterx, qemu-devel, pasic, eric.auger, yi.y.sun,
	chao.p.peng, nicolinc, kevin.tian, eric.auger.pro, david

On Fri, 22 Apr 2022 11:58:15 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> I don't see IOMMU_IOAS_IOVA_RANGES called at all, that seems like a
> problem..

Not as much as you might think.  Note that you also won't find QEMU
testing VFIO_IOMMU_TYPE1_INFO_CAP_IOVA_RANGE in the QEMU vfio-pci
driver either.  The vfio-nvme driver does because it has control of the
address space it chooses to use, but for vfio-pci the address space is
dictated by the VM and there's not a lot of difference between knowing
in advance that a mapping conflicts with a reserved range or just
trying add the mapping and taking appropriate action if it fails.
Thanks,

Alex



^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC 15/18] vfio/iommufd: Implement iommufd backend
@ 2022-04-22 21:33       ` Alex Williamson
  0 siblings, 0 replies; 125+ messages in thread
From: Alex Williamson @ 2022-04-22 21:33 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Yi Liu, cohuck, qemu-devel, david, thuth, farman, mjrosato,
	akrowiak, pasic, jjherne, jasowang, kvm, nicolinc, eric.auger,
	eric.auger.pro, kevin.tian, chao.p.peng, yi.y.sun, peterx

On Fri, 22 Apr 2022 11:58:15 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> I don't see IOMMU_IOAS_IOVA_RANGES called at all, that seems like a
> problem..

Not as much as you might think.  Note that you also won't find QEMU
testing VFIO_IOMMU_TYPE1_INFO_CAP_IOVA_RANGE in the QEMU vfio-pci
driver either.  The vfio-nvme driver does because it has control of the
address space it chooses to use, but for vfio-pci the address space is
dictated by the VM and there's not a lot of difference between knowing
in advance that a mapping conflicts with a reserved range or just
trying add the mapping and taking appropriate action if it fails.
Thanks,

Alex


^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC 00/18] vfio: Adopt iommufd
  2022-04-14 10:46 ` Yi Liu
@ 2022-04-22 22:09   ` Alex Williamson
  -1 siblings, 0 replies; 125+ messages in thread
From: Alex Williamson @ 2022-04-22 22:09 UTC (permalink / raw)
  To: Yi Liu
  Cc: akrowiak, jjherne, farman, chao.p.peng, kvm, mjrosato,
	Laine Stump, libvir-list, jasowang, cohuck, thuth, peterx,
	qemu-devel, pasic, eric.auger, yi.y.sun, nicolinc, kevin.tian,
	jgg, eric.auger.pro, david

[Cc +libvirt folks]

On Thu, 14 Apr 2022 03:46:52 -0700
Yi Liu <yi.l.liu@intel.com> wrote:

> With the introduction of iommufd[1], the linux kernel provides a generic
> interface for userspace drivers to propagate their DMA mappings to kernel
> for assigned devices. This series does the porting of the VFIO devices
> onto the /dev/iommu uapi and let it coexist with the legacy implementation.
> Other devices like vpda, vfio mdev and etc. are not considered yet.
> 
> For vfio devices, the new interface is tied with device fd and iommufd
> as the iommufd solution is device-centric. This is different from legacy
> vfio which is group-centric. To support both interfaces in QEMU, this
> series introduces the iommu backend concept in the form of different
> container classes. The existing vfio container is named legacy container
> (equivalent with legacy iommu backend in this series), while the new
> iommufd based container is named as iommufd container (may also be mentioned
> as iommufd backend in this series). The two backend types have their own
> way to setup secure context and dma management interface. Below diagram
> shows how it looks like with both BEs.
> 
>                     VFIO                           AddressSpace/Memory
>     +-------+  +----------+  +-----+  +-----+
>     |  pci  |  | platform |  |  ap |  | ccw |
>     +---+---+  +----+-----+  +--+--+  +--+--+     +----------------------+
>         |           |           |        |        |   AddressSpace       |
>         |           |           |        |        +------------+---------+
>     +---V-----------V-----------V--------V----+               /
>     |           VFIOAddressSpace              | <------------+
>     |                  |                      |  MemoryListener
>     |          VFIOContainer list             |
>     +-------+----------------------------+----+
>             |                            |
>             |                            |
>     +-------V------+            +--------V----------+
>     |   iommufd    |            |    vfio legacy    |
>     |  container   |            |     container     |
>     +-------+------+            +--------+----------+
>             |                            |
>             | /dev/iommu                 | /dev/vfio/vfio
>             | /dev/vfio/devices/vfioX    | /dev/vfio/$group_id
>  Userspace  |                            |
>  ===========+============================+================================
>  Kernel     |  device fd                 |
>             +---------------+            | group/container fd
>             | (BIND_IOMMUFD |            | (SET_CONTAINER/SET_IOMMU)
>             |  ATTACH_IOAS) |            | device fd
>             |               |            |
>             |       +-------V------------V-----------------+
>     iommufd |       |                vfio                  |
> (map/unmap  |       +---------+--------------------+-------+
>  ioas_copy) |                 |                    | map/unmap
>             |                 |                    |
>      +------V------+    +-----V------+      +------V--------+
>      | iommfd core |    |  device    |      |  vfio iommu   |
>      +-------------+    +------------+      +---------------+
> 
> [Secure Context setup]
> - iommufd BE: uses device fd and iommufd to setup secure context
>               (bind_iommufd, attach_ioas)
> - vfio legacy BE: uses group fd and container fd to setup secure context
>                   (set_container, set_iommu)
> [Device access]
> - iommufd BE: device fd is opened through /dev/vfio/devices/vfioX
> - vfio legacy BE: device fd is retrieved from group fd ioctl
> [DMA Mapping flow]
> - VFIOAddressSpace receives MemoryRegion add/del via MemoryListener
> - VFIO populates DMA map/unmap via the container BEs
>   *) iommufd BE: uses iommufd
>   *) vfio legacy BE: uses container fd
> 
> This series qomifies the VFIOContainer object which acts as a base class
> for a container. This base class is derived into the legacy VFIO container
> and the new iommufd based container. The base class implements generic code
> such as code related to memory_listener and address space management whereas
> the derived class implements callbacks that depend on the kernel user space
> being used.
> 
> The selection of the backend is made on a device basis using the new
> iommufd option (on/off/auto). By default the iommufd backend is selected
> if supported by the host and by QEMU (iommufd KConfig). This option is
> currently available only for the vfio-pci device. For other types of
> devices, it does not yet exist and the legacy BE is chosen by default.

I've discussed this a bit with Eric, but let me propose a different
command line interface.  Libvirt generally likes to pass file
descriptors to QEMU rather than grant it access to those files
directly.  This was problematic with vfio-pci because libvirt can't
easily know when QEMU will want to grab another /dev/vfio/vfio
container.  Therefore we abandoned this approach and instead libvirt
grants file permissions.

However, with iommufd there's no reason that QEMU ever needs more than
a single instance of /dev/iommufd and we're using per device vfio file
descriptors, so it seems like a good time to revisit this.

The interface I was considering would be to add an iommufd object to
QEMU, so we might have a:

-device iommufd[,fd=#][,id=foo]

For non-libivrt usage this would have the ability to open /dev/iommufd
itself if an fd is not provided.  This object could be shared with
other iommufd users in the VM and maybe we'd allow multiple instances
for more esoteric use cases.  [NB, maybe this should be a -object rather than
-device since the iommufd is not a guest visible device?]

The vfio-pci device might then become:

-device vfio-pci[,host=DDDD:BB:DD.f][,sysfsdev=/sys/path/to/device][,fd=#][,iommufd=foo]

So essentially we can specify the device via host, sysfsdev, or passing
an fd to the vfio device file.  When an iommufd object is specified,
"foo" in the example above, each of those options would use the
vfio-device access mechanism, essentially the same as iommufd=on in
your example.  With the fd passing option, an iommufd object would be
required and necessarily use device level access.

In your example, the iommufd=auto seems especially troublesome for
libvirt because QEMU is going to have different locked memory
requirements based on whether we're using type1 or iommufd, where the
latter resolves the duplicate accounting issues.  libvirt needs to know
deterministically which backed is being used, which this proposal seems
to provide, while at the same time bringing us more in line with fd
passing.  Thoughts?  Thanks,

Alex



^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC 00/18] vfio: Adopt iommufd
@ 2022-04-22 22:09   ` Alex Williamson
  0 siblings, 0 replies; 125+ messages in thread
From: Alex Williamson @ 2022-04-22 22:09 UTC (permalink / raw)
  To: Yi Liu
  Cc: cohuck, qemu-devel, david, thuth, farman, mjrosato, akrowiak,
	pasic, jjherne, jasowang, kvm, jgg, nicolinc, eric.auger,
	eric.auger.pro, kevin.tian, chao.p.peng, yi.y.sun, peterx,
	libvir-list, Laine Stump

[Cc +libvirt folks]

On Thu, 14 Apr 2022 03:46:52 -0700
Yi Liu <yi.l.liu@intel.com> wrote:

> With the introduction of iommufd[1], the linux kernel provides a generic
> interface for userspace drivers to propagate their DMA mappings to kernel
> for assigned devices. This series does the porting of the VFIO devices
> onto the /dev/iommu uapi and let it coexist with the legacy implementation.
> Other devices like vpda, vfio mdev and etc. are not considered yet.
> 
> For vfio devices, the new interface is tied with device fd and iommufd
> as the iommufd solution is device-centric. This is different from legacy
> vfio which is group-centric. To support both interfaces in QEMU, this
> series introduces the iommu backend concept in the form of different
> container classes. The existing vfio container is named legacy container
> (equivalent with legacy iommu backend in this series), while the new
> iommufd based container is named as iommufd container (may also be mentioned
> as iommufd backend in this series). The two backend types have their own
> way to setup secure context and dma management interface. Below diagram
> shows how it looks like with both BEs.
> 
>                     VFIO                           AddressSpace/Memory
>     +-------+  +----------+  +-----+  +-----+
>     |  pci  |  | platform |  |  ap |  | ccw |
>     +---+---+  +----+-----+  +--+--+  +--+--+     +----------------------+
>         |           |           |        |        |   AddressSpace       |
>         |           |           |        |        +------------+---------+
>     +---V-----------V-----------V--------V----+               /
>     |           VFIOAddressSpace              | <------------+
>     |                  |                      |  MemoryListener
>     |          VFIOContainer list             |
>     +-------+----------------------------+----+
>             |                            |
>             |                            |
>     +-------V------+            +--------V----------+
>     |   iommufd    |            |    vfio legacy    |
>     |  container   |            |     container     |
>     +-------+------+            +--------+----------+
>             |                            |
>             | /dev/iommu                 | /dev/vfio/vfio
>             | /dev/vfio/devices/vfioX    | /dev/vfio/$group_id
>  Userspace  |                            |
>  ===========+============================+================================
>  Kernel     |  device fd                 |
>             +---------------+            | group/container fd
>             | (BIND_IOMMUFD |            | (SET_CONTAINER/SET_IOMMU)
>             |  ATTACH_IOAS) |            | device fd
>             |               |            |
>             |       +-------V------------V-----------------+
>     iommufd |       |                vfio                  |
> (map/unmap  |       +---------+--------------------+-------+
>  ioas_copy) |                 |                    | map/unmap
>             |                 |                    |
>      +------V------+    +-----V------+      +------V--------+
>      | iommfd core |    |  device    |      |  vfio iommu   |
>      +-------------+    +------------+      +---------------+
> 
> [Secure Context setup]
> - iommufd BE: uses device fd and iommufd to setup secure context
>               (bind_iommufd, attach_ioas)
> - vfio legacy BE: uses group fd and container fd to setup secure context
>                   (set_container, set_iommu)
> [Device access]
> - iommufd BE: device fd is opened through /dev/vfio/devices/vfioX
> - vfio legacy BE: device fd is retrieved from group fd ioctl
> [DMA Mapping flow]
> - VFIOAddressSpace receives MemoryRegion add/del via MemoryListener
> - VFIO populates DMA map/unmap via the container BEs
>   *) iommufd BE: uses iommufd
>   *) vfio legacy BE: uses container fd
> 
> This series qomifies the VFIOContainer object which acts as a base class
> for a container. This base class is derived into the legacy VFIO container
> and the new iommufd based container. The base class implements generic code
> such as code related to memory_listener and address space management whereas
> the derived class implements callbacks that depend on the kernel user space
> being used.
> 
> The selection of the backend is made on a device basis using the new
> iommufd option (on/off/auto). By default the iommufd backend is selected
> if supported by the host and by QEMU (iommufd KConfig). This option is
> currently available only for the vfio-pci device. For other types of
> devices, it does not yet exist and the legacy BE is chosen by default.

I've discussed this a bit with Eric, but let me propose a different
command line interface.  Libvirt generally likes to pass file
descriptors to QEMU rather than grant it access to those files
directly.  This was problematic with vfio-pci because libvirt can't
easily know when QEMU will want to grab another /dev/vfio/vfio
container.  Therefore we abandoned this approach and instead libvirt
grants file permissions.

However, with iommufd there's no reason that QEMU ever needs more than
a single instance of /dev/iommufd and we're using per device vfio file
descriptors, so it seems like a good time to revisit this.

The interface I was considering would be to add an iommufd object to
QEMU, so we might have a:

-device iommufd[,fd=#][,id=foo]

For non-libivrt usage this would have the ability to open /dev/iommufd
itself if an fd is not provided.  This object could be shared with
other iommufd users in the VM and maybe we'd allow multiple instances
for more esoteric use cases.  [NB, maybe this should be a -object rather than
-device since the iommufd is not a guest visible device?]

The vfio-pci device might then become:

-device vfio-pci[,host=DDDD:BB:DD.f][,sysfsdev=/sys/path/to/device][,fd=#][,iommufd=foo]

So essentially we can specify the device via host, sysfsdev, or passing
an fd to the vfio device file.  When an iommufd object is specified,
"foo" in the example above, each of those options would use the
vfio-device access mechanism, essentially the same as iommufd=on in
your example.  With the fd passing option, an iommufd object would be
required and necessarily use device level access.

In your example, the iommufd=auto seems especially troublesome for
libvirt because QEMU is going to have different locked memory
requirements based on whether we're using type1 or iommufd, where the
latter resolves the duplicate accounting issues.  libvirt needs to know
deterministically which backed is being used, which this proposal seems
to provide, while at the same time bringing us more in line with fd
passing.  Thoughts?  Thanks,

Alex


^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC 00/18] vfio: Adopt iommufd
  2022-04-22 22:09   ` Alex Williamson
@ 2022-04-25 10:10     ` Daniel P. Berrangé
  -1 siblings, 0 replies; 125+ messages in thread
From: Daniel P. Berrangé @ 2022-04-25 10:10 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Yi Liu, akrowiak, jjherne, chao.p.peng, kvm, Laine Stump,
	libvir-list, jasowang, cohuck, thuth, peterx, qemu-devel, pasic,
	eric.auger, yi.y.sun, nicolinc, kevin.tian, jgg, eric.auger.pro,
	david

On Fri, Apr 22, 2022 at 04:09:43PM -0600, Alex Williamson wrote:
> [Cc +libvirt folks]
> 
> On Thu, 14 Apr 2022 03:46:52 -0700
> Yi Liu <yi.l.liu@intel.com> wrote:
> 
> > With the introduction of iommufd[1], the linux kernel provides a generic
> > interface for userspace drivers to propagate their DMA mappings to kernel
> > for assigned devices. This series does the porting of the VFIO devices
> > onto the /dev/iommu uapi and let it coexist with the legacy implementation.
> > Other devices like vpda, vfio mdev and etc. are not considered yet.

snip

> > The selection of the backend is made on a device basis using the new
> > iommufd option (on/off/auto). By default the iommufd backend is selected
> > if supported by the host and by QEMU (iommufd KConfig). This option is
> > currently available only for the vfio-pci device. For other types of
> > devices, it does not yet exist and the legacy BE is chosen by default.
> 
> I've discussed this a bit with Eric, but let me propose a different
> command line interface.  Libvirt generally likes to pass file
> descriptors to QEMU rather than grant it access to those files
> directly.  This was problematic with vfio-pci because libvirt can't
> easily know when QEMU will want to grab another /dev/vfio/vfio
> container.  Therefore we abandoned this approach and instead libvirt
> grants file permissions.
> 
> However, with iommufd there's no reason that QEMU ever needs more than
> a single instance of /dev/iommufd and we're using per device vfio file
> descriptors, so it seems like a good time to revisit this.

I assume access to '/dev/iommufd' gives the process somewhat elevated
privileges, such that you don't want to unconditionally give QEMU
access to this device ?

> The interface I was considering would be to add an iommufd object to
> QEMU, so we might have a:
> 
> -device iommufd[,fd=#][,id=foo]
> 
> For non-libivrt usage this would have the ability to open /dev/iommufd
> itself if an fd is not provided.  This object could be shared with
> other iommufd users in the VM and maybe we'd allow multiple instances
> for more esoteric use cases.  [NB, maybe this should be a -object rather than
> -device since the iommufd is not a guest visible device?]

Yes,  -object would be the right answer for something that's purely
a host side backend impl selector.

> The vfio-pci device might then become:
> 
> -device vfio-pci[,host=DDDD:BB:DD.f][,sysfsdev=/sys/path/to/device][,fd=#][,iommufd=foo]
> 
> So essentially we can specify the device via host, sysfsdev, or passing
> an fd to the vfio device file.  When an iommufd object is specified,
> "foo" in the example above, each of those options would use the
> vfio-device access mechanism, essentially the same as iommufd=on in
> your example.  With the fd passing option, an iommufd object would be
> required and necessarily use device level access.
> 
> In your example, the iommufd=auto seems especially troublesome for
> libvirt because QEMU is going to have different locked memory
> requirements based on whether we're using type1 or iommufd, where the
> latter resolves the duplicate accounting issues.  libvirt needs to know
> deterministically which backed is being used, which this proposal seems
> to provide, while at the same time bringing us more in line with fd
> passing.  Thoughts?  Thanks,

Yep, I agree that libvirt needs to have more direct control over this.
This is also even more important if there are notable feature differences
in the 2 backends.

I wonder if anyone has considered an even more distinct impl, whereby
we have a completely different device type on the backend, eg

  -device vfio-iommu-pci[,host=DDDD:BB:DD.f][,sysfsdev=/sys/path/to/device][,fd=#][,iommufd=foo]

If a vendor wants to fully remove the legacy impl, they can then use the
Kconfig mechanism to disable the build of the legacy impl device, while
keeping the iommu impl (or vica-verca if the new iommu impl isn't considered
reliable enough for them to support yet).

Libvirt would use

   -object iommu,id=iommu0,fd=NNN
   -device vfio-iommu-pci,fd=MMM,iommu=iommu0

Non-libvirt would use a simpler

   -device vfio-iommu-pci,host=0000:03:22.1

with QEMU auto-creating a 'iommu' object in the background.

This would fit into libvirt's existing modelling better. We currently have
a concept of a PCI assignment backend, which previously supported the
legacy PCI assignment, vs the VFIO PCI assignment. This new iommu impl
feels like a 3rd PCI assignment approach, and so fits with how we modelled
it as a different device type in the past.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|


^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC 00/18] vfio: Adopt iommufd
@ 2022-04-25 10:10     ` Daniel P. Berrangé
  0 siblings, 0 replies; 125+ messages in thread
From: Daniel P. Berrangé @ 2022-04-25 10:10 UTC (permalink / raw)
  To: Alex Williamson
  Cc: akrowiak, jjherne, thuth, chao.p.peng, jgg, kvm, libvir-list,
	jasowang, cohuck, qemu-devel, peterx, pasic, eric.auger,
	yi.y.sun, Yi Liu, nicolinc, kevin.tian, Laine Stump, david,
	eric.auger.pro

On Fri, Apr 22, 2022 at 04:09:43PM -0600, Alex Williamson wrote:
> [Cc +libvirt folks]
> 
> On Thu, 14 Apr 2022 03:46:52 -0700
> Yi Liu <yi.l.liu@intel.com> wrote:
> 
> > With the introduction of iommufd[1], the linux kernel provides a generic
> > interface for userspace drivers to propagate their DMA mappings to kernel
> > for assigned devices. This series does the porting of the VFIO devices
> > onto the /dev/iommu uapi and let it coexist with the legacy implementation.
> > Other devices like vpda, vfio mdev and etc. are not considered yet.

snip

> > The selection of the backend is made on a device basis using the new
> > iommufd option (on/off/auto). By default the iommufd backend is selected
> > if supported by the host and by QEMU (iommufd KConfig). This option is
> > currently available only for the vfio-pci device. For other types of
> > devices, it does not yet exist and the legacy BE is chosen by default.
> 
> I've discussed this a bit with Eric, but let me propose a different
> command line interface.  Libvirt generally likes to pass file
> descriptors to QEMU rather than grant it access to those files
> directly.  This was problematic with vfio-pci because libvirt can't
> easily know when QEMU will want to grab another /dev/vfio/vfio
> container.  Therefore we abandoned this approach and instead libvirt
> grants file permissions.
> 
> However, with iommufd there's no reason that QEMU ever needs more than
> a single instance of /dev/iommufd and we're using per device vfio file
> descriptors, so it seems like a good time to revisit this.

I assume access to '/dev/iommufd' gives the process somewhat elevated
privileges, such that you don't want to unconditionally give QEMU
access to this device ?

> The interface I was considering would be to add an iommufd object to
> QEMU, so we might have a:
> 
> -device iommufd[,fd=#][,id=foo]
> 
> For non-libivrt usage this would have the ability to open /dev/iommufd
> itself if an fd is not provided.  This object could be shared with
> other iommufd users in the VM and maybe we'd allow multiple instances
> for more esoteric use cases.  [NB, maybe this should be a -object rather than
> -device since the iommufd is not a guest visible device?]

Yes,  -object would be the right answer for something that's purely
a host side backend impl selector.

> The vfio-pci device might then become:
> 
> -device vfio-pci[,host=DDDD:BB:DD.f][,sysfsdev=/sys/path/to/device][,fd=#][,iommufd=foo]
> 
> So essentially we can specify the device via host, sysfsdev, or passing
> an fd to the vfio device file.  When an iommufd object is specified,
> "foo" in the example above, each of those options would use the
> vfio-device access mechanism, essentially the same as iommufd=on in
> your example.  With the fd passing option, an iommufd object would be
> required and necessarily use device level access.
> 
> In your example, the iommufd=auto seems especially troublesome for
> libvirt because QEMU is going to have different locked memory
> requirements based on whether we're using type1 or iommufd, where the
> latter resolves the duplicate accounting issues.  libvirt needs to know
> deterministically which backed is being used, which this proposal seems
> to provide, while at the same time bringing us more in line with fd
> passing.  Thoughts?  Thanks,

Yep, I agree that libvirt needs to have more direct control over this.
This is also even more important if there are notable feature differences
in the 2 backends.

I wonder if anyone has considered an even more distinct impl, whereby
we have a completely different device type on the backend, eg

  -device vfio-iommu-pci[,host=DDDD:BB:DD.f][,sysfsdev=/sys/path/to/device][,fd=#][,iommufd=foo]

If a vendor wants to fully remove the legacy impl, they can then use the
Kconfig mechanism to disable the build of the legacy impl device, while
keeping the iommu impl (or vica-verca if the new iommu impl isn't considered
reliable enough for them to support yet).

Libvirt would use

   -object iommu,id=iommu0,fd=NNN
   -device vfio-iommu-pci,fd=MMM,iommu=iommu0

Non-libvirt would use a simpler

   -device vfio-iommu-pci,host=0000:03:22.1

with QEMU auto-creating a 'iommu' object in the background.

This would fit into libvirt's existing modelling better. We currently have
a concept of a PCI assignment backend, which previously supported the
legacy PCI assignment, vs the VFIO PCI assignment. This new iommu impl
feels like a 3rd PCI assignment approach, and so fits with how we modelled
it as a different device type in the past.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC 00/18] vfio: Adopt iommufd
  2022-04-25 10:10     ` Daniel P. Berrangé
  (?)
@ 2022-04-25 13:36     ` Jason Gunthorpe
  -1 siblings, 0 replies; 125+ messages in thread
From: Jason Gunthorpe @ 2022-04-25 13:36 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Alex Williamson, Yi Liu, akrowiak, jjherne, chao.p.peng, kvm,
	Laine Stump, libvir-list, jasowang, cohuck, thuth, peterx,
	qemu-devel, pasic, eric.auger, yi.y.sun, nicolinc, kevin.tian,
	eric.auger.pro, david

On Mon, Apr 25, 2022 at 11:10:14AM +0100, Daniel P. Berrangé wrote:

> > However, with iommufd there's no reason that QEMU ever needs more than
> > a single instance of /dev/iommufd and we're using per device vfio file
> > descriptors, so it seems like a good time to revisit this.
> 
> I assume access to '/dev/iommufd' gives the process somewhat elevated
> privileges, such that you don't want to unconditionally give QEMU
> access to this device ?

I doesn't give much, at worst it allows userspace to allocate kernel
memory and pin pages which can be already be done through all sorts of
other interfaces qemu already has access to..

Jason

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC 00/18] vfio: Adopt iommufd
  2022-04-25 10:10     ` Daniel P. Berrangé
@ 2022-04-25 14:37       ` Alex Williamson
  -1 siblings, 0 replies; 125+ messages in thread
From: Alex Williamson @ 2022-04-25 14:37 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Yi Liu, akrowiak, jjherne, chao.p.peng, kvm, Laine Stump,
	libvir-list, jasowang, cohuck, thuth, peterx, qemu-devel, pasic,
	eric.auger, yi.y.sun, nicolinc, kevin.tian, jgg, eric.auger.pro,
	david

On Mon, 25 Apr 2022 11:10:14 +0100
Daniel P. Berrangé <berrange@redhat.com> wrote:

> On Fri, Apr 22, 2022 at 04:09:43PM -0600, Alex Williamson wrote:
> > [Cc +libvirt folks]
> > 
> > On Thu, 14 Apr 2022 03:46:52 -0700
> > Yi Liu <yi.l.liu@intel.com> wrote:
> >   
> > > With the introduction of iommufd[1], the linux kernel provides a generic
> > > interface for userspace drivers to propagate their DMA mappings to kernel
> > > for assigned devices. This series does the porting of the VFIO devices
> > > onto the /dev/iommu uapi and let it coexist with the legacy implementation.
> > > Other devices like vpda, vfio mdev and etc. are not considered yet.  
> 
> snip
> 
> > > The selection of the backend is made on a device basis using the new
> > > iommufd option (on/off/auto). By default the iommufd backend is selected
> > > if supported by the host and by QEMU (iommufd KConfig). This option is
> > > currently available only for the vfio-pci device. For other types of
> > > devices, it does not yet exist and the legacy BE is chosen by default.  
> > 
> > I've discussed this a bit with Eric, but let me propose a different
> > command line interface.  Libvirt generally likes to pass file
> > descriptors to QEMU rather than grant it access to those files
> > directly.  This was problematic with vfio-pci because libvirt can't
> > easily know when QEMU will want to grab another /dev/vfio/vfio
> > container.  Therefore we abandoned this approach and instead libvirt
> > grants file permissions.
> > 
> > However, with iommufd there's no reason that QEMU ever needs more than
> > a single instance of /dev/iommufd and we're using per device vfio file
> > descriptors, so it seems like a good time to revisit this.  
> 
> I assume access to '/dev/iommufd' gives the process somewhat elevated
> privileges, such that you don't want to unconditionally give QEMU
> access to this device ?

It's not that much dissimilar to /dev/vfio/vfio, it's an unprivileged
interface which should have limited scope for abuse, but more so here
the goal would be to de-privilege QEMU that one step further that it
cannot open the device file itself.

> > The interface I was considering would be to add an iommufd object to
> > QEMU, so we might have a:
> > 
> > -device iommufd[,fd=#][,id=foo]
> > 
> > For non-libivrt usage this would have the ability to open /dev/iommufd
> > itself if an fd is not provided.  This object could be shared with
> > other iommufd users in the VM and maybe we'd allow multiple instances
> > for more esoteric use cases.  [NB, maybe this should be a -object rather than
> > -device since the iommufd is not a guest visible device?]  
> 
> Yes,  -object would be the right answer for something that's purely
> a host side backend impl selector.
> 
> > The vfio-pci device might then become:
> > 
> > -device vfio-pci[,host=DDDD:BB:DD.f][,sysfsdev=/sys/path/to/device][,fd=#][,iommufd=foo]
> > 
> > So essentially we can specify the device via host, sysfsdev, or passing
> > an fd to the vfio device file.  When an iommufd object is specified,
> > "foo" in the example above, each of those options would use the
> > vfio-device access mechanism, essentially the same as iommufd=on in
> > your example.  With the fd passing option, an iommufd object would be
> > required and necessarily use device level access.
> > 
> > In your example, the iommufd=auto seems especially troublesome for
> > libvirt because QEMU is going to have different locked memory
> > requirements based on whether we're using type1 or iommufd, where the
> > latter resolves the duplicate accounting issues.  libvirt needs to know
> > deterministically which backed is being used, which this proposal seems
> > to provide, while at the same time bringing us more in line with fd
> > passing.  Thoughts?  Thanks,  
> 
> Yep, I agree that libvirt needs to have more direct control over this.
> This is also even more important if there are notable feature differences
> in the 2 backends.
> 
> I wonder if anyone has considered an even more distinct impl, whereby
> we have a completely different device type on the backend, eg
> 
>   -device vfio-iommu-pci[,host=DDDD:BB:DD.f][,sysfsdev=/sys/path/to/device][,fd=#][,iommufd=foo]
> 
> If a vendor wants to fully remove the legacy impl, they can then use the
> Kconfig mechanism to disable the build of the legacy impl device, while
> keeping the iommu impl (or vica-verca if the new iommu impl isn't considered
> reliable enough for them to support yet).
> 
> Libvirt would use
> 
>    -object iommu,id=iommu0,fd=NNN
>    -device vfio-iommu-pci,fd=MMM,iommu=iommu0
> 
> Non-libvirt would use a simpler
> 
>    -device vfio-iommu-pci,host=0000:03:22.1
> 
> with QEMU auto-creating a 'iommu' object in the background.
> 
> This would fit into libvirt's existing modelling better. We currently have
> a concept of a PCI assignment backend, which previously supported the
> legacy PCI assignment, vs the VFIO PCI assignment. This new iommu impl
> feels like a 3rd PCI assignment approach, and so fits with how we modelled
> it as a different device type in the past.

I don't think we want to conflate "iommu" and "iommufd", we're creating
an object that interfaces into the iommufd uAPI, not an iommu itself.
Likewise "vfio-iommu-pci" is just confusing, there was an iommu
interface previously, it's just a different implementation now and as
far as the VM interface to the device, it's identical.  Note that a
"vfio-iommufd-pci" device multiplies the matrix of every vfio device
for a rather subtle implementation detail.

My expectation would be that libvirt uses:

 -object iommufd,id=iommufd0,fd=NNN
 -device vfio-pci,fd=MMM,iommufd=iommufd0

Whereas simple QEMU command line would be:

 -object iommufd,id=iommufd0
 -device vfio-pci,iommufd=iommufd0,host=0000:02:00.0

The iommufd object would open /dev/iommufd itself.  Creating an
implicit iommufd object is someone problematic because one of the
things I forgot to highlight in my previous description is that the
iommufd object is meant to be shared across not only various vfio
devices (platform, ccw, ap, nvme, etc), but also across subsystems, ex.
vdpa.

If the old style were used:

 -device vfio-pci,host=0000:02:00.0

Then QEMU would use vfio for the IOMMU backend.

If libvirt/userspace wants to query whether "legacy" vfio is still
supported by the host kernel, I think it'd only need to look for
whether the /dev/vfio/vfio container interface still exists.

If we need some means for QEMU to remove legacy support, I'd rather
find a way to do it via probing device options.  It's easy enough to
see if iommufd support exists by looking for the presence of the
iommufd option for the vfio-pci device and Kconfig within QEMU could be
used regardless of whether we define a new device name.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC 00/18] vfio: Adopt iommufd
@ 2022-04-25 14:37       ` Alex Williamson
  0 siblings, 0 replies; 125+ messages in thread
From: Alex Williamson @ 2022-04-25 14:37 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: akrowiak, jjherne, thuth, chao.p.peng, jgg, kvm, libvir-list,
	jasowang, cohuck, qemu-devel, peterx, pasic, eric.auger,
	yi.y.sun, Yi Liu, nicolinc, kevin.tian, Laine Stump, david,
	eric.auger.pro

On Mon, 25 Apr 2022 11:10:14 +0100
Daniel P. Berrangé <berrange@redhat.com> wrote:

> On Fri, Apr 22, 2022 at 04:09:43PM -0600, Alex Williamson wrote:
> > [Cc +libvirt folks]
> > 
> > On Thu, 14 Apr 2022 03:46:52 -0700
> > Yi Liu <yi.l.liu@intel.com> wrote:
> >   
> > > With the introduction of iommufd[1], the linux kernel provides a generic
> > > interface for userspace drivers to propagate their DMA mappings to kernel
> > > for assigned devices. This series does the porting of the VFIO devices
> > > onto the /dev/iommu uapi and let it coexist with the legacy implementation.
> > > Other devices like vpda, vfio mdev and etc. are not considered yet.  
> 
> snip
> 
> > > The selection of the backend is made on a device basis using the new
> > > iommufd option (on/off/auto). By default the iommufd backend is selected
> > > if supported by the host and by QEMU (iommufd KConfig). This option is
> > > currently available only for the vfio-pci device. For other types of
> > > devices, it does not yet exist and the legacy BE is chosen by default.  
> > 
> > I've discussed this a bit with Eric, but let me propose a different
> > command line interface.  Libvirt generally likes to pass file
> > descriptors to QEMU rather than grant it access to those files
> > directly.  This was problematic with vfio-pci because libvirt can't
> > easily know when QEMU will want to grab another /dev/vfio/vfio
> > container.  Therefore we abandoned this approach and instead libvirt
> > grants file permissions.
> > 
> > However, with iommufd there's no reason that QEMU ever needs more than
> > a single instance of /dev/iommufd and we're using per device vfio file
> > descriptors, so it seems like a good time to revisit this.  
> 
> I assume access to '/dev/iommufd' gives the process somewhat elevated
> privileges, such that you don't want to unconditionally give QEMU
> access to this device ?

It's not that much dissimilar to /dev/vfio/vfio, it's an unprivileged
interface which should have limited scope for abuse, but more so here
the goal would be to de-privilege QEMU that one step further that it
cannot open the device file itself.

> > The interface I was considering would be to add an iommufd object to
> > QEMU, so we might have a:
> > 
> > -device iommufd[,fd=#][,id=foo]
> > 
> > For non-libivrt usage this would have the ability to open /dev/iommufd
> > itself if an fd is not provided.  This object could be shared with
> > other iommufd users in the VM and maybe we'd allow multiple instances
> > for more esoteric use cases.  [NB, maybe this should be a -object rather than
> > -device since the iommufd is not a guest visible device?]  
> 
> Yes,  -object would be the right answer for something that's purely
> a host side backend impl selector.
> 
> > The vfio-pci device might then become:
> > 
> > -device vfio-pci[,host=DDDD:BB:DD.f][,sysfsdev=/sys/path/to/device][,fd=#][,iommufd=foo]
> > 
> > So essentially we can specify the device via host, sysfsdev, or passing
> > an fd to the vfio device file.  When an iommufd object is specified,
> > "foo" in the example above, each of those options would use the
> > vfio-device access mechanism, essentially the same as iommufd=on in
> > your example.  With the fd passing option, an iommufd object would be
> > required and necessarily use device level access.
> > 
> > In your example, the iommufd=auto seems especially troublesome for
> > libvirt because QEMU is going to have different locked memory
> > requirements based on whether we're using type1 or iommufd, where the
> > latter resolves the duplicate accounting issues.  libvirt needs to know
> > deterministically which backed is being used, which this proposal seems
> > to provide, while at the same time bringing us more in line with fd
> > passing.  Thoughts?  Thanks,  
> 
> Yep, I agree that libvirt needs to have more direct control over this.
> This is also even more important if there are notable feature differences
> in the 2 backends.
> 
> I wonder if anyone has considered an even more distinct impl, whereby
> we have a completely different device type on the backend, eg
> 
>   -device vfio-iommu-pci[,host=DDDD:BB:DD.f][,sysfsdev=/sys/path/to/device][,fd=#][,iommufd=foo]
> 
> If a vendor wants to fully remove the legacy impl, they can then use the
> Kconfig mechanism to disable the build of the legacy impl device, while
> keeping the iommu impl (or vica-verca if the new iommu impl isn't considered
> reliable enough for them to support yet).
> 
> Libvirt would use
> 
>    -object iommu,id=iommu0,fd=NNN
>    -device vfio-iommu-pci,fd=MMM,iommu=iommu0
> 
> Non-libvirt would use a simpler
> 
>    -device vfio-iommu-pci,host=0000:03:22.1
> 
> with QEMU auto-creating a 'iommu' object in the background.
> 
> This would fit into libvirt's existing modelling better. We currently have
> a concept of a PCI assignment backend, which previously supported the
> legacy PCI assignment, vs the VFIO PCI assignment. This new iommu impl
> feels like a 3rd PCI assignment approach, and so fits with how we modelled
> it as a different device type in the past.

I don't think we want to conflate "iommu" and "iommufd", we're creating
an object that interfaces into the iommufd uAPI, not an iommu itself.
Likewise "vfio-iommu-pci" is just confusing, there was an iommu
interface previously, it's just a different implementation now and as
far as the VM interface to the device, it's identical.  Note that a
"vfio-iommufd-pci" device multiplies the matrix of every vfio device
for a rather subtle implementation detail.

My expectation would be that libvirt uses:

 -object iommufd,id=iommufd0,fd=NNN
 -device vfio-pci,fd=MMM,iommufd=iommufd0

Whereas simple QEMU command line would be:

 -object iommufd,id=iommufd0
 -device vfio-pci,iommufd=iommufd0,host=0000:02:00.0

The iommufd object would open /dev/iommufd itself.  Creating an
implicit iommufd object is someone problematic because one of the
things I forgot to highlight in my previous description is that the
iommufd object is meant to be shared across not only various vfio
devices (platform, ccw, ap, nvme, etc), but also across subsystems, ex.
vdpa.

If the old style were used:

 -device vfio-pci,host=0000:02:00.0

Then QEMU would use vfio for the IOMMU backend.

If libvirt/userspace wants to query whether "legacy" vfio is still
supported by the host kernel, I think it'd only need to look for
whether the /dev/vfio/vfio container interface still exists.

If we need some means for QEMU to remove legacy support, I'd rather
find a way to do it via probing device options.  It's easy enough to
see if iommufd support exists by looking for the presence of the
iommufd option for the vfio-pci device and Kconfig within QEMU could be
used regardless of whether we define a new device name.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC 00/18] vfio: Adopt iommufd
  2022-04-19  3:26     ` Nicolin Chen
@ 2022-04-25 19:40         ` Eric Auger
  0 siblings, 0 replies; 125+ messages in thread
From: Eric Auger @ 2022-04-25 19:40 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Yi Liu, alex.williamson, cohuck, qemu-devel, david, thuth,
	farman, mjrosato, akrowiak, pasic, jjherne, jasowang, kvm, jgg,
	eric.auger.pro, kevin.tian, chao.p.peng, yi.y.sun, peterx

Hi Nicolin,

On 4/19/22 5:26 AM, Nicolin Chen wrote:
> On Sun, Apr 17, 2022 at 12:30:40PM +0200, Eric Auger wrote:
>
>>>> - More tests
>>> I did a quick test on my ARM64 platform, using "iommu=smmuv3"
>>> string. The behaviors are different between using default and
>>> using legacy "iommufd=off".
>>>
>>> The legacy pathway exits the VM with:
>>>     vfio 0002:01:00.0:
>>>     failed to setup container for group 1:
>>>     memory listener initialization failed:
>>>     Region smmuv3-iommu-memory-region-16-0:
>>>     device 00.02.0 requires iommu MAP notifier which is not currently supported
>>>
>>> while the iommufd pathway started the VM but reported errors
>>> from host kernel about address translation failures, probably
>>> because of accessing unmapped addresses.
>>>
>>> I found iommufd pathway also calls error_propagate_prepend()
>>> to add to errp for not supporting IOMMU_NOTIFIER_MAP, but it
>>> doesn't get a chance to print errp out. Perhaps there should
>>> be a final error check somewhere to exit?
>> thank you for giving it a try.
>>
>> vsmmuv3 + vfio is not supported as we miss the HW nested stage support
>> and SMMU does not support cache mode. If you want to test viommu on ARM
>> you shall test virtio-iommu+vfio. This should work but this is not yet
>> tested.
> I tried "-device virtio-iommu" and "-device virtio-iommu-pci"
> separately with vfio-pci, but neither seems to work. The host
> SMMU driver reports Translation Faults.
>
> Do you know what commands I should use to run QEMU for that
> combination?
you shall use :

 -device virtio-iommu-pci -device vfio-pci,host=<BDF>

Please make sure the "-device virtio-iommu-pci" is set *before* the
"-device vfio-pci,"

Otherwise the IOMMU MR notifiers are not set properly and this may be
the cause of your physical SMMU translations faults.

Eric
>
>> I pushed a fix for the error notification issue:
>> qemu-for-5.17-rc6-vm-rfcv2-rc0 on my git https://github.com/eauger/qemu.git
> Yes. This fixes the problem. Thanks!
>


^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC 00/18] vfio: Adopt iommufd
@ 2022-04-25 19:40         ` Eric Auger
  0 siblings, 0 replies; 125+ messages in thread
From: Eric Auger @ 2022-04-25 19:40 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: akrowiak, jjherne, thuth, Yi Liu, kvm, mjrosato, farman,
	jasowang, cohuck, qemu-devel, peterx, pasic, alex.williamson,
	chao.p.peng, kevin.tian, jgg, eric.auger.pro, yi.y.sun, david

Hi Nicolin,

On 4/19/22 5:26 AM, Nicolin Chen wrote:
> On Sun, Apr 17, 2022 at 12:30:40PM +0200, Eric Auger wrote:
>
>>>> - More tests
>>> I did a quick test on my ARM64 platform, using "iommu=smmuv3"
>>> string. The behaviors are different between using default and
>>> using legacy "iommufd=off".
>>>
>>> The legacy pathway exits the VM with:
>>>     vfio 0002:01:00.0:
>>>     failed to setup container for group 1:
>>>     memory listener initialization failed:
>>>     Region smmuv3-iommu-memory-region-16-0:
>>>     device 00.02.0 requires iommu MAP notifier which is not currently supported
>>>
>>> while the iommufd pathway started the VM but reported errors
>>> from host kernel about address translation failures, probably
>>> because of accessing unmapped addresses.
>>>
>>> I found iommufd pathway also calls error_propagate_prepend()
>>> to add to errp for not supporting IOMMU_NOTIFIER_MAP, but it
>>> doesn't get a chance to print errp out. Perhaps there should
>>> be a final error check somewhere to exit?
>> thank you for giving it a try.
>>
>> vsmmuv3 + vfio is not supported as we miss the HW nested stage support
>> and SMMU does not support cache mode. If you want to test viommu on ARM
>> you shall test virtio-iommu+vfio. This should work but this is not yet
>> tested.
> I tried "-device virtio-iommu" and "-device virtio-iommu-pci"
> separately with vfio-pci, but neither seems to work. The host
> SMMU driver reports Translation Faults.
>
> Do you know what commands I should use to run QEMU for that
> combination?
you shall use :

 -device virtio-iommu-pci -device vfio-pci,host=<BDF>

Please make sure the "-device virtio-iommu-pci" is set *before* the
"-device vfio-pci,"

Otherwise the IOMMU MR notifiers are not set properly and this may be
the cause of your physical SMMU translations faults.

Eric
>
>> I pushed a fix for the error notification issue:
>> qemu-for-5.17-rc6-vm-rfcv2-rc0 on my git https://github.com/eauger/qemu.git
> Yes. This fixes the problem. Thanks!
>



^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC 00/18] vfio: Adopt iommufd
  2022-04-18 12:09     ` Yi Liu
@ 2022-04-25 19:51       ` Eric Auger
  -1 siblings, 0 replies; 125+ messages in thread
From: Eric Auger @ 2022-04-25 19:51 UTC (permalink / raw)
  To: Yi Liu, Tian, Kevin, alex.williamson, cohuck, qemu-devel
  Cc: david, thuth, farman, mjrosato, akrowiak, pasic, jjherne,
	jasowang, kvm, jgg, nicolinc, eric.auger.pro, Peng, Chao P, Sun,
	Yi Y, peterx

Hi,

On 4/18/22 2:09 PM, Yi Liu wrote:
> Hi Kevin,
>
> On 2022/4/18 16:49, Tian, Kevin wrote:
>>> From: Liu, Yi L <yi.l.liu@intel.com>
>>> Sent: Thursday, April 14, 2022 6:47 PM
>>>
>>> With the introduction of iommufd[1], the linux kernel provides a
>>> generic
>>> interface for userspace drivers to propagate their DMA mappings to
>>> kernel
>>> for assigned devices. This series does the porting of the VFIO devices
>>> onto the /dev/iommu uapi and let it coexist with the legacy
>>> implementation.
>>> Other devices like vpda, vfio mdev and etc. are not considered yet.
>>
>> vfio mdev has no special support in Qemu. Just that it's not supported
>> by iommufd yet thus can only be operated in legacy container
>> interface at
>> this point. Later once it's supported by the kernel suppose no
>> additional
>> enabling work is required for mdev in Qemu.
>
> yes. will make it more precise in next version.
>
>>>
>>> For vfio devices, the new interface is tied with device fd and iommufd
>>> as the iommufd solution is device-centric. This is different from
>>> legacy
>>> vfio which is group-centric. To support both interfaces in QEMU, this
>>> series introduces the iommu backend concept in the form of different
>>> container classes. The existing vfio container is named legacy
>>> container
>>> (equivalent with legacy iommu backend in this series), while the new
>>> iommufd based container is named as iommufd container (may also be
>>> mentioned
>>> as iommufd backend in this series). The two backend types have their
>>> own
>>> way to setup secure context and dma management interface. Below diagram
>>> shows how it looks like with both BEs.
>>>
>>>                      VFIO                           AddressSpace/Memory
>>>      +-------+  +----------+  +-----+  +-----+
>>>      |  pci  |  | platform |  |  ap |  | ccw |
>>>      +---+---+  +----+-----+  +--+--+  +--+--+    
>>> +----------------------+
>>>          |           |           |        |        |  
>>> AddressSpace       |
>>>          |           |           |        |       
>>> +------------+---------+
>>>      +---V-----------V-----------V--------V----+               /
>>>      |           VFIOAddressSpace              | <------------+
>>>      |                  |                      |  MemoryListener
>>>      |          VFIOContainer list             |
>>>      +-------+----------------------------+----+
>>>              |                            |
>>>              |                            |
>>>      +-------V------+            +--------V----------+
>>>      |   iommufd    |            |    vfio legacy    |
>>>      |  container   |            |     container     |
>>>      +-------+------+            +--------+----------+
>>>              |                            |
>>>              | /dev/iommu                 | /dev/vfio/vfio
>>>              | /dev/vfio/devices/vfioX    | /dev/vfio/$group_id
>>>   Userspace  |                            |
>>>
>>> ===========+============================+=======================
>>> =========
>>>   Kernel     |  device fd                 |
>>>              +---------------+            | group/container fd
>>>              | (BIND_IOMMUFD |            | (SET_CONTAINER/SET_IOMMU)
>>>              |  ATTACH_IOAS) |            | device fd
>>>              |               |            |
>>>              |       +-------V------------V-----------------+
>>>      iommufd |       |                vfio                  |
>>> (map/unmap  |       +---------+--------------------+-------+
>>>   ioas_copy) |                 |                    | map/unmap
>>>              |                 |                    |
>>>       +------V------+    +-----V------+      +------V--------+
>>>       | iommfd core |    |  device    |      |  vfio iommu   |
>>>       +-------------+    +------------+      +---------------+
>>
>> last row: s/iommfd/iommufd/
>
> thanks. a typo.
>
>> overall this sounds a reasonable abstraction. Later when vdpa starts
>> supporting iommufd probably the iommufd BE will become even
>> smaller with more logic shareable between vfio and vdpa.
>
> let's see if Jason Wang will give some idea. :-)
>
>>>
>>> [Secure Context setup]
>>> - iommufd BE: uses device fd and iommufd to setup secure context
>>>                (bind_iommufd, attach_ioas)
>>> - vfio legacy BE: uses group fd and container fd to setup secure
>>> context
>>>                    (set_container, set_iommu)
>>> [Device access]
>>> - iommufd BE: device fd is opened through /dev/vfio/devices/vfioX
>>> - vfio legacy BE: device fd is retrieved from group fd ioctl
>>> [DMA Mapping flow]
>>> - VFIOAddressSpace receives MemoryRegion add/del via MemoryListener
>>> - VFIO populates DMA map/unmap via the container BEs
>>>    *) iommufd BE: uses iommufd
>>>    *) vfio legacy BE: uses container fd
>>>
>>> This series qomifies the VFIOContainer object which acts as a base
>>> class
>>
>> what does 'qomify' mean? I didn't find this word from dictionary...
>>
>>> for a container. This base class is derived into the legacy VFIO
>>> container
>>> and the new iommufd based container. The base class implements generic
>>> code
>>> such as code related to memory_listener and address space management
>>> whereas
>>> the derived class implements callbacks that depend on the kernel
>>> user space
>>
>> 'the kernel user space'?
>
> aha, just want to express different BE callbacks will use different
> user interface exposed by kernel. will refine the wording.
>
>>
>>> being used.
>>>
>>> The selection of the backend is made on a device basis using the new
>>> iommufd option (on/off/auto). By default the iommufd backend is
>>> selected
>>> if supported by the host and by QEMU (iommufd KConfig). This option is
>>> currently available only for the vfio-pci device. For other types of
>>> devices, it does not yet exist and the legacy BE is chosen by default.
>>>
>>> Test done:
>>> - PCI and Platform device were tested
>>
>> In this case PCI uses iommufd while platform device uses legacy?
>
> For PCI, both legacy and iommufd were tested. The exploration kernel
> branch doesn't have the new device uapi for platform device, so I
> didn't test it.
> But I remember Eric should have tested it with iommufd. Eric?
No I just ran non regression tests for vfio-platform, in legacy mode. I
did not integrate with the new device uapi for platform device.
>
>>> - ccw and ap were only compile-tested
>>> - limited device hotplug test
>>> - vIOMMU test run for both legacy and iommufd backends (limited tests)
>>>
>>> This series was co-developed by Eric Auger and me based on the
>>> exploration
>>> iommufd kernel[2], complete code of this series is available in[3]. As
>>> iommufd kernel is in the early step (only iommufd generic interface
>>> is in
>>> mailing list), so this series hasn't made the iommufd backend fully
>>> on par
>>> with legacy backend w.r.t. features like p2p mappings, coherency
>>> tracking,
>>
>> what does 'coherency tracking' mean here? if related to iommu enforce
>> snoop it is fully handled by the kernel so far. I didn't find any use of
>> VFIO_DMA_CC_IOMMU in current Qemu.
>
> It's the kvm_group add/del stuffs.perhaps say kvm_group add/del
> equivalence
> would be better?
>
>>> live migration, etc. This series hasn't supported PCI devices
>>> without FLR
>>> neither as the kernel doesn't support VFIO_DEVICE_PCI_HOT_RESET when
>>> userspace
>>> is using iommufd. The kernel needs to be updated to accept device fd
>>> list for
>>> reset when userspace is using iommufd. Related work is in progress by
>>> Jason[4].
>>>
>>> TODOs:
>>> - Add DMA alias check for iommufd BE (group level)
>>> - Make pci.c to be BE agnostic. Needs kernel change as well to fix the
>>>    VFIO_DEVICE_PCI_HOT_RESET gap
>>> - Cleanup the VFIODevice fields as it's used in both BEs
>>> - Add locks
>>> - Replace list with g_tree
>>> - More tests
>>>
>>> Patch Overview:
>>>
>>> - Preparation:
>>>    0001-scripts-update-linux-headers-Add-iommufd.h.patch
>>>    0002-linux-headers-Import-latest-vfio.h-and-iommufd.h.patch
>>>    0003-hw-vfio-pci-fix-vfio_pci_hot_reset_result-trace-poin.patch
>>>    0004-vfio-pci-Use-vbasedev-local-variable-in-vfio_realize.patch
>>>    0005-vfio-common-Rename-VFIOGuestIOMMU-iommu-into-
>>> iommu_m.patch
>>
>> 3-5 are pure cleanups which could be sent out separately
>
> yes. may send later after checking with Eric. :-)
yes makes sense to send them separately.

Thanks

Eric
>
>>>    0006-vfio-common-Split-common.c-into-common.c-container.c.patch
>>>
>>> - Introduce container object and covert existing vfio to use it:
>>>    0007-vfio-Add-base-object-for-VFIOContainer.patch
>>>    0008-vfio-container-Introduce-vfio_attach-detach_device.patch
>>>    0009-vfio-platform-Use-vfio_-attach-detach-_device.patch
>>>    0010-vfio-ap-Use-vfio_-attach-detach-_device.patch
>>>    0011-vfio-ccw-Use-vfio_-attach-detach-_device.patch
>>>    0012-vfio-container-obj-Introduce-attach-detach-_device-c.patch
>>>    0013-vfio-container-obj-Introduce-VFIOContainer-reset-cal.patch
>>>
>>> - Introduce iommufd based container:
>>>    0014-hw-iommufd-Creation.patch
>>>    0015-vfio-iommufd-Implement-iommufd-backend.patch
>>>    0016-vfio-iommufd-Add-IOAS_COPY_DMA-support.patch
>>>
>>> - Add backend selection for vfio-pci:
>>>    0017-vfio-as-Allow-the-selection-of-a-given-iommu-backend.patch
>>>    0018-vfio-pci-Add-an-iommufd-option.patch
>>>
>>> [1] https://lore.kernel.org/kvm/0-v1-e79cd8d168e8+6-
>>> iommufd_jgg@nvidia.com/
>>> [2] https://github.com/luxis1999/iommufd/tree/iommufd-v5.17-rc6
>>> [3] https://github.com/luxis1999/qemu/tree/qemu-for-5.17-rc6-vm-rfcv1
>>> [4] https://lore.kernel.org/kvm/0-v1-a8faf768d202+125dd-
>>> vfio_mdev_no_group_jgg@nvidia.com/
>>
>> Following is probably more relevant to [4]:
>>
>> https://lore.kernel.org/all/10-v1-33906a626da1+16b0-vfio_kvm_no_group_jgg@nvidia.com/
>>
>
> absolutely.:-) thanks.
>
>> Thanks
>> Kevin
>


^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC 00/18] vfio: Adopt iommufd
@ 2022-04-25 19:51       ` Eric Auger
  0 siblings, 0 replies; 125+ messages in thread
From: Eric Auger @ 2022-04-25 19:51 UTC (permalink / raw)
  To: Yi Liu, Tian, Kevin, alex.williamson, cohuck, qemu-devel
  Cc: akrowiak, jjherne, thuth, Peng, Chao P, kvm, mjrosato, jasowang,
	farman, peterx, pasic, Sun, Yi Y, nicolinc, jgg, eric.auger.pro,
	david

Hi,

On 4/18/22 2:09 PM, Yi Liu wrote:
> Hi Kevin,
>
> On 2022/4/18 16:49, Tian, Kevin wrote:
>>> From: Liu, Yi L <yi.l.liu@intel.com>
>>> Sent: Thursday, April 14, 2022 6:47 PM
>>>
>>> With the introduction of iommufd[1], the linux kernel provides a
>>> generic
>>> interface for userspace drivers to propagate their DMA mappings to
>>> kernel
>>> for assigned devices. This series does the porting of the VFIO devices
>>> onto the /dev/iommu uapi and let it coexist with the legacy
>>> implementation.
>>> Other devices like vpda, vfio mdev and etc. are not considered yet.
>>
>> vfio mdev has no special support in Qemu. Just that it's not supported
>> by iommufd yet thus can only be operated in legacy container
>> interface at
>> this point. Later once it's supported by the kernel suppose no
>> additional
>> enabling work is required for mdev in Qemu.
>
> yes. will make it more precise in next version.
>
>>>
>>> For vfio devices, the new interface is tied with device fd and iommufd
>>> as the iommufd solution is device-centric. This is different from
>>> legacy
>>> vfio which is group-centric. To support both interfaces in QEMU, this
>>> series introduces the iommu backend concept in the form of different
>>> container classes. The existing vfio container is named legacy
>>> container
>>> (equivalent with legacy iommu backend in this series), while the new
>>> iommufd based container is named as iommufd container (may also be
>>> mentioned
>>> as iommufd backend in this series). The two backend types have their
>>> own
>>> way to setup secure context and dma management interface. Below diagram
>>> shows how it looks like with both BEs.
>>>
>>>                      VFIO                           AddressSpace/Memory
>>>      +-------+  +----------+  +-----+  +-----+
>>>      |  pci  |  | platform |  |  ap |  | ccw |
>>>      +---+---+  +----+-----+  +--+--+  +--+--+    
>>> +----------------------+
>>>          |           |           |        |        |  
>>> AddressSpace       |
>>>          |           |           |        |       
>>> +------------+---------+
>>>      +---V-----------V-----------V--------V----+               /
>>>      |           VFIOAddressSpace              | <------------+
>>>      |                  |                      |  MemoryListener
>>>      |          VFIOContainer list             |
>>>      +-------+----------------------------+----+
>>>              |                            |
>>>              |                            |
>>>      +-------V------+            +--------V----------+
>>>      |   iommufd    |            |    vfio legacy    |
>>>      |  container   |            |     container     |
>>>      +-------+------+            +--------+----------+
>>>              |                            |
>>>              | /dev/iommu                 | /dev/vfio/vfio
>>>              | /dev/vfio/devices/vfioX    | /dev/vfio/$group_id
>>>   Userspace  |                            |
>>>
>>> ===========+============================+=======================
>>> =========
>>>   Kernel     |  device fd                 |
>>>              +---------------+            | group/container fd
>>>              | (BIND_IOMMUFD |            | (SET_CONTAINER/SET_IOMMU)
>>>              |  ATTACH_IOAS) |            | device fd
>>>              |               |            |
>>>              |       +-------V------------V-----------------+
>>>      iommufd |       |                vfio                  |
>>> (map/unmap  |       +---------+--------------------+-------+
>>>   ioas_copy) |                 |                    | map/unmap
>>>              |                 |                    |
>>>       +------V------+    +-----V------+      +------V--------+
>>>       | iommfd core |    |  device    |      |  vfio iommu   |
>>>       +-------------+    +------------+      +---------------+
>>
>> last row: s/iommfd/iommufd/
>
> thanks. a typo.
>
>> overall this sounds a reasonable abstraction. Later when vdpa starts
>> supporting iommufd probably the iommufd BE will become even
>> smaller with more logic shareable between vfio and vdpa.
>
> let's see if Jason Wang will give some idea. :-)
>
>>>
>>> [Secure Context setup]
>>> - iommufd BE: uses device fd and iommufd to setup secure context
>>>                (bind_iommufd, attach_ioas)
>>> - vfio legacy BE: uses group fd and container fd to setup secure
>>> context
>>>                    (set_container, set_iommu)
>>> [Device access]
>>> - iommufd BE: device fd is opened through /dev/vfio/devices/vfioX
>>> - vfio legacy BE: device fd is retrieved from group fd ioctl
>>> [DMA Mapping flow]
>>> - VFIOAddressSpace receives MemoryRegion add/del via MemoryListener
>>> - VFIO populates DMA map/unmap via the container BEs
>>>    *) iommufd BE: uses iommufd
>>>    *) vfio legacy BE: uses container fd
>>>
>>> This series qomifies the VFIOContainer object which acts as a base
>>> class
>>
>> what does 'qomify' mean? I didn't find this word from dictionary...
>>
>>> for a container. This base class is derived into the legacy VFIO
>>> container
>>> and the new iommufd based container. The base class implements generic
>>> code
>>> such as code related to memory_listener and address space management
>>> whereas
>>> the derived class implements callbacks that depend on the kernel
>>> user space
>>
>> 'the kernel user space'?
>
> aha, just want to express different BE callbacks will use different
> user interface exposed by kernel. will refine the wording.
>
>>
>>> being used.
>>>
>>> The selection of the backend is made on a device basis using the new
>>> iommufd option (on/off/auto). By default the iommufd backend is
>>> selected
>>> if supported by the host and by QEMU (iommufd KConfig). This option is
>>> currently available only for the vfio-pci device. For other types of
>>> devices, it does not yet exist and the legacy BE is chosen by default.
>>>
>>> Test done:
>>> - PCI and Platform device were tested
>>
>> In this case PCI uses iommufd while platform device uses legacy?
>
> For PCI, both legacy and iommufd were tested. The exploration kernel
> branch doesn't have the new device uapi for platform device, so I
> didn't test it.
> But I remember Eric should have tested it with iommufd. Eric?
No I just ran non regression tests for vfio-platform, in legacy mode. I
did not integrate with the new device uapi for platform device.
>
>>> - ccw and ap were only compile-tested
>>> - limited device hotplug test
>>> - vIOMMU test run for both legacy and iommufd backends (limited tests)
>>>
>>> This series was co-developed by Eric Auger and me based on the
>>> exploration
>>> iommufd kernel[2], complete code of this series is available in[3]. As
>>> iommufd kernel is in the early step (only iommufd generic interface
>>> is in
>>> mailing list), so this series hasn't made the iommufd backend fully
>>> on par
>>> with legacy backend w.r.t. features like p2p mappings, coherency
>>> tracking,
>>
>> what does 'coherency tracking' mean here? if related to iommu enforce
>> snoop it is fully handled by the kernel so far. I didn't find any use of
>> VFIO_DMA_CC_IOMMU in current Qemu.
>
> It's the kvm_group add/del stuffs.perhaps say kvm_group add/del
> equivalence
> would be better?
>
>>> live migration, etc. This series hasn't supported PCI devices
>>> without FLR
>>> neither as the kernel doesn't support VFIO_DEVICE_PCI_HOT_RESET when
>>> userspace
>>> is using iommufd. The kernel needs to be updated to accept device fd
>>> list for
>>> reset when userspace is using iommufd. Related work is in progress by
>>> Jason[4].
>>>
>>> TODOs:
>>> - Add DMA alias check for iommufd BE (group level)
>>> - Make pci.c to be BE agnostic. Needs kernel change as well to fix the
>>>    VFIO_DEVICE_PCI_HOT_RESET gap
>>> - Cleanup the VFIODevice fields as it's used in both BEs
>>> - Add locks
>>> - Replace list with g_tree
>>> - More tests
>>>
>>> Patch Overview:
>>>
>>> - Preparation:
>>>    0001-scripts-update-linux-headers-Add-iommufd.h.patch
>>>    0002-linux-headers-Import-latest-vfio.h-and-iommufd.h.patch
>>>    0003-hw-vfio-pci-fix-vfio_pci_hot_reset_result-trace-poin.patch
>>>    0004-vfio-pci-Use-vbasedev-local-variable-in-vfio_realize.patch
>>>    0005-vfio-common-Rename-VFIOGuestIOMMU-iommu-into-
>>> iommu_m.patch
>>
>> 3-5 are pure cleanups which could be sent out separately
>
> yes. may send later after checking with Eric. :-)
yes makes sense to send them separately.

Thanks

Eric
>
>>>    0006-vfio-common-Split-common.c-into-common.c-container.c.patch
>>>
>>> - Introduce container object and covert existing vfio to use it:
>>>    0007-vfio-Add-base-object-for-VFIOContainer.patch
>>>    0008-vfio-container-Introduce-vfio_attach-detach_device.patch
>>>    0009-vfio-platform-Use-vfio_-attach-detach-_device.patch
>>>    0010-vfio-ap-Use-vfio_-attach-detach-_device.patch
>>>    0011-vfio-ccw-Use-vfio_-attach-detach-_device.patch
>>>    0012-vfio-container-obj-Introduce-attach-detach-_device-c.patch
>>>    0013-vfio-container-obj-Introduce-VFIOContainer-reset-cal.patch
>>>
>>> - Introduce iommufd based container:
>>>    0014-hw-iommufd-Creation.patch
>>>    0015-vfio-iommufd-Implement-iommufd-backend.patch
>>>    0016-vfio-iommufd-Add-IOAS_COPY_DMA-support.patch
>>>
>>> - Add backend selection for vfio-pci:
>>>    0017-vfio-as-Allow-the-selection-of-a-given-iommu-backend.patch
>>>    0018-vfio-pci-Add-an-iommufd-option.patch
>>>
>>> [1] https://lore.kernel.org/kvm/0-v1-e79cd8d168e8+6-
>>> iommufd_jgg@nvidia.com/
>>> [2] https://github.com/luxis1999/iommufd/tree/iommufd-v5.17-rc6
>>> [3] https://github.com/luxis1999/qemu/tree/qemu-for-5.17-rc6-vm-rfcv1
>>> [4] https://lore.kernel.org/kvm/0-v1-a8faf768d202+125dd-
>>> vfio_mdev_no_group_jgg@nvidia.com/
>>
>> Following is probably more relevant to [4]:
>>
>> https://lore.kernel.org/all/10-v1-33906a626da1+16b0-vfio_kvm_no_group_jgg@nvidia.com/
>>
>
> absolutely.:-) thanks.
>
>> Thanks
>> Kevin
>



^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC 00/18] vfio: Adopt iommufd
  2022-04-18  8:49   ` Tian, Kevin
@ 2022-04-25 19:55     ` Eric Auger
  -1 siblings, 0 replies; 125+ messages in thread
From: Eric Auger @ 2022-04-25 19:55 UTC (permalink / raw)
  To: Tian, Kevin, Liu, Yi L, alex.williamson, cohuck, qemu-devel
  Cc: david, thuth, farman, mjrosato, akrowiak, pasic, jjherne,
	jasowang, kvm, jgg, nicolinc, eric.auger.pro, Peng, Chao P, Sun,
	Yi Y, peterx

Hi Kevin,

On 4/18/22 10:49 AM, Tian, Kevin wrote:
>> From: Liu, Yi L <yi.l.liu@intel.com>
>> Sent: Thursday, April 14, 2022 6:47 PM
>>
>> With the introduction of iommufd[1], the linux kernel provides a generic
>> interface for userspace drivers to propagate their DMA mappings to kernel
>> for assigned devices. This series does the porting of the VFIO devices
>> onto the /dev/iommu uapi and let it coexist with the legacy implementation.
>> Other devices like vpda, vfio mdev and etc. are not considered yet.
> vfio mdev has no special support in Qemu. Just that it's not supported
> by iommufd yet thus can only be operated in legacy container interface at
> this point. Later once it's supported by the kernel suppose no additional
> enabling work is required for mdev in Qemu.
>
>> For vfio devices, the new interface is tied with device fd and iommufd
>> as the iommufd solution is device-centric. This is different from legacy
>> vfio which is group-centric. To support both interfaces in QEMU, this
>> series introduces the iommu backend concept in the form of different
>> container classes. The existing vfio container is named legacy container
>> (equivalent with legacy iommu backend in this series), while the new
>> iommufd based container is named as iommufd container (may also be
>> mentioned
>> as iommufd backend in this series). The two backend types have their own
>> way to setup secure context and dma management interface. Below diagram
>> shows how it looks like with both BEs.
>>
>>                     VFIO                           AddressSpace/Memory
>>     +-------+  +----------+  +-----+  +-----+
>>     |  pci  |  | platform |  |  ap |  | ccw |
>>     +---+---+  +----+-----+  +--+--+  +--+--+     +----------------------+
>>         |           |           |        |        |   AddressSpace       |
>>         |           |           |        |        +------------+---------+
>>     +---V-----------V-----------V--------V----+               /
>>     |           VFIOAddressSpace              | <------------+
>>     |                  |                      |  MemoryListener
>>     |          VFIOContainer list             |
>>     +-------+----------------------------+----+
>>             |                            |
>>             |                            |
>>     +-------V------+            +--------V----------+
>>     |   iommufd    |            |    vfio legacy    |
>>     |  container   |            |     container     |
>>     +-------+------+            +--------+----------+
>>             |                            |
>>             | /dev/iommu                 | /dev/vfio/vfio
>>             | /dev/vfio/devices/vfioX    | /dev/vfio/$group_id
>>  Userspace  |                            |
>>
>> ===========+============================+=======================
>> =========
>>  Kernel     |  device fd                 |
>>             +---------------+            | group/container fd
>>             | (BIND_IOMMUFD |            | (SET_CONTAINER/SET_IOMMU)
>>             |  ATTACH_IOAS) |            | device fd
>>             |               |            |
>>             |       +-------V------------V-----------------+
>>     iommufd |       |                vfio                  |
>> (map/unmap  |       +---------+--------------------+-------+
>>  ioas_copy) |                 |                    | map/unmap
>>             |                 |                    |
>>      +------V------+    +-----V------+      +------V--------+
>>      | iommfd core |    |  device    |      |  vfio iommu   |
>>      +-------------+    +------------+      +---------------+
> last row: s/iommfd/iommufd/
>
> overall this sounds a reasonable abstraction. Later when vdpa starts
> supporting iommufd probably the iommufd BE will become even
> smaller with more logic shareable between vfio and vdpa.
>
>> [Secure Context setup]
>> - iommufd BE: uses device fd and iommufd to setup secure context
>>               (bind_iommufd, attach_ioas)
>> - vfio legacy BE: uses group fd and container fd to setup secure context
>>                   (set_container, set_iommu)
>> [Device access]
>> - iommufd BE: device fd is opened through /dev/vfio/devices/vfioX
>> - vfio legacy BE: device fd is retrieved from group fd ioctl
>> [DMA Mapping flow]
>> - VFIOAddressSpace receives MemoryRegion add/del via MemoryListener
>> - VFIO populates DMA map/unmap via the container BEs
>>   *) iommufd BE: uses iommufd
>>   *) vfio legacy BE: uses container fd
>>
>> This series qomifies the VFIOContainer object which acts as a base class
> what does 'qomify' mean? I didn't find this word from dictionary...
sorry this is pure QEMU terminology. This stands for "QEMU Object Model"
additional info at:
https://qemu.readthedocs.io/en/latest/devel/qom.html

Eric
>
>> for a container. This base class is derived into the legacy VFIO container
>> and the new iommufd based container. The base class implements generic
>> code
>> such as code related to memory_listener and address space management
>> whereas
>> the derived class implements callbacks that depend on the kernel user space
> 'the kernel user space'?
>
>> being used.
>>
>> The selection of the backend is made on a device basis using the new
>> iommufd option (on/off/auto). By default the iommufd backend is selected
>> if supported by the host and by QEMU (iommufd KConfig). This option is
>> currently available only for the vfio-pci device. For other types of
>> devices, it does not yet exist and the legacy BE is chosen by default.
>>
>> Test done:
>> - PCI and Platform device were tested
> In this case PCI uses iommufd while platform device uses legacy?
>
>> - ccw and ap were only compile-tested
>> - limited device hotplug test
>> - vIOMMU test run for both legacy and iommufd backends (limited tests)
>>
>> This series was co-developed by Eric Auger and me based on the exploration
>> iommufd kernel[2], complete code of this series is available in[3]. As
>> iommufd kernel is in the early step (only iommufd generic interface is in
>> mailing list), so this series hasn't made the iommufd backend fully on par
>> with legacy backend w.r.t. features like p2p mappings, coherency tracking,
> what does 'coherency tracking' mean here? if related to iommu enforce
> snoop it is fully handled by the kernel so far. I didn't find any use of
> VFIO_DMA_CC_IOMMU in current Qemu.
>
>> live migration, etc. This series hasn't supported PCI devices without FLR
>> neither as the kernel doesn't support VFIO_DEVICE_PCI_HOT_RESET when
>> userspace
>> is using iommufd. The kernel needs to be updated to accept device fd list for
>> reset when userspace is using iommufd. Related work is in progress by
>> Jason[4].
>>
>> TODOs:
>> - Add DMA alias check for iommufd BE (group level)
>> - Make pci.c to be BE agnostic. Needs kernel change as well to fix the
>>   VFIO_DEVICE_PCI_HOT_RESET gap
>> - Cleanup the VFIODevice fields as it's used in both BEs
>> - Add locks
>> - Replace list with g_tree
>> - More tests
>>
>> Patch Overview:
>>
>> - Preparation:
>>   0001-scripts-update-linux-headers-Add-iommufd.h.patch
>>   0002-linux-headers-Import-latest-vfio.h-and-iommufd.h.patch
>>   0003-hw-vfio-pci-fix-vfio_pci_hot_reset_result-trace-poin.patch
>>   0004-vfio-pci-Use-vbasedev-local-variable-in-vfio_realize.patch
>>   0005-vfio-common-Rename-VFIOGuestIOMMU-iommu-into-
>> iommu_m.patch
> 3-5 are pure cleanups which could be sent out separately 
>
>>   0006-vfio-common-Split-common.c-into-common.c-container.c.patch
>>
>> - Introduce container object and covert existing vfio to use it:
>>   0007-vfio-Add-base-object-for-VFIOContainer.patch
>>   0008-vfio-container-Introduce-vfio_attach-detach_device.patch
>>   0009-vfio-platform-Use-vfio_-attach-detach-_device.patch
>>   0010-vfio-ap-Use-vfio_-attach-detach-_device.patch
>>   0011-vfio-ccw-Use-vfio_-attach-detach-_device.patch
>>   0012-vfio-container-obj-Introduce-attach-detach-_device-c.patch
>>   0013-vfio-container-obj-Introduce-VFIOContainer-reset-cal.patch
>>
>> - Introduce iommufd based container:
>>   0014-hw-iommufd-Creation.patch
>>   0015-vfio-iommufd-Implement-iommufd-backend.patch
>>   0016-vfio-iommufd-Add-IOAS_COPY_DMA-support.patch
>>
>> - Add backend selection for vfio-pci:
>>   0017-vfio-as-Allow-the-selection-of-a-given-iommu-backend.patch
>>   0018-vfio-pci-Add-an-iommufd-option.patch
>>
>> [1] https://lore.kernel.org/kvm/0-v1-e79cd8d168e8+6-
>> iommufd_jgg@nvidia.com/
>> [2] https://github.com/luxis1999/iommufd/tree/iommufd-v5.17-rc6
>> [3] https://github.com/luxis1999/qemu/tree/qemu-for-5.17-rc6-vm-rfcv1
>> [4] https://lore.kernel.org/kvm/0-v1-a8faf768d202+125dd-
>> vfio_mdev_no_group_jgg@nvidia.com/
> Following is probably more relevant to [4]:
>
> https://lore.kernel.org/all/10-v1-33906a626da1+16b0-vfio_kvm_no_group_jgg@nvidia.com/
>
> Thanks
> Kevin
>


^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC 00/18] vfio: Adopt iommufd
@ 2022-04-25 19:55     ` Eric Auger
  0 siblings, 0 replies; 125+ messages in thread
From: Eric Auger @ 2022-04-25 19:55 UTC (permalink / raw)
  To: Tian, Kevin, Liu, Yi L, alex.williamson, cohuck, qemu-devel
  Cc: akrowiak, jjherne, thuth, Peng, Chao P, kvm, mjrosato, jasowang,
	farman, peterx, pasic, Sun, Yi Y, nicolinc, jgg, eric.auger.pro,
	david

Hi Kevin,

On 4/18/22 10:49 AM, Tian, Kevin wrote:
>> From: Liu, Yi L <yi.l.liu@intel.com>
>> Sent: Thursday, April 14, 2022 6:47 PM
>>
>> With the introduction of iommufd[1], the linux kernel provides a generic
>> interface for userspace drivers to propagate their DMA mappings to kernel
>> for assigned devices. This series does the porting of the VFIO devices
>> onto the /dev/iommu uapi and let it coexist with the legacy implementation.
>> Other devices like vpda, vfio mdev and etc. are not considered yet.
> vfio mdev has no special support in Qemu. Just that it's not supported
> by iommufd yet thus can only be operated in legacy container interface at
> this point. Later once it's supported by the kernel suppose no additional
> enabling work is required for mdev in Qemu.
>
>> For vfio devices, the new interface is tied with device fd and iommufd
>> as the iommufd solution is device-centric. This is different from legacy
>> vfio which is group-centric. To support both interfaces in QEMU, this
>> series introduces the iommu backend concept in the form of different
>> container classes. The existing vfio container is named legacy container
>> (equivalent with legacy iommu backend in this series), while the new
>> iommufd based container is named as iommufd container (may also be
>> mentioned
>> as iommufd backend in this series). The two backend types have their own
>> way to setup secure context and dma management interface. Below diagram
>> shows how it looks like with both BEs.
>>
>>                     VFIO                           AddressSpace/Memory
>>     +-------+  +----------+  +-----+  +-----+
>>     |  pci  |  | platform |  |  ap |  | ccw |
>>     +---+---+  +----+-----+  +--+--+  +--+--+     +----------------------+
>>         |           |           |        |        |   AddressSpace       |
>>         |           |           |        |        +------------+---------+
>>     +---V-----------V-----------V--------V----+               /
>>     |           VFIOAddressSpace              | <------------+
>>     |                  |                      |  MemoryListener
>>     |          VFIOContainer list             |
>>     +-------+----------------------------+----+
>>             |                            |
>>             |                            |
>>     +-------V------+            +--------V----------+
>>     |   iommufd    |            |    vfio legacy    |
>>     |  container   |            |     container     |
>>     +-------+------+            +--------+----------+
>>             |                            |
>>             | /dev/iommu                 | /dev/vfio/vfio
>>             | /dev/vfio/devices/vfioX    | /dev/vfio/$group_id
>>  Userspace  |                            |
>>
>> ===========+============================+=======================
>> =========
>>  Kernel     |  device fd                 |
>>             +---------------+            | group/container fd
>>             | (BIND_IOMMUFD |            | (SET_CONTAINER/SET_IOMMU)
>>             |  ATTACH_IOAS) |            | device fd
>>             |               |            |
>>             |       +-------V------------V-----------------+
>>     iommufd |       |                vfio                  |
>> (map/unmap  |       +---------+--------------------+-------+
>>  ioas_copy) |                 |                    | map/unmap
>>             |                 |                    |
>>      +------V------+    +-----V------+      +------V--------+
>>      | iommfd core |    |  device    |      |  vfio iommu   |
>>      +-------------+    +------------+      +---------------+
> last row: s/iommfd/iommufd/
>
> overall this sounds a reasonable abstraction. Later when vdpa starts
> supporting iommufd probably the iommufd BE will become even
> smaller with more logic shareable between vfio and vdpa.
>
>> [Secure Context setup]
>> - iommufd BE: uses device fd and iommufd to setup secure context
>>               (bind_iommufd, attach_ioas)
>> - vfio legacy BE: uses group fd and container fd to setup secure context
>>                   (set_container, set_iommu)
>> [Device access]
>> - iommufd BE: device fd is opened through /dev/vfio/devices/vfioX
>> - vfio legacy BE: device fd is retrieved from group fd ioctl
>> [DMA Mapping flow]
>> - VFIOAddressSpace receives MemoryRegion add/del via MemoryListener
>> - VFIO populates DMA map/unmap via the container BEs
>>   *) iommufd BE: uses iommufd
>>   *) vfio legacy BE: uses container fd
>>
>> This series qomifies the VFIOContainer object which acts as a base class
> what does 'qomify' mean? I didn't find this word from dictionary...
sorry this is pure QEMU terminology. This stands for "QEMU Object Model"
additional info at:
https://qemu.readthedocs.io/en/latest/devel/qom.html

Eric
>
>> for a container. This base class is derived into the legacy VFIO container
>> and the new iommufd based container. The base class implements generic
>> code
>> such as code related to memory_listener and address space management
>> whereas
>> the derived class implements callbacks that depend on the kernel user space
> 'the kernel user space'?
>
>> being used.
>>
>> The selection of the backend is made on a device basis using the new
>> iommufd option (on/off/auto). By default the iommufd backend is selected
>> if supported by the host and by QEMU (iommufd KConfig). This option is
>> currently available only for the vfio-pci device. For other types of
>> devices, it does not yet exist and the legacy BE is chosen by default.
>>
>> Test done:
>> - PCI and Platform device were tested
> In this case PCI uses iommufd while platform device uses legacy?
>
>> - ccw and ap were only compile-tested
>> - limited device hotplug test
>> - vIOMMU test run for both legacy and iommufd backends (limited tests)
>>
>> This series was co-developed by Eric Auger and me based on the exploration
>> iommufd kernel[2], complete code of this series is available in[3]. As
>> iommufd kernel is in the early step (only iommufd generic interface is in
>> mailing list), so this series hasn't made the iommufd backend fully on par
>> with legacy backend w.r.t. features like p2p mappings, coherency tracking,
> what does 'coherency tracking' mean here? if related to iommu enforce
> snoop it is fully handled by the kernel so far. I didn't find any use of
> VFIO_DMA_CC_IOMMU in current Qemu.
>
>> live migration, etc. This series hasn't supported PCI devices without FLR
>> neither as the kernel doesn't support VFIO_DEVICE_PCI_HOT_RESET when
>> userspace
>> is using iommufd. The kernel needs to be updated to accept device fd list for
>> reset when userspace is using iommufd. Related work is in progress by
>> Jason[4].
>>
>> TODOs:
>> - Add DMA alias check for iommufd BE (group level)
>> - Make pci.c to be BE agnostic. Needs kernel change as well to fix the
>>   VFIO_DEVICE_PCI_HOT_RESET gap
>> - Cleanup the VFIODevice fields as it's used in both BEs
>> - Add locks
>> - Replace list with g_tree
>> - More tests
>>
>> Patch Overview:
>>
>> - Preparation:
>>   0001-scripts-update-linux-headers-Add-iommufd.h.patch
>>   0002-linux-headers-Import-latest-vfio.h-and-iommufd.h.patch
>>   0003-hw-vfio-pci-fix-vfio_pci_hot_reset_result-trace-poin.patch
>>   0004-vfio-pci-Use-vbasedev-local-variable-in-vfio_realize.patch
>>   0005-vfio-common-Rename-VFIOGuestIOMMU-iommu-into-
>> iommu_m.patch
> 3-5 are pure cleanups which could be sent out separately 
>
>>   0006-vfio-common-Split-common.c-into-common.c-container.c.patch
>>
>> - Introduce container object and covert existing vfio to use it:
>>   0007-vfio-Add-base-object-for-VFIOContainer.patch
>>   0008-vfio-container-Introduce-vfio_attach-detach_device.patch
>>   0009-vfio-platform-Use-vfio_-attach-detach-_device.patch
>>   0010-vfio-ap-Use-vfio_-attach-detach-_device.patch
>>   0011-vfio-ccw-Use-vfio_-attach-detach-_device.patch
>>   0012-vfio-container-obj-Introduce-attach-detach-_device-c.patch
>>   0013-vfio-container-obj-Introduce-VFIOContainer-reset-cal.patch
>>
>> - Introduce iommufd based container:
>>   0014-hw-iommufd-Creation.patch
>>   0015-vfio-iommufd-Implement-iommufd-backend.patch
>>   0016-vfio-iommufd-Add-IOAS_COPY_DMA-support.patch
>>
>> - Add backend selection for vfio-pci:
>>   0017-vfio-as-Allow-the-selection-of-a-given-iommu-backend.patch
>>   0018-vfio-pci-Add-an-iommufd-option.patch
>>
>> [1] https://lore.kernel.org/kvm/0-v1-e79cd8d168e8+6-
>> iommufd_jgg@nvidia.com/
>> [2] https://github.com/luxis1999/iommufd/tree/iommufd-v5.17-rc6
>> [3] https://github.com/luxis1999/qemu/tree/qemu-for-5.17-rc6-vm-rfcv1
>> [4] https://lore.kernel.org/kvm/0-v1-a8faf768d202+125dd-
>> vfio_mdev_no_group_jgg@nvidia.com/
> Following is probably more relevant to [4]:
>
> https://lore.kernel.org/all/10-v1-33906a626da1+16b0-vfio_kvm_no_group_jgg@nvidia.com/
>
> Thanks
> Kevin
>



^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC 00/18] vfio: Adopt iommufd
  2022-04-22 22:09   ` Alex Williamson
@ 2022-04-25 20:23     ` Eric Auger
  -1 siblings, 0 replies; 125+ messages in thread
From: Eric Auger @ 2022-04-25 20:23 UTC (permalink / raw)
  To: Alex Williamson, Yi Liu
  Cc: akrowiak, jjherne, farman, chao.p.peng, kvm, mjrosato,
	Laine Stump, libvir-list, jasowang, cohuck, thuth, peterx,
	qemu-devel, pasic, yi.y.sun, nicolinc, kevin.tian, jgg,
	eric.auger.pro, david

Hi Alex,

On 4/23/22 12:09 AM, Alex Williamson wrote:
> [Cc +libvirt folks]
>
> On Thu, 14 Apr 2022 03:46:52 -0700
> Yi Liu <yi.l.liu@intel.com> wrote:
>
>> With the introduction of iommufd[1], the linux kernel provides a generic
>> interface for userspace drivers to propagate their DMA mappings to kernel
>> for assigned devices. This series does the porting of the VFIO devices
>> onto the /dev/iommu uapi and let it coexist with the legacy implementation.
>> Other devices like vpda, vfio mdev and etc. are not considered yet.
>>
>> For vfio devices, the new interface is tied with device fd and iommufd
>> as the iommufd solution is device-centric. This is different from legacy
>> vfio which is group-centric. To support both interfaces in QEMU, this
>> series introduces the iommu backend concept in the form of different
>> container classes. The existing vfio container is named legacy container
>> (equivalent with legacy iommu backend in this series), while the new
>> iommufd based container is named as iommufd container (may also be mentioned
>> as iommufd backend in this series). The two backend types have their own
>> way to setup secure context and dma management interface. Below diagram
>> shows how it looks like with both BEs.
>>
>>                     VFIO                           AddressSpace/Memory
>>     +-------+  +----------+  +-----+  +-----+
>>     |  pci  |  | platform |  |  ap |  | ccw |
>>     +---+---+  +----+-----+  +--+--+  +--+--+     +----------------------+
>>         |           |           |        |        |   AddressSpace       |
>>         |           |           |        |        +------------+---------+
>>     +---V-----------V-----------V--------V----+               /
>>     |           VFIOAddressSpace              | <------------+
>>     |                  |                      |  MemoryListener
>>     |          VFIOContainer list             |
>>     +-------+----------------------------+----+
>>             |                            |
>>             |                            |
>>     +-------V------+            +--------V----------+
>>     |   iommufd    |            |    vfio legacy    |
>>     |  container   |            |     container     |
>>     +-------+------+            +--------+----------+
>>             |                            |
>>             | /dev/iommu                 | /dev/vfio/vfio
>>             | /dev/vfio/devices/vfioX    | /dev/vfio/$group_id
>>  Userspace  |                            |
>>  ===========+============================+================================
>>  Kernel     |  device fd                 |
>>             +---------------+            | group/container fd
>>             | (BIND_IOMMUFD |            | (SET_CONTAINER/SET_IOMMU)
>>             |  ATTACH_IOAS) |            | device fd
>>             |               |            |
>>             |       +-------V------------V-----------------+
>>     iommufd |       |                vfio                  |
>> (map/unmap  |       +---------+--------------------+-------+
>>  ioas_copy) |                 |                    | map/unmap
>>             |                 |                    |
>>      +------V------+    +-----V------+      +------V--------+
>>      | iommfd core |    |  device    |      |  vfio iommu   |
>>      +-------------+    +------------+      +---------------+
>>
>> [Secure Context setup]
>> - iommufd BE: uses device fd and iommufd to setup secure context
>>               (bind_iommufd, attach_ioas)
>> - vfio legacy BE: uses group fd and container fd to setup secure context
>>                   (set_container, set_iommu)
>> [Device access]
>> - iommufd BE: device fd is opened through /dev/vfio/devices/vfioX
>> - vfio legacy BE: device fd is retrieved from group fd ioctl
>> [DMA Mapping flow]
>> - VFIOAddressSpace receives MemoryRegion add/del via MemoryListener
>> - VFIO populates DMA map/unmap via the container BEs
>>   *) iommufd BE: uses iommufd
>>   *) vfio legacy BE: uses container fd
>>
>> This series qomifies the VFIOContainer object which acts as a base class
>> for a container. This base class is derived into the legacy VFIO container
>> and the new iommufd based container. The base class implements generic code
>> such as code related to memory_listener and address space management whereas
>> the derived class implements callbacks that depend on the kernel user space
>> being used.
>>
>> The selection of the backend is made on a device basis using the new
>> iommufd option (on/off/auto). By default the iommufd backend is selected
>> if supported by the host and by QEMU (iommufd KConfig). This option is
>> currently available only for the vfio-pci device. For other types of
>> devices, it does not yet exist and the legacy BE is chosen by default.
> I've discussed this a bit with Eric, but let me propose a different
> command line interface.  Libvirt generally likes to pass file
> descriptors to QEMU rather than grant it access to those files
> directly.  This was problematic with vfio-pci because libvirt can't
> easily know when QEMU will want to grab another /dev/vfio/vfio
> container.  Therefore we abandoned this approach and instead libvirt
> grants file permissions.
>
> However, with iommufd there's no reason that QEMU ever needs more than
> a single instance of /dev/iommufd and we're using per device vfio file
> descriptors, so it seems like a good time to revisit this.
>
> The interface I was considering would be to add an iommufd object to
> QEMU, so we might have a:
>
> -device iommufd[,fd=#][,id=foo]
>
> For non-libivrt usage this would have the ability to open /dev/iommufd
> itself if an fd is not provided.  This object could be shared with
> other iommufd users in the VM and maybe we'd allow multiple instances
> for more esoteric use cases.  [NB, maybe this should be a -object rather than
> -device since the iommufd is not a guest visible device?]
>
> The vfio-pci device might then become:
>
> -device vfio-pci[,host=DDDD:BB:DD.f][,sysfsdev=/sys/path/to/device][,fd=#][,iommufd=foo]
>
> So essentially we can specify the device via host, sysfsdev, or passing
> an fd to the vfio device file.  When an iommufd object is specified,
> "foo" in the example above, each of those options would use the
> vfio-device access mechanism, essentially the same as iommufd=on in
> your example.  With the fd passing option, an iommufd object would be
> required and necessarily use device level access.
What is the use case you foresee for the "fd=#" option?
>
> In your example, the iommufd=auto seems especially troublesome for
> libvirt because QEMU is going to have different locked memory
> requirements based on whether we're using type1 or iommufd, where the
> latter resolves the duplicate accounting issues.  libvirt needs to know
> deterministically which backed is being used, which this proposal seems
> to provide, while at the same time bringing us more in line with fd
> passing.  Thoughts?  Thanks,
I like your proposal (based on the -object iommufd). The only thing that
may be missing I think is for a qemu end-user who actually does not care
about the iommu backend being used but just wishes to use the most
recent available one it adds some extra complexity. But this is not the
most important use case ;)

Thanks

Eric
>
> Alex
>



^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC 00/18] vfio: Adopt iommufd
@ 2022-04-25 20:23     ` Eric Auger
  0 siblings, 0 replies; 125+ messages in thread
From: Eric Auger @ 2022-04-25 20:23 UTC (permalink / raw)
  To: Alex Williamson, Yi Liu
  Cc: cohuck, qemu-devel, david, thuth, farman, mjrosato, akrowiak,
	pasic, jjherne, jasowang, kvm, jgg, nicolinc, eric.auger.pro,
	kevin.tian, chao.p.peng, yi.y.sun, peterx, libvir-list,
	Laine Stump

Hi Alex,

On 4/23/22 12:09 AM, Alex Williamson wrote:
> [Cc +libvirt folks]
>
> On Thu, 14 Apr 2022 03:46:52 -0700
> Yi Liu <yi.l.liu@intel.com> wrote:
>
>> With the introduction of iommufd[1], the linux kernel provides a generic
>> interface for userspace drivers to propagate their DMA mappings to kernel
>> for assigned devices. This series does the porting of the VFIO devices
>> onto the /dev/iommu uapi and let it coexist with the legacy implementation.
>> Other devices like vpda, vfio mdev and etc. are not considered yet.
>>
>> For vfio devices, the new interface is tied with device fd and iommufd
>> as the iommufd solution is device-centric. This is different from legacy
>> vfio which is group-centric. To support both interfaces in QEMU, this
>> series introduces the iommu backend concept in the form of different
>> container classes. The existing vfio container is named legacy container
>> (equivalent with legacy iommu backend in this series), while the new
>> iommufd based container is named as iommufd container (may also be mentioned
>> as iommufd backend in this series). The two backend types have their own
>> way to setup secure context and dma management interface. Below diagram
>> shows how it looks like with both BEs.
>>
>>                     VFIO                           AddressSpace/Memory
>>     +-------+  +----------+  +-----+  +-----+
>>     |  pci  |  | platform |  |  ap |  | ccw |
>>     +---+---+  +----+-----+  +--+--+  +--+--+     +----------------------+
>>         |           |           |        |        |   AddressSpace       |
>>         |           |           |        |        +------------+---------+
>>     +---V-----------V-----------V--------V----+               /
>>     |           VFIOAddressSpace              | <------------+
>>     |                  |                      |  MemoryListener
>>     |          VFIOContainer list             |
>>     +-------+----------------------------+----+
>>             |                            |
>>             |                            |
>>     +-------V------+            +--------V----------+
>>     |   iommufd    |            |    vfio legacy    |
>>     |  container   |            |     container     |
>>     +-------+------+            +--------+----------+
>>             |                            |
>>             | /dev/iommu                 | /dev/vfio/vfio
>>             | /dev/vfio/devices/vfioX    | /dev/vfio/$group_id
>>  Userspace  |                            |
>>  ===========+============================+================================
>>  Kernel     |  device fd                 |
>>             +---------------+            | group/container fd
>>             | (BIND_IOMMUFD |            | (SET_CONTAINER/SET_IOMMU)
>>             |  ATTACH_IOAS) |            | device fd
>>             |               |            |
>>             |       +-------V------------V-----------------+
>>     iommufd |       |                vfio                  |
>> (map/unmap  |       +---------+--------------------+-------+
>>  ioas_copy) |                 |                    | map/unmap
>>             |                 |                    |
>>      +------V------+    +-----V------+      +------V--------+
>>      | iommfd core |    |  device    |      |  vfio iommu   |
>>      +-------------+    +------------+      +---------------+
>>
>> [Secure Context setup]
>> - iommufd BE: uses device fd and iommufd to setup secure context
>>               (bind_iommufd, attach_ioas)
>> - vfio legacy BE: uses group fd and container fd to setup secure context
>>                   (set_container, set_iommu)
>> [Device access]
>> - iommufd BE: device fd is opened through /dev/vfio/devices/vfioX
>> - vfio legacy BE: device fd is retrieved from group fd ioctl
>> [DMA Mapping flow]
>> - VFIOAddressSpace receives MemoryRegion add/del via MemoryListener
>> - VFIO populates DMA map/unmap via the container BEs
>>   *) iommufd BE: uses iommufd
>>   *) vfio legacy BE: uses container fd
>>
>> This series qomifies the VFIOContainer object which acts as a base class
>> for a container. This base class is derived into the legacy VFIO container
>> and the new iommufd based container. The base class implements generic code
>> such as code related to memory_listener and address space management whereas
>> the derived class implements callbacks that depend on the kernel user space
>> being used.
>>
>> The selection of the backend is made on a device basis using the new
>> iommufd option (on/off/auto). By default the iommufd backend is selected
>> if supported by the host and by QEMU (iommufd KConfig). This option is
>> currently available only for the vfio-pci device. For other types of
>> devices, it does not yet exist and the legacy BE is chosen by default.
> I've discussed this a bit with Eric, but let me propose a different
> command line interface.  Libvirt generally likes to pass file
> descriptors to QEMU rather than grant it access to those files
> directly.  This was problematic with vfio-pci because libvirt can't
> easily know when QEMU will want to grab another /dev/vfio/vfio
> container.  Therefore we abandoned this approach and instead libvirt
> grants file permissions.
>
> However, with iommufd there's no reason that QEMU ever needs more than
> a single instance of /dev/iommufd and we're using per device vfio file
> descriptors, so it seems like a good time to revisit this.
>
> The interface I was considering would be to add an iommufd object to
> QEMU, so we might have a:
>
> -device iommufd[,fd=#][,id=foo]
>
> For non-libivrt usage this would have the ability to open /dev/iommufd
> itself if an fd is not provided.  This object could be shared with
> other iommufd users in the VM and maybe we'd allow multiple instances
> for more esoteric use cases.  [NB, maybe this should be a -object rather than
> -device since the iommufd is not a guest visible device?]
>
> The vfio-pci device might then become:
>
> -device vfio-pci[,host=DDDD:BB:DD.f][,sysfsdev=/sys/path/to/device][,fd=#][,iommufd=foo]
>
> So essentially we can specify the device via host, sysfsdev, or passing
> an fd to the vfio device file.  When an iommufd object is specified,
> "foo" in the example above, each of those options would use the
> vfio-device access mechanism, essentially the same as iommufd=on in
> your example.  With the fd passing option, an iommufd object would be
> required and necessarily use device level access.
What is the use case you foresee for the "fd=#" option?
>
> In your example, the iommufd=auto seems especially troublesome for
> libvirt because QEMU is going to have different locked memory
> requirements based on whether we're using type1 or iommufd, where the
> latter resolves the duplicate accounting issues.  libvirt needs to know
> deterministically which backed is being used, which this proposal seems
> to provide, while at the same time bringing us more in line with fd
> passing.  Thoughts?  Thanks,
I like your proposal (based on the -object iommufd). The only thing that
may be missing I think is for a qemu end-user who actually does not care
about the iommu backend being used but just wishes to use the most
recent available one it adds some extra complexity. But this is not the
most important use case ;)

Thanks

Eric
>
> Alex
>


^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC 00/18] vfio: Adopt iommufd
  2022-04-25 20:23     ` Eric Auger
@ 2022-04-25 22:53       ` Alex Williamson
  -1 siblings, 0 replies; 125+ messages in thread
From: Alex Williamson @ 2022-04-25 22:53 UTC (permalink / raw)
  To: Eric Auger
  Cc: Yi Liu, cohuck, qemu-devel, david, thuth, farman, mjrosato,
	akrowiak, pasic, jjherne, jasowang, kvm, jgg, nicolinc,
	eric.auger.pro, kevin.tian, chao.p.peng, yi.y.sun, peterx,
	libvir-list, Laine Stump

On Mon, 25 Apr 2022 22:23:05 +0200
Eric Auger <eric.auger@redhat.com> wrote:

> Hi Alex,
> 
> On 4/23/22 12:09 AM, Alex Williamson wrote:
> > [Cc +libvirt folks]
> >
> > On Thu, 14 Apr 2022 03:46:52 -0700
> > Yi Liu <yi.l.liu@intel.com> wrote:
> >  
> >> With the introduction of iommufd[1], the linux kernel provides a generic
> >> interface for userspace drivers to propagate their DMA mappings to kernel
> >> for assigned devices. This series does the porting of the VFIO devices
> >> onto the /dev/iommu uapi and let it coexist with the legacy implementation.
> >> Other devices like vpda, vfio mdev and etc. are not considered yet.
> >>
> >> For vfio devices, the new interface is tied with device fd and iommufd
> >> as the iommufd solution is device-centric. This is different from legacy
> >> vfio which is group-centric. To support both interfaces in QEMU, this
> >> series introduces the iommu backend concept in the form of different
> >> container classes. The existing vfio container is named legacy container
> >> (equivalent with legacy iommu backend in this series), while the new
> >> iommufd based container is named as iommufd container (may also be mentioned
> >> as iommufd backend in this series). The two backend types have their own
> >> way to setup secure context and dma management interface. Below diagram
> >> shows how it looks like with both BEs.
> >>
> >>                     VFIO                           AddressSpace/Memory
> >>     +-------+  +----------+  +-----+  +-----+
> >>     |  pci  |  | platform |  |  ap |  | ccw |
> >>     +---+---+  +----+-----+  +--+--+  +--+--+     +----------------------+
> >>         |           |           |        |        |   AddressSpace       |
> >>         |           |           |        |        +------------+---------+
> >>     +---V-----------V-----------V--------V----+               /
> >>     |           VFIOAddressSpace              | <------------+
> >>     |                  |                      |  MemoryListener
> >>     |          VFIOContainer list             |
> >>     +-------+----------------------------+----+
> >>             |                            |
> >>             |                            |
> >>     +-------V------+            +--------V----------+
> >>     |   iommufd    |            |    vfio legacy    |
> >>     |  container   |            |     container     |
> >>     +-------+------+            +--------+----------+
> >>             |                            |
> >>             | /dev/iommu                 | /dev/vfio/vfio
> >>             | /dev/vfio/devices/vfioX    | /dev/vfio/$group_id
> >>  Userspace  |                            |
> >>  ===========+============================+================================
> >>  Kernel     |  device fd                 |
> >>             +---------------+            | group/container fd
> >>             | (BIND_IOMMUFD |            | (SET_CONTAINER/SET_IOMMU)
> >>             |  ATTACH_IOAS) |            | device fd
> >>             |               |            |
> >>             |       +-------V------------V-----------------+
> >>     iommufd |       |                vfio                  |
> >> (map/unmap  |       +---------+--------------------+-------+
> >>  ioas_copy) |                 |                    | map/unmap
> >>             |                 |                    |
> >>      +------V------+    +-----V------+      +------V--------+
> >>      | iommfd core |    |  device    |      |  vfio iommu   |
> >>      +-------------+    +------------+      +---------------+
> >>
> >> [Secure Context setup]
> >> - iommufd BE: uses device fd and iommufd to setup secure context
> >>               (bind_iommufd, attach_ioas)
> >> - vfio legacy BE: uses group fd and container fd to setup secure context
> >>                   (set_container, set_iommu)
> >> [Device access]
> >> - iommufd BE: device fd is opened through /dev/vfio/devices/vfioX
> >> - vfio legacy BE: device fd is retrieved from group fd ioctl
> >> [DMA Mapping flow]
> >> - VFIOAddressSpace receives MemoryRegion add/del via MemoryListener
> >> - VFIO populates DMA map/unmap via the container BEs
> >>   *) iommufd BE: uses iommufd
> >>   *) vfio legacy BE: uses container fd
> >>
> >> This series qomifies the VFIOContainer object which acts as a base class
> >> for a container. This base class is derived into the legacy VFIO container
> >> and the new iommufd based container. The base class implements generic code
> >> such as code related to memory_listener and address space management whereas
> >> the derived class implements callbacks that depend on the kernel user space
> >> being used.
> >>
> >> The selection of the backend is made on a device basis using the new
> >> iommufd option (on/off/auto). By default the iommufd backend is selected
> >> if supported by the host and by QEMU (iommufd KConfig). This option is
> >> currently available only for the vfio-pci device. For other types of
> >> devices, it does not yet exist and the legacy BE is chosen by default.  
> > I've discussed this a bit with Eric, but let me propose a different
> > command line interface.  Libvirt generally likes to pass file
> > descriptors to QEMU rather than grant it access to those files
> > directly.  This was problematic with vfio-pci because libvirt can't
> > easily know when QEMU will want to grab another /dev/vfio/vfio
> > container.  Therefore we abandoned this approach and instead libvirt
> > grants file permissions.
> >
> > However, with iommufd there's no reason that QEMU ever needs more than
> > a single instance of /dev/iommufd and we're using per device vfio file
> > descriptors, so it seems like a good time to revisit this.
> >
> > The interface I was considering would be to add an iommufd object to
> > QEMU, so we might have a:
> >
> > -device iommufd[,fd=#][,id=foo]
> >
> > For non-libivrt usage this would have the ability to open /dev/iommufd
> > itself if an fd is not provided.  This object could be shared with
> > other iommufd users in the VM and maybe we'd allow multiple instances
> > for more esoteric use cases.  [NB, maybe this should be a -object rather than
> > -device since the iommufd is not a guest visible device?]
> >
> > The vfio-pci device might then become:
> >
> > -device vfio-pci[,host=DDDD:BB:DD.f][,sysfsdev=/sys/path/to/device][,fd=#][,iommufd=foo]
> >
> > So essentially we can specify the device via host, sysfsdev, or passing
> > an fd to the vfio device file.  When an iommufd object is specified,
> > "foo" in the example above, each of those options would use the
> > vfio-device access mechanism, essentially the same as iommufd=on in
> > your example.  With the fd passing option, an iommufd object would be
> > required and necessarily use device level access.  
> What is the use case you foresee for the "fd=#" option?

On the vfio-pci device this was intended to be the actual vfio device
file descriptor.  Once we have a file per device, QEMU doesn't really
have any need to navigate through sysfs to determine which fd to use
other than for user convenience on the command line.  For libvirt usage,
I assume QEMU could accept the device fd, without ever really knowing
anything about the host address or sysfs path of the device.

> >
> > In your example, the iommufd=auto seems especially troublesome for
> > libvirt because QEMU is going to have different locked memory
> > requirements based on whether we're using type1 or iommufd, where the
> > latter resolves the duplicate accounting issues.  libvirt needs to know
> > deterministically which backed is being used, which this proposal seems
> > to provide, while at the same time bringing us more in line with fd
> > passing.  Thoughts?  Thanks,  
> I like your proposal (based on the -object iommufd). The only thing that
> may be missing I think is for a qemu end-user who actually does not care
> about the iommu backend being used but just wishes to use the most
> recent available one it adds some extra complexity. But this is not the
> most important use case ;)

Yeah, I can sympathize with that, but isn't that also why we're pursing
a vfio compatibility interface at the kernel level?  Eventually, once
the native vfio IOMMU backends go away, the vfio "container" device
file will be provided by iommufd and that transition to the new
interface can be both seamless to the user and apparent to tools like
libvirt.

An end-user with a fixed command line should continue to work and will
eventually get iommufd via compatibility, but taking care of an
end-user that "does not care" and "wishes to use the most recent" is a
non-goal for me.  That would be more troublesome for tools and use cases
that we do care about imo.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC 00/18] vfio: Adopt iommufd
@ 2022-04-25 22:53       ` Alex Williamson
  0 siblings, 0 replies; 125+ messages in thread
From: Alex Williamson @ 2022-04-25 22:53 UTC (permalink / raw)
  To: Eric Auger
  Cc: akrowiak, jjherne, farman, Yi Liu, kvm, mjrosato, Laine Stump,
	libvir-list, jasowang, cohuck, thuth, peterx, qemu-devel, pasic,
	yi.y.sun, chao.p.peng, nicolinc, kevin.tian, jgg, eric.auger.pro,
	david

On Mon, 25 Apr 2022 22:23:05 +0200
Eric Auger <eric.auger@redhat.com> wrote:

> Hi Alex,
> 
> On 4/23/22 12:09 AM, Alex Williamson wrote:
> > [Cc +libvirt folks]
> >
> > On Thu, 14 Apr 2022 03:46:52 -0700
> > Yi Liu <yi.l.liu@intel.com> wrote:
> >  
> >> With the introduction of iommufd[1], the linux kernel provides a generic
> >> interface for userspace drivers to propagate their DMA mappings to kernel
> >> for assigned devices. This series does the porting of the VFIO devices
> >> onto the /dev/iommu uapi and let it coexist with the legacy implementation.
> >> Other devices like vpda, vfio mdev and etc. are not considered yet.
> >>
> >> For vfio devices, the new interface is tied with device fd and iommufd
> >> as the iommufd solution is device-centric. This is different from legacy
> >> vfio which is group-centric. To support both interfaces in QEMU, this
> >> series introduces the iommu backend concept in the form of different
> >> container classes. The existing vfio container is named legacy container
> >> (equivalent with legacy iommu backend in this series), while the new
> >> iommufd based container is named as iommufd container (may also be mentioned
> >> as iommufd backend in this series). The two backend types have their own
> >> way to setup secure context and dma management interface. Below diagram
> >> shows how it looks like with both BEs.
> >>
> >>                     VFIO                           AddressSpace/Memory
> >>     +-------+  +----------+  +-----+  +-----+
> >>     |  pci  |  | platform |  |  ap |  | ccw |
> >>     +---+---+  +----+-----+  +--+--+  +--+--+     +----------------------+
> >>         |           |           |        |        |   AddressSpace       |
> >>         |           |           |        |        +------------+---------+
> >>     +---V-----------V-----------V--------V----+               /
> >>     |           VFIOAddressSpace              | <------------+
> >>     |                  |                      |  MemoryListener
> >>     |          VFIOContainer list             |
> >>     +-------+----------------------------+----+
> >>             |                            |
> >>             |                            |
> >>     +-------V------+            +--------V----------+
> >>     |   iommufd    |            |    vfio legacy    |
> >>     |  container   |            |     container     |
> >>     +-------+------+            +--------+----------+
> >>             |                            |
> >>             | /dev/iommu                 | /dev/vfio/vfio
> >>             | /dev/vfio/devices/vfioX    | /dev/vfio/$group_id
> >>  Userspace  |                            |
> >>  ===========+============================+================================
> >>  Kernel     |  device fd                 |
> >>             +---------------+            | group/container fd
> >>             | (BIND_IOMMUFD |            | (SET_CONTAINER/SET_IOMMU)
> >>             |  ATTACH_IOAS) |            | device fd
> >>             |               |            |
> >>             |       +-------V------------V-----------------+
> >>     iommufd |       |                vfio                  |
> >> (map/unmap  |       +---------+--------------------+-------+
> >>  ioas_copy) |                 |                    | map/unmap
> >>             |                 |                    |
> >>      +------V------+    +-----V------+      +------V--------+
> >>      | iommfd core |    |  device    |      |  vfio iommu   |
> >>      +-------------+    +------------+      +---------------+
> >>
> >> [Secure Context setup]
> >> - iommufd BE: uses device fd and iommufd to setup secure context
> >>               (bind_iommufd, attach_ioas)
> >> - vfio legacy BE: uses group fd and container fd to setup secure context
> >>                   (set_container, set_iommu)
> >> [Device access]
> >> - iommufd BE: device fd is opened through /dev/vfio/devices/vfioX
> >> - vfio legacy BE: device fd is retrieved from group fd ioctl
> >> [DMA Mapping flow]
> >> - VFIOAddressSpace receives MemoryRegion add/del via MemoryListener
> >> - VFIO populates DMA map/unmap via the container BEs
> >>   *) iommufd BE: uses iommufd
> >>   *) vfio legacy BE: uses container fd
> >>
> >> This series qomifies the VFIOContainer object which acts as a base class
> >> for a container. This base class is derived into the legacy VFIO container
> >> and the new iommufd based container. The base class implements generic code
> >> such as code related to memory_listener and address space management whereas
> >> the derived class implements callbacks that depend on the kernel user space
> >> being used.
> >>
> >> The selection of the backend is made on a device basis using the new
> >> iommufd option (on/off/auto). By default the iommufd backend is selected
> >> if supported by the host and by QEMU (iommufd KConfig). This option is
> >> currently available only for the vfio-pci device. For other types of
> >> devices, it does not yet exist and the legacy BE is chosen by default.  
> > I've discussed this a bit with Eric, but let me propose a different
> > command line interface.  Libvirt generally likes to pass file
> > descriptors to QEMU rather than grant it access to those files
> > directly.  This was problematic with vfio-pci because libvirt can't
> > easily know when QEMU will want to grab another /dev/vfio/vfio
> > container.  Therefore we abandoned this approach and instead libvirt
> > grants file permissions.
> >
> > However, with iommufd there's no reason that QEMU ever needs more than
> > a single instance of /dev/iommufd and we're using per device vfio file
> > descriptors, so it seems like a good time to revisit this.
> >
> > The interface I was considering would be to add an iommufd object to
> > QEMU, so we might have a:
> >
> > -device iommufd[,fd=#][,id=foo]
> >
> > For non-libivrt usage this would have the ability to open /dev/iommufd
> > itself if an fd is not provided.  This object could be shared with
> > other iommufd users in the VM and maybe we'd allow multiple instances
> > for more esoteric use cases.  [NB, maybe this should be a -object rather than
> > -device since the iommufd is not a guest visible device?]
> >
> > The vfio-pci device might then become:
> >
> > -device vfio-pci[,host=DDDD:BB:DD.f][,sysfsdev=/sys/path/to/device][,fd=#][,iommufd=foo]
> >
> > So essentially we can specify the device via host, sysfsdev, or passing
> > an fd to the vfio device file.  When an iommufd object is specified,
> > "foo" in the example above, each of those options would use the
> > vfio-device access mechanism, essentially the same as iommufd=on in
> > your example.  With the fd passing option, an iommufd object would be
> > required and necessarily use device level access.  
> What is the use case you foresee for the "fd=#" option?

On the vfio-pci device this was intended to be the actual vfio device
file descriptor.  Once we have a file per device, QEMU doesn't really
have any need to navigate through sysfs to determine which fd to use
other than for user convenience on the command line.  For libvirt usage,
I assume QEMU could accept the device fd, without ever really knowing
anything about the host address or sysfs path of the device.

> >
> > In your example, the iommufd=auto seems especially troublesome for
> > libvirt because QEMU is going to have different locked memory
> > requirements based on whether we're using type1 or iommufd, where the
> > latter resolves the duplicate accounting issues.  libvirt needs to know
> > deterministically which backed is being used, which this proposal seems
> > to provide, while at the same time bringing us more in line with fd
> > passing.  Thoughts?  Thanks,  
> I like your proposal (based on the -object iommufd). The only thing that
> may be missing I think is for a qemu end-user who actually does not care
> about the iommu backend being used but just wishes to use the most
> recent available one it adds some extra complexity. But this is not the
> most important use case ;)

Yeah, I can sympathize with that, but isn't that also why we're pursing
a vfio compatibility interface at the kernel level?  Eventually, once
the native vfio IOMMU backends go away, the vfio "container" device
file will be provided by iommufd and that transition to the new
interface can be both seamless to the user and apparent to tools like
libvirt.

An end-user with a fixed command line should continue to work and will
eventually get iommufd via compatibility, but taking care of an
end-user that "does not care" and "wishes to use the most recent" is a
non-goal for me.  That would be more troublesome for tools and use cases
that we do care about imo.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 125+ messages in thread

* RE: [RFC 00/18] vfio: Adopt iommufd
  2022-04-25 14:37       ` Alex Williamson
@ 2022-04-26  8:37         ` Tian, Kevin
  -1 siblings, 0 replies; 125+ messages in thread
From: Tian, Kevin @ 2022-04-26  8:37 UTC (permalink / raw)
  To: Alex Williamson, Daniel P. Berrangé
  Cc: akrowiak, jjherne, thuth, Peng, Chao P, jgg, kvm, libvir-list,
	jasowang, cohuck, qemu-devel, peterx, pasic, eric.auger, Sun,
	Yi Y, Liu, Yi L, nicolinc, Laine Stump, david, eric.auger.pro

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Monday, April 25, 2022 10:38 PM
> 
> On Mon, 25 Apr 2022 11:10:14 +0100
> Daniel P. Berrangé <berrange@redhat.com> wrote:
> 
> > On Fri, Apr 22, 2022 at 04:09:43PM -0600, Alex Williamson wrote:
> > > [Cc +libvirt folks]
> > >
> > > On Thu, 14 Apr 2022 03:46:52 -0700
> > > Yi Liu <yi.l.liu@intel.com> wrote:
> > >
> > > > With the introduction of iommufd[1], the linux kernel provides a
> generic
> > > > interface for userspace drivers to propagate their DMA mappings to
> kernel
> > > > for assigned devices. This series does the porting of the VFIO devices
> > > > onto the /dev/iommu uapi and let it coexist with the legacy
> implementation.
> > > > Other devices like vpda, vfio mdev and etc. are not considered yet.
> >
> > snip
> >
> > > > The selection of the backend is made on a device basis using the new
> > > > iommufd option (on/off/auto). By default the iommufd backend is
> selected
> > > > if supported by the host and by QEMU (iommufd KConfig). This option
> is
> > > > currently available only for the vfio-pci device. For other types of
> > > > devices, it does not yet exist and the legacy BE is chosen by default.
> > >
> > > I've discussed this a bit with Eric, but let me propose a different
> > > command line interface.  Libvirt generally likes to pass file
> > > descriptors to QEMU rather than grant it access to those files
> > > directly.  This was problematic with vfio-pci because libvirt can't
> > > easily know when QEMU will want to grab another /dev/vfio/vfio
> > > container.  Therefore we abandoned this approach and instead libvirt
> > > grants file permissions.
> > >
> > > However, with iommufd there's no reason that QEMU ever needs more
> than
> > > a single instance of /dev/iommufd and we're using per device vfio file
> > > descriptors, so it seems like a good time to revisit this.
> >
> > I assume access to '/dev/iommufd' gives the process somewhat elevated
> > privileges, such that you don't want to unconditionally give QEMU
> > access to this device ?
> 
> It's not that much dissimilar to /dev/vfio/vfio, it's an unprivileged
> interface which should have limited scope for abuse, but more so here
> the goal would be to de-privilege QEMU that one step further that it
> cannot open the device file itself.
> 
> > > The interface I was considering would be to add an iommufd object to
> > > QEMU, so we might have a:
> > >
> > > -device iommufd[,fd=#][,id=foo]
> > >
> > > For non-libivrt usage this would have the ability to open /dev/iommufd
> > > itself if an fd is not provided.  This object could be shared with
> > > other iommufd users in the VM and maybe we'd allow multiple instances
> > > for more esoteric use cases.  [NB, maybe this should be a -object rather
> than
> > > -device since the iommufd is not a guest visible device?]
> >
> > Yes,  -object would be the right answer for something that's purely
> > a host side backend impl selector.
> >
> > > The vfio-pci device might then become:
> > >
> > > -device vfio-
> pci[,host=DDDD:BB:DD.f][,sysfsdev=/sys/path/to/device][,fd=#][,iommufd=f
> oo]
> > >
> > > So essentially we can specify the device via host, sysfsdev, or passing
> > > an fd to the vfio device file.  When an iommufd object is specified,
> > > "foo" in the example above, each of those options would use the
> > > vfio-device access mechanism, essentially the same as iommufd=on in
> > > your example.  With the fd passing option, an iommufd object would be
> > > required and necessarily use device level access.
> > >
> > > In your example, the iommufd=auto seems especially troublesome for
> > > libvirt because QEMU is going to have different locked memory
> > > requirements based on whether we're using type1 or iommufd, where
> the
> > > latter resolves the duplicate accounting issues.  libvirt needs to know

Based on current plan there is probably a transition window between the
point where the first vfio device type (vfio-pci) gaining iommufd support
and the point where all vfio types supporting iommufd. Libvirt can figure
out which one to use iommufd by checking the presence of
/dev/vfio/devices/vfioX. But what would be the resource limit policy
in Libvirt in such transition window when both type1 and iommufd might
be used? Or do we just expect Libvirt to support iommufd only after the
transition window ends to avoid handling such mess?

> > > deterministically which backed is being used, which this proposal seems
> > > to provide, while at the same time bringing us more in line with fd
> > > passing.  Thoughts?  Thanks,
> >
> > Yep, I agree that libvirt needs to have more direct control over this.
> > This is also even more important if there are notable feature differences
> > in the 2 backends.
> >
> > I wonder if anyone has considered an even more distinct impl, whereby
> > we have a completely different device type on the backend, eg
> >
> >   -device vfio-iommu-
> pci[,host=DDDD:BB:DD.f][,sysfsdev=/sys/path/to/device][,fd=#][,iommufd=f
> oo]
> >
> > If a vendor wants to fully remove the legacy impl, they can then use the
> > Kconfig mechanism to disable the build of the legacy impl device, while
> > keeping the iommu impl (or vica-verca if the new iommu impl isn't
> considered
> > reliable enough for them to support yet).
> >
> > Libvirt would use
> >
> >    -object iommu,id=iommu0,fd=NNN
> >    -device vfio-iommu-pci,fd=MMM,iommu=iommu0
> >
> > Non-libvirt would use a simpler
> >
> >    -device vfio-iommu-pci,host=0000:03:22.1
> >
> > with QEMU auto-creating a 'iommu' object in the background.
> >
> > This would fit into libvirt's existing modelling better. We currently have
> > a concept of a PCI assignment backend, which previously supported the
> > legacy PCI assignment, vs the VFIO PCI assignment. This new iommu impl
> > feels like a 3rd PCI assignment approach, and so fits with how we modelled
> > it as a different device type in the past.
> 
> I don't think we want to conflate "iommu" and "iommufd", we're creating
> an object that interfaces into the iommufd uAPI, not an iommu itself.
> Likewise "vfio-iommu-pci" is just confusing, there was an iommu
> interface previously, it's just a different implementation now and as
> far as the VM interface to the device, it's identical.  Note that a
> "vfio-iommufd-pci" device multiplies the matrix of every vfio device
> for a rather subtle implementation detail.
> 
> My expectation would be that libvirt uses:
> 
>  -object iommufd,id=iommufd0,fd=NNN
>  -device vfio-pci,fd=MMM,iommufd=iommufd0
> 
> Whereas simple QEMU command line would be:
> 
>  -object iommufd,id=iommufd0
>  -device vfio-pci,iommufd=iommufd0,host=0000:02:00.0
> 
> The iommufd object would open /dev/iommufd itself.  Creating an
> implicit iommufd object is someone problematic because one of the
> things I forgot to highlight in my previous description is that the
> iommufd object is meant to be shared across not only various vfio
> devices (platform, ccw, ap, nvme, etc), but also across subsystems, ex.
> vdpa.

Out of curiosity - in concept one iommufd is sufficient to support all
ioas requirements across subsystems while having multiple iommufd's
instead lose the benefit of centralized accounting. The latter will also
cause some trouble when we start virtualizing ENQCMD which requires
VM-wide PASID virtualization thus further needs to share that 
information across iommufd's. Not unsolvable but really no gain by
adding such complexity. So I'm curious whether Qemu provide
a way to restrict that certain object type can only have one instance
to discourage such multi-iommufd attempt?

> 
> If the old style were used:
> 
>  -device vfio-pci,host=0000:02:00.0
> 
> Then QEMU would use vfio for the IOMMU backend.
> 
> If libvirt/userspace wants to query whether "legacy" vfio is still
> supported by the host kernel, I think it'd only need to look for
> whether the /dev/vfio/vfio container interface still exists.
> 
> If we need some means for QEMU to remove legacy support, I'd rather
> find a way to do it via probing device options.  It's easy enough to
> see if iommufd support exists by looking for the presence of the
> iommufd option for the vfio-pci device and Kconfig within QEMU could be
> used regardless of whether we define a new device name.  Thanks,
> 
> Alex


^ permalink raw reply	[flat|nested] 125+ messages in thread

* RE: [RFC 00/18] vfio: Adopt iommufd
@ 2022-04-26  8:37         ` Tian, Kevin
  0 siblings, 0 replies; 125+ messages in thread
From: Tian, Kevin @ 2022-04-26  8:37 UTC (permalink / raw)
  To: Alex Williamson, Daniel P. Berrangé
  Cc: Liu, Yi L, akrowiak, jjherne, Peng, Chao P, kvm, Laine Stump,
	libvir-list, jasowang, cohuck, thuth, peterx, qemu-devel, pasic,
	eric.auger, Sun, Yi Y, nicolinc, jgg, eric.auger.pro, david

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Monday, April 25, 2022 10:38 PM
> 
> On Mon, 25 Apr 2022 11:10:14 +0100
> Daniel P. Berrangé <berrange@redhat.com> wrote:
> 
> > On Fri, Apr 22, 2022 at 04:09:43PM -0600, Alex Williamson wrote:
> > > [Cc +libvirt folks]
> > >
> > > On Thu, 14 Apr 2022 03:46:52 -0700
> > > Yi Liu <yi.l.liu@intel.com> wrote:
> > >
> > > > With the introduction of iommufd[1], the linux kernel provides a
> generic
> > > > interface for userspace drivers to propagate their DMA mappings to
> kernel
> > > > for assigned devices. This series does the porting of the VFIO devices
> > > > onto the /dev/iommu uapi and let it coexist with the legacy
> implementation.
> > > > Other devices like vpda, vfio mdev and etc. are not considered yet.
> >
> > snip
> >
> > > > The selection of the backend is made on a device basis using the new
> > > > iommufd option (on/off/auto). By default the iommufd backend is
> selected
> > > > if supported by the host and by QEMU (iommufd KConfig). This option
> is
> > > > currently available only for the vfio-pci device. For other types of
> > > > devices, it does not yet exist and the legacy BE is chosen by default.
> > >
> > > I've discussed this a bit with Eric, but let me propose a different
> > > command line interface.  Libvirt generally likes to pass file
> > > descriptors to QEMU rather than grant it access to those files
> > > directly.  This was problematic with vfio-pci because libvirt can't
> > > easily know when QEMU will want to grab another /dev/vfio/vfio
> > > container.  Therefore we abandoned this approach and instead libvirt
> > > grants file permissions.
> > >
> > > However, with iommufd there's no reason that QEMU ever needs more
> than
> > > a single instance of /dev/iommufd and we're using per device vfio file
> > > descriptors, so it seems like a good time to revisit this.
> >
> > I assume access to '/dev/iommufd' gives the process somewhat elevated
> > privileges, such that you don't want to unconditionally give QEMU
> > access to this device ?
> 
> It's not that much dissimilar to /dev/vfio/vfio, it's an unprivileged
> interface which should have limited scope for abuse, but more so here
> the goal would be to de-privilege QEMU that one step further that it
> cannot open the device file itself.
> 
> > > The interface I was considering would be to add an iommufd object to
> > > QEMU, so we might have a:
> > >
> > > -device iommufd[,fd=#][,id=foo]
> > >
> > > For non-libivrt usage this would have the ability to open /dev/iommufd
> > > itself if an fd is not provided.  This object could be shared with
> > > other iommufd users in the VM and maybe we'd allow multiple instances
> > > for more esoteric use cases.  [NB, maybe this should be a -object rather
> than
> > > -device since the iommufd is not a guest visible device?]
> >
> > Yes,  -object would be the right answer for something that's purely
> > a host side backend impl selector.
> >
> > > The vfio-pci device might then become:
> > >
> > > -device vfio-
> pci[,host=DDDD:BB:DD.f][,sysfsdev=/sys/path/to/device][,fd=#][,iommufd=f
> oo]
> > >
> > > So essentially we can specify the device via host, sysfsdev, or passing
> > > an fd to the vfio device file.  When an iommufd object is specified,
> > > "foo" in the example above, each of those options would use the
> > > vfio-device access mechanism, essentially the same as iommufd=on in
> > > your example.  With the fd passing option, an iommufd object would be
> > > required and necessarily use device level access.
> > >
> > > In your example, the iommufd=auto seems especially troublesome for
> > > libvirt because QEMU is going to have different locked memory
> > > requirements based on whether we're using type1 or iommufd, where
> the
> > > latter resolves the duplicate accounting issues.  libvirt needs to know

Based on current plan there is probably a transition window between the
point where the first vfio device type (vfio-pci) gaining iommufd support
and the point where all vfio types supporting iommufd. Libvirt can figure
out which one to use iommufd by checking the presence of
/dev/vfio/devices/vfioX. But what would be the resource limit policy
in Libvirt in such transition window when both type1 and iommufd might
be used? Or do we just expect Libvirt to support iommufd only after the
transition window ends to avoid handling such mess?

> > > deterministically which backed is being used, which this proposal seems
> > > to provide, while at the same time bringing us more in line with fd
> > > passing.  Thoughts?  Thanks,
> >
> > Yep, I agree that libvirt needs to have more direct control over this.
> > This is also even more important if there are notable feature differences
> > in the 2 backends.
> >
> > I wonder if anyone has considered an even more distinct impl, whereby
> > we have a completely different device type on the backend, eg
> >
> >   -device vfio-iommu-
> pci[,host=DDDD:BB:DD.f][,sysfsdev=/sys/path/to/device][,fd=#][,iommufd=f
> oo]
> >
> > If a vendor wants to fully remove the legacy impl, they can then use the
> > Kconfig mechanism to disable the build of the legacy impl device, while
> > keeping the iommu impl (or vica-verca if the new iommu impl isn't
> considered
> > reliable enough for them to support yet).
> >
> > Libvirt would use
> >
> >    -object iommu,id=iommu0,fd=NNN
> >    -device vfio-iommu-pci,fd=MMM,iommu=iommu0
> >
> > Non-libvirt would use a simpler
> >
> >    -device vfio-iommu-pci,host=0000:03:22.1
> >
> > with QEMU auto-creating a 'iommu' object in the background.
> >
> > This would fit into libvirt's existing modelling better. We currently have
> > a concept of a PCI assignment backend, which previously supported the
> > legacy PCI assignment, vs the VFIO PCI assignment. This new iommu impl
> > feels like a 3rd PCI assignment approach, and so fits with how we modelled
> > it as a different device type in the past.
> 
> I don't think we want to conflate "iommu" and "iommufd", we're creating
> an object that interfaces into the iommufd uAPI, not an iommu itself.
> Likewise "vfio-iommu-pci" is just confusing, there was an iommu
> interface previously, it's just a different implementation now and as
> far as the VM interface to the device, it's identical.  Note that a
> "vfio-iommufd-pci" device multiplies the matrix of every vfio device
> for a rather subtle implementation detail.
> 
> My expectation would be that libvirt uses:
> 
>  -object iommufd,id=iommufd0,fd=NNN
>  -device vfio-pci,fd=MMM,iommufd=iommufd0
> 
> Whereas simple QEMU command line would be:
> 
>  -object iommufd,id=iommufd0
>  -device vfio-pci,iommufd=iommufd0,host=0000:02:00.0
> 
> The iommufd object would open /dev/iommufd itself.  Creating an
> implicit iommufd object is someone problematic because one of the
> things I forgot to highlight in my previous description is that the
> iommufd object is meant to be shared across not only various vfio
> devices (platform, ccw, ap, nvme, etc), but also across subsystems, ex.
> vdpa.

Out of curiosity - in concept one iommufd is sufficient to support all
ioas requirements across subsystems while having multiple iommufd's
instead lose the benefit of centralized accounting. The latter will also
cause some trouble when we start virtualizing ENQCMD which requires
VM-wide PASID virtualization thus further needs to share that 
information across iommufd's. Not unsolvable but really no gain by
adding such complexity. So I'm curious whether Qemu provide
a way to restrict that certain object type can only have one instance
to discourage such multi-iommufd attempt?

> 
> If the old style were used:
> 
>  -device vfio-pci,host=0000:02:00.0
> 
> Then QEMU would use vfio for the IOMMU backend.
> 
> If libvirt/userspace wants to query whether "legacy" vfio is still
> supported by the host kernel, I think it'd only need to look for
> whether the /dev/vfio/vfio container interface still exists.
> 
> If we need some means for QEMU to remove legacy support, I'd rather
> find a way to do it via probing device options.  It's easy enough to
> see if iommufd support exists by looking for the presence of the
> iommufd option for the vfio-pci device and Kconfig within QEMU could be
> used regardless of whether we define a new device name.  Thanks,
> 
> Alex


^ permalink raw reply	[flat|nested] 125+ messages in thread

* RE: [RFC 00/18] vfio: Adopt iommufd
  2022-04-25 19:55     ` Eric Auger
@ 2022-04-26  8:39       ` Tian, Kevin
  -1 siblings, 0 replies; 125+ messages in thread
From: Tian, Kevin @ 2022-04-26  8:39 UTC (permalink / raw)
  To: eric.auger, Liu, Yi L, alex.williamson, cohuck, qemu-devel
  Cc: akrowiak, jjherne, thuth, Peng, Chao P, kvm, mjrosato, jasowang,
	farman, peterx, pasic, Sun, Yi Y, nicolinc, jgg, eric.auger.pro,
	david

> From: Eric Auger <eric.auger@redhat.com>
> Sent: Tuesday, April 26, 2022 3:55 AM
> 
> Hi Kevin,
> 
> On 4/18/22 10:49 AM, Tian, Kevin wrote:
> >> From: Liu, Yi L <yi.l.liu@intel.com>
> >> Sent: Thursday, April 14, 2022 6:47 PM
> >>
> >> This series qomifies the VFIOContainer object which acts as a base class
> > what does 'qomify' mean? I didn't find this word from dictionary...
> sorry this is pure QEMU terminology. This stands for "QEMU Object Model"
> additional info at:
> https://qemu.readthedocs.io/en/latest/devel/qom.html
> 

Nice to know!

^ permalink raw reply	[flat|nested] 125+ messages in thread

* RE: [RFC 00/18] vfio: Adopt iommufd
@ 2022-04-26  8:39       ` Tian, Kevin
  0 siblings, 0 replies; 125+ messages in thread
From: Tian, Kevin @ 2022-04-26  8:39 UTC (permalink / raw)
  To: eric.auger, Liu, Yi L, alex.williamson, cohuck, qemu-devel
  Cc: david, thuth, farman, mjrosato, akrowiak, pasic, jjherne,
	jasowang, kvm, jgg, nicolinc, eric.auger.pro, Peng, Chao P, Sun,
	Yi Y, peterx

> From: Eric Auger <eric.auger@redhat.com>
> Sent: Tuesday, April 26, 2022 3:55 AM
> 
> Hi Kevin,
> 
> On 4/18/22 10:49 AM, Tian, Kevin wrote:
> >> From: Liu, Yi L <yi.l.liu@intel.com>
> >> Sent: Thursday, April 14, 2022 6:47 PM
> >>
> >> This series qomifies the VFIOContainer object which acts as a base class
> > what does 'qomify' mean? I didn't find this word from dictionary...
> sorry this is pure QEMU terminology. This stands for "QEMU Object Model"
> additional info at:
> https://qemu.readthedocs.io/en/latest/devel/qom.html
> 

Nice to know!

^ permalink raw reply	[flat|nested] 125+ messages in thread

* RE: [RFC 00/18] vfio: Adopt iommufd
  2022-04-14 10:46 ` Yi Liu
@ 2022-04-26  9:47   ` Shameerali Kolothum Thodi
  -1 siblings, 0 replies; 125+ messages in thread
From: Shameerali Kolothum Thodi via @ 2022-04-26  9:47 UTC (permalink / raw)
  To: Yi Liu, alex.williamson, cohuck, qemu-devel
  Cc: david, thuth, farman, mjrosato, akrowiak, pasic, jjherne,
	jasowang, kvm, jgg, nicolinc, eric.auger, eric.auger.pro,
	kevin.tian, chao.p.peng, yi.y.sun, peterx, Zhangfei Gao



> -----Original Message-----
> From: Yi Liu [mailto:yi.l.liu@intel.com]
> Sent: 14 April 2022 11:47
> To: alex.williamson@redhat.com; cohuck@redhat.com;
> qemu-devel@nongnu.org
> Cc: david@gibson.dropbear.id.au; thuth@redhat.com; farman@linux.ibm.com;
> mjrosato@linux.ibm.com; akrowiak@linux.ibm.com; pasic@linux.ibm.com;
> jjherne@linux.ibm.com; jasowang@redhat.com; kvm@vger.kernel.org;
> jgg@nvidia.com; nicolinc@nvidia.com; eric.auger@redhat.com;
> eric.auger.pro@gmail.com; kevin.tian@intel.com; yi.l.liu@intel.com;
> chao.p.peng@intel.com; yi.y.sun@intel.com; peterx@redhat.com
> Subject: [RFC 00/18] vfio: Adopt iommufd
> 
> With the introduction of iommufd[1], the linux kernel provides a generic
> interface for userspace drivers to propagate their DMA mappings to kernel
> for assigned devices. This series does the porting of the VFIO devices
> onto the /dev/iommu uapi and let it coexist with the legacy implementation.
> Other devices like vpda, vfio mdev and etc. are not considered yet.
> 
> For vfio devices, the new interface is tied with device fd and iommufd
> as the iommufd solution is device-centric. This is different from legacy
> vfio which is group-centric. To support both interfaces in QEMU, this
> series introduces the iommu backend concept in the form of different
> container classes. The existing vfio container is named legacy container
> (equivalent with legacy iommu backend in this series), while the new
> iommufd based container is named as iommufd container (may also be
> mentioned
> as iommufd backend in this series). The two backend types have their own
> way to setup secure context and dma management interface. Below diagram
> shows how it looks like with both BEs.
> 
>                     VFIO
> AddressSpace/Memory
>     +-------+  +----------+  +-----+  +-----+
>     |  pci  |  | platform |  |  ap |  | ccw |
>     +---+---+  +----+-----+  +--+--+  +--+--+     +----------------------+
>         |           |           |        |        |   AddressSpace
> |
>         |           |           |        |        +------------+---------+
>     +---V-----------V-----------V--------V----+               /
>     |           VFIOAddressSpace              | <------------+
>     |                  |                      |  MemoryListener
>     |          VFIOContainer list             |
>     +-------+----------------------------+----+
>             |                            |
>             |                            |
>     +-------V------+            +--------V----------+
>     |   iommufd    |            |    vfio legacy    |
>     |  container   |            |     container     |
>     +-------+------+            +--------+----------+
>             |                            |
>             | /dev/iommu                 | /dev/vfio/vfio
>             | /dev/vfio/devices/vfioX    | /dev/vfio/$group_id
>  Userspace  |                            |
> 
> ===========+============================+==========================
> ======
>  Kernel     |  device fd                 |
>             +---------------+            | group/container fd
>             | (BIND_IOMMUFD |            |
> (SET_CONTAINER/SET_IOMMU)
>             |  ATTACH_IOAS) |            | device fd
>             |               |            |
>             |       +-------V------------V-----------------+
>     iommufd |       |                vfio                  |
> (map/unmap  |       +---------+--------------------+-------+
>  ioas_copy) |                 |                    | map/unmap
>             |                 |                    |
>      +------V------+    +-----V------+      +------V--------+
>      | iommfd core |    |  device    |      |  vfio iommu   |
>      +-------------+    +------------+      +---------------+
> 
> [Secure Context setup]
> - iommufd BE: uses device fd and iommufd to setup secure context
>               (bind_iommufd, attach_ioas)
> - vfio legacy BE: uses group fd and container fd to setup secure context
>                   (set_container, set_iommu)
> [Device access]
> - iommufd BE: device fd is opened through /dev/vfio/devices/vfioX
> - vfio legacy BE: device fd is retrieved from group fd ioctl
> [DMA Mapping flow]
> - VFIOAddressSpace receives MemoryRegion add/del via MemoryListener
> - VFIO populates DMA map/unmap via the container BEs
>   *) iommufd BE: uses iommufd
>   *) vfio legacy BE: uses container fd
> 
> This series qomifies the VFIOContainer object which acts as a base class
> for a container. This base class is derived into the legacy VFIO container
> and the new iommufd based container. The base class implements generic
> code
> such as code related to memory_listener and address space management
> whereas
> the derived class implements callbacks that depend on the kernel user space
> being used.
> 
> The selection of the backend is made on a device basis using the new
> iommufd option (on/off/auto). By default the iommufd backend is selected
> if supported by the host and by QEMU (iommufd KConfig). This option is
> currently available only for the vfio-pci device. For other types of
> devices, it does not yet exist and the legacy BE is chosen by default.
> 
> Test done:
> - PCI and Platform device were tested
> - ccw and ap were only compile-tested
> - limited device hotplug test
> - vIOMMU test run for both legacy and iommufd backends (limited tests)
> 
> This series was co-developed by Eric Auger and me based on the exploration
> iommufd kernel[2], complete code of this series is available in[3]. As
> iommufd kernel is in the early step (only iommufd generic interface is in
> mailing list), so this series hasn't made the iommufd backend fully on par
> with legacy backend w.r.t. features like p2p mappings, coherency tracking,
> live migration, etc. This series hasn't supported PCI devices without FLR
> neither as the kernel doesn't support VFIO_DEVICE_PCI_HOT_RESET when
> userspace
> is using iommufd. The kernel needs to be updated to accept device fd list for
> reset when userspace is using iommufd. Related work is in progress by
> Jason[4].
> 
> TODOs:
> - Add DMA alias check for iommufd BE (group level)
> - Make pci.c to be BE agnostic. Needs kernel change as well to fix the
>   VFIO_DEVICE_PCI_HOT_RESET gap
> - Cleanup the VFIODevice fields as it's used in both BEs
> - Add locks
> - Replace list with g_tree
> - More tests
> 
> Patch Overview:
> 
> - Preparation:
>   0001-scripts-update-linux-headers-Add-iommufd.h.patch
>   0002-linux-headers-Import-latest-vfio.h-and-iommufd.h.patch
>   0003-hw-vfio-pci-fix-vfio_pci_hot_reset_result-trace-poin.patch
>   0004-vfio-pci-Use-vbasedev-local-variable-in-vfio_realize.patch
> 
> 0005-vfio-common-Rename-VFIOGuestIOMMU-iommu-into-iommu_m.patch
>   0006-vfio-common-Split-common.c-into-common.c-container.c.patch
> 
> - Introduce container object and covert existing vfio to use it:
>   0007-vfio-Add-base-object-for-VFIOContainer.patch
>   0008-vfio-container-Introduce-vfio_attach-detach_device.patch
>   0009-vfio-platform-Use-vfio_-attach-detach-_device.patch
>   0010-vfio-ap-Use-vfio_-attach-detach-_device.patch
>   0011-vfio-ccw-Use-vfio_-attach-detach-_device.patch
>   0012-vfio-container-obj-Introduce-attach-detach-_device-c.patch
>   0013-vfio-container-obj-Introduce-VFIOContainer-reset-cal.patch
> 
> - Introduce iommufd based container:
>   0014-hw-iommufd-Creation.patch
>   0015-vfio-iommufd-Implement-iommufd-backend.patch
>   0016-vfio-iommufd-Add-IOAS_COPY_DMA-support.patch
> 
> - Add backend selection for vfio-pci:
>   0017-vfio-as-Allow-the-selection-of-a-given-iommu-backend.patch
>   0018-vfio-pci-Add-an-iommufd-option.patch
> 
> [1]
> https://lore.kernel.org/kvm/0-v1-e79cd8d168e8+6-iommufd_jgg@nvidia.com
> /
> [2] https://github.com/luxis1999/iommufd/tree/iommufd-v5.17-rc6
> [3] https://github.com/luxis1999/qemu/tree/qemu-for-5.17-rc6-vm-rfcv1

Hi,

I had a go with the above branches on our ARM64 platform trying to pass-through
a VF dev, but Qemu reports an error as below,

[    0.444728] hisi_sec2 0000:00:01.0: enabling device (0000 -> 0002)
qemu-system-aarch64-iommufd: IOMMU_IOAS_MAP failed: Bad address
qemu-system-aarch64-iommufd: vfio_container_dma_map(0xaaaafeb40ce0, 0x8000000000, 0x10000, 0xffffb40ef000) = -14 (Bad address)

I think this happens for the dev BAR addr range. I haven't debugged the kernel
yet to see where it actually reports that. 

Maybe I am missing something. Please let me know.

Thanks,
Shameer



^ permalink raw reply	[flat|nested] 125+ messages in thread

* RE: [RFC 00/18] vfio: Adopt iommufd
@ 2022-04-26  9:47   ` Shameerali Kolothum Thodi
  0 siblings, 0 replies; 125+ messages in thread
From: Shameerali Kolothum Thodi @ 2022-04-26  9:47 UTC (permalink / raw)
  To: Yi Liu, alex.williamson, cohuck, qemu-devel
  Cc: david, thuth, farman, mjrosato, akrowiak, pasic, jjherne,
	jasowang, kvm, jgg, nicolinc, eric.auger, eric.auger.pro,
	kevin.tian, chao.p.peng, yi.y.sun, peterx, Zhangfei Gao



> -----Original Message-----
> From: Yi Liu [mailto:yi.l.liu@intel.com]
> Sent: 14 April 2022 11:47
> To: alex.williamson@redhat.com; cohuck@redhat.com;
> qemu-devel@nongnu.org
> Cc: david@gibson.dropbear.id.au; thuth@redhat.com; farman@linux.ibm.com;
> mjrosato@linux.ibm.com; akrowiak@linux.ibm.com; pasic@linux.ibm.com;
> jjherne@linux.ibm.com; jasowang@redhat.com; kvm@vger.kernel.org;
> jgg@nvidia.com; nicolinc@nvidia.com; eric.auger@redhat.com;
> eric.auger.pro@gmail.com; kevin.tian@intel.com; yi.l.liu@intel.com;
> chao.p.peng@intel.com; yi.y.sun@intel.com; peterx@redhat.com
> Subject: [RFC 00/18] vfio: Adopt iommufd
> 
> With the introduction of iommufd[1], the linux kernel provides a generic
> interface for userspace drivers to propagate their DMA mappings to kernel
> for assigned devices. This series does the porting of the VFIO devices
> onto the /dev/iommu uapi and let it coexist with the legacy implementation.
> Other devices like vpda, vfio mdev and etc. are not considered yet.
> 
> For vfio devices, the new interface is tied with device fd and iommufd
> as the iommufd solution is device-centric. This is different from legacy
> vfio which is group-centric. To support both interfaces in QEMU, this
> series introduces the iommu backend concept in the form of different
> container classes. The existing vfio container is named legacy container
> (equivalent with legacy iommu backend in this series), while the new
> iommufd based container is named as iommufd container (may also be
> mentioned
> as iommufd backend in this series). The two backend types have their own
> way to setup secure context and dma management interface. Below diagram
> shows how it looks like with both BEs.
> 
>                     VFIO
> AddressSpace/Memory
>     +-------+  +----------+  +-----+  +-----+
>     |  pci  |  | platform |  |  ap |  | ccw |
>     +---+---+  +----+-----+  +--+--+  +--+--+     +----------------------+
>         |           |           |        |        |   AddressSpace
> |
>         |           |           |        |        +------------+---------+
>     +---V-----------V-----------V--------V----+               /
>     |           VFIOAddressSpace              | <------------+
>     |                  |                      |  MemoryListener
>     |          VFIOContainer list             |
>     +-------+----------------------------+----+
>             |                            |
>             |                            |
>     +-------V------+            +--------V----------+
>     |   iommufd    |            |    vfio legacy    |
>     |  container   |            |     container     |
>     +-------+------+            +--------+----------+
>             |                            |
>             | /dev/iommu                 | /dev/vfio/vfio
>             | /dev/vfio/devices/vfioX    | /dev/vfio/$group_id
>  Userspace  |                            |
> 
> ===========+============================+==========================
> ======
>  Kernel     |  device fd                 |
>             +---------------+            | group/container fd
>             | (BIND_IOMMUFD |            |
> (SET_CONTAINER/SET_IOMMU)
>             |  ATTACH_IOAS) |            | device fd
>             |               |            |
>             |       +-------V------------V-----------------+
>     iommufd |       |                vfio                  |
> (map/unmap  |       +---------+--------------------+-------+
>  ioas_copy) |                 |                    | map/unmap
>             |                 |                    |
>      +------V------+    +-----V------+      +------V--------+
>      | iommfd core |    |  device    |      |  vfio iommu   |
>      +-------------+    +------------+      +---------------+
> 
> [Secure Context setup]
> - iommufd BE: uses device fd and iommufd to setup secure context
>               (bind_iommufd, attach_ioas)
> - vfio legacy BE: uses group fd and container fd to setup secure context
>                   (set_container, set_iommu)
> [Device access]
> - iommufd BE: device fd is opened through /dev/vfio/devices/vfioX
> - vfio legacy BE: device fd is retrieved from group fd ioctl
> [DMA Mapping flow]
> - VFIOAddressSpace receives MemoryRegion add/del via MemoryListener
> - VFIO populates DMA map/unmap via the container BEs
>   *) iommufd BE: uses iommufd
>   *) vfio legacy BE: uses container fd
> 
> This series qomifies the VFIOContainer object which acts as a base class
> for a container. This base class is derived into the legacy VFIO container
> and the new iommufd based container. The base class implements generic
> code
> such as code related to memory_listener and address space management
> whereas
> the derived class implements callbacks that depend on the kernel user space
> being used.
> 
> The selection of the backend is made on a device basis using the new
> iommufd option (on/off/auto). By default the iommufd backend is selected
> if supported by the host and by QEMU (iommufd KConfig). This option is
> currently available only for the vfio-pci device. For other types of
> devices, it does not yet exist and the legacy BE is chosen by default.
> 
> Test done:
> - PCI and Platform device were tested
> - ccw and ap were only compile-tested
> - limited device hotplug test
> - vIOMMU test run for both legacy and iommufd backends (limited tests)
> 
> This series was co-developed by Eric Auger and me based on the exploration
> iommufd kernel[2], complete code of this series is available in[3]. As
> iommufd kernel is in the early step (only iommufd generic interface is in
> mailing list), so this series hasn't made the iommufd backend fully on par
> with legacy backend w.r.t. features like p2p mappings, coherency tracking,
> live migration, etc. This series hasn't supported PCI devices without FLR
> neither as the kernel doesn't support VFIO_DEVICE_PCI_HOT_RESET when
> userspace
> is using iommufd. The kernel needs to be updated to accept device fd list for
> reset when userspace is using iommufd. Related work is in progress by
> Jason[4].
> 
> TODOs:
> - Add DMA alias check for iommufd BE (group level)
> - Make pci.c to be BE agnostic. Needs kernel change as well to fix the
>   VFIO_DEVICE_PCI_HOT_RESET gap
> - Cleanup the VFIODevice fields as it's used in both BEs
> - Add locks
> - Replace list with g_tree
> - More tests
> 
> Patch Overview:
> 
> - Preparation:
>   0001-scripts-update-linux-headers-Add-iommufd.h.patch
>   0002-linux-headers-Import-latest-vfio.h-and-iommufd.h.patch
>   0003-hw-vfio-pci-fix-vfio_pci_hot_reset_result-trace-poin.patch
>   0004-vfio-pci-Use-vbasedev-local-variable-in-vfio_realize.patch
> 
> 0005-vfio-common-Rename-VFIOGuestIOMMU-iommu-into-iommu_m.patch
>   0006-vfio-common-Split-common.c-into-common.c-container.c.patch
> 
> - Introduce container object and covert existing vfio to use it:
>   0007-vfio-Add-base-object-for-VFIOContainer.patch
>   0008-vfio-container-Introduce-vfio_attach-detach_device.patch
>   0009-vfio-platform-Use-vfio_-attach-detach-_device.patch
>   0010-vfio-ap-Use-vfio_-attach-detach-_device.patch
>   0011-vfio-ccw-Use-vfio_-attach-detach-_device.patch
>   0012-vfio-container-obj-Introduce-attach-detach-_device-c.patch
>   0013-vfio-container-obj-Introduce-VFIOContainer-reset-cal.patch
> 
> - Introduce iommufd based container:
>   0014-hw-iommufd-Creation.patch
>   0015-vfio-iommufd-Implement-iommufd-backend.patch
>   0016-vfio-iommufd-Add-IOAS_COPY_DMA-support.patch
> 
> - Add backend selection for vfio-pci:
>   0017-vfio-as-Allow-the-selection-of-a-given-iommu-backend.patch
>   0018-vfio-pci-Add-an-iommufd-option.patch
> 
> [1]
> https://lore.kernel.org/kvm/0-v1-e79cd8d168e8+6-iommufd_jgg@nvidia.com
> /
> [2] https://github.com/luxis1999/iommufd/tree/iommufd-v5.17-rc6
> [3] https://github.com/luxis1999/qemu/tree/qemu-for-5.17-rc6-vm-rfcv1

Hi,

I had a go with the above branches on our ARM64 platform trying to pass-through
a VF dev, but Qemu reports an error as below,

[    0.444728] hisi_sec2 0000:00:01.0: enabling device (0000 -> 0002)
qemu-system-aarch64-iommufd: IOMMU_IOAS_MAP failed: Bad address
qemu-system-aarch64-iommufd: vfio_container_dma_map(0xaaaafeb40ce0, 0x8000000000, 0x10000, 0xffffb40ef000) = -14 (Bad address)

I think this happens for the dev BAR addr range. I haven't debugged the kernel
yet to see where it actually reports that. 

Maybe I am missing something. Please let me know.

Thanks,
Shameer


^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC 15/18] vfio/iommufd: Implement iommufd backend
  2022-04-22 14:58   ` Jason Gunthorpe
@ 2022-04-26  9:55       ` Yi Liu
  2022-04-26  9:55       ` Yi Liu
  1 sibling, 0 replies; 125+ messages in thread
From: Yi Liu @ 2022-04-26  9:55 UTC (permalink / raw)
  To: Tian, Kevin, Jason Gunthorpe
  Cc: Peng, Chao P, Sun, Yi Y, qemu-devel, david, thuth, farman,
	mjrosato, akrowiak, pasic, jjherne, jasowang, kvm, nicolinc,
	eric.auger, eric.auger.pro, Tian, Kevin, Peng, Chao P, Sun, Yi Y,
	peterx

Hi Jason,

On 2022/4/22 22:58, Jason Gunthorpe wrote:
> On Thu, Apr 14, 2022 at 03:47:07AM -0700, Yi Liu wrote:
> 
>> +static int vfio_get_devicefd(const char *sysfs_path, Error **errp)
>> +{
>> +    long int vfio_id = -1, ret = -ENOTTY;
>> +    char *path, *tmp = NULL;
>> +    DIR *dir;
>> +    struct dirent *dent;
>> +    struct stat st;
>> +    gchar *contents;
>> +    gsize length;
>> +    int major, minor;
>> +    dev_t vfio_devt;
>> +
>> +    path = g_strdup_printf("%s/vfio-device", sysfs_path);
>> +    if (stat(path, &st) < 0) {
>> +        error_setg_errno(errp, errno, "no such host device");
>> +        goto out;
>> +    }
>> +
>> +    dir = opendir(path);
>> +    if (!dir) {
>> +        error_setg_errno(errp, errno, "couldn't open dirrectory %s", path);
>> +        goto out;
>> +    }
>> +
>> +    while ((dent = readdir(dir))) {
>> +        const char *end_name;
>> +
>> +        if (!strncmp(dent->d_name, "vfio", 4)) {
>> +            ret = qemu_strtol(dent->d_name + 4, &end_name, 10, &vfio_id);
>> +            if (ret) {
>> +                error_setg(errp, "suspicious vfio* file in %s", path);
>> +                goto out;
>> +            }
> 
> Userspace shouldn't explode if there are different files here down the
> road. Just search for the first match of vfio\d+ and there is no need
> to parse out the vfio_id from the string. Only fail if no match is
> found.
> 
>> +    tmp = g_strdup_printf("/dev/vfio/devices/vfio%ld", vfio_id);
>> +    if (stat(tmp, &st) < 0) {
>> +        error_setg_errno(errp, errno, "no such vfio device");
>> +        goto out;
>> +    }
> 
> And simply pass the string directly here, no need to parse out
> vfio_id.

got above suggestion.

> I also suggest falling back to using "/dev/char/%u:%u" if the above
> does not exist which prevents "vfio/devices/vfio" from turning into
> ABI.

do you mean there is no matched file under /dev/vfio/devices/? Is this
possible?

> 
> It would be a good idea to make a general open_cdev function that does
> all this work once the sysfs is found and cdev read out of it, all the
> other vfio places can use it too.

hmmm, it's good to have a general open_cdev() function. But I guess this
is the only place in VFIO to open the device cdev. Do you mean the vdpa
stuffes?

>> +static int iommufd_attach_device(VFIODevice *vbasedev, AddressSpace *as,
>> +                                 Error **errp)
>> +{
>> +    VFIOContainer *bcontainer;
>> +    VFIOIOMMUFDContainer *container;
>> +    VFIOAddressSpace *space;
>> +    struct vfio_device_info dev_info = { .argsz = sizeof(dev_info) };
>> +    int ret, devfd, iommufd;
>> +    uint32_t ioas_id;
>> +    Error *err = NULL;
>> +
>> +    devfd = vfio_get_devicefd(vbasedev->sysfsdev, errp);
>> +    if (devfd < 0) {
>> +        return devfd;
>> +    }
>> +    vbasedev->fd = devfd;
>> +
>> +    space = vfio_get_address_space(as);
>> +
>> +    /* try to attach to an existing container in this space */
>> +    QLIST_FOREACH(bcontainer, &space->containers, next) {
>> +        if (!object_dynamic_cast(OBJECT(bcontainer),
>> +                                 TYPE_VFIO_IOMMUFD_CONTAINER)) {
>> +            continue;
>> +        }
>> +        container = container_of(bcontainer, VFIOIOMMUFDContainer, obj);
>> +        if (vfio_device_attach_container(vbasedev, container, &err)) {
>> +            const char *msg = error_get_pretty(err);
>> +
>> +            trace_vfio_iommufd_fail_attach_existing_container(msg);
>> +            error_free(err);
>> +            err = NULL;
>> +        } else {
>> +            ret = vfio_ram_block_discard_disable(true);
>> +            if (ret) {
>> +                vfio_device_detach_container(vbasedev, container, &err);
>> +                error_propagate(errp, err);
>> +                vfio_put_address_space(space);
>> +                close(vbasedev->fd);
>> +                error_prepend(errp,
>> +                              "Cannot set discarding of RAM broken (%d)", ret);
>> +                return ret;
>> +            }
>> +            goto out;
>> +        }
>> +    }
> 
> ?? this logic shouldn't be necessary, a single ioas always supports
> all devices, userspace should never need to juggle multiple ioas's
> unless it wants to have different address maps.

legacy vfio container needs to allocate multiple containers in some cases.
Say a device is attached to a container and some iova were mapped on this
container. When trying to attach another device to this container, it will
be failed in case of conflicts between the mapped DMA mappings and the
reserved iovas of the another device. For such case, legacy vfio chooses to
create a new container and attach the group to this new container. Hotlplug
is a typical case of such scenario.

I think current iommufd also needs such choice. The reserved_iova and 
mapped iova area are tracked in io_pagetable, and this structure is 
per-IOAS. So if there is conflict between mapped iova areas of an IOAS and
the reserved_iova of a device that is going to be attached to IOAS, the
attachment would be failed. To be working, QEMU needs to create another
IOAS and attach the device to new IOAS as well.

struct io_pagetable {
          struct rw_semaphore domains_rwsem;
          struct xarray domains;
          unsigned int next_domain_id;

          struct rw_semaphore iova_rwsem;
          struct rb_root_cached area_itree;
          struct rb_root_cached reserved_iova_itree;
          unsigned long iova_alignment;
};

struct iommufd_ioas {
          struct iommufd_object obj;
          struct io_pagetable iopt;
          struct mutex mutex;
          struct list_head auto_domains;
};

> Something I would like to see confirmed here in qemu is that qemu can
> track the hw pagetable id for each device it binds because we will
> need that later to do dirty tracking and other things.

we have tracked the hwpt_id. :-)

>> +    /*
>> +     * TODO: for now iommufd BE is on par with vfio iommu type1, so it's
>> +     * fine to add the whole range as window. For SPAPR, below code
>> +     * should be updated.
>> +     */
>> +    vfio_host_win_add(bcontainer, 0, (hwaddr)-1, 4096);
> 
> ? Not sure what this is, but I don't expect any changes for SPAPR
> someday IOMMU_IOAS_IOVA_RANGES should be able to accurately report its
> configuration.
> 
> I don't see IOMMU_IOAS_IOVA_RANGES called at all, that seems like a
> problem..
> 
> (and note that IOVA_RANGES changes with every device attached to the IOAS)
> 
> Jason

-- 
Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC 15/18] vfio/iommufd: Implement iommufd backend
@ 2022-04-26  9:55       ` Yi Liu
  0 siblings, 0 replies; 125+ messages in thread
From: Yi Liu @ 2022-04-26  9:55 UTC (permalink / raw)
  To: Tian, Kevin, Jason Gunthorpe
  Cc: akrowiak, jjherne, thuth, Peng, Chao P, kvm, mjrosato, jasowang,
	farman, peterx, qemu-devel, pasic, eric.auger, Sun, Yi Y,
	nicolinc, Tian, Kevin, eric.auger.pro, david

Hi Jason,

On 2022/4/22 22:58, Jason Gunthorpe wrote:
> On Thu, Apr 14, 2022 at 03:47:07AM -0700, Yi Liu wrote:
> 
>> +static int vfio_get_devicefd(const char *sysfs_path, Error **errp)
>> +{
>> +    long int vfio_id = -1, ret = -ENOTTY;
>> +    char *path, *tmp = NULL;
>> +    DIR *dir;
>> +    struct dirent *dent;
>> +    struct stat st;
>> +    gchar *contents;
>> +    gsize length;
>> +    int major, minor;
>> +    dev_t vfio_devt;
>> +
>> +    path = g_strdup_printf("%s/vfio-device", sysfs_path);
>> +    if (stat(path, &st) < 0) {
>> +        error_setg_errno(errp, errno, "no such host device");
>> +        goto out;
>> +    }
>> +
>> +    dir = opendir(path);
>> +    if (!dir) {
>> +        error_setg_errno(errp, errno, "couldn't open dirrectory %s", path);
>> +        goto out;
>> +    }
>> +
>> +    while ((dent = readdir(dir))) {
>> +        const char *end_name;
>> +
>> +        if (!strncmp(dent->d_name, "vfio", 4)) {
>> +            ret = qemu_strtol(dent->d_name + 4, &end_name, 10, &vfio_id);
>> +            if (ret) {
>> +                error_setg(errp, "suspicious vfio* file in %s", path);
>> +                goto out;
>> +            }
> 
> Userspace shouldn't explode if there are different files here down the
> road. Just search for the first match of vfio\d+ and there is no need
> to parse out the vfio_id from the string. Only fail if no match is
> found.
> 
>> +    tmp = g_strdup_printf("/dev/vfio/devices/vfio%ld", vfio_id);
>> +    if (stat(tmp, &st) < 0) {
>> +        error_setg_errno(errp, errno, "no such vfio device");
>> +        goto out;
>> +    }
> 
> And simply pass the string directly here, no need to parse out
> vfio_id.

got above suggestion.

> I also suggest falling back to using "/dev/char/%u:%u" if the above
> does not exist which prevents "vfio/devices/vfio" from turning into
> ABI.

do you mean there is no matched file under /dev/vfio/devices/? Is this
possible?

> 
> It would be a good idea to make a general open_cdev function that does
> all this work once the sysfs is found and cdev read out of it, all the
> other vfio places can use it too.

hmmm, it's good to have a general open_cdev() function. But I guess this
is the only place in VFIO to open the device cdev. Do you mean the vdpa
stuffes?

>> +static int iommufd_attach_device(VFIODevice *vbasedev, AddressSpace *as,
>> +                                 Error **errp)
>> +{
>> +    VFIOContainer *bcontainer;
>> +    VFIOIOMMUFDContainer *container;
>> +    VFIOAddressSpace *space;
>> +    struct vfio_device_info dev_info = { .argsz = sizeof(dev_info) };
>> +    int ret, devfd, iommufd;
>> +    uint32_t ioas_id;
>> +    Error *err = NULL;
>> +
>> +    devfd = vfio_get_devicefd(vbasedev->sysfsdev, errp);
>> +    if (devfd < 0) {
>> +        return devfd;
>> +    }
>> +    vbasedev->fd = devfd;
>> +
>> +    space = vfio_get_address_space(as);
>> +
>> +    /* try to attach to an existing container in this space */
>> +    QLIST_FOREACH(bcontainer, &space->containers, next) {
>> +        if (!object_dynamic_cast(OBJECT(bcontainer),
>> +                                 TYPE_VFIO_IOMMUFD_CONTAINER)) {
>> +            continue;
>> +        }
>> +        container = container_of(bcontainer, VFIOIOMMUFDContainer, obj);
>> +        if (vfio_device_attach_container(vbasedev, container, &err)) {
>> +            const char *msg = error_get_pretty(err);
>> +
>> +            trace_vfio_iommufd_fail_attach_existing_container(msg);
>> +            error_free(err);
>> +            err = NULL;
>> +        } else {
>> +            ret = vfio_ram_block_discard_disable(true);
>> +            if (ret) {
>> +                vfio_device_detach_container(vbasedev, container, &err);
>> +                error_propagate(errp, err);
>> +                vfio_put_address_space(space);
>> +                close(vbasedev->fd);
>> +                error_prepend(errp,
>> +                              "Cannot set discarding of RAM broken (%d)", ret);
>> +                return ret;
>> +            }
>> +            goto out;
>> +        }
>> +    }
> 
> ?? this logic shouldn't be necessary, a single ioas always supports
> all devices, userspace should never need to juggle multiple ioas's
> unless it wants to have different address maps.

legacy vfio container needs to allocate multiple containers in some cases.
Say a device is attached to a container and some iova were mapped on this
container. When trying to attach another device to this container, it will
be failed in case of conflicts between the mapped DMA mappings and the
reserved iovas of the another device. For such case, legacy vfio chooses to
create a new container and attach the group to this new container. Hotlplug
is a typical case of such scenario.

I think current iommufd also needs such choice. The reserved_iova and 
mapped iova area are tracked in io_pagetable, and this structure is 
per-IOAS. So if there is conflict between mapped iova areas of an IOAS and
the reserved_iova of a device that is going to be attached to IOAS, the
attachment would be failed. To be working, QEMU needs to create another
IOAS and attach the device to new IOAS as well.

struct io_pagetable {
          struct rw_semaphore domains_rwsem;
          struct xarray domains;
          unsigned int next_domain_id;

          struct rw_semaphore iova_rwsem;
          struct rb_root_cached area_itree;
          struct rb_root_cached reserved_iova_itree;
          unsigned long iova_alignment;
};

struct iommufd_ioas {
          struct iommufd_object obj;
          struct io_pagetable iopt;
          struct mutex mutex;
          struct list_head auto_domains;
};

> Something I would like to see confirmed here in qemu is that qemu can
> track the hw pagetable id for each device it binds because we will
> need that later to do dirty tracking and other things.

we have tracked the hwpt_id. :-)

>> +    /*
>> +     * TODO: for now iommufd BE is on par with vfio iommu type1, so it's
>> +     * fine to add the whole range as window. For SPAPR, below code
>> +     * should be updated.
>> +     */
>> +    vfio_host_win_add(bcontainer, 0, (hwaddr)-1, 4096);
> 
> ? Not sure what this is, but I don't expect any changes for SPAPR
> someday IOMMU_IOAS_IOVA_RANGES should be able to accurately report its
> configuration.
> 
> I don't see IOMMU_IOAS_IOVA_RANGES called at all, that seems like a
> problem..
> 
> (and note that IOVA_RANGES changes with every device attached to the IOAS)
> 
> Jason

-- 
Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 125+ messages in thread

* RE: [RFC 15/18] vfio/iommufd: Implement iommufd backend
  2022-04-26  9:55       ` Yi Liu
@ 2022-04-26 10:41         ` Tian, Kevin
  -1 siblings, 0 replies; 125+ messages in thread
From: Tian, Kevin @ 2022-04-26 10:41 UTC (permalink / raw)
  To: Liu, Yi L, Jason Gunthorpe
  Cc: Peng, Chao P, Sun, Yi Y, qemu-devel, david, thuth, farman,
	mjrosato, akrowiak, pasic, jjherne, jasowang, kvm, nicolinc,
	eric.auger, eric.auger.pro, Peng, Chao P, Sun, Yi Y, peterx,
	Alex Williamson (alex.williamson@redhat.com)

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Tuesday, April 26, 2022 5:55 PM
> On 2022/4/22 22:58, Jason Gunthorpe wrote:
> > On Thu, Apr 14, 2022 at 03:47:07AM -0700, Yi Liu wrote:
> >
> >> +
> >> +    /* try to attach to an existing container in this space */
> >> +    QLIST_FOREACH(bcontainer, &space->containers, next) {
> >> +        if (!object_dynamic_cast(OBJECT(bcontainer),
> >> +                                 TYPE_VFIO_IOMMUFD_CONTAINER)) {
> >> +            continue;
> >> +        }
> >> +        container = container_of(bcontainer, VFIOIOMMUFDContainer, obj);
> >> +        if (vfio_device_attach_container(vbasedev, container, &err)) {
> >> +            const char *msg = error_get_pretty(err);
> >> +
> >> +            trace_vfio_iommufd_fail_attach_existing_container(msg);
> >> +            error_free(err);
> >> +            err = NULL;
> >> +        } else {
> >> +            ret = vfio_ram_block_discard_disable(true);
> >> +            if (ret) {
> >> +                vfio_device_detach_container(vbasedev, container, &err);
> >> +                error_propagate(errp, err);
> >> +                vfio_put_address_space(space);
> >> +                close(vbasedev->fd);
> >> +                error_prepend(errp,
> >> +                              "Cannot set discarding of RAM broken (%d)", ret);
> >> +                return ret;
> >> +            }
> >> +            goto out;
> >> +        }
> >> +    }
> >
> > ?? this logic shouldn't be necessary, a single ioas always supports
> > all devices, userspace should never need to juggle multiple ioas's
> > unless it wants to have different address maps.
> 
> legacy vfio container needs to allocate multiple containers in some cases.
> Say a device is attached to a container and some iova were mapped on this
> container. When trying to attach another device to this container, it will
> be failed in case of conflicts between the mapped DMA mappings and the
> reserved iovas of the another device. For such case, legacy vfio chooses to
> create a new container and attach the group to this new container. Hotlplug
> is a typical case of such scenario.
> 

Alex provided a clear rationale when we chatted with him on the
same topic. I simply copied it here instead of trying to further
translate: (Alex, please chime in if you want to add more words. 😊)

Q:
Why existing VFIOAddressSpace has a VFIOContainer list? is it because
one device with type1 and another with no_iommu?

A:
That's one case of incompatibility, but the IOMMU attach group callback
can fail in a variety of ways.  One that we've seen that is not
uncommon is that we might have an mdev container with various  mappings  
to other devices.  None of those mappings are validated until the mdev
driver tries to pin something, where it's generally unlikely that
they'd pin those particular mappings.  Then QEMU hot-adds a regular
IOMMU backed device, we allocate a domain for the device and replay the
mappings from the container, but now they get validated and potentially
fail.  The kernel returns a failure for the SET_IOMMU ioctl, QEMU
creates a new container and fills it from the same AddressSpace, where
now QEMU can determine which mappings can be safely skipped.  

Q:
I didn't get why some mappings are valid for one device while can
be skipped for another device under the same address space. Can you
elaborate a bit? If the skipped mappings are redundant and won't
be used for dma why does userspace request it in the first place? I'm
a bit lost here...

A: 
QEMU sets up a MemoryListener for the device AddressSpace and attempts
to map anything that triggers that listener, which includes not only VM
RAM which is our primary mapping goal, but also miscellaneous devices,
unaligned regions, and other device regions, ex. BARs.  Some of these
we filter out in QEMU with broad generalizations that unaligned ranges
aren't anything we can deal with, but other device regions covers
anything that's mmap'd in QEMU, ie. it has an associated KVM memory
slot.  IIRC, in the case I'm thinking of, the mapping that triggered
the replay failure was the BAR for an mdev device.  No attempt was made
to use gup or PFNMAP to resolve the mapping when only the mdev device
was present and the mdev host driver didn't attempt to pin pages within
its own BAR, but neither of these methods worked for the replay (I
don't recall further specifics). 

QEMU always attempts to create p2p mappings for devices, but this is a
case where we don't halt the VM if such a mapping cannot be created, so
a new container would replay the AddressSpace, see the fault, and skip
the region.

Q:
If there is conflict between reserved regions of a newly-plugged device
and existing mappings of VFIOAddressSpace, the device should simply
be rejected from attaching to the address space instead of creating 
another container under that address space.

A:
From a kernel perspective, yes, and that's what we do.  That doesn't
preclude the user from instantiating a new container and determining
for themselves whether the reserved region conflict is critical.  Note
that just because containers are in the same AddressSpace doesn't mean
that there aren't rules to allow certain mappings failures to be
non-fatal.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 125+ messages in thread

* RE: [RFC 15/18] vfio/iommufd: Implement iommufd backend
@ 2022-04-26 10:41         ` Tian, Kevin
  0 siblings, 0 replies; 125+ messages in thread
From: Tian, Kevin @ 2022-04-26 10:41 UTC (permalink / raw)
  To: Liu, Yi L, Jason Gunthorpe
  Cc: akrowiak, jjherne, thuth, Peng, Chao P,
	Alex Williamson (alex.williamson@redhat.com),
	kvm, mjrosato, jasowang, farman, peterx, qemu-devel, pasic,
	eric.auger, Sun, Yi Y, nicolinc, eric.auger.pro, david

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Tuesday, April 26, 2022 5:55 PM
> On 2022/4/22 22:58, Jason Gunthorpe wrote:
> > On Thu, Apr 14, 2022 at 03:47:07AM -0700, Yi Liu wrote:
> >
> >> +
> >> +    /* try to attach to an existing container in this space */
> >> +    QLIST_FOREACH(bcontainer, &space->containers, next) {
> >> +        if (!object_dynamic_cast(OBJECT(bcontainer),
> >> +                                 TYPE_VFIO_IOMMUFD_CONTAINER)) {
> >> +            continue;
> >> +        }
> >> +        container = container_of(bcontainer, VFIOIOMMUFDContainer, obj);
> >> +        if (vfio_device_attach_container(vbasedev, container, &err)) {
> >> +            const char *msg = error_get_pretty(err);
> >> +
> >> +            trace_vfio_iommufd_fail_attach_existing_container(msg);
> >> +            error_free(err);
> >> +            err = NULL;
> >> +        } else {
> >> +            ret = vfio_ram_block_discard_disable(true);
> >> +            if (ret) {
> >> +                vfio_device_detach_container(vbasedev, container, &err);
> >> +                error_propagate(errp, err);
> >> +                vfio_put_address_space(space);
> >> +                close(vbasedev->fd);
> >> +                error_prepend(errp,
> >> +                              "Cannot set discarding of RAM broken (%d)", ret);
> >> +                return ret;
> >> +            }
> >> +            goto out;
> >> +        }
> >> +    }
> >
> > ?? this logic shouldn't be necessary, a single ioas always supports
> > all devices, userspace should never need to juggle multiple ioas's
> > unless it wants to have different address maps.
> 
> legacy vfio container needs to allocate multiple containers in some cases.
> Say a device is attached to a container and some iova were mapped on this
> container. When trying to attach another device to this container, it will
> be failed in case of conflicts between the mapped DMA mappings and the
> reserved iovas of the another device. For such case, legacy vfio chooses to
> create a new container and attach the group to this new container. Hotlplug
> is a typical case of such scenario.
> 

Alex provided a clear rationale when we chatted with him on the
same topic. I simply copied it here instead of trying to further
translate: (Alex, please chime in if you want to add more words. 😊)

Q:
Why existing VFIOAddressSpace has a VFIOContainer list? is it because
one device with type1 and another with no_iommu?

A:
That's one case of incompatibility, but the IOMMU attach group callback
can fail in a variety of ways.  One that we've seen that is not
uncommon is that we might have an mdev container with various  mappings  
to other devices.  None of those mappings are validated until the mdev
driver tries to pin something, where it's generally unlikely that
they'd pin those particular mappings.  Then QEMU hot-adds a regular
IOMMU backed device, we allocate a domain for the device and replay the
mappings from the container, but now they get validated and potentially
fail.  The kernel returns a failure for the SET_IOMMU ioctl, QEMU
creates a new container and fills it from the same AddressSpace, where
now QEMU can determine which mappings can be safely skipped.  

Q:
I didn't get why some mappings are valid for one device while can
be skipped for another device under the same address space. Can you
elaborate a bit? If the skipped mappings are redundant and won't
be used for dma why does userspace request it in the first place? I'm
a bit lost here...

A: 
QEMU sets up a MemoryListener for the device AddressSpace and attempts
to map anything that triggers that listener, which includes not only VM
RAM which is our primary mapping goal, but also miscellaneous devices,
unaligned regions, and other device regions, ex. BARs.  Some of these
we filter out in QEMU with broad generalizations that unaligned ranges
aren't anything we can deal with, but other device regions covers
anything that's mmap'd in QEMU, ie. it has an associated KVM memory
slot.  IIRC, in the case I'm thinking of, the mapping that triggered
the replay failure was the BAR for an mdev device.  No attempt was made
to use gup or PFNMAP to resolve the mapping when only the mdev device
was present and the mdev host driver didn't attempt to pin pages within
its own BAR, but neither of these methods worked for the replay (I
don't recall further specifics). 

QEMU always attempts to create p2p mappings for devices, but this is a
case where we don't halt the VM if such a mapping cannot be created, so
a new container would replay the AddressSpace, see the fault, and skip
the region.

Q:
If there is conflict between reserved regions of a newly-plugged device
and existing mappings of VFIOAddressSpace, the device should simply
be rejected from attaching to the address space instead of creating 
another container under that address space.

A:
From a kernel perspective, yes, and that's what we do.  That doesn't
preclude the user from instantiating a new container and determining
for themselves whether the reserved region conflict is critical.  Note
that just because containers are in the same AddressSpace doesn't mean
that there aren't rules to allow certain mappings failures to be
non-fatal.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC 00/18] vfio: Adopt iommufd
  2022-04-26  9:47   ` Shameerali Kolothum Thodi
@ 2022-04-26 11:44     ` Eric Auger
  -1 siblings, 0 replies; 125+ messages in thread
From: Eric Auger @ 2022-04-26 11:44 UTC (permalink / raw)
  To: Shameerali Kolothum Thodi, Yi Liu, alex.williamson, cohuck, qemu-devel
  Cc: david, thuth, farman, mjrosato, akrowiak, pasic, jjherne,
	jasowang, kvm, jgg, nicolinc, eric.auger.pro, kevin.tian,
	chao.p.peng, yi.y.sun, peterx, Zhangfei Gao

Hi Shameer,

On 4/26/22 11:47 AM, Shameerali Kolothum Thodi wrote:
>
>> -----Original Message-----
>> From: Yi Liu [mailto:yi.l.liu@intel.com]
>> Sent: 14 April 2022 11:47
>> To: alex.williamson@redhat.com; cohuck@redhat.com;
>> qemu-devel@nongnu.org
>> Cc: david@gibson.dropbear.id.au; thuth@redhat.com; farman@linux.ibm.com;
>> mjrosato@linux.ibm.com; akrowiak@linux.ibm.com; pasic@linux.ibm.com;
>> jjherne@linux.ibm.com; jasowang@redhat.com; kvm@vger.kernel.org;
>> jgg@nvidia.com; nicolinc@nvidia.com; eric.auger@redhat.com;
>> eric.auger.pro@gmail.com; kevin.tian@intel.com; yi.l.liu@intel.com;
>> chao.p.peng@intel.com; yi.y.sun@intel.com; peterx@redhat.com
>> Subject: [RFC 00/18] vfio: Adopt iommufd
>>
>> With the introduction of iommufd[1], the linux kernel provides a generic
>> interface for userspace drivers to propagate their DMA mappings to kernel
>> for assigned devices. This series does the porting of the VFIO devices
>> onto the /dev/iommu uapi and let it coexist with the legacy implementation.
>> Other devices like vpda, vfio mdev and etc. are not considered yet.
>>
>> For vfio devices, the new interface is tied with device fd and iommufd
>> as the iommufd solution is device-centric. This is different from legacy
>> vfio which is group-centric. To support both interfaces in QEMU, this
>> series introduces the iommu backend concept in the form of different
>> container classes. The existing vfio container is named legacy container
>> (equivalent with legacy iommu backend in this series), while the new
>> iommufd based container is named as iommufd container (may also be
>> mentioned
>> as iommufd backend in this series). The two backend types have their own
>> way to setup secure context and dma management interface. Below diagram
>> shows how it looks like with both BEs.
>>
>>                     VFIO
>> AddressSpace/Memory
>>     +-------+  +----------+  +-----+  +-----+
>>     |  pci  |  | platform |  |  ap |  | ccw |
>>     +---+---+  +----+-----+  +--+--+  +--+--+     +----------------------+
>>         |           |           |        |        |   AddressSpace
>> |
>>         |           |           |        |        +------------+---------+
>>     +---V-----------V-----------V--------V----+               /
>>     |           VFIOAddressSpace              | <------------+
>>     |                  |                      |  MemoryListener
>>     |          VFIOContainer list             |
>>     +-------+----------------------------+----+
>>             |                            |
>>             |                            |
>>     +-------V------+            +--------V----------+
>>     |   iommufd    |            |    vfio legacy    |
>>     |  container   |            |     container     |
>>     +-------+------+            +--------+----------+
>>             |                            |
>>             | /dev/iommu                 | /dev/vfio/vfio
>>             | /dev/vfio/devices/vfioX    | /dev/vfio/$group_id
>>  Userspace  |                            |
>>
>> ===========+============================+==========================
>> ======
>>  Kernel     |  device fd                 |
>>             +---------------+            | group/container fd
>>             | (BIND_IOMMUFD |            |
>> (SET_CONTAINER/SET_IOMMU)
>>             |  ATTACH_IOAS) |            | device fd
>>             |               |            |
>>             |       +-------V------------V-----------------+
>>     iommufd |       |                vfio                  |
>> (map/unmap  |       +---------+--------------------+-------+
>>  ioas_copy) |                 |                    | map/unmap
>>             |                 |                    |
>>      +------V------+    +-----V------+      +------V--------+
>>      | iommfd core |    |  device    |      |  vfio iommu   |
>>      +-------------+    +------------+      +---------------+
>>
>> [Secure Context setup]
>> - iommufd BE: uses device fd and iommufd to setup secure context
>>               (bind_iommufd, attach_ioas)
>> - vfio legacy BE: uses group fd and container fd to setup secure context
>>                   (set_container, set_iommu)
>> [Device access]
>> - iommufd BE: device fd is opened through /dev/vfio/devices/vfioX
>> - vfio legacy BE: device fd is retrieved from group fd ioctl
>> [DMA Mapping flow]
>> - VFIOAddressSpace receives MemoryRegion add/del via MemoryListener
>> - VFIO populates DMA map/unmap via the container BEs
>>   *) iommufd BE: uses iommufd
>>   *) vfio legacy BE: uses container fd
>>
>> This series qomifies the VFIOContainer object which acts as a base class
>> for a container. This base class is derived into the legacy VFIO container
>> and the new iommufd based container. The base class implements generic
>> code
>> such as code related to memory_listener and address space management
>> whereas
>> the derived class implements callbacks that depend on the kernel user space
>> being used.
>>
>> The selection of the backend is made on a device basis using the new
>> iommufd option (on/off/auto). By default the iommufd backend is selected
>> if supported by the host and by QEMU (iommufd KConfig). This option is
>> currently available only for the vfio-pci device. For other types of
>> devices, it does not yet exist and the legacy BE is chosen by default.
>>
>> Test done:
>> - PCI and Platform device were tested
>> - ccw and ap were only compile-tested
>> - limited device hotplug test
>> - vIOMMU test run for both legacy and iommufd backends (limited tests)
>>
>> This series was co-developed by Eric Auger and me based on the exploration
>> iommufd kernel[2], complete code of this series is available in[3]. As
>> iommufd kernel is in the early step (only iommufd generic interface is in
>> mailing list), so this series hasn't made the iommufd backend fully on par
>> with legacy backend w.r.t. features like p2p mappings, coherency tracking,
>> live migration, etc. This series hasn't supported PCI devices without FLR
>> neither as the kernel doesn't support VFIO_DEVICE_PCI_HOT_RESET when
>> userspace
>> is using iommufd. The kernel needs to be updated to accept device fd list for
>> reset when userspace is using iommufd. Related work is in progress by
>> Jason[4].
>>
>> TODOs:
>> - Add DMA alias check for iommufd BE (group level)
>> - Make pci.c to be BE agnostic. Needs kernel change as well to fix the
>>   VFIO_DEVICE_PCI_HOT_RESET gap
>> - Cleanup the VFIODevice fields as it's used in both BEs
>> - Add locks
>> - Replace list with g_tree
>> - More tests
>>
>> Patch Overview:
>>
>> - Preparation:
>>   0001-scripts-update-linux-headers-Add-iommufd.h.patch
>>   0002-linux-headers-Import-latest-vfio.h-and-iommufd.h.patch
>>   0003-hw-vfio-pci-fix-vfio_pci_hot_reset_result-trace-poin.patch
>>   0004-vfio-pci-Use-vbasedev-local-variable-in-vfio_realize.patch
>>
>> 0005-vfio-common-Rename-VFIOGuestIOMMU-iommu-into-iommu_m.patch
>>   0006-vfio-common-Split-common.c-into-common.c-container.c.patch
>>
>> - Introduce container object and covert existing vfio to use it:
>>   0007-vfio-Add-base-object-for-VFIOContainer.patch
>>   0008-vfio-container-Introduce-vfio_attach-detach_device.patch
>>   0009-vfio-platform-Use-vfio_-attach-detach-_device.patch
>>   0010-vfio-ap-Use-vfio_-attach-detach-_device.patch
>>   0011-vfio-ccw-Use-vfio_-attach-detach-_device.patch
>>   0012-vfio-container-obj-Introduce-attach-detach-_device-c.patch
>>   0013-vfio-container-obj-Introduce-VFIOContainer-reset-cal.patch
>>
>> - Introduce iommufd based container:
>>   0014-hw-iommufd-Creation.patch
>>   0015-vfio-iommufd-Implement-iommufd-backend.patch
>>   0016-vfio-iommufd-Add-IOAS_COPY_DMA-support.patch
>>
>> - Add backend selection for vfio-pci:
>>   0017-vfio-as-Allow-the-selection-of-a-given-iommu-backend.patch
>>   0018-vfio-pci-Add-an-iommufd-option.patch
>>
>> [1]
>> https://lore.kernel.org/kvm/0-v1-e79cd8d168e8+6-iommufd_jgg@nvidia.com
>> /
>> [2] https://github.com/luxis1999/iommufd/tree/iommufd-v5.17-rc6
>> [3] https://github.com/luxis1999/qemu/tree/qemu-for-5.17-rc6-vm-rfcv1
> Hi,
>
> I had a go with the above branches on our ARM64 platform trying to pass-through
> a VF dev, but Qemu reports an error as below,
>
> [    0.444728] hisi_sec2 0000:00:01.0: enabling device (0000 -> 0002)
> qemu-system-aarch64-iommufd: IOMMU_IOAS_MAP failed: Bad address
> qemu-system-aarch64-iommufd: vfio_container_dma_map(0xaaaafeb40ce0, 0x8000000000, 0x10000, 0xffffb40ef000) = -14 (Bad address)
>
> I think this happens for the dev BAR addr range. I haven't debugged the kernel
> yet to see where it actually reports that. 
Does it prevent your assigned device from working? I have such errors
too but this is a known issue. This is due to the fact P2P DMA is not
supported yet.

Thanks

Eric

>
> Maybe I am missing something. Please let me know.
>
> Thanks,
> Shameer
>


^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC 00/18] vfio: Adopt iommufd
@ 2022-04-26 11:44     ` Eric Auger
  0 siblings, 0 replies; 125+ messages in thread
From: Eric Auger @ 2022-04-26 11:44 UTC (permalink / raw)
  To: Shameerali Kolothum Thodi, Yi Liu, alex.williamson, cohuck, qemu-devel
  Cc: akrowiak, jjherne, thuth, chao.p.peng, kvm, mjrosato, jasowang,
	farman, peterx, pasic, yi.y.sun, nicolinc, kevin.tian, jgg,
	Zhangfei Gao, eric.auger.pro, david

Hi Shameer,

On 4/26/22 11:47 AM, Shameerali Kolothum Thodi wrote:
>
>> -----Original Message-----
>> From: Yi Liu [mailto:yi.l.liu@intel.com]
>> Sent: 14 April 2022 11:47
>> To: alex.williamson@redhat.com; cohuck@redhat.com;
>> qemu-devel@nongnu.org
>> Cc: david@gibson.dropbear.id.au; thuth@redhat.com; farman@linux.ibm.com;
>> mjrosato@linux.ibm.com; akrowiak@linux.ibm.com; pasic@linux.ibm.com;
>> jjherne@linux.ibm.com; jasowang@redhat.com; kvm@vger.kernel.org;
>> jgg@nvidia.com; nicolinc@nvidia.com; eric.auger@redhat.com;
>> eric.auger.pro@gmail.com; kevin.tian@intel.com; yi.l.liu@intel.com;
>> chao.p.peng@intel.com; yi.y.sun@intel.com; peterx@redhat.com
>> Subject: [RFC 00/18] vfio: Adopt iommufd
>>
>> With the introduction of iommufd[1], the linux kernel provides a generic
>> interface for userspace drivers to propagate their DMA mappings to kernel
>> for assigned devices. This series does the porting of the VFIO devices
>> onto the /dev/iommu uapi and let it coexist with the legacy implementation.
>> Other devices like vpda, vfio mdev and etc. are not considered yet.
>>
>> For vfio devices, the new interface is tied with device fd and iommufd
>> as the iommufd solution is device-centric. This is different from legacy
>> vfio which is group-centric. To support both interfaces in QEMU, this
>> series introduces the iommu backend concept in the form of different
>> container classes. The existing vfio container is named legacy container
>> (equivalent with legacy iommu backend in this series), while the new
>> iommufd based container is named as iommufd container (may also be
>> mentioned
>> as iommufd backend in this series). The two backend types have their own
>> way to setup secure context and dma management interface. Below diagram
>> shows how it looks like with both BEs.
>>
>>                     VFIO
>> AddressSpace/Memory
>>     +-------+  +----------+  +-----+  +-----+
>>     |  pci  |  | platform |  |  ap |  | ccw |
>>     +---+---+  +----+-----+  +--+--+  +--+--+     +----------------------+
>>         |           |           |        |        |   AddressSpace
>> |
>>         |           |           |        |        +------------+---------+
>>     +---V-----------V-----------V--------V----+               /
>>     |           VFIOAddressSpace              | <------------+
>>     |                  |                      |  MemoryListener
>>     |          VFIOContainer list             |
>>     +-------+----------------------------+----+
>>             |                            |
>>             |                            |
>>     +-------V------+            +--------V----------+
>>     |   iommufd    |            |    vfio legacy    |
>>     |  container   |            |     container     |
>>     +-------+------+            +--------+----------+
>>             |                            |
>>             | /dev/iommu                 | /dev/vfio/vfio
>>             | /dev/vfio/devices/vfioX    | /dev/vfio/$group_id
>>  Userspace  |                            |
>>
>> ===========+============================+==========================
>> ======
>>  Kernel     |  device fd                 |
>>             +---------------+            | group/container fd
>>             | (BIND_IOMMUFD |            |
>> (SET_CONTAINER/SET_IOMMU)
>>             |  ATTACH_IOAS) |            | device fd
>>             |               |            |
>>             |       +-------V------------V-----------------+
>>     iommufd |       |                vfio                  |
>> (map/unmap  |       +---------+--------------------+-------+
>>  ioas_copy) |                 |                    | map/unmap
>>             |                 |                    |
>>      +------V------+    +-----V------+      +------V--------+
>>      | iommfd core |    |  device    |      |  vfio iommu   |
>>      +-------------+    +------------+      +---------------+
>>
>> [Secure Context setup]
>> - iommufd BE: uses device fd and iommufd to setup secure context
>>               (bind_iommufd, attach_ioas)
>> - vfio legacy BE: uses group fd and container fd to setup secure context
>>                   (set_container, set_iommu)
>> [Device access]
>> - iommufd BE: device fd is opened through /dev/vfio/devices/vfioX
>> - vfio legacy BE: device fd is retrieved from group fd ioctl
>> [DMA Mapping flow]
>> - VFIOAddressSpace receives MemoryRegion add/del via MemoryListener
>> - VFIO populates DMA map/unmap via the container BEs
>>   *) iommufd BE: uses iommufd
>>   *) vfio legacy BE: uses container fd
>>
>> This series qomifies the VFIOContainer object which acts as a base class
>> for a container. This base class is derived into the legacy VFIO container
>> and the new iommufd based container. The base class implements generic
>> code
>> such as code related to memory_listener and address space management
>> whereas
>> the derived class implements callbacks that depend on the kernel user space
>> being used.
>>
>> The selection of the backend is made on a device basis using the new
>> iommufd option (on/off/auto). By default the iommufd backend is selected
>> if supported by the host and by QEMU (iommufd KConfig). This option is
>> currently available only for the vfio-pci device. For other types of
>> devices, it does not yet exist and the legacy BE is chosen by default.
>>
>> Test done:
>> - PCI and Platform device were tested
>> - ccw and ap were only compile-tested
>> - limited device hotplug test
>> - vIOMMU test run for both legacy and iommufd backends (limited tests)
>>
>> This series was co-developed by Eric Auger and me based on the exploration
>> iommufd kernel[2], complete code of this series is available in[3]. As
>> iommufd kernel is in the early step (only iommufd generic interface is in
>> mailing list), so this series hasn't made the iommufd backend fully on par
>> with legacy backend w.r.t. features like p2p mappings, coherency tracking,
>> live migration, etc. This series hasn't supported PCI devices without FLR
>> neither as the kernel doesn't support VFIO_DEVICE_PCI_HOT_RESET when
>> userspace
>> is using iommufd. The kernel needs to be updated to accept device fd list for
>> reset when userspace is using iommufd. Related work is in progress by
>> Jason[4].
>>
>> TODOs:
>> - Add DMA alias check for iommufd BE (group level)
>> - Make pci.c to be BE agnostic. Needs kernel change as well to fix the
>>   VFIO_DEVICE_PCI_HOT_RESET gap
>> - Cleanup the VFIODevice fields as it's used in both BEs
>> - Add locks
>> - Replace list with g_tree
>> - More tests
>>
>> Patch Overview:
>>
>> - Preparation:
>>   0001-scripts-update-linux-headers-Add-iommufd.h.patch
>>   0002-linux-headers-Import-latest-vfio.h-and-iommufd.h.patch
>>   0003-hw-vfio-pci-fix-vfio_pci_hot_reset_result-trace-poin.patch
>>   0004-vfio-pci-Use-vbasedev-local-variable-in-vfio_realize.patch
>>
>> 0005-vfio-common-Rename-VFIOGuestIOMMU-iommu-into-iommu_m.patch
>>   0006-vfio-common-Split-common.c-into-common.c-container.c.patch
>>
>> - Introduce container object and covert existing vfio to use it:
>>   0007-vfio-Add-base-object-for-VFIOContainer.patch
>>   0008-vfio-container-Introduce-vfio_attach-detach_device.patch
>>   0009-vfio-platform-Use-vfio_-attach-detach-_device.patch
>>   0010-vfio-ap-Use-vfio_-attach-detach-_device.patch
>>   0011-vfio-ccw-Use-vfio_-attach-detach-_device.patch
>>   0012-vfio-container-obj-Introduce-attach-detach-_device-c.patch
>>   0013-vfio-container-obj-Introduce-VFIOContainer-reset-cal.patch
>>
>> - Introduce iommufd based container:
>>   0014-hw-iommufd-Creation.patch
>>   0015-vfio-iommufd-Implement-iommufd-backend.patch
>>   0016-vfio-iommufd-Add-IOAS_COPY_DMA-support.patch
>>
>> - Add backend selection for vfio-pci:
>>   0017-vfio-as-Allow-the-selection-of-a-given-iommu-backend.patch
>>   0018-vfio-pci-Add-an-iommufd-option.patch
>>
>> [1]
>> https://lore.kernel.org/kvm/0-v1-e79cd8d168e8+6-iommufd_jgg@nvidia.com
>> /
>> [2] https://github.com/luxis1999/iommufd/tree/iommufd-v5.17-rc6
>> [3] https://github.com/luxis1999/qemu/tree/qemu-for-5.17-rc6-vm-rfcv1
> Hi,
>
> I had a go with the above branches on our ARM64 platform trying to pass-through
> a VF dev, but Qemu reports an error as below,
>
> [    0.444728] hisi_sec2 0000:00:01.0: enabling device (0000 -> 0002)
> qemu-system-aarch64-iommufd: IOMMU_IOAS_MAP failed: Bad address
> qemu-system-aarch64-iommufd: vfio_container_dma_map(0xaaaafeb40ce0, 0x8000000000, 0x10000, 0xffffb40ef000) = -14 (Bad address)
>
> I think this happens for the dev BAR addr range. I haven't debugged the kernel
> yet to see where it actually reports that. 
Does it prevent your assigned device from working? I have such errors
too but this is a known issue. This is due to the fact P2P DMA is not
supported yet.

Thanks

Eric

>
> Maybe I am missing something. Please let me know.
>
> Thanks,
> Shameer
>



^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC 00/18] vfio: Adopt iommufd
  2022-04-26  8:37         ` Tian, Kevin
  (?)
@ 2022-04-26 12:33         ` Jason Gunthorpe
  -1 siblings, 0 replies; 125+ messages in thread
From: Jason Gunthorpe @ 2022-04-26 12:33 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Alex Williamson, Daniel P. Berrangé,
	Liu, Yi L, akrowiak, jjherne, Peng, Chao P, kvm, Laine Stump,
	libvir-list, jasowang, cohuck, thuth, peterx, qemu-devel, pasic,
	eric.auger, Sun, Yi Y, nicolinc, eric.auger.pro, david

On Tue, Apr 26, 2022 at 08:37:41AM +0000, Tian, Kevin wrote:

> Based on current plan there is probably a transition window between the
> point where the first vfio device type (vfio-pci) gaining iommufd support
> and the point where all vfio types supporting iommufd. 

I am still hoping to do all in one shot, lets see :) 

Jason

^ permalink raw reply	[flat|nested] 125+ messages in thread

* RE: [RFC 00/18] vfio: Adopt iommufd
  2022-04-26 11:44     ` Eric Auger
@ 2022-04-26 12:43       ` Shameerali Kolothum Thodi via
  -1 siblings, 0 replies; 125+ messages in thread
From: Shameerali Kolothum Thodi @ 2022-04-26 12:43 UTC (permalink / raw)
  To: eric.auger, Yi Liu, alex.williamson, cohuck, qemu-devel
  Cc: david, thuth, farman, mjrosato, akrowiak, pasic, jjherne,
	jasowang, kvm, jgg, nicolinc, eric.auger.pro, kevin.tian,
	chao.p.peng, yi.y.sun, peterx, Zhangfei Gao



> -----Original Message-----
> From: Eric Auger [mailto:eric.auger@redhat.com]
> Sent: 26 April 2022 12:45
> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>; Yi
> Liu <yi.l.liu@intel.com>; alex.williamson@redhat.com; cohuck@redhat.com;
> qemu-devel@nongnu.org
> Cc: david@gibson.dropbear.id.au; thuth@redhat.com; farman@linux.ibm.com;
> mjrosato@linux.ibm.com; akrowiak@linux.ibm.com; pasic@linux.ibm.com;
> jjherne@linux.ibm.com; jasowang@redhat.com; kvm@vger.kernel.org;
> jgg@nvidia.com; nicolinc@nvidia.com; eric.auger.pro@gmail.com;
> kevin.tian@intel.com; chao.p.peng@intel.com; yi.y.sun@intel.com;
> peterx@redhat.com; Zhangfei Gao <zhangfei.gao@linaro.org>
> Subject: Re: [RFC 00/18] vfio: Adopt iommufd

[...]
 
> >>
> https://lore.kernel.org/kvm/0-v1-e79cd8d168e8+6-iommufd_jgg@nvidia.com
> >> /
> >> [2] https://github.com/luxis1999/iommufd/tree/iommufd-v5.17-rc6
> >> [3] https://github.com/luxis1999/qemu/tree/qemu-for-5.17-rc6-vm-rfcv1
> > Hi,
> >
> > I had a go with the above branches on our ARM64 platform trying to
> pass-through
> > a VF dev, but Qemu reports an error as below,
> >
> > [    0.444728] hisi_sec2 0000:00:01.0: enabling device (0000 -> 0002)
> > qemu-system-aarch64-iommufd: IOMMU_IOAS_MAP failed: Bad address
> > qemu-system-aarch64-iommufd: vfio_container_dma_map(0xaaaafeb40ce0,
> 0x8000000000, 0x10000, 0xffffb40ef000) = -14 (Bad address)
> >
> > I think this happens for the dev BAR addr range. I haven't debugged the
> kernel
> > yet to see where it actually reports that.
> Does it prevent your assigned device from working? I have such errors
> too but this is a known issue. This is due to the fact P2P DMA is not
> supported yet.
> 

Yes, the basic tests all good so far. I am still not very clear how it works if
the map() fails though. It looks like it fails in,

iommufd_ioas_map()
  iopt_map_user_pages()
   iopt_map_pages()
   ..
     pfn_reader_pin_pages()

So does it mean it just works because the page is resident()?

Thanks,
Shameer




^ permalink raw reply	[flat|nested] 125+ messages in thread

* RE: [RFC 00/18] vfio: Adopt iommufd
@ 2022-04-26 12:43       ` Shameerali Kolothum Thodi via
  0 siblings, 0 replies; 125+ messages in thread
From: Shameerali Kolothum Thodi via @ 2022-04-26 12:43 UTC (permalink / raw)
  To: eric.auger, Yi Liu, alex.williamson, cohuck, qemu-devel
  Cc: david, thuth, farman, mjrosato, akrowiak, pasic, jjherne,
	jasowang, kvm, jgg, nicolinc, eric.auger.pro, kevin.tian,
	chao.p.peng, yi.y.sun, peterx, Zhangfei Gao



> -----Original Message-----
> From: Eric Auger [mailto:eric.auger@redhat.com]
> Sent: 26 April 2022 12:45
> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>; Yi
> Liu <yi.l.liu@intel.com>; alex.williamson@redhat.com; cohuck@redhat.com;
> qemu-devel@nongnu.org
> Cc: david@gibson.dropbear.id.au; thuth@redhat.com; farman@linux.ibm.com;
> mjrosato@linux.ibm.com; akrowiak@linux.ibm.com; pasic@linux.ibm.com;
> jjherne@linux.ibm.com; jasowang@redhat.com; kvm@vger.kernel.org;
> jgg@nvidia.com; nicolinc@nvidia.com; eric.auger.pro@gmail.com;
> kevin.tian@intel.com; chao.p.peng@intel.com; yi.y.sun@intel.com;
> peterx@redhat.com; Zhangfei Gao <zhangfei.gao@linaro.org>
> Subject: Re: [RFC 00/18] vfio: Adopt iommufd

[...]
 
> >>
> https://lore.kernel.org/kvm/0-v1-e79cd8d168e8+6-iommufd_jgg@nvidia.com
> >> /
> >> [2] https://github.com/luxis1999/iommufd/tree/iommufd-v5.17-rc6
> >> [3] https://github.com/luxis1999/qemu/tree/qemu-for-5.17-rc6-vm-rfcv1
> > Hi,
> >
> > I had a go with the above branches on our ARM64 platform trying to
> pass-through
> > a VF dev, but Qemu reports an error as below,
> >
> > [    0.444728] hisi_sec2 0000:00:01.0: enabling device (0000 -> 0002)
> > qemu-system-aarch64-iommufd: IOMMU_IOAS_MAP failed: Bad address
> > qemu-system-aarch64-iommufd: vfio_container_dma_map(0xaaaafeb40ce0,
> 0x8000000000, 0x10000, 0xffffb40ef000) = -14 (Bad address)
> >
> > I think this happens for the dev BAR addr range. I haven't debugged the
> kernel
> > yet to see where it actually reports that.
> Does it prevent your assigned device from working? I have such errors
> too but this is a known issue. This is due to the fact P2P DMA is not
> supported yet.
> 

Yes, the basic tests all good so far. I am still not very clear how it works if
the map() fails though. It looks like it fails in,

iommufd_ioas_map()
  iopt_map_user_pages()
   iopt_map_pages()
   ..
     pfn_reader_pin_pages()

So does it mean it just works because the page is resident()?

Thanks,
Shameer




^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC 15/18] vfio/iommufd: Implement iommufd backend
  2022-04-26 10:41         ` Tian, Kevin
  (?)
@ 2022-04-26 13:41         ` Jason Gunthorpe
  2022-04-26 14:08             ` Yi Liu
  -1 siblings, 1 reply; 125+ messages in thread
From: Jason Gunthorpe @ 2022-04-26 13:41 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Liu, Yi L, Peng, Chao P, Sun, Yi Y, qemu-devel, david, thuth,
	farman, mjrosato, akrowiak, pasic, jjherne, jasowang, kvm,
	nicolinc, eric.auger, eric.auger.pro, peterx,
	Alex Williamson (alex.williamson@redhat.com)

On Tue, Apr 26, 2022 at 10:41:01AM +0000, Tian, Kevin wrote:

> That's one case of incompatibility, but the IOMMU attach group callback
> can fail in a variety of ways.  One that we've seen that is not
> uncommon is that we might have an mdev container with various  mappings  
> to other devices.  None of those mappings are validated until the mdev
> driver tries to pin something, where it's generally unlikely that
> they'd pin those particular mappings.  Then QEMU hot-adds a regular
> IOMMU backed device, we allocate a domain for the device and replay the
> mappings from the container, but now they get validated and potentially
> fail.  The kernel returns a failure for the SET_IOMMU ioctl, QEMU
> creates a new container and fills it from the same AddressSpace, where
> now QEMU can determine which mappings can be safely skipped.  

I think it is strange that the allowed DMA a guest can do depends on
the order how devices are plugged into the guest, and varys from
device to device?

IMHO it would be nicer if qemu would be able to read the new reserved
regions and unmap the conflicts before hot plugging the new device. We
don't have a kernel API to do this, maybe we should have one?

> A: 
> QEMU sets up a MemoryListener for the device AddressSpace and attempts
> to map anything that triggers that listener, which includes not only VM
> RAM which is our primary mapping goal, but also miscellaneous devices,
> unaligned regions, and other device regions, ex. BARs.  Some of these
> we filter out in QEMU with broad generalizations that unaligned ranges
> aren't anything we can deal with, but other device regions covers
> anything that's mmap'd in QEMU, ie. it has an associated KVM memory
> slot.  IIRC, in the case I'm thinking of, the mapping that triggered
> the replay failure was the BAR for an mdev device.  No attempt was made
> to use gup or PFNMAP to resolve the mapping when only the mdev device
> was present and the mdev host driver didn't attempt to pin pages within
> its own BAR, but neither of these methods worked for the replay (I
> don't recall further specifics). 

This feels sort of like a bug in iommufd, or perhaps qemu..

With iommufd only normal GUP'able memory should be passed to
map. Special memory will have to go through some other API. This is
different from vfio containers.

We could possibly check the VMAs in iommufd during map to enforce
normal memory.. However I'm also a bit surprised that qemu can't ID
the underlying memory source and avoid this?

eg currently I see the log messages that it is passing P2P BAR memory
into iommufd map, this should be prevented inside qemu because it is
not reliable right now if iommufd will correctly reject it.

IMHO multi-container should be avoided because it does force creating
multiple iommu_domains which does have a memory/performance cost.

Though, it is not so important that it is urgent (and copy makes it
work better anyhow), qemu can stay as it is.

Jason

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC 15/18] vfio/iommufd: Implement iommufd backend
  2022-04-26  9:55       ` Yi Liu
  (?)
  (?)
@ 2022-04-26 13:53       ` Jason Gunthorpe
  -1 siblings, 0 replies; 125+ messages in thread
From: Jason Gunthorpe @ 2022-04-26 13:53 UTC (permalink / raw)
  To: Yi Liu
  Cc: Tian, Kevin, Peng, Chao P, Sun, Yi Y, qemu-devel, david, thuth,
	farman, mjrosato, akrowiak, pasic, jjherne, jasowang, kvm,
	nicolinc, eric.auger, eric.auger.pro, peterx

On Tue, Apr 26, 2022 at 05:55:29PM +0800, Yi Liu wrote:
> > I also suggest falling back to using "/dev/char/%u:%u" if the above
> > does not exist which prevents "vfio/devices/vfio" from turning into
> > ABI.
> 
> do you mean there is no matched file under /dev/vfio/devices/? Is this
> possible?

The layout of /dev/ depens on udev rules, so it is possible. I only
suggested it to avoid creating ABI here.

> > It would be a good idea to make a general open_cdev function that does
> > all this work once the sysfs is found and cdev read out of it, all the
> > other vfio places can use it too.
> 
> hmmm, it's good to have a general open_cdev() function. But I guess this
> is the only place in VFIO to open the device cdev. Do you mean the vdpa
> stuffes?

Any place that starts from a sysfs name would be interested - I don't
know what vdpa does

Jason

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC 15/18] vfio/iommufd: Implement iommufd backend
  2022-04-26 13:41         ` Jason Gunthorpe
@ 2022-04-26 14:08             ` Yi Liu
  0 siblings, 0 replies; 125+ messages in thread
From: Yi Liu @ 2022-04-26 14:08 UTC (permalink / raw)
  To: Jason Gunthorpe, Tian, Kevin
  Cc: Peng, Chao P, Sun, Yi Y, qemu-devel, david, thuth, farman,
	mjrosato, akrowiak, pasic, jjherne, jasowang, kvm, nicolinc,
	eric.auger, eric.auger.pro, peterx,
	Alex Williamson (alex.williamson@redhat.com)


On 2022/4/26 21:41, Jason Gunthorpe wrote:
> On Tue, Apr 26, 2022 at 10:41:01AM +0000, Tian, Kevin wrote:
> 
>> That's one case of incompatibility, but the IOMMU attach group callback
>> can fail in a variety of ways.  One that we've seen that is not
>> uncommon is that we might have an mdev container with various  mappings
>> to other devices.  None of those mappings are validated until the mdev
>> driver tries to pin something, where it's generally unlikely that
>> they'd pin those particular mappings.  Then QEMU hot-adds a regular
>> IOMMU backed device, we allocate a domain for the device and replay the
>> mappings from the container, but now they get validated and potentially
>> fail.  The kernel returns a failure for the SET_IOMMU ioctl, QEMU
>> creates a new container and fills it from the same AddressSpace, where
>> now QEMU can determine which mappings can be safely skipped.
> 
> I think it is strange that the allowed DMA a guest can do depends on
> the order how devices are plugged into the guest, and varys from
> device to device?
> 
> IMHO it would be nicer if qemu would be able to read the new reserved
> regions and unmap the conflicts before hot plugging the new device. We
> don't have a kernel API to do this, maybe we should have one?

For userspace drivers, it is fine to do it. For QEMU, it's not quite easy 
since the IOVA is GPA which is determined per the e820 table.

>> A:
>> QEMU sets up a MemoryListener for the device AddressSpace and attempts
>> to map anything that triggers that listener, which includes not only VM
>> RAM which is our primary mapping goal, but also miscellaneous devices,
>> unaligned regions, and other device regions, ex. BARs.  Some of these
>> we filter out in QEMU with broad generalizations that unaligned ranges
>> aren't anything we can deal with, but other device regions covers
>> anything that's mmap'd in QEMU, ie. it has an associated KVM memory
>> slot.  IIRC, in the case I'm thinking of, the mapping that triggered
>> the replay failure was the BAR for an mdev device.  No attempt was made
>> to use gup or PFNMAP to resolve the mapping when only the mdev device
>> was present and the mdev host driver didn't attempt to pin pages within
>> its own BAR, but neither of these methods worked for the replay (I
>> don't recall further specifics).
> 
> This feels sort of like a bug in iommufd, or perhaps qemu..
> 
> With iommufd only normal GUP'able memory should be passed to
> map. Special memory will have to go through some other API. This is
> different from vfio containers.
> 
> We could possibly check the VMAs in iommufd during map to enforce
> normal memory.. However I'm also a bit surprised that qemu can't ID
> the underlying memory source and avoid this?
> 
> eg currently I see the log messages that it is passing P2P BAR memory
> into iommufd map, this should be prevented inside qemu because it is
> not reliable right now if iommufd will correctly reject it.

yeah. qemu can filter the P2P BAR mapping and just stop it in qemu. We
haven't added it as it is something you will add in future. so didn't
add it in this RFC. :-) Please let me know if it feels better to filter
it from today.

> IMHO multi-container should be avoided because it does force creating
> multiple iommu_domains which does have a memory/performance cost.

yes. for multi-hw_pgtable, there is no choice as it is mostly due to
compatibility. But for multi-container, seems to be solvable if kernel
and qemu has some extra support like you mentioned. But I'd like to
echo below. It seems there may be other possible reasons to fail in
the attach.

 >> That's one case of incompatibility, but the IOMMU attach group callback
 >> can fail in a variety of ways."

> Though, it is not so important that it is urgent (and copy makes it
> work better anyhow), qemu can stay as it is.

yes. as a start, keep it would be simpler.

> Jason

-- 
Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC 15/18] vfio/iommufd: Implement iommufd backend
@ 2022-04-26 14:08             ` Yi Liu
  0 siblings, 0 replies; 125+ messages in thread
From: Yi Liu @ 2022-04-26 14:08 UTC (permalink / raw)
  To: Jason Gunthorpe, Tian, Kevin
  Cc: akrowiak, jjherne, thuth, Peng, Chao P,
	Alex Williamson (alex.williamson@redhat.com),
	kvm, mjrosato, jasowang, farman, peterx, qemu-devel, pasic,
	eric.auger, Sun, Yi Y, nicolinc, eric.auger.pro, david


On 2022/4/26 21:41, Jason Gunthorpe wrote:
> On Tue, Apr 26, 2022 at 10:41:01AM +0000, Tian, Kevin wrote:
> 
>> That's one case of incompatibility, but the IOMMU attach group callback
>> can fail in a variety of ways.  One that we've seen that is not
>> uncommon is that we might have an mdev container with various  mappings
>> to other devices.  None of those mappings are validated until the mdev
>> driver tries to pin something, where it's generally unlikely that
>> they'd pin those particular mappings.  Then QEMU hot-adds a regular
>> IOMMU backed device, we allocate a domain for the device and replay the
>> mappings from the container, but now they get validated and potentially
>> fail.  The kernel returns a failure for the SET_IOMMU ioctl, QEMU
>> creates a new container and fills it from the same AddressSpace, where
>> now QEMU can determine which mappings can be safely skipped.
> 
> I think it is strange that the allowed DMA a guest can do depends on
> the order how devices are plugged into the guest, and varys from
> device to device?
> 
> IMHO it would be nicer if qemu would be able to read the new reserved
> regions and unmap the conflicts before hot plugging the new device. We
> don't have a kernel API to do this, maybe we should have one?

For userspace drivers, it is fine to do it. For QEMU, it's not quite easy 
since the IOVA is GPA which is determined per the e820 table.

>> A:
>> QEMU sets up a MemoryListener for the device AddressSpace and attempts
>> to map anything that triggers that listener, which includes not only VM
>> RAM which is our primary mapping goal, but also miscellaneous devices,
>> unaligned regions, and other device regions, ex. BARs.  Some of these
>> we filter out in QEMU with broad generalizations that unaligned ranges
>> aren't anything we can deal with, but other device regions covers
>> anything that's mmap'd in QEMU, ie. it has an associated KVM memory
>> slot.  IIRC, in the case I'm thinking of, the mapping that triggered
>> the replay failure was the BAR for an mdev device.  No attempt was made
>> to use gup or PFNMAP to resolve the mapping when only the mdev device
>> was present and the mdev host driver didn't attempt to pin pages within
>> its own BAR, but neither of these methods worked for the replay (I
>> don't recall further specifics).
> 
> This feels sort of like a bug in iommufd, or perhaps qemu..
> 
> With iommufd only normal GUP'able memory should be passed to
> map. Special memory will have to go through some other API. This is
> different from vfio containers.
> 
> We could possibly check the VMAs in iommufd during map to enforce
> normal memory.. However I'm also a bit surprised that qemu can't ID
> the underlying memory source and avoid this?
> 
> eg currently I see the log messages that it is passing P2P BAR memory
> into iommufd map, this should be prevented inside qemu because it is
> not reliable right now if iommufd will correctly reject it.

yeah. qemu can filter the P2P BAR mapping and just stop it in qemu. We
haven't added it as it is something you will add in future. so didn't
add it in this RFC. :-) Please let me know if it feels better to filter
it from today.

> IMHO multi-container should be avoided because it does force creating
> multiple iommu_domains which does have a memory/performance cost.

yes. for multi-hw_pgtable, there is no choice as it is mostly due to
compatibility. But for multi-container, seems to be solvable if kernel
and qemu has some extra support like you mentioned. But I'd like to
echo below. It seems there may be other possible reasons to fail in
the attach.

 >> That's one case of incompatibility, but the IOMMU attach group callback
 >> can fail in a variety of ways."

> Though, it is not so important that it is urgent (and copy makes it
> work better anyhow), qemu can stay as it is.

yes. as a start, keep it would be simpler.

> Jason

-- 
Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC 15/18] vfio/iommufd: Implement iommufd backend
  2022-04-26 14:08             ` Yi Liu
  (?)
@ 2022-04-26 14:11             ` Jason Gunthorpe
  2022-04-26 18:45                 ` Alex Williamson
  -1 siblings, 1 reply; 125+ messages in thread
From: Jason Gunthorpe @ 2022-04-26 14:11 UTC (permalink / raw)
  To: Yi Liu
  Cc: Tian, Kevin, Peng, Chao P, Sun, Yi Y, qemu-devel, david, thuth,
	farman, mjrosato, akrowiak, pasic, jjherne, jasowang, kvm,
	nicolinc, eric.auger, eric.auger.pro, peterx,
	Alex Williamson (alex.williamson@redhat.com)

On Tue, Apr 26, 2022 at 10:08:30PM +0800, Yi Liu wrote:

> > I think it is strange that the allowed DMA a guest can do depends on
> > the order how devices are plugged into the guest, and varys from
> > device to device?
> > 
> > IMHO it would be nicer if qemu would be able to read the new reserved
> > regions and unmap the conflicts before hot plugging the new device. We
> > don't have a kernel API to do this, maybe we should have one?
> 
> For userspace drivers, it is fine to do it. For QEMU, it's not quite easy
> since the IOVA is GPA which is determined per the e820 table.

Sure, that is why I said we may need a new API to get this data back
so userspace can fix the address map before attempting to attach the
new device. Currently that is not possible at all, the device attach
fails and userspace has no way to learn what addresses are causing
problems.

> > eg currently I see the log messages that it is passing P2P BAR memory
> > into iommufd map, this should be prevented inside qemu because it is
> > not reliable right now if iommufd will correctly reject it.
> 
> yeah. qemu can filter the P2P BAR mapping and just stop it in qemu. We
> haven't added it as it is something you will add in future. so didn't
> add it in this RFC. :-) Please let me know if it feels better to filter
> it from today.

I currently hope it will use a different map API entirely and not rely
on discovering the P2P via the VMA. eg using a DMABUF FD or something.

So blocking it in qemu feels like the right thing to do.

Jason

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC 00/18] vfio: Adopt iommufd
  2022-04-26  8:37         ` Tian, Kevin
@ 2022-04-26 16:21           ` Alex Williamson
  -1 siblings, 0 replies; 125+ messages in thread
From: Alex Williamson @ 2022-04-26 16:21 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Daniel P. Berrangé,
	Liu, Yi L, akrowiak, jjherne, Peng, Chao P, kvm, Laine Stump,
	libvir-list, jasowang, cohuck, thuth, peterx, qemu-devel, pasic,
	eric.auger, Sun, Yi Y, nicolinc, jgg, eric.auger.pro, david

On Tue, 26 Apr 2022 08:37:41 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Monday, April 25, 2022 10:38 PM
> > 
> > On Mon, 25 Apr 2022 11:10:14 +0100
> > Daniel P. Berrangé <berrange@redhat.com> wrote:
> >   
> > > On Fri, Apr 22, 2022 at 04:09:43PM -0600, Alex Williamson wrote:  
> > > > [Cc +libvirt folks]
> > > >
> > > > On Thu, 14 Apr 2022 03:46:52 -0700
> > > > Yi Liu <yi.l.liu@intel.com> wrote:
> > > >  
> > > > > With the introduction of iommufd[1], the linux kernel provides a  
> > generic  
> > > > > interface for userspace drivers to propagate their DMA mappings to  
> > kernel  
> > > > > for assigned devices. This series does the porting of the VFIO devices
> > > > > onto the /dev/iommu uapi and let it coexist with the legacy  
> > implementation.  
> > > > > Other devices like vpda, vfio mdev and etc. are not considered yet.  
> > >
> > > snip
> > >  
> > > > > The selection of the backend is made on a device basis using the new
> > > > > iommufd option (on/off/auto). By default the iommufd backend is  
> > selected  
> > > > > if supported by the host and by QEMU (iommufd KConfig). This option  
> > is  
> > > > > currently available only for the vfio-pci device. For other types of
> > > > > devices, it does not yet exist and the legacy BE is chosen by default.  
> > > >
> > > > I've discussed this a bit with Eric, but let me propose a different
> > > > command line interface.  Libvirt generally likes to pass file
> > > > descriptors to QEMU rather than grant it access to those files
> > > > directly.  This was problematic with vfio-pci because libvirt can't
> > > > easily know when QEMU will want to grab another /dev/vfio/vfio
> > > > container.  Therefore we abandoned this approach and instead libvirt
> > > > grants file permissions.
> > > >
> > > > However, with iommufd there's no reason that QEMU ever needs more  
> > than  
> > > > a single instance of /dev/iommufd and we're using per device vfio file
> > > > descriptors, so it seems like a good time to revisit this.  
> > >
> > > I assume access to '/dev/iommufd' gives the process somewhat elevated
> > > privileges, such that you don't want to unconditionally give QEMU
> > > access to this device ?  
> > 
> > It's not that much dissimilar to /dev/vfio/vfio, it's an unprivileged
> > interface which should have limited scope for abuse, but more so here
> > the goal would be to de-privilege QEMU that one step further that it
> > cannot open the device file itself.
> >   
> > > > The interface I was considering would be to add an iommufd object to
> > > > QEMU, so we might have a:
> > > >
> > > > -device iommufd[,fd=#][,id=foo]
> > > >
> > > > For non-libivrt usage this would have the ability to open /dev/iommufd
> > > > itself if an fd is not provided.  This object could be shared with
> > > > other iommufd users in the VM and maybe we'd allow multiple instances
> > > > for more esoteric use cases.  [NB, maybe this should be a -object rather  
> > than  
> > > > -device since the iommufd is not a guest visible device?]  
> > >
> > > Yes,  -object would be the right answer for something that's purely
> > > a host side backend impl selector.
> > >  
> > > > The vfio-pci device might then become:
> > > >
> > > > -device vfio-  
> > pci[,host=DDDD:BB:DD.f][,sysfsdev=/sys/path/to/device][,fd=#][,iommufd=f
> > oo]  
> > > >
> > > > So essentially we can specify the device via host, sysfsdev, or passing
> > > > an fd to the vfio device file.  When an iommufd object is specified,
> > > > "foo" in the example above, each of those options would use the
> > > > vfio-device access mechanism, essentially the same as iommufd=on in
> > > > your example.  With the fd passing option, an iommufd object would be
> > > > required and necessarily use device level access.
> > > >
> > > > In your example, the iommufd=auto seems especially troublesome for
> > > > libvirt because QEMU is going to have different locked memory
> > > > requirements based on whether we're using type1 or iommufd, where  
> > the  
> > > > latter resolves the duplicate accounting issues.  libvirt needs to know  
> 
> Based on current plan there is probably a transition window between the
> point where the first vfio device type (vfio-pci) gaining iommufd support
> and the point where all vfio types supporting iommufd. Libvirt can figure
> out which one to use iommufd by checking the presence of
> /dev/vfio/devices/vfioX. But what would be the resource limit policy
> in Libvirt in such transition window when both type1 and iommufd might
> be used? Or do we just expect Libvirt to support iommufd only after the
> transition window ends to avoid handling such mess?

Good point regarding libvirt testing for the vfio device files for use
with iommufd, so libvirt would test if /dev/iommufd exists and if the
device they want to assign maps to a /dev/vfio/devices/vfioX file.  This
was essentially implicit in the fd=# option to the vfio-pci device.

In mixed combinations, I'd expect libvirt to continue to add the full
VM memory to the locked memory limit for each non-iommufd device added.

> > > > deterministically which backed is being used, which this proposal seems
> > > > to provide, while at the same time bringing us more in line with fd
> > > > passing.  Thoughts?  Thanks,  
> > >
> > > Yep, I agree that libvirt needs to have more direct control over this.
> > > This is also even more important if there are notable feature differences
> > > in the 2 backends.
> > >
> > > I wonder if anyone has considered an even more distinct impl, whereby
> > > we have a completely different device type on the backend, eg
> > >
> > >   -device vfio-iommu-  
> > pci[,host=DDDD:BB:DD.f][,sysfsdev=/sys/path/to/device][,fd=#][,iommufd=f
> > oo]  
> > >
> > > If a vendor wants to fully remove the legacy impl, they can then use the
> > > Kconfig mechanism to disable the build of the legacy impl device, while
> > > keeping the iommu impl (or vica-verca if the new iommu impl isn't  
> > considered  
> > > reliable enough for them to support yet).
> > >
> > > Libvirt would use
> > >
> > >    -object iommu,id=iommu0,fd=NNN
> > >    -device vfio-iommu-pci,fd=MMM,iommu=iommu0
> > >
> > > Non-libvirt would use a simpler
> > >
> > >    -device vfio-iommu-pci,host=0000:03:22.1
> > >
> > > with QEMU auto-creating a 'iommu' object in the background.
> > >
> > > This would fit into libvirt's existing modelling better. We currently have
> > > a concept of a PCI assignment backend, which previously supported the
> > > legacy PCI assignment, vs the VFIO PCI assignment. This new iommu impl
> > > feels like a 3rd PCI assignment approach, and so fits with how we modelled
> > > it as a different device type in the past.  
> > 
> > I don't think we want to conflate "iommu" and "iommufd", we're creating
> > an object that interfaces into the iommufd uAPI, not an iommu itself.
> > Likewise "vfio-iommu-pci" is just confusing, there was an iommu
> > interface previously, it's just a different implementation now and as
> > far as the VM interface to the device, it's identical.  Note that a
> > "vfio-iommufd-pci" device multiplies the matrix of every vfio device
> > for a rather subtle implementation detail.
> > 
> > My expectation would be that libvirt uses:
> > 
> >  -object iommufd,id=iommufd0,fd=NNN
> >  -device vfio-pci,fd=MMM,iommufd=iommufd0
> > 
> > Whereas simple QEMU command line would be:
> > 
> >  -object iommufd,id=iommufd0
> >  -device vfio-pci,iommufd=iommufd0,host=0000:02:00.0
> > 
> > The iommufd object would open /dev/iommufd itself.  Creating an
> > implicit iommufd object is someone problematic because one of the
> > things I forgot to highlight in my previous description is that the
> > iommufd object is meant to be shared across not only various vfio
> > devices (platform, ccw, ap, nvme, etc), but also across subsystems, ex.
> > vdpa.  
> 
> Out of curiosity - in concept one iommufd is sufficient to support all
> ioas requirements across subsystems while having multiple iommufd's
> instead lose the benefit of centralized accounting. The latter will also
> cause some trouble when we start virtualizing ENQCMD which requires
> VM-wide PASID virtualization thus further needs to share that 
> information across iommufd's. Not unsolvable but really no gain by
> adding such complexity. So I'm curious whether Qemu provide
> a way to restrict that certain object type can only have one instance
> to discourage such multi-iommufd attempt?

I don't see any reason for QEMU to restrict iommufd objects.  The QEMU
philosophy seems to be to let users create whatever configuration they
want.  For libvirt though, the assumption would be that a single
iommufd object can be used across subsystems, so libvirt would never
automatically create multiple objects.

We also need to be able to advise libvirt as to how each iommufd object
or user of that object factors into the VM locked memory requirement.
When used by vfio-pci, we're only mapping VM RAM, so we'd ask libvirt
to set the locked memory limit to the size of VM RAM per iommufd,
regardless of the number of devices using a given iommufd.  However, I
don't know if all users of iommufd will be exclusively mapping VM RAM.
Combinations of devices where some map VM RAM and others map QEMU
buffer space could still require some incremental increase per device
(I'm not sure if vfio-nvme is such a device).  It seems like heuristics
will still be involved even after iommufd solves the per-device
vfio-pci locked memory limit issue.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC 00/18] vfio: Adopt iommufd
@ 2022-04-26 16:21           ` Alex Williamson
  0 siblings, 0 replies; 125+ messages in thread
From: Alex Williamson @ 2022-04-26 16:21 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: akrowiak, jjherne, thuth, Peng, Chao P, Daniel P. Berrangé,
	jgg, kvm, libvir-list, jasowang, cohuck, qemu-devel, peterx,
	pasic, eric.auger, Sun, Yi Y, Liu, Yi L, nicolinc, Laine Stump,
	david, eric.auger.pro

On Tue, 26 Apr 2022 08:37:41 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Monday, April 25, 2022 10:38 PM
> > 
> > On Mon, 25 Apr 2022 11:10:14 +0100
> > Daniel P. Berrangé <berrange@redhat.com> wrote:
> >   
> > > On Fri, Apr 22, 2022 at 04:09:43PM -0600, Alex Williamson wrote:  
> > > > [Cc +libvirt folks]
> > > >
> > > > On Thu, 14 Apr 2022 03:46:52 -0700
> > > > Yi Liu <yi.l.liu@intel.com> wrote:
> > > >  
> > > > > With the introduction of iommufd[1], the linux kernel provides a  
> > generic  
> > > > > interface for userspace drivers to propagate their DMA mappings to  
> > kernel  
> > > > > for assigned devices. This series does the porting of the VFIO devices
> > > > > onto the /dev/iommu uapi and let it coexist with the legacy  
> > implementation.  
> > > > > Other devices like vpda, vfio mdev and etc. are not considered yet.  
> > >
> > > snip
> > >  
> > > > > The selection of the backend is made on a device basis using the new
> > > > > iommufd option (on/off/auto). By default the iommufd backend is  
> > selected  
> > > > > if supported by the host and by QEMU (iommufd KConfig). This option  
> > is  
> > > > > currently available only for the vfio-pci device. For other types of
> > > > > devices, it does not yet exist and the legacy BE is chosen by default.  
> > > >
> > > > I've discussed this a bit with Eric, but let me propose a different
> > > > command line interface.  Libvirt generally likes to pass file
> > > > descriptors to QEMU rather than grant it access to those files
> > > > directly.  This was problematic with vfio-pci because libvirt can't
> > > > easily know when QEMU will want to grab another /dev/vfio/vfio
> > > > container.  Therefore we abandoned this approach and instead libvirt
> > > > grants file permissions.
> > > >
> > > > However, with iommufd there's no reason that QEMU ever needs more  
> > than  
> > > > a single instance of /dev/iommufd and we're using per device vfio file
> > > > descriptors, so it seems like a good time to revisit this.  
> > >
> > > I assume access to '/dev/iommufd' gives the process somewhat elevated
> > > privileges, such that you don't want to unconditionally give QEMU
> > > access to this device ?  
> > 
> > It's not that much dissimilar to /dev/vfio/vfio, it's an unprivileged
> > interface which should have limited scope for abuse, but more so here
> > the goal would be to de-privilege QEMU that one step further that it
> > cannot open the device file itself.
> >   
> > > > The interface I was considering would be to add an iommufd object to
> > > > QEMU, so we might have a:
> > > >
> > > > -device iommufd[,fd=#][,id=foo]
> > > >
> > > > For non-libivrt usage this would have the ability to open /dev/iommufd
> > > > itself if an fd is not provided.  This object could be shared with
> > > > other iommufd users in the VM and maybe we'd allow multiple instances
> > > > for more esoteric use cases.  [NB, maybe this should be a -object rather  
> > than  
> > > > -device since the iommufd is not a guest visible device?]  
> > >
> > > Yes,  -object would be the right answer for something that's purely
> > > a host side backend impl selector.
> > >  
> > > > The vfio-pci device might then become:
> > > >
> > > > -device vfio-  
> > pci[,host=DDDD:BB:DD.f][,sysfsdev=/sys/path/to/device][,fd=#][,iommufd=f
> > oo]  
> > > >
> > > > So essentially we can specify the device via host, sysfsdev, or passing
> > > > an fd to the vfio device file.  When an iommufd object is specified,
> > > > "foo" in the example above, each of those options would use the
> > > > vfio-device access mechanism, essentially the same as iommufd=on in
> > > > your example.  With the fd passing option, an iommufd object would be
> > > > required and necessarily use device level access.
> > > >
> > > > In your example, the iommufd=auto seems especially troublesome for
> > > > libvirt because QEMU is going to have different locked memory
> > > > requirements based on whether we're using type1 or iommufd, where  
> > the  
> > > > latter resolves the duplicate accounting issues.  libvirt needs to know  
> 
> Based on current plan there is probably a transition window between the
> point where the first vfio device type (vfio-pci) gaining iommufd support
> and the point where all vfio types supporting iommufd. Libvirt can figure
> out which one to use iommufd by checking the presence of
> /dev/vfio/devices/vfioX. But what would be the resource limit policy
> in Libvirt in such transition window when both type1 and iommufd might
> be used? Or do we just expect Libvirt to support iommufd only after the
> transition window ends to avoid handling such mess?

Good point regarding libvirt testing for the vfio device files for use
with iommufd, so libvirt would test if /dev/iommufd exists and if the
device they want to assign maps to a /dev/vfio/devices/vfioX file.  This
was essentially implicit in the fd=# option to the vfio-pci device.

In mixed combinations, I'd expect libvirt to continue to add the full
VM memory to the locked memory limit for each non-iommufd device added.

> > > > deterministically which backed is being used, which this proposal seems
> > > > to provide, while at the same time bringing us more in line with fd
> > > > passing.  Thoughts?  Thanks,  
> > >
> > > Yep, I agree that libvirt needs to have more direct control over this.
> > > This is also even more important if there are notable feature differences
> > > in the 2 backends.
> > >
> > > I wonder if anyone has considered an even more distinct impl, whereby
> > > we have a completely different device type on the backend, eg
> > >
> > >   -device vfio-iommu-  
> > pci[,host=DDDD:BB:DD.f][,sysfsdev=/sys/path/to/device][,fd=#][,iommufd=f
> > oo]  
> > >
> > > If a vendor wants to fully remove the legacy impl, they can then use the
> > > Kconfig mechanism to disable the build of the legacy impl device, while
> > > keeping the iommu impl (or vica-verca if the new iommu impl isn't  
> > considered  
> > > reliable enough for them to support yet).
> > >
> > > Libvirt would use
> > >
> > >    -object iommu,id=iommu0,fd=NNN
> > >    -device vfio-iommu-pci,fd=MMM,iommu=iommu0
> > >
> > > Non-libvirt would use a simpler
> > >
> > >    -device vfio-iommu-pci,host=0000:03:22.1
> > >
> > > with QEMU auto-creating a 'iommu' object in the background.
> > >
> > > This would fit into libvirt's existing modelling better. We currently have
> > > a concept of a PCI assignment backend, which previously supported the
> > > legacy PCI assignment, vs the VFIO PCI assignment. This new iommu impl
> > > feels like a 3rd PCI assignment approach, and so fits with how we modelled
> > > it as a different device type in the past.  
> > 
> > I don't think we want to conflate "iommu" and "iommufd", we're creating
> > an object that interfaces into the iommufd uAPI, not an iommu itself.
> > Likewise "vfio-iommu-pci" is just confusing, there was an iommu
> > interface previously, it's just a different implementation now and as
> > far as the VM interface to the device, it's identical.  Note that a
> > "vfio-iommufd-pci" device multiplies the matrix of every vfio device
> > for a rather subtle implementation detail.
> > 
> > My expectation would be that libvirt uses:
> > 
> >  -object iommufd,id=iommufd0,fd=NNN
> >  -device vfio-pci,fd=MMM,iommufd=iommufd0
> > 
> > Whereas simple QEMU command line would be:
> > 
> >  -object iommufd,id=iommufd0
> >  -device vfio-pci,iommufd=iommufd0,host=0000:02:00.0
> > 
> > The iommufd object would open /dev/iommufd itself.  Creating an
> > implicit iommufd object is someone problematic because one of the
> > things I forgot to highlight in my previous description is that the
> > iommufd object is meant to be shared across not only various vfio
> > devices (platform, ccw, ap, nvme, etc), but also across subsystems, ex.
> > vdpa.  
> 
> Out of curiosity - in concept one iommufd is sufficient to support all
> ioas requirements across subsystems while having multiple iommufd's
> instead lose the benefit of centralized accounting. The latter will also
> cause some trouble when we start virtualizing ENQCMD which requires
> VM-wide PASID virtualization thus further needs to share that 
> information across iommufd's. Not unsolvable but really no gain by
> adding such complexity. So I'm curious whether Qemu provide
> a way to restrict that certain object type can only have one instance
> to discourage such multi-iommufd attempt?

I don't see any reason for QEMU to restrict iommufd objects.  The QEMU
philosophy seems to be to let users create whatever configuration they
want.  For libvirt though, the assumption would be that a single
iommufd object can be used across subsystems, so libvirt would never
automatically create multiple objects.

We also need to be able to advise libvirt as to how each iommufd object
or user of that object factors into the VM locked memory requirement.
When used by vfio-pci, we're only mapping VM RAM, so we'd ask libvirt
to set the locked memory limit to the size of VM RAM per iommufd,
regardless of the number of devices using a given iommufd.  However, I
don't know if all users of iommufd will be exclusively mapping VM RAM.
Combinations of devices where some map VM RAM and others map QEMU
buffer space could still require some incremental increase per device
(I'm not sure if vfio-nvme is such a device).  It seems like heuristics
will still be involved even after iommufd solves the per-device
vfio-pci locked memory limit issue.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC 00/18] vfio: Adopt iommufd
  2022-04-26 12:43       ` Shameerali Kolothum Thodi via
@ 2022-04-26 16:35         ` Alex Williamson
  -1 siblings, 0 replies; 125+ messages in thread
From: Alex Williamson @ 2022-04-26 16:35 UTC (permalink / raw)
  To: Shameerali Kolothum Thodi
  Cc: eric.auger, Yi Liu, cohuck, qemu-devel, david, thuth, farman,
	mjrosato, akrowiak, pasic, jjherne, jasowang, kvm, jgg, nicolinc,
	eric.auger.pro, kevin.tian, chao.p.peng, yi.y.sun, peterx,
	Zhangfei Gao

On Tue, 26 Apr 2022 12:43:35 +0000
Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com> wrote:

> > -----Original Message-----
> > From: Eric Auger [mailto:eric.auger@redhat.com]
> > Sent: 26 April 2022 12:45
> > To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>; Yi
> > Liu <yi.l.liu@intel.com>; alex.williamson@redhat.com; cohuck@redhat.com;
> > qemu-devel@nongnu.org
> > Cc: david@gibson.dropbear.id.au; thuth@redhat.com; farman@linux.ibm.com;
> > mjrosato@linux.ibm.com; akrowiak@linux.ibm.com; pasic@linux.ibm.com;
> > jjherne@linux.ibm.com; jasowang@redhat.com; kvm@vger.kernel.org;
> > jgg@nvidia.com; nicolinc@nvidia.com; eric.auger.pro@gmail.com;
> > kevin.tian@intel.com; chao.p.peng@intel.com; yi.y.sun@intel.com;
> > peterx@redhat.com; Zhangfei Gao <zhangfei.gao@linaro.org>
> > Subject: Re: [RFC 00/18] vfio: Adopt iommufd  
> 
> [...]
>  
> > >>  
> > https://lore.kernel.org/kvm/0-v1-e79cd8d168e8+6-iommufd_jgg@nvidia.com  
> > >> /
> > >> [2] https://github.com/luxis1999/iommufd/tree/iommufd-v5.17-rc6
> > >> [3] https://github.com/luxis1999/qemu/tree/qemu-for-5.17-rc6-vm-rfcv1  
> > > Hi,
> > >
> > > I had a go with the above branches on our ARM64 platform trying to  
> > pass-through  
> > > a VF dev, but Qemu reports an error as below,
> > >
> > > [    0.444728] hisi_sec2 0000:00:01.0: enabling device (0000 -> 0002)
> > > qemu-system-aarch64-iommufd: IOMMU_IOAS_MAP failed: Bad address
> > > qemu-system-aarch64-iommufd: vfio_container_dma_map(0xaaaafeb40ce0,  
> > 0x8000000000, 0x10000, 0xffffb40ef000) = -14 (Bad address)  
> > >
> > > I think this happens for the dev BAR addr range. I haven't debugged the  
> > kernel  
> > > yet to see where it actually reports that.  
> > Does it prevent your assigned device from working? I have such errors
> > too but this is a known issue. This is due to the fact P2P DMA is not
> > supported yet.
> >   
> 
> Yes, the basic tests all good so far. I am still not very clear how it works if
> the map() fails though. It looks like it fails in,
> 
> iommufd_ioas_map()
>   iopt_map_user_pages()
>    iopt_map_pages()
>    ..
>      pfn_reader_pin_pages()
> 
> So does it mean it just works because the page is resident()?

No, it just means that you're not triggering any accesses that require
peer-to-peer DMA support.  Any sort of test where the device is only
performing DMA to guest RAM, which is by far the standard use case,
will work fine.  This also doesn't affect vCPU access to BAR space.
It's only a failure of the mappings of the BAR space into the IOAS,
which is only used when a device tries to directly target another
device's BAR space via DMA.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC 00/18] vfio: Adopt iommufd
@ 2022-04-26 16:35         ` Alex Williamson
  0 siblings, 0 replies; 125+ messages in thread
From: Alex Williamson @ 2022-04-26 16:35 UTC (permalink / raw)
  To: Shameerali Kolothum Thodi
  Cc: akrowiak, jjherne, thuth, Yi Liu, kvm, mjrosato, farman,
	jasowang, cohuck, qemu-devel, peterx, pasic, eric.auger,
	yi.y.sun, chao.p.peng, nicolinc, kevin.tian, jgg, Zhangfei Gao,
	eric.auger.pro, david

On Tue, 26 Apr 2022 12:43:35 +0000
Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com> wrote:

> > -----Original Message-----
> > From: Eric Auger [mailto:eric.auger@redhat.com]
> > Sent: 26 April 2022 12:45
> > To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>; Yi
> > Liu <yi.l.liu@intel.com>; alex.williamson@redhat.com; cohuck@redhat.com;
> > qemu-devel@nongnu.org
> > Cc: david@gibson.dropbear.id.au; thuth@redhat.com; farman@linux.ibm.com;
> > mjrosato@linux.ibm.com; akrowiak@linux.ibm.com; pasic@linux.ibm.com;
> > jjherne@linux.ibm.com; jasowang@redhat.com; kvm@vger.kernel.org;
> > jgg@nvidia.com; nicolinc@nvidia.com; eric.auger.pro@gmail.com;
> > kevin.tian@intel.com; chao.p.peng@intel.com; yi.y.sun@intel.com;
> > peterx@redhat.com; Zhangfei Gao <zhangfei.gao@linaro.org>
> > Subject: Re: [RFC 00/18] vfio: Adopt iommufd  
> 
> [...]
>  
> > >>  
> > https://lore.kernel.org/kvm/0-v1-e79cd8d168e8+6-iommufd_jgg@nvidia.com  
> > >> /
> > >> [2] https://github.com/luxis1999/iommufd/tree/iommufd-v5.17-rc6
> > >> [3] https://github.com/luxis1999/qemu/tree/qemu-for-5.17-rc6-vm-rfcv1  
> > > Hi,
> > >
> > > I had a go with the above branches on our ARM64 platform trying to  
> > pass-through  
> > > a VF dev, but Qemu reports an error as below,
> > >
> > > [    0.444728] hisi_sec2 0000:00:01.0: enabling device (0000 -> 0002)
> > > qemu-system-aarch64-iommufd: IOMMU_IOAS_MAP failed: Bad address
> > > qemu-system-aarch64-iommufd: vfio_container_dma_map(0xaaaafeb40ce0,  
> > 0x8000000000, 0x10000, 0xffffb40ef000) = -14 (Bad address)  
> > >
> > > I think this happens for the dev BAR addr range. I haven't debugged the  
> > kernel  
> > > yet to see where it actually reports that.  
> > Does it prevent your assigned device from working? I have such errors
> > too but this is a known issue. This is due to the fact P2P DMA is not
> > supported yet.
> >   
> 
> Yes, the basic tests all good so far. I am still not very clear how it works if
> the map() fails though. It looks like it fails in,
> 
> iommufd_ioas_map()
>   iopt_map_user_pages()
>    iopt_map_pages()
>    ..
>      pfn_reader_pin_pages()
> 
> So does it mean it just works because the page is resident()?

No, it just means that you're not triggering any accesses that require
peer-to-peer DMA support.  Any sort of test where the device is only
performing DMA to guest RAM, which is by far the standard use case,
will work fine.  This also doesn't affect vCPU access to BAR space.
It's only a failure of the mappings of the BAR space into the IOAS,
which is only used when a device tries to directly target another
device's BAR space via DMA.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC 00/18] vfio: Adopt iommufd
  2022-04-26 16:21           ` Alex Williamson
  (?)
@ 2022-04-26 16:42           ` Jason Gunthorpe
  2022-04-26 19:24               ` Alex Williamson
  -1 siblings, 1 reply; 125+ messages in thread
From: Jason Gunthorpe @ 2022-04-26 16:42 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Tian, Kevin, Daniel P. Berrangé,
	Liu, Yi L, akrowiak, jjherne, Peng, Chao P, kvm, Laine Stump,
	libvir-list, jasowang, cohuck, thuth, peterx, qemu-devel, pasic,
	eric.auger, Sun, Yi Y, nicolinc, eric.auger.pro, david

On Tue, Apr 26, 2022 at 10:21:59AM -0600, Alex Williamson wrote:
> We also need to be able to advise libvirt as to how each iommufd object
> or user of that object factors into the VM locked memory requirement.
> When used by vfio-pci, we're only mapping VM RAM, so we'd ask libvirt
> to set the locked memory limit to the size of VM RAM per iommufd,
> regardless of the number of devices using a given iommufd.  However, I
> don't know if all users of iommufd will be exclusively mapping VM RAM.
> Combinations of devices where some map VM RAM and others map QEMU
> buffer space could still require some incremental increase per device
> (I'm not sure if vfio-nvme is such a device).  It seems like heuristics
> will still be involved even after iommufd solves the per-device
> vfio-pci locked memory limit issue.  Thanks,

If the model is to pass the FD, how about we put a limit on the FD
itself instead of abusing the locked memory limit?

We could have a no-way-out ioctl that directly limits the # of PFNs
covered by iopt_pages inside an iommufd.

Jason

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC 15/18] vfio/iommufd: Implement iommufd backend
  2022-04-26 14:11             ` Jason Gunthorpe
@ 2022-04-26 18:45                 ` Alex Williamson
  0 siblings, 0 replies; 125+ messages in thread
From: Alex Williamson @ 2022-04-26 18:45 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Yi Liu, Tian, Kevin, Peng, Chao P, Sun, Yi Y, qemu-devel, david,
	thuth, farman, mjrosato, akrowiak, pasic, jjherne, jasowang, kvm,
	nicolinc, eric.auger, eric.auger.pro, peterx

On Tue, 26 Apr 2022 11:11:56 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, Apr 26, 2022 at 10:08:30PM +0800, Yi Liu wrote:
> 
> > > I think it is strange that the allowed DMA a guest can do depends on
> > > the order how devices are plugged into the guest, and varys from
> > > device to device?
> > > 
> > > IMHO it would be nicer if qemu would be able to read the new reserved
> > > regions and unmap the conflicts before hot plugging the new device. We
> > > don't have a kernel API to do this, maybe we should have one?  
> > 
> > For userspace drivers, it is fine to do it. For QEMU, it's not quite easy
> > since the IOVA is GPA which is determined per the e820 table.  
> 
> Sure, that is why I said we may need a new API to get this data back
> so userspace can fix the address map before attempting to attach the
> new device. Currently that is not possible at all, the device attach
> fails and userspace has no way to learn what addresses are causing
> problems.

We have APIs to get the IOVA ranges, both with legacy vfio and the
iommufd RFC, QEMU could compare these, but deciding to remove an
existing mapping is not something to be done lightly.  We must be
absolutely certain that there is no DMA to that range before doing so.
 
> > > eg currently I see the log messages that it is passing P2P BAR memory
> > > into iommufd map, this should be prevented inside qemu because it is
> > > not reliable right now if iommufd will correctly reject it.  
> > 
> > yeah. qemu can filter the P2P BAR mapping and just stop it in qemu. We
> > haven't added it as it is something you will add in future. so didn't
> > add it in this RFC. :-) Please let me know if it feels better to filter
> > it from today.  
> 
> I currently hope it will use a different map API entirely and not rely
> on discovering the P2P via the VMA. eg using a DMABUF FD or something.
> 
> So blocking it in qemu feels like the right thing to do.

Wait a sec, so legacy vfio supports p2p between devices, which has a
least a couple known use cases, primarily involving GPUs for at least
one of the peers, and we're not going to make equivalent support a
feature requirement for iommufd?  This would entirely fracture the
notion that iommufd is a direct replacement and upgrade from legacy
vfio and make a transparent transition for libvirt managed VMs
impossible.  Let's reconsider.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC 15/18] vfio/iommufd: Implement iommufd backend
@ 2022-04-26 18:45                 ` Alex Williamson
  0 siblings, 0 replies; 125+ messages in thread
From: Alex Williamson @ 2022-04-26 18:45 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: akrowiak, jjherne, Tian, Kevin, Peng, Chao P, kvm, mjrosato,
	farman, jasowang, qemu-devel, peterx, pasic, eric.auger, Sun,
	Yi Y, Yi Liu, nicolinc, thuth, eric.auger.pro, david

On Tue, 26 Apr 2022 11:11:56 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, Apr 26, 2022 at 10:08:30PM +0800, Yi Liu wrote:
> 
> > > I think it is strange that the allowed DMA a guest can do depends on
> > > the order how devices are plugged into the guest, and varys from
> > > device to device?
> > > 
> > > IMHO it would be nicer if qemu would be able to read the new reserved
> > > regions and unmap the conflicts before hot plugging the new device. We
> > > don't have a kernel API to do this, maybe we should have one?  
> > 
> > For userspace drivers, it is fine to do it. For QEMU, it's not quite easy
> > since the IOVA is GPA which is determined per the e820 table.  
> 
> Sure, that is why I said we may need a new API to get this data back
> so userspace can fix the address map before attempting to attach the
> new device. Currently that is not possible at all, the device attach
> fails and userspace has no way to learn what addresses are causing
> problems.

We have APIs to get the IOVA ranges, both with legacy vfio and the
iommufd RFC, QEMU could compare these, but deciding to remove an
existing mapping is not something to be done lightly.  We must be
absolutely certain that there is no DMA to that range before doing so.
 
> > > eg currently I see the log messages that it is passing P2P BAR memory
> > > into iommufd map, this should be prevented inside qemu because it is
> > > not reliable right now if iommufd will correctly reject it.  
> > 
> > yeah. qemu can filter the P2P BAR mapping and just stop it in qemu. We
> > haven't added it as it is something you will add in future. so didn't
> > add it in this RFC. :-) Please let me know if it feels better to filter
> > it from today.  
> 
> I currently hope it will use a different map API entirely and not rely
> on discovering the P2P via the VMA. eg using a DMABUF FD or something.
> 
> So blocking it in qemu feels like the right thing to do.

Wait a sec, so legacy vfio supports p2p between devices, which has a
least a couple known use cases, primarily involving GPUs for at least
one of the peers, and we're not going to make equivalent support a
feature requirement for iommufd?  This would entirely fracture the
notion that iommufd is a direct replacement and upgrade from legacy
vfio and make a transparent transition for libvirt managed VMs
impossible.  Let's reconsider.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC 00/18] vfio: Adopt iommufd
  2022-04-26 16:42           ` Jason Gunthorpe
@ 2022-04-26 19:24               ` Alex Williamson
  0 siblings, 0 replies; 125+ messages in thread
From: Alex Williamson @ 2022-04-26 19:24 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Daniel P. Berrangé,
	Liu, Yi L, akrowiak, jjherne, Peng, Chao P, kvm, Laine Stump,
	libvir-list, jasowang, cohuck, thuth, peterx, qemu-devel, pasic,
	eric.auger, Sun, Yi Y, nicolinc, eric.auger.pro, david

On Tue, 26 Apr 2022 13:42:17 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, Apr 26, 2022 at 10:21:59AM -0600, Alex Williamson wrote:
> > We also need to be able to advise libvirt as to how each iommufd object
> > or user of that object factors into the VM locked memory requirement.
> > When used by vfio-pci, we're only mapping VM RAM, so we'd ask libvirt
> > to set the locked memory limit to the size of VM RAM per iommufd,
> > regardless of the number of devices using a given iommufd.  However, I
> > don't know if all users of iommufd will be exclusively mapping VM RAM.
> > Combinations of devices where some map VM RAM and others map QEMU
> > buffer space could still require some incremental increase per device
> > (I'm not sure if vfio-nvme is such a device).  It seems like heuristics
> > will still be involved even after iommufd solves the per-device
> > vfio-pci locked memory limit issue.  Thanks,  
> 
> If the model is to pass the FD, how about we put a limit on the FD
> itself instead of abusing the locked memory limit?
> 
> We could have a no-way-out ioctl that directly limits the # of PFNs
> covered by iopt_pages inside an iommufd.

FD passing would likely only be the standard for libvirt invoked VMs.
The QEMU vfio-pci device would still parse a host= or sysfsdev= option
when invoked by mortals and associate to use the legacy vfio group
interface or the new vfio device interface based on whether an iommufd
is specified.

Does that rule out your suggestion?  I don't know, please reveal more
about the mechanics of putting a limit on the FD itself and this
no-way-out ioctl.  The latter name suggests to me that I should also
note that we need to support memory hotplug with these devices.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC 00/18] vfio: Adopt iommufd
@ 2022-04-26 19:24               ` Alex Williamson
  0 siblings, 0 replies; 125+ messages in thread
From: Alex Williamson @ 2022-04-26 19:24 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: akrowiak, jjherne, Tian, Kevin, Liu, Yi L,
	Daniel P. Berrangé,
	kvm, libvir-list, jasowang, cohuck, thuth, peterx, qemu-devel,
	pasic, eric.auger, Sun, Yi Y, Peng, Chao P, nicolinc,
	Laine Stump, david, eric.auger.pro

On Tue, 26 Apr 2022 13:42:17 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, Apr 26, 2022 at 10:21:59AM -0600, Alex Williamson wrote:
> > We also need to be able to advise libvirt as to how each iommufd object
> > or user of that object factors into the VM locked memory requirement.
> > When used by vfio-pci, we're only mapping VM RAM, so we'd ask libvirt
> > to set the locked memory limit to the size of VM RAM per iommufd,
> > regardless of the number of devices using a given iommufd.  However, I
> > don't know if all users of iommufd will be exclusively mapping VM RAM.
> > Combinations of devices where some map VM RAM and others map QEMU
> > buffer space could still require some incremental increase per device
> > (I'm not sure if vfio-nvme is such a device).  It seems like heuristics
> > will still be involved even after iommufd solves the per-device
> > vfio-pci locked memory limit issue.  Thanks,  
> 
> If the model is to pass the FD, how about we put a limit on the FD
> itself instead of abusing the locked memory limit?
> 
> We could have a no-way-out ioctl that directly limits the # of PFNs
> covered by iopt_pages inside an iommufd.

FD passing would likely only be the standard for libvirt invoked VMs.
The QEMU vfio-pci device would still parse a host= or sysfsdev= option
when invoked by mortals and associate to use the legacy vfio group
interface or the new vfio device interface based on whether an iommufd
is specified.

Does that rule out your suggestion?  I don't know, please reveal more
about the mechanics of putting a limit on the FD itself and this
no-way-out ioctl.  The latter name suggests to me that I should also
note that we need to support memory hotplug with these devices.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC 15/18] vfio/iommufd: Implement iommufd backend
  2022-04-26 18:45                 ` Alex Williamson
  (?)
@ 2022-04-26 19:27                 ` Jason Gunthorpe
  2022-04-26 20:59                     ` Alex Williamson
  -1 siblings, 1 reply; 125+ messages in thread
From: Jason Gunthorpe @ 2022-04-26 19:27 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Yi Liu, Tian, Kevin, Peng, Chao P, Sun, Yi Y, qemu-devel, david,
	thuth, farman, mjrosato, akrowiak, pasic, jjherne, jasowang, kvm,
	nicolinc, eric.auger, eric.auger.pro, peterx

On Tue, Apr 26, 2022 at 12:45:41PM -0600, Alex Williamson wrote:
> On Tue, 26 Apr 2022 11:11:56 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Tue, Apr 26, 2022 at 10:08:30PM +0800, Yi Liu wrote:
> > 
> > > > I think it is strange that the allowed DMA a guest can do depends on
> > > > the order how devices are plugged into the guest, and varys from
> > > > device to device?
> > > > 
> > > > IMHO it would be nicer if qemu would be able to read the new reserved
> > > > regions and unmap the conflicts before hot plugging the new device. We
> > > > don't have a kernel API to do this, maybe we should have one?  
> > > 
> > > For userspace drivers, it is fine to do it. For QEMU, it's not quite easy
> > > since the IOVA is GPA which is determined per the e820 table.  
> > 
> > Sure, that is why I said we may need a new API to get this data back
> > so userspace can fix the address map before attempting to attach the
> > new device. Currently that is not possible at all, the device attach
> > fails and userspace has no way to learn what addresses are causing
> > problems.
> 
> We have APIs to get the IOVA ranges, both with legacy vfio and the
> iommufd RFC, QEMU could compare these, but deciding to remove an
> existing mapping is not something to be done lightly. 

Not quite, you can get the IOVA ranges after you attach the device,
but device attach will fail if the new range restrictions intersect
with the existing mappings. So we don't have an easy way to learn the
new range restriction in a way that lets userspace ensure an attach
will not fail due to reserved ranged overlapping with mappings.

The best you could do is make a dummy IOAS then attach the device,
read the mappings, detatch, and then do your unmaps.

I'm imagining something like IOMMUFD_DEVICE_GET_RANGES that can be
called prior to attaching on the device ID.

> We must be absolutely certain that there is no DMA to that range
> before doing so.

Yes, but at the same time if the VM thinks it can DMA to that memory
then it is quite likely to DMA to it with the new device that doesn't
have it mapped in the first place.

It is also a bit odd that the behavior depends on the order the
devices are installed as if you plug the narrower device first then
the next device will happily use the narrower ranges, but viceversa
will get a different result.

This is why I find it bit strange that qemu doesn't check the
ranges. eg I would expect that anything declared as memory in the E820
map has to be mappable to the iommu_domain or the device should not
attach at all.

The P2P is a bit trickier, and I know we don't have a good story
because we lack ACPI description, but I would have expected the same
kind of thing. Anything P2Pable should be in the iommu_domain or the
device should not attach. As with system memory there are only certain
parts of the E820 map that an OS would use for P2P.

(ideally ACPI would indicate exactly what combinations of devices are
P2Pable and then qemu would use that drive the mandatory address
ranges in the IOAS)

> > > yeah. qemu can filter the P2P BAR mapping and just stop it in qemu. We
> > > haven't added it as it is something you will add in future. so didn't
> > > add it in this RFC. :-) Please let me know if it feels better to filter
> > > it from today.  
> > 
> > I currently hope it will use a different map API entirely and not rely
> > on discovering the P2P via the VMA. eg using a DMABUF FD or something.
> > 
> > So blocking it in qemu feels like the right thing to do.
> 
> Wait a sec, so legacy vfio supports p2p between devices, which has a
> least a couple known use cases, primarily involving GPUs for at least
> one of the peers, and we're not going to make equivalent support a
> feature requirement for iommufd?  

I said "different map API" - something like IOMMU_FD_MAP_DMABUF
perhaps.

The trouble with taking in a user pointer to MMIO memory is that it
becomes quite annoying to go from a VMA back to the actual owner
object so we can establish proper refcounting and lifetime of struct-page-less
memory. Requiring userspace to make that connection via a FD
simplifies and generalizes this.

So, qemu would say 'oh this memory is exported by VFIO, I will do
VFIO_EXPORT_DMA_BUF, then do IOMMU_FD_MAP_DMABUF, then close the FD'

For vfio_compat we'd have to build some hacky compat approach to
discover the dmabuf for vfio-pci from the VMA.

But if qemu is going this way with a new implementation I would prefer
the new implementation use the new way, when we decide what it should
be.

As I mentioned before I would like to use DMABUF since I already have
a use-case to expose DMABUF from vfio-pci to connect to RDMA. I will
post the vfio DMABUF patch I have already.

Jason

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC 00/18] vfio: Adopt iommufd
  2022-04-26 19:24               ` Alex Williamson
  (?)
@ 2022-04-26 19:36               ` Jason Gunthorpe
  -1 siblings, 0 replies; 125+ messages in thread
From: Jason Gunthorpe @ 2022-04-26 19:36 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Tian, Kevin, Daniel P. Berrangé,
	Liu, Yi L, akrowiak, jjherne, Peng, Chao P, kvm, Laine Stump,
	libvir-list, jasowang, cohuck, thuth, peterx, qemu-devel, pasic,
	eric.auger, Sun, Yi Y, nicolinc, eric.auger.pro, david

On Tue, Apr 26, 2022 at 01:24:35PM -0600, Alex Williamson wrote:
> On Tue, 26 Apr 2022 13:42:17 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Tue, Apr 26, 2022 at 10:21:59AM -0600, Alex Williamson wrote:
> > > We also need to be able to advise libvirt as to how each iommufd object
> > > or user of that object factors into the VM locked memory requirement.
> > > When used by vfio-pci, we're only mapping VM RAM, so we'd ask libvirt
> > > to set the locked memory limit to the size of VM RAM per iommufd,
> > > regardless of the number of devices using a given iommufd.  However, I
> > > don't know if all users of iommufd will be exclusively mapping VM RAM.
> > > Combinations of devices where some map VM RAM and others map QEMU
> > > buffer space could still require some incremental increase per device
> > > (I'm not sure if vfio-nvme is such a device).  It seems like heuristics
> > > will still be involved even after iommufd solves the per-device
> > > vfio-pci locked memory limit issue.  Thanks,  
> > 
> > If the model is to pass the FD, how about we put a limit on the FD
> > itself instead of abusing the locked memory limit?
> > 
> > We could have a no-way-out ioctl that directly limits the # of PFNs
> > covered by iopt_pages inside an iommufd.
> 
> FD passing would likely only be the standard for libvirt invoked VMs.
> The QEMU vfio-pci device would still parse a host= or sysfsdev= option
> when invoked by mortals and associate to use the legacy vfio group
> interface or the new vfio device interface based on whether an iommufd
> is specified.

Yes, but perhaps we don't need resource limits in the mortals case..

> Does that rule out your suggestion?  I don't know, please reveal more
> about the mechanics of putting a limit on the FD itself and this
> no-way-out ioctl.  The latter name suggests to me that I should also
> note that we need to support memory hotplug with these devices.  Thanks,

So libvirt uses CAP_SYS_RESOURCE and prlimit to adjust things in
realtime today?

It could still work, instead of no way out iommufd would have to check
for CAP_SYS_RESOURCE to make the limit higher.

It is a pretty simple idea, we just attach a resource limit to the FD
and every PFN that gets mapped into the iommufd counts against that
limit, regardless if it is pinned or not. An ioctl on the FD would set
the limit, defaulting to unlimited.

To me this has the appeal that what is being resourced controlled is
strictly defined - address space mapped into an iommufd - which has a
bunch of nice additional consequences like partially bounding the
amount of kernel memory an iommufd can consume and so forth.

Doesn't interact with iouring or rdma however.

Though we could certianly consider allowing RDMA to consume an iommufd
to access pinned pages much like a vfio-mdev would - I'm not sure what
is ideal for the qemu usage of RDMA for migration..

Jason

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC 15/18] vfio/iommufd: Implement iommufd backend
  2022-04-26 19:27                 ` Jason Gunthorpe
@ 2022-04-26 20:59                     ` Alex Williamson
  0 siblings, 0 replies; 125+ messages in thread
From: Alex Williamson @ 2022-04-26 20:59 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Yi Liu, Tian, Kevin, Peng, Chao P, Sun, Yi Y, qemu-devel, david,
	thuth, farman, mjrosato, akrowiak, pasic, jjherne, jasowang, kvm,
	nicolinc, eric.auger, eric.auger.pro, peterx

On Tue, 26 Apr 2022 16:27:03 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, Apr 26, 2022 at 12:45:41PM -0600, Alex Williamson wrote:
> > On Tue, 26 Apr 2022 11:11:56 -0300
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> >   
> > > On Tue, Apr 26, 2022 at 10:08:30PM +0800, Yi Liu wrote:
> > >   
> > > > > I think it is strange that the allowed DMA a guest can do depends on
> > > > > the order how devices are plugged into the guest, and varys from
> > > > > device to device?
> > > > > 
> > > > > IMHO it would be nicer if qemu would be able to read the new reserved
> > > > > regions and unmap the conflicts before hot plugging the new device. We
> > > > > don't have a kernel API to do this, maybe we should have one?    
> > > > 
> > > > For userspace drivers, it is fine to do it. For QEMU, it's not quite easy
> > > > since the IOVA is GPA which is determined per the e820 table.    
> > > 
> > > Sure, that is why I said we may need a new API to get this data back
> > > so userspace can fix the address map before attempting to attach the
> > > new device. Currently that is not possible at all, the device attach
> > > fails and userspace has no way to learn what addresses are causing
> > > problems.  
> > 
> > We have APIs to get the IOVA ranges, both with legacy vfio and the
> > iommufd RFC, QEMU could compare these, but deciding to remove an
> > existing mapping is not something to be done lightly.   
> 
> Not quite, you can get the IOVA ranges after you attach the device,
> but device attach will fail if the new range restrictions intersect
> with the existing mappings. So we don't have an easy way to learn the
> new range restriction in a way that lets userspace ensure an attach
> will not fail due to reserved ranged overlapping with mappings.
> 
> The best you could do is make a dummy IOAS then attach the device,
> read the mappings, detatch, and then do your unmaps.

Right, the same thing the kernel does currently.

> I'm imagining something like IOMMUFD_DEVICE_GET_RANGES that can be
> called prior to attaching on the device ID.

Something like /sys/kernel/iommu_groups/$GROUP/reserved_regions?

> > We must be absolutely certain that there is no DMA to that range
> > before doing so.  
> 
> Yes, but at the same time if the VM thinks it can DMA to that memory
> then it is quite likely to DMA to it with the new device that doesn't
> have it mapped in the first place.

Sorry, this assertion doesn't make sense to me.  We can't assume a
vIOMMU on x86, so QEMU typically maps the entire VM address space (ie.
device address space == system memory).  Some of those mappings are
likely DMA targets (RAM), but only a tiny fraction of the address space
may actually be used for DMA.  Some of those mappings are exceedingly
unlikely P2P DMA targets (device memory), so we don't consider mapping
failures to be fatal to attaching the device.

If we have a case where a range failed for one device but worked for a
previous, we're in the latter scenario, because we should have failed
the device attach otherwise.  Your assertion would require that there
are existing devices (plural) making use of this mapping and that the
new device is also likely to make use of this mapping.  I have a hard
time believing that evidence exists to support that statement.
 
> It is also a bit odd that the behavior depends on the order the
> devices are installed as if you plug the narrower device first then
> the next device will happily use the narrower ranges, but viceversa
> will get a different result.

P2P use cases are sufficiently rare that this hasn't been an issue.  I
think there's also still a sufficient healthy dose of FUD whether a
system supports P2P that drivers do some validation before relying on
it.
 
> This is why I find it bit strange that qemu doesn't check the
> ranges. eg I would expect that anything declared as memory in the E820
> map has to be mappable to the iommu_domain or the device should not
> attach at all.

You have some interesting assumptions around associating
MemoryRegionSegments from the device AddressSpace to something like an
x86 specific E820 table.  The currently used rule of thumb is that if
we think it's memory, mapping failure is fatal to the device, otherwise
it's not.  If we want each device to have the most complete mapping
possible, then we'd use a container per device, but that implies a lot
of extra overhead.  Instead we try to attach the device to an existing
container within the address space and assume if it was good enough
there, it's good enough here.

> The P2P is a bit trickier, and I know we don't have a good story
> because we lack ACPI description, but I would have expected the same
> kind of thing. Anything P2Pable should be in the iommu_domain or the
> device should not attach. As with system memory there are only certain
> parts of the E820 map that an OS would use for P2P.
> 
> (ideally ACPI would indicate exactly what combinations of devices are
> P2Pable and then qemu would use that drive the mandatory address
> ranges in the IOAS)

How exactly does ACPI indicate that devices can do P2P?  How can we
rely on ACPI for a problem that's not unique to platforms that
implement ACPI?

> > > > yeah. qemu can filter the P2P BAR mapping and just stop it in qemu. We
> > > > haven't added it as it is something you will add in future. so didn't
> > > > add it in this RFC. :-) Please let me know if it feels better to filter
> > > > it from today.    
> > > 
> > > I currently hope it will use a different map API entirely and not rely
> > > on discovering the P2P via the VMA. eg using a DMABUF FD or something.
> > > 
> > > So blocking it in qemu feels like the right thing to do.  
> > 
> > Wait a sec, so legacy vfio supports p2p between devices, which has a
> > least a couple known use cases, primarily involving GPUs for at least
> > one of the peers, and we're not going to make equivalent support a
> > feature requirement for iommufd?    
> 
> I said "different map API" - something like IOMMU_FD_MAP_DMABUF
> perhaps.

For future support, yes, but your last sentence above states to
outright block it for now, which would be a visible feature regression
vs legacy vfio.

> The trouble with taking in a user pointer to MMIO memory is that it
> becomes quite annoying to go from a VMA back to the actual owner
> object so we can establish proper refcounting and lifetime of struct-page-less
> memory. Requiring userspace to make that connection via a FD
> simplifies and generalizes this.
> 
> So, qemu would say 'oh this memory is exported by VFIO, I will do
> VFIO_EXPORT_DMA_BUF, then do IOMMU_FD_MAP_DMABUF, then close the FD'
> 
> For vfio_compat we'd have to build some hacky compat approach to
> discover the dmabuf for vfio-pci from the VMA.
> 
> But if qemu is going this way with a new implementation I would prefer
> the new implementation use the new way, when we decide what it should
> be.
> 
> As I mentioned before I would like to use DMABUF since I already have
> a use-case to expose DMABUF from vfio-pci to connect to RDMA. I will
> post the vfio DMABUF patch I have already.

I'm not suggesting there aren't issues with P2P mappings, we all know
that legacy vfio has various issues currently.  I'm only stating that
there are use cases for it and if we cannot support those use cases
then we can't do a transparent switch to iommufd when it's available.
Switching would depend not only on kernel/QEMU support, but the
necessary features for the VM, where we have no means to
programmatically determine the latter.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC 15/18] vfio/iommufd: Implement iommufd backend
@ 2022-04-26 20:59                     ` Alex Williamson
  0 siblings, 0 replies; 125+ messages in thread
From: Alex Williamson @ 2022-04-26 20:59 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: akrowiak, jjherne, Tian, Kevin, Peng, Chao P, kvm, mjrosato,
	farman, jasowang, qemu-devel, peterx, pasic, eric.auger, Sun,
	Yi Y, Yi Liu, nicolinc, thuth, eric.auger.pro, david

On Tue, 26 Apr 2022 16:27:03 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, Apr 26, 2022 at 12:45:41PM -0600, Alex Williamson wrote:
> > On Tue, 26 Apr 2022 11:11:56 -0300
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> >   
> > > On Tue, Apr 26, 2022 at 10:08:30PM +0800, Yi Liu wrote:
> > >   
> > > > > I think it is strange that the allowed DMA a guest can do depends on
> > > > > the order how devices are plugged into the guest, and varys from
> > > > > device to device?
> > > > > 
> > > > > IMHO it would be nicer if qemu would be able to read the new reserved
> > > > > regions and unmap the conflicts before hot plugging the new device. We
> > > > > don't have a kernel API to do this, maybe we should have one?    
> > > > 
> > > > For userspace drivers, it is fine to do it. For QEMU, it's not quite easy
> > > > since the IOVA is GPA which is determined per the e820 table.    
> > > 
> > > Sure, that is why I said we may need a new API to get this data back
> > > so userspace can fix the address map before attempting to attach the
> > > new device. Currently that is not possible at all, the device attach
> > > fails and userspace has no way to learn what addresses are causing
> > > problems.  
> > 
> > We have APIs to get the IOVA ranges, both with legacy vfio and the
> > iommufd RFC, QEMU could compare these, but deciding to remove an
> > existing mapping is not something to be done lightly.   
> 
> Not quite, you can get the IOVA ranges after you attach the device,
> but device attach will fail if the new range restrictions intersect
> with the existing mappings. So we don't have an easy way to learn the
> new range restriction in a way that lets userspace ensure an attach
> will not fail due to reserved ranged overlapping with mappings.
> 
> The best you could do is make a dummy IOAS then attach the device,
> read the mappings, detatch, and then do your unmaps.

Right, the same thing the kernel does currently.

> I'm imagining something like IOMMUFD_DEVICE_GET_RANGES that can be
> called prior to attaching on the device ID.

Something like /sys/kernel/iommu_groups/$GROUP/reserved_regions?

> > We must be absolutely certain that there is no DMA to that range
> > before doing so.  
> 
> Yes, but at the same time if the VM thinks it can DMA to that memory
> then it is quite likely to DMA to it with the new device that doesn't
> have it mapped in the first place.

Sorry, this assertion doesn't make sense to me.  We can't assume a
vIOMMU on x86, so QEMU typically maps the entire VM address space (ie.
device address space == system memory).  Some of those mappings are
likely DMA targets (RAM), but only a tiny fraction of the address space
may actually be used for DMA.  Some of those mappings are exceedingly
unlikely P2P DMA targets (device memory), so we don't consider mapping
failures to be fatal to attaching the device.

If we have a case where a range failed for one device but worked for a
previous, we're in the latter scenario, because we should have failed
the device attach otherwise.  Your assertion would require that there
are existing devices (plural) making use of this mapping and that the
new device is also likely to make use of this mapping.  I have a hard
time believing that evidence exists to support that statement.
 
> It is also a bit odd that the behavior depends on the order the
> devices are installed as if you plug the narrower device first then
> the next device will happily use the narrower ranges, but viceversa
> will get a different result.

P2P use cases are sufficiently rare that this hasn't been an issue.  I
think there's also still a sufficient healthy dose of FUD whether a
system supports P2P that drivers do some validation before relying on
it.
 
> This is why I find it bit strange that qemu doesn't check the
> ranges. eg I would expect that anything declared as memory in the E820
> map has to be mappable to the iommu_domain or the device should not
> attach at all.

You have some interesting assumptions around associating
MemoryRegionSegments from the device AddressSpace to something like an
x86 specific E820 table.  The currently used rule of thumb is that if
we think it's memory, mapping failure is fatal to the device, otherwise
it's not.  If we want each device to have the most complete mapping
possible, then we'd use a container per device, but that implies a lot
of extra overhead.  Instead we try to attach the device to an existing
container within the address space and assume if it was good enough
there, it's good enough here.

> The P2P is a bit trickier, and I know we don't have a good story
> because we lack ACPI description, but I would have expected the same
> kind of thing. Anything P2Pable should be in the iommu_domain or the
> device should not attach. As with system memory there are only certain
> parts of the E820 map that an OS would use for P2P.
> 
> (ideally ACPI would indicate exactly what combinations of devices are
> P2Pable and then qemu would use that drive the mandatory address
> ranges in the IOAS)

How exactly does ACPI indicate that devices can do P2P?  How can we
rely on ACPI for a problem that's not unique to platforms that
implement ACPI?

> > > > yeah. qemu can filter the P2P BAR mapping and just stop it in qemu. We
> > > > haven't added it as it is something you will add in future. so didn't
> > > > add it in this RFC. :-) Please let me know if it feels better to filter
> > > > it from today.    
> > > 
> > > I currently hope it will use a different map API entirely and not rely
> > > on discovering the P2P via the VMA. eg using a DMABUF FD or something.
> > > 
> > > So blocking it in qemu feels like the right thing to do.  
> > 
> > Wait a sec, so legacy vfio supports p2p between devices, which has a
> > least a couple known use cases, primarily involving GPUs for at least
> > one of the peers, and we're not going to make equivalent support a
> > feature requirement for iommufd?    
> 
> I said "different map API" - something like IOMMU_FD_MAP_DMABUF
> perhaps.

For future support, yes, but your last sentence above states to
outright block it for now, which would be a visible feature regression
vs legacy vfio.

> The trouble with taking in a user pointer to MMIO memory is that it
> becomes quite annoying to go from a VMA back to the actual owner
> object so we can establish proper refcounting and lifetime of struct-page-less
> memory. Requiring userspace to make that connection via a FD
> simplifies and generalizes this.
> 
> So, qemu would say 'oh this memory is exported by VFIO, I will do
> VFIO_EXPORT_DMA_BUF, then do IOMMU_FD_MAP_DMABUF, then close the FD'
> 
> For vfio_compat we'd have to build some hacky compat approach to
> discover the dmabuf for vfio-pci from the VMA.
> 
> But if qemu is going this way with a new implementation I would prefer
> the new implementation use the new way, when we decide what it should
> be.
> 
> As I mentioned before I would like to use DMABUF since I already have
> a use-case to expose DMABUF from vfio-pci to connect to RDMA. I will
> post the vfio DMABUF patch I have already.

I'm not suggesting there aren't issues with P2P mappings, we all know
that legacy vfio has various issues currently.  I'm only stating that
there are use cases for it and if we cannot support those use cases
then we can't do a transparent switch to iommufd when it's available.
Switching would depend not only on kernel/QEMU support, but the
necessary features for the VM, where we have no means to
programmatically determine the latter.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC 15/18] vfio/iommufd: Implement iommufd backend
  2022-04-26 20:59                     ` Alex Williamson
  (?)
@ 2022-04-26 23:08                     ` Jason Gunthorpe
  -1 siblings, 0 replies; 125+ messages in thread
From: Jason Gunthorpe @ 2022-04-26 23:08 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Yi Liu, Tian, Kevin, Peng, Chao P, Sun, Yi Y, qemu-devel, david,
	thuth, farman, mjrosato, akrowiak, pasic, jjherne, jasowang, kvm,
	nicolinc, eric.auger, eric.auger.pro, peterx

On Tue, Apr 26, 2022 at 02:59:31PM -0600, Alex Williamson wrote:

> > The best you could do is make a dummy IOAS then attach the device,
> > read the mappings, detatch, and then do your unmaps.
> 
> Right, the same thing the kernel does currently.
> 
> > I'm imagining something like IOMMUFD_DEVICE_GET_RANGES that can be
> > called prior to attaching on the device ID.
> 
> Something like /sys/kernel/iommu_groups/$GROUP/reserved_regions?

If we do the above ioctl with iommufd I would want to include the domain
aperture too, but yes.

> > > We must be absolutely certain that there is no DMA to that range
> > > before doing so.  
> > 
> > Yes, but at the same time if the VM thinks it can DMA to that memory
> > then it is quite likely to DMA to it with the new device that doesn't
> > have it mapped in the first place.
> 
> Sorry, this assertion doesn't make sense to me.  We can't assume a
> vIOMMU on x86, so QEMU typically maps the entire VM address space (ie.
> device address space == system memory).  Some of those mappings are
> likely DMA targets (RAM), but only a tiny fraction of the address space
> may actually be used for DMA.  Some of those mappings are exceedingly
> unlikely P2P DMA targets (device memory), so we don't consider mapping
> failures to be fatal to attaching the device.

> If we have a case where a range failed for one device but worked for a
> previous, we're in the latter scenario, because we should have failed
> the device attach otherwise.  Your assertion would require that there
> are existing devices (plural) making use of this mapping and that the
> new device is also likely to make use of this mapping.  I have a hard
> time believing that evidence exists to support that statement.

This is quite normal, we often have multiple NICs and GPUs in the same
system/VM and the expectation is that P2P between the MMIO regions of
all the NICs and all the GPUs will work. Hotplugging in a NIC or GPU
and having it be excluded from P2P maps would be fatal to the VM.

So, while I think it is vanishingly unlikely that a reserved region
conflict would cause a problem, my preference is that this stuff is
deterministic. Either hotplugs fails or hotplug configures it to the
same state it would be if the VM was started with this configuration.

Perhaps this just suggests that qemu should be told by the operator
what kind of P2P to export from a device 'never/auto/always' with auto
being today's behavior.

> P2P use cases are sufficiently rare that this hasn't been an issue.  I
> think there's also still a sufficient healthy dose of FUD whether a
> system supports P2P that drivers do some validation before relying on
> it.

I'm not sure what you mean here, the P2P capability discovery is a
complete mess and never did get standardized. Linux has the
expectation that drivers will use pci_p2pdma_distance() before doing
P2P which weeds out only some of the worst non-working cases.

> > This is why I find it bit strange that qemu doesn't check the
> > ranges. eg I would expect that anything declared as memory in the E820
> > map has to be mappable to the iommu_domain or the device should not
> > attach at all.
> 
> You have some interesting assumptions around associating
> MemoryRegionSegments from the device AddressSpace to something like an
> x86 specific E820 table.  

I'm thinking about it from an OS perspective in the VM, not from qemu
internals. OS's do not randomly DMA everwhere, the firmware tables/etc
do make it predictable where DMA will happen.

> > The P2P is a bit trickier, and I know we don't have a good story
> > because we lack ACPI description, but I would have expected the same
> > kind of thing. Anything P2Pable should be in the iommu_domain or the
> > device should not attach. As with system memory there are only certain
> > parts of the E820 map that an OS would use for P2P.
> > 
> > (ideally ACPI would indicate exactly what combinations of devices are
> > P2Pable and then qemu would use that drive the mandatory address
> > ranges in the IOAS)
> 
> How exactly does ACPI indicate that devices can do P2P?  How can we
> rely on ACPI for a problem that's not unique to platforms that
> implement ACPI?

I am trying to say this never did get standardized. It was talked about
when the pci_p2pdma_distance() was merged and I thought some folks
were going to go off and take care of an ACPI query for it to use. It
would be useful here at least.
 
> > > > > yeah. qemu can filter the P2P BAR mapping and just stop it in qemu. We
> > > > > haven't added it as it is something you will add in future. so didn't
> > > > > add it in this RFC. :-) Please let me know if it feels better to filter
> > > > > it from today.    
> > > > 
> > > > I currently hope it will use a different map API entirely and not rely
> > > > on discovering the P2P via the VMA. eg using a DMABUF FD or something.
> > > > 
> > > > So blocking it in qemu feels like the right thing to do.  
> > > 
> > > Wait a sec, so legacy vfio supports p2p between devices, which has a
> > > least a couple known use cases, primarily involving GPUs for at least
> > > one of the peers, and we're not going to make equivalent support a
> > > feature requirement for iommufd?    
> > 
> > I said "different map API" - something like IOMMU_FD_MAP_DMABUF
> > perhaps.
> 
> For future support, yes, but your last sentence above states to
> outright block it for now, which would be a visible feature regression
> vs legacy vfio.

I'm not sure I understand. Today iommufd does not support MMIO vmas in
IOMMUFD_MAP, and if we do the DMABUF stuff, it never will. So the
correct thing is to block it in qemu and when we decide exactly the
correct interface we will update qemu to use it. Surely this would be
completed before we declare iommufd "ready". Hopefully this happens
not long after we merge the basic iommufd kernel stuff.

> that legacy vfio has various issues currently.  I'm only stating that
> there are use cases for it and if we cannot support those use cases
> then we can't do a transparent switch to iommufd when it's
> available.

P2P is very important to me, I will get it supported, but I can't
tackle every problem at once.

If we can't agree on a secure implementation after a lot of trying
then we can implement follow_pfn like VFIO did.

> Switching would depend not only on kernel/QEMU support, but the
> necessary features for the VM, where we have no means to
> programmatically determine the latter.  Thanks,

I'm not sure what "features for the VM" means?

Jason

^ permalink raw reply	[flat|nested] 125+ messages in thread

* RE: [RFC 00/18] vfio: Adopt iommufd
  2022-04-26 16:21           ` Alex Williamson
@ 2022-04-28  3:21             ` Tian, Kevin
  -1 siblings, 0 replies; 125+ messages in thread
From: Tian, Kevin @ 2022-04-28  3:21 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Daniel P. Berrangé,
	Liu, Yi L, akrowiak, jjherne, Peng, Chao P, kvm, Laine Stump,
	libvir-list, jasowang, cohuck, thuth, peterx, qemu-devel, pasic,
	eric.auger, Sun, Yi Y, nicolinc, jgg, eric.auger.pro, david

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Wednesday, April 27, 2022 12:22 AM
> > >
> > > My expectation would be that libvirt uses:
> > >
> > >  -object iommufd,id=iommufd0,fd=NNN
> > >  -device vfio-pci,fd=MMM,iommufd=iommufd0
> > >
> > > Whereas simple QEMU command line would be:
> > >
> > >  -object iommufd,id=iommufd0
> > >  -device vfio-pci,iommufd=iommufd0,host=0000:02:00.0
> > >
> > > The iommufd object would open /dev/iommufd itself.  Creating an
> > > implicit iommufd object is someone problematic because one of the
> > > things I forgot to highlight in my previous description is that the
> > > iommufd object is meant to be shared across not only various vfio
> > > devices (platform, ccw, ap, nvme, etc), but also across subsystems, ex.
> > > vdpa.
> >
> > Out of curiosity - in concept one iommufd is sufficient to support all
> > ioas requirements across subsystems while having multiple iommufd's
> > instead lose the benefit of centralized accounting. The latter will also
> > cause some trouble when we start virtualizing ENQCMD which requires
> > VM-wide PASID virtualization thus further needs to share that
> > information across iommufd's. Not unsolvable but really no gain by
> > adding such complexity. So I'm curious whether Qemu provide
> > a way to restrict that certain object type can only have one instance
> > to discourage such multi-iommufd attempt?
> 
> I don't see any reason for QEMU to restrict iommufd objects.  The QEMU
> philosophy seems to be to let users create whatever configuration they
> want.  For libvirt though, the assumption would be that a single
> iommufd object can be used across subsystems, so libvirt would never
> automatically create multiple objects.

I like the flexibility what the objection approach gives in your proposal.
But with the said complexity in mind (with no foreseen benefit), I wonder
whether an alternative approach which treats iommufd as a global
property instead of an object is acceptable in Qemu, i.e.:

-iommufd on/off
-device vfio-pci,iommufd,[fd=MMM/host=0000:02:00.0]

All devices with iommufd specified then implicitly share a single iommufd
object within Qemu.

This still allows vfio devices to be specified via fd but just requires Libvirt
to grant file permission on /dev/iommu. Is it a worthwhile tradeoff to be
considered or just not a typical way in Qemu philosophy e.g. any object
associated with a device must be explicitly specified?

Thanks
Kevin

^ permalink raw reply	[flat|nested] 125+ messages in thread

* RE: [RFC 00/18] vfio: Adopt iommufd
@ 2022-04-28  3:21             ` Tian, Kevin
  0 siblings, 0 replies; 125+ messages in thread
From: Tian, Kevin @ 2022-04-28  3:21 UTC (permalink / raw)
  To: Alex Williamson
  Cc: akrowiak, jjherne, thuth, Peng, Chao P, Daniel P. Berrangé,
	jgg, kvm, libvir-list, jasowang, cohuck, qemu-devel, peterx,
	pasic, eric.auger, Sun, Yi Y, Liu, Yi L, nicolinc, Laine Stump,
	david, eric.auger.pro

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Wednesday, April 27, 2022 12:22 AM
> > >
> > > My expectation would be that libvirt uses:
> > >
> > >  -object iommufd,id=iommufd0,fd=NNN
> > >  -device vfio-pci,fd=MMM,iommufd=iommufd0
> > >
> > > Whereas simple QEMU command line would be:
> > >
> > >  -object iommufd,id=iommufd0
> > >  -device vfio-pci,iommufd=iommufd0,host=0000:02:00.0
> > >
> > > The iommufd object would open /dev/iommufd itself.  Creating an
> > > implicit iommufd object is someone problematic because one of the
> > > things I forgot to highlight in my previous description is that the
> > > iommufd object is meant to be shared across not only various vfio
> > > devices (platform, ccw, ap, nvme, etc), but also across subsystems, ex.
> > > vdpa.
> >
> > Out of curiosity - in concept one iommufd is sufficient to support all
> > ioas requirements across subsystems while having multiple iommufd's
> > instead lose the benefit of centralized accounting. The latter will also
> > cause some trouble when we start virtualizing ENQCMD which requires
> > VM-wide PASID virtualization thus further needs to share that
> > information across iommufd's. Not unsolvable but really no gain by
> > adding such complexity. So I'm curious whether Qemu provide
> > a way to restrict that certain object type can only have one instance
> > to discourage such multi-iommufd attempt?
> 
> I don't see any reason for QEMU to restrict iommufd objects.  The QEMU
> philosophy seems to be to let users create whatever configuration they
> want.  For libvirt though, the assumption would be that a single
> iommufd object can be used across subsystems, so libvirt would never
> automatically create multiple objects.

I like the flexibility what the objection approach gives in your proposal.
But with the said complexity in mind (with no foreseen benefit), I wonder
whether an alternative approach which treats iommufd as a global
property instead of an object is acceptable in Qemu, i.e.:

-iommufd on/off
-device vfio-pci,iommufd,[fd=MMM/host=0000:02:00.0]

All devices with iommufd specified then implicitly share a single iommufd
object within Qemu.

This still allows vfio devices to be specified via fd but just requires Libvirt
to grant file permission on /dev/iommu. Is it a worthwhile tradeoff to be
considered or just not a typical way in Qemu philosophy e.g. any object
associated with a device must be explicitly specified?

Thanks
Kevin

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC 00/18] vfio: Adopt iommufd
  2022-04-28  3:21             ` Tian, Kevin
@ 2022-04-28 14:24               ` Alex Williamson
  -1 siblings, 0 replies; 125+ messages in thread
From: Alex Williamson @ 2022-04-28 14:24 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Daniel P. Berrangé,
	Liu, Yi L, akrowiak, jjherne, Peng, Chao P, kvm, Laine Stump,
	libvir-list, jasowang, cohuck, thuth, peterx, qemu-devel, pasic,
	eric.auger, Sun, Yi Y, nicolinc, jgg, eric.auger.pro, david

On Thu, 28 Apr 2022 03:21:45 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Wednesday, April 27, 2022 12:22 AM  
> > > >
> > > > My expectation would be that libvirt uses:
> > > >
> > > >  -object iommufd,id=iommufd0,fd=NNN
> > > >  -device vfio-pci,fd=MMM,iommufd=iommufd0
> > > >
> > > > Whereas simple QEMU command line would be:
> > > >
> > > >  -object iommufd,id=iommufd0
> > > >  -device vfio-pci,iommufd=iommufd0,host=0000:02:00.0
> > > >
> > > > The iommufd object would open /dev/iommufd itself.  Creating an
> > > > implicit iommufd object is someone problematic because one of the
> > > > things I forgot to highlight in my previous description is that the
> > > > iommufd object is meant to be shared across not only various vfio
> > > > devices (platform, ccw, ap, nvme, etc), but also across subsystems, ex.
> > > > vdpa.  
> > >
> > > Out of curiosity - in concept one iommufd is sufficient to support all
> > > ioas requirements across subsystems while having multiple iommufd's
> > > instead lose the benefit of centralized accounting. The latter will also
> > > cause some trouble when we start virtualizing ENQCMD which requires
> > > VM-wide PASID virtualization thus further needs to share that
> > > information across iommufd's. Not unsolvable but really no gain by
> > > adding such complexity. So I'm curious whether Qemu provide
> > > a way to restrict that certain object type can only have one instance
> > > to discourage such multi-iommufd attempt?  
> > 
> > I don't see any reason for QEMU to restrict iommufd objects.  The QEMU
> > philosophy seems to be to let users create whatever configuration they
> > want.  For libvirt though, the assumption would be that a single
> > iommufd object can be used across subsystems, so libvirt would never
> > automatically create multiple objects.  
> 
> I like the flexibility what the objection approach gives in your proposal.
> But with the said complexity in mind (with no foreseen benefit), I wonder

What's the actual complexity?  Front-end/backend splits are very common
in QEMU.  We're making the object connection via name, why is it
significantly more complicated to allow multiple iommufd objects?  On
the contrary, it seems to me that we'd need to go out of our way to add
code to block multiple iommufd objects.

> whether an alternative approach which treats iommufd as a global
> property instead of an object is acceptable in Qemu, i.e.:
> 
> -iommufd on/off
> -device vfio-pci,iommufd,[fd=MMM/host=0000:02:00.0]
> 
> All devices with iommufd specified then implicitly share a single iommufd
> object within Qemu.

QEMU requires key-value pairs AFAIK, so the above doesn't work, then
we're just back to the iommufd=on/off.
 
> This still allows vfio devices to be specified via fd but just requires Libvirt
> to grant file permission on /dev/iommu. Is it a worthwhile tradeoff to be
> considered or just not a typical way in Qemu philosophy e.g. any object
> associated with a device must be explicitly specified?

Avoiding QEMU opening files was a significant focus of my alternate
proposal.  Also note that we must be able to support hotplug, so we
need to be able to dynamically add and remove the iommufd object, I
don't see that a global property allows for that.  Implicit
associations of devices to shared resources doesn't seem particularly
desirable to me.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC 00/18] vfio: Adopt iommufd
@ 2022-04-28 14:24               ` Alex Williamson
  0 siblings, 0 replies; 125+ messages in thread
From: Alex Williamson @ 2022-04-28 14:24 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: akrowiak, jjherne, thuth, Peng, Chao P, Daniel P. Berrangé,
	jgg, kvm, libvir-list, jasowang, cohuck, qemu-devel, peterx,
	pasic, eric.auger, Sun, Yi Y, Liu, Yi L, nicolinc, Laine Stump,
	david, eric.auger.pro

On Thu, 28 Apr 2022 03:21:45 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Wednesday, April 27, 2022 12:22 AM  
> > > >
> > > > My expectation would be that libvirt uses:
> > > >
> > > >  -object iommufd,id=iommufd0,fd=NNN
> > > >  -device vfio-pci,fd=MMM,iommufd=iommufd0
> > > >
> > > > Whereas simple QEMU command line would be:
> > > >
> > > >  -object iommufd,id=iommufd0
> > > >  -device vfio-pci,iommufd=iommufd0,host=0000:02:00.0
> > > >
> > > > The iommufd object would open /dev/iommufd itself.  Creating an
> > > > implicit iommufd object is someone problematic because one of the
> > > > things I forgot to highlight in my previous description is that the
> > > > iommufd object is meant to be shared across not only various vfio
> > > > devices (platform, ccw, ap, nvme, etc), but also across subsystems, ex.
> > > > vdpa.  
> > >
> > > Out of curiosity - in concept one iommufd is sufficient to support all
> > > ioas requirements across subsystems while having multiple iommufd's
> > > instead lose the benefit of centralized accounting. The latter will also
> > > cause some trouble when we start virtualizing ENQCMD which requires
> > > VM-wide PASID virtualization thus further needs to share that
> > > information across iommufd's. Not unsolvable but really no gain by
> > > adding such complexity. So I'm curious whether Qemu provide
> > > a way to restrict that certain object type can only have one instance
> > > to discourage such multi-iommufd attempt?  
> > 
> > I don't see any reason for QEMU to restrict iommufd objects.  The QEMU
> > philosophy seems to be to let users create whatever configuration they
> > want.  For libvirt though, the assumption would be that a single
> > iommufd object can be used across subsystems, so libvirt would never
> > automatically create multiple objects.  
> 
> I like the flexibility what the objection approach gives in your proposal.
> But with the said complexity in mind (with no foreseen benefit), I wonder

What's the actual complexity?  Front-end/backend splits are very common
in QEMU.  We're making the object connection via name, why is it
significantly more complicated to allow multiple iommufd objects?  On
the contrary, it seems to me that we'd need to go out of our way to add
code to block multiple iommufd objects.

> whether an alternative approach which treats iommufd as a global
> property instead of an object is acceptable in Qemu, i.e.:
> 
> -iommufd on/off
> -device vfio-pci,iommufd,[fd=MMM/host=0000:02:00.0]
> 
> All devices with iommufd specified then implicitly share a single iommufd
> object within Qemu.

QEMU requires key-value pairs AFAIK, so the above doesn't work, then
we're just back to the iommufd=on/off.
 
> This still allows vfio devices to be specified via fd but just requires Libvirt
> to grant file permission on /dev/iommu. Is it a worthwhile tradeoff to be
> considered or just not a typical way in Qemu philosophy e.g. any object
> associated with a device must be explicitly specified?

Avoiding QEMU opening files was a significant focus of my alternate
proposal.  Also note that we must be able to support hotplug, so we
need to be able to dynamically add and remove the iommufd object, I
don't see that a global property allows for that.  Implicit
associations of devices to shared resources doesn't seem particularly
desirable to me.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC 00/18] vfio: Adopt iommufd
  2022-04-28 14:24               ` Alex Williamson
@ 2022-04-28 16:20                 ` Daniel P. Berrangé
  -1 siblings, 0 replies; 125+ messages in thread
From: Daniel P. Berrangé @ 2022-04-28 16:20 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Tian, Kevin, Liu, Yi L, akrowiak, jjherne, Peng, Chao P, kvm,
	Laine Stump, libvir-list, jasowang, cohuck, thuth, peterx,
	qemu-devel, pasic, eric.auger, Sun, Yi Y, nicolinc, jgg,
	eric.auger.pro, david

On Thu, Apr 28, 2022 at 08:24:48AM -0600, Alex Williamson wrote:
> On Thu, 28 Apr 2022 03:21:45 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Alex Williamson <alex.williamson@redhat.com>
> > > Sent: Wednesday, April 27, 2022 12:22 AM  
> > > > >
> > > > > My expectation would be that libvirt uses:
> > > > >
> > > > >  -object iommufd,id=iommufd0,fd=NNN
> > > > >  -device vfio-pci,fd=MMM,iommufd=iommufd0
> > > > >
> > > > > Whereas simple QEMU command line would be:
> > > > >
> > > > >  -object iommufd,id=iommufd0
> > > > >  -device vfio-pci,iommufd=iommufd0,host=0000:02:00.0
> > > > >
> > > > > The iommufd object would open /dev/iommufd itself.  Creating an
> > > > > implicit iommufd object is someone problematic because one of the
> > > > > things I forgot to highlight in my previous description is that the
> > > > > iommufd object is meant to be shared across not only various vfio
> > > > > devices (platform, ccw, ap, nvme, etc), but also across subsystems, ex.
> > > > > vdpa.  
> > > >
> > > > Out of curiosity - in concept one iommufd is sufficient to support all
> > > > ioas requirements across subsystems while having multiple iommufd's
> > > > instead lose the benefit of centralized accounting. The latter will also
> > > > cause some trouble when we start virtualizing ENQCMD which requires
> > > > VM-wide PASID virtualization thus further needs to share that
> > > > information across iommufd's. Not unsolvable but really no gain by
> > > > adding such complexity. So I'm curious whether Qemu provide
> > > > a way to restrict that certain object type can only have one instance
> > > > to discourage such multi-iommufd attempt?  
> > > 
> > > I don't see any reason for QEMU to restrict iommufd objects.  The QEMU
> > > philosophy seems to be to let users create whatever configuration they
> > > want.  For libvirt though, the assumption would be that a single
> > > iommufd object can be used across subsystems, so libvirt would never
> > > automatically create multiple objects.  
> > 
> > I like the flexibility what the objection approach gives in your proposal.
> > But with the said complexity in mind (with no foreseen benefit), I wonder
> 
> What's the actual complexity?  Front-end/backend splits are very common
> in QEMU.  We're making the object connection via name, why is it
> significantly more complicated to allow multiple iommufd objects?  On
> the contrary, it seems to me that we'd need to go out of our way to add
> code to block multiple iommufd objects.
> 
> > whether an alternative approach which treats iommufd as a global
> > property instead of an object is acceptable in Qemu, i.e.:
> > 
> > -iommufd on/off
> > -device vfio-pci,iommufd,[fd=MMM/host=0000:02:00.0]
> > 
> > All devices with iommufd specified then implicitly share a single iommufd
> > object within Qemu.
> 
> QEMU requires key-value pairs AFAIK, so the above doesn't work, then
> we're just back to the iommufd=on/off.
>  
> > This still allows vfio devices to be specified via fd but just requires Libvirt
> > to grant file permission on /dev/iommu. Is it a worthwhile tradeoff to be
> > considered or just not a typical way in Qemu philosophy e.g. any object
> > associated with a device must be explicitly specified?
> 
> Avoiding QEMU opening files was a significant focus of my alternate
> proposal.  Also note that we must be able to support hotplug, so we
> need to be able to dynamically add and remove the iommufd object, I
> don't see that a global property allows for that.  Implicit
> associations of devices to shared resources doesn't seem particularly
> desirable to me.  Thanks,

Adding new global properties/options is rather an anti-pattern for QEMU
these days. Using -object is the right approach. If you only want to
allow for one of them, just document this requirement. We've got other
objects which are singletons like all the confidential guest classes
for each arch.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|


^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC 00/18] vfio: Adopt iommufd
@ 2022-04-28 16:20                 ` Daniel P. Berrangé
  0 siblings, 0 replies; 125+ messages in thread
From: Daniel P. Berrangé @ 2022-04-28 16:20 UTC (permalink / raw)
  To: Alex Williamson
  Cc: akrowiak, jjherne, Tian, Kevin, Liu, Yi L, jgg, kvm, libvir-list,
	jasowang, cohuck, thuth, peterx, qemu-devel, pasic, eric.auger,
	Sun, Yi Y, Peng, Chao P, nicolinc, Laine Stump, david,
	eric.auger.pro

On Thu, Apr 28, 2022 at 08:24:48AM -0600, Alex Williamson wrote:
> On Thu, 28 Apr 2022 03:21:45 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Alex Williamson <alex.williamson@redhat.com>
> > > Sent: Wednesday, April 27, 2022 12:22 AM  
> > > > >
> > > > > My expectation would be that libvirt uses:
> > > > >
> > > > >  -object iommufd,id=iommufd0,fd=NNN
> > > > >  -device vfio-pci,fd=MMM,iommufd=iommufd0
> > > > >
> > > > > Whereas simple QEMU command line would be:
> > > > >
> > > > >  -object iommufd,id=iommufd0
> > > > >  -device vfio-pci,iommufd=iommufd0,host=0000:02:00.0
> > > > >
> > > > > The iommufd object would open /dev/iommufd itself.  Creating an
> > > > > implicit iommufd object is someone problematic because one of the
> > > > > things I forgot to highlight in my previous description is that the
> > > > > iommufd object is meant to be shared across not only various vfio
> > > > > devices (platform, ccw, ap, nvme, etc), but also across subsystems, ex.
> > > > > vdpa.  
> > > >
> > > > Out of curiosity - in concept one iommufd is sufficient to support all
> > > > ioas requirements across subsystems while having multiple iommufd's
> > > > instead lose the benefit of centralized accounting. The latter will also
> > > > cause some trouble when we start virtualizing ENQCMD which requires
> > > > VM-wide PASID virtualization thus further needs to share that
> > > > information across iommufd's. Not unsolvable but really no gain by
> > > > adding such complexity. So I'm curious whether Qemu provide
> > > > a way to restrict that certain object type can only have one instance
> > > > to discourage such multi-iommufd attempt?  
> > > 
> > > I don't see any reason for QEMU to restrict iommufd objects.  The QEMU
> > > philosophy seems to be to let users create whatever configuration they
> > > want.  For libvirt though, the assumption would be that a single
> > > iommufd object can be used across subsystems, so libvirt would never
> > > automatically create multiple objects.  
> > 
> > I like the flexibility what the objection approach gives in your proposal.
> > But with the said complexity in mind (with no foreseen benefit), I wonder
> 
> What's the actual complexity?  Front-end/backend splits are very common
> in QEMU.  We're making the object connection via name, why is it
> significantly more complicated to allow multiple iommufd objects?  On
> the contrary, it seems to me that we'd need to go out of our way to add
> code to block multiple iommufd objects.
> 
> > whether an alternative approach which treats iommufd as a global
> > property instead of an object is acceptable in Qemu, i.e.:
> > 
> > -iommufd on/off
> > -device vfio-pci,iommufd,[fd=MMM/host=0000:02:00.0]
> > 
> > All devices with iommufd specified then implicitly share a single iommufd
> > object within Qemu.
> 
> QEMU requires key-value pairs AFAIK, so the above doesn't work, then
> we're just back to the iommufd=on/off.
>  
> > This still allows vfio devices to be specified via fd but just requires Libvirt
> > to grant file permission on /dev/iommu. Is it a worthwhile tradeoff to be
> > considered or just not a typical way in Qemu philosophy e.g. any object
> > associated with a device must be explicitly specified?
> 
> Avoiding QEMU opening files was a significant focus of my alternate
> proposal.  Also note that we must be able to support hotplug, so we
> need to be able to dynamically add and remove the iommufd object, I
> don't see that a global property allows for that.  Implicit
> associations of devices to shared resources doesn't seem particularly
> desirable to me.  Thanks,

Adding new global properties/options is rather an anti-pattern for QEMU
these days. Using -object is the right approach. If you only want to
allow for one of them, just document this requirement. We've got other
objects which are singletons like all the confidential guest classes
for each arch.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 125+ messages in thread

* RE: [RFC 00/18] vfio: Adopt iommufd
  2022-04-28 16:20                 ` Daniel P. Berrangé
@ 2022-04-29  0:45                   ` Tian, Kevin
  -1 siblings, 0 replies; 125+ messages in thread
From: Tian, Kevin @ 2022-04-29  0:45 UTC (permalink / raw)
  To: Daniel P. Berrangé, Alex Williamson
  Cc: Liu, Yi L, akrowiak, jjherne, Peng, Chao P, kvm, Laine Stump,
	libvir-list, jasowang, cohuck, thuth, peterx, qemu-devel, pasic,
	eric.auger, Sun, Yi Y, nicolinc, jgg, eric.auger.pro, david

> From: Daniel P. Berrangé <berrange@redhat.com>
> Sent: Friday, April 29, 2022 12:20 AM
> 
> On Thu, Apr 28, 2022 at 08:24:48AM -0600, Alex Williamson wrote:
> > On Thu, 28 Apr 2022 03:21:45 +0000
> > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> >
> > > > From: Alex Williamson <alex.williamson@redhat.com>
> > > > Sent: Wednesday, April 27, 2022 12:22 AM
> > > > > >
> > > > > > My expectation would be that libvirt uses:
> > > > > >
> > > > > >  -object iommufd,id=iommufd0,fd=NNN
> > > > > >  -device vfio-pci,fd=MMM,iommufd=iommufd0
> > > > > >
> > > > > > Whereas simple QEMU command line would be:
> > > > > >
> > > > > >  -object iommufd,id=iommufd0
> > > > > >  -device vfio-pci,iommufd=iommufd0,host=0000:02:00.0
> > > > > >
> > > > > > The iommufd object would open /dev/iommufd itself.  Creating an
> > > > > > implicit iommufd object is someone problematic because one of the
> > > > > > things I forgot to highlight in my previous description is that the
> > > > > > iommufd object is meant to be shared across not only various vfio
> > > > > > devices (platform, ccw, ap, nvme, etc), but also across subsystems,
> ex.
> > > > > > vdpa.
> > > > >
> > > > > Out of curiosity - in concept one iommufd is sufficient to support all
> > > > > ioas requirements across subsystems while having multiple
> iommufd's
> > > > > instead lose the benefit of centralized accounting. The latter will also
> > > > > cause some trouble when we start virtualizing ENQCMD which
> requires
> > > > > VM-wide PASID virtualization thus further needs to share that
> > > > > information across iommufd's. Not unsolvable but really no gain by
> > > > > adding such complexity. So I'm curious whether Qemu provide
> > > > > a way to restrict that certain object type can only have one instance
> > > > > to discourage such multi-iommufd attempt?
> > > >
> > > > I don't see any reason for QEMU to restrict iommufd objects.  The
> QEMU
> > > > philosophy seems to be to let users create whatever configuration they
> > > > want.  For libvirt though, the assumption would be that a single
> > > > iommufd object can be used across subsystems, so libvirt would never
> > > > automatically create multiple objects.
> > >
> > > I like the flexibility what the objection approach gives in your proposal.
> > > But with the said complexity in mind (with no foreseen benefit), I wonder
> >
> > What's the actual complexity?  Front-end/backend splits are very common
> > in QEMU.  We're making the object connection via name, why is it
> > significantly more complicated to allow multiple iommufd objects?  On
> > the contrary, it seems to me that we'd need to go out of our way to add
> > code to block multiple iommufd objects.

Probably it's just a hypothetical concern when I thought about the need
of managing certain global information (e.g. PASID virtualization) cross
iommufd's down the road. With your and Daniel's replies I think we'll
first try to follow the common practice in Qemu first given there are
more positive reasons to do so than the hypothetical concern itself.

> >
> > > whether an alternative approach which treats iommufd as a global
> > > property instead of an object is acceptable in Qemu, i.e.:
> > >
> > > -iommufd on/off
> > > -device vfio-pci,iommufd,[fd=MMM/host=0000:02:00.0]
> > >
> > > All devices with iommufd specified then implicitly share a single iommufd
> > > object within Qemu.
> >
> > QEMU requires key-value pairs AFAIK, so the above doesn't work, then
> > we're just back to the iommufd=on/off.
> >
> > > This still allows vfio devices to be specified via fd but just requires Libvirt
> > > to grant file permission on /dev/iommu. Is it a worthwhile tradeoff to be
> > > considered or just not a typical way in Qemu philosophy e.g. any object
> > > associated with a device must be explicitly specified?
> >
> > Avoiding QEMU opening files was a significant focus of my alternate
> > proposal.  Also note that we must be able to support hotplug, so we
> > need to be able to dynamically add and remove the iommufd object, I
> > don't see that a global property allows for that.  Implicit
> > associations of devices to shared resources doesn't seem particularly
> > desirable to me.  Thanks,
> 
> Adding new global properties/options is rather an anti-pattern for QEMU
> these days. Using -object is the right approach. If you only want to
> allow for one of them, just document this requirement. We've got other
> objects which are singletons like all the confidential guest classes
> for each arch.
> 

Good to know such last resort. As said we'll try to avoid this restriction
and follow Alex's proposal unless there are unexpectedly unreasonable
complexities arising later.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 125+ messages in thread

* RE: [RFC 00/18] vfio: Adopt iommufd
@ 2022-04-29  0:45                   ` Tian, Kevin
  0 siblings, 0 replies; 125+ messages in thread
From: Tian, Kevin @ 2022-04-29  0:45 UTC (permalink / raw)
  To: Daniel P. Berrangé, Alex Williamson
  Cc: akrowiak, jjherne, thuth, Peng, Chao P, jgg, kvm, libvir-list,
	jasowang, cohuck, qemu-devel, peterx, pasic, eric.auger, Sun,
	Yi Y, Liu, Yi L, nicolinc, Laine Stump, david, eric.auger.pro

> From: Daniel P. Berrangé <berrange@redhat.com>
> Sent: Friday, April 29, 2022 12:20 AM
> 
> On Thu, Apr 28, 2022 at 08:24:48AM -0600, Alex Williamson wrote:
> > On Thu, 28 Apr 2022 03:21:45 +0000
> > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> >
> > > > From: Alex Williamson <alex.williamson@redhat.com>
> > > > Sent: Wednesday, April 27, 2022 12:22 AM
> > > > > >
> > > > > > My expectation would be that libvirt uses:
> > > > > >
> > > > > >  -object iommufd,id=iommufd0,fd=NNN
> > > > > >  -device vfio-pci,fd=MMM,iommufd=iommufd0
> > > > > >
> > > > > > Whereas simple QEMU command line would be:
> > > > > >
> > > > > >  -object iommufd,id=iommufd0
> > > > > >  -device vfio-pci,iommufd=iommufd0,host=0000:02:00.0
> > > > > >
> > > > > > The iommufd object would open /dev/iommufd itself.  Creating an
> > > > > > implicit iommufd object is someone problematic because one of the
> > > > > > things I forgot to highlight in my previous description is that the
> > > > > > iommufd object is meant to be shared across not only various vfio
> > > > > > devices (platform, ccw, ap, nvme, etc), but also across subsystems,
> ex.
> > > > > > vdpa.
> > > > >
> > > > > Out of curiosity - in concept one iommufd is sufficient to support all
> > > > > ioas requirements across subsystems while having multiple
> iommufd's
> > > > > instead lose the benefit of centralized accounting. The latter will also
> > > > > cause some trouble when we start virtualizing ENQCMD which
> requires
> > > > > VM-wide PASID virtualization thus further needs to share that
> > > > > information across iommufd's. Not unsolvable but really no gain by
> > > > > adding such complexity. So I'm curious whether Qemu provide
> > > > > a way to restrict that certain object type can only have one instance
> > > > > to discourage such multi-iommufd attempt?
> > > >
> > > > I don't see any reason for QEMU to restrict iommufd objects.  The
> QEMU
> > > > philosophy seems to be to let users create whatever configuration they
> > > > want.  For libvirt though, the assumption would be that a single
> > > > iommufd object can be used across subsystems, so libvirt would never
> > > > automatically create multiple objects.
> > >
> > > I like the flexibility what the objection approach gives in your proposal.
> > > But with the said complexity in mind (with no foreseen benefit), I wonder
> >
> > What's the actual complexity?  Front-end/backend splits are very common
> > in QEMU.  We're making the object connection via name, why is it
> > significantly more complicated to allow multiple iommufd objects?  On
> > the contrary, it seems to me that we'd need to go out of our way to add
> > code to block multiple iommufd objects.

Probably it's just a hypothetical concern when I thought about the need
of managing certain global information (e.g. PASID virtualization) cross
iommufd's down the road. With your and Daniel's replies I think we'll
first try to follow the common practice in Qemu first given there are
more positive reasons to do so than the hypothetical concern itself.

> >
> > > whether an alternative approach which treats iommufd as a global
> > > property instead of an object is acceptable in Qemu, i.e.:
> > >
> > > -iommufd on/off
> > > -device vfio-pci,iommufd,[fd=MMM/host=0000:02:00.0]
> > >
> > > All devices with iommufd specified then implicitly share a single iommufd
> > > object within Qemu.
> >
> > QEMU requires key-value pairs AFAIK, so the above doesn't work, then
> > we're just back to the iommufd=on/off.
> >
> > > This still allows vfio devices to be specified via fd but just requires Libvirt
> > > to grant file permission on /dev/iommu. Is it a worthwhile tradeoff to be
> > > considered or just not a typical way in Qemu philosophy e.g. any object
> > > associated with a device must be explicitly specified?
> >
> > Avoiding QEMU opening files was a significant focus of my alternate
> > proposal.  Also note that we must be able to support hotplug, so we
> > need to be able to dynamically add and remove the iommufd object, I
> > don't see that a global property allows for that.  Implicit
> > associations of devices to shared resources doesn't seem particularly
> > desirable to me.  Thanks,
> 
> Adding new global properties/options is rather an anti-pattern for QEMU
> these days. Using -object is the right approach. If you only want to
> allow for one of them, just document this requirement. We've got other
> objects which are singletons like all the confidential guest classes
> for each arch.
> 

Good to know such last resort. As said we'll try to avoid this restriction
and follow Alex's proposal unless there are unexpectedly unreasonable
complexities arising later.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC 07/18] vfio: Add base object for VFIOContainer
  2022-04-14 10:46   ` Yi Liu
@ 2022-04-29  6:29     ` David Gibson
  -1 siblings, 0 replies; 125+ messages in thread
From: David Gibson @ 2022-04-29  6:29 UTC (permalink / raw)
  To: Yi Liu
  Cc: alex.williamson, cohuck, qemu-devel, thuth, farman, mjrosato,
	akrowiak, pasic, jjherne, jasowang, kvm, jgg, nicolinc,
	eric.auger, eric.auger.pro, kevin.tian, chao.p.peng, yi.y.sun,
	peterx

[-- Attachment #1: Type: text/plain, Size: 58515 bytes --]

On Thu, Apr 14, 2022 at 03:46:59AM -0700, Yi Liu wrote:
> Qomify the VFIOContainer object which acts as a base class for a
> container. This base class is derived into the legacy VFIO container
> and later on, into the new iommufd based container.

You certainly need the abstraction, but I'm not sure QOM is the right
way to accomplish it in this case.  The QOM class of things is visible
to the user/config layer via QMP (and sometimes command line).  It
doesn't necessarily correspond to guest visible differences, but it
often does.

AIUI, the idea here is that the back end in use should be an
implementation detail which doesn't affect the interfaces outside the
vfio subsystem itself.  If that's the case QOM may not be a great
fit, even though you can probably make it work.

> The base class implements generic code such as code related to
> memory_listener and address space management whereas the derived
> class implements callbacks that depend on the kernel user space
> being used.
> 
> 'as.c' only manipulates the base class object with wrapper functions
> that call the right class functions. Existing 'container.c' code is
> converted to implement the legacy container class functions.
> 
> Existing migration code only works with the legacy container.
> Also 'spapr.c' isn't BE agnostic.
> 
> Below is the object. It's named as VFIOContainer, old VFIOContainer
> is replaced with VFIOLegacyContainer.
> 
> struct VFIOContainer {
>     /* private */
>     Object parent_obj;
> 
>     VFIOAddressSpace *space;
>     MemoryListener listener;
>     Error *error;
>     bool initialized;
>     bool dirty_pages_supported;
>     uint64_t dirty_pgsizes;
>     uint64_t max_dirty_bitmap_size;
>     unsigned long pgsizes;
>     unsigned int dma_max_mappings;
>     QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
>     QLIST_HEAD(, VFIOHostDMAWindow) hostwin_list;
>     QLIST_HEAD(, VFIORamDiscardListener) vrdl_list;
>     QLIST_ENTRY(VFIOContainer) next;
> };
> 
> struct VFIOLegacyContainer {
>     VFIOContainer obj;
>     int fd; /* /dev/vfio/vfio, empowered by the attached groups */
>     MemoryListener prereg_listener;
>     unsigned iommu_type;
>     QLIST_HEAD(, VFIOGroup) group_list;
> };
> 
> Co-authored-by: Eric Auger <eric.auger@redhat.com>
> Signed-off-by: Eric Auger <eric.auger@redhat.com>
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> ---
>  hw/vfio/as.c                         |  48 +++---
>  hw/vfio/container-obj.c              | 195 +++++++++++++++++++++++
>  hw/vfio/container.c                  | 224 ++++++++++++++++-----------
>  hw/vfio/meson.build                  |   1 +
>  hw/vfio/migration.c                  |   4 +-
>  hw/vfio/pci.c                        |   4 +-
>  hw/vfio/spapr.c                      |  22 +--
>  include/hw/vfio/vfio-common.h        |  78 ++--------
>  include/hw/vfio/vfio-container-obj.h | 154 ++++++++++++++++++
>  9 files changed, 540 insertions(+), 190 deletions(-)
>  create mode 100644 hw/vfio/container-obj.c
>  create mode 100644 include/hw/vfio/vfio-container-obj.h
> 
> diff --git a/hw/vfio/as.c b/hw/vfio/as.c
> index 4181182808..37423d2c89 100644
> --- a/hw/vfio/as.c
> +++ b/hw/vfio/as.c
> @@ -215,9 +215,9 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
>           * of vaddr will always be there, even if the memory object is
>           * destroyed and its backing memory munmap-ed.
>           */
> -        ret = vfio_dma_map(container, iova,
> -                           iotlb->addr_mask + 1, vaddr,
> -                           read_only);
> +        ret = vfio_container_dma_map(container, iova,
> +                                     iotlb->addr_mask + 1, vaddr,
> +                                     read_only);
>          if (ret) {
>              error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
>                           "0x%"HWADDR_PRIx", %p) = %d (%m)",
> @@ -225,7 +225,8 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
>                           iotlb->addr_mask + 1, vaddr, ret);
>          }
>      } else {
> -        ret = vfio_dma_unmap(container, iova, iotlb->addr_mask + 1, iotlb);
> +        ret = vfio_container_dma_unmap(container, iova,
> +                                       iotlb->addr_mask + 1, iotlb);
>          if (ret) {
>              error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
>                           "0x%"HWADDR_PRIx") = %d (%m)",
> @@ -242,12 +243,13 @@ static void vfio_ram_discard_notify_discard(RamDiscardListener *rdl,
>  {
>      VFIORamDiscardListener *vrdl = container_of(rdl, VFIORamDiscardListener,
>                                                  listener);
> +    VFIOContainer *container = vrdl->container;
>      const hwaddr size = int128_get64(section->size);
>      const hwaddr iova = section->offset_within_address_space;
>      int ret;
>  
>      /* Unmap with a single call. */
> -    ret = vfio_dma_unmap(vrdl->container, iova, size , NULL);
> +    ret = vfio_container_dma_unmap(container, iova, size , NULL);
>      if (ret) {
>          error_report("%s: vfio_dma_unmap() failed: %s", __func__,
>                       strerror(-ret));
> @@ -259,6 +261,7 @@ static int vfio_ram_discard_notify_populate(RamDiscardListener *rdl,
>  {
>      VFIORamDiscardListener *vrdl = container_of(rdl, VFIORamDiscardListener,
>                                                  listener);
> +    VFIOContainer *container = vrdl->container;
>      const hwaddr end = section->offset_within_region +
>                         int128_get64(section->size);
>      hwaddr start, next, iova;
> @@ -277,8 +280,8 @@ static int vfio_ram_discard_notify_populate(RamDiscardListener *rdl,
>                 section->offset_within_address_space;
>          vaddr = memory_region_get_ram_ptr(section->mr) + start;
>  
> -        ret = vfio_dma_map(vrdl->container, iova, next - start,
> -                           vaddr, section->readonly);
> +        ret = vfio_container_dma_map(container, iova, next - start,
> +                                     vaddr, section->readonly);
>          if (ret) {
>              /* Rollback */
>              vfio_ram_discard_notify_discard(rdl, section);
> @@ -530,8 +533,8 @@ static void vfio_listener_region_add(MemoryListener *listener,
>          }
>      }
>  
> -    ret = vfio_dma_map(container, iova, int128_get64(llsize),
> -                       vaddr, section->readonly);
> +    ret = vfio_container_dma_map(container, iova, int128_get64(llsize),
> +                                 vaddr, section->readonly);
>      if (ret) {
>          error_setg(&err, "vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
>                     "0x%"HWADDR_PRIx", %p) = %d (%m)",
> @@ -656,7 +659,8 @@ static void vfio_listener_region_del(MemoryListener *listener,
>          if (int128_eq(llsize, int128_2_64())) {
>              /* The unmap ioctl doesn't accept a full 64-bit span. */
>              llsize = int128_rshift(llsize, 1);
> -            ret = vfio_dma_unmap(container, iova, int128_get64(llsize), NULL);
> +            ret = vfio_container_dma_unmap(container, iova,
> +                                           int128_get64(llsize), NULL);
>              if (ret) {
>                  error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
>                               "0x%"HWADDR_PRIx") = %d (%m)",
> @@ -664,7 +668,8 @@ static void vfio_listener_region_del(MemoryListener *listener,
>              }
>              iova += int128_get64(llsize);
>          }
> -        ret = vfio_dma_unmap(container, iova, int128_get64(llsize), NULL);
> +        ret = vfio_container_dma_unmap(container, iova,
> +                                       int128_get64(llsize), NULL);
>          if (ret) {
>              error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
>                           "0x%"HWADDR_PRIx") = %d (%m)",
> @@ -681,14 +686,14 @@ static void vfio_listener_log_global_start(MemoryListener *listener)
>  {
>      VFIOContainer *container = container_of(listener, VFIOContainer, listener);
>  
> -    vfio_set_dirty_page_tracking(container, true);
> +    vfio_container_set_dirty_page_tracking(container, true);
>  }
>  
>  static void vfio_listener_log_global_stop(MemoryListener *listener)
>  {
>      VFIOContainer *container = container_of(listener, VFIOContainer, listener);
>  
> -    vfio_set_dirty_page_tracking(container, false);
> +    vfio_container_set_dirty_page_tracking(container, false);
>  }
>  
>  typedef struct {
> @@ -717,8 +722,9 @@ static void vfio_iommu_map_dirty_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
>      if (vfio_get_xlat_addr(iotlb, NULL, &translated_addr, NULL)) {
>          int ret;
>  
> -        ret = vfio_get_dirty_bitmap(container, iova, iotlb->addr_mask + 1,
> -                                    translated_addr);
> +        ret = vfio_container_get_dirty_bitmap(container, iova,
> +                                              iotlb->addr_mask + 1,
> +                                              translated_addr);
>          if (ret) {
>              error_report("vfio_iommu_map_dirty_notify(%p, 0x%"HWADDR_PRIx", "
>                           "0x%"HWADDR_PRIx") = %d (%m)",
> @@ -742,11 +748,13 @@ static int vfio_ram_discard_get_dirty_bitmap(MemoryRegionSection *section,
>       * Sync the whole mapped region (spanning multiple individual mappings)
>       * in one go.
>       */
> -    return vfio_get_dirty_bitmap(vrdl->container, iova, size, ram_addr);
> +    return vfio_container_get_dirty_bitmap(vrdl->container, iova,
> +                                           size, ram_addr);
>  }
>  
> -static int vfio_sync_ram_discard_listener_dirty_bitmap(VFIOContainer *container,
> -                                                   MemoryRegionSection *section)
> +static int
> +vfio_sync_ram_discard_listener_dirty_bitmap(VFIOContainer *container,
> +                                            MemoryRegionSection *section)
>  {
>      RamDiscardManager *rdm = memory_region_get_ram_discard_manager(section->mr);
>      VFIORamDiscardListener *vrdl = NULL;
> @@ -810,7 +818,7 @@ static int vfio_sync_dirty_bitmap(VFIOContainer *container,
>      ram_addr = memory_region_get_ram_addr(section->mr) +
>                 section->offset_within_region;
>  
> -    return vfio_get_dirty_bitmap(container,
> +    return vfio_container_get_dirty_bitmap(container,
>                     REAL_HOST_PAGE_ALIGN(section->offset_within_address_space),
>                     int128_get64(section->size), ram_addr);
>  }
> @@ -825,7 +833,7 @@ static void vfio_listener_log_sync(MemoryListener *listener,
>          return;
>      }
>  
> -    if (vfio_devices_all_dirty_tracking(container)) {
> +    if (vfio_container_devices_all_dirty_tracking(container)) {
>          vfio_sync_dirty_bitmap(container, section);
>      }
>  }
> diff --git a/hw/vfio/container-obj.c b/hw/vfio/container-obj.c
> new file mode 100644
> index 0000000000..40c1e2a2b5
> --- /dev/null
> +++ b/hw/vfio/container-obj.c
> @@ -0,0 +1,195 @@
> +/*
> + * VFIO CONTAINER BASE OBJECT
> + *
> + * Copyright (C) 2022 Intel Corporation.
> + * Copyright Red Hat, Inc. 2022
> + *
> + * Authors: Yi Liu <yi.l.liu@intel.com>
> + *          Eric Auger <eric.auger@redhat.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> +
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> +
> + * You should have received a copy of the GNU General Public License along
> + * with this program; if not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qapi/error.h"
> +#include "qemu/error-report.h"
> +#include "qom/object.h"
> +#include "qapi/visitor.h"
> +#include "hw/vfio/vfio-container-obj.h"
> +
> +bool vfio_container_check_extension(VFIOContainer *container,
> +                                    VFIOContainerFeature feat)
> +{
> +    VFIOContainerClass *vccs = VFIO_CONTAINER_OBJ_GET_CLASS(container);
> +
> +    if (!vccs->check_extension) {
> +        return false;
> +    }
> +
> +    return vccs->check_extension(container, feat);
> +}
> +
> +int vfio_container_dma_map(VFIOContainer *container,
> +                           hwaddr iova, ram_addr_t size,
> +                           void *vaddr, bool readonly)
> +{
> +    VFIOContainerClass *vccs = VFIO_CONTAINER_OBJ_GET_CLASS(container);
> +
> +    if (!vccs->dma_map) {
> +        return -EINVAL;
> +    }
> +
> +    return vccs->dma_map(container, iova, size, vaddr, readonly);
> +}
> +
> +int vfio_container_dma_unmap(VFIOContainer *container,
> +                             hwaddr iova, ram_addr_t size,
> +                             IOMMUTLBEntry *iotlb)
> +{
> +    VFIOContainerClass *vccs = VFIO_CONTAINER_OBJ_GET_CLASS(container);
> +
> +    vccs = VFIO_CONTAINER_OBJ_GET_CLASS(container);
> +
> +    if (!vccs->dma_unmap) {
> +        return -EINVAL;
> +    }
> +
> +    return vccs->dma_unmap(container, iova, size, iotlb);
> +}
> +
> +void vfio_container_set_dirty_page_tracking(VFIOContainer *container,
> +                                            bool start)
> +{
> +    VFIOContainerClass *vccs = VFIO_CONTAINER_OBJ_GET_CLASS(container);
> +
> +    if (!vccs->set_dirty_page_tracking) {
> +        return;
> +    }
> +
> +    vccs->set_dirty_page_tracking(container, start);
> +}
> +
> +bool vfio_container_devices_all_dirty_tracking(VFIOContainer *container)
> +{
> +    VFIOContainerClass *vccs = VFIO_CONTAINER_OBJ_GET_CLASS(container);
> +
> +    if (!vccs->devices_all_dirty_tracking) {
> +        return false;
> +    }
> +
> +    return vccs->devices_all_dirty_tracking(container);
> +}
> +
> +int vfio_container_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
> +                                    uint64_t size, ram_addr_t ram_addr)
> +{
> +    VFIOContainerClass *vccs = VFIO_CONTAINER_OBJ_GET_CLASS(container);
> +
> +    if (!vccs->get_dirty_bitmap) {
> +        return -EINVAL;
> +    }
> +
> +    return vccs->get_dirty_bitmap(container, iova, size, ram_addr);
> +}
> +
> +int vfio_container_add_section_window(VFIOContainer *container,
> +                                      MemoryRegionSection *section,
> +                                      Error **errp)
> +{
> +    VFIOContainerClass *vccs = VFIO_CONTAINER_OBJ_GET_CLASS(container);
> +
> +    if (!vccs->add_window) {
> +        return 0;
> +    }
> +
> +    return vccs->add_window(container, section, errp);
> +}
> +
> +void vfio_container_del_section_window(VFIOContainer *container,
> +                                       MemoryRegionSection *section)
> +{
> +    VFIOContainerClass *vccs = VFIO_CONTAINER_OBJ_GET_CLASS(container);
> +
> +    if (!vccs->del_window) {
> +        return;
> +    }
> +
> +    return vccs->del_window(container, section);
> +}
> +
> +void vfio_container_init(void *_container, size_t instance_size,
> +                         const char *mrtypename,
> +                         VFIOAddressSpace *space)
> +{
> +    VFIOContainer *container;
> +
> +    object_initialize(_container, instance_size, mrtypename);
> +    container = VFIO_CONTAINER_OBJ(_container);
> +
> +    container->space = space;
> +    container->error = NULL;
> +    container->dirty_pages_supported = false;
> +    container->dma_max_mappings = 0;
> +    QLIST_INIT(&container->giommu_list);
> +    QLIST_INIT(&container->hostwin_list);
> +    QLIST_INIT(&container->vrdl_list);
> +}
> +
> +void vfio_container_destroy(VFIOContainer *container)
> +{
> +    VFIORamDiscardListener *vrdl, *vrdl_tmp;
> +    VFIOGuestIOMMU *giommu, *tmp;
> +    VFIOHostDMAWindow *hostwin, *next;
> +
> +    QLIST_SAFE_REMOVE(container, next);
> +
> +    QLIST_FOREACH_SAFE(vrdl, &container->vrdl_list, next, vrdl_tmp) {
> +        RamDiscardManager *rdm;
> +
> +        rdm = memory_region_get_ram_discard_manager(vrdl->mr);
> +        ram_discard_manager_unregister_listener(rdm, &vrdl->listener);
> +        QLIST_REMOVE(vrdl, next);
> +        g_free(vrdl);
> +    }
> +
> +    QLIST_FOREACH_SAFE(giommu, &container->giommu_list, giommu_next, tmp) {
> +        memory_region_unregister_iommu_notifier(
> +                MEMORY_REGION(giommu->iommu_mr), &giommu->n);
> +        QLIST_REMOVE(giommu, giommu_next);
> +        g_free(giommu);
> +    }
> +
> +    QLIST_FOREACH_SAFE(hostwin, &container->hostwin_list, hostwin_next,
> +                       next) {
> +        QLIST_REMOVE(hostwin, hostwin_next);
> +        g_free(hostwin);
> +    }
> +
> +    object_unref(&container->parent_obj);
> +}
> +
> +static const TypeInfo vfio_container_info = {
> +    .parent             = TYPE_OBJECT,
> +    .name               = TYPE_VFIO_CONTAINER_OBJ,
> +    .class_size         = sizeof(VFIOContainerClass),
> +    .instance_size      = sizeof(VFIOContainer),
> +    .abstract           = true,
> +};
> +
> +static void vfio_container_register_types(void)
> +{
> +    type_register_static(&vfio_container_info);
> +}
> +
> +type_init(vfio_container_register_types)
> diff --git a/hw/vfio/container.c b/hw/vfio/container.c
> index 9c665c1720..79972064d3 100644
> --- a/hw/vfio/container.c
> +++ b/hw/vfio/container.c
> @@ -50,6 +50,8 @@
>  static int vfio_kvm_device_fd = -1;
>  #endif
>  
> +#define TYPE_VFIO_LEGACY_CONTAINER "qemu:vfio-legacy-container"
> +
>  VFIOGroupList vfio_group_list =
>      QLIST_HEAD_INITIALIZER(vfio_group_list);
>  
> @@ -76,8 +78,10 @@ bool vfio_mig_active(void)
>      return true;
>  }
>  
> -bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
> +static bool vfio_devices_all_dirty_tracking(VFIOContainer *bcontainer)
>  {
> +    VFIOLegacyContainer *container = container_of(bcontainer,
> +                                                  VFIOLegacyContainer, obj);
>      VFIOGroup *group;
>      VFIODevice *vbasedev;
>      MigrationState *ms = migrate_get_current();
> @@ -103,7 +107,7 @@ bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
>      return true;
>  }
>  
> -bool vfio_devices_all_running_and_saving(VFIOContainer *container)
> +static bool vfio_devices_all_running_and_saving(VFIOLegacyContainer *container)
>  {
>      VFIOGroup *group;
>      VFIODevice *vbasedev;
> @@ -132,10 +136,11 @@ bool vfio_devices_all_running_and_saving(VFIOContainer *container)
>      return true;
>  }
>  
> -static int vfio_dma_unmap_bitmap(VFIOContainer *container,
> +static int vfio_dma_unmap_bitmap(VFIOLegacyContainer *container,
>                                   hwaddr iova, ram_addr_t size,
>                                   IOMMUTLBEntry *iotlb)
>  {
> +    VFIOContainer *bcontainer = &container->obj;
>      struct vfio_iommu_type1_dma_unmap *unmap;
>      struct vfio_bitmap *bitmap;
>      uint64_t pages = REAL_HOST_PAGE_ALIGN(size) / qemu_real_host_page_size;
> @@ -159,7 +164,7 @@ static int vfio_dma_unmap_bitmap(VFIOContainer *container,
>      bitmap->size = ROUND_UP(pages, sizeof(__u64) * BITS_PER_BYTE) /
>                     BITS_PER_BYTE;
>  
> -    if (bitmap->size > container->max_dirty_bitmap_size) {
> +    if (bitmap->size > bcontainer->max_dirty_bitmap_size) {
>          error_report("UNMAP: Size of bitmap too big 0x%"PRIx64,
>                       (uint64_t)bitmap->size);
>          ret = -E2BIG;
> @@ -189,10 +194,12 @@ unmap_exit:
>  /*
>   * DMA - Mapping and unmapping for the "type1" IOMMU interface used on x86
>   */
> -int vfio_dma_unmap(VFIOContainer *container,
> -                   hwaddr iova, ram_addr_t size,
> -                   IOMMUTLBEntry *iotlb)
> +static int vfio_dma_unmap(VFIOContainer *bcontainer,
> +                          hwaddr iova, ram_addr_t size,
> +                          IOMMUTLBEntry *iotlb)
>  {
> +    VFIOLegacyContainer *container = container_of(bcontainer,
> +                                                  VFIOLegacyContainer, obj);
>      struct vfio_iommu_type1_dma_unmap unmap = {
>          .argsz = sizeof(unmap),
>          .flags = 0,
> @@ -200,7 +207,7 @@ int vfio_dma_unmap(VFIOContainer *container,
>          .size = size,
>      };
>  
> -    if (iotlb && container->dirty_pages_supported &&
> +    if (iotlb && bcontainer->dirty_pages_supported &&
>          vfio_devices_all_running_and_saving(container)) {
>          return vfio_dma_unmap_bitmap(container, iova, size, iotlb);
>      }
> @@ -221,7 +228,7 @@ int vfio_dma_unmap(VFIOContainer *container,
>          if (errno == EINVAL && unmap.size && !(unmap.iova + unmap.size) &&
>              container->iommu_type == VFIO_TYPE1v2_IOMMU) {
>              trace_vfio_dma_unmap_overflow_workaround();
> -            unmap.size -= 1ULL << ctz64(container->pgsizes);
> +            unmap.size -= 1ULL << ctz64(bcontainer->pgsizes);
>              continue;
>          }
>          error_report("VFIO_UNMAP_DMA failed: %s", strerror(errno));
> @@ -231,9 +238,22 @@ int vfio_dma_unmap(VFIOContainer *container,
>      return 0;
>  }
>  
> -int vfio_dma_map(VFIOContainer *container, hwaddr iova,
> -                 ram_addr_t size, void *vaddr, bool readonly)
> +static bool vfio_legacy_container_check_extension(VFIOContainer *bcontainer,
> +                                                  VFIOContainerFeature feat)
>  {
> +    switch (feat) {
> +    case VFIO_FEAT_LIVE_MIGRATION:
> +        return true;
> +    default:
> +        return false;
> +    };
> +}
> +
> +static int vfio_dma_map(VFIOContainer *bcontainer, hwaddr iova,
> +                       ram_addr_t size, void *vaddr, bool readonly)
> +{
> +    VFIOLegacyContainer *container = container_of(bcontainer,
> +                                                  VFIOLegacyContainer, obj);
>      struct vfio_iommu_type1_dma_map map = {
>          .argsz = sizeof(map),
>          .flags = VFIO_DMA_MAP_FLAG_READ,
> @@ -252,7 +272,7 @@ int vfio_dma_map(VFIOContainer *container, hwaddr iova,
>       * the VGA ROM space.
>       */
>      if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0 ||
> -        (errno == EBUSY && vfio_dma_unmap(container, iova, size, NULL) == 0 &&
> +        (errno == EBUSY && vfio_dma_unmap(bcontainer, iova, size, NULL) == 0 &&
>           ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0)) {
>          return 0;
>      }
> @@ -261,8 +281,10 @@ int vfio_dma_map(VFIOContainer *container, hwaddr iova,
>      return -errno;
>  }
>  
> -void vfio_set_dirty_page_tracking(VFIOContainer *container, bool start)
> +static void vfio_set_dirty_page_tracking(VFIOContainer *bcontainer, bool start)
>  {
> +    VFIOLegacyContainer *container = container_of(bcontainer,
> +                                                  VFIOLegacyContainer, obj);
>      int ret;
>      struct vfio_iommu_type1_dirty_bitmap dirty = {
>          .argsz = sizeof(dirty),
> @@ -281,9 +303,11 @@ void vfio_set_dirty_page_tracking(VFIOContainer *container, bool start)
>      }
>  }
>  
> -int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
> -                          uint64_t size, ram_addr_t ram_addr)
> +static int vfio_get_dirty_bitmap(VFIOContainer *bcontainer, uint64_t iova,
> +                                 uint64_t size, ram_addr_t ram_addr)
>  {
> +    VFIOLegacyContainer *container = container_of(bcontainer,
> +                                                  VFIOLegacyContainer, obj);
>      struct vfio_iommu_type1_dirty_bitmap *dbitmap;
>      struct vfio_iommu_type1_dirty_bitmap_get *range;
>      uint64_t pages;
> @@ -333,18 +357,23 @@ err_out:
>      return ret;
>  }
>  
> -static void vfio_listener_release(VFIOContainer *container)
> +static void vfio_listener_release(VFIOLegacyContainer *container)
>  {
> -    memory_listener_unregister(&container->listener);
> +    VFIOContainer *bcontainer = &container->obj;
> +
> +    memory_listener_unregister(&bcontainer->listener);
>      if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
>          memory_listener_unregister(&container->prereg_listener);
>      }
>  }
>  
> -int vfio_container_add_section_window(VFIOContainer *container,
> -                                      MemoryRegionSection *section,
> -                                      Error **errp)
> +static int
> +vfio_legacy_container_add_section_window(VFIOContainer *bcontainer,
> +                                         MemoryRegionSection *section,
> +                                         Error **errp)
>  {
> +    VFIOLegacyContainer *container = container_of(bcontainer,
> +                                                  VFIOLegacyContainer, obj);
>      VFIOHostDMAWindow *hostwin;
>      hwaddr pgsize = 0;
>      int ret;
> @@ -354,7 +383,7 @@ int vfio_container_add_section_window(VFIOContainer *container,
>      }
>  
>      /* For now intersections are not allowed, we may relax this later */
> -    QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) {
> +    QLIST_FOREACH(hostwin, &bcontainer->hostwin_list, hostwin_next) {
>          if (ranges_overlap(hostwin->min_iova,
>                             hostwin->max_iova - hostwin->min_iova + 1,
>                             section->offset_within_address_space,
> @@ -376,7 +405,7 @@ int vfio_container_add_section_window(VFIOContainer *container,
>          return ret;
>      }
>  
> -    vfio_host_win_add(container, section->offset_within_address_space,
> +    vfio_host_win_add(bcontainer, section->offset_within_address_space,
>                        section->offset_within_address_space +
>                        int128_get64(section->size) - 1, pgsize);
>  #ifdef CONFIG_KVM
> @@ -409,16 +438,20 @@ int vfio_container_add_section_window(VFIOContainer *container,
>      return 0;
>  }
>  
> -void vfio_container_del_section_window(VFIOContainer *container,
> -                                       MemoryRegionSection *section)
> +static void
> +vfio_legacy_container_del_section_window(VFIOContainer *bcontainer,
> +                                         MemoryRegionSection *section)
>  {
> +    VFIOLegacyContainer *container = container_of(bcontainer,
> +                                                  VFIOLegacyContainer, obj);
> +
>      if (container->iommu_type != VFIO_SPAPR_TCE_v2_IOMMU) {
>          return;
>      }
>  
>      vfio_spapr_remove_window(container,
>                               section->offset_within_address_space);
> -    if (vfio_host_win_del(container,
> +    if (vfio_host_win_del(bcontainer,
>                            section->offset_within_address_space,
>                            section->offset_within_address_space +
>                            int128_get64(section->size) - 1) < 0) {
> @@ -505,7 +538,7 @@ static void vfio_kvm_device_del_group(VFIOGroup *group)
>  /*
>   * vfio_get_iommu_type - selects the richest iommu_type (v2 first)
>   */
> -static int vfio_get_iommu_type(VFIOContainer *container,
> +static int vfio_get_iommu_type(VFIOLegacyContainer *container,
>                                 Error **errp)
>  {
>      int iommu_types[] = { VFIO_TYPE1v2_IOMMU, VFIO_TYPE1_IOMMU,
> @@ -521,7 +554,7 @@ static int vfio_get_iommu_type(VFIOContainer *container,
>      return -EINVAL;
>  }
>  
> -static int vfio_init_container(VFIOContainer *container, int group_fd,
> +static int vfio_init_container(VFIOLegacyContainer *container, int group_fd,
>                                 Error **errp)
>  {
>      int iommu_type, ret;
> @@ -556,7 +589,7 @@ static int vfio_init_container(VFIOContainer *container, int group_fd,
>      return 0;
>  }
>  
> -static int vfio_get_iommu_info(VFIOContainer *container,
> +static int vfio_get_iommu_info(VFIOLegacyContainer *container,
>                                 struct vfio_iommu_type1_info **info)
>  {
>  
> @@ -600,11 +633,12 @@ vfio_get_iommu_info_cap(struct vfio_iommu_type1_info *info, uint16_t id)
>      return NULL;
>  }
>  
> -static void vfio_get_iommu_info_migration(VFIOContainer *container,
> -                                         struct vfio_iommu_type1_info *info)
> +static void vfio_get_iommu_info_migration(VFIOLegacyContainer *container,
> +                                          struct vfio_iommu_type1_info *info)
>  {
>      struct vfio_info_cap_header *hdr;
>      struct vfio_iommu_type1_info_cap_migration *cap_mig;
> +    VFIOContainer *bcontainer = &container->obj;
>  
>      hdr = vfio_get_iommu_info_cap(info, VFIO_IOMMU_TYPE1_INFO_CAP_MIGRATION);
>      if (!hdr) {
> @@ -619,13 +653,14 @@ static void vfio_get_iommu_info_migration(VFIOContainer *container,
>       * qemu_real_host_page_size to mark those dirty.
>       */
>      if (cap_mig->pgsize_bitmap & qemu_real_host_page_size) {
> -        container->dirty_pages_supported = true;
> -        container->max_dirty_bitmap_size = cap_mig->max_dirty_bitmap_size;
> -        container->dirty_pgsizes = cap_mig->pgsize_bitmap;
> +        bcontainer->dirty_pages_supported = true;
> +        bcontainer->max_dirty_bitmap_size = cap_mig->max_dirty_bitmap_size;
> +        bcontainer->dirty_pgsizes = cap_mig->pgsize_bitmap;
>      }
>  }
>  
> -static int vfio_ram_block_discard_disable(VFIOContainer *container, bool state)
> +static int
> +vfio_ram_block_discard_disable(VFIOLegacyContainer *container, bool state)
>  {
>      switch (container->iommu_type) {
>      case VFIO_TYPE1v2_IOMMU:
> @@ -651,7 +686,8 @@ static int vfio_ram_block_discard_disable(VFIOContainer *container, bool state)
>  static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>                                    Error **errp)
>  {
> -    VFIOContainer *container;
> +    VFIOContainer *bcontainer;
> +    VFIOLegacyContainer *container;
>      int ret, fd;
>      VFIOAddressSpace *space;
>  
> @@ -688,7 +724,8 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>       * details once we know which type of IOMMU we are using.
>       */
>  
> -    QLIST_FOREACH(container, &space->containers, next) {
> +    QLIST_FOREACH(bcontainer, &space->containers, next) {
> +        container = container_of(bcontainer, VFIOLegacyContainer, obj);
>          if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
>              ret = vfio_ram_block_discard_disable(container, true);
>              if (ret) {
> @@ -724,14 +761,10 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>      }
>  
>      container = g_malloc0(sizeof(*container));
> -    container->space = space;
>      container->fd = fd;
> -    container->error = NULL;
> -    container->dirty_pages_supported = false;
> -    container->dma_max_mappings = 0;
> -    QLIST_INIT(&container->giommu_list);
> -    QLIST_INIT(&container->hostwin_list);
> -    QLIST_INIT(&container->vrdl_list);
> +    bcontainer = &container->obj;
> +    vfio_container_init(bcontainer, sizeof(*bcontainer),
> +                        TYPE_VFIO_LEGACY_CONTAINER, space);
>  
>      ret = vfio_init_container(container, group->fd, errp);
>      if (ret) {
> @@ -763,13 +796,13 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>              /* Assume 4k IOVA page size */
>              info->iova_pgsizes = 4096;
>          }
> -        vfio_host_win_add(container, 0, (hwaddr)-1, info->iova_pgsizes);
> -        container->pgsizes = info->iova_pgsizes;
> +        vfio_host_win_add(bcontainer, 0, (hwaddr)-1, info->iova_pgsizes);
> +        bcontainer->pgsizes = info->iova_pgsizes;
>  
>          /* The default in the kernel ("dma_entry_limit") is 65535. */
> -        container->dma_max_mappings = 65535;
> +        bcontainer->dma_max_mappings = 65535;
>          if (!ret) {
> -            vfio_get_info_dma_avail(info, &container->dma_max_mappings);
> +            vfio_get_info_dma_avail(info, &bcontainer->dma_max_mappings);
>              vfio_get_iommu_info_migration(container, info);
>          }
>          g_free(info);
> @@ -798,10 +831,10 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>  
>              memory_listener_register(&container->prereg_listener,
>                                       &address_space_memory);
> -            if (container->error) {
> +            if (bcontainer->error) {
>                  memory_listener_unregister(&container->prereg_listener);
>                  ret = -1;
> -                error_propagate_prepend(errp, container->error,
> +                error_propagate_prepend(errp, bcontainer->error,
>                      "RAM memory listener initialization failed: ");
>                  goto enable_discards_exit;
>              }
> @@ -820,7 +853,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>          }
>  
>          if (v2) {
> -            container->pgsizes = info.ddw.pgsizes;
> +            bcontainer->pgsizes = info.ddw.pgsizes;
>              /*
>               * There is a default window in just created container.
>               * To make region_add/del simpler, we better remove this
> @@ -835,8 +868,8 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>              }
>          } else {
>              /* The default table uses 4K pages */
> -            container->pgsizes = 0x1000;
> -            vfio_host_win_add(container, info.dma32_window_start,
> +            bcontainer->pgsizes = 0x1000;
> +            vfio_host_win_add(bcontainer, info.dma32_window_start,
>                                info.dma32_window_start +
>                                info.dma32_window_size - 1,
>                                0x1000);
> @@ -847,28 +880,28 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>      vfio_kvm_device_add_group(group);
>  
>      QLIST_INIT(&container->group_list);
> -    QLIST_INSERT_HEAD(&space->containers, container, next);
> +    QLIST_INSERT_HEAD(&space->containers, bcontainer, next);
>  
>      group->container = container;
>      QLIST_INSERT_HEAD(&container->group_list, group, container_next);
>  
> -    container->listener = vfio_memory_listener;
> +    bcontainer->listener = vfio_memory_listener;
>  
> -    memory_listener_register(&container->listener, container->space->as);
> +    memory_listener_register(&bcontainer->listener, bcontainer->space->as);
>  
> -    if (container->error) {
> +    if (bcontainer->error) {
>          ret = -1;
> -        error_propagate_prepend(errp, container->error,
> +        error_propagate_prepend(errp, bcontainer->error,
>              "memory listener initialization failed: ");
>          goto listener_release_exit;
>      }
>  
> -    container->initialized = true;
> +    bcontainer->initialized = true;
>  
>      return 0;
>  listener_release_exit:
>      QLIST_REMOVE(group, container_next);
> -    QLIST_REMOVE(container, next);
> +    QLIST_REMOVE(bcontainer, next);
>      vfio_kvm_device_del_group(group);
>      vfio_listener_release(container);
>  
> @@ -889,7 +922,8 @@ put_space_exit:
>  
>  static void vfio_disconnect_container(VFIOGroup *group)
>  {
> -    VFIOContainer *container = group->container;
> +    VFIOLegacyContainer *container = group->container;
> +    VFIOContainer *bcontainer = &container->obj;
>  
>      QLIST_REMOVE(group, container_next);
>      group->container = NULL;
> @@ -909,25 +943,9 @@ static void vfio_disconnect_container(VFIOGroup *group)
>      }
>  
>      if (QLIST_EMPTY(&container->group_list)) {
> -        VFIOAddressSpace *space = container->space;
> -        VFIOGuestIOMMU *giommu, *tmp;
> -        VFIOHostDMAWindow *hostwin, *next;
> -
> -        QLIST_REMOVE(container, next);
> -
> -        QLIST_FOREACH_SAFE(giommu, &container->giommu_list, giommu_next, tmp) {
> -            memory_region_unregister_iommu_notifier(
> -                    MEMORY_REGION(giommu->iommu_mr), &giommu->n);
> -            QLIST_REMOVE(giommu, giommu_next);
> -            g_free(giommu);
> -        }
> -
> -        QLIST_FOREACH_SAFE(hostwin, &container->hostwin_list, hostwin_next,
> -                           next) {
> -            QLIST_REMOVE(hostwin, hostwin_next);
> -            g_free(hostwin);
> -        }
> +        VFIOAddressSpace *space = bcontainer->space;
>  
> +        vfio_container_destroy(bcontainer);
>          trace_vfio_disconnect_container(container->fd);
>          close(container->fd);
>          g_free(container);
> @@ -939,13 +957,15 @@ static void vfio_disconnect_container(VFIOGroup *group)
>  VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
>  {
>      VFIOGroup *group;
> +    VFIOContainer *bcontainer;
>      char path[32];
>      struct vfio_group_status status = { .argsz = sizeof(status) };
>  
>      QLIST_FOREACH(group, &vfio_group_list, next) {
>          if (group->groupid == groupid) {
>              /* Found it.  Now is it already in the right context? */
> -            if (group->container->space->as == as) {
> +            bcontainer = &group->container->obj;
> +            if (bcontainer->space->as == as) {
>                  return group;
>              } else {
>                  error_setg(errp, "group %d used in multiple address spaces",
> @@ -1098,7 +1118,7 @@ void vfio_put_base_device(VFIODevice *vbasedev)
>  /*
>   * Interfaces for IBM EEH (Enhanced Error Handling)
>   */
> -static bool vfio_eeh_container_ok(VFIOContainer *container)
> +static bool vfio_eeh_container_ok(VFIOLegacyContainer *container)
>  {
>      /*
>       * As of 2016-03-04 (linux-4.5) the host kernel EEH/VFIO
> @@ -1126,7 +1146,7 @@ static bool vfio_eeh_container_ok(VFIOContainer *container)
>      return true;
>  }
>  
> -static int vfio_eeh_container_op(VFIOContainer *container, uint32_t op)
> +static int vfio_eeh_container_op(VFIOLegacyContainer *container, uint32_t op)
>  {
>      struct vfio_eeh_pe_op pe_op = {
>          .argsz = sizeof(pe_op),
> @@ -1149,19 +1169,21 @@ static int vfio_eeh_container_op(VFIOContainer *container, uint32_t op)
>      return ret;
>  }
>  
> -static VFIOContainer *vfio_eeh_as_container(AddressSpace *as)
> +static VFIOLegacyContainer *vfio_eeh_as_container(AddressSpace *as)
>  {
>      VFIOAddressSpace *space = vfio_get_address_space(as);
> -    VFIOContainer *container = NULL;
> +    VFIOLegacyContainer *container = NULL;
> +    VFIOContainer *bcontainer = NULL;
>  
>      if (QLIST_EMPTY(&space->containers)) {
>          /* No containers to act on */
>          goto out;
>      }
>  
> -    container = QLIST_FIRST(&space->containers);
> +    bcontainer = QLIST_FIRST(&space->containers);
> +    container = container_of(bcontainer, VFIOLegacyContainer, obj);
>  
> -    if (QLIST_NEXT(container, next)) {
> +    if (QLIST_NEXT(bcontainer, next)) {
>          /*
>           * We don't yet have logic to synchronize EEH state across
>           * multiple containers.
> @@ -1177,17 +1199,45 @@ out:
>  
>  bool vfio_eeh_as_ok(AddressSpace *as)
>  {
> -    VFIOContainer *container = vfio_eeh_as_container(as);
> +    VFIOLegacyContainer *container = vfio_eeh_as_container(as);
>  
>      return (container != NULL) && vfio_eeh_container_ok(container);
>  }
>  
>  int vfio_eeh_as_op(AddressSpace *as, uint32_t op)
>  {
> -    VFIOContainer *container = vfio_eeh_as_container(as);
> +    VFIOLegacyContainer *container = vfio_eeh_as_container(as);
>  
>      if (!container) {
>          return -ENODEV;
>      }
>      return vfio_eeh_container_op(container, op);
>  }
> +
> +static void vfio_legacy_container_class_init(ObjectClass *klass,
> +                                             void *data)
> +{
> +    VFIOContainerClass *vccs = VFIO_CONTAINER_OBJ_CLASS(klass);
> +
> +    vccs->dma_map = vfio_dma_map;
> +    vccs->dma_unmap = vfio_dma_unmap;
> +    vccs->devices_all_dirty_tracking = vfio_devices_all_dirty_tracking;
> +    vccs->set_dirty_page_tracking = vfio_set_dirty_page_tracking;
> +    vccs->get_dirty_bitmap = vfio_get_dirty_bitmap;
> +    vccs->add_window = vfio_legacy_container_add_section_window;
> +    vccs->del_window = vfio_legacy_container_del_section_window;
> +    vccs->check_extension = vfio_legacy_container_check_extension;
> +}
> +
> +static const TypeInfo vfio_legacy_container_info = {
> +    .parent = TYPE_VFIO_CONTAINER_OBJ,
> +    .name = TYPE_VFIO_LEGACY_CONTAINER,
> +    .class_init = vfio_legacy_container_class_init,
> +};
> +
> +static void vfio_register_types(void)
> +{
> +    type_register_static(&vfio_legacy_container_info);
> +}
> +
> +type_init(vfio_register_types)
> diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
> index e3b6d6e2cb..df4fa2b695 100644
> --- a/hw/vfio/meson.build
> +++ b/hw/vfio/meson.build
> @@ -2,6 +2,7 @@ vfio_ss = ss.source_set()
>  vfio_ss.add(files(
>    'common.c',
>    'as.c',
> +  'container-obj.c',
>    'container.c',
>    'spapr.c',
>    'migration.c',
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index ff6b45de6b..cbbde177c3 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -856,11 +856,11 @@ int64_t vfio_mig_bytes_transferred(void)
>  
>  int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
>  {
> -    VFIOContainer *container = vbasedev->group->container;
> +    VFIOLegacyContainer *container = vbasedev->group->container;
>      struct vfio_region_info *info = NULL;
>      int ret = -ENOTSUP;
>  
> -    if (!vbasedev->enable_migration || !container->dirty_pages_supported) {
> +    if (!vbasedev->enable_migration || !container->obj.dirty_pages_supported) {
>          goto add_blocker;
>      }
>  
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index e707329394..a00a485e46 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -3101,7 +3101,9 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
>          }
>      }
>  
> -    if (!pdev->failover_pair_id) {
> +    if (!pdev->failover_pair_id &&
> +        vfio_container_check_extension(&vbasedev->group->container->obj,
> +                                       VFIO_FEAT_LIVE_MIGRATION)) {
>          ret = vfio_migration_probe(vbasedev, errp);
>          if (ret) {
>              error_report("%s: Migration disabled", vbasedev->name);
> diff --git a/hw/vfio/spapr.c b/hw/vfio/spapr.c
> index 04c6e67f8f..cdcd9e05ba 100644
> --- a/hw/vfio/spapr.c
> +++ b/hw/vfio/spapr.c
> @@ -39,8 +39,8 @@ static void *vfio_prereg_gpa_to_vaddr(MemoryRegionSection *section, hwaddr gpa)
>  static void vfio_prereg_listener_region_add(MemoryListener *listener,
>                                              MemoryRegionSection *section)
>  {
> -    VFIOContainer *container = container_of(listener, VFIOContainer,
> -                                            prereg_listener);
> +    VFIOLegacyContainer *container = container_of(listener, VFIOLegacyContainer,
> +                                                  prereg_listener);
>      const hwaddr gpa = section->offset_within_address_space;
>      hwaddr end;
>      int ret;
> @@ -83,9 +83,9 @@ static void vfio_prereg_listener_region_add(MemoryListener *listener,
>           * can gracefully fail.  Runtime, there's not much we can do other
>           * than throw a hardware error.
>           */
> -        if (!container->initialized) {
> -            if (!container->error) {
> -                error_setg_errno(&container->error, -ret,
> +        if (!container->obj.initialized) {
> +            if (!container->obj.error) {
> +                error_setg_errno(&container->obj.error, -ret,
>                                   "Memory registering failed");
>              }
>          } else {
> @@ -97,8 +97,8 @@ static void vfio_prereg_listener_region_add(MemoryListener *listener,
>  static void vfio_prereg_listener_region_del(MemoryListener *listener,
>                                              MemoryRegionSection *section)
>  {
> -    VFIOContainer *container = container_of(listener, VFIOContainer,
> -                                            prereg_listener);
> +    VFIOLegacyContainer *container = container_of(listener, VFIOLegacyContainer,
> +                                                  prereg_listener);
>      const hwaddr gpa = section->offset_within_address_space;
>      hwaddr end;
>      int ret;
> @@ -141,7 +141,7 @@ const MemoryListener vfio_prereg_listener = {
>      .region_del = vfio_prereg_listener_region_del,
>  };
>  
> -int vfio_spapr_create_window(VFIOContainer *container,
> +int vfio_spapr_create_window(VFIOLegacyContainer *container,
>                               MemoryRegionSection *section,
>                               hwaddr *pgsize)
>  {
> @@ -159,13 +159,13 @@ int vfio_spapr_create_window(VFIOContainer *container,
>      if (pagesize > rampagesize) {
>          pagesize = rampagesize;
>      }
> -    pgmask = container->pgsizes & (pagesize | (pagesize - 1));
> +    pgmask = container->obj.pgsizes & (pagesize | (pagesize - 1));
>      pagesize = pgmask ? (1ULL << (63 - clz64(pgmask))) : 0;
>      if (!pagesize) {
>          error_report("Host doesn't support page size 0x%"PRIx64
>                       ", the supported mask is 0x%lx",
>                       memory_region_iommu_get_min_page_size(iommu_mr),
> -                     container->pgsizes);
> +                     container->obj.pgsizes);
>          return -EINVAL;
>      }
>  
> @@ -233,7 +233,7 @@ int vfio_spapr_create_window(VFIOContainer *container,
>      return 0;
>  }
>  
> -int vfio_spapr_remove_window(VFIOContainer *container,
> +int vfio_spapr_remove_window(VFIOLegacyContainer *container,
>                               hwaddr offset_within_address_space)
>  {
>      struct vfio_iommu_spapr_tce_remove remove = {
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 03ff7944cb..02a6f36a9e 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -30,6 +30,7 @@
>  #include <linux/vfio.h>
>  #endif
>  #include "sysemu/sysemu.h"
> +#include "hw/vfio/vfio-container-obj.h"
>  
>  #define VFIO_MSG_PREFIX "vfio %s: "
>  
> @@ -70,58 +71,15 @@ typedef struct VFIOMigration {
>      uint64_t pending_bytes;
>  } VFIOMigration;
>  
> -typedef struct VFIOAddressSpace {
> -    AddressSpace *as;
> -    QLIST_HEAD(, VFIOContainer) containers;
> -    QLIST_ENTRY(VFIOAddressSpace) list;
> -} VFIOAddressSpace;
> -
>  struct VFIOGroup;
>  
> -typedef struct VFIOContainer {
> -    VFIOAddressSpace *space;
> +typedef struct VFIOLegacyContainer {
> +    VFIOContainer obj;
>      int fd; /* /dev/vfio/vfio, empowered by the attached groups */
> -    MemoryListener listener;
>      MemoryListener prereg_listener;
>      unsigned iommu_type;
> -    Error *error;
> -    bool initialized;
> -    bool dirty_pages_supported;
> -    uint64_t dirty_pgsizes;
> -    uint64_t max_dirty_bitmap_size;
> -    unsigned long pgsizes;
> -    unsigned int dma_max_mappings;
> -    QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
> -    QLIST_HEAD(, VFIOHostDMAWindow) hostwin_list;
>      QLIST_HEAD(, VFIOGroup) group_list;
> -    QLIST_HEAD(, VFIORamDiscardListener) vrdl_list;
> -    QLIST_ENTRY(VFIOContainer) next;
> -} VFIOContainer;
> -
> -typedef struct VFIOGuestIOMMU {
> -    VFIOContainer *container;
> -    IOMMUMemoryRegion *iommu_mr;
> -    hwaddr iommu_offset;
> -    IOMMUNotifier n;
> -    QLIST_ENTRY(VFIOGuestIOMMU) giommu_next;
> -} VFIOGuestIOMMU;
> -
> -typedef struct VFIORamDiscardListener {
> -    VFIOContainer *container;
> -    MemoryRegion *mr;
> -    hwaddr offset_within_address_space;
> -    hwaddr size;
> -    uint64_t granularity;
> -    RamDiscardListener listener;
> -    QLIST_ENTRY(VFIORamDiscardListener) next;
> -} VFIORamDiscardListener;
> -
> -typedef struct VFIOHostDMAWindow {
> -    hwaddr min_iova;
> -    hwaddr max_iova;
> -    uint64_t iova_pgsizes;
> -    QLIST_ENTRY(VFIOHostDMAWindow) hostwin_next;
> -} VFIOHostDMAWindow;
> +} VFIOLegacyContainer;
>  
>  typedef struct VFIODeviceOps VFIODeviceOps;
>  
> @@ -159,7 +117,7 @@ struct VFIODeviceOps {
>  typedef struct VFIOGroup {
>      int fd;
>      int groupid;
> -    VFIOContainer *container;
> +    VFIOLegacyContainer *container;
>      QLIST_HEAD(, VFIODevice) device_list;
>      QLIST_ENTRY(VFIOGroup) next;
>      QLIST_ENTRY(VFIOGroup) container_next;
> @@ -192,31 +150,13 @@ typedef struct VFIODisplay {
>      } dmabuf;
>  } VFIODisplay;
>  
> -void vfio_host_win_add(VFIOContainer *container,
> +void vfio_host_win_add(VFIOContainer *bcontainer,
>                         hwaddr min_iova, hwaddr max_iova,
>                         uint64_t iova_pgsizes);
> -int vfio_host_win_del(VFIOContainer *container, hwaddr min_iova,
> +int vfio_host_win_del(VFIOContainer *bcontainer, hwaddr min_iova,
>                        hwaddr max_iova);
>  VFIOAddressSpace *vfio_get_address_space(AddressSpace *as);
>  void vfio_put_address_space(VFIOAddressSpace *space);
> -bool vfio_devices_all_running_and_saving(VFIOContainer *container);
> -bool vfio_devices_all_dirty_tracking(VFIOContainer *container);
> -
> -/* container->fd */
> -int vfio_dma_unmap(VFIOContainer *container,
> -                   hwaddr iova, ram_addr_t size,
> -                   IOMMUTLBEntry *iotlb);
> -int vfio_dma_map(VFIOContainer *container, hwaddr iova,
> -                 ram_addr_t size, void *vaddr, bool readonly);
> -void vfio_set_dirty_page_tracking(VFIOContainer *container, bool start);
> -int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
> -                          uint64_t size, ram_addr_t ram_addr);
> -
> -int vfio_container_add_section_window(VFIOContainer *container,
> -                                      MemoryRegionSection *section,
> -                                      Error **errp);
> -void vfio_container_del_section_window(VFIOContainer *container,
> -                                       MemoryRegionSection *section);
>  
>  void vfio_put_base_device(VFIODevice *vbasedev);
>  void vfio_disable_irqindex(VFIODevice *vbasedev, int index);
> @@ -263,10 +203,10 @@ vfio_get_device_info_cap(struct vfio_device_info *info, uint16_t id);
>  #endif
>  extern const MemoryListener vfio_prereg_listener;
>  
> -int vfio_spapr_create_window(VFIOContainer *container,
> +int vfio_spapr_create_window(VFIOLegacyContainer *container,
>                               MemoryRegionSection *section,
>                               hwaddr *pgsize);
> -int vfio_spapr_remove_window(VFIOContainer *container,
> +int vfio_spapr_remove_window(VFIOLegacyContainer *container,
>                               hwaddr offset_within_address_space);
>  
>  int vfio_migration_probe(VFIODevice *vbasedev, Error **errp);
> diff --git a/include/hw/vfio/vfio-container-obj.h b/include/hw/vfio/vfio-container-obj.h
> new file mode 100644
> index 0000000000..7ffbbb299f
> --- /dev/null
> +++ b/include/hw/vfio/vfio-container-obj.h
> @@ -0,0 +1,154 @@
> +/*
> + * VFIO CONTAINER BASE OBJECT
> + *
> + * Copyright (C) 2022 Intel Corporation.
> + * Copyright Red Hat, Inc. 2022
> + *
> + * Authors: Yi Liu <yi.l.liu@intel.com>
> + *          Eric Auger <eric.auger@redhat.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> +
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> +
> + * You should have received a copy of the GNU General Public License along
> + * with this program; if not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#ifndef HW_VFIO_VFIO_CONTAINER_OBJ_H
> +#define HW_VFIO_VFIO_CONTAINER_OBJ_H
> +
> +#include "qom/object.h"
> +#include "exec/memory.h"
> +#include "qemu/queue.h"
> +#include "qemu/thread.h"
> +#ifndef CONFIG_USER_ONLY
> +#include "exec/hwaddr.h"
> +#endif
> +
> +#define TYPE_VFIO_CONTAINER_OBJ "qemu:vfio-base-container-obj"
> +#define VFIO_CONTAINER_OBJ(obj) \
> +        OBJECT_CHECK(VFIOContainer, (obj), TYPE_VFIO_CONTAINER_OBJ)
> +#define VFIO_CONTAINER_OBJ_CLASS(klass) \
> +        OBJECT_CLASS_CHECK(VFIOContainerClass, (klass), \
> +                         TYPE_VFIO_CONTAINER_OBJ)
> +#define VFIO_CONTAINER_OBJ_GET_CLASS(obj) \
> +        OBJECT_GET_CLASS(VFIOContainerClass, (obj), \
> +                         TYPE_VFIO_CONTAINER_OBJ)
> +
> +typedef enum VFIOContainerFeature {
> +    VFIO_FEAT_LIVE_MIGRATION,
> +} VFIOContainerFeature;
> +
> +typedef struct VFIOContainer VFIOContainer;
> +
> +typedef struct VFIOAddressSpace {
> +    AddressSpace *as;
> +    QLIST_HEAD(, VFIOContainer) containers;
> +    QLIST_ENTRY(VFIOAddressSpace) list;
> +} VFIOAddressSpace;
> +
> +typedef struct VFIOGuestIOMMU {
> +    VFIOContainer *container;
> +    IOMMUMemoryRegion *iommu_mr;
> +    hwaddr iommu_offset;
> +    IOMMUNotifier n;
> +    QLIST_ENTRY(VFIOGuestIOMMU) giommu_next;
> +} VFIOGuestIOMMU;
> +
> +typedef struct VFIORamDiscardListener {
> +    VFIOContainer *container;
> +    MemoryRegion *mr;
> +    hwaddr offset_within_address_space;
> +    hwaddr size;
> +    uint64_t granularity;
> +    RamDiscardListener listener;
> +    QLIST_ENTRY(VFIORamDiscardListener) next;
> +} VFIORamDiscardListener;
> +
> +typedef struct VFIOHostDMAWindow {
> +    hwaddr min_iova;
> +    hwaddr max_iova;
> +    uint64_t iova_pgsizes;
> +    QLIST_ENTRY(VFIOHostDMAWindow) hostwin_next;
> +} VFIOHostDMAWindow;
> +
> +/*
> + * This is the base object for vfio container backends
> + */
> +struct VFIOContainer {
> +    /* private */
> +    Object parent_obj;
> +
> +    VFIOAddressSpace *space;
> +    MemoryListener listener;
> +    Error *error;
> +    bool initialized;
> +    bool dirty_pages_supported;
> +    uint64_t dirty_pgsizes;
> +    uint64_t max_dirty_bitmap_size;
> +    unsigned long pgsizes;
> +    unsigned int dma_max_mappings;
> +    QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
> +    QLIST_HEAD(, VFIOHostDMAWindow) hostwin_list;
> +    QLIST_HEAD(, VFIORamDiscardListener) vrdl_list;
> +    QLIST_ENTRY(VFIOContainer) next;
> +};
> +
> +typedef struct VFIOContainerClass {
> +    /* private */
> +    ObjectClass parent_class;
> +
> +    /* required */
> +    bool (*check_extension)(VFIOContainer *container,
> +                            VFIOContainerFeature feat);
> +    int (*dma_map)(VFIOContainer *container,
> +                   hwaddr iova, ram_addr_t size,
> +                   void *vaddr, bool readonly);
> +    int (*dma_unmap)(VFIOContainer *container,
> +                     hwaddr iova, ram_addr_t size,
> +                     IOMMUTLBEntry *iotlb);
> +    /* migration feature */
> +    bool (*devices_all_dirty_tracking)(VFIOContainer *container);
> +    void (*set_dirty_page_tracking)(VFIOContainer *container, bool start);
> +    int (*get_dirty_bitmap)(VFIOContainer *container, uint64_t iova,
> +                            uint64_t size, ram_addr_t ram_addr);
> +
> +    /* SPAPR specific */
> +    int (*add_window)(VFIOContainer *container,
> +                      MemoryRegionSection *section,
> +                      Error **errp);
> +    void (*del_window)(VFIOContainer *container,
> +                       MemoryRegionSection *section);
> +} VFIOContainerClass;
> +
> +bool vfio_container_check_extension(VFIOContainer *container,
> +                                    VFIOContainerFeature feat);
> +int vfio_container_dma_map(VFIOContainer *container,
> +                           hwaddr iova, ram_addr_t size,
> +                           void *vaddr, bool readonly);
> +int vfio_container_dma_unmap(VFIOContainer *container,
> +                             hwaddr iova, ram_addr_t size,
> +                             IOMMUTLBEntry *iotlb);
> +bool vfio_container_devices_all_dirty_tracking(VFIOContainer *container);
> +void vfio_container_set_dirty_page_tracking(VFIOContainer *container,
> +                                            bool start);
> +int vfio_container_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
> +                                    uint64_t size, ram_addr_t ram_addr);
> +int vfio_container_add_section_window(VFIOContainer *container,
> +                                      MemoryRegionSection *section,
> +                                      Error **errp);
> +void vfio_container_del_section_window(VFIOContainer *container,
> +                                       MemoryRegionSection *section);
> +
> +void vfio_container_init(void *_container, size_t instance_size,
> +                         const char *mrtypename,
> +                         VFIOAddressSpace *space);
> +void vfio_container_destroy(VFIOContainer *container);
> +#endif /* HW_VFIO_VFIO_CONTAINER_OBJ_H */

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC 07/18] vfio: Add base object for VFIOContainer
@ 2022-04-29  6:29     ` David Gibson
  0 siblings, 0 replies; 125+ messages in thread
From: David Gibson @ 2022-04-29  6:29 UTC (permalink / raw)
  To: Yi Liu
  Cc: akrowiak, jjherne, thuth, chao.p.peng, kvm, mjrosato, farman,
	jasowang, cohuck, qemu-devel, peterx, pasic, eric.auger,
	alex.williamson, nicolinc, kevin.tian, jgg, yi.y.sun,
	eric.auger.pro

[-- Attachment #1: Type: text/plain, Size: 58515 bytes --]

On Thu, Apr 14, 2022 at 03:46:59AM -0700, Yi Liu wrote:
> Qomify the VFIOContainer object which acts as a base class for a
> container. This base class is derived into the legacy VFIO container
> and later on, into the new iommufd based container.

You certainly need the abstraction, but I'm not sure QOM is the right
way to accomplish it in this case.  The QOM class of things is visible
to the user/config layer via QMP (and sometimes command line).  It
doesn't necessarily correspond to guest visible differences, but it
often does.

AIUI, the idea here is that the back end in use should be an
implementation detail which doesn't affect the interfaces outside the
vfio subsystem itself.  If that's the case QOM may not be a great
fit, even though you can probably make it work.

> The base class implements generic code such as code related to
> memory_listener and address space management whereas the derived
> class implements callbacks that depend on the kernel user space
> being used.
> 
> 'as.c' only manipulates the base class object with wrapper functions
> that call the right class functions. Existing 'container.c' code is
> converted to implement the legacy container class functions.
> 
> Existing migration code only works with the legacy container.
> Also 'spapr.c' isn't BE agnostic.
> 
> Below is the object. It's named as VFIOContainer, old VFIOContainer
> is replaced with VFIOLegacyContainer.
> 
> struct VFIOContainer {
>     /* private */
>     Object parent_obj;
> 
>     VFIOAddressSpace *space;
>     MemoryListener listener;
>     Error *error;
>     bool initialized;
>     bool dirty_pages_supported;
>     uint64_t dirty_pgsizes;
>     uint64_t max_dirty_bitmap_size;
>     unsigned long pgsizes;
>     unsigned int dma_max_mappings;
>     QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
>     QLIST_HEAD(, VFIOHostDMAWindow) hostwin_list;
>     QLIST_HEAD(, VFIORamDiscardListener) vrdl_list;
>     QLIST_ENTRY(VFIOContainer) next;
> };
> 
> struct VFIOLegacyContainer {
>     VFIOContainer obj;
>     int fd; /* /dev/vfio/vfio, empowered by the attached groups */
>     MemoryListener prereg_listener;
>     unsigned iommu_type;
>     QLIST_HEAD(, VFIOGroup) group_list;
> };
> 
> Co-authored-by: Eric Auger <eric.auger@redhat.com>
> Signed-off-by: Eric Auger <eric.auger@redhat.com>
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> ---
>  hw/vfio/as.c                         |  48 +++---
>  hw/vfio/container-obj.c              | 195 +++++++++++++++++++++++
>  hw/vfio/container.c                  | 224 ++++++++++++++++-----------
>  hw/vfio/meson.build                  |   1 +
>  hw/vfio/migration.c                  |   4 +-
>  hw/vfio/pci.c                        |   4 +-
>  hw/vfio/spapr.c                      |  22 +--
>  include/hw/vfio/vfio-common.h        |  78 ++--------
>  include/hw/vfio/vfio-container-obj.h | 154 ++++++++++++++++++
>  9 files changed, 540 insertions(+), 190 deletions(-)
>  create mode 100644 hw/vfio/container-obj.c
>  create mode 100644 include/hw/vfio/vfio-container-obj.h
> 
> diff --git a/hw/vfio/as.c b/hw/vfio/as.c
> index 4181182808..37423d2c89 100644
> --- a/hw/vfio/as.c
> +++ b/hw/vfio/as.c
> @@ -215,9 +215,9 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
>           * of vaddr will always be there, even if the memory object is
>           * destroyed and its backing memory munmap-ed.
>           */
> -        ret = vfio_dma_map(container, iova,
> -                           iotlb->addr_mask + 1, vaddr,
> -                           read_only);
> +        ret = vfio_container_dma_map(container, iova,
> +                                     iotlb->addr_mask + 1, vaddr,
> +                                     read_only);
>          if (ret) {
>              error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
>                           "0x%"HWADDR_PRIx", %p) = %d (%m)",
> @@ -225,7 +225,8 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
>                           iotlb->addr_mask + 1, vaddr, ret);
>          }
>      } else {
> -        ret = vfio_dma_unmap(container, iova, iotlb->addr_mask + 1, iotlb);
> +        ret = vfio_container_dma_unmap(container, iova,
> +                                       iotlb->addr_mask + 1, iotlb);
>          if (ret) {
>              error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
>                           "0x%"HWADDR_PRIx") = %d (%m)",
> @@ -242,12 +243,13 @@ static void vfio_ram_discard_notify_discard(RamDiscardListener *rdl,
>  {
>      VFIORamDiscardListener *vrdl = container_of(rdl, VFIORamDiscardListener,
>                                                  listener);
> +    VFIOContainer *container = vrdl->container;
>      const hwaddr size = int128_get64(section->size);
>      const hwaddr iova = section->offset_within_address_space;
>      int ret;
>  
>      /* Unmap with a single call. */
> -    ret = vfio_dma_unmap(vrdl->container, iova, size , NULL);
> +    ret = vfio_container_dma_unmap(container, iova, size , NULL);
>      if (ret) {
>          error_report("%s: vfio_dma_unmap() failed: %s", __func__,
>                       strerror(-ret));
> @@ -259,6 +261,7 @@ static int vfio_ram_discard_notify_populate(RamDiscardListener *rdl,
>  {
>      VFIORamDiscardListener *vrdl = container_of(rdl, VFIORamDiscardListener,
>                                                  listener);
> +    VFIOContainer *container = vrdl->container;
>      const hwaddr end = section->offset_within_region +
>                         int128_get64(section->size);
>      hwaddr start, next, iova;
> @@ -277,8 +280,8 @@ static int vfio_ram_discard_notify_populate(RamDiscardListener *rdl,
>                 section->offset_within_address_space;
>          vaddr = memory_region_get_ram_ptr(section->mr) + start;
>  
> -        ret = vfio_dma_map(vrdl->container, iova, next - start,
> -                           vaddr, section->readonly);
> +        ret = vfio_container_dma_map(container, iova, next - start,
> +                                     vaddr, section->readonly);
>          if (ret) {
>              /* Rollback */
>              vfio_ram_discard_notify_discard(rdl, section);
> @@ -530,8 +533,8 @@ static void vfio_listener_region_add(MemoryListener *listener,
>          }
>      }
>  
> -    ret = vfio_dma_map(container, iova, int128_get64(llsize),
> -                       vaddr, section->readonly);
> +    ret = vfio_container_dma_map(container, iova, int128_get64(llsize),
> +                                 vaddr, section->readonly);
>      if (ret) {
>          error_setg(&err, "vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
>                     "0x%"HWADDR_PRIx", %p) = %d (%m)",
> @@ -656,7 +659,8 @@ static void vfio_listener_region_del(MemoryListener *listener,
>          if (int128_eq(llsize, int128_2_64())) {
>              /* The unmap ioctl doesn't accept a full 64-bit span. */
>              llsize = int128_rshift(llsize, 1);
> -            ret = vfio_dma_unmap(container, iova, int128_get64(llsize), NULL);
> +            ret = vfio_container_dma_unmap(container, iova,
> +                                           int128_get64(llsize), NULL);
>              if (ret) {
>                  error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
>                               "0x%"HWADDR_PRIx") = %d (%m)",
> @@ -664,7 +668,8 @@ static void vfio_listener_region_del(MemoryListener *listener,
>              }
>              iova += int128_get64(llsize);
>          }
> -        ret = vfio_dma_unmap(container, iova, int128_get64(llsize), NULL);
> +        ret = vfio_container_dma_unmap(container, iova,
> +                                       int128_get64(llsize), NULL);
>          if (ret) {
>              error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
>                           "0x%"HWADDR_PRIx") = %d (%m)",
> @@ -681,14 +686,14 @@ static void vfio_listener_log_global_start(MemoryListener *listener)
>  {
>      VFIOContainer *container = container_of(listener, VFIOContainer, listener);
>  
> -    vfio_set_dirty_page_tracking(container, true);
> +    vfio_container_set_dirty_page_tracking(container, true);
>  }
>  
>  static void vfio_listener_log_global_stop(MemoryListener *listener)
>  {
>      VFIOContainer *container = container_of(listener, VFIOContainer, listener);
>  
> -    vfio_set_dirty_page_tracking(container, false);
> +    vfio_container_set_dirty_page_tracking(container, false);
>  }
>  
>  typedef struct {
> @@ -717,8 +722,9 @@ static void vfio_iommu_map_dirty_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
>      if (vfio_get_xlat_addr(iotlb, NULL, &translated_addr, NULL)) {
>          int ret;
>  
> -        ret = vfio_get_dirty_bitmap(container, iova, iotlb->addr_mask + 1,
> -                                    translated_addr);
> +        ret = vfio_container_get_dirty_bitmap(container, iova,
> +                                              iotlb->addr_mask + 1,
> +                                              translated_addr);
>          if (ret) {
>              error_report("vfio_iommu_map_dirty_notify(%p, 0x%"HWADDR_PRIx", "
>                           "0x%"HWADDR_PRIx") = %d (%m)",
> @@ -742,11 +748,13 @@ static int vfio_ram_discard_get_dirty_bitmap(MemoryRegionSection *section,
>       * Sync the whole mapped region (spanning multiple individual mappings)
>       * in one go.
>       */
> -    return vfio_get_dirty_bitmap(vrdl->container, iova, size, ram_addr);
> +    return vfio_container_get_dirty_bitmap(vrdl->container, iova,
> +                                           size, ram_addr);
>  }
>  
> -static int vfio_sync_ram_discard_listener_dirty_bitmap(VFIOContainer *container,
> -                                                   MemoryRegionSection *section)
> +static int
> +vfio_sync_ram_discard_listener_dirty_bitmap(VFIOContainer *container,
> +                                            MemoryRegionSection *section)
>  {
>      RamDiscardManager *rdm = memory_region_get_ram_discard_manager(section->mr);
>      VFIORamDiscardListener *vrdl = NULL;
> @@ -810,7 +818,7 @@ static int vfio_sync_dirty_bitmap(VFIOContainer *container,
>      ram_addr = memory_region_get_ram_addr(section->mr) +
>                 section->offset_within_region;
>  
> -    return vfio_get_dirty_bitmap(container,
> +    return vfio_container_get_dirty_bitmap(container,
>                     REAL_HOST_PAGE_ALIGN(section->offset_within_address_space),
>                     int128_get64(section->size), ram_addr);
>  }
> @@ -825,7 +833,7 @@ static void vfio_listener_log_sync(MemoryListener *listener,
>          return;
>      }
>  
> -    if (vfio_devices_all_dirty_tracking(container)) {
> +    if (vfio_container_devices_all_dirty_tracking(container)) {
>          vfio_sync_dirty_bitmap(container, section);
>      }
>  }
> diff --git a/hw/vfio/container-obj.c b/hw/vfio/container-obj.c
> new file mode 100644
> index 0000000000..40c1e2a2b5
> --- /dev/null
> +++ b/hw/vfio/container-obj.c
> @@ -0,0 +1,195 @@
> +/*
> + * VFIO CONTAINER BASE OBJECT
> + *
> + * Copyright (C) 2022 Intel Corporation.
> + * Copyright Red Hat, Inc. 2022
> + *
> + * Authors: Yi Liu <yi.l.liu@intel.com>
> + *          Eric Auger <eric.auger@redhat.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> +
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> +
> + * You should have received a copy of the GNU General Public License along
> + * with this program; if not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qapi/error.h"
> +#include "qemu/error-report.h"
> +#include "qom/object.h"
> +#include "qapi/visitor.h"
> +#include "hw/vfio/vfio-container-obj.h"
> +
> +bool vfio_container_check_extension(VFIOContainer *container,
> +                                    VFIOContainerFeature feat)
> +{
> +    VFIOContainerClass *vccs = VFIO_CONTAINER_OBJ_GET_CLASS(container);
> +
> +    if (!vccs->check_extension) {
> +        return false;
> +    }
> +
> +    return vccs->check_extension(container, feat);
> +}
> +
> +int vfio_container_dma_map(VFIOContainer *container,
> +                           hwaddr iova, ram_addr_t size,
> +                           void *vaddr, bool readonly)
> +{
> +    VFIOContainerClass *vccs = VFIO_CONTAINER_OBJ_GET_CLASS(container);
> +
> +    if (!vccs->dma_map) {
> +        return -EINVAL;
> +    }
> +
> +    return vccs->dma_map(container, iova, size, vaddr, readonly);
> +}
> +
> +int vfio_container_dma_unmap(VFIOContainer *container,
> +                             hwaddr iova, ram_addr_t size,
> +                             IOMMUTLBEntry *iotlb)
> +{
> +    VFIOContainerClass *vccs = VFIO_CONTAINER_OBJ_GET_CLASS(container);
> +
> +    vccs = VFIO_CONTAINER_OBJ_GET_CLASS(container);
> +
> +    if (!vccs->dma_unmap) {
> +        return -EINVAL;
> +    }
> +
> +    return vccs->dma_unmap(container, iova, size, iotlb);
> +}
> +
> +void vfio_container_set_dirty_page_tracking(VFIOContainer *container,
> +                                            bool start)
> +{
> +    VFIOContainerClass *vccs = VFIO_CONTAINER_OBJ_GET_CLASS(container);
> +
> +    if (!vccs->set_dirty_page_tracking) {
> +        return;
> +    }
> +
> +    vccs->set_dirty_page_tracking(container, start);
> +}
> +
> +bool vfio_container_devices_all_dirty_tracking(VFIOContainer *container)
> +{
> +    VFIOContainerClass *vccs = VFIO_CONTAINER_OBJ_GET_CLASS(container);
> +
> +    if (!vccs->devices_all_dirty_tracking) {
> +        return false;
> +    }
> +
> +    return vccs->devices_all_dirty_tracking(container);
> +}
> +
> +int vfio_container_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
> +                                    uint64_t size, ram_addr_t ram_addr)
> +{
> +    VFIOContainerClass *vccs = VFIO_CONTAINER_OBJ_GET_CLASS(container);
> +
> +    if (!vccs->get_dirty_bitmap) {
> +        return -EINVAL;
> +    }
> +
> +    return vccs->get_dirty_bitmap(container, iova, size, ram_addr);
> +}
> +
> +int vfio_container_add_section_window(VFIOContainer *container,
> +                                      MemoryRegionSection *section,
> +                                      Error **errp)
> +{
> +    VFIOContainerClass *vccs = VFIO_CONTAINER_OBJ_GET_CLASS(container);
> +
> +    if (!vccs->add_window) {
> +        return 0;
> +    }
> +
> +    return vccs->add_window(container, section, errp);
> +}
> +
> +void vfio_container_del_section_window(VFIOContainer *container,
> +                                       MemoryRegionSection *section)
> +{
> +    VFIOContainerClass *vccs = VFIO_CONTAINER_OBJ_GET_CLASS(container);
> +
> +    if (!vccs->del_window) {
> +        return;
> +    }
> +
> +    return vccs->del_window(container, section);
> +}
> +
> +void vfio_container_init(void *_container, size_t instance_size,
> +                         const char *mrtypename,
> +                         VFIOAddressSpace *space)
> +{
> +    VFIOContainer *container;
> +
> +    object_initialize(_container, instance_size, mrtypename);
> +    container = VFIO_CONTAINER_OBJ(_container);
> +
> +    container->space = space;
> +    container->error = NULL;
> +    container->dirty_pages_supported = false;
> +    container->dma_max_mappings = 0;
> +    QLIST_INIT(&container->giommu_list);
> +    QLIST_INIT(&container->hostwin_list);
> +    QLIST_INIT(&container->vrdl_list);
> +}
> +
> +void vfio_container_destroy(VFIOContainer *container)
> +{
> +    VFIORamDiscardListener *vrdl, *vrdl_tmp;
> +    VFIOGuestIOMMU *giommu, *tmp;
> +    VFIOHostDMAWindow *hostwin, *next;
> +
> +    QLIST_SAFE_REMOVE(container, next);
> +
> +    QLIST_FOREACH_SAFE(vrdl, &container->vrdl_list, next, vrdl_tmp) {
> +        RamDiscardManager *rdm;
> +
> +        rdm = memory_region_get_ram_discard_manager(vrdl->mr);
> +        ram_discard_manager_unregister_listener(rdm, &vrdl->listener);
> +        QLIST_REMOVE(vrdl, next);
> +        g_free(vrdl);
> +    }
> +
> +    QLIST_FOREACH_SAFE(giommu, &container->giommu_list, giommu_next, tmp) {
> +        memory_region_unregister_iommu_notifier(
> +                MEMORY_REGION(giommu->iommu_mr), &giommu->n);
> +        QLIST_REMOVE(giommu, giommu_next);
> +        g_free(giommu);
> +    }
> +
> +    QLIST_FOREACH_SAFE(hostwin, &container->hostwin_list, hostwin_next,
> +                       next) {
> +        QLIST_REMOVE(hostwin, hostwin_next);
> +        g_free(hostwin);
> +    }
> +
> +    object_unref(&container->parent_obj);
> +}
> +
> +static const TypeInfo vfio_container_info = {
> +    .parent             = TYPE_OBJECT,
> +    .name               = TYPE_VFIO_CONTAINER_OBJ,
> +    .class_size         = sizeof(VFIOContainerClass),
> +    .instance_size      = sizeof(VFIOContainer),
> +    .abstract           = true,
> +};
> +
> +static void vfio_container_register_types(void)
> +{
> +    type_register_static(&vfio_container_info);
> +}
> +
> +type_init(vfio_container_register_types)
> diff --git a/hw/vfio/container.c b/hw/vfio/container.c
> index 9c665c1720..79972064d3 100644
> --- a/hw/vfio/container.c
> +++ b/hw/vfio/container.c
> @@ -50,6 +50,8 @@
>  static int vfio_kvm_device_fd = -1;
>  #endif
>  
> +#define TYPE_VFIO_LEGACY_CONTAINER "qemu:vfio-legacy-container"
> +
>  VFIOGroupList vfio_group_list =
>      QLIST_HEAD_INITIALIZER(vfio_group_list);
>  
> @@ -76,8 +78,10 @@ bool vfio_mig_active(void)
>      return true;
>  }
>  
> -bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
> +static bool vfio_devices_all_dirty_tracking(VFIOContainer *bcontainer)
>  {
> +    VFIOLegacyContainer *container = container_of(bcontainer,
> +                                                  VFIOLegacyContainer, obj);
>      VFIOGroup *group;
>      VFIODevice *vbasedev;
>      MigrationState *ms = migrate_get_current();
> @@ -103,7 +107,7 @@ bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
>      return true;
>  }
>  
> -bool vfio_devices_all_running_and_saving(VFIOContainer *container)
> +static bool vfio_devices_all_running_and_saving(VFIOLegacyContainer *container)
>  {
>      VFIOGroup *group;
>      VFIODevice *vbasedev;
> @@ -132,10 +136,11 @@ bool vfio_devices_all_running_and_saving(VFIOContainer *container)
>      return true;
>  }
>  
> -static int vfio_dma_unmap_bitmap(VFIOContainer *container,
> +static int vfio_dma_unmap_bitmap(VFIOLegacyContainer *container,
>                                   hwaddr iova, ram_addr_t size,
>                                   IOMMUTLBEntry *iotlb)
>  {
> +    VFIOContainer *bcontainer = &container->obj;
>      struct vfio_iommu_type1_dma_unmap *unmap;
>      struct vfio_bitmap *bitmap;
>      uint64_t pages = REAL_HOST_PAGE_ALIGN(size) / qemu_real_host_page_size;
> @@ -159,7 +164,7 @@ static int vfio_dma_unmap_bitmap(VFIOContainer *container,
>      bitmap->size = ROUND_UP(pages, sizeof(__u64) * BITS_PER_BYTE) /
>                     BITS_PER_BYTE;
>  
> -    if (bitmap->size > container->max_dirty_bitmap_size) {
> +    if (bitmap->size > bcontainer->max_dirty_bitmap_size) {
>          error_report("UNMAP: Size of bitmap too big 0x%"PRIx64,
>                       (uint64_t)bitmap->size);
>          ret = -E2BIG;
> @@ -189,10 +194,12 @@ unmap_exit:
>  /*
>   * DMA - Mapping and unmapping for the "type1" IOMMU interface used on x86
>   */
> -int vfio_dma_unmap(VFIOContainer *container,
> -                   hwaddr iova, ram_addr_t size,
> -                   IOMMUTLBEntry *iotlb)
> +static int vfio_dma_unmap(VFIOContainer *bcontainer,
> +                          hwaddr iova, ram_addr_t size,
> +                          IOMMUTLBEntry *iotlb)
>  {
> +    VFIOLegacyContainer *container = container_of(bcontainer,
> +                                                  VFIOLegacyContainer, obj);
>      struct vfio_iommu_type1_dma_unmap unmap = {
>          .argsz = sizeof(unmap),
>          .flags = 0,
> @@ -200,7 +207,7 @@ int vfio_dma_unmap(VFIOContainer *container,
>          .size = size,
>      };
>  
> -    if (iotlb && container->dirty_pages_supported &&
> +    if (iotlb && bcontainer->dirty_pages_supported &&
>          vfio_devices_all_running_and_saving(container)) {
>          return vfio_dma_unmap_bitmap(container, iova, size, iotlb);
>      }
> @@ -221,7 +228,7 @@ int vfio_dma_unmap(VFIOContainer *container,
>          if (errno == EINVAL && unmap.size && !(unmap.iova + unmap.size) &&
>              container->iommu_type == VFIO_TYPE1v2_IOMMU) {
>              trace_vfio_dma_unmap_overflow_workaround();
> -            unmap.size -= 1ULL << ctz64(container->pgsizes);
> +            unmap.size -= 1ULL << ctz64(bcontainer->pgsizes);
>              continue;
>          }
>          error_report("VFIO_UNMAP_DMA failed: %s", strerror(errno));
> @@ -231,9 +238,22 @@ int vfio_dma_unmap(VFIOContainer *container,
>      return 0;
>  }
>  
> -int vfio_dma_map(VFIOContainer *container, hwaddr iova,
> -                 ram_addr_t size, void *vaddr, bool readonly)
> +static bool vfio_legacy_container_check_extension(VFIOContainer *bcontainer,
> +                                                  VFIOContainerFeature feat)
>  {
> +    switch (feat) {
> +    case VFIO_FEAT_LIVE_MIGRATION:
> +        return true;
> +    default:
> +        return false;
> +    };
> +}
> +
> +static int vfio_dma_map(VFIOContainer *bcontainer, hwaddr iova,
> +                       ram_addr_t size, void *vaddr, bool readonly)
> +{
> +    VFIOLegacyContainer *container = container_of(bcontainer,
> +                                                  VFIOLegacyContainer, obj);
>      struct vfio_iommu_type1_dma_map map = {
>          .argsz = sizeof(map),
>          .flags = VFIO_DMA_MAP_FLAG_READ,
> @@ -252,7 +272,7 @@ int vfio_dma_map(VFIOContainer *container, hwaddr iova,
>       * the VGA ROM space.
>       */
>      if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0 ||
> -        (errno == EBUSY && vfio_dma_unmap(container, iova, size, NULL) == 0 &&
> +        (errno == EBUSY && vfio_dma_unmap(bcontainer, iova, size, NULL) == 0 &&
>           ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0)) {
>          return 0;
>      }
> @@ -261,8 +281,10 @@ int vfio_dma_map(VFIOContainer *container, hwaddr iova,
>      return -errno;
>  }
>  
> -void vfio_set_dirty_page_tracking(VFIOContainer *container, bool start)
> +static void vfio_set_dirty_page_tracking(VFIOContainer *bcontainer, bool start)
>  {
> +    VFIOLegacyContainer *container = container_of(bcontainer,
> +                                                  VFIOLegacyContainer, obj);
>      int ret;
>      struct vfio_iommu_type1_dirty_bitmap dirty = {
>          .argsz = sizeof(dirty),
> @@ -281,9 +303,11 @@ void vfio_set_dirty_page_tracking(VFIOContainer *container, bool start)
>      }
>  }
>  
> -int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
> -                          uint64_t size, ram_addr_t ram_addr)
> +static int vfio_get_dirty_bitmap(VFIOContainer *bcontainer, uint64_t iova,
> +                                 uint64_t size, ram_addr_t ram_addr)
>  {
> +    VFIOLegacyContainer *container = container_of(bcontainer,
> +                                                  VFIOLegacyContainer, obj);
>      struct vfio_iommu_type1_dirty_bitmap *dbitmap;
>      struct vfio_iommu_type1_dirty_bitmap_get *range;
>      uint64_t pages;
> @@ -333,18 +357,23 @@ err_out:
>      return ret;
>  }
>  
> -static void vfio_listener_release(VFIOContainer *container)
> +static void vfio_listener_release(VFIOLegacyContainer *container)
>  {
> -    memory_listener_unregister(&container->listener);
> +    VFIOContainer *bcontainer = &container->obj;
> +
> +    memory_listener_unregister(&bcontainer->listener);
>      if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
>          memory_listener_unregister(&container->prereg_listener);
>      }
>  }
>  
> -int vfio_container_add_section_window(VFIOContainer *container,
> -                                      MemoryRegionSection *section,
> -                                      Error **errp)
> +static int
> +vfio_legacy_container_add_section_window(VFIOContainer *bcontainer,
> +                                         MemoryRegionSection *section,
> +                                         Error **errp)
>  {
> +    VFIOLegacyContainer *container = container_of(bcontainer,
> +                                                  VFIOLegacyContainer, obj);
>      VFIOHostDMAWindow *hostwin;
>      hwaddr pgsize = 0;
>      int ret;
> @@ -354,7 +383,7 @@ int vfio_container_add_section_window(VFIOContainer *container,
>      }
>  
>      /* For now intersections are not allowed, we may relax this later */
> -    QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) {
> +    QLIST_FOREACH(hostwin, &bcontainer->hostwin_list, hostwin_next) {
>          if (ranges_overlap(hostwin->min_iova,
>                             hostwin->max_iova - hostwin->min_iova + 1,
>                             section->offset_within_address_space,
> @@ -376,7 +405,7 @@ int vfio_container_add_section_window(VFIOContainer *container,
>          return ret;
>      }
>  
> -    vfio_host_win_add(container, section->offset_within_address_space,
> +    vfio_host_win_add(bcontainer, section->offset_within_address_space,
>                        section->offset_within_address_space +
>                        int128_get64(section->size) - 1, pgsize);
>  #ifdef CONFIG_KVM
> @@ -409,16 +438,20 @@ int vfio_container_add_section_window(VFIOContainer *container,
>      return 0;
>  }
>  
> -void vfio_container_del_section_window(VFIOContainer *container,
> -                                       MemoryRegionSection *section)
> +static void
> +vfio_legacy_container_del_section_window(VFIOContainer *bcontainer,
> +                                         MemoryRegionSection *section)
>  {
> +    VFIOLegacyContainer *container = container_of(bcontainer,
> +                                                  VFIOLegacyContainer, obj);
> +
>      if (container->iommu_type != VFIO_SPAPR_TCE_v2_IOMMU) {
>          return;
>      }
>  
>      vfio_spapr_remove_window(container,
>                               section->offset_within_address_space);
> -    if (vfio_host_win_del(container,
> +    if (vfio_host_win_del(bcontainer,
>                            section->offset_within_address_space,
>                            section->offset_within_address_space +
>                            int128_get64(section->size) - 1) < 0) {
> @@ -505,7 +538,7 @@ static void vfio_kvm_device_del_group(VFIOGroup *group)
>  /*
>   * vfio_get_iommu_type - selects the richest iommu_type (v2 first)
>   */
> -static int vfio_get_iommu_type(VFIOContainer *container,
> +static int vfio_get_iommu_type(VFIOLegacyContainer *container,
>                                 Error **errp)
>  {
>      int iommu_types[] = { VFIO_TYPE1v2_IOMMU, VFIO_TYPE1_IOMMU,
> @@ -521,7 +554,7 @@ static int vfio_get_iommu_type(VFIOContainer *container,
>      return -EINVAL;
>  }
>  
> -static int vfio_init_container(VFIOContainer *container, int group_fd,
> +static int vfio_init_container(VFIOLegacyContainer *container, int group_fd,
>                                 Error **errp)
>  {
>      int iommu_type, ret;
> @@ -556,7 +589,7 @@ static int vfio_init_container(VFIOContainer *container, int group_fd,
>      return 0;
>  }
>  
> -static int vfio_get_iommu_info(VFIOContainer *container,
> +static int vfio_get_iommu_info(VFIOLegacyContainer *container,
>                                 struct vfio_iommu_type1_info **info)
>  {
>  
> @@ -600,11 +633,12 @@ vfio_get_iommu_info_cap(struct vfio_iommu_type1_info *info, uint16_t id)
>      return NULL;
>  }
>  
> -static void vfio_get_iommu_info_migration(VFIOContainer *container,
> -                                         struct vfio_iommu_type1_info *info)
> +static void vfio_get_iommu_info_migration(VFIOLegacyContainer *container,
> +                                          struct vfio_iommu_type1_info *info)
>  {
>      struct vfio_info_cap_header *hdr;
>      struct vfio_iommu_type1_info_cap_migration *cap_mig;
> +    VFIOContainer *bcontainer = &container->obj;
>  
>      hdr = vfio_get_iommu_info_cap(info, VFIO_IOMMU_TYPE1_INFO_CAP_MIGRATION);
>      if (!hdr) {
> @@ -619,13 +653,14 @@ static void vfio_get_iommu_info_migration(VFIOContainer *container,
>       * qemu_real_host_page_size to mark those dirty.
>       */
>      if (cap_mig->pgsize_bitmap & qemu_real_host_page_size) {
> -        container->dirty_pages_supported = true;
> -        container->max_dirty_bitmap_size = cap_mig->max_dirty_bitmap_size;
> -        container->dirty_pgsizes = cap_mig->pgsize_bitmap;
> +        bcontainer->dirty_pages_supported = true;
> +        bcontainer->max_dirty_bitmap_size = cap_mig->max_dirty_bitmap_size;
> +        bcontainer->dirty_pgsizes = cap_mig->pgsize_bitmap;
>      }
>  }
>  
> -static int vfio_ram_block_discard_disable(VFIOContainer *container, bool state)
> +static int
> +vfio_ram_block_discard_disable(VFIOLegacyContainer *container, bool state)
>  {
>      switch (container->iommu_type) {
>      case VFIO_TYPE1v2_IOMMU:
> @@ -651,7 +686,8 @@ static int vfio_ram_block_discard_disable(VFIOContainer *container, bool state)
>  static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>                                    Error **errp)
>  {
> -    VFIOContainer *container;
> +    VFIOContainer *bcontainer;
> +    VFIOLegacyContainer *container;
>      int ret, fd;
>      VFIOAddressSpace *space;
>  
> @@ -688,7 +724,8 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>       * details once we know which type of IOMMU we are using.
>       */
>  
> -    QLIST_FOREACH(container, &space->containers, next) {
> +    QLIST_FOREACH(bcontainer, &space->containers, next) {
> +        container = container_of(bcontainer, VFIOLegacyContainer, obj);
>          if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
>              ret = vfio_ram_block_discard_disable(container, true);
>              if (ret) {
> @@ -724,14 +761,10 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>      }
>  
>      container = g_malloc0(sizeof(*container));
> -    container->space = space;
>      container->fd = fd;
> -    container->error = NULL;
> -    container->dirty_pages_supported = false;
> -    container->dma_max_mappings = 0;
> -    QLIST_INIT(&container->giommu_list);
> -    QLIST_INIT(&container->hostwin_list);
> -    QLIST_INIT(&container->vrdl_list);
> +    bcontainer = &container->obj;
> +    vfio_container_init(bcontainer, sizeof(*bcontainer),
> +                        TYPE_VFIO_LEGACY_CONTAINER, space);
>  
>      ret = vfio_init_container(container, group->fd, errp);
>      if (ret) {
> @@ -763,13 +796,13 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>              /* Assume 4k IOVA page size */
>              info->iova_pgsizes = 4096;
>          }
> -        vfio_host_win_add(container, 0, (hwaddr)-1, info->iova_pgsizes);
> -        container->pgsizes = info->iova_pgsizes;
> +        vfio_host_win_add(bcontainer, 0, (hwaddr)-1, info->iova_pgsizes);
> +        bcontainer->pgsizes = info->iova_pgsizes;
>  
>          /* The default in the kernel ("dma_entry_limit") is 65535. */
> -        container->dma_max_mappings = 65535;
> +        bcontainer->dma_max_mappings = 65535;
>          if (!ret) {
> -            vfio_get_info_dma_avail(info, &container->dma_max_mappings);
> +            vfio_get_info_dma_avail(info, &bcontainer->dma_max_mappings);
>              vfio_get_iommu_info_migration(container, info);
>          }
>          g_free(info);
> @@ -798,10 +831,10 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>  
>              memory_listener_register(&container->prereg_listener,
>                                       &address_space_memory);
> -            if (container->error) {
> +            if (bcontainer->error) {
>                  memory_listener_unregister(&container->prereg_listener);
>                  ret = -1;
> -                error_propagate_prepend(errp, container->error,
> +                error_propagate_prepend(errp, bcontainer->error,
>                      "RAM memory listener initialization failed: ");
>                  goto enable_discards_exit;
>              }
> @@ -820,7 +853,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>          }
>  
>          if (v2) {
> -            container->pgsizes = info.ddw.pgsizes;
> +            bcontainer->pgsizes = info.ddw.pgsizes;
>              /*
>               * There is a default window in just created container.
>               * To make region_add/del simpler, we better remove this
> @@ -835,8 +868,8 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>              }
>          } else {
>              /* The default table uses 4K pages */
> -            container->pgsizes = 0x1000;
> -            vfio_host_win_add(container, info.dma32_window_start,
> +            bcontainer->pgsizes = 0x1000;
> +            vfio_host_win_add(bcontainer, info.dma32_window_start,
>                                info.dma32_window_start +
>                                info.dma32_window_size - 1,
>                                0x1000);
> @@ -847,28 +880,28 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>      vfio_kvm_device_add_group(group);
>  
>      QLIST_INIT(&container->group_list);
> -    QLIST_INSERT_HEAD(&space->containers, container, next);
> +    QLIST_INSERT_HEAD(&space->containers, bcontainer, next);
>  
>      group->container = container;
>      QLIST_INSERT_HEAD(&container->group_list, group, container_next);
>  
> -    container->listener = vfio_memory_listener;
> +    bcontainer->listener = vfio_memory_listener;
>  
> -    memory_listener_register(&container->listener, container->space->as);
> +    memory_listener_register(&bcontainer->listener, bcontainer->space->as);
>  
> -    if (container->error) {
> +    if (bcontainer->error) {
>          ret = -1;
> -        error_propagate_prepend(errp, container->error,
> +        error_propagate_prepend(errp, bcontainer->error,
>              "memory listener initialization failed: ");
>          goto listener_release_exit;
>      }
>  
> -    container->initialized = true;
> +    bcontainer->initialized = true;
>  
>      return 0;
>  listener_release_exit:
>      QLIST_REMOVE(group, container_next);
> -    QLIST_REMOVE(container, next);
> +    QLIST_REMOVE(bcontainer, next);
>      vfio_kvm_device_del_group(group);
>      vfio_listener_release(container);
>  
> @@ -889,7 +922,8 @@ put_space_exit:
>  
>  static void vfio_disconnect_container(VFIOGroup *group)
>  {
> -    VFIOContainer *container = group->container;
> +    VFIOLegacyContainer *container = group->container;
> +    VFIOContainer *bcontainer = &container->obj;
>  
>      QLIST_REMOVE(group, container_next);
>      group->container = NULL;
> @@ -909,25 +943,9 @@ static void vfio_disconnect_container(VFIOGroup *group)
>      }
>  
>      if (QLIST_EMPTY(&container->group_list)) {
> -        VFIOAddressSpace *space = container->space;
> -        VFIOGuestIOMMU *giommu, *tmp;
> -        VFIOHostDMAWindow *hostwin, *next;
> -
> -        QLIST_REMOVE(container, next);
> -
> -        QLIST_FOREACH_SAFE(giommu, &container->giommu_list, giommu_next, tmp) {
> -            memory_region_unregister_iommu_notifier(
> -                    MEMORY_REGION(giommu->iommu_mr), &giommu->n);
> -            QLIST_REMOVE(giommu, giommu_next);
> -            g_free(giommu);
> -        }
> -
> -        QLIST_FOREACH_SAFE(hostwin, &container->hostwin_list, hostwin_next,
> -                           next) {
> -            QLIST_REMOVE(hostwin, hostwin_next);
> -            g_free(hostwin);
> -        }
> +        VFIOAddressSpace *space = bcontainer->space;
>  
> +        vfio_container_destroy(bcontainer);
>          trace_vfio_disconnect_container(container->fd);
>          close(container->fd);
>          g_free(container);
> @@ -939,13 +957,15 @@ static void vfio_disconnect_container(VFIOGroup *group)
>  VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
>  {
>      VFIOGroup *group;
> +    VFIOContainer *bcontainer;
>      char path[32];
>      struct vfio_group_status status = { .argsz = sizeof(status) };
>  
>      QLIST_FOREACH(group, &vfio_group_list, next) {
>          if (group->groupid == groupid) {
>              /* Found it.  Now is it already in the right context? */
> -            if (group->container->space->as == as) {
> +            bcontainer = &group->container->obj;
> +            if (bcontainer->space->as == as) {
>                  return group;
>              } else {
>                  error_setg(errp, "group %d used in multiple address spaces",
> @@ -1098,7 +1118,7 @@ void vfio_put_base_device(VFIODevice *vbasedev)
>  /*
>   * Interfaces for IBM EEH (Enhanced Error Handling)
>   */
> -static bool vfio_eeh_container_ok(VFIOContainer *container)
> +static bool vfio_eeh_container_ok(VFIOLegacyContainer *container)
>  {
>      /*
>       * As of 2016-03-04 (linux-4.5) the host kernel EEH/VFIO
> @@ -1126,7 +1146,7 @@ static bool vfio_eeh_container_ok(VFIOContainer *container)
>      return true;
>  }
>  
> -static int vfio_eeh_container_op(VFIOContainer *container, uint32_t op)
> +static int vfio_eeh_container_op(VFIOLegacyContainer *container, uint32_t op)
>  {
>      struct vfio_eeh_pe_op pe_op = {
>          .argsz = sizeof(pe_op),
> @@ -1149,19 +1169,21 @@ static int vfio_eeh_container_op(VFIOContainer *container, uint32_t op)
>      return ret;
>  }
>  
> -static VFIOContainer *vfio_eeh_as_container(AddressSpace *as)
> +static VFIOLegacyContainer *vfio_eeh_as_container(AddressSpace *as)
>  {
>      VFIOAddressSpace *space = vfio_get_address_space(as);
> -    VFIOContainer *container = NULL;
> +    VFIOLegacyContainer *container = NULL;
> +    VFIOContainer *bcontainer = NULL;
>  
>      if (QLIST_EMPTY(&space->containers)) {
>          /* No containers to act on */
>          goto out;
>      }
>  
> -    container = QLIST_FIRST(&space->containers);
> +    bcontainer = QLIST_FIRST(&space->containers);
> +    container = container_of(bcontainer, VFIOLegacyContainer, obj);
>  
> -    if (QLIST_NEXT(container, next)) {
> +    if (QLIST_NEXT(bcontainer, next)) {
>          /*
>           * We don't yet have logic to synchronize EEH state across
>           * multiple containers.
> @@ -1177,17 +1199,45 @@ out:
>  
>  bool vfio_eeh_as_ok(AddressSpace *as)
>  {
> -    VFIOContainer *container = vfio_eeh_as_container(as);
> +    VFIOLegacyContainer *container = vfio_eeh_as_container(as);
>  
>      return (container != NULL) && vfio_eeh_container_ok(container);
>  }
>  
>  int vfio_eeh_as_op(AddressSpace *as, uint32_t op)
>  {
> -    VFIOContainer *container = vfio_eeh_as_container(as);
> +    VFIOLegacyContainer *container = vfio_eeh_as_container(as);
>  
>      if (!container) {
>          return -ENODEV;
>      }
>      return vfio_eeh_container_op(container, op);
>  }
> +
> +static void vfio_legacy_container_class_init(ObjectClass *klass,
> +                                             void *data)
> +{
> +    VFIOContainerClass *vccs = VFIO_CONTAINER_OBJ_CLASS(klass);
> +
> +    vccs->dma_map = vfio_dma_map;
> +    vccs->dma_unmap = vfio_dma_unmap;
> +    vccs->devices_all_dirty_tracking = vfio_devices_all_dirty_tracking;
> +    vccs->set_dirty_page_tracking = vfio_set_dirty_page_tracking;
> +    vccs->get_dirty_bitmap = vfio_get_dirty_bitmap;
> +    vccs->add_window = vfio_legacy_container_add_section_window;
> +    vccs->del_window = vfio_legacy_container_del_section_window;
> +    vccs->check_extension = vfio_legacy_container_check_extension;
> +}
> +
> +static const TypeInfo vfio_legacy_container_info = {
> +    .parent = TYPE_VFIO_CONTAINER_OBJ,
> +    .name = TYPE_VFIO_LEGACY_CONTAINER,
> +    .class_init = vfio_legacy_container_class_init,
> +};
> +
> +static void vfio_register_types(void)
> +{
> +    type_register_static(&vfio_legacy_container_info);
> +}
> +
> +type_init(vfio_register_types)
> diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
> index e3b6d6e2cb..df4fa2b695 100644
> --- a/hw/vfio/meson.build
> +++ b/hw/vfio/meson.build
> @@ -2,6 +2,7 @@ vfio_ss = ss.source_set()
>  vfio_ss.add(files(
>    'common.c',
>    'as.c',
> +  'container-obj.c',
>    'container.c',
>    'spapr.c',
>    'migration.c',
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index ff6b45de6b..cbbde177c3 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -856,11 +856,11 @@ int64_t vfio_mig_bytes_transferred(void)
>  
>  int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
>  {
> -    VFIOContainer *container = vbasedev->group->container;
> +    VFIOLegacyContainer *container = vbasedev->group->container;
>      struct vfio_region_info *info = NULL;
>      int ret = -ENOTSUP;
>  
> -    if (!vbasedev->enable_migration || !container->dirty_pages_supported) {
> +    if (!vbasedev->enable_migration || !container->obj.dirty_pages_supported) {
>          goto add_blocker;
>      }
>  
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index e707329394..a00a485e46 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -3101,7 +3101,9 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
>          }
>      }
>  
> -    if (!pdev->failover_pair_id) {
> +    if (!pdev->failover_pair_id &&
> +        vfio_container_check_extension(&vbasedev->group->container->obj,
> +                                       VFIO_FEAT_LIVE_MIGRATION)) {
>          ret = vfio_migration_probe(vbasedev, errp);
>          if (ret) {
>              error_report("%s: Migration disabled", vbasedev->name);
> diff --git a/hw/vfio/spapr.c b/hw/vfio/spapr.c
> index 04c6e67f8f..cdcd9e05ba 100644
> --- a/hw/vfio/spapr.c
> +++ b/hw/vfio/spapr.c
> @@ -39,8 +39,8 @@ static void *vfio_prereg_gpa_to_vaddr(MemoryRegionSection *section, hwaddr gpa)
>  static void vfio_prereg_listener_region_add(MemoryListener *listener,
>                                              MemoryRegionSection *section)
>  {
> -    VFIOContainer *container = container_of(listener, VFIOContainer,
> -                                            prereg_listener);
> +    VFIOLegacyContainer *container = container_of(listener, VFIOLegacyContainer,
> +                                                  prereg_listener);
>      const hwaddr gpa = section->offset_within_address_space;
>      hwaddr end;
>      int ret;
> @@ -83,9 +83,9 @@ static void vfio_prereg_listener_region_add(MemoryListener *listener,
>           * can gracefully fail.  Runtime, there's not much we can do other
>           * than throw a hardware error.
>           */
> -        if (!container->initialized) {
> -            if (!container->error) {
> -                error_setg_errno(&container->error, -ret,
> +        if (!container->obj.initialized) {
> +            if (!container->obj.error) {
> +                error_setg_errno(&container->obj.error, -ret,
>                                   "Memory registering failed");
>              }
>          } else {
> @@ -97,8 +97,8 @@ static void vfio_prereg_listener_region_add(MemoryListener *listener,
>  static void vfio_prereg_listener_region_del(MemoryListener *listener,
>                                              MemoryRegionSection *section)
>  {
> -    VFIOContainer *container = container_of(listener, VFIOContainer,
> -                                            prereg_listener);
> +    VFIOLegacyContainer *container = container_of(listener, VFIOLegacyContainer,
> +                                                  prereg_listener);
>      const hwaddr gpa = section->offset_within_address_space;
>      hwaddr end;
>      int ret;
> @@ -141,7 +141,7 @@ const MemoryListener vfio_prereg_listener = {
>      .region_del = vfio_prereg_listener_region_del,
>  };
>  
> -int vfio_spapr_create_window(VFIOContainer *container,
> +int vfio_spapr_create_window(VFIOLegacyContainer *container,
>                               MemoryRegionSection *section,
>                               hwaddr *pgsize)
>  {
> @@ -159,13 +159,13 @@ int vfio_spapr_create_window(VFIOContainer *container,
>      if (pagesize > rampagesize) {
>          pagesize = rampagesize;
>      }
> -    pgmask = container->pgsizes & (pagesize | (pagesize - 1));
> +    pgmask = container->obj.pgsizes & (pagesize | (pagesize - 1));
>      pagesize = pgmask ? (1ULL << (63 - clz64(pgmask))) : 0;
>      if (!pagesize) {
>          error_report("Host doesn't support page size 0x%"PRIx64
>                       ", the supported mask is 0x%lx",
>                       memory_region_iommu_get_min_page_size(iommu_mr),
> -                     container->pgsizes);
> +                     container->obj.pgsizes);
>          return -EINVAL;
>      }
>  
> @@ -233,7 +233,7 @@ int vfio_spapr_create_window(VFIOContainer *container,
>      return 0;
>  }
>  
> -int vfio_spapr_remove_window(VFIOContainer *container,
> +int vfio_spapr_remove_window(VFIOLegacyContainer *container,
>                               hwaddr offset_within_address_space)
>  {
>      struct vfio_iommu_spapr_tce_remove remove = {
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 03ff7944cb..02a6f36a9e 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -30,6 +30,7 @@
>  #include <linux/vfio.h>
>  #endif
>  #include "sysemu/sysemu.h"
> +#include "hw/vfio/vfio-container-obj.h"
>  
>  #define VFIO_MSG_PREFIX "vfio %s: "
>  
> @@ -70,58 +71,15 @@ typedef struct VFIOMigration {
>      uint64_t pending_bytes;
>  } VFIOMigration;
>  
> -typedef struct VFIOAddressSpace {
> -    AddressSpace *as;
> -    QLIST_HEAD(, VFIOContainer) containers;
> -    QLIST_ENTRY(VFIOAddressSpace) list;
> -} VFIOAddressSpace;
> -
>  struct VFIOGroup;
>  
> -typedef struct VFIOContainer {
> -    VFIOAddressSpace *space;
> +typedef struct VFIOLegacyContainer {
> +    VFIOContainer obj;
>      int fd; /* /dev/vfio/vfio, empowered by the attached groups */
> -    MemoryListener listener;
>      MemoryListener prereg_listener;
>      unsigned iommu_type;
> -    Error *error;
> -    bool initialized;
> -    bool dirty_pages_supported;
> -    uint64_t dirty_pgsizes;
> -    uint64_t max_dirty_bitmap_size;
> -    unsigned long pgsizes;
> -    unsigned int dma_max_mappings;
> -    QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
> -    QLIST_HEAD(, VFIOHostDMAWindow) hostwin_list;
>      QLIST_HEAD(, VFIOGroup) group_list;
> -    QLIST_HEAD(, VFIORamDiscardListener) vrdl_list;
> -    QLIST_ENTRY(VFIOContainer) next;
> -} VFIOContainer;
> -
> -typedef struct VFIOGuestIOMMU {
> -    VFIOContainer *container;
> -    IOMMUMemoryRegion *iommu_mr;
> -    hwaddr iommu_offset;
> -    IOMMUNotifier n;
> -    QLIST_ENTRY(VFIOGuestIOMMU) giommu_next;
> -} VFIOGuestIOMMU;
> -
> -typedef struct VFIORamDiscardListener {
> -    VFIOContainer *container;
> -    MemoryRegion *mr;
> -    hwaddr offset_within_address_space;
> -    hwaddr size;
> -    uint64_t granularity;
> -    RamDiscardListener listener;
> -    QLIST_ENTRY(VFIORamDiscardListener) next;
> -} VFIORamDiscardListener;
> -
> -typedef struct VFIOHostDMAWindow {
> -    hwaddr min_iova;
> -    hwaddr max_iova;
> -    uint64_t iova_pgsizes;
> -    QLIST_ENTRY(VFIOHostDMAWindow) hostwin_next;
> -} VFIOHostDMAWindow;
> +} VFIOLegacyContainer;
>  
>  typedef struct VFIODeviceOps VFIODeviceOps;
>  
> @@ -159,7 +117,7 @@ struct VFIODeviceOps {
>  typedef struct VFIOGroup {
>      int fd;
>      int groupid;
> -    VFIOContainer *container;
> +    VFIOLegacyContainer *container;
>      QLIST_HEAD(, VFIODevice) device_list;
>      QLIST_ENTRY(VFIOGroup) next;
>      QLIST_ENTRY(VFIOGroup) container_next;
> @@ -192,31 +150,13 @@ typedef struct VFIODisplay {
>      } dmabuf;
>  } VFIODisplay;
>  
> -void vfio_host_win_add(VFIOContainer *container,
> +void vfio_host_win_add(VFIOContainer *bcontainer,
>                         hwaddr min_iova, hwaddr max_iova,
>                         uint64_t iova_pgsizes);
> -int vfio_host_win_del(VFIOContainer *container, hwaddr min_iova,
> +int vfio_host_win_del(VFIOContainer *bcontainer, hwaddr min_iova,
>                        hwaddr max_iova);
>  VFIOAddressSpace *vfio_get_address_space(AddressSpace *as);
>  void vfio_put_address_space(VFIOAddressSpace *space);
> -bool vfio_devices_all_running_and_saving(VFIOContainer *container);
> -bool vfio_devices_all_dirty_tracking(VFIOContainer *container);
> -
> -/* container->fd */
> -int vfio_dma_unmap(VFIOContainer *container,
> -                   hwaddr iova, ram_addr_t size,
> -                   IOMMUTLBEntry *iotlb);
> -int vfio_dma_map(VFIOContainer *container, hwaddr iova,
> -                 ram_addr_t size, void *vaddr, bool readonly);
> -void vfio_set_dirty_page_tracking(VFIOContainer *container, bool start);
> -int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
> -                          uint64_t size, ram_addr_t ram_addr);
> -
> -int vfio_container_add_section_window(VFIOContainer *container,
> -                                      MemoryRegionSection *section,
> -                                      Error **errp);
> -void vfio_container_del_section_window(VFIOContainer *container,
> -                                       MemoryRegionSection *section);
>  
>  void vfio_put_base_device(VFIODevice *vbasedev);
>  void vfio_disable_irqindex(VFIODevice *vbasedev, int index);
> @@ -263,10 +203,10 @@ vfio_get_device_info_cap(struct vfio_device_info *info, uint16_t id);
>  #endif
>  extern const MemoryListener vfio_prereg_listener;
>  
> -int vfio_spapr_create_window(VFIOContainer *container,
> +int vfio_spapr_create_window(VFIOLegacyContainer *container,
>                               MemoryRegionSection *section,
>                               hwaddr *pgsize);
> -int vfio_spapr_remove_window(VFIOContainer *container,
> +int vfio_spapr_remove_window(VFIOLegacyContainer *container,
>                               hwaddr offset_within_address_space);
>  
>  int vfio_migration_probe(VFIODevice *vbasedev, Error **errp);
> diff --git a/include/hw/vfio/vfio-container-obj.h b/include/hw/vfio/vfio-container-obj.h
> new file mode 100644
> index 0000000000..7ffbbb299f
> --- /dev/null
> +++ b/include/hw/vfio/vfio-container-obj.h
> @@ -0,0 +1,154 @@
> +/*
> + * VFIO CONTAINER BASE OBJECT
> + *
> + * Copyright (C) 2022 Intel Corporation.
> + * Copyright Red Hat, Inc. 2022
> + *
> + * Authors: Yi Liu <yi.l.liu@intel.com>
> + *          Eric Auger <eric.auger@redhat.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> +
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> +
> + * You should have received a copy of the GNU General Public License along
> + * with this program; if not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#ifndef HW_VFIO_VFIO_CONTAINER_OBJ_H
> +#define HW_VFIO_VFIO_CONTAINER_OBJ_H
> +
> +#include "qom/object.h"
> +#include "exec/memory.h"
> +#include "qemu/queue.h"
> +#include "qemu/thread.h"
> +#ifndef CONFIG_USER_ONLY
> +#include "exec/hwaddr.h"
> +#endif
> +
> +#define TYPE_VFIO_CONTAINER_OBJ "qemu:vfio-base-container-obj"
> +#define VFIO_CONTAINER_OBJ(obj) \
> +        OBJECT_CHECK(VFIOContainer, (obj), TYPE_VFIO_CONTAINER_OBJ)
> +#define VFIO_CONTAINER_OBJ_CLASS(klass) \
> +        OBJECT_CLASS_CHECK(VFIOContainerClass, (klass), \
> +                         TYPE_VFIO_CONTAINER_OBJ)
> +#define VFIO_CONTAINER_OBJ_GET_CLASS(obj) \
> +        OBJECT_GET_CLASS(VFIOContainerClass, (obj), \
> +                         TYPE_VFIO_CONTAINER_OBJ)
> +
> +typedef enum VFIOContainerFeature {
> +    VFIO_FEAT_LIVE_MIGRATION,
> +} VFIOContainerFeature;
> +
> +typedef struct VFIOContainer VFIOContainer;
> +
> +typedef struct VFIOAddressSpace {
> +    AddressSpace *as;
> +    QLIST_HEAD(, VFIOContainer) containers;
> +    QLIST_ENTRY(VFIOAddressSpace) list;
> +} VFIOAddressSpace;
> +
> +typedef struct VFIOGuestIOMMU {
> +    VFIOContainer *container;
> +    IOMMUMemoryRegion *iommu_mr;
> +    hwaddr iommu_offset;
> +    IOMMUNotifier n;
> +    QLIST_ENTRY(VFIOGuestIOMMU) giommu_next;
> +} VFIOGuestIOMMU;
> +
> +typedef struct VFIORamDiscardListener {
> +    VFIOContainer *container;
> +    MemoryRegion *mr;
> +    hwaddr offset_within_address_space;
> +    hwaddr size;
> +    uint64_t granularity;
> +    RamDiscardListener listener;
> +    QLIST_ENTRY(VFIORamDiscardListener) next;
> +} VFIORamDiscardListener;
> +
> +typedef struct VFIOHostDMAWindow {
> +    hwaddr min_iova;
> +    hwaddr max_iova;
> +    uint64_t iova_pgsizes;
> +    QLIST_ENTRY(VFIOHostDMAWindow) hostwin_next;
> +} VFIOHostDMAWindow;
> +
> +/*
> + * This is the base object for vfio container backends
> + */
> +struct VFIOContainer {
> +    /* private */
> +    Object parent_obj;
> +
> +    VFIOAddressSpace *space;
> +    MemoryListener listener;
> +    Error *error;
> +    bool initialized;
> +    bool dirty_pages_supported;
> +    uint64_t dirty_pgsizes;
> +    uint64_t max_dirty_bitmap_size;
> +    unsigned long pgsizes;
> +    unsigned int dma_max_mappings;
> +    QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
> +    QLIST_HEAD(, VFIOHostDMAWindow) hostwin_list;
> +    QLIST_HEAD(, VFIORamDiscardListener) vrdl_list;
> +    QLIST_ENTRY(VFIOContainer) next;
> +};
> +
> +typedef struct VFIOContainerClass {
> +    /* private */
> +    ObjectClass parent_class;
> +
> +    /* required */
> +    bool (*check_extension)(VFIOContainer *container,
> +                            VFIOContainerFeature feat);
> +    int (*dma_map)(VFIOContainer *container,
> +                   hwaddr iova, ram_addr_t size,
> +                   void *vaddr, bool readonly);
> +    int (*dma_unmap)(VFIOContainer *container,
> +                     hwaddr iova, ram_addr_t size,
> +                     IOMMUTLBEntry *iotlb);
> +    /* migration feature */
> +    bool (*devices_all_dirty_tracking)(VFIOContainer *container);
> +    void (*set_dirty_page_tracking)(VFIOContainer *container, bool start);
> +    int (*get_dirty_bitmap)(VFIOContainer *container, uint64_t iova,
> +                            uint64_t size, ram_addr_t ram_addr);
> +
> +    /* SPAPR specific */
> +    int (*add_window)(VFIOContainer *container,
> +                      MemoryRegionSection *section,
> +                      Error **errp);
> +    void (*del_window)(VFIOContainer *container,
> +                       MemoryRegionSection *section);
> +} VFIOContainerClass;
> +
> +bool vfio_container_check_extension(VFIOContainer *container,
> +                                    VFIOContainerFeature feat);
> +int vfio_container_dma_map(VFIOContainer *container,
> +                           hwaddr iova, ram_addr_t size,
> +                           void *vaddr, bool readonly);
> +int vfio_container_dma_unmap(VFIOContainer *container,
> +                             hwaddr iova, ram_addr_t size,
> +                             IOMMUTLBEntry *iotlb);
> +bool vfio_container_devices_all_dirty_tracking(VFIOContainer *container);
> +void vfio_container_set_dirty_page_tracking(VFIOContainer *container,
> +                                            bool start);
> +int vfio_container_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
> +                                    uint64_t size, ram_addr_t ram_addr);
> +int vfio_container_add_section_window(VFIOContainer *container,
> +                                      MemoryRegionSection *section,
> +                                      Error **errp);
> +void vfio_container_del_section_window(VFIOContainer *container,
> +                                       MemoryRegionSection *section);
> +
> +void vfio_container_init(void *_container, size_t instance_size,
> +                         const char *mrtypename,
> +                         VFIOAddressSpace *space);
> +void vfio_container_destroy(VFIOContainer *container);
> +#endif /* HW_VFIO_VFIO_CONTAINER_OBJ_H */

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC 07/18] vfio: Add base object for VFIOContainer
  2022-04-29  6:29     ` David Gibson
  (?)
@ 2022-05-03 13:05     ` Yi Liu
  -1 siblings, 0 replies; 125+ messages in thread
From: Yi Liu @ 2022-05-03 13:05 UTC (permalink / raw)
  To: David Gibson
  Cc: alex.williamson, cohuck, qemu-devel, thuth, farman, mjrosato,
	akrowiak, pasic, jjherne, jasowang, kvm, jgg, nicolinc,
	eric.auger, eric.auger.pro, kevin.tian, chao.p.peng, yi.y.sun,
	peterx

On 2022/4/29 14:29, David Gibson wrote:
> On Thu, Apr 14, 2022 at 03:46:59AM -0700, Yi Liu wrote:
>> Qomify the VFIOContainer object which acts as a base class for a
>> container. This base class is derived into the legacy VFIO container
>> and later on, into the new iommufd based container.
> 
> You certainly need the abstraction, but I'm not sure QOM is the right
> way to accomplish it in this case.  The QOM class of things is visible
> to the user/config layer via QMP (and sometimes command line).  It
> doesn't necessarily correspond to guest visible differences, but it
> often does.
got it. btw. this series adds an iommufd option in below. do you think
it can suit the notion that QOM class mostly be visible to user/config?

https://lore.kernel.org/kvm/20220414104710.28534-19-yi.l.liu@intel.com/

> AIUI, the idea here is that the back end in use should be an
> implementation detail which doesn't affect the interfaces outside the
> vfio subsystem itself.  If that's the case QOM may not be a great
> fit, even though you can probably make it work.

yes, currently, the implementation detail is just for vfio subsystem. so
if iommufd option doesn't make too much sense to have QOM for the
abstraciton, I may just add an abstraction within vfio as you suggested.

>> The base class implements generic code such as code related to
>> memory_listener and address space management whereas the derived
>> class implements callbacks that depend on the kernel user space
>> being used.
>>
>> 'as.c' only manipulates the base class object with wrapper functions
>> that call the right class functions. Existing 'container.c' code is
>> converted to implement the legacy container class functions.
>>
>> Existing migration code only works with the legacy container.
>> Also 'spapr.c' isn't BE agnostic.
>>
>> Below is the object. It's named as VFIOContainer, old VFIOContainer
>> is replaced with VFIOLegacyContainer.
>>
>> struct VFIOContainer {
>>      /* private */
>>      Object parent_obj;
>>
>>      VFIOAddressSpace *space;
>>      MemoryListener listener;
>>      Error *error;
>>      bool initialized;
>>      bool dirty_pages_supported;
>>      uint64_t dirty_pgsizes;
>>      uint64_t max_dirty_bitmap_size;
>>      unsigned long pgsizes;
>>      unsigned int dma_max_mappings;
>>      QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
>>      QLIST_HEAD(, VFIOHostDMAWindow) hostwin_list;
>>      QLIST_HEAD(, VFIORamDiscardListener) vrdl_list;
>>      QLIST_ENTRY(VFIOContainer) next;
>> };
>>
>> struct VFIOLegacyContainer {
>>      VFIOContainer obj;
>>      int fd; /* /dev/vfio/vfio, empowered by the attached groups */
>>      MemoryListener prereg_listener;
>>      unsigned iommu_type;
>>      QLIST_HEAD(, VFIOGroup) group_list;
>> };
>>
>> Co-authored-by: Eric Auger <eric.auger@redhat.com>
>> Signed-off-by: Eric Auger <eric.auger@redhat.com>
>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>> ---
>>   hw/vfio/as.c                         |  48 +++---
>>   hw/vfio/container-obj.c              | 195 +++++++++++++++++++++++
>>   hw/vfio/container.c                  | 224 ++++++++++++++++-----------
>>   hw/vfio/meson.build                  |   1 +
>>   hw/vfio/migration.c                  |   4 +-
>>   hw/vfio/pci.c                        |   4 +-
>>   hw/vfio/spapr.c                      |  22 +--
>>   include/hw/vfio/vfio-common.h        |  78 ++--------
>>   include/hw/vfio/vfio-container-obj.h | 154 ++++++++++++++++++
>>   9 files changed, 540 insertions(+), 190 deletions(-)
>>   create mode 100644 hw/vfio/container-obj.c
>>   create mode 100644 include/hw/vfio/vfio-container-obj.h
>>
>> diff --git a/hw/vfio/as.c b/hw/vfio/as.c
>> index 4181182808..37423d2c89 100644
>> --- a/hw/vfio/as.c
>> +++ b/hw/vfio/as.c
>> @@ -215,9 +215,9 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
>>            * of vaddr will always be there, even if the memory object is
>>            * destroyed and its backing memory munmap-ed.
>>            */
>> -        ret = vfio_dma_map(container, iova,
>> -                           iotlb->addr_mask + 1, vaddr,
>> -                           read_only);
>> +        ret = vfio_container_dma_map(container, iova,
>> +                                     iotlb->addr_mask + 1, vaddr,
>> +                                     read_only);
>>           if (ret) {
>>               error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
>>                            "0x%"HWADDR_PRIx", %p) = %d (%m)",
>> @@ -225,7 +225,8 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
>>                            iotlb->addr_mask + 1, vaddr, ret);
>>           }
>>       } else {
>> -        ret = vfio_dma_unmap(container, iova, iotlb->addr_mask + 1, iotlb);
>> +        ret = vfio_container_dma_unmap(container, iova,
>> +                                       iotlb->addr_mask + 1, iotlb);
>>           if (ret) {
>>               error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
>>                            "0x%"HWADDR_PRIx") = %d (%m)",
>> @@ -242,12 +243,13 @@ static void vfio_ram_discard_notify_discard(RamDiscardListener *rdl,
>>   {
>>       VFIORamDiscardListener *vrdl = container_of(rdl, VFIORamDiscardListener,
>>                                                   listener);
>> +    VFIOContainer *container = vrdl->container;
>>       const hwaddr size = int128_get64(section->size);
>>       const hwaddr iova = section->offset_within_address_space;
>>       int ret;
>>   
>>       /* Unmap with a single call. */
>> -    ret = vfio_dma_unmap(vrdl->container, iova, size , NULL);
>> +    ret = vfio_container_dma_unmap(container, iova, size , NULL);
>>       if (ret) {
>>           error_report("%s: vfio_dma_unmap() failed: %s", __func__,
>>                        strerror(-ret));
>> @@ -259,6 +261,7 @@ static int vfio_ram_discard_notify_populate(RamDiscardListener *rdl,
>>   {
>>       VFIORamDiscardListener *vrdl = container_of(rdl, VFIORamDiscardListener,
>>                                                   listener);
>> +    VFIOContainer *container = vrdl->container;
>>       const hwaddr end = section->offset_within_region +
>>                          int128_get64(section->size);
>>       hwaddr start, next, iova;
>> @@ -277,8 +280,8 @@ static int vfio_ram_discard_notify_populate(RamDiscardListener *rdl,
>>                  section->offset_within_address_space;
>>           vaddr = memory_region_get_ram_ptr(section->mr) + start;
>>   
>> -        ret = vfio_dma_map(vrdl->container, iova, next - start,
>> -                           vaddr, section->readonly);
>> +        ret = vfio_container_dma_map(container, iova, next - start,
>> +                                     vaddr, section->readonly);
>>           if (ret) {
>>               /* Rollback */
>>               vfio_ram_discard_notify_discard(rdl, section);
>> @@ -530,8 +533,8 @@ static void vfio_listener_region_add(MemoryListener *listener,
>>           }
>>       }
>>   
>> -    ret = vfio_dma_map(container, iova, int128_get64(llsize),
>> -                       vaddr, section->readonly);
>> +    ret = vfio_container_dma_map(container, iova, int128_get64(llsize),
>> +                                 vaddr, section->readonly);
>>       if (ret) {
>>           error_setg(&err, "vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
>>                      "0x%"HWADDR_PRIx", %p) = %d (%m)",
>> @@ -656,7 +659,8 @@ static void vfio_listener_region_del(MemoryListener *listener,
>>           if (int128_eq(llsize, int128_2_64())) {
>>               /* The unmap ioctl doesn't accept a full 64-bit span. */
>>               llsize = int128_rshift(llsize, 1);
>> -            ret = vfio_dma_unmap(container, iova, int128_get64(llsize), NULL);
>> +            ret = vfio_container_dma_unmap(container, iova,
>> +                                           int128_get64(llsize), NULL);
>>               if (ret) {
>>                   error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
>>                                "0x%"HWADDR_PRIx") = %d (%m)",
>> @@ -664,7 +668,8 @@ static void vfio_listener_region_del(MemoryListener *listener,
>>               }
>>               iova += int128_get64(llsize);
>>           }
>> -        ret = vfio_dma_unmap(container, iova, int128_get64(llsize), NULL);
>> +        ret = vfio_container_dma_unmap(container, iova,
>> +                                       int128_get64(llsize), NULL);
>>           if (ret) {
>>               error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
>>                            "0x%"HWADDR_PRIx") = %d (%m)",
>> @@ -681,14 +686,14 @@ static void vfio_listener_log_global_start(MemoryListener *listener)
>>   {
>>       VFIOContainer *container = container_of(listener, VFIOContainer, listener);
>>   
>> -    vfio_set_dirty_page_tracking(container, true);
>> +    vfio_container_set_dirty_page_tracking(container, true);
>>   }
>>   
>>   static void vfio_listener_log_global_stop(MemoryListener *listener)
>>   {
>>       VFIOContainer *container = container_of(listener, VFIOContainer, listener);
>>   
>> -    vfio_set_dirty_page_tracking(container, false);
>> +    vfio_container_set_dirty_page_tracking(container, false);
>>   }
>>   
>>   typedef struct {
>> @@ -717,8 +722,9 @@ static void vfio_iommu_map_dirty_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
>>       if (vfio_get_xlat_addr(iotlb, NULL, &translated_addr, NULL)) {
>>           int ret;
>>   
>> -        ret = vfio_get_dirty_bitmap(container, iova, iotlb->addr_mask + 1,
>> -                                    translated_addr);
>> +        ret = vfio_container_get_dirty_bitmap(container, iova,
>> +                                              iotlb->addr_mask + 1,
>> +                                              translated_addr);
>>           if (ret) {
>>               error_report("vfio_iommu_map_dirty_notify(%p, 0x%"HWADDR_PRIx", "
>>                            "0x%"HWADDR_PRIx") = %d (%m)",
>> @@ -742,11 +748,13 @@ static int vfio_ram_discard_get_dirty_bitmap(MemoryRegionSection *section,
>>        * Sync the whole mapped region (spanning multiple individual mappings)
>>        * in one go.
>>        */
>> -    return vfio_get_dirty_bitmap(vrdl->container, iova, size, ram_addr);
>> +    return vfio_container_get_dirty_bitmap(vrdl->container, iova,
>> +                                           size, ram_addr);
>>   }
>>   
>> -static int vfio_sync_ram_discard_listener_dirty_bitmap(VFIOContainer *container,
>> -                                                   MemoryRegionSection *section)
>> +static int
>> +vfio_sync_ram_discard_listener_dirty_bitmap(VFIOContainer *container,
>> +                                            MemoryRegionSection *section)
>>   {
>>       RamDiscardManager *rdm = memory_region_get_ram_discard_manager(section->mr);
>>       VFIORamDiscardListener *vrdl = NULL;
>> @@ -810,7 +818,7 @@ static int vfio_sync_dirty_bitmap(VFIOContainer *container,
>>       ram_addr = memory_region_get_ram_addr(section->mr) +
>>                  section->offset_within_region;
>>   
>> -    return vfio_get_dirty_bitmap(container,
>> +    return vfio_container_get_dirty_bitmap(container,
>>                      REAL_HOST_PAGE_ALIGN(section->offset_within_address_space),
>>                      int128_get64(section->size), ram_addr);
>>   }
>> @@ -825,7 +833,7 @@ static void vfio_listener_log_sync(MemoryListener *listener,
>>           return;
>>       }
>>   
>> -    if (vfio_devices_all_dirty_tracking(container)) {
>> +    if (vfio_container_devices_all_dirty_tracking(container)) {
>>           vfio_sync_dirty_bitmap(container, section);
>>       }
>>   }
>> diff --git a/hw/vfio/container-obj.c b/hw/vfio/container-obj.c
>> new file mode 100644
>> index 0000000000..40c1e2a2b5
>> --- /dev/null
>> +++ b/hw/vfio/container-obj.c
>> @@ -0,0 +1,195 @@
>> +/*
>> + * VFIO CONTAINER BASE OBJECT
>> + *
>> + * Copyright (C) 2022 Intel Corporation.
>> + * Copyright Red Hat, Inc. 2022
>> + *
>> + * Authors: Yi Liu <yi.l.liu@intel.com>
>> + *          Eric Auger <eric.auger@redhat.com>
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License as published by
>> + * the Free Software Foundation; either version 2 of the License, or
>> + * (at your option) any later version.
>> +
>> + * This program is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> + * GNU General Public License for more details.
>> +
>> + * You should have received a copy of the GNU General Public License along
>> + * with this program; if not, see <http://www.gnu.org/licenses/>.
>> + */
>> +
>> +#include "qemu/osdep.h"
>> +#include "qapi/error.h"
>> +#include "qemu/error-report.h"
>> +#include "qom/object.h"
>> +#include "qapi/visitor.h"
>> +#include "hw/vfio/vfio-container-obj.h"
>> +
>> +bool vfio_container_check_extension(VFIOContainer *container,
>> +                                    VFIOContainerFeature feat)
>> +{
>> +    VFIOContainerClass *vccs = VFIO_CONTAINER_OBJ_GET_CLASS(container);
>> +
>> +    if (!vccs->check_extension) {
>> +        return false;
>> +    }
>> +
>> +    return vccs->check_extension(container, feat);
>> +}
>> +
>> +int vfio_container_dma_map(VFIOContainer *container,
>> +                           hwaddr iova, ram_addr_t size,
>> +                           void *vaddr, bool readonly)
>> +{
>> +    VFIOContainerClass *vccs = VFIO_CONTAINER_OBJ_GET_CLASS(container);
>> +
>> +    if (!vccs->dma_map) {
>> +        return -EINVAL;
>> +    }
>> +
>> +    return vccs->dma_map(container, iova, size, vaddr, readonly);
>> +}
>> +
>> +int vfio_container_dma_unmap(VFIOContainer *container,
>> +                             hwaddr iova, ram_addr_t size,
>> +                             IOMMUTLBEntry *iotlb)
>> +{
>> +    VFIOContainerClass *vccs = VFIO_CONTAINER_OBJ_GET_CLASS(container);
>> +
>> +    vccs = VFIO_CONTAINER_OBJ_GET_CLASS(container);
>> +
>> +    if (!vccs->dma_unmap) {
>> +        return -EINVAL;
>> +    }
>> +
>> +    return vccs->dma_unmap(container, iova, size, iotlb);
>> +}
>> +
>> +void vfio_container_set_dirty_page_tracking(VFIOContainer *container,
>> +                                            bool start)
>> +{
>> +    VFIOContainerClass *vccs = VFIO_CONTAINER_OBJ_GET_CLASS(container);
>> +
>> +    if (!vccs->set_dirty_page_tracking) {
>> +        return;
>> +    }
>> +
>> +    vccs->set_dirty_page_tracking(container, start);
>> +}
>> +
>> +bool vfio_container_devices_all_dirty_tracking(VFIOContainer *container)
>> +{
>> +    VFIOContainerClass *vccs = VFIO_CONTAINER_OBJ_GET_CLASS(container);
>> +
>> +    if (!vccs->devices_all_dirty_tracking) {
>> +        return false;
>> +    }
>> +
>> +    return vccs->devices_all_dirty_tracking(container);
>> +}
>> +
>> +int vfio_container_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
>> +                                    uint64_t size, ram_addr_t ram_addr)
>> +{
>> +    VFIOContainerClass *vccs = VFIO_CONTAINER_OBJ_GET_CLASS(container);
>> +
>> +    if (!vccs->get_dirty_bitmap) {
>> +        return -EINVAL;
>> +    }
>> +
>> +    return vccs->get_dirty_bitmap(container, iova, size, ram_addr);
>> +}
>> +
>> +int vfio_container_add_section_window(VFIOContainer *container,
>> +                                      MemoryRegionSection *section,
>> +                                      Error **errp)
>> +{
>> +    VFIOContainerClass *vccs = VFIO_CONTAINER_OBJ_GET_CLASS(container);
>> +
>> +    if (!vccs->add_window) {
>> +        return 0;
>> +    }
>> +
>> +    return vccs->add_window(container, section, errp);
>> +}
>> +
>> +void vfio_container_del_section_window(VFIOContainer *container,
>> +                                       MemoryRegionSection *section)
>> +{
>> +    VFIOContainerClass *vccs = VFIO_CONTAINER_OBJ_GET_CLASS(container);
>> +
>> +    if (!vccs->del_window) {
>> +        return;
>> +    }
>> +
>> +    return vccs->del_window(container, section);
>> +}
>> +
>> +void vfio_container_init(void *_container, size_t instance_size,
>> +                         const char *mrtypename,
>> +                         VFIOAddressSpace *space)
>> +{
>> +    VFIOContainer *container;
>> +
>> +    object_initialize(_container, instance_size, mrtypename);
>> +    container = VFIO_CONTAINER_OBJ(_container);
>> +
>> +    container->space = space;
>> +    container->error = NULL;
>> +    container->dirty_pages_supported = false;
>> +    container->dma_max_mappings = 0;
>> +    QLIST_INIT(&container->giommu_list);
>> +    QLIST_INIT(&container->hostwin_list);
>> +    QLIST_INIT(&container->vrdl_list);
>> +}
>> +
>> +void vfio_container_destroy(VFIOContainer *container)
>> +{
>> +    VFIORamDiscardListener *vrdl, *vrdl_tmp;
>> +    VFIOGuestIOMMU *giommu, *tmp;
>> +    VFIOHostDMAWindow *hostwin, *next;
>> +
>> +    QLIST_SAFE_REMOVE(container, next);
>> +
>> +    QLIST_FOREACH_SAFE(vrdl, &container->vrdl_list, next, vrdl_tmp) {
>> +        RamDiscardManager *rdm;
>> +
>> +        rdm = memory_region_get_ram_discard_manager(vrdl->mr);
>> +        ram_discard_manager_unregister_listener(rdm, &vrdl->listener);
>> +        QLIST_REMOVE(vrdl, next);
>> +        g_free(vrdl);
>> +    }
>> +
>> +    QLIST_FOREACH_SAFE(giommu, &container->giommu_list, giommu_next, tmp) {
>> +        memory_region_unregister_iommu_notifier(
>> +                MEMORY_REGION(giommu->iommu_mr), &giommu->n);
>> +        QLIST_REMOVE(giommu, giommu_next);
>> +        g_free(giommu);
>> +    }
>> +
>> +    QLIST_FOREACH_SAFE(hostwin, &container->hostwin_list, hostwin_next,
>> +                       next) {
>> +        QLIST_REMOVE(hostwin, hostwin_next);
>> +        g_free(hostwin);
>> +    }
>> +
>> +    object_unref(&container->parent_obj);
>> +}
>> +
>> +static const TypeInfo vfio_container_info = {
>> +    .parent             = TYPE_OBJECT,
>> +    .name               = TYPE_VFIO_CONTAINER_OBJ,
>> +    .class_size         = sizeof(VFIOContainerClass),
>> +    .instance_size      = sizeof(VFIOContainer),
>> +    .abstract           = true,
>> +};
>> +
>> +static void vfio_container_register_types(void)
>> +{
>> +    type_register_static(&vfio_container_info);
>> +}
>> +
>> +type_init(vfio_container_register_types)
>> diff --git a/hw/vfio/container.c b/hw/vfio/container.c
>> index 9c665c1720..79972064d3 100644
>> --- a/hw/vfio/container.c
>> +++ b/hw/vfio/container.c
>> @@ -50,6 +50,8 @@
>>   static int vfio_kvm_device_fd = -1;
>>   #endif
>>   
>> +#define TYPE_VFIO_LEGACY_CONTAINER "qemu:vfio-legacy-container"
>> +
>>   VFIOGroupList vfio_group_list =
>>       QLIST_HEAD_INITIALIZER(vfio_group_list);
>>   
>> @@ -76,8 +78,10 @@ bool vfio_mig_active(void)
>>       return true;
>>   }
>>   
>> -bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
>> +static bool vfio_devices_all_dirty_tracking(VFIOContainer *bcontainer)
>>   {
>> +    VFIOLegacyContainer *container = container_of(bcontainer,
>> +                                                  VFIOLegacyContainer, obj);
>>       VFIOGroup *group;
>>       VFIODevice *vbasedev;
>>       MigrationState *ms = migrate_get_current();
>> @@ -103,7 +107,7 @@ bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
>>       return true;
>>   }
>>   
>> -bool vfio_devices_all_running_and_saving(VFIOContainer *container)
>> +static bool vfio_devices_all_running_and_saving(VFIOLegacyContainer *container)
>>   {
>>       VFIOGroup *group;
>>       VFIODevice *vbasedev;
>> @@ -132,10 +136,11 @@ bool vfio_devices_all_running_and_saving(VFIOContainer *container)
>>       return true;
>>   }
>>   
>> -static int vfio_dma_unmap_bitmap(VFIOContainer *container,
>> +static int vfio_dma_unmap_bitmap(VFIOLegacyContainer *container,
>>                                    hwaddr iova, ram_addr_t size,
>>                                    IOMMUTLBEntry *iotlb)
>>   {
>> +    VFIOContainer *bcontainer = &container->obj;
>>       struct vfio_iommu_type1_dma_unmap *unmap;
>>       struct vfio_bitmap *bitmap;
>>       uint64_t pages = REAL_HOST_PAGE_ALIGN(size) / qemu_real_host_page_size;
>> @@ -159,7 +164,7 @@ static int vfio_dma_unmap_bitmap(VFIOContainer *container,
>>       bitmap->size = ROUND_UP(pages, sizeof(__u64) * BITS_PER_BYTE) /
>>                      BITS_PER_BYTE;
>>   
>> -    if (bitmap->size > container->max_dirty_bitmap_size) {
>> +    if (bitmap->size > bcontainer->max_dirty_bitmap_size) {
>>           error_report("UNMAP: Size of bitmap too big 0x%"PRIx64,
>>                        (uint64_t)bitmap->size);
>>           ret = -E2BIG;
>> @@ -189,10 +194,12 @@ unmap_exit:
>>   /*
>>    * DMA - Mapping and unmapping for the "type1" IOMMU interface used on x86
>>    */
>> -int vfio_dma_unmap(VFIOContainer *container,
>> -                   hwaddr iova, ram_addr_t size,
>> -                   IOMMUTLBEntry *iotlb)
>> +static int vfio_dma_unmap(VFIOContainer *bcontainer,
>> +                          hwaddr iova, ram_addr_t size,
>> +                          IOMMUTLBEntry *iotlb)
>>   {
>> +    VFIOLegacyContainer *container = container_of(bcontainer,
>> +                                                  VFIOLegacyContainer, obj);
>>       struct vfio_iommu_type1_dma_unmap unmap = {
>>           .argsz = sizeof(unmap),
>>           .flags = 0,
>> @@ -200,7 +207,7 @@ int vfio_dma_unmap(VFIOContainer *container,
>>           .size = size,
>>       };
>>   
>> -    if (iotlb && container->dirty_pages_supported &&
>> +    if (iotlb && bcontainer->dirty_pages_supported &&
>>           vfio_devices_all_running_and_saving(container)) {
>>           return vfio_dma_unmap_bitmap(container, iova, size, iotlb);
>>       }
>> @@ -221,7 +228,7 @@ int vfio_dma_unmap(VFIOContainer *container,
>>           if (errno == EINVAL && unmap.size && !(unmap.iova + unmap.size) &&
>>               container->iommu_type == VFIO_TYPE1v2_IOMMU) {
>>               trace_vfio_dma_unmap_overflow_workaround();
>> -            unmap.size -= 1ULL << ctz64(container->pgsizes);
>> +            unmap.size -= 1ULL << ctz64(bcontainer->pgsizes);
>>               continue;
>>           }
>>           error_report("VFIO_UNMAP_DMA failed: %s", strerror(errno));
>> @@ -231,9 +238,22 @@ int vfio_dma_unmap(VFIOContainer *container,
>>       return 0;
>>   }
>>   
>> -int vfio_dma_map(VFIOContainer *container, hwaddr iova,
>> -                 ram_addr_t size, void *vaddr, bool readonly)
>> +static bool vfio_legacy_container_check_extension(VFIOContainer *bcontainer,
>> +                                                  VFIOContainerFeature feat)
>>   {
>> +    switch (feat) {
>> +    case VFIO_FEAT_LIVE_MIGRATION:
>> +        return true;
>> +    default:
>> +        return false;
>> +    };
>> +}
>> +
>> +static int vfio_dma_map(VFIOContainer *bcontainer, hwaddr iova,
>> +                       ram_addr_t size, void *vaddr, bool readonly)
>> +{
>> +    VFIOLegacyContainer *container = container_of(bcontainer,
>> +                                                  VFIOLegacyContainer, obj);
>>       struct vfio_iommu_type1_dma_map map = {
>>           .argsz = sizeof(map),
>>           .flags = VFIO_DMA_MAP_FLAG_READ,
>> @@ -252,7 +272,7 @@ int vfio_dma_map(VFIOContainer *container, hwaddr iova,
>>        * the VGA ROM space.
>>        */
>>       if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0 ||
>> -        (errno == EBUSY && vfio_dma_unmap(container, iova, size, NULL) == 0 &&
>> +        (errno == EBUSY && vfio_dma_unmap(bcontainer, iova, size, NULL) == 0 &&
>>            ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0)) {
>>           return 0;
>>       }
>> @@ -261,8 +281,10 @@ int vfio_dma_map(VFIOContainer *container, hwaddr iova,
>>       return -errno;
>>   }
>>   
>> -void vfio_set_dirty_page_tracking(VFIOContainer *container, bool start)
>> +static void vfio_set_dirty_page_tracking(VFIOContainer *bcontainer, bool start)
>>   {
>> +    VFIOLegacyContainer *container = container_of(bcontainer,
>> +                                                  VFIOLegacyContainer, obj);
>>       int ret;
>>       struct vfio_iommu_type1_dirty_bitmap dirty = {
>>           .argsz = sizeof(dirty),
>> @@ -281,9 +303,11 @@ void vfio_set_dirty_page_tracking(VFIOContainer *container, bool start)
>>       }
>>   }
>>   
>> -int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
>> -                          uint64_t size, ram_addr_t ram_addr)
>> +static int vfio_get_dirty_bitmap(VFIOContainer *bcontainer, uint64_t iova,
>> +                                 uint64_t size, ram_addr_t ram_addr)
>>   {
>> +    VFIOLegacyContainer *container = container_of(bcontainer,
>> +                                                  VFIOLegacyContainer, obj);
>>       struct vfio_iommu_type1_dirty_bitmap *dbitmap;
>>       struct vfio_iommu_type1_dirty_bitmap_get *range;
>>       uint64_t pages;
>> @@ -333,18 +357,23 @@ err_out:
>>       return ret;
>>   }
>>   
>> -static void vfio_listener_release(VFIOContainer *container)
>> +static void vfio_listener_release(VFIOLegacyContainer *container)
>>   {
>> -    memory_listener_unregister(&container->listener);
>> +    VFIOContainer *bcontainer = &container->obj;
>> +
>> +    memory_listener_unregister(&bcontainer->listener);
>>       if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
>>           memory_listener_unregister(&container->prereg_listener);
>>       }
>>   }
>>   
>> -int vfio_container_add_section_window(VFIOContainer *container,
>> -                                      MemoryRegionSection *section,
>> -                                      Error **errp)
>> +static int
>> +vfio_legacy_container_add_section_window(VFIOContainer *bcontainer,
>> +                                         MemoryRegionSection *section,
>> +                                         Error **errp)
>>   {
>> +    VFIOLegacyContainer *container = container_of(bcontainer,
>> +                                                  VFIOLegacyContainer, obj);
>>       VFIOHostDMAWindow *hostwin;
>>       hwaddr pgsize = 0;
>>       int ret;
>> @@ -354,7 +383,7 @@ int vfio_container_add_section_window(VFIOContainer *container,
>>       }
>>   
>>       /* For now intersections are not allowed, we may relax this later */
>> -    QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) {
>> +    QLIST_FOREACH(hostwin, &bcontainer->hostwin_list, hostwin_next) {
>>           if (ranges_overlap(hostwin->min_iova,
>>                              hostwin->max_iova - hostwin->min_iova + 1,
>>                              section->offset_within_address_space,
>> @@ -376,7 +405,7 @@ int vfio_container_add_section_window(VFIOContainer *container,
>>           return ret;
>>       }
>>   
>> -    vfio_host_win_add(container, section->offset_within_address_space,
>> +    vfio_host_win_add(bcontainer, section->offset_within_address_space,
>>                         section->offset_within_address_space +
>>                         int128_get64(section->size) - 1, pgsize);
>>   #ifdef CONFIG_KVM
>> @@ -409,16 +438,20 @@ int vfio_container_add_section_window(VFIOContainer *container,
>>       return 0;
>>   }
>>   
>> -void vfio_container_del_section_window(VFIOContainer *container,
>> -                                       MemoryRegionSection *section)
>> +static void
>> +vfio_legacy_container_del_section_window(VFIOContainer *bcontainer,
>> +                                         MemoryRegionSection *section)
>>   {
>> +    VFIOLegacyContainer *container = container_of(bcontainer,
>> +                                                  VFIOLegacyContainer, obj);
>> +
>>       if (container->iommu_type != VFIO_SPAPR_TCE_v2_IOMMU) {
>>           return;
>>       }
>>   
>>       vfio_spapr_remove_window(container,
>>                                section->offset_within_address_space);
>> -    if (vfio_host_win_del(container,
>> +    if (vfio_host_win_del(bcontainer,
>>                             section->offset_within_address_space,
>>                             section->offset_within_address_space +
>>                             int128_get64(section->size) - 1) < 0) {
>> @@ -505,7 +538,7 @@ static void vfio_kvm_device_del_group(VFIOGroup *group)
>>   /*
>>    * vfio_get_iommu_type - selects the richest iommu_type (v2 first)
>>    */
>> -static int vfio_get_iommu_type(VFIOContainer *container,
>> +static int vfio_get_iommu_type(VFIOLegacyContainer *container,
>>                                  Error **errp)
>>   {
>>       int iommu_types[] = { VFIO_TYPE1v2_IOMMU, VFIO_TYPE1_IOMMU,
>> @@ -521,7 +554,7 @@ static int vfio_get_iommu_type(VFIOContainer *container,
>>       return -EINVAL;
>>   }
>>   
>> -static int vfio_init_container(VFIOContainer *container, int group_fd,
>> +static int vfio_init_container(VFIOLegacyContainer *container, int group_fd,
>>                                  Error **errp)
>>   {
>>       int iommu_type, ret;
>> @@ -556,7 +589,7 @@ static int vfio_init_container(VFIOContainer *container, int group_fd,
>>       return 0;
>>   }
>>   
>> -static int vfio_get_iommu_info(VFIOContainer *container,
>> +static int vfio_get_iommu_info(VFIOLegacyContainer *container,
>>                                  struct vfio_iommu_type1_info **info)
>>   {
>>   
>> @@ -600,11 +633,12 @@ vfio_get_iommu_info_cap(struct vfio_iommu_type1_info *info, uint16_t id)
>>       return NULL;
>>   }
>>   
>> -static void vfio_get_iommu_info_migration(VFIOContainer *container,
>> -                                         struct vfio_iommu_type1_info *info)
>> +static void vfio_get_iommu_info_migration(VFIOLegacyContainer *container,
>> +                                          struct vfio_iommu_type1_info *info)
>>   {
>>       struct vfio_info_cap_header *hdr;
>>       struct vfio_iommu_type1_info_cap_migration *cap_mig;
>> +    VFIOContainer *bcontainer = &container->obj;
>>   
>>       hdr = vfio_get_iommu_info_cap(info, VFIO_IOMMU_TYPE1_INFO_CAP_MIGRATION);
>>       if (!hdr) {
>> @@ -619,13 +653,14 @@ static void vfio_get_iommu_info_migration(VFIOContainer *container,
>>        * qemu_real_host_page_size to mark those dirty.
>>        */
>>       if (cap_mig->pgsize_bitmap & qemu_real_host_page_size) {
>> -        container->dirty_pages_supported = true;
>> -        container->max_dirty_bitmap_size = cap_mig->max_dirty_bitmap_size;
>> -        container->dirty_pgsizes = cap_mig->pgsize_bitmap;
>> +        bcontainer->dirty_pages_supported = true;
>> +        bcontainer->max_dirty_bitmap_size = cap_mig->max_dirty_bitmap_size;
>> +        bcontainer->dirty_pgsizes = cap_mig->pgsize_bitmap;
>>       }
>>   }
>>   
>> -static int vfio_ram_block_discard_disable(VFIOContainer *container, bool state)
>> +static int
>> +vfio_ram_block_discard_disable(VFIOLegacyContainer *container, bool state)
>>   {
>>       switch (container->iommu_type) {
>>       case VFIO_TYPE1v2_IOMMU:
>> @@ -651,7 +686,8 @@ static int vfio_ram_block_discard_disable(VFIOContainer *container, bool state)
>>   static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>>                                     Error **errp)
>>   {
>> -    VFIOContainer *container;
>> +    VFIOContainer *bcontainer;
>> +    VFIOLegacyContainer *container;
>>       int ret, fd;
>>       VFIOAddressSpace *space;
>>   
>> @@ -688,7 +724,8 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>>        * details once we know which type of IOMMU we are using.
>>        */
>>   
>> -    QLIST_FOREACH(container, &space->containers, next) {
>> +    QLIST_FOREACH(bcontainer, &space->containers, next) {
>> +        container = container_of(bcontainer, VFIOLegacyContainer, obj);
>>           if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
>>               ret = vfio_ram_block_discard_disable(container, true);
>>               if (ret) {
>> @@ -724,14 +761,10 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>>       }
>>   
>>       container = g_malloc0(sizeof(*container));
>> -    container->space = space;
>>       container->fd = fd;
>> -    container->error = NULL;
>> -    container->dirty_pages_supported = false;
>> -    container->dma_max_mappings = 0;
>> -    QLIST_INIT(&container->giommu_list);
>> -    QLIST_INIT(&container->hostwin_list);
>> -    QLIST_INIT(&container->vrdl_list);
>> +    bcontainer = &container->obj;
>> +    vfio_container_init(bcontainer, sizeof(*bcontainer),
>> +                        TYPE_VFIO_LEGACY_CONTAINER, space);
>>   
>>       ret = vfio_init_container(container, group->fd, errp);
>>       if (ret) {
>> @@ -763,13 +796,13 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>>               /* Assume 4k IOVA page size */
>>               info->iova_pgsizes = 4096;
>>           }
>> -        vfio_host_win_add(container, 0, (hwaddr)-1, info->iova_pgsizes);
>> -        container->pgsizes = info->iova_pgsizes;
>> +        vfio_host_win_add(bcontainer, 0, (hwaddr)-1, info->iova_pgsizes);
>> +        bcontainer->pgsizes = info->iova_pgsizes;
>>   
>>           /* The default in the kernel ("dma_entry_limit") is 65535. */
>> -        container->dma_max_mappings = 65535;
>> +        bcontainer->dma_max_mappings = 65535;
>>           if (!ret) {
>> -            vfio_get_info_dma_avail(info, &container->dma_max_mappings);
>> +            vfio_get_info_dma_avail(info, &bcontainer->dma_max_mappings);
>>               vfio_get_iommu_info_migration(container, info);
>>           }
>>           g_free(info);
>> @@ -798,10 +831,10 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>>   
>>               memory_listener_register(&container->prereg_listener,
>>                                        &address_space_memory);
>> -            if (container->error) {
>> +            if (bcontainer->error) {
>>                   memory_listener_unregister(&container->prereg_listener);
>>                   ret = -1;
>> -                error_propagate_prepend(errp, container->error,
>> +                error_propagate_prepend(errp, bcontainer->error,
>>                       "RAM memory listener initialization failed: ");
>>                   goto enable_discards_exit;
>>               }
>> @@ -820,7 +853,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>>           }
>>   
>>           if (v2) {
>> -            container->pgsizes = info.ddw.pgsizes;
>> +            bcontainer->pgsizes = info.ddw.pgsizes;
>>               /*
>>                * There is a default window in just created container.
>>                * To make region_add/del simpler, we better remove this
>> @@ -835,8 +868,8 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>>               }
>>           } else {
>>               /* The default table uses 4K pages */
>> -            container->pgsizes = 0x1000;
>> -            vfio_host_win_add(container, info.dma32_window_start,
>> +            bcontainer->pgsizes = 0x1000;
>> +            vfio_host_win_add(bcontainer, info.dma32_window_start,
>>                                 info.dma32_window_start +
>>                                 info.dma32_window_size - 1,
>>                                 0x1000);
>> @@ -847,28 +880,28 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>>       vfio_kvm_device_add_group(group);
>>   
>>       QLIST_INIT(&container->group_list);
>> -    QLIST_INSERT_HEAD(&space->containers, container, next);
>> +    QLIST_INSERT_HEAD(&space->containers, bcontainer, next);
>>   
>>       group->container = container;
>>       QLIST_INSERT_HEAD(&container->group_list, group, container_next);
>>   
>> -    container->listener = vfio_memory_listener;
>> +    bcontainer->listener = vfio_memory_listener;
>>   
>> -    memory_listener_register(&container->listener, container->space->as);
>> +    memory_listener_register(&bcontainer->listener, bcontainer->space->as);
>>   
>> -    if (container->error) {
>> +    if (bcontainer->error) {
>>           ret = -1;
>> -        error_propagate_prepend(errp, container->error,
>> +        error_propagate_prepend(errp, bcontainer->error,
>>               "memory listener initialization failed: ");
>>           goto listener_release_exit;
>>       }
>>   
>> -    container->initialized = true;
>> +    bcontainer->initialized = true;
>>   
>>       return 0;
>>   listener_release_exit:
>>       QLIST_REMOVE(group, container_next);
>> -    QLIST_REMOVE(container, next);
>> +    QLIST_REMOVE(bcontainer, next);
>>       vfio_kvm_device_del_group(group);
>>       vfio_listener_release(container);
>>   
>> @@ -889,7 +922,8 @@ put_space_exit:
>>   
>>   static void vfio_disconnect_container(VFIOGroup *group)
>>   {
>> -    VFIOContainer *container = group->container;
>> +    VFIOLegacyContainer *container = group->container;
>> +    VFIOContainer *bcontainer = &container->obj;
>>   
>>       QLIST_REMOVE(group, container_next);
>>       group->container = NULL;
>> @@ -909,25 +943,9 @@ static void vfio_disconnect_container(VFIOGroup *group)
>>       }
>>   
>>       if (QLIST_EMPTY(&container->group_list)) {
>> -        VFIOAddressSpace *space = container->space;
>> -        VFIOGuestIOMMU *giommu, *tmp;
>> -        VFIOHostDMAWindow *hostwin, *next;
>> -
>> -        QLIST_REMOVE(container, next);
>> -
>> -        QLIST_FOREACH_SAFE(giommu, &container->giommu_list, giommu_next, tmp) {
>> -            memory_region_unregister_iommu_notifier(
>> -                    MEMORY_REGION(giommu->iommu_mr), &giommu->n);
>> -            QLIST_REMOVE(giommu, giommu_next);
>> -            g_free(giommu);
>> -        }
>> -
>> -        QLIST_FOREACH_SAFE(hostwin, &container->hostwin_list, hostwin_next,
>> -                           next) {
>> -            QLIST_REMOVE(hostwin, hostwin_next);
>> -            g_free(hostwin);
>> -        }
>> +        VFIOAddressSpace *space = bcontainer->space;
>>   
>> +        vfio_container_destroy(bcontainer);
>>           trace_vfio_disconnect_container(container->fd);
>>           close(container->fd);
>>           g_free(container);
>> @@ -939,13 +957,15 @@ static void vfio_disconnect_container(VFIOGroup *group)
>>   VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
>>   {
>>       VFIOGroup *group;
>> +    VFIOContainer *bcontainer;
>>       char path[32];
>>       struct vfio_group_status status = { .argsz = sizeof(status) };
>>   
>>       QLIST_FOREACH(group, &vfio_group_list, next) {
>>           if (group->groupid == groupid) {
>>               /* Found it.  Now is it already in the right context? */
>> -            if (group->container->space->as == as) {
>> +            bcontainer = &group->container->obj;
>> +            if (bcontainer->space->as == as) {
>>                   return group;
>>               } else {
>>                   error_setg(errp, "group %d used in multiple address spaces",
>> @@ -1098,7 +1118,7 @@ void vfio_put_base_device(VFIODevice *vbasedev)
>>   /*
>>    * Interfaces for IBM EEH (Enhanced Error Handling)
>>    */
>> -static bool vfio_eeh_container_ok(VFIOContainer *container)
>> +static bool vfio_eeh_container_ok(VFIOLegacyContainer *container)
>>   {
>>       /*
>>        * As of 2016-03-04 (linux-4.5) the host kernel EEH/VFIO
>> @@ -1126,7 +1146,7 @@ static bool vfio_eeh_container_ok(VFIOContainer *container)
>>       return true;
>>   }
>>   
>> -static int vfio_eeh_container_op(VFIOContainer *container, uint32_t op)
>> +static int vfio_eeh_container_op(VFIOLegacyContainer *container, uint32_t op)
>>   {
>>       struct vfio_eeh_pe_op pe_op = {
>>           .argsz = sizeof(pe_op),
>> @@ -1149,19 +1169,21 @@ static int vfio_eeh_container_op(VFIOContainer *container, uint32_t op)
>>       return ret;
>>   }
>>   
>> -static VFIOContainer *vfio_eeh_as_container(AddressSpace *as)
>> +static VFIOLegacyContainer *vfio_eeh_as_container(AddressSpace *as)
>>   {
>>       VFIOAddressSpace *space = vfio_get_address_space(as);
>> -    VFIOContainer *container = NULL;
>> +    VFIOLegacyContainer *container = NULL;
>> +    VFIOContainer *bcontainer = NULL;
>>   
>>       if (QLIST_EMPTY(&space->containers)) {
>>           /* No containers to act on */
>>           goto out;
>>       }
>>   
>> -    container = QLIST_FIRST(&space->containers);
>> +    bcontainer = QLIST_FIRST(&space->containers);
>> +    container = container_of(bcontainer, VFIOLegacyContainer, obj);
>>   
>> -    if (QLIST_NEXT(container, next)) {
>> +    if (QLIST_NEXT(bcontainer, next)) {
>>           /*
>>            * We don't yet have logic to synchronize EEH state across
>>            * multiple containers.
>> @@ -1177,17 +1199,45 @@ out:
>>   
>>   bool vfio_eeh_as_ok(AddressSpace *as)
>>   {
>> -    VFIOContainer *container = vfio_eeh_as_container(as);
>> +    VFIOLegacyContainer *container = vfio_eeh_as_container(as);
>>   
>>       return (container != NULL) && vfio_eeh_container_ok(container);
>>   }
>>   
>>   int vfio_eeh_as_op(AddressSpace *as, uint32_t op)
>>   {
>> -    VFIOContainer *container = vfio_eeh_as_container(as);
>> +    VFIOLegacyContainer *container = vfio_eeh_as_container(as);
>>   
>>       if (!container) {
>>           return -ENODEV;
>>       }
>>       return vfio_eeh_container_op(container, op);
>>   }
>> +
>> +static void vfio_legacy_container_class_init(ObjectClass *klass,
>> +                                             void *data)
>> +{
>> +    VFIOContainerClass *vccs = VFIO_CONTAINER_OBJ_CLASS(klass);
>> +
>> +    vccs->dma_map = vfio_dma_map;
>> +    vccs->dma_unmap = vfio_dma_unmap;
>> +    vccs->devices_all_dirty_tracking = vfio_devices_all_dirty_tracking;
>> +    vccs->set_dirty_page_tracking = vfio_set_dirty_page_tracking;
>> +    vccs->get_dirty_bitmap = vfio_get_dirty_bitmap;
>> +    vccs->add_window = vfio_legacy_container_add_section_window;
>> +    vccs->del_window = vfio_legacy_container_del_section_window;
>> +    vccs->check_extension = vfio_legacy_container_check_extension;
>> +}
>> +
>> +static const TypeInfo vfio_legacy_container_info = {
>> +    .parent = TYPE_VFIO_CONTAINER_OBJ,
>> +    .name = TYPE_VFIO_LEGACY_CONTAINER,
>> +    .class_init = vfio_legacy_container_class_init,
>> +};
>> +
>> +static void vfio_register_types(void)
>> +{
>> +    type_register_static(&vfio_legacy_container_info);
>> +}
>> +
>> +type_init(vfio_register_types)
>> diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
>> index e3b6d6e2cb..df4fa2b695 100644
>> --- a/hw/vfio/meson.build
>> +++ b/hw/vfio/meson.build
>> @@ -2,6 +2,7 @@ vfio_ss = ss.source_set()
>>   vfio_ss.add(files(
>>     'common.c',
>>     'as.c',
>> +  'container-obj.c',
>>     'container.c',
>>     'spapr.c',
>>     'migration.c',
>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>> index ff6b45de6b..cbbde177c3 100644
>> --- a/hw/vfio/migration.c
>> +++ b/hw/vfio/migration.c
>> @@ -856,11 +856,11 @@ int64_t vfio_mig_bytes_transferred(void)
>>   
>>   int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
>>   {
>> -    VFIOContainer *container = vbasedev->group->container;
>> +    VFIOLegacyContainer *container = vbasedev->group->container;
>>       struct vfio_region_info *info = NULL;
>>       int ret = -ENOTSUP;
>>   
>> -    if (!vbasedev->enable_migration || !container->dirty_pages_supported) {
>> +    if (!vbasedev->enable_migration || !container->obj.dirty_pages_supported) {
>>           goto add_blocker;
>>       }
>>   
>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>> index e707329394..a00a485e46 100644
>> --- a/hw/vfio/pci.c
>> +++ b/hw/vfio/pci.c
>> @@ -3101,7 +3101,9 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
>>           }
>>       }
>>   
>> -    if (!pdev->failover_pair_id) {
>> +    if (!pdev->failover_pair_id &&
>> +        vfio_container_check_extension(&vbasedev->group->container->obj,
>> +                                       VFIO_FEAT_LIVE_MIGRATION)) {
>>           ret = vfio_migration_probe(vbasedev, errp);
>>           if (ret) {
>>               error_report("%s: Migration disabled", vbasedev->name);
>> diff --git a/hw/vfio/spapr.c b/hw/vfio/spapr.c
>> index 04c6e67f8f..cdcd9e05ba 100644
>> --- a/hw/vfio/spapr.c
>> +++ b/hw/vfio/spapr.c
>> @@ -39,8 +39,8 @@ static void *vfio_prereg_gpa_to_vaddr(MemoryRegionSection *section, hwaddr gpa)
>>   static void vfio_prereg_listener_region_add(MemoryListener *listener,
>>                                               MemoryRegionSection *section)
>>   {
>> -    VFIOContainer *container = container_of(listener, VFIOContainer,
>> -                                            prereg_listener);
>> +    VFIOLegacyContainer *container = container_of(listener, VFIOLegacyContainer,
>> +                                                  prereg_listener);
>>       const hwaddr gpa = section->offset_within_address_space;
>>       hwaddr end;
>>       int ret;
>> @@ -83,9 +83,9 @@ static void vfio_prereg_listener_region_add(MemoryListener *listener,
>>            * can gracefully fail.  Runtime, there's not much we can do other
>>            * than throw a hardware error.
>>            */
>> -        if (!container->initialized) {
>> -            if (!container->error) {
>> -                error_setg_errno(&container->error, -ret,
>> +        if (!container->obj.initialized) {
>> +            if (!container->obj.error) {
>> +                error_setg_errno(&container->obj.error, -ret,
>>                                    "Memory registering failed");
>>               }
>>           } else {
>> @@ -97,8 +97,8 @@ static void vfio_prereg_listener_region_add(MemoryListener *listener,
>>   static void vfio_prereg_listener_region_del(MemoryListener *listener,
>>                                               MemoryRegionSection *section)
>>   {
>> -    VFIOContainer *container = container_of(listener, VFIOContainer,
>> -                                            prereg_listener);
>> +    VFIOLegacyContainer *container = container_of(listener, VFIOLegacyContainer,
>> +                                                  prereg_listener);
>>       const hwaddr gpa = section->offset_within_address_space;
>>       hwaddr end;
>>       int ret;
>> @@ -141,7 +141,7 @@ const MemoryListener vfio_prereg_listener = {
>>       .region_del = vfio_prereg_listener_region_del,
>>   };
>>   
>> -int vfio_spapr_create_window(VFIOContainer *container,
>> +int vfio_spapr_create_window(VFIOLegacyContainer *container,
>>                                MemoryRegionSection *section,
>>                                hwaddr *pgsize)
>>   {
>> @@ -159,13 +159,13 @@ int vfio_spapr_create_window(VFIOContainer *container,
>>       if (pagesize > rampagesize) {
>>           pagesize = rampagesize;
>>       }
>> -    pgmask = container->pgsizes & (pagesize | (pagesize - 1));
>> +    pgmask = container->obj.pgsizes & (pagesize | (pagesize - 1));
>>       pagesize = pgmask ? (1ULL << (63 - clz64(pgmask))) : 0;
>>       if (!pagesize) {
>>           error_report("Host doesn't support page size 0x%"PRIx64
>>                        ", the supported mask is 0x%lx",
>>                        memory_region_iommu_get_min_page_size(iommu_mr),
>> -                     container->pgsizes);
>> +                     container->obj.pgsizes);
>>           return -EINVAL;
>>       }
>>   
>> @@ -233,7 +233,7 @@ int vfio_spapr_create_window(VFIOContainer *container,
>>       return 0;
>>   }
>>   
>> -int vfio_spapr_remove_window(VFIOContainer *container,
>> +int vfio_spapr_remove_window(VFIOLegacyContainer *container,
>>                                hwaddr offset_within_address_space)
>>   {
>>       struct vfio_iommu_spapr_tce_remove remove = {
>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>> index 03ff7944cb..02a6f36a9e 100644
>> --- a/include/hw/vfio/vfio-common.h
>> +++ b/include/hw/vfio/vfio-common.h
>> @@ -30,6 +30,7 @@
>>   #include <linux/vfio.h>
>>   #endif
>>   #include "sysemu/sysemu.h"
>> +#include "hw/vfio/vfio-container-obj.h"
>>   
>>   #define VFIO_MSG_PREFIX "vfio %s: "
>>   
>> @@ -70,58 +71,15 @@ typedef struct VFIOMigration {
>>       uint64_t pending_bytes;
>>   } VFIOMigration;
>>   
>> -typedef struct VFIOAddressSpace {
>> -    AddressSpace *as;
>> -    QLIST_HEAD(, VFIOContainer) containers;
>> -    QLIST_ENTRY(VFIOAddressSpace) list;
>> -} VFIOAddressSpace;
>> -
>>   struct VFIOGroup;
>>   
>> -typedef struct VFIOContainer {
>> -    VFIOAddressSpace *space;
>> +typedef struct VFIOLegacyContainer {
>> +    VFIOContainer obj;
>>       int fd; /* /dev/vfio/vfio, empowered by the attached groups */
>> -    MemoryListener listener;
>>       MemoryListener prereg_listener;
>>       unsigned iommu_type;
>> -    Error *error;
>> -    bool initialized;
>> -    bool dirty_pages_supported;
>> -    uint64_t dirty_pgsizes;
>> -    uint64_t max_dirty_bitmap_size;
>> -    unsigned long pgsizes;
>> -    unsigned int dma_max_mappings;
>> -    QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
>> -    QLIST_HEAD(, VFIOHostDMAWindow) hostwin_list;
>>       QLIST_HEAD(, VFIOGroup) group_list;
>> -    QLIST_HEAD(, VFIORamDiscardListener) vrdl_list;
>> -    QLIST_ENTRY(VFIOContainer) next;
>> -} VFIOContainer;
>> -
>> -typedef struct VFIOGuestIOMMU {
>> -    VFIOContainer *container;
>> -    IOMMUMemoryRegion *iommu_mr;
>> -    hwaddr iommu_offset;
>> -    IOMMUNotifier n;
>> -    QLIST_ENTRY(VFIOGuestIOMMU) giommu_next;
>> -} VFIOGuestIOMMU;
>> -
>> -typedef struct VFIORamDiscardListener {
>> -    VFIOContainer *container;
>> -    MemoryRegion *mr;
>> -    hwaddr offset_within_address_space;
>> -    hwaddr size;
>> -    uint64_t granularity;
>> -    RamDiscardListener listener;
>> -    QLIST_ENTRY(VFIORamDiscardListener) next;
>> -} VFIORamDiscardListener;
>> -
>> -typedef struct VFIOHostDMAWindow {
>> -    hwaddr min_iova;
>> -    hwaddr max_iova;
>> -    uint64_t iova_pgsizes;
>> -    QLIST_ENTRY(VFIOHostDMAWindow) hostwin_next;
>> -} VFIOHostDMAWindow;
>> +} VFIOLegacyContainer;
>>   
>>   typedef struct VFIODeviceOps VFIODeviceOps;
>>   
>> @@ -159,7 +117,7 @@ struct VFIODeviceOps {
>>   typedef struct VFIOGroup {
>>       int fd;
>>       int groupid;
>> -    VFIOContainer *container;
>> +    VFIOLegacyContainer *container;
>>       QLIST_HEAD(, VFIODevice) device_list;
>>       QLIST_ENTRY(VFIOGroup) next;
>>       QLIST_ENTRY(VFIOGroup) container_next;
>> @@ -192,31 +150,13 @@ typedef struct VFIODisplay {
>>       } dmabuf;
>>   } VFIODisplay;
>>   
>> -void vfio_host_win_add(VFIOContainer *container,
>> +void vfio_host_win_add(VFIOContainer *bcontainer,
>>                          hwaddr min_iova, hwaddr max_iova,
>>                          uint64_t iova_pgsizes);
>> -int vfio_host_win_del(VFIOContainer *container, hwaddr min_iova,
>> +int vfio_host_win_del(VFIOContainer *bcontainer, hwaddr min_iova,
>>                         hwaddr max_iova);
>>   VFIOAddressSpace *vfio_get_address_space(AddressSpace *as);
>>   void vfio_put_address_space(VFIOAddressSpace *space);
>> -bool vfio_devices_all_running_and_saving(VFIOContainer *container);
>> -bool vfio_devices_all_dirty_tracking(VFIOContainer *container);
>> -
>> -/* container->fd */
>> -int vfio_dma_unmap(VFIOContainer *container,
>> -                   hwaddr iova, ram_addr_t size,
>> -                   IOMMUTLBEntry *iotlb);
>> -int vfio_dma_map(VFIOContainer *container, hwaddr iova,
>> -                 ram_addr_t size, void *vaddr, bool readonly);
>> -void vfio_set_dirty_page_tracking(VFIOContainer *container, bool start);
>> -int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
>> -                          uint64_t size, ram_addr_t ram_addr);
>> -
>> -int vfio_container_add_section_window(VFIOContainer *container,
>> -                                      MemoryRegionSection *section,
>> -                                      Error **errp);
>> -void vfio_container_del_section_window(VFIOContainer *container,
>> -                                       MemoryRegionSection *section);
>>   
>>   void vfio_put_base_device(VFIODevice *vbasedev);
>>   void vfio_disable_irqindex(VFIODevice *vbasedev, int index);
>> @@ -263,10 +203,10 @@ vfio_get_device_info_cap(struct vfio_device_info *info, uint16_t id);
>>   #endif
>>   extern const MemoryListener vfio_prereg_listener;
>>   
>> -int vfio_spapr_create_window(VFIOContainer *container,
>> +int vfio_spapr_create_window(VFIOLegacyContainer *container,
>>                                MemoryRegionSection *section,
>>                                hwaddr *pgsize);
>> -int vfio_spapr_remove_window(VFIOContainer *container,
>> +int vfio_spapr_remove_window(VFIOLegacyContainer *container,
>>                                hwaddr offset_within_address_space);
>>   
>>   int vfio_migration_probe(VFIODevice *vbasedev, Error **errp);
>> diff --git a/include/hw/vfio/vfio-container-obj.h b/include/hw/vfio/vfio-container-obj.h
>> new file mode 100644
>> index 0000000000..7ffbbb299f
>> --- /dev/null
>> +++ b/include/hw/vfio/vfio-container-obj.h
>> @@ -0,0 +1,154 @@
>> +/*
>> + * VFIO CONTAINER BASE OBJECT
>> + *
>> + * Copyright (C) 2022 Intel Corporation.
>> + * Copyright Red Hat, Inc. 2022
>> + *
>> + * Authors: Yi Liu <yi.l.liu@intel.com>
>> + *          Eric Auger <eric.auger@redhat.com>
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License as published by
>> + * the Free Software Foundation; either version 2 of the License, or
>> + * (at your option) any later version.
>> +
>> + * This program is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> + * GNU General Public License for more details.
>> +
>> + * You should have received a copy of the GNU General Public License along
>> + * with this program; if not, see <http://www.gnu.org/licenses/>.
>> + */
>> +
>> +#ifndef HW_VFIO_VFIO_CONTAINER_OBJ_H
>> +#define HW_VFIO_VFIO_CONTAINER_OBJ_H
>> +
>> +#include "qom/object.h"
>> +#include "exec/memory.h"
>> +#include "qemu/queue.h"
>> +#include "qemu/thread.h"
>> +#ifndef CONFIG_USER_ONLY
>> +#include "exec/hwaddr.h"
>> +#endif
>> +
>> +#define TYPE_VFIO_CONTAINER_OBJ "qemu:vfio-base-container-obj"
>> +#define VFIO_CONTAINER_OBJ(obj) \
>> +        OBJECT_CHECK(VFIOContainer, (obj), TYPE_VFIO_CONTAINER_OBJ)
>> +#define VFIO_CONTAINER_OBJ_CLASS(klass) \
>> +        OBJECT_CLASS_CHECK(VFIOContainerClass, (klass), \
>> +                         TYPE_VFIO_CONTAINER_OBJ)
>> +#define VFIO_CONTAINER_OBJ_GET_CLASS(obj) \
>> +        OBJECT_GET_CLASS(VFIOContainerClass, (obj), \
>> +                         TYPE_VFIO_CONTAINER_OBJ)
>> +
>> +typedef enum VFIOContainerFeature {
>> +    VFIO_FEAT_LIVE_MIGRATION,
>> +} VFIOContainerFeature;
>> +
>> +typedef struct VFIOContainer VFIOContainer;
>> +
>> +typedef struct VFIOAddressSpace {
>> +    AddressSpace *as;
>> +    QLIST_HEAD(, VFIOContainer) containers;
>> +    QLIST_ENTRY(VFIOAddressSpace) list;
>> +} VFIOAddressSpace;
>> +
>> +typedef struct VFIOGuestIOMMU {
>> +    VFIOContainer *container;
>> +    IOMMUMemoryRegion *iommu_mr;
>> +    hwaddr iommu_offset;
>> +    IOMMUNotifier n;
>> +    QLIST_ENTRY(VFIOGuestIOMMU) giommu_next;
>> +} VFIOGuestIOMMU;
>> +
>> +typedef struct VFIORamDiscardListener {
>> +    VFIOContainer *container;
>> +    MemoryRegion *mr;
>> +    hwaddr offset_within_address_space;
>> +    hwaddr size;
>> +    uint64_t granularity;
>> +    RamDiscardListener listener;
>> +    QLIST_ENTRY(VFIORamDiscardListener) next;
>> +} VFIORamDiscardListener;
>> +
>> +typedef struct VFIOHostDMAWindow {
>> +    hwaddr min_iova;
>> +    hwaddr max_iova;
>> +    uint64_t iova_pgsizes;
>> +    QLIST_ENTRY(VFIOHostDMAWindow) hostwin_next;
>> +} VFIOHostDMAWindow;
>> +
>> +/*
>> + * This is the base object for vfio container backends
>> + */
>> +struct VFIOContainer {
>> +    /* private */
>> +    Object parent_obj;
>> +
>> +    VFIOAddressSpace *space;
>> +    MemoryListener listener;
>> +    Error *error;
>> +    bool initialized;
>> +    bool dirty_pages_supported;
>> +    uint64_t dirty_pgsizes;
>> +    uint64_t max_dirty_bitmap_size;
>> +    unsigned long pgsizes;
>> +    unsigned int dma_max_mappings;
>> +    QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
>> +    QLIST_HEAD(, VFIOHostDMAWindow) hostwin_list;
>> +    QLIST_HEAD(, VFIORamDiscardListener) vrdl_list;
>> +    QLIST_ENTRY(VFIOContainer) next;
>> +};
>> +
>> +typedef struct VFIOContainerClass {
>> +    /* private */
>> +    ObjectClass parent_class;
>> +
>> +    /* required */
>> +    bool (*check_extension)(VFIOContainer *container,
>> +                            VFIOContainerFeature feat);
>> +    int (*dma_map)(VFIOContainer *container,
>> +                   hwaddr iova, ram_addr_t size,
>> +                   void *vaddr, bool readonly);
>> +    int (*dma_unmap)(VFIOContainer *container,
>> +                     hwaddr iova, ram_addr_t size,
>> +                     IOMMUTLBEntry *iotlb);
>> +    /* migration feature */
>> +    bool (*devices_all_dirty_tracking)(VFIOContainer *container);
>> +    void (*set_dirty_page_tracking)(VFIOContainer *container, bool start);
>> +    int (*get_dirty_bitmap)(VFIOContainer *container, uint64_t iova,
>> +                            uint64_t size, ram_addr_t ram_addr);
>> +
>> +    /* SPAPR specific */
>> +    int (*add_window)(VFIOContainer *container,
>> +                      MemoryRegionSection *section,
>> +                      Error **errp);
>> +    void (*del_window)(VFIOContainer *container,
>> +                       MemoryRegionSection *section);
>> +} VFIOContainerClass;
>> +
>> +bool vfio_container_check_extension(VFIOContainer *container,
>> +                                    VFIOContainerFeature feat);
>> +int vfio_container_dma_map(VFIOContainer *container,
>> +                           hwaddr iova, ram_addr_t size,
>> +                           void *vaddr, bool readonly);
>> +int vfio_container_dma_unmap(VFIOContainer *container,
>> +                             hwaddr iova, ram_addr_t size,
>> +                             IOMMUTLBEntry *iotlb);
>> +bool vfio_container_devices_all_dirty_tracking(VFIOContainer *container);
>> +void vfio_container_set_dirty_page_tracking(VFIOContainer *container,
>> +                                            bool start);
>> +int vfio_container_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
>> +                                    uint64_t size, ram_addr_t ram_addr);
>> +int vfio_container_add_section_window(VFIOContainer *container,
>> +                                      MemoryRegionSection *section,
>> +                                      Error **errp);
>> +void vfio_container_del_section_window(VFIOContainer *container,
>> +                                       MemoryRegionSection *section);
>> +
>> +void vfio_container_init(void *_container, size_t instance_size,
>> +                         const char *mrtypename,
>> +                         VFIOAddressSpace *space);
>> +void vfio_container_destroy(VFIOContainer *container);
>> +#endif /* HW_VFIO_VFIO_CONTAINER_OBJ_H */
> 

-- 
Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC 00/18] vfio: Adopt iommufd
  2022-04-26 16:35         ` Alex Williamson
  (?)
@ 2022-05-09 14:24         ` Zhangfei Gao
  2022-05-10  3:17           ` Yi Liu
  -1 siblings, 1 reply; 125+ messages in thread
From: Zhangfei Gao @ 2022-05-09 14:24 UTC (permalink / raw)
  To: Alex Williamson, Shameerali Kolothum Thodi
  Cc: eric.auger, Yi Liu, cohuck, qemu-devel, david, thuth, farman,
	mjrosato, akrowiak, pasic, jjherne, jasowang, kvm, jgg, nicolinc,
	eric.auger.pro, kevin.tian, chao.p.peng, yi.y.sun, peterx

Hi, Alex

On 2022/4/27 上午12:35, Alex Williamson wrote:
> On Tue, 26 Apr 2022 12:43:35 +0000
> Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com> wrote:
>
>>> -----Original Message-----
>>> From: Eric Auger [mailto:eric.auger@redhat.com]
>>> Sent: 26 April 2022 12:45
>>> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>; Yi
>>> Liu <yi.l.liu@intel.com>; alex.williamson@redhat.com; cohuck@redhat.com;
>>> qemu-devel@nongnu.org
>>> Cc: david@gibson.dropbear.id.au; thuth@redhat.com; farman@linux.ibm.com;
>>> mjrosato@linux.ibm.com; akrowiak@linux.ibm.com; pasic@linux.ibm.com;
>>> jjherne@linux.ibm.com; jasowang@redhat.com; kvm@vger.kernel.org;
>>> jgg@nvidia.com; nicolinc@nvidia.com; eric.auger.pro@gmail.com;
>>> kevin.tian@intel.com; chao.p.peng@intel.com; yi.y.sun@intel.com;
>>> peterx@redhat.com; Zhangfei Gao <zhangfei.gao@linaro.org>
>>> Subject: Re: [RFC 00/18] vfio: Adopt iommufd
>> [...]
>>   
>>>>>   
>>> https://lore.kernel.org/kvm/0-v1-e79cd8d168e8+6-iommufd_jgg@nvidia.com
>>>>> /
>>>>> [2] https://github.com/luxis1999/iommufd/tree/iommufd-v5.17-rc6
>>>>> [3] https://github.com/luxis1999/qemu/tree/qemu-for-5.17-rc6-vm-rfcv1
>>>> Hi,
>>>>
>>>> I had a go with the above branches on our ARM64 platform trying to
>>> pass-through
>>>> a VF dev, but Qemu reports an error as below,
>>>>
>>>> [    0.444728] hisi_sec2 0000:00:01.0: enabling device (0000 -> 0002)
>>>> qemu-system-aarch64-iommufd: IOMMU_IOAS_MAP failed: Bad address
>>>> qemu-system-aarch64-iommufd: vfio_container_dma_map(0xaaaafeb40ce0,
>>> 0x8000000000, 0x10000, 0xffffb40ef000) = -14 (Bad address)
>>>> I think this happens for the dev BAR addr range. I haven't debugged the
>>> kernel
>>>> yet to see where it actually reports that.
>>> Does it prevent your assigned device from working? I have such errors
>>> too but this is a known issue. This is due to the fact P2P DMA is not
>>> supported yet.
>>>    
>> Yes, the basic tests all good so far. I am still not very clear how it works if
>> the map() fails though. It looks like it fails in,
>>
>> iommufd_ioas_map()
>>    iopt_map_user_pages()
>>     iopt_map_pages()
>>     ..
>>       pfn_reader_pin_pages()
>>
>> So does it mean it just works because the page is resident()?
> No, it just means that you're not triggering any accesses that require
> peer-to-peer DMA support.  Any sort of test where the device is only
> performing DMA to guest RAM, which is by far the standard use case,
> will work fine.  This also doesn't affect vCPU access to BAR space.
> It's only a failure of the mappings of the BAR space into the IOAS,
> which is only used when a device tries to directly target another
> device's BAR space via DMA.  Thanks,

I also get this issue when trying adding prereg listenner

+    container->prereg_listener = vfio_memory_prereg_listener;
+    memory_listener_register(&container->prereg_listener,
+                            &address_space_memory);

host kernel log:
iommufd_ioas_map 1 iova=8000000000, iova1=8000000000, 
cmd->iova=8000000000, cmd->user_va=9c495000, cmd->length=10000
iopt_alloc_area input area=859a2d00 iova=8000000000
iopt_alloc_area area=859a2d00 iova=8000000000
pin_user_pages_remote rc=-14

qemu log:
vfio_prereg_listener_region_add
iommufd_map iova=0x8000000000
qemu-system-aarch64: IOMMU_IOAS_MAP failed: Bad address
qemu-system-aarch64: vfio_dma_map(0xaaaafb96a930, 0x8000000000, 0x10000, 
0xffff9c495000) = -14 (Bad address)
qemu-system-aarch64: (null)
double free or corruption (fasttop)
Aborted (core dumped)

With hack of ignoring address 0x8000000000 in map and unmap, kernel can 
boot.

Thanks


^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC 00/18] vfio: Adopt iommufd
  2022-05-09 14:24         ` Zhangfei Gao
@ 2022-05-10  3:17           ` Yi Liu
  2022-05-10  6:51             ` Eric Auger
  0 siblings, 1 reply; 125+ messages in thread
From: Yi Liu @ 2022-05-10  3:17 UTC (permalink / raw)
  To: Zhangfei Gao, Alex Williamson, Shameerali Kolothum Thodi
  Cc: eric.auger, cohuck, qemu-devel, david, thuth, farman, mjrosato,
	akrowiak, pasic, jjherne, jasowang, kvm, jgg, nicolinc,
	eric.auger.pro, kevin.tian, chao.p.peng, yi.y.sun, peterx

Hi Zhangfei,

On 2022/5/9 22:24, Zhangfei Gao wrote:
> Hi, Alex
> 
> On 2022/4/27 上午12:35, Alex Williamson wrote:
>> On Tue, 26 Apr 2022 12:43:35 +0000
>> Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com> wrote:
>>
>>>> -----Original Message-----
>>>> From: Eric Auger [mailto:eric.auger@redhat.com]
>>>> Sent: 26 April 2022 12:45
>>>> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>; Yi
>>>> Liu <yi.l.liu@intel.com>; alex.williamson@redhat.com; cohuck@redhat.com;
>>>> qemu-devel@nongnu.org
>>>> Cc: david@gibson.dropbear.id.au; thuth@redhat.com; farman@linux.ibm.com;
>>>> mjrosato@linux.ibm.com; akrowiak@linux.ibm.com; pasic@linux.ibm.com;
>>>> jjherne@linux.ibm.com; jasowang@redhat.com; kvm@vger.kernel.org;
>>>> jgg@nvidia.com; nicolinc@nvidia.com; eric.auger.pro@gmail.com;
>>>> kevin.tian@intel.com; chao.p.peng@intel.com; yi.y.sun@intel.com;
>>>> peterx@redhat.com; Zhangfei Gao <zhangfei.gao@linaro.org>
>>>> Subject: Re: [RFC 00/18] vfio: Adopt iommufd
>>> [...]
>>>> https://lore.kernel.org/kvm/0-v1-e79cd8d168e8+6-iommufd_jgg@nvidia.com
>>>>>> /
>>>>>> [2] https://github.com/luxis1999/iommufd/tree/iommufd-v5.17-rc6
>>>>>> [3] https://github.com/luxis1999/qemu/tree/qemu-for-5.17-rc6-vm-rfcv1
>>>>> Hi,
>>>>>
>>>>> I had a go with the above branches on our ARM64 platform trying to
>>>> pass-through
>>>>> a VF dev, but Qemu reports an error as below,
>>>>>
>>>>> [    0.444728] hisi_sec2 0000:00:01.0: enabling device (0000 -> 0002)
>>>>> qemu-system-aarch64-iommufd: IOMMU_IOAS_MAP failed: Bad address
>>>>> qemu-system-aarch64-iommufd: vfio_container_dma_map(0xaaaafeb40ce0,
>>>> 0x8000000000, 0x10000, 0xffffb40ef000) = -14 (Bad address)
>>>>> I think this happens for the dev BAR addr range. I haven't debugged the
>>>> kernel
>>>>> yet to see where it actually reports that.
>>>> Does it prevent your assigned device from working? I have such errors
>>>> too but this is a known issue. This is due to the fact P2P DMA is not
>>>> supported yet.
>>> Yes, the basic tests all good so far. I am still not very clear how it 
>>> works if
>>> the map() fails though. It looks like it fails in,
>>>
>>> iommufd_ioas_map()
>>>    iopt_map_user_pages()
>>>     iopt_map_pages()
>>>     ..
>>>       pfn_reader_pin_pages()
>>>
>>> So does it mean it just works because the page is resident()?
>> No, it just means that you're not triggering any accesses that require
>> peer-to-peer DMA support.  Any sort of test where the device is only
>> performing DMA to guest RAM, which is by far the standard use case,
>> will work fine.  This also doesn't affect vCPU access to BAR space.
>> It's only a failure of the mappings of the BAR space into the IOAS,
>> which is only used when a device tries to directly target another
>> device's BAR space via DMA.  Thanks,
> 
> I also get this issue when trying adding prereg listenner
> 
> +    container->prereg_listener = vfio_memory_prereg_listener;
> +    memory_listener_register(&container->prereg_listener,
> +                            &address_space_memory);
> 
> host kernel log:
> iommufd_ioas_map 1 iova=8000000000, iova1=8000000000, cmd->iova=8000000000, 
> cmd->user_va=9c495000, cmd->length=10000
> iopt_alloc_area input area=859a2d00 iova=8000000000
> iopt_alloc_area area=859a2d00 iova=8000000000
> pin_user_pages_remote rc=-14
> 
> qemu log:
> vfio_prereg_listener_region_add
> iommufd_map iova=0x8000000000
> qemu-system-aarch64: IOMMU_IOAS_MAP failed: Bad address
> qemu-system-aarch64: vfio_dma_map(0xaaaafb96a930, 0x8000000000, 0x10000, 
> 0xffff9c495000) = -14 (Bad address)
> qemu-system-aarch64: (null)
> double free or corruption (fasttop)
> Aborted (core dumped)
> 
> With hack of ignoring address 0x8000000000 in map and unmap, kernel can boot.

do you know if the iova 0x8000000000 guest RAM or MMIO? Currently, iommufd 
kernel part doesn't support mapping device BAR MMIO. This is a known gap.

-- 
Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC 00/18] vfio: Adopt iommufd
  2022-05-10  3:17           ` Yi Liu
@ 2022-05-10  6:51             ` Eric Auger
  2022-05-10 12:35               ` Zhangfei Gao
  0 siblings, 1 reply; 125+ messages in thread
From: Eric Auger @ 2022-05-10  6:51 UTC (permalink / raw)
  To: Yi Liu, Zhangfei Gao, Alex Williamson, Shameerali Kolothum Thodi
  Cc: cohuck, qemu-devel, david, thuth, farman, mjrosato, akrowiak,
	pasic, jjherne, jasowang, kvm, jgg, nicolinc, eric.auger.pro,
	kevin.tian, chao.p.peng, yi.y.sun, peterx

Hi Hi, Zhangfei,

On 5/10/22 05:17, Yi Liu wrote:
> Hi Zhangfei,
>
> On 2022/5/9 22:24, Zhangfei Gao wrote:
>> Hi, Alex
>>
>> On 2022/4/27 上午12:35, Alex Williamson wrote:
>>> On Tue, 26 Apr 2022 12:43:35 +0000
>>> Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com> wrote:
>>>
>>>>> -----Original Message-----
>>>>> From: Eric Auger [mailto:eric.auger@redhat.com]
>>>>> Sent: 26 April 2022 12:45
>>>>> To: Shameerali Kolothum Thodi
>>>>> <shameerali.kolothum.thodi@huawei.com>; Yi
>>>>> Liu <yi.l.liu@intel.com>; alex.williamson@redhat.com;
>>>>> cohuck@redhat.com;
>>>>> qemu-devel@nongnu.org
>>>>> Cc: david@gibson.dropbear.id.au; thuth@redhat.com;
>>>>> farman@linux.ibm.com;
>>>>> mjrosato@linux.ibm.com; akrowiak@linux.ibm.com; pasic@linux.ibm.com;
>>>>> jjherne@linux.ibm.com; jasowang@redhat.com; kvm@vger.kernel.org;
>>>>> jgg@nvidia.com; nicolinc@nvidia.com; eric.auger.pro@gmail.com;
>>>>> kevin.tian@intel.com; chao.p.peng@intel.com; yi.y.sun@intel.com;
>>>>> peterx@redhat.com; Zhangfei Gao <zhangfei.gao@linaro.org>
>>>>> Subject: Re: [RFC 00/18] vfio: Adopt iommufd
>>>> [...]
>>>>> https://lore.kernel.org/kvm/0-v1-e79cd8d168e8+6-iommufd_jgg@nvidia.com
>>>>>
>>>>>>> /
>>>>>>> [2] https://github.com/luxis1999/iommufd/tree/iommufd-v5.17-rc6
>>>>>>> [3]
>>>>>>> https://github.com/luxis1999/qemu/tree/qemu-for-5.17-rc6-vm-rfcv1
>>>>>> Hi,
>>>>>>
>>>>>> I had a go with the above branches on our ARM64 platform trying to
>>>>> pass-through
>>>>>> a VF dev, but Qemu reports an error as below,
>>>>>>
>>>>>> [    0.444728] hisi_sec2 0000:00:01.0: enabling device (0000 ->
>>>>>> 0002)
>>>>>> qemu-system-aarch64-iommufd: IOMMU_IOAS_MAP failed: Bad address
>>>>>> qemu-system-aarch64-iommufd: vfio_container_dma_map(0xaaaafeb40ce0,
>>>>> 0x8000000000, 0x10000, 0xffffb40ef000) = -14 (Bad address)
>>>>>> I think this happens for the dev BAR addr range. I haven't
>>>>>> debugged the
>>>>> kernel
>>>>>> yet to see where it actually reports that.
>>>>> Does it prevent your assigned device from working? I have such errors
>>>>> too but this is a known issue. This is due to the fact P2P DMA is not
>>>>> supported yet.
>>>> Yes, the basic tests all good so far. I am still not very clear how
>>>> it works if
>>>> the map() fails though. It looks like it fails in,
>>>>
>>>> iommufd_ioas_map()
>>>>    iopt_map_user_pages()
>>>>     iopt_map_pages()
>>>>     ..
>>>>       pfn_reader_pin_pages()
>>>>
>>>> So does it mean it just works because the page is resident()?
>>> No, it just means that you're not triggering any accesses that require
>>> peer-to-peer DMA support.  Any sort of test where the device is only
>>> performing DMA to guest RAM, which is by far the standard use case,
>>> will work fine.  This also doesn't affect vCPU access to BAR space.
>>> It's only a failure of the mappings of the BAR space into the IOAS,
>>> which is only used when a device tries to directly target another
>>> device's BAR space via DMA.  Thanks,
>>
>> I also get this issue when trying adding prereg listenner
>>
>> +    container->prereg_listener = vfio_memory_prereg_listener;
>> +    memory_listener_register(&container->prereg_listener,
>> +                            &address_space_memory);
>>
>> host kernel log:
>> iommufd_ioas_map 1 iova=8000000000, iova1=8000000000,
>> cmd->iova=8000000000, cmd->user_va=9c495000, cmd->length=10000
>> iopt_alloc_area input area=859a2d00 iova=8000000000
>> iopt_alloc_area area=859a2d00 iova=8000000000
>> pin_user_pages_remote rc=-14
>>
>> qemu log:
>> vfio_prereg_listener_region_add
>> iommufd_map iova=0x8000000000
>> qemu-system-aarch64: IOMMU_IOAS_MAP failed: Bad address
>> qemu-system-aarch64: vfio_dma_map(0xaaaafb96a930, 0x8000000000,
>> 0x10000, 0xffff9c495000) = -14 (Bad address)
>> qemu-system-aarch64: (null)
>> double free or corruption (fasttop)
>> Aborted (core dumped)
>>
>> With hack of ignoring address 0x8000000000 in map and unmap, kernel
>> can boot.
>
> do you know if the iova 0x8000000000 guest RAM or MMIO? Currently,
> iommufd kernel part doesn't support mapping device BAR MMIO. This is a
> known gap.
In qemu arm virt machine this indeed matches the PCI MMIO region.

Thanks

Eric


^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC 00/18] vfio: Adopt iommufd
  2022-05-10  6:51             ` Eric Auger
@ 2022-05-10 12:35               ` Zhangfei Gao
  2022-05-10 12:45                 ` Jason Gunthorpe
  0 siblings, 1 reply; 125+ messages in thread
From: Zhangfei Gao @ 2022-05-10 12:35 UTC (permalink / raw)
  To: eric.auger, Yi Liu, Alex Williamson, Shameerali Kolothum Thodi
  Cc: cohuck, qemu-devel, david, thuth, farman, mjrosato, akrowiak,
	pasic, jjherne, jasowang, kvm, jgg, nicolinc, eric.auger.pro,
	kevin.tian, chao.p.peng, yi.y.sun, peterx



On 2022/5/10 下午2:51, Eric Auger wrote:
> Hi Hi, Zhangfei,
>
> On 5/10/22 05:17, Yi Liu wrote:
>> Hi Zhangfei,
>>
>> On 2022/5/9 22:24, Zhangfei Gao wrote:
>>> Hi, Alex
>>>
>>> On 2022/4/27 上午12:35, Alex Williamson wrote:
>>>> On Tue, 26 Apr 2022 12:43:35 +0000
>>>> Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com> wrote:
>>>>
>>>>>> -----Original Message-----
>>>>>> From: Eric Auger [mailto:eric.auger@redhat.com]
>>>>>> Sent: 26 April 2022 12:45
>>>>>> To: Shameerali Kolothum Thodi
>>>>>> <shameerali.kolothum.thodi@huawei.com>; Yi
>>>>>> Liu <yi.l.liu@intel.com>; alex.williamson@redhat.com;
>>>>>> cohuck@redhat.com;
>>>>>> qemu-devel@nongnu.org
>>>>>> Cc: david@gibson.dropbear.id.au; thuth@redhat.com;
>>>>>> farman@linux.ibm.com;
>>>>>> mjrosato@linux.ibm.com; akrowiak@linux.ibm.com; pasic@linux.ibm.com;
>>>>>> jjherne@linux.ibm.com; jasowang@redhat.com; kvm@vger.kernel.org;
>>>>>> jgg@nvidia.com; nicolinc@nvidia.com; eric.auger.pro@gmail.com;
>>>>>> kevin.tian@intel.com; chao.p.peng@intel.com; yi.y.sun@intel.com;
>>>>>> peterx@redhat.com; Zhangfei Gao <zhangfei.gao@linaro.org>
>>>>>> Subject: Re: [RFC 00/18] vfio: Adopt iommufd
>>>>> [...]
>>>>>> https://lore.kernel.org/kvm/0-v1-e79cd8d168e8+6-iommufd_jgg@nvidia.com
>>>>>>
>>>>>>>> /
>>>>>>>> [2] https://github.com/luxis1999/iommufd/tree/iommufd-v5.17-rc6
>>>>>>>> [3]
>>>>>>>> https://github.com/luxis1999/qemu/tree/qemu-for-5.17-rc6-vm-rfcv1
>>>>>>> Hi,
>>>>>>>
>>>>>>> I had a go with the above branches on our ARM64 platform trying to
>>>>>> pass-through
>>>>>>> a VF dev, but Qemu reports an error as below,
>>>>>>>
>>>>>>> [    0.444728] hisi_sec2 0000:00:01.0: enabling device (0000 ->
>>>>>>> 0002)
>>>>>>> qemu-system-aarch64-iommufd: IOMMU_IOAS_MAP failed: Bad address
>>>>>>> qemu-system-aarch64-iommufd: vfio_container_dma_map(0xaaaafeb40ce0,
>>>>>> 0x8000000000, 0x10000, 0xffffb40ef000) = -14 (Bad address)
>>>>>>> I think this happens for the dev BAR addr range. I haven't
>>>>>>> debugged the
>>>>>> kernel
>>>>>>> yet to see where it actually reports that.
>>>>>> Does it prevent your assigned device from working? I have such errors
>>>>>> too but this is a known issue. This is due to the fact P2P DMA is not
>>>>>> supported yet.
>>>>> Yes, the basic tests all good so far. I am still not very clear how
>>>>> it works if
>>>>> the map() fails though. It looks like it fails in,
>>>>>
>>>>> iommufd_ioas_map()
>>>>>     iopt_map_user_pages()
>>>>>      iopt_map_pages()
>>>>>      ..
>>>>>        pfn_reader_pin_pages()
>>>>>
>>>>> So does it mean it just works because the page is resident()?
>>>> No, it just means that you're not triggering any accesses that require
>>>> peer-to-peer DMA support.  Any sort of test where the device is only
>>>> performing DMA to guest RAM, which is by far the standard use case,
>>>> will work fine.  This also doesn't affect vCPU access to BAR space.
>>>> It's only a failure of the mappings of the BAR space into the IOAS,
>>>> which is only used when a device tries to directly target another
>>>> device's BAR space via DMA.  Thanks,
>>> I also get this issue when trying adding prereg listenner
>>>
>>> +    container->prereg_listener = vfio_memory_prereg_listener;
>>> +    memory_listener_register(&container->prereg_listener,
>>> +                            &address_space_memory);
>>>
>>> host kernel log:
>>> iommufd_ioas_map 1 iova=8000000000, iova1=8000000000,
>>> cmd->iova=8000000000, cmd->user_va=9c495000, cmd->length=10000
>>> iopt_alloc_area input area=859a2d00 iova=8000000000
>>> iopt_alloc_area area=859a2d00 iova=8000000000
>>> pin_user_pages_remote rc=-14
>>>
>>> qemu log:
>>> vfio_prereg_listener_region_add
>>> iommufd_map iova=0x8000000000
>>> qemu-system-aarch64: IOMMU_IOAS_MAP failed: Bad address
>>> qemu-system-aarch64: vfio_dma_map(0xaaaafb96a930, 0x8000000000,
>>> 0x10000, 0xffff9c495000) = -14 (Bad address)
>>> qemu-system-aarch64: (null)
>>> double free or corruption (fasttop)
>>> Aborted (core dumped)
>>>
>>> With hack of ignoring address 0x8000000000 in map and unmap, kernel
>>> can boot.
>> do you know if the iova 0x8000000000 guest RAM or MMIO? Currently,
>> iommufd kernel part doesn't support mapping device BAR MMIO. This is a
>> known gap.
> In qemu arm virt machine this indeed matches the PCI MMIO region.

Thanks Yi and Eric,
Then will wait for the updated iommufd kernel for the PCI MMIO region.

Another question,
How to get the iommu_domain in the ioctl.

qemu can get container->ioas_id.

kernel can get ioas via the ioas_id.
But how to get the domain?
Currently I am hacking with ioas->iopt.next_domain_id, which is increasing.
domain = xa_load(&ioas->iopt.domains, ioas->iopt.next_domain_id-1);

Any idea?

Thanks

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC 00/18] vfio: Adopt iommufd
  2022-05-10 12:35               ` Zhangfei Gao
@ 2022-05-10 12:45                 ` Jason Gunthorpe
  2022-05-10 14:08                   ` Yi Liu
  0 siblings, 1 reply; 125+ messages in thread
From: Jason Gunthorpe @ 2022-05-10 12:45 UTC (permalink / raw)
  To: Zhangfei Gao
  Cc: eric.auger, Yi Liu, Alex Williamson, Shameerali Kolothum Thodi,
	cohuck, qemu-devel, david, thuth, farman, mjrosato, akrowiak,
	pasic, jjherne, jasowang, kvm, nicolinc, eric.auger.pro,
	kevin.tian, chao.p.peng, yi.y.sun, peterx

On Tue, May 10, 2022 at 08:35:00PM +0800, Zhangfei Gao wrote:
> Thanks Yi and Eric,
> Then will wait for the updated iommufd kernel for the PCI MMIO region.
> 
> Another question,
> How to get the iommu_domain in the ioctl.

The ID of the iommu_domain (called the hwpt) it should be returned by
the vfio attach ioctl.

Jason

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC 00/18] vfio: Adopt iommufd
  2022-05-10 12:45                 ` Jason Gunthorpe
@ 2022-05-10 14:08                   ` Yi Liu
  2022-05-11 14:17                     ` zhangfei.gao
  0 siblings, 1 reply; 125+ messages in thread
From: Yi Liu @ 2022-05-10 14:08 UTC (permalink / raw)
  To: Jason Gunthorpe, Zhangfei Gao
  Cc: eric.auger, Alex Williamson, Shameerali Kolothum Thodi, cohuck,
	qemu-devel, david, thuth, farman, mjrosato, akrowiak, pasic,
	jjherne, jasowang, kvm, nicolinc, eric.auger.pro, kevin.tian,
	chao.p.peng, yi.y.sun, peterx

On 2022/5/10 20:45, Jason Gunthorpe wrote:
> On Tue, May 10, 2022 at 08:35:00PM +0800, Zhangfei Gao wrote:
>> Thanks Yi and Eric,
>> Then will wait for the updated iommufd kernel for the PCI MMIO region.
>>
>> Another question,
>> How to get the iommu_domain in the ioctl.
> 
> The ID of the iommu_domain (called the hwpt) it should be returned by
> the vfio attach ioctl.

yes, hwpt_id is returned by the vfio attach ioctl and recorded in
qemu. You can query page table related capabilities with this id.

https://lore.kernel.org/kvm/20220414104710.28534-16-yi.l.liu@intel.com/

-- 
Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC 00/18] vfio: Adopt iommufd
  2022-05-10 14:08                   ` Yi Liu
@ 2022-05-11 14:17                     ` zhangfei.gao
  2022-05-12  9:01                       ` zhangfei.gao
  2022-05-17  8:52                       ` Yi Liu
  0 siblings, 2 replies; 125+ messages in thread
From: zhangfei.gao @ 2022-05-11 14:17 UTC (permalink / raw)
  To: Yi Liu, Jason Gunthorpe, Zhangfei Gao
  Cc: eric.auger, Alex Williamson, Shameerali Kolothum Thodi, cohuck,
	qemu-devel, david, thuth, farman, mjrosato, akrowiak, pasic,
	jjherne, jasowang, kvm, nicolinc, eric.auger.pro, kevin.tian,
	chao.p.peng, yi.y.sun, peterx



On 2022/5/10 下午10:08, Yi Liu wrote:
> On 2022/5/10 20:45, Jason Gunthorpe wrote:
>> On Tue, May 10, 2022 at 08:35:00PM +0800, Zhangfei Gao wrote:
>>> Thanks Yi and Eric,
>>> Then will wait for the updated iommufd kernel for the PCI MMIO region.
>>>
>>> Another question,
>>> How to get the iommu_domain in the ioctl.
>>
>> The ID of the iommu_domain (called the hwpt) it should be returned by
>> the vfio attach ioctl.
>
> yes, hwpt_id is returned by the vfio attach ioctl and recorded in
> qemu. You can query page table related capabilities with this id.
>
> https://lore.kernel.org/kvm/20220414104710.28534-16-yi.l.liu@intel.com/
>
Thanks Yi,

Do we use iommufd_hw_pagetable_from_id in kernel?

The qemu send hwpt_id via ioctl.
Currently VFIOIOMMUFDContainer has hwpt_list,
Which member is good to save hwpt_id, IOMMUTLBEntry?

In kernel ioctl: iommufd_vfio_ioctl
@dev: Device to get an iommu_domain for
iommufd_hw_pagetable_from_id(struct iommufd_ctx *ictx, u32 pt_id, struct 
device *dev)
But iommufd_vfio_ioctl seems no para dev?

Thanks





^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC 00/18] vfio: Adopt iommufd
  2022-05-11 14:17                     ` zhangfei.gao
@ 2022-05-12  9:01                       ` zhangfei.gao
  2022-05-17  8:55                         ` Yi Liu
  2022-05-17  8:52                       ` Yi Liu
  1 sibling, 1 reply; 125+ messages in thread
From: zhangfei.gao @ 2022-05-12  9:01 UTC (permalink / raw)
  To: Yi Liu, Jason Gunthorpe, Zhangfei Gao
  Cc: eric.auger, Alex Williamson, Shameerali Kolothum Thodi, cohuck,
	qemu-devel, david, thuth, farman, mjrosato, akrowiak, pasic,
	jjherne, jasowang, kvm, nicolinc, eric.auger.pro, kevin.tian,
	chao.p.peng, yi.y.sun, peterx


Hi, Yi

On 2022/5/11 下午10:17, zhangfei.gao@foxmail.com wrote:
>
>
> On 2022/5/10 下午10:08, Yi Liu wrote:
>> On 2022/5/10 20:45, Jason Gunthorpe wrote:
>>> On Tue, May 10, 2022 at 08:35:00PM +0800, Zhangfei Gao wrote:
>>>> Thanks Yi and Eric,
>>>> Then will wait for the updated iommufd kernel for the PCI MMIO region.
>>>>
>>>> Another question,
>>>> How to get the iommu_domain in the ioctl.
>>>
>>> The ID of the iommu_domain (called the hwpt) it should be returned by
>>> the vfio attach ioctl.
>>
>> yes, hwpt_id is returned by the vfio attach ioctl and recorded in
>> qemu. You can query page table related capabilities with this id.
>>
>> https://lore.kernel.org/kvm/20220414104710.28534-16-yi.l.liu@intel.com/
>>
> Thanks Yi,
>
> Do we use iommufd_hw_pagetable_from_id in kernel?
>
> The qemu send hwpt_id via ioctl.
> Currently VFIOIOMMUFDContainer has hwpt_list,
> Which member is good to save hwpt_id, IOMMUTLBEntry?

Can VFIOIOMMUFDContainer  have multi hwpt?
Since VFIOIOMMUFDContainer has hwpt_list now.
If so, how to get specific hwpt from map/unmap_notify in hw/vfio/as.c, 
where no vbasedev can be used for compare.

I am testing with a workaround, adding VFIOIOASHwpt *hwpt in 
VFIOIOMMUFDContainer.
And save hwpt when vfio_device_attach_container.


>
> In kernel ioctl: iommufd_vfio_ioctl
> @dev: Device to get an iommu_domain for
> iommufd_hw_pagetable_from_id(struct iommufd_ctx *ictx, u32 pt_id, 
> struct device *dev)
> But iommufd_vfio_ioctl seems no para dev?

We can set dev=Null since IOMMUFD_OBJ_HW_PAGETABLE does not need dev.
iommufd_hw_pagetable_from_id(ictx, hwpt_id, NULL)

Thanks
>


^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC 00/18] vfio: Adopt iommufd
  2022-05-11 14:17                     ` zhangfei.gao
  2022-05-12  9:01                       ` zhangfei.gao
@ 2022-05-17  8:52                       ` Yi Liu
  1 sibling, 0 replies; 125+ messages in thread
From: Yi Liu @ 2022-05-17  8:52 UTC (permalink / raw)
  To: zhangfei.gao, Jason Gunthorpe, Zhangfei Gao
  Cc: eric.auger, Alex Williamson, Shameerali Kolothum Thodi, cohuck,
	qemu-devel, david, thuth, farman, mjrosato, akrowiak, pasic,
	jjherne, jasowang, kvm, nicolinc, eric.auger.pro, kevin.tian,
	chao.p.peng, yi.y.sun, peterx

Hi Zhangfei,

On 2022/5/11 22:17, zhangfei.gao@foxmail.com wrote:
> 
> 
> On 2022/5/10 下午10:08, Yi Liu wrote:
>> On 2022/5/10 20:45, Jason Gunthorpe wrote:
>>> On Tue, May 10, 2022 at 08:35:00PM +0800, Zhangfei Gao wrote:
>>>> Thanks Yi and Eric,
>>>> Then will wait for the updated iommufd kernel for the PCI MMIO region.
>>>>
>>>> Another question,
>>>> How to get the iommu_domain in the ioctl.
>>>
>>> The ID of the iommu_domain (called the hwpt) it should be returned by
>>> the vfio attach ioctl.
>>
>> yes, hwpt_id is returned by the vfio attach ioctl and recorded in
>> qemu. You can query page table related capabilities with this id.
>>
>> https://lore.kernel.org/kvm/20220414104710.28534-16-yi.l.liu@intel.com/
>>
> Thanks Yi,
> 
> Do we use iommufd_hw_pagetable_from_id in kernel?
> 
> The qemu send hwpt_id via ioctl.
> Currently VFIOIOMMUFDContainer has hwpt_list,
> Which member is good to save hwpt_id, IOMMUTLBEntry?

currently, we don't make use of hwpt yet in the version we have
in the qemu branch. I have a change to make use of it. Also, it
would be used in future for nested translation setup and also
dirty page bit support query for a given domain.

> 
> In kernel ioctl: iommufd_vfio_ioctl
> @dev: Device to get an iommu_domain for
> iommufd_hw_pagetable_from_id(struct iommufd_ctx *ictx, u32 pt_id, struct 
> device *dev)
> But iommufd_vfio_ioctl seems no para dev?

there is. you can look at the vfio_group_set_iommufd(), it loops the
device_list provided by vfio. And the device info is passed to iommufd.

> Thanks
> 
> 
> 
> 

-- 
Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC 00/18] vfio: Adopt iommufd
  2022-05-12  9:01                       ` zhangfei.gao
@ 2022-05-17  8:55                         ` Yi Liu
  2022-05-18  7:22                           ` zhangfei.gao
  0 siblings, 1 reply; 125+ messages in thread
From: Yi Liu @ 2022-05-17  8:55 UTC (permalink / raw)
  To: zhangfei.gao, Jason Gunthorpe, Zhangfei Gao
  Cc: eric.auger, Alex Williamson, Shameerali Kolothum Thodi, cohuck,
	qemu-devel, david, thuth, farman, mjrosato, akrowiak, pasic,
	jjherne, jasowang, kvm, nicolinc, eric.auger.pro, kevin.tian,
	chao.p.peng, yi.y.sun, peterx

Hi Zhangfei,

On 2022/5/12 17:01, zhangfei.gao@foxmail.com wrote:
> 
> Hi, Yi
> 
> On 2022/5/11 下午10:17, zhangfei.gao@foxmail.com wrote:
>>
>>
>> On 2022/5/10 下午10:08, Yi Liu wrote:
>>> On 2022/5/10 20:45, Jason Gunthorpe wrote:
>>>> On Tue, May 10, 2022 at 08:35:00PM +0800, Zhangfei Gao wrote:
>>>>> Thanks Yi and Eric,
>>>>> Then will wait for the updated iommufd kernel for the PCI MMIO region.
>>>>>
>>>>> Another question,
>>>>> How to get the iommu_domain in the ioctl.
>>>>
>>>> The ID of the iommu_domain (called the hwpt) it should be returned by
>>>> the vfio attach ioctl.
>>>
>>> yes, hwpt_id is returned by the vfio attach ioctl and recorded in
>>> qemu. You can query page table related capabilities with this id.
>>>
>>> https://lore.kernel.org/kvm/20220414104710.28534-16-yi.l.liu@intel.com/
>>>
>> Thanks Yi,
>>
>> Do we use iommufd_hw_pagetable_from_id in kernel?
>>
>> The qemu send hwpt_id via ioctl.
>> Currently VFIOIOMMUFDContainer has hwpt_list,
>> Which member is good to save hwpt_id, IOMMUTLBEntry?
> 
> Can VFIOIOMMUFDContainer  have multi hwpt?

yes, it is possible

> Since VFIOIOMMUFDContainer has hwpt_list now.
> If so, how to get specific hwpt from map/unmap_notify in hw/vfio/as.c, 
> where no vbasedev can be used for compare.
> 
> I am testing with a workaround, adding VFIOIOASHwpt *hwpt in 
> VFIOIOMMUFDContainer.
> And save hwpt when vfio_device_attach_container.
> 
>>
>> In kernel ioctl: iommufd_vfio_ioctl
>> @dev: Device to get an iommu_domain for
>> iommufd_hw_pagetable_from_id(struct iommufd_ctx *ictx, u32 pt_id, struct 
>> device *dev)
>> But iommufd_vfio_ioctl seems no para dev?
> 
> We can set dev=Null since IOMMUFD_OBJ_HW_PAGETABLE does not need dev.
> iommufd_hw_pagetable_from_id(ictx, hwpt_id, NULL)

this is not good. dev is passed in to this function to allocate domain
and also check sw_msi things. If you pass in a NULL, it may even unable
to get a domain for the hwpt. It won't work I guess.

> Thanks
>>
> 

-- 
Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC 00/18] vfio: Adopt iommufd
  2022-05-17  8:55                         ` Yi Liu
@ 2022-05-18  7:22                           ` zhangfei.gao
  2022-05-18 14:00                             ` Yi Liu
  0 siblings, 1 reply; 125+ messages in thread
From: zhangfei.gao @ 2022-05-18  7:22 UTC (permalink / raw)
  To: Yi Liu, Jason Gunthorpe, Zhangfei Gao
  Cc: eric.auger, Alex Williamson, Shameerali Kolothum Thodi, cohuck,
	qemu-devel, david, thuth, farman, mjrosato, akrowiak, pasic,
	jjherne, jasowang, kvm, nicolinc, eric.auger.pro, kevin.tian,
	chao.p.peng, yi.y.sun, peterx



On 2022/5/17 下午4:55, Yi Liu wrote:
> Hi Zhangfei,
>
> On 2022/5/12 17:01, zhangfei.gao@foxmail.com wrote:
>>
>> Hi, Yi
>>
>> On 2022/5/11 下午10:17, zhangfei.gao@foxmail.com wrote:
>>>
>>>
>>> On 2022/5/10 下午10:08, Yi Liu wrote:
>>>> On 2022/5/10 20:45, Jason Gunthorpe wrote:
>>>>> On Tue, May 10, 2022 at 08:35:00PM +0800, Zhangfei Gao wrote:
>>>>>> Thanks Yi and Eric,
>>>>>> Then will wait for the updated iommufd kernel for the PCI MMIO 
>>>>>> region.
>>>>>>
>>>>>> Another question,
>>>>>> How to get the iommu_domain in the ioctl.
>>>>>
>>>>> The ID of the iommu_domain (called the hwpt) it should be returned by
>>>>> the vfio attach ioctl.
>>>>
>>>> yes, hwpt_id is returned by the vfio attach ioctl and recorded in
>>>> qemu. You can query page table related capabilities with this id.
>>>>
>>>> https://lore.kernel.org/kvm/20220414104710.28534-16-yi.l.liu@intel.com/ 
>>>>
>>>>
>>> Thanks Yi,
>>>
>>> Do we use iommufd_hw_pagetable_from_id in kernel?
>>>
>>> The qemu send hwpt_id via ioctl.
>>> Currently VFIOIOMMUFDContainer has hwpt_list,
>>> Which member is good to save hwpt_id, IOMMUTLBEntry?
>>
>> Can VFIOIOMMUFDContainer  have multi hwpt?
>
> yes, it is possible
Then how to get hwpt_id in map/unmap_notify(IOMMUNotifier *n, 
IOMMUTLBEntry *iotlb)

>
>> Since VFIOIOMMUFDContainer has hwpt_list now.
>> If so, how to get specific hwpt from map/unmap_notify in 
>> hw/vfio/as.c, where no vbasedev can be used for compare.
>>
>> I am testing with a workaround, adding VFIOIOASHwpt *hwpt in 
>> VFIOIOMMUFDContainer.
>> And save hwpt when vfio_device_attach_container.
>>
>>>
>>> In kernel ioctl: iommufd_vfio_ioctl
>>> @dev: Device to get an iommu_domain for
>>> iommufd_hw_pagetable_from_id(struct iommufd_ctx *ictx, u32 pt_id, 
>>> struct device *dev)
>>> But iommufd_vfio_ioctl seems no para dev?
>>
>> We can set dev=Null since IOMMUFD_OBJ_HW_PAGETABLE does not need dev.
>> iommufd_hw_pagetable_from_id(ictx, hwpt_id, NULL)
>
> this is not good. dev is passed in to this function to allocate domain
> and also check sw_msi things. If you pass in a NULL, it may even unable
> to get a domain for the hwpt. It won't work I guess.

The iommufd_hw_pagetable_from_id can be used for
1, allocate domain, which need para dev
case IOMMUFD_OBJ_IOAS
hwpt = iommufd_hw_pagetable_auto_get(ictx, ioas, dev);

2. Just return allocated domain via hwpt_id, which does not need dev.
case IOMMUFD_OBJ_HW_PAGETABLE:
return container_of(obj, struct iommufd_hw_pagetable, obj);

By the way, any plan of the nested mode?

Thanks

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC 00/18] vfio: Adopt iommufd
  2022-05-18  7:22                           ` zhangfei.gao
@ 2022-05-18 14:00                             ` Yi Liu
  2022-06-28  8:14                                 ` Shameerali Kolothum Thodi via
  0 siblings, 1 reply; 125+ messages in thread
From: Yi Liu @ 2022-05-18 14:00 UTC (permalink / raw)
  To: zhangfei.gao, Jason Gunthorpe, Zhangfei Gao
  Cc: eric.auger, Alex Williamson, Shameerali Kolothum Thodi, cohuck,
	qemu-devel, david, thuth, farman, mjrosato, akrowiak, pasic,
	jjherne, jasowang, kvm, nicolinc, eric.auger.pro, kevin.tian,
	chao.p.peng, yi.y.sun, peterx

On 2022/5/18 15:22, zhangfei.gao@foxmail.com wrote:
> 
> 
> On 2022/5/17 下午4:55, Yi Liu wrote:
>> Hi Zhangfei,
>>
>> On 2022/5/12 17:01, zhangfei.gao@foxmail.com wrote:
>>>
>>> Hi, Yi
>>>
>>> On 2022/5/11 下午10:17, zhangfei.gao@foxmail.com wrote:
>>>>
>>>>
>>>> On 2022/5/10 下午10:08, Yi Liu wrote:
>>>>> On 2022/5/10 20:45, Jason Gunthorpe wrote:
>>>>>> On Tue, May 10, 2022 at 08:35:00PM +0800, Zhangfei Gao wrote:
>>>>>>> Thanks Yi and Eric,
>>>>>>> Then will wait for the updated iommufd kernel for the PCI MMIO region.
>>>>>>>
>>>>>>> Another question,
>>>>>>> How to get the iommu_domain in the ioctl.
>>>>>>
>>>>>> The ID of the iommu_domain (called the hwpt) it should be returned by
>>>>>> the vfio attach ioctl.
>>>>>
>>>>> yes, hwpt_id is returned by the vfio attach ioctl and recorded in
>>>>> qemu. You can query page table related capabilities with this id.
>>>>>
>>>>> https://lore.kernel.org/kvm/20220414104710.28534-16-yi.l.liu@intel.com/
>>>>>
>>>> Thanks Yi,
>>>>
>>>> Do we use iommufd_hw_pagetable_from_id in kernel?
>>>>
>>>> The qemu send hwpt_id via ioctl.
>>>> Currently VFIOIOMMUFDContainer has hwpt_list,
>>>> Which member is good to save hwpt_id, IOMMUTLBEntry?
>>>
>>> Can VFIOIOMMUFDContainer  have multi hwpt?
>>
>> yes, it is possible
> Then how to get hwpt_id in map/unmap_notify(IOMMUNotifier *n, IOMMUTLBEntry 
> *iotlb)

in map/unmap, should use ioas_id instead of hwpt_id

> 
>>
>>> Since VFIOIOMMUFDContainer has hwpt_list now.
>>> If so, how to get specific hwpt from map/unmap_notify in hw/vfio/as.c, 
>>> where no vbasedev can be used for compare.
>>>
>>> I am testing with a workaround, adding VFIOIOASHwpt *hwpt in 
>>> VFIOIOMMUFDContainer.
>>> And save hwpt when vfio_device_attach_container.
>>>
>>>>
>>>> In kernel ioctl: iommufd_vfio_ioctl
>>>> @dev: Device to get an iommu_domain for
>>>> iommufd_hw_pagetable_from_id(struct iommufd_ctx *ictx, u32 pt_id, 
>>>> struct device *dev)
>>>> But iommufd_vfio_ioctl seems no para dev?
>>>
>>> We can set dev=Null since IOMMUFD_OBJ_HW_PAGETABLE does not need dev.
>>> iommufd_hw_pagetable_from_id(ictx, hwpt_id, NULL)
>>
>> this is not good. dev is passed in to this function to allocate domain
>> and also check sw_msi things. If you pass in a NULL, it may even unable
>> to get a domain for the hwpt. It won't work I guess.
> 
> The iommufd_hw_pagetable_from_id can be used for
> 1, allocate domain, which need para dev
> case IOMMUFD_OBJ_IOAS
> hwpt = iommufd_hw_pagetable_auto_get(ictx, ioas, dev);

this is used when attaching ioas.

> 2. Just return allocated domain via hwpt_id, which does not need dev.
> case IOMMUFD_OBJ_HW_PAGETABLE:
> return container_of(obj, struct iommufd_hw_pagetable, obj);

yes, this would be the usage in nesting. you may check my below
branch. It's for nesting integration.

https://github.com/luxis1999/iommufd/tree/iommufd-v5.18-rc4-nesting

> By the way, any plan of the nested mode?
I'm working with Eric, Nic on it. Currently, I've got the above kernel
branch, QEMU side is also WIP.

> Thanks

-- 
Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 125+ messages in thread

* RE: [RFC 00/18] vfio: Adopt iommufd
  2022-05-18 14:00                             ` Yi Liu
@ 2022-06-28  8:14                                 ` Shameerali Kolothum Thodi via
  0 siblings, 0 replies; 125+ messages in thread
From: Shameerali Kolothum Thodi @ 2022-06-28  8:14 UTC (permalink / raw)
  To: Yi Liu, zhangfei.gao, Jason Gunthorpe, Zhangfei Gao
  Cc: eric.auger, Alex Williamson, cohuck, qemu-devel, david, thuth,
	farman, mjrosato, akrowiak, pasic, jjherne, jasowang, kvm,
	nicolinc, eric.auger.pro, kevin.tian, chao.p.peng, yi.y.sun,
	peterx



> -----Original Message-----
> From: Yi Liu [mailto:yi.l.liu@intel.com]
> Sent: 18 May 2022 15:01
> To: zhangfei.gao@foxmail.com; Jason Gunthorpe <jgg@nvidia.com>;
> Zhangfei Gao <zhangfei.gao@linaro.org>
> Cc: eric.auger@redhat.com; Alex Williamson <alex.williamson@redhat.com>;
> Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>;
> cohuck@redhat.com; qemu-devel@nongnu.org;
> david@gibson.dropbear.id.au; thuth@redhat.com; farman@linux.ibm.com;
> mjrosato@linux.ibm.com; akrowiak@linux.ibm.com; pasic@linux.ibm.com;
> jjherne@linux.ibm.com; jasowang@redhat.com; kvm@vger.kernel.org;
> nicolinc@nvidia.com; eric.auger.pro@gmail.com; kevin.tian@intel.com;
> chao.p.peng@intel.com; yi.y.sun@intel.com; peterx@redhat.com
> Subject: Re: [RFC 00/18] vfio: Adopt iommufd
> 
> On 2022/5/18 15:22, zhangfei.gao@foxmail.com wrote:
> >
> >
> > On 2022/5/17 下午4:55, Yi Liu wrote:
> >> Hi Zhangfei,
> >>
> >> On 2022/5/12 17:01, zhangfei.gao@foxmail.com wrote:
> >>>
> >>> Hi, Yi
> >>>
> >>> On 2022/5/11 下午10:17, zhangfei.gao@foxmail.com wrote:
> >>>>
> >>>>
> >>>> On 2022/5/10 下午10:08, Yi Liu wrote:
> >>>>> On 2022/5/10 20:45, Jason Gunthorpe wrote:
> >>>>>> On Tue, May 10, 2022 at 08:35:00PM +0800, Zhangfei Gao wrote:
> >>>>>>> Thanks Yi and Eric,
> >>>>>>> Then will wait for the updated iommufd kernel for the PCI MMIO
> region.
> >>>>>>>
> >>>>>>> Another question,
> >>>>>>> How to get the iommu_domain in the ioctl.
> >>>>>>
> >>>>>> The ID of the iommu_domain (called the hwpt) it should be returned
> by
> >>>>>> the vfio attach ioctl.
> >>>>>
> >>>>> yes, hwpt_id is returned by the vfio attach ioctl and recorded in
> >>>>> qemu. You can query page table related capabilities with this id.
> >>>>>
> >>>>>
> https://lore.kernel.org/kvm/20220414104710.28534-16-yi.l.liu@intel.com/
> >>>>>
> >>>> Thanks Yi,
> >>>>
> >>>> Do we use iommufd_hw_pagetable_from_id in kernel?
> >>>>
> >>>> The qemu send hwpt_id via ioctl.
> >>>> Currently VFIOIOMMUFDContainer has hwpt_list,
> >>>> Which member is good to save hwpt_id, IOMMUTLBEntry?
> >>>
> >>> Can VFIOIOMMUFDContainer  have multi hwpt?
> >>
> >> yes, it is possible
> > Then how to get hwpt_id in map/unmap_notify(IOMMUNotifier *n,
> IOMMUTLBEntry
> > *iotlb)
> 
> in map/unmap, should use ioas_id instead of hwpt_id
> 
> >
> >>
> >>> Since VFIOIOMMUFDContainer has hwpt_list now.
> >>> If so, how to get specific hwpt from map/unmap_notify in hw/vfio/as.c,
> >>> where no vbasedev can be used for compare.
> >>>
> >>> I am testing with a workaround, adding VFIOIOASHwpt *hwpt in
> >>> VFIOIOMMUFDContainer.
> >>> And save hwpt when vfio_device_attach_container.
> >>>
> >>>>
> >>>> In kernel ioctl: iommufd_vfio_ioctl
> >>>> @dev: Device to get an iommu_domain for
> >>>> iommufd_hw_pagetable_from_id(struct iommufd_ctx *ictx, u32 pt_id,
> >>>> struct device *dev)
> >>>> But iommufd_vfio_ioctl seems no para dev?
> >>>
> >>> We can set dev=Null since IOMMUFD_OBJ_HW_PAGETABLE does not
> need dev.
> >>> iommufd_hw_pagetable_from_id(ictx, hwpt_id, NULL)
> >>
> >> this is not good. dev is passed in to this function to allocate domain
> >> and also check sw_msi things. If you pass in a NULL, it may even unable
> >> to get a domain for the hwpt. It won't work I guess.
> >
> > The iommufd_hw_pagetable_from_id can be used for
> > 1, allocate domain, which need para dev
> > case IOMMUFD_OBJ_IOAS
> > hwpt = iommufd_hw_pagetable_auto_get(ictx, ioas, dev);
> 
> this is used when attaching ioas.
> 
> > 2. Just return allocated domain via hwpt_id, which does not need dev.
> > case IOMMUFD_OBJ_HW_PAGETABLE:
> > return container_of(obj, struct iommufd_hw_pagetable, obj);
> 
> yes, this would be the usage in nesting. you may check my below
> branch. It's for nesting integration.
> 
> https://github.com/luxis1999/iommufd/tree/iommufd-v5.18-rc4-nesting
> 
> > By the way, any plan of the nested mode?
> I'm working with Eric, Nic on it. Currently, I've got the above kernel
> branch, QEMU side is also WIP.

Hi Yi/Eric,

I had a look at the above nesting kernel and Qemu branches and as mentioned
in the cover letter it is not working on ARM yet.

IIUC, to get it working via the iommufd the main thing is we need a way to configure
the phys SMMU in nested mode and setup the mappings for the stage 2. The
Cache/PASID related changes looks more straight forward. 

I had quite a few hacks to get it working on ARM, but still a WIP. So just wondering
do you guys have something that can be shared yet?

Please let me know.

Thanks,
Shameer

^ permalink raw reply	[flat|nested] 125+ messages in thread

* RE: [RFC 00/18] vfio: Adopt iommufd
@ 2022-06-28  8:14                                 ` Shameerali Kolothum Thodi via
  0 siblings, 0 replies; 125+ messages in thread
From: Shameerali Kolothum Thodi via @ 2022-06-28  8:14 UTC (permalink / raw)
  To: Yi Liu, zhangfei.gao, Jason Gunthorpe, Zhangfei Gao
  Cc: eric.auger, Alex Williamson, cohuck, qemu-devel, david, thuth,
	farman, mjrosato, akrowiak, pasic, jjherne, jasowang, kvm,
	nicolinc, eric.auger.pro, kevin.tian, chao.p.peng, yi.y.sun,
	peterx



> -----Original Message-----
> From: Yi Liu [mailto:yi.l.liu@intel.com]
> Sent: 18 May 2022 15:01
> To: zhangfei.gao@foxmail.com; Jason Gunthorpe <jgg@nvidia.com>;
> Zhangfei Gao <zhangfei.gao@linaro.org>
> Cc: eric.auger@redhat.com; Alex Williamson <alex.williamson@redhat.com>;
> Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>;
> cohuck@redhat.com; qemu-devel@nongnu.org;
> david@gibson.dropbear.id.au; thuth@redhat.com; farman@linux.ibm.com;
> mjrosato@linux.ibm.com; akrowiak@linux.ibm.com; pasic@linux.ibm.com;
> jjherne@linux.ibm.com; jasowang@redhat.com; kvm@vger.kernel.org;
> nicolinc@nvidia.com; eric.auger.pro@gmail.com; kevin.tian@intel.com;
> chao.p.peng@intel.com; yi.y.sun@intel.com; peterx@redhat.com
> Subject: Re: [RFC 00/18] vfio: Adopt iommufd
> 
> On 2022/5/18 15:22, zhangfei.gao@foxmail.com wrote:
> >
> >
> > On 2022/5/17 下午4:55, Yi Liu wrote:
> >> Hi Zhangfei,
> >>
> >> On 2022/5/12 17:01, zhangfei.gao@foxmail.com wrote:
> >>>
> >>> Hi, Yi
> >>>
> >>> On 2022/5/11 下午10:17, zhangfei.gao@foxmail.com wrote:
> >>>>
> >>>>
> >>>> On 2022/5/10 下午10:08, Yi Liu wrote:
> >>>>> On 2022/5/10 20:45, Jason Gunthorpe wrote:
> >>>>>> On Tue, May 10, 2022 at 08:35:00PM +0800, Zhangfei Gao wrote:
> >>>>>>> Thanks Yi and Eric,
> >>>>>>> Then will wait for the updated iommufd kernel for the PCI MMIO
> region.
> >>>>>>>
> >>>>>>> Another question,
> >>>>>>> How to get the iommu_domain in the ioctl.
> >>>>>>
> >>>>>> The ID of the iommu_domain (called the hwpt) it should be returned
> by
> >>>>>> the vfio attach ioctl.
> >>>>>
> >>>>> yes, hwpt_id is returned by the vfio attach ioctl and recorded in
> >>>>> qemu. You can query page table related capabilities with this id.
> >>>>>
> >>>>>
> https://lore.kernel.org/kvm/20220414104710.28534-16-yi.l.liu@intel.com/
> >>>>>
> >>>> Thanks Yi,
> >>>>
> >>>> Do we use iommufd_hw_pagetable_from_id in kernel?
> >>>>
> >>>> The qemu send hwpt_id via ioctl.
> >>>> Currently VFIOIOMMUFDContainer has hwpt_list,
> >>>> Which member is good to save hwpt_id, IOMMUTLBEntry?
> >>>
> >>> Can VFIOIOMMUFDContainer  have multi hwpt?
> >>
> >> yes, it is possible
> > Then how to get hwpt_id in map/unmap_notify(IOMMUNotifier *n,
> IOMMUTLBEntry
> > *iotlb)
> 
> in map/unmap, should use ioas_id instead of hwpt_id
> 
> >
> >>
> >>> Since VFIOIOMMUFDContainer has hwpt_list now.
> >>> If so, how to get specific hwpt from map/unmap_notify in hw/vfio/as.c,
> >>> where no vbasedev can be used for compare.
> >>>
> >>> I am testing with a workaround, adding VFIOIOASHwpt *hwpt in
> >>> VFIOIOMMUFDContainer.
> >>> And save hwpt when vfio_device_attach_container.
> >>>
> >>>>
> >>>> In kernel ioctl: iommufd_vfio_ioctl
> >>>> @dev: Device to get an iommu_domain for
> >>>> iommufd_hw_pagetable_from_id(struct iommufd_ctx *ictx, u32 pt_id,
> >>>> struct device *dev)
> >>>> But iommufd_vfio_ioctl seems no para dev?
> >>>
> >>> We can set dev=Null since IOMMUFD_OBJ_HW_PAGETABLE does not
> need dev.
> >>> iommufd_hw_pagetable_from_id(ictx, hwpt_id, NULL)
> >>
> >> this is not good. dev is passed in to this function to allocate domain
> >> and also check sw_msi things. If you pass in a NULL, it may even unable
> >> to get a domain for the hwpt. It won't work I guess.
> >
> > The iommufd_hw_pagetable_from_id can be used for
> > 1, allocate domain, which need para dev
> > case IOMMUFD_OBJ_IOAS
> > hwpt = iommufd_hw_pagetable_auto_get(ictx, ioas, dev);
> 
> this is used when attaching ioas.
> 
> > 2. Just return allocated domain via hwpt_id, which does not need dev.
> > case IOMMUFD_OBJ_HW_PAGETABLE:
> > return container_of(obj, struct iommufd_hw_pagetable, obj);
> 
> yes, this would be the usage in nesting. you may check my below
> branch. It's for nesting integration.
> 
> https://github.com/luxis1999/iommufd/tree/iommufd-v5.18-rc4-nesting
> 
> > By the way, any plan of the nested mode?
> I'm working with Eric, Nic on it. Currently, I've got the above kernel
> branch, QEMU side is also WIP.

Hi Yi/Eric,

I had a look at the above nesting kernel and Qemu branches and as mentioned
in the cover letter it is not working on ARM yet.

IIUC, to get it working via the iommufd the main thing is we need a way to configure
the phys SMMU in nested mode and setup the mappings for the stage 2. The
Cache/PASID related changes looks more straight forward. 

I had quite a few hacks to get it working on ARM, but still a WIP. So just wondering
do you guys have something that can be shared yet?

Please let me know.

Thanks,
Shameer

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC 00/18] vfio: Adopt iommufd
  2022-06-28  8:14                                 ` Shameerali Kolothum Thodi via
  (?)
@ 2022-06-28  8:58                                 ` Eric Auger
  -1 siblings, 0 replies; 125+ messages in thread
From: Eric Auger @ 2022-06-28  8:58 UTC (permalink / raw)
  To: Shameerali Kolothum Thodi, Yi Liu, zhangfei.gao, Jason Gunthorpe,
	Zhangfei Gao
  Cc: Alex Williamson, cohuck, qemu-devel, david, thuth, farman,
	mjrosato, akrowiak, pasic, jjherne, jasowang, kvm, nicolinc,
	eric.auger.pro, kevin.tian, chao.p.peng, yi.y.sun, peterx

Hi Shameer,

On 6/28/22 10:14, Shameerali Kolothum Thodi wrote:
>
>> -----Original Message-----
>> From: Yi Liu [mailto:yi.l.liu@intel.com]
>> Sent: 18 May 2022 15:01
>> To: zhangfei.gao@foxmail.com; Jason Gunthorpe <jgg@nvidia.com>;
>> Zhangfei Gao <zhangfei.gao@linaro.org>
>> Cc: eric.auger@redhat.com; Alex Williamson <alex.williamson@redhat.com>;
>> Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>;
>> cohuck@redhat.com; qemu-devel@nongnu.org;
>> david@gibson.dropbear.id.au; thuth@redhat.com; farman@linux.ibm.com;
>> mjrosato@linux.ibm.com; akrowiak@linux.ibm.com; pasic@linux.ibm.com;
>> jjherne@linux.ibm.com; jasowang@redhat.com; kvm@vger.kernel.org;
>> nicolinc@nvidia.com; eric.auger.pro@gmail.com; kevin.tian@intel.com;
>> chao.p.peng@intel.com; yi.y.sun@intel.com; peterx@redhat.com
>> Subject: Re: [RFC 00/18] vfio: Adopt iommufd
>>
>> On 2022/5/18 15:22, zhangfei.gao@foxmail.com wrote:
>>>
>>> On 2022/5/17 下午4:55, Yi Liu wrote:
>>>> Hi Zhangfei,
>>>>
>>>> On 2022/5/12 17:01, zhangfei.gao@foxmail.com wrote:
>>>>> Hi, Yi
>>>>>
>>>>> On 2022/5/11 下午10:17, zhangfei.gao@foxmail.com wrote:
>>>>>>
>>>>>> On 2022/5/10 下午10:08, Yi Liu wrote:
>>>>>>> On 2022/5/10 20:45, Jason Gunthorpe wrote:
>>>>>>>> On Tue, May 10, 2022 at 08:35:00PM +0800, Zhangfei Gao wrote:
>>>>>>>>> Thanks Yi and Eric,
>>>>>>>>> Then will wait for the updated iommufd kernel for the PCI MMIO
>> region.
>>>>>>>>> Another question,
>>>>>>>>> How to get the iommu_domain in the ioctl.
>>>>>>>> The ID of the iommu_domain (called the hwpt) it should be returned
>> by
>>>>>>>> the vfio attach ioctl.
>>>>>>> yes, hwpt_id is returned by the vfio attach ioctl and recorded in
>>>>>>> qemu. You can query page table related capabilities with this id.
>>>>>>>
>>>>>>>
>> https://lore.kernel.org/kvm/20220414104710.28534-16-yi.l.liu@intel.com/
>>>>>> Thanks Yi,
>>>>>>
>>>>>> Do we use iommufd_hw_pagetable_from_id in kernel?
>>>>>>
>>>>>> The qemu send hwpt_id via ioctl.
>>>>>> Currently VFIOIOMMUFDContainer has hwpt_list,
>>>>>> Which member is good to save hwpt_id, IOMMUTLBEntry?
>>>>> Can VFIOIOMMUFDContainer  have multi hwpt?
>>>> yes, it is possible
>>> Then how to get hwpt_id in map/unmap_notify(IOMMUNotifier *n,
>> IOMMUTLBEntry
>>> *iotlb)
>> in map/unmap, should use ioas_id instead of hwpt_id
>>
>>>>> Since VFIOIOMMUFDContainer has hwpt_list now.
>>>>> If so, how to get specific hwpt from map/unmap_notify in hw/vfio/as.c,
>>>>> where no vbasedev can be used for compare.
>>>>>
>>>>> I am testing with a workaround, adding VFIOIOASHwpt *hwpt in
>>>>> VFIOIOMMUFDContainer.
>>>>> And save hwpt when vfio_device_attach_container.
>>>>>
>>>>>> In kernel ioctl: iommufd_vfio_ioctl
>>>>>> @dev: Device to get an iommu_domain for
>>>>>> iommufd_hw_pagetable_from_id(struct iommufd_ctx *ictx, u32 pt_id,
>>>>>> struct device *dev)
>>>>>> But iommufd_vfio_ioctl seems no para dev?
>>>>> We can set dev=Null since IOMMUFD_OBJ_HW_PAGETABLE does not
>> need dev.
>>>>> iommufd_hw_pagetable_from_id(ictx, hwpt_id, NULL)
>>>> this is not good. dev is passed in to this function to allocate domain
>>>> and also check sw_msi things. If you pass in a NULL, it may even unable
>>>> to get a domain for the hwpt. It won't work I guess.
>>> The iommufd_hw_pagetable_from_id can be used for
>>> 1, allocate domain, which need para dev
>>> case IOMMUFD_OBJ_IOAS
>>> hwpt = iommufd_hw_pagetable_auto_get(ictx, ioas, dev);
>> this is used when attaching ioas.
>>
>>> 2. Just return allocated domain via hwpt_id, which does not need dev.
>>> case IOMMUFD_OBJ_HW_PAGETABLE:
>>> return container_of(obj, struct iommufd_hw_pagetable, obj);
>> yes, this would be the usage in nesting. you may check my below
>> branch. It's for nesting integration.
>>
>> https://github.com/luxis1999/iommufd/tree/iommufd-v5.18-rc4-nesting
>>
>>> By the way, any plan of the nested mode?
>> I'm working with Eric, Nic on it. Currently, I've got the above kernel
>> branch, QEMU side is also WIP.
> Hi Yi/Eric,
>
> I had a look at the above nesting kernel and Qemu branches and as mentioned
> in the cover letter it is not working on ARM yet.
>
> IIUC, to get it working via the iommufd the main thing is we need a way to configure
> the phys SMMU in nested mode and setup the mappings for the stage 2. The
> Cache/PASID related changes looks more straight forward. 
>
> I had quite a few hacks to get it working on ARM, but still a WIP. So just wondering
> do you guys have something that can be shared yet?

I am working on the respin based on latest iommufd kernel branches and
qemu RFC v2 but it is still WIP.

I will share as soon as possible.

Eric
>
> Please let me know.
>
> Thanks,
> Shameer


^ permalink raw reply	[flat|nested] 125+ messages in thread

end of thread, other threads:[~2022-06-28  9:28 UTC | newest]

Thread overview: 125+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-14 10:46 [RFC 00/18] vfio: Adopt iommufd Yi Liu
2022-04-14 10:46 ` Yi Liu
2022-04-14 10:46 ` [RFC 01/18] scripts/update-linux-headers: Add iommufd.h Yi Liu
2022-04-14 10:46   ` Yi Liu
2022-04-14 10:46 ` [RFC 02/18] linux-headers: Import latest vfio.h and iommufd.h Yi Liu
2022-04-14 10:46   ` Yi Liu
2022-04-14 10:46 ` [RFC 03/18] hw/vfio/pci: fix vfio_pci_hot_reset_result trace point Yi Liu
2022-04-14 10:46   ` Yi Liu
2022-04-14 10:46 ` [RFC 04/18] vfio/pci: Use vbasedev local variable in vfio_realize() Yi Liu
2022-04-14 10:46   ` Yi Liu
2022-04-14 10:46 ` [RFC 05/18] vfio/common: Rename VFIOGuestIOMMU::iommu into ::iommu_mr Yi Liu
2022-04-14 10:46   ` Yi Liu
2022-04-14 10:46 ` [RFC 06/18] vfio/common: Split common.c into common.c, container.c and as.c Yi Liu
2022-04-14 10:46 ` [RFC 07/18] vfio: Add base object for VFIOContainer Yi Liu
2022-04-14 10:46   ` Yi Liu
2022-04-29  6:29   ` David Gibson
2022-04-29  6:29     ` David Gibson
2022-05-03 13:05     ` Yi Liu
2022-04-14 10:47 ` [RFC 08/18] vfio/container: Introduce vfio_[attach/detach]_device Yi Liu
2022-04-14 10:47   ` Yi Liu
2022-04-14 10:47 ` [RFC 09/18] vfio/platform: Use vfio_[attach/detach]_device Yi Liu
2022-04-14 10:47   ` Yi Liu
2022-04-14 10:47 ` [RFC 10/18] vfio/ap: " Yi Liu
2022-04-14 10:47   ` Yi Liu
2022-04-14 10:47 ` [RFC 11/18] vfio/ccw: " Yi Liu
2022-04-14 10:47   ` Yi Liu
2022-04-14 10:47 ` [RFC 12/18] vfio/container-obj: Introduce [attach/detach]_device container callbacks Yi Liu
2022-04-14 10:47   ` Yi Liu
2022-04-14 10:47 ` [RFC 13/18] vfio/container-obj: Introduce VFIOContainer reset callback Yi Liu
2022-04-14 10:47   ` Yi Liu
2022-04-14 10:47 ` [RFC 14/18] hw/iommufd: Creation Yi Liu
2022-04-14 10:47   ` Yi Liu
2022-04-14 10:47 ` [RFC 15/18] vfio/iommufd: Implement iommufd backend Yi Liu
2022-04-14 10:47   ` Yi Liu
2022-04-22 14:58   ` Jason Gunthorpe
2022-04-22 21:33     ` Alex Williamson
2022-04-22 21:33       ` Alex Williamson
2022-04-26  9:55     ` Yi Liu
2022-04-26  9:55       ` Yi Liu
2022-04-26 10:41       ` Tian, Kevin
2022-04-26 10:41         ` Tian, Kevin
2022-04-26 13:41         ` Jason Gunthorpe
2022-04-26 14:08           ` Yi Liu
2022-04-26 14:08             ` Yi Liu
2022-04-26 14:11             ` Jason Gunthorpe
2022-04-26 18:45               ` Alex Williamson
2022-04-26 18:45                 ` Alex Williamson
2022-04-26 19:27                 ` Jason Gunthorpe
2022-04-26 20:59                   ` Alex Williamson
2022-04-26 20:59                     ` Alex Williamson
2022-04-26 23:08                     ` Jason Gunthorpe
2022-04-26 13:53       ` Jason Gunthorpe
2022-04-14 10:47 ` [RFC 16/18] vfio/iommufd: Add IOAS_COPY_DMA support Yi Liu
2022-04-14 10:47   ` Yi Liu
2022-04-14 10:47 ` [RFC 17/18] vfio/as: Allow the selection of a given iommu backend Yi Liu
2022-04-14 10:47   ` Yi Liu
2022-04-14 10:47 ` [RFC 18/18] vfio/pci: Add an iommufd option Yi Liu
2022-04-14 10:47   ` Yi Liu
2022-04-15  8:37 ` [RFC 00/18] vfio: Adopt iommufd Nicolin Chen
2022-04-17 10:30   ` Eric Auger
2022-04-17 10:30     ` Eric Auger
2022-04-19  3:26     ` Nicolin Chen
2022-04-25 19:40       ` Eric Auger
2022-04-25 19:40         ` Eric Auger
2022-04-18  8:49 ` Tian, Kevin
2022-04-18  8:49   ` Tian, Kevin
2022-04-18 12:09   ` Yi Liu
2022-04-18 12:09     ` Yi Liu
2022-04-25 19:51     ` Eric Auger
2022-04-25 19:51       ` Eric Auger
2022-04-25 19:55   ` Eric Auger
2022-04-25 19:55     ` Eric Auger
2022-04-26  8:39     ` Tian, Kevin
2022-04-26  8:39       ` Tian, Kevin
2022-04-22 22:09 ` Alex Williamson
2022-04-22 22:09   ` Alex Williamson
2022-04-25 10:10   ` Daniel P. Berrangé
2022-04-25 10:10     ` Daniel P. Berrangé
2022-04-25 13:36     ` Jason Gunthorpe
2022-04-25 14:37     ` Alex Williamson
2022-04-25 14:37       ` Alex Williamson
2022-04-26  8:37       ` Tian, Kevin
2022-04-26  8:37         ` Tian, Kevin
2022-04-26 12:33         ` Jason Gunthorpe
2022-04-26 16:21         ` Alex Williamson
2022-04-26 16:21           ` Alex Williamson
2022-04-26 16:42           ` Jason Gunthorpe
2022-04-26 19:24             ` Alex Williamson
2022-04-26 19:24               ` Alex Williamson
2022-04-26 19:36               ` Jason Gunthorpe
2022-04-28  3:21           ` Tian, Kevin
2022-04-28  3:21             ` Tian, Kevin
2022-04-28 14:24             ` Alex Williamson
2022-04-28 14:24               ` Alex Williamson
2022-04-28 16:20               ` Daniel P. Berrangé
2022-04-28 16:20                 ` Daniel P. Berrangé
2022-04-29  0:45                 ` Tian, Kevin
2022-04-29  0:45                   ` Tian, Kevin
2022-04-25 20:23   ` Eric Auger
2022-04-25 20:23     ` Eric Auger
2022-04-25 22:53     ` Alex Williamson
2022-04-25 22:53       ` Alex Williamson
2022-04-26  9:47 ` Shameerali Kolothum Thodi via
2022-04-26  9:47   ` Shameerali Kolothum Thodi
2022-04-26 11:44   ` Eric Auger
2022-04-26 11:44     ` Eric Auger
2022-04-26 12:43     ` Shameerali Kolothum Thodi
2022-04-26 12:43       ` Shameerali Kolothum Thodi via
2022-04-26 16:35       ` Alex Williamson
2022-04-26 16:35         ` Alex Williamson
2022-05-09 14:24         ` Zhangfei Gao
2022-05-10  3:17           ` Yi Liu
2022-05-10  6:51             ` Eric Auger
2022-05-10 12:35               ` Zhangfei Gao
2022-05-10 12:45                 ` Jason Gunthorpe
2022-05-10 14:08                   ` Yi Liu
2022-05-11 14:17                     ` zhangfei.gao
2022-05-12  9:01                       ` zhangfei.gao
2022-05-17  8:55                         ` Yi Liu
2022-05-18  7:22                           ` zhangfei.gao
2022-05-18 14:00                             ` Yi Liu
2022-06-28  8:14                               ` Shameerali Kolothum Thodi
2022-06-28  8:14                                 ` Shameerali Kolothum Thodi via
2022-06-28  8:58                                 ` Eric Auger
2022-05-17  8:52                       ` Yi Liu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.