All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC 00/20] Introduce /dev/iommu for userspace I/O address space management
@ 2021-09-19  6:38 ` Liu Yi L
  0 siblings, 0 replies; 532+ messages in thread
From: Liu Yi L @ 2021-09-19  6:38 UTC (permalink / raw)
  To: alex.williamson, jgg, hch, jasowang, joro
  Cc: jean-philippe, kevin.tian, parav, lkml, pbonzini, lushenming,
	eric.auger, corbet, ashok.raj, yi.l.liu, yi.l.liu, jun.j.tian,
	hao.wu, dave.jiang, jacob.jun.pan, kwankhede, robin.murphy, kvm,
	iommu, dwmw2, linux-kernel, baolu.lu, david, nicolinc

Linux now includes multiple device-passthrough frameworks (e.g. VFIO and
vDPA) to manage secure device access from the userspace. One critical task
of those frameworks is to put the assigned device in a secure, IOMMU-
protected context so user-initiated DMAs are prevented from doing harm to
the rest of the system.

Currently those frameworks implement their own logic for managing I/O page
tables to isolate user-initiated DMAs. This doesn't scale to support many
new IOMMU features, such as PASID-granular DMA remapping, nested translation,
I/O page fault, IOMMU dirty bit, etc.

/dev/iommu is introduced as an unified interface for managing I/O address
spaces and DMA isolation for passthrough devices. It's originated from the
upstream discussion for the vSVA enabling work[1].

This RFC aims to provide a basic skeleton for above proposal, w/o adding
any new feature beyond what vfio type1 provides today. For an overview of
future extensions, please refer to the full design proposal [2].

The core concepts in /dev/iommu are iommufd and ioasid. iommufd (by opening
/dev/iommu) is the container holding multiple I/O address spaces, while
ioasid is the fd-local software handle representing an I/O address space and
associated with a single I/O page table. User manages those address spaces
through fd operations, e.g. by using vfio type1v2 mapping semantics to manage
respective I/O page tables.

An I/O address space takes effect in the iommu only after it is attached by
a device. One I/O address space can be attached by multiple devices. One
device can be only attached to a single I/O address space in this RFC, to
match vfio type1 behavior as the starting point.

Device must be bound to an iommufd before attach operation can be conducted.
The binding operation builds the connection between the devicefd (opened via
device-passthrough framework) and iommufd. Most importantly, the entire
/dev/iommu framework adopts a device-centric model w/o carrying any container/
group legacy as current vfio does. This requires the binding operation also
establishes a security context which prevents the bound device from accessing
the rest of the system, as the contract for vfio to grant user access to the
assigned device. Detail explanation of this aspect can be found in patch 06.

Last, the format of an I/O page table must be compatible to the attached 
devices (or more specifically to the IOMMU which serves the DMA from the
attached devices). User is responsible for specifying the format when
allocating an IOASID, according to one or multiple devices which will be
attached right after. The device IOMMU format can be queried via iommufd
once a device is successfully bound to the iommufd. Attaching a device to
an IOASID with incompatible format is simply rejected.

The skeleton is mostly implemented in iommufd, except that bind_iommufd/
ioasid_attach operations are initiated via device-passthrough framework
specific uAPIs. This RFC only changes vfio to work with iommufd. vdpa
support can be added in a later stage.

Basically iommufd provides following uAPIs and helper functions:

- IOMMU_DEVICE_GET_INFO, for querying per-device iommu capability/format;
- IOMMU_IOASID_ALLOC/FREE, as the name stands;
- IOMMU_[UN]MAP_DMA, providing vfio type1v2 semantics for managing a
  specific I/O page table;
- helper functions for vfio to bind_iommufd/attach_ioasid with devices;

vfio extensions include:
- A new interface for user to open a device w/o using container/group uAPI;
- VFIO_DEVICE_BIND_IOMMUFD, for binding a vfio device to an iommufd;
  * unbind is automatically done when devicefd is closed;
- VFIO_DEVICE_[DE]ATTACH_IOASID, for attaching/detaching a vfio device
  to/from an ioasid in the specified iommufd;

[TODO in RFC v2]

We did one temporary hack in v1 by reusing vfio_iommu_type1.c to implement
IOMMU_[UN]MAP_DMA. This leads to some dirty code in patch 16/17/18. We
estimated almost 80% of the current type1 code are related to map/unmap.
It needs non-trivial effort for either duplicating it in iommufd or making
it shared by both vfio and iommufd. We hope this hack doesn't affect the
review of the overall skeleton, since the  role of this part is very clear.
Based on the received feedback we will make a clean implementation in v2.

For userspace our time doesn't afford a clean implementation in Qemu.
Instead, we just wrote a simple application (similar to the example in
iommufd.rst) and verified the basic work flow (bind/unbind, alloc/free
ioasid, attach/detach, map/unmap, multi-devices group, etc.). We did
verify the I/O page table mappings established as expected, though no
DMA is conducted. We plan to have a clean implementation in Qemu and
provide a public link for reference when v2 is sending out.

[TODO out of this RFC]

The entire /dev/iommu project involves lots of tasks. It has to grow in
a staging approach. Below is a rough list of TODO features. Most of them
can be developed in parallel after this skeleton is accepted. For more
detail please refer to the design proposal [2]:

1. Move more vfio device types to iommufd:
    * device which does no-snoop DMA
    * software mdev
    * PPC device
    * platform device

2. New vfio device type
    * hardware mdev/subdev (with PASID)

3. vDPA adoption

4. User-managed I/O page table
    * ioasid nesting (hardware)
    * ioasid nesting (software)
    * pasid virtualization
        o pdev (arm/amd)
        o pdev/mdev which doesn't support enqcmd (intel)
        o pdev/mdev which supports enqcmd (intel)
    * I/O page fault (stage-1)

5. Miscellaneous
    * I/O page fault (stage-2), for on-demand paging
    * IOMMU dirty bit, for hardware-assisted dirty page tracking
    * shared I/O page table (mm, ept, etc.)
    * vfio/vdpa shim to avoid code duplication for legacy uAPI
    * hardware-assisted vIOMMU

[1] https://lore.kernel.org/linux-iommu/20210330132830.GO2356281@nvidia.com/
[2] https://lore.kernel.org/kvm/BN9PR11MB5433B1E4AE5B0480369F97178C189@BN9PR11MB5433.namprd11.prod.outlook.com/

[Series Overview]
* Basic skeleton:
  0001-iommu-iommufd-Add-dev-iommu-core.patch

* VFIO PCI creates device-centric interface:
  0002-vfio-Add-vfio-device-class-for-device-nodes.patch
  0003-vfio-Add-vfio_-un-register_device.patch
  0004-iommu-Add-iommu_device_get_info-interface.patch
  0005-vfio-pci-Register-device-centric-interface.patch

* Bind device fd with iommufd:
  0006-iommu-Add-iommu_device_init-exit-_user_dma-interface.patch
  0007-iommu-iommufd-Add-iommufd_-un-bind_device.patch
  0008-vfio-pci-Add-VFIO_DEVICE_BIND_IOMMUFD.patch

* IOASID allocation:
  0009-iommu-iommufd-Add-IOMMU_DEVICE_GET_INFO.patch
  0010-iommu-iommufd-Add-IOMMU_IOASID_ALLOC-FREE.patch

* IOASID [de]attach:
  0011-iommu-Extend-iommu_at-de-tach_device-for-multiple-de.patch
  0012-iommu-iommufd-Add-iommufd_device_-de-attach_ioasid.patch
  0013-vfio-pci-Add-VFIO_DEVICE_-DE-ATTACH_IOASID.patch

* /dev/iommu DMA (un)map:
  0014-vfio-type1-Export-symbols-for-dma-un-map-code-sharin.patch
  0015-iommu-iommufd-Report-iova-range-to-userspace.patch
  0016-iommu-iommufd-Add-IOMMU_-UN-MAP_DMA-on-IOASID.patch

* Report the device info:
  0017-iommu-vt-d-Implement-device_info-iommu_ops-callback.patch

* Add doc:
  0018-Doc-Add-documentation-for-dev-iommu.patch
 
* Basic skeleton:
  0001-iommu-iommufd-Add-dev-iommu-core.patch

* VFIO PCI creates device-centric interface:
  0002-vfio-Add-device-class-for-dev-vfio-devices.patch
  0003-vfio-Add-vfio_-un-register_device.patch
  0004-iommu-Add-iommu_device_get_info-interface.patch
  0005-vfio-pci-Register-device-to-dev-vfio-devices.patch

* Bind device fd with iommufd:
  0006-iommu-Add-iommu_device_init-exit-_user_dma-interface.patch
  0007-iommu-iommufd-Add-iommufd_-un-bind_device.patch
  0008-vfio-pci-Add-VFIO_DEVICE_BIND_IOMMUFD.patch

* IOASID allocation:
  0009-iommu-Add-page-size-and-address-width-attributes.patch
  0010-iommu-iommufd-Add-IOMMU_DEVICE_GET_INFO.patch
  0011-iommu-iommufd-Add-IOMMU_IOASID_ALLOC-FREE.patch
  0012-iommu-iommufd-Add-IOMMU_CHECK_EXTENSION.patch

* IOASID [de]attach:
  0013-iommu-Extend-iommu_at-de-tach_device-for-multiple-de.patch
  0014-iommu-iommufd-Add-iommufd_device_-de-attach_ioasid.patch
  0015-vfio-pci-Add-VFIO_DEVICE_-DE-ATTACH_IOASID.patch

* DMA (un)map:
  0016-vfio-type1-Export-symbols-for-dma-un-map-code-sharin.patch
  0017-iommu-iommufd-Report-iova-range-to-userspace.patch
  0018-iommu-iommufd-Add-IOMMU_-UN-MAP_DMA-on-IOASID.patch

* Report the device info in vt-d driver to enable whole series:
  0019-iommu-vt-d-Implement-device_info-iommu_ops-callback.patch

* Add doc:
  0020-Doc-Add-documentation-for-dev-iommu.patch

Complete code can be found in:
https://github.com/luxis1999/dev-iommu/commits/dev-iommu-5.14-rfcv1

Thanks for your time!

Regards,
Yi Liu
---

Liu Yi L (15):
  iommu/iommufd: Add /dev/iommu core
  vfio: Add device class for /dev/vfio/devices
  vfio: Add vfio_[un]register_device()
  vfio/pci: Register device to /dev/vfio/devices
  iommu/iommufd: Add iommufd_[un]bind_device()
  vfio/pci: Add VFIO_DEVICE_BIND_IOMMUFD
  iommu/iommufd: Add IOMMU_DEVICE_GET_INFO
  iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
  iommu/iommufd: Add IOMMU_CHECK_EXTENSION
  iommu/iommufd: Add iommufd_device_[de]attach_ioasid()
  vfio/pci: Add VFIO_DEVICE_[DE]ATTACH_IOASID
  vfio/type1: Export symbols for dma [un]map code sharing
  iommu/iommufd: Report iova range to userspace
  iommu/iommufd: Add IOMMU_[UN]MAP_DMA on IOASID
  Doc: Add documentation for /dev/iommu

Lu Baolu (5):
  iommu: Add iommu_device_get_info interface
  iommu: Add iommu_device_init[exit]_user_dma interfaces
  iommu: Add page size and address width attributes
  iommu: Extend iommu_at[de]tach_device() for multiple devices group
  iommu/vt-d: Implement device_info iommu_ops callback

 Documentation/userspace-api/index.rst   |   1 +
 Documentation/userspace-api/iommufd.rst | 183 ++++++
 drivers/iommu/Kconfig                   |   1 +
 drivers/iommu/Makefile                  |   1 +
 drivers/iommu/intel/iommu.c             |  35 +
 drivers/iommu/iommu.c                   | 188 +++++-
 drivers/iommu/iommufd/Kconfig           |  11 +
 drivers/iommu/iommufd/Makefile          |   2 +
 drivers/iommu/iommufd/iommufd.c         | 832 ++++++++++++++++++++++++
 drivers/vfio/pci/Kconfig                |   1 +
 drivers/vfio/pci/vfio_pci.c             | 179 ++++-
 drivers/vfio/pci/vfio_pci_private.h     |  10 +
 drivers/vfio/vfio.c                     | 366 ++++++++++-
 drivers/vfio/vfio_iommu_type1.c         | 246 ++++++-
 include/linux/iommu.h                   |  35 +
 include/linux/iommufd.h                 |  71 ++
 include/linux/vfio.h                    |  27 +
 include/uapi/linux/iommu.h              | 162 +++++
 include/uapi/linux/vfio.h               |  56 ++
 19 files changed, 2358 insertions(+), 49 deletions(-)
 create mode 100644 Documentation/userspace-api/iommufd.rst
 create mode 100644 drivers/iommu/iommufd/Kconfig
 create mode 100644 drivers/iommu/iommufd/Makefile
 create mode 100644 drivers/iommu/iommufd/iommufd.c
 create mode 100644 include/linux/iommufd.h

-- 
2.25.1


^ permalink raw reply	[flat|nested] 532+ messages in thread

* [RFC 00/20] Introduce /dev/iommu for userspace I/O address space management
@ 2021-09-19  6:38 ` Liu Yi L
  0 siblings, 0 replies; 532+ messages in thread
From: Liu Yi L @ 2021-09-19  6:38 UTC (permalink / raw)
  To: alex.williamson, jgg, hch, jasowang, joro
  Cc: kvm, kwankhede, jean-philippe, dave.jiang, ashok.raj, corbet,
	kevin.tian, parav, lkml, david, robin.murphy, jun.j.tian,
	linux-kernel, lushenming, iommu, pbonzini, dwmw2

Linux now includes multiple device-passthrough frameworks (e.g. VFIO and
vDPA) to manage secure device access from the userspace. One critical task
of those frameworks is to put the assigned device in a secure, IOMMU-
protected context so user-initiated DMAs are prevented from doing harm to
the rest of the system.

Currently those frameworks implement their own logic for managing I/O page
tables to isolate user-initiated DMAs. This doesn't scale to support many
new IOMMU features, such as PASID-granular DMA remapping, nested translation,
I/O page fault, IOMMU dirty bit, etc.

/dev/iommu is introduced as an unified interface for managing I/O address
spaces and DMA isolation for passthrough devices. It's originated from the
upstream discussion for the vSVA enabling work[1].

This RFC aims to provide a basic skeleton for above proposal, w/o adding
any new feature beyond what vfio type1 provides today. For an overview of
future extensions, please refer to the full design proposal [2].

The core concepts in /dev/iommu are iommufd and ioasid. iommufd (by opening
/dev/iommu) is the container holding multiple I/O address spaces, while
ioasid is the fd-local software handle representing an I/O address space and
associated with a single I/O page table. User manages those address spaces
through fd operations, e.g. by using vfio type1v2 mapping semantics to manage
respective I/O page tables.

An I/O address space takes effect in the iommu only after it is attached by
a device. One I/O address space can be attached by multiple devices. One
device can be only attached to a single I/O address space in this RFC, to
match vfio type1 behavior as the starting point.

Device must be bound to an iommufd before attach operation can be conducted.
The binding operation builds the connection between the devicefd (opened via
device-passthrough framework) and iommufd. Most importantly, the entire
/dev/iommu framework adopts a device-centric model w/o carrying any container/
group legacy as current vfio does. This requires the binding operation also
establishes a security context which prevents the bound device from accessing
the rest of the system, as the contract for vfio to grant user access to the
assigned device. Detail explanation of this aspect can be found in patch 06.

Last, the format of an I/O page table must be compatible to the attached 
devices (or more specifically to the IOMMU which serves the DMA from the
attached devices). User is responsible for specifying the format when
allocating an IOASID, according to one or multiple devices which will be
attached right after. The device IOMMU format can be queried via iommufd
once a device is successfully bound to the iommufd. Attaching a device to
an IOASID with incompatible format is simply rejected.

The skeleton is mostly implemented in iommufd, except that bind_iommufd/
ioasid_attach operations are initiated via device-passthrough framework
specific uAPIs. This RFC only changes vfio to work with iommufd. vdpa
support can be added in a later stage.

Basically iommufd provides following uAPIs and helper functions:

- IOMMU_DEVICE_GET_INFO, for querying per-device iommu capability/format;
- IOMMU_IOASID_ALLOC/FREE, as the name stands;
- IOMMU_[UN]MAP_DMA, providing vfio type1v2 semantics for managing a
  specific I/O page table;
- helper functions for vfio to bind_iommufd/attach_ioasid with devices;

vfio extensions include:
- A new interface for user to open a device w/o using container/group uAPI;
- VFIO_DEVICE_BIND_IOMMUFD, for binding a vfio device to an iommufd;
  * unbind is automatically done when devicefd is closed;
- VFIO_DEVICE_[DE]ATTACH_IOASID, for attaching/detaching a vfio device
  to/from an ioasid in the specified iommufd;

[TODO in RFC v2]

We did one temporary hack in v1 by reusing vfio_iommu_type1.c to implement
IOMMU_[UN]MAP_DMA. This leads to some dirty code in patch 16/17/18. We
estimated almost 80% of the current type1 code are related to map/unmap.
It needs non-trivial effort for either duplicating it in iommufd or making
it shared by both vfio and iommufd. We hope this hack doesn't affect the
review of the overall skeleton, since the  role of this part is very clear.
Based on the received feedback we will make a clean implementation in v2.

For userspace our time doesn't afford a clean implementation in Qemu.
Instead, we just wrote a simple application (similar to the example in
iommufd.rst) and verified the basic work flow (bind/unbind, alloc/free
ioasid, attach/detach, map/unmap, multi-devices group, etc.). We did
verify the I/O page table mappings established as expected, though no
DMA is conducted. We plan to have a clean implementation in Qemu and
provide a public link for reference when v2 is sending out.

[TODO out of this RFC]

The entire /dev/iommu project involves lots of tasks. It has to grow in
a staging approach. Below is a rough list of TODO features. Most of them
can be developed in parallel after this skeleton is accepted. For more
detail please refer to the design proposal [2]:

1. Move more vfio device types to iommufd:
    * device which does no-snoop DMA
    * software mdev
    * PPC device
    * platform device

2. New vfio device type
    * hardware mdev/subdev (with PASID)

3. vDPA adoption

4. User-managed I/O page table
    * ioasid nesting (hardware)
    * ioasid nesting (software)
    * pasid virtualization
        o pdev (arm/amd)
        o pdev/mdev which doesn't support enqcmd (intel)
        o pdev/mdev which supports enqcmd (intel)
    * I/O page fault (stage-1)

5. Miscellaneous
    * I/O page fault (stage-2), for on-demand paging
    * IOMMU dirty bit, for hardware-assisted dirty page tracking
    * shared I/O page table (mm, ept, etc.)
    * vfio/vdpa shim to avoid code duplication for legacy uAPI
    * hardware-assisted vIOMMU

[1] https://lore.kernel.org/linux-iommu/20210330132830.GO2356281@nvidia.com/
[2] https://lore.kernel.org/kvm/BN9PR11MB5433B1E4AE5B0480369F97178C189@BN9PR11MB5433.namprd11.prod.outlook.com/

[Series Overview]
* Basic skeleton:
  0001-iommu-iommufd-Add-dev-iommu-core.patch

* VFIO PCI creates device-centric interface:
  0002-vfio-Add-vfio-device-class-for-device-nodes.patch
  0003-vfio-Add-vfio_-un-register_device.patch
  0004-iommu-Add-iommu_device_get_info-interface.patch
  0005-vfio-pci-Register-device-centric-interface.patch

* Bind device fd with iommufd:
  0006-iommu-Add-iommu_device_init-exit-_user_dma-interface.patch
  0007-iommu-iommufd-Add-iommufd_-un-bind_device.patch
  0008-vfio-pci-Add-VFIO_DEVICE_BIND_IOMMUFD.patch

* IOASID allocation:
  0009-iommu-iommufd-Add-IOMMU_DEVICE_GET_INFO.patch
  0010-iommu-iommufd-Add-IOMMU_IOASID_ALLOC-FREE.patch

* IOASID [de]attach:
  0011-iommu-Extend-iommu_at-de-tach_device-for-multiple-de.patch
  0012-iommu-iommufd-Add-iommufd_device_-de-attach_ioasid.patch
  0013-vfio-pci-Add-VFIO_DEVICE_-DE-ATTACH_IOASID.patch

* /dev/iommu DMA (un)map:
  0014-vfio-type1-Export-symbols-for-dma-un-map-code-sharin.patch
  0015-iommu-iommufd-Report-iova-range-to-userspace.patch
  0016-iommu-iommufd-Add-IOMMU_-UN-MAP_DMA-on-IOASID.patch

* Report the device info:
  0017-iommu-vt-d-Implement-device_info-iommu_ops-callback.patch

* Add doc:
  0018-Doc-Add-documentation-for-dev-iommu.patch
 
* Basic skeleton:
  0001-iommu-iommufd-Add-dev-iommu-core.patch

* VFIO PCI creates device-centric interface:
  0002-vfio-Add-device-class-for-dev-vfio-devices.patch
  0003-vfio-Add-vfio_-un-register_device.patch
  0004-iommu-Add-iommu_device_get_info-interface.patch
  0005-vfio-pci-Register-device-to-dev-vfio-devices.patch

* Bind device fd with iommufd:
  0006-iommu-Add-iommu_device_init-exit-_user_dma-interface.patch
  0007-iommu-iommufd-Add-iommufd_-un-bind_device.patch
  0008-vfio-pci-Add-VFIO_DEVICE_BIND_IOMMUFD.patch

* IOASID allocation:
  0009-iommu-Add-page-size-and-address-width-attributes.patch
  0010-iommu-iommufd-Add-IOMMU_DEVICE_GET_INFO.patch
  0011-iommu-iommufd-Add-IOMMU_IOASID_ALLOC-FREE.patch
  0012-iommu-iommufd-Add-IOMMU_CHECK_EXTENSION.patch

* IOASID [de]attach:
  0013-iommu-Extend-iommu_at-de-tach_device-for-multiple-de.patch
  0014-iommu-iommufd-Add-iommufd_device_-de-attach_ioasid.patch
  0015-vfio-pci-Add-VFIO_DEVICE_-DE-ATTACH_IOASID.patch

* DMA (un)map:
  0016-vfio-type1-Export-symbols-for-dma-un-map-code-sharin.patch
  0017-iommu-iommufd-Report-iova-range-to-userspace.patch
  0018-iommu-iommufd-Add-IOMMU_-UN-MAP_DMA-on-IOASID.patch

* Report the device info in vt-d driver to enable whole series:
  0019-iommu-vt-d-Implement-device_info-iommu_ops-callback.patch

* Add doc:
  0020-Doc-Add-documentation-for-dev-iommu.patch

Complete code can be found in:
https://github.com/luxis1999/dev-iommu/commits/dev-iommu-5.14-rfcv1

Thanks for your time!

Regards,
Yi Liu
---

Liu Yi L (15):
  iommu/iommufd: Add /dev/iommu core
  vfio: Add device class for /dev/vfio/devices
  vfio: Add vfio_[un]register_device()
  vfio/pci: Register device to /dev/vfio/devices
  iommu/iommufd: Add iommufd_[un]bind_device()
  vfio/pci: Add VFIO_DEVICE_BIND_IOMMUFD
  iommu/iommufd: Add IOMMU_DEVICE_GET_INFO
  iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
  iommu/iommufd: Add IOMMU_CHECK_EXTENSION
  iommu/iommufd: Add iommufd_device_[de]attach_ioasid()
  vfio/pci: Add VFIO_DEVICE_[DE]ATTACH_IOASID
  vfio/type1: Export symbols for dma [un]map code sharing
  iommu/iommufd: Report iova range to userspace
  iommu/iommufd: Add IOMMU_[UN]MAP_DMA on IOASID
  Doc: Add documentation for /dev/iommu

Lu Baolu (5):
  iommu: Add iommu_device_get_info interface
  iommu: Add iommu_device_init[exit]_user_dma interfaces
  iommu: Add page size and address width attributes
  iommu: Extend iommu_at[de]tach_device() for multiple devices group
  iommu/vt-d: Implement device_info iommu_ops callback

 Documentation/userspace-api/index.rst   |   1 +
 Documentation/userspace-api/iommufd.rst | 183 ++++++
 drivers/iommu/Kconfig                   |   1 +
 drivers/iommu/Makefile                  |   1 +
 drivers/iommu/intel/iommu.c             |  35 +
 drivers/iommu/iommu.c                   | 188 +++++-
 drivers/iommu/iommufd/Kconfig           |  11 +
 drivers/iommu/iommufd/Makefile          |   2 +
 drivers/iommu/iommufd/iommufd.c         | 832 ++++++++++++++++++++++++
 drivers/vfio/pci/Kconfig                |   1 +
 drivers/vfio/pci/vfio_pci.c             | 179 ++++-
 drivers/vfio/pci/vfio_pci_private.h     |  10 +
 drivers/vfio/vfio.c                     | 366 ++++++++++-
 drivers/vfio/vfio_iommu_type1.c         | 246 ++++++-
 include/linux/iommu.h                   |  35 +
 include/linux/iommufd.h                 |  71 ++
 include/linux/vfio.h                    |  27 +
 include/uapi/linux/iommu.h              | 162 +++++
 include/uapi/linux/vfio.h               |  56 ++
 19 files changed, 2358 insertions(+), 49 deletions(-)
 create mode 100644 Documentation/userspace-api/iommufd.rst
 create mode 100644 drivers/iommu/iommufd/Kconfig
 create mode 100644 drivers/iommu/iommufd/Makefile
 create mode 100644 drivers/iommu/iommufd/iommufd.c
 create mode 100644 include/linux/iommufd.h

-- 
2.25.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* [RFC 01/20] iommu/iommufd: Add /dev/iommu core
  2021-09-19  6:38 ` Liu Yi L
@ 2021-09-19  6:38   ` Liu Yi L
  -1 siblings, 0 replies; 532+ messages in thread
From: Liu Yi L @ 2021-09-19  6:38 UTC (permalink / raw)
  To: alex.williamson, jgg, hch, jasowang, joro
  Cc: jean-philippe, kevin.tian, parav, lkml, pbonzini, lushenming,
	eric.auger, corbet, ashok.raj, yi.l.liu, yi.l.liu, jun.j.tian,
	hao.wu, dave.jiang, jacob.jun.pan, kwankhede, robin.murphy, kvm,
	iommu, dwmw2, linux-kernel, baolu.lu, david, nicolinc

/dev/iommu aims to provide a unified interface for managing I/O address
spaces for devices assigned to userspace. This patch adds the initial
framework to create a /dev/iommu node. Each open of this node returns an
iommufd. And this fd is the handle for userspace to initiate its I/O
address space management.

One open:
- We call this feature as IOMMUFD in Kconfig in this RFC. However this
  name is not clear enough to indicate its purpose to user. Back to 2010
  vfio even introduced a /dev/uiommu [1] as the predecessor of its
  container concept. Is that a better name? Appreciate opinions here.

[1] https://lore.kernel.org/kvm/4c0eb470.1HMjondO00NIvFM6%25pugs@cisco.com/

Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 drivers/iommu/Kconfig           |   1 +
 drivers/iommu/Makefile          |   1 +
 drivers/iommu/iommufd/Kconfig   |  11 ++++
 drivers/iommu/iommufd/Makefile  |   2 +
 drivers/iommu/iommufd/iommufd.c | 112 ++++++++++++++++++++++++++++++++
 5 files changed, 127 insertions(+)
 create mode 100644 drivers/iommu/iommufd/Kconfig
 create mode 100644 drivers/iommu/iommufd/Makefile
 create mode 100644 drivers/iommu/iommufd/iommufd.c

diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index 07b7c25cbed8..a83ce0acd09d 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -136,6 +136,7 @@ config MSM_IOMMU
 
 source "drivers/iommu/amd/Kconfig"
 source "drivers/iommu/intel/Kconfig"
+source "drivers/iommu/iommufd/Kconfig"
 
 config IRQ_REMAP
 	bool "Support for Interrupt Remapping"
diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
index c0fb0ba88143..719c799f23ad 100644
--- a/drivers/iommu/Makefile
+++ b/drivers/iommu/Makefile
@@ -29,3 +29,4 @@ obj-$(CONFIG_HYPERV_IOMMU) += hyperv-iommu.o
 obj-$(CONFIG_VIRTIO_IOMMU) += virtio-iommu.o
 obj-$(CONFIG_IOMMU_SVA_LIB) += iommu-sva-lib.o io-pgfault.o
 obj-$(CONFIG_SPRD_IOMMU) += sprd-iommu.o
+obj-$(CONFIG_IOMMUFD) += iommufd/
diff --git a/drivers/iommu/iommufd/Kconfig b/drivers/iommu/iommufd/Kconfig
new file mode 100644
index 000000000000..9fb7769a815d
--- /dev/null
+++ b/drivers/iommu/iommufd/Kconfig
@@ -0,0 +1,11 @@
+# SPDX-License-Identifier: GPL-2.0-only
+config IOMMUFD
+	tristate "I/O Address Space management framework for passthrough devices"
+	select IOMMU_API
+	default n
+	help
+	  provides unified I/O address space management framework for
+	  isolating untrusted DMAs via devices which are passed through
+	  to userspace drivers.
+
+	  If you don't know what to do here, say N.
diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile
new file mode 100644
index 000000000000..54381a01d003
--- /dev/null
+++ b/drivers/iommu/iommufd/Makefile
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0-only
+obj-$(CONFIG_IOMMUFD) += iommufd.o
diff --git a/drivers/iommu/iommufd/iommufd.c b/drivers/iommu/iommufd/iommufd.c
new file mode 100644
index 000000000000..710b7e62988b
--- /dev/null
+++ b/drivers/iommu/iommufd/iommufd.c
@@ -0,0 +1,112 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * I/O Address Space Management for passthrough devices
+ *
+ * Copyright (C) 2021 Intel Corporation
+ *
+ * Author: Liu Yi L <yi.l.liu@intel.com>
+ */
+
+#define pr_fmt(fmt)    "iommufd: " fmt
+
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/miscdevice.h>
+#include <linux/mutex.h>
+#include <linux/iommu.h>
+
+/* Per iommufd */
+struct iommufd_ctx {
+	refcount_t refs;
+};
+
+static int iommufd_fops_open(struct inode *inode, struct file *filep)
+{
+	struct iommufd_ctx *ictx;
+	int ret = 0;
+
+	ictx = kzalloc(sizeof(*ictx), GFP_KERNEL);
+	if (!ictx)
+		return -ENOMEM;
+
+	refcount_set(&ictx->refs, 1);
+	filep->private_data = ictx;
+
+	return ret;
+}
+
+static void iommufd_ctx_put(struct iommufd_ctx *ictx)
+{
+	if (refcount_dec_and_test(&ictx->refs))
+		kfree(ictx);
+}
+
+static int iommufd_fops_release(struct inode *inode, struct file *filep)
+{
+	struct iommufd_ctx *ictx = filep->private_data;
+
+	filep->private_data = NULL;
+
+	iommufd_ctx_put(ictx);
+
+	return 0;
+}
+
+static long iommufd_fops_unl_ioctl(struct file *filep,
+				   unsigned int cmd, unsigned long arg)
+{
+	struct iommufd_ctx *ictx = filep->private_data;
+	long ret = -EINVAL;
+
+	if (!ictx)
+		return ret;
+
+	switch (cmd) {
+	default:
+		pr_err_ratelimited("unsupported cmd %u\n", cmd);
+		break;
+	}
+	return ret;
+}
+
+static const struct file_operations iommufd_fops = {
+	.owner		= THIS_MODULE,
+	.open		= iommufd_fops_open,
+	.release	= iommufd_fops_release,
+	.unlocked_ioctl	= iommufd_fops_unl_ioctl,
+};
+
+static struct miscdevice iommu_misc_dev = {
+	.minor = MISC_DYNAMIC_MINOR,
+	.name = "iommu",
+	.fops = &iommufd_fops,
+	.nodename = "iommu",
+	.mode = 0666,
+};
+
+static int __init iommufd_init(void)
+{
+	int ret;
+
+	ret = misc_register(&iommu_misc_dev);
+	if (ret) {
+		pr_err("failed to register misc device\n");
+		return ret;
+	}
+
+	return 0;
+}
+
+static void __exit iommufd_exit(void)
+{
+	misc_deregister(&iommu_misc_dev);
+}
+
+module_init(iommufd_init);
+module_exit(iommufd_exit);
+
+MODULE_AUTHOR("Liu Yi L <yi.l.liu@intel.com>");
+MODULE_DESCRIPTION("I/O Address Space Management for passthrough devices");
+MODULE_LICENSE("GPL v2");
-- 
2.25.1


^ permalink raw reply	[flat|nested] 532+ messages in thread

* [RFC 01/20] iommu/iommufd: Add /dev/iommu core
@ 2021-09-19  6:38   ` Liu Yi L
  0 siblings, 0 replies; 532+ messages in thread
From: Liu Yi L @ 2021-09-19  6:38 UTC (permalink / raw)
  To: alex.williamson, jgg, hch, jasowang, joro
  Cc: kvm, kwankhede, jean-philippe, dave.jiang, ashok.raj, corbet,
	kevin.tian, parav, lkml, david, robin.murphy, jun.j.tian,
	linux-kernel, lushenming, iommu, pbonzini, dwmw2

/dev/iommu aims to provide a unified interface for managing I/O address
spaces for devices assigned to userspace. This patch adds the initial
framework to create a /dev/iommu node. Each open of this node returns an
iommufd. And this fd is the handle for userspace to initiate its I/O
address space management.

One open:
- We call this feature as IOMMUFD in Kconfig in this RFC. However this
  name is not clear enough to indicate its purpose to user. Back to 2010
  vfio even introduced a /dev/uiommu [1] as the predecessor of its
  container concept. Is that a better name? Appreciate opinions here.

[1] https://lore.kernel.org/kvm/4c0eb470.1HMjondO00NIvFM6%25pugs@cisco.com/

Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 drivers/iommu/Kconfig           |   1 +
 drivers/iommu/Makefile          |   1 +
 drivers/iommu/iommufd/Kconfig   |  11 ++++
 drivers/iommu/iommufd/Makefile  |   2 +
 drivers/iommu/iommufd/iommufd.c | 112 ++++++++++++++++++++++++++++++++
 5 files changed, 127 insertions(+)
 create mode 100644 drivers/iommu/iommufd/Kconfig
 create mode 100644 drivers/iommu/iommufd/Makefile
 create mode 100644 drivers/iommu/iommufd/iommufd.c

diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index 07b7c25cbed8..a83ce0acd09d 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -136,6 +136,7 @@ config MSM_IOMMU
 
 source "drivers/iommu/amd/Kconfig"
 source "drivers/iommu/intel/Kconfig"
+source "drivers/iommu/iommufd/Kconfig"
 
 config IRQ_REMAP
 	bool "Support for Interrupt Remapping"
diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
index c0fb0ba88143..719c799f23ad 100644
--- a/drivers/iommu/Makefile
+++ b/drivers/iommu/Makefile
@@ -29,3 +29,4 @@ obj-$(CONFIG_HYPERV_IOMMU) += hyperv-iommu.o
 obj-$(CONFIG_VIRTIO_IOMMU) += virtio-iommu.o
 obj-$(CONFIG_IOMMU_SVA_LIB) += iommu-sva-lib.o io-pgfault.o
 obj-$(CONFIG_SPRD_IOMMU) += sprd-iommu.o
+obj-$(CONFIG_IOMMUFD) += iommufd/
diff --git a/drivers/iommu/iommufd/Kconfig b/drivers/iommu/iommufd/Kconfig
new file mode 100644
index 000000000000..9fb7769a815d
--- /dev/null
+++ b/drivers/iommu/iommufd/Kconfig
@@ -0,0 +1,11 @@
+# SPDX-License-Identifier: GPL-2.0-only
+config IOMMUFD
+	tristate "I/O Address Space management framework for passthrough devices"
+	select IOMMU_API
+	default n
+	help
+	  provides unified I/O address space management framework for
+	  isolating untrusted DMAs via devices which are passed through
+	  to userspace drivers.
+
+	  If you don't know what to do here, say N.
diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile
new file mode 100644
index 000000000000..54381a01d003
--- /dev/null
+++ b/drivers/iommu/iommufd/Makefile
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0-only
+obj-$(CONFIG_IOMMUFD) += iommufd.o
diff --git a/drivers/iommu/iommufd/iommufd.c b/drivers/iommu/iommufd/iommufd.c
new file mode 100644
index 000000000000..710b7e62988b
--- /dev/null
+++ b/drivers/iommu/iommufd/iommufd.c
@@ -0,0 +1,112 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * I/O Address Space Management for passthrough devices
+ *
+ * Copyright (C) 2021 Intel Corporation
+ *
+ * Author: Liu Yi L <yi.l.liu@intel.com>
+ */
+
+#define pr_fmt(fmt)    "iommufd: " fmt
+
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/miscdevice.h>
+#include <linux/mutex.h>
+#include <linux/iommu.h>
+
+/* Per iommufd */
+struct iommufd_ctx {
+	refcount_t refs;
+};
+
+static int iommufd_fops_open(struct inode *inode, struct file *filep)
+{
+	struct iommufd_ctx *ictx;
+	int ret = 0;
+
+	ictx = kzalloc(sizeof(*ictx), GFP_KERNEL);
+	if (!ictx)
+		return -ENOMEM;
+
+	refcount_set(&ictx->refs, 1);
+	filep->private_data = ictx;
+
+	return ret;
+}
+
+static void iommufd_ctx_put(struct iommufd_ctx *ictx)
+{
+	if (refcount_dec_and_test(&ictx->refs))
+		kfree(ictx);
+}
+
+static int iommufd_fops_release(struct inode *inode, struct file *filep)
+{
+	struct iommufd_ctx *ictx = filep->private_data;
+
+	filep->private_data = NULL;
+
+	iommufd_ctx_put(ictx);
+
+	return 0;
+}
+
+static long iommufd_fops_unl_ioctl(struct file *filep,
+				   unsigned int cmd, unsigned long arg)
+{
+	struct iommufd_ctx *ictx = filep->private_data;
+	long ret = -EINVAL;
+
+	if (!ictx)
+		return ret;
+
+	switch (cmd) {
+	default:
+		pr_err_ratelimited("unsupported cmd %u\n", cmd);
+		break;
+	}
+	return ret;
+}
+
+static const struct file_operations iommufd_fops = {
+	.owner		= THIS_MODULE,
+	.open		= iommufd_fops_open,
+	.release	= iommufd_fops_release,
+	.unlocked_ioctl	= iommufd_fops_unl_ioctl,
+};
+
+static struct miscdevice iommu_misc_dev = {
+	.minor = MISC_DYNAMIC_MINOR,
+	.name = "iommu",
+	.fops = &iommufd_fops,
+	.nodename = "iommu",
+	.mode = 0666,
+};
+
+static int __init iommufd_init(void)
+{
+	int ret;
+
+	ret = misc_register(&iommu_misc_dev);
+	if (ret) {
+		pr_err("failed to register misc device\n");
+		return ret;
+	}
+
+	return 0;
+}
+
+static void __exit iommufd_exit(void)
+{
+	misc_deregister(&iommu_misc_dev);
+}
+
+module_init(iommufd_init);
+module_exit(iommufd_exit);
+
+MODULE_AUTHOR("Liu Yi L <yi.l.liu@intel.com>");
+MODULE_DESCRIPTION("I/O Address Space Management for passthrough devices");
+MODULE_LICENSE("GPL v2");
-- 
2.25.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* [RFC 02/20] vfio: Add device class for /dev/vfio/devices
  2021-09-19  6:38 ` Liu Yi L
@ 2021-09-19  6:38   ` Liu Yi L
  -1 siblings, 0 replies; 532+ messages in thread
From: Liu Yi L @ 2021-09-19  6:38 UTC (permalink / raw)
  To: alex.williamson, jgg, hch, jasowang, joro
  Cc: jean-philippe, kevin.tian, parav, lkml, pbonzini, lushenming,
	eric.auger, corbet, ashok.raj, yi.l.liu, yi.l.liu, jun.j.tian,
	hao.wu, dave.jiang, jacob.jun.pan, kwankhede, robin.murphy, kvm,
	iommu, dwmw2, linux-kernel, baolu.lu, david, nicolinc

This patch introduces a new interface (/dev/vfio/devices/$DEVICE) for
userspace to directly open a vfio device w/o relying on container/group
(/dev/vfio/$GROUP). Anything related to group is now hidden behind
iommufd (more specifically in iommu core by this RFC) in a device-centric
manner.

In case a device is exposed in both legacy and new interfaces (see next
patch for how to decide it), this patch also ensures that when the device
is already opened via one interface then the other one must be blocked.

Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 drivers/vfio/vfio.c  | 228 +++++++++++++++++++++++++++++++++++++++----
 include/linux/vfio.h |   2 +
 2 files changed, 213 insertions(+), 17 deletions(-)

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 02cc51ce6891..84436d7abedd 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -46,6 +46,12 @@ static struct vfio {
 	struct mutex			group_lock;
 	struct cdev			group_cdev;
 	dev_t				group_devt;
+	/* Fields for /dev/vfio/devices interface */
+	struct class			*device_class;
+	struct cdev			device_cdev;
+	dev_t				device_devt;
+	struct mutex			device_lock;
+	struct idr			device_idr;
 } vfio;
 
 struct vfio_iommu_driver {
@@ -81,9 +87,11 @@ struct vfio_group {
 	struct list_head		container_next;
 	struct list_head		unbound_list;
 	struct mutex			unbound_lock;
-	atomic_t			opened;
-	wait_queue_head_t		container_q;
+	struct mutex			opened_lock;
+	u32				opened;
+	bool				opened_by_nongroup_dev;
 	bool				noiommu;
+	wait_queue_head_t		container_q;
 	unsigned int			dev_counter;
 	struct kvm			*kvm;
 	struct blocking_notifier_head	notifier;
@@ -327,7 +335,7 @@ static struct vfio_group *vfio_create_group(struct iommu_group *iommu_group)
 	INIT_LIST_HEAD(&group->unbound_list);
 	mutex_init(&group->unbound_lock);
 	atomic_set(&group->container_users, 0);
-	atomic_set(&group->opened, 0);
+	mutex_init(&group->opened_lock);
 	init_waitqueue_head(&group->container_q);
 	group->iommu_group = iommu_group;
 #ifdef CONFIG_VFIO_NOIOMMU
@@ -1489,10 +1497,53 @@ static long vfio_group_fops_unl_ioctl(struct file *filep,
 	return ret;
 }
 
+/*
+ * group->opened is used to ensure that the group can be opened only via
+ * one of the two interfaces (/dev/vfio/$GROUP and /dev/vfio/devices/
+ * $DEVICE) instead of both.
+ *
+ * We also introduce a new group flag to indicate whether this group is
+ * opened via /dev/vfio/devices/$DEVICE. For multi-devices group,
+ * group->opened also tracks how many devices have been opened in the
+ * group if the new flag is true.
+ *
+ * Also add a new lock since two flags are operated here.
+ */
+static int vfio_group_try_open(struct vfio_group *group, bool nongroup_dev)
+{
+	int ret = 0;
+
+	mutex_lock(&group->opened_lock);
+	if (group->opened) {
+		if (nongroup_dev && group->opened_by_nongroup_dev)
+			group->opened++;
+		else
+			ret = -EBUSY;
+		goto out;
+	}
+
+	/*
+	 * Is something still in use from a previous open? Should
+	 * not allow new open if it is such case.
+	 */
+	if (group->container) {
+		ret = -EBUSY;
+		goto out;
+	}
+
+	group->opened = 1;
+	group->opened_by_nongroup_dev = nongroup_dev;
+
+out:
+	mutex_unlock(&group->opened_lock);
+
+	return ret;
+}
+
 static int vfio_group_fops_open(struct inode *inode, struct file *filep)
 {
 	struct vfio_group *group;
-	int opened;
+	int ret;
 
 	group = vfio_group_get_from_minor(iminor(inode));
 	if (!group)
@@ -1503,18 +1554,10 @@ static int vfio_group_fops_open(struct inode *inode, struct file *filep)
 		return -EPERM;
 	}
 
-	/* Do we need multiple instances of the group open?  Seems not. */
-	opened = atomic_cmpxchg(&group->opened, 0, 1);
-	if (opened) {
-		vfio_group_put(group);
-		return -EBUSY;
-	}
-
-	/* Is something still in use from a previous open? */
-	if (group->container) {
-		atomic_dec(&group->opened);
+	ret = vfio_group_try_open(group, false);
+	if (ret) {
 		vfio_group_put(group);
-		return -EBUSY;
+		return ret;
 	}
 
 	/* Warn if previous user didn't cleanup and re-init to drop them */
@@ -1534,7 +1577,9 @@ static int vfio_group_fops_release(struct inode *inode, struct file *filep)
 
 	vfio_group_try_dissolve_container(group);
 
-	atomic_dec(&group->opened);
+	mutex_lock(&group->opened_lock);
+	group->opened--;
+	mutex_unlock(&group->opened_lock);
 
 	vfio_group_put(group);
 
@@ -1552,6 +1597,92 @@ static const struct file_operations vfio_group_fops = {
 /**
  * VFIO Device fd
  */
+static struct vfio_device *vfio_device_get_from_minor(int minor)
+{
+	struct vfio_device *device;
+
+	mutex_lock(&vfio.device_lock);
+	device = idr_find(&vfio.device_idr, minor);
+	if (!device || !vfio_device_try_get(device)) {
+		mutex_unlock(&vfio.device_lock);
+		return NULL;
+	}
+	mutex_unlock(&vfio.device_lock);
+
+	return device;
+}
+
+static int vfio_device_fops_open(struct inode *inode, struct file *filep)
+{
+	struct vfio_device *device;
+	struct vfio_group *group;
+	int ret, opened;
+
+	device = vfio_device_get_from_minor(iminor(inode));
+	if (!device)
+		return -ENODEV;
+
+	/*
+	 * Check whether the user has opened this device via the legacy
+	 * container/group interface. If yes, then prevent the user from
+	 * opening it via device node in /dev/vfio/devices. Otherwise,
+	 * mark the group as opened to block the group interface. either
+	 * way, we must ensure only one interface is used to open the
+	 * device when it supports both legacy and new interfaces.
+	 */
+	group = vfio_group_try_get(device->group);
+	if (group) {
+		ret = vfio_group_try_open(group, true);
+		if (ret)
+			goto err_group_try_open;
+	}
+
+	/*
+	 * No support of multiple instances of the device open, similar to
+	 * the policy on the group open.
+	 */
+	opened = atomic_cmpxchg(&device->opened, 0, 1);
+	if (opened) {
+		ret = -EBUSY;
+		goto err_device_try_open;
+	}
+
+	if (!try_module_get(device->dev->driver->owner)) {
+		ret = -ENODEV;
+		goto err_module_get;
+	}
+
+	ret = device->ops->open(device);
+	if (ret)
+		goto err_device_open;
+
+	filep->private_data = device;
+
+	if (group)
+		vfio_group_put(group);
+	return 0;
+err_device_open:
+	module_put(device->dev->driver->owner);
+err_module_get:
+	atomic_dec(&device->opened);
+err_device_try_open:
+	if (group) {
+		mutex_lock(&group->opened_lock);
+		group->opened--;
+		mutex_unlock(&group->opened_lock);
+	}
+err_group_try_open:
+	if (group)
+		vfio_group_put(group);
+	vfio_device_put(device);
+	return ret;
+}
+
+static bool vfio_device_in_container(struct vfio_device *device)
+{
+	return !!(device->group && device->group->container);
+}
+
 static int vfio_device_fops_release(struct inode *inode, struct file *filep)
 {
 	struct vfio_device *device = filep->private_data;
@@ -1560,7 +1691,16 @@ static int vfio_device_fops_release(struct inode *inode, struct file *filep)
 
 	module_put(device->dev->driver->owner);
 
-	vfio_group_try_dissolve_container(device->group);
+	if (vfio_device_in_container(device)) {
+		vfio_group_try_dissolve_container(device->group);
+	} else {
+		atomic_dec(&device->opened);
+		if (device->group) {
+			mutex_lock(&device->group->opened_lock);
+			device->group->opened--;
+			mutex_unlock(&device->group->opened_lock);
+		}
+	}
 
 	vfio_device_put(device);
 
@@ -1613,6 +1753,7 @@ static int vfio_device_fops_mmap(struct file *filep, struct vm_area_struct *vma)
 
 static const struct file_operations vfio_device_fops = {
 	.owner		= THIS_MODULE,
+	.open		= vfio_device_fops_open,
 	.release	= vfio_device_fops_release,
 	.read		= vfio_device_fops_read,
 	.write		= vfio_device_fops_write,
@@ -2295,6 +2436,52 @@ static struct miscdevice vfio_dev = {
 	.mode = S_IRUGO | S_IWUGO,
 };
 
+static char *vfio_device_devnode(struct device *dev, umode_t *mode)
+{
+	return kasprintf(GFP_KERNEL, "vfio/devices/%s", dev_name(dev));
+}
+
+static int vfio_init_device_class(void)
+{
+	int ret;
+
+	mutex_init(&vfio.device_lock);
+	idr_init(&vfio.device_idr);
+
+	/* /dev/vfio/devices/$DEVICE */
+	vfio.device_class = class_create(THIS_MODULE, "vfio-device");
+	if (IS_ERR(vfio.device_class))
+		return PTR_ERR(vfio.device_class);
+
+	vfio.device_class->devnode = vfio_device_devnode;
+
+	ret = alloc_chrdev_region(&vfio.device_devt, 0, MINORMASK + 1, "vfio-device");
+	if (ret)
+		goto err_alloc_chrdev;
+
+	cdev_init(&vfio.device_cdev, &vfio_device_fops);
+	ret = cdev_add(&vfio.device_cdev, vfio.device_devt, MINORMASK + 1);
+	if (ret)
+		goto err_cdev_add;
+	return 0;
+
+err_cdev_add:
+	unregister_chrdev_region(vfio.device_devt, MINORMASK + 1);
+err_alloc_chrdev:
+	class_destroy(vfio.device_class);
+	vfio.device_class = NULL;
+	return ret;
+}
+
+static void vfio_destroy_device_class(void)
+{
+	cdev_del(&vfio.device_cdev);
+	unregister_chrdev_region(vfio.device_devt, MINORMASK + 1);
+	class_destroy(vfio.device_class);
+	vfio.device_class = NULL;
+	idr_destroy(&vfio.device_idr);
+}
+
 static int __init vfio_init(void)
 {
 	int ret;
@@ -2329,6 +2516,10 @@ static int __init vfio_init(void)
 	if (ret)
 		goto err_cdev_add;
 
+	ret = vfio_init_device_class();
+	if (ret)
+		goto err_init_device_class;
+
 	pr_info(DRIVER_DESC " version: " DRIVER_VERSION "\n");
 
 #ifdef CONFIG_VFIO_NOIOMMU
@@ -2336,6 +2527,8 @@ static int __init vfio_init(void)
 #endif
 	return 0;
 
+err_init_device_class:
+	cdev_del(&vfio.group_cdev);
 err_cdev_add:
 	unregister_chrdev_region(vfio.group_devt, MINORMASK + 1);
 err_alloc_chrdev:
@@ -2358,6 +2551,7 @@ static void __exit vfio_cleanup(void)
 	unregister_chrdev_region(vfio.group_devt, MINORMASK + 1);
 	class_destroy(vfio.class);
 	vfio.class = NULL;
+	vfio_destroy_device_class();
 	misc_deregister(&vfio_dev);
 }
 
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index a2c5b30e1763..4a5f3f99eab2 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -24,6 +24,8 @@ struct vfio_device {
 	refcount_t refcount;
 	struct completion comp;
 	struct list_head group_next;
+	int minor;
+	atomic_t opened;
 };
 
 /**
-- 
2.25.1


^ permalink raw reply	[flat|nested] 532+ messages in thread

* [RFC 02/20] vfio: Add device class for /dev/vfio/devices
@ 2021-09-19  6:38   ` Liu Yi L
  0 siblings, 0 replies; 532+ messages in thread
From: Liu Yi L @ 2021-09-19  6:38 UTC (permalink / raw)
  To: alex.williamson, jgg, hch, jasowang, joro
  Cc: kvm, kwankhede, jean-philippe, dave.jiang, ashok.raj, corbet,
	kevin.tian, parav, lkml, david, robin.murphy, jun.j.tian,
	linux-kernel, lushenming, iommu, pbonzini, dwmw2

This patch introduces a new interface (/dev/vfio/devices/$DEVICE) for
userspace to directly open a vfio device w/o relying on container/group
(/dev/vfio/$GROUP). Anything related to group is now hidden behind
iommufd (more specifically in iommu core by this RFC) in a device-centric
manner.

In case a device is exposed in both legacy and new interfaces (see next
patch for how to decide it), this patch also ensures that when the device
is already opened via one interface then the other one must be blocked.

Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 drivers/vfio/vfio.c  | 228 +++++++++++++++++++++++++++++++++++++++----
 include/linux/vfio.h |   2 +
 2 files changed, 213 insertions(+), 17 deletions(-)

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 02cc51ce6891..84436d7abedd 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -46,6 +46,12 @@ static struct vfio {
 	struct mutex			group_lock;
 	struct cdev			group_cdev;
 	dev_t				group_devt;
+	/* Fields for /dev/vfio/devices interface */
+	struct class			*device_class;
+	struct cdev			device_cdev;
+	dev_t				device_devt;
+	struct mutex			device_lock;
+	struct idr			device_idr;
 } vfio;
 
 struct vfio_iommu_driver {
@@ -81,9 +87,11 @@ struct vfio_group {
 	struct list_head		container_next;
 	struct list_head		unbound_list;
 	struct mutex			unbound_lock;
-	atomic_t			opened;
-	wait_queue_head_t		container_q;
+	struct mutex			opened_lock;
+	u32				opened;
+	bool				opened_by_nongroup_dev;
 	bool				noiommu;
+	wait_queue_head_t		container_q;
 	unsigned int			dev_counter;
 	struct kvm			*kvm;
 	struct blocking_notifier_head	notifier;
@@ -327,7 +335,7 @@ static struct vfio_group *vfio_create_group(struct iommu_group *iommu_group)
 	INIT_LIST_HEAD(&group->unbound_list);
 	mutex_init(&group->unbound_lock);
 	atomic_set(&group->container_users, 0);
-	atomic_set(&group->opened, 0);
+	mutex_init(&group->opened_lock);
 	init_waitqueue_head(&group->container_q);
 	group->iommu_group = iommu_group;
 #ifdef CONFIG_VFIO_NOIOMMU
@@ -1489,10 +1497,53 @@ static long vfio_group_fops_unl_ioctl(struct file *filep,
 	return ret;
 }
 
+/*
+ * group->opened is used to ensure that the group can be opened only via
+ * one of the two interfaces (/dev/vfio/$GROUP and /dev/vfio/devices/
+ * $DEVICE) instead of both.
+ *
+ * We also introduce a new group flag to indicate whether this group is
+ * opened via /dev/vfio/devices/$DEVICE. For multi-devices group,
+ * group->opened also tracks how many devices have been opened in the
+ * group if the new flag is true.
+ *
+ * Also add a new lock since two flags are operated here.
+ */
+static int vfio_group_try_open(struct vfio_group *group, bool nongroup_dev)
+{
+	int ret = 0;
+
+	mutex_lock(&group->opened_lock);
+	if (group->opened) {
+		if (nongroup_dev && group->opened_by_nongroup_dev)
+			group->opened++;
+		else
+			ret = -EBUSY;
+		goto out;
+	}
+
+	/*
+	 * Is something still in use from a previous open? Should
+	 * not allow new open if it is such case.
+	 */
+	if (group->container) {
+		ret = -EBUSY;
+		goto out;
+	}
+
+	group->opened = 1;
+	group->opened_by_nongroup_dev = nongroup_dev;
+
+out:
+	mutex_unlock(&group->opened_lock);
+
+	return ret;
+}
+
 static int vfio_group_fops_open(struct inode *inode, struct file *filep)
 {
 	struct vfio_group *group;
-	int opened;
+	int ret;
 
 	group = vfio_group_get_from_minor(iminor(inode));
 	if (!group)
@@ -1503,18 +1554,10 @@ static int vfio_group_fops_open(struct inode *inode, struct file *filep)
 		return -EPERM;
 	}
 
-	/* Do we need multiple instances of the group open?  Seems not. */
-	opened = atomic_cmpxchg(&group->opened, 0, 1);
-	if (opened) {
-		vfio_group_put(group);
-		return -EBUSY;
-	}
-
-	/* Is something still in use from a previous open? */
-	if (group->container) {
-		atomic_dec(&group->opened);
+	ret = vfio_group_try_open(group, false);
+	if (ret) {
 		vfio_group_put(group);
-		return -EBUSY;
+		return ret;
 	}
 
 	/* Warn if previous user didn't cleanup and re-init to drop them */
@@ -1534,7 +1577,9 @@ static int vfio_group_fops_release(struct inode *inode, struct file *filep)
 
 	vfio_group_try_dissolve_container(group);
 
-	atomic_dec(&group->opened);
+	mutex_lock(&group->opened_lock);
+	group->opened--;
+	mutex_unlock(&group->opened_lock);
 
 	vfio_group_put(group);
 
@@ -1552,6 +1597,92 @@ static const struct file_operations vfio_group_fops = {
 /**
  * VFIO Device fd
  */
+static struct vfio_device *vfio_device_get_from_minor(int minor)
+{
+	struct vfio_device *device;
+
+	mutex_lock(&vfio.device_lock);
+	device = idr_find(&vfio.device_idr, minor);
+	if (!device || !vfio_device_try_get(device)) {
+		mutex_unlock(&vfio.device_lock);
+		return NULL;
+	}
+	mutex_unlock(&vfio.device_lock);
+
+	return device;
+}
+
+static int vfio_device_fops_open(struct inode *inode, struct file *filep)
+{
+	struct vfio_device *device;
+	struct vfio_group *group;
+	int ret, opened;
+
+	device = vfio_device_get_from_minor(iminor(inode));
+	if (!device)
+		return -ENODEV;
+
+	/*
+	 * Check whether the user has opened this device via the legacy
+	 * container/group interface. If yes, then prevent the user from
+	 * opening it via device node in /dev/vfio/devices. Otherwise,
+	 * mark the group as opened to block the group interface. either
+	 * way, we must ensure only one interface is used to open the
+	 * device when it supports both legacy and new interfaces.
+	 */
+	group = vfio_group_try_get(device->group);
+	if (group) {
+		ret = vfio_group_try_open(group, true);
+		if (ret)
+			goto err_group_try_open;
+	}
+
+	/*
+	 * No support of multiple instances of the device open, similar to
+	 * the policy on the group open.
+	 */
+	opened = atomic_cmpxchg(&device->opened, 0, 1);
+	if (opened) {
+		ret = -EBUSY;
+		goto err_device_try_open;
+	}
+
+	if (!try_module_get(device->dev->driver->owner)) {
+		ret = -ENODEV;
+		goto err_module_get;
+	}
+
+	ret = device->ops->open(device);
+	if (ret)
+		goto err_device_open;
+
+	filep->private_data = device;
+
+	if (group)
+		vfio_group_put(group);
+	return 0;
+err_device_open:
+	module_put(device->dev->driver->owner);
+err_module_get:
+	atomic_dec(&device->opened);
+err_device_try_open:
+	if (group) {
+		mutex_lock(&group->opened_lock);
+		group->opened--;
+		mutex_unlock(&group->opened_lock);
+	}
+err_group_try_open:
+	if (group)
+		vfio_group_put(group);
+	vfio_device_put(device);
+	return ret;
+}
+
+static bool vfio_device_in_container(struct vfio_device *device)
+{
+	return !!(device->group && device->group->container);
+}
+
 static int vfio_device_fops_release(struct inode *inode, struct file *filep)
 {
 	struct vfio_device *device = filep->private_data;
@@ -1560,7 +1691,16 @@ static int vfio_device_fops_release(struct inode *inode, struct file *filep)
 
 	module_put(device->dev->driver->owner);
 
-	vfio_group_try_dissolve_container(device->group);
+	if (vfio_device_in_container(device)) {
+		vfio_group_try_dissolve_container(device->group);
+	} else {
+		atomic_dec(&device->opened);
+		if (device->group) {
+			mutex_lock(&device->group->opened_lock);
+			device->group->opened--;
+			mutex_unlock(&device->group->opened_lock);
+		}
+	}
 
 	vfio_device_put(device);
 
@@ -1613,6 +1753,7 @@ static int vfio_device_fops_mmap(struct file *filep, struct vm_area_struct *vma)
 
 static const struct file_operations vfio_device_fops = {
 	.owner		= THIS_MODULE,
+	.open		= vfio_device_fops_open,
 	.release	= vfio_device_fops_release,
 	.read		= vfio_device_fops_read,
 	.write		= vfio_device_fops_write,
@@ -2295,6 +2436,52 @@ static struct miscdevice vfio_dev = {
 	.mode = S_IRUGO | S_IWUGO,
 };
 
+static char *vfio_device_devnode(struct device *dev, umode_t *mode)
+{
+	return kasprintf(GFP_KERNEL, "vfio/devices/%s", dev_name(dev));
+}
+
+static int vfio_init_device_class(void)
+{
+	int ret;
+
+	mutex_init(&vfio.device_lock);
+	idr_init(&vfio.device_idr);
+
+	/* /dev/vfio/devices/$DEVICE */
+	vfio.device_class = class_create(THIS_MODULE, "vfio-device");
+	if (IS_ERR(vfio.device_class))
+		return PTR_ERR(vfio.device_class);
+
+	vfio.device_class->devnode = vfio_device_devnode;
+
+	ret = alloc_chrdev_region(&vfio.device_devt, 0, MINORMASK + 1, "vfio-device");
+	if (ret)
+		goto err_alloc_chrdev;
+
+	cdev_init(&vfio.device_cdev, &vfio_device_fops);
+	ret = cdev_add(&vfio.device_cdev, vfio.device_devt, MINORMASK + 1);
+	if (ret)
+		goto err_cdev_add;
+	return 0;
+
+err_cdev_add:
+	unregister_chrdev_region(vfio.device_devt, MINORMASK + 1);
+err_alloc_chrdev:
+	class_destroy(vfio.device_class);
+	vfio.device_class = NULL;
+	return ret;
+}
+
+static void vfio_destroy_device_class(void)
+{
+	cdev_del(&vfio.device_cdev);
+	unregister_chrdev_region(vfio.device_devt, MINORMASK + 1);
+	class_destroy(vfio.device_class);
+	vfio.device_class = NULL;
+	idr_destroy(&vfio.device_idr);
+}
+
 static int __init vfio_init(void)
 {
 	int ret;
@@ -2329,6 +2516,10 @@ static int __init vfio_init(void)
 	if (ret)
 		goto err_cdev_add;
 
+	ret = vfio_init_device_class();
+	if (ret)
+		goto err_init_device_class;
+
 	pr_info(DRIVER_DESC " version: " DRIVER_VERSION "\n");
 
 #ifdef CONFIG_VFIO_NOIOMMU
@@ -2336,6 +2527,8 @@ static int __init vfio_init(void)
 #endif
 	return 0;
 
+err_init_device_class:
+	cdev_del(&vfio.group_cdev);
 err_cdev_add:
 	unregister_chrdev_region(vfio.group_devt, MINORMASK + 1);
 err_alloc_chrdev:
@@ -2358,6 +2551,7 @@ static void __exit vfio_cleanup(void)
 	unregister_chrdev_region(vfio.group_devt, MINORMASK + 1);
 	class_destroy(vfio.class);
 	vfio.class = NULL;
+	vfio_destroy_device_class();
 	misc_deregister(&vfio_dev);
 }
 
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index a2c5b30e1763..4a5f3f99eab2 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -24,6 +24,8 @@ struct vfio_device {
 	refcount_t refcount;
 	struct completion comp;
 	struct list_head group_next;
+	int minor;
+	atomic_t opened;
 };
 
 /**
-- 
2.25.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* [RFC 03/20] vfio: Add vfio_[un]register_device()
  2021-09-19  6:38 ` Liu Yi L
@ 2021-09-19  6:38   ` Liu Yi L
  -1 siblings, 0 replies; 532+ messages in thread
From: Liu Yi L @ 2021-09-19  6:38 UTC (permalink / raw)
  To: alex.williamson, jgg, hch, jasowang, joro
  Cc: jean-philippe, kevin.tian, parav, lkml, pbonzini, lushenming,
	eric.auger, corbet, ashok.raj, yi.l.liu, yi.l.liu, jun.j.tian,
	hao.wu, dave.jiang, jacob.jun.pan, kwankhede, robin.murphy, kvm,
	iommu, dwmw2, linux-kernel, baolu.lu, david, nicolinc

With /dev/vfio/devices introduced, now a vfio device driver has three
options to expose its device to userspace:

a)  only legacy group interface, for devices which haven't been moved to
    iommufd (e.g. platform devices, sw mdev, etc.);

b)  both legacy group interface and new device-centric interface, for
    devices which supports iommufd but also wants to keep backward
    compatibility (e.g. pci devices in this RFC);

c)  only new device-centric interface, for new devices which don't carry
    backward compatibility burden (e.g. hw mdev/subdev with pasid);

This patch introduces vfio_[un]register_device() helpers for the device
drivers to specify the device exposure policy to vfio core. Hence the
existing vfio_[un]register_group_dev() become the wrapper of the new
helper functions. The new device-centric interface is described as
'nongroup' to differentiate from existing 'group' stuff.

TBD: this patch needs to rebase on top of below series from Christoph in
next version.

	"cleanup vfio iommu_group creation"

Legacy userspace continues to follow the legacy group interface.

Newer userspace can first try the new device-centric interface if the
device is present under /dev/vfio/devices. Otherwise fall back to the
group interface.

One open about how to organize the device nodes under /dev/vfio/devices/.
This RFC adopts a simple policy by keeping a flat layout with mixed devname
from all kinds of devices. The prerequisite of this model is that devnames
from different bus types are unique formats:

	/dev/vfio/devices/0000:00:14.2 (pci)
	/dev/vfio/devices/PNP0103:00 (platform)
	/dev/vfio/devices/83b8f4f2-509f-382f-3c1e-e6bfe0fa1001 (mdev)

One alternative option is to arrange device nodes in sub-directories based
on the device type. But doing so also adds one trouble to userspace. The
current vfio uAPI is designed to have the user query device type via
VFIO_DEVICE_GET_INFO after opening the device. With this option the user
instead needs to figure out the device type before opening the device, to
identify the sub-directory. Another tricky thing is that "pdev. vs. mdev"
and "pci vs. platform vs. ccw,..." are orthogonal categorizations. Need
more thoughts on whether both or just one category should be used to define
the sub-directories.

Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 drivers/vfio/vfio.c  | 137 +++++++++++++++++++++++++++++++++++++++----
 include/linux/vfio.h |   9 +++
 2 files changed, 134 insertions(+), 12 deletions(-)

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 84436d7abedd..1e87b25962f1 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -51,6 +51,7 @@ static struct vfio {
 	struct cdev			device_cdev;
 	dev_t				device_devt;
 	struct mutex			device_lock;
+	struct list_head		device_list;
 	struct idr			device_idr;
 } vfio;
 
@@ -757,7 +758,7 @@ void vfio_init_group_dev(struct vfio_device *device, struct device *dev,
 }
 EXPORT_SYMBOL_GPL(vfio_init_group_dev);
 
-int vfio_register_group_dev(struct vfio_device *device)
+static int __vfio_register_group_dev(struct vfio_device *device)
 {
 	struct vfio_device *existing_device;
 	struct iommu_group *iommu_group;
@@ -794,8 +795,13 @@ int vfio_register_group_dev(struct vfio_device *device)
 	/* Our reference on group is moved to the device */
 	device->group = group;
 
-	/* Refcounting can't start until the driver calls register */
-	refcount_set(&device->refcount, 1);
+	/*
+	 * Refcounting can't start until the driver call register. Don’t
+	 * start twice when the device is exposed in both group and nongroup
+	 * interfaces.
+	 */
+	if (!refcount_read(&device->refcount))
+		refcount_set(&device->refcount, 1);
 
 	mutex_lock(&group->device_lock);
 	list_add(&device->group_next, &group->device_list);
@@ -804,7 +810,78 @@ int vfio_register_group_dev(struct vfio_device *device)
 
 	return 0;
 }
-EXPORT_SYMBOL_GPL(vfio_register_group_dev);
+
+static int __vfio_register_nongroup_dev(struct vfio_device *device)
+{
+	struct vfio_device *existing_device;
+	struct device *dev;
+	int ret = 0, minor;
+
+	mutex_lock(&vfio.device_lock);
+	list_for_each_entry(existing_device, &vfio.device_list, vfio_next) {
+		if (existing_device == device) {
+			ret = -EBUSY;
+			goto out_unlock;
+		}
+	}
+
+	minor = idr_alloc(&vfio.device_idr, device, 0, MINORMASK + 1, GFP_KERNEL);
+	pr_debug("%s - mnior: %d\n", __func__, minor);
+	if (minor < 0) {
+		ret = minor;
+		goto out_unlock;
+	}
+
+	dev = device_create(vfio.device_class, NULL,
+			    MKDEV(MAJOR(vfio.device_devt), minor),
+			    device, "%s", dev_name(device->dev));
+	if (IS_ERR(dev)) {
+		idr_remove(&vfio.device_idr, minor);
+		ret = PTR_ERR(dev);
+		goto out_unlock;
+	}
+
+	/*
+	 * Refcounting can't start until the driver call register. Don’t
+	 * start twice when the device is exposed in both group and nongroup
+	 * interfaces.
+	 */
+	if (!refcount_read(&device->refcount))
+		refcount_set(&device->refcount, 1);
+
+	device->minor = minor;
+	list_add(&device->vfio_next, &vfio.device_list);
+	dev_info(device->dev, "Creates Device interface successfully!\n");
+out_unlock:
+	mutex_unlock(&vfio.device_lock);
+	return ret;
+}
+
+int vfio_register_device(struct vfio_device *device, u32 flags)
+{
+	int ret = -EINVAL;
+
+	device->minor = -1;
+	device->group = NULL;
+	atomic_set(&device->opened, 0);
+
+	if (flags & ~(VFIO_DEVNODE_GROUP | VFIO_DEVNODE_NONGROUP))
+		return ret;
+
+	if (flags & VFIO_DEVNODE_GROUP) {
+		ret = __vfio_register_group_dev(device);
+		if (ret)
+			return ret;
+	}
+
+	if (flags & VFIO_DEVNODE_NONGROUP) {
+		ret = __vfio_register_nongroup_dev(device);
+		if (ret && device->group)
+			vfio_unregister_device(device);
+	}
+	return ret;
+}
+EXPORT_SYMBOL_GPL(vfio_register_device);
 
 /**
  * Get a reference to the vfio_device for a device.  Even if the
@@ -861,13 +938,14 @@ static struct vfio_device *vfio_device_get_from_name(struct vfio_group *group,
 /*
  * Decrement the device reference count and wait for the device to be
  * removed.  Open file descriptors for the device... */
-void vfio_unregister_group_dev(struct vfio_device *device)
+void vfio_unregister_device(struct vfio_device *device)
 {
 	struct vfio_group *group = device->group;
 	struct vfio_unbound_dev *unbound;
 	unsigned int i = 0;
 	bool interrupted = false;
 	long rc;
+	int minor = device->minor;
 
 	/*
 	 * When the device is removed from the group, the group suddenly
@@ -878,14 +956,20 @@ void vfio_unregister_group_dev(struct vfio_device *device)
 	 * solve this, we track such devices on the unbound_list to bridge
 	 * the gap until they're fully unbound.
 	 */
-	unbound = kzalloc(sizeof(*unbound), GFP_KERNEL);
-	if (unbound) {
-		unbound->dev = device->dev;
-		mutex_lock(&group->unbound_lock);
-		list_add(&unbound->unbound_next, &group->unbound_list);
-		mutex_unlock(&group->unbound_lock);
+	if (group) {
+		/*
+		 * If caller hasn't called vfio_register_group_dev(), this
+		 * branch is not necessary.
+		 */
+		unbound = kzalloc(sizeof(*unbound), GFP_KERNEL);
+		if (unbound) {
+			unbound->dev = device->dev;
+			mutex_lock(&group->unbound_lock);
+			list_add(&unbound->unbound_next, &group->unbound_list);
+			mutex_unlock(&group->unbound_lock);
+		}
+		WARN_ON(!unbound);
 	}
-	WARN_ON(!unbound);
 
 	vfio_device_put(device);
 	rc = try_wait_for_completion(&device->comp);
@@ -910,6 +994,21 @@ void vfio_unregister_group_dev(struct vfio_device *device)
 		}
 	}
 
+	/* nongroup interface related cleanup */
+	if (minor >= 0) {
+		mutex_lock(&vfio.device_lock);
+		list_del(&device->vfio_next);
+		device->minor = -1;
+		device_destroy(vfio.device_class,
+			       MKDEV(MAJOR(vfio.device_devt), minor));
+		idr_remove(&vfio.device_idr, minor);
+		mutex_unlock(&vfio.device_lock);
+	}
+
+	/* No need go further if no group. */
+	if (!group)
+		return;
+
 	mutex_lock(&group->device_lock);
 	list_del(&device->group_next);
 	group->dev_counter--;
@@ -935,6 +1034,18 @@ void vfio_unregister_group_dev(struct vfio_device *device)
 	/* Matches the get in vfio_register_group_dev() */
 	vfio_group_put(group);
 }
+EXPORT_SYMBOL_GPL(vfio_unregister_device);
+
+int vfio_register_group_dev(struct vfio_device *device)
+{
+	return vfio_register_device(device, VFIO_DEVNODE_GROUP);
+}
+EXPORT_SYMBOL_GPL(vfio_register_group_dev);
+
+void vfio_unregister_group_dev(struct vfio_device *device)
+{
+	vfio_unregister_device(device);
+}
 EXPORT_SYMBOL_GPL(vfio_unregister_group_dev);
 
 /**
@@ -2447,6 +2558,7 @@ static int vfio_init_device_class(void)
 
 	mutex_init(&vfio.device_lock);
 	idr_init(&vfio.device_idr);
+	INIT_LIST_HEAD(&vfio.device_list);
 
 	/* /dev/vfio/devices/$DEVICE */
 	vfio.device_class = class_create(THIS_MODULE, "vfio-device");
@@ -2542,6 +2654,7 @@ static int __init vfio_init(void)
 static void __exit vfio_cleanup(void)
 {
 	WARN_ON(!list_empty(&vfio.group_list));
+	WARN_ON(!list_empty(&vfio.device_list));
 
 #ifdef CONFIG_VFIO_NOIOMMU
 	vfio_unregister_iommu_driver(&vfio_noiommu_ops);
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 4a5f3f99eab2..9448b751b663 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -26,6 +26,7 @@ struct vfio_device {
 	struct list_head group_next;
 	int minor;
 	atomic_t opened;
+	struct list_head vfio_next;
 };
 
 /**
@@ -73,6 +74,14 @@ enum vfio_iommu_notify_type {
 	VFIO_IOMMU_CONTAINER_CLOSE = 0,
 };
 
+/* The device can be opened via VFIO_GROUP_GET_DEVICE_FD */
+#define VFIO_DEVNODE_GROUP	BIT(0)
+/* The device can be opened via /dev/sys/devices/${DEVICE} */
+#define VFIO_DEVNODE_NONGROUP	BIT(1)
+
+extern int vfio_register_device(struct vfio_device *device, u32 flags);
+extern void vfio_unregister_device(struct vfio_device *device);
+
 /**
  * struct vfio_iommu_driver_ops - VFIO IOMMU driver callbacks
  */
-- 
2.25.1


^ permalink raw reply	[flat|nested] 532+ messages in thread

* [RFC 03/20] vfio: Add vfio_[un]register_device()
@ 2021-09-19  6:38   ` Liu Yi L
  0 siblings, 0 replies; 532+ messages in thread
From: Liu Yi L @ 2021-09-19  6:38 UTC (permalink / raw)
  To: alex.williamson, jgg, hch, jasowang, joro
  Cc: kvm, kwankhede, jean-philippe, dave.jiang, ashok.raj, corbet,
	kevin.tian, parav, lkml, david, robin.murphy, jun.j.tian,
	linux-kernel, lushenming, iommu, pbonzini, dwmw2

With /dev/vfio/devices introduced, now a vfio device driver has three
options to expose its device to userspace:

a)  only legacy group interface, for devices which haven't been moved to
    iommufd (e.g. platform devices, sw mdev, etc.);

b)  both legacy group interface and new device-centric interface, for
    devices which supports iommufd but also wants to keep backward
    compatibility (e.g. pci devices in this RFC);

c)  only new device-centric interface, for new devices which don't carry
    backward compatibility burden (e.g. hw mdev/subdev with pasid);

This patch introduces vfio_[un]register_device() helpers for the device
drivers to specify the device exposure policy to vfio core. Hence the
existing vfio_[un]register_group_dev() become the wrapper of the new
helper functions. The new device-centric interface is described as
'nongroup' to differentiate from existing 'group' stuff.

TBD: this patch needs to rebase on top of below series from Christoph in
next version.

	"cleanup vfio iommu_group creation"

Legacy userspace continues to follow the legacy group interface.

Newer userspace can first try the new device-centric interface if the
device is present under /dev/vfio/devices. Otherwise fall back to the
group interface.

One open about how to organize the device nodes under /dev/vfio/devices/.
This RFC adopts a simple policy by keeping a flat layout with mixed devname
from all kinds of devices. The prerequisite of this model is that devnames
from different bus types are unique formats:

	/dev/vfio/devices/0000:00:14.2 (pci)
	/dev/vfio/devices/PNP0103:00 (platform)
	/dev/vfio/devices/83b8f4f2-509f-382f-3c1e-e6bfe0fa1001 (mdev)

One alternative option is to arrange device nodes in sub-directories based
on the device type. But doing so also adds one trouble to userspace. The
current vfio uAPI is designed to have the user query device type via
VFIO_DEVICE_GET_INFO after opening the device. With this option the user
instead needs to figure out the device type before opening the device, to
identify the sub-directory. Another tricky thing is that "pdev. vs. mdev"
and "pci vs. platform vs. ccw,..." are orthogonal categorizations. Need
more thoughts on whether both or just one category should be used to define
the sub-directories.

Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 drivers/vfio/vfio.c  | 137 +++++++++++++++++++++++++++++++++++++++----
 include/linux/vfio.h |   9 +++
 2 files changed, 134 insertions(+), 12 deletions(-)

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 84436d7abedd..1e87b25962f1 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -51,6 +51,7 @@ static struct vfio {
 	struct cdev			device_cdev;
 	dev_t				device_devt;
 	struct mutex			device_lock;
+	struct list_head		device_list;
 	struct idr			device_idr;
 } vfio;
 
@@ -757,7 +758,7 @@ void vfio_init_group_dev(struct vfio_device *device, struct device *dev,
 }
 EXPORT_SYMBOL_GPL(vfio_init_group_dev);
 
-int vfio_register_group_dev(struct vfio_device *device)
+static int __vfio_register_group_dev(struct vfio_device *device)
 {
 	struct vfio_device *existing_device;
 	struct iommu_group *iommu_group;
@@ -794,8 +795,13 @@ int vfio_register_group_dev(struct vfio_device *device)
 	/* Our reference on group is moved to the device */
 	device->group = group;
 
-	/* Refcounting can't start until the driver calls register */
-	refcount_set(&device->refcount, 1);
+	/*
+	 * Refcounting can't start until the driver call register. Don’t
+	 * start twice when the device is exposed in both group and nongroup
+	 * interfaces.
+	 */
+	if (!refcount_read(&device->refcount))
+		refcount_set(&device->refcount, 1);
 
 	mutex_lock(&group->device_lock);
 	list_add(&device->group_next, &group->device_list);
@@ -804,7 +810,78 @@ int vfio_register_group_dev(struct vfio_device *device)
 
 	return 0;
 }
-EXPORT_SYMBOL_GPL(vfio_register_group_dev);
+
+static int __vfio_register_nongroup_dev(struct vfio_device *device)
+{
+	struct vfio_device *existing_device;
+	struct device *dev;
+	int ret = 0, minor;
+
+	mutex_lock(&vfio.device_lock);
+	list_for_each_entry(existing_device, &vfio.device_list, vfio_next) {
+		if (existing_device == device) {
+			ret = -EBUSY;
+			goto out_unlock;
+		}
+	}
+
+	minor = idr_alloc(&vfio.device_idr, device, 0, MINORMASK + 1, GFP_KERNEL);
+	pr_debug("%s - mnior: %d\n", __func__, minor);
+	if (minor < 0) {
+		ret = minor;
+		goto out_unlock;
+	}
+
+	dev = device_create(vfio.device_class, NULL,
+			    MKDEV(MAJOR(vfio.device_devt), minor),
+			    device, "%s", dev_name(device->dev));
+	if (IS_ERR(dev)) {
+		idr_remove(&vfio.device_idr, minor);
+		ret = PTR_ERR(dev);
+		goto out_unlock;
+	}
+
+	/*
+	 * Refcounting can't start until the driver call register. Don’t
+	 * start twice when the device is exposed in both group and nongroup
+	 * interfaces.
+	 */
+	if (!refcount_read(&device->refcount))
+		refcount_set(&device->refcount, 1);
+
+	device->minor = minor;
+	list_add(&device->vfio_next, &vfio.device_list);
+	dev_info(device->dev, "Creates Device interface successfully!\n");
+out_unlock:
+	mutex_unlock(&vfio.device_lock);
+	return ret;
+}
+
+int vfio_register_device(struct vfio_device *device, u32 flags)
+{
+	int ret = -EINVAL;
+
+	device->minor = -1;
+	device->group = NULL;
+	atomic_set(&device->opened, 0);
+
+	if (flags & ~(VFIO_DEVNODE_GROUP | VFIO_DEVNODE_NONGROUP))
+		return ret;
+
+	if (flags & VFIO_DEVNODE_GROUP) {
+		ret = __vfio_register_group_dev(device);
+		if (ret)
+			return ret;
+	}
+
+	if (flags & VFIO_DEVNODE_NONGROUP) {
+		ret = __vfio_register_nongroup_dev(device);
+		if (ret && device->group)
+			vfio_unregister_device(device);
+	}
+	return ret;
+}
+EXPORT_SYMBOL_GPL(vfio_register_device);
 
 /**
  * Get a reference to the vfio_device for a device.  Even if the
@@ -861,13 +938,14 @@ static struct vfio_device *vfio_device_get_from_name(struct vfio_group *group,
 /*
  * Decrement the device reference count and wait for the device to be
  * removed.  Open file descriptors for the device... */
-void vfio_unregister_group_dev(struct vfio_device *device)
+void vfio_unregister_device(struct vfio_device *device)
 {
 	struct vfio_group *group = device->group;
 	struct vfio_unbound_dev *unbound;
 	unsigned int i = 0;
 	bool interrupted = false;
 	long rc;
+	int minor = device->minor;
 
 	/*
 	 * When the device is removed from the group, the group suddenly
@@ -878,14 +956,20 @@ void vfio_unregister_group_dev(struct vfio_device *device)
 	 * solve this, we track such devices on the unbound_list to bridge
 	 * the gap until they're fully unbound.
 	 */
-	unbound = kzalloc(sizeof(*unbound), GFP_KERNEL);
-	if (unbound) {
-		unbound->dev = device->dev;
-		mutex_lock(&group->unbound_lock);
-		list_add(&unbound->unbound_next, &group->unbound_list);
-		mutex_unlock(&group->unbound_lock);
+	if (group) {
+		/*
+		 * If caller hasn't called vfio_register_group_dev(), this
+		 * branch is not necessary.
+		 */
+		unbound = kzalloc(sizeof(*unbound), GFP_KERNEL);
+		if (unbound) {
+			unbound->dev = device->dev;
+			mutex_lock(&group->unbound_lock);
+			list_add(&unbound->unbound_next, &group->unbound_list);
+			mutex_unlock(&group->unbound_lock);
+		}
+		WARN_ON(!unbound);
 	}
-	WARN_ON(!unbound);
 
 	vfio_device_put(device);
 	rc = try_wait_for_completion(&device->comp);
@@ -910,6 +994,21 @@ void vfio_unregister_group_dev(struct vfio_device *device)
 		}
 	}
 
+	/* nongroup interface related cleanup */
+	if (minor >= 0) {
+		mutex_lock(&vfio.device_lock);
+		list_del(&device->vfio_next);
+		device->minor = -1;
+		device_destroy(vfio.device_class,
+			       MKDEV(MAJOR(vfio.device_devt), minor));
+		idr_remove(&vfio.device_idr, minor);
+		mutex_unlock(&vfio.device_lock);
+	}
+
+	/* No need go further if no group. */
+	if (!group)
+		return;
+
 	mutex_lock(&group->device_lock);
 	list_del(&device->group_next);
 	group->dev_counter--;
@@ -935,6 +1034,18 @@ void vfio_unregister_group_dev(struct vfio_device *device)
 	/* Matches the get in vfio_register_group_dev() */
 	vfio_group_put(group);
 }
+EXPORT_SYMBOL_GPL(vfio_unregister_device);
+
+int vfio_register_group_dev(struct vfio_device *device)
+{
+	return vfio_register_device(device, VFIO_DEVNODE_GROUP);
+}
+EXPORT_SYMBOL_GPL(vfio_register_group_dev);
+
+void vfio_unregister_group_dev(struct vfio_device *device)
+{
+	vfio_unregister_device(device);
+}
 EXPORT_SYMBOL_GPL(vfio_unregister_group_dev);
 
 /**
@@ -2447,6 +2558,7 @@ static int vfio_init_device_class(void)
 
 	mutex_init(&vfio.device_lock);
 	idr_init(&vfio.device_idr);
+	INIT_LIST_HEAD(&vfio.device_list);
 
 	/* /dev/vfio/devices/$DEVICE */
 	vfio.device_class = class_create(THIS_MODULE, "vfio-device");
@@ -2542,6 +2654,7 @@ static int __init vfio_init(void)
 static void __exit vfio_cleanup(void)
 {
 	WARN_ON(!list_empty(&vfio.group_list));
+	WARN_ON(!list_empty(&vfio.device_list));
 
 #ifdef CONFIG_VFIO_NOIOMMU
 	vfio_unregister_iommu_driver(&vfio_noiommu_ops);
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 4a5f3f99eab2..9448b751b663 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -26,6 +26,7 @@ struct vfio_device {
 	struct list_head group_next;
 	int minor;
 	atomic_t opened;
+	struct list_head vfio_next;
 };
 
 /**
@@ -73,6 +74,14 @@ enum vfio_iommu_notify_type {
 	VFIO_IOMMU_CONTAINER_CLOSE = 0,
 };
 
+/* The device can be opened via VFIO_GROUP_GET_DEVICE_FD */
+#define VFIO_DEVNODE_GROUP	BIT(0)
+/* The device can be opened via /dev/sys/devices/${DEVICE} */
+#define VFIO_DEVNODE_NONGROUP	BIT(1)
+
+extern int vfio_register_device(struct vfio_device *device, u32 flags);
+extern void vfio_unregister_device(struct vfio_device *device);
+
 /**
  * struct vfio_iommu_driver_ops - VFIO IOMMU driver callbacks
  */
-- 
2.25.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* [RFC 04/20] iommu: Add iommu_device_get_info interface
  2021-09-19  6:38 ` Liu Yi L
@ 2021-09-19  6:38   ` Liu Yi L
  -1 siblings, 0 replies; 532+ messages in thread
From: Liu Yi L @ 2021-09-19  6:38 UTC (permalink / raw)
  To: alex.williamson, jgg, hch, jasowang, joro
  Cc: jean-philippe, kevin.tian, parav, lkml, pbonzini, lushenming,
	eric.auger, corbet, ashok.raj, yi.l.liu, yi.l.liu, jun.j.tian,
	hao.wu, dave.jiang, jacob.jun.pan, kwankhede, robin.murphy, kvm,
	iommu, dwmw2, linux-kernel, baolu.lu, david, nicolinc

From: Lu Baolu <baolu.lu@linux.intel.com>

This provides an interface for upper layers to get the per-device iommu
attributes.

    int iommu_device_get_info(struct device *dev,
                              enum iommu_devattr attr, void *data);

The first attribute (IOMMU_DEV_INFO_FORCE_SNOOP) is added. It tells if
the iommu can force DMA to snoop cache. At this stage, only PCI devices
which have this attribute set could use the iommufd, this is due to
supporting no-snoop DMA requires additional refactoring work on the
current kvm-vfio contract. The following patch will have vfio check this
attribute to decide whether a pci device can be exposed through
/dev/vfio/devices.

Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
 drivers/iommu/iommu.c | 16 ++++++++++++++++
 include/linux/iommu.h | 19 +++++++++++++++++++
 2 files changed, 35 insertions(+)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 63f0af10c403..5ea3a007fd7c 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -3260,3 +3260,19 @@ static ssize_t iommu_group_store_type(struct iommu_group *group,
 
 	return ret;
 }
+
+/* Expose per-device iommu attributes. */
+int iommu_device_get_info(struct device *dev, enum iommu_devattr attr, void *data)
+{
+	const struct iommu_ops *ops;
+
+	if (!dev->bus || !dev->bus->iommu_ops)
+		return -EINVAL;
+
+	ops = dev->bus->iommu_ops;
+	if (unlikely(!ops->device_info))
+		return -ENODEV;
+
+	return ops->device_info(dev, attr, data);
+}
+EXPORT_SYMBOL_GPL(iommu_device_get_info);
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 32d448050bf7..52a6d33c82dc 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -150,6 +150,14 @@ enum iommu_dev_features {
 	IOMMU_DEV_FEAT_IOPF,
 };
 
+/**
+ * enum iommu_devattr - Per device IOMMU attributes
+ * @IOMMU_DEV_INFO_FORCE_SNOOP [bool]: IOMMU can force DMA to be snooped.
+ */
+enum iommu_devattr {
+	IOMMU_DEV_INFO_FORCE_SNOOP,
+};
+
 #define IOMMU_PASID_INVALID	(-1U)
 
 #ifdef CONFIG_IOMMU_API
@@ -215,6 +223,7 @@ struct iommu_iotlb_gather {
  *		- IOMMU_DOMAIN_IDENTITY: must use an identity domain
  *		- IOMMU_DOMAIN_DMA: must use a dma domain
  *		- 0: use the default setting
+ * @device_info: query per-device iommu attributes
  * @pgsize_bitmap: bitmap of all possible supported page sizes
  * @owner: Driver module providing these ops
  */
@@ -283,6 +292,8 @@ struct iommu_ops {
 
 	int (*def_domain_type)(struct device *dev);
 
+	int (*device_info)(struct device *dev, enum iommu_devattr attr, void *data);
+
 	unsigned long pgsize_bitmap;
 	struct module *owner;
 };
@@ -604,6 +615,8 @@ struct iommu_sva *iommu_sva_bind_device(struct device *dev,
 void iommu_sva_unbind_device(struct iommu_sva *handle);
 u32 iommu_sva_get_pasid(struct iommu_sva *handle);
 
+int iommu_device_get_info(struct device *dev, enum iommu_devattr attr, void *data);
+
 #else /* CONFIG_IOMMU_API */
 
 struct iommu_ops {};
@@ -999,6 +1012,12 @@ static inline struct iommu_fwspec *dev_iommu_fwspec_get(struct device *dev)
 {
 	return NULL;
 }
+
+static inline int iommu_device_get_info(struct device *dev,
+					enum iommu_devattr type, void *data)
+{
+	return -ENODEV;
+}
 #endif /* CONFIG_IOMMU_API */
 
 /**
-- 
2.25.1


^ permalink raw reply	[flat|nested] 532+ messages in thread

* [RFC 04/20] iommu: Add iommu_device_get_info interface
@ 2021-09-19  6:38   ` Liu Yi L
  0 siblings, 0 replies; 532+ messages in thread
From: Liu Yi L @ 2021-09-19  6:38 UTC (permalink / raw)
  To: alex.williamson, jgg, hch, jasowang, joro
  Cc: kvm, kwankhede, jean-philippe, dave.jiang, ashok.raj, corbet,
	kevin.tian, parav, lkml, david, robin.murphy, jun.j.tian,
	linux-kernel, lushenming, iommu, pbonzini, dwmw2

From: Lu Baolu <baolu.lu@linux.intel.com>

This provides an interface for upper layers to get the per-device iommu
attributes.

    int iommu_device_get_info(struct device *dev,
                              enum iommu_devattr attr, void *data);

The first attribute (IOMMU_DEV_INFO_FORCE_SNOOP) is added. It tells if
the iommu can force DMA to snoop cache. At this stage, only PCI devices
which have this attribute set could use the iommufd, this is due to
supporting no-snoop DMA requires additional refactoring work on the
current kvm-vfio contract. The following patch will have vfio check this
attribute to decide whether a pci device can be exposed through
/dev/vfio/devices.

Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
 drivers/iommu/iommu.c | 16 ++++++++++++++++
 include/linux/iommu.h | 19 +++++++++++++++++++
 2 files changed, 35 insertions(+)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 63f0af10c403..5ea3a007fd7c 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -3260,3 +3260,19 @@ static ssize_t iommu_group_store_type(struct iommu_group *group,
 
 	return ret;
 }
+
+/* Expose per-device iommu attributes. */
+int iommu_device_get_info(struct device *dev, enum iommu_devattr attr, void *data)
+{
+	const struct iommu_ops *ops;
+
+	if (!dev->bus || !dev->bus->iommu_ops)
+		return -EINVAL;
+
+	ops = dev->bus->iommu_ops;
+	if (unlikely(!ops->device_info))
+		return -ENODEV;
+
+	return ops->device_info(dev, attr, data);
+}
+EXPORT_SYMBOL_GPL(iommu_device_get_info);
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 32d448050bf7..52a6d33c82dc 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -150,6 +150,14 @@ enum iommu_dev_features {
 	IOMMU_DEV_FEAT_IOPF,
 };
 
+/**
+ * enum iommu_devattr - Per device IOMMU attributes
+ * @IOMMU_DEV_INFO_FORCE_SNOOP [bool]: IOMMU can force DMA to be snooped.
+ */
+enum iommu_devattr {
+	IOMMU_DEV_INFO_FORCE_SNOOP,
+};
+
 #define IOMMU_PASID_INVALID	(-1U)
 
 #ifdef CONFIG_IOMMU_API
@@ -215,6 +223,7 @@ struct iommu_iotlb_gather {
  *		- IOMMU_DOMAIN_IDENTITY: must use an identity domain
  *		- IOMMU_DOMAIN_DMA: must use a dma domain
  *		- 0: use the default setting
+ * @device_info: query per-device iommu attributes
  * @pgsize_bitmap: bitmap of all possible supported page sizes
  * @owner: Driver module providing these ops
  */
@@ -283,6 +292,8 @@ struct iommu_ops {
 
 	int (*def_domain_type)(struct device *dev);
 
+	int (*device_info)(struct device *dev, enum iommu_devattr attr, void *data);
+
 	unsigned long pgsize_bitmap;
 	struct module *owner;
 };
@@ -604,6 +615,8 @@ struct iommu_sva *iommu_sva_bind_device(struct device *dev,
 void iommu_sva_unbind_device(struct iommu_sva *handle);
 u32 iommu_sva_get_pasid(struct iommu_sva *handle);
 
+int iommu_device_get_info(struct device *dev, enum iommu_devattr attr, void *data);
+
 #else /* CONFIG_IOMMU_API */
 
 struct iommu_ops {};
@@ -999,6 +1012,12 @@ static inline struct iommu_fwspec *dev_iommu_fwspec_get(struct device *dev)
 {
 	return NULL;
 }
+
+static inline int iommu_device_get_info(struct device *dev,
+					enum iommu_devattr type, void *data)
+{
+	return -ENODEV;
+}
 #endif /* CONFIG_IOMMU_API */
 
 /**
-- 
2.25.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* [RFC 05/20] vfio/pci: Register device to /dev/vfio/devices
  2021-09-19  6:38 ` Liu Yi L
@ 2021-09-19  6:38   ` Liu Yi L
  -1 siblings, 0 replies; 532+ messages in thread
From: Liu Yi L @ 2021-09-19  6:38 UTC (permalink / raw)
  To: alex.williamson, jgg, hch, jasowang, joro
  Cc: jean-philippe, kevin.tian, parav, lkml, pbonzini, lushenming,
	eric.auger, corbet, ashok.raj, yi.l.liu, yi.l.liu, jun.j.tian,
	hao.wu, dave.jiang, jacob.jun.pan, kwankhede, robin.murphy, kvm,
	iommu, dwmw2, linux-kernel, baolu.lu, david, nicolinc

This patch exposes the device-centric interface for vfio-pci devices. To
be compatiable with existing users, vfio-pci exposes both legacy group
interface and device-centric interface.

As explained in last patch, this change doesn't apply to devices which
cannot be forced to snoop cache by their upstream iommu. Such devices
are still expected to be opened via the legacy group interface.

When the device is opened via /dev/vfio/devices, vfio-pci should prevent
the user from accessing the assigned device because the device is still
attached to the default domain which may allow user-initiated DMAs to
touch arbitrary place. The user access must be blocked until the device
is later bound to an iommufd (see patch 08). The binding acts as the
contract for putting the device in a security context which ensures user-
initiated DMAs via this device cannot harm the rest of the system.

This patch introduces a vdev->block_access flag for this purpose. It's set
when the device is opened via /dev/vfio/devices and cleared after binding
to iommufd succeeds. mmap and r/w handlers check this flag to decide whether
user access should be blocked or not.

An alternative option is to use a dummy fops when the device is opened and
then switch to the real fops (replace_fops()) after binding. Appreciate
inputs on which option is better.

The legacy group interface doesn't have this problem. Its uAPI requires the
user to first put the device into a security context via container/group
attaching process, before opening the device through the groupfd.

Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 drivers/vfio/pci/vfio_pci.c         | 25 +++++++++++++++++++++++--
 drivers/vfio/pci/vfio_pci_private.h |  1 +
 drivers/vfio/vfio.c                 |  3 ++-
 include/linux/vfio.h                |  1 +
 4 files changed, 27 insertions(+), 3 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 318864d52837..145addde983b 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -572,6 +572,10 @@ static int vfio_pci_open(struct vfio_device *core_vdev)
 
 		vfio_spapr_pci_eeh_open(vdev->pdev);
 		vfio_pci_vf_token_user_add(vdev, 1);
+		if (!vfio_device_in_container(core_vdev))
+			atomic_set(&vdev->block_access, 1);
+		else
+			atomic_set(&vdev->block_access, 0);
 	}
 	vdev->refcnt++;
 error:
@@ -1374,6 +1378,9 @@ static ssize_t vfio_pci_rw(struct vfio_pci_device *vdev, char __user *buf,
 {
 	unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
 
+	if (atomic_read(&vdev->block_access))
+		return -ENODEV;
+
 	if (index >= VFIO_PCI_NUM_REGIONS + vdev->num_regions)
 		return -EINVAL;
 
@@ -1640,6 +1647,9 @@ static int vfio_pci_mmap(struct vfio_device *core_vdev, struct vm_area_struct *v
 	u64 phys_len, req_len, pgoff, req_start;
 	int ret;
 
+	if (atomic_read(&vdev->block_access))
+		return -ENODEV;
+
 	index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);
 
 	if (index >= VFIO_PCI_NUM_REGIONS + vdev->num_regions)
@@ -1978,6 +1988,8 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	struct vfio_pci_device *vdev;
 	struct iommu_group *group;
 	int ret;
+	u32 flags;
+	bool snoop = false;
 
 	if (vfio_pci_is_denylisted(pdev))
 		return -EINVAL;
@@ -2046,9 +2058,18 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 		vfio_pci_set_power_state(vdev, PCI_D3hot);
 	}
 
-	ret = vfio_register_group_dev(&vdev->vdev);
-	if (ret)
+	flags = VFIO_DEVNODE_GROUP;
+	ret = iommu_device_get_info(&pdev->dev,
+				    IOMMU_DEV_INFO_FORCE_SNOOP, &snoop);
+	if (!ret && snoop)
+		flags |= VFIO_DEVNODE_NONGROUP;
+
+	ret = vfio_register_device(&vdev->vdev, flags);
+	if (ret) {
+		pr_debug("Failed to register device interface\n");
 		goto out_power;
+	}
+
 	dev_set_drvdata(&pdev->dev, vdev);
 	return 0;
 
diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
index 5a36272cecbf..f12012e30b53 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -143,6 +143,7 @@ struct vfio_pci_device {
 	struct mutex		vma_lock;
 	struct list_head	vma_list;
 	struct rw_semaphore	memory_lock;
+	atomic_t		block_access;
 };
 
 #define is_intx(vdev) (vdev->irq_type == VFIO_PCI_INTX_IRQ_INDEX)
diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 1e87b25962f1..22851747e92c 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -1789,10 +1789,11 @@ static int vfio_device_fops_open(struct inode *inode, struct file *filep)
 	return ret;
 }
 
-static bool vfio_device_in_container(struct vfio_device *device)
+bool vfio_device_in_container(struct vfio_device *device)
 {
 	return !!(device->group && device->group->container);
 }
+EXPORT_SYMBOL_GPL(vfio_device_in_container);
 
 static int vfio_device_fops_release(struct inode *inode, struct file *filep)
 {
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 9448b751b663..fd0629acb948 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -81,6 +81,7 @@ enum vfio_iommu_notify_type {
 
 extern int vfio_register_device(struct vfio_device *device, u32 flags);
 extern void vfio_unregister_device(struct vfio_device *device);
+extern bool vfio_device_in_container(struct vfio_device *device);
 
 /**
  * struct vfio_iommu_driver_ops - VFIO IOMMU driver callbacks
-- 
2.25.1


^ permalink raw reply	[flat|nested] 532+ messages in thread

* [RFC 05/20] vfio/pci: Register device to /dev/vfio/devices
@ 2021-09-19  6:38   ` Liu Yi L
  0 siblings, 0 replies; 532+ messages in thread
From: Liu Yi L @ 2021-09-19  6:38 UTC (permalink / raw)
  To: alex.williamson, jgg, hch, jasowang, joro
  Cc: kvm, kwankhede, jean-philippe, dave.jiang, ashok.raj, corbet,
	kevin.tian, parav, lkml, david, robin.murphy, jun.j.tian,
	linux-kernel, lushenming, iommu, pbonzini, dwmw2

This patch exposes the device-centric interface for vfio-pci devices. To
be compatiable with existing users, vfio-pci exposes both legacy group
interface and device-centric interface.

As explained in last patch, this change doesn't apply to devices which
cannot be forced to snoop cache by their upstream iommu. Such devices
are still expected to be opened via the legacy group interface.

When the device is opened via /dev/vfio/devices, vfio-pci should prevent
the user from accessing the assigned device because the device is still
attached to the default domain which may allow user-initiated DMAs to
touch arbitrary place. The user access must be blocked until the device
is later bound to an iommufd (see patch 08). The binding acts as the
contract for putting the device in a security context which ensures user-
initiated DMAs via this device cannot harm the rest of the system.

This patch introduces a vdev->block_access flag for this purpose. It's set
when the device is opened via /dev/vfio/devices and cleared after binding
to iommufd succeeds. mmap and r/w handlers check this flag to decide whether
user access should be blocked or not.

An alternative option is to use a dummy fops when the device is opened and
then switch to the real fops (replace_fops()) after binding. Appreciate
inputs on which option is better.

The legacy group interface doesn't have this problem. Its uAPI requires the
user to first put the device into a security context via container/group
attaching process, before opening the device through the groupfd.

Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 drivers/vfio/pci/vfio_pci.c         | 25 +++++++++++++++++++++++--
 drivers/vfio/pci/vfio_pci_private.h |  1 +
 drivers/vfio/vfio.c                 |  3 ++-
 include/linux/vfio.h                |  1 +
 4 files changed, 27 insertions(+), 3 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 318864d52837..145addde983b 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -572,6 +572,10 @@ static int vfio_pci_open(struct vfio_device *core_vdev)
 
 		vfio_spapr_pci_eeh_open(vdev->pdev);
 		vfio_pci_vf_token_user_add(vdev, 1);
+		if (!vfio_device_in_container(core_vdev))
+			atomic_set(&vdev->block_access, 1);
+		else
+			atomic_set(&vdev->block_access, 0);
 	}
 	vdev->refcnt++;
 error:
@@ -1374,6 +1378,9 @@ static ssize_t vfio_pci_rw(struct vfio_pci_device *vdev, char __user *buf,
 {
 	unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
 
+	if (atomic_read(&vdev->block_access))
+		return -ENODEV;
+
 	if (index >= VFIO_PCI_NUM_REGIONS + vdev->num_regions)
 		return -EINVAL;
 
@@ -1640,6 +1647,9 @@ static int vfio_pci_mmap(struct vfio_device *core_vdev, struct vm_area_struct *v
 	u64 phys_len, req_len, pgoff, req_start;
 	int ret;
 
+	if (atomic_read(&vdev->block_access))
+		return -ENODEV;
+
 	index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);
 
 	if (index >= VFIO_PCI_NUM_REGIONS + vdev->num_regions)
@@ -1978,6 +1988,8 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	struct vfio_pci_device *vdev;
 	struct iommu_group *group;
 	int ret;
+	u32 flags;
+	bool snoop = false;
 
 	if (vfio_pci_is_denylisted(pdev))
 		return -EINVAL;
@@ -2046,9 +2058,18 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 		vfio_pci_set_power_state(vdev, PCI_D3hot);
 	}
 
-	ret = vfio_register_group_dev(&vdev->vdev);
-	if (ret)
+	flags = VFIO_DEVNODE_GROUP;
+	ret = iommu_device_get_info(&pdev->dev,
+				    IOMMU_DEV_INFO_FORCE_SNOOP, &snoop);
+	if (!ret && snoop)
+		flags |= VFIO_DEVNODE_NONGROUP;
+
+	ret = vfio_register_device(&vdev->vdev, flags);
+	if (ret) {
+		pr_debug("Failed to register device interface\n");
 		goto out_power;
+	}
+
 	dev_set_drvdata(&pdev->dev, vdev);
 	return 0;
 
diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
index 5a36272cecbf..f12012e30b53 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -143,6 +143,7 @@ struct vfio_pci_device {
 	struct mutex		vma_lock;
 	struct list_head	vma_list;
 	struct rw_semaphore	memory_lock;
+	atomic_t		block_access;
 };
 
 #define is_intx(vdev) (vdev->irq_type == VFIO_PCI_INTX_IRQ_INDEX)
diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 1e87b25962f1..22851747e92c 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -1789,10 +1789,11 @@ static int vfio_device_fops_open(struct inode *inode, struct file *filep)
 	return ret;
 }
 
-static bool vfio_device_in_container(struct vfio_device *device)
+bool vfio_device_in_container(struct vfio_device *device)
 {
 	return !!(device->group && device->group->container);
 }
+EXPORT_SYMBOL_GPL(vfio_device_in_container);
 
 static int vfio_device_fops_release(struct inode *inode, struct file *filep)
 {
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 9448b751b663..fd0629acb948 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -81,6 +81,7 @@ enum vfio_iommu_notify_type {
 
 extern int vfio_register_device(struct vfio_device *device, u32 flags);
 extern void vfio_unregister_device(struct vfio_device *device);
+extern bool vfio_device_in_container(struct vfio_device *device);
 
 /**
  * struct vfio_iommu_driver_ops - VFIO IOMMU driver callbacks
-- 
2.25.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces
  2021-09-19  6:38 ` Liu Yi L
@ 2021-09-19  6:38   ` Liu Yi L
  -1 siblings, 0 replies; 532+ messages in thread
From: Liu Yi L @ 2021-09-19  6:38 UTC (permalink / raw)
  To: alex.williamson, jgg, hch, jasowang, joro
  Cc: jean-philippe, kevin.tian, parav, lkml, pbonzini, lushenming,
	eric.auger, corbet, ashok.raj, yi.l.liu, yi.l.liu, jun.j.tian,
	hao.wu, dave.jiang, jacob.jun.pan, kwankhede, robin.murphy, kvm,
	iommu, dwmw2, linux-kernel, baolu.lu, david, nicolinc

From: Lu Baolu <baolu.lu@linux.intel.com>

This extends iommu core to manage security context for passthrough
devices. Please bear a long explanation for how we reach this design
instead of managing it solely in iommufd like what vfio does today.

Devices which cannot be isolated from each other are organized into an
iommu group. When a device is assigned to the user space, the entire
group must be put in a security context so that user-initiated DMAs via
the assigned device cannot harm the rest of the system. No user access
should be granted on a device before the security context is established
for the group which the device belongs to.

Managing the security context must meet below criteria:

1)  The group is viable for user-initiated DMAs. This implies that the
    devices in the group must be either bound to a device-passthrough
    framework, or driver-less, or bound to a driver which is known safe
    (not do DMA).

2)  The security context should only allow DMA to the user's memory and
    devices in this group;

3)  After the security context is established for the group, the group
    viability must be continuously monitored before the user relinquishes
    all devices belonging to the group. The viability might be broken e.g.
    when a driver-less device is later bound to a driver which does DMA.

4)  The security context should not be destroyed before user access
    permission is withdrawn.

Existing vfio introduces explicit container/group semantics in its uAPI
to meet above requirements. A single security context (iommu domain)
is created per container. Attaching group to container moves the entire
group into the associated security context, and vice versa. The user can
open the device only after group attach. A group can be detached only
after all devices in the group are closed. Group viability is monitored
by listening to iommu group events.

Unlike vfio, iommufd adopts a device-centric design with all group
logistics hidden behind the fd. Binding a device to iommufd serves
as the contract to get security context established (and vice versa
for unbinding). One additional requirement in iommufd is to manage the
switch between multiple security contexts due to decoupled bind/attach:

1)  Open a device in "/dev/vfio/devices" with user access blocked;

2)  Bind the device to an iommufd with an initial security context
    (an empty iommu domain which blocks dma) established for its
    group, with user access unblocked;

3)  Attach the device to a user-specified ioasid (shared by all devices
    attached to this ioasid). Before attaching, the device should be first
    detached from the initial context;

4)  Detach the device from the ioasid and switch it back to the initial
    security context;

5)  Unbind the device from the iommufd, back to access blocked state and
    move its group out of the initial security context if it's the last
    unbound device in the group;

(multiple attach/detach could happen between 2 and 5).

However existing iommu core has problem with above transition. Detach
in step 3/4 makes the device/group re-attached to the default domain
automatically, which opens the door for user-initiated DMAs to attack
the rest of the system. The existing vfio doesn't have this problem as
it combines 2/3 in one step (so does 4/5).

Fixing this problem requires the iommu core to also participate in the
security context management. Following this direction we also move group
viability check into the iommu core, which allows iommufd to stay fully
device-centric w/o keeping any group knowledge (combining with the
extension to iommu_at[de]tach_device() in a latter patch).

Basically two new interfaces are provided:

        int iommu_device_init_user_dma(struct device *dev,
                        unsigned long owner);
        void iommu_device_exit_user_dma(struct device *dev);

iommufd calls them respectively when handling device binding/unbinding
requests.

The init_user_dma() for the 1st device in a group marks the entire group
for user-dma and establishes the initial security context (dma blocked)
according to aforementioned criteria. As long as the group is marked for
user-dma, auto-reattaching to default domain is disabled. Instead, upon
detaching the group is moved back to the initial security context.

The caller also provides an owner id to mark the ownership so inadvertent
attempt from another caller on the same device can be captured. In this
RFC iommufd will use the fd context pointer as the owner id.

The exit_user_dma() for the last device in the group clears the user-dma
mark and moves the group back to the default domain.

Signed-off-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
 drivers/iommu/iommu.c | 145 +++++++++++++++++++++++++++++++++++++++++-
 include/linux/iommu.h |  12 ++++
 2 files changed, 154 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 5ea3a007fd7c..bffd84e978fb 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -45,6 +45,8 @@ struct iommu_group {
 	struct iommu_domain *default_domain;
 	struct iommu_domain *domain;
 	struct list_head entry;
+	unsigned long user_dma_owner_id;
+	refcount_t owner_cnt;
 };
 
 struct group_device {
@@ -86,6 +88,7 @@ static int iommu_create_device_direct_mappings(struct iommu_group *group,
 static struct iommu_group *iommu_group_get_for_dev(struct device *dev);
 static ssize_t iommu_group_store_type(struct iommu_group *group,
 				      const char *buf, size_t count);
+static bool iommu_group_user_dma_viable(struct iommu_group *group);
 
 #define IOMMU_GROUP_ATTR(_name, _mode, _show, _store)		\
 struct iommu_group_attribute iommu_group_attr_##_name =		\
@@ -275,7 +278,11 @@ int iommu_probe_device(struct device *dev)
 	 */
 	iommu_alloc_default_domain(group, dev);
 
-	if (group->default_domain) {
+	/*
+	 * If any device in the group has been initialized for user dma,
+	 * avoid attaching the default domain.
+	 */
+	if (group->default_domain && !group->user_dma_owner_id) {
 		ret = __iommu_attach_device(group->default_domain, dev);
 		if (ret) {
 			iommu_group_put(group);
@@ -1664,6 +1671,17 @@ static int iommu_bus_notifier(struct notifier_block *nb,
 		group_action = IOMMU_GROUP_NOTIFY_BIND_DRIVER;
 		break;
 	case BUS_NOTIFY_BOUND_DRIVER:
+		/*
+		 * FIXME: Alternatively the attached drivers could generically
+		 * indicate to the iommu layer that they are safe for keeping
+		 * the iommu group user viable by calling some function around
+		 * probe(). We could eliminate this gross BUG_ON() by denying
+		 * probe to non-iommu-safe driver.
+		 */
+		mutex_lock(&group->mutex);
+		if (group->user_dma_owner_id)
+			BUG_ON(!iommu_group_user_dma_viable(group));
+		mutex_unlock(&group->mutex);
 		group_action = IOMMU_GROUP_NOTIFY_BOUND_DRIVER;
 		break;
 	case BUS_NOTIFY_UNBIND_DRIVER:
@@ -2304,7 +2322,11 @@ static int __iommu_attach_group(struct iommu_domain *domain,
 {
 	int ret;
 
-	if (group->default_domain && group->domain != group->default_domain)
+	/*
+	 * group->domain could be NULL when a domain is detached from the
+	 * group but the default_domain is not re-attached.
+	 */
+	if (group->domain && group->domain != group->default_domain)
 		return -EBUSY;
 
 	ret = __iommu_group_for_each_dev(group, domain,
@@ -2341,7 +2363,11 @@ static void __iommu_detach_group(struct iommu_domain *domain,
 {
 	int ret;
 
-	if (!group->default_domain) {
+	/*
+	 * If any device in the group has been initialized for user dma,
+	 * avoid re-attaching the default domain.
+	 */
+	if (!group->default_domain || group->user_dma_owner_id) {
 		__iommu_group_for_each_dev(group, domain,
 					   iommu_group_do_detach_device);
 		group->domain = NULL;
@@ -3276,3 +3302,116 @@ int iommu_device_get_info(struct device *dev, enum iommu_devattr attr, void *dat
 	return ops->device_info(dev, attr, data);
 }
 EXPORT_SYMBOL_GPL(iommu_device_get_info);
+
+/*
+ * IOMMU core interfaces for iommufd.
+ */
+
+/*
+ * FIXME: We currently simply follow vifo policy to mantain the group's
+ * viability to user. Eventually, we should avoid below hard-coded list
+ * by letting drivers indicate to the iommu layer that they are safe for
+ * keeping the iommu group's user aviability.
+ */
+static const char * const iommu_driver_allowed[] = {
+	"vfio-pci",
+	"pci-stub"
+};
+
+/*
+ * An iommu group is viable for use by userspace if all devices are in
+ * one of the following states:
+ *  - driver-less
+ *  - bound to an allowed driver
+ *  - a PCI interconnect device
+ */
+static int device_user_dma_viable(struct device *dev, void *data)
+{
+	struct device_driver *drv = READ_ONCE(dev->driver);
+
+	if (!drv)
+		return 0;
+
+	if (dev_is_pci(dev)) {
+		struct pci_dev *pdev = to_pci_dev(dev);
+
+		if (pdev->hdr_type != PCI_HEADER_TYPE_NORMAL)
+			return 0;
+	}
+
+	return match_string(iommu_driver_allowed,
+			    ARRAY_SIZE(iommu_driver_allowed),
+			    drv->name) < 0;
+}
+
+static bool iommu_group_user_dma_viable(struct iommu_group *group)
+{
+	return !__iommu_group_for_each_dev(group, NULL, device_user_dma_viable);
+}
+
+static int iommu_group_init_user_dma(struct iommu_group *group,
+				     unsigned long owner)
+{
+	if (group->user_dma_owner_id) {
+		if (group->user_dma_owner_id != owner)
+			return -EBUSY;
+
+		refcount_inc(&group->owner_cnt);
+		return 0;
+	}
+
+	if (group->domain && group->domain != group->default_domain)
+		return -EBUSY;
+
+	if (!iommu_group_user_dma_viable(group))
+		return -EINVAL;
+
+	group->user_dma_owner_id = owner;
+	refcount_set(&group->owner_cnt, 1);
+
+	/* default domain is unsafe for user-initiated dma */
+	if (group->domain == group->default_domain)
+		__iommu_detach_group(group->default_domain, group);
+
+	return 0;
+}
+
+int iommu_device_init_user_dma(struct device *dev, unsigned long owner)
+{
+	struct iommu_group *group = iommu_group_get(dev);
+	int ret;
+
+	if (!group || !owner)
+		return -ENODEV;
+
+	mutex_lock(&group->mutex);
+	ret = iommu_group_init_user_dma(group, owner);
+	mutex_unlock(&group->mutex);
+	iommu_group_put(group);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_device_init_user_dma);
+
+static void iommu_group_exit_user_dma(struct iommu_group *group)
+{
+	if (refcount_dec_and_test(&group->owner_cnt)) {
+		group->user_dma_owner_id = 0;
+		if (group->default_domain)
+			__iommu_attach_group(group->default_domain, group);
+	}
+}
+
+void iommu_device_exit_user_dma(struct device *dev)
+{
+	struct iommu_group *group = iommu_group_get(dev);
+
+	if (WARN_ON(!group || !group->user_dma_owner_id))
+		return;
+
+	mutex_lock(&group->mutex);
+	iommu_group_exit_user_dma(group);
+	mutex_unlock(&group->mutex);
+	iommu_group_put(group);
+}
+EXPORT_SYMBOL_GPL(iommu_device_exit_user_dma);
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 52a6d33c82dc..943de6897f56 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -617,6 +617,9 @@ u32 iommu_sva_get_pasid(struct iommu_sva *handle);
 
 int iommu_device_get_info(struct device *dev, enum iommu_devattr attr, void *data);
 
+int iommu_device_init_user_dma(struct device *dev, unsigned long owner);
+void iommu_device_exit_user_dma(struct device *dev);
+
 #else /* CONFIG_IOMMU_API */
 
 struct iommu_ops {};
@@ -1018,6 +1021,15 @@ static inline int iommu_device_get_info(struct device *dev,
 {
 	return -ENODEV;
 }
+
+static inline int iommu_device_init_user_dma(struct device *dev, unsigned long owner)
+{
+	return -ENODEV;
+}
+
+static inline void iommu_device_exit_user_dma(struct device *dev)
+{
+}
 #endif /* CONFIG_IOMMU_API */
 
 /**
-- 
2.25.1


^ permalink raw reply	[flat|nested] 532+ messages in thread

* [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces
@ 2021-09-19  6:38   ` Liu Yi L
  0 siblings, 0 replies; 532+ messages in thread
From: Liu Yi L @ 2021-09-19  6:38 UTC (permalink / raw)
  To: alex.williamson, jgg, hch, jasowang, joro
  Cc: kvm, kwankhede, jean-philippe, dave.jiang, ashok.raj, corbet,
	kevin.tian, parav, lkml, david, robin.murphy, jun.j.tian,
	linux-kernel, lushenming, iommu, pbonzini, dwmw2

From: Lu Baolu <baolu.lu@linux.intel.com>

This extends iommu core to manage security context for passthrough
devices. Please bear a long explanation for how we reach this design
instead of managing it solely in iommufd like what vfio does today.

Devices which cannot be isolated from each other are organized into an
iommu group. When a device is assigned to the user space, the entire
group must be put in a security context so that user-initiated DMAs via
the assigned device cannot harm the rest of the system. No user access
should be granted on a device before the security context is established
for the group which the device belongs to.

Managing the security context must meet below criteria:

1)  The group is viable for user-initiated DMAs. This implies that the
    devices in the group must be either bound to a device-passthrough
    framework, or driver-less, or bound to a driver which is known safe
    (not do DMA).

2)  The security context should only allow DMA to the user's memory and
    devices in this group;

3)  After the security context is established for the group, the group
    viability must be continuously monitored before the user relinquishes
    all devices belonging to the group. The viability might be broken e.g.
    when a driver-less device is later bound to a driver which does DMA.

4)  The security context should not be destroyed before user access
    permission is withdrawn.

Existing vfio introduces explicit container/group semantics in its uAPI
to meet above requirements. A single security context (iommu domain)
is created per container. Attaching group to container moves the entire
group into the associated security context, and vice versa. The user can
open the device only after group attach. A group can be detached only
after all devices in the group are closed. Group viability is monitored
by listening to iommu group events.

Unlike vfio, iommufd adopts a device-centric design with all group
logistics hidden behind the fd. Binding a device to iommufd serves
as the contract to get security context established (and vice versa
for unbinding). One additional requirement in iommufd is to manage the
switch between multiple security contexts due to decoupled bind/attach:

1)  Open a device in "/dev/vfio/devices" with user access blocked;

2)  Bind the device to an iommufd with an initial security context
    (an empty iommu domain which blocks dma) established for its
    group, with user access unblocked;

3)  Attach the device to a user-specified ioasid (shared by all devices
    attached to this ioasid). Before attaching, the device should be first
    detached from the initial context;

4)  Detach the device from the ioasid and switch it back to the initial
    security context;

5)  Unbind the device from the iommufd, back to access blocked state and
    move its group out of the initial security context if it's the last
    unbound device in the group;

(multiple attach/detach could happen between 2 and 5).

However existing iommu core has problem with above transition. Detach
in step 3/4 makes the device/group re-attached to the default domain
automatically, which opens the door for user-initiated DMAs to attack
the rest of the system. The existing vfio doesn't have this problem as
it combines 2/3 in one step (so does 4/5).

Fixing this problem requires the iommu core to also participate in the
security context management. Following this direction we also move group
viability check into the iommu core, which allows iommufd to stay fully
device-centric w/o keeping any group knowledge (combining with the
extension to iommu_at[de]tach_device() in a latter patch).

Basically two new interfaces are provided:

        int iommu_device_init_user_dma(struct device *dev,
                        unsigned long owner);
        void iommu_device_exit_user_dma(struct device *dev);

iommufd calls them respectively when handling device binding/unbinding
requests.

The init_user_dma() for the 1st device in a group marks the entire group
for user-dma and establishes the initial security context (dma blocked)
according to aforementioned criteria. As long as the group is marked for
user-dma, auto-reattaching to default domain is disabled. Instead, upon
detaching the group is moved back to the initial security context.

The caller also provides an owner id to mark the ownership so inadvertent
attempt from another caller on the same device can be captured. In this
RFC iommufd will use the fd context pointer as the owner id.

The exit_user_dma() for the last device in the group clears the user-dma
mark and moves the group back to the default domain.

Signed-off-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
 drivers/iommu/iommu.c | 145 +++++++++++++++++++++++++++++++++++++++++-
 include/linux/iommu.h |  12 ++++
 2 files changed, 154 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 5ea3a007fd7c..bffd84e978fb 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -45,6 +45,8 @@ struct iommu_group {
 	struct iommu_domain *default_domain;
 	struct iommu_domain *domain;
 	struct list_head entry;
+	unsigned long user_dma_owner_id;
+	refcount_t owner_cnt;
 };
 
 struct group_device {
@@ -86,6 +88,7 @@ static int iommu_create_device_direct_mappings(struct iommu_group *group,
 static struct iommu_group *iommu_group_get_for_dev(struct device *dev);
 static ssize_t iommu_group_store_type(struct iommu_group *group,
 				      const char *buf, size_t count);
+static bool iommu_group_user_dma_viable(struct iommu_group *group);
 
 #define IOMMU_GROUP_ATTR(_name, _mode, _show, _store)		\
 struct iommu_group_attribute iommu_group_attr_##_name =		\
@@ -275,7 +278,11 @@ int iommu_probe_device(struct device *dev)
 	 */
 	iommu_alloc_default_domain(group, dev);
 
-	if (group->default_domain) {
+	/*
+	 * If any device in the group has been initialized for user dma,
+	 * avoid attaching the default domain.
+	 */
+	if (group->default_domain && !group->user_dma_owner_id) {
 		ret = __iommu_attach_device(group->default_domain, dev);
 		if (ret) {
 			iommu_group_put(group);
@@ -1664,6 +1671,17 @@ static int iommu_bus_notifier(struct notifier_block *nb,
 		group_action = IOMMU_GROUP_NOTIFY_BIND_DRIVER;
 		break;
 	case BUS_NOTIFY_BOUND_DRIVER:
+		/*
+		 * FIXME: Alternatively the attached drivers could generically
+		 * indicate to the iommu layer that they are safe for keeping
+		 * the iommu group user viable by calling some function around
+		 * probe(). We could eliminate this gross BUG_ON() by denying
+		 * probe to non-iommu-safe driver.
+		 */
+		mutex_lock(&group->mutex);
+		if (group->user_dma_owner_id)
+			BUG_ON(!iommu_group_user_dma_viable(group));
+		mutex_unlock(&group->mutex);
 		group_action = IOMMU_GROUP_NOTIFY_BOUND_DRIVER;
 		break;
 	case BUS_NOTIFY_UNBIND_DRIVER:
@@ -2304,7 +2322,11 @@ static int __iommu_attach_group(struct iommu_domain *domain,
 {
 	int ret;
 
-	if (group->default_domain && group->domain != group->default_domain)
+	/*
+	 * group->domain could be NULL when a domain is detached from the
+	 * group but the default_domain is not re-attached.
+	 */
+	if (group->domain && group->domain != group->default_domain)
 		return -EBUSY;
 
 	ret = __iommu_group_for_each_dev(group, domain,
@@ -2341,7 +2363,11 @@ static void __iommu_detach_group(struct iommu_domain *domain,
 {
 	int ret;
 
-	if (!group->default_domain) {
+	/*
+	 * If any device in the group has been initialized for user dma,
+	 * avoid re-attaching the default domain.
+	 */
+	if (!group->default_domain || group->user_dma_owner_id) {
 		__iommu_group_for_each_dev(group, domain,
 					   iommu_group_do_detach_device);
 		group->domain = NULL;
@@ -3276,3 +3302,116 @@ int iommu_device_get_info(struct device *dev, enum iommu_devattr attr, void *dat
 	return ops->device_info(dev, attr, data);
 }
 EXPORT_SYMBOL_GPL(iommu_device_get_info);
+
+/*
+ * IOMMU core interfaces for iommufd.
+ */
+
+/*
+ * FIXME: We currently simply follow vifo policy to mantain the group's
+ * viability to user. Eventually, we should avoid below hard-coded list
+ * by letting drivers indicate to the iommu layer that they are safe for
+ * keeping the iommu group's user aviability.
+ */
+static const char * const iommu_driver_allowed[] = {
+	"vfio-pci",
+	"pci-stub"
+};
+
+/*
+ * An iommu group is viable for use by userspace if all devices are in
+ * one of the following states:
+ *  - driver-less
+ *  - bound to an allowed driver
+ *  - a PCI interconnect device
+ */
+static int device_user_dma_viable(struct device *dev, void *data)
+{
+	struct device_driver *drv = READ_ONCE(dev->driver);
+
+	if (!drv)
+		return 0;
+
+	if (dev_is_pci(dev)) {
+		struct pci_dev *pdev = to_pci_dev(dev);
+
+		if (pdev->hdr_type != PCI_HEADER_TYPE_NORMAL)
+			return 0;
+	}
+
+	return match_string(iommu_driver_allowed,
+			    ARRAY_SIZE(iommu_driver_allowed),
+			    drv->name) < 0;
+}
+
+static bool iommu_group_user_dma_viable(struct iommu_group *group)
+{
+	return !__iommu_group_for_each_dev(group, NULL, device_user_dma_viable);
+}
+
+static int iommu_group_init_user_dma(struct iommu_group *group,
+				     unsigned long owner)
+{
+	if (group->user_dma_owner_id) {
+		if (group->user_dma_owner_id != owner)
+			return -EBUSY;
+
+		refcount_inc(&group->owner_cnt);
+		return 0;
+	}
+
+	if (group->domain && group->domain != group->default_domain)
+		return -EBUSY;
+
+	if (!iommu_group_user_dma_viable(group))
+		return -EINVAL;
+
+	group->user_dma_owner_id = owner;
+	refcount_set(&group->owner_cnt, 1);
+
+	/* default domain is unsafe for user-initiated dma */
+	if (group->domain == group->default_domain)
+		__iommu_detach_group(group->default_domain, group);
+
+	return 0;
+}
+
+int iommu_device_init_user_dma(struct device *dev, unsigned long owner)
+{
+	struct iommu_group *group = iommu_group_get(dev);
+	int ret;
+
+	if (!group || !owner)
+		return -ENODEV;
+
+	mutex_lock(&group->mutex);
+	ret = iommu_group_init_user_dma(group, owner);
+	mutex_unlock(&group->mutex);
+	iommu_group_put(group);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_device_init_user_dma);
+
+static void iommu_group_exit_user_dma(struct iommu_group *group)
+{
+	if (refcount_dec_and_test(&group->owner_cnt)) {
+		group->user_dma_owner_id = 0;
+		if (group->default_domain)
+			__iommu_attach_group(group->default_domain, group);
+	}
+}
+
+void iommu_device_exit_user_dma(struct device *dev)
+{
+	struct iommu_group *group = iommu_group_get(dev);
+
+	if (WARN_ON(!group || !group->user_dma_owner_id))
+		return;
+
+	mutex_lock(&group->mutex);
+	iommu_group_exit_user_dma(group);
+	mutex_unlock(&group->mutex);
+	iommu_group_put(group);
+}
+EXPORT_SYMBOL_GPL(iommu_device_exit_user_dma);
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 52a6d33c82dc..943de6897f56 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -617,6 +617,9 @@ u32 iommu_sva_get_pasid(struct iommu_sva *handle);
 
 int iommu_device_get_info(struct device *dev, enum iommu_devattr attr, void *data);
 
+int iommu_device_init_user_dma(struct device *dev, unsigned long owner);
+void iommu_device_exit_user_dma(struct device *dev);
+
 #else /* CONFIG_IOMMU_API */
 
 struct iommu_ops {};
@@ -1018,6 +1021,15 @@ static inline int iommu_device_get_info(struct device *dev,
 {
 	return -ENODEV;
 }
+
+static inline int iommu_device_init_user_dma(struct device *dev, unsigned long owner)
+{
+	return -ENODEV;
+}
+
+static inline void iommu_device_exit_user_dma(struct device *dev)
+{
+}
 #endif /* CONFIG_IOMMU_API */
 
 /**
-- 
2.25.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* [RFC 07/20] iommu/iommufd: Add iommufd_[un]bind_device()
  2021-09-19  6:38 ` Liu Yi L
@ 2021-09-19  6:38   ` Liu Yi L
  -1 siblings, 0 replies; 532+ messages in thread
From: Liu Yi L @ 2021-09-19  6:38 UTC (permalink / raw)
  To: alex.williamson, jgg, hch, jasowang, joro
  Cc: jean-philippe, kevin.tian, parav, lkml, pbonzini, lushenming,
	eric.auger, corbet, ashok.raj, yi.l.liu, yi.l.liu, jun.j.tian,
	hao.wu, dave.jiang, jacob.jun.pan, kwankhede, robin.murphy, kvm,
	iommu, dwmw2, linux-kernel, baolu.lu, david, nicolinc

Under the /dev/iommu model, iommufd provides the interface for I/O page
tables management such as dma map/unmap. However, it cannot work
independently since the device is still owned by the device-passthrough
frameworks (VFIO, vDPA, etc.) and vice versa. Device-passthrough frameworks
should build a connection between its device and the iommufd to delegate
the I/O page table management affairs to iommufd.

This patch introduces iommufd_[un]bind_device() helpers for the device-
passthrough framework to build such connection. The helper functions then
invoke iommu core (iommu_device_init/exit_user_dma()) to establish/exit
security context for the bound device. Each successfully bound device is
internally tracked by an iommufd_device object. This object is returned
to the caller for subsequent attaching operations on the device as well.

The caller should pass a user-provided cookie to mark the device in the
iommufd. Later this cookie will be used to represent the device in iommufd
uAPI, e.g. when querying device capabilities or handling per-device I/O
page faults. One alternative is to have iommufd allocate a device label
and return to the user. Either way works, but cookie is slightly preferred
per earlier discussion as it may allow the user to inject faults slightly
faster without ID->vRID lookup.

iommu_[un]bind_device() functions are only used for physical devices. Other
variants will be introduced in the future, e.g.:

-  iommu_[un]bind_device_pasid() for mdev/subdev which requires pasid granular
   DMA isolation;
-  iommu_[un]bind_sw_mdev() for sw mdev which relies on software measures
   instead of iommu to isolate DMA;

Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 drivers/iommu/iommufd/iommufd.c | 160 +++++++++++++++++++++++++++++++-
 include/linux/iommufd.h         |  38 ++++++++
 2 files changed, 196 insertions(+), 2 deletions(-)
 create mode 100644 include/linux/iommufd.h

diff --git a/drivers/iommu/iommufd/iommufd.c b/drivers/iommu/iommufd/iommufd.c
index 710b7e62988b..e16ca21e4534 100644
--- a/drivers/iommu/iommufd/iommufd.c
+++ b/drivers/iommu/iommufd/iommufd.c
@@ -16,10 +16,30 @@
 #include <linux/miscdevice.h>
 #include <linux/mutex.h>
 #include <linux/iommu.h>
+#include <linux/iommufd.h>
+#include <linux/xarray.h>
+#include <asm-generic/bug.h>
 
 /* Per iommufd */
 struct iommufd_ctx {
 	refcount_t refs;
+	struct mutex lock;
+	struct xarray device_xa; /* xarray of bound devices */
+};
+
+/*
+ * A iommufd_device object represents the binding relationship
+ * between iommufd and device. It is created per a successful
+ * binding request from device driver. The bound device must be
+ * a physical device so far. Subdevice will be supported later
+ * (with additional PASID information). An user-assigned cookie
+ * is also recorded to mark the device in the /dev/iommu uAPI.
+ */
+struct iommufd_device {
+	unsigned int id;
+	struct iommufd_ctx *ictx;
+	struct device *dev; /* always be the physical device */
+	u64 dev_cookie;
 };
 
 static int iommufd_fops_open(struct inode *inode, struct file *filep)
@@ -32,15 +52,58 @@ static int iommufd_fops_open(struct inode *inode, struct file *filep)
 		return -ENOMEM;
 
 	refcount_set(&ictx->refs, 1);
+	mutex_init(&ictx->lock);
+	xa_init_flags(&ictx->device_xa, XA_FLAGS_ALLOC);
 	filep->private_data = ictx;
 
 	return ret;
 }
 
+static void iommufd_ctx_get(struct iommufd_ctx *ictx)
+{
+	refcount_inc(&ictx->refs);
+}
+
+static const struct file_operations iommufd_fops;
+
+/**
+ * iommufd_ctx_fdget - Acquires a reference to the internal iommufd context.
+ * @fd: [in] iommufd file descriptor.
+ *
+ * Returns a pointer to the iommufd context, otherwise NULL;
+ *
+ */
+static struct iommufd_ctx *iommufd_ctx_fdget(int fd)
+{
+	struct fd f = fdget(fd);
+	struct file *file = f.file;
+	struct iommufd_ctx *ictx;
+
+	if (!file)
+		return NULL;
+
+	if (file->f_op != &iommufd_fops)
+		return NULL;
+
+	ictx = file->private_data;
+	if (ictx)
+		iommufd_ctx_get(ictx);
+	fdput(f);
+	return ictx;
+}
+
+/**
+ * iommufd_ctx_put - Releases a reference to the internal iommufd context.
+ * @ictx: [in] Pointer to iommufd context.
+ *
+ */
 static void iommufd_ctx_put(struct iommufd_ctx *ictx)
 {
-	if (refcount_dec_and_test(&ictx->refs))
-		kfree(ictx);
+	if (!refcount_dec_and_test(&ictx->refs))
+		return;
+
+	WARN_ON(!xa_empty(&ictx->device_xa));
+	kfree(ictx);
 }
 
 static int iommufd_fops_release(struct inode *inode, struct file *filep)
@@ -86,6 +149,99 @@ static struct miscdevice iommu_misc_dev = {
 	.mode = 0666,
 };
 
+/**
+ * iommufd_bind_device - Bind a physical device marked by a device
+ *			 cookie to an iommu fd.
+ * @fd:		[in] iommufd file descriptor.
+ * @dev:	[in] Pointer to a physical device struct.
+ * @dev_cookie:	[in] A cookie to mark the device in /dev/iommu uAPI.
+ *
+ * A successful bind establishes a security context for the device
+ * and returns struct iommufd_device pointer. Otherwise returns
+ * error pointer.
+ *
+ */
+struct iommufd_device *iommufd_bind_device(int fd, struct device *dev,
+					   u64 dev_cookie)
+{
+	struct iommufd_ctx *ictx;
+	struct iommufd_device *idev;
+	unsigned long index;
+	unsigned int id;
+	int ret;
+
+	ictx = iommufd_ctx_fdget(fd);
+	if (!ictx)
+		return ERR_PTR(-EINVAL);
+
+	mutex_lock(&ictx->lock);
+
+	/* check duplicate registration */
+	xa_for_each(&ictx->device_xa, index, idev) {
+		if (idev->dev == dev || idev->dev_cookie == dev_cookie) {
+			idev = ERR_PTR(-EBUSY);
+			goto out_unlock;
+		}
+	}
+
+	idev = kzalloc(sizeof(*idev), GFP_KERNEL);
+	if (!idev) {
+		ret = -ENOMEM;
+		goto out_unlock;
+	}
+
+	/* Establish the security context */
+	ret = iommu_device_init_user_dma(dev, (unsigned long)ictx);
+	if (ret)
+		goto out_free;
+
+	ret = xa_alloc(&ictx->device_xa, &id, idev,
+		       XA_LIMIT(IOMMUFD_DEVID_MIN, IOMMUFD_DEVID_MAX),
+		       GFP_KERNEL);
+	if (ret) {
+		idev = ERR_PTR(ret);
+		goto out_user_dma;
+	}
+
+	idev->ictx = ictx;
+	idev->dev = dev;
+	idev->dev_cookie = dev_cookie;
+	idev->id = id;
+	mutex_unlock(&ictx->lock);
+
+	return idev;
+out_user_dma:
+	iommu_device_exit_user_dma(idev->dev);
+out_free:
+	kfree(idev);
+out_unlock:
+	mutex_unlock(&ictx->lock);
+	iommufd_ctx_put(ictx);
+
+	return ERR_PTR(ret);
+}
+EXPORT_SYMBOL_GPL(iommufd_bind_device);
+
+/**
+ * iommufd_unbind_device - Unbind a physical device from iommufd
+ *
+ * @idev: [in] Pointer to the internal iommufd_device struct.
+ *
+ */
+void iommufd_unbind_device(struct iommufd_device *idev)
+{
+	struct iommufd_ctx *ictx = idev->ictx;
+
+	mutex_lock(&ictx->lock);
+	xa_erase(&ictx->device_xa, idev->id);
+	mutex_unlock(&ictx->lock);
+	/* Exit the security context */
+	iommu_device_exit_user_dma(idev->dev);
+	kfree(idev);
+	iommufd_ctx_put(ictx);
+}
+EXPORT_SYMBOL_GPL(iommufd_unbind_device);
+
 static int __init iommufd_init(void)
 {
 	int ret;
diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h
new file mode 100644
index 000000000000..1603a13937e9
--- /dev/null
+++ b/include/linux/iommufd.h
@@ -0,0 +1,38 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * IOMMUFD API definition
+ *
+ * Copyright (C) 2021 Intel Corporation
+ *
+ * Author: Liu Yi L <yi.l.liu@intel.com>
+ */
+#ifndef __LINUX_IOMMUFD_H
+#define __LINUX_IOMMUFD_H
+
+#include <linux/types.h>
+#include <linux/errno.h>
+#include <linux/err.h>
+#include <linux/device.h>
+
+#define IOMMUFD_DEVID_MAX	((unsigned int)(0x7FFFFFFF))
+#define IOMMUFD_DEVID_MIN	0
+
+struct iommufd_device;
+
+#if IS_ENABLED(CONFIG_IOMMUFD)
+struct iommufd_device *
+iommufd_bind_device(int fd, struct device *dev, u64 dev_cookie);
+void iommufd_unbind_device(struct iommufd_device *idev);
+
+#else /* !CONFIG_IOMMUFD */
+static inline struct iommufd_device *
+iommufd_bind_device(int fd, struct device *dev, u64 dev_cookie)
+{
+	return ERR_PTR(-ENODEV);
+}
+
+static inline void iommufd_unbind_device(struct iommufd_device *idev)
+{
+}
+#endif /* CONFIG_IOMMUFD */
+#endif /* __LINUX_IOMMUFD_H */
-- 
2.25.1


^ permalink raw reply	[flat|nested] 532+ messages in thread

* [RFC 07/20] iommu/iommufd: Add iommufd_[un]bind_device()
@ 2021-09-19  6:38   ` Liu Yi L
  0 siblings, 0 replies; 532+ messages in thread
From: Liu Yi L @ 2021-09-19  6:38 UTC (permalink / raw)
  To: alex.williamson, jgg, hch, jasowang, joro
  Cc: kvm, kwankhede, jean-philippe, dave.jiang, ashok.raj, corbet,
	kevin.tian, parav, lkml, david, robin.murphy, jun.j.tian,
	linux-kernel, lushenming, iommu, pbonzini, dwmw2

Under the /dev/iommu model, iommufd provides the interface for I/O page
tables management such as dma map/unmap. However, it cannot work
independently since the device is still owned by the device-passthrough
frameworks (VFIO, vDPA, etc.) and vice versa. Device-passthrough frameworks
should build a connection between its device and the iommufd to delegate
the I/O page table management affairs to iommufd.

This patch introduces iommufd_[un]bind_device() helpers for the device-
passthrough framework to build such connection. The helper functions then
invoke iommu core (iommu_device_init/exit_user_dma()) to establish/exit
security context for the bound device. Each successfully bound device is
internally tracked by an iommufd_device object. This object is returned
to the caller for subsequent attaching operations on the device as well.

The caller should pass a user-provided cookie to mark the device in the
iommufd. Later this cookie will be used to represent the device in iommufd
uAPI, e.g. when querying device capabilities or handling per-device I/O
page faults. One alternative is to have iommufd allocate a device label
and return to the user. Either way works, but cookie is slightly preferred
per earlier discussion as it may allow the user to inject faults slightly
faster without ID->vRID lookup.

iommu_[un]bind_device() functions are only used for physical devices. Other
variants will be introduced in the future, e.g.:

-  iommu_[un]bind_device_pasid() for mdev/subdev which requires pasid granular
   DMA isolation;
-  iommu_[un]bind_sw_mdev() for sw mdev which relies on software measures
   instead of iommu to isolate DMA;

Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 drivers/iommu/iommufd/iommufd.c | 160 +++++++++++++++++++++++++++++++-
 include/linux/iommufd.h         |  38 ++++++++
 2 files changed, 196 insertions(+), 2 deletions(-)
 create mode 100644 include/linux/iommufd.h

diff --git a/drivers/iommu/iommufd/iommufd.c b/drivers/iommu/iommufd/iommufd.c
index 710b7e62988b..e16ca21e4534 100644
--- a/drivers/iommu/iommufd/iommufd.c
+++ b/drivers/iommu/iommufd/iommufd.c
@@ -16,10 +16,30 @@
 #include <linux/miscdevice.h>
 #include <linux/mutex.h>
 #include <linux/iommu.h>
+#include <linux/iommufd.h>
+#include <linux/xarray.h>
+#include <asm-generic/bug.h>
 
 /* Per iommufd */
 struct iommufd_ctx {
 	refcount_t refs;
+	struct mutex lock;
+	struct xarray device_xa; /* xarray of bound devices */
+};
+
+/*
+ * A iommufd_device object represents the binding relationship
+ * between iommufd and device. It is created per a successful
+ * binding request from device driver. The bound device must be
+ * a physical device so far. Subdevice will be supported later
+ * (with additional PASID information). An user-assigned cookie
+ * is also recorded to mark the device in the /dev/iommu uAPI.
+ */
+struct iommufd_device {
+	unsigned int id;
+	struct iommufd_ctx *ictx;
+	struct device *dev; /* always be the physical device */
+	u64 dev_cookie;
 };
 
 static int iommufd_fops_open(struct inode *inode, struct file *filep)
@@ -32,15 +52,58 @@ static int iommufd_fops_open(struct inode *inode, struct file *filep)
 		return -ENOMEM;
 
 	refcount_set(&ictx->refs, 1);
+	mutex_init(&ictx->lock);
+	xa_init_flags(&ictx->device_xa, XA_FLAGS_ALLOC);
 	filep->private_data = ictx;
 
 	return ret;
 }
 
+static void iommufd_ctx_get(struct iommufd_ctx *ictx)
+{
+	refcount_inc(&ictx->refs);
+}
+
+static const struct file_operations iommufd_fops;
+
+/**
+ * iommufd_ctx_fdget - Acquires a reference to the internal iommufd context.
+ * @fd: [in] iommufd file descriptor.
+ *
+ * Returns a pointer to the iommufd context, otherwise NULL;
+ *
+ */
+static struct iommufd_ctx *iommufd_ctx_fdget(int fd)
+{
+	struct fd f = fdget(fd);
+	struct file *file = f.file;
+	struct iommufd_ctx *ictx;
+
+	if (!file)
+		return NULL;
+
+	if (file->f_op != &iommufd_fops)
+		return NULL;
+
+	ictx = file->private_data;
+	if (ictx)
+		iommufd_ctx_get(ictx);
+	fdput(f);
+	return ictx;
+}
+
+/**
+ * iommufd_ctx_put - Releases a reference to the internal iommufd context.
+ * @ictx: [in] Pointer to iommufd context.
+ *
+ */
 static void iommufd_ctx_put(struct iommufd_ctx *ictx)
 {
-	if (refcount_dec_and_test(&ictx->refs))
-		kfree(ictx);
+	if (!refcount_dec_and_test(&ictx->refs))
+		return;
+
+	WARN_ON(!xa_empty(&ictx->device_xa));
+	kfree(ictx);
 }
 
 static int iommufd_fops_release(struct inode *inode, struct file *filep)
@@ -86,6 +149,99 @@ static struct miscdevice iommu_misc_dev = {
 	.mode = 0666,
 };
 
+/**
+ * iommufd_bind_device - Bind a physical device marked by a device
+ *			 cookie to an iommu fd.
+ * @fd:		[in] iommufd file descriptor.
+ * @dev:	[in] Pointer to a physical device struct.
+ * @dev_cookie:	[in] A cookie to mark the device in /dev/iommu uAPI.
+ *
+ * A successful bind establishes a security context for the device
+ * and returns struct iommufd_device pointer. Otherwise returns
+ * error pointer.
+ *
+ */
+struct iommufd_device *iommufd_bind_device(int fd, struct device *dev,
+					   u64 dev_cookie)
+{
+	struct iommufd_ctx *ictx;
+	struct iommufd_device *idev;
+	unsigned long index;
+	unsigned int id;
+	int ret;
+
+	ictx = iommufd_ctx_fdget(fd);
+	if (!ictx)
+		return ERR_PTR(-EINVAL);
+
+	mutex_lock(&ictx->lock);
+
+	/* check duplicate registration */
+	xa_for_each(&ictx->device_xa, index, idev) {
+		if (idev->dev == dev || idev->dev_cookie == dev_cookie) {
+			idev = ERR_PTR(-EBUSY);
+			goto out_unlock;
+		}
+	}
+
+	idev = kzalloc(sizeof(*idev), GFP_KERNEL);
+	if (!idev) {
+		ret = -ENOMEM;
+		goto out_unlock;
+	}
+
+	/* Establish the security context */
+	ret = iommu_device_init_user_dma(dev, (unsigned long)ictx);
+	if (ret)
+		goto out_free;
+
+	ret = xa_alloc(&ictx->device_xa, &id, idev,
+		       XA_LIMIT(IOMMUFD_DEVID_MIN, IOMMUFD_DEVID_MAX),
+		       GFP_KERNEL);
+	if (ret) {
+		idev = ERR_PTR(ret);
+		goto out_user_dma;
+	}
+
+	idev->ictx = ictx;
+	idev->dev = dev;
+	idev->dev_cookie = dev_cookie;
+	idev->id = id;
+	mutex_unlock(&ictx->lock);
+
+	return idev;
+out_user_dma:
+	iommu_device_exit_user_dma(idev->dev);
+out_free:
+	kfree(idev);
+out_unlock:
+	mutex_unlock(&ictx->lock);
+	iommufd_ctx_put(ictx);
+
+	return ERR_PTR(ret);
+}
+EXPORT_SYMBOL_GPL(iommufd_bind_device);
+
+/**
+ * iommufd_unbind_device - Unbind a physical device from iommufd
+ *
+ * @idev: [in] Pointer to the internal iommufd_device struct.
+ *
+ */
+void iommufd_unbind_device(struct iommufd_device *idev)
+{
+	struct iommufd_ctx *ictx = idev->ictx;
+
+	mutex_lock(&ictx->lock);
+	xa_erase(&ictx->device_xa, idev->id);
+	mutex_unlock(&ictx->lock);
+	/* Exit the security context */
+	iommu_device_exit_user_dma(idev->dev);
+	kfree(idev);
+	iommufd_ctx_put(ictx);
+}
+EXPORT_SYMBOL_GPL(iommufd_unbind_device);
+
 static int __init iommufd_init(void)
 {
 	int ret;
diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h
new file mode 100644
index 000000000000..1603a13937e9
--- /dev/null
+++ b/include/linux/iommufd.h
@@ -0,0 +1,38 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * IOMMUFD API definition
+ *
+ * Copyright (C) 2021 Intel Corporation
+ *
+ * Author: Liu Yi L <yi.l.liu@intel.com>
+ */
+#ifndef __LINUX_IOMMUFD_H
+#define __LINUX_IOMMUFD_H
+
+#include <linux/types.h>
+#include <linux/errno.h>
+#include <linux/err.h>
+#include <linux/device.h>
+
+#define IOMMUFD_DEVID_MAX	((unsigned int)(0x7FFFFFFF))
+#define IOMMUFD_DEVID_MIN	0
+
+struct iommufd_device;
+
+#if IS_ENABLED(CONFIG_IOMMUFD)
+struct iommufd_device *
+iommufd_bind_device(int fd, struct device *dev, u64 dev_cookie);
+void iommufd_unbind_device(struct iommufd_device *idev);
+
+#else /* !CONFIG_IOMMUFD */
+static inline struct iommufd_device *
+iommufd_bind_device(int fd, struct device *dev, u64 dev_cookie)
+{
+	return ERR_PTR(-ENODEV);
+}
+
+static inline void iommufd_unbind_device(struct iommufd_device *idev)
+{
+}
+#endif /* CONFIG_IOMMUFD */
+#endif /* __LINUX_IOMMUFD_H */
-- 
2.25.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* [RFC 08/20] vfio/pci: Add VFIO_DEVICE_BIND_IOMMUFD
  2021-09-19  6:38 ` Liu Yi L
@ 2021-09-19  6:38   ` Liu Yi L
  -1 siblings, 0 replies; 532+ messages in thread
From: Liu Yi L @ 2021-09-19  6:38 UTC (permalink / raw)
  To: alex.williamson, jgg, hch, jasowang, joro
  Cc: kvm, kwankhede, jean-philippe, dave.jiang, ashok.raj, corbet,
	kevin.tian, parav, lkml, david, robin.murphy, jun.j.tian,
	linux-kernel, lushenming, iommu, pbonzini, dwmw2

This patch adds VFIO_DEVICE_BIND_IOMMUFD for userspace to bind the vfio
device to an iommufd. No VFIO_DEVICE_UNBIND_IOMMUFD interface is provided
because it's implicitly done when the device fd is closed.

In concept a vfio device can be bound to multiple iommufds, each hosting
a subset of I/O address spaces attached by this device. However as a
starting point (matching current vfio), only one I/O address space is
supported per vfio device. It implies one device can only be attached
to one iommufd at this point.

Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 drivers/vfio/pci/Kconfig            |  1 +
 drivers/vfio/pci/vfio_pci.c         | 72 ++++++++++++++++++++++++++++-
 drivers/vfio/pci/vfio_pci_private.h |  8 ++++
 include/uapi/linux/vfio.h           | 30 ++++++++++++
 4 files changed, 110 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
index 5e2e1b9a9fd3..3abfb098b4dc 100644
--- a/drivers/vfio/pci/Kconfig
+++ b/drivers/vfio/pci/Kconfig
@@ -5,6 +5,7 @@ config VFIO_PCI
 	depends on MMU
 	select VFIO_VIRQFD
 	select IRQ_BYPASS_MANAGER
+	select IOMMUFD
 	help
 	  Support for the PCI VFIO bus driver.  This is required to make
 	  use of PCI drivers using the VFIO framework.
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 145addde983b..20006bb66430 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -552,6 +552,16 @@ static void vfio_pci_release(struct vfio_device *core_vdev)
 			vdev->req_trigger = NULL;
 		}
 		mutex_unlock(&vdev->igate);
+
+		mutex_lock(&vdev->videv_lock);
+		if (vdev->videv) {
+			struct vfio_iommufd_device *videv = vdev->videv;
+
+			vdev->videv = NULL;
+			iommufd_unbind_device(videv->idev);
+			kfree(videv);
+		}
+		mutex_unlock(&vdev->videv_lock);
 	}
 
 	mutex_unlock(&vdev->reflck->lock);
@@ -780,7 +790,66 @@ static long vfio_pci_ioctl(struct vfio_device *core_vdev,
 		container_of(core_vdev, struct vfio_pci_device, vdev);
 	unsigned long minsz;
 
-	if (cmd == VFIO_DEVICE_GET_INFO) {
+	if (cmd == VFIO_DEVICE_BIND_IOMMUFD) {
+		struct vfio_device_iommu_bind_data bind_data;
+		unsigned long minsz;
+		struct iommufd_device *idev;
+		struct vfio_iommufd_device *videv;
+
+		/*
+		 * Reject the request if the device is already opened and
+		 * attached to a container.
+		 */
+		if (vfio_device_in_container(core_vdev))
+			return -ENOTTY;
+
+		minsz = offsetofend(struct vfio_device_iommu_bind_data, dev_cookie);
+
+		if (copy_from_user(&bind_data, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (bind_data.argsz < minsz ||
+		    bind_data.flags || bind_data.iommu_fd < 0)
+			return -EINVAL;
+
+		mutex_lock(&vdev->videv_lock);
+		/*
+		 * Allow only one iommufd per device until multiple
+		 * address spaces (e.g. vSVA) support is introduced
+		 * in the future.
+		 */
+		if (vdev->videv) {
+			mutex_unlock(&vdev->videv_lock);
+			return -EBUSY;
+		}
+
+		idev = iommufd_bind_device(bind_data.iommu_fd,
+					   &vdev->pdev->dev,
+					   bind_data.dev_cookie);
+		if (IS_ERR(idev)) {
+			mutex_unlock(&vdev->videv_lock);
+			return PTR_ERR(idev);
+		}
+
+		videv = kzalloc(sizeof(*videv), GFP_KERNEL);
+		if (!videv) {
+			iommufd_unbind_device(idev);
+			mutex_unlock(&vdev->videv_lock);
+			return -ENOMEM;
+		}
+		videv->idev = idev;
+		videv->iommu_fd = bind_data.iommu_fd;
+		/*
+		 * A security context has been established. Unblock
+		 * user access.
+		 */
+		if (atomic_read(&vdev->block_access))
+			atomic_set(&vdev->block_access, 0);
+		vdev->videv = videv;
+		mutex_unlock(&vdev->videv_lock);
+
+		return 0;
+	} else if (cmd == VFIO_DEVICE_GET_INFO) {
 		struct vfio_device_info info;
 		struct vfio_info_cap caps = { .buf = NULL, .size = 0 };
 		unsigned long capsz;
@@ -2031,6 +2100,7 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	mutex_init(&vdev->vma_lock);
 	INIT_LIST_HEAD(&vdev->vma_list);
 	init_rwsem(&vdev->memory_lock);
+	mutex_init(&vdev->videv_lock);
 
 	ret = vfio_pci_reflck_attach(vdev);
 	if (ret)
diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
index f12012e30b53..bd784accac35 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -14,6 +14,7 @@
 #include <linux/types.h>
 #include <linux/uuid.h>
 #include <linux/notifier.h>
+#include <linux/iommufd.h>
 
 #ifndef VFIO_PCI_PRIVATE_H
 #define VFIO_PCI_PRIVATE_H
@@ -99,6 +100,11 @@ struct vfio_pci_mmap_vma {
 	struct list_head	vma_next;
 };
 
+struct vfio_iommufd_device {
+	struct iommufd_device *idev;
+	int iommu_fd;
+};
+
 struct vfio_pci_device {
 	struct vfio_device	vdev;
 	struct pci_dev		*pdev;
@@ -144,6 +150,8 @@ struct vfio_pci_device {
 	struct list_head	vma_list;
 	struct rw_semaphore	memory_lock;
 	atomic_t		block_access;
+	struct mutex		videv_lock;
+	struct vfio_iommufd_device *videv;
 };
 
 #define is_intx(vdev) (vdev->irq_type == VFIO_PCI_INTX_IRQ_INDEX)
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index ef33ea002b0b..c902abd60339 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -190,6 +190,36 @@ struct vfio_group_status {
 
 /* --------------- IOCTLs for DEVICE file descriptors --------------- */
 
+/*
+ * VFIO_DEVICE_BIND_IOMMUFD - _IOR(VFIO_TYPE, VFIO_BASE + 19,
+ *				struct vfio_device_iommu_bind_data)
+ *
+ * Bind a vfio_device to the specified iommufd
+ *
+ * The user should provide a device cookie when calling this ioctl. The
+ * cookie is later used in iommufd for capability query, iotlb invalidation
+ * and I/O fault handling.
+ *
+ * User is not allowed to access the device before the binding operation
+ * is completed.
+ *
+ * Unbind is automatically conducted when device fd is closed.
+ *
+ * Input parameters:
+ *	- iommu_fd;
+ *	- dev_cookie;
+ *
+ * Return: 0 on success, -errno on failure.
+ */
+struct vfio_device_iommu_bind_data {
+	__u32	argsz;
+	__u32	flags;
+	__s32	iommu_fd;
+	__u64	dev_cookie;
+};
+
+#define VFIO_DEVICE_BIND_IOMMUFD	_IO(VFIO_TYPE, VFIO_BASE + 19)
+
 /**
  * VFIO_DEVICE_GET_INFO - _IOR(VFIO_TYPE, VFIO_BASE + 7,
  *						struct vfio_device_info)
-- 
2.25.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* [RFC 08/20] vfio/pci: Add VFIO_DEVICE_BIND_IOMMUFD
@ 2021-09-19  6:38   ` Liu Yi L
  0 siblings, 0 replies; 532+ messages in thread
From: Liu Yi L @ 2021-09-19  6:38 UTC (permalink / raw)
  To: alex.williamson, jgg, hch, jasowang, joro
  Cc: jean-philippe, kevin.tian, parav, lkml, pbonzini, lushenming,
	eric.auger, corbet, ashok.raj, yi.l.liu, yi.l.liu, jun.j.tian,
	hao.wu, dave.jiang, jacob.jun.pan, kwankhede, robin.murphy, kvm,
	iommu, dwmw2, linux-kernel, baolu.lu, david, nicolinc

This patch adds VFIO_DEVICE_BIND_IOMMUFD for userspace to bind the vfio
device to an iommufd. No VFIO_DEVICE_UNBIND_IOMMUFD interface is provided
because it's implicitly done when the device fd is closed.

In concept a vfio device can be bound to multiple iommufds, each hosting
a subset of I/O address spaces attached by this device. However as a
starting point (matching current vfio), only one I/O address space is
supported per vfio device. It implies one device can only be attached
to one iommufd at this point.

Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 drivers/vfio/pci/Kconfig            |  1 +
 drivers/vfio/pci/vfio_pci.c         | 72 ++++++++++++++++++++++++++++-
 drivers/vfio/pci/vfio_pci_private.h |  8 ++++
 include/uapi/linux/vfio.h           | 30 ++++++++++++
 4 files changed, 110 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
index 5e2e1b9a9fd3..3abfb098b4dc 100644
--- a/drivers/vfio/pci/Kconfig
+++ b/drivers/vfio/pci/Kconfig
@@ -5,6 +5,7 @@ config VFIO_PCI
 	depends on MMU
 	select VFIO_VIRQFD
 	select IRQ_BYPASS_MANAGER
+	select IOMMUFD
 	help
 	  Support for the PCI VFIO bus driver.  This is required to make
 	  use of PCI drivers using the VFIO framework.
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 145addde983b..20006bb66430 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -552,6 +552,16 @@ static void vfio_pci_release(struct vfio_device *core_vdev)
 			vdev->req_trigger = NULL;
 		}
 		mutex_unlock(&vdev->igate);
+
+		mutex_lock(&vdev->videv_lock);
+		if (vdev->videv) {
+			struct vfio_iommufd_device *videv = vdev->videv;
+
+			vdev->videv = NULL;
+			iommufd_unbind_device(videv->idev);
+			kfree(videv);
+		}
+		mutex_unlock(&vdev->videv_lock);
 	}
 
 	mutex_unlock(&vdev->reflck->lock);
@@ -780,7 +790,66 @@ static long vfio_pci_ioctl(struct vfio_device *core_vdev,
 		container_of(core_vdev, struct vfio_pci_device, vdev);
 	unsigned long minsz;
 
-	if (cmd == VFIO_DEVICE_GET_INFO) {
+	if (cmd == VFIO_DEVICE_BIND_IOMMUFD) {
+		struct vfio_device_iommu_bind_data bind_data;
+		unsigned long minsz;
+		struct iommufd_device *idev;
+		struct vfio_iommufd_device *videv;
+
+		/*
+		 * Reject the request if the device is already opened and
+		 * attached to a container.
+		 */
+		if (vfio_device_in_container(core_vdev))
+			return -ENOTTY;
+
+		minsz = offsetofend(struct vfio_device_iommu_bind_data, dev_cookie);
+
+		if (copy_from_user(&bind_data, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (bind_data.argsz < minsz ||
+		    bind_data.flags || bind_data.iommu_fd < 0)
+			return -EINVAL;
+
+		mutex_lock(&vdev->videv_lock);
+		/*
+		 * Allow only one iommufd per device until multiple
+		 * address spaces (e.g. vSVA) support is introduced
+		 * in the future.
+		 */
+		if (vdev->videv) {
+			mutex_unlock(&vdev->videv_lock);
+			return -EBUSY;
+		}
+
+		idev = iommufd_bind_device(bind_data.iommu_fd,
+					   &vdev->pdev->dev,
+					   bind_data.dev_cookie);
+		if (IS_ERR(idev)) {
+			mutex_unlock(&vdev->videv_lock);
+			return PTR_ERR(idev);
+		}
+
+		videv = kzalloc(sizeof(*videv), GFP_KERNEL);
+		if (!videv) {
+			iommufd_unbind_device(idev);
+			mutex_unlock(&vdev->videv_lock);
+			return -ENOMEM;
+		}
+		videv->idev = idev;
+		videv->iommu_fd = bind_data.iommu_fd;
+		/*
+		 * A security context has been established. Unblock
+		 * user access.
+		 */
+		if (atomic_read(&vdev->block_access))
+			atomic_set(&vdev->block_access, 0);
+		vdev->videv = videv;
+		mutex_unlock(&vdev->videv_lock);
+
+		return 0;
+	} else if (cmd == VFIO_DEVICE_GET_INFO) {
 		struct vfio_device_info info;
 		struct vfio_info_cap caps = { .buf = NULL, .size = 0 };
 		unsigned long capsz;
@@ -2031,6 +2100,7 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	mutex_init(&vdev->vma_lock);
 	INIT_LIST_HEAD(&vdev->vma_list);
 	init_rwsem(&vdev->memory_lock);
+	mutex_init(&vdev->videv_lock);
 
 	ret = vfio_pci_reflck_attach(vdev);
 	if (ret)
diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
index f12012e30b53..bd784accac35 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -14,6 +14,7 @@
 #include <linux/types.h>
 #include <linux/uuid.h>
 #include <linux/notifier.h>
+#include <linux/iommufd.h>
 
 #ifndef VFIO_PCI_PRIVATE_H
 #define VFIO_PCI_PRIVATE_H
@@ -99,6 +100,11 @@ struct vfio_pci_mmap_vma {
 	struct list_head	vma_next;
 };
 
+struct vfio_iommufd_device {
+	struct iommufd_device *idev;
+	int iommu_fd;
+};
+
 struct vfio_pci_device {
 	struct vfio_device	vdev;
 	struct pci_dev		*pdev;
@@ -144,6 +150,8 @@ struct vfio_pci_device {
 	struct list_head	vma_list;
 	struct rw_semaphore	memory_lock;
 	atomic_t		block_access;
+	struct mutex		videv_lock;
+	struct vfio_iommufd_device *videv;
 };
 
 #define is_intx(vdev) (vdev->irq_type == VFIO_PCI_INTX_IRQ_INDEX)
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index ef33ea002b0b..c902abd60339 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -190,6 +190,36 @@ struct vfio_group_status {
 
 /* --------------- IOCTLs for DEVICE file descriptors --------------- */
 
+/*
+ * VFIO_DEVICE_BIND_IOMMUFD - _IOR(VFIO_TYPE, VFIO_BASE + 19,
+ *				struct vfio_device_iommu_bind_data)
+ *
+ * Bind a vfio_device to the specified iommufd
+ *
+ * The user should provide a device cookie when calling this ioctl. The
+ * cookie is later used in iommufd for capability query, iotlb invalidation
+ * and I/O fault handling.
+ *
+ * User is not allowed to access the device before the binding operation
+ * is completed.
+ *
+ * Unbind is automatically conducted when device fd is closed.
+ *
+ * Input parameters:
+ *	- iommu_fd;
+ *	- dev_cookie;
+ *
+ * Return: 0 on success, -errno on failure.
+ */
+struct vfio_device_iommu_bind_data {
+	__u32	argsz;
+	__u32	flags;
+	__s32	iommu_fd;
+	__u64	dev_cookie;
+};
+
+#define VFIO_DEVICE_BIND_IOMMUFD	_IO(VFIO_TYPE, VFIO_BASE + 19)
+
 /**
  * VFIO_DEVICE_GET_INFO - _IOR(VFIO_TYPE, VFIO_BASE + 7,
  *						struct vfio_device_info)
-- 
2.25.1


^ permalink raw reply	[flat|nested] 532+ messages in thread

* [RFC 09/20] iommu: Add page size and address width attributes
  2021-09-19  6:38 ` Liu Yi L
@ 2021-09-19  6:38   ` Liu Yi L
  -1 siblings, 0 replies; 532+ messages in thread
From: Liu Yi L @ 2021-09-19  6:38 UTC (permalink / raw)
  To: alex.williamson, jgg, hch, jasowang, joro
  Cc: kvm, kwankhede, jean-philippe, dave.jiang, ashok.raj, corbet,
	kevin.tian, parav, lkml, david, robin.murphy, jun.j.tian,
	linux-kernel, lushenming, iommu, pbonzini, dwmw2

From: Lu Baolu <baolu.lu@linux.intel.com>

This exposes PAGE_SIZE and ADDR_WIDTH attributes. The iommufd could use
them to define the IOAS.

Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
 include/linux/iommu.h | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 943de6897f56..86d34e4ce05e 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -153,9 +153,13 @@ enum iommu_dev_features {
 /**
  * enum iommu_devattr - Per device IOMMU attributes
  * @IOMMU_DEV_INFO_FORCE_SNOOP [bool]: IOMMU can force DMA to be snooped.
+ * @IOMMU_DEV_INFO_PAGE_SIZE [u64]: Page sizes that iommu supports.
+ * @IOMMU_DEV_INFO_ADDR_WIDTH [u32]: Address width supported.
  */
 enum iommu_devattr {
 	IOMMU_DEV_INFO_FORCE_SNOOP,
+	IOMMU_DEV_INFO_PAGE_SIZE,
+	IOMMU_DEV_INFO_ADDR_WIDTH,
 };
 
 #define IOMMU_PASID_INVALID	(-1U)
-- 
2.25.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* [RFC 09/20] iommu: Add page size and address width attributes
@ 2021-09-19  6:38   ` Liu Yi L
  0 siblings, 0 replies; 532+ messages in thread
From: Liu Yi L @ 2021-09-19  6:38 UTC (permalink / raw)
  To: alex.williamson, jgg, hch, jasowang, joro
  Cc: jean-philippe, kevin.tian, parav, lkml, pbonzini, lushenming,
	eric.auger, corbet, ashok.raj, yi.l.liu, yi.l.liu, jun.j.tian,
	hao.wu, dave.jiang, jacob.jun.pan, kwankhede, robin.murphy, kvm,
	iommu, dwmw2, linux-kernel, baolu.lu, david, nicolinc

From: Lu Baolu <baolu.lu@linux.intel.com>

This exposes PAGE_SIZE and ADDR_WIDTH attributes. The iommufd could use
them to define the IOAS.

Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
 include/linux/iommu.h | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 943de6897f56..86d34e4ce05e 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -153,9 +153,13 @@ enum iommu_dev_features {
 /**
  * enum iommu_devattr - Per device IOMMU attributes
  * @IOMMU_DEV_INFO_FORCE_SNOOP [bool]: IOMMU can force DMA to be snooped.
+ * @IOMMU_DEV_INFO_PAGE_SIZE [u64]: Page sizes that iommu supports.
+ * @IOMMU_DEV_INFO_ADDR_WIDTH [u32]: Address width supported.
  */
 enum iommu_devattr {
 	IOMMU_DEV_INFO_FORCE_SNOOP,
+	IOMMU_DEV_INFO_PAGE_SIZE,
+	IOMMU_DEV_INFO_ADDR_WIDTH,
 };
 
 #define IOMMU_PASID_INVALID	(-1U)
-- 
2.25.1


^ permalink raw reply	[flat|nested] 532+ messages in thread

* [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO
  2021-09-19  6:38 ` Liu Yi L
@ 2021-09-19  6:38   ` Liu Yi L
  -1 siblings, 0 replies; 532+ messages in thread
From: Liu Yi L @ 2021-09-19  6:38 UTC (permalink / raw)
  To: alex.williamson, jgg, hch, jasowang, joro
  Cc: kvm, kwankhede, jean-philippe, dave.jiang, ashok.raj, corbet,
	kevin.tian, parav, lkml, david, robin.murphy, jun.j.tian,
	linux-kernel, lushenming, iommu, pbonzini, dwmw2

After a device is bound to the iommufd, userspace can use this interface
to query the underlying iommu capability and format info for this device.
Based on this information the user then creates I/O address space in a
compatible format with the to-be-attached devices.

Device cookie which is registered at binding time is used to mark the
device which is being queried here.

Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 drivers/iommu/iommufd/iommufd.c | 68 +++++++++++++++++++++++++++++++++
 include/uapi/linux/iommu.h      | 49 ++++++++++++++++++++++++
 2 files changed, 117 insertions(+)

diff --git a/drivers/iommu/iommufd/iommufd.c b/drivers/iommu/iommufd/iommufd.c
index e16ca21e4534..641f199f2d41 100644
--- a/drivers/iommu/iommufd/iommufd.c
+++ b/drivers/iommu/iommufd/iommufd.c
@@ -117,6 +117,71 @@ static int iommufd_fops_release(struct inode *inode, struct file *filep)
 	return 0;
 }
 
+static struct device *
+iommu_find_device_from_cookie(struct iommufd_ctx *ictx, u64 dev_cookie)
+{
+	struct iommufd_device *idev;
+	struct device *dev = NULL;
+	unsigned long index;
+
+	mutex_lock(&ictx->lock);
+	xa_for_each(&ictx->device_xa, index, idev) {
+		if (idev->dev_cookie == dev_cookie) {
+			dev = idev->dev;
+			break;
+		}
+	}
+	mutex_unlock(&ictx->lock);
+
+	return dev;
+}
+
+static void iommu_device_build_info(struct device *dev,
+				    struct iommu_device_info *info)
+{
+	bool snoop;
+	u64 awidth, pgsizes;
+
+	if (!iommu_device_get_info(dev, IOMMU_DEV_INFO_FORCE_SNOOP, &snoop))
+		info->flags |= snoop ? IOMMU_DEVICE_INFO_ENFORCE_SNOOP : 0;
+
+	if (!iommu_device_get_info(dev, IOMMU_DEV_INFO_PAGE_SIZE, &pgsizes)) {
+		info->pgsize_bitmap = pgsizes;
+		info->flags |= IOMMU_DEVICE_INFO_PGSIZES;
+	}
+
+	if (!iommu_device_get_info(dev, IOMMU_DEV_INFO_ADDR_WIDTH, &awidth)) {
+		info->addr_width = awidth;
+		info->flags |= IOMMU_DEVICE_INFO_ADDR_WIDTH;
+	}
+}
+
+static int iommufd_get_device_info(struct iommufd_ctx *ictx,
+				   unsigned long arg)
+{
+	struct iommu_device_info info;
+	unsigned long minsz;
+	struct device *dev;
+
+	minsz = offsetofend(struct iommu_device_info, addr_width);
+
+	if (copy_from_user(&info, (void __user *)arg, minsz))
+		return -EFAULT;
+
+	if (info.argsz < minsz)
+		return -EINVAL;
+
+	info.flags = 0;
+
+	dev = iommu_find_device_from_cookie(ictx, info.dev_cookie);
+	if (!dev)
+		return -EINVAL;
+
+	iommu_device_build_info(dev, &info);
+
+	return copy_to_user((void __user *)arg, &info, minsz) ? -EFAULT : 0;
+}
+
 static long iommufd_fops_unl_ioctl(struct file *filep,
 				   unsigned int cmd, unsigned long arg)
 {
@@ -127,6 +192,9 @@ static long iommufd_fops_unl_ioctl(struct file *filep,
 		return ret;
 
 	switch (cmd) {
+	case IOMMU_DEVICE_GET_INFO:
+		ret = iommufd_get_device_info(ictx, arg);
+		break;
 	default:
 		pr_err_ratelimited("unsupported cmd %u\n", cmd);
 		break;
diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
index 59178fc229ca..76b71f9d6b34 100644
--- a/include/uapi/linux/iommu.h
+++ b/include/uapi/linux/iommu.h
@@ -7,6 +7,55 @@
 #define _UAPI_IOMMU_H
 
 #include <linux/types.h>
+#include <linux/ioctl.h>
+
+/* -------- IOCTLs for IOMMU file descriptor (/dev/iommu) -------- */
+
+#define IOMMU_TYPE	(';')
+#define IOMMU_BASE	100
+
+/*
+ * IOMMU_DEVICE_GET_INFO - _IOR(IOMMU_TYPE, IOMMU_BASE + 1,
+ *				struct iommu_device_info)
+ *
+ * Check IOMMU capabilities and format information on a bound device.
+ *
+ * The device is identified by device cookie (registered when binding
+ * this device).
+ *
+ * @argsz:	   user filled size of this data.
+ * @flags:	   tells userspace which capability info is available
+ * @dev_cookie:	   user assinged cookie.
+ * @pgsize_bitmap: Bitmap of supported page sizes. 1-setting of the
+ *		   bit in pgsize_bitmap[63:12] indicates a supported
+ *		   page size. Details as below table:
+ *
+ *		   +===============+============+
+ *		   |  Bit[index]   |  Page Size |
+ *		   +---------------+------------+
+ *		   |  12           |  4 KB      |
+ *		   +---------------+------------+
+ *		   |  13           |  8 KB      |
+ *		   +---------------+------------+
+ *		   |  14           |  16 KB     |
+ *		   +---------------+------------+
+ *		   ...
+ * @addr_width:    the address width of supported I/O address spaces.
+ *
+ * Availability: after device is bound to iommufd
+ */
+struct iommu_device_info {
+	__u32	argsz;
+	__u32	flags;
+#define IOMMU_DEVICE_INFO_ENFORCE_SNOOP	(1 << 0) /* IOMMU enforced snoop */
+#define IOMMU_DEVICE_INFO_PGSIZES	(1 << 1) /* supported page sizes */
+#define IOMMU_DEVICE_INFO_ADDR_WIDTH	(1 << 2) /* addr_wdith field valid */
+	__u64	dev_cookie;
+	__u64   pgsize_bitmap;
+	__u32	addr_width;
+};
+
+#define IOMMU_DEVICE_GET_INFO	_IO(IOMMU_TYPE, IOMMU_BASE + 1)
 
 #define IOMMU_FAULT_PERM_READ	(1 << 0) /* read */
 #define IOMMU_FAULT_PERM_WRITE	(1 << 1) /* write */
-- 
2.25.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO
@ 2021-09-19  6:38   ` Liu Yi L
  0 siblings, 0 replies; 532+ messages in thread
From: Liu Yi L @ 2021-09-19  6:38 UTC (permalink / raw)
  To: alex.williamson, jgg, hch, jasowang, joro
  Cc: jean-philippe, kevin.tian, parav, lkml, pbonzini, lushenming,
	eric.auger, corbet, ashok.raj, yi.l.liu, yi.l.liu, jun.j.tian,
	hao.wu, dave.jiang, jacob.jun.pan, kwankhede, robin.murphy, kvm,
	iommu, dwmw2, linux-kernel, baolu.lu, david, nicolinc

After a device is bound to the iommufd, userspace can use this interface
to query the underlying iommu capability and format info for this device.
Based on this information the user then creates I/O address space in a
compatible format with the to-be-attached devices.

Device cookie which is registered at binding time is used to mark the
device which is being queried here.

Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 drivers/iommu/iommufd/iommufd.c | 68 +++++++++++++++++++++++++++++++++
 include/uapi/linux/iommu.h      | 49 ++++++++++++++++++++++++
 2 files changed, 117 insertions(+)

diff --git a/drivers/iommu/iommufd/iommufd.c b/drivers/iommu/iommufd/iommufd.c
index e16ca21e4534..641f199f2d41 100644
--- a/drivers/iommu/iommufd/iommufd.c
+++ b/drivers/iommu/iommufd/iommufd.c
@@ -117,6 +117,71 @@ static int iommufd_fops_release(struct inode *inode, struct file *filep)
 	return 0;
 }
 
+static struct device *
+iommu_find_device_from_cookie(struct iommufd_ctx *ictx, u64 dev_cookie)
+{
+	struct iommufd_device *idev;
+	struct device *dev = NULL;
+	unsigned long index;
+
+	mutex_lock(&ictx->lock);
+	xa_for_each(&ictx->device_xa, index, idev) {
+		if (idev->dev_cookie == dev_cookie) {
+			dev = idev->dev;
+			break;
+		}
+	}
+	mutex_unlock(&ictx->lock);
+
+	return dev;
+}
+
+static void iommu_device_build_info(struct device *dev,
+				    struct iommu_device_info *info)
+{
+	bool snoop;
+	u64 awidth, pgsizes;
+
+	if (!iommu_device_get_info(dev, IOMMU_DEV_INFO_FORCE_SNOOP, &snoop))
+		info->flags |= snoop ? IOMMU_DEVICE_INFO_ENFORCE_SNOOP : 0;
+
+	if (!iommu_device_get_info(dev, IOMMU_DEV_INFO_PAGE_SIZE, &pgsizes)) {
+		info->pgsize_bitmap = pgsizes;
+		info->flags |= IOMMU_DEVICE_INFO_PGSIZES;
+	}
+
+	if (!iommu_device_get_info(dev, IOMMU_DEV_INFO_ADDR_WIDTH, &awidth)) {
+		info->addr_width = awidth;
+		info->flags |= IOMMU_DEVICE_INFO_ADDR_WIDTH;
+	}
+}
+
+static int iommufd_get_device_info(struct iommufd_ctx *ictx,
+				   unsigned long arg)
+{
+	struct iommu_device_info info;
+	unsigned long minsz;
+	struct device *dev;
+
+	minsz = offsetofend(struct iommu_device_info, addr_width);
+
+	if (copy_from_user(&info, (void __user *)arg, minsz))
+		return -EFAULT;
+
+	if (info.argsz < minsz)
+		return -EINVAL;
+
+	info.flags = 0;
+
+	dev = iommu_find_device_from_cookie(ictx, info.dev_cookie);
+	if (!dev)
+		return -EINVAL;
+
+	iommu_device_build_info(dev, &info);
+
+	return copy_to_user((void __user *)arg, &info, minsz) ? -EFAULT : 0;
+}
+
 static long iommufd_fops_unl_ioctl(struct file *filep,
 				   unsigned int cmd, unsigned long arg)
 {
@@ -127,6 +192,9 @@ static long iommufd_fops_unl_ioctl(struct file *filep,
 		return ret;
 
 	switch (cmd) {
+	case IOMMU_DEVICE_GET_INFO:
+		ret = iommufd_get_device_info(ictx, arg);
+		break;
 	default:
 		pr_err_ratelimited("unsupported cmd %u\n", cmd);
 		break;
diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
index 59178fc229ca..76b71f9d6b34 100644
--- a/include/uapi/linux/iommu.h
+++ b/include/uapi/linux/iommu.h
@@ -7,6 +7,55 @@
 #define _UAPI_IOMMU_H
 
 #include <linux/types.h>
+#include <linux/ioctl.h>
+
+/* -------- IOCTLs for IOMMU file descriptor (/dev/iommu) -------- */
+
+#define IOMMU_TYPE	(';')
+#define IOMMU_BASE	100
+
+/*
+ * IOMMU_DEVICE_GET_INFO - _IOR(IOMMU_TYPE, IOMMU_BASE + 1,
+ *				struct iommu_device_info)
+ *
+ * Check IOMMU capabilities and format information on a bound device.
+ *
+ * The device is identified by device cookie (registered when binding
+ * this device).
+ *
+ * @argsz:	   user filled size of this data.
+ * @flags:	   tells userspace which capability info is available
+ * @dev_cookie:	   user assinged cookie.
+ * @pgsize_bitmap: Bitmap of supported page sizes. 1-setting of the
+ *		   bit in pgsize_bitmap[63:12] indicates a supported
+ *		   page size. Details as below table:
+ *
+ *		   +===============+============+
+ *		   |  Bit[index]   |  Page Size |
+ *		   +---------------+------------+
+ *		   |  12           |  4 KB      |
+ *		   +---------------+------------+
+ *		   |  13           |  8 KB      |
+ *		   +---------------+------------+
+ *		   |  14           |  16 KB     |
+ *		   +---------------+------------+
+ *		   ...
+ * @addr_width:    the address width of supported I/O address spaces.
+ *
+ * Availability: after device is bound to iommufd
+ */
+struct iommu_device_info {
+	__u32	argsz;
+	__u32	flags;
+#define IOMMU_DEVICE_INFO_ENFORCE_SNOOP	(1 << 0) /* IOMMU enforced snoop */
+#define IOMMU_DEVICE_INFO_PGSIZES	(1 << 1) /* supported page sizes */
+#define IOMMU_DEVICE_INFO_ADDR_WIDTH	(1 << 2) /* addr_wdith field valid */
+	__u64	dev_cookie;
+	__u64   pgsize_bitmap;
+	__u32	addr_width;
+};
+
+#define IOMMU_DEVICE_GET_INFO	_IO(IOMMU_TYPE, IOMMU_BASE + 1)
 
 #define IOMMU_FAULT_PERM_READ	(1 << 0) /* read */
 #define IOMMU_FAULT_PERM_WRITE	(1 << 1) /* write */
-- 
2.25.1


^ permalink raw reply	[flat|nested] 532+ messages in thread

* [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
  2021-09-19  6:38 ` Liu Yi L
@ 2021-09-19  6:38   ` Liu Yi L
  -1 siblings, 0 replies; 532+ messages in thread
From: Liu Yi L @ 2021-09-19  6:38 UTC (permalink / raw)
  To: alex.williamson, jgg, hch, jasowang, joro
  Cc: jean-philippe, kevin.tian, parav, lkml, pbonzini, lushenming,
	eric.auger, corbet, ashok.raj, yi.l.liu, yi.l.liu, jun.j.tian,
	hao.wu, dave.jiang, jacob.jun.pan, kwankhede, robin.murphy, kvm,
	iommu, dwmw2, linux-kernel, baolu.lu, david, nicolinc

This patch adds IOASID allocation/free interface per iommufd. When
allocating an IOASID, userspace is expected to specify the type and
format information for the target I/O page table.

This RFC supports only one type (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2),
implying a kernel-managed I/O page table with vfio type1v2 mapping
semantics. For this type the user should specify the addr_width of
the I/O address space and whether the I/O page table is created in
an iommu enfore_snoop format. enforce_snoop must be true at this point,
as the false setting requires additional contract with KVM on handling
WBINVD emulation, which can be added later.

Userspace is expected to call IOMMU_CHECK_EXTENSION (see next patch)
for what formats can be specified when allocating an IOASID.

Open:
- Devices on PPC platform currently use a different iommu driver in vfio.
  Per previous discussion they can also use vfio type1v2 as long as there
  is a way to claim a specific iova range from a system-wide address space.
  This requirement doesn't sound PPC specific, as addr_width for pci devices
  can be also represented by a range [0, 2^addr_width-1]. This RFC hasn't
  adopted this design yet. We hope to have formal alignment in v1 discussion
  and then decide how to incorporate it in v2.

- Currently ioasid term has already been used in the kernel (drivers/iommu/
  ioasid.c) to represent the hardware I/O address space ID in the wire. It
  covers both PCI PASID (Process Address Space ID) and ARM SSID (Sub-Stream
  ID). We need find a way to resolve the naming conflict between the hardware
  ID and software handle. One option is to rename the existing ioasid to be
  pasid or ssid, given their full names still sound generic. Appreciate more
  thoughts on this open!

Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 drivers/iommu/iommufd/iommufd.c | 120 ++++++++++++++++++++++++++++++++
 include/linux/iommufd.h         |   3 +
 include/uapi/linux/iommu.h      |  54 ++++++++++++++
 3 files changed, 177 insertions(+)

diff --git a/drivers/iommu/iommufd/iommufd.c b/drivers/iommu/iommufd/iommufd.c
index 641f199f2d41..4839f128b24a 100644
--- a/drivers/iommu/iommufd/iommufd.c
+++ b/drivers/iommu/iommufd/iommufd.c
@@ -24,6 +24,7 @@
 struct iommufd_ctx {
 	refcount_t refs;
 	struct mutex lock;
+	struct xarray ioasid_xa; /* xarray of ioasids */
 	struct xarray device_xa; /* xarray of bound devices */
 };
 
@@ -42,6 +43,16 @@ struct iommufd_device {
 	u64 dev_cookie;
 };
 
+/* Represent an I/O address space */
+struct iommufd_ioas {
+	int ioasid;
+	u32 type;
+	u32 addr_width;
+	bool enforce_snoop;
+	struct iommufd_ctx *ictx;
+	refcount_t refs;
+};
+
 static int iommufd_fops_open(struct inode *inode, struct file *filep)
 {
 	struct iommufd_ctx *ictx;
@@ -53,6 +64,7 @@ static int iommufd_fops_open(struct inode *inode, struct file *filep)
 
 	refcount_set(&ictx->refs, 1);
 	mutex_init(&ictx->lock);
+	xa_init_flags(&ictx->ioasid_xa, XA_FLAGS_ALLOC);
 	xa_init_flags(&ictx->device_xa, XA_FLAGS_ALLOC);
 	filep->private_data = ictx;
 
@@ -102,16 +114,118 @@ static void iommufd_ctx_put(struct iommufd_ctx *ictx)
 	if (!refcount_dec_and_test(&ictx->refs))
 		return;
 
+	WARN_ON(!xa_empty(&ictx->ioasid_xa));
 	WARN_ON(!xa_empty(&ictx->device_xa));
 	kfree(ictx);
 }
 
+/* Caller should hold ictx->lock */
+static void ioas_put_locked(struct iommufd_ioas *ioas)
+{
+	struct iommufd_ctx *ictx = ioas->ictx;
+	int ioasid = ioas->ioasid;
+
+	if (!refcount_dec_and_test(&ioas->refs))
+		return;
+
+	xa_erase(&ictx->ioasid_xa, ioasid);
+	iommufd_ctx_put(ictx);
+	kfree(ioas);
+}
+
+static int iommufd_ioasid_alloc(struct iommufd_ctx *ictx, unsigned long arg)
+{
+	struct iommu_ioasid_alloc req;
+	struct iommufd_ioas *ioas;
+	unsigned long minsz;
+	int ioasid, ret;
+
+	minsz = offsetofend(struct iommu_ioasid_alloc, addr_width);
+
+	if (copy_from_user(&req, (void __user *)arg, minsz))
+		return -EFAULT;
+
+	if (req.argsz < minsz || !req.addr_width ||
+	    req.flags != IOMMU_IOASID_ENFORCE_SNOOP ||
+	    req.type != IOMMU_IOASID_TYPE_KERNEL_TYPE1V2)
+		return -EINVAL;
+
+	ioas = kzalloc(sizeof(*ioas), GFP_KERNEL);
+	if (!ioas)
+		return -ENOMEM;
+
+	mutex_lock(&ictx->lock);
+	ret = xa_alloc(&ictx->ioasid_xa, &ioasid, ioas,
+		       XA_LIMIT(IOMMUFD_IOASID_MIN, IOMMUFD_IOASID_MAX),
+		       GFP_KERNEL);
+	mutex_unlock(&ictx->lock);
+	if (ret) {
+		pr_err_ratelimited("Failed to alloc ioasid\n");
+		kfree(ioas);
+		return ret;
+	}
+
+	ioas->ioasid = ioasid;
+
+	/* only supports kernel managed I/O page table so far */
+	ioas->type = IOMMU_IOASID_TYPE_KERNEL_TYPE1V2;
+
+	ioas->addr_width = req.addr_width;
+
+	/* only supports enforce snoop today */
+	ioas->enforce_snoop = true;
+
+	iommufd_ctx_get(ictx);
+	ioas->ictx = ictx;
+
+	refcount_set(&ioas->refs, 1);
+
+	return ioasid;
+}
+
+static int iommufd_ioasid_free(struct iommufd_ctx *ictx, unsigned long arg)
+{
+	struct iommufd_ioas *ioas = NULL;
+	int ioasid, ret;
+
+	if (copy_from_user(&ioasid, (void __user *)arg, sizeof(ioasid)))
+		return -EFAULT;
+
+	if (ioasid < 0)
+		return -EINVAL;
+
+	mutex_lock(&ictx->lock);
+	ioas = xa_load(&ictx->ioasid_xa, ioasid);
+	if (IS_ERR(ioas)) {
+		ret = -EINVAL;
+		goto out_unlock;
+	}
+
+	/* Disallow free if refcount is not 1 */
+	if (refcount_read(&ioas->refs) > 1) {
+		ret = -EBUSY;
+		goto out_unlock;
+	}
+
+	ioas_put_locked(ioas);
+out_unlock:
+	mutex_unlock(&ictx->lock);
+	return ret;
+};
+
 static int iommufd_fops_release(struct inode *inode, struct file *filep)
 {
 	struct iommufd_ctx *ictx = filep->private_data;
+	struct iommufd_ioas *ioas;
+	unsigned long index;
 
 	filep->private_data = NULL;
 
+	mutex_lock(&ictx->lock);
+	xa_for_each(&ictx->ioasid_xa, index, ioas)
+		ioas_put_locked(ioas);
+	mutex_unlock(&ictx->lock);
+
 	iommufd_ctx_put(ictx);
 
 	return 0;
@@ -195,6 +309,12 @@ static long iommufd_fops_unl_ioctl(struct file *filep,
 	case IOMMU_DEVICE_GET_INFO:
 		ret = iommufd_get_device_info(ictx, arg);
 		break;
+	case IOMMU_IOASID_ALLOC:
+		ret = iommufd_ioasid_alloc(ictx, arg);
+		break;
+	case IOMMU_IOASID_FREE:
+		ret = iommufd_ioasid_free(ictx, arg);
+		break;
 	default:
 		pr_err_ratelimited("unsupported cmd %u\n", cmd);
 		break;
diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h
index 1603a13937e9..1dd6515e7816 100644
--- a/include/linux/iommufd.h
+++ b/include/linux/iommufd.h
@@ -14,6 +14,9 @@
 #include <linux/err.h>
 #include <linux/device.h>
 
+#define IOMMUFD_IOASID_MAX	((unsigned int)(0x7FFFFFFF))
+#define IOMMUFD_IOASID_MIN	0
+
 #define IOMMUFD_DEVID_MAX	((unsigned int)(0x7FFFFFFF))
 #define IOMMUFD_DEVID_MIN	0
 
diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
index 76b71f9d6b34..5cbd300eb0ee 100644
--- a/include/uapi/linux/iommu.h
+++ b/include/uapi/linux/iommu.h
@@ -57,6 +57,60 @@ struct iommu_device_info {
 
 #define IOMMU_DEVICE_GET_INFO	_IO(IOMMU_TYPE, IOMMU_BASE + 1)
 
+/*
+ * IOMMU_IOASID_ALLOC	- _IOWR(IOMMU_TYPE, IOMMU_BASE + 2,
+ *				struct iommu_ioasid_alloc)
+ *
+ * Allocate an IOASID.
+ *
+ * IOASID is the FD-local software handle representing an I/O address
+ * space. Each IOASID is associated with a single I/O page table. User
+ * must call this ioctl to get an IOASID for every I/O address space
+ * that is intended to be tracked by the kernel.
+ *
+ * User needs to specify the attributes of the IOASID and associated
+ * I/O page table format information according to one or multiple devices
+ * which will be attached to this IOASID right after. The I/O page table
+ * is activated in the IOMMU when it's attached by a device. Incompatible
+ * format between device and IOASID will lead to attaching failure in
+ * device side.
+ *
+ * Currently only one flag (IOMMU_IOASID_ENFORCE_SNOOP) is supported and
+ * must be always set.
+ *
+ * Only one I/O page table type (kernel-managed) is supported, with vfio
+ * type1v2 mapping semantics.
+ *
+ * User should call IOMMU_CHECK_EXTENSION for future extensions.
+ *
+ * @argsz:	    user filled size of this data.
+ * @flags:	    additional information for IOASID allocation.
+ * @type:	    I/O address space page table type.
+ * @addr_width:    address width of the I/O address space.
+ *
+ * Return: allocated ioasid on success, -errno on failure.
+ */
+struct iommu_ioasid_alloc {
+	__u32	argsz;
+	__u32	flags;
+#define IOMMU_IOASID_ENFORCE_SNOOP	(1 << 0)
+	__u32	type;
+#define IOMMU_IOASID_TYPE_KERNEL_TYPE1V2	1
+	__u32	addr_width;
+};
+
+#define IOMMU_IOASID_ALLOC		_IO(IOMMU_TYPE, IOMMU_BASE + 2)
+
+/**
+ * IOMMU_IOASID_FREE - _IOWR(IOMMU_TYPE, IOMMU_BASE + 3, int)
+ *
+ * Free an IOASID.
+ *
+ * returns: 0 on success, -errno on failure
+ */
+
+#define IOMMU_IOASID_FREE		_IO(IOMMU_TYPE, IOMMU_BASE + 3)
+
 #define IOMMU_FAULT_PERM_READ	(1 << 0) /* read */
 #define IOMMU_FAULT_PERM_WRITE	(1 << 1) /* write */
 #define IOMMU_FAULT_PERM_EXEC	(1 << 2) /* exec */
-- 
2.25.1


^ permalink raw reply	[flat|nested] 532+ messages in thread

* [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
@ 2021-09-19  6:38   ` Liu Yi L
  0 siblings, 0 replies; 532+ messages in thread
From: Liu Yi L @ 2021-09-19  6:38 UTC (permalink / raw)
  To: alex.williamson, jgg, hch, jasowang, joro
  Cc: kvm, kwankhede, jean-philippe, dave.jiang, ashok.raj, corbet,
	kevin.tian, parav, lkml, david, robin.murphy, jun.j.tian,
	linux-kernel, lushenming, iommu, pbonzini, dwmw2

This patch adds IOASID allocation/free interface per iommufd. When
allocating an IOASID, userspace is expected to specify the type and
format information for the target I/O page table.

This RFC supports only one type (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2),
implying a kernel-managed I/O page table with vfio type1v2 mapping
semantics. For this type the user should specify the addr_width of
the I/O address space and whether the I/O page table is created in
an iommu enfore_snoop format. enforce_snoop must be true at this point,
as the false setting requires additional contract with KVM on handling
WBINVD emulation, which can be added later.

Userspace is expected to call IOMMU_CHECK_EXTENSION (see next patch)
for what formats can be specified when allocating an IOASID.

Open:
- Devices on PPC platform currently use a different iommu driver in vfio.
  Per previous discussion they can also use vfio type1v2 as long as there
  is a way to claim a specific iova range from a system-wide address space.
  This requirement doesn't sound PPC specific, as addr_width for pci devices
  can be also represented by a range [0, 2^addr_width-1]. This RFC hasn't
  adopted this design yet. We hope to have formal alignment in v1 discussion
  and then decide how to incorporate it in v2.

- Currently ioasid term has already been used in the kernel (drivers/iommu/
  ioasid.c) to represent the hardware I/O address space ID in the wire. It
  covers both PCI PASID (Process Address Space ID) and ARM SSID (Sub-Stream
  ID). We need find a way to resolve the naming conflict between the hardware
  ID and software handle. One option is to rename the existing ioasid to be
  pasid or ssid, given their full names still sound generic. Appreciate more
  thoughts on this open!

Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 drivers/iommu/iommufd/iommufd.c | 120 ++++++++++++++++++++++++++++++++
 include/linux/iommufd.h         |   3 +
 include/uapi/linux/iommu.h      |  54 ++++++++++++++
 3 files changed, 177 insertions(+)

diff --git a/drivers/iommu/iommufd/iommufd.c b/drivers/iommu/iommufd/iommufd.c
index 641f199f2d41..4839f128b24a 100644
--- a/drivers/iommu/iommufd/iommufd.c
+++ b/drivers/iommu/iommufd/iommufd.c
@@ -24,6 +24,7 @@
 struct iommufd_ctx {
 	refcount_t refs;
 	struct mutex lock;
+	struct xarray ioasid_xa; /* xarray of ioasids */
 	struct xarray device_xa; /* xarray of bound devices */
 };
 
@@ -42,6 +43,16 @@ struct iommufd_device {
 	u64 dev_cookie;
 };
 
+/* Represent an I/O address space */
+struct iommufd_ioas {
+	int ioasid;
+	u32 type;
+	u32 addr_width;
+	bool enforce_snoop;
+	struct iommufd_ctx *ictx;
+	refcount_t refs;
+};
+
 static int iommufd_fops_open(struct inode *inode, struct file *filep)
 {
 	struct iommufd_ctx *ictx;
@@ -53,6 +64,7 @@ static int iommufd_fops_open(struct inode *inode, struct file *filep)
 
 	refcount_set(&ictx->refs, 1);
 	mutex_init(&ictx->lock);
+	xa_init_flags(&ictx->ioasid_xa, XA_FLAGS_ALLOC);
 	xa_init_flags(&ictx->device_xa, XA_FLAGS_ALLOC);
 	filep->private_data = ictx;
 
@@ -102,16 +114,118 @@ static void iommufd_ctx_put(struct iommufd_ctx *ictx)
 	if (!refcount_dec_and_test(&ictx->refs))
 		return;
 
+	WARN_ON(!xa_empty(&ictx->ioasid_xa));
 	WARN_ON(!xa_empty(&ictx->device_xa));
 	kfree(ictx);
 }
 
+/* Caller should hold ictx->lock */
+static void ioas_put_locked(struct iommufd_ioas *ioas)
+{
+	struct iommufd_ctx *ictx = ioas->ictx;
+	int ioasid = ioas->ioasid;
+
+	if (!refcount_dec_and_test(&ioas->refs))
+		return;
+
+	xa_erase(&ictx->ioasid_xa, ioasid);
+	iommufd_ctx_put(ictx);
+	kfree(ioas);
+}
+
+static int iommufd_ioasid_alloc(struct iommufd_ctx *ictx, unsigned long arg)
+{
+	struct iommu_ioasid_alloc req;
+	struct iommufd_ioas *ioas;
+	unsigned long minsz;
+	int ioasid, ret;
+
+	minsz = offsetofend(struct iommu_ioasid_alloc, addr_width);
+
+	if (copy_from_user(&req, (void __user *)arg, minsz))
+		return -EFAULT;
+
+	if (req.argsz < minsz || !req.addr_width ||
+	    req.flags != IOMMU_IOASID_ENFORCE_SNOOP ||
+	    req.type != IOMMU_IOASID_TYPE_KERNEL_TYPE1V2)
+		return -EINVAL;
+
+	ioas = kzalloc(sizeof(*ioas), GFP_KERNEL);
+	if (!ioas)
+		return -ENOMEM;
+
+	mutex_lock(&ictx->lock);
+	ret = xa_alloc(&ictx->ioasid_xa, &ioasid, ioas,
+		       XA_LIMIT(IOMMUFD_IOASID_MIN, IOMMUFD_IOASID_MAX),
+		       GFP_KERNEL);
+	mutex_unlock(&ictx->lock);
+	if (ret) {
+		pr_err_ratelimited("Failed to alloc ioasid\n");
+		kfree(ioas);
+		return ret;
+	}
+
+	ioas->ioasid = ioasid;
+
+	/* only supports kernel managed I/O page table so far */
+	ioas->type = IOMMU_IOASID_TYPE_KERNEL_TYPE1V2;
+
+	ioas->addr_width = req.addr_width;
+
+	/* only supports enforce snoop today */
+	ioas->enforce_snoop = true;
+
+	iommufd_ctx_get(ictx);
+	ioas->ictx = ictx;
+
+	refcount_set(&ioas->refs, 1);
+
+	return ioasid;
+}
+
+static int iommufd_ioasid_free(struct iommufd_ctx *ictx, unsigned long arg)
+{
+	struct iommufd_ioas *ioas = NULL;
+	int ioasid, ret;
+
+	if (copy_from_user(&ioasid, (void __user *)arg, sizeof(ioasid)))
+		return -EFAULT;
+
+	if (ioasid < 0)
+		return -EINVAL;
+
+	mutex_lock(&ictx->lock);
+	ioas = xa_load(&ictx->ioasid_xa, ioasid);
+	if (IS_ERR(ioas)) {
+		ret = -EINVAL;
+		goto out_unlock;
+	}
+
+	/* Disallow free if refcount is not 1 */
+	if (refcount_read(&ioas->refs) > 1) {
+		ret = -EBUSY;
+		goto out_unlock;
+	}
+
+	ioas_put_locked(ioas);
+out_unlock:
+	mutex_unlock(&ictx->lock);
+	return ret;
+};
+
 static int iommufd_fops_release(struct inode *inode, struct file *filep)
 {
 	struct iommufd_ctx *ictx = filep->private_data;
+	struct iommufd_ioas *ioas;
+	unsigned long index;
 
 	filep->private_data = NULL;
 
+	mutex_lock(&ictx->lock);
+	xa_for_each(&ictx->ioasid_xa, index, ioas)
+		ioas_put_locked(ioas);
+	mutex_unlock(&ictx->lock);
+
 	iommufd_ctx_put(ictx);
 
 	return 0;
@@ -195,6 +309,12 @@ static long iommufd_fops_unl_ioctl(struct file *filep,
 	case IOMMU_DEVICE_GET_INFO:
 		ret = iommufd_get_device_info(ictx, arg);
 		break;
+	case IOMMU_IOASID_ALLOC:
+		ret = iommufd_ioasid_alloc(ictx, arg);
+		break;
+	case IOMMU_IOASID_FREE:
+		ret = iommufd_ioasid_free(ictx, arg);
+		break;
 	default:
 		pr_err_ratelimited("unsupported cmd %u\n", cmd);
 		break;
diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h
index 1603a13937e9..1dd6515e7816 100644
--- a/include/linux/iommufd.h
+++ b/include/linux/iommufd.h
@@ -14,6 +14,9 @@
 #include <linux/err.h>
 #include <linux/device.h>
 
+#define IOMMUFD_IOASID_MAX	((unsigned int)(0x7FFFFFFF))
+#define IOMMUFD_IOASID_MIN	0
+
 #define IOMMUFD_DEVID_MAX	((unsigned int)(0x7FFFFFFF))
 #define IOMMUFD_DEVID_MIN	0
 
diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
index 76b71f9d6b34..5cbd300eb0ee 100644
--- a/include/uapi/linux/iommu.h
+++ b/include/uapi/linux/iommu.h
@@ -57,6 +57,60 @@ struct iommu_device_info {
 
 #define IOMMU_DEVICE_GET_INFO	_IO(IOMMU_TYPE, IOMMU_BASE + 1)
 
+/*
+ * IOMMU_IOASID_ALLOC	- _IOWR(IOMMU_TYPE, IOMMU_BASE + 2,
+ *				struct iommu_ioasid_alloc)
+ *
+ * Allocate an IOASID.
+ *
+ * IOASID is the FD-local software handle representing an I/O address
+ * space. Each IOASID is associated with a single I/O page table. User
+ * must call this ioctl to get an IOASID for every I/O address space
+ * that is intended to be tracked by the kernel.
+ *
+ * User needs to specify the attributes of the IOASID and associated
+ * I/O page table format information according to one or multiple devices
+ * which will be attached to this IOASID right after. The I/O page table
+ * is activated in the IOMMU when it's attached by a device. Incompatible
+ * format between device and IOASID will lead to attaching failure in
+ * device side.
+ *
+ * Currently only one flag (IOMMU_IOASID_ENFORCE_SNOOP) is supported and
+ * must be always set.
+ *
+ * Only one I/O page table type (kernel-managed) is supported, with vfio
+ * type1v2 mapping semantics.
+ *
+ * User should call IOMMU_CHECK_EXTENSION for future extensions.
+ *
+ * @argsz:	    user filled size of this data.
+ * @flags:	    additional information for IOASID allocation.
+ * @type:	    I/O address space page table type.
+ * @addr_width:    address width of the I/O address space.
+ *
+ * Return: allocated ioasid on success, -errno on failure.
+ */
+struct iommu_ioasid_alloc {
+	__u32	argsz;
+	__u32	flags;
+#define IOMMU_IOASID_ENFORCE_SNOOP	(1 << 0)
+	__u32	type;
+#define IOMMU_IOASID_TYPE_KERNEL_TYPE1V2	1
+	__u32	addr_width;
+};
+
+#define IOMMU_IOASID_ALLOC		_IO(IOMMU_TYPE, IOMMU_BASE + 2)
+
+/**
+ * IOMMU_IOASID_FREE - _IOWR(IOMMU_TYPE, IOMMU_BASE + 3, int)
+ *
+ * Free an IOASID.
+ *
+ * returns: 0 on success, -errno on failure
+ */
+
+#define IOMMU_IOASID_FREE		_IO(IOMMU_TYPE, IOMMU_BASE + 3)
+
 #define IOMMU_FAULT_PERM_READ	(1 << 0) /* read */
 #define IOMMU_FAULT_PERM_WRITE	(1 << 1) /* write */
 #define IOMMU_FAULT_PERM_EXEC	(1 << 2) /* exec */
-- 
2.25.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* [RFC 12/20] iommu/iommufd: Add IOMMU_CHECK_EXTENSION
  2021-09-19  6:38 ` Liu Yi L
@ 2021-09-19  6:38   ` Liu Yi L
  -1 siblings, 0 replies; 532+ messages in thread
From: Liu Yi L @ 2021-09-19  6:38 UTC (permalink / raw)
  To: alex.williamson, jgg, hch, jasowang, joro
  Cc: jean-philippe, kevin.tian, parav, lkml, pbonzini, lushenming,
	eric.auger, corbet, ashok.raj, yi.l.liu, yi.l.liu, jun.j.tian,
	hao.wu, dave.jiang, jacob.jun.pan, kwankhede, robin.murphy, kvm,
	iommu, dwmw2, linux-kernel, baolu.lu, david, nicolinc

As aforementioned, userspace should check extension for what formats
can be specified when allocating an IOASID. This patch adds such
interface for userspace. In this RFC, iommufd reports EXT_MAP_TYPE1V2
support and no no-snoop support yet.

Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 drivers/iommu/iommufd/iommufd.c |  7 +++++++
 include/uapi/linux/iommu.h      | 27 +++++++++++++++++++++++++++
 2 files changed, 34 insertions(+)

diff --git a/drivers/iommu/iommufd/iommufd.c b/drivers/iommu/iommufd/iommufd.c
index 4839f128b24a..e45d76359e34 100644
--- a/drivers/iommu/iommufd/iommufd.c
+++ b/drivers/iommu/iommufd/iommufd.c
@@ -306,6 +306,13 @@ static long iommufd_fops_unl_ioctl(struct file *filep,
 		return ret;
 
 	switch (cmd) {
+	case IOMMU_CHECK_EXTENSION:
+		switch (arg) {
+		case EXT_MAP_TYPE1V2:
+			return 1;
+		default:
+			return 0;
+		}
 	case IOMMU_DEVICE_GET_INFO:
 		ret = iommufd_get_device_info(ictx, arg);
 		break;
diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
index 5cbd300eb0ee..49731be71213 100644
--- a/include/uapi/linux/iommu.h
+++ b/include/uapi/linux/iommu.h
@@ -14,6 +14,33 @@
 #define IOMMU_TYPE	(';')
 #define IOMMU_BASE	100
 
+/*
+ * IOMMU_CHECK_EXTENSION - _IO(IOMMU_TYPE, IOMMU_BASE + 0)
+ *
+ * Check whether an uAPI extension is supported.
+ *
+ * It's unlikely that all planned capabilities in IOMMU fd will be ready
+ * in one breath. User should check which uAPI extension is supported
+ * according to its intended usage.
+ *
+ * A rough list of possible extensions may include:
+ *
+ *	- EXT_MAP_TYPE1V2 for vfio type1v2 map semantics;
+ *	- EXT_DMA_NO_SNOOP for no-snoop DMA support;
+ *	- EXT_MAP_NEWTYPE for an enhanced map semantics;
+ *	- EXT_MULTIDEV_GROUP for 1:N iommu group;
+ *	- EXT_IOASID_NESTING for what the name stands;
+ *	- EXT_USER_PAGE_TABLE for user managed page table;
+ *	- EXT_USER_PASID_TABLE for user managed PASID table;
+ *	- EXT_DIRTY_TRACKING for tracking pages dirtied by DMA;
+ *	- ...
+ *
+ * Return: 0 if not supported, 1 if supported.
+ */
+#define EXT_MAP_TYPE1V2		1
+#define EXT_DMA_NO_SNOOP	2
+#define IOMMU_CHECK_EXTENSION	_IO(IOMMU_TYPE, IOMMU_BASE + 0)
+
 /*
  * IOMMU_DEVICE_GET_INFO - _IOR(IOMMU_TYPE, IOMMU_BASE + 1,
  *				struct iommu_device_info)
-- 
2.25.1


^ permalink raw reply	[flat|nested] 532+ messages in thread

* [RFC 12/20] iommu/iommufd: Add IOMMU_CHECK_EXTENSION
@ 2021-09-19  6:38   ` Liu Yi L
  0 siblings, 0 replies; 532+ messages in thread
From: Liu Yi L @ 2021-09-19  6:38 UTC (permalink / raw)
  To: alex.williamson, jgg, hch, jasowang, joro
  Cc: kvm, kwankhede, jean-philippe, dave.jiang, ashok.raj, corbet,
	kevin.tian, parav, lkml, david, robin.murphy, jun.j.tian,
	linux-kernel, lushenming, iommu, pbonzini, dwmw2

As aforementioned, userspace should check extension for what formats
can be specified when allocating an IOASID. This patch adds such
interface for userspace. In this RFC, iommufd reports EXT_MAP_TYPE1V2
support and no no-snoop support yet.

Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 drivers/iommu/iommufd/iommufd.c |  7 +++++++
 include/uapi/linux/iommu.h      | 27 +++++++++++++++++++++++++++
 2 files changed, 34 insertions(+)

diff --git a/drivers/iommu/iommufd/iommufd.c b/drivers/iommu/iommufd/iommufd.c
index 4839f128b24a..e45d76359e34 100644
--- a/drivers/iommu/iommufd/iommufd.c
+++ b/drivers/iommu/iommufd/iommufd.c
@@ -306,6 +306,13 @@ static long iommufd_fops_unl_ioctl(struct file *filep,
 		return ret;
 
 	switch (cmd) {
+	case IOMMU_CHECK_EXTENSION:
+		switch (arg) {
+		case EXT_MAP_TYPE1V2:
+			return 1;
+		default:
+			return 0;
+		}
 	case IOMMU_DEVICE_GET_INFO:
 		ret = iommufd_get_device_info(ictx, arg);
 		break;
diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
index 5cbd300eb0ee..49731be71213 100644
--- a/include/uapi/linux/iommu.h
+++ b/include/uapi/linux/iommu.h
@@ -14,6 +14,33 @@
 #define IOMMU_TYPE	(';')
 #define IOMMU_BASE	100
 
+/*
+ * IOMMU_CHECK_EXTENSION - _IO(IOMMU_TYPE, IOMMU_BASE + 0)
+ *
+ * Check whether an uAPI extension is supported.
+ *
+ * It's unlikely that all planned capabilities in IOMMU fd will be ready
+ * in one breath. User should check which uAPI extension is supported
+ * according to its intended usage.
+ *
+ * A rough list of possible extensions may include:
+ *
+ *	- EXT_MAP_TYPE1V2 for vfio type1v2 map semantics;
+ *	- EXT_DMA_NO_SNOOP for no-snoop DMA support;
+ *	- EXT_MAP_NEWTYPE for an enhanced map semantics;
+ *	- EXT_MULTIDEV_GROUP for 1:N iommu group;
+ *	- EXT_IOASID_NESTING for what the name stands;
+ *	- EXT_USER_PAGE_TABLE for user managed page table;
+ *	- EXT_USER_PASID_TABLE for user managed PASID table;
+ *	- EXT_DIRTY_TRACKING for tracking pages dirtied by DMA;
+ *	- ...
+ *
+ * Return: 0 if not supported, 1 if supported.
+ */
+#define EXT_MAP_TYPE1V2		1
+#define EXT_DMA_NO_SNOOP	2
+#define IOMMU_CHECK_EXTENSION	_IO(IOMMU_TYPE, IOMMU_BASE + 0)
+
 /*
  * IOMMU_DEVICE_GET_INFO - _IOR(IOMMU_TYPE, IOMMU_BASE + 1,
  *				struct iommu_device_info)
-- 
2.25.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* [RFC 13/20] iommu: Extend iommu_at[de]tach_device() for multiple devices group
  2021-09-19  6:38 ` Liu Yi L
@ 2021-09-19  6:38   ` Liu Yi L
  -1 siblings, 0 replies; 532+ messages in thread
From: Liu Yi L @ 2021-09-19  6:38 UTC (permalink / raw)
  To: alex.williamson, jgg, hch, jasowang, joro
  Cc: jean-philippe, kevin.tian, parav, lkml, pbonzini, lushenming,
	eric.auger, corbet, ashok.raj, yi.l.liu, yi.l.liu, jun.j.tian,
	hao.wu, dave.jiang, jacob.jun.pan, kwankhede, robin.murphy, kvm,
	iommu, dwmw2, linux-kernel, baolu.lu, david, nicolinc

From: Lu Baolu <baolu.lu@linux.intel.com>

These two helpers could be used when 1) the iommu group is singleton,
or 2) the upper layer has put the iommu group into the secure state by
calling iommu_device_init_user_dma().

As we want the iommufd design to be a device-centric model, we want to
remove any group knowledge in iommufd. Given that we already have
iommu_at[de]tach_device() interface, we could extend it for iommufd
simply by doing below:

 - first device in a group does group attach;
 - last device in a group does group detach.

as long as the group has been put into the secure context.

The commit <426a273834eae> ("iommu: Limit iommu_attach/detach_device to
device with their own group") deliberately restricts the two interfaces
to single-device group. To avoid the conflict with existing usages, we
keep this policy and put the new extension only when the group has been
marked for user_dma.

Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
 drivers/iommu/iommu.c | 25 +++++++++++++++++++++----
 1 file changed, 21 insertions(+), 4 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index bffd84e978fb..b6178997aef1 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -47,6 +47,7 @@ struct iommu_group {
 	struct list_head entry;
 	unsigned long user_dma_owner_id;
 	refcount_t owner_cnt;
+	refcount_t attach_cnt;
 };
 
 struct group_device {
@@ -1994,7 +1995,7 @@ static int __iommu_attach_device(struct iommu_domain *domain,
 int iommu_attach_device(struct iommu_domain *domain, struct device *dev)
 {
 	struct iommu_group *group;
-	int ret;
+	int ret = 0;
 
 	group = iommu_group_get(dev);
 	if (!group)
@@ -2005,11 +2006,23 @@ int iommu_attach_device(struct iommu_domain *domain, struct device *dev)
 	 * change while we are attaching
 	 */
 	mutex_lock(&group->mutex);
-	ret = -EINVAL;
-	if (iommu_group_device_count(group) != 1)
+	if (group->user_dma_owner_id) {
+		if (group->domain) {
+			if (group->domain != domain)
+				ret = -EBUSY;
+			else
+				refcount_inc(&group->attach_cnt);
+
+			goto out_unlock;
+		}
+	} else if (iommu_group_device_count(group) != 1) {
+		ret = -EINVAL;
 		goto out_unlock;
+	}
 
 	ret = __iommu_attach_group(domain, group);
+	if (!ret && group->user_dma_owner_id)
+		refcount_set(&group->attach_cnt, 1);
 
 out_unlock:
 	mutex_unlock(&group->mutex);
@@ -2261,7 +2274,10 @@ void iommu_detach_device(struct iommu_domain *domain, struct device *dev)
 		return;
 
 	mutex_lock(&group->mutex);
-	if (iommu_group_device_count(group) != 1) {
+	if (group->user_dma_owner_id) {
+		if (!refcount_dec_and_test(&group->attach_cnt))
+			goto out_unlock;
+	} else if (iommu_group_device_count(group) != 1) {
 		WARN_ON(1);
 		goto out_unlock;
 	}
@@ -3368,6 +3384,7 @@ static int iommu_group_init_user_dma(struct iommu_group *group,
 
 	group->user_dma_owner_id = owner;
 	refcount_set(&group->owner_cnt, 1);
+	refcount_set(&group->attach_cnt, 0);
 
 	/* default domain is unsafe for user-initiated dma */
 	if (group->domain == group->default_domain)
-- 
2.25.1


^ permalink raw reply	[flat|nested] 532+ messages in thread

* [RFC 13/20] iommu: Extend iommu_at[de]tach_device() for multiple devices group
@ 2021-09-19  6:38   ` Liu Yi L
  0 siblings, 0 replies; 532+ messages in thread
From: Liu Yi L @ 2021-09-19  6:38 UTC (permalink / raw)
  To: alex.williamson, jgg, hch, jasowang, joro
  Cc: kvm, kwankhede, jean-philippe, dave.jiang, ashok.raj, corbet,
	kevin.tian, parav, lkml, david, robin.murphy, jun.j.tian,
	linux-kernel, lushenming, iommu, pbonzini, dwmw2

From: Lu Baolu <baolu.lu@linux.intel.com>

These two helpers could be used when 1) the iommu group is singleton,
or 2) the upper layer has put the iommu group into the secure state by
calling iommu_device_init_user_dma().

As we want the iommufd design to be a device-centric model, we want to
remove any group knowledge in iommufd. Given that we already have
iommu_at[de]tach_device() interface, we could extend it for iommufd
simply by doing below:

 - first device in a group does group attach;
 - last device in a group does group detach.

as long as the group has been put into the secure context.

The commit <426a273834eae> ("iommu: Limit iommu_attach/detach_device to
device with their own group") deliberately restricts the two interfaces
to single-device group. To avoid the conflict with existing usages, we
keep this policy and put the new extension only when the group has been
marked for user_dma.

Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
 drivers/iommu/iommu.c | 25 +++++++++++++++++++++----
 1 file changed, 21 insertions(+), 4 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index bffd84e978fb..b6178997aef1 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -47,6 +47,7 @@ struct iommu_group {
 	struct list_head entry;
 	unsigned long user_dma_owner_id;
 	refcount_t owner_cnt;
+	refcount_t attach_cnt;
 };
 
 struct group_device {
@@ -1994,7 +1995,7 @@ static int __iommu_attach_device(struct iommu_domain *domain,
 int iommu_attach_device(struct iommu_domain *domain, struct device *dev)
 {
 	struct iommu_group *group;
-	int ret;
+	int ret = 0;
 
 	group = iommu_group_get(dev);
 	if (!group)
@@ -2005,11 +2006,23 @@ int iommu_attach_device(struct iommu_domain *domain, struct device *dev)
 	 * change while we are attaching
 	 */
 	mutex_lock(&group->mutex);
-	ret = -EINVAL;
-	if (iommu_group_device_count(group) != 1)
+	if (group->user_dma_owner_id) {
+		if (group->domain) {
+			if (group->domain != domain)
+				ret = -EBUSY;
+			else
+				refcount_inc(&group->attach_cnt);
+
+			goto out_unlock;
+		}
+	} else if (iommu_group_device_count(group) != 1) {
+		ret = -EINVAL;
 		goto out_unlock;
+	}
 
 	ret = __iommu_attach_group(domain, group);
+	if (!ret && group->user_dma_owner_id)
+		refcount_set(&group->attach_cnt, 1);
 
 out_unlock:
 	mutex_unlock(&group->mutex);
@@ -2261,7 +2274,10 @@ void iommu_detach_device(struct iommu_domain *domain, struct device *dev)
 		return;
 
 	mutex_lock(&group->mutex);
-	if (iommu_group_device_count(group) != 1) {
+	if (group->user_dma_owner_id) {
+		if (!refcount_dec_and_test(&group->attach_cnt))
+			goto out_unlock;
+	} else if (iommu_group_device_count(group) != 1) {
 		WARN_ON(1);
 		goto out_unlock;
 	}
@@ -3368,6 +3384,7 @@ static int iommu_group_init_user_dma(struct iommu_group *group,
 
 	group->user_dma_owner_id = owner;
 	refcount_set(&group->owner_cnt, 1);
+	refcount_set(&group->attach_cnt, 0);
 
 	/* default domain is unsafe for user-initiated dma */
 	if (group->domain == group->default_domain)
-- 
2.25.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* [RFC 14/20] iommu/iommufd: Add iommufd_device_[de]attach_ioasid()
  2021-09-19  6:38 ` Liu Yi L
@ 2021-09-19  6:38   ` Liu Yi L
  -1 siblings, 0 replies; 532+ messages in thread
From: Liu Yi L @ 2021-09-19  6:38 UTC (permalink / raw)
  To: alex.williamson, jgg, hch, jasowang, joro
  Cc: jean-philippe, kevin.tian, parav, lkml, pbonzini, lushenming,
	eric.auger, corbet, ashok.raj, yi.l.liu, yi.l.liu, jun.j.tian,
	hao.wu, dave.jiang, jacob.jun.pan, kwankhede, robin.murphy, kvm,
	iommu, dwmw2, linux-kernel, baolu.lu, david, nicolinc

An I/O address space takes effect in the iommu only after it's attached
by a device. This patch provides iommufd_device_[de/at]tach_ioasid()
helpers for this purpose. One device can be only attached to one ioasid
at this point, but one ioasid can be attached by multiple devices.

The caller specifies the iommufd_device (returned at binding time) and
the target ioasid when calling the helper function. Upon request, iommufd
installs the specified I/O page table to the correct place in the IOMMU,
according to the routing information (struct device* which represents
RID) recorded in iommufd_device. Future variants could allow the caller
to specify additional routing information (e.g. pasid/ssid) when multiple
I/O address spaces are supported per device.

Open:
Per Jason's comment in below link, bus-specific wrappers are recommended.
This RFC implements one wrapper for pci device. But it looks that struct
pci_device is not used at all since iommufd_ device already carries all
necessary info. So want to have another discussion on its necessity, e.g.
whether making more sense to have bus-specific wrappers for binding, while
leaving a common attaching helper per iommufd_device.
https://lore.kernel.org/linux-iommu/20210528233649.GB3816344@nvidia.com/

TODO:
When multiple devices are attached to a same ioasid, the permitted iova
ranges and supported pgsize bitmap on this ioasid should be a common
subset of all attached devices. iommufd needs to track such info per
ioasid and update it every time when a new device is attached to the
ioasid. This has not been done in this version yet, due to the temporary
hack adopted in patch 16-18. The hack reuses vfio type1 driver which
already includes the necessary logic for iova ranges and pgsize bitmap.
Once we get a clear direction for those patches, that logic will be moved
to this patch.

Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 drivers/iommu/iommufd/iommufd.c | 226 ++++++++++++++++++++++++++++++++
 include/linux/iommufd.h         |  29 ++++
 2 files changed, 255 insertions(+)

diff --git a/drivers/iommu/iommufd/iommufd.c b/drivers/iommu/iommufd/iommufd.c
index e45d76359e34..25373a0e037a 100644
--- a/drivers/iommu/iommufd/iommufd.c
+++ b/drivers/iommu/iommufd/iommufd.c
@@ -51,6 +51,19 @@ struct iommufd_ioas {
 	bool enforce_snoop;
 	struct iommufd_ctx *ictx;
 	refcount_t refs;
+	struct mutex lock;
+	struct list_head device_list;
+	struct iommu_domain *domain;
+};
+
+/*
+ * An ioas_device_info object is created per each successful attaching
+ * request. A list of objects are maintained per ioas when the address
+ * space is shared by multiple devices.
+ */
+struct ioas_device_info {
+	struct iommufd_device *idev;
+	struct list_head next;
 };
 
 static int iommufd_fops_open(struct inode *inode, struct file *filep)
@@ -119,6 +132,21 @@ static void iommufd_ctx_put(struct iommufd_ctx *ictx)
 	kfree(ictx);
 }
 
+static struct iommufd_ioas *ioasid_get_ioas(struct iommufd_ctx *ictx, int ioasid)
+{
+	struct iommufd_ioas *ioas;
+
+	if (ioasid < 0)
+		return NULL;
+
+	mutex_lock(&ictx->lock);
+	ioas = xa_load(&ictx->ioasid_xa, ioasid);
+	if (ioas)
+		refcount_inc(&ioas->refs);
+	mutex_unlock(&ictx->lock);
+	return ioas;
+}
+
 /* Caller should hold ictx->lock */
 static void ioas_put_locked(struct iommufd_ioas *ioas)
 {
@@ -128,11 +156,28 @@ static void ioas_put_locked(struct iommufd_ioas *ioas)
 	if (!refcount_dec_and_test(&ioas->refs))
 		return;
 
+	WARN_ON(!list_empty(&ioas->device_list));
 	xa_erase(&ictx->ioasid_xa, ioasid);
 	iommufd_ctx_put(ictx);
 	kfree(ioas);
 }
 
+/*
+ * Caller should hold a ictx reference when calling this function
+ * otherwise ictx might be freed in ioas_put_locked() then the last
+ * unlock becomes problematic. Alternatively we could have a fresh
+ * implementation of ioas_put instead of calling the locked function.
+ * In this case it can ensure ictx is freed after mutext_unlock().
+ */
+static void ioas_put(struct iommufd_ioas *ioas)
+{
+	struct iommufd_ctx *ictx = ioas->ictx;
+
+	mutex_lock(&ictx->lock);
+	ioas_put_locked(ioas);
+	mutex_unlock(&ictx->lock);
+}
+
 static int iommufd_ioasid_alloc(struct iommufd_ctx *ictx, unsigned long arg)
 {
 	struct iommu_ioasid_alloc req;
@@ -178,6 +223,9 @@ static int iommufd_ioasid_alloc(struct iommufd_ctx *ictx, unsigned long arg)
 	iommufd_ctx_get(ictx);
 	ioas->ictx = ictx;
 
+	mutex_init(&ioas->lock);
+	INIT_LIST_HEAD(&ioas->device_list);
+
 	refcount_set(&ioas->refs, 1);
 
 	return ioasid;
@@ -344,6 +392,166 @@ static struct miscdevice iommu_misc_dev = {
 	.mode = 0666,
 };
 
+/* Caller should hold ioas->lock */
+static struct ioas_device_info *ioas_find_device(struct iommufd_ioas *ioas,
+						 struct iommufd_device *idev)
+{
+	struct ioas_device_info *ioas_dev;
+
+	list_for_each_entry(ioas_dev, &ioas->device_list, next) {
+		if (ioas_dev->idev == idev)
+			return ioas_dev;
+	}
+
+	return NULL;
+}
+
+static void ioas_free_domain_if_empty(struct iommufd_ioas *ioas)
+{
+	if (list_empty(&ioas->device_list)) {
+		iommu_domain_free(ioas->domain);
+		ioas->domain = NULL;
+	}
+}
+
+static int ioas_check_device_compatibility(struct iommufd_ioas *ioas,
+					   struct device *dev)
+{
+	bool snoop = false;
+	u32 addr_width;
+	int ret;
+
+	/*
+	 * currently we only support I/O page table with iommu enforce-snoop
+	 * format. Attaching a device which doesn't support this format in its
+	 * upstreaming iommu is rejected.
+	 */
+	ret = iommu_device_get_info(dev, IOMMU_DEV_INFO_FORCE_SNOOP, &snoop);
+	if (ret || !snoop)
+		return -EINVAL;
+
+	ret = iommu_device_get_info(dev, IOMMU_DEV_INFO_ADDR_WIDTH, &addr_width);
+	if (ret || addr_width < ioas->addr_width)
+		return -EINVAL;
+
+	/* TODO: also need to check permitted iova ranges and pgsize bitmap */
+
+	return 0;
+}
+
+/**
+ * iommufd_device_attach_ioasid - attach device to an ioasid
+ * @idev: [in] Pointer to struct iommufd_device.
+ * @ioasid: [in] ioasid points to an I/O address space.
+ *
+ * Returns 0 for successful attach, otherwise returns error.
+ *
+ */
+int iommufd_device_attach_ioasid(struct iommufd_device *idev, int ioasid)
+{
+	struct iommufd_ioas *ioas;
+	struct ioas_device_info *ioas_dev;
+	struct iommu_domain *domain;
+	int ret;
+
+	ioas = ioasid_get_ioas(idev->ictx, ioasid);
+	if (!ioas) {
+		pr_err_ratelimited("Trying to attach illegal or unkonwn IOASID %u\n", ioasid);
+		return -EINVAL;
+	}
+
+	mutex_lock(&ioas->lock);
+
+	/* Check for duplicates */
+	if (ioas_find_device(ioas, idev)) {
+		ret = -EINVAL;
+		goto out_unlock;
+	}
+
+	ret = ioas_check_device_compatibility(ioas, idev->dev);
+	if (ret)
+		goto out_unlock;
+
+	ioas_dev = kzalloc(sizeof(*ioas_dev), GFP_KERNEL);
+	if (!ioas_dev) {
+		ret = -ENOMEM;
+		goto out_unlock;
+	}
+
+	/*
+	 * Each ioas is backed by an iommu domain, which is allocated
+	 * when the ioas is attached for the first time and then shared
+	 * by following devices.
+	 */
+	if (list_empty(&ioas->device_list)) {
+		struct iommu_domain *d;
+
+		d = iommu_domain_alloc(idev->dev->bus);
+		if (!d) {
+			ret = -ENOMEM;
+			goto out_free;
+		}
+		ioas->domain = d;
+	}
+	domain = ioas->domain;
+
+	/* Install the I/O page table to the iommu for this device */
+	ret = iommu_attach_device(domain, idev->dev);
+	if (ret)
+		goto out_domain;
+
+	ioas_dev->idev = idev;
+	list_add(&ioas_dev->next, &ioas->device_list);
+	mutex_unlock(&ioas->lock);
+
+	return 0;
+out_domain:
+	ioas_free_domain_if_empty(ioas);
+out_free:
+	kfree(ioas_dev);
+out_unlock:
+	mutex_unlock(&ioas->lock);
+	ioas_put(ioas);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(iommufd_device_attach_ioasid);
+
+/**
+ * iommufd_device_detach_ioasid - Detach an ioasid from a device.
+ * @idev: [in] Pointer to struct iommufd_device.
+ * @ioasid: [in] ioasid points to an I/O address space.
+ *
+ */
+void iommufd_device_detach_ioasid(struct iommufd_device *idev, int ioasid)
+{
+	struct iommufd_ioas *ioas;
+	struct ioas_device_info *ioas_dev;
+
+	ioas = ioasid_get_ioas(idev->ictx, ioasid);
+	if (!ioas)
+		return;
+
+	mutex_lock(&ioas->lock);
+	ioas_dev = ioas_find_device(ioas, idev);
+	if (!ioas_dev) {
+		mutex_unlock(&ioas->lock);
+		goto out;
+	}
+
+	list_del(&ioas_dev->next);
+	iommu_detach_device(ioas->domain, idev->dev);
+	ioas_free_domain_if_empty(ioas);
+	kfree(ioas_dev);
+	mutex_unlock(&ioas->lock);
+
+	/* release the reference acquired at the start of this function */
+	ioas_put(ioas);
+out:
+	ioas_put(ioas);
+}
+EXPORT_SYMBOL_GPL(iommufd_device_detach_ioasid);
+
 /**
  * iommufd_bind_device - Bind a physical device marked by a device
  *			 cookie to an iommu fd.
@@ -426,8 +634,26 @@ EXPORT_SYMBOL_GPL(iommufd_bind_device);
 void iommufd_unbind_device(struct iommufd_device *idev)
 {
 	struct iommufd_ctx *ictx = idev->ictx;
+	struct iommufd_ioas *ioas;
+	unsigned long index;
 
 	mutex_lock(&ictx->lock);
+	xa_for_each(&ictx->ioasid_xa, index, ioas) {
+		struct ioas_device_info *ioas_dev;
+
+		mutex_lock(&ioas->lock);
+		ioas_dev = ioas_find_device(ioas, idev);
+		if (!ioas_dev) {
+			mutex_unlock(&ioas->lock);
+			continue;
+		}
+		list_del(&ioas_dev->next);
+		iommu_detach_device(ioas->domain, idev->dev);
+		ioas_free_domain_if_empty(ioas);
+		kfree(ioas_dev);
+		mutex_unlock(&ioas->lock);
+		ioas_put_locked(ioas);
+	}
 	xa_erase(&ictx->device_xa, idev->id);
 	mutex_unlock(&ictx->lock);
 	/* Exit the security context */
diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h
index 1dd6515e7816..01a4fe934143 100644
--- a/include/linux/iommufd.h
+++ b/include/linux/iommufd.h
@@ -13,6 +13,7 @@
 #include <linux/errno.h>
 #include <linux/err.h>
 #include <linux/device.h>
+#include <linux/pci.h>
 
 #define IOMMUFD_IOASID_MAX	((unsigned int)(0x7FFFFFFF))
 #define IOMMUFD_IOASID_MIN	0
@@ -27,6 +28,16 @@ struct iommufd_device *
 iommufd_bind_device(int fd, struct device *dev, u64 dev_cookie);
 void iommufd_unbind_device(struct iommufd_device *idev);
 
+int iommufd_device_attach_ioasid(struct iommufd_device *idev, int ioasid);
+void iommufd_device_detach_ioasid(struct iommufd_device *idev, int ioasid);
+
+static inline int
+__pci_iommufd_device_attach_ioasid(struct pci_dev *pdev,
+				   struct iommufd_device *idev, int ioasid)
+{
+	return iommufd_device_attach_ioasid(idev, ioasid);
+}
+
 #else /* !CONFIG_IOMMUFD */
 static inline struct iommufd_device *
 iommufd_bind_device(int fd, struct device *dev, u64 dev_cookie)
@@ -37,5 +48,23 @@ iommufd_bind_device(int fd, struct device *dev, u64 dev_cookie)
 static inline void iommufd_unbind_device(struct iommufd_device *idev)
 {
 }
+
+static inline int iommufd_device_attach_ioasid(struct iommufd_device *idev,
+					       int ioasid)
+{
+	return -ENODEV;
+}
+
+static inline void iommufd_device_detach_ioasid(struct iommufd_device *idev,
+						int ioasid)
+{
+}
+
+static inline int
+__pci_iommufd_device_attach_ioasid(struct pci_dev *pdev,
+				   struct iommufd_device *idev, int ioasid)
+{
+	return -ENODEV;
+}
 #endif /* CONFIG_IOMMUFD */
 #endif /* __LINUX_IOMMUFD_H */
-- 
2.25.1


^ permalink raw reply	[flat|nested] 532+ messages in thread

* [RFC 14/20] iommu/iommufd: Add iommufd_device_[de]attach_ioasid()
@ 2021-09-19  6:38   ` Liu Yi L
  0 siblings, 0 replies; 532+ messages in thread
From: Liu Yi L @ 2021-09-19  6:38 UTC (permalink / raw)
  To: alex.williamson, jgg, hch, jasowang, joro
  Cc: kvm, kwankhede, jean-philippe, dave.jiang, ashok.raj, corbet,
	kevin.tian, parav, lkml, david, robin.murphy, jun.j.tian,
	linux-kernel, lushenming, iommu, pbonzini, dwmw2

An I/O address space takes effect in the iommu only after it's attached
by a device. This patch provides iommufd_device_[de/at]tach_ioasid()
helpers for this purpose. One device can be only attached to one ioasid
at this point, but one ioasid can be attached by multiple devices.

The caller specifies the iommufd_device (returned at binding time) and
the target ioasid when calling the helper function. Upon request, iommufd
installs the specified I/O page table to the correct place in the IOMMU,
according to the routing information (struct device* which represents
RID) recorded in iommufd_device. Future variants could allow the caller
to specify additional routing information (e.g. pasid/ssid) when multiple
I/O address spaces are supported per device.

Open:
Per Jason's comment in below link, bus-specific wrappers are recommended.
This RFC implements one wrapper for pci device. But it looks that struct
pci_device is not used at all since iommufd_ device already carries all
necessary info. So want to have another discussion on its necessity, e.g.
whether making more sense to have bus-specific wrappers for binding, while
leaving a common attaching helper per iommufd_device.
https://lore.kernel.org/linux-iommu/20210528233649.GB3816344@nvidia.com/

TODO:
When multiple devices are attached to a same ioasid, the permitted iova
ranges and supported pgsize bitmap on this ioasid should be a common
subset of all attached devices. iommufd needs to track such info per
ioasid and update it every time when a new device is attached to the
ioasid. This has not been done in this version yet, due to the temporary
hack adopted in patch 16-18. The hack reuses vfio type1 driver which
already includes the necessary logic for iova ranges and pgsize bitmap.
Once we get a clear direction for those patches, that logic will be moved
to this patch.

Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 drivers/iommu/iommufd/iommufd.c | 226 ++++++++++++++++++++++++++++++++
 include/linux/iommufd.h         |  29 ++++
 2 files changed, 255 insertions(+)

diff --git a/drivers/iommu/iommufd/iommufd.c b/drivers/iommu/iommufd/iommufd.c
index e45d76359e34..25373a0e037a 100644
--- a/drivers/iommu/iommufd/iommufd.c
+++ b/drivers/iommu/iommufd/iommufd.c
@@ -51,6 +51,19 @@ struct iommufd_ioas {
 	bool enforce_snoop;
 	struct iommufd_ctx *ictx;
 	refcount_t refs;
+	struct mutex lock;
+	struct list_head device_list;
+	struct iommu_domain *domain;
+};
+
+/*
+ * An ioas_device_info object is created per each successful attaching
+ * request. A list of objects are maintained per ioas when the address
+ * space is shared by multiple devices.
+ */
+struct ioas_device_info {
+	struct iommufd_device *idev;
+	struct list_head next;
 };
 
 static int iommufd_fops_open(struct inode *inode, struct file *filep)
@@ -119,6 +132,21 @@ static void iommufd_ctx_put(struct iommufd_ctx *ictx)
 	kfree(ictx);
 }
 
+static struct iommufd_ioas *ioasid_get_ioas(struct iommufd_ctx *ictx, int ioasid)
+{
+	struct iommufd_ioas *ioas;
+
+	if (ioasid < 0)
+		return NULL;
+
+	mutex_lock(&ictx->lock);
+	ioas = xa_load(&ictx->ioasid_xa, ioasid);
+	if (ioas)
+		refcount_inc(&ioas->refs);
+	mutex_unlock(&ictx->lock);
+	return ioas;
+}
+
 /* Caller should hold ictx->lock */
 static void ioas_put_locked(struct iommufd_ioas *ioas)
 {
@@ -128,11 +156,28 @@ static void ioas_put_locked(struct iommufd_ioas *ioas)
 	if (!refcount_dec_and_test(&ioas->refs))
 		return;
 
+	WARN_ON(!list_empty(&ioas->device_list));
 	xa_erase(&ictx->ioasid_xa, ioasid);
 	iommufd_ctx_put(ictx);
 	kfree(ioas);
 }
 
+/*
+ * Caller should hold a ictx reference when calling this function
+ * otherwise ictx might be freed in ioas_put_locked() then the last
+ * unlock becomes problematic. Alternatively we could have a fresh
+ * implementation of ioas_put instead of calling the locked function.
+ * In this case it can ensure ictx is freed after mutext_unlock().
+ */
+static void ioas_put(struct iommufd_ioas *ioas)
+{
+	struct iommufd_ctx *ictx = ioas->ictx;
+
+	mutex_lock(&ictx->lock);
+	ioas_put_locked(ioas);
+	mutex_unlock(&ictx->lock);
+}
+
 static int iommufd_ioasid_alloc(struct iommufd_ctx *ictx, unsigned long arg)
 {
 	struct iommu_ioasid_alloc req;
@@ -178,6 +223,9 @@ static int iommufd_ioasid_alloc(struct iommufd_ctx *ictx, unsigned long arg)
 	iommufd_ctx_get(ictx);
 	ioas->ictx = ictx;
 
+	mutex_init(&ioas->lock);
+	INIT_LIST_HEAD(&ioas->device_list);
+
 	refcount_set(&ioas->refs, 1);
 
 	return ioasid;
@@ -344,6 +392,166 @@ static struct miscdevice iommu_misc_dev = {
 	.mode = 0666,
 };
 
+/* Caller should hold ioas->lock */
+static struct ioas_device_info *ioas_find_device(struct iommufd_ioas *ioas,
+						 struct iommufd_device *idev)
+{
+	struct ioas_device_info *ioas_dev;
+
+	list_for_each_entry(ioas_dev, &ioas->device_list, next) {
+		if (ioas_dev->idev == idev)
+			return ioas_dev;
+	}
+
+	return NULL;
+}
+
+static void ioas_free_domain_if_empty(struct iommufd_ioas *ioas)
+{
+	if (list_empty(&ioas->device_list)) {
+		iommu_domain_free(ioas->domain);
+		ioas->domain = NULL;
+	}
+}
+
+static int ioas_check_device_compatibility(struct iommufd_ioas *ioas,
+					   struct device *dev)
+{
+	bool snoop = false;
+	u32 addr_width;
+	int ret;
+
+	/*
+	 * currently we only support I/O page table with iommu enforce-snoop
+	 * format. Attaching a device which doesn't support this format in its
+	 * upstreaming iommu is rejected.
+	 */
+	ret = iommu_device_get_info(dev, IOMMU_DEV_INFO_FORCE_SNOOP, &snoop);
+	if (ret || !snoop)
+		return -EINVAL;
+
+	ret = iommu_device_get_info(dev, IOMMU_DEV_INFO_ADDR_WIDTH, &addr_width);
+	if (ret || addr_width < ioas->addr_width)
+		return -EINVAL;
+
+	/* TODO: also need to check permitted iova ranges and pgsize bitmap */
+
+	return 0;
+}
+
+/**
+ * iommufd_device_attach_ioasid - attach device to an ioasid
+ * @idev: [in] Pointer to struct iommufd_device.
+ * @ioasid: [in] ioasid points to an I/O address space.
+ *
+ * Returns 0 for successful attach, otherwise returns error.
+ *
+ */
+int iommufd_device_attach_ioasid(struct iommufd_device *idev, int ioasid)
+{
+	struct iommufd_ioas *ioas;
+	struct ioas_device_info *ioas_dev;
+	struct iommu_domain *domain;
+	int ret;
+
+	ioas = ioasid_get_ioas(idev->ictx, ioasid);
+	if (!ioas) {
+		pr_err_ratelimited("Trying to attach illegal or unkonwn IOASID %u\n", ioasid);
+		return -EINVAL;
+	}
+
+	mutex_lock(&ioas->lock);
+
+	/* Check for duplicates */
+	if (ioas_find_device(ioas, idev)) {
+		ret = -EINVAL;
+		goto out_unlock;
+	}
+
+	ret = ioas_check_device_compatibility(ioas, idev->dev);
+	if (ret)
+		goto out_unlock;
+
+	ioas_dev = kzalloc(sizeof(*ioas_dev), GFP_KERNEL);
+	if (!ioas_dev) {
+		ret = -ENOMEM;
+		goto out_unlock;
+	}
+
+	/*
+	 * Each ioas is backed by an iommu domain, which is allocated
+	 * when the ioas is attached for the first time and then shared
+	 * by following devices.
+	 */
+	if (list_empty(&ioas->device_list)) {
+		struct iommu_domain *d;
+
+		d = iommu_domain_alloc(idev->dev->bus);
+		if (!d) {
+			ret = -ENOMEM;
+			goto out_free;
+		}
+		ioas->domain = d;
+	}
+	domain = ioas->domain;
+
+	/* Install the I/O page table to the iommu for this device */
+	ret = iommu_attach_device(domain, idev->dev);
+	if (ret)
+		goto out_domain;
+
+	ioas_dev->idev = idev;
+	list_add(&ioas_dev->next, &ioas->device_list);
+	mutex_unlock(&ioas->lock);
+
+	return 0;
+out_domain:
+	ioas_free_domain_if_empty(ioas);
+out_free:
+	kfree(ioas_dev);
+out_unlock:
+	mutex_unlock(&ioas->lock);
+	ioas_put(ioas);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(iommufd_device_attach_ioasid);
+
+/**
+ * iommufd_device_detach_ioasid - Detach an ioasid from a device.
+ * @idev: [in] Pointer to struct iommufd_device.
+ * @ioasid: [in] ioasid points to an I/O address space.
+ *
+ */
+void iommufd_device_detach_ioasid(struct iommufd_device *idev, int ioasid)
+{
+	struct iommufd_ioas *ioas;
+	struct ioas_device_info *ioas_dev;
+
+	ioas = ioasid_get_ioas(idev->ictx, ioasid);
+	if (!ioas)
+		return;
+
+	mutex_lock(&ioas->lock);
+	ioas_dev = ioas_find_device(ioas, idev);
+	if (!ioas_dev) {
+		mutex_unlock(&ioas->lock);
+		goto out;
+	}
+
+	list_del(&ioas_dev->next);
+	iommu_detach_device(ioas->domain, idev->dev);
+	ioas_free_domain_if_empty(ioas);
+	kfree(ioas_dev);
+	mutex_unlock(&ioas->lock);
+
+	/* release the reference acquired at the start of this function */
+	ioas_put(ioas);
+out:
+	ioas_put(ioas);
+}
+EXPORT_SYMBOL_GPL(iommufd_device_detach_ioasid);
+
 /**
  * iommufd_bind_device - Bind a physical device marked by a device
  *			 cookie to an iommu fd.
@@ -426,8 +634,26 @@ EXPORT_SYMBOL_GPL(iommufd_bind_device);
 void iommufd_unbind_device(struct iommufd_device *idev)
 {
 	struct iommufd_ctx *ictx = idev->ictx;
+	struct iommufd_ioas *ioas;
+	unsigned long index;
 
 	mutex_lock(&ictx->lock);
+	xa_for_each(&ictx->ioasid_xa, index, ioas) {
+		struct ioas_device_info *ioas_dev;
+
+		mutex_lock(&ioas->lock);
+		ioas_dev = ioas_find_device(ioas, idev);
+		if (!ioas_dev) {
+			mutex_unlock(&ioas->lock);
+			continue;
+		}
+		list_del(&ioas_dev->next);
+		iommu_detach_device(ioas->domain, idev->dev);
+		ioas_free_domain_if_empty(ioas);
+		kfree(ioas_dev);
+		mutex_unlock(&ioas->lock);
+		ioas_put_locked(ioas);
+	}
 	xa_erase(&ictx->device_xa, idev->id);
 	mutex_unlock(&ictx->lock);
 	/* Exit the security context */
diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h
index 1dd6515e7816..01a4fe934143 100644
--- a/include/linux/iommufd.h
+++ b/include/linux/iommufd.h
@@ -13,6 +13,7 @@
 #include <linux/errno.h>
 #include <linux/err.h>
 #include <linux/device.h>
+#include <linux/pci.h>
 
 #define IOMMUFD_IOASID_MAX	((unsigned int)(0x7FFFFFFF))
 #define IOMMUFD_IOASID_MIN	0
@@ -27,6 +28,16 @@ struct iommufd_device *
 iommufd_bind_device(int fd, struct device *dev, u64 dev_cookie);
 void iommufd_unbind_device(struct iommufd_device *idev);
 
+int iommufd_device_attach_ioasid(struct iommufd_device *idev, int ioasid);
+void iommufd_device_detach_ioasid(struct iommufd_device *idev, int ioasid);
+
+static inline int
+__pci_iommufd_device_attach_ioasid(struct pci_dev *pdev,
+				   struct iommufd_device *idev, int ioasid)
+{
+	return iommufd_device_attach_ioasid(idev, ioasid);
+}
+
 #else /* !CONFIG_IOMMUFD */
 static inline struct iommufd_device *
 iommufd_bind_device(int fd, struct device *dev, u64 dev_cookie)
@@ -37,5 +48,23 @@ iommufd_bind_device(int fd, struct device *dev, u64 dev_cookie)
 static inline void iommufd_unbind_device(struct iommufd_device *idev)
 {
 }
+
+static inline int iommufd_device_attach_ioasid(struct iommufd_device *idev,
+					       int ioasid)
+{
+	return -ENODEV;
+}
+
+static inline void iommufd_device_detach_ioasid(struct iommufd_device *idev,
+						int ioasid)
+{
+}
+
+static inline int
+__pci_iommufd_device_attach_ioasid(struct pci_dev *pdev,
+				   struct iommufd_device *idev, int ioasid)
+{
+	return -ENODEV;
+}
 #endif /* CONFIG_IOMMUFD */
 #endif /* __LINUX_IOMMUFD_H */
-- 
2.25.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* [RFC 15/20] vfio/pci: Add VFIO_DEVICE_[DE]ATTACH_IOASID
  2021-09-19  6:38 ` Liu Yi L
@ 2021-09-19  6:38   ` Liu Yi L
  -1 siblings, 0 replies; 532+ messages in thread
From: Liu Yi L @ 2021-09-19  6:38 UTC (permalink / raw)
  To: alex.williamson, jgg, hch, jasowang, joro
  Cc: kvm, kwankhede, jean-philippe, dave.jiang, ashok.raj, corbet,
	kevin.tian, parav, lkml, david, robin.murphy, jun.j.tian,
	linux-kernel, lushenming, iommu, pbonzini, dwmw2

This patch adds interface for userspace to attach device to specified
IOASID.

Note:
One device can only be attached to one IOASID in this version. This is
on par with what vfio provides today. In the future this restriction can
be relaxed when multiple I/O address spaces are supported per device

Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 drivers/vfio/pci/vfio_pci.c         | 82 +++++++++++++++++++++++++++++
 drivers/vfio/pci/vfio_pci_private.h |  1 +
 include/linux/iommufd.h             |  1 +
 include/uapi/linux/vfio.h           | 26 +++++++++
 4 files changed, 110 insertions(+)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 20006bb66430..5b1fda333122 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -557,6 +557,11 @@ static void vfio_pci_release(struct vfio_device *core_vdev)
 		if (vdev->videv) {
 			struct vfio_iommufd_device *videv = vdev->videv;
 
+			if (videv->ioasid != IOMMUFD_INVALID_IOASID) {
+				iommufd_device_detach_ioasid(videv->idev,
+							     videv->ioasid);
+				videv->ioasid = IOMMUFD_INVALID_IOASID;
+			}
 			vdev->videv = NULL;
 			iommufd_unbind_device(videv->idev);
 			kfree(videv);
@@ -839,6 +844,7 @@ static long vfio_pci_ioctl(struct vfio_device *core_vdev,
 		}
 		videv->idev = idev;
 		videv->iommu_fd = bind_data.iommu_fd;
+		videv->ioasid = IOMMUFD_INVALID_IOASID;
 		/*
 		 * A security context has been established. Unblock
 		 * user access.
@@ -848,6 +854,82 @@ static long vfio_pci_ioctl(struct vfio_device *core_vdev,
 		vdev->videv = videv;
 		mutex_unlock(&vdev->videv_lock);
 
+		return 0;
+	} else if (cmd == VFIO_DEVICE_ATTACH_IOASID) {
+		struct vfio_device_attach_ioasid attach;
+		unsigned long minsz;
+		struct vfio_iommufd_device *videv;
+		int ret = 0;
+
+		/* not allowed if the device is opened in legacy interface */
+		if (vfio_device_in_container(core_vdev))
+			return -ENOTTY;
+
+		minsz = offsetofend(struct vfio_device_attach_ioasid, ioasid);
+		if (copy_from_user(&attach, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (attach.argsz < minsz || attach.flags ||
+		    attach.iommu_fd < 0 || attach.ioasid < 0)
+			return -EINVAL;
+
+		mutex_lock(&vdev->videv_lock);
+
+		videv = vdev->videv;
+		if (!videv || videv->iommu_fd != attach.iommu_fd) {
+			mutex_unlock(&vdev->videv_lock);
+			return -EINVAL;
+		}
+
+		/* Currently only allows one IOASID attach */
+		if (videv->ioasid != IOMMUFD_INVALID_IOASID) {
+			mutex_unlock(&vdev->videv_lock);
+			return -EBUSY;
+		}
+
+		ret = __pci_iommufd_device_attach_ioasid(vdev->pdev,
+							 videv->idev,
+							 attach.ioasid);
+		if (!ret)
+			videv->ioasid = attach.ioasid;
+		mutex_unlock(&vdev->videv_lock);
+
+		return ret;
+	} else if (cmd == VFIO_DEVICE_DETACH_IOASID) {
+		struct vfio_device_attach_ioasid attach;
+		unsigned long minsz;
+		struct vfio_iommufd_device *videv;
+
+		/* not allowed if the device is opened in legacy interface */
+		if (vfio_device_in_container(core_vdev))
+			return -ENOTTY;
+
+		minsz = offsetofend(struct vfio_device_attach_ioasid, ioasid);
+		if (copy_from_user(&attach, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (attach.argsz < minsz || attach.flags ||
+		    attach.iommu_fd < 0 || attach.ioasid < 0)
+			return -EINVAL;
+
+		mutex_lock(&vdev->videv_lock);
+
+		videv = vdev->videv;
+		if (!videv || videv->iommu_fd != attach.iommu_fd) {
+			mutex_unlock(&vdev->videv_lock);
+			return -EINVAL;
+		}
+
+		if (videv->ioasid == IOMMUFD_INVALID_IOASID ||
+		    videv->ioasid != attach.ioasid) {
+			mutex_unlock(&vdev->videv_lock);
+			return -EINVAL;
+		}
+
+		videv->ioasid = IOMMUFD_INVALID_IOASID;
+		iommufd_device_detach_ioasid(videv->idev, attach.ioasid);
+		mutex_unlock(&vdev->videv_lock);
+
 		return 0;
 	} else if (cmd == VFIO_DEVICE_GET_INFO) {
 		struct vfio_device_info info;
diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
index bd784accac35..daa0f08ac835 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -103,6 +103,7 @@ struct vfio_pci_mmap_vma {
 struct vfio_iommufd_device {
 	struct iommufd_device *idev;
 	int iommu_fd;
+	int ioasid;
 };
 
 struct vfio_pci_device {
diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h
index 01a4fe934143..36d8d2fd22bb 100644
--- a/include/linux/iommufd.h
+++ b/include/linux/iommufd.h
@@ -17,6 +17,7 @@
 
 #define IOMMUFD_IOASID_MAX	((unsigned int)(0x7FFFFFFF))
 #define IOMMUFD_IOASID_MIN	0
+#define IOMMUFD_INVALID_IOASID	-1
 
 #define IOMMUFD_DEVID_MAX	((unsigned int)(0x7FFFFFFF))
 #define IOMMUFD_DEVID_MIN	0
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index c902abd60339..61493ab03038 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -220,6 +220,32 @@ struct vfio_device_iommu_bind_data {
 
 #define VFIO_DEVICE_BIND_IOMMUFD	_IO(VFIO_TYPE, VFIO_BASE + 19)
 
+/*
+ * VFIO_DEVICE_ATTACH_IOASID - _IOW(VFIO_TYPE, VFIO_BASE + 21,
+ *				struct vfio_device_attach_ioasid)
+ *
+ * Attach a vfio device to the specified IOASID
+ *
+ * Multiple vfio devices can be attached to the same IOASID. One device can
+ * be attached to only one ioasid at this point.
+ *
+ * @argsz:	user filled size of this data.
+ * @flags:	reserved for future extension.
+ * @iommu_fd:	iommufd where the ioasid comes from.
+ * @ioasid:	target I/O address space.
+ *
+ * Return: 0 on success, -errno on failure.
+ */
+struct vfio_device_attach_ioasid {
+	__u32	argsz;
+	__u32	flags;
+	__s32	iommu_fd;
+	__s32	ioasid;
+};
+
+#define VFIO_DEVICE_ATTACH_IOASID	_IO(VFIO_TYPE, VFIO_BASE + 20)
+#define VFIO_DEVICE_DETACH_IOASID	_IO(VFIO_TYPE, VFIO_BASE + 21)
+
 /**
  * VFIO_DEVICE_GET_INFO - _IOR(VFIO_TYPE, VFIO_BASE + 7,
  *						struct vfio_device_info)
-- 
2.25.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* [RFC 15/20] vfio/pci: Add VFIO_DEVICE_[DE]ATTACH_IOASID
@ 2021-09-19  6:38   ` Liu Yi L
  0 siblings, 0 replies; 532+ messages in thread
From: Liu Yi L @ 2021-09-19  6:38 UTC (permalink / raw)
  To: alex.williamson, jgg, hch, jasowang, joro
  Cc: jean-philippe, kevin.tian, parav, lkml, pbonzini, lushenming,
	eric.auger, corbet, ashok.raj, yi.l.liu, yi.l.liu, jun.j.tian,
	hao.wu, dave.jiang, jacob.jun.pan, kwankhede, robin.murphy, kvm,
	iommu, dwmw2, linux-kernel, baolu.lu, david, nicolinc

This patch adds interface for userspace to attach device to specified
IOASID.

Note:
One device can only be attached to one IOASID in this version. This is
on par with what vfio provides today. In the future this restriction can
be relaxed when multiple I/O address spaces are supported per device

Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 drivers/vfio/pci/vfio_pci.c         | 82 +++++++++++++++++++++++++++++
 drivers/vfio/pci/vfio_pci_private.h |  1 +
 include/linux/iommufd.h             |  1 +
 include/uapi/linux/vfio.h           | 26 +++++++++
 4 files changed, 110 insertions(+)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 20006bb66430..5b1fda333122 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -557,6 +557,11 @@ static void vfio_pci_release(struct vfio_device *core_vdev)
 		if (vdev->videv) {
 			struct vfio_iommufd_device *videv = vdev->videv;
 
+			if (videv->ioasid != IOMMUFD_INVALID_IOASID) {
+				iommufd_device_detach_ioasid(videv->idev,
+							     videv->ioasid);
+				videv->ioasid = IOMMUFD_INVALID_IOASID;
+			}
 			vdev->videv = NULL;
 			iommufd_unbind_device(videv->idev);
 			kfree(videv);
@@ -839,6 +844,7 @@ static long vfio_pci_ioctl(struct vfio_device *core_vdev,
 		}
 		videv->idev = idev;
 		videv->iommu_fd = bind_data.iommu_fd;
+		videv->ioasid = IOMMUFD_INVALID_IOASID;
 		/*
 		 * A security context has been established. Unblock
 		 * user access.
@@ -848,6 +854,82 @@ static long vfio_pci_ioctl(struct vfio_device *core_vdev,
 		vdev->videv = videv;
 		mutex_unlock(&vdev->videv_lock);
 
+		return 0;
+	} else if (cmd == VFIO_DEVICE_ATTACH_IOASID) {
+		struct vfio_device_attach_ioasid attach;
+		unsigned long minsz;
+		struct vfio_iommufd_device *videv;
+		int ret = 0;
+
+		/* not allowed if the device is opened in legacy interface */
+		if (vfio_device_in_container(core_vdev))
+			return -ENOTTY;
+
+		minsz = offsetofend(struct vfio_device_attach_ioasid, ioasid);
+		if (copy_from_user(&attach, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (attach.argsz < minsz || attach.flags ||
+		    attach.iommu_fd < 0 || attach.ioasid < 0)
+			return -EINVAL;
+
+		mutex_lock(&vdev->videv_lock);
+
+		videv = vdev->videv;
+		if (!videv || videv->iommu_fd != attach.iommu_fd) {
+			mutex_unlock(&vdev->videv_lock);
+			return -EINVAL;
+		}
+
+		/* Currently only allows one IOASID attach */
+		if (videv->ioasid != IOMMUFD_INVALID_IOASID) {
+			mutex_unlock(&vdev->videv_lock);
+			return -EBUSY;
+		}
+
+		ret = __pci_iommufd_device_attach_ioasid(vdev->pdev,
+							 videv->idev,
+							 attach.ioasid);
+		if (!ret)
+			videv->ioasid = attach.ioasid;
+		mutex_unlock(&vdev->videv_lock);
+
+		return ret;
+	} else if (cmd == VFIO_DEVICE_DETACH_IOASID) {
+		struct vfio_device_attach_ioasid attach;
+		unsigned long minsz;
+		struct vfio_iommufd_device *videv;
+
+		/* not allowed if the device is opened in legacy interface */
+		if (vfio_device_in_container(core_vdev))
+			return -ENOTTY;
+
+		minsz = offsetofend(struct vfio_device_attach_ioasid, ioasid);
+		if (copy_from_user(&attach, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (attach.argsz < minsz || attach.flags ||
+		    attach.iommu_fd < 0 || attach.ioasid < 0)
+			return -EINVAL;
+
+		mutex_lock(&vdev->videv_lock);
+
+		videv = vdev->videv;
+		if (!videv || videv->iommu_fd != attach.iommu_fd) {
+			mutex_unlock(&vdev->videv_lock);
+			return -EINVAL;
+		}
+
+		if (videv->ioasid == IOMMUFD_INVALID_IOASID ||
+		    videv->ioasid != attach.ioasid) {
+			mutex_unlock(&vdev->videv_lock);
+			return -EINVAL;
+		}
+
+		videv->ioasid = IOMMUFD_INVALID_IOASID;
+		iommufd_device_detach_ioasid(videv->idev, attach.ioasid);
+		mutex_unlock(&vdev->videv_lock);
+
 		return 0;
 	} else if (cmd == VFIO_DEVICE_GET_INFO) {
 		struct vfio_device_info info;
diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
index bd784accac35..daa0f08ac835 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -103,6 +103,7 @@ struct vfio_pci_mmap_vma {
 struct vfio_iommufd_device {
 	struct iommufd_device *idev;
 	int iommu_fd;
+	int ioasid;
 };
 
 struct vfio_pci_device {
diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h
index 01a4fe934143..36d8d2fd22bb 100644
--- a/include/linux/iommufd.h
+++ b/include/linux/iommufd.h
@@ -17,6 +17,7 @@
 
 #define IOMMUFD_IOASID_MAX	((unsigned int)(0x7FFFFFFF))
 #define IOMMUFD_IOASID_MIN	0
+#define IOMMUFD_INVALID_IOASID	-1
 
 #define IOMMUFD_DEVID_MAX	((unsigned int)(0x7FFFFFFF))
 #define IOMMUFD_DEVID_MIN	0
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index c902abd60339..61493ab03038 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -220,6 +220,32 @@ struct vfio_device_iommu_bind_data {
 
 #define VFIO_DEVICE_BIND_IOMMUFD	_IO(VFIO_TYPE, VFIO_BASE + 19)
 
+/*
+ * VFIO_DEVICE_ATTACH_IOASID - _IOW(VFIO_TYPE, VFIO_BASE + 21,
+ *				struct vfio_device_attach_ioasid)
+ *
+ * Attach a vfio device to the specified IOASID
+ *
+ * Multiple vfio devices can be attached to the same IOASID. One device can
+ * be attached to only one ioasid at this point.
+ *
+ * @argsz:	user filled size of this data.
+ * @flags:	reserved for future extension.
+ * @iommu_fd:	iommufd where the ioasid comes from.
+ * @ioasid:	target I/O address space.
+ *
+ * Return: 0 on success, -errno on failure.
+ */
+struct vfio_device_attach_ioasid {
+	__u32	argsz;
+	__u32	flags;
+	__s32	iommu_fd;
+	__s32	ioasid;
+};
+
+#define VFIO_DEVICE_ATTACH_IOASID	_IO(VFIO_TYPE, VFIO_BASE + 20)
+#define VFIO_DEVICE_DETACH_IOASID	_IO(VFIO_TYPE, VFIO_BASE + 21)
+
 /**
  * VFIO_DEVICE_GET_INFO - _IOR(VFIO_TYPE, VFIO_BASE + 7,
  *						struct vfio_device_info)
-- 
2.25.1


^ permalink raw reply	[flat|nested] 532+ messages in thread

* [RFC 16/20] vfio/type1: Export symbols for dma [un]map code sharing
  2021-09-19  6:38 ` Liu Yi L
@ 2021-09-19  6:38   ` Liu Yi L
  -1 siblings, 0 replies; 532+ messages in thread
From: Liu Yi L @ 2021-09-19  6:38 UTC (permalink / raw)
  To: alex.williamson, jgg, hch, jasowang, joro
  Cc: kvm, kwankhede, jean-philippe, dave.jiang, ashok.raj, corbet,
	kevin.tian, parav, lkml, david, robin.murphy, jun.j.tian,
	linux-kernel, lushenming, iommu, pbonzini, dwmw2

[HACK. will fix in v2]

There are two options to impelement vfio type1v2 mapping semantics in
/dev/iommu.

One is to duplicate the related code from vfio as the starting point,
and then merge with vfio type1 at a later time. However vfio_iommu_type1.c
has over 3000LOC with ~80% related to dma management logic, including:

- the dma map/unmap metadata management
- page pinning, and related accounting
- iova range reporting
- dirty bitmap retrieving
- dynamic vaddr update, etc.

Not sure whether duplicating such amount of code in the transition phase
is acceptable.

The alternative is to consolidate type1v2 logic in /dev/iommu immediately,
which requires converting vfio_iommu_type1 to be a shim driver. The upside
is no code duplication and it is anyway the long-term goal even with the
first approach. The downside is that more effort is required for the
'initial' skeleton thus all new iommu features will be blocked for a longer
time. Main task is to figure out how to handle the remaining 20% code (tied
with group) in vfio_iommu_type1 with device-centric model in iommufd (with
group managed by iommu core). It also implies that no-snoop DMA must be
handled now with extra work on reworked kvm-vfio contract. and also need
to support external page pinning as required by sw mdev.

Due to limited time, we choose a hacky approach in this RFC by directly
calling vfio_iommu_type1 functions in iommufd and raising this open for
discussion. This should not impact the review on other key aspects of the
new framework. Once we reach consensus, we'll follow it to do a clean
implementation 'in' next version.

Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 drivers/vfio/vfio_iommu_type1.c | 199 +++++++++++++++++++++++++++++++-
 include/linux/vfio.h            |  13 +++
 2 files changed, 206 insertions(+), 6 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 0b4f7c174c7a..c1c6bc803d94 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -115,6 +115,7 @@ struct vfio_iommu_group {
 	struct list_head	next;
 	bool			mdev_group;	/* An mdev group */
 	bool			pinned_page_dirty_scope;
+	int			attach_cnt;
 };
 
 struct vfio_iova {
@@ -2240,6 +2241,135 @@ static void vfio_iommu_iova_insert_copy(struct vfio_iommu *iommu,
 	list_splice_tail(iova_copy, iova);
 }
 
+/* HACK: called by /dev/iommu core to init group to vfio_iommu_type1 */
+int vfio_iommu_add_group(struct vfio_iommu *iommu,
+			 struct iommu_group *iommu_group,
+			 struct iommu_domain *iommu_domain)
+{
+	struct vfio_iommu_group *group;
+	struct vfio_domain *domain = NULL;
+	struct bus_type *bus = NULL;
+	int ret = 0;
+	bool resv_msi, msi_remap;
+	phys_addr_t resv_msi_base = 0;
+	struct iommu_domain_geometry *geo;
+	LIST_HEAD(iova_copy);
+	LIST_HEAD(group_resv_regions);
+
+	/* Determine bus_type */
+	ret = iommu_group_for_each_dev(iommu_group, &bus, vfio_bus_type);
+	if (ret)
+		return ret;
+
+	mutex_lock(&iommu->lock);
+
+	/* Check for duplicates */
+	group = vfio_iommu_find_iommu_group(iommu, iommu_group);
+	if (group) {
+		group->attach_cnt++;
+		mutex_unlock(&iommu->lock);
+		return 0;
+	}
+
+	/* Get aperture info */
+	geo = &iommu_domain->geometry;
+	if (vfio_iommu_aper_conflict(iommu, geo->aperture_start,
+				     geo->aperture_end)) {
+		ret = -EINVAL;
+		goto out_free;
+	}
+
+	ret = iommu_get_group_resv_regions(iommu_group, &group_resv_regions);
+	if (ret)
+		goto out_free;
+
+	if (vfio_iommu_resv_conflict(iommu, &group_resv_regions)) {
+		ret = -EINVAL;
+		goto out_free;
+	}
+
+	/*
+	 * We don't want to work on the original iova list as the list
+	 * gets modified and in case of failure we have to retain the
+	 * original list. Get a copy here.
+	 */
+	ret = vfio_iommu_iova_get_copy(iommu, &iova_copy);
+	if (ret)
+		goto out_free;
+
+	ret = vfio_iommu_aper_resize(&iova_copy, geo->aperture_start,
+				     geo->aperture_end);
+	if (ret)
+		goto out_free;
+
+	ret = vfio_iommu_resv_exclude(&iova_copy, &group_resv_regions);
+	if (ret)
+		goto out_free;
+
+	resv_msi = vfio_iommu_has_sw_msi(&group_resv_regions, &resv_msi_base);
+
+	msi_remap = irq_domain_check_msi_remap() ||
+		    iommu_capable(bus, IOMMU_CAP_INTR_REMAP);
+
+	if (!allow_unsafe_interrupts && !msi_remap) {
+		pr_warn("%s: No interrupt remapping support.  Use the module param \"allow_unsafe_interrupts\" to enable VFIO IOMMU support on this platform\n",
+		       __func__);
+		ret = -EPERM;
+		goto out_free;
+	}
+
+	if (resv_msi) {
+		ret = iommu_get_msi_cookie(iommu_domain, resv_msi_base);
+		if (ret && ret != -ENODEV)
+			goto out_free;
+	}
+
+	group = kzalloc(sizeof(*group), GFP_KERNEL);
+	if (!group) {
+		ret = -ENOMEM;
+		goto out_free;
+	}
+
+	group->iommu_group = iommu_group;
+
+	if (!list_empty(&iommu->domain_list)) {
+		domain = list_first_entry(&iommu->domain_list,
+					  struct vfio_domain, next);
+	} else {
+		domain = kzalloc(sizeof(*domain), GFP_KERNEL);
+		if (!domain) {
+			kfree(group);
+			ret = -ENOMEM;
+			goto out_free;
+		}
+		domain->domain = iommu_domain;
+		INIT_LIST_HEAD(&domain->group_list);
+		list_add(&domain->next, &iommu->domain_list);
+	}
+
+	list_add(&group->next, &domain->group_list);
+
+	vfio_test_domain_fgsp(domain);
+
+	vfio_update_pgsize_bitmap(iommu);
+
+	/* Delete the old one and insert new iova list */
+	vfio_iommu_iova_insert_copy(iommu, &iova_copy);
+
+	group->attach_cnt = 1;
+	mutex_unlock(&iommu->lock);
+	vfio_iommu_resv_free(&group_resv_regions);
+
+	return 0;
+
+out_free:
+	vfio_iommu_iova_free(&iova_copy);
+	vfio_iommu_resv_free(&group_resv_regions);
+	mutex_unlock(&iommu->lock);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(vfio_iommu_add_group);
+
 static int vfio_iommu_type1_attach_group(void *iommu_data,
 					 struct iommu_group *iommu_group)
 {
@@ -2557,6 +2687,59 @@ static int vfio_iommu_resv_refresh(struct vfio_iommu *iommu,
 	return ret;
 }
 
+/* HACK: called by /dev/iommu core to remove group to vfio_iommu_type1 */
+void vfio_iommu_remove_group(struct vfio_iommu *iommu,
+			     struct iommu_group *iommu_group)
+{
+	struct vfio_iommu_group *group;
+	struct vfio_domain *domain = NULL;
+	LIST_HEAD(iova_copy);
+
+	mutex_lock(&iommu->lock);
+	domain = list_first_entry(&iommu->domain_list,
+				  struct vfio_domain, next);
+	group = find_iommu_group(domain, iommu_group);
+	if (!group) {
+		mutex_unlock(&iommu->lock);
+		return;
+	}
+
+	if (!--group->attach_cnt) {
+		mutex_unlock(&iommu->lock);
+		return;
+	}
+
+	/*
+	 * Get a copy of iova list. This will be used to update
+	 * and to replace the current one later. Please note that
+	 * we will leave the original list as it is if update fails.
+	 */
+	vfio_iommu_iova_get_copy(iommu, &iova_copy);
+
+	list_del(&group->next);
+	kfree(group);
+	/*
+	 * Group ownership provides privilege, if the device list is
+	 * empty, the domain goes away.
+	 */
+	if (list_empty(&domain->group_list)) {
+		WARN_ON(iommu->notifier.head);
+		vfio_iommu_unmap_unpin_all(iommu);
+		list_del(&domain->next);
+		kfree(domain);
+		vfio_iommu_aper_expand(iommu, &iova_copy);
+		vfio_update_pgsize_bitmap(iommu);
+	}
+
+	if (!vfio_iommu_resv_refresh(iommu, &iova_copy))
+		vfio_iommu_iova_insert_copy(iommu, &iova_copy);
+	else
+		vfio_iommu_iova_free(&iova_copy);
+
+	mutex_unlock(&iommu->lock);
+}
+EXPORT_SYMBOL_GPL(vfio_iommu_remove_group);
+
 static void vfio_iommu_type1_detach_group(void *iommu_data,
 					  struct iommu_group *iommu_group)
 {
@@ -2647,7 +2830,7 @@ static void vfio_iommu_type1_detach_group(void *iommu_data,
 	mutex_unlock(&iommu->lock);
 }
 
-static void *vfio_iommu_type1_open(unsigned long arg)
+void *vfio_iommu_type1_open(unsigned long arg)
 {
 	struct vfio_iommu *iommu;
 
@@ -2680,6 +2863,7 @@ static void *vfio_iommu_type1_open(unsigned long arg)
 
 	return iommu;
 }
+EXPORT_SYMBOL_GPL(vfio_iommu_type1_open);
 
 static void vfio_release_domain(struct vfio_domain *domain, bool external)
 {
@@ -2697,7 +2881,7 @@ static void vfio_release_domain(struct vfio_domain *domain, bool external)
 		iommu_domain_free(domain->domain);
 }
 
-static void vfio_iommu_type1_release(void *iommu_data)
+void vfio_iommu_type1_release(void *iommu_data)
 {
 	struct vfio_iommu *iommu = iommu_data;
 	struct vfio_domain *domain, *domain_tmp;
@@ -2720,6 +2904,7 @@ static void vfio_iommu_type1_release(void *iommu_data)
 
 	kfree(iommu);
 }
+EXPORT_SYMBOL_GPL(vfio_iommu_type1_release);
 
 static int vfio_domains_have_iommu_cache(struct vfio_iommu *iommu)
 {
@@ -2913,8 +3098,8 @@ static int vfio_iommu_type1_get_info(struct vfio_iommu *iommu,
 			-EFAULT : 0;
 }
 
-static int vfio_iommu_type1_map_dma(struct vfio_iommu *iommu,
-				    unsigned long arg)
+int vfio_iommu_type1_map_dma(struct vfio_iommu *iommu,
+			     unsigned long arg)
 {
 	struct vfio_iommu_type1_dma_map map;
 	unsigned long minsz;
@@ -2931,9 +3116,10 @@ static int vfio_iommu_type1_map_dma(struct vfio_iommu *iommu,
 
 	return vfio_dma_do_map(iommu, &map);
 }
+EXPORT_SYMBOL_GPL(vfio_iommu_type1_map_dma);
 
-static int vfio_iommu_type1_unmap_dma(struct vfio_iommu *iommu,
-				      unsigned long arg)
+int vfio_iommu_type1_unmap_dma(struct vfio_iommu *iommu,
+			       unsigned long arg)
 {
 	struct vfio_iommu_type1_dma_unmap unmap;
 	struct vfio_bitmap bitmap = { 0 };
@@ -2984,6 +3170,7 @@ static int vfio_iommu_type1_unmap_dma(struct vfio_iommu *iommu,
 	return copy_to_user((void __user *)arg, &unmap, minsz) ?
 			-EFAULT : 0;
 }
+EXPORT_SYMBOL_GPL(vfio_iommu_type1_unmap_dma);
 
 static int vfio_iommu_type1_dirty_pages(struct vfio_iommu *iommu,
 					unsigned long arg)
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index fd0629acb948..d904ee5a68cc 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -158,6 +158,19 @@ extern int vfio_dma_rw(struct vfio_group *group, dma_addr_t user_iova,
 
 extern struct iommu_domain *vfio_group_iommu_domain(struct vfio_group *group);
 
+struct vfio_iommu;
+extern void *vfio_iommu_type1_open(unsigned long arg);
+extern void vfio_iommu_type1_release(void *iommu_data);
+extern int vfio_iommu_add_group(struct vfio_iommu *iommu,
+				struct iommu_group *iommu_group,
+				struct iommu_domain *iommu_domain);
+extern void vfio_iommu_remove_group(struct vfio_iommu *iommu,
+				    struct iommu_group *iommu_group);
+extern int vfio_iommu_type1_unmap_dma(struct vfio_iommu *iommu,
+				      unsigned long arg);
+extern int vfio_iommu_type1_map_dma(struct vfio_iommu *iommu,
+				    unsigned long arg);
+
 /* each type has independent events */
 enum vfio_notify_type {
 	VFIO_IOMMU_NOTIFY = 0,
-- 
2.25.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* [RFC 16/20] vfio/type1: Export symbols for dma [un]map code sharing
@ 2021-09-19  6:38   ` Liu Yi L
  0 siblings, 0 replies; 532+ messages in thread
From: Liu Yi L @ 2021-09-19  6:38 UTC (permalink / raw)
  To: alex.williamson, jgg, hch, jasowang, joro
  Cc: jean-philippe, kevin.tian, parav, lkml, pbonzini, lushenming,
	eric.auger, corbet, ashok.raj, yi.l.liu, yi.l.liu, jun.j.tian,
	hao.wu, dave.jiang, jacob.jun.pan, kwankhede, robin.murphy, kvm,
	iommu, dwmw2, linux-kernel, baolu.lu, david, nicolinc

[HACK. will fix in v2]

There are two options to impelement vfio type1v2 mapping semantics in
/dev/iommu.

One is to duplicate the related code from vfio as the starting point,
and then merge with vfio type1 at a later time. However vfio_iommu_type1.c
has over 3000LOC with ~80% related to dma management logic, including:

- the dma map/unmap metadata management
- page pinning, and related accounting
- iova range reporting
- dirty bitmap retrieving
- dynamic vaddr update, etc.

Not sure whether duplicating such amount of code in the transition phase
is acceptable.

The alternative is to consolidate type1v2 logic in /dev/iommu immediately,
which requires converting vfio_iommu_type1 to be a shim driver. The upside
is no code duplication and it is anyway the long-term goal even with the
first approach. The downside is that more effort is required for the
'initial' skeleton thus all new iommu features will be blocked for a longer
time. Main task is to figure out how to handle the remaining 20% code (tied
with group) in vfio_iommu_type1 with device-centric model in iommufd (with
group managed by iommu core). It also implies that no-snoop DMA must be
handled now with extra work on reworked kvm-vfio contract. and also need
to support external page pinning as required by sw mdev.

Due to limited time, we choose a hacky approach in this RFC by directly
calling vfio_iommu_type1 functions in iommufd and raising this open for
discussion. This should not impact the review on other key aspects of the
new framework. Once we reach consensus, we'll follow it to do a clean
implementation 'in' next version.

Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 drivers/vfio/vfio_iommu_type1.c | 199 +++++++++++++++++++++++++++++++-
 include/linux/vfio.h            |  13 +++
 2 files changed, 206 insertions(+), 6 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 0b4f7c174c7a..c1c6bc803d94 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -115,6 +115,7 @@ struct vfio_iommu_group {
 	struct list_head	next;
 	bool			mdev_group;	/* An mdev group */
 	bool			pinned_page_dirty_scope;
+	int			attach_cnt;
 };
 
 struct vfio_iova {
@@ -2240,6 +2241,135 @@ static void vfio_iommu_iova_insert_copy(struct vfio_iommu *iommu,
 	list_splice_tail(iova_copy, iova);
 }
 
+/* HACK: called by /dev/iommu core to init group to vfio_iommu_type1 */
+int vfio_iommu_add_group(struct vfio_iommu *iommu,
+			 struct iommu_group *iommu_group,
+			 struct iommu_domain *iommu_domain)
+{
+	struct vfio_iommu_group *group;
+	struct vfio_domain *domain = NULL;
+	struct bus_type *bus = NULL;
+	int ret = 0;
+	bool resv_msi, msi_remap;
+	phys_addr_t resv_msi_base = 0;
+	struct iommu_domain_geometry *geo;
+	LIST_HEAD(iova_copy);
+	LIST_HEAD(group_resv_regions);
+
+	/* Determine bus_type */
+	ret = iommu_group_for_each_dev(iommu_group, &bus, vfio_bus_type);
+	if (ret)
+		return ret;
+
+	mutex_lock(&iommu->lock);
+
+	/* Check for duplicates */
+	group = vfio_iommu_find_iommu_group(iommu, iommu_group);
+	if (group) {
+		group->attach_cnt++;
+		mutex_unlock(&iommu->lock);
+		return 0;
+	}
+
+	/* Get aperture info */
+	geo = &iommu_domain->geometry;
+	if (vfio_iommu_aper_conflict(iommu, geo->aperture_start,
+				     geo->aperture_end)) {
+		ret = -EINVAL;
+		goto out_free;
+	}
+
+	ret = iommu_get_group_resv_regions(iommu_group, &group_resv_regions);
+	if (ret)
+		goto out_free;
+
+	if (vfio_iommu_resv_conflict(iommu, &group_resv_regions)) {
+		ret = -EINVAL;
+		goto out_free;
+	}
+
+	/*
+	 * We don't want to work on the original iova list as the list
+	 * gets modified and in case of failure we have to retain the
+	 * original list. Get a copy here.
+	 */
+	ret = vfio_iommu_iova_get_copy(iommu, &iova_copy);
+	if (ret)
+		goto out_free;
+
+	ret = vfio_iommu_aper_resize(&iova_copy, geo->aperture_start,
+				     geo->aperture_end);
+	if (ret)
+		goto out_free;
+
+	ret = vfio_iommu_resv_exclude(&iova_copy, &group_resv_regions);
+	if (ret)
+		goto out_free;
+
+	resv_msi = vfio_iommu_has_sw_msi(&group_resv_regions, &resv_msi_base);
+
+	msi_remap = irq_domain_check_msi_remap() ||
+		    iommu_capable(bus, IOMMU_CAP_INTR_REMAP);
+
+	if (!allow_unsafe_interrupts && !msi_remap) {
+		pr_warn("%s: No interrupt remapping support.  Use the module param \"allow_unsafe_interrupts\" to enable VFIO IOMMU support on this platform\n",
+		       __func__);
+		ret = -EPERM;
+		goto out_free;
+	}
+
+	if (resv_msi) {
+		ret = iommu_get_msi_cookie(iommu_domain, resv_msi_base);
+		if (ret && ret != -ENODEV)
+			goto out_free;
+	}
+
+	group = kzalloc(sizeof(*group), GFP_KERNEL);
+	if (!group) {
+		ret = -ENOMEM;
+		goto out_free;
+	}
+
+	group->iommu_group = iommu_group;
+
+	if (!list_empty(&iommu->domain_list)) {
+		domain = list_first_entry(&iommu->domain_list,
+					  struct vfio_domain, next);
+	} else {
+		domain = kzalloc(sizeof(*domain), GFP_KERNEL);
+		if (!domain) {
+			kfree(group);
+			ret = -ENOMEM;
+			goto out_free;
+		}
+		domain->domain = iommu_domain;
+		INIT_LIST_HEAD(&domain->group_list);
+		list_add(&domain->next, &iommu->domain_list);
+	}
+
+	list_add(&group->next, &domain->group_list);
+
+	vfio_test_domain_fgsp(domain);
+
+	vfio_update_pgsize_bitmap(iommu);
+
+	/* Delete the old one and insert new iova list */
+	vfio_iommu_iova_insert_copy(iommu, &iova_copy);
+
+	group->attach_cnt = 1;
+	mutex_unlock(&iommu->lock);
+	vfio_iommu_resv_free(&group_resv_regions);
+
+	return 0;
+
+out_free:
+	vfio_iommu_iova_free(&iova_copy);
+	vfio_iommu_resv_free(&group_resv_regions);
+	mutex_unlock(&iommu->lock);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(vfio_iommu_add_group);
+
 static int vfio_iommu_type1_attach_group(void *iommu_data,
 					 struct iommu_group *iommu_group)
 {
@@ -2557,6 +2687,59 @@ static int vfio_iommu_resv_refresh(struct vfio_iommu *iommu,
 	return ret;
 }
 
+/* HACK: called by /dev/iommu core to remove group to vfio_iommu_type1 */
+void vfio_iommu_remove_group(struct vfio_iommu *iommu,
+			     struct iommu_group *iommu_group)
+{
+	struct vfio_iommu_group *group;
+	struct vfio_domain *domain = NULL;
+	LIST_HEAD(iova_copy);
+
+	mutex_lock(&iommu->lock);
+	domain = list_first_entry(&iommu->domain_list,
+				  struct vfio_domain, next);
+	group = find_iommu_group(domain, iommu_group);
+	if (!group) {
+		mutex_unlock(&iommu->lock);
+		return;
+	}
+
+	if (!--group->attach_cnt) {
+		mutex_unlock(&iommu->lock);
+		return;
+	}
+
+	/*
+	 * Get a copy of iova list. This will be used to update
+	 * and to replace the current one later. Please note that
+	 * we will leave the original list as it is if update fails.
+	 */
+	vfio_iommu_iova_get_copy(iommu, &iova_copy);
+
+	list_del(&group->next);
+	kfree(group);
+	/*
+	 * Group ownership provides privilege, if the device list is
+	 * empty, the domain goes away.
+	 */
+	if (list_empty(&domain->group_list)) {
+		WARN_ON(iommu->notifier.head);
+		vfio_iommu_unmap_unpin_all(iommu);
+		list_del(&domain->next);
+		kfree(domain);
+		vfio_iommu_aper_expand(iommu, &iova_copy);
+		vfio_update_pgsize_bitmap(iommu);
+	}
+
+	if (!vfio_iommu_resv_refresh(iommu, &iova_copy))
+		vfio_iommu_iova_insert_copy(iommu, &iova_copy);
+	else
+		vfio_iommu_iova_free(&iova_copy);
+
+	mutex_unlock(&iommu->lock);
+}
+EXPORT_SYMBOL_GPL(vfio_iommu_remove_group);
+
 static void vfio_iommu_type1_detach_group(void *iommu_data,
 					  struct iommu_group *iommu_group)
 {
@@ -2647,7 +2830,7 @@ static void vfio_iommu_type1_detach_group(void *iommu_data,
 	mutex_unlock(&iommu->lock);
 }
 
-static void *vfio_iommu_type1_open(unsigned long arg)
+void *vfio_iommu_type1_open(unsigned long arg)
 {
 	struct vfio_iommu *iommu;
 
@@ -2680,6 +2863,7 @@ static void *vfio_iommu_type1_open(unsigned long arg)
 
 	return iommu;
 }
+EXPORT_SYMBOL_GPL(vfio_iommu_type1_open);
 
 static void vfio_release_domain(struct vfio_domain *domain, bool external)
 {
@@ -2697,7 +2881,7 @@ static void vfio_release_domain(struct vfio_domain *domain, bool external)
 		iommu_domain_free(domain->domain);
 }
 
-static void vfio_iommu_type1_release(void *iommu_data)
+void vfio_iommu_type1_release(void *iommu_data)
 {
 	struct vfio_iommu *iommu = iommu_data;
 	struct vfio_domain *domain, *domain_tmp;
@@ -2720,6 +2904,7 @@ static void vfio_iommu_type1_release(void *iommu_data)
 
 	kfree(iommu);
 }
+EXPORT_SYMBOL_GPL(vfio_iommu_type1_release);
 
 static int vfio_domains_have_iommu_cache(struct vfio_iommu *iommu)
 {
@@ -2913,8 +3098,8 @@ static int vfio_iommu_type1_get_info(struct vfio_iommu *iommu,
 			-EFAULT : 0;
 }
 
-static int vfio_iommu_type1_map_dma(struct vfio_iommu *iommu,
-				    unsigned long arg)
+int vfio_iommu_type1_map_dma(struct vfio_iommu *iommu,
+			     unsigned long arg)
 {
 	struct vfio_iommu_type1_dma_map map;
 	unsigned long minsz;
@@ -2931,9 +3116,10 @@ static int vfio_iommu_type1_map_dma(struct vfio_iommu *iommu,
 
 	return vfio_dma_do_map(iommu, &map);
 }
+EXPORT_SYMBOL_GPL(vfio_iommu_type1_map_dma);
 
-static int vfio_iommu_type1_unmap_dma(struct vfio_iommu *iommu,
-				      unsigned long arg)
+int vfio_iommu_type1_unmap_dma(struct vfio_iommu *iommu,
+			       unsigned long arg)
 {
 	struct vfio_iommu_type1_dma_unmap unmap;
 	struct vfio_bitmap bitmap = { 0 };
@@ -2984,6 +3170,7 @@ static int vfio_iommu_type1_unmap_dma(struct vfio_iommu *iommu,
 	return copy_to_user((void __user *)arg, &unmap, minsz) ?
 			-EFAULT : 0;
 }
+EXPORT_SYMBOL_GPL(vfio_iommu_type1_unmap_dma);
 
 static int vfio_iommu_type1_dirty_pages(struct vfio_iommu *iommu,
 					unsigned long arg)
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index fd0629acb948..d904ee5a68cc 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -158,6 +158,19 @@ extern int vfio_dma_rw(struct vfio_group *group, dma_addr_t user_iova,
 
 extern struct iommu_domain *vfio_group_iommu_domain(struct vfio_group *group);
 
+struct vfio_iommu;
+extern void *vfio_iommu_type1_open(unsigned long arg);
+extern void vfio_iommu_type1_release(void *iommu_data);
+extern int vfio_iommu_add_group(struct vfio_iommu *iommu,
+				struct iommu_group *iommu_group,
+				struct iommu_domain *iommu_domain);
+extern void vfio_iommu_remove_group(struct vfio_iommu *iommu,
+				    struct iommu_group *iommu_group);
+extern int vfio_iommu_type1_unmap_dma(struct vfio_iommu *iommu,
+				      unsigned long arg);
+extern int vfio_iommu_type1_map_dma(struct vfio_iommu *iommu,
+				    unsigned long arg);
+
 /* each type has independent events */
 enum vfio_notify_type {
 	VFIO_IOMMU_NOTIFY = 0,
-- 
2.25.1


^ permalink raw reply	[flat|nested] 532+ messages in thread

* [RFC 17/20] iommu/iommufd: Report iova range to userspace
  2021-09-19  6:38 ` Liu Yi L
@ 2021-09-19  6:38   ` Liu Yi L
  -1 siblings, 0 replies; 532+ messages in thread
From: Liu Yi L @ 2021-09-19  6:38 UTC (permalink / raw)
  To: alex.williamson, jgg, hch, jasowang, joro
  Cc: kvm, kwankhede, jean-philippe, dave.jiang, ashok.raj, corbet,
	kevin.tian, parav, lkml, david, robin.murphy, jun.j.tian,
	linux-kernel, lushenming, iommu, pbonzini, dwmw2

[HACK. will fix in v2]

IOVA range is critical info for userspace to manage DMA for an I/O address
space. This patch reports the valid iova range info of a given device.

Due to aforementioned hack, this info comes from the hacked vfio type1
driver. To follow the same format in vfio, we also introduce a cap chain
format in IOMMU_DEVICE_GET_INFO to carry the iova range info.

Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 drivers/iommu/iommu.c           |  2 ++
 drivers/iommu/iommufd/iommufd.c | 41 +++++++++++++++++++++++++++-
 drivers/vfio/vfio_iommu_type1.c | 47 ++++++++++++++++++++++++++++++---
 include/linux/vfio.h            |  2 ++
 include/uapi/linux/iommu.h      |  3 +++
 5 files changed, 90 insertions(+), 5 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index b6178997aef1..44bba346ab52 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -2755,6 +2755,7 @@ void iommu_get_resv_regions(struct device *dev, struct list_head *list)
 	if (ops && ops->get_resv_regions)
 		ops->get_resv_regions(dev, list);
 }
+EXPORT_SYMBOL_GPL(iommu_get_resv_regions);
 
 void iommu_put_resv_regions(struct device *dev, struct list_head *list)
 {
@@ -2763,6 +2764,7 @@ void iommu_put_resv_regions(struct device *dev, struct list_head *list)
 	if (ops && ops->put_resv_regions)
 		ops->put_resv_regions(dev, list);
 }
+EXPORT_SYMBOL_GPL(iommu_put_resv_regions);
 
 /**
  * generic_iommu_put_resv_regions - Reserved region driver helper
diff --git a/drivers/iommu/iommufd/iommufd.c b/drivers/iommu/iommufd/iommufd.c
index 25373a0e037a..cbf5e30062a6 100644
--- a/drivers/iommu/iommufd/iommufd.c
+++ b/drivers/iommu/iommufd/iommufd.c
@@ -19,6 +19,7 @@
 #include <linux/iommufd.h>
 #include <linux/xarray.h>
 #include <asm-generic/bug.h>
+#include <linux/vfio.h>
 
 /* Per iommufd */
 struct iommufd_ctx {
@@ -298,6 +299,38 @@ iommu_find_device_from_cookie(struct iommufd_ctx *ictx, u64 dev_cookie)
 	return dev;
 }
 
+static int iommu_device_add_cap_chain(struct device *dev, unsigned long arg,
+				      struct iommu_device_info *info)
+{
+	struct vfio_info_cap caps = { .buf = NULL, .size = 0 };
+	int ret;
+
+	ret = vfio_device_add_iova_cap(dev, &caps);
+	if (ret)
+		return ret;
+
+	if (caps.size) {
+		info->flags |= IOMMU_DEVICE_INFO_CAPS;
+
+		if (info->argsz < sizeof(*info) + caps.size) {
+			info->argsz = sizeof(*info) + caps.size;
+		} else {
+			vfio_info_cap_shift(&caps, sizeof(*info));
+			if (copy_to_user((void __user *)arg +
+					sizeof(*info), caps.buf,
+					caps.size)) {
+				kfree(caps.buf);
+				info->flags &= ~IOMMU_DEVICE_INFO_CAPS;
+				return -EFAULT;
+			}
+			info->cap_offset = sizeof(*info);
+		}
+
+		kfree(caps.buf);
+	}
+	return 0;
+}
+
 static void iommu_device_build_info(struct device *dev,
 				    struct iommu_device_info *info)
 {
@@ -324,8 +357,9 @@ static int iommufd_get_device_info(struct iommufd_ctx *ictx,
 	struct iommu_device_info info;
 	unsigned long minsz;
 	struct device *dev;
+	int ret;
 
-	minsz = offsetofend(struct iommu_device_info, addr_width);
+	minsz = offsetofend(struct iommu_device_info, cap_offset);
 
 	if (copy_from_user(&info, (void __user *)arg, minsz))
 		return -EFAULT;
@@ -341,6 +375,11 @@ static int iommufd_get_device_info(struct iommufd_ctx *ictx,
 
 	iommu_device_build_info(dev, &info);
 
+	info.cap_offset = 0;
+	ret = iommu_device_add_cap_chain(dev, arg, &info);
+	if (ret)
+		pr_info_ratelimited("No cap chain added, error %d\n", ret);
+
 	return copy_to_user((void __user *)arg, &info, minsz) ? -EFAULT : 0;
 }
 
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index c1c6bc803d94..28c1699aed6b 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -2963,15 +2963,15 @@ static int vfio_iommu_iova_add_cap(struct vfio_info_cap *caps,
 	return 0;
 }
 
-static int vfio_iommu_iova_build_caps(struct vfio_iommu *iommu,
-				      struct vfio_info_cap *caps)
+static int vfio_iova_list_build_caps(struct list_head *iova_list,
+				     struct vfio_info_cap *caps)
 {
 	struct vfio_iommu_type1_info_cap_iova_range *cap_iovas;
 	struct vfio_iova *iova;
 	size_t size;
 	int iovas = 0, i = 0, ret;
 
-	list_for_each_entry(iova, &iommu->iova_list, list)
+	list_for_each_entry(iova, iova_list, list)
 		iovas++;
 
 	if (!iovas) {
@@ -2990,7 +2990,7 @@ static int vfio_iommu_iova_build_caps(struct vfio_iommu *iommu,
 
 	cap_iovas->nr_iovas = iovas;
 
-	list_for_each_entry(iova, &iommu->iova_list, list) {
+	list_for_each_entry(iova, iova_list, list) {
 		cap_iovas->iova_ranges[i].start = iova->start;
 		cap_iovas->iova_ranges[i].end = iova->end;
 		i++;
@@ -3002,6 +3002,45 @@ static int vfio_iommu_iova_build_caps(struct vfio_iommu *iommu,
 	return ret;
 }
 
+static int vfio_iommu_iova_build_caps(struct vfio_iommu *iommu,
+				      struct vfio_info_cap *caps)
+{
+	return vfio_iova_list_build_caps(&iommu->iova_list, caps);
+}
+
+/* HACK: called by /dev/iommu core to build iova range cap for a device */
+int vfio_device_add_iova_cap(struct device *dev, struct vfio_info_cap *caps)
+{
+	u64 awidth;
+	dma_addr_t aperture_end;
+	LIST_HEAD(iova);
+	LIST_HEAD(dev_resv_regions);
+	int ret;
+
+	ret = iommu_device_get_info(dev, IOMMU_DEV_INFO_ADDR_WIDTH, &awidth);
+	if (ret)
+		return ret;
+
+	/* FIXME: needs to use geometry info reported by iommu core. */
+	aperture_end = ((dma_addr_t)1) << awidth;
+
+	ret = vfio_iommu_iova_insert(&iova, 0, aperture_end);
+	if (ret)
+		return ret;
+
+	iommu_get_resv_regions(dev, &dev_resv_regions);
+	ret = vfio_iommu_resv_exclude(&iova, &dev_resv_regions);
+	if (ret)
+		goto out;
+
+	ret = vfio_iova_list_build_caps(&iova, caps);
+out:
+	vfio_iommu_iova_free(&iova);
+	iommu_put_resv_regions(dev, &dev_resv_regions);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(vfio_device_add_iova_cap);
+
 static int vfio_iommu_migration_build_caps(struct vfio_iommu *iommu,
 					   struct vfio_info_cap *caps)
 {
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index d904ee5a68cc..605b8e828be4 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -212,6 +212,8 @@ extern int vfio_info_add_capability(struct vfio_info_cap *caps,
 extern int vfio_set_irqs_validate_and_prepare(struct vfio_irq_set *hdr,
 					      int num_irqs, int max_irq_type,
 					      size_t *data_size);
+extern int vfio_device_add_iova_cap(struct device *dev,
+				    struct vfio_info_cap *caps);
 
 struct pci_dev;
 #if IS_ENABLED(CONFIG_VFIO_SPAPR_EEH)
diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
index 49731be71213..f408ad3c8ade 100644
--- a/include/uapi/linux/iommu.h
+++ b/include/uapi/linux/iommu.h
@@ -68,6 +68,7 @@
  *		   +---------------+------------+
  *		   ...
  * @addr_width:    the address width of supported I/O address spaces.
+ * @cap_offset:	   Offset within info struct of first cap
  *
  * Availability: after device is bound to iommufd
  */
@@ -77,9 +78,11 @@ struct iommu_device_info {
 #define IOMMU_DEVICE_INFO_ENFORCE_SNOOP	(1 << 0) /* IOMMU enforced snoop */
 #define IOMMU_DEVICE_INFO_PGSIZES	(1 << 1) /* supported page sizes */
 #define IOMMU_DEVICE_INFO_ADDR_WIDTH	(1 << 2) /* addr_wdith field valid */
+#define IOMMU_DEVICE_INFO_CAPS		(1 << 3) /* info supports cap chain */
 	__u64	dev_cookie;
 	__u64   pgsize_bitmap;
 	__u32	addr_width;
+	__u32   cap_offset;
 };
 
 #define IOMMU_DEVICE_GET_INFO	_IO(IOMMU_TYPE, IOMMU_BASE + 1)
-- 
2.25.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* [RFC 17/20] iommu/iommufd: Report iova range to userspace
@ 2021-09-19  6:38   ` Liu Yi L
  0 siblings, 0 replies; 532+ messages in thread
From: Liu Yi L @ 2021-09-19  6:38 UTC (permalink / raw)
  To: alex.williamson, jgg, hch, jasowang, joro
  Cc: jean-philippe, kevin.tian, parav, lkml, pbonzini, lushenming,
	eric.auger, corbet, ashok.raj, yi.l.liu, yi.l.liu, jun.j.tian,
	hao.wu, dave.jiang, jacob.jun.pan, kwankhede, robin.murphy, kvm,
	iommu, dwmw2, linux-kernel, baolu.lu, david, nicolinc

[HACK. will fix in v2]

IOVA range is critical info for userspace to manage DMA for an I/O address
space. This patch reports the valid iova range info of a given device.

Due to aforementioned hack, this info comes from the hacked vfio type1
driver. To follow the same format in vfio, we also introduce a cap chain
format in IOMMU_DEVICE_GET_INFO to carry the iova range info.

Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 drivers/iommu/iommu.c           |  2 ++
 drivers/iommu/iommufd/iommufd.c | 41 +++++++++++++++++++++++++++-
 drivers/vfio/vfio_iommu_type1.c | 47 ++++++++++++++++++++++++++++++---
 include/linux/vfio.h            |  2 ++
 include/uapi/linux/iommu.h      |  3 +++
 5 files changed, 90 insertions(+), 5 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index b6178997aef1..44bba346ab52 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -2755,6 +2755,7 @@ void iommu_get_resv_regions(struct device *dev, struct list_head *list)
 	if (ops && ops->get_resv_regions)
 		ops->get_resv_regions(dev, list);
 }
+EXPORT_SYMBOL_GPL(iommu_get_resv_regions);
 
 void iommu_put_resv_regions(struct device *dev, struct list_head *list)
 {
@@ -2763,6 +2764,7 @@ void iommu_put_resv_regions(struct device *dev, struct list_head *list)
 	if (ops && ops->put_resv_regions)
 		ops->put_resv_regions(dev, list);
 }
+EXPORT_SYMBOL_GPL(iommu_put_resv_regions);
 
 /**
  * generic_iommu_put_resv_regions - Reserved region driver helper
diff --git a/drivers/iommu/iommufd/iommufd.c b/drivers/iommu/iommufd/iommufd.c
index 25373a0e037a..cbf5e30062a6 100644
--- a/drivers/iommu/iommufd/iommufd.c
+++ b/drivers/iommu/iommufd/iommufd.c
@@ -19,6 +19,7 @@
 #include <linux/iommufd.h>
 #include <linux/xarray.h>
 #include <asm-generic/bug.h>
+#include <linux/vfio.h>
 
 /* Per iommufd */
 struct iommufd_ctx {
@@ -298,6 +299,38 @@ iommu_find_device_from_cookie(struct iommufd_ctx *ictx, u64 dev_cookie)
 	return dev;
 }
 
+static int iommu_device_add_cap_chain(struct device *dev, unsigned long arg,
+				      struct iommu_device_info *info)
+{
+	struct vfio_info_cap caps = { .buf = NULL, .size = 0 };
+	int ret;
+
+	ret = vfio_device_add_iova_cap(dev, &caps);
+	if (ret)
+		return ret;
+
+	if (caps.size) {
+		info->flags |= IOMMU_DEVICE_INFO_CAPS;
+
+		if (info->argsz < sizeof(*info) + caps.size) {
+			info->argsz = sizeof(*info) + caps.size;
+		} else {
+			vfio_info_cap_shift(&caps, sizeof(*info));
+			if (copy_to_user((void __user *)arg +
+					sizeof(*info), caps.buf,
+					caps.size)) {
+				kfree(caps.buf);
+				info->flags &= ~IOMMU_DEVICE_INFO_CAPS;
+				return -EFAULT;
+			}
+			info->cap_offset = sizeof(*info);
+		}
+
+		kfree(caps.buf);
+	}
+	return 0;
+}
+
 static void iommu_device_build_info(struct device *dev,
 				    struct iommu_device_info *info)
 {
@@ -324,8 +357,9 @@ static int iommufd_get_device_info(struct iommufd_ctx *ictx,
 	struct iommu_device_info info;
 	unsigned long minsz;
 	struct device *dev;
+	int ret;
 
-	minsz = offsetofend(struct iommu_device_info, addr_width);
+	minsz = offsetofend(struct iommu_device_info, cap_offset);
 
 	if (copy_from_user(&info, (void __user *)arg, minsz))
 		return -EFAULT;
@@ -341,6 +375,11 @@ static int iommufd_get_device_info(struct iommufd_ctx *ictx,
 
 	iommu_device_build_info(dev, &info);
 
+	info.cap_offset = 0;
+	ret = iommu_device_add_cap_chain(dev, arg, &info);
+	if (ret)
+		pr_info_ratelimited("No cap chain added, error %d\n", ret);
+
 	return copy_to_user((void __user *)arg, &info, minsz) ? -EFAULT : 0;
 }
 
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index c1c6bc803d94..28c1699aed6b 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -2963,15 +2963,15 @@ static int vfio_iommu_iova_add_cap(struct vfio_info_cap *caps,
 	return 0;
 }
 
-static int vfio_iommu_iova_build_caps(struct vfio_iommu *iommu,
-				      struct vfio_info_cap *caps)
+static int vfio_iova_list_build_caps(struct list_head *iova_list,
+				     struct vfio_info_cap *caps)
 {
 	struct vfio_iommu_type1_info_cap_iova_range *cap_iovas;
 	struct vfio_iova *iova;
 	size_t size;
 	int iovas = 0, i = 0, ret;
 
-	list_for_each_entry(iova, &iommu->iova_list, list)
+	list_for_each_entry(iova, iova_list, list)
 		iovas++;
 
 	if (!iovas) {
@@ -2990,7 +2990,7 @@ static int vfio_iommu_iova_build_caps(struct vfio_iommu *iommu,
 
 	cap_iovas->nr_iovas = iovas;
 
-	list_for_each_entry(iova, &iommu->iova_list, list) {
+	list_for_each_entry(iova, iova_list, list) {
 		cap_iovas->iova_ranges[i].start = iova->start;
 		cap_iovas->iova_ranges[i].end = iova->end;
 		i++;
@@ -3002,6 +3002,45 @@ static int vfio_iommu_iova_build_caps(struct vfio_iommu *iommu,
 	return ret;
 }
 
+static int vfio_iommu_iova_build_caps(struct vfio_iommu *iommu,
+				      struct vfio_info_cap *caps)
+{
+	return vfio_iova_list_build_caps(&iommu->iova_list, caps);
+}
+
+/* HACK: called by /dev/iommu core to build iova range cap for a device */
+int vfio_device_add_iova_cap(struct device *dev, struct vfio_info_cap *caps)
+{
+	u64 awidth;
+	dma_addr_t aperture_end;
+	LIST_HEAD(iova);
+	LIST_HEAD(dev_resv_regions);
+	int ret;
+
+	ret = iommu_device_get_info(dev, IOMMU_DEV_INFO_ADDR_WIDTH, &awidth);
+	if (ret)
+		return ret;
+
+	/* FIXME: needs to use geometry info reported by iommu core. */
+	aperture_end = ((dma_addr_t)1) << awidth;
+
+	ret = vfio_iommu_iova_insert(&iova, 0, aperture_end);
+	if (ret)
+		return ret;
+
+	iommu_get_resv_regions(dev, &dev_resv_regions);
+	ret = vfio_iommu_resv_exclude(&iova, &dev_resv_regions);
+	if (ret)
+		goto out;
+
+	ret = vfio_iova_list_build_caps(&iova, caps);
+out:
+	vfio_iommu_iova_free(&iova);
+	iommu_put_resv_regions(dev, &dev_resv_regions);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(vfio_device_add_iova_cap);
+
 static int vfio_iommu_migration_build_caps(struct vfio_iommu *iommu,
 					   struct vfio_info_cap *caps)
 {
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index d904ee5a68cc..605b8e828be4 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -212,6 +212,8 @@ extern int vfio_info_add_capability(struct vfio_info_cap *caps,
 extern int vfio_set_irqs_validate_and_prepare(struct vfio_irq_set *hdr,
 					      int num_irqs, int max_irq_type,
 					      size_t *data_size);
+extern int vfio_device_add_iova_cap(struct device *dev,
+				    struct vfio_info_cap *caps);
 
 struct pci_dev;
 #if IS_ENABLED(CONFIG_VFIO_SPAPR_EEH)
diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
index 49731be71213..f408ad3c8ade 100644
--- a/include/uapi/linux/iommu.h
+++ b/include/uapi/linux/iommu.h
@@ -68,6 +68,7 @@
  *		   +---------------+------------+
  *		   ...
  * @addr_width:    the address width of supported I/O address spaces.
+ * @cap_offset:	   Offset within info struct of first cap
  *
  * Availability: after device is bound to iommufd
  */
@@ -77,9 +78,11 @@ struct iommu_device_info {
 #define IOMMU_DEVICE_INFO_ENFORCE_SNOOP	(1 << 0) /* IOMMU enforced snoop */
 #define IOMMU_DEVICE_INFO_PGSIZES	(1 << 1) /* supported page sizes */
 #define IOMMU_DEVICE_INFO_ADDR_WIDTH	(1 << 2) /* addr_wdith field valid */
+#define IOMMU_DEVICE_INFO_CAPS		(1 << 3) /* info supports cap chain */
 	__u64	dev_cookie;
 	__u64   pgsize_bitmap;
 	__u32	addr_width;
+	__u32   cap_offset;
 };
 
 #define IOMMU_DEVICE_GET_INFO	_IO(IOMMU_TYPE, IOMMU_BASE + 1)
-- 
2.25.1


^ permalink raw reply	[flat|nested] 532+ messages in thread

* [RFC 18/20] iommu/iommufd: Add IOMMU_[UN]MAP_DMA on IOASID
  2021-09-19  6:38 ` Liu Yi L
@ 2021-09-19  6:38   ` Liu Yi L
  -1 siblings, 0 replies; 532+ messages in thread
From: Liu Yi L @ 2021-09-19  6:38 UTC (permalink / raw)
  To: alex.williamson, jgg, hch, jasowang, joro
  Cc: kvm, kwankhede, jean-philippe, dave.jiang, ashok.raj, corbet,
	kevin.tian, parav, lkml, david, robin.murphy, jun.j.tian,
	linux-kernel, lushenming, iommu, pbonzini, dwmw2

[HACK. will fix in v2]

This patch introduces vfio type1v2-equivalent interface to userspace. Due
to aforementioned hack, iommufd currently calls exported vfio symbols to
handle map/unmap requests from the user.

Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 drivers/iommu/iommufd/iommufd.c | 104 ++++++++++++++++++++++++++++++++
 include/uapi/linux/iommu.h      |  29 +++++++++
 2 files changed, 133 insertions(+)

diff --git a/drivers/iommu/iommufd/iommufd.c b/drivers/iommu/iommufd/iommufd.c
index cbf5e30062a6..f5f2274d658c 100644
--- a/drivers/iommu/iommufd/iommufd.c
+++ b/drivers/iommu/iommufd/iommufd.c
@@ -55,6 +55,7 @@ struct iommufd_ioas {
 	struct mutex lock;
 	struct list_head device_list;
 	struct iommu_domain *domain;
+	struct vfio_iommu *vfio_iommu; /* FIXME: added for reusing vfio_iommu_type1 code */
 };
 
 /*
@@ -158,6 +159,7 @@ static void ioas_put_locked(struct iommufd_ioas *ioas)
 		return;
 
 	WARN_ON(!list_empty(&ioas->device_list));
+	vfio_iommu_type1_release(ioas->vfio_iommu); /* FIXME: reused vfio code */
 	xa_erase(&ictx->ioasid_xa, ioasid);
 	iommufd_ctx_put(ictx);
 	kfree(ioas);
@@ -185,6 +187,7 @@ static int iommufd_ioasid_alloc(struct iommufd_ctx *ictx, unsigned long arg)
 	struct iommufd_ioas *ioas;
 	unsigned long minsz;
 	int ioasid, ret;
+	struct vfio_iommu *vfio_iommu;
 
 	minsz = offsetofend(struct iommu_ioasid_alloc, addr_width);
 
@@ -211,6 +214,18 @@ static int iommufd_ioasid_alloc(struct iommufd_ctx *ictx, unsigned long arg)
 		return ret;
 	}
 
+	/* FIXME: get a vfio_iommu object for dma map/unmap management */
+	vfio_iommu = vfio_iommu_type1_open(VFIO_TYPE1v2_IOMMU);
+	if (IS_ERR(vfio_iommu)) {
+		pr_err_ratelimited("Failed to get vfio_iommu object\n");
+		mutex_lock(&ictx->lock);
+		xa_erase(&ictx->ioasid_xa, ioasid);
+		mutex_unlock(&ictx->lock);
+		kfree(ioas);
+		return PTR_ERR(vfio_iommu);
+	}
+	ioas->vfio_iommu = vfio_iommu;
+
 	ioas->ioasid = ioasid;
 
 	/* only supports kernel managed I/O page table so far */
@@ -383,6 +398,49 @@ static int iommufd_get_device_info(struct iommufd_ctx *ictx,
 	return copy_to_user((void __user *)arg, &info, minsz) ? -EFAULT : 0;
 }
 
+static int iommufd_process_dma_op(struct iommufd_ctx *ictx,
+				  unsigned long arg, bool map)
+{
+	struct iommu_ioasid_dma_op dma;
+	unsigned long minsz;
+	struct iommufd_ioas *ioas = NULL;
+	int ret;
+
+	minsz = offsetofend(struct iommu_ioasid_dma_op, padding);
+
+	if (copy_from_user(&dma, (void __user *)arg, minsz))
+		return -EFAULT;
+
+	if (dma.argsz < minsz || dma.flags || dma.ioasid < 0)
+		return -EINVAL;
+
+	ioas = ioasid_get_ioas(ictx, dma.ioasid);
+	if (!ioas) {
+		pr_err_ratelimited("unkonwn IOASID %u\n", dma.ioasid);
+		return -EINVAL;
+	}
+
+	mutex_lock(&ioas->lock);
+
+	/*
+	 * Needs to block map/unmap request from userspace before IOASID
+	 * is attached to any device.
+	 */
+	if (list_empty(&ioas->device_list)) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	if (map)
+		ret = vfio_iommu_type1_map_dma(ioas->vfio_iommu, arg + minsz);
+	else
+		ret = vfio_iommu_type1_unmap_dma(ioas->vfio_iommu, arg + minsz);
+out:
+	mutex_unlock(&ioas->lock);
+	ioas_put(ioas);
+	return ret;
+};
+
 static long iommufd_fops_unl_ioctl(struct file *filep,
 				   unsigned int cmd, unsigned long arg)
 {
@@ -409,6 +467,12 @@ static long iommufd_fops_unl_ioctl(struct file *filep,
 	case IOMMU_IOASID_FREE:
 		ret = iommufd_ioasid_free(ictx, arg);
 		break;
+	case IOMMU_MAP_DMA:
+		ret = iommufd_process_dma_op(ictx, arg, true);
+		break;
+	case IOMMU_UNMAP_DMA:
+		ret = iommufd_process_dma_op(ictx, arg, false);
+		break;
 	default:
 		pr_err_ratelimited("unsupported cmd %u\n", cmd);
 		break;
@@ -478,6 +542,39 @@ static int ioas_check_device_compatibility(struct iommufd_ioas *ioas,
 	return 0;
 }
 
+/* HACK:
+ * vfio_iommu_add/remove_device() is hacky implementation for
+ * this version to add the device/group to vfio iommu type1.
+ */
+static int vfio_iommu_add_device(struct vfio_iommu *vfio_iommu,
+				 struct device *dev,
+				 struct iommu_domain *domain)
+{
+	struct iommu_group *group;
+	int ret;
+
+	group = iommu_group_get(dev);
+	if (!group)
+		return -EINVAL;
+
+	ret = vfio_iommu_add_group(vfio_iommu, group, domain);
+	iommu_group_put(group);
+	return ret;
+}
+
+static void vfio_iommu_remove_device(struct vfio_iommu *vfio_iommu,
+				     struct device *dev)
+{
+	struct iommu_group *group;
+
+	group = iommu_group_get(dev);
+	if (!group)
+		return;
+
+	vfio_iommu_remove_group(vfio_iommu, group);
+	iommu_group_put(group);
+}
+
 /**
  * iommufd_device_attach_ioasid - attach device to an ioasid
  * @idev: [in] Pointer to struct iommufd_device.
@@ -539,11 +636,17 @@ int iommufd_device_attach_ioasid(struct iommufd_device *idev, int ioasid)
 	if (ret)
 		goto out_domain;
 
+	ret = vfio_iommu_add_device(ioas->vfio_iommu, idev->dev, domain);
+	if (ret)
+		goto out_detach;
+
 	ioas_dev->idev = idev;
 	list_add(&ioas_dev->next, &ioas->device_list);
 	mutex_unlock(&ioas->lock);
 
 	return 0;
+out_detach:
+	iommu_detach_device(domain, idev->dev);
 out_domain:
 	ioas_free_domain_if_empty(ioas);
 out_free:
@@ -579,6 +682,7 @@ void iommufd_device_detach_ioasid(struct iommufd_device *idev, int ioasid)
 	}
 
 	list_del(&ioas_dev->next);
+	vfio_iommu_remove_device(ioas->vfio_iommu, idev->dev);
 	iommu_detach_device(ioas->domain, idev->dev);
 	ioas_free_domain_if_empty(ioas);
 	kfree(ioas_dev);
diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
index f408ad3c8ade..fe815cc1f665 100644
--- a/include/uapi/linux/iommu.h
+++ b/include/uapi/linux/iommu.h
@@ -141,6 +141,35 @@ struct iommu_ioasid_alloc {
 
 #define IOMMU_IOASID_FREE		_IO(IOMMU_TYPE, IOMMU_BASE + 3)
 
+/*
+ * Map/unmap process virtual addresses to I/O virtual addresses.
+ *
+ * Provide VFIO type1 equivalent semantics. Start with the same
+ * restriction e.g. the unmap size should match those used in the
+ * original mapping call.
+ *
+ * @argsz:	user filled size of this data.
+ * @flags:	reserved for future extension.
+ * @ioasid:	the handle of target I/O address space.
+ * @data:	the operation payload, refer to vfio_iommu_type1_dma_{un}map.
+ *
+ * FIXME:
+ *	userspace needs to include uapi/vfio.h as well as interface reuses
+ *	the map/unmap logic from vfio iommu type1.
+ *
+ * Return: 0 on success, -errno on failure.
+ */
+struct iommu_ioasid_dma_op {
+	__u32	argsz;
+	__u32	flags;
+	__s32	ioasid;
+	__u32	padding;
+	__u8	data[];
+};
+
+#define IOMMU_MAP_DMA	_IO(IOMMU_TYPE, IOMMU_BASE + 4)
+#define IOMMU_UNMAP_DMA	_IO(IOMMU_TYPE, IOMMU_BASE + 5)
+
 #define IOMMU_FAULT_PERM_READ	(1 << 0) /* read */
 #define IOMMU_FAULT_PERM_WRITE	(1 << 1) /* write */
 #define IOMMU_FAULT_PERM_EXEC	(1 << 2) /* exec */
-- 
2.25.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* [RFC 18/20] iommu/iommufd: Add IOMMU_[UN]MAP_DMA on IOASID
@ 2021-09-19  6:38   ` Liu Yi L
  0 siblings, 0 replies; 532+ messages in thread
From: Liu Yi L @ 2021-09-19  6:38 UTC (permalink / raw)
  To: alex.williamson, jgg, hch, jasowang, joro
  Cc: jean-philippe, kevin.tian, parav, lkml, pbonzini, lushenming,
	eric.auger, corbet, ashok.raj, yi.l.liu, yi.l.liu, jun.j.tian,
	hao.wu, dave.jiang, jacob.jun.pan, kwankhede, robin.murphy, kvm,
	iommu, dwmw2, linux-kernel, baolu.lu, david, nicolinc

[HACK. will fix in v2]

This patch introduces vfio type1v2-equivalent interface to userspace. Due
to aforementioned hack, iommufd currently calls exported vfio symbols to
handle map/unmap requests from the user.

Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 drivers/iommu/iommufd/iommufd.c | 104 ++++++++++++++++++++++++++++++++
 include/uapi/linux/iommu.h      |  29 +++++++++
 2 files changed, 133 insertions(+)

diff --git a/drivers/iommu/iommufd/iommufd.c b/drivers/iommu/iommufd/iommufd.c
index cbf5e30062a6..f5f2274d658c 100644
--- a/drivers/iommu/iommufd/iommufd.c
+++ b/drivers/iommu/iommufd/iommufd.c
@@ -55,6 +55,7 @@ struct iommufd_ioas {
 	struct mutex lock;
 	struct list_head device_list;
 	struct iommu_domain *domain;
+	struct vfio_iommu *vfio_iommu; /* FIXME: added for reusing vfio_iommu_type1 code */
 };
 
 /*
@@ -158,6 +159,7 @@ static void ioas_put_locked(struct iommufd_ioas *ioas)
 		return;
 
 	WARN_ON(!list_empty(&ioas->device_list));
+	vfio_iommu_type1_release(ioas->vfio_iommu); /* FIXME: reused vfio code */
 	xa_erase(&ictx->ioasid_xa, ioasid);
 	iommufd_ctx_put(ictx);
 	kfree(ioas);
@@ -185,6 +187,7 @@ static int iommufd_ioasid_alloc(struct iommufd_ctx *ictx, unsigned long arg)
 	struct iommufd_ioas *ioas;
 	unsigned long minsz;
 	int ioasid, ret;
+	struct vfio_iommu *vfio_iommu;
 
 	minsz = offsetofend(struct iommu_ioasid_alloc, addr_width);
 
@@ -211,6 +214,18 @@ static int iommufd_ioasid_alloc(struct iommufd_ctx *ictx, unsigned long arg)
 		return ret;
 	}
 
+	/* FIXME: get a vfio_iommu object for dma map/unmap management */
+	vfio_iommu = vfio_iommu_type1_open(VFIO_TYPE1v2_IOMMU);
+	if (IS_ERR(vfio_iommu)) {
+		pr_err_ratelimited("Failed to get vfio_iommu object\n");
+		mutex_lock(&ictx->lock);
+		xa_erase(&ictx->ioasid_xa, ioasid);
+		mutex_unlock(&ictx->lock);
+		kfree(ioas);
+		return PTR_ERR(vfio_iommu);
+	}
+	ioas->vfio_iommu = vfio_iommu;
+
 	ioas->ioasid = ioasid;
 
 	/* only supports kernel managed I/O page table so far */
@@ -383,6 +398,49 @@ static int iommufd_get_device_info(struct iommufd_ctx *ictx,
 	return copy_to_user((void __user *)arg, &info, minsz) ? -EFAULT : 0;
 }
 
+static int iommufd_process_dma_op(struct iommufd_ctx *ictx,
+				  unsigned long arg, bool map)
+{
+	struct iommu_ioasid_dma_op dma;
+	unsigned long minsz;
+	struct iommufd_ioas *ioas = NULL;
+	int ret;
+
+	minsz = offsetofend(struct iommu_ioasid_dma_op, padding);
+
+	if (copy_from_user(&dma, (void __user *)arg, minsz))
+		return -EFAULT;
+
+	if (dma.argsz < minsz || dma.flags || dma.ioasid < 0)
+		return -EINVAL;
+
+	ioas = ioasid_get_ioas(ictx, dma.ioasid);
+	if (!ioas) {
+		pr_err_ratelimited("unkonwn IOASID %u\n", dma.ioasid);
+		return -EINVAL;
+	}
+
+	mutex_lock(&ioas->lock);
+
+	/*
+	 * Needs to block map/unmap request from userspace before IOASID
+	 * is attached to any device.
+	 */
+	if (list_empty(&ioas->device_list)) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	if (map)
+		ret = vfio_iommu_type1_map_dma(ioas->vfio_iommu, arg + minsz);
+	else
+		ret = vfio_iommu_type1_unmap_dma(ioas->vfio_iommu, arg + minsz);
+out:
+	mutex_unlock(&ioas->lock);
+	ioas_put(ioas);
+	return ret;
+};
+
 static long iommufd_fops_unl_ioctl(struct file *filep,
 				   unsigned int cmd, unsigned long arg)
 {
@@ -409,6 +467,12 @@ static long iommufd_fops_unl_ioctl(struct file *filep,
 	case IOMMU_IOASID_FREE:
 		ret = iommufd_ioasid_free(ictx, arg);
 		break;
+	case IOMMU_MAP_DMA:
+		ret = iommufd_process_dma_op(ictx, arg, true);
+		break;
+	case IOMMU_UNMAP_DMA:
+		ret = iommufd_process_dma_op(ictx, arg, false);
+		break;
 	default:
 		pr_err_ratelimited("unsupported cmd %u\n", cmd);
 		break;
@@ -478,6 +542,39 @@ static int ioas_check_device_compatibility(struct iommufd_ioas *ioas,
 	return 0;
 }
 
+/* HACK:
+ * vfio_iommu_add/remove_device() is hacky implementation for
+ * this version to add the device/group to vfio iommu type1.
+ */
+static int vfio_iommu_add_device(struct vfio_iommu *vfio_iommu,
+				 struct device *dev,
+				 struct iommu_domain *domain)
+{
+	struct iommu_group *group;
+	int ret;
+
+	group = iommu_group_get(dev);
+	if (!group)
+		return -EINVAL;
+
+	ret = vfio_iommu_add_group(vfio_iommu, group, domain);
+	iommu_group_put(group);
+	return ret;
+}
+
+static void vfio_iommu_remove_device(struct vfio_iommu *vfio_iommu,
+				     struct device *dev)
+{
+	struct iommu_group *group;
+
+	group = iommu_group_get(dev);
+	if (!group)
+		return;
+
+	vfio_iommu_remove_group(vfio_iommu, group);
+	iommu_group_put(group);
+}
+
 /**
  * iommufd_device_attach_ioasid - attach device to an ioasid
  * @idev: [in] Pointer to struct iommufd_device.
@@ -539,11 +636,17 @@ int iommufd_device_attach_ioasid(struct iommufd_device *idev, int ioasid)
 	if (ret)
 		goto out_domain;
 
+	ret = vfio_iommu_add_device(ioas->vfio_iommu, idev->dev, domain);
+	if (ret)
+		goto out_detach;
+
 	ioas_dev->idev = idev;
 	list_add(&ioas_dev->next, &ioas->device_list);
 	mutex_unlock(&ioas->lock);
 
 	return 0;
+out_detach:
+	iommu_detach_device(domain, idev->dev);
 out_domain:
 	ioas_free_domain_if_empty(ioas);
 out_free:
@@ -579,6 +682,7 @@ void iommufd_device_detach_ioasid(struct iommufd_device *idev, int ioasid)
 	}
 
 	list_del(&ioas_dev->next);
+	vfio_iommu_remove_device(ioas->vfio_iommu, idev->dev);
 	iommu_detach_device(ioas->domain, idev->dev);
 	ioas_free_domain_if_empty(ioas);
 	kfree(ioas_dev);
diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
index f408ad3c8ade..fe815cc1f665 100644
--- a/include/uapi/linux/iommu.h
+++ b/include/uapi/linux/iommu.h
@@ -141,6 +141,35 @@ struct iommu_ioasid_alloc {
 
 #define IOMMU_IOASID_FREE		_IO(IOMMU_TYPE, IOMMU_BASE + 3)
 
+/*
+ * Map/unmap process virtual addresses to I/O virtual addresses.
+ *
+ * Provide VFIO type1 equivalent semantics. Start with the same
+ * restriction e.g. the unmap size should match those used in the
+ * original mapping call.
+ *
+ * @argsz:	user filled size of this data.
+ * @flags:	reserved for future extension.
+ * @ioasid:	the handle of target I/O address space.
+ * @data:	the operation payload, refer to vfio_iommu_type1_dma_{un}map.
+ *
+ * FIXME:
+ *	userspace needs to include uapi/vfio.h as well as interface reuses
+ *	the map/unmap logic from vfio iommu type1.
+ *
+ * Return: 0 on success, -errno on failure.
+ */
+struct iommu_ioasid_dma_op {
+	__u32	argsz;
+	__u32	flags;
+	__s32	ioasid;
+	__u32	padding;
+	__u8	data[];
+};
+
+#define IOMMU_MAP_DMA	_IO(IOMMU_TYPE, IOMMU_BASE + 4)
+#define IOMMU_UNMAP_DMA	_IO(IOMMU_TYPE, IOMMU_BASE + 5)
+
 #define IOMMU_FAULT_PERM_READ	(1 << 0) /* read */
 #define IOMMU_FAULT_PERM_WRITE	(1 << 1) /* write */
 #define IOMMU_FAULT_PERM_EXEC	(1 << 2) /* exec */
-- 
2.25.1


^ permalink raw reply	[flat|nested] 532+ messages in thread

* [RFC 19/20] iommu/vt-d: Implement device_info iommu_ops callback
  2021-09-19  6:38 ` Liu Yi L
@ 2021-09-19  6:38   ` Liu Yi L
  -1 siblings, 0 replies; 532+ messages in thread
From: Liu Yi L @ 2021-09-19  6:38 UTC (permalink / raw)
  To: alex.williamson, jgg, hch, jasowang, joro
  Cc: jean-philippe, kevin.tian, parav, lkml, pbonzini, lushenming,
	eric.auger, corbet, ashok.raj, yi.l.liu, yi.l.liu, jun.j.tian,
	hao.wu, dave.jiang, jacob.jun.pan, kwankhede, robin.murphy, kvm,
	iommu, dwmw2, linux-kernel, baolu.lu, david, nicolinc

From: Lu Baolu <baolu.lu@linux.intel.com>

Expose per-device IOMMU attributes to the upper layers.

Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
 drivers/iommu/intel/iommu.c | 35 +++++++++++++++++++++++++++++++++++
 1 file changed, 35 insertions(+)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index dd22fc7d5176..d531ea44f418 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -5583,6 +5583,40 @@ static void intel_iommu_iotlb_sync_map(struct iommu_domain *domain,
 	}
 }
 
+static int
+intel_iommu_device_info(struct device *dev, enum iommu_devattr type, void *data)
+{
+	struct intel_iommu *iommu = device_to_iommu(dev, NULL, NULL);
+	int ret = 0;
+
+	if (!iommu)
+		return -ENODEV;
+
+	switch (type) {
+	case IOMMU_DEV_INFO_PAGE_SIZE:
+		*(u64 *)data = SZ_4K |
+			(cap_super_page_val(iommu->cap) & BIT(0) ? SZ_2M : 0) |
+			(cap_super_page_val(iommu->cap) & BIT(1) ? SZ_1G : 0);
+		break;
+	case IOMMU_DEV_INFO_FORCE_SNOOP:
+		/*
+		 * Force snoop is always supported in the scalable mode. For the legacy
+		 * mode, check the capability register.
+		 */
+		*(bool *)data = sm_supported(iommu) || ecap_sc_support(iommu->ecap);
+		break;
+	case IOMMU_DEV_INFO_ADDR_WIDTH:
+		*(u32 *)data = min_t(u32, agaw_to_width(iommu->agaw),
+				     cap_mgaw(iommu->cap));
+		break;
+	default:
+		ret = -EINVAL;
+		break;
+	}
+
+	return ret;
+}
+
 const struct iommu_ops intel_iommu_ops = {
 	.capable		= intel_iommu_capable,
 	.domain_alloc		= intel_iommu_domain_alloc,
@@ -5621,6 +5655,7 @@ const struct iommu_ops intel_iommu_ops = {
 	.sva_get_pasid		= intel_svm_get_pasid,
 	.page_response		= intel_svm_page_response,
 #endif
+	.device_info		= intel_iommu_device_info,
 };
 
 static void quirk_iommu_igfx(struct pci_dev *dev)
-- 
2.25.1


^ permalink raw reply	[flat|nested] 532+ messages in thread

* [RFC 19/20] iommu/vt-d: Implement device_info iommu_ops callback
@ 2021-09-19  6:38   ` Liu Yi L
  0 siblings, 0 replies; 532+ messages in thread
From: Liu Yi L @ 2021-09-19  6:38 UTC (permalink / raw)
  To: alex.williamson, jgg, hch, jasowang, joro
  Cc: kvm, kwankhede, jean-philippe, dave.jiang, ashok.raj, corbet,
	kevin.tian, parav, lkml, david, robin.murphy, jun.j.tian,
	linux-kernel, lushenming, iommu, pbonzini, dwmw2

From: Lu Baolu <baolu.lu@linux.intel.com>

Expose per-device IOMMU attributes to the upper layers.

Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
 drivers/iommu/intel/iommu.c | 35 +++++++++++++++++++++++++++++++++++
 1 file changed, 35 insertions(+)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index dd22fc7d5176..d531ea44f418 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -5583,6 +5583,40 @@ static void intel_iommu_iotlb_sync_map(struct iommu_domain *domain,
 	}
 }
 
+static int
+intel_iommu_device_info(struct device *dev, enum iommu_devattr type, void *data)
+{
+	struct intel_iommu *iommu = device_to_iommu(dev, NULL, NULL);
+	int ret = 0;
+
+	if (!iommu)
+		return -ENODEV;
+
+	switch (type) {
+	case IOMMU_DEV_INFO_PAGE_SIZE:
+		*(u64 *)data = SZ_4K |
+			(cap_super_page_val(iommu->cap) & BIT(0) ? SZ_2M : 0) |
+			(cap_super_page_val(iommu->cap) & BIT(1) ? SZ_1G : 0);
+		break;
+	case IOMMU_DEV_INFO_FORCE_SNOOP:
+		/*
+		 * Force snoop is always supported in the scalable mode. For the legacy
+		 * mode, check the capability register.
+		 */
+		*(bool *)data = sm_supported(iommu) || ecap_sc_support(iommu->ecap);
+		break;
+	case IOMMU_DEV_INFO_ADDR_WIDTH:
+		*(u32 *)data = min_t(u32, agaw_to_width(iommu->agaw),
+				     cap_mgaw(iommu->cap));
+		break;
+	default:
+		ret = -EINVAL;
+		break;
+	}
+
+	return ret;
+}
+
 const struct iommu_ops intel_iommu_ops = {
 	.capable		= intel_iommu_capable,
 	.domain_alloc		= intel_iommu_domain_alloc,
@@ -5621,6 +5655,7 @@ const struct iommu_ops intel_iommu_ops = {
 	.sva_get_pasid		= intel_svm_get_pasid,
 	.page_response		= intel_svm_page_response,
 #endif
+	.device_info		= intel_iommu_device_info,
 };
 
 static void quirk_iommu_igfx(struct pci_dev *dev)
-- 
2.25.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* [RFC 20/20] Doc: Add documentation for /dev/iommu
  2021-09-19  6:38 ` Liu Yi L
@ 2021-09-19  6:38   ` Liu Yi L
  -1 siblings, 0 replies; 532+ messages in thread
From: Liu Yi L @ 2021-09-19  6:38 UTC (permalink / raw)
  To: alex.williamson, jgg, hch, jasowang, joro
  Cc: jean-philippe, kevin.tian, parav, lkml, pbonzini, lushenming,
	eric.auger, corbet, ashok.raj, yi.l.liu, yi.l.liu, jun.j.tian,
	hao.wu, dave.jiang, jacob.jun.pan, kwankhede, robin.murphy, kvm,
	iommu, dwmw2, linux-kernel, baolu.lu, david, nicolinc

Document the /dev/iommu framework for user.

Open:
Do we want to document /dev/iommu in Documentation/userspace-api/iommu.rst?
Existing iommu.rst is for the vSVA interfaces, honestly, may need to rewrite
this doc entirely.

Signed-off-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 Documentation/userspace-api/index.rst   |   1 +
 Documentation/userspace-api/iommufd.rst | 183 ++++++++++++++++++++++++
 2 files changed, 184 insertions(+)
 create mode 100644 Documentation/userspace-api/iommufd.rst

diff --git a/Documentation/userspace-api/index.rst b/Documentation/userspace-api/index.rst
index 0b5eefed027e..54df5a278023 100644
--- a/Documentation/userspace-api/index.rst
+++ b/Documentation/userspace-api/index.rst
@@ -25,6 +25,7 @@ place where this information is gathered.
    ebpf/index
    ioctl/index
    iommu
+   iommufd
    media/index
    sysfs-platform_profile
 
diff --git a/Documentation/userspace-api/iommufd.rst b/Documentation/userspace-api/iommufd.rst
new file mode 100644
index 000000000000..abffbb47dc02
--- /dev/null
+++ b/Documentation/userspace-api/iommufd.rst
@@ -0,0 +1,183 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. iommu:
+
+===================
+IOMMU Userspace API
+===================
+
+Direct device access from userspace has been a crtical feature in
+high performance computing and virtualization usages. Linux now
+includes multiple device-passthrough frameworks (e.g. VFIO and vDPA)
+to manage secure device access from the userspace. One critical
+task of those frameworks is to put the assigned device in a secure,
+IOMMU-protected context so the device is prevented from doing harm
+to the rest of the system.
+
+Currently those frameworks implement their own logic for managing
+I/O page tables to isolate user-initiated DMAs. This doesn't scale
+to support many new IOMMU features, such as PASID-granular DMA
+remapping, nested translation, I/O page fault, IOMMU dirty bit, etc.
+
+The /dev/iommu framework provides an unified interface for managing
+I/O page tables for passthrough devices. Existing passthrough
+frameworks are expected to use this interface instead of continuing
+their ad-hoc implementations.
+
+IOMMUFDs, IOASIDs, Devices and Groups
+-------------------------------------
+
+The core concepts in /dev/iommu are IOMMUFDs and IOASIDs. IOMMUFD (by
+opening /dev/iommu) is the container holding multiple I/O address
+spaces for a user, while IOASID is the fd-local software handle
+representing an I/O address space and associated with a single I/O
+page table. User manages those address spaces through fd operations,
+e.g. by using vfio type1v2 mapping semantics to manage respective
+I/O page tables.
+
+IOASID is comparable to the conatiner concept in VFIO. The latter
+is also associated to a single I/O address space. A main difference
+between them is that multiple IOASIDs in the same IOMMUFD can be
+nested together (not supported yet) to allow centralized accounting
+of locked pages, while multiple containers are disconnected thus
+duplicated accounting is incurred. Typically one IOMMUFD is
+sufficient for all intended IOMMU usages for a user.
+
+An I/O address space takes effect in the IOMMU only after it is
+attached by a device. One I/O address space can be attached by
+multiple devices. One device can be only attached to a single I/O
+address space at this point (on par with current vfio behavior).
+
+Device must be bound to an iommufd before the attach operation can
+be conducted. The binding operation builds the connection between
+the devicefd (opened via device-passthrough framework) and IOMMUFD.
+IOMMU-protected security context is esbliashed when the binding
+operation is completed. The passthrough framework must block user
+access to the assigned device until bind() returns success.
+
+The entire /dev/iommu framework adopts a device-centric model w/o
+carrying any container/group legacy as current vfio does. However
+the group is the minimum granularity that must be used to ensure
+secure user access (refer to vfio.rst). This framework relies on
+the IOMMU core layer to map device-centric model into group-granular
+isolation.
+
+Managing I/O Address Spaces
+---------------------------
+
+When creating an I/O address space (by allocating IOASID), the user
+must specify the type of underlying I/O page table. Currently only
+one type (kernel-managed) is supported. In the future other types
+will be introduced, e.g. to support user-managed I/O page table or
+a shared I/O page table which is managed by another kernel sub-
+system (mm, ept, etc.). Kernel-managed I/O page table is currently
+managed via vfio type1v2 equivalent mapping semantics.
+
+The user also needs to specify the format of the I/O page table
+when allocating an IOASID. The format must be compatible to the
+attached devices (or more specifically to the IOMMU which serves
+the DMA from the attached devices). User can query the device IOMMU
+format via IOMMUFD once a device is successfully bound. Attaching a
+device to an IOASID with incompatible format is simply rejected.
+
+Currently no-snoop DMA is not supported yet. This implies that
+IOASID must be created in an enforce-snoop format and only devices
+which can be forced to snoop cache by IOMMU are allowed to be
+attached to IOASID. The user should check uAPI extension and get
+device info via IOMMUFD to handle such restriction.
+
+Usage Example
+-------------
+
+Assume user wants to access PCI device 0000:06:0d.0, which is
+exposed under the new /dev/vfio/devices directory by VFIO:
+
+	/* Open device-centric interface and /dev/iommu interface */
+	device_fd = open("/dev/vfio/devices/0000:06:0d.0", O_RDWR);
+	iommu_fd = open("/dev/iommu", O_RDWR);
+
+	/* Bind device to IOMMUFD */
+	bind_data = { .iommu_fd = iommu_fd, .dev_cookie = cookie };
+	ioctl(device_fd, VFIO_DEVICE_BIND_IOMMUFD, &bind_data);
+
+	/* Query per-device IOMMU capability/format */
+	info = { .dev_cookie = cookie, };
+	ioctl(iommu_fd, IOMMU_DEVICE_GET_INFO, &info);
+
+	if (!(info.flags & IOMMU_DEVICE_INFO_ENFORCE_SNOOP)) {
+		if (!ioctl(iommu_fd, IOMMU_CHECK_EXTENSION,
+				EXT_DMA_NO_SNOOP))
+			/* No support of no-snoop DMA */
+	}
+
+	if (!ioctl(iommu_fd, IOMMU_CHECK_EXTENSION, EXT_MAP_TYPE1V2))
+		/* No support of vfio type1v2 mapping semantics */
+
+	/* Decides IOASID alloc fields based on info */
+	alloc_data = { .type = IOMMU_IOASID_TYPE_KERNEL,
+		       .flags = IOMMU_IOASID_ENFORCE_SNOOP,
+		       .addr_width = info.addr_width, };
+
+	/* Allocate IOASID */
+	gpa_ioasid = ioctl(iommu_fd, IOMMU_IOASID_ALLOC, &alloc_data);
+
+	/* Attach device to an IOASID */
+	at_data = { .iommu_fd = iommu_fd; .ioasid = gpa_ioasid};
+	ioctl(device_fd, VFIO_DEVICE_ATTACH_IOASID, &at_data);
+
+	/* Setup GPA mapping [0 - 1GB] */
+	dma_map = {
+		.ioasid	= gpa_ioasid,
+		.data {
+			.flags  = R/W		/* permission */
+			.iova	= 0,		/* GPA */
+			.vaddr	= 0x40000000,	/* HVA */
+			.size	= 1GB,
+		},
+	};
+	ioctl(iommu_fd, IOMMU_MAP_DMA, &dma_map);
+
+	/* DMA */
+
+	/* Unmap GPA mapping [0 - 1GB] */
+	dma_unmap = {
+		.ioasid	= gpa_ioasid,
+		.data {
+			.iova	= 0,		/* GPA */
+			.size	= 1GB,
+		},
+	};
+	ioctl(iommu_fd, IOMMU_UNMAP_DMA, &dma_unmap);
+
+	/* Detach device from an IOASID */
+	dt_data = { .iommu_fd = iommu_fd; .ioasid = gpa_ioasid};
+	ioctl(device_fd, VFIO_DEVICE_DETACH_IOASID, &dt_data);
+
+	/* Free IOASID */
+	ioctl(iommu_fd, IOMMU_IOASID_FREE, gpa_ioasid);
+
+	close(device_fd);
+	close(iommu_fd);
+
+API for device-passthrough frameworks
+-------------------------------------
+
+iommufd binding and IOASID attach/detach are initiated via the device-
+passthrough framework uAPI.
+
+When a binding operation is requested by the user, the passthrough
+framework should call iommufd_bind_device(). When the device fd is
+closed by the user, iommufd_unbind_device() should be called
+automatically::
+
+	struct iommufd_device *
+	iommufd_bind_device(int fd, struct device *dev,
+			   u64 dev_cookie);
+	void iommufd_unbind_device(struct iommufd_device *idev);
+
+IOASID attach/detach operations are per iommufd_device which is
+returned by iommufd_bind_device():
+
+	int iommufd_device_attach_ioasid(struct iommufd_device *idev,
+					int ioasid);
+	void iommufd_device_detach_ioasid(struct iommufd_device *idev,
+					int ioasid);
-- 
2.25.1


^ permalink raw reply	[flat|nested] 532+ messages in thread

* [RFC 20/20] Doc: Add documentation for /dev/iommu
@ 2021-09-19  6:38   ` Liu Yi L
  0 siblings, 0 replies; 532+ messages in thread
From: Liu Yi L @ 2021-09-19  6:38 UTC (permalink / raw)
  To: alex.williamson, jgg, hch, jasowang, joro
  Cc: kvm, kwankhede, jean-philippe, dave.jiang, ashok.raj, corbet,
	kevin.tian, parav, lkml, david, robin.murphy, jun.j.tian,
	linux-kernel, lushenming, iommu, pbonzini, dwmw2

Document the /dev/iommu framework for user.

Open:
Do we want to document /dev/iommu in Documentation/userspace-api/iommu.rst?
Existing iommu.rst is for the vSVA interfaces, honestly, may need to rewrite
this doc entirely.

Signed-off-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 Documentation/userspace-api/index.rst   |   1 +
 Documentation/userspace-api/iommufd.rst | 183 ++++++++++++++++++++++++
 2 files changed, 184 insertions(+)
 create mode 100644 Documentation/userspace-api/iommufd.rst

diff --git a/Documentation/userspace-api/index.rst b/Documentation/userspace-api/index.rst
index 0b5eefed027e..54df5a278023 100644
--- a/Documentation/userspace-api/index.rst
+++ b/Documentation/userspace-api/index.rst
@@ -25,6 +25,7 @@ place where this information is gathered.
    ebpf/index
    ioctl/index
    iommu
+   iommufd
    media/index
    sysfs-platform_profile
 
diff --git a/Documentation/userspace-api/iommufd.rst b/Documentation/userspace-api/iommufd.rst
new file mode 100644
index 000000000000..abffbb47dc02
--- /dev/null
+++ b/Documentation/userspace-api/iommufd.rst
@@ -0,0 +1,183 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. iommu:
+
+===================
+IOMMU Userspace API
+===================
+
+Direct device access from userspace has been a crtical feature in
+high performance computing and virtualization usages. Linux now
+includes multiple device-passthrough frameworks (e.g. VFIO and vDPA)
+to manage secure device access from the userspace. One critical
+task of those frameworks is to put the assigned device in a secure,
+IOMMU-protected context so the device is prevented from doing harm
+to the rest of the system.
+
+Currently those frameworks implement their own logic for managing
+I/O page tables to isolate user-initiated DMAs. This doesn't scale
+to support many new IOMMU features, such as PASID-granular DMA
+remapping, nested translation, I/O page fault, IOMMU dirty bit, etc.
+
+The /dev/iommu framework provides an unified interface for managing
+I/O page tables for passthrough devices. Existing passthrough
+frameworks are expected to use this interface instead of continuing
+their ad-hoc implementations.
+
+IOMMUFDs, IOASIDs, Devices and Groups
+-------------------------------------
+
+The core concepts in /dev/iommu are IOMMUFDs and IOASIDs. IOMMUFD (by
+opening /dev/iommu) is the container holding multiple I/O address
+spaces for a user, while IOASID is the fd-local software handle
+representing an I/O address space and associated with a single I/O
+page table. User manages those address spaces through fd operations,
+e.g. by using vfio type1v2 mapping semantics to manage respective
+I/O page tables.
+
+IOASID is comparable to the conatiner concept in VFIO. The latter
+is also associated to a single I/O address space. A main difference
+between them is that multiple IOASIDs in the same IOMMUFD can be
+nested together (not supported yet) to allow centralized accounting
+of locked pages, while multiple containers are disconnected thus
+duplicated accounting is incurred. Typically one IOMMUFD is
+sufficient for all intended IOMMU usages for a user.
+
+An I/O address space takes effect in the IOMMU only after it is
+attached by a device. One I/O address space can be attached by
+multiple devices. One device can be only attached to a single I/O
+address space at this point (on par with current vfio behavior).
+
+Device must be bound to an iommufd before the attach operation can
+be conducted. The binding operation builds the connection between
+the devicefd (opened via device-passthrough framework) and IOMMUFD.
+IOMMU-protected security context is esbliashed when the binding
+operation is completed. The passthrough framework must block user
+access to the assigned device until bind() returns success.
+
+The entire /dev/iommu framework adopts a device-centric model w/o
+carrying any container/group legacy as current vfio does. However
+the group is the minimum granularity that must be used to ensure
+secure user access (refer to vfio.rst). This framework relies on
+the IOMMU core layer to map device-centric model into group-granular
+isolation.
+
+Managing I/O Address Spaces
+---------------------------
+
+When creating an I/O address space (by allocating IOASID), the user
+must specify the type of underlying I/O page table. Currently only
+one type (kernel-managed) is supported. In the future other types
+will be introduced, e.g. to support user-managed I/O page table or
+a shared I/O page table which is managed by another kernel sub-
+system (mm, ept, etc.). Kernel-managed I/O page table is currently
+managed via vfio type1v2 equivalent mapping semantics.
+
+The user also needs to specify the format of the I/O page table
+when allocating an IOASID. The format must be compatible to the
+attached devices (or more specifically to the IOMMU which serves
+the DMA from the attached devices). User can query the device IOMMU
+format via IOMMUFD once a device is successfully bound. Attaching a
+device to an IOASID with incompatible format is simply rejected.
+
+Currently no-snoop DMA is not supported yet. This implies that
+IOASID must be created in an enforce-snoop format and only devices
+which can be forced to snoop cache by IOMMU are allowed to be
+attached to IOASID. The user should check uAPI extension and get
+device info via IOMMUFD to handle such restriction.
+
+Usage Example
+-------------
+
+Assume user wants to access PCI device 0000:06:0d.0, which is
+exposed under the new /dev/vfio/devices directory by VFIO:
+
+	/* Open device-centric interface and /dev/iommu interface */
+	device_fd = open("/dev/vfio/devices/0000:06:0d.0", O_RDWR);
+	iommu_fd = open("/dev/iommu", O_RDWR);
+
+	/* Bind device to IOMMUFD */
+	bind_data = { .iommu_fd = iommu_fd, .dev_cookie = cookie };
+	ioctl(device_fd, VFIO_DEVICE_BIND_IOMMUFD, &bind_data);
+
+	/* Query per-device IOMMU capability/format */
+	info = { .dev_cookie = cookie, };
+	ioctl(iommu_fd, IOMMU_DEVICE_GET_INFO, &info);
+
+	if (!(info.flags & IOMMU_DEVICE_INFO_ENFORCE_SNOOP)) {
+		if (!ioctl(iommu_fd, IOMMU_CHECK_EXTENSION,
+				EXT_DMA_NO_SNOOP))
+			/* No support of no-snoop DMA */
+	}
+
+	if (!ioctl(iommu_fd, IOMMU_CHECK_EXTENSION, EXT_MAP_TYPE1V2))
+		/* No support of vfio type1v2 mapping semantics */
+
+	/* Decides IOASID alloc fields based on info */
+	alloc_data = { .type = IOMMU_IOASID_TYPE_KERNEL,
+		       .flags = IOMMU_IOASID_ENFORCE_SNOOP,
+		       .addr_width = info.addr_width, };
+
+	/* Allocate IOASID */
+	gpa_ioasid = ioctl(iommu_fd, IOMMU_IOASID_ALLOC, &alloc_data);
+
+	/* Attach device to an IOASID */
+	at_data = { .iommu_fd = iommu_fd; .ioasid = gpa_ioasid};
+	ioctl(device_fd, VFIO_DEVICE_ATTACH_IOASID, &at_data);
+
+	/* Setup GPA mapping [0 - 1GB] */
+	dma_map = {
+		.ioasid	= gpa_ioasid,
+		.data {
+			.flags  = R/W		/* permission */
+			.iova	= 0,		/* GPA */
+			.vaddr	= 0x40000000,	/* HVA */
+			.size	= 1GB,
+		},
+	};
+	ioctl(iommu_fd, IOMMU_MAP_DMA, &dma_map);
+
+	/* DMA */
+
+	/* Unmap GPA mapping [0 - 1GB] */
+	dma_unmap = {
+		.ioasid	= gpa_ioasid,
+		.data {
+			.iova	= 0,		/* GPA */
+			.size	= 1GB,
+		},
+	};
+	ioctl(iommu_fd, IOMMU_UNMAP_DMA, &dma_unmap);
+
+	/* Detach device from an IOASID */
+	dt_data = { .iommu_fd = iommu_fd; .ioasid = gpa_ioasid};
+	ioctl(device_fd, VFIO_DEVICE_DETACH_IOASID, &dt_data);
+
+	/* Free IOASID */
+	ioctl(iommu_fd, IOMMU_IOASID_FREE, gpa_ioasid);
+
+	close(device_fd);
+	close(iommu_fd);
+
+API for device-passthrough frameworks
+-------------------------------------
+
+iommufd binding and IOASID attach/detach are initiated via the device-
+passthrough framework uAPI.
+
+When a binding operation is requested by the user, the passthrough
+framework should call iommufd_bind_device(). When the device fd is
+closed by the user, iommufd_unbind_device() should be called
+automatically::
+
+	struct iommufd_device *
+	iommufd_bind_device(int fd, struct device *dev,
+			   u64 dev_cookie);
+	void iommufd_unbind_device(struct iommufd_device *idev);
+
+IOASID attach/detach operations are per iommufd_device which is
+returned by iommufd_bind_device():
+
+	int iommufd_device_attach_ioasid(struct iommufd_device *idev,
+					int ioasid);
+	void iommufd_device_detach_ioasid(struct iommufd_device *idev,
+					int ioasid);
-- 
2.25.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* RE: [RFC 00/20] Introduce /dev/iommu for userspace I/O address space management
  2021-09-19  6:38 ` Liu Yi L
@ 2021-09-19  6:45   ` Liu, Yi L
  -1 siblings, 0 replies; 532+ messages in thread
From: Liu, Yi L @ 2021-09-19  6:45 UTC (permalink / raw)
  To: alex.williamson, jgg, hch, jasowang, joro
  Cc: jean-philippe, Tian, Kevin, parav, lkml, pbonzini, lushenming,
	eric.auger, corbet, Raj, Ashok, yi.l.liu, Tian, Jun J, Wu, Hao,
	Jiang, Dave, jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu,
	dwmw2, linux-kernel, baolu.lu, david, nicolinc

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Sunday, September 19, 2021 2:38 PM
[...]
> [Series Overview]
>
> * Basic skeleton:
>   0001-iommu-iommufd-Add-dev-iommu-core.patch
> 
> * VFIO PCI creates device-centric interface:
>   0002-vfio-Add-device-class-for-dev-vfio-devices.patch
>   0003-vfio-Add-vfio_-un-register_device.patch
>   0004-iommu-Add-iommu_device_get_info-interface.patch
>   0005-vfio-pci-Register-device-to-dev-vfio-devices.patch
> 
> * Bind device fd with iommufd:
>   0006-iommu-Add-iommu_device_init-exit-_user_dma-interface.patch
>   0007-iommu-iommufd-Add-iommufd_-un-bind_device.patch
>   0008-vfio-pci-Add-VFIO_DEVICE_BIND_IOMMUFD.patch
> 
> * IOASID allocation:
>   0009-iommu-Add-page-size-and-address-width-attributes.patch
>   0010-iommu-iommufd-Add-IOMMU_DEVICE_GET_INFO.patch
>   0011-iommu-iommufd-Add-IOMMU_IOASID_ALLOC-FREE.patch
>   0012-iommu-iommufd-Add-IOMMU_CHECK_EXTENSION.patch
> 
> * IOASID [de]attach:
>   0013-iommu-Extend-iommu_at-de-tach_device-for-multiple-de.patch
>   0014-iommu-iommufd-Add-iommufd_device_-de-attach_ioasid.patch
>   0015-vfio-pci-Add-VFIO_DEVICE_-DE-ATTACH_IOASID.patch
> 
> * DMA (un)map:
>   0016-vfio-type1-Export-symbols-for-dma-un-map-code-sharin.patch
>   0017-iommu-iommufd-Report-iova-range-to-userspace.patch
>   0018-iommu-iommufd-Add-IOMMU_-UN-MAP_DMA-on-IOASID.patch
> 
> * Report the device info in vt-d driver to enable whole series:
>   0019-iommu-vt-d-Implement-device_info-iommu_ops-callback.patch
> 
> * Add doc:
>   0020-Doc-Add-documentation-for-dev-iommu.patch

Please refer to the above patch overview. sorry for the duplicated contents.

thanks,
Yi Liu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* RE: [RFC 00/20] Introduce /dev/iommu for userspace I/O address space management
@ 2021-09-19  6:45   ` Liu, Yi L
  0 siblings, 0 replies; 532+ messages in thread
From: Liu, Yi L @ 2021-09-19  6:45 UTC (permalink / raw)
  To: alex.williamson, jgg, hch, jasowang, joro
  Cc: kvm, kwankhede, jean-philippe, Jiang, Dave, Raj, Ashok, corbet,
	Tian, Kevin, parav, lkml, david, robin.murphy, Tian, Jun J,
	linux-kernel, lushenming, iommu, pbonzini, dwmw2

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Sunday, September 19, 2021 2:38 PM
[...]
> [Series Overview]
>
> * Basic skeleton:
>   0001-iommu-iommufd-Add-dev-iommu-core.patch
> 
> * VFIO PCI creates device-centric interface:
>   0002-vfio-Add-device-class-for-dev-vfio-devices.patch
>   0003-vfio-Add-vfio_-un-register_device.patch
>   0004-iommu-Add-iommu_device_get_info-interface.patch
>   0005-vfio-pci-Register-device-to-dev-vfio-devices.patch
> 
> * Bind device fd with iommufd:
>   0006-iommu-Add-iommu_device_init-exit-_user_dma-interface.patch
>   0007-iommu-iommufd-Add-iommufd_-un-bind_device.patch
>   0008-vfio-pci-Add-VFIO_DEVICE_BIND_IOMMUFD.patch
> 
> * IOASID allocation:
>   0009-iommu-Add-page-size-and-address-width-attributes.patch
>   0010-iommu-iommufd-Add-IOMMU_DEVICE_GET_INFO.patch
>   0011-iommu-iommufd-Add-IOMMU_IOASID_ALLOC-FREE.patch
>   0012-iommu-iommufd-Add-IOMMU_CHECK_EXTENSION.patch
> 
> * IOASID [de]attach:
>   0013-iommu-Extend-iommu_at-de-tach_device-for-multiple-de.patch
>   0014-iommu-iommufd-Add-iommufd_device_-de-attach_ioasid.patch
>   0015-vfio-pci-Add-VFIO_DEVICE_-DE-ATTACH_IOASID.patch
> 
> * DMA (un)map:
>   0016-vfio-type1-Export-symbols-for-dma-un-map-code-sharin.patch
>   0017-iommu-iommufd-Report-iova-range-to-userspace.patch
>   0018-iommu-iommufd-Add-IOMMU_-UN-MAP_DMA-on-IOASID.patch
> 
> * Report the device info in vt-d driver to enable whole series:
>   0019-iommu-vt-d-Implement-device_info-iommu_ops-callback.patch
> 
> * Add doc:
>   0020-Doc-Add-documentation-for-dev-iommu.patch

Please refer to the above patch overview. sorry for the duplicated contents.

thanks,
Yi Liu
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 08/20] vfio/pci: Add VFIO_DEVICE_BIND_IOMMUFD
  2021-09-19  6:38   ` Liu Yi L
  (?)
@ 2021-09-19 10:08   ` kernel test robot
  -1 siblings, 0 replies; 532+ messages in thread
From: kernel test robot @ 2021-09-19 10:08 UTC (permalink / raw)
  To: Liu Yi L; +Cc: llvm, kbuild-all

[-- Attachment #1: Type: text/plain, Size: 8127 bytes --]

Hi Liu,

[FYI, it's a private test report for your RFC patch.]
[auto build test WARNING on hch-configfs/for-next]
[cannot apply to joro-iommu/next awilliam-vfio/next linus/master v5.15-rc1 next-20210917]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Liu-Yi-L/Introduce-dev-iommu-for-userspace-I-O-address-space-management/20210919-144631
base:   git://git.infradead.org/users/hch/configfs.git for-next
config: i386-randconfig-a016-20210919 (attached as .config)
compiler: clang version 14.0.0 (https://github.com/llvm/llvm-project c8b3d7d6d6de37af68b2f379d0e37304f78e115f)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/597d2cdca2b63c05ae9cb0c8bb8e3b9f53b82685
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Liu-Yi-L/Introduce-dev-iommu-for-userspace-I-O-address-space-management/20210919-144631
        git checkout 597d2cdca2b63c05ae9cb0c8bb8e3b9f53b82685
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 ARCH=i386 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> drivers/iommu/iommufd/iommufd.c:181:7: warning: variable 'ret' is used uninitialized whenever 'if' condition is true [-Wsometimes-uninitialized]
                   if (idev->dev == dev || idev->dev_cookie == dev_cookie) {
                       ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   drivers/iommu/iommufd/iommufd.c:221:17: note: uninitialized use occurs here
           return ERR_PTR(ret);
                          ^~~
   drivers/iommu/iommufd/iommufd.c:181:3: note: remove the 'if' if its condition is always false
                   if (idev->dev == dev || idev->dev_cookie == dev_cookie) {
                   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> drivers/iommu/iommufd/iommufd.c:181:7: warning: variable 'ret' is used uninitialized whenever '||' condition is true [-Wsometimes-uninitialized]
                   if (idev->dev == dev || idev->dev_cookie == dev_cookie) {
                       ^~~~~~~~~~~~~~~~
   drivers/iommu/iommufd/iommufd.c:221:17: note: uninitialized use occurs here
           return ERR_PTR(ret);
                          ^~~
   drivers/iommu/iommufd/iommufd.c:181:7: note: remove the '||' if its condition is always false
                   if (idev->dev == dev || idev->dev_cookie == dev_cookie) {
                       ^~~~~~~~~~~~~~~~~~~
   drivers/iommu/iommufd/iommufd.c:171:9: note: initialize the variable 'ret' to silence this warning
           int ret;
                  ^
                   = 0
   2 warnings generated.

Kconfig warnings: (for reference only)
   WARNING: unmet direct dependencies detected for IOMMUFD
   Depends on IOMMU_SUPPORT
   Selected by
   - VFIO_PCI && VFIO && PCI && EVENTFD && MMU


vim +181 drivers/iommu/iommufd/iommufd.c

6a3d96f3199688 Liu Yi L 2021-09-19  151  
09ba2a1b554c58 Liu Yi L 2021-09-19  152  /**
09ba2a1b554c58 Liu Yi L 2021-09-19  153   * iommufd_bind_device - Bind a physical device marked by a device
09ba2a1b554c58 Liu Yi L 2021-09-19  154   *			 cookie to an iommu fd.
09ba2a1b554c58 Liu Yi L 2021-09-19  155   * @fd:		[in] iommufd file descriptor.
09ba2a1b554c58 Liu Yi L 2021-09-19  156   * @dev:	[in] Pointer to a physical device struct.
09ba2a1b554c58 Liu Yi L 2021-09-19  157   * @dev_cookie:	[in] A cookie to mark the device in /dev/iommu uAPI.
09ba2a1b554c58 Liu Yi L 2021-09-19  158   *
09ba2a1b554c58 Liu Yi L 2021-09-19  159   * A successful bind establishes a security context for the device
09ba2a1b554c58 Liu Yi L 2021-09-19  160   * and returns struct iommufd_device pointer. Otherwise returns
09ba2a1b554c58 Liu Yi L 2021-09-19  161   * error pointer.
09ba2a1b554c58 Liu Yi L 2021-09-19  162   *
09ba2a1b554c58 Liu Yi L 2021-09-19  163   */
09ba2a1b554c58 Liu Yi L 2021-09-19  164  struct iommufd_device *iommufd_bind_device(int fd, struct device *dev,
09ba2a1b554c58 Liu Yi L 2021-09-19  165  					   u64 dev_cookie)
09ba2a1b554c58 Liu Yi L 2021-09-19  166  {
09ba2a1b554c58 Liu Yi L 2021-09-19  167  	struct iommufd_ctx *ictx;
09ba2a1b554c58 Liu Yi L 2021-09-19  168  	struct iommufd_device *idev;
09ba2a1b554c58 Liu Yi L 2021-09-19  169  	unsigned long index;
09ba2a1b554c58 Liu Yi L 2021-09-19  170  	unsigned int id;
09ba2a1b554c58 Liu Yi L 2021-09-19  171  	int ret;
09ba2a1b554c58 Liu Yi L 2021-09-19  172  
09ba2a1b554c58 Liu Yi L 2021-09-19  173  	ictx = iommufd_ctx_fdget(fd);
09ba2a1b554c58 Liu Yi L 2021-09-19  174  	if (!ictx)
09ba2a1b554c58 Liu Yi L 2021-09-19  175  		return ERR_PTR(-EINVAL);
09ba2a1b554c58 Liu Yi L 2021-09-19  176  
09ba2a1b554c58 Liu Yi L 2021-09-19  177  	mutex_lock(&ictx->lock);
09ba2a1b554c58 Liu Yi L 2021-09-19  178  
09ba2a1b554c58 Liu Yi L 2021-09-19  179  	/* check duplicate registration */
09ba2a1b554c58 Liu Yi L 2021-09-19  180  	xa_for_each(&ictx->device_xa, index, idev) {
09ba2a1b554c58 Liu Yi L 2021-09-19 @181  		if (idev->dev == dev || idev->dev_cookie == dev_cookie) {
09ba2a1b554c58 Liu Yi L 2021-09-19  182  			idev = ERR_PTR(-EBUSY);
09ba2a1b554c58 Liu Yi L 2021-09-19  183  			goto out_unlock;
09ba2a1b554c58 Liu Yi L 2021-09-19  184  		}
09ba2a1b554c58 Liu Yi L 2021-09-19  185  	}
09ba2a1b554c58 Liu Yi L 2021-09-19  186  
09ba2a1b554c58 Liu Yi L 2021-09-19  187  	idev = kzalloc(sizeof(*idev), GFP_KERNEL);
09ba2a1b554c58 Liu Yi L 2021-09-19  188  	if (!idev) {
09ba2a1b554c58 Liu Yi L 2021-09-19  189  		ret = -ENOMEM;
09ba2a1b554c58 Liu Yi L 2021-09-19  190  		goto out_unlock;
09ba2a1b554c58 Liu Yi L 2021-09-19  191  	}
09ba2a1b554c58 Liu Yi L 2021-09-19  192  
09ba2a1b554c58 Liu Yi L 2021-09-19  193  	/* Establish the security context */
09ba2a1b554c58 Liu Yi L 2021-09-19  194  	ret = iommu_device_init_user_dma(dev, (unsigned long)ictx);
09ba2a1b554c58 Liu Yi L 2021-09-19  195  	if (ret)
09ba2a1b554c58 Liu Yi L 2021-09-19  196  		goto out_free;
09ba2a1b554c58 Liu Yi L 2021-09-19  197  
09ba2a1b554c58 Liu Yi L 2021-09-19  198  	ret = xa_alloc(&ictx->device_xa, &id, idev,
09ba2a1b554c58 Liu Yi L 2021-09-19  199  		       XA_LIMIT(IOMMUFD_DEVID_MIN, IOMMUFD_DEVID_MAX),
09ba2a1b554c58 Liu Yi L 2021-09-19  200  		       GFP_KERNEL);
09ba2a1b554c58 Liu Yi L 2021-09-19  201  	if (ret) {
09ba2a1b554c58 Liu Yi L 2021-09-19  202  		idev = ERR_PTR(ret);
09ba2a1b554c58 Liu Yi L 2021-09-19  203  		goto out_user_dma;
09ba2a1b554c58 Liu Yi L 2021-09-19  204  	}
09ba2a1b554c58 Liu Yi L 2021-09-19  205  
09ba2a1b554c58 Liu Yi L 2021-09-19  206  	idev->ictx = ictx;
09ba2a1b554c58 Liu Yi L 2021-09-19  207  	idev->dev = dev;
09ba2a1b554c58 Liu Yi L 2021-09-19  208  	idev->dev_cookie = dev_cookie;
09ba2a1b554c58 Liu Yi L 2021-09-19  209  	idev->id = id;
09ba2a1b554c58 Liu Yi L 2021-09-19  210  	mutex_unlock(&ictx->lock);
09ba2a1b554c58 Liu Yi L 2021-09-19  211  
09ba2a1b554c58 Liu Yi L 2021-09-19  212  	return idev;
09ba2a1b554c58 Liu Yi L 2021-09-19  213  out_user_dma:
09ba2a1b554c58 Liu Yi L 2021-09-19  214  	iommu_device_exit_user_dma(idev->dev);
09ba2a1b554c58 Liu Yi L 2021-09-19  215  out_free:
09ba2a1b554c58 Liu Yi L 2021-09-19  216  	kfree(idev);
09ba2a1b554c58 Liu Yi L 2021-09-19  217  out_unlock:
09ba2a1b554c58 Liu Yi L 2021-09-19  218  	mutex_unlock(&ictx->lock);
09ba2a1b554c58 Liu Yi L 2021-09-19  219  	iommufd_ctx_put(ictx);
09ba2a1b554c58 Liu Yi L 2021-09-19  220  
09ba2a1b554c58 Liu Yi L 2021-09-19  221  	return ERR_PTR(ret);
09ba2a1b554c58 Liu Yi L 2021-09-19  222  }
09ba2a1b554c58 Liu Yi L 2021-09-19  223  EXPORT_SYMBOL_GPL(iommufd_bind_device);
09ba2a1b554c58 Liu Yi L 2021-09-19  224  

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 37648 bytes --]

^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
  2021-09-19  6:38   ` Liu Yi L
  (?)
@ 2021-09-19 11:03   ` kernel test robot
  -1 siblings, 0 replies; 532+ messages in thread
From: kernel test robot @ 2021-09-19 11:03 UTC (permalink / raw)
  To: Liu Yi L; +Cc: llvm, kbuild-all

[-- Attachment #1: Type: text/plain, Size: 4980 bytes --]

Hi Liu,

[FYI, it's a private test report for your RFC patch.]
[auto build test WARNING on hch-configfs/for-next]
[cannot apply to joro-iommu/next awilliam-vfio/next linus/master v5.15-rc1 next-20210917]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Liu-Yi-L/Introduce-dev-iommu-for-userspace-I-O-address-space-management/20210919-144631
base:   git://git.infradead.org/users/hch/configfs.git for-next
config: i386-randconfig-a016-20210919 (attached as .config)
compiler: clang version 14.0.0 (https://github.com/llvm/llvm-project c8b3d7d6d6de37af68b2f379d0e37304f78e115f)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/262b8a31d0b9bbcb24f7fc2eed5e7ac849265047
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Liu-Yi-L/Introduce-dev-iommu-for-userspace-I-O-address-space-management/20210919-144631
        git checkout 262b8a31d0b9bbcb24f7fc2eed5e7ac849265047
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 ARCH=i386 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> drivers/iommu/iommufd/iommufd.c:205:6: warning: variable 'ret' is used uninitialized whenever 'if' condition is false [-Wsometimes-uninitialized]
           if (refcount_read(&ioas->refs) > 1) {
               ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   drivers/iommu/iommufd/iommufd.c:213:9: note: uninitialized use occurs here
           return ret;
                  ^~~
   drivers/iommu/iommufd/iommufd.c:205:2: note: remove the 'if' if its condition is always true
           if (refcount_read(&ioas->refs) > 1) {
           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   drivers/iommu/iommufd/iommufd.c:189:17: note: initialize the variable 'ret' to silence this warning
           int ioasid, ret;
                          ^
                           = 0
   drivers/iommu/iommufd/iommufd.c:369:7: warning: variable 'ret' is used uninitialized whenever 'if' condition is true [-Wsometimes-uninitialized]
                   if (idev->dev == dev || idev->dev_cookie == dev_cookie) {
                       ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   drivers/iommu/iommufd/iommufd.c:409:17: note: uninitialized use occurs here
           return ERR_PTR(ret);
                          ^~~
   drivers/iommu/iommufd/iommufd.c:369:3: note: remove the 'if' if its condition is always false
                   if (idev->dev == dev || idev->dev_cookie == dev_cookie) {
                   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   drivers/iommu/iommufd/iommufd.c:369:7: warning: variable 'ret' is used uninitialized whenever '||' condition is true [-Wsometimes-uninitialized]
                   if (idev->dev == dev || idev->dev_cookie == dev_cookie) {
                       ^~~~~~~~~~~~~~~~
   drivers/iommu/iommufd/iommufd.c:409:17: note: uninitialized use occurs here
           return ERR_PTR(ret);
                          ^~~
   drivers/iommu/iommufd/iommufd.c:369:7: note: remove the '||' if its condition is always false
                   if (idev->dev == dev || idev->dev_cookie == dev_cookie) {
                       ^~~~~~~~~~~~~~~~~~~
   drivers/iommu/iommufd/iommufd.c:359:9: note: initialize the variable 'ret' to silence this warning
           int ret;
                  ^
                   = 0
   3 warnings generated.

Kconfig warnings: (for reference only)
   WARNING: unmet direct dependencies detected for IOMMUFD
   Depends on IOMMU_SUPPORT
   Selected by
   - VFIO_PCI && VFIO && PCI && EVENTFD && MMU


vim +205 drivers/iommu/iommufd/iommufd.c

   185	
   186	static int iommufd_ioasid_free(struct iommufd_ctx *ictx, unsigned long arg)
   187	{
   188		struct iommufd_ioas *ioas = NULL;
   189		int ioasid, ret;
   190	
   191		if (copy_from_user(&ioasid, (void __user *)arg, sizeof(ioasid)))
   192			return -EFAULT;
   193	
   194		if (ioasid < 0)
   195			return -EINVAL;
   196	
   197		mutex_lock(&ictx->lock);
   198		ioas = xa_load(&ictx->ioasid_xa, ioasid);
   199		if (IS_ERR(ioas)) {
   200			ret = -EINVAL;
   201			goto out_unlock;
   202		}
   203	
   204		/* Disallow free if refcount is not 1 */
 > 205		if (refcount_read(&ioas->refs) > 1) {
   206			ret = -EBUSY;
   207			goto out_unlock;
   208		}
   209	
   210		ioas_put_locked(ioas);
   211	out_unlock:
   212		mutex_unlock(&ictx->lock);
   213		return ret;
   214	};
   215	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 37648 bytes --]

^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 00/20] Introduce /dev/iommu for userspace I/O address space management
  2021-09-19  6:38 ` Liu Yi L
@ 2021-09-21 13:45   ` Jason Gunthorpe via iommu
  -1 siblings, 0 replies; 532+ messages in thread
From: Jason Gunthorpe @ 2021-09-21 13:45 UTC (permalink / raw)
  To: Liu Yi L
  Cc: alex.williamson, hch, jasowang, joro, jean-philippe, kevin.tian,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, ashok.raj,
	yi.l.liu, jun.j.tian, hao.wu, dave.jiang, jacob.jun.pan,
	kwankhede, robin.murphy, kvm, iommu, dwmw2, linux-kernel,
	baolu.lu, david, nicolinc

On Sun, Sep 19, 2021 at 02:38:28PM +0800, Liu Yi L wrote:
> Linux now includes multiple device-passthrough frameworks (e.g. VFIO and
> vDPA) to manage secure device access from the userspace. One critical task
> of those frameworks is to put the assigned device in a secure, IOMMU-
> protected context so user-initiated DMAs are prevented from doing harm to
> the rest of the system.

Some bot will probably send this too, but it has compile warnings and
needs to be rebased to 5.15-rc1

drivers/iommu/iommufd/iommufd.c:269:6: warning: variable 'ret' is used uninitialized whenever 'if' condition is false [-Wsometimes-uninitialized]
        if (refcount_read(&ioas->refs) > 1) {
            ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/iommu/iommufd/iommufd.c:277:9: note: uninitialized use occurs here
        return ret;
               ^~~
drivers/iommu/iommufd/iommufd.c:269:2: note: remove the 'if' if its condition is always true
        if (refcount_read(&ioas->refs) > 1) {
        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/iommu/iommufd/iommufd.c:253:17: note: initialize the variable 'ret' to silence this warning
        int ioasid, ret;
                       ^
                        = 0
drivers/iommu/iommufd/iommufd.c:727:7: warning: variable 'ret' is used uninitialized whenever 'if' condition is true [-Wsometimes-uninitialized]
                if (idev->dev == dev || idev->dev_cookie == dev_cookie) {
                    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/iommu/iommufd/iommufd.c:767:17: note: uninitialized use occurs here
        return ERR_PTR(ret);
                       ^~~
drivers/iommu/iommufd/iommufd.c:727:3: note: remove the 'if' if its condition is always false
                if (idev->dev == dev || idev->dev_cookie == dev_cookie) {
                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/iommu/iommufd/iommufd.c:727:7: warning: variable 'ret' is used uninitialized whenever '||' condition is true [-Wsometimes-uninitialized]
                if (idev->dev == dev || idev->dev_cookie == dev_cookie) {
                    ^~~~~~~~~~~~~~~~
drivers/iommu/iommufd/iommufd.c:767:17: note: uninitialized use occurs here
        return ERR_PTR(ret);
                       ^~~
drivers/iommu/iommufd/iommufd.c:727:7: note: remove the '||' if its condition is always false
                if (idev->dev == dev || idev->dev_cookie == dev_cookie) {
                    ^~~~~~~~~~~~~~~~~~~
drivers/iommu/iommufd/iommufd.c:717:9: note: initialize the variable 'ret' to silence this warning
        int ret;
               ^
                = 0

Jason

^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 00/20] Introduce /dev/iommu for userspace I/O address space management
@ 2021-09-21 13:45   ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 532+ messages in thread
From: Jason Gunthorpe via iommu @ 2021-09-21 13:45 UTC (permalink / raw)
  To: Liu Yi L
  Cc: kvm, jasowang, kwankhede, hch, jean-philippe, dave.jiang,
	ashok.raj, corbet, kevin.tian, parav, alex.williamson, lkml,
	david, dwmw2, jun.j.tian, linux-kernel, lushenming, iommu,
	pbonzini, robin.murphy

On Sun, Sep 19, 2021 at 02:38:28PM +0800, Liu Yi L wrote:
> Linux now includes multiple device-passthrough frameworks (e.g. VFIO and
> vDPA) to manage secure device access from the userspace. One critical task
> of those frameworks is to put the assigned device in a secure, IOMMU-
> protected context so user-initiated DMAs are prevented from doing harm to
> the rest of the system.

Some bot will probably send this too, but it has compile warnings and
needs to be rebased to 5.15-rc1

drivers/iommu/iommufd/iommufd.c:269:6: warning: variable 'ret' is used uninitialized whenever 'if' condition is false [-Wsometimes-uninitialized]
        if (refcount_read(&ioas->refs) > 1) {
            ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/iommu/iommufd/iommufd.c:277:9: note: uninitialized use occurs here
        return ret;
               ^~~
drivers/iommu/iommufd/iommufd.c:269:2: note: remove the 'if' if its condition is always true
        if (refcount_read(&ioas->refs) > 1) {
        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/iommu/iommufd/iommufd.c:253:17: note: initialize the variable 'ret' to silence this warning
        int ioasid, ret;
                       ^
                        = 0
drivers/iommu/iommufd/iommufd.c:727:7: warning: variable 'ret' is used uninitialized whenever 'if' condition is true [-Wsometimes-uninitialized]
                if (idev->dev == dev || idev->dev_cookie == dev_cookie) {
                    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/iommu/iommufd/iommufd.c:767:17: note: uninitialized use occurs here
        return ERR_PTR(ret);
                       ^~~
drivers/iommu/iommufd/iommufd.c:727:3: note: remove the 'if' if its condition is always false
                if (idev->dev == dev || idev->dev_cookie == dev_cookie) {
                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/iommu/iommufd/iommufd.c:727:7: warning: variable 'ret' is used uninitialized whenever '||' condition is true [-Wsometimes-uninitialized]
                if (idev->dev == dev || idev->dev_cookie == dev_cookie) {
                    ^~~~~~~~~~~~~~~~
drivers/iommu/iommufd/iommufd.c:767:17: note: uninitialized use occurs here
        return ERR_PTR(ret);
                       ^~~
drivers/iommu/iommufd/iommufd.c:727:7: note: remove the '||' if its condition is always false
                if (idev->dev == dev || idev->dev_cookie == dev_cookie) {
                    ^~~~~~~~~~~~~~~~~~~
drivers/iommu/iommufd/iommufd.c:717:9: note: initialize the variable 'ret' to silence this warning
        int ret;
               ^
                = 0

Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 01/20] iommu/iommufd: Add /dev/iommu core
  2021-09-19  6:38   ` Liu Yi L
@ 2021-09-21 15:41     ` Jason Gunthorpe via iommu
  -1 siblings, 0 replies; 532+ messages in thread
From: Jason Gunthorpe @ 2021-09-21 15:41 UTC (permalink / raw)
  To: Liu Yi L
  Cc: alex.williamson, hch, jasowang, joro, jean-philippe, kevin.tian,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, ashok.raj,
	yi.l.liu, jun.j.tian, hao.wu, dave.jiang, jacob.jun.pan,
	kwankhede, robin.murphy, kvm, iommu, dwmw2, linux-kernel,
	baolu.lu, david, nicolinc

On Sun, Sep 19, 2021 at 02:38:29PM +0800, Liu Yi L wrote:
> /dev/iommu aims to provide a unified interface for managing I/O address
> spaces for devices assigned to userspace. This patch adds the initial
> framework to create a /dev/iommu node. Each open of this node returns an
> iommufd. And this fd is the handle for userspace to initiate its I/O
> address space management.
> 
> One open:
> - We call this feature as IOMMUFD in Kconfig in this RFC. However this
>   name is not clear enough to indicate its purpose to user. Back to 2010
>   vfio even introduced a /dev/uiommu [1] as the predecessor of its
>   container concept. Is that a better name? Appreciate opinions here.
> 
> [1] https://lore.kernel.org/kvm/4c0eb470.1HMjondO00NIvFM6%25pugs@cisco.com/
> 
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
>  drivers/iommu/Kconfig           |   1 +
>  drivers/iommu/Makefile          |   1 +
>  drivers/iommu/iommufd/Kconfig   |  11 ++++
>  drivers/iommu/iommufd/Makefile  |   2 +
>  drivers/iommu/iommufd/iommufd.c | 112 ++++++++++++++++++++++++++++++++
>  5 files changed, 127 insertions(+)
>  create mode 100644 drivers/iommu/iommufd/Kconfig
>  create mode 100644 drivers/iommu/iommufd/Makefile
>  create mode 100644 drivers/iommu/iommufd/iommufd.c
> 
> diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
> index 07b7c25cbed8..a83ce0acd09d 100644
> +++ b/drivers/iommu/Kconfig
> @@ -136,6 +136,7 @@ config MSM_IOMMU
>  
>  source "drivers/iommu/amd/Kconfig"
>  source "drivers/iommu/intel/Kconfig"
> +source "drivers/iommu/iommufd/Kconfig"
>  
>  config IRQ_REMAP
>  	bool "Support for Interrupt Remapping"
> diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
> index c0fb0ba88143..719c799f23ad 100644
> +++ b/drivers/iommu/Makefile
> @@ -29,3 +29,4 @@ obj-$(CONFIG_HYPERV_IOMMU) += hyperv-iommu.o
>  obj-$(CONFIG_VIRTIO_IOMMU) += virtio-iommu.o
>  obj-$(CONFIG_IOMMU_SVA_LIB) += iommu-sva-lib.o io-pgfault.o
>  obj-$(CONFIG_SPRD_IOMMU) += sprd-iommu.o
> +obj-$(CONFIG_IOMMUFD) += iommufd/
> diff --git a/drivers/iommu/iommufd/Kconfig b/drivers/iommu/iommufd/Kconfig
> new file mode 100644
> index 000000000000..9fb7769a815d
> +++ b/drivers/iommu/iommufd/Kconfig
> @@ -0,0 +1,11 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +config IOMMUFD
> +	tristate "I/O Address Space management framework for passthrough devices"
> +	select IOMMU_API
> +	default n
> +	help
> +	  provides unified I/O address space management framework for
> +	  isolating untrusted DMAs via devices which are passed through
> +	  to userspace drivers.
> +
> +	  If you don't know what to do here, say N.
> diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile
> new file mode 100644
> index 000000000000..54381a01d003
> +++ b/drivers/iommu/iommufd/Makefile
> @@ -0,0 +1,2 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +obj-$(CONFIG_IOMMUFD) += iommufd.o
> diff --git a/drivers/iommu/iommufd/iommufd.c b/drivers/iommu/iommufd/iommufd.c
> new file mode 100644
> index 000000000000..710b7e62988b
> +++ b/drivers/iommu/iommufd/iommufd.c
> @@ -0,0 +1,112 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * I/O Address Space Management for passthrough devices
> + *
> + * Copyright (C) 2021 Intel Corporation
> + *
> + * Author: Liu Yi L <yi.l.liu@intel.com>
> + */
> +
> +#define pr_fmt(fmt)    "iommufd: " fmt
> +
> +#include <linux/file.h>
> +#include <linux/fs.h>
> +#include <linux/module.h>
> +#include <linux/slab.h>
> +#include <linux/miscdevice.h>
> +#include <linux/mutex.h>
> +#include <linux/iommu.h>
> +
> +/* Per iommufd */
> +struct iommufd_ctx {
> +	refcount_t refs;
> +};

A private_data of a struct file should avoid having a refcount (and
this should have been a kref anyhow)

Use the refcount on the struct file instead.

In general the lifetime models look overly convoluted to me with
refcounts being used as locks and going in all manner of directions.

- No refcount on iommufd_ctx, this should use the fget on the fd.
  The driver facing version of the API has the driver holds a fget
  inside the iommufd_device.

- Put a rwlock inside the iommufd_ioas that is a
  'destroying_lock'. The rwlock starts out unlocked.
  
  Acquire from the xarray is
   rcu_lock()
   ioas = xa_load()
   if (ioas)
      if (down_read_trylock(&ioas->destroying_lock))
           // success
  Unacquire is just up_read()

  Do down_write when the ioas is to be destroyed, do not return ebusy.

 - Delete the iommufd_ctx->lock. Use RCU to protect load, erase/alloc does
   not need locking (order it properly too, it is in the wrong order), and
   don't check for duplicate devices or dev_cookie duplication, that
   is user error and is harmless to the kernel.
  
> +static int iommufd_fops_release(struct inode *inode, struct file *filep)
> +{
> +	struct iommufd_ctx *ictx = filep->private_data;
> +
> +	filep->private_data = NULL;

unnecessary

> +	iommufd_ctx_put(ictx);
> +
> +	return 0;
> +}
> +
> +static long iommufd_fops_unl_ioctl(struct file *filep,
> +				   unsigned int cmd, unsigned long arg)
> +{
> +	struct iommufd_ctx *ictx = filep->private_data;
> +	long ret = -EINVAL;
> +
> +	if (!ictx)
> +		return ret;

impossible

> +
> +	switch (cmd) {
> +	default:
> +		pr_err_ratelimited("unsupported cmd %u\n", cmd);

don't log user triggerable events

Jason

^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 01/20] iommu/iommufd: Add /dev/iommu core
@ 2021-09-21 15:41     ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 532+ messages in thread
From: Jason Gunthorpe via iommu @ 2021-09-21 15:41 UTC (permalink / raw)
  To: Liu Yi L
  Cc: kvm, jasowang, kwankhede, hch, jean-philippe, dave.jiang,
	ashok.raj, corbet, kevin.tian, parav, alex.williamson, lkml,
	david, dwmw2, jun.j.tian, linux-kernel, lushenming, iommu,
	pbonzini, robin.murphy

On Sun, Sep 19, 2021 at 02:38:29PM +0800, Liu Yi L wrote:
> /dev/iommu aims to provide a unified interface for managing I/O address
> spaces for devices assigned to userspace. This patch adds the initial
> framework to create a /dev/iommu node. Each open of this node returns an
> iommufd. And this fd is the handle for userspace to initiate its I/O
> address space management.
> 
> One open:
> - We call this feature as IOMMUFD in Kconfig in this RFC. However this
>   name is not clear enough to indicate its purpose to user. Back to 2010
>   vfio even introduced a /dev/uiommu [1] as the predecessor of its
>   container concept. Is that a better name? Appreciate opinions here.
> 
> [1] https://lore.kernel.org/kvm/4c0eb470.1HMjondO00NIvFM6%25pugs@cisco.com/
> 
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
>  drivers/iommu/Kconfig           |   1 +
>  drivers/iommu/Makefile          |   1 +
>  drivers/iommu/iommufd/Kconfig   |  11 ++++
>  drivers/iommu/iommufd/Makefile  |   2 +
>  drivers/iommu/iommufd/iommufd.c | 112 ++++++++++++++++++++++++++++++++
>  5 files changed, 127 insertions(+)
>  create mode 100644 drivers/iommu/iommufd/Kconfig
>  create mode 100644 drivers/iommu/iommufd/Makefile
>  create mode 100644 drivers/iommu/iommufd/iommufd.c
> 
> diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
> index 07b7c25cbed8..a83ce0acd09d 100644
> +++ b/drivers/iommu/Kconfig
> @@ -136,6 +136,7 @@ config MSM_IOMMU
>  
>  source "drivers/iommu/amd/Kconfig"
>  source "drivers/iommu/intel/Kconfig"
> +source "drivers/iommu/iommufd/Kconfig"
>  
>  config IRQ_REMAP
>  	bool "Support for Interrupt Remapping"
> diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
> index c0fb0ba88143..719c799f23ad 100644
> +++ b/drivers/iommu/Makefile
> @@ -29,3 +29,4 @@ obj-$(CONFIG_HYPERV_IOMMU) += hyperv-iommu.o
>  obj-$(CONFIG_VIRTIO_IOMMU) += virtio-iommu.o
>  obj-$(CONFIG_IOMMU_SVA_LIB) += iommu-sva-lib.o io-pgfault.o
>  obj-$(CONFIG_SPRD_IOMMU) += sprd-iommu.o
> +obj-$(CONFIG_IOMMUFD) += iommufd/
> diff --git a/drivers/iommu/iommufd/Kconfig b/drivers/iommu/iommufd/Kconfig
> new file mode 100644
> index 000000000000..9fb7769a815d
> +++ b/drivers/iommu/iommufd/Kconfig
> @@ -0,0 +1,11 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +config IOMMUFD
> +	tristate "I/O Address Space management framework for passthrough devices"
> +	select IOMMU_API
> +	default n
> +	help
> +	  provides unified I/O address space management framework for
> +	  isolating untrusted DMAs via devices which are passed through
> +	  to userspace drivers.
> +
> +	  If you don't know what to do here, say N.
> diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile
> new file mode 100644
> index 000000000000..54381a01d003
> +++ b/drivers/iommu/iommufd/Makefile
> @@ -0,0 +1,2 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +obj-$(CONFIG_IOMMUFD) += iommufd.o
> diff --git a/drivers/iommu/iommufd/iommufd.c b/drivers/iommu/iommufd/iommufd.c
> new file mode 100644
> index 000000000000..710b7e62988b
> +++ b/drivers/iommu/iommufd/iommufd.c
> @@ -0,0 +1,112 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * I/O Address Space Management for passthrough devices
> + *
> + * Copyright (C) 2021 Intel Corporation
> + *
> + * Author: Liu Yi L <yi.l.liu@intel.com>
> + */
> +
> +#define pr_fmt(fmt)    "iommufd: " fmt
> +
> +#include <linux/file.h>
> +#include <linux/fs.h>
> +#include <linux/module.h>
> +#include <linux/slab.h>
> +#include <linux/miscdevice.h>
> +#include <linux/mutex.h>
> +#include <linux/iommu.h>
> +
> +/* Per iommufd */
> +struct iommufd_ctx {
> +	refcount_t refs;
> +};

A private_data of a struct file should avoid having a refcount (and
this should have been a kref anyhow)

Use the refcount on the struct file instead.

In general the lifetime models look overly convoluted to me with
refcounts being used as locks and going in all manner of directions.

- No refcount on iommufd_ctx, this should use the fget on the fd.
  The driver facing version of the API has the driver holds a fget
  inside the iommufd_device.

- Put a rwlock inside the iommufd_ioas that is a
  'destroying_lock'. The rwlock starts out unlocked.
  
  Acquire from the xarray is
   rcu_lock()
   ioas = xa_load()
   if (ioas)
      if (down_read_trylock(&ioas->destroying_lock))
           // success
  Unacquire is just up_read()

  Do down_write when the ioas is to be destroyed, do not return ebusy.

 - Delete the iommufd_ctx->lock. Use RCU to protect load, erase/alloc does
   not need locking (order it properly too, it is in the wrong order), and
   don't check for duplicate devices or dev_cookie duplication, that
   is user error and is harmless to the kernel.
  
> +static int iommufd_fops_release(struct inode *inode, struct file *filep)
> +{
> +	struct iommufd_ctx *ictx = filep->private_data;
> +
> +	filep->private_data = NULL;

unnecessary

> +	iommufd_ctx_put(ictx);
> +
> +	return 0;
> +}
> +
> +static long iommufd_fops_unl_ioctl(struct file *filep,
> +				   unsigned int cmd, unsigned long arg)
> +{
> +	struct iommufd_ctx *ictx = filep->private_data;
> +	long ret = -EINVAL;
> +
> +	if (!ictx)
> +		return ret;

impossible

> +
> +	switch (cmd) {
> +	default:
> +		pr_err_ratelimited("unsupported cmd %u\n", cmd);

don't log user triggerable events

Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 02/20] vfio: Add device class for /dev/vfio/devices
  2021-09-19  6:38   ` Liu Yi L
@ 2021-09-21 15:57     ` Jason Gunthorpe via iommu
  -1 siblings, 0 replies; 532+ messages in thread
From: Jason Gunthorpe @ 2021-09-21 15:57 UTC (permalink / raw)
  To: Liu Yi L
  Cc: alex.williamson, hch, jasowang, joro, jean-philippe, kevin.tian,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, ashok.raj,
	yi.l.liu, jun.j.tian, hao.wu, dave.jiang, jacob.jun.pan,
	kwankhede, robin.murphy, kvm, iommu, dwmw2, linux-kernel,
	baolu.lu, david, nicolinc

On Sun, Sep 19, 2021 at 02:38:30PM +0800, Liu Yi L wrote:
> This patch introduces a new interface (/dev/vfio/devices/$DEVICE) for
> userspace to directly open a vfio device w/o relying on container/group
> (/dev/vfio/$GROUP). Anything related to group is now hidden behind
> iommufd (more specifically in iommu core by this RFC) in a device-centric
> manner.
> 
> In case a device is exposed in both legacy and new interfaces (see next
> patch for how to decide it), this patch also ensures that when the device
> is already opened via one interface then the other one must be blocked.
> 
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> ---
>  drivers/vfio/vfio.c  | 228 +++++++++++++++++++++++++++++++++++++++----
>  include/linux/vfio.h |   2 +
>  2 files changed, 213 insertions(+), 17 deletions(-)

> +static int vfio_init_device_class(void)
> +{
> +	int ret;
> +
> +	mutex_init(&vfio.device_lock);
> +	idr_init(&vfio.device_idr);
> +
> +	/* /dev/vfio/devices/$DEVICE */
> +	vfio.device_class = class_create(THIS_MODULE, "vfio-device");
> +	if (IS_ERR(vfio.device_class))
> +		return PTR_ERR(vfio.device_class);
> +
> +	vfio.device_class->devnode = vfio_device_devnode;
> +
> +	ret = alloc_chrdev_region(&vfio.device_devt, 0, MINORMASK + 1, "vfio-device");
> +	if (ret)
> +		goto err_alloc_chrdev;
> +
> +	cdev_init(&vfio.device_cdev, &vfio_device_fops);
> +	ret = cdev_add(&vfio.device_cdev, vfio.device_devt, MINORMASK + 1);
> +	if (ret)
> +		goto err_cdev_add;

Huh? This is not how cdevs are used. This patch needs rewriting.

The struct vfio_device should gain a 'struct device' and 'struct cdev'
as non-pointer members

vfio register path should end up doing cdev_device_add() for each
vfio_device

vfio_unregister path should do cdev_device_del()

No idr should be needed, an ida is used to allocate minor numbers

The struct device release function should trigger a kfree which
requires some reworking of the callers

vfio_init_group_dev() should do a device_initialize()
vfio_uninit_group_dev() should do a device_put()

The opened atomic is aweful. A newly created fd should start in a
state where it has a disabled fops

The only thing the disabled fops can do is register the device to the
iommu fd. When successfully registered the device gets the normal fops.

The registration steps should be done under a normal lock inside the
vfio_device. If a vfio_device is already registered then further
registration should fail.

Getting the device fd via the group fd triggers the same sequence as
above.

Jason

^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 02/20] vfio: Add device class for /dev/vfio/devices
@ 2021-09-21 15:57     ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 532+ messages in thread
From: Jason Gunthorpe via iommu @ 2021-09-21 15:57 UTC (permalink / raw)
  To: Liu Yi L
  Cc: kvm, jasowang, kwankhede, hch, jean-philippe, dave.jiang,
	ashok.raj, corbet, kevin.tian, parav, alex.williamson, lkml,
	david, dwmw2, jun.j.tian, linux-kernel, lushenming, iommu,
	pbonzini, robin.murphy

On Sun, Sep 19, 2021 at 02:38:30PM +0800, Liu Yi L wrote:
> This patch introduces a new interface (/dev/vfio/devices/$DEVICE) for
> userspace to directly open a vfio device w/o relying on container/group
> (/dev/vfio/$GROUP). Anything related to group is now hidden behind
> iommufd (more specifically in iommu core by this RFC) in a device-centric
> manner.
> 
> In case a device is exposed in both legacy and new interfaces (see next
> patch for how to decide it), this patch also ensures that when the device
> is already opened via one interface then the other one must be blocked.
> 
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> ---
>  drivers/vfio/vfio.c  | 228 +++++++++++++++++++++++++++++++++++++++----
>  include/linux/vfio.h |   2 +
>  2 files changed, 213 insertions(+), 17 deletions(-)

> +static int vfio_init_device_class(void)
> +{
> +	int ret;
> +
> +	mutex_init(&vfio.device_lock);
> +	idr_init(&vfio.device_idr);
> +
> +	/* /dev/vfio/devices/$DEVICE */
> +	vfio.device_class = class_create(THIS_MODULE, "vfio-device");
> +	if (IS_ERR(vfio.device_class))
> +		return PTR_ERR(vfio.device_class);
> +
> +	vfio.device_class->devnode = vfio_device_devnode;
> +
> +	ret = alloc_chrdev_region(&vfio.device_devt, 0, MINORMASK + 1, "vfio-device");
> +	if (ret)
> +		goto err_alloc_chrdev;
> +
> +	cdev_init(&vfio.device_cdev, &vfio_device_fops);
> +	ret = cdev_add(&vfio.device_cdev, vfio.device_devt, MINORMASK + 1);
> +	if (ret)
> +		goto err_cdev_add;

Huh? This is not how cdevs are used. This patch needs rewriting.

The struct vfio_device should gain a 'struct device' and 'struct cdev'
as non-pointer members

vfio register path should end up doing cdev_device_add() for each
vfio_device

vfio_unregister path should do cdev_device_del()

No idr should be needed, an ida is used to allocate minor numbers

The struct device release function should trigger a kfree which
requires some reworking of the callers

vfio_init_group_dev() should do a device_initialize()
vfio_uninit_group_dev() should do a device_put()

The opened atomic is aweful. A newly created fd should start in a
state where it has a disabled fops

The only thing the disabled fops can do is register the device to the
iommu fd. When successfully registered the device gets the normal fops.

The registration steps should be done under a normal lock inside the
vfio_device. If a vfio_device is already registered then further
registration should fail.

Getting the device fd via the group fd triggers the same sequence as
above.

Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 03/20] vfio: Add vfio_[un]register_device()
  2021-09-19  6:38   ` Liu Yi L
@ 2021-09-21 16:01     ` Jason Gunthorpe via iommu
  -1 siblings, 0 replies; 532+ messages in thread
From: Jason Gunthorpe @ 2021-09-21 16:01 UTC (permalink / raw)
  To: Liu Yi L
  Cc: alex.williamson, hch, jasowang, joro, jean-philippe, kevin.tian,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, ashok.raj,
	yi.l.liu, jun.j.tian, hao.wu, dave.jiang, jacob.jun.pan,
	kwankhede, robin.murphy, kvm, iommu, dwmw2, linux-kernel,
	baolu.lu, david, nicolinc

On Sun, Sep 19, 2021 at 02:38:31PM +0800, Liu Yi L wrote:
> With /dev/vfio/devices introduced, now a vfio device driver has three
> options to expose its device to userspace:
> 
> a)  only legacy group interface, for devices which haven't been moved to
>     iommufd (e.g. platform devices, sw mdev, etc.);
> 
> b)  both legacy group interface and new device-centric interface, for
>     devices which supports iommufd but also wants to keep backward
>     compatibility (e.g. pci devices in this RFC);
> 
> c)  only new device-centric interface, for new devices which don't carry
>     backward compatibility burden (e.g. hw mdev/subdev with pasid);

We shouldn't have 'b'? Where does it come from?

> This patch introduces vfio_[un]register_device() helpers for the device
> drivers to specify the device exposure policy to vfio core. Hence the
> existing vfio_[un]register_group_dev() become the wrapper of the new
> helper functions. The new device-centric interface is described as
> 'nongroup' to differentiate from existing 'group' stuff.

Detect what the driver supports based on the ops it declares. There
should be a function provided through the ops for the driver to bind
to the iommufd.

>  One open about how to organize the device nodes under /dev/vfio/devices/.
> This RFC adopts a simple policy by keeping a flat layout with mixed devname
> from all kinds of devices. The prerequisite of this model is that devnames
> from different bus types are unique formats:

This isn't reliable, the devname should just be vfio0, vfio1, etc

The userspace can learn the correct major/minor by inspecting the
sysfs.

This whole concept should disappear into the prior patch that adds the
struct device in the first place, and I think most of the code here
can be deleted once the struct device is used properly.

Jason

^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 03/20] vfio: Add vfio_[un]register_device()
@ 2021-09-21 16:01     ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 532+ messages in thread
From: Jason Gunthorpe via iommu @ 2021-09-21 16:01 UTC (permalink / raw)
  To: Liu Yi L
  Cc: kvm, jasowang, kwankhede, hch, jean-philippe, dave.jiang,
	ashok.raj, corbet, kevin.tian, parav, alex.williamson, lkml,
	david, dwmw2, jun.j.tian, linux-kernel, lushenming, iommu,
	pbonzini, robin.murphy

On Sun, Sep 19, 2021 at 02:38:31PM +0800, Liu Yi L wrote:
> With /dev/vfio/devices introduced, now a vfio device driver has three
> options to expose its device to userspace:
> 
> a)  only legacy group interface, for devices which haven't been moved to
>     iommufd (e.g. platform devices, sw mdev, etc.);
> 
> b)  both legacy group interface and new device-centric interface, for
>     devices which supports iommufd but also wants to keep backward
>     compatibility (e.g. pci devices in this RFC);
> 
> c)  only new device-centric interface, for new devices which don't carry
>     backward compatibility burden (e.g. hw mdev/subdev with pasid);

We shouldn't have 'b'? Where does it come from?

> This patch introduces vfio_[un]register_device() helpers for the device
> drivers to specify the device exposure policy to vfio core. Hence the
> existing vfio_[un]register_group_dev() become the wrapper of the new
> helper functions. The new device-centric interface is described as
> 'nongroup' to differentiate from existing 'group' stuff.

Detect what the driver supports based on the ops it declares. There
should be a function provided through the ops for the driver to bind
to the iommufd.

>  One open about how to organize the device nodes under /dev/vfio/devices/.
> This RFC adopts a simple policy by keeping a flat layout with mixed devname
> from all kinds of devices. The prerequisite of this model is that devnames
> from different bus types are unique formats:

This isn't reliable, the devname should just be vfio0, vfio1, etc

The userspace can learn the correct major/minor by inspecting the
sysfs.

This whole concept should disappear into the prior patch that adds the
struct device in the first place, and I think most of the code here
can be deleted once the struct device is used properly.

Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 04/20] iommu: Add iommu_device_get_info interface
  2021-09-19  6:38   ` Liu Yi L
@ 2021-09-21 16:19     ` Jason Gunthorpe via iommu
  -1 siblings, 0 replies; 532+ messages in thread
From: Jason Gunthorpe @ 2021-09-21 16:19 UTC (permalink / raw)
  To: Liu Yi L
  Cc: alex.williamson, hch, jasowang, joro, jean-philippe, kevin.tian,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, ashok.raj,
	yi.l.liu, jun.j.tian, hao.wu, dave.jiang, jacob.jun.pan,
	kwankhede, robin.murphy, kvm, iommu, dwmw2, linux-kernel,
	baolu.lu, david, nicolinc

On Sun, Sep 19, 2021 at 02:38:32PM +0800, Liu Yi L wrote:
> From: Lu Baolu <baolu.lu@linux.intel.com>
> 
> This provides an interface for upper layers to get the per-device iommu
> attributes.
> 
>     int iommu_device_get_info(struct device *dev,
>                               enum iommu_devattr attr, void *data);

Can't we use properly typed ops and functions here instead of a void
*data?

get_snoop()
get_page_size()
get_addr_width()

?

Jason

^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 04/20] iommu: Add iommu_device_get_info interface
@ 2021-09-21 16:19     ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 532+ messages in thread
From: Jason Gunthorpe via iommu @ 2021-09-21 16:19 UTC (permalink / raw)
  To: Liu Yi L
  Cc: kvm, jasowang, kwankhede, hch, jean-philippe, dave.jiang,
	ashok.raj, corbet, kevin.tian, parav, alex.williamson, lkml,
	david, dwmw2, jun.j.tian, linux-kernel, lushenming, iommu,
	pbonzini, robin.murphy

On Sun, Sep 19, 2021 at 02:38:32PM +0800, Liu Yi L wrote:
> From: Lu Baolu <baolu.lu@linux.intel.com>
> 
> This provides an interface for upper layers to get the per-device iommu
> attributes.
> 
>     int iommu_device_get_info(struct device *dev,
>                               enum iommu_devattr attr, void *data);

Can't we use properly typed ops and functions here instead of a void
*data?

get_snoop()
get_page_size()
get_addr_width()

?

Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 05/20] vfio/pci: Register device to /dev/vfio/devices
  2021-09-19  6:38   ` Liu Yi L
@ 2021-09-21 16:40     ` Jason Gunthorpe via iommu
  -1 siblings, 0 replies; 532+ messages in thread
From: Jason Gunthorpe @ 2021-09-21 16:40 UTC (permalink / raw)
  To: Liu Yi L
  Cc: alex.williamson, hch, jasowang, joro, jean-philippe, kevin.tian,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, ashok.raj,
	yi.l.liu, jun.j.tian, hao.wu, dave.jiang, jacob.jun.pan,
	kwankhede, robin.murphy, kvm, iommu, dwmw2, linux-kernel,
	baolu.lu, david, nicolinc

On Sun, Sep 19, 2021 at 02:38:33PM +0800, Liu Yi L wrote:
> This patch exposes the device-centric interface for vfio-pci devices. To
> be compatiable with existing users, vfio-pci exposes both legacy group
> interface and device-centric interface.
> 
> As explained in last patch, this change doesn't apply to devices which
> cannot be forced to snoop cache by their upstream iommu. Such devices
> are still expected to be opened via the legacy group interface.
> 
> When the device is opened via /dev/vfio/devices, vfio-pci should prevent
> the user from accessing the assigned device because the device is still
> attached to the default domain which may allow user-initiated DMAs to
> touch arbitrary place. The user access must be blocked until the device
> is later bound to an iommufd (see patch 08). The binding acts as the
> contract for putting the device in a security context which ensures user-
> initiated DMAs via this device cannot harm the rest of the system.
> 
> This patch introduces a vdev->block_access flag for this purpose. It's set
> when the device is opened via /dev/vfio/devices and cleared after binding
> to iommufd succeeds. mmap and r/w handlers check this flag to decide whether
> user access should be blocked or not.

This should not be in vfio_pci.

AFAIK there is no condition where a vfio driver can work without being
connected to some kind of iommu back end, so the core code should
handle this interlock globally. A vfio driver's ops should not be
callable until the iommu is connected.

The only vfio_pci patch in this series should be adding a new callback
op to take in an iommufd and register the pci_device as a iommufd
device.

Jason

^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 05/20] vfio/pci: Register device to /dev/vfio/devices
@ 2021-09-21 16:40     ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 532+ messages in thread
From: Jason Gunthorpe via iommu @ 2021-09-21 16:40 UTC (permalink / raw)
  To: Liu Yi L
  Cc: kvm, jasowang, kwankhede, hch, jean-philippe, dave.jiang,
	ashok.raj, corbet, kevin.tian, parav, alex.williamson, lkml,
	david, dwmw2, jun.j.tian, linux-kernel, lushenming, iommu,
	pbonzini, robin.murphy

On Sun, Sep 19, 2021 at 02:38:33PM +0800, Liu Yi L wrote:
> This patch exposes the device-centric interface for vfio-pci devices. To
> be compatiable with existing users, vfio-pci exposes both legacy group
> interface and device-centric interface.
> 
> As explained in last patch, this change doesn't apply to devices which
> cannot be forced to snoop cache by their upstream iommu. Such devices
> are still expected to be opened via the legacy group interface.
> 
> When the device is opened via /dev/vfio/devices, vfio-pci should prevent
> the user from accessing the assigned device because the device is still
> attached to the default domain which may allow user-initiated DMAs to
> touch arbitrary place. The user access must be blocked until the device
> is later bound to an iommufd (see patch 08). The binding acts as the
> contract for putting the device in a security context which ensures user-
> initiated DMAs via this device cannot harm the rest of the system.
> 
> This patch introduces a vdev->block_access flag for this purpose. It's set
> when the device is opened via /dev/vfio/devices and cleared after binding
> to iommufd succeeds. mmap and r/w handlers check this flag to decide whether
> user access should be blocked or not.

This should not be in vfio_pci.

AFAIK there is no condition where a vfio driver can work without being
connected to some kind of iommu back end, so the core code should
handle this interlock globally. A vfio driver's ops should not be
callable until the iommu is connected.

The only vfio_pci patch in this series should be adding a new callback
op to take in an iommufd and register the pci_device as a iommufd
device.

Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces
  2021-09-19  6:38   ` Liu Yi L
@ 2021-09-21 17:09     ` Jason Gunthorpe via iommu
  -1 siblings, 0 replies; 532+ messages in thread
From: Jason Gunthorpe @ 2021-09-21 17:09 UTC (permalink / raw)
  To: Liu Yi L
  Cc: alex.williamson, hch, jasowang, joro, jean-philippe, kevin.tian,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, ashok.raj,
	yi.l.liu, jun.j.tian, hao.wu, dave.jiang, jacob.jun.pan,
	kwankhede, robin.murphy, kvm, iommu, dwmw2, linux-kernel,
	baolu.lu, david, nicolinc

On Sun, Sep 19, 2021 at 02:38:34PM +0800, Liu Yi L wrote:
> From: Lu Baolu <baolu.lu@linux.intel.com>
> 
> This extends iommu core to manage security context for passthrough
> devices. Please bear a long explanation for how we reach this design
> instead of managing it solely in iommufd like what vfio does today.
> 
> Devices which cannot be isolated from each other are organized into an
> iommu group. When a device is assigned to the user space, the entire
> group must be put in a security context so that user-initiated DMAs via
> the assigned device cannot harm the rest of the system. No user access
> should be granted on a device before the security context is established
> for the group which the device belongs to.

> Managing the security context must meet below criteria:
> 
> 1)  The group is viable for user-initiated DMAs. This implies that the
>     devices in the group must be either bound to a device-passthrough

s/a/the same/

>     framework, or driver-less, or bound to a driver which is known safe
>     (not do DMA).
> 
> 2)  The security context should only allow DMA to the user's memory and
>     devices in this group;
> 
> 3)  After the security context is established for the group, the group
>     viability must be continuously monitored before the user relinquishes
>     all devices belonging to the group. The viability might be broken e.g.
>     when a driver-less device is later bound to a driver which does DMA.
> 
> 4)  The security context should not be destroyed before user access
>     permission is withdrawn.
> 
> Existing vfio introduces explicit container/group semantics in its uAPI
> to meet above requirements. A single security context (iommu domain)
> is created per container. Attaching group to container moves the entire
> group into the associated security context, and vice versa. The user can
> open the device only after group attach. A group can be detached only
> after all devices in the group are closed. Group viability is monitored
> by listening to iommu group events.
> 
> Unlike vfio, iommufd adopts a device-centric design with all group
> logistics hidden behind the fd. Binding a device to iommufd serves
> as the contract to get security context established (and vice versa
> for unbinding). One additional requirement in iommufd is to manage the
> switch between multiple security contexts due to decoupled bind/attach:

This should be a precursor series that actually does clean things up
properly. There is no reason for vfio and iommufd to differ here, if
we are implementing this logic into the iommu layer then it should be
deleted from the VFIO layer, not left duplicated like this.

IIRC in VFIO the container is the IOAS and when the group goes to
create the device fd it should simply do the
iommu_device_init_user_dma() followed immediately by a call to bind
the container IOAS as your #3.

Then delete all the group viability stuff from vfio, relying on the
iommu to do it.

It should have full symmetry with the iommufd.

> @@ -1664,6 +1671,17 @@ static int iommu_bus_notifier(struct notifier_block *nb,
>  		group_action = IOMMU_GROUP_NOTIFY_BIND_DRIVER;
>  		break;
>  	case BUS_NOTIFY_BOUND_DRIVER:
> +		/*
> +		 * FIXME: Alternatively the attached drivers could generically
> +		 * indicate to the iommu layer that they are safe for keeping
> +		 * the iommu group user viable by calling some function around
> +		 * probe(). We could eliminate this gross BUG_ON() by denying
> +		 * probe to non-iommu-safe driver.
> +		 */
> +		mutex_lock(&group->mutex);
> +		if (group->user_dma_owner_id)
> +			BUG_ON(!iommu_group_user_dma_viable(group));
> +		mutex_unlock(&group->mutex);

And the mini-series should fix this BUG_ON properly by interlocking
with the driver core to simply refuse to bind a driver under these
conditions instead of allowing userspace to crash the kernel.

That alone would be justification enough to merge this work.

> +
> +/*
> + * IOMMU core interfaces for iommufd.
> + */
> +
> +/*
> + * FIXME: We currently simply follow vifo policy to mantain the group's
> + * viability to user. Eventually, we should avoid below hard-coded list
> + * by letting drivers indicate to the iommu layer that they are safe for
> + * keeping the iommu group's user aviability.
> + */
> +static const char * const iommu_driver_allowed[] = {
> +	"vfio-pci",
> +	"pci-stub"
> +};

Yuk. This should be done with some callback in those drivers
'iomm_allow_user_dma()"

Ie the basic flow would see the driver core doing some:

 ret = iommu_doing_kernel_dma()
 if (ret) do not bind
 driver_bind
  pci_stub_probe()
     iommu_allow_user_dma()

And the various functions are manipulating some atomic.
 0 = nothing happening
 1 = kernel DMA
 2 = user DMA

No BUG_ON.

Jason

^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces
@ 2021-09-21 17:09     ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 532+ messages in thread
From: Jason Gunthorpe via iommu @ 2021-09-21 17:09 UTC (permalink / raw)
  To: Liu Yi L
  Cc: kvm, jasowang, kwankhede, hch, jean-philippe, dave.jiang,
	ashok.raj, corbet, kevin.tian, parav, alex.williamson, lkml,
	david, dwmw2, jun.j.tian, linux-kernel, lushenming, iommu,
	pbonzini, robin.murphy

On Sun, Sep 19, 2021 at 02:38:34PM +0800, Liu Yi L wrote:
> From: Lu Baolu <baolu.lu@linux.intel.com>
> 
> This extends iommu core to manage security context for passthrough
> devices. Please bear a long explanation for how we reach this design
> instead of managing it solely in iommufd like what vfio does today.
> 
> Devices which cannot be isolated from each other are organized into an
> iommu group. When a device is assigned to the user space, the entire
> group must be put in a security context so that user-initiated DMAs via
> the assigned device cannot harm the rest of the system. No user access
> should be granted on a device before the security context is established
> for the group which the device belongs to.

> Managing the security context must meet below criteria:
> 
> 1)  The group is viable for user-initiated DMAs. This implies that the
>     devices in the group must be either bound to a device-passthrough

s/a/the same/

>     framework, or driver-less, or bound to a driver which is known safe
>     (not do DMA).
> 
> 2)  The security context should only allow DMA to the user's memory and
>     devices in this group;
> 
> 3)  After the security context is established for the group, the group
>     viability must be continuously monitored before the user relinquishes
>     all devices belonging to the group. The viability might be broken e.g.
>     when a driver-less device is later bound to a driver which does DMA.
> 
> 4)  The security context should not be destroyed before user access
>     permission is withdrawn.
> 
> Existing vfio introduces explicit container/group semantics in its uAPI
> to meet above requirements. A single security context (iommu domain)
> is created per container. Attaching group to container moves the entire
> group into the associated security context, and vice versa. The user can
> open the device only after group attach. A group can be detached only
> after all devices in the group are closed. Group viability is monitored
> by listening to iommu group events.
> 
> Unlike vfio, iommufd adopts a device-centric design with all group
> logistics hidden behind the fd. Binding a device to iommufd serves
> as the contract to get security context established (and vice versa
> for unbinding). One additional requirement in iommufd is to manage the
> switch between multiple security contexts due to decoupled bind/attach:

This should be a precursor series that actually does clean things up
properly. There is no reason for vfio and iommufd to differ here, if
we are implementing this logic into the iommu layer then it should be
deleted from the VFIO layer, not left duplicated like this.

IIRC in VFIO the container is the IOAS and when the group goes to
create the device fd it should simply do the
iommu_device_init_user_dma() followed immediately by a call to bind
the container IOAS as your #3.

Then delete all the group viability stuff from vfio, relying on the
iommu to do it.

It should have full symmetry with the iommufd.

> @@ -1664,6 +1671,17 @@ static int iommu_bus_notifier(struct notifier_block *nb,
>  		group_action = IOMMU_GROUP_NOTIFY_BIND_DRIVER;
>  		break;
>  	case BUS_NOTIFY_BOUND_DRIVER:
> +		/*
> +		 * FIXME: Alternatively the attached drivers could generically
> +		 * indicate to the iommu layer that they are safe for keeping
> +		 * the iommu group user viable by calling some function around
> +		 * probe(). We could eliminate this gross BUG_ON() by denying
> +		 * probe to non-iommu-safe driver.
> +		 */
> +		mutex_lock(&group->mutex);
> +		if (group->user_dma_owner_id)
> +			BUG_ON(!iommu_group_user_dma_viable(group));
> +		mutex_unlock(&group->mutex);

And the mini-series should fix this BUG_ON properly by interlocking
with the driver core to simply refuse to bind a driver under these
conditions instead of allowing userspace to crash the kernel.

That alone would be justification enough to merge this work.

> +
> +/*
> + * IOMMU core interfaces for iommufd.
> + */
> +
> +/*
> + * FIXME: We currently simply follow vifo policy to mantain the group's
> + * viability to user. Eventually, we should avoid below hard-coded list
> + * by letting drivers indicate to the iommu layer that they are safe for
> + * keeping the iommu group's user aviability.
> + */
> +static const char * const iommu_driver_allowed[] = {
> +	"vfio-pci",
> +	"pci-stub"
> +};

Yuk. This should be done with some callback in those drivers
'iomm_allow_user_dma()"

Ie the basic flow would see the driver core doing some:

 ret = iommu_doing_kernel_dma()
 if (ret) do not bind
 driver_bind
  pci_stub_probe()
     iommu_allow_user_dma()

And the various functions are manipulating some atomic.
 0 = nothing happening
 1 = kernel DMA
 2 = user DMA

No BUG_ON.

Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 07/20] iommu/iommufd: Add iommufd_[un]bind_device()
  2021-09-19  6:38   ` Liu Yi L
@ 2021-09-21 17:14     ` Jason Gunthorpe via iommu
  -1 siblings, 0 replies; 532+ messages in thread
From: Jason Gunthorpe @ 2021-09-21 17:14 UTC (permalink / raw)
  To: Liu Yi L
  Cc: alex.williamson, hch, jasowang, joro, jean-philippe, kevin.tian,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, ashok.raj,
	yi.l.liu, jun.j.tian, hao.wu, dave.jiang, jacob.jun.pan,
	kwankhede, robin.murphy, kvm, iommu, dwmw2, linux-kernel,
	baolu.lu, david, nicolinc

On Sun, Sep 19, 2021 at 02:38:35PM +0800, Liu Yi L wrote:

> +/*
> + * A iommufd_device object represents the binding relationship
> + * between iommufd and device. It is created per a successful
> + * binding request from device driver. The bound device must be
> + * a physical device so far. Subdevice will be supported later
> + * (with additional PASID information). An user-assigned cookie
> + * is also recorded to mark the device in the /dev/iommu uAPI.
> + */
> +struct iommufd_device {
> +	unsigned int id;
> +	struct iommufd_ctx *ictx;
> +	struct device *dev; /* always be the physical device */
> +	u64 dev_cookie;
>  };
>  
>  static int iommufd_fops_open(struct inode *inode, struct file *filep)
> @@ -32,15 +52,58 @@ static int iommufd_fops_open(struct inode *inode, struct file *filep)
>  		return -ENOMEM;
>  
>  	refcount_set(&ictx->refs, 1);
> +	mutex_init(&ictx->lock);
> +	xa_init_flags(&ictx->device_xa, XA_FLAGS_ALLOC);
>  	filep->private_data = ictx;
>  
>  	return ret;
>  }
>  
> +static void iommufd_ctx_get(struct iommufd_ctx *ictx)
> +{
> +	refcount_inc(&ictx->refs);
> +}

See my earlier remarks about how to structure the lifetime logic, this
ref isn't necessary.

> +static const struct file_operations iommufd_fops;
> +
> +/**
> + * iommufd_ctx_fdget - Acquires a reference to the internal iommufd context.
> + * @fd: [in] iommufd file descriptor.
> + *
> + * Returns a pointer to the iommufd context, otherwise NULL;
> + *
> + */
> +static struct iommufd_ctx *iommufd_ctx_fdget(int fd)
> +{
> +	struct fd f = fdget(fd);
> +	struct file *file = f.file;
> +	struct iommufd_ctx *ictx;
> +
> +	if (!file)
> +		return NULL;
> +
> +	if (file->f_op != &iommufd_fops)
> +		return NULL;

Leaks the fdget

> +
> +	ictx = file->private_data;
> +	if (ictx)
> +		iommufd_ctx_get(ictx);

Use success oriented flow

> +	fdput(f);
> +	return ictx;
> +}

> + */
> +struct iommufd_device *iommufd_bind_device(int fd, struct device *dev,
> +					   u64 dev_cookie)
> +{
> +	struct iommufd_ctx *ictx;
> +	struct iommufd_device *idev;
> +	unsigned long index;
> +	unsigned int id;
> +	int ret;
> +
> +	ictx = iommufd_ctx_fdget(fd);
> +	if (!ictx)
> +		return ERR_PTR(-EINVAL);
> +
> +	mutex_lock(&ictx->lock);
> +
> +	/* check duplicate registration */
> +	xa_for_each(&ictx->device_xa, index, idev) {
> +		if (idev->dev == dev || idev->dev_cookie == dev_cookie) {
> +			idev = ERR_PTR(-EBUSY);
> +			goto out_unlock;
> +		}

I can't think of a reason why this expensive check is needed.

> +	}
> +
> +	idev = kzalloc(sizeof(*idev), GFP_KERNEL);
> +	if (!idev) {
> +		ret = -ENOMEM;
> +		goto out_unlock;
> +	}
> +
> +	/* Establish the security context */
> +	ret = iommu_device_init_user_dma(dev, (unsigned long)ictx);
> +	if (ret)
> +		goto out_free;
> +
> +	ret = xa_alloc(&ictx->device_xa, &id, idev,
> +		       XA_LIMIT(IOMMUFD_DEVID_MIN, IOMMUFD_DEVID_MAX),
> +		       GFP_KERNEL);

idev should be fully initialized before being placed in the xarray, so
this should be the last thing done.

Why not just use the standard xa_limit_32b instead of special single
use constants?

Jason

^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 07/20] iommu/iommufd: Add iommufd_[un]bind_device()
@ 2021-09-21 17:14     ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 532+ messages in thread
From: Jason Gunthorpe via iommu @ 2021-09-21 17:14 UTC (permalink / raw)
  To: Liu Yi L
  Cc: kvm, jasowang, kwankhede, hch, jean-philippe, dave.jiang,
	ashok.raj, corbet, kevin.tian, parav, alex.williamson, lkml,
	david, dwmw2, jun.j.tian, linux-kernel, lushenming, iommu,
	pbonzini, robin.murphy

On Sun, Sep 19, 2021 at 02:38:35PM +0800, Liu Yi L wrote:

> +/*
> + * A iommufd_device object represents the binding relationship
> + * between iommufd and device. It is created per a successful
> + * binding request from device driver. The bound device must be
> + * a physical device so far. Subdevice will be supported later
> + * (with additional PASID information). An user-assigned cookie
> + * is also recorded to mark the device in the /dev/iommu uAPI.
> + */
> +struct iommufd_device {
> +	unsigned int id;
> +	struct iommufd_ctx *ictx;
> +	struct device *dev; /* always be the physical device */
> +	u64 dev_cookie;
>  };
>  
>  static int iommufd_fops_open(struct inode *inode, struct file *filep)
> @@ -32,15 +52,58 @@ static int iommufd_fops_open(struct inode *inode, struct file *filep)
>  		return -ENOMEM;
>  
>  	refcount_set(&ictx->refs, 1);
> +	mutex_init(&ictx->lock);
> +	xa_init_flags(&ictx->device_xa, XA_FLAGS_ALLOC);
>  	filep->private_data = ictx;
>  
>  	return ret;
>  }
>  
> +static void iommufd_ctx_get(struct iommufd_ctx *ictx)
> +{
> +	refcount_inc(&ictx->refs);
> +}

See my earlier remarks about how to structure the lifetime logic, this
ref isn't necessary.

> +static const struct file_operations iommufd_fops;
> +
> +/**
> + * iommufd_ctx_fdget - Acquires a reference to the internal iommufd context.
> + * @fd: [in] iommufd file descriptor.
> + *
> + * Returns a pointer to the iommufd context, otherwise NULL;
> + *
> + */
> +static struct iommufd_ctx *iommufd_ctx_fdget(int fd)
> +{
> +	struct fd f = fdget(fd);
> +	struct file *file = f.file;
> +	struct iommufd_ctx *ictx;
> +
> +	if (!file)
> +		return NULL;
> +
> +	if (file->f_op != &iommufd_fops)
> +		return NULL;

Leaks the fdget

> +
> +	ictx = file->private_data;
> +	if (ictx)
> +		iommufd_ctx_get(ictx);

Use success oriented flow

> +	fdput(f);
> +	return ictx;
> +}

> + */
> +struct iommufd_device *iommufd_bind_device(int fd, struct device *dev,
> +					   u64 dev_cookie)
> +{
> +	struct iommufd_ctx *ictx;
> +	struct iommufd_device *idev;
> +	unsigned long index;
> +	unsigned int id;
> +	int ret;
> +
> +	ictx = iommufd_ctx_fdget(fd);
> +	if (!ictx)
> +		return ERR_PTR(-EINVAL);
> +
> +	mutex_lock(&ictx->lock);
> +
> +	/* check duplicate registration */
> +	xa_for_each(&ictx->device_xa, index, idev) {
> +		if (idev->dev == dev || idev->dev_cookie == dev_cookie) {
> +			idev = ERR_PTR(-EBUSY);
> +			goto out_unlock;
> +		}

I can't think of a reason why this expensive check is needed.

> +	}
> +
> +	idev = kzalloc(sizeof(*idev), GFP_KERNEL);
> +	if (!idev) {
> +		ret = -ENOMEM;
> +		goto out_unlock;
> +	}
> +
> +	/* Establish the security context */
> +	ret = iommu_device_init_user_dma(dev, (unsigned long)ictx);
> +	if (ret)
> +		goto out_free;
> +
> +	ret = xa_alloc(&ictx->device_xa, &id, idev,
> +		       XA_LIMIT(IOMMUFD_DEVID_MIN, IOMMUFD_DEVID_MAX),
> +		       GFP_KERNEL);

idev should be fully initialized before being placed in the xarray, so
this should be the last thing done.

Why not just use the standard xa_limit_32b instead of special single
use constants?

Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 08/20] vfio/pci: Add VFIO_DEVICE_BIND_IOMMUFD
  2021-09-19  6:38   ` Liu Yi L
@ 2021-09-21 17:29     ` Jason Gunthorpe via iommu
  -1 siblings, 0 replies; 532+ messages in thread
From: Jason Gunthorpe @ 2021-09-21 17:29 UTC (permalink / raw)
  To: Liu Yi L
  Cc: alex.williamson, hch, jasowang, joro, jean-philippe, kevin.tian,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, ashok.raj,
	yi.l.liu, jun.j.tian, hao.wu, dave.jiang, jacob.jun.pan,
	kwankhede, robin.murphy, kvm, iommu, dwmw2, linux-kernel,
	baolu.lu, david, nicolinc

On Sun, Sep 19, 2021 at 02:38:36PM +0800, Liu Yi L wrote:
> This patch adds VFIO_DEVICE_BIND_IOMMUFD for userspace to bind the vfio
> device to an iommufd. No VFIO_DEVICE_UNBIND_IOMMUFD interface is provided
> because it's implicitly done when the device fd is closed.
> 
> In concept a vfio device can be bound to multiple iommufds, each hosting
> a subset of I/O address spaces attached by this device. However as a
> starting point (matching current vfio), only one I/O address space is
> supported per vfio device. It implies one device can only be attached
> to one iommufd at this point.
> 
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
>  drivers/vfio/pci/Kconfig            |  1 +
>  drivers/vfio/pci/vfio_pci.c         | 72 ++++++++++++++++++++++++++++-
>  drivers/vfio/pci/vfio_pci_private.h |  8 ++++
>  include/uapi/linux/vfio.h           | 30 ++++++++++++
>  4 files changed, 110 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
> index 5e2e1b9a9fd3..3abfb098b4dc 100644
> +++ b/drivers/vfio/pci/Kconfig
> @@ -5,6 +5,7 @@ config VFIO_PCI
>  	depends on MMU
>  	select VFIO_VIRQFD
>  	select IRQ_BYPASS_MANAGER
> +	select IOMMUFD
>  	help
>  	  Support for the PCI VFIO bus driver.  This is required to make
>  	  use of PCI drivers using the VFIO framework.
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index 145addde983b..20006bb66430 100644
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -552,6 +552,16 @@ static void vfio_pci_release(struct vfio_device *core_vdev)
>  			vdev->req_trigger = NULL;
>  		}
>  		mutex_unlock(&vdev->igate);
> +
> +		mutex_lock(&vdev->videv_lock);
> +		if (vdev->videv) {
> +			struct vfio_iommufd_device *videv = vdev->videv;
> +
> +			vdev->videv = NULL;
> +			iommufd_unbind_device(videv->idev);
> +			kfree(videv);
> +		}
> +		mutex_unlock(&vdev->videv_lock);
>  	}
>  
>  	mutex_unlock(&vdev->reflck->lock);
> @@ -780,7 +790,66 @@ static long vfio_pci_ioctl(struct vfio_device *core_vdev,
>  		container_of(core_vdev, struct vfio_pci_device, vdev);
>  	unsigned long minsz;
>  
> -	if (cmd == VFIO_DEVICE_GET_INFO) {
> +	if (cmd == VFIO_DEVICE_BIND_IOMMUFD) {

Choosing to implement this through the ioctl multiplexor is what is
causing so much ugly gyration in the previous patches

This should be a straightforward new function and ops:

struct iommufd_device *vfio_pci_bind_iommufd(struct vfio_device *)
{
		iommu_dev = iommufd_bind_device(bind_data.iommu_fd,
					   &vdev->pdev->dev,
					   bind_data.dev_cookie);
                if (!iommu_dev) return ERR
                vdev->iommu_dev = iommu_dev;
}
static const struct vfio_device_ops vfio_pci_ops = {
   .bind_iommufd = &*vfio_pci_bind_iommufd

If you do the other stuff I said then you'll notice that the
iommufd_bind_device() will provide automatic exclusivity.

The thread that sees ops->bind_device succeed will know it is the only
thread that can see that (by definition, the iommu enable user stuff
has to be exclusive and race free) thus it can go ahead and store the
iommu pointer.

The other half of the problem '&vdev->block_access' is solved by
manipulating the filp->f_ops. Start with a fops that can ONLY call the
above op. When the above op succeeds switch the fops to the normal
full ops. .

The same flow happens when the group fd spawns the device fd, just
parts of iommfd_bind_device are open coded into the vfio code, but the
whole flow and sequence should be the same.

> +		/*
> +		 * Reject the request if the device is already opened and
> +		 * attached to a container.
> +		 */
> +		if (vfio_device_in_container(core_vdev))
> +			return -ENOTTY;

This is wrongly locked

> +
> +		minsz = offsetofend(struct vfio_device_iommu_bind_data, dev_cookie);
> +
> +		if (copy_from_user(&bind_data, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (bind_data.argsz < minsz ||
> +		    bind_data.flags || bind_data.iommu_fd < 0)
> +			return -EINVAL;
> +
> +		mutex_lock(&vdev->videv_lock);
> +		/*
> +		 * Allow only one iommufd per device until multiple
> +		 * address spaces (e.g. vSVA) support is introduced
> +		 * in the future.
> +		 */
> +		if (vdev->videv) {
> +			mutex_unlock(&vdev->videv_lock);
> +			return -EBUSY;
> +		}
> +
> +		idev = iommufd_bind_device(bind_data.iommu_fd,
> +					   &vdev->pdev->dev,
> +					   bind_data.dev_cookie);
> +		if (IS_ERR(idev)) {
> +			mutex_unlock(&vdev->videv_lock);
> +			return PTR_ERR(idev);
> +		}
> +
> +		videv = kzalloc(sizeof(*videv), GFP_KERNEL);
> +		if (!videv) {
> +			iommufd_unbind_device(idev);
> +			mutex_unlock(&vdev->videv_lock);
> +			return -ENOMEM;
> +		}
> +		videv->idev = idev;
> +		videv->iommu_fd = bind_data.iommu_fd;

No need for more memory, a struct vfio_device can be attached to a
single iommu context. If idev then the context and all the other
information is valid.

> +		if (atomic_read(&vdev->block_access))
> +			atomic_set(&vdev->block_access, 0);

I'm sure I'll tell you this is all wrongly locked too if I look
closely.

> +/*
> + * VFIO_DEVICE_BIND_IOMMUFD - _IOR(VFIO_TYPE, VFIO_BASE + 19,
> + *				struct vfio_device_iommu_bind_data)
> + *
> + * Bind a vfio_device to the specified iommufd
> + *
> + * The user should provide a device cookie when calling this ioctl. The
> + * cookie is later used in iommufd for capability query, iotlb invalidation
> + * and I/O fault handling.
> + *
> + * User is not allowed to access the device before the binding operation
> + * is completed.
> + *
> + * Unbind is automatically conducted when device fd is closed.
> + *
> + * Input parameters:
> + *	- iommu_fd;
> + *	- dev_cookie;
> + *
> + * Return: 0 on success, -errno on failure.
> + */
> +struct vfio_device_iommu_bind_data {
> +	__u32	argsz;
> +	__u32	flags;
> +	__s32	iommu_fd;
> +	__u64	dev_cookie;

Missing explicit padding

Always use __aligned_u64 in uapi headers, fix all the patches.

Jason

^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 08/20] vfio/pci: Add VFIO_DEVICE_BIND_IOMMUFD
@ 2021-09-21 17:29     ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 532+ messages in thread
From: Jason Gunthorpe via iommu @ 2021-09-21 17:29 UTC (permalink / raw)
  To: Liu Yi L
  Cc: kvm, jasowang, kwankhede, hch, jean-philippe, dave.jiang,
	ashok.raj, corbet, kevin.tian, parav, alex.williamson, lkml,
	david, dwmw2, jun.j.tian, linux-kernel, lushenming, iommu,
	pbonzini, robin.murphy

On Sun, Sep 19, 2021 at 02:38:36PM +0800, Liu Yi L wrote:
> This patch adds VFIO_DEVICE_BIND_IOMMUFD for userspace to bind the vfio
> device to an iommufd. No VFIO_DEVICE_UNBIND_IOMMUFD interface is provided
> because it's implicitly done when the device fd is closed.
> 
> In concept a vfio device can be bound to multiple iommufds, each hosting
> a subset of I/O address spaces attached by this device. However as a
> starting point (matching current vfio), only one I/O address space is
> supported per vfio device. It implies one device can only be attached
> to one iommufd at this point.
> 
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
>  drivers/vfio/pci/Kconfig            |  1 +
>  drivers/vfio/pci/vfio_pci.c         | 72 ++++++++++++++++++++++++++++-
>  drivers/vfio/pci/vfio_pci_private.h |  8 ++++
>  include/uapi/linux/vfio.h           | 30 ++++++++++++
>  4 files changed, 110 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
> index 5e2e1b9a9fd3..3abfb098b4dc 100644
> +++ b/drivers/vfio/pci/Kconfig
> @@ -5,6 +5,7 @@ config VFIO_PCI
>  	depends on MMU
>  	select VFIO_VIRQFD
>  	select IRQ_BYPASS_MANAGER
> +	select IOMMUFD
>  	help
>  	  Support for the PCI VFIO bus driver.  This is required to make
>  	  use of PCI drivers using the VFIO framework.
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index 145addde983b..20006bb66430 100644
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -552,6 +552,16 @@ static void vfio_pci_release(struct vfio_device *core_vdev)
>  			vdev->req_trigger = NULL;
>  		}
>  		mutex_unlock(&vdev->igate);
> +
> +		mutex_lock(&vdev->videv_lock);
> +		if (vdev->videv) {
> +			struct vfio_iommufd_device *videv = vdev->videv;
> +
> +			vdev->videv = NULL;
> +			iommufd_unbind_device(videv->idev);
> +			kfree(videv);
> +		}
> +		mutex_unlock(&vdev->videv_lock);
>  	}
>  
>  	mutex_unlock(&vdev->reflck->lock);
> @@ -780,7 +790,66 @@ static long vfio_pci_ioctl(struct vfio_device *core_vdev,
>  		container_of(core_vdev, struct vfio_pci_device, vdev);
>  	unsigned long minsz;
>  
> -	if (cmd == VFIO_DEVICE_GET_INFO) {
> +	if (cmd == VFIO_DEVICE_BIND_IOMMUFD) {

Choosing to implement this through the ioctl multiplexor is what is
causing so much ugly gyration in the previous patches

This should be a straightforward new function and ops:

struct iommufd_device *vfio_pci_bind_iommufd(struct vfio_device *)
{
		iommu_dev = iommufd_bind_device(bind_data.iommu_fd,
					   &vdev->pdev->dev,
					   bind_data.dev_cookie);
                if (!iommu_dev) return ERR
                vdev->iommu_dev = iommu_dev;
}
static const struct vfio_device_ops vfio_pci_ops = {
   .bind_iommufd = &*vfio_pci_bind_iommufd

If you do the other stuff I said then you'll notice that the
iommufd_bind_device() will provide automatic exclusivity.

The thread that sees ops->bind_device succeed will know it is the only
thread that can see that (by definition, the iommu enable user stuff
has to be exclusive and race free) thus it can go ahead and store the
iommu pointer.

The other half of the problem '&vdev->block_access' is solved by
manipulating the filp->f_ops. Start with a fops that can ONLY call the
above op. When the above op succeeds switch the fops to the normal
full ops. .

The same flow happens when the group fd spawns the device fd, just
parts of iommfd_bind_device are open coded into the vfio code, but the
whole flow and sequence should be the same.

> +		/*
> +		 * Reject the request if the device is already opened and
> +		 * attached to a container.
> +		 */
> +		if (vfio_device_in_container(core_vdev))
> +			return -ENOTTY;

This is wrongly locked

> +
> +		minsz = offsetofend(struct vfio_device_iommu_bind_data, dev_cookie);
> +
> +		if (copy_from_user(&bind_data, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (bind_data.argsz < minsz ||
> +		    bind_data.flags || bind_data.iommu_fd < 0)
> +			return -EINVAL;
> +
> +		mutex_lock(&vdev->videv_lock);
> +		/*
> +		 * Allow only one iommufd per device until multiple
> +		 * address spaces (e.g. vSVA) support is introduced
> +		 * in the future.
> +		 */
> +		if (vdev->videv) {
> +			mutex_unlock(&vdev->videv_lock);
> +			return -EBUSY;
> +		}
> +
> +		idev = iommufd_bind_device(bind_data.iommu_fd,
> +					   &vdev->pdev->dev,
> +					   bind_data.dev_cookie);
> +		if (IS_ERR(idev)) {
> +			mutex_unlock(&vdev->videv_lock);
> +			return PTR_ERR(idev);
> +		}
> +
> +		videv = kzalloc(sizeof(*videv), GFP_KERNEL);
> +		if (!videv) {
> +			iommufd_unbind_device(idev);
> +			mutex_unlock(&vdev->videv_lock);
> +			return -ENOMEM;
> +		}
> +		videv->idev = idev;
> +		videv->iommu_fd = bind_data.iommu_fd;

No need for more memory, a struct vfio_device can be attached to a
single iommu context. If idev then the context and all the other
information is valid.

> +		if (atomic_read(&vdev->block_access))
> +			atomic_set(&vdev->block_access, 0);

I'm sure I'll tell you this is all wrongly locked too if I look
closely.

> +/*
> + * VFIO_DEVICE_BIND_IOMMUFD - _IOR(VFIO_TYPE, VFIO_BASE + 19,
> + *				struct vfio_device_iommu_bind_data)
> + *
> + * Bind a vfio_device to the specified iommufd
> + *
> + * The user should provide a device cookie when calling this ioctl. The
> + * cookie is later used in iommufd for capability query, iotlb invalidation
> + * and I/O fault handling.
> + *
> + * User is not allowed to access the device before the binding operation
> + * is completed.
> + *
> + * Unbind is automatically conducted when device fd is closed.
> + *
> + * Input parameters:
> + *	- iommu_fd;
> + *	- dev_cookie;
> + *
> + * Return: 0 on success, -errno on failure.
> + */
> +struct vfio_device_iommu_bind_data {
> +	__u32	argsz;
> +	__u32	flags;
> +	__s32	iommu_fd;
> +	__u64	dev_cookie;

Missing explicit padding

Always use __aligned_u64 in uapi headers, fix all the patches.

Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO
  2021-09-19  6:38   ` Liu Yi L
@ 2021-09-21 17:40     ` Jason Gunthorpe via iommu
  -1 siblings, 0 replies; 532+ messages in thread
From: Jason Gunthorpe @ 2021-09-21 17:40 UTC (permalink / raw)
  To: Liu Yi L
  Cc: alex.williamson, hch, jasowang, joro, jean-philippe, kevin.tian,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, ashok.raj,
	yi.l.liu, jun.j.tian, hao.wu, dave.jiang, jacob.jun.pan,
	kwankhede, robin.murphy, kvm, iommu, dwmw2, linux-kernel,
	baolu.lu, david, nicolinc

On Sun, Sep 19, 2021 at 02:38:38PM +0800, Liu Yi L wrote:
> After a device is bound to the iommufd, userspace can use this interface
> to query the underlying iommu capability and format info for this device.
> Based on this information the user then creates I/O address space in a
> compatible format with the to-be-attached devices.
> 
> Device cookie which is registered at binding time is used to mark the
> device which is being queried here.
> 
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
>  drivers/iommu/iommufd/iommufd.c | 68 +++++++++++++++++++++++++++++++++
>  include/uapi/linux/iommu.h      | 49 ++++++++++++++++++++++++
>  2 files changed, 117 insertions(+)
> 
> diff --git a/drivers/iommu/iommufd/iommufd.c b/drivers/iommu/iommufd/iommufd.c
> index e16ca21e4534..641f199f2d41 100644
> +++ b/drivers/iommu/iommufd/iommufd.c
> @@ -117,6 +117,71 @@ static int iommufd_fops_release(struct inode *inode, struct file *filep)
>  	return 0;
>  }
>  
> +static struct device *
> +iommu_find_device_from_cookie(struct iommufd_ctx *ictx, u64 dev_cookie)
> +{

We have an xarray ID for the device, why are we allowing userspace to
use the dev_cookie as input?

Userspace should always pass in the ID. The only place dev_cookie
should appear is if the kernel generates an event back to
userspace. Then the kernel should return both the ID and the
dev_cookie in the event to allow userspace to correlate it.

> +static void iommu_device_build_info(struct device *dev,
> +				    struct iommu_device_info *info)
> +{
> +	bool snoop;
> +	u64 awidth, pgsizes;
> +
> +	if (!iommu_device_get_info(dev, IOMMU_DEV_INFO_FORCE_SNOOP, &snoop))
> +		info->flags |= snoop ? IOMMU_DEVICE_INFO_ENFORCE_SNOOP : 0;
> +
> +	if (!iommu_device_get_info(dev, IOMMU_DEV_INFO_PAGE_SIZE, &pgsizes)) {
> +		info->pgsize_bitmap = pgsizes;
> +		info->flags |= IOMMU_DEVICE_INFO_PGSIZES;
> +	}
> +
> +	if (!iommu_device_get_info(dev, IOMMU_DEV_INFO_ADDR_WIDTH, &awidth)) {
> +		info->addr_width = awidth;
> +		info->flags |= IOMMU_DEVICE_INFO_ADDR_WIDTH;
> +	}

Another good option is to push the iommu_device_info uAPI struct down
through to the iommu driver to fill it in and forget about the crazy
enum.

A big part of thinking of this iommu interface is a way to bind the HW
IOMMU driver to a uAPI and allow the HW driver to expose its unique
functionalities.

> +static int iommufd_get_device_info(struct iommufd_ctx *ictx,
> +				   unsigned long arg)
> +{
> +	struct iommu_device_info info;
> +	unsigned long minsz;
> +	struct device *dev;
> +
> +	minsz = offsetofend(struct iommu_device_info, addr_width);
> +
> +	if (copy_from_user(&info, (void __user *)arg, minsz))
> +		return -EFAULT;
> +
> +	if (info.argsz < minsz)
> +		return -EINVAL;

All of these patterns everywhere are wrongly coded for forward/back
compatibility.

static int iommufd_get_device_info(struct iommufd_ctx *ictx,
                   struct iommu_device_info __user *arg, size_t usize)
{
	struct iommu_device_info info;
	int ret;

	if (usize < offsetofend(struct iommu_device_info, addr_flags))
           return -EINVAL;

        ret = copy_struct_from_user(&info, sizeof(info), arg, usize);
        if (ret)
	      return ret;

'usize' should be in a 'common' header extracted by the main ioctl handler.

> +struct iommu_device_info {
> +	__u32	argsz;
> +	__u32	flags;
> +#define IOMMU_DEVICE_INFO_ENFORCE_SNOOP	(1 << 0) /* IOMMU enforced snoop */
> +#define IOMMU_DEVICE_INFO_PGSIZES	(1 << 1) /* supported page sizes */
> +#define IOMMU_DEVICE_INFO_ADDR_WIDTH	(1 << 2) /* addr_wdith field valid */
> +	__u64	dev_cookie;
> +	__u64   pgsize_bitmap;
> +	__u32	addr_width;
> +};

Be explicit with padding here too.

Jason

^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO
@ 2021-09-21 17:40     ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 532+ messages in thread
From: Jason Gunthorpe via iommu @ 2021-09-21 17:40 UTC (permalink / raw)
  To: Liu Yi L
  Cc: kvm, jasowang, kwankhede, hch, jean-philippe, dave.jiang,
	ashok.raj, corbet, kevin.tian, parav, alex.williamson, lkml,
	david, dwmw2, jun.j.tian, linux-kernel, lushenming, iommu,
	pbonzini, robin.murphy

On Sun, Sep 19, 2021 at 02:38:38PM +0800, Liu Yi L wrote:
> After a device is bound to the iommufd, userspace can use this interface
> to query the underlying iommu capability and format info for this device.
> Based on this information the user then creates I/O address space in a
> compatible format with the to-be-attached devices.
> 
> Device cookie which is registered at binding time is used to mark the
> device which is being queried here.
> 
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
>  drivers/iommu/iommufd/iommufd.c | 68 +++++++++++++++++++++++++++++++++
>  include/uapi/linux/iommu.h      | 49 ++++++++++++++++++++++++
>  2 files changed, 117 insertions(+)
> 
> diff --git a/drivers/iommu/iommufd/iommufd.c b/drivers/iommu/iommufd/iommufd.c
> index e16ca21e4534..641f199f2d41 100644
> +++ b/drivers/iommu/iommufd/iommufd.c
> @@ -117,6 +117,71 @@ static int iommufd_fops_release(struct inode *inode, struct file *filep)
>  	return 0;
>  }
>  
> +static struct device *
> +iommu_find_device_from_cookie(struct iommufd_ctx *ictx, u64 dev_cookie)
> +{

We have an xarray ID for the device, why are we allowing userspace to
use the dev_cookie as input?

Userspace should always pass in the ID. The only place dev_cookie
should appear is if the kernel generates an event back to
userspace. Then the kernel should return both the ID and the
dev_cookie in the event to allow userspace to correlate it.

> +static void iommu_device_build_info(struct device *dev,
> +				    struct iommu_device_info *info)
> +{
> +	bool snoop;
> +	u64 awidth, pgsizes;
> +
> +	if (!iommu_device_get_info(dev, IOMMU_DEV_INFO_FORCE_SNOOP, &snoop))
> +		info->flags |= snoop ? IOMMU_DEVICE_INFO_ENFORCE_SNOOP : 0;
> +
> +	if (!iommu_device_get_info(dev, IOMMU_DEV_INFO_PAGE_SIZE, &pgsizes)) {
> +		info->pgsize_bitmap = pgsizes;
> +		info->flags |= IOMMU_DEVICE_INFO_PGSIZES;
> +	}
> +
> +	if (!iommu_device_get_info(dev, IOMMU_DEV_INFO_ADDR_WIDTH, &awidth)) {
> +		info->addr_width = awidth;
> +		info->flags |= IOMMU_DEVICE_INFO_ADDR_WIDTH;
> +	}

Another good option is to push the iommu_device_info uAPI struct down
through to the iommu driver to fill it in and forget about the crazy
enum.

A big part of thinking of this iommu interface is a way to bind the HW
IOMMU driver to a uAPI and allow the HW driver to expose its unique
functionalities.

> +static int iommufd_get_device_info(struct iommufd_ctx *ictx,
> +				   unsigned long arg)
> +{
> +	struct iommu_device_info info;
> +	unsigned long minsz;
> +	struct device *dev;
> +
> +	minsz = offsetofend(struct iommu_device_info, addr_width);
> +
> +	if (copy_from_user(&info, (void __user *)arg, minsz))
> +		return -EFAULT;
> +
> +	if (info.argsz < minsz)
> +		return -EINVAL;

All of these patterns everywhere are wrongly coded for forward/back
compatibility.

static int iommufd_get_device_info(struct iommufd_ctx *ictx,
                   struct iommu_device_info __user *arg, size_t usize)
{
	struct iommu_device_info info;
	int ret;

	if (usize < offsetofend(struct iommu_device_info, addr_flags))
           return -EINVAL;

        ret = copy_struct_from_user(&info, sizeof(info), arg, usize);
        if (ret)
	      return ret;

'usize' should be in a 'common' header extracted by the main ioctl handler.

> +struct iommu_device_info {
> +	__u32	argsz;
> +	__u32	flags;
> +#define IOMMU_DEVICE_INFO_ENFORCE_SNOOP	(1 << 0) /* IOMMU enforced snoop */
> +#define IOMMU_DEVICE_INFO_PGSIZES	(1 << 1) /* supported page sizes */
> +#define IOMMU_DEVICE_INFO_ADDR_WIDTH	(1 << 2) /* addr_wdith field valid */
> +	__u64	dev_cookie;
> +	__u64   pgsize_bitmap;
> +	__u32	addr_width;
> +};

Be explicit with padding here too.

Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
  2021-09-19  6:38   ` Liu Yi L
@ 2021-09-21 17:44     ` Jason Gunthorpe via iommu
  -1 siblings, 0 replies; 532+ messages in thread
From: Jason Gunthorpe @ 2021-09-21 17:44 UTC (permalink / raw)
  To: Liu Yi L
  Cc: alex.williamson, hch, jasowang, joro, jean-philippe, kevin.tian,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, ashok.raj,
	yi.l.liu, jun.j.tian, hao.wu, dave.jiang, jacob.jun.pan,
	kwankhede, robin.murphy, kvm, iommu, dwmw2, linux-kernel,
	baolu.lu, david, nicolinc

On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote:
> This patch adds IOASID allocation/free interface per iommufd. When
> allocating an IOASID, userspace is expected to specify the type and
> format information for the target I/O page table.
> 
> This RFC supports only one type (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2),
> implying a kernel-managed I/O page table with vfio type1v2 mapping
> semantics. For this type the user should specify the addr_width of
> the I/O address space and whether the I/O page table is created in
> an iommu enfore_snoop format. enforce_snoop must be true at this point,
> as the false setting requires additional contract with KVM on handling
> WBINVD emulation, which can be added later.
> 
> Userspace is expected to call IOMMU_CHECK_EXTENSION (see next patch)
> for what formats can be specified when allocating an IOASID.
> 
> Open:
> - Devices on PPC platform currently use a different iommu driver in vfio.
>   Per previous discussion they can also use vfio type1v2 as long as there
>   is a way to claim a specific iova range from a system-wide address space.
>   This requirement doesn't sound PPC specific, as addr_width for pci devices
>   can be also represented by a range [0, 2^addr_width-1]. This RFC hasn't
>   adopted this design yet. We hope to have formal alignment in v1 discussion
>   and then decide how to incorporate it in v2.

I think the request was to include a start/end IO address hint when
creating the ios. When the kernel creates it then it can return the
actual geometry including any holes via a query.

> - Currently ioasid term has already been used in the kernel (drivers/iommu/
>   ioasid.c) to represent the hardware I/O address space ID in the wire. It
>   covers both PCI PASID (Process Address Space ID) and ARM SSID (Sub-Stream
>   ID). We need find a way to resolve the naming conflict between the hardware
>   ID and software handle. One option is to rename the existing ioasid to be
>   pasid or ssid, given their full names still sound generic. Appreciate more
>   thoughts on this open!

ioas works well here I think. Use ioas_id to refer to the xarray
index.

> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
>  drivers/iommu/iommufd/iommufd.c | 120 ++++++++++++++++++++++++++++++++
>  include/linux/iommufd.h         |   3 +
>  include/uapi/linux/iommu.h      |  54 ++++++++++++++
>  3 files changed, 177 insertions(+)
> 
> diff --git a/drivers/iommu/iommufd/iommufd.c b/drivers/iommu/iommufd/iommufd.c
> index 641f199f2d41..4839f128b24a 100644
> +++ b/drivers/iommu/iommufd/iommufd.c
> @@ -24,6 +24,7 @@
>  struct iommufd_ctx {
>  	refcount_t refs;
>  	struct mutex lock;
> +	struct xarray ioasid_xa; /* xarray of ioasids */
>  	struct xarray device_xa; /* xarray of bound devices */
>  };
>  
> @@ -42,6 +43,16 @@ struct iommufd_device {
>  	u64 dev_cookie;
>  };
>  
> +/* Represent an I/O address space */
> +struct iommufd_ioas {
> +	int ioasid;

xarray id's should consistently be u32s everywhere.

Many of the same prior comments repeated here

Jason

^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
@ 2021-09-21 17:44     ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 532+ messages in thread
From: Jason Gunthorpe via iommu @ 2021-09-21 17:44 UTC (permalink / raw)
  To: Liu Yi L
  Cc: kvm, jasowang, kwankhede, hch, jean-philippe, dave.jiang,
	ashok.raj, corbet, kevin.tian, parav, alex.williamson, lkml,
	david, dwmw2, jun.j.tian, linux-kernel, lushenming, iommu,
	pbonzini, robin.murphy

On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote:
> This patch adds IOASID allocation/free interface per iommufd. When
> allocating an IOASID, userspace is expected to specify the type and
> format information for the target I/O page table.
> 
> This RFC supports only one type (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2),
> implying a kernel-managed I/O page table with vfio type1v2 mapping
> semantics. For this type the user should specify the addr_width of
> the I/O address space and whether the I/O page table is created in
> an iommu enfore_snoop format. enforce_snoop must be true at this point,
> as the false setting requires additional contract with KVM on handling
> WBINVD emulation, which can be added later.
> 
> Userspace is expected to call IOMMU_CHECK_EXTENSION (see next patch)
> for what formats can be specified when allocating an IOASID.
> 
> Open:
> - Devices on PPC platform currently use a different iommu driver in vfio.
>   Per previous discussion they can also use vfio type1v2 as long as there
>   is a way to claim a specific iova range from a system-wide address space.
>   This requirement doesn't sound PPC specific, as addr_width for pci devices
>   can be also represented by a range [0, 2^addr_width-1]. This RFC hasn't
>   adopted this design yet. We hope to have formal alignment in v1 discussion
>   and then decide how to incorporate it in v2.

I think the request was to include a start/end IO address hint when
creating the ios. When the kernel creates it then it can return the
actual geometry including any holes via a query.

> - Currently ioasid term has already been used in the kernel (drivers/iommu/
>   ioasid.c) to represent the hardware I/O address space ID in the wire. It
>   covers both PCI PASID (Process Address Space ID) and ARM SSID (Sub-Stream
>   ID). We need find a way to resolve the naming conflict between the hardware
>   ID and software handle. One option is to rename the existing ioasid to be
>   pasid or ssid, given their full names still sound generic. Appreciate more
>   thoughts on this open!

ioas works well here I think. Use ioas_id to refer to the xarray
index.

> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
>  drivers/iommu/iommufd/iommufd.c | 120 ++++++++++++++++++++++++++++++++
>  include/linux/iommufd.h         |   3 +
>  include/uapi/linux/iommu.h      |  54 ++++++++++++++
>  3 files changed, 177 insertions(+)
> 
> diff --git a/drivers/iommu/iommufd/iommufd.c b/drivers/iommu/iommufd/iommufd.c
> index 641f199f2d41..4839f128b24a 100644
> +++ b/drivers/iommu/iommufd/iommufd.c
> @@ -24,6 +24,7 @@
>  struct iommufd_ctx {
>  	refcount_t refs;
>  	struct mutex lock;
> +	struct xarray ioasid_xa; /* xarray of ioasids */
>  	struct xarray device_xa; /* xarray of bound devices */
>  };
>  
> @@ -42,6 +43,16 @@ struct iommufd_device {
>  	u64 dev_cookie;
>  };
>  
> +/* Represent an I/O address space */
> +struct iommufd_ioas {
> +	int ioasid;

xarray id's should consistently be u32s everywhere.

Many of the same prior comments repeated here

Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 12/20] iommu/iommufd: Add IOMMU_CHECK_EXTENSION
  2021-09-19  6:38   ` Liu Yi L
@ 2021-09-21 17:47     ` Jason Gunthorpe via iommu
  -1 siblings, 0 replies; 532+ messages in thread
From: Jason Gunthorpe @ 2021-09-21 17:47 UTC (permalink / raw)
  To: Liu Yi L
  Cc: alex.williamson, hch, jasowang, joro, jean-philippe, kevin.tian,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, ashok.raj,
	yi.l.liu, jun.j.tian, hao.wu, dave.jiang, jacob.jun.pan,
	kwankhede, robin.murphy, kvm, iommu, dwmw2, linux-kernel,
	baolu.lu, david, nicolinc

On Sun, Sep 19, 2021 at 02:38:40PM +0800, Liu Yi L wrote:
> As aforementioned, userspace should check extension for what formats
> can be specified when allocating an IOASID. This patch adds such
> interface for userspace. In this RFC, iommufd reports EXT_MAP_TYPE1V2
> support and no no-snoop support yet.
> 
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
>  drivers/iommu/iommufd/iommufd.c |  7 +++++++
>  include/uapi/linux/iommu.h      | 27 +++++++++++++++++++++++++++
>  2 files changed, 34 insertions(+)
> 
> diff --git a/drivers/iommu/iommufd/iommufd.c b/drivers/iommu/iommufd/iommufd.c
> index 4839f128b24a..e45d76359e34 100644
> +++ b/drivers/iommu/iommufd/iommufd.c
> @@ -306,6 +306,13 @@ static long iommufd_fops_unl_ioctl(struct file *filep,
>  		return ret;
>  
>  	switch (cmd) {
> +	case IOMMU_CHECK_EXTENSION:
> +		switch (arg) {
> +		case EXT_MAP_TYPE1V2:
> +			return 1;
> +		default:
> +			return 0;
> +		}
>  	case IOMMU_DEVICE_GET_INFO:
>  		ret = iommufd_get_device_info(ictx, arg);
>  		break;
> diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
> index 5cbd300eb0ee..49731be71213 100644
> +++ b/include/uapi/linux/iommu.h
> @@ -14,6 +14,33 @@
>  #define IOMMU_TYPE	(';')
>  #define IOMMU_BASE	100
>  
> +/*
> + * IOMMU_CHECK_EXTENSION - _IO(IOMMU_TYPE, IOMMU_BASE + 0)
> + *
> + * Check whether an uAPI extension is supported.
> + *
> + * It's unlikely that all planned capabilities in IOMMU fd will be ready
> + * in one breath. User should check which uAPI extension is supported
> + * according to its intended usage.
> + *
> + * A rough list of possible extensions may include:
> + *
> + *	- EXT_MAP_TYPE1V2 for vfio type1v2 map semantics;
> + *	- EXT_DMA_NO_SNOOP for no-snoop DMA support;
> + *	- EXT_MAP_NEWTYPE for an enhanced map semantics;
> + *	- EXT_MULTIDEV_GROUP for 1:N iommu group;
> + *	- EXT_IOASID_NESTING for what the name stands;
> + *	- EXT_USER_PAGE_TABLE for user managed page table;
> + *	- EXT_USER_PASID_TABLE for user managed PASID table;
> + *	- EXT_DIRTY_TRACKING for tracking pages dirtied by DMA;
> + *	- ...
> + *
> + * Return: 0 if not supported, 1 if supported.
> + */
> +#define EXT_MAP_TYPE1V2		1
> +#define EXT_DMA_NO_SNOOP	2
> +#define IOMMU_CHECK_EXTENSION	_IO(IOMMU_TYPE, IOMMU_BASE + 0)

I generally advocate for a 'try and fail' approach to discovering
compatibility.

If that doesn't work for the userspace then a query to return a
generic capability flag is the next best idea. Each flag should
clearly define what 'try and fail' it is talking about

Eg dma_no_snoop is about creating an IOS with flag NO SNOOP set

TYPE1V2 seems like nonsense

Not sure about the others.

IOW, this should recast to a generic 'query capabilities' IOCTL

Jason

^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 12/20] iommu/iommufd: Add IOMMU_CHECK_EXTENSION
@ 2021-09-21 17:47     ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 532+ messages in thread
From: Jason Gunthorpe via iommu @ 2021-09-21 17:47 UTC (permalink / raw)
  To: Liu Yi L
  Cc: kvm, jasowang, kwankhede, hch, jean-philippe, dave.jiang,
	ashok.raj, corbet, kevin.tian, parav, alex.williamson, lkml,
	david, dwmw2, jun.j.tian, linux-kernel, lushenming, iommu,
	pbonzini, robin.murphy

On Sun, Sep 19, 2021 at 02:38:40PM +0800, Liu Yi L wrote:
> As aforementioned, userspace should check extension for what formats
> can be specified when allocating an IOASID. This patch adds such
> interface for userspace. In this RFC, iommufd reports EXT_MAP_TYPE1V2
> support and no no-snoop support yet.
> 
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
>  drivers/iommu/iommufd/iommufd.c |  7 +++++++
>  include/uapi/linux/iommu.h      | 27 +++++++++++++++++++++++++++
>  2 files changed, 34 insertions(+)
> 
> diff --git a/drivers/iommu/iommufd/iommufd.c b/drivers/iommu/iommufd/iommufd.c
> index 4839f128b24a..e45d76359e34 100644
> +++ b/drivers/iommu/iommufd/iommufd.c
> @@ -306,6 +306,13 @@ static long iommufd_fops_unl_ioctl(struct file *filep,
>  		return ret;
>  
>  	switch (cmd) {
> +	case IOMMU_CHECK_EXTENSION:
> +		switch (arg) {
> +		case EXT_MAP_TYPE1V2:
> +			return 1;
> +		default:
> +			return 0;
> +		}
>  	case IOMMU_DEVICE_GET_INFO:
>  		ret = iommufd_get_device_info(ictx, arg);
>  		break;
> diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
> index 5cbd300eb0ee..49731be71213 100644
> +++ b/include/uapi/linux/iommu.h
> @@ -14,6 +14,33 @@
>  #define IOMMU_TYPE	(';')
>  #define IOMMU_BASE	100
>  
> +/*
> + * IOMMU_CHECK_EXTENSION - _IO(IOMMU_TYPE, IOMMU_BASE + 0)
> + *
> + * Check whether an uAPI extension is supported.
> + *
> + * It's unlikely that all planned capabilities in IOMMU fd will be ready
> + * in one breath. User should check which uAPI extension is supported
> + * according to its intended usage.
> + *
> + * A rough list of possible extensions may include:
> + *
> + *	- EXT_MAP_TYPE1V2 for vfio type1v2 map semantics;
> + *	- EXT_DMA_NO_SNOOP for no-snoop DMA support;
> + *	- EXT_MAP_NEWTYPE for an enhanced map semantics;
> + *	- EXT_MULTIDEV_GROUP for 1:N iommu group;
> + *	- EXT_IOASID_NESTING for what the name stands;
> + *	- EXT_USER_PAGE_TABLE for user managed page table;
> + *	- EXT_USER_PASID_TABLE for user managed PASID table;
> + *	- EXT_DIRTY_TRACKING for tracking pages dirtied by DMA;
> + *	- ...
> + *
> + * Return: 0 if not supported, 1 if supported.
> + */
> +#define EXT_MAP_TYPE1V2		1
> +#define EXT_DMA_NO_SNOOP	2
> +#define IOMMU_CHECK_EXTENSION	_IO(IOMMU_TYPE, IOMMU_BASE + 0)

I generally advocate for a 'try and fail' approach to discovering
compatibility.

If that doesn't work for the userspace then a query to return a
generic capability flag is the next best idea. Each flag should
clearly define what 'try and fail' it is talking about

Eg dma_no_snoop is about creating an IOS with flag NO SNOOP set

TYPE1V2 seems like nonsense

Not sure about the others.

IOW, this should recast to a generic 'query capabilities' IOCTL

Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 14/20] iommu/iommufd: Add iommufd_device_[de]attach_ioasid()
  2021-09-19  6:38   ` Liu Yi L
@ 2021-09-21 18:02     ` Jason Gunthorpe via iommu
  -1 siblings, 0 replies; 532+ messages in thread
From: Jason Gunthorpe @ 2021-09-21 18:02 UTC (permalink / raw)
  To: Liu Yi L
  Cc: alex.williamson, hch, jasowang, joro, jean-philippe, kevin.tian,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, ashok.raj,
	yi.l.liu, jun.j.tian, hao.wu, dave.jiang, jacob.jun.pan,
	kwankhede, robin.murphy, kvm, iommu, dwmw2, linux-kernel,
	baolu.lu, david, nicolinc

On Sun, Sep 19, 2021 at 02:38:42PM +0800, Liu Yi L wrote:
> An I/O address space takes effect in the iommu only after it's attached
> by a device. This patch provides iommufd_device_[de/at]tach_ioasid()
> helpers for this purpose. One device can be only attached to one ioasid
> at this point, but one ioasid can be attached by multiple devices.
> 
> The caller specifies the iommufd_device (returned at binding time) and
> the target ioasid when calling the helper function. Upon request, iommufd
> installs the specified I/O page table to the correct place in the IOMMU,
> according to the routing information (struct device* which represents
> RID) recorded in iommufd_device. Future variants could allow the caller
> to specify additional routing information (e.g. pasid/ssid) when multiple
> I/O address spaces are supported per device.
> 
> Open:
> Per Jason's comment in below link, bus-specific wrappers are recommended.
> This RFC implements one wrapper for pci device. But it looks that struct
> pci_device is not used at all since iommufd_ device already carries all
> necessary info. So want to have another discussion on its necessity, e.g.
> whether making more sense to have bus-specific wrappers for binding, while
> leaving a common attaching helper per iommufd_device.
> https://lore.kernel.org/linux-iommu/20210528233649.GB3816344@nvidia.com/
> 
> TODO:
> When multiple devices are attached to a same ioasid, the permitted iova
> ranges and supported pgsize bitmap on this ioasid should be a common
> subset of all attached devices. iommufd needs to track such info per
> ioasid and update it every time when a new device is attached to the
> ioasid. This has not been done in this version yet, due to the temporary
> hack adopted in patch 16-18. The hack reuses vfio type1 driver which
> already includes the necessary logic for iova ranges and pgsize bitmap.
> Once we get a clear direction for those patches, that logic will be moved
> to this patch.
> 
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
>  drivers/iommu/iommufd/iommufd.c | 226 ++++++++++++++++++++++++++++++++
>  include/linux/iommufd.h         |  29 ++++
>  2 files changed, 255 insertions(+)
> 
> diff --git a/drivers/iommu/iommufd/iommufd.c b/drivers/iommu/iommufd/iommufd.c
> index e45d76359e34..25373a0e037a 100644
> +++ b/drivers/iommu/iommufd/iommufd.c
> @@ -51,6 +51,19 @@ struct iommufd_ioas {
>  	bool enforce_snoop;
>  	struct iommufd_ctx *ictx;
>  	refcount_t refs;
> +	struct mutex lock;
> +	struct list_head device_list;
> +	struct iommu_domain *domain;

This should just be another xarray indexed by the device id

> +/* Caller should hold ioas->lock */
> +static struct ioas_device_info *ioas_find_device(struct iommufd_ioas *ioas,
> +						 struct iommufd_device *idev)
> +{
> +	struct ioas_device_info *ioas_dev;
> +
> +	list_for_each_entry(ioas_dev, &ioas->device_list, next) {
> +		if (ioas_dev->idev == idev)
> +			return ioas_dev;
> +	}

Which eliminates this search. xarray with tightly packed indexes is
generally more efficient than linked lists..

> +static int ioas_check_device_compatibility(struct iommufd_ioas *ioas,
> +					   struct device *dev)
> +{
> +	bool snoop = false;
> +	u32 addr_width;
> +	int ret;
> +
> +	/*
> +	 * currently we only support I/O page table with iommu enforce-snoop
> +	 * format. Attaching a device which doesn't support this format in its
> +	 * upstreaming iommu is rejected.
> +	 */
> +	ret = iommu_device_get_info(dev, IOMMU_DEV_INFO_FORCE_SNOOP, &snoop);
> +	if (ret || !snoop)
> +		return -EINVAL;
> +
> +	ret = iommu_device_get_info(dev, IOMMU_DEV_INFO_ADDR_WIDTH, &addr_width);
> +	if (ret || addr_width < ioas->addr_width)
> +		return -EINVAL;
> +
> +	/* TODO: also need to check permitted iova ranges and pgsize bitmap */
> +
> +	return 0;
> +}

This seems kind of weird..

I expect the iommufd to hold a SW copy of the IO page table and each
time a new domain is to be created it should push the SW copy into the
domain. If the domain cannot support it then the domain driver should
naturally fail a request.

When the user changes the IO page table the SW copy is updated then
all of the domains are updated too - again if any domain cannot
support the change then it fails and the change is rolled back.

It seems like this is a side effect of roughly hacking in the vfio
code?

> +
> +/**
> + * iommufd_device_attach_ioasid - attach device to an ioasid
> + * @idev: [in] Pointer to struct iommufd_device.
> + * @ioasid: [in] ioasid points to an I/O address space.
> + *
> + * Returns 0 for successful attach, otherwise returns error.
> + *
> + */
> +int iommufd_device_attach_ioasid(struct iommufd_device *idev, int ioasid)

Types for the ioas_id again..

> +{
> +	struct iommufd_ioas *ioas;
> +	struct ioas_device_info *ioas_dev;
> +	struct iommu_domain *domain;
> +	int ret;
> +
> +	ioas = ioasid_get_ioas(idev->ictx, ioasid);
> +	if (!ioas) {
> +		pr_err_ratelimited("Trying to attach illegal or unkonwn IOASID %u\n", ioasid);
> +		return -EINVAL;

No prints triggered by bad userspace

> +	}
> +
> +	mutex_lock(&ioas->lock);
> +
> +	/* Check for duplicates */
> +	if (ioas_find_device(ioas, idev)) {
> +		ret = -EINVAL;
> +		goto out_unlock;
> +	}

just xa_cmpxchg NULL, XA_ZERO_ENTRY

> +	/*
> +	 * Each ioas is backed by an iommu domain, which is allocated
> +	 * when the ioas is attached for the first time and then shared
> +	 * by following devices.
> +	 */
> +	if (list_empty(&ioas->device_list)) {

Seems strange, what if the devices are forced to have different
domains? We don't want to model that in the SW layer..

> +	/* Install the I/O page table to the iommu for this device */
> +	ret = iommu_attach_device(domain, idev->dev);
> +	if (ret)
> +		goto out_domain;

This is where things start to get confusing when you talk about PASID
as the above call needs to be some PASID centric API.

> @@ -27,6 +28,16 @@ struct iommufd_device *
>  iommufd_bind_device(int fd, struct device *dev, u64 dev_cookie);
>  void iommufd_unbind_device(struct iommufd_device *idev);
>  
> +int iommufd_device_attach_ioasid(struct iommufd_device *idev, int ioasid);
> +void iommufd_device_detach_ioasid(struct iommufd_device *idev, int ioasid);
> +
> +static inline int
> +__pci_iommufd_device_attach_ioasid(struct pci_dev *pdev,
> +				   struct iommufd_device *idev, int ioasid)
> +{
> +	return iommufd_device_attach_ioasid(idev, ioasid);
> +}

If think sis taking in the iommfd_device then there isn't a logical
place to signal the PCIness

But, I think the API should at least have a kdoc that this is
capturing the entire device and specify that for PCI this means all
TLPs with the RID.

Jason

^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 14/20] iommu/iommufd: Add iommufd_device_[de]attach_ioasid()
@ 2021-09-21 18:02     ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 532+ messages in thread
From: Jason Gunthorpe via iommu @ 2021-09-21 18:02 UTC (permalink / raw)
  To: Liu Yi L
  Cc: kvm, jasowang, kwankhede, hch, jean-philippe, dave.jiang,
	ashok.raj, corbet, kevin.tian, parav, alex.williamson, lkml,
	david, dwmw2, jun.j.tian, linux-kernel, lushenming, iommu,
	pbonzini, robin.murphy

On Sun, Sep 19, 2021 at 02:38:42PM +0800, Liu Yi L wrote:
> An I/O address space takes effect in the iommu only after it's attached
> by a device. This patch provides iommufd_device_[de/at]tach_ioasid()
> helpers for this purpose. One device can be only attached to one ioasid
> at this point, but one ioasid can be attached by multiple devices.
> 
> The caller specifies the iommufd_device (returned at binding time) and
> the target ioasid when calling the helper function. Upon request, iommufd
> installs the specified I/O page table to the correct place in the IOMMU,
> according to the routing information (struct device* which represents
> RID) recorded in iommufd_device. Future variants could allow the caller
> to specify additional routing information (e.g. pasid/ssid) when multiple
> I/O address spaces are supported per device.
> 
> Open:
> Per Jason's comment in below link, bus-specific wrappers are recommended.
> This RFC implements one wrapper for pci device. But it looks that struct
> pci_device is not used at all since iommufd_ device already carries all
> necessary info. So want to have another discussion on its necessity, e.g.
> whether making more sense to have bus-specific wrappers for binding, while
> leaving a common attaching helper per iommufd_device.
> https://lore.kernel.org/linux-iommu/20210528233649.GB3816344@nvidia.com/
> 
> TODO:
> When multiple devices are attached to a same ioasid, the permitted iova
> ranges and supported pgsize bitmap on this ioasid should be a common
> subset of all attached devices. iommufd needs to track such info per
> ioasid and update it every time when a new device is attached to the
> ioasid. This has not been done in this version yet, due to the temporary
> hack adopted in patch 16-18. The hack reuses vfio type1 driver which
> already includes the necessary logic for iova ranges and pgsize bitmap.
> Once we get a clear direction for those patches, that logic will be moved
> to this patch.
> 
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
>  drivers/iommu/iommufd/iommufd.c | 226 ++++++++++++++++++++++++++++++++
>  include/linux/iommufd.h         |  29 ++++
>  2 files changed, 255 insertions(+)
> 
> diff --git a/drivers/iommu/iommufd/iommufd.c b/drivers/iommu/iommufd/iommufd.c
> index e45d76359e34..25373a0e037a 100644
> +++ b/drivers/iommu/iommufd/iommufd.c
> @@ -51,6 +51,19 @@ struct iommufd_ioas {
>  	bool enforce_snoop;
>  	struct iommufd_ctx *ictx;
>  	refcount_t refs;
> +	struct mutex lock;
> +	struct list_head device_list;
> +	struct iommu_domain *domain;

This should just be another xarray indexed by the device id

> +/* Caller should hold ioas->lock */
> +static struct ioas_device_info *ioas_find_device(struct iommufd_ioas *ioas,
> +						 struct iommufd_device *idev)
> +{
> +	struct ioas_device_info *ioas_dev;
> +
> +	list_for_each_entry(ioas_dev, &ioas->device_list, next) {
> +		if (ioas_dev->idev == idev)
> +			return ioas_dev;
> +	}

Which eliminates this search. xarray with tightly packed indexes is
generally more efficient than linked lists..

> +static int ioas_check_device_compatibility(struct iommufd_ioas *ioas,
> +					   struct device *dev)
> +{
> +	bool snoop = false;
> +	u32 addr_width;
> +	int ret;
> +
> +	/*
> +	 * currently we only support I/O page table with iommu enforce-snoop
> +	 * format. Attaching a device which doesn't support this format in its
> +	 * upstreaming iommu is rejected.
> +	 */
> +	ret = iommu_device_get_info(dev, IOMMU_DEV_INFO_FORCE_SNOOP, &snoop);
> +	if (ret || !snoop)
> +		return -EINVAL;
> +
> +	ret = iommu_device_get_info(dev, IOMMU_DEV_INFO_ADDR_WIDTH, &addr_width);
> +	if (ret || addr_width < ioas->addr_width)
> +		return -EINVAL;
> +
> +	/* TODO: also need to check permitted iova ranges and pgsize bitmap */
> +
> +	return 0;
> +}

This seems kind of weird..

I expect the iommufd to hold a SW copy of the IO page table and each
time a new domain is to be created it should push the SW copy into the
domain. If the domain cannot support it then the domain driver should
naturally fail a request.

When the user changes the IO page table the SW copy is updated then
all of the domains are updated too - again if any domain cannot
support the change then it fails and the change is rolled back.

It seems like this is a side effect of roughly hacking in the vfio
code?

> +
> +/**
> + * iommufd_device_attach_ioasid - attach device to an ioasid
> + * @idev: [in] Pointer to struct iommufd_device.
> + * @ioasid: [in] ioasid points to an I/O address space.
> + *
> + * Returns 0 for successful attach, otherwise returns error.
> + *
> + */
> +int iommufd_device_attach_ioasid(struct iommufd_device *idev, int ioasid)

Types for the ioas_id again..

> +{
> +	struct iommufd_ioas *ioas;
> +	struct ioas_device_info *ioas_dev;
> +	struct iommu_domain *domain;
> +	int ret;
> +
> +	ioas = ioasid_get_ioas(idev->ictx, ioasid);
> +	if (!ioas) {
> +		pr_err_ratelimited("Trying to attach illegal or unkonwn IOASID %u\n", ioasid);
> +		return -EINVAL;

No prints triggered by bad userspace

> +	}
> +
> +	mutex_lock(&ioas->lock);
> +
> +	/* Check for duplicates */
> +	if (ioas_find_device(ioas, idev)) {
> +		ret = -EINVAL;
> +		goto out_unlock;
> +	}

just xa_cmpxchg NULL, XA_ZERO_ENTRY

> +	/*
> +	 * Each ioas is backed by an iommu domain, which is allocated
> +	 * when the ioas is attached for the first time and then shared
> +	 * by following devices.
> +	 */
> +	if (list_empty(&ioas->device_list)) {

Seems strange, what if the devices are forced to have different
domains? We don't want to model that in the SW layer..

> +	/* Install the I/O page table to the iommu for this device */
> +	ret = iommu_attach_device(domain, idev->dev);
> +	if (ret)
> +		goto out_domain;

This is where things start to get confusing when you talk about PASID
as the above call needs to be some PASID centric API.

> @@ -27,6 +28,16 @@ struct iommufd_device *
>  iommufd_bind_device(int fd, struct device *dev, u64 dev_cookie);
>  void iommufd_unbind_device(struct iommufd_device *idev);
>  
> +int iommufd_device_attach_ioasid(struct iommufd_device *idev, int ioasid);
> +void iommufd_device_detach_ioasid(struct iommufd_device *idev, int ioasid);
> +
> +static inline int
> +__pci_iommufd_device_attach_ioasid(struct pci_dev *pdev,
> +				   struct iommufd_device *idev, int ioasid)
> +{
> +	return iommufd_device_attach_ioasid(idev, ioasid);
> +}

If think sis taking in the iommfd_device then there isn't a logical
place to signal the PCIness

But, I think the API should at least have a kdoc that this is
capturing the entire device and specify that for PCI this means all
TLPs with the RID.

Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 15/20] vfio/pci: Add VFIO_DEVICE_[DE]ATTACH_IOASID
  2021-09-19  6:38   ` Liu Yi L
@ 2021-09-21 18:04     ` Jason Gunthorpe via iommu
  -1 siblings, 0 replies; 532+ messages in thread
From: Jason Gunthorpe @ 2021-09-21 18:04 UTC (permalink / raw)
  To: Liu Yi L
  Cc: alex.williamson, hch, jasowang, joro, jean-philippe, kevin.tian,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, ashok.raj,
	yi.l.liu, jun.j.tian, hao.wu, dave.jiang, jacob.jun.pan,
	kwankhede, robin.murphy, kvm, iommu, dwmw2, linux-kernel,
	baolu.lu, david, nicolinc

On Sun, Sep 19, 2021 at 02:38:43PM +0800, Liu Yi L wrote:
> This patch adds interface for userspace to attach device to specified
> IOASID.
> 
> Note:
> One device can only be attached to one IOASID in this version. This is
> on par with what vfio provides today. In the future this restriction can
> be relaxed when multiple I/O address spaces are supported per device

?? In VFIO the container is the IOS and the container can be shared
with multiple devices. This needs to start at about the same
functionality.

> +	} else if (cmd == VFIO_DEVICE_ATTACH_IOASID) {

This should be in the core code, right? There is nothing PCI specific
here.

Jason

^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 15/20] vfio/pci: Add VFIO_DEVICE_[DE]ATTACH_IOASID
@ 2021-09-21 18:04     ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 532+ messages in thread
From: Jason Gunthorpe via iommu @ 2021-09-21 18:04 UTC (permalink / raw)
  To: Liu Yi L
  Cc: kvm, jasowang, kwankhede, hch, jean-philippe, dave.jiang,
	ashok.raj, corbet, kevin.tian, parav, alex.williamson, lkml,
	david, dwmw2, jun.j.tian, linux-kernel, lushenming, iommu,
	pbonzini, robin.murphy

On Sun, Sep 19, 2021 at 02:38:43PM +0800, Liu Yi L wrote:
> This patch adds interface for userspace to attach device to specified
> IOASID.
> 
> Note:
> One device can only be attached to one IOASID in this version. This is
> on par with what vfio provides today. In the future this restriction can
> be relaxed when multiple I/O address spaces are supported per device

?? In VFIO the container is the IOS and the container can be shared
with multiple devices. This needs to start at about the same
functionality.

> +	} else if (cmd == VFIO_DEVICE_ATTACH_IOASID) {

This should be in the core code, right? There is nothing PCI specific
here.

Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 16/20] vfio/type1: Export symbols for dma [un]map code sharing
  2021-09-19  6:38   ` Liu Yi L
@ 2021-09-21 18:14     ` Jason Gunthorpe via iommu
  -1 siblings, 0 replies; 532+ messages in thread
From: Jason Gunthorpe @ 2021-09-21 18:14 UTC (permalink / raw)
  To: Liu Yi L
  Cc: alex.williamson, hch, jasowang, joro, jean-philippe, kevin.tian,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, ashok.raj,
	yi.l.liu, jun.j.tian, hao.wu, dave.jiang, jacob.jun.pan,
	kwankhede, robin.murphy, kvm, iommu, dwmw2, linux-kernel,
	baolu.lu, david, nicolinc

On Sun, Sep 19, 2021 at 02:38:44PM +0800, Liu Yi L wrote:
> [HACK. will fix in v2]
> 
> There are two options to impelement vfio type1v2 mapping semantics in
> /dev/iommu.
> 
> One is to duplicate the related code from vfio as the starting point,
> and then merge with vfio type1 at a later time. However vfio_iommu_type1.c
> has over 3000LOC with ~80% related to dma management logic, including:

I can't really see a way forward like this. I think some scheme to
move the vfio datastructure is going to be necessary.

> - the dma map/unmap metadata management
> - page pinning, and related accounting
> - iova range reporting
> - dirty bitmap retrieving
> - dynamic vaddr update, etc.

All of this needs to be part of the iommufd anyhow..

> The alternative is to consolidate type1v2 logic in /dev/iommu immediately,
> which requires converting vfio_iommu_type1 to be a shim driver. 

Another choice is the the datastructure coulde move and the two
drivers could share its code and continue to exist more independently

Jason

^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 16/20] vfio/type1: Export symbols for dma [un]map code sharing
@ 2021-09-21 18:14     ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 532+ messages in thread
From: Jason Gunthorpe via iommu @ 2021-09-21 18:14 UTC (permalink / raw)
  To: Liu Yi L
  Cc: kvm, jasowang, kwankhede, hch, jean-philippe, dave.jiang,
	ashok.raj, corbet, kevin.tian, parav, alex.williamson, lkml,
	david, dwmw2, jun.j.tian, linux-kernel, lushenming, iommu,
	pbonzini, robin.murphy

On Sun, Sep 19, 2021 at 02:38:44PM +0800, Liu Yi L wrote:
> [HACK. will fix in v2]
> 
> There are two options to impelement vfio type1v2 mapping semantics in
> /dev/iommu.
> 
> One is to duplicate the related code from vfio as the starting point,
> and then merge with vfio type1 at a later time. However vfio_iommu_type1.c
> has over 3000LOC with ~80% related to dma management logic, including:

I can't really see a way forward like this. I think some scheme to
move the vfio datastructure is going to be necessary.

> - the dma map/unmap metadata management
> - page pinning, and related accounting
> - iova range reporting
> - dirty bitmap retrieving
> - dynamic vaddr update, etc.

All of this needs to be part of the iommufd anyhow..

> The alternative is to consolidate type1v2 logic in /dev/iommu immediately,
> which requires converting vfio_iommu_type1 to be a shim driver. 

Another choice is the the datastructure coulde move and the two
drivers could share its code and continue to exist more independently

Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 02/20] vfio: Add device class for /dev/vfio/devices
  2021-09-19  6:38   ` Liu Yi L
@ 2021-09-21 19:56     ` Alex Williamson
  -1 siblings, 0 replies; 532+ messages in thread
From: Alex Williamson @ 2021-09-21 19:56 UTC (permalink / raw)
  To: Liu Yi L
  Cc: jgg, hch, jasowang, joro, jean-philippe, kevin.tian, parav, lkml,
	pbonzini, lushenming, eric.auger, corbet, ashok.raj, yi.l.liu,
	jun.j.tian, hao.wu, dave.jiang, jacob.jun.pan, kwankhede,
	robin.murphy, kvm, iommu, dwmw2, linux-kernel, baolu.lu, david,
	nicolinc

On Sun, 19 Sep 2021 14:38:30 +0800
Liu Yi L <yi.l.liu@intel.com> wrote:

> This patch introduces a new interface (/dev/vfio/devices/$DEVICE) for
> userspace to directly open a vfio device w/o relying on container/group
> (/dev/vfio/$GROUP). Anything related to group is now hidden behind
> iommufd (more specifically in iommu core by this RFC) in a device-centric
> manner.
> 
> In case a device is exposed in both legacy and new interfaces (see next
> patch for how to decide it), this patch also ensures that when the device
> is already opened via one interface then the other one must be blocked.
> 
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> ---
>  drivers/vfio/vfio.c  | 228 +++++++++++++++++++++++++++++++++++++++----
>  include/linux/vfio.h |   2 +
>  2 files changed, 213 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> index 02cc51ce6891..84436d7abedd 100644
> --- a/drivers/vfio/vfio.c
> +++ b/drivers/vfio/vfio.c
...
> @@ -2295,6 +2436,52 @@ static struct miscdevice vfio_dev = {
>  	.mode = S_IRUGO | S_IWUGO,
>  };
>  
> +static char *vfio_device_devnode(struct device *dev, umode_t *mode)
> +{
> +	return kasprintf(GFP_KERNEL, "vfio/devices/%s", dev_name(dev));
> +}

dev_name() doesn't provide us with any uniqueness guarantees, so this
could potentially generate naming conflicts.  The similar scheme for
devices within an iommu group appends an instance number if a conflict
occurs, but that solution doesn't work here where the name isn't just a
link to the actual device.  Devices within an iommu group are also
likely associated within a bus_type, so the potential for conflict is
pretty negligible, that's not the case as vfio is adopted for new
device types.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 02/20] vfio: Add device class for /dev/vfio/devices
@ 2021-09-21 19:56     ` Alex Williamson
  0 siblings, 0 replies; 532+ messages in thread
From: Alex Williamson @ 2021-09-21 19:56 UTC (permalink / raw)
  To: Liu Yi L
  Cc: kvm, jasowang, kwankhede, hch, jean-philippe, dave.jiang,
	ashok.raj, corbet, jgg, kevin.tian, parav, lkml, david, dwmw2,
	jun.j.tian, linux-kernel, lushenming, iommu, pbonzini,
	robin.murphy

On Sun, 19 Sep 2021 14:38:30 +0800
Liu Yi L <yi.l.liu@intel.com> wrote:

> This patch introduces a new interface (/dev/vfio/devices/$DEVICE) for
> userspace to directly open a vfio device w/o relying on container/group
> (/dev/vfio/$GROUP). Anything related to group is now hidden behind
> iommufd (more specifically in iommu core by this RFC) in a device-centric
> manner.
> 
> In case a device is exposed in both legacy and new interfaces (see next
> patch for how to decide it), this patch also ensures that when the device
> is already opened via one interface then the other one must be blocked.
> 
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> ---
>  drivers/vfio/vfio.c  | 228 +++++++++++++++++++++++++++++++++++++++----
>  include/linux/vfio.h |   2 +
>  2 files changed, 213 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> index 02cc51ce6891..84436d7abedd 100644
> --- a/drivers/vfio/vfio.c
> +++ b/drivers/vfio/vfio.c
...
> @@ -2295,6 +2436,52 @@ static struct miscdevice vfio_dev = {
>  	.mode = S_IRUGO | S_IWUGO,
>  };
>  
> +static char *vfio_device_devnode(struct device *dev, umode_t *mode)
> +{
> +	return kasprintf(GFP_KERNEL, "vfio/devices/%s", dev_name(dev));
> +}

dev_name() doesn't provide us with any uniqueness guarantees, so this
could potentially generate naming conflicts.  The similar scheme for
devices within an iommu group appends an instance number if a conflict
occurs, but that solution doesn't work here where the name isn't just a
link to the actual device.  Devices within an iommu group are also
likely associated within a bus_type, so the potential for conflict is
pretty negligible, that's not the case as vfio is adopted for new
device types.  Thanks,

Alex

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 05/20] vfio/pci: Register device to /dev/vfio/devices
  2021-09-21 16:40     ` Jason Gunthorpe via iommu
@ 2021-09-21 21:09       ` Alex Williamson
  -1 siblings, 0 replies; 532+ messages in thread
From: Alex Williamson @ 2021-09-21 21:09 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Liu Yi L, hch, jasowang, joro, jean-philippe, kevin.tian, parav,
	lkml, pbonzini, lushenming, eric.auger, corbet, ashok.raj,
	yi.l.liu, jun.j.tian, hao.wu, dave.jiang, jacob.jun.pan,
	kwankhede, robin.murphy, kvm, iommu, dwmw2, linux-kernel,
	baolu.lu, david, nicolinc

On Tue, 21 Sep 2021 13:40:01 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Sun, Sep 19, 2021 at 02:38:33PM +0800, Liu Yi L wrote:
> > This patch exposes the device-centric interface for vfio-pci devices. To
> > be compatiable with existing users, vfio-pci exposes both legacy group
> > interface and device-centric interface.
> > 
> > As explained in last patch, this change doesn't apply to devices which
> > cannot be forced to snoop cache by their upstream iommu. Such devices
> > are still expected to be opened via the legacy group interface.

This doesn't make much sense to me.  The previous patch indicates
there's work to be done in updating the kvm-vfio contract to understand
DMA coherency, so you're trying to limit use cases to those where the
IOMMU enforces coherency, but there's QEMU work to be done to support
the iommufd uAPI at all.  Isn't part of that work to understand how KVM
will be told about non-coherent devices rather than "meh, skip it in the
kernel"?  Also let's not forget that vfio is not only for KVM.
 
> > When the device is opened via /dev/vfio/devices, vfio-pci should prevent
> > the user from accessing the assigned device because the device is still
> > attached to the default domain which may allow user-initiated DMAs to
> > touch arbitrary place. The user access must be blocked until the device
> > is later bound to an iommufd (see patch 08). The binding acts as the
> > contract for putting the device in a security context which ensures user-
> > initiated DMAs via this device cannot harm the rest of the system.
> > 
> > This patch introduces a vdev->block_access flag for this purpose. It's set
> > when the device is opened via /dev/vfio/devices and cleared after binding
> > to iommufd succeeds. mmap and r/w handlers check this flag to decide whether
> > user access should be blocked or not.  
> 
> This should not be in vfio_pci.
> 
> AFAIK there is no condition where a vfio driver can work without being
> connected to some kind of iommu back end, so the core code should
> handle this interlock globally. A vfio driver's ops should not be
> callable until the iommu is connected.
> 
> The only vfio_pci patch in this series should be adding a new callback
> op to take in an iommufd and register the pci_device as a iommufd
> device.

Couldn't the same argument be made that registering a $bus device as an
iommufd device is a common interface that shouldn't be the
responsibility of the vfio device driver?  Is userspace opening the
non-group device anything more than a reservation of that device if
access is withheld until iommu isolation?  I also don't really want to
predict how ioctls might evolve to guess whether only blocking .read,
.write, and .mmap callbacks are sufficient.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 05/20] vfio/pci: Register device to /dev/vfio/devices
@ 2021-09-21 21:09       ` Alex Williamson
  0 siblings, 0 replies; 532+ messages in thread
From: Alex Williamson @ 2021-09-21 21:09 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: kvm, jasowang, kwankhede, hch, jean-philippe, dave.jiang,
	ashok.raj, corbet, kevin.tian, parav, lkml, david, dwmw2,
	jun.j.tian, linux-kernel, lushenming, iommu, pbonzini,
	robin.murphy

On Tue, 21 Sep 2021 13:40:01 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Sun, Sep 19, 2021 at 02:38:33PM +0800, Liu Yi L wrote:
> > This patch exposes the device-centric interface for vfio-pci devices. To
> > be compatiable with existing users, vfio-pci exposes both legacy group
> > interface and device-centric interface.
> > 
> > As explained in last patch, this change doesn't apply to devices which
> > cannot be forced to snoop cache by their upstream iommu. Such devices
> > are still expected to be opened via the legacy group interface.

This doesn't make much sense to me.  The previous patch indicates
there's work to be done in updating the kvm-vfio contract to understand
DMA coherency, so you're trying to limit use cases to those where the
IOMMU enforces coherency, but there's QEMU work to be done to support
the iommufd uAPI at all.  Isn't part of that work to understand how KVM
will be told about non-coherent devices rather than "meh, skip it in the
kernel"?  Also let's not forget that vfio is not only for KVM.
 
> > When the device is opened via /dev/vfio/devices, vfio-pci should prevent
> > the user from accessing the assigned device because the device is still
> > attached to the default domain which may allow user-initiated DMAs to
> > touch arbitrary place. The user access must be blocked until the device
> > is later bound to an iommufd (see patch 08). The binding acts as the
> > contract for putting the device in a security context which ensures user-
> > initiated DMAs via this device cannot harm the rest of the system.
> > 
> > This patch introduces a vdev->block_access flag for this purpose. It's set
> > when the device is opened via /dev/vfio/devices and cleared after binding
> > to iommufd succeeds. mmap and r/w handlers check this flag to decide whether
> > user access should be blocked or not.  
> 
> This should not be in vfio_pci.
> 
> AFAIK there is no condition where a vfio driver can work without being
> connected to some kind of iommu back end, so the core code should
> handle this interlock globally. A vfio driver's ops should not be
> callable until the iommu is connected.
> 
> The only vfio_pci patch in this series should be adding a new callback
> op to take in an iommufd and register the pci_device as a iommufd
> device.

Couldn't the same argument be made that registering a $bus device as an
iommufd device is a common interface that shouldn't be the
responsibility of the vfio device driver?  Is userspace opening the
non-group device anything more than a reservation of that device if
access is withheld until iommu isolation?  I also don't really want to
predict how ioctls might evolve to guess whether only blocking .read,
.write, and .mmap callbacks are sufficient.  Thanks,

Alex

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 05/20] vfio/pci: Register device to /dev/vfio/devices
  2021-09-21 21:09       ` Alex Williamson
@ 2021-09-21 21:58         ` Jason Gunthorpe via iommu
  -1 siblings, 0 replies; 532+ messages in thread
From: Jason Gunthorpe @ 2021-09-21 21:58 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Liu Yi L, hch, jasowang, joro, jean-philippe, kevin.tian, parav,
	lkml, pbonzini, lushenming, eric.auger, corbet, ashok.raj,
	yi.l.liu, jun.j.tian, hao.wu, dave.jiang, jacob.jun.pan,
	kwankhede, robin.murphy, kvm, iommu, dwmw2, linux-kernel,
	baolu.lu, david, nicolinc

On Tue, Sep 21, 2021 at 03:09:29PM -0600, Alex Williamson wrote:

> the iommufd uAPI at all.  Isn't part of that work to understand how KVM
> will be told about non-coherent devices rather than "meh, skip it in the
> kernel"?  Also let's not forget that vfio is not only for KVM.

vfio is not only for KVM, but AFIACT the wbinv stuff is only for
KVM... But yes, I agree this should be sorted out at this stage

> > > When the device is opened via /dev/vfio/devices, vfio-pci should prevent
> > > the user from accessing the assigned device because the device is still
> > > attached to the default domain which may allow user-initiated DMAs to
> > > touch arbitrary place. The user access must be blocked until the device
> > > is later bound to an iommufd (see patch 08). The binding acts as the
> > > contract for putting the device in a security context which ensures user-
> > > initiated DMAs via this device cannot harm the rest of the system.
> > > 
> > > This patch introduces a vdev->block_access flag for this purpose. It's set
> > > when the device is opened via /dev/vfio/devices and cleared after binding
> > > to iommufd succeeds. mmap and r/w handlers check this flag to decide whether
> > > user access should be blocked or not.  
> > 
> > This should not be in vfio_pci.
> > 
> > AFAIK there is no condition where a vfio driver can work without being
> > connected to some kind of iommu back end, so the core code should
> > handle this interlock globally. A vfio driver's ops should not be
> > callable until the iommu is connected.
> > 
> > The only vfio_pci patch in this series should be adding a new callback
> > op to take in an iommufd and register the pci_device as a iommufd
> > device.
> 
> Couldn't the same argument be made that registering a $bus device as an
> iommufd device is a common interface that shouldn't be the
> responsibility of the vfio device driver? 

The driver needs enough involvment to signal what kind of IOMMU
connection it wants, eg attaching to a physical device will use the
iofd_attach_device() path, but attaching to a SW page table should use
a different API call. PASID should also be different.

Possibly a good arrangement is to have the core provide some generic
ioctl ops functions 'vfio_all_device_iommufd_bind' that everything
except mdev drivers can use so the code is all duplicated.

> non-group device anything more than a reservation of that device if
> access is withheld until iommu isolation?  I also don't really want to
> predict how ioctls might evolve to guess whether only blocking .read,
> .write, and .mmap callbacks are sufficient.  Thanks,

This is why I said the entire fops should be blocked in a dummy fops
so the core code the vfio_device FD parked and userspace unable to
access the ops until device attachment and thus IOMMU ioslation is
completed.

Simple and easy to reason about, a parked FD is very similar to a
closed FD.

Jason

^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 05/20] vfio/pci: Register device to /dev/vfio/devices
@ 2021-09-21 21:58         ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 532+ messages in thread
From: Jason Gunthorpe via iommu @ 2021-09-21 21:58 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm, jasowang, kwankhede, hch, jean-philippe, dave.jiang,
	ashok.raj, corbet, kevin.tian, parav, lkml, david, dwmw2,
	jun.j.tian, linux-kernel, lushenming, iommu, pbonzini,
	robin.murphy

On Tue, Sep 21, 2021 at 03:09:29PM -0600, Alex Williamson wrote:

> the iommufd uAPI at all.  Isn't part of that work to understand how KVM
> will be told about non-coherent devices rather than "meh, skip it in the
> kernel"?  Also let's not forget that vfio is not only for KVM.

vfio is not only for KVM, but AFIACT the wbinv stuff is only for
KVM... But yes, I agree this should be sorted out at this stage

> > > When the device is opened via /dev/vfio/devices, vfio-pci should prevent
> > > the user from accessing the assigned device because the device is still
> > > attached to the default domain which may allow user-initiated DMAs to
> > > touch arbitrary place. The user access must be blocked until the device
> > > is later bound to an iommufd (see patch 08). The binding acts as the
> > > contract for putting the device in a security context which ensures user-
> > > initiated DMAs via this device cannot harm the rest of the system.
> > > 
> > > This patch introduces a vdev->block_access flag for this purpose. It's set
> > > when the device is opened via /dev/vfio/devices and cleared after binding
> > > to iommufd succeeds. mmap and r/w handlers check this flag to decide whether
> > > user access should be blocked or not.  
> > 
> > This should not be in vfio_pci.
> > 
> > AFAIK there is no condition where a vfio driver can work without being
> > connected to some kind of iommu back end, so the core code should
> > handle this interlock globally. A vfio driver's ops should not be
> > callable until the iommu is connected.
> > 
> > The only vfio_pci patch in this series should be adding a new callback
> > op to take in an iommufd and register the pci_device as a iommufd
> > device.
> 
> Couldn't the same argument be made that registering a $bus device as an
> iommufd device is a common interface that shouldn't be the
> responsibility of the vfio device driver? 

The driver needs enough involvment to signal what kind of IOMMU
connection it wants, eg attaching to a physical device will use the
iofd_attach_device() path, but attaching to a SW page table should use
a different API call. PASID should also be different.

Possibly a good arrangement is to have the core provide some generic
ioctl ops functions 'vfio_all_device_iommufd_bind' that everything
except mdev drivers can use so the code is all duplicated.

> non-group device anything more than a reservation of that device if
> access is withheld until iommu isolation?  I also don't really want to
> predict how ioctls might evolve to guess whether only blocking .read,
> .write, and .mmap callbacks are sufficient.  Thanks,

This is why I said the entire fops should be blocked in a dummy fops
so the core code the vfio_device FD parked and userspace unable to
access the ops until device attachment and thus IOMMU ioslation is
completed.

Simple and easy to reason about, a parked FD is very similar to a
closed FD.

Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* RE: [RFC 03/20] vfio: Add vfio_[un]register_device()
  2021-09-21 16:01     ` Jason Gunthorpe via iommu
@ 2021-09-21 23:10       ` Tian, Kevin
  -1 siblings, 0 replies; 532+ messages in thread
From: Tian, Kevin @ 2021-09-21 23:10 UTC (permalink / raw)
  To: Jason Gunthorpe, Liu, Yi L
  Cc: alex.williamson, hch, jasowang, joro, jean-philippe, parav, lkml,
	pbonzini, lushenming, eric.auger, corbet, Raj, Ashok, yi.l.liu,
	Tian, Jun J, Wu, Hao, Jiang, Dave, jacob.jun.pan, kwankhede,
	robin.murphy, kvm, iommu, dwmw2, linux-kernel, baolu.lu, david,
	nicolinc

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, September 22, 2021 12:01 AM
> 
> On Sun, Sep 19, 2021 at 02:38:31PM +0800, Liu Yi L wrote:
> > With /dev/vfio/devices introduced, now a vfio device driver has three
> > options to expose its device to userspace:
> >
> > a)  only legacy group interface, for devices which haven't been moved to
> >     iommufd (e.g. platform devices, sw mdev, etc.);
> >
> > b)  both legacy group interface and new device-centric interface, for
> >     devices which supports iommufd but also wants to keep backward
> >     compatibility (e.g. pci devices in this RFC);
> >
> > c)  only new device-centric interface, for new devices which don't carry
> >     backward compatibility burden (e.g. hw mdev/subdev with pasid);
> 
> We shouldn't have 'b'? Where does it come from?

a vfio-pci device can be opened via the existing group interface. if no b) it 
means legacy vfio userspace can never use vfio-pci device any more
once the latter is moved to iommufd.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 532+ messages in thread

* RE: [RFC 03/20] vfio: Add vfio_[un]register_device()
@ 2021-09-21 23:10       ` Tian, Kevin
  0 siblings, 0 replies; 532+ messages in thread
From: Tian, Kevin @ 2021-09-21 23:10 UTC (permalink / raw)
  To: Jason Gunthorpe, Liu, Yi L
  Cc: kvm, jasowang, kwankhede, hch, jean-philippe, Jiang, Dave, Raj,
	Ashok, corbet, parav, alex.williamson, lkml, david, robin.murphy,
	Tian, Jun J, linux-kernel, lushenming, iommu, pbonzini, dwmw2

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, September 22, 2021 12:01 AM
> 
> On Sun, Sep 19, 2021 at 02:38:31PM +0800, Liu Yi L wrote:
> > With /dev/vfio/devices introduced, now a vfio device driver has three
> > options to expose its device to userspace:
> >
> > a)  only legacy group interface, for devices which haven't been moved to
> >     iommufd (e.g. platform devices, sw mdev, etc.);
> >
> > b)  both legacy group interface and new device-centric interface, for
> >     devices which supports iommufd but also wants to keep backward
> >     compatibility (e.g. pci devices in this RFC);
> >
> > c)  only new device-centric interface, for new devices which don't carry
> >     backward compatibility burden (e.g. hw mdev/subdev with pasid);
> 
> We shouldn't have 'b'? Where does it come from?

a vfio-pci device can be opened via the existing group interface. if no b) it 
means legacy vfio userspace can never use vfio-pci device any more
once the latter is moved to iommufd.

Thanks
Kevin
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* RE: [RFC 02/20] vfio: Add device class for /dev/vfio/devices
  2021-09-21 15:57     ` Jason Gunthorpe via iommu
@ 2021-09-21 23:56       ` Tian, Kevin
  -1 siblings, 0 replies; 532+ messages in thread
From: Tian, Kevin @ 2021-09-21 23:56 UTC (permalink / raw)
  To: Jason Gunthorpe, Liu, Yi L
  Cc: alex.williamson, hch, jasowang, joro, jean-philippe, parav, lkml,
	pbonzini, lushenming, eric.auger, corbet, Raj, Ashok, yi.l.liu,
	Tian, Jun J, Wu, Hao, Jiang, Dave, jacob.jun.pan, kwankhede,
	robin.murphy, kvm, iommu, dwmw2, linux-kernel, baolu.lu, david,
	nicolinc

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, September 21, 2021 11:57 PM
> 
> On Sun, Sep 19, 2021 at 02:38:30PM +0800, Liu Yi L wrote:
> > This patch introduces a new interface (/dev/vfio/devices/$DEVICE) for
> > userspace to directly open a vfio device w/o relying on container/group
> > (/dev/vfio/$GROUP). Anything related to group is now hidden behind
> > iommufd (more specifically in iommu core by this RFC) in a device-centric
> > manner.
> >
> > In case a device is exposed in both legacy and new interfaces (see next
> > patch for how to decide it), this patch also ensures that when the device
> > is already opened via one interface then the other one must be blocked.
> >
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > ---
> >  drivers/vfio/vfio.c  | 228 +++++++++++++++++++++++++++++++++++++++----
> >  include/linux/vfio.h |   2 +
> >  2 files changed, 213 insertions(+), 17 deletions(-)
> 
> > +static int vfio_init_device_class(void)
> > +{
> > +	int ret;
> > +
> > +	mutex_init(&vfio.device_lock);
> > +	idr_init(&vfio.device_idr);
> > +
> > +	/* /dev/vfio/devices/$DEVICE */
> > +	vfio.device_class = class_create(THIS_MODULE, "vfio-device");
> > +	if (IS_ERR(vfio.device_class))
> > +		return PTR_ERR(vfio.device_class);
> > +
> > +	vfio.device_class->devnode = vfio_device_devnode;
> > +
> > +	ret = alloc_chrdev_region(&vfio.device_devt, 0, MINORMASK + 1,
> "vfio-device");
> > +	if (ret)
> > +		goto err_alloc_chrdev;
> > +
> > +	cdev_init(&vfio.device_cdev, &vfio_device_fops);
> > +	ret = cdev_add(&vfio.device_cdev, vfio.device_devt, MINORMASK +
> 1);
> > +	if (ret)
> > +		goto err_cdev_add;
> 
> Huh? This is not how cdevs are used. This patch needs rewriting.
> 
> The struct vfio_device should gain a 'struct device' and 'struct cdev'
> as non-pointer members
> 
> vfio register path should end up doing cdev_device_add() for each
> vfio_device
> 
> vfio_unregister path should do cdev_device_del()
> 
> No idr should be needed, an ida is used to allocate minor numbers
> 
> The struct device release function should trigger a kfree which
> requires some reworking of the callers
> 
> vfio_init_group_dev() should do a device_initialize()
> vfio_uninit_group_dev() should do a device_put()

All above are good suggestions!

> 
> The opened atomic is aweful. A newly created fd should start in a
> state where it has a disabled fops
> 
> The only thing the disabled fops can do is register the device to the
> iommu fd. When successfully registered the device gets the normal fops.
> 
> The registration steps should be done under a normal lock inside the
> vfio_device. If a vfio_device is already registered then further
> registration should fail.
> 
> Getting the device fd via the group fd triggers the same sequence as
> above.
> 

Above works if the group interface is also connected to iommufd, i.e.
making vfio type1 as a shim. In this case we can use the registration
status as the exclusive switch. But if we keep vfio type1 separate as
today, then a new atomic is still necessary. This all depends on how
we want to deal with vfio type1 and iommufd, and possibly what's
discussed here just adds another pound to the shim option...

Thanks
Kevin

^ permalink raw reply	[flat|nested] 532+ messages in thread

* RE: [RFC 02/20] vfio: Add device class for /dev/vfio/devices
@ 2021-09-21 23:56       ` Tian, Kevin
  0 siblings, 0 replies; 532+ messages in thread
From: Tian, Kevin @ 2021-09-21 23:56 UTC (permalink / raw)
  To: Jason Gunthorpe, Liu, Yi L
  Cc: kvm, jasowang, kwankhede, hch, jean-philippe, Jiang, Dave, Raj,
	Ashok, corbet, parav, alex.williamson, lkml, david, robin.murphy,
	Tian, Jun J, linux-kernel, lushenming, iommu, pbonzini, dwmw2

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, September 21, 2021 11:57 PM
> 
> On Sun, Sep 19, 2021 at 02:38:30PM +0800, Liu Yi L wrote:
> > This patch introduces a new interface (/dev/vfio/devices/$DEVICE) for
> > userspace to directly open a vfio device w/o relying on container/group
> > (/dev/vfio/$GROUP). Anything related to group is now hidden behind
> > iommufd (more specifically in iommu core by this RFC) in a device-centric
> > manner.
> >
> > In case a device is exposed in both legacy and new interfaces (see next
> > patch for how to decide it), this patch also ensures that when the device
> > is already opened via one interface then the other one must be blocked.
> >
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > ---
> >  drivers/vfio/vfio.c  | 228 +++++++++++++++++++++++++++++++++++++++----
> >  include/linux/vfio.h |   2 +
> >  2 files changed, 213 insertions(+), 17 deletions(-)
> 
> > +static int vfio_init_device_class(void)
> > +{
> > +	int ret;
> > +
> > +	mutex_init(&vfio.device_lock);
> > +	idr_init(&vfio.device_idr);
> > +
> > +	/* /dev/vfio/devices/$DEVICE */
> > +	vfio.device_class = class_create(THIS_MODULE, "vfio-device");
> > +	if (IS_ERR(vfio.device_class))
> > +		return PTR_ERR(vfio.device_class);
> > +
> > +	vfio.device_class->devnode = vfio_device_devnode;
> > +
> > +	ret = alloc_chrdev_region(&vfio.device_devt, 0, MINORMASK + 1,
> "vfio-device");
> > +	if (ret)
> > +		goto err_alloc_chrdev;
> > +
> > +	cdev_init(&vfio.device_cdev, &vfio_device_fops);
> > +	ret = cdev_add(&vfio.device_cdev, vfio.device_devt, MINORMASK +
> 1);
> > +	if (ret)
> > +		goto err_cdev_add;
> 
> Huh? This is not how cdevs are used. This patch needs rewriting.
> 
> The struct vfio_device should gain a 'struct device' and 'struct cdev'
> as non-pointer members
> 
> vfio register path should end up doing cdev_device_add() for each
> vfio_device
> 
> vfio_unregister path should do cdev_device_del()
> 
> No idr should be needed, an ida is used to allocate minor numbers
> 
> The struct device release function should trigger a kfree which
> requires some reworking of the callers
> 
> vfio_init_group_dev() should do a device_initialize()
> vfio_uninit_group_dev() should do a device_put()

All above are good suggestions!

> 
> The opened atomic is aweful. A newly created fd should start in a
> state where it has a disabled fops
> 
> The only thing the disabled fops can do is register the device to the
> iommu fd. When successfully registered the device gets the normal fops.
> 
> The registration steps should be done under a normal lock inside the
> vfio_device. If a vfio_device is already registered then further
> registration should fail.
> 
> Getting the device fd via the group fd triggers the same sequence as
> above.
> 

Above works if the group interface is also connected to iommufd, i.e.
making vfio type1 as a shim. In this case we can use the registration
status as the exclusive switch. But if we keep vfio type1 separate as
today, then a new atomic is still necessary. This all depends on how
we want to deal with vfio type1 and iommufd, and possibly what's
discussed here just adds another pound to the shim option...

Thanks
Kevin
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 03/20] vfio: Add vfio_[un]register_device()
  2021-09-21 23:10       ` Tian, Kevin
@ 2021-09-22  0:53         ` Jason Gunthorpe via iommu
  -1 siblings, 0 replies; 532+ messages in thread
From: Jason Gunthorpe @ 2021-09-22  0:53 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Liu, Yi L, alex.williamson, hch, jasowang, joro, jean-philippe,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

On Tue, Sep 21, 2021 at 11:10:15PM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Wednesday, September 22, 2021 12:01 AM
> > 
> > On Sun, Sep 19, 2021 at 02:38:31PM +0800, Liu Yi L wrote:
> > > With /dev/vfio/devices introduced, now a vfio device driver has three
> > > options to expose its device to userspace:
> > >
> > > a)  only legacy group interface, for devices which haven't been moved to
> > >     iommufd (e.g. platform devices, sw mdev, etc.);
> > >
> > > b)  both legacy group interface and new device-centric interface, for
> > >     devices which supports iommufd but also wants to keep backward
> > >     compatibility (e.g. pci devices in this RFC);
> > >
> > > c)  only new device-centric interface, for new devices which don't carry
> > >     backward compatibility burden (e.g. hw mdev/subdev with pasid);
> > 
> > We shouldn't have 'b'? Where does it come from?
> 
> a vfio-pci device can be opened via the existing group interface. if no b) it 
> means legacy vfio userspace can never use vfio-pci device any more
> once the latter is moved to iommufd.

Sorry, I think I ment a, which I guess you will say is SW mdev devices

But even so, I think the way forward here is to still always expose
the device /dev/vfio/devices/X and some devices may not allow iommufd
usage initially.

Providing an ioctl to bind to a normal VFIO container or group might
allow a reasonable fallback in userspace..

Jason

^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 03/20] vfio: Add vfio_[un]register_device()
@ 2021-09-22  0:53         ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 532+ messages in thread
From: Jason Gunthorpe via iommu @ 2021-09-22  0:53 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: kvm, jasowang, kwankhede, hch, jean-philippe, Jiang, Dave, Raj,
	Ashok, corbet, parav, alex.williamson, lkml, david, dwmw2, Tian,
	Jun J, linux-kernel, lushenming, iommu, pbonzini, robin.murphy

On Tue, Sep 21, 2021 at 11:10:15PM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Wednesday, September 22, 2021 12:01 AM
> > 
> > On Sun, Sep 19, 2021 at 02:38:31PM +0800, Liu Yi L wrote:
> > > With /dev/vfio/devices introduced, now a vfio device driver has three
> > > options to expose its device to userspace:
> > >
> > > a)  only legacy group interface, for devices which haven't been moved to
> > >     iommufd (e.g. platform devices, sw mdev, etc.);
> > >
> > > b)  both legacy group interface and new device-centric interface, for
> > >     devices which supports iommufd but also wants to keep backward
> > >     compatibility (e.g. pci devices in this RFC);
> > >
> > > c)  only new device-centric interface, for new devices which don't carry
> > >     backward compatibility burden (e.g. hw mdev/subdev with pasid);
> > 
> > We shouldn't have 'b'? Where does it come from?
> 
> a vfio-pci device can be opened via the existing group interface. if no b) it 
> means legacy vfio userspace can never use vfio-pci device any more
> once the latter is moved to iommufd.

Sorry, I think I ment a, which I guess you will say is SW mdev devices

But even so, I think the way forward here is to still always expose
the device /dev/vfio/devices/X and some devices may not allow iommufd
usage initially.

Providing an ioctl to bind to a normal VFIO container or group might
allow a reasonable fallback in userspace..

Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* RE: [RFC 03/20] vfio: Add vfio_[un]register_device()
  2021-09-21 16:01     ` Jason Gunthorpe via iommu
@ 2021-09-22  0:54       ` Tian, Kevin
  -1 siblings, 0 replies; 532+ messages in thread
From: Tian, Kevin @ 2021-09-22  0:54 UTC (permalink / raw)
  To: Jason Gunthorpe, Liu, Yi L
  Cc: alex.williamson, hch, jasowang, joro, jean-philippe, parav, lkml,
	pbonzini, lushenming, eric.auger, corbet, Raj, Ashok, yi.l.liu,
	Tian, Jun J, Wu, Hao, Jiang, Dave, jacob.jun.pan, kwankhede,
	robin.murphy, kvm, iommu, dwmw2, linux-kernel, baolu.lu, david,
	nicolinc

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, September 22, 2021 12:01 AM
> 
> >  One open about how to organize the device nodes under
> /dev/vfio/devices/.
> > This RFC adopts a simple policy by keeping a flat layout with mixed
> devname
> > from all kinds of devices. The prerequisite of this model is that devnames
> > from different bus types are unique formats:
> 
> This isn't reliable, the devname should just be vfio0, vfio1, etc
> 
> The userspace can learn the correct major/minor by inspecting the
> sysfs.
> 
> This whole concept should disappear into the prior patch that adds the
> struct device in the first place, and I think most of the code here
> can be deleted once the struct device is used properly.
> 

Can you help elaborate above flow? This is one area where we need
more guidance.

When Qemu accepts an option "-device vfio-pci,host=DDDD:BB:DD.F",
how does Qemu identify which vifo0/1/... is associated with the specified 
DDDD:BB:DD.F? 

Thanks
Kevin

^ permalink raw reply	[flat|nested] 532+ messages in thread

* RE: [RFC 03/20] vfio: Add vfio_[un]register_device()
@ 2021-09-22  0:54       ` Tian, Kevin
  0 siblings, 0 replies; 532+ messages in thread
From: Tian, Kevin @ 2021-09-22  0:54 UTC (permalink / raw)
  To: Jason Gunthorpe, Liu, Yi L
  Cc: kvm, jasowang, kwankhede, hch, jean-philippe, Jiang, Dave, Raj,
	Ashok, corbet, parav, alex.williamson, lkml, david, robin.murphy,
	Tian, Jun J, linux-kernel, lushenming, iommu, pbonzini, dwmw2

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, September 22, 2021 12:01 AM
> 
> >  One open about how to organize the device nodes under
> /dev/vfio/devices/.
> > This RFC adopts a simple policy by keeping a flat layout with mixed
> devname
> > from all kinds of devices. The prerequisite of this model is that devnames
> > from different bus types are unique formats:
> 
> This isn't reliable, the devname should just be vfio0, vfio1, etc
> 
> The userspace can learn the correct major/minor by inspecting the
> sysfs.
> 
> This whole concept should disappear into the prior patch that adds the
> struct device in the first place, and I think most of the code here
> can be deleted once the struct device is used properly.
> 

Can you help elaborate above flow? This is one area where we need
more guidance.

When Qemu accepts an option "-device vfio-pci,host=DDDD:BB:DD.F",
how does Qemu identify which vifo0/1/... is associated with the specified 
DDDD:BB:DD.F? 

Thanks
Kevin
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 02/20] vfio: Add device class for /dev/vfio/devices
  2021-09-21 23:56       ` Tian, Kevin
@ 2021-09-22  0:55         ` Jason Gunthorpe via iommu
  -1 siblings, 0 replies; 532+ messages in thread
From: Jason Gunthorpe @ 2021-09-22  0:55 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Liu, Yi L, alex.williamson, hch, jasowang, joro, jean-philippe,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

On Tue, Sep 21, 2021 at 11:56:06PM +0000, Tian, Kevin wrote:
> > The opened atomic is aweful. A newly created fd should start in a
> > state where it has a disabled fops
> > 
> > The only thing the disabled fops can do is register the device to the
> > iommu fd. When successfully registered the device gets the normal fops.
> > 
> > The registration steps should be done under a normal lock inside the
> > vfio_device. If a vfio_device is already registered then further
> > registration should fail.
> > 
> > Getting the device fd via the group fd triggers the same sequence as
> > above.
> > 
> 
> Above works if the group interface is also connected to iommufd, i.e.
> making vfio type1 as a shim. In this case we can use the registration
> status as the exclusive switch. But if we keep vfio type1 separate as
> today, then a new atomic is still necessary. This all depends on how
> we want to deal with vfio type1 and iommufd, and possibly what's
> discussed here just adds another pound to the shim option...

No, it works the same either way, the group FD path is identical to
the normal FD path, it just triggers some of the state transitions
automatically internally instead of requiring external ioctls.

The device FDs starts disabled, an internal API binds it to the iommu
via open coding with the group API, and then the rest of the APIs can
be enabled. Same as today.

Jason

^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 02/20] vfio: Add device class for /dev/vfio/devices
@ 2021-09-22  0:55         ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 532+ messages in thread
From: Jason Gunthorpe via iommu @ 2021-09-22  0:55 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: kvm, jasowang, kwankhede, hch, jean-philippe, Jiang, Dave, Raj,
	Ashok, corbet, parav, alex.williamson, lkml, david, dwmw2, Tian,
	Jun J, linux-kernel, lushenming, iommu, pbonzini, robin.murphy

On Tue, Sep 21, 2021 at 11:56:06PM +0000, Tian, Kevin wrote:
> > The opened atomic is aweful. A newly created fd should start in a
> > state where it has a disabled fops
> > 
> > The only thing the disabled fops can do is register the device to the
> > iommu fd. When successfully registered the device gets the normal fops.
> > 
> > The registration steps should be done under a normal lock inside the
> > vfio_device. If a vfio_device is already registered then further
> > registration should fail.
> > 
> > Getting the device fd via the group fd triggers the same sequence as
> > above.
> > 
> 
> Above works if the group interface is also connected to iommufd, i.e.
> making vfio type1 as a shim. In this case we can use the registration
> status as the exclusive switch. But if we keep vfio type1 separate as
> today, then a new atomic is still necessary. This all depends on how
> we want to deal with vfio type1 and iommufd, and possibly what's
> discussed here just adds another pound to the shim option...

No, it works the same either way, the group FD path is identical to
the normal FD path, it just triggers some of the state transitions
automatically internally instead of requiring external ioctls.

The device FDs starts disabled, an internal API binds it to the iommu
via open coding with the group API, and then the rest of the APIs can
be enabled. Same as today.

Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* RE: [RFC 02/20] vfio: Add device class for /dev/vfio/devices
  2021-09-21 19:56     ` Alex Williamson
@ 2021-09-22  0:56       ` Tian, Kevin
  -1 siblings, 0 replies; 532+ messages in thread
From: Tian, Kevin @ 2021-09-22  0:56 UTC (permalink / raw)
  To: Alex Williamson, Liu, Yi L
  Cc: jgg, hch, jasowang, joro, jean-philippe, parav, lkml, pbonzini,
	lushenming, eric.auger, corbet, Raj, Ashok, yi.l.liu, Tian,
	Jun J, Wu, Hao, Jiang, Dave, jacob.jun.pan, kwankhede,
	robin.murphy, kvm, iommu, dwmw2, linux-kernel, baolu.lu, david,
	nicolinc

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Wednesday, September 22, 2021 3:56 AM
> 
> On Sun, 19 Sep 2021 14:38:30 +0800
> Liu Yi L <yi.l.liu@intel.com> wrote:
> 
> > This patch introduces a new interface (/dev/vfio/devices/$DEVICE) for
> > userspace to directly open a vfio device w/o relying on container/group
> > (/dev/vfio/$GROUP). Anything related to group is now hidden behind
> > iommufd (more specifically in iommu core by this RFC) in a device-centric
> > manner.
> >
> > In case a device is exposed in both legacy and new interfaces (see next
> > patch for how to decide it), this patch also ensures that when the device
> > is already opened via one interface then the other one must be blocked.
> >
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > ---
> >  drivers/vfio/vfio.c  | 228 +++++++++++++++++++++++++++++++++++++++----
> >  include/linux/vfio.h |   2 +
> >  2 files changed, 213 insertions(+), 17 deletions(-)
> >
> > diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> > index 02cc51ce6891..84436d7abedd 100644
> > --- a/drivers/vfio/vfio.c
> > +++ b/drivers/vfio/vfio.c
> ...
> > @@ -2295,6 +2436,52 @@ static struct miscdevice vfio_dev = {
> >  	.mode = S_IRUGO | S_IWUGO,
> >  };
> >
> > +static char *vfio_device_devnode(struct device *dev, umode_t *mode)
> > +{
> > +	return kasprintf(GFP_KERNEL, "vfio/devices/%s", dev_name(dev));
> > +}
> 
> dev_name() doesn't provide us with any uniqueness guarantees, so this
> could potentially generate naming conflicts.  The similar scheme for
> devices within an iommu group appends an instance number if a conflict
> occurs, but that solution doesn't work here where the name isn't just a
> link to the actual device.  Devices within an iommu group are also
> likely associated within a bus_type, so the potential for conflict is
> pretty negligible, that's not the case as vfio is adopted for new
> device types.  Thanks,
> 

This is also our concern. Thanks for confirming it. Appreciate if you
can help think out some better alternative to deal with it.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 532+ messages in thread

* RE: [RFC 02/20] vfio: Add device class for /dev/vfio/devices
@ 2021-09-22  0:56       ` Tian, Kevin
  0 siblings, 0 replies; 532+ messages in thread
From: Tian, Kevin @ 2021-09-22  0:56 UTC (permalink / raw)
  To: Alex Williamson, Liu, Yi L
  Cc: kvm, jasowang, kwankhede, hch, jean-philippe, Jiang, Dave, Raj,
	Ashok, corbet, jgg, parav, lkml, david, robin.murphy, Tian,
	Jun J, linux-kernel, lushenming, iommu, pbonzini, dwmw2

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Wednesday, September 22, 2021 3:56 AM
> 
> On Sun, 19 Sep 2021 14:38:30 +0800
> Liu Yi L <yi.l.liu@intel.com> wrote:
> 
> > This patch introduces a new interface (/dev/vfio/devices/$DEVICE) for
> > userspace to directly open a vfio device w/o relying on container/group
> > (/dev/vfio/$GROUP). Anything related to group is now hidden behind
> > iommufd (more specifically in iommu core by this RFC) in a device-centric
> > manner.
> >
> > In case a device is exposed in both legacy and new interfaces (see next
> > patch for how to decide it), this patch also ensures that when the device
> > is already opened via one interface then the other one must be blocked.
> >
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > ---
> >  drivers/vfio/vfio.c  | 228 +++++++++++++++++++++++++++++++++++++++----
> >  include/linux/vfio.h |   2 +
> >  2 files changed, 213 insertions(+), 17 deletions(-)
> >
> > diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> > index 02cc51ce6891..84436d7abedd 100644
> > --- a/drivers/vfio/vfio.c
> > +++ b/drivers/vfio/vfio.c
> ...
> > @@ -2295,6 +2436,52 @@ static struct miscdevice vfio_dev = {
> >  	.mode = S_IRUGO | S_IWUGO,
> >  };
> >
> > +static char *vfio_device_devnode(struct device *dev, umode_t *mode)
> > +{
> > +	return kasprintf(GFP_KERNEL, "vfio/devices/%s", dev_name(dev));
> > +}
> 
> dev_name() doesn't provide us with any uniqueness guarantees, so this
> could potentially generate naming conflicts.  The similar scheme for
> devices within an iommu group appends an instance number if a conflict
> occurs, but that solution doesn't work here where the name isn't just a
> link to the actual device.  Devices within an iommu group are also
> likely associated within a bus_type, so the potential for conflict is
> pretty negligible, that's not the case as vfio is adopted for new
> device types.  Thanks,
> 

This is also our concern. Thanks for confirming it. Appreciate if you
can help think out some better alternative to deal with it.

Thanks
Kevin
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* RE: [RFC 03/20] vfio: Add vfio_[un]register_device()
  2021-09-22  0:53         ` Jason Gunthorpe via iommu
@ 2021-09-22  0:59           ` Tian, Kevin
  -1 siblings, 0 replies; 532+ messages in thread
From: Tian, Kevin @ 2021-09-22  0:59 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Liu, Yi L, alex.williamson, hch, jasowang, joro, jean-philippe,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, September 22, 2021 8:54 AM
> 
> On Tue, Sep 21, 2021 at 11:10:15PM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Wednesday, September 22, 2021 12:01 AM
> > >
> > > On Sun, Sep 19, 2021 at 02:38:31PM +0800, Liu Yi L wrote:
> > > > With /dev/vfio/devices introduced, now a vfio device driver has three
> > > > options to expose its device to userspace:
> > > >
> > > > a)  only legacy group interface, for devices which haven't been moved
> to
> > > >     iommufd (e.g. platform devices, sw mdev, etc.);
> > > >
> > > > b)  both legacy group interface and new device-centric interface, for
> > > >     devices which supports iommufd but also wants to keep backward
> > > >     compatibility (e.g. pci devices in this RFC);
> > > >
> > > > c)  only new device-centric interface, for new devices which don't carry
> > > >     backward compatibility burden (e.g. hw mdev/subdev with pasid);
> > >
> > > We shouldn't have 'b'? Where does it come from?
> >
> > a vfio-pci device can be opened via the existing group interface. if no b) it
> > means legacy vfio userspace can never use vfio-pci device any more
> > once the latter is moved to iommufd.
> 
> Sorry, I think I ment a, which I guess you will say is SW mdev devices

We listed a) here in case we don't want to move all vfio device types to
use iommufd in one breath. It's supposed to be a type valid only in this
transition phase. In the end only b) and c) are valid.

> 
> But even so, I think the way forward here is to still always expose
> the device /dev/vfio/devices/X and some devices may not allow iommufd
> usage initially.
> 
> Providing an ioctl to bind to a normal VFIO container or group might
> allow a reasonable fallback in userspace..
> 

but doesn't a new ioctl still imply breaking existing vfio userspace?

Thanks
Kevin

^ permalink raw reply	[flat|nested] 532+ messages in thread

* RE: [RFC 03/20] vfio: Add vfio_[un]register_device()
@ 2021-09-22  0:59           ` Tian, Kevin
  0 siblings, 0 replies; 532+ messages in thread
From: Tian, Kevin @ 2021-09-22  0:59 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: kvm, jasowang, kwankhede, hch, jean-philippe, Jiang, Dave, Raj,
	Ashok, corbet, parav, alex.williamson, lkml, david, dwmw2, Tian,
	Jun J, linux-kernel, lushenming, iommu, pbonzini, robin.murphy

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, September 22, 2021 8:54 AM
> 
> On Tue, Sep 21, 2021 at 11:10:15PM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Wednesday, September 22, 2021 12:01 AM
> > >
> > > On Sun, Sep 19, 2021 at 02:38:31PM +0800, Liu Yi L wrote:
> > > > With /dev/vfio/devices introduced, now a vfio device driver has three
> > > > options to expose its device to userspace:
> > > >
> > > > a)  only legacy group interface, for devices which haven't been moved
> to
> > > >     iommufd (e.g. platform devices, sw mdev, etc.);
> > > >
> > > > b)  both legacy group interface and new device-centric interface, for
> > > >     devices which supports iommufd but also wants to keep backward
> > > >     compatibility (e.g. pci devices in this RFC);
> > > >
> > > > c)  only new device-centric interface, for new devices which don't carry
> > > >     backward compatibility burden (e.g. hw mdev/subdev with pasid);
> > >
> > > We shouldn't have 'b'? Where does it come from?
> >
> > a vfio-pci device can be opened via the existing group interface. if no b) it
> > means legacy vfio userspace can never use vfio-pci device any more
> > once the latter is moved to iommufd.
> 
> Sorry, I think I ment a, which I guess you will say is SW mdev devices

We listed a) here in case we don't want to move all vfio device types to
use iommufd in one breath. It's supposed to be a type valid only in this
transition phase. In the end only b) and c) are valid.

> 
> But even so, I think the way forward here is to still always expose
> the device /dev/vfio/devices/X and some devices may not allow iommufd
> usage initially.
> 
> Providing an ioctl to bind to a normal VFIO container or group might
> allow a reasonable fallback in userspace..
> 

but doesn't a new ioctl still imply breaking existing vfio userspace?

Thanks
Kevin
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 03/20] vfio: Add vfio_[un]register_device()
  2021-09-22  0:54       ` Tian, Kevin
@ 2021-09-22  1:00         ` Jason Gunthorpe via iommu
  -1 siblings, 0 replies; 532+ messages in thread
From: Jason Gunthorpe @ 2021-09-22  1:00 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Liu, Yi L, alex.williamson, hch, jasowang, joro, jean-philippe,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

On Wed, Sep 22, 2021 at 12:54:02AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Wednesday, September 22, 2021 12:01 AM
> > 
> > >  One open about how to organize the device nodes under
> > /dev/vfio/devices/.
> > > This RFC adopts a simple policy by keeping a flat layout with mixed
> > devname
> > > from all kinds of devices. The prerequisite of this model is that devnames
> > > from different bus types are unique formats:
> > 
> > This isn't reliable, the devname should just be vfio0, vfio1, etc
> > 
> > The userspace can learn the correct major/minor by inspecting the
> > sysfs.
> > 
> > This whole concept should disappear into the prior patch that adds the
> > struct device in the first place, and I think most of the code here
> > can be deleted once the struct device is used properly.
> > 
> 
> Can you help elaborate above flow? This is one area where we need
> more guidance.
> 
> When Qemu accepts an option "-device vfio-pci,host=DDDD:BB:DD.F",
> how does Qemu identify which vifo0/1/... is associated with the specified 
> DDDD:BB:DD.F? 

When done properly in the kernel the file:

/sys/bus/pci/devices/DDDD:BB:DD.F/vfio/vfioX/dev

Will contain the major:minor of the VFIO device.

Userspace then opens the /dev/vfio/devices/vfioX and checks with fstat
that the major:minor matches.

in the above pattern "pci" and "DDDD:BB:DD.FF" are the arguments passed
to qemu.

You can look at this for some general over engineered code to handle
opening from a sysfs handle like above:

https://github.com/linux-rdma/rdma-core/blob/master/util/open_cdev.c

Jason

^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 03/20] vfio: Add vfio_[un]register_device()
@ 2021-09-22  1:00         ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 532+ messages in thread
From: Jason Gunthorpe via iommu @ 2021-09-22  1:00 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: kvm, jasowang, kwankhede, hch, jean-philippe, Jiang, Dave, Raj,
	Ashok, corbet, parav, alex.williamson, lkml, david, dwmw2, Tian,
	Jun J, linux-kernel, lushenming, iommu, pbonzini, robin.murphy

On Wed, Sep 22, 2021 at 12:54:02AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Wednesday, September 22, 2021 12:01 AM
> > 
> > >  One open about how to organize the device nodes under
> > /dev/vfio/devices/.
> > > This RFC adopts a simple policy by keeping a flat layout with mixed
> > devname
> > > from all kinds of devices. The prerequisite of this model is that devnames
> > > from different bus types are unique formats:
> > 
> > This isn't reliable, the devname should just be vfio0, vfio1, etc
> > 
> > The userspace can learn the correct major/minor by inspecting the
> > sysfs.
> > 
> > This whole concept should disappear into the prior patch that adds the
> > struct device in the first place, and I think most of the code here
> > can be deleted once the struct device is used properly.
> > 
> 
> Can you help elaborate above flow? This is one area where we need
> more guidance.
> 
> When Qemu accepts an option "-device vfio-pci,host=DDDD:BB:DD.F",
> how does Qemu identify which vifo0/1/... is associated with the specified 
> DDDD:BB:DD.F? 

When done properly in the kernel the file:

/sys/bus/pci/devices/DDDD:BB:DD.F/vfio/vfioX/dev

Will contain the major:minor of the VFIO device.

Userspace then opens the /dev/vfio/devices/vfioX and checks with fstat
that the major:minor matches.

in the above pattern "pci" and "DDDD:BB:DD.FF" are the arguments passed
to qemu.

You can look at this for some general over engineered code to handle
opening from a sysfs handle like above:

https://github.com/linux-rdma/rdma-core/blob/master/util/open_cdev.c

Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* RE: [RFC 03/20] vfio: Add vfio_[un]register_device()
  2021-09-22  1:00         ` Jason Gunthorpe via iommu
@ 2021-09-22  1:02           ` Tian, Kevin
  -1 siblings, 0 replies; 532+ messages in thread
From: Tian, Kevin @ 2021-09-22  1:02 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Liu, Yi L, alex.williamson, hch, jasowang, joro, jean-philippe,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, September 22, 2021 9:00 AM
> 
> On Wed, Sep 22, 2021 at 12:54:02AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Wednesday, September 22, 2021 12:01 AM
> > >
> > > >  One open about how to organize the device nodes under
> > > /dev/vfio/devices/.
> > > > This RFC adopts a simple policy by keeping a flat layout with mixed
> > > devname
> > > > from all kinds of devices. The prerequisite of this model is that
> devnames
> > > > from different bus types are unique formats:
> > >
> > > This isn't reliable, the devname should just be vfio0, vfio1, etc
> > >
> > > The userspace can learn the correct major/minor by inspecting the
> > > sysfs.
> > >
> > > This whole concept should disappear into the prior patch that adds the
> > > struct device in the first place, and I think most of the code here
> > > can be deleted once the struct device is used properly.
> > >
> >
> > Can you help elaborate above flow? This is one area where we need
> > more guidance.
> >
> > When Qemu accepts an option "-device vfio-pci,host=DDDD:BB:DD.F",
> > how does Qemu identify which vifo0/1/... is associated with the specified
> > DDDD:BB:DD.F?
> 
> When done properly in the kernel the file:
> 
> /sys/bus/pci/devices/DDDD:BB:DD.F/vfio/vfioX/dev
> 
> Will contain the major:minor of the VFIO device.
> 
> Userspace then opens the /dev/vfio/devices/vfioX and checks with fstat
> that the major:minor matches.

ah, that's the trick.

> 
> in the above pattern "pci" and "DDDD:BB:DD.FF" are the arguments passed
> to qemu.
> 
> You can look at this for some general over engineered code to handle
> opening from a sysfs handle like above:
> 
> https://github.com/linux-rdma/rdma-core/blob/master/util/open_cdev.c
> 

will check. Thanks for suggestion.

^ permalink raw reply	[flat|nested] 532+ messages in thread

* RE: [RFC 03/20] vfio: Add vfio_[un]register_device()
@ 2021-09-22  1:02           ` Tian, Kevin
  0 siblings, 0 replies; 532+ messages in thread
From: Tian, Kevin @ 2021-09-22  1:02 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: kvm, jasowang, kwankhede, hch, jean-philippe, Jiang, Dave, Raj,
	Ashok, corbet, parav, alex.williamson, lkml, david, dwmw2, Tian,
	Jun J, linux-kernel, lushenming, iommu, pbonzini, robin.murphy

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, September 22, 2021 9:00 AM
> 
> On Wed, Sep 22, 2021 at 12:54:02AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Wednesday, September 22, 2021 12:01 AM
> > >
> > > >  One open about how to organize the device nodes under
> > > /dev/vfio/devices/.
> > > > This RFC adopts a simple policy by keeping a flat layout with mixed
> > > devname
> > > > from all kinds of devices. The prerequisite of this model is that
> devnames
> > > > from different bus types are unique formats:
> > >
> > > This isn't reliable, the devname should just be vfio0, vfio1, etc
> > >
> > > The userspace can learn the correct major/minor by inspecting the
> > > sysfs.
> > >
> > > This whole concept should disappear into the prior patch that adds the
> > > struct device in the first place, and I think most of the code here
> > > can be deleted once the struct device is used properly.
> > >
> >
> > Can you help elaborate above flow? This is one area where we need
> > more guidance.
> >
> > When Qemu accepts an option "-device vfio-pci,host=DDDD:BB:DD.F",
> > how does Qemu identify which vifo0/1/... is associated with the specified
> > DDDD:BB:DD.F?
> 
> When done properly in the kernel the file:
> 
> /sys/bus/pci/devices/DDDD:BB:DD.F/vfio/vfioX/dev
> 
> Will contain the major:minor of the VFIO device.
> 
> Userspace then opens the /dev/vfio/devices/vfioX and checks with fstat
> that the major:minor matches.

ah, that's the trick.

> 
> in the above pattern "pci" and "DDDD:BB:DD.FF" are the arguments passed
> to qemu.
> 
> You can look at this for some general over engineered code to handle
> opening from a sysfs handle like above:
> 
> https://github.com/linux-rdma/rdma-core/blob/master/util/open_cdev.c
> 

will check. Thanks for suggestion.
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* RE: [RFC 02/20] vfio: Add device class for /dev/vfio/devices
  2021-09-22  0:55         ` Jason Gunthorpe via iommu
@ 2021-09-22  1:07           ` Tian, Kevin
  -1 siblings, 0 replies; 532+ messages in thread
From: Tian, Kevin @ 2021-09-22  1:07 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Liu, Yi L, alex.williamson, hch, jasowang, joro, jean-philippe,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, September 22, 2021 8:55 AM
> 
> On Tue, Sep 21, 2021 at 11:56:06PM +0000, Tian, Kevin wrote:
> > > The opened atomic is aweful. A newly created fd should start in a
> > > state where it has a disabled fops
> > >
> > > The only thing the disabled fops can do is register the device to the
> > > iommu fd. When successfully registered the device gets the normal fops.
> > >
> > > The registration steps should be done under a normal lock inside the
> > > vfio_device. If a vfio_device is already registered then further
> > > registration should fail.
> > >
> > > Getting the device fd via the group fd triggers the same sequence as
> > > above.
> > >
> >
> > Above works if the group interface is also connected to iommufd, i.e.
> > making vfio type1 as a shim. In this case we can use the registration
> > status as the exclusive switch. But if we keep vfio type1 separate as
> > today, then a new atomic is still necessary. This all depends on how
> > we want to deal with vfio type1 and iommufd, and possibly what's
> > discussed here just adds another pound to the shim option...
> 
> No, it works the same either way, the group FD path is identical to
> the normal FD path, it just triggers some of the state transitions
> automatically internally instead of requiring external ioctls.
> 
> The device FDs starts disabled, an internal API binds it to the iommu
> via open coding with the group API, and then the rest of the APIs can
> be enabled. Same as today.
> 

Still a bit confused. if vfio type1 also connects to iommufd, whether 
the device is registered can be centrally checked based on whether
an iommu_ctx is recorded. But if type1 doesn't talk to iommufd at
all, don't we still need introduce a new state (calling it 'opened' or
'registered') to protect the two interfaces? In this case what is the
point of keeping device FD disabled even for the group path?

^ permalink raw reply	[flat|nested] 532+ messages in thread

* RE: [RFC 02/20] vfio: Add device class for /dev/vfio/devices
@ 2021-09-22  1:07           ` Tian, Kevin
  0 siblings, 0 replies; 532+ messages in thread
From: Tian, Kevin @ 2021-09-22  1:07 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: kvm, jasowang, kwankhede, hch, jean-philippe, Jiang, Dave, Raj,
	Ashok, corbet, parav, alex.williamson, lkml, david, dwmw2, Tian,
	Jun J, linux-kernel, lushenming, iommu, pbonzini, robin.murphy

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, September 22, 2021 8:55 AM
> 
> On Tue, Sep 21, 2021 at 11:56:06PM +0000, Tian, Kevin wrote:
> > > The opened atomic is aweful. A newly created fd should start in a
> > > state where it has a disabled fops
> > >
> > > The only thing the disabled fops can do is register the device to the
> > > iommu fd. When successfully registered the device gets the normal fops.
> > >
> > > The registration steps should be done under a normal lock inside the
> > > vfio_device. If a vfio_device is already registered then further
> > > registration should fail.
> > >
> > > Getting the device fd via the group fd triggers the same sequence as
> > > above.
> > >
> >
> > Above works if the group interface is also connected to iommufd, i.e.
> > making vfio type1 as a shim. In this case we can use the registration
> > status as the exclusive switch. But if we keep vfio type1 separate as
> > today, then a new atomic is still necessary. This all depends on how
> > we want to deal with vfio type1 and iommufd, and possibly what's
> > discussed here just adds another pound to the shim option...
> 
> No, it works the same either way, the group FD path is identical to
> the normal FD path, it just triggers some of the state transitions
> automatically internally instead of requiring external ioctls.
> 
> The device FDs starts disabled, an internal API binds it to the iommu
> via open coding with the group API, and then the rest of the APIs can
> be enabled. Same as today.
> 

Still a bit confused. if vfio type1 also connects to iommufd, whether 
the device is registered can be centrally checked based on whether
an iommu_ctx is recorded. But if type1 doesn't talk to iommufd at
all, don't we still need introduce a new state (calling it 'opened' or
'registered') to protect the two interfaces? In this case what is the
point of keeping device FD disabled even for the group path?
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* RE: [RFC 05/20] vfio/pci: Register device to /dev/vfio/devices
  2021-09-21 21:09       ` Alex Williamson
@ 2021-09-22  1:19         ` Tian, Kevin
  -1 siblings, 0 replies; 532+ messages in thread
From: Tian, Kevin @ 2021-09-22  1:19 UTC (permalink / raw)
  To: Alex Williamson, Jason Gunthorpe
  Cc: Liu, Yi L, hch, jasowang, joro, jean-philippe, parav, lkml,
	pbonzini, lushenming, eric.auger, corbet, Raj, Ashok, yi.l.liu,
	Tian, Jun J, Wu, Hao, Jiang, Dave, jacob.jun.pan, kwankhede,
	robin.murphy, kvm, iommu, dwmw2, linux-kernel, baolu.lu, david,
	nicolinc

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Wednesday, September 22, 2021 5:09 AM
> 
> On Tue, 21 Sep 2021 13:40:01 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Sun, Sep 19, 2021 at 02:38:33PM +0800, Liu Yi L wrote:
> > > This patch exposes the device-centric interface for vfio-pci devices. To
> > > be compatiable with existing users, vfio-pci exposes both legacy group
> > > interface and device-centric interface.
> > >
> > > As explained in last patch, this change doesn't apply to devices which
> > > cannot be forced to snoop cache by their upstream iommu. Such devices
> > > are still expected to be opened via the legacy group interface.
> 
> This doesn't make much sense to me.  The previous patch indicates
> there's work to be done in updating the kvm-vfio contract to understand
> DMA coherency, so you're trying to limit use cases to those where the
> IOMMU enforces coherency, but there's QEMU work to be done to support
> the iommufd uAPI at all.  Isn't part of that work to understand how KVM
> will be told about non-coherent devices rather than "meh, skip it in the
> kernel"?  Also let's not forget that vfio is not only for KVM.

The policy here is that VFIO will not expose such devices (no enforce-snoop)
in the new device hierarchy at all. In this case QEMU will fall back to the
group interface automatically and then rely on the existing contract to connect 
vfio and QEMU. It doesn't need to care about the whatever new contract
until such devices are exposed in the new interface.

yes, vfio is not only for KVM. But here it's more a task split based on staging
consideration. imo it's not necessary to further split task into supporting
non-snoop device for userspace driver and then for kvm.


^ permalink raw reply	[flat|nested] 532+ messages in thread

* RE: [RFC 05/20] vfio/pci: Register device to /dev/vfio/devices
@ 2021-09-22  1:19         ` Tian, Kevin
  0 siblings, 0 replies; 532+ messages in thread
From: Tian, Kevin @ 2021-09-22  1:19 UTC (permalink / raw)
  To: Alex Williamson, Jason Gunthorpe
  Cc: kvm, jasowang, kwankhede, hch, jean-philippe, Jiang, Dave, Raj,
	Ashok, corbet, parav, lkml, david, robin.murphy, Tian, Jun J,
	linux-kernel, lushenming, iommu, pbonzini, dwmw2

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Wednesday, September 22, 2021 5:09 AM
> 
> On Tue, 21 Sep 2021 13:40:01 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Sun, Sep 19, 2021 at 02:38:33PM +0800, Liu Yi L wrote:
> > > This patch exposes the device-centric interface for vfio-pci devices. To
> > > be compatiable with existing users, vfio-pci exposes both legacy group
> > > interface and device-centric interface.
> > >
> > > As explained in last patch, this change doesn't apply to devices which
> > > cannot be forced to snoop cache by their upstream iommu. Such devices
> > > are still expected to be opened via the legacy group interface.
> 
> This doesn't make much sense to me.  The previous patch indicates
> there's work to be done in updating the kvm-vfio contract to understand
> DMA coherency, so you're trying to limit use cases to those where the
> IOMMU enforces coherency, but there's QEMU work to be done to support
> the iommufd uAPI at all.  Isn't part of that work to understand how KVM
> will be told about non-coherent devices rather than "meh, skip it in the
> kernel"?  Also let's not forget that vfio is not only for KVM.

The policy here is that VFIO will not expose such devices (no enforce-snoop)
in the new device hierarchy at all. In this case QEMU will fall back to the
group interface automatically and then rely on the existing contract to connect 
vfio and QEMU. It doesn't need to care about the whatever new contract
until such devices are exposed in the new interface.

yes, vfio is not only for KVM. But here it's more a task split based on staging
consideration. imo it's not necessary to further split task into supporting
non-snoop device for userspace driver and then for kvm.

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* RE: [RFC 05/20] vfio/pci: Register device to /dev/vfio/devices
  2021-09-21 21:58         ` Jason Gunthorpe via iommu
@ 2021-09-22  1:24           ` Tian, Kevin
  -1 siblings, 0 replies; 532+ messages in thread
From: Tian, Kevin @ 2021-09-22  1:24 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson
  Cc: Liu, Yi L, hch, jasowang, joro, jean-philippe, parav, lkml,
	pbonzini, lushenming, eric.auger, corbet, Raj, Ashok, yi.l.liu,
	Tian, Jun J, Wu, Hao, Jiang, Dave, jacob.jun.pan, kwankhede,
	robin.murphy, kvm, iommu, dwmw2, linux-kernel, baolu.lu, david,
	nicolinc

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, September 22, 2021 5:59 AM
> 
> On Tue, Sep 21, 2021 at 03:09:29PM -0600, Alex Williamson wrote:
> 
> > the iommufd uAPI at all.  Isn't part of that work to understand how KVM
> > will be told about non-coherent devices rather than "meh, skip it in the
> > kernel"?  Also let's not forget that vfio is not only for KVM.
> 
> vfio is not only for KVM, but AFIACT the wbinv stuff is only for
> KVM... But yes, I agree this should be sorted out at this stage

If such devices are even not exposed in the new hierarchy at this stage,
suppose sorting it out later should be fine?

> 
> > > > When the device is opened via /dev/vfio/devices, vfio-pci should
> prevent
> > > > the user from accessing the assigned device because the device is still
> > > > attached to the default domain which may allow user-initiated DMAs to
> > > > touch arbitrary place. The user access must be blocked until the device
> > > > is later bound to an iommufd (see patch 08). The binding acts as the
> > > > contract for putting the device in a security context which ensures user-
> > > > initiated DMAs via this device cannot harm the rest of the system.
> > > >
> > > > This patch introduces a vdev->block_access flag for this purpose. It's set
> > > > when the device is opened via /dev/vfio/devices and cleared after
> binding
> > > > to iommufd succeeds. mmap and r/w handlers check this flag to decide
> whether
> > > > user access should be blocked or not.
> > >
> > > This should not be in vfio_pci.
> > >
> > > AFAIK there is no condition where a vfio driver can work without being
> > > connected to some kind of iommu back end, so the core code should
> > > handle this interlock globally. A vfio driver's ops should not be
> > > callable until the iommu is connected.
> > >
> > > The only vfio_pci patch in this series should be adding a new callback
> > > op to take in an iommufd and register the pci_device as a iommufd
> > > device.
> >
> > Couldn't the same argument be made that registering a $bus device as an
> > iommufd device is a common interface that shouldn't be the
> > responsibility of the vfio device driver?
> 
> The driver needs enough involvment to signal what kind of IOMMU
> connection it wants, eg attaching to a physical device will use the
> iofd_attach_device() path, but attaching to a SW page table should use
> a different API call. PASID should also be different.

Exactly

> 
> Possibly a good arrangement is to have the core provide some generic
> ioctl ops functions 'vfio_all_device_iommufd_bind' that everything
> except mdev drivers can use so the code is all duplicated.

Could this be an future enhancement when we have multiple device
types supporting iommufd?

> 
> > non-group device anything more than a reservation of that device if
> > access is withheld until iommu isolation?  I also don't really want to
> > predict how ioctls might evolve to guess whether only blocking .read,
> > .write, and .mmap callbacks are sufficient.  Thanks,
> 
> This is why I said the entire fops should be blocked in a dummy fops
> so the core code the vfio_device FD parked and userspace unable to
> access the ops until device attachment and thus IOMMU ioslation is
> completed.
> 
> Simple and easy to reason about, a parked FD is very similar to a
> closed FD.
> 

This rationale makes sense. Just the open how to handle exclusive
open between group and nongroup interfaces still needs some
more clarification here, especially about what a parked FD means
for the group interface (where parking is unnecessary since the 
security context is already established before the device is opened)

Thanks
Kevin

^ permalink raw reply	[flat|nested] 532+ messages in thread

* RE: [RFC 05/20] vfio/pci: Register device to /dev/vfio/devices
@ 2021-09-22  1:24           ` Tian, Kevin
  0 siblings, 0 replies; 532+ messages in thread
From: Tian, Kevin @ 2021-09-22  1:24 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson
  Cc: kvm, jasowang, kwankhede, hch, jean-philippe, Jiang, Dave, Raj,
	Ashok, corbet, parav, lkml, david, robin.murphy, Tian, Jun J,
	linux-kernel, lushenming, iommu, pbonzini, dwmw2

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, September 22, 2021 5:59 AM
> 
> On Tue, Sep 21, 2021 at 03:09:29PM -0600, Alex Williamson wrote:
> 
> > the iommufd uAPI at all.  Isn't part of that work to understand how KVM
> > will be told about non-coherent devices rather than "meh, skip it in the
> > kernel"?  Also let's not forget that vfio is not only for KVM.
> 
> vfio is not only for KVM, but AFIACT the wbinv stuff is only for
> KVM... But yes, I agree this should be sorted out at this stage

If such devices are even not exposed in the new hierarchy at this stage,
suppose sorting it out later should be fine?

> 
> > > > When the device is opened via /dev/vfio/devices, vfio-pci should
> prevent
> > > > the user from accessing the assigned device because the device is still
> > > > attached to the default domain which may allow user-initiated DMAs to
> > > > touch arbitrary place. The user access must be blocked until the device
> > > > is later bound to an iommufd (see patch 08). The binding acts as the
> > > > contract for putting the device in a security context which ensures user-
> > > > initiated DMAs via this device cannot harm the rest of the system.
> > > >
> > > > This patch introduces a vdev->block_access flag for this purpose. It's set
> > > > when the device is opened via /dev/vfio/devices and cleared after
> binding
> > > > to iommufd succeeds. mmap and r/w handlers check this flag to decide
> whether
> > > > user access should be blocked or not.
> > >
> > > This should not be in vfio_pci.
> > >
> > > AFAIK there is no condition where a vfio driver can work without being
> > > connected to some kind of iommu back end, so the core code should
> > > handle this interlock globally. A vfio driver's ops should not be
> > > callable until the iommu is connected.
> > >
> > > The only vfio_pci patch in this series should be adding a new callback
> > > op to take in an iommufd and register the pci_device as a iommufd
> > > device.
> >
> > Couldn't the same argument be made that registering a $bus device as an
> > iommufd device is a common interface that shouldn't be the
> > responsibility of the vfio device driver?
> 
> The driver needs enough involvment to signal what kind of IOMMU
> connection it wants, eg attaching to a physical device will use the
> iofd_attach_device() path, but attaching to a SW page table should use
> a different API call. PASID should also be different.

Exactly

> 
> Possibly a good arrangement is to have the core provide some generic
> ioctl ops functions 'vfio_all_device_iommufd_bind' that everything
> except mdev drivers can use so the code is all duplicated.

Could this be an future enhancement when we have multiple device
types supporting iommufd?

> 
> > non-group device anything more than a reservation of that device if
> > access is withheld until iommu isolation?  I also don't really want to
> > predict how ioctls might evolve to guess whether only blocking .read,
> > .write, and .mmap callbacks are sufficient.  Thanks,
> 
> This is why I said the entire fops should be blocked in a dummy fops
> so the core code the vfio_device FD parked and userspace unable to
> access the ops until device attachment and thus IOMMU ioslation is
> completed.
> 
> Simple and easy to reason about, a parked FD is very similar to a
> closed FD.
> 

This rationale makes sense. Just the open how to handle exclusive
open between group and nongroup interfaces still needs some
more clarification here, especially about what a parked FD means
for the group interface (where parking is unnecessary since the 
security context is already established before the device is opened)

Thanks
Kevin
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* RE: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces
  2021-09-21 17:09     ` Jason Gunthorpe via iommu
@ 2021-09-22  1:47       ` Tian, Kevin
  -1 siblings, 0 replies; 532+ messages in thread
From: Tian, Kevin @ 2021-09-22  1:47 UTC (permalink / raw)
  To: Jason Gunthorpe, Liu, Yi L
  Cc: alex.williamson, hch, jasowang, joro, jean-philippe, parav, lkml,
	pbonzini, lushenming, eric.auger, corbet, Raj, Ashok, yi.l.liu,
	Tian, Jun J, Wu, Hao, Jiang, Dave, jacob.jun.pan, kwankhede,
	robin.murphy, kvm, iommu, dwmw2, linux-kernel, baolu.lu, david,
	nicolinc

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, September 22, 2021 1:10 AM
> 
> On Sun, Sep 19, 2021 at 02:38:34PM +0800, Liu Yi L wrote:
> > From: Lu Baolu <baolu.lu@linux.intel.com>
> >
> > This extends iommu core to manage security context for passthrough
> > devices. Please bear a long explanation for how we reach this design
> > instead of managing it solely in iommufd like what vfio does today.
> >
> > Devices which cannot be isolated from each other are organized into an
> > iommu group. When a device is assigned to the user space, the entire
> > group must be put in a security context so that user-initiated DMAs via
> > the assigned device cannot harm the rest of the system. No user access
> > should be granted on a device before the security context is established
> > for the group which the device belongs to.
> 
> > Managing the security context must meet below criteria:
> >
> > 1)  The group is viable for user-initiated DMAs. This implies that the
> >     devices in the group must be either bound to a device-passthrough
> 
> s/a/the same/
> 
> >     framework, or driver-less, or bound to a driver which is known safe
> >     (not do DMA).
> >
> > 2)  The security context should only allow DMA to the user's memory and
> >     devices in this group;
> >
> > 3)  After the security context is established for the group, the group
> >     viability must be continuously monitored before the user relinquishes
> >     all devices belonging to the group. The viability might be broken e.g.
> >     when a driver-less device is later bound to a driver which does DMA.
> >
> > 4)  The security context should not be destroyed before user access
> >     permission is withdrawn.
> >
> > Existing vfio introduces explicit container/group semantics in its uAPI
> > to meet above requirements. A single security context (iommu domain)
> > is created per container. Attaching group to container moves the entire
> > group into the associated security context, and vice versa. The user can
> > open the device only after group attach. A group can be detached only
> > after all devices in the group are closed. Group viability is monitored
> > by listening to iommu group events.
> >
> > Unlike vfio, iommufd adopts a device-centric design with all group
> > logistics hidden behind the fd. Binding a device to iommufd serves
> > as the contract to get security context established (and vice versa
> > for unbinding). One additional requirement in iommufd is to manage the
> > switch between multiple security contexts due to decoupled bind/attach:
> 
> This should be a precursor series that actually does clean things up
> properly. There is no reason for vfio and iommufd to differ here, if
> we are implementing this logic into the iommu layer then it should be
> deleted from the VFIO layer, not left duplicated like this.

make sense

> 
> IIRC in VFIO the container is the IOAS and when the group goes to
> create the device fd it should simply do the
> iommu_device_init_user_dma() followed immediately by a call to bind
> the container IOAS as your #3.

a slight correction.

to meet vfio semantics we could do init_user_dma() at group attach
time and then call binding to container IOAS when the device fd
is created. This is because vfio requires the group in a security context
before the device is opened. 

> 
> Then delete all the group viability stuff from vfio, relying on the
> iommu to do it.
> 
> It should have full symmetry with the iommufd.

agree

> 
> > @@ -1664,6 +1671,17 @@ static int iommu_bus_notifier(struct
> notifier_block *nb,
> >  		group_action = IOMMU_GROUP_NOTIFY_BIND_DRIVER;
> >  		break;
> >  	case BUS_NOTIFY_BOUND_DRIVER:
> > +		/*
> > +		 * FIXME: Alternatively the attached drivers could generically
> > +		 * indicate to the iommu layer that they are safe for keeping
> > +		 * the iommu group user viable by calling some function
> around
> > +		 * probe(). We could eliminate this gross BUG_ON() by
> denying
> > +		 * probe to non-iommu-safe driver.
> > +		 */
> > +		mutex_lock(&group->mutex);
> > +		if (group->user_dma_owner_id)
> > +			BUG_ON(!iommu_group_user_dma_viable(group));
> > +		mutex_unlock(&group->mutex);
> 
> And the mini-series should fix this BUG_ON properly by interlocking
> with the driver core to simply refuse to bind a driver under these
> conditions instead of allowing userspace to crash the kernel.
> 
> That alone would be justification enough to merge this work.

yes

> 
> > +
> > +/*
> > + * IOMMU core interfaces for iommufd.
> > + */
> > +
> > +/*
> > + * FIXME: We currently simply follow vifo policy to mantain the group's
> > + * viability to user. Eventually, we should avoid below hard-coded list
> > + * by letting drivers indicate to the iommu layer that they are safe for
> > + * keeping the iommu group's user aviability.
> > + */
> > +static const char * const iommu_driver_allowed[] = {
> > +	"vfio-pci",
> > +	"pci-stub"
> > +};
> 
> Yuk. This should be done with some callback in those drivers
> 'iomm_allow_user_dma()"
> 
> Ie the basic flow would see the driver core doing some:

Just double confirm. Is there concern on having the driver core to
call iommu functions? 

> 
>  ret = iommu_doing_kernel_dma()
>  if (ret) do not bind
>  driver_bind
>   pci_stub_probe()
>      iommu_allow_user_dma()
> 
> And the various functions are manipulating some atomic.
>  0 = nothing happening
>  1 = kernel DMA
>  2 = user DMA
> 
> No BUG_ON.
> 
> Jason

^ permalink raw reply	[flat|nested] 532+ messages in thread

* RE: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces
@ 2021-09-22  1:47       ` Tian, Kevin
  0 siblings, 0 replies; 532+ messages in thread
From: Tian, Kevin @ 2021-09-22  1:47 UTC (permalink / raw)
  To: Jason Gunthorpe, Liu, Yi L
  Cc: kvm, jasowang, kwankhede, hch, jean-philippe, Jiang, Dave, Raj,
	Ashok, corbet, parav, alex.williamson, lkml, david, robin.murphy,
	Tian, Jun J, linux-kernel, lushenming, iommu, pbonzini, dwmw2

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, September 22, 2021 1:10 AM
> 
> On Sun, Sep 19, 2021 at 02:38:34PM +0800, Liu Yi L wrote:
> > From: Lu Baolu <baolu.lu@linux.intel.com>
> >
> > This extends iommu core to manage security context for passthrough
> > devices. Please bear a long explanation for how we reach this design
> > instead of managing it solely in iommufd like what vfio does today.
> >
> > Devices which cannot be isolated from each other are organized into an
> > iommu group. When a device is assigned to the user space, the entire
> > group must be put in a security context so that user-initiated DMAs via
> > the assigned device cannot harm the rest of the system. No user access
> > should be granted on a device before the security context is established
> > for the group which the device belongs to.
> 
> > Managing the security context must meet below criteria:
> >
> > 1)  The group is viable for user-initiated DMAs. This implies that the
> >     devices in the group must be either bound to a device-passthrough
> 
> s/a/the same/
> 
> >     framework, or driver-less, or bound to a driver which is known safe
> >     (not do DMA).
> >
> > 2)  The security context should only allow DMA to the user's memory and
> >     devices in this group;
> >
> > 3)  After the security context is established for the group, the group
> >     viability must be continuously monitored before the user relinquishes
> >     all devices belonging to the group. The viability might be broken e.g.
> >     when a driver-less device is later bound to a driver which does DMA.
> >
> > 4)  The security context should not be destroyed before user access
> >     permission is withdrawn.
> >
> > Existing vfio introduces explicit container/group semantics in its uAPI
> > to meet above requirements. A single security context (iommu domain)
> > is created per container. Attaching group to container moves the entire
> > group into the associated security context, and vice versa. The user can
> > open the device only after group attach. A group can be detached only
> > after all devices in the group are closed. Group viability is monitored
> > by listening to iommu group events.
> >
> > Unlike vfio, iommufd adopts a device-centric design with all group
> > logistics hidden behind the fd. Binding a device to iommufd serves
> > as the contract to get security context established (and vice versa
> > for unbinding). One additional requirement in iommufd is to manage the
> > switch between multiple security contexts due to decoupled bind/attach:
> 
> This should be a precursor series that actually does clean things up
> properly. There is no reason for vfio and iommufd to differ here, if
> we are implementing this logic into the iommu layer then it should be
> deleted from the VFIO layer, not left duplicated like this.

make sense

> 
> IIRC in VFIO the container is the IOAS and when the group goes to
> create the device fd it should simply do the
> iommu_device_init_user_dma() followed immediately by a call to bind
> the container IOAS as your #3.

a slight correction.

to meet vfio semantics we could do init_user_dma() at group attach
time and then call binding to container IOAS when the device fd
is created. This is because vfio requires the group in a security context
before the device is opened. 

> 
> Then delete all the group viability stuff from vfio, relying on the
> iommu to do it.
> 
> It should have full symmetry with the iommufd.

agree

> 
> > @@ -1664,6 +1671,17 @@ static int iommu_bus_notifier(struct
> notifier_block *nb,
> >  		group_action = IOMMU_GROUP_NOTIFY_BIND_DRIVER;
> >  		break;
> >  	case BUS_NOTIFY_BOUND_DRIVER:
> > +		/*
> > +		 * FIXME: Alternatively the attached drivers could generically
> > +		 * indicate to the iommu layer that they are safe for keeping
> > +		 * the iommu group user viable by calling some function
> around
> > +		 * probe(). We could eliminate this gross BUG_ON() by
> denying
> > +		 * probe to non-iommu-safe driver.
> > +		 */
> > +		mutex_lock(&group->mutex);
> > +		if (group->user_dma_owner_id)
> > +			BUG_ON(!iommu_group_user_dma_viable(group));
> > +		mutex_unlock(&group->mutex);
> 
> And the mini-series should fix this BUG_ON properly by interlocking
> with the driver core to simply refuse to bind a driver under these
> conditions instead of allowing userspace to crash the kernel.
> 
> That alone would be justification enough to merge this work.

yes

> 
> > +
> > +/*
> > + * IOMMU core interfaces for iommufd.
> > + */
> > +
> > +/*
> > + * FIXME: We currently simply follow vifo policy to mantain the group's
> > + * viability to user. Eventually, we should avoid below hard-coded list
> > + * by letting drivers indicate to the iommu layer that they are safe for
> > + * keeping the iommu group's user aviability.
> > + */
> > +static const char * const iommu_driver_allowed[] = {
> > +	"vfio-pci",
> > +	"pci-stub"
> > +};
> 
> Yuk. This should be done with some callback in those drivers
> 'iomm_allow_user_dma()"
> 
> Ie the basic flow would see the driver core doing some:

Just double confirm. Is there concern on having the driver core to
call iommu functions? 

> 
>  ret = iommu_doing_kernel_dma()
>  if (ret) do not bind
>  driver_bind
>   pci_stub_probe()
>      iommu_allow_user_dma()
> 
> And the various functions are manipulating some atomic.
>  0 = nothing happening
>  1 = kernel DMA
>  2 = user DMA
> 
> No BUG_ON.
> 
> Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* RE: [RFC 01/20] iommu/iommufd: Add /dev/iommu core
  2021-09-21 15:41     ` Jason Gunthorpe via iommu
@ 2021-09-22  1:51       ` Tian, Kevin
  -1 siblings, 0 replies; 532+ messages in thread
From: Tian, Kevin @ 2021-09-22  1:51 UTC (permalink / raw)
  To: Jason Gunthorpe, Liu, Yi L
  Cc: alex.williamson, hch, jasowang, joro, jean-philippe, parav, lkml,
	pbonzini, lushenming, eric.auger, corbet, Raj, Ashok, yi.l.liu,
	Tian, Jun J, Wu, Hao, Jiang, Dave, jacob.jun.pan, kwankhede,
	robin.murphy, kvm, iommu, dwmw2, linux-kernel, baolu.lu, david,
	nicolinc

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, September 21, 2021 11:42 PM
> 
>  - Delete the iommufd_ctx->lock. Use RCU to protect load, erase/alloc does
>    not need locking (order it properly too, it is in the wrong order), and
>    don't check for duplicate devices or dev_cookie duplication, that
>    is user error and is harmless to the kernel.
> 

I'm confused here. yes it's user error, but we check so many user errors
and then return -EINVAL, -EBUSY, etc. Why is this one special?

^ permalink raw reply	[flat|nested] 532+ messages in thread

* RE: [RFC 01/20] iommu/iommufd: Add /dev/iommu core
@ 2021-09-22  1:51       ` Tian, Kevin
  0 siblings, 0 replies; 532+ messages in thread
From: Tian, Kevin @ 2021-09-22  1:51 UTC (permalink / raw)
  To: Jason Gunthorpe, Liu, Yi L
  Cc: kvm, jasowang, kwankhede, hch, jean-philippe, Jiang, Dave, Raj,
	Ashok, corbet, parav, alex.williamson, lkml, david, robin.murphy,
	Tian, Jun J, linux-kernel, lushenming, iommu, pbonzini, dwmw2

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, September 21, 2021 11:42 PM
> 
>  - Delete the iommufd_ctx->lock. Use RCU to protect load, erase/alloc does
>    not need locking (order it properly too, it is in the wrong order), and
>    don't check for duplicate devices or dev_cookie duplication, that
>    is user error and is harmless to the kernel.
> 

I'm confused here. yes it's user error, but we check so many user errors
and then return -EINVAL, -EBUSY, etc. Why is this one special?
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 04/20] iommu: Add iommu_device_get_info interface
  2021-09-21 16:19     ` Jason Gunthorpe via iommu
@ 2021-09-22  2:31       ` Lu Baolu
  -1 siblings, 0 replies; 532+ messages in thread
From: Lu Baolu @ 2021-09-22  2:31 UTC (permalink / raw)
  To: Jason Gunthorpe, Liu Yi L
  Cc: baolu.lu, alex.williamson, hch, jasowang, joro, jean-philippe,
	kevin.tian, parav, lkml, pbonzini, lushenming, eric.auger,
	corbet, ashok.raj, yi.l.liu, jun.j.tian, hao.wu, dave.jiang,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, david, nicolinc

Hi Jason,

On 9/22/21 12:19 AM, Jason Gunthorpe wrote:
> On Sun, Sep 19, 2021 at 02:38:32PM +0800, Liu Yi L wrote:
>> From: Lu Baolu <baolu.lu@linux.intel.com>
>>
>> This provides an interface for upper layers to get the per-device iommu
>> attributes.
>>
>>      int iommu_device_get_info(struct device *dev,
>>                                enum iommu_devattr attr, void *data);
> 
> Can't we use properly typed ops and functions here instead of a void
> *data?
> 
> get_snoop()
> get_page_size()
> get_addr_width()

Yeah! Above are more friendly to the upper layer callers.

> 
> ?
> 
> Jason
> 

Best regards,
baolu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 04/20] iommu: Add iommu_device_get_info interface
@ 2021-09-22  2:31       ` Lu Baolu
  0 siblings, 0 replies; 532+ messages in thread
From: Lu Baolu @ 2021-09-22  2:31 UTC (permalink / raw)
  To: Jason Gunthorpe, Liu Yi L
  Cc: kvm, jasowang, kwankhede, hch, jean-philippe, dave.jiang,
	ashok.raj, corbet, kevin.tian, parav, alex.williamson, lkml,
	david, dwmw2, jun.j.tian, linux-kernel, lushenming, iommu,
	pbonzini, robin.murphy

Hi Jason,

On 9/22/21 12:19 AM, Jason Gunthorpe wrote:
> On Sun, Sep 19, 2021 at 02:38:32PM +0800, Liu Yi L wrote:
>> From: Lu Baolu <baolu.lu@linux.intel.com>
>>
>> This provides an interface for upper layers to get the per-device iommu
>> attributes.
>>
>>      int iommu_device_get_info(struct device *dev,
>>                                enum iommu_devattr attr, void *data);
> 
> Can't we use properly typed ops and functions here instead of a void
> *data?
> 
> get_snoop()
> get_page_size()
> get_addr_width()

Yeah! Above are more friendly to the upper layer callers.

> 
> ?
> 
> Jason
> 

Best regards,
baolu
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* RE: [RFC 02/20] vfio: Add device class for /dev/vfio/devices
  2021-09-22  0:55         ` Jason Gunthorpe via iommu
@ 2021-09-22  3:22           ` Tian, Kevin
  -1 siblings, 0 replies; 532+ messages in thread
From: Tian, Kevin @ 2021-09-22  3:22 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Liu, Yi L, alex.williamson, hch, jasowang, joro, jean-philippe,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

> From: Tian, Kevin
> Sent: Wednesday, September 22, 2021 9:07 AM
> 
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Wednesday, September 22, 2021 8:55 AM
> >
> > On Tue, Sep 21, 2021 at 11:56:06PM +0000, Tian, Kevin wrote:
> > > > The opened atomic is aweful. A newly created fd should start in a
> > > > state where it has a disabled fops
> > > >
> > > > The only thing the disabled fops can do is register the device to the
> > > > iommu fd. When successfully registered the device gets the normal fops.
> > > >
> > > > The registration steps should be done under a normal lock inside the
> > > > vfio_device. If a vfio_device is already registered then further
> > > > registration should fail.
> > > >
> > > > Getting the device fd via the group fd triggers the same sequence as
> > > > above.
> > > >
> > >
> > > Above works if the group interface is also connected to iommufd, i.e.
> > > making vfio type1 as a shim. In this case we can use the registration
> > > status as the exclusive switch. But if we keep vfio type1 separate as
> > > today, then a new atomic is still necessary. This all depends on how
> > > we want to deal with vfio type1 and iommufd, and possibly what's
> > > discussed here just adds another pound to the shim option...
> >
> > No, it works the same either way, the group FD path is identical to
> > the normal FD path, it just triggers some of the state transitions
> > automatically internally instead of requiring external ioctls.
> >
> > The device FDs starts disabled, an internal API binds it to the iommu
> > via open coding with the group API, and then the rest of the APIs can
> > be enabled. Same as today.
> >

After reading your comments on patch08, I may have a clearer picture
on your suggestion. The key is to handle exclusive access at the binding
time (based on vdev->iommu_dev). Please see whether below makes 
sense:

Shared sequence:

1)  initialize the device with a parked fops;
2)  need binding (explicit or implicit) to move away from parked fops;
3)  switch to normal fops after successful binding;

1) happens at device probe.

for nongroup 2) and 3) are done together in VFIO_DEVICE_GET_IOMMUFD:

  - 2) is done by calling .bind_iommufd() callback;
  - 3) could be done within .bind_iommufd(), or via a new callback e.g.
    .finalize_device(). The latter may be preferred for the group interface;
  - Two threads may open the same device simultaneously, with exclusive 
    access guaranteed by iommufd_bind_device();
  - Open() after successful binding is rejected, since normal fops has been
    activated. This is checked upon vdev->iommu_dev;

for group 2/3) are done together in VFIO_GROUP_GET_DEVICE_FD:

  - 2) is done by open coding bind_iommufd + attach_ioas. Create an 
    iommufd_device object and record it to vdev->iommu_dev
  - 3) is done by calling .finalize_device();
  - open() after a valid vdev->iommu_dev is rejected. this also ensures
    exclusive ownership with the nongroup path.

If Alex also agrees with it, this might be another mini-series to be merged
(just for group path) before this one. Doing so sort of nullifies the existing
group/container attaching process, where attach_ioas will be skipped and
now the security context is established when the device is opened.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 532+ messages in thread

* RE: [RFC 02/20] vfio: Add device class for /dev/vfio/devices
@ 2021-09-22  3:22           ` Tian, Kevin
  0 siblings, 0 replies; 532+ messages in thread
From: Tian, Kevin @ 2021-09-22  3:22 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: kvm, jasowang, kwankhede, hch, jean-philippe, Jiang, Dave, Raj,
	Ashok, corbet, parav, alex.williamson, lkml, david, dwmw2, Tian,
	Jun J, linux-kernel, lushenming, iommu, pbonzini, robin.murphy

> From: Tian, Kevin
> Sent: Wednesday, September 22, 2021 9:07 AM
> 
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Wednesday, September 22, 2021 8:55 AM
> >
> > On Tue, Sep 21, 2021 at 11:56:06PM +0000, Tian, Kevin wrote:
> > > > The opened atomic is aweful. A newly created fd should start in a
> > > > state where it has a disabled fops
> > > >
> > > > The only thing the disabled fops can do is register the device to the
> > > > iommu fd. When successfully registered the device gets the normal fops.
> > > >
> > > > The registration steps should be done under a normal lock inside the
> > > > vfio_device. If a vfio_device is already registered then further
> > > > registration should fail.
> > > >
> > > > Getting the device fd via the group fd triggers the same sequence as
> > > > above.
> > > >
> > >
> > > Above works if the group interface is also connected to iommufd, i.e.
> > > making vfio type1 as a shim. In this case we can use the registration
> > > status as the exclusive switch. But if we keep vfio type1 separate as
> > > today, then a new atomic is still necessary. This all depends on how
> > > we want to deal with vfio type1 and iommufd, and possibly what's
> > > discussed here just adds another pound to the shim option...
> >
> > No, it works the same either way, the group FD path is identical to
> > the normal FD path, it just triggers some of the state transitions
> > automatically internally instead of requiring external ioctls.
> >
> > The device FDs starts disabled, an internal API binds it to the iommu
> > via open coding with the group API, and then the rest of the APIs can
> > be enabled. Same as today.
> >

After reading your comments on patch08, I may have a clearer picture
on your suggestion. The key is to handle exclusive access at the binding
time (based on vdev->iommu_dev). Please see whether below makes 
sense:

Shared sequence:

1)  initialize the device with a parked fops;
2)  need binding (explicit or implicit) to move away from parked fops;
3)  switch to normal fops after successful binding;

1) happens at device probe.

for nongroup 2) and 3) are done together in VFIO_DEVICE_GET_IOMMUFD:

  - 2) is done by calling .bind_iommufd() callback;
  - 3) could be done within .bind_iommufd(), or via a new callback e.g.
    .finalize_device(). The latter may be preferred for the group interface;
  - Two threads may open the same device simultaneously, with exclusive 
    access guaranteed by iommufd_bind_device();
  - Open() after successful binding is rejected, since normal fops has been
    activated. This is checked upon vdev->iommu_dev;

for group 2/3) are done together in VFIO_GROUP_GET_DEVICE_FD:

  - 2) is done by open coding bind_iommufd + attach_ioas. Create an 
    iommufd_device object and record it to vdev->iommu_dev
  - 3) is done by calling .finalize_device();
  - open() after a valid vdev->iommu_dev is rejected. this also ensures
    exclusive ownership with the nongroup path.

If Alex also agrees with it, this might be another mini-series to be merged
(just for group path) before this one. Doing so sort of nullifies the existing
group/container attaching process, where attach_ioas will be skipped and
now the security context is established when the device is opened.

Thanks
Kevin
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* RE: [RFC 00/20] Introduce /dev/iommu for userspace I/O address space management
  2021-09-21 13:45   ` Jason Gunthorpe via iommu
@ 2021-09-22  3:25     ` Liu, Yi L
  -1 siblings, 0 replies; 532+ messages in thread
From: Liu, Yi L @ 2021-09-22  3:25 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: alex.williamson, hch, jasowang, joro, jean-philippe, Tian, Kevin,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, September 21, 2021 9:45 PM
> 
> On Sun, Sep 19, 2021 at 02:38:28PM +0800, Liu Yi L wrote:
> > Linux now includes multiple device-passthrough frameworks (e.g. VFIO
> and
> > vDPA) to manage secure device access from the userspace. One critical
> task
> > of those frameworks is to put the assigned device in a secure, IOMMU-
> > protected context so user-initiated DMAs are prevented from doing harm
> to
> > the rest of the system.
> 
> Some bot will probably send this too, but it has compile warnings and
> needs to be rebased to 5.15-rc1

thanks Jason, will fix the warnings. yeah, I was using 5.14 in the test, will
rebase to 5.15-rc# in next version.

Regards,
Yi Liu

> drivers/iommu/iommufd/iommufd.c:269:6: warning: variable 'ret' is used
> uninitialized whenever 'if' condition is false [-Wsometimes-uninitialized]
>         if (refcount_read(&ioas->refs) > 1) {
>             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> drivers/iommu/iommufd/iommufd.c:277:9: note: uninitialized use occurs
> here
>         return ret;
>                ^~~
> drivers/iommu/iommufd/iommufd.c:269:2: note: remove the 'if' if its
> condition is always true
>         if (refcount_read(&ioas->refs) > 1) {
>         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> drivers/iommu/iommufd/iommufd.c:253:17: note: initialize the variable 'ret'
> to silence this warning
>         int ioasid, ret;
>                        ^
>                         = 0
> drivers/iommu/iommufd/iommufd.c:727:7: warning: variable 'ret' is used
> uninitialized whenever 'if' condition is true [-Wsometimes-uninitialized]
>                 if (idev->dev == dev || idev->dev_cookie == dev_cookie) {
>                     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> drivers/iommu/iommufd/iommufd.c:767:17: note: uninitialized use occurs
> here
>         return ERR_PTR(ret);
>                        ^~~
> drivers/iommu/iommufd/iommufd.c:727:3: note: remove the 'if' if its
> condition is always false
>                 if (idev->dev == dev || idev->dev_cookie == dev_cookie) {
> 
> ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> drivers/iommu/iommufd/iommufd.c:727:7: warning: variable 'ret' is used
> uninitialized whenever '||' condition is true [-Wsometimes-uninitialized]
>                 if (idev->dev == dev || idev->dev_cookie == dev_cookie) {
>                     ^~~~~~~~~~~~~~~~
> drivers/iommu/iommufd/iommufd.c:767:17: note: uninitialized use occurs
> here
>         return ERR_PTR(ret);
>                        ^~~
> drivers/iommu/iommufd/iommufd.c:727:7: note: remove the '||' if its
> condition is always false
>                 if (idev->dev == dev || idev->dev_cookie == dev_cookie) {
>                     ^~~~~~~~~~~~~~~~~~~
> drivers/iommu/iommufd/iommufd.c:717:9: note: initialize the variable 'ret'
> to silence this warning
>         int ret;
>                ^
>                 = 0
> 
> Jason

^ permalink raw reply	[flat|nested] 532+ messages in thread

* RE: [RFC 00/20] Introduce /dev/iommu for userspace I/O address space management
@ 2021-09-22  3:25     ` Liu, Yi L
  0 siblings, 0 replies; 532+ messages in thread
From: Liu, Yi L @ 2021-09-22  3:25 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: kvm, jasowang, kwankhede, hch, jean-philippe, Jiang, Dave, Raj,
	Ashok, corbet, Tian, Kevin, parav, alex.williamson, lkml, david,
	dwmw2, Tian, Jun J, linux-kernel, lushenming, iommu, pbonzini,
	robin.murphy

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, September 21, 2021 9:45 PM
> 
> On Sun, Sep 19, 2021 at 02:38:28PM +0800, Liu Yi L wrote:
> > Linux now includes multiple device-passthrough frameworks (e.g. VFIO
> and
> > vDPA) to manage secure device access from the userspace. One critical
> task
> > of those frameworks is to put the assigned device in a secure, IOMMU-
> > protected context so user-initiated DMAs are prevented from doing harm
> to
> > the rest of the system.
> 
> Some bot will probably send this too, but it has compile warnings and
> needs to be rebased to 5.15-rc1

thanks Jason, will fix the warnings. yeah, I was using 5.14 in the test, will
rebase to 5.15-rc# in next version.

Regards,
Yi Liu

> drivers/iommu/iommufd/iommufd.c:269:6: warning: variable 'ret' is used
> uninitialized whenever 'if' condition is false [-Wsometimes-uninitialized]
>         if (refcount_read(&ioas->refs) > 1) {
>             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> drivers/iommu/iommufd/iommufd.c:277:9: note: uninitialized use occurs
> here
>         return ret;
>                ^~~
> drivers/iommu/iommufd/iommufd.c:269:2: note: remove the 'if' if its
> condition is always true
>         if (refcount_read(&ioas->refs) > 1) {
>         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> drivers/iommu/iommufd/iommufd.c:253:17: note: initialize the variable 'ret'
> to silence this warning
>         int ioasid, ret;
>                        ^
>                         = 0
> drivers/iommu/iommufd/iommufd.c:727:7: warning: variable 'ret' is used
> uninitialized whenever 'if' condition is true [-Wsometimes-uninitialized]
>                 if (idev->dev == dev || idev->dev_cookie == dev_cookie) {
>                     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> drivers/iommu/iommufd/iommufd.c:767:17: note: uninitialized use occurs
> here
>         return ERR_PTR(ret);
>                        ^~~
> drivers/iommu/iommufd/iommufd.c:727:3: note: remove the 'if' if its
> condition is always false
>                 if (idev->dev == dev || idev->dev_cookie == dev_cookie) {
> 
> ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> drivers/iommu/iommufd/iommufd.c:727:7: warning: variable 'ret' is used
> uninitialized whenever '||' condition is true [-Wsometimes-uninitialized]
>                 if (idev->dev == dev || idev->dev_cookie == dev_cookie) {
>                     ^~~~~~~~~~~~~~~~
> drivers/iommu/iommufd/iommufd.c:767:17: note: uninitialized use occurs
> here
>         return ERR_PTR(ret);
>                        ^~~
> drivers/iommu/iommufd/iommufd.c:727:7: note: remove the '||' if its
> condition is always false
>                 if (idev->dev == dev || idev->dev_cookie == dev_cookie) {
>                     ^~~~~~~~~~~~~~~~~~~
> drivers/iommu/iommufd/iommufd.c:717:9: note: initialize the variable 'ret'
> to silence this warning
>         int ret;
>                ^
>                 = 0
> 
> Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* RE: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO
  2021-09-21 17:40     ` Jason Gunthorpe via iommu
@ 2021-09-22  3:30       ` Tian, Kevin
  -1 siblings, 0 replies; 532+ messages in thread
From: Tian, Kevin @ 2021-09-22  3:30 UTC (permalink / raw)
  To: Jason Gunthorpe, Liu, Yi L
  Cc: alex.williamson, hch, jasowang, joro, jean-philippe, parav, lkml,
	pbonzini, lushenming, eric.auger, corbet, Raj, Ashok, yi.l.liu,
	Tian, Jun J, Wu, Hao, Jiang, Dave, jacob.jun.pan, kwankhede,
	robin.murphy, kvm, iommu, dwmw2, linux-kernel, baolu.lu, david,
	nicolinc

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, September 22, 2021 1:41 AM
> 
> On Sun, Sep 19, 2021 at 02:38:38PM +0800, Liu Yi L wrote:
> > After a device is bound to the iommufd, userspace can use this interface
> > to query the underlying iommu capability and format info for this device.
> > Based on this information the user then creates I/O address space in a
> > compatible format with the to-be-attached devices.
> >
> > Device cookie which is registered at binding time is used to mark the
> > device which is being queried here.
> >
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> >  drivers/iommu/iommufd/iommufd.c | 68
> +++++++++++++++++++++++++++++++++
> >  include/uapi/linux/iommu.h      | 49 ++++++++++++++++++++++++
> >  2 files changed, 117 insertions(+)
> >
> > diff --git a/drivers/iommu/iommufd/iommufd.c
> b/drivers/iommu/iommufd/iommufd.c
> > index e16ca21e4534..641f199f2d41 100644
> > +++ b/drivers/iommu/iommufd/iommufd.c
> > @@ -117,6 +117,71 @@ static int iommufd_fops_release(struct inode
> *inode, struct file *filep)
> >  	return 0;
> >  }
> >
> > +static struct device *
> > +iommu_find_device_from_cookie(struct iommufd_ctx *ictx, u64
> dev_cookie)
> > +{
> 
> We have an xarray ID for the device, why are we allowing userspace to
> use the dev_cookie as input?
> 
> Userspace should always pass in the ID. The only place dev_cookie
> should appear is if the kernel generates an event back to
> userspace. Then the kernel should return both the ID and the
> dev_cookie in the event to allow userspace to correlate it.
> 

A little background.

In earlier design proposal we discussed two options. One is to return
an kernel-allocated ID (label) to userspace. The other is to have user
register a cookie and use it in iommufd uAPI. At that time the two
options were discussed exclusively and the cookie one is preferred.

Now you instead recommended a mixed option. We can follow it for
sure if nobody objects.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 532+ messages in thread

* RE: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO
@ 2021-09-22  3:30       ` Tian, Kevin
  0 siblings, 0 replies; 532+ messages in thread
From: Tian, Kevin @ 2021-09-22  3:30 UTC (permalink / raw)
  To: Jason Gunthorpe, Liu, Yi L
  Cc: kvm, jasowang, kwankhede, hch, jean-philippe, Jiang, Dave, Raj,
	Ashok, corbet, parav, alex.williamson, lkml, david, robin.murphy,
	Tian, Jun J, linux-kernel, lushenming, iommu, pbonzini, dwmw2

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, September 22, 2021 1:41 AM
> 
> On Sun, Sep 19, 2021 at 02:38:38PM +0800, Liu Yi L wrote:
> > After a device is bound to the iommufd, userspace can use this interface
> > to query the underlying iommu capability and format info for this device.
> > Based on this information the user then creates I/O address space in a
> > compatible format with the to-be-attached devices.
> >
> > Device cookie which is registered at binding time is used to mark the
> > device which is being queried here.
> >
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> >  drivers/iommu/iommufd/iommufd.c | 68
> +++++++++++++++++++++++++++++++++
> >  include/uapi/linux/iommu.h      | 49 ++++++++++++++++++++++++
> >  2 files changed, 117 insertions(+)
> >
> > diff --git a/drivers/iommu/iommufd/iommufd.c
> b/drivers/iommu/iommufd/iommufd.c
> > index e16ca21e4534..641f199f2d41 100644
> > +++ b/drivers/iommu/iommufd/iommufd.c
> > @@ -117,6 +117,71 @@ static int iommufd_fops_release(struct inode
> *inode, struct file *filep)
> >  	return 0;
> >  }
> >
> > +static struct device *
> > +iommu_find_device_from_cookie(struct iommufd_ctx *ictx, u64
> dev_cookie)
> > +{
> 
> We have an xarray ID for the device, why are we allowing userspace to
> use the dev_cookie as input?
> 
> Userspace should always pass in the ID. The only place dev_cookie
> should appear is if the kernel generates an event back to
> userspace. Then the kernel should return both the ID and the
> dev_cookie in the event to allow userspace to correlate it.
> 

A little background.

In earlier design proposal we discussed two options. One is to return
an kernel-allocated ID (label) to userspace. The other is to have user
register a cookie and use it in iommufd uAPI. At that time the two
options were discussed exclusively and the cookie one is preferred.

Now you instead recommended a mixed option. We can follow it for
sure if nobody objects.

Thanks
Kevin
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* RE: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
  2021-09-21 17:44     ` Jason Gunthorpe via iommu
@ 2021-09-22  3:40       ` Tian, Kevin
  -1 siblings, 0 replies; 532+ messages in thread
From: Tian, Kevin @ 2021-09-22  3:40 UTC (permalink / raw)
  To: Jason Gunthorpe, Liu, Yi L
  Cc: alex.williamson, hch, jasowang, joro, jean-philippe, parav, lkml,
	pbonzini, lushenming, eric.auger, corbet, Raj, Ashok, yi.l.liu,
	Tian, Jun J, Wu, Hao, Jiang, Dave, jacob.jun.pan, kwankhede,
	robin.murphy, kvm, iommu, dwmw2, linux-kernel, baolu.lu, david,
	nicolinc

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, September 22, 2021 1:45 AM
> 
> On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote:
> > This patch adds IOASID allocation/free interface per iommufd. When
> > allocating an IOASID, userspace is expected to specify the type and
> > format information for the target I/O page table.
> >
> > This RFC supports only one type (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2),
> > implying a kernel-managed I/O page table with vfio type1v2 mapping
> > semantics. For this type the user should specify the addr_width of
> > the I/O address space and whether the I/O page table is created in
> > an iommu enfore_snoop format. enforce_snoop must be true at this point,
> > as the false setting requires additional contract with KVM on handling
> > WBINVD emulation, which can be added later.
> >
> > Userspace is expected to call IOMMU_CHECK_EXTENSION (see next patch)
> > for what formats can be specified when allocating an IOASID.
> >
> > Open:
> > - Devices on PPC platform currently use a different iommu driver in vfio.
> >   Per previous discussion they can also use vfio type1v2 as long as there
> >   is a way to claim a specific iova range from a system-wide address space.
> >   This requirement doesn't sound PPC specific, as addr_width for pci
> devices
> >   can be also represented by a range [0, 2^addr_width-1]. This RFC hasn't
> >   adopted this design yet. We hope to have formal alignment in v1
> discussion
> >   and then decide how to incorporate it in v2.
> 
> I think the request was to include a start/end IO address hint when
> creating the ios. When the kernel creates it then it can return the

is the hint single-range or could be multiple-ranges?

> actual geometry including any holes via a query.

I'd like to see a detail flow from David on how the uAPI works today with
existing spapr driver and what exact changes he'd like to make on this
proposed interface. Above info is still insufficient for us to think about the
right solution.

> 
> > - Currently ioasid term has already been used in the kernel
> (drivers/iommu/
> >   ioasid.c) to represent the hardware I/O address space ID in the wire. It
> >   covers both PCI PASID (Process Address Space ID) and ARM SSID (Sub-
> Stream
> >   ID). We need find a way to resolve the naming conflict between the
> hardware
> >   ID and software handle. One option is to rename the existing ioasid to be
> >   pasid or ssid, given their full names still sound generic. Appreciate more
> >   thoughts on this open!
> 
> ioas works well here I think. Use ioas_id to refer to the xarray
> index.

What about when introducing pasid to this uAPI? Then use ioas_id
for the xarray index and ioasid to represent pasid/ssid? At this point
the software handle and hardware id are mixed together thus need
a clear terminology to differentiate them.


Thanks
Kevin

^ permalink raw reply	[flat|nested] 532+ messages in thread

* RE: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
@ 2021-09-22  3:40       ` Tian, Kevin
  0 siblings, 0 replies; 532+ messages in thread
From: Tian, Kevin @ 2021-09-22  3:40 UTC (permalink / raw)
  To: Jason Gunthorpe, Liu, Yi L
  Cc: kvm, jasowang, kwankhede, hch, jean-philippe, Jiang, Dave, Raj,
	Ashok, corbet, parav, alex.williamson, lkml, david, robin.murphy,
	Tian, Jun J, linux-kernel, lushenming, iommu, pbonzini, dwmw2

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, September 22, 2021 1:45 AM
> 
> On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote:
> > This patch adds IOASID allocation/free interface per iommufd. When
> > allocating an IOASID, userspace is expected to specify the type and
> > format information for the target I/O page table.
> >
> > This RFC supports only one type (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2),
> > implying a kernel-managed I/O page table with vfio type1v2 mapping
> > semantics. For this type the user should specify the addr_width of
> > the I/O address space and whether the I/O page table is created in
> > an iommu enfore_snoop format. enforce_snoop must be true at this point,
> > as the false setting requires additional contract with KVM on handling
> > WBINVD emulation, which can be added later.
> >
> > Userspace is expected to call IOMMU_CHECK_EXTENSION (see next patch)
> > for what formats can be specified when allocating an IOASID.
> >
> > Open:
> > - Devices on PPC platform currently use a different iommu driver in vfio.
> >   Per previous discussion they can also use vfio type1v2 as long as there
> >   is a way to claim a specific iova range from a system-wide address space.
> >   This requirement doesn't sound PPC specific, as addr_width for pci
> devices
> >   can be also represented by a range [0, 2^addr_width-1]. This RFC hasn't
> >   adopted this design yet. We hope to have formal alignment in v1
> discussion
> >   and then decide how to incorporate it in v2.
> 
> I think the request was to include a start/end IO address hint when
> creating the ios. When the kernel creates it then it can return the

is the hint single-range or could be multiple-ranges?

> actual geometry including any holes via a query.

I'd like to see a detail flow from David on how the uAPI works today with
existing spapr driver and what exact changes he'd like to make on this
proposed interface. Above info is still insufficient for us to think about the
right solution.

> 
> > - Currently ioasid term has already been used in the kernel
> (drivers/iommu/
> >   ioasid.c) to represent the hardware I/O address space ID in the wire. It
> >   covers both PCI PASID (Process Address Space ID) and ARM SSID (Sub-
> Stream
> >   ID). We need find a way to resolve the naming conflict between the
> hardware
> >   ID and software handle. One option is to rename the existing ioasid to be
> >   pasid or ssid, given their full names still sound generic. Appreciate more
> >   thoughts on this open!
> 
> ioas works well here I think. Use ioas_id to refer to the xarray
> index.

What about when introducing pasid to this uAPI? Then use ioas_id
for the xarray index and ioasid to represent pasid/ssid? At this point
the software handle and hardware id are mixed together thus need
a clear terminology to differentiate them.


Thanks
Kevin
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* RE: [RFC 12/20] iommu/iommufd: Add IOMMU_CHECK_EXTENSION
  2021-09-21 17:47     ` Jason Gunthorpe via iommu
@ 2021-09-22  3:41       ` Tian, Kevin
  -1 siblings, 0 replies; 532+ messages in thread
From: Tian, Kevin @ 2021-09-22  3:41 UTC (permalink / raw)
  To: Jason Gunthorpe, Liu, Yi L
  Cc: alex.williamson, hch, jasowang, joro, jean-philippe, parav, lkml,
	pbonzini, lushenming, eric.auger, corbet, Raj, Ashok, yi.l.liu,
	Tian, Jun J, Wu, Hao, Jiang, Dave, jacob.jun.pan, kwankhede,
	robin.murphy, kvm, iommu, dwmw2, linux-kernel, baolu.lu, david,
	nicolinc

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, September 22, 2021 1:47 AM
> 
> On Sun, Sep 19, 2021 at 02:38:40PM +0800, Liu Yi L wrote:
> > As aforementioned, userspace should check extension for what formats
> > can be specified when allocating an IOASID. This patch adds such
> > interface for userspace. In this RFC, iommufd reports EXT_MAP_TYPE1V2
> > support and no no-snoop support yet.
> >
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> >  drivers/iommu/iommufd/iommufd.c |  7 +++++++
> >  include/uapi/linux/iommu.h      | 27 +++++++++++++++++++++++++++
> >  2 files changed, 34 insertions(+)
> >
> > diff --git a/drivers/iommu/iommufd/iommufd.c
> b/drivers/iommu/iommufd/iommufd.c
> > index 4839f128b24a..e45d76359e34 100644
> > +++ b/drivers/iommu/iommufd/iommufd.c
> > @@ -306,6 +306,13 @@ static long iommufd_fops_unl_ioctl(struct file
> *filep,
> >  		return ret;
> >
> >  	switch (cmd) {
> > +	case IOMMU_CHECK_EXTENSION:
> > +		switch (arg) {
> > +		case EXT_MAP_TYPE1V2:
> > +			return 1;
> > +		default:
> > +			return 0;
> > +		}
> >  	case IOMMU_DEVICE_GET_INFO:
> >  		ret = iommufd_get_device_info(ictx, arg);
> >  		break;
> > diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
> > index 5cbd300eb0ee..49731be71213 100644
> > +++ b/include/uapi/linux/iommu.h
> > @@ -14,6 +14,33 @@
> >  #define IOMMU_TYPE	(';')
> >  #define IOMMU_BASE	100
> >
> > +/*
> > + * IOMMU_CHECK_EXTENSION - _IO(IOMMU_TYPE, IOMMU_BASE + 0)
> > + *
> > + * Check whether an uAPI extension is supported.
> > + *
> > + * It's unlikely that all planned capabilities in IOMMU fd will be ready
> > + * in one breath. User should check which uAPI extension is supported
> > + * according to its intended usage.
> > + *
> > + * A rough list of possible extensions may include:
> > + *
> > + *	- EXT_MAP_TYPE1V2 for vfio type1v2 map semantics;
> > + *	- EXT_DMA_NO_SNOOP for no-snoop DMA support;
> > + *	- EXT_MAP_NEWTYPE for an enhanced map semantics;
> > + *	- EXT_MULTIDEV_GROUP for 1:N iommu group;
> > + *	- EXT_IOASID_NESTING for what the name stands;
> > + *	- EXT_USER_PAGE_TABLE for user managed page table;
> > + *	- EXT_USER_PASID_TABLE for user managed PASID table;
> > + *	- EXT_DIRTY_TRACKING for tracking pages dirtied by DMA;
> > + *	- ...
> > + *
> > + * Return: 0 if not supported, 1 if supported.
> > + */
> > +#define EXT_MAP_TYPE1V2		1
> > +#define EXT_DMA_NO_SNOOP	2
> > +#define IOMMU_CHECK_EXTENSION	_IO(IOMMU_TYPE,
> IOMMU_BASE + 0)
> 
> I generally advocate for a 'try and fail' approach to discovering
> compatibility.
> 
> If that doesn't work for the userspace then a query to return a
> generic capability flag is the next best idea. Each flag should
> clearly define what 'try and fail' it is talking about

We don't have strong preference here. Just follow what vfio does
today. So Alex's opinion is appreciated here. 😊

> 
> Eg dma_no_snoop is about creating an IOS with flag NO SNOOP set
> 
> TYPE1V2 seems like nonsense

just in case other mapping protocols are introduced in the future

> 
> Not sure about the others.
> 
> IOW, this should recast to a generic 'query capabilities' IOCTL
> 
> Jason

^ permalink raw reply	[flat|nested] 532+ messages in thread

* RE: [RFC 12/20] iommu/iommufd: Add IOMMU_CHECK_EXTENSION
@ 2021-09-22  3:41       ` Tian, Kevin
  0 siblings, 0 replies; 532+ messages in thread
From: Tian, Kevin @ 2021-09-22  3:41 UTC (permalink / raw)
  To: Jason Gunthorpe, Liu, Yi L
  Cc: kvm, jasowang, kwankhede, hch, jean-philippe, Jiang, Dave, Raj,
	Ashok, corbet, parav, alex.williamson, lkml, david, robin.murphy,
	Tian, Jun J, linux-kernel, lushenming, iommu, pbonzini, dwmw2

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, September 22, 2021 1:47 AM
> 
> On Sun, Sep 19, 2021 at 02:38:40PM +0800, Liu Yi L wrote:
> > As aforementioned, userspace should check extension for what formats
> > can be specified when allocating an IOASID. This patch adds such
> > interface for userspace. In this RFC, iommufd reports EXT_MAP_TYPE1V2
> > support and no no-snoop support yet.
> >
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> >  drivers/iommu/iommufd/iommufd.c |  7 +++++++
> >  include/uapi/linux/iommu.h      | 27 +++++++++++++++++++++++++++
> >  2 files changed, 34 insertions(+)
> >
> > diff --git a/drivers/iommu/iommufd/iommufd.c
> b/drivers/iommu/iommufd/iommufd.c
> > index 4839f128b24a..e45d76359e34 100644
> > +++ b/drivers/iommu/iommufd/iommufd.c
> > @@ -306,6 +306,13 @@ static long iommufd_fops_unl_ioctl(struct file
> *filep,
> >  		return ret;
> >
> >  	switch (cmd) {
> > +	case IOMMU_CHECK_EXTENSION:
> > +		switch (arg) {
> > +		case EXT_MAP_TYPE1V2:
> > +			return 1;
> > +		default:
> > +			return 0;
> > +		}
> >  	case IOMMU_DEVICE_GET_INFO:
> >  		ret = iommufd_get_device_info(ictx, arg);
> >  		break;
> > diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
> > index 5cbd300eb0ee..49731be71213 100644
> > +++ b/include/uapi/linux/iommu.h
> > @@ -14,6 +14,33 @@
> >  #define IOMMU_TYPE	(';')
> >  #define IOMMU_BASE	100
> >
> > +/*
> > + * IOMMU_CHECK_EXTENSION - _IO(IOMMU_TYPE, IOMMU_BASE + 0)
> > + *
> > + * Check whether an uAPI extension is supported.
> > + *
> > + * It's unlikely that all planned capabilities in IOMMU fd will be ready
> > + * in one breath. User should check which uAPI extension is supported
> > + * according to its intended usage.
> > + *
> > + * A rough list of possible extensions may include:
> > + *
> > + *	- EXT_MAP_TYPE1V2 for vfio type1v2 map semantics;
> > + *	- EXT_DMA_NO_SNOOP for no-snoop DMA support;
> > + *	- EXT_MAP_NEWTYPE for an enhanced map semantics;
> > + *	- EXT_MULTIDEV_GROUP for 1:N iommu group;
> > + *	- EXT_IOASID_NESTING for what the name stands;
> > + *	- EXT_USER_PAGE_TABLE for user managed page table;
> > + *	- EXT_USER_PASID_TABLE for user managed PASID table;
> > + *	- EXT_DIRTY_TRACKING for tracking pages dirtied by DMA;
> > + *	- ...
> > + *
> > + * Return: 0 if not supported, 1 if supported.
> > + */
> > +#define EXT_MAP_TYPE1V2		1
> > +#define EXT_DMA_NO_SNOOP	2
> > +#define IOMMU_CHECK_EXTENSION	_IO(IOMMU_TYPE,
> IOMMU_BASE + 0)
> 
> I generally advocate for a 'try and fail' approach to discovering
> compatibility.
> 
> If that doesn't work for the userspace then a query to return a
> generic capability flag is the next best idea. Each flag should
> clearly define what 'try and fail' it is talking about

We don't have strong preference here. Just follow what vfio does
today. So Alex's opinion is appreciated here. 😊

> 
> Eg dma_no_snoop is about creating an IOS with flag NO SNOOP set
> 
> TYPE1V2 seems like nonsense

just in case other mapping protocols are introduced in the future

> 
> Not sure about the others.
> 
> IOW, this should recast to a generic 'query capabilities' IOCTL
> 
> Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* RE: [RFC 14/20] iommu/iommufd: Add iommufd_device_[de]attach_ioasid()
  2021-09-21 18:02     ` Jason Gunthorpe via iommu
@ 2021-09-22  3:53       ` Tian, Kevin
  -1 siblings, 0 replies; 532+ messages in thread
From: Tian, Kevin @ 2021-09-22  3:53 UTC (permalink / raw)
  To: Jason Gunthorpe, Liu, Yi L
  Cc: alex.williamson, hch, jasowang, joro, jean-philippe, parav, lkml,
	pbonzini, lushenming, eric.auger, corbet, Raj, Ashok, yi.l.liu,
	Tian, Jun J, Wu, Hao, Jiang, Dave, jacob.jun.pan, kwankhede,
	robin.murphy, kvm, iommu, dwmw2, linux-kernel, baolu.lu, david,
	nicolinc

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, September 22, 2021 2:02 AM
> 
> > +static int ioas_check_device_compatibility(struct iommufd_ioas *ioas,
> > +					   struct device *dev)
> > +{
> > +	bool snoop = false;
> > +	u32 addr_width;
> > +	int ret;
> > +
> > +	/*
> > +	 * currently we only support I/O page table with iommu enforce-
> snoop
> > +	 * format. Attaching a device which doesn't support this format in its
> > +	 * upstreaming iommu is rejected.
> > +	 */
> > +	ret = iommu_device_get_info(dev,
> IOMMU_DEV_INFO_FORCE_SNOOP, &snoop);
> > +	if (ret || !snoop)
> > +		return -EINVAL;
> > +
> > +	ret = iommu_device_get_info(dev,
> IOMMU_DEV_INFO_ADDR_WIDTH, &addr_width);
> > +	if (ret || addr_width < ioas->addr_width)
> > +		return -EINVAL;
> > +
> > +	/* TODO: also need to check permitted iova ranges and pgsize
> bitmap */
> > +
> > +	return 0;
> > +}
> 
> This seems kind of weird..
> 
> I expect the iommufd to hold a SW copy of the IO page table and each
> time a new domain is to be created it should push the SW copy into the
> domain. If the domain cannot support it then the domain driver should
> naturally fail a request.
> 
> When the user changes the IO page table the SW copy is updated then
> all of the domains are updated too - again if any domain cannot
> support the change then it fails and the change is rolled back.
> 
> It seems like this is a side effect of roughly hacking in the vfio
> code?

Actually this was one open we closed in previous design proposal, but
looks you have a different thought now.

vfio maintains one ioas per container. Devices in the container
can be attached to different domains (e.g. due to snoop format). Every
time when the ioas is updated, every attached domain is updated
in accordance. 

You recommended one-ioas-one-domain model instead, i.e. any device 
with a format incompatible with the one currently used in ioas has to 
be attached to a new ioas, even if the two ioas's have the same mapping.
This leads to compatibility check at attaching time.

Now you want returning back to the vfio model?

> 
> > +	/*
> > +	 * Each ioas is backed by an iommu domain, which is allocated
> > +	 * when the ioas is attached for the first time and then shared
> > +	 * by following devices.
> > +	 */
> > +	if (list_empty(&ioas->device_list)) {
> 
> Seems strange, what if the devices are forced to have different
> domains? We don't want to model that in the SW layer..

this is due to above background

> 
> > +	/* Install the I/O page table to the iommu for this device */
> > +	ret = iommu_attach_device(domain, idev->dev);
> > +	if (ret)
> > +		goto out_domain;
> 
> This is where things start to get confusing when you talk about PASID
> as the above call needs to be some PASID centric API.

yes, for pasid new api (e.g. iommu_attach_device_pasid()) will be added.

but here we only talk about physical device, and iommu_attach_device()
is only for physical device.

> 
> > @@ -27,6 +28,16 @@ struct iommufd_device *
> >  iommufd_bind_device(int fd, struct device *dev, u64 dev_cookie);
> >  void iommufd_unbind_device(struct iommufd_device *idev);
> >
> > +int iommufd_device_attach_ioasid(struct iommufd_device *idev, int
> ioasid);
> > +void iommufd_device_detach_ioasid(struct iommufd_device *idev, int
> ioasid);
> > +
> > +static inline int
> > +__pci_iommufd_device_attach_ioasid(struct pci_dev *pdev,
> > +				   struct iommufd_device *idev, int ioasid)
> > +{
> > +	return iommufd_device_attach_ioasid(idev, ioasid);
> > +}
> 
> If think sis taking in the iommfd_device then there isn't a logical
> place to signal the PCIness

can you elaborate?

> 
> But, I think the API should at least have a kdoc that this is
> capturing the entire device and specify that for PCI this means all
> TLPs with the RID.
> 

yes, this should be documented.

^ permalink raw reply	[flat|nested] 532+ messages in thread

* RE: [RFC 14/20] iommu/iommufd: Add iommufd_device_[de]attach_ioasid()
@ 2021-09-22  3:53       ` Tian, Kevin
  0 siblings, 0 replies; 532+ messages in thread
From: Tian, Kevin @ 2021-09-22  3:53 UTC (permalink / raw)
  To: Jason Gunthorpe, Liu, Yi L
  Cc: kvm, jasowang, kwankhede, hch, jean-philippe, Jiang, Dave, Raj,
	Ashok, corbet, parav, alex.williamson, lkml, david, robin.murphy,
	Tian, Jun J, linux-kernel, lushenming, iommu, pbonzini, dwmw2

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, September 22, 2021 2:02 AM
> 
> > +static int ioas_check_device_compatibility(struct iommufd_ioas *ioas,
> > +					   struct device *dev)
> > +{
> > +	bool snoop = false;
> > +	u32 addr_width;
> > +	int ret;
> > +
> > +	/*
> > +	 * currently we only support I/O page table with iommu enforce-
> snoop
> > +	 * format. Attaching a device which doesn't support this format in its
> > +	 * upstreaming iommu is rejected.
> > +	 */
> > +	ret = iommu_device_get_info(dev,
> IOMMU_DEV_INFO_FORCE_SNOOP, &snoop);
> > +	if (ret || !snoop)
> > +		return -EINVAL;
> > +
> > +	ret = iommu_device_get_info(dev,
> IOMMU_DEV_INFO_ADDR_WIDTH, &addr_width);
> > +	if (ret || addr_width < ioas->addr_width)
> > +		return -EINVAL;
> > +
> > +	/* TODO: also need to check permitted iova ranges and pgsize
> bitmap */
> > +
> > +	return 0;
> > +}
> 
> This seems kind of weird..
> 
> I expect the iommufd to hold a SW copy of the IO page table and each
> time a new domain is to be created it should push the SW copy into the
> domain. If the domain cannot support it then the domain driver should
> naturally fail a request.
> 
> When the user changes the IO page table the SW copy is updated then
> all of the domains are updated too - again if any domain cannot
> support the change then it fails and the change is rolled back.
> 
> It seems like this is a side effect of roughly hacking in the vfio
> code?

Actually this was one open we closed in previous design proposal, but
looks you have a different thought now.

vfio maintains one ioas per container. Devices in the container
can be attached to different domains (e.g. due to snoop format). Every
time when the ioas is updated, every attached domain is updated
in accordance. 

You recommended one-ioas-one-domain model instead, i.e. any device 
with a format incompatible with the one currently used in ioas has to 
be attached to a new ioas, even if the two ioas's have the same mapping.
This leads to compatibility check at attaching time.

Now you want returning back to the vfio model?

> 
> > +	/*
> > +	 * Each ioas is backed by an iommu domain, which is allocated
> > +	 * when the ioas is attached for the first time and then shared
> > +	 * by following devices.
> > +	 */
> > +	if (list_empty(&ioas->device_list)) {
> 
> Seems strange, what if the devices are forced to have different
> domains? We don't want to model that in the SW layer..

this is due to above background

> 
> > +	/* Install the I/O page table to the iommu for this device */
> > +	ret = iommu_attach_device(domain, idev->dev);
> > +	if (ret)
> > +		goto out_domain;
> 
> This is where things start to get confusing when you talk about PASID
> as the above call needs to be some PASID centric API.

yes, for pasid new api (e.g. iommu_attach_device_pasid()) will be added.

but here we only talk about physical device, and iommu_attach_device()
is only for physical device.

> 
> > @@ -27,6 +28,16 @@ struct iommufd_device *
> >  iommufd_bind_device(int fd, struct device *dev, u64 dev_cookie);
> >  void iommufd_unbind_device(struct iommufd_device *idev);
> >
> > +int iommufd_device_attach_ioasid(struct iommufd_device *idev, int
> ioasid);
> > +void iommufd_device_detach_ioasid(struct iommufd_device *idev, int
> ioasid);
> > +
> > +static inline int
> > +__pci_iommufd_device_attach_ioasid(struct pci_dev *pdev,
> > +				   struct iommufd_device *idev, int ioasid)
> > +{
> > +	return iommufd_device_attach_ioasid(idev, ioasid);
> > +}
> 
> If think sis taking in the iommfd_device then there isn't a logical
> place to signal the PCIness

can you elaborate?

> 
> But, I think the API should at least have a kdoc that this is
> capturing the entire device and specify that for PCI this means all
> TLPs with the RID.
> 

yes, this should be documented.
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* RE: [RFC 15/20] vfio/pci: Add VFIO_DEVICE_[DE]ATTACH_IOASID
  2021-09-21 18:04     ` Jason Gunthorpe via iommu
@ 2021-09-22  3:56       ` Tian, Kevin
  -1 siblings, 0 replies; 532+ messages in thread
From: Tian, Kevin @ 2021-09-22  3:56 UTC (permalink / raw)
  To: Jason Gunthorpe, Liu, Yi L
  Cc: alex.williamson, hch, jasowang, joro, jean-philippe, parav, lkml,
	pbonzini, lushenming, eric.auger, corbet, Raj, Ashok, yi.l.liu,
	Tian, Jun J, Wu, Hao, Jiang, Dave, jacob.jun.pan, kwankhede,
	robin.murphy, kvm, iommu, dwmw2, linux-kernel, baolu.lu, david,
	nicolinc

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, September 22, 2021 2:04 AM
> 
> On Sun, Sep 19, 2021 at 02:38:43PM +0800, Liu Yi L wrote:
> > This patch adds interface for userspace to attach device to specified
> > IOASID.
> >
> > Note:
> > One device can only be attached to one IOASID in this version. This is
> > on par with what vfio provides today. In the future this restriction can
> > be relaxed when multiple I/O address spaces are supported per device
> 
> ?? In VFIO the container is the IOS and the container can be shared
> with multiple devices. This needs to start at about the same
> functionality.

a device can be only attached to one container. One container can be
shared by multiple devices.

a device can be only attached to one IOASID. One IOASID can be shared
by multiple devices.

it does start at the same functionality.

> 
> > +	} else if (cmd == VFIO_DEVICE_ATTACH_IOASID) {
> 
> This should be in the core code, right? There is nothing PCI specific
> here.
> 

but if you insist on a pci-wrapper attach function, we still need something
here (e.g. with .attach_ioasid() callback)?

^ permalink raw reply	[flat|nested] 532+ messages in thread

* RE: [RFC 15/20] vfio/pci: Add VFIO_DEVICE_[DE]ATTACH_IOASID
@ 2021-09-22  3:56       ` Tian, Kevin
  0 siblings, 0 replies; 532+ messages in thread
From: Tian, Kevin @ 2021-09-22  3:56 UTC (permalink / raw)
  To: Jason Gunthorpe, Liu, Yi L
  Cc: kvm, jasowang, kwankhede, hch, jean-philippe, Jiang, Dave, Raj,
	Ashok, corbet, parav, alex.williamson, lkml, david, robin.murphy,
	Tian, Jun J, linux-kernel, lushenming, iommu, pbonzini, dwmw2

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, September 22, 2021 2:04 AM
> 
> On Sun, Sep 19, 2021 at 02:38:43PM +0800, Liu Yi L wrote:
> > This patch adds interface for userspace to attach device to specified
> > IOASID.
> >
> > Note:
> > One device can only be attached to one IOASID in this version. This is
> > on par with what vfio provides today. In the future this restriction can
> > be relaxed when multiple I/O address spaces are supported per device
> 
> ?? In VFIO the container is the IOS and the container can be shared
> with multiple devices. This needs to start at about the same
> functionality.

a device can be only attached to one container. One container can be
shared by multiple devices.

a device can be only attached to one IOASID. One IOASID can be shared
by multiple devices.

it does start at the same functionality.

> 
> > +	} else if (cmd == VFIO_DEVICE_ATTACH_IOASID) {
> 
> This should be in the core code, right? There is nothing PCI specific
> here.
> 

but if you insist on a pci-wrapper attach function, we still need something
here (e.g. with .attach_ioasid() callback)?
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* RE: [RFC 16/20] vfio/type1: Export symbols for dma [un]map code sharing
  2021-09-21 18:14     ` Jason Gunthorpe via iommu
@ 2021-09-22  3:57       ` Tian, Kevin
  -1 siblings, 0 replies; 532+ messages in thread
From: Tian, Kevin @ 2021-09-22  3:57 UTC (permalink / raw)
  To: Jason Gunthorpe, Liu, Yi L
  Cc: alex.williamson, hch, jasowang, joro, jean-philippe, parav, lkml,
	pbonzini, lushenming, eric.auger, corbet, Raj, Ashok, yi.l.liu,
	Tian, Jun J, Wu, Hao, Jiang, Dave, jacob.jun.pan, kwankhede,
	robin.murphy, kvm, iommu, dwmw2, linux-kernel, baolu.lu, david,
	nicolinc

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, September 22, 2021 2:15 AM
> 
> On Sun, Sep 19, 2021 at 02:38:44PM +0800, Liu Yi L wrote:
> > [HACK. will fix in v2]
> >
> > There are two options to impelement vfio type1v2 mapping semantics in
> > /dev/iommu.
> >
> > One is to duplicate the related code from vfio as the starting point,
> > and then merge with vfio type1 at a later time. However
> vfio_iommu_type1.c
> > has over 3000LOC with ~80% related to dma management logic, including:
> 
> I can't really see a way forward like this. I think some scheme to
> move the vfio datastructure is going to be necessary.
> 
> > - the dma map/unmap metadata management
> > - page pinning, and related accounting
> > - iova range reporting
> > - dirty bitmap retrieving
> > - dynamic vaddr update, etc.
> 
> All of this needs to be part of the iommufd anyhow..

yes

> 
> > The alternative is to consolidate type1v2 logic in /dev/iommu immediately,
> > which requires converting vfio_iommu_type1 to be a shim driver.
> 
> Another choice is the the datastructure coulde move and the two
> drivers could share its code and continue to exist more independently
> 

where to put the shared code?

btw this is one major open that I plan to discuss in LPC. 😊

^ permalink raw reply	[flat|nested] 532+ messages in thread

* RE: [RFC 16/20] vfio/type1: Export symbols for dma [un]map code sharing
@ 2021-09-22  3:57       ` Tian, Kevin
  0 siblings, 0 replies; 532+ messages in thread
From: Tian, Kevin @ 2021-09-22  3:57 UTC (permalink / raw)
  To: Jason Gunthorpe, Liu, Yi L
  Cc: kvm, jasowang, kwankhede, hch, jean-philippe, Jiang, Dave, Raj,
	Ashok, corbet, parav, alex.williamson, lkml, david, robin.murphy,
	Tian, Jun J, linux-kernel, lushenming, iommu, pbonzini, dwmw2

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, September 22, 2021 2:15 AM
> 
> On Sun, Sep 19, 2021 at 02:38:44PM +0800, Liu Yi L wrote:
> > [HACK. will fix in v2]
> >
> > There are two options to impelement vfio type1v2 mapping semantics in
> > /dev/iommu.
> >
> > One is to duplicate the related code from vfio as the starting point,
> > and then merge with vfio type1 at a later time. However
> vfio_iommu_type1.c
> > has over 3000LOC with ~80% related to dma management logic, including:
> 
> I can't really see a way forward like this. I think some scheme to
> move the vfio datastructure is going to be necessary.
> 
> > - the dma map/unmap metadata management
> > - page pinning, and related accounting
> > - iova range reporting
> > - dirty bitmap retrieving
> > - dynamic vaddr update, etc.
> 
> All of this needs to be part of the iommufd anyhow..

yes

> 
> > The alternative is to consolidate type1v2 logic in /dev/iommu immediately,
> > which requires converting vfio_iommu_type1 to be a shim driver.
> 
> Another choice is the the datastructure coulde move and the two
> drivers could share its code and continue to exist more independently
> 

where to put the shared code?

btw this is one major open that I plan to discuss in LPC. 😊
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 04/20] iommu: Add iommu_device_get_info interface
  2021-09-22  2:31       ` Lu Baolu
@ 2021-09-22  5:07         ` Christoph Hellwig
  -1 siblings, 0 replies; 532+ messages in thread
From: Christoph Hellwig @ 2021-09-22  5:07 UTC (permalink / raw)
  To: Lu Baolu
  Cc: Jason Gunthorpe, Liu Yi L, alex.williamson, hch, jasowang, joro,
	jean-philippe, kevin.tian, parav, lkml, pbonzini, lushenming,
	eric.auger, corbet, ashok.raj, yi.l.liu, jun.j.tian, hao.wu,
	dave.jiang, jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu,
	dwmw2, linux-kernel, david, nicolinc

On Wed, Sep 22, 2021 at 10:31:47AM +0800, Lu Baolu wrote:
> Hi Jason,
>
> On 9/22/21 12:19 AM, Jason Gunthorpe wrote:
>> On Sun, Sep 19, 2021 at 02:38:32PM +0800, Liu Yi L wrote:
>>> From: Lu Baolu <baolu.lu@linux.intel.com>
>>>
>>> This provides an interface for upper layers to get the per-device iommu
>>> attributes.
>>>
>>>      int iommu_device_get_info(struct device *dev,
>>>                                enum iommu_devattr attr, void *data);
>>
>> Can't we use properly typed ops and functions here instead of a void
>> *data?
>>
>> get_snoop()
>> get_page_size()
>> get_addr_width()
>
> Yeah! Above are more friendly to the upper layer callers.

The other option would be a struct with all the attributes.  Still
type safe, but not as many methods.  It'll require a little boilerplate
in the callers, though.

^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 04/20] iommu: Add iommu_device_get_info interface
@ 2021-09-22  5:07         ` Christoph Hellwig
  0 siblings, 0 replies; 532+ messages in thread
From: Christoph Hellwig @ 2021-09-22  5:07 UTC (permalink / raw)
  To: Lu Baolu
  Cc: kvm, jasowang, kwankhede, hch, jean-philippe, dave.jiang,
	ashok.raj, corbet, Jason Gunthorpe, kevin.tian, parav,
	alex.williamson, lkml, david, dwmw2, jun.j.tian, linux-kernel,
	lushenming, iommu, pbonzini, robin.murphy

On Wed, Sep 22, 2021 at 10:31:47AM +0800, Lu Baolu wrote:
> Hi Jason,
>
> On 9/22/21 12:19 AM, Jason Gunthorpe wrote:
>> On Sun, Sep 19, 2021 at 02:38:32PM +0800, Liu Yi L wrote:
>>> From: Lu Baolu <baolu.lu@linux.intel.com>
>>>
>>> This provides an interface for upper layers to get the per-device iommu
>>> attributes.
>>>
>>>      int iommu_device_get_info(struct device *dev,
>>>                                enum iommu_devattr attr, void *data);
>>
>> Can't we use properly typed ops and functions here instead of a void
>> *data?
>>
>> get_snoop()
>> get_page_size()
>> get_addr_width()
>
> Yeah! Above are more friendly to the upper layer callers.

The other option would be a struct with all the attributes.  Still
type safe, but not as many methods.  It'll require a little boilerplate
in the callers, though.
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* RE: [RFC 03/20] vfio: Add vfio_[un]register_device()
  2021-09-22  0:53         ` Jason Gunthorpe via iommu
@ 2021-09-22  9:23           ` Tian, Kevin
  -1 siblings, 0 replies; 532+ messages in thread
From: Tian, Kevin @ 2021-09-22  9:23 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Liu, Yi L, alex.williamson, hch, jasowang, joro, jean-philippe,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, September 22, 2021 8:54 AM
> 
> On Tue, Sep 21, 2021 at 11:10:15PM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Wednesday, September 22, 2021 12:01 AM
> > >
> > > On Sun, Sep 19, 2021 at 02:38:31PM +0800, Liu Yi L wrote:
> > > > With /dev/vfio/devices introduced, now a vfio device driver has three
> > > > options to expose its device to userspace:
> > > >
> > > > a)  only legacy group interface, for devices which haven't been moved
> to
> > > >     iommufd (e.g. platform devices, sw mdev, etc.);
> > > >
> > > > b)  both legacy group interface and new device-centric interface, for
> > > >     devices which supports iommufd but also wants to keep backward
> > > >     compatibility (e.g. pci devices in this RFC);
> > > >
> > > > c)  only new device-centric interface, for new devices which don't carry
> > > >     backward compatibility burden (e.g. hw mdev/subdev with pasid);
> > >
> > > We shouldn't have 'b'? Where does it come from?
> >
> > a vfio-pci device can be opened via the existing group interface. if no b) it
> > means legacy vfio userspace can never use vfio-pci device any more
> > once the latter is moved to iommufd.
> 
> Sorry, I think I ment a, which I guess you will say is SW mdev devices
> 
> But even so, I think the way forward here is to still always expose
> the device /dev/vfio/devices/X and some devices may not allow iommufd
> usage initially.

After another thought this should work. Following your comments in
other places, we'll move the handling of BIND_IOMMUFD to vfio core
which then invoke .bind_iommufd() from the driver. For devices which
don't allow iommufd now, the callback is null thus an error is returned.

This leaves the userspace in a try-and-fail mode. It first opens the device
fd and iommufd, and then try to connect the two together. If failed then
fallback to the legacy group interface.

Then we don't need a) at all. and we can even avoid introducing new
vfio_[un]register_device() at this point. Just leverage existing 
vfio_[un]register_group_dev() to cover b). new helpers can be introduced
later when c) is supported.

> 
> Providing an ioctl to bind to a normal VFIO container or group might
> allow a reasonable fallback in userspace..
> 

I didn't get this point though. An error in binding already allows the
user to fall back to the group path. Why do we need introduce another
ioctl to explicitly bind to container via the nongroup interface? 

Thanks
Kevin

^ permalink raw reply	[flat|nested] 532+ messages in thread

* RE: [RFC 03/20] vfio: Add vfio_[un]register_device()
@ 2021-09-22  9:23           ` Tian, Kevin
  0 siblings, 0 replies; 532+ messages in thread
From: Tian, Kevin @ 2021-09-22  9:23 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: kvm, jasowang, kwankhede, hch, jean-philippe, Jiang, Dave, Raj,
	Ashok, corbet, parav, alex.williamson, lkml, david, dwmw2, Tian,
	Jun J, linux-kernel, lushenming, iommu, pbonzini, robin.murphy

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, September 22, 2021 8:54 AM
> 
> On Tue, Sep 21, 2021 at 11:10:15PM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Wednesday, September 22, 2021 12:01 AM
> > >
> > > On Sun, Sep 19, 2021 at 02:38:31PM +0800, Liu Yi L wrote:
> > > > With /dev/vfio/devices introduced, now a vfio device driver has three
> > > > options to expose its device to userspace:
> > > >
> > > > a)  only legacy group interface, for devices which haven't been moved
> to
> > > >     iommufd (e.g. platform devices, sw mdev, etc.);
> > > >
> > > > b)  both legacy group interface and new device-centric interface, for
> > > >     devices which supports iommufd but also wants to keep backward
> > > >     compatibility (e.g. pci devices in this RFC);
> > > >
> > > > c)  only new device-centric interface, for new devices which don't carry
> > > >     backward compatibility burden (e.g. hw mdev/subdev with pasid);
> > >
> > > We shouldn't have 'b'? Where does it come from?
> >
> > a vfio-pci device can be opened via the existing group interface. if no b) it
> > means legacy vfio userspace can never use vfio-pci device any more
> > once the latter is moved to iommufd.
> 
> Sorry, I think I ment a, which I guess you will say is SW mdev devices
> 
> But even so, I think the way forward here is to still always expose
> the device /dev/vfio/devices/X and some devices may not allow iommufd
> usage initially.

After another thought this should work. Following your comments in
other places, we'll move the handling of BIND_IOMMUFD to vfio core
which then invoke .bind_iommufd() from the driver. For devices which
don't allow iommufd now, the callback is null thus an error is returned.

This leaves the userspace in a try-and-fail mode. It first opens the device
fd and iommufd, and then try to connect the two together. If failed then
fallback to the legacy group interface.

Then we don't need a) at all. and we can even avoid introducing new
vfio_[un]register_device() at this point. Just leverage existing 
vfio_[un]register_group_dev() to cover b). new helpers can be introduced
later when c) is supported.

> 
> Providing an ioctl to bind to a normal VFIO container or group might
> allow a reasonable fallback in userspace..
> 

I didn't get this point though. An error in binding already allows the
user to fall back to the group path. Why do we need introduce another
ioctl to explicitly bind to container via the nongroup interface? 

Thanks
Kevin
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 03/20] vfio: Add vfio_[un]register_device()
  2021-09-22  9:23           ` Tian, Kevin
@ 2021-09-22 12:22             ` Jason Gunthorpe via iommu
  -1 siblings, 0 replies; 532+ messages in thread
From: Jason Gunthorpe @ 2021-09-22 12:22 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Liu, Yi L, alex.williamson, hch, jasowang, joro, jean-philippe,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

On Wed, Sep 22, 2021 at 09:23:34AM +0000, Tian, Kevin wrote:

> > Providing an ioctl to bind to a normal VFIO container or group might
> > allow a reasonable fallback in userspace..
> 
> I didn't get this point though. An error in binding already allows the
> user to fall back to the group path. Why do we need introduce another
> ioctl to explicitly bind to container via the nongroup interface? 

New userspace still needs a fallback path if it hits the 'try and
fail'. Keeping the device FD open and just using a different ioctl to
bind to a container/group FD, which new userspace can then obtain as a
fallback, might be OK.

Hard to see without going through the qemu parts, so maybe just keep
it in mind

Jason

^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 03/20] vfio: Add vfio_[un]register_device()
@ 2021-09-22 12:22             ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 532+ messages in thread
From: Jason Gunthorpe via iommu @ 2021-09-22 12:22 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: kvm, jasowang, kwankhede, hch, jean-philippe, Jiang, Dave, Raj,
	Ashok, corbet, parav, alex.williamson, lkml, david, dwmw2, Tian,
	Jun J, linux-kernel, lushenming, iommu, pbonzini, robin.murphy

On Wed, Sep 22, 2021 at 09:23:34AM +0000, Tian, Kevin wrote:

> > Providing an ioctl to bind to a normal VFIO container or group might
> > allow a reasonable fallback in userspace..
> 
> I didn't get this point though. An error in binding already allows the
> user to fall back to the group path. Why do we need introduce another
> ioctl to explicitly bind to container via the nongroup interface? 

New userspace still needs a fallback path if it hits the 'try and
fail'. Keeping the device FD open and just using a different ioctl to
bind to a container/group FD, which new userspace can then obtain as a
fallback, might be OK.

Hard to see without going through the qemu parts, so maybe just keep
it in mind

Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 02/20] vfio: Add device class for /dev/vfio/devices
  2021-09-22  1:07           ` Tian, Kevin
@ 2021-09-22 12:31             ` Jason Gunthorpe via iommu
  -1 siblings, 0 replies; 532+ messages in thread
From: Jason Gunthorpe @ 2021-09-22 12:31 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Liu, Yi L, alex.williamson, hch, jasowang, joro, jean-philippe,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

On Wed, Sep 22, 2021 at 01:07:11AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Wednesday, September 22, 2021 8:55 AM
> > 
> > On Tue, Sep 21, 2021 at 11:56:06PM +0000, Tian, Kevin wrote:
> > > > The opened atomic is aweful. A newly created fd should start in a
> > > > state where it has a disabled fops
> > > >
> > > > The only thing the disabled fops can do is register the device to the
> > > > iommu fd. When successfully registered the device gets the normal fops.
> > > >
> > > > The registration steps should be done under a normal lock inside the
> > > > vfio_device. If a vfio_device is already registered then further
> > > > registration should fail.
> > > >
> > > > Getting the device fd via the group fd triggers the same sequence as
> > > > above.
> > > >
> > >
> > > Above works if the group interface is also connected to iommufd, i.e.
> > > making vfio type1 as a shim. In this case we can use the registration
> > > status as the exclusive switch. But if we keep vfio type1 separate as
> > > today, then a new atomic is still necessary. This all depends on how
> > > we want to deal with vfio type1 and iommufd, and possibly what's
> > > discussed here just adds another pound to the shim option...
> > 
> > No, it works the same either way, the group FD path is identical to
> > the normal FD path, it just triggers some of the state transitions
> > automatically internally instead of requiring external ioctls.
> > 
> > The device FDs starts disabled, an internal API binds it to the iommu
> > via open coding with the group API, and then the rest of the APIs can
> > be enabled. Same as today.
> > 
> 
> Still a bit confused. if vfio type1 also connects to iommufd, whether 
> the device is registered can be centrally checked based on whether
> an iommu_ctx is recorded. But if type1 doesn't talk to iommufd at
> all, don't we still need introduce a new state (calling it 'opened' or
> 'registered') to protect the two interfaces? 

The "new state" is if the fops are pointing at the real fops or the
pre-fops, which in turn protects everything. You could imagine this as
some state in front of every fop call if you want.

> In this case what is the point of keeping device FD disabled even
> for the group path?

I have a feeling when you go through the APIs it will make sense to
have some symmetry here.

eg creating a device FD should have basically the same flow no matter
what triggers it, not confusing special cases where the group code
skips steps

Jason

^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 02/20] vfio: Add device class for /dev/vfio/devices
@ 2021-09-22 12:31             ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 532+ messages in thread
From: Jason Gunthorpe via iommu @ 2021-09-22 12:31 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: kvm, jasowang, kwankhede, hch, jean-philippe, Jiang, Dave, Raj,
	Ashok, corbet, parav, alex.williamson, lkml, david, dwmw2, Tian,
	Jun J, linux-kernel, lushenming, iommu, pbonzini, robin.murphy

On Wed, Sep 22, 2021 at 01:07:11AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Wednesday, September 22, 2021 8:55 AM
> > 
> > On Tue, Sep 21, 2021 at 11:56:06PM +0000, Tian, Kevin wrote:
> > > > The opened atomic is aweful. A newly created fd should start in a
> > > > state where it has a disabled fops
> > > >
> > > > The only thing the disabled fops can do is register the device to the
> > > > iommu fd. When successfully registered the device gets the normal fops.
> > > >
> > > > The registration steps should be done under a normal lock inside the
> > > > vfio_device. If a vfio_device is already registered then further
> > > > registration should fail.
> > > >
> > > > Getting the device fd via the group fd triggers the same sequence as
> > > > above.
> > > >
> > >
> > > Above works if the group interface is also connected to iommufd, i.e.
> > > making vfio type1 as a shim. In this case we can use the registration
> > > status as the exclusive switch. But if we keep vfio type1 separate as
> > > today, then a new atomic is still necessary. This all depends on how
> > > we want to deal with vfio type1 and iommufd, and possibly what's
> > > discussed here just adds another pound to the shim option...
> > 
> > No, it works the same either way, the group FD path is identical to
> > the normal FD path, it just triggers some of the state transitions
> > automatically internally instead of requiring external ioctls.
> > 
> > The device FDs starts disabled, an internal API binds it to the iommu
> > via open coding with the group API, and then the rest of the APIs can
> > be enabled. Same as today.
> > 
> 
> Still a bit confused. if vfio type1 also connects to iommufd, whether 
> the device is registered can be centrally checked based on whether
> an iommu_ctx is recorded. But if type1 doesn't talk to iommufd at
> all, don't we still need introduce a new state (calling it 'opened' or
> 'registered') to protect the two interfaces? 

The "new state" is if the fops are pointing at the real fops or the
pre-fops, which in turn protects everything. You could imagine this as
some state in front of every fop call if you want.

> In this case what is the point of keeping device FD disabled even
> for the group path?

I have a feeling when you go through the APIs it will make sense to
have some symmetry here.

eg creating a device FD should have basically the same flow no matter
what triggers it, not confusing special cases where the group code
skips steps

Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces
  2021-09-22  1:47       ` Tian, Kevin
@ 2021-09-22 12:39         ` Jason Gunthorpe via iommu
  -1 siblings, 0 replies; 532+ messages in thread
From: Jason Gunthorpe @ 2021-09-22 12:39 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Liu, Yi L, alex.williamson, hch, jasowang, joro, jean-philippe,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

On Wed, Sep 22, 2021 at 01:47:05AM +0000, Tian, Kevin wrote:

> > IIRC in VFIO the container is the IOAS and when the group goes to
> > create the device fd it should simply do the
> > iommu_device_init_user_dma() followed immediately by a call to bind
> > the container IOAS as your #3.
> 
> a slight correction.
> 
> to meet vfio semantics we could do init_user_dma() at group attach
> time and then call binding to container IOAS when the device fd
> is created. This is because vfio requires the group in a security context
> before the device is opened. 

Is it? Until a device FD is opened the group fd is kind of idle, right?

> > Ie the basic flow would see the driver core doing some:
> 
> Just double confirm. Is there concern on having the driver core to
> call iommu functions? 

It is always an interesting question, but I'd say iommu is
foundantional to Linux and if it needs driver core help it shouldn't
be any different from PM, pinctl, or other subsystems that have
inserted themselves into the driver core.

Something kind of like the below.

If I recall, once it is done like this then the entire iommu notifier
infrastructure can be ripped out which is a lot of code.


diff --git a/drivers/base/dd.c b/drivers/base/dd.c
index 68ea1f949daa90..e39612c99c6123 100644
--- a/drivers/base/dd.c
+++ b/drivers/base/dd.c
@@ -566,6 +566,10 @@ static int really_probe(struct device *dev, struct device_driver *drv)
                goto done;
        }
 
+       ret = iommu_set_kernel_ownership(dev);
+       if (ret)
+               return ret;
+
 re_probe:
        dev->driver = drv;
 
@@ -673,6 +677,7 @@ static int really_probe(struct device *dev, struct device_driver *drv)
                dev->pm_domain->dismiss(dev);
        pm_runtime_reinit(dev);
        dev_pm_set_driver_flags(dev, 0);
+       iommu_release_kernel_ownership(dev);
 done:
        return ret;
 }
@@ -1214,6 +1219,7 @@ static void __device_release_driver(struct device *dev, struct device *parent)
                        dev->pm_domain->dismiss(dev);
                pm_runtime_reinit(dev);
                dev_pm_set_driver_flags(dev, 0);
+               iommu_release_kernel_ownership(dev);
 
                klist_remove(&dev->p->knode_driver);
                device_pm_check_callbacks(dev);

^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces
@ 2021-09-22 12:39         ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 532+ messages in thread
From: Jason Gunthorpe via iommu @ 2021-09-22 12:39 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: kvm, jasowang, kwankhede, hch, jean-philippe, Jiang, Dave, Raj,
	Ashok, corbet, parav, alex.williamson, lkml, david, dwmw2, Tian,
	Jun J, linux-kernel, lushenming, iommu, pbonzini, robin.murphy

On Wed, Sep 22, 2021 at 01:47:05AM +0000, Tian, Kevin wrote:

> > IIRC in VFIO the container is the IOAS and when the group goes to
> > create the device fd it should simply do the
> > iommu_device_init_user_dma() followed immediately by a call to bind
> > the container IOAS as your #3.
> 
> a slight correction.
> 
> to meet vfio semantics we could do init_user_dma() at group attach
> time and then call binding to container IOAS when the device fd
> is created. This is because vfio requires the group in a security context
> before the device is opened. 

Is it? Until a device FD is opened the group fd is kind of idle, right?

> > Ie the basic flow would see the driver core doing some:
> 
> Just double confirm. Is there concern on having the driver core to
> call iommu functions? 

It is always an interesting question, but I'd say iommu is
foundantional to Linux and if it needs driver core help it shouldn't
be any different from PM, pinctl, or other subsystems that have
inserted themselves into the driver core.

Something kind of like the below.

If I recall, once it is done like this then the entire iommu notifier
infrastructure can be ripped out which is a lot of code.


diff --git a/drivers/base/dd.c b/drivers/base/dd.c
index 68ea1f949daa90..e39612c99c6123 100644
--- a/drivers/base/dd.c
+++ b/drivers/base/dd.c
@@ -566,6 +566,10 @@ static int really_probe(struct device *dev, struct device_driver *drv)
                goto done;
        }
 
+       ret = iommu_set_kernel_ownership(dev);
+       if (ret)
+               return ret;
+
 re_probe:
        dev->driver = drv;
 
@@ -673,6 +677,7 @@ static int really_probe(struct device *dev, struct device_driver *drv)
                dev->pm_domain->dismiss(dev);
        pm_runtime_reinit(dev);
        dev_pm_set_driver_flags(dev, 0);
+       iommu_release_kernel_ownership(dev);
 done:
        return ret;
 }
@@ -1214,6 +1219,7 @@ static void __device_release_driver(struct device *dev, struct device *parent)
                        dev->pm_domain->dismiss(dev);
                pm_runtime_reinit(dev);
                dev_pm_set_driver_flags(dev, 0);
+               iommu_release_kernel_ownership(dev);
 
                klist_remove(&dev->p->knode_driver);
                device_pm_check_callbacks(dev);
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 01/20] iommu/iommufd: Add /dev/iommu core
  2021-09-22  1:51       ` Tian, Kevin
@ 2021-09-22 12:40         ` Jason Gunthorpe via iommu
  -1 siblings, 0 replies; 532+ messages in thread
From: Jason Gunthorpe @ 2021-09-22 12:40 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Liu, Yi L, alex.williamson, hch, jasowang, joro, jean-philippe,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

On Wed, Sep 22, 2021 at 01:51:03AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Tuesday, September 21, 2021 11:42 PM
> > 
> >  - Delete the iommufd_ctx->lock. Use RCU to protect load, erase/alloc does
> >    not need locking (order it properly too, it is in the wrong order), and
> >    don't check for duplicate devices or dev_cookie duplication, that
> >    is user error and is harmless to the kernel.
> > 
> 
> I'm confused here. yes it's user error, but we check so many user errors
> and then return -EINVAL, -EBUSY, etc. Why is this one special?

Because it is expensive to calculate and forces a complicated locking
scheme into the kernel. Without this check you don't need the locking
that spans so much code, and simple RCU becomes acceptable.

Jason

^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 01/20] iommu/iommufd: Add /dev/iommu core
@ 2021-09-22 12:40         ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 532+ messages in thread
From: Jason Gunthorpe via iommu @ 2021-09-22 12:40 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: kvm, jasowang, kwankhede, hch, jean-philippe, Jiang, Dave, Raj,
	Ashok, corbet, parav, alex.williamson, lkml, david, dwmw2, Tian,
	Jun J, linux-kernel, lushenming, iommu, pbonzini, robin.murphy

On Wed, Sep 22, 2021 at 01:51:03AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Tuesday, September 21, 2021 11:42 PM
> > 
> >  - Delete the iommufd_ctx->lock. Use RCU to protect load, erase/alloc does
> >    not need locking (order it properly too, it is in the wrong order), and
> >    don't check for duplicate devices or dev_cookie duplication, that
> >    is user error and is harmless to the kernel.
> > 
> 
> I'm confused here. yes it's user error, but we check so many user errors
> and then return -EINVAL, -EBUSY, etc. Why is this one special?

Because it is expensive to calculate and forces a complicated locking
scheme into the kernel. Without this check you don't need the locking
that spans so much code, and simple RCU becomes acceptable.

Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO
  2021-09-22  3:30       ` Tian, Kevin
@ 2021-09-22 12:41         ` Jason Gunthorpe via iommu
  -1 siblings, 0 replies; 532+ messages in thread
From: Jason Gunthorpe @ 2021-09-22 12:41 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Liu, Yi L, alex.williamson, hch, jasowang, joro, jean-philippe,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

On Wed, Sep 22, 2021 at 03:30:09AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Wednesday, September 22, 2021 1:41 AM
> > 
> > On Sun, Sep 19, 2021 at 02:38:38PM +0800, Liu Yi L wrote:
> > > After a device is bound to the iommufd, userspace can use this interface
> > > to query the underlying iommu capability and format info for this device.
> > > Based on this information the user then creates I/O address space in a
> > > compatible format with the to-be-attached devices.
> > >
> > > Device cookie which is registered at binding time is used to mark the
> > > device which is being queried here.
> > >
> > > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > >  drivers/iommu/iommufd/iommufd.c | 68
> > +++++++++++++++++++++++++++++++++
> > >  include/uapi/linux/iommu.h      | 49 ++++++++++++++++++++++++
> > >  2 files changed, 117 insertions(+)
> > >
> > > diff --git a/drivers/iommu/iommufd/iommufd.c
> > b/drivers/iommu/iommufd/iommufd.c
> > > index e16ca21e4534..641f199f2d41 100644
> > > +++ b/drivers/iommu/iommufd/iommufd.c
> > > @@ -117,6 +117,71 @@ static int iommufd_fops_release(struct inode
> > *inode, struct file *filep)
> > >  	return 0;
> > >  }
> > >
> > > +static struct device *
> > > +iommu_find_device_from_cookie(struct iommufd_ctx *ictx, u64
> > dev_cookie)
> > > +{
> > 
> > We have an xarray ID for the device, why are we allowing userspace to
> > use the dev_cookie as input?
> > 
> > Userspace should always pass in the ID. The only place dev_cookie
> > should appear is if the kernel generates an event back to
> > userspace. Then the kernel should return both the ID and the
> > dev_cookie in the event to allow userspace to correlate it.
> > 
> 
> A little background.
> 
> In earlier design proposal we discussed two options. One is to return
> an kernel-allocated ID (label) to userspace. The other is to have user
> register a cookie and use it in iommufd uAPI. At that time the two
> options were discussed exclusively and the cookie one is preferred.
> 
> Now you instead recommended a mixed option. We can follow it for
> sure if nobody objects.

Either or for the return is fine, I'd return both just because it is
more flexable

But the cookie should never be an input from userspace, and the kernel
should never search for it. Locating the kernel object is what the ID
and xarray is for.

Jason

^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO
@ 2021-09-22 12:41         ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 532+ messages in thread
From: Jason Gunthorpe via iommu @ 2021-09-22 12:41 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: kvm, jasowang, kwankhede, hch, jean-philippe, Jiang, Dave, Raj,
	Ashok, corbet, parav, alex.williamson, lkml, david, dwmw2, Tian,
	Jun J, linux-kernel, lushenming, iommu, pbonzini, robin.murphy

On Wed, Sep 22, 2021 at 03:30:09AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Wednesday, September 22, 2021 1:41 AM
> > 
> > On Sun, Sep 19, 2021 at 02:38:38PM +0800, Liu Yi L wrote:
> > > After a device is bound to the iommufd, userspace can use this interface
> > > to query the underlying iommu capability and format info for this device.
> > > Based on this information the user then creates I/O address space in a
> > > compatible format with the to-be-attached devices.
> > >
> > > Device cookie which is registered at binding time is used to mark the
> > > device which is being queried here.
> > >
> > > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > >  drivers/iommu/iommufd/iommufd.c | 68
> > +++++++++++++++++++++++++++++++++
> > >  include/uapi/linux/iommu.h      | 49 ++++++++++++++++++++++++
> > >  2 files changed, 117 insertions(+)
> > >
> > > diff --git a/drivers/iommu/iommufd/iommufd.c
> > b/drivers/iommu/iommufd/iommufd.c
> > > index e16ca21e4534..641f199f2d41 100644
> > > +++ b/drivers/iommu/iommufd/iommufd.c
> > > @@ -117,6 +117,71 @@ static int iommufd_fops_release(struct inode
> > *inode, struct file *filep)
> > >  	return 0;
> > >  }
> > >
> > > +static struct device *
> > > +iommu_find_device_from_cookie(struct iommufd_ctx *ictx, u64
> > dev_cookie)
> > > +{
> > 
> > We have an xarray ID for the device, why are we allowing userspace to
> > use the dev_cookie as input?
> > 
> > Userspace should always pass in the ID. The only place dev_cookie
> > should appear is if the kernel generates an event back to
> > userspace. Then the kernel should return both the ID and the
> > dev_cookie in the event to allow userspace to correlate it.
> > 
> 
> A little background.
> 
> In earlier design proposal we discussed two options. One is to return
> an kernel-allocated ID (label) to userspace. The other is to have user
> register a cookie and use it in iommufd uAPI. At that time the two
> options were discussed exclusively and the cookie one is preferred.
> 
> Now you instead recommended a mixed option. We can follow it for
> sure if nobody objects.

Either or for the return is fine, I'd return both just because it is
more flexable

But the cookie should never be an input from userspace, and the kernel
should never search for it. Locating the kernel object is what the ID
and xarray is for.

Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 02/20] vfio: Add device class for /dev/vfio/devices
  2021-09-22  3:22           ` Tian, Kevin
@ 2021-09-22 12:50             ` Jason Gunthorpe via iommu
  -1 siblings, 0 replies; 532+ messages in thread
From: Jason Gunthorpe @ 2021-09-22 12:50 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Liu, Yi L, alex.williamson, hch, jasowang, joro, jean-philippe,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

On Wed, Sep 22, 2021 at 03:22:42AM +0000, Tian, Kevin wrote:
> > From: Tian, Kevin
> > Sent: Wednesday, September 22, 2021 9:07 AM
> > 
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Wednesday, September 22, 2021 8:55 AM
> > >
> > > On Tue, Sep 21, 2021 at 11:56:06PM +0000, Tian, Kevin wrote:
> > > > > The opened atomic is aweful. A newly created fd should start in a
> > > > > state where it has a disabled fops
> > > > >
> > > > > The only thing the disabled fops can do is register the device to the
> > > > > iommu fd. When successfully registered the device gets the normal fops.
> > > > >
> > > > > The registration steps should be done under a normal lock inside the
> > > > > vfio_device. If a vfio_device is already registered then further
> > > > > registration should fail.
> > > > >
> > > > > Getting the device fd via the group fd triggers the same sequence as
> > > > > above.
> > > > >
> > > >
> > > > Above works if the group interface is also connected to iommufd, i.e.
> > > > making vfio type1 as a shim. In this case we can use the registration
> > > > status as the exclusive switch. But if we keep vfio type1 separate as
> > > > today, then a new atomic is still necessary. This all depends on how
> > > > we want to deal with vfio type1 and iommufd, and possibly what's
> > > > discussed here just adds another pound to the shim option...
> > >
> > > No, it works the same either way, the group FD path is identical to
> > > the normal FD path, it just triggers some of the state transitions
> > > automatically internally instead of requiring external ioctls.
> > >
> > > The device FDs starts disabled, an internal API binds it to the iommu
> > > via open coding with the group API, and then the rest of the APIs can
> > > be enabled. Same as today.
> > >
> 
> After reading your comments on patch08, I may have a clearer picture
> on your suggestion. The key is to handle exclusive access at the binding
> time (based on vdev->iommu_dev). Please see whether below makes 
> sense:
> 
> Shared sequence:
> 
> 1)  initialize the device with a parked fops;
> 2)  need binding (explicit or implicit) to move away from parked fops;
> 3)  switch to normal fops after successful binding;
> 
> 1) happens at device probe.

1 happens when the cdev is setup with the parked fops, yes. I'd say it
happens at fd open time.

> for nongroup 2) and 3) are done together in VFIO_DEVICE_GET_IOMMUFD:
> 
>   - 2) is done by calling .bind_iommufd() callback;
>   - 3) could be done within .bind_iommufd(), or via a new callback e.g.
>     .finalize_device(). The latter may be preferred for the group interface;
>   - Two threads may open the same device simultaneously, with exclusive 
>     access guaranteed by iommufd_bind_device();
>   - Open() after successful binding is rejected, since normal fops has been
>     activated. This is checked upon vdev->iommu_dev;

Almost, open is always successful, what fails is
VFIO_DEVICE_GET_IOMMUFD (or the group equivilant). The user ends up
with a FD that is useless, cannot reach the ops and thus cannot impact
the device it doesn't own in any way.

It is similar to opening a group FD

> for group 2/3) are done together in VFIO_GROUP_GET_DEVICE_FD:
> 
>   - 2) is done by open coding bind_iommufd + attach_ioas. Create an 
>     iommufd_device object and record it to vdev->iommu_dev
>   - 3) is done by calling .finalize_device();
>   - open() after a valid vdev->iommu_dev is rejected. this also ensures
>     exclusive ownership with the nongroup path.

Same comment as above, groups should go through the same sequence of
steps, create a FD, attempt to bind, if successuful make the FD
operational.

The only difference is that failure in these steps does not call
fd_install(). For this reason alone the FD could start out with
operational fops, but it feels like a needless optimization.

> If Alex also agrees with it, this might be another mini-series to be merged
> (just for group path) before this one. Doing so sort of nullifies the existing
> group/container attaching process, where attach_ioas will be skipped and
> now the security context is established when the device is opened.

I think it is really important to unify DMA exclusion model and lower
to the core iommu code. If there is a reason the exclusion must be
triggered on group fd open then the iommu core code should provide an
API to do that which interworks with the device API iommufd will work.

But I would start here because it is much simpler to understand..

Jason

^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 02/20] vfio: Add device class for /dev/vfio/devices
@ 2021-09-22 12:50             ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 532+ messages in thread
From: Jason Gunthorpe via iommu @ 2021-09-22 12:50 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: kvm, jasowang, kwankhede, hch, jean-philippe, Jiang, Dave, Raj,
	Ashok, corbet, parav, alex.williamson, lkml, david, dwmw2, Tian,
	Jun J, linux-kernel, lushenming, iommu, pbonzini, robin.murphy

On Wed, Sep 22, 2021 at 03:22:42AM +0000, Tian, Kevin wrote:
> > From: Tian, Kevin
> > Sent: Wednesday, September 22, 2021 9:07 AM
> > 
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Wednesday, September 22, 2021 8:55 AM
> > >
> > > On Tue, Sep 21, 2021 at 11:56:06PM +0000, Tian, Kevin wrote:
> > > > > The opened atomic is aweful. A newly created fd should start in a
> > > > > state where it has a disabled fops
> > > > >
> > > > > The only thing the disabled fops can do is register the device to the
> > > > > iommu fd. When successfully registered the device gets the normal fops.
> > > > >
> > > > > The registration steps should be done under a normal lock inside the
> > > > > vfio_device. If a vfio_device is already registered then further
> > > > > registration should fail.
> > > > >
> > > > > Getting the device fd via the group fd triggers the same sequence as
> > > > > above.
> > > > >
> > > >
> > > > Above works if the group interface is also connected to iommufd, i.e.
> > > > making vfio type1 as a shim. In this case we can use the registration
> > > > status as the exclusive switch. But if we keep vfio type1 separate as
> > > > today, then a new atomic is still necessary. This all depends on how
> > > > we want to deal with vfio type1 and iommufd, and possibly what's
> > > > discussed here just adds another pound to the shim option...
> > >
> > > No, it works the same either way, the group FD path is identical to
> > > the normal FD path, it just triggers some of the state transitions
> > > automatically internally instead of requiring external ioctls.
> > >
> > > The device FDs starts disabled, an internal API binds it to the iommu
> > > via open coding with the group API, and then the rest of the APIs can
> > > be enabled. Same as today.
> > >
> 
> After reading your comments on patch08, I may have a clearer picture
> on your suggestion. The key is to handle exclusive access at the binding
> time (based on vdev->iommu_dev). Please see whether below makes 
> sense:
> 
> Shared sequence:
> 
> 1)  initialize the device with a parked fops;
> 2)  need binding (explicit or implicit) to move away from parked fops;
> 3)  switch to normal fops after successful binding;
> 
> 1) happens at device probe.

1 happens when the cdev is setup with the parked fops, yes. I'd say it
happens at fd open time.

> for nongroup 2) and 3) are done together in VFIO_DEVICE_GET_IOMMUFD:
> 
>   - 2) is done by calling .bind_iommufd() callback;
>   - 3) could be done within .bind_iommufd(), or via a new callback e.g.
>     .finalize_device(). The latter may be preferred for the group interface;
>   - Two threads may open the same device simultaneously, with exclusive 
>     access guaranteed by iommufd_bind_device();
>   - Open() after successful binding is rejected, since normal fops has been
>     activated. This is checked upon vdev->iommu_dev;

Almost, open is always successful, what fails is
VFIO_DEVICE_GET_IOMMUFD (or the group equivilant). The user ends up
with a FD that is useless, cannot reach the ops and thus cannot impact
the device it doesn't own in any way.

It is similar to opening a group FD

> for group 2/3) are done together in VFIO_GROUP_GET_DEVICE_FD:
> 
>   - 2) is done by open coding bind_iommufd + attach_ioas. Create an 
>     iommufd_device object and record it to vdev->iommu_dev
>   - 3) is done by calling .finalize_device();
>   - open() after a valid vdev->iommu_dev is rejected. this also ensures
>     exclusive ownership with the nongroup path.

Same comment as above, groups should go through the same sequence of
steps, create a FD, attempt to bind, if successuful make the FD
operational.

The only difference is that failure in these steps does not call
fd_install(). For this reason alone the FD could start out with
operational fops, but it feels like a needless optimization.

> If Alex also agrees with it, this might be another mini-series to be merged
> (just for group path) before this one. Doing so sort of nullifies the existing
> group/container attaching process, where attach_ioas will be skipped and
> now the security context is established when the device is opened.

I think it is really important to unify DMA exclusion model and lower
to the core iommu code. If there is a reason the exclusion must be
triggered on group fd open then the iommu core code should provide an
API to do that which interworks with the device API iommufd will work.

But I would start here because it is much simpler to understand..

Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* RE: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
  2021-09-21 17:44     ` Jason Gunthorpe via iommu
@ 2021-09-22 12:51       ` Liu, Yi L
  -1 siblings, 0 replies; 532+ messages in thread
From: Liu, Yi L @ 2021-09-22 12:51 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: alex.williamson, hch, jasowang, joro, jean-philippe, Tian, Kevin,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, September 22, 2021 1:45 AM
> 
[...]
> > diff --git a/drivers/iommu/iommufd/iommufd.c
> b/drivers/iommu/iommufd/iommufd.c
> > index 641f199f2d41..4839f128b24a 100644
> > +++ b/drivers/iommu/iommufd/iommufd.c
> > @@ -24,6 +24,7 @@
> >  struct iommufd_ctx {
> >  	refcount_t refs;
> >  	struct mutex lock;
> > +	struct xarray ioasid_xa; /* xarray of ioasids */
> >  	struct xarray device_xa; /* xarray of bound devices */
> >  };
> >
> > @@ -42,6 +43,16 @@ struct iommufd_device {
> >  	u64 dev_cookie;
> >  };
> >
> > +/* Represent an I/O address space */
> > +struct iommufd_ioas {
> > +	int ioasid;
> 
> xarray id's should consistently be u32s everywhere.

sure. just one more check, this id is supposed to be returned to
userspace as the return value of ioctl(IOASID_ALLOC). That's why
I chose to use "int" as its prototype to make it aligned with the
return type of ioctl(). Based on this, do you think it's still better
to use "u32" here?

Regards,
Yi Liu

> Many of the same prior comments repeated here
>
> Jason

^ permalink raw reply	[flat|nested] 532+ messages in thread

* RE: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
@ 2021-09-22 12:51       ` Liu, Yi L
  0 siblings, 0 replies; 532+ messages in thread
From: Liu, Yi L @ 2021-09-22 12:51 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: kvm, jasowang, kwankhede, hch, jean-philippe, Jiang, Dave, Raj,
	Ashok, corbet, Tian, Kevin, parav, alex.williamson, lkml, david,
	dwmw2, Tian, Jun J, linux-kernel, lushenming, iommu, pbonzini,
	robin.murphy

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, September 22, 2021 1:45 AM
> 
[...]
> > diff --git a/drivers/iommu/iommufd/iommufd.c
> b/drivers/iommu/iommufd/iommufd.c
> > index 641f199f2d41..4839f128b24a 100644
> > +++ b/drivers/iommu/iommufd/iommufd.c
> > @@ -24,6 +24,7 @@
> >  struct iommufd_ctx {
> >  	refcount_t refs;
> >  	struct mutex lock;
> > +	struct xarray ioasid_xa; /* xarray of ioasids */
> >  	struct xarray device_xa; /* xarray of bound devices */
> >  };
> >
> > @@ -42,6 +43,16 @@ struct iommufd_device {
> >  	u64 dev_cookie;
> >  };
> >
> > +/* Represent an I/O address space */
> > +struct iommufd_ioas {
> > +	int ioasid;
> 
> xarray id's should consistently be u32s everywhere.

sure. just one more check, this id is supposed to be returned to
userspace as the return value of ioctl(IOASID_ALLOC). That's why
I chose to use "int" as its prototype to make it aligned with the
return type of ioctl(). Based on this, do you think it's still better
to use "u32" here?

Regards,
Yi Liu

> Many of the same prior comments repeated here
>
> Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 12/20] iommu/iommufd: Add IOMMU_CHECK_EXTENSION
  2021-09-22  3:41       ` Tian, Kevin
@ 2021-09-22 12:55         ` Jason Gunthorpe via iommu
  -1 siblings, 0 replies; 532+ messages in thread
From: Jason Gunthorpe @ 2021-09-22 12:55 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Liu, Yi L, alex.williamson, hch, jasowang, joro, jean-philippe,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

On Wed, Sep 22, 2021 at 03:41:50AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Wednesday, September 22, 2021 1:47 AM
> > 
> > On Sun, Sep 19, 2021 at 02:38:40PM +0800, Liu Yi L wrote:
> > > As aforementioned, userspace should check extension for what formats
> > > can be specified when allocating an IOASID. This patch adds such
> > > interface for userspace. In this RFC, iommufd reports EXT_MAP_TYPE1V2
> > > support and no no-snoop support yet.
> > >
> > > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > >  drivers/iommu/iommufd/iommufd.c |  7 +++++++
> > >  include/uapi/linux/iommu.h      | 27 +++++++++++++++++++++++++++
> > >  2 files changed, 34 insertions(+)
> > >
> > > diff --git a/drivers/iommu/iommufd/iommufd.c
> > b/drivers/iommu/iommufd/iommufd.c
> > > index 4839f128b24a..e45d76359e34 100644
> > > +++ b/drivers/iommu/iommufd/iommufd.c
> > > @@ -306,6 +306,13 @@ static long iommufd_fops_unl_ioctl(struct file
> > *filep,
> > >  		return ret;
> > >
> > >  	switch (cmd) {
> > > +	case IOMMU_CHECK_EXTENSION:
> > > +		switch (arg) {
> > > +		case EXT_MAP_TYPE1V2:
> > > +			return 1;
> > > +		default:
> > > +			return 0;
> > > +		}
> > >  	case IOMMU_DEVICE_GET_INFO:
> > >  		ret = iommufd_get_device_info(ictx, arg);
> > >  		break;
> > > diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
> > > index 5cbd300eb0ee..49731be71213 100644
> > > +++ b/include/uapi/linux/iommu.h
> > > @@ -14,6 +14,33 @@
> > >  #define IOMMU_TYPE	(';')
> > >  #define IOMMU_BASE	100
> > >
> > > +/*
> > > + * IOMMU_CHECK_EXTENSION - _IO(IOMMU_TYPE, IOMMU_BASE + 0)
> > > + *
> > > + * Check whether an uAPI extension is supported.
> > > + *
> > > + * It's unlikely that all planned capabilities in IOMMU fd will be ready
> > > + * in one breath. User should check which uAPI extension is supported
> > > + * according to its intended usage.
> > > + *
> > > + * A rough list of possible extensions may include:
> > > + *
> > > + *	- EXT_MAP_TYPE1V2 for vfio type1v2 map semantics;
> > > + *	- EXT_DMA_NO_SNOOP for no-snoop DMA support;
> > > + *	- EXT_MAP_NEWTYPE for an enhanced map semantics;
> > > + *	- EXT_MULTIDEV_GROUP for 1:N iommu group;
> > > + *	- EXT_IOASID_NESTING for what the name stands;
> > > + *	- EXT_USER_PAGE_TABLE for user managed page table;
> > > + *	- EXT_USER_PASID_TABLE for user managed PASID table;
> > > + *	- EXT_DIRTY_TRACKING for tracking pages dirtied by DMA;
> > > + *	- ...
> > > + *
> > > + * Return: 0 if not supported, 1 if supported.
> > > + */
> > > +#define EXT_MAP_TYPE1V2		1
> > > +#define EXT_DMA_NO_SNOOP	2
> > > +#define IOMMU_CHECK_EXTENSION	_IO(IOMMU_TYPE,
> > IOMMU_BASE + 0)
> > 
> > I generally advocate for a 'try and fail' approach to discovering
> > compatibility.
> > 
> > If that doesn't work for the userspace then a query to return a
> > generic capability flag is the next best idea. Each flag should
> > clearly define what 'try and fail' it is talking about
> 
> We don't have strong preference here. Just follow what vfio does
> today. So Alex's opinion is appreciated here. 😊

This is a uAPI design, it should follow the current mainstream
thinking on how to build these things. There is a lot of old stuff in
vfio that doesn't match the modern thinking. IMHO.

> > TYPE1V2 seems like nonsense
> 
> just in case other mapping protocols are introduced in the future

Well, we should never, ever do that. Allowing PPC and evrything else
to split in VFIO has created a compelte disaster in userspace. HW
specific extensions should be modeled as extensions not a wholesale
replacement of everything.

I'd say this is part of the modern thinking on uAPI design.

What I want to strive for is the basic API is usable with all HW - and
is what something like DPDK can exclusively use.

An extended API with HW specific facets exists for qemu to use to
build a HW backed accelereated and featureful vIOMMU emulation.

The needs of qmeu should not trump the requirement for a universal
basic API.

Eg if we can't figure out a basic API version of the PPC range issue
then that should be punted to a PPC specific API.

Jason

^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 12/20] iommu/iommufd: Add IOMMU_CHECK_EXTENSION
@ 2021-09-22 12:55         ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 532+ messages in thread
From: Jason Gunthorpe via iommu @ 2021-09-22 12:55 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: kvm, jasowang, kwankhede, hch, jean-philippe, Jiang, Dave, Raj,
	Ashok, corbet, parav, alex.williamson, lkml, david, dwmw2, Tian,
	Jun J, linux-kernel, lushenming, iommu, pbonzini, robin.murphy

On Wed, Sep 22, 2021 at 03:41:50AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Wednesday, September 22, 2021 1:47 AM
> > 
> > On Sun, Sep 19, 2021 at 02:38:40PM +0800, Liu Yi L wrote:
> > > As aforementioned, userspace should check extension for what formats
> > > can be specified when allocating an IOASID. This patch adds such
> > > interface for userspace. In this RFC, iommufd reports EXT_MAP_TYPE1V2
> > > support and no no-snoop support yet.
> > >
> > > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > >  drivers/iommu/iommufd/iommufd.c |  7 +++++++
> > >  include/uapi/linux/iommu.h      | 27 +++++++++++++++++++++++++++
> > >  2 files changed, 34 insertions(+)
> > >
> > > diff --git a/drivers/iommu/iommufd/iommufd.c
> > b/drivers/iommu/iommufd/iommufd.c
> > > index 4839f128b24a..e45d76359e34 100644
> > > +++ b/drivers/iommu/iommufd/iommufd.c
> > > @@ -306,6 +306,13 @@ static long iommufd_fops_unl_ioctl(struct file
> > *filep,
> > >  		return ret;
> > >
> > >  	switch (cmd) {
> > > +	case IOMMU_CHECK_EXTENSION:
> > > +		switch (arg) {
> > > +		case EXT_MAP_TYPE1V2:
> > > +			return 1;
> > > +		default:
> > > +			return 0;
> > > +		}
> > >  	case IOMMU_DEVICE_GET_INFO:
> > >  		ret = iommufd_get_device_info(ictx, arg);
> > >  		break;
> > > diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
> > > index 5cbd300eb0ee..49731be71213 100644
> > > +++ b/include/uapi/linux/iommu.h
> > > @@ -14,6 +14,33 @@
> > >  #define IOMMU_TYPE	(';')
> > >  #define IOMMU_BASE	100
> > >
> > > +/*
> > > + * IOMMU_CHECK_EXTENSION - _IO(IOMMU_TYPE, IOMMU_BASE + 0)
> > > + *
> > > + * Check whether an uAPI extension is supported.
> > > + *
> > > + * It's unlikely that all planned capabilities in IOMMU fd will be ready
> > > + * in one breath. User should check which uAPI extension is supported
> > > + * according to its intended usage.
> > > + *
> > > + * A rough list of possible extensions may include:
> > > + *
> > > + *	- EXT_MAP_TYPE1V2 for vfio type1v2 map semantics;
> > > + *	- EXT_DMA_NO_SNOOP for no-snoop DMA support;
> > > + *	- EXT_MAP_NEWTYPE for an enhanced map semantics;
> > > + *	- EXT_MULTIDEV_GROUP for 1:N iommu group;
> > > + *	- EXT_IOASID_NESTING for what the name stands;
> > > + *	- EXT_USER_PAGE_TABLE for user managed page table;
> > > + *	- EXT_USER_PASID_TABLE for user managed PASID table;
> > > + *	- EXT_DIRTY_TRACKING for tracking pages dirtied by DMA;
> > > + *	- ...
> > > + *
> > > + * Return: 0 if not supported, 1 if supported.
> > > + */
> > > +#define EXT_MAP_TYPE1V2		1
> > > +#define EXT_DMA_NO_SNOOP	2
> > > +#define IOMMU_CHECK_EXTENSION	_IO(IOMMU_TYPE,
> > IOMMU_BASE + 0)
> > 
> > I generally advocate for a 'try and fail' approach to discovering
> > compatibility.
> > 
> > If that doesn't work for the userspace then a query to return a
> > generic capability flag is the next best idea. Each flag should
> > clearly define what 'try and fail' it is talking about
> 
> We don't have strong preference here. Just follow what vfio does
> today. So Alex's opinion is appreciated here. 😊

This is a uAPI design, it should follow the current mainstream
thinking on how to build these things. There is a lot of old stuff in
vfio that doesn't match the modern thinking. IMHO.

> > TYPE1V2 seems like nonsense
> 
> just in case other mapping protocols are introduced in the future

Well, we should never, ever do that. Allowing PPC and evrything else
to split in VFIO has created a compelte disaster in userspace. HW
specific extensions should be modeled as extensions not a wholesale
replacement of everything.

I'd say this is part of the modern thinking on uAPI design.

What I want to strive for is the basic API is usable with all HW - and
is what something like DPDK can exclusively use.

An extended API with HW specific facets exists for qemu to use to
build a HW backed accelereated and featureful vIOMMU emulation.

The needs of qmeu should not trump the requirement for a universal
basic API.

Eg if we can't figure out a basic API version of the PPC range issue
then that should be punted to a PPC specific API.

Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 14/20] iommu/iommufd: Add iommufd_device_[de]attach_ioasid()
  2021-09-22  3:53       ` Tian, Kevin
@ 2021-09-22 12:57         ` Jason Gunthorpe via iommu
  -1 siblings, 0 replies; 532+ messages in thread
From: Jason Gunthorpe @ 2021-09-22 12:57 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Liu, Yi L, alex.williamson, hch, jasowang, joro, jean-philippe,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

On Wed, Sep 22, 2021 at 03:53:52AM +0000, Tian, Kevin wrote:

> Actually this was one open we closed in previous design proposal, but
> looks you have a different thought now.
> 
> vfio maintains one ioas per container. Devices in the container
> can be attached to different domains (e.g. due to snoop format). Every
> time when the ioas is updated, every attached domain is updated
> in accordance. 
> 
> You recommended one-ioas-one-domain model instead, i.e. any device 
> with a format incompatible with the one currently used in ioas has to 
> be attached to a new ioas, even if the two ioas's have the same mapping.
> This leads to compatibility check at attaching time.
> 
> Now you want returning back to the vfio model?

Oh, I thought we circled back again.. If we are all OK with one ioas
one domain then great.

> > If think sis taking in the iommfd_device then there isn't a logical
> > place to signal the PCIness
> 
> can you elaborate?

I mean just drop it and document it.

Jason

^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 14/20] iommu/iommufd: Add iommufd_device_[de]attach_ioasid()
@ 2021-09-22 12:57         ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 532+ messages in thread
From: Jason Gunthorpe via iommu @ 2021-09-22 12:57 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: kvm, jasowang, kwankhede, hch, jean-philippe, Jiang, Dave, Raj,
	Ashok, corbet, parav, alex.williamson, lkml, david, dwmw2, Tian,
	Jun J, linux-kernel, lushenming, iommu, pbonzini, robin.murphy

On Wed, Sep 22, 2021 at 03:53:52AM +0000, Tian, Kevin wrote:

> Actually this was one open we closed in previous design proposal, but
> looks you have a different thought now.
> 
> vfio maintains one ioas per container. Devices in the container
> can be attached to different domains (e.g. due to snoop format). Every
> time when the ioas is updated, every attached domain is updated
> in accordance. 
> 
> You recommended one-ioas-one-domain model instead, i.e. any device 
> with a format incompatible with the one currently used in ioas has to 
> be attached to a new ioas, even if the two ioas's have the same mapping.
> This leads to compatibility check at attaching time.
> 
> Now you want returning back to the vfio model?

Oh, I thought we circled back again.. If we are all OK with one ioas
one domain then great.

> > If think sis taking in the iommfd_device then there isn't a logical
> > place to signal the PCIness
> 
> can you elaborate?

I mean just drop it and document it.

Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 15/20] vfio/pci: Add VFIO_DEVICE_[DE]ATTACH_IOASID
  2021-09-22  3:56       ` Tian, Kevin
@ 2021-09-22 12:58         ` Jason Gunthorpe via iommu
  -1 siblings, 0 replies; 532+ messages in thread
From: Jason Gunthorpe @ 2021-09-22 12:58 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Liu, Yi L, alex.williamson, hch, jasowang, joro, jean-philippe,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

On Wed, Sep 22, 2021 at 03:56:18AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Wednesday, September 22, 2021 2:04 AM
> > 
> > On Sun, Sep 19, 2021 at 02:38:43PM +0800, Liu Yi L wrote:
> > > This patch adds interface for userspace to attach device to specified
> > > IOASID.
> > >
> > > Note:
> > > One device can only be attached to one IOASID in this version. This is
> > > on par with what vfio provides today. In the future this restriction can
> > > be relaxed when multiple I/O address spaces are supported per device
> > 
> > ?? In VFIO the container is the IOS and the container can be shared
> > with multiple devices. This needs to start at about the same
> > functionality.
> 
> a device can be only attached to one container. One container can be
> shared by multiple devices.
> 
> a device can be only attached to one IOASID. One IOASID can be shared
> by multiple devices.
> 
> it does start at the same functionality.
> 
> > 
> > > +	} else if (cmd == VFIO_DEVICE_ATTACH_IOASID) {
> > 
> > This should be in the core code, right? There is nothing PCI specific
> > here.
> > 
> 
> but if you insist on a pci-wrapper attach function, we still need something
> here (e.g. with .attach_ioasid() callback)?

I would like to stop adding ioctls to this switch, the core code
should decode the ioctl and call an per-ioctl op like every other
subsystem does..

If you do that then you could have an op

 .attach_ioasid = vfio_full_device_attach,

And that is it for driver changes.

Every driver that use type1 today should be updated to have the above
line and will work with iommufd. mdevs will not be updated and won't
work.

Jason

^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 15/20] vfio/pci: Add VFIO_DEVICE_[DE]ATTACH_IOASID
@ 2021-09-22 12:58         ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 532+ messages in thread
From: Jason Gunthorpe via iommu @ 2021-09-22 12:58 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: kvm, jasowang, kwankhede, hch, jean-philippe, Jiang, Dave, Raj,
	Ashok, corbet, parav, alex.williamson, lkml, david, dwmw2, Tian,
	Jun J, linux-kernel, lushenming, iommu, pbonzini, robin.murphy

On Wed, Sep 22, 2021 at 03:56:18AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Wednesday, September 22, 2021 2:04 AM
> > 
> > On Sun, Sep 19, 2021 at 02:38:43PM +0800, Liu Yi L wrote:
> > > This patch adds interface for userspace to attach device to specified
> > > IOASID.
> > >
> > > Note:
> > > One device can only be attached to one IOASID in this version. This is
> > > on par with what vfio provides today. In the future this restriction can
> > > be relaxed when multiple I/O address spaces are supported per device
> > 
> > ?? In VFIO the container is the IOS and the container can be shared
> > with multiple devices. This needs to start at about the same
> > functionality.
> 
> a device can be only attached to one container. One container can be
> shared by multiple devices.
> 
> a device can be only attached to one IOASID. One IOASID can be shared
> by multiple devices.
> 
> it does start at the same functionality.
> 
> > 
> > > +	} else if (cmd == VFIO_DEVICE_ATTACH_IOASID) {
> > 
> > This should be in the core code, right? There is nothing PCI specific
> > here.
> > 
> 
> but if you insist on a pci-wrapper attach function, we still need something
> here (e.g. with .attach_ioasid() callback)?

I would like to stop adding ioctls to this switch, the core code
should decode the ioctl and call an per-ioctl op like every other
subsystem does..

If you do that then you could have an op

 .attach_ioasid = vfio_full_device_attach,

And that is it for driver changes.

Every driver that use type1 today should be updated to have the above
line and will work with iommufd. mdevs will not be updated and won't
work.

Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
  2021-09-22 12:51       ` Liu, Yi L
@ 2021-09-22 13:32         ` Jason Gunthorpe via iommu
  -1 siblings, 0 replies; 532+ messages in thread
From: Jason Gunthorpe @ 2021-09-22 13:32 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: alex.williamson, hch, jasowang, joro, jean-philippe, Tian, Kevin,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

On Wed, Sep 22, 2021 at 12:51:38PM +0000, Liu, Yi L wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Wednesday, September 22, 2021 1:45 AM
> > 
> [...]
> > > diff --git a/drivers/iommu/iommufd/iommufd.c
> > b/drivers/iommu/iommufd/iommufd.c
> > > index 641f199f2d41..4839f128b24a 100644
> > > +++ b/drivers/iommu/iommufd/iommufd.c
> > > @@ -24,6 +24,7 @@
> > >  struct iommufd_ctx {
> > >  	refcount_t refs;
> > >  	struct mutex lock;
> > > +	struct xarray ioasid_xa; /* xarray of ioasids */
> > >  	struct xarray device_xa; /* xarray of bound devices */
> > >  };
> > >
> > > @@ -42,6 +43,16 @@ struct iommufd_device {
> > >  	u64 dev_cookie;
> > >  };
> > >
> > > +/* Represent an I/O address space */
> > > +struct iommufd_ioas {
> > > +	int ioasid;
> > 
> > xarray id's should consistently be u32s everywhere.
> 
> sure. just one more check, this id is supposed to be returned to
> userspace as the return value of ioctl(IOASID_ALLOC). That's why
> I chose to use "int" as its prototype to make it aligned with the
> return type of ioctl(). Based on this, do you think it's still better
> to use "u32" here?

I suggest not using the return code from ioctl to exchange data.. The
rest of the uAPI uses an in/out struct, everything should do
that consistently.

Jason

^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
@ 2021-09-22 13:32         ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 532+ messages in thread
From: Jason Gunthorpe via iommu @ 2021-09-22 13:32 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: kvm, jasowang, kwankhede, hch, jean-philippe, Jiang, Dave, Raj,
	Ashok, corbet, Tian, Kevin, parav, alex.williamson, lkml, david,
	dwmw2, Tian, Jun J, linux-kernel, lushenming, iommu, pbonzini,
	robin.murphy

On Wed, Sep 22, 2021 at 12:51:38PM +0000, Liu, Yi L wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Wednesday, September 22, 2021 1:45 AM
> > 
> [...]
> > > diff --git a/drivers/iommu/iommufd/iommufd.c
> > b/drivers/iommu/iommufd/iommufd.c
> > > index 641f199f2d41..4839f128b24a 100644
> > > +++ b/drivers/iommu/iommufd/iommufd.c
> > > @@ -24,6 +24,7 @@
> > >  struct iommufd_ctx {
> > >  	refcount_t refs;
> > >  	struct mutex lock;
> > > +	struct xarray ioasid_xa; /* xarray of ioasids */
> > >  	struct xarray device_xa; /* xarray of bound devices */
> > >  };
> > >
> > > @@ -42,6 +43,16 @@ struct iommufd_device {
> > >  	u64 dev_cookie;
> > >  };
> > >
> > > +/* Represent an I/O address space */
> > > +struct iommufd_ioas {
> > > +	int ioasid;
> > 
> > xarray id's should consistently be u32s everywhere.
> 
> sure. just one more check, this id is supposed to be returned to
> userspace as the return value of ioctl(IOASID_ALLOC). That's why
> I chose to use "int" as its prototype to make it aligned with the
> return type of ioctl(). Based on this, do you think it's still better
> to use "u32" here?

I suggest not using the return code from ioctl to exchange data.. The
rest of the uAPI uses an in/out struct, everything should do
that consistently.

Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 09/20] iommu: Add page size and address width attributes
  2021-09-19  6:38   ` Liu Yi L
@ 2021-09-22 13:42     ` Eric Auger
  -1 siblings, 0 replies; 532+ messages in thread
From: Eric Auger @ 2021-09-22 13:42 UTC (permalink / raw)
  To: Liu Yi L, alex.williamson, jgg, hch, jasowang, joro
  Cc: jean-philippe, kevin.tian, parav, lkml, pbonzini, lushenming,
	corbet, ashok.raj, yi.l.liu, jun.j.tian, hao.wu, dave.jiang,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

Hi,

On 9/19/21 8:38 AM, Liu Yi L wrote:
> From: Lu Baolu <baolu.lu@linux.intel.com>
>
> This exposes PAGE_SIZE and ADDR_WIDTH attributes. The iommufd could use
> them to define the IOAS.
>
> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
> ---
>  include/linux/iommu.h | 4 ++++
>  1 file changed, 4 insertions(+)
>
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index 943de6897f56..86d34e4ce05e 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -153,9 +153,13 @@ enum iommu_dev_features {
>  /**
>   * enum iommu_devattr - Per device IOMMU attributes
>   * @IOMMU_DEV_INFO_FORCE_SNOOP [bool]: IOMMU can force DMA to be snooped.
> + * @IOMMU_DEV_INFO_PAGE_SIZE [u64]: Page sizes that iommu supports.
> + * @IOMMU_DEV_INFO_ADDR_WIDTH [u32]: Address width supported.
I think this deserves additional info. What address width do we talk
about, input, output, what stage if the IOMMU does support multiple stages

Thanks

Eric
>   */
>  enum iommu_devattr {
>  	IOMMU_DEV_INFO_FORCE_SNOOP,
> +	IOMMU_DEV_INFO_PAGE_SIZE,
> +	IOMMU_DEV_INFO_ADDR_WIDTH,
>  };
>  
>  #define IOMMU_PASID_INVALID	(-1U)


^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 09/20] iommu: Add page size and address width attributes
@ 2021-09-22 13:42     ` Eric Auger
  0 siblings, 0 replies; 532+ messages in thread
From: Eric Auger @ 2021-09-22 13:42 UTC (permalink / raw)
  To: Liu Yi L, alex.williamson, jgg, hch, jasowang, joro
  Cc: kvm, kwankhede, jean-philippe, dave.jiang, ashok.raj, corbet,
	kevin.tian, parav, lkml, david, robin.murphy, jun.j.tian,
	linux-kernel, lushenming, iommu, pbonzini, dwmw2

Hi,

On 9/19/21 8:38 AM, Liu Yi L wrote:
> From: Lu Baolu <baolu.lu@linux.intel.com>
>
> This exposes PAGE_SIZE and ADDR_WIDTH attributes. The iommufd could use
> them to define the IOAS.
>
> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
> ---
>  include/linux/iommu.h | 4 ++++
>  1 file changed, 4 insertions(+)
>
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index 943de6897f56..86d34e4ce05e 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -153,9 +153,13 @@ enum iommu_dev_features {
>  /**
>   * enum iommu_devattr - Per device IOMMU attributes
>   * @IOMMU_DEV_INFO_FORCE_SNOOP [bool]: IOMMU can force DMA to be snooped.
> + * @IOMMU_DEV_INFO_PAGE_SIZE [u64]: Page sizes that iommu supports.
> + * @IOMMU_DEV_INFO_ADDR_WIDTH [u32]: Address width supported.
I think this deserves additional info. What address width do we talk
about, input, output, what stage if the IOMMU does support multiple stages

Thanks

Eric
>   */
>  enum iommu_devattr {
>  	IOMMU_DEV_INFO_FORCE_SNOOP,
> +	IOMMU_DEV_INFO_PAGE_SIZE,
> +	IOMMU_DEV_INFO_ADDR_WIDTH,
>  };
>  
>  #define IOMMU_PASID_INVALID	(-1U)

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* RE: [RFC 03/20] vfio: Add vfio_[un]register_device()
  2021-09-22 12:22             ` Jason Gunthorpe via iommu
@ 2021-09-22 13:44               ` Tian, Kevin
  -1 siblings, 0 replies; 532+ messages in thread
From: Tian, Kevin @ 2021-09-22 13:44 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Liu, Yi L, alex.williamson, hch, jasowang, joro, jean-philippe,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, September 22, 2021 8:23 PM
> 
> On Wed, Sep 22, 2021 at 09:23:34AM +0000, Tian, Kevin wrote:
> 
> > > Providing an ioctl to bind to a normal VFIO container or group might
> > > allow a reasonable fallback in userspace..
> >
> > I didn't get this point though. An error in binding already allows the
> > user to fall back to the group path. Why do we need introduce another
> > ioctl to explicitly bind to container via the nongroup interface?
> 
> New userspace still needs a fallback path if it hits the 'try and
> fail'. Keeping the device FD open and just using a different ioctl to
> bind to a container/group FD, which new userspace can then obtain as a
> fallback, might be OK.
> 
> Hard to see without going through the qemu parts, so maybe just keep
> it in mind
> 

sure. will figure it out when working on the qemu part.

^ permalink raw reply	[flat|nested] 532+ messages in thread

* RE: [RFC 03/20] vfio: Add vfio_[un]register_device()
@ 2021-09-22 13:44               ` Tian, Kevin
  0 siblings, 0 replies; 532+ messages in thread
From: Tian, Kevin @ 2021-09-22 13:44 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: kvm, jasowang, kwankhede, hch, jean-philippe, Jiang, Dave, Raj,
	Ashok, corbet, parav, alex.williamson, lkml, david, dwmw2, Tian,
	Jun J, linux-kernel, lushenming, iommu, pbonzini, robin.murphy

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, September 22, 2021 8:23 PM
> 
> On Wed, Sep 22, 2021 at 09:23:34AM +0000, Tian, Kevin wrote:
> 
> > > Providing an ioctl to bind to a normal VFIO container or group might
> > > allow a reasonable fallback in userspace..
> >
> > I didn't get this point though. An error in binding already allows the
> > user to fall back to the group path. Why do we need introduce another
> > ioctl to explicitly bind to container via the nongroup interface?
> 
> New userspace still needs a fallback path if it hits the 'try and
> fail'. Keeping the device FD open and just using a different ioctl to
> bind to a container/group FD, which new userspace can then obtain as a
> fallback, might be OK.
> 
> Hard to see without going through the qemu parts, so maybe just keep
> it in mind
> 

sure. will figure it out when working on the qemu part.
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
  2021-09-19  6:38   ` Liu Yi L
@ 2021-09-22 13:45     ` Jean-Philippe Brucker
  -1 siblings, 0 replies; 532+ messages in thread
From: Jean-Philippe Brucker @ 2021-09-22 13:45 UTC (permalink / raw)
  To: Liu Yi L
  Cc: alex.williamson, jgg, hch, jasowang, joro, kevin.tian, parav,
	lkml, pbonzini, lushenming, eric.auger, corbet, ashok.raj,
	yi.l.liu, jun.j.tian, hao.wu, dave.jiang, jacob.jun.pan,
	kwankhede, robin.murphy, kvm, iommu, dwmw2, linux-kernel,
	baolu.lu, david, nicolinc

On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote:
> This patch adds IOASID allocation/free interface per iommufd. When
> allocating an IOASID, userspace is expected to specify the type and
> format information for the target I/O page table.
> 
> This RFC supports only one type (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2),
> implying a kernel-managed I/O page table with vfio type1v2 mapping
> semantics. For this type the user should specify the addr_width of
> the I/O address space and whether the I/O page table is created in
> an iommu enfore_snoop format. enforce_snoop must be true at this point,
> as the false setting requires additional contract with KVM on handling
> WBINVD emulation, which can be added later.
> 
> Userspace is expected to call IOMMU_CHECK_EXTENSION (see next patch)
> for what formats can be specified when allocating an IOASID.
> 
> Open:
> - Devices on PPC platform currently use a different iommu driver in vfio.
>   Per previous discussion they can also use vfio type1v2 as long as there
>   is a way to claim a specific iova range from a system-wide address space.

Is this the reason for passing addr_width to IOASID_ALLOC?  I didn't get
what it's used for or why it's mandatory. But for PPC it sounds like it
should be an address range instead of an upper limit?

Thanks,
Jean

>   This requirement doesn't sound PPC specific, as addr_width for pci devices
>   can be also represented by a range [0, 2^addr_width-1]. This RFC hasn't
>   adopted this design yet. We hope to have formal alignment in v1 discussion
>   and then decide how to incorporate it in v2.
> 
> - Currently ioasid term has already been used in the kernel (drivers/iommu/
>   ioasid.c) to represent the hardware I/O address space ID in the wire. It
>   covers both PCI PASID (Process Address Space ID) and ARM SSID (Sub-Stream
>   ID). We need find a way to resolve the naming conflict between the hardware
>   ID and software handle. One option is to rename the existing ioasid to be
>   pasid or ssid, given their full names still sound generic. Appreciate more
>   thoughts on this open!

^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
@ 2021-09-22 13:45     ` Jean-Philippe Brucker
  0 siblings, 0 replies; 532+ messages in thread
From: Jean-Philippe Brucker @ 2021-09-22 13:45 UTC (permalink / raw)
  To: Liu Yi L
  Cc: kvm, jasowang, kwankhede, hch, dave.jiang, ashok.raj, corbet,
	jgg, kevin.tian, parav, alex.williamson, lkml, david, dwmw2,
	jun.j.tian, linux-kernel, lushenming, iommu, pbonzini,
	robin.murphy

On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote:
> This patch adds IOASID allocation/free interface per iommufd. When
> allocating an IOASID, userspace is expected to specify the type and
> format information for the target I/O page table.
> 
> This RFC supports only one type (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2),
> implying a kernel-managed I/O page table with vfio type1v2 mapping
> semantics. For this type the user should specify the addr_width of
> the I/O address space and whether the I/O page table is created in
> an iommu enfore_snoop format. enforce_snoop must be true at this point,
> as the false setting requires additional contract with KVM on handling
> WBINVD emulation, which can be added later.
> 
> Userspace is expected to call IOMMU_CHECK_EXTENSION (see next patch)
> for what formats can be specified when allocating an IOASID.
> 
> Open:
> - Devices on PPC platform currently use a different iommu driver in vfio.
>   Per previous discussion they can also use vfio type1v2 as long as there
>   is a way to claim a specific iova range from a system-wide address space.

Is this the reason for passing addr_width to IOASID_ALLOC?  I didn't get
what it's used for or why it's mandatory. But for PPC it sounds like it
should be an address range instead of an upper limit?

Thanks,
Jean

>   This requirement doesn't sound PPC specific, as addr_width for pci devices
>   can be also represented by a range [0, 2^addr_width-1]. This RFC hasn't
>   adopted this design yet. We hope to have formal alignment in v1 discussion
>   and then decide how to incorporate it in v2.
> 
> - Currently ioasid term has already been used in the kernel (drivers/iommu/
>   ioasid.c) to represent the hardware I/O address space ID in the wire. It
>   covers both PCI PASID (Process Address Space ID) and ARM SSID (Sub-Stream
>   ID). We need find a way to resolve the naming conflict between the hardware
>   ID and software handle. One option is to rename the existing ioasid to be
>   pasid or ssid, given their full names still sound generic. Appreciate more
>   thoughts on this open!
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* RE: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces
  2021-09-22 12:39         ` Jason Gunthorpe via iommu
@ 2021-09-22 13:56           ` Tian, Kevin
  -1 siblings, 0 replies; 532+ messages in thread
From: Tian, Kevin @ 2021-09-22 13:56 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Liu, Yi L, alex.williamson, hch, jasowang, joro, jean-philippe,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, September 22, 2021 8:40 PM
> 
> On Wed, Sep 22, 2021 at 01:47:05AM +0000, Tian, Kevin wrote:
> 
> > > IIRC in VFIO the container is the IOAS and when the group goes to
> > > create the device fd it should simply do the
> > > iommu_device_init_user_dma() followed immediately by a call to bind
> > > the container IOAS as your #3.
> >
> > a slight correction.
> >
> > to meet vfio semantics we could do init_user_dma() at group attach
> > time and then call binding to container IOAS when the device fd
> > is created. This is because vfio requires the group in a security context
> > before the device is opened.
> 
> Is it? Until a device FD is opened the group fd is kind of idle, right?

yes, then there is no user-tangible difference between init_user_dma()
at group attach time vs. doing it when opening fd(). But the latter does
require more change than the former, as it also needs the vfio iommu 
driver to provide a .device_attach callback. 

What's in my mind now is to keep existing group attach sequence 
which further calls a group-version init_user_dma(). Then when 
device fd is created, just create a iommu_dev object and switch to
normal fops. 

> 
> > > Ie the basic flow would see the driver core doing some:
> >
> > Just double confirm. Is there concern on having the driver core to
> > call iommu functions?
> 
> It is always an interesting question, but I'd say iommu is
> foundantional to Linux and if it needs driver core help it shouldn't
> be any different from PM, pinctl, or other subsystems that have
> inserted themselves into the driver core.
> 
> Something kind of like the below.
> 
> If I recall, once it is done like this then the entire iommu notifier
> infrastructure can be ripped out which is a lot of code.

thanks for the guidance. will think more along this direction...

> 
> 
> diff --git a/drivers/base/dd.c b/drivers/base/dd.c
> index 68ea1f949daa90..e39612c99c6123 100644
> --- a/drivers/base/dd.c
> +++ b/drivers/base/dd.c
> @@ -566,6 +566,10 @@ static int really_probe(struct device *dev, struct
> device_driver *drv)
>                 goto done;
>         }
> 
> +       ret = iommu_set_kernel_ownership(dev);
> +       if (ret)
> +               return ret;
> +
>  re_probe:
>         dev->driver = drv;
> 
> @@ -673,6 +677,7 @@ static int really_probe(struct device *dev, struct
> device_driver *drv)
>                 dev->pm_domain->dismiss(dev);
>         pm_runtime_reinit(dev);
>         dev_pm_set_driver_flags(dev, 0);
> +       iommu_release_kernel_ownership(dev);
>  done:
>         return ret;
>  }
> @@ -1214,6 +1219,7 @@ static void __device_release_driver(struct device
> *dev, struct device *parent)
>                         dev->pm_domain->dismiss(dev);
>                 pm_runtime_reinit(dev);
>                 dev_pm_set_driver_flags(dev, 0);
> +               iommu_release_kernel_ownership(dev);
> 
>                 klist_remove(&dev->p->knode_driver);
>                 device_pm_check_callbacks(dev);

^ permalink raw reply	[flat|nested] 532+ messages in thread

* RE: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces
@ 2021-09-22 13:56           ` Tian, Kevin
  0 siblings, 0 replies; 532+ messages in thread
From: Tian, Kevin @ 2021-09-22 13:56 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: kvm, jasowang, kwankhede, hch, jean-philippe, Jiang, Dave, Raj,
	Ashok, corbet, parav, alex.williamson, lkml, david, dwmw2, Tian,
	Jun J, linux-kernel, lushenming, iommu, pbonzini, robin.murphy

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, September 22, 2021 8:40 PM
> 
> On Wed, Sep 22, 2021 at 01:47:05AM +0000, Tian, Kevin wrote:
> 
> > > IIRC in VFIO the container is the IOAS and when the group goes to
> > > create the device fd it should simply do the
> > > iommu_device_init_user_dma() followed immediately by a call to bind
> > > the container IOAS as your #3.
> >
> > a slight correction.
> >
> > to meet vfio semantics we could do init_user_dma() at group attach
> > time and then call binding to container IOAS when the device fd
> > is created. This is because vfio requires the group in a security context
> > before the device is opened.
> 
> Is it? Until a device FD is opened the group fd is kind of idle, right?

yes, then there is no user-tangible difference between init_user_dma()
at group attach time vs. doing it when opening fd(). But the latter does
require more change than the former, as it also needs the vfio iommu 
driver to provide a .device_attach callback. 

What's in my mind now is to keep existing group attach sequence 
which further calls a group-version init_user_dma(). Then when 
device fd is created, just create a iommu_dev object and switch to
normal fops. 

> 
> > > Ie the basic flow would see the driver core doing some:
> >
> > Just double confirm. Is there concern on having the driver core to
> > call iommu functions?
> 
> It is always an interesting question, but I'd say iommu is
> foundantional to Linux and if it needs driver core help it shouldn't
> be any different from PM, pinctl, or other subsystems that have
> inserted themselves into the driver core.
> 
> Something kind of like the below.
> 
> If I recall, once it is done like this then the entire iommu notifier
> infrastructure can be ripped out which is a lot of code.

thanks for the guidance. will think more along this direction...

> 
> 
> diff --git a/drivers/base/dd.c b/drivers/base/dd.c
> index 68ea1f949daa90..e39612c99c6123 100644
> --- a/drivers/base/dd.c
> +++ b/drivers/base/dd.c
> @@ -566,6 +566,10 @@ static int really_probe(struct device *dev, struct
> device_driver *drv)
>                 goto done;
>         }
> 
> +       ret = iommu_set_kernel_ownership(dev);
> +       if (ret)
> +               return ret;
> +
>  re_probe:
>         dev->driver = drv;
> 
> @@ -673,6 +677,7 @@ static int really_probe(struct device *dev, struct
> device_driver *drv)
>                 dev->pm_domain->dismiss(dev);
>         pm_runtime_reinit(dev);
>         dev_pm_set_driver_flags(dev, 0);
> +       iommu_release_kernel_ownership(dev);
>  done:
>         return ret;
>  }
> @@ -1214,6 +1219,7 @@ static void __device_release_driver(struct device
> *dev, struct device *parent)
>                         dev->pm_domain->dismiss(dev);
>                 pm_runtime_reinit(dev);
>                 dev_pm_set_driver_flags(dev, 0);
> +               iommu_release_kernel_ownership(dev);
> 
>                 klist_remove(&dev->p->knode_driver);
>                 device_pm_check_callbacks(dev);
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* RE: [RFC 01/20] iommu/iommufd: Add /dev/iommu core
  2021-09-22 12:40         ` Jason Gunthorpe via iommu
@ 2021-09-22 13:59           ` Tian, Kevin
  -1 siblings, 0 replies; 532+ messages in thread
From: Tian, Kevin @ 2021-09-22 13:59 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: kvm, jasowang, kwankhede, hch, jean-philippe, Jiang, Dave, Raj,
	Ashok, corbet, parav, alex.williamson, lkml, david, dwmw2, Tian,
	Jun J, linux-kernel, lushenming, iommu, pbonzini, robin.murphy

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, September 22, 2021 8:41 PM
> 
> On Wed, Sep 22, 2021 at 01:51:03AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Tuesday, September 21, 2021 11:42 PM
> > >
> > >  - Delete the iommufd_ctx->lock. Use RCU to protect load, erase/alloc
> does
> > >    not need locking (order it properly too, it is in the wrong order), and
> > >    don't check for duplicate devices or dev_cookie duplication, that
> > >    is user error and is harmless to the kernel.
> > >
> >
> > I'm confused here. yes it's user error, but we check so many user errors
> > and then return -EINVAL, -EBUSY, etc. Why is this one special?
> 
> Because it is expensive to calculate and forces a complicated locking
> scheme into the kernel. Without this check you don't need the locking
> that spans so much code, and simple RCU becomes acceptable.
> 

In case of duplication the kernel just uses the first entry which matches
the device when sending an event to userspace?
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* RE: [RFC 01/20] iommu/iommufd: Add /dev/iommu core
@ 2021-09-22 13:59           ` Tian, Kevin
  0 siblings, 0 replies; 532+ messages in thread
From: Tian, Kevin @ 2021-09-22 13:59 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Liu, Yi L, alex.williamson, hch, jasowang, joro, jean-philippe,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, September 22, 2021 8:41 PM
> 
> On Wed, Sep 22, 2021 at 01:51:03AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Tuesday, September 21, 2021 11:42 PM
> > >
> > >  - Delete the iommufd_ctx->lock. Use RCU to protect load, erase/alloc
> does
> > >    not need locking (order it properly too, it is in the wrong order), and
> > >    don't check for duplicate devices or dev_cookie duplication, that
> > >    is user error and is harmless to the kernel.
> > >
> >
> > I'm confused here. yes it's user error, but we check so many user errors
> > and then return -EINVAL, -EBUSY, etc. Why is this one special?
> 
> Because it is expensive to calculate and forces a complicated locking
> scheme into the kernel. Without this check you don't need the locking
> that spans so much code, and simple RCU becomes acceptable.
> 

In case of duplication the kernel just uses the first entry which matches
the device when sending an event to userspace?

^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
  2021-09-22  3:40       ` Tian, Kevin
@ 2021-09-22 14:09         ` Jason Gunthorpe via iommu
  -1 siblings, 0 replies; 532+ messages in thread
From: Jason Gunthorpe @ 2021-09-22 14:09 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Liu, Yi L, alex.williamson, hch, jasowang, joro, jean-philippe,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

On Wed, Sep 22, 2021 at 03:40:25AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Wednesday, September 22, 2021 1:45 AM
> > 
> > On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote:
> > > This patch adds IOASID allocation/free interface per iommufd. When
> > > allocating an IOASID, userspace is expected to specify the type and
> > > format information for the target I/O page table.
> > >
> > > This RFC supports only one type (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2),
> > > implying a kernel-managed I/O page table with vfio type1v2 mapping
> > > semantics. For this type the user should specify the addr_width of
> > > the I/O address space and whether the I/O page table is created in
> > > an iommu enfore_snoop format. enforce_snoop must be true at this point,
> > > as the false setting requires additional contract with KVM on handling
> > > WBINVD emulation, which can be added later.
> > >
> > > Userspace is expected to call IOMMU_CHECK_EXTENSION (see next patch)
> > > for what formats can be specified when allocating an IOASID.
> > >
> > > Open:
> > > - Devices on PPC platform currently use a different iommu driver in vfio.
> > >   Per previous discussion they can also use vfio type1v2 as long as there
> > >   is a way to claim a specific iova range from a system-wide address space.
> > >   This requirement doesn't sound PPC specific, as addr_width for pci
> > devices
> > >   can be also represented by a range [0, 2^addr_width-1]. This RFC hasn't
> > >   adopted this design yet. We hope to have formal alignment in v1
> > discussion
> > >   and then decide how to incorporate it in v2.
> > 
> > I think the request was to include a start/end IO address hint when
> > creating the ios. When the kernel creates it then it can return the
> 
> is the hint single-range or could be multiple-ranges?

David explained it here:

https://lore.kernel.org/kvm/YMrKksUeNW%2FPEGPM@yekko/

qeumu needs to be able to chooose if it gets the 32 bit range or 64
bit range.

So a 'range hint' will do the job

David also suggested this:

https://lore.kernel.org/kvm/YL6%2FbjHyuHJTn4Rd@yekko/

So I like this better:

struct iommu_ioasid_alloc {
	__u32	argsz;

	__u32	flags;
#define IOMMU_IOASID_ENFORCE_SNOOP	(1 << 0)
#define IOMMU_IOASID_HINT_BASE_IOVA	(1 << 1)

	__aligned_u64 max_iova_hint;
	__aligned_u64 base_iova_hint; // Used only if IOMMU_IOASID_HINT_BASE_IOVA

	// For creating nested page tables
	__u32 parent_ios_id;
	__u32 format;
#define IOMMU_FORMAT_KERNEL 0
#define IOMMU_FORMAT_PPC_XXX 2
#define IOMMU_FORMAT_[..]
	u32 format_flags; // Layout depends on format above

	__aligned_u64 user_page_directory;  // Used if parent_ios_id != 0
};

Again 'type' as an overall API indicator should not exist, feature
flags need to have clear narrow meanings.

This does both of David's suggestions at once. If quemu wants the 1G
limited region it could specify max_iova_hint = 1G, if it wants the
extend 64bit region with the hole it can give either the high base or
a large max_iova_hint. format/format_flags allows a further
device-specific escape if more specific customization is needed and is
needed to specify user space page tables anyhow.

> > ioas works well here I think. Use ioas_id to refer to the xarray
> > index.
> 
> What about when introducing pasid to this uAPI? Then use ioas_id
> for the xarray index

Yes, ioas_id should always be the xarray index.

PASID needs to be called out as PASID or as a generic "hw description"
blob.

kvm's API to program the vPASID translation table should probably take
in a (iommufd,ioas_id,device_id) tuple and extract the IOMMU side
information using an in-kernel API. Userspace shouldn't have to
shuttle it around.

I'm starting to feel like the struct approach for describing this uAPI
might not scale well, but lets see..

Jason

^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
@ 2021-09-22 14:09         ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 532+ messages in thread
From: Jason Gunthorpe via iommu @ 2021-09-22 14:09 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: kvm, jasowang, kwankhede, hch, jean-philippe, Jiang, Dave, Raj,
	Ashok, corbet, parav, alex.williamson, lkml, david, dwmw2, Tian,
	Jun J, linux-kernel, lushenming, iommu, pbonzini, robin.murphy

On Wed, Sep 22, 2021 at 03:40:25AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Wednesday, September 22, 2021 1:45 AM
> > 
> > On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote:
> > > This patch adds IOASID allocation/free interface per iommufd. When
> > > allocating an IOASID, userspace is expected to specify the type and
> > > format information for the target I/O page table.
> > >
> > > This RFC supports only one type (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2),
> > > implying a kernel-managed I/O page table with vfio type1v2 mapping
> > > semantics. For this type the user should specify the addr_width of
> > > the I/O address space and whether the I/O page table is created in
> > > an iommu enfore_snoop format. enforce_snoop must be true at this point,
> > > as the false setting requires additional contract with KVM on handling
> > > WBINVD emulation, which can be added later.
> > >
> > > Userspace is expected to call IOMMU_CHECK_EXTENSION (see next patch)
> > > for what formats can be specified when allocating an IOASID.
> > >
> > > Open:
> > > - Devices on PPC platform currently use a different iommu driver in vfio.
> > >   Per previous discussion they can also use vfio type1v2 as long as there
> > >   is a way to claim a specific iova range from a system-wide address space.
> > >   This requirement doesn't sound PPC specific, as addr_width for pci
> > devices
> > >   can be also represented by a range [0, 2^addr_width-1]. This RFC hasn't
> > >   adopted this design yet. We hope to have formal alignment in v1
> > discussion
> > >   and then decide how to incorporate it in v2.
> > 
> > I think the request was to include a start/end IO address hint when
> > creating the ios. When the kernel creates it then it can return the
> 
> is the hint single-range or could be multiple-ranges?

David explained it here:

https://lore.kernel.org/kvm/YMrKksUeNW%2FPEGPM@yekko/

qeumu needs to be able to chooose if it gets the 32 bit range or 64
bit range.

So a 'range hint' will do the job

David also suggested this:

https://lore.kernel.org/kvm/YL6%2FbjHyuHJTn4Rd@yekko/

So I like this better:

struct iommu_ioasid_alloc {
	__u32	argsz;

	__u32	flags;
#define IOMMU_IOASID_ENFORCE_SNOOP	(1 << 0)
#define IOMMU_IOASID_HINT_BASE_IOVA	(1 << 1)

	__aligned_u64 max_iova_hint;
	__aligned_u64 base_iova_hint; // Used only if IOMMU_IOASID_HINT_BASE_IOVA

	// For creating nested page tables
	__u32 parent_ios_id;
	__u32 format;
#define IOMMU_FORMAT_KERNEL 0
#define IOMMU_FORMAT_PPC_XXX 2
#define IOMMU_FORMAT_[..]
	u32 format_flags; // Layout depends on format above

	__aligned_u64 user_page_directory;  // Used if parent_ios_id != 0
};

Again 'type' as an overall API indicator should not exist, feature
flags need to have clear narrow meanings.

This does both of David's suggestions at once. If quemu wants the 1G
limited region it could specify max_iova_hint = 1G, if it wants the
extend 64bit region with the hole it can give either the high base or
a large max_iova_hint. format/format_flags allows a further
device-specific escape if more specific customization is needed and is
needed to specify user space page tables anyhow.

> > ioas works well here I think. Use ioas_id to refer to the xarray
> > index.
> 
> What about when introducing pasid to this uAPI? Then use ioas_id
> for the xarray index

Yes, ioas_id should always be the xarray index.

PASID needs to be called out as PASID or as a generic "hw description"
blob.

kvm's API to program the vPASID translation table should probably take
in a (iommufd,ioas_id,device_id) tuple and extract the IOMMU side
information using an in-kernel API. Userspace shouldn't have to
shuttle it around.

I'm starting to feel like the struct approach for describing this uAPI
might not scale well, but lets see..

Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* RE: [RFC 02/20] vfio: Add device class for /dev/vfio/devices
  2021-09-22 12:50             ` Jason Gunthorpe via iommu
@ 2021-09-22 14:09               ` Tian, Kevin
  -1 siblings, 0 replies; 532+ messages in thread
From: Tian, Kevin @ 2021-09-22 14:09 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Liu, Yi L, alex.williamson, hch, jasowang, joro, jean-philippe,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, September 22, 2021 8:51 PM
> 
> On Wed, Sep 22, 2021 at 03:22:42AM +0000, Tian, Kevin wrote:
> > > From: Tian, Kevin
> > > Sent: Wednesday, September 22, 2021 9:07 AM
> > >
> > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > Sent: Wednesday, September 22, 2021 8:55 AM
> > > >
> > > > On Tue, Sep 21, 2021 at 11:56:06PM +0000, Tian, Kevin wrote:
> > > > > > The opened atomic is aweful. A newly created fd should start in a
> > > > > > state where it has a disabled fops
> > > > > >
> > > > > > The only thing the disabled fops can do is register the device to the
> > > > > > iommu fd. When successfully registered the device gets the normal
> fops.
> > > > > >
> > > > > > The registration steps should be done under a normal lock inside
> the
> > > > > > vfio_device. If a vfio_device is already registered then further
> > > > > > registration should fail.
> > > > > >
> > > > > > Getting the device fd via the group fd triggers the same sequence as
> > > > > > above.
> > > > > >
> > > > >
> > > > > Above works if the group interface is also connected to iommufd, i.e.
> > > > > making vfio type1 as a shim. In this case we can use the registration
> > > > > status as the exclusive switch. But if we keep vfio type1 separate as
> > > > > today, then a new atomic is still necessary. This all depends on how
> > > > > we want to deal with vfio type1 and iommufd, and possibly what's
> > > > > discussed here just adds another pound to the shim option...
> > > >
> > > > No, it works the same either way, the group FD path is identical to
> > > > the normal FD path, it just triggers some of the state transitions
> > > > automatically internally instead of requiring external ioctls.
> > > >
> > > > The device FDs starts disabled, an internal API binds it to the iommu
> > > > via open coding with the group API, and then the rest of the APIs can
> > > > be enabled. Same as today.
> > > >
> >
> > After reading your comments on patch08, I may have a clearer picture
> > on your suggestion. The key is to handle exclusive access at the binding
> > time (based on vdev->iommu_dev). Please see whether below makes
> > sense:
> >
> > Shared sequence:
> >
> > 1)  initialize the device with a parked fops;
> > 2)  need binding (explicit or implicit) to move away from parked fops;
> > 3)  switch to normal fops after successful binding;
> >
> > 1) happens at device probe.
> 
> 1 happens when the cdev is setup with the parked fops, yes. I'd say it
> happens at fd open time.
> 
> > for nongroup 2) and 3) are done together in VFIO_DEVICE_GET_IOMMUFD:
> >
> >   - 2) is done by calling .bind_iommufd() callback;
> >   - 3) could be done within .bind_iommufd(), or via a new callback e.g.
> >     .finalize_device(). The latter may be preferred for the group interface;
> >   - Two threads may open the same device simultaneously, with exclusive
> >     access guaranteed by iommufd_bind_device();
> >   - Open() after successful binding is rejected, since normal fops has been
> >     activated. This is checked upon vdev->iommu_dev;
> 
> Almost, open is always successful, what fails is
> VFIO_DEVICE_GET_IOMMUFD (or the group equivilant). The user ends up
> with a FD that is useless, cannot reach the ops and thus cannot impact
> the device it doesn't own in any way.

make sense. I had an wrong impression that once a normal fops is
activated it is also visible to other threads. But in concept this fops
replacement should be local to each thread thus another thread
opening the device always gets a parked fops.

> 
> It is similar to opening a group FD
> 
> > for group 2/3) are done together in VFIO_GROUP_GET_DEVICE_FD:
> >
> >   - 2) is done by open coding bind_iommufd + attach_ioas. Create an
> >     iommufd_device object and record it to vdev->iommu_dev
> >   - 3) is done by calling .finalize_device();
> >   - open() after a valid vdev->iommu_dev is rejected. this also ensures
> >     exclusive ownership with the nongroup path.
> 
> Same comment as above, groups should go through the same sequence of
> steps, create a FD, attempt to bind, if successuful make the FD
> operational.
> 
> The only difference is that failure in these steps does not call
> fd_install(). For this reason alone the FD could start out with
> operational fops, but it feels like a needless optimization.
> 
> > If Alex also agrees with it, this might be another mini-series to be merged
> > (just for group path) before this one. Doing so sort of nullifies the existing
> > group/container attaching process, where attach_ioas will be skipped and
> > now the security context is established when the device is opened.
> 
> I think it is really important to unify DMA exclusion model and lower
> to the core iommu code. If there is a reason the exclusion must be
> triggered on group fd open then the iommu core code should provide an
> API to do that which interworks with the device API iommufd will work.
> 
> But I would start here because it is much simpler to understand..
> 

Let's work on this task first and figure out what's the cleaner way to unify
it. My current impression is that having an iommu api for group fd open
might be simpler. Currently vfio iommu drivers are coupled with container
with group-granular operations. Adapting them to device fd open will 
require more changes to handle device<->group. anyway we'll see...

^ permalink raw reply	[flat|nested] 532+ messages in thread

* RE: [RFC 02/20] vfio: Add device class for /dev/vfio/devices
@ 2021-09-22 14:09               ` Tian, Kevin
  0 siblings, 0 replies; 532+ messages in thread
From: Tian, Kevin @ 2021-09-22 14:09 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: kvm, jasowang, kwankhede, hch, jean-philippe, Jiang, Dave, Raj,
	Ashok, corbet, parav, alex.williamson, lkml, david, dwmw2, Tian,
	Jun J, linux-kernel, lushenming, iommu, pbonzini, robin.murphy

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, September 22, 2021 8:51 PM
> 
> On Wed, Sep 22, 2021 at 03:22:42AM +0000, Tian, Kevin wrote:
> > > From: Tian, Kevin
> > > Sent: Wednesday, September 22, 2021 9:07 AM
> > >
> > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > Sent: Wednesday, September 22, 2021 8:55 AM
> > > >
> > > > On Tue, Sep 21, 2021 at 11:56:06PM +0000, Tian, Kevin wrote:
> > > > > > The opened atomic is aweful. A newly created fd should start in a
> > > > > > state where it has a disabled fops
> > > > > >
> > > > > > The only thing the disabled fops can do is register the device to the
> > > > > > iommu fd. When successfully registered the device gets the normal
> fops.
> > > > > >
> > > > > > The registration steps should be done under a normal lock inside
> the
> > > > > > vfio_device. If a vfio_device is already registered then further
> > > > > > registration should fail.
> > > > > >
> > > > > > Getting the device fd via the group fd triggers the same sequence as
> > > > > > above.
> > > > > >
> > > > >
> > > > > Above works if the group interface is also connected to iommufd, i.e.
> > > > > making vfio type1 as a shim. In this case we can use the registration
> > > > > status as the exclusive switch. But if we keep vfio type1 separate as
> > > > > today, then a new atomic is still necessary. This all depends on how
> > > > > we want to deal with vfio type1 and iommufd, and possibly what's
> > > > > discussed here just adds another pound to the shim option...
> > > >
> > > > No, it works the same either way, the group FD path is identical to
> > > > the normal FD path, it just triggers some of the state transitions
> > > > automatically internally instead of requiring external ioctls.
> > > >
> > > > The device FDs starts disabled, an internal API binds it to the iommu
> > > > via open coding with the group API, and then the rest of the APIs can
> > > > be enabled. Same as today.
> > > >
> >
> > After reading your comments on patch08, I may have a clearer picture
> > on your suggestion. The key is to handle exclusive access at the binding
> > time (based on vdev->iommu_dev). Please see whether below makes
> > sense:
> >
> > Shared sequence:
> >
> > 1)  initialize the device with a parked fops;
> > 2)  need binding (explicit or implicit) to move away from parked fops;
> > 3)  switch to normal fops after successful binding;
> >
> > 1) happens at device probe.
> 
> 1 happens when the cdev is setup with the parked fops, yes. I'd say it
> happens at fd open time.
> 
> > for nongroup 2) and 3) are done together in VFIO_DEVICE_GET_IOMMUFD:
> >
> >   - 2) is done by calling .bind_iommufd() callback;
> >   - 3) could be done within .bind_iommufd(), or via a new callback e.g.
> >     .finalize_device(). The latter may be preferred for the group interface;
> >   - Two threads may open the same device simultaneously, with exclusive
> >     access guaranteed by iommufd_bind_device();
> >   - Open() after successful binding is rejected, since normal fops has been
> >     activated. This is checked upon vdev->iommu_dev;
> 
> Almost, open is always successful, what fails is
> VFIO_DEVICE_GET_IOMMUFD (or the group equivilant). The user ends up
> with a FD that is useless, cannot reach the ops and thus cannot impact
> the device it doesn't own in any way.

make sense. I had an wrong impression that once a normal fops is
activated it is also visible to other threads. But in concept this fops
replacement should be local to each thread thus another thread
opening the device always gets a parked fops.

> 
> It is similar to opening a group FD
> 
> > for group 2/3) are done together in VFIO_GROUP_GET_DEVICE_FD:
> >
> >   - 2) is done by open coding bind_iommufd + attach_ioas. Create an
> >     iommufd_device object and record it to vdev->iommu_dev
> >   - 3) is done by calling .finalize_device();
> >   - open() after a valid vdev->iommu_dev is rejected. this also ensures
> >     exclusive ownership with the nongroup path.
> 
> Same comment as above, groups should go through the same sequence of
> steps, create a FD, attempt to bind, if successuful make the FD
> operational.
> 
> The only difference is that failure in these steps does not call
> fd_install(). For this reason alone the FD could start out with
> operational fops, but it feels like a needless optimization.
> 
> > If Alex also agrees with it, this might be another mini-series to be merged
> > (just for group path) before this one. Doing so sort of nullifies the existing
> > group/container attaching process, where attach_ioas will be skipped and
> > now the security context is established when the device is opened.
> 
> I think it is really important to unify DMA exclusion model and lower
> to the core iommu code. If there is a reason the exclusion must be
> triggered on group fd open then the iommu core code should provide an
> API to do that which interworks with the device API iommufd will work.
> 
> But I would start here because it is much simpler to understand..
> 

Let's work on this task first and figure out what's the cleaner way to unify
it. My current impression is that having an iommu api for group fd open
might be simpler. Currently vfio iommu drivers are coupled with container
with group-granular operations. Adapting them to device fd open will 
require more changes to handle device<->group. anyway we'll see...
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 01/20] iommu/iommufd: Add /dev/iommu core
  2021-09-22 13:59           ` Tian, Kevin
@ 2021-09-22 14:10             ` Jason Gunthorpe via iommu
  -1 siblings, 0 replies; 532+ messages in thread
From: Jason Gunthorpe @ 2021-09-22 14:10 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Liu, Yi L, alex.williamson, hch, jasowang, joro, jean-philippe,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

On Wed, Sep 22, 2021 at 01:59:39PM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Wednesday, September 22, 2021 8:41 PM
> > 
> > On Wed, Sep 22, 2021 at 01:51:03AM +0000, Tian, Kevin wrote:
> > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > Sent: Tuesday, September 21, 2021 11:42 PM
> > > >
> > > >  - Delete the iommufd_ctx->lock. Use RCU to protect load, erase/alloc
> > does
> > > >    not need locking (order it properly too, it is in the wrong order), and
> > > >    don't check for duplicate devices or dev_cookie duplication, that
> > > >    is user error and is harmless to the kernel.
> > > >
> > >
> > > I'm confused here. yes it's user error, but we check so many user errors
> > > and then return -EINVAL, -EBUSY, etc. Why is this one special?
> > 
> > Because it is expensive to calculate and forces a complicated locking
> > scheme into the kernel. Without this check you don't need the locking
> > that spans so much code, and simple RCU becomes acceptable.
> 
> In case of duplication the kernel just uses the first entry which matches
> the device when sending an event to userspace?

Sure

Jason

^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 01/20] iommu/iommufd: Add /dev/iommu core
@ 2021-09-22 14:10             ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 532+ messages in thread
From: Jason Gunthorpe via iommu @ 2021-09-22 14:10 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: kvm, jasowang, kwankhede, hch, jean-philippe, Jiang, Dave, Raj,
	Ashok, corbet, parav, alex.williamson, lkml, david, dwmw2, Tian,
	Jun J, linux-kernel, lushenming, iommu, pbonzini, robin.murphy

On Wed, Sep 22, 2021 at 01:59:39PM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Wednesday, September 22, 2021 8:41 PM
> > 
> > On Wed, Sep 22, 2021 at 01:51:03AM +0000, Tian, Kevin wrote:
> > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > Sent: Tuesday, September 21, 2021 11:42 PM
> > > >
> > > >  - Delete the iommufd_ctx->lock. Use RCU to protect load, erase/alloc
> > does
> > > >    not need locking (order it properly too, it is in the wrong order), and
> > > >    don't check for duplicate devices or dev_cookie duplication, that
> > > >    is user error and is harmless to the kernel.
> > > >
> > >
> > > I'm confused here. yes it's user error, but we check so many user errors
> > > and then return -EINVAL, -EBUSY, etc. Why is this one special?
> > 
> > Because it is expensive to calculate and forces a complicated locking
> > scheme into the kernel. Without this check you don't need the locking
> > that spans so much code, and simple RCU becomes acceptable.
> 
> In case of duplication the kernel just uses the first entry which matches
> the device when sending an event to userspace?

Sure

Jason
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* RE: [RFC 12/20] iommu/iommufd: Add IOMMU_CHECK_EXTENSION
  2021-09-22 12:55         ` Jason Gunthorpe via iommu
  (?)
@ 2021-09-22 14:13         ` Tian, Kevin
  -1 siblings, 0 replies; 532+ messages in thread
From: Tian, Kevin @ 2021-09-22 14:13 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: kvm, jasowang, kwankhede, hch, jean-philippe, Jiang, Dave, Raj,
	Ashok, corbet, parav, alex.williamson, lkml, david, dwmw2, Tian,
	Jun J, linux-kernel, lushenming, pbonzini, robin.murphy

> From: Jason Gunthorpe
> Sent: Wednesday, September 22, 2021 8:55 PM
> 
> On Wed, Sep 22, 2021 at 03:41:50AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Wednesday, September 22, 2021 1:47 AM
> > >
> > > On Sun, Sep 19, 2021 at 02:38:40PM +0800, Liu Yi L wrote:
> > > > As aforementioned, userspace should check extension for what formats
> > > > can be specified when allocating an IOASID. This patch adds such
> > > > interface for userspace. In this RFC, iommufd reports
> EXT_MAP_TYPE1V2
> > > > support and no no-snoop support yet.
> > > >
> > > > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > > >  drivers/iommu/iommufd/iommufd.c |  7 +++++++
> > > >  include/uapi/linux/iommu.h      | 27 +++++++++++++++++++++++++++
> > > >  2 files changed, 34 insertions(+)
> > > >
> > > > diff --git a/drivers/iommu/iommufd/iommufd.c
> > > b/drivers/iommu/iommufd/iommufd.c
> > > > index 4839f128b24a..e45d76359e34 100644
> > > > +++ b/drivers/iommu/iommufd/iommufd.c
> > > > @@ -306,6 +306,13 @@ static long iommufd_fops_unl_ioctl(struct file
> > > *filep,
> > > >  		return ret;
> > > >
> > > >  	switch (cmd) {
> > > > +	case IOMMU_CHECK_EXTENSION:
> > > > +		switch (arg) {
> > > > +		case EXT_MAP_TYPE1V2:
> > > > +			return 1;
> > > > +		default:
> > > > +			return 0;
> > > > +		}
> > > >  	case IOMMU_DEVICE_GET_INFO:
> > > >  		ret = iommufd_get_device_info(ictx, arg);
> > > >  		break;
> > > > diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
> > > > index 5cbd300eb0ee..49731be71213 100644
> > > > +++ b/include/uapi/linux/iommu.h
> > > > @@ -14,6 +14,33 @@
> > > >  #define IOMMU_TYPE	(';')
> > > >  #define IOMMU_BASE	100
> > > >
> > > > +/*
> > > > + * IOMMU_CHECK_EXTENSION - _IO(IOMMU_TYPE, IOMMU_BASE + 0)
> > > > + *
> > > > + * Check whether an uAPI extension is supported.
> > > > + *
> > > > + * It's unlikely that all planned capabilities in IOMMU fd will be ready
> > > > + * in one breath. User should check which uAPI extension is supported
> > > > + * according to its intended usage.
> > > > + *
> > > > + * A rough list of possible extensions may include:
> > > > + *
> > > > + *	- EXT_MAP_TYPE1V2 for vfio type1v2 map semantics;
> > > > + *	- EXT_DMA_NO_SNOOP for no-snoop DMA support;
> > > > + *	- EXT_MAP_NEWTYPE for an enhanced map semantics;
> > > > + *	- EXT_MULTIDEV_GROUP for 1:N iommu group;
> > > > + *	- EXT_IOASID_NESTING for what the name stands;
> > > > + *	- EXT_USER_PAGE_TABLE for user managed page table;
> > > > + *	- EXT_USER_PASID_TABLE for user managed PASID table;
> > > > + *	- EXT_DIRTY_TRACKING for tracking pages dirtied by DMA;
> > > > + *	- ...
> > > > + *
> > > > + * Return: 0 if not supported, 1 if supported.
> > > > + */
> > > > +#define EXT_MAP_TYPE1V2		1
> > > > +#define EXT_DMA_NO_SNOOP	2
> > > > +#define IOMMU_CHECK_EXTENSION	_IO(IOMMU_TYPE,
> > > IOMMU_BASE + 0)
> > >
> > > I generally advocate for a 'try and fail' approach to discovering
> > > compatibility.
> > >
> > > If that doesn't work for the userspace then a query to return a
> > > generic capability flag is the next best idea. Each flag should
> > > clearly define what 'try and fail' it is talking about
> >
> > We don't have strong preference here. Just follow what vfio does
> > today. So Alex's opinion is appreciated here. 😊
> 
> This is a uAPI design, it should follow the current mainstream
> thinking on how to build these things. There is a lot of old stuff in
> vfio that doesn't match the modern thinking. IMHO.
> 
> > > TYPE1V2 seems like nonsense
> >
> > just in case other mapping protocols are introduced in the future
> 
> Well, we should never, ever do that. Allowing PPC and evrything else
> to split in VFIO has created a compelte disaster in userspace. HW
> specific extensions should be modeled as extensions not a wholesale
> replacement of everything.
> 
> I'd say this is part of the modern thinking on uAPI design.
> 
> What I want to strive for is the basic API is usable with all HW - and
> is what something like DPDK can exclusively use.
> 
> An extended API with HW specific facets exists for qemu to use to
> build a HW backed accelereated and featureful vIOMMU emulation.
> 
> The needs of qmeu should not trump the requirement for a universal
> basic API.
> 
> Eg if we can't figure out a basic API version of the PPC range issue
> then that should be punted to a PPC specific API.
> 

sounds good. I may keep an wrong memory on the multiple mapping
protocols thing. 😊

^ permalink raw reply	[flat|nested] 532+ messages in thread

* RE: [RFC 14/20] iommu/iommufd: Add iommufd_device_[de]attach_ioasid()
  2021-09-22 12:57         ` Jason Gunthorpe via iommu
@ 2021-09-22 14:16           ` Tian, Kevin
  -1 siblings, 0 replies; 532+ messages in thread
From: Tian, Kevin @ 2021-09-22 14:16 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Liu, Yi L, alex.williamson, hch, jasowang, joro, jean-philippe,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, September 22, 2021 8:57 PM
> 
> On Wed, Sep 22, 2021 at 03:53:52AM +0000, Tian, Kevin wrote:
> 
> > Actually this was one open we closed in previous design proposal, but
> > looks you have a different thought now.
> >
> > vfio maintains one ioas per container. Devices in the container
> > can be attached to different domains (e.g. due to snoop format). Every
> > time when the ioas is updated, every attached domain is updated
> > in accordance.
> >
> > You recommended one-ioas-one-domain model instead, i.e. any device
> > with a format incompatible with the one currently used in ioas has to
> > be attached to a new ioas, even if the two ioas's have the same mapping.
> > This leads to compatibility check at attaching time.
> >
> > Now you want returning back to the vfio model?
> 
> Oh, I thought we circled back again.. If we are all OK with one ioas
> one domain then great.

yes, at least I haven't seen a blocking issue with this assumption. Later
when converting vfio type1 into a shim, it could create multiple ioas's
if container would have a list of domains before the shim.

> 
> > > If think sis taking in the iommfd_device then there isn't a logical
> > > place to signal the PCIness
> >
> > can you elaborate?
> 
> I mean just drop it and document it.
> 

got you

^ permalink raw reply	[flat|nested] 532+ messages in thread

* RE: [RFC 14/20] iommu/iommufd: Add iommufd_device_[de]attach_ioasid()
@ 2021-09-22 14:16           ` Tian, Kevin
  0 siblings, 0 replies; 532+ messages in thread
From: Tian, Kevin @ 2021-09-22 14:16 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: kvm, jasowang, kwankhede, hch, jean-philippe, Jiang, Dave, Raj,
	Ashok, corbet, parav, alex.williamson, lkml, david, dwmw2, Tian,
	Jun J, linux-kernel, lushenming, iommu, pbonzini, robin.murphy

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, September 22, 2021 8:57 PM
> 
> On Wed, Sep 22, 2021 at 03:53:52AM +0000, Tian, Kevin wrote:
> 
> > Actually this was one open we closed in previous design proposal, but
> > looks you have a different thought now.
> >
> > vfio maintains one ioas per container. Devices in the container
> > can be attached to different domains (e.g. due to snoop format). Every
> > time when the ioas is updated, every attached domain is updated
> > in accordance.
> >
> > You recommended one-ioas-one-domain model instead, i.e. any device
> > with a format incompatible with the one currently used in ioas has to
> > be attached to a new ioas, even if the two ioas's have the same mapping.
> > This leads to compatibility check at attaching time.
> >
> > Now you want returning back to the vfio model?
> 
> Oh, I thought we circled back again.. If we are all OK with one ioas
> one domain then great.

yes, at least I haven't seen a blocking issue with this assumption. Later
when converting vfio type1 into a shim, it could create multiple ioas's
if container would have a list of domains before the shim.

> 
> > > If think sis taking in the iommfd_device then there isn't a logical
> > > place to signal the PCIness
> >
> > can you elaborate?
> 
> I mean just drop it and document it.
> 

got you
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* RE: [RFC 15/20] vfio/pci: Add VFIO_DEVICE_[DE]ATTACH_IOASID
  2021-09-22 12:58         ` Jason Gunthorpe via iommu
@ 2021-09-22 14:17           ` Tian, Kevin
  -1 siblings, 0 replies; 532+ messages in thread
From: Tian, Kevin @ 2021-09-22 14:17 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Liu, Yi L, alex.williamson, hch, jasowang, joro, jean-philippe,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, September 22, 2021 8:59 PM
> 
> On Wed, Sep 22, 2021 at 03:56:18AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Wednesday, September 22, 2021 2:04 AM
> > >
> > > On Sun, Sep 19, 2021 at 02:38:43PM +0800, Liu Yi L wrote:
> > > > This patch adds interface for userspace to attach device to specified
> > > > IOASID.
> > > >
> > > > Note:
> > > > One device can only be attached to one IOASID in this version. This is
> > > > on par with what vfio provides today. In the future this restriction can
> > > > be relaxed when multiple I/O address spaces are supported per device
> > >
> > > ?? In VFIO the container is the IOS and the container can be shared
> > > with multiple devices. This needs to start at about the same
> > > functionality.
> >
> > a device can be only attached to one container. One container can be
> > shared by multiple devices.
> >
> > a device can be only attached to one IOASID. One IOASID can be shared
> > by multiple devices.
> >
> > it does start at the same functionality.
> >
> > >
> > > > +	} else if (cmd == VFIO_DEVICE_ATTACH_IOASID) {
> > >
> > > This should be in the core code, right? There is nothing PCI specific
> > > here.
> > >
> >
> > but if you insist on a pci-wrapper attach function, we still need something
> > here (e.g. with .attach_ioasid() callback)?
> 
> I would like to stop adding ioctls to this switch, the core code
> should decode the ioctl and call an per-ioctl op like every other
> subsystem does..
> 
> If you do that then you could have an op
> 
>  .attach_ioasid = vfio_full_device_attach,
> 
> And that is it for driver changes.
> 
> Every driver that use type1 today should be updated to have the above
> line and will work with iommufd. mdevs will not be updated and won't
> work.
> 

will do. 

^ permalink raw reply	[flat|nested] 532+ messages in thread

* RE: [RFC 15/20] vfio/pci: Add VFIO_DEVICE_[DE]ATTACH_IOASID
@ 2021-09-22 14:17           ` Tian, Kevin
  0 siblings, 0 replies; 532+ messages in thread
From: Tian, Kevin @ 2021-09-22 14:17 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: kvm, jasowang, kwankhede, hch, jean-philippe, Jiang, Dave, Raj,
	Ashok, corbet, parav, alex.williamson, lkml, david, dwmw2, Tian,
	Jun J, linux-kernel, lushenming, iommu, pbonzini, robin.murphy

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, September 22, 2021 8:59 PM
> 
> On Wed, Sep 22, 2021 at 03:56:18AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Wednesday, September 22, 2021 2:04 AM
> > >
> > > On Sun, Sep 19, 2021 at 02:38:43PM +0800, Liu Yi L wrote:
> > > > This patch adds interface for userspace to attach device to specified
> > > > IOASID.
> > > >
> > > > Note:
> > > > One device can only be attached to one IOASID in this version. This is
> > > > on par with what vfio provides today. In the future this restriction can
> > > > be relaxed when multiple I/O address spaces are supported per device
> > >
> > > ?? In VFIO the container is the IOS and the container can be shared
> > > with multiple devices. This needs to start at about the same
> > > functionality.
> >
> > a device can be only attached to one container. One container can be
> > shared by multiple devices.
> >
> > a device can be only attached to one IOASID. One IOASID can be shared
> > by multiple devices.
> >
> > it does start at the same functionality.
> >
> > >
> > > > +	} else if (cmd == VFIO_DEVICE_ATTACH_IOASID) {
> > >
> > > This should be in the core code, right? There is nothing PCI specific
> > > here.
> > >
> >
> > but if you insist on a pci-wrapper attach function, we still need something
> > here (e.g. with .attach_ioasid() callback)?
> 
> I would like to stop adding ioctls to this switch, the core code
> should decode the ioctl and call an per-ioctl op like every other
> subsystem does..
> 
> If you do that then you could have an op
> 
>  .attach_ioasid = vfio_full_device_attach,
> 
> And that is it for driver changes.
> 
> Every driver that use type1 today should be updated to have the above
> line and will work with iommufd. mdevs will not be updated and won't
> work.
> 

will do. 
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* RE: [RFC 09/20] iommu: Add page size and address width attributes
  2021-09-22 13:42     ` Eric Auger
@ 2021-09-22 14:19       ` Tian, Kevin
  -1 siblings, 0 replies; 532+ messages in thread
From: Tian, Kevin @ 2021-09-22 14:19 UTC (permalink / raw)
  To: eric.auger, Liu, Yi L, alex.williamson, jgg, hch, jasowang, joro
  Cc: jean-philippe, parav, lkml, pbonzini, lushenming, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

> From: Eric Auger <eric.auger@redhat.com>
> Sent: Wednesday, September 22, 2021 9:43 PM
> 
> Hi,
> 
> On 9/19/21 8:38 AM, Liu Yi L wrote:
> > From: Lu Baolu <baolu.lu@linux.intel.com>
> >
> > This exposes PAGE_SIZE and ADDR_WIDTH attributes. The iommufd could
> use
> > them to define the IOAS.
> >
> > Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
> > ---
> >  include/linux/iommu.h | 4 ++++
> >  1 file changed, 4 insertions(+)
> >
> > diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> > index 943de6897f56..86d34e4ce05e 100644
> > --- a/include/linux/iommu.h
> > +++ b/include/linux/iommu.h
> > @@ -153,9 +153,13 @@ enum iommu_dev_features {
> >  /**
> >   * enum iommu_devattr - Per device IOMMU attributes
> >   * @IOMMU_DEV_INFO_FORCE_SNOOP [bool]: IOMMU can force DMA to
> be snooped.
> > + * @IOMMU_DEV_INFO_PAGE_SIZE [u64]: Page sizes that iommu
> supports.
> > + * @IOMMU_DEV_INFO_ADDR_WIDTH [u32]: Address width supported.
> I think this deserves additional info. What address width do we talk
> about, input, output, what stage if the IOMMU does support multiple stages
> 

it describes the address space width, thus is about input.

when multiple stages are supported, each stage is represented by a separate
ioasid, each with its own addr_width

^ permalink raw reply	[flat|nested] 532+ messages in thread

* RE: [RFC 09/20] iommu: Add page size and address width attributes
@ 2021-09-22 14:19       ` Tian, Kevin
  0 siblings, 0 replies; 532+ messages in thread
From: Tian, Kevin @ 2021-09-22 14:19 UTC (permalink / raw)
  To: eric.auger, Liu, Yi L, alex.williamson, jgg, hch, jasowang, joro
  Cc: jean-philippe, Jiang, Dave, parav, kvm, corbet, dwmw2, Tian,
	Jun J, iommu, linux-kernel, lushenming, kwankhede, lkml, david,
	pbonzini, robin.murphy, Raj, Ashok

> From: Eric Auger <eric.auger@redhat.com>
> Sent: Wednesday, September 22, 2021 9:43 PM
> 
> Hi,
> 
> On 9/19/21 8:38 AM, Liu Yi L wrote:
> > From: Lu Baolu <baolu.lu@linux.intel.com>
> >
> > This exposes PAGE_SIZE and ADDR_WIDTH attributes. The iommufd could
> use
> > them to define the IOAS.
> >
> > Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
> > ---
> >  include/linux/iommu.h | 4 ++++
> >  1 file changed, 4 insertions(+)
> >
> > diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> > index 943de6897f56..86d34e4ce05e 100644
> > --- a/include/linux/iommu.h
> > +++ b/include/linux/iommu.h
> > @@ -153,9 +153,13 @@ enum iommu_dev_features {
> >  /**
> >   * enum iommu_devattr - Per device IOMMU attributes
> >   * @IOMMU_DEV_INFO_FORCE_SNOOP [bool]: IOMMU can force DMA to
> be snooped.
> > + * @IOMMU_DEV_INFO_PAGE_SIZE [u64]: Page sizes that iommu
> supports.
> > + * @IOMMU_DEV_INFO_ADDR_WIDTH [u32]: Address width supported.
> I think this deserves additional info. What address width do we talk
> about, input, output, what stage if the IOMMU does support multiple stages
> 

it describes the address space width, thus is about input.

when multiple stages are supported, each stage is represented by a separate
ioasid, each with its own addr_width
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 17/20] iommu/iommufd: Report iova range to userspace
  2021-09-19  6:38   ` Liu Yi L
@ 2021-09-22 14:49     ` Jean-Philippe Brucker
  -1 siblings, 0 replies; 532+ messages in thread
From: Jean-Philippe Brucker @ 2021-09-22 14:49 UTC (permalink / raw)
  To: Liu Yi L
  Cc: alex.williamson, jgg, hch, jasowang, joro, kevin.tian, parav,
	lkml, pbonzini, lushenming, eric.auger, corbet, ashok.raj,
	yi.l.liu, jun.j.tian, hao.wu, dave.jiang, jacob.jun.pan,
	kwankhede, robin.murphy, kvm, iommu, dwmw2, linux-kernel,
	baolu.lu, david, nicolinc

On Sun, Sep 19, 2021 at 02:38:45PM +0800, Liu Yi L wrote:
> [HACK. will fix in v2]
> 
> IOVA range is critical info for userspace to manage DMA for an I/O address
> space. This patch reports the valid iova range info of a given device.
> 
> Due to aforementioned hack, this info comes from the hacked vfio type1
> driver. To follow the same format in vfio, we also introduce a cap chain
> format in IOMMU_DEVICE_GET_INFO to carry the iova range info.
[...]
> diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
> index 49731be71213..f408ad3c8ade 100644
> --- a/include/uapi/linux/iommu.h
> +++ b/include/uapi/linux/iommu.h
> @@ -68,6 +68,7 @@
>   *		   +---------------+------------+
>   *		   ...
>   * @addr_width:    the address width of supported I/O address spaces.
> + * @cap_offset:	   Offset within info struct of first cap
>   *
>   * Availability: after device is bound to iommufd
>   */
> @@ -77,9 +78,11 @@ struct iommu_device_info {
>  #define IOMMU_DEVICE_INFO_ENFORCE_SNOOP	(1 << 0) /* IOMMU enforced snoop */
>  #define IOMMU_DEVICE_INFO_PGSIZES	(1 << 1) /* supported page sizes */
>  #define IOMMU_DEVICE_INFO_ADDR_WIDTH	(1 << 2) /* addr_wdith field valid */
> +#define IOMMU_DEVICE_INFO_CAPS		(1 << 3) /* info supports cap chain */
>  	__u64	dev_cookie;
>  	__u64   pgsize_bitmap;
>  	__u32	addr_width;
> +	__u32   cap_offset;

We can also add vendor-specific page table and PASID table properties as
capabilities, otherwise we'll need giant unions in the iommu_device_info
struct. That made me wonder whether pgsize and addr_width should also be
separate capabilities for consistency, but this way might be good enough.
There won't be many more generic capabilities. I have "output address
width" and "PASID width", the rest is specific to Arm and SMMU table
formats.

Thanks,
Jean

>  };
>  
>  #define IOMMU_DEVICE_GET_INFO	_IO(IOMMU_TYPE, IOMMU_BASE + 1)
> -- 
> 2.25.1
> 

^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 17/20] iommu/iommufd: Report iova range to userspace
@ 2021-09-22 14:49     ` Jean-Philippe Brucker
  0 siblings, 0 replies; 532+ messages in thread
From: Jean-Philippe Brucker @ 2021-09-22 14:49 UTC (permalink / raw)
  To: Liu Yi L
  Cc: kvm, jasowang, kwankhede, hch, dave.jiang, ashok.raj, corbet,
	jgg, kevin.tian, parav, alex.williamson, lkml, david, dwmw2,
	jun.j.tian, linux-kernel, lushenming, iommu, pbonzini,
	robin.murphy

On Sun, Sep 19, 2021 at 02:38:45PM +0800, Liu Yi L wrote:
> [HACK. will fix in v2]
> 
> IOVA range is critical info for userspace to manage DMA for an I/O address
> space. This patch reports the valid iova range info of a given device.
> 
> Due to aforementioned hack, this info comes from the hacked vfio type1
> driver. To follow the same format in vfio, we also introduce a cap chain
> format in IOMMU_DEVICE_GET_INFO to carry the iova range info.
[...]
> diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
> index 49731be71213..f408ad3c8ade 100644
> --- a/include/uapi/linux/iommu.h
> +++ b/include/uapi/linux/iommu.h
> @@ -68,6 +68,7 @@
>   *		   +---------------+------------+
>   *		   ...
>   * @addr_width:    the address width of supported I/O address spaces.
> + * @cap_offset:	   Offset within info struct of first cap
>   *
>   * Availability: after device is bound to iommufd
>   */
> @@ -77,9 +78,11 @@ struct iommu_device_info {
>  #define IOMMU_DEVICE_INFO_ENFORCE_SNOOP	(1 << 0) /* IOMMU enforced snoop */
>  #define IOMMU_DEVICE_INFO_PGSIZES	(1 << 1) /* supported page sizes */
>  #define IOMMU_DEVICE_INFO_ADDR_WIDTH	(1 << 2) /* addr_wdith field valid */
> +#define IOMMU_DEVICE_INFO_CAPS		(1 << 3) /* info supports cap chain */
>  	__u64	dev_cookie;
>  	__u64   pgsize_bitmap;
>  	__u32	addr_width;
> +	__u32   cap_offset;

We can also add vendor-specific page table and PASID table properties as
capabilities, otherwise we'll need giant unions in the iommu_device_info
struct. That made me wonder whether pgsize and addr_width should also be
separate capabilities for consistency, but this way might be good enough.
There won't be many more generic capabilities. I have "output address
width" and "PASID width", the rest is specific to Arm and SMMU table
formats.

Thanks,
Jean

>  };
>  
>  #define IOMMU_DEVICE_GET_INFO	_IO(IOMMU_TYPE, IOMMU_BASE + 1)
> -- 
> 2.25.1
> 
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 03/20] vfio: Add vfio_[un]register_device()
  2021-09-22 12:22             ` Jason Gunthorpe via iommu
@ 2021-09-22 20:10               ` Alex Williamson
  -1 siblings, 0 replies; 532+ messages in thread
From: Alex Williamson @ 2021-09-22 20:10 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Liu, Yi L, hch, jasowang, joro, jean-philippe,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

On Wed, 22 Sep 2021 09:22:52 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Wed, Sep 22, 2021 at 09:23:34AM +0000, Tian, Kevin wrote:
> 
> > > Providing an ioctl to bind to a normal VFIO container or group might
> > > allow a reasonable fallback in userspace..  
> > 
> > I didn't get this point though. An error in binding already allows the
> > user to fall back to the group path. Why do we need introduce another
> > ioctl to explicitly bind to container via the nongroup interface?   
> 
> New userspace still needs a fallback path if it hits the 'try and
> fail'. Keeping the device FD open and just using a different ioctl to
> bind to a container/group FD, which new userspace can then obtain as a
> fallback, might be OK.
> 
> Hard to see without going through the qemu parts, so maybe just keep
> it in mind

If we assume that the container/group/device interface is essentially
deprecated once we have iommufd, it doesn't make a lot of sense to me
to tack on a container/device interface just so userspace can avoid
reverting to the fully legacy interface.

But why would we create vfio device interface files at all if they
can't work?  I'm not really on board with creating a try-and-fail
interface for a mechanism that cannot work for a given device.  The
existence of the device interface should indicate that it's supported.
Thanks,

Alex


^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 03/20] vfio: Add vfio_[un]register_device()
@ 2021-09-22 20:10               ` Alex Williamson
  0 siblings, 0 replies; 532+ messages in thread
From: Alex Williamson @ 2021-09-22 20:10 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: kvm, jasowang, kwankhede, hch, jean-philippe, Jiang, Dave, Raj,
	Ashok, corbet, Tian, Kevin, parav, lkml, david, dwmw2, Tian,
	Jun J, linux-kernel, lushenming, iommu, pbonzini, robin.murphy

On Wed, 22 Sep 2021 09:22:52 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Wed, Sep 22, 2021 at 09:23:34AM +0000, Tian, Kevin wrote:
> 
> > > Providing an ioctl to bind to a normal VFIO container or group might
> > > allow a reasonable fallback in userspace..  
> > 
> > I didn't get this point though. An error in binding already allows the
> > user to fall back to the group path. Why do we need introduce another
> > ioctl to explicitly bind to container via the nongroup interface?   
> 
> New userspace still needs a fallback path if it hits the 'try and
> fail'. Keeping the device FD open and just using a different ioctl to
> bind to a container/group FD, which new userspace can then obtain as a
> fallback, might be OK.
> 
> Hard to see without going through the qemu parts, so maybe just keep
> it in mind

If we assume that the container/group/device interface is essentially
deprecated once we have iommufd, it doesn't make a lot of sense to me
to tack on a container/device interface just so userspace can avoid
reverting to the fully legacy interface.

But why would we create vfio device interface files at all if they
can't work?  I'm not really on board with creating a try-and-fail
interface for a mechanism that cannot work for a given device.  The
existence of the device interface should indicate that it's supported.
Thanks,

Alex

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 08/20] vfio/pci: Add VFIO_DEVICE_BIND_IOMMUFD
  2021-09-21 17:29     ` Jason Gunthorpe via iommu
@ 2021-09-22 21:01       ` Alex Williamson
  -1 siblings, 0 replies; 532+ messages in thread
From: Alex Williamson @ 2021-09-22 21:01 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Liu Yi L, hch, jasowang, joro, jean-philippe, kevin.tian, parav,
	lkml, pbonzini, lushenming, eric.auger, corbet, ashok.raj,
	yi.l.liu, jun.j.tian, hao.wu, dave.jiang, jacob.jun.pan,
	kwankhede, robin.murphy, kvm, iommu, dwmw2, linux-kernel,
	baolu.lu, david, nicolinc

On Tue, 21 Sep 2021 14:29:39 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Sun, Sep 19, 2021 at 02:38:36PM +0800, Liu Yi L wrote:
> > +struct vfio_device_iommu_bind_data {
> > +	__u32	argsz;
> > +	__u32	flags;
> > +	__s32	iommu_fd;
> > +	__u64	dev_cookie;  
> 
> Missing explicit padding
> 
> Always use __aligned_u64 in uapi headers, fix all the patches.

We don't need padding or explicit alignment if we just swap the order
of iommu_fd and dev_cookie.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 08/20] vfio/pci: Add VFIO_DEVICE_BIND_IOMMUFD
@ 2021-09-22 21:01       ` Alex Williamson
  0 siblings, 0 replies; 532+ messages in thread
From: Alex Williamson @ 2021-09-22 21:01 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: kvm, jasowang, kwankhede, hch, jean-philippe, dave.jiang,
	ashok.raj, corbet, kevin.tian, parav, lkml, david, dwmw2,
	jun.j.tian, linux-kernel, lushenming, iommu, pbonzini,
	robin.murphy

On Tue, 21 Sep 2021 14:29:39 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Sun, Sep 19, 2021 at 02:38:36PM +0800, Liu Yi L wrote:
> > +struct vfio_device_iommu_bind_data {
> > +	__u32	argsz;
> > +	__u32	flags;
> > +	__s32	iommu_fd;
> > +	__u64	dev_cookie;  
> 
> Missing explicit padding
> 
> Always use __aligned_u64 in uapi headers, fix all the patches.

We don't need padding or explicit alignment if we just swap the order
of iommu_fd and dev_cookie.  Thanks,

Alex

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 05/20] vfio/pci: Register device to /dev/vfio/devices
  2021-09-22  1:19         ` Tian, Kevin
@ 2021-09-22 21:17           ` Alex Williamson
  -1 siblings, 0 replies; 532+ messages in thread
From: Alex Williamson @ 2021-09-22 21:17 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jason Gunthorpe, Liu, Yi L, hch, jasowang, joro, jean-philippe,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

On Wed, 22 Sep 2021 01:19:08 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Wednesday, September 22, 2021 5:09 AM
> > 
> > On Tue, 21 Sep 2021 13:40:01 -0300
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> >   
> > > On Sun, Sep 19, 2021 at 02:38:33PM +0800, Liu Yi L wrote:  
> > > > This patch exposes the device-centric interface for vfio-pci devices. To
> > > > be compatiable with existing users, vfio-pci exposes both legacy group
> > > > interface and device-centric interface.
> > > >
> > > > As explained in last patch, this change doesn't apply to devices which
> > > > cannot be forced to snoop cache by their upstream iommu. Such devices
> > > > are still expected to be opened via the legacy group interface.  
> > 
> > This doesn't make much sense to me.  The previous patch indicates
> > there's work to be done in updating the kvm-vfio contract to understand
> > DMA coherency, so you're trying to limit use cases to those where the
> > IOMMU enforces coherency, but there's QEMU work to be done to support
> > the iommufd uAPI at all.  Isn't part of that work to understand how KVM
> > will be told about non-coherent devices rather than "meh, skip it in the
> > kernel"?  Also let's not forget that vfio is not only for KVM.  
> 
> The policy here is that VFIO will not expose such devices (no enforce-snoop)
> in the new device hierarchy at all. In this case QEMU will fall back to the
> group interface automatically and then rely on the existing contract to connect 
> vfio and QEMU. It doesn't need to care about the whatever new contract
> until such devices are exposed in the new interface.
> 
> yes, vfio is not only for KVM. But here it's more a task split based on staging
> consideration. imo it's not necessary to further split task into supporting
> non-snoop device for userspace driver and then for kvm.

Patch 10 introduces an iommufd interface for QEMU to learn whether the
IOMMU enforces DMA coherency, at that point QEMU could revert to the
legacy interface, or register the iommufd with KVM, or otherwise
establish non-coherent DMA with KVM as necessary.  We're adding cruft
to the kernel here to enforce an unnecessary limitation.

If there are reasons the kernel can't support the device interface,
that's a valid reason not to present the interface, but this seems like
picking a specific gap that userspace is already able to detect from
this series at the expense of other use cases.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 05/20] vfio/pci: Register device to /dev/vfio/devices
@ 2021-09-22 21:17           ` Alex Williamson
  0 siblings, 0 replies; 532+ messages in thread
From: Alex Williamson @ 2021-09-22 21:17 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: kvm, jasowang, kwankhede, hch, jean-philippe, Jiang, Dave, Raj,
	Ashok, corbet, Jason Gunthorpe, parav, lkml, david, dwmw2, Tian,
	Jun J, linux-kernel, lushenming, iommu, pbonzini, robin.murphy

On Wed, 22 Sep 2021 01:19:08 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Wednesday, September 22, 2021 5:09 AM
> > 
> > On Tue, 21 Sep 2021 13:40:01 -0300
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> >   
> > > On Sun, Sep 19, 2021 at 02:38:33PM +0800, Liu Yi L wrote:  
> > > > This patch exposes the device-centric interface for vfio-pci devices. To
> > > > be compatiable with existing users, vfio-pci exposes both legacy group
> > > > interface and device-centric interface.
> > > >
> > > > As explained in last patch, this change doesn't apply to devices which
> > > > cannot be forced to snoop cache by their upstream iommu. Such devices
> > > > are still expected to be opened via the legacy group interface.  
> > 
> > This doesn't make much sense to me.  The previous patch indicates
> > there's work to be done in updating the kvm-vfio contract to understand
> > DMA coherency, so you're trying to limit use cases to those where the
> > IOMMU enforces coherency, but there's QEMU work to be done to support
> > the iommufd uAPI at all.  Isn't part of that work to understand how KVM
> > will be told about non-coherent devices rather than "meh, skip it in the
> > kernel"?  Also let's not forget that vfio is not only for KVM.  
> 
> The policy here is that VFIO will not expose such devices (no enforce-snoop)
> in the new device hierarchy at all. In this case QEMU will fall back to the
> group interface automatically and then rely on the existing contract to connect 
> vfio and QEMU. It doesn't need to care about the whatever new contract
> until such devices are exposed in the new interface.
> 
> yes, vfio is not only for KVM. But here it's more a task split based on staging
> consideration. imo it's not necessary to further split task into supporting
> non-snoop device for userspace driver and then for kvm.

Patch 10 introduces an iommufd interface for QEMU to learn whether the
IOMMU enforces DMA coherency, at that point QEMU could revert to the
legacy interface, or register the iommufd with KVM, or otherwise
establish non-coherent DMA with KVM as necessary.  We're adding cruft
to the kernel here to enforce an unnecessary limitation.

If there are reasons the kernel can't support the device interface,
that's a valid reason not to present the interface, but this seems like
picking a specific gap that userspace is already able to detect from
this series at the expense of other use cases.  Thanks,

Alex

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO
  2021-09-19  6:38   ` Liu Yi L
@ 2021-09-22 21:24     ` Alex Williamson
  -1 siblings, 0 replies; 532+ messages in thread
From: Alex Williamson @ 2021-09-22 21:24 UTC (permalink / raw)
  To: Liu Yi L
  Cc: jgg, hch, jasowang, joro, jean-philippe, kevin.tian, parav, lkml,
	pbonzini, lushenming, eric.auger, corbet, ashok.raj, yi.l.liu,
	jun.j.tian, hao.wu, dave.jiang, jacob.jun.pan, kwankhede,
	robin.murphy, kvm, iommu, dwmw2, linux-kernel, baolu.lu, david,
	nicolinc

On Sun, 19 Sep 2021 14:38:38 +0800
Liu Yi L <yi.l.liu@intel.com> wrote:

> +struct iommu_device_info {
> +	__u32	argsz;
> +	__u32	flags;
> +#define IOMMU_DEVICE_INFO_ENFORCE_SNOOP	(1 << 0) /* IOMMU enforced snoop */

Is this too PCI specific, or perhaps too much of the mechanism rather
than the result?  ie. should we just indicate if the IOMMU guarantees
coherent DMA?  Thanks,

Alex

> +#define IOMMU_DEVICE_INFO_PGSIZES	(1 << 1) /* supported page sizes */
> +#define IOMMU_DEVICE_INFO_ADDR_WIDTH	(1 << 2) /* addr_wdith field valid */
> +	__u64	dev_cookie;
> +	__u64   pgsize_bitmap;
> +	__u32	addr_width;
> +};
> +
> +#define IOMMU_DEVICE_GET_INFO	_IO(IOMMU_TYPE, IOMMU_BASE + 1)
>  
>  #define IOMMU_FAULT_PERM_READ	(1 << 0) /* read */
>  #define IOMMU_FAULT_PERM_WRITE	(1 << 1) /* write */


^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO
@ 2021-09-22 21:24     ` Alex Williamson
  0 siblings, 0 replies; 532+ messages in thread
From: Alex Williamson @ 2021-09-22 21:24 UTC (permalink / raw)
  To: Liu Yi L
  Cc: kvm, jasowang, kwankhede, hch, jean-philippe, dave.jiang,
	ashok.raj, corbet, jgg, kevin.tian, parav, lkml, david, dwmw2,
	jun.j.tian, linux-kernel, lushenming, iommu, pbonzini,
	robin.murphy

On Sun, 19 Sep 2021 14:38:38 +0800
Liu Yi L <yi.l.liu@intel.com> wrote:

> +struct iommu_device_info {
> +	__u32	argsz;
> +	__u32	flags;
> +#define IOMMU_DEVICE_INFO_ENFORCE_SNOOP	(1 << 0) /* IOMMU enforced snoop */

Is this too PCI specific, or perhaps too much of the mechanism rather
than the result?  ie. should we just indicate if the IOMMU guarantees
coherent DMA?  Thanks,

Alex

> +#define IOMMU_DEVICE_INFO_PGSIZES	(1 << 1) /* supported page sizes */
> +#define IOMMU_DEVICE_INFO_ADDR_WIDTH	(1 << 2) /* addr_wdith field valid */
> +	__u64	dev_cookie;
> +	__u64   pgsize_bitmap;
> +	__u32	addr_width;
> +};
> +
> +#define IOMMU_DEVICE_GET_INFO	_IO(IOMMU_TYPE, IOMMU_BASE + 1)
>  
>  #define IOMMU_FAULT_PERM_READ	(1 << 0) /* read */
>  #define IOMMU_FAULT_PERM_WRITE	(1 << 1) /* write */

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* RE: [RFC 03/20] vfio: Add vfio_[un]register_device()
  2021-09-22 20:10               ` Alex Williamson
@ 2021-09-22 22:34                 ` Tian, Kevin
  -1 siblings, 0 replies; 532+ messages in thread
From: Tian, Kevin @ 2021-09-22 22:34 UTC (permalink / raw)
  To: Alex Williamson, Jason Gunthorpe
  Cc: Liu, Yi L, hch, jasowang, joro, jean-philippe, parav, lkml,
	pbonzini, lushenming, eric.auger, corbet, Raj, Ashok, yi.l.liu,
	Tian, Jun J, Wu, Hao, Jiang, Dave, jacob.jun.pan, kwankhede,
	robin.murphy, kvm, iommu, dwmw2, linux-kernel, baolu.lu, david,
	nicolinc

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Thursday, September 23, 2021 4:11 AM
> 
> On Wed, 22 Sep 2021 09:22:52 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Wed, Sep 22, 2021 at 09:23:34AM +0000, Tian, Kevin wrote:
> >
> > > > Providing an ioctl to bind to a normal VFIO container or group might
> > > > allow a reasonable fallback in userspace..
> > >
> > > I didn't get this point though. An error in binding already allows the
> > > user to fall back to the group path. Why do we need introduce another
> > > ioctl to explicitly bind to container via the nongroup interface?
> >
> > New userspace still needs a fallback path if it hits the 'try and
> > fail'. Keeping the device FD open and just using a different ioctl to
> > bind to a container/group FD, which new userspace can then obtain as a
> > fallback, might be OK.
> >
> > Hard to see without going through the qemu parts, so maybe just keep
> > it in mind
> 
> If we assume that the container/group/device interface is essentially
> deprecated once we have iommufd, it doesn't make a lot of sense to me
> to tack on a container/device interface just so userspace can avoid
> reverting to the fully legacy interface.
> 
> But why would we create vfio device interface files at all if they
> can't work?  I'm not really on board with creating a try-and-fail
> interface for a mechanism that cannot work for a given device.  The
> existence of the device interface should indicate that it's supported.
> Thanks,
> 

Now it's a try-and-fail model even for devices which support iommufd.
Per Jason's suggestion, a device is always opened with a parked fops
which supports only bind. Binding serves as the contract for handling
exclusive ownership on a device and switching to normal fops if
succeed. So the user has to try-and-fail in case multiple threads attempt 
to open a same device. Device which doesn't support iommufd is not
different, except binding request 100% fails (due to missing .bind_iommufd
in kernel driver).

Thanks
Kevin

^ permalink raw reply	[flat|nested] 532+ messages in thread

* RE: [RFC 03/20] vfio: Add vfio_[un]register_device()
@ 2021-09-22 22:34                 ` Tian, Kevin
  0 siblings, 0 replies; 532+ messages in thread
From: Tian, Kevin @ 2021-09-22 22:34 UTC (permalink / raw)
  To: Alex Williamson, Jason Gunthorpe
  Cc: kvm, jasowang, kwankhede, hch, jean-philippe, Jiang, Dave, Raj,
	Ashok, corbet, parav, lkml, david, robin.murphy, Tian, Jun J,
	linux-kernel, lushenming, iommu, pbonzini, dwmw2

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Thursday, September 23, 2021 4:11 AM
> 
> On Wed, 22 Sep 2021 09:22:52 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Wed, Sep 22, 2021 at 09:23:34AM +0000, Tian, Kevin wrote:
> >
> > > > Providing an ioctl to bind to a normal VFIO container or group might
> > > > allow a reasonable fallback in userspace..
> > >
> > > I didn't get this point though. An error in binding already allows the
> > > user to fall back to the group path. Why do we need introduce another
> > > ioctl to explicitly bind to container via the nongroup interface?
> >
> > New userspace still needs a fallback path if it hits the 'try and
> > fail'. Keeping the device FD open and just using a different ioctl to
> > bind to a container/group FD, which new userspace can then obtain as a
> > fallback, might be OK.
> >
> > Hard to see without going through the qemu parts, so maybe just keep
> > it in mind
> 
> If we assume that the container/group/device interface is essentially
> deprecated once we have iommufd, it doesn't make a lot of sense to me
> to tack on a container/device interface just so userspace can avoid
> reverting to the fully legacy interface.
> 
> But why would we create vfio device interface files at all if they
> can't work?  I'm not really on board with creating a try-and-fail
> interface for a mechanism that cannot work for a given device.  The
> existence of the device interface should indicate that it's supported.
> Thanks,
> 

Now it's a try-and-fail model even for devices which support iommufd.
Per Jason's suggestion, a device is always opened with a parked fops
which supports only bind. Binding serves as the contract for handling
exclusive ownership on a device and switching to normal fops if
succeed. So the user has to try-and-fail in case multiple threads attempt 
to open a same device. Device which doesn't support iommufd is not
different, except binding request 100% fails (due to missing .bind_iommufd
in kernel driver).

Thanks
Kevin
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 03/20] vfio: Add vfio_[un]register_device()
  2021-09-22 22:34                 ` Tian, Kevin
@ 2021-09-22 22:45                   ` Alex Williamson
  -1 siblings, 0 replies; 532+ messages in thread
From: Alex Williamson @ 2021-09-22 22:45 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jason Gunthorpe, Liu, Yi L, hch, jasowang, joro, jean-philippe,
	parav, lkml, pbonzini, lushenming, eric.auger, corbet, Raj,
	Ashok, yi.l.liu, Tian, Jun J, Wu, Hao, Jiang, Dave,
	jacob.jun.pan, kwankhede, robin.murphy, kvm, iommu, dwmw2,
	linux-kernel, baolu.lu, david, nicolinc

On Wed, 22 Sep 2021 22:34:42 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Thursday, September 23, 2021 4:11 AM
> > 
> > On Wed, 22 Sep 2021 09:22:52 -0300
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> >   
> > > On Wed, Sep 22, 2021 at 09:23:34AM +0000, Tian, Kevin wrote:
> > >  
> > > > > Providing an ioctl to bind to a normal VFIO container or group might
> > > > > allow a reasonable fallback in userspace..  
> > > >
> > > > I didn't get this point though. An error in binding already allows the
> > > > user to fall back to the group path. Why do we need introduce another
> > > > ioctl to explicitly bind to container via the nongroup interface?  
> > >
> > > New userspace still needs a fallback path if it hits the 'try and
> > > fail'. Keeping the device FD open and just using a different ioctl to
> > > bind to a container/group FD, which new userspace can then obtain as a
> > > fallback, might be OK.
> > >
> > > Hard to see without going through the qemu parts, so maybe just keep
> > > it in mind  
> > 
> > If we assume that the container/group/device interface is essentially
> > deprecated once we have iommufd, it doesn't make a lot of sense to me
> > to tack on a container/device interface just so userspace can avoid
> > reverting to the fully legacy interface.
> > 
> > But why would we create vfio device interface files at all if they
> > can't work?  I'm not really on board with creating a try-and-fail
> > interface for a mechanism that cannot work for a given device.  The
> > existence of the device interface should indicate that it's supported.
> > Thanks,
> >   
> 
> Now it's a try-and-fail model even for devices which support iommufd.
> Per Jason's suggestion, a device is always opened with a parked fops
> which supports only bind. Binding serves as the contract for handling
> exclusive ownership on a device and switching to normal fops if
> succeed. So the user has to try-and-fail in case multiple threads attempt 
> to open a same device. Device which doesn't support iommufd is not
> different, except binding request 100% fails (due to missing .bind_iommufd
> in kernel driver).

That's a rather important difference.  I don't really see how that's
comparable to the mutually exclusive nature of the legacy vs device
interface.  We're not going to present a vfio device interface for SW
mdevs that can't participate in iommufd, right?  Thanks,

Alex


^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 03/20] vfio: Add vfio_[un]register_device()
@ 2021-09-22 22:45                   ` Alex Williamson
  0 siblings, 0 replies; 532+ messages in thread
From: Alex Williamson @ 2021-09-22 22:45 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: kvm, jasowang, kwankhede, hch, jean-philippe, Jiang, Dave, Raj,
	Ashok, corbet, Jason Gunthorpe, parav, lkml, david, dwmw2, Tian,
	Jun J, linux-kernel, lushenming, iommu, pbonzini, robin.murphy

On Wed, 22 Sep 2021 22:34:42 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Thursday, September 23, 2021 4:11 AM
> > 
> > On Wed, 22 Sep 2021 09:22:52 -0300
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> >   
> > > On Wed, Sep 22, 2021 at 09:23:34AM +0000, Tian, Kevin wrote:
> > >  
> > > > > Providing an ioctl to bind to a normal VFIO container or group might
> > > > > allow a reasonable fallback in userspace..  
> > > >
> > > > I didn't get this point though. An error in binding already allows the
> > > > user to fall back to the group path. Why do we need introduce another
> > > > ioctl to explicitly bind to container via the nongroup interface?  
> > >
> > > New userspace still needs a fallback path if it hits the 'try and
> > > fail'. Keeping the device FD open and just using a different ioctl to
> > > bind to a container/group FD, which new userspace can then obtain as a
> > > fallback, might be OK.
> > >
> > > Hard to see without going through the qemu parts, so maybe just keep
> > > it in mind  
> > 
> > If we assume that the container/group/device interface is essentially
> > deprecated once we have iommufd, it doesn't make a lot of sense to me
> > to tack on a container/device interface just so userspace can avoid
> > reverting to the fully legacy interface.
> > 
> > But why would we create vfio device interface files at all if they
> > can't work?  I'm not really on board with creating a try-and-fail
> > interface for a mechanism that cannot work for a given device.  The
> > existence of the device interface should indicate that it's supported.
> > Thanks,
> >   
> 
> Now it's a try-and-fail model even for devices which support iommufd.
> Per Jason's suggestion, a device is always opened with a parked fops
> which supports only bind. Binding serves as the contract for handling
> exclusive ownership on a device and switching to normal fops if
> succeed. So the user has to try-and-fail in case multiple threads attempt 
> to open a same device. Device which doesn't support iommufd is not
> different, except binding request 100% fails (due to missing .bind_iommufd
> in kernel driver).

That's a rather important difference.  I don't really see how that's
comparable to the mutually exclusive nature of the legacy vs device
interface.  We're not going to present a vfio device interface for SW
mdevs that can't participate in iommufd, right?  Thanks,

Alex

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 08/20] vfio/pci: Add VFIO_DEVICE_BIND_IOMMUFD
  2021-09-22 21:01       ` Alex Williamson
@ 2021-09-22 23:01         ` Jason Gunthorpe via iommu
  -1 siblings, 0 replies; 532+ messages in thread
From: Jason Gunthorpe @ 2021-09-22 23:01 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Liu Yi L, hch, jasowang, joro, jean-philippe, kevin.tian, parav,
	lkml, pbonzini, lushenming, eric.auger, corbet, ashok.raj,
	yi.l.liu, jun.j.tian, hao.wu, dave.jiang, jacob.jun.pan,
	kwankhede, robin.murphy, kvm, iommu, dwmw2, linux-kernel,
	baolu.lu, david, nicolinc

On Wed, Sep 22, 2021 at 03:01:01PM -0600, Alex Williamson wrote:
> On Tue, 21 Sep 2021 14:29:39 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Sun, Sep 19, 2021 at 02:38:36PM +0800, Liu Yi L wrote:
> > > +struct vfio_device_iommu_bind_data {
> > > +	__u32	argsz;
> > > +	__u32	flags;
> > > +	__s32	iommu_fd;
> > > +	__u64	dev_cookie;  
> > 
> > Missing explicit padding
> > 
> > Always use __aligned_u64 in uapi headers, fix all the patches.
> 
> We don't need padding or explicit alignment if we just swap the order
> of iommu_fd and dev_cookie.  Thanks,

Yes, the padding should all be checked and minimized

But it is always good practice to always use __aligned_u64 in the uapi
headers just in case someone messes it up someday - it prevents small
mistakes from becoming an ABI mess.

Jason

^ permalink raw reply	[flat|nested] 532+ messages in thread

* Re: [RFC 08/20] vfio/pci: Add VFIO_DEVICE_BIND_IOMMUFD
@ 2021-09-22 23:01         ` Jason Gunthorpe via iommu
  0 siblings, 0 replies; 532+ messages in thread
From: Jason Gunthorpe via iommu @ 2021-09-22 23:01 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm, jasowang, kwankhede, hch, jean-philippe, dave.jiang,
	ashok.raj, corbet, kevin.tian, parav, lkml, david, dwmw2,
	jun.j.tian, linux-kernel, lushenming, iommu, pbonzini,
	robin.murphy

On Wed, Sep 22, 2021 at 03:01:01PM -0600, Alex Williamson wrote:
> On Tue, 21 Sep 2021 14:29:39 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Sun, Sep 19, 2021 at 02:38:36PM +0800, Liu Yi L wrote:
> > > +struct vfio_device_iommu_bind_data {
> > > +	__u32	argsz;
> > > +	__u32	flags;
> > > +	__s32	iommu_fd;
> > > +	__u64	dev_cookie;  
&g