All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 00/22] intel_iommu: expose Shared Virtual Addressing to VMs
@ 2020-03-30  4:24 ` Liu Yi L
  0 siblings, 0 replies; 160+ messages in thread
From: Liu Yi L @ 2020-03-30  4:24 UTC (permalink / raw)
  To: qemu-devel, alex.williamson, peterx
  Cc: eric.auger, pbonzini, mst, david, kevin.tian, yi.l.liu,
	jun.j.tian, yi.y.sun, kvm, hao.wu, jean-philippe

Shared Virtual Addressing (SVA), a.k.a, Shared Virtual Memory (SVM) on
Intel platforms allows address space sharing between device DMA and
applications. SVA can reduce programming complexity and enhance security.

This QEMU series is intended to expose SVA usage to VMs. i.e. Sharing
guest application address space with passthru devices. This is called
vSVA in this series. The whole vSVA enabling requires QEMU/VFIO/IOMMU
changes.

The high-level architecture for SVA virtualization is as below, the key
design of vSVA support is to utilize the dual-stage IOMMU translation (
also known as IOMMU nesting translation) capability in host IOMMU.

    .-------------.  .---------------------------.
    |   vIOMMU    |  | Guest process CR3, FL only|
    |             |  '---------------------------'
    .----------------/
    | PASID Entry |--- PASID cache flush -
    '-------------'                       |
    |             |                       V
    |             |                CR3 in GPA
    '-------------'
Guest
------| Shadow |--------------------------|--------
      v        v                          v
Host
    .-------------.  .----------------------.
    |   pIOMMU    |  | Bind FL for GVA-GPA  |
    |             |  '----------------------'
    .----------------/  |
    | PASID Entry |     V (Nested xlate)
    '----------------\.------------------------------.
    |             |   |SL for GPA-HPA, default domain|
    |             |   '------------------------------'
    '-------------'
Where:
 - FL = First level/stage one page tables
 - SL = Second level/stage two page tables

The complete vSVA kernel upstream patches are divided into three phases:
    1. Common APIs and PCI device direct assignment
    2. IOMMU-backed Mediated Device assignment
    3. Page Request Services (PRS) support

This QEMU patchset is aiming for the phase 1 and phase 2. It is based
on the two kernel series below.
[1] [PATCH V10 00/11] Nested Shared Virtual Address (SVA) VT-d support:
https://lkml.org/lkml/2020/3/20/1172
[2] [PATCH v1 0/8] vfio: expose virtual Shared Virtual Addressing to VMs
https://lkml.org/lkml/2020/3/22/116

There are roughly two parts:
 1. Introduce HostIOMMUContext as abstract of host IOMMU. It provides explicit
    method for vIOMMU emulators to communicate with host IOMMU. e.g. propagate
    guest page table binding to host IOMMU to setup dual-stage DMA translation
    in host IOMMU and flush iommu iotlb.
 2. Setup dual-stage IOMMU translation for Intel vIOMMU. Includes 
    - Check IOMMU uAPI version compatibility and VFIO Nesting capabilities which
      includes hardware compatibility (stage 1 format) and VFIO_PASID_REQ
      availability. This is preparation for setting up dual-stage DMA translation
      in host IOMMU.
    - Propagate guest PASID allocation and free request to host.
    - Propagate guest page table binding to host to setup dual-stage IOMMU DMA
      translation in host IOMMU.
    - Propagate guest IOMMU cache invalidation to host to ensure iotlb
      correctness.

The complete QEMU set can be found in below link:
https://github.com/luxis1999/qemu.git: sva_vtd_v10_v2

Complete kernel can be found in:
https://github.com/luxis1999/linux-vsva.git: vsva-linux-5.6-rc6

Tests: basci vSVA functionality test, VM reboot/shutdown/crash, kernel build in
guest, boot VM with vSVA disabled, full comapilation with all archs.

Regards,
Yi Liu

Changelog:
	- Patch v1 -> Patch v2:
	  a) Refactor the vfio HostIOMMUContext init code (patch 0008 - 0009 of v1 series)
	  b) Refactor the pasid binding handling (patch 0011 - 0016 of v1 series)
	  Patch v1: https://patchwork.ozlabs.org/cover/1259648/

	- RFC v3.1 -> Patch v1:
	  a) Implement HostIOMMUContext in QOM manner.
	  b) Add pci_set/unset_iommu_context() to register HostIOMMUContext to
	     vIOMMU, thus the lifecircle of HostIOMMUContext is awared in vIOMMU
	     side. In such way, vIOMMU could use the methods provided by the
	     HostIOMMUContext safely.
	  c) Add back patch "[RFC v3 01/25] hw/pci: modify pci_setup_iommu() to set PCIIOMMUOps"
	  RFCv3.1: https://patchwork.kernel.org/cover/11397879/

	- RFC v3 -> v3.1:
	  a) Drop IOMMUContext, and rename DualStageIOMMUObject to HostIOMMUContext.
	     HostIOMMUContext is per-vfio-container, it is exposed to  vIOMMU via PCI
	     layer. VFIO registers a PCIHostIOMMUFunc callback to PCI layer, vIOMMU
	     could get HostIOMMUContext instance via it.
	  b) Check IOMMU uAPI version by VFIO_CHECK_EXTENSION
	  c) Add a check on VFIO_PASID_REQ availability via VFIO_GET_IOMMU_IHNFO
	  d) Reorder the series, put vSVA linux header file update in the beginning
	     put the x-scalable-mode option mofification in the end of the series.
	  e) Dropped patch "[RFC v3 01/25] hw/pci: modify pci_setup_iommu() to set PCIIOMMUOps"
	  RFCv3: https://patchwork.kernel.org/cover/11356033/

	- RFC v2 -> v3:
	  a) Introduce DualStageIOMMUObject to abstract the host IOMMU programming
	  capability. e.g. request PASID from host, setup IOMMU nesting translation
	  on host IOMMU. The pasid_alloc/bind_guest_page_table/iommu_cache_flush
	  operations are moved to be DualStageIOMMUOps. Thus, DualStageIOMMUObject
	  is an abstract layer which provides QEMU vIOMMU emulators with an explicit
	  method to program host IOMMU.
	  b) Compared with RFC v2, the IOMMUContext has also been updated. It is
	  modified to provide an abstract for vIOMMU emulators. It provides the
	  method for pass-through modules (like VFIO) to communicate with host IOMMU.
	  e.g. tell vIOMMU emulators about the IOMMU nesting capability on host side
	  and report the host IOMMU DMA translation faults to vIOMMU emulators.
	  RFC v2: https://www.spinics.net/lists/kvm/msg198556.html

	- RFC v1 -> v2:
	  Introduce IOMMUContext to abstract the connection between VFIO
	  and vIOMMU emulators, which is a replacement of the PCIPASIDOps
	  in RFC v1. Modify x-scalable-mode to be string option instead of
	  adding a new option as RFC v1 did. Refined the pasid cache management
	  and addressed the TODOs mentioned in RFC v1. 
	  RFC v1: https://patchwork.kernel.org/cover/11033657/

Eric Auger (1):
  scripts/update-linux-headers: Import iommu.h

Liu Yi L (21):
  header file update VFIO/IOMMU vSVA APIs
  vfio: check VFIO_TYPE1_NESTING_IOMMU support
  hw/iommu: introduce HostIOMMUContext
  hw/pci: modify pci_setup_iommu() to set PCIIOMMUOps
  hw/pci: introduce pci_device_set/unset_iommu_context()
  intel_iommu: add set/unset_iommu_context callback
  vfio/common: provide PASID alloc/free hooks
  vfio/common: init HostIOMMUContext per-container
  vfio/pci: set host iommu context to vIOMMU
  intel_iommu: add virtual command capability support
  intel_iommu: process PASID cache invalidation
  intel_iommu: add PASID cache management infrastructure
  vfio: add bind stage-1 page table support
  intel_iommu: bind/unbind guest page table to host
  intel_iommu: replay pasid binds after context cache invalidation
  intel_iommu: do not pass down pasid bind for PASID #0
  vfio: add support for flush iommu stage-1 cache
  intel_iommu: process PASID-based iotlb invalidation
  intel_iommu: propagate PASID-based iotlb invalidation to host
  intel_iommu: process PASID-based Device-TLB invalidation
  intel_iommu: modify x-scalable-mode to be string option

 hw/Makefile.objs                      |    1 +
 hw/alpha/typhoon.c                    |    6 +-
 hw/arm/smmu-common.c                  |    6 +-
 hw/hppa/dino.c                        |    6 +-
 hw/i386/amd_iommu.c                   |    6 +-
 hw/i386/intel_iommu.c                 | 1109 ++++++++++++++++++++++++++++++++-
 hw/i386/intel_iommu_internal.h        |  114 ++++
 hw/i386/trace-events                  |    6 +
 hw/iommu/Makefile.objs                |    1 +
 hw/iommu/host_iommu_context.c         |  161 +++++
 hw/pci-host/designware.c              |    6 +-
 hw/pci-host/pnv_phb3.c                |    6 +-
 hw/pci-host/pnv_phb4.c                |    6 +-
 hw/pci-host/ppce500.c                 |    6 +-
 hw/pci-host/prep.c                    |    6 +-
 hw/pci-host/sabre.c                   |    6 +-
 hw/pci/pci.c                          |   53 +-
 hw/ppc/ppc440_pcix.c                  |    6 +-
 hw/ppc/spapr_pci.c                    |    6 +-
 hw/s390x/s390-pci-bus.c               |    8 +-
 hw/vfio/common.c                      |  260 +++++++-
 hw/vfio/pci.c                         |   13 +
 hw/virtio/virtio-iommu.c              |    6 +-
 include/hw/i386/intel_iommu.h         |   57 +-
 include/hw/iommu/host_iommu_context.h |  116 ++++
 include/hw/pci/pci.h                  |   18 +-
 include/hw/pci/pci_bus.h              |    2 +-
 include/hw/vfio/vfio-common.h         |    4 +
 linux-headers/linux/iommu.h           |  378 +++++++++++
 linux-headers/linux/vfio.h            |  127 ++++
 scripts/update-linux-headers.sh       |    2 +-
 31 files changed, 2463 insertions(+), 45 deletions(-)
 create mode 100644 hw/iommu/Makefile.objs
 create mode 100644 hw/iommu/host_iommu_context.c
 create mode 100644 include/hw/iommu/host_iommu_context.h
 create mode 100644 linux-headers/linux/iommu.h

-- 
2.7.4


^ permalink raw reply	[flat|nested] 160+ messages in thread

* [PATCH v2 00/22] intel_iommu: expose Shared Virtual Addressing to VMs
@ 2020-03-30  4:24 ` Liu Yi L
  0 siblings, 0 replies; 160+ messages in thread
From: Liu Yi L @ 2020-03-30  4:24 UTC (permalink / raw)
  To: qemu-devel, alex.williamson, peterx
  Cc: jean-philippe, kevin.tian, yi.l.liu, kvm, mst, jun.j.tian,
	eric.auger, yi.y.sun, pbonzini, hao.wu, david

Shared Virtual Addressing (SVA), a.k.a, Shared Virtual Memory (SVM) on
Intel platforms allows address space sharing between device DMA and
applications. SVA can reduce programming complexity and enhance security.

This QEMU series is intended to expose SVA usage to VMs. i.e. Sharing
guest application address space with passthru devices. This is called
vSVA in this series. The whole vSVA enabling requires QEMU/VFIO/IOMMU
changes.

The high-level architecture for SVA virtualization is as below, the key
design of vSVA support is to utilize the dual-stage IOMMU translation (
also known as IOMMU nesting translation) capability in host IOMMU.

    .-------------.  .---------------------------.
    |   vIOMMU    |  | Guest process CR3, FL only|
    |             |  '---------------------------'
    .----------------/
    | PASID Entry |--- PASID cache flush -
    '-------------'                       |
    |             |                       V
    |             |                CR3 in GPA
    '-------------'
Guest
------| Shadow |--------------------------|--------
      v        v                          v
Host
    .-------------.  .----------------------.
    |   pIOMMU    |  | Bind FL for GVA-GPA  |
    |             |  '----------------------'
    .----------------/  |
    | PASID Entry |     V (Nested xlate)
    '----------------\.------------------------------.
    |             |   |SL for GPA-HPA, default domain|
    |             |   '------------------------------'
    '-------------'
Where:
 - FL = First level/stage one page tables
 - SL = Second level/stage two page tables

The complete vSVA kernel upstream patches are divided into three phases:
    1. Common APIs and PCI device direct assignment
    2. IOMMU-backed Mediated Device assignment
    3. Page Request Services (PRS) support

This QEMU patchset is aiming for the phase 1 and phase 2. It is based
on the two kernel series below.
[1] [PATCH V10 00/11] Nested Shared Virtual Address (SVA) VT-d support:
https://lkml.org/lkml/2020/3/20/1172
[2] [PATCH v1 0/8] vfio: expose virtual Shared Virtual Addressing to VMs
https://lkml.org/lkml/2020/3/22/116

There are roughly two parts:
 1. Introduce HostIOMMUContext as abstract of host IOMMU. It provides explicit
    method for vIOMMU emulators to communicate with host IOMMU. e.g. propagate
    guest page table binding to host IOMMU to setup dual-stage DMA translation
    in host IOMMU and flush iommu iotlb.
 2. Setup dual-stage IOMMU translation for Intel vIOMMU. Includes 
    - Check IOMMU uAPI version compatibility and VFIO Nesting capabilities which
      includes hardware compatibility (stage 1 format) and VFIO_PASID_REQ
      availability. This is preparation for setting up dual-stage DMA translation
      in host IOMMU.
    - Propagate guest PASID allocation and free request to host.
    - Propagate guest page table binding to host to setup dual-stage IOMMU DMA
      translation in host IOMMU.
    - Propagate guest IOMMU cache invalidation to host to ensure iotlb
      correctness.

The complete QEMU set can be found in below link:
https://github.com/luxis1999/qemu.git: sva_vtd_v10_v2

Complete kernel can be found in:
https://github.com/luxis1999/linux-vsva.git: vsva-linux-5.6-rc6

Tests: basci vSVA functionality test, VM reboot/shutdown/crash, kernel build in
guest, boot VM with vSVA disabled, full comapilation with all archs.

Regards,
Yi Liu

Changelog:
	- Patch v1 -> Patch v2:
	  a) Refactor the vfio HostIOMMUContext init code (patch 0008 - 0009 of v1 series)
	  b) Refactor the pasid binding handling (patch 0011 - 0016 of v1 series)
	  Patch v1: https://patchwork.ozlabs.org/cover/1259648/

	- RFC v3.1 -> Patch v1:
	  a) Implement HostIOMMUContext in QOM manner.
	  b) Add pci_set/unset_iommu_context() to register HostIOMMUContext to
	     vIOMMU, thus the lifecircle of HostIOMMUContext is awared in vIOMMU
	     side. In such way, vIOMMU could use the methods provided by the
	     HostIOMMUContext safely.
	  c) Add back patch "[RFC v3 01/25] hw/pci: modify pci_setup_iommu() to set PCIIOMMUOps"
	  RFCv3.1: https://patchwork.kernel.org/cover/11397879/

	- RFC v3 -> v3.1:
	  a) Drop IOMMUContext, and rename DualStageIOMMUObject to HostIOMMUContext.
	     HostIOMMUContext is per-vfio-container, it is exposed to  vIOMMU via PCI
	     layer. VFIO registers a PCIHostIOMMUFunc callback to PCI layer, vIOMMU
	     could get HostIOMMUContext instance via it.
	  b) Check IOMMU uAPI version by VFIO_CHECK_EXTENSION
	  c) Add a check on VFIO_PASID_REQ availability via VFIO_GET_IOMMU_IHNFO
	  d) Reorder the series, put vSVA linux header file update in the beginning
	     put the x-scalable-mode option mofification in the end of the series.
	  e) Dropped patch "[RFC v3 01/25] hw/pci: modify pci_setup_iommu() to set PCIIOMMUOps"
	  RFCv3: https://patchwork.kernel.org/cover/11356033/

	- RFC v2 -> v3:
	  a) Introduce DualStageIOMMUObject to abstract the host IOMMU programming
	  capability. e.g. request PASID from host, setup IOMMU nesting translation
	  on host IOMMU. The pasid_alloc/bind_guest_page_table/iommu_cache_flush
	  operations are moved to be DualStageIOMMUOps. Thus, DualStageIOMMUObject
	  is an abstract layer which provides QEMU vIOMMU emulators with an explicit
	  method to program host IOMMU.
	  b) Compared with RFC v2, the IOMMUContext has also been updated. It is
	  modified to provide an abstract for vIOMMU emulators. It provides the
	  method for pass-through modules (like VFIO) to communicate with host IOMMU.
	  e.g. tell vIOMMU emulators about the IOMMU nesting capability on host side
	  and report the host IOMMU DMA translation faults to vIOMMU emulators.
	  RFC v2: https://www.spinics.net/lists/kvm/msg198556.html

	- RFC v1 -> v2:
	  Introduce IOMMUContext to abstract the connection between VFIO
	  and vIOMMU emulators, which is a replacement of the PCIPASIDOps
	  in RFC v1. Modify x-scalable-mode to be string option instead of
	  adding a new option as RFC v1 did. Refined the pasid cache management
	  and addressed the TODOs mentioned in RFC v1. 
	  RFC v1: https://patchwork.kernel.org/cover/11033657/

Eric Auger (1):
  scripts/update-linux-headers: Import iommu.h

Liu Yi L (21):
  header file update VFIO/IOMMU vSVA APIs
  vfio: check VFIO_TYPE1_NESTING_IOMMU support
  hw/iommu: introduce HostIOMMUContext
  hw/pci: modify pci_setup_iommu() to set PCIIOMMUOps
  hw/pci: introduce pci_device_set/unset_iommu_context()
  intel_iommu: add set/unset_iommu_context callback
  vfio/common: provide PASID alloc/free hooks
  vfio/common: init HostIOMMUContext per-container
  vfio/pci: set host iommu context to vIOMMU
  intel_iommu: add virtual command capability support
  intel_iommu: process PASID cache invalidation
  intel_iommu: add PASID cache management infrastructure
  vfio: add bind stage-1 page table support
  intel_iommu: bind/unbind guest page table to host
  intel_iommu: replay pasid binds after context cache invalidation
  intel_iommu: do not pass down pasid bind for PASID #0
  vfio: add support for flush iommu stage-1 cache
  intel_iommu: process PASID-based iotlb invalidation
  intel_iommu: propagate PASID-based iotlb invalidation to host
  intel_iommu: process PASID-based Device-TLB invalidation
  intel_iommu: modify x-scalable-mode to be string option

 hw/Makefile.objs                      |    1 +
 hw/alpha/typhoon.c                    |    6 +-
 hw/arm/smmu-common.c                  |    6 +-
 hw/hppa/dino.c                        |    6 +-
 hw/i386/amd_iommu.c                   |    6 +-
 hw/i386/intel_iommu.c                 | 1109 ++++++++++++++++++++++++++++++++-
 hw/i386/intel_iommu_internal.h        |  114 ++++
 hw/i386/trace-events                  |    6 +
 hw/iommu/Makefile.objs                |    1 +
 hw/iommu/host_iommu_context.c         |  161 +++++
 hw/pci-host/designware.c              |    6 +-
 hw/pci-host/pnv_phb3.c                |    6 +-
 hw/pci-host/pnv_phb4.c                |    6 +-
 hw/pci-host/ppce500.c                 |    6 +-
 hw/pci-host/prep.c                    |    6 +-
 hw/pci-host/sabre.c                   |    6 +-
 hw/pci/pci.c                          |   53 +-
 hw/ppc/ppc440_pcix.c                  |    6 +-
 hw/ppc/spapr_pci.c                    |    6 +-
 hw/s390x/s390-pci-bus.c               |    8 +-
 hw/vfio/common.c                      |  260 +++++++-
 hw/vfio/pci.c                         |   13 +
 hw/virtio/virtio-iommu.c              |    6 +-
 include/hw/i386/intel_iommu.h         |   57 +-
 include/hw/iommu/host_iommu_context.h |  116 ++++
 include/hw/pci/pci.h                  |   18 +-
 include/hw/pci/pci_bus.h              |    2 +-
 include/hw/vfio/vfio-common.h         |    4 +
 linux-headers/linux/iommu.h           |  378 +++++++++++
 linux-headers/linux/vfio.h            |  127 ++++
 scripts/update-linux-headers.sh       |    2 +-
 31 files changed, 2463 insertions(+), 45 deletions(-)
 create mode 100644 hw/iommu/Makefile.objs
 create mode 100644 hw/iommu/host_iommu_context.c
 create mode 100644 include/hw/iommu/host_iommu_context.h
 create mode 100644 linux-headers/linux/iommu.h

-- 
2.7.4



^ permalink raw reply	[flat|nested] 160+ messages in thread

* [PATCH v2 01/22] scripts/update-linux-headers: Import iommu.h
  2020-03-30  4:24 ` Liu Yi L
@ 2020-03-30  4:24   ` Liu Yi L
  -1 siblings, 0 replies; 160+ messages in thread
From: Liu Yi L @ 2020-03-30  4:24 UTC (permalink / raw)
  To: qemu-devel, alex.williamson, peterx
  Cc: eric.auger, pbonzini, mst, david, kevin.tian, yi.l.liu,
	jun.j.tian, yi.y.sun, kvm, hao.wu, jean-philippe, Jacob Pan,
	Yi Sun, Cornelia Huck

From: Eric Auger <eric.auger@redhat.com>

Update the script to import the new iommu.h uapi header.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Cornelia Huck <cohuck@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Eric Auger <eric.auger@redhat.com>
---
 scripts/update-linux-headers.sh | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/scripts/update-linux-headers.sh b/scripts/update-linux-headers.sh
index 29c27f4..5b64ee3 100755
--- a/scripts/update-linux-headers.sh
+++ b/scripts/update-linux-headers.sh
@@ -141,7 +141,7 @@ done
 
 rm -rf "$output/linux-headers/linux"
 mkdir -p "$output/linux-headers/linux"
-for header in kvm.h vfio.h vfio_ccw.h vhost.h \
+for header in kvm.h vfio.h vfio_ccw.h vhost.h iommu.h \
               psci.h psp-sev.h userfaultfd.h mman.h; do
     cp "$tmpdir/include/linux/$header" "$output/linux-headers/linux"
 done
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v2 01/22] scripts/update-linux-headers: Import iommu.h
@ 2020-03-30  4:24   ` Liu Yi L
  0 siblings, 0 replies; 160+ messages in thread
From: Liu Yi L @ 2020-03-30  4:24 UTC (permalink / raw)
  To: qemu-devel, alex.williamson, peterx
  Cc: jean-philippe, kevin.tian, yi.l.liu, Yi Sun, kvm, mst,
	jun.j.tian, Cornelia Huck, eric.auger, yi.y.sun, Jacob Pan,
	pbonzini, hao.wu, david

From: Eric Auger <eric.auger@redhat.com>

Update the script to import the new iommu.h uapi header.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Cornelia Huck <cohuck@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Eric Auger <eric.auger@redhat.com>
---
 scripts/update-linux-headers.sh | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/scripts/update-linux-headers.sh b/scripts/update-linux-headers.sh
index 29c27f4..5b64ee3 100755
--- a/scripts/update-linux-headers.sh
+++ b/scripts/update-linux-headers.sh
@@ -141,7 +141,7 @@ done
 
 rm -rf "$output/linux-headers/linux"
 mkdir -p "$output/linux-headers/linux"
-for header in kvm.h vfio.h vfio_ccw.h vhost.h \
+for header in kvm.h vfio.h vfio_ccw.h vhost.h iommu.h \
               psci.h psp-sev.h userfaultfd.h mman.h; do
     cp "$tmpdir/include/linux/$header" "$output/linux-headers/linux"
 done
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v2 02/22] header file update VFIO/IOMMU vSVA APIs
  2020-03-30  4:24 ` Liu Yi L
@ 2020-03-30  4:24   ` Liu Yi L
  -1 siblings, 0 replies; 160+ messages in thread
From: Liu Yi L @ 2020-03-30  4:24 UTC (permalink / raw)
  To: qemu-devel, alex.williamson, peterx
  Cc: eric.auger, pbonzini, mst, david, kevin.tian, yi.l.liu,
	jun.j.tian, yi.y.sun, kvm, hao.wu, jean-philippe, Jacob Pan,
	Yi Sun, Cornelia Huck

The kernel uapi/linux/iommu.h header file includes the
extensions for vSVA support. e.g. bind gpasid, iommu
fault report related user structures and etc.

Note: this should be replaced with a full header files update when
the vSVA uPAPI is stable.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Cornelia Huck <cohuck@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 linux-headers/linux/iommu.h | 378 ++++++++++++++++++++++++++++++++++++++++++++
 linux-headers/linux/vfio.h  | 127 +++++++++++++++
 2 files changed, 505 insertions(+)
 create mode 100644 linux-headers/linux/iommu.h

diff --git a/linux-headers/linux/iommu.h b/linux-headers/linux/iommu.h
new file mode 100644
index 0000000..9025496
--- /dev/null
+++ b/linux-headers/linux/iommu.h
@@ -0,0 +1,378 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/*
+ * IOMMU user API definitions
+ */
+
+#ifndef _IOMMU_H
+#define _IOMMU_H
+
+#include <linux/types.h>
+
+/**
+ * Current version of the IOMMU user API. This is intended for query
+ * between user and kernel to determine compatible data structures.
+ *
+ * UAPI version can be bumped up with the following rules:
+ * 1. All data structures passed between user and kernel space share
+ *    the same version number. i.e. any extension to any structure
+ *    results in version number increment.
+ *
+ * 2. Data structures are open to extension but closed to modification.
+ *    Extension should leverage the padding bytes first where a new
+ *    flag bit is required to indicate the validity of each new member.
+ *    The above rule for padding bytes also applies to adding new union
+ *    members.
+ *    After padding bytes are exhausted, new fields must be added at the
+ *    end of each data structure with 64bit alignment. Flag bits can be
+ *    added without size change but existing ones cannot be altered.
+ *
+ * 3. Versions are backward compatible.
+ *
+ * 4. Version to size lookup is supported by kernel internal API for each
+ *    API function type. @version is mandatory for new data structures
+ *    and must be at the beginning with type of __u32.
+ */
+#define IOMMU_UAPI_VERSION	1
+static __inline__ int iommu_get_uapi_version(void)
+{
+	return IOMMU_UAPI_VERSION;
+}
+
+/*
+ * Supported UAPI features that can be reported to user space.
+ * These types represent the capability available in the kernel.
+ *
+ * REVISIT: UAPI version also implies the capabilities. Should we
+ * report them explicitly?
+ */
+enum IOMMU_UAPI_DATA_TYPES {
+	IOMMU_UAPI_BIND_GPASID,
+	IOMMU_UAPI_CACHE_INVAL,
+	IOMMU_UAPI_PAGE_RESP,
+	NR_IOMMU_UAPI_TYPE,
+};
+
+#define IOMMU_UAPI_CAP_MASK ((1 << IOMMU_UAPI_BIND_GPASID) |	\
+				(1 << IOMMU_UAPI_CACHE_INVAL) |	\
+				(1 << IOMMU_UAPI_PAGE_RESP))
+
+#define IOMMU_FAULT_PERM_READ	(1 << 0) /* read */
+#define IOMMU_FAULT_PERM_WRITE	(1 << 1) /* write */
+#define IOMMU_FAULT_PERM_EXEC	(1 << 2) /* exec */
+#define IOMMU_FAULT_PERM_PRIV	(1 << 3) /* privileged */
+
+/* Generic fault types, can be expanded IRQ remapping fault */
+enum iommu_fault_type {
+	IOMMU_FAULT_DMA_UNRECOV = 1,	/* unrecoverable fault */
+	IOMMU_FAULT_PAGE_REQ,		/* page request fault */
+};
+
+enum iommu_fault_reason {
+	IOMMU_FAULT_REASON_UNKNOWN = 0,
+
+	/* Could not access the PASID table (fetch caused external abort) */
+	IOMMU_FAULT_REASON_PASID_FETCH,
+
+	/* PASID entry is invalid or has configuration errors */
+	IOMMU_FAULT_REASON_BAD_PASID_ENTRY,
+
+	/*
+	 * PASID is out of range (e.g. exceeds the maximum PASID
+	 * supported by the IOMMU) or disabled.
+	 */
+	IOMMU_FAULT_REASON_PASID_INVALID,
+
+	/*
+	 * An external abort occurred fetching (or updating) a translation
+	 * table descriptor
+	 */
+	IOMMU_FAULT_REASON_WALK_EABT,
+
+	/*
+	 * Could not access the page table entry (Bad address),
+	 * actual translation fault
+	 */
+	IOMMU_FAULT_REASON_PTE_FETCH,
+
+	/* Protection flag check failed */
+	IOMMU_FAULT_REASON_PERMISSION,
+
+	/* access flag check failed */
+	IOMMU_FAULT_REASON_ACCESS,
+
+	/* Output address of a translation stage caused Address Size fault */
+	IOMMU_FAULT_REASON_OOR_ADDRESS,
+};
+
+/**
+ * struct iommu_fault_unrecoverable - Unrecoverable fault data
+ * @reason: reason of the fault, from &enum iommu_fault_reason
+ * @flags: parameters of this fault (IOMMU_FAULT_UNRECOV_* values)
+ * @pasid: Process Address Space ID
+ * @perm: requested permission access using by the incoming transaction
+ *        (IOMMU_FAULT_PERM_* values)
+ * @addr: offending page address
+ * @fetch_addr: address that caused a fetch abort, if any
+ */
+struct iommu_fault_unrecoverable {
+	__u32	reason;
+#define IOMMU_FAULT_UNRECOV_PASID_VALID		(1 << 0)
+#define IOMMU_FAULT_UNRECOV_ADDR_VALID		(1 << 1)
+#define IOMMU_FAULT_UNRECOV_FETCH_ADDR_VALID	(1 << 2)
+	__u32	flags;
+	__u32	pasid;
+	__u32	perm;
+	__u64	addr;
+	__u64	fetch_addr;
+};
+
+/**
+ * struct iommu_fault_page_request - Page Request data
+ * @flags: encodes whether the corresponding fields are valid and whether this
+ *         is the last page in group (IOMMU_FAULT_PAGE_REQUEST_* values)
+ * @pasid: Process Address Space ID
+ * @grpid: Page Request Group Index
+ * @perm: requested page permissions (IOMMU_FAULT_PERM_* values)
+ * @addr: page address
+ * @private_data: device-specific private information
+ */
+struct iommu_fault_page_request {
+#define IOMMU_FAULT_PAGE_REQUEST_PASID_VALID	(1 << 0)
+#define IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE	(1 << 1)
+#define IOMMU_FAULT_PAGE_REQUEST_PRIV_DATA	(1 << 2)
+	__u32	flags;
+	__u32	pasid;
+	__u32	grpid;
+	__u32	perm;
+	__u64	addr;
+	__u64	private_data[2];
+};
+
+/**
+ * struct iommu_fault - Generic fault data
+ * @type: fault type from &enum iommu_fault_type
+ * @padding: reserved for future use (should be zero)
+ * @event: fault event, when @type is %IOMMU_FAULT_DMA_UNRECOV
+ * @prm: Page Request message, when @type is %IOMMU_FAULT_PAGE_REQ
+ * @padding2: sets the fault size to allow for future extensions
+ */
+struct iommu_fault {
+	__u32	type;
+	__u32	padding;
+	union {
+		struct iommu_fault_unrecoverable event;
+		struct iommu_fault_page_request prm;
+		__u8 padding2[56];
+	};
+};
+
+/**
+ * enum iommu_page_response_code - Return status of fault handlers
+ * @IOMMU_PAGE_RESP_SUCCESS: Fault has been handled and the page tables
+ *	populated, retry the access. This is "Success" in PCI PRI.
+ * @IOMMU_PAGE_RESP_FAILURE: General error. Drop all subsequent faults from
+ *	this device if possible. This is "Response Failure" in PCI PRI.
+ * @IOMMU_PAGE_RESP_INVALID: Could not handle this fault, don't retry the
+ *	access. This is "Invalid Request" in PCI PRI.
+ */
+enum iommu_page_response_code {
+	IOMMU_PAGE_RESP_SUCCESS = 0,
+	IOMMU_PAGE_RESP_INVALID,
+	IOMMU_PAGE_RESP_FAILURE,
+};
+
+/**
+ * struct iommu_page_response - Generic page response information
+ * @version: IOMMU_UAPI_VERSION
+ * @flags: encodes whether the corresponding fields are valid
+ *         (IOMMU_FAULT_PAGE_RESPONSE_* values)
+ * @pasid: Process Address Space ID
+ * @grpid: Page Request Group Index
+ * @code: response code from &enum iommu_page_response_code
+ */
+struct iommu_page_response {
+	__u32	version;
+#define IOMMU_PAGE_RESP_PASID_VALID	(1 << 0)
+	__u32	flags;
+	__u32	pasid;
+	__u32	grpid;
+	__u32	code;
+};
+
+/* defines the granularity of the invalidation */
+enum iommu_inv_granularity {
+	IOMMU_INV_GRANU_DOMAIN,	/* domain-selective invalidation */
+	IOMMU_INV_GRANU_PASID,	/* PASID-selective invalidation */
+	IOMMU_INV_GRANU_ADDR,	/* page-selective invalidation */
+	IOMMU_INV_GRANU_NR,	/* number of invalidation granularities */
+};
+
+/**
+ * struct iommu_inv_addr_info - Address Selective Invalidation Structure
+ *
+ * @flags: indicates the granularity of the address-selective invalidation
+ * - If the PASID bit is set, the @pasid field is populated and the invalidation
+ *   relates to cache entries tagged with this PASID and matching the address
+ *   range.
+ * - If ARCHID bit is set, @archid is populated and the invalidation relates
+ *   to cache entries tagged with this architecture specific ID and matching
+ *   the address range.
+ * - Both PASID and ARCHID can be set as they may tag different caches.
+ * - If neither PASID or ARCHID is set, global addr invalidation applies.
+ * - The LEAF flag indicates whether only the leaf PTE caching needs to be
+ *   invalidated and other paging structure caches can be preserved.
+ * @pasid: process address space ID
+ * @archid: architecture-specific ID
+ * @addr: first stage/level input address
+ * @granule_size: page/block size of the mapping in bytes
+ * @nb_granules: number of contiguous granules to be invalidated
+ */
+struct iommu_inv_addr_info {
+#define IOMMU_INV_ADDR_FLAGS_PASID	(1 << 0)
+#define IOMMU_INV_ADDR_FLAGS_ARCHID	(1 << 1)
+#define IOMMU_INV_ADDR_FLAGS_LEAF	(1 << 2)
+	__u32	flags;
+	__u32	archid;
+	__u64	pasid;
+	__u64	addr;
+	__u64	granule_size;
+	__u64	nb_granules;
+};
+
+/**
+ * struct iommu_inv_pasid_info - PASID Selective Invalidation Structure
+ *
+ * @flags: indicates the granularity of the PASID-selective invalidation
+ * - If the PASID bit is set, the @pasid field is populated and the invalidation
+ *   relates to cache entries tagged with this PASID and matching the address
+ *   range.
+ * - If the ARCHID bit is set, the @archid is populated and the invalidation
+ *   relates to cache entries tagged with this architecture specific ID and
+ *   matching the address range.
+ * - Both PASID and ARCHID can be set as they may tag different caches.
+ * - At least one of PASID or ARCHID must be set.
+ * @pasid: process address space ID
+ * @archid: architecture-specific ID
+ */
+struct iommu_inv_pasid_info {
+#define IOMMU_INV_PASID_FLAGS_PASID	(1 << 0)
+#define IOMMU_INV_PASID_FLAGS_ARCHID	(1 << 1)
+	__u32	flags;
+	__u32	archid;
+	__u64	pasid;
+};
+
+/**
+ * struct iommu_cache_invalidate_info - First level/stage invalidation
+ *     information
+ * @version: IOMMU_UAPI_VERSION
+ * @cache: bitfield that allows to select which caches to invalidate
+ * @granularity: defines the lowest granularity used for the invalidation:
+ *     domain > PASID > addr
+ * @padding: reserved for future use (should be zero)
+ * @pasid_info: invalidation data when @granularity is %IOMMU_INV_GRANU_PASID
+ * @addr_info: invalidation data when @granularity is %IOMMU_INV_GRANU_ADDR
+ *
+ * Not all the combinations of cache/granularity are valid:
+ *
+ * +--------------+---------------+---------------+---------------+
+ * | type /       |   DEV_IOTLB   |     IOTLB     |      PASID    |
+ * | granularity  |               |               |      cache    |
+ * +==============+===============+===============+===============+
+ * | DOMAIN       |       N/A     |       Y       |       Y       |
+ * +--------------+---------------+---------------+---------------+
+ * | PASID        |       Y       |       Y       |       Y       |
+ * +--------------+---------------+---------------+---------------+
+ * | ADDR         |       Y       |       Y       |       N/A     |
+ * +--------------+---------------+---------------+---------------+
+ *
+ * Invalidations by %IOMMU_INV_GRANU_DOMAIN don't take any argument other than
+ * @version and @cache.
+ *
+ * If multiple cache types are invalidated simultaneously, they all
+ * must support the used granularity.
+ */
+struct iommu_cache_invalidate_info {
+	__u32	version;
+/* IOMMU paging structure cache */
+#define IOMMU_CACHE_INV_TYPE_IOTLB	(1 << 0) /* IOMMU IOTLB */
+#define IOMMU_CACHE_INV_TYPE_DEV_IOTLB	(1 << 1) /* Device IOTLB */
+#define IOMMU_CACHE_INV_TYPE_PASID	(1 << 2) /* PASID cache */
+#define IOMMU_CACHE_INV_TYPE_NR		(3)
+	__u8	cache;
+	__u8	granularity;
+	__u8	padding[2];
+	union {
+		struct iommu_inv_pasid_info pasid_info;
+		struct iommu_inv_addr_info addr_info;
+	};
+};
+
+/**
+ * struct iommu_gpasid_bind_data_vtd - Intel VT-d specific data on device and guest
+ * SVA binding.
+ *
+ * @flags:	VT-d PASID table entry attributes
+ * @pat:	Page attribute table data to compute effective memory type
+ * @emt:	Extended memory type
+ *
+ * Only guest vIOMMU selectable and effective options are passed down to
+ * the host IOMMU.
+ */
+struct iommu_gpasid_bind_data_vtd {
+#define IOMMU_SVA_VTD_GPASID_SRE	(1 << 0) /* supervisor request */
+#define IOMMU_SVA_VTD_GPASID_EAFE	(1 << 1) /* extended access enable */
+#define IOMMU_SVA_VTD_GPASID_PCD	(1 << 2) /* page-level cache disable */
+#define IOMMU_SVA_VTD_GPASID_PWT	(1 << 3) /* page-level write through */
+#define IOMMU_SVA_VTD_GPASID_EMTE	(1 << 4) /* extended mem type enable */
+#define IOMMU_SVA_VTD_GPASID_CD		(1 << 5) /* PASID-level cache disable */
+	__u64 flags;
+	__u32 pat;
+	__u32 emt;
+};
+#define IOMMU_SVA_VTD_GPASID_EMT_MASK	(IOMMU_SVA_VTD_GPASID_CD | \
+					 IOMMU_SVA_VTD_GPASID_EMTE | \
+					 IOMMU_SVA_VTD_GPASID_PCD |  \
+					 IOMMU_SVA_VTD_GPASID_PWT)
+/**
+ * struct iommu_gpasid_bind_data - Information about device and guest PASID binding
+ * @version:	IOMMU_UAPI_VERSION
+ * @format:	PASID table entry format
+ * @flags:	Additional information on guest bind request
+ * @gpgd:	Guest page directory base of the guest mm to bind
+ * @hpasid:	Process address space ID used for the guest mm in host IOMMU
+ * @gpasid:	Process address space ID used for the guest mm in guest IOMMU
+ * @addr_width:	Guest virtual address width
+ * @padding:	Reserved for future use (should be zero)
+ * @dummy	Reserve space for vendor specific data in the union. New
+ *		members added to the union cannot exceed the size of dummy.
+ *		The fixed size union is needed to allow further expansion
+ *		after the end of the union while still maintain backward
+ *		compatibility.
+ * @vtd:	Intel VT-d specific data
+ *
+ * Guest to host PASID mapping can be an identity or non-identity, where guest
+ * has its own PASID space. For non-identify mapping, guest to host PASID lookup
+ * is needed when VM programs guest PASID into an assigned device. VMM may
+ * trap such PASID programming then request host IOMMU driver to convert guest
+ * PASID to host PASID based on this bind data.
+ */
+struct iommu_gpasid_bind_data {
+	__u32 version;
+#define IOMMU_PASID_FORMAT_INTEL_VTD	1
+	__u32 format;
+#define IOMMU_SVA_GPASID_VAL	(1 << 0) /* guest PASID valid */
+	__u64 flags;
+	__u64 gpgd;
+	__u64 hpasid;
+	__u64 gpasid;
+	__u32 addr_width;
+	__u8  padding[12];
+	/* Vendor specific data */
+	union {
+		__u8 dummy[128];
+		struct iommu_gpasid_bind_data_vtd vtd;
+	};
+};
+
+#endif /* _IOMMU_H */
diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
index fb10370..29d0071 100644
--- a/linux-headers/linux/vfio.h
+++ b/linux-headers/linux/vfio.h
@@ -14,6 +14,7 @@
 
 #include <linux/types.h>
 #include <linux/ioctl.h>
+#include <linux/iommu.h>
 
 #define VFIO_API_VERSION	0
 
@@ -47,6 +48,15 @@
 #define VFIO_NOIOMMU_IOMMU		8
 
 /*
+ * Hardware IOMMUs with two-stage translation capability give userspace
+ * the ownership of stage-1 translation structures (e.g. page tables).
+ * VFIO exposes the two-stage IOMMU programming capability to userspace
+ * based on the IOMMU UAPIs. Therefore user of VFIO_TYPE1_NESTING should
+ * check the IOMMU UAPI version compatibility.
+ */
+#define VFIO_NESTING_IOMMU_UAPI		9
+
+/*
  * The IOCTL interface is designed for extensibility by embedding the
  * structure length (argsz) and flags into structures passed between
  * kernel and userspace.  We therefore use the _IO() macro for these
@@ -748,6 +758,15 @@ struct vfio_iommu_type1_info_cap_iova_range {
 	struct	vfio_iova_range iova_ranges[];
 };
 
+#define VFIO_IOMMU_TYPE1_INFO_CAP_NESTING  2
+
+struct vfio_iommu_type1_info_cap_nesting {
+	struct	vfio_info_cap_header header;
+#define VFIO_IOMMU_PASID_REQS	(1 << 0)
+	__u32	nesting_capabilities;
+	__u32	stage1_formats;
+};
+
 #define VFIO_IOMMU_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12)
 
 /**
@@ -794,6 +813,114 @@ struct vfio_iommu_type1_dma_unmap {
 #define VFIO_IOMMU_ENABLE	_IO(VFIO_TYPE, VFIO_BASE + 15)
 #define VFIO_IOMMU_DISABLE	_IO(VFIO_TYPE, VFIO_BASE + 16)
 
+/*
+ * PASID (Process Address Space ID) is a PCIe concept which
+ * has been extended to support DMA isolation in fine-grain.
+ * With device assigned to user space (e.g. VMs), PASID alloc
+ * and free need to be system wide. This structure defines
+ * the info for pasid alloc/free between user space and kernel
+ * space.
+ *
+ * @flag=VFIO_IOMMU_PASID_ALLOC, refer to the @alloc_pasid
+ * @flag=VFIO_IOMMU_PASID_FREE, refer to @free_pasid
+ */
+struct vfio_iommu_type1_pasid_request {
+	__u32	argsz;
+#define VFIO_IOMMU_PASID_ALLOC	(1 << 0)
+#define VFIO_IOMMU_PASID_FREE	(1 << 1)
+	__u32	flags;
+	union {
+		struct {
+			__u32 min;
+			__u32 max;
+			__u32 result;
+		} alloc_pasid;
+		__u32 free_pasid;
+	};
+};
+
+#define VFIO_PASID_REQUEST_MASK	(VFIO_IOMMU_PASID_ALLOC | \
+					 VFIO_IOMMU_PASID_FREE)
+
+/**
+ * VFIO_IOMMU_PASID_REQUEST - _IOWR(VFIO_TYPE, VFIO_BASE + 22,
+ *				struct vfio_iommu_type1_pasid_request)
+ *
+ * Availability of this feature depends on PASID support in the device,
+ * its bus, the underlying IOMMU and the CPU architecture. In VFIO, it
+ * is available after VFIO_SET_IOMMU.
+ *
+ * returns: 0 on success, -errno on failure.
+ */
+#define VFIO_IOMMU_PASID_REQUEST	_IO(VFIO_TYPE, VFIO_BASE + 22)
+
+/**
+ * Supported flags:
+ *	- VFIO_IOMMU_BIND_GUEST_PGTBL: bind guest page tables to host for
+ *			nesting type IOMMUs. In @data field It takes struct
+ *			iommu_gpasid_bind_data.
+ *	- VFIO_IOMMU_UNBIND_GUEST_PGTBL: undo a bind guest page table operation
+ *			invoked by VFIO_IOMMU_BIND_GUEST_PGTBL.
+ *
+ */
+struct vfio_iommu_type1_bind {
+	__u32		argsz;
+	__u32		flags;
+#define VFIO_IOMMU_BIND_GUEST_PGTBL	(1 << 0)
+#define VFIO_IOMMU_UNBIND_GUEST_PGTBL	(1 << 1)
+	__u8		data[];
+};
+
+#define VFIO_IOMMU_BIND_MASK	(VFIO_IOMMU_BIND_GUEST_PGTBL | \
+					VFIO_IOMMU_UNBIND_GUEST_PGTBL)
+
+/**
+ * VFIO_IOMMU_BIND - _IOW(VFIO_TYPE, VFIO_BASE + 23,
+ *				struct vfio_iommu_type1_bind)
+ *
+ * Manage address spaces of devices in this container. Initially a TYPE1
+ * container can only have one address space, managed with
+ * VFIO_IOMMU_MAP/UNMAP_DMA.
+ *
+ * An IOMMU of type VFIO_TYPE1_NESTING_IOMMU can be managed by both MAP/UNMAP
+ * and BIND ioctls at the same time. MAP/UNMAP acts on the stage-2 (host) page
+ * tables, and BIND manages the stage-1 (guest) page tables. Other types of
+ * IOMMU may allow MAP/UNMAP and BIND to coexist, where MAP/UNMAP controls
+ * the traffics only require single stage translation while BIND controls the
+ * traffics require nesting translation. But this depends on the underlying
+ * IOMMU architecture and isn't guaranteed. Example of this is the guest SVA
+ * traffics, such traffics need nesting translation to gain gVA->gPA and then
+ * gPA->hPA translation.
+ *
+ * Availability of this feature depends on the device, its bus, the underlying
+ * IOMMU and the CPU architecture.
+ *
+ * returns: 0 on success, -errno on failure.
+ */
+#define VFIO_IOMMU_BIND		_IO(VFIO_TYPE, VFIO_BASE + 23)
+
+/**
+ * VFIO_IOMMU_CACHE_INVALIDATE - _IOW(VFIO_TYPE, VFIO_BASE + 24,
+ *			struct vfio_iommu_type1_cache_invalidate)
+ *
+ * Propagate guest IOMMU cache invalidation to the host. The cache
+ * invalidation information is conveyed by @cache_info, the content
+ * format would be structures defined in uapi/linux/iommu.h. User
+ * should be aware of that the struct  iommu_cache_invalidate_info
+ * has a @version field, vfio needs to parse this field before getting
+ * data from userspace.
+ *
+ * Availability of this IOCTL is after VFIO_SET_IOMMU.
+ *
+ * returns: 0 on success, -errno on failure.
+ */
+struct vfio_iommu_type1_cache_invalidate {
+	__u32   argsz;
+	__u32   flags;
+	struct	iommu_cache_invalidate_info cache_info;
+};
+#define VFIO_IOMMU_CACHE_INVALIDATE      _IO(VFIO_TYPE, VFIO_BASE + 24)
+
 /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */
 
 /*
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v2 02/22] header file update VFIO/IOMMU vSVA APIs
@ 2020-03-30  4:24   ` Liu Yi L
  0 siblings, 0 replies; 160+ messages in thread
From: Liu Yi L @ 2020-03-30  4:24 UTC (permalink / raw)
  To: qemu-devel, alex.williamson, peterx
  Cc: jean-philippe, kevin.tian, yi.l.liu, Yi Sun, kvm, mst,
	jun.j.tian, Cornelia Huck, eric.auger, yi.y.sun, Jacob Pan,
	pbonzini, hao.wu, david

The kernel uapi/linux/iommu.h header file includes the
extensions for vSVA support. e.g. bind gpasid, iommu
fault report related user structures and etc.

Note: this should be replaced with a full header files update when
the vSVA uPAPI is stable.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Cornelia Huck <cohuck@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 linux-headers/linux/iommu.h | 378 ++++++++++++++++++++++++++++++++++++++++++++
 linux-headers/linux/vfio.h  | 127 +++++++++++++++
 2 files changed, 505 insertions(+)
 create mode 100644 linux-headers/linux/iommu.h

diff --git a/linux-headers/linux/iommu.h b/linux-headers/linux/iommu.h
new file mode 100644
index 0000000..9025496
--- /dev/null
+++ b/linux-headers/linux/iommu.h
@@ -0,0 +1,378 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/*
+ * IOMMU user API definitions
+ */
+
+#ifndef _IOMMU_H
+#define _IOMMU_H
+
+#include <linux/types.h>
+
+/**
+ * Current version of the IOMMU user API. This is intended for query
+ * between user and kernel to determine compatible data structures.
+ *
+ * UAPI version can be bumped up with the following rules:
+ * 1. All data structures passed between user and kernel space share
+ *    the same version number. i.e. any extension to any structure
+ *    results in version number increment.
+ *
+ * 2. Data structures are open to extension but closed to modification.
+ *    Extension should leverage the padding bytes first where a new
+ *    flag bit is required to indicate the validity of each new member.
+ *    The above rule for padding bytes also applies to adding new union
+ *    members.
+ *    After padding bytes are exhausted, new fields must be added at the
+ *    end of each data structure with 64bit alignment. Flag bits can be
+ *    added without size change but existing ones cannot be altered.
+ *
+ * 3. Versions are backward compatible.
+ *
+ * 4. Version to size lookup is supported by kernel internal API for each
+ *    API function type. @version is mandatory for new data structures
+ *    and must be at the beginning with type of __u32.
+ */
+#define IOMMU_UAPI_VERSION	1
+static __inline__ int iommu_get_uapi_version(void)
+{
+	return IOMMU_UAPI_VERSION;
+}
+
+/*
+ * Supported UAPI features that can be reported to user space.
+ * These types represent the capability available in the kernel.
+ *
+ * REVISIT: UAPI version also implies the capabilities. Should we
+ * report them explicitly?
+ */
+enum IOMMU_UAPI_DATA_TYPES {
+	IOMMU_UAPI_BIND_GPASID,
+	IOMMU_UAPI_CACHE_INVAL,
+	IOMMU_UAPI_PAGE_RESP,
+	NR_IOMMU_UAPI_TYPE,
+};
+
+#define IOMMU_UAPI_CAP_MASK ((1 << IOMMU_UAPI_BIND_GPASID) |	\
+				(1 << IOMMU_UAPI_CACHE_INVAL) |	\
+				(1 << IOMMU_UAPI_PAGE_RESP))
+
+#define IOMMU_FAULT_PERM_READ	(1 << 0) /* read */
+#define IOMMU_FAULT_PERM_WRITE	(1 << 1) /* write */
+#define IOMMU_FAULT_PERM_EXEC	(1 << 2) /* exec */
+#define IOMMU_FAULT_PERM_PRIV	(1 << 3) /* privileged */
+
+/* Generic fault types, can be expanded IRQ remapping fault */
+enum iommu_fault_type {
+	IOMMU_FAULT_DMA_UNRECOV = 1,	/* unrecoverable fault */
+	IOMMU_FAULT_PAGE_REQ,		/* page request fault */
+};
+
+enum iommu_fault_reason {
+	IOMMU_FAULT_REASON_UNKNOWN = 0,
+
+	/* Could not access the PASID table (fetch caused external abort) */
+	IOMMU_FAULT_REASON_PASID_FETCH,
+
+	/* PASID entry is invalid or has configuration errors */
+	IOMMU_FAULT_REASON_BAD_PASID_ENTRY,
+
+	/*
+	 * PASID is out of range (e.g. exceeds the maximum PASID
+	 * supported by the IOMMU) or disabled.
+	 */
+	IOMMU_FAULT_REASON_PASID_INVALID,
+
+	/*
+	 * An external abort occurred fetching (or updating) a translation
+	 * table descriptor
+	 */
+	IOMMU_FAULT_REASON_WALK_EABT,
+
+	/*
+	 * Could not access the page table entry (Bad address),
+	 * actual translation fault
+	 */
+	IOMMU_FAULT_REASON_PTE_FETCH,
+
+	/* Protection flag check failed */
+	IOMMU_FAULT_REASON_PERMISSION,
+
+	/* access flag check failed */
+	IOMMU_FAULT_REASON_ACCESS,
+
+	/* Output address of a translation stage caused Address Size fault */
+	IOMMU_FAULT_REASON_OOR_ADDRESS,
+};
+
+/**
+ * struct iommu_fault_unrecoverable - Unrecoverable fault data
+ * @reason: reason of the fault, from &enum iommu_fault_reason
+ * @flags: parameters of this fault (IOMMU_FAULT_UNRECOV_* values)
+ * @pasid: Process Address Space ID
+ * @perm: requested permission access using by the incoming transaction
+ *        (IOMMU_FAULT_PERM_* values)
+ * @addr: offending page address
+ * @fetch_addr: address that caused a fetch abort, if any
+ */
+struct iommu_fault_unrecoverable {
+	__u32	reason;
+#define IOMMU_FAULT_UNRECOV_PASID_VALID		(1 << 0)
+#define IOMMU_FAULT_UNRECOV_ADDR_VALID		(1 << 1)
+#define IOMMU_FAULT_UNRECOV_FETCH_ADDR_VALID	(1 << 2)
+	__u32	flags;
+	__u32	pasid;
+	__u32	perm;
+	__u64	addr;
+	__u64	fetch_addr;
+};
+
+/**
+ * struct iommu_fault_page_request - Page Request data
+ * @flags: encodes whether the corresponding fields are valid and whether this
+ *         is the last page in group (IOMMU_FAULT_PAGE_REQUEST_* values)
+ * @pasid: Process Address Space ID
+ * @grpid: Page Request Group Index
+ * @perm: requested page permissions (IOMMU_FAULT_PERM_* values)
+ * @addr: page address
+ * @private_data: device-specific private information
+ */
+struct iommu_fault_page_request {
+#define IOMMU_FAULT_PAGE_REQUEST_PASID_VALID	(1 << 0)
+#define IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE	(1 << 1)
+#define IOMMU_FAULT_PAGE_REQUEST_PRIV_DATA	(1 << 2)
+	__u32	flags;
+	__u32	pasid;
+	__u32	grpid;
+	__u32	perm;
+	__u64	addr;
+	__u64	private_data[2];
+};
+
+/**
+ * struct iommu_fault - Generic fault data
+ * @type: fault type from &enum iommu_fault_type
+ * @padding: reserved for future use (should be zero)
+ * @event: fault event, when @type is %IOMMU_FAULT_DMA_UNRECOV
+ * @prm: Page Request message, when @type is %IOMMU_FAULT_PAGE_REQ
+ * @padding2: sets the fault size to allow for future extensions
+ */
+struct iommu_fault {
+	__u32	type;
+	__u32	padding;
+	union {
+		struct iommu_fault_unrecoverable event;
+		struct iommu_fault_page_request prm;
+		__u8 padding2[56];
+	};
+};
+
+/**
+ * enum iommu_page_response_code - Return status of fault handlers
+ * @IOMMU_PAGE_RESP_SUCCESS: Fault has been handled and the page tables
+ *	populated, retry the access. This is "Success" in PCI PRI.
+ * @IOMMU_PAGE_RESP_FAILURE: General error. Drop all subsequent faults from
+ *	this device if possible. This is "Response Failure" in PCI PRI.
+ * @IOMMU_PAGE_RESP_INVALID: Could not handle this fault, don't retry the
+ *	access. This is "Invalid Request" in PCI PRI.
+ */
+enum iommu_page_response_code {
+	IOMMU_PAGE_RESP_SUCCESS = 0,
+	IOMMU_PAGE_RESP_INVALID,
+	IOMMU_PAGE_RESP_FAILURE,
+};
+
+/**
+ * struct iommu_page_response - Generic page response information
+ * @version: IOMMU_UAPI_VERSION
+ * @flags: encodes whether the corresponding fields are valid
+ *         (IOMMU_FAULT_PAGE_RESPONSE_* values)
+ * @pasid: Process Address Space ID
+ * @grpid: Page Request Group Index
+ * @code: response code from &enum iommu_page_response_code
+ */
+struct iommu_page_response {
+	__u32	version;
+#define IOMMU_PAGE_RESP_PASID_VALID	(1 << 0)
+	__u32	flags;
+	__u32	pasid;
+	__u32	grpid;
+	__u32	code;
+};
+
+/* defines the granularity of the invalidation */
+enum iommu_inv_granularity {
+	IOMMU_INV_GRANU_DOMAIN,	/* domain-selective invalidation */
+	IOMMU_INV_GRANU_PASID,	/* PASID-selective invalidation */
+	IOMMU_INV_GRANU_ADDR,	/* page-selective invalidation */
+	IOMMU_INV_GRANU_NR,	/* number of invalidation granularities */
+};
+
+/**
+ * struct iommu_inv_addr_info - Address Selective Invalidation Structure
+ *
+ * @flags: indicates the granularity of the address-selective invalidation
+ * - If the PASID bit is set, the @pasid field is populated and the invalidation
+ *   relates to cache entries tagged with this PASID and matching the address
+ *   range.
+ * - If ARCHID bit is set, @archid is populated and the invalidation relates
+ *   to cache entries tagged with this architecture specific ID and matching
+ *   the address range.
+ * - Both PASID and ARCHID can be set as they may tag different caches.
+ * - If neither PASID or ARCHID is set, global addr invalidation applies.
+ * - The LEAF flag indicates whether only the leaf PTE caching needs to be
+ *   invalidated and other paging structure caches can be preserved.
+ * @pasid: process address space ID
+ * @archid: architecture-specific ID
+ * @addr: first stage/level input address
+ * @granule_size: page/block size of the mapping in bytes
+ * @nb_granules: number of contiguous granules to be invalidated
+ */
+struct iommu_inv_addr_info {
+#define IOMMU_INV_ADDR_FLAGS_PASID	(1 << 0)
+#define IOMMU_INV_ADDR_FLAGS_ARCHID	(1 << 1)
+#define IOMMU_INV_ADDR_FLAGS_LEAF	(1 << 2)
+	__u32	flags;
+	__u32	archid;
+	__u64	pasid;
+	__u64	addr;
+	__u64	granule_size;
+	__u64	nb_granules;
+};
+
+/**
+ * struct iommu_inv_pasid_info - PASID Selective Invalidation Structure
+ *
+ * @flags: indicates the granularity of the PASID-selective invalidation
+ * - If the PASID bit is set, the @pasid field is populated and the invalidation
+ *   relates to cache entries tagged with this PASID and matching the address
+ *   range.
+ * - If the ARCHID bit is set, the @archid is populated and the invalidation
+ *   relates to cache entries tagged with this architecture specific ID and
+ *   matching the address range.
+ * - Both PASID and ARCHID can be set as they may tag different caches.
+ * - At least one of PASID or ARCHID must be set.
+ * @pasid: process address space ID
+ * @archid: architecture-specific ID
+ */
+struct iommu_inv_pasid_info {
+#define IOMMU_INV_PASID_FLAGS_PASID	(1 << 0)
+#define IOMMU_INV_PASID_FLAGS_ARCHID	(1 << 1)
+	__u32	flags;
+	__u32	archid;
+	__u64	pasid;
+};
+
+/**
+ * struct iommu_cache_invalidate_info - First level/stage invalidation
+ *     information
+ * @version: IOMMU_UAPI_VERSION
+ * @cache: bitfield that allows to select which caches to invalidate
+ * @granularity: defines the lowest granularity used for the invalidation:
+ *     domain > PASID > addr
+ * @padding: reserved for future use (should be zero)
+ * @pasid_info: invalidation data when @granularity is %IOMMU_INV_GRANU_PASID
+ * @addr_info: invalidation data when @granularity is %IOMMU_INV_GRANU_ADDR
+ *
+ * Not all the combinations of cache/granularity are valid:
+ *
+ * +--------------+---------------+---------------+---------------+
+ * | type /       |   DEV_IOTLB   |     IOTLB     |      PASID    |
+ * | granularity  |               |               |      cache    |
+ * +==============+===============+===============+===============+
+ * | DOMAIN       |       N/A     |       Y       |       Y       |
+ * +--------------+---------------+---------------+---------------+
+ * | PASID        |       Y       |       Y       |       Y       |
+ * +--------------+---------------+---------------+---------------+
+ * | ADDR         |       Y       |       Y       |       N/A     |
+ * +--------------+---------------+---------------+---------------+
+ *
+ * Invalidations by %IOMMU_INV_GRANU_DOMAIN don't take any argument other than
+ * @version and @cache.
+ *
+ * If multiple cache types are invalidated simultaneously, they all
+ * must support the used granularity.
+ */
+struct iommu_cache_invalidate_info {
+	__u32	version;
+/* IOMMU paging structure cache */
+#define IOMMU_CACHE_INV_TYPE_IOTLB	(1 << 0) /* IOMMU IOTLB */
+#define IOMMU_CACHE_INV_TYPE_DEV_IOTLB	(1 << 1) /* Device IOTLB */
+#define IOMMU_CACHE_INV_TYPE_PASID	(1 << 2) /* PASID cache */
+#define IOMMU_CACHE_INV_TYPE_NR		(3)
+	__u8	cache;
+	__u8	granularity;
+	__u8	padding[2];
+	union {
+		struct iommu_inv_pasid_info pasid_info;
+		struct iommu_inv_addr_info addr_info;
+	};
+};
+
+/**
+ * struct iommu_gpasid_bind_data_vtd - Intel VT-d specific data on device and guest
+ * SVA binding.
+ *
+ * @flags:	VT-d PASID table entry attributes
+ * @pat:	Page attribute table data to compute effective memory type
+ * @emt:	Extended memory type
+ *
+ * Only guest vIOMMU selectable and effective options are passed down to
+ * the host IOMMU.
+ */
+struct iommu_gpasid_bind_data_vtd {
+#define IOMMU_SVA_VTD_GPASID_SRE	(1 << 0) /* supervisor request */
+#define IOMMU_SVA_VTD_GPASID_EAFE	(1 << 1) /* extended access enable */
+#define IOMMU_SVA_VTD_GPASID_PCD	(1 << 2) /* page-level cache disable */
+#define IOMMU_SVA_VTD_GPASID_PWT	(1 << 3) /* page-level write through */
+#define IOMMU_SVA_VTD_GPASID_EMTE	(1 << 4) /* extended mem type enable */
+#define IOMMU_SVA_VTD_GPASID_CD		(1 << 5) /* PASID-level cache disable */
+	__u64 flags;
+	__u32 pat;
+	__u32 emt;
+};
+#define IOMMU_SVA_VTD_GPASID_EMT_MASK	(IOMMU_SVA_VTD_GPASID_CD | \
+					 IOMMU_SVA_VTD_GPASID_EMTE | \
+					 IOMMU_SVA_VTD_GPASID_PCD |  \
+					 IOMMU_SVA_VTD_GPASID_PWT)
+/**
+ * struct iommu_gpasid_bind_data - Information about device and guest PASID binding
+ * @version:	IOMMU_UAPI_VERSION
+ * @format:	PASID table entry format
+ * @flags:	Additional information on guest bind request
+ * @gpgd:	Guest page directory base of the guest mm to bind
+ * @hpasid:	Process address space ID used for the guest mm in host IOMMU
+ * @gpasid:	Process address space ID used for the guest mm in guest IOMMU
+ * @addr_width:	Guest virtual address width
+ * @padding:	Reserved for future use (should be zero)
+ * @dummy	Reserve space for vendor specific data in the union. New
+ *		members added to the union cannot exceed the size of dummy.
+ *		The fixed size union is needed to allow further expansion
+ *		after the end of the union while still maintain backward
+ *		compatibility.
+ * @vtd:	Intel VT-d specific data
+ *
+ * Guest to host PASID mapping can be an identity or non-identity, where guest
+ * has its own PASID space. For non-identify mapping, guest to host PASID lookup
+ * is needed when VM programs guest PASID into an assigned device. VMM may
+ * trap such PASID programming then request host IOMMU driver to convert guest
+ * PASID to host PASID based on this bind data.
+ */
+struct iommu_gpasid_bind_data {
+	__u32 version;
+#define IOMMU_PASID_FORMAT_INTEL_VTD	1
+	__u32 format;
+#define IOMMU_SVA_GPASID_VAL	(1 << 0) /* guest PASID valid */
+	__u64 flags;
+	__u64 gpgd;
+	__u64 hpasid;
+	__u64 gpasid;
+	__u32 addr_width;
+	__u8  padding[12];
+	/* Vendor specific data */
+	union {
+		__u8 dummy[128];
+		struct iommu_gpasid_bind_data_vtd vtd;
+	};
+};
+
+#endif /* _IOMMU_H */
diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
index fb10370..29d0071 100644
--- a/linux-headers/linux/vfio.h
+++ b/linux-headers/linux/vfio.h
@@ -14,6 +14,7 @@
 
 #include <linux/types.h>
 #include <linux/ioctl.h>
+#include <linux/iommu.h>
 
 #define VFIO_API_VERSION	0
 
@@ -47,6 +48,15 @@
 #define VFIO_NOIOMMU_IOMMU		8
 
 /*
+ * Hardware IOMMUs with two-stage translation capability give userspace
+ * the ownership of stage-1 translation structures (e.g. page tables).
+ * VFIO exposes the two-stage IOMMU programming capability to userspace
+ * based on the IOMMU UAPIs. Therefore user of VFIO_TYPE1_NESTING should
+ * check the IOMMU UAPI version compatibility.
+ */
+#define VFIO_NESTING_IOMMU_UAPI		9
+
+/*
  * The IOCTL interface is designed for extensibility by embedding the
  * structure length (argsz) and flags into structures passed between
  * kernel and userspace.  We therefore use the _IO() macro for these
@@ -748,6 +758,15 @@ struct vfio_iommu_type1_info_cap_iova_range {
 	struct	vfio_iova_range iova_ranges[];
 };
 
+#define VFIO_IOMMU_TYPE1_INFO_CAP_NESTING  2
+
+struct vfio_iommu_type1_info_cap_nesting {
+	struct	vfio_info_cap_header header;
+#define VFIO_IOMMU_PASID_REQS	(1 << 0)
+	__u32	nesting_capabilities;
+	__u32	stage1_formats;
+};
+
 #define VFIO_IOMMU_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12)
 
 /**
@@ -794,6 +813,114 @@ struct vfio_iommu_type1_dma_unmap {
 #define VFIO_IOMMU_ENABLE	_IO(VFIO_TYPE, VFIO_BASE + 15)
 #define VFIO_IOMMU_DISABLE	_IO(VFIO_TYPE, VFIO_BASE + 16)
 
+/*
+ * PASID (Process Address Space ID) is a PCIe concept which
+ * has been extended to support DMA isolation in fine-grain.
+ * With device assigned to user space (e.g. VMs), PASID alloc
+ * and free need to be system wide. This structure defines
+ * the info for pasid alloc/free between user space and kernel
+ * space.
+ *
+ * @flag=VFIO_IOMMU_PASID_ALLOC, refer to the @alloc_pasid
+ * @flag=VFIO_IOMMU_PASID_FREE, refer to @free_pasid
+ */
+struct vfio_iommu_type1_pasid_request {
+	__u32	argsz;
+#define VFIO_IOMMU_PASID_ALLOC	(1 << 0)
+#define VFIO_IOMMU_PASID_FREE	(1 << 1)
+	__u32	flags;
+	union {
+		struct {
+			__u32 min;
+			__u32 max;
+			__u32 result;
+		} alloc_pasid;
+		__u32 free_pasid;
+	};
+};
+
+#define VFIO_PASID_REQUEST_MASK	(VFIO_IOMMU_PASID_ALLOC | \
+					 VFIO_IOMMU_PASID_FREE)
+
+/**
+ * VFIO_IOMMU_PASID_REQUEST - _IOWR(VFIO_TYPE, VFIO_BASE + 22,
+ *				struct vfio_iommu_type1_pasid_request)
+ *
+ * Availability of this feature depends on PASID support in the device,
+ * its bus, the underlying IOMMU and the CPU architecture. In VFIO, it
+ * is available after VFIO_SET_IOMMU.
+ *
+ * returns: 0 on success, -errno on failure.
+ */
+#define VFIO_IOMMU_PASID_REQUEST	_IO(VFIO_TYPE, VFIO_BASE + 22)
+
+/**
+ * Supported flags:
+ *	- VFIO_IOMMU_BIND_GUEST_PGTBL: bind guest page tables to host for
+ *			nesting type IOMMUs. In @data field It takes struct
+ *			iommu_gpasid_bind_data.
+ *	- VFIO_IOMMU_UNBIND_GUEST_PGTBL: undo a bind guest page table operation
+ *			invoked by VFIO_IOMMU_BIND_GUEST_PGTBL.
+ *
+ */
+struct vfio_iommu_type1_bind {
+	__u32		argsz;
+	__u32		flags;
+#define VFIO_IOMMU_BIND_GUEST_PGTBL	(1 << 0)
+#define VFIO_IOMMU_UNBIND_GUEST_PGTBL	(1 << 1)
+	__u8		data[];
+};
+
+#define VFIO_IOMMU_BIND_MASK	(VFIO_IOMMU_BIND_GUEST_PGTBL | \
+					VFIO_IOMMU_UNBIND_GUEST_PGTBL)
+
+/**
+ * VFIO_IOMMU_BIND - _IOW(VFIO_TYPE, VFIO_BASE + 23,
+ *				struct vfio_iommu_type1_bind)
+ *
+ * Manage address spaces of devices in this container. Initially a TYPE1
+ * container can only have one address space, managed with
+ * VFIO_IOMMU_MAP/UNMAP_DMA.
+ *
+ * An IOMMU of type VFIO_TYPE1_NESTING_IOMMU can be managed by both MAP/UNMAP
+ * and BIND ioctls at the same time. MAP/UNMAP acts on the stage-2 (host) page
+ * tables, and BIND manages the stage-1 (guest) page tables. Other types of
+ * IOMMU may allow MAP/UNMAP and BIND to coexist, where MAP/UNMAP controls
+ * the traffics only require single stage translation while BIND controls the
+ * traffics require nesting translation. But this depends on the underlying
+ * IOMMU architecture and isn't guaranteed. Example of this is the guest SVA
+ * traffics, such traffics need nesting translation to gain gVA->gPA and then
+ * gPA->hPA translation.
+ *
+ * Availability of this feature depends on the device, its bus, the underlying
+ * IOMMU and the CPU architecture.
+ *
+ * returns: 0 on success, -errno on failure.
+ */
+#define VFIO_IOMMU_BIND		_IO(VFIO_TYPE, VFIO_BASE + 23)
+
+/**
+ * VFIO_IOMMU_CACHE_INVALIDATE - _IOW(VFIO_TYPE, VFIO_BASE + 24,
+ *			struct vfio_iommu_type1_cache_invalidate)
+ *
+ * Propagate guest IOMMU cache invalidation to the host. The cache
+ * invalidation information is conveyed by @cache_info, the content
+ * format would be structures defined in uapi/linux/iommu.h. User
+ * should be aware of that the struct  iommu_cache_invalidate_info
+ * has a @version field, vfio needs to parse this field before getting
+ * data from userspace.
+ *
+ * Availability of this IOCTL is after VFIO_SET_IOMMU.
+ *
+ * returns: 0 on success, -errno on failure.
+ */
+struct vfio_iommu_type1_cache_invalidate {
+	__u32   argsz;
+	__u32   flags;
+	struct	iommu_cache_invalidate_info cache_info;
+};
+#define VFIO_IOMMU_CACHE_INVALIDATE      _IO(VFIO_TYPE, VFIO_BASE + 24)
+
 /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */
 
 /*
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v2 03/22] vfio: check VFIO_TYPE1_NESTING_IOMMU support
  2020-03-30  4:24 ` Liu Yi L
@ 2020-03-30  4:24   ` Liu Yi L
  -1 siblings, 0 replies; 160+ messages in thread
From: Liu Yi L @ 2020-03-30  4:24 UTC (permalink / raw)
  To: qemu-devel, alex.williamson, peterx
  Cc: eric.auger, pbonzini, mst, david, kevin.tian, yi.l.liu,
	jun.j.tian, yi.y.sun, kvm, hao.wu, jean-philippe, Jacob Pan,
	Yi Sun

VFIO needs to check VFIO_TYPE1_NESTING_IOMMU support with Kernel before
further using it. e.g. requires to check IOMMU UAPI version.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Eric Auger <eric.auger@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
---
 hw/vfio/common.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 0b3593b..c276732 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -1157,12 +1157,21 @@ static void vfio_put_address_space(VFIOAddressSpace *space)
 static int vfio_get_iommu_type(VFIOContainer *container,
                                Error **errp)
 {
-    int iommu_types[] = { VFIO_TYPE1v2_IOMMU, VFIO_TYPE1_IOMMU,
+    int iommu_types[] = { VFIO_TYPE1_NESTING_IOMMU,
+                          VFIO_TYPE1v2_IOMMU, VFIO_TYPE1_IOMMU,
                           VFIO_SPAPR_TCE_v2_IOMMU, VFIO_SPAPR_TCE_IOMMU };
-    int i;
+    int i, version;
 
     for (i = 0; i < ARRAY_SIZE(iommu_types); i++) {
         if (ioctl(container->fd, VFIO_CHECK_EXTENSION, iommu_types[i])) {
+            if (iommu_types[i] == VFIO_TYPE1_NESTING_IOMMU) {
+                version = ioctl(container->fd, VFIO_CHECK_EXTENSION,
+                                VFIO_NESTING_IOMMU_UAPI);
+                if (version < IOMMU_UAPI_VERSION) {
+                    info_report("IOMMU UAPI incompatible for nesting");
+                    continue;
+                }
+            }
             return iommu_types[i];
         }
     }
@@ -1278,6 +1287,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
     }
 
     switch (container->iommu_type) {
+    case VFIO_TYPE1_NESTING_IOMMU:
     case VFIO_TYPE1v2_IOMMU:
     case VFIO_TYPE1_IOMMU:
     {
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v2 03/22] vfio: check VFIO_TYPE1_NESTING_IOMMU support
@ 2020-03-30  4:24   ` Liu Yi L
  0 siblings, 0 replies; 160+ messages in thread
From: Liu Yi L @ 2020-03-30  4:24 UTC (permalink / raw)
  To: qemu-devel, alex.williamson, peterx
  Cc: jean-philippe, kevin.tian, yi.l.liu, Yi Sun, kvm, mst,
	jun.j.tian, eric.auger, yi.y.sun, Jacob Pan, pbonzini, hao.wu,
	david

VFIO needs to check VFIO_TYPE1_NESTING_IOMMU support with Kernel before
further using it. e.g. requires to check IOMMU UAPI version.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Eric Auger <eric.auger@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
---
 hw/vfio/common.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 0b3593b..c276732 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -1157,12 +1157,21 @@ static void vfio_put_address_space(VFIOAddressSpace *space)
 static int vfio_get_iommu_type(VFIOContainer *container,
                                Error **errp)
 {
-    int iommu_types[] = { VFIO_TYPE1v2_IOMMU, VFIO_TYPE1_IOMMU,
+    int iommu_types[] = { VFIO_TYPE1_NESTING_IOMMU,
+                          VFIO_TYPE1v2_IOMMU, VFIO_TYPE1_IOMMU,
                           VFIO_SPAPR_TCE_v2_IOMMU, VFIO_SPAPR_TCE_IOMMU };
-    int i;
+    int i, version;
 
     for (i = 0; i < ARRAY_SIZE(iommu_types); i++) {
         if (ioctl(container->fd, VFIO_CHECK_EXTENSION, iommu_types[i])) {
+            if (iommu_types[i] == VFIO_TYPE1_NESTING_IOMMU) {
+                version = ioctl(container->fd, VFIO_CHECK_EXTENSION,
+                                VFIO_NESTING_IOMMU_UAPI);
+                if (version < IOMMU_UAPI_VERSION) {
+                    info_report("IOMMU UAPI incompatible for nesting");
+                    continue;
+                }
+            }
             return iommu_types[i];
         }
     }
@@ -1278,6 +1287,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
     }
 
     switch (container->iommu_type) {
+    case VFIO_TYPE1_NESTING_IOMMU:
     case VFIO_TYPE1v2_IOMMU:
     case VFIO_TYPE1_IOMMU:
     {
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v2 04/22] hw/iommu: introduce HostIOMMUContext
  2020-03-30  4:24 ` Liu Yi L
@ 2020-03-30  4:24   ` Liu Yi L
  -1 siblings, 0 replies; 160+ messages in thread
From: Liu Yi L @ 2020-03-30  4:24 UTC (permalink / raw)
  To: qemu-devel, alex.williamson, peterx
  Cc: eric.auger, pbonzini, mst, david, kevin.tian, yi.l.liu,
	jun.j.tian, yi.y.sun, kvm, hao.wu, jean-philippe, Jacob Pan,
	Yi Sun

Currently, many platform vendors provide the capability of dual stage
DMA address translation in hardware. For example, nested translation
on Intel VT-d scalable mode, nested stage translation on ARM SMMUv3,
and etc. In dual stage DMA address translation, there are two stages
address translation, stage-1 (a.k.a first-level) and stage-2 (a.k.a
second-level) translation structures. Stage-1 translation results are
also subjected to stage-2 translation structures. Take vSVA (Virtual
Shared Virtual Addressing) as an example, guest IOMMU driver owns
stage-1 translation structures (covers GVA->GPA translation), and host
IOMMU driver owns stage-2 translation structures (covers GPA->HPA
translation). VMM is responsible to bind stage-1 translation structures
to host, thus hardware could achieve GVA->GPA and then GPA->HPA
translation. For more background on SVA, refer the below links.
 - https://www.youtube.com/watch?v=Kq_nfGK5MwQ
 - https://events19.lfasiallc.com/wp-content/uploads/2017/11/\
Shared-Virtual-Memory-in-KVM_Yi-Liu.pdf

In QEMU, vIOMMU emulators expose IOMMUs to VM per their own spec (e.g.
Intel VT-d spec). Devices are pass-through to guest via device pass-
through components like VFIO. VFIO is a userspace driver framework
which exposes host IOMMU programming capability to userspace in a
secure manner. e.g. IOVA MAP/UNMAP requests. Thus the major connection
between VFIO and vIOMMU are MAP/UNMAP. However, with the dual stage
DMA translation support, there are more interactions between vIOMMU and
VFIO as below:
 1) PASID allocation (allow host to intercept in PASID allocation)
 2) bind stage-1 translation structures to host
 3) propagate stage-1 cache invalidation to host
 4) DMA address translation fault (I/O page fault) servicing etc.

With the above new interactions in QEMU, it requires an abstract layer
to facilitate the above operations and expose to vIOMMU emulators as an
explicit way for vIOMMU emulators call into VFIO. This patch introduces
HostIOMMUContext to stand for hardware IOMMU w/ dual stage DMA address
translation capability. And introduces HostIOMMUContextClass to provide
methods for vIOMMU emulators to propagate dual-stage translation related
requests to host. As a beginning, PASID allocation/free are defined to
propagate PASID allocation/free requests to host which is helpful for the
vendors who manage PASID in system-wide. In future, there will be more
operations like bind_stage1_pgtbl, flush_stage1_cache and etc.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Eric Auger <eric.auger@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 hw/Makefile.objs                      |  1 +
 hw/iommu/Makefile.objs                |  1 +
 hw/iommu/host_iommu_context.c         | 97 +++++++++++++++++++++++++++++++++++
 include/hw/iommu/host_iommu_context.h | 75 +++++++++++++++++++++++++++
 4 files changed, 174 insertions(+)
 create mode 100644 hw/iommu/Makefile.objs
 create mode 100644 hw/iommu/host_iommu_context.c
 create mode 100644 include/hw/iommu/host_iommu_context.h

diff --git a/hw/Makefile.objs b/hw/Makefile.objs
index 660e2b4..cab83fe 100644
--- a/hw/Makefile.objs
+++ b/hw/Makefile.objs
@@ -40,6 +40,7 @@ devices-dirs-$(CONFIG_MEM_DEVICE) += mem/
 devices-dirs-$(CONFIG_NUBUS) += nubus/
 devices-dirs-y += semihosting/
 devices-dirs-y += smbios/
+devices-dirs-y += iommu/
 endif
 
 common-obj-y += $(devices-dirs-y)
diff --git a/hw/iommu/Makefile.objs b/hw/iommu/Makefile.objs
new file mode 100644
index 0000000..e6eed4e
--- /dev/null
+++ b/hw/iommu/Makefile.objs
@@ -0,0 +1 @@
+obj-y += host_iommu_context.o
diff --git a/hw/iommu/host_iommu_context.c b/hw/iommu/host_iommu_context.c
new file mode 100644
index 0000000..5fb2223
--- /dev/null
+++ b/hw/iommu/host_iommu_context.c
@@ -0,0 +1,97 @@
+/*
+ * QEMU abstract of Host IOMMU
+ *
+ * Copyright (C) 2020 Intel Corporation.
+ *
+ * Authors: Liu Yi L <yi.l.liu@intel.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include "qemu/osdep.h"
+#include "qapi/error.h"
+#include "qom/object.h"
+#include "qapi/visitor.h"
+#include "hw/iommu/host_iommu_context.h"
+
+int host_iommu_ctx_pasid_alloc(HostIOMMUContext *iommu_ctx, uint32_t min,
+                               uint32_t max, uint32_t *pasid)
+{
+    HostIOMMUContextClass *hicxc;
+
+    if (!iommu_ctx) {
+        return -EINVAL;
+    }
+
+    hicxc = HOST_IOMMU_CONTEXT_GET_CLASS(iommu_ctx);
+
+    if (!hicxc) {
+        return -EINVAL;
+    }
+
+    if (!(iommu_ctx->flags & HOST_IOMMU_PASID_REQUEST) ||
+        !hicxc->pasid_alloc) {
+        return -EINVAL;
+    }
+
+    return hicxc->pasid_alloc(iommu_ctx, min, max, pasid);
+}
+
+int host_iommu_ctx_pasid_free(HostIOMMUContext *iommu_ctx, uint32_t pasid)
+{
+    HostIOMMUContextClass *hicxc;
+
+    if (!iommu_ctx) {
+        return -EINVAL;
+    }
+
+    hicxc = HOST_IOMMU_CONTEXT_GET_CLASS(iommu_ctx);
+    if (!hicxc) {
+        return -EINVAL;
+    }
+
+    if (!(iommu_ctx->flags & HOST_IOMMU_PASID_REQUEST) ||
+        !hicxc->pasid_free) {
+        return -EINVAL;
+    }
+
+    return hicxc->pasid_free(iommu_ctx, pasid);
+}
+
+void host_iommu_ctx_init(void *_iommu_ctx, size_t instance_size,
+                         const char *mrtypename,
+                         uint64_t flags)
+{
+    HostIOMMUContext *iommu_ctx;
+
+    object_initialize(_iommu_ctx, instance_size, mrtypename);
+    iommu_ctx = HOST_IOMMU_CONTEXT(_iommu_ctx);
+    iommu_ctx->flags = flags;
+    iommu_ctx->initialized = true;
+}
+
+static const TypeInfo host_iommu_context_info = {
+    .parent             = TYPE_OBJECT,
+    .name               = TYPE_HOST_IOMMU_CONTEXT,
+    .class_size         = sizeof(HostIOMMUContextClass),
+    .instance_size      = sizeof(HostIOMMUContext),
+    .abstract           = true,
+};
+
+static void host_iommu_ctx_register_types(void)
+{
+    type_register_static(&host_iommu_context_info);
+}
+
+type_init(host_iommu_ctx_register_types)
diff --git a/include/hw/iommu/host_iommu_context.h b/include/hw/iommu/host_iommu_context.h
new file mode 100644
index 0000000..35c4861
--- /dev/null
+++ b/include/hw/iommu/host_iommu_context.h
@@ -0,0 +1,75 @@
+/*
+ * QEMU abstraction of Host IOMMU
+ *
+ * Copyright (C) 2020 Intel Corporation.
+ *
+ * Authors: Liu Yi L <yi.l.liu@intel.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef HW_IOMMU_CONTEXT_H
+#define HW_IOMMU_CONTEXT_H
+
+#include "qemu/queue.h"
+#include "qemu/thread.h"
+#include "qom/object.h"
+#include <linux/iommu.h>
+#ifndef CONFIG_USER_ONLY
+#include "exec/hwaddr.h"
+#endif
+
+#define TYPE_HOST_IOMMU_CONTEXT "qemu:host-iommu-context"
+#define HOST_IOMMU_CONTEXT(obj) \
+        OBJECT_CHECK(HostIOMMUContext, (obj), TYPE_HOST_IOMMU_CONTEXT)
+#define HOST_IOMMU_CONTEXT_GET_CLASS(obj) \
+        OBJECT_GET_CLASS(HostIOMMUContextClass, (obj), \
+                         TYPE_HOST_IOMMU_CONTEXT)
+
+typedef struct HostIOMMUContext HostIOMMUContext;
+
+typedef struct HostIOMMUContextClass {
+    /* private */
+    ObjectClass parent_class;
+
+    /* Allocate pasid from HostIOMMUContext (a.k.a. host software) */
+    int (*pasid_alloc)(HostIOMMUContext *iommu_ctx,
+                       uint32_t min,
+                       uint32_t max,
+                       uint32_t *pasid);
+    /* Reclaim pasid from HostIOMMUContext (a.k.a. host software) */
+    int (*pasid_free)(HostIOMMUContext *iommu_ctx,
+                      uint32_t pasid);
+} HostIOMMUContextClass;
+
+/*
+ * This is an abstraction of host IOMMU with dual-stage capability
+ */
+struct HostIOMMUContext {
+    Object parent_obj;
+#define HOST_IOMMU_PASID_REQUEST (1ULL << 0)
+    uint64_t flags;
+    bool initialized;
+};
+
+int host_iommu_ctx_pasid_alloc(HostIOMMUContext *iommu_ctx, uint32_t min,
+                               uint32_t max, uint32_t *pasid);
+int host_iommu_ctx_pasid_free(HostIOMMUContext *iommu_ctx, uint32_t pasid);
+
+void host_iommu_ctx_init(void *_iommu_ctx, size_t instance_size,
+                         const char *mrtypename,
+                         uint64_t flags);
+void host_iommu_ctx_destroy(HostIOMMUContext *iommu_ctx);
+
+#endif
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v2 04/22] hw/iommu: introduce HostIOMMUContext
@ 2020-03-30  4:24   ` Liu Yi L
  0 siblings, 0 replies; 160+ messages in thread
From: Liu Yi L @ 2020-03-30  4:24 UTC (permalink / raw)
  To: qemu-devel, alex.williamson, peterx
  Cc: jean-philippe, kevin.tian, yi.l.liu, Yi Sun, kvm, mst,
	jun.j.tian, eric.auger, yi.y.sun, Jacob Pan, pbonzini, hao.wu,
	david

Currently, many platform vendors provide the capability of dual stage
DMA address translation in hardware. For example, nested translation
on Intel VT-d scalable mode, nested stage translation on ARM SMMUv3,
and etc. In dual stage DMA address translation, there are two stages
address translation, stage-1 (a.k.a first-level) and stage-2 (a.k.a
second-level) translation structures. Stage-1 translation results are
also subjected to stage-2 translation structures. Take vSVA (Virtual
Shared Virtual Addressing) as an example, guest IOMMU driver owns
stage-1 translation structures (covers GVA->GPA translation), and host
IOMMU driver owns stage-2 translation structures (covers GPA->HPA
translation). VMM is responsible to bind stage-1 translation structures
to host, thus hardware could achieve GVA->GPA and then GPA->HPA
translation. For more background on SVA, refer the below links.
 - https://www.youtube.com/watch?v=Kq_nfGK5MwQ
 - https://events19.lfasiallc.com/wp-content/uploads/2017/11/\
Shared-Virtual-Memory-in-KVM_Yi-Liu.pdf

In QEMU, vIOMMU emulators expose IOMMUs to VM per their own spec (e.g.
Intel VT-d spec). Devices are pass-through to guest via device pass-
through components like VFIO. VFIO is a userspace driver framework
which exposes host IOMMU programming capability to userspace in a
secure manner. e.g. IOVA MAP/UNMAP requests. Thus the major connection
between VFIO and vIOMMU are MAP/UNMAP. However, with the dual stage
DMA translation support, there are more interactions between vIOMMU and
VFIO as below:
 1) PASID allocation (allow host to intercept in PASID allocation)
 2) bind stage-1 translation structures to host
 3) propagate stage-1 cache invalidation to host
 4) DMA address translation fault (I/O page fault) servicing etc.

With the above new interactions in QEMU, it requires an abstract layer
to facilitate the above operations and expose to vIOMMU emulators as an
explicit way for vIOMMU emulators call into VFIO. This patch introduces
HostIOMMUContext to stand for hardware IOMMU w/ dual stage DMA address
translation capability. And introduces HostIOMMUContextClass to provide
methods for vIOMMU emulators to propagate dual-stage translation related
requests to host. As a beginning, PASID allocation/free are defined to
propagate PASID allocation/free requests to host which is helpful for the
vendors who manage PASID in system-wide. In future, there will be more
operations like bind_stage1_pgtbl, flush_stage1_cache and etc.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Eric Auger <eric.auger@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 hw/Makefile.objs                      |  1 +
 hw/iommu/Makefile.objs                |  1 +
 hw/iommu/host_iommu_context.c         | 97 +++++++++++++++++++++++++++++++++++
 include/hw/iommu/host_iommu_context.h | 75 +++++++++++++++++++++++++++
 4 files changed, 174 insertions(+)
 create mode 100644 hw/iommu/Makefile.objs
 create mode 100644 hw/iommu/host_iommu_context.c
 create mode 100644 include/hw/iommu/host_iommu_context.h

diff --git a/hw/Makefile.objs b/hw/Makefile.objs
index 660e2b4..cab83fe 100644
--- a/hw/Makefile.objs
+++ b/hw/Makefile.objs
@@ -40,6 +40,7 @@ devices-dirs-$(CONFIG_MEM_DEVICE) += mem/
 devices-dirs-$(CONFIG_NUBUS) += nubus/
 devices-dirs-y += semihosting/
 devices-dirs-y += smbios/
+devices-dirs-y += iommu/
 endif
 
 common-obj-y += $(devices-dirs-y)
diff --git a/hw/iommu/Makefile.objs b/hw/iommu/Makefile.objs
new file mode 100644
index 0000000..e6eed4e
--- /dev/null
+++ b/hw/iommu/Makefile.objs
@@ -0,0 +1 @@
+obj-y += host_iommu_context.o
diff --git a/hw/iommu/host_iommu_context.c b/hw/iommu/host_iommu_context.c
new file mode 100644
index 0000000..5fb2223
--- /dev/null
+++ b/hw/iommu/host_iommu_context.c
@@ -0,0 +1,97 @@
+/*
+ * QEMU abstract of Host IOMMU
+ *
+ * Copyright (C) 2020 Intel Corporation.
+ *
+ * Authors: Liu Yi L <yi.l.liu@intel.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include "qemu/osdep.h"
+#include "qapi/error.h"
+#include "qom/object.h"
+#include "qapi/visitor.h"
+#include "hw/iommu/host_iommu_context.h"
+
+int host_iommu_ctx_pasid_alloc(HostIOMMUContext *iommu_ctx, uint32_t min,
+                               uint32_t max, uint32_t *pasid)
+{
+    HostIOMMUContextClass *hicxc;
+
+    if (!iommu_ctx) {
+        return -EINVAL;
+    }
+
+    hicxc = HOST_IOMMU_CONTEXT_GET_CLASS(iommu_ctx);
+
+    if (!hicxc) {
+        return -EINVAL;
+    }
+
+    if (!(iommu_ctx->flags & HOST_IOMMU_PASID_REQUEST) ||
+        !hicxc->pasid_alloc) {
+        return -EINVAL;
+    }
+
+    return hicxc->pasid_alloc(iommu_ctx, min, max, pasid);
+}
+
+int host_iommu_ctx_pasid_free(HostIOMMUContext *iommu_ctx, uint32_t pasid)
+{
+    HostIOMMUContextClass *hicxc;
+
+    if (!iommu_ctx) {
+        return -EINVAL;
+    }
+
+    hicxc = HOST_IOMMU_CONTEXT_GET_CLASS(iommu_ctx);
+    if (!hicxc) {
+        return -EINVAL;
+    }
+
+    if (!(iommu_ctx->flags & HOST_IOMMU_PASID_REQUEST) ||
+        !hicxc->pasid_free) {
+        return -EINVAL;
+    }
+
+    return hicxc->pasid_free(iommu_ctx, pasid);
+}
+
+void host_iommu_ctx_init(void *_iommu_ctx, size_t instance_size,
+                         const char *mrtypename,
+                         uint64_t flags)
+{
+    HostIOMMUContext *iommu_ctx;
+
+    object_initialize(_iommu_ctx, instance_size, mrtypename);
+    iommu_ctx = HOST_IOMMU_CONTEXT(_iommu_ctx);
+    iommu_ctx->flags = flags;
+    iommu_ctx->initialized = true;
+}
+
+static const TypeInfo host_iommu_context_info = {
+    .parent             = TYPE_OBJECT,
+    .name               = TYPE_HOST_IOMMU_CONTEXT,
+    .class_size         = sizeof(HostIOMMUContextClass),
+    .instance_size      = sizeof(HostIOMMUContext),
+    .abstract           = true,
+};
+
+static void host_iommu_ctx_register_types(void)
+{
+    type_register_static(&host_iommu_context_info);
+}
+
+type_init(host_iommu_ctx_register_types)
diff --git a/include/hw/iommu/host_iommu_context.h b/include/hw/iommu/host_iommu_context.h
new file mode 100644
index 0000000..35c4861
--- /dev/null
+++ b/include/hw/iommu/host_iommu_context.h
@@ -0,0 +1,75 @@
+/*
+ * QEMU abstraction of Host IOMMU
+ *
+ * Copyright (C) 2020 Intel Corporation.
+ *
+ * Authors: Liu Yi L <yi.l.liu@intel.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef HW_IOMMU_CONTEXT_H
+#define HW_IOMMU_CONTEXT_H
+
+#include "qemu/queue.h"
+#include "qemu/thread.h"
+#include "qom/object.h"
+#include <linux/iommu.h>
+#ifndef CONFIG_USER_ONLY
+#include "exec/hwaddr.h"
+#endif
+
+#define TYPE_HOST_IOMMU_CONTEXT "qemu:host-iommu-context"
+#define HOST_IOMMU_CONTEXT(obj) \
+        OBJECT_CHECK(HostIOMMUContext, (obj), TYPE_HOST_IOMMU_CONTEXT)
+#define HOST_IOMMU_CONTEXT_GET_CLASS(obj) \
+        OBJECT_GET_CLASS(HostIOMMUContextClass, (obj), \
+                         TYPE_HOST_IOMMU_CONTEXT)
+
+typedef struct HostIOMMUContext HostIOMMUContext;
+
+typedef struct HostIOMMUContextClass {
+    /* private */
+    ObjectClass parent_class;
+
+    /* Allocate pasid from HostIOMMUContext (a.k.a. host software) */
+    int (*pasid_alloc)(HostIOMMUContext *iommu_ctx,
+                       uint32_t min,
+                       uint32_t max,
+                       uint32_t *pasid);
+    /* Reclaim pasid from HostIOMMUContext (a.k.a. host software) */
+    int (*pasid_free)(HostIOMMUContext *iommu_ctx,
+                      uint32_t pasid);
+} HostIOMMUContextClass;
+
+/*
+ * This is an abstraction of host IOMMU with dual-stage capability
+ */
+struct HostIOMMUContext {
+    Object parent_obj;
+#define HOST_IOMMU_PASID_REQUEST (1ULL << 0)
+    uint64_t flags;
+    bool initialized;
+};
+
+int host_iommu_ctx_pasid_alloc(HostIOMMUContext *iommu_ctx, uint32_t min,
+                               uint32_t max, uint32_t *pasid);
+int host_iommu_ctx_pasid_free(HostIOMMUContext *iommu_ctx, uint32_t pasid);
+
+void host_iommu_ctx_init(void *_iommu_ctx, size_t instance_size,
+                         const char *mrtypename,
+                         uint64_t flags);
+void host_iommu_ctx_destroy(HostIOMMUContext *iommu_ctx);
+
+#endif
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v2 05/22] hw/pci: modify pci_setup_iommu() to set PCIIOMMUOps
  2020-03-30  4:24 ` Liu Yi L
@ 2020-03-30  4:24   ` Liu Yi L
  -1 siblings, 0 replies; 160+ messages in thread
From: Liu Yi L @ 2020-03-30  4:24 UTC (permalink / raw)
  To: qemu-devel, alex.williamson, peterx
  Cc: eric.auger, pbonzini, mst, david, kevin.tian, yi.l.liu,
	jun.j.tian, yi.y.sun, kvm, hao.wu, jean-philippe, Jacob Pan,
	Yi Sun

This patch modifies pci_setup_iommu() to set PCIIOMMUOps
instead of setting PCIIOMMUFunc. PCIIOMMUFunc is used to
get an address space for a PCI device in vendor specific
way. The PCIIOMMUOps still offers this functionality. But
using PCIIOMMUOps leaves space to add more iommu related
vendor specific operations.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Eric Auger <eric.auger@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 hw/alpha/typhoon.c       |  6 +++++-
 hw/arm/smmu-common.c     |  6 +++++-
 hw/hppa/dino.c           |  6 +++++-
 hw/i386/amd_iommu.c      |  6 +++++-
 hw/i386/intel_iommu.c    |  6 +++++-
 hw/pci-host/designware.c |  6 +++++-
 hw/pci-host/pnv_phb3.c   |  6 +++++-
 hw/pci-host/pnv_phb4.c   |  6 +++++-
 hw/pci-host/ppce500.c    |  6 +++++-
 hw/pci-host/prep.c       |  6 +++++-
 hw/pci-host/sabre.c      |  6 +++++-
 hw/pci/pci.c             | 12 +++++++-----
 hw/ppc/ppc440_pcix.c     |  6 +++++-
 hw/ppc/spapr_pci.c       |  6 +++++-
 hw/s390x/s390-pci-bus.c  |  8 ++++++--
 hw/virtio/virtio-iommu.c |  6 +++++-
 include/hw/pci/pci.h     |  8 ++++++--
 include/hw/pci/pci_bus.h |  2 +-
 18 files changed, 90 insertions(+), 24 deletions(-)

diff --git a/hw/alpha/typhoon.c b/hw/alpha/typhoon.c
index 1795e2f..f271de1 100644
--- a/hw/alpha/typhoon.c
+++ b/hw/alpha/typhoon.c
@@ -740,6 +740,10 @@ static AddressSpace *typhoon_pci_dma_iommu(PCIBus *bus, void *opaque, int devfn)
     return &s->pchip.iommu_as;
 }
 
+static const PCIIOMMUOps typhoon_iommu_ops = {
+    .get_address_space = typhoon_pci_dma_iommu,
+};
+
 static void typhoon_set_irq(void *opaque, int irq, int level)
 {
     TyphoonState *s = opaque;
@@ -897,7 +901,7 @@ PCIBus *typhoon_init(MemoryRegion *ram, ISABus **isa_bus, qemu_irq *p_rtc_irq,
                              "iommu-typhoon", UINT64_MAX);
     address_space_init(&s->pchip.iommu_as, MEMORY_REGION(&s->pchip.iommu),
                        "pchip0-pci");
-    pci_setup_iommu(b, typhoon_pci_dma_iommu, s);
+    pci_setup_iommu(b, &typhoon_iommu_ops, s);
 
     /* Pchip0 PCI special/interrupt acknowledge, 0x801.F800.0000, 64MB.  */
     memory_region_init_io(&s->pchip.reg_iack, OBJECT(s), &alpha_pci_iack_ops,
diff --git a/hw/arm/smmu-common.c b/hw/arm/smmu-common.c
index e13a5f4..447146e 100644
--- a/hw/arm/smmu-common.c
+++ b/hw/arm/smmu-common.c
@@ -343,6 +343,10 @@ static AddressSpace *smmu_find_add_as(PCIBus *bus, void *opaque, int devfn)
     return &sdev->as;
 }
 
+static const PCIIOMMUOps smmu_ops = {
+    .get_address_space = smmu_find_add_as,
+};
+
 IOMMUMemoryRegion *smmu_iommu_mr(SMMUState *s, uint32_t sid)
 {
     uint8_t bus_n, devfn;
@@ -437,7 +441,7 @@ static void smmu_base_realize(DeviceState *dev, Error **errp)
     s->smmu_pcibus_by_busptr = g_hash_table_new(NULL, NULL);
 
     if (s->primary_bus) {
-        pci_setup_iommu(s->primary_bus, smmu_find_add_as, s);
+        pci_setup_iommu(s->primary_bus, &smmu_ops, s);
     } else {
         error_setg(errp, "SMMU is not attached to any PCI bus!");
     }
diff --git a/hw/hppa/dino.c b/hw/hppa/dino.c
index 2b1b38c..3da4f84 100644
--- a/hw/hppa/dino.c
+++ b/hw/hppa/dino.c
@@ -459,6 +459,10 @@ static AddressSpace *dino_pcihost_set_iommu(PCIBus *bus, void *opaque,
     return &s->bm_as;
 }
 
+static const PCIIOMMUOps dino_iommu_ops = {
+    .get_address_space = dino_pcihost_set_iommu,
+};
+
 /*
  * Dino interrupts are connected as shown on Page 78, Table 23
  * (Little-endian bit numbers)
@@ -580,7 +584,7 @@ PCIBus *dino_init(MemoryRegion *addr_space,
     memory_region_add_subregion(&s->bm, 0xfff00000,
                                 &s->bm_cpu_alias);
     address_space_init(&s->bm_as, &s->bm, "pci-bm");
-    pci_setup_iommu(b, dino_pcihost_set_iommu, s);
+    pci_setup_iommu(b, &dino_iommu_ops, s);
 
     *p_rtc_irq = qemu_allocate_irq(dino_set_timer_irq, s, 0);
     *p_ser_irq = qemu_allocate_irq(dino_set_serial_irq, s, 0);
diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
index b1175e5..5fec30e 100644
--- a/hw/i386/amd_iommu.c
+++ b/hw/i386/amd_iommu.c
@@ -1451,6 +1451,10 @@ static AddressSpace *amdvi_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
     return &iommu_as[devfn]->as;
 }
 
+static const PCIIOMMUOps amdvi_iommu_ops = {
+    .get_address_space = amdvi_host_dma_iommu,
+};
+
 static const MemoryRegionOps mmio_mem_ops = {
     .read = amdvi_mmio_read,
     .write = amdvi_mmio_write,
@@ -1577,7 +1581,7 @@ static void amdvi_realize(DeviceState *dev, Error **errp)
 
     sysbus_init_mmio(SYS_BUS_DEVICE(s), &s->mmio);
     sysbus_mmio_map(SYS_BUS_DEVICE(s), 0, AMDVI_BASE_ADDR);
-    pci_setup_iommu(bus, amdvi_host_dma_iommu, s);
+    pci_setup_iommu(bus, &amdvi_iommu_ops, s);
     s->devid = object_property_get_int(OBJECT(&s->pci), "addr", errp);
     msi_init(&s->pci.dev, 0, 1, true, false, errp);
     amdvi_init(s);
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index df7ad25..4b22910 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -3729,6 +3729,10 @@ static AddressSpace *vtd_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
     return &vtd_as->as;
 }
 
+static PCIIOMMUOps vtd_iommu_ops = {
+    .get_address_space = vtd_host_dma_iommu,
+};
+
 static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
 {
     X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(s);
@@ -3840,7 +3844,7 @@ static void vtd_realize(DeviceState *dev, Error **errp)
                                               g_free, g_free);
     vtd_init(s);
     sysbus_mmio_map(SYS_BUS_DEVICE(s), 0, Q35_HOST_BRIDGE_IOMMU_ADDR);
-    pci_setup_iommu(bus, vtd_host_dma_iommu, dev);
+    pci_setup_iommu(bus, &vtd_iommu_ops, dev);
     /* Pseudo address space under root PCI bus. */
     x86ms->ioapic_as = vtd_host_dma_iommu(bus, s, Q35_PSEUDO_DEVFN_IOAPIC);
     qemu_add_machine_init_done_notifier(&vtd_machine_done_notify);
diff --git a/hw/pci-host/designware.c b/hw/pci-host/designware.c
index dd24551..4c6338a 100644
--- a/hw/pci-host/designware.c
+++ b/hw/pci-host/designware.c
@@ -645,6 +645,10 @@ static AddressSpace *designware_pcie_host_set_iommu(PCIBus *bus, void *opaque,
     return &s->pci.address_space;
 }
 
+static const PCIIOMMUOps designware_iommu_ops = {
+    .get_address_space = designware_pcie_host_set_iommu,
+};
+
 static void designware_pcie_host_realize(DeviceState *dev, Error **errp)
 {
     PCIHostState *pci = PCI_HOST_BRIDGE(dev);
@@ -686,7 +690,7 @@ static void designware_pcie_host_realize(DeviceState *dev, Error **errp)
     address_space_init(&s->pci.address_space,
                        &s->pci.address_space_root,
                        "pcie-bus-address-space");
-    pci_setup_iommu(pci->bus, designware_pcie_host_set_iommu, s);
+    pci_setup_iommu(pci->bus, &designware_iommu_ops, s);
 
     qdev_set_parent_bus(DEVICE(&s->root), BUS(pci->bus));
     qdev_init_nofail(DEVICE(&s->root));
diff --git a/hw/pci-host/pnv_phb3.c b/hw/pci-host/pnv_phb3.c
index 74618fa..ecfe627 100644
--- a/hw/pci-host/pnv_phb3.c
+++ b/hw/pci-host/pnv_phb3.c
@@ -961,6 +961,10 @@ static AddressSpace *pnv_phb3_dma_iommu(PCIBus *bus, void *opaque, int devfn)
     return &ds->dma_as;
 }
 
+static PCIIOMMUOps pnv_phb3_iommu_ops = {
+    .get_address_space = pnv_phb3_dma_iommu,
+};
+
 static void pnv_phb3_instance_init(Object *obj)
 {
     PnvPHB3 *phb = PNV_PHB3(obj);
@@ -1059,7 +1063,7 @@ static void pnv_phb3_realize(DeviceState *dev, Error **errp)
                                      &phb->pci_mmio, &phb->pci_io,
                                      0, 4, TYPE_PNV_PHB3_ROOT_BUS);
 
-    pci_setup_iommu(pci->bus, pnv_phb3_dma_iommu, phb);
+    pci_setup_iommu(pci->bus, &pnv_phb3_iommu_ops, phb);
 
     /* Add a single Root port */
     qdev_prop_set_uint8(DEVICE(&phb->root), "chassis", phb->chip_id);
diff --git a/hw/pci-host/pnv_phb4.c b/hw/pci-host/pnv_phb4.c
index 23cf093..04e95e3 100644
--- a/hw/pci-host/pnv_phb4.c
+++ b/hw/pci-host/pnv_phb4.c
@@ -1148,6 +1148,10 @@ static AddressSpace *pnv_phb4_dma_iommu(PCIBus *bus, void *opaque, int devfn)
     return &ds->dma_as;
 }
 
+static PCIIOMMUOps pnv_phb4_iommu_ops = {
+    .get_address_space = pnv_phb4_dma_iommu,
+};
+
 static void pnv_phb4_instance_init(Object *obj)
 {
     PnvPHB4 *phb = PNV_PHB4(obj);
@@ -1205,7 +1209,7 @@ static void pnv_phb4_realize(DeviceState *dev, Error **errp)
                                      pnv_phb4_set_irq, pnv_phb4_map_irq, phb,
                                      &phb->pci_mmio, &phb->pci_io,
                                      0, 4, TYPE_PNV_PHB4_ROOT_BUS);
-    pci_setup_iommu(pci->bus, pnv_phb4_dma_iommu, phb);
+    pci_setup_iommu(pci->bus, &pnv_phb4_iommu_ops, phb);
 
     /* Add a single Root port */
     qdev_prop_set_uint8(DEVICE(&phb->root), "chassis", phb->chip_id);
diff --git a/hw/pci-host/ppce500.c b/hw/pci-host/ppce500.c
index d710727..5baf5db 100644
--- a/hw/pci-host/ppce500.c
+++ b/hw/pci-host/ppce500.c
@@ -439,6 +439,10 @@ static AddressSpace *e500_pcihost_set_iommu(PCIBus *bus, void *opaque,
     return &s->bm_as;
 }
 
+static const PCIIOMMUOps ppce500_iommu_ops = {
+    .get_address_space = e500_pcihost_set_iommu,
+};
+
 static void e500_pcihost_realize(DeviceState *dev, Error **errp)
 {
     SysBusDevice *sbd = SYS_BUS_DEVICE(dev);
@@ -473,7 +477,7 @@ static void e500_pcihost_realize(DeviceState *dev, Error **errp)
     memory_region_init(&s->bm, OBJECT(s), "bm-e500", UINT64_MAX);
     memory_region_add_subregion(&s->bm, 0x0, &s->busmem);
     address_space_init(&s->bm_as, &s->bm, "pci-bm");
-    pci_setup_iommu(b, e500_pcihost_set_iommu, s);
+    pci_setup_iommu(b, &ppce500_iommu_ops, s);
 
     pci_create_simple(b, 0, "e500-host-bridge");
 
diff --git a/hw/pci-host/prep.c b/hw/pci-host/prep.c
index 1a02e9a..7c57311 100644
--- a/hw/pci-host/prep.c
+++ b/hw/pci-host/prep.c
@@ -213,6 +213,10 @@ static AddressSpace *raven_pcihost_set_iommu(PCIBus *bus, void *opaque,
     return &s->bm_as;
 }
 
+static const PCIIOMMUOps raven_iommu_ops = {
+    .get_address_space = raven_pcihost_set_iommu,
+};
+
 static void raven_change_gpio(void *opaque, int n, int level)
 {
     PREPPCIState *s = opaque;
@@ -303,7 +307,7 @@ static void raven_pcihost_initfn(Object *obj)
     memory_region_add_subregion(&s->bm, 0         , &s->bm_pci_memory_alias);
     memory_region_add_subregion(&s->bm, 0x80000000, &s->bm_ram_alias);
     address_space_init(&s->bm_as, &s->bm, "raven-bm");
-    pci_setup_iommu(&s->pci_bus, raven_pcihost_set_iommu, s);
+    pci_setup_iommu(&s->pci_bus, &raven_iommu_ops, s);
 
     h->bus = &s->pci_bus;
 
diff --git a/hw/pci-host/sabre.c b/hw/pci-host/sabre.c
index 2b8503b..251549b 100644
--- a/hw/pci-host/sabre.c
+++ b/hw/pci-host/sabre.c
@@ -112,6 +112,10 @@ static AddressSpace *sabre_pci_dma_iommu(PCIBus *bus, void *opaque, int devfn)
     return &is->iommu_as;
 }
 
+static const PCIIOMMUOps sabre_iommu_ops = {
+    .get_address_space = sabre_pci_dma_iommu,
+};
+
 static void sabre_config_write(void *opaque, hwaddr addr,
                                uint64_t val, unsigned size)
 {
@@ -402,7 +406,7 @@ static void sabre_realize(DeviceState *dev, Error **errp)
     /* IOMMU */
     memory_region_add_subregion_overlap(&s->sabre_config, 0x200,
                     sysbus_mmio_get_region(SYS_BUS_DEVICE(s->iommu), 0), 1);
-    pci_setup_iommu(phb->bus, sabre_pci_dma_iommu, s->iommu);
+    pci_setup_iommu(phb->bus, &sabre_iommu_ops, s->iommu);
 
     /* APB secondary busses */
     pci_dev = pci_create_multifunction(phb->bus, PCI_DEVFN(1, 0), true,
diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index e1ed667..aa9025c 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -2644,7 +2644,7 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
     PCIBus *iommu_bus = bus;
     uint8_t devfn = dev->devfn;
 
-    while (iommu_bus && !iommu_bus->iommu_fn && iommu_bus->parent_dev) {
+    while (iommu_bus && !iommu_bus->iommu_ops && iommu_bus->parent_dev) {
         PCIBus *parent_bus = pci_get_bus(iommu_bus->parent_dev);
 
         /*
@@ -2683,15 +2683,17 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
 
         iommu_bus = parent_bus;
     }
-    if (iommu_bus && iommu_bus->iommu_fn) {
-        return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, devfn);
+    if (iommu_bus && iommu_bus->iommu_ops &&
+                     iommu_bus->iommu_ops->get_address_space) {
+        return iommu_bus->iommu_ops->get_address_space(bus,
+                                 iommu_bus->iommu_opaque, devfn);
     }
     return &address_space_memory;
 }
 
-void pci_setup_iommu(PCIBus *bus, PCIIOMMUFunc fn, void *opaque)
+void pci_setup_iommu(PCIBus *bus, const PCIIOMMUOps *ops, void *opaque)
 {
-    bus->iommu_fn = fn;
+    bus->iommu_ops = ops;
     bus->iommu_opaque = opaque;
 }
 
diff --git a/hw/ppc/ppc440_pcix.c b/hw/ppc/ppc440_pcix.c
index 2ee2d4f..7b17ee5 100644
--- a/hw/ppc/ppc440_pcix.c
+++ b/hw/ppc/ppc440_pcix.c
@@ -442,6 +442,10 @@ static AddressSpace *ppc440_pcix_set_iommu(PCIBus *b, void *opaque, int devfn)
     return &s->bm_as;
 }
 
+static const PCIIOMMUOps ppc440_iommu_ops = {
+    .get_address_space = ppc440_pcix_set_iommu,
+};
+
 /* The default pci_host_data_{read,write} functions in pci/pci_host.c
  * deny access to registers without bit 31 set but our clients want
  * this to work so we have to override these here */
@@ -487,7 +491,7 @@ static void ppc440_pcix_realize(DeviceState *dev, Error **errp)
     memory_region_init(&s->bm, OBJECT(s), "bm-ppc440-pcix", UINT64_MAX);
     memory_region_add_subregion(&s->bm, 0x0, &s->busmem);
     address_space_init(&s->bm_as, &s->bm, "pci-bm");
-    pci_setup_iommu(h->bus, ppc440_pcix_set_iommu, s);
+    pci_setup_iommu(h->bus, &ppc440_iommu_ops, s);
 
     memory_region_init(&s->container, OBJECT(s), "pci-container", PCI_ALL_SIZE);
     memory_region_init_io(&h->conf_mem, OBJECT(s), &pci_host_conf_le_ops,
diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index 709a527..729a1cb 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -771,6 +771,10 @@ static AddressSpace *spapr_pci_dma_iommu(PCIBus *bus, void *opaque, int devfn)
     return &phb->iommu_as;
 }
 
+static const PCIIOMMUOps spapr_iommu_ops = {
+    .get_address_space = spapr_pci_dma_iommu,
+};
+
 static char *spapr_phb_vfio_get_loc_code(SpaprPhbState *sphb,  PCIDevice *pdev)
 {
     char *path = NULL, *buf = NULL, *host = NULL;
@@ -1950,7 +1954,7 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
     memory_region_add_subregion(&sphb->iommu_root, SPAPR_PCI_MSI_WINDOW,
                                 &sphb->msiwindow);
 
-    pci_setup_iommu(bus, spapr_pci_dma_iommu, sphb);
+    pci_setup_iommu(bus, &spapr_iommu_ops, sphb);
 
     pci_bus_set_route_irq_fn(bus, spapr_route_intx_pin_to_irq);
 
diff --git a/hw/s390x/s390-pci-bus.c b/hw/s390x/s390-pci-bus.c
index ed8be12..c1c3aa4 100644
--- a/hw/s390x/s390-pci-bus.c
+++ b/hw/s390x/s390-pci-bus.c
@@ -635,6 +635,10 @@ static AddressSpace *s390_pci_dma_iommu(PCIBus *bus, void *opaque, int devfn)
     return &iommu->as;
 }
 
+static const PCIIOMMUOps s390_iommu_ops = {
+    .get_address_space = s390_pci_dma_iommu,
+};
+
 static uint8_t set_ind_atomic(uint64_t ind_loc, uint8_t to_be_set)
 {
     uint8_t ind_old, ind_new;
@@ -748,7 +752,7 @@ static void s390_pcihost_realize(DeviceState *dev, Error **errp)
     b = pci_register_root_bus(dev, NULL, s390_pci_set_irq, s390_pci_map_irq,
                               NULL, get_system_memory(), get_system_io(), 0,
                               64, TYPE_PCI_BUS);
-    pci_setup_iommu(b, s390_pci_dma_iommu, s);
+    pci_setup_iommu(b, &s390_iommu_ops, s);
 
     bus = BUS(b);
     qbus_set_hotplug_handler(bus, OBJECT(dev), &local_err);
@@ -919,7 +923,7 @@ static void s390_pcihost_plug(HotplugHandler *hotplug_dev, DeviceState *dev,
 
         pdev = PCI_DEVICE(dev);
         pci_bridge_map_irq(pb, dev->id, s390_pci_map_irq);
-        pci_setup_iommu(&pb->sec_bus, s390_pci_dma_iommu, s);
+        pci_setup_iommu(&pb->sec_bus, &s390_iommu_ops, s);
 
         qbus_set_hotplug_handler(BUS(&pb->sec_bus), OBJECT(s), errp);
 
diff --git a/hw/virtio/virtio-iommu.c b/hw/virtio/virtio-iommu.c
index 4cee808..fefc24e 100644
--- a/hw/virtio/virtio-iommu.c
+++ b/hw/virtio/virtio-iommu.c
@@ -235,6 +235,10 @@ static AddressSpace *virtio_iommu_find_add_as(PCIBus *bus, void *opaque,
     return &sdev->as;
 }
 
+static const PCIIOMMUOps virtio_iommu_ops = {
+    .get_address_space = virtio_iommu_find_add_as,
+};
+
 static int virtio_iommu_attach(VirtIOIOMMU *s,
                                struct virtio_iommu_req_attach *req)
 {
@@ -682,7 +686,7 @@ static void virtio_iommu_device_realize(DeviceState *dev, Error **errp)
     s->as_by_busptr = g_hash_table_new_full(NULL, NULL, NULL, g_free);
 
     if (s->primary_bus) {
-        pci_setup_iommu(s->primary_bus, virtio_iommu_find_add_as, s);
+        pci_setup_iommu(s->primary_bus, &virtio_iommu_ops, s);
     } else {
         error_setg(errp, "VIRTIO-IOMMU is not attached to any PCI bus!");
     }
diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index cfedf5a..ffe192d 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -485,10 +485,14 @@ void pci_bus_get_w64_range(PCIBus *bus, Range *range);
 
 void pci_device_deassert_intx(PCIDevice *dev);
 
-typedef AddressSpace *(*PCIIOMMUFunc)(PCIBus *, void *, int);
+typedef struct PCIIOMMUOps PCIIOMMUOps;
+struct PCIIOMMUOps {
+    AddressSpace * (*get_address_space)(PCIBus *bus,
+                                void *opaque, int32_t devfn);
+};
 
 AddressSpace *pci_device_iommu_address_space(PCIDevice *dev);
-void pci_setup_iommu(PCIBus *bus, PCIIOMMUFunc fn, void *opaque);
+void pci_setup_iommu(PCIBus *bus, const PCIIOMMUOps *iommu_ops, void *opaque);
 
 static inline void
 pci_set_byte(uint8_t *config, uint8_t val)
diff --git a/include/hw/pci/pci_bus.h b/include/hw/pci/pci_bus.h
index 0714f57..c281057 100644
--- a/include/hw/pci/pci_bus.h
+++ b/include/hw/pci/pci_bus.h
@@ -29,7 +29,7 @@ enum PCIBusFlags {
 struct PCIBus {
     BusState qbus;
     enum PCIBusFlags flags;
-    PCIIOMMUFunc iommu_fn;
+    const PCIIOMMUOps *iommu_ops;
     void *iommu_opaque;
     uint8_t devfn_min;
     uint32_t slot_reserved_mask;
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v2 05/22] hw/pci: modify pci_setup_iommu() to set PCIIOMMUOps
@ 2020-03-30  4:24   ` Liu Yi L
  0 siblings, 0 replies; 160+ messages in thread
From: Liu Yi L @ 2020-03-30  4:24 UTC (permalink / raw)
  To: qemu-devel, alex.williamson, peterx
  Cc: jean-philippe, kevin.tian, yi.l.liu, Yi Sun, kvm, mst,
	jun.j.tian, eric.auger, yi.y.sun, Jacob Pan, pbonzini, hao.wu,
	david

This patch modifies pci_setup_iommu() to set PCIIOMMUOps
instead of setting PCIIOMMUFunc. PCIIOMMUFunc is used to
get an address space for a PCI device in vendor specific
way. The PCIIOMMUOps still offers this functionality. But
using PCIIOMMUOps leaves space to add more iommu related
vendor specific operations.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Eric Auger <eric.auger@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 hw/alpha/typhoon.c       |  6 +++++-
 hw/arm/smmu-common.c     |  6 +++++-
 hw/hppa/dino.c           |  6 +++++-
 hw/i386/amd_iommu.c      |  6 +++++-
 hw/i386/intel_iommu.c    |  6 +++++-
 hw/pci-host/designware.c |  6 +++++-
 hw/pci-host/pnv_phb3.c   |  6 +++++-
 hw/pci-host/pnv_phb4.c   |  6 +++++-
 hw/pci-host/ppce500.c    |  6 +++++-
 hw/pci-host/prep.c       |  6 +++++-
 hw/pci-host/sabre.c      |  6 +++++-
 hw/pci/pci.c             | 12 +++++++-----
 hw/ppc/ppc440_pcix.c     |  6 +++++-
 hw/ppc/spapr_pci.c       |  6 +++++-
 hw/s390x/s390-pci-bus.c  |  8 ++++++--
 hw/virtio/virtio-iommu.c |  6 +++++-
 include/hw/pci/pci.h     |  8 ++++++--
 include/hw/pci/pci_bus.h |  2 +-
 18 files changed, 90 insertions(+), 24 deletions(-)

diff --git a/hw/alpha/typhoon.c b/hw/alpha/typhoon.c
index 1795e2f..f271de1 100644
--- a/hw/alpha/typhoon.c
+++ b/hw/alpha/typhoon.c
@@ -740,6 +740,10 @@ static AddressSpace *typhoon_pci_dma_iommu(PCIBus *bus, void *opaque, int devfn)
     return &s->pchip.iommu_as;
 }
 
+static const PCIIOMMUOps typhoon_iommu_ops = {
+    .get_address_space = typhoon_pci_dma_iommu,
+};
+
 static void typhoon_set_irq(void *opaque, int irq, int level)
 {
     TyphoonState *s = opaque;
@@ -897,7 +901,7 @@ PCIBus *typhoon_init(MemoryRegion *ram, ISABus **isa_bus, qemu_irq *p_rtc_irq,
                              "iommu-typhoon", UINT64_MAX);
     address_space_init(&s->pchip.iommu_as, MEMORY_REGION(&s->pchip.iommu),
                        "pchip0-pci");
-    pci_setup_iommu(b, typhoon_pci_dma_iommu, s);
+    pci_setup_iommu(b, &typhoon_iommu_ops, s);
 
     /* Pchip0 PCI special/interrupt acknowledge, 0x801.F800.0000, 64MB.  */
     memory_region_init_io(&s->pchip.reg_iack, OBJECT(s), &alpha_pci_iack_ops,
diff --git a/hw/arm/smmu-common.c b/hw/arm/smmu-common.c
index e13a5f4..447146e 100644
--- a/hw/arm/smmu-common.c
+++ b/hw/arm/smmu-common.c
@@ -343,6 +343,10 @@ static AddressSpace *smmu_find_add_as(PCIBus *bus, void *opaque, int devfn)
     return &sdev->as;
 }
 
+static const PCIIOMMUOps smmu_ops = {
+    .get_address_space = smmu_find_add_as,
+};
+
 IOMMUMemoryRegion *smmu_iommu_mr(SMMUState *s, uint32_t sid)
 {
     uint8_t bus_n, devfn;
@@ -437,7 +441,7 @@ static void smmu_base_realize(DeviceState *dev, Error **errp)
     s->smmu_pcibus_by_busptr = g_hash_table_new(NULL, NULL);
 
     if (s->primary_bus) {
-        pci_setup_iommu(s->primary_bus, smmu_find_add_as, s);
+        pci_setup_iommu(s->primary_bus, &smmu_ops, s);
     } else {
         error_setg(errp, "SMMU is not attached to any PCI bus!");
     }
diff --git a/hw/hppa/dino.c b/hw/hppa/dino.c
index 2b1b38c..3da4f84 100644
--- a/hw/hppa/dino.c
+++ b/hw/hppa/dino.c
@@ -459,6 +459,10 @@ static AddressSpace *dino_pcihost_set_iommu(PCIBus *bus, void *opaque,
     return &s->bm_as;
 }
 
+static const PCIIOMMUOps dino_iommu_ops = {
+    .get_address_space = dino_pcihost_set_iommu,
+};
+
 /*
  * Dino interrupts are connected as shown on Page 78, Table 23
  * (Little-endian bit numbers)
@@ -580,7 +584,7 @@ PCIBus *dino_init(MemoryRegion *addr_space,
     memory_region_add_subregion(&s->bm, 0xfff00000,
                                 &s->bm_cpu_alias);
     address_space_init(&s->bm_as, &s->bm, "pci-bm");
-    pci_setup_iommu(b, dino_pcihost_set_iommu, s);
+    pci_setup_iommu(b, &dino_iommu_ops, s);
 
     *p_rtc_irq = qemu_allocate_irq(dino_set_timer_irq, s, 0);
     *p_ser_irq = qemu_allocate_irq(dino_set_serial_irq, s, 0);
diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
index b1175e5..5fec30e 100644
--- a/hw/i386/amd_iommu.c
+++ b/hw/i386/amd_iommu.c
@@ -1451,6 +1451,10 @@ static AddressSpace *amdvi_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
     return &iommu_as[devfn]->as;
 }
 
+static const PCIIOMMUOps amdvi_iommu_ops = {
+    .get_address_space = amdvi_host_dma_iommu,
+};
+
 static const MemoryRegionOps mmio_mem_ops = {
     .read = amdvi_mmio_read,
     .write = amdvi_mmio_write,
@@ -1577,7 +1581,7 @@ static void amdvi_realize(DeviceState *dev, Error **errp)
 
     sysbus_init_mmio(SYS_BUS_DEVICE(s), &s->mmio);
     sysbus_mmio_map(SYS_BUS_DEVICE(s), 0, AMDVI_BASE_ADDR);
-    pci_setup_iommu(bus, amdvi_host_dma_iommu, s);
+    pci_setup_iommu(bus, &amdvi_iommu_ops, s);
     s->devid = object_property_get_int(OBJECT(&s->pci), "addr", errp);
     msi_init(&s->pci.dev, 0, 1, true, false, errp);
     amdvi_init(s);
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index df7ad25..4b22910 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -3729,6 +3729,10 @@ static AddressSpace *vtd_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
     return &vtd_as->as;
 }
 
+static PCIIOMMUOps vtd_iommu_ops = {
+    .get_address_space = vtd_host_dma_iommu,
+};
+
 static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
 {
     X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(s);
@@ -3840,7 +3844,7 @@ static void vtd_realize(DeviceState *dev, Error **errp)
                                               g_free, g_free);
     vtd_init(s);
     sysbus_mmio_map(SYS_BUS_DEVICE(s), 0, Q35_HOST_BRIDGE_IOMMU_ADDR);
-    pci_setup_iommu(bus, vtd_host_dma_iommu, dev);
+    pci_setup_iommu(bus, &vtd_iommu_ops, dev);
     /* Pseudo address space under root PCI bus. */
     x86ms->ioapic_as = vtd_host_dma_iommu(bus, s, Q35_PSEUDO_DEVFN_IOAPIC);
     qemu_add_machine_init_done_notifier(&vtd_machine_done_notify);
diff --git a/hw/pci-host/designware.c b/hw/pci-host/designware.c
index dd24551..4c6338a 100644
--- a/hw/pci-host/designware.c
+++ b/hw/pci-host/designware.c
@@ -645,6 +645,10 @@ static AddressSpace *designware_pcie_host_set_iommu(PCIBus *bus, void *opaque,
     return &s->pci.address_space;
 }
 
+static const PCIIOMMUOps designware_iommu_ops = {
+    .get_address_space = designware_pcie_host_set_iommu,
+};
+
 static void designware_pcie_host_realize(DeviceState *dev, Error **errp)
 {
     PCIHostState *pci = PCI_HOST_BRIDGE(dev);
@@ -686,7 +690,7 @@ static void designware_pcie_host_realize(DeviceState *dev, Error **errp)
     address_space_init(&s->pci.address_space,
                        &s->pci.address_space_root,
                        "pcie-bus-address-space");
-    pci_setup_iommu(pci->bus, designware_pcie_host_set_iommu, s);
+    pci_setup_iommu(pci->bus, &designware_iommu_ops, s);
 
     qdev_set_parent_bus(DEVICE(&s->root), BUS(pci->bus));
     qdev_init_nofail(DEVICE(&s->root));
diff --git a/hw/pci-host/pnv_phb3.c b/hw/pci-host/pnv_phb3.c
index 74618fa..ecfe627 100644
--- a/hw/pci-host/pnv_phb3.c
+++ b/hw/pci-host/pnv_phb3.c
@@ -961,6 +961,10 @@ static AddressSpace *pnv_phb3_dma_iommu(PCIBus *bus, void *opaque, int devfn)
     return &ds->dma_as;
 }
 
+static PCIIOMMUOps pnv_phb3_iommu_ops = {
+    .get_address_space = pnv_phb3_dma_iommu,
+};
+
 static void pnv_phb3_instance_init(Object *obj)
 {
     PnvPHB3 *phb = PNV_PHB3(obj);
@@ -1059,7 +1063,7 @@ static void pnv_phb3_realize(DeviceState *dev, Error **errp)
                                      &phb->pci_mmio, &phb->pci_io,
                                      0, 4, TYPE_PNV_PHB3_ROOT_BUS);
 
-    pci_setup_iommu(pci->bus, pnv_phb3_dma_iommu, phb);
+    pci_setup_iommu(pci->bus, &pnv_phb3_iommu_ops, phb);
 
     /* Add a single Root port */
     qdev_prop_set_uint8(DEVICE(&phb->root), "chassis", phb->chip_id);
diff --git a/hw/pci-host/pnv_phb4.c b/hw/pci-host/pnv_phb4.c
index 23cf093..04e95e3 100644
--- a/hw/pci-host/pnv_phb4.c
+++ b/hw/pci-host/pnv_phb4.c
@@ -1148,6 +1148,10 @@ static AddressSpace *pnv_phb4_dma_iommu(PCIBus *bus, void *opaque, int devfn)
     return &ds->dma_as;
 }
 
+static PCIIOMMUOps pnv_phb4_iommu_ops = {
+    .get_address_space = pnv_phb4_dma_iommu,
+};
+
 static void pnv_phb4_instance_init(Object *obj)
 {
     PnvPHB4 *phb = PNV_PHB4(obj);
@@ -1205,7 +1209,7 @@ static void pnv_phb4_realize(DeviceState *dev, Error **errp)
                                      pnv_phb4_set_irq, pnv_phb4_map_irq, phb,
                                      &phb->pci_mmio, &phb->pci_io,
                                      0, 4, TYPE_PNV_PHB4_ROOT_BUS);
-    pci_setup_iommu(pci->bus, pnv_phb4_dma_iommu, phb);
+    pci_setup_iommu(pci->bus, &pnv_phb4_iommu_ops, phb);
 
     /* Add a single Root port */
     qdev_prop_set_uint8(DEVICE(&phb->root), "chassis", phb->chip_id);
diff --git a/hw/pci-host/ppce500.c b/hw/pci-host/ppce500.c
index d710727..5baf5db 100644
--- a/hw/pci-host/ppce500.c
+++ b/hw/pci-host/ppce500.c
@@ -439,6 +439,10 @@ static AddressSpace *e500_pcihost_set_iommu(PCIBus *bus, void *opaque,
     return &s->bm_as;
 }
 
+static const PCIIOMMUOps ppce500_iommu_ops = {
+    .get_address_space = e500_pcihost_set_iommu,
+};
+
 static void e500_pcihost_realize(DeviceState *dev, Error **errp)
 {
     SysBusDevice *sbd = SYS_BUS_DEVICE(dev);
@@ -473,7 +477,7 @@ static void e500_pcihost_realize(DeviceState *dev, Error **errp)
     memory_region_init(&s->bm, OBJECT(s), "bm-e500", UINT64_MAX);
     memory_region_add_subregion(&s->bm, 0x0, &s->busmem);
     address_space_init(&s->bm_as, &s->bm, "pci-bm");
-    pci_setup_iommu(b, e500_pcihost_set_iommu, s);
+    pci_setup_iommu(b, &ppce500_iommu_ops, s);
 
     pci_create_simple(b, 0, "e500-host-bridge");
 
diff --git a/hw/pci-host/prep.c b/hw/pci-host/prep.c
index 1a02e9a..7c57311 100644
--- a/hw/pci-host/prep.c
+++ b/hw/pci-host/prep.c
@@ -213,6 +213,10 @@ static AddressSpace *raven_pcihost_set_iommu(PCIBus *bus, void *opaque,
     return &s->bm_as;
 }
 
+static const PCIIOMMUOps raven_iommu_ops = {
+    .get_address_space = raven_pcihost_set_iommu,
+};
+
 static void raven_change_gpio(void *opaque, int n, int level)
 {
     PREPPCIState *s = opaque;
@@ -303,7 +307,7 @@ static void raven_pcihost_initfn(Object *obj)
     memory_region_add_subregion(&s->bm, 0         , &s->bm_pci_memory_alias);
     memory_region_add_subregion(&s->bm, 0x80000000, &s->bm_ram_alias);
     address_space_init(&s->bm_as, &s->bm, "raven-bm");
-    pci_setup_iommu(&s->pci_bus, raven_pcihost_set_iommu, s);
+    pci_setup_iommu(&s->pci_bus, &raven_iommu_ops, s);
 
     h->bus = &s->pci_bus;
 
diff --git a/hw/pci-host/sabre.c b/hw/pci-host/sabre.c
index 2b8503b..251549b 100644
--- a/hw/pci-host/sabre.c
+++ b/hw/pci-host/sabre.c
@@ -112,6 +112,10 @@ static AddressSpace *sabre_pci_dma_iommu(PCIBus *bus, void *opaque, int devfn)
     return &is->iommu_as;
 }
 
+static const PCIIOMMUOps sabre_iommu_ops = {
+    .get_address_space = sabre_pci_dma_iommu,
+};
+
 static void sabre_config_write(void *opaque, hwaddr addr,
                                uint64_t val, unsigned size)
 {
@@ -402,7 +406,7 @@ static void sabre_realize(DeviceState *dev, Error **errp)
     /* IOMMU */
     memory_region_add_subregion_overlap(&s->sabre_config, 0x200,
                     sysbus_mmio_get_region(SYS_BUS_DEVICE(s->iommu), 0), 1);
-    pci_setup_iommu(phb->bus, sabre_pci_dma_iommu, s->iommu);
+    pci_setup_iommu(phb->bus, &sabre_iommu_ops, s->iommu);
 
     /* APB secondary busses */
     pci_dev = pci_create_multifunction(phb->bus, PCI_DEVFN(1, 0), true,
diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index e1ed667..aa9025c 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -2644,7 +2644,7 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
     PCIBus *iommu_bus = bus;
     uint8_t devfn = dev->devfn;
 
-    while (iommu_bus && !iommu_bus->iommu_fn && iommu_bus->parent_dev) {
+    while (iommu_bus && !iommu_bus->iommu_ops && iommu_bus->parent_dev) {
         PCIBus *parent_bus = pci_get_bus(iommu_bus->parent_dev);
 
         /*
@@ -2683,15 +2683,17 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
 
         iommu_bus = parent_bus;
     }
-    if (iommu_bus && iommu_bus->iommu_fn) {
-        return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, devfn);
+    if (iommu_bus && iommu_bus->iommu_ops &&
+                     iommu_bus->iommu_ops->get_address_space) {
+        return iommu_bus->iommu_ops->get_address_space(bus,
+                                 iommu_bus->iommu_opaque, devfn);
     }
     return &address_space_memory;
 }
 
-void pci_setup_iommu(PCIBus *bus, PCIIOMMUFunc fn, void *opaque)
+void pci_setup_iommu(PCIBus *bus, const PCIIOMMUOps *ops, void *opaque)
 {
-    bus->iommu_fn = fn;
+    bus->iommu_ops = ops;
     bus->iommu_opaque = opaque;
 }
 
diff --git a/hw/ppc/ppc440_pcix.c b/hw/ppc/ppc440_pcix.c
index 2ee2d4f..7b17ee5 100644
--- a/hw/ppc/ppc440_pcix.c
+++ b/hw/ppc/ppc440_pcix.c
@@ -442,6 +442,10 @@ static AddressSpace *ppc440_pcix_set_iommu(PCIBus *b, void *opaque, int devfn)
     return &s->bm_as;
 }
 
+static const PCIIOMMUOps ppc440_iommu_ops = {
+    .get_address_space = ppc440_pcix_set_iommu,
+};
+
 /* The default pci_host_data_{read,write} functions in pci/pci_host.c
  * deny access to registers without bit 31 set but our clients want
  * this to work so we have to override these here */
@@ -487,7 +491,7 @@ static void ppc440_pcix_realize(DeviceState *dev, Error **errp)
     memory_region_init(&s->bm, OBJECT(s), "bm-ppc440-pcix", UINT64_MAX);
     memory_region_add_subregion(&s->bm, 0x0, &s->busmem);
     address_space_init(&s->bm_as, &s->bm, "pci-bm");
-    pci_setup_iommu(h->bus, ppc440_pcix_set_iommu, s);
+    pci_setup_iommu(h->bus, &ppc440_iommu_ops, s);
 
     memory_region_init(&s->container, OBJECT(s), "pci-container", PCI_ALL_SIZE);
     memory_region_init_io(&h->conf_mem, OBJECT(s), &pci_host_conf_le_ops,
diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index 709a527..729a1cb 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -771,6 +771,10 @@ static AddressSpace *spapr_pci_dma_iommu(PCIBus *bus, void *opaque, int devfn)
     return &phb->iommu_as;
 }
 
+static const PCIIOMMUOps spapr_iommu_ops = {
+    .get_address_space = spapr_pci_dma_iommu,
+};
+
 static char *spapr_phb_vfio_get_loc_code(SpaprPhbState *sphb,  PCIDevice *pdev)
 {
     char *path = NULL, *buf = NULL, *host = NULL;
@@ -1950,7 +1954,7 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
     memory_region_add_subregion(&sphb->iommu_root, SPAPR_PCI_MSI_WINDOW,
                                 &sphb->msiwindow);
 
-    pci_setup_iommu(bus, spapr_pci_dma_iommu, sphb);
+    pci_setup_iommu(bus, &spapr_iommu_ops, sphb);
 
     pci_bus_set_route_irq_fn(bus, spapr_route_intx_pin_to_irq);
 
diff --git a/hw/s390x/s390-pci-bus.c b/hw/s390x/s390-pci-bus.c
index ed8be12..c1c3aa4 100644
--- a/hw/s390x/s390-pci-bus.c
+++ b/hw/s390x/s390-pci-bus.c
@@ -635,6 +635,10 @@ static AddressSpace *s390_pci_dma_iommu(PCIBus *bus, void *opaque, int devfn)
     return &iommu->as;
 }
 
+static const PCIIOMMUOps s390_iommu_ops = {
+    .get_address_space = s390_pci_dma_iommu,
+};
+
 static uint8_t set_ind_atomic(uint64_t ind_loc, uint8_t to_be_set)
 {
     uint8_t ind_old, ind_new;
@@ -748,7 +752,7 @@ static void s390_pcihost_realize(DeviceState *dev, Error **errp)
     b = pci_register_root_bus(dev, NULL, s390_pci_set_irq, s390_pci_map_irq,
                               NULL, get_system_memory(), get_system_io(), 0,
                               64, TYPE_PCI_BUS);
-    pci_setup_iommu(b, s390_pci_dma_iommu, s);
+    pci_setup_iommu(b, &s390_iommu_ops, s);
 
     bus = BUS(b);
     qbus_set_hotplug_handler(bus, OBJECT(dev), &local_err);
@@ -919,7 +923,7 @@ static void s390_pcihost_plug(HotplugHandler *hotplug_dev, DeviceState *dev,
 
         pdev = PCI_DEVICE(dev);
         pci_bridge_map_irq(pb, dev->id, s390_pci_map_irq);
-        pci_setup_iommu(&pb->sec_bus, s390_pci_dma_iommu, s);
+        pci_setup_iommu(&pb->sec_bus, &s390_iommu_ops, s);
 
         qbus_set_hotplug_handler(BUS(&pb->sec_bus), OBJECT(s), errp);
 
diff --git a/hw/virtio/virtio-iommu.c b/hw/virtio/virtio-iommu.c
index 4cee808..fefc24e 100644
--- a/hw/virtio/virtio-iommu.c
+++ b/hw/virtio/virtio-iommu.c
@@ -235,6 +235,10 @@ static AddressSpace *virtio_iommu_find_add_as(PCIBus *bus, void *opaque,
     return &sdev->as;
 }
 
+static const PCIIOMMUOps virtio_iommu_ops = {
+    .get_address_space = virtio_iommu_find_add_as,
+};
+
 static int virtio_iommu_attach(VirtIOIOMMU *s,
                                struct virtio_iommu_req_attach *req)
 {
@@ -682,7 +686,7 @@ static void virtio_iommu_device_realize(DeviceState *dev, Error **errp)
     s->as_by_busptr = g_hash_table_new_full(NULL, NULL, NULL, g_free);
 
     if (s->primary_bus) {
-        pci_setup_iommu(s->primary_bus, virtio_iommu_find_add_as, s);
+        pci_setup_iommu(s->primary_bus, &virtio_iommu_ops, s);
     } else {
         error_setg(errp, "VIRTIO-IOMMU is not attached to any PCI bus!");
     }
diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index cfedf5a..ffe192d 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -485,10 +485,14 @@ void pci_bus_get_w64_range(PCIBus *bus, Range *range);
 
 void pci_device_deassert_intx(PCIDevice *dev);
 
-typedef AddressSpace *(*PCIIOMMUFunc)(PCIBus *, void *, int);
+typedef struct PCIIOMMUOps PCIIOMMUOps;
+struct PCIIOMMUOps {
+    AddressSpace * (*get_address_space)(PCIBus *bus,
+                                void *opaque, int32_t devfn);
+};
 
 AddressSpace *pci_device_iommu_address_space(PCIDevice *dev);
-void pci_setup_iommu(PCIBus *bus, PCIIOMMUFunc fn, void *opaque);
+void pci_setup_iommu(PCIBus *bus, const PCIIOMMUOps *iommu_ops, void *opaque);
 
 static inline void
 pci_set_byte(uint8_t *config, uint8_t val)
diff --git a/include/hw/pci/pci_bus.h b/include/hw/pci/pci_bus.h
index 0714f57..c281057 100644
--- a/include/hw/pci/pci_bus.h
+++ b/include/hw/pci/pci_bus.h
@@ -29,7 +29,7 @@ enum PCIBusFlags {
 struct PCIBus {
     BusState qbus;
     enum PCIBusFlags flags;
-    PCIIOMMUFunc iommu_fn;
+    const PCIIOMMUOps *iommu_ops;
     void *iommu_opaque;
     uint8_t devfn_min;
     uint32_t slot_reserved_mask;
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v2 06/22] hw/pci: introduce pci_device_set/unset_iommu_context()
  2020-03-30  4:24 ` Liu Yi L
@ 2020-03-30  4:24   ` Liu Yi L
  -1 siblings, 0 replies; 160+ messages in thread
From: Liu Yi L @ 2020-03-30  4:24 UTC (permalink / raw)
  To: qemu-devel, alex.williamson, peterx
  Cc: eric.auger, pbonzini, mst, david, kevin.tian, yi.l.liu,
	jun.j.tian, yi.y.sun, kvm, hao.wu, jean-philippe, Jacob Pan,
	Yi Sun

This patch adds pci_device_set/unset_iommu_context() to set/unset
host_iommu_context for a given device. New callback is added in
PCIIOMMUOps. As such, vIOMMU could make use of host IOMMU capability.
e.g setup nested translation.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Eric Auger <eric.auger@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 hw/pci/pci.c         | 49 ++++++++++++++++++++++++++++++++++++++++++++-----
 include/hw/pci/pci.h | 10 ++++++++++
 2 files changed, 54 insertions(+), 5 deletions(-)

diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index aa9025c..af3c1a1 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -2638,7 +2638,8 @@ static void pci_device_class_base_init(ObjectClass *klass, void *data)
     }
 }
 
-AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
+static void pci_device_get_iommu_bus_devfn(PCIDevice *dev,
+                              PCIBus **pbus, uint8_t *pdevfn)
 {
     PCIBus *bus = pci_get_bus(dev);
     PCIBus *iommu_bus = bus;
@@ -2683,14 +2684,52 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
 
         iommu_bus = parent_bus;
     }
-    if (iommu_bus && iommu_bus->iommu_ops &&
-                     iommu_bus->iommu_ops->get_address_space) {
-        return iommu_bus->iommu_ops->get_address_space(bus,
-                                 iommu_bus->iommu_opaque, devfn);
+    *pbus = iommu_bus;
+    *pdevfn = devfn;
+}
+
+AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
+{
+    PCIBus *bus;
+    uint8_t devfn;
+
+    pci_device_get_iommu_bus_devfn(dev, &bus, &devfn);
+    if (bus && bus->iommu_ops &&
+        bus->iommu_ops->get_address_space) {
+        return bus->iommu_ops->get_address_space(bus,
+                                bus->iommu_opaque, devfn);
     }
     return &address_space_memory;
 }
 
+int pci_device_set_iommu_context(PCIDevice *dev,
+                                 HostIOMMUContext *iommu_ctx)
+{
+    PCIBus *bus;
+    uint8_t devfn;
+
+    pci_device_get_iommu_bus_devfn(dev, &bus, &devfn);
+    if (bus && bus->iommu_ops &&
+        bus->iommu_ops->set_iommu_context) {
+        return bus->iommu_ops->set_iommu_context(bus,
+                              bus->iommu_opaque, devfn, iommu_ctx);
+    }
+    return -ENOENT;
+}
+
+void pci_device_unset_iommu_context(PCIDevice *dev)
+{
+    PCIBus *bus;
+    uint8_t devfn;
+
+    pci_device_get_iommu_bus_devfn(dev, &bus, &devfn);
+    if (bus && bus->iommu_ops &&
+        bus->iommu_ops->unset_iommu_context) {
+        bus->iommu_ops->unset_iommu_context(bus,
+                                 bus->iommu_opaque, devfn);
+    }
+}
+
 void pci_setup_iommu(PCIBus *bus, const PCIIOMMUOps *ops, void *opaque)
 {
     bus->iommu_ops = ops;
diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index ffe192d..0ec5680 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -9,6 +9,8 @@
 
 #include "hw/pci/pcie.h"
 
+#include "hw/iommu/host_iommu_context.h"
+
 extern bool pci_available;
 
 /* PCI bus */
@@ -489,9 +491,17 @@ typedef struct PCIIOMMUOps PCIIOMMUOps;
 struct PCIIOMMUOps {
     AddressSpace * (*get_address_space)(PCIBus *bus,
                                 void *opaque, int32_t devfn);
+    int (*set_iommu_context)(PCIBus *bus, void *opaque,
+                             int32_t devfn,
+                             HostIOMMUContext *iommu_ctx);
+    void (*unset_iommu_context)(PCIBus *bus, void *opaque,
+                                int32_t devfn);
 };
 
 AddressSpace *pci_device_iommu_address_space(PCIDevice *dev);
+int pci_device_set_iommu_context(PCIDevice *dev,
+                                 HostIOMMUContext *iommu_ctx);
+void pci_device_unset_iommu_context(PCIDevice *dev);
 void pci_setup_iommu(PCIBus *bus, const PCIIOMMUOps *iommu_ops, void *opaque);
 
 static inline void
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v2 06/22] hw/pci: introduce pci_device_set/unset_iommu_context()
@ 2020-03-30  4:24   ` Liu Yi L
  0 siblings, 0 replies; 160+ messages in thread
From: Liu Yi L @ 2020-03-30  4:24 UTC (permalink / raw)
  To: qemu-devel, alex.williamson, peterx
  Cc: jean-philippe, kevin.tian, yi.l.liu, Yi Sun, kvm, mst,
	jun.j.tian, eric.auger, yi.y.sun, Jacob Pan, pbonzini, hao.wu,
	david

This patch adds pci_device_set/unset_iommu_context() to set/unset
host_iommu_context for a given device. New callback is added in
PCIIOMMUOps. As such, vIOMMU could make use of host IOMMU capability.
e.g setup nested translation.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Eric Auger <eric.auger@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 hw/pci/pci.c         | 49 ++++++++++++++++++++++++++++++++++++++++++++-----
 include/hw/pci/pci.h | 10 ++++++++++
 2 files changed, 54 insertions(+), 5 deletions(-)

diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index aa9025c..af3c1a1 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -2638,7 +2638,8 @@ static void pci_device_class_base_init(ObjectClass *klass, void *data)
     }
 }
 
-AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
+static void pci_device_get_iommu_bus_devfn(PCIDevice *dev,
+                              PCIBus **pbus, uint8_t *pdevfn)
 {
     PCIBus *bus = pci_get_bus(dev);
     PCIBus *iommu_bus = bus;
@@ -2683,14 +2684,52 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
 
         iommu_bus = parent_bus;
     }
-    if (iommu_bus && iommu_bus->iommu_ops &&
-                     iommu_bus->iommu_ops->get_address_space) {
-        return iommu_bus->iommu_ops->get_address_space(bus,
-                                 iommu_bus->iommu_opaque, devfn);
+    *pbus = iommu_bus;
+    *pdevfn = devfn;
+}
+
+AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
+{
+    PCIBus *bus;
+    uint8_t devfn;
+
+    pci_device_get_iommu_bus_devfn(dev, &bus, &devfn);
+    if (bus && bus->iommu_ops &&
+        bus->iommu_ops->get_address_space) {
+        return bus->iommu_ops->get_address_space(bus,
+                                bus->iommu_opaque, devfn);
     }
     return &address_space_memory;
 }
 
+int pci_device_set_iommu_context(PCIDevice *dev,
+                                 HostIOMMUContext *iommu_ctx)
+{
+    PCIBus *bus;
+    uint8_t devfn;
+
+    pci_device_get_iommu_bus_devfn(dev, &bus, &devfn);
+    if (bus && bus->iommu_ops &&
+        bus->iommu_ops->set_iommu_context) {
+        return bus->iommu_ops->set_iommu_context(bus,
+                              bus->iommu_opaque, devfn, iommu_ctx);
+    }
+    return -ENOENT;
+}
+
+void pci_device_unset_iommu_context(PCIDevice *dev)
+{
+    PCIBus *bus;
+    uint8_t devfn;
+
+    pci_device_get_iommu_bus_devfn(dev, &bus, &devfn);
+    if (bus && bus->iommu_ops &&
+        bus->iommu_ops->unset_iommu_context) {
+        bus->iommu_ops->unset_iommu_context(bus,
+                                 bus->iommu_opaque, devfn);
+    }
+}
+
 void pci_setup_iommu(PCIBus *bus, const PCIIOMMUOps *ops, void *opaque)
 {
     bus->iommu_ops = ops;
diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index ffe192d..0ec5680 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -9,6 +9,8 @@
 
 #include "hw/pci/pcie.h"
 
+#include "hw/iommu/host_iommu_context.h"
+
 extern bool pci_available;
 
 /* PCI bus */
@@ -489,9 +491,17 @@ typedef struct PCIIOMMUOps PCIIOMMUOps;
 struct PCIIOMMUOps {
     AddressSpace * (*get_address_space)(PCIBus *bus,
                                 void *opaque, int32_t devfn);
+    int (*set_iommu_context)(PCIBus *bus, void *opaque,
+                             int32_t devfn,
+                             HostIOMMUContext *iommu_ctx);
+    void (*unset_iommu_context)(PCIBus *bus, void *opaque,
+                                int32_t devfn);
 };
 
 AddressSpace *pci_device_iommu_address_space(PCIDevice *dev);
+int pci_device_set_iommu_context(PCIDevice *dev,
+                                 HostIOMMUContext *iommu_ctx);
+void pci_device_unset_iommu_context(PCIDevice *dev);
 void pci_setup_iommu(PCIBus *bus, const PCIIOMMUOps *iommu_ops, void *opaque);
 
 static inline void
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v2 07/22] intel_iommu: add set/unset_iommu_context callback
  2020-03-30  4:24 ` Liu Yi L
@ 2020-03-30  4:24   ` Liu Yi L
  -1 siblings, 0 replies; 160+ messages in thread
From: Liu Yi L @ 2020-03-30  4:24 UTC (permalink / raw)
  To: qemu-devel, alex.williamson, peterx
  Cc: eric.auger, pbonzini, mst, david, kevin.tian, yi.l.liu,
	jun.j.tian, yi.y.sun, kvm, hao.wu, jean-philippe, Jacob Pan,
	Yi Sun, Richard Henderson, Eduardo Habkost

This patch adds set/unset_iommu_context() impelementation in Intel
vIOMMU. For Intel platform, pass-through modules (e.g. VFIO) could
set HostIOMMUContext to Intel vIOMMU emulator.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 hw/i386/intel_iommu.c         | 71 ++++++++++++++++++++++++++++++++++++++++---
 include/hw/i386/intel_iommu.h | 21 ++++++++++---
 2 files changed, 83 insertions(+), 9 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 4b22910..fd349c6 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -3354,23 +3354,33 @@ static const MemoryRegionOps vtd_mem_ir_ops = {
     },
 };
 
-VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
+/**
+ * Fetch a VTDBus instance for given PCIBus. If no existing instance,
+ * allocate one.
+ */
+static VTDBus *vtd_find_add_bus(IntelIOMMUState *s, PCIBus *bus)
 {
     uintptr_t key = (uintptr_t)bus;
     VTDBus *vtd_bus = g_hash_table_lookup(s->vtd_as_by_busptr, &key);
-    VTDAddressSpace *vtd_dev_as;
-    char name[128];
 
     if (!vtd_bus) {
         uintptr_t *new_key = g_malloc(sizeof(*new_key));
         *new_key = (uintptr_t)bus;
         /* No corresponding free() */
-        vtd_bus = g_malloc0(sizeof(VTDBus) + sizeof(VTDAddressSpace *) * \
-                            PCI_DEVFN_MAX);
+        vtd_bus = g_malloc0(sizeof(VTDBus));
         vtd_bus->bus = bus;
         g_hash_table_insert(s->vtd_as_by_busptr, new_key, vtd_bus);
     }
+    return vtd_bus;
+}
 
+VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
+{
+    VTDBus *vtd_bus;
+    VTDAddressSpace *vtd_dev_as;
+    char name[128];
+
+    vtd_bus = vtd_find_add_bus(s, bus);
     vtd_dev_as = vtd_bus->dev_as[devfn];
 
     if (!vtd_dev_as) {
@@ -3436,6 +3446,55 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
     return vtd_dev_as;
 }
 
+static int vtd_dev_set_iommu_context(PCIBus *bus, void *opaque,
+                                     int devfn,
+                                     HostIOMMUContext *iommu_ctx)
+{
+    IntelIOMMUState *s = opaque;
+    VTDBus *vtd_bus;
+    VTDHostIOMMUContext *vtd_dev_icx;
+
+    assert(0 <= devfn && devfn < PCI_DEVFN_MAX);
+
+    vtd_bus = vtd_find_add_bus(s, bus);
+
+    vtd_iommu_lock(s);
+
+    vtd_dev_icx = vtd_bus->dev_icx[devfn];
+
+    assert(!vtd_dev_icx);
+
+    vtd_bus->dev_icx[devfn] = vtd_dev_icx =
+                    g_malloc0(sizeof(VTDHostIOMMUContext));
+    vtd_dev_icx->vtd_bus = vtd_bus;
+    vtd_dev_icx->devfn = (uint8_t)devfn;
+    vtd_dev_icx->iommu_state = s;
+    vtd_dev_icx->iommu_ctx = iommu_ctx;
+
+    vtd_iommu_unlock(s);
+
+    return 0;
+}
+
+static void vtd_dev_unset_iommu_context(PCIBus *bus, void *opaque, int devfn)
+{
+    IntelIOMMUState *s = opaque;
+    VTDBus *vtd_bus;
+    VTDHostIOMMUContext *vtd_dev_icx;
+
+    assert(0 <= devfn && devfn < PCI_DEVFN_MAX);
+
+    vtd_bus = vtd_find_add_bus(s, bus);
+
+    vtd_iommu_lock(s);
+
+    vtd_dev_icx = vtd_bus->dev_icx[devfn];
+    g_free(vtd_dev_icx);
+    vtd_bus->dev_icx[devfn] = NULL;
+
+    vtd_iommu_unlock(s);
+}
+
 static uint64_t get_naturally_aligned_size(uint64_t start,
                                            uint64_t size, int gaw)
 {
@@ -3731,6 +3790,8 @@ static AddressSpace *vtd_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
 
 static PCIIOMMUOps vtd_iommu_ops = {
     .get_address_space = vtd_host_dma_iommu,
+    .set_iommu_context = vtd_dev_set_iommu_context,
+    .unset_iommu_context = vtd_dev_unset_iommu_context,
 };
 
 static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index 3870052..b5fefb9 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -64,6 +64,7 @@ typedef union VTD_IR_TableEntry VTD_IR_TableEntry;
 typedef union VTD_IR_MSIAddress VTD_IR_MSIAddress;
 typedef struct VTDPASIDDirEntry VTDPASIDDirEntry;
 typedef struct VTDPASIDEntry VTDPASIDEntry;
+typedef struct VTDHostIOMMUContext VTDHostIOMMUContext;
 
 /* Context-Entry */
 struct VTDContextEntry {
@@ -112,10 +113,20 @@ struct VTDAddressSpace {
     IOVATree *iova_tree;          /* Traces mapped IOVA ranges */
 };
 
+struct VTDHostIOMMUContext {
+    VTDBus *vtd_bus;
+    uint8_t devfn;
+    HostIOMMUContext *iommu_ctx;
+    IntelIOMMUState *iommu_state;
+};
+
 struct VTDBus {
-    PCIBus* bus;		/* A reference to the bus to provide translation for */
+    /* A reference to the bus to provide translation for */
+    PCIBus *bus;
     /* A table of VTDAddressSpace objects indexed by devfn */
-    VTDAddressSpace *dev_as[];
+    VTDAddressSpace *dev_as[PCI_DEVFN_MAX];
+    /* A table of VTDHostIOMMUContext objects indexed by devfn */
+    VTDHostIOMMUContext *dev_icx[PCI_DEVFN_MAX];
 };
 
 struct VTDIOTLBEntry {
@@ -269,8 +280,10 @@ struct IntelIOMMUState {
     bool dma_drain;                 /* Whether DMA r/w draining enabled */
 
     /*
-     * Protects IOMMU states in general.  Currently it protects the
-     * per-IOMMU IOTLB cache, and context entry cache in VTDAddressSpace.
+     * iommu_lock protects below:
+     * - per-IOMMU IOTLB caches
+     * - context entry cache in VTDAddressSpace
+     * - HostIOMMUContext pointer cached in vIOMMU
      */
     QemuMutex iommu_lock;
 };
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v2 07/22] intel_iommu: add set/unset_iommu_context callback
@ 2020-03-30  4:24   ` Liu Yi L
  0 siblings, 0 replies; 160+ messages in thread
From: Liu Yi L @ 2020-03-30  4:24 UTC (permalink / raw)
  To: qemu-devel, alex.williamson, peterx
  Cc: jean-philippe, kevin.tian, yi.l.liu, Yi Sun, Eduardo Habkost,
	kvm, mst, jun.j.tian, eric.auger, yi.y.sun, Jacob Pan, pbonzini,
	hao.wu, Richard Henderson, david

This patch adds set/unset_iommu_context() impelementation in Intel
vIOMMU. For Intel platform, pass-through modules (e.g. VFIO) could
set HostIOMMUContext to Intel vIOMMU emulator.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 hw/i386/intel_iommu.c         | 71 ++++++++++++++++++++++++++++++++++++++++---
 include/hw/i386/intel_iommu.h | 21 ++++++++++---
 2 files changed, 83 insertions(+), 9 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 4b22910..fd349c6 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -3354,23 +3354,33 @@ static const MemoryRegionOps vtd_mem_ir_ops = {
     },
 };
 
-VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
+/**
+ * Fetch a VTDBus instance for given PCIBus. If no existing instance,
+ * allocate one.
+ */
+static VTDBus *vtd_find_add_bus(IntelIOMMUState *s, PCIBus *bus)
 {
     uintptr_t key = (uintptr_t)bus;
     VTDBus *vtd_bus = g_hash_table_lookup(s->vtd_as_by_busptr, &key);
-    VTDAddressSpace *vtd_dev_as;
-    char name[128];
 
     if (!vtd_bus) {
         uintptr_t *new_key = g_malloc(sizeof(*new_key));
         *new_key = (uintptr_t)bus;
         /* No corresponding free() */
-        vtd_bus = g_malloc0(sizeof(VTDBus) + sizeof(VTDAddressSpace *) * \
-                            PCI_DEVFN_MAX);
+        vtd_bus = g_malloc0(sizeof(VTDBus));
         vtd_bus->bus = bus;
         g_hash_table_insert(s->vtd_as_by_busptr, new_key, vtd_bus);
     }
+    return vtd_bus;
+}
 
+VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
+{
+    VTDBus *vtd_bus;
+    VTDAddressSpace *vtd_dev_as;
+    char name[128];
+
+    vtd_bus = vtd_find_add_bus(s, bus);
     vtd_dev_as = vtd_bus->dev_as[devfn];
 
     if (!vtd_dev_as) {
@@ -3436,6 +3446,55 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
     return vtd_dev_as;
 }
 
+static int vtd_dev_set_iommu_context(PCIBus *bus, void *opaque,
+                                     int devfn,
+                                     HostIOMMUContext *iommu_ctx)
+{
+    IntelIOMMUState *s = opaque;
+    VTDBus *vtd_bus;
+    VTDHostIOMMUContext *vtd_dev_icx;
+
+    assert(0 <= devfn && devfn < PCI_DEVFN_MAX);
+
+    vtd_bus = vtd_find_add_bus(s, bus);
+
+    vtd_iommu_lock(s);
+
+    vtd_dev_icx = vtd_bus->dev_icx[devfn];
+
+    assert(!vtd_dev_icx);
+
+    vtd_bus->dev_icx[devfn] = vtd_dev_icx =
+                    g_malloc0(sizeof(VTDHostIOMMUContext));
+    vtd_dev_icx->vtd_bus = vtd_bus;
+    vtd_dev_icx->devfn = (uint8_t)devfn;
+    vtd_dev_icx->iommu_state = s;
+    vtd_dev_icx->iommu_ctx = iommu_ctx;
+
+    vtd_iommu_unlock(s);
+
+    return 0;
+}
+
+static void vtd_dev_unset_iommu_context(PCIBus *bus, void *opaque, int devfn)
+{
+    IntelIOMMUState *s = opaque;
+    VTDBus *vtd_bus;
+    VTDHostIOMMUContext *vtd_dev_icx;
+
+    assert(0 <= devfn && devfn < PCI_DEVFN_MAX);
+
+    vtd_bus = vtd_find_add_bus(s, bus);
+
+    vtd_iommu_lock(s);
+
+    vtd_dev_icx = vtd_bus->dev_icx[devfn];
+    g_free(vtd_dev_icx);
+    vtd_bus->dev_icx[devfn] = NULL;
+
+    vtd_iommu_unlock(s);
+}
+
 static uint64_t get_naturally_aligned_size(uint64_t start,
                                            uint64_t size, int gaw)
 {
@@ -3731,6 +3790,8 @@ static AddressSpace *vtd_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
 
 static PCIIOMMUOps vtd_iommu_ops = {
     .get_address_space = vtd_host_dma_iommu,
+    .set_iommu_context = vtd_dev_set_iommu_context,
+    .unset_iommu_context = vtd_dev_unset_iommu_context,
 };
 
 static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index 3870052..b5fefb9 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -64,6 +64,7 @@ typedef union VTD_IR_TableEntry VTD_IR_TableEntry;
 typedef union VTD_IR_MSIAddress VTD_IR_MSIAddress;
 typedef struct VTDPASIDDirEntry VTDPASIDDirEntry;
 typedef struct VTDPASIDEntry VTDPASIDEntry;
+typedef struct VTDHostIOMMUContext VTDHostIOMMUContext;
 
 /* Context-Entry */
 struct VTDContextEntry {
@@ -112,10 +113,20 @@ struct VTDAddressSpace {
     IOVATree *iova_tree;          /* Traces mapped IOVA ranges */
 };
 
+struct VTDHostIOMMUContext {
+    VTDBus *vtd_bus;
+    uint8_t devfn;
+    HostIOMMUContext *iommu_ctx;
+    IntelIOMMUState *iommu_state;
+};
+
 struct VTDBus {
-    PCIBus* bus;		/* A reference to the bus to provide translation for */
+    /* A reference to the bus to provide translation for */
+    PCIBus *bus;
     /* A table of VTDAddressSpace objects indexed by devfn */
-    VTDAddressSpace *dev_as[];
+    VTDAddressSpace *dev_as[PCI_DEVFN_MAX];
+    /* A table of VTDHostIOMMUContext objects indexed by devfn */
+    VTDHostIOMMUContext *dev_icx[PCI_DEVFN_MAX];
 };
 
 struct VTDIOTLBEntry {
@@ -269,8 +280,10 @@ struct IntelIOMMUState {
     bool dma_drain;                 /* Whether DMA r/w draining enabled */
 
     /*
-     * Protects IOMMU states in general.  Currently it protects the
-     * per-IOMMU IOTLB cache, and context entry cache in VTDAddressSpace.
+     * iommu_lock protects below:
+     * - per-IOMMU IOTLB caches
+     * - context entry cache in VTDAddressSpace
+     * - HostIOMMUContext pointer cached in vIOMMU
      */
     QemuMutex iommu_lock;
 };
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v2 08/22] vfio/common: provide PASID alloc/free hooks
  2020-03-30  4:24 ` Liu Yi L
@ 2020-03-30  4:24   ` Liu Yi L
  -1 siblings, 0 replies; 160+ messages in thread
From: Liu Yi L @ 2020-03-30  4:24 UTC (permalink / raw)
  To: qemu-devel, alex.williamson, peterx
  Cc: eric.auger, pbonzini, mst, david, kevin.tian, yi.l.liu,
	jun.j.tian, yi.y.sun, kvm, hao.wu, jean-philippe, Jacob Pan,
	Yi Sun

This patch defines vfio_host_iommu_context_info, implements the PASID
alloc/free hooks defined in HostIOMMUContextClass.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Eric Auger <eric.auger@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 hw/vfio/common.c                      | 69 +++++++++++++++++++++++++++++++++++
 include/hw/iommu/host_iommu_context.h |  3 ++
 include/hw/vfio/vfio-common.h         |  4 ++
 3 files changed, 76 insertions(+)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index c276732..5f3534d 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -1179,6 +1179,53 @@ static int vfio_get_iommu_type(VFIOContainer *container,
     return -EINVAL;
 }
 
+static int vfio_host_iommu_ctx_pasid_alloc(HostIOMMUContext *iommu_ctx,
+                                           uint32_t min, uint32_t max,
+                                           uint32_t *pasid)
+{
+    VFIOContainer *container = container_of(iommu_ctx,
+                                            VFIOContainer, iommu_ctx);
+    struct vfio_iommu_type1_pasid_request req;
+    unsigned long argsz;
+    int ret;
+
+    argsz = sizeof(req);
+    req.argsz = argsz;
+    req.flags = VFIO_IOMMU_PASID_ALLOC;
+    req.alloc_pasid.min = min;
+    req.alloc_pasid.max = max;
+
+    if (ioctl(container->fd, VFIO_IOMMU_PASID_REQUEST, &req)) {
+        ret = -errno;
+        error_report("%s: %d, alloc failed", __func__, ret);
+        return ret;
+    }
+    *pasid = req.alloc_pasid.result;
+    return 0;
+}
+
+static int vfio_host_iommu_ctx_pasid_free(HostIOMMUContext *iommu_ctx,
+                                          uint32_t pasid)
+{
+    VFIOContainer *container = container_of(iommu_ctx,
+                                            VFIOContainer, iommu_ctx);
+    struct vfio_iommu_type1_pasid_request req;
+    unsigned long argsz;
+    int ret;
+
+    argsz = sizeof(req);
+    req.argsz = argsz;
+    req.flags = VFIO_IOMMU_PASID_FREE;
+    req.free_pasid = pasid;
+
+    if (ioctl(container->fd, VFIO_IOMMU_PASID_REQUEST, &req)) {
+        ret = -errno;
+        error_report("%s: %d, free failed", __func__, ret);
+        return ret;
+    }
+    return 0;
+}
+
 static int vfio_init_container(VFIOContainer *container, int group_fd,
                                Error **errp)
 {
@@ -1791,3 +1838,25 @@ int vfio_eeh_as_op(AddressSpace *as, uint32_t op)
     }
     return vfio_eeh_container_op(container, op);
 }
+
+static void vfio_host_iommu_context_class_init(ObjectClass *klass,
+                                                       void *data)
+{
+    HostIOMMUContextClass *hicxc = HOST_IOMMU_CONTEXT_CLASS(klass);
+
+    hicxc->pasid_alloc = vfio_host_iommu_ctx_pasid_alloc;
+    hicxc->pasid_free = vfio_host_iommu_ctx_pasid_free;
+}
+
+static const TypeInfo vfio_host_iommu_context_info = {
+    .parent = TYPE_HOST_IOMMU_CONTEXT,
+    .name = TYPE_VFIO_HOST_IOMMU_CONTEXT,
+    .class_init = vfio_host_iommu_context_class_init,
+};
+
+static void vfio_register_types(void)
+{
+    type_register_static(&vfio_host_iommu_context_info);
+}
+
+type_init(vfio_register_types)
diff --git a/include/hw/iommu/host_iommu_context.h b/include/hw/iommu/host_iommu_context.h
index 35c4861..227c433 100644
--- a/include/hw/iommu/host_iommu_context.h
+++ b/include/hw/iommu/host_iommu_context.h
@@ -33,6 +33,9 @@
 #define TYPE_HOST_IOMMU_CONTEXT "qemu:host-iommu-context"
 #define HOST_IOMMU_CONTEXT(obj) \
         OBJECT_CHECK(HostIOMMUContext, (obj), TYPE_HOST_IOMMU_CONTEXT)
+#define HOST_IOMMU_CONTEXT_CLASS(klass) \
+        OBJECT_CLASS_CHECK(HostIOMMUContextClass, (klass), \
+                         TYPE_HOST_IOMMU_CONTEXT)
 #define HOST_IOMMU_CONTEXT_GET_CLASS(obj) \
         OBJECT_GET_CLASS(HostIOMMUContextClass, (obj), \
                          TYPE_HOST_IOMMU_CONTEXT)
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index fd56420..0b07303 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -26,12 +26,15 @@
 #include "qemu/notify.h"
 #include "ui/console.h"
 #include "hw/display/ramfb.h"
+#include "hw/iommu/host_iommu_context.h"
 #ifdef CONFIG_LINUX
 #include <linux/vfio.h>
 #endif
 
 #define VFIO_MSG_PREFIX "vfio %s: "
 
+#define TYPE_VFIO_HOST_IOMMU_CONTEXT "qemu:vfio-host-iommu-context"
+
 enum {
     VFIO_DEVICE_TYPE_PCI = 0,
     VFIO_DEVICE_TYPE_PLATFORM = 1,
@@ -71,6 +74,7 @@ typedef struct VFIOContainer {
     MemoryListener listener;
     MemoryListener prereg_listener;
     unsigned iommu_type;
+    HostIOMMUContext iommu_ctx;
     Error *error;
     bool initialized;
     unsigned long pgsizes;
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v2 08/22] vfio/common: provide PASID alloc/free hooks
@ 2020-03-30  4:24   ` Liu Yi L
  0 siblings, 0 replies; 160+ messages in thread
From: Liu Yi L @ 2020-03-30  4:24 UTC (permalink / raw)
  To: qemu-devel, alex.williamson, peterx
  Cc: jean-philippe, kevin.tian, yi.l.liu, Yi Sun, kvm, mst,
	jun.j.tian, eric.auger, yi.y.sun, Jacob Pan, pbonzini, hao.wu,
	david

This patch defines vfio_host_iommu_context_info, implements the PASID
alloc/free hooks defined in HostIOMMUContextClass.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Eric Auger <eric.auger@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 hw/vfio/common.c                      | 69 +++++++++++++++++++++++++++++++++++
 include/hw/iommu/host_iommu_context.h |  3 ++
 include/hw/vfio/vfio-common.h         |  4 ++
 3 files changed, 76 insertions(+)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index c276732..5f3534d 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -1179,6 +1179,53 @@ static int vfio_get_iommu_type(VFIOContainer *container,
     return -EINVAL;
 }
 
+static int vfio_host_iommu_ctx_pasid_alloc(HostIOMMUContext *iommu_ctx,
+                                           uint32_t min, uint32_t max,
+                                           uint32_t *pasid)
+{
+    VFIOContainer *container = container_of(iommu_ctx,
+                                            VFIOContainer, iommu_ctx);
+    struct vfio_iommu_type1_pasid_request req;
+    unsigned long argsz;
+    int ret;
+
+    argsz = sizeof(req);
+    req.argsz = argsz;
+    req.flags = VFIO_IOMMU_PASID_ALLOC;
+    req.alloc_pasid.min = min;
+    req.alloc_pasid.max = max;
+
+    if (ioctl(container->fd, VFIO_IOMMU_PASID_REQUEST, &req)) {
+        ret = -errno;
+        error_report("%s: %d, alloc failed", __func__, ret);
+        return ret;
+    }
+    *pasid = req.alloc_pasid.result;
+    return 0;
+}
+
+static int vfio_host_iommu_ctx_pasid_free(HostIOMMUContext *iommu_ctx,
+                                          uint32_t pasid)
+{
+    VFIOContainer *container = container_of(iommu_ctx,
+                                            VFIOContainer, iommu_ctx);
+    struct vfio_iommu_type1_pasid_request req;
+    unsigned long argsz;
+    int ret;
+
+    argsz = sizeof(req);
+    req.argsz = argsz;
+    req.flags = VFIO_IOMMU_PASID_FREE;
+    req.free_pasid = pasid;
+
+    if (ioctl(container->fd, VFIO_IOMMU_PASID_REQUEST, &req)) {
+        ret = -errno;
+        error_report("%s: %d, free failed", __func__, ret);
+        return ret;
+    }
+    return 0;
+}
+
 static int vfio_init_container(VFIOContainer *container, int group_fd,
                                Error **errp)
 {
@@ -1791,3 +1838,25 @@ int vfio_eeh_as_op(AddressSpace *as, uint32_t op)
     }
     return vfio_eeh_container_op(container, op);
 }
+
+static void vfio_host_iommu_context_class_init(ObjectClass *klass,
+                                                       void *data)
+{
+    HostIOMMUContextClass *hicxc = HOST_IOMMU_CONTEXT_CLASS(klass);
+
+    hicxc->pasid_alloc = vfio_host_iommu_ctx_pasid_alloc;
+    hicxc->pasid_free = vfio_host_iommu_ctx_pasid_free;
+}
+
+static const TypeInfo vfio_host_iommu_context_info = {
+    .parent = TYPE_HOST_IOMMU_CONTEXT,
+    .name = TYPE_VFIO_HOST_IOMMU_CONTEXT,
+    .class_init = vfio_host_iommu_context_class_init,
+};
+
+static void vfio_register_types(void)
+{
+    type_register_static(&vfio_host_iommu_context_info);
+}
+
+type_init(vfio_register_types)
diff --git a/include/hw/iommu/host_iommu_context.h b/include/hw/iommu/host_iommu_context.h
index 35c4861..227c433 100644
--- a/include/hw/iommu/host_iommu_context.h
+++ b/include/hw/iommu/host_iommu_context.h
@@ -33,6 +33,9 @@
 #define TYPE_HOST_IOMMU_CONTEXT "qemu:host-iommu-context"
 #define HOST_IOMMU_CONTEXT(obj) \
         OBJECT_CHECK(HostIOMMUContext, (obj), TYPE_HOST_IOMMU_CONTEXT)
+#define HOST_IOMMU_CONTEXT_CLASS(klass) \
+        OBJECT_CLASS_CHECK(HostIOMMUContextClass, (klass), \
+                         TYPE_HOST_IOMMU_CONTEXT)
 #define HOST_IOMMU_CONTEXT_GET_CLASS(obj) \
         OBJECT_GET_CLASS(HostIOMMUContextClass, (obj), \
                          TYPE_HOST_IOMMU_CONTEXT)
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index fd56420..0b07303 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -26,12 +26,15 @@
 #include "qemu/notify.h"
 #include "ui/console.h"
 #include "hw/display/ramfb.h"
+#include "hw/iommu/host_iommu_context.h"
 #ifdef CONFIG_LINUX
 #include <linux/vfio.h>
 #endif
 
 #define VFIO_MSG_PREFIX "vfio %s: "
 
+#define TYPE_VFIO_HOST_IOMMU_CONTEXT "qemu:vfio-host-iommu-context"
+
 enum {
     VFIO_DEVICE_TYPE_PCI = 0,
     VFIO_DEVICE_TYPE_PLATFORM = 1,
@@ -71,6 +74,7 @@ typedef struct VFIOContainer {
     MemoryListener listener;
     MemoryListener prereg_listener;
     unsigned iommu_type;
+    HostIOMMUContext iommu_ctx;
     Error *error;
     bool initialized;
     unsigned long pgsizes;
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v2 09/22] vfio/common: init HostIOMMUContext per-container
  2020-03-30  4:24 ` Liu Yi L
@ 2020-03-30  4:24   ` Liu Yi L
  -1 siblings, 0 replies; 160+ messages in thread
From: Liu Yi L @ 2020-03-30  4:24 UTC (permalink / raw)
  To: qemu-devel, alex.williamson, peterx
  Cc: eric.auger, pbonzini, mst, david, kevin.tian, yi.l.liu,
	jun.j.tian, yi.y.sun, kvm, hao.wu, jean-philippe, Jacob Pan,
	Yi Sun

In this patch, QEMU firstly gets iommu info from kernel to check the
supported capabilities by a VFIO_IOMMU_TYPE1_NESTING iommu. And inits
HostIOMMUContet instance.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Eric Auger <eric.auger@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 hw/vfio/common.c | 99 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 99 insertions(+)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 5f3534d..44b142c 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -1226,10 +1226,89 @@ static int vfio_host_iommu_ctx_pasid_free(HostIOMMUContext *iommu_ctx,
     return 0;
 }
 
+/**
+ * Get iommu info from host. Caller of this funcion should free
+ * the memory pointed by the returned pointer stored in @info
+ * after a successful calling when finished its usage.
+ */
+static int vfio_get_iommu_info(VFIOContainer *container,
+                         struct vfio_iommu_type1_info **info)
+{
+
+    size_t argsz = sizeof(struct vfio_iommu_type1_info);
+
+    *info = g_malloc0(argsz);
+
+retry:
+    (*info)->argsz = argsz;
+
+    if (ioctl(container->fd, VFIO_IOMMU_GET_INFO, *info)) {
+        g_free(*info);
+        *info = NULL;
+        return -errno;
+    }
+
+    if (((*info)->argsz > argsz)) {
+        argsz = (*info)->argsz;
+        *info = g_realloc(*info, argsz);
+        goto retry;
+    }
+
+    return 0;
+}
+
+static struct vfio_info_cap_header *
+vfio_get_iommu_info_cap(struct vfio_iommu_type1_info *info, uint16_t id)
+{
+    struct vfio_info_cap_header *hdr;
+    void *ptr = info;
+
+    if (!(info->flags & VFIO_IOMMU_INFO_CAPS)) {
+        return NULL;
+    }
+
+    for (hdr = ptr + info->cap_offset; hdr != ptr; hdr = ptr + hdr->next) {
+        if (hdr->id == id) {
+            return hdr;
+        }
+    }
+
+    return NULL;
+}
+
+static int vfio_get_nesting_iommu_cap(VFIOContainer *container,
+                   struct vfio_iommu_type1_info_cap_nesting *cap_nesting)
+{
+    struct vfio_iommu_type1_info *info;
+    struct vfio_info_cap_header *hdr;
+    struct vfio_iommu_type1_info_cap_nesting *cap;
+    int ret;
+
+    ret = vfio_get_iommu_info(container, &info);
+    if (ret) {
+        return ret;
+    }
+
+    hdr = vfio_get_iommu_info_cap(info,
+                        VFIO_IOMMU_TYPE1_INFO_CAP_NESTING);
+    if (!hdr) {
+        g_free(info);
+        return -errno;
+    }
+
+    cap = container_of(hdr,
+                struct vfio_iommu_type1_info_cap_nesting, header);
+    *cap_nesting = *cap;
+
+    g_free(info);
+    return 0;
+}
+
 static int vfio_init_container(VFIOContainer *container, int group_fd,
                                Error **errp)
 {
     int iommu_type, ret;
+    uint64_t flags = 0;
 
     iommu_type = vfio_get_iommu_type(container, errp);
     if (iommu_type < 0) {
@@ -1257,6 +1336,26 @@ static int vfio_init_container(VFIOContainer *container, int group_fd,
         return -errno;
     }
 
+    if (iommu_type == VFIO_TYPE1_NESTING_IOMMU) {
+        struct vfio_iommu_type1_info_cap_nesting nesting = {
+                                         .nesting_capabilities = 0x0,
+                                         .stage1_formats = 0, };
+
+        ret = vfio_get_nesting_iommu_cap(container, &nesting);
+        if (ret) {
+            error_setg_errno(errp, -ret,
+                             "Failed to get nesting iommu cap");
+            return ret;
+        }
+
+        flags |= (nesting.nesting_capabilities & VFIO_IOMMU_PASID_REQS) ?
+                 HOST_IOMMU_PASID_REQUEST : 0;
+        host_iommu_ctx_init(&container->iommu_ctx,
+                            sizeof(container->iommu_ctx),
+                            TYPE_VFIO_HOST_IOMMU_CONTEXT,
+                            flags);
+    }
+
     container->iommu_type = iommu_type;
     return 0;
 }
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v2 09/22] vfio/common: init HostIOMMUContext per-container
@ 2020-03-30  4:24   ` Liu Yi L
  0 siblings, 0 replies; 160+ messages in thread
From: Liu Yi L @ 2020-03-30  4:24 UTC (permalink / raw)
  To: qemu-devel, alex.williamson, peterx
  Cc: jean-philippe, kevin.tian, yi.l.liu, Yi Sun, kvm, mst,
	jun.j.tian, eric.auger, yi.y.sun, Jacob Pan, pbonzini, hao.wu,
	david

In this patch, QEMU firstly gets iommu info from kernel to check the
supported capabilities by a VFIO_IOMMU_TYPE1_NESTING iommu. And inits
HostIOMMUContet instance.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Eric Auger <eric.auger@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 hw/vfio/common.c | 99 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 99 insertions(+)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 5f3534d..44b142c 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -1226,10 +1226,89 @@ static int vfio_host_iommu_ctx_pasid_free(HostIOMMUContext *iommu_ctx,
     return 0;
 }
 
+/**
+ * Get iommu info from host. Caller of this funcion should free
+ * the memory pointed by the returned pointer stored in @info
+ * after a successful calling when finished its usage.
+ */
+static int vfio_get_iommu_info(VFIOContainer *container,
+                         struct vfio_iommu_type1_info **info)
+{
+
+    size_t argsz = sizeof(struct vfio_iommu_type1_info);
+
+    *info = g_malloc0(argsz);
+
+retry:
+    (*info)->argsz = argsz;
+
+    if (ioctl(container->fd, VFIO_IOMMU_GET_INFO, *info)) {
+        g_free(*info);
+        *info = NULL;
+        return -errno;
+    }
+
+    if (((*info)->argsz > argsz)) {
+        argsz = (*info)->argsz;
+        *info = g_realloc(*info, argsz);
+        goto retry;
+    }
+
+    return 0;
+}
+
+static struct vfio_info_cap_header *
+vfio_get_iommu_info_cap(struct vfio_iommu_type1_info *info, uint16_t id)
+{
+    struct vfio_info_cap_header *hdr;
+    void *ptr = info;
+
+    if (!(info->flags & VFIO_IOMMU_INFO_CAPS)) {
+        return NULL;
+    }
+
+    for (hdr = ptr + info->cap_offset; hdr != ptr; hdr = ptr + hdr->next) {
+        if (hdr->id == id) {
+            return hdr;
+        }
+    }
+
+    return NULL;
+}
+
+static int vfio_get_nesting_iommu_cap(VFIOContainer *container,
+                   struct vfio_iommu_type1_info_cap_nesting *cap_nesting)
+{
+    struct vfio_iommu_type1_info *info;
+    struct vfio_info_cap_header *hdr;
+    struct vfio_iommu_type1_info_cap_nesting *cap;
+    int ret;
+
+    ret = vfio_get_iommu_info(container, &info);
+    if (ret) {
+        return ret;
+    }
+
+    hdr = vfio_get_iommu_info_cap(info,
+                        VFIO_IOMMU_TYPE1_INFO_CAP_NESTING);
+    if (!hdr) {
+        g_free(info);
+        return -errno;
+    }
+
+    cap = container_of(hdr,
+                struct vfio_iommu_type1_info_cap_nesting, header);
+    *cap_nesting = *cap;
+
+    g_free(info);
+    return 0;
+}
+
 static int vfio_init_container(VFIOContainer *container, int group_fd,
                                Error **errp)
 {
     int iommu_type, ret;
+    uint64_t flags = 0;
 
     iommu_type = vfio_get_iommu_type(container, errp);
     if (iommu_type < 0) {
@@ -1257,6 +1336,26 @@ static int vfio_init_container(VFIOContainer *container, int group_fd,
         return -errno;
     }
 
+    if (iommu_type == VFIO_TYPE1_NESTING_IOMMU) {
+        struct vfio_iommu_type1_info_cap_nesting nesting = {
+                                         .nesting_capabilities = 0x0,
+                                         .stage1_formats = 0, };
+
+        ret = vfio_get_nesting_iommu_cap(container, &nesting);
+        if (ret) {
+            error_setg_errno(errp, -ret,
+                             "Failed to get nesting iommu cap");
+            return ret;
+        }
+
+        flags |= (nesting.nesting_capabilities & VFIO_IOMMU_PASID_REQS) ?
+                 HOST_IOMMU_PASID_REQUEST : 0;
+        host_iommu_ctx_init(&container->iommu_ctx,
+                            sizeof(container->iommu_ctx),
+                            TYPE_VFIO_HOST_IOMMU_CONTEXT,
+                            flags);
+    }
+
     container->iommu_type = iommu_type;
     return 0;
 }
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v2 10/22] vfio/pci: set host iommu context to vIOMMU
  2020-03-30  4:24 ` Liu Yi L
@ 2020-03-30  4:24   ` Liu Yi L
  -1 siblings, 0 replies; 160+ messages in thread
From: Liu Yi L @ 2020-03-30  4:24 UTC (permalink / raw)
  To: qemu-devel, alex.williamson, peterx
  Cc: eric.auger, pbonzini, mst, david, kevin.tian, yi.l.liu,
	jun.j.tian, yi.y.sun, kvm, hao.wu, jean-philippe, Jacob Pan,
	Yi Sun

For vfio-pci devices, it could use pci_device_set/unset_iommu() to
expose host iommu context to vIOMMU emulators. vIOMMU emulators
could make use the methods provided by host iommu context. e.g.
propagate requests to host iommu.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Eric Auger <eric.auger@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 hw/vfio/pci.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 5e75a95..c140c88 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2717,6 +2717,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
     VFIOPCIDevice *vdev = PCI_VFIO(pdev);
     VFIODevice *vbasedev_iter;
     VFIOGroup *group;
+    VFIOContainer *container;
     char *tmp, *subsys, group_path[PATH_MAX], *group_name;
     Error *err = NULL;
     ssize_t len;
@@ -3028,6 +3029,11 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
     vfio_register_req_notifier(vdev);
     vfio_setup_resetfn_quirk(vdev);
 
+    container = vdev->vbasedev.group->container;
+    if (container->iommu_ctx.initialized) {
+        pci_device_set_iommu_context(pdev, &container->iommu_ctx);
+    }
+
     return;
 
 out_deregister:
@@ -3072,9 +3078,16 @@ static void vfio_instance_finalize(Object *obj)
 static void vfio_exitfn(PCIDevice *pdev)
 {
     VFIOPCIDevice *vdev = PCI_VFIO(pdev);
+    VFIOContainer *container;
 
     vfio_unregister_req_notifier(vdev);
     vfio_unregister_err_notifier(vdev);
+
+    container = vdev->vbasedev.group->container;
+    if (container->iommu_ctx.initialized) {
+        pci_device_unset_iommu_context(pdev);
+    }
+
     pci_device_set_intx_routing_notifier(&vdev->pdev, NULL);
     if (vdev->irqchip_change_notifier.notify) {
         kvm_irqchip_remove_change_notifier(&vdev->irqchip_change_notifier);
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v2 10/22] vfio/pci: set host iommu context to vIOMMU
@ 2020-03-30  4:24   ` Liu Yi L
  0 siblings, 0 replies; 160+ messages in thread
From: Liu Yi L @ 2020-03-30  4:24 UTC (permalink / raw)
  To: qemu-devel, alex.williamson, peterx
  Cc: jean-philippe, kevin.tian, yi.l.liu, Yi Sun, kvm, mst,
	jun.j.tian, eric.auger, yi.y.sun, Jacob Pan, pbonzini, hao.wu,
	david

For vfio-pci devices, it could use pci_device_set/unset_iommu() to
expose host iommu context to vIOMMU emulators. vIOMMU emulators
could make use the methods provided by host iommu context. e.g.
propagate requests to host iommu.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Eric Auger <eric.auger@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 hw/vfio/pci.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 5e75a95..c140c88 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2717,6 +2717,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
     VFIOPCIDevice *vdev = PCI_VFIO(pdev);
     VFIODevice *vbasedev_iter;
     VFIOGroup *group;
+    VFIOContainer *container;
     char *tmp, *subsys, group_path[PATH_MAX], *group_name;
     Error *err = NULL;
     ssize_t len;
@@ -3028,6 +3029,11 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
     vfio_register_req_notifier(vdev);
     vfio_setup_resetfn_quirk(vdev);
 
+    container = vdev->vbasedev.group->container;
+    if (container->iommu_ctx.initialized) {
+        pci_device_set_iommu_context(pdev, &container->iommu_ctx);
+    }
+
     return;
 
 out_deregister:
@@ -3072,9 +3078,16 @@ static void vfio_instance_finalize(Object *obj)
 static void vfio_exitfn(PCIDevice *pdev)
 {
     VFIOPCIDevice *vdev = PCI_VFIO(pdev);
+    VFIOContainer *container;
 
     vfio_unregister_req_notifier(vdev);
     vfio_unregister_err_notifier(vdev);
+
+    container = vdev->vbasedev.group->container;
+    if (container->iommu_ctx.initialized) {
+        pci_device_unset_iommu_context(pdev);
+    }
+
     pci_device_set_intx_routing_notifier(&vdev->pdev, NULL);
     if (vdev->irqchip_change_notifier.notify) {
         kvm_irqchip_remove_change_notifier(&vdev->irqchip_change_notifier);
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v2 11/22] intel_iommu: add virtual command capability support
  2020-03-30  4:24 ` Liu Yi L
@ 2020-03-30  4:24   ` Liu Yi L
  -1 siblings, 0 replies; 160+ messages in thread
From: Liu Yi L @ 2020-03-30  4:24 UTC (permalink / raw)
  To: qemu-devel, alex.williamson, peterx
  Cc: eric.auger, pbonzini, mst, david, kevin.tian, yi.l.liu,
	jun.j.tian, yi.y.sun, kvm, hao.wu, jean-philippe, Jacob Pan,
	Yi Sun, Richard Henderson, Eduardo Habkost

This patch adds virtual command support to Intel vIOMMU per
Intel VT-d 3.1 spec. And adds two virtual commands: allocate
pasid and free pasid.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
---
 hw/i386/intel_iommu.c          | 154 ++++++++++++++++++++++++++++++++++++++++-
 hw/i386/intel_iommu_internal.h |  37 ++++++++++
 hw/i386/trace-events           |   1 +
 include/hw/i386/intel_iommu.h  |  10 ++-
 4 files changed, 200 insertions(+), 2 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index fd349c6..6c3159f 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -2651,6 +2651,129 @@ static void vtd_handle_iectl_write(IntelIOMMUState *s)
     }
 }
 
+static int vtd_request_pasid_alloc(IntelIOMMUState *s, uint32_t *pasid)
+{
+    VTDHostIOMMUContext *vtd_dev_icx;
+    int ret = -1;
+
+    vtd_iommu_lock(s);
+    QLIST_FOREACH(vtd_dev_icx, &s->vtd_dev_icx_list, next) {
+        HostIOMMUContext *iommu_ctx = vtd_dev_icx->iommu_ctx;
+
+        /*
+         * We'll return the first valid result we got. It's
+         * a bit hackish in that we don't have a good global
+         * interface yet to talk to modules like vfio to deliver
+         * this allocation request, so we're leveraging this
+         * per-device iommu context to do the same thing just
+         * to make sure the allocation happens only once.
+         */
+        ret = host_iommu_ctx_pasid_alloc(iommu_ctx, VTD_HPASID_MIN,
+                                         VTD_HPASID_MAX, pasid);
+        if (!ret) {
+            break;
+        }
+    }
+    vtd_iommu_unlock(s);
+
+    return ret;
+}
+
+static int vtd_request_pasid_free(IntelIOMMUState *s, uint32_t pasid)
+{
+    VTDHostIOMMUContext *vtd_dev_icx;
+    int ret = -1;
+
+    vtd_iommu_lock(s);
+    QLIST_FOREACH(vtd_dev_icx, &s->vtd_dev_icx_list, next) {
+        HostIOMMUContext *iommu_ctx = vtd_dev_icx->iommu_ctx;
+
+        /*
+         * Similar with pasid allocation. We'll free the pasid
+         * on the first successful free operation. It's a bit
+         * hackish in that we don't have a good global interface
+         * yet to talk to modules like vfio to deliver this pasid
+         * free request, so we're leveraging this per-device iommu
+         * context to do the same thing just to make sure the free
+         * happens only once.
+         */
+        ret = host_iommu_ctx_pasid_free(iommu_ctx, pasid);
+        if (!ret) {
+            break;
+        }
+    }
+    vtd_iommu_unlock(s);
+
+    return ret;
+}
+
+/*
+ * If IP is not set, set it then return.
+ * If IP is already set, return.
+ */
+static void vtd_vcmd_set_ip(IntelIOMMUState *s)
+{
+    s->vcrsp = 1;
+    vtd_set_quad_raw(s, DMAR_VCRSP_REG,
+                     ((uint64_t) s->vcrsp));
+}
+
+static void vtd_vcmd_clear_ip(IntelIOMMUState *s)
+{
+    s->vcrsp &= (~((uint64_t)(0x1)));
+    vtd_set_quad_raw(s, DMAR_VCRSP_REG,
+                     ((uint64_t) s->vcrsp));
+}
+
+/* Handle write to Virtual Command Register */
+static int vtd_handle_vcmd_write(IntelIOMMUState *s, uint64_t val)
+{
+    uint32_t pasid;
+    int ret = -1;
+
+    trace_vtd_reg_write_vcmd(s->vcrsp, val);
+
+    if (!(s->vccap & VTD_VCCAP_PAS) ||
+         (s->vcrsp & 1)) {
+        return -1;
+    }
+
+    /*
+     * Since vCPU should be blocked when the guest VMCD
+     * write was trapped to here. Should be no other vCPUs
+     * try to access VCMD if guest software is well written.
+     * However, we still emulate the IP bit here in case of
+     * bad guest software. Also align with the spec.
+     */
+    vtd_vcmd_set_ip(s);
+
+    switch (val & VTD_VCMD_CMD_MASK) {
+    case VTD_VCMD_ALLOC_PASID:
+        ret = vtd_request_pasid_alloc(s, &pasid);
+        if (ret) {
+            s->vcrsp |= VTD_VCRSP_SC(VTD_VCMD_NO_AVAILABLE_PASID);
+        } else {
+            s->vcrsp |= VTD_VCRSP_RSLT(pasid);
+        }
+        break;
+
+    case VTD_VCMD_FREE_PASID:
+        pasid = VTD_VCMD_PASID_VALUE(val);
+        ret = vtd_request_pasid_free(s, pasid);
+        if (ret < 0) {
+            s->vcrsp |= VTD_VCRSP_SC(VTD_VCMD_FREE_INVALID_PASID);
+        }
+        break;
+
+    default:
+        s->vcrsp |= VTD_VCRSP_SC(VTD_VCMD_UNDEFINED_CMD);
+        error_report_once("Virtual Command: unsupported command!!!");
+        break;
+    }
+    vtd_vcmd_clear_ip(s);
+    return 0;
+}
+
 static uint64_t vtd_mem_read(void *opaque, hwaddr addr, unsigned size)
 {
     IntelIOMMUState *s = opaque;
@@ -2939,6 +3062,23 @@ static void vtd_mem_write(void *opaque, hwaddr addr,
         vtd_set_long(s, addr, val);
         break;
 
+    case DMAR_VCMD_REG:
+        if (!vtd_handle_vcmd_write(s, val)) {
+            if (size == 4) {
+                vtd_set_long(s, addr, val);
+            } else {
+                vtd_set_quad(s, addr, val);
+            }
+        }
+        break;
+
+    case DMAR_VCMD_REG_HI:
+        assert(size == 4);
+        if (!vtd_handle_vcmd_write(s, val)) {
+            vtd_set_long(s, addr, val);
+        }
+        break;
+
     default:
         if (size == 4) {
             vtd_set_long(s, addr, val);
@@ -3470,6 +3610,7 @@ static int vtd_dev_set_iommu_context(PCIBus *bus, void *opaque,
     vtd_dev_icx->devfn = (uint8_t)devfn;
     vtd_dev_icx->iommu_state = s;
     vtd_dev_icx->iommu_ctx = iommu_ctx;
+    QLIST_INSERT_HEAD(&s->vtd_dev_icx_list, vtd_dev_icx, next);
 
     vtd_iommu_unlock(s);
 
@@ -3489,7 +3630,10 @@ static void vtd_dev_unset_iommu_context(PCIBus *bus, void *opaque, int devfn)
     vtd_iommu_lock(s);
 
     vtd_dev_icx = vtd_bus->dev_icx[devfn];
-    g_free(vtd_dev_icx);
+    if (vtd_dev_icx) {
+        QLIST_REMOVE(vtd_dev_icx, next);
+        g_free(vtd_dev_icx);
+    }
     vtd_bus->dev_icx[devfn] = NULL;
 
     vtd_iommu_unlock(s);
@@ -3764,6 +3908,13 @@ static void vtd_init(IntelIOMMUState *s)
      * Interrupt remapping registers.
      */
     vtd_define_quad(s, DMAR_IRTA_REG, 0, 0xfffffffffffff80fULL, 0);
+
+    /*
+     * Virtual Command Definitions
+     */
+    vtd_define_quad(s, DMAR_VCCAP_REG, s->vccap, 0, 0);
+    vtd_define_quad(s, DMAR_VCMD_REG, 0, 0xffffffffffffffffULL, 0);
+    vtd_define_quad(s, DMAR_VCRSP_REG, 0, 0, 0);
 }
 
 /* Should not reset address_spaces when reset because devices will still use
@@ -3878,6 +4029,7 @@ static void vtd_realize(DeviceState *dev, Error **errp)
     }
 
     QLIST_INIT(&s->vtd_as_with_notifiers);
+    QLIST_INIT(&s->vtd_dev_icx_list);
     qemu_mutex_init(&s->iommu_lock);
     memset(s->vtd_as_by_bus_num, 0, sizeof(s->vtd_as_by_bus_num));
     memory_region_init_io(&s->csrmem, OBJECT(s), &vtd_mem_ops, s,
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 862033e..3fc83f1 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -85,6 +85,12 @@
 #define DMAR_MTRRCAP_REG_HI     0x104
 #define DMAR_MTRRDEF_REG        0x108 /* MTRR default type */
 #define DMAR_MTRRDEF_REG_HI     0x10c
+#define DMAR_VCCAP_REG          0xE00 /* Virtual Command Capability Register */
+#define DMAR_VCCAP_REG_HI       0xE04
+#define DMAR_VCMD_REG           0xE10 /* Virtual Command Register */
+#define DMAR_VCMD_REG_HI        0xE14
+#define DMAR_VCRSP_REG          0xE20 /* Virtual Command Reponse Register */
+#define DMAR_VCRSP_REG_HI       0xE24
 
 /* IOTLB registers */
 #define DMAR_IOTLB_REG_OFFSET   0xf0 /* Offset to the IOTLB registers */
@@ -312,6 +318,37 @@ typedef enum VTDFaultReason {
 
 #define VTD_CONTEXT_CACHE_GEN_MAX       0xffffffffUL
 
+/* VCCAP_REG */
+#define VTD_VCCAP_PAS               (1UL << 0)
+
+/*
+ * The basic idea is to let hypervisor to set a range for available
+ * PASIDs for VMs. One of the reasons is PASID #0 is reserved by
+ * RID_PASID usage. We have no idea how many reserved PASIDs in future,
+ * so here just an evaluated value. Honestly, set it as "1" is enough
+ * at current stage.
+ */
+#define VTD_HPASID_MIN              1
+#define VTD_HPASID_MAX              0xFFFFF
+
+/* Virtual Command Register */
+enum {
+     VTD_VCMD_NULL_CMD = 0,
+     VTD_VCMD_ALLOC_PASID = 1,
+     VTD_VCMD_FREE_PASID = 2,
+     VTD_VCMD_CMD_NUM,
+};
+
+#define VTD_VCMD_CMD_MASK           0xffUL
+#define VTD_VCMD_PASID_VALUE(val)   (((val) >> 8) & 0xfffff)
+
+#define VTD_VCRSP_RSLT(val)         ((val) << 8)
+#define VTD_VCRSP_SC(val)           (((val) & 0x3) << 1)
+
+#define VTD_VCMD_UNDEFINED_CMD         1ULL
+#define VTD_VCMD_NO_AVAILABLE_PASID    2ULL
+#define VTD_VCMD_FREE_INVALID_PASID    2ULL
+
 /* Interrupt Entry Cache Invalidation Descriptor: VT-d 6.5.2.7. */
 struct VTDInvDescIEC {
     uint32_t type:4;            /* Should always be 0x4 */
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index e48bef2..71536a7 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -51,6 +51,7 @@ vtd_reg_write_gcmd(uint32_t status, uint32_t val) "status 0x%"PRIx32" value 0x%"
 vtd_reg_write_fectl(uint32_t value) "value 0x%"PRIx32
 vtd_reg_write_iectl(uint32_t value) "value 0x%"PRIx32
 vtd_reg_ics_clear_ip(void) ""
+vtd_reg_write_vcmd(uint32_t status, uint32_t val) "status 0x%"PRIx32" value 0x%"PRIx32
 vtd_dmar_translate(uint8_t bus, uint8_t slot, uint8_t func, uint64_t iova, uint64_t gpa, uint64_t mask) "dev %02x:%02x.%02x iova 0x%"PRIx64" -> gpa 0x%"PRIx64" mask 0x%"PRIx64
 vtd_dmar_enable(bool en) "enable %d"
 vtd_dmar_fault(uint16_t sid, int fault, uint64_t addr, bool is_write) "sid 0x%"PRIx16" fault %d addr 0x%"PRIx64" write %d"
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index b5fefb9..42a58d6 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -42,7 +42,7 @@
 #define VTD_SID_TO_BUS(sid)         (((sid) >> 8) & 0xff)
 #define VTD_SID_TO_DEVFN(sid)       ((sid) & 0xff)
 
-#define DMAR_REG_SIZE               0x230
+#define DMAR_REG_SIZE               0xF00
 #define VTD_HOST_AW_39BIT           39
 #define VTD_HOST_AW_48BIT           48
 #define VTD_HOST_ADDRESS_WIDTH      VTD_HOST_AW_39BIT
@@ -118,6 +118,7 @@ struct VTDHostIOMMUContext {
     uint8_t devfn;
     HostIOMMUContext *iommu_ctx;
     IntelIOMMUState *iommu_state;
+    QLIST_ENTRY(VTDHostIOMMUContext) next;
 };
 
 struct VTDBus {
@@ -269,6 +270,9 @@ struct IntelIOMMUState {
     /* list of registered notifiers */
     QLIST_HEAD(, VTDAddressSpace) vtd_as_with_notifiers;
 
+    /* list of VTDHostIOMMUContexts */
+    QLIST_HEAD(, VTDHostIOMMUContext) vtd_dev_icx_list;
+
     /* interrupt remapping */
     bool intr_enabled;              /* Whether guest enabled IR */
     dma_addr_t intr_root;           /* Interrupt remapping table pointer */
@@ -279,6 +283,10 @@ struct IntelIOMMUState {
     uint8_t aw_bits;                /* Host/IOVA address width (in bits) */
     bool dma_drain;                 /* Whether DMA r/w draining enabled */
 
+    /* Virtual Command Register */
+    uint64_t vccap;                 /* The value of vcmd capability reg */
+    uint64_t vcrsp;                 /* Current value of VCMD RSP REG */
+
     /*
      * iommu_lock protects below:
      * - per-IOMMU IOTLB caches
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v2 11/22] intel_iommu: add virtual command capability support
@ 2020-03-30  4:24   ` Liu Yi L
  0 siblings, 0 replies; 160+ messages in thread
From: Liu Yi L @ 2020-03-30  4:24 UTC (permalink / raw)
  To: qemu-devel, alex.williamson, peterx
  Cc: jean-philippe, kevin.tian, yi.l.liu, Yi Sun, Eduardo Habkost,
	kvm, mst, jun.j.tian, eric.auger, yi.y.sun, Jacob Pan, pbonzini,
	hao.wu, Richard Henderson, david

This patch adds virtual command support to Intel vIOMMU per
Intel VT-d 3.1 spec. And adds two virtual commands: allocate
pasid and free pasid.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
---
 hw/i386/intel_iommu.c          | 154 ++++++++++++++++++++++++++++++++++++++++-
 hw/i386/intel_iommu_internal.h |  37 ++++++++++
 hw/i386/trace-events           |   1 +
 include/hw/i386/intel_iommu.h  |  10 ++-
 4 files changed, 200 insertions(+), 2 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index fd349c6..6c3159f 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -2651,6 +2651,129 @@ static void vtd_handle_iectl_write(IntelIOMMUState *s)
     }
 }
 
+static int vtd_request_pasid_alloc(IntelIOMMUState *s, uint32_t *pasid)
+{
+    VTDHostIOMMUContext *vtd_dev_icx;
+    int ret = -1;
+
+    vtd_iommu_lock(s);
+    QLIST_FOREACH(vtd_dev_icx, &s->vtd_dev_icx_list, next) {
+        HostIOMMUContext *iommu_ctx = vtd_dev_icx->iommu_ctx;
+
+        /*
+         * We'll return the first valid result we got. It's
+         * a bit hackish in that we don't have a good global
+         * interface yet to talk to modules like vfio to deliver
+         * this allocation request, so we're leveraging this
+         * per-device iommu context to do the same thing just
+         * to make sure the allocation happens only once.
+         */
+        ret = host_iommu_ctx_pasid_alloc(iommu_ctx, VTD_HPASID_MIN,
+                                         VTD_HPASID_MAX, pasid);
+        if (!ret) {
+            break;
+        }
+    }
+    vtd_iommu_unlock(s);
+
+    return ret;
+}
+
+static int vtd_request_pasid_free(IntelIOMMUState *s, uint32_t pasid)
+{
+    VTDHostIOMMUContext *vtd_dev_icx;
+    int ret = -1;
+
+    vtd_iommu_lock(s);
+    QLIST_FOREACH(vtd_dev_icx, &s->vtd_dev_icx_list, next) {
+        HostIOMMUContext *iommu_ctx = vtd_dev_icx->iommu_ctx;
+
+        /*
+         * Similar with pasid allocation. We'll free the pasid
+         * on the first successful free operation. It's a bit
+         * hackish in that we don't have a good global interface
+         * yet to talk to modules like vfio to deliver this pasid
+         * free request, so we're leveraging this per-device iommu
+         * context to do the same thing just to make sure the free
+         * happens only once.
+         */
+        ret = host_iommu_ctx_pasid_free(iommu_ctx, pasid);
+        if (!ret) {
+            break;
+        }
+    }
+    vtd_iommu_unlock(s);
+
+    return ret;
+}
+
+/*
+ * If IP is not set, set it then return.
+ * If IP is already set, return.
+ */
+static void vtd_vcmd_set_ip(IntelIOMMUState *s)
+{
+    s->vcrsp = 1;
+    vtd_set_quad_raw(s, DMAR_VCRSP_REG,
+                     ((uint64_t) s->vcrsp));
+}
+
+static void vtd_vcmd_clear_ip(IntelIOMMUState *s)
+{
+    s->vcrsp &= (~((uint64_t)(0x1)));
+    vtd_set_quad_raw(s, DMAR_VCRSP_REG,
+                     ((uint64_t) s->vcrsp));
+}
+
+/* Handle write to Virtual Command Register */
+static int vtd_handle_vcmd_write(IntelIOMMUState *s, uint64_t val)
+{
+    uint32_t pasid;
+    int ret = -1;
+
+    trace_vtd_reg_write_vcmd(s->vcrsp, val);
+
+    if (!(s->vccap & VTD_VCCAP_PAS) ||
+         (s->vcrsp & 1)) {
+        return -1;
+    }
+
+    /*
+     * Since vCPU should be blocked when the guest VMCD
+     * write was trapped to here. Should be no other vCPUs
+     * try to access VCMD if guest software is well written.
+     * However, we still emulate the IP bit here in case of
+     * bad guest software. Also align with the spec.
+     */
+    vtd_vcmd_set_ip(s);
+
+    switch (val & VTD_VCMD_CMD_MASK) {
+    case VTD_VCMD_ALLOC_PASID:
+        ret = vtd_request_pasid_alloc(s, &pasid);
+        if (ret) {
+            s->vcrsp |= VTD_VCRSP_SC(VTD_VCMD_NO_AVAILABLE_PASID);
+        } else {
+            s->vcrsp |= VTD_VCRSP_RSLT(pasid);
+        }
+        break;
+
+    case VTD_VCMD_FREE_PASID:
+        pasid = VTD_VCMD_PASID_VALUE(val);
+        ret = vtd_request_pasid_free(s, pasid);
+        if (ret < 0) {
+            s->vcrsp |= VTD_VCRSP_SC(VTD_VCMD_FREE_INVALID_PASID);
+        }
+        break;
+
+    default:
+        s->vcrsp |= VTD_VCRSP_SC(VTD_VCMD_UNDEFINED_CMD);
+        error_report_once("Virtual Command: unsupported command!!!");
+        break;
+    }
+    vtd_vcmd_clear_ip(s);
+    return 0;
+}
+
 static uint64_t vtd_mem_read(void *opaque, hwaddr addr, unsigned size)
 {
     IntelIOMMUState *s = opaque;
@@ -2939,6 +3062,23 @@ static void vtd_mem_write(void *opaque, hwaddr addr,
         vtd_set_long(s, addr, val);
         break;
 
+    case DMAR_VCMD_REG:
+        if (!vtd_handle_vcmd_write(s, val)) {
+            if (size == 4) {
+                vtd_set_long(s, addr, val);
+            } else {
+                vtd_set_quad(s, addr, val);
+            }
+        }
+        break;
+
+    case DMAR_VCMD_REG_HI:
+        assert(size == 4);
+        if (!vtd_handle_vcmd_write(s, val)) {
+            vtd_set_long(s, addr, val);
+        }
+        break;
+
     default:
         if (size == 4) {
             vtd_set_long(s, addr, val);
@@ -3470,6 +3610,7 @@ static int vtd_dev_set_iommu_context(PCIBus *bus, void *opaque,
     vtd_dev_icx->devfn = (uint8_t)devfn;
     vtd_dev_icx->iommu_state = s;
     vtd_dev_icx->iommu_ctx = iommu_ctx;
+    QLIST_INSERT_HEAD(&s->vtd_dev_icx_list, vtd_dev_icx, next);
 
     vtd_iommu_unlock(s);
 
@@ -3489,7 +3630,10 @@ static void vtd_dev_unset_iommu_context(PCIBus *bus, void *opaque, int devfn)
     vtd_iommu_lock(s);
 
     vtd_dev_icx = vtd_bus->dev_icx[devfn];
-    g_free(vtd_dev_icx);
+    if (vtd_dev_icx) {
+        QLIST_REMOVE(vtd_dev_icx, next);
+        g_free(vtd_dev_icx);
+    }
     vtd_bus->dev_icx[devfn] = NULL;
 
     vtd_iommu_unlock(s);
@@ -3764,6 +3908,13 @@ static void vtd_init(IntelIOMMUState *s)
      * Interrupt remapping registers.
      */
     vtd_define_quad(s, DMAR_IRTA_REG, 0, 0xfffffffffffff80fULL, 0);
+
+    /*
+     * Virtual Command Definitions
+     */
+    vtd_define_quad(s, DMAR_VCCAP_REG, s->vccap, 0, 0);
+    vtd_define_quad(s, DMAR_VCMD_REG, 0, 0xffffffffffffffffULL, 0);
+    vtd_define_quad(s, DMAR_VCRSP_REG, 0, 0, 0);
 }
 
 /* Should not reset address_spaces when reset because devices will still use
@@ -3878,6 +4029,7 @@ static void vtd_realize(DeviceState *dev, Error **errp)
     }
 
     QLIST_INIT(&s->vtd_as_with_notifiers);
+    QLIST_INIT(&s->vtd_dev_icx_list);
     qemu_mutex_init(&s->iommu_lock);
     memset(s->vtd_as_by_bus_num, 0, sizeof(s->vtd_as_by_bus_num));
     memory_region_init_io(&s->csrmem, OBJECT(s), &vtd_mem_ops, s,
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 862033e..3fc83f1 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -85,6 +85,12 @@
 #define DMAR_MTRRCAP_REG_HI     0x104
 #define DMAR_MTRRDEF_REG        0x108 /* MTRR default type */
 #define DMAR_MTRRDEF_REG_HI     0x10c
+#define DMAR_VCCAP_REG          0xE00 /* Virtual Command Capability Register */
+#define DMAR_VCCAP_REG_HI       0xE04
+#define DMAR_VCMD_REG           0xE10 /* Virtual Command Register */
+#define DMAR_VCMD_REG_HI        0xE14
+#define DMAR_VCRSP_REG          0xE20 /* Virtual Command Reponse Register */
+#define DMAR_VCRSP_REG_HI       0xE24
 
 /* IOTLB registers */
 #define DMAR_IOTLB_REG_OFFSET   0xf0 /* Offset to the IOTLB registers */
@@ -312,6 +318,37 @@ typedef enum VTDFaultReason {
 
 #define VTD_CONTEXT_CACHE_GEN_MAX       0xffffffffUL
 
+/* VCCAP_REG */
+#define VTD_VCCAP_PAS               (1UL << 0)
+
+/*
+ * The basic idea is to let hypervisor to set a range for available
+ * PASIDs for VMs. One of the reasons is PASID #0 is reserved by
+ * RID_PASID usage. We have no idea how many reserved PASIDs in future,
+ * so here just an evaluated value. Honestly, set it as "1" is enough
+ * at current stage.
+ */
+#define VTD_HPASID_MIN              1
+#define VTD_HPASID_MAX              0xFFFFF
+
+/* Virtual Command Register */
+enum {
+     VTD_VCMD_NULL_CMD = 0,
+     VTD_VCMD_ALLOC_PASID = 1,
+     VTD_VCMD_FREE_PASID = 2,
+     VTD_VCMD_CMD_NUM,
+};
+
+#define VTD_VCMD_CMD_MASK           0xffUL
+#define VTD_VCMD_PASID_VALUE(val)   (((val) >> 8) & 0xfffff)
+
+#define VTD_VCRSP_RSLT(val)         ((val) << 8)
+#define VTD_VCRSP_SC(val)           (((val) & 0x3) << 1)
+
+#define VTD_VCMD_UNDEFINED_CMD         1ULL
+#define VTD_VCMD_NO_AVAILABLE_PASID    2ULL
+#define VTD_VCMD_FREE_INVALID_PASID    2ULL
+
 /* Interrupt Entry Cache Invalidation Descriptor: VT-d 6.5.2.7. */
 struct VTDInvDescIEC {
     uint32_t type:4;            /* Should always be 0x4 */
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index e48bef2..71536a7 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -51,6 +51,7 @@ vtd_reg_write_gcmd(uint32_t status, uint32_t val) "status 0x%"PRIx32" value 0x%"
 vtd_reg_write_fectl(uint32_t value) "value 0x%"PRIx32
 vtd_reg_write_iectl(uint32_t value) "value 0x%"PRIx32
 vtd_reg_ics_clear_ip(void) ""
+vtd_reg_write_vcmd(uint32_t status, uint32_t val) "status 0x%"PRIx32" value 0x%"PRIx32
 vtd_dmar_translate(uint8_t bus, uint8_t slot, uint8_t func, uint64_t iova, uint64_t gpa, uint64_t mask) "dev %02x:%02x.%02x iova 0x%"PRIx64" -> gpa 0x%"PRIx64" mask 0x%"PRIx64
 vtd_dmar_enable(bool en) "enable %d"
 vtd_dmar_fault(uint16_t sid, int fault, uint64_t addr, bool is_write) "sid 0x%"PRIx16" fault %d addr 0x%"PRIx64" write %d"
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index b5fefb9..42a58d6 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -42,7 +42,7 @@
 #define VTD_SID_TO_BUS(sid)         (((sid) >> 8) & 0xff)
 #define VTD_SID_TO_DEVFN(sid)       ((sid) & 0xff)
 
-#define DMAR_REG_SIZE               0x230
+#define DMAR_REG_SIZE               0xF00
 #define VTD_HOST_AW_39BIT           39
 #define VTD_HOST_AW_48BIT           48
 #define VTD_HOST_ADDRESS_WIDTH      VTD_HOST_AW_39BIT
@@ -118,6 +118,7 @@ struct VTDHostIOMMUContext {
     uint8_t devfn;
     HostIOMMUContext *iommu_ctx;
     IntelIOMMUState *iommu_state;
+    QLIST_ENTRY(VTDHostIOMMUContext) next;
 };
 
 struct VTDBus {
@@ -269,6 +270,9 @@ struct IntelIOMMUState {
     /* list of registered notifiers */
     QLIST_HEAD(, VTDAddressSpace) vtd_as_with_notifiers;
 
+    /* list of VTDHostIOMMUContexts */
+    QLIST_HEAD(, VTDHostIOMMUContext) vtd_dev_icx_list;
+
     /* interrupt remapping */
     bool intr_enabled;              /* Whether guest enabled IR */
     dma_addr_t intr_root;           /* Interrupt remapping table pointer */
@@ -279,6 +283,10 @@ struct IntelIOMMUState {
     uint8_t aw_bits;                /* Host/IOVA address width (in bits) */
     bool dma_drain;                 /* Whether DMA r/w draining enabled */
 
+    /* Virtual Command Register */
+    uint64_t vccap;                 /* The value of vcmd capability reg */
+    uint64_t vcrsp;                 /* Current value of VCMD RSP REG */
+
     /*
      * iommu_lock protects below:
      * - per-IOMMU IOTLB caches
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v2 12/22] intel_iommu: process PASID cache invalidation
  2020-03-30  4:24 ` Liu Yi L
@ 2020-03-30  4:24   ` Liu Yi L
  -1 siblings, 0 replies; 160+ messages in thread
From: Liu Yi L @ 2020-03-30  4:24 UTC (permalink / raw)
  To: qemu-devel, alex.williamson, peterx
  Cc: eric.auger, pbonzini, mst, david, kevin.tian, yi.l.liu,
	jun.j.tian, yi.y.sun, kvm, hao.wu, jean-philippe, Jacob Pan,
	Yi Sun, Richard Henderson, Eduardo Habkost

This patch adds PASID cache invalidation handling. When guest enabled
PASID usages (e.g. SVA), guest software should issue a proper PASID
cache invalidation when caching-mode is exposed. This patch only adds
the draft handling of pasid cache invalidation. Detailed handling will
be added in subsequent patches.

v1 -> v2: remove vtd_pasid_cache_gsi(), vtd_pasid_cache_psi() and
          vtd_pasid_cache_dsi()

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 hw/i386/intel_iommu.c          | 40 +++++++++++++++++++++++++++++++++++-----
 hw/i386/intel_iommu_internal.h | 12 ++++++++++++
 hw/i386/trace-events           |  3 +++
 3 files changed, 50 insertions(+), 5 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 6c3159f..2eb60c3 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -2395,6 +2395,37 @@ static bool vtd_process_iotlb_desc(IntelIOMMUState *s, VTDInvDesc *inv_desc)
     return true;
 }
 
+static bool vtd_process_pasid_desc(IntelIOMMUState *s,
+                                   VTDInvDesc *inv_desc)
+{
+    if ((inv_desc->val[0] & VTD_INV_DESC_PASIDC_RSVD_VAL0) ||
+        (inv_desc->val[1] & VTD_INV_DESC_PASIDC_RSVD_VAL1) ||
+        (inv_desc->val[2] & VTD_INV_DESC_PASIDC_RSVD_VAL2) ||
+        (inv_desc->val[3] & VTD_INV_DESC_PASIDC_RSVD_VAL3)) {
+        error_report_once("non-zero-field-in-pc_inv_desc hi: 0x%" PRIx64
+                  " lo: 0x%" PRIx64, inv_desc->val[1], inv_desc->val[0]);
+        return false;
+    }
+
+    switch (inv_desc->val[0] & VTD_INV_DESC_PASIDC_G) {
+    case VTD_INV_DESC_PASIDC_DSI:
+        break;
+
+    case VTD_INV_DESC_PASIDC_PASID_SI:
+        break;
+
+    case VTD_INV_DESC_PASIDC_GLOBAL:
+        break;
+
+    default:
+        error_report_once("invalid-inv-granu-in-pc_inv_desc hi: 0x%" PRIx64
+                  " lo: 0x%" PRIx64, inv_desc->val[1], inv_desc->val[0]);
+        return false;
+    }
+
+    return true;
+}
+
 static bool vtd_process_inv_iec_desc(IntelIOMMUState *s,
                                      VTDInvDesc *inv_desc)
 {
@@ -2501,12 +2532,11 @@ static bool vtd_process_inv_desc(IntelIOMMUState *s)
         }
         break;
 
-    /*
-     * TODO: the entity of below two cases will be implemented in future series.
-     * To make guest (which integrates scalable mode support patch set in
-     * iommu driver) work, just return true is enough so far.
-     */
     case VTD_INV_DESC_PC:
+        trace_vtd_inv_desc("pasid-cache", inv_desc.val[1], inv_desc.val[0]);
+        if (!vtd_process_pasid_desc(s, &inv_desc)) {
+            return false;
+        }
         break;
 
     case VTD_INV_DESC_PIOTLB:
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 3fc83f1..9a76f20 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -444,6 +444,18 @@ typedef union VTDInvDesc VTDInvDesc;
         (0x3ffff800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM | VTD_SL_TM)) : \
         (0x3ffff800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
 
+#define VTD_INV_DESC_PASIDC_G          (3ULL << 4)
+#define VTD_INV_DESC_PASIDC_PASID(val) (((val) >> 32) & 0xfffffULL)
+#define VTD_INV_DESC_PASIDC_DID(val)   (((val) >> 16) & VTD_DOMAIN_ID_MASK)
+#define VTD_INV_DESC_PASIDC_RSVD_VAL0  0xfff000000000ffc0ULL
+#define VTD_INV_DESC_PASIDC_RSVD_VAL1  0xffffffffffffffffULL
+#define VTD_INV_DESC_PASIDC_RSVD_VAL2  0xffffffffffffffffULL
+#define VTD_INV_DESC_PASIDC_RSVD_VAL3  0xffffffffffffffffULL
+
+#define VTD_INV_DESC_PASIDC_DSI        (0ULL << 4)
+#define VTD_INV_DESC_PASIDC_PASID_SI   (1ULL << 4)
+#define VTD_INV_DESC_PASIDC_GLOBAL     (3ULL << 4)
+
 /* Information about page-selective IOTLB invalidate */
 struct VTDIOTLBPageInvInfo {
     uint16_t domain_id;
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index 71536a7..f7cd4e5 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -22,6 +22,9 @@ vtd_inv_qi_head(uint16_t head) "read head %d"
 vtd_inv_qi_tail(uint16_t head) "write tail %d"
 vtd_inv_qi_fetch(void) ""
 vtd_context_cache_reset(void) ""
+vtd_pasid_cache_gsi(void) ""
+vtd_pasid_cache_dsi(uint16_t domain) "Domian slective PC invalidation domain 0x%"PRIx16
+vtd_pasid_cache_psi(uint16_t domain, uint32_t pasid) "PASID slective PC invalidation domain 0x%"PRIx16" pasid 0x%"PRIx32
 vtd_re_not_present(uint8_t bus) "Root entry bus %"PRIu8" not present"
 vtd_ce_not_present(uint8_t bus, uint8_t devfn) "Context entry bus %"PRIu8" devfn %"PRIu8" not present"
 vtd_iotlb_page_hit(uint16_t sid, uint64_t addr, uint64_t slpte, uint16_t domain) "IOTLB page hit sid 0x%"PRIx16" iova 0x%"PRIx64" slpte 0x%"PRIx64" domain 0x%"PRIx16
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v2 12/22] intel_iommu: process PASID cache invalidation
@ 2020-03-30  4:24   ` Liu Yi L
  0 siblings, 0 replies; 160+ messages in thread
From: Liu Yi L @ 2020-03-30  4:24 UTC (permalink / raw)
  To: qemu-devel, alex.williamson, peterx
  Cc: jean-philippe, kevin.tian, yi.l.liu, Yi Sun, Eduardo Habkost,
	kvm, mst, jun.j.tian, eric.auger, yi.y.sun, Jacob Pan, pbonzini,
	hao.wu, Richard Henderson, david

This patch adds PASID cache invalidation handling. When guest enabled
PASID usages (e.g. SVA), guest software should issue a proper PASID
cache invalidation when caching-mode is exposed. This patch only adds
the draft handling of pasid cache invalidation. Detailed handling will
be added in subsequent patches.

v1 -> v2: remove vtd_pasid_cache_gsi(), vtd_pasid_cache_psi() and
          vtd_pasid_cache_dsi()

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 hw/i386/intel_iommu.c          | 40 +++++++++++++++++++++++++++++++++++-----
 hw/i386/intel_iommu_internal.h | 12 ++++++++++++
 hw/i386/trace-events           |  3 +++
 3 files changed, 50 insertions(+), 5 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 6c3159f..2eb60c3 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -2395,6 +2395,37 @@ static bool vtd_process_iotlb_desc(IntelIOMMUState *s, VTDInvDesc *inv_desc)
     return true;
 }
 
+static bool vtd_process_pasid_desc(IntelIOMMUState *s,
+                                   VTDInvDesc *inv_desc)
+{
+    if ((inv_desc->val[0] & VTD_INV_DESC_PASIDC_RSVD_VAL0) ||
+        (inv_desc->val[1] & VTD_INV_DESC_PASIDC_RSVD_VAL1) ||
+        (inv_desc->val[2] & VTD_INV_DESC_PASIDC_RSVD_VAL2) ||
+        (inv_desc->val[3] & VTD_INV_DESC_PASIDC_RSVD_VAL3)) {
+        error_report_once("non-zero-field-in-pc_inv_desc hi: 0x%" PRIx64
+                  " lo: 0x%" PRIx64, inv_desc->val[1], inv_desc->val[0]);
+        return false;
+    }
+
+    switch (inv_desc->val[0] & VTD_INV_DESC_PASIDC_G) {
+    case VTD_INV_DESC_PASIDC_DSI:
+        break;
+
+    case VTD_INV_DESC_PASIDC_PASID_SI:
+        break;
+
+    case VTD_INV_DESC_PASIDC_GLOBAL:
+        break;
+
+    default:
+        error_report_once("invalid-inv-granu-in-pc_inv_desc hi: 0x%" PRIx64
+                  " lo: 0x%" PRIx64, inv_desc->val[1], inv_desc->val[0]);
+        return false;
+    }
+
+    return true;
+}
+
 static bool vtd_process_inv_iec_desc(IntelIOMMUState *s,
                                      VTDInvDesc *inv_desc)
 {
@@ -2501,12 +2532,11 @@ static bool vtd_process_inv_desc(IntelIOMMUState *s)
         }
         break;
 
-    /*
-     * TODO: the entity of below two cases will be implemented in future series.
-     * To make guest (which integrates scalable mode support patch set in
-     * iommu driver) work, just return true is enough so far.
-     */
     case VTD_INV_DESC_PC:
+        trace_vtd_inv_desc("pasid-cache", inv_desc.val[1], inv_desc.val[0]);
+        if (!vtd_process_pasid_desc(s, &inv_desc)) {
+            return false;
+        }
         break;
 
     case VTD_INV_DESC_PIOTLB:
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 3fc83f1..9a76f20 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -444,6 +444,18 @@ typedef union VTDInvDesc VTDInvDesc;
         (0x3ffff800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM | VTD_SL_TM)) : \
         (0x3ffff800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
 
+#define VTD_INV_DESC_PASIDC_G          (3ULL << 4)
+#define VTD_INV_DESC_PASIDC_PASID(val) (((val) >> 32) & 0xfffffULL)
+#define VTD_INV_DESC_PASIDC_DID(val)   (((val) >> 16) & VTD_DOMAIN_ID_MASK)
+#define VTD_INV_DESC_PASIDC_RSVD_VAL0  0xfff000000000ffc0ULL
+#define VTD_INV_DESC_PASIDC_RSVD_VAL1  0xffffffffffffffffULL
+#define VTD_INV_DESC_PASIDC_RSVD_VAL2  0xffffffffffffffffULL
+#define VTD_INV_DESC_PASIDC_RSVD_VAL3  0xffffffffffffffffULL
+
+#define VTD_INV_DESC_PASIDC_DSI        (0ULL << 4)
+#define VTD_INV_DESC_PASIDC_PASID_SI   (1ULL << 4)
+#define VTD_INV_DESC_PASIDC_GLOBAL     (3ULL << 4)
+
 /* Information about page-selective IOTLB invalidate */
 struct VTDIOTLBPageInvInfo {
     uint16_t domain_id;
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index 71536a7..f7cd4e5 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -22,6 +22,9 @@ vtd_inv_qi_head(uint16_t head) "read head %d"
 vtd_inv_qi_tail(uint16_t head) "write tail %d"
 vtd_inv_qi_fetch(void) ""
 vtd_context_cache_reset(void) ""
+vtd_pasid_cache_gsi(void) ""
+vtd_pasid_cache_dsi(uint16_t domain) "Domian slective PC invalidation domain 0x%"PRIx16
+vtd_pasid_cache_psi(uint16_t domain, uint32_t pasid) "PASID slective PC invalidation domain 0x%"PRIx16" pasid 0x%"PRIx32
 vtd_re_not_present(uint8_t bus) "Root entry bus %"PRIu8" not present"
 vtd_ce_not_present(uint8_t bus, uint8_t devfn) "Context entry bus %"PRIu8" devfn %"PRIu8" not present"
 vtd_iotlb_page_hit(uint16_t sid, uint64_t addr, uint64_t slpte, uint16_t domain) "IOTLB page hit sid 0x%"PRIx16" iova 0x%"PRIx64" slpte 0x%"PRIx64" domain 0x%"PRIx16
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v2 13/22] intel_iommu: add PASID cache management infrastructure
  2020-03-30  4:24 ` Liu Yi L
@ 2020-03-30  4:24   ` Liu Yi L
  -1 siblings, 0 replies; 160+ messages in thread
From: Liu Yi L @ 2020-03-30  4:24 UTC (permalink / raw)
  To: qemu-devel, alex.williamson, peterx
  Cc: eric.auger, pbonzini, mst, david, kevin.tian, yi.l.liu,
	jun.j.tian, yi.y.sun, kvm, hao.wu, jean-philippe, Jacob Pan,
	Yi Sun, Richard Henderson, Eduardo Habkost

This patch adds a PASID cache management infrastructure based on
new added structure VTDPASIDAddressSpace, which is used to track
the PASID usage and future PASID tagged DMA address translation
support in vIOMMU.

    struct VTDPASIDAddressSpace {
        VTDBus *vtd_bus;
        uint8_t devfn;
        AddressSpace as;
        uint32_t pasid;
        IntelIOMMUState *iommu_state;
        VTDContextCacheEntry context_cache_entry;
        QLIST_ENTRY(VTDPASIDAddressSpace) next;
        VTDPASIDCacheEntry pasid_cache_entry;
    };

Ideally, a VTDPASIDAddressSpace instance is created when a PASID
is bound with a DMA AddressSpace. Intel VT-d spec requires guest
software to issue pasid cache invalidation when bind or unbind a
pasid with an address space under caching-mode. However, as
VTDPASIDAddressSpace instances also act as pasid cache in this
implementation, its creation also happens during vIOMMU PASID
tagged DMA translation. The creation in this path will not be
added in this patch since no PASID-capable emulated devices for
now.

The implementation in this patch manages VTDPASIDAddressSpace
instances per PASID+BDF (lookup and insert will use PASID and
BDF) since Intel VT-d spec allows per-BDF PASID Table. When a
guest bind a PASID with an AddressSpace, QEMU will capture the
guest pasid selective pasid cache invalidation, and allocate
remove a VTDPASIDAddressSpace instance per the invalidation
reasons:

    *) a present pasid entry moved to non-present
    *) a present pasid entry to be a present entry
    *) a non-present pasid entry moved to present

vIOMMU emulator could figure out the reason by fetching latest
guest pasid entry.

v1 -> v2: - merged this patch with former replay binding patch, makes
            PSI/DSI/GSI use the unified function to do cache invalidation
            and pasid binding replay.
          - dropped pasid_cache_gen in both iommu_state and vtd_pasid_as
            as it is not necessary so far, we may want it when one day
            initroduce emulated SVA-capable device.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 hw/i386/intel_iommu.c          | 473 +++++++++++++++++++++++++++++++++++++++++
 hw/i386/intel_iommu_internal.h |  18 ++
 hw/i386/trace-events           |   1 +
 include/hw/i386/intel_iommu.h  |  24 +++
 4 files changed, 516 insertions(+)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 2eb60c3..a7e9973 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -40,6 +40,7 @@
 #include "kvm_i386.h"
 #include "migration/vmstate.h"
 #include "trace.h"
+#include "qemu/jhash.h"
 
 /* context entry operations */
 #define VTD_CE_GET_RID2PASID(ce) \
@@ -65,6 +66,8 @@
 static void vtd_address_space_refresh_all(IntelIOMMUState *s);
 static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n);
 
+static void vtd_pasid_cache_reset(IntelIOMMUState *s);
+
 static void vtd_panic_require_caching_mode(void)
 {
     error_report("We need to set caching-mode=on for intel-iommu to enable "
@@ -276,6 +279,7 @@ static void vtd_reset_caches(IntelIOMMUState *s)
     vtd_iommu_lock(s);
     vtd_reset_iotlb_locked(s);
     vtd_reset_context_cache_locked(s);
+    vtd_pasid_cache_reset(s);
     vtd_iommu_unlock(s);
 }
 
@@ -686,6 +690,16 @@ static inline bool vtd_pe_type_check(X86IOMMUState *x86_iommu,
     return true;
 }
 
+static inline uint16_t vtd_pe_get_domain_id(VTDPASIDEntry *pe)
+{
+    return VTD_SM_PASID_ENTRY_DID((pe)->val[1]);
+}
+
+static inline uint32_t vtd_sm_ce_get_pdt_entry_num(VTDContextEntry *ce)
+{
+    return 1U << (VTD_SM_CONTEXT_ENTRY_PDTS(ce->val[0]) + 7);
+}
+
 static inline bool vtd_pdire_present(VTDPASIDDirEntry *pdire)
 {
     return pdire->val & 1;
@@ -2395,9 +2409,452 @@ static bool vtd_process_iotlb_desc(IntelIOMMUState *s, VTDInvDesc *inv_desc)
     return true;
 }
 
+static inline void vtd_init_pasid_key(uint32_t pasid,
+                                     uint16_t sid,
+                                     struct pasid_key *key)
+{
+    key->pasid = pasid;
+    key->sid = sid;
+}
+
+static guint vtd_pasid_as_key_hash(gconstpointer v)
+{
+    struct pasid_key *key = (struct pasid_key *)v;
+    uint32_t a, b, c;
+
+    /* Jenkins hash */
+    a = b = c = JHASH_INITVAL + sizeof(*key);
+    a += key->sid;
+    b += extract32(key->pasid, 0, 16);
+    c += extract32(key->pasid, 16, 16);
+
+    __jhash_mix(a, b, c);
+    __jhash_final(a, b, c);
+
+    return c;
+}
+
+static gboolean vtd_pasid_as_key_equal(gconstpointer v1, gconstpointer v2)
+{
+    const struct pasid_key *k1 = v1;
+    const struct pasid_key *k2 = v2;
+
+    return (k1->pasid == k2->pasid) && (k1->sid == k2->sid);
+}
+
+static inline int vtd_dev_get_pe_from_pasid(IntelIOMMUState *s,
+                                            uint8_t bus_num,
+                                            uint8_t devfn,
+                                            uint32_t pasid,
+                                            VTDPASIDEntry *pe)
+{
+    VTDContextEntry ce;
+    int ret;
+    dma_addr_t pasid_dir_base;
+
+    if (!s->root_scalable) {
+        return -VTD_FR_PASID_TABLE_INV;
+    }
+
+    ret = vtd_dev_to_context_entry(s, bus_num, devfn, &ce);
+    if (ret) {
+        return ret;
+    }
+
+    pasid_dir_base = VTD_CE_GET_PASID_DIR_TABLE(&ce);
+    ret = vtd_get_pe_from_pasid_table(s,
+                                  pasid_dir_base, pasid, pe);
+
+    return ret;
+}
+
+static bool vtd_pasid_entry_compare(VTDPASIDEntry *p1, VTDPASIDEntry *p2)
+{
+    return !memcmp(p1, p2, sizeof(*p1));
+}
+
+/**
+ * This function fills in the pasid entry in &vtd_pasid_as. Caller
+ * of this function should hold iommu_lock.
+ */
+static void vtd_fill_pe_in_cache(IntelIOMMUState *s,
+                                 VTDPASIDAddressSpace *vtd_pasid_as,
+                                 VTDPASIDEntry *pe)
+{
+    VTDPASIDCacheEntry *pc_entry = &vtd_pasid_as->pasid_cache_entry;
+
+    if (vtd_pasid_entry_compare(pe, &pc_entry->pasid_entry)) {
+        /* No need to go further as cached pasid entry is latest */
+        return;
+    }
+
+    pc_entry->pasid_entry = *pe;
+    /*
+     * TODO:
+     * - send pasid bind to host for passthru devices
+     */
+}
+
+/**
+ * This function is used to clear cached pasid entry in vtd_pasid_as
+ * instances. Caller of this function should hold iommu_lock.
+ */
+static gboolean vtd_flush_pasid(gpointer key, gpointer value,
+                                gpointer user_data)
+{
+    VTDPASIDCacheInfo *pc_info = user_data;
+    VTDPASIDAddressSpace *vtd_pasid_as = value;
+    IntelIOMMUState *s = vtd_pasid_as->iommu_state;
+    VTDPASIDCacheEntry *pc_entry = &vtd_pasid_as->pasid_cache_entry;
+    VTDBus *vtd_bus = vtd_pasid_as->vtd_bus;
+    VTDPASIDEntry pe;
+    uint16_t did;
+    uint32_t pasid;
+    uint16_t devfn;
+    int ret;
+
+    did = vtd_pe_get_domain_id(&pc_entry->pasid_entry);
+    pasid = vtd_pasid_as->pasid;
+    devfn = vtd_pasid_as->devfn;
+
+    switch (pc_info->flags & VTD_PASID_CACHE_INFO_MASK) {
+    case VTD_PASID_CACHE_FORCE_RESET:
+        goto remove;
+    case VTD_PASID_CACHE_PASIDSI:
+        if (pc_info->pasid != pasid) {
+            return false;
+        }
+        /* Fall through */
+    case VTD_PASID_CACHE_DOMSI:
+        if (pc_info->domain_id != did) {
+            return false;
+        }
+        /* Fall through */
+    case VTD_PASID_CACHE_GLOBAL:
+        break;
+    default:
+        error_report("invalid pc_info->flags");
+        abort();
+    }
+
+    /*
+     * pasid cache invalidation may indicate a present pasid
+     * entry to present pasid entry modification. To cover such
+     * case, vIOMMU emulator needs to fetch latest guest pasid
+     * entry and check cached pasid entry, then update pasid
+     * cache and send pasid bind/unbind to host properly.
+     */
+    ret = vtd_dev_get_pe_from_pasid(s, pci_bus_num(vtd_bus->bus),
+                                    devfn, pasid, &pe);
+    if (ret) {
+        /*
+         * No valid pasid entry in guest memory. e.g. pasid entry
+         * was modified to be either all-zero or non-present. Either
+         * case means existing pasid cache should be removed.
+         */
+        goto remove;
+    }
+
+    vtd_fill_pe_in_cache(s, vtd_pasid_as, &pe);
+    /*
+     * TODO:
+     * - when pasid-base-iotlb(piotlb) infrastructure is ready,
+     *   should invalidate QEMU piotlb togehter with this change.
+     */
+    return false;
+remove:
+    /*
+     * TODO:
+     * - send pasid bind to host for passthru devices
+     * - when pasid-base-iotlb(piotlb) infrastructure is ready,
+     *   should invalidate QEMU piotlb togehter with this change.
+     */
+    return true;
+}
+
+/**
+ * This function finds or adds a VTDPASIDAddressSpace for a device
+ * when it is bound to a pasid. Caller of this function should hold
+ * iommu_lock.
+ */
+static VTDPASIDAddressSpace *vtd_add_find_pasid_as(IntelIOMMUState *s,
+                                                   VTDBus *vtd_bus,
+                                                   int devfn,
+                                                   uint32_t pasid)
+{
+    struct pasid_key key;
+    struct pasid_key *new_key;
+    VTDPASIDAddressSpace *vtd_pasid_as;
+    uint16_t sid;
+
+    sid = vtd_make_source_id(pci_bus_num(vtd_bus->bus), devfn);
+    vtd_init_pasid_key(pasid, sid, &key);
+    vtd_pasid_as = g_hash_table_lookup(s->vtd_pasid_as, &key);
+
+    if (!vtd_pasid_as) {
+        new_key = g_malloc0(sizeof(*new_key));
+        vtd_init_pasid_key(pasid, sid, new_key);
+        /*
+         * Initiate the vtd_pasid_as structure.
+         *
+         * This structure here is used to track the guest pasid
+         * binding and also serves as pasid-cache mangement entry.
+         *
+         * TODO: in future, if wants to support the SVA-aware DMA
+         *       emulation, the vtd_pasid_as should have include
+         *       AddressSpace to support DMA emulation.
+         */
+        vtd_pasid_as = g_malloc0(sizeof(VTDPASIDAddressSpace));
+        vtd_pasid_as->iommu_state = s;
+        vtd_pasid_as->vtd_bus = vtd_bus;
+        vtd_pasid_as->devfn = devfn;
+        vtd_pasid_as->pasid = pasid;
+        g_hash_table_insert(s->vtd_pasid_as, new_key, vtd_pasid_as);
+    }
+    return vtd_pasid_as;
+}
+
+/**
+ * Constant information used during pasid table walk
+   @vtd_bus, @devfn: device info
+ * @flags: indicates if it is domain selective walk
+ * @did: domain ID of the pasid table walk
+ */
+typedef struct {
+    VTDBus *vtd_bus;
+    uint16_t devfn;
+#define VTD_PASID_TABLE_DID_SEL_WALK   (1ULL << 0)
+    uint32_t flags;
+    uint16_t did;
+} vtd_pasid_table_walk_info;
+
+/**
+ * Caller of this function should hold iommu_lock.
+ */
+static void vtd_sm_pasid_table_walk_one(IntelIOMMUState *s,
+                                        dma_addr_t pt_base,
+                                        int start,
+                                        int end,
+                                        vtd_pasid_table_walk_info *info)
+{
+    VTDPASIDEntry pe;
+    int pasid = start;
+    int pasid_next;
+    VTDPASIDAddressSpace *vtd_pasid_as;
+
+    while (pasid < end) {
+        pasid_next = pasid + 1;
+
+        if (!vtd_get_pe_in_pasid_leaf_table(s, pasid, pt_base, &pe)
+            && vtd_pe_present(&pe)) {
+            vtd_pasid_as = vtd_add_find_pasid_as(s,
+                                       info->vtd_bus, info->devfn, pasid);
+            if ((info->flags & VTD_PASID_TABLE_DID_SEL_WALK) &&
+                !(info->did == vtd_pe_get_domain_id(&pe))) {
+                pasid = pasid_next;
+                continue;
+            }
+            vtd_fill_pe_in_cache(s, vtd_pasid_as, &pe);
+        }
+        pasid = pasid_next;
+    }
+}
+
+/*
+ * Currently, VT-d scalable mode pasid table is a two level table,
+ * this function aims to loop a range of PASIDs in a given pasid
+ * table to identify the pasid config in guest.
+ * Caller of this function should hold iommu_lock.
+ */
+static void vtd_sm_pasid_table_walk(IntelIOMMUState *s,
+                                    dma_addr_t pdt_base,
+                                    int start,
+                                    int end,
+                                    vtd_pasid_table_walk_info *info)
+{
+    VTDPASIDDirEntry pdire;
+    int pasid = start;
+    int pasid_next;
+    dma_addr_t pt_base;
+
+    while (pasid < end) {
+        pasid_next = ((end - pasid) > VTD_PASID_TBL_ENTRY_NUM) ?
+                      (pasid + VTD_PASID_TBL_ENTRY_NUM) : end;
+        if (!vtd_get_pdire_from_pdir_table(pdt_base, pasid, &pdire)
+            && vtd_pdire_present(&pdire)) {
+            pt_base = pdire.val & VTD_PASID_TABLE_BASE_ADDR_MASK;
+            vtd_sm_pasid_table_walk_one(s, pt_base, pasid, pasid_next, info);
+        }
+        pasid = pasid_next;
+    }
+}
+
+static void vtd_replay_pasid_bind_for_dev(IntelIOMMUState *s,
+                                          int start, int end,
+                                          vtd_pasid_table_walk_info *info)
+{
+    VTDContextEntry ce;
+    int bus_n, devfn;
+
+    bus_n = pci_bus_num(info->vtd_bus->bus);
+    devfn = info->devfn;
+
+    if (!vtd_dev_to_context_entry(s, bus_n, devfn, &ce)) {
+        uint32_t max_pasid;
+
+        max_pasid = vtd_sm_ce_get_pdt_entry_num(&ce) * VTD_PASID_TBL_ENTRY_NUM;
+        if (end > max_pasid) {
+            end = max_pasid;
+        }
+        vtd_sm_pasid_table_walk(s,
+                                VTD_CE_GET_PASID_DIR_TABLE(&ce),
+                                start,
+                                end,
+                                info);
+    }
+}
+
+/**
+ * This function replay the guest pasid bindings to hots by
+ * walking the guest PASID table. This ensures host will have
+ * latest guest pasid bindings. Caller should hold iommu_lock.
+ */
+static void vtd_replay_guest_pasid_bindings(IntelIOMMUState *s,
+                                            VTDPASIDCacheInfo *pc_info)
+{
+    VTDHostIOMMUContext *vtd_dev_icx;
+    int start = 0, end = VTD_HPASID_MAX;
+    vtd_pasid_table_walk_info walk_info = {.flags = 0};
+
+    switch (pc_info->flags & VTD_PASID_CACHE_INFO_MASK) {
+    case VTD_PASID_CACHE_PASIDSI:
+        start = pc_info->pasid;
+        end = pc_info->pasid + 1;
+        /*
+         * PASID selective invalidation is within domain,
+         * thus fall through.
+         */
+    case VTD_PASID_CACHE_DOMSI:
+        walk_info.did = pc_info->domain_id;
+        walk_info.flags |= VTD_PASID_TABLE_DID_SEL_WALK;
+        /* loop all assigned devices */
+        break;
+    case VTD_PASID_CACHE_FORCE_RESET:
+        /* For force reset, no need to go further replay */
+        return;
+    case VTD_PASID_CACHE_GLOBAL:
+        break;
+    default:
+        error_report("%s, invalid pc_info->flags", __func__);
+        abort();
+    }
+
+    /*
+     * In this replay, only needs to care about the devices which
+     * are backed by host IOMMU. For such devices, their vtd_dev_icx
+     * instances are in the s->vtd_dev_icx_list. For devices which
+     * are not backed byhost IOMMU, it is not necessary to replay
+     * the bindings since their cache could be re-created in the future
+     * DMA address transaltion.
+     */
+    QLIST_FOREACH(vtd_dev_icx, &s->vtd_dev_icx_list, next) {
+        walk_info.vtd_bus = vtd_dev_icx->vtd_bus;
+        walk_info.devfn = vtd_dev_icx->devfn;
+        vtd_replay_pasid_bind_for_dev(s, start, end, &walk_info);
+    }
+}
+
+/**
+ * This function syncs the pasid bindings between guest and host.
+ * It includes updating the pasid cache in vIOMMU and updating the
+ * pasid bindings per guest's latest pasid entry presence.
+ */
+static void vtd_pasid_cache_sync(IntelIOMMUState *s,
+                                 VTDPASIDCacheInfo *pc_info)
+{
+    /*
+     * Regards to a pasid cache invalidation, e.g. a PSI.
+     * it could be either cases of below:
+     * a) a present pasid entry moved to non-present
+     * b) a present pasid entry to be a present entry
+     * c) a non-present pasid entry moved to present
+     *
+     * Different invalidation granularity may affect different device
+     * scope and pasid scope. But for each invalidation granularity,
+     * it needs to do two steps to sync host and guest pasid binding.
+     *
+     * Here is the handling of a PSI:
+     * 1) loop all the existing vtd_pasid_as instances to update them
+     *    according to the latest guest pasid entry in pasid table.
+     *    this will make sure affected existing vtd_pasid_as instances
+     *    cached the latest pasid entries. Also, during the loop, the
+     *    host should be notified if needed. e.g. pasid unbind or pasid
+     *    update. Should be able to cover case a) and case b).
+     *
+     * 2) loop all devices to cover case c)
+     *    - For devices which have HostIOMMUContext instances,
+     *      we loop them and check if guest pasid entry exists. If yes,
+     *      it is case c), we update the pasid cache and also notify
+     *      host.
+     *    - For devices which have no HostIOMMUContext, it is not
+     *      necessary to create pasid cache at this phase since it
+     *      could be created when vIOMMU does DMA address translation.
+     *      This is not yet implemented since there is no emulated
+     *      pasid-capable devices today. If we have such devices in
+     *      future, the pasid cache shall be created there.
+     * Other granularity follow the same steps, just with different scope
+     *
+     */
+
+    vtd_iommu_lock(s);
+    /* Step 1: loop all the exisitng vtd_pasid_as instances */
+    g_hash_table_foreach_remove(s->vtd_pasid_as,
+                                vtd_flush_pasid, pc_info);
+
+    /*
+     * Step 2: loop all the exisitng vtd_dev_icx instances.
+     * Ideally, needs to loop all devices to find if there is any new
+     * PASID binding regards to the PASID cache invalidation request.
+     * But it is enough to loop the devices which are backed by host
+     * IOMMU. For devices backed by vIOMMU (a.k.a emulated devices),
+     * if new PASID happened on them, their vtd_pasid_as instance could
+     * be created during future vIOMMU DMA translation.
+     */
+    vtd_replay_guest_pasid_bindings(s, pc_info);
+    vtd_iommu_unlock(s);
+}
+
+/**
+ * Caller of this function should hold iommu_lock
+ */
+static void vtd_pasid_cache_reset(IntelIOMMUState *s)
+{
+    VTDPASIDCacheInfo pc_info;
+
+    trace_vtd_pasid_cache_reset();
+
+    pc_info.flags = VTD_PASID_CACHE_FORCE_RESET;
+
+    /*
+     * Reset pasid cache is a big hammer, so use
+     * g_hash_table_foreach_remove which will free
+     * the vtd_pasid_as instances. Also, as a big
+     * hammer, use VTD_PASID_CACHE_FORCE_RESET to
+     * ensure all the vtd_pasid_as instances are
+     * dropped, meanwhile the change will be pass
+     * to host if HostIOMMUContext is available.
+     */
+    g_hash_table_foreach_remove(s->vtd_pasid_as,
+                                vtd_flush_pasid, &pc_info);
+}
+
 static bool vtd_process_pasid_desc(IntelIOMMUState *s,
                                    VTDInvDesc *inv_desc)
 {
+    uint16_t domain_id;
+    uint32_t pasid;
+    VTDPASIDCacheInfo pc_info;
+
     if ((inv_desc->val[0] & VTD_INV_DESC_PASIDC_RSVD_VAL0) ||
         (inv_desc->val[1] & VTD_INV_DESC_PASIDC_RSVD_VAL1) ||
         (inv_desc->val[2] & VTD_INV_DESC_PASIDC_RSVD_VAL2) ||
@@ -2407,14 +2864,26 @@ static bool vtd_process_pasid_desc(IntelIOMMUState *s,
         return false;
     }
 
+    domain_id = VTD_INV_DESC_PASIDC_DID(inv_desc->val[0]);
+    pasid = VTD_INV_DESC_PASIDC_PASID(inv_desc->val[0]);
+
     switch (inv_desc->val[0] & VTD_INV_DESC_PASIDC_G) {
     case VTD_INV_DESC_PASIDC_DSI:
+        trace_vtd_pasid_cache_dsi(domain_id);
+        pc_info.flags = VTD_PASID_CACHE_DOMSI;
+        pc_info.domain_id = domain_id;
         break;
 
     case VTD_INV_DESC_PASIDC_PASID_SI:
+        /* PASID selective implies a DID selective */
+        pc_info.flags = VTD_PASID_CACHE_PASIDSI;
+        pc_info.domain_id = domain_id;
+        pc_info.pasid = pasid;
         break;
 
     case VTD_INV_DESC_PASIDC_GLOBAL:
+        trace_vtd_pasid_cache_gsi();
+        pc_info.flags = VTD_PASID_CACHE_GLOBAL;
         break;
 
     default:
@@ -2423,6 +2892,7 @@ static bool vtd_process_pasid_desc(IntelIOMMUState *s,
         return false;
     }
 
+    vtd_pasid_cache_sync(s, &pc_info);
     return true;
 }
 
@@ -4085,6 +4555,9 @@ static void vtd_realize(DeviceState *dev, Error **errp)
                                      g_free, g_free);
     s->vtd_as_by_busptr = g_hash_table_new_full(vtd_uint64_hash, vtd_uint64_equal,
                                               g_free, g_free);
+    s->vtd_pasid_as = g_hash_table_new_full(vtd_pasid_as_key_hash,
+                                            vtd_pasid_as_key_equal,
+                                            g_free, g_free);
     vtd_init(s);
     sysbus_mmio_map(SYS_BUS_DEVICE(s), 0, Q35_HOST_BRIDGE_IOMMU_ADDR);
     pci_setup_iommu(bus, &vtd_iommu_ops, dev);
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 9a76f20..451ef4c 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -307,6 +307,7 @@ typedef enum VTDFaultReason {
     VTD_FR_IR_SID_ERR = 0x26,   /* Invalid Source-ID */
 
     VTD_FR_PASID_TABLE_INV = 0x58,  /*Invalid PASID table entry */
+    VTD_FR_PASID_ENTRY_P = 0x59, /* The Present(P) field of pasidt-entry is 0 */
 
     /* This is not a normal fault reason. We use this to indicate some faults
      * that are not referenced by the VT-d specification.
@@ -511,10 +512,26 @@ typedef struct VTDRootEntry VTDRootEntry;
 #define VTD_CTX_ENTRY_LEGACY_SIZE     16
 #define VTD_CTX_ENTRY_SCALABLE_SIZE   32
 
+#define VTD_SM_CONTEXT_ENTRY_PDTS(val)      (((val) >> 9) & 0x3)
 #define VTD_SM_CONTEXT_ENTRY_RID2PASID_MASK 0xfffff
 #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL0(aw)  (0x1e0ULL | ~VTD_HAW_MASK(aw))
 #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL1      0xffffffffffe00000ULL
 
+struct VTDPASIDCacheInfo {
+#define VTD_PASID_CACHE_FORCE_RESET    (1ULL << 0)
+#define VTD_PASID_CACHE_GLOBAL         (1ULL << 1)
+#define VTD_PASID_CACHE_DOMSI          (1ULL << 2)
+#define VTD_PASID_CACHE_PASIDSI        (1ULL << 3)
+    uint32_t flags;
+    uint16_t domain_id;
+    uint32_t pasid;
+};
+#define VTD_PASID_CACHE_INFO_MASK    (VTD_PASID_CACHE_FORCE_RESET | \
+                                      VTD_PASID_CACHE_GLOBAL  | \
+                                      VTD_PASID_CACHE_DOMSI  | \
+                                      VTD_PASID_CACHE_PASIDSI)
+typedef struct VTDPASIDCacheInfo VTDPASIDCacheInfo;
+
 /* PASID Table Related Definitions */
 #define VTD_PASID_DIR_BASE_ADDR_MASK  (~0xfffULL)
 #define VTD_PASID_TABLE_BASE_ADDR_MASK (~0xfffULL)
@@ -526,6 +543,7 @@ typedef struct VTDRootEntry VTDRootEntry;
 #define VTD_PASID_TABLE_BITS_MASK     (0x3fULL)
 #define VTD_PASID_TABLE_INDEX(pasid)  ((pasid) & VTD_PASID_TABLE_BITS_MASK)
 #define VTD_PASID_ENTRY_FPD           (1ULL << 1) /* Fault Processing Disable */
+#define VTD_PASID_TBL_ENTRY_NUM       (1ULL << 6)
 
 /* PASID Granular Translation Type Mask */
 #define VTD_PASID_ENTRY_P              1ULL
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index f7cd4e5..60d20c1 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -23,6 +23,7 @@ vtd_inv_qi_tail(uint16_t head) "write tail %d"
 vtd_inv_qi_fetch(void) ""
 vtd_context_cache_reset(void) ""
 vtd_pasid_cache_gsi(void) ""
+vtd_pasid_cache_reset(void) ""
 vtd_pasid_cache_dsi(uint16_t domain) "Domian slective PC invalidation domain 0x%"PRIx16
 vtd_pasid_cache_psi(uint16_t domain, uint32_t pasid) "PASID slective PC invalidation domain 0x%"PRIx16" pasid 0x%"PRIx32
 vtd_re_not_present(uint8_t bus) "Root entry bus %"PRIu8" not present"
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index 42a58d6..626c1cd 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -65,6 +65,8 @@ typedef union VTD_IR_MSIAddress VTD_IR_MSIAddress;
 typedef struct VTDPASIDDirEntry VTDPASIDDirEntry;
 typedef struct VTDPASIDEntry VTDPASIDEntry;
 typedef struct VTDHostIOMMUContext VTDHostIOMMUContext;
+typedef struct VTDPASIDCacheEntry VTDPASIDCacheEntry;
+typedef struct VTDPASIDAddressSpace VTDPASIDAddressSpace;
 
 /* Context-Entry */
 struct VTDContextEntry {
@@ -97,6 +99,26 @@ struct VTDPASIDEntry {
     uint64_t val[8];
 };
 
+struct pasid_key {
+    uint32_t pasid;
+    uint16_t sid;
+};
+
+struct VTDPASIDCacheEntry {
+    struct VTDPASIDEntry pasid_entry;
+};
+
+struct VTDPASIDAddressSpace {
+    VTDBus *vtd_bus;
+    uint8_t devfn;
+    AddressSpace as;
+    uint32_t pasid;
+    IntelIOMMUState *iommu_state;
+    VTDContextCacheEntry context_cache_entry;
+    QLIST_ENTRY(VTDPASIDAddressSpace) next;
+    VTDPASIDCacheEntry pasid_cache_entry;
+};
+
 struct VTDAddressSpace {
     PCIBus *bus;
     uint8_t devfn;
@@ -267,6 +289,7 @@ struct IntelIOMMUState {
 
     GHashTable *vtd_as_by_busptr;   /* VTDBus objects indexed by PCIBus* reference */
     VTDBus *vtd_as_by_bus_num[VTD_PCI_BUS_MAX]; /* VTDBus objects indexed by bus number */
+    GHashTable *vtd_pasid_as;       /* VTDPASIDAddressSpace instances */
     /* list of registered notifiers */
     QLIST_HEAD(, VTDAddressSpace) vtd_as_with_notifiers;
 
@@ -292,6 +315,7 @@ struct IntelIOMMUState {
      * - per-IOMMU IOTLB caches
      * - context entry cache in VTDAddressSpace
      * - HostIOMMUContext pointer cached in vIOMMU
+     * - PASID cache in VTDPASIDAddressSpace
      */
     QemuMutex iommu_lock;
 };
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v2 13/22] intel_iommu: add PASID cache management infrastructure
@ 2020-03-30  4:24   ` Liu Yi L
  0 siblings, 0 replies; 160+ messages in thread
From: Liu Yi L @ 2020-03-30  4:24 UTC (permalink / raw)
  To: qemu-devel, alex.williamson, peterx
  Cc: jean-philippe, kevin.tian, yi.l.liu, Yi Sun, Eduardo Habkost,
	kvm, mst, jun.j.tian, eric.auger, yi.y.sun, Jacob Pan, pbonzini,
	hao.wu, Richard Henderson, david

This patch adds a PASID cache management infrastructure based on
new added structure VTDPASIDAddressSpace, which is used to track
the PASID usage and future PASID tagged DMA address translation
support in vIOMMU.

    struct VTDPASIDAddressSpace {
        VTDBus *vtd_bus;
        uint8_t devfn;
        AddressSpace as;
        uint32_t pasid;
        IntelIOMMUState *iommu_state;
        VTDContextCacheEntry context_cache_entry;
        QLIST_ENTRY(VTDPASIDAddressSpace) next;
        VTDPASIDCacheEntry pasid_cache_entry;
    };

Ideally, a VTDPASIDAddressSpace instance is created when a PASID
is bound with a DMA AddressSpace. Intel VT-d spec requires guest
software to issue pasid cache invalidation when bind or unbind a
pasid with an address space under caching-mode. However, as
VTDPASIDAddressSpace instances also act as pasid cache in this
implementation, its creation also happens during vIOMMU PASID
tagged DMA translation. The creation in this path will not be
added in this patch since no PASID-capable emulated devices for
now.

The implementation in this patch manages VTDPASIDAddressSpace
instances per PASID+BDF (lookup and insert will use PASID and
BDF) since Intel VT-d spec allows per-BDF PASID Table. When a
guest bind a PASID with an AddressSpace, QEMU will capture the
guest pasid selective pasid cache invalidation, and allocate
remove a VTDPASIDAddressSpace instance per the invalidation
reasons:

    *) a present pasid entry moved to non-present
    *) a present pasid entry to be a present entry
    *) a non-present pasid entry moved to present

vIOMMU emulator could figure out the reason by fetching latest
guest pasid entry.

v1 -> v2: - merged this patch with former replay binding patch, makes
            PSI/DSI/GSI use the unified function to do cache invalidation
            and pasid binding replay.
          - dropped pasid_cache_gen in both iommu_state and vtd_pasid_as
            as it is not necessary so far, we may want it when one day
            initroduce emulated SVA-capable device.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 hw/i386/intel_iommu.c          | 473 +++++++++++++++++++++++++++++++++++++++++
 hw/i386/intel_iommu_internal.h |  18 ++
 hw/i386/trace-events           |   1 +
 include/hw/i386/intel_iommu.h  |  24 +++
 4 files changed, 516 insertions(+)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 2eb60c3..a7e9973 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -40,6 +40,7 @@
 #include "kvm_i386.h"
 #include "migration/vmstate.h"
 #include "trace.h"
+#include "qemu/jhash.h"
 
 /* context entry operations */
 #define VTD_CE_GET_RID2PASID(ce) \
@@ -65,6 +66,8 @@
 static void vtd_address_space_refresh_all(IntelIOMMUState *s);
 static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n);
 
+static void vtd_pasid_cache_reset(IntelIOMMUState *s);
+
 static void vtd_panic_require_caching_mode(void)
 {
     error_report("We need to set caching-mode=on for intel-iommu to enable "
@@ -276,6 +279,7 @@ static void vtd_reset_caches(IntelIOMMUState *s)
     vtd_iommu_lock(s);
     vtd_reset_iotlb_locked(s);
     vtd_reset_context_cache_locked(s);
+    vtd_pasid_cache_reset(s);
     vtd_iommu_unlock(s);
 }
 
@@ -686,6 +690,16 @@ static inline bool vtd_pe_type_check(X86IOMMUState *x86_iommu,
     return true;
 }
 
+static inline uint16_t vtd_pe_get_domain_id(VTDPASIDEntry *pe)
+{
+    return VTD_SM_PASID_ENTRY_DID((pe)->val[1]);
+}
+
+static inline uint32_t vtd_sm_ce_get_pdt_entry_num(VTDContextEntry *ce)
+{
+    return 1U << (VTD_SM_CONTEXT_ENTRY_PDTS(ce->val[0]) + 7);
+}
+
 static inline bool vtd_pdire_present(VTDPASIDDirEntry *pdire)
 {
     return pdire->val & 1;
@@ -2395,9 +2409,452 @@ static bool vtd_process_iotlb_desc(IntelIOMMUState *s, VTDInvDesc *inv_desc)
     return true;
 }
 
+static inline void vtd_init_pasid_key(uint32_t pasid,
+                                     uint16_t sid,
+                                     struct pasid_key *key)
+{
+    key->pasid = pasid;
+    key->sid = sid;
+}
+
+static guint vtd_pasid_as_key_hash(gconstpointer v)
+{
+    struct pasid_key *key = (struct pasid_key *)v;
+    uint32_t a, b, c;
+
+    /* Jenkins hash */
+    a = b = c = JHASH_INITVAL + sizeof(*key);
+    a += key->sid;
+    b += extract32(key->pasid, 0, 16);
+    c += extract32(key->pasid, 16, 16);
+
+    __jhash_mix(a, b, c);
+    __jhash_final(a, b, c);
+
+    return c;
+}
+
+static gboolean vtd_pasid_as_key_equal(gconstpointer v1, gconstpointer v2)
+{
+    const struct pasid_key *k1 = v1;
+    const struct pasid_key *k2 = v2;
+
+    return (k1->pasid == k2->pasid) && (k1->sid == k2->sid);
+}
+
+static inline int vtd_dev_get_pe_from_pasid(IntelIOMMUState *s,
+                                            uint8_t bus_num,
+                                            uint8_t devfn,
+                                            uint32_t pasid,
+                                            VTDPASIDEntry *pe)
+{
+    VTDContextEntry ce;
+    int ret;
+    dma_addr_t pasid_dir_base;
+
+    if (!s->root_scalable) {
+        return -VTD_FR_PASID_TABLE_INV;
+    }
+
+    ret = vtd_dev_to_context_entry(s, bus_num, devfn, &ce);
+    if (ret) {
+        return ret;
+    }
+
+    pasid_dir_base = VTD_CE_GET_PASID_DIR_TABLE(&ce);
+    ret = vtd_get_pe_from_pasid_table(s,
+                                  pasid_dir_base, pasid, pe);
+
+    return ret;
+}
+
+static bool vtd_pasid_entry_compare(VTDPASIDEntry *p1, VTDPASIDEntry *p2)
+{
+    return !memcmp(p1, p2, sizeof(*p1));
+}
+
+/**
+ * This function fills in the pasid entry in &vtd_pasid_as. Caller
+ * of this function should hold iommu_lock.
+ */
+static void vtd_fill_pe_in_cache(IntelIOMMUState *s,
+                                 VTDPASIDAddressSpace *vtd_pasid_as,
+                                 VTDPASIDEntry *pe)
+{
+    VTDPASIDCacheEntry *pc_entry = &vtd_pasid_as->pasid_cache_entry;
+
+    if (vtd_pasid_entry_compare(pe, &pc_entry->pasid_entry)) {
+        /* No need to go further as cached pasid entry is latest */
+        return;
+    }
+
+    pc_entry->pasid_entry = *pe;
+    /*
+     * TODO:
+     * - send pasid bind to host for passthru devices
+     */
+}
+
+/**
+ * This function is used to clear cached pasid entry in vtd_pasid_as
+ * instances. Caller of this function should hold iommu_lock.
+ */
+static gboolean vtd_flush_pasid(gpointer key, gpointer value,
+                                gpointer user_data)
+{
+    VTDPASIDCacheInfo *pc_info = user_data;
+    VTDPASIDAddressSpace *vtd_pasid_as = value;
+    IntelIOMMUState *s = vtd_pasid_as->iommu_state;
+    VTDPASIDCacheEntry *pc_entry = &vtd_pasid_as->pasid_cache_entry;
+    VTDBus *vtd_bus = vtd_pasid_as->vtd_bus;
+    VTDPASIDEntry pe;
+    uint16_t did;
+    uint32_t pasid;
+    uint16_t devfn;
+    int ret;
+
+    did = vtd_pe_get_domain_id(&pc_entry->pasid_entry);
+    pasid = vtd_pasid_as->pasid;
+    devfn = vtd_pasid_as->devfn;
+
+    switch (pc_info->flags & VTD_PASID_CACHE_INFO_MASK) {
+    case VTD_PASID_CACHE_FORCE_RESET:
+        goto remove;
+    case VTD_PASID_CACHE_PASIDSI:
+        if (pc_info->pasid != pasid) {
+            return false;
+        }
+        /* Fall through */
+    case VTD_PASID_CACHE_DOMSI:
+        if (pc_info->domain_id != did) {
+            return false;
+        }
+        /* Fall through */
+    case VTD_PASID_CACHE_GLOBAL:
+        break;
+    default:
+        error_report("invalid pc_info->flags");
+        abort();
+    }
+
+    /*
+     * pasid cache invalidation may indicate a present pasid
+     * entry to present pasid entry modification. To cover such
+     * case, vIOMMU emulator needs to fetch latest guest pasid
+     * entry and check cached pasid entry, then update pasid
+     * cache and send pasid bind/unbind to host properly.
+     */
+    ret = vtd_dev_get_pe_from_pasid(s, pci_bus_num(vtd_bus->bus),
+                                    devfn, pasid, &pe);
+    if (ret) {
+        /*
+         * No valid pasid entry in guest memory. e.g. pasid entry
+         * was modified to be either all-zero or non-present. Either
+         * case means existing pasid cache should be removed.
+         */
+        goto remove;
+    }
+
+    vtd_fill_pe_in_cache(s, vtd_pasid_as, &pe);
+    /*
+     * TODO:
+     * - when pasid-base-iotlb(piotlb) infrastructure is ready,
+     *   should invalidate QEMU piotlb togehter with this change.
+     */
+    return false;
+remove:
+    /*
+     * TODO:
+     * - send pasid bind to host for passthru devices
+     * - when pasid-base-iotlb(piotlb) infrastructure is ready,
+     *   should invalidate QEMU piotlb togehter with this change.
+     */
+    return true;
+}
+
+/**
+ * This function finds or adds a VTDPASIDAddressSpace for a device
+ * when it is bound to a pasid. Caller of this function should hold
+ * iommu_lock.
+ */
+static VTDPASIDAddressSpace *vtd_add_find_pasid_as(IntelIOMMUState *s,
+                                                   VTDBus *vtd_bus,
+                                                   int devfn,
+                                                   uint32_t pasid)
+{
+    struct pasid_key key;
+    struct pasid_key *new_key;
+    VTDPASIDAddressSpace *vtd_pasid_as;
+    uint16_t sid;
+
+    sid = vtd_make_source_id(pci_bus_num(vtd_bus->bus), devfn);
+    vtd_init_pasid_key(pasid, sid, &key);
+    vtd_pasid_as = g_hash_table_lookup(s->vtd_pasid_as, &key);
+
+    if (!vtd_pasid_as) {
+        new_key = g_malloc0(sizeof(*new_key));
+        vtd_init_pasid_key(pasid, sid, new_key);
+        /*
+         * Initiate the vtd_pasid_as structure.
+         *
+         * This structure here is used to track the guest pasid
+         * binding and also serves as pasid-cache mangement entry.
+         *
+         * TODO: in future, if wants to support the SVA-aware DMA
+         *       emulation, the vtd_pasid_as should have include
+         *       AddressSpace to support DMA emulation.
+         */
+        vtd_pasid_as = g_malloc0(sizeof(VTDPASIDAddressSpace));
+        vtd_pasid_as->iommu_state = s;
+        vtd_pasid_as->vtd_bus = vtd_bus;
+        vtd_pasid_as->devfn = devfn;
+        vtd_pasid_as->pasid = pasid;
+        g_hash_table_insert(s->vtd_pasid_as, new_key, vtd_pasid_as);
+    }
+    return vtd_pasid_as;
+}
+
+/**
+ * Constant information used during pasid table walk
+   @vtd_bus, @devfn: device info
+ * @flags: indicates if it is domain selective walk
+ * @did: domain ID of the pasid table walk
+ */
+typedef struct {
+    VTDBus *vtd_bus;
+    uint16_t devfn;
+#define VTD_PASID_TABLE_DID_SEL_WALK   (1ULL << 0)
+    uint32_t flags;
+    uint16_t did;
+} vtd_pasid_table_walk_info;
+
+/**
+ * Caller of this function should hold iommu_lock.
+ */
+static void vtd_sm_pasid_table_walk_one(IntelIOMMUState *s,
+                                        dma_addr_t pt_base,
+                                        int start,
+                                        int end,
+                                        vtd_pasid_table_walk_info *info)
+{
+    VTDPASIDEntry pe;
+    int pasid = start;
+    int pasid_next;
+    VTDPASIDAddressSpace *vtd_pasid_as;
+
+    while (pasid < end) {
+        pasid_next = pasid + 1;
+
+        if (!vtd_get_pe_in_pasid_leaf_table(s, pasid, pt_base, &pe)
+            && vtd_pe_present(&pe)) {
+            vtd_pasid_as = vtd_add_find_pasid_as(s,
+                                       info->vtd_bus, info->devfn, pasid);
+            if ((info->flags & VTD_PASID_TABLE_DID_SEL_WALK) &&
+                !(info->did == vtd_pe_get_domain_id(&pe))) {
+                pasid = pasid_next;
+                continue;
+            }
+            vtd_fill_pe_in_cache(s, vtd_pasid_as, &pe);
+        }
+        pasid = pasid_next;
+    }
+}
+
+/*
+ * Currently, VT-d scalable mode pasid table is a two level table,
+ * this function aims to loop a range of PASIDs in a given pasid
+ * table to identify the pasid config in guest.
+ * Caller of this function should hold iommu_lock.
+ */
+static void vtd_sm_pasid_table_walk(IntelIOMMUState *s,
+                                    dma_addr_t pdt_base,
+                                    int start,
+                                    int end,
+                                    vtd_pasid_table_walk_info *info)
+{
+    VTDPASIDDirEntry pdire;
+    int pasid = start;
+    int pasid_next;
+    dma_addr_t pt_base;
+
+    while (pasid < end) {
+        pasid_next = ((end - pasid) > VTD_PASID_TBL_ENTRY_NUM) ?
+                      (pasid + VTD_PASID_TBL_ENTRY_NUM) : end;
+        if (!vtd_get_pdire_from_pdir_table(pdt_base, pasid, &pdire)
+            && vtd_pdire_present(&pdire)) {
+            pt_base = pdire.val & VTD_PASID_TABLE_BASE_ADDR_MASK;
+            vtd_sm_pasid_table_walk_one(s, pt_base, pasid, pasid_next, info);
+        }
+        pasid = pasid_next;
+    }
+}
+
+static void vtd_replay_pasid_bind_for_dev(IntelIOMMUState *s,
+                                          int start, int end,
+                                          vtd_pasid_table_walk_info *info)
+{
+    VTDContextEntry ce;
+    int bus_n, devfn;
+
+    bus_n = pci_bus_num(info->vtd_bus->bus);
+    devfn = info->devfn;
+
+    if (!vtd_dev_to_context_entry(s, bus_n, devfn, &ce)) {
+        uint32_t max_pasid;
+
+        max_pasid = vtd_sm_ce_get_pdt_entry_num(&ce) * VTD_PASID_TBL_ENTRY_NUM;
+        if (end > max_pasid) {
+            end = max_pasid;
+        }
+        vtd_sm_pasid_table_walk(s,
+                                VTD_CE_GET_PASID_DIR_TABLE(&ce),
+                                start,
+                                end,
+                                info);
+    }
+}
+
+/**
+ * This function replay the guest pasid bindings to hots by
+ * walking the guest PASID table. This ensures host will have
+ * latest guest pasid bindings. Caller should hold iommu_lock.
+ */
+static void vtd_replay_guest_pasid_bindings(IntelIOMMUState *s,
+                                            VTDPASIDCacheInfo *pc_info)
+{
+    VTDHostIOMMUContext *vtd_dev_icx;
+    int start = 0, end = VTD_HPASID_MAX;
+    vtd_pasid_table_walk_info walk_info = {.flags = 0};
+
+    switch (pc_info->flags & VTD_PASID_CACHE_INFO_MASK) {
+    case VTD_PASID_CACHE_PASIDSI:
+        start = pc_info->pasid;
+        end = pc_info->pasid + 1;
+        /*
+         * PASID selective invalidation is within domain,
+         * thus fall through.
+         */
+    case VTD_PASID_CACHE_DOMSI:
+        walk_info.did = pc_info->domain_id;
+        walk_info.flags |= VTD_PASID_TABLE_DID_SEL_WALK;
+        /* loop all assigned devices */
+        break;
+    case VTD_PASID_CACHE_FORCE_RESET:
+        /* For force reset, no need to go further replay */
+        return;
+    case VTD_PASID_CACHE_GLOBAL:
+        break;
+    default:
+        error_report("%s, invalid pc_info->flags", __func__);
+        abort();
+    }
+
+    /*
+     * In this replay, only needs to care about the devices which
+     * are backed by host IOMMU. For such devices, their vtd_dev_icx
+     * instances are in the s->vtd_dev_icx_list. For devices which
+     * are not backed byhost IOMMU, it is not necessary to replay
+     * the bindings since their cache could be re-created in the future
+     * DMA address transaltion.
+     */
+    QLIST_FOREACH(vtd_dev_icx, &s->vtd_dev_icx_list, next) {
+        walk_info.vtd_bus = vtd_dev_icx->vtd_bus;
+        walk_info.devfn = vtd_dev_icx->devfn;
+        vtd_replay_pasid_bind_for_dev(s, start, end, &walk_info);
+    }
+}
+
+/**
+ * This function syncs the pasid bindings between guest and host.
+ * It includes updating the pasid cache in vIOMMU and updating the
+ * pasid bindings per guest's latest pasid entry presence.
+ */
+static void vtd_pasid_cache_sync(IntelIOMMUState *s,
+                                 VTDPASIDCacheInfo *pc_info)
+{
+    /*
+     * Regards to a pasid cache invalidation, e.g. a PSI.
+     * it could be either cases of below:
+     * a) a present pasid entry moved to non-present
+     * b) a present pasid entry to be a present entry
+     * c) a non-present pasid entry moved to present
+     *
+     * Different invalidation granularity may affect different device
+     * scope and pasid scope. But for each invalidation granularity,
+     * it needs to do two steps to sync host and guest pasid binding.
+     *
+     * Here is the handling of a PSI:
+     * 1) loop all the existing vtd_pasid_as instances to update them
+     *    according to the latest guest pasid entry in pasid table.
+     *    this will make sure affected existing vtd_pasid_as instances
+     *    cached the latest pasid entries. Also, during the loop, the
+     *    host should be notified if needed. e.g. pasid unbind or pasid
+     *    update. Should be able to cover case a) and case b).
+     *
+     * 2) loop all devices to cover case c)
+     *    - For devices which have HostIOMMUContext instances,
+     *      we loop them and check if guest pasid entry exists. If yes,
+     *      it is case c), we update the pasid cache and also notify
+     *      host.
+     *    - For devices which have no HostIOMMUContext, it is not
+     *      necessary to create pasid cache at this phase since it
+     *      could be created when vIOMMU does DMA address translation.
+     *      This is not yet implemented since there is no emulated
+     *      pasid-capable devices today. If we have such devices in
+     *      future, the pasid cache shall be created there.
+     * Other granularity follow the same steps, just with different scope
+     *
+     */
+
+    vtd_iommu_lock(s);
+    /* Step 1: loop all the exisitng vtd_pasid_as instances */
+    g_hash_table_foreach_remove(s->vtd_pasid_as,
+                                vtd_flush_pasid, pc_info);
+
+    /*
+     * Step 2: loop all the exisitng vtd_dev_icx instances.
+     * Ideally, needs to loop all devices to find if there is any new
+     * PASID binding regards to the PASID cache invalidation request.
+     * But it is enough to loop the devices which are backed by host
+     * IOMMU. For devices backed by vIOMMU (a.k.a emulated devices),
+     * if new PASID happened on them, their vtd_pasid_as instance could
+     * be created during future vIOMMU DMA translation.
+     */
+    vtd_replay_guest_pasid_bindings(s, pc_info);
+    vtd_iommu_unlock(s);
+}
+
+/**
+ * Caller of this function should hold iommu_lock
+ */
+static void vtd_pasid_cache_reset(IntelIOMMUState *s)
+{
+    VTDPASIDCacheInfo pc_info;
+
+    trace_vtd_pasid_cache_reset();
+
+    pc_info.flags = VTD_PASID_CACHE_FORCE_RESET;
+
+    /*
+     * Reset pasid cache is a big hammer, so use
+     * g_hash_table_foreach_remove which will free
+     * the vtd_pasid_as instances. Also, as a big
+     * hammer, use VTD_PASID_CACHE_FORCE_RESET to
+     * ensure all the vtd_pasid_as instances are
+     * dropped, meanwhile the change will be pass
+     * to host if HostIOMMUContext is available.
+     */
+    g_hash_table_foreach_remove(s->vtd_pasid_as,
+                                vtd_flush_pasid, &pc_info);
+}
+
 static bool vtd_process_pasid_desc(IntelIOMMUState *s,
                                    VTDInvDesc *inv_desc)
 {
+    uint16_t domain_id;
+    uint32_t pasid;
+    VTDPASIDCacheInfo pc_info;
+
     if ((inv_desc->val[0] & VTD_INV_DESC_PASIDC_RSVD_VAL0) ||
         (inv_desc->val[1] & VTD_INV_DESC_PASIDC_RSVD_VAL1) ||
         (inv_desc->val[2] & VTD_INV_DESC_PASIDC_RSVD_VAL2) ||
@@ -2407,14 +2864,26 @@ static bool vtd_process_pasid_desc(IntelIOMMUState *s,
         return false;
     }
 
+    domain_id = VTD_INV_DESC_PASIDC_DID(inv_desc->val[0]);
+    pasid = VTD_INV_DESC_PASIDC_PASID(inv_desc->val[0]);
+
     switch (inv_desc->val[0] & VTD_INV_DESC_PASIDC_G) {
     case VTD_INV_DESC_PASIDC_DSI:
+        trace_vtd_pasid_cache_dsi(domain_id);
+        pc_info.flags = VTD_PASID_CACHE_DOMSI;
+        pc_info.domain_id = domain_id;
         break;
 
     case VTD_INV_DESC_PASIDC_PASID_SI:
+        /* PASID selective implies a DID selective */
+        pc_info.flags = VTD_PASID_CACHE_PASIDSI;
+        pc_info.domain_id = domain_id;
+        pc_info.pasid = pasid;
         break;
 
     case VTD_INV_DESC_PASIDC_GLOBAL:
+        trace_vtd_pasid_cache_gsi();
+        pc_info.flags = VTD_PASID_CACHE_GLOBAL;
         break;
 
     default:
@@ -2423,6 +2892,7 @@ static bool vtd_process_pasid_desc(IntelIOMMUState *s,
         return false;
     }
 
+    vtd_pasid_cache_sync(s, &pc_info);
     return true;
 }
 
@@ -4085,6 +4555,9 @@ static void vtd_realize(DeviceState *dev, Error **errp)
                                      g_free, g_free);
     s->vtd_as_by_busptr = g_hash_table_new_full(vtd_uint64_hash, vtd_uint64_equal,
                                               g_free, g_free);
+    s->vtd_pasid_as = g_hash_table_new_full(vtd_pasid_as_key_hash,
+                                            vtd_pasid_as_key_equal,
+                                            g_free, g_free);
     vtd_init(s);
     sysbus_mmio_map(SYS_BUS_DEVICE(s), 0, Q35_HOST_BRIDGE_IOMMU_ADDR);
     pci_setup_iommu(bus, &vtd_iommu_ops, dev);
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 9a76f20..451ef4c 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -307,6 +307,7 @@ typedef enum VTDFaultReason {
     VTD_FR_IR_SID_ERR = 0x26,   /* Invalid Source-ID */
 
     VTD_FR_PASID_TABLE_INV = 0x58,  /*Invalid PASID table entry */
+    VTD_FR_PASID_ENTRY_P = 0x59, /* The Present(P) field of pasidt-entry is 0 */
 
     /* This is not a normal fault reason. We use this to indicate some faults
      * that are not referenced by the VT-d specification.
@@ -511,10 +512,26 @@ typedef struct VTDRootEntry VTDRootEntry;
 #define VTD_CTX_ENTRY_LEGACY_SIZE     16
 #define VTD_CTX_ENTRY_SCALABLE_SIZE   32
 
+#define VTD_SM_CONTEXT_ENTRY_PDTS(val)      (((val) >> 9) & 0x3)
 #define VTD_SM_CONTEXT_ENTRY_RID2PASID_MASK 0xfffff
 #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL0(aw)  (0x1e0ULL | ~VTD_HAW_MASK(aw))
 #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL1      0xffffffffffe00000ULL
 
+struct VTDPASIDCacheInfo {
+#define VTD_PASID_CACHE_FORCE_RESET    (1ULL << 0)
+#define VTD_PASID_CACHE_GLOBAL         (1ULL << 1)
+#define VTD_PASID_CACHE_DOMSI          (1ULL << 2)
+#define VTD_PASID_CACHE_PASIDSI        (1ULL << 3)
+    uint32_t flags;
+    uint16_t domain_id;
+    uint32_t pasid;
+};
+#define VTD_PASID_CACHE_INFO_MASK    (VTD_PASID_CACHE_FORCE_RESET | \
+                                      VTD_PASID_CACHE_GLOBAL  | \
+                                      VTD_PASID_CACHE_DOMSI  | \
+                                      VTD_PASID_CACHE_PASIDSI)
+typedef struct VTDPASIDCacheInfo VTDPASIDCacheInfo;
+
 /* PASID Table Related Definitions */
 #define VTD_PASID_DIR_BASE_ADDR_MASK  (~0xfffULL)
 #define VTD_PASID_TABLE_BASE_ADDR_MASK (~0xfffULL)
@@ -526,6 +543,7 @@ typedef struct VTDRootEntry VTDRootEntry;
 #define VTD_PASID_TABLE_BITS_MASK     (0x3fULL)
 #define VTD_PASID_TABLE_INDEX(pasid)  ((pasid) & VTD_PASID_TABLE_BITS_MASK)
 #define VTD_PASID_ENTRY_FPD           (1ULL << 1) /* Fault Processing Disable */
+#define VTD_PASID_TBL_ENTRY_NUM       (1ULL << 6)
 
 /* PASID Granular Translation Type Mask */
 #define VTD_PASID_ENTRY_P              1ULL
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index f7cd4e5..60d20c1 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -23,6 +23,7 @@ vtd_inv_qi_tail(uint16_t head) "write tail %d"
 vtd_inv_qi_fetch(void) ""
 vtd_context_cache_reset(void) ""
 vtd_pasid_cache_gsi(void) ""
+vtd_pasid_cache_reset(void) ""
 vtd_pasid_cache_dsi(uint16_t domain) "Domian slective PC invalidation domain 0x%"PRIx16
 vtd_pasid_cache_psi(uint16_t domain, uint32_t pasid) "PASID slective PC invalidation domain 0x%"PRIx16" pasid 0x%"PRIx32
 vtd_re_not_present(uint8_t bus) "Root entry bus %"PRIu8" not present"
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index 42a58d6..626c1cd 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -65,6 +65,8 @@ typedef union VTD_IR_MSIAddress VTD_IR_MSIAddress;
 typedef struct VTDPASIDDirEntry VTDPASIDDirEntry;
 typedef struct VTDPASIDEntry VTDPASIDEntry;
 typedef struct VTDHostIOMMUContext VTDHostIOMMUContext;
+typedef struct VTDPASIDCacheEntry VTDPASIDCacheEntry;
+typedef struct VTDPASIDAddressSpace VTDPASIDAddressSpace;
 
 /* Context-Entry */
 struct VTDContextEntry {
@@ -97,6 +99,26 @@ struct VTDPASIDEntry {
     uint64_t val[8];
 };
 
+struct pasid_key {
+    uint32_t pasid;
+    uint16_t sid;
+};
+
+struct VTDPASIDCacheEntry {
+    struct VTDPASIDEntry pasid_entry;
+};
+
+struct VTDPASIDAddressSpace {
+    VTDBus *vtd_bus;
+    uint8_t devfn;
+    AddressSpace as;
+    uint32_t pasid;
+    IntelIOMMUState *iommu_state;
+    VTDContextCacheEntry context_cache_entry;
+    QLIST_ENTRY(VTDPASIDAddressSpace) next;
+    VTDPASIDCacheEntry pasid_cache_entry;
+};
+
 struct VTDAddressSpace {
     PCIBus *bus;
     uint8_t devfn;
@@ -267,6 +289,7 @@ struct IntelIOMMUState {
 
     GHashTable *vtd_as_by_busptr;   /* VTDBus objects indexed by PCIBus* reference */
     VTDBus *vtd_as_by_bus_num[VTD_PCI_BUS_MAX]; /* VTDBus objects indexed by bus number */
+    GHashTable *vtd_pasid_as;       /* VTDPASIDAddressSpace instances */
     /* list of registered notifiers */
     QLIST_HEAD(, VTDAddressSpace) vtd_as_with_notifiers;
 
@@ -292,6 +315,7 @@ struct IntelIOMMUState {
      * - per-IOMMU IOTLB caches
      * - context entry cache in VTDAddressSpace
      * - HostIOMMUContext pointer cached in vIOMMU
+     * - PASID cache in VTDPASIDAddressSpace
      */
     QemuMutex iommu_lock;
 };
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v2 14/22] vfio: add bind stage-1 page table support
  2020-03-30  4:24 ` Liu Yi L
@ 2020-03-30  4:24   ` Liu Yi L
  -1 siblings, 0 replies; 160+ messages in thread
From: Liu Yi L @ 2020-03-30  4:24 UTC (permalink / raw)
  To: qemu-devel, alex.williamson, peterx
  Cc: eric.auger, pbonzini, mst, david, kevin.tian, yi.l.liu,
	jun.j.tian, yi.y.sun, kvm, hao.wu, jean-philippe, Jacob Pan,
	Yi Sun

This patch adds bind_stage1_pgtbl() definition in HostIOMMUContextClass,
also adds corresponding implementation in VFIO. This is to expose a way
for vIOMMU to setup dual stage DMA translation for passthru devices on
hardware.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Eric Auger <eric.auger@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 hw/iommu/host_iommu_context.c         | 47 +++++++++++++++++++++++++++++-
 hw/vfio/common.c                      | 55 ++++++++++++++++++++++++++++++++++-
 include/hw/iommu/host_iommu_context.h | 26 ++++++++++++++++-
 3 files changed, 125 insertions(+), 3 deletions(-)

diff --git a/hw/iommu/host_iommu_context.c b/hw/iommu/host_iommu_context.c
index 5fb2223..8ae20fe 100644
--- a/hw/iommu/host_iommu_context.c
+++ b/hw/iommu/host_iommu_context.c
@@ -69,15 +69,60 @@ int host_iommu_ctx_pasid_free(HostIOMMUContext *iommu_ctx, uint32_t pasid)
     return hicxc->pasid_free(iommu_ctx, pasid);
 }
 
+int host_iommu_ctx_bind_stage1_pgtbl(HostIOMMUContext *iommu_ctx,
+                                     DualIOMMUStage1BindData *data)
+{
+    HostIOMMUContextClass *hicxc;
+
+    if (!iommu_ctx) {
+        return -EINVAL;
+    }
+
+    hicxc = HOST_IOMMU_CONTEXT_GET_CLASS(iommu_ctx);
+    if (!hicxc) {
+        return -EINVAL;
+    }
+
+    if (!(iommu_ctx->flags & HOST_IOMMU_NESTING) ||
+        !hicxc->bind_stage1_pgtbl) {
+        return -EINVAL;
+    }
+
+    return hicxc->bind_stage1_pgtbl(iommu_ctx, data);
+}
+
+int host_iommu_ctx_unbind_stage1_pgtbl(HostIOMMUContext *iommu_ctx,
+                                       DualIOMMUStage1BindData *data)
+{
+    HostIOMMUContextClass *hicxc;
+
+    if (!iommu_ctx) {
+        return -EINVAL;
+    }
+
+    hicxc = HOST_IOMMU_CONTEXT_GET_CLASS(iommu_ctx);
+    if (!hicxc) {
+        return -EINVAL;
+    }
+
+    if (!(iommu_ctx->flags & HOST_IOMMU_NESTING) ||
+        !hicxc->unbind_stage1_pgtbl) {
+        return -EINVAL;
+    }
+
+    return hicxc->unbind_stage1_pgtbl(iommu_ctx, data);
+}
+
 void host_iommu_ctx_init(void *_iommu_ctx, size_t instance_size,
                          const char *mrtypename,
-                         uint64_t flags)
+                         uint64_t flags, uint32_t formats)
 {
     HostIOMMUContext *iommu_ctx;
 
     object_initialize(_iommu_ctx, instance_size, mrtypename);
     iommu_ctx = HOST_IOMMU_CONTEXT(_iommu_ctx);
     iommu_ctx->flags = flags;
+    iommu_ctx->stage1_formats = formats;
     iommu_ctx->initialized = true;
 }
 
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 44b142c..465e4d8 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -1226,6 +1226,54 @@ static int vfio_host_iommu_ctx_pasid_free(HostIOMMUContext *iommu_ctx,
     return 0;
 }
 
+static int vfio_host_iommu_ctx_bind_stage1_pgtbl(HostIOMMUContext *iommu_ctx,
+                                          DualIOMMUStage1BindData *bind_data)
+{
+    VFIOContainer *container = container_of(iommu_ctx,
+                                            VFIOContainer, iommu_ctx);
+    struct vfio_iommu_type1_bind *bind;
+    unsigned long argsz;
+    int ret = 0;
+
+    argsz = sizeof(*bind) + sizeof(bind_data->bind_data);
+    bind = g_malloc0(argsz);
+    bind->argsz = argsz;
+    bind->flags = VFIO_IOMMU_BIND_GUEST_PGTBL;
+    memcpy(&bind->data, &bind_data->bind_data, sizeof(bind_data->bind_data));
+
+    if (ioctl(container->fd, VFIO_IOMMU_BIND, bind)) {
+        ret = -errno;
+        error_report("%s: pasid (%u) bind failed: %d",
+                      __func__, bind_data->pasid, ret);
+    }
+    g_free(bind);
+    return ret;
+}
+
+static int vfio_host_iommu_ctx_unbind_stage1_pgtbl(HostIOMMUContext *iommu_ctx,
+                                            DualIOMMUStage1BindData *bind_data)
+{
+    VFIOContainer *container = container_of(iommu_ctx,
+                                            VFIOContainer, iommu_ctx);
+    struct vfio_iommu_type1_bind *bind;
+    unsigned long argsz;
+    int ret = 0;
+
+    argsz = sizeof(*bind) + sizeof(bind_data->bind_data);
+    bind = g_malloc0(argsz);
+    bind->argsz = argsz;
+    bind->flags = VFIO_IOMMU_UNBIND_GUEST_PGTBL;
+    memcpy(&bind->data, &bind_data->bind_data, sizeof(bind_data->bind_data));
+
+    if (ioctl(container->fd, VFIO_IOMMU_BIND, bind)) {
+        ret = -errno;
+        error_report("%s: pasid (%u) unbind failed: %d",
+                      __func__, bind_data->pasid, ret);
+    }
+    g_free(bind);
+    return ret;
+}
+
 /**
  * Get iommu info from host. Caller of this funcion should free
  * the memory pointed by the returned pointer stored in @info
@@ -1350,10 +1398,13 @@ static int vfio_init_container(VFIOContainer *container, int group_fd,
 
         flags |= (nesting.nesting_capabilities & VFIO_IOMMU_PASID_REQS) ?
                  HOST_IOMMU_PASID_REQUEST : 0;
+        flags |= HOST_IOMMU_NESTING;
+
         host_iommu_ctx_init(&container->iommu_ctx,
                             sizeof(container->iommu_ctx),
                             TYPE_VFIO_HOST_IOMMU_CONTEXT,
-                            flags);
+                            flags,
+                            nesting.stage1_formats);
     }
 
     container->iommu_type = iommu_type;
@@ -1945,6 +1996,8 @@ static void vfio_host_iommu_context_class_init(ObjectClass *klass,
 
     hicxc->pasid_alloc = vfio_host_iommu_ctx_pasid_alloc;
     hicxc->pasid_free = vfio_host_iommu_ctx_pasid_free;
+    hicxc->bind_stage1_pgtbl = vfio_host_iommu_ctx_bind_stage1_pgtbl;
+    hicxc->unbind_stage1_pgtbl = vfio_host_iommu_ctx_unbind_stage1_pgtbl;
 }
 
 static const TypeInfo vfio_host_iommu_context_info = {
diff --git a/include/hw/iommu/host_iommu_context.h b/include/hw/iommu/host_iommu_context.h
index 227c433..44daca9 100644
--- a/include/hw/iommu/host_iommu_context.h
+++ b/include/hw/iommu/host_iommu_context.h
@@ -41,6 +41,7 @@
                          TYPE_HOST_IOMMU_CONTEXT)
 
 typedef struct HostIOMMUContext HostIOMMUContext;
+typedef struct DualIOMMUStage1BindData DualIOMMUStage1BindData;
 
 typedef struct HostIOMMUContextClass {
     /* private */
@@ -54,6 +55,16 @@ typedef struct HostIOMMUContextClass {
     /* Reclaim pasid from HostIOMMUContext (a.k.a. host software) */
     int (*pasid_free)(HostIOMMUContext *iommu_ctx,
                       uint32_t pasid);
+    /*
+     * Bind stage-1 page table to a hostIOMMU w/ dual stage
+     * DMA translation capability.
+     * @bind_data specifies the bind configurations.
+     */
+    int (*bind_stage1_pgtbl)(HostIOMMUContext *iommu_ctx,
+                             DualIOMMUStage1BindData *bind_data);
+    /* Undo a previous bind. @bind_data specifies the unbind info. */
+    int (*unbind_stage1_pgtbl)(HostIOMMUContext *iommu_ctx,
+                               DualIOMMUStage1BindData *bind_data);
 } HostIOMMUContextClass;
 
 /*
@@ -62,17 +73,30 @@ typedef struct HostIOMMUContextClass {
 struct HostIOMMUContext {
     Object parent_obj;
 #define HOST_IOMMU_PASID_REQUEST (1ULL << 0)
+#define HOST_IOMMU_NESTING       (1ULL << 1)
     uint64_t flags;
+    uint32_t stage1_formats;
     bool initialized;
 };
 
+struct DualIOMMUStage1BindData {
+    uint32_t pasid;
+    union {
+        struct iommu_gpasid_bind_data gpasid_bind;
+    } bind_data;
+};
+
 int host_iommu_ctx_pasid_alloc(HostIOMMUContext *iommu_ctx, uint32_t min,
                                uint32_t max, uint32_t *pasid);
 int host_iommu_ctx_pasid_free(HostIOMMUContext *iommu_ctx, uint32_t pasid);
+int host_iommu_ctx_bind_stage1_pgtbl(HostIOMMUContext *iommu_ctx,
+                                     DualIOMMUStage1BindData *data);
+int host_iommu_ctx_unbind_stage1_pgtbl(HostIOMMUContext *iommu_ctx,
+                                       DualIOMMUStage1BindData *data);
 
 void host_iommu_ctx_init(void *_iommu_ctx, size_t instance_size,
                          const char *mrtypename,
-                         uint64_t flags);
+                         uint64_t flags, uint32_t formats);
 void host_iommu_ctx_destroy(HostIOMMUContext *iommu_ctx);
 
 #endif
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v2 14/22] vfio: add bind stage-1 page table support
@ 2020-03-30  4:24   ` Liu Yi L
  0 siblings, 0 replies; 160+ messages in thread
From: Liu Yi L @ 2020-03-30  4:24 UTC (permalink / raw)
  To: qemu-devel, alex.williamson, peterx
  Cc: jean-philippe, kevin.tian, yi.l.liu, Yi Sun, kvm, mst,
	jun.j.tian, eric.auger, yi.y.sun, Jacob Pan, pbonzini, hao.wu,
	david

This patch adds bind_stage1_pgtbl() definition in HostIOMMUContextClass,
also adds corresponding implementation in VFIO. This is to expose a way
for vIOMMU to setup dual stage DMA translation for passthru devices on
hardware.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Eric Auger <eric.auger@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 hw/iommu/host_iommu_context.c         | 47 +++++++++++++++++++++++++++++-
 hw/vfio/common.c                      | 55 ++++++++++++++++++++++++++++++++++-
 include/hw/iommu/host_iommu_context.h | 26 ++++++++++++++++-
 3 files changed, 125 insertions(+), 3 deletions(-)

diff --git a/hw/iommu/host_iommu_context.c b/hw/iommu/host_iommu_context.c
index 5fb2223..8ae20fe 100644
--- a/hw/iommu/host_iommu_context.c
+++ b/hw/iommu/host_iommu_context.c
@@ -69,15 +69,60 @@ int host_iommu_ctx_pasid_free(HostIOMMUContext *iommu_ctx, uint32_t pasid)
     return hicxc->pasid_free(iommu_ctx, pasid);
 }
 
+int host_iommu_ctx_bind_stage1_pgtbl(HostIOMMUContext *iommu_ctx,
+                                     DualIOMMUStage1BindData *data)
+{
+    HostIOMMUContextClass *hicxc;
+
+    if (!iommu_ctx) {
+        return -EINVAL;
+    }
+
+    hicxc = HOST_IOMMU_CONTEXT_GET_CLASS(iommu_ctx);
+    if (!hicxc) {
+        return -EINVAL;
+    }
+
+    if (!(iommu_ctx->flags & HOST_IOMMU_NESTING) ||
+        !hicxc->bind_stage1_pgtbl) {
+        return -EINVAL;
+    }
+
+    return hicxc->bind_stage1_pgtbl(iommu_ctx, data);
+}
+
+int host_iommu_ctx_unbind_stage1_pgtbl(HostIOMMUContext *iommu_ctx,
+                                       DualIOMMUStage1BindData *data)
+{
+    HostIOMMUContextClass *hicxc;
+
+    if (!iommu_ctx) {
+        return -EINVAL;
+    }
+
+    hicxc = HOST_IOMMU_CONTEXT_GET_CLASS(iommu_ctx);
+    if (!hicxc) {
+        return -EINVAL;
+    }
+
+    if (!(iommu_ctx->flags & HOST_IOMMU_NESTING) ||
+        !hicxc->unbind_stage1_pgtbl) {
+        return -EINVAL;
+    }
+
+    return hicxc->unbind_stage1_pgtbl(iommu_ctx, data);
+}
+
 void host_iommu_ctx_init(void *_iommu_ctx, size_t instance_size,
                          const char *mrtypename,
-                         uint64_t flags)
+                         uint64_t flags, uint32_t formats)
 {
     HostIOMMUContext *iommu_ctx;
 
     object_initialize(_iommu_ctx, instance_size, mrtypename);
     iommu_ctx = HOST_IOMMU_CONTEXT(_iommu_ctx);
     iommu_ctx->flags = flags;
+    iommu_ctx->stage1_formats = formats;
     iommu_ctx->initialized = true;
 }
 
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 44b142c..465e4d8 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -1226,6 +1226,54 @@ static int vfio_host_iommu_ctx_pasid_free(HostIOMMUContext *iommu_ctx,
     return 0;
 }
 
+static int vfio_host_iommu_ctx_bind_stage1_pgtbl(HostIOMMUContext *iommu_ctx,
+                                          DualIOMMUStage1BindData *bind_data)
+{
+    VFIOContainer *container = container_of(iommu_ctx,
+                                            VFIOContainer, iommu_ctx);
+    struct vfio_iommu_type1_bind *bind;
+    unsigned long argsz;
+    int ret = 0;
+
+    argsz = sizeof(*bind) + sizeof(bind_data->bind_data);
+    bind = g_malloc0(argsz);
+    bind->argsz = argsz;
+    bind->flags = VFIO_IOMMU_BIND_GUEST_PGTBL;
+    memcpy(&bind->data, &bind_data->bind_data, sizeof(bind_data->bind_data));
+
+    if (ioctl(container->fd, VFIO_IOMMU_BIND, bind)) {
+        ret = -errno;
+        error_report("%s: pasid (%u) bind failed: %d",
+                      __func__, bind_data->pasid, ret);
+    }
+    g_free(bind);
+    return ret;
+}
+
+static int vfio_host_iommu_ctx_unbind_stage1_pgtbl(HostIOMMUContext *iommu_ctx,
+                                            DualIOMMUStage1BindData *bind_data)
+{
+    VFIOContainer *container = container_of(iommu_ctx,
+                                            VFIOContainer, iommu_ctx);
+    struct vfio_iommu_type1_bind *bind;
+    unsigned long argsz;
+    int ret = 0;
+
+    argsz = sizeof(*bind) + sizeof(bind_data->bind_data);
+    bind = g_malloc0(argsz);
+    bind->argsz = argsz;
+    bind->flags = VFIO_IOMMU_UNBIND_GUEST_PGTBL;
+    memcpy(&bind->data, &bind_data->bind_data, sizeof(bind_data->bind_data));
+
+    if (ioctl(container->fd, VFIO_IOMMU_BIND, bind)) {
+        ret = -errno;
+        error_report("%s: pasid (%u) unbind failed: %d",
+                      __func__, bind_data->pasid, ret);
+    }
+    g_free(bind);
+    return ret;
+}
+
 /**
  * Get iommu info from host. Caller of this funcion should free
  * the memory pointed by the returned pointer stored in @info
@@ -1350,10 +1398,13 @@ static int vfio_init_container(VFIOContainer *container, int group_fd,
 
         flags |= (nesting.nesting_capabilities & VFIO_IOMMU_PASID_REQS) ?
                  HOST_IOMMU_PASID_REQUEST : 0;
+        flags |= HOST_IOMMU_NESTING;
+
         host_iommu_ctx_init(&container->iommu_ctx,
                             sizeof(container->iommu_ctx),
                             TYPE_VFIO_HOST_IOMMU_CONTEXT,
-                            flags);
+                            flags,
+                            nesting.stage1_formats);
     }
 
     container->iommu_type = iommu_type;
@@ -1945,6 +1996,8 @@ static void vfio_host_iommu_context_class_init(ObjectClass *klass,
 
     hicxc->pasid_alloc = vfio_host_iommu_ctx_pasid_alloc;
     hicxc->pasid_free = vfio_host_iommu_ctx_pasid_free;
+    hicxc->bind_stage1_pgtbl = vfio_host_iommu_ctx_bind_stage1_pgtbl;
+    hicxc->unbind_stage1_pgtbl = vfio_host_iommu_ctx_unbind_stage1_pgtbl;
 }
 
 static const TypeInfo vfio_host_iommu_context_info = {
diff --git a/include/hw/iommu/host_iommu_context.h b/include/hw/iommu/host_iommu_context.h
index 227c433..44daca9 100644
--- a/include/hw/iommu/host_iommu_context.h
+++ b/include/hw/iommu/host_iommu_context.h
@@ -41,6 +41,7 @@
                          TYPE_HOST_IOMMU_CONTEXT)
 
 typedef struct HostIOMMUContext HostIOMMUContext;
+typedef struct DualIOMMUStage1BindData DualIOMMUStage1BindData;
 
 typedef struct HostIOMMUContextClass {
     /* private */
@@ -54,6 +55,16 @@ typedef struct HostIOMMUContextClass {
     /* Reclaim pasid from HostIOMMUContext (a.k.a. host software) */
     int (*pasid_free)(HostIOMMUContext *iommu_ctx,
                       uint32_t pasid);
+    /*
+     * Bind stage-1 page table to a hostIOMMU w/ dual stage
+     * DMA translation capability.
+     * @bind_data specifies the bind configurations.
+     */
+    int (*bind_stage1_pgtbl)(HostIOMMUContext *iommu_ctx,
+                             DualIOMMUStage1BindData *bind_data);
+    /* Undo a previous bind. @bind_data specifies the unbind info. */
+    int (*unbind_stage1_pgtbl)(HostIOMMUContext *iommu_ctx,
+                               DualIOMMUStage1BindData *bind_data);
 } HostIOMMUContextClass;
 
 /*
@@ -62,17 +73,30 @@ typedef struct HostIOMMUContextClass {
 struct HostIOMMUContext {
     Object parent_obj;
 #define HOST_IOMMU_PASID_REQUEST (1ULL << 0)
+#define HOST_IOMMU_NESTING       (1ULL << 1)
     uint64_t flags;
+    uint32_t stage1_formats;
     bool initialized;
 };
 
+struct DualIOMMUStage1BindData {
+    uint32_t pasid;
+    union {
+        struct iommu_gpasid_bind_data gpasid_bind;
+    } bind_data;
+};
+
 int host_iommu_ctx_pasid_alloc(HostIOMMUContext *iommu_ctx, uint32_t min,
                                uint32_t max, uint32_t *pasid);
 int host_iommu_ctx_pasid_free(HostIOMMUContext *iommu_ctx, uint32_t pasid);
+int host_iommu_ctx_bind_stage1_pgtbl(HostIOMMUContext *iommu_ctx,
+                                     DualIOMMUStage1BindData *data);
+int host_iommu_ctx_unbind_stage1_pgtbl(HostIOMMUContext *iommu_ctx,
+                                       DualIOMMUStage1BindData *data);
 
 void host_iommu_ctx_init(void *_iommu_ctx, size_t instance_size,
                          const char *mrtypename,
-                         uint64_t flags);
+                         uint64_t flags, uint32_t formats);
 void host_iommu_ctx_destroy(HostIOMMUContext *iommu_ctx);
 
 #endif
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v2 15/22] intel_iommu: bind/unbind guest page table to host
  2020-03-30  4:24 ` Liu Yi L
@ 2020-03-30  4:24   ` Liu Yi L
  -1 siblings, 0 replies; 160+ messages in thread
From: Liu Yi L @ 2020-03-30  4:24 UTC (permalink / raw)
  To: qemu-devel, alex.williamson, peterx
  Cc: eric.auger, pbonzini, mst, david, kevin.tian, yi.l.liu,
	jun.j.tian, yi.y.sun, kvm, hao.wu, jean-philippe, Jacob Pan,
	Yi Sun, Richard Henderson

This patch captures the guest PASID table entry modifications and
propagates the changes to host to setup dual stage DMA translation.
The guest page table is configured as 1st level page table (GVA->GPA)
whose translation result would further go through host VT-d 2nd
level page table(GPA->HPA) under nested translation mode. This is the
key part of vSVA support, and also a key to support IOVA over 1st-
level page table for Intel VT-d in virtualization environment.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Richard Henderson <rth@twiddle.net>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 hw/i386/intel_iommu.c          | 98 +++++++++++++++++++++++++++++++++++++++---
 hw/i386/intel_iommu_internal.h | 18 ++++++++
 2 files changed, 111 insertions(+), 5 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index a7e9973..d87f608 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -41,6 +41,7 @@
 #include "migration/vmstate.h"
 #include "trace.h"
 #include "qemu/jhash.h"
+#include <linux/iommu.h>
 
 /* context entry operations */
 #define VTD_CE_GET_RID2PASID(ce) \
@@ -700,6 +701,16 @@ static inline uint32_t vtd_sm_ce_get_pdt_entry_num(VTDContextEntry *ce)
     return 1U << (VTD_SM_CONTEXT_ENTRY_PDTS(ce->val[0]) + 7);
 }
 
+static inline uint32_t vtd_pe_get_fl_aw(VTDPASIDEntry *pe)
+{
+    return 48 + ((pe->val[2] >> 2) & VTD_SM_PASID_ENTRY_FLPM) * 9;
+}
+
+static inline dma_addr_t vtd_pe_get_flpt_base(VTDPASIDEntry *pe)
+{
+    return pe->val[2] & VTD_SM_PASID_ENTRY_FLPTPTR;
+}
+
 static inline bool vtd_pdire_present(VTDPASIDDirEntry *pdire)
 {
     return pdire->val & 1;
@@ -1861,6 +1872,82 @@ static void vtd_context_global_invalidate(IntelIOMMUState *s)
     vtd_iommu_replay_all(s);
 }
 
+/**
+ * Caller should hold iommu_lock.
+ */
+static int vtd_bind_guest_pasid(IntelIOMMUState *s, VTDBus *vtd_bus,
+                                int devfn, int pasid, VTDPASIDEntry *pe,
+                                VTDPASIDOp op)
+{
+    VTDHostIOMMUContext *vtd_dev_icx;
+    HostIOMMUContext *iommu_ctx;
+    DualIOMMUStage1BindData *bind_data;
+    struct iommu_gpasid_bind_data *g_bind_data;
+    int ret = -1;
+
+    vtd_dev_icx = vtd_bus->dev_icx[devfn];
+    if (!vtd_dev_icx) {
+        /* means no need to go further, e.g. for emulated devices */
+        return 0;
+    }
+
+    iommu_ctx = vtd_dev_icx->iommu_ctx;
+    if (!iommu_ctx) {
+        return -EINVAL;
+    }
+
+    if (!(iommu_ctx->stage1_formats
+             & IOMMU_PASID_FORMAT_INTEL_VTD)) {
+        error_report_once("IOMMU Stage 1 format is not compatible!\n");
+        return -EINVAL;
+    }
+
+    bind_data = g_malloc0(sizeof(*bind_data));
+    bind_data->pasid = pasid;
+    g_bind_data = &bind_data->bind_data.gpasid_bind;
+
+    g_bind_data->flags = 0;
+    g_bind_data->vtd.flags = 0;
+    switch (op) {
+    case VTD_PASID_BIND:
+        g_bind_data->version = IOMMU_UAPI_VERSION;
+        g_bind_data->format = IOMMU_PASID_FORMAT_INTEL_VTD;
+        g_bind_data->gpgd = vtd_pe_get_flpt_base(pe);
+        g_bind_data->addr_width = vtd_pe_get_fl_aw(pe);
+        g_bind_data->hpasid = pasid;
+        g_bind_data->gpasid = pasid;
+        g_bind_data->flags |= IOMMU_SVA_GPASID_VAL;
+        g_bind_data->vtd.flags =
+                             (VTD_SM_PASID_ENTRY_SRE_BIT(pe->val[2]) ? 1 : 0)
+                           | (VTD_SM_PASID_ENTRY_EAFE_BIT(pe->val[2]) ? 1 : 0)
+                           | (VTD_SM_PASID_ENTRY_PCD_BIT(pe->val[1]) ? 1 : 0)
+                           | (VTD_SM_PASID_ENTRY_PWT_BIT(pe->val[1]) ? 1 : 0)
+                           | (VTD_SM_PASID_ENTRY_EMTE_BIT(pe->val[1]) ? 1 : 0)
+                           | (VTD_SM_PASID_ENTRY_CD_BIT(pe->val[1]) ? 1 : 0);
+        g_bind_data->vtd.pat = VTD_SM_PASID_ENTRY_PAT(pe->val[1]);
+        g_bind_data->vtd.emt = VTD_SM_PASID_ENTRY_EMT(pe->val[1]);
+        ret = host_iommu_ctx_bind_stage1_pgtbl(iommu_ctx, bind_data);
+        break;
+    case VTD_PASID_UNBIND:
+        g_bind_data->version = IOMMU_UAPI_VERSION;
+        g_bind_data->format = IOMMU_PASID_FORMAT_INTEL_VTD;
+        g_bind_data->gpgd = 0;
+        g_bind_data->addr_width = 0;
+        g_bind_data->hpasid = pasid;
+        g_bind_data->gpasid = pasid;
+        g_bind_data->flags |= IOMMU_SVA_GPASID_VAL;
+        ret = host_iommu_ctx_unbind_stage1_pgtbl(iommu_ctx, bind_data);
+        break;
+    default:
+        error_report_once("Unknown VTDPASIDOp!!!\n");
+        break;
+    }
+
+    g_free(bind_data);
+
+    return ret;
+}
+
 /* Do a context-cache device-selective invalidation.
  * @func_mask: FM field after shifting
  */
@@ -2489,10 +2576,10 @@ static void vtd_fill_pe_in_cache(IntelIOMMUState *s,
     }
 
     pc_entry->pasid_entry = *pe;
-    /*
-     * TODO:
-     * - send pasid bind to host for passthru devices
-     */
+    vtd_bind_guest_pasid(s, vtd_pasid_as->vtd_bus,
+                         vtd_pasid_as->devfn,
+                         vtd_pasid_as->pasid,
+                         pe, VTD_PASID_BIND);
 }
 
 /**
@@ -2565,10 +2652,11 @@ static gboolean vtd_flush_pasid(gpointer key, gpointer value,
 remove:
     /*
      * TODO:
-     * - send pasid bind to host for passthru devices
      * - when pasid-base-iotlb(piotlb) infrastructure is ready,
      *   should invalidate QEMU piotlb togehter with this change.
      */
+    vtd_bind_guest_pasid(s, vtd_bus, devfn,
+                         pasid, NULL, VTD_PASID_UNBIND);
     return true;
 }
 
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 451ef4c..b9e48ab 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -517,6 +517,13 @@ typedef struct VTDRootEntry VTDRootEntry;
 #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL0(aw)  (0x1e0ULL | ~VTD_HAW_MASK(aw))
 #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL1      0xffffffffffe00000ULL
 
+enum VTDPASIDOp {
+    VTD_PASID_BIND,
+    VTD_PASID_UNBIND,
+    VTD_OP_NUM
+};
+typedef enum VTDPASIDOp VTDPASIDOp;
+
 struct VTDPASIDCacheInfo {
 #define VTD_PASID_CACHE_FORCE_RESET    (1ULL << 0)
 #define VTD_PASID_CACHE_GLOBAL         (1ULL << 1)
@@ -556,6 +563,17 @@ typedef struct VTDPASIDCacheInfo VTDPASIDCacheInfo;
 #define VTD_SM_PASID_ENTRY_AW          7ULL /* Adjusted guest-address-width */
 #define VTD_SM_PASID_ENTRY_DID(val)    ((val) & VTD_DOMAIN_ID_MASK)
 
+#define VTD_SM_PASID_ENTRY_FLPM          3ULL
+#define VTD_SM_PASID_ENTRY_FLPTPTR       (~0xfffULL)
+#define VTD_SM_PASID_ENTRY_SRE_BIT(val)  (!!((val) & 1ULL))
+#define VTD_SM_PASID_ENTRY_EAFE_BIT(val) (!!(((val) >> 7) & 1ULL))
+#define VTD_SM_PASID_ENTRY_PCD_BIT(val)  (!!(((val) >> 31) & 1ULL))
+#define VTD_SM_PASID_ENTRY_PWT_BIT(val)  (!!(((val) >> 30) & 1ULL))
+#define VTD_SM_PASID_ENTRY_EMTE_BIT(val) (!!(((val) >> 26) & 1ULL))
+#define VTD_SM_PASID_ENTRY_CD_BIT(val)   (!!(((val) >> 25) & 1ULL))
+#define VTD_SM_PASID_ENTRY_PAT(val)      (((val) >> 32) & 0xFFFFFFFFULL)
+#define VTD_SM_PASID_ENTRY_EMT(val)      (((val) >> 27) & 0x7ULL)
+
 /* Second Level Page Translation Pointer*/
 #define VTD_SM_PASID_ENTRY_SLPTPTR     (~0xfffULL)
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v2 15/22] intel_iommu: bind/unbind guest page table to host
@ 2020-03-30  4:24   ` Liu Yi L
  0 siblings, 0 replies; 160+ messages in thread
From: Liu Yi L @ 2020-03-30  4:24 UTC (permalink / raw)
  To: qemu-devel, alex.williamson, peterx
  Cc: jean-philippe, kevin.tian, yi.l.liu, Yi Sun, kvm, mst,
	jun.j.tian, eric.auger, yi.y.sun, Jacob Pan, pbonzini, hao.wu,
	Richard Henderson, david

This patch captures the guest PASID table entry modifications and
propagates the changes to host to setup dual stage DMA translation.
The guest page table is configured as 1st level page table (GVA->GPA)
whose translation result would further go through host VT-d 2nd
level page table(GPA->HPA) under nested translation mode. This is the
key part of vSVA support, and also a key to support IOVA over 1st-
level page table for Intel VT-d in virtualization environment.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Richard Henderson <rth@twiddle.net>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 hw/i386/intel_iommu.c          | 98 +++++++++++++++++++++++++++++++++++++++---
 hw/i386/intel_iommu_internal.h | 18 ++++++++
 2 files changed, 111 insertions(+), 5 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index a7e9973..d87f608 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -41,6 +41,7 @@
 #include "migration/vmstate.h"
 #include "trace.h"
 #include "qemu/jhash.h"
+#include <linux/iommu.h>
 
 /* context entry operations */
 #define VTD_CE_GET_RID2PASID(ce) \
@@ -700,6 +701,16 @@ static inline uint32_t vtd_sm_ce_get_pdt_entry_num(VTDContextEntry *ce)
     return 1U << (VTD_SM_CONTEXT_ENTRY_PDTS(ce->val[0]) + 7);
 }
 
+static inline uint32_t vtd_pe_get_fl_aw(VTDPASIDEntry *pe)
+{
+    return 48 + ((pe->val[2] >> 2) & VTD_SM_PASID_ENTRY_FLPM) * 9;
+}
+
+static inline dma_addr_t vtd_pe_get_flpt_base(VTDPASIDEntry *pe)
+{
+    return pe->val[2] & VTD_SM_PASID_ENTRY_FLPTPTR;
+}
+
 static inline bool vtd_pdire_present(VTDPASIDDirEntry *pdire)
 {
     return pdire->val & 1;
@@ -1861,6 +1872,82 @@ static void vtd_context_global_invalidate(IntelIOMMUState *s)
     vtd_iommu_replay_all(s);
 }
 
+/**
+ * Caller should hold iommu_lock.
+ */
+static int vtd_bind_guest_pasid(IntelIOMMUState *s, VTDBus *vtd_bus,
+                                int devfn, int pasid, VTDPASIDEntry *pe,
+                                VTDPASIDOp op)
+{
+    VTDHostIOMMUContext *vtd_dev_icx;
+    HostIOMMUContext *iommu_ctx;
+    DualIOMMUStage1BindData *bind_data;
+    struct iommu_gpasid_bind_data *g_bind_data;
+    int ret = -1;
+
+    vtd_dev_icx = vtd_bus->dev_icx[devfn];
+    if (!vtd_dev_icx) {
+        /* means no need to go further, e.g. for emulated devices */
+        return 0;
+    }
+
+    iommu_ctx = vtd_dev_icx->iommu_ctx;
+    if (!iommu_ctx) {
+        return -EINVAL;
+    }
+
+    if (!(iommu_ctx->stage1_formats
+             & IOMMU_PASID_FORMAT_INTEL_VTD)) {
+        error_report_once("IOMMU Stage 1 format is not compatible!\n");
+        return -EINVAL;
+    }
+
+    bind_data = g_malloc0(sizeof(*bind_data));
+    bind_data->pasid = pasid;
+    g_bind_data = &bind_data->bind_data.gpasid_bind;
+
+    g_bind_data->flags = 0;
+    g_bind_data->vtd.flags = 0;
+    switch (op) {
+    case VTD_PASID_BIND:
+        g_bind_data->version = IOMMU_UAPI_VERSION;
+        g_bind_data->format = IOMMU_PASID_FORMAT_INTEL_VTD;
+        g_bind_data->gpgd = vtd_pe_get_flpt_base(pe);
+        g_bind_data->addr_width = vtd_pe_get_fl_aw(pe);
+        g_bind_data->hpasid = pasid;
+        g_bind_data->gpasid = pasid;
+        g_bind_data->flags |= IOMMU_SVA_GPASID_VAL;
+        g_bind_data->vtd.flags =
+                             (VTD_SM_PASID_ENTRY_SRE_BIT(pe->val[2]) ? 1 : 0)
+                           | (VTD_SM_PASID_ENTRY_EAFE_BIT(pe->val[2]) ? 1 : 0)
+                           | (VTD_SM_PASID_ENTRY_PCD_BIT(pe->val[1]) ? 1 : 0)
+                           | (VTD_SM_PASID_ENTRY_PWT_BIT(pe->val[1]) ? 1 : 0)
+                           | (VTD_SM_PASID_ENTRY_EMTE_BIT(pe->val[1]) ? 1 : 0)
+                           | (VTD_SM_PASID_ENTRY_CD_BIT(pe->val[1]) ? 1 : 0);
+        g_bind_data->vtd.pat = VTD_SM_PASID_ENTRY_PAT(pe->val[1]);
+        g_bind_data->vtd.emt = VTD_SM_PASID_ENTRY_EMT(pe->val[1]);
+        ret = host_iommu_ctx_bind_stage1_pgtbl(iommu_ctx, bind_data);
+        break;
+    case VTD_PASID_UNBIND:
+        g_bind_data->version = IOMMU_UAPI_VERSION;
+        g_bind_data->format = IOMMU_PASID_FORMAT_INTEL_VTD;
+        g_bind_data->gpgd = 0;
+        g_bind_data->addr_width = 0;
+        g_bind_data->hpasid = pasid;
+        g_bind_data->gpasid = pasid;
+        g_bind_data->flags |= IOMMU_SVA_GPASID_VAL;
+        ret = host_iommu_ctx_unbind_stage1_pgtbl(iommu_ctx, bind_data);
+        break;
+    default:
+        error_report_once("Unknown VTDPASIDOp!!!\n");
+        break;
+    }
+
+    g_free(bind_data);
+
+    return ret;
+}
+
 /* Do a context-cache device-selective invalidation.
  * @func_mask: FM field after shifting
  */
@@ -2489,10 +2576,10 @@ static void vtd_fill_pe_in_cache(IntelIOMMUState *s,
     }
 
     pc_entry->pasid_entry = *pe;
-    /*
-     * TODO:
-     * - send pasid bind to host for passthru devices
-     */
+    vtd_bind_guest_pasid(s, vtd_pasid_as->vtd_bus,
+                         vtd_pasid_as->devfn,
+                         vtd_pasid_as->pasid,
+                         pe, VTD_PASID_BIND);
 }
 
 /**
@@ -2565,10 +2652,11 @@ static gboolean vtd_flush_pasid(gpointer key, gpointer value,
 remove:
     /*
      * TODO:
-     * - send pasid bind to host for passthru devices
      * - when pasid-base-iotlb(piotlb) infrastructure is ready,
      *   should invalidate QEMU piotlb togehter with this change.
      */
+    vtd_bind_guest_pasid(s, vtd_bus, devfn,
+                         pasid, NULL, VTD_PASID_UNBIND);
     return true;
 }
 
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 451ef4c..b9e48ab 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -517,6 +517,13 @@ typedef struct VTDRootEntry VTDRootEntry;
 #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL0(aw)  (0x1e0ULL | ~VTD_HAW_MASK(aw))
 #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL1      0xffffffffffe00000ULL
 
+enum VTDPASIDOp {
+    VTD_PASID_BIND,
+    VTD_PASID_UNBIND,
+    VTD_OP_NUM
+};
+typedef enum VTDPASIDOp VTDPASIDOp;
+
 struct VTDPASIDCacheInfo {
 #define VTD_PASID_CACHE_FORCE_RESET    (1ULL << 0)
 #define VTD_PASID_CACHE_GLOBAL         (1ULL << 1)
@@ -556,6 +563,17 @@ typedef struct VTDPASIDCacheInfo VTDPASIDCacheInfo;
 #define VTD_SM_PASID_ENTRY_AW          7ULL /* Adjusted guest-address-width */
 #define VTD_SM_PASID_ENTRY_DID(val)    ((val) & VTD_DOMAIN_ID_MASK)
 
+#define VTD_SM_PASID_ENTRY_FLPM          3ULL
+#define VTD_SM_PASID_ENTRY_FLPTPTR       (~0xfffULL)
+#define VTD_SM_PASID_ENTRY_SRE_BIT(val)  (!!((val) & 1ULL))
+#define VTD_SM_PASID_ENTRY_EAFE_BIT(val) (!!(((val) >> 7) & 1ULL))
+#define VTD_SM_PASID_ENTRY_PCD_BIT(val)  (!!(((val) >> 31) & 1ULL))
+#define VTD_SM_PASID_ENTRY_PWT_BIT(val)  (!!(((val) >> 30) & 1ULL))
+#define VTD_SM_PASID_ENTRY_EMTE_BIT(val) (!!(((val) >> 26) & 1ULL))
+#define VTD_SM_PASID_ENTRY_CD_BIT(val)   (!!(((val) >> 25) & 1ULL))
+#define VTD_SM_PASID_ENTRY_PAT(val)      (((val) >> 32) & 0xFFFFFFFFULL)
+#define VTD_SM_PASID_ENTRY_EMT(val)      (((val) >> 27) & 0x7ULL)
+
 /* Second Level Page Translation Pointer*/
 #define VTD_SM_PASID_ENTRY_SLPTPTR     (~0xfffULL)
 
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v2 16/22] intel_iommu: replay pasid binds after context cache invalidation
  2020-03-30  4:24 ` Liu Yi L
@ 2020-03-30  4:24   ` Liu Yi L
  -1 siblings, 0 replies; 160+ messages in thread
From: Liu Yi L @ 2020-03-30  4:24 UTC (permalink / raw)
  To: qemu-devel, alex.williamson, peterx
  Cc: eric.auger, pbonzini, mst, david, kevin.tian, yi.l.liu,
	jun.j.tian, yi.y.sun, kvm, hao.wu, jean-philippe, Jacob Pan,
	Yi Sun, Richard Henderson, Eduardo Habkost

This patch replays guest pasid bindings after context cache
invalidation. This is a behavior to ensure safety. Actually,
programmer should issue pasid cache invalidation with proper
granularity after issuing a context cache invalidation.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 hw/i386/intel_iommu.c          | 51 ++++++++++++++++++++++++++++++++++++++++++
 hw/i386/intel_iommu_internal.h |  6 ++++-
 hw/i386/trace-events           |  1 +
 3 files changed, 57 insertions(+), 1 deletion(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index d87f608..883aeac 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -68,6 +68,10 @@ static void vtd_address_space_refresh_all(IntelIOMMUState *s);
 static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n);
 
 static void vtd_pasid_cache_reset(IntelIOMMUState *s);
+static void vtd_pasid_cache_sync(IntelIOMMUState *s,
+                                 VTDPASIDCacheInfo *pc_info);
+static void vtd_pasid_cache_devsi(IntelIOMMUState *s,
+                                  VTDBus *vtd_bus, uint16_t devfn);
 
 static void vtd_panic_require_caching_mode(void)
 {
@@ -1853,7 +1857,10 @@ static void vtd_iommu_replay_all(IntelIOMMUState *s)
 
 static void vtd_context_global_invalidate(IntelIOMMUState *s)
 {
+    VTDPASIDCacheInfo pc_info;
+
     trace_vtd_inv_desc_cc_global();
+
     /* Protects context cache */
     vtd_iommu_lock(s);
     s->context_cache_gen++;
@@ -1870,6 +1877,9 @@ static void vtd_context_global_invalidate(IntelIOMMUState *s)
      * VT-d emulation codes.
      */
     vtd_iommu_replay_all(s);
+
+    pc_info.flags = VTD_PASID_CACHE_GLOBAL;
+    vtd_pasid_cache_sync(s, &pc_info);
 }
 
 /**
@@ -2005,6 +2015,22 @@ static void vtd_context_device_invalidate(IntelIOMMUState *s,
                  * happened.
                  */
                 vtd_sync_shadow_page_table(vtd_as);
+                /*
+                 * Per spec, context flush should also followed with PASID
+                 * cache and iotlb flush. Regards to a device selective
+                 * context cache invalidation:
+                 * if (emaulted_device)
+                 *    modify the pasid cache gen and pasid-based iotlb gen
+                 *    value (will be added in following patches)
+                 * else if (assigned_device)
+                 *    check if the device has been bound to any pasid
+                 *    invoke pasid_unbind regards to each bound pasid
+                 * Here, we have vtd_pasid_cache_devsi() to invalidate pasid
+                 * caches, while for piotlb in QEMU, we don't have it yet, so
+                 * no handling. For assigned device, host iommu driver would
+                 * flush piotlb when a pasid unbind is pass down to it.
+                 */
+                 vtd_pasid_cache_devsi(s, vtd_bus, devfn_it);
             }
         }
     }
@@ -2619,6 +2645,12 @@ static gboolean vtd_flush_pasid(gpointer key, gpointer value,
         /* Fall through */
     case VTD_PASID_CACHE_GLOBAL:
         break;
+    case VTD_PASID_CACHE_DEVSI:
+        if (pc_info->vtd_bus != vtd_bus ||
+            pc_info->devfn == devfn) {
+            return false;
+        }
+        break;
     default:
         error_report("invalid pc_info->flags");
         abort();
@@ -2827,6 +2859,11 @@ static void vtd_replay_guest_pasid_bindings(IntelIOMMUState *s,
         walk_info.flags |= VTD_PASID_TABLE_DID_SEL_WALK;
         /* loop all assigned devices */
         break;
+    case VTD_PASID_CACHE_DEVSI:
+        walk_info.vtd_bus = pc_info->vtd_bus;
+        walk_info.devfn = pc_info->devfn;
+        vtd_replay_pasid_bind_for_dev(s, start, end, &walk_info);
+        return;
     case VTD_PASID_CACHE_FORCE_RESET:
         /* For force reset, no need to go further replay */
         return;
@@ -2912,6 +2949,20 @@ static void vtd_pasid_cache_sync(IntelIOMMUState *s,
     vtd_iommu_unlock(s);
 }
 
+static void vtd_pasid_cache_devsi(IntelIOMMUState *s,
+                                  VTDBus *vtd_bus, uint16_t devfn)
+{
+    VTDPASIDCacheInfo pc_info;
+
+    trace_vtd_pasid_cache_devsi(devfn);
+
+    pc_info.flags = VTD_PASID_CACHE_DEVSI;
+    pc_info.vtd_bus = vtd_bus;
+    pc_info.devfn = devfn;
+
+    vtd_pasid_cache_sync(s, &pc_info);
+}
+
 /**
  * Caller of this function should hold iommu_lock
  */
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index b9e48ab..9122601 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -529,14 +529,18 @@ struct VTDPASIDCacheInfo {
 #define VTD_PASID_CACHE_GLOBAL         (1ULL << 1)
 #define VTD_PASID_CACHE_DOMSI          (1ULL << 2)
 #define VTD_PASID_CACHE_PASIDSI        (1ULL << 3)
+#define VTD_PASID_CACHE_DEVSI          (1ULL << 4)
     uint32_t flags;
     uint16_t domain_id;
     uint32_t pasid;
+    VTDBus *vtd_bus;
+    uint16_t devfn;
 };
 #define VTD_PASID_CACHE_INFO_MASK    (VTD_PASID_CACHE_FORCE_RESET | \
                                       VTD_PASID_CACHE_GLOBAL  | \
                                       VTD_PASID_CACHE_DOMSI  | \
-                                      VTD_PASID_CACHE_PASIDSI)
+                                      VTD_PASID_CACHE_PASIDSI | \
+                                      VTD_PASID_CACHE_DEVSI)
 typedef struct VTDPASIDCacheInfo VTDPASIDCacheInfo;
 
 /* PASID Table Related Definitions */
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index 60d20c1..3853fa8 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -26,6 +26,7 @@ vtd_pasid_cache_gsi(void) ""
 vtd_pasid_cache_reset(void) ""
 vtd_pasid_cache_dsi(uint16_t domain) "Domian slective PC invalidation domain 0x%"PRIx16
 vtd_pasid_cache_psi(uint16_t domain, uint32_t pasid) "PASID slective PC invalidation domain 0x%"PRIx16" pasid 0x%"PRIx32
+vtd_pasid_cache_devsi(uint16_t devfn) "Dev selective PC invalidation dev: 0x%"PRIx16
 vtd_re_not_present(uint8_t bus) "Root entry bus %"PRIu8" not present"
 vtd_ce_not_present(uint8_t bus, uint8_t devfn) "Context entry bus %"PRIu8" devfn %"PRIu8" not present"
 vtd_iotlb_page_hit(uint16_t sid, uint64_t addr, uint64_t slpte, uint16_t domain) "IOTLB page hit sid 0x%"PRIx16" iova 0x%"PRIx64" slpte 0x%"PRIx64" domain 0x%"PRIx16
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v2 16/22] intel_iommu: replay pasid binds after context cache invalidation
@ 2020-03-30  4:24   ` Liu Yi L
  0 siblings, 0 replies; 160+ messages in thread
From: Liu Yi L @ 2020-03-30  4:24 UTC (permalink / raw)
  To: qemu-devel, alex.williamson, peterx
  Cc: jean-philippe, kevin.tian, yi.l.liu, Yi Sun, Eduardo Habkost,
	kvm, mst, jun.j.tian, eric.auger, yi.y.sun, Jacob Pan, pbonzini,
	hao.wu, Richard Henderson, david

This patch replays guest pasid bindings after context cache
invalidation. This is a behavior to ensure safety. Actually,
programmer should issue pasid cache invalidation with proper
granularity after issuing a context cache invalidation.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 hw/i386/intel_iommu.c          | 51 ++++++++++++++++++++++++++++++++++++++++++
 hw/i386/intel_iommu_internal.h |  6 ++++-
 hw/i386/trace-events           |  1 +
 3 files changed, 57 insertions(+), 1 deletion(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index d87f608..883aeac 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -68,6 +68,10 @@ static void vtd_address_space_refresh_all(IntelIOMMUState *s);
 static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n);
 
 static void vtd_pasid_cache_reset(IntelIOMMUState *s);
+static void vtd_pasid_cache_sync(IntelIOMMUState *s,
+                                 VTDPASIDCacheInfo *pc_info);
+static void vtd_pasid_cache_devsi(IntelIOMMUState *s,
+                                  VTDBus *vtd_bus, uint16_t devfn);
 
 static void vtd_panic_require_caching_mode(void)
 {
@@ -1853,7 +1857,10 @@ static void vtd_iommu_replay_all(IntelIOMMUState *s)
 
 static void vtd_context_global_invalidate(IntelIOMMUState *s)
 {
+    VTDPASIDCacheInfo pc_info;
+
     trace_vtd_inv_desc_cc_global();
+
     /* Protects context cache */
     vtd_iommu_lock(s);
     s->context_cache_gen++;
@@ -1870,6 +1877,9 @@ static void vtd_context_global_invalidate(IntelIOMMUState *s)
      * VT-d emulation codes.
      */
     vtd_iommu_replay_all(s);
+
+    pc_info.flags = VTD_PASID_CACHE_GLOBAL;
+    vtd_pasid_cache_sync(s, &pc_info);
 }
 
 /**
@@ -2005,6 +2015,22 @@ static void vtd_context_device_invalidate(IntelIOMMUState *s,
                  * happened.
                  */
                 vtd_sync_shadow_page_table(vtd_as);
+                /*
+                 * Per spec, context flush should also followed with PASID
+                 * cache and iotlb flush. Regards to a device selective
+                 * context cache invalidation:
+                 * if (emaulted_device)
+                 *    modify the pasid cache gen and pasid-based iotlb gen
+                 *    value (will be added in following patches)
+                 * else if (assigned_device)
+                 *    check if the device has been bound to any pasid
+                 *    invoke pasid_unbind regards to each bound pasid
+                 * Here, we have vtd_pasid_cache_devsi() to invalidate pasid
+                 * caches, while for piotlb in QEMU, we don't have it yet, so
+                 * no handling. For assigned device, host iommu driver would
+                 * flush piotlb when a pasid unbind is pass down to it.
+                 */
+                 vtd_pasid_cache_devsi(s, vtd_bus, devfn_it);
             }
         }
     }
@@ -2619,6 +2645,12 @@ static gboolean vtd_flush_pasid(gpointer key, gpointer value,
         /* Fall through */
     case VTD_PASID_CACHE_GLOBAL:
         break;
+    case VTD_PASID_CACHE_DEVSI:
+        if (pc_info->vtd_bus != vtd_bus ||
+            pc_info->devfn == devfn) {
+            return false;
+        }
+        break;
     default:
         error_report("invalid pc_info->flags");
         abort();
@@ -2827,6 +2859,11 @@ static void vtd_replay_guest_pasid_bindings(IntelIOMMUState *s,
         walk_info.flags |= VTD_PASID_TABLE_DID_SEL_WALK;
         /* loop all assigned devices */
         break;
+    case VTD_PASID_CACHE_DEVSI:
+        walk_info.vtd_bus = pc_info->vtd_bus;
+        walk_info.devfn = pc_info->devfn;
+        vtd_replay_pasid_bind_for_dev(s, start, end, &walk_info);
+        return;
     case VTD_PASID_CACHE_FORCE_RESET:
         /* For force reset, no need to go further replay */
         return;
@@ -2912,6 +2949,20 @@ static void vtd_pasid_cache_sync(IntelIOMMUState *s,
     vtd_iommu_unlock(s);
 }
 
+static void vtd_pasid_cache_devsi(IntelIOMMUState *s,
+                                  VTDBus *vtd_bus, uint16_t devfn)
+{
+    VTDPASIDCacheInfo pc_info;
+
+    trace_vtd_pasid_cache_devsi(devfn);
+
+    pc_info.flags = VTD_PASID_CACHE_DEVSI;
+    pc_info.vtd_bus = vtd_bus;
+    pc_info.devfn = devfn;
+
+    vtd_pasid_cache_sync(s, &pc_info);
+}
+
 /**
  * Caller of this function should hold iommu_lock
  */
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index b9e48ab..9122601 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -529,14 +529,18 @@ struct VTDPASIDCacheInfo {
 #define VTD_PASID_CACHE_GLOBAL         (1ULL << 1)
 #define VTD_PASID_CACHE_DOMSI          (1ULL << 2)
 #define VTD_PASID_CACHE_PASIDSI        (1ULL << 3)
+#define VTD_PASID_CACHE_DEVSI          (1ULL << 4)
     uint32_t flags;
     uint16_t domain_id;
     uint32_t pasid;
+    VTDBus *vtd_bus;
+    uint16_t devfn;
 };
 #define VTD_PASID_CACHE_INFO_MASK    (VTD_PASID_CACHE_FORCE_RESET | \
                                       VTD_PASID_CACHE_GLOBAL  | \
                                       VTD_PASID_CACHE_DOMSI  | \
-                                      VTD_PASID_CACHE_PASIDSI)
+                                      VTD_PASID_CACHE_PASIDSI | \
+                                      VTD_PASID_CACHE_DEVSI)
 typedef struct VTDPASIDCacheInfo VTDPASIDCacheInfo;
 
 /* PASID Table Related Definitions */
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index 60d20c1..3853fa8 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -26,6 +26,7 @@ vtd_pasid_cache_gsi(void) ""
 vtd_pasid_cache_reset(void) ""
 vtd_pasid_cache_dsi(uint16_t domain) "Domian slective PC invalidation domain 0x%"PRIx16
 vtd_pasid_cache_psi(uint16_t domain, uint32_t pasid) "PASID slective PC invalidation domain 0x%"PRIx16" pasid 0x%"PRIx32
+vtd_pasid_cache_devsi(uint16_t devfn) "Dev selective PC invalidation dev: 0x%"PRIx16
 vtd_re_not_present(uint8_t bus) "Root entry bus %"PRIu8" not present"
 vtd_ce_not_present(uint8_t bus, uint8_t devfn) "Context entry bus %"PRIu8" devfn %"PRIu8" not present"
 vtd_iotlb_page_hit(uint16_t sid, uint64_t addr, uint64_t slpte, uint16_t domain) "IOTLB page hit sid 0x%"PRIx16" iova 0x%"PRIx64" slpte 0x%"PRIx64" domain 0x%"PRIx16
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v2 17/22] intel_iommu: do not pass down pasid bind for PASID #0
  2020-03-30  4:24 ` Liu Yi L
@ 2020-03-30  4:24   ` Liu Yi L
  -1 siblings, 0 replies; 160+ messages in thread
From: Liu Yi L @ 2020-03-30  4:24 UTC (permalink / raw)
  To: qemu-devel, alex.williamson, peterx
  Cc: eric.auger, pbonzini, mst, david, kevin.tian, yi.l.liu,
	jun.j.tian, yi.y.sun, kvm, hao.wu, jean-philippe, Jacob Pan,
	Yi Sun, Richard Henderson, Eduardo Habkost

RID_PASID field was introduced in VT-d 3.0 spec, it is used
for DMA requests w/o PASID in scalable mode VT-d. It is also
known as IOVA. And in VT-d 3.1 spec, there is definition on it:

"Implementations not supporting RID_PASID capability
(ECAP_REG.RPS is 0b), use a PASID value of 0 to perform
address translation for requests without PASID."

This patch adds a check against the PASIDs which are going to be
bound to device. For PASID #0, it is not necessary to pass down
pasid bind request for it since PASID #0 is used as RID_PASID for
DMA requests without pasid. Further reason is current Intel vIOMMU
supports gIOVA by shadowing guest 2nd level page table. However,
in future, if guest IOMMU driver uses 1st level page table to store
IOVA mappings, then guest IOVA support will also be done via nested
translation. When gIOVA is over FLPT, then vIOMMU should pass down
the pasid bind request for PASID #0 to host, host needs to bind the
guest IOVA page table to a proper PASID. e.g PASID value in RID_PASID
field for PF/VF if ECAP_REG.RPS is clear or default PASID for ADI
(Assignable Device Interface in Scalable IOV solution).

IOVA over FLPT support on Intel VT-d:
https://lkml.org/lkml/2019/9/23/297

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 hw/i386/intel_iommu.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 883aeac..074d966 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -1895,6 +1895,16 @@ static int vtd_bind_guest_pasid(IntelIOMMUState *s, VTDBus *vtd_bus,
     struct iommu_gpasid_bind_data *g_bind_data;
     int ret = -1;
 
+    if (pasid < VTD_HPASID_MIN) {
+        /*
+         * If pasid < VTD_HPASID_MIN, this pasid is not allocated
+         * from host. No need to pass down the changes on it to host.
+         * TODO: when IOVA over FLPT is ready, this switch should be
+         * refined.
+         */
+        return 0;
+    }
+
     vtd_dev_icx = vtd_bus->dev_icx[devfn];
     if (!vtd_dev_icx) {
         /* means no need to go further, e.g. for emulated devices */
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v2 17/22] intel_iommu: do not pass down pasid bind for PASID #0
@ 2020-03-30  4:24   ` Liu Yi L
  0 siblings, 0 replies; 160+ messages in thread
From: Liu Yi L @ 2020-03-30  4:24 UTC (permalink / raw)
  To: qemu-devel, alex.williamson, peterx
  Cc: jean-philippe, kevin.tian, yi.l.liu, Yi Sun, Eduardo Habkost,
	kvm, mst, jun.j.tian, eric.auger, yi.y.sun, Jacob Pan, pbonzini,
	hao.wu, Richard Henderson, david

RID_PASID field was introduced in VT-d 3.0 spec, it is used
for DMA requests w/o PASID in scalable mode VT-d. It is also
known as IOVA. And in VT-d 3.1 spec, there is definition on it:

"Implementations not supporting RID_PASID capability
(ECAP_REG.RPS is 0b), use a PASID value of 0 to perform
address translation for requests without PASID."

This patch adds a check against the PASIDs which are going to be
bound to device. For PASID #0, it is not necessary to pass down
pasid bind request for it since PASID #0 is used as RID_PASID for
DMA requests without pasid. Further reason is current Intel vIOMMU
supports gIOVA by shadowing guest 2nd level page table. However,
in future, if guest IOMMU driver uses 1st level page table to store
IOVA mappings, then guest IOVA support will also be done via nested
translation. When gIOVA is over FLPT, then vIOMMU should pass down
the pasid bind request for PASID #0 to host, host needs to bind the
guest IOVA page table to a proper PASID. e.g PASID value in RID_PASID
field for PF/VF if ECAP_REG.RPS is clear or default PASID for ADI
(Assignable Device Interface in Scalable IOV solution).

IOVA over FLPT support on Intel VT-d:
https://lkml.org/lkml/2019/9/23/297

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 hw/i386/intel_iommu.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 883aeac..074d966 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -1895,6 +1895,16 @@ static int vtd_bind_guest_pasid(IntelIOMMUState *s, VTDBus *vtd_bus,
     struct iommu_gpasid_bind_data *g_bind_data;
     int ret = -1;
 
+    if (pasid < VTD_HPASID_MIN) {
+        /*
+         * If pasid < VTD_HPASID_MIN, this pasid is not allocated
+         * from host. No need to pass down the changes on it to host.
+         * TODO: when IOVA over FLPT is ready, this switch should be
+         * refined.
+         */
+        return 0;
+    }
+
     vtd_dev_icx = vtd_bus->dev_icx[devfn];
     if (!vtd_dev_icx) {
         /* means no need to go further, e.g. for emulated devices */
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v2 18/22] vfio: add support for flush iommu stage-1 cache
  2020-03-30  4:24 ` Liu Yi L
@ 2020-03-30  4:24   ` Liu Yi L
  -1 siblings, 0 replies; 160+ messages in thread
From: Liu Yi L @ 2020-03-30  4:24 UTC (permalink / raw)
  To: qemu-devel, alex.williamson, peterx
  Cc: eric.auger, pbonzini, mst, david, kevin.tian, yi.l.liu,
	jun.j.tian, yi.y.sun, kvm, hao.wu, jean-philippe, Jacob Pan,
	Yi Sun

This patch adds flush_stage1_cache() definition in HostIOMUContextClass.
And adds corresponding implementation in VFIO. This is to expose a way
for vIOMMU to flush stage-1 cache in host side since guest owns stage-1
translation structures in dual stage DMA translation configuration.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Eric Auger <eric.auger@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 hw/iommu/host_iommu_context.c         | 19 +++++++++++++++++++
 hw/vfio/common.c                      | 25 +++++++++++++++++++++++++
 include/hw/iommu/host_iommu_context.h | 14 ++++++++++++++
 3 files changed, 58 insertions(+)

diff --git a/hw/iommu/host_iommu_context.c b/hw/iommu/host_iommu_context.c
index 8ae20fe..e884752 100644
--- a/hw/iommu/host_iommu_context.c
+++ b/hw/iommu/host_iommu_context.c
@@ -113,6 +113,25 @@ int host_iommu_ctx_unbind_stage1_pgtbl(HostIOMMUContext *iommu_ctx,
     return hicxc->unbind_stage1_pgtbl(iommu_ctx, data);
 }
 
+int host_iommu_ctx_flush_stage1_cache(HostIOMMUContext *iommu_ctx,
+                                      DualIOMMUStage1Cache *cache)
+{
+    HostIOMMUContextClass *hicxc;
+
+    hicxc = HOST_IOMMU_CONTEXT_GET_CLASS(iommu_ctx);
+
+    if (!hicxc) {
+        return -EINVAL;
+    }
+
+    if (!(iommu_ctx->flags & HOST_IOMMU_NESTING) ||
+        !hicxc->flush_stage1_cache) {
+        return -EINVAL;
+    }
+
+    return hicxc->flush_stage1_cache(iommu_ctx, cache);
+}
+
 void host_iommu_ctx_init(void *_iommu_ctx, size_t instance_size,
                          const char *mrtypename,
                          uint64_t flags, uint32_t formats)
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 465e4d8..6b730b6 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -1274,6 +1274,30 @@ static int vfio_host_iommu_ctx_unbind_stage1_pgtbl(HostIOMMUContext *iommu_ctx,
     return ret;
 }
 
+static int vfio_host_iommu_ctx_flush_stage1_cache(HostIOMMUContext *iommu_ctx,
+                                            DualIOMMUStage1Cache *cache)
+{
+    VFIOContainer *container = container_of(iommu_ctx,
+                                            VFIOContainer, iommu_ctx);
+    struct vfio_iommu_type1_cache_invalidate *cache_inv;
+    unsigned long argsz;
+    int ret = 0;
+
+    argsz = sizeof(*cache_inv) + sizeof(cache->cache_info);
+    cache_inv = g_malloc0(argsz);
+    cache_inv->argsz = argsz;
+    cache_inv->flags = 0;
+    memcpy(&cache_inv->cache_info, &cache->cache_info,
+           sizeof(cache->cache_info));
+
+    if (ioctl(container->fd, VFIO_IOMMU_CACHE_INVALIDATE, cache_inv)) {
+        error_report("%s: iommu cache flush failed: %d", __func__, -errno);
+        ret = -errno;
+    }
+    g_free(cache_inv);
+    return ret;
+}
+
 /**
  * Get iommu info from host. Caller of this funcion should free
  * the memory pointed by the returned pointer stored in @info
@@ -1998,6 +2022,7 @@ static void vfio_host_iommu_context_class_init(ObjectClass *klass,
     hicxc->pasid_free = vfio_host_iommu_ctx_pasid_free;
     hicxc->bind_stage1_pgtbl = vfio_host_iommu_ctx_bind_stage1_pgtbl;
     hicxc->unbind_stage1_pgtbl = vfio_host_iommu_ctx_unbind_stage1_pgtbl;
+    hicxc->flush_stage1_cache = vfio_host_iommu_ctx_flush_stage1_cache;
 }
 
 static const TypeInfo vfio_host_iommu_context_info = {
diff --git a/include/hw/iommu/host_iommu_context.h b/include/hw/iommu/host_iommu_context.h
index 44daca9..69b1b7b 100644
--- a/include/hw/iommu/host_iommu_context.h
+++ b/include/hw/iommu/host_iommu_context.h
@@ -42,6 +42,7 @@
 
 typedef struct HostIOMMUContext HostIOMMUContext;
 typedef struct DualIOMMUStage1BindData DualIOMMUStage1BindData;
+typedef struct DualIOMMUStage1Cache DualIOMMUStage1Cache;
 
 typedef struct HostIOMMUContextClass {
     /* private */
@@ -65,6 +66,12 @@ typedef struct HostIOMMUContextClass {
     /* Undo a previous bind. @bind_data specifies the unbind info. */
     int (*unbind_stage1_pgtbl)(HostIOMMUContext *iommu_ctx,
                                DualIOMMUStage1BindData *bind_data);
+    /*
+     * Propagate stage-1 cache flush to host IOMMU, cache
+     * info specifid in @cache
+     */
+    int (*flush_stage1_cache)(HostIOMMUContext *iommu_ctx,
+                              DualIOMMUStage1Cache *cache);
 } HostIOMMUContextClass;
 
 /*
@@ -86,6 +93,11 @@ struct DualIOMMUStage1BindData {
     } bind_data;
 };
 
+struct DualIOMMUStage1Cache {
+    uint32_t pasid;
+    struct iommu_cache_invalidate_info cache_info;
+};
+
 int host_iommu_ctx_pasid_alloc(HostIOMMUContext *iommu_ctx, uint32_t min,
                                uint32_t max, uint32_t *pasid);
 int host_iommu_ctx_pasid_free(HostIOMMUContext *iommu_ctx, uint32_t pasid);
@@ -93,6 +105,8 @@ int host_iommu_ctx_bind_stage1_pgtbl(HostIOMMUContext *iommu_ctx,
                                      DualIOMMUStage1BindData *data);
 int host_iommu_ctx_unbind_stage1_pgtbl(HostIOMMUContext *iommu_ctx,
                                        DualIOMMUStage1BindData *data);
+int host_iommu_ctx_flush_stage1_cache(HostIOMMUContext *iommu_ctx,
+                                      DualIOMMUStage1Cache *cache);
 
 void host_iommu_ctx_init(void *_iommu_ctx, size_t instance_size,
                          const char *mrtypename,
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v2 18/22] vfio: add support for flush iommu stage-1 cache
@ 2020-03-30  4:24   ` Liu Yi L
  0 siblings, 0 replies; 160+ messages in thread
From: Liu Yi L @ 2020-03-30  4:24 UTC (permalink / raw)
  To: qemu-devel, alex.williamson, peterx
  Cc: jean-philippe, kevin.tian, yi.l.liu, Yi Sun, kvm, mst,
	jun.j.tian, eric.auger, yi.y.sun, Jacob Pan, pbonzini, hao.wu,
	david

This patch adds flush_stage1_cache() definition in HostIOMUContextClass.
And adds corresponding implementation in VFIO. This is to expose a way
for vIOMMU to flush stage-1 cache in host side since guest owns stage-1
translation structures in dual stage DMA translation configuration.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Eric Auger <eric.auger@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 hw/iommu/host_iommu_context.c         | 19 +++++++++++++++++++
 hw/vfio/common.c                      | 25 +++++++++++++++++++++++++
 include/hw/iommu/host_iommu_context.h | 14 ++++++++++++++
 3 files changed, 58 insertions(+)

diff --git a/hw/iommu/host_iommu_context.c b/hw/iommu/host_iommu_context.c
index 8ae20fe..e884752 100644
--- a/hw/iommu/host_iommu_context.c
+++ b/hw/iommu/host_iommu_context.c
@@ -113,6 +113,25 @@ int host_iommu_ctx_unbind_stage1_pgtbl(HostIOMMUContext *iommu_ctx,
     return hicxc->unbind_stage1_pgtbl(iommu_ctx, data);
 }
 
+int host_iommu_ctx_flush_stage1_cache(HostIOMMUContext *iommu_ctx,
+                                      DualIOMMUStage1Cache *cache)
+{
+    HostIOMMUContextClass *hicxc;
+
+    hicxc = HOST_IOMMU_CONTEXT_GET_CLASS(iommu_ctx);
+
+    if (!hicxc) {
+        return -EINVAL;
+    }
+
+    if (!(iommu_ctx->flags & HOST_IOMMU_NESTING) ||
+        !hicxc->flush_stage1_cache) {
+        return -EINVAL;
+    }
+
+    return hicxc->flush_stage1_cache(iommu_ctx, cache);
+}
+
 void host_iommu_ctx_init(void *_iommu_ctx, size_t instance_size,
                          const char *mrtypename,
                          uint64_t flags, uint32_t formats)
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 465e4d8..6b730b6 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -1274,6 +1274,30 @@ static int vfio_host_iommu_ctx_unbind_stage1_pgtbl(HostIOMMUContext *iommu_ctx,
     return ret;
 }
 
+static int vfio_host_iommu_ctx_flush_stage1_cache(HostIOMMUContext *iommu_ctx,
+                                            DualIOMMUStage1Cache *cache)
+{
+    VFIOContainer *container = container_of(iommu_ctx,
+                                            VFIOContainer, iommu_ctx);
+    struct vfio_iommu_type1_cache_invalidate *cache_inv;
+    unsigned long argsz;
+    int ret = 0;
+
+    argsz = sizeof(*cache_inv) + sizeof(cache->cache_info);
+    cache_inv = g_malloc0(argsz);
+    cache_inv->argsz = argsz;
+    cache_inv->flags = 0;
+    memcpy(&cache_inv->cache_info, &cache->cache_info,
+           sizeof(cache->cache_info));
+
+    if (ioctl(container->fd, VFIO_IOMMU_CACHE_INVALIDATE, cache_inv)) {
+        error_report("%s: iommu cache flush failed: %d", __func__, -errno);
+        ret = -errno;
+    }
+    g_free(cache_inv);
+    return ret;
+}
+
 /**
  * Get iommu info from host. Caller of this funcion should free
  * the memory pointed by the returned pointer stored in @info
@@ -1998,6 +2022,7 @@ static void vfio_host_iommu_context_class_init(ObjectClass *klass,
     hicxc->pasid_free = vfio_host_iommu_ctx_pasid_free;
     hicxc->bind_stage1_pgtbl = vfio_host_iommu_ctx_bind_stage1_pgtbl;
     hicxc->unbind_stage1_pgtbl = vfio_host_iommu_ctx_unbind_stage1_pgtbl;
+    hicxc->flush_stage1_cache = vfio_host_iommu_ctx_flush_stage1_cache;
 }
 
 static const TypeInfo vfio_host_iommu_context_info = {
diff --git a/include/hw/iommu/host_iommu_context.h b/include/hw/iommu/host_iommu_context.h
index 44daca9..69b1b7b 100644
--- a/include/hw/iommu/host_iommu_context.h
+++ b/include/hw/iommu/host_iommu_context.h
@@ -42,6 +42,7 @@
 
 typedef struct HostIOMMUContext HostIOMMUContext;
 typedef struct DualIOMMUStage1BindData DualIOMMUStage1BindData;
+typedef struct DualIOMMUStage1Cache DualIOMMUStage1Cache;
 
 typedef struct HostIOMMUContextClass {
     /* private */
@@ -65,6 +66,12 @@ typedef struct HostIOMMUContextClass {
     /* Undo a previous bind. @bind_data specifies the unbind info. */
     int (*unbind_stage1_pgtbl)(HostIOMMUContext *iommu_ctx,
                                DualIOMMUStage1BindData *bind_data);
+    /*
+     * Propagate stage-1 cache flush to host IOMMU, cache
+     * info specifid in @cache
+     */
+    int (*flush_stage1_cache)(HostIOMMUContext *iommu_ctx,
+                              DualIOMMUStage1Cache *cache);
 } HostIOMMUContextClass;
 
 /*
@@ -86,6 +93,11 @@ struct DualIOMMUStage1BindData {
     } bind_data;
 };
 
+struct DualIOMMUStage1Cache {
+    uint32_t pasid;
+    struct iommu_cache_invalidate_info cache_info;
+};
+
 int host_iommu_ctx_pasid_alloc(HostIOMMUContext *iommu_ctx, uint32_t min,
                                uint32_t max, uint32_t *pasid);
 int host_iommu_ctx_pasid_free(HostIOMMUContext *iommu_ctx, uint32_t pasid);
@@ -93,6 +105,8 @@ int host_iommu_ctx_bind_stage1_pgtbl(HostIOMMUContext *iommu_ctx,
                                      DualIOMMUStage1BindData *data);
 int host_iommu_ctx_unbind_stage1_pgtbl(HostIOMMUContext *iommu_ctx,
                                        DualIOMMUStage1BindData *data);
+int host_iommu_ctx_flush_stage1_cache(HostIOMMUContext *iommu_ctx,
+                                      DualIOMMUStage1Cache *cache);
 
 void host_iommu_ctx_init(void *_iommu_ctx, size_t instance_size,
                          const char *mrtypename,
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v2 19/22] intel_iommu: process PASID-based iotlb invalidation
  2020-03-30  4:24 ` Liu Yi L
@ 2020-03-30  4:24   ` Liu Yi L
  -1 siblings, 0 replies; 160+ messages in thread
From: Liu Yi L @ 2020-03-30  4:24 UTC (permalink / raw)
  To: qemu-devel, alex.williamson, peterx
  Cc: eric.auger, pbonzini, mst, david, kevin.tian, yi.l.liu,
	jun.j.tian, yi.y.sun, kvm, hao.wu, jean-philippe, Jacob Pan,
	Yi Sun, Richard Henderson, Eduardo Habkost

This patch adds the basic PASID-based iotlb (piotlb) invalidation
support. piotlb is used during walking Intel VT-d 1st level page
table. This patch only adds the basic processing. Detailed handling
will be added in next patch.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 hw/i386/intel_iommu.c          | 53 ++++++++++++++++++++++++++++++++++++++++++
 hw/i386/intel_iommu_internal.h | 13 +++++++++++
 2 files changed, 66 insertions(+)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 074d966..6114dd8 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -3045,6 +3045,55 @@ static bool vtd_process_pasid_desc(IntelIOMMUState *s,
     return true;
 }
 
+static void vtd_piotlb_pasid_invalidate(IntelIOMMUState *s,
+                                        uint16_t domain_id,
+                                        uint32_t pasid)
+{
+}
+
+static void vtd_piotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
+                                       uint32_t pasid, hwaddr addr, uint8_t am,
+                                       bool ih)
+{
+}
+
+static bool vtd_process_piotlb_desc(IntelIOMMUState *s,
+                                    VTDInvDesc *inv_desc)
+{
+    uint16_t domain_id;
+    uint32_t pasid;
+    uint8_t am;
+    hwaddr addr;
+
+    if ((inv_desc->val[0] & VTD_INV_DESC_PIOTLB_RSVD_VAL0) ||
+        (inv_desc->val[1] & VTD_INV_DESC_PIOTLB_RSVD_VAL1)) {
+        error_report_once("non-zero-field-in-piotlb_inv_desc hi: 0x%" PRIx64
+                  " lo: 0x%" PRIx64, inv_desc->val[1], inv_desc->val[0]);
+        return false;
+    }
+
+    domain_id = VTD_INV_DESC_PIOTLB_DID(inv_desc->val[0]);
+    pasid = VTD_INV_DESC_PIOTLB_PASID(inv_desc->val[0]);
+    switch (inv_desc->val[0] & VTD_INV_DESC_IOTLB_G) {
+    case VTD_INV_DESC_PIOTLB_ALL_IN_PASID:
+        vtd_piotlb_pasid_invalidate(s, domain_id, pasid);
+        break;
+
+    case VTD_INV_DESC_PIOTLB_PSI_IN_PASID:
+        am = VTD_INV_DESC_PIOTLB_AM(inv_desc->val[1]);
+        addr = (hwaddr) VTD_INV_DESC_PIOTLB_ADDR(inv_desc->val[1]);
+        vtd_piotlb_page_invalidate(s, domain_id, pasid, addr, am,
+                                   VTD_INV_DESC_PIOTLB_IH(inv_desc->val[1]));
+        break;
+
+    default:
+        error_report_once("Invalid granularity in P-IOTLB desc hi: 0x%" PRIx64
+                  " lo: 0x%" PRIx64, inv_desc->val[1], inv_desc->val[0]);
+        return false;
+    }
+    return true;
+}
+
 static bool vtd_process_inv_iec_desc(IntelIOMMUState *s,
                                      VTDInvDesc *inv_desc)
 {
@@ -3159,6 +3208,10 @@ static bool vtd_process_inv_desc(IntelIOMMUState *s)
         break;
 
     case VTD_INV_DESC_PIOTLB:
+        trace_vtd_inv_desc("p-iotlb", inv_desc.val[1], inv_desc.val[0]);
+        if (!vtd_process_piotlb_desc(s, &inv_desc)) {
+            return false;
+        }
         break;
 
     case VTD_INV_DESC_WAIT:
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 9122601..5a49d5b 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -457,6 +457,19 @@ typedef union VTDInvDesc VTDInvDesc;
 #define VTD_INV_DESC_PASIDC_PASID_SI   (1ULL << 4)
 #define VTD_INV_DESC_PASIDC_GLOBAL     (3ULL << 4)
 
+#define VTD_INV_DESC_PIOTLB_ALL_IN_PASID  (2ULL << 4)
+#define VTD_INV_DESC_PIOTLB_PSI_IN_PASID  (3ULL << 4)
+
+#define VTD_INV_DESC_PIOTLB_RSVD_VAL0     0xfff000000000ffc0ULL
+#define VTD_INV_DESC_PIOTLB_RSVD_VAL1     0xf80ULL
+
+#define VTD_INV_DESC_PIOTLB_PASID(val)    (((val) >> 32) & 0xfffffULL)
+#define VTD_INV_DESC_PIOTLB_DID(val)      (((val) >> 16) & \
+                                             VTD_DOMAIN_ID_MASK)
+#define VTD_INV_DESC_PIOTLB_ADDR(val)     ((val) & ~0xfffULL)
+#define VTD_INV_DESC_PIOTLB_AM(val)       ((val) & 0x3fULL)
+#define VTD_INV_DESC_PIOTLB_IH(val)       (((val) >> 6) & 0x1)
+
 /* Information about page-selective IOTLB invalidate */
 struct VTDIOTLBPageInvInfo {
     uint16_t domain_id;
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v2 19/22] intel_iommu: process PASID-based iotlb invalidation
@ 2020-03-30  4:24   ` Liu Yi L
  0 siblings, 0 replies; 160+ messages in thread
From: Liu Yi L @ 2020-03-30  4:24 UTC (permalink / raw)
  To: qemu-devel, alex.williamson, peterx
  Cc: jean-philippe, kevin.tian, yi.l.liu, Yi Sun, Eduardo Habkost,
	kvm, mst, jun.j.tian, eric.auger, yi.y.sun, Jacob Pan, pbonzini,
	hao.wu, Richard Henderson, david

This patch adds the basic PASID-based iotlb (piotlb) invalidation
support. piotlb is used during walking Intel VT-d 1st level page
table. This patch only adds the basic processing. Detailed handling
will be added in next patch.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 hw/i386/intel_iommu.c          | 53 ++++++++++++++++++++++++++++++++++++++++++
 hw/i386/intel_iommu_internal.h | 13 +++++++++++
 2 files changed, 66 insertions(+)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 074d966..6114dd8 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -3045,6 +3045,55 @@ static bool vtd_process_pasid_desc(IntelIOMMUState *s,
     return true;
 }
 
+static void vtd_piotlb_pasid_invalidate(IntelIOMMUState *s,
+                                        uint16_t domain_id,
+                                        uint32_t pasid)
+{
+}
+
+static void vtd_piotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
+                                       uint32_t pasid, hwaddr addr, uint8_t am,
+                                       bool ih)
+{
+}
+
+static bool vtd_process_piotlb_desc(IntelIOMMUState *s,
+                                    VTDInvDesc *inv_desc)
+{
+    uint16_t domain_id;
+    uint32_t pasid;
+    uint8_t am;
+    hwaddr addr;
+
+    if ((inv_desc->val[0] & VTD_INV_DESC_PIOTLB_RSVD_VAL0) ||
+        (inv_desc->val[1] & VTD_INV_DESC_PIOTLB_RSVD_VAL1)) {
+        error_report_once("non-zero-field-in-piotlb_inv_desc hi: 0x%" PRIx64
+                  " lo: 0x%" PRIx64, inv_desc->val[1], inv_desc->val[0]);
+        return false;
+    }
+
+    domain_id = VTD_INV_DESC_PIOTLB_DID(inv_desc->val[0]);
+    pasid = VTD_INV_DESC_PIOTLB_PASID(inv_desc->val[0]);
+    switch (inv_desc->val[0] & VTD_INV_DESC_IOTLB_G) {
+    case VTD_INV_DESC_PIOTLB_ALL_IN_PASID:
+        vtd_piotlb_pasid_invalidate(s, domain_id, pasid);
+        break;
+
+    case VTD_INV_DESC_PIOTLB_PSI_IN_PASID:
+        am = VTD_INV_DESC_PIOTLB_AM(inv_desc->val[1]);
+        addr = (hwaddr) VTD_INV_DESC_PIOTLB_ADDR(inv_desc->val[1]);
+        vtd_piotlb_page_invalidate(s, domain_id, pasid, addr, am,
+                                   VTD_INV_DESC_PIOTLB_IH(inv_desc->val[1]));
+        break;
+
+    default:
+        error_report_once("Invalid granularity in P-IOTLB desc hi: 0x%" PRIx64
+                  " lo: 0x%" PRIx64, inv_desc->val[1], inv_desc->val[0]);
+        return false;
+    }
+    return true;
+}
+
 static bool vtd_process_inv_iec_desc(IntelIOMMUState *s,
                                      VTDInvDesc *inv_desc)
 {
@@ -3159,6 +3208,10 @@ static bool vtd_process_inv_desc(IntelIOMMUState *s)
         break;
 
     case VTD_INV_DESC_PIOTLB:
+        trace_vtd_inv_desc("p-iotlb", inv_desc.val[1], inv_desc.val[0]);
+        if (!vtd_process_piotlb_desc(s, &inv_desc)) {
+            return false;
+        }
         break;
 
     case VTD_INV_DESC_WAIT:
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 9122601..5a49d5b 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -457,6 +457,19 @@ typedef union VTDInvDesc VTDInvDesc;
 #define VTD_INV_DESC_PASIDC_PASID_SI   (1ULL << 4)
 #define VTD_INV_DESC_PASIDC_GLOBAL     (3ULL << 4)
 
+#define VTD_INV_DESC_PIOTLB_ALL_IN_PASID  (2ULL << 4)
+#define VTD_INV_DESC_PIOTLB_PSI_IN_PASID  (3ULL << 4)
+
+#define VTD_INV_DESC_PIOTLB_RSVD_VAL0     0xfff000000000ffc0ULL
+#define VTD_INV_DESC_PIOTLB_RSVD_VAL1     0xf80ULL
+
+#define VTD_INV_DESC_PIOTLB_PASID(val)    (((val) >> 32) & 0xfffffULL)
+#define VTD_INV_DESC_PIOTLB_DID(val)      (((val) >> 16) & \
+                                             VTD_DOMAIN_ID_MASK)
+#define VTD_INV_DESC_PIOTLB_ADDR(val)     ((val) & ~0xfffULL)
+#define VTD_INV_DESC_PIOTLB_AM(val)       ((val) & 0x3fULL)
+#define VTD_INV_DESC_PIOTLB_IH(val)       (((val) >> 6) & 0x1)
+
 /* Information about page-selective IOTLB invalidate */
 struct VTDIOTLBPageInvInfo {
     uint16_t domain_id;
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v2 20/22] intel_iommu: propagate PASID-based iotlb invalidation to host
  2020-03-30  4:24 ` Liu Yi L
@ 2020-03-30  4:24   ` Liu Yi L
  -1 siblings, 0 replies; 160+ messages in thread
From: Liu Yi L @ 2020-03-30  4:24 UTC (permalink / raw)
  To: qemu-devel, alex.williamson, peterx
  Cc: eric.auger, pbonzini, mst, david, kevin.tian, yi.l.liu,
	jun.j.tian, yi.y.sun, kvm, hao.wu, jean-philippe, Jacob Pan,
	Yi Sun, Richard Henderson, Eduardo Habkost

This patch propagates PASID-based iotlb invalidation to host.

Intel VT-d 3.0 supports nested translation in PASID granular.
Guest SVA support could be implemented by configuring nested
translation on specific PASID. This is also known as dual stage
DMA translation.

Under such configuration, guest owns the GVA->GPA translation
which is configured as first level page table in host side for
a specific pasid, and host owns GPA->HPA translation. As guest
owns first level translation table, piotlb invalidation should
be propagated to host since host IOMMU will cache first level
page table related mappings during DMA address translation.

This patch traps the guest PASID-based iotlb flush and propagate
it to host.

v1 -> v2: removed the valid check to vtd_pasid_as instance as
          v2 ensures all vtd_pasid_as instance should be valid
Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 hw/i386/intel_iommu.c          | 117 +++++++++++++++++++++++++++++++++++++++++
 hw/i386/intel_iommu_internal.h |   7 +++
 2 files changed, 124 insertions(+)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 6114dd8..02ad90a 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -3045,16 +3045,133 @@ static bool vtd_process_pasid_desc(IntelIOMMUState *s,
     return true;
 }
 
+/**
+ * Caller of this function should hold iommu_lock.
+ */
+static void vtd_invalidate_piotlb(IntelIOMMUState *s,
+                                  VTDBus *vtd_bus,
+                                  int devfn,
+                                  DualIOMMUStage1Cache *stage1_cache)
+{
+    VTDHostIOMMUContext *vtd_dev_icx;
+    HostIOMMUContext *iommu_ctx;
+
+    vtd_dev_icx = vtd_bus->dev_icx[devfn];
+    if (!vtd_dev_icx) {
+        goto out;
+    }
+    iommu_ctx = vtd_dev_icx->iommu_ctx;
+    if (!iommu_ctx) {
+        goto out;
+    }
+    if (host_iommu_ctx_flush_stage1_cache(iommu_ctx, stage1_cache)) {
+        error_report("Cache flush failed");
+    }
+out:
+    return;
+}
+
+/**
+ * This function is a loop function for the s->vtd_pasid_as
+ * list with VTDPIOTLBInvInfo as execution filter. It propagates
+ * the piotlb invalidation to host. Caller of this function
+ * should hold iommu_lock.
+ */
+static void vtd_flush_pasid_iotlb(gpointer key, gpointer value,
+                                  gpointer user_data)
+{
+    VTDPIOTLBInvInfo *piotlb_info = user_data;
+    VTDPASIDAddressSpace *vtd_pasid_as = value;
+    VTDPASIDCacheEntry *pc_entry = &vtd_pasid_as->pasid_cache_entry;
+    uint16_t did;
+
+    did = vtd_pe_get_domain_id(&pc_entry->pasid_entry);
+
+    if ((piotlb_info->domain_id == did) &&
+        (piotlb_info->pasid == vtd_pasid_as->pasid)) {
+        vtd_invalidate_piotlb(vtd_pasid_as->iommu_state,
+                              vtd_pasid_as->vtd_bus,
+                              vtd_pasid_as->devfn,
+                              piotlb_info->stage1_cache);
+    }
+
+    /*
+     * TODO: needs to add QEMU piotlb flush when QEMU piotlb
+     * infrastructure is ready. For now, it is enough for passthru
+     * devices.
+     */
+}
+
 static void vtd_piotlb_pasid_invalidate(IntelIOMMUState *s,
                                         uint16_t domain_id,
                                         uint32_t pasid)
 {
+    VTDPIOTLBInvInfo piotlb_info;
+    DualIOMMUStage1Cache *stage1_cache;
+    struct iommu_cache_invalidate_info *cache_info;
+
+    stage1_cache = g_malloc0(sizeof(*stage1_cache));
+    stage1_cache->pasid = pasid;
+
+    cache_info = &stage1_cache->cache_info;
+    cache_info->version = IOMMU_UAPI_VERSION;
+    cache_info->cache = IOMMU_CACHE_INV_TYPE_IOTLB;
+    cache_info->granularity = IOMMU_INV_GRANU_PASID;
+    cache_info->pasid_info.pasid = pasid;
+    cache_info->pasid_info.flags = IOMMU_INV_PASID_FLAGS_PASID;
+
+    piotlb_info.domain_id = domain_id;
+    piotlb_info.pasid = pasid;
+    piotlb_info.stage1_cache = stage1_cache;
+
+    vtd_iommu_lock(s);
+    /*
+     * Here loops all the vtd_pasid_as instances in s->vtd_pasid_as
+     * to find out the affected devices since piotlb invalidation
+     * should check pasid cache per architecture point of view.
+     */
+    g_hash_table_foreach(s->vtd_pasid_as,
+                         vtd_flush_pasid_iotlb, &piotlb_info);
+    vtd_iommu_unlock(s);
+    g_free(stage1_cache);
 }
 
 static void vtd_piotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
                                        uint32_t pasid, hwaddr addr, uint8_t am,
                                        bool ih)
 {
+    VTDPIOTLBInvInfo piotlb_info;
+    DualIOMMUStage1Cache *stage1_cache;
+    struct iommu_cache_invalidate_info *cache_info;
+
+    stage1_cache = g_malloc0(sizeof(*stage1_cache));
+    stage1_cache->pasid = pasid;
+
+    cache_info = &stage1_cache->cache_info;
+    cache_info->version = IOMMU_UAPI_VERSION;
+    cache_info->cache = IOMMU_CACHE_INV_TYPE_IOTLB;
+    cache_info->granularity = IOMMU_INV_GRANU_ADDR;
+    cache_info->addr_info.flags = IOMMU_INV_ADDR_FLAGS_PASID;
+    cache_info->addr_info.flags |= ih ? IOMMU_INV_ADDR_FLAGS_LEAF : 0;
+    cache_info->addr_info.pasid = pasid;
+    cache_info->addr_info.addr = addr;
+    cache_info->addr_info.granule_size = 1 << (12 + am);
+    cache_info->addr_info.nb_granules = 1;
+
+    piotlb_info.domain_id = domain_id;
+    piotlb_info.pasid = pasid;
+    piotlb_info.stage1_cache = stage1_cache;
+
+    vtd_iommu_lock(s);
+    /*
+     * Here loops all the vtd_pasid_as instances in s->vtd_pasid_as
+     * to find out the affected devices since piotlb invalidation
+     * should check pasid cache per architecture point of view.
+     */
+    g_hash_table_foreach(s->vtd_pasid_as,
+                         vtd_flush_pasid_iotlb, &piotlb_info);
+    vtd_iommu_unlock(s);
+    g_free(stage1_cache);
 }
 
 static bool vtd_process_piotlb_desc(IntelIOMMUState *s,
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 5a49d5b..85ebaa5 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -556,6 +556,13 @@ struct VTDPASIDCacheInfo {
                                       VTD_PASID_CACHE_DEVSI)
 typedef struct VTDPASIDCacheInfo VTDPASIDCacheInfo;
 
+struct VTDPIOTLBInvInfo {
+    uint16_t domain_id;
+    uint32_t pasid;
+    DualIOMMUStage1Cache *stage1_cache;
+};
+typedef struct VTDPIOTLBInvInfo VTDPIOTLBInvInfo;
+
 /* PASID Table Related Definitions */
 #define VTD_PASID_DIR_BASE_ADDR_MASK  (~0xfffULL)
 #define VTD_PASID_TABLE_BASE_ADDR_MASK (~0xfffULL)
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v2 20/22] intel_iommu: propagate PASID-based iotlb invalidation to host
@ 2020-03-30  4:24   ` Liu Yi L
  0 siblings, 0 replies; 160+ messages in thread
From: Liu Yi L @ 2020-03-30  4:24 UTC (permalink / raw)
  To: qemu-devel, alex.williamson, peterx
  Cc: jean-philippe, kevin.tian, yi.l.liu, Yi Sun, Eduardo Habkost,
	kvm, mst, jun.j.tian, eric.auger, yi.y.sun, Jacob Pan, pbonzini,
	hao.wu, Richard Henderson, david

This patch propagates PASID-based iotlb invalidation to host.

Intel VT-d 3.0 supports nested translation in PASID granular.
Guest SVA support could be implemented by configuring nested
translation on specific PASID. This is also known as dual stage
DMA translation.

Under such configuration, guest owns the GVA->GPA translation
which is configured as first level page table in host side for
a specific pasid, and host owns GPA->HPA translation. As guest
owns first level translation table, piotlb invalidation should
be propagated to host since host IOMMU will cache first level
page table related mappings during DMA address translation.

This patch traps the guest PASID-based iotlb flush and propagate
it to host.

v1 -> v2: removed the valid check to vtd_pasid_as instance as
          v2 ensures all vtd_pasid_as instance should be valid
Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 hw/i386/intel_iommu.c          | 117 +++++++++++++++++++++++++++++++++++++++++
 hw/i386/intel_iommu_internal.h |   7 +++
 2 files changed, 124 insertions(+)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 6114dd8..02ad90a 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -3045,16 +3045,133 @@ static bool vtd_process_pasid_desc(IntelIOMMUState *s,
     return true;
 }
 
+/**
+ * Caller of this function should hold iommu_lock.
+ */
+static void vtd_invalidate_piotlb(IntelIOMMUState *s,
+                                  VTDBus *vtd_bus,
+                                  int devfn,
+                                  DualIOMMUStage1Cache *stage1_cache)
+{
+    VTDHostIOMMUContext *vtd_dev_icx;
+    HostIOMMUContext *iommu_ctx;
+
+    vtd_dev_icx = vtd_bus->dev_icx[devfn];
+    if (!vtd_dev_icx) {
+        goto out;
+    }
+    iommu_ctx = vtd_dev_icx->iommu_ctx;
+    if (!iommu_ctx) {
+        goto out;
+    }
+    if (host_iommu_ctx_flush_stage1_cache(iommu_ctx, stage1_cache)) {
+        error_report("Cache flush failed");
+    }
+out:
+    return;
+}
+
+/**
+ * This function is a loop function for the s->vtd_pasid_as
+ * list with VTDPIOTLBInvInfo as execution filter. It propagates
+ * the piotlb invalidation to host. Caller of this function
+ * should hold iommu_lock.
+ */
+static void vtd_flush_pasid_iotlb(gpointer key, gpointer value,
+                                  gpointer user_data)
+{
+    VTDPIOTLBInvInfo *piotlb_info = user_data;
+    VTDPASIDAddressSpace *vtd_pasid_as = value;
+    VTDPASIDCacheEntry *pc_entry = &vtd_pasid_as->pasid_cache_entry;
+    uint16_t did;
+
+    did = vtd_pe_get_domain_id(&pc_entry->pasid_entry);
+
+    if ((piotlb_info->domain_id == did) &&
+        (piotlb_info->pasid == vtd_pasid_as->pasid)) {
+        vtd_invalidate_piotlb(vtd_pasid_as->iommu_state,
+                              vtd_pasid_as->vtd_bus,
+                              vtd_pasid_as->devfn,
+                              piotlb_info->stage1_cache);
+    }
+
+    /*
+     * TODO: needs to add QEMU piotlb flush when QEMU piotlb
+     * infrastructure is ready. For now, it is enough for passthru
+     * devices.
+     */
+}
+
 static void vtd_piotlb_pasid_invalidate(IntelIOMMUState *s,
                                         uint16_t domain_id,
                                         uint32_t pasid)
 {
+    VTDPIOTLBInvInfo piotlb_info;
+    DualIOMMUStage1Cache *stage1_cache;
+    struct iommu_cache_invalidate_info *cache_info;
+
+    stage1_cache = g_malloc0(sizeof(*stage1_cache));
+    stage1_cache->pasid = pasid;
+
+    cache_info = &stage1_cache->cache_info;
+    cache_info->version = IOMMU_UAPI_VERSION;
+    cache_info->cache = IOMMU_CACHE_INV_TYPE_IOTLB;
+    cache_info->granularity = IOMMU_INV_GRANU_PASID;
+    cache_info->pasid_info.pasid = pasid;
+    cache_info->pasid_info.flags = IOMMU_INV_PASID_FLAGS_PASID;
+
+    piotlb_info.domain_id = domain_id;
+    piotlb_info.pasid = pasid;
+    piotlb_info.stage1_cache = stage1_cache;
+
+    vtd_iommu_lock(s);
+    /*
+     * Here loops all the vtd_pasid_as instances in s->vtd_pasid_as
+     * to find out the affected devices since piotlb invalidation
+     * should check pasid cache per architecture point of view.
+     */
+    g_hash_table_foreach(s->vtd_pasid_as,
+                         vtd_flush_pasid_iotlb, &piotlb_info);
+    vtd_iommu_unlock(s);
+    g_free(stage1_cache);
 }
 
 static void vtd_piotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
                                        uint32_t pasid, hwaddr addr, uint8_t am,
                                        bool ih)
 {
+    VTDPIOTLBInvInfo piotlb_info;
+    DualIOMMUStage1Cache *stage1_cache;
+    struct iommu_cache_invalidate_info *cache_info;
+
+    stage1_cache = g_malloc0(sizeof(*stage1_cache));
+    stage1_cache->pasid = pasid;
+
+    cache_info = &stage1_cache->cache_info;
+    cache_info->version = IOMMU_UAPI_VERSION;
+    cache_info->cache = IOMMU_CACHE_INV_TYPE_IOTLB;
+    cache_info->granularity = IOMMU_INV_GRANU_ADDR;
+    cache_info->addr_info.flags = IOMMU_INV_ADDR_FLAGS_PASID;
+    cache_info->addr_info.flags |= ih ? IOMMU_INV_ADDR_FLAGS_LEAF : 0;
+    cache_info->addr_info.pasid = pasid;
+    cache_info->addr_info.addr = addr;
+    cache_info->addr_info.granule_size = 1 << (12 + am);
+    cache_info->addr_info.nb_granules = 1;
+
+    piotlb_info.domain_id = domain_id;
+    piotlb_info.pasid = pasid;
+    piotlb_info.stage1_cache = stage1_cache;
+
+    vtd_iommu_lock(s);
+    /*
+     * Here loops all the vtd_pasid_as instances in s->vtd_pasid_as
+     * to find out the affected devices since piotlb invalidation
+     * should check pasid cache per architecture point of view.
+     */
+    g_hash_table_foreach(s->vtd_pasid_as,
+                         vtd_flush_pasid_iotlb, &piotlb_info);
+    vtd_iommu_unlock(s);
+    g_free(stage1_cache);
 }
 
 static bool vtd_process_piotlb_desc(IntelIOMMUState *s,
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 5a49d5b..85ebaa5 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -556,6 +556,13 @@ struct VTDPASIDCacheInfo {
                                       VTD_PASID_CACHE_DEVSI)
 typedef struct VTDPASIDCacheInfo VTDPASIDCacheInfo;
 
+struct VTDPIOTLBInvInfo {
+    uint16_t domain_id;
+    uint32_t pasid;
+    DualIOMMUStage1Cache *stage1_cache;
+};
+typedef struct VTDPIOTLBInvInfo VTDPIOTLBInvInfo;
+
 /* PASID Table Related Definitions */
 #define VTD_PASID_DIR_BASE_ADDR_MASK  (~0xfffULL)
 #define VTD_PASID_TABLE_BASE_ADDR_MASK (~0xfffULL)
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v2 21/22] intel_iommu: process PASID-based Device-TLB invalidation
  2020-03-30  4:24 ` Liu Yi L
@ 2020-03-30  4:25   ` Liu Yi L
  -1 siblings, 0 replies; 160+ messages in thread
From: Liu Yi L @ 2020-03-30  4:25 UTC (permalink / raw)
  To: qemu-devel, alex.williamson, peterx
  Cc: eric.auger, pbonzini, mst, david, kevin.tian, yi.l.liu,
	jun.j.tian, yi.y.sun, kvm, hao.wu, jean-philippe, Jacob Pan,
	Yi Sun, Richard Henderson, Eduardo Habkost

This patch adds an empty handling for PASID-based Device-TLB
invalidation. For now it is enough as it is not necessary to
propagate it to host for passthru device and also there is no
emulated device has device tlb.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 hw/i386/intel_iommu.c          | 18 ++++++++++++++++++
 hw/i386/intel_iommu_internal.h |  1 +
 2 files changed, 19 insertions(+)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 02ad90a..e8877d4 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -3224,6 +3224,17 @@ static bool vtd_process_inv_iec_desc(IntelIOMMUState *s,
     return true;
 }
 
+static bool vtd_process_device_piotlb_desc(IntelIOMMUState *s,
+                                           VTDInvDesc *inv_desc)
+{
+    /*
+     * no need to handle it for passthru device, for emulated
+     * devices with device tlb, it may be required, but for now,
+     * return is enough
+     */
+    return true;
+}
+
 static bool vtd_process_device_iotlb_desc(IntelIOMMUState *s,
                                           VTDInvDesc *inv_desc)
 {
@@ -3345,6 +3356,13 @@ static bool vtd_process_inv_desc(IntelIOMMUState *s)
         }
         break;
 
+    case VTD_INV_DESC_DEV_PIOTLB:
+        trace_vtd_inv_desc("device-piotlb", inv_desc.hi, inv_desc.lo);
+        if (!vtd_process_device_piotlb_desc(s, &inv_desc)) {
+            return false;
+        }
+        break;
+
     case VTD_INV_DESC_DEVICE:
         trace_vtd_inv_desc("device", inv_desc.hi, inv_desc.lo);
         if (!vtd_process_device_iotlb_desc(s, &inv_desc)) {
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 85ebaa5..4910e63 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -386,6 +386,7 @@ typedef union VTDInvDesc VTDInvDesc;
 #define VTD_INV_DESC_WAIT               0x5 /* Invalidation Wait Descriptor */
 #define VTD_INV_DESC_PIOTLB             0x6 /* PASID-IOTLB Invalidate Desc */
 #define VTD_INV_DESC_PC                 0x7 /* PASID-cache Invalidate Desc */
+#define VTD_INV_DESC_DEV_PIOTLB         0x8 /* PASID-based-DIOTLB inv_desc*/
 #define VTD_INV_DESC_NONE               0   /* Not an Invalidate Descriptor */
 
 /* Masks for Invalidation Wait Descriptor*/
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v2 21/22] intel_iommu: process PASID-based Device-TLB invalidation
@ 2020-03-30  4:25   ` Liu Yi L
  0 siblings, 0 replies; 160+ messages in thread
From: Liu Yi L @ 2020-03-30  4:25 UTC (permalink / raw)
  To: qemu-devel, alex.williamson, peterx
  Cc: jean-philippe, kevin.tian, yi.l.liu, Yi Sun, Eduardo Habkost,
	kvm, mst, jun.j.tian, eric.auger, yi.y.sun, Jacob Pan, pbonzini,
	hao.wu, Richard Henderson, david

This patch adds an empty handling for PASID-based Device-TLB
invalidation. For now it is enough as it is not necessary to
propagate it to host for passthru device and also there is no
emulated device has device tlb.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 hw/i386/intel_iommu.c          | 18 ++++++++++++++++++
 hw/i386/intel_iommu_internal.h |  1 +
 2 files changed, 19 insertions(+)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 02ad90a..e8877d4 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -3224,6 +3224,17 @@ static bool vtd_process_inv_iec_desc(IntelIOMMUState *s,
     return true;
 }
 
+static bool vtd_process_device_piotlb_desc(IntelIOMMUState *s,
+                                           VTDInvDesc *inv_desc)
+{
+    /*
+     * no need to handle it for passthru device, for emulated
+     * devices with device tlb, it may be required, but for now,
+     * return is enough
+     */
+    return true;
+}
+
 static bool vtd_process_device_iotlb_desc(IntelIOMMUState *s,
                                           VTDInvDesc *inv_desc)
 {
@@ -3345,6 +3356,13 @@ static bool vtd_process_inv_desc(IntelIOMMUState *s)
         }
         break;
 
+    case VTD_INV_DESC_DEV_PIOTLB:
+        trace_vtd_inv_desc("device-piotlb", inv_desc.hi, inv_desc.lo);
+        if (!vtd_process_device_piotlb_desc(s, &inv_desc)) {
+            return false;
+        }
+        break;
+
     case VTD_INV_DESC_DEVICE:
         trace_vtd_inv_desc("device", inv_desc.hi, inv_desc.lo);
         if (!vtd_process_device_iotlb_desc(s, &inv_desc)) {
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 85ebaa5..4910e63 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -386,6 +386,7 @@ typedef union VTDInvDesc VTDInvDesc;
 #define VTD_INV_DESC_WAIT               0x5 /* Invalidation Wait Descriptor */
 #define VTD_INV_DESC_PIOTLB             0x6 /* PASID-IOTLB Invalidate Desc */
 #define VTD_INV_DESC_PC                 0x7 /* PASID-cache Invalidate Desc */
+#define VTD_INV_DESC_DEV_PIOTLB         0x8 /* PASID-based-DIOTLB inv_desc*/
 #define VTD_INV_DESC_NONE               0   /* Not an Invalidate Descriptor */
 
 /* Masks for Invalidation Wait Descriptor*/
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v2 22/22] intel_iommu: modify x-scalable-mode to be string option
  2020-03-30  4:24 ` Liu Yi L
@ 2020-03-30  4:25   ` Liu Yi L
  -1 siblings, 0 replies; 160+ messages in thread
From: Liu Yi L @ 2020-03-30  4:25 UTC (permalink / raw)
  To: qemu-devel, alex.williamson, peterx
  Cc: eric.auger, pbonzini, mst, david, kevin.tian, yi.l.liu,
	jun.j.tian, yi.y.sun, kvm, hao.wu, jean-philippe, Jacob Pan,
	Yi Sun, Richard Henderson, Eduardo Habkost

Intel VT-d 3.0 introduces scalable mode, and it has a bunch of capabilities
related to scalable mode translation, thus there are multiple combinations.
While this vIOMMU implementation wants simplify it for user by providing
typical combinations. User could config it by "x-scalable-mode" option. The
usage is as below:

"-device intel-iommu,x-scalable-mode=["legacy"|"modern"|"off"]"

 - "legacy": gives support for SL page table
 - "modern": gives support for FL page table, pasid, virtual command
 - "off": no scalable mode support
 -  if not configured, means no scalable mode support, if not proper
    configured, will throw error

Note: this patch is supposed to be merged when  the whole vSVA patch series
were merged.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
---
 hw/i386/intel_iommu.c          | 30 ++++++++++++++++++++++++++++--
 hw/i386/intel_iommu_internal.h |  4 ++++
 include/hw/i386/intel_iommu.h  |  2 ++
 3 files changed, 34 insertions(+), 2 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index e8877d4..2e745e8 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -4056,7 +4056,7 @@ static Property vtd_properties[] = {
     DEFINE_PROP_UINT8("aw-bits", IntelIOMMUState, aw_bits,
                       VTD_HOST_ADDRESS_WIDTH),
     DEFINE_PROP_BOOL("caching-mode", IntelIOMMUState, caching_mode, FALSE),
-    DEFINE_PROP_BOOL("x-scalable-mode", IntelIOMMUState, scalable_mode, FALSE),
+    DEFINE_PROP_STRING("x-scalable-mode", IntelIOMMUState, scalable_mode_str),
     DEFINE_PROP_BOOL("dma-drain", IntelIOMMUState, dma_drain, true),
     DEFINE_PROP_END_OF_LIST(),
 };
@@ -4688,8 +4688,12 @@ static void vtd_init(IntelIOMMUState *s)
     }
 
     /* TODO: read cap/ecap from host to decide which cap to be exposed. */
-    if (s->scalable_mode) {
+    if (s->scalable_mode && !s->scalable_modern) {
         s->ecap |= VTD_ECAP_SMTS | VTD_ECAP_SRS | VTD_ECAP_SLTS;
+    } else if (s->scalable_mode && s->scalable_modern) {
+        s->ecap |= VTD_ECAP_SMTS | VTD_ECAP_SRS | VTD_ECAP_PASID
+                   | VTD_ECAP_FLTS | VTD_ECAP_PSS | VTD_ECAP_VCS;
+        s->vccap |= VTD_VCCAP_PAS;
     }
 
     vtd_reset_caches(s);
@@ -4821,6 +4825,28 @@ static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
         return false;
     }
 
+    if (s->scalable_mode_str &&
+        (strcmp(s->scalable_mode_str, "off") &&
+         strcmp(s->scalable_mode_str, "modern") &&
+         strcmp(s->scalable_mode_str, "legacy"))) {
+        error_setg(errp, "Invalid x-scalable-mode config,"
+                         "Please use \"modern\", \"legacy\" or \"off\"");
+        return false;
+    }
+
+    if (s->scalable_mode_str &&
+        !strcmp(s->scalable_mode_str, "legacy")) {
+        s->scalable_mode = true;
+        s->scalable_modern = false;
+    } else if (s->scalable_mode_str &&
+        !strcmp(s->scalable_mode_str, "modern")) {
+        s->scalable_mode = true;
+        s->scalable_modern = true;
+    } else {
+        s->scalable_mode = false;
+        s->scalable_modern = false;
+    }
+
     return true;
 }
 
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 4910e63..e0719bc 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -196,8 +196,12 @@
 #define VTD_ECAP_PT                 (1ULL << 6)
 #define VTD_ECAP_MHMV               (15ULL << 20)
 #define VTD_ECAP_SRS                (1ULL << 31)
+#define VTD_ECAP_PSS                (19ULL << 35)
+#define VTD_ECAP_PASID              (1ULL << 40)
 #define VTD_ECAP_SMTS               (1ULL << 43)
+#define VTD_ECAP_VCS                (1ULL << 44)
 #define VTD_ECAP_SLTS               (1ULL << 46)
+#define VTD_ECAP_FLTS               (1ULL << 47)
 
 /* CAP_REG */
 /* (offset >> 4) << 24 */
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index 626c1cd..3831ba7 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -263,6 +263,8 @@ struct IntelIOMMUState {
 
     bool caching_mode;              /* RO - is cap CM enabled? */
     bool scalable_mode;             /* RO - is Scalable Mode supported? */
+    char *scalable_mode_str;        /* RO - admin's Scalable Mode config */
+    bool scalable_modern;           /* RO - is modern SM supported? */
 
     dma_addr_t root;                /* Current root table pointer */
     bool root_scalable;             /* Type of root table (scalable or not) */
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v2 22/22] intel_iommu: modify x-scalable-mode to be string option
@ 2020-03-30  4:25   ` Liu Yi L
  0 siblings, 0 replies; 160+ messages in thread
From: Liu Yi L @ 2020-03-30  4:25 UTC (permalink / raw)
  To: qemu-devel, alex.williamson, peterx
  Cc: jean-philippe, kevin.tian, yi.l.liu, Yi Sun, Eduardo Habkost,
	kvm, mst, jun.j.tian, eric.auger, yi.y.sun, Jacob Pan, pbonzini,
	hao.wu, Richard Henderson, david

Intel VT-d 3.0 introduces scalable mode, and it has a bunch of capabilities
related to scalable mode translation, thus there are multiple combinations.
While this vIOMMU implementation wants simplify it for user by providing
typical combinations. User could config it by "x-scalable-mode" option. The
usage is as below:

"-device intel-iommu,x-scalable-mode=["legacy"|"modern"|"off"]"

 - "legacy": gives support for SL page table
 - "modern": gives support for FL page table, pasid, virtual command
 - "off": no scalable mode support
 -  if not configured, means no scalable mode support, if not proper
    configured, will throw error

Note: this patch is supposed to be merged when  the whole vSVA patch series
were merged.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
---
 hw/i386/intel_iommu.c          | 30 ++++++++++++++++++++++++++++--
 hw/i386/intel_iommu_internal.h |  4 ++++
 include/hw/i386/intel_iommu.h  |  2 ++
 3 files changed, 34 insertions(+), 2 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index e8877d4..2e745e8 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -4056,7 +4056,7 @@ static Property vtd_properties[] = {
     DEFINE_PROP_UINT8("aw-bits", IntelIOMMUState, aw_bits,
                       VTD_HOST_ADDRESS_WIDTH),
     DEFINE_PROP_BOOL("caching-mode", IntelIOMMUState, caching_mode, FALSE),
-    DEFINE_PROP_BOOL("x-scalable-mode", IntelIOMMUState, scalable_mode, FALSE),
+    DEFINE_PROP_STRING("x-scalable-mode", IntelIOMMUState, scalable_mode_str),
     DEFINE_PROP_BOOL("dma-drain", IntelIOMMUState, dma_drain, true),
     DEFINE_PROP_END_OF_LIST(),
 };
@@ -4688,8 +4688,12 @@ static void vtd_init(IntelIOMMUState *s)
     }
 
     /* TODO: read cap/ecap from host to decide which cap to be exposed. */
-    if (s->scalable_mode) {
+    if (s->scalable_mode && !s->scalable_modern) {
         s->ecap |= VTD_ECAP_SMTS | VTD_ECAP_SRS | VTD_ECAP_SLTS;
+    } else if (s->scalable_mode && s->scalable_modern) {
+        s->ecap |= VTD_ECAP_SMTS | VTD_ECAP_SRS | VTD_ECAP_PASID
+                   | VTD_ECAP_FLTS | VTD_ECAP_PSS | VTD_ECAP_VCS;
+        s->vccap |= VTD_VCCAP_PAS;
     }
 
     vtd_reset_caches(s);
@@ -4821,6 +4825,28 @@ static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
         return false;
     }
 
+    if (s->scalable_mode_str &&
+        (strcmp(s->scalable_mode_str, "off") &&
+         strcmp(s->scalable_mode_str, "modern") &&
+         strcmp(s->scalable_mode_str, "legacy"))) {
+        error_setg(errp, "Invalid x-scalable-mode config,"
+                         "Please use \"modern\", \"legacy\" or \"off\"");
+        return false;
+    }
+
+    if (s->scalable_mode_str &&
+        !strcmp(s->scalable_mode_str, "legacy")) {
+        s->scalable_mode = true;
+        s->scalable_modern = false;
+    } else if (s->scalable_mode_str &&
+        !strcmp(s->scalable_mode_str, "modern")) {
+        s->scalable_mode = true;
+        s->scalable_modern = true;
+    } else {
+        s->scalable_mode = false;
+        s->scalable_modern = false;
+    }
+
     return true;
 }
 
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 4910e63..e0719bc 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -196,8 +196,12 @@
 #define VTD_ECAP_PT                 (1ULL << 6)
 #define VTD_ECAP_MHMV               (15ULL << 20)
 #define VTD_ECAP_SRS                (1ULL << 31)
+#define VTD_ECAP_PSS                (19ULL << 35)
+#define VTD_ECAP_PASID              (1ULL << 40)
 #define VTD_ECAP_SMTS               (1ULL << 43)
+#define VTD_ECAP_VCS                (1ULL << 44)
 #define VTD_ECAP_SLTS               (1ULL << 46)
+#define VTD_ECAP_FLTS               (1ULL << 47)
 
 /* CAP_REG */
 /* (offset >> 4) << 24 */
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index 626c1cd..3831ba7 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -263,6 +263,8 @@ struct IntelIOMMUState {
 
     bool caching_mode;              /* RO - is cap CM enabled? */
     bool scalable_mode;             /* RO - is Scalable Mode supported? */
+    char *scalable_mode_str;        /* RO - admin's Scalable Mode config */
+    bool scalable_modern;           /* RO - is modern SM supported? */
 
     dma_addr_t root;                /* Current root table pointer */
     bool root_scalable;             /* Type of root table (scalable or not) */
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 160+ messages in thread

* Re: [PATCH v2 00/22] intel_iommu: expose Shared Virtual Addressing to VMs
  2020-03-30  4:24 ` Liu Yi L
@ 2020-03-30  5:40   ` no-reply
  -1 siblings, 0 replies; 160+ messages in thread
From: no-reply @ 2020-03-30  5:40 UTC (permalink / raw)
  To: yi.l.liu
  Cc: qemu-devel, alex.williamson, peterx, jean-philippe, kevin.tian,
	yi.l.liu, kvm, mst, jun.j.tian, eric.auger, yi.y.sun, pbonzini,
	hao.wu, david

Patchew URL: https://patchew.org/QEMU/1585542301-84087-1-git-send-email-yi.l.liu@intel.com/



Hi,

This series failed the docker-mingw@fedora build test. Please find the testing commands and
their output below. If you have Docker installed, you can probably reproduce it
locally.

=== TEST SCRIPT BEGIN ===
#! /bin/bash
export ARCH=x86_64
make docker-image-fedora V=1 NETWORK=1
time make docker-test-mingw@fedora J=14 NETWORK=1
=== TEST SCRIPT END ===

                 from /tmp/qemu-test/src/include/hw/pci/pci_bus.h:4,
                 from /tmp/qemu-test/src/include/hw/pci-host/i440fx.h:15,
                 from /tmp/qemu-test/src/stubs/pci-host-piix.c:2:
/tmp/qemu-test/src/include/hw/iommu/host_iommu_context.h:28:10: fatal error: linux/iommu.h: No such file or directory
 #include <linux/iommu.h>
          ^~~~~~~~~~~~~~~
compilation terminated.
  CC      scsi/pr-manager-stub.o
make: *** [/tmp/qemu-test/src/rules.mak:69: stubs/pci-host-piix.o] Error 1
make: *** Waiting for unfinished jobs....
  CC      block/curl.o
Traceback (most recent call last):
---
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['sudo', '-n', 'docker', 'run', '--label', 'com.qemu.instance.uuid=a71cba547b0b47ef91f874b42e00f828', '-u', '1001', '--security-opt', 'seccomp=unconfined', '--rm', '-e', 'TARGET_LIST=', '-e', 'EXTRA_CONFIGURE_OPTS=', '-e', 'V=', '-e', 'J=14', '-e', 'DEBUG=', '-e', 'SHOW_ENV=', '-e', 'CCACHE_DIR=/var/tmp/ccache', '-v', '/home/patchew/.cache/qemu-docker-ccache:/var/tmp/ccache:z', '-v', '/var/tmp/patchew-tester-tmp-enp9m7rr/src/docker-src.2020-03-30-01.38.53.2480:/var/tmp/qemu:z,ro', 'qemu:fedora', '/var/tmp/qemu/run', 'test-mingw']' returned non-zero exit status 2.
filter=--filter=label=com.qemu.instance.uuid=a71cba547b0b47ef91f874b42e00f828
make[1]: *** [docker-run] Error 1
make[1]: Leaving directory `/var/tmp/patchew-tester-tmp-enp9m7rr/src'
make: *** [docker-run-test-mingw@fedora] Error 2

real    2m1.872s
user    0m8.422s


The full log is available at
http://patchew.org/logs/1585542301-84087-1-git-send-email-yi.l.liu@intel.com/testing.docker-mingw@fedora/?type=message.
---
Email generated automatically by Patchew [https://patchew.org/].
Please send your feedback to patchew-devel@redhat.com

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v2 00/22] intel_iommu: expose Shared Virtual Addressing to VMs
@ 2020-03-30  5:40   ` no-reply
  0 siblings, 0 replies; 160+ messages in thread
From: no-reply @ 2020-03-30  5:40 UTC (permalink / raw)
  To: yi.l.liu
  Cc: jean-philippe, kevin.tian, yi.l.liu, kvm, mst, jun.j.tian,
	qemu-devel, peterx, eric.auger, alex.williamson, pbonzini, david,
	yi.y.sun, hao.wu

Patchew URL: https://patchew.org/QEMU/1585542301-84087-1-git-send-email-yi.l.liu@intel.com/



Hi,

This series failed the docker-mingw@fedora build test. Please find the testing commands and
their output below. If you have Docker installed, you can probably reproduce it
locally.

=== TEST SCRIPT BEGIN ===
#! /bin/bash
export ARCH=x86_64
make docker-image-fedora V=1 NETWORK=1
time make docker-test-mingw@fedora J=14 NETWORK=1
=== TEST SCRIPT END ===

                 from /tmp/qemu-test/src/include/hw/pci/pci_bus.h:4,
                 from /tmp/qemu-test/src/include/hw/pci-host/i440fx.h:15,
                 from /tmp/qemu-test/src/stubs/pci-host-piix.c:2:
/tmp/qemu-test/src/include/hw/iommu/host_iommu_context.h:28:10: fatal error: linux/iommu.h: No such file or directory
 #include <linux/iommu.h>
          ^~~~~~~~~~~~~~~
compilation terminated.
  CC      scsi/pr-manager-stub.o
make: *** [/tmp/qemu-test/src/rules.mak:69: stubs/pci-host-piix.o] Error 1
make: *** Waiting for unfinished jobs....
  CC      block/curl.o
Traceback (most recent call last):
---
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['sudo', '-n', 'docker', 'run', '--label', 'com.qemu.instance.uuid=a71cba547b0b47ef91f874b42e00f828', '-u', '1001', '--security-opt', 'seccomp=unconfined', '--rm', '-e', 'TARGET_LIST=', '-e', 'EXTRA_CONFIGURE_OPTS=', '-e', 'V=', '-e', 'J=14', '-e', 'DEBUG=', '-e', 'SHOW_ENV=', '-e', 'CCACHE_DIR=/var/tmp/ccache', '-v', '/home/patchew/.cache/qemu-docker-ccache:/var/tmp/ccache:z', '-v', '/var/tmp/patchew-tester-tmp-enp9m7rr/src/docker-src.2020-03-30-01.38.53.2480:/var/tmp/qemu:z,ro', 'qemu:fedora', '/var/tmp/qemu/run', 'test-mingw']' returned non-zero exit status 2.
filter=--filter=label=com.qemu.instance.uuid=a71cba547b0b47ef91f874b42e00f828
make[1]: *** [docker-run] Error 1
make[1]: Leaving directory `/var/tmp/patchew-tester-tmp-enp9m7rr/src'
make: *** [docker-run-test-mingw@fedora] Error 2

real    2m1.872s
user    0m8.422s


The full log is available at
http://patchew.org/logs/1585542301-84087-1-git-send-email-yi.l.liu@intel.com/testing.docker-mingw@fedora/?type=message.
---
Email generated automatically by Patchew [https://patchew.org/].
Please send your feedback to patchew-devel@redhat.com

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v2 03/22] vfio: check VFIO_TYPE1_NESTING_IOMMU support
  2020-03-30  4:24   ` Liu Yi L
@ 2020-03-30  9:36     ` Auger Eric
  -1 siblings, 0 replies; 160+ messages in thread
From: Auger Eric @ 2020-03-30  9:36 UTC (permalink / raw)
  To: Liu Yi L, qemu-devel, alex.williamson, peterx
  Cc: pbonzini, mst, david, kevin.tian, jun.j.tian, yi.y.sun, kvm,
	hao.wu, jean-philippe, Jacob Pan, Yi Sun

Yi,

On 3/30/20 6:24 AM, Liu Yi L wrote:
> VFIO needs to check VFIO_TYPE1_NESTING_IOMMU support with Kernel before
> further using it. e.g. requires to check IOMMU UAPI version.
> 
> Cc: Kevin Tian <kevin.tian@intel.com>
> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Eric Auger <eric.auger@redhat.com>
> Cc: Yi Sun <yi.y.sun@linux.intel.com>
> Cc: David Gibson <david@gibson.dropbear.id.au>
> Cc: Alex Williamson <alex.williamson@redhat.com>
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> ---
>  hw/vfio/common.c | 14 ++++++++++++--
>  1 file changed, 12 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 0b3593b..c276732 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -1157,12 +1157,21 @@ static void vfio_put_address_space(VFIOAddressSpace *space)
>  static int vfio_get_iommu_type(VFIOContainer *container,
>                                 Error **errp)
>  {
> -    int iommu_types[] = { VFIO_TYPE1v2_IOMMU, VFIO_TYPE1_IOMMU,
> +    int iommu_types[] = { VFIO_TYPE1_NESTING_IOMMU,
> +                          VFIO_TYPE1v2_IOMMU, VFIO_TYPE1_IOMMU,
>                            VFIO_SPAPR_TCE_v2_IOMMU, VFIO_SPAPR_TCE_IOMMU };
> -    int i;
> +    int i, version;
>  
>      for (i = 0; i < ARRAY_SIZE(iommu_types); i++) {
>          if (ioctl(container->fd, VFIO_CHECK_EXTENSION, iommu_types[i])) {
> +            if (iommu_types[i] == VFIO_TYPE1_NESTING_IOMMU) {
> +                version = ioctl(container->fd, VFIO_CHECK_EXTENSION,
> +                                VFIO_NESTING_IOMMU_UAPI);
> +                if (version < IOMMU_UAPI_VERSION) {
> +                    info_report("IOMMU UAPI incompatible for nesting");
> +                    continue;
> +                }
> +            }
This means that by default VFIO_TYPE1_NESTING_IOMMU wwould be chosen. I
don't think this what we want. On ARM this would mean that for a
standard VFIO assignment without vIOMMU, SL will be used instead of FL.
This may not be harmless.

For instance, in "[RFC v6 09/24] vfio: Force nested if iommu requires
it", I use nested only if I detect we have a vSMMU. Otherwise I keep the
legacy VFIO_TYPE1v2_IOMMU.

Thanks

Eric
>              return iommu_types[i];
>          }
>      }
> @@ -1278,6 +1287,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>      }
>  
>      switch (container->iommu_type) {
> +    case VFIO_TYPE1_NESTING_IOMMU:
>      case VFIO_TYPE1v2_IOMMU:
>      case VFIO_TYPE1_IOMMU:
>      {
> 


^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v2 03/22] vfio: check VFIO_TYPE1_NESTING_IOMMU support
@ 2020-03-30  9:36     ` Auger Eric
  0 siblings, 0 replies; 160+ messages in thread
From: Auger Eric @ 2020-03-30  9:36 UTC (permalink / raw)
  To: Liu Yi L, qemu-devel, alex.williamson, peterx
  Cc: jean-philippe, kevin.tian, Jacob Pan, Yi Sun, kvm, mst,
	jun.j.tian, yi.y.sun, pbonzini, hao.wu, david

Yi,

On 3/30/20 6:24 AM, Liu Yi L wrote:
> VFIO needs to check VFIO_TYPE1_NESTING_IOMMU support with Kernel before
> further using it. e.g. requires to check IOMMU UAPI version.
> 
> Cc: Kevin Tian <kevin.tian@intel.com>
> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Eric Auger <eric.auger@redhat.com>
> Cc: Yi Sun <yi.y.sun@linux.intel.com>
> Cc: David Gibson <david@gibson.dropbear.id.au>
> Cc: Alex Williamson <alex.williamson@redhat.com>
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> ---
>  hw/vfio/common.c | 14 ++++++++++++--
>  1 file changed, 12 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 0b3593b..c276732 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -1157,12 +1157,21 @@ static void vfio_put_address_space(VFIOAddressSpace *space)
>  static int vfio_get_iommu_type(VFIOContainer *container,
>                                 Error **errp)
>  {
> -    int iommu_types[] = { VFIO_TYPE1v2_IOMMU, VFIO_TYPE1_IOMMU,
> +    int iommu_types[] = { VFIO_TYPE1_NESTING_IOMMU,
> +                          VFIO_TYPE1v2_IOMMU, VFIO_TYPE1_IOMMU,
>                            VFIO_SPAPR_TCE_v2_IOMMU, VFIO_SPAPR_TCE_IOMMU };
> -    int i;
> +    int i, version;
>  
>      for (i = 0; i < ARRAY_SIZE(iommu_types); i++) {
>          if (ioctl(container->fd, VFIO_CHECK_EXTENSION, iommu_types[i])) {
> +            if (iommu_types[i] == VFIO_TYPE1_NESTING_IOMMU) {
> +                version = ioctl(container->fd, VFIO_CHECK_EXTENSION,
> +                                VFIO_NESTING_IOMMU_UAPI);
> +                if (version < IOMMU_UAPI_VERSION) {
> +                    info_report("IOMMU UAPI incompatible for nesting");
> +                    continue;
> +                }
> +            }
This means that by default VFIO_TYPE1_NESTING_IOMMU wwould be chosen. I
don't think this what we want. On ARM this would mean that for a
standard VFIO assignment without vIOMMU, SL will be used instead of FL.
This may not be harmless.

For instance, in "[RFC v6 09/24] vfio: Force nested if iommu requires
it", I use nested only if I detect we have a vSMMU. Otherwise I keep the
legacy VFIO_TYPE1v2_IOMMU.

Thanks

Eric
>              return iommu_types[i];
>          }
>      }
> @@ -1278,6 +1287,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>      }
>  
>      switch (container->iommu_type) {
> +    case VFIO_TYPE1_NESTING_IOMMU:
>      case VFIO_TYPE1v2_IOMMU:
>      case VFIO_TYPE1_IOMMU:
>      {
> 



^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v2 00/22] intel_iommu: expose Shared Virtual Addressing to VMs
  2020-03-30  4:24 ` Liu Yi L
@ 2020-03-30 10:36   ` Auger Eric
  -1 siblings, 0 replies; 160+ messages in thread
From: Auger Eric @ 2020-03-30 10:36 UTC (permalink / raw)
  To: Liu Yi L, qemu-devel, alex.williamson, peterx
  Cc: pbonzini, mst, david, kevin.tian, jun.j.tian, yi.y.sun, kvm,
	hao.wu, jean-philippe

Hi Yi,

On 3/30/20 6:24 AM, Liu Yi L wrote:
> Shared Virtual Addressing (SVA), a.k.a, Shared Virtual Memory (SVM) on
> Intel platforms allows address space sharing between device DMA and
> applications. SVA can reduce programming complexity and enhance security.
> 
> This QEMU series is intended to expose SVA usage to VMs. i.e. Sharing
> guest application address space with passthru devices. This is called
> vSVA in this series. The whole vSVA enabling requires QEMU/VFIO/IOMMU
> changes.
> 
> The high-level architecture for SVA virtualization is as below, the key
> design of vSVA support is to utilize the dual-stage IOMMU translation (
> also known as IOMMU nesting translation) capability in host IOMMU.
> 
>     .-------------.  .---------------------------.
>     |   vIOMMU    |  | Guest process CR3, FL only|
>     |             |  '---------------------------'
>     .----------------/
>     | PASID Entry |--- PASID cache flush -
>     '-------------'                       |
>     |             |                       V
>     |             |                CR3 in GPA
>     '-------------'
> Guest
> ------| Shadow |--------------------------|--------
>       v        v                          v
> Host
>     .-------------.  .----------------------.
>     |   pIOMMU    |  | Bind FL for GVA-GPA  |
>     |             |  '----------------------'
>     .----------------/  |
>     | PASID Entry |     V (Nested xlate)
>     '----------------\.------------------------------.
>     |             |   |SL for GPA-HPA, default domain|
>     |             |   '------------------------------'
>     '-------------'
> Where:
>  - FL = First level/stage one page tables
>  - SL = Second level/stage two page tables
> 
> The complete vSVA kernel upstream patches are divided into three phases:
>     1. Common APIs and PCI device direct assignment
>     2. IOMMU-backed Mediated Device assignment
>     3. Page Request Services (PRS) support
> 
> This QEMU patchset is aiming for the phase 1 and phase 2. It is based
> on the two kernel series below.
> [1] [PATCH V10 00/11] Nested Shared Virtual Address (SVA) VT-d support:
> https://lkml.org/lkml/2020/3/20/1172
> [2] [PATCH v1 0/8] vfio: expose virtual Shared Virtual Addressing to VMs
> https://lkml.org/lkml/2020/3/22/116
+ [PATCH v2 0/3] IOMMU user API enhancement, right?

I think in general, as long as the kernel dependencies are not resolved,
the QEMU series is supposed to stay in RFC state.

Thanks

Eric
> 
> There are roughly two parts:
>  1. Introduce HostIOMMUContext as abstract of host IOMMU. It provides explicit
>     method for vIOMMU emulators to communicate with host IOMMU. e.g. propagate
>     guest page table binding to host IOMMU to setup dual-stage DMA translation
>     in host IOMMU and flush iommu iotlb.
>  2. Setup dual-stage IOMMU translation for Intel vIOMMU. Includes 
>     - Check IOMMU uAPI version compatibility and VFIO Nesting capabilities which
>       includes hardware compatibility (stage 1 format) and VFIO_PASID_REQ
>       availability. This is preparation for setting up dual-stage DMA translation
>       in host IOMMU.
>     - Propagate guest PASID allocation and free request to host.
>     - Propagate guest page table binding to host to setup dual-stage IOMMU DMA
>       translation in host IOMMU.
>     - Propagate guest IOMMU cache invalidation to host to ensure iotlb
>       correctness.
> 
> The complete QEMU set can be found in below link:
> https://github.com/luxis1999/qemu.git: sva_vtd_v10_v2
> 
> Complete kernel can be found in:
> https://github.com/luxis1999/linux-vsva.git: vsva-linux-5.6-rc6
> 
> Tests: basci vSVA functionality test, VM reboot/shutdown/crash, kernel build in
> guest, boot VM with vSVA disabled, full comapilation with all archs.
> 
> Regards,
> Yi Liu
> 
> Changelog:
> 	- Patch v1 -> Patch v2:
> 	  a) Refactor the vfio HostIOMMUContext init code (patch 0008 - 0009 of v1 series)
> 	  b) Refactor the pasid binding handling (patch 0011 - 0016 of v1 series)
> 	  Patch v1: https://patchwork.ozlabs.org/cover/1259648/
> 
> 	- RFC v3.1 -> Patch v1:
> 	  a) Implement HostIOMMUContext in QOM manner.
> 	  b) Add pci_set/unset_iommu_context() to register HostIOMMUContext to
> 	     vIOMMU, thus the lifecircle of HostIOMMUContext is awared in vIOMMU
> 	     side. In such way, vIOMMU could use the methods provided by the
> 	     HostIOMMUContext safely.
> 	  c) Add back patch "[RFC v3 01/25] hw/pci: modify pci_setup_iommu() to set PCIIOMMUOps"
> 	  RFCv3.1: https://patchwork.kernel.org/cover/11397879/
> 
> 	- RFC v3 -> v3.1:
> 	  a) Drop IOMMUContext, and rename DualStageIOMMUObject to HostIOMMUContext.
> 	     HostIOMMUContext is per-vfio-container, it is exposed to  vIOMMU via PCI
> 	     layer. VFIO registers a PCIHostIOMMUFunc callback to PCI layer, vIOMMU
> 	     could get HostIOMMUContext instance via it.
> 	  b) Check IOMMU uAPI version by VFIO_CHECK_EXTENSION
> 	  c) Add a check on VFIO_PASID_REQ availability via VFIO_GET_IOMMU_IHNFO
> 	  d) Reorder the series, put vSVA linux header file update in the beginning
> 	     put the x-scalable-mode option mofification in the end of the series.
> 	  e) Dropped patch "[RFC v3 01/25] hw/pci: modify pci_setup_iommu() to set PCIIOMMUOps"
> 	  RFCv3: https://patchwork.kernel.org/cover/11356033/
> 
> 	- RFC v2 -> v3:
> 	  a) Introduce DualStageIOMMUObject to abstract the host IOMMU programming
> 	  capability. e.g. request PASID from host, setup IOMMU nesting translation
> 	  on host IOMMU. The pasid_alloc/bind_guest_page_table/iommu_cache_flush
> 	  operations are moved to be DualStageIOMMUOps. Thus, DualStageIOMMUObject
> 	  is an abstract layer which provides QEMU vIOMMU emulators with an explicit
> 	  method to program host IOMMU.
> 	  b) Compared with RFC v2, the IOMMUContext has also been updated. It is
> 	  modified to provide an abstract for vIOMMU emulators. It provides the
> 	  method for pass-through modules (like VFIO) to communicate with host IOMMU.
> 	  e.g. tell vIOMMU emulators about the IOMMU nesting capability on host side
> 	  and report the host IOMMU DMA translation faults to vIOMMU emulators.
> 	  RFC v2: https://www.spinics.net/lists/kvm/msg198556.html
> 
> 	- RFC v1 -> v2:
> 	  Introduce IOMMUContext to abstract the connection between VFIO
> 	  and vIOMMU emulators, which is a replacement of the PCIPASIDOps
> 	  in RFC v1. Modify x-scalable-mode to be string option instead of
> 	  adding a new option as RFC v1 did. Refined the pasid cache management
> 	  and addressed the TODOs mentioned in RFC v1. 
> 	  RFC v1: https://patchwork.kernel.org/cover/11033657/
> 
> Eric Auger (1):
>   scripts/update-linux-headers: Import iommu.h
> 
> Liu Yi L (21):
>   header file update VFIO/IOMMU vSVA APIs
>   vfio: check VFIO_TYPE1_NESTING_IOMMU support
>   hw/iommu: introduce HostIOMMUContext
>   hw/pci: modify pci_setup_iommu() to set PCIIOMMUOps
>   hw/pci: introduce pci_device_set/unset_iommu_context()
>   intel_iommu: add set/unset_iommu_context callback
>   vfio/common: provide PASID alloc/free hooks
>   vfio/common: init HostIOMMUContext per-container
>   vfio/pci: set host iommu context to vIOMMU
>   intel_iommu: add virtual command capability support
>   intel_iommu: process PASID cache invalidation
>   intel_iommu: add PASID cache management infrastructure
>   vfio: add bind stage-1 page table support
>   intel_iommu: bind/unbind guest page table to host
>   intel_iommu: replay pasid binds after context cache invalidation
>   intel_iommu: do not pass down pasid bind for PASID #0
>   vfio: add support for flush iommu stage-1 cache
>   intel_iommu: process PASID-based iotlb invalidation
>   intel_iommu: propagate PASID-based iotlb invalidation to host
>   intel_iommu: process PASID-based Device-TLB invalidation
>   intel_iommu: modify x-scalable-mode to be string option
> 
>  hw/Makefile.objs                      |    1 +
>  hw/alpha/typhoon.c                    |    6 +-
>  hw/arm/smmu-common.c                  |    6 +-
>  hw/hppa/dino.c                        |    6 +-
>  hw/i386/amd_iommu.c                   |    6 +-
>  hw/i386/intel_iommu.c                 | 1109 ++++++++++++++++++++++++++++++++-
>  hw/i386/intel_iommu_internal.h        |  114 ++++
>  hw/i386/trace-events                  |    6 +
>  hw/iommu/Makefile.objs                |    1 +
>  hw/iommu/host_iommu_context.c         |  161 +++++
>  hw/pci-host/designware.c              |    6 +-
>  hw/pci-host/pnv_phb3.c                |    6 +-
>  hw/pci-host/pnv_phb4.c                |    6 +-
>  hw/pci-host/ppce500.c                 |    6 +-
>  hw/pci-host/prep.c                    |    6 +-
>  hw/pci-host/sabre.c                   |    6 +-
>  hw/pci/pci.c                          |   53 +-
>  hw/ppc/ppc440_pcix.c                  |    6 +-
>  hw/ppc/spapr_pci.c                    |    6 +-
>  hw/s390x/s390-pci-bus.c               |    8 +-
>  hw/vfio/common.c                      |  260 +++++++-
>  hw/vfio/pci.c                         |   13 +
>  hw/virtio/virtio-iommu.c              |    6 +-
>  include/hw/i386/intel_iommu.h         |   57 +-
>  include/hw/iommu/host_iommu_context.h |  116 ++++
>  include/hw/pci/pci.h                  |   18 +-
>  include/hw/pci/pci_bus.h              |    2 +-
>  include/hw/vfio/vfio-common.h         |    4 +
>  linux-headers/linux/iommu.h           |  378 +++++++++++
>  linux-headers/linux/vfio.h            |  127 ++++
>  scripts/update-linux-headers.sh       |    2 +-
>  31 files changed, 2463 insertions(+), 45 deletions(-)
>  create mode 100644 hw/iommu/Makefile.objs
>  create mode 100644 hw/iommu/host_iommu_context.c
>  create mode 100644 include/hw/iommu/host_iommu_context.h
>  create mode 100644 linux-headers/linux/iommu.h
> 


^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v2 00/22] intel_iommu: expose Shared Virtual Addressing to VMs
@ 2020-03-30 10:36   ` Auger Eric
  0 siblings, 0 replies; 160+ messages in thread
From: Auger Eric @ 2020-03-30 10:36 UTC (permalink / raw)
  To: Liu Yi L, qemu-devel, alex.williamson, peterx
  Cc: jean-philippe, kevin.tian, kvm, mst, jun.j.tian, yi.y.sun,
	pbonzini, hao.wu, david

Hi Yi,

On 3/30/20 6:24 AM, Liu Yi L wrote:
> Shared Virtual Addressing (SVA), a.k.a, Shared Virtual Memory (SVM) on
> Intel platforms allows address space sharing between device DMA and
> applications. SVA can reduce programming complexity and enhance security.
> 
> This QEMU series is intended to expose SVA usage to VMs. i.e. Sharing
> guest application address space with passthru devices. This is called
> vSVA in this series. The whole vSVA enabling requires QEMU/VFIO/IOMMU
> changes.
> 
> The high-level architecture for SVA virtualization is as below, the key
> design of vSVA support is to utilize the dual-stage IOMMU translation (
> also known as IOMMU nesting translation) capability in host IOMMU.
> 
>     .-------------.  .---------------------------.
>     |   vIOMMU    |  | Guest process CR3, FL only|
>     |             |  '---------------------------'
>     .----------------/
>     | PASID Entry |--- PASID cache flush -
>     '-------------'                       |
>     |             |                       V
>     |             |                CR3 in GPA
>     '-------------'
> Guest
> ------| Shadow |--------------------------|--------
>       v        v                          v
> Host
>     .-------------.  .----------------------.
>     |   pIOMMU    |  | Bind FL for GVA-GPA  |
>     |             |  '----------------------'
>     .----------------/  |
>     | PASID Entry |     V (Nested xlate)
>     '----------------\.------------------------------.
>     |             |   |SL for GPA-HPA, default domain|
>     |             |   '------------------------------'
>     '-------------'
> Where:
>  - FL = First level/stage one page tables
>  - SL = Second level/stage two page tables
> 
> The complete vSVA kernel upstream patches are divided into three phases:
>     1. Common APIs and PCI device direct assignment
>     2. IOMMU-backed Mediated Device assignment
>     3. Page Request Services (PRS) support
> 
> This QEMU patchset is aiming for the phase 1 and phase 2. It is based
> on the two kernel series below.
> [1] [PATCH V10 00/11] Nested Shared Virtual Address (SVA) VT-d support:
> https://lkml.org/lkml/2020/3/20/1172
> [2] [PATCH v1 0/8] vfio: expose virtual Shared Virtual Addressing to VMs
> https://lkml.org/lkml/2020/3/22/116
+ [PATCH v2 0/3] IOMMU user API enhancement, right?

I think in general, as long as the kernel dependencies are not resolved,
the QEMU series is supposed to stay in RFC state.

Thanks

Eric
> 
> There are roughly two parts:
>  1. Introduce HostIOMMUContext as abstract of host IOMMU. It provides explicit
>     method for vIOMMU emulators to communicate with host IOMMU. e.g. propagate
>     guest page table binding to host IOMMU to setup dual-stage DMA translation
>     in host IOMMU and flush iommu iotlb.
>  2. Setup dual-stage IOMMU translation for Intel vIOMMU. Includes 
>     - Check IOMMU uAPI version compatibility and VFIO Nesting capabilities which
>       includes hardware compatibility (stage 1 format) and VFIO_PASID_REQ
>       availability. This is preparation for setting up dual-stage DMA translation
>       in host IOMMU.
>     - Propagate guest PASID allocation and free request to host.
>     - Propagate guest page table binding to host to setup dual-stage IOMMU DMA
>       translation in host IOMMU.
>     - Propagate guest IOMMU cache invalidation to host to ensure iotlb
>       correctness.
> 
> The complete QEMU set can be found in below link:
> https://github.com/luxis1999/qemu.git: sva_vtd_v10_v2
> 
> Complete kernel can be found in:
> https://github.com/luxis1999/linux-vsva.git: vsva-linux-5.6-rc6
> 
> Tests: basci vSVA functionality test, VM reboot/shutdown/crash, kernel build in
> guest, boot VM with vSVA disabled, full comapilation with all archs.
> 
> Regards,
> Yi Liu
> 
> Changelog:
> 	- Patch v1 -> Patch v2:
> 	  a) Refactor the vfio HostIOMMUContext init code (patch 0008 - 0009 of v1 series)
> 	  b) Refactor the pasid binding handling (patch 0011 - 0016 of v1 series)
> 	  Patch v1: https://patchwork.ozlabs.org/cover/1259648/
> 
> 	- RFC v3.1 -> Patch v1:
> 	  a) Implement HostIOMMUContext in QOM manner.
> 	  b) Add pci_set/unset_iommu_context() to register HostIOMMUContext to
> 	     vIOMMU, thus the lifecircle of HostIOMMUContext is awared in vIOMMU
> 	     side. In such way, vIOMMU could use the methods provided by the
> 	     HostIOMMUContext safely.
> 	  c) Add back patch "[RFC v3 01/25] hw/pci: modify pci_setup_iommu() to set PCIIOMMUOps"
> 	  RFCv3.1: https://patchwork.kernel.org/cover/11397879/
> 
> 	- RFC v3 -> v3.1:
> 	  a) Drop IOMMUContext, and rename DualStageIOMMUObject to HostIOMMUContext.
> 	     HostIOMMUContext is per-vfio-container, it is exposed to  vIOMMU via PCI
> 	     layer. VFIO registers a PCIHostIOMMUFunc callback to PCI layer, vIOMMU
> 	     could get HostIOMMUContext instance via it.
> 	  b) Check IOMMU uAPI version by VFIO_CHECK_EXTENSION
> 	  c) Add a check on VFIO_PASID_REQ availability via VFIO_GET_IOMMU_IHNFO
> 	  d) Reorder the series, put vSVA linux header file update in the beginning
> 	     put the x-scalable-mode option mofification in the end of the series.
> 	  e) Dropped patch "[RFC v3 01/25] hw/pci: modify pci_setup_iommu() to set PCIIOMMUOps"
> 	  RFCv3: https://patchwork.kernel.org/cover/11356033/
> 
> 	- RFC v2 -> v3:
> 	  a) Introduce DualStageIOMMUObject to abstract the host IOMMU programming
> 	  capability. e.g. request PASID from host, setup IOMMU nesting translation
> 	  on host IOMMU. The pasid_alloc/bind_guest_page_table/iommu_cache_flush
> 	  operations are moved to be DualStageIOMMUOps. Thus, DualStageIOMMUObject
> 	  is an abstract layer which provides QEMU vIOMMU emulators with an explicit
> 	  method to program host IOMMU.
> 	  b) Compared with RFC v2, the IOMMUContext has also been updated. It is
> 	  modified to provide an abstract for vIOMMU emulators. It provides the
> 	  method for pass-through modules (like VFIO) to communicate with host IOMMU.
> 	  e.g. tell vIOMMU emulators about the IOMMU nesting capability on host side
> 	  and report the host IOMMU DMA translation faults to vIOMMU emulators.
> 	  RFC v2: https://www.spinics.net/lists/kvm/msg198556.html
> 
> 	- RFC v1 -> v2:
> 	  Introduce IOMMUContext to abstract the connection between VFIO
> 	  and vIOMMU emulators, which is a replacement of the PCIPASIDOps
> 	  in RFC v1. Modify x-scalable-mode to be string option instead of
> 	  adding a new option as RFC v1 did. Refined the pasid cache management
> 	  and addressed the TODOs mentioned in RFC v1. 
> 	  RFC v1: https://patchwork.kernel.org/cover/11033657/
> 
> Eric Auger (1):
>   scripts/update-linux-headers: Import iommu.h
> 
> Liu Yi L (21):
>   header file update VFIO/IOMMU vSVA APIs
>   vfio: check VFIO_TYPE1_NESTING_IOMMU support
>   hw/iommu: introduce HostIOMMUContext
>   hw/pci: modify pci_setup_iommu() to set PCIIOMMUOps
>   hw/pci: introduce pci_device_set/unset_iommu_context()
>   intel_iommu: add set/unset_iommu_context callback
>   vfio/common: provide PASID alloc/free hooks
>   vfio/common: init HostIOMMUContext per-container
>   vfio/pci: set host iommu context to vIOMMU
>   intel_iommu: add virtual command capability support
>   intel_iommu: process PASID cache invalidation
>   intel_iommu: add PASID cache management infrastructure
>   vfio: add bind stage-1 page table support
>   intel_iommu: bind/unbind guest page table to host
>   intel_iommu: replay pasid binds after context cache invalidation
>   intel_iommu: do not pass down pasid bind for PASID #0
>   vfio: add support for flush iommu stage-1 cache
>   intel_iommu: process PASID-based iotlb invalidation
>   intel_iommu: propagate PASID-based iotlb invalidation to host
>   intel_iommu: process PASID-based Device-TLB invalidation
>   intel_iommu: modify x-scalable-mode to be string option
> 
>  hw/Makefile.objs                      |    1 +
>  hw/alpha/typhoon.c                    |    6 +-
>  hw/arm/smmu-common.c                  |    6 +-
>  hw/hppa/dino.c                        |    6 +-
>  hw/i386/amd_iommu.c                   |    6 +-
>  hw/i386/intel_iommu.c                 | 1109 ++++++++++++++++++++++++++++++++-
>  hw/i386/intel_iommu_internal.h        |  114 ++++
>  hw/i386/trace-events                  |    6 +
>  hw/iommu/Makefile.objs                |    1 +
>  hw/iommu/host_iommu_context.c         |  161 +++++
>  hw/pci-host/designware.c              |    6 +-
>  hw/pci-host/pnv_phb3.c                |    6 +-
>  hw/pci-host/pnv_phb4.c                |    6 +-
>  hw/pci-host/ppce500.c                 |    6 +-
>  hw/pci-host/prep.c                    |    6 +-
>  hw/pci-host/sabre.c                   |    6 +-
>  hw/pci/pci.c                          |   53 +-
>  hw/ppc/ppc440_pcix.c                  |    6 +-
>  hw/ppc/spapr_pci.c                    |    6 +-
>  hw/s390x/s390-pci-bus.c               |    8 +-
>  hw/vfio/common.c                      |  260 +++++++-
>  hw/vfio/pci.c                         |   13 +
>  hw/virtio/virtio-iommu.c              |    6 +-
>  include/hw/i386/intel_iommu.h         |   57 +-
>  include/hw/iommu/host_iommu_context.h |  116 ++++
>  include/hw/pci/pci.h                  |   18 +-
>  include/hw/pci/pci_bus.h              |    2 +-
>  include/hw/vfio/vfio-common.h         |    4 +
>  linux-headers/linux/iommu.h           |  378 +++++++++++
>  linux-headers/linux/vfio.h            |  127 ++++
>  scripts/update-linux-headers.sh       |    2 +-
>  31 files changed, 2463 insertions(+), 45 deletions(-)
>  create mode 100644 hw/iommu/Makefile.objs
>  create mode 100644 hw/iommu/host_iommu_context.c
>  create mode 100644 include/hw/iommu/host_iommu_context.h
>  create mode 100644 linux-headers/linux/iommu.h
> 



^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v2 05/22] hw/pci: modify pci_setup_iommu() to set PCIIOMMUOps
  2020-03-30  4:24   ` Liu Yi L
@ 2020-03-30 11:02     ` Auger Eric
  -1 siblings, 0 replies; 160+ messages in thread
From: Auger Eric @ 2020-03-30 11:02 UTC (permalink / raw)
  To: Liu Yi L, qemu-devel, alex.williamson, peterx
  Cc: pbonzini, mst, david, kevin.tian, jun.j.tian, yi.y.sun, kvm,
	hao.wu, jean-philippe, Jacob Pan, Yi Sun



On 3/30/20 6:24 AM, Liu Yi L wrote:
> This patch modifies pci_setup_iommu() to set PCIIOMMUOps
> instead of setting PCIIOMMUFunc. PCIIOMMUFunc is used to
> get an address space for a PCI device in vendor specific
> way. The PCIIOMMUOps still offers this functionality. But
> using PCIIOMMUOps leaves space to add more iommu related
> vendor specific operations.
> 
> Cc: Kevin Tian <kevin.tian@intel.com>
> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Eric Auger <eric.auger@redhat.com>
> Cc: Yi Sun <yi.y.sun@linux.intel.com>
> Cc: David Gibson <david@gibson.dropbear.id.au>
> Cc: Michael S. Tsirkin <mst@redhat.com>
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> Reviewed-by: Peter Xu <peterx@redhat.com>
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> ---
>  hw/alpha/typhoon.c       |  6 +++++-
>  hw/arm/smmu-common.c     |  6 +++++-
>  hw/hppa/dino.c           |  6 +++++-
>  hw/i386/amd_iommu.c      |  6 +++++-
>  hw/i386/intel_iommu.c    |  6 +++++-
>  hw/pci-host/designware.c |  6 +++++-
>  hw/pci-host/pnv_phb3.c   |  6 +++++-
>  hw/pci-host/pnv_phb4.c   |  6 +++++-
>  hw/pci-host/ppce500.c    |  6 +++++-
>  hw/pci-host/prep.c       |  6 +++++-
>  hw/pci-host/sabre.c      |  6 +++++-
>  hw/pci/pci.c             | 12 +++++++-----
>  hw/ppc/ppc440_pcix.c     |  6 +++++-
>  hw/ppc/spapr_pci.c       |  6 +++++-
>  hw/s390x/s390-pci-bus.c  |  8 ++++++--
>  hw/virtio/virtio-iommu.c |  6 +++++-
>  include/hw/pci/pci.h     |  8 ++++++--
>  include/hw/pci/pci_bus.h |  2 +-
>  18 files changed, 90 insertions(+), 24 deletions(-)
> 
> diff --git a/hw/alpha/typhoon.c b/hw/alpha/typhoon.c
> index 1795e2f..f271de1 100644
> --- a/hw/alpha/typhoon.c
> +++ b/hw/alpha/typhoon.c
> @@ -740,6 +740,10 @@ static AddressSpace *typhoon_pci_dma_iommu(PCIBus *bus, void *opaque, int devfn)
>      return &s->pchip.iommu_as;
>  }
>  
> +static const PCIIOMMUOps typhoon_iommu_ops = {
> +    .get_address_space = typhoon_pci_dma_iommu,
> +};
> +
>  static void typhoon_set_irq(void *opaque, int irq, int level)
>  {
>      TyphoonState *s = opaque;
> @@ -897,7 +901,7 @@ PCIBus *typhoon_init(MemoryRegion *ram, ISABus **isa_bus, qemu_irq *p_rtc_irq,
>                               "iommu-typhoon", UINT64_MAX);
>      address_space_init(&s->pchip.iommu_as, MEMORY_REGION(&s->pchip.iommu),
>                         "pchip0-pci");
> -    pci_setup_iommu(b, typhoon_pci_dma_iommu, s);
> +    pci_setup_iommu(b, &typhoon_iommu_ops, s);
>  
>      /* Pchip0 PCI special/interrupt acknowledge, 0x801.F800.0000, 64MB.  */
>      memory_region_init_io(&s->pchip.reg_iack, OBJECT(s), &alpha_pci_iack_ops,
> diff --git a/hw/arm/smmu-common.c b/hw/arm/smmu-common.c
> index e13a5f4..447146e 100644
> --- a/hw/arm/smmu-common.c
> +++ b/hw/arm/smmu-common.c
> @@ -343,6 +343,10 @@ static AddressSpace *smmu_find_add_as(PCIBus *bus, void *opaque, int devfn)
>      return &sdev->as;
>  }
>  
> +static const PCIIOMMUOps smmu_ops = {
> +    .get_address_space = smmu_find_add_as,
> +};
> +
>  IOMMUMemoryRegion *smmu_iommu_mr(SMMUState *s, uint32_t sid)
>  {
>      uint8_t bus_n, devfn;
> @@ -437,7 +441,7 @@ static void smmu_base_realize(DeviceState *dev, Error **errp)
>      s->smmu_pcibus_by_busptr = g_hash_table_new(NULL, NULL);
>  
>      if (s->primary_bus) {
> -        pci_setup_iommu(s->primary_bus, smmu_find_add_as, s);
> +        pci_setup_iommu(s->primary_bus, &smmu_ops, s);
>      } else {
>          error_setg(errp, "SMMU is not attached to any PCI bus!");
>      }
> diff --git a/hw/hppa/dino.c b/hw/hppa/dino.c
> index 2b1b38c..3da4f84 100644
> --- a/hw/hppa/dino.c
> +++ b/hw/hppa/dino.c
> @@ -459,6 +459,10 @@ static AddressSpace *dino_pcihost_set_iommu(PCIBus *bus, void *opaque,
>      return &s->bm_as;
>  }
>  
> +static const PCIIOMMUOps dino_iommu_ops = {
> +    .get_address_space = dino_pcihost_set_iommu,
> +};
> +
>  /*
>   * Dino interrupts are connected as shown on Page 78, Table 23
>   * (Little-endian bit numbers)
> @@ -580,7 +584,7 @@ PCIBus *dino_init(MemoryRegion *addr_space,
>      memory_region_add_subregion(&s->bm, 0xfff00000,
>                                  &s->bm_cpu_alias);
>      address_space_init(&s->bm_as, &s->bm, "pci-bm");
> -    pci_setup_iommu(b, dino_pcihost_set_iommu, s);
> +    pci_setup_iommu(b, &dino_iommu_ops, s);
>  
>      *p_rtc_irq = qemu_allocate_irq(dino_set_timer_irq, s, 0);
>      *p_ser_irq = qemu_allocate_irq(dino_set_serial_irq, s, 0);
> diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
> index b1175e5..5fec30e 100644
> --- a/hw/i386/amd_iommu.c
> +++ b/hw/i386/amd_iommu.c
> @@ -1451,6 +1451,10 @@ static AddressSpace *amdvi_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
>      return &iommu_as[devfn]->as;
>  }
>  
> +static const PCIIOMMUOps amdvi_iommu_ops = {
> +    .get_address_space = amdvi_host_dma_iommu,
> +};
> +
>  static const MemoryRegionOps mmio_mem_ops = {
>      .read = amdvi_mmio_read,
>      .write = amdvi_mmio_write,
> @@ -1577,7 +1581,7 @@ static void amdvi_realize(DeviceState *dev, Error **errp)
>  
>      sysbus_init_mmio(SYS_BUS_DEVICE(s), &s->mmio);
>      sysbus_mmio_map(SYS_BUS_DEVICE(s), 0, AMDVI_BASE_ADDR);
> -    pci_setup_iommu(bus, amdvi_host_dma_iommu, s);
> +    pci_setup_iommu(bus, &amdvi_iommu_ops, s);
>      s->devid = object_property_get_int(OBJECT(&s->pci), "addr", errp);
>      msi_init(&s->pci.dev, 0, 1, true, false, errp);
>      amdvi_init(s);
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index df7ad25..4b22910 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -3729,6 +3729,10 @@ static AddressSpace *vtd_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
>      return &vtd_as->as;
>  }
>  
> +static PCIIOMMUOps vtd_iommu_ops = {
static const
> +    .get_address_space = vtd_host_dma_iommu,
> +};
> +
>  static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
>  {
>      X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(s);
> @@ -3840,7 +3844,7 @@ static void vtd_realize(DeviceState *dev, Error **errp)
>                                                g_free, g_free);
>      vtd_init(s);
>      sysbus_mmio_map(SYS_BUS_DEVICE(s), 0, Q35_HOST_BRIDGE_IOMMU_ADDR);
> -    pci_setup_iommu(bus, vtd_host_dma_iommu, dev);
> +    pci_setup_iommu(bus, &vtd_iommu_ops, dev);
>      /* Pseudo address space under root PCI bus. */
>      x86ms->ioapic_as = vtd_host_dma_iommu(bus, s, Q35_PSEUDO_DEVFN_IOAPIC);
>      qemu_add_machine_init_done_notifier(&vtd_machine_done_notify);
> diff --git a/hw/pci-host/designware.c b/hw/pci-host/designware.c
> index dd24551..4c6338a 100644
> --- a/hw/pci-host/designware.c
> +++ b/hw/pci-host/designware.c
> @@ -645,6 +645,10 @@ static AddressSpace *designware_pcie_host_set_iommu(PCIBus *bus, void *opaque,
>      return &s->pci.address_space;
>  }
>  
> +static const PCIIOMMUOps designware_iommu_ops = {
> +    .get_address_space = designware_pcie_host_set_iommu,
> +};
> +
>  static void designware_pcie_host_realize(DeviceState *dev, Error **errp)
>  {
>      PCIHostState *pci = PCI_HOST_BRIDGE(dev);
> @@ -686,7 +690,7 @@ static void designware_pcie_host_realize(DeviceState *dev, Error **errp)
>      address_space_init(&s->pci.address_space,
>                         &s->pci.address_space_root,
>                         "pcie-bus-address-space");
> -    pci_setup_iommu(pci->bus, designware_pcie_host_set_iommu, s);
> +    pci_setup_iommu(pci->bus, &designware_iommu_ops, s);
>  
>      qdev_set_parent_bus(DEVICE(&s->root), BUS(pci->bus));
>      qdev_init_nofail(DEVICE(&s->root));
> diff --git a/hw/pci-host/pnv_phb3.c b/hw/pci-host/pnv_phb3.c
> index 74618fa..ecfe627 100644
> --- a/hw/pci-host/pnv_phb3.c
> +++ b/hw/pci-host/pnv_phb3.c
> @@ -961,6 +961,10 @@ static AddressSpace *pnv_phb3_dma_iommu(PCIBus *bus, void *opaque, int devfn)
>      return &ds->dma_as;
>  }
>  
> +static PCIIOMMUOps pnv_phb3_iommu_ops = {
static const
> +    .get_address_space = pnv_phb3_dma_iommu,
> +};
> +
>  static void pnv_phb3_instance_init(Object *obj)
>  {
>      PnvPHB3 *phb = PNV_PHB3(obj);
> @@ -1059,7 +1063,7 @@ static void pnv_phb3_realize(DeviceState *dev, Error **errp)
>                                       &phb->pci_mmio, &phb->pci_io,
>                                       0, 4, TYPE_PNV_PHB3_ROOT_BUS);
>  
> -    pci_setup_iommu(pci->bus, pnv_phb3_dma_iommu, phb);
> +    pci_setup_iommu(pci->bus, &pnv_phb3_iommu_ops, phb);
>  
>      /* Add a single Root port */
>      qdev_prop_set_uint8(DEVICE(&phb->root), "chassis", phb->chip_id);
> diff --git a/hw/pci-host/pnv_phb4.c b/hw/pci-host/pnv_phb4.c
> index 23cf093..04e95e3 100644
> --- a/hw/pci-host/pnv_phb4.c
> +++ b/hw/pci-host/pnv_phb4.c
> @@ -1148,6 +1148,10 @@ static AddressSpace *pnv_phb4_dma_iommu(PCIBus *bus, void *opaque, int devfn)
>      return &ds->dma_as;
>  }
>  
> +static PCIIOMMUOps pnv_phb4_iommu_ops = {
idem
> +    .get_address_space = pnv_phb4_dma_iommu,
> +};
> +
>  static void pnv_phb4_instance_init(Object *obj)
>  {
>      PnvPHB4 *phb = PNV_PHB4(obj);
> @@ -1205,7 +1209,7 @@ static void pnv_phb4_realize(DeviceState *dev, Error **errp)
>                                       pnv_phb4_set_irq, pnv_phb4_map_irq, phb,
>                                       &phb->pci_mmio, &phb->pci_io,
>                                       0, 4, TYPE_PNV_PHB4_ROOT_BUS);
> -    pci_setup_iommu(pci->bus, pnv_phb4_dma_iommu, phb);
> +    pci_setup_iommu(pci->bus, &pnv_phb4_iommu_ops, phb);
>  
>      /* Add a single Root port */
>      qdev_prop_set_uint8(DEVICE(&phb->root), "chassis", phb->chip_id);
> diff --git a/hw/pci-host/ppce500.c b/hw/pci-host/ppce500.c
> index d710727..5baf5db 100644
> --- a/hw/pci-host/ppce500.c
> +++ b/hw/pci-host/ppce500.c
> @@ -439,6 +439,10 @@ static AddressSpace *e500_pcihost_set_iommu(PCIBus *bus, void *opaque,
>      return &s->bm_as;
>  }
>  
> +static const PCIIOMMUOps ppce500_iommu_ops = {
> +    .get_address_space = e500_pcihost_set_iommu,
> +};
> +
>  static void e500_pcihost_realize(DeviceState *dev, Error **errp)
>  {
>      SysBusDevice *sbd = SYS_BUS_DEVICE(dev);
> @@ -473,7 +477,7 @@ static void e500_pcihost_realize(DeviceState *dev, Error **errp)
>      memory_region_init(&s->bm, OBJECT(s), "bm-e500", UINT64_MAX);
>      memory_region_add_subregion(&s->bm, 0x0, &s->busmem);
>      address_space_init(&s->bm_as, &s->bm, "pci-bm");
> -    pci_setup_iommu(b, e500_pcihost_set_iommu, s);
> +    pci_setup_iommu(b, &ppce500_iommu_ops, s);
>  
>      pci_create_simple(b, 0, "e500-host-bridge");
>  
> diff --git a/hw/pci-host/prep.c b/hw/pci-host/prep.c
> index 1a02e9a..7c57311 100644
> --- a/hw/pci-host/prep.c
> +++ b/hw/pci-host/prep.c
> @@ -213,6 +213,10 @@ static AddressSpace *raven_pcihost_set_iommu(PCIBus *bus, void *opaque,
>      return &s->bm_as;
>  }
>  
> +static const PCIIOMMUOps raven_iommu_ops = {
> +    .get_address_space = raven_pcihost_set_iommu,
> +};
> +
>  static void raven_change_gpio(void *opaque, int n, int level)
>  {
>      PREPPCIState *s = opaque;
> @@ -303,7 +307,7 @@ static void raven_pcihost_initfn(Object *obj)
>      memory_region_add_subregion(&s->bm, 0         , &s->bm_pci_memory_alias);
>      memory_region_add_subregion(&s->bm, 0x80000000, &s->bm_ram_alias);
>      address_space_init(&s->bm_as, &s->bm, "raven-bm");
> -    pci_setup_iommu(&s->pci_bus, raven_pcihost_set_iommu, s);
> +    pci_setup_iommu(&s->pci_bus, &raven_iommu_ops, s);
>  
>      h->bus = &s->pci_bus;
>  
> diff --git a/hw/pci-host/sabre.c b/hw/pci-host/sabre.c
> index 2b8503b..251549b 100644
> --- a/hw/pci-host/sabre.c
> +++ b/hw/pci-host/sabre.c
> @@ -112,6 +112,10 @@ static AddressSpace *sabre_pci_dma_iommu(PCIBus *bus, void *opaque, int devfn)
>      return &is->iommu_as;
>  }
>  
> +static const PCIIOMMUOps sabre_iommu_ops = {
> +    .get_address_space = sabre_pci_dma_iommu,
> +};
> +
>  static void sabre_config_write(void *opaque, hwaddr addr,
>                                 uint64_t val, unsigned size)
>  {
> @@ -402,7 +406,7 @@ static void sabre_realize(DeviceState *dev, Error **errp)
>      /* IOMMU */
>      memory_region_add_subregion_overlap(&s->sabre_config, 0x200,
>                      sysbus_mmio_get_region(SYS_BUS_DEVICE(s->iommu), 0), 1);
> -    pci_setup_iommu(phb->bus, sabre_pci_dma_iommu, s->iommu);
> +    pci_setup_iommu(phb->bus, &sabre_iommu_ops, s->iommu);
>  
>      /* APB secondary busses */
>      pci_dev = pci_create_multifunction(phb->bus, PCI_DEVFN(1, 0), true,
> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> index e1ed667..aa9025c 100644
> --- a/hw/pci/pci.c
> +++ b/hw/pci/pci.c
> @@ -2644,7 +2644,7 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
>      PCIBus *iommu_bus = bus;
>      uint8_t devfn = dev->devfn;
>  
> -    while (iommu_bus && !iommu_bus->iommu_fn && iommu_bus->parent_dev) {
> +    while (iommu_bus && !iommu_bus->iommu_ops && iommu_bus->parent_dev) {
Depending on future usage, this is not strictly identical to the
original code. You exit the loop as soon as a iommu_bus->iommu_ops is
set whatever the presence of get_address_space().

>          PCIBus *parent_bus = pci_get_bus(iommu_bus->parent_dev);
>  
>          /*
> @@ -2683,15 +2683,17 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
>  
>          iommu_bus = parent_bus;
>      }
> -    if (iommu_bus && iommu_bus->iommu_fn) {
> -        return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, devfn);
> +    if (iommu_bus && iommu_bus->iommu_ops &&
> +                     iommu_bus->iommu_ops->get_address_space) {
> +        return iommu_bus->iommu_ops->get_address_space(bus,
> +                                 iommu_bus->iommu_opaque, devfn);
>      }
>      return &address_space_memory;
>  }
>  
> -void pci_setup_iommu(PCIBus *bus, PCIIOMMUFunc fn, void *opaque)
> +void pci_setup_iommu(PCIBus *bus, const PCIIOMMUOps *ops, void *opaque)
>  {
> -    bus->iommu_fn = fn;
> +    bus->iommu_ops = ops;
>      bus->iommu_opaque = opaque;
>  }
>  
> diff --git a/hw/ppc/ppc440_pcix.c b/hw/ppc/ppc440_pcix.c
> index 2ee2d4f..7b17ee5 100644
> --- a/hw/ppc/ppc440_pcix.c
> +++ b/hw/ppc/ppc440_pcix.c
> @@ -442,6 +442,10 @@ static AddressSpace *ppc440_pcix_set_iommu(PCIBus *b, void *opaque, int devfn)
>      return &s->bm_as;
>  }
>  
> +static const PCIIOMMUOps ppc440_iommu_ops = {
> +    .get_address_space = ppc440_pcix_set_iommu,
> +};
> +
>  /* The default pci_host_data_{read,write} functions in pci/pci_host.c
>   * deny access to registers without bit 31 set but our clients want
>   * this to work so we have to override these here */
> @@ -487,7 +491,7 @@ static void ppc440_pcix_realize(DeviceState *dev, Error **errp)
>      memory_region_init(&s->bm, OBJECT(s), "bm-ppc440-pcix", UINT64_MAX);
>      memory_region_add_subregion(&s->bm, 0x0, &s->busmem);
>      address_space_init(&s->bm_as, &s->bm, "pci-bm");
> -    pci_setup_iommu(h->bus, ppc440_pcix_set_iommu, s);
> +    pci_setup_iommu(h->bus, &ppc440_iommu_ops, s);
>  
>      memory_region_init(&s->container, OBJECT(s), "pci-container", PCI_ALL_SIZE);
>      memory_region_init_io(&h->conf_mem, OBJECT(s), &pci_host_conf_le_ops,
> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> index 709a527..729a1cb 100644
> --- a/hw/ppc/spapr_pci.c
> +++ b/hw/ppc/spapr_pci.c
> @@ -771,6 +771,10 @@ static AddressSpace *spapr_pci_dma_iommu(PCIBus *bus, void *opaque, int devfn)
>      return &phb->iommu_as;
>  }
>  
> +static const PCIIOMMUOps spapr_iommu_ops = {
> +    .get_address_space = spapr_pci_dma_iommu,
> +};
> +
>  static char *spapr_phb_vfio_get_loc_code(SpaprPhbState *sphb,  PCIDevice *pdev)
>  {
>      char *path = NULL, *buf = NULL, *host = NULL;
> @@ -1950,7 +1954,7 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>      memory_region_add_subregion(&sphb->iommu_root, SPAPR_PCI_MSI_WINDOW,
>                                  &sphb->msiwindow);
>  
> -    pci_setup_iommu(bus, spapr_pci_dma_iommu, sphb);
> +    pci_setup_iommu(bus, &spapr_iommu_ops, sphb);
>  
>      pci_bus_set_route_irq_fn(bus, spapr_route_intx_pin_to_irq);
>  
> diff --git a/hw/s390x/s390-pci-bus.c b/hw/s390x/s390-pci-bus.c
> index ed8be12..c1c3aa4 100644
> --- a/hw/s390x/s390-pci-bus.c
> +++ b/hw/s390x/s390-pci-bus.c
> @@ -635,6 +635,10 @@ static AddressSpace *s390_pci_dma_iommu(PCIBus *bus, void *opaque, int devfn)
>      return &iommu->as;
>  }
>  
> +static const PCIIOMMUOps s390_iommu_ops = {
> +    .get_address_space = s390_pci_dma_iommu,
> +};
> +
>  static uint8_t set_ind_atomic(uint64_t ind_loc, uint8_t to_be_set)
>  {
>      uint8_t ind_old, ind_new;
> @@ -748,7 +752,7 @@ static void s390_pcihost_realize(DeviceState *dev, Error **errp)
>      b = pci_register_root_bus(dev, NULL, s390_pci_set_irq, s390_pci_map_irq,
>                                NULL, get_system_memory(), get_system_io(), 0,
>                                64, TYPE_PCI_BUS);
> -    pci_setup_iommu(b, s390_pci_dma_iommu, s);
> +    pci_setup_iommu(b, &s390_iommu_ops, s);
>  
>      bus = BUS(b);
>      qbus_set_hotplug_handler(bus, OBJECT(dev), &local_err);
> @@ -919,7 +923,7 @@ static void s390_pcihost_plug(HotplugHandler *hotplug_dev, DeviceState *dev,
>  
>          pdev = PCI_DEVICE(dev);
>          pci_bridge_map_irq(pb, dev->id, s390_pci_map_irq);
> -        pci_setup_iommu(&pb->sec_bus, s390_pci_dma_iommu, s);
> +        pci_setup_iommu(&pb->sec_bus, &s390_iommu_ops, s);
>  
>          qbus_set_hotplug_handler(BUS(&pb->sec_bus), OBJECT(s), errp);
>  
> diff --git a/hw/virtio/virtio-iommu.c b/hw/virtio/virtio-iommu.c
> index 4cee808..fefc24e 100644
> --- a/hw/virtio/virtio-iommu.c
> +++ b/hw/virtio/virtio-iommu.c
> @@ -235,6 +235,10 @@ static AddressSpace *virtio_iommu_find_add_as(PCIBus *bus, void *opaque,
>      return &sdev->as;
>  }
>  
> +static const PCIIOMMUOps virtio_iommu_ops = {
> +    .get_address_space = virtio_iommu_find_add_as,
> +};
> +
>  static int virtio_iommu_attach(VirtIOIOMMU *s,
>                                 struct virtio_iommu_req_attach *req)
>  {
> @@ -682,7 +686,7 @@ static void virtio_iommu_device_realize(DeviceState *dev, Error **errp)
>      s->as_by_busptr = g_hash_table_new_full(NULL, NULL, NULL, g_free);
>  
>      if (s->primary_bus) {
> -        pci_setup_iommu(s->primary_bus, virtio_iommu_find_add_as, s);
> +        pci_setup_iommu(s->primary_bus, &virtio_iommu_ops, s);
>      } else {
>          error_setg(errp, "VIRTIO-IOMMU is not attached to any PCI bus!");
>      }
> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
> index cfedf5a..ffe192d 100644
> --- a/include/hw/pci/pci.h
> +++ b/include/hw/pci/pci.h
> @@ -485,10 +485,14 @@ void pci_bus_get_w64_range(PCIBus *bus, Range *range);
>  
>  void pci_device_deassert_intx(PCIDevice *dev);
>  
> -typedef AddressSpace *(*PCIIOMMUFunc)(PCIBus *, void *, int);
> +typedef struct PCIIOMMUOps PCIIOMMUOps;
> +struct PCIIOMMUOps {
> +    AddressSpace * (*get_address_space)(PCIBus *bus,
> +                                void *opaque, int32_t devfn);
> +};
>  
>  AddressSpace *pci_device_iommu_address_space(PCIDevice *dev);
> -void pci_setup_iommu(PCIBus *bus, PCIIOMMUFunc fn, void *opaque);
> +void pci_setup_iommu(PCIBus *bus, const PCIIOMMUOps *iommu_ops, void *opaque);
>  
>  static inline void
>  pci_set_byte(uint8_t *config, uint8_t val)
> diff --git a/include/hw/pci/pci_bus.h b/include/hw/pci/pci_bus.h
> index 0714f57..c281057 100644
> --- a/include/hw/pci/pci_bus.h
> +++ b/include/hw/pci/pci_bus.h
> @@ -29,7 +29,7 @@ enum PCIBusFlags {
>  struct PCIBus {
>      BusState qbus;
>      enum PCIBusFlags flags;
> -    PCIIOMMUFunc iommu_fn;
> +    const PCIIOMMUOps *iommu_ops;
>      void *iommu_opaque;
>      uint8_t devfn_min;
>      uint32_t slot_reserved_mask;
> 
Thanks

Eric


^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v2 05/22] hw/pci: modify pci_setup_iommu() to set PCIIOMMUOps
@ 2020-03-30 11:02     ` Auger Eric
  0 siblings, 0 replies; 160+ messages in thread
From: Auger Eric @ 2020-03-30 11:02 UTC (permalink / raw)
  To: Liu Yi L, qemu-devel, alex.williamson, peterx
  Cc: jean-philippe, kevin.tian, Jacob Pan, Yi Sun, kvm, mst,
	jun.j.tian, yi.y.sun, pbonzini, hao.wu, david



On 3/30/20 6:24 AM, Liu Yi L wrote:
> This patch modifies pci_setup_iommu() to set PCIIOMMUOps
> instead of setting PCIIOMMUFunc. PCIIOMMUFunc is used to
> get an address space for a PCI device in vendor specific
> way. The PCIIOMMUOps still offers this functionality. But
> using PCIIOMMUOps leaves space to add more iommu related
> vendor specific operations.
> 
> Cc: Kevin Tian <kevin.tian@intel.com>
> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Eric Auger <eric.auger@redhat.com>
> Cc: Yi Sun <yi.y.sun@linux.intel.com>
> Cc: David Gibson <david@gibson.dropbear.id.au>
> Cc: Michael S. Tsirkin <mst@redhat.com>
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> Reviewed-by: Peter Xu <peterx@redhat.com>
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> ---
>  hw/alpha/typhoon.c       |  6 +++++-
>  hw/arm/smmu-common.c     |  6 +++++-
>  hw/hppa/dino.c           |  6 +++++-
>  hw/i386/amd_iommu.c      |  6 +++++-
>  hw/i386/intel_iommu.c    |  6 +++++-
>  hw/pci-host/designware.c |  6 +++++-
>  hw/pci-host/pnv_phb3.c   |  6 +++++-
>  hw/pci-host/pnv_phb4.c   |  6 +++++-
>  hw/pci-host/ppce500.c    |  6 +++++-
>  hw/pci-host/prep.c       |  6 +++++-
>  hw/pci-host/sabre.c      |  6 +++++-
>  hw/pci/pci.c             | 12 +++++++-----
>  hw/ppc/ppc440_pcix.c     |  6 +++++-
>  hw/ppc/spapr_pci.c       |  6 +++++-
>  hw/s390x/s390-pci-bus.c  |  8 ++++++--
>  hw/virtio/virtio-iommu.c |  6 +++++-
>  include/hw/pci/pci.h     |  8 ++++++--
>  include/hw/pci/pci_bus.h |  2 +-
>  18 files changed, 90 insertions(+), 24 deletions(-)
> 
> diff --git a/hw/alpha/typhoon.c b/hw/alpha/typhoon.c
> index 1795e2f..f271de1 100644
> --- a/hw/alpha/typhoon.c
> +++ b/hw/alpha/typhoon.c
> @@ -740,6 +740,10 @@ static AddressSpace *typhoon_pci_dma_iommu(PCIBus *bus, void *opaque, int devfn)
>      return &s->pchip.iommu_as;
>  }
>  
> +static const PCIIOMMUOps typhoon_iommu_ops = {
> +    .get_address_space = typhoon_pci_dma_iommu,
> +};
> +
>  static void typhoon_set_irq(void *opaque, int irq, int level)
>  {
>      TyphoonState *s = opaque;
> @@ -897,7 +901,7 @@ PCIBus *typhoon_init(MemoryRegion *ram, ISABus **isa_bus, qemu_irq *p_rtc_irq,
>                               "iommu-typhoon", UINT64_MAX);
>      address_space_init(&s->pchip.iommu_as, MEMORY_REGION(&s->pchip.iommu),
>                         "pchip0-pci");
> -    pci_setup_iommu(b, typhoon_pci_dma_iommu, s);
> +    pci_setup_iommu(b, &typhoon_iommu_ops, s);
>  
>      /* Pchip0 PCI special/interrupt acknowledge, 0x801.F800.0000, 64MB.  */
>      memory_region_init_io(&s->pchip.reg_iack, OBJECT(s), &alpha_pci_iack_ops,
> diff --git a/hw/arm/smmu-common.c b/hw/arm/smmu-common.c
> index e13a5f4..447146e 100644
> --- a/hw/arm/smmu-common.c
> +++ b/hw/arm/smmu-common.c
> @@ -343,6 +343,10 @@ static AddressSpace *smmu_find_add_as(PCIBus *bus, void *opaque, int devfn)
>      return &sdev->as;
>  }
>  
> +static const PCIIOMMUOps smmu_ops = {
> +    .get_address_space = smmu_find_add_as,
> +};
> +
>  IOMMUMemoryRegion *smmu_iommu_mr(SMMUState *s, uint32_t sid)
>  {
>      uint8_t bus_n, devfn;
> @@ -437,7 +441,7 @@ static void smmu_base_realize(DeviceState *dev, Error **errp)
>      s->smmu_pcibus_by_busptr = g_hash_table_new(NULL, NULL);
>  
>      if (s->primary_bus) {
> -        pci_setup_iommu(s->primary_bus, smmu_find_add_as, s);
> +        pci_setup_iommu(s->primary_bus, &smmu_ops, s);
>      } else {
>          error_setg(errp, "SMMU is not attached to any PCI bus!");
>      }
> diff --git a/hw/hppa/dino.c b/hw/hppa/dino.c
> index 2b1b38c..3da4f84 100644
> --- a/hw/hppa/dino.c
> +++ b/hw/hppa/dino.c
> @@ -459,6 +459,10 @@ static AddressSpace *dino_pcihost_set_iommu(PCIBus *bus, void *opaque,
>      return &s->bm_as;
>  }
>  
> +static const PCIIOMMUOps dino_iommu_ops = {
> +    .get_address_space = dino_pcihost_set_iommu,
> +};
> +
>  /*
>   * Dino interrupts are connected as shown on Page 78, Table 23
>   * (Little-endian bit numbers)
> @@ -580,7 +584,7 @@ PCIBus *dino_init(MemoryRegion *addr_space,
>      memory_region_add_subregion(&s->bm, 0xfff00000,
>                                  &s->bm_cpu_alias);
>      address_space_init(&s->bm_as, &s->bm, "pci-bm");
> -    pci_setup_iommu(b, dino_pcihost_set_iommu, s);
> +    pci_setup_iommu(b, &dino_iommu_ops, s);
>  
>      *p_rtc_irq = qemu_allocate_irq(dino_set_timer_irq, s, 0);
>      *p_ser_irq = qemu_allocate_irq(dino_set_serial_irq, s, 0);
> diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
> index b1175e5..5fec30e 100644
> --- a/hw/i386/amd_iommu.c
> +++ b/hw/i386/amd_iommu.c
> @@ -1451,6 +1451,10 @@ static AddressSpace *amdvi_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
>      return &iommu_as[devfn]->as;
>  }
>  
> +static const PCIIOMMUOps amdvi_iommu_ops = {
> +    .get_address_space = amdvi_host_dma_iommu,
> +};
> +
>  static const MemoryRegionOps mmio_mem_ops = {
>      .read = amdvi_mmio_read,
>      .write = amdvi_mmio_write,
> @@ -1577,7 +1581,7 @@ static void amdvi_realize(DeviceState *dev, Error **errp)
>  
>      sysbus_init_mmio(SYS_BUS_DEVICE(s), &s->mmio);
>      sysbus_mmio_map(SYS_BUS_DEVICE(s), 0, AMDVI_BASE_ADDR);
> -    pci_setup_iommu(bus, amdvi_host_dma_iommu, s);
> +    pci_setup_iommu(bus, &amdvi_iommu_ops, s);
>      s->devid = object_property_get_int(OBJECT(&s->pci), "addr", errp);
>      msi_init(&s->pci.dev, 0, 1, true, false, errp);
>      amdvi_init(s);
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index df7ad25..4b22910 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -3729,6 +3729,10 @@ static AddressSpace *vtd_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
>      return &vtd_as->as;
>  }
>  
> +static PCIIOMMUOps vtd_iommu_ops = {
static const
> +    .get_address_space = vtd_host_dma_iommu,
> +};
> +
>  static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
>  {
>      X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(s);
> @@ -3840,7 +3844,7 @@ static void vtd_realize(DeviceState *dev, Error **errp)
>                                                g_free, g_free);
>      vtd_init(s);
>      sysbus_mmio_map(SYS_BUS_DEVICE(s), 0, Q35_HOST_BRIDGE_IOMMU_ADDR);
> -    pci_setup_iommu(bus, vtd_host_dma_iommu, dev);
> +    pci_setup_iommu(bus, &vtd_iommu_ops, dev);
>      /* Pseudo address space under root PCI bus. */
>      x86ms->ioapic_as = vtd_host_dma_iommu(bus, s, Q35_PSEUDO_DEVFN_IOAPIC);
>      qemu_add_machine_init_done_notifier(&vtd_machine_done_notify);
> diff --git a/hw/pci-host/designware.c b/hw/pci-host/designware.c
> index dd24551..4c6338a 100644
> --- a/hw/pci-host/designware.c
> +++ b/hw/pci-host/designware.c
> @@ -645,6 +645,10 @@ static AddressSpace *designware_pcie_host_set_iommu(PCIBus *bus, void *opaque,
>      return &s->pci.address_space;
>  }
>  
> +static const PCIIOMMUOps designware_iommu_ops = {
> +    .get_address_space = designware_pcie_host_set_iommu,
> +};
> +
>  static void designware_pcie_host_realize(DeviceState *dev, Error **errp)
>  {
>      PCIHostState *pci = PCI_HOST_BRIDGE(dev);
> @@ -686,7 +690,7 @@ static void designware_pcie_host_realize(DeviceState *dev, Error **errp)
>      address_space_init(&s->pci.address_space,
>                         &s->pci.address_space_root,
>                         "pcie-bus-address-space");
> -    pci_setup_iommu(pci->bus, designware_pcie_host_set_iommu, s);
> +    pci_setup_iommu(pci->bus, &designware_iommu_ops, s);
>  
>      qdev_set_parent_bus(DEVICE(&s->root), BUS(pci->bus));
>      qdev_init_nofail(DEVICE(&s->root));
> diff --git a/hw/pci-host/pnv_phb3.c b/hw/pci-host/pnv_phb3.c
> index 74618fa..ecfe627 100644
> --- a/hw/pci-host/pnv_phb3.c
> +++ b/hw/pci-host/pnv_phb3.c
> @@ -961,6 +961,10 @@ static AddressSpace *pnv_phb3_dma_iommu(PCIBus *bus, void *opaque, int devfn)
>      return &ds->dma_as;
>  }
>  
> +static PCIIOMMUOps pnv_phb3_iommu_ops = {
static const
> +    .get_address_space = pnv_phb3_dma_iommu,
> +};
> +
>  static void pnv_phb3_instance_init(Object *obj)
>  {
>      PnvPHB3 *phb = PNV_PHB3(obj);
> @@ -1059,7 +1063,7 @@ static void pnv_phb3_realize(DeviceState *dev, Error **errp)
>                                       &phb->pci_mmio, &phb->pci_io,
>                                       0, 4, TYPE_PNV_PHB3_ROOT_BUS);
>  
> -    pci_setup_iommu(pci->bus, pnv_phb3_dma_iommu, phb);
> +    pci_setup_iommu(pci->bus, &pnv_phb3_iommu_ops, phb);
>  
>      /* Add a single Root port */
>      qdev_prop_set_uint8(DEVICE(&phb->root), "chassis", phb->chip_id);
> diff --git a/hw/pci-host/pnv_phb4.c b/hw/pci-host/pnv_phb4.c
> index 23cf093..04e95e3 100644
> --- a/hw/pci-host/pnv_phb4.c
> +++ b/hw/pci-host/pnv_phb4.c
> @@ -1148,6 +1148,10 @@ static AddressSpace *pnv_phb4_dma_iommu(PCIBus *bus, void *opaque, int devfn)
>      return &ds->dma_as;
>  }
>  
> +static PCIIOMMUOps pnv_phb4_iommu_ops = {
idem
> +    .get_address_space = pnv_phb4_dma_iommu,
> +};
> +
>  static void pnv_phb4_instance_init(Object *obj)
>  {
>      PnvPHB4 *phb = PNV_PHB4(obj);
> @@ -1205,7 +1209,7 @@ static void pnv_phb4_realize(DeviceState *dev, Error **errp)
>                                       pnv_phb4_set_irq, pnv_phb4_map_irq, phb,
>                                       &phb->pci_mmio, &phb->pci_io,
>                                       0, 4, TYPE_PNV_PHB4_ROOT_BUS);
> -    pci_setup_iommu(pci->bus, pnv_phb4_dma_iommu, phb);
> +    pci_setup_iommu(pci->bus, &pnv_phb4_iommu_ops, phb);
>  
>      /* Add a single Root port */
>      qdev_prop_set_uint8(DEVICE(&phb->root), "chassis", phb->chip_id);
> diff --git a/hw/pci-host/ppce500.c b/hw/pci-host/ppce500.c
> index d710727..5baf5db 100644
> --- a/hw/pci-host/ppce500.c
> +++ b/hw/pci-host/ppce500.c
> @@ -439,6 +439,10 @@ static AddressSpace *e500_pcihost_set_iommu(PCIBus *bus, void *opaque,
>      return &s->bm_as;
>  }
>  
> +static const PCIIOMMUOps ppce500_iommu_ops = {
> +    .get_address_space = e500_pcihost_set_iommu,
> +};
> +
>  static void e500_pcihost_realize(DeviceState *dev, Error **errp)
>  {
>      SysBusDevice *sbd = SYS_BUS_DEVICE(dev);
> @@ -473,7 +477,7 @@ static void e500_pcihost_realize(DeviceState *dev, Error **errp)
>      memory_region_init(&s->bm, OBJECT(s), "bm-e500", UINT64_MAX);
>      memory_region_add_subregion(&s->bm, 0x0, &s->busmem);
>      address_space_init(&s->bm_as, &s->bm, "pci-bm");
> -    pci_setup_iommu(b, e500_pcihost_set_iommu, s);
> +    pci_setup_iommu(b, &ppce500_iommu_ops, s);
>  
>      pci_create_simple(b, 0, "e500-host-bridge");
>  
> diff --git a/hw/pci-host/prep.c b/hw/pci-host/prep.c
> index 1a02e9a..7c57311 100644
> --- a/hw/pci-host/prep.c
> +++ b/hw/pci-host/prep.c
> @@ -213,6 +213,10 @@ static AddressSpace *raven_pcihost_set_iommu(PCIBus *bus, void *opaque,
>      return &s->bm_as;
>  }
>  
> +static const PCIIOMMUOps raven_iommu_ops = {
> +    .get_address_space = raven_pcihost_set_iommu,
> +};
> +
>  static void raven_change_gpio(void *opaque, int n, int level)
>  {
>      PREPPCIState *s = opaque;
> @@ -303,7 +307,7 @@ static void raven_pcihost_initfn(Object *obj)
>      memory_region_add_subregion(&s->bm, 0         , &s->bm_pci_memory_alias);
>      memory_region_add_subregion(&s->bm, 0x80000000, &s->bm_ram_alias);
>      address_space_init(&s->bm_as, &s->bm, "raven-bm");
> -    pci_setup_iommu(&s->pci_bus, raven_pcihost_set_iommu, s);
> +    pci_setup_iommu(&s->pci_bus, &raven_iommu_ops, s);
>  
>      h->bus = &s->pci_bus;
>  
> diff --git a/hw/pci-host/sabre.c b/hw/pci-host/sabre.c
> index 2b8503b..251549b 100644
> --- a/hw/pci-host/sabre.c
> +++ b/hw/pci-host/sabre.c
> @@ -112,6 +112,10 @@ static AddressSpace *sabre_pci_dma_iommu(PCIBus *bus, void *opaque, int devfn)
>      return &is->iommu_as;
>  }
>  
> +static const PCIIOMMUOps sabre_iommu_ops = {
> +    .get_address_space = sabre_pci_dma_iommu,
> +};
> +
>  static void sabre_config_write(void *opaque, hwaddr addr,
>                                 uint64_t val, unsigned size)
>  {
> @@ -402,7 +406,7 @@ static void sabre_realize(DeviceState *dev, Error **errp)
>      /* IOMMU */
>      memory_region_add_subregion_overlap(&s->sabre_config, 0x200,
>                      sysbus_mmio_get_region(SYS_BUS_DEVICE(s->iommu), 0), 1);
> -    pci_setup_iommu(phb->bus, sabre_pci_dma_iommu, s->iommu);
> +    pci_setup_iommu(phb->bus, &sabre_iommu_ops, s->iommu);
>  
>      /* APB secondary busses */
>      pci_dev = pci_create_multifunction(phb->bus, PCI_DEVFN(1, 0), true,
> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> index e1ed667..aa9025c 100644
> --- a/hw/pci/pci.c
> +++ b/hw/pci/pci.c
> @@ -2644,7 +2644,7 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
>      PCIBus *iommu_bus = bus;
>      uint8_t devfn = dev->devfn;
>  
> -    while (iommu_bus && !iommu_bus->iommu_fn && iommu_bus->parent_dev) {
> +    while (iommu_bus && !iommu_bus->iommu_ops && iommu_bus->parent_dev) {
Depending on future usage, this is not strictly identical to the
original code. You exit the loop as soon as a iommu_bus->iommu_ops is
set whatever the presence of get_address_space().

>          PCIBus *parent_bus = pci_get_bus(iommu_bus->parent_dev);
>  
>          /*
> @@ -2683,15 +2683,17 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
>  
>          iommu_bus = parent_bus;
>      }
> -    if (iommu_bus && iommu_bus->iommu_fn) {
> -        return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, devfn);
> +    if (iommu_bus && iommu_bus->iommu_ops &&
> +                     iommu_bus->iommu_ops->get_address_space) {
> +        return iommu_bus->iommu_ops->get_address_space(bus,
> +                                 iommu_bus->iommu_opaque, devfn);
>      }
>      return &address_space_memory;
>  }
>  
> -void pci_setup_iommu(PCIBus *bus, PCIIOMMUFunc fn, void *opaque)
> +void pci_setup_iommu(PCIBus *bus, const PCIIOMMUOps *ops, void *opaque)
>  {
> -    bus->iommu_fn = fn;
> +    bus->iommu_ops = ops;
>      bus->iommu_opaque = opaque;
>  }
>  
> diff --git a/hw/ppc/ppc440_pcix.c b/hw/ppc/ppc440_pcix.c
> index 2ee2d4f..7b17ee5 100644
> --- a/hw/ppc/ppc440_pcix.c
> +++ b/hw/ppc/ppc440_pcix.c
> @@ -442,6 +442,10 @@ static AddressSpace *ppc440_pcix_set_iommu(PCIBus *b, void *opaque, int devfn)
>      return &s->bm_as;
>  }
>  
> +static const PCIIOMMUOps ppc440_iommu_ops = {
> +    .get_address_space = ppc440_pcix_set_iommu,
> +};
> +
>  /* The default pci_host_data_{read,write} functions in pci/pci_host.c
>   * deny access to registers without bit 31 set but our clients want
>   * this to work so we have to override these here */
> @@ -487,7 +491,7 @@ static void ppc440_pcix_realize(DeviceState *dev, Error **errp)
>      memory_region_init(&s->bm, OBJECT(s), "bm-ppc440-pcix", UINT64_MAX);
>      memory_region_add_subregion(&s->bm, 0x0, &s->busmem);
>      address_space_init(&s->bm_as, &s->bm, "pci-bm");
> -    pci_setup_iommu(h->bus, ppc440_pcix_set_iommu, s);
> +    pci_setup_iommu(h->bus, &ppc440_iommu_ops, s);
>  
>      memory_region_init(&s->container, OBJECT(s), "pci-container", PCI_ALL_SIZE);
>      memory_region_init_io(&h->conf_mem, OBJECT(s), &pci_host_conf_le_ops,
> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> index 709a527..729a1cb 100644
> --- a/hw/ppc/spapr_pci.c
> +++ b/hw/ppc/spapr_pci.c
> @@ -771,6 +771,10 @@ static AddressSpace *spapr_pci_dma_iommu(PCIBus *bus, void *opaque, int devfn)
>      return &phb->iommu_as;
>  }
>  
> +static const PCIIOMMUOps spapr_iommu_ops = {
> +    .get_address_space = spapr_pci_dma_iommu,
> +};
> +
>  static char *spapr_phb_vfio_get_loc_code(SpaprPhbState *sphb,  PCIDevice *pdev)
>  {
>      char *path = NULL, *buf = NULL, *host = NULL;
> @@ -1950,7 +1954,7 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>      memory_region_add_subregion(&sphb->iommu_root, SPAPR_PCI_MSI_WINDOW,
>                                  &sphb->msiwindow);
>  
> -    pci_setup_iommu(bus, spapr_pci_dma_iommu, sphb);
> +    pci_setup_iommu(bus, &spapr_iommu_ops, sphb);
>  
>      pci_bus_set_route_irq_fn(bus, spapr_route_intx_pin_to_irq);
>  
> diff --git a/hw/s390x/s390-pci-bus.c b/hw/s390x/s390-pci-bus.c
> index ed8be12..c1c3aa4 100644
> --- a/hw/s390x/s390-pci-bus.c
> +++ b/hw/s390x/s390-pci-bus.c
> @@ -635,6 +635,10 @@ static AddressSpace *s390_pci_dma_iommu(PCIBus *bus, void *opaque, int devfn)
>      return &iommu->as;
>  }
>  
> +static const PCIIOMMUOps s390_iommu_ops = {
> +    .get_address_space = s390_pci_dma_iommu,
> +};
> +
>  static uint8_t set_ind_atomic(uint64_t ind_loc, uint8_t to_be_set)
>  {
>      uint8_t ind_old, ind_new;
> @@ -748,7 +752,7 @@ static void s390_pcihost_realize(DeviceState *dev, Error **errp)
>      b = pci_register_root_bus(dev, NULL, s390_pci_set_irq, s390_pci_map_irq,
>                                NULL, get_system_memory(), get_system_io(), 0,
>                                64, TYPE_PCI_BUS);
> -    pci_setup_iommu(b, s390_pci_dma_iommu, s);
> +    pci_setup_iommu(b, &s390_iommu_ops, s);
>  
>      bus = BUS(b);
>      qbus_set_hotplug_handler(bus, OBJECT(dev), &local_err);
> @@ -919,7 +923,7 @@ static void s390_pcihost_plug(HotplugHandler *hotplug_dev, DeviceState *dev,
>  
>          pdev = PCI_DEVICE(dev);
>          pci_bridge_map_irq(pb, dev->id, s390_pci_map_irq);
> -        pci_setup_iommu(&pb->sec_bus, s390_pci_dma_iommu, s);
> +        pci_setup_iommu(&pb->sec_bus, &s390_iommu_ops, s);
>  
>          qbus_set_hotplug_handler(BUS(&pb->sec_bus), OBJECT(s), errp);
>  
> diff --git a/hw/virtio/virtio-iommu.c b/hw/virtio/virtio-iommu.c
> index 4cee808..fefc24e 100644
> --- a/hw/virtio/virtio-iommu.c
> +++ b/hw/virtio/virtio-iommu.c
> @@ -235,6 +235,10 @@ static AddressSpace *virtio_iommu_find_add_as(PCIBus *bus, void *opaque,
>      return &sdev->as;
>  }
>  
> +static const PCIIOMMUOps virtio_iommu_ops = {
> +    .get_address_space = virtio_iommu_find_add_as,
> +};
> +
>  static int virtio_iommu_attach(VirtIOIOMMU *s,
>                                 struct virtio_iommu_req_attach *req)
>  {
> @@ -682,7 +686,7 @@ static void virtio_iommu_device_realize(DeviceState *dev, Error **errp)
>      s->as_by_busptr = g_hash_table_new_full(NULL, NULL, NULL, g_free);
>  
>      if (s->primary_bus) {
> -        pci_setup_iommu(s->primary_bus, virtio_iommu_find_add_as, s);
> +        pci_setup_iommu(s->primary_bus, &virtio_iommu_ops, s);
>      } else {
>          error_setg(errp, "VIRTIO-IOMMU is not attached to any PCI bus!");
>      }
> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
> index cfedf5a..ffe192d 100644
> --- a/include/hw/pci/pci.h
> +++ b/include/hw/pci/pci.h
> @@ -485,10 +485,14 @@ void pci_bus_get_w64_range(PCIBus *bus, Range *range);
>  
>  void pci_device_deassert_intx(PCIDevice *dev);
>  
> -typedef AddressSpace *(*PCIIOMMUFunc)(PCIBus *, void *, int);
> +typedef struct PCIIOMMUOps PCIIOMMUOps;
> +struct PCIIOMMUOps {
> +    AddressSpace * (*get_address_space)(PCIBus *bus,
> +                                void *opaque, int32_t devfn);
> +};
>  
>  AddressSpace *pci_device_iommu_address_space(PCIDevice *dev);
> -void pci_setup_iommu(PCIBus *bus, PCIIOMMUFunc fn, void *opaque);
> +void pci_setup_iommu(PCIBus *bus, const PCIIOMMUOps *iommu_ops, void *opaque);
>  
>  static inline void
>  pci_set_byte(uint8_t *config, uint8_t val)
> diff --git a/include/hw/pci/pci_bus.h b/include/hw/pci/pci_bus.h
> index 0714f57..c281057 100644
> --- a/include/hw/pci/pci_bus.h
> +++ b/include/hw/pci/pci_bus.h
> @@ -29,7 +29,7 @@ enum PCIBusFlags {
>  struct PCIBus {
>      BusState qbus;
>      enum PCIBusFlags flags;
> -    PCIIOMMUFunc iommu_fn;
> +    const PCIIOMMUOps *iommu_ops;
>      void *iommu_opaque;
>      uint8_t devfn_min;
>      uint32_t slot_reserved_mask;
> 
Thanks

Eric



^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v2 00/22] intel_iommu: expose Shared Virtual Addressing to VMs
  2020-03-30 10:36   ` Auger Eric
@ 2020-03-30 14:46     ` Peter Xu
  -1 siblings, 0 replies; 160+ messages in thread
From: Peter Xu @ 2020-03-30 14:46 UTC (permalink / raw)
  To: Auger Eric
  Cc: Liu Yi L, qemu-devel, alex.williamson, pbonzini, mst, david,
	kevin.tian, jun.j.tian, yi.y.sun, kvm, hao.wu, jean-philippe

On Mon, Mar 30, 2020 at 12:36:23PM +0200, Auger Eric wrote:
> I think in general, as long as the kernel dependencies are not resolved,
> the QEMU series is supposed to stay in RFC state.

Yeah I agree. I think the subject is not extremely important, but we
definitely should wait for the kernel part to be ready before merging
the series.

Side note: I offered quite a few r-bs for the series (and I still plan
to move on reading it this week since there's a new version, and try
to offer more r-bs when I still have some context in my brain-cache),
however they're mostly only for myself to avoid re-reading the whole
series again in the future especially because it's huge... :)

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v2 00/22] intel_iommu: expose Shared Virtual Addressing to VMs
@ 2020-03-30 14:46     ` Peter Xu
  0 siblings, 0 replies; 160+ messages in thread
From: Peter Xu @ 2020-03-30 14:46 UTC (permalink / raw)
  To: Auger Eric
  Cc: jean-philippe, kevin.tian, Liu Yi L, kvm, mst, jun.j.tian,
	qemu-devel, alex.williamson, pbonzini, hao.wu, yi.y.sun, david

On Mon, Mar 30, 2020 at 12:36:23PM +0200, Auger Eric wrote:
> I think in general, as long as the kernel dependencies are not resolved,
> the QEMU series is supposed to stay in RFC state.

Yeah I agree. I think the subject is not extremely important, but we
definitely should wait for the kernel part to be ready before merging
the series.

Side note: I offered quite a few r-bs for the series (and I still plan
to move on reading it this week since there's a new version, and try
to offer more r-bs when I still have some context in my brain-cache),
however they're mostly only for myself to avoid re-reading the whole
series again in the future especially because it's huge... :)

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v2 04/22] hw/iommu: introduce HostIOMMUContext
  2020-03-30  4:24   ` Liu Yi L
@ 2020-03-30 17:22     ` Auger Eric
  -1 siblings, 0 replies; 160+ messages in thread
From: Auger Eric @ 2020-03-30 17:22 UTC (permalink / raw)
  To: Liu Yi L, qemu-devel, alex.williamson, peterx
  Cc: pbonzini, mst, david, kevin.tian, jun.j.tian, yi.y.sun, kvm,
	hao.wu, jean-philippe, Jacob Pan, Yi Sun

Yi,

On 3/30/20 6:24 AM, Liu Yi L wrote:
> Currently, many platform vendors provide the capability of dual stage
> DMA address translation in hardware. For example, nested translation
> on Intel VT-d scalable mode, nested stage translation on ARM SMMUv3,
> and etc. In dual stage DMA address translation, there are two stages
> address translation, stage-1 (a.k.a first-level) and stage-2 (a.k.a
> second-level) translation structures. Stage-1 translation results are
> also subjected to stage-2 translation structures. Take vSVA (Virtual
> Shared Virtual Addressing) as an example, guest IOMMU driver owns
> stage-1 translation structures (covers GVA->GPA translation), and host
> IOMMU driver owns stage-2 translation structures (covers GPA->HPA
> translation). VMM is responsible to bind stage-1 translation structures
> to host, thus hardware could achieve GVA->GPA and then GPA->HPA
> translation. For more background on SVA, refer the below links.
>  - https://www.youtube.com/watch?v=Kq_nfGK5MwQ
>  - https://events19.lfasiallc.com/wp-content/uploads/2017/11/\
> Shared-Virtual-Memory-in-KVM_Yi-Liu.pdf
> 
> In QEMU, vIOMMU emulators expose IOMMUs to VM per their own spec (e.g.
> Intel VT-d spec). Devices are pass-through to guest via device pass-
> through components like VFIO. VFIO is a userspace driver framework
> which exposes host IOMMU programming capability to userspace in a
> secure manner. e.g. IOVA MAP/UNMAP requests. Thus the major connection
> between VFIO and vIOMMU are MAP/UNMAP. However, with the dual stage
> DMA translation support, there are more interactions between vIOMMU and
> VFIO as below:

I think it is key to justify at some point why the IOMMU MR notifiers
are not usable for that purpose. If I remember correctly this is due to
the fact MR notifiers are not active on x86 in that use xase, which is
not the case on ARM dual stage enablement.

maybe: "Information, different from map/unmap notifications need to be
passed from QEMU vIOMMU device to/from the host IOMMU driver through the
VFIO/IOMMU layer: ..."

>  1) PASID allocation (allow host to intercept in PASID allocation)
>  2) bind stage-1 translation structures to host
>  3) propagate stage-1 cache invalidation to host
>  4) DMA address translation fault (I/O page fault) servicing etc.

> 
> With the above new interactions in QEMU, it requires an abstract layer
> to facilitate the above operations and expose to vIOMMU emulators as an
> explicit way for vIOMMU emulators call into VFIO. This patch introduces
> HostIOMMUContext to stand for hardware IOMMU w/ dual stage DMA address
> translation capability. And introduces HostIOMMUContextClass to provide
> methods for vIOMMU emulators to propagate dual-stage translation related
> requests to host. As a beginning, PASID allocation/free are defined to
> propagate PASID allocation/free requests to host which is helpful for the
> vendors who manage PASID in system-wide. In future, there will be more
> operations like bind_stage1_pgtbl, flush_stage1_cache and etc.
> 
> Cc: Kevin Tian <kevin.tian@intel.com>
> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Eric Auger <eric.auger@redhat.com>
> Cc: Yi Sun <yi.y.sun@linux.intel.com>
> Cc: David Gibson <david@gibson.dropbear.id.au>
> Cc: Michael S. Tsirkin <mst@redhat.com>
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> ---
>  hw/Makefile.objs                      |  1 +
>  hw/iommu/Makefile.objs                |  1 +
>  hw/iommu/host_iommu_context.c         | 97 +++++++++++++++++++++++++++++++++++
>  include/hw/iommu/host_iommu_context.h | 75 +++++++++++++++++++++++++++
>  4 files changed, 174 insertions(+)
>  create mode 100644 hw/iommu/Makefile.objs
>  create mode 100644 hw/iommu/host_iommu_context.c
>  create mode 100644 include/hw/iommu/host_iommu_context.h
> 
> diff --git a/hw/Makefile.objs b/hw/Makefile.objs
> index 660e2b4..cab83fe 100644
> --- a/hw/Makefile.objs
> +++ b/hw/Makefile.objs
> @@ -40,6 +40,7 @@ devices-dirs-$(CONFIG_MEM_DEVICE) += mem/
>  devices-dirs-$(CONFIG_NUBUS) += nubus/
>  devices-dirs-y += semihosting/
>  devices-dirs-y += smbios/
> +devices-dirs-y += iommu/
>  endif
>  
>  common-obj-y += $(devices-dirs-y)
> diff --git a/hw/iommu/Makefile.objs b/hw/iommu/Makefile.objs
> new file mode 100644
> index 0000000..e6eed4e
> --- /dev/null
> +++ b/hw/iommu/Makefile.objs
> @@ -0,0 +1 @@
> +obj-y += host_iommu_context.o
> diff --git a/hw/iommu/host_iommu_context.c b/hw/iommu/host_iommu_context.c
> new file mode 100644
> index 0000000..5fb2223
> --- /dev/null
> +++ b/hw/iommu/host_iommu_context.c
> @@ -0,0 +1,97 @@
> +/*
> + * QEMU abstract of Host IOMMU
> + *
> + * Copyright (C) 2020 Intel Corporation.
> + *
> + * Authors: Liu Yi L <yi.l.liu@intel.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> +
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> +
> + * You should have received a copy of the GNU General Public License along
> + * with this program; if not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qapi/error.h"
> +#include "qom/object.h"
> +#include "qapi/visitor.h"
> +#include "hw/iommu/host_iommu_context.h"
> +
> +int host_iommu_ctx_pasid_alloc(HostIOMMUContext *iommu_ctx, uint32_t min,
> +                               uint32_t max, uint32_t *pasid)
> +{
> +    HostIOMMUContextClass *hicxc;
> +
> +    if (!iommu_ctx) {
> +        return -EINVAL;
> +    }
> +
> +    hicxc = HOST_IOMMU_CONTEXT_GET_CLASS(iommu_ctx);
> +
> +    if (!hicxc) {
> +        return -EINVAL;
> +    }
> +
> +    if (!(iommu_ctx->flags & HOST_IOMMU_PASID_REQUEST) ||
> +        !hicxc->pasid_alloc) {
At this point of the reading, I fail to understand why we need the flag.
Why isn't it sufficient to test whether the ops is set?
> +        return -EINVAL;
> +    }
> +
> +    return hicxc->pasid_alloc(iommu_ctx, min, max, pasid);
> +}
> +
> +int host_iommu_ctx_pasid_free(HostIOMMUContext *iommu_ctx, uint32_t pasid)
> +{
> +    HostIOMMUContextClass *hicxc;
> +
> +    if (!iommu_ctx) {
> +        return -EINVAL;
> +    }
> +
> +    hicxc = HOST_IOMMU_CONTEXT_GET_CLASS(iommu_ctx);
> +    if (!hicxc) {
> +        return -EINVAL;
> +    }
> +
> +    if (!(iommu_ctx->flags & HOST_IOMMU_PASID_REQUEST) ||
> +        !hicxc->pasid_free) {
> +        return -EINVAL;
> +    }
> +
> +    return hicxc->pasid_free(iommu_ctx, pasid);
> +}
> +
> +void host_iommu_ctx_init(void *_iommu_ctx, size_t instance_size,
> +                         const char *mrtypename,
> +                         uint64_t flags)
> +{
> +    HostIOMMUContext *iommu_ctx;
> +
> +    object_initialize(_iommu_ctx, instance_size, mrtypename);
> +    iommu_ctx = HOST_IOMMU_CONTEXT(_iommu_ctx);
> +    iommu_ctx->flags = flags;
> +    iommu_ctx->initialized = true;
> +}
> +
> +static const TypeInfo host_iommu_context_info = {
> +    .parent             = TYPE_OBJECT,
> +    .name               = TYPE_HOST_IOMMU_CONTEXT,
> +    .class_size         = sizeof(HostIOMMUContextClass),
> +    .instance_size      = sizeof(HostIOMMUContext),
> +    .abstract           = true,
Can't we use the usual .instance_init and .instance_finalize?
> +};
> +
> +static void host_iommu_ctx_register_types(void)
> +{
> +    type_register_static(&host_iommu_context_info);
> +}
> +
> +type_init(host_iommu_ctx_register_types)
> diff --git a/include/hw/iommu/host_iommu_context.h b/include/hw/iommu/host_iommu_context.h
> new file mode 100644
> index 0000000..35c4861
> --- /dev/null
> +++ b/include/hw/iommu/host_iommu_context.h
> @@ -0,0 +1,75 @@
> +/*
> + * QEMU abstraction of Host IOMMU
> + *
> + * Copyright (C) 2020 Intel Corporation.
> + *
> + * Authors: Liu Yi L <yi.l.liu@intel.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> +
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> +
> + * You should have received a copy of the GNU General Public License along
> + * with this program; if not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#ifndef HW_IOMMU_CONTEXT_H
> +#define HW_IOMMU_CONTEXT_H
> +
> +#include "qemu/queue.h"
> +#include "qemu/thread.h"
> +#include "qom/object.h"
> +#include <linux/iommu.h>
> +#ifndef CONFIG_USER_ONLY
> +#include "exec/hwaddr.h"
> +#endif
> +
> +#define TYPE_HOST_IOMMU_CONTEXT "qemu:host-iommu-context"
> +#define HOST_IOMMU_CONTEXT(obj) \
> +        OBJECT_CHECK(HostIOMMUContext, (obj), TYPE_HOST_IOMMU_CONTEXT)
> +#define HOST_IOMMU_CONTEXT_GET_CLASS(obj) \
> +        OBJECT_GET_CLASS(HostIOMMUContextClass, (obj), \
> +                         TYPE_HOST_IOMMU_CONTEXT)
> +
> +typedef struct HostIOMMUContext HostIOMMUContext;
> +
> +typedef struct HostIOMMUContextClass {
> +    /* private */
> +    ObjectClass parent_class;
> +
> +    /* Allocate pasid from HostIOMMUContext (a.k.a. host software) */
Request the host to allocate a PASID?
"from HostIOMMUContext (a.k.a. host software)" is a bit cryptic to me.

Actually at this stage I do not understand what this HostIOMMUContext
abstracts. Is it an object associated to one guest FL context entry
(attached to one PASID). Meaning for just vIOMMU/VFIO using nested
paging (single PASID) I would use a single of such context per IOMMU MR?

I think David also felt difficult to understand the abstraction behind
this object.

> +    int (*pasid_alloc)(HostIOMMUContext *iommu_ctx,
> +                       uint32_t min,
> +                       uint32_t max,
> +                       uint32_t *pasid);
> +    /* Reclaim pasid from HostIOMMUContext (a.k.a. host software) */
> +    int (*pasid_free)(HostIOMMUContext *iommu_ctx,
> +                      uint32_t pasid);
> +} HostIOMMUContextClass;
> +
> +/*
> + * This is an abstraction of host IOMMU with dual-stage capability
> + */
> +struct HostIOMMUContext {
> +    Object parent_obj;
> +#define HOST_IOMMU_PASID_REQUEST (1ULL << 0)
> +    uint64_t flags;
> +    bool initialized;
what's the purpose of the initialized flag?
> +};
> +
> +int host_iommu_ctx_pasid_alloc(HostIOMMUContext *iommu_ctx, uint32_t min,
> +                               uint32_t max, uint32_t *pasid);
> +int host_iommu_ctx_pasid_free(HostIOMMUContext *iommu_ctx, uint32_t pasid);
> +
> +void host_iommu_ctx_init(void *_iommu_ctx, size_t instance_size,
> +                         const char *mrtypename,
> +                         uint64_t flags);
> +void host_iommu_ctx_destroy(HostIOMMUContext *iommu_ctx);
leftover from V1?
> +
> +#endif
> 
Thanks

Eric


^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v2 04/22] hw/iommu: introduce HostIOMMUContext
@ 2020-03-30 17:22     ` Auger Eric
  0 siblings, 0 replies; 160+ messages in thread
From: Auger Eric @ 2020-03-30 17:22 UTC (permalink / raw)
  To: Liu Yi L, qemu-devel, alex.williamson, peterx
  Cc: jean-philippe, kevin.tian, Jacob Pan, Yi Sun, kvm, mst,
	jun.j.tian, yi.y.sun, pbonzini, hao.wu, david

Yi,

On 3/30/20 6:24 AM, Liu Yi L wrote:
> Currently, many platform vendors provide the capability of dual stage
> DMA address translation in hardware. For example, nested translation
> on Intel VT-d scalable mode, nested stage translation on ARM SMMUv3,
> and etc. In dual stage DMA address translation, there are two stages
> address translation, stage-1 (a.k.a first-level) and stage-2 (a.k.a
> second-level) translation structures. Stage-1 translation results are
> also subjected to stage-2 translation structures. Take vSVA (Virtual
> Shared Virtual Addressing) as an example, guest IOMMU driver owns
> stage-1 translation structures (covers GVA->GPA translation), and host
> IOMMU driver owns stage-2 translation structures (covers GPA->HPA
> translation). VMM is responsible to bind stage-1 translation structures
> to host, thus hardware could achieve GVA->GPA and then GPA->HPA
> translation. For more background on SVA, refer the below links.
>  - https://www.youtube.com/watch?v=Kq_nfGK5MwQ
>  - https://events19.lfasiallc.com/wp-content/uploads/2017/11/\
> Shared-Virtual-Memory-in-KVM_Yi-Liu.pdf
> 
> In QEMU, vIOMMU emulators expose IOMMUs to VM per their own spec (e.g.
> Intel VT-d spec). Devices are pass-through to guest via device pass-
> through components like VFIO. VFIO is a userspace driver framework
> which exposes host IOMMU programming capability to userspace in a
> secure manner. e.g. IOVA MAP/UNMAP requests. Thus the major connection
> between VFIO and vIOMMU are MAP/UNMAP. However, with the dual stage
> DMA translation support, there are more interactions between vIOMMU and
> VFIO as below:

I think it is key to justify at some point why the IOMMU MR notifiers
are not usable for that purpose. If I remember correctly this is due to
the fact MR notifiers are not active on x86 in that use xase, which is
not the case on ARM dual stage enablement.

maybe: "Information, different from map/unmap notifications need to be
passed from QEMU vIOMMU device to/from the host IOMMU driver through the
VFIO/IOMMU layer: ..."

>  1) PASID allocation (allow host to intercept in PASID allocation)
>  2) bind stage-1 translation structures to host
>  3) propagate stage-1 cache invalidation to host
>  4) DMA address translation fault (I/O page fault) servicing etc.

> 
> With the above new interactions in QEMU, it requires an abstract layer
> to facilitate the above operations and expose to vIOMMU emulators as an
> explicit way for vIOMMU emulators call into VFIO. This patch introduces
> HostIOMMUContext to stand for hardware IOMMU w/ dual stage DMA address
> translation capability. And introduces HostIOMMUContextClass to provide
> methods for vIOMMU emulators to propagate dual-stage translation related
> requests to host. As a beginning, PASID allocation/free are defined to
> propagate PASID allocation/free requests to host which is helpful for the
> vendors who manage PASID in system-wide. In future, there will be more
> operations like bind_stage1_pgtbl, flush_stage1_cache and etc.
> 
> Cc: Kevin Tian <kevin.tian@intel.com>
> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Eric Auger <eric.auger@redhat.com>
> Cc: Yi Sun <yi.y.sun@linux.intel.com>
> Cc: David Gibson <david@gibson.dropbear.id.au>
> Cc: Michael S. Tsirkin <mst@redhat.com>
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> ---
>  hw/Makefile.objs                      |  1 +
>  hw/iommu/Makefile.objs                |  1 +
>  hw/iommu/host_iommu_context.c         | 97 +++++++++++++++++++++++++++++++++++
>  include/hw/iommu/host_iommu_context.h | 75 +++++++++++++++++++++++++++
>  4 files changed, 174 insertions(+)
>  create mode 100644 hw/iommu/Makefile.objs
>  create mode 100644 hw/iommu/host_iommu_context.c
>  create mode 100644 include/hw/iommu/host_iommu_context.h
> 
> diff --git a/hw/Makefile.objs b/hw/Makefile.objs
> index 660e2b4..cab83fe 100644
> --- a/hw/Makefile.objs
> +++ b/hw/Makefile.objs
> @@ -40,6 +40,7 @@ devices-dirs-$(CONFIG_MEM_DEVICE) += mem/
>  devices-dirs-$(CONFIG_NUBUS) += nubus/
>  devices-dirs-y += semihosting/
>  devices-dirs-y += smbios/
> +devices-dirs-y += iommu/
>  endif
>  
>  common-obj-y += $(devices-dirs-y)
> diff --git a/hw/iommu/Makefile.objs b/hw/iommu/Makefile.objs
> new file mode 100644
> index 0000000..e6eed4e
> --- /dev/null
> +++ b/hw/iommu/Makefile.objs
> @@ -0,0 +1 @@
> +obj-y += host_iommu_context.o
> diff --git a/hw/iommu/host_iommu_context.c b/hw/iommu/host_iommu_context.c
> new file mode 100644
> index 0000000..5fb2223
> --- /dev/null
> +++ b/hw/iommu/host_iommu_context.c
> @@ -0,0 +1,97 @@
> +/*
> + * QEMU abstract of Host IOMMU
> + *
> + * Copyright (C) 2020 Intel Corporation.
> + *
> + * Authors: Liu Yi L <yi.l.liu@intel.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> +
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> +
> + * You should have received a copy of the GNU General Public License along
> + * with this program; if not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qapi/error.h"
> +#include "qom/object.h"
> +#include "qapi/visitor.h"
> +#include "hw/iommu/host_iommu_context.h"
> +
> +int host_iommu_ctx_pasid_alloc(HostIOMMUContext *iommu_ctx, uint32_t min,
> +                               uint32_t max, uint32_t *pasid)
> +{
> +    HostIOMMUContextClass *hicxc;
> +
> +    if (!iommu_ctx) {
> +        return -EINVAL;
> +    }
> +
> +    hicxc = HOST_IOMMU_CONTEXT_GET_CLASS(iommu_ctx);
> +
> +    if (!hicxc) {
> +        return -EINVAL;
> +    }
> +
> +    if (!(iommu_ctx->flags & HOST_IOMMU_PASID_REQUEST) ||
> +        !hicxc->pasid_alloc) {
At this point of the reading, I fail to understand why we need the flag.
Why isn't it sufficient to test whether the ops is set?
> +        return -EINVAL;
> +    }
> +
> +    return hicxc->pasid_alloc(iommu_ctx, min, max, pasid);
> +}
> +
> +int host_iommu_ctx_pasid_free(HostIOMMUContext *iommu_ctx, uint32_t pasid)
> +{
> +    HostIOMMUContextClass *hicxc;
> +
> +    if (!iommu_ctx) {
> +        return -EINVAL;
> +    }
> +
> +    hicxc = HOST_IOMMU_CONTEXT_GET_CLASS(iommu_ctx);
> +    if (!hicxc) {
> +        return -EINVAL;
> +    }
> +
> +    if (!(iommu_ctx->flags & HOST_IOMMU_PASID_REQUEST) ||
> +        !hicxc->pasid_free) {
> +        return -EINVAL;
> +    }
> +
> +    return hicxc->pasid_free(iommu_ctx, pasid);
> +}
> +
> +void host_iommu_ctx_init(void *_iommu_ctx, size_t instance_size,
> +                         const char *mrtypename,
> +                         uint64_t flags)
> +{
> +    HostIOMMUContext *iommu_ctx;
> +
> +    object_initialize(_iommu_ctx, instance_size, mrtypename);
> +    iommu_ctx = HOST_IOMMU_CONTEXT(_iommu_ctx);
> +    iommu_ctx->flags = flags;
> +    iommu_ctx->initialized = true;
> +}
> +
> +static const TypeInfo host_iommu_context_info = {
> +    .parent             = TYPE_OBJECT,
> +    .name               = TYPE_HOST_IOMMU_CONTEXT,
> +    .class_size         = sizeof(HostIOMMUContextClass),
> +    .instance_size      = sizeof(HostIOMMUContext),
> +    .abstract           = true,
Can't we use the usual .instance_init and .instance_finalize?
> +};
> +
> +static void host_iommu_ctx_register_types(void)
> +{
> +    type_register_static(&host_iommu_context_info);
> +}
> +
> +type_init(host_iommu_ctx_register_types)
> diff --git a/include/hw/iommu/host_iommu_context.h b/include/hw/iommu/host_iommu_context.h
> new file mode 100644
> index 0000000..35c4861
> --- /dev/null
> +++ b/include/hw/iommu/host_iommu_context.h
> @@ -0,0 +1,75 @@
> +/*
> + * QEMU abstraction of Host IOMMU
> + *
> + * Copyright (C) 2020 Intel Corporation.
> + *
> + * Authors: Liu Yi L <yi.l.liu@intel.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> +
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> +
> + * You should have received a copy of the GNU General Public License along
> + * with this program; if not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#ifndef HW_IOMMU_CONTEXT_H
> +#define HW_IOMMU_CONTEXT_H
> +
> +#include "qemu/queue.h"
> +#include "qemu/thread.h"
> +#include "qom/object.h"
> +#include <linux/iommu.h>
> +#ifndef CONFIG_USER_ONLY
> +#include "exec/hwaddr.h"
> +#endif
> +
> +#define TYPE_HOST_IOMMU_CONTEXT "qemu:host-iommu-context"
> +#define HOST_IOMMU_CONTEXT(obj) \
> +        OBJECT_CHECK(HostIOMMUContext, (obj), TYPE_HOST_IOMMU_CONTEXT)
> +#define HOST_IOMMU_CONTEXT_GET_CLASS(obj) \
> +        OBJECT_GET_CLASS(HostIOMMUContextClass, (obj), \
> +                         TYPE_HOST_IOMMU_CONTEXT)
> +
> +typedef struct HostIOMMUContext HostIOMMUContext;
> +
> +typedef struct HostIOMMUContextClass {
> +    /* private */
> +    ObjectClass parent_class;
> +
> +    /* Allocate pasid from HostIOMMUContext (a.k.a. host software) */
Request the host to allocate a PASID?
"from HostIOMMUContext (a.k.a. host software)" is a bit cryptic to me.

Actually at this stage I do not understand what this HostIOMMUContext
abstracts. Is it an object associated to one guest FL context entry
(attached to one PASID). Meaning for just vIOMMU/VFIO using nested
paging (single PASID) I would use a single of such context per IOMMU MR?

I think David also felt difficult to understand the abstraction behind
this object.

> +    int (*pasid_alloc)(HostIOMMUContext *iommu_ctx,
> +                       uint32_t min,
> +                       uint32_t max,
> +                       uint32_t *pasid);
> +    /* Reclaim pasid from HostIOMMUContext (a.k.a. host software) */
> +    int (*pasid_free)(HostIOMMUContext *iommu_ctx,
> +                      uint32_t pasid);
> +} HostIOMMUContextClass;
> +
> +/*
> + * This is an abstraction of host IOMMU with dual-stage capability
> + */
> +struct HostIOMMUContext {
> +    Object parent_obj;
> +#define HOST_IOMMU_PASID_REQUEST (1ULL << 0)
> +    uint64_t flags;
> +    bool initialized;
what's the purpose of the initialized flag?
> +};
> +
> +int host_iommu_ctx_pasid_alloc(HostIOMMUContext *iommu_ctx, uint32_t min,
> +                               uint32_t max, uint32_t *pasid);
> +int host_iommu_ctx_pasid_free(HostIOMMUContext *iommu_ctx, uint32_t pasid);
> +
> +void host_iommu_ctx_init(void *_iommu_ctx, size_t instance_size,
> +                         const char *mrtypename,
> +                         uint64_t flags);
> +void host_iommu_ctx_destroy(HostIOMMUContext *iommu_ctx);
leftover from V1?
> +
> +#endif
> 
Thanks

Eric



^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v2 06/22] hw/pci: introduce pci_device_set/unset_iommu_context()
  2020-03-30  4:24   ` Liu Yi L
@ 2020-03-30 17:30     ` Auger Eric
  -1 siblings, 0 replies; 160+ messages in thread
From: Auger Eric @ 2020-03-30 17:30 UTC (permalink / raw)
  To: Liu Yi L, qemu-devel, alex.williamson, peterx
  Cc: pbonzini, mst, david, kevin.tian, jun.j.tian, yi.y.sun, kvm,
	hao.wu, jean-philippe, Jacob Pan, Yi Sun

Yi,
On 3/30/20 6:24 AM, Liu Yi L wrote:
> This patch adds pci_device_set/unset_iommu_context() to set/unset
> host_iommu_context for a given device. New callback is added in
> PCIIOMMUOps. As such, vIOMMU could make use of host IOMMU capability.
> e.g setup nested translation.

I think you need to explain what this practically is supposed to do.
such as: by attaching such context to a PCI device (for example VFIO
assigned?), you tell the host that this PCIe device is protected by a FL
stage controlled by the guest or something like that - if this is
correct understanding (?) -
> 
> Cc: Kevin Tian <kevin.tian@intel.com>
> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Eric Auger <eric.auger@redhat.com>
> Cc: Yi Sun <yi.y.sun@linux.intel.com>
> Cc: David Gibson <david@gibson.dropbear.id.au>
> Cc: Michael S. Tsirkin <mst@redhat.com>
> Reviewed-by: Peter Xu <peterx@redhat.com>
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> ---
>  hw/pci/pci.c         | 49 ++++++++++++++++++++++++++++++++++++++++++++-----
>  include/hw/pci/pci.h | 10 ++++++++++
>  2 files changed, 54 insertions(+), 5 deletions(-)
> 
> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> index aa9025c..af3c1a1 100644
> --- a/hw/pci/pci.c
> +++ b/hw/pci/pci.c
> @@ -2638,7 +2638,8 @@ static void pci_device_class_base_init(ObjectClass *klass, void *data)
>      }
>  }
>  
> -AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
> +static void pci_device_get_iommu_bus_devfn(PCIDevice *dev,
> +                              PCIBus **pbus, uint8_t *pdevfn)
>  {
>      PCIBus *bus = pci_get_bus(dev);
>      PCIBus *iommu_bus = bus;
> @@ -2683,14 +2684,52 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
>  
>          iommu_bus = parent_bus;
>      }
> -    if (iommu_bus && iommu_bus->iommu_ops &&
> -                     iommu_bus->iommu_ops->get_address_space) {
> -        return iommu_bus->iommu_ops->get_address_space(bus,
> -                                 iommu_bus->iommu_opaque, devfn);
> +    *pbus = iommu_bus;
> +    *pdevfn = devfn;
> +}
> +
> +AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
> +{
> +    PCIBus *bus;
> +    uint8_t devfn;
> +
> +    pci_device_get_iommu_bus_devfn(dev, &bus, &devfn);
> +    if (bus && bus->iommu_ops &&
> +        bus->iommu_ops->get_address_space) {
> +        return bus->iommu_ops->get_address_space(bus,
> +                                bus->iommu_opaque, devfn);
>      }
>      return &address_space_memory;
>  }
>  
> +int pci_device_set_iommu_context(PCIDevice *dev,
> +                                 HostIOMMUContext *iommu_ctx)
> +{
> +    PCIBus *bus;
> +    uint8_t devfn;
> +
> +    pci_device_get_iommu_bus_devfn(dev, &bus, &devfn);
> +    if (bus && bus->iommu_ops &&
> +        bus->iommu_ops->set_iommu_context) {
> +        return bus->iommu_ops->set_iommu_context(bus,
> +                              bus->iommu_opaque, devfn, iommu_ctx);
> +    }
> +    return -ENOENT;
> +}
> +
> +void pci_device_unset_iommu_context(PCIDevice *dev)
> +{
> +    PCIBus *bus;
> +    uint8_t devfn;
> +
> +    pci_device_get_iommu_bus_devfn(dev, &bus, &devfn);
> +    if (bus && bus->iommu_ops &&
> +        bus->iommu_ops->unset_iommu_context) {
> +        bus->iommu_ops->unset_iommu_context(bus,
> +                                 bus->iommu_opaque, devfn);
> +    }
> +}
> +
>  void pci_setup_iommu(PCIBus *bus, const PCIIOMMUOps *ops, void *opaque)
>  {
>      bus->iommu_ops = ops;
> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
> index ffe192d..0ec5680 100644
> --- a/include/hw/pci/pci.h
> +++ b/include/hw/pci/pci.h
> @@ -9,6 +9,8 @@
>  
>  #include "hw/pci/pcie.h"
>  
> +#include "hw/iommu/host_iommu_context.h"
> +
>  extern bool pci_available;
>  
>  /* PCI bus */
> @@ -489,9 +491,17 @@ typedef struct PCIIOMMUOps PCIIOMMUOps;
>  struct PCIIOMMUOps {
>      AddressSpace * (*get_address_space)(PCIBus *bus,
>                                  void *opaque, int32_t devfn);
> +    int (*set_iommu_context)(PCIBus *bus, void *opaque,
> +                             int32_t devfn,
> +                             HostIOMMUContext *iommu_ctx);
> +    void (*unset_iommu_context)(PCIBus *bus, void *opaque,
> +                                int32_t devfn);
>  };
>  
>  AddressSpace *pci_device_iommu_address_space(PCIDevice *dev);
> +int pci_device_set_iommu_context(PCIDevice *dev,
> +                                 HostIOMMUContext *iommu_ctx);
> +void pci_device_unset_iommu_context(PCIDevice *dev);
>  void pci_setup_iommu(PCIBus *bus, const PCIIOMMUOps *iommu_ops, void *opaque);
>  
>  static inline void
> 
Thanks

Eric


^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v2 06/22] hw/pci: introduce pci_device_set/unset_iommu_context()
@ 2020-03-30 17:30     ` Auger Eric
  0 siblings, 0 replies; 160+ messages in thread
From: Auger Eric @ 2020-03-30 17:30 UTC (permalink / raw)
  To: Liu Yi L, qemu-devel, alex.williamson, peterx
  Cc: jean-philippe, kevin.tian, Jacob Pan, Yi Sun, kvm, mst,
	jun.j.tian, yi.y.sun, pbonzini, hao.wu, david

Yi,
On 3/30/20 6:24 AM, Liu Yi L wrote:
> This patch adds pci_device_set/unset_iommu_context() to set/unset
> host_iommu_context for a given device. New callback is added in
> PCIIOMMUOps. As such, vIOMMU could make use of host IOMMU capability.
> e.g setup nested translation.

I think you need to explain what this practically is supposed to do.
such as: by attaching such context to a PCI device (for example VFIO
assigned?), you tell the host that this PCIe device is protected by a FL
stage controlled by the guest or something like that - if this is
correct understanding (?) -
> 
> Cc: Kevin Tian <kevin.tian@intel.com>
> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Eric Auger <eric.auger@redhat.com>
> Cc: Yi Sun <yi.y.sun@linux.intel.com>
> Cc: David Gibson <david@gibson.dropbear.id.au>
> Cc: Michael S. Tsirkin <mst@redhat.com>
> Reviewed-by: Peter Xu <peterx@redhat.com>
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> ---
>  hw/pci/pci.c         | 49 ++++++++++++++++++++++++++++++++++++++++++++-----
>  include/hw/pci/pci.h | 10 ++++++++++
>  2 files changed, 54 insertions(+), 5 deletions(-)
> 
> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> index aa9025c..af3c1a1 100644
> --- a/hw/pci/pci.c
> +++ b/hw/pci/pci.c
> @@ -2638,7 +2638,8 @@ static void pci_device_class_base_init(ObjectClass *klass, void *data)
>      }
>  }
>  
> -AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
> +static void pci_device_get_iommu_bus_devfn(PCIDevice *dev,
> +                              PCIBus **pbus, uint8_t *pdevfn)
>  {
>      PCIBus *bus = pci_get_bus(dev);
>      PCIBus *iommu_bus = bus;
> @@ -2683,14 +2684,52 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
>  
>          iommu_bus = parent_bus;
>      }
> -    if (iommu_bus && iommu_bus->iommu_ops &&
> -                     iommu_bus->iommu_ops->get_address_space) {
> -        return iommu_bus->iommu_ops->get_address_space(bus,
> -                                 iommu_bus->iommu_opaque, devfn);
> +    *pbus = iommu_bus;
> +    *pdevfn = devfn;
> +}
> +
> +AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
> +{
> +    PCIBus *bus;
> +    uint8_t devfn;
> +
> +    pci_device_get_iommu_bus_devfn(dev, &bus, &devfn);
> +    if (bus && bus->iommu_ops &&
> +        bus->iommu_ops->get_address_space) {
> +        return bus->iommu_ops->get_address_space(bus,
> +                                bus->iommu_opaque, devfn);
>      }
>      return &address_space_memory;
>  }
>  
> +int pci_device_set_iommu_context(PCIDevice *dev,
> +                                 HostIOMMUContext *iommu_ctx)
> +{
> +    PCIBus *bus;
> +    uint8_t devfn;
> +
> +    pci_device_get_iommu_bus_devfn(dev, &bus, &devfn);
> +    if (bus && bus->iommu_ops &&
> +        bus->iommu_ops->set_iommu_context) {
> +        return bus->iommu_ops->set_iommu_context(bus,
> +                              bus->iommu_opaque, devfn, iommu_ctx);
> +    }
> +    return -ENOENT;
> +}
> +
> +void pci_device_unset_iommu_context(PCIDevice *dev)
> +{
> +    PCIBus *bus;
> +    uint8_t devfn;
> +
> +    pci_device_get_iommu_bus_devfn(dev, &bus, &devfn);
> +    if (bus && bus->iommu_ops &&
> +        bus->iommu_ops->unset_iommu_context) {
> +        bus->iommu_ops->unset_iommu_context(bus,
> +                                 bus->iommu_opaque, devfn);
> +    }
> +}
> +
>  void pci_setup_iommu(PCIBus *bus, const PCIIOMMUOps *ops, void *opaque)
>  {
>      bus->iommu_ops = ops;
> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
> index ffe192d..0ec5680 100644
> --- a/include/hw/pci/pci.h
> +++ b/include/hw/pci/pci.h
> @@ -9,6 +9,8 @@
>  
>  #include "hw/pci/pcie.h"
>  
> +#include "hw/iommu/host_iommu_context.h"
> +
>  extern bool pci_available;
>  
>  /* PCI bus */
> @@ -489,9 +491,17 @@ typedef struct PCIIOMMUOps PCIIOMMUOps;
>  struct PCIIOMMUOps {
>      AddressSpace * (*get_address_space)(PCIBus *bus,
>                                  void *opaque, int32_t devfn);
> +    int (*set_iommu_context)(PCIBus *bus, void *opaque,
> +                             int32_t devfn,
> +                             HostIOMMUContext *iommu_ctx);
> +    void (*unset_iommu_context)(PCIBus *bus, void *opaque,
> +                                int32_t devfn);
>  };
>  
>  AddressSpace *pci_device_iommu_address_space(PCIDevice *dev);
> +int pci_device_set_iommu_context(PCIDevice *dev,
> +                                 HostIOMMUContext *iommu_ctx);
> +void pci_device_unset_iommu_context(PCIDevice *dev);
>  void pci_setup_iommu(PCIBus *bus, const PCIIOMMUOps *iommu_ops, void *opaque);
>  
>  static inline void
> 
Thanks

Eric



^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v2 07/22] intel_iommu: add set/unset_iommu_context callback
  2020-03-30  4:24   ` Liu Yi L
@ 2020-03-30 20:23     ` Auger Eric
  -1 siblings, 0 replies; 160+ messages in thread
From: Auger Eric @ 2020-03-30 20:23 UTC (permalink / raw)
  To: Liu Yi L, qemu-devel, alex.williamson, peterx
  Cc: pbonzini, mst, david, kevin.tian, jun.j.tian, yi.y.sun, kvm,
	hao.wu, jean-philippe, Jacob Pan, Yi Sun, Richard Henderson,
	Eduardo Habkost

Yi,

On 3/30/20 6:24 AM, Liu Yi L wrote:
> This patch adds set/unset_iommu_context() impelementation in Intel
This patch implements the set/unset_iommu_context() ops for Intel vIOMMU.
> vIOMMU. For Intel platform, pass-through modules (e.g. VFIO) could
> set HostIOMMUContext to Intel vIOMMU emulator.
> 
> Cc: Kevin Tian <kevin.tian@intel.com>
> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Yi Sun <yi.y.sun@linux.intel.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Richard Henderson <rth@twiddle.net>
> Cc: Eduardo Habkost <ehabkost@redhat.com>
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> ---
>  hw/i386/intel_iommu.c         | 71 ++++++++++++++++++++++++++++++++++++++++---
>  include/hw/i386/intel_iommu.h | 21 ++++++++++---
>  2 files changed, 83 insertions(+), 9 deletions(-)
> 
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 4b22910..fd349c6 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -3354,23 +3354,33 @@ static const MemoryRegionOps vtd_mem_ir_ops = {
>      },
>  };
>  
> -VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
> +/**
> + * Fetch a VTDBus instance for given PCIBus. If no existing instance,
> + * allocate one.
> + */
> +static VTDBus *vtd_find_add_bus(IntelIOMMUState *s, PCIBus *bus)
>  {
>      uintptr_t key = (uintptr_t)bus;
>      VTDBus *vtd_bus = g_hash_table_lookup(s->vtd_as_by_busptr, &key);
> -    VTDAddressSpace *vtd_dev_as;
> -    char name[128];
>  
>      if (!vtd_bus) {
>          uintptr_t *new_key = g_malloc(sizeof(*new_key));
>          *new_key = (uintptr_t)bus;
>          /* No corresponding free() */
> -        vtd_bus = g_malloc0(sizeof(VTDBus) + sizeof(VTDAddressSpace *) * \
> -                            PCI_DEVFN_MAX);
> +        vtd_bus = g_malloc0(sizeof(VTDBus));
>          vtd_bus->bus = bus;
>          g_hash_table_insert(s->vtd_as_by_busptr, new_key, vtd_bus);
>      }
> +    return vtd_bus;
> +}
>  
> +VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
> +{
> +    VTDBus *vtd_bus;
> +    VTDAddressSpace *vtd_dev_as;
> +    char name[128];
> +
> +    vtd_bus = vtd_find_add_bus(s, bus);
>      vtd_dev_as = vtd_bus->dev_as[devfn];
>  
>      if (!vtd_dev_as) {
> @@ -3436,6 +3446,55 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
>      return vtd_dev_as;
>  }
>  
> +static int vtd_dev_set_iommu_context(PCIBus *bus, void *opaque,
> +                                     int devfn,
> +                                     HostIOMMUContext *iommu_ctx)
> +{
> +    IntelIOMMUState *s = opaque;
> +    VTDBus *vtd_bus;
> +    VTDHostIOMMUContext *vtd_dev_icx;
> +
> +    assert(0 <= devfn && devfn < PCI_DEVFN_MAX);
> +
> +    vtd_bus = vtd_find_add_bus(s, bus);
> +
> +    vtd_iommu_lock(s);
> +
> +    vtd_dev_icx = vtd_bus->dev_icx[devfn];
> +
> +    assert(!vtd_dev_icx);
> +
> +    vtd_bus->dev_icx[devfn] = vtd_dev_icx =
> +                    g_malloc0(sizeof(VTDHostIOMMUContext));
> +    vtd_dev_icx->vtd_bus = vtd_bus;
> +    vtd_dev_icx->devfn = (uint8_t)devfn;
> +    vtd_dev_icx->iommu_state = s;
> +    vtd_dev_icx->iommu_ctx = iommu_ctx;
> +
> +    vtd_iommu_unlock(s);
> +
> +    return 0;
> +}
> +
> +static void vtd_dev_unset_iommu_context(PCIBus *bus, void *opaque, int devfn)
> +{
> +    IntelIOMMUState *s = opaque;
> +    VTDBus *vtd_bus;
> +    VTDHostIOMMUContext *vtd_dev_icx;
> +
> +    assert(0 <= devfn && devfn < PCI_DEVFN_MAX);
> +
> +    vtd_bus = vtd_find_add_bus(s, bus);
> +
> +    vtd_iommu_lock(s);
> +
> +    vtd_dev_icx = vtd_bus->dev_icx[devfn];
> +    g_free(vtd_dev_icx);
> +    vtd_bus->dev_icx[devfn] = NULL;
> +
> +    vtd_iommu_unlock(s);
> +}
> +
>  static uint64_t get_naturally_aligned_size(uint64_t start,
>                                             uint64_t size, int gaw)
>  {
> @@ -3731,6 +3790,8 @@ static AddressSpace *vtd_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
>  
>  static PCIIOMMUOps vtd_iommu_ops = {
>      .get_address_space = vtd_host_dma_iommu,
> +    .set_iommu_context = vtd_dev_set_iommu_context,
> +    .unset_iommu_context = vtd_dev_unset_iommu_context,
>  };
>  
>  static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
> diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> index 3870052..b5fefb9 100644
> --- a/include/hw/i386/intel_iommu.h
> +++ b/include/hw/i386/intel_iommu.h
> @@ -64,6 +64,7 @@ typedef union VTD_IR_TableEntry VTD_IR_TableEntry;
>  typedef union VTD_IR_MSIAddress VTD_IR_MSIAddress;
>  typedef struct VTDPASIDDirEntry VTDPASIDDirEntry;
>  typedef struct VTDPASIDEntry VTDPASIDEntry;
> +typedef struct VTDHostIOMMUContext VTDHostIOMMUContext;
>  
>  /* Context-Entry */
>  struct VTDContextEntry {
> @@ -112,10 +113,20 @@ struct VTDAddressSpace {
>      IOVATree *iova_tree;          /* Traces mapped IOVA ranges */
>  };
>  
> +struct VTDHostIOMMUContext {


> +    VTDBus *vtd_bus;
> +    uint8_t devfn;
> +    HostIOMMUContext *iommu_ctx;
I don't get why we don't have standard QOM inheritance instead of this
handle?
VTDHostContext parent_obj;

like IOMMUMemoryRegion <- MemoryRegion <- Object
> +    IntelIOMMUState *iommu_state;
> +};
> +
>  struct VTDBus {
> -    PCIBus* bus;		/* A reference to the bus to provide translation for */
> +    /* A reference to the bus to provide translation for */
> +    PCIBus *bus;
>      /* A table of VTDAddressSpace objects indexed by devfn */
> -    VTDAddressSpace *dev_as[];
> +    VTDAddressSpace *dev_as[PCI_DEVFN_MAX];
> +    /* A table of VTDHostIOMMUContext objects indexed by devfn */
> +    VTDHostIOMMUContext *dev_icx[PCI_DEVFN_MAX];
At this point of the review, it is unclear to me why the context is
associated to a device. Up to now you have not explained it should. If
so why isn't it part of VTDAddressSpace?

Thanks

Eric
>  };
>  
>  struct VTDIOTLBEntry {
> @@ -269,8 +280,10 @@ struct IntelIOMMUState {
>      bool dma_drain;                 /* Whether DMA r/w draining enabled */
>  
>      /*
> -     * Protects IOMMU states in general.  Currently it protects the
> -     * per-IOMMU IOTLB cache, and context entry cache in VTDAddressSpace.
> +     * iommu_lock protects below:
> +     * - per-IOMMU IOTLB caches
> +     * - context entry cache in VTDAddressSpace
> +     * - HostIOMMUContext pointer cached in vIOMMU
>       */
>      QemuMutex iommu_lock;
>  };
> 


^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v2 07/22] intel_iommu: add set/unset_iommu_context callback
@ 2020-03-30 20:23     ` Auger Eric
  0 siblings, 0 replies; 160+ messages in thread
From: Auger Eric @ 2020-03-30 20:23 UTC (permalink / raw)
  To: Liu Yi L, qemu-devel, alex.williamson, peterx
  Cc: jean-philippe, kevin.tian, Jacob Pan, Yi Sun, Eduardo Habkost,
	kvm, mst, jun.j.tian, yi.y.sun, pbonzini, hao.wu,
	Richard Henderson, david

Yi,

On 3/30/20 6:24 AM, Liu Yi L wrote:
> This patch adds set/unset_iommu_context() impelementation in Intel
This patch implements the set/unset_iommu_context() ops for Intel vIOMMU.
> vIOMMU. For Intel platform, pass-through modules (e.g. VFIO) could
> set HostIOMMUContext to Intel vIOMMU emulator.
> 
> Cc: Kevin Tian <kevin.tian@intel.com>
> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Yi Sun <yi.y.sun@linux.intel.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Richard Henderson <rth@twiddle.net>
> Cc: Eduardo Habkost <ehabkost@redhat.com>
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> ---
>  hw/i386/intel_iommu.c         | 71 ++++++++++++++++++++++++++++++++++++++++---
>  include/hw/i386/intel_iommu.h | 21 ++++++++++---
>  2 files changed, 83 insertions(+), 9 deletions(-)
> 
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 4b22910..fd349c6 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -3354,23 +3354,33 @@ static const MemoryRegionOps vtd_mem_ir_ops = {
>      },
>  };
>  
> -VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
> +/**
> + * Fetch a VTDBus instance for given PCIBus. If no existing instance,
> + * allocate one.
> + */
> +static VTDBus *vtd_find_add_bus(IntelIOMMUState *s, PCIBus *bus)
>  {
>      uintptr_t key = (uintptr_t)bus;
>      VTDBus *vtd_bus = g_hash_table_lookup(s->vtd_as_by_busptr, &key);
> -    VTDAddressSpace *vtd_dev_as;
> -    char name[128];
>  
>      if (!vtd_bus) {
>          uintptr_t *new_key = g_malloc(sizeof(*new_key));
>          *new_key = (uintptr_t)bus;
>          /* No corresponding free() */
> -        vtd_bus = g_malloc0(sizeof(VTDBus) + sizeof(VTDAddressSpace *) * \
> -                            PCI_DEVFN_MAX);
> +        vtd_bus = g_malloc0(sizeof(VTDBus));
>          vtd_bus->bus = bus;
>          g_hash_table_insert(s->vtd_as_by_busptr, new_key, vtd_bus);
>      }
> +    return vtd_bus;
> +}
>  
> +VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
> +{
> +    VTDBus *vtd_bus;
> +    VTDAddressSpace *vtd_dev_as;
> +    char name[128];
> +
> +    vtd_bus = vtd_find_add_bus(s, bus);
>      vtd_dev_as = vtd_bus->dev_as[devfn];
>  
>      if (!vtd_dev_as) {
> @@ -3436,6 +3446,55 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
>      return vtd_dev_as;
>  }
>  
> +static int vtd_dev_set_iommu_context(PCIBus *bus, void *opaque,
> +                                     int devfn,
> +                                     HostIOMMUContext *iommu_ctx)
> +{
> +    IntelIOMMUState *s = opaque;
> +    VTDBus *vtd_bus;
> +    VTDHostIOMMUContext *vtd_dev_icx;
> +
> +    assert(0 <= devfn && devfn < PCI_DEVFN_MAX);
> +
> +    vtd_bus = vtd_find_add_bus(s, bus);
> +
> +    vtd_iommu_lock(s);
> +
> +    vtd_dev_icx = vtd_bus->dev_icx[devfn];
> +
> +    assert(!vtd_dev_icx);
> +
> +    vtd_bus->dev_icx[devfn] = vtd_dev_icx =
> +                    g_malloc0(sizeof(VTDHostIOMMUContext));
> +    vtd_dev_icx->vtd_bus = vtd_bus;
> +    vtd_dev_icx->devfn = (uint8_t)devfn;
> +    vtd_dev_icx->iommu_state = s;
> +    vtd_dev_icx->iommu_ctx = iommu_ctx;
> +
> +    vtd_iommu_unlock(s);
> +
> +    return 0;
> +}
> +
> +static void vtd_dev_unset_iommu_context(PCIBus *bus, void *opaque, int devfn)
> +{
> +    IntelIOMMUState *s = opaque;
> +    VTDBus *vtd_bus;
> +    VTDHostIOMMUContext *vtd_dev_icx;
> +
> +    assert(0 <= devfn && devfn < PCI_DEVFN_MAX);
> +
> +    vtd_bus = vtd_find_add_bus(s, bus);
> +
> +    vtd_iommu_lock(s);
> +
> +    vtd_dev_icx = vtd_bus->dev_icx[devfn];
> +    g_free(vtd_dev_icx);
> +    vtd_bus->dev_icx[devfn] = NULL;
> +
> +    vtd_iommu_unlock(s);
> +}
> +
>  static uint64_t get_naturally_aligned_size(uint64_t start,
>                                             uint64_t size, int gaw)
>  {
> @@ -3731,6 +3790,8 @@ static AddressSpace *vtd_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
>  
>  static PCIIOMMUOps vtd_iommu_ops = {
>      .get_address_space = vtd_host_dma_iommu,
> +    .set_iommu_context = vtd_dev_set_iommu_context,
> +    .unset_iommu_context = vtd_dev_unset_iommu_context,
>  };
>  
>  static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
> diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> index 3870052..b5fefb9 100644
> --- a/include/hw/i386/intel_iommu.h
> +++ b/include/hw/i386/intel_iommu.h
> @@ -64,6 +64,7 @@ typedef union VTD_IR_TableEntry VTD_IR_TableEntry;
>  typedef union VTD_IR_MSIAddress VTD_IR_MSIAddress;
>  typedef struct VTDPASIDDirEntry VTDPASIDDirEntry;
>  typedef struct VTDPASIDEntry VTDPASIDEntry;
> +typedef struct VTDHostIOMMUContext VTDHostIOMMUContext;
>  
>  /* Context-Entry */
>  struct VTDContextEntry {
> @@ -112,10 +113,20 @@ struct VTDAddressSpace {
>      IOVATree *iova_tree;          /* Traces mapped IOVA ranges */
>  };
>  
> +struct VTDHostIOMMUContext {


> +    VTDBus *vtd_bus;
> +    uint8_t devfn;
> +    HostIOMMUContext *iommu_ctx;
I don't get why we don't have standard QOM inheritance instead of this
handle?
VTDHostContext parent_obj;

like IOMMUMemoryRegion <- MemoryRegion <- Object
> +    IntelIOMMUState *iommu_state;
> +};
> +
>  struct VTDBus {
> -    PCIBus* bus;		/* A reference to the bus to provide translation for */
> +    /* A reference to the bus to provide translation for */
> +    PCIBus *bus;
>      /* A table of VTDAddressSpace objects indexed by devfn */
> -    VTDAddressSpace *dev_as[];
> +    VTDAddressSpace *dev_as[PCI_DEVFN_MAX];
> +    /* A table of VTDHostIOMMUContext objects indexed by devfn */
> +    VTDHostIOMMUContext *dev_icx[PCI_DEVFN_MAX];
At this point of the review, it is unclear to me why the context is
associated to a device. Up to now you have not explained it should. If
so why isn't it part of VTDAddressSpace?

Thanks

Eric
>  };
>  
>  struct VTDIOTLBEntry {
> @@ -269,8 +280,10 @@ struct IntelIOMMUState {
>      bool dma_drain;                 /* Whether DMA r/w draining enabled */
>  
>      /*
> -     * Protects IOMMU states in general.  Currently it protects the
> -     * per-IOMMU IOTLB cache, and context entry cache in VTDAddressSpace.
> +     * iommu_lock protects below:
> +     * - per-IOMMU IOTLB caches
> +     * - context entry cache in VTDAddressSpace
> +     * - HostIOMMUContext pointer cached in vIOMMU
>       */
>      QemuMutex iommu_lock;
>  };
> 



^ permalink raw reply	[flat|nested] 160+ messages in thread

* RE: [PATCH v2 04/22] hw/iommu: introduce HostIOMMUContext
  2020-03-30 17:22     ` Auger Eric
@ 2020-03-31  4:10       ` Liu, Yi L
  -1 siblings, 0 replies; 160+ messages in thread
From: Liu, Yi L @ 2020-03-31  4:10 UTC (permalink / raw)
  To: Auger Eric, qemu-devel, alex.williamson, peterx
  Cc: pbonzini, mst, david, Tian, Kevin, Tian, Jun J, Sun, Yi Y, kvm,
	Wu, Hao, jean-philippe, Jacob Pan, Yi Sun

Hi Eric,

> From: Auger Eric < eric.auger@redhat.com >
> Sent: Tuesday, March 31, 2020 1:23 AM
> To: Liu, Yi L <yi.l.liu@intel.com>; qemu-devel@nongnu.org;
> Subject: Re: [PATCH v2 04/22] hw/iommu: introduce HostIOMMUContext
> 
> Yi,
> 
> On 3/30/20 6:24 AM, Liu Yi L wrote:
> > Currently, many platform vendors provide the capability of dual stage
> > DMA address translation in hardware. For example, nested translation
> > on Intel VT-d scalable mode, nested stage translation on ARM SMMUv3,
> > and etc. In dual stage DMA address translation, there are two stages
> > address translation, stage-1 (a.k.a first-level) and stage-2 (a.k.a
> > second-level) translation structures. Stage-1 translation results are
> > also subjected to stage-2 translation structures. Take vSVA (Virtual
> > Shared Virtual Addressing) as an example, guest IOMMU driver owns
> > stage-1 translation structures (covers GVA->GPA translation), and host
> > IOMMU driver owns stage-2 translation structures (covers GPA->HPA
> > translation). VMM is responsible to bind stage-1 translation structures
> > to host, thus hardware could achieve GVA->GPA and then GPA->HPA
> > translation. For more background on SVA, refer the below links.
> >  - https://www.youtube.com/watch?v=Kq_nfGK5MwQ
> >  - https://events19.lfasiallc.com/wp-content/uploads/2017/11/\
> > Shared-Virtual-Memory-in-KVM_Yi-Liu.pdf
> >
> > In QEMU, vIOMMU emulators expose IOMMUs to VM per their own spec (e.g.
> > Intel VT-d spec). Devices are pass-through to guest via device pass-
> > through components like VFIO. VFIO is a userspace driver framework
> > which exposes host IOMMU programming capability to userspace in a
> > secure manner. e.g. IOVA MAP/UNMAP requests. Thus the major connection
> > between VFIO and vIOMMU are MAP/UNMAP. However, with the dual stage
> > DMA translation support, there are more interactions between vIOMMU and
> > VFIO as below:
> 
> I think it is key to justify at some point why the IOMMU MR notifiers
> are not usable for that purpose. If I remember correctly this is due to
> the fact MR notifiers are not active on x86 in that use xase, which is
> not the case on ARM dual stage enablement.

yes, it's the major reason. Also I listed the former description here.
BTW. I don't think notifier is suitable as it is unable to return value.
right? The pasid alloc in this series actually requires to get the alloc
result from vfio. So it's also a reason why notifier is not proper.

  "Qemu has an existing notifier framework based on MemoryRegion, which
  are used for MAP/UNMAP. However, it is not well suited for virt-SVA.
  Reasons are as below:
  - virt-SVA works along with PT = 1
  - if PT = 1 IOMMU MR are disabled so MR notifier are not registered
  - new notifiers do not fit nicely in this framework as they need to be
    registered even if PT = 1
  - need a new framework to attach the new notifiers
  - Additional background can be got from:
    https://lists.gnu.org/archive/html/qemu-devel/2017-04/msg04931.html"

And there is a history on it. I think the earliest idea to introduce a
new mechanism instead of using MR notifier for vSVA is from below link.
https://lists.gnu.org/archive/html/qemu-devel/2017-04/msg05295.html

And then, I have several versions patch series which try to add a notifier
framework for vSVA based on IOMMUSVAContext.
https://lists.gnu.org/archive/html/qemu-devel/2018-03/msg00078.html

After the vSVA notifier framework patchset, then we somehow agreed to
use PCIPASIDOps which sits in PCIDevice. This is proposed in below link.
https://patchwork.kernel.org/cover/11033657/ 
However, it was questioned to provide pasid allocation interface in a
per-device manner.
  "On Fri, Jul 05, 2019 at 07:01:38PM +0800, Liu Yi L wrote:
  > This patch adds vfio implementation PCIPASIDOps.alloc_pasid/free_pasid().
  > These two functions are used to propagate guest pasid allocation and
  > free requests to host via vfio container ioctl.

  As I said in an earlier comment, I think doing this on the device is
  conceptually incorrect.  I think we need an explcit notion of an SVM
  context (i.e. the namespace in which all the PASIDs live) - which will
  IIUC usually be shared amongst multiple devices.  The create and free
  PASID requests should be on that object."
https://patchwork.kernel.org/patch/11033659/

And the explicit notion of an SVM context from David inspired me to make
an explicit way to facilitate the interaction between vfio and vIOMMU. So
I came up with the SVMContext direction, and finally renamed it as
HostIOMMUContext and place it in VFIOContainer as it is supposed to be per
-container.

> maybe: "Information, different from map/unmap notifications need to be
> passed from QEMU vIOMMU device to/from the host IOMMU driver through the
> VFIO/IOMMU layer: ..."

I see. I'll adopt your description. thanks.

> >  1) PASID allocation (allow host to intercept in PASID allocation)
> >  2) bind stage-1 translation structures to host
> >  3) propagate stage-1 cache invalidation to host
> >  4) DMA address translation fault (I/O page fault) servicing etc.
> 
> >
> > With the above new interactions in QEMU, it requires an abstract layer
> > to facilitate the above operations and expose to vIOMMU emulators as an
> > explicit way for vIOMMU emulators call into VFIO. This patch introduces
> > HostIOMMUContext to stand for hardware IOMMU w/ dual stage DMA address
> > translation capability. And introduces HostIOMMUContextClass to provide
> > methods for vIOMMU emulators to propagate dual-stage translation related
> > requests to host. As a beginning, PASID allocation/free are defined to
> > propagate PASID allocation/free requests to host which is helpful for the
> > vendors who manage PASID in system-wide. In future, there will be more
> > operations like bind_stage1_pgtbl, flush_stage1_cache and etc.
> >
> > Cc: Kevin Tian <kevin.tian@intel.com>
> > Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Cc: Peter Xu <peterx@redhat.com>
> > Cc: Eric Auger <eric.auger@redhat.com>
> > Cc: Yi Sun <yi.y.sun@linux.intel.com>
> > Cc: David Gibson <david@gibson.dropbear.id.au>
> > Cc: Michael S. Tsirkin <mst@redhat.com>
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > ---
> >  hw/Makefile.objs                      |  1 +
> >  hw/iommu/Makefile.objs                |  1 +
> >  hw/iommu/host_iommu_context.c         | 97
> +++++++++++++++++++++++++++++++++++
> >  include/hw/iommu/host_iommu_context.h | 75 +++++++++++++++++++++++++++
> >  4 files changed, 174 insertions(+)
> >  create mode 100644 hw/iommu/Makefile.objs
> >  create mode 100644 hw/iommu/host_iommu_context.c
> >  create mode 100644 include/hw/iommu/host_iommu_context.h
> >
> > diff --git a/hw/Makefile.objs b/hw/Makefile.objs
> > index 660e2b4..cab83fe 100644
> > --- a/hw/Makefile.objs
> > +++ b/hw/Makefile.objs
> > @@ -40,6 +40,7 @@ devices-dirs-$(CONFIG_MEM_DEVICE) += mem/
> >  devices-dirs-$(CONFIG_NUBUS) += nubus/
> >  devices-dirs-y += semihosting/
> >  devices-dirs-y += smbios/
> > +devices-dirs-y += iommu/
> >  endif
> >
> >  common-obj-y += $(devices-dirs-y)
> > diff --git a/hw/iommu/Makefile.objs b/hw/iommu/Makefile.objs
> > new file mode 100644
> > index 0000000..e6eed4e
> > --- /dev/null
> > +++ b/hw/iommu/Makefile.objs
> > @@ -0,0 +1 @@
> > +obj-y += host_iommu_context.o
> > diff --git a/hw/iommu/host_iommu_context.c
> b/hw/iommu/host_iommu_context.c
> > new file mode 100644
> > index 0000000..5fb2223
> > --- /dev/null
> > +++ b/hw/iommu/host_iommu_context.c
> > @@ -0,0 +1,97 @@
> > +/*
> > + * QEMU abstract of Host IOMMU
> > + *
> > + * Copyright (C) 2020 Intel Corporation.
> > + *
> > + * Authors: Liu Yi L <yi.l.liu@intel.com>
> > + *
> > + * This program is free software; you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License as published by
> > + * the Free Software Foundation; either version 2 of the License, or
> > + * (at your option) any later version.
> > +
> > + * This program is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > + * GNU General Public License for more details.
> > +
> > + * You should have received a copy of the GNU General Public License along
> > + * with this program; if not, see <http://www.gnu.org/licenses/>.
> > + */
> > +
> > +#include "qemu/osdep.h"
> > +#include "qapi/error.h"
> > +#include "qom/object.h"
> > +#include "qapi/visitor.h"
> > +#include "hw/iommu/host_iommu_context.h"
> > +
> > +int host_iommu_ctx_pasid_alloc(HostIOMMUContext *iommu_ctx, uint32_t min,
> > +                               uint32_t max, uint32_t *pasid)
> > +{
> > +    HostIOMMUContextClass *hicxc;
> > +
> > +    if (!iommu_ctx) {
> > +        return -EINVAL;
> > +    }
> > +
> > +    hicxc = HOST_IOMMU_CONTEXT_GET_CLASS(iommu_ctx);
> > +
> > +    if (!hicxc) {
> > +        return -EINVAL;
> > +    }
> > +
> > +    if (!(iommu_ctx->flags & HOST_IOMMU_PASID_REQUEST) ||
> > +        !hicxc->pasid_alloc) {
> At this point of the reading, I fail to understand why we need the flag.
> Why isn't it sufficient to test whether the ops is set?

I added it in case of the architecture which has no requirement for
pasid alloc/free and only needs the other callbacks in the class. I'm
not sure if I'm correct, it looks to be unnecessary for vSMMU. right?

> > +        return -EINVAL;
> > +    }
> > +
> > +    return hicxc->pasid_alloc(iommu_ctx, min, max, pasid);
> > +}
> > +
> > +int host_iommu_ctx_pasid_free(HostIOMMUContext *iommu_ctx, uint32_t
> pasid)
> > +{
> > +    HostIOMMUContextClass *hicxc;
> > +
> > +    if (!iommu_ctx) {
> > +        return -EINVAL;
> > +    }
> > +
> > +    hicxc = HOST_IOMMU_CONTEXT_GET_CLASS(iommu_ctx);
> > +    if (!hicxc) {
> > +        return -EINVAL;
> > +    }
> > +
> > +    if (!(iommu_ctx->flags & HOST_IOMMU_PASID_REQUEST) ||
> > +        !hicxc->pasid_free) {
> > +        return -EINVAL;
> > +    }
> > +
> > +    return hicxc->pasid_free(iommu_ctx, pasid);
> > +}
> > +
> > +void host_iommu_ctx_init(void *_iommu_ctx, size_t instance_size,
> > +                         const char *mrtypename,
> > +                         uint64_t flags)
> > +{
> > +    HostIOMMUContext *iommu_ctx;
> > +
> > +    object_initialize(_iommu_ctx, instance_size, mrtypename);
> > +    iommu_ctx = HOST_IOMMU_CONTEXT(_iommu_ctx);
> > +    iommu_ctx->flags = flags;
> > +    iommu_ctx->initialized = true;
> > +}
> > +
> > +static const TypeInfo host_iommu_context_info = {
> > +    .parent             = TYPE_OBJECT,
> > +    .name               = TYPE_HOST_IOMMU_CONTEXT,
> > +    .class_size         = sizeof(HostIOMMUContextClass),
> > +    .instance_size      = sizeof(HostIOMMUContext),
> > +    .abstract           = true,
> Can't we use the usual .instance_init and .instance_finalize?
> > +};
> > +
> > +static void host_iommu_ctx_register_types(void)
> > +{
> > +    type_register_static(&host_iommu_context_info);
> > +}
> > +
> > +type_init(host_iommu_ctx_register_types)
> > diff --git a/include/hw/iommu/host_iommu_context.h
> b/include/hw/iommu/host_iommu_context.h
> > new file mode 100644
> > index 0000000..35c4861
> > --- /dev/null
> > +++ b/include/hw/iommu/host_iommu_context.h
> > @@ -0,0 +1,75 @@
> > +/*
> > + * QEMU abstraction of Host IOMMU
> > + *
> > + * Copyright (C) 2020 Intel Corporation.
> > + *
> > + * Authors: Liu Yi L <yi.l.liu@intel.com>
> > + *
> > + * This program is free software; you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License as published by
> > + * the Free Software Foundation; either version 2 of the License, or
> > + * (at your option) any later version.
> > +
> > + * This program is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > + * GNU General Public License for more details.
> > +
> > + * You should have received a copy of the GNU General Public License along
> > + * with this program; if not, see <http://www.gnu.org/licenses/>.
> > + */
> > +
> > +#ifndef HW_IOMMU_CONTEXT_H
> > +#define HW_IOMMU_CONTEXT_H
> > +
> > +#include "qemu/queue.h"
> > +#include "qemu/thread.h"
> > +#include "qom/object.h"
> > +#include <linux/iommu.h>
> > +#ifndef CONFIG_USER_ONLY
> > +#include "exec/hwaddr.h"
> > +#endif
> > +
> > +#define TYPE_HOST_IOMMU_CONTEXT "qemu:host-iommu-context"
> > +#define HOST_IOMMU_CONTEXT(obj) \
> > +        OBJECT_CHECK(HostIOMMUContext, (obj), TYPE_HOST_IOMMU_CONTEXT)
> > +#define HOST_IOMMU_CONTEXT_GET_CLASS(obj) \
> > +        OBJECT_GET_CLASS(HostIOMMUContextClass, (obj), \
> > +                         TYPE_HOST_IOMMU_CONTEXT)
> > +
> > +typedef struct HostIOMMUContext HostIOMMUContext;
> > +
> > +typedef struct HostIOMMUContextClass {
> > +    /* private */
> > +    ObjectClass parent_class;
> > +
> > +    /* Allocate pasid from HostIOMMUContext (a.k.a. host software) */
> Request the host to allocate a PASID?
> "from HostIOMMUContext (a.k.a. host software)" is a bit cryptic to me.

oh, I mean to request pasid allocation from host.. sorry for the confusion.

> Actually at this stage I do not understand what this HostIOMMUContext
> abstracts. Is it an object associated to one guest FL context entry
> (attached to one PASID). Meaning for just vIOMMU/VFIO using nested
> paging (single PASID) I would use a single of such context per IOMMU MR?

No, it's not for a single guest FL context. It's for the abstraction
of the capability provided by a nested-translation capable host backend.
In vfio, it's VFIO_IOMMU_TYPE1_NESTING.

Here is the notion behind introducing the HostIOMMUContext. Existing
vfio is a secure framework which provides userspace the capability to
program mappings into a single isolation domain in host side. Compared
with the legacy host IOMMU, nested-translation capable IOMMU provides
more. It gives the user-space with the capability to program a FL/Stage
-1 page table to host side. This is also called as bind_gpasid in this
series. VFIO exposes nesting capability to userspace with the
VFIO_IOMMU_TYPE1_NESTING type. And along with the type, the pasid alloc/
free and iommu_cache_inv are exposed as the capabilities provided by
VFIO_IOMMU_TYPE1_NESTING. Also, if we want, actually we could migrate
the MAP/UNMAP notifier to be hooks in HostIOMMUContext. Then we can have
an unified abstraction for the capabilities provided by host.

> I think David also felt difficult to understand the abstraction behind
> this object.
> 
> > +    int (*pasid_alloc)(HostIOMMUContext *iommu_ctx,
> > +                       uint32_t min,
> > +                       uint32_t max,
> > +                       uint32_t *pasid);
> > +    /* Reclaim pasid from HostIOMMUContext (a.k.a. host software) */
> > +    int (*pasid_free)(HostIOMMUContext *iommu_ctx,
> > +                      uint32_t pasid);
> > +} HostIOMMUContextClass;
> > +
> > +/*
> > + * This is an abstraction of host IOMMU with dual-stage capability
> > + */
> > +struct HostIOMMUContext {
> > +    Object parent_obj;
> > +#define HOST_IOMMU_PASID_REQUEST (1ULL << 0)
> > +    uint64_t flags;
> > +    bool initialized;
> what's the purpose of the initialized flag?

it's somehow for checking the availability of host's nested capability in
vfio/pci. In this series, HostIOMMUContext is initialized in vfio/common
and needs a way to tell vfio/pci that it is available.

> > +};
> > +
> > +int host_iommu_ctx_pasid_alloc(HostIOMMUContext *iommu_ctx, uint32_t min,
> > +                               uint32_t max, uint32_t *pasid);
> > +int host_iommu_ctx_pasid_free(HostIOMMUContext *iommu_ctx, uint32_t
> pasid);
> > +
> > +void host_iommu_ctx_init(void *_iommu_ctx, size_t instance_size,
> > +                         const char *mrtypename,
> > +                         uint64_t flags);
> > +void host_iommu_ctx_destroy(HostIOMMUContext *iommu_ctx);
> leftover from V1?

right, thanks for catching it.

Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 160+ messages in thread

* RE: [PATCH v2 04/22] hw/iommu: introduce HostIOMMUContext
@ 2020-03-31  4:10       ` Liu, Yi L
  0 siblings, 0 replies; 160+ messages in thread
From: Liu, Yi L @ 2020-03-31  4:10 UTC (permalink / raw)
  To: Auger Eric, qemu-devel, alex.williamson, peterx
  Cc: jean-philippe, Tian, Kevin, Jacob Pan, Yi Sun, kvm, mst, Tian,
	 Jun J, Sun, Yi Y, pbonzini, Wu, Hao, david

Hi Eric,

> From: Auger Eric < eric.auger@redhat.com >
> Sent: Tuesday, March 31, 2020 1:23 AM
> To: Liu, Yi L <yi.l.liu@intel.com>; qemu-devel@nongnu.org;
> Subject: Re: [PATCH v2 04/22] hw/iommu: introduce HostIOMMUContext
> 
> Yi,
> 
> On 3/30/20 6:24 AM, Liu Yi L wrote:
> > Currently, many platform vendors provide the capability of dual stage
> > DMA address translation in hardware. For example, nested translation
> > on Intel VT-d scalable mode, nested stage translation on ARM SMMUv3,
> > and etc. In dual stage DMA address translation, there are two stages
> > address translation, stage-1 (a.k.a first-level) and stage-2 (a.k.a
> > second-level) translation structures. Stage-1 translation results are
> > also subjected to stage-2 translation structures. Take vSVA (Virtual
> > Shared Virtual Addressing) as an example, guest IOMMU driver owns
> > stage-1 translation structures (covers GVA->GPA translation), and host
> > IOMMU driver owns stage-2 translation structures (covers GPA->HPA
> > translation). VMM is responsible to bind stage-1 translation structures
> > to host, thus hardware could achieve GVA->GPA and then GPA->HPA
> > translation. For more background on SVA, refer the below links.
> >  - https://www.youtube.com/watch?v=Kq_nfGK5MwQ
> >  - https://events19.lfasiallc.com/wp-content/uploads/2017/11/\
> > Shared-Virtual-Memory-in-KVM_Yi-Liu.pdf
> >
> > In QEMU, vIOMMU emulators expose IOMMUs to VM per their own spec (e.g.
> > Intel VT-d spec). Devices are pass-through to guest via device pass-
> > through components like VFIO. VFIO is a userspace driver framework
> > which exposes host IOMMU programming capability to userspace in a
> > secure manner. e.g. IOVA MAP/UNMAP requests. Thus the major connection
> > between VFIO and vIOMMU are MAP/UNMAP. However, with the dual stage
> > DMA translation support, there are more interactions between vIOMMU and
> > VFIO as below:
> 
> I think it is key to justify at some point why the IOMMU MR notifiers
> are not usable for that purpose. If I remember correctly this is due to
> the fact MR notifiers are not active on x86 in that use xase, which is
> not the case on ARM dual stage enablement.

yes, it's the major reason. Also I listed the former description here.
BTW. I don't think notifier is suitable as it is unable to return value.
right? The pasid alloc in this series actually requires to get the alloc
result from vfio. So it's also a reason why notifier is not proper.

  "Qemu has an existing notifier framework based on MemoryRegion, which
  are used for MAP/UNMAP. However, it is not well suited for virt-SVA.
  Reasons are as below:
  - virt-SVA works along with PT = 1
  - if PT = 1 IOMMU MR are disabled so MR notifier are not registered
  - new notifiers do not fit nicely in this framework as they need to be
    registered even if PT = 1
  - need a new framework to attach the new notifiers
  - Additional background can be got from:
    https://lists.gnu.org/archive/html/qemu-devel/2017-04/msg04931.html"

And there is a history on it. I think the earliest idea to introduce a
new mechanism instead of using MR notifier for vSVA is from below link.
https://lists.gnu.org/archive/html/qemu-devel/2017-04/msg05295.html

And then, I have several versions patch series which try to add a notifier
framework for vSVA based on IOMMUSVAContext.
https://lists.gnu.org/archive/html/qemu-devel/2018-03/msg00078.html

After the vSVA notifier framework patchset, then we somehow agreed to
use PCIPASIDOps which sits in PCIDevice. This is proposed in below link.
https://patchwork.kernel.org/cover/11033657/ 
However, it was questioned to provide pasid allocation interface in a
per-device manner.
  "On Fri, Jul 05, 2019 at 07:01:38PM +0800, Liu Yi L wrote:
  > This patch adds vfio implementation PCIPASIDOps.alloc_pasid/free_pasid().
  > These two functions are used to propagate guest pasid allocation and
  > free requests to host via vfio container ioctl.

  As I said in an earlier comment, I think doing this on the device is
  conceptually incorrect.  I think we need an explcit notion of an SVM
  context (i.e. the namespace in which all the PASIDs live) - which will
  IIUC usually be shared amongst multiple devices.  The create and free
  PASID requests should be on that object."
https://patchwork.kernel.org/patch/11033659/

And the explicit notion of an SVM context from David inspired me to make
an explicit way to facilitate the interaction between vfio and vIOMMU. So
I came up with the SVMContext direction, and finally renamed it as
HostIOMMUContext and place it in VFIOContainer as it is supposed to be per
-container.

> maybe: "Information, different from map/unmap notifications need to be
> passed from QEMU vIOMMU device to/from the host IOMMU driver through the
> VFIO/IOMMU layer: ..."

I see. I'll adopt your description. thanks.

> >  1) PASID allocation (allow host to intercept in PASID allocation)
> >  2) bind stage-1 translation structures to host
> >  3) propagate stage-1 cache invalidation to host
> >  4) DMA address translation fault (I/O page fault) servicing etc.
> 
> >
> > With the above new interactions in QEMU, it requires an abstract layer
> > to facilitate the above operations and expose to vIOMMU emulators as an
> > explicit way for vIOMMU emulators call into VFIO. This patch introduces
> > HostIOMMUContext to stand for hardware IOMMU w/ dual stage DMA address
> > translation capability. And introduces HostIOMMUContextClass to provide
> > methods for vIOMMU emulators to propagate dual-stage translation related
> > requests to host. As a beginning, PASID allocation/free are defined to
> > propagate PASID allocation/free requests to host which is helpful for the
> > vendors who manage PASID in system-wide. In future, there will be more
> > operations like bind_stage1_pgtbl, flush_stage1_cache and etc.
> >
> > Cc: Kevin Tian <kevin.tian@intel.com>
> > Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Cc: Peter Xu <peterx@redhat.com>
> > Cc: Eric Auger <eric.auger@redhat.com>
> > Cc: Yi Sun <yi.y.sun@linux.intel.com>
> > Cc: David Gibson <david@gibson.dropbear.id.au>
> > Cc: Michael S. Tsirkin <mst@redhat.com>
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > ---
> >  hw/Makefile.objs                      |  1 +
> >  hw/iommu/Makefile.objs                |  1 +
> >  hw/iommu/host_iommu_context.c         | 97
> +++++++++++++++++++++++++++++++++++
> >  include/hw/iommu/host_iommu_context.h | 75 +++++++++++++++++++++++++++
> >  4 files changed, 174 insertions(+)
> >  create mode 100644 hw/iommu/Makefile.objs
> >  create mode 100644 hw/iommu/host_iommu_context.c
> >  create mode 100644 include/hw/iommu/host_iommu_context.h
> >
> > diff --git a/hw/Makefile.objs b/hw/Makefile.objs
> > index 660e2b4..cab83fe 100644
> > --- a/hw/Makefile.objs
> > +++ b/hw/Makefile.objs
> > @@ -40,6 +40,7 @@ devices-dirs-$(CONFIG_MEM_DEVICE) += mem/
> >  devices-dirs-$(CONFIG_NUBUS) += nubus/
> >  devices-dirs-y += semihosting/
> >  devices-dirs-y += smbios/
> > +devices-dirs-y += iommu/
> >  endif
> >
> >  common-obj-y += $(devices-dirs-y)
> > diff --git a/hw/iommu/Makefile.objs b/hw/iommu/Makefile.objs
> > new file mode 100644
> > index 0000000..e6eed4e
> > --- /dev/null
> > +++ b/hw/iommu/Makefile.objs
> > @@ -0,0 +1 @@
> > +obj-y += host_iommu_context.o
> > diff --git a/hw/iommu/host_iommu_context.c
> b/hw/iommu/host_iommu_context.c
> > new file mode 100644
> > index 0000000..5fb2223
> > --- /dev/null
> > +++ b/hw/iommu/host_iommu_context.c
> > @@ -0,0 +1,97 @@
> > +/*
> > + * QEMU abstract of Host IOMMU
> > + *
> > + * Copyright (C) 2020 Intel Corporation.
> > + *
> > + * Authors: Liu Yi L <yi.l.liu@intel.com>
> > + *
> > + * This program is free software; you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License as published by
> > + * the Free Software Foundation; either version 2 of the License, or
> > + * (at your option) any later version.
> > +
> > + * This program is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > + * GNU General Public License for more details.
> > +
> > + * You should have received a copy of the GNU General Public License along
> > + * with this program; if not, see <http://www.gnu.org/licenses/>.
> > + */
> > +
> > +#include "qemu/osdep.h"
> > +#include "qapi/error.h"
> > +#include "qom/object.h"
> > +#include "qapi/visitor.h"
> > +#include "hw/iommu/host_iommu_context.h"
> > +
> > +int host_iommu_ctx_pasid_alloc(HostIOMMUContext *iommu_ctx, uint32_t min,
> > +                               uint32_t max, uint32_t *pasid)
> > +{
> > +    HostIOMMUContextClass *hicxc;
> > +
> > +    if (!iommu_ctx) {
> > +        return -EINVAL;
> > +    }
> > +
> > +    hicxc = HOST_IOMMU_CONTEXT_GET_CLASS(iommu_ctx);
> > +
> > +    if (!hicxc) {
> > +        return -EINVAL;
> > +    }
> > +
> > +    if (!(iommu_ctx->flags & HOST_IOMMU_PASID_REQUEST) ||
> > +        !hicxc->pasid_alloc) {
> At this point of the reading, I fail to understand why we need the flag.
> Why isn't it sufficient to test whether the ops is set?

I added it in case of the architecture which has no requirement for
pasid alloc/free and only needs the other callbacks in the class. I'm
not sure if I'm correct, it looks to be unnecessary for vSMMU. right?

> > +        return -EINVAL;
> > +    }
> > +
> > +    return hicxc->pasid_alloc(iommu_ctx, min, max, pasid);
> > +}
> > +
> > +int host_iommu_ctx_pasid_free(HostIOMMUContext *iommu_ctx, uint32_t
> pasid)
> > +{
> > +    HostIOMMUContextClass *hicxc;
> > +
> > +    if (!iommu_ctx) {
> > +        return -EINVAL;
> > +    }
> > +
> > +    hicxc = HOST_IOMMU_CONTEXT_GET_CLASS(iommu_ctx);
> > +    if (!hicxc) {
> > +        return -EINVAL;
> > +    }
> > +
> > +    if (!(iommu_ctx->flags & HOST_IOMMU_PASID_REQUEST) ||
> > +        !hicxc->pasid_free) {
> > +        return -EINVAL;
> > +    }
> > +
> > +    return hicxc->pasid_free(iommu_ctx, pasid);
> > +}
> > +
> > +void host_iommu_ctx_init(void *_iommu_ctx, size_t instance_size,
> > +                         const char *mrtypename,
> > +                         uint64_t flags)
> > +{
> > +    HostIOMMUContext *iommu_ctx;
> > +
> > +    object_initialize(_iommu_ctx, instance_size, mrtypename);
> > +    iommu_ctx = HOST_IOMMU_CONTEXT(_iommu_ctx);
> > +    iommu_ctx->flags = flags;
> > +    iommu_ctx->initialized = true;
> > +}
> > +
> > +static const TypeInfo host_iommu_context_info = {
> > +    .parent             = TYPE_OBJECT,
> > +    .name               = TYPE_HOST_IOMMU_CONTEXT,
> > +    .class_size         = sizeof(HostIOMMUContextClass),
> > +    .instance_size      = sizeof(HostIOMMUContext),
> > +    .abstract           = true,
> Can't we use the usual .instance_init and .instance_finalize?
> > +};
> > +
> > +static void host_iommu_ctx_register_types(void)
> > +{
> > +    type_register_static(&host_iommu_context_info);
> > +}
> > +
> > +type_init(host_iommu_ctx_register_types)
> > diff --git a/include/hw/iommu/host_iommu_context.h
> b/include/hw/iommu/host_iommu_context.h
> > new file mode 100644
> > index 0000000..35c4861
> > --- /dev/null
> > +++ b/include/hw/iommu/host_iommu_context.h
> > @@ -0,0 +1,75 @@
> > +/*
> > + * QEMU abstraction of Host IOMMU
> > + *
> > + * Copyright (C) 2020 Intel Corporation.
> > + *
> > + * Authors: Liu Yi L <yi.l.liu@intel.com>
> > + *
> > + * This program is free software; you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License as published by
> > + * the Free Software Foundation; either version 2 of the License, or
> > + * (at your option) any later version.
> > +
> > + * This program is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > + * GNU General Public License for more details.
> > +
> > + * You should have received a copy of the GNU General Public License along
> > + * with this program; if not, see <http://www.gnu.org/licenses/>.
> > + */
> > +
> > +#ifndef HW_IOMMU_CONTEXT_H
> > +#define HW_IOMMU_CONTEXT_H
> > +
> > +#include "qemu/queue.h"
> > +#include "qemu/thread.h"
> > +#include "qom/object.h"
> > +#include <linux/iommu.h>
> > +#ifndef CONFIG_USER_ONLY
> > +#include "exec/hwaddr.h"
> > +#endif
> > +
> > +#define TYPE_HOST_IOMMU_CONTEXT "qemu:host-iommu-context"
> > +#define HOST_IOMMU_CONTEXT(obj) \
> > +        OBJECT_CHECK(HostIOMMUContext, (obj), TYPE_HOST_IOMMU_CONTEXT)
> > +#define HOST_IOMMU_CONTEXT_GET_CLASS(obj) \
> > +        OBJECT_GET_CLASS(HostIOMMUContextClass, (obj), \
> > +                         TYPE_HOST_IOMMU_CONTEXT)
> > +
> > +typedef struct HostIOMMUContext HostIOMMUContext;
> > +
> > +typedef struct HostIOMMUContextClass {
> > +    /* private */
> > +    ObjectClass parent_class;
> > +
> > +    /* Allocate pasid from HostIOMMUContext (a.k.a. host software) */
> Request the host to allocate a PASID?
> "from HostIOMMUContext (a.k.a. host software)" is a bit cryptic to me.

oh, I mean to request pasid allocation from host.. sorry for the confusion.

> Actually at this stage I do not understand what this HostIOMMUContext
> abstracts. Is it an object associated to one guest FL context entry
> (attached to one PASID). Meaning for just vIOMMU/VFIO using nested
> paging (single PASID) I would use a single of such context per IOMMU MR?

No, it's not for a single guest FL context. It's for the abstraction
of the capability provided by a nested-translation capable host backend.
In vfio, it's VFIO_IOMMU_TYPE1_NESTING.

Here is the notion behind introducing the HostIOMMUContext. Existing
vfio is a secure framework which provides userspace the capability to
program mappings into a single isolation domain in host side. Compared
with the legacy host IOMMU, nested-translation capable IOMMU provides
more. It gives the user-space with the capability to program a FL/Stage
-1 page table to host side. This is also called as bind_gpasid in this
series. VFIO exposes nesting capability to userspace with the
VFIO_IOMMU_TYPE1_NESTING type. And along with the type, the pasid alloc/
free and iommu_cache_inv are exposed as the capabilities provided by
VFIO_IOMMU_TYPE1_NESTING. Also, if we want, actually we could migrate
the MAP/UNMAP notifier to be hooks in HostIOMMUContext. Then we can have
an unified abstraction for the capabilities provided by host.

> I think David also felt difficult to understand the abstraction behind
> this object.
> 
> > +    int (*pasid_alloc)(HostIOMMUContext *iommu_ctx,
> > +                       uint32_t min,
> > +                       uint32_t max,
> > +                       uint32_t *pasid);
> > +    /* Reclaim pasid from HostIOMMUContext (a.k.a. host software) */
> > +    int (*pasid_free)(HostIOMMUContext *iommu_ctx,
> > +                      uint32_t pasid);
> > +} HostIOMMUContextClass;
> > +
> > +/*
> > + * This is an abstraction of host IOMMU with dual-stage capability
> > + */
> > +struct HostIOMMUContext {
> > +    Object parent_obj;
> > +#define HOST_IOMMU_PASID_REQUEST (1ULL << 0)
> > +    uint64_t flags;
> > +    bool initialized;
> what's the purpose of the initialized flag?

it's somehow for checking the availability of host's nested capability in
vfio/pci. In this series, HostIOMMUContext is initialized in vfio/common
and needs a way to tell vfio/pci that it is available.

> > +};
> > +
> > +int host_iommu_ctx_pasid_alloc(HostIOMMUContext *iommu_ctx, uint32_t min,
> > +                               uint32_t max, uint32_t *pasid);
> > +int host_iommu_ctx_pasid_free(HostIOMMUContext *iommu_ctx, uint32_t
> pasid);
> > +
> > +void host_iommu_ctx_init(void *_iommu_ctx, size_t instance_size,
> > +                         const char *mrtypename,
> > +                         uint64_t flags);
> > +void host_iommu_ctx_destroy(HostIOMMUContext *iommu_ctx);
> leftover from V1?

right, thanks for catching it.

Regards,
Yi Liu



^ permalink raw reply	[flat|nested] 160+ messages in thread

* RE: [PATCH v2 03/22] vfio: check VFIO_TYPE1_NESTING_IOMMU support
  2020-03-30  9:36     ` Auger Eric
@ 2020-03-31  6:08       ` Liu, Yi L
  -1 siblings, 0 replies; 160+ messages in thread
From: Liu, Yi L @ 2020-03-31  6:08 UTC (permalink / raw)
  To: Auger Eric, qemu-devel, alex.williamson, peterx
  Cc: pbonzini, mst, david, Tian, Kevin, Tian, Jun J, Sun, Yi Y, kvm,
	Wu, Hao, jean-philippe, Jacob Pan, Yi Sun

Eric,

> From: Auger Eric <eric.auger@redhat.com>
> Sent: Monday, March 30, 2020 5:36 PM
> To: Liu, Yi L <yi.l.liu@intel.com>; qemu-devel@nongnu.org;
> Subject: Re: [PATCH v2 03/22] vfio: check VFIO_TYPE1_NESTING_IOMMU support
> 
> Yi,
> 
> On 3/30/20 6:24 AM, Liu Yi L wrote:
> > VFIO needs to check VFIO_TYPE1_NESTING_IOMMU support with Kernel before
> > further using it. e.g. requires to check IOMMU UAPI version.
> >
> > Cc: Kevin Tian <kevin.tian@intel.com>
> > Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Cc: Peter Xu <peterx@redhat.com>
> > Cc: Eric Auger <eric.auger@redhat.com>
> > Cc: Yi Sun <yi.y.sun@linux.intel.com>
> > Cc: David Gibson <david@gibson.dropbear.id.au>
> > Cc: Alex Williamson <alex.williamson@redhat.com>
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> > ---
> >  hw/vfio/common.c | 14 ++++++++++++--
> >  1 file changed, 12 insertions(+), 2 deletions(-)
> >
> > diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> > index 0b3593b..c276732 100644
> > --- a/hw/vfio/common.c
> > +++ b/hw/vfio/common.c
> > @@ -1157,12 +1157,21 @@ static void
> vfio_put_address_space(VFIOAddressSpace *space)
> >  static int vfio_get_iommu_type(VFIOContainer *container,
> >                                 Error **errp)
> >  {
> > -    int iommu_types[] = { VFIO_TYPE1v2_IOMMU, VFIO_TYPE1_IOMMU,
> > +    int iommu_types[] = { VFIO_TYPE1_NESTING_IOMMU,
> > +                          VFIO_TYPE1v2_IOMMU, VFIO_TYPE1_IOMMU,
> >                            VFIO_SPAPR_TCE_v2_IOMMU, VFIO_SPAPR_TCE_IOMMU };
> > -    int i;
> > +    int i, version;
> >
> >      for (i = 0; i < ARRAY_SIZE(iommu_types); i++) {
> >          if (ioctl(container->fd, VFIO_CHECK_EXTENSION, iommu_types[i])) {
> > +            if (iommu_types[i] == VFIO_TYPE1_NESTING_IOMMU) {
> > +                version = ioctl(container->fd, VFIO_CHECK_EXTENSION,
> > +                                VFIO_NESTING_IOMMU_UAPI);
> > +                if (version < IOMMU_UAPI_VERSION) {
> > +                    info_report("IOMMU UAPI incompatible for nesting");
> > +                    continue;
> > +                }
> > +            }
> This means that by default VFIO_TYPE1_NESTING_IOMMU wwould be chosen. I
> don't think this what we want. On ARM this would mean that for a
> standard VFIO assignment without vIOMMU, SL will be used instead of FL.
> This may not be harmless.
> 
> For instance, in "[RFC v6 09/24] vfio: Force nested if iommu requires
> it", I use nested only if I detect we have a vSMMU. Otherwise I keep the
> legacy VFIO_TYPE1v2_IOMMU.
> 
Good point. I also replied in your patch.

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 160+ messages in thread

* RE: [PATCH v2 03/22] vfio: check VFIO_TYPE1_NESTING_IOMMU support
@ 2020-03-31  6:08       ` Liu, Yi L
  0 siblings, 0 replies; 160+ messages in thread
From: Liu, Yi L @ 2020-03-31  6:08 UTC (permalink / raw)
  To: Auger Eric, qemu-devel, alex.williamson, peterx
  Cc: jean-philippe, Tian, Kevin, Jacob Pan, Yi Sun, kvm, mst, Tian,
	 Jun J, Sun, Yi Y, pbonzini, Wu, Hao, david

Eric,

> From: Auger Eric <eric.auger@redhat.com>
> Sent: Monday, March 30, 2020 5:36 PM
> To: Liu, Yi L <yi.l.liu@intel.com>; qemu-devel@nongnu.org;
> Subject: Re: [PATCH v2 03/22] vfio: check VFIO_TYPE1_NESTING_IOMMU support
> 
> Yi,
> 
> On 3/30/20 6:24 AM, Liu Yi L wrote:
> > VFIO needs to check VFIO_TYPE1_NESTING_IOMMU support with Kernel before
> > further using it. e.g. requires to check IOMMU UAPI version.
> >
> > Cc: Kevin Tian <kevin.tian@intel.com>
> > Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Cc: Peter Xu <peterx@redhat.com>
> > Cc: Eric Auger <eric.auger@redhat.com>
> > Cc: Yi Sun <yi.y.sun@linux.intel.com>
> > Cc: David Gibson <david@gibson.dropbear.id.au>
> > Cc: Alex Williamson <alex.williamson@redhat.com>
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> > ---
> >  hw/vfio/common.c | 14 ++++++++++++--
> >  1 file changed, 12 insertions(+), 2 deletions(-)
> >
> > diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> > index 0b3593b..c276732 100644
> > --- a/hw/vfio/common.c
> > +++ b/hw/vfio/common.c
> > @@ -1157,12 +1157,21 @@ static void
> vfio_put_address_space(VFIOAddressSpace *space)
> >  static int vfio_get_iommu_type(VFIOContainer *container,
> >                                 Error **errp)
> >  {
> > -    int iommu_types[] = { VFIO_TYPE1v2_IOMMU, VFIO_TYPE1_IOMMU,
> > +    int iommu_types[] = { VFIO_TYPE1_NESTING_IOMMU,
> > +                          VFIO_TYPE1v2_IOMMU, VFIO_TYPE1_IOMMU,
> >                            VFIO_SPAPR_TCE_v2_IOMMU, VFIO_SPAPR_TCE_IOMMU };
> > -    int i;
> > +    int i, version;
> >
> >      for (i = 0; i < ARRAY_SIZE(iommu_types); i++) {
> >          if (ioctl(container->fd, VFIO_CHECK_EXTENSION, iommu_types[i])) {
> > +            if (iommu_types[i] == VFIO_TYPE1_NESTING_IOMMU) {
> > +                version = ioctl(container->fd, VFIO_CHECK_EXTENSION,
> > +                                VFIO_NESTING_IOMMU_UAPI);
> > +                if (version < IOMMU_UAPI_VERSION) {
> > +                    info_report("IOMMU UAPI incompatible for nesting");
> > +                    continue;
> > +                }
> > +            }
> This means that by default VFIO_TYPE1_NESTING_IOMMU wwould be chosen. I
> don't think this what we want. On ARM this would mean that for a
> standard VFIO assignment without vIOMMU, SL will be used instead of FL.
> This may not be harmless.
> 
> For instance, in "[RFC v6 09/24] vfio: Force nested if iommu requires
> it", I use nested only if I detect we have a vSMMU. Otherwise I keep the
> legacy VFIO_TYPE1v2_IOMMU.
> 
Good point. I also replied in your patch.

Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 160+ messages in thread

* RE: [PATCH v2 00/22] intel_iommu: expose Shared Virtual Addressing to VMs
  2020-03-30 14:46     ` Peter Xu
@ 2020-03-31  6:53       ` Liu, Yi L
  -1 siblings, 0 replies; 160+ messages in thread
From: Liu, Yi L @ 2020-03-31  6:53 UTC (permalink / raw)
  To: Peter Xu, Auger Eric
  Cc: qemu-devel, alex.williamson, pbonzini, mst, david, Tian, Kevin,
	Tian, Jun J, Sun, Yi Y, kvm, Wu, Hao, jean-philippe

Hi Eric,

> From: Peter Xu <peterx@redhat.com>
> Sent: Monday, March 30, 2020 10:47 PM
> To: Auger Eric <eric.auger@redhat.com>
> Subject: Re: [PATCH v2 00/22] intel_iommu: expose Shared Virtual Addressing to
> VMs
> 
> On Mon, Mar 30, 2020 at 12:36:23PM +0200, Auger Eric wrote:
> > I think in general, as long as the kernel dependencies are not
> > resolved, the QEMU series is supposed to stay in RFC state.
> 
> Yeah I agree. I think the subject is not extremely important, but we definitely should
> wait for the kernel part to be ready before merging the series.
> 
> Side note: I offered quite a few r-bs for the series (and I still plan to move on
> reading it this week since there's a new version, and try to offer more r-bs when I
> still have some context in my brain-cache), however they're mostly only for myself
> to avoid re-reading the whole series again in the future especially because it's
> huge... :)

Agreed. I'll rename the next version as RFCv6 then. BTW. although there
is dependency on kernel side, but I think we'd get agreement on the
interaction mechanism between vfio and vIOMMU within QEMU. Also, for the
VT-d specific changes (e.g. the pasid cache invalidation patches and the
pasid-based-iotlb invalidations), we can actually get them ready as they
have no dependency on kernel side change. Please help. :-)

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 160+ messages in thread

* RE: [PATCH v2 00/22] intel_iommu: expose Shared Virtual Addressing to VMs
@ 2020-03-31  6:53       ` Liu, Yi L
  0 siblings, 0 replies; 160+ messages in thread
From: Liu, Yi L @ 2020-03-31  6:53 UTC (permalink / raw)
  To: Peter Xu, Auger Eric
  Cc: jean-philippe, Tian, Kevin, kvm, mst, Tian,  Jun J, qemu-devel,
	alex.williamson, pbonzini, Wu, Hao, Sun, Yi Y, david

Hi Eric,

> From: Peter Xu <peterx@redhat.com>
> Sent: Monday, March 30, 2020 10:47 PM
> To: Auger Eric <eric.auger@redhat.com>
> Subject: Re: [PATCH v2 00/22] intel_iommu: expose Shared Virtual Addressing to
> VMs
> 
> On Mon, Mar 30, 2020 at 12:36:23PM +0200, Auger Eric wrote:
> > I think in general, as long as the kernel dependencies are not
> > resolved, the QEMU series is supposed to stay in RFC state.
> 
> Yeah I agree. I think the subject is not extremely important, but we definitely should
> wait for the kernel part to be ready before merging the series.
> 
> Side note: I offered quite a few r-bs for the series (and I still plan to move on
> reading it this week since there's a new version, and try to offer more r-bs when I
> still have some context in my brain-cache), however they're mostly only for myself
> to avoid re-reading the whole series again in the future especially because it's
> huge... :)

Agreed. I'll rename the next version as RFCv6 then. BTW. although there
is dependency on kernel side, but I think we'd get agreement on the
interaction mechanism between vfio and vIOMMU within QEMU. Also, for the
VT-d specific changes (e.g. the pasid cache invalidation patches and the
pasid-based-iotlb invalidations), we can actually get them ready as they
have no dependency on kernel side change. Please help. :-)

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v2 04/22] hw/iommu: introduce HostIOMMUContext
  2020-03-31  4:10       ` Liu, Yi L
@ 2020-03-31  7:47         ` Auger Eric
  -1 siblings, 0 replies; 160+ messages in thread
From: Auger Eric @ 2020-03-31  7:47 UTC (permalink / raw)
  To: Liu, Yi L, qemu-devel, alex.williamson, peterx
  Cc: pbonzini, mst, david, Tian, Kevin, Tian, Jun J, Sun, Yi Y, kvm,
	Wu, Hao, jean-philippe, Jacob Pan, Yi Sun

Yi,

On 3/31/20 6:10 AM, Liu, Yi L wrote:
> Hi Eric,
> 
>> From: Auger Eric < eric.auger@redhat.com >
>> Sent: Tuesday, March 31, 2020 1:23 AM
>> To: Liu, Yi L <yi.l.liu@intel.com>; qemu-devel@nongnu.org;
>> Subject: Re: [PATCH v2 04/22] hw/iommu: introduce HostIOMMUContext
>>
>> Yi,
>>
>> On 3/30/20 6:24 AM, Liu Yi L wrote:
>>> Currently, many platform vendors provide the capability of dual stage
>>> DMA address translation in hardware. For example, nested translation
>>> on Intel VT-d scalable mode, nested stage translation on ARM SMMUv3,
>>> and etc. In dual stage DMA address translation, there are two stages
>>> address translation, stage-1 (a.k.a first-level) and stage-2 (a.k.a
>>> second-level) translation structures. Stage-1 translation results are
>>> also subjected to stage-2 translation structures. Take vSVA (Virtual
>>> Shared Virtual Addressing) as an example, guest IOMMU driver owns
>>> stage-1 translation structures (covers GVA->GPA translation), and host
>>> IOMMU driver owns stage-2 translation structures (covers GPA->HPA
>>> translation). VMM is responsible to bind stage-1 translation structures
>>> to host, thus hardware could achieve GVA->GPA and then GPA->HPA
>>> translation. For more background on SVA, refer the below links.
>>>  - https://www.youtube.com/watch?v=Kq_nfGK5MwQ
>>>  - https://events19.lfasiallc.com/wp-content/uploads/2017/11/\
>>> Shared-Virtual-Memory-in-KVM_Yi-Liu.pdf
>>>
>>> In QEMU, vIOMMU emulators expose IOMMUs to VM per their own spec (e.g.
>>> Intel VT-d spec). Devices are pass-through to guest via device pass-
>>> through components like VFIO. VFIO is a userspace driver framework
>>> which exposes host IOMMU programming capability to userspace in a
>>> secure manner. e.g. IOVA MAP/UNMAP requests. Thus the major connection
>>> between VFIO and vIOMMU are MAP/UNMAP. However, with the dual stage
>>> DMA translation support, there are more interactions between vIOMMU and
>>> VFIO as below:
>>
>> I think it is key to justify at some point why the IOMMU MR notifiers
>> are not usable for that purpose. If I remember correctly this is due to
>> the fact MR notifiers are not active on x86 in that use xase, which is
>> not the case on ARM dual stage enablement.
> 
> yes, it's the major reason. Also I listed the former description here.
> BTW. I don't think notifier is suitable as it is unable to return value.
> right? The pasid alloc in this series actually requires to get the alloc
> result from vfio. So it's also a reason why notifier is not proper.
> 
>   "Qemu has an existing notifier framework based on MemoryRegion, which
>   are used for MAP/UNMAP. However, it is not well suited for virt-SVA.
>   Reasons are as below:
>   - virt-SVA works along with PT = 1
>   - if PT = 1 IOMMU MR are disabled so MR notifier are not registered
>   - new notifiers do not fit nicely in this framework as they need to be
>     registered even if PT = 1
>   - need a new framework to attach the new notifiers
>   - Additional background can be got from:
>     https://lists.gnu.org/archive/html/qemu-devel/2017-04/msg04931.html"
> 
> And there is a history on it. I think the earliest idea to introduce a
> new mechanism instead of using MR notifier for vSVA is from below link.
> https://lists.gnu.org/archive/html/qemu-devel/2017-04/msg05295.html
> 
> And then, I have several versions patch series which try to add a notifier
> framework for vSVA based on IOMMUSVAContext.
> https://lists.gnu.org/archive/html/qemu-devel/2018-03/msg00078.html
> 
> After the vSVA notifier framework patchset, then we somehow agreed to
> use PCIPASIDOps which sits in PCIDevice. This is proposed in below link.
> https://patchwork.kernel.org/cover/11033657/ 
> However, it was questioned to provide pasid allocation interface in a
> per-device manner.
>   "On Fri, Jul 05, 2019 at 07:01:38PM +0800, Liu Yi L wrote:
>   > This patch adds vfio implementation PCIPASIDOps.alloc_pasid/free_pasid().
>   > These two functions are used to propagate guest pasid allocation and
>   > free requests to host via vfio container ioctl.
> 
>   As I said in an earlier comment, I think doing this on the device is
>   conceptually incorrect.  I think we need an explcit notion of an SVM
>   context (i.e. the namespace in which all the PASIDs live) - which will
>   IIUC usually be shared amongst multiple devices.  The create and free
>   PASID requests should be on that object."
> https://patchwork.kernel.org/patch/11033659/
> 
> And the explicit notion of an SVM context from David inspired me to make
> an explicit way to facilitate the interaction between vfio and vIOMMU. So
> I came up with the SVMContext direction, and finally renamed it as
> HostIOMMUContext and place it in VFIOContainer as it is supposed to be per
> -container.

Thank you for summarizing the whole history. To make things clear I do
not put into question this last approach, I just meant the commit
message should justify why this is needed and why the existing
IOMMUMRNotifier approach cannot be used.
> 
>> maybe: "Information, different from map/unmap notifications need to be
>> passed from QEMU vIOMMU device to/from the host IOMMU driver through the
>> VFIO/IOMMU layer: ..."
> 
> I see. I'll adopt your description. thanks.
> 
>>>  1) PASID allocation (allow host to intercept in PASID allocation)
>>>  2) bind stage-1 translation structures to host
>>>  3) propagate stage-1 cache invalidation to host
>>>  4) DMA address translation fault (I/O page fault) servicing etc.
>>
>>>
>>> With the above new interactions in QEMU, it requires an abstract layer
>>> to facilitate the above operations and expose to vIOMMU emulators as an
>>> explicit way for vIOMMU emulators call into VFIO. This patch introduces
>>> HostIOMMUContext to stand for hardware IOMMU w/ dual stage DMA address
>>> translation capability. And introduces HostIOMMUContextClass to provide
>>> methods for vIOMMU emulators to propagate dual-stage translation related
>>> requests to host. As a beginning, PASID allocation/free are defined to
>>> propagate PASID allocation/free requests to host which is helpful for the
>>> vendors who manage PASID in system-wide. In future, there will be more
>>> operations like bind_stage1_pgtbl, flush_stage1_cache and etc.
>>>
>>> Cc: Kevin Tian <kevin.tian@intel.com>
>>> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
>>> Cc: Peter Xu <peterx@redhat.com>
>>> Cc: Eric Auger <eric.auger@redhat.com>
>>> Cc: Yi Sun <yi.y.sun@linux.intel.com>
>>> Cc: David Gibson <david@gibson.dropbear.id.au>
>>> Cc: Michael S. Tsirkin <mst@redhat.com>
>>> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
>>> ---
>>>  hw/Makefile.objs                      |  1 +
>>>  hw/iommu/Makefile.objs                |  1 +
>>>  hw/iommu/host_iommu_context.c         | 97
>> +++++++++++++++++++++++++++++++++++
>>>  include/hw/iommu/host_iommu_context.h | 75 +++++++++++++++++++++++++++
>>>  4 files changed, 174 insertions(+)
>>>  create mode 100644 hw/iommu/Makefile.objs
>>>  create mode 100644 hw/iommu/host_iommu_context.c
>>>  create mode 100644 include/hw/iommu/host_iommu_context.h
>>>
>>> diff --git a/hw/Makefile.objs b/hw/Makefile.objs
>>> index 660e2b4..cab83fe 100644
>>> --- a/hw/Makefile.objs
>>> +++ b/hw/Makefile.objs
>>> @@ -40,6 +40,7 @@ devices-dirs-$(CONFIG_MEM_DEVICE) += mem/
>>>  devices-dirs-$(CONFIG_NUBUS) += nubus/
>>>  devices-dirs-y += semihosting/
>>>  devices-dirs-y += smbios/
>>> +devices-dirs-y += iommu/
>>>  endif
>>>
>>>  common-obj-y += $(devices-dirs-y)
>>> diff --git a/hw/iommu/Makefile.objs b/hw/iommu/Makefile.objs
>>> new file mode 100644
>>> index 0000000..e6eed4e
>>> --- /dev/null
>>> +++ b/hw/iommu/Makefile.objs
>>> @@ -0,0 +1 @@
>>> +obj-y += host_iommu_context.o
>>> diff --git a/hw/iommu/host_iommu_context.c
>> b/hw/iommu/host_iommu_context.c
>>> new file mode 100644
>>> index 0000000..5fb2223
>>> --- /dev/null
>>> +++ b/hw/iommu/host_iommu_context.c
>>> @@ -0,0 +1,97 @@
>>> +/*
>>> + * QEMU abstract of Host IOMMU
>>> + *
>>> + * Copyright (C) 2020 Intel Corporation.
>>> + *
>>> + * Authors: Liu Yi L <yi.l.liu@intel.com>
>>> + *
>>> + * This program is free software; you can redistribute it and/or modify
>>> + * it under the terms of the GNU General Public License as published by
>>> + * the Free Software Foundation; either version 2 of the License, or
>>> + * (at your option) any later version.
>>> +
>>> + * This program is distributed in the hope that it will be useful,
>>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>>> + * GNU General Public License for more details.
>>> +
>>> + * You should have received a copy of the GNU General Public License along
>>> + * with this program; if not, see <http://www.gnu.org/licenses/>.
>>> + */
>>> +
>>> +#include "qemu/osdep.h"
>>> +#include "qapi/error.h"
>>> +#include "qom/object.h"
>>> +#include "qapi/visitor.h"
>>> +#include "hw/iommu/host_iommu_context.h"
>>> +
>>> +int host_iommu_ctx_pasid_alloc(HostIOMMUContext *iommu_ctx, uint32_t min,
>>> +                               uint32_t max, uint32_t *pasid)
>>> +{
>>> +    HostIOMMUContextClass *hicxc;
>>> +
>>> +    if (!iommu_ctx) {
>>> +        return -EINVAL;
>>> +    }
>>> +
>>> +    hicxc = HOST_IOMMU_CONTEXT_GET_CLASS(iommu_ctx);
>>> +
>>> +    if (!hicxc) {
>>> +        return -EINVAL;
>>> +    }
>>> +
>>> +    if (!(iommu_ctx->flags & HOST_IOMMU_PASID_REQUEST) ||
>>> +        !hicxc->pasid_alloc) {
>> At this point of the reading, I fail to understand why we need the flag.
>> Why isn't it sufficient to test whether the ops is set?
> 
> I added it in case of the architecture which has no requirement for
> pasid alloc/free and only needs the other callbacks in the class. I'm
> not sure if I'm correct, it looks to be unnecessary for vSMMU. right?
vSMMU does not require it at the moment. But in that case, it shall not
provide any implementation for it and that should be sufficient,
shouldn't it?
> 
>>> +        return -EINVAL;
>>> +    }
>>> +
>>> +    return hicxc->pasid_alloc(iommu_ctx, min, max, pasid);
>>> +}
>>> +
>>> +int host_iommu_ctx_pasid_free(HostIOMMUContext *iommu_ctx, uint32_t
>> pasid)
>>> +{
>>> +    HostIOMMUContextClass *hicxc;
>>> +
>>> +    if (!iommu_ctx) {
>>> +        return -EINVAL;
>>> +    }
>>> +
>>> +    hicxc = HOST_IOMMU_CONTEXT_GET_CLASS(iommu_ctx);
>>> +    if (!hicxc) {
>>> +        return -EINVAL;
>>> +    }
>>> +
>>> +    if (!(iommu_ctx->flags & HOST_IOMMU_PASID_REQUEST) ||
>>> +        !hicxc->pasid_free) {
>>> +        return -EINVAL;
>>> +    }
>>> +
>>> +    return hicxc->pasid_free(iommu_ctx, pasid);
>>> +}
>>> +
>>> +void host_iommu_ctx_init(void *_iommu_ctx, size_t instance_size,
>>> +                         const char *mrtypename,
>>> +                         uint64_t flags)
>>> +{
>>> +    HostIOMMUContext *iommu_ctx;
>>> +
>>> +    object_initialize(_iommu_ctx, instance_size, mrtypename);
>>> +    iommu_ctx = HOST_IOMMU_CONTEXT(_iommu_ctx);
>>> +    iommu_ctx->flags = flags;
>>> +    iommu_ctx->initialized = true;
>>> +}
>>> +
>>> +static const TypeInfo host_iommu_context_info = {
>>> +    .parent             = TYPE_OBJECT,
>>> +    .name               = TYPE_HOST_IOMMU_CONTEXT,
>>> +    .class_size         = sizeof(HostIOMMUContextClass),
>>> +    .instance_size      = sizeof(HostIOMMUContext),
>>> +    .abstract           = true,
>> Can't we use the usual .instance_init and .instance_finalize?
>>> +};
>>> +
>>> +static void host_iommu_ctx_register_types(void)
>>> +{
>>> +    type_register_static(&host_iommu_context_info);
>>> +}
>>> +
>>> +type_init(host_iommu_ctx_register_types)
>>> diff --git a/include/hw/iommu/host_iommu_context.h
>> b/include/hw/iommu/host_iommu_context.h
>>> new file mode 100644
>>> index 0000000..35c4861
>>> --- /dev/null
>>> +++ b/include/hw/iommu/host_iommu_context.h
>>> @@ -0,0 +1,75 @@
>>> +/*
>>> + * QEMU abstraction of Host IOMMU
>>> + *
>>> + * Copyright (C) 2020 Intel Corporation.
>>> + *
>>> + * Authors: Liu Yi L <yi.l.liu@intel.com>
>>> + *
>>> + * This program is free software; you can redistribute it and/or modify
>>> + * it under the terms of the GNU General Public License as published by
>>> + * the Free Software Foundation; either version 2 of the License, or
>>> + * (at your option) any later version.
>>> +
>>> + * This program is distributed in the hope that it will be useful,
>>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>>> + * GNU General Public License for more details.
>>> +
>>> + * You should have received a copy of the GNU General Public License along
>>> + * with this program; if not, see <http://www.gnu.org/licenses/>.
>>> + */
>>> +
>>> +#ifndef HW_IOMMU_CONTEXT_H
>>> +#define HW_IOMMU_CONTEXT_H
>>> +
>>> +#include "qemu/queue.h"
>>> +#include "qemu/thread.h"
>>> +#include "qom/object.h"
>>> +#include <linux/iommu.h>
>>> +#ifndef CONFIG_USER_ONLY
>>> +#include "exec/hwaddr.h"
>>> +#endif
>>> +
>>> +#define TYPE_HOST_IOMMU_CONTEXT "qemu:host-iommu-context"
>>> +#define HOST_IOMMU_CONTEXT(obj) \
>>> +        OBJECT_CHECK(HostIOMMUContext, (obj), TYPE_HOST_IOMMU_CONTEXT)
>>> +#define HOST_IOMMU_CONTEXT_GET_CLASS(obj) \
>>> +        OBJECT_GET_CLASS(HostIOMMUContextClass, (obj), \
>>> +                         TYPE_HOST_IOMMU_CONTEXT)
>>> +
>>> +typedef struct HostIOMMUContext HostIOMMUContext;
>>> +
>>> +typedef struct HostIOMMUContextClass {
>>> +    /* private */
>>> +    ObjectClass parent_class;
>>> +
>>> +    /* Allocate pasid from HostIOMMUContext (a.k.a. host software) */
>> Request the host to allocate a PASID?
>> "from HostIOMMUContext (a.k.a. host software)" is a bit cryptic to me.
> 
> oh, I mean to request pasid allocation from host.. sorry for the confusion.
> 
>> Actually at this stage I do not understand what this HostIOMMUContext
>> abstracts. Is it an object associated to one guest FL context entry
>> (attached to one PASID). Meaning for just vIOMMU/VFIO using nested
>> paging (single PASID) I would use a single of such context per IOMMU MR?
> 
> No, it's not for a single guest FL context. It's for the abstraction
> of the capability provided by a nested-translation capable host backend.
> In vfio, it's VFIO_IOMMU_TYPE1_NESTING.
> 
> Here is the notion behind introducing the HostIOMMUContext. Existing
> vfio is a secure framework which provides userspace the capability to
> program mappings into a single isolation domain in host side. Compared
> with the legacy host IOMMU, nested-translation capable IOMMU provides
> more. It gives the user-space with the capability to program a FL/Stage
> -1 page table to host side. This is also called as bind_gpasid in this
> series. VFIO exposes nesting capability to userspace with the
> VFIO_IOMMU_TYPE1_NESTING type. And along with the type, the pasid alloc/
> free and iommu_cache_inv are exposed as the capabilities provided by
> VFIO_IOMMU_TYPE1_NESTING.

OK so let me try to rephrase:

"the HostIOMMUContext is an object which allows to manage the stage-1
translation when a vIOMMU is implemented upon physical IOMMU nested
paging (VFIO case).

It is an abstract object which needs to be derived for each vIOMMU
immplementation based on physical nested paging.

An HostIOMMUContext derived object will be passed to each VFIO device
protected by a vIOMMU using physical nested paging.
"

Is that correct?

 Also, if we want, actually we could migrate
> the MAP/UNMAP notifier to be hooks in HostIOMMUContext. Then we can have
> an unified abstraction for the capabilities provided by host.
So then it becomes contradictory to what we said before because
MAP/UNMAP are used with single stage HW implementation.
> 
>> I think David also felt difficult to understand the abstraction behind
>> this object.
>>
>>> +    int (*pasid_alloc)(HostIOMMUContext *iommu_ctx,
>>> +                       uint32_t min,
>>> +                       uint32_t max,
>>> +                       uint32_t *pasid);
>>> +    /* Reclaim pasid from HostIOMMUContext (a.k.a. host software) */
>>> +    int (*pasid_free)(HostIOMMUContext *iommu_ctx,
>>> +                      uint32_t pasid);
>>> +} HostIOMMUContextClass;
>>> +
>>> +/*
>>> + * This is an abstraction of host IOMMU with dual-stage capability
>>> + */
>>> +struct HostIOMMUContext {
>>> +    Object parent_obj;
>>> +#define HOST_IOMMU_PASID_REQUEST (1ULL << 0)
>>> +    uint64_t flags;
>>> +    bool initialized;
>> what's the purpose of the initialized flag?
> 
> it's somehow for checking the availability of host's nested capability in
> vfio/pci. In this series, HostIOMMUContext is initialized in vfio/common
> and needs a way to tell vfio/pci that it is available.
> 
>>> +};
>>> +
>>> +int host_iommu_ctx_pasid_alloc(HostIOMMUContext *iommu_ctx, uint32_t min,
>>> +                               uint32_t max, uint32_t *pasid);
>>> +int host_iommu_ctx_pasid_free(HostIOMMUContext *iommu_ctx, uint32_t
>> pasid);
>>> +
>>> +void host_iommu_ctx_init(void *_iommu_ctx, size_t instance_size,
>>> +                         const char *mrtypename,
>>> +                         uint64_t flags);
>>> +void host_iommu_ctx_destroy(HostIOMMUContext *iommu_ctx);
>> leftover from V1?
> 
> right, thanks for catching it.
> 
> Regards,
> Yi Liu
> 
Thanks

Eric


^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v2 04/22] hw/iommu: introduce HostIOMMUContext
@ 2020-03-31  7:47         ` Auger Eric
  0 siblings, 0 replies; 160+ messages in thread
From: Auger Eric @ 2020-03-31  7:47 UTC (permalink / raw)
  To: Liu, Yi L, qemu-devel, alex.williamson, peterx
  Cc: jean-philippe, Tian, Kevin, Jacob Pan, Yi Sun, kvm, mst, Tian,
	Jun J, Sun, Yi Y, pbonzini, Wu, Hao, david

Yi,

On 3/31/20 6:10 AM, Liu, Yi L wrote:
> Hi Eric,
> 
>> From: Auger Eric < eric.auger@redhat.com >
>> Sent: Tuesday, March 31, 2020 1:23 AM
>> To: Liu, Yi L <yi.l.liu@intel.com>; qemu-devel@nongnu.org;
>> Subject: Re: [PATCH v2 04/22] hw/iommu: introduce HostIOMMUContext
>>
>> Yi,
>>
>> On 3/30/20 6:24 AM, Liu Yi L wrote:
>>> Currently, many platform vendors provide the capability of dual stage
>>> DMA address translation in hardware. For example, nested translation
>>> on Intel VT-d scalable mode, nested stage translation on ARM SMMUv3,
>>> and etc. In dual stage DMA address translation, there are two stages
>>> address translation, stage-1 (a.k.a first-level) and stage-2 (a.k.a
>>> second-level) translation structures. Stage-1 translation results are
>>> also subjected to stage-2 translation structures. Take vSVA (Virtual
>>> Shared Virtual Addressing) as an example, guest IOMMU driver owns
>>> stage-1 translation structures (covers GVA->GPA translation), and host
>>> IOMMU driver owns stage-2 translation structures (covers GPA->HPA
>>> translation). VMM is responsible to bind stage-1 translation structures
>>> to host, thus hardware could achieve GVA->GPA and then GPA->HPA
>>> translation. For more background on SVA, refer the below links.
>>>  - https://www.youtube.com/watch?v=Kq_nfGK5MwQ
>>>  - https://events19.lfasiallc.com/wp-content/uploads/2017/11/\
>>> Shared-Virtual-Memory-in-KVM_Yi-Liu.pdf
>>>
>>> In QEMU, vIOMMU emulators expose IOMMUs to VM per their own spec (e.g.
>>> Intel VT-d spec). Devices are pass-through to guest via device pass-
>>> through components like VFIO. VFIO is a userspace driver framework
>>> which exposes host IOMMU programming capability to userspace in a
>>> secure manner. e.g. IOVA MAP/UNMAP requests. Thus the major connection
>>> between VFIO and vIOMMU are MAP/UNMAP. However, with the dual stage
>>> DMA translation support, there are more interactions between vIOMMU and
>>> VFIO as below:
>>
>> I think it is key to justify at some point why the IOMMU MR notifiers
>> are not usable for that purpose. If I remember correctly this is due to
>> the fact MR notifiers are not active on x86 in that use xase, which is
>> not the case on ARM dual stage enablement.
> 
> yes, it's the major reason. Also I listed the former description here.
> BTW. I don't think notifier is suitable as it is unable to return value.
> right? The pasid alloc in this series actually requires to get the alloc
> result from vfio. So it's also a reason why notifier is not proper.
> 
>   "Qemu has an existing notifier framework based on MemoryRegion, which
>   are used for MAP/UNMAP. However, it is not well suited for virt-SVA.
>   Reasons are as below:
>   - virt-SVA works along with PT = 1
>   - if PT = 1 IOMMU MR are disabled so MR notifier are not registered
>   - new notifiers do not fit nicely in this framework as they need to be
>     registered even if PT = 1
>   - need a new framework to attach the new notifiers
>   - Additional background can be got from:
>     https://lists.gnu.org/archive/html/qemu-devel/2017-04/msg04931.html"
> 
> And there is a history on it. I think the earliest idea to introduce a
> new mechanism instead of using MR notifier for vSVA is from below link.
> https://lists.gnu.org/archive/html/qemu-devel/2017-04/msg05295.html
> 
> And then, I have several versions patch series which try to add a notifier
> framework for vSVA based on IOMMUSVAContext.
> https://lists.gnu.org/archive/html/qemu-devel/2018-03/msg00078.html
> 
> After the vSVA notifier framework patchset, then we somehow agreed to
> use PCIPASIDOps which sits in PCIDevice. This is proposed in below link.
> https://patchwork.kernel.org/cover/11033657/ 
> However, it was questioned to provide pasid allocation interface in a
> per-device manner.
>   "On Fri, Jul 05, 2019 at 07:01:38PM +0800, Liu Yi L wrote:
>   > This patch adds vfio implementation PCIPASIDOps.alloc_pasid/free_pasid().
>   > These two functions are used to propagate guest pasid allocation and
>   > free requests to host via vfio container ioctl.
> 
>   As I said in an earlier comment, I think doing this on the device is
>   conceptually incorrect.  I think we need an explcit notion of an SVM
>   context (i.e. the namespace in which all the PASIDs live) - which will
>   IIUC usually be shared amongst multiple devices.  The create and free
>   PASID requests should be on that object."
> https://patchwork.kernel.org/patch/11033659/
> 
> And the explicit notion of an SVM context from David inspired me to make
> an explicit way to facilitate the interaction between vfio and vIOMMU. So
> I came up with the SVMContext direction, and finally renamed it as
> HostIOMMUContext and place it in VFIOContainer as it is supposed to be per
> -container.

Thank you for summarizing the whole history. To make things clear I do
not put into question this last approach, I just meant the commit
message should justify why this is needed and why the existing
IOMMUMRNotifier approach cannot be used.
> 
>> maybe: "Information, different from map/unmap notifications need to be
>> passed from QEMU vIOMMU device to/from the host IOMMU driver through the
>> VFIO/IOMMU layer: ..."
> 
> I see. I'll adopt your description. thanks.
> 
>>>  1) PASID allocation (allow host to intercept in PASID allocation)
>>>  2) bind stage-1 translation structures to host
>>>  3) propagate stage-1 cache invalidation to host
>>>  4) DMA address translation fault (I/O page fault) servicing etc.
>>
>>>
>>> With the above new interactions in QEMU, it requires an abstract layer
>>> to facilitate the above operations and expose to vIOMMU emulators as an
>>> explicit way for vIOMMU emulators call into VFIO. This patch introduces
>>> HostIOMMUContext to stand for hardware IOMMU w/ dual stage DMA address
>>> translation capability. And introduces HostIOMMUContextClass to provide
>>> methods for vIOMMU emulators to propagate dual-stage translation related
>>> requests to host. As a beginning, PASID allocation/free are defined to
>>> propagate PASID allocation/free requests to host which is helpful for the
>>> vendors who manage PASID in system-wide. In future, there will be more
>>> operations like bind_stage1_pgtbl, flush_stage1_cache and etc.
>>>
>>> Cc: Kevin Tian <kevin.tian@intel.com>
>>> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
>>> Cc: Peter Xu <peterx@redhat.com>
>>> Cc: Eric Auger <eric.auger@redhat.com>
>>> Cc: Yi Sun <yi.y.sun@linux.intel.com>
>>> Cc: David Gibson <david@gibson.dropbear.id.au>
>>> Cc: Michael S. Tsirkin <mst@redhat.com>
>>> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
>>> ---
>>>  hw/Makefile.objs                      |  1 +
>>>  hw/iommu/Makefile.objs                |  1 +
>>>  hw/iommu/host_iommu_context.c         | 97
>> +++++++++++++++++++++++++++++++++++
>>>  include/hw/iommu/host_iommu_context.h | 75 +++++++++++++++++++++++++++
>>>  4 files changed, 174 insertions(+)
>>>  create mode 100644 hw/iommu/Makefile.objs
>>>  create mode 100644 hw/iommu/host_iommu_context.c
>>>  create mode 100644 include/hw/iommu/host_iommu_context.h
>>>
>>> diff --git a/hw/Makefile.objs b/hw/Makefile.objs
>>> index 660e2b4..cab83fe 100644
>>> --- a/hw/Makefile.objs
>>> +++ b/hw/Makefile.objs
>>> @@ -40,6 +40,7 @@ devices-dirs-$(CONFIG_MEM_DEVICE) += mem/
>>>  devices-dirs-$(CONFIG_NUBUS) += nubus/
>>>  devices-dirs-y += semihosting/
>>>  devices-dirs-y += smbios/
>>> +devices-dirs-y += iommu/
>>>  endif
>>>
>>>  common-obj-y += $(devices-dirs-y)
>>> diff --git a/hw/iommu/Makefile.objs b/hw/iommu/Makefile.objs
>>> new file mode 100644
>>> index 0000000..e6eed4e
>>> --- /dev/null
>>> +++ b/hw/iommu/Makefile.objs
>>> @@ -0,0 +1 @@
>>> +obj-y += host_iommu_context.o
>>> diff --git a/hw/iommu/host_iommu_context.c
>> b/hw/iommu/host_iommu_context.c
>>> new file mode 100644
>>> index 0000000..5fb2223
>>> --- /dev/null
>>> +++ b/hw/iommu/host_iommu_context.c
>>> @@ -0,0 +1,97 @@
>>> +/*
>>> + * QEMU abstract of Host IOMMU
>>> + *
>>> + * Copyright (C) 2020 Intel Corporation.
>>> + *
>>> + * Authors: Liu Yi L <yi.l.liu@intel.com>
>>> + *
>>> + * This program is free software; you can redistribute it and/or modify
>>> + * it under the terms of the GNU General Public License as published by
>>> + * the Free Software Foundation; either version 2 of the License, or
>>> + * (at your option) any later version.
>>> +
>>> + * This program is distributed in the hope that it will be useful,
>>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>>> + * GNU General Public License for more details.
>>> +
>>> + * You should have received a copy of the GNU General Public License along
>>> + * with this program; if not, see <http://www.gnu.org/licenses/>.
>>> + */
>>> +
>>> +#include "qemu/osdep.h"
>>> +#include "qapi/error.h"
>>> +#include "qom/object.h"
>>> +#include "qapi/visitor.h"
>>> +#include "hw/iommu/host_iommu_context.h"
>>> +
>>> +int host_iommu_ctx_pasid_alloc(HostIOMMUContext *iommu_ctx, uint32_t min,
>>> +                               uint32_t max, uint32_t *pasid)
>>> +{
>>> +    HostIOMMUContextClass *hicxc;
>>> +
>>> +    if (!iommu_ctx) {
>>> +        return -EINVAL;
>>> +    }
>>> +
>>> +    hicxc = HOST_IOMMU_CONTEXT_GET_CLASS(iommu_ctx);
>>> +
>>> +    if (!hicxc) {
>>> +        return -EINVAL;
>>> +    }
>>> +
>>> +    if (!(iommu_ctx->flags & HOST_IOMMU_PASID_REQUEST) ||
>>> +        !hicxc->pasid_alloc) {
>> At this point of the reading, I fail to understand why we need the flag.
>> Why isn't it sufficient to test whether the ops is set?
> 
> I added it in case of the architecture which has no requirement for
> pasid alloc/free and only needs the other callbacks in the class. I'm
> not sure if I'm correct, it looks to be unnecessary for vSMMU. right?
vSMMU does not require it at the moment. But in that case, it shall not
provide any implementation for it and that should be sufficient,
shouldn't it?
> 
>>> +        return -EINVAL;
>>> +    }
>>> +
>>> +    return hicxc->pasid_alloc(iommu_ctx, min, max, pasid);
>>> +}
>>> +
>>> +int host_iommu_ctx_pasid_free(HostIOMMUContext *iommu_ctx, uint32_t
>> pasid)
>>> +{
>>> +    HostIOMMUContextClass *hicxc;
>>> +
>>> +    if (!iommu_ctx) {
>>> +        return -EINVAL;
>>> +    }
>>> +
>>> +    hicxc = HOST_IOMMU_CONTEXT_GET_CLASS(iommu_ctx);
>>> +    if (!hicxc) {
>>> +        return -EINVAL;
>>> +    }
>>> +
>>> +    if (!(iommu_ctx->flags & HOST_IOMMU_PASID_REQUEST) ||
>>> +        !hicxc->pasid_free) {
>>> +        return -EINVAL;
>>> +    }
>>> +
>>> +    return hicxc->pasid_free(iommu_ctx, pasid);
>>> +}
>>> +
>>> +void host_iommu_ctx_init(void *_iommu_ctx, size_t instance_size,
>>> +                         const char *mrtypename,
>>> +                         uint64_t flags)
>>> +{
>>> +    HostIOMMUContext *iommu_ctx;
>>> +
>>> +    object_initialize(_iommu_ctx, instance_size, mrtypename);
>>> +    iommu_ctx = HOST_IOMMU_CONTEXT(_iommu_ctx);
>>> +    iommu_ctx->flags = flags;
>>> +    iommu_ctx->initialized = true;
>>> +}
>>> +
>>> +static const TypeInfo host_iommu_context_info = {
>>> +    .parent             = TYPE_OBJECT,
>>> +    .name               = TYPE_HOST_IOMMU_CONTEXT,
>>> +    .class_size         = sizeof(HostIOMMUContextClass),
>>> +    .instance_size      = sizeof(HostIOMMUContext),
>>> +    .abstract           = true,
>> Can't we use the usual .instance_init and .instance_finalize?
>>> +};
>>> +
>>> +static void host_iommu_ctx_register_types(void)
>>> +{
>>> +    type_register_static(&host_iommu_context_info);
>>> +}
>>> +
>>> +type_init(host_iommu_ctx_register_types)
>>> diff --git a/include/hw/iommu/host_iommu_context.h
>> b/include/hw/iommu/host_iommu_context.h
>>> new file mode 100644
>>> index 0000000..35c4861
>>> --- /dev/null
>>> +++ b/include/hw/iommu/host_iommu_context.h
>>> @@ -0,0 +1,75 @@
>>> +/*
>>> + * QEMU abstraction of Host IOMMU
>>> + *
>>> + * Copyright (C) 2020 Intel Corporation.
>>> + *
>>> + * Authors: Liu Yi L <yi.l.liu@intel.com>
>>> + *
>>> + * This program is free software; you can redistribute it and/or modify
>>> + * it under the terms of the GNU General Public License as published by
>>> + * the Free Software Foundation; either version 2 of the License, or
>>> + * (at your option) any later version.
>>> +
>>> + * This program is distributed in the hope that it will be useful,
>>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>>> + * GNU General Public License for more details.
>>> +
>>> + * You should have received a copy of the GNU General Public License along
>>> + * with this program; if not, see <http://www.gnu.org/licenses/>.
>>> + */
>>> +
>>> +#ifndef HW_IOMMU_CONTEXT_H
>>> +#define HW_IOMMU_CONTEXT_H
>>> +
>>> +#include "qemu/queue.h"
>>> +#include "qemu/thread.h"
>>> +#include "qom/object.h"
>>> +#include <linux/iommu.h>
>>> +#ifndef CONFIG_USER_ONLY
>>> +#include "exec/hwaddr.h"
>>> +#endif
>>> +
>>> +#define TYPE_HOST_IOMMU_CONTEXT "qemu:host-iommu-context"
>>> +#define HOST_IOMMU_CONTEXT(obj) \
>>> +        OBJECT_CHECK(HostIOMMUContext, (obj), TYPE_HOST_IOMMU_CONTEXT)
>>> +#define HOST_IOMMU_CONTEXT_GET_CLASS(obj) \
>>> +        OBJECT_GET_CLASS(HostIOMMUContextClass, (obj), \
>>> +                         TYPE_HOST_IOMMU_CONTEXT)
>>> +
>>> +typedef struct HostIOMMUContext HostIOMMUContext;
>>> +
>>> +typedef struct HostIOMMUContextClass {
>>> +    /* private */
>>> +    ObjectClass parent_class;
>>> +
>>> +    /* Allocate pasid from HostIOMMUContext (a.k.a. host software) */
>> Request the host to allocate a PASID?
>> "from HostIOMMUContext (a.k.a. host software)" is a bit cryptic to me.
> 
> oh, I mean to request pasid allocation from host.. sorry for the confusion.
> 
>> Actually at this stage I do not understand what this HostIOMMUContext
>> abstracts. Is it an object associated to one guest FL context entry
>> (attached to one PASID). Meaning for just vIOMMU/VFIO using nested
>> paging (single PASID) I would use a single of such context per IOMMU MR?
> 
> No, it's not for a single guest FL context. It's for the abstraction
> of the capability provided by a nested-translation capable host backend.
> In vfio, it's VFIO_IOMMU_TYPE1_NESTING.
> 
> Here is the notion behind introducing the HostIOMMUContext. Existing
> vfio is a secure framework which provides userspace the capability to
> program mappings into a single isolation domain in host side. Compared
> with the legacy host IOMMU, nested-translation capable IOMMU provides
> more. It gives the user-space with the capability to program a FL/Stage
> -1 page table to host side. This is also called as bind_gpasid in this
> series. VFIO exposes nesting capability to userspace with the
> VFIO_IOMMU_TYPE1_NESTING type. And along with the type, the pasid alloc/
> free and iommu_cache_inv are exposed as the capabilities provided by
> VFIO_IOMMU_TYPE1_NESTING.

OK so let me try to rephrase:

"the HostIOMMUContext is an object which allows to manage the stage-1
translation when a vIOMMU is implemented upon physical IOMMU nested
paging (VFIO case).

It is an abstract object which needs to be derived for each vIOMMU
immplementation based on physical nested paging.

An HostIOMMUContext derived object will be passed to each VFIO device
protected by a vIOMMU using physical nested paging.
"

Is that correct?

 Also, if we want, actually we could migrate
> the MAP/UNMAP notifier to be hooks in HostIOMMUContext. Then we can have
> an unified abstraction for the capabilities provided by host.
So then it becomes contradictory to what we said before because
MAP/UNMAP are used with single stage HW implementation.
> 
>> I think David also felt difficult to understand the abstraction behind
>> this object.
>>
>>> +    int (*pasid_alloc)(HostIOMMUContext *iommu_ctx,
>>> +                       uint32_t min,
>>> +                       uint32_t max,
>>> +                       uint32_t *pasid);
>>> +    /* Reclaim pasid from HostIOMMUContext (a.k.a. host software) */
>>> +    int (*pasid_free)(HostIOMMUContext *iommu_ctx,
>>> +                      uint32_t pasid);
>>> +} HostIOMMUContextClass;
>>> +
>>> +/*
>>> + * This is an abstraction of host IOMMU with dual-stage capability
>>> + */
>>> +struct HostIOMMUContext {
>>> +    Object parent_obj;
>>> +#define HOST_IOMMU_PASID_REQUEST (1ULL << 0)
>>> +    uint64_t flags;
>>> +    bool initialized;
>> what's the purpose of the initialized flag?
> 
> it's somehow for checking the availability of host's nested capability in
> vfio/pci. In this series, HostIOMMUContext is initialized in vfio/common
> and needs a way to tell vfio/pci that it is available.
> 
>>> +};
>>> +
>>> +int host_iommu_ctx_pasid_alloc(HostIOMMUContext *iommu_ctx, uint32_t min,
>>> +                               uint32_t max, uint32_t *pasid);
>>> +int host_iommu_ctx_pasid_free(HostIOMMUContext *iommu_ctx, uint32_t
>> pasid);
>>> +
>>> +void host_iommu_ctx_init(void *_iommu_ctx, size_t instance_size,
>>> +                         const char *mrtypename,
>>> +                         uint64_t flags);
>>> +void host_iommu_ctx_destroy(HostIOMMUContext *iommu_ctx);
>> leftover from V1?
> 
> right, thanks for catching it.
> 
> Regards,
> Yi Liu
> 
Thanks

Eric



^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v2 08/22] vfio/common: provide PASID alloc/free hooks
  2020-03-30  4:24   ` Liu Yi L
@ 2020-03-31 10:47     ` Auger Eric
  -1 siblings, 0 replies; 160+ messages in thread
From: Auger Eric @ 2020-03-31 10:47 UTC (permalink / raw)
  To: Liu Yi L, qemu-devel, alex.williamson, peterx
  Cc: pbonzini, mst, david, kevin.tian, jun.j.tian, yi.y.sun, kvm,
	hao.wu, jean-philippe, Jacob Pan, Yi Sun

Yi,

On 3/30/20 6:24 AM, Liu Yi L wrote:
> This patch defines vfio_host_iommu_context_info, implements the PASID
> alloc/free hooks defined in HostIOMMUContextClass.
> 
> Cc: Kevin Tian <kevin.tian@intel.com>
> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Eric Auger <eric.auger@redhat.com>
> Cc: Yi Sun <yi.y.sun@linux.intel.com>
> Cc: David Gibson <david@gibson.dropbear.id.au>
> Cc: Alex Williamson <alex.williamson@redhat.com>
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> ---
>  hw/vfio/common.c                      | 69 +++++++++++++++++++++++++++++++++++
>  include/hw/iommu/host_iommu_context.h |  3 ++
>  include/hw/vfio/vfio-common.h         |  4 ++
>  3 files changed, 76 insertions(+)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index c276732..5f3534d 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -1179,6 +1179,53 @@ static int vfio_get_iommu_type(VFIOContainer *container,
>      return -EINVAL;
>  }
>  
> +static int vfio_host_iommu_ctx_pasid_alloc(HostIOMMUContext *iommu_ctx,
> +                                           uint32_t min, uint32_t max,
> +                                           uint32_t *pasid)
> +{
> +    VFIOContainer *container = container_of(iommu_ctx,
> +                                            VFIOContainer, iommu_ctx);
> +    struct vfio_iommu_type1_pasid_request req;
> +    unsigned long argsz;
you can easily avoid using argsz variable
> +    int ret;
> +
> +    argsz = sizeof(req);
> +    req.argsz = argsz;
> +    req.flags = VFIO_IOMMU_PASID_ALLOC;
> +    req.alloc_pasid.min = min;
> +    req.alloc_pasid.max = max;
> +
> +    if (ioctl(container->fd, VFIO_IOMMU_PASID_REQUEST, &req)) {
> +        ret = -errno;
> +        error_report("%s: %d, alloc failed", __func__, ret);
better use %m directly or strerror(errno)
also include vbasedev->name?
> +        return ret;
> +    }
> +    *pasid = req.alloc_pasid.result;
> +    return 0;
> +}
> +
> +static int vfio_host_iommu_ctx_pasid_free(HostIOMMUContext *iommu_ctx,
> +                                          uint32_t pasid)
> +{
> +    VFIOContainer *container = container_of(iommu_ctx,
> +                                            VFIOContainer, iommu_ctx);
> +    struct vfio_iommu_type1_pasid_request req;
> +    unsigned long argsz;
same
> +    int ret;
> +
> +    argsz = sizeof(req);
> +    req.argsz = argsz;
> +    req.flags = VFIO_IOMMU_PASID_FREE;
> +    req.free_pasid = pasid;
> +
> +    if (ioctl(container->fd, VFIO_IOMMU_PASID_REQUEST, &req)) {
> +        ret = -errno;
> +        error_report("%s: %d, free failed", __func__, ret);
same
> +        return ret;
> +    }
> +    return 0;
> +}
> +
>  static int vfio_init_container(VFIOContainer *container, int group_fd,
>                                 Error **errp)
>  {
> @@ -1791,3 +1838,25 @@ int vfio_eeh_as_op(AddressSpace *as, uint32_t op)
>      }
>      return vfio_eeh_container_op(container, op);
>  }
> +
> +static void vfio_host_iommu_context_class_init(ObjectClass *klass,
> +                                                       void *data)
> +{
> +    HostIOMMUContextClass *hicxc = HOST_IOMMU_CONTEXT_CLASS(klass);
> +
> +    hicxc->pasid_alloc = vfio_host_iommu_ctx_pasid_alloc;
> +    hicxc->pasid_free = vfio_host_iommu_ctx_pasid_free;
> +}
> +
> +static const TypeInfo vfio_host_iommu_context_info = {
> +    .parent = TYPE_HOST_IOMMU_CONTEXT,
> +    .name = TYPE_VFIO_HOST_IOMMU_CONTEXT,
> +    .class_init = vfio_host_iommu_context_class_init,
Ah OK

This is the object inheriting from the abstract TYPE_HOST_IOMMU_CONTEXT.
I initially thought VTDHostIOMMUContext was, sorry for the misunderstanding.

Do you expect other HostIOMMUContext backends? Given the name and ops,
it looks really related to VFIO?

Thanks

Eric


> +};
> +
> +static void vfio_register_types(void)
> +{
> +    type_register_static(&vfio_host_iommu_context_info);
> +}
> +
> +type_init(vfio_register_types)
> diff --git a/include/hw/iommu/host_iommu_context.h b/include/hw/iommu/host_iommu_context.h
> index 35c4861..227c433 100644
> --- a/include/hw/iommu/host_iommu_context.h
> +++ b/include/hw/iommu/host_iommu_context.h
> @@ -33,6 +33,9 @@
>  #define TYPE_HOST_IOMMU_CONTEXT "qemu:host-iommu-context"
>  #define HOST_IOMMU_CONTEXT(obj) \
>          OBJECT_CHECK(HostIOMMUContext, (obj), TYPE_HOST_IOMMU_CONTEXT)
> +#define HOST_IOMMU_CONTEXT_CLASS(klass) \
> +        OBJECT_CLASS_CHECK(HostIOMMUContextClass, (klass), \
> +                         TYPE_HOST_IOMMU_CONTEXT)
>  #define HOST_IOMMU_CONTEXT_GET_CLASS(obj) \
>          OBJECT_GET_CLASS(HostIOMMUContextClass, (obj), \
>                           TYPE_HOST_IOMMU_CONTEXT)
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index fd56420..0b07303 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -26,12 +26,15 @@
>  #include "qemu/notify.h"
>  #include "ui/console.h"
>  #include "hw/display/ramfb.h"
> +#include "hw/iommu/host_iommu_context.h"
>  #ifdef CONFIG_LINUX
>  #include <linux/vfio.h>
>  #endif
>  
>  #define VFIO_MSG_PREFIX "vfio %s: "
>  
> +#define TYPE_VFIO_HOST_IOMMU_CONTEXT "qemu:vfio-host-iommu-context"
> +
>  enum {
>      VFIO_DEVICE_TYPE_PCI = 0,
>      VFIO_DEVICE_TYPE_PLATFORM = 1,
> @@ -71,6 +74,7 @@ typedef struct VFIOContainer {
>      MemoryListener listener;
>      MemoryListener prereg_listener;
>      unsigned iommu_type;
> +    HostIOMMUContext iommu_ctx;
>      Error *error;
>      bool initialized;
>      unsigned long pgsizes;
> 


^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v2 08/22] vfio/common: provide PASID alloc/free hooks
@ 2020-03-31 10:47     ` Auger Eric
  0 siblings, 0 replies; 160+ messages in thread
From: Auger Eric @ 2020-03-31 10:47 UTC (permalink / raw)
  To: Liu Yi L, qemu-devel, alex.williamson, peterx
  Cc: jean-philippe, kevin.tian, Jacob Pan, Yi Sun, kvm, mst,
	jun.j.tian, yi.y.sun, pbonzini, hao.wu, david

Yi,

On 3/30/20 6:24 AM, Liu Yi L wrote:
> This patch defines vfio_host_iommu_context_info, implements the PASID
> alloc/free hooks defined in HostIOMMUContextClass.
> 
> Cc: Kevin Tian <kevin.tian@intel.com>
> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Eric Auger <eric.auger@redhat.com>
> Cc: Yi Sun <yi.y.sun@linux.intel.com>
> Cc: David Gibson <david@gibson.dropbear.id.au>
> Cc: Alex Williamson <alex.williamson@redhat.com>
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> ---
>  hw/vfio/common.c                      | 69 +++++++++++++++++++++++++++++++++++
>  include/hw/iommu/host_iommu_context.h |  3 ++
>  include/hw/vfio/vfio-common.h         |  4 ++
>  3 files changed, 76 insertions(+)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index c276732..5f3534d 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -1179,6 +1179,53 @@ static int vfio_get_iommu_type(VFIOContainer *container,
>      return -EINVAL;
>  }
>  
> +static int vfio_host_iommu_ctx_pasid_alloc(HostIOMMUContext *iommu_ctx,
> +                                           uint32_t min, uint32_t max,
> +                                           uint32_t *pasid)
> +{
> +    VFIOContainer *container = container_of(iommu_ctx,
> +                                            VFIOContainer, iommu_ctx);
> +    struct vfio_iommu_type1_pasid_request req;
> +    unsigned long argsz;
you can easily avoid using argsz variable
> +    int ret;
> +
> +    argsz = sizeof(req);
> +    req.argsz = argsz;
> +    req.flags = VFIO_IOMMU_PASID_ALLOC;
> +    req.alloc_pasid.min = min;
> +    req.alloc_pasid.max = max;
> +
> +    if (ioctl(container->fd, VFIO_IOMMU_PASID_REQUEST, &req)) {
> +        ret = -errno;
> +        error_report("%s: %d, alloc failed", __func__, ret);
better use %m directly or strerror(errno)
also include vbasedev->name?
> +        return ret;
> +    }
> +    *pasid = req.alloc_pasid.result;
> +    return 0;
> +}
> +
> +static int vfio_host_iommu_ctx_pasid_free(HostIOMMUContext *iommu_ctx,
> +                                          uint32_t pasid)
> +{
> +    VFIOContainer *container = container_of(iommu_ctx,
> +                                            VFIOContainer, iommu_ctx);
> +    struct vfio_iommu_type1_pasid_request req;
> +    unsigned long argsz;
same
> +    int ret;
> +
> +    argsz = sizeof(req);
> +    req.argsz = argsz;
> +    req.flags = VFIO_IOMMU_PASID_FREE;
> +    req.free_pasid = pasid;
> +
> +    if (ioctl(container->fd, VFIO_IOMMU_PASID_REQUEST, &req)) {
> +        ret = -errno;
> +        error_report("%s: %d, free failed", __func__, ret);
same
> +        return ret;
> +    }
> +    return 0;
> +}
> +
>  static int vfio_init_container(VFIOContainer *container, int group_fd,
>                                 Error **errp)
>  {
> @@ -1791,3 +1838,25 @@ int vfio_eeh_as_op(AddressSpace *as, uint32_t op)
>      }
>      return vfio_eeh_container_op(container, op);
>  }
> +
> +static void vfio_host_iommu_context_class_init(ObjectClass *klass,
> +                                                       void *data)
> +{
> +    HostIOMMUContextClass *hicxc = HOST_IOMMU_CONTEXT_CLASS(klass);
> +
> +    hicxc->pasid_alloc = vfio_host_iommu_ctx_pasid_alloc;
> +    hicxc->pasid_free = vfio_host_iommu_ctx_pasid_free;
> +}
> +
> +static const TypeInfo vfio_host_iommu_context_info = {
> +    .parent = TYPE_HOST_IOMMU_CONTEXT,
> +    .name = TYPE_VFIO_HOST_IOMMU_CONTEXT,
> +    .class_init = vfio_host_iommu_context_class_init,
Ah OK

This is the object inheriting from the abstract TYPE_HOST_IOMMU_CONTEXT.
I initially thought VTDHostIOMMUContext was, sorry for the misunderstanding.

Do you expect other HostIOMMUContext backends? Given the name and ops,
it looks really related to VFIO?

Thanks

Eric


> +};
> +
> +static void vfio_register_types(void)
> +{
> +    type_register_static(&vfio_host_iommu_context_info);
> +}
> +
> +type_init(vfio_register_types)
> diff --git a/include/hw/iommu/host_iommu_context.h b/include/hw/iommu/host_iommu_context.h
> index 35c4861..227c433 100644
> --- a/include/hw/iommu/host_iommu_context.h
> +++ b/include/hw/iommu/host_iommu_context.h
> @@ -33,6 +33,9 @@
>  #define TYPE_HOST_IOMMU_CONTEXT "qemu:host-iommu-context"
>  #define HOST_IOMMU_CONTEXT(obj) \
>          OBJECT_CHECK(HostIOMMUContext, (obj), TYPE_HOST_IOMMU_CONTEXT)
> +#define HOST_IOMMU_CONTEXT_CLASS(klass) \
> +        OBJECT_CLASS_CHECK(HostIOMMUContextClass, (klass), \
> +                         TYPE_HOST_IOMMU_CONTEXT)
>  #define HOST_IOMMU_CONTEXT_GET_CLASS(obj) \
>          OBJECT_GET_CLASS(HostIOMMUContextClass, (obj), \
>                           TYPE_HOST_IOMMU_CONTEXT)
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index fd56420..0b07303 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -26,12 +26,15 @@
>  #include "qemu/notify.h"
>  #include "ui/console.h"
>  #include "hw/display/ramfb.h"
> +#include "hw/iommu/host_iommu_context.h"
>  #ifdef CONFIG_LINUX
>  #include <linux/vfio.h>
>  #endif
>  
>  #define VFIO_MSG_PREFIX "vfio %s: "
>  
> +#define TYPE_VFIO_HOST_IOMMU_CONTEXT "qemu:vfio-host-iommu-context"
> +
>  enum {
>      VFIO_DEVICE_TYPE_PCI = 0,
>      VFIO_DEVICE_TYPE_PLATFORM = 1,
> @@ -71,6 +74,7 @@ typedef struct VFIOContainer {
>      MemoryListener listener;
>      MemoryListener prereg_listener;
>      unsigned iommu_type;
> +    HostIOMMUContext iommu_ctx;
>      Error *error;
>      bool initialized;
>      unsigned long pgsizes;
> 



^ permalink raw reply	[flat|nested] 160+ messages in thread

* RE: [PATCH v2 08/22] vfio/common: provide PASID alloc/free hooks
  2020-03-31 10:47     ` Auger Eric
@ 2020-03-31 10:59       ` Liu, Yi L
  -1 siblings, 0 replies; 160+ messages in thread
From: Liu, Yi L @ 2020-03-31 10:59 UTC (permalink / raw)
  To: Auger Eric, qemu-devel, alex.williamson, peterx
  Cc: pbonzini, mst, david, Tian, Kevin, Tian, Jun J, Sun, Yi Y, kvm,
	Wu, Hao, jean-philippe, Jacob Pan, Yi Sun

Hi Eric,

> From: Auger Eric
> Sent: Tuesday, March 31, 2020 6:48 PM
> To: Liu, Yi L <yi.l.liu@intel.com>; qemu-devel@nongnu.org;
> alex.williamson@redhat.com; peterx@redhat.com
> Cc: pbonzini@redhat.com; mst@redhat.com; david@gibson.dropbear.id.au; Tian,
> Kevin <kevin.tian@intel.com>; Tian, Jun J <jun.j.tian@intel.com>; Sun, Yi Y
> <yi.y.sun@intel.com>; kvm@vger.kernel.org; Wu, Hao <hao.wu@intel.com>; jean-
> philippe@linaro.org; Jacob Pan <jacob.jun.pan@linux.intel.com>; Yi Sun
> <yi.y.sun@linux.intel.com>
> Subject: Re: [PATCH v2 08/22] vfio/common: provide PASID alloc/free hooks
> 
> Yi,
> 
> On 3/30/20 6:24 AM, Liu Yi L wrote:
> > This patch defines vfio_host_iommu_context_info, implements the PASID
> > alloc/free hooks defined in HostIOMMUContextClass.
> >
> > Cc: Kevin Tian <kevin.tian@intel.com>
> > Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Cc: Peter Xu <peterx@redhat.com>
> > Cc: Eric Auger <eric.auger@redhat.com>
> > Cc: Yi Sun <yi.y.sun@linux.intel.com>
> > Cc: David Gibson <david@gibson.dropbear.id.au>
> > Cc: Alex Williamson <alex.williamson@redhat.com>
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > ---
> >  hw/vfio/common.c                      | 69 +++++++++++++++++++++++++++++++++++
> >  include/hw/iommu/host_iommu_context.h |  3 ++
> >  include/hw/vfio/vfio-common.h         |  4 ++
> >  3 files changed, 76 insertions(+)
> >
> > diff --git a/hw/vfio/common.c b/hw/vfio/common.c index
> > c276732..5f3534d 100644
> > --- a/hw/vfio/common.c
> > +++ b/hw/vfio/common.c
> > @@ -1179,6 +1179,53 @@ static int vfio_get_iommu_type(VFIOContainer
> *container,
> >      return -EINVAL;
> >  }
> >
> > +static int vfio_host_iommu_ctx_pasid_alloc(HostIOMMUContext *iommu_ctx,
> > +                                           uint32_t min, uint32_t max,
> > +                                           uint32_t *pasid) {
> > +    VFIOContainer *container = container_of(iommu_ctx,
> > +                                            VFIOContainer, iommu_ctx);
> > +    struct vfio_iommu_type1_pasid_request req;
> > +    unsigned long argsz;
> you can easily avoid using argsz variable

oh, right. :-)

> > +    int ret;
> > +
> > +    argsz = sizeof(req);
> > +    req.argsz = argsz;
> > +    req.flags = VFIO_IOMMU_PASID_ALLOC;
> > +    req.alloc_pasid.min = min;
> > +    req.alloc_pasid.max = max;
> > +
> > +    if (ioctl(container->fd, VFIO_IOMMU_PASID_REQUEST, &req)) {
> > +        ret = -errno;
> > +        error_report("%s: %d, alloc failed", __func__, ret);
> better use %m directly or strerror(errno) also include vbasedev->name?

or yes, vbasedev->name is also nice to have.

> > +        return ret;
> > +    }
> > +    *pasid = req.alloc_pasid.result;
> > +    return 0;
> > +}
> > +
> > +static int vfio_host_iommu_ctx_pasid_free(HostIOMMUContext *iommu_ctx,
> > +                                          uint32_t pasid) {
> > +    VFIOContainer *container = container_of(iommu_ctx,
> > +                                            VFIOContainer, iommu_ctx);
> > +    struct vfio_iommu_type1_pasid_request req;
> > +    unsigned long argsz;
> same

got it.

> > +    int ret;
> > +
> > +    argsz = sizeof(req);
> > +    req.argsz = argsz;
> > +    req.flags = VFIO_IOMMU_PASID_FREE;
> > +    req.free_pasid = pasid;
> > +
> > +    if (ioctl(container->fd, VFIO_IOMMU_PASID_REQUEST, &req)) {
> > +        ret = -errno;
> > +        error_report("%s: %d, free failed", __func__, ret);
> same

yep.
> > +        return ret;
> > +    }
> > +    return 0;
> > +}
> > +
> >  static int vfio_init_container(VFIOContainer *container, int group_fd,
> >                                 Error **errp)  { @@ -1791,3 +1838,25
> > @@ int vfio_eeh_as_op(AddressSpace *as, uint32_t op)
> >      }
> >      return vfio_eeh_container_op(container, op);  }
> > +
> > +static void vfio_host_iommu_context_class_init(ObjectClass *klass,
> > +                                                       void *data) {
> > +    HostIOMMUContextClass *hicxc = HOST_IOMMU_CONTEXT_CLASS(klass);
> > +
> > +    hicxc->pasid_alloc = vfio_host_iommu_ctx_pasid_alloc;
> > +    hicxc->pasid_free = vfio_host_iommu_ctx_pasid_free; }
> > +
> > +static const TypeInfo vfio_host_iommu_context_info = {
> > +    .parent = TYPE_HOST_IOMMU_CONTEXT,
> > +    .name = TYPE_VFIO_HOST_IOMMU_CONTEXT,
> > +    .class_init = vfio_host_iommu_context_class_init,
> Ah OK
> 
> This is the object inheriting from the abstract TYPE_HOST_IOMMU_CONTEXT.

yes. it is. :-)

> I initially thought VTDHostIOMMUContext was, sorry for the misunderstanding.

Ah, my fault, should have got it earlier. so we may have just aligned
in last Oct.

> Do you expect other HostIOMMUContext backends? Given the name and ops, it
> looks really related to VFIO?

For other backends, I guess you mean other passthru modules? If yes, I
think they should have their own type name. Just like vIOMMUs, the below
vIOMMUs defines their own type name and inherits the same parent.

static const TypeInfo vtd_iommu_memory_region_info = {
    .parent = TYPE_IOMMU_MEMORY_REGION,
    .name = TYPE_INTEL_IOMMU_MEMORY_REGION,
    .class_init = vtd_iommu_memory_region_class_init,
};

static const TypeInfo smmuv3_iommu_memory_region_info = {
    .parent = TYPE_IOMMU_MEMORY_REGION,
    .name = TYPE_SMMUV3_IOMMU_MEMORY_REGION,
    .class_init = smmuv3_iommu_memory_region_class_init,
};

static const TypeInfo amdvi_iommu_memory_region_info = {
    .parent = TYPE_IOMMU_MEMORY_REGION,
    .name = TYPE_AMD_IOMMU_MEMORY_REGION,
    .class_init = amdvi_iommu_memory_region_class_init,
};

Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 160+ messages in thread

* RE: [PATCH v2 08/22] vfio/common: provide PASID alloc/free hooks
@ 2020-03-31 10:59       ` Liu, Yi L
  0 siblings, 0 replies; 160+ messages in thread
From: Liu, Yi L @ 2020-03-31 10:59 UTC (permalink / raw)
  To: Auger Eric, qemu-devel, alex.williamson, peterx
  Cc: jean-philippe, Tian, Kevin, Jacob Pan, Yi Sun, kvm, mst, Tian,
	 Jun J, Sun, Yi Y, pbonzini, Wu, Hao, david

Hi Eric,

> From: Auger Eric
> Sent: Tuesday, March 31, 2020 6:48 PM
> To: Liu, Yi L <yi.l.liu@intel.com>; qemu-devel@nongnu.org;
> alex.williamson@redhat.com; peterx@redhat.com
> Cc: pbonzini@redhat.com; mst@redhat.com; david@gibson.dropbear.id.au; Tian,
> Kevin <kevin.tian@intel.com>; Tian, Jun J <jun.j.tian@intel.com>; Sun, Yi Y
> <yi.y.sun@intel.com>; kvm@vger.kernel.org; Wu, Hao <hao.wu@intel.com>; jean-
> philippe@linaro.org; Jacob Pan <jacob.jun.pan@linux.intel.com>; Yi Sun
> <yi.y.sun@linux.intel.com>
> Subject: Re: [PATCH v2 08/22] vfio/common: provide PASID alloc/free hooks
> 
> Yi,
> 
> On 3/30/20 6:24 AM, Liu Yi L wrote:
> > This patch defines vfio_host_iommu_context_info, implements the PASID
> > alloc/free hooks defined in HostIOMMUContextClass.
> >
> > Cc: Kevin Tian <kevin.tian@intel.com>
> > Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Cc: Peter Xu <peterx@redhat.com>
> > Cc: Eric Auger <eric.auger@redhat.com>
> > Cc: Yi Sun <yi.y.sun@linux.intel.com>
> > Cc: David Gibson <david@gibson.dropbear.id.au>
> > Cc: Alex Williamson <alex.williamson@redhat.com>
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > ---
> >  hw/vfio/common.c                      | 69 +++++++++++++++++++++++++++++++++++
> >  include/hw/iommu/host_iommu_context.h |  3 ++
> >  include/hw/vfio/vfio-common.h         |  4 ++
> >  3 files changed, 76 insertions(+)
> >
> > diff --git a/hw/vfio/common.c b/hw/vfio/common.c index
> > c276732..5f3534d 100644
> > --- a/hw/vfio/common.c
> > +++ b/hw/vfio/common.c
> > @@ -1179,6 +1179,53 @@ static int vfio_get_iommu_type(VFIOContainer
> *container,
> >      return -EINVAL;
> >  }
> >
> > +static int vfio_host_iommu_ctx_pasid_alloc(HostIOMMUContext *iommu_ctx,
> > +                                           uint32_t min, uint32_t max,
> > +                                           uint32_t *pasid) {
> > +    VFIOContainer *container = container_of(iommu_ctx,
> > +                                            VFIOContainer, iommu_ctx);
> > +    struct vfio_iommu_type1_pasid_request req;
> > +    unsigned long argsz;
> you can easily avoid using argsz variable

oh, right. :-)

> > +    int ret;
> > +
> > +    argsz = sizeof(req);
> > +    req.argsz = argsz;
> > +    req.flags = VFIO_IOMMU_PASID_ALLOC;
> > +    req.alloc_pasid.min = min;
> > +    req.alloc_pasid.max = max;
> > +
> > +    if (ioctl(container->fd, VFIO_IOMMU_PASID_REQUEST, &req)) {
> > +        ret = -errno;
> > +        error_report("%s: %d, alloc failed", __func__, ret);
> better use %m directly or strerror(errno) also include vbasedev->name?

or yes, vbasedev->name is also nice to have.

> > +        return ret;
> > +    }
> > +    *pasid = req.alloc_pasid.result;
> > +    return 0;
> > +}
> > +
> > +static int vfio_host_iommu_ctx_pasid_free(HostIOMMUContext *iommu_ctx,
> > +                                          uint32_t pasid) {
> > +    VFIOContainer *container = container_of(iommu_ctx,
> > +                                            VFIOContainer, iommu_ctx);
> > +    struct vfio_iommu_type1_pasid_request req;
> > +    unsigned long argsz;
> same

got it.

> > +    int ret;
> > +
> > +    argsz = sizeof(req);
> > +    req.argsz = argsz;
> > +    req.flags = VFIO_IOMMU_PASID_FREE;
> > +    req.free_pasid = pasid;
> > +
> > +    if (ioctl(container->fd, VFIO_IOMMU_PASID_REQUEST, &req)) {
> > +        ret = -errno;
> > +        error_report("%s: %d, free failed", __func__, ret);
> same

yep.
> > +        return ret;
> > +    }
> > +    return 0;
> > +}
> > +
> >  static int vfio_init_container(VFIOContainer *container, int group_fd,
> >                                 Error **errp)  { @@ -1791,3 +1838,25
> > @@ int vfio_eeh_as_op(AddressSpace *as, uint32_t op)
> >      }
> >      return vfio_eeh_container_op(container, op);  }
> > +
> > +static void vfio_host_iommu_context_class_init(ObjectClass *klass,
> > +                                                       void *data) {
> > +    HostIOMMUContextClass *hicxc = HOST_IOMMU_CONTEXT_CLASS(klass);
> > +
> > +    hicxc->pasid_alloc = vfio_host_iommu_ctx_pasid_alloc;
> > +    hicxc->pasid_free = vfio_host_iommu_ctx_pasid_free; }
> > +
> > +static const TypeInfo vfio_host_iommu_context_info = {
> > +    .parent = TYPE_HOST_IOMMU_CONTEXT,
> > +    .name = TYPE_VFIO_HOST_IOMMU_CONTEXT,
> > +    .class_init = vfio_host_iommu_context_class_init,
> Ah OK
> 
> This is the object inheriting from the abstract TYPE_HOST_IOMMU_CONTEXT.

yes. it is. :-)

> I initially thought VTDHostIOMMUContext was, sorry for the misunderstanding.

Ah, my fault, should have got it earlier. so we may have just aligned
in last Oct.

> Do you expect other HostIOMMUContext backends? Given the name and ops, it
> looks really related to VFIO?

For other backends, I guess you mean other passthru modules? If yes, I
think they should have their own type name. Just like vIOMMUs, the below
vIOMMUs defines their own type name and inherits the same parent.

static const TypeInfo vtd_iommu_memory_region_info = {
    .parent = TYPE_IOMMU_MEMORY_REGION,
    .name = TYPE_INTEL_IOMMU_MEMORY_REGION,
    .class_init = vtd_iommu_memory_region_class_init,
};

static const TypeInfo smmuv3_iommu_memory_region_info = {
    .parent = TYPE_IOMMU_MEMORY_REGION,
    .name = TYPE_SMMUV3_IOMMU_MEMORY_REGION,
    .class_init = smmuv3_iommu_memory_region_class_init,
};

static const TypeInfo amdvi_iommu_memory_region_info = {
    .parent = TYPE_IOMMU_MEMORY_REGION,
    .name = TYPE_AMD_IOMMU_MEMORY_REGION,
    .class_init = amdvi_iommu_memory_region_class_init,
};

Regards,
Yi Liu



^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v2 08/22] vfio/common: provide PASID alloc/free hooks
  2020-03-31 10:59       ` Liu, Yi L
@ 2020-03-31 11:15         ` Auger Eric
  -1 siblings, 0 replies; 160+ messages in thread
From: Auger Eric @ 2020-03-31 11:15 UTC (permalink / raw)
  To: Liu, Yi L, qemu-devel, alex.williamson, peterx
  Cc: pbonzini, mst, david, Tian, Kevin, Tian, Jun J, Sun, Yi Y, kvm,
	Wu, Hao, jean-philippe, Jacob Pan, Yi Sun

Hi Yi,
On 3/31/20 12:59 PM, Liu, Yi L wrote:
> Hi Eric,
> 
>> From: Auger Eric
>> Sent: Tuesday, March 31, 2020 6:48 PM
>> To: Liu, Yi L <yi.l.liu@intel.com>; qemu-devel@nongnu.org;
>> alex.williamson@redhat.com; peterx@redhat.com
>> Cc: pbonzini@redhat.com; mst@redhat.com; david@gibson.dropbear.id.au; Tian,
>> Kevin <kevin.tian@intel.com>; Tian, Jun J <jun.j.tian@intel.com>; Sun, Yi Y
>> <yi.y.sun@intel.com>; kvm@vger.kernel.org; Wu, Hao <hao.wu@intel.com>; jean-
>> philippe@linaro.org; Jacob Pan <jacob.jun.pan@linux.intel.com>; Yi Sun
>> <yi.y.sun@linux.intel.com>
>> Subject: Re: [PATCH v2 08/22] vfio/common: provide PASID alloc/free hooks
>>
>> Yi,
>>
>> On 3/30/20 6:24 AM, Liu Yi L wrote:
>>> This patch defines vfio_host_iommu_context_info, implements the PASID
>>> alloc/free hooks defined in HostIOMMUContextClass.
>>>
>>> Cc: Kevin Tian <kevin.tian@intel.com>
>>> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
>>> Cc: Peter Xu <peterx@redhat.com>
>>> Cc: Eric Auger <eric.auger@redhat.com>
>>> Cc: Yi Sun <yi.y.sun@linux.intel.com>
>>> Cc: David Gibson <david@gibson.dropbear.id.au>
>>> Cc: Alex Williamson <alex.williamson@redhat.com>
>>> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
>>> ---
>>>  hw/vfio/common.c                      | 69 +++++++++++++++++++++++++++++++++++
>>>  include/hw/iommu/host_iommu_context.h |  3 ++
>>>  include/hw/vfio/vfio-common.h         |  4 ++
>>>  3 files changed, 76 insertions(+)
>>>
>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c index
>>> c276732..5f3534d 100644
>>> --- a/hw/vfio/common.c
>>> +++ b/hw/vfio/common.c
>>> @@ -1179,6 +1179,53 @@ static int vfio_get_iommu_type(VFIOContainer
>> *container,
>>>      return -EINVAL;
>>>  }
>>>
>>> +static int vfio_host_iommu_ctx_pasid_alloc(HostIOMMUContext *iommu_ctx,
>>> +                                           uint32_t min, uint32_t max,
>>> +                                           uint32_t *pasid) {
>>> +    VFIOContainer *container = container_of(iommu_ctx,
>>> +                                            VFIOContainer, iommu_ctx);
>>> +    struct vfio_iommu_type1_pasid_request req;
>>> +    unsigned long argsz;
>> you can easily avoid using argsz variable
> 
> oh, right. :-)
> 
>>> +    int ret;
>>> +
>>> +    argsz = sizeof(req);
>>> +    req.argsz = argsz;
>>> +    req.flags = VFIO_IOMMU_PASID_ALLOC;
>>> +    req.alloc_pasid.min = min;
>>> +    req.alloc_pasid.max = max;
>>> +
>>> +    if (ioctl(container->fd, VFIO_IOMMU_PASID_REQUEST, &req)) {
>>> +        ret = -errno;
>>> +        error_report("%s: %d, alloc failed", __func__, ret);
>> better use %m directly or strerror(errno) also include vbasedev->name?
> 
> or yes, vbasedev->name is also nice to have.
> 
>>> +        return ret;
>>> +    }
>>> +    *pasid = req.alloc_pasid.result;
>>> +    return 0;
>>> +}
>>> +
>>> +static int vfio_host_iommu_ctx_pasid_free(HostIOMMUContext *iommu_ctx,
>>> +                                          uint32_t pasid) {
>>> +    VFIOContainer *container = container_of(iommu_ctx,
>>> +                                            VFIOContainer, iommu_ctx);
>>> +    struct vfio_iommu_type1_pasid_request req;
>>> +    unsigned long argsz;
>> same
> 
> got it.
> 
>>> +    int ret;
>>> +
>>> +    argsz = sizeof(req);
>>> +    req.argsz = argsz;
>>> +    req.flags = VFIO_IOMMU_PASID_FREE;
>>> +    req.free_pasid = pasid;
>>> +
>>> +    if (ioctl(container->fd, VFIO_IOMMU_PASID_REQUEST, &req)) {
>>> +        ret = -errno;
>>> +        error_report("%s: %d, free failed", __func__, ret);
>> same
> 
> yep.
>>> +        return ret;
>>> +    }
>>> +    return 0;
>>> +}
>>> +
>>>  static int vfio_init_container(VFIOContainer *container, int group_fd,
>>>                                 Error **errp)  { @@ -1791,3 +1838,25
>>> @@ int vfio_eeh_as_op(AddressSpace *as, uint32_t op)
>>>      }
>>>      return vfio_eeh_container_op(container, op);  }
>>> +
>>> +static void vfio_host_iommu_context_class_init(ObjectClass *klass,
>>> +                                                       void *data) {
>>> +    HostIOMMUContextClass *hicxc = HOST_IOMMU_CONTEXT_CLASS(klass);
>>> +
>>> +    hicxc->pasid_alloc = vfio_host_iommu_ctx_pasid_alloc;
>>> +    hicxc->pasid_free = vfio_host_iommu_ctx_pasid_free; }
>>> +
>>> +static const TypeInfo vfio_host_iommu_context_info = {
>>> +    .parent = TYPE_HOST_IOMMU_CONTEXT,
>>> +    .name = TYPE_VFIO_HOST_IOMMU_CONTEXT,
>>> +    .class_init = vfio_host_iommu_context_class_init,
>> Ah OK
>>
>> This is the object inheriting from the abstract TYPE_HOST_IOMMU_CONTEXT.
> 
> yes. it is. :-)
> 
>> I initially thought VTDHostIOMMUContext was, sorry for the misunderstanding.
> 
> Ah, my fault, should have got it earlier. so we may have just aligned
> in last Oct.
> 
>> Do you expect other HostIOMMUContext backends? Given the name and ops, it
>> looks really related to VFIO?
> 
> For other backends, I guess you mean other passthru modules? If yes, I
> think they should have their own type name. Just like vIOMMUs, the below
> vIOMMUs defines their own type name and inherits the same parent.
> 
> static const TypeInfo vtd_iommu_memory_region_info = {
>     .parent = TYPE_IOMMU_MEMORY_REGION,
>     .name = TYPE_INTEL_IOMMU_MEMORY_REGION,
>     .class_init = vtd_iommu_memory_region_class_init,
> };
> 
> static const TypeInfo smmuv3_iommu_memory_region_info = {
>     .parent = TYPE_IOMMU_MEMORY_REGION,
>     .name = TYPE_SMMUV3_IOMMU_MEMORY_REGION,
>     .class_init = smmuv3_iommu_memory_region_class_init,
> };
> 
> static const TypeInfo amdvi_iommu_memory_region_info = {
>     .parent = TYPE_IOMMU_MEMORY_REGION,
>     .name = TYPE_AMD_IOMMU_MEMORY_REGION,
>     .class_init = amdvi_iommu_memory_region_class_init,
> };
Sorry I am confused now.

You don't have such kind of inheritance at the moment in your series.

You have an abstract object (TYPE_HOST_IOMMU_CONTEXT, HostIOMMUContext)
which is derived into TYPE_VFIO_HOST_IOMMU_CONTEXT. Only the class ops
are specialized for VFIO. But I do not foresee any other user than VFIO
(ie. other implementers of the class ops), hence my question. For
instance would virtio/vhost ever implement its TYPE_HOST_IOMMU_CONTEXT.

On the other hand you have VTDHostIOMMUContext which is not a QOM
derived object.

Thanks

Eric
> 
> Regards,
> Yi Liu
> 


^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v2 08/22] vfio/common: provide PASID alloc/free hooks
@ 2020-03-31 11:15         ` Auger Eric
  0 siblings, 0 replies; 160+ messages in thread
From: Auger Eric @ 2020-03-31 11:15 UTC (permalink / raw)
  To: Liu, Yi L, qemu-devel, alex.williamson, peterx
  Cc: jean-philippe, Tian, Kevin, Jacob Pan, Yi Sun, kvm, mst, Tian,
	Jun J, Sun, Yi Y, pbonzini, Wu, Hao, david

Hi Yi,
On 3/31/20 12:59 PM, Liu, Yi L wrote:
> Hi Eric,
> 
>> From: Auger Eric
>> Sent: Tuesday, March 31, 2020 6:48 PM
>> To: Liu, Yi L <yi.l.liu@intel.com>; qemu-devel@nongnu.org;
>> alex.williamson@redhat.com; peterx@redhat.com
>> Cc: pbonzini@redhat.com; mst@redhat.com; david@gibson.dropbear.id.au; Tian,
>> Kevin <kevin.tian@intel.com>; Tian, Jun J <jun.j.tian@intel.com>; Sun, Yi Y
>> <yi.y.sun@intel.com>; kvm@vger.kernel.org; Wu, Hao <hao.wu@intel.com>; jean-
>> philippe@linaro.org; Jacob Pan <jacob.jun.pan@linux.intel.com>; Yi Sun
>> <yi.y.sun@linux.intel.com>
>> Subject: Re: [PATCH v2 08/22] vfio/common: provide PASID alloc/free hooks
>>
>> Yi,
>>
>> On 3/30/20 6:24 AM, Liu Yi L wrote:
>>> This patch defines vfio_host_iommu_context_info, implements the PASID
>>> alloc/free hooks defined in HostIOMMUContextClass.
>>>
>>> Cc: Kevin Tian <kevin.tian@intel.com>
>>> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
>>> Cc: Peter Xu <peterx@redhat.com>
>>> Cc: Eric Auger <eric.auger@redhat.com>
>>> Cc: Yi Sun <yi.y.sun@linux.intel.com>
>>> Cc: David Gibson <david@gibson.dropbear.id.au>
>>> Cc: Alex Williamson <alex.williamson@redhat.com>
>>> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
>>> ---
>>>  hw/vfio/common.c                      | 69 +++++++++++++++++++++++++++++++++++
>>>  include/hw/iommu/host_iommu_context.h |  3 ++
>>>  include/hw/vfio/vfio-common.h         |  4 ++
>>>  3 files changed, 76 insertions(+)
>>>
>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c index
>>> c276732..5f3534d 100644
>>> --- a/hw/vfio/common.c
>>> +++ b/hw/vfio/common.c
>>> @@ -1179,6 +1179,53 @@ static int vfio_get_iommu_type(VFIOContainer
>> *container,
>>>      return -EINVAL;
>>>  }
>>>
>>> +static int vfio_host_iommu_ctx_pasid_alloc(HostIOMMUContext *iommu_ctx,
>>> +                                           uint32_t min, uint32_t max,
>>> +                                           uint32_t *pasid) {
>>> +    VFIOContainer *container = container_of(iommu_ctx,
>>> +                                            VFIOContainer, iommu_ctx);
>>> +    struct vfio_iommu_type1_pasid_request req;
>>> +    unsigned long argsz;
>> you can easily avoid using argsz variable
> 
> oh, right. :-)
> 
>>> +    int ret;
>>> +
>>> +    argsz = sizeof(req);
>>> +    req.argsz = argsz;
>>> +    req.flags = VFIO_IOMMU_PASID_ALLOC;
>>> +    req.alloc_pasid.min = min;
>>> +    req.alloc_pasid.max = max;
>>> +
>>> +    if (ioctl(container->fd, VFIO_IOMMU_PASID_REQUEST, &req)) {
>>> +        ret = -errno;
>>> +        error_report("%s: %d, alloc failed", __func__, ret);
>> better use %m directly or strerror(errno) also include vbasedev->name?
> 
> or yes, vbasedev->name is also nice to have.
> 
>>> +        return ret;
>>> +    }
>>> +    *pasid = req.alloc_pasid.result;
>>> +    return 0;
>>> +}
>>> +
>>> +static int vfio_host_iommu_ctx_pasid_free(HostIOMMUContext *iommu_ctx,
>>> +                                          uint32_t pasid) {
>>> +    VFIOContainer *container = container_of(iommu_ctx,
>>> +                                            VFIOContainer, iommu_ctx);
>>> +    struct vfio_iommu_type1_pasid_request req;
>>> +    unsigned long argsz;
>> same
> 
> got it.
> 
>>> +    int ret;
>>> +
>>> +    argsz = sizeof(req);
>>> +    req.argsz = argsz;
>>> +    req.flags = VFIO_IOMMU_PASID_FREE;
>>> +    req.free_pasid = pasid;
>>> +
>>> +    if (ioctl(container->fd, VFIO_IOMMU_PASID_REQUEST, &req)) {
>>> +        ret = -errno;
>>> +        error_report("%s: %d, free failed", __func__, ret);
>> same
> 
> yep.
>>> +        return ret;
>>> +    }
>>> +    return 0;
>>> +}
>>> +
>>>  static int vfio_init_container(VFIOContainer *container, int group_fd,
>>>                                 Error **errp)  { @@ -1791,3 +1838,25
>>> @@ int vfio_eeh_as_op(AddressSpace *as, uint32_t op)
>>>      }
>>>      return vfio_eeh_container_op(container, op);  }
>>> +
>>> +static void vfio_host_iommu_context_class_init(ObjectClass *klass,
>>> +                                                       void *data) {
>>> +    HostIOMMUContextClass *hicxc = HOST_IOMMU_CONTEXT_CLASS(klass);
>>> +
>>> +    hicxc->pasid_alloc = vfio_host_iommu_ctx_pasid_alloc;
>>> +    hicxc->pasid_free = vfio_host_iommu_ctx_pasid_free; }
>>> +
>>> +static const TypeInfo vfio_host_iommu_context_info = {
>>> +    .parent = TYPE_HOST_IOMMU_CONTEXT,
>>> +    .name = TYPE_VFIO_HOST_IOMMU_CONTEXT,
>>> +    .class_init = vfio_host_iommu_context_class_init,
>> Ah OK
>>
>> This is the object inheriting from the abstract TYPE_HOST_IOMMU_CONTEXT.
> 
> yes. it is. :-)
> 
>> I initially thought VTDHostIOMMUContext was, sorry for the misunderstanding.
> 
> Ah, my fault, should have got it earlier. so we may have just aligned
> in last Oct.
> 
>> Do you expect other HostIOMMUContext backends? Given the name and ops, it
>> looks really related to VFIO?
> 
> For other backends, I guess you mean other passthru modules? If yes, I
> think they should have their own type name. Just like vIOMMUs, the below
> vIOMMUs defines their own type name and inherits the same parent.
> 
> static const TypeInfo vtd_iommu_memory_region_info = {
>     .parent = TYPE_IOMMU_MEMORY_REGION,
>     .name = TYPE_INTEL_IOMMU_MEMORY_REGION,
>     .class_init = vtd_iommu_memory_region_class_init,
> };
> 
> static const TypeInfo smmuv3_iommu_memory_region_info = {
>     .parent = TYPE_IOMMU_MEMORY_REGION,
>     .name = TYPE_SMMUV3_IOMMU_MEMORY_REGION,
>     .class_init = smmuv3_iommu_memory_region_class_init,
> };
> 
> static const TypeInfo amdvi_iommu_memory_region_info = {
>     .parent = TYPE_IOMMU_MEMORY_REGION,
>     .name = TYPE_AMD_IOMMU_MEMORY_REGION,
>     .class_init = amdvi_iommu_memory_region_class_init,
> };
Sorry I am confused now.

You don't have such kind of inheritance at the moment in your series.

You have an abstract object (TYPE_HOST_IOMMU_CONTEXT, HostIOMMUContext)
which is derived into TYPE_VFIO_HOST_IOMMU_CONTEXT. Only the class ops
are specialized for VFIO. But I do not foresee any other user than VFIO
(ie. other implementers of the class ops), hence my question. For
instance would virtio/vhost ever implement its TYPE_HOST_IOMMU_CONTEXT.

On the other hand you have VTDHostIOMMUContext which is not a QOM
derived object.

Thanks

Eric
> 
> Regards,
> Yi Liu
> 



^ permalink raw reply	[flat|nested] 160+ messages in thread

* RE: [PATCH v2 06/22] hw/pci: introduce pci_device_set/unset_iommu_context()
  2020-03-30 17:30     ` Auger Eric
@ 2020-03-31 12:14       ` Liu, Yi L
  -1 siblings, 0 replies; 160+ messages in thread
From: Liu, Yi L @ 2020-03-31 12:14 UTC (permalink / raw)
  To: Auger Eric, qemu-devel, alex.williamson, peterx
  Cc: pbonzini, mst, david, Tian, Kevin, Tian, Jun J, Sun, Yi Y, kvm,
	Wu, Hao, jean-philippe, Jacob Pan, Yi Sun

Hi Eric,

> From: Auger Eric < eric.auger@redhat.com>
> Sent: Tuesday, March 31, 2020 1:30 AM
> To: Liu, Yi L <yi.l.liu@intel.com>; qemu-devel@nongnu.org;
> Subject: Re: [PATCH v2 06/22] hw/pci: introduce
> pci_device_set/unset_iommu_context()
> 
> Yi,
> On 3/30/20 6:24 AM, Liu Yi L wrote:
> > This patch adds pci_device_set/unset_iommu_context() to set/unset
> > host_iommu_context for a given device. New callback is added in
> > PCIIOMMUOps. As such, vIOMMU could make use of host IOMMU capability.
> > e.g setup nested translation.
> 
> I think you need to explain what this practically is supposed to do.
> such as: by attaching such context to a PCI device (for example VFIO
> assigned?), you tell the host that this PCIe device is protected by a FL
> stage controlled by the guest or something like that - if this is
> correct understanding (?) -

I'd like to say by attaching such context to a PCI device (for
example VFIO assigned), this PCIe device is protected by a host
IOMMU w/ nested-translation capability. Its DMA would be protected
either through the FL stage controlled by the guest together with
a SL stage page table owned by host or a single stage page table
owned by host (e.g. shadow solution). It depends on the choice of
vIOMMU the pci_device_set/unset_iommu_context() finally pass the
context to vIOMMU. If vIOMMU binds guest FL stage page table to host,
then it is prior case. If vIOMMU doesn't, do bind, then it is the
latter case.

Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 160+ messages in thread

* RE: [PATCH v2 06/22] hw/pci: introduce pci_device_set/unset_iommu_context()
@ 2020-03-31 12:14       ` Liu, Yi L
  0 siblings, 0 replies; 160+ messages in thread
From: Liu, Yi L @ 2020-03-31 12:14 UTC (permalink / raw)
  To: Auger Eric, qemu-devel, alex.williamson, peterx
  Cc: jean-philippe, Tian, Kevin, Jacob Pan, Yi Sun, kvm, mst, Tian,
	 Jun J, Sun, Yi Y, pbonzini, Wu, Hao, david

Hi Eric,

> From: Auger Eric < eric.auger@redhat.com>
> Sent: Tuesday, March 31, 2020 1:30 AM
> To: Liu, Yi L <yi.l.liu@intel.com>; qemu-devel@nongnu.org;
> Subject: Re: [PATCH v2 06/22] hw/pci: introduce
> pci_device_set/unset_iommu_context()
> 
> Yi,
> On 3/30/20 6:24 AM, Liu Yi L wrote:
> > This patch adds pci_device_set/unset_iommu_context() to set/unset
> > host_iommu_context for a given device. New callback is added in
> > PCIIOMMUOps. As such, vIOMMU could make use of host IOMMU capability.
> > e.g setup nested translation.
> 
> I think you need to explain what this practically is supposed to do.
> such as: by attaching such context to a PCI device (for example VFIO
> assigned?), you tell the host that this PCIe device is protected by a FL
> stage controlled by the guest or something like that - if this is
> correct understanding (?) -

I'd like to say by attaching such context to a PCI device (for
example VFIO assigned), this PCIe device is protected by a host
IOMMU w/ nested-translation capability. Its DMA would be protected
either through the FL stage controlled by the guest together with
a SL stage page table owned by host or a single stage page table
owned by host (e.g. shadow solution). It depends on the choice of
vIOMMU the pci_device_set/unset_iommu_context() finally pass the
context to vIOMMU. If vIOMMU binds guest FL stage page table to host,
then it is prior case. If vIOMMU doesn't, do bind, then it is the
latter case.

Regards,
Yi Liu



^ permalink raw reply	[flat|nested] 160+ messages in thread

* RE: [PATCH v2 07/22] intel_iommu: add set/unset_iommu_context callback
  2020-03-30 20:23     ` Auger Eric
@ 2020-03-31 12:25       ` Liu, Yi L
  -1 siblings, 0 replies; 160+ messages in thread
From: Liu, Yi L @ 2020-03-31 12:25 UTC (permalink / raw)
  To: Auger Eric, qemu-devel, alex.williamson, peterx
  Cc: pbonzini, mst, david, Tian, Kevin, Tian, Jun J, Sun, Yi Y, kvm,
	Wu, Hao, jean-philippe, Jacob Pan, Yi Sun, Richard Henderson,
	Eduardo Habkost

Hi Eric,

> From: Auger Eric < eric.auger@redhat.com>
> Sent: Tuesday, March 31, 2020 4:24 AM
> To: Liu, Yi L <yi.l.liu@intel.com>; qemu-devel@nongnu.org;
> Subject: Re: [PATCH v2 07/22] intel_iommu: add set/unset_iommu_context callback
> 
> Yi,
> 
> On 3/30/20 6:24 AM, Liu Yi L wrote:
> > This patch adds set/unset_iommu_context() impelementation in Intel
> This patch implements the set/unset_iommu_context() ops for Intel vIOMMU.
> > vIOMMU. For Intel platform, pass-through modules (e.g. VFIO) could
> > set HostIOMMUContext to Intel vIOMMU emulator.
> >
> > Cc: Kevin Tian <kevin.tian@intel.com>
> > Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Cc: Peter Xu <peterx@redhat.com>
> > Cc: Yi Sun <yi.y.sun@linux.intel.com>
> > Cc: Paolo Bonzini <pbonzini@redhat.com>
> > Cc: Richard Henderson <rth@twiddle.net>
> > Cc: Eduardo Habkost <ehabkost@redhat.com>
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > ---
> >  hw/i386/intel_iommu.c         | 71
> ++++++++++++++++++++++++++++++++++++++++---
> >  include/hw/i386/intel_iommu.h | 21 ++++++++++---
> >  2 files changed, 83 insertions(+), 9 deletions(-)
> >
> > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> > index 4b22910..fd349c6 100644
> > --- a/hw/i386/intel_iommu.c
> > +++ b/hw/i386/intel_iommu.c
> > @@ -3354,23 +3354,33 @@ static const MemoryRegionOps vtd_mem_ir_ops = {
> >      },
> >  };
> >
> > -VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
> > +/**
> > + * Fetch a VTDBus instance for given PCIBus. If no existing instance,
> > + * allocate one.
> > + */
> > +static VTDBus *vtd_find_add_bus(IntelIOMMUState *s, PCIBus *bus)
> >  {
> >      uintptr_t key = (uintptr_t)bus;
> >      VTDBus *vtd_bus = g_hash_table_lookup(s->vtd_as_by_busptr, &key);
> > -    VTDAddressSpace *vtd_dev_as;
> > -    char name[128];
> >
> >      if (!vtd_bus) {
> >          uintptr_t *new_key = g_malloc(sizeof(*new_key));
> >          *new_key = (uintptr_t)bus;
> >          /* No corresponding free() */
> > -        vtd_bus = g_malloc0(sizeof(VTDBus) + sizeof(VTDAddressSpace *) * \
> > -                            PCI_DEVFN_MAX);
> > +        vtd_bus = g_malloc0(sizeof(VTDBus));
> >          vtd_bus->bus = bus;
> >          g_hash_table_insert(s->vtd_as_by_busptr, new_key, vtd_bus);
> >      }
> > +    return vtd_bus;
> > +}
> >
> > +VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
> > +{
> > +    VTDBus *vtd_bus;
> > +    VTDAddressSpace *vtd_dev_as;
> > +    char name[128];
> > +
> > +    vtd_bus = vtd_find_add_bus(s, bus);
> >      vtd_dev_as = vtd_bus->dev_as[devfn];
> >
> >      if (!vtd_dev_as) {
> > @@ -3436,6 +3446,55 @@ VTDAddressSpace
> *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
> >      return vtd_dev_as;
> >  }
> >
> > +static int vtd_dev_set_iommu_context(PCIBus *bus, void *opaque,
> > +                                     int devfn,
> > +                                     HostIOMMUContext *iommu_ctx)
> > +{
> > +    IntelIOMMUState *s = opaque;
> > +    VTDBus *vtd_bus;
> > +    VTDHostIOMMUContext *vtd_dev_icx;
> > +
> > +    assert(0 <= devfn && devfn < PCI_DEVFN_MAX);
> > +
> > +    vtd_bus = vtd_find_add_bus(s, bus);
> > +
> > +    vtd_iommu_lock(s);
> > +
> > +    vtd_dev_icx = vtd_bus->dev_icx[devfn];
> > +
> > +    assert(!vtd_dev_icx);
> > +
> > +    vtd_bus->dev_icx[devfn] = vtd_dev_icx =
> > +                    g_malloc0(sizeof(VTDHostIOMMUContext));
> > +    vtd_dev_icx->vtd_bus = vtd_bus;
> > +    vtd_dev_icx->devfn = (uint8_t)devfn;
> > +    vtd_dev_icx->iommu_state = s;
> > +    vtd_dev_icx->iommu_ctx = iommu_ctx;
> > +
> > +    vtd_iommu_unlock(s);
> > +
> > +    return 0;
> > +}
> > +
> > +static void vtd_dev_unset_iommu_context(PCIBus *bus, void *opaque, int devfn)
> > +{
> > +    IntelIOMMUState *s = opaque;
> > +    VTDBus *vtd_bus;
> > +    VTDHostIOMMUContext *vtd_dev_icx;
> > +
> > +    assert(0 <= devfn && devfn < PCI_DEVFN_MAX);
> > +
> > +    vtd_bus = vtd_find_add_bus(s, bus);
> > +
> > +    vtd_iommu_lock(s);
> > +
> > +    vtd_dev_icx = vtd_bus->dev_icx[devfn];
> > +    g_free(vtd_dev_icx);
> > +    vtd_bus->dev_icx[devfn] = NULL;
> > +
> > +    vtd_iommu_unlock(s);
> > +}
> > +
> >  static uint64_t get_naturally_aligned_size(uint64_t start,
> >                                             uint64_t size, int gaw)
> >  {
> > @@ -3731,6 +3790,8 @@ static AddressSpace *vtd_host_dma_iommu(PCIBus
> *bus, void *opaque, int devfn)
> >
> >  static PCIIOMMUOps vtd_iommu_ops = {
> >      .get_address_space = vtd_host_dma_iommu,
> > +    .set_iommu_context = vtd_dev_set_iommu_context,
> > +    .unset_iommu_context = vtd_dev_unset_iommu_context,
> >  };
> >
> >  static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
> > diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> > index 3870052..b5fefb9 100644
> > --- a/include/hw/i386/intel_iommu.h
> > +++ b/include/hw/i386/intel_iommu.h
> > @@ -64,6 +64,7 @@ typedef union VTD_IR_TableEntry VTD_IR_TableEntry;
> >  typedef union VTD_IR_MSIAddress VTD_IR_MSIAddress;
> >  typedef struct VTDPASIDDirEntry VTDPASIDDirEntry;
> >  typedef struct VTDPASIDEntry VTDPASIDEntry;
> > +typedef struct VTDHostIOMMUContext VTDHostIOMMUContext;
> >
> >  /* Context-Entry */
> >  struct VTDContextEntry {
> > @@ -112,10 +113,20 @@ struct VTDAddressSpace {
> >      IOVATree *iova_tree;          /* Traces mapped IOVA ranges */
> >  };
> >
> > +struct VTDHostIOMMUContext {
> 
> 
> > +    VTDBus *vtd_bus;
> > +    uint8_t devfn;
> > +    HostIOMMUContext *iommu_ctx;
> I don't get why we don't have standard QOM inheritance instead of this
> handle?
> VTDHostContext parent_obj;
> 
> like IOMMUMemoryRegion <- MemoryRegion <- Object

Here it is not inherit the object. It's just cache the HostIOMMUContext
pointer in vIOMMU. Just like AddressSpace, it has a MemoryRegion pointer.
Here is the same, VTDHostIOMMUContext is just a wrapper to better manage
it in vVT-d. It's not inheriting.

> > +    IntelIOMMUState *iommu_state;
> > +};
> > +
> >  struct VTDBus {
> > -    PCIBus* bus;		/* A reference to the bus to provide translation for
> */
> > +    /* A reference to the bus to provide translation for */
> > +    PCIBus *bus;
> >      /* A table of VTDAddressSpace objects indexed by devfn */
> > -    VTDAddressSpace *dev_as[];
> > +    VTDAddressSpace *dev_as[PCI_DEVFN_MAX];
> > +    /* A table of VTDHostIOMMUContext objects indexed by devfn */
> > +    VTDHostIOMMUContext *dev_icx[PCI_DEVFN_MAX];
> At this point of the review, it is unclear to me why the context is
> associated to a device.

HostIOMMUContext can be per-device or not. It depends on how vIOMMU
manage it. For vVT-d, it's per device as the container is per-device.

> Up to now you have not explained it should. If
> so why isn't it part of VTDAddressSpace?

Ah, I did have considered it. But I chose to use a separate one as
context is not really tied with an addresspace. It's better to mange
it with a separate structure.

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 160+ messages in thread

* RE: [PATCH v2 07/22] intel_iommu: add set/unset_iommu_context callback
@ 2020-03-31 12:25       ` Liu, Yi L
  0 siblings, 0 replies; 160+ messages in thread
From: Liu, Yi L @ 2020-03-31 12:25 UTC (permalink / raw)
  To: Auger Eric, qemu-devel, alex.williamson, peterx
  Cc: jean-philippe, Tian, Kevin, Jacob Pan, Yi Sun, Eduardo Habkost,
	kvm, mst, Tian, Jun J, Sun, Yi Y, pbonzini, Wu, Hao,
	Richard Henderson, david

Hi Eric,

> From: Auger Eric < eric.auger@redhat.com>
> Sent: Tuesday, March 31, 2020 4:24 AM
> To: Liu, Yi L <yi.l.liu@intel.com>; qemu-devel@nongnu.org;
> Subject: Re: [PATCH v2 07/22] intel_iommu: add set/unset_iommu_context callback
> 
> Yi,
> 
> On 3/30/20 6:24 AM, Liu Yi L wrote:
> > This patch adds set/unset_iommu_context() impelementation in Intel
> This patch implements the set/unset_iommu_context() ops for Intel vIOMMU.
> > vIOMMU. For Intel platform, pass-through modules (e.g. VFIO) could
> > set HostIOMMUContext to Intel vIOMMU emulator.
> >
> > Cc: Kevin Tian <kevin.tian@intel.com>
> > Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Cc: Peter Xu <peterx@redhat.com>
> > Cc: Yi Sun <yi.y.sun@linux.intel.com>
> > Cc: Paolo Bonzini <pbonzini@redhat.com>
> > Cc: Richard Henderson <rth@twiddle.net>
> > Cc: Eduardo Habkost <ehabkost@redhat.com>
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > ---
> >  hw/i386/intel_iommu.c         | 71
> ++++++++++++++++++++++++++++++++++++++++---
> >  include/hw/i386/intel_iommu.h | 21 ++++++++++---
> >  2 files changed, 83 insertions(+), 9 deletions(-)
> >
> > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> > index 4b22910..fd349c6 100644
> > --- a/hw/i386/intel_iommu.c
> > +++ b/hw/i386/intel_iommu.c
> > @@ -3354,23 +3354,33 @@ static const MemoryRegionOps vtd_mem_ir_ops = {
> >      },
> >  };
> >
> > -VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
> > +/**
> > + * Fetch a VTDBus instance for given PCIBus. If no existing instance,
> > + * allocate one.
> > + */
> > +static VTDBus *vtd_find_add_bus(IntelIOMMUState *s, PCIBus *bus)
> >  {
> >      uintptr_t key = (uintptr_t)bus;
> >      VTDBus *vtd_bus = g_hash_table_lookup(s->vtd_as_by_busptr, &key);
> > -    VTDAddressSpace *vtd_dev_as;
> > -    char name[128];
> >
> >      if (!vtd_bus) {
> >          uintptr_t *new_key = g_malloc(sizeof(*new_key));
> >          *new_key = (uintptr_t)bus;
> >          /* No corresponding free() */
> > -        vtd_bus = g_malloc0(sizeof(VTDBus) + sizeof(VTDAddressSpace *) * \
> > -                            PCI_DEVFN_MAX);
> > +        vtd_bus = g_malloc0(sizeof(VTDBus));
> >          vtd_bus->bus = bus;
> >          g_hash_table_insert(s->vtd_as_by_busptr, new_key, vtd_bus);
> >      }
> > +    return vtd_bus;
> > +}
> >
> > +VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
> > +{
> > +    VTDBus *vtd_bus;
> > +    VTDAddressSpace *vtd_dev_as;
> > +    char name[128];
> > +
> > +    vtd_bus = vtd_find_add_bus(s, bus);
> >      vtd_dev_as = vtd_bus->dev_as[devfn];
> >
> >      if (!vtd_dev_as) {
> > @@ -3436,6 +3446,55 @@ VTDAddressSpace
> *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
> >      return vtd_dev_as;
> >  }
> >
> > +static int vtd_dev_set_iommu_context(PCIBus *bus, void *opaque,
> > +                                     int devfn,
> > +                                     HostIOMMUContext *iommu_ctx)
> > +{
> > +    IntelIOMMUState *s = opaque;
> > +    VTDBus *vtd_bus;
> > +    VTDHostIOMMUContext *vtd_dev_icx;
> > +
> > +    assert(0 <= devfn && devfn < PCI_DEVFN_MAX);
> > +
> > +    vtd_bus = vtd_find_add_bus(s, bus);
> > +
> > +    vtd_iommu_lock(s);
> > +
> > +    vtd_dev_icx = vtd_bus->dev_icx[devfn];
> > +
> > +    assert(!vtd_dev_icx);
> > +
> > +    vtd_bus->dev_icx[devfn] = vtd_dev_icx =
> > +                    g_malloc0(sizeof(VTDHostIOMMUContext));
> > +    vtd_dev_icx->vtd_bus = vtd_bus;
> > +    vtd_dev_icx->devfn = (uint8_t)devfn;
> > +    vtd_dev_icx->iommu_state = s;
> > +    vtd_dev_icx->iommu_ctx = iommu_ctx;
> > +
> > +    vtd_iommu_unlock(s);
> > +
> > +    return 0;
> > +}
> > +
> > +static void vtd_dev_unset_iommu_context(PCIBus *bus, void *opaque, int devfn)
> > +{
> > +    IntelIOMMUState *s = opaque;
> > +    VTDBus *vtd_bus;
> > +    VTDHostIOMMUContext *vtd_dev_icx;
> > +
> > +    assert(0 <= devfn && devfn < PCI_DEVFN_MAX);
> > +
> > +    vtd_bus = vtd_find_add_bus(s, bus);
> > +
> > +    vtd_iommu_lock(s);
> > +
> > +    vtd_dev_icx = vtd_bus->dev_icx[devfn];
> > +    g_free(vtd_dev_icx);
> > +    vtd_bus->dev_icx[devfn] = NULL;
> > +
> > +    vtd_iommu_unlock(s);
> > +}
> > +
> >  static uint64_t get_naturally_aligned_size(uint64_t start,
> >                                             uint64_t size, int gaw)
> >  {
> > @@ -3731,6 +3790,8 @@ static AddressSpace *vtd_host_dma_iommu(PCIBus
> *bus, void *opaque, int devfn)
> >
> >  static PCIIOMMUOps vtd_iommu_ops = {
> >      .get_address_space = vtd_host_dma_iommu,
> > +    .set_iommu_context = vtd_dev_set_iommu_context,
> > +    .unset_iommu_context = vtd_dev_unset_iommu_context,
> >  };
> >
> >  static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
> > diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> > index 3870052..b5fefb9 100644
> > --- a/include/hw/i386/intel_iommu.h
> > +++ b/include/hw/i386/intel_iommu.h
> > @@ -64,6 +64,7 @@ typedef union VTD_IR_TableEntry VTD_IR_TableEntry;
> >  typedef union VTD_IR_MSIAddress VTD_IR_MSIAddress;
> >  typedef struct VTDPASIDDirEntry VTDPASIDDirEntry;
> >  typedef struct VTDPASIDEntry VTDPASIDEntry;
> > +typedef struct VTDHostIOMMUContext VTDHostIOMMUContext;
> >
> >  /* Context-Entry */
> >  struct VTDContextEntry {
> > @@ -112,10 +113,20 @@ struct VTDAddressSpace {
> >      IOVATree *iova_tree;          /* Traces mapped IOVA ranges */
> >  };
> >
> > +struct VTDHostIOMMUContext {
> 
> 
> > +    VTDBus *vtd_bus;
> > +    uint8_t devfn;
> > +    HostIOMMUContext *iommu_ctx;
> I don't get why we don't have standard QOM inheritance instead of this
> handle?
> VTDHostContext parent_obj;
> 
> like IOMMUMemoryRegion <- MemoryRegion <- Object

Here it is not inherit the object. It's just cache the HostIOMMUContext
pointer in vIOMMU. Just like AddressSpace, it has a MemoryRegion pointer.
Here is the same, VTDHostIOMMUContext is just a wrapper to better manage
it in vVT-d. It's not inheriting.

> > +    IntelIOMMUState *iommu_state;
> > +};
> > +
> >  struct VTDBus {
> > -    PCIBus* bus;		/* A reference to the bus to provide translation for
> */
> > +    /* A reference to the bus to provide translation for */
> > +    PCIBus *bus;
> >      /* A table of VTDAddressSpace objects indexed by devfn */
> > -    VTDAddressSpace *dev_as[];
> > +    VTDAddressSpace *dev_as[PCI_DEVFN_MAX];
> > +    /* A table of VTDHostIOMMUContext objects indexed by devfn */
> > +    VTDHostIOMMUContext *dev_icx[PCI_DEVFN_MAX];
> At this point of the review, it is unclear to me why the context is
> associated to a device.

HostIOMMUContext can be per-device or not. It depends on how vIOMMU
manage it. For vVT-d, it's per device as the container is per-device.

> Up to now you have not explained it should. If
> so why isn't it part of VTDAddressSpace?

Ah, I did have considered it. But I chose to use a separate one as
context is not really tied with an addresspace. It's better to mange
it with a separate structure.

Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 160+ messages in thread

* RE: [PATCH v2 04/22] hw/iommu: introduce HostIOMMUContext
  2020-03-31  7:47         ` Auger Eric
@ 2020-03-31 12:43           ` Liu, Yi L
  -1 siblings, 0 replies; 160+ messages in thread
From: Liu, Yi L @ 2020-03-31 12:43 UTC (permalink / raw)
  To: Auger Eric, qemu-devel, alex.williamson, peterx
  Cc: pbonzini, mst, david, Tian, Kevin, Tian, Jun J, Sun, Yi Y, kvm,
	Wu, Hao, jean-philippe, Jacob Pan, Yi Sun

Hi Eric,

> From: Auger Eric <eric.auger@redhat.com>
> Sent: Tuesday, March 31, 2020 3:48 PM
> To: Liu, Yi L <yi.l.liu@intel.com>; qemu-devel@nongnu.org;
> Subject: Re: [PATCH v2 04/22] hw/iommu: introduce HostIOMMUContext
> 
> Yi,
> 
> On 3/31/20 6:10 AM, Liu, Yi L wrote:
> > Hi Eric,
> >
> >> From: Auger Eric < eric.auger@redhat.com >
> >> Sent: Tuesday, March 31, 2020 1:23 AM
> >> To: Liu, Yi L <yi.l.liu@intel.com>; qemu-devel@nongnu.org;
> >> Subject: Re: [PATCH v2 04/22] hw/iommu: introduce HostIOMMUContext
> >>
> >> Yi,
> >>
> >> On 3/30/20 6:24 AM, Liu Yi L wrote:
> >>> Currently, many platform vendors provide the capability of dual
> >>> stage DMA address translation in hardware. For example, nested
> >>> translation on Intel VT-d scalable mode, nested stage translation on
> >>> ARM SMMUv3, and etc. In dual stage DMA address translation, there
> >>> are two stages address translation, stage-1 (a.k.a first-level) and
> >>> stage-2 (a.k.a
> >>> second-level) translation structures. Stage-1 translation results
> >>> are also subjected to stage-2 translation structures. Take vSVA
> >>> (Virtual Shared Virtual Addressing) as an example, guest IOMMU
> >>> driver owns
> >>> stage-1 translation structures (covers GVA->GPA translation), and
> >>> host IOMMU driver owns stage-2 translation structures (covers
> >>> GPA->HPA translation). VMM is responsible to bind stage-1
> >>> translation structures to host, thus hardware could achieve GVA->GPA
> >>> and then GPA->HPA translation. For more background on SVA, refer the below
> links.
> >>>  - https://www.youtube.com/watch?v=Kq_nfGK5MwQ
> >>>  - https://events19.lfasiallc.com/wp-content/uploads/2017/11/\
> >>> Shared-Virtual-Memory-in-KVM_Yi-Liu.pdf
> >>>
> >>> In QEMU, vIOMMU emulators expose IOMMUs to VM per their own spec (e.g.
> >>> Intel VT-d spec). Devices are pass-through to guest via device pass-
> >>> through components like VFIO. VFIO is a userspace driver framework
> >>> which exposes host IOMMU programming capability to userspace in a
> >>> secure manner. e.g. IOVA MAP/UNMAP requests. Thus the major
> >>> connection between VFIO and vIOMMU are MAP/UNMAP. However, with the
> >>> dual stage DMA translation support, there are more interactions
> >>> between vIOMMU and VFIO as below:
> >>
> >> I think it is key to justify at some point why the IOMMU MR notifiers
> >> are not usable for that purpose. If I remember correctly this is due
> >> to the fact MR notifiers are not active on x86 in that use xase,
> >> which is not the case on ARM dual stage enablement.
> >
> > yes, it's the major reason. Also I listed the former description here.
> > BTW. I don't think notifier is suitable as it is unable to return value.
> > right? The pasid alloc in this series actually requires to get the
> > alloc result from vfio. So it's also a reason why notifier is not proper.
> >
> >   "Qemu has an existing notifier framework based on MemoryRegion, which
> >   are used for MAP/UNMAP. However, it is not well suited for virt-SVA.
> >   Reasons are as below:
> >   - virt-SVA works along with PT = 1
> >   - if PT = 1 IOMMU MR are disabled so MR notifier are not registered
> >   - new notifiers do not fit nicely in this framework as they need to be
> >     registered even if PT = 1
> >   - need a new framework to attach the new notifiers
> >   - Additional background can be got from:
> >     https://lists.gnu.org/archive/html/qemu-devel/2017-04/msg04931.html"
> >
> > And there is a history on it. I think the earliest idea to introduce a
> > new mechanism instead of using MR notifier for vSVA is from below link.
> > https://lists.gnu.org/archive/html/qemu-devel/2017-04/msg05295.html
> >
> > And then, I have several versions patch series which try to add a
> > notifier framework for vSVA based on IOMMUSVAContext.
> > https://lists.gnu.org/archive/html/qemu-devel/2018-03/msg00078.html
> >
> > After the vSVA notifier framework patchset, then we somehow agreed to
> > use PCIPASIDOps which sits in PCIDevice. This is proposed in below link.
> > https://patchwork.kernel.org/cover/11033657/
> > However, it was questioned to provide pasid allocation interface in a
> > per-device manner.
> >   "On Fri, Jul 05, 2019 at 07:01:38PM +0800, Liu Yi L wrote:
> >   > This patch adds vfio implementation PCIPASIDOps.alloc_pasid/free_pasid().
> >   > These two functions are used to propagate guest pasid allocation and
> >   > free requests to host via vfio container ioctl.
> >
> >   As I said in an earlier comment, I think doing this on the device is
> >   conceptually incorrect.  I think we need an explcit notion of an SVM
> >   context (i.e. the namespace in which all the PASIDs live) - which will
> >   IIUC usually be shared amongst multiple devices.  The create and free
> >   PASID requests should be on that object."
> > https://patchwork.kernel.org/patch/11033659/
> >
> > And the explicit notion of an SVM context from David inspired me to
> > make an explicit way to facilitate the interaction between vfio and
> > vIOMMU. So I came up with the SVMContext direction, and finally
> > renamed it as HostIOMMUContext and place it in VFIOContainer as it is
> > supposed to be per -container.
> 
> Thank you for summarizing the whole history. To make things clear I do not put into
> question this last approach, I just meant the commit message should justify why this
> is needed and why the existing IOMMUMRNotifier approach cannot be used.

Ah, it's also a good recall for me. :-)

> >> maybe: "Information, different from map/unmap notifications need to
> >> be passed from QEMU vIOMMU device to/from the host IOMMU driver
> >> through the VFIO/IOMMU layer: ..."
> >
> > I see. I'll adopt your description. thanks.
> >
> >>>  1) PASID allocation (allow host to intercept in PASID allocation)
> >>>  2) bind stage-1 translation structures to host
> >>>  3) propagate stage-1 cache invalidation to host
> >>>  4) DMA address translation fault (I/O page fault) servicing etc.
> >>
> >>>
> >>> With the above new interactions in QEMU, it requires an abstract
> >>> layer to facilitate the above operations and expose to vIOMMU
> >>> emulators as an explicit way for vIOMMU emulators call into VFIO.
> >>> This patch introduces HostIOMMUContext to stand for hardware IOMMU
> >>> w/ dual stage DMA address translation capability. And introduces
> >>> HostIOMMUContextClass to provide methods for vIOMMU emulators to
> >>> propagate dual-stage translation related requests to host. As a
> >>> beginning, PASID allocation/free are defined to propagate PASID
> >>> allocation/free requests to host which is helpful for the vendors
> >>> who manage PASID in system-wide. In future, there will be more operations
> like bind_stage1_pgtbl, flush_stage1_cache and etc.
> >>>
> >>> Cc: Kevin Tian <kevin.tian@intel.com>
> >>> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> >>> Cc: Peter Xu <peterx@redhat.com>
> >>> Cc: Eric Auger <eric.auger@redhat.com>
> >>> Cc: Yi Sun <yi.y.sun@linux.intel.com>
> >>> Cc: David Gibson <david@gibson.dropbear.id.au>
> >>> Cc: Michael S. Tsirkin <mst@redhat.com>
> >>> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> >>> ---
> >>>  hw/Makefile.objs                      |  1 +
> >>>  hw/iommu/Makefile.objs                |  1 +
> >>>  hw/iommu/host_iommu_context.c         | 97
> >> +++++++++++++++++++++++++++++++++++
> >>>  include/hw/iommu/host_iommu_context.h | 75
> >>> +++++++++++++++++++++++++++
> >>>  4 files changed, 174 insertions(+)
> >>>  create mode 100644 hw/iommu/Makefile.objs  create mode 100644
> >>> hw/iommu/host_iommu_context.c  create mode 100644
> >>> include/hw/iommu/host_iommu_context.h
> >>>
> >>> diff --git a/hw/Makefile.objs b/hw/Makefile.objs index
> >>> 660e2b4..cab83fe 100644
> >>> --- a/hw/Makefile.objs
> >>> +++ b/hw/Makefile.objs
> >>> @@ -40,6 +40,7 @@ devices-dirs-$(CONFIG_MEM_DEVICE) += mem/
> >>>  devices-dirs-$(CONFIG_NUBUS) += nubus/  devices-dirs-y +=
> >>> semihosting/  devices-dirs-y += smbios/
> >>> +devices-dirs-y += iommu/
> >>>  endif
> >>>
> >>>  common-obj-y += $(devices-dirs-y)
> >>> diff --git a/hw/iommu/Makefile.objs b/hw/iommu/Makefile.objs new
> >>> file mode 100644 index 0000000..e6eed4e
> >>> --- /dev/null
> >>> +++ b/hw/iommu/Makefile.objs
> >>> @@ -0,0 +1 @@
> >>> +obj-y += host_iommu_context.o
> >>> diff --git a/hw/iommu/host_iommu_context.c
> >> b/hw/iommu/host_iommu_context.c
> >>> new file mode 100644
> >>> index 0000000..5fb2223
> >>> --- /dev/null
> >>> +++ b/hw/iommu/host_iommu_context.c
> >>> @@ -0,0 +1,97 @@
> >>> +/*
> >>> + * QEMU abstract of Host IOMMU
> >>> + *
> >>> + * Copyright (C) 2020 Intel Corporation.
> >>> + *
> >>> + * Authors: Liu Yi L <yi.l.liu@intel.com>
> >>> + *
> >>> + * This program is free software; you can redistribute it and/or
> >>> +modify
> >>> + * it under the terms of the GNU General Public License as
> >>> +published by
> >>> + * the Free Software Foundation; either version 2 of the License,
> >>> +or
> >>> + * (at your option) any later version.
> >>> +
> >>> + * This program is distributed in the hope that it will be useful,
> >>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> >>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> >>> + * GNU General Public License for more details.
> >>> +
> >>> + * You should have received a copy of the GNU General Public
> >>> + License along
> >>> + * with this program; if not, see <http://www.gnu.org/licenses/>.
> >>> + */
> >>> +
> >>> +#include "qemu/osdep.h"
> >>> +#include "qapi/error.h"
> >>> +#include "qom/object.h"
> >>> +#include "qapi/visitor.h"
> >>> +#include "hw/iommu/host_iommu_context.h"
> >>> +
> >>> +int host_iommu_ctx_pasid_alloc(HostIOMMUContext *iommu_ctx, uint32_t
> min,
> >>> +                               uint32_t max, uint32_t *pasid) {
> >>> +    HostIOMMUContextClass *hicxc;
> >>> +
> >>> +    if (!iommu_ctx) {
> >>> +        return -EINVAL;
> >>> +    }
> >>> +
> >>> +    hicxc = HOST_IOMMU_CONTEXT_GET_CLASS(iommu_ctx);
> >>> +
> >>> +    if (!hicxc) {
> >>> +        return -EINVAL;
> >>> +    }
> >>> +
> >>> +    if (!(iommu_ctx->flags & HOST_IOMMU_PASID_REQUEST) ||
> >>> +        !hicxc->pasid_alloc) {
> >> At this point of the reading, I fail to understand why we need the flag.
> >> Why isn't it sufficient to test whether the ops is set?
> >
> > I added it in case of the architecture which has no requirement for
> > pasid alloc/free and only needs the other callbacks in the class. I'm
> > not sure if I'm correct, it looks to be unnecessary for vSMMU. right?
> vSMMU does not require it at the moment. But in that case, it shall not provide any
> implementation for it and that should be sufficient, shouldn't it?

Emm, but the hook is implemented by vfio. Forget it. I provided a bad
argument.

I'd better say it works when there is backend which doesn't
want to provide pasid alloc/free. Also, the flags can be used
by vIOMMU to enumerate host side's capability (e.g. pasid
alloc/free, pasid bind, cache_inv, and pasid_table_bind). I
guess my series has not made use of it in vIOMMU, but I
do have such plan.

> >
> >>> +        return -EINVAL;
> >>> +    }
> >>> +
> >>> +    return hicxc->pasid_alloc(iommu_ctx, min, max, pasid); }
> >>> +
> >>> +int host_iommu_ctx_pasid_free(HostIOMMUContext *iommu_ctx, uint32_t
> >> pasid)
> >>> +{
> >>> +    HostIOMMUContextClass *hicxc;
> >>> +
> >>> +    if (!iommu_ctx) {
> >>> +        return -EINVAL;
> >>> +    }
> >>> +
> >>> +    hicxc = HOST_IOMMU_CONTEXT_GET_CLASS(iommu_ctx);
> >>> +    if (!hicxc) {
> >>> +        return -EINVAL;
> >>> +    }
> >>> +
> >>> +    if (!(iommu_ctx->flags & HOST_IOMMU_PASID_REQUEST) ||
> >>> +        !hicxc->pasid_free) {
> >>> +        return -EINVAL;
> >>> +    }
> >>> +
> >>> +    return hicxc->pasid_free(iommu_ctx, pasid); }
> >>> +
> >>> +void host_iommu_ctx_init(void *_iommu_ctx, size_t instance_size,
> >>> +                         const char *mrtypename,
> >>> +                         uint64_t flags) {
> >>> +    HostIOMMUContext *iommu_ctx;
> >>> +
> >>> +    object_initialize(_iommu_ctx, instance_size, mrtypename);
> >>> +    iommu_ctx = HOST_IOMMU_CONTEXT(_iommu_ctx);
> >>> +    iommu_ctx->flags = flags;
> >>> +    iommu_ctx->initialized = true;
> >>> +}
> >>> +
> >>> +static const TypeInfo host_iommu_context_info = {
> >>> +    .parent             = TYPE_OBJECT,
> >>> +    .name               = TYPE_HOST_IOMMU_CONTEXT,
> >>> +    .class_size         = sizeof(HostIOMMUContextClass),
> >>> +    .instance_size      = sizeof(HostIOMMUContext),
> >>> +    .abstract           = true,
> >> Can't we use the usual .instance_init and .instance_finalize?
> >>> +};
> >>> +
> >>> +static void host_iommu_ctx_register_types(void)
> >>> +{
> >>> +    type_register_static(&host_iommu_context_info);
> >>> +}
> >>> +
> >>> +type_init(host_iommu_ctx_register_types)
> >>> diff --git a/include/hw/iommu/host_iommu_context.h
> >> b/include/hw/iommu/host_iommu_context.h
> >>> new file mode 100644
> >>> index 0000000..35c4861
> >>> --- /dev/null
> >>> +++ b/include/hw/iommu/host_iommu_context.h
> >>> @@ -0,0 +1,75 @@
> >>> +/*
> >>> + * QEMU abstraction of Host IOMMU
> >>> + *
> >>> + * Copyright (C) 2020 Intel Corporation.
> >>> + *
> >>> + * Authors: Liu Yi L <yi.l.liu@intel.com>
> >>> + *
> >>> + * This program is free software; you can redistribute it and/or
> >>> +modify
> >>> + * it under the terms of the GNU General Public License as
> >>> +published by
> >>> + * the Free Software Foundation; either version 2 of the License,
> >>> +or
> >>> + * (at your option) any later version.
> >>> +
> >>> + * This program is distributed in the hope that it will be useful,
> >>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> >>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> >>> + * GNU General Public License for more details.
> >>> +
> >>> + * You should have received a copy of the GNU General Public
> >>> + License along
> >>> + * with this program; if not, see <http://www.gnu.org/licenses/>.
> >>> + */
> >>> +
> >>> +#ifndef HW_IOMMU_CONTEXT_H
> >>> +#define HW_IOMMU_CONTEXT_H
> >>> +
> >>> +#include "qemu/queue.h"
> >>> +#include "qemu/thread.h"
> >>> +#include "qom/object.h"
> >>> +#include <linux/iommu.h>
> >>> +#ifndef CONFIG_USER_ONLY
> >>> +#include "exec/hwaddr.h"
> >>> +#endif
> >>> +
> >>> +#define TYPE_HOST_IOMMU_CONTEXT "qemu:host-iommu-context"
> >>> +#define HOST_IOMMU_CONTEXT(obj) \
> >>> +        OBJECT_CHECK(HostIOMMUContext, (obj),
> >>> +TYPE_HOST_IOMMU_CONTEXT) #define
> HOST_IOMMU_CONTEXT_GET_CLASS(obj) \
> >>> +        OBJECT_GET_CLASS(HostIOMMUContextClass, (obj), \
> >>> +                         TYPE_HOST_IOMMU_CONTEXT)
> >>> +
> >>> +typedef struct HostIOMMUContext HostIOMMUContext;
> >>> +
> >>> +typedef struct HostIOMMUContextClass {
> >>> +    /* private */
> >>> +    ObjectClass parent_class;
> >>> +
> >>> +    /* Allocate pasid from HostIOMMUContext (a.k.a. host software)
> >>> + */
> >> Request the host to allocate a PASID?
> >> "from HostIOMMUContext (a.k.a. host software)" is a bit cryptic to me.
> >
> > oh, I mean to request pasid allocation from host.. sorry for the confusion.
> >
> >> Actually at this stage I do not understand what this HostIOMMUContext
> >> abstracts. Is it an object associated to one guest FL context entry
> >> (attached to one PASID). Meaning for just vIOMMU/VFIO using nested
> >> paging (single PASID) I would use a single of such context per IOMMU MR?
> >
> > No, it's not for a single guest FL context. It's for the abstraction
> > of the capability provided by a nested-translation capable host backend.
> > In vfio, it's VFIO_IOMMU_TYPE1_NESTING.
> >
> > Here is the notion behind introducing the HostIOMMUContext. Existing
> > vfio is a secure framework which provides userspace the capability to
> > program mappings into a single isolation domain in host side. Compared
> > with the legacy host IOMMU, nested-translation capable IOMMU provides
> > more. It gives the user-space with the capability to program a
> > FL/Stage
> > -1 page table to host side. This is also called as bind_gpasid in this
> > series. VFIO exposes nesting capability to userspace with the
> > VFIO_IOMMU_TYPE1_NESTING type. And along with the type, the pasid
> > alloc/ free and iommu_cache_inv are exposed as the capabilities
> > provided by VFIO_IOMMU_TYPE1_NESTING.
> 
> OK so let me try to rephrase:
> 
> "the HostIOMMUContext is an object which allows to manage the stage-1
> translation when a vIOMMU is implemented upon physical IOMMU nested paging
> (VFIO case).
> 
> It is an abstract object which needs to be derived for each vIOMMU
> immplementation based on physical nested paging.
> 
> An HostIOMMUContext derived object will be passed to each VFIO device protected
> by a vIOMMU using physical nested paging.
> "
> 
> Is that correct?

you're better writer than me. yes, I think so.

>  Also, if we want, actually we could migrate
> > the MAP/UNMAP notifier to be hooks in HostIOMMUContext. Then we can
> > have an unified abstraction for the capabilities provided by host.
> So then it becomes contradictory to what we said before because MAP/UNMAP are
> used with single stage HW implementation.

If we want to migrate the MAP/UNMAP to host context. Then, the object
description should be updated as below. I don't think we'll do it so
far. So just keep the above description for nesting case.

"the HostIOMMUContext is an object which allows to manage the stage-1
translation when a vIOMMU is implemented upon physical IOMMU nested paging
or program single stage page mapping to host (VFIO case).

It is an abstract object which needs to be derived for each vIOMMU
implementation based on physical iommu paging.

An HostIOMMUContext derived object will be passed to each VFIO device protected
by a vIOMMU using physical iommu paging."

Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 160+ messages in thread

* RE: [PATCH v2 04/22] hw/iommu: introduce HostIOMMUContext
@ 2020-03-31 12:43           ` Liu, Yi L
  0 siblings, 0 replies; 160+ messages in thread
From: Liu, Yi L @ 2020-03-31 12:43 UTC (permalink / raw)
  To: Auger Eric, qemu-devel, alex.williamson, peterx
  Cc: jean-philippe, Tian, Kevin, Jacob Pan, Yi Sun, kvm, mst, Tian,
	 Jun J, Sun, Yi Y, pbonzini, Wu, Hao, david

Hi Eric,

> From: Auger Eric <eric.auger@redhat.com>
> Sent: Tuesday, March 31, 2020 3:48 PM
> To: Liu, Yi L <yi.l.liu@intel.com>; qemu-devel@nongnu.org;
> Subject: Re: [PATCH v2 04/22] hw/iommu: introduce HostIOMMUContext
> 
> Yi,
> 
> On 3/31/20 6:10 AM, Liu, Yi L wrote:
> > Hi Eric,
> >
> >> From: Auger Eric < eric.auger@redhat.com >
> >> Sent: Tuesday, March 31, 2020 1:23 AM
> >> To: Liu, Yi L <yi.l.liu@intel.com>; qemu-devel@nongnu.org;
> >> Subject: Re: [PATCH v2 04/22] hw/iommu: introduce HostIOMMUContext
> >>
> >> Yi,
> >>
> >> On 3/30/20 6:24 AM, Liu Yi L wrote:
> >>> Currently, many platform vendors provide the capability of dual
> >>> stage DMA address translation in hardware. For example, nested
> >>> translation on Intel VT-d scalable mode, nested stage translation on
> >>> ARM SMMUv3, and etc. In dual stage DMA address translation, there
> >>> are two stages address translation, stage-1 (a.k.a first-level) and
> >>> stage-2 (a.k.a
> >>> second-level) translation structures. Stage-1 translation results
> >>> are also subjected to stage-2 translation structures. Take vSVA
> >>> (Virtual Shared Virtual Addressing) as an example, guest IOMMU
> >>> driver owns
> >>> stage-1 translation structures (covers GVA->GPA translation), and
> >>> host IOMMU driver owns stage-2 translation structures (covers
> >>> GPA->HPA translation). VMM is responsible to bind stage-1
> >>> translation structures to host, thus hardware could achieve GVA->GPA
> >>> and then GPA->HPA translation. For more background on SVA, refer the below
> links.
> >>>  - https://www.youtube.com/watch?v=Kq_nfGK5MwQ
> >>>  - https://events19.lfasiallc.com/wp-content/uploads/2017/11/\
> >>> Shared-Virtual-Memory-in-KVM_Yi-Liu.pdf
> >>>
> >>> In QEMU, vIOMMU emulators expose IOMMUs to VM per their own spec (e.g.
> >>> Intel VT-d spec). Devices are pass-through to guest via device pass-
> >>> through components like VFIO. VFIO is a userspace driver framework
> >>> which exposes host IOMMU programming capability to userspace in a
> >>> secure manner. e.g. IOVA MAP/UNMAP requests. Thus the major
> >>> connection between VFIO and vIOMMU are MAP/UNMAP. However, with the
> >>> dual stage DMA translation support, there are more interactions
> >>> between vIOMMU and VFIO as below:
> >>
> >> I think it is key to justify at some point why the IOMMU MR notifiers
> >> are not usable for that purpose. If I remember correctly this is due
> >> to the fact MR notifiers are not active on x86 in that use xase,
> >> which is not the case on ARM dual stage enablement.
> >
> > yes, it's the major reason. Also I listed the former description here.
> > BTW. I don't think notifier is suitable as it is unable to return value.
> > right? The pasid alloc in this series actually requires to get the
> > alloc result from vfio. So it's also a reason why notifier is not proper.
> >
> >   "Qemu has an existing notifier framework based on MemoryRegion, which
> >   are used for MAP/UNMAP. However, it is not well suited for virt-SVA.
> >   Reasons are as below:
> >   - virt-SVA works along with PT = 1
> >   - if PT = 1 IOMMU MR are disabled so MR notifier are not registered
> >   - new notifiers do not fit nicely in this framework as they need to be
> >     registered even if PT = 1
> >   - need a new framework to attach the new notifiers
> >   - Additional background can be got from:
> >     https://lists.gnu.org/archive/html/qemu-devel/2017-04/msg04931.html"
> >
> > And there is a history on it. I think the earliest idea to introduce a
> > new mechanism instead of using MR notifier for vSVA is from below link.
> > https://lists.gnu.org/archive/html/qemu-devel/2017-04/msg05295.html
> >
> > And then, I have several versions patch series which try to add a
> > notifier framework for vSVA based on IOMMUSVAContext.
> > https://lists.gnu.org/archive/html/qemu-devel/2018-03/msg00078.html
> >
> > After the vSVA notifier framework patchset, then we somehow agreed to
> > use PCIPASIDOps which sits in PCIDevice. This is proposed in below link.
> > https://patchwork.kernel.org/cover/11033657/
> > However, it was questioned to provide pasid allocation interface in a
> > per-device manner.
> >   "On Fri, Jul 05, 2019 at 07:01:38PM +0800, Liu Yi L wrote:
> >   > This patch adds vfio implementation PCIPASIDOps.alloc_pasid/free_pasid().
> >   > These two functions are used to propagate guest pasid allocation and
> >   > free requests to host via vfio container ioctl.
> >
> >   As I said in an earlier comment, I think doing this on the device is
> >   conceptually incorrect.  I think we need an explcit notion of an SVM
> >   context (i.e. the namespace in which all the PASIDs live) - which will
> >   IIUC usually be shared amongst multiple devices.  The create and free
> >   PASID requests should be on that object."
> > https://patchwork.kernel.org/patch/11033659/
> >
> > And the explicit notion of an SVM context from David inspired me to
> > make an explicit way to facilitate the interaction between vfio and
> > vIOMMU. So I came up with the SVMContext direction, and finally
> > renamed it as HostIOMMUContext and place it in VFIOContainer as it is
> > supposed to be per -container.
> 
> Thank you for summarizing the whole history. To make things clear I do not put into
> question this last approach, I just meant the commit message should justify why this
> is needed and why the existing IOMMUMRNotifier approach cannot be used.

Ah, it's also a good recall for me. :-)

> >> maybe: "Information, different from map/unmap notifications need to
> >> be passed from QEMU vIOMMU device to/from the host IOMMU driver
> >> through the VFIO/IOMMU layer: ..."
> >
> > I see. I'll adopt your description. thanks.
> >
> >>>  1) PASID allocation (allow host to intercept in PASID allocation)
> >>>  2) bind stage-1 translation structures to host
> >>>  3) propagate stage-1 cache invalidation to host
> >>>  4) DMA address translation fault (I/O page fault) servicing etc.
> >>
> >>>
> >>> With the above new interactions in QEMU, it requires an abstract
> >>> layer to facilitate the above operations and expose to vIOMMU
> >>> emulators as an explicit way for vIOMMU emulators call into VFIO.
> >>> This patch introduces HostIOMMUContext to stand for hardware IOMMU
> >>> w/ dual stage DMA address translation capability. And introduces
> >>> HostIOMMUContextClass to provide methods for vIOMMU emulators to
> >>> propagate dual-stage translation related requests to host. As a
> >>> beginning, PASID allocation/free are defined to propagate PASID
> >>> allocation/free requests to host which is helpful for the vendors
> >>> who manage PASID in system-wide. In future, there will be more operations
> like bind_stage1_pgtbl, flush_stage1_cache and etc.
> >>>
> >>> Cc: Kevin Tian <kevin.tian@intel.com>
> >>> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> >>> Cc: Peter Xu <peterx@redhat.com>
> >>> Cc: Eric Auger <eric.auger@redhat.com>
> >>> Cc: Yi Sun <yi.y.sun@linux.intel.com>
> >>> Cc: David Gibson <david@gibson.dropbear.id.au>
> >>> Cc: Michael S. Tsirkin <mst@redhat.com>
> >>> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> >>> ---
> >>>  hw/Makefile.objs                      |  1 +
> >>>  hw/iommu/Makefile.objs                |  1 +
> >>>  hw/iommu/host_iommu_context.c         | 97
> >> +++++++++++++++++++++++++++++++++++
> >>>  include/hw/iommu/host_iommu_context.h | 75
> >>> +++++++++++++++++++++++++++
> >>>  4 files changed, 174 insertions(+)
> >>>  create mode 100644 hw/iommu/Makefile.objs  create mode 100644
> >>> hw/iommu/host_iommu_context.c  create mode 100644
> >>> include/hw/iommu/host_iommu_context.h
> >>>
> >>> diff --git a/hw/Makefile.objs b/hw/Makefile.objs index
> >>> 660e2b4..cab83fe 100644
> >>> --- a/hw/Makefile.objs
> >>> +++ b/hw/Makefile.objs
> >>> @@ -40,6 +40,7 @@ devices-dirs-$(CONFIG_MEM_DEVICE) += mem/
> >>>  devices-dirs-$(CONFIG_NUBUS) += nubus/  devices-dirs-y +=
> >>> semihosting/  devices-dirs-y += smbios/
> >>> +devices-dirs-y += iommu/
> >>>  endif
> >>>
> >>>  common-obj-y += $(devices-dirs-y)
> >>> diff --git a/hw/iommu/Makefile.objs b/hw/iommu/Makefile.objs new
> >>> file mode 100644 index 0000000..e6eed4e
> >>> --- /dev/null
> >>> +++ b/hw/iommu/Makefile.objs
> >>> @@ -0,0 +1 @@
> >>> +obj-y += host_iommu_context.o
> >>> diff --git a/hw/iommu/host_iommu_context.c
> >> b/hw/iommu/host_iommu_context.c
> >>> new file mode 100644
> >>> index 0000000..5fb2223
> >>> --- /dev/null
> >>> +++ b/hw/iommu/host_iommu_context.c
> >>> @@ -0,0 +1,97 @@
> >>> +/*
> >>> + * QEMU abstract of Host IOMMU
> >>> + *
> >>> + * Copyright (C) 2020 Intel Corporation.
> >>> + *
> >>> + * Authors: Liu Yi L <yi.l.liu@intel.com>
> >>> + *
> >>> + * This program is free software; you can redistribute it and/or
> >>> +modify
> >>> + * it under the terms of the GNU General Public License as
> >>> +published by
> >>> + * the Free Software Foundation; either version 2 of the License,
> >>> +or
> >>> + * (at your option) any later version.
> >>> +
> >>> + * This program is distributed in the hope that it will be useful,
> >>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> >>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> >>> + * GNU General Public License for more details.
> >>> +
> >>> + * You should have received a copy of the GNU General Public
> >>> + License along
> >>> + * with this program; if not, see <http://www.gnu.org/licenses/>.
> >>> + */
> >>> +
> >>> +#include "qemu/osdep.h"
> >>> +#include "qapi/error.h"
> >>> +#include "qom/object.h"
> >>> +#include "qapi/visitor.h"
> >>> +#include "hw/iommu/host_iommu_context.h"
> >>> +
> >>> +int host_iommu_ctx_pasid_alloc(HostIOMMUContext *iommu_ctx, uint32_t
> min,
> >>> +                               uint32_t max, uint32_t *pasid) {
> >>> +    HostIOMMUContextClass *hicxc;
> >>> +
> >>> +    if (!iommu_ctx) {
> >>> +        return -EINVAL;
> >>> +    }
> >>> +
> >>> +    hicxc = HOST_IOMMU_CONTEXT_GET_CLASS(iommu_ctx);
> >>> +
> >>> +    if (!hicxc) {
> >>> +        return -EINVAL;
> >>> +    }
> >>> +
> >>> +    if (!(iommu_ctx->flags & HOST_IOMMU_PASID_REQUEST) ||
> >>> +        !hicxc->pasid_alloc) {
> >> At this point of the reading, I fail to understand why we need the flag.
> >> Why isn't it sufficient to test whether the ops is set?
> >
> > I added it in case of the architecture which has no requirement for
> > pasid alloc/free and only needs the other callbacks in the class. I'm
> > not sure if I'm correct, it looks to be unnecessary for vSMMU. right?
> vSMMU does not require it at the moment. But in that case, it shall not provide any
> implementation for it and that should be sufficient, shouldn't it?

Emm, but the hook is implemented by vfio. Forget it. I provided a bad
argument.

I'd better say it works when there is backend which doesn't
want to provide pasid alloc/free. Also, the flags can be used
by vIOMMU to enumerate host side's capability (e.g. pasid
alloc/free, pasid bind, cache_inv, and pasid_table_bind). I
guess my series has not made use of it in vIOMMU, but I
do have such plan.

> >
> >>> +        return -EINVAL;
> >>> +    }
> >>> +
> >>> +    return hicxc->pasid_alloc(iommu_ctx, min, max, pasid); }
> >>> +
> >>> +int host_iommu_ctx_pasid_free(HostIOMMUContext *iommu_ctx, uint32_t
> >> pasid)
> >>> +{
> >>> +    HostIOMMUContextClass *hicxc;
> >>> +
> >>> +    if (!iommu_ctx) {
> >>> +        return -EINVAL;
> >>> +    }
> >>> +
> >>> +    hicxc = HOST_IOMMU_CONTEXT_GET_CLASS(iommu_ctx);
> >>> +    if (!hicxc) {
> >>> +        return -EINVAL;
> >>> +    }
> >>> +
> >>> +    if (!(iommu_ctx->flags & HOST_IOMMU_PASID_REQUEST) ||
> >>> +        !hicxc->pasid_free) {
> >>> +        return -EINVAL;
> >>> +    }
> >>> +
> >>> +    return hicxc->pasid_free(iommu_ctx, pasid); }
> >>> +
> >>> +void host_iommu_ctx_init(void *_iommu_ctx, size_t instance_size,
> >>> +                         const char *mrtypename,
> >>> +                         uint64_t flags) {
> >>> +    HostIOMMUContext *iommu_ctx;
> >>> +
> >>> +    object_initialize(_iommu_ctx, instance_size, mrtypename);
> >>> +    iommu_ctx = HOST_IOMMU_CONTEXT(_iommu_ctx);
> >>> +    iommu_ctx->flags = flags;
> >>> +    iommu_ctx->initialized = true;
> >>> +}
> >>> +
> >>> +static const TypeInfo host_iommu_context_info = {
> >>> +    .parent             = TYPE_OBJECT,
> >>> +    .name               = TYPE_HOST_IOMMU_CONTEXT,
> >>> +    .class_size         = sizeof(HostIOMMUContextClass),
> >>> +    .instance_size      = sizeof(HostIOMMUContext),
> >>> +    .abstract           = true,
> >> Can't we use the usual .instance_init and .instance_finalize?
> >>> +};
> >>> +
> >>> +static void host_iommu_ctx_register_types(void)
> >>> +{
> >>> +    type_register_static(&host_iommu_context_info);
> >>> +}
> >>> +
> >>> +type_init(host_iommu_ctx_register_types)
> >>> diff --git a/include/hw/iommu/host_iommu_context.h
> >> b/include/hw/iommu/host_iommu_context.h
> >>> new file mode 100644
> >>> index 0000000..35c4861
> >>> --- /dev/null
> >>> +++ b/include/hw/iommu/host_iommu_context.h
> >>> @@ -0,0 +1,75 @@
> >>> +/*
> >>> + * QEMU abstraction of Host IOMMU
> >>> + *
> >>> + * Copyright (C) 2020 Intel Corporation.
> >>> + *
> >>> + * Authors: Liu Yi L <yi.l.liu@intel.com>
> >>> + *
> >>> + * This program is free software; you can redistribute it and/or
> >>> +modify
> >>> + * it under the terms of the GNU General Public License as
> >>> +published by
> >>> + * the Free Software Foundation; either version 2 of the License,
> >>> +or
> >>> + * (at your option) any later version.
> >>> +
> >>> + * This program is distributed in the hope that it will be useful,
> >>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> >>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> >>> + * GNU General Public License for more details.
> >>> +
> >>> + * You should have received a copy of the GNU General Public
> >>> + License along
> >>> + * with this program; if not, see <http://www.gnu.org/licenses/>.
> >>> + */
> >>> +
> >>> +#ifndef HW_IOMMU_CONTEXT_H
> >>> +#define HW_IOMMU_CONTEXT_H
> >>> +
> >>> +#include "qemu/queue.h"
> >>> +#include "qemu/thread.h"
> >>> +#include "qom/object.h"
> >>> +#include <linux/iommu.h>
> >>> +#ifndef CONFIG_USER_ONLY
> >>> +#include "exec/hwaddr.h"
> >>> +#endif
> >>> +
> >>> +#define TYPE_HOST_IOMMU_CONTEXT "qemu:host-iommu-context"
> >>> +#define HOST_IOMMU_CONTEXT(obj) \
> >>> +        OBJECT_CHECK(HostIOMMUContext, (obj),
> >>> +TYPE_HOST_IOMMU_CONTEXT) #define
> HOST_IOMMU_CONTEXT_GET_CLASS(obj) \
> >>> +        OBJECT_GET_CLASS(HostIOMMUContextClass, (obj), \
> >>> +                         TYPE_HOST_IOMMU_CONTEXT)
> >>> +
> >>> +typedef struct HostIOMMUContext HostIOMMUContext;
> >>> +
> >>> +typedef struct HostIOMMUContextClass {
> >>> +    /* private */
> >>> +    ObjectClass parent_class;
> >>> +
> >>> +    /* Allocate pasid from HostIOMMUContext (a.k.a. host software)
> >>> + */
> >> Request the host to allocate a PASID?
> >> "from HostIOMMUContext (a.k.a. host software)" is a bit cryptic to me.
> >
> > oh, I mean to request pasid allocation from host.. sorry for the confusion.
> >
> >> Actually at this stage I do not understand what this HostIOMMUContext
> >> abstracts. Is it an object associated to one guest FL context entry
> >> (attached to one PASID). Meaning for just vIOMMU/VFIO using nested
> >> paging (single PASID) I would use a single of such context per IOMMU MR?
> >
> > No, it's not for a single guest FL context. It's for the abstraction
> > of the capability provided by a nested-translation capable host backend.
> > In vfio, it's VFIO_IOMMU_TYPE1_NESTING.
> >
> > Here is the notion behind introducing the HostIOMMUContext. Existing
> > vfio is a secure framework which provides userspace the capability to
> > program mappings into a single isolation domain in host side. Compared
> > with the legacy host IOMMU, nested-translation capable IOMMU provides
> > more. It gives the user-space with the capability to program a
> > FL/Stage
> > -1 page table to host side. This is also called as bind_gpasid in this
> > series. VFIO exposes nesting capability to userspace with the
> > VFIO_IOMMU_TYPE1_NESTING type. And along with the type, the pasid
> > alloc/ free and iommu_cache_inv are exposed as the capabilities
> > provided by VFIO_IOMMU_TYPE1_NESTING.
> 
> OK so let me try to rephrase:
> 
> "the HostIOMMUContext is an object which allows to manage the stage-1
> translation when a vIOMMU is implemented upon physical IOMMU nested paging
> (VFIO case).
> 
> It is an abstract object which needs to be derived for each vIOMMU
> immplementation based on physical nested paging.
> 
> An HostIOMMUContext derived object will be passed to each VFIO device protected
> by a vIOMMU using physical nested paging.
> "
> 
> Is that correct?

you're better writer than me. yes, I think so.

>  Also, if we want, actually we could migrate
> > the MAP/UNMAP notifier to be hooks in HostIOMMUContext. Then we can
> > have an unified abstraction for the capabilities provided by host.
> So then it becomes contradictory to what we said before because MAP/UNMAP are
> used with single stage HW implementation.

If we want to migrate the MAP/UNMAP to host context. Then, the object
description should be updated as below. I don't think we'll do it so
far. So just keep the above description for nesting case.

"the HostIOMMUContext is an object which allows to manage the stage-1
translation when a vIOMMU is implemented upon physical IOMMU nested paging
or program single stage page mapping to host (VFIO case).

It is an abstract object which needs to be derived for each vIOMMU
implementation based on physical iommu paging.

An HostIOMMUContext derived object will be passed to each VFIO device protected
by a vIOMMU using physical iommu paging."

Regards,
Yi Liu



^ permalink raw reply	[flat|nested] 160+ messages in thread

* RE: [PATCH v2 08/22] vfio/common: provide PASID alloc/free hooks
  2020-03-31 11:15         ` Auger Eric
@ 2020-03-31 12:54           ` Liu, Yi L
  -1 siblings, 0 replies; 160+ messages in thread
From: Liu, Yi L @ 2020-03-31 12:54 UTC (permalink / raw)
  To: Auger Eric, qemu-devel, alex.williamson, peterx
  Cc: pbonzini, mst, david, Tian, Kevin, Tian, Jun J, Sun, Yi Y, kvm,
	Wu, Hao, jean-philippe, Jacob Pan, Yi Sun

Hi Eric,

> From: Auger Eric <eric.auger@redhat.com>
> Sent: Tuesday, March 31, 2020 7:16 PM
> To: Liu, Yi L <yi.l.liu@intel.com>; qemu-devel@nongnu.org;
> Subject: Re: [PATCH v2 08/22] vfio/common: provide PASID alloc/free hooks
> 
> Hi Yi,
> On 3/31/20 12:59 PM, Liu, Yi L wrote:
> > Hi Eric,
> >
> >> From: Auger Eric
> >> Sent: Tuesday, March 31, 2020 6:48 PM
> >> To: Liu, Yi L <yi.l.liu@intel.com>; qemu-devel@nongnu.org;
> >> alex.williamson@redhat.com; peterx@redhat.com
> >> Cc: pbonzini@redhat.com; mst@redhat.com; david@gibson.dropbear.id.au; Tian,
> >> Kevin <kevin.tian@intel.com>; Tian, Jun J <jun.j.tian@intel.com>; Sun, Yi Y
> >> <yi.y.sun@intel.com>; kvm@vger.kernel.org; Wu, Hao <hao.wu@intel.com>;
> jean-
> >> philippe@linaro.org; Jacob Pan <jacob.jun.pan@linux.intel.com>; Yi Sun
> >> <yi.y.sun@linux.intel.com>
> >> Subject: Re: [PATCH v2 08/22] vfio/common: provide PASID alloc/free hooks
> >>
> >> Yi,
> >>
> >> On 3/30/20 6:24 AM, Liu Yi L wrote:
> >>> This patch defines vfio_host_iommu_context_info, implements the PASID
> >>> alloc/free hooks defined in HostIOMMUContextClass.
> >>>
> >>> Cc: Kevin Tian <kevin.tian@intel.com>
> >>> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> >>> Cc: Peter Xu <peterx@redhat.com>
> >>> Cc: Eric Auger <eric.auger@redhat.com>
> >>> Cc: Yi Sun <yi.y.sun@linux.intel.com>
> >>> Cc: David Gibson <david@gibson.dropbear.id.au>
> >>> Cc: Alex Williamson <alex.williamson@redhat.com>
> >>> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> >>> ---
> >>>  hw/vfio/common.c                      | 69
> +++++++++++++++++++++++++++++++++++
> >>>  include/hw/iommu/host_iommu_context.h |  3 ++
> >>>  include/hw/vfio/vfio-common.h         |  4 ++
> >>>  3 files changed, 76 insertions(+)
> >>>
> >>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c index
> >>> c276732..5f3534d 100644
> >>> --- a/hw/vfio/common.c
> >>> +++ b/hw/vfio/common.c
> >>> @@ -1179,6 +1179,53 @@ static int vfio_get_iommu_type(VFIOContainer
> >> *container,
> >>>      return -EINVAL;
> >>>  }
> >>>
> >>> +static int vfio_host_iommu_ctx_pasid_alloc(HostIOMMUContext *iommu_ctx,
> >>> +                                           uint32_t min, uint32_t max,
> >>> +                                           uint32_t *pasid) {
> >>> +    VFIOContainer *container = container_of(iommu_ctx,
> >>> +                                            VFIOContainer, iommu_ctx);
> >>> +    struct vfio_iommu_type1_pasid_request req;
> >>> +    unsigned long argsz;
> >> you can easily avoid using argsz variable
> >
> > oh, right. :-)
> >
> >>> +    int ret;
> >>> +
> >>> +    argsz = sizeof(req);
> >>> +    req.argsz = argsz;
> >>> +    req.flags = VFIO_IOMMU_PASID_ALLOC;
> >>> +    req.alloc_pasid.min = min;
> >>> +    req.alloc_pasid.max = max;
> >>> +
> >>> +    if (ioctl(container->fd, VFIO_IOMMU_PASID_REQUEST, &req)) {
> >>> +        ret = -errno;
> >>> +        error_report("%s: %d, alloc failed", __func__, ret);
> >> better use %m directly or strerror(errno) also include vbasedev->name?
> >
> > or yes, vbasedev->name is also nice to have.
> >
> >>> +        return ret;
> >>> +    }
> >>> +    *pasid = req.alloc_pasid.result;
> >>> +    return 0;
> >>> +}
> >>> +
> >>> +static int vfio_host_iommu_ctx_pasid_free(HostIOMMUContext *iommu_ctx,
> >>> +                                          uint32_t pasid) {
> >>> +    VFIOContainer *container = container_of(iommu_ctx,
> >>> +                                            VFIOContainer, iommu_ctx);
> >>> +    struct vfio_iommu_type1_pasid_request req;
> >>> +    unsigned long argsz;
> >> same
> >
> > got it.
> >
> >>> +    int ret;
> >>> +
> >>> +    argsz = sizeof(req);
> >>> +    req.argsz = argsz;
> >>> +    req.flags = VFIO_IOMMU_PASID_FREE;
> >>> +    req.free_pasid = pasid;
> >>> +
> >>> +    if (ioctl(container->fd, VFIO_IOMMU_PASID_REQUEST, &req)) {
> >>> +        ret = -errno;
> >>> +        error_report("%s: %d, free failed", __func__, ret);
> >> same
> >
> > yep.
> >>> +        return ret;
> >>> +    }
> >>> +    return 0;
> >>> +}
> >>> +
> >>>  static int vfio_init_container(VFIOContainer *container, int group_fd,
> >>>                                 Error **errp)  { @@ -1791,3 +1838,25
> >>> @@ int vfio_eeh_as_op(AddressSpace *as, uint32_t op)
> >>>      }
> >>>      return vfio_eeh_container_op(container, op);  }
> >>> +
> >>> +static void vfio_host_iommu_context_class_init(ObjectClass *klass,
> >>> +                                                       void *data) {
> >>> +    HostIOMMUContextClass *hicxc = HOST_IOMMU_CONTEXT_CLASS(klass);
> >>> +
> >>> +    hicxc->pasid_alloc = vfio_host_iommu_ctx_pasid_alloc;
> >>> +    hicxc->pasid_free = vfio_host_iommu_ctx_pasid_free; }
> >>> +
> >>> +static const TypeInfo vfio_host_iommu_context_info = {
> >>> +    .parent = TYPE_HOST_IOMMU_CONTEXT,
> >>> +    .name = TYPE_VFIO_HOST_IOMMU_CONTEXT,
> >>> +    .class_init = vfio_host_iommu_context_class_init,
> >> Ah OK
> >>
> >> This is the object inheriting from the abstract TYPE_HOST_IOMMU_CONTEXT.
> >
> > yes. it is. :-)
> >
> >> I initially thought VTDHostIOMMUContext was, sorry for the misunderstanding.
> >
> > Ah, my fault, should have got it earlier. so we may have just aligned
> > in last Oct.
> >
> >> Do you expect other HostIOMMUContext backends? Given the name and ops, it
> >> looks really related to VFIO?
> >
> > For other backends, I guess you mean other passthru modules? If yes, I
> > think they should have their own type name. Just like vIOMMUs, the below
> > vIOMMUs defines their own type name and inherits the same parent.
> >
> > static const TypeInfo vtd_iommu_memory_region_info = {
> >     .parent = TYPE_IOMMU_MEMORY_REGION,
> >     .name = TYPE_INTEL_IOMMU_MEMORY_REGION,
> >     .class_init = vtd_iommu_memory_region_class_init,
> > };
> >
> > static const TypeInfo smmuv3_iommu_memory_region_info = {
> >     .parent = TYPE_IOMMU_MEMORY_REGION,
> >     .name = TYPE_SMMUV3_IOMMU_MEMORY_REGION,
> >     .class_init = smmuv3_iommu_memory_region_class_init,
> > };
> >
> > static const TypeInfo amdvi_iommu_memory_region_info = {
> >     .parent = TYPE_IOMMU_MEMORY_REGION,
> >     .name = TYPE_AMD_IOMMU_MEMORY_REGION,
> >     .class_init = amdvi_iommu_memory_region_class_init,
> > };
> Sorry I am confused now.

The three above definition are just as an example. Just want to explain
what model I'm referencing. :-)

> You don't have such kind of inheritance at the moment in your series.

yes, only vfio inherits HostIOMMUContext, no other module inherits.
But I want to show a case in which there are multiple module inherits
one single parent. Just lack a vfio equivalent module to show it. So
I used the iommu_memory_region example. sorry to confuse you.

> 
> You have an abstract object (TYPE_HOST_IOMMU_CONTEXT, HostIOMMUContext)
> which is derived into TYPE_VFIO_HOST_IOMMU_CONTEXT. Only the class ops
> are specialized for VFIO. But I do not foresee any other user than VFIO
> (ie. other implementers of the class ops), hence my question. For
> instance would virtio/vhost ever implement its TYPE_HOST_IOMMU_CONTEXT.

I don't know either. But I think it's possible. They can do it per their
need in future.

> On the other hand you have VTDHostIOMMUContext which is not a QOM
> derived object.

Ok, I guess I made you believe both vfio and vIOMMU will inherit the
HostIOMMUContext now. is it?

Actually, it's not. Only vfio inherits HostIOMMUContext in QOM manner.
For the VTDHostIOMMUContext, it's just referencing the HostIOMMUContext
which is initialized by vfio.

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 160+ messages in thread

* RE: [PATCH v2 08/22] vfio/common: provide PASID alloc/free hooks
@ 2020-03-31 12:54           ` Liu, Yi L
  0 siblings, 0 replies; 160+ messages in thread
From: Liu, Yi L @ 2020-03-31 12:54 UTC (permalink / raw)
  To: Auger Eric, qemu-devel, alex.williamson, peterx
  Cc: jean-philippe, Tian, Kevin, Jacob Pan, Yi Sun, kvm, mst, Tian,
	 Jun J, Sun, Yi Y, pbonzini, Wu, Hao, david

Hi Eric,

> From: Auger Eric <eric.auger@redhat.com>
> Sent: Tuesday, March 31, 2020 7:16 PM
> To: Liu, Yi L <yi.l.liu@intel.com>; qemu-devel@nongnu.org;
> Subject: Re: [PATCH v2 08/22] vfio/common: provide PASID alloc/free hooks
> 
> Hi Yi,
> On 3/31/20 12:59 PM, Liu, Yi L wrote:
> > Hi Eric,
> >
> >> From: Auger Eric
> >> Sent: Tuesday, March 31, 2020 6:48 PM
> >> To: Liu, Yi L <yi.l.liu@intel.com>; qemu-devel@nongnu.org;
> >> alex.williamson@redhat.com; peterx@redhat.com
> >> Cc: pbonzini@redhat.com; mst@redhat.com; david@gibson.dropbear.id.au; Tian,
> >> Kevin <kevin.tian@intel.com>; Tian, Jun J <jun.j.tian@intel.com>; Sun, Yi Y
> >> <yi.y.sun@intel.com>; kvm@vger.kernel.org; Wu, Hao <hao.wu@intel.com>;
> jean-
> >> philippe@linaro.org; Jacob Pan <jacob.jun.pan@linux.intel.com>; Yi Sun
> >> <yi.y.sun@linux.intel.com>
> >> Subject: Re: [PATCH v2 08/22] vfio/common: provide PASID alloc/free hooks
> >>
> >> Yi,
> >>
> >> On 3/30/20 6:24 AM, Liu Yi L wrote:
> >>> This patch defines vfio_host_iommu_context_info, implements the PASID
> >>> alloc/free hooks defined in HostIOMMUContextClass.
> >>>
> >>> Cc: Kevin Tian <kevin.tian@intel.com>
> >>> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> >>> Cc: Peter Xu <peterx@redhat.com>
> >>> Cc: Eric Auger <eric.auger@redhat.com>
> >>> Cc: Yi Sun <yi.y.sun@linux.intel.com>
> >>> Cc: David Gibson <david@gibson.dropbear.id.au>
> >>> Cc: Alex Williamson <alex.williamson@redhat.com>
> >>> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> >>> ---
> >>>  hw/vfio/common.c                      | 69
> +++++++++++++++++++++++++++++++++++
> >>>  include/hw/iommu/host_iommu_context.h |  3 ++
> >>>  include/hw/vfio/vfio-common.h         |  4 ++
> >>>  3 files changed, 76 insertions(+)
> >>>
> >>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c index
> >>> c276732..5f3534d 100644
> >>> --- a/hw/vfio/common.c
> >>> +++ b/hw/vfio/common.c
> >>> @@ -1179,6 +1179,53 @@ static int vfio_get_iommu_type(VFIOContainer
> >> *container,
> >>>      return -EINVAL;
> >>>  }
> >>>
> >>> +static int vfio_host_iommu_ctx_pasid_alloc(HostIOMMUContext *iommu_ctx,
> >>> +                                           uint32_t min, uint32_t max,
> >>> +                                           uint32_t *pasid) {
> >>> +    VFIOContainer *container = container_of(iommu_ctx,
> >>> +                                            VFIOContainer, iommu_ctx);
> >>> +    struct vfio_iommu_type1_pasid_request req;
> >>> +    unsigned long argsz;
> >> you can easily avoid using argsz variable
> >
> > oh, right. :-)
> >
> >>> +    int ret;
> >>> +
> >>> +    argsz = sizeof(req);
> >>> +    req.argsz = argsz;
> >>> +    req.flags = VFIO_IOMMU_PASID_ALLOC;
> >>> +    req.alloc_pasid.min = min;
> >>> +    req.alloc_pasid.max = max;
> >>> +
> >>> +    if (ioctl(container->fd, VFIO_IOMMU_PASID_REQUEST, &req)) {
> >>> +        ret = -errno;
> >>> +        error_report("%s: %d, alloc failed", __func__, ret);
> >> better use %m directly or strerror(errno) also include vbasedev->name?
> >
> > or yes, vbasedev->name is also nice to have.
> >
> >>> +        return ret;
> >>> +    }
> >>> +    *pasid = req.alloc_pasid.result;
> >>> +    return 0;
> >>> +}
> >>> +
> >>> +static int vfio_host_iommu_ctx_pasid_free(HostIOMMUContext *iommu_ctx,
> >>> +                                          uint32_t pasid) {
> >>> +    VFIOContainer *container = container_of(iommu_ctx,
> >>> +                                            VFIOContainer, iommu_ctx);
> >>> +    struct vfio_iommu_type1_pasid_request req;
> >>> +    unsigned long argsz;
> >> same
> >
> > got it.
> >
> >>> +    int ret;
> >>> +
> >>> +    argsz = sizeof(req);
> >>> +    req.argsz = argsz;
> >>> +    req.flags = VFIO_IOMMU_PASID_FREE;
> >>> +    req.free_pasid = pasid;
> >>> +
> >>> +    if (ioctl(container->fd, VFIO_IOMMU_PASID_REQUEST, &req)) {
> >>> +        ret = -errno;
> >>> +        error_report("%s: %d, free failed", __func__, ret);
> >> same
> >
> > yep.
> >>> +        return ret;
> >>> +    }
> >>> +    return 0;
> >>> +}
> >>> +
> >>>  static int vfio_init_container(VFIOContainer *container, int group_fd,
> >>>                                 Error **errp)  { @@ -1791,3 +1838,25
> >>> @@ int vfio_eeh_as_op(AddressSpace *as, uint32_t op)
> >>>      }
> >>>      return vfio_eeh_container_op(container, op);  }
> >>> +
> >>> +static void vfio_host_iommu_context_class_init(ObjectClass *klass,
> >>> +                                                       void *data) {
> >>> +    HostIOMMUContextClass *hicxc = HOST_IOMMU_CONTEXT_CLASS(klass);
> >>> +
> >>> +    hicxc->pasid_alloc = vfio_host_iommu_ctx_pasid_alloc;
> >>> +    hicxc->pasid_free = vfio_host_iommu_ctx_pasid_free; }
> >>> +
> >>> +static const TypeInfo vfio_host_iommu_context_info = {
> >>> +    .parent = TYPE_HOST_IOMMU_CONTEXT,
> >>> +    .name = TYPE_VFIO_HOST_IOMMU_CONTEXT,
> >>> +    .class_init = vfio_host_iommu_context_class_init,
> >> Ah OK
> >>
> >> This is the object inheriting from the abstract TYPE_HOST_IOMMU_CONTEXT.
> >
> > yes. it is. :-)
> >
> >> I initially thought VTDHostIOMMUContext was, sorry for the misunderstanding.
> >
> > Ah, my fault, should have got it earlier. so we may have just aligned
> > in last Oct.
> >
> >> Do you expect other HostIOMMUContext backends? Given the name and ops, it
> >> looks really related to VFIO?
> >
> > For other backends, I guess you mean other passthru modules? If yes, I
> > think they should have their own type name. Just like vIOMMUs, the below
> > vIOMMUs defines their own type name and inherits the same parent.
> >
> > static const TypeInfo vtd_iommu_memory_region_info = {
> >     .parent = TYPE_IOMMU_MEMORY_REGION,
> >     .name = TYPE_INTEL_IOMMU_MEMORY_REGION,
> >     .class_init = vtd_iommu_memory_region_class_init,
> > };
> >
> > static const TypeInfo smmuv3_iommu_memory_region_info = {
> >     .parent = TYPE_IOMMU_MEMORY_REGION,
> >     .name = TYPE_SMMUV3_IOMMU_MEMORY_REGION,
> >     .class_init = smmuv3_iommu_memory_region_class_init,
> > };
> >
> > static const TypeInfo amdvi_iommu_memory_region_info = {
> >     .parent = TYPE_IOMMU_MEMORY_REGION,
> >     .name = TYPE_AMD_IOMMU_MEMORY_REGION,
> >     .class_init = amdvi_iommu_memory_region_class_init,
> > };
> Sorry I am confused now.

The three above definition are just as an example. Just want to explain
what model I'm referencing. :-)

> You don't have such kind of inheritance at the moment in your series.

yes, only vfio inherits HostIOMMUContext, no other module inherits.
But I want to show a case in which there are multiple module inherits
one single parent. Just lack a vfio equivalent module to show it. So
I used the iommu_memory_region example. sorry to confuse you.

> 
> You have an abstract object (TYPE_HOST_IOMMU_CONTEXT, HostIOMMUContext)
> which is derived into TYPE_VFIO_HOST_IOMMU_CONTEXT. Only the class ops
> are specialized for VFIO. But I do not foresee any other user than VFIO
> (ie. other implementers of the class ops), hence my question. For
> instance would virtio/vhost ever implement its TYPE_HOST_IOMMU_CONTEXT.

I don't know either. But I think it's possible. They can do it per their
need in future.

> On the other hand you have VTDHostIOMMUContext which is not a QOM
> derived object.

Ok, I guess I made you believe both vfio and vIOMMU will inherit the
HostIOMMUContext now. is it?

Actually, it's not. Only vfio inherits HostIOMMUContext in QOM manner.
For the VTDHostIOMMUContext, it's just referencing the HostIOMMUContext
which is initialized by vfio.

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v2 07/22] intel_iommu: add set/unset_iommu_context callback
  2020-03-31 12:25       ` Liu, Yi L
@ 2020-03-31 12:57         ` Auger Eric
  -1 siblings, 0 replies; 160+ messages in thread
From: Auger Eric @ 2020-03-31 12:57 UTC (permalink / raw)
  To: Liu, Yi L, qemu-devel, alex.williamson, peterx
  Cc: pbonzini, mst, david, Tian, Kevin, Tian, Jun J, Sun, Yi Y, kvm,
	Wu, Hao, jean-philippe, Jacob Pan, Yi Sun, Richard Henderson,
	Eduardo Habkost

Hi Yi,

On 3/31/20 2:25 PM, Liu, Yi L wrote:
> Hi Eric,
> 
>> From: Auger Eric < eric.auger@redhat.com>
>> Sent: Tuesday, March 31, 2020 4:24 AM
>> To: Liu, Yi L <yi.l.liu@intel.com>; qemu-devel@nongnu.org;
>> Subject: Re: [PATCH v2 07/22] intel_iommu: add set/unset_iommu_context callback
>>
>> Yi,
>>
>> On 3/30/20 6:24 AM, Liu Yi L wrote:
>>> This patch adds set/unset_iommu_context() impelementation in Intel
>> This patch implements the set/unset_iommu_context() ops for Intel vIOMMU.
>>> vIOMMU. For Intel platform, pass-through modules (e.g. VFIO) could
>>> set HostIOMMUContext to Intel vIOMMU emulator.
>>>
>>> Cc: Kevin Tian <kevin.tian@intel.com>
>>> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
>>> Cc: Peter Xu <peterx@redhat.com>
>>> Cc: Yi Sun <yi.y.sun@linux.intel.com>
>>> Cc: Paolo Bonzini <pbonzini@redhat.com>
>>> Cc: Richard Henderson <rth@twiddle.net>
>>> Cc: Eduardo Habkost <ehabkost@redhat.com>
>>> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
>>> ---
>>>  hw/i386/intel_iommu.c         | 71
>> ++++++++++++++++++++++++++++++++++++++++---
>>>  include/hw/i386/intel_iommu.h | 21 ++++++++++---
>>>  2 files changed, 83 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>>> index 4b22910..fd349c6 100644
>>> --- a/hw/i386/intel_iommu.c
>>> +++ b/hw/i386/intel_iommu.c
>>> @@ -3354,23 +3354,33 @@ static const MemoryRegionOps vtd_mem_ir_ops = {
>>>      },
>>>  };
>>>
>>> -VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
>>> +/**
>>> + * Fetch a VTDBus instance for given PCIBus. If no existing instance,
>>> + * allocate one.
>>> + */
>>> +static VTDBus *vtd_find_add_bus(IntelIOMMUState *s, PCIBus *bus)
>>>  {
>>>      uintptr_t key = (uintptr_t)bus;
>>>      VTDBus *vtd_bus = g_hash_table_lookup(s->vtd_as_by_busptr, &key);
>>> -    VTDAddressSpace *vtd_dev_as;
>>> -    char name[128];
>>>
>>>      if (!vtd_bus) {
>>>          uintptr_t *new_key = g_malloc(sizeof(*new_key));
>>>          *new_key = (uintptr_t)bus;
>>>          /* No corresponding free() */
>>> -        vtd_bus = g_malloc0(sizeof(VTDBus) + sizeof(VTDAddressSpace *) * \
>>> -                            PCI_DEVFN_MAX);
>>> +        vtd_bus = g_malloc0(sizeof(VTDBus));
>>>          vtd_bus->bus = bus;
>>>          g_hash_table_insert(s->vtd_as_by_busptr, new_key, vtd_bus);
>>>      }
>>> +    return vtd_bus;
>>> +}
>>>
>>> +VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
>>> +{
>>> +    VTDBus *vtd_bus;
>>> +    VTDAddressSpace *vtd_dev_as;
>>> +    char name[128];
>>> +
>>> +    vtd_bus = vtd_find_add_bus(s, bus);
>>>      vtd_dev_as = vtd_bus->dev_as[devfn];
>>>
>>>      if (!vtd_dev_as) {
>>> @@ -3436,6 +3446,55 @@ VTDAddressSpace
>> *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
>>>      return vtd_dev_as;
>>>  }
>>>
>>> +static int vtd_dev_set_iommu_context(PCIBus *bus, void *opaque,
>>> +                                     int devfn,
>>> +                                     HostIOMMUContext *iommu_ctx)
>>> +{
>>> +    IntelIOMMUState *s = opaque;
>>> +    VTDBus *vtd_bus;
>>> +    VTDHostIOMMUContext *vtd_dev_icx;
>>> +
>>> +    assert(0 <= devfn && devfn < PCI_DEVFN_MAX);
>>> +
>>> +    vtd_bus = vtd_find_add_bus(s, bus);
>>> +
>>> +    vtd_iommu_lock(s);
>>> +
>>> +    vtd_dev_icx = vtd_bus->dev_icx[devfn];
>>> +
>>> +    assert(!vtd_dev_icx);
>>> +
>>> +    vtd_bus->dev_icx[devfn] = vtd_dev_icx =
>>> +                    g_malloc0(sizeof(VTDHostIOMMUContext));
>>> +    vtd_dev_icx->vtd_bus = vtd_bus;
>>> +    vtd_dev_icx->devfn = (uint8_t)devfn;
>>> +    vtd_dev_icx->iommu_state = s;
>>> +    vtd_dev_icx->iommu_ctx = iommu_ctx;
>>> +
>>> +    vtd_iommu_unlock(s);
>>> +
>>> +    return 0;
>>> +}
>>> +
>>> +static void vtd_dev_unset_iommu_context(PCIBus *bus, void *opaque, int devfn)
>>> +{
>>> +    IntelIOMMUState *s = opaque;
>>> +    VTDBus *vtd_bus;
>>> +    VTDHostIOMMUContext *vtd_dev_icx;
>>> +
>>> +    assert(0 <= devfn && devfn < PCI_DEVFN_MAX);
>>> +
>>> +    vtd_bus = vtd_find_add_bus(s, bus);
>>> +
>>> +    vtd_iommu_lock(s);
>>> +
>>> +    vtd_dev_icx = vtd_bus->dev_icx[devfn];
>>> +    g_free(vtd_dev_icx);
>>> +    vtd_bus->dev_icx[devfn] = NULL;
>>> +
>>> +    vtd_iommu_unlock(s);
>>> +}
>>> +
>>>  static uint64_t get_naturally_aligned_size(uint64_t start,
>>>                                             uint64_t size, int gaw)
>>>  {
>>> @@ -3731,6 +3790,8 @@ static AddressSpace *vtd_host_dma_iommu(PCIBus
>> *bus, void *opaque, int devfn)
>>>
>>>  static PCIIOMMUOps vtd_iommu_ops = {
>>>      .get_address_space = vtd_host_dma_iommu,
>>> +    .set_iommu_context = vtd_dev_set_iommu_context,
>>> +    .unset_iommu_context = vtd_dev_unset_iommu_context,
>>>  };
>>>
>>>  static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
>>> diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
>>> index 3870052..b5fefb9 100644
>>> --- a/include/hw/i386/intel_iommu.h
>>> +++ b/include/hw/i386/intel_iommu.h
>>> @@ -64,6 +64,7 @@ typedef union VTD_IR_TableEntry VTD_IR_TableEntry;
>>>  typedef union VTD_IR_MSIAddress VTD_IR_MSIAddress;
>>>  typedef struct VTDPASIDDirEntry VTDPASIDDirEntry;
>>>  typedef struct VTDPASIDEntry VTDPASIDEntry;
>>> +typedef struct VTDHostIOMMUContext VTDHostIOMMUContext;
>>>
>>>  /* Context-Entry */
>>>  struct VTDContextEntry {
>>> @@ -112,10 +113,20 @@ struct VTDAddressSpace {
>>>      IOVATree *iova_tree;          /* Traces mapped IOVA ranges */
>>>  };
>>>
>>> +struct VTDHostIOMMUContext {
>>
>>
>>> +    VTDBus *vtd_bus;
>>> +    uint8_t devfn;
>>> +    HostIOMMUContext *iommu_ctx;
>> I don't get why we don't have standard QOM inheritance instead of this
>> handle?
>> VTDHostContext parent_obj;
>>
>> like IOMMUMemoryRegion <- MemoryRegion <- Object
> 
> Here it is not inherit the object. It's just cache the HostIOMMUContext
> pointer in vIOMMU. Just like AddressSpace, it has a MemoryRegion pointer.
> Here is the same, VTDHostIOMMUContext is just a wrapper to better manage
> it in vVT-d. It's not inheriting.

Yep I've got it now ;-)
> 
>>> +    IntelIOMMUState *iommu_state;
>>> +};
>>> +
>>>  struct VTDBus {
>>> -    PCIBus* bus;		/* A reference to the bus to provide translation for
>> */
>>> +    /* A reference to the bus to provide translation for */
>>> +    PCIBus *bus;
>>>      /* A table of VTDAddressSpace objects indexed by devfn */
>>> -    VTDAddressSpace *dev_as[];
>>> +    VTDAddressSpace *dev_as[PCI_DEVFN_MAX];
>>> +    /* A table of VTDHostIOMMUContext objects indexed by devfn */
>>> +    VTDHostIOMMUContext *dev_icx[PCI_DEVFN_MAX];
>> At this point of the review, it is unclear to me why the context is
>> associated to a device.
> 
> HostIOMMUContext can be per-device or not. It depends on how vIOMMU
> manage it. For vVT-d, it's per device as the container is per-device.
> 
>> Up to now you have not explained it should. If
>> so why isn't it part of VTDAddressSpace?
> 
> Ah, I did have considered it. But I chose to use a separate one as
> context is not really tied with an addresspace. It's better to mange
> it with a separate structure.

OK

Thanks

Eric
> 
> Regards,
> Yi Liu
> 


^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v2 07/22] intel_iommu: add set/unset_iommu_context callback
@ 2020-03-31 12:57         ` Auger Eric
  0 siblings, 0 replies; 160+ messages in thread
From: Auger Eric @ 2020-03-31 12:57 UTC (permalink / raw)
  To: Liu, Yi L, qemu-devel, alex.williamson, peterx
  Cc: jean-philippe, Tian, Kevin, Jacob Pan, Yi Sun, Eduardo Habkost,
	kvm, mst, Tian, Jun J, Sun, Yi Y, pbonzini, Wu, Hao,
	Richard Henderson, david

Hi Yi,

On 3/31/20 2:25 PM, Liu, Yi L wrote:
> Hi Eric,
> 
>> From: Auger Eric < eric.auger@redhat.com>
>> Sent: Tuesday, March 31, 2020 4:24 AM
>> To: Liu, Yi L <yi.l.liu@intel.com>; qemu-devel@nongnu.org;
>> Subject: Re: [PATCH v2 07/22] intel_iommu: add set/unset_iommu_context callback
>>
>> Yi,
>>
>> On 3/30/20 6:24 AM, Liu Yi L wrote:
>>> This patch adds set/unset_iommu_context() impelementation in Intel
>> This patch implements the set/unset_iommu_context() ops for Intel vIOMMU.
>>> vIOMMU. For Intel platform, pass-through modules (e.g. VFIO) could
>>> set HostIOMMUContext to Intel vIOMMU emulator.
>>>
>>> Cc: Kevin Tian <kevin.tian@intel.com>
>>> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
>>> Cc: Peter Xu <peterx@redhat.com>
>>> Cc: Yi Sun <yi.y.sun@linux.intel.com>
>>> Cc: Paolo Bonzini <pbonzini@redhat.com>
>>> Cc: Richard Henderson <rth@twiddle.net>
>>> Cc: Eduardo Habkost <ehabkost@redhat.com>
>>> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
>>> ---
>>>  hw/i386/intel_iommu.c         | 71
>> ++++++++++++++++++++++++++++++++++++++++---
>>>  include/hw/i386/intel_iommu.h | 21 ++++++++++---
>>>  2 files changed, 83 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>>> index 4b22910..fd349c6 100644
>>> --- a/hw/i386/intel_iommu.c
>>> +++ b/hw/i386/intel_iommu.c
>>> @@ -3354,23 +3354,33 @@ static const MemoryRegionOps vtd_mem_ir_ops = {
>>>      },
>>>  };
>>>
>>> -VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
>>> +/**
>>> + * Fetch a VTDBus instance for given PCIBus. If no existing instance,
>>> + * allocate one.
>>> + */
>>> +static VTDBus *vtd_find_add_bus(IntelIOMMUState *s, PCIBus *bus)
>>>  {
>>>      uintptr_t key = (uintptr_t)bus;
>>>      VTDBus *vtd_bus = g_hash_table_lookup(s->vtd_as_by_busptr, &key);
>>> -    VTDAddressSpace *vtd_dev_as;
>>> -    char name[128];
>>>
>>>      if (!vtd_bus) {
>>>          uintptr_t *new_key = g_malloc(sizeof(*new_key));
>>>          *new_key = (uintptr_t)bus;
>>>          /* No corresponding free() */
>>> -        vtd_bus = g_malloc0(sizeof(VTDBus) + sizeof(VTDAddressSpace *) * \
>>> -                            PCI_DEVFN_MAX);
>>> +        vtd_bus = g_malloc0(sizeof(VTDBus));
>>>          vtd_bus->bus = bus;
>>>          g_hash_table_insert(s->vtd_as_by_busptr, new_key, vtd_bus);
>>>      }
>>> +    return vtd_bus;
>>> +}
>>>
>>> +VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
>>> +{
>>> +    VTDBus *vtd_bus;
>>> +    VTDAddressSpace *vtd_dev_as;
>>> +    char name[128];
>>> +
>>> +    vtd_bus = vtd_find_add_bus(s, bus);
>>>      vtd_dev_as = vtd_bus->dev_as[devfn];
>>>
>>>      if (!vtd_dev_as) {
>>> @@ -3436,6 +3446,55 @@ VTDAddressSpace
>> *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
>>>      return vtd_dev_as;
>>>  }
>>>
>>> +static int vtd_dev_set_iommu_context(PCIBus *bus, void *opaque,
>>> +                                     int devfn,
>>> +                                     HostIOMMUContext *iommu_ctx)
>>> +{
>>> +    IntelIOMMUState *s = opaque;
>>> +    VTDBus *vtd_bus;
>>> +    VTDHostIOMMUContext *vtd_dev_icx;
>>> +
>>> +    assert(0 <= devfn && devfn < PCI_DEVFN_MAX);
>>> +
>>> +    vtd_bus = vtd_find_add_bus(s, bus);
>>> +
>>> +    vtd_iommu_lock(s);
>>> +
>>> +    vtd_dev_icx = vtd_bus->dev_icx[devfn];
>>> +
>>> +    assert(!vtd_dev_icx);
>>> +
>>> +    vtd_bus->dev_icx[devfn] = vtd_dev_icx =
>>> +                    g_malloc0(sizeof(VTDHostIOMMUContext));
>>> +    vtd_dev_icx->vtd_bus = vtd_bus;
>>> +    vtd_dev_icx->devfn = (uint8_t)devfn;
>>> +    vtd_dev_icx->iommu_state = s;
>>> +    vtd_dev_icx->iommu_ctx = iommu_ctx;
>>> +
>>> +    vtd_iommu_unlock(s);
>>> +
>>> +    return 0;
>>> +}
>>> +
>>> +static void vtd_dev_unset_iommu_context(PCIBus *bus, void *opaque, int devfn)
>>> +{
>>> +    IntelIOMMUState *s = opaque;
>>> +    VTDBus *vtd_bus;
>>> +    VTDHostIOMMUContext *vtd_dev_icx;
>>> +
>>> +    assert(0 <= devfn && devfn < PCI_DEVFN_MAX);
>>> +
>>> +    vtd_bus = vtd_find_add_bus(s, bus);
>>> +
>>> +    vtd_iommu_lock(s);
>>> +
>>> +    vtd_dev_icx = vtd_bus->dev_icx[devfn];
>>> +    g_free(vtd_dev_icx);
>>> +    vtd_bus->dev_icx[devfn] = NULL;
>>> +
>>> +    vtd_iommu_unlock(s);
>>> +}
>>> +
>>>  static uint64_t get_naturally_aligned_size(uint64_t start,
>>>                                             uint64_t size, int gaw)
>>>  {
>>> @@ -3731,6 +3790,8 @@ static AddressSpace *vtd_host_dma_iommu(PCIBus
>> *bus, void *opaque, int devfn)
>>>
>>>  static PCIIOMMUOps vtd_iommu_ops = {
>>>      .get_address_space = vtd_host_dma_iommu,
>>> +    .set_iommu_context = vtd_dev_set_iommu_context,
>>> +    .unset_iommu_context = vtd_dev_unset_iommu_context,
>>>  };
>>>
>>>  static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
>>> diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
>>> index 3870052..b5fefb9 100644
>>> --- a/include/hw/i386/intel_iommu.h
>>> +++ b/include/hw/i386/intel_iommu.h
>>> @@ -64,6 +64,7 @@ typedef union VTD_IR_TableEntry VTD_IR_TableEntry;
>>>  typedef union VTD_IR_MSIAddress VTD_IR_MSIAddress;
>>>  typedef struct VTDPASIDDirEntry VTDPASIDDirEntry;
>>>  typedef struct VTDPASIDEntry VTDPASIDEntry;
>>> +typedef struct VTDHostIOMMUContext VTDHostIOMMUContext;
>>>
>>>  /* Context-Entry */
>>>  struct VTDContextEntry {
>>> @@ -112,10 +113,20 @@ struct VTDAddressSpace {
>>>      IOVATree *iova_tree;          /* Traces mapped IOVA ranges */
>>>  };
>>>
>>> +struct VTDHostIOMMUContext {
>>
>>
>>> +    VTDBus *vtd_bus;
>>> +    uint8_t devfn;
>>> +    HostIOMMUContext *iommu_ctx;
>> I don't get why we don't have standard QOM inheritance instead of this
>> handle?
>> VTDHostContext parent_obj;
>>
>> like IOMMUMemoryRegion <- MemoryRegion <- Object
> 
> Here it is not inherit the object. It's just cache the HostIOMMUContext
> pointer in vIOMMU. Just like AddressSpace, it has a MemoryRegion pointer.
> Here is the same, VTDHostIOMMUContext is just a wrapper to better manage
> it in vVT-d. It's not inheriting.

Yep I've got it now ;-)
> 
>>> +    IntelIOMMUState *iommu_state;
>>> +};
>>> +
>>>  struct VTDBus {
>>> -    PCIBus* bus;		/* A reference to the bus to provide translation for
>> */
>>> +    /* A reference to the bus to provide translation for */
>>> +    PCIBus *bus;
>>>      /* A table of VTDAddressSpace objects indexed by devfn */
>>> -    VTDAddressSpace *dev_as[];
>>> +    VTDAddressSpace *dev_as[PCI_DEVFN_MAX];
>>> +    /* A table of VTDHostIOMMUContext objects indexed by devfn */
>>> +    VTDHostIOMMUContext *dev_icx[PCI_DEVFN_MAX];
>> At this point of the review, it is unclear to me why the context is
>> associated to a device.
> 
> HostIOMMUContext can be per-device or not. It depends on how vIOMMU
> manage it. For vVT-d, it's per device as the container is per-device.
> 
>> Up to now you have not explained it should. If
>> so why isn't it part of VTDAddressSpace?
> 
> Ah, I did have considered it. But I chose to use a separate one as
> context is not really tied with an addresspace. It's better to mange
> it with a separate structure.

OK

Thanks

Eric
> 
> Regards,
> Yi Liu
> 



^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v2 10/22] vfio/pci: set host iommu context to vIOMMU
  2020-03-30  4:24   ` Liu Yi L
@ 2020-03-31 14:30     ` Auger Eric
  -1 siblings, 0 replies; 160+ messages in thread
From: Auger Eric @ 2020-03-31 14:30 UTC (permalink / raw)
  To: Liu Yi L, qemu-devel, alex.williamson, peterx
  Cc: pbonzini, mst, david, kevin.tian, jun.j.tian, yi.y.sun, kvm,
	hao.wu, jean-philippe, Jacob Pan, Yi Sun

Yi,

On 3/30/20 6:24 AM, Liu Yi L wrote:
> For vfio-pci devices, it could use pci_device_set/unset_iommu() to
> expose host iommu context to vIOMMU emulators. vIOMMU emulators
> could make use the methods provided by host iommu context. e.g.
> propagate requests to host iommu.
I think I would squash this patch into the previous one.

> 
> Cc: Kevin Tian <kevin.tian@intel.com>
> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Eric Auger <eric.auger@redhat.com>
> Cc: Yi Sun <yi.y.sun@linux.intel.com>
> Cc: David Gibson <david@gibson.dropbear.id.au>
> Cc: Alex Williamson <alex.williamson@redhat.com>
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> ---
>  hw/vfio/pci.c | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
> 
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 5e75a95..c140c88 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -2717,6 +2717,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
>      VFIOPCIDevice *vdev = PCI_VFIO(pdev);
>      VFIODevice *vbasedev_iter;
>      VFIOGroup *group;
> +    VFIOContainer *container;
>      char *tmp, *subsys, group_path[PATH_MAX], *group_name;
>      Error *err = NULL;
>      ssize_t len;
> @@ -3028,6 +3029,11 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
>      vfio_register_req_notifier(vdev);
>      vfio_setup_resetfn_quirk(vdev);
>  
> +    container = vdev->vbasedev.group->container;
> +    if (container->iommu_ctx.initialized) {
Sin't it possible to dynamically allocate the iommu_ctx so that you can
simply check container->iommu_ctx and discard the initialized field?
> +        pci_device_set_iommu_context(pdev, &container->iommu_ctx);
> +    }
> +
>      return;
>  
>  out_deregister:
> @@ -3072,9 +3078,16 @@ static void vfio_instance_finalize(Object *obj)
>  static void vfio_exitfn(PCIDevice *pdev)
>  {
>      VFIOPCIDevice *vdev = PCI_VFIO(pdev);
> +    VFIOContainer *container;
>  
>      vfio_unregister_req_notifier(vdev);
>      vfio_unregister_err_notifier(vdev);
> +
> +    container = vdev->vbasedev.group->container;
> +    if (container->iommu_ctx.initialized) {
> +        pci_device_unset_iommu_context(pdev);
> +    }
> +
>      pci_device_set_intx_routing_notifier(&vdev->pdev, NULL);
>      if (vdev->irqchip_change_notifier.notify) {
>          kvm_irqchip_remove_change_notifier(&vdev->irqchip_change_notifier);
> 
Thanks

Eric


^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v2 10/22] vfio/pci: set host iommu context to vIOMMU
@ 2020-03-31 14:30     ` Auger Eric
  0 siblings, 0 replies; 160+ messages in thread
From: Auger Eric @ 2020-03-31 14:30 UTC (permalink / raw)
  To: Liu Yi L, qemu-devel, alex.williamson, peterx
  Cc: jean-philippe, kevin.tian, Jacob Pan, Yi Sun, kvm, mst,
	jun.j.tian, yi.y.sun, pbonzini, hao.wu, david

Yi,

On 3/30/20 6:24 AM, Liu Yi L wrote:
> For vfio-pci devices, it could use pci_device_set/unset_iommu() to
> expose host iommu context to vIOMMU emulators. vIOMMU emulators
> could make use the methods provided by host iommu context. e.g.
> propagate requests to host iommu.
I think I would squash this patch into the previous one.

> 
> Cc: Kevin Tian <kevin.tian@intel.com>
> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Eric Auger <eric.auger@redhat.com>
> Cc: Yi Sun <yi.y.sun@linux.intel.com>
> Cc: David Gibson <david@gibson.dropbear.id.au>
> Cc: Alex Williamson <alex.williamson@redhat.com>
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> ---
>  hw/vfio/pci.c | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
> 
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 5e75a95..c140c88 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -2717,6 +2717,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
>      VFIOPCIDevice *vdev = PCI_VFIO(pdev);
>      VFIODevice *vbasedev_iter;
>      VFIOGroup *group;
> +    VFIOContainer *container;
>      char *tmp, *subsys, group_path[PATH_MAX], *group_name;
>      Error *err = NULL;
>      ssize_t len;
> @@ -3028,6 +3029,11 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
>      vfio_register_req_notifier(vdev);
>      vfio_setup_resetfn_quirk(vdev);
>  
> +    container = vdev->vbasedev.group->container;
> +    if (container->iommu_ctx.initialized) {
Sin't it possible to dynamically allocate the iommu_ctx so that you can
simply check container->iommu_ctx and discard the initialized field?
> +        pci_device_set_iommu_context(pdev, &container->iommu_ctx);
> +    }
> +
>      return;
>  
>  out_deregister:
> @@ -3072,9 +3078,16 @@ static void vfio_instance_finalize(Object *obj)
>  static void vfio_exitfn(PCIDevice *pdev)
>  {
>      VFIOPCIDevice *vdev = PCI_VFIO(pdev);
> +    VFIOContainer *container;
>  
>      vfio_unregister_req_notifier(vdev);
>      vfio_unregister_err_notifier(vdev);
> +
> +    container = vdev->vbasedev.group->container;
> +    if (container->iommu_ctx.initialized) {
> +        pci_device_unset_iommu_context(pdev);
> +    }
> +
>      pci_device_set_intx_routing_notifier(&vdev->pdev, NULL);
>      if (vdev->irqchip_change_notifier.notify) {
>          kvm_irqchip_remove_change_notifier(&vdev->irqchip_change_notifier);
> 
Thanks

Eric



^ permalink raw reply	[flat|nested] 160+ messages in thread

* RE: [PATCH v2 10/22] vfio/pci: set host iommu context to vIOMMU
  2020-03-31 14:30     ` Auger Eric
@ 2020-04-01  3:20       ` Liu, Yi L
  -1 siblings, 0 replies; 160+ messages in thread
From: Liu, Yi L @ 2020-04-01  3:20 UTC (permalink / raw)
  To: Auger Eric, qemu-devel, alex.williamson, peterx
  Cc: pbonzini, mst, david, Tian, Kevin, Tian, Jun J, Sun, Yi Y, kvm,
	Wu, Hao, jean-philippe, Jacob Pan, Yi Sun

Hi Eric,

> From: Auger Eric <eric.auger@redhat.com>
> Sent: Tuesday, March 31, 2020 10:30 PM
> To: Liu, Yi L <yi.l.liu@intel.com>; qemu-devel@nongnu.org;
> Subject: Re: [PATCH v2 10/22] vfio/pci: set host iommu context to vIOMMU
> 
> Yi,
> 
> On 3/30/20 6:24 AM, Liu Yi L wrote:
> > For vfio-pci devices, it could use pci_device_set/unset_iommu() to
> > expose host iommu context to vIOMMU emulators. vIOMMU emulators could
> > make use the methods provided by host iommu context. e.g.
> > propagate requests to host iommu.
> I think I would squash this patch into the previous one.

sure, I can make it. :-)

> >
> > Cc: Kevin Tian <kevin.tian@intel.com>
> > Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Cc: Peter Xu <peterx@redhat.com>
> > Cc: Eric Auger <eric.auger@redhat.com>
> > Cc: Yi Sun <yi.y.sun@linux.intel.com>
> > Cc: David Gibson <david@gibson.dropbear.id.au>
> > Cc: Alex Williamson <alex.williamson@redhat.com>
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > ---
> >  hw/vfio/pci.c | 13 +++++++++++++
> >  1 file changed, 13 insertions(+)
> >
> > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c index 5e75a95..c140c88
> > 100644
> > --- a/hw/vfio/pci.c
> > +++ b/hw/vfio/pci.c
> > @@ -2717,6 +2717,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
> >      VFIOPCIDevice *vdev = PCI_VFIO(pdev);
> >      VFIODevice *vbasedev_iter;
> >      VFIOGroup *group;
> > +    VFIOContainer *container;
> >      char *tmp, *subsys, group_path[PATH_MAX], *group_name;
> >      Error *err = NULL;
> >      ssize_t len;
> > @@ -3028,6 +3029,11 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
> >      vfio_register_req_notifier(vdev);
> >      vfio_setup_resetfn_quirk(vdev);
> >
> > +    container = vdev->vbasedev.group->container;
> > +    if (container->iommu_ctx.initialized) {
> Sin't it possible to dynamically allocate the iommu_ctx so that you can simply check
> container->iommu_ctx and discard the initialized field?

iommu_ctx is allocated along with container as it is not a pointer in VFIOContainer.
The only way to check it is to have flag. :-)

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 160+ messages in thread

* RE: [PATCH v2 10/22] vfio/pci: set host iommu context to vIOMMU
@ 2020-04-01  3:20       ` Liu, Yi L
  0 siblings, 0 replies; 160+ messages in thread
From: Liu, Yi L @ 2020-04-01  3:20 UTC (permalink / raw)
  To: Auger Eric, qemu-devel, alex.williamson, peterx
  Cc: jean-philippe, Tian, Kevin, Jacob Pan, Yi Sun, kvm, mst, Tian,
	 Jun J, Sun, Yi Y, pbonzini, Wu, Hao, david

Hi Eric,

> From: Auger Eric <eric.auger@redhat.com>
> Sent: Tuesday, March 31, 2020 10:30 PM
> To: Liu, Yi L <yi.l.liu@intel.com>; qemu-devel@nongnu.org;
> Subject: Re: [PATCH v2 10/22] vfio/pci: set host iommu context to vIOMMU
> 
> Yi,
> 
> On 3/30/20 6:24 AM, Liu Yi L wrote:
> > For vfio-pci devices, it could use pci_device_set/unset_iommu() to
> > expose host iommu context to vIOMMU emulators. vIOMMU emulators could
> > make use the methods provided by host iommu context. e.g.
> > propagate requests to host iommu.
> I think I would squash this patch into the previous one.

sure, I can make it. :-)

> >
> > Cc: Kevin Tian <kevin.tian@intel.com>
> > Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Cc: Peter Xu <peterx@redhat.com>
> > Cc: Eric Auger <eric.auger@redhat.com>
> > Cc: Yi Sun <yi.y.sun@linux.intel.com>
> > Cc: David Gibson <david@gibson.dropbear.id.au>
> > Cc: Alex Williamson <alex.williamson@redhat.com>
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > ---
> >  hw/vfio/pci.c | 13 +++++++++++++
> >  1 file changed, 13 insertions(+)
> >
> > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c index 5e75a95..c140c88
> > 100644
> > --- a/hw/vfio/pci.c
> > +++ b/hw/vfio/pci.c
> > @@ -2717,6 +2717,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
> >      VFIOPCIDevice *vdev = PCI_VFIO(pdev);
> >      VFIODevice *vbasedev_iter;
> >      VFIOGroup *group;
> > +    VFIOContainer *container;
> >      char *tmp, *subsys, group_path[PATH_MAX], *group_name;
> >      Error *err = NULL;
> >      ssize_t len;
> > @@ -3028,6 +3029,11 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
> >      vfio_register_req_notifier(vdev);
> >      vfio_setup_resetfn_quirk(vdev);
> >
> > +    container = vdev->vbasedev.group->container;
> > +    if (container->iommu_ctx.initialized) {
> Sin't it possible to dynamically allocate the iommu_ctx so that you can simply check
> container->iommu_ctx and discard the initialized field?

iommu_ctx is allocated along with container as it is not a pointer in VFIOContainer.
The only way to check it is to have flag. :-)

Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v2 09/22] vfio/common: init HostIOMMUContext per-container
  2020-03-30  4:24   ` Liu Yi L
@ 2020-04-01  7:50     ` Auger Eric
  -1 siblings, 0 replies; 160+ messages in thread
From: Auger Eric @ 2020-04-01  7:50 UTC (permalink / raw)
  To: Liu Yi L, qemu-devel, alex.williamson, peterx
  Cc: pbonzini, mst, david, kevin.tian, jun.j.tian, yi.y.sun, kvm,
	hao.wu, jean-philippe, Jacob Pan, Yi Sun

Hi Yi,

On 3/30/20 6:24 AM, Liu Yi L wrote:
> In this patch, QEMU firstly gets iommu info from kernel to check the
> supported capabilities by a VFIO_IOMMU_TYPE1_NESTING iommu. And inits
> HostIOMMUContet instance.
> 
> Cc: Kevin Tian <kevin.tian@intel.com>
> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Eric Auger <eric.auger@redhat.com>
> Cc: Yi Sun <yi.y.sun@linux.intel.com>
> Cc: David Gibson <david@gibson.dropbear.id.au>
> Cc: Alex Williamson <alex.williamson@redhat.com>
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> ---
>  hw/vfio/common.c | 99 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 99 insertions(+)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 5f3534d..44b142c 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -1226,10 +1226,89 @@ static int vfio_host_iommu_ctx_pasid_free(HostIOMMUContext *iommu_ctx,
>      return 0;
>  }
>  
> +/**
> + * Get iommu info from host. Caller of this funcion should free
> + * the memory pointed by the returned pointer stored in @info
> + * after a successful calling when finished its usage.
> + */
> +static int vfio_get_iommu_info(VFIOContainer *container,
> +                         struct vfio_iommu_type1_info **info)
> +{
> +
> +    size_t argsz = sizeof(struct vfio_iommu_type1_info);
> +
> +    *info = g_malloc0(argsz);
> +
> +retry:
> +    (*info)->argsz = argsz;
> +
> +    if (ioctl(container->fd, VFIO_IOMMU_GET_INFO, *info)) {
> +        g_free(*info);
> +        *info = NULL;
> +        return -errno;
> +    }
> +
> +    if (((*info)->argsz > argsz)) {
> +        argsz = (*info)->argsz;
> +        *info = g_realloc(*info, argsz);
> +        goto retry;
> +    }
> +
> +    return 0;
> +}
> +
> +static struct vfio_info_cap_header *
> +vfio_get_iommu_info_cap(struct vfio_iommu_type1_info *info, uint16_t id)
> +{
> +    struct vfio_info_cap_header *hdr;
> +    void *ptr = info;
> +
> +    if (!(info->flags & VFIO_IOMMU_INFO_CAPS)) {
> +        return NULL;
> +    }
> +
> +    for (hdr = ptr + info->cap_offset; hdr != ptr; hdr = ptr + hdr->next) {
> +        if (hdr->id == id) {
> +            return hdr;
> +        }
> +    }
> +
> +    return NULL;
> +}
> +
> +static int vfio_get_nesting_iommu_cap(VFIOContainer *container,
> +                   struct vfio_iommu_type1_info_cap_nesting *cap_nesting)
> +{
> +    struct vfio_iommu_type1_info *info;
> +    struct vfio_info_cap_header *hdr;
> +    struct vfio_iommu_type1_info_cap_nesting *cap;
> +    int ret;
> +
> +    ret = vfio_get_iommu_info(container, &info);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    hdr = vfio_get_iommu_info_cap(info,
> +                        VFIO_IOMMU_TYPE1_INFO_CAP_NESTING);
> +    if (!hdr) {
> +        g_free(info);
> +        return -errno;
> +    }
> +
> +    cap = container_of(hdr,
> +                struct vfio_iommu_type1_info_cap_nesting, header);
> +    *cap_nesting = *cap;
> +
> +    g_free(info);
> +    return 0;
> +}
> +
>  static int vfio_init_container(VFIOContainer *container, int group_fd,
>                                 Error **errp)
>  {
>      int iommu_type, ret;
> +    uint64_t flags = 0;
>  
>      iommu_type = vfio_get_iommu_type(container, errp);
>      if (iommu_type < 0) {
> @@ -1257,6 +1336,26 @@ static int vfio_init_container(VFIOContainer *container, int group_fd,
>          return -errno;
>      }
>  
> +    if (iommu_type == VFIO_TYPE1_NESTING_IOMMU) {
> +        struct vfio_iommu_type1_info_cap_nesting nesting = {
> +                                         .nesting_capabilities = 0x0,
> +                                         .stage1_formats = 0, };
> +
> +        ret = vfio_get_nesting_iommu_cap(container, &nesting);
> +        if (ret) {
> +            error_setg_errno(errp, -ret,
> +                             "Failed to get nesting iommu cap");
> +            return ret;
> +        }
> +
> +        flags |= (nesting.nesting_capabilities & VFIO_IOMMU_PASID_REQS) ?
> +                 HOST_IOMMU_PASID_REQUEST : 0;
I still don't get why you can't transform your iommu_ctx into a  pointer
and do
container->iommu_ctx = g_new0(HostIOMMUContext, 1);
then
host_iommu_ctx_init(container->iommu_ctx, flags);

looks something similar to (hw/vfio/common.c). You may not even need to
use a derived VFIOHostIOMMUContext object (As only VFIO does use that
object)? Only the ops do change, no new field?
        region->mem = g_new0(MemoryRegion, 1);
        memory_region_init_io(region->mem, obj, &vfio_region_ops,
                              region, name, region->size);

Thanks

Eric

> +        host_iommu_ctx_init(&container->iommu_ctx,
> +                            sizeof(container->iommu_ctx),
> +                            TYPE_VFIO_HOST_IOMMU_CONTEXT,
> +                            flags);
> +    }
> +
>      container->iommu_type = iommu_type;
>      return 0;
>  }
> 


^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v2 09/22] vfio/common: init HostIOMMUContext per-container
@ 2020-04-01  7:50     ` Auger Eric
  0 siblings, 0 replies; 160+ messages in thread
From: Auger Eric @ 2020-04-01  7:50 UTC (permalink / raw)
  To: Liu Yi L, qemu-devel, alex.williamson, peterx
  Cc: jean-philippe, kevin.tian, Jacob Pan, Yi Sun, kvm, mst,
	jun.j.tian, yi.y.sun, pbonzini, hao.wu, david

Hi Yi,

On 3/30/20 6:24 AM, Liu Yi L wrote:
> In this patch, QEMU firstly gets iommu info from kernel to check the
> supported capabilities by a VFIO_IOMMU_TYPE1_NESTING iommu. And inits
> HostIOMMUContet instance.
> 
> Cc: Kevin Tian <kevin.tian@intel.com>
> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Eric Auger <eric.auger@redhat.com>
> Cc: Yi Sun <yi.y.sun@linux.intel.com>
> Cc: David Gibson <david@gibson.dropbear.id.au>
> Cc: Alex Williamson <alex.williamson@redhat.com>
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> ---
>  hw/vfio/common.c | 99 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 99 insertions(+)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 5f3534d..44b142c 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -1226,10 +1226,89 @@ static int vfio_host_iommu_ctx_pasid_free(HostIOMMUContext *iommu_ctx,
>      return 0;
>  }
>  
> +/**
> + * Get iommu info from host. Caller of this funcion should free
> + * the memory pointed by the returned pointer stored in @info
> + * after a successful calling when finished its usage.
> + */
> +static int vfio_get_iommu_info(VFIOContainer *container,
> +                         struct vfio_iommu_type1_info **info)
> +{
> +
> +    size_t argsz = sizeof(struct vfio_iommu_type1_info);
> +
> +    *info = g_malloc0(argsz);
> +
> +retry:
> +    (*info)->argsz = argsz;
> +
> +    if (ioctl(container->fd, VFIO_IOMMU_GET_INFO, *info)) {
> +        g_free(*info);
> +        *info = NULL;
> +        return -errno;
> +    }
> +
> +    if (((*info)->argsz > argsz)) {
> +        argsz = (*info)->argsz;
> +        *info = g_realloc(*info, argsz);
> +        goto retry;
> +    }
> +
> +    return 0;
> +}
> +
> +static struct vfio_info_cap_header *
> +vfio_get_iommu_info_cap(struct vfio_iommu_type1_info *info, uint16_t id)
> +{
> +    struct vfio_info_cap_header *hdr;
> +    void *ptr = info;
> +
> +    if (!(info->flags & VFIO_IOMMU_INFO_CAPS)) {
> +        return NULL;
> +    }
> +
> +    for (hdr = ptr + info->cap_offset; hdr != ptr; hdr = ptr + hdr->next) {
> +        if (hdr->id == id) {
> +            return hdr;
> +        }
> +    }
> +
> +    return NULL;
> +}
> +
> +static int vfio_get_nesting_iommu_cap(VFIOContainer *container,
> +                   struct vfio_iommu_type1_info_cap_nesting *cap_nesting)
> +{
> +    struct vfio_iommu_type1_info *info;
> +    struct vfio_info_cap_header *hdr;
> +    struct vfio_iommu_type1_info_cap_nesting *cap;
> +    int ret;
> +
> +    ret = vfio_get_iommu_info(container, &info);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    hdr = vfio_get_iommu_info_cap(info,
> +                        VFIO_IOMMU_TYPE1_INFO_CAP_NESTING);
> +    if (!hdr) {
> +        g_free(info);
> +        return -errno;
> +    }
> +
> +    cap = container_of(hdr,
> +                struct vfio_iommu_type1_info_cap_nesting, header);
> +    *cap_nesting = *cap;
> +
> +    g_free(info);
> +    return 0;
> +}
> +
>  static int vfio_init_container(VFIOContainer *container, int group_fd,
>                                 Error **errp)
>  {
>      int iommu_type, ret;
> +    uint64_t flags = 0;
>  
>      iommu_type = vfio_get_iommu_type(container, errp);
>      if (iommu_type < 0) {
> @@ -1257,6 +1336,26 @@ static int vfio_init_container(VFIOContainer *container, int group_fd,
>          return -errno;
>      }
>  
> +    if (iommu_type == VFIO_TYPE1_NESTING_IOMMU) {
> +        struct vfio_iommu_type1_info_cap_nesting nesting = {
> +                                         .nesting_capabilities = 0x0,
> +                                         .stage1_formats = 0, };
> +
> +        ret = vfio_get_nesting_iommu_cap(container, &nesting);
> +        if (ret) {
> +            error_setg_errno(errp, -ret,
> +                             "Failed to get nesting iommu cap");
> +            return ret;
> +        }
> +
> +        flags |= (nesting.nesting_capabilities & VFIO_IOMMU_PASID_REQS) ?
> +                 HOST_IOMMU_PASID_REQUEST : 0;
I still don't get why you can't transform your iommu_ctx into a  pointer
and do
container->iommu_ctx = g_new0(HostIOMMUContext, 1);
then
host_iommu_ctx_init(container->iommu_ctx, flags);

looks something similar to (hw/vfio/common.c). You may not even need to
use a derived VFIOHostIOMMUContext object (As only VFIO does use that
object)? Only the ops do change, no new field?
        region->mem = g_new0(MemoryRegion, 1);
        memory_region_init_io(region->mem, obj, &vfio_region_ops,
                              region, name, region->size);

Thanks

Eric

> +        host_iommu_ctx_init(&container->iommu_ctx,
> +                            sizeof(container->iommu_ctx),
> +                            TYPE_VFIO_HOST_IOMMU_CONTEXT,
> +                            flags);
> +    }
> +
>      container->iommu_type = iommu_type;
>      return 0;
>  }
> 



^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v2 13/22] intel_iommu: add PASID cache management infrastructure
  2020-03-30  4:24   ` Liu Yi L
@ 2020-04-02  0:02     ` Peter Xu
  -1 siblings, 0 replies; 160+ messages in thread
From: Peter Xu @ 2020-04-02  0:02 UTC (permalink / raw)
  To: Liu Yi L
  Cc: qemu-devel, alex.williamson, eric.auger, pbonzini, mst, david,
	kevin.tian, jun.j.tian, yi.y.sun, kvm, hao.wu, jean-philippe,
	Jacob Pan, Yi Sun, Richard Henderson, Eduardo Habkost

On Sun, Mar 29, 2020 at 09:24:52PM -0700, Liu Yi L wrote:
> This patch adds a PASID cache management infrastructure based on
> new added structure VTDPASIDAddressSpace, which is used to track
> the PASID usage and future PASID tagged DMA address translation
> support in vIOMMU.
> 
>     struct VTDPASIDAddressSpace {
>         VTDBus *vtd_bus;
>         uint8_t devfn;
>         AddressSpace as;
>         uint32_t pasid;
>         IntelIOMMUState *iommu_state;
>         VTDContextCacheEntry context_cache_entry;
>         QLIST_ENTRY(VTDPASIDAddressSpace) next;
>         VTDPASIDCacheEntry pasid_cache_entry;
>     };
> 
> Ideally, a VTDPASIDAddressSpace instance is created when a PASID
> is bound with a DMA AddressSpace. Intel VT-d spec requires guest
> software to issue pasid cache invalidation when bind or unbind a
> pasid with an address space under caching-mode. However, as
> VTDPASIDAddressSpace instances also act as pasid cache in this
> implementation, its creation also happens during vIOMMU PASID
> tagged DMA translation. The creation in this path will not be
> added in this patch since no PASID-capable emulated devices for
> now.
> 
> The implementation in this patch manages VTDPASIDAddressSpace
> instances per PASID+BDF (lookup and insert will use PASID and
> BDF) since Intel VT-d spec allows per-BDF PASID Table. When a
> guest bind a PASID with an AddressSpace, QEMU will capture the
> guest pasid selective pasid cache invalidation, and allocate
> remove a VTDPASIDAddressSpace instance per the invalidation
> reasons:
> 
>     *) a present pasid entry moved to non-present
>     *) a present pasid entry to be a present entry
>     *) a non-present pasid entry moved to present
> 
> vIOMMU emulator could figure out the reason by fetching latest
> guest pasid entry.
> 
> v1 -> v2: - merged this patch with former replay binding patch, makes
>             PSI/DSI/GSI use the unified function to do cache invalidation
>             and pasid binding replay.
>           - dropped pasid_cache_gen in both iommu_state and vtd_pasid_as
>             as it is not necessary so far, we may want it when one day
>             initroduce emulated SVA-capable device.
> 
> Cc: Kevin Tian <kevin.tian@intel.com>
> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Yi Sun <yi.y.sun@linux.intel.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Richard Henderson <rth@twiddle.net>
> Cc: Eduardo Habkost <ehabkost@redhat.com>
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> ---
>  hw/i386/intel_iommu.c          | 473 +++++++++++++++++++++++++++++++++++++++++
>  hw/i386/intel_iommu_internal.h |  18 ++
>  hw/i386/trace-events           |   1 +
>  include/hw/i386/intel_iommu.h  |  24 +++
>  4 files changed, 516 insertions(+)
> 
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 2eb60c3..a7e9973 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -40,6 +40,7 @@
>  #include "kvm_i386.h"
>  #include "migration/vmstate.h"
>  #include "trace.h"
> +#include "qemu/jhash.h"
>  
>  /* context entry operations */
>  #define VTD_CE_GET_RID2PASID(ce) \
> @@ -65,6 +66,8 @@
>  static void vtd_address_space_refresh_all(IntelIOMMUState *s);
>  static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n);
>  
> +static void vtd_pasid_cache_reset(IntelIOMMUState *s);
> +
>  static void vtd_panic_require_caching_mode(void)
>  {
>      error_report("We need to set caching-mode=on for intel-iommu to enable "
> @@ -276,6 +279,7 @@ static void vtd_reset_caches(IntelIOMMUState *s)
>      vtd_iommu_lock(s);
>      vtd_reset_iotlb_locked(s);
>      vtd_reset_context_cache_locked(s);
> +    vtd_pasid_cache_reset(s);
>      vtd_iommu_unlock(s);
>  }
>  
> @@ -686,6 +690,16 @@ static inline bool vtd_pe_type_check(X86IOMMUState *x86_iommu,
>      return true;
>  }
>  
> +static inline uint16_t vtd_pe_get_domain_id(VTDPASIDEntry *pe)
> +{
> +    return VTD_SM_PASID_ENTRY_DID((pe)->val[1]);
> +}
> +
> +static inline uint32_t vtd_sm_ce_get_pdt_entry_num(VTDContextEntry *ce)
> +{
> +    return 1U << (VTD_SM_CONTEXT_ENTRY_PDTS(ce->val[0]) + 7);
> +}
> +
>  static inline bool vtd_pdire_present(VTDPASIDDirEntry *pdire)
>  {
>      return pdire->val & 1;
> @@ -2395,9 +2409,452 @@ static bool vtd_process_iotlb_desc(IntelIOMMUState *s, VTDInvDesc *inv_desc)
>      return true;
>  }
>  
> +static inline void vtd_init_pasid_key(uint32_t pasid,
> +                                     uint16_t sid,
> +                                     struct pasid_key *key)
> +{
> +    key->pasid = pasid;
> +    key->sid = sid;
> +}
> +
> +static guint vtd_pasid_as_key_hash(gconstpointer v)
> +{
> +    struct pasid_key *key = (struct pasid_key *)v;
> +    uint32_t a, b, c;
> +
> +    /* Jenkins hash */
> +    a = b = c = JHASH_INITVAL + sizeof(*key);
> +    a += key->sid;
> +    b += extract32(key->pasid, 0, 16);
> +    c += extract32(key->pasid, 16, 16);
> +
> +    __jhash_mix(a, b, c);
> +    __jhash_final(a, b, c);
> +
> +    return c;
> +}
> +
> +static gboolean vtd_pasid_as_key_equal(gconstpointer v1, gconstpointer v2)
> +{
> +    const struct pasid_key *k1 = v1;
> +    const struct pasid_key *k2 = v2;
> +
> +    return (k1->pasid == k2->pasid) && (k1->sid == k2->sid);
> +}
> +
> +static inline int vtd_dev_get_pe_from_pasid(IntelIOMMUState *s,
> +                                            uint8_t bus_num,
> +                                            uint8_t devfn,
> +                                            uint32_t pasid,
> +                                            VTDPASIDEntry *pe)
> +{
> +    VTDContextEntry ce;
> +    int ret;
> +    dma_addr_t pasid_dir_base;
> +
> +    if (!s->root_scalable) {
> +        return -VTD_FR_PASID_TABLE_INV;
> +    }
> +
> +    ret = vtd_dev_to_context_entry(s, bus_num, devfn, &ce);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    pasid_dir_base = VTD_CE_GET_PASID_DIR_TABLE(&ce);
> +    ret = vtd_get_pe_from_pasid_table(s,
> +                                  pasid_dir_base, pasid, pe);
> +
> +    return ret;
> +}
> +
> +static bool vtd_pasid_entry_compare(VTDPASIDEntry *p1, VTDPASIDEntry *p2)
> +{
> +    return !memcmp(p1, p2, sizeof(*p1));
> +}
> +
> +/**
> + * This function fills in the pasid entry in &vtd_pasid_as. Caller
> + * of this function should hold iommu_lock.
> + */
> +static void vtd_fill_pe_in_cache(IntelIOMMUState *s,
> +                                 VTDPASIDAddressSpace *vtd_pasid_as,
> +                                 VTDPASIDEntry *pe)
> +{
> +    VTDPASIDCacheEntry *pc_entry = &vtd_pasid_as->pasid_cache_entry;
> +
> +    if (vtd_pasid_entry_compare(pe, &pc_entry->pasid_entry)) {
> +        /* No need to go further as cached pasid entry is latest */
> +        return;
> +    }
> +
> +    pc_entry->pasid_entry = *pe;
> +    /*
> +     * TODO:
> +     * - send pasid bind to host for passthru devices
> +     */
> +}
> +
> +/**
> + * This function is used to clear cached pasid entry in vtd_pasid_as
> + * instances. Caller of this function should hold iommu_lock.
> + */
> +static gboolean vtd_flush_pasid(gpointer key, gpointer value,
> +                                gpointer user_data)
> +{
> +    VTDPASIDCacheInfo *pc_info = user_data;
> +    VTDPASIDAddressSpace *vtd_pasid_as = value;
> +    IntelIOMMUState *s = vtd_pasid_as->iommu_state;
> +    VTDPASIDCacheEntry *pc_entry = &vtd_pasid_as->pasid_cache_entry;
> +    VTDBus *vtd_bus = vtd_pasid_as->vtd_bus;
> +    VTDPASIDEntry pe;
> +    uint16_t did;
> +    uint32_t pasid;
> +    uint16_t devfn;
> +    int ret;
> +
> +    did = vtd_pe_get_domain_id(&pc_entry->pasid_entry);
> +    pasid = vtd_pasid_as->pasid;
> +    devfn = vtd_pasid_as->devfn;
> +
> +    switch (pc_info->flags & VTD_PASID_CACHE_INFO_MASK) {
> +    case VTD_PASID_CACHE_FORCE_RESET:
> +        goto remove;
> +    case VTD_PASID_CACHE_PASIDSI:
> +        if (pc_info->pasid != pasid) {
> +            return false;
> +        }
> +        /* Fall through */
> +    case VTD_PASID_CACHE_DOMSI:
> +        if (pc_info->domain_id != did) {
> +            return false;
> +        }
> +        /* Fall through */
> +    case VTD_PASID_CACHE_GLOBAL:
> +        break;
> +    default:
> +        error_report("invalid pc_info->flags");
> +        abort();
> +    }
> +
> +    /*
> +     * pasid cache invalidation may indicate a present pasid
> +     * entry to present pasid entry modification. To cover such
> +     * case, vIOMMU emulator needs to fetch latest guest pasid
> +     * entry and check cached pasid entry, then update pasid
> +     * cache and send pasid bind/unbind to host properly.
> +     */
> +    ret = vtd_dev_get_pe_from_pasid(s, pci_bus_num(vtd_bus->bus),
> +                                    devfn, pasid, &pe);
> +    if (ret) {
> +        /*
> +         * No valid pasid entry in guest memory. e.g. pasid entry
> +         * was modified to be either all-zero or non-present. Either
> +         * case means existing pasid cache should be removed.
> +         */
> +        goto remove;
> +    }
> +
> +    vtd_fill_pe_in_cache(s, vtd_pasid_as, &pe);
> +    /*
> +     * TODO:
> +     * - when pasid-base-iotlb(piotlb) infrastructure is ready,
> +     *   should invalidate QEMU piotlb togehter with this change.
> +     */
> +    return false;
> +remove:
> +    /*
> +     * TODO:
> +     * - send pasid bind to host for passthru devices
> +     * - when pasid-base-iotlb(piotlb) infrastructure is ready,
> +     *   should invalidate QEMU piotlb togehter with this change.
> +     */
> +    return true;
> +}
> +
> +/**
> + * This function finds or adds a VTDPASIDAddressSpace for a device
> + * when it is bound to a pasid. Caller of this function should hold
> + * iommu_lock.
> + */
> +static VTDPASIDAddressSpace *vtd_add_find_pasid_as(IntelIOMMUState *s,
> +                                                   VTDBus *vtd_bus,
> +                                                   int devfn,
> +                                                   uint32_t pasid)
> +{
> +    struct pasid_key key;
> +    struct pasid_key *new_key;
> +    VTDPASIDAddressSpace *vtd_pasid_as;
> +    uint16_t sid;
> +
> +    sid = vtd_make_source_id(pci_bus_num(vtd_bus->bus), devfn);
> +    vtd_init_pasid_key(pasid, sid, &key);
> +    vtd_pasid_as = g_hash_table_lookup(s->vtd_pasid_as, &key);
> +
> +    if (!vtd_pasid_as) {
> +        new_key = g_malloc0(sizeof(*new_key));
> +        vtd_init_pasid_key(pasid, sid, new_key);
> +        /*
> +         * Initiate the vtd_pasid_as structure.
> +         *
> +         * This structure here is used to track the guest pasid
> +         * binding and also serves as pasid-cache mangement entry.
> +         *
> +         * TODO: in future, if wants to support the SVA-aware DMA
> +         *       emulation, the vtd_pasid_as should have include
> +         *       AddressSpace to support DMA emulation.
> +         */
> +        vtd_pasid_as = g_malloc0(sizeof(VTDPASIDAddressSpace));
> +        vtd_pasid_as->iommu_state = s;
> +        vtd_pasid_as->vtd_bus = vtd_bus;
> +        vtd_pasid_as->devfn = devfn;
> +        vtd_pasid_as->pasid = pasid;
> +        g_hash_table_insert(s->vtd_pasid_as, new_key, vtd_pasid_as);
> +    }
> +    return vtd_pasid_as;
> +}
> +
> +/**
> + * Constant information used during pasid table walk
> +   @vtd_bus, @devfn: device info
> + * @flags: indicates if it is domain selective walk
> + * @did: domain ID of the pasid table walk
> + */
> +typedef struct {
> +    VTDBus *vtd_bus;
> +    uint16_t devfn;
> +#define VTD_PASID_TABLE_DID_SEL_WALK   (1ULL << 0)
> +    uint32_t flags;
> +    uint16_t did;
> +} vtd_pasid_table_walk_info;
> +
> +/**
> + * Caller of this function should hold iommu_lock.
> + */
> +static void vtd_sm_pasid_table_walk_one(IntelIOMMUState *s,
> +                                        dma_addr_t pt_base,
> +                                        int start,
> +                                        int end,
> +                                        vtd_pasid_table_walk_info *info)
> +{
> +    VTDPASIDEntry pe;
> +    int pasid = start;
> +    int pasid_next;
> +    VTDPASIDAddressSpace *vtd_pasid_as;
> +
> +    while (pasid < end) {
> +        pasid_next = pasid + 1;
> +
> +        if (!vtd_get_pe_in_pasid_leaf_table(s, pasid, pt_base, &pe)
> +            && vtd_pe_present(&pe)) {
> +            vtd_pasid_as = vtd_add_find_pasid_as(s,
> +                                       info->vtd_bus, info->devfn, pasid);
> +            if ((info->flags & VTD_PASID_TABLE_DID_SEL_WALK) &&
> +                !(info->did == vtd_pe_get_domain_id(&pe))) {
> +                pasid = pasid_next;
> +                continue;
> +            }
> +            vtd_fill_pe_in_cache(s, vtd_pasid_as, &pe);
> +        }
> +        pasid = pasid_next;
> +    }
> +}
> +
> +/*
> + * Currently, VT-d scalable mode pasid table is a two level table,
> + * this function aims to loop a range of PASIDs in a given pasid
> + * table to identify the pasid config in guest.
> + * Caller of this function should hold iommu_lock.
> + */
> +static void vtd_sm_pasid_table_walk(IntelIOMMUState *s,
> +                                    dma_addr_t pdt_base,
> +                                    int start,
> +                                    int end,
> +                                    vtd_pasid_table_walk_info *info)
> +{
> +    VTDPASIDDirEntry pdire;
> +    int pasid = start;
> +    int pasid_next;
> +    dma_addr_t pt_base;
> +
> +    while (pasid < end) {
> +        pasid_next = ((end - pasid) > VTD_PASID_TBL_ENTRY_NUM) ?
> +                      (pasid + VTD_PASID_TBL_ENTRY_NUM) : end;
> +        if (!vtd_get_pdire_from_pdir_table(pdt_base, pasid, &pdire)
> +            && vtd_pdire_present(&pdire)) {
> +            pt_base = pdire.val & VTD_PASID_TABLE_BASE_ADDR_MASK;
> +            vtd_sm_pasid_table_walk_one(s, pt_base, pasid, pasid_next, info);
> +        }
> +        pasid = pasid_next;
> +    }
> +}
> +
> +static void vtd_replay_pasid_bind_for_dev(IntelIOMMUState *s,
> +                                          int start, int end,
> +                                          vtd_pasid_table_walk_info *info)
> +{
> +    VTDContextEntry ce;
> +    int bus_n, devfn;
> +
> +    bus_n = pci_bus_num(info->vtd_bus->bus);
> +    devfn = info->devfn;
> +
> +    if (!vtd_dev_to_context_entry(s, bus_n, devfn, &ce)) {
> +        uint32_t max_pasid;
> +
> +        max_pasid = vtd_sm_ce_get_pdt_entry_num(&ce) * VTD_PASID_TBL_ENTRY_NUM;
> +        if (end > max_pasid) {
> +            end = max_pasid;
> +        }
> +        vtd_sm_pasid_table_walk(s,
> +                                VTD_CE_GET_PASID_DIR_TABLE(&ce),
> +                                start,
> +                                end,
> +                                info);
> +    }
> +}
> +
> +/**
> + * This function replay the guest pasid bindings to hots by
> + * walking the guest PASID table. This ensures host will have
> + * latest guest pasid bindings. Caller should hold iommu_lock.
> + */
> +static void vtd_replay_guest_pasid_bindings(IntelIOMMUState *s,
> +                                            VTDPASIDCacheInfo *pc_info)
> +{
> +    VTDHostIOMMUContext *vtd_dev_icx;
> +    int start = 0, end = VTD_HPASID_MAX;
> +    vtd_pasid_table_walk_info walk_info = {.flags = 0};

So vtd_pasid_table_walk_info is still used.  I thought we had reached
a consensus that this can be dropped?

> +
> +    switch (pc_info->flags & VTD_PASID_CACHE_INFO_MASK) {
> +    case VTD_PASID_CACHE_PASIDSI:
> +        start = pc_info->pasid;
> +        end = pc_info->pasid + 1;
> +        /*
> +         * PASID selective invalidation is within domain,
> +         * thus fall through.
> +         */
> +    case VTD_PASID_CACHE_DOMSI:
> +        walk_info.did = pc_info->domain_id;
> +        walk_info.flags |= VTD_PASID_TABLE_DID_SEL_WALK;
> +        /* loop all assigned devices */
> +        break;
> +    case VTD_PASID_CACHE_FORCE_RESET:
> +        /* For force reset, no need to go further replay */
> +        return;
> +    case VTD_PASID_CACHE_GLOBAL:
> +        break;
> +    default:
> +        error_report("%s, invalid pc_info->flags", __func__);
> +        abort();
> +    }
> +
> +    /*
> +     * In this replay, only needs to care about the devices which
> +     * are backed by host IOMMU. For such devices, their vtd_dev_icx
> +     * instances are in the s->vtd_dev_icx_list. For devices which
> +     * are not backed byhost IOMMU, it is not necessary to replay
> +     * the bindings since their cache could be re-created in the future
> +     * DMA address transaltion.
> +     */
> +    QLIST_FOREACH(vtd_dev_icx, &s->vtd_dev_icx_list, next) {
> +        walk_info.vtd_bus = vtd_dev_icx->vtd_bus;
> +        walk_info.devfn = vtd_dev_icx->devfn;
> +        vtd_replay_pasid_bind_for_dev(s, start, end, &walk_info);
> +    }
> +}
> +
> +/**
> + * This function syncs the pasid bindings between guest and host.
> + * It includes updating the pasid cache in vIOMMU and updating the
> + * pasid bindings per guest's latest pasid entry presence.
> + */
> +static void vtd_pasid_cache_sync(IntelIOMMUState *s,
> +                                 VTDPASIDCacheInfo *pc_info)
> +{
> +    /*
> +     * Regards to a pasid cache invalidation, e.g. a PSI.
> +     * it could be either cases of below:
> +     * a) a present pasid entry moved to non-present
> +     * b) a present pasid entry to be a present entry
> +     * c) a non-present pasid entry moved to present
> +     *
> +     * Different invalidation granularity may affect different device
> +     * scope and pasid scope. But for each invalidation granularity,
> +     * it needs to do two steps to sync host and guest pasid binding.
> +     *
> +     * Here is the handling of a PSI:
> +     * 1) loop all the existing vtd_pasid_as instances to update them
> +     *    according to the latest guest pasid entry in pasid table.
> +     *    this will make sure affected existing vtd_pasid_as instances
> +     *    cached the latest pasid entries. Also, during the loop, the
> +     *    host should be notified if needed. e.g. pasid unbind or pasid
> +     *    update. Should be able to cover case a) and case b).
> +     *
> +     * 2) loop all devices to cover case c)
> +     *    - For devices which have HostIOMMUContext instances,
> +     *      we loop them and check if guest pasid entry exists. If yes,
> +     *      it is case c), we update the pasid cache and also notify
> +     *      host.
> +     *    - For devices which have no HostIOMMUContext, it is not
> +     *      necessary to create pasid cache at this phase since it
> +     *      could be created when vIOMMU does DMA address translation.
> +     *      This is not yet implemented since there is no emulated
> +     *      pasid-capable devices today. If we have such devices in
> +     *      future, the pasid cache shall be created there.
> +     * Other granularity follow the same steps, just with different scope
> +     *
> +     */
> +
> +    vtd_iommu_lock(s);
> +    /* Step 1: loop all the exisitng vtd_pasid_as instances */
> +    g_hash_table_foreach_remove(s->vtd_pasid_as,
> +                                vtd_flush_pasid, pc_info);

OK the series is evolving along with our discussions, and /me too on
understanding your series... Now I'm not very sure whether this
operation is still useful...

The major point is you'll need to do pasid table walk for all the
registered devices below.  So IIUC vtd_replay_guest_pasid_bindings()
will be able to also detect addition, removal or modification of pasid
address spaces.  Am I right?

If this can be dropped, then vtd_flush_pasid() will be only used below
for device reset, and it can be greatly simplifed - just UNBIND every
address space we have.

> +
> +    /*
> +     * Step 2: loop all the exisitng vtd_dev_icx instances.
> +     * Ideally, needs to loop all devices to find if there is any new
> +     * PASID binding regards to the PASID cache invalidation request.
> +     * But it is enough to loop the devices which are backed by host
> +     * IOMMU. For devices backed by vIOMMU (a.k.a emulated devices),
> +     * if new PASID happened on them, their vtd_pasid_as instance could
> +     * be created during future vIOMMU DMA translation.
> +     */
> +    vtd_replay_guest_pasid_bindings(s, pc_info);
> +    vtd_iommu_unlock(s);
> +}
> +
> +/**
> + * Caller of this function should hold iommu_lock
> + */
> +static void vtd_pasid_cache_reset(IntelIOMMUState *s)
> +{
> +    VTDPASIDCacheInfo pc_info;
> +
> +    trace_vtd_pasid_cache_reset();
> +
> +    pc_info.flags = VTD_PASID_CACHE_FORCE_RESET;
> +
> +    /*
> +     * Reset pasid cache is a big hammer, so use
> +     * g_hash_table_foreach_remove which will free
> +     * the vtd_pasid_as instances. Also, as a big
> +     * hammer, use VTD_PASID_CACHE_FORCE_RESET to
> +     * ensure all the vtd_pasid_as instances are
> +     * dropped, meanwhile the change will be pass
> +     * to host if HostIOMMUContext is available.
> +     */
> +    g_hash_table_foreach_remove(s->vtd_pasid_as,
> +                                vtd_flush_pasid, &pc_info);
> +}
> +
>  static bool vtd_process_pasid_desc(IntelIOMMUState *s,
>                                     VTDInvDesc *inv_desc)
>  {
> +    uint16_t domain_id;
> +    uint32_t pasid;
> +    VTDPASIDCacheInfo pc_info;
> +
>      if ((inv_desc->val[0] & VTD_INV_DESC_PASIDC_RSVD_VAL0) ||
>          (inv_desc->val[1] & VTD_INV_DESC_PASIDC_RSVD_VAL1) ||
>          (inv_desc->val[2] & VTD_INV_DESC_PASIDC_RSVD_VAL2) ||
> @@ -2407,14 +2864,26 @@ static bool vtd_process_pasid_desc(IntelIOMMUState *s,
>          return false;
>      }
>  
> +    domain_id = VTD_INV_DESC_PASIDC_DID(inv_desc->val[0]);
> +    pasid = VTD_INV_DESC_PASIDC_PASID(inv_desc->val[0]);
> +
>      switch (inv_desc->val[0] & VTD_INV_DESC_PASIDC_G) {
>      case VTD_INV_DESC_PASIDC_DSI:
> +        trace_vtd_pasid_cache_dsi(domain_id);
> +        pc_info.flags = VTD_PASID_CACHE_DOMSI;
> +        pc_info.domain_id = domain_id;
>          break;
>  
>      case VTD_INV_DESC_PASIDC_PASID_SI:
> +        /* PASID selective implies a DID selective */
> +        pc_info.flags = VTD_PASID_CACHE_PASIDSI;
> +        pc_info.domain_id = domain_id;
> +        pc_info.pasid = pasid;
>          break;
>  
>      case VTD_INV_DESC_PASIDC_GLOBAL:
> +        trace_vtd_pasid_cache_gsi();
> +        pc_info.flags = VTD_PASID_CACHE_GLOBAL;
>          break;
>  
>      default:
> @@ -2423,6 +2892,7 @@ static bool vtd_process_pasid_desc(IntelIOMMUState *s,
>          return false;
>      }
>  
> +    vtd_pasid_cache_sync(s, &pc_info);
>      return true;
>  }
>  
> @@ -4085,6 +4555,9 @@ static void vtd_realize(DeviceState *dev, Error **errp)
>                                       g_free, g_free);
>      s->vtd_as_by_busptr = g_hash_table_new_full(vtd_uint64_hash, vtd_uint64_equal,
>                                                g_free, g_free);
> +    s->vtd_pasid_as = g_hash_table_new_full(vtd_pasid_as_key_hash,
> +                                            vtd_pasid_as_key_equal,
> +                                            g_free, g_free);
>      vtd_init(s);
>      sysbus_mmio_map(SYS_BUS_DEVICE(s), 0, Q35_HOST_BRIDGE_IOMMU_ADDR);
>      pci_setup_iommu(bus, &vtd_iommu_ops, dev);
> diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> index 9a76f20..451ef4c 100644
> --- a/hw/i386/intel_iommu_internal.h
> +++ b/hw/i386/intel_iommu_internal.h
> @@ -307,6 +307,7 @@ typedef enum VTDFaultReason {
>      VTD_FR_IR_SID_ERR = 0x26,   /* Invalid Source-ID */
>  
>      VTD_FR_PASID_TABLE_INV = 0x58,  /*Invalid PASID table entry */
> +    VTD_FR_PASID_ENTRY_P = 0x59, /* The Present(P) field of pasidt-entry is 0 */
>  
>      /* This is not a normal fault reason. We use this to indicate some faults
>       * that are not referenced by the VT-d specification.
> @@ -511,10 +512,26 @@ typedef struct VTDRootEntry VTDRootEntry;
>  #define VTD_CTX_ENTRY_LEGACY_SIZE     16
>  #define VTD_CTX_ENTRY_SCALABLE_SIZE   32
>  
> +#define VTD_SM_CONTEXT_ENTRY_PDTS(val)      (((val) >> 9) & 0x3)
>  #define VTD_SM_CONTEXT_ENTRY_RID2PASID_MASK 0xfffff
>  #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL0(aw)  (0x1e0ULL | ~VTD_HAW_MASK(aw))
>  #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL1      0xffffffffffe00000ULL
>  
> +struct VTDPASIDCacheInfo {
> +#define VTD_PASID_CACHE_FORCE_RESET    (1ULL << 0)
> +#define VTD_PASID_CACHE_GLOBAL         (1ULL << 1)
> +#define VTD_PASID_CACHE_DOMSI          (1ULL << 2)
> +#define VTD_PASID_CACHE_PASIDSI        (1ULL << 3)
> +    uint32_t flags;
> +    uint16_t domain_id;
> +    uint32_t pasid;
> +};
> +#define VTD_PASID_CACHE_INFO_MASK    (VTD_PASID_CACHE_FORCE_RESET | \
> +                                      VTD_PASID_CACHE_GLOBAL  | \
> +                                      VTD_PASID_CACHE_DOMSI  | \
> +                                      VTD_PASID_CACHE_PASIDSI)

I think this is not needed at all?  The naming "flags" is confusing
too because it's not really a bitmap but an enum.  How about drop this
and rename "flags" to "type"?

> +typedef struct VTDPASIDCacheInfo VTDPASIDCacheInfo;
> +
>  /* PASID Table Related Definitions */
>  #define VTD_PASID_DIR_BASE_ADDR_MASK  (~0xfffULL)
>  #define VTD_PASID_TABLE_BASE_ADDR_MASK (~0xfffULL)
> @@ -526,6 +543,7 @@ typedef struct VTDRootEntry VTDRootEntry;
>  #define VTD_PASID_TABLE_BITS_MASK     (0x3fULL)
>  #define VTD_PASID_TABLE_INDEX(pasid)  ((pasid) & VTD_PASID_TABLE_BITS_MASK)
>  #define VTD_PASID_ENTRY_FPD           (1ULL << 1) /* Fault Processing Disable */
> +#define VTD_PASID_TBL_ENTRY_NUM       (1ULL << 6)
>  
>  /* PASID Granular Translation Type Mask */
>  #define VTD_PASID_ENTRY_P              1ULL
> diff --git a/hw/i386/trace-events b/hw/i386/trace-events
> index f7cd4e5..60d20c1 100644
> --- a/hw/i386/trace-events
> +++ b/hw/i386/trace-events
> @@ -23,6 +23,7 @@ vtd_inv_qi_tail(uint16_t head) "write tail %d"
>  vtd_inv_qi_fetch(void) ""
>  vtd_context_cache_reset(void) ""
>  vtd_pasid_cache_gsi(void) ""
> +vtd_pasid_cache_reset(void) ""
>  vtd_pasid_cache_dsi(uint16_t domain) "Domian slective PC invalidation domain 0x%"PRIx16
>  vtd_pasid_cache_psi(uint16_t domain, uint32_t pasid) "PASID slective PC invalidation domain 0x%"PRIx16" pasid 0x%"PRIx32
>  vtd_re_not_present(uint8_t bus) "Root entry bus %"PRIu8" not present"
> diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> index 42a58d6..626c1cd 100644
> --- a/include/hw/i386/intel_iommu.h
> +++ b/include/hw/i386/intel_iommu.h
> @@ -65,6 +65,8 @@ typedef union VTD_IR_MSIAddress VTD_IR_MSIAddress;
>  typedef struct VTDPASIDDirEntry VTDPASIDDirEntry;
>  typedef struct VTDPASIDEntry VTDPASIDEntry;
>  typedef struct VTDHostIOMMUContext VTDHostIOMMUContext;
> +typedef struct VTDPASIDCacheEntry VTDPASIDCacheEntry;
> +typedef struct VTDPASIDAddressSpace VTDPASIDAddressSpace;
>  
>  /* Context-Entry */
>  struct VTDContextEntry {
> @@ -97,6 +99,26 @@ struct VTDPASIDEntry {
>      uint64_t val[8];
>  };
>  
> +struct pasid_key {
> +    uint32_t pasid;
> +    uint16_t sid;
> +};
> +
> +struct VTDPASIDCacheEntry {
> +    struct VTDPASIDEntry pasid_entry;
> +};
> +
> +struct VTDPASIDAddressSpace {
> +    VTDBus *vtd_bus;
> +    uint8_t devfn;
> +    AddressSpace as;

Can this be dropped?

> +    uint32_t pasid;
> +    IntelIOMMUState *iommu_state;
> +    VTDContextCacheEntry context_cache_entry;

Can this be dropped too?

> +    QLIST_ENTRY(VTDPASIDAddressSpace) next;
> +    VTDPASIDCacheEntry pasid_cache_entry;
> +};
> +
>  struct VTDAddressSpace {
>      PCIBus *bus;
>      uint8_t devfn;
> @@ -267,6 +289,7 @@ struct IntelIOMMUState {
>  
>      GHashTable *vtd_as_by_busptr;   /* VTDBus objects indexed by PCIBus* reference */
>      VTDBus *vtd_as_by_bus_num[VTD_PCI_BUS_MAX]; /* VTDBus objects indexed by bus number */
> +    GHashTable *vtd_pasid_as;       /* VTDPASIDAddressSpace instances */
>      /* list of registered notifiers */
>      QLIST_HEAD(, VTDAddressSpace) vtd_as_with_notifiers;
>  
> @@ -292,6 +315,7 @@ struct IntelIOMMUState {
>       * - per-IOMMU IOTLB caches
>       * - context entry cache in VTDAddressSpace
>       * - HostIOMMUContext pointer cached in vIOMMU
> +     * - PASID cache in VTDPASIDAddressSpace
>       */
>      QemuMutex iommu_lock;
>  };
> -- 
> 2.7.4
> 

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v2 13/22] intel_iommu: add PASID cache management infrastructure
@ 2020-04-02  0:02     ` Peter Xu
  0 siblings, 0 replies; 160+ messages in thread
From: Peter Xu @ 2020-04-02  0:02 UTC (permalink / raw)
  To: Liu Yi L
  Cc: jean-philippe, kevin.tian, Jacob Pan, Yi Sun, Eduardo Habkost,
	kvm, mst, jun.j.tian, qemu-devel, eric.auger, alex.williamson,
	pbonzini, hao.wu, yi.y.sun, Richard Henderson, david

On Sun, Mar 29, 2020 at 09:24:52PM -0700, Liu Yi L wrote:
> This patch adds a PASID cache management infrastructure based on
> new added structure VTDPASIDAddressSpace, which is used to track
> the PASID usage and future PASID tagged DMA address translation
> support in vIOMMU.
> 
>     struct VTDPASIDAddressSpace {
>         VTDBus *vtd_bus;
>         uint8_t devfn;
>         AddressSpace as;
>         uint32_t pasid;
>         IntelIOMMUState *iommu_state;
>         VTDContextCacheEntry context_cache_entry;
>         QLIST_ENTRY(VTDPASIDAddressSpace) next;
>         VTDPASIDCacheEntry pasid_cache_entry;
>     };
> 
> Ideally, a VTDPASIDAddressSpace instance is created when a PASID
> is bound with a DMA AddressSpace. Intel VT-d spec requires guest
> software to issue pasid cache invalidation when bind or unbind a
> pasid with an address space under caching-mode. However, as
> VTDPASIDAddressSpace instances also act as pasid cache in this
> implementation, its creation also happens during vIOMMU PASID
> tagged DMA translation. The creation in this path will not be
> added in this patch since no PASID-capable emulated devices for
> now.
> 
> The implementation in this patch manages VTDPASIDAddressSpace
> instances per PASID+BDF (lookup and insert will use PASID and
> BDF) since Intel VT-d spec allows per-BDF PASID Table. When a
> guest bind a PASID with an AddressSpace, QEMU will capture the
> guest pasid selective pasid cache invalidation, and allocate
> remove a VTDPASIDAddressSpace instance per the invalidation
> reasons:
> 
>     *) a present pasid entry moved to non-present
>     *) a present pasid entry to be a present entry
>     *) a non-present pasid entry moved to present
> 
> vIOMMU emulator could figure out the reason by fetching latest
> guest pasid entry.
> 
> v1 -> v2: - merged this patch with former replay binding patch, makes
>             PSI/DSI/GSI use the unified function to do cache invalidation
>             and pasid binding replay.
>           - dropped pasid_cache_gen in both iommu_state and vtd_pasid_as
>             as it is not necessary so far, we may want it when one day
>             initroduce emulated SVA-capable device.
> 
> Cc: Kevin Tian <kevin.tian@intel.com>
> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Yi Sun <yi.y.sun@linux.intel.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Richard Henderson <rth@twiddle.net>
> Cc: Eduardo Habkost <ehabkost@redhat.com>
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> ---
>  hw/i386/intel_iommu.c          | 473 +++++++++++++++++++++++++++++++++++++++++
>  hw/i386/intel_iommu_internal.h |  18 ++
>  hw/i386/trace-events           |   1 +
>  include/hw/i386/intel_iommu.h  |  24 +++
>  4 files changed, 516 insertions(+)
> 
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 2eb60c3..a7e9973 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -40,6 +40,7 @@
>  #include "kvm_i386.h"
>  #include "migration/vmstate.h"
>  #include "trace.h"
> +#include "qemu/jhash.h"
>  
>  /* context entry operations */
>  #define VTD_CE_GET_RID2PASID(ce) \
> @@ -65,6 +66,8 @@
>  static void vtd_address_space_refresh_all(IntelIOMMUState *s);
>  static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n);
>  
> +static void vtd_pasid_cache_reset(IntelIOMMUState *s);
> +
>  static void vtd_panic_require_caching_mode(void)
>  {
>      error_report("We need to set caching-mode=on for intel-iommu to enable "
> @@ -276,6 +279,7 @@ static void vtd_reset_caches(IntelIOMMUState *s)
>      vtd_iommu_lock(s);
>      vtd_reset_iotlb_locked(s);
>      vtd_reset_context_cache_locked(s);
> +    vtd_pasid_cache_reset(s);
>      vtd_iommu_unlock(s);
>  }
>  
> @@ -686,6 +690,16 @@ static inline bool vtd_pe_type_check(X86IOMMUState *x86_iommu,
>      return true;
>  }
>  
> +static inline uint16_t vtd_pe_get_domain_id(VTDPASIDEntry *pe)
> +{
> +    return VTD_SM_PASID_ENTRY_DID((pe)->val[1]);
> +}
> +
> +static inline uint32_t vtd_sm_ce_get_pdt_entry_num(VTDContextEntry *ce)
> +{
> +    return 1U << (VTD_SM_CONTEXT_ENTRY_PDTS(ce->val[0]) + 7);
> +}
> +
>  static inline bool vtd_pdire_present(VTDPASIDDirEntry *pdire)
>  {
>      return pdire->val & 1;
> @@ -2395,9 +2409,452 @@ static bool vtd_process_iotlb_desc(IntelIOMMUState *s, VTDInvDesc *inv_desc)
>      return true;
>  }
>  
> +static inline void vtd_init_pasid_key(uint32_t pasid,
> +                                     uint16_t sid,
> +                                     struct pasid_key *key)
> +{
> +    key->pasid = pasid;
> +    key->sid = sid;
> +}
> +
> +static guint vtd_pasid_as_key_hash(gconstpointer v)
> +{
> +    struct pasid_key *key = (struct pasid_key *)v;
> +    uint32_t a, b, c;
> +
> +    /* Jenkins hash */
> +    a = b = c = JHASH_INITVAL + sizeof(*key);
> +    a += key->sid;
> +    b += extract32(key->pasid, 0, 16);
> +    c += extract32(key->pasid, 16, 16);
> +
> +    __jhash_mix(a, b, c);
> +    __jhash_final(a, b, c);
> +
> +    return c;
> +}
> +
> +static gboolean vtd_pasid_as_key_equal(gconstpointer v1, gconstpointer v2)
> +{
> +    const struct pasid_key *k1 = v1;
> +    const struct pasid_key *k2 = v2;
> +
> +    return (k1->pasid == k2->pasid) && (k1->sid == k2->sid);
> +}
> +
> +static inline int vtd_dev_get_pe_from_pasid(IntelIOMMUState *s,
> +                                            uint8_t bus_num,
> +                                            uint8_t devfn,
> +                                            uint32_t pasid,
> +                                            VTDPASIDEntry *pe)
> +{
> +    VTDContextEntry ce;
> +    int ret;
> +    dma_addr_t pasid_dir_base;
> +
> +    if (!s->root_scalable) {
> +        return -VTD_FR_PASID_TABLE_INV;
> +    }
> +
> +    ret = vtd_dev_to_context_entry(s, bus_num, devfn, &ce);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    pasid_dir_base = VTD_CE_GET_PASID_DIR_TABLE(&ce);
> +    ret = vtd_get_pe_from_pasid_table(s,
> +                                  pasid_dir_base, pasid, pe);
> +
> +    return ret;
> +}
> +
> +static bool vtd_pasid_entry_compare(VTDPASIDEntry *p1, VTDPASIDEntry *p2)
> +{
> +    return !memcmp(p1, p2, sizeof(*p1));
> +}
> +
> +/**
> + * This function fills in the pasid entry in &vtd_pasid_as. Caller
> + * of this function should hold iommu_lock.
> + */
> +static void vtd_fill_pe_in_cache(IntelIOMMUState *s,
> +                                 VTDPASIDAddressSpace *vtd_pasid_as,
> +                                 VTDPASIDEntry *pe)
> +{
> +    VTDPASIDCacheEntry *pc_entry = &vtd_pasid_as->pasid_cache_entry;
> +
> +    if (vtd_pasid_entry_compare(pe, &pc_entry->pasid_entry)) {
> +        /* No need to go further as cached pasid entry is latest */
> +        return;
> +    }
> +
> +    pc_entry->pasid_entry = *pe;
> +    /*
> +     * TODO:
> +     * - send pasid bind to host for passthru devices
> +     */
> +}
> +
> +/**
> + * This function is used to clear cached pasid entry in vtd_pasid_as
> + * instances. Caller of this function should hold iommu_lock.
> + */
> +static gboolean vtd_flush_pasid(gpointer key, gpointer value,
> +                                gpointer user_data)
> +{
> +    VTDPASIDCacheInfo *pc_info = user_data;
> +    VTDPASIDAddressSpace *vtd_pasid_as = value;
> +    IntelIOMMUState *s = vtd_pasid_as->iommu_state;
> +    VTDPASIDCacheEntry *pc_entry = &vtd_pasid_as->pasid_cache_entry;
> +    VTDBus *vtd_bus = vtd_pasid_as->vtd_bus;
> +    VTDPASIDEntry pe;
> +    uint16_t did;
> +    uint32_t pasid;
> +    uint16_t devfn;
> +    int ret;
> +
> +    did = vtd_pe_get_domain_id(&pc_entry->pasid_entry);
> +    pasid = vtd_pasid_as->pasid;
> +    devfn = vtd_pasid_as->devfn;
> +
> +    switch (pc_info->flags & VTD_PASID_CACHE_INFO_MASK) {
> +    case VTD_PASID_CACHE_FORCE_RESET:
> +        goto remove;
> +    case VTD_PASID_CACHE_PASIDSI:
> +        if (pc_info->pasid != pasid) {
> +            return false;
> +        }
> +        /* Fall through */
> +    case VTD_PASID_CACHE_DOMSI:
> +        if (pc_info->domain_id != did) {
> +            return false;
> +        }
> +        /* Fall through */
> +    case VTD_PASID_CACHE_GLOBAL:
> +        break;
> +    default:
> +        error_report("invalid pc_info->flags");
> +        abort();
> +    }
> +
> +    /*
> +     * pasid cache invalidation may indicate a present pasid
> +     * entry to present pasid entry modification. To cover such
> +     * case, vIOMMU emulator needs to fetch latest guest pasid
> +     * entry and check cached pasid entry, then update pasid
> +     * cache and send pasid bind/unbind to host properly.
> +     */
> +    ret = vtd_dev_get_pe_from_pasid(s, pci_bus_num(vtd_bus->bus),
> +                                    devfn, pasid, &pe);
> +    if (ret) {
> +        /*
> +         * No valid pasid entry in guest memory. e.g. pasid entry
> +         * was modified to be either all-zero or non-present. Either
> +         * case means existing pasid cache should be removed.
> +         */
> +        goto remove;
> +    }
> +
> +    vtd_fill_pe_in_cache(s, vtd_pasid_as, &pe);
> +    /*
> +     * TODO:
> +     * - when pasid-base-iotlb(piotlb) infrastructure is ready,
> +     *   should invalidate QEMU piotlb togehter with this change.
> +     */
> +    return false;
> +remove:
> +    /*
> +     * TODO:
> +     * - send pasid bind to host for passthru devices
> +     * - when pasid-base-iotlb(piotlb) infrastructure is ready,
> +     *   should invalidate QEMU piotlb togehter with this change.
> +     */
> +    return true;
> +}
> +
> +/**
> + * This function finds or adds a VTDPASIDAddressSpace for a device
> + * when it is bound to a pasid. Caller of this function should hold
> + * iommu_lock.
> + */
> +static VTDPASIDAddressSpace *vtd_add_find_pasid_as(IntelIOMMUState *s,
> +                                                   VTDBus *vtd_bus,
> +                                                   int devfn,
> +                                                   uint32_t pasid)
> +{
> +    struct pasid_key key;
> +    struct pasid_key *new_key;
> +    VTDPASIDAddressSpace *vtd_pasid_as;
> +    uint16_t sid;
> +
> +    sid = vtd_make_source_id(pci_bus_num(vtd_bus->bus), devfn);
> +    vtd_init_pasid_key(pasid, sid, &key);
> +    vtd_pasid_as = g_hash_table_lookup(s->vtd_pasid_as, &key);
> +
> +    if (!vtd_pasid_as) {
> +        new_key = g_malloc0(sizeof(*new_key));
> +        vtd_init_pasid_key(pasid, sid, new_key);
> +        /*
> +         * Initiate the vtd_pasid_as structure.
> +         *
> +         * This structure here is used to track the guest pasid
> +         * binding and also serves as pasid-cache mangement entry.
> +         *
> +         * TODO: in future, if wants to support the SVA-aware DMA
> +         *       emulation, the vtd_pasid_as should have include
> +         *       AddressSpace to support DMA emulation.
> +         */
> +        vtd_pasid_as = g_malloc0(sizeof(VTDPASIDAddressSpace));
> +        vtd_pasid_as->iommu_state = s;
> +        vtd_pasid_as->vtd_bus = vtd_bus;
> +        vtd_pasid_as->devfn = devfn;
> +        vtd_pasid_as->pasid = pasid;
> +        g_hash_table_insert(s->vtd_pasid_as, new_key, vtd_pasid_as);
> +    }
> +    return vtd_pasid_as;
> +}
> +
> +/**
> + * Constant information used during pasid table walk
> +   @vtd_bus, @devfn: device info
> + * @flags: indicates if it is domain selective walk
> + * @did: domain ID of the pasid table walk
> + */
> +typedef struct {
> +    VTDBus *vtd_bus;
> +    uint16_t devfn;
> +#define VTD_PASID_TABLE_DID_SEL_WALK   (1ULL << 0)
> +    uint32_t flags;
> +    uint16_t did;
> +} vtd_pasid_table_walk_info;
> +
> +/**
> + * Caller of this function should hold iommu_lock.
> + */
> +static void vtd_sm_pasid_table_walk_one(IntelIOMMUState *s,
> +                                        dma_addr_t pt_base,
> +                                        int start,
> +                                        int end,
> +                                        vtd_pasid_table_walk_info *info)
> +{
> +    VTDPASIDEntry pe;
> +    int pasid = start;
> +    int pasid_next;
> +    VTDPASIDAddressSpace *vtd_pasid_as;
> +
> +    while (pasid < end) {
> +        pasid_next = pasid + 1;
> +
> +        if (!vtd_get_pe_in_pasid_leaf_table(s, pasid, pt_base, &pe)
> +            && vtd_pe_present(&pe)) {
> +            vtd_pasid_as = vtd_add_find_pasid_as(s,
> +                                       info->vtd_bus, info->devfn, pasid);
> +            if ((info->flags & VTD_PASID_TABLE_DID_SEL_WALK) &&
> +                !(info->did == vtd_pe_get_domain_id(&pe))) {
> +                pasid = pasid_next;
> +                continue;
> +            }
> +            vtd_fill_pe_in_cache(s, vtd_pasid_as, &pe);
> +        }
> +        pasid = pasid_next;
> +    }
> +}
> +
> +/*
> + * Currently, VT-d scalable mode pasid table is a two level table,
> + * this function aims to loop a range of PASIDs in a given pasid
> + * table to identify the pasid config in guest.
> + * Caller of this function should hold iommu_lock.
> + */
> +static void vtd_sm_pasid_table_walk(IntelIOMMUState *s,
> +                                    dma_addr_t pdt_base,
> +                                    int start,
> +                                    int end,
> +                                    vtd_pasid_table_walk_info *info)
> +{
> +    VTDPASIDDirEntry pdire;
> +    int pasid = start;
> +    int pasid_next;
> +    dma_addr_t pt_base;
> +
> +    while (pasid < end) {
> +        pasid_next = ((end - pasid) > VTD_PASID_TBL_ENTRY_NUM) ?
> +                      (pasid + VTD_PASID_TBL_ENTRY_NUM) : end;
> +        if (!vtd_get_pdire_from_pdir_table(pdt_base, pasid, &pdire)
> +            && vtd_pdire_present(&pdire)) {
> +            pt_base = pdire.val & VTD_PASID_TABLE_BASE_ADDR_MASK;
> +            vtd_sm_pasid_table_walk_one(s, pt_base, pasid, pasid_next, info);
> +        }
> +        pasid = pasid_next;
> +    }
> +}
> +
> +static void vtd_replay_pasid_bind_for_dev(IntelIOMMUState *s,
> +                                          int start, int end,
> +                                          vtd_pasid_table_walk_info *info)
> +{
> +    VTDContextEntry ce;
> +    int bus_n, devfn;
> +
> +    bus_n = pci_bus_num(info->vtd_bus->bus);
> +    devfn = info->devfn;
> +
> +    if (!vtd_dev_to_context_entry(s, bus_n, devfn, &ce)) {
> +        uint32_t max_pasid;
> +
> +        max_pasid = vtd_sm_ce_get_pdt_entry_num(&ce) * VTD_PASID_TBL_ENTRY_NUM;
> +        if (end > max_pasid) {
> +            end = max_pasid;
> +        }
> +        vtd_sm_pasid_table_walk(s,
> +                                VTD_CE_GET_PASID_DIR_TABLE(&ce),
> +                                start,
> +                                end,
> +                                info);
> +    }
> +}
> +
> +/**
> + * This function replay the guest pasid bindings to hots by
> + * walking the guest PASID table. This ensures host will have
> + * latest guest pasid bindings. Caller should hold iommu_lock.
> + */
> +static void vtd_replay_guest_pasid_bindings(IntelIOMMUState *s,
> +                                            VTDPASIDCacheInfo *pc_info)
> +{
> +    VTDHostIOMMUContext *vtd_dev_icx;
> +    int start = 0, end = VTD_HPASID_MAX;
> +    vtd_pasid_table_walk_info walk_info = {.flags = 0};

So vtd_pasid_table_walk_info is still used.  I thought we had reached
a consensus that this can be dropped?

> +
> +    switch (pc_info->flags & VTD_PASID_CACHE_INFO_MASK) {
> +    case VTD_PASID_CACHE_PASIDSI:
> +        start = pc_info->pasid;
> +        end = pc_info->pasid + 1;
> +        /*
> +         * PASID selective invalidation is within domain,
> +         * thus fall through.
> +         */
> +    case VTD_PASID_CACHE_DOMSI:
> +        walk_info.did = pc_info->domain_id;
> +        walk_info.flags |= VTD_PASID_TABLE_DID_SEL_WALK;
> +        /* loop all assigned devices */
> +        break;
> +    case VTD_PASID_CACHE_FORCE_RESET:
> +        /* For force reset, no need to go further replay */
> +        return;
> +    case VTD_PASID_CACHE_GLOBAL:
> +        break;
> +    default:
> +        error_report("%s, invalid pc_info->flags", __func__);
> +        abort();
> +    }
> +
> +    /*
> +     * In this replay, only needs to care about the devices which
> +     * are backed by host IOMMU. For such devices, their vtd_dev_icx
> +     * instances are in the s->vtd_dev_icx_list. For devices which
> +     * are not backed byhost IOMMU, it is not necessary to replay
> +     * the bindings since their cache could be re-created in the future
> +     * DMA address transaltion.
> +     */
> +    QLIST_FOREACH(vtd_dev_icx, &s->vtd_dev_icx_list, next) {
> +        walk_info.vtd_bus = vtd_dev_icx->vtd_bus;
> +        walk_info.devfn = vtd_dev_icx->devfn;
> +        vtd_replay_pasid_bind_for_dev(s, start, end, &walk_info);
> +    }
> +}
> +
> +/**
> + * This function syncs the pasid bindings between guest and host.
> + * It includes updating the pasid cache in vIOMMU and updating the
> + * pasid bindings per guest's latest pasid entry presence.
> + */
> +static void vtd_pasid_cache_sync(IntelIOMMUState *s,
> +                                 VTDPASIDCacheInfo *pc_info)
> +{
> +    /*
> +     * Regards to a pasid cache invalidation, e.g. a PSI.
> +     * it could be either cases of below:
> +     * a) a present pasid entry moved to non-present
> +     * b) a present pasid entry to be a present entry
> +     * c) a non-present pasid entry moved to present
> +     *
> +     * Different invalidation granularity may affect different device
> +     * scope and pasid scope. But for each invalidation granularity,
> +     * it needs to do two steps to sync host and guest pasid binding.
> +     *
> +     * Here is the handling of a PSI:
> +     * 1) loop all the existing vtd_pasid_as instances to update them
> +     *    according to the latest guest pasid entry in pasid table.
> +     *    this will make sure affected existing vtd_pasid_as instances
> +     *    cached the latest pasid entries. Also, during the loop, the
> +     *    host should be notified if needed. e.g. pasid unbind or pasid
> +     *    update. Should be able to cover case a) and case b).
> +     *
> +     * 2) loop all devices to cover case c)
> +     *    - For devices which have HostIOMMUContext instances,
> +     *      we loop them and check if guest pasid entry exists. If yes,
> +     *      it is case c), we update the pasid cache and also notify
> +     *      host.
> +     *    - For devices which have no HostIOMMUContext, it is not
> +     *      necessary to create pasid cache at this phase since it
> +     *      could be created when vIOMMU does DMA address translation.
> +     *      This is not yet implemented since there is no emulated
> +     *      pasid-capable devices today. If we have such devices in
> +     *      future, the pasid cache shall be created there.
> +     * Other granularity follow the same steps, just with different scope
> +     *
> +     */
> +
> +    vtd_iommu_lock(s);
> +    /* Step 1: loop all the exisitng vtd_pasid_as instances */
> +    g_hash_table_foreach_remove(s->vtd_pasid_as,
> +                                vtd_flush_pasid, pc_info);

OK the series is evolving along with our discussions, and /me too on
understanding your series... Now I'm not very sure whether this
operation is still useful...

The major point is you'll need to do pasid table walk for all the
registered devices below.  So IIUC vtd_replay_guest_pasid_bindings()
will be able to also detect addition, removal or modification of pasid
address spaces.  Am I right?

If this can be dropped, then vtd_flush_pasid() will be only used below
for device reset, and it can be greatly simplifed - just UNBIND every
address space we have.

> +
> +    /*
> +     * Step 2: loop all the exisitng vtd_dev_icx instances.
> +     * Ideally, needs to loop all devices to find if there is any new
> +     * PASID binding regards to the PASID cache invalidation request.
> +     * But it is enough to loop the devices which are backed by host
> +     * IOMMU. For devices backed by vIOMMU (a.k.a emulated devices),
> +     * if new PASID happened on them, their vtd_pasid_as instance could
> +     * be created during future vIOMMU DMA translation.
> +     */
> +    vtd_replay_guest_pasid_bindings(s, pc_info);
> +    vtd_iommu_unlock(s);
> +}
> +
> +/**
> + * Caller of this function should hold iommu_lock
> + */
> +static void vtd_pasid_cache_reset(IntelIOMMUState *s)
> +{
> +    VTDPASIDCacheInfo pc_info;
> +
> +    trace_vtd_pasid_cache_reset();
> +
> +    pc_info.flags = VTD_PASID_CACHE_FORCE_RESET;
> +
> +    /*
> +     * Reset pasid cache is a big hammer, so use
> +     * g_hash_table_foreach_remove which will free
> +     * the vtd_pasid_as instances. Also, as a big
> +     * hammer, use VTD_PASID_CACHE_FORCE_RESET to
> +     * ensure all the vtd_pasid_as instances are
> +     * dropped, meanwhile the change will be pass
> +     * to host if HostIOMMUContext is available.
> +     */
> +    g_hash_table_foreach_remove(s->vtd_pasid_as,
> +                                vtd_flush_pasid, &pc_info);
> +}
> +
>  static bool vtd_process_pasid_desc(IntelIOMMUState *s,
>                                     VTDInvDesc *inv_desc)
>  {
> +    uint16_t domain_id;
> +    uint32_t pasid;
> +    VTDPASIDCacheInfo pc_info;
> +
>      if ((inv_desc->val[0] & VTD_INV_DESC_PASIDC_RSVD_VAL0) ||
>          (inv_desc->val[1] & VTD_INV_DESC_PASIDC_RSVD_VAL1) ||
>          (inv_desc->val[2] & VTD_INV_DESC_PASIDC_RSVD_VAL2) ||
> @@ -2407,14 +2864,26 @@ static bool vtd_process_pasid_desc(IntelIOMMUState *s,
>          return false;
>      }
>  
> +    domain_id = VTD_INV_DESC_PASIDC_DID(inv_desc->val[0]);
> +    pasid = VTD_INV_DESC_PASIDC_PASID(inv_desc->val[0]);
> +
>      switch (inv_desc->val[0] & VTD_INV_DESC_PASIDC_G) {
>      case VTD_INV_DESC_PASIDC_DSI:
> +        trace_vtd_pasid_cache_dsi(domain_id);
> +        pc_info.flags = VTD_PASID_CACHE_DOMSI;
> +        pc_info.domain_id = domain_id;
>          break;
>  
>      case VTD_INV_DESC_PASIDC_PASID_SI:
> +        /* PASID selective implies a DID selective */
> +        pc_info.flags = VTD_PASID_CACHE_PASIDSI;
> +        pc_info.domain_id = domain_id;
> +        pc_info.pasid = pasid;
>          break;
>  
>      case VTD_INV_DESC_PASIDC_GLOBAL:
> +        trace_vtd_pasid_cache_gsi();
> +        pc_info.flags = VTD_PASID_CACHE_GLOBAL;
>          break;
>  
>      default:
> @@ -2423,6 +2892,7 @@ static bool vtd_process_pasid_desc(IntelIOMMUState *s,
>          return false;
>      }
>  
> +    vtd_pasid_cache_sync(s, &pc_info);
>      return true;
>  }
>  
> @@ -4085,6 +4555,9 @@ static void vtd_realize(DeviceState *dev, Error **errp)
>                                       g_free, g_free);
>      s->vtd_as_by_busptr = g_hash_table_new_full(vtd_uint64_hash, vtd_uint64_equal,
>                                                g_free, g_free);
> +    s->vtd_pasid_as = g_hash_table_new_full(vtd_pasid_as_key_hash,
> +                                            vtd_pasid_as_key_equal,
> +                                            g_free, g_free);
>      vtd_init(s);
>      sysbus_mmio_map(SYS_BUS_DEVICE(s), 0, Q35_HOST_BRIDGE_IOMMU_ADDR);
>      pci_setup_iommu(bus, &vtd_iommu_ops, dev);
> diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> index 9a76f20..451ef4c 100644
> --- a/hw/i386/intel_iommu_internal.h
> +++ b/hw/i386/intel_iommu_internal.h
> @@ -307,6 +307,7 @@ typedef enum VTDFaultReason {
>      VTD_FR_IR_SID_ERR = 0x26,   /* Invalid Source-ID */
>  
>      VTD_FR_PASID_TABLE_INV = 0x58,  /*Invalid PASID table entry */
> +    VTD_FR_PASID_ENTRY_P = 0x59, /* The Present(P) field of pasidt-entry is 0 */
>  
>      /* This is not a normal fault reason. We use this to indicate some faults
>       * that are not referenced by the VT-d specification.
> @@ -511,10 +512,26 @@ typedef struct VTDRootEntry VTDRootEntry;
>  #define VTD_CTX_ENTRY_LEGACY_SIZE     16
>  #define VTD_CTX_ENTRY_SCALABLE_SIZE   32
>  
> +#define VTD_SM_CONTEXT_ENTRY_PDTS(val)      (((val) >> 9) & 0x3)
>  #define VTD_SM_CONTEXT_ENTRY_RID2PASID_MASK 0xfffff
>  #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL0(aw)  (0x1e0ULL | ~VTD_HAW_MASK(aw))
>  #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL1      0xffffffffffe00000ULL
>  
> +struct VTDPASIDCacheInfo {
> +#define VTD_PASID_CACHE_FORCE_RESET    (1ULL << 0)
> +#define VTD_PASID_CACHE_GLOBAL         (1ULL << 1)
> +#define VTD_PASID_CACHE_DOMSI          (1ULL << 2)
> +#define VTD_PASID_CACHE_PASIDSI        (1ULL << 3)
> +    uint32_t flags;
> +    uint16_t domain_id;
> +    uint32_t pasid;
> +};
> +#define VTD_PASID_CACHE_INFO_MASK    (VTD_PASID_CACHE_FORCE_RESET | \
> +                                      VTD_PASID_CACHE_GLOBAL  | \
> +                                      VTD_PASID_CACHE_DOMSI  | \
> +                                      VTD_PASID_CACHE_PASIDSI)

I think this is not needed at all?  The naming "flags" is confusing
too because it's not really a bitmap but an enum.  How about drop this
and rename "flags" to "type"?

> +typedef struct VTDPASIDCacheInfo VTDPASIDCacheInfo;
> +
>  /* PASID Table Related Definitions */
>  #define VTD_PASID_DIR_BASE_ADDR_MASK  (~0xfffULL)
>  #define VTD_PASID_TABLE_BASE_ADDR_MASK (~0xfffULL)
> @@ -526,6 +543,7 @@ typedef struct VTDRootEntry VTDRootEntry;
>  #define VTD_PASID_TABLE_BITS_MASK     (0x3fULL)
>  #define VTD_PASID_TABLE_INDEX(pasid)  ((pasid) & VTD_PASID_TABLE_BITS_MASK)
>  #define VTD_PASID_ENTRY_FPD           (1ULL << 1) /* Fault Processing Disable */
> +#define VTD_PASID_TBL_ENTRY_NUM       (1ULL << 6)
>  
>  /* PASID Granular Translation Type Mask */
>  #define VTD_PASID_ENTRY_P              1ULL
> diff --git a/hw/i386/trace-events b/hw/i386/trace-events
> index f7cd4e5..60d20c1 100644
> --- a/hw/i386/trace-events
> +++ b/hw/i386/trace-events
> @@ -23,6 +23,7 @@ vtd_inv_qi_tail(uint16_t head) "write tail %d"
>  vtd_inv_qi_fetch(void) ""
>  vtd_context_cache_reset(void) ""
>  vtd_pasid_cache_gsi(void) ""
> +vtd_pasid_cache_reset(void) ""
>  vtd_pasid_cache_dsi(uint16_t domain) "Domian slective PC invalidation domain 0x%"PRIx16
>  vtd_pasid_cache_psi(uint16_t domain, uint32_t pasid) "PASID slective PC invalidation domain 0x%"PRIx16" pasid 0x%"PRIx32
>  vtd_re_not_present(uint8_t bus) "Root entry bus %"PRIu8" not present"
> diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> index 42a58d6..626c1cd 100644
> --- a/include/hw/i386/intel_iommu.h
> +++ b/include/hw/i386/intel_iommu.h
> @@ -65,6 +65,8 @@ typedef union VTD_IR_MSIAddress VTD_IR_MSIAddress;
>  typedef struct VTDPASIDDirEntry VTDPASIDDirEntry;
>  typedef struct VTDPASIDEntry VTDPASIDEntry;
>  typedef struct VTDHostIOMMUContext VTDHostIOMMUContext;
> +typedef struct VTDPASIDCacheEntry VTDPASIDCacheEntry;
> +typedef struct VTDPASIDAddressSpace VTDPASIDAddressSpace;
>  
>  /* Context-Entry */
>  struct VTDContextEntry {
> @@ -97,6 +99,26 @@ struct VTDPASIDEntry {
>      uint64_t val[8];
>  };
>  
> +struct pasid_key {
> +    uint32_t pasid;
> +    uint16_t sid;
> +};
> +
> +struct VTDPASIDCacheEntry {
> +    struct VTDPASIDEntry pasid_entry;
> +};
> +
> +struct VTDPASIDAddressSpace {
> +    VTDBus *vtd_bus;
> +    uint8_t devfn;
> +    AddressSpace as;

Can this be dropped?

> +    uint32_t pasid;
> +    IntelIOMMUState *iommu_state;
> +    VTDContextCacheEntry context_cache_entry;

Can this be dropped too?

> +    QLIST_ENTRY(VTDPASIDAddressSpace) next;
> +    VTDPASIDCacheEntry pasid_cache_entry;
> +};
> +
>  struct VTDAddressSpace {
>      PCIBus *bus;
>      uint8_t devfn;
> @@ -267,6 +289,7 @@ struct IntelIOMMUState {
>  
>      GHashTable *vtd_as_by_busptr;   /* VTDBus objects indexed by PCIBus* reference */
>      VTDBus *vtd_as_by_bus_num[VTD_PCI_BUS_MAX]; /* VTDBus objects indexed by bus number */
> +    GHashTable *vtd_pasid_as;       /* VTDPASIDAddressSpace instances */
>      /* list of registered notifiers */
>      QLIST_HEAD(, VTDAddressSpace) vtd_as_with_notifiers;
>  
> @@ -292,6 +315,7 @@ struct IntelIOMMUState {
>       * - per-IOMMU IOTLB caches
>       * - context entry cache in VTDAddressSpace
>       * - HostIOMMUContext pointer cached in vIOMMU
> +     * - PASID cache in VTDPASIDAddressSpace
>       */
>      QemuMutex iommu_lock;
>  };
> -- 
> 2.7.4
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 160+ messages in thread

* RE: [PATCH v2 13/22] intel_iommu: add PASID cache management infrastructure
  2020-04-02  0:02     ` Peter Xu
@ 2020-04-02  6:46       ` Liu, Yi L
  -1 siblings, 0 replies; 160+ messages in thread
From: Liu, Yi L @ 2020-04-02  6:46 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, alex.williamson, eric.auger, pbonzini, mst, david,
	Tian, Kevin, Tian, Jun J, Sun, Yi Y, kvm, Wu, Hao, jean-philippe,
	Jacob Pan, Yi Sun, Richard Henderson, Eduardo Habkost

Hi Peter,

> From: Peter Xu <peterx@redhat.com>
> Sent: Thursday, April 2, 2020 8:02 AM
> Subject: Re: [PATCH v2 13/22] intel_iommu: add PASID cache management
> infrastructure
> 
> On Sun, Mar 29, 2020 at 09:24:52PM -0700, Liu Yi L wrote:
> > This patch adds a PASID cache management infrastructure based on new
> > added structure VTDPASIDAddressSpace, which is used to track the PASID
> > usage and future PASID tagged DMA address translation support in
> > vIOMMU.
> >
> >     struct VTDPASIDAddressSpace {
> >         VTDBus *vtd_bus;
> >         uint8_t devfn;
> >         AddressSpace as;
> >         uint32_t pasid;
> >         IntelIOMMUState *iommu_state;
> >         VTDContextCacheEntry context_cache_entry;
> >         QLIST_ENTRY(VTDPASIDAddressSpace) next;
> >         VTDPASIDCacheEntry pasid_cache_entry;
> >     };
> >
> > Ideally, a VTDPASIDAddressSpace instance is created when a PASID is
> > bound with a DMA AddressSpace. Intel VT-d spec requires guest software
> > to issue pasid cache invalidation when bind or unbind a pasid with an
> > address space under caching-mode. However, as VTDPASIDAddressSpace
> > instances also act as pasid cache in this implementation, its creation
> > also happens during vIOMMU PASID tagged DMA translation. The creation
> > in this path will not be added in this patch since no PASID-capable
> > emulated devices for now.
> >
> > The implementation in this patch manages VTDPASIDAddressSpace
> > instances per PASID+BDF (lookup and insert will use PASID and
> > BDF) since Intel VT-d spec allows per-BDF PASID Table. When a guest
> > bind a PASID with an AddressSpace, QEMU will capture the guest pasid
> > selective pasid cache invalidation, and allocate remove a
> > VTDPASIDAddressSpace instance per the invalidation
> > reasons:
> >
> >     *) a present pasid entry moved to non-present
> >     *) a present pasid entry to be a present entry
> >     *) a non-present pasid entry moved to present
> >
> > vIOMMU emulator could figure out the reason by fetching latest guest
> > pasid entry.
> >
> > v1 -> v2: - merged this patch with former replay binding patch, makes
> >             PSI/DSI/GSI use the unified function to do cache invalidation
> >             and pasid binding replay.
> >           - dropped pasid_cache_gen in both iommu_state and vtd_pasid_as
> >             as it is not necessary so far, we may want it when one day
> >             initroduce emulated SVA-capable device.
> >
> > Cc: Kevin Tian <kevin.tian@intel.com>
> > Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Cc: Peter Xu <peterx@redhat.com>
> > Cc: Yi Sun <yi.y.sun@linux.intel.com>
> > Cc: Paolo Bonzini <pbonzini@redhat.com>
> > Cc: Richard Henderson <rth@twiddle.net>
> > Cc: Eduardo Habkost <ehabkost@redhat.com>
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > ---
> >  hw/i386/intel_iommu.c          | 473
> +++++++++++++++++++++++++++++++++++++++++
> >  hw/i386/intel_iommu_internal.h |  18 ++
> >  hw/i386/trace-events           |   1 +
> >  include/hw/i386/intel_iommu.h  |  24 +++
> >  4 files changed, 516 insertions(+)
> >
> > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c index
> > 2eb60c3..a7e9973 100644
> > --- a/hw/i386/intel_iommu.c
> > +++ b/hw/i386/intel_iommu.c
> > @@ -40,6 +40,7 @@
> >  #include "kvm_i386.h"
> >  #include "migration/vmstate.h"
> >  #include "trace.h"
> > +#include "qemu/jhash.h"
> >
> >  /* context entry operations */
> >  #define VTD_CE_GET_RID2PASID(ce) \
> > @@ -65,6 +66,8 @@
> >  static void vtd_address_space_refresh_all(IntelIOMMUState *s);
> > static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier
> > *n);
> >
> > +static void vtd_pasid_cache_reset(IntelIOMMUState *s);
> > +
> >  static void vtd_panic_require_caching_mode(void)
> >  {
> >      error_report("We need to set caching-mode=on for intel-iommu to enable "
> > @@ -276,6 +279,7 @@ static void vtd_reset_caches(IntelIOMMUState *s)
> >      vtd_iommu_lock(s);
> >      vtd_reset_iotlb_locked(s);
> >      vtd_reset_context_cache_locked(s);
> > +    vtd_pasid_cache_reset(s);
> >      vtd_iommu_unlock(s);
> >  }
> >
> > @@ -686,6 +690,16 @@ static inline bool vtd_pe_type_check(X86IOMMUState
> *x86_iommu,
> >      return true;
> >  }
> >
> > +static inline uint16_t vtd_pe_get_domain_id(VTDPASIDEntry *pe) {
> > +    return VTD_SM_PASID_ENTRY_DID((pe)->val[1]);
> > +}
> > +
> > +static inline uint32_t vtd_sm_ce_get_pdt_entry_num(VTDContextEntry
> > +*ce) {
> > +    return 1U << (VTD_SM_CONTEXT_ENTRY_PDTS(ce->val[0]) + 7); }
> > +
> >  static inline bool vtd_pdire_present(VTDPASIDDirEntry *pdire)  {
> >      return pdire->val & 1;
> > @@ -2395,9 +2409,452 @@ static bool vtd_process_iotlb_desc(IntelIOMMUState
> *s, VTDInvDesc *inv_desc)
> >      return true;
> >  }
> >
> > +static inline void vtd_init_pasid_key(uint32_t pasid,
> > +                                     uint16_t sid,
> > +                                     struct pasid_key *key) {
> > +    key->pasid = pasid;
> > +    key->sid = sid;
> > +}
> > +
> > +static guint vtd_pasid_as_key_hash(gconstpointer v) {
> > +    struct pasid_key *key = (struct pasid_key *)v;
> > +    uint32_t a, b, c;
> > +
> > +    /* Jenkins hash */
> > +    a = b = c = JHASH_INITVAL + sizeof(*key);
> > +    a += key->sid;
> > +    b += extract32(key->pasid, 0, 16);
> > +    c += extract32(key->pasid, 16, 16);
> > +
> > +    __jhash_mix(a, b, c);
> > +    __jhash_final(a, b, c);
> > +
> > +    return c;
> > +}
> > +
> > +static gboolean vtd_pasid_as_key_equal(gconstpointer v1,
> > +gconstpointer v2) {
> > +    const struct pasid_key *k1 = v1;
> > +    const struct pasid_key *k2 = v2;
> > +
> > +    return (k1->pasid == k2->pasid) && (k1->sid == k2->sid); }
> > +
> > +static inline int vtd_dev_get_pe_from_pasid(IntelIOMMUState *s,
> > +                                            uint8_t bus_num,
> > +                                            uint8_t devfn,
> > +                                            uint32_t pasid,
> > +                                            VTDPASIDEntry *pe) {
> > +    VTDContextEntry ce;
> > +    int ret;
> > +    dma_addr_t pasid_dir_base;
> > +
> > +    if (!s->root_scalable) {
> > +        return -VTD_FR_PASID_TABLE_INV;
> > +    }
> > +
> > +    ret = vtd_dev_to_context_entry(s, bus_num, devfn, &ce);
> > +    if (ret) {
> > +        return ret;
> > +    }
> > +
> > +    pasid_dir_base = VTD_CE_GET_PASID_DIR_TABLE(&ce);
> > +    ret = vtd_get_pe_from_pasid_table(s,
> > +                                  pasid_dir_base, pasid, pe);
> > +
> > +    return ret;
> > +}
> > +
> > +static bool vtd_pasid_entry_compare(VTDPASIDEntry *p1, VTDPASIDEntry
> > +*p2) {
> > +    return !memcmp(p1, p2, sizeof(*p1)); }
> > +
> > +/**
> > + * This function fills in the pasid entry in &vtd_pasid_as. Caller
> > + * of this function should hold iommu_lock.
> > + */
> > +static void vtd_fill_pe_in_cache(IntelIOMMUState *s,
> > +                                 VTDPASIDAddressSpace *vtd_pasid_as,
> > +                                 VTDPASIDEntry *pe) {
> > +    VTDPASIDCacheEntry *pc_entry = &vtd_pasid_as->pasid_cache_entry;
> > +
> > +    if (vtd_pasid_entry_compare(pe, &pc_entry->pasid_entry)) {
> > +        /* No need to go further as cached pasid entry is latest */
> > +        return;
> > +    }
> > +
> > +    pc_entry->pasid_entry = *pe;
> > +    /*
> > +     * TODO:
> > +     * - send pasid bind to host for passthru devices
> > +     */
> > +}
> > +
> > +/**
> > + * This function is used to clear cached pasid entry in vtd_pasid_as
> > + * instances. Caller of this function should hold iommu_lock.
> > + */
> > +static gboolean vtd_flush_pasid(gpointer key, gpointer value,
> > +                                gpointer user_data) {
> > +    VTDPASIDCacheInfo *pc_info = user_data;
> > +    VTDPASIDAddressSpace *vtd_pasid_as = value;
> > +    IntelIOMMUState *s = vtd_pasid_as->iommu_state;
> > +    VTDPASIDCacheEntry *pc_entry = &vtd_pasid_as->pasid_cache_entry;
> > +    VTDBus *vtd_bus = vtd_pasid_as->vtd_bus;
> > +    VTDPASIDEntry pe;
> > +    uint16_t did;
> > +    uint32_t pasid;
> > +    uint16_t devfn;
> > +    int ret;
> > +
> > +    did = vtd_pe_get_domain_id(&pc_entry->pasid_entry);
> > +    pasid = vtd_pasid_as->pasid;
> > +    devfn = vtd_pasid_as->devfn;
> > +
> > +    switch (pc_info->flags & VTD_PASID_CACHE_INFO_MASK) {
> > +    case VTD_PASID_CACHE_FORCE_RESET:
> > +        goto remove;
> > +    case VTD_PASID_CACHE_PASIDSI:
> > +        if (pc_info->pasid != pasid) {
> > +            return false;
> > +        }
> > +        /* Fall through */
> > +    case VTD_PASID_CACHE_DOMSI:
> > +        if (pc_info->domain_id != did) {
> > +            return false;
> > +        }
> > +        /* Fall through */
> > +    case VTD_PASID_CACHE_GLOBAL:
> > +        break;
> > +    default:
> > +        error_report("invalid pc_info->flags");
> > +        abort();
> > +    }
> > +
> > +    /*
> > +     * pasid cache invalidation may indicate a present pasid
> > +     * entry to present pasid entry modification. To cover such
> > +     * case, vIOMMU emulator needs to fetch latest guest pasid
> > +     * entry and check cached pasid entry, then update pasid
> > +     * cache and send pasid bind/unbind to host properly.
> > +     */
> > +    ret = vtd_dev_get_pe_from_pasid(s, pci_bus_num(vtd_bus->bus),
> > +                                    devfn, pasid, &pe);
> > +    if (ret) {
> > +        /*
> > +         * No valid pasid entry in guest memory. e.g. pasid entry
> > +         * was modified to be either all-zero or non-present. Either
> > +         * case means existing pasid cache should be removed.
> > +         */
> > +        goto remove;
> > +    }
> > +
> > +    vtd_fill_pe_in_cache(s, vtd_pasid_as, &pe);
> > +    /*
> > +     * TODO:
> > +     * - when pasid-base-iotlb(piotlb) infrastructure is ready,
> > +     *   should invalidate QEMU piotlb togehter with this change.
> > +     */
> > +    return false;
> > +remove:
> > +    /*
> > +     * TODO:
> > +     * - send pasid bind to host for passthru devices
> > +     * - when pasid-base-iotlb(piotlb) infrastructure is ready,
> > +     *   should invalidate QEMU piotlb togehter with this change.
> > +     */
> > +    return true;
> > +}
> > +
> > +/**
> > + * This function finds or adds a VTDPASIDAddressSpace for a device
> > + * when it is bound to a pasid. Caller of this function should hold
> > + * iommu_lock.
> > + */
> > +static VTDPASIDAddressSpace *vtd_add_find_pasid_as(IntelIOMMUState *s,
> > +                                                   VTDBus *vtd_bus,
> > +                                                   int devfn,
> > +                                                   uint32_t pasid) {
> > +    struct pasid_key key;
> > +    struct pasid_key *new_key;
> > +    VTDPASIDAddressSpace *vtd_pasid_as;
> > +    uint16_t sid;
> > +
> > +    sid = vtd_make_source_id(pci_bus_num(vtd_bus->bus), devfn);
> > +    vtd_init_pasid_key(pasid, sid, &key);
> > +    vtd_pasid_as = g_hash_table_lookup(s->vtd_pasid_as, &key);
> > +
> > +    if (!vtd_pasid_as) {
> > +        new_key = g_malloc0(sizeof(*new_key));
> > +        vtd_init_pasid_key(pasid, sid, new_key);
> > +        /*
> > +         * Initiate the vtd_pasid_as structure.
> > +         *
> > +         * This structure here is used to track the guest pasid
> > +         * binding and also serves as pasid-cache mangement entry.
> > +         *
> > +         * TODO: in future, if wants to support the SVA-aware DMA
> > +         *       emulation, the vtd_pasid_as should have include
> > +         *       AddressSpace to support DMA emulation.
> > +         */
> > +        vtd_pasid_as = g_malloc0(sizeof(VTDPASIDAddressSpace));
> > +        vtd_pasid_as->iommu_state = s;
> > +        vtd_pasid_as->vtd_bus = vtd_bus;
> > +        vtd_pasid_as->devfn = devfn;
> > +        vtd_pasid_as->pasid = pasid;
> > +        g_hash_table_insert(s->vtd_pasid_as, new_key, vtd_pasid_as);
> > +    }
> > +    return vtd_pasid_as;
> > +}
> > +
> > +/**
> > + * Constant information used during pasid table walk
> > +   @vtd_bus, @devfn: device info
> > + * @flags: indicates if it is domain selective walk
> > + * @did: domain ID of the pasid table walk  */ typedef struct {
> > +    VTDBus *vtd_bus;
> > +    uint16_t devfn;
> > +#define VTD_PASID_TABLE_DID_SEL_WALK   (1ULL << 0)
> > +    uint32_t flags;
> > +    uint16_t did;
> > +} vtd_pasid_table_walk_info;
> > +
> > +/**
> > + * Caller of this function should hold iommu_lock.
> > + */
> > +static void vtd_sm_pasid_table_walk_one(IntelIOMMUState *s,
> > +                                        dma_addr_t pt_base,
> > +                                        int start,
> > +                                        int end,
> > +                                        vtd_pasid_table_walk_info
> > +*info) {
> > +    VTDPASIDEntry pe;
> > +    int pasid = start;
> > +    int pasid_next;
> > +    VTDPASIDAddressSpace *vtd_pasid_as;
> > +
> > +    while (pasid < end) {
> > +        pasid_next = pasid + 1;
> > +
> > +        if (!vtd_get_pe_in_pasid_leaf_table(s, pasid, pt_base, &pe)
> > +            && vtd_pe_present(&pe)) {
> > +            vtd_pasid_as = vtd_add_find_pasid_as(s,
> > +                                       info->vtd_bus, info->devfn, pasid);
> > +            if ((info->flags & VTD_PASID_TABLE_DID_SEL_WALK) &&
> > +                !(info->did == vtd_pe_get_domain_id(&pe))) {
> > +                pasid = pasid_next;
> > +                continue;
> > +            }
> > +            vtd_fill_pe_in_cache(s, vtd_pasid_as, &pe);
> > +        }
> > +        pasid = pasid_next;
> > +    }
> > +}
> > +
> > +/*
> > + * Currently, VT-d scalable mode pasid table is a two level table,
> > + * this function aims to loop a range of PASIDs in a given pasid
> > + * table to identify the pasid config in guest.
> > + * Caller of this function should hold iommu_lock.
> > + */
> > +static void vtd_sm_pasid_table_walk(IntelIOMMUState *s,
> > +                                    dma_addr_t pdt_base,
> > +                                    int start,
> > +                                    int end,
> > +                                    vtd_pasid_table_walk_info *info)
> > +{
> > +    VTDPASIDDirEntry pdire;
> > +    int pasid = start;
> > +    int pasid_next;
> > +    dma_addr_t pt_base;
> > +
> > +    while (pasid < end) {
> > +        pasid_next = ((end - pasid) > VTD_PASID_TBL_ENTRY_NUM) ?
> > +                      (pasid + VTD_PASID_TBL_ENTRY_NUM) : end;
> > +        if (!vtd_get_pdire_from_pdir_table(pdt_base, pasid, &pdire)
> > +            && vtd_pdire_present(&pdire)) {
> > +            pt_base = pdire.val & VTD_PASID_TABLE_BASE_ADDR_MASK;
> > +            vtd_sm_pasid_table_walk_one(s, pt_base, pasid, pasid_next, info);
> > +        }
> > +        pasid = pasid_next;
> > +    }
> > +}
> > +
> > +static void vtd_replay_pasid_bind_for_dev(IntelIOMMUState *s,
> > +                                          int start, int end,
> > +                                          vtd_pasid_table_walk_info
> > +*info) {
> > +    VTDContextEntry ce;
> > +    int bus_n, devfn;
> > +
> > +    bus_n = pci_bus_num(info->vtd_bus->bus);
> > +    devfn = info->devfn;
> > +
> > +    if (!vtd_dev_to_context_entry(s, bus_n, devfn, &ce)) {
> > +        uint32_t max_pasid;
> > +
> > +        max_pasid = vtd_sm_ce_get_pdt_entry_num(&ce) *
> VTD_PASID_TBL_ENTRY_NUM;
> > +        if (end > max_pasid) {
> > +            end = max_pasid;
> > +        }
> > +        vtd_sm_pasid_table_walk(s,
> > +                                VTD_CE_GET_PASID_DIR_TABLE(&ce),
> > +                                start,
> > +                                end,
> > +                                info);
> > +    }
> > +}
> > +
> > +/**
> > + * This function replay the guest pasid bindings to hots by
> > + * walking the guest PASID table. This ensures host will have
> > + * latest guest pasid bindings. Caller should hold iommu_lock.
> > + */
> > +static void vtd_replay_guest_pasid_bindings(IntelIOMMUState *s,
> > +                                            VTDPASIDCacheInfo
> > +*pc_info) {
> > +    VTDHostIOMMUContext *vtd_dev_icx;
> > +    int start = 0, end = VTD_HPASID_MAX;
> > +    vtd_pasid_table_walk_info walk_info = {.flags = 0};
> 
> So vtd_pasid_table_walk_info is still used.  I thought we had reached a consensus
> that this can be dropped?

yeah, I did have considered your suggestion and plan to do it. But when
I started coding, it looks a little bit weird to me:
For one, there is an input VTDPASIDCacheInfo in this function. It may be
nature to think about passing the parameter to further calling
(vtd_replay_pasid_bind_for_dev()). But, we can't do that. The vtd_bus/devfn
fields should be filled when looping the assigned devices, not the one
passed by vtd_replay_guest_pasid_bindings() caller.
For two, reusing the VTDPASIDCacheInfo for passing walk info may require
the final user do the same thing as what the vtd_replay_guest_pasid_bindings()
has done here.

So kept the vtd_pasid_table_walk_info.

> > +
> > +    switch (pc_info->flags & VTD_PASID_CACHE_INFO_MASK) {
> > +    case VTD_PASID_CACHE_PASIDSI:
> > +        start = pc_info->pasid;
> > +        end = pc_info->pasid + 1;
> > +        /*
> > +         * PASID selective invalidation is within domain,
> > +         * thus fall through.
> > +         */
> > +    case VTD_PASID_CACHE_DOMSI:
> > +        walk_info.did = pc_info->domain_id;
> > +        walk_info.flags |= VTD_PASID_TABLE_DID_SEL_WALK;
> > +        /* loop all assigned devices */
> > +        break;
> > +    case VTD_PASID_CACHE_FORCE_RESET:
> > +        /* For force reset, no need to go further replay */
> > +        return;
> > +    case VTD_PASID_CACHE_GLOBAL:
> > +        break;
> > +    default:
> > +        error_report("%s, invalid pc_info->flags", __func__);
> > +        abort();
> > +    }
> > +
> > +    /*
> > +     * In this replay, only needs to care about the devices which
> > +     * are backed by host IOMMU. For such devices, their vtd_dev_icx
> > +     * instances are in the s->vtd_dev_icx_list. For devices which
> > +     * are not backed byhost IOMMU, it is not necessary to replay
> > +     * the bindings since their cache could be re-created in the future
> > +     * DMA address transaltion.
> > +     */
> > +    QLIST_FOREACH(vtd_dev_icx, &s->vtd_dev_icx_list, next) {
> > +        walk_info.vtd_bus = vtd_dev_icx->vtd_bus;
> > +        walk_info.devfn = vtd_dev_icx->devfn;
> > +        vtd_replay_pasid_bind_for_dev(s, start, end, &walk_info);
> > +    }
> > +}
> > +
> > +/**
> > + * This function syncs the pasid bindings between guest and host.
> > + * It includes updating the pasid cache in vIOMMU and updating the
> > + * pasid bindings per guest's latest pasid entry presence.
> > + */
> > +static void vtd_pasid_cache_sync(IntelIOMMUState *s,
> > +                                 VTDPASIDCacheInfo *pc_info) {
> > +    /*
> > +     * Regards to a pasid cache invalidation, e.g. a PSI.
> > +     * it could be either cases of below:
> > +     * a) a present pasid entry moved to non-present
> > +     * b) a present pasid entry to be a present entry
> > +     * c) a non-present pasid entry moved to present
> > +     *
> > +     * Different invalidation granularity may affect different device
> > +     * scope and pasid scope. But for each invalidation granularity,
> > +     * it needs to do two steps to sync host and guest pasid binding.
> > +     *
> > +     * Here is the handling of a PSI:
> > +     * 1) loop all the existing vtd_pasid_as instances to update them
> > +     *    according to the latest guest pasid entry in pasid table.
> > +     *    this will make sure affected existing vtd_pasid_as instances
> > +     *    cached the latest pasid entries. Also, during the loop, the
> > +     *    host should be notified if needed. e.g. pasid unbind or pasid
> > +     *    update. Should be able to cover case a) and case b).
> > +     *
> > +     * 2) loop all devices to cover case c)
> > +     *    - For devices which have HostIOMMUContext instances,
> > +     *      we loop them and check if guest pasid entry exists. If yes,
> > +     *      it is case c), we update the pasid cache and also notify
> > +     *      host.
> > +     *    - For devices which have no HostIOMMUContext, it is not
> > +     *      necessary to create pasid cache at this phase since it
> > +     *      could be created when vIOMMU does DMA address translation.
> > +     *      This is not yet implemented since there is no emulated
> > +     *      pasid-capable devices today. If we have such devices in
> > +     *      future, the pasid cache shall be created there.
> > +     * Other granularity follow the same steps, just with different scope
> > +     *
> > +     */
> > +
> > +    vtd_iommu_lock(s);
> > +    /* Step 1: loop all the exisitng vtd_pasid_as instances */
> > +    g_hash_table_foreach_remove(s->vtd_pasid_as,
> > +                                vtd_flush_pasid, pc_info);
> 
> OK the series is evolving along with our discussions, and /me too on understanding
> your series... Now I'm not very sure whether this operation is still useful...
> 
> The major point is you'll need to do pasid table walk for all the registered
> devices
> below.  So IIUC vtd_replay_guest_pasid_bindings() will be able to also detect
> addition, removal or modification of pasid address spaces.  Am I right?

It's true if there is only assigned pasid-capable devices. If there is
emualted pasid-capable device, it would be a problem as emualted devices
won't register HostIOMMUContext. Somehow, the pasid cahce invalidation
for emualted device would be missed. So I chose to make the step 1 cover
the "real" cache invalidation(a.k.a. removal), while step 2 to cover
addition and modification.

> 
> If this can be dropped, then vtd_flush_pasid() will be only used below for device
> reset, and it can be greatly simplifed - just UNBIND every address space we have.
> 
> > +
> > +    /*
> > +     * Step 2: loop all the exisitng vtd_dev_icx instances.
> > +     * Ideally, needs to loop all devices to find if there is any new
> > +     * PASID binding regards to the PASID cache invalidation request.
> > +     * But it is enough to loop the devices which are backed by host
> > +     * IOMMU. For devices backed by vIOMMU (a.k.a emulated devices),
> > +     * if new PASID happened on them, their vtd_pasid_as instance could
> > +     * be created during future vIOMMU DMA translation.
> > +     */
> > +    vtd_replay_guest_pasid_bindings(s, pc_info);
> > +    vtd_iommu_unlock(s);
> > +}
> > +
> > +/**
> > + * Caller of this function should hold iommu_lock  */ static void
> > +vtd_pasid_cache_reset(IntelIOMMUState *s) {
> > +    VTDPASIDCacheInfo pc_info;
> > +
> > +    trace_vtd_pasid_cache_reset();
> > +
> > +    pc_info.flags = VTD_PASID_CACHE_FORCE_RESET;
> > +
> > +    /*
> > +     * Reset pasid cache is a big hammer, so use
> > +     * g_hash_table_foreach_remove which will free
> > +     * the vtd_pasid_as instances. Also, as a big
> > +     * hammer, use VTD_PASID_CACHE_FORCE_RESET to
> > +     * ensure all the vtd_pasid_as instances are
> > +     * dropped, meanwhile the change will be pass
> > +     * to host if HostIOMMUContext is available.
> > +     */
> > +    g_hash_table_foreach_remove(s->vtd_pasid_as,
> > +                                vtd_flush_pasid, &pc_info); }
> > +
> >  static bool vtd_process_pasid_desc(IntelIOMMUState *s,
> >                                     VTDInvDesc *inv_desc)  {
> > +    uint16_t domain_id;
> > +    uint32_t pasid;
> > +    VTDPASIDCacheInfo pc_info;
> > +
> >      if ((inv_desc->val[0] & VTD_INV_DESC_PASIDC_RSVD_VAL0) ||
> >          (inv_desc->val[1] & VTD_INV_DESC_PASIDC_RSVD_VAL1) ||
> >          (inv_desc->val[2] & VTD_INV_DESC_PASIDC_RSVD_VAL2) || @@
> > -2407,14 +2864,26 @@ static bool vtd_process_pasid_desc(IntelIOMMUState *s,
> >          return false;
> >      }
> >
> > +    domain_id = VTD_INV_DESC_PASIDC_DID(inv_desc->val[0]);
> > +    pasid = VTD_INV_DESC_PASIDC_PASID(inv_desc->val[0]);
> > +
> >      switch (inv_desc->val[0] & VTD_INV_DESC_PASIDC_G) {
> >      case VTD_INV_DESC_PASIDC_DSI:
> > +        trace_vtd_pasid_cache_dsi(domain_id);
> > +        pc_info.flags = VTD_PASID_CACHE_DOMSI;
> > +        pc_info.domain_id = domain_id;
> >          break;
> >
> >      case VTD_INV_DESC_PASIDC_PASID_SI:
> > +        /* PASID selective implies a DID selective */
> > +        pc_info.flags = VTD_PASID_CACHE_PASIDSI;
> > +        pc_info.domain_id = domain_id;
> > +        pc_info.pasid = pasid;
> >          break;
> >
> >      case VTD_INV_DESC_PASIDC_GLOBAL:
> > +        trace_vtd_pasid_cache_gsi();
> > +        pc_info.flags = VTD_PASID_CACHE_GLOBAL;
> >          break;
> >
> >      default:
> > @@ -2423,6 +2892,7 @@ static bool vtd_process_pasid_desc(IntelIOMMUState
> *s,
> >          return false;
> >      }
> >
> > +    vtd_pasid_cache_sync(s, &pc_info);
> >      return true;
> >  }
> >
> > @@ -4085,6 +4555,9 @@ static void vtd_realize(DeviceState *dev, Error **errp)
> >                                       g_free, g_free);
> >      s->vtd_as_by_busptr = g_hash_table_new_full(vtd_uint64_hash,
> vtd_uint64_equal,
> >                                                g_free, g_free);
> > +    s->vtd_pasid_as = g_hash_table_new_full(vtd_pasid_as_key_hash,
> > +                                            vtd_pasid_as_key_equal,
> > +                                            g_free, g_free);
> >      vtd_init(s);
> >      sysbus_mmio_map(SYS_BUS_DEVICE(s), 0,
> Q35_HOST_BRIDGE_IOMMU_ADDR);
> >      pci_setup_iommu(bus, &vtd_iommu_ops, dev); diff --git
> > a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> > index 9a76f20..451ef4c 100644
> > --- a/hw/i386/intel_iommu_internal.h
> > +++ b/hw/i386/intel_iommu_internal.h
> > @@ -307,6 +307,7 @@ typedef enum VTDFaultReason {
> >      VTD_FR_IR_SID_ERR = 0x26,   /* Invalid Source-ID */
> >
> >      VTD_FR_PASID_TABLE_INV = 0x58,  /*Invalid PASID table entry */
> > +    VTD_FR_PASID_ENTRY_P = 0x59, /* The Present(P) field of
> > + pasidt-entry is 0 */
> >
> >      /* This is not a normal fault reason. We use this to indicate some faults
> >       * that are not referenced by the VT-d specification.
> > @@ -511,10 +512,26 @@ typedef struct VTDRootEntry VTDRootEntry;
> >  #define VTD_CTX_ENTRY_LEGACY_SIZE     16
> >  #define VTD_CTX_ENTRY_SCALABLE_SIZE   32
> >
> > +#define VTD_SM_CONTEXT_ENTRY_PDTS(val)      (((val) >> 9) & 0x3)
> >  #define VTD_SM_CONTEXT_ENTRY_RID2PASID_MASK 0xfffff  #define
> > VTD_SM_CONTEXT_ENTRY_RSVD_VAL0(aw)  (0x1e0ULL | ~VTD_HAW_MASK(aw))
> >  #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL1      0xffffffffffe00000ULL
> >
> > +struct VTDPASIDCacheInfo {
> > +#define VTD_PASID_CACHE_FORCE_RESET    (1ULL << 0)
> > +#define VTD_PASID_CACHE_GLOBAL         (1ULL << 1)
> > +#define VTD_PASID_CACHE_DOMSI          (1ULL << 2)
> > +#define VTD_PASID_CACHE_PASIDSI        (1ULL << 3)
> > +    uint32_t flags;
> > +    uint16_t domain_id;
> > +    uint32_t pasid;
> > +};
> > +#define VTD_PASID_CACHE_INFO_MASK    (VTD_PASID_CACHE_FORCE_RESET |
> \
> > +                                      VTD_PASID_CACHE_GLOBAL  | \
> > +                                      VTD_PASID_CACHE_DOMSI  | \
> > +                                      VTD_PASID_CACHE_PASIDSI)
> 
> I think this is not needed at all?  The naming "flags" is confusing too because it's not
> really a bitmap but an enum.  How about drop this and rename "flags" to "type"?

got it, could make it as enum.

> > +typedef struct VTDPASIDCacheInfo VTDPASIDCacheInfo;
> > +
> >  /* PASID Table Related Definitions */  #define
> > VTD_PASID_DIR_BASE_ADDR_MASK  (~0xfffULL)  #define
> > VTD_PASID_TABLE_BASE_ADDR_MASK (~0xfffULL) @@ -526,6 +543,7 @@
> typedef
> > struct VTDRootEntry VTDRootEntry;
> >  #define VTD_PASID_TABLE_BITS_MASK     (0x3fULL)
> >  #define VTD_PASID_TABLE_INDEX(pasid)  ((pasid) &
> VTD_PASID_TABLE_BITS_MASK)
> >  #define VTD_PASID_ENTRY_FPD           (1ULL << 1) /* Fault Processing Disable */
> > +#define VTD_PASID_TBL_ENTRY_NUM       (1ULL << 6)
> >
> >  /* PASID Granular Translation Type Mask */
> >  #define VTD_PASID_ENTRY_P              1ULL
> > diff --git a/hw/i386/trace-events b/hw/i386/trace-events index
> > f7cd4e5..60d20c1 100644
> > --- a/hw/i386/trace-events
> > +++ b/hw/i386/trace-events
> > @@ -23,6 +23,7 @@ vtd_inv_qi_tail(uint16_t head) "write tail %d"
> >  vtd_inv_qi_fetch(void) ""
> >  vtd_context_cache_reset(void) ""
> >  vtd_pasid_cache_gsi(void) ""
> > +vtd_pasid_cache_reset(void) ""
> >  vtd_pasid_cache_dsi(uint16_t domain) "Domian slective PC invalidation
> > domain 0x%"PRIx16  vtd_pasid_cache_psi(uint16_t domain, uint32_t
> > pasid) "PASID slective PC invalidation domain 0x%"PRIx16" pasid 0x%"PRIx32
> vtd_re_not_present(uint8_t bus) "Root entry bus %"PRIu8" not present"
> > diff --git a/include/hw/i386/intel_iommu.h
> > b/include/hw/i386/intel_iommu.h index 42a58d6..626c1cd 100644
> > --- a/include/hw/i386/intel_iommu.h
> > +++ b/include/hw/i386/intel_iommu.h
> > @@ -65,6 +65,8 @@ typedef union VTD_IR_MSIAddress VTD_IR_MSIAddress;
> > typedef struct VTDPASIDDirEntry VTDPASIDDirEntry;  typedef struct
> > VTDPASIDEntry VTDPASIDEntry;  typedef struct VTDHostIOMMUContext
> > VTDHostIOMMUContext;
> > +typedef struct VTDPASIDCacheEntry VTDPASIDCacheEntry; typedef struct
> > +VTDPASIDAddressSpace VTDPASIDAddressSpace;
> >
> >  /* Context-Entry */
> >  struct VTDContextEntry {
> > @@ -97,6 +99,26 @@ struct VTDPASIDEntry {
> >      uint64_t val[8];
> >  };
> >
> > +struct pasid_key {
> > +    uint32_t pasid;
> > +    uint16_t sid;
> > +};
> > +
> > +struct VTDPASIDCacheEntry {
> > +    struct VTDPASIDEntry pasid_entry; };
> > +
> > +struct VTDPASIDAddressSpace {
> > +    VTDBus *vtd_bus;
> > +    uint8_t devfn;
> > +    AddressSpace as;
> 
> Can this be dropped?

oh, yes.

> > +    uint32_t pasid;
> > +    IntelIOMMUState *iommu_state;
> > +    VTDContextCacheEntry context_cache_entry;
> 
> Can this be dropped too?

yep.

Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 160+ messages in thread

* RE: [PATCH v2 13/22] intel_iommu: add PASID cache management infrastructure
@ 2020-04-02  6:46       ` Liu, Yi L
  0 siblings, 0 replies; 160+ messages in thread
From: Liu, Yi L @ 2020-04-02  6:46 UTC (permalink / raw)
  To: Peter Xu
  Cc: jean-philippe, Tian, Kevin, Jacob Pan, Yi Sun, Eduardo Habkost,
	kvm, mst, Tian, Jun J, qemu-devel, eric.auger, alex.williamson,
	pbonzini, Wu, Hao, Sun, Yi Y, Richard Henderson, david

Hi Peter,

> From: Peter Xu <peterx@redhat.com>
> Sent: Thursday, April 2, 2020 8:02 AM
> Subject: Re: [PATCH v2 13/22] intel_iommu: add PASID cache management
> infrastructure
> 
> On Sun, Mar 29, 2020 at 09:24:52PM -0700, Liu Yi L wrote:
> > This patch adds a PASID cache management infrastructure based on new
> > added structure VTDPASIDAddressSpace, which is used to track the PASID
> > usage and future PASID tagged DMA address translation support in
> > vIOMMU.
> >
> >     struct VTDPASIDAddressSpace {
> >         VTDBus *vtd_bus;
> >         uint8_t devfn;
> >         AddressSpace as;
> >         uint32_t pasid;
> >         IntelIOMMUState *iommu_state;
> >         VTDContextCacheEntry context_cache_entry;
> >         QLIST_ENTRY(VTDPASIDAddressSpace) next;
> >         VTDPASIDCacheEntry pasid_cache_entry;
> >     };
> >
> > Ideally, a VTDPASIDAddressSpace instance is created when a PASID is
> > bound with a DMA AddressSpace. Intel VT-d spec requires guest software
> > to issue pasid cache invalidation when bind or unbind a pasid with an
> > address space under caching-mode. However, as VTDPASIDAddressSpace
> > instances also act as pasid cache in this implementation, its creation
> > also happens during vIOMMU PASID tagged DMA translation. The creation
> > in this path will not be added in this patch since no PASID-capable
> > emulated devices for now.
> >
> > The implementation in this patch manages VTDPASIDAddressSpace
> > instances per PASID+BDF (lookup and insert will use PASID and
> > BDF) since Intel VT-d spec allows per-BDF PASID Table. When a guest
> > bind a PASID with an AddressSpace, QEMU will capture the guest pasid
> > selective pasid cache invalidation, and allocate remove a
> > VTDPASIDAddressSpace instance per the invalidation
> > reasons:
> >
> >     *) a present pasid entry moved to non-present
> >     *) a present pasid entry to be a present entry
> >     *) a non-present pasid entry moved to present
> >
> > vIOMMU emulator could figure out the reason by fetching latest guest
> > pasid entry.
> >
> > v1 -> v2: - merged this patch with former replay binding patch, makes
> >             PSI/DSI/GSI use the unified function to do cache invalidation
> >             and pasid binding replay.
> >           - dropped pasid_cache_gen in both iommu_state and vtd_pasid_as
> >             as it is not necessary so far, we may want it when one day
> >             initroduce emulated SVA-capable device.
> >
> > Cc: Kevin Tian <kevin.tian@intel.com>
> > Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Cc: Peter Xu <peterx@redhat.com>
> > Cc: Yi Sun <yi.y.sun@linux.intel.com>
> > Cc: Paolo Bonzini <pbonzini@redhat.com>
> > Cc: Richard Henderson <rth@twiddle.net>
> > Cc: Eduardo Habkost <ehabkost@redhat.com>
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > ---
> >  hw/i386/intel_iommu.c          | 473
> +++++++++++++++++++++++++++++++++++++++++
> >  hw/i386/intel_iommu_internal.h |  18 ++
> >  hw/i386/trace-events           |   1 +
> >  include/hw/i386/intel_iommu.h  |  24 +++
> >  4 files changed, 516 insertions(+)
> >
> > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c index
> > 2eb60c3..a7e9973 100644
> > --- a/hw/i386/intel_iommu.c
> > +++ b/hw/i386/intel_iommu.c
> > @@ -40,6 +40,7 @@
> >  #include "kvm_i386.h"
> >  #include "migration/vmstate.h"
> >  #include "trace.h"
> > +#include "qemu/jhash.h"
> >
> >  /* context entry operations */
> >  #define VTD_CE_GET_RID2PASID(ce) \
> > @@ -65,6 +66,8 @@
> >  static void vtd_address_space_refresh_all(IntelIOMMUState *s);
> > static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier
> > *n);
> >
> > +static void vtd_pasid_cache_reset(IntelIOMMUState *s);
> > +
> >  static void vtd_panic_require_caching_mode(void)
> >  {
> >      error_report("We need to set caching-mode=on for intel-iommu to enable "
> > @@ -276,6 +279,7 @@ static void vtd_reset_caches(IntelIOMMUState *s)
> >      vtd_iommu_lock(s);
> >      vtd_reset_iotlb_locked(s);
> >      vtd_reset_context_cache_locked(s);
> > +    vtd_pasid_cache_reset(s);
> >      vtd_iommu_unlock(s);
> >  }
> >
> > @@ -686,6 +690,16 @@ static inline bool vtd_pe_type_check(X86IOMMUState
> *x86_iommu,
> >      return true;
> >  }
> >
> > +static inline uint16_t vtd_pe_get_domain_id(VTDPASIDEntry *pe) {
> > +    return VTD_SM_PASID_ENTRY_DID((pe)->val[1]);
> > +}
> > +
> > +static inline uint32_t vtd_sm_ce_get_pdt_entry_num(VTDContextEntry
> > +*ce) {
> > +    return 1U << (VTD_SM_CONTEXT_ENTRY_PDTS(ce->val[0]) + 7); }
> > +
> >  static inline bool vtd_pdire_present(VTDPASIDDirEntry *pdire)  {
> >      return pdire->val & 1;
> > @@ -2395,9 +2409,452 @@ static bool vtd_process_iotlb_desc(IntelIOMMUState
> *s, VTDInvDesc *inv_desc)
> >      return true;
> >  }
> >
> > +static inline void vtd_init_pasid_key(uint32_t pasid,
> > +                                     uint16_t sid,
> > +                                     struct pasid_key *key) {
> > +    key->pasid = pasid;
> > +    key->sid = sid;
> > +}
> > +
> > +static guint vtd_pasid_as_key_hash(gconstpointer v) {
> > +    struct pasid_key *key = (struct pasid_key *)v;
> > +    uint32_t a, b, c;
> > +
> > +    /* Jenkins hash */
> > +    a = b = c = JHASH_INITVAL + sizeof(*key);
> > +    a += key->sid;
> > +    b += extract32(key->pasid, 0, 16);
> > +    c += extract32(key->pasid, 16, 16);
> > +
> > +    __jhash_mix(a, b, c);
> > +    __jhash_final(a, b, c);
> > +
> > +    return c;
> > +}
> > +
> > +static gboolean vtd_pasid_as_key_equal(gconstpointer v1,
> > +gconstpointer v2) {
> > +    const struct pasid_key *k1 = v1;
> > +    const struct pasid_key *k2 = v2;
> > +
> > +    return (k1->pasid == k2->pasid) && (k1->sid == k2->sid); }
> > +
> > +static inline int vtd_dev_get_pe_from_pasid(IntelIOMMUState *s,
> > +                                            uint8_t bus_num,
> > +                                            uint8_t devfn,
> > +                                            uint32_t pasid,
> > +                                            VTDPASIDEntry *pe) {
> > +    VTDContextEntry ce;
> > +    int ret;
> > +    dma_addr_t pasid_dir_base;
> > +
> > +    if (!s->root_scalable) {
> > +        return -VTD_FR_PASID_TABLE_INV;
> > +    }
> > +
> > +    ret = vtd_dev_to_context_entry(s, bus_num, devfn, &ce);
> > +    if (ret) {
> > +        return ret;
> > +    }
> > +
> > +    pasid_dir_base = VTD_CE_GET_PASID_DIR_TABLE(&ce);
> > +    ret = vtd_get_pe_from_pasid_table(s,
> > +                                  pasid_dir_base, pasid, pe);
> > +
> > +    return ret;
> > +}
> > +
> > +static bool vtd_pasid_entry_compare(VTDPASIDEntry *p1, VTDPASIDEntry
> > +*p2) {
> > +    return !memcmp(p1, p2, sizeof(*p1)); }
> > +
> > +/**
> > + * This function fills in the pasid entry in &vtd_pasid_as. Caller
> > + * of this function should hold iommu_lock.
> > + */
> > +static void vtd_fill_pe_in_cache(IntelIOMMUState *s,
> > +                                 VTDPASIDAddressSpace *vtd_pasid_as,
> > +                                 VTDPASIDEntry *pe) {
> > +    VTDPASIDCacheEntry *pc_entry = &vtd_pasid_as->pasid_cache_entry;
> > +
> > +    if (vtd_pasid_entry_compare(pe, &pc_entry->pasid_entry)) {
> > +        /* No need to go further as cached pasid entry is latest */
> > +        return;
> > +    }
> > +
> > +    pc_entry->pasid_entry = *pe;
> > +    /*
> > +     * TODO:
> > +     * - send pasid bind to host for passthru devices
> > +     */
> > +}
> > +
> > +/**
> > + * This function is used to clear cached pasid entry in vtd_pasid_as
> > + * instances. Caller of this function should hold iommu_lock.
> > + */
> > +static gboolean vtd_flush_pasid(gpointer key, gpointer value,
> > +                                gpointer user_data) {
> > +    VTDPASIDCacheInfo *pc_info = user_data;
> > +    VTDPASIDAddressSpace *vtd_pasid_as = value;
> > +    IntelIOMMUState *s = vtd_pasid_as->iommu_state;
> > +    VTDPASIDCacheEntry *pc_entry = &vtd_pasid_as->pasid_cache_entry;
> > +    VTDBus *vtd_bus = vtd_pasid_as->vtd_bus;
> > +    VTDPASIDEntry pe;
> > +    uint16_t did;
> > +    uint32_t pasid;
> > +    uint16_t devfn;
> > +    int ret;
> > +
> > +    did = vtd_pe_get_domain_id(&pc_entry->pasid_entry);
> > +    pasid = vtd_pasid_as->pasid;
> > +    devfn = vtd_pasid_as->devfn;
> > +
> > +    switch (pc_info->flags & VTD_PASID_CACHE_INFO_MASK) {
> > +    case VTD_PASID_CACHE_FORCE_RESET:
> > +        goto remove;
> > +    case VTD_PASID_CACHE_PASIDSI:
> > +        if (pc_info->pasid != pasid) {
> > +            return false;
> > +        }
> > +        /* Fall through */
> > +    case VTD_PASID_CACHE_DOMSI:
> > +        if (pc_info->domain_id != did) {
> > +            return false;
> > +        }
> > +        /* Fall through */
> > +    case VTD_PASID_CACHE_GLOBAL:
> > +        break;
> > +    default:
> > +        error_report("invalid pc_info->flags");
> > +        abort();
> > +    }
> > +
> > +    /*
> > +     * pasid cache invalidation may indicate a present pasid
> > +     * entry to present pasid entry modification. To cover such
> > +     * case, vIOMMU emulator needs to fetch latest guest pasid
> > +     * entry and check cached pasid entry, then update pasid
> > +     * cache and send pasid bind/unbind to host properly.
> > +     */
> > +    ret = vtd_dev_get_pe_from_pasid(s, pci_bus_num(vtd_bus->bus),
> > +                                    devfn, pasid, &pe);
> > +    if (ret) {
> > +        /*
> > +         * No valid pasid entry in guest memory. e.g. pasid entry
> > +         * was modified to be either all-zero or non-present. Either
> > +         * case means existing pasid cache should be removed.
> > +         */
> > +        goto remove;
> > +    }
> > +
> > +    vtd_fill_pe_in_cache(s, vtd_pasid_as, &pe);
> > +    /*
> > +     * TODO:
> > +     * - when pasid-base-iotlb(piotlb) infrastructure is ready,
> > +     *   should invalidate QEMU piotlb togehter with this change.
> > +     */
> > +    return false;
> > +remove:
> > +    /*
> > +     * TODO:
> > +     * - send pasid bind to host for passthru devices
> > +     * - when pasid-base-iotlb(piotlb) infrastructure is ready,
> > +     *   should invalidate QEMU piotlb togehter with this change.
> > +     */
> > +    return true;
> > +}
> > +
> > +/**
> > + * This function finds or adds a VTDPASIDAddressSpace for a device
> > + * when it is bound to a pasid. Caller of this function should hold
> > + * iommu_lock.
> > + */
> > +static VTDPASIDAddressSpace *vtd_add_find_pasid_as(IntelIOMMUState *s,
> > +                                                   VTDBus *vtd_bus,
> > +                                                   int devfn,
> > +                                                   uint32_t pasid) {
> > +    struct pasid_key key;
> > +    struct pasid_key *new_key;
> > +    VTDPASIDAddressSpace *vtd_pasid_as;
> > +    uint16_t sid;
> > +
> > +    sid = vtd_make_source_id(pci_bus_num(vtd_bus->bus), devfn);
> > +    vtd_init_pasid_key(pasid, sid, &key);
> > +    vtd_pasid_as = g_hash_table_lookup(s->vtd_pasid_as, &key);
> > +
> > +    if (!vtd_pasid_as) {
> > +        new_key = g_malloc0(sizeof(*new_key));
> > +        vtd_init_pasid_key(pasid, sid, new_key);
> > +        /*
> > +         * Initiate the vtd_pasid_as structure.
> > +         *
> > +         * This structure here is used to track the guest pasid
> > +         * binding and also serves as pasid-cache mangement entry.
> > +         *
> > +         * TODO: in future, if wants to support the SVA-aware DMA
> > +         *       emulation, the vtd_pasid_as should have include
> > +         *       AddressSpace to support DMA emulation.
> > +         */
> > +        vtd_pasid_as = g_malloc0(sizeof(VTDPASIDAddressSpace));
> > +        vtd_pasid_as->iommu_state = s;
> > +        vtd_pasid_as->vtd_bus = vtd_bus;
> > +        vtd_pasid_as->devfn = devfn;
> > +        vtd_pasid_as->pasid = pasid;
> > +        g_hash_table_insert(s->vtd_pasid_as, new_key, vtd_pasid_as);
> > +    }
> > +    return vtd_pasid_as;
> > +}
> > +
> > +/**
> > + * Constant information used during pasid table walk
> > +   @vtd_bus, @devfn: device info
> > + * @flags: indicates if it is domain selective walk
> > + * @did: domain ID of the pasid table walk  */ typedef struct {
> > +    VTDBus *vtd_bus;
> > +    uint16_t devfn;
> > +#define VTD_PASID_TABLE_DID_SEL_WALK   (1ULL << 0)
> > +    uint32_t flags;
> > +    uint16_t did;
> > +} vtd_pasid_table_walk_info;
> > +
> > +/**
> > + * Caller of this function should hold iommu_lock.
> > + */
> > +static void vtd_sm_pasid_table_walk_one(IntelIOMMUState *s,
> > +                                        dma_addr_t pt_base,
> > +                                        int start,
> > +                                        int end,
> > +                                        vtd_pasid_table_walk_info
> > +*info) {
> > +    VTDPASIDEntry pe;
> > +    int pasid = start;
> > +    int pasid_next;
> > +    VTDPASIDAddressSpace *vtd_pasid_as;
> > +
> > +    while (pasid < end) {
> > +        pasid_next = pasid + 1;
> > +
> > +        if (!vtd_get_pe_in_pasid_leaf_table(s, pasid, pt_base, &pe)
> > +            && vtd_pe_present(&pe)) {
> > +            vtd_pasid_as = vtd_add_find_pasid_as(s,
> > +                                       info->vtd_bus, info->devfn, pasid);
> > +            if ((info->flags & VTD_PASID_TABLE_DID_SEL_WALK) &&
> > +                !(info->did == vtd_pe_get_domain_id(&pe))) {
> > +                pasid = pasid_next;
> > +                continue;
> > +            }
> > +            vtd_fill_pe_in_cache(s, vtd_pasid_as, &pe);
> > +        }
> > +        pasid = pasid_next;
> > +    }
> > +}
> > +
> > +/*
> > + * Currently, VT-d scalable mode pasid table is a two level table,
> > + * this function aims to loop a range of PASIDs in a given pasid
> > + * table to identify the pasid config in guest.
> > + * Caller of this function should hold iommu_lock.
> > + */
> > +static void vtd_sm_pasid_table_walk(IntelIOMMUState *s,
> > +                                    dma_addr_t pdt_base,
> > +                                    int start,
> > +                                    int end,
> > +                                    vtd_pasid_table_walk_info *info)
> > +{
> > +    VTDPASIDDirEntry pdire;
> > +    int pasid = start;
> > +    int pasid_next;
> > +    dma_addr_t pt_base;
> > +
> > +    while (pasid < end) {
> > +        pasid_next = ((end - pasid) > VTD_PASID_TBL_ENTRY_NUM) ?
> > +                      (pasid + VTD_PASID_TBL_ENTRY_NUM) : end;
> > +        if (!vtd_get_pdire_from_pdir_table(pdt_base, pasid, &pdire)
> > +            && vtd_pdire_present(&pdire)) {
> > +            pt_base = pdire.val & VTD_PASID_TABLE_BASE_ADDR_MASK;
> > +            vtd_sm_pasid_table_walk_one(s, pt_base, pasid, pasid_next, info);
> > +        }
> > +        pasid = pasid_next;
> > +    }
> > +}
> > +
> > +static void vtd_replay_pasid_bind_for_dev(IntelIOMMUState *s,
> > +                                          int start, int end,
> > +                                          vtd_pasid_table_walk_info
> > +*info) {
> > +    VTDContextEntry ce;
> > +    int bus_n, devfn;
> > +
> > +    bus_n = pci_bus_num(info->vtd_bus->bus);
> > +    devfn = info->devfn;
> > +
> > +    if (!vtd_dev_to_context_entry(s, bus_n, devfn, &ce)) {
> > +        uint32_t max_pasid;
> > +
> > +        max_pasid = vtd_sm_ce_get_pdt_entry_num(&ce) *
> VTD_PASID_TBL_ENTRY_NUM;
> > +        if (end > max_pasid) {
> > +            end = max_pasid;
> > +        }
> > +        vtd_sm_pasid_table_walk(s,
> > +                                VTD_CE_GET_PASID_DIR_TABLE(&ce),
> > +                                start,
> > +                                end,
> > +                                info);
> > +    }
> > +}
> > +
> > +/**
> > + * This function replay the guest pasid bindings to hots by
> > + * walking the guest PASID table. This ensures host will have
> > + * latest guest pasid bindings. Caller should hold iommu_lock.
> > + */
> > +static void vtd_replay_guest_pasid_bindings(IntelIOMMUState *s,
> > +                                            VTDPASIDCacheInfo
> > +*pc_info) {
> > +    VTDHostIOMMUContext *vtd_dev_icx;
> > +    int start = 0, end = VTD_HPASID_MAX;
> > +    vtd_pasid_table_walk_info walk_info = {.flags = 0};
> 
> So vtd_pasid_table_walk_info is still used.  I thought we had reached a consensus
> that this can be dropped?

yeah, I did have considered your suggestion and plan to do it. But when
I started coding, it looks a little bit weird to me:
For one, there is an input VTDPASIDCacheInfo in this function. It may be
nature to think about passing the parameter to further calling
(vtd_replay_pasid_bind_for_dev()). But, we can't do that. The vtd_bus/devfn
fields should be filled when looping the assigned devices, not the one
passed by vtd_replay_guest_pasid_bindings() caller.
For two, reusing the VTDPASIDCacheInfo for passing walk info may require
the final user do the same thing as what the vtd_replay_guest_pasid_bindings()
has done here.

So kept the vtd_pasid_table_walk_info.

> > +
> > +    switch (pc_info->flags & VTD_PASID_CACHE_INFO_MASK) {
> > +    case VTD_PASID_CACHE_PASIDSI:
> > +        start = pc_info->pasid;
> > +        end = pc_info->pasid + 1;
> > +        /*
> > +         * PASID selective invalidation is within domain,
> > +         * thus fall through.
> > +         */
> > +    case VTD_PASID_CACHE_DOMSI:
> > +        walk_info.did = pc_info->domain_id;
> > +        walk_info.flags |= VTD_PASID_TABLE_DID_SEL_WALK;
> > +        /* loop all assigned devices */
> > +        break;
> > +    case VTD_PASID_CACHE_FORCE_RESET:
> > +        /* For force reset, no need to go further replay */
> > +        return;
> > +    case VTD_PASID_CACHE_GLOBAL:
> > +        break;
> > +    default:
> > +        error_report("%s, invalid pc_info->flags", __func__);
> > +        abort();
> > +    }
> > +
> > +    /*
> > +     * In this replay, only needs to care about the devices which
> > +     * are backed by host IOMMU. For such devices, their vtd_dev_icx
> > +     * instances are in the s->vtd_dev_icx_list. For devices which
> > +     * are not backed byhost IOMMU, it is not necessary to replay
> > +     * the bindings since their cache could be re-created in the future
> > +     * DMA address transaltion.
> > +     */
> > +    QLIST_FOREACH(vtd_dev_icx, &s->vtd_dev_icx_list, next) {
> > +        walk_info.vtd_bus = vtd_dev_icx->vtd_bus;
> > +        walk_info.devfn = vtd_dev_icx->devfn;
> > +        vtd_replay_pasid_bind_for_dev(s, start, end, &walk_info);
> > +    }
> > +}
> > +
> > +/**
> > + * This function syncs the pasid bindings between guest and host.
> > + * It includes updating the pasid cache in vIOMMU and updating the
> > + * pasid bindings per guest's latest pasid entry presence.
> > + */
> > +static void vtd_pasid_cache_sync(IntelIOMMUState *s,
> > +                                 VTDPASIDCacheInfo *pc_info) {
> > +    /*
> > +     * Regards to a pasid cache invalidation, e.g. a PSI.
> > +     * it could be either cases of below:
> > +     * a) a present pasid entry moved to non-present
> > +     * b) a present pasid entry to be a present entry
> > +     * c) a non-present pasid entry moved to present
> > +     *
> > +     * Different invalidation granularity may affect different device
> > +     * scope and pasid scope. But for each invalidation granularity,
> > +     * it needs to do two steps to sync host and guest pasid binding.
> > +     *
> > +     * Here is the handling of a PSI:
> > +     * 1) loop all the existing vtd_pasid_as instances to update them
> > +     *    according to the latest guest pasid entry in pasid table.
> > +     *    this will make sure affected existing vtd_pasid_as instances
> > +     *    cached the latest pasid entries. Also, during the loop, the
> > +     *    host should be notified if needed. e.g. pasid unbind or pasid
> > +     *    update. Should be able to cover case a) and case b).
> > +     *
> > +     * 2) loop all devices to cover case c)
> > +     *    - For devices which have HostIOMMUContext instances,
> > +     *      we loop them and check if guest pasid entry exists. If yes,
> > +     *      it is case c), we update the pasid cache and also notify
> > +     *      host.
> > +     *    - For devices which have no HostIOMMUContext, it is not
> > +     *      necessary to create pasid cache at this phase since it
> > +     *      could be created when vIOMMU does DMA address translation.
> > +     *      This is not yet implemented since there is no emulated
> > +     *      pasid-capable devices today. If we have such devices in
> > +     *      future, the pasid cache shall be created there.
> > +     * Other granularity follow the same steps, just with different scope
> > +     *
> > +     */
> > +
> > +    vtd_iommu_lock(s);
> > +    /* Step 1: loop all the exisitng vtd_pasid_as instances */
> > +    g_hash_table_foreach_remove(s->vtd_pasid_as,
> > +                                vtd_flush_pasid, pc_info);
> 
> OK the series is evolving along with our discussions, and /me too on understanding
> your series... Now I'm not very sure whether this operation is still useful...
> 
> The major point is you'll need to do pasid table walk for all the registered
> devices
> below.  So IIUC vtd_replay_guest_pasid_bindings() will be able to also detect
> addition, removal or modification of pasid address spaces.  Am I right?

It's true if there is only assigned pasid-capable devices. If there is
emualted pasid-capable device, it would be a problem as emualted devices
won't register HostIOMMUContext. Somehow, the pasid cahce invalidation
for emualted device would be missed. So I chose to make the step 1 cover
the "real" cache invalidation(a.k.a. removal), while step 2 to cover
addition and modification.

> 
> If this can be dropped, then vtd_flush_pasid() will be only used below for device
> reset, and it can be greatly simplifed - just UNBIND every address space we have.
> 
> > +
> > +    /*
> > +     * Step 2: loop all the exisitng vtd_dev_icx instances.
> > +     * Ideally, needs to loop all devices to find if there is any new
> > +     * PASID binding regards to the PASID cache invalidation request.
> > +     * But it is enough to loop the devices which are backed by host
> > +     * IOMMU. For devices backed by vIOMMU (a.k.a emulated devices),
> > +     * if new PASID happened on them, their vtd_pasid_as instance could
> > +     * be created during future vIOMMU DMA translation.
> > +     */
> > +    vtd_replay_guest_pasid_bindings(s, pc_info);
> > +    vtd_iommu_unlock(s);
> > +}
> > +
> > +/**
> > + * Caller of this function should hold iommu_lock  */ static void
> > +vtd_pasid_cache_reset(IntelIOMMUState *s) {
> > +    VTDPASIDCacheInfo pc_info;
> > +
> > +    trace_vtd_pasid_cache_reset();
> > +
> > +    pc_info.flags = VTD_PASID_CACHE_FORCE_RESET;
> > +
> > +    /*
> > +     * Reset pasid cache is a big hammer, so use
> > +     * g_hash_table_foreach_remove which will free
> > +     * the vtd_pasid_as instances. Also, as a big
> > +     * hammer, use VTD_PASID_CACHE_FORCE_RESET to
> > +     * ensure all the vtd_pasid_as instances are
> > +     * dropped, meanwhile the change will be pass
> > +     * to host if HostIOMMUContext is available.
> > +     */
> > +    g_hash_table_foreach_remove(s->vtd_pasid_as,
> > +                                vtd_flush_pasid, &pc_info); }
> > +
> >  static bool vtd_process_pasid_desc(IntelIOMMUState *s,
> >                                     VTDInvDesc *inv_desc)  {
> > +    uint16_t domain_id;
> > +    uint32_t pasid;
> > +    VTDPASIDCacheInfo pc_info;
> > +
> >      if ((inv_desc->val[0] & VTD_INV_DESC_PASIDC_RSVD_VAL0) ||
> >          (inv_desc->val[1] & VTD_INV_DESC_PASIDC_RSVD_VAL1) ||
> >          (inv_desc->val[2] & VTD_INV_DESC_PASIDC_RSVD_VAL2) || @@
> > -2407,14 +2864,26 @@ static bool vtd_process_pasid_desc(IntelIOMMUState *s,
> >          return false;
> >      }
> >
> > +    domain_id = VTD_INV_DESC_PASIDC_DID(inv_desc->val[0]);
> > +    pasid = VTD_INV_DESC_PASIDC_PASID(inv_desc->val[0]);
> > +
> >      switch (inv_desc->val[0] & VTD_INV_DESC_PASIDC_G) {
> >      case VTD_INV_DESC_PASIDC_DSI:
> > +        trace_vtd_pasid_cache_dsi(domain_id);
> > +        pc_info.flags = VTD_PASID_CACHE_DOMSI;
> > +        pc_info.domain_id = domain_id;
> >          break;
> >
> >      case VTD_INV_DESC_PASIDC_PASID_SI:
> > +        /* PASID selective implies a DID selective */
> > +        pc_info.flags = VTD_PASID_CACHE_PASIDSI;
> > +        pc_info.domain_id = domain_id;
> > +        pc_info.pasid = pasid;
> >          break;
> >
> >      case VTD_INV_DESC_PASIDC_GLOBAL:
> > +        trace_vtd_pasid_cache_gsi();
> > +        pc_info.flags = VTD_PASID_CACHE_GLOBAL;
> >          break;
> >
> >      default:
> > @@ -2423,6 +2892,7 @@ static bool vtd_process_pasid_desc(IntelIOMMUState
> *s,
> >          return false;
> >      }
> >
> > +    vtd_pasid_cache_sync(s, &pc_info);
> >      return true;
> >  }
> >
> > @@ -4085,6 +4555,9 @@ static void vtd_realize(DeviceState *dev, Error **errp)
> >                                       g_free, g_free);
> >      s->vtd_as_by_busptr = g_hash_table_new_full(vtd_uint64_hash,
> vtd_uint64_equal,
> >                                                g_free, g_free);
> > +    s->vtd_pasid_as = g_hash_table_new_full(vtd_pasid_as_key_hash,
> > +                                            vtd_pasid_as_key_equal,
> > +                                            g_free, g_free);
> >      vtd_init(s);
> >      sysbus_mmio_map(SYS_BUS_DEVICE(s), 0,
> Q35_HOST_BRIDGE_IOMMU_ADDR);
> >      pci_setup_iommu(bus, &vtd_iommu_ops, dev); diff --git
> > a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> > index 9a76f20..451ef4c 100644
> > --- a/hw/i386/intel_iommu_internal.h
> > +++ b/hw/i386/intel_iommu_internal.h
> > @@ -307,6 +307,7 @@ typedef enum VTDFaultReason {
> >      VTD_FR_IR_SID_ERR = 0x26,   /* Invalid Source-ID */
> >
> >      VTD_FR_PASID_TABLE_INV = 0x58,  /*Invalid PASID table entry */
> > +    VTD_FR_PASID_ENTRY_P = 0x59, /* The Present(P) field of
> > + pasidt-entry is 0 */
> >
> >      /* This is not a normal fault reason. We use this to indicate some faults
> >       * that are not referenced by the VT-d specification.
> > @@ -511,10 +512,26 @@ typedef struct VTDRootEntry VTDRootEntry;
> >  #define VTD_CTX_ENTRY_LEGACY_SIZE     16
> >  #define VTD_CTX_ENTRY_SCALABLE_SIZE   32
> >
> > +#define VTD_SM_CONTEXT_ENTRY_PDTS(val)      (((val) >> 9) & 0x3)
> >  #define VTD_SM_CONTEXT_ENTRY_RID2PASID_MASK 0xfffff  #define
> > VTD_SM_CONTEXT_ENTRY_RSVD_VAL0(aw)  (0x1e0ULL | ~VTD_HAW_MASK(aw))
> >  #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL1      0xffffffffffe00000ULL
> >
> > +struct VTDPASIDCacheInfo {
> > +#define VTD_PASID_CACHE_FORCE_RESET    (1ULL << 0)
> > +#define VTD_PASID_CACHE_GLOBAL         (1ULL << 1)
> > +#define VTD_PASID_CACHE_DOMSI          (1ULL << 2)
> > +#define VTD_PASID_CACHE_PASIDSI        (1ULL << 3)
> > +    uint32_t flags;
> > +    uint16_t domain_id;
> > +    uint32_t pasid;
> > +};
> > +#define VTD_PASID_CACHE_INFO_MASK    (VTD_PASID_CACHE_FORCE_RESET |
> \
> > +                                      VTD_PASID_CACHE_GLOBAL  | \
> > +                                      VTD_PASID_CACHE_DOMSI  | \
> > +                                      VTD_PASID_CACHE_PASIDSI)
> 
> I think this is not needed at all?  The naming "flags" is confusing too because it's not
> really a bitmap but an enum.  How about drop this and rename "flags" to "type"?

got it, could make it as enum.

> > +typedef struct VTDPASIDCacheInfo VTDPASIDCacheInfo;
> > +
> >  /* PASID Table Related Definitions */  #define
> > VTD_PASID_DIR_BASE_ADDR_MASK  (~0xfffULL)  #define
> > VTD_PASID_TABLE_BASE_ADDR_MASK (~0xfffULL) @@ -526,6 +543,7 @@
> typedef
> > struct VTDRootEntry VTDRootEntry;
> >  #define VTD_PASID_TABLE_BITS_MASK     (0x3fULL)
> >  #define VTD_PASID_TABLE_INDEX(pasid)  ((pasid) &
> VTD_PASID_TABLE_BITS_MASK)
> >  #define VTD_PASID_ENTRY_FPD           (1ULL << 1) /* Fault Processing Disable */
> > +#define VTD_PASID_TBL_ENTRY_NUM       (1ULL << 6)
> >
> >  /* PASID Granular Translation Type Mask */
> >  #define VTD_PASID_ENTRY_P              1ULL
> > diff --git a/hw/i386/trace-events b/hw/i386/trace-events index
> > f7cd4e5..60d20c1 100644
> > --- a/hw/i386/trace-events
> > +++ b/hw/i386/trace-events
> > @@ -23,6 +23,7 @@ vtd_inv_qi_tail(uint16_t head) "write tail %d"
> >  vtd_inv_qi_fetch(void) ""
> >  vtd_context_cache_reset(void) ""
> >  vtd_pasid_cache_gsi(void) ""
> > +vtd_pasid_cache_reset(void) ""
> >  vtd_pasid_cache_dsi(uint16_t domain) "Domian slective PC invalidation
> > domain 0x%"PRIx16  vtd_pasid_cache_psi(uint16_t domain, uint32_t
> > pasid) "PASID slective PC invalidation domain 0x%"PRIx16" pasid 0x%"PRIx32
> vtd_re_not_present(uint8_t bus) "Root entry bus %"PRIu8" not present"
> > diff --git a/include/hw/i386/intel_iommu.h
> > b/include/hw/i386/intel_iommu.h index 42a58d6..626c1cd 100644
> > --- a/include/hw/i386/intel_iommu.h
> > +++ b/include/hw/i386/intel_iommu.h
> > @@ -65,6 +65,8 @@ typedef union VTD_IR_MSIAddress VTD_IR_MSIAddress;
> > typedef struct VTDPASIDDirEntry VTDPASIDDirEntry;  typedef struct
> > VTDPASIDEntry VTDPASIDEntry;  typedef struct VTDHostIOMMUContext
> > VTDHostIOMMUContext;
> > +typedef struct VTDPASIDCacheEntry VTDPASIDCacheEntry; typedef struct
> > +VTDPASIDAddressSpace VTDPASIDAddressSpace;
> >
> >  /* Context-Entry */
> >  struct VTDContextEntry {
> > @@ -97,6 +99,26 @@ struct VTDPASIDEntry {
> >      uint64_t val[8];
> >  };
> >
> > +struct pasid_key {
> > +    uint32_t pasid;
> > +    uint16_t sid;
> > +};
> > +
> > +struct VTDPASIDCacheEntry {
> > +    struct VTDPASIDEntry pasid_entry; };
> > +
> > +struct VTDPASIDAddressSpace {
> > +    VTDBus *vtd_bus;
> > +    uint8_t devfn;
> > +    AddressSpace as;
> 
> Can this be dropped?

oh, yes.

> > +    uint32_t pasid;
> > +    IntelIOMMUState *iommu_state;
> > +    VTDContextCacheEntry context_cache_entry;
> 
> Can this be dropped too?

yep.

Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v2 00/22] intel_iommu: expose Shared Virtual Addressing to VMs
  2020-03-30  4:24 ` Liu Yi L
@ 2020-04-02  8:33   ` Jason Wang
  -1 siblings, 0 replies; 160+ messages in thread
From: Jason Wang @ 2020-04-02  8:33 UTC (permalink / raw)
  To: Liu Yi L, qemu-devel, alex.williamson, peterx
  Cc: jean-philippe, kevin.tian, kvm, mst, jun.j.tian, eric.auger,
	yi.y.sun, pbonzini, hao.wu, david


On 2020/3/30 下午12:24, Liu Yi L wrote:
> Shared Virtual Addressing (SVA), a.k.a, Shared Virtual Memory (SVM) on
> Intel platforms allows address space sharing between device DMA and
> applications. SVA can reduce programming complexity and enhance security.
>
> This QEMU series is intended to expose SVA usage to VMs. i.e. Sharing
> guest application address space with passthru devices. This is called
> vSVA in this series. The whole vSVA enabling requires QEMU/VFIO/IOMMU
> changes.
>
> The high-level architecture for SVA virtualization is as below, the key
> design of vSVA support is to utilize the dual-stage IOMMU translation (
> also known as IOMMU nesting translation) capability in host IOMMU.
>
>      .-------------.  .---------------------------.
>      |   vIOMMU    |  | Guest process CR3, FL only|
>      |             |  '---------------------------'
>      .----------------/
>      | PASID Entry |--- PASID cache flush -
>      '-------------'                       |
>      |             |                       V
>      |             |                CR3 in GPA
>      '-------------'
> Guest
> ------| Shadow |--------------------------|--------
>        v        v                          v
> Host
>      .-------------.  .----------------------.
>      |   pIOMMU    |  | Bind FL for GVA-GPA  |
>      |             |  '----------------------'
>      .----------------/  |
>      | PASID Entry |     V (Nested xlate)
>      '----------------\.------------------------------.
>      |             ||SL for GPA-HPA, default domain|
>      |             |   '------------------------------'
>      '-------------'
> Where:
>   - FL = First level/stage one page tables
>   - SL = Second level/stage two page tables
>
> The complete vSVA kernel upstream patches are divided into three phases:
>      1. Common APIs and PCI device direct assignment
>      2. IOMMU-backed Mediated Device assignment
>      3. Page Request Services (PRS) support
>
> This QEMU patchset is aiming for the phase 1 and phase 2. It is based
> on the two kernel series below.
> [1] [PATCH V10 00/11] Nested Shared Virtual Address (SVA) VT-d support:
> https://lkml.org/lkml/2020/3/20/1172
> [2] [PATCH v1 0/8] vfio: expose virtual Shared Virtual Addressing to VMs
> https://lkml.org/lkml/2020/3/22/116
>
> There are roughly two parts:
>   1. Introduce HostIOMMUContext as abstract of host IOMMU. It provides explicit
>      method for vIOMMU emulators to communicate with host IOMMU. e.g. propagate
>      guest page table binding to host IOMMU to setup dual-stage DMA translation
>      in host IOMMU and flush iommu iotlb.
>   2. Setup dual-stage IOMMU translation for Intel vIOMMU. Includes
>      - Check IOMMU uAPI version compatibility and VFIO Nesting capabilities which
>        includes hardware compatibility (stage 1 format) and VFIO_PASID_REQ
>        availability. This is preparation for setting up dual-stage DMA translation
>        in host IOMMU.
>      - Propagate guest PASID allocation and free request to host.
>      - Propagate guest page table binding to host to setup dual-stage IOMMU DMA
>        translation in host IOMMU.
>      - Propagate guest IOMMU cache invalidation to host to ensure iotlb
>        correctness.
>
> The complete QEMU set can be found in below link:
> https://github.com/luxis1999/qemu.git: sva_vtd_v10_v2


Hi Yi:

I could not find the branch there.

Thanks


^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v2 00/22] intel_iommu: expose Shared Virtual Addressing to VMs
@ 2020-04-02  8:33   ` Jason Wang
  0 siblings, 0 replies; 160+ messages in thread
From: Jason Wang @ 2020-04-02  8:33 UTC (permalink / raw)
  To: Liu Yi L, qemu-devel, alex.williamson, peterx
  Cc: jean-philippe, kevin.tian, kvm, mst, jun.j.tian, eric.auger,
	yi.y.sun, pbonzini, david, hao.wu


On 2020/3/30 下午12:24, Liu Yi L wrote:
> Shared Virtual Addressing (SVA), a.k.a, Shared Virtual Memory (SVM) on
> Intel platforms allows address space sharing between device DMA and
> applications. SVA can reduce programming complexity and enhance security.
>
> This QEMU series is intended to expose SVA usage to VMs. i.e. Sharing
> guest application address space with passthru devices. This is called
> vSVA in this series. The whole vSVA enabling requires QEMU/VFIO/IOMMU
> changes.
>
> The high-level architecture for SVA virtualization is as below, the key
> design of vSVA support is to utilize the dual-stage IOMMU translation (
> also known as IOMMU nesting translation) capability in host IOMMU.
>
>      .-------------.  .---------------------------.
>      |   vIOMMU    |  | Guest process CR3, FL only|
>      |             |  '---------------------------'
>      .----------------/
>      | PASID Entry |--- PASID cache flush -
>      '-------------'                       |
>      |             |                       V
>      |             |                CR3 in GPA
>      '-------------'
> Guest
> ------| Shadow |--------------------------|--------
>        v        v                          v
> Host
>      .-------------.  .----------------------.
>      |   pIOMMU    |  | Bind FL for GVA-GPA  |
>      |             |  '----------------------'
>      .----------------/  |
>      | PASID Entry |     V (Nested xlate)
>      '----------------\.------------------------------.
>      |             ||SL for GPA-HPA, default domain|
>      |             |   '------------------------------'
>      '-------------'
> Where:
>   - FL = First level/stage one page tables
>   - SL = Second level/stage two page tables
>
> The complete vSVA kernel upstream patches are divided into three phases:
>      1. Common APIs and PCI device direct assignment
>      2. IOMMU-backed Mediated Device assignment
>      3. Page Request Services (PRS) support
>
> This QEMU patchset is aiming for the phase 1 and phase 2. It is based
> on the two kernel series below.
> [1] [PATCH V10 00/11] Nested Shared Virtual Address (SVA) VT-d support:
> https://lkml.org/lkml/2020/3/20/1172
> [2] [PATCH v1 0/8] vfio: expose virtual Shared Virtual Addressing to VMs
> https://lkml.org/lkml/2020/3/22/116
>
> There are roughly two parts:
>   1. Introduce HostIOMMUContext as abstract of host IOMMU. It provides explicit
>      method for vIOMMU emulators to communicate with host IOMMU. e.g. propagate
>      guest page table binding to host IOMMU to setup dual-stage DMA translation
>      in host IOMMU and flush iommu iotlb.
>   2. Setup dual-stage IOMMU translation for Intel vIOMMU. Includes
>      - Check IOMMU uAPI version compatibility and VFIO Nesting capabilities which
>        includes hardware compatibility (stage 1 format) and VFIO_PASID_REQ
>        availability. This is preparation for setting up dual-stage DMA translation
>        in host IOMMU.
>      - Propagate guest PASID allocation and free request to host.
>      - Propagate guest page table binding to host to setup dual-stage IOMMU DMA
>        translation in host IOMMU.
>      - Propagate guest IOMMU cache invalidation to host to ensure iotlb
>        correctness.
>
> The complete QEMU set can be found in below link:
> https://github.com/luxis1999/qemu.git: sva_vtd_v10_v2


Hi Yi:

I could not find the branch there.

Thanks



^ permalink raw reply	[flat|nested] 160+ messages in thread

* RE: [PATCH v2 05/22] hw/pci: modify pci_setup_iommu() to set PCIIOMMUOps
  2020-03-30 11:02     ` Auger Eric
@ 2020-04-02  8:52       ` Liu, Yi L
  -1 siblings, 0 replies; 160+ messages in thread
From: Liu, Yi L @ 2020-04-02  8:52 UTC (permalink / raw)
  To: Auger Eric, qemu-devel, alex.williamson, peterx
  Cc: jean-philippe, Tian, Kevin, Jacob Pan, Yi Sun, kvm, mst, Tian,
	Jun J, Sun, Yi Y, pbonzini, Wu, Hao, david

> From: Auger Eric < eric.auger@redhat.com>
> Sent: Monday, March 30, 2020 7:02 PM
> To: Liu, Yi L <yi.l.liu@intel.com>; qemu-devel@nongnu.org;
> Subject: Re: [PATCH v2 05/22] hw/pci: modify pci_setup_iommu() to set
> PCIIOMMUOps
> 
> 
> 
> On 3/30/20 6:24 AM, Liu Yi L wrote:
> > This patch modifies pci_setup_iommu() to set PCIIOMMUOps instead of
> > setting PCIIOMMUFunc. PCIIOMMUFunc is used to get an address space for
> > a PCI device in vendor specific way. The PCIIOMMUOps still offers this
> > functionality. But using PCIIOMMUOps leaves space to add more iommu
> > related vendor specific operations.
> >
> > Cc: Kevin Tian <kevin.tian@intel.com>
> > Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Cc: Peter Xu <peterx@redhat.com>
> > Cc: Eric Auger <eric.auger@redhat.com>
> > Cc: Yi Sun <yi.y.sun@linux.intel.com>
> > Cc: David Gibson <david@gibson.dropbear.id.au>
> > Cc: Michael S. Tsirkin <mst@redhat.com>
> > Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> > Reviewed-by: Peter Xu <peterx@redhat.com>
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > ---
> >  hw/alpha/typhoon.c       |  6 +++++-
> >  hw/arm/smmu-common.c     |  6 +++++-
> >  hw/hppa/dino.c           |  6 +++++-
> >  hw/i386/amd_iommu.c      |  6 +++++-
> >  hw/i386/intel_iommu.c    |  6 +++++-
> >  hw/pci-host/designware.c |  6 +++++-
> >  hw/pci-host/pnv_phb3.c   |  6 +++++-
> >  hw/pci-host/pnv_phb4.c   |  6 +++++-
> >  hw/pci-host/ppce500.c    |  6 +++++-
> >  hw/pci-host/prep.c       |  6 +++++-
> >  hw/pci-host/sabre.c      |  6 +++++-
> >  hw/pci/pci.c             | 12 +++++++-----
> >  hw/ppc/ppc440_pcix.c     |  6 +++++-
> >  hw/ppc/spapr_pci.c       |  6 +++++-
> >  hw/s390x/s390-pci-bus.c  |  8 ++++++--  hw/virtio/virtio-iommu.c |  6
> > +++++-
> >  include/hw/pci/pci.h     |  8 ++++++--
> >  include/hw/pci/pci_bus.h |  2 +-
> >  18 files changed, 90 insertions(+), 24 deletions(-)
> >
> > diff --git a/hw/alpha/typhoon.c b/hw/alpha/typhoon.c index
> > 1795e2f..f271de1 100644
> > --- a/hw/alpha/typhoon.c
> > +++ b/hw/alpha/typhoon.c
> > @@ -740,6 +740,10 @@ static AddressSpace *typhoon_pci_dma_iommu(PCIBus
> *bus, void *opaque, int devfn)
> >      return &s->pchip.iommu_as;
> >  }
> >
> > +static const PCIIOMMUOps typhoon_iommu_ops = {
> > +    .get_address_space = typhoon_pci_dma_iommu, };
> > +
> >  static void typhoon_set_irq(void *opaque, int irq, int level)  {
> >      TyphoonState *s = opaque;
> > @@ -897,7 +901,7 @@ PCIBus *typhoon_init(MemoryRegion *ram, ISABus
> **isa_bus, qemu_irq *p_rtc_irq,
> >                               "iommu-typhoon", UINT64_MAX);
> >      address_space_init(&s->pchip.iommu_as, MEMORY_REGION(&s-
> >pchip.iommu),
> >                         "pchip0-pci");
> > -    pci_setup_iommu(b, typhoon_pci_dma_iommu, s);
> > +    pci_setup_iommu(b, &typhoon_iommu_ops, s);
> >
> >      /* Pchip0 PCI special/interrupt acknowledge, 0x801.F800.0000, 64MB.  */
> >      memory_region_init_io(&s->pchip.reg_iack, OBJECT(s),
> > &alpha_pci_iack_ops, diff --git a/hw/arm/smmu-common.c
> > b/hw/arm/smmu-common.c index e13a5f4..447146e 100644
> > --- a/hw/arm/smmu-common.c
> > +++ b/hw/arm/smmu-common.c
> > @@ -343,6 +343,10 @@ static AddressSpace *smmu_find_add_as(PCIBus *bus,
> void *opaque, int devfn)
> >      return &sdev->as;
> >  }
> >
> > +static const PCIIOMMUOps smmu_ops = {
> > +    .get_address_space = smmu_find_add_as, };
> > +
> >  IOMMUMemoryRegion *smmu_iommu_mr(SMMUState *s, uint32_t sid)  {
> >      uint8_t bus_n, devfn;
> > @@ -437,7 +441,7 @@ static void smmu_base_realize(DeviceState *dev, Error
> **errp)
> >      s->smmu_pcibus_by_busptr = g_hash_table_new(NULL, NULL);
> >
> >      if (s->primary_bus) {
> > -        pci_setup_iommu(s->primary_bus, smmu_find_add_as, s);
> > +        pci_setup_iommu(s->primary_bus, &smmu_ops, s);
> >      } else {
> >          error_setg(errp, "SMMU is not attached to any PCI bus!");
> >      }
> > diff --git a/hw/hppa/dino.c b/hw/hppa/dino.c index 2b1b38c..3da4f84
> > 100644
> > --- a/hw/hppa/dino.c
> > +++ b/hw/hppa/dino.c
> > @@ -459,6 +459,10 @@ static AddressSpace *dino_pcihost_set_iommu(PCIBus
> *bus, void *opaque,
> >      return &s->bm_as;
> >  }
> >
> > +static const PCIIOMMUOps dino_iommu_ops = {
> > +    .get_address_space = dino_pcihost_set_iommu, };
> > +
> >  /*
> >   * Dino interrupts are connected as shown on Page 78, Table 23
> >   * (Little-endian bit numbers)
> > @@ -580,7 +584,7 @@ PCIBus *dino_init(MemoryRegion *addr_space,
> >      memory_region_add_subregion(&s->bm, 0xfff00000,
> >                                  &s->bm_cpu_alias);
> >      address_space_init(&s->bm_as, &s->bm, "pci-bm");
> > -    pci_setup_iommu(b, dino_pcihost_set_iommu, s);
> > +    pci_setup_iommu(b, &dino_iommu_ops, s);
> >
> >      *p_rtc_irq = qemu_allocate_irq(dino_set_timer_irq, s, 0);
> >      *p_ser_irq = qemu_allocate_irq(dino_set_serial_irq, s, 0); diff
> > --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c index
> > b1175e5..5fec30e 100644
> > --- a/hw/i386/amd_iommu.c
> > +++ b/hw/i386/amd_iommu.c
> > @@ -1451,6 +1451,10 @@ static AddressSpace
> *amdvi_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
> >      return &iommu_as[devfn]->as;
> >  }
> >
> > +static const PCIIOMMUOps amdvi_iommu_ops = {
> > +    .get_address_space = amdvi_host_dma_iommu, };
> > +
> >  static const MemoryRegionOps mmio_mem_ops = {
> >      .read = amdvi_mmio_read,
> >      .write = amdvi_mmio_write,
> > @@ -1577,7 +1581,7 @@ static void amdvi_realize(DeviceState *dev,
> > Error **errp)
> >
> >      sysbus_init_mmio(SYS_BUS_DEVICE(s), &s->mmio);
> >      sysbus_mmio_map(SYS_BUS_DEVICE(s), 0, AMDVI_BASE_ADDR);
> > -    pci_setup_iommu(bus, amdvi_host_dma_iommu, s);
> > +    pci_setup_iommu(bus, &amdvi_iommu_ops, s);
> >      s->devid = object_property_get_int(OBJECT(&s->pci), "addr", errp);
> >      msi_init(&s->pci.dev, 0, 1, true, false, errp);
> >      amdvi_init(s);
> > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c index
> > df7ad25..4b22910 100644
> > --- a/hw/i386/intel_iommu.c
> > +++ b/hw/i386/intel_iommu.c
> > @@ -3729,6 +3729,10 @@ static AddressSpace *vtd_host_dma_iommu(PCIBus
> *bus, void *opaque, int devfn)
> >      return &vtd_as->as;
> >  }
> >
> > +static PCIIOMMUOps vtd_iommu_ops = {
> static const

got it.

> > +    .get_address_space = vtd_host_dma_iommu, };
> > +
> >  static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)  {
> >      X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(s); @@ -3840,7
> > +3844,7 @@ static void vtd_realize(DeviceState *dev, Error **errp)
> >                                                g_free, g_free);
> >      vtd_init(s);
> >      sysbus_mmio_map(SYS_BUS_DEVICE(s), 0,
> Q35_HOST_BRIDGE_IOMMU_ADDR);
> > -    pci_setup_iommu(bus, vtd_host_dma_iommu, dev);
> > +    pci_setup_iommu(bus, &vtd_iommu_ops, dev);
> >      /* Pseudo address space under root PCI bus. */
> >      x86ms->ioapic_as = vtd_host_dma_iommu(bus, s,
> Q35_PSEUDO_DEVFN_IOAPIC);
> >      qemu_add_machine_init_done_notifier(&vtd_machine_done_notify);
> > diff --git a/hw/pci-host/designware.c b/hw/pci-host/designware.c index
> > dd24551..4c6338a 100644
> > --- a/hw/pci-host/designware.c
> > +++ b/hw/pci-host/designware.c
> > @@ -645,6 +645,10 @@ static AddressSpace
> *designware_pcie_host_set_iommu(PCIBus *bus, void *opaque,
> >      return &s->pci.address_space;
> >  }
> >
> > +static const PCIIOMMUOps designware_iommu_ops = {
> > +    .get_address_space = designware_pcie_host_set_iommu, };
> > +
> >  static void designware_pcie_host_realize(DeviceState *dev, Error
> > **errp)  {
> >      PCIHostState *pci = PCI_HOST_BRIDGE(dev); @@ -686,7 +690,7 @@
> > static void designware_pcie_host_realize(DeviceState *dev, Error **errp)
> >      address_space_init(&s->pci.address_space,
> >                         &s->pci.address_space_root,
> >                         "pcie-bus-address-space");
> > -    pci_setup_iommu(pci->bus, designware_pcie_host_set_iommu, s);
> > +    pci_setup_iommu(pci->bus, &designware_iommu_ops, s);
> >
> >      qdev_set_parent_bus(DEVICE(&s->root), BUS(pci->bus));
> >      qdev_init_nofail(DEVICE(&s->root));
> > diff --git a/hw/pci-host/pnv_phb3.c b/hw/pci-host/pnv_phb3.c index
> > 74618fa..ecfe627 100644
> > --- a/hw/pci-host/pnv_phb3.c
> > +++ b/hw/pci-host/pnv_phb3.c
> > @@ -961,6 +961,10 @@ static AddressSpace *pnv_phb3_dma_iommu(PCIBus
> *bus, void *opaque, int devfn)
> >      return &ds->dma_as;
> >  }
> >
> > +static PCIIOMMUOps pnv_phb3_iommu_ops = {
> static const
got it. :-)

> > +    .get_address_space = pnv_phb3_dma_iommu, };
> > +
> >  static void pnv_phb3_instance_init(Object *obj)  {
> >      PnvPHB3 *phb = PNV_PHB3(obj);
> > @@ -1059,7 +1063,7 @@ static void pnv_phb3_realize(DeviceState *dev, Error
> **errp)
> >                                       &phb->pci_mmio, &phb->pci_io,
> >                                       0, 4, TYPE_PNV_PHB3_ROOT_BUS);
> >
> > -    pci_setup_iommu(pci->bus, pnv_phb3_dma_iommu, phb);
> > +    pci_setup_iommu(pci->bus, &pnv_phb3_iommu_ops, phb);
> >
> >      /* Add a single Root port */
> >      qdev_prop_set_uint8(DEVICE(&phb->root), "chassis", phb->chip_id);
> > diff --git a/hw/pci-host/pnv_phb4.c b/hw/pci-host/pnv_phb4.c index
> > 23cf093..04e95e3 100644
> > --- a/hw/pci-host/pnv_phb4.c
> > +++ b/hw/pci-host/pnv_phb4.c
> > @@ -1148,6 +1148,10 @@ static AddressSpace *pnv_phb4_dma_iommu(PCIBus
> *bus, void *opaque, int devfn)
> >      return &ds->dma_as;
> >  }
> >
> > +static PCIIOMMUOps pnv_phb4_iommu_ops = {
> idem
will add const.

> > +    .get_address_space = pnv_phb4_dma_iommu, };
> > +
> >  static void pnv_phb4_instance_init(Object *obj)  {
> >      PnvPHB4 *phb = PNV_PHB4(obj);
> > @@ -1205,7 +1209,7 @@ static void pnv_phb4_realize(DeviceState *dev, Error
> **errp)
> >                                       pnv_phb4_set_irq, pnv_phb4_map_irq, phb,
> >                                       &phb->pci_mmio, &phb->pci_io,
> >                                       0, 4, TYPE_PNV_PHB4_ROOT_BUS);
> > -    pci_setup_iommu(pci->bus, pnv_phb4_dma_iommu, phb);
> > +    pci_setup_iommu(pci->bus, &pnv_phb4_iommu_ops, phb);
> >
> >      /* Add a single Root port */
> >      qdev_prop_set_uint8(DEVICE(&phb->root), "chassis", phb->chip_id);
> > diff --git a/hw/pci-host/ppce500.c b/hw/pci-host/ppce500.c index
> > d710727..5baf5db 100644
> > --- a/hw/pci-host/ppce500.c
> > +++ b/hw/pci-host/ppce500.c
> > @@ -439,6 +439,10 @@ static AddressSpace *e500_pcihost_set_iommu(PCIBus
> *bus, void *opaque,
> >      return &s->bm_as;
> >  }
> >
> > +static const PCIIOMMUOps ppce500_iommu_ops = {
> > +    .get_address_space = e500_pcihost_set_iommu, };
> > +
> >  static void e500_pcihost_realize(DeviceState *dev, Error **errp)  {
> >      SysBusDevice *sbd = SYS_BUS_DEVICE(dev); @@ -473,7 +477,7 @@
> > static void e500_pcihost_realize(DeviceState *dev, Error **errp)
> >      memory_region_init(&s->bm, OBJECT(s), "bm-e500", UINT64_MAX);
> >      memory_region_add_subregion(&s->bm, 0x0, &s->busmem);
> >      address_space_init(&s->bm_as, &s->bm, "pci-bm");
> > -    pci_setup_iommu(b, e500_pcihost_set_iommu, s);
> > +    pci_setup_iommu(b, &ppce500_iommu_ops, s);
> >
> >      pci_create_simple(b, 0, "e500-host-bridge");
> >
> > diff --git a/hw/pci-host/prep.c b/hw/pci-host/prep.c index
> > 1a02e9a..7c57311 100644
> > --- a/hw/pci-host/prep.c
> > +++ b/hw/pci-host/prep.c
> > @@ -213,6 +213,10 @@ static AddressSpace *raven_pcihost_set_iommu(PCIBus
> *bus, void *opaque,
> >      return &s->bm_as;
> >  }
> >
> > +static const PCIIOMMUOps raven_iommu_ops = {
> > +    .get_address_space = raven_pcihost_set_iommu, };
> > +
> >  static void raven_change_gpio(void *opaque, int n, int level)  {
> >      PREPPCIState *s = opaque;
> > @@ -303,7 +307,7 @@ static void raven_pcihost_initfn(Object *obj)
> >      memory_region_add_subregion(&s->bm, 0         , &s->bm_pci_memory_alias);
> >      memory_region_add_subregion(&s->bm, 0x80000000, &s->bm_ram_alias);
> >      address_space_init(&s->bm_as, &s->bm, "raven-bm");
> > -    pci_setup_iommu(&s->pci_bus, raven_pcihost_set_iommu, s);
> > +    pci_setup_iommu(&s->pci_bus, &raven_iommu_ops, s);
> >
> >      h->bus = &s->pci_bus;
> >
> > diff --git a/hw/pci-host/sabre.c b/hw/pci-host/sabre.c index
> > 2b8503b..251549b 100644
> > --- a/hw/pci-host/sabre.c
> > +++ b/hw/pci-host/sabre.c
> > @@ -112,6 +112,10 @@ static AddressSpace *sabre_pci_dma_iommu(PCIBus
> *bus, void *opaque, int devfn)
> >      return &is->iommu_as;
> >  }
> >
> > +static const PCIIOMMUOps sabre_iommu_ops = {
> > +    .get_address_space = sabre_pci_dma_iommu, };
> > +
> >  static void sabre_config_write(void *opaque, hwaddr addr,
> >                                 uint64_t val, unsigned size)  { @@
> > -402,7 +406,7 @@ static void sabre_realize(DeviceState *dev, Error **errp)
> >      /* IOMMU */
> >      memory_region_add_subregion_overlap(&s->sabre_config, 0x200,
> >                      sysbus_mmio_get_region(SYS_BUS_DEVICE(s->iommu), 0), 1);
> > -    pci_setup_iommu(phb->bus, sabre_pci_dma_iommu, s->iommu);
> > +    pci_setup_iommu(phb->bus, &sabre_iommu_ops, s->iommu);
> >
> >      /* APB secondary busses */
> >      pci_dev = pci_create_multifunction(phb->bus, PCI_DEVFN(1, 0),
> > true, diff --git a/hw/pci/pci.c b/hw/pci/pci.c index e1ed667..aa9025c
> > 100644
> > --- a/hw/pci/pci.c
> > +++ b/hw/pci/pci.c
> > @@ -2644,7 +2644,7 @@ AddressSpace
> *pci_device_iommu_address_space(PCIDevice *dev)
> >      PCIBus *iommu_bus = bus;
> >      uint8_t devfn = dev->devfn;
> >
> > -    while (iommu_bus && !iommu_bus->iommu_fn && iommu_bus->parent_dev)
> {
> > +    while (iommu_bus && !iommu_bus->iommu_ops &&
> > + iommu_bus->parent_dev) {
> Depending on future usage, this is not strictly identical to the original
> code. You exit
> the loop as soon as a iommu_bus->iommu_ops is set whatever the presence of
> get_address_space().

To be identical with original code, may adding the get_address_space()
presence check. Then the loop exits when the iommu_bus->iommu_ops is
set and meanwhile iommu_bus->iommu_ops->get_address_space() is set.
But is it possible that there is an intermediate iommu_bus which has
iommu_ops set but the get_address_space() is clear. I guess not as
iommu_ops is set by vIOMMU and vIOMMU won't differentiate buses?

Also the get_address_space() presence will be checked when trying to
use it. right?

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 160+ messages in thread

* RE: [PATCH v2 05/22] hw/pci: modify pci_setup_iommu() to set PCIIOMMUOps
@ 2020-04-02  8:52       ` Liu, Yi L
  0 siblings, 0 replies; 160+ messages in thread
From: Liu, Yi L @ 2020-04-02  8:52 UTC (permalink / raw)
  To: Auger Eric, qemu-devel, alex.williamson, peterx
  Cc: jean-philippe, Tian, Kevin, Jacob Pan, Yi Sun, kvm, mst, Tian,
	Jun J, Sun, Yi Y, pbonzini, david, Wu, Hao

> From: Auger Eric < eric.auger@redhat.com>
> Sent: Monday, March 30, 2020 7:02 PM
> To: Liu, Yi L <yi.l.liu@intel.com>; qemu-devel@nongnu.org;
> Subject: Re: [PATCH v2 05/22] hw/pci: modify pci_setup_iommu() to set
> PCIIOMMUOps
> 
> 
> 
> On 3/30/20 6:24 AM, Liu Yi L wrote:
> > This patch modifies pci_setup_iommu() to set PCIIOMMUOps instead of
> > setting PCIIOMMUFunc. PCIIOMMUFunc is used to get an address space for
> > a PCI device in vendor specific way. The PCIIOMMUOps still offers this
> > functionality. But using PCIIOMMUOps leaves space to add more iommu
> > related vendor specific operations.
> >
> > Cc: Kevin Tian <kevin.tian@intel.com>
> > Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Cc: Peter Xu <peterx@redhat.com>
> > Cc: Eric Auger <eric.auger@redhat.com>
> > Cc: Yi Sun <yi.y.sun@linux.intel.com>
> > Cc: David Gibson <david@gibson.dropbear.id.au>
> > Cc: Michael S. Tsirkin <mst@redhat.com>
> > Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> > Reviewed-by: Peter Xu <peterx@redhat.com>
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > ---
> >  hw/alpha/typhoon.c       |  6 +++++-
> >  hw/arm/smmu-common.c     |  6 +++++-
> >  hw/hppa/dino.c           |  6 +++++-
> >  hw/i386/amd_iommu.c      |  6 +++++-
> >  hw/i386/intel_iommu.c    |  6 +++++-
> >  hw/pci-host/designware.c |  6 +++++-
> >  hw/pci-host/pnv_phb3.c   |  6 +++++-
> >  hw/pci-host/pnv_phb4.c   |  6 +++++-
> >  hw/pci-host/ppce500.c    |  6 +++++-
> >  hw/pci-host/prep.c       |  6 +++++-
> >  hw/pci-host/sabre.c      |  6 +++++-
> >  hw/pci/pci.c             | 12 +++++++-----
> >  hw/ppc/ppc440_pcix.c     |  6 +++++-
> >  hw/ppc/spapr_pci.c       |  6 +++++-
> >  hw/s390x/s390-pci-bus.c  |  8 ++++++--  hw/virtio/virtio-iommu.c |  6
> > +++++-
> >  include/hw/pci/pci.h     |  8 ++++++--
> >  include/hw/pci/pci_bus.h |  2 +-
> >  18 files changed, 90 insertions(+), 24 deletions(-)
> >
> > diff --git a/hw/alpha/typhoon.c b/hw/alpha/typhoon.c index
> > 1795e2f..f271de1 100644
> > --- a/hw/alpha/typhoon.c
> > +++ b/hw/alpha/typhoon.c
> > @@ -740,6 +740,10 @@ static AddressSpace *typhoon_pci_dma_iommu(PCIBus
> *bus, void *opaque, int devfn)
> >      return &s->pchip.iommu_as;
> >  }
> >
> > +static const PCIIOMMUOps typhoon_iommu_ops = {
> > +    .get_address_space = typhoon_pci_dma_iommu, };
> > +
> >  static void typhoon_set_irq(void *opaque, int irq, int level)  {
> >      TyphoonState *s = opaque;
> > @@ -897,7 +901,7 @@ PCIBus *typhoon_init(MemoryRegion *ram, ISABus
> **isa_bus, qemu_irq *p_rtc_irq,
> >                               "iommu-typhoon", UINT64_MAX);
> >      address_space_init(&s->pchip.iommu_as, MEMORY_REGION(&s-
> >pchip.iommu),
> >                         "pchip0-pci");
> > -    pci_setup_iommu(b, typhoon_pci_dma_iommu, s);
> > +    pci_setup_iommu(b, &typhoon_iommu_ops, s);
> >
> >      /* Pchip0 PCI special/interrupt acknowledge, 0x801.F800.0000, 64MB.  */
> >      memory_region_init_io(&s->pchip.reg_iack, OBJECT(s),
> > &alpha_pci_iack_ops, diff --git a/hw/arm/smmu-common.c
> > b/hw/arm/smmu-common.c index e13a5f4..447146e 100644
> > --- a/hw/arm/smmu-common.c
> > +++ b/hw/arm/smmu-common.c
> > @@ -343,6 +343,10 @@ static AddressSpace *smmu_find_add_as(PCIBus *bus,
> void *opaque, int devfn)
> >      return &sdev->as;
> >  }
> >
> > +static const PCIIOMMUOps smmu_ops = {
> > +    .get_address_space = smmu_find_add_as, };
> > +
> >  IOMMUMemoryRegion *smmu_iommu_mr(SMMUState *s, uint32_t sid)  {
> >      uint8_t bus_n, devfn;
> > @@ -437,7 +441,7 @@ static void smmu_base_realize(DeviceState *dev, Error
> **errp)
> >      s->smmu_pcibus_by_busptr = g_hash_table_new(NULL, NULL);
> >
> >      if (s->primary_bus) {
> > -        pci_setup_iommu(s->primary_bus, smmu_find_add_as, s);
> > +        pci_setup_iommu(s->primary_bus, &smmu_ops, s);
> >      } else {
> >          error_setg(errp, "SMMU is not attached to any PCI bus!");
> >      }
> > diff --git a/hw/hppa/dino.c b/hw/hppa/dino.c index 2b1b38c..3da4f84
> > 100644
> > --- a/hw/hppa/dino.c
> > +++ b/hw/hppa/dino.c
> > @@ -459,6 +459,10 @@ static AddressSpace *dino_pcihost_set_iommu(PCIBus
> *bus, void *opaque,
> >      return &s->bm_as;
> >  }
> >
> > +static const PCIIOMMUOps dino_iommu_ops = {
> > +    .get_address_space = dino_pcihost_set_iommu, };
> > +
> >  /*
> >   * Dino interrupts are connected as shown on Page 78, Table 23
> >   * (Little-endian bit numbers)
> > @@ -580,7 +584,7 @@ PCIBus *dino_init(MemoryRegion *addr_space,
> >      memory_region_add_subregion(&s->bm, 0xfff00000,
> >                                  &s->bm_cpu_alias);
> >      address_space_init(&s->bm_as, &s->bm, "pci-bm");
> > -    pci_setup_iommu(b, dino_pcihost_set_iommu, s);
> > +    pci_setup_iommu(b, &dino_iommu_ops, s);
> >
> >      *p_rtc_irq = qemu_allocate_irq(dino_set_timer_irq, s, 0);
> >      *p_ser_irq = qemu_allocate_irq(dino_set_serial_irq, s, 0); diff
> > --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c index
> > b1175e5..5fec30e 100644
> > --- a/hw/i386/amd_iommu.c
> > +++ b/hw/i386/amd_iommu.c
> > @@ -1451,6 +1451,10 @@ static AddressSpace
> *amdvi_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
> >      return &iommu_as[devfn]->as;
> >  }
> >
> > +static const PCIIOMMUOps amdvi_iommu_ops = {
> > +    .get_address_space = amdvi_host_dma_iommu, };
> > +
> >  static const MemoryRegionOps mmio_mem_ops = {
> >      .read = amdvi_mmio_read,
> >      .write = amdvi_mmio_write,
> > @@ -1577,7 +1581,7 @@ static void amdvi_realize(DeviceState *dev,
> > Error **errp)
> >
> >      sysbus_init_mmio(SYS_BUS_DEVICE(s), &s->mmio);
> >      sysbus_mmio_map(SYS_BUS_DEVICE(s), 0, AMDVI_BASE_ADDR);
> > -    pci_setup_iommu(bus, amdvi_host_dma_iommu, s);
> > +    pci_setup_iommu(bus, &amdvi_iommu_ops, s);
> >      s->devid = object_property_get_int(OBJECT(&s->pci), "addr", errp);
> >      msi_init(&s->pci.dev, 0, 1, true, false, errp);
> >      amdvi_init(s);
> > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c index
> > df7ad25..4b22910 100644
> > --- a/hw/i386/intel_iommu.c
> > +++ b/hw/i386/intel_iommu.c
> > @@ -3729,6 +3729,10 @@ static AddressSpace *vtd_host_dma_iommu(PCIBus
> *bus, void *opaque, int devfn)
> >      return &vtd_as->as;
> >  }
> >
> > +static PCIIOMMUOps vtd_iommu_ops = {
> static const

got it.

> > +    .get_address_space = vtd_host_dma_iommu, };
> > +
> >  static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)  {
> >      X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(s); @@ -3840,7
> > +3844,7 @@ static void vtd_realize(DeviceState *dev, Error **errp)
> >                                                g_free, g_free);
> >      vtd_init(s);
> >      sysbus_mmio_map(SYS_BUS_DEVICE(s), 0,
> Q35_HOST_BRIDGE_IOMMU_ADDR);
> > -    pci_setup_iommu(bus, vtd_host_dma_iommu, dev);
> > +    pci_setup_iommu(bus, &vtd_iommu_ops, dev);
> >      /* Pseudo address space under root PCI bus. */
> >      x86ms->ioapic_as = vtd_host_dma_iommu(bus, s,
> Q35_PSEUDO_DEVFN_IOAPIC);
> >      qemu_add_machine_init_done_notifier(&vtd_machine_done_notify);
> > diff --git a/hw/pci-host/designware.c b/hw/pci-host/designware.c index
> > dd24551..4c6338a 100644
> > --- a/hw/pci-host/designware.c
> > +++ b/hw/pci-host/designware.c
> > @@ -645,6 +645,10 @@ static AddressSpace
> *designware_pcie_host_set_iommu(PCIBus *bus, void *opaque,
> >      return &s->pci.address_space;
> >  }
> >
> > +static const PCIIOMMUOps designware_iommu_ops = {
> > +    .get_address_space = designware_pcie_host_set_iommu, };
> > +
> >  static void designware_pcie_host_realize(DeviceState *dev, Error
> > **errp)  {
> >      PCIHostState *pci = PCI_HOST_BRIDGE(dev); @@ -686,7 +690,7 @@
> > static void designware_pcie_host_realize(DeviceState *dev, Error **errp)
> >      address_space_init(&s->pci.address_space,
> >                         &s->pci.address_space_root,
> >                         "pcie-bus-address-space");
> > -    pci_setup_iommu(pci->bus, designware_pcie_host_set_iommu, s);
> > +    pci_setup_iommu(pci->bus, &designware_iommu_ops, s);
> >
> >      qdev_set_parent_bus(DEVICE(&s->root), BUS(pci->bus));
> >      qdev_init_nofail(DEVICE(&s->root));
> > diff --git a/hw/pci-host/pnv_phb3.c b/hw/pci-host/pnv_phb3.c index
> > 74618fa..ecfe627 100644
> > --- a/hw/pci-host/pnv_phb3.c
> > +++ b/hw/pci-host/pnv_phb3.c
> > @@ -961,6 +961,10 @@ static AddressSpace *pnv_phb3_dma_iommu(PCIBus
> *bus, void *opaque, int devfn)
> >      return &ds->dma_as;
> >  }
> >
> > +static PCIIOMMUOps pnv_phb3_iommu_ops = {
> static const
got it. :-)

> > +    .get_address_space = pnv_phb3_dma_iommu, };
> > +
> >  static void pnv_phb3_instance_init(Object *obj)  {
> >      PnvPHB3 *phb = PNV_PHB3(obj);
> > @@ -1059,7 +1063,7 @@ static void pnv_phb3_realize(DeviceState *dev, Error
> **errp)
> >                                       &phb->pci_mmio, &phb->pci_io,
> >                                       0, 4, TYPE_PNV_PHB3_ROOT_BUS);
> >
> > -    pci_setup_iommu(pci->bus, pnv_phb3_dma_iommu, phb);
> > +    pci_setup_iommu(pci->bus, &pnv_phb3_iommu_ops, phb);
> >
> >      /* Add a single Root port */
> >      qdev_prop_set_uint8(DEVICE(&phb->root), "chassis", phb->chip_id);
> > diff --git a/hw/pci-host/pnv_phb4.c b/hw/pci-host/pnv_phb4.c index
> > 23cf093..04e95e3 100644
> > --- a/hw/pci-host/pnv_phb4.c
> > +++ b/hw/pci-host/pnv_phb4.c
> > @@ -1148,6 +1148,10 @@ static AddressSpace *pnv_phb4_dma_iommu(PCIBus
> *bus, void *opaque, int devfn)
> >      return &ds->dma_as;
> >  }
> >
> > +static PCIIOMMUOps pnv_phb4_iommu_ops = {
> idem
will add const.

> > +    .get_address_space = pnv_phb4_dma_iommu, };
> > +
> >  static void pnv_phb4_instance_init(Object *obj)  {
> >      PnvPHB4 *phb = PNV_PHB4(obj);
> > @@ -1205,7 +1209,7 @@ static void pnv_phb4_realize(DeviceState *dev, Error
> **errp)
> >                                       pnv_phb4_set_irq, pnv_phb4_map_irq, phb,
> >                                       &phb->pci_mmio, &phb->pci_io,
> >                                       0, 4, TYPE_PNV_PHB4_ROOT_BUS);
> > -    pci_setup_iommu(pci->bus, pnv_phb4_dma_iommu, phb);
> > +    pci_setup_iommu(pci->bus, &pnv_phb4_iommu_ops, phb);
> >
> >      /* Add a single Root port */
> >      qdev_prop_set_uint8(DEVICE(&phb->root), "chassis", phb->chip_id);
> > diff --git a/hw/pci-host/ppce500.c b/hw/pci-host/ppce500.c index
> > d710727..5baf5db 100644
> > --- a/hw/pci-host/ppce500.c
> > +++ b/hw/pci-host/ppce500.c
> > @@ -439,6 +439,10 @@ static AddressSpace *e500_pcihost_set_iommu(PCIBus
> *bus, void *opaque,
> >      return &s->bm_as;
> >  }
> >
> > +static const PCIIOMMUOps ppce500_iommu_ops = {
> > +    .get_address_space = e500_pcihost_set_iommu, };
> > +
> >  static void e500_pcihost_realize(DeviceState *dev, Error **errp)  {
> >      SysBusDevice *sbd = SYS_BUS_DEVICE(dev); @@ -473,7 +477,7 @@
> > static void e500_pcihost_realize(DeviceState *dev, Error **errp)
> >      memory_region_init(&s->bm, OBJECT(s), "bm-e500", UINT64_MAX);
> >      memory_region_add_subregion(&s->bm, 0x0, &s->busmem);
> >      address_space_init(&s->bm_as, &s->bm, "pci-bm");
> > -    pci_setup_iommu(b, e500_pcihost_set_iommu, s);
> > +    pci_setup_iommu(b, &ppce500_iommu_ops, s);
> >
> >      pci_create_simple(b, 0, "e500-host-bridge");
> >
> > diff --git a/hw/pci-host/prep.c b/hw/pci-host/prep.c index
> > 1a02e9a..7c57311 100644
> > --- a/hw/pci-host/prep.c
> > +++ b/hw/pci-host/prep.c
> > @@ -213,6 +213,10 @@ static AddressSpace *raven_pcihost_set_iommu(PCIBus
> *bus, void *opaque,
> >      return &s->bm_as;
> >  }
> >
> > +static const PCIIOMMUOps raven_iommu_ops = {
> > +    .get_address_space = raven_pcihost_set_iommu, };
> > +
> >  static void raven_change_gpio(void *opaque, int n, int level)  {
> >      PREPPCIState *s = opaque;
> > @@ -303,7 +307,7 @@ static void raven_pcihost_initfn(Object *obj)
> >      memory_region_add_subregion(&s->bm, 0         , &s->bm_pci_memory_alias);
> >      memory_region_add_subregion(&s->bm, 0x80000000, &s->bm_ram_alias);
> >      address_space_init(&s->bm_as, &s->bm, "raven-bm");
> > -    pci_setup_iommu(&s->pci_bus, raven_pcihost_set_iommu, s);
> > +    pci_setup_iommu(&s->pci_bus, &raven_iommu_ops, s);
> >
> >      h->bus = &s->pci_bus;
> >
> > diff --git a/hw/pci-host/sabre.c b/hw/pci-host/sabre.c index
> > 2b8503b..251549b 100644
> > --- a/hw/pci-host/sabre.c
> > +++ b/hw/pci-host/sabre.c
> > @@ -112,6 +112,10 @@ static AddressSpace *sabre_pci_dma_iommu(PCIBus
> *bus, void *opaque, int devfn)
> >      return &is->iommu_as;
> >  }
> >
> > +static const PCIIOMMUOps sabre_iommu_ops = {
> > +    .get_address_space = sabre_pci_dma_iommu, };
> > +
> >  static void sabre_config_write(void *opaque, hwaddr addr,
> >                                 uint64_t val, unsigned size)  { @@
> > -402,7 +406,7 @@ static void sabre_realize(DeviceState *dev, Error **errp)
> >      /* IOMMU */
> >      memory_region_add_subregion_overlap(&s->sabre_config, 0x200,
> >                      sysbus_mmio_get_region(SYS_BUS_DEVICE(s->iommu), 0), 1);
> > -    pci_setup_iommu(phb->bus, sabre_pci_dma_iommu, s->iommu);
> > +    pci_setup_iommu(phb->bus, &sabre_iommu_ops, s->iommu);
> >
> >      /* APB secondary busses */
> >      pci_dev = pci_create_multifunction(phb->bus, PCI_DEVFN(1, 0),
> > true, diff --git a/hw/pci/pci.c b/hw/pci/pci.c index e1ed667..aa9025c
> > 100644
> > --- a/hw/pci/pci.c
> > +++ b/hw/pci/pci.c
> > @@ -2644,7 +2644,7 @@ AddressSpace
> *pci_device_iommu_address_space(PCIDevice *dev)
> >      PCIBus *iommu_bus = bus;
> >      uint8_t devfn = dev->devfn;
> >
> > -    while (iommu_bus && !iommu_bus->iommu_fn && iommu_bus->parent_dev)
> {
> > +    while (iommu_bus && !iommu_bus->iommu_ops &&
> > + iommu_bus->parent_dev) {
> Depending on future usage, this is not strictly identical to the original
> code. You exit
> the loop as soon as a iommu_bus->iommu_ops is set whatever the presence of
> get_address_space().

To be identical with original code, may adding the get_address_space()
presence check. Then the loop exits when the iommu_bus->iommu_ops is
set and meanwhile iommu_bus->iommu_ops->get_address_space() is set.
But is it possible that there is an intermediate iommu_bus which has
iommu_ops set but the get_address_space() is clear. I guess not as
iommu_ops is set by vIOMMU and vIOMMU won't differentiate buses?

Also the get_address_space() presence will be checked when trying to
use it. right?

Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v2 05/22] hw/pci: modify pci_setup_iommu() to set PCIIOMMUOps
  2020-04-02  8:52       ` Liu, Yi L
@ 2020-04-02 12:41         ` Auger Eric
  -1 siblings, 0 replies; 160+ messages in thread
From: Auger Eric @ 2020-04-02 12:41 UTC (permalink / raw)
  To: Liu, Yi L, qemu-devel, alex.williamson, peterx
  Cc: jean-philippe, Tian, Kevin, Jacob Pan, Yi Sun, kvm, mst, Tian,
	Jun J, Sun, Yi Y, pbonzini, Wu, Hao, david

Hi Yi,

On 4/2/20 10:52 AM, Liu, Yi L wrote:
>> From: Auger Eric < eric.auger@redhat.com>
>> Sent: Monday, March 30, 2020 7:02 PM
>> To: Liu, Yi L <yi.l.liu@intel.com>; qemu-devel@nongnu.org;
>> Subject: Re: [PATCH v2 05/22] hw/pci: modify pci_setup_iommu() to set
>> PCIIOMMUOps
>>
>>
>>
>> On 3/30/20 6:24 AM, Liu Yi L wrote:
>>> This patch modifies pci_setup_iommu() to set PCIIOMMUOps instead of
>>> setting PCIIOMMUFunc. PCIIOMMUFunc is used to get an address space for
>>> a PCI device in vendor specific way. The PCIIOMMUOps still offers this
>>> functionality. But using PCIIOMMUOps leaves space to add more iommu
>>> related vendor specific operations.
>>>
>>> Cc: Kevin Tian <kevin.tian@intel.com>
>>> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
>>> Cc: Peter Xu <peterx@redhat.com>
>>> Cc: Eric Auger <eric.auger@redhat.com>
>>> Cc: Yi Sun <yi.y.sun@linux.intel.com>
>>> Cc: David Gibson <david@gibson.dropbear.id.au>
>>> Cc: Michael S. Tsirkin <mst@redhat.com>
>>> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
>>> Reviewed-by: Peter Xu <peterx@redhat.com>
>>> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
>>> ---
>>>  hw/alpha/typhoon.c       |  6 +++++-
>>>  hw/arm/smmu-common.c     |  6 +++++-
>>>  hw/hppa/dino.c           |  6 +++++-
>>>  hw/i386/amd_iommu.c      |  6 +++++-
>>>  hw/i386/intel_iommu.c    |  6 +++++-
>>>  hw/pci-host/designware.c |  6 +++++-
>>>  hw/pci-host/pnv_phb3.c   |  6 +++++-
>>>  hw/pci-host/pnv_phb4.c   |  6 +++++-
>>>  hw/pci-host/ppce500.c    |  6 +++++-
>>>  hw/pci-host/prep.c       |  6 +++++-
>>>  hw/pci-host/sabre.c      |  6 +++++-
>>>  hw/pci/pci.c             | 12 +++++++-----
>>>  hw/ppc/ppc440_pcix.c     |  6 +++++-
>>>  hw/ppc/spapr_pci.c       |  6 +++++-
>>>  hw/s390x/s390-pci-bus.c  |  8 ++++++--  hw/virtio/virtio-iommu.c |  6
>>> +++++-
>>>  include/hw/pci/pci.h     |  8 ++++++--
>>>  include/hw/pci/pci_bus.h |  2 +-
>>>  18 files changed, 90 insertions(+), 24 deletions(-)
>>>
>>> diff --git a/hw/alpha/typhoon.c b/hw/alpha/typhoon.c index
>>> 1795e2f..f271de1 100644
>>> --- a/hw/alpha/typhoon.c
>>> +++ b/hw/alpha/typhoon.c
>>> @@ -740,6 +740,10 @@ static AddressSpace *typhoon_pci_dma_iommu(PCIBus
>> *bus, void *opaque, int devfn)
>>>      return &s->pchip.iommu_as;
>>>  }
>>>
>>> +static const PCIIOMMUOps typhoon_iommu_ops = {
>>> +    .get_address_space = typhoon_pci_dma_iommu, };
>>> +
>>>  static void typhoon_set_irq(void *opaque, int irq, int level)  {
>>>      TyphoonState *s = opaque;
>>> @@ -897,7 +901,7 @@ PCIBus *typhoon_init(MemoryRegion *ram, ISABus
>> **isa_bus, qemu_irq *p_rtc_irq,
>>>                               "iommu-typhoon", UINT64_MAX);
>>>      address_space_init(&s->pchip.iommu_as, MEMORY_REGION(&s-
>>> pchip.iommu),
>>>                         "pchip0-pci");
>>> -    pci_setup_iommu(b, typhoon_pci_dma_iommu, s);
>>> +    pci_setup_iommu(b, &typhoon_iommu_ops, s);
>>>
>>>      /* Pchip0 PCI special/interrupt acknowledge, 0x801.F800.0000, 64MB.  */
>>>      memory_region_init_io(&s->pchip.reg_iack, OBJECT(s),
>>> &alpha_pci_iack_ops, diff --git a/hw/arm/smmu-common.c
>>> b/hw/arm/smmu-common.c index e13a5f4..447146e 100644
>>> --- a/hw/arm/smmu-common.c
>>> +++ b/hw/arm/smmu-common.c
>>> @@ -343,6 +343,10 @@ static AddressSpace *smmu_find_add_as(PCIBus *bus,
>> void *opaque, int devfn)
>>>      return &sdev->as;
>>>  }
>>>
>>> +static const PCIIOMMUOps smmu_ops = {
>>> +    .get_address_space = smmu_find_add_as, };
>>> +
>>>  IOMMUMemoryRegion *smmu_iommu_mr(SMMUState *s, uint32_t sid)  {
>>>      uint8_t bus_n, devfn;
>>> @@ -437,7 +441,7 @@ static void smmu_base_realize(DeviceState *dev, Error
>> **errp)
>>>      s->smmu_pcibus_by_busptr = g_hash_table_new(NULL, NULL);
>>>
>>>      if (s->primary_bus) {
>>> -        pci_setup_iommu(s->primary_bus, smmu_find_add_as, s);
>>> +        pci_setup_iommu(s->primary_bus, &smmu_ops, s);
>>>      } else {
>>>          error_setg(errp, "SMMU is not attached to any PCI bus!");
>>>      }
>>> diff --git a/hw/hppa/dino.c b/hw/hppa/dino.c index 2b1b38c..3da4f84
>>> 100644
>>> --- a/hw/hppa/dino.c
>>> +++ b/hw/hppa/dino.c
>>> @@ -459,6 +459,10 @@ static AddressSpace *dino_pcihost_set_iommu(PCIBus
>> *bus, void *opaque,
>>>      return &s->bm_as;
>>>  }
>>>
>>> +static const PCIIOMMUOps dino_iommu_ops = {
>>> +    .get_address_space = dino_pcihost_set_iommu, };
>>> +
>>>  /*
>>>   * Dino interrupts are connected as shown on Page 78, Table 23
>>>   * (Little-endian bit numbers)
>>> @@ -580,7 +584,7 @@ PCIBus *dino_init(MemoryRegion *addr_space,
>>>      memory_region_add_subregion(&s->bm, 0xfff00000,
>>>                                  &s->bm_cpu_alias);
>>>      address_space_init(&s->bm_as, &s->bm, "pci-bm");
>>> -    pci_setup_iommu(b, dino_pcihost_set_iommu, s);
>>> +    pci_setup_iommu(b, &dino_iommu_ops, s);
>>>
>>>      *p_rtc_irq = qemu_allocate_irq(dino_set_timer_irq, s, 0);
>>>      *p_ser_irq = qemu_allocate_irq(dino_set_serial_irq, s, 0); diff
>>> --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c index
>>> b1175e5..5fec30e 100644
>>> --- a/hw/i386/amd_iommu.c
>>> +++ b/hw/i386/amd_iommu.c
>>> @@ -1451,6 +1451,10 @@ static AddressSpace
>> *amdvi_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
>>>      return &iommu_as[devfn]->as;
>>>  }
>>>
>>> +static const PCIIOMMUOps amdvi_iommu_ops = {
>>> +    .get_address_space = amdvi_host_dma_iommu, };
>>> +
>>>  static const MemoryRegionOps mmio_mem_ops = {
>>>      .read = amdvi_mmio_read,
>>>      .write = amdvi_mmio_write,
>>> @@ -1577,7 +1581,7 @@ static void amdvi_realize(DeviceState *dev,
>>> Error **errp)
>>>
>>>      sysbus_init_mmio(SYS_BUS_DEVICE(s), &s->mmio);
>>>      sysbus_mmio_map(SYS_BUS_DEVICE(s), 0, AMDVI_BASE_ADDR);
>>> -    pci_setup_iommu(bus, amdvi_host_dma_iommu, s);
>>> +    pci_setup_iommu(bus, &amdvi_iommu_ops, s);
>>>      s->devid = object_property_get_int(OBJECT(&s->pci), "addr", errp);
>>>      msi_init(&s->pci.dev, 0, 1, true, false, errp);
>>>      amdvi_init(s);
>>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c index
>>> df7ad25..4b22910 100644
>>> --- a/hw/i386/intel_iommu.c
>>> +++ b/hw/i386/intel_iommu.c
>>> @@ -3729,6 +3729,10 @@ static AddressSpace *vtd_host_dma_iommu(PCIBus
>> *bus, void *opaque, int devfn)
>>>      return &vtd_as->as;
>>>  }
>>>
>>> +static PCIIOMMUOps vtd_iommu_ops = {
>> static const
> 
> got it.
> 
>>> +    .get_address_space = vtd_host_dma_iommu, };
>>> +
>>>  static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)  {
>>>      X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(s); @@ -3840,7
>>> +3844,7 @@ static void vtd_realize(DeviceState *dev, Error **errp)
>>>                                                g_free, g_free);
>>>      vtd_init(s);
>>>      sysbus_mmio_map(SYS_BUS_DEVICE(s), 0,
>> Q35_HOST_BRIDGE_IOMMU_ADDR);
>>> -    pci_setup_iommu(bus, vtd_host_dma_iommu, dev);
>>> +    pci_setup_iommu(bus, &vtd_iommu_ops, dev);
>>>      /* Pseudo address space under root PCI bus. */
>>>      x86ms->ioapic_as = vtd_host_dma_iommu(bus, s,
>> Q35_PSEUDO_DEVFN_IOAPIC);
>>>      qemu_add_machine_init_done_notifier(&vtd_machine_done_notify);
>>> diff --git a/hw/pci-host/designware.c b/hw/pci-host/designware.c index
>>> dd24551..4c6338a 100644
>>> --- a/hw/pci-host/designware.c
>>> +++ b/hw/pci-host/designware.c
>>> @@ -645,6 +645,10 @@ static AddressSpace
>> *designware_pcie_host_set_iommu(PCIBus *bus, void *opaque,
>>>      return &s->pci.address_space;
>>>  }
>>>
>>> +static const PCIIOMMUOps designware_iommu_ops = {
>>> +    .get_address_space = designware_pcie_host_set_iommu, };
>>> +
>>>  static void designware_pcie_host_realize(DeviceState *dev, Error
>>> **errp)  {
>>>      PCIHostState *pci = PCI_HOST_BRIDGE(dev); @@ -686,7 +690,7 @@
>>> static void designware_pcie_host_realize(DeviceState *dev, Error **errp)
>>>      address_space_init(&s->pci.address_space,
>>>                         &s->pci.address_space_root,
>>>                         "pcie-bus-address-space");
>>> -    pci_setup_iommu(pci->bus, designware_pcie_host_set_iommu, s);
>>> +    pci_setup_iommu(pci->bus, &designware_iommu_ops, s);
>>>
>>>      qdev_set_parent_bus(DEVICE(&s->root), BUS(pci->bus));
>>>      qdev_init_nofail(DEVICE(&s->root));
>>> diff --git a/hw/pci-host/pnv_phb3.c b/hw/pci-host/pnv_phb3.c index
>>> 74618fa..ecfe627 100644
>>> --- a/hw/pci-host/pnv_phb3.c
>>> +++ b/hw/pci-host/pnv_phb3.c
>>> @@ -961,6 +961,10 @@ static AddressSpace *pnv_phb3_dma_iommu(PCIBus
>> *bus, void *opaque, int devfn)
>>>      return &ds->dma_as;
>>>  }
>>>
>>> +static PCIIOMMUOps pnv_phb3_iommu_ops = {
>> static const
> got it. :-)
> 
>>> +    .get_address_space = pnv_phb3_dma_iommu, };
>>> +
>>>  static void pnv_phb3_instance_init(Object *obj)  {
>>>      PnvPHB3 *phb = PNV_PHB3(obj);
>>> @@ -1059,7 +1063,7 @@ static void pnv_phb3_realize(DeviceState *dev, Error
>> **errp)
>>>                                       &phb->pci_mmio, &phb->pci_io,
>>>                                       0, 4, TYPE_PNV_PHB3_ROOT_BUS);
>>>
>>> -    pci_setup_iommu(pci->bus, pnv_phb3_dma_iommu, phb);
>>> +    pci_setup_iommu(pci->bus, &pnv_phb3_iommu_ops, phb);
>>>
>>>      /* Add a single Root port */
>>>      qdev_prop_set_uint8(DEVICE(&phb->root), "chassis", phb->chip_id);
>>> diff --git a/hw/pci-host/pnv_phb4.c b/hw/pci-host/pnv_phb4.c index
>>> 23cf093..04e95e3 100644
>>> --- a/hw/pci-host/pnv_phb4.c
>>> +++ b/hw/pci-host/pnv_phb4.c
>>> @@ -1148,6 +1148,10 @@ static AddressSpace *pnv_phb4_dma_iommu(PCIBus
>> *bus, void *opaque, int devfn)
>>>      return &ds->dma_as;
>>>  }
>>>
>>> +static PCIIOMMUOps pnv_phb4_iommu_ops = {
>> idem
> will add const.
> 
>>> +    .get_address_space = pnv_phb4_dma_iommu, };
>>> +
>>>  static void pnv_phb4_instance_init(Object *obj)  {
>>>      PnvPHB4 *phb = PNV_PHB4(obj);
>>> @@ -1205,7 +1209,7 @@ static void pnv_phb4_realize(DeviceState *dev, Error
>> **errp)
>>>                                       pnv_phb4_set_irq, pnv_phb4_map_irq, phb,
>>>                                       &phb->pci_mmio, &phb->pci_io,
>>>                                       0, 4, TYPE_PNV_PHB4_ROOT_BUS);
>>> -    pci_setup_iommu(pci->bus, pnv_phb4_dma_iommu, phb);
>>> +    pci_setup_iommu(pci->bus, &pnv_phb4_iommu_ops, phb);
>>>
>>>      /* Add a single Root port */
>>>      qdev_prop_set_uint8(DEVICE(&phb->root), "chassis", phb->chip_id);
>>> diff --git a/hw/pci-host/ppce500.c b/hw/pci-host/ppce500.c index
>>> d710727..5baf5db 100644
>>> --- a/hw/pci-host/ppce500.c
>>> +++ b/hw/pci-host/ppce500.c
>>> @@ -439,6 +439,10 @@ static AddressSpace *e500_pcihost_set_iommu(PCIBus
>> *bus, void *opaque,
>>>      return &s->bm_as;
>>>  }
>>>
>>> +static const PCIIOMMUOps ppce500_iommu_ops = {
>>> +    .get_address_space = e500_pcihost_set_iommu, };
>>> +
>>>  static void e500_pcihost_realize(DeviceState *dev, Error **errp)  {
>>>      SysBusDevice *sbd = SYS_BUS_DEVICE(dev); @@ -473,7 +477,7 @@
>>> static void e500_pcihost_realize(DeviceState *dev, Error **errp)
>>>      memory_region_init(&s->bm, OBJECT(s), "bm-e500", UINT64_MAX);
>>>      memory_region_add_subregion(&s->bm, 0x0, &s->busmem);
>>>      address_space_init(&s->bm_as, &s->bm, "pci-bm");
>>> -    pci_setup_iommu(b, e500_pcihost_set_iommu, s);
>>> +    pci_setup_iommu(b, &ppce500_iommu_ops, s);
>>>
>>>      pci_create_simple(b, 0, "e500-host-bridge");
>>>
>>> diff --git a/hw/pci-host/prep.c b/hw/pci-host/prep.c index
>>> 1a02e9a..7c57311 100644
>>> --- a/hw/pci-host/prep.c
>>> +++ b/hw/pci-host/prep.c
>>> @@ -213,6 +213,10 @@ static AddressSpace *raven_pcihost_set_iommu(PCIBus
>> *bus, void *opaque,
>>>      return &s->bm_as;
>>>  }
>>>
>>> +static const PCIIOMMUOps raven_iommu_ops = {
>>> +    .get_address_space = raven_pcihost_set_iommu, };
>>> +
>>>  static void raven_change_gpio(void *opaque, int n, int level)  {
>>>      PREPPCIState *s = opaque;
>>> @@ -303,7 +307,7 @@ static void raven_pcihost_initfn(Object *obj)
>>>      memory_region_add_subregion(&s->bm, 0         , &s->bm_pci_memory_alias);
>>>      memory_region_add_subregion(&s->bm, 0x80000000, &s->bm_ram_alias);
>>>      address_space_init(&s->bm_as, &s->bm, "raven-bm");
>>> -    pci_setup_iommu(&s->pci_bus, raven_pcihost_set_iommu, s);
>>> +    pci_setup_iommu(&s->pci_bus, &raven_iommu_ops, s);
>>>
>>>      h->bus = &s->pci_bus;
>>>
>>> diff --git a/hw/pci-host/sabre.c b/hw/pci-host/sabre.c index
>>> 2b8503b..251549b 100644
>>> --- a/hw/pci-host/sabre.c
>>> +++ b/hw/pci-host/sabre.c
>>> @@ -112,6 +112,10 @@ static AddressSpace *sabre_pci_dma_iommu(PCIBus
>> *bus, void *opaque, int devfn)
>>>      return &is->iommu_as;
>>>  }
>>>
>>> +static const PCIIOMMUOps sabre_iommu_ops = {
>>> +    .get_address_space = sabre_pci_dma_iommu, };
>>> +
>>>  static void sabre_config_write(void *opaque, hwaddr addr,
>>>                                 uint64_t val, unsigned size)  { @@
>>> -402,7 +406,7 @@ static void sabre_realize(DeviceState *dev, Error **errp)
>>>      /* IOMMU */
>>>      memory_region_add_subregion_overlap(&s->sabre_config, 0x200,
>>>                      sysbus_mmio_get_region(SYS_BUS_DEVICE(s->iommu), 0), 1);
>>> -    pci_setup_iommu(phb->bus, sabre_pci_dma_iommu, s->iommu);
>>> +    pci_setup_iommu(phb->bus, &sabre_iommu_ops, s->iommu);
>>>
>>>      /* APB secondary busses */
>>>      pci_dev = pci_create_multifunction(phb->bus, PCI_DEVFN(1, 0),
>>> true, diff --git a/hw/pci/pci.c b/hw/pci/pci.c index e1ed667..aa9025c
>>> 100644
>>> --- a/hw/pci/pci.c
>>> +++ b/hw/pci/pci.c
>>> @@ -2644,7 +2644,7 @@ AddressSpace
>> *pci_device_iommu_address_space(PCIDevice *dev)
>>>      PCIBus *iommu_bus = bus;
>>>      uint8_t devfn = dev->devfn;
>>>
>>> -    while (iommu_bus && !iommu_bus->iommu_fn && iommu_bus->parent_dev)
>> {
>>> +    while (iommu_bus && !iommu_bus->iommu_ops &&
>>> + iommu_bus->parent_dev) {
>> Depending on future usage, this is not strictly identical to the original
>> code. You exit
>> the loop as soon as a iommu_bus->iommu_ops is set whatever the presence of
>> get_address_space().
> 
> To be identical with original code, may adding the get_address_space()
> presence check. Then the loop exits when the iommu_bus->iommu_ops is
> set and meanwhile iommu_bus->iommu_ops->get_address_space() is set.
> But is it possible that there is an intermediate iommu_bus which has
> iommu_ops set but the get_address_space() is clear. I guess not as
> iommu_ops is set by vIOMMU and vIOMMU won't differentiate buses?

I don't know. That depends on how the ops are going to be used in the
future. Can't you enforce the fact that get_address_space() is a
mandatory ops?

Thanks

Eric
> 
> Also the get_address_space() presence will be checked when trying to
> use it. right?
> 
> Regards,
> Yi Liu
> 


^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v2 05/22] hw/pci: modify pci_setup_iommu() to set PCIIOMMUOps
@ 2020-04-02 12:41         ` Auger Eric
  0 siblings, 0 replies; 160+ messages in thread
From: Auger Eric @ 2020-04-02 12:41 UTC (permalink / raw)
  To: Liu, Yi L, qemu-devel, alex.williamson, peterx
  Cc: jean-philippe, Tian, Kevin, Jacob Pan, Yi Sun, kvm, mst, Tian,
	Jun J, Sun, Yi Y, pbonzini, david, Wu, Hao

Hi Yi,

On 4/2/20 10:52 AM, Liu, Yi L wrote:
>> From: Auger Eric < eric.auger@redhat.com>
>> Sent: Monday, March 30, 2020 7:02 PM
>> To: Liu, Yi L <yi.l.liu@intel.com>; qemu-devel@nongnu.org;
>> Subject: Re: [PATCH v2 05/22] hw/pci: modify pci_setup_iommu() to set
>> PCIIOMMUOps
>>
>>
>>
>> On 3/30/20 6:24 AM, Liu Yi L wrote:
>>> This patch modifies pci_setup_iommu() to set PCIIOMMUOps instead of
>>> setting PCIIOMMUFunc. PCIIOMMUFunc is used to get an address space for
>>> a PCI device in vendor specific way. The PCIIOMMUOps still offers this
>>> functionality. But using PCIIOMMUOps leaves space to add more iommu
>>> related vendor specific operations.
>>>
>>> Cc: Kevin Tian <kevin.tian@intel.com>
>>> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
>>> Cc: Peter Xu <peterx@redhat.com>
>>> Cc: Eric Auger <eric.auger@redhat.com>
>>> Cc: Yi Sun <yi.y.sun@linux.intel.com>
>>> Cc: David Gibson <david@gibson.dropbear.id.au>
>>> Cc: Michael S. Tsirkin <mst@redhat.com>
>>> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
>>> Reviewed-by: Peter Xu <peterx@redhat.com>
>>> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
>>> ---
>>>  hw/alpha/typhoon.c       |  6 +++++-
>>>  hw/arm/smmu-common.c     |  6 +++++-
>>>  hw/hppa/dino.c           |  6 +++++-
>>>  hw/i386/amd_iommu.c      |  6 +++++-
>>>  hw/i386/intel_iommu.c    |  6 +++++-
>>>  hw/pci-host/designware.c |  6 +++++-
>>>  hw/pci-host/pnv_phb3.c   |  6 +++++-
>>>  hw/pci-host/pnv_phb4.c   |  6 +++++-
>>>  hw/pci-host/ppce500.c    |  6 +++++-
>>>  hw/pci-host/prep.c       |  6 +++++-
>>>  hw/pci-host/sabre.c      |  6 +++++-
>>>  hw/pci/pci.c             | 12 +++++++-----
>>>  hw/ppc/ppc440_pcix.c     |  6 +++++-
>>>  hw/ppc/spapr_pci.c       |  6 +++++-
>>>  hw/s390x/s390-pci-bus.c  |  8 ++++++--  hw/virtio/virtio-iommu.c |  6
>>> +++++-
>>>  include/hw/pci/pci.h     |  8 ++++++--
>>>  include/hw/pci/pci_bus.h |  2 +-
>>>  18 files changed, 90 insertions(+), 24 deletions(-)
>>>
>>> diff --git a/hw/alpha/typhoon.c b/hw/alpha/typhoon.c index
>>> 1795e2f..f271de1 100644
>>> --- a/hw/alpha/typhoon.c
>>> +++ b/hw/alpha/typhoon.c
>>> @@ -740,6 +740,10 @@ static AddressSpace *typhoon_pci_dma_iommu(PCIBus
>> *bus, void *opaque, int devfn)
>>>      return &s->pchip.iommu_as;
>>>  }
>>>
>>> +static const PCIIOMMUOps typhoon_iommu_ops = {
>>> +    .get_address_space = typhoon_pci_dma_iommu, };
>>> +
>>>  static void typhoon_set_irq(void *opaque, int irq, int level)  {
>>>      TyphoonState *s = opaque;
>>> @@ -897,7 +901,7 @@ PCIBus *typhoon_init(MemoryRegion *ram, ISABus
>> **isa_bus, qemu_irq *p_rtc_irq,
>>>                               "iommu-typhoon", UINT64_MAX);
>>>      address_space_init(&s->pchip.iommu_as, MEMORY_REGION(&s-
>>> pchip.iommu),
>>>                         "pchip0-pci");
>>> -    pci_setup_iommu(b, typhoon_pci_dma_iommu, s);
>>> +    pci_setup_iommu(b, &typhoon_iommu_ops, s);
>>>
>>>      /* Pchip0 PCI special/interrupt acknowledge, 0x801.F800.0000, 64MB.  */
>>>      memory_region_init_io(&s->pchip.reg_iack, OBJECT(s),
>>> &alpha_pci_iack_ops, diff --git a/hw/arm/smmu-common.c
>>> b/hw/arm/smmu-common.c index e13a5f4..447146e 100644
>>> --- a/hw/arm/smmu-common.c
>>> +++ b/hw/arm/smmu-common.c
>>> @@ -343,6 +343,10 @@ static AddressSpace *smmu_find_add_as(PCIBus *bus,
>> void *opaque, int devfn)
>>>      return &sdev->as;
>>>  }
>>>
>>> +static const PCIIOMMUOps smmu_ops = {
>>> +    .get_address_space = smmu_find_add_as, };
>>> +
>>>  IOMMUMemoryRegion *smmu_iommu_mr(SMMUState *s, uint32_t sid)  {
>>>      uint8_t bus_n, devfn;
>>> @@ -437,7 +441,7 @@ static void smmu_base_realize(DeviceState *dev, Error
>> **errp)
>>>      s->smmu_pcibus_by_busptr = g_hash_table_new(NULL, NULL);
>>>
>>>      if (s->primary_bus) {
>>> -        pci_setup_iommu(s->primary_bus, smmu_find_add_as, s);
>>> +        pci_setup_iommu(s->primary_bus, &smmu_ops, s);
>>>      } else {
>>>          error_setg(errp, "SMMU is not attached to any PCI bus!");
>>>      }
>>> diff --git a/hw/hppa/dino.c b/hw/hppa/dino.c index 2b1b38c..3da4f84
>>> 100644
>>> --- a/hw/hppa/dino.c
>>> +++ b/hw/hppa/dino.c
>>> @@ -459,6 +459,10 @@ static AddressSpace *dino_pcihost_set_iommu(PCIBus
>> *bus, void *opaque,
>>>      return &s->bm_as;
>>>  }
>>>
>>> +static const PCIIOMMUOps dino_iommu_ops = {
>>> +    .get_address_space = dino_pcihost_set_iommu, };
>>> +
>>>  /*
>>>   * Dino interrupts are connected as shown on Page 78, Table 23
>>>   * (Little-endian bit numbers)
>>> @@ -580,7 +584,7 @@ PCIBus *dino_init(MemoryRegion *addr_space,
>>>      memory_region_add_subregion(&s->bm, 0xfff00000,
>>>                                  &s->bm_cpu_alias);
>>>      address_space_init(&s->bm_as, &s->bm, "pci-bm");
>>> -    pci_setup_iommu(b, dino_pcihost_set_iommu, s);
>>> +    pci_setup_iommu(b, &dino_iommu_ops, s);
>>>
>>>      *p_rtc_irq = qemu_allocate_irq(dino_set_timer_irq, s, 0);
>>>      *p_ser_irq = qemu_allocate_irq(dino_set_serial_irq, s, 0); diff
>>> --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c index
>>> b1175e5..5fec30e 100644
>>> --- a/hw/i386/amd_iommu.c
>>> +++ b/hw/i386/amd_iommu.c
>>> @@ -1451,6 +1451,10 @@ static AddressSpace
>> *amdvi_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
>>>      return &iommu_as[devfn]->as;
>>>  }
>>>
>>> +static const PCIIOMMUOps amdvi_iommu_ops = {
>>> +    .get_address_space = amdvi_host_dma_iommu, };
>>> +
>>>  static const MemoryRegionOps mmio_mem_ops = {
>>>      .read = amdvi_mmio_read,
>>>      .write = amdvi_mmio_write,
>>> @@ -1577,7 +1581,7 @@ static void amdvi_realize(DeviceState *dev,
>>> Error **errp)
>>>
>>>      sysbus_init_mmio(SYS_BUS_DEVICE(s), &s->mmio);
>>>      sysbus_mmio_map(SYS_BUS_DEVICE(s), 0, AMDVI_BASE_ADDR);
>>> -    pci_setup_iommu(bus, amdvi_host_dma_iommu, s);
>>> +    pci_setup_iommu(bus, &amdvi_iommu_ops, s);
>>>      s->devid = object_property_get_int(OBJECT(&s->pci), "addr", errp);
>>>      msi_init(&s->pci.dev, 0, 1, true, false, errp);
>>>      amdvi_init(s);
>>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c index
>>> df7ad25..4b22910 100644
>>> --- a/hw/i386/intel_iommu.c
>>> +++ b/hw/i386/intel_iommu.c
>>> @@ -3729,6 +3729,10 @@ static AddressSpace *vtd_host_dma_iommu(PCIBus
>> *bus, void *opaque, int devfn)
>>>      return &vtd_as->as;
>>>  }
>>>
>>> +static PCIIOMMUOps vtd_iommu_ops = {
>> static const
> 
> got it.
> 
>>> +    .get_address_space = vtd_host_dma_iommu, };
>>> +
>>>  static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)  {
>>>      X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(s); @@ -3840,7
>>> +3844,7 @@ static void vtd_realize(DeviceState *dev, Error **errp)
>>>                                                g_free, g_free);
>>>      vtd_init(s);
>>>      sysbus_mmio_map(SYS_BUS_DEVICE(s), 0,
>> Q35_HOST_BRIDGE_IOMMU_ADDR);
>>> -    pci_setup_iommu(bus, vtd_host_dma_iommu, dev);
>>> +    pci_setup_iommu(bus, &vtd_iommu_ops, dev);
>>>      /* Pseudo address space under root PCI bus. */
>>>      x86ms->ioapic_as = vtd_host_dma_iommu(bus, s,
>> Q35_PSEUDO_DEVFN_IOAPIC);
>>>      qemu_add_machine_init_done_notifier(&vtd_machine_done_notify);
>>> diff --git a/hw/pci-host/designware.c b/hw/pci-host/designware.c index
>>> dd24551..4c6338a 100644
>>> --- a/hw/pci-host/designware.c
>>> +++ b/hw/pci-host/designware.c
>>> @@ -645,6 +645,10 @@ static AddressSpace
>> *designware_pcie_host_set_iommu(PCIBus *bus, void *opaque,
>>>      return &s->pci.address_space;
>>>  }
>>>
>>> +static const PCIIOMMUOps designware_iommu_ops = {
>>> +    .get_address_space = designware_pcie_host_set_iommu, };
>>> +
>>>  static void designware_pcie_host_realize(DeviceState *dev, Error
>>> **errp)  {
>>>      PCIHostState *pci = PCI_HOST_BRIDGE(dev); @@ -686,7 +690,7 @@
>>> static void designware_pcie_host_realize(DeviceState *dev, Error **errp)
>>>      address_space_init(&s->pci.address_space,
>>>                         &s->pci.address_space_root,
>>>                         "pcie-bus-address-space");
>>> -    pci_setup_iommu(pci->bus, designware_pcie_host_set_iommu, s);
>>> +    pci_setup_iommu(pci->bus, &designware_iommu_ops, s);
>>>
>>>      qdev_set_parent_bus(DEVICE(&s->root), BUS(pci->bus));
>>>      qdev_init_nofail(DEVICE(&s->root));
>>> diff --git a/hw/pci-host/pnv_phb3.c b/hw/pci-host/pnv_phb3.c index
>>> 74618fa..ecfe627 100644
>>> --- a/hw/pci-host/pnv_phb3.c
>>> +++ b/hw/pci-host/pnv_phb3.c
>>> @@ -961,6 +961,10 @@ static AddressSpace *pnv_phb3_dma_iommu(PCIBus
>> *bus, void *opaque, int devfn)
>>>      return &ds->dma_as;
>>>  }
>>>
>>> +static PCIIOMMUOps pnv_phb3_iommu_ops = {
>> static const
> got it. :-)
> 
>>> +    .get_address_space = pnv_phb3_dma_iommu, };
>>> +
>>>  static void pnv_phb3_instance_init(Object *obj)  {
>>>      PnvPHB3 *phb = PNV_PHB3(obj);
>>> @@ -1059,7 +1063,7 @@ static void pnv_phb3_realize(DeviceState *dev, Error
>> **errp)
>>>                                       &phb->pci_mmio, &phb->pci_io,
>>>                                       0, 4, TYPE_PNV_PHB3_ROOT_BUS);
>>>
>>> -    pci_setup_iommu(pci->bus, pnv_phb3_dma_iommu, phb);
>>> +    pci_setup_iommu(pci->bus, &pnv_phb3_iommu_ops, phb);
>>>
>>>      /* Add a single Root port */
>>>      qdev_prop_set_uint8(DEVICE(&phb->root), "chassis", phb->chip_id);
>>> diff --git a/hw/pci-host/pnv_phb4.c b/hw/pci-host/pnv_phb4.c index
>>> 23cf093..04e95e3 100644
>>> --- a/hw/pci-host/pnv_phb4.c
>>> +++ b/hw/pci-host/pnv_phb4.c
>>> @@ -1148,6 +1148,10 @@ static AddressSpace *pnv_phb4_dma_iommu(PCIBus
>> *bus, void *opaque, int devfn)
>>>      return &ds->dma_as;
>>>  }
>>>
>>> +static PCIIOMMUOps pnv_phb4_iommu_ops = {
>> idem
> will add const.
> 
>>> +    .get_address_space = pnv_phb4_dma_iommu, };
>>> +
>>>  static void pnv_phb4_instance_init(Object *obj)  {
>>>      PnvPHB4 *phb = PNV_PHB4(obj);
>>> @@ -1205,7 +1209,7 @@ static void pnv_phb4_realize(DeviceState *dev, Error
>> **errp)
>>>                                       pnv_phb4_set_irq, pnv_phb4_map_irq, phb,
>>>                                       &phb->pci_mmio, &phb->pci_io,
>>>                                       0, 4, TYPE_PNV_PHB4_ROOT_BUS);
>>> -    pci_setup_iommu(pci->bus, pnv_phb4_dma_iommu, phb);
>>> +    pci_setup_iommu(pci->bus, &pnv_phb4_iommu_ops, phb);
>>>
>>>      /* Add a single Root port */
>>>      qdev_prop_set_uint8(DEVICE(&phb->root), "chassis", phb->chip_id);
>>> diff --git a/hw/pci-host/ppce500.c b/hw/pci-host/ppce500.c index
>>> d710727..5baf5db 100644
>>> --- a/hw/pci-host/ppce500.c
>>> +++ b/hw/pci-host/ppce500.c
>>> @@ -439,6 +439,10 @@ static AddressSpace *e500_pcihost_set_iommu(PCIBus
>> *bus, void *opaque,
>>>      return &s->bm_as;
>>>  }
>>>
>>> +static const PCIIOMMUOps ppce500_iommu_ops = {
>>> +    .get_address_space = e500_pcihost_set_iommu, };
>>> +
>>>  static void e500_pcihost_realize(DeviceState *dev, Error **errp)  {
>>>      SysBusDevice *sbd = SYS_BUS_DEVICE(dev); @@ -473,7 +477,7 @@
>>> static void e500_pcihost_realize(DeviceState *dev, Error **errp)
>>>      memory_region_init(&s->bm, OBJECT(s), "bm-e500", UINT64_MAX);
>>>      memory_region_add_subregion(&s->bm, 0x0, &s->busmem);
>>>      address_space_init(&s->bm_as, &s->bm, "pci-bm");
>>> -    pci_setup_iommu(b, e500_pcihost_set_iommu, s);
>>> +    pci_setup_iommu(b, &ppce500_iommu_ops, s);
>>>
>>>      pci_create_simple(b, 0, "e500-host-bridge");
>>>
>>> diff --git a/hw/pci-host/prep.c b/hw/pci-host/prep.c index
>>> 1a02e9a..7c57311 100644
>>> --- a/hw/pci-host/prep.c
>>> +++ b/hw/pci-host/prep.c
>>> @@ -213,6 +213,10 @@ static AddressSpace *raven_pcihost_set_iommu(PCIBus
>> *bus, void *opaque,
>>>      return &s->bm_as;
>>>  }
>>>
>>> +static const PCIIOMMUOps raven_iommu_ops = {
>>> +    .get_address_space = raven_pcihost_set_iommu, };
>>> +
>>>  static void raven_change_gpio(void *opaque, int n, int level)  {
>>>      PREPPCIState *s = opaque;
>>> @@ -303,7 +307,7 @@ static void raven_pcihost_initfn(Object *obj)
>>>      memory_region_add_subregion(&s->bm, 0         , &s->bm_pci_memory_alias);
>>>      memory_region_add_subregion(&s->bm, 0x80000000, &s->bm_ram_alias);
>>>      address_space_init(&s->bm_as, &s->bm, "raven-bm");
>>> -    pci_setup_iommu(&s->pci_bus, raven_pcihost_set_iommu, s);
>>> +    pci_setup_iommu(&s->pci_bus, &raven_iommu_ops, s);
>>>
>>>      h->bus = &s->pci_bus;
>>>
>>> diff --git a/hw/pci-host/sabre.c b/hw/pci-host/sabre.c index
>>> 2b8503b..251549b 100644
>>> --- a/hw/pci-host/sabre.c
>>> +++ b/hw/pci-host/sabre.c
>>> @@ -112,6 +112,10 @@ static AddressSpace *sabre_pci_dma_iommu(PCIBus
>> *bus, void *opaque, int devfn)
>>>      return &is->iommu_as;
>>>  }
>>>
>>> +static const PCIIOMMUOps sabre_iommu_ops = {
>>> +    .get_address_space = sabre_pci_dma_iommu, };
>>> +
>>>  static void sabre_config_write(void *opaque, hwaddr addr,
>>>                                 uint64_t val, unsigned size)  { @@
>>> -402,7 +406,7 @@ static void sabre_realize(DeviceState *dev, Error **errp)
>>>      /* IOMMU */
>>>      memory_region_add_subregion_overlap(&s->sabre_config, 0x200,
>>>                      sysbus_mmio_get_region(SYS_BUS_DEVICE(s->iommu), 0), 1);
>>> -    pci_setup_iommu(phb->bus, sabre_pci_dma_iommu, s->iommu);
>>> +    pci_setup_iommu(phb->bus, &sabre_iommu_ops, s->iommu);
>>>
>>>      /* APB secondary busses */
>>>      pci_dev = pci_create_multifunction(phb->bus, PCI_DEVFN(1, 0),
>>> true, diff --git a/hw/pci/pci.c b/hw/pci/pci.c index e1ed667..aa9025c
>>> 100644
>>> --- a/hw/pci/pci.c
>>> +++ b/hw/pci/pci.c
>>> @@ -2644,7 +2644,7 @@ AddressSpace
>> *pci_device_iommu_address_space(PCIDevice *dev)
>>>      PCIBus *iommu_bus = bus;
>>>      uint8_t devfn = dev->devfn;
>>>
>>> -    while (iommu_bus && !iommu_bus->iommu_fn && iommu_bus->parent_dev)
>> {
>>> +    while (iommu_bus && !iommu_bus->iommu_ops &&
>>> + iommu_bus->parent_dev) {
>> Depending on future usage, this is not strictly identical to the original
>> code. You exit
>> the loop as soon as a iommu_bus->iommu_ops is set whatever the presence of
>> get_address_space().
> 
> To be identical with original code, may adding the get_address_space()
> presence check. Then the loop exits when the iommu_bus->iommu_ops is
> set and meanwhile iommu_bus->iommu_ops->get_address_space() is set.
> But is it possible that there is an intermediate iommu_bus which has
> iommu_ops set but the get_address_space() is clear. I guess not as
> iommu_ops is set by vIOMMU and vIOMMU won't differentiate buses?

I don't know. That depends on how the ops are going to be used in the
future. Can't you enforce the fact that get_address_space() is a
mandatory ops?

Thanks

Eric
> 
> Also the get_address_space() presence will be checked when trying to
> use it. right?
> 
> Regards,
> Yi Liu
> 



^ permalink raw reply	[flat|nested] 160+ messages in thread

* RE: [PATCH v2 05/22] hw/pci: modify pci_setup_iommu() to set PCIIOMMUOps
  2020-04-02 12:41         ` Auger Eric
@ 2020-04-02 13:37           ` Liu, Yi L
  -1 siblings, 0 replies; 160+ messages in thread
From: Liu, Yi L @ 2020-04-02 13:37 UTC (permalink / raw)
  To: Auger Eric, qemu-devel, alex.williamson, peterx
  Cc: jean-philippe, Tian, Kevin, Jacob Pan, Yi Sun, kvm, mst, Tian,
	Jun J, Sun, Yi Y, pbonzini, Wu, Hao, david

Hi Eric,

> From: Auger Eric < eric.auger@redhat.com >
> Sent: Thursday, April 2, 2020 8:41 PM
> To: Liu, Yi L <yi.l.liu@intel.com>; qemu-devel@nongnu.org;
> Subject: Re: [PATCH v2 05/22] hw/pci: modify pci_setup_iommu() to set
> PCIIOMMUOps
> 
> Hi Yi,
> 
> On 4/2/20 10:52 AM, Liu, Yi L wrote:
> >> From: Auger Eric < eric.auger@redhat.com>
> >> Sent: Monday, March 30, 2020 7:02 PM
> >> To: Liu, Yi L <yi.l.liu@intel.com>; qemu-devel@nongnu.org;
> >> Subject: Re: [PATCH v2 05/22] hw/pci: modify pci_setup_iommu() to set
> >> PCIIOMMUOps
> >>
> >>
> >>
> >> On 3/30/20 6:24 AM, Liu Yi L wrote:
> >>> This patch modifies pci_setup_iommu() to set PCIIOMMUOps instead of
> >>> setting PCIIOMMUFunc. PCIIOMMUFunc is used to get an address space
> >>> for a PCI device in vendor specific way. The PCIIOMMUOps still
> >>> offers this functionality. But using PCIIOMMUOps leaves space to add
> >>> more iommu related vendor specific operations.
> >>>
> >>> Cc: Kevin Tian <kevin.tian@intel.com>
> >>> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> >>> Cc: Peter Xu <peterx@redhat.com>
> >>> Cc: Eric Auger <eric.auger@redhat.com>
> >>> Cc: Yi Sun <yi.y.sun@linux.intel.com>
> >>> Cc: David Gibson <david@gibson.dropbear.id.au>
> >>> Cc: Michael S. Tsirkin <mst@redhat.com>
> >>> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> >>> Reviewed-by: Peter Xu <peterx@redhat.com>
> >>> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> >>> ---
> >>>  hw/alpha/typhoon.c       |  6 +++++-
> >>>  hw/arm/smmu-common.c     |  6 +++++-
> >>>  hw/hppa/dino.c           |  6 +++++-
> >>>  hw/i386/amd_iommu.c      |  6 +++++-
> >>>  hw/i386/intel_iommu.c    |  6 +++++-
> >>>  hw/pci-host/designware.c |  6 +++++-
> >>>  hw/pci-host/pnv_phb3.c   |  6 +++++-
> >>>  hw/pci-host/pnv_phb4.c   |  6 +++++-
> >>>  hw/pci-host/ppce500.c    |  6 +++++-
> >>>  hw/pci-host/prep.c       |  6 +++++-
> >>>  hw/pci-host/sabre.c      |  6 +++++-
> >>>  hw/pci/pci.c             | 12 +++++++-----
> >>>  hw/ppc/ppc440_pcix.c     |  6 +++++-
> >>>  hw/ppc/spapr_pci.c       |  6 +++++-
> >>>  hw/s390x/s390-pci-bus.c  |  8 ++++++--  hw/virtio/virtio-iommu.c |
> >>> 6
> >>> +++++-
> >>>  include/hw/pci/pci.h     |  8 ++++++--
> >>>  include/hw/pci/pci_bus.h |  2 +-
> >>>  18 files changed, 90 insertions(+), 24 deletions(-)
> >>>
> >>> diff --git a/hw/alpha/typhoon.c b/hw/alpha/typhoon.c index
> >>> 1795e2f..f271de1 100644
> >>> --- a/hw/alpha/typhoon.c
> >>> +++ b/hw/alpha/typhoon.c
> >>> @@ -740,6 +740,10 @@ static AddressSpace
> >>> *typhoon_pci_dma_iommu(PCIBus
> >> *bus, void *opaque, int devfn)
> >>>      return &s->pchip.iommu_as;
> >>>  }
> >>>
> >>> +static const PCIIOMMUOps typhoon_iommu_ops = {
> >>> +    .get_address_space = typhoon_pci_dma_iommu, };
> >>> +
> >>>  static void typhoon_set_irq(void *opaque, int irq, int level)  {
> >>>      TyphoonState *s = opaque;
> >>> @@ -897,7 +901,7 @@ PCIBus *typhoon_init(MemoryRegion *ram, ISABus
> >> **isa_bus, qemu_irq *p_rtc_irq,
> >>>                               "iommu-typhoon", UINT64_MAX);
> >>>      address_space_init(&s->pchip.iommu_as, MEMORY_REGION(&s-
> >>> pchip.iommu),
> >>>                         "pchip0-pci");
> >>> -    pci_setup_iommu(b, typhoon_pci_dma_iommu, s);
> >>> +    pci_setup_iommu(b, &typhoon_iommu_ops, s);
> >>>
> >>>      /* Pchip0 PCI special/interrupt acknowledge, 0x801.F800.0000, 64MB.  */
> >>>      memory_region_init_io(&s->pchip.reg_iack, OBJECT(s),
> >>> &alpha_pci_iack_ops, diff --git a/hw/arm/smmu-common.c
> >>> b/hw/arm/smmu-common.c index e13a5f4..447146e 100644
> >>> --- a/hw/arm/smmu-common.c
> >>> +++ b/hw/arm/smmu-common.c
> >>> @@ -343,6 +343,10 @@ static AddressSpace *smmu_find_add_as(PCIBus
> >>> *bus,
> >> void *opaque, int devfn)
> >>>      return &sdev->as;
> >>>  }
> >>>
> >>> +static const PCIIOMMUOps smmu_ops = {
> >>> +    .get_address_space = smmu_find_add_as, };
> >>> +
> >>>  IOMMUMemoryRegion *smmu_iommu_mr(SMMUState *s, uint32_t sid)  {
> >>>      uint8_t bus_n, devfn;
> >>> @@ -437,7 +441,7 @@ static void smmu_base_realize(DeviceState *dev,
> >>> Error
> >> **errp)
> >>>      s->smmu_pcibus_by_busptr = g_hash_table_new(NULL, NULL);
> >>>
> >>>      if (s->primary_bus) {
> >>> -        pci_setup_iommu(s->primary_bus, smmu_find_add_as, s);
> >>> +        pci_setup_iommu(s->primary_bus, &smmu_ops, s);
> >>>      } else {
> >>>          error_setg(errp, "SMMU is not attached to any PCI bus!");
> >>>      }
> >>> diff --git a/hw/hppa/dino.c b/hw/hppa/dino.c index 2b1b38c..3da4f84
> >>> 100644
> >>> --- a/hw/hppa/dino.c
> >>> +++ b/hw/hppa/dino.c
> >>> @@ -459,6 +459,10 @@ static AddressSpace
> >>> *dino_pcihost_set_iommu(PCIBus
> >> *bus, void *opaque,
> >>>      return &s->bm_as;
> >>>  }
> >>>
> >>> +static const PCIIOMMUOps dino_iommu_ops = {
> >>> +    .get_address_space = dino_pcihost_set_iommu, };
> >>> +
> >>>  /*
> >>>   * Dino interrupts are connected as shown on Page 78, Table 23
> >>>   * (Little-endian bit numbers)
> >>> @@ -580,7 +584,7 @@ PCIBus *dino_init(MemoryRegion *addr_space,
> >>>      memory_region_add_subregion(&s->bm, 0xfff00000,
> >>>                                  &s->bm_cpu_alias);
> >>>      address_space_init(&s->bm_as, &s->bm, "pci-bm");
> >>> -    pci_setup_iommu(b, dino_pcihost_set_iommu, s);
> >>> +    pci_setup_iommu(b, &dino_iommu_ops, s);
> >>>
> >>>      *p_rtc_irq = qemu_allocate_irq(dino_set_timer_irq, s, 0);
> >>>      *p_ser_irq = qemu_allocate_irq(dino_set_serial_irq, s, 0); diff
> >>> --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c index
> >>> b1175e5..5fec30e 100644
> >>> --- a/hw/i386/amd_iommu.c
> >>> +++ b/hw/i386/amd_iommu.c
> >>> @@ -1451,6 +1451,10 @@ static AddressSpace
> >> *amdvi_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
> >>>      return &iommu_as[devfn]->as;
> >>>  }
> >>>
> >>> +static const PCIIOMMUOps amdvi_iommu_ops = {
> >>> +    .get_address_space = amdvi_host_dma_iommu, };
> >>> +
> >>>  static const MemoryRegionOps mmio_mem_ops = {
> >>>      .read = amdvi_mmio_read,
> >>>      .write = amdvi_mmio_write,
> >>> @@ -1577,7 +1581,7 @@ static void amdvi_realize(DeviceState *dev,
> >>> Error **errp)
> >>>
> >>>      sysbus_init_mmio(SYS_BUS_DEVICE(s), &s->mmio);
> >>>      sysbus_mmio_map(SYS_BUS_DEVICE(s), 0, AMDVI_BASE_ADDR);
> >>> -    pci_setup_iommu(bus, amdvi_host_dma_iommu, s);
> >>> +    pci_setup_iommu(bus, &amdvi_iommu_ops, s);
> >>>      s->devid = object_property_get_int(OBJECT(&s->pci), "addr", errp);
> >>>      msi_init(&s->pci.dev, 0, 1, true, false, errp);
> >>>      amdvi_init(s);
> >>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c index
> >>> df7ad25..4b22910 100644
> >>> --- a/hw/i386/intel_iommu.c
> >>> +++ b/hw/i386/intel_iommu.c
> >>> @@ -3729,6 +3729,10 @@ static AddressSpace
> >>> *vtd_host_dma_iommu(PCIBus
> >> *bus, void *opaque, int devfn)
> >>>      return &vtd_as->as;
> >>>  }
> >>>
> >>> +static PCIIOMMUOps vtd_iommu_ops = {
> >> static const
> >
> > got it.
> >
> >>> +    .get_address_space = vtd_host_dma_iommu, };
> >>> +
> >>>  static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)  {
> >>>      X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(s); @@ -3840,7
> >>> +3844,7 @@ static void vtd_realize(DeviceState *dev, Error **errp)
> >>>                                                g_free, g_free);
> >>>      vtd_init(s);
> >>>      sysbus_mmio_map(SYS_BUS_DEVICE(s), 0,
> >> Q35_HOST_BRIDGE_IOMMU_ADDR);
> >>> -    pci_setup_iommu(bus, vtd_host_dma_iommu, dev);
> >>> +    pci_setup_iommu(bus, &vtd_iommu_ops, dev);
> >>>      /* Pseudo address space under root PCI bus. */
> >>>      x86ms->ioapic_as = vtd_host_dma_iommu(bus, s,
> >> Q35_PSEUDO_DEVFN_IOAPIC);
> >>>      qemu_add_machine_init_done_notifier(&vtd_machine_done_notify);
> >>> diff --git a/hw/pci-host/designware.c b/hw/pci-host/designware.c
> >>> index dd24551..4c6338a 100644
> >>> --- a/hw/pci-host/designware.c
> >>> +++ b/hw/pci-host/designware.c
> >>> @@ -645,6 +645,10 @@ static AddressSpace
> >> *designware_pcie_host_set_iommu(PCIBus *bus, void *opaque,
> >>>      return &s->pci.address_space;
> >>>  }
> >>>
> >>> +static const PCIIOMMUOps designware_iommu_ops = {
> >>> +    .get_address_space = designware_pcie_host_set_iommu, };
> >>> +
> >>>  static void designware_pcie_host_realize(DeviceState *dev, Error
> >>> **errp)  {
> >>>      PCIHostState *pci = PCI_HOST_BRIDGE(dev); @@ -686,7 +690,7 @@
> >>> static void designware_pcie_host_realize(DeviceState *dev, Error **errp)
> >>>      address_space_init(&s->pci.address_space,
> >>>                         &s->pci.address_space_root,
> >>>                         "pcie-bus-address-space");
> >>> -    pci_setup_iommu(pci->bus, designware_pcie_host_set_iommu, s);
> >>> +    pci_setup_iommu(pci->bus, &designware_iommu_ops, s);
> >>>
> >>>      qdev_set_parent_bus(DEVICE(&s->root), BUS(pci->bus));
> >>>      qdev_init_nofail(DEVICE(&s->root));
> >>> diff --git a/hw/pci-host/pnv_phb3.c b/hw/pci-host/pnv_phb3.c index
> >>> 74618fa..ecfe627 100644
> >>> --- a/hw/pci-host/pnv_phb3.c
> >>> +++ b/hw/pci-host/pnv_phb3.c
> >>> @@ -961,6 +961,10 @@ static AddressSpace *pnv_phb3_dma_iommu(PCIBus
> >> *bus, void *opaque, int devfn)
> >>>      return &ds->dma_as;
> >>>  }
> >>>
> >>> +static PCIIOMMUOps pnv_phb3_iommu_ops = {
> >> static const
> > got it. :-)
> >
> >>> +    .get_address_space = pnv_phb3_dma_iommu, };
> >>> +
> >>>  static void pnv_phb3_instance_init(Object *obj)  {
> >>>      PnvPHB3 *phb = PNV_PHB3(obj);
> >>> @@ -1059,7 +1063,7 @@ static void pnv_phb3_realize(DeviceState *dev,
> >>> Error
> >> **errp)
> >>>                                       &phb->pci_mmio, &phb->pci_io,
> >>>                                       0, 4, TYPE_PNV_PHB3_ROOT_BUS);
> >>>
> >>> -    pci_setup_iommu(pci->bus, pnv_phb3_dma_iommu, phb);
> >>> +    pci_setup_iommu(pci->bus, &pnv_phb3_iommu_ops, phb);
> >>>
> >>>      /* Add a single Root port */
> >>>      qdev_prop_set_uint8(DEVICE(&phb->root), "chassis",
> >>> phb->chip_id); diff --git a/hw/pci-host/pnv_phb4.c
> >>> b/hw/pci-host/pnv_phb4.c index
> >>> 23cf093..04e95e3 100644
> >>> --- a/hw/pci-host/pnv_phb4.c
> >>> +++ b/hw/pci-host/pnv_phb4.c
> >>> @@ -1148,6 +1148,10 @@ static AddressSpace
> >>> *pnv_phb4_dma_iommu(PCIBus
> >> *bus, void *opaque, int devfn)
> >>>      return &ds->dma_as;
> >>>  }
> >>>
> >>> +static PCIIOMMUOps pnv_phb4_iommu_ops = {
> >> idem
> > will add const.
> >
> >>> +    .get_address_space = pnv_phb4_dma_iommu, };
> >>> +
> >>>  static void pnv_phb4_instance_init(Object *obj)  {
> >>>      PnvPHB4 *phb = PNV_PHB4(obj);
> >>> @@ -1205,7 +1209,7 @@ static void pnv_phb4_realize(DeviceState *dev,
> >>> Error
> >> **errp)
> >>>                                       pnv_phb4_set_irq, pnv_phb4_map_irq, phb,
> >>>                                       &phb->pci_mmio, &phb->pci_io,
> >>>                                       0, 4, TYPE_PNV_PHB4_ROOT_BUS);
> >>> -    pci_setup_iommu(pci->bus, pnv_phb4_dma_iommu, phb);
> >>> +    pci_setup_iommu(pci->bus, &pnv_phb4_iommu_ops, phb);
> >>>
> >>>      /* Add a single Root port */
> >>>      qdev_prop_set_uint8(DEVICE(&phb->root), "chassis",
> >>> phb->chip_id); diff --git a/hw/pci-host/ppce500.c
> >>> b/hw/pci-host/ppce500.c index d710727..5baf5db 100644
> >>> --- a/hw/pci-host/ppce500.c
> >>> +++ b/hw/pci-host/ppce500.c
> >>> @@ -439,6 +439,10 @@ static AddressSpace
> >>> *e500_pcihost_set_iommu(PCIBus
> >> *bus, void *opaque,
> >>>      return &s->bm_as;
> >>>  }
> >>>
> >>> +static const PCIIOMMUOps ppce500_iommu_ops = {
> >>> +    .get_address_space = e500_pcihost_set_iommu, };
> >>> +
> >>>  static void e500_pcihost_realize(DeviceState *dev, Error **errp)  {
> >>>      SysBusDevice *sbd = SYS_BUS_DEVICE(dev); @@ -473,7 +477,7 @@
> >>> static void e500_pcihost_realize(DeviceState *dev, Error **errp)
> >>>      memory_region_init(&s->bm, OBJECT(s), "bm-e500", UINT64_MAX);
> >>>      memory_region_add_subregion(&s->bm, 0x0, &s->busmem);
> >>>      address_space_init(&s->bm_as, &s->bm, "pci-bm");
> >>> -    pci_setup_iommu(b, e500_pcihost_set_iommu, s);
> >>> +    pci_setup_iommu(b, &ppce500_iommu_ops, s);
> >>>
> >>>      pci_create_simple(b, 0, "e500-host-bridge");
> >>>
> >>> diff --git a/hw/pci-host/prep.c b/hw/pci-host/prep.c index
> >>> 1a02e9a..7c57311 100644
> >>> --- a/hw/pci-host/prep.c
> >>> +++ b/hw/pci-host/prep.c
> >>> @@ -213,6 +213,10 @@ static AddressSpace
> >>> *raven_pcihost_set_iommu(PCIBus
> >> *bus, void *opaque,
> >>>      return &s->bm_as;
> >>>  }
> >>>
> >>> +static const PCIIOMMUOps raven_iommu_ops = {
> >>> +    .get_address_space = raven_pcihost_set_iommu, };
> >>> +
> >>>  static void raven_change_gpio(void *opaque, int n, int level)  {
> >>>      PREPPCIState *s = opaque;
> >>> @@ -303,7 +307,7 @@ static void raven_pcihost_initfn(Object *obj)
> >>>      memory_region_add_subregion(&s->bm, 0         , &s-
> >bm_pci_memory_alias);
> >>>      memory_region_add_subregion(&s->bm, 0x80000000, &s->bm_ram_alias);
> >>>      address_space_init(&s->bm_as, &s->bm, "raven-bm");
> >>> -    pci_setup_iommu(&s->pci_bus, raven_pcihost_set_iommu, s);
> >>> +    pci_setup_iommu(&s->pci_bus, &raven_iommu_ops, s);
> >>>
> >>>      h->bus = &s->pci_bus;
> >>>
> >>> diff --git a/hw/pci-host/sabre.c b/hw/pci-host/sabre.c index
> >>> 2b8503b..251549b 100644
> >>> --- a/hw/pci-host/sabre.c
> >>> +++ b/hw/pci-host/sabre.c
> >>> @@ -112,6 +112,10 @@ static AddressSpace *sabre_pci_dma_iommu(PCIBus
> >> *bus, void *opaque, int devfn)
> >>>      return &is->iommu_as;
> >>>  }
> >>>
> >>> +static const PCIIOMMUOps sabre_iommu_ops = {
> >>> +    .get_address_space = sabre_pci_dma_iommu, };
> >>> +
> >>>  static void sabre_config_write(void *opaque, hwaddr addr,
> >>>                                 uint64_t val, unsigned size)  { @@
> >>> -402,7 +406,7 @@ static void sabre_realize(DeviceState *dev, Error **errp)
> >>>      /* IOMMU */
> >>>      memory_region_add_subregion_overlap(&s->sabre_config, 0x200,
> >>>                      sysbus_mmio_get_region(SYS_BUS_DEVICE(s->iommu), 0), 1);
> >>> -    pci_setup_iommu(phb->bus, sabre_pci_dma_iommu, s->iommu);
> >>> +    pci_setup_iommu(phb->bus, &sabre_iommu_ops, s->iommu);
> >>>
> >>>      /* APB secondary busses */
> >>>      pci_dev = pci_create_multifunction(phb->bus, PCI_DEVFN(1, 0),
> >>> true, diff --git a/hw/pci/pci.c b/hw/pci/pci.c index
> >>> e1ed667..aa9025c
> >>> 100644
> >>> --- a/hw/pci/pci.c
> >>> +++ b/hw/pci/pci.c
> >>> @@ -2644,7 +2644,7 @@ AddressSpace
> >> *pci_device_iommu_address_space(PCIDevice *dev)
> >>>      PCIBus *iommu_bus = bus;
> >>>      uint8_t devfn = dev->devfn;
> >>>
> >>> -    while (iommu_bus && !iommu_bus->iommu_fn && iommu_bus-
> >parent_dev)
> >> {
> >>> +    while (iommu_bus && !iommu_bus->iommu_ops &&
> >>> + iommu_bus->parent_dev) {
> >> Depending on future usage, this is not strictly identical to the
> >> original code. You exit the loop as soon as a iommu_bus->iommu_ops is
> >> set whatever the presence of get_address_space().
> >
> > To be identical with original code, may adding the get_address_space()
> > presence check. Then the loop exits when the iommu_bus->iommu_ops is
> > set and meanwhile iommu_bus->iommu_ops->get_address_space() is set.
> > But is it possible that there is an intermediate iommu_bus which has
> > iommu_ops set but the get_address_space() is clear. I guess not as
> > iommu_ops is set by vIOMMU and vIOMMU won't differentiate buses?
> 
> I don't know. That depends on how the ops are going to be used in the future. Can't
> you enforce the fact that get_address_space() is a mandatory ops?

No, I didn't mean that. Actually, in the patch, the get_address_space() presence is checked.
I'm not sure if your point is to add get_address_space() presence check instead of
just checking the iommu_ops presence.

Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 160+ messages in thread

* RE: [PATCH v2 05/22] hw/pci: modify pci_setup_iommu() to set PCIIOMMUOps
@ 2020-04-02 13:37           ` Liu, Yi L
  0 siblings, 0 replies; 160+ messages in thread
From: Liu, Yi L @ 2020-04-02 13:37 UTC (permalink / raw)
  To: Auger Eric, qemu-devel, alex.williamson, peterx
  Cc: jean-philippe, Tian, Kevin, Jacob Pan, Yi Sun, kvm, mst, Tian,
	Jun J, Sun, Yi Y, pbonzini, david, Wu, Hao

Hi Eric,

> From: Auger Eric < eric.auger@redhat.com >
> Sent: Thursday, April 2, 2020 8:41 PM
> To: Liu, Yi L <yi.l.liu@intel.com>; qemu-devel@nongnu.org;
> Subject: Re: [PATCH v2 05/22] hw/pci: modify pci_setup_iommu() to set
> PCIIOMMUOps
> 
> Hi Yi,
> 
> On 4/2/20 10:52 AM, Liu, Yi L wrote:
> >> From: Auger Eric < eric.auger@redhat.com>
> >> Sent: Monday, March 30, 2020 7:02 PM
> >> To: Liu, Yi L <yi.l.liu@intel.com>; qemu-devel@nongnu.org;
> >> Subject: Re: [PATCH v2 05/22] hw/pci: modify pci_setup_iommu() to set
> >> PCIIOMMUOps
> >>
> >>
> >>
> >> On 3/30/20 6:24 AM, Liu Yi L wrote:
> >>> This patch modifies pci_setup_iommu() to set PCIIOMMUOps instead of
> >>> setting PCIIOMMUFunc. PCIIOMMUFunc is used to get an address space
> >>> for a PCI device in vendor specific way. The PCIIOMMUOps still
> >>> offers this functionality. But using PCIIOMMUOps leaves space to add
> >>> more iommu related vendor specific operations.
> >>>
> >>> Cc: Kevin Tian <kevin.tian@intel.com>
> >>> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> >>> Cc: Peter Xu <peterx@redhat.com>
> >>> Cc: Eric Auger <eric.auger@redhat.com>
> >>> Cc: Yi Sun <yi.y.sun@linux.intel.com>
> >>> Cc: David Gibson <david@gibson.dropbear.id.au>
> >>> Cc: Michael S. Tsirkin <mst@redhat.com>
> >>> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> >>> Reviewed-by: Peter Xu <peterx@redhat.com>
> >>> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> >>> ---
> >>>  hw/alpha/typhoon.c       |  6 +++++-
> >>>  hw/arm/smmu-common.c     |  6 +++++-
> >>>  hw/hppa/dino.c           |  6 +++++-
> >>>  hw/i386/amd_iommu.c      |  6 +++++-
> >>>  hw/i386/intel_iommu.c    |  6 +++++-
> >>>  hw/pci-host/designware.c |  6 +++++-
> >>>  hw/pci-host/pnv_phb3.c   |  6 +++++-
> >>>  hw/pci-host/pnv_phb4.c   |  6 +++++-
> >>>  hw/pci-host/ppce500.c    |  6 +++++-
> >>>  hw/pci-host/prep.c       |  6 +++++-
> >>>  hw/pci-host/sabre.c      |  6 +++++-
> >>>  hw/pci/pci.c             | 12 +++++++-----
> >>>  hw/ppc/ppc440_pcix.c     |  6 +++++-
> >>>  hw/ppc/spapr_pci.c       |  6 +++++-
> >>>  hw/s390x/s390-pci-bus.c  |  8 ++++++--  hw/virtio/virtio-iommu.c |
> >>> 6
> >>> +++++-
> >>>  include/hw/pci/pci.h     |  8 ++++++--
> >>>  include/hw/pci/pci_bus.h |  2 +-
> >>>  18 files changed, 90 insertions(+), 24 deletions(-)
> >>>
> >>> diff --git a/hw/alpha/typhoon.c b/hw/alpha/typhoon.c index
> >>> 1795e2f..f271de1 100644
> >>> --- a/hw/alpha/typhoon.c
> >>> +++ b/hw/alpha/typhoon.c
> >>> @@ -740,6 +740,10 @@ static AddressSpace
> >>> *typhoon_pci_dma_iommu(PCIBus
> >> *bus, void *opaque, int devfn)
> >>>      return &s->pchip.iommu_as;
> >>>  }
> >>>
> >>> +static const PCIIOMMUOps typhoon_iommu_ops = {
> >>> +    .get_address_space = typhoon_pci_dma_iommu, };
> >>> +
> >>>  static void typhoon_set_irq(void *opaque, int irq, int level)  {
> >>>      TyphoonState *s = opaque;
> >>> @@ -897,7 +901,7 @@ PCIBus *typhoon_init(MemoryRegion *ram, ISABus
> >> **isa_bus, qemu_irq *p_rtc_irq,
> >>>                               "iommu-typhoon", UINT64_MAX);
> >>>      address_space_init(&s->pchip.iommu_as, MEMORY_REGION(&s-
> >>> pchip.iommu),
> >>>                         "pchip0-pci");
> >>> -    pci_setup_iommu(b, typhoon_pci_dma_iommu, s);
> >>> +    pci_setup_iommu(b, &typhoon_iommu_ops, s);
> >>>
> >>>      /* Pchip0 PCI special/interrupt acknowledge, 0x801.F800.0000, 64MB.  */
> >>>      memory_region_init_io(&s->pchip.reg_iack, OBJECT(s),
> >>> &alpha_pci_iack_ops, diff --git a/hw/arm/smmu-common.c
> >>> b/hw/arm/smmu-common.c index e13a5f4..447146e 100644
> >>> --- a/hw/arm/smmu-common.c
> >>> +++ b/hw/arm/smmu-common.c
> >>> @@ -343,6 +343,10 @@ static AddressSpace *smmu_find_add_as(PCIBus
> >>> *bus,
> >> void *opaque, int devfn)
> >>>      return &sdev->as;
> >>>  }
> >>>
> >>> +static const PCIIOMMUOps smmu_ops = {
> >>> +    .get_address_space = smmu_find_add_as, };
> >>> +
> >>>  IOMMUMemoryRegion *smmu_iommu_mr(SMMUState *s, uint32_t sid)  {
> >>>      uint8_t bus_n, devfn;
> >>> @@ -437,7 +441,7 @@ static void smmu_base_realize(DeviceState *dev,
> >>> Error
> >> **errp)
> >>>      s->smmu_pcibus_by_busptr = g_hash_table_new(NULL, NULL);
> >>>
> >>>      if (s->primary_bus) {
> >>> -        pci_setup_iommu(s->primary_bus, smmu_find_add_as, s);
> >>> +        pci_setup_iommu(s->primary_bus, &smmu_ops, s);
> >>>      } else {
> >>>          error_setg(errp, "SMMU is not attached to any PCI bus!");
> >>>      }
> >>> diff --git a/hw/hppa/dino.c b/hw/hppa/dino.c index 2b1b38c..3da4f84
> >>> 100644
> >>> --- a/hw/hppa/dino.c
> >>> +++ b/hw/hppa/dino.c
> >>> @@ -459,6 +459,10 @@ static AddressSpace
> >>> *dino_pcihost_set_iommu(PCIBus
> >> *bus, void *opaque,
> >>>      return &s->bm_as;
> >>>  }
> >>>
> >>> +static const PCIIOMMUOps dino_iommu_ops = {
> >>> +    .get_address_space = dino_pcihost_set_iommu, };
> >>> +
> >>>  /*
> >>>   * Dino interrupts are connected as shown on Page 78, Table 23
> >>>   * (Little-endian bit numbers)
> >>> @@ -580,7 +584,7 @@ PCIBus *dino_init(MemoryRegion *addr_space,
> >>>      memory_region_add_subregion(&s->bm, 0xfff00000,
> >>>                                  &s->bm_cpu_alias);
> >>>      address_space_init(&s->bm_as, &s->bm, "pci-bm");
> >>> -    pci_setup_iommu(b, dino_pcihost_set_iommu, s);
> >>> +    pci_setup_iommu(b, &dino_iommu_ops, s);
> >>>
> >>>      *p_rtc_irq = qemu_allocate_irq(dino_set_timer_irq, s, 0);
> >>>      *p_ser_irq = qemu_allocate_irq(dino_set_serial_irq, s, 0); diff
> >>> --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c index
> >>> b1175e5..5fec30e 100644
> >>> --- a/hw/i386/amd_iommu.c
> >>> +++ b/hw/i386/amd_iommu.c
> >>> @@ -1451,6 +1451,10 @@ static AddressSpace
> >> *amdvi_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
> >>>      return &iommu_as[devfn]->as;
> >>>  }
> >>>
> >>> +static const PCIIOMMUOps amdvi_iommu_ops = {
> >>> +    .get_address_space = amdvi_host_dma_iommu, };
> >>> +
> >>>  static const MemoryRegionOps mmio_mem_ops = {
> >>>      .read = amdvi_mmio_read,
> >>>      .write = amdvi_mmio_write,
> >>> @@ -1577,7 +1581,7 @@ static void amdvi_realize(DeviceState *dev,
> >>> Error **errp)
> >>>
> >>>      sysbus_init_mmio(SYS_BUS_DEVICE(s), &s->mmio);
> >>>      sysbus_mmio_map(SYS_BUS_DEVICE(s), 0, AMDVI_BASE_ADDR);
> >>> -    pci_setup_iommu(bus, amdvi_host_dma_iommu, s);
> >>> +    pci_setup_iommu(bus, &amdvi_iommu_ops, s);
> >>>      s->devid = object_property_get_int(OBJECT(&s->pci), "addr", errp);
> >>>      msi_init(&s->pci.dev, 0, 1, true, false, errp);
> >>>      amdvi_init(s);
> >>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c index
> >>> df7ad25..4b22910 100644
> >>> --- a/hw/i386/intel_iommu.c
> >>> +++ b/hw/i386/intel_iommu.c
> >>> @@ -3729,6 +3729,10 @@ static AddressSpace
> >>> *vtd_host_dma_iommu(PCIBus
> >> *bus, void *opaque, int devfn)
> >>>      return &vtd_as->as;
> >>>  }
> >>>
> >>> +static PCIIOMMUOps vtd_iommu_ops = {
> >> static const
> >
> > got it.
> >
> >>> +    .get_address_space = vtd_host_dma_iommu, };
> >>> +
> >>>  static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)  {
> >>>      X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(s); @@ -3840,7
> >>> +3844,7 @@ static void vtd_realize(DeviceState *dev, Error **errp)
> >>>                                                g_free, g_free);
> >>>      vtd_init(s);
> >>>      sysbus_mmio_map(SYS_BUS_DEVICE(s), 0,
> >> Q35_HOST_BRIDGE_IOMMU_ADDR);
> >>> -    pci_setup_iommu(bus, vtd_host_dma_iommu, dev);
> >>> +    pci_setup_iommu(bus, &vtd_iommu_ops, dev);
> >>>      /* Pseudo address space under root PCI bus. */
> >>>      x86ms->ioapic_as = vtd_host_dma_iommu(bus, s,
> >> Q35_PSEUDO_DEVFN_IOAPIC);
> >>>      qemu_add_machine_init_done_notifier(&vtd_machine_done_notify);
> >>> diff --git a/hw/pci-host/designware.c b/hw/pci-host/designware.c
> >>> index dd24551..4c6338a 100644
> >>> --- a/hw/pci-host/designware.c
> >>> +++ b/hw/pci-host/designware.c
> >>> @@ -645,6 +645,10 @@ static AddressSpace
> >> *designware_pcie_host_set_iommu(PCIBus *bus, void *opaque,
> >>>      return &s->pci.address_space;
> >>>  }
> >>>
> >>> +static const PCIIOMMUOps designware_iommu_ops = {
> >>> +    .get_address_space = designware_pcie_host_set_iommu, };
> >>> +
> >>>  static void designware_pcie_host_realize(DeviceState *dev, Error
> >>> **errp)  {
> >>>      PCIHostState *pci = PCI_HOST_BRIDGE(dev); @@ -686,7 +690,7 @@
> >>> static void designware_pcie_host_realize(DeviceState *dev, Error **errp)
> >>>      address_space_init(&s->pci.address_space,
> >>>                         &s->pci.address_space_root,
> >>>                         "pcie-bus-address-space");
> >>> -    pci_setup_iommu(pci->bus, designware_pcie_host_set_iommu, s);
> >>> +    pci_setup_iommu(pci->bus, &designware_iommu_ops, s);
> >>>
> >>>      qdev_set_parent_bus(DEVICE(&s->root), BUS(pci->bus));
> >>>      qdev_init_nofail(DEVICE(&s->root));
> >>> diff --git a/hw/pci-host/pnv_phb3.c b/hw/pci-host/pnv_phb3.c index
> >>> 74618fa..ecfe627 100644
> >>> --- a/hw/pci-host/pnv_phb3.c
> >>> +++ b/hw/pci-host/pnv_phb3.c
> >>> @@ -961,6 +961,10 @@ static AddressSpace *pnv_phb3_dma_iommu(PCIBus
> >> *bus, void *opaque, int devfn)
> >>>      return &ds->dma_as;
> >>>  }
> >>>
> >>> +static PCIIOMMUOps pnv_phb3_iommu_ops = {
> >> static const
> > got it. :-)
> >
> >>> +    .get_address_space = pnv_phb3_dma_iommu, };
> >>> +
> >>>  static void pnv_phb3_instance_init(Object *obj)  {
> >>>      PnvPHB3 *phb = PNV_PHB3(obj);
> >>> @@ -1059,7 +1063,7 @@ static void pnv_phb3_realize(DeviceState *dev,
> >>> Error
> >> **errp)
> >>>                                       &phb->pci_mmio, &phb->pci_io,
> >>>                                       0, 4, TYPE_PNV_PHB3_ROOT_BUS);
> >>>
> >>> -    pci_setup_iommu(pci->bus, pnv_phb3_dma_iommu, phb);
> >>> +    pci_setup_iommu(pci->bus, &pnv_phb3_iommu_ops, phb);
> >>>
> >>>      /* Add a single Root port */
> >>>      qdev_prop_set_uint8(DEVICE(&phb->root), "chassis",
> >>> phb->chip_id); diff --git a/hw/pci-host/pnv_phb4.c
> >>> b/hw/pci-host/pnv_phb4.c index
> >>> 23cf093..04e95e3 100644
> >>> --- a/hw/pci-host/pnv_phb4.c
> >>> +++ b/hw/pci-host/pnv_phb4.c
> >>> @@ -1148,6 +1148,10 @@ static AddressSpace
> >>> *pnv_phb4_dma_iommu(PCIBus
> >> *bus, void *opaque, int devfn)
> >>>      return &ds->dma_as;
> >>>  }
> >>>
> >>> +static PCIIOMMUOps pnv_phb4_iommu_ops = {
> >> idem
> > will add const.
> >
> >>> +    .get_address_space = pnv_phb4_dma_iommu, };
> >>> +
> >>>  static void pnv_phb4_instance_init(Object *obj)  {
> >>>      PnvPHB4 *phb = PNV_PHB4(obj);
> >>> @@ -1205,7 +1209,7 @@ static void pnv_phb4_realize(DeviceState *dev,
> >>> Error
> >> **errp)
> >>>                                       pnv_phb4_set_irq, pnv_phb4_map_irq, phb,
> >>>                                       &phb->pci_mmio, &phb->pci_io,
> >>>                                       0, 4, TYPE_PNV_PHB4_ROOT_BUS);
> >>> -    pci_setup_iommu(pci->bus, pnv_phb4_dma_iommu, phb);
> >>> +    pci_setup_iommu(pci->bus, &pnv_phb4_iommu_ops, phb);
> >>>
> >>>      /* Add a single Root port */
> >>>      qdev_prop_set_uint8(DEVICE(&phb->root), "chassis",
> >>> phb->chip_id); diff --git a/hw/pci-host/ppce500.c
> >>> b/hw/pci-host/ppce500.c index d710727..5baf5db 100644
> >>> --- a/hw/pci-host/ppce500.c
> >>> +++ b/hw/pci-host/ppce500.c
> >>> @@ -439,6 +439,10 @@ static AddressSpace
> >>> *e500_pcihost_set_iommu(PCIBus
> >> *bus, void *opaque,
> >>>      return &s->bm_as;
> >>>  }
> >>>
> >>> +static const PCIIOMMUOps ppce500_iommu_ops = {
> >>> +    .get_address_space = e500_pcihost_set_iommu, };
> >>> +
> >>>  static void e500_pcihost_realize(DeviceState *dev, Error **errp)  {
> >>>      SysBusDevice *sbd = SYS_BUS_DEVICE(dev); @@ -473,7 +477,7 @@
> >>> static void e500_pcihost_realize(DeviceState *dev, Error **errp)
> >>>      memory_region_init(&s->bm, OBJECT(s), "bm-e500", UINT64_MAX);
> >>>      memory_region_add_subregion(&s->bm, 0x0, &s->busmem);
> >>>      address_space_init(&s->bm_as, &s->bm, "pci-bm");
> >>> -    pci_setup_iommu(b, e500_pcihost_set_iommu, s);
> >>> +    pci_setup_iommu(b, &ppce500_iommu_ops, s);
> >>>
> >>>      pci_create_simple(b, 0, "e500-host-bridge");
> >>>
> >>> diff --git a/hw/pci-host/prep.c b/hw/pci-host/prep.c index
> >>> 1a02e9a..7c57311 100644
> >>> --- a/hw/pci-host/prep.c
> >>> +++ b/hw/pci-host/prep.c
> >>> @@ -213,6 +213,10 @@ static AddressSpace
> >>> *raven_pcihost_set_iommu(PCIBus
> >> *bus, void *opaque,
> >>>      return &s->bm_as;
> >>>  }
> >>>
> >>> +static const PCIIOMMUOps raven_iommu_ops = {
> >>> +    .get_address_space = raven_pcihost_set_iommu, };
> >>> +
> >>>  static void raven_change_gpio(void *opaque, int n, int level)  {
> >>>      PREPPCIState *s = opaque;
> >>> @@ -303,7 +307,7 @@ static void raven_pcihost_initfn(Object *obj)
> >>>      memory_region_add_subregion(&s->bm, 0         , &s-
> >bm_pci_memory_alias);
> >>>      memory_region_add_subregion(&s->bm, 0x80000000, &s->bm_ram_alias);
> >>>      address_space_init(&s->bm_as, &s->bm, "raven-bm");
> >>> -    pci_setup_iommu(&s->pci_bus, raven_pcihost_set_iommu, s);
> >>> +    pci_setup_iommu(&s->pci_bus, &raven_iommu_ops, s);
> >>>
> >>>      h->bus = &s->pci_bus;
> >>>
> >>> diff --git a/hw/pci-host/sabre.c b/hw/pci-host/sabre.c index
> >>> 2b8503b..251549b 100644
> >>> --- a/hw/pci-host/sabre.c
> >>> +++ b/hw/pci-host/sabre.c
> >>> @@ -112,6 +112,10 @@ static AddressSpace *sabre_pci_dma_iommu(PCIBus
> >> *bus, void *opaque, int devfn)
> >>>      return &is->iommu_as;
> >>>  }
> >>>
> >>> +static const PCIIOMMUOps sabre_iommu_ops = {
> >>> +    .get_address_space = sabre_pci_dma_iommu, };
> >>> +
> >>>  static void sabre_config_write(void *opaque, hwaddr addr,
> >>>                                 uint64_t val, unsigned size)  { @@
> >>> -402,7 +406,7 @@ static void sabre_realize(DeviceState *dev, Error **errp)
> >>>      /* IOMMU */
> >>>      memory_region_add_subregion_overlap(&s->sabre_config, 0x200,
> >>>                      sysbus_mmio_get_region(SYS_BUS_DEVICE(s->iommu), 0), 1);
> >>> -    pci_setup_iommu(phb->bus, sabre_pci_dma_iommu, s->iommu);
> >>> +    pci_setup_iommu(phb->bus, &sabre_iommu_ops, s->iommu);
> >>>
> >>>      /* APB secondary busses */
> >>>      pci_dev = pci_create_multifunction(phb->bus, PCI_DEVFN(1, 0),
> >>> true, diff --git a/hw/pci/pci.c b/hw/pci/pci.c index
> >>> e1ed667..aa9025c
> >>> 100644
> >>> --- a/hw/pci/pci.c
> >>> +++ b/hw/pci/pci.c
> >>> @@ -2644,7 +2644,7 @@ AddressSpace
> >> *pci_device_iommu_address_space(PCIDevice *dev)
> >>>      PCIBus *iommu_bus = bus;
> >>>      uint8_t devfn = dev->devfn;
> >>>
> >>> -    while (iommu_bus && !iommu_bus->iommu_fn && iommu_bus-
> >parent_dev)
> >> {
> >>> +    while (iommu_bus && !iommu_bus->iommu_ops &&
> >>> + iommu_bus->parent_dev) {
> >> Depending on future usage, this is not strictly identical to the
> >> original code. You exit the loop as soon as a iommu_bus->iommu_ops is
> >> set whatever the presence of get_address_space().
> >
> > To be identical with original code, may adding the get_address_space()
> > presence check. Then the loop exits when the iommu_bus->iommu_ops is
> > set and meanwhile iommu_bus->iommu_ops->get_address_space() is set.
> > But is it possible that there is an intermediate iommu_bus which has
> > iommu_ops set but the get_address_space() is clear. I guess not as
> > iommu_ops is set by vIOMMU and vIOMMU won't differentiate buses?
> 
> I don't know. That depends on how the ops are going to be used in the future. Can't
> you enforce the fact that get_address_space() is a mandatory ops?

No, I didn't mean that. Actually, in the patch, the get_address_space() presence is checked.
I'm not sure if your point is to add get_address_space() presence check instead of
just checking the iommu_ops presence.

Regards,
Yi Liu



^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v2 13/22] intel_iommu: add PASID cache management infrastructure
  2020-04-02  6:46       ` Liu, Yi L
@ 2020-04-02 13:44         ` Peter Xu
  -1 siblings, 0 replies; 160+ messages in thread
From: Peter Xu @ 2020-04-02 13:44 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: qemu-devel, alex.williamson, eric.auger, pbonzini, mst, david,
	Tian, Kevin, Tian, Jun J, Sun, Yi Y, kvm, Wu, Hao, jean-philippe,
	Jacob Pan, Yi Sun, Richard Henderson, Eduardo Habkost

On Thu, Apr 02, 2020 at 06:46:11AM +0000, Liu, Yi L wrote:

[...]

> > > +/**
> > > + * This function replay the guest pasid bindings to hots by
> > > + * walking the guest PASID table. This ensures host will have
> > > + * latest guest pasid bindings. Caller should hold iommu_lock.
> > > + */
> > > +static void vtd_replay_guest_pasid_bindings(IntelIOMMUState *s,
> > > +                                            VTDPASIDCacheInfo
> > > +*pc_info) {
> > > +    VTDHostIOMMUContext *vtd_dev_icx;
> > > +    int start = 0, end = VTD_HPASID_MAX;
> > > +    vtd_pasid_table_walk_info walk_info = {.flags = 0};
> > 
> > So vtd_pasid_table_walk_info is still used.  I thought we had reached a consensus
> > that this can be dropped?
> 
> yeah, I did have considered your suggestion and plan to do it. But when
> I started coding, it looks a little bit weird to me:
> For one, there is an input VTDPASIDCacheInfo in this function. It may be
> nature to think about passing the parameter to further calling
> (vtd_replay_pasid_bind_for_dev()). But, we can't do that. The vtd_bus/devfn
> fields should be filled when looping the assigned devices, not the one
> passed by vtd_replay_guest_pasid_bindings() caller.

Hacky way is we can directly modify VTDPASIDCacheInfo* with bus/devfn
for the loop.  Otherwise we can duplicate the object when looping, so
that we can avoid introducing a new struct which seems to contain
mostly the same information.

> For two, reusing the VTDPASIDCacheInfo for passing walk info may require
> the final user do the same thing as what the vtd_replay_guest_pasid_bindings()
> has done here.

I don't see it happen, could you explain?

> 
> So kept the vtd_pasid_table_walk_info.

[...]

> > > +/**
> > > + * This function syncs the pasid bindings between guest and host.
> > > + * It includes updating the pasid cache in vIOMMU and updating the
> > > + * pasid bindings per guest's latest pasid entry presence.
> > > + */
> > > +static void vtd_pasid_cache_sync(IntelIOMMUState *s,
> > > +                                 VTDPASIDCacheInfo *pc_info) {
> > > +    /*
> > > +     * Regards to a pasid cache invalidation, e.g. a PSI.
> > > +     * it could be either cases of below:
> > > +     * a) a present pasid entry moved to non-present
> > > +     * b) a present pasid entry to be a present entry
> > > +     * c) a non-present pasid entry moved to present
> > > +     *
> > > +     * Different invalidation granularity may affect different device
> > > +     * scope and pasid scope. But for each invalidation granularity,
> > > +     * it needs to do two steps to sync host and guest pasid binding.
> > > +     *
> > > +     * Here is the handling of a PSI:
> > > +     * 1) loop all the existing vtd_pasid_as instances to update them
> > > +     *    according to the latest guest pasid entry in pasid table.
> > > +     *    this will make sure affected existing vtd_pasid_as instances
> > > +     *    cached the latest pasid entries. Also, during the loop, the
> > > +     *    host should be notified if needed. e.g. pasid unbind or pasid
> > > +     *    update. Should be able to cover case a) and case b).
> > > +     *
> > > +     * 2) loop all devices to cover case c)
> > > +     *    - For devices which have HostIOMMUContext instances,
> > > +     *      we loop them and check if guest pasid entry exists. If yes,
> > > +     *      it is case c), we update the pasid cache and also notify
> > > +     *      host.
> > > +     *    - For devices which have no HostIOMMUContext, it is not
> > > +     *      necessary to create pasid cache at this phase since it
> > > +     *      could be created when vIOMMU does DMA address translation.
> > > +     *      This is not yet implemented since there is no emulated
> > > +     *      pasid-capable devices today. If we have such devices in
> > > +     *      future, the pasid cache shall be created there.
> > > +     * Other granularity follow the same steps, just with different scope
> > > +     *
> > > +     */
> > > +
> > > +    vtd_iommu_lock(s);
> > > +    /* Step 1: loop all the exisitng vtd_pasid_as instances */
> > > +    g_hash_table_foreach_remove(s->vtd_pasid_as,
> > > +                                vtd_flush_pasid, pc_info);
> > 
> > OK the series is evolving along with our discussions, and /me too on understanding
> > your series... Now I'm not very sure whether this operation is still useful...
> > 
> > The major point is you'll need to do pasid table walk for all the registered
> > devices
> > below.  So IIUC vtd_replay_guest_pasid_bindings() will be able to also detect
> > addition, removal or modification of pasid address spaces.  Am I right?
> 
> It's true if there is only assigned pasid-capable devices. If there is
> emualted pasid-capable device, it would be a problem as emualted devices
> won't register HostIOMMUContext. Somehow, the pasid cahce invalidation
> for emualted device would be missed. So I chose to make the step 1 cover
> the "real" cache invalidation(a.k.a. removal), while step 2 to cover
> addition and modification.

OK.  Btw, I think modification should still belongs to step 1 then (I
think you're doing that, though).

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v2 13/22] intel_iommu: add PASID cache management infrastructure
@ 2020-04-02 13:44         ` Peter Xu
  0 siblings, 0 replies; 160+ messages in thread
From: Peter Xu @ 2020-04-02 13:44 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: jean-philippe, Tian, Kevin, Jacob Pan, Yi Sun, Eduardo Habkost,
	kvm, mst, Tian, Jun J, qemu-devel, eric.auger, alex.williamson,
	pbonzini, Wu, Hao, Sun, Yi Y, Richard Henderson, david

On Thu, Apr 02, 2020 at 06:46:11AM +0000, Liu, Yi L wrote:

[...]

> > > +/**
> > > + * This function replay the guest pasid bindings to hots by
> > > + * walking the guest PASID table. This ensures host will have
> > > + * latest guest pasid bindings. Caller should hold iommu_lock.
> > > + */
> > > +static void vtd_replay_guest_pasid_bindings(IntelIOMMUState *s,
> > > +                                            VTDPASIDCacheInfo
> > > +*pc_info) {
> > > +    VTDHostIOMMUContext *vtd_dev_icx;
> > > +    int start = 0, end = VTD_HPASID_MAX;
> > > +    vtd_pasid_table_walk_info walk_info = {.flags = 0};
> > 
> > So vtd_pasid_table_walk_info is still used.  I thought we had reached a consensus
> > that this can be dropped?
> 
> yeah, I did have considered your suggestion and plan to do it. But when
> I started coding, it looks a little bit weird to me:
> For one, there is an input VTDPASIDCacheInfo in this function. It may be
> nature to think about passing the parameter to further calling
> (vtd_replay_pasid_bind_for_dev()). But, we can't do that. The vtd_bus/devfn
> fields should be filled when looping the assigned devices, not the one
> passed by vtd_replay_guest_pasid_bindings() caller.

Hacky way is we can directly modify VTDPASIDCacheInfo* with bus/devfn
for the loop.  Otherwise we can duplicate the object when looping, so
that we can avoid introducing a new struct which seems to contain
mostly the same information.

> For two, reusing the VTDPASIDCacheInfo for passing walk info may require
> the final user do the same thing as what the vtd_replay_guest_pasid_bindings()
> has done here.

I don't see it happen, could you explain?

> 
> So kept the vtd_pasid_table_walk_info.

[...]

> > > +/**
> > > + * This function syncs the pasid bindings between guest and host.
> > > + * It includes updating the pasid cache in vIOMMU and updating the
> > > + * pasid bindings per guest's latest pasid entry presence.
> > > + */
> > > +static void vtd_pasid_cache_sync(IntelIOMMUState *s,
> > > +                                 VTDPASIDCacheInfo *pc_info) {
> > > +    /*
> > > +     * Regards to a pasid cache invalidation, e.g. a PSI.
> > > +     * it could be either cases of below:
> > > +     * a) a present pasid entry moved to non-present
> > > +     * b) a present pasid entry to be a present entry
> > > +     * c) a non-present pasid entry moved to present
> > > +     *
> > > +     * Different invalidation granularity may affect different device
> > > +     * scope and pasid scope. But for each invalidation granularity,
> > > +     * it needs to do two steps to sync host and guest pasid binding.
> > > +     *
> > > +     * Here is the handling of a PSI:
> > > +     * 1) loop all the existing vtd_pasid_as instances to update them
> > > +     *    according to the latest guest pasid entry in pasid table.
> > > +     *    this will make sure affected existing vtd_pasid_as instances
> > > +     *    cached the latest pasid entries. Also, during the loop, the
> > > +     *    host should be notified if needed. e.g. pasid unbind or pasid
> > > +     *    update. Should be able to cover case a) and case b).
> > > +     *
> > > +     * 2) loop all devices to cover case c)
> > > +     *    - For devices which have HostIOMMUContext instances,
> > > +     *      we loop them and check if guest pasid entry exists. If yes,
> > > +     *      it is case c), we update the pasid cache and also notify
> > > +     *      host.
> > > +     *    - For devices which have no HostIOMMUContext, it is not
> > > +     *      necessary to create pasid cache at this phase since it
> > > +     *      could be created when vIOMMU does DMA address translation.
> > > +     *      This is not yet implemented since there is no emulated
> > > +     *      pasid-capable devices today. If we have such devices in
> > > +     *      future, the pasid cache shall be created there.
> > > +     * Other granularity follow the same steps, just with different scope
> > > +     *
> > > +     */
> > > +
> > > +    vtd_iommu_lock(s);
> > > +    /* Step 1: loop all the exisitng vtd_pasid_as instances */
> > > +    g_hash_table_foreach_remove(s->vtd_pasid_as,
> > > +                                vtd_flush_pasid, pc_info);
> > 
> > OK the series is evolving along with our discussions, and /me too on understanding
> > your series... Now I'm not very sure whether this operation is still useful...
> > 
> > The major point is you'll need to do pasid table walk for all the registered
> > devices
> > below.  So IIUC vtd_replay_guest_pasid_bindings() will be able to also detect
> > addition, removal or modification of pasid address spaces.  Am I right?
> 
> It's true if there is only assigned pasid-capable devices. If there is
> emualted pasid-capable device, it would be a problem as emualted devices
> won't register HostIOMMUContext. Somehow, the pasid cahce invalidation
> for emualted device would be missed. So I chose to make the step 1 cover
> the "real" cache invalidation(a.k.a. removal), while step 2 to cover
> addition and modification.

OK.  Btw, I think modification should still belongs to step 1 then (I
think you're doing that, though).

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v2 00/22] intel_iommu: expose Shared Virtual Addressing to VMs
  2020-04-02  8:33   ` Jason Wang
@ 2020-04-02 13:46     ` Peter Xu
  -1 siblings, 0 replies; 160+ messages in thread
From: Peter Xu @ 2020-04-02 13:46 UTC (permalink / raw)
  To: Jason Wang
  Cc: Liu Yi L, qemu-devel, alex.williamson, jean-philippe, kevin.tian,
	kvm, mst, jun.j.tian, eric.auger, yi.y.sun, pbonzini, hao.wu,
	david

On Thu, Apr 02, 2020 at 04:33:02PM +0800, Jason Wang wrote:
> > The complete QEMU set can be found in below link:
> > https://github.com/luxis1999/qemu.git: sva_vtd_v10_v2
> 
> 
> Hi Yi:
> 
> I could not find the branch there.

Jason,

He typed wrong... It's actually (I found it myself):

https://github.com/luxis1999/qemu/tree/sva_vtd_v10_qemu_v2

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v2 00/22] intel_iommu: expose Shared Virtual Addressing to VMs
@ 2020-04-02 13:46     ` Peter Xu
  0 siblings, 0 replies; 160+ messages in thread
From: Peter Xu @ 2020-04-02 13:46 UTC (permalink / raw)
  To: Jason Wang
  Cc: jean-philippe, kevin.tian, Liu Yi L, kvm, mst, jun.j.tian,
	qemu-devel, eric.auger, alex.williamson, pbonzini, david,
	yi.y.sun, hao.wu

On Thu, Apr 02, 2020 at 04:33:02PM +0800, Jason Wang wrote:
> > The complete QEMU set can be found in below link:
> > https://github.com/luxis1999/qemu.git: sva_vtd_v10_v2
> 
> 
> Hi Yi:
> 
> I could not find the branch there.

Jason,

He typed wrong... It's actually (I found it myself):

https://github.com/luxis1999/qemu/tree/sva_vtd_v10_qemu_v2

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v2 05/22] hw/pci: modify pci_setup_iommu() to set PCIIOMMUOps
  2020-04-02 13:37           ` Liu, Yi L
@ 2020-04-02 13:49             ` Auger Eric
  -1 siblings, 0 replies; 160+ messages in thread
From: Auger Eric @ 2020-04-02 13:49 UTC (permalink / raw)
  To: Liu, Yi L, qemu-devel, alex.williamson, peterx
  Cc: jean-philippe, Tian, Kevin, Jacob Pan, Yi Sun, kvm, mst, Tian,
	Jun J, Sun, Yi Y, pbonzini, Wu, Hao, david

Hi Yi,

On 4/2/20 3:37 PM, Liu, Yi L wrote:
> Hi Eric,
> 
>> From: Auger Eric < eric.auger@redhat.com >
>> Sent: Thursday, April 2, 2020 8:41 PM
>> To: Liu, Yi L <yi.l.liu@intel.com>; qemu-devel@nongnu.org;
>> Subject: Re: [PATCH v2 05/22] hw/pci: modify pci_setup_iommu() to set
>> PCIIOMMUOps
>>
>> Hi Yi,
>>
>> On 4/2/20 10:52 AM, Liu, Yi L wrote:
>>>> From: Auger Eric < eric.auger@redhat.com>
>>>> Sent: Monday, March 30, 2020 7:02 PM
>>>> To: Liu, Yi L <yi.l.liu@intel.com>; qemu-devel@nongnu.org;
>>>> Subject: Re: [PATCH v2 05/22] hw/pci: modify pci_setup_iommu() to set
>>>> PCIIOMMUOps
>>>>
>>>>
>>>>
>>>> On 3/30/20 6:24 AM, Liu Yi L wrote:
>>>>> This patch modifies pci_setup_iommu() to set PCIIOMMUOps instead of
>>>>> setting PCIIOMMUFunc. PCIIOMMUFunc is used to get an address space
>>>>> for a PCI device in vendor specific way. The PCIIOMMUOps still
>>>>> offers this functionality. But using PCIIOMMUOps leaves space to add
>>>>> more iommu related vendor specific operations.
>>>>>
>>>>> Cc: Kevin Tian <kevin.tian@intel.com>
>>>>> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
>>>>> Cc: Peter Xu <peterx@redhat.com>
>>>>> Cc: Eric Auger <eric.auger@redhat.com>
>>>>> Cc: Yi Sun <yi.y.sun@linux.intel.com>
>>>>> Cc: David Gibson <david@gibson.dropbear.id.au>
>>>>> Cc: Michael S. Tsirkin <mst@redhat.com>
>>>>> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
>>>>> Reviewed-by: Peter Xu <peterx@redhat.com>
>>>>> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
>>>>> ---
>>>>>  hw/alpha/typhoon.c       |  6 +++++-
>>>>>  hw/arm/smmu-common.c     |  6 +++++-
>>>>>  hw/hppa/dino.c           |  6 +++++-
>>>>>  hw/i386/amd_iommu.c      |  6 +++++-
>>>>>  hw/i386/intel_iommu.c    |  6 +++++-
>>>>>  hw/pci-host/designware.c |  6 +++++-
>>>>>  hw/pci-host/pnv_phb3.c   |  6 +++++-
>>>>>  hw/pci-host/pnv_phb4.c   |  6 +++++-
>>>>>  hw/pci-host/ppce500.c    |  6 +++++-
>>>>>  hw/pci-host/prep.c       |  6 +++++-
>>>>>  hw/pci-host/sabre.c      |  6 +++++-
>>>>>  hw/pci/pci.c             | 12 +++++++-----
>>>>>  hw/ppc/ppc440_pcix.c     |  6 +++++-
>>>>>  hw/ppc/spapr_pci.c       |  6 +++++-
>>>>>  hw/s390x/s390-pci-bus.c  |  8 ++++++--  hw/virtio/virtio-iommu.c |
>>>>> 6
>>>>> +++++-
>>>>>  include/hw/pci/pci.h     |  8 ++++++--
>>>>>  include/hw/pci/pci_bus.h |  2 +-
>>>>>  18 files changed, 90 insertions(+), 24 deletions(-)
>>>>>
>>>>> diff --git a/hw/alpha/typhoon.c b/hw/alpha/typhoon.c index
>>>>> 1795e2f..f271de1 100644
>>>>> --- a/hw/alpha/typhoon.c
>>>>> +++ b/hw/alpha/typhoon.c
>>>>> @@ -740,6 +740,10 @@ static AddressSpace
>>>>> *typhoon_pci_dma_iommu(PCIBus
>>>> *bus, void *opaque, int devfn)
>>>>>      return &s->pchip.iommu_as;
>>>>>  }
>>>>>
>>>>> +static const PCIIOMMUOps typhoon_iommu_ops = {
>>>>> +    .get_address_space = typhoon_pci_dma_iommu, };
>>>>> +
>>>>>  static void typhoon_set_irq(void *opaque, int irq, int level)  {
>>>>>      TyphoonState *s = opaque;
>>>>> @@ -897,7 +901,7 @@ PCIBus *typhoon_init(MemoryRegion *ram, ISABus
>>>> **isa_bus, qemu_irq *p_rtc_irq,
>>>>>                               "iommu-typhoon", UINT64_MAX);
>>>>>      address_space_init(&s->pchip.iommu_as, MEMORY_REGION(&s-
>>>>> pchip.iommu),
>>>>>                         "pchip0-pci");
>>>>> -    pci_setup_iommu(b, typhoon_pci_dma_iommu, s);
>>>>> +    pci_setup_iommu(b, &typhoon_iommu_ops, s);
>>>>>
>>>>>      /* Pchip0 PCI special/interrupt acknowledge, 0x801.F800.0000, 64MB.  */
>>>>>      memory_region_init_io(&s->pchip.reg_iack, OBJECT(s),
>>>>> &alpha_pci_iack_ops, diff --git a/hw/arm/smmu-common.c
>>>>> b/hw/arm/smmu-common.c index e13a5f4..447146e 100644
>>>>> --- a/hw/arm/smmu-common.c
>>>>> +++ b/hw/arm/smmu-common.c
>>>>> @@ -343,6 +343,10 @@ static AddressSpace *smmu_find_add_as(PCIBus
>>>>> *bus,
>>>> void *opaque, int devfn)
>>>>>      return &sdev->as;
>>>>>  }
>>>>>
>>>>> +static const PCIIOMMUOps smmu_ops = {
>>>>> +    .get_address_space = smmu_find_add_as, };
>>>>> +
>>>>>  IOMMUMemoryRegion *smmu_iommu_mr(SMMUState *s, uint32_t sid)  {
>>>>>      uint8_t bus_n, devfn;
>>>>> @@ -437,7 +441,7 @@ static void smmu_base_realize(DeviceState *dev,
>>>>> Error
>>>> **errp)
>>>>>      s->smmu_pcibus_by_busptr = g_hash_table_new(NULL, NULL);
>>>>>
>>>>>      if (s->primary_bus) {
>>>>> -        pci_setup_iommu(s->primary_bus, smmu_find_add_as, s);
>>>>> +        pci_setup_iommu(s->primary_bus, &smmu_ops, s);
>>>>>      } else {
>>>>>          error_setg(errp, "SMMU is not attached to any PCI bus!");
>>>>>      }
>>>>> diff --git a/hw/hppa/dino.c b/hw/hppa/dino.c index 2b1b38c..3da4f84
>>>>> 100644
>>>>> --- a/hw/hppa/dino.c
>>>>> +++ b/hw/hppa/dino.c
>>>>> @@ -459,6 +459,10 @@ static AddressSpace
>>>>> *dino_pcihost_set_iommu(PCIBus
>>>> *bus, void *opaque,
>>>>>      return &s->bm_as;
>>>>>  }
>>>>>
>>>>> +static const PCIIOMMUOps dino_iommu_ops = {
>>>>> +    .get_address_space = dino_pcihost_set_iommu, };
>>>>> +
>>>>>  /*
>>>>>   * Dino interrupts are connected as shown on Page 78, Table 23
>>>>>   * (Little-endian bit numbers)
>>>>> @@ -580,7 +584,7 @@ PCIBus *dino_init(MemoryRegion *addr_space,
>>>>>      memory_region_add_subregion(&s->bm, 0xfff00000,
>>>>>                                  &s->bm_cpu_alias);
>>>>>      address_space_init(&s->bm_as, &s->bm, "pci-bm");
>>>>> -    pci_setup_iommu(b, dino_pcihost_set_iommu, s);
>>>>> +    pci_setup_iommu(b, &dino_iommu_ops, s);
>>>>>
>>>>>      *p_rtc_irq = qemu_allocate_irq(dino_set_timer_irq, s, 0);
>>>>>      *p_ser_irq = qemu_allocate_irq(dino_set_serial_irq, s, 0); diff
>>>>> --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c index
>>>>> b1175e5..5fec30e 100644
>>>>> --- a/hw/i386/amd_iommu.c
>>>>> +++ b/hw/i386/amd_iommu.c
>>>>> @@ -1451,6 +1451,10 @@ static AddressSpace
>>>> *amdvi_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
>>>>>      return &iommu_as[devfn]->as;
>>>>>  }
>>>>>
>>>>> +static const PCIIOMMUOps amdvi_iommu_ops = {
>>>>> +    .get_address_space = amdvi_host_dma_iommu, };
>>>>> +
>>>>>  static const MemoryRegionOps mmio_mem_ops = {
>>>>>      .read = amdvi_mmio_read,
>>>>>      .write = amdvi_mmio_write,
>>>>> @@ -1577,7 +1581,7 @@ static void amdvi_realize(DeviceState *dev,
>>>>> Error **errp)
>>>>>
>>>>>      sysbus_init_mmio(SYS_BUS_DEVICE(s), &s->mmio);
>>>>>      sysbus_mmio_map(SYS_BUS_DEVICE(s), 0, AMDVI_BASE_ADDR);
>>>>> -    pci_setup_iommu(bus, amdvi_host_dma_iommu, s);
>>>>> +    pci_setup_iommu(bus, &amdvi_iommu_ops, s);
>>>>>      s->devid = object_property_get_int(OBJECT(&s->pci), "addr", errp);
>>>>>      msi_init(&s->pci.dev, 0, 1, true, false, errp);
>>>>>      amdvi_init(s);
>>>>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c index
>>>>> df7ad25..4b22910 100644
>>>>> --- a/hw/i386/intel_iommu.c
>>>>> +++ b/hw/i386/intel_iommu.c
>>>>> @@ -3729,6 +3729,10 @@ static AddressSpace
>>>>> *vtd_host_dma_iommu(PCIBus
>>>> *bus, void *opaque, int devfn)
>>>>>      return &vtd_as->as;
>>>>>  }
>>>>>
>>>>> +static PCIIOMMUOps vtd_iommu_ops = {
>>>> static const
>>>
>>> got it.
>>>
>>>>> +    .get_address_space = vtd_host_dma_iommu, };
>>>>> +
>>>>>  static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)  {
>>>>>      X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(s); @@ -3840,7
>>>>> +3844,7 @@ static void vtd_realize(DeviceState *dev, Error **errp)
>>>>>                                                g_free, g_free);
>>>>>      vtd_init(s);
>>>>>      sysbus_mmio_map(SYS_BUS_DEVICE(s), 0,
>>>> Q35_HOST_BRIDGE_IOMMU_ADDR);
>>>>> -    pci_setup_iommu(bus, vtd_host_dma_iommu, dev);
>>>>> +    pci_setup_iommu(bus, &vtd_iommu_ops, dev);
>>>>>      /* Pseudo address space under root PCI bus. */
>>>>>      x86ms->ioapic_as = vtd_host_dma_iommu(bus, s,
>>>> Q35_PSEUDO_DEVFN_IOAPIC);
>>>>>      qemu_add_machine_init_done_notifier(&vtd_machine_done_notify);
>>>>> diff --git a/hw/pci-host/designware.c b/hw/pci-host/designware.c
>>>>> index dd24551..4c6338a 100644
>>>>> --- a/hw/pci-host/designware.c
>>>>> +++ b/hw/pci-host/designware.c
>>>>> @@ -645,6 +645,10 @@ static AddressSpace
>>>> *designware_pcie_host_set_iommu(PCIBus *bus, void *opaque,
>>>>>      return &s->pci.address_space;
>>>>>  }
>>>>>
>>>>> +static const PCIIOMMUOps designware_iommu_ops = {
>>>>> +    .get_address_space = designware_pcie_host_set_iommu, };
>>>>> +
>>>>>  static void designware_pcie_host_realize(DeviceState *dev, Error
>>>>> **errp)  {
>>>>>      PCIHostState *pci = PCI_HOST_BRIDGE(dev); @@ -686,7 +690,7 @@
>>>>> static void designware_pcie_host_realize(DeviceState *dev, Error **errp)
>>>>>      address_space_init(&s->pci.address_space,
>>>>>                         &s->pci.address_space_root,
>>>>>                         "pcie-bus-address-space");
>>>>> -    pci_setup_iommu(pci->bus, designware_pcie_host_set_iommu, s);
>>>>> +    pci_setup_iommu(pci->bus, &designware_iommu_ops, s);
>>>>>
>>>>>      qdev_set_parent_bus(DEVICE(&s->root), BUS(pci->bus));
>>>>>      qdev_init_nofail(DEVICE(&s->root));
>>>>> diff --git a/hw/pci-host/pnv_phb3.c b/hw/pci-host/pnv_phb3.c index
>>>>> 74618fa..ecfe627 100644
>>>>> --- a/hw/pci-host/pnv_phb3.c
>>>>> +++ b/hw/pci-host/pnv_phb3.c
>>>>> @@ -961,6 +961,10 @@ static AddressSpace *pnv_phb3_dma_iommu(PCIBus
>>>> *bus, void *opaque, int devfn)
>>>>>      return &ds->dma_as;
>>>>>  }
>>>>>
>>>>> +static PCIIOMMUOps pnv_phb3_iommu_ops = {
>>>> static const
>>> got it. :-)
>>>
>>>>> +    .get_address_space = pnv_phb3_dma_iommu, };
>>>>> +
>>>>>  static void pnv_phb3_instance_init(Object *obj)  {
>>>>>      PnvPHB3 *phb = PNV_PHB3(obj);
>>>>> @@ -1059,7 +1063,7 @@ static void pnv_phb3_realize(DeviceState *dev,
>>>>> Error
>>>> **errp)
>>>>>                                       &phb->pci_mmio, &phb->pci_io,
>>>>>                                       0, 4, TYPE_PNV_PHB3_ROOT_BUS);
>>>>>
>>>>> -    pci_setup_iommu(pci->bus, pnv_phb3_dma_iommu, phb);
>>>>> +    pci_setup_iommu(pci->bus, &pnv_phb3_iommu_ops, phb);
>>>>>
>>>>>      /* Add a single Root port */
>>>>>      qdev_prop_set_uint8(DEVICE(&phb->root), "chassis",
>>>>> phb->chip_id); diff --git a/hw/pci-host/pnv_phb4.c
>>>>> b/hw/pci-host/pnv_phb4.c index
>>>>> 23cf093..04e95e3 100644
>>>>> --- a/hw/pci-host/pnv_phb4.c
>>>>> +++ b/hw/pci-host/pnv_phb4.c
>>>>> @@ -1148,6 +1148,10 @@ static AddressSpace
>>>>> *pnv_phb4_dma_iommu(PCIBus
>>>> *bus, void *opaque, int devfn)
>>>>>      return &ds->dma_as;
>>>>>  }
>>>>>
>>>>> +static PCIIOMMUOps pnv_phb4_iommu_ops = {
>>>> idem
>>> will add const.
>>>
>>>>> +    .get_address_space = pnv_phb4_dma_iommu, };
>>>>> +
>>>>>  static void pnv_phb4_instance_init(Object *obj)  {
>>>>>      PnvPHB4 *phb = PNV_PHB4(obj);
>>>>> @@ -1205,7 +1209,7 @@ static void pnv_phb4_realize(DeviceState *dev,
>>>>> Error
>>>> **errp)
>>>>>                                       pnv_phb4_set_irq, pnv_phb4_map_irq, phb,
>>>>>                                       &phb->pci_mmio, &phb->pci_io,
>>>>>                                       0, 4, TYPE_PNV_PHB4_ROOT_BUS);
>>>>> -    pci_setup_iommu(pci->bus, pnv_phb4_dma_iommu, phb);
>>>>> +    pci_setup_iommu(pci->bus, &pnv_phb4_iommu_ops, phb);
>>>>>
>>>>>      /* Add a single Root port */
>>>>>      qdev_prop_set_uint8(DEVICE(&phb->root), "chassis",
>>>>> phb->chip_id); diff --git a/hw/pci-host/ppce500.c
>>>>> b/hw/pci-host/ppce500.c index d710727..5baf5db 100644
>>>>> --- a/hw/pci-host/ppce500.c
>>>>> +++ b/hw/pci-host/ppce500.c
>>>>> @@ -439,6 +439,10 @@ static AddressSpace
>>>>> *e500_pcihost_set_iommu(PCIBus
>>>> *bus, void *opaque,
>>>>>      return &s->bm_as;
>>>>>  }
>>>>>
>>>>> +static const PCIIOMMUOps ppce500_iommu_ops = {
>>>>> +    .get_address_space = e500_pcihost_set_iommu, };
>>>>> +
>>>>>  static void e500_pcihost_realize(DeviceState *dev, Error **errp)  {
>>>>>      SysBusDevice *sbd = SYS_BUS_DEVICE(dev); @@ -473,7 +477,7 @@
>>>>> static void e500_pcihost_realize(DeviceState *dev, Error **errp)
>>>>>      memory_region_init(&s->bm, OBJECT(s), "bm-e500", UINT64_MAX);
>>>>>      memory_region_add_subregion(&s->bm, 0x0, &s->busmem);
>>>>>      address_space_init(&s->bm_as, &s->bm, "pci-bm");
>>>>> -    pci_setup_iommu(b, e500_pcihost_set_iommu, s);
>>>>> +    pci_setup_iommu(b, &ppce500_iommu_ops, s);
>>>>>
>>>>>      pci_create_simple(b, 0, "e500-host-bridge");
>>>>>
>>>>> diff --git a/hw/pci-host/prep.c b/hw/pci-host/prep.c index
>>>>> 1a02e9a..7c57311 100644
>>>>> --- a/hw/pci-host/prep.c
>>>>> +++ b/hw/pci-host/prep.c
>>>>> @@ -213,6 +213,10 @@ static AddressSpace
>>>>> *raven_pcihost_set_iommu(PCIBus
>>>> *bus, void *opaque,
>>>>>      return &s->bm_as;
>>>>>  }
>>>>>
>>>>> +static const PCIIOMMUOps raven_iommu_ops = {
>>>>> +    .get_address_space = raven_pcihost_set_iommu, };
>>>>> +
>>>>>  static void raven_change_gpio(void *opaque, int n, int level)  {
>>>>>      PREPPCIState *s = opaque;
>>>>> @@ -303,7 +307,7 @@ static void raven_pcihost_initfn(Object *obj)
>>>>>      memory_region_add_subregion(&s->bm, 0         , &s-
>>> bm_pci_memory_alias);
>>>>>      memory_region_add_subregion(&s->bm, 0x80000000, &s->bm_ram_alias);
>>>>>      address_space_init(&s->bm_as, &s->bm, "raven-bm");
>>>>> -    pci_setup_iommu(&s->pci_bus, raven_pcihost_set_iommu, s);
>>>>> +    pci_setup_iommu(&s->pci_bus, &raven_iommu_ops, s);
>>>>>
>>>>>      h->bus = &s->pci_bus;
>>>>>
>>>>> diff --git a/hw/pci-host/sabre.c b/hw/pci-host/sabre.c index
>>>>> 2b8503b..251549b 100644
>>>>> --- a/hw/pci-host/sabre.c
>>>>> +++ b/hw/pci-host/sabre.c
>>>>> @@ -112,6 +112,10 @@ static AddressSpace *sabre_pci_dma_iommu(PCIBus
>>>> *bus, void *opaque, int devfn)
>>>>>      return &is->iommu_as;
>>>>>  }
>>>>>
>>>>> +static const PCIIOMMUOps sabre_iommu_ops = {
>>>>> +    .get_address_space = sabre_pci_dma_iommu, };
>>>>> +
>>>>>  static void sabre_config_write(void *opaque, hwaddr addr,
>>>>>                                 uint64_t val, unsigned size)  { @@
>>>>> -402,7 +406,7 @@ static void sabre_realize(DeviceState *dev, Error **errp)
>>>>>      /* IOMMU */
>>>>>      memory_region_add_subregion_overlap(&s->sabre_config, 0x200,
>>>>>                      sysbus_mmio_get_region(SYS_BUS_DEVICE(s->iommu), 0), 1);
>>>>> -    pci_setup_iommu(phb->bus, sabre_pci_dma_iommu, s->iommu);
>>>>> +    pci_setup_iommu(phb->bus, &sabre_iommu_ops, s->iommu);
>>>>>
>>>>>      /* APB secondary busses */
>>>>>      pci_dev = pci_create_multifunction(phb->bus, PCI_DEVFN(1, 0),
>>>>> true, diff --git a/hw/pci/pci.c b/hw/pci/pci.c index
>>>>> e1ed667..aa9025c
>>>>> 100644
>>>>> --- a/hw/pci/pci.c
>>>>> +++ b/hw/pci/pci.c
>>>>> @@ -2644,7 +2644,7 @@ AddressSpace
>>>> *pci_device_iommu_address_space(PCIDevice *dev)
>>>>>      PCIBus *iommu_bus = bus;
>>>>>      uint8_t devfn = dev->devfn;
>>>>>
>>>>> -    while (iommu_bus && !iommu_bus->iommu_fn && iommu_bus-
>>> parent_dev)
>>>> {
>>>>> +    while (iommu_bus && !iommu_bus->iommu_ops &&
>>>>> + iommu_bus->parent_dev) {
>>>> Depending on future usage, this is not strictly identical to the
>>>> original code. You exit the loop as soon as a iommu_bus->iommu_ops is
>>>> set whatever the presence of get_address_space().
>>>
>>> To be identical with original code, may adding the get_address_space()
>>> presence check. Then the loop exits when the iommu_bus->iommu_ops is
>>> set and meanwhile iommu_bus->iommu_ops->get_address_space() is set.
>>> But is it possible that there is an intermediate iommu_bus which has
>>> iommu_ops set but the get_address_space() is clear. I guess not as
>>> iommu_ops is set by vIOMMU and vIOMMU won't differentiate buses?
>>
>> I don't know. That depends on how the ops are going to be used in the future. Can't
>> you enforce the fact that get_address_space() is a mandatory ops?
> 
> No, I didn't mean that. Actually, in the patch, the get_address_space() presence is checked.
> I'm not sure if your point is to add get_address_space() presence check instead of
> just checking the iommu_ops presence.
Yes that was my point. I wanted to underline the checks are not strictly
identical and maybe during enumeration you may find a device with ops
set and no get_address_space(). Then I meant maybe you should enforce
somewhere in the code or in the documentation that get_address_space()
is a mandatory operation in the ops struct and must be set as soon as
the struct is passed. maybe in pci_setup_iommu() you could check that
get_address_space is set?

Thanks

Eric
> 
> Regards,
> Yi Liu
> 


^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v2 05/22] hw/pci: modify pci_setup_iommu() to set PCIIOMMUOps
@ 2020-04-02 13:49             ` Auger Eric
  0 siblings, 0 replies; 160+ messages in thread
From: Auger Eric @ 2020-04-02 13:49 UTC (permalink / raw)
  To: Liu, Yi L, qemu-devel, alex.williamson, peterx
  Cc: jean-philippe, Tian, Kevin, Jacob Pan, Yi Sun, kvm, mst, Tian,
	Jun J, Sun, Yi Y, pbonzini, david, Wu, Hao

Hi Yi,

On 4/2/20 3:37 PM, Liu, Yi L wrote:
> Hi Eric,
> 
>> From: Auger Eric < eric.auger@redhat.com >
>> Sent: Thursday, April 2, 2020 8:41 PM
>> To: Liu, Yi L <yi.l.liu@intel.com>; qemu-devel@nongnu.org;
>> Subject: Re: [PATCH v2 05/22] hw/pci: modify pci_setup_iommu() to set
>> PCIIOMMUOps
>>
>> Hi Yi,
>>
>> On 4/2/20 10:52 AM, Liu, Yi L wrote:
>>>> From: Auger Eric < eric.auger@redhat.com>
>>>> Sent: Monday, March 30, 2020 7:02 PM
>>>> To: Liu, Yi L <yi.l.liu@intel.com>; qemu-devel@nongnu.org;
>>>> Subject: Re: [PATCH v2 05/22] hw/pci: modify pci_setup_iommu() to set
>>>> PCIIOMMUOps
>>>>
>>>>
>>>>
>>>> On 3/30/20 6:24 AM, Liu Yi L wrote:
>>>>> This patch modifies pci_setup_iommu() to set PCIIOMMUOps instead of
>>>>> setting PCIIOMMUFunc. PCIIOMMUFunc is used to get an address space
>>>>> for a PCI device in vendor specific way. The PCIIOMMUOps still
>>>>> offers this functionality. But using PCIIOMMUOps leaves space to add
>>>>> more iommu related vendor specific operations.
>>>>>
>>>>> Cc: Kevin Tian <kevin.tian@intel.com>
>>>>> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
>>>>> Cc: Peter Xu <peterx@redhat.com>
>>>>> Cc: Eric Auger <eric.auger@redhat.com>
>>>>> Cc: Yi Sun <yi.y.sun@linux.intel.com>
>>>>> Cc: David Gibson <david@gibson.dropbear.id.au>
>>>>> Cc: Michael S. Tsirkin <mst@redhat.com>
>>>>> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
>>>>> Reviewed-by: Peter Xu <peterx@redhat.com>
>>>>> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
>>>>> ---
>>>>>  hw/alpha/typhoon.c       |  6 +++++-
>>>>>  hw/arm/smmu-common.c     |  6 +++++-
>>>>>  hw/hppa/dino.c           |  6 +++++-
>>>>>  hw/i386/amd_iommu.c      |  6 +++++-
>>>>>  hw/i386/intel_iommu.c    |  6 +++++-
>>>>>  hw/pci-host/designware.c |  6 +++++-
>>>>>  hw/pci-host/pnv_phb3.c   |  6 +++++-
>>>>>  hw/pci-host/pnv_phb4.c   |  6 +++++-
>>>>>  hw/pci-host/ppce500.c    |  6 +++++-
>>>>>  hw/pci-host/prep.c       |  6 +++++-
>>>>>  hw/pci-host/sabre.c      |  6 +++++-
>>>>>  hw/pci/pci.c             | 12 +++++++-----
>>>>>  hw/ppc/ppc440_pcix.c     |  6 +++++-
>>>>>  hw/ppc/spapr_pci.c       |  6 +++++-
>>>>>  hw/s390x/s390-pci-bus.c  |  8 ++++++--  hw/virtio/virtio-iommu.c |
>>>>> 6
>>>>> +++++-
>>>>>  include/hw/pci/pci.h     |  8 ++++++--
>>>>>  include/hw/pci/pci_bus.h |  2 +-
>>>>>  18 files changed, 90 insertions(+), 24 deletions(-)
>>>>>
>>>>> diff --git a/hw/alpha/typhoon.c b/hw/alpha/typhoon.c index
>>>>> 1795e2f..f271de1 100644
>>>>> --- a/hw/alpha/typhoon.c
>>>>> +++ b/hw/alpha/typhoon.c
>>>>> @@ -740,6 +740,10 @@ static AddressSpace
>>>>> *typhoon_pci_dma_iommu(PCIBus
>>>> *bus, void *opaque, int devfn)
>>>>>      return &s->pchip.iommu_as;
>>>>>  }
>>>>>
>>>>> +static const PCIIOMMUOps typhoon_iommu_ops = {
>>>>> +    .get_address_space = typhoon_pci_dma_iommu, };
>>>>> +
>>>>>  static void typhoon_set_irq(void *opaque, int irq, int level)  {
>>>>>      TyphoonState *s = opaque;
>>>>> @@ -897,7 +901,7 @@ PCIBus *typhoon_init(MemoryRegion *ram, ISABus
>>>> **isa_bus, qemu_irq *p_rtc_irq,
>>>>>                               "iommu-typhoon", UINT64_MAX);
>>>>>      address_space_init(&s->pchip.iommu_as, MEMORY_REGION(&s-
>>>>> pchip.iommu),
>>>>>                         "pchip0-pci");
>>>>> -    pci_setup_iommu(b, typhoon_pci_dma_iommu, s);
>>>>> +    pci_setup_iommu(b, &typhoon_iommu_ops, s);
>>>>>
>>>>>      /* Pchip0 PCI special/interrupt acknowledge, 0x801.F800.0000, 64MB.  */
>>>>>      memory_region_init_io(&s->pchip.reg_iack, OBJECT(s),
>>>>> &alpha_pci_iack_ops, diff --git a/hw/arm/smmu-common.c
>>>>> b/hw/arm/smmu-common.c index e13a5f4..447146e 100644
>>>>> --- a/hw/arm/smmu-common.c
>>>>> +++ b/hw/arm/smmu-common.c
>>>>> @@ -343,6 +343,10 @@ static AddressSpace *smmu_find_add_as(PCIBus
>>>>> *bus,
>>>> void *opaque, int devfn)
>>>>>      return &sdev->as;
>>>>>  }
>>>>>
>>>>> +static const PCIIOMMUOps smmu_ops = {
>>>>> +    .get_address_space = smmu_find_add_as, };
>>>>> +
>>>>>  IOMMUMemoryRegion *smmu_iommu_mr(SMMUState *s, uint32_t sid)  {
>>>>>      uint8_t bus_n, devfn;
>>>>> @@ -437,7 +441,7 @@ static void smmu_base_realize(DeviceState *dev,
>>>>> Error
>>>> **errp)
>>>>>      s->smmu_pcibus_by_busptr = g_hash_table_new(NULL, NULL);
>>>>>
>>>>>      if (s->primary_bus) {
>>>>> -        pci_setup_iommu(s->primary_bus, smmu_find_add_as, s);
>>>>> +        pci_setup_iommu(s->primary_bus, &smmu_ops, s);
>>>>>      } else {
>>>>>          error_setg(errp, "SMMU is not attached to any PCI bus!");
>>>>>      }
>>>>> diff --git a/hw/hppa/dino.c b/hw/hppa/dino.c index 2b1b38c..3da4f84
>>>>> 100644
>>>>> --- a/hw/hppa/dino.c
>>>>> +++ b/hw/hppa/dino.c
>>>>> @@ -459,6 +459,10 @@ static AddressSpace
>>>>> *dino_pcihost_set_iommu(PCIBus
>>>> *bus, void *opaque,
>>>>>      return &s->bm_as;
>>>>>  }
>>>>>
>>>>> +static const PCIIOMMUOps dino_iommu_ops = {
>>>>> +    .get_address_space = dino_pcihost_set_iommu, };
>>>>> +
>>>>>  /*
>>>>>   * Dino interrupts are connected as shown on Page 78, Table 23
>>>>>   * (Little-endian bit numbers)
>>>>> @@ -580,7 +584,7 @@ PCIBus *dino_init(MemoryRegion *addr_space,
>>>>>      memory_region_add_subregion(&s->bm, 0xfff00000,
>>>>>                                  &s->bm_cpu_alias);
>>>>>      address_space_init(&s->bm_as, &s->bm, "pci-bm");
>>>>> -    pci_setup_iommu(b, dino_pcihost_set_iommu, s);
>>>>> +    pci_setup_iommu(b, &dino_iommu_ops, s);
>>>>>
>>>>>      *p_rtc_irq = qemu_allocate_irq(dino_set_timer_irq, s, 0);
>>>>>      *p_ser_irq = qemu_allocate_irq(dino_set_serial_irq, s, 0); diff
>>>>> --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c index
>>>>> b1175e5..5fec30e 100644
>>>>> --- a/hw/i386/amd_iommu.c
>>>>> +++ b/hw/i386/amd_iommu.c
>>>>> @@ -1451,6 +1451,10 @@ static AddressSpace
>>>> *amdvi_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
>>>>>      return &iommu_as[devfn]->as;
>>>>>  }
>>>>>
>>>>> +static const PCIIOMMUOps amdvi_iommu_ops = {
>>>>> +    .get_address_space = amdvi_host_dma_iommu, };
>>>>> +
>>>>>  static const MemoryRegionOps mmio_mem_ops = {
>>>>>      .read = amdvi_mmio_read,
>>>>>      .write = amdvi_mmio_write,
>>>>> @@ -1577,7 +1581,7 @@ static void amdvi_realize(DeviceState *dev,
>>>>> Error **errp)
>>>>>
>>>>>      sysbus_init_mmio(SYS_BUS_DEVICE(s), &s->mmio);
>>>>>      sysbus_mmio_map(SYS_BUS_DEVICE(s), 0, AMDVI_BASE_ADDR);
>>>>> -    pci_setup_iommu(bus, amdvi_host_dma_iommu, s);
>>>>> +    pci_setup_iommu(bus, &amdvi_iommu_ops, s);
>>>>>      s->devid = object_property_get_int(OBJECT(&s->pci), "addr", errp);
>>>>>      msi_init(&s->pci.dev, 0, 1, true, false, errp);
>>>>>      amdvi_init(s);
>>>>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c index
>>>>> df7ad25..4b22910 100644
>>>>> --- a/hw/i386/intel_iommu.c
>>>>> +++ b/hw/i386/intel_iommu.c
>>>>> @@ -3729,6 +3729,10 @@ static AddressSpace
>>>>> *vtd_host_dma_iommu(PCIBus
>>>> *bus, void *opaque, int devfn)
>>>>>      return &vtd_as->as;
>>>>>  }
>>>>>
>>>>> +static PCIIOMMUOps vtd_iommu_ops = {
>>>> static const
>>>
>>> got it.
>>>
>>>>> +    .get_address_space = vtd_host_dma_iommu, };
>>>>> +
>>>>>  static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)  {
>>>>>      X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(s); @@ -3840,7
>>>>> +3844,7 @@ static void vtd_realize(DeviceState *dev, Error **errp)
>>>>>                                                g_free, g_free);
>>>>>      vtd_init(s);
>>>>>      sysbus_mmio_map(SYS_BUS_DEVICE(s), 0,
>>>> Q35_HOST_BRIDGE_IOMMU_ADDR);
>>>>> -    pci_setup_iommu(bus, vtd_host_dma_iommu, dev);
>>>>> +    pci_setup_iommu(bus, &vtd_iommu_ops, dev);
>>>>>      /* Pseudo address space under root PCI bus. */
>>>>>      x86ms->ioapic_as = vtd_host_dma_iommu(bus, s,
>>>> Q35_PSEUDO_DEVFN_IOAPIC);
>>>>>      qemu_add_machine_init_done_notifier(&vtd_machine_done_notify);
>>>>> diff --git a/hw/pci-host/designware.c b/hw/pci-host/designware.c
>>>>> index dd24551..4c6338a 100644
>>>>> --- a/hw/pci-host/designware.c
>>>>> +++ b/hw/pci-host/designware.c
>>>>> @@ -645,6 +645,10 @@ static AddressSpace
>>>> *designware_pcie_host_set_iommu(PCIBus *bus, void *opaque,
>>>>>      return &s->pci.address_space;
>>>>>  }
>>>>>
>>>>> +static const PCIIOMMUOps designware_iommu_ops = {
>>>>> +    .get_address_space = designware_pcie_host_set_iommu, };
>>>>> +
>>>>>  static void designware_pcie_host_realize(DeviceState *dev, Error
>>>>> **errp)  {
>>>>>      PCIHostState *pci = PCI_HOST_BRIDGE(dev); @@ -686,7 +690,7 @@
>>>>> static void designware_pcie_host_realize(DeviceState *dev, Error **errp)
>>>>>      address_space_init(&s->pci.address_space,
>>>>>                         &s->pci.address_space_root,
>>>>>                         "pcie-bus-address-space");
>>>>> -    pci_setup_iommu(pci->bus, designware_pcie_host_set_iommu, s);
>>>>> +    pci_setup_iommu(pci->bus, &designware_iommu_ops, s);
>>>>>
>>>>>      qdev_set_parent_bus(DEVICE(&s->root), BUS(pci->bus));
>>>>>      qdev_init_nofail(DEVICE(&s->root));
>>>>> diff --git a/hw/pci-host/pnv_phb3.c b/hw/pci-host/pnv_phb3.c index
>>>>> 74618fa..ecfe627 100644
>>>>> --- a/hw/pci-host/pnv_phb3.c
>>>>> +++ b/hw/pci-host/pnv_phb3.c
>>>>> @@ -961,6 +961,10 @@ static AddressSpace *pnv_phb3_dma_iommu(PCIBus
>>>> *bus, void *opaque, int devfn)
>>>>>      return &ds->dma_as;
>>>>>  }
>>>>>
>>>>> +static PCIIOMMUOps pnv_phb3_iommu_ops = {
>>>> static const
>>> got it. :-)
>>>
>>>>> +    .get_address_space = pnv_phb3_dma_iommu, };
>>>>> +
>>>>>  static void pnv_phb3_instance_init(Object *obj)  {
>>>>>      PnvPHB3 *phb = PNV_PHB3(obj);
>>>>> @@ -1059,7 +1063,7 @@ static void pnv_phb3_realize(DeviceState *dev,
>>>>> Error
>>>> **errp)
>>>>>                                       &phb->pci_mmio, &phb->pci_io,
>>>>>                                       0, 4, TYPE_PNV_PHB3_ROOT_BUS);
>>>>>
>>>>> -    pci_setup_iommu(pci->bus, pnv_phb3_dma_iommu, phb);
>>>>> +    pci_setup_iommu(pci->bus, &pnv_phb3_iommu_ops, phb);
>>>>>
>>>>>      /* Add a single Root port */
>>>>>      qdev_prop_set_uint8(DEVICE(&phb->root), "chassis",
>>>>> phb->chip_id); diff --git a/hw/pci-host/pnv_phb4.c
>>>>> b/hw/pci-host/pnv_phb4.c index
>>>>> 23cf093..04e95e3 100644
>>>>> --- a/hw/pci-host/pnv_phb4.c
>>>>> +++ b/hw/pci-host/pnv_phb4.c
>>>>> @@ -1148,6 +1148,10 @@ static AddressSpace
>>>>> *pnv_phb4_dma_iommu(PCIBus
>>>> *bus, void *opaque, int devfn)
>>>>>      return &ds->dma_as;
>>>>>  }
>>>>>
>>>>> +static PCIIOMMUOps pnv_phb4_iommu_ops = {
>>>> idem
>>> will add const.
>>>
>>>>> +    .get_address_space = pnv_phb4_dma_iommu, };
>>>>> +
>>>>>  static void pnv_phb4_instance_init(Object *obj)  {
>>>>>      PnvPHB4 *phb = PNV_PHB4(obj);
>>>>> @@ -1205,7 +1209,7 @@ static void pnv_phb4_realize(DeviceState *dev,
>>>>> Error
>>>> **errp)
>>>>>                                       pnv_phb4_set_irq, pnv_phb4_map_irq, phb,
>>>>>                                       &phb->pci_mmio, &phb->pci_io,
>>>>>                                       0, 4, TYPE_PNV_PHB4_ROOT_BUS);
>>>>> -    pci_setup_iommu(pci->bus, pnv_phb4_dma_iommu, phb);
>>>>> +    pci_setup_iommu(pci->bus, &pnv_phb4_iommu_ops, phb);
>>>>>
>>>>>      /* Add a single Root port */
>>>>>      qdev_prop_set_uint8(DEVICE(&phb->root), "chassis",
>>>>> phb->chip_id); diff --git a/hw/pci-host/ppce500.c
>>>>> b/hw/pci-host/ppce500.c index d710727..5baf5db 100644
>>>>> --- a/hw/pci-host/ppce500.c
>>>>> +++ b/hw/pci-host/ppce500.c
>>>>> @@ -439,6 +439,10 @@ static AddressSpace
>>>>> *e500_pcihost_set_iommu(PCIBus
>>>> *bus, void *opaque,
>>>>>      return &s->bm_as;
>>>>>  }
>>>>>
>>>>> +static const PCIIOMMUOps ppce500_iommu_ops = {
>>>>> +    .get_address_space = e500_pcihost_set_iommu, };
>>>>> +
>>>>>  static void e500_pcihost_realize(DeviceState *dev, Error **errp)  {
>>>>>      SysBusDevice *sbd = SYS_BUS_DEVICE(dev); @@ -473,7 +477,7 @@
>>>>> static void e500_pcihost_realize(DeviceState *dev, Error **errp)
>>>>>      memory_region_init(&s->bm, OBJECT(s), "bm-e500", UINT64_MAX);
>>>>>      memory_region_add_subregion(&s->bm, 0x0, &s->busmem);
>>>>>      address_space_init(&s->bm_as, &s->bm, "pci-bm");
>>>>> -    pci_setup_iommu(b, e500_pcihost_set_iommu, s);
>>>>> +    pci_setup_iommu(b, &ppce500_iommu_ops, s);
>>>>>
>>>>>      pci_create_simple(b, 0, "e500-host-bridge");
>>>>>
>>>>> diff --git a/hw/pci-host/prep.c b/hw/pci-host/prep.c index
>>>>> 1a02e9a..7c57311 100644
>>>>> --- a/hw/pci-host/prep.c
>>>>> +++ b/hw/pci-host/prep.c
>>>>> @@ -213,6 +213,10 @@ static AddressSpace
>>>>> *raven_pcihost_set_iommu(PCIBus
>>>> *bus, void *opaque,
>>>>>      return &s->bm_as;
>>>>>  }
>>>>>
>>>>> +static const PCIIOMMUOps raven_iommu_ops = {
>>>>> +    .get_address_space = raven_pcihost_set_iommu, };
>>>>> +
>>>>>  static void raven_change_gpio(void *opaque, int n, int level)  {
>>>>>      PREPPCIState *s = opaque;
>>>>> @@ -303,7 +307,7 @@ static void raven_pcihost_initfn(Object *obj)
>>>>>      memory_region_add_subregion(&s->bm, 0         , &s-
>>> bm_pci_memory_alias);
>>>>>      memory_region_add_subregion(&s->bm, 0x80000000, &s->bm_ram_alias);
>>>>>      address_space_init(&s->bm_as, &s->bm, "raven-bm");
>>>>> -    pci_setup_iommu(&s->pci_bus, raven_pcihost_set_iommu, s);
>>>>> +    pci_setup_iommu(&s->pci_bus, &raven_iommu_ops, s);
>>>>>
>>>>>      h->bus = &s->pci_bus;
>>>>>
>>>>> diff --git a/hw/pci-host/sabre.c b/hw/pci-host/sabre.c index
>>>>> 2b8503b..251549b 100644
>>>>> --- a/hw/pci-host/sabre.c
>>>>> +++ b/hw/pci-host/sabre.c
>>>>> @@ -112,6 +112,10 @@ static AddressSpace *sabre_pci_dma_iommu(PCIBus
>>>> *bus, void *opaque, int devfn)
>>>>>      return &is->iommu_as;
>>>>>  }
>>>>>
>>>>> +static const PCIIOMMUOps sabre_iommu_ops = {
>>>>> +    .get_address_space = sabre_pci_dma_iommu, };
>>>>> +
>>>>>  static void sabre_config_write(void *opaque, hwaddr addr,
>>>>>                                 uint64_t val, unsigned size)  { @@
>>>>> -402,7 +406,7 @@ static void sabre_realize(DeviceState *dev, Error **errp)
>>>>>      /* IOMMU */
>>>>>      memory_region_add_subregion_overlap(&s->sabre_config, 0x200,
>>>>>                      sysbus_mmio_get_region(SYS_BUS_DEVICE(s->iommu), 0), 1);
>>>>> -    pci_setup_iommu(phb->bus, sabre_pci_dma_iommu, s->iommu);
>>>>> +    pci_setup_iommu(phb->bus, &sabre_iommu_ops, s->iommu);
>>>>>
>>>>>      /* APB secondary busses */
>>>>>      pci_dev = pci_create_multifunction(phb->bus, PCI_DEVFN(1, 0),
>>>>> true, diff --git a/hw/pci/pci.c b/hw/pci/pci.c index
>>>>> e1ed667..aa9025c
>>>>> 100644
>>>>> --- a/hw/pci/pci.c
>>>>> +++ b/hw/pci/pci.c
>>>>> @@ -2644,7 +2644,7 @@ AddressSpace
>>>> *pci_device_iommu_address_space(PCIDevice *dev)
>>>>>      PCIBus *iommu_bus = bus;
>>>>>      uint8_t devfn = dev->devfn;
>>>>>
>>>>> -    while (iommu_bus && !iommu_bus->iommu_fn && iommu_bus-
>>> parent_dev)
>>>> {
>>>>> +    while (iommu_bus && !iommu_bus->iommu_ops &&
>>>>> + iommu_bus->parent_dev) {
>>>> Depending on future usage, this is not strictly identical to the
>>>> original code. You exit the loop as soon as a iommu_bus->iommu_ops is
>>>> set whatever the presence of get_address_space().
>>>
>>> To be identical with original code, may adding the get_address_space()
>>> presence check. Then the loop exits when the iommu_bus->iommu_ops is
>>> set and meanwhile iommu_bus->iommu_ops->get_address_space() is set.
>>> But is it possible that there is an intermediate iommu_bus which has
>>> iommu_ops set but the get_address_space() is clear. I guess not as
>>> iommu_ops is set by vIOMMU and vIOMMU won't differentiate buses?
>>
>> I don't know. That depends on how the ops are going to be used in the future. Can't
>> you enforce the fact that get_address_space() is a mandatory ops?
> 
> No, I didn't mean that. Actually, in the patch, the get_address_space() presence is checked.
> I'm not sure if your point is to add get_address_space() presence check instead of
> just checking the iommu_ops presence.
Yes that was my point. I wanted to underline the checks are not strictly
identical and maybe during enumeration you may find a device with ops
set and no get_address_space(). Then I meant maybe you should enforce
somewhere in the code or in the documentation that get_address_space()
is a mandatory operation in the ops struct and must be set as soon as
the struct is passed. maybe in pci_setup_iommu() you could check that
get_address_space is set?

Thanks

Eric
> 
> Regards,
> Yi Liu
> 



^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v2 15/22] intel_iommu: bind/unbind guest page table to host
  2020-03-30  4:24   ` Liu Yi L
@ 2020-04-02 18:09     ` Peter Xu
  -1 siblings, 0 replies; 160+ messages in thread
From: Peter Xu @ 2020-04-02 18:09 UTC (permalink / raw)
  To: Liu Yi L
  Cc: qemu-devel, alex.williamson, eric.auger, pbonzini, mst, david,
	kevin.tian, jun.j.tian, yi.y.sun, kvm, hao.wu, jean-philippe,
	Jacob Pan, Yi Sun, Richard Henderson

On Sun, Mar 29, 2020 at 09:24:54PM -0700, Liu Yi L wrote:
> +static int vtd_bind_guest_pasid(IntelIOMMUState *s, VTDBus *vtd_bus,
> +                                int devfn, int pasid, VTDPASIDEntry *pe,
> +                                VTDPASIDOp op)
> +{
> +    VTDHostIOMMUContext *vtd_dev_icx;
> +    HostIOMMUContext *iommu_ctx;
> +    DualIOMMUStage1BindData *bind_data;
> +    struct iommu_gpasid_bind_data *g_bind_data;
> +    int ret = -1;
> +
> +    vtd_dev_icx = vtd_bus->dev_icx[devfn];
> +    if (!vtd_dev_icx) {
> +        /* means no need to go further, e.g. for emulated devices */
> +        return 0;
> +    }
> +
> +    iommu_ctx = vtd_dev_icx->iommu_ctx;
> +    if (!iommu_ctx) {
> +        return -EINVAL;
> +    }
> +
> +    if (!(iommu_ctx->stage1_formats
> +             & IOMMU_PASID_FORMAT_INTEL_VTD)) {
> +        error_report_once("IOMMU Stage 1 format is not compatible!\n");
> +        return -EINVAL;
> +    }
> +
> +    bind_data = g_malloc0(sizeof(*bind_data));
> +    bind_data->pasid = pasid;
> +    g_bind_data = &bind_data->bind_data.gpasid_bind;
> +
> +    g_bind_data->flags = 0;
> +    g_bind_data->vtd.flags = 0;
> +    switch (op) {
> +    case VTD_PASID_BIND:
> +        g_bind_data->version = IOMMU_UAPI_VERSION;
> +        g_bind_data->format = IOMMU_PASID_FORMAT_INTEL_VTD;
> +        g_bind_data->gpgd = vtd_pe_get_flpt_base(pe);
> +        g_bind_data->addr_width = vtd_pe_get_fl_aw(pe);
> +        g_bind_data->hpasid = pasid;
> +        g_bind_data->gpasid = pasid;
> +        g_bind_data->flags |= IOMMU_SVA_GPASID_VAL;
> +        g_bind_data->vtd.flags =
> +                             (VTD_SM_PASID_ENTRY_SRE_BIT(pe->val[2]) ? 1 : 0)

This evaluates to 1 if VTD_SM_PASID_ENTRY_SRE_BIT(pe->val[2]), or 0.
Do you want to use IOMMU_SVA_VTD_GPASID_SRE instead of 1?  Same
question to all the rest.

> +                           | (VTD_SM_PASID_ENTRY_EAFE_BIT(pe->val[2]) ? 1 : 0)
> +                           | (VTD_SM_PASID_ENTRY_PCD_BIT(pe->val[1]) ? 1 : 0)
> +                           | (VTD_SM_PASID_ENTRY_PWT_BIT(pe->val[1]) ? 1 : 0)
> +                           | (VTD_SM_PASID_ENTRY_EMTE_BIT(pe->val[1]) ? 1 : 0)
> +                           | (VTD_SM_PASID_ENTRY_CD_BIT(pe->val[1]) ? 1 : 0);
> +        g_bind_data->vtd.pat = VTD_SM_PASID_ENTRY_PAT(pe->val[1]);
> +        g_bind_data->vtd.emt = VTD_SM_PASID_ENTRY_EMT(pe->val[1]);
> +        ret = host_iommu_ctx_bind_stage1_pgtbl(iommu_ctx, bind_data);
> +        break;
> +    case VTD_PASID_UNBIND:
> +        g_bind_data->version = IOMMU_UAPI_VERSION;
> +        g_bind_data->format = IOMMU_PASID_FORMAT_INTEL_VTD;
> +        g_bind_data->gpgd = 0;
> +        g_bind_data->addr_width = 0;
> +        g_bind_data->hpasid = pasid;
> +        g_bind_data->gpasid = pasid;
> +        g_bind_data->flags |= IOMMU_SVA_GPASID_VAL;
> +        ret = host_iommu_ctx_unbind_stage1_pgtbl(iommu_ctx, bind_data);
> +        break;
> +    default:
> +        error_report_once("Unknown VTDPASIDOp!!!\n");
> +        break;
> +    }
> +
> +    g_free(bind_data);
> +
> +    return ret;
> +}

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v2 15/22] intel_iommu: bind/unbind guest page table to host
@ 2020-04-02 18:09     ` Peter Xu
  0 siblings, 0 replies; 160+ messages in thread
From: Peter Xu @ 2020-04-02 18:09 UTC (permalink / raw)
  To: Liu Yi L
  Cc: jean-philippe, kevin.tian, Jacob Pan, Yi Sun, kvm, mst,
	jun.j.tian, qemu-devel, eric.auger, alex.williamson, pbonzini,
	hao.wu, yi.y.sun, Richard Henderson, david

On Sun, Mar 29, 2020 at 09:24:54PM -0700, Liu Yi L wrote:
> +static int vtd_bind_guest_pasid(IntelIOMMUState *s, VTDBus *vtd_bus,
> +                                int devfn, int pasid, VTDPASIDEntry *pe,
> +                                VTDPASIDOp op)
> +{
> +    VTDHostIOMMUContext *vtd_dev_icx;
> +    HostIOMMUContext *iommu_ctx;
> +    DualIOMMUStage1BindData *bind_data;
> +    struct iommu_gpasid_bind_data *g_bind_data;
> +    int ret = -1;
> +
> +    vtd_dev_icx = vtd_bus->dev_icx[devfn];
> +    if (!vtd_dev_icx) {
> +        /* means no need to go further, e.g. for emulated devices */
> +        return 0;
> +    }
> +
> +    iommu_ctx = vtd_dev_icx->iommu_ctx;
> +    if (!iommu_ctx) {
> +        return -EINVAL;
> +    }
> +
> +    if (!(iommu_ctx->stage1_formats
> +             & IOMMU_PASID_FORMAT_INTEL_VTD)) {
> +        error_report_once("IOMMU Stage 1 format is not compatible!\n");
> +        return -EINVAL;
> +    }
> +
> +    bind_data = g_malloc0(sizeof(*bind_data));
> +    bind_data->pasid = pasid;
> +    g_bind_data = &bind_data->bind_data.gpasid_bind;
> +
> +    g_bind_data->flags = 0;
> +    g_bind_data->vtd.flags = 0;
> +    switch (op) {
> +    case VTD_PASID_BIND:
> +        g_bind_data->version = IOMMU_UAPI_VERSION;
> +        g_bind_data->format = IOMMU_PASID_FORMAT_INTEL_VTD;
> +        g_bind_data->gpgd = vtd_pe_get_flpt_base(pe);
> +        g_bind_data->addr_width = vtd_pe_get_fl_aw(pe);
> +        g_bind_data->hpasid = pasid;
> +        g_bind_data->gpasid = pasid;
> +        g_bind_data->flags |= IOMMU_SVA_GPASID_VAL;
> +        g_bind_data->vtd.flags =
> +                             (VTD_SM_PASID_ENTRY_SRE_BIT(pe->val[2]) ? 1 : 0)

This evaluates to 1 if VTD_SM_PASID_ENTRY_SRE_BIT(pe->val[2]), or 0.
Do you want to use IOMMU_SVA_VTD_GPASID_SRE instead of 1?  Same
question to all the rest.

> +                           | (VTD_SM_PASID_ENTRY_EAFE_BIT(pe->val[2]) ? 1 : 0)
> +                           | (VTD_SM_PASID_ENTRY_PCD_BIT(pe->val[1]) ? 1 : 0)
> +                           | (VTD_SM_PASID_ENTRY_PWT_BIT(pe->val[1]) ? 1 : 0)
> +                           | (VTD_SM_PASID_ENTRY_EMTE_BIT(pe->val[1]) ? 1 : 0)
> +                           | (VTD_SM_PASID_ENTRY_CD_BIT(pe->val[1]) ? 1 : 0);
> +        g_bind_data->vtd.pat = VTD_SM_PASID_ENTRY_PAT(pe->val[1]);
> +        g_bind_data->vtd.emt = VTD_SM_PASID_ENTRY_EMT(pe->val[1]);
> +        ret = host_iommu_ctx_bind_stage1_pgtbl(iommu_ctx, bind_data);
> +        break;
> +    case VTD_PASID_UNBIND:
> +        g_bind_data->version = IOMMU_UAPI_VERSION;
> +        g_bind_data->format = IOMMU_PASID_FORMAT_INTEL_VTD;
> +        g_bind_data->gpgd = 0;
> +        g_bind_data->addr_width = 0;
> +        g_bind_data->hpasid = pasid;
> +        g_bind_data->gpasid = pasid;
> +        g_bind_data->flags |= IOMMU_SVA_GPASID_VAL;
> +        ret = host_iommu_ctx_unbind_stage1_pgtbl(iommu_ctx, bind_data);
> +        break;
> +    default:
> +        error_report_once("Unknown VTDPASIDOp!!!\n");
> +        break;
> +    }
> +
> +    g_free(bind_data);
> +
> +    return ret;
> +}

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v2 00/22] intel_iommu: expose Shared Virtual Addressing to VMs
  2020-03-30  4:24 ` Liu Yi L
@ 2020-04-02 18:12   ` Peter Xu
  -1 siblings, 0 replies; 160+ messages in thread
From: Peter Xu @ 2020-04-02 18:12 UTC (permalink / raw)
  To: Liu Yi L
  Cc: qemu-devel, alex.williamson, eric.auger, pbonzini, mst, david,
	kevin.tian, jun.j.tian, yi.y.sun, kvm, hao.wu, jean-philippe

On Sun, Mar 29, 2020 at 09:24:39PM -0700, Liu Yi L wrote:
> Tests: basci vSVA functionality test,

Could you elaborate what's the functionality test?  Does that contains
at least some IOs go through the SVA-capable device so the nested page
table is used?  I thought it was a yes, but after I notice that the
BIND message flags seems to be wrong, I really think I should ask this
loud..

> VM reboot/shutdown/crash,

What's the VM crash test?

> kernel build in
> guest, boot VM with vSVA disabled, full comapilation with all archs.

I believe I've said similar things, but...  I'd appreciate if you can
also smoke on 2nd-level only with the series applied.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v2 00/22] intel_iommu: expose Shared Virtual Addressing to VMs
@ 2020-04-02 18:12   ` Peter Xu
  0 siblings, 0 replies; 160+ messages in thread
From: Peter Xu @ 2020-04-02 18:12 UTC (permalink / raw)
  To: Liu Yi L
  Cc: jean-philippe, kevin.tian, kvm, mst, jun.j.tian, qemu-devel,
	eric.auger, alex.williamson, pbonzini, hao.wu, yi.y.sun, david

On Sun, Mar 29, 2020 at 09:24:39PM -0700, Liu Yi L wrote:
> Tests: basci vSVA functionality test,

Could you elaborate what's the functionality test?  Does that contains
at least some IOs go through the SVA-capable device so the nested page
table is used?  I thought it was a yes, but after I notice that the
BIND message flags seems to be wrong, I really think I should ask this
loud..

> VM reboot/shutdown/crash,

What's the VM crash test?

> kernel build in
> guest, boot VM with vSVA disabled, full comapilation with all archs.

I believe I've said similar things, but...  I'd appreciate if you can
also smoke on 2nd-level only with the series applied.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v2 00/22] intel_iommu: expose Shared Virtual Addressing to VMs
  2020-04-02 13:46     ` Peter Xu
@ 2020-04-03  1:38       ` Jason Wang
  -1 siblings, 0 replies; 160+ messages in thread
From: Jason Wang @ 2020-04-03  1:38 UTC (permalink / raw)
  To: Peter Xu
  Cc: jean-philippe, kevin.tian, Liu Yi L, kvm, mst, jun.j.tian,
	qemu-devel, eric.auger, alex.williamson, pbonzini, david,
	yi.y.sun, hao.wu


On 2020/4/2 下午9:46, Peter Xu wrote:
> On Thu, Apr 02, 2020 at 04:33:02PM +0800, Jason Wang wrote:
>>> The complete QEMU set can be found in below link:
>>> https://github.com/luxis1999/qemu.git: sva_vtd_v10_v2
>>
>> Hi Yi:
>>
>> I could not find the branch there.
> Jason,
>
> He typed wrong... It's actually (I found it myself):
>
> https://github.com/luxis1999/qemu/tree/sva_vtd_v10_qemu_v2


Aha, I see.

Thanks


>


^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v2 00/22] intel_iommu: expose Shared Virtual Addressing to VMs
@ 2020-04-03  1:38       ` Jason Wang
  0 siblings, 0 replies; 160+ messages in thread
From: Jason Wang @ 2020-04-03  1:38 UTC (permalink / raw)
  To: Peter Xu
  Cc: jean-philippe, kevin.tian, Liu Yi L, kvm, mst, jun.j.tian,
	qemu-devel, eric.auger, alex.williamson, pbonzini, hao.wu,
	yi.y.sun, david


On 2020/4/2 下午9:46, Peter Xu wrote:
> On Thu, Apr 02, 2020 at 04:33:02PM +0800, Jason Wang wrote:
>>> The complete QEMU set can be found in below link:
>>> https://github.com/luxis1999/qemu.git: sva_vtd_v10_v2
>>
>> Hi Yi:
>>
>> I could not find the branch there.
> Jason,
>
> He typed wrong... It's actually (I found it myself):
>
> https://github.com/luxis1999/qemu/tree/sva_vtd_v10_qemu_v2


Aha, I see.

Thanks


>



^ permalink raw reply	[flat|nested] 160+ messages in thread

* RE: [PATCH v2 00/22] intel_iommu: expose Shared Virtual Addressing to VMs
  2020-04-02 13:46     ` Peter Xu
@ 2020-04-03 14:20       ` Liu, Yi L
  -1 siblings, 0 replies; 160+ messages in thread
From: Liu, Yi L @ 2020-04-03 14:20 UTC (permalink / raw)
  To: Peter Xu, Jason Wang
  Cc: qemu-devel, alex.williamson, jean-philippe, Tian, Kevin, kvm,
	mst, Tian, Jun J, eric.auger, Sun, Yi Y, pbonzini, Wu, Hao,
	david

> From: Peter Xu <peterx@redhat.com>
> Sent: Thursday, April 2, 2020 9:46 PM
> To: Jason Wang <jasowang@redhat.com>
> Subject: Re: [PATCH v2 00/22] intel_iommu: expose Shared Virtual Addressing to
> VMs
> 
> On Thu, Apr 02, 2020 at 04:33:02PM +0800, Jason Wang wrote:
> > > The complete QEMU set can be found in below link:
> > > https://github.com/luxis1999/qemu.git: sva_vtd_v10_v2
> >
> >
> > Hi Yi:
> >
> > I could not find the branch there.
> 
> Jason,
> 
> He typed wrong... It's actually (I found it myself):
> 
> https://github.com/luxis1999/qemu/tree/sva_vtd_v10_qemu_v2
thanks, really a silly type mistake.

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 160+ messages in thread

* RE: [PATCH v2 00/22] intel_iommu: expose Shared Virtual Addressing to VMs
@ 2020-04-03 14:20       ` Liu, Yi L
  0 siblings, 0 replies; 160+ messages in thread
From: Liu, Yi L @ 2020-04-03 14:20 UTC (permalink / raw)
  To: Peter Xu, Jason Wang
  Cc: jean-philippe, Tian, Kevin, kvm, mst, Tian, Jun J, qemu-devel,
	eric.auger, alex.williamson, pbonzini, david, Sun, Yi Y, Wu, Hao

> From: Peter Xu <peterx@redhat.com>
> Sent: Thursday, April 2, 2020 9:46 PM
> To: Jason Wang <jasowang@redhat.com>
> Subject: Re: [PATCH v2 00/22] intel_iommu: expose Shared Virtual Addressing to
> VMs
> 
> On Thu, Apr 02, 2020 at 04:33:02PM +0800, Jason Wang wrote:
> > > The complete QEMU set can be found in below link:
> > > https://github.com/luxis1999/qemu.git: sva_vtd_v10_v2
> >
> >
> > Hi Yi:
> >
> > I could not find the branch there.
> 
> Jason,
> 
> He typed wrong... It's actually (I found it myself):
> 
> https://github.com/luxis1999/qemu/tree/sva_vtd_v10_qemu_v2
thanks, really a silly type mistake.

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 160+ messages in thread

* RE: [PATCH v2 15/22] intel_iommu: bind/unbind guest page table to host
  2020-04-02 18:09     ` Peter Xu
@ 2020-04-03 14:29       ` Liu, Yi L
  -1 siblings, 0 replies; 160+ messages in thread
From: Liu, Yi L @ 2020-04-03 14:29 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, alex.williamson, eric.auger, pbonzini, mst, david,
	Tian, Kevin, Tian, Jun J, Sun, Yi Y, kvm, Wu, Hao, jean-philippe,
	Jacob Pan, Yi Sun, Richard Henderson

> From: Peter Xu <peterx@redhat.com>
> Sent: Friday, April 3, 2020 2:09 AM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [PATCH v2 15/22] intel_iommu: bind/unbind guest page table to host
> 
> On Sun, Mar 29, 2020 at 09:24:54PM -0700, Liu Yi L wrote:
> > +static int vtd_bind_guest_pasid(IntelIOMMUState *s, VTDBus *vtd_bus,
> > +                                int devfn, int pasid, VTDPASIDEntry *pe,
> > +                                VTDPASIDOp op) {
> > +    VTDHostIOMMUContext *vtd_dev_icx;
> > +    HostIOMMUContext *iommu_ctx;
> > +    DualIOMMUStage1BindData *bind_data;
> > +    struct iommu_gpasid_bind_data *g_bind_data;
> > +    int ret = -1;
> > +
> > +    vtd_dev_icx = vtd_bus->dev_icx[devfn];
> > +    if (!vtd_dev_icx) {
> > +        /* means no need to go further, e.g. for emulated devices */
> > +        return 0;
> > +    }
> > +
> > +    iommu_ctx = vtd_dev_icx->iommu_ctx;
> > +    if (!iommu_ctx) {
> > +        return -EINVAL;
> > +    }
> > +
> > +    if (!(iommu_ctx->stage1_formats
> > +             & IOMMU_PASID_FORMAT_INTEL_VTD)) {
> > +        error_report_once("IOMMU Stage 1 format is not compatible!\n");
> > +        return -EINVAL;
> > +    }
> > +
> > +    bind_data = g_malloc0(sizeof(*bind_data));
> > +    bind_data->pasid = pasid;
> > +    g_bind_data = &bind_data->bind_data.gpasid_bind;
> > +
> > +    g_bind_data->flags = 0;
> > +    g_bind_data->vtd.flags = 0;
> > +    switch (op) {
> > +    case VTD_PASID_BIND:
> > +        g_bind_data->version = IOMMU_UAPI_VERSION;
> > +        g_bind_data->format = IOMMU_PASID_FORMAT_INTEL_VTD;
> > +        g_bind_data->gpgd = vtd_pe_get_flpt_base(pe);
> > +        g_bind_data->addr_width = vtd_pe_get_fl_aw(pe);
> > +        g_bind_data->hpasid = pasid;
> > +        g_bind_data->gpasid = pasid;
> > +        g_bind_data->flags |= IOMMU_SVA_GPASID_VAL;
> > +        g_bind_data->vtd.flags =
> > +                             (VTD_SM_PASID_ENTRY_SRE_BIT(pe->val[2])
> > + ? 1 : 0)
> 
> This evaluates to 1 if VTD_SM_PASID_ENTRY_SRE_BIT(pe->val[2]), or 0.
> Do you want to use IOMMU_SVA_VTD_GPASID_SRE instead of 1?  Same question to
> all the rest.

oops, yes it is. you are right. thanks for catching it. During verification, only
the SRE bit is used, so it's not spotted in testing.

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 160+ messages in thread

* RE: [PATCH v2 15/22] intel_iommu: bind/unbind guest page table to host
@ 2020-04-03 14:29       ` Liu, Yi L
  0 siblings, 0 replies; 160+ messages in thread
From: Liu, Yi L @ 2020-04-03 14:29 UTC (permalink / raw)
  To: Peter Xu
  Cc: jean-philippe, Tian, Kevin, Jacob Pan, Yi Sun, kvm, mst, Tian,
	Jun J, qemu-devel, eric.auger, alex.williamson, pbonzini, Wu,
	Hao, Sun, Yi Y, Richard Henderson, david

> From: Peter Xu <peterx@redhat.com>
> Sent: Friday, April 3, 2020 2:09 AM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [PATCH v2 15/22] intel_iommu: bind/unbind guest page table to host
> 
> On Sun, Mar 29, 2020 at 09:24:54PM -0700, Liu Yi L wrote:
> > +static int vtd_bind_guest_pasid(IntelIOMMUState *s, VTDBus *vtd_bus,
> > +                                int devfn, int pasid, VTDPASIDEntry *pe,
> > +                                VTDPASIDOp op) {
> > +    VTDHostIOMMUContext *vtd_dev_icx;
> > +    HostIOMMUContext *iommu_ctx;
> > +    DualIOMMUStage1BindData *bind_data;
> > +    struct iommu_gpasid_bind_data *g_bind_data;
> > +    int ret = -1;
> > +
> > +    vtd_dev_icx = vtd_bus->dev_icx[devfn];
> > +    if (!vtd_dev_icx) {
> > +        /* means no need to go further, e.g. for emulated devices */
> > +        return 0;
> > +    }
> > +
> > +    iommu_ctx = vtd_dev_icx->iommu_ctx;
> > +    if (!iommu_ctx) {
> > +        return -EINVAL;
> > +    }
> > +
> > +    if (!(iommu_ctx->stage1_formats
> > +             & IOMMU_PASID_FORMAT_INTEL_VTD)) {
> > +        error_report_once("IOMMU Stage 1 format is not compatible!\n");
> > +        return -EINVAL;
> > +    }
> > +
> > +    bind_data = g_malloc0(sizeof(*bind_data));
> > +    bind_data->pasid = pasid;
> > +    g_bind_data = &bind_data->bind_data.gpasid_bind;
> > +
> > +    g_bind_data->flags = 0;
> > +    g_bind_data->vtd.flags = 0;
> > +    switch (op) {
> > +    case VTD_PASID_BIND:
> > +        g_bind_data->version = IOMMU_UAPI_VERSION;
> > +        g_bind_data->format = IOMMU_PASID_FORMAT_INTEL_VTD;
> > +        g_bind_data->gpgd = vtd_pe_get_flpt_base(pe);
> > +        g_bind_data->addr_width = vtd_pe_get_fl_aw(pe);
> > +        g_bind_data->hpasid = pasid;
> > +        g_bind_data->gpasid = pasid;
> > +        g_bind_data->flags |= IOMMU_SVA_GPASID_VAL;
> > +        g_bind_data->vtd.flags =
> > +                             (VTD_SM_PASID_ENTRY_SRE_BIT(pe->val[2])
> > + ? 1 : 0)
> 
> This evaluates to 1 if VTD_SM_PASID_ENTRY_SRE_BIT(pe->val[2]), or 0.
> Do you want to use IOMMU_SVA_VTD_GPASID_SRE instead of 1?  Same question to
> all the rest.

oops, yes it is. you are right. thanks for catching it. During verification, only
the SRE bit is used, so it's not spotted in testing.

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 160+ messages in thread

* RE: [PATCH v2 00/22] intel_iommu: expose Shared Virtual Addressing to VMs
  2020-04-02 18:12   ` Peter Xu
@ 2020-04-03 14:32     ` Liu, Yi L
  -1 siblings, 0 replies; 160+ messages in thread
From: Liu, Yi L @ 2020-04-03 14:32 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, alex.williamson, eric.auger, pbonzini, mst, david,
	Tian, Kevin, Tian, Jun J, Sun, Yi Y, kvm, Wu, Hao, jean-philippe

> From: Peter Xu <peterx@redhat.com>
> Sent: Friday, April 3, 2020 2:13 AM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [PATCH v2 00/22] intel_iommu: expose Shared Virtual Addressing to
> VMs
> 
> On Sun, Mar 29, 2020 at 09:24:39PM -0700, Liu Yi L wrote:
> > Tests: basci vSVA functionality test,
> 
> Could you elaborate what's the functionality test?  Does that contains
> at least some IOs go through the SVA-capable device so the nested page
> table is used?  I thought it was a yes, but after I notice that the
> BIND message flags seems to be wrong, I really think I should ask this
> loud..

as just replied, in the verification, only the SRE bit is used. So it's not
spotted. In my functionality test, I've passthru a SVA-capable device
and issue SVA transactions.

> > VM reboot/shutdown/crash,
> 
> What's the VM crash test?

it's ctrl+c to kill the VM.

> > kernel build in
> > guest, boot VM with vSVA disabled, full comapilation with all archs.
> 
> I believe I've said similar things, but...  I'd appreciate if you can
> also smoke on 2nd-level only with the series applied.

yeah, you mean the legacy case, I booted with such config.

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 160+ messages in thread

* RE: [PATCH v2 00/22] intel_iommu: expose Shared Virtual Addressing to VMs
@ 2020-04-03 14:32     ` Liu, Yi L
  0 siblings, 0 replies; 160+ messages in thread
From: Liu, Yi L @ 2020-04-03 14:32 UTC (permalink / raw)
  To: Peter Xu
  Cc: jean-philippe, Tian, Kevin, kvm, mst, Tian, Jun J, qemu-devel,
	eric.auger, alex.williamson, pbonzini, Wu, Hao, Sun, Yi Y, david

> From: Peter Xu <peterx@redhat.com>
> Sent: Friday, April 3, 2020 2:13 AM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [PATCH v2 00/22] intel_iommu: expose Shared Virtual Addressing to
> VMs
> 
> On Sun, Mar 29, 2020 at 09:24:39PM -0700, Liu Yi L wrote:
> > Tests: basci vSVA functionality test,
> 
> Could you elaborate what's the functionality test?  Does that contains
> at least some IOs go through the SVA-capable device so the nested page
> table is used?  I thought it was a yes, but after I notice that the
> BIND message flags seems to be wrong, I really think I should ask this
> loud..

as just replied, in the verification, only the SRE bit is used. So it's not
spotted. In my functionality test, I've passthru a SVA-capable device
and issue SVA transactions.

> > VM reboot/shutdown/crash,
> 
> What's the VM crash test?

it's ctrl+c to kill the VM.

> > kernel build in
> > guest, boot VM with vSVA disabled, full comapilation with all archs.
> 
> I believe I've said similar things, but...  I'd appreciate if you can
> also smoke on 2nd-level only with the series applied.

yeah, you mean the legacy case, I booted with such config.

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v2 16/22] intel_iommu: replay pasid binds after context cache invalidation
  2020-03-30  4:24   ` Liu Yi L
@ 2020-04-03 14:45     ` Peter Xu
  -1 siblings, 0 replies; 160+ messages in thread
From: Peter Xu @ 2020-04-03 14:45 UTC (permalink / raw)
  To: Liu Yi L
  Cc: qemu-devel, alex.williamson, eric.auger, pbonzini, mst, david,
	kevin.tian, jun.j.tian, yi.y.sun, kvm, hao.wu, jean-philippe,
	Jacob Pan, Yi Sun, Richard Henderson, Eduardo Habkost

On Sun, Mar 29, 2020 at 09:24:55PM -0700, Liu Yi L wrote:
> This patch replays guest pasid bindings after context cache
> invalidation. This is a behavior to ensure safety. Actually,
> programmer should issue pasid cache invalidation with proper
> granularity after issuing a context cache invalidation.
> 
> Cc: Kevin Tian <kevin.tian@intel.com>
> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Yi Sun <yi.y.sun@linux.intel.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Richard Henderson <rth@twiddle.net>
> Cc: Eduardo Habkost <ehabkost@redhat.com>
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> ---
>  hw/i386/intel_iommu.c          | 51 ++++++++++++++++++++++++++++++++++++++++++
>  hw/i386/intel_iommu_internal.h |  6 ++++-
>  hw/i386/trace-events           |  1 +
>  3 files changed, 57 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index d87f608..883aeac 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -68,6 +68,10 @@ static void vtd_address_space_refresh_all(IntelIOMMUState *s);
>  static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n);
>  
>  static void vtd_pasid_cache_reset(IntelIOMMUState *s);
> +static void vtd_pasid_cache_sync(IntelIOMMUState *s,
> +                                 VTDPASIDCacheInfo *pc_info);
> +static void vtd_pasid_cache_devsi(IntelIOMMUState *s,
> +                                  VTDBus *vtd_bus, uint16_t devfn);
>  
>  static void vtd_panic_require_caching_mode(void)
>  {
> @@ -1853,7 +1857,10 @@ static void vtd_iommu_replay_all(IntelIOMMUState *s)
>  
>  static void vtd_context_global_invalidate(IntelIOMMUState *s)
>  {
> +    VTDPASIDCacheInfo pc_info;
> +
>      trace_vtd_inv_desc_cc_global();
> +
>      /* Protects context cache */
>      vtd_iommu_lock(s);
>      s->context_cache_gen++;
> @@ -1870,6 +1877,9 @@ static void vtd_context_global_invalidate(IntelIOMMUState *s)
>       * VT-d emulation codes.
>       */
>      vtd_iommu_replay_all(s);
> +
> +    pc_info.flags = VTD_PASID_CACHE_GLOBAL;
> +    vtd_pasid_cache_sync(s, &pc_info);
>  }
>  
>  /**
> @@ -2005,6 +2015,22 @@ static void vtd_context_device_invalidate(IntelIOMMUState *s,
>                   * happened.
>                   */
>                  vtd_sync_shadow_page_table(vtd_as);
> +                /*
> +                 * Per spec, context flush should also followed with PASID
> +                 * cache and iotlb flush. Regards to a device selective
> +                 * context cache invalidation:

If context entry flush should also follow another pasid cache flush,
then this is still needed?  Shouldn't the pasid flush do the same
thing again?

> +                 * if (emaulted_device)
> +                 *    modify the pasid cache gen and pasid-based iotlb gen
> +                 *    value (will be added in following patches)

Let's avoid using "following patches" because it'll be helpless after
merged.  Also, the pasid cache gen is gone.

> +                 * else if (assigned_device)
> +                 *    check if the device has been bound to any pasid
> +                 *    invoke pasid_unbind regards to each bound pasid
> +                 * Here, we have vtd_pasid_cache_devsi() to invalidate pasid
> +                 * caches, while for piotlb in QEMU, we don't have it yet, so
> +                 * no handling. For assigned device, host iommu driver would
> +                 * flush piotlb when a pasid unbind is pass down to it.
> +                 */
> +                 vtd_pasid_cache_devsi(s, vtd_bus, devfn_it);
>              }
>          }
>      }
> @@ -2619,6 +2645,12 @@ static gboolean vtd_flush_pasid(gpointer key, gpointer value,
>          /* Fall through */
>      case VTD_PASID_CACHE_GLOBAL:
>          break;
> +    case VTD_PASID_CACHE_DEVSI:
> +        if (pc_info->vtd_bus != vtd_bus ||
> +            pc_info->devfn == devfn) {

Do you mean "!="?

> +            return false;
> +        }
> +        break;
>      default:
>          error_report("invalid pc_info->flags");
>          abort();
> @@ -2827,6 +2859,11 @@ static void vtd_replay_guest_pasid_bindings(IntelIOMMUState *s,
>          walk_info.flags |= VTD_PASID_TABLE_DID_SEL_WALK;
>          /* loop all assigned devices */
>          break;
> +    case VTD_PASID_CACHE_DEVSI:
> +        walk_info.vtd_bus = pc_info->vtd_bus;
> +        walk_info.devfn = pc_info->devfn;
> +        vtd_replay_pasid_bind_for_dev(s, start, end, &walk_info);
> +        return;
>      case VTD_PASID_CACHE_FORCE_RESET:
>          /* For force reset, no need to go further replay */
>          return;
> @@ -2912,6 +2949,20 @@ static void vtd_pasid_cache_sync(IntelIOMMUState *s,
>      vtd_iommu_unlock(s);
>  }
>  
> +static void vtd_pasid_cache_devsi(IntelIOMMUState *s,
> +                                  VTDBus *vtd_bus, uint16_t devfn)
> +{
> +    VTDPASIDCacheInfo pc_info;
> +
> +    trace_vtd_pasid_cache_devsi(devfn);
> +
> +    pc_info.flags = VTD_PASID_CACHE_DEVSI;
> +    pc_info.vtd_bus = vtd_bus;
> +    pc_info.devfn = devfn;
> +
> +    vtd_pasid_cache_sync(s, &pc_info);
> +}
> +
>  /**
>   * Caller of this function should hold iommu_lock
>   */
> diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> index b9e48ab..9122601 100644
> --- a/hw/i386/intel_iommu_internal.h
> +++ b/hw/i386/intel_iommu_internal.h
> @@ -529,14 +529,18 @@ struct VTDPASIDCacheInfo {
>  #define VTD_PASID_CACHE_GLOBAL         (1ULL << 1)
>  #define VTD_PASID_CACHE_DOMSI          (1ULL << 2)
>  #define VTD_PASID_CACHE_PASIDSI        (1ULL << 3)
> +#define VTD_PASID_CACHE_DEVSI          (1ULL << 4)
>      uint32_t flags;
>      uint16_t domain_id;
>      uint32_t pasid;
> +    VTDBus *vtd_bus;
> +    uint16_t devfn;
>  };
>  #define VTD_PASID_CACHE_INFO_MASK    (VTD_PASID_CACHE_FORCE_RESET | \
>                                        VTD_PASID_CACHE_GLOBAL  | \
>                                        VTD_PASID_CACHE_DOMSI  | \
> -                                      VTD_PASID_CACHE_PASIDSI)
> +                                      VTD_PASID_CACHE_PASIDSI | \
> +                                      VTD_PASID_CACHE_DEVSI)
>  typedef struct VTDPASIDCacheInfo VTDPASIDCacheInfo;
>  
>  /* PASID Table Related Definitions */
> diff --git a/hw/i386/trace-events b/hw/i386/trace-events
> index 60d20c1..3853fa8 100644
> --- a/hw/i386/trace-events
> +++ b/hw/i386/trace-events
> @@ -26,6 +26,7 @@ vtd_pasid_cache_gsi(void) ""
>  vtd_pasid_cache_reset(void) ""
>  vtd_pasid_cache_dsi(uint16_t domain) "Domian slective PC invalidation domain 0x%"PRIx16
>  vtd_pasid_cache_psi(uint16_t domain, uint32_t pasid) "PASID slective PC invalidation domain 0x%"PRIx16" pasid 0x%"PRIx32
> +vtd_pasid_cache_devsi(uint16_t devfn) "Dev selective PC invalidation dev: 0x%"PRIx16
>  vtd_re_not_present(uint8_t bus) "Root entry bus %"PRIu8" not present"
>  vtd_ce_not_present(uint8_t bus, uint8_t devfn) "Context entry bus %"PRIu8" devfn %"PRIu8" not present"
>  vtd_iotlb_page_hit(uint16_t sid, uint64_t addr, uint64_t slpte, uint16_t domain) "IOTLB page hit sid 0x%"PRIx16" iova 0x%"PRIx64" slpte 0x%"PRIx64" domain 0x%"PRIx16
> -- 
> 2.7.4
> 

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v2 16/22] intel_iommu: replay pasid binds after context cache invalidation
@ 2020-04-03 14:45     ` Peter Xu
  0 siblings, 0 replies; 160+ messages in thread
From: Peter Xu @ 2020-04-03 14:45 UTC (permalink / raw)
  To: Liu Yi L
  Cc: jean-philippe, kevin.tian, Jacob Pan, Yi Sun, Eduardo Habkost,
	kvm, mst, jun.j.tian, qemu-devel, eric.auger, alex.williamson,
	pbonzini, hao.wu, yi.y.sun, Richard Henderson, david

On Sun, Mar 29, 2020 at 09:24:55PM -0700, Liu Yi L wrote:
> This patch replays guest pasid bindings after context cache
> invalidation. This is a behavior to ensure safety. Actually,
> programmer should issue pasid cache invalidation with proper
> granularity after issuing a context cache invalidation.
> 
> Cc: Kevin Tian <kevin.tian@intel.com>
> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Yi Sun <yi.y.sun@linux.intel.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Richard Henderson <rth@twiddle.net>
> Cc: Eduardo Habkost <ehabkost@redhat.com>
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> ---
>  hw/i386/intel_iommu.c          | 51 ++++++++++++++++++++++++++++++++++++++++++
>  hw/i386/intel_iommu_internal.h |  6 ++++-
>  hw/i386/trace-events           |  1 +
>  3 files changed, 57 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index d87f608..883aeac 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -68,6 +68,10 @@ static void vtd_address_space_refresh_all(IntelIOMMUState *s);
>  static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n);
>  
>  static void vtd_pasid_cache_reset(IntelIOMMUState *s);
> +static void vtd_pasid_cache_sync(IntelIOMMUState *s,
> +                                 VTDPASIDCacheInfo *pc_info);
> +static void vtd_pasid_cache_devsi(IntelIOMMUState *s,
> +                                  VTDBus *vtd_bus, uint16_t devfn);
>  
>  static void vtd_panic_require_caching_mode(void)
>  {
> @@ -1853,7 +1857,10 @@ static void vtd_iommu_replay_all(IntelIOMMUState *s)
>  
>  static void vtd_context_global_invalidate(IntelIOMMUState *s)
>  {
> +    VTDPASIDCacheInfo pc_info;
> +
>      trace_vtd_inv_desc_cc_global();
> +
>      /* Protects context cache */
>      vtd_iommu_lock(s);
>      s->context_cache_gen++;
> @@ -1870,6 +1877,9 @@ static void vtd_context_global_invalidate(IntelIOMMUState *s)
>       * VT-d emulation codes.
>       */
>      vtd_iommu_replay_all(s);
> +
> +    pc_info.flags = VTD_PASID_CACHE_GLOBAL;
> +    vtd_pasid_cache_sync(s, &pc_info);
>  }
>  
>  /**
> @@ -2005,6 +2015,22 @@ static void vtd_context_device_invalidate(IntelIOMMUState *s,
>                   * happened.
>                   */
>                  vtd_sync_shadow_page_table(vtd_as);
> +                /*
> +                 * Per spec, context flush should also followed with PASID
> +                 * cache and iotlb flush. Regards to a device selective
> +                 * context cache invalidation:

If context entry flush should also follow another pasid cache flush,
then this is still needed?  Shouldn't the pasid flush do the same
thing again?

> +                 * if (emaulted_device)
> +                 *    modify the pasid cache gen and pasid-based iotlb gen
> +                 *    value (will be added in following patches)

Let's avoid using "following patches" because it'll be helpless after
merged.  Also, the pasid cache gen is gone.

> +                 * else if (assigned_device)
> +                 *    check if the device has been bound to any pasid
> +                 *    invoke pasid_unbind regards to each bound pasid
> +                 * Here, we have vtd_pasid_cache_devsi() to invalidate pasid
> +                 * caches, while for piotlb in QEMU, we don't have it yet, so
> +                 * no handling. For assigned device, host iommu driver would
> +                 * flush piotlb when a pasid unbind is pass down to it.
> +                 */
> +                 vtd_pasid_cache_devsi(s, vtd_bus, devfn_it);
>              }
>          }
>      }
> @@ -2619,6 +2645,12 @@ static gboolean vtd_flush_pasid(gpointer key, gpointer value,
>          /* Fall through */
>      case VTD_PASID_CACHE_GLOBAL:
>          break;
> +    case VTD_PASID_CACHE_DEVSI:
> +        if (pc_info->vtd_bus != vtd_bus ||
> +            pc_info->devfn == devfn) {

Do you mean "!="?

> +            return false;
> +        }
> +        break;
>      default:
>          error_report("invalid pc_info->flags");
>          abort();
> @@ -2827,6 +2859,11 @@ static void vtd_replay_guest_pasid_bindings(IntelIOMMUState *s,
>          walk_info.flags |= VTD_PASID_TABLE_DID_SEL_WALK;
>          /* loop all assigned devices */
>          break;
> +    case VTD_PASID_CACHE_DEVSI:
> +        walk_info.vtd_bus = pc_info->vtd_bus;
> +        walk_info.devfn = pc_info->devfn;
> +        vtd_replay_pasid_bind_for_dev(s, start, end, &walk_info);
> +        return;
>      case VTD_PASID_CACHE_FORCE_RESET:
>          /* For force reset, no need to go further replay */
>          return;
> @@ -2912,6 +2949,20 @@ static void vtd_pasid_cache_sync(IntelIOMMUState *s,
>      vtd_iommu_unlock(s);
>  }
>  
> +static void vtd_pasid_cache_devsi(IntelIOMMUState *s,
> +                                  VTDBus *vtd_bus, uint16_t devfn)
> +{
> +    VTDPASIDCacheInfo pc_info;
> +
> +    trace_vtd_pasid_cache_devsi(devfn);
> +
> +    pc_info.flags = VTD_PASID_CACHE_DEVSI;
> +    pc_info.vtd_bus = vtd_bus;
> +    pc_info.devfn = devfn;
> +
> +    vtd_pasid_cache_sync(s, &pc_info);
> +}
> +
>  /**
>   * Caller of this function should hold iommu_lock
>   */
> diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> index b9e48ab..9122601 100644
> --- a/hw/i386/intel_iommu_internal.h
> +++ b/hw/i386/intel_iommu_internal.h
> @@ -529,14 +529,18 @@ struct VTDPASIDCacheInfo {
>  #define VTD_PASID_CACHE_GLOBAL         (1ULL << 1)
>  #define VTD_PASID_CACHE_DOMSI          (1ULL << 2)
>  #define VTD_PASID_CACHE_PASIDSI        (1ULL << 3)
> +#define VTD_PASID_CACHE_DEVSI          (1ULL << 4)
>      uint32_t flags;
>      uint16_t domain_id;
>      uint32_t pasid;
> +    VTDBus *vtd_bus;
> +    uint16_t devfn;
>  };
>  #define VTD_PASID_CACHE_INFO_MASK    (VTD_PASID_CACHE_FORCE_RESET | \
>                                        VTD_PASID_CACHE_GLOBAL  | \
>                                        VTD_PASID_CACHE_DOMSI  | \
> -                                      VTD_PASID_CACHE_PASIDSI)
> +                                      VTD_PASID_CACHE_PASIDSI | \
> +                                      VTD_PASID_CACHE_DEVSI)
>  typedef struct VTDPASIDCacheInfo VTDPASIDCacheInfo;
>  
>  /* PASID Table Related Definitions */
> diff --git a/hw/i386/trace-events b/hw/i386/trace-events
> index 60d20c1..3853fa8 100644
> --- a/hw/i386/trace-events
> +++ b/hw/i386/trace-events
> @@ -26,6 +26,7 @@ vtd_pasid_cache_gsi(void) ""
>  vtd_pasid_cache_reset(void) ""
>  vtd_pasid_cache_dsi(uint16_t domain) "Domian slective PC invalidation domain 0x%"PRIx16
>  vtd_pasid_cache_psi(uint16_t domain, uint32_t pasid) "PASID slective PC invalidation domain 0x%"PRIx16" pasid 0x%"PRIx32
> +vtd_pasid_cache_devsi(uint16_t devfn) "Dev selective PC invalidation dev: 0x%"PRIx16
>  vtd_re_not_present(uint8_t bus) "Root entry bus %"PRIu8" not present"
>  vtd_ce_not_present(uint8_t bus, uint8_t devfn) "Context entry bus %"PRIu8" devfn %"PRIu8" not present"
>  vtd_iotlb_page_hit(uint16_t sid, uint64_t addr, uint64_t slpte, uint16_t domain) "IOTLB page hit sid 0x%"PRIx16" iova 0x%"PRIx64" slpte 0x%"PRIx64" domain 0x%"PRIx16
> -- 
> 2.7.4
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v2 19/22] intel_iommu: process PASID-based iotlb invalidation
  2020-03-30  4:24   ` Liu Yi L
@ 2020-04-03 14:47     ` Peter Xu
  -1 siblings, 0 replies; 160+ messages in thread
From: Peter Xu @ 2020-04-03 14:47 UTC (permalink / raw)
  To: Liu Yi L
  Cc: qemu-devel, alex.williamson, eric.auger, pbonzini, mst, david,
	kevin.tian, jun.j.tian, yi.y.sun, kvm, hao.wu, jean-philippe,
	Jacob Pan, Yi Sun, Richard Henderson, Eduardo Habkost

On Sun, Mar 29, 2020 at 09:24:58PM -0700, Liu Yi L wrote:
> This patch adds the basic PASID-based iotlb (piotlb) invalidation
> support. piotlb is used during walking Intel VT-d 1st level page
> table. This patch only adds the basic processing. Detailed handling
> will be added in next patch.
> 
> Cc: Kevin Tian <kevin.tian@intel.com>
> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Yi Sun <yi.y.sun@linux.intel.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Richard Henderson <rth@twiddle.net>
> Cc: Eduardo Habkost <ehabkost@redhat.com>
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>

Reviewed-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v2 19/22] intel_iommu: process PASID-based iotlb invalidation
@ 2020-04-03 14:47     ` Peter Xu
  0 siblings, 0 replies; 160+ messages in thread
From: Peter Xu @ 2020-04-03 14:47 UTC (permalink / raw)
  To: Liu Yi L
  Cc: jean-philippe, kevin.tian, Jacob Pan, Yi Sun, Eduardo Habkost,
	kvm, mst, jun.j.tian, qemu-devel, eric.auger, alex.williamson,
	pbonzini, hao.wu, yi.y.sun, Richard Henderson, david

On Sun, Mar 29, 2020 at 09:24:58PM -0700, Liu Yi L wrote:
> This patch adds the basic PASID-based iotlb (piotlb) invalidation
> support. piotlb is used during walking Intel VT-d 1st level page
> table. This patch only adds the basic processing. Detailed handling
> will be added in next patch.
> 
> Cc: Kevin Tian <kevin.tian@intel.com>
> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Yi Sun <yi.y.sun@linux.intel.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Richard Henderson <rth@twiddle.net>
> Cc: Eduardo Habkost <ehabkost@redhat.com>
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>

Reviewed-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v2 22/22] intel_iommu: modify x-scalable-mode to be string option
  2020-03-30  4:25   ` Liu Yi L
@ 2020-04-03 14:49     ` Peter Xu
  -1 siblings, 0 replies; 160+ messages in thread
From: Peter Xu @ 2020-04-03 14:49 UTC (permalink / raw)
  To: Liu Yi L
  Cc: qemu-devel, alex.williamson, eric.auger, pbonzini, mst, david,
	kevin.tian, jun.j.tian, yi.y.sun, kvm, hao.wu, jean-philippe,
	Jacob Pan, Yi Sun, Richard Henderson, Eduardo Habkost

On Sun, Mar 29, 2020 at 09:25:01PM -0700, Liu Yi L wrote:
> Intel VT-d 3.0 introduces scalable mode, and it has a bunch of capabilities
> related to scalable mode translation, thus there are multiple combinations.
> While this vIOMMU implementation wants simplify it for user by providing
> typical combinations. User could config it by "x-scalable-mode" option. The
> usage is as below:
> 
> "-device intel-iommu,x-scalable-mode=["legacy"|"modern"|"off"]"
> 
>  - "legacy": gives support for SL page table
>  - "modern": gives support for FL page table, pasid, virtual command
>  - "off": no scalable mode support
>  -  if not configured, means no scalable mode support, if not proper
>     configured, will throw error
> 
> Note: this patch is supposed to be merged when  the whole vSVA patch series
> were merged.
> 
> Cc: Kevin Tian <kevin.tian@intel.com>
> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Yi Sun <yi.y.sun@linux.intel.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Richard Henderson <rth@twiddle.net>
> Cc: Eduardo Habkost <ehabkost@redhat.com>
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>

Reviewed-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v2 22/22] intel_iommu: modify x-scalable-mode to be string option
@ 2020-04-03 14:49     ` Peter Xu
  0 siblings, 0 replies; 160+ messages in thread
From: Peter Xu @ 2020-04-03 14:49 UTC (permalink / raw)
  To: Liu Yi L
  Cc: jean-philippe, kevin.tian, Jacob Pan, Yi Sun, Eduardo Habkost,
	kvm, mst, jun.j.tian, qemu-devel, eric.auger, alex.williamson,
	pbonzini, hao.wu, yi.y.sun, Richard Henderson, david

On Sun, Mar 29, 2020 at 09:25:01PM -0700, Liu Yi L wrote:
> Intel VT-d 3.0 introduces scalable mode, and it has a bunch of capabilities
> related to scalable mode translation, thus there are multiple combinations.
> While this vIOMMU implementation wants simplify it for user by providing
> typical combinations. User could config it by "x-scalable-mode" option. The
> usage is as below:
> 
> "-device intel-iommu,x-scalable-mode=["legacy"|"modern"|"off"]"
> 
>  - "legacy": gives support for SL page table
>  - "modern": gives support for FL page table, pasid, virtual command
>  - "off": no scalable mode support
>  -  if not configured, means no scalable mode support, if not proper
>     configured, will throw error
> 
> Note: this patch is supposed to be merged when  the whole vSVA patch series
> were merged.
> 
> Cc: Kevin Tian <kevin.tian@intel.com>
> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Yi Sun <yi.y.sun@linux.intel.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Richard Henderson <rth@twiddle.net>
> Cc: Eduardo Habkost <ehabkost@redhat.com>
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>

Reviewed-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 160+ messages in thread

* RE: [PATCH v2 13/22] intel_iommu: add PASID cache management infrastructure
  2020-04-02 13:44         ` Peter Xu
@ 2020-04-03 15:05           ` Liu, Yi L
  -1 siblings, 0 replies; 160+ messages in thread
From: Liu, Yi L @ 2020-04-03 15:05 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, alex.williamson, eric.auger, pbonzini, mst, david,
	Tian, Kevin, Tian, Jun J, Sun, Yi Y, kvm, Wu, Hao, jean-philippe,
	Jacob Pan, Yi Sun, Richard Henderson, Eduardo Habkost

> From: Peter Xu <peterx@redhat.com>
> Sent: Thursday, April 2, 2020 9:45 PM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [PATCH v2 13/22] intel_iommu: add PASID cache management
> infrastructure
> 
> On Thu, Apr 02, 2020 at 06:46:11AM +0000, Liu, Yi L wrote:
> 
> [...]
> 
> > > > +/**
> > > > + * This function replay the guest pasid bindings to hots by
> > > > + * walking the guest PASID table. This ensures host will have
> > > > + * latest guest pasid bindings. Caller should hold iommu_lock.
> > > > + */
> > > > +static void vtd_replay_guest_pasid_bindings(IntelIOMMUState *s,
> > > > +                                            VTDPASIDCacheInfo
> > > > +*pc_info) {
> > > > +    VTDHostIOMMUContext *vtd_dev_icx;
> > > > +    int start = 0, end = VTD_HPASID_MAX;
> > > > +    vtd_pasid_table_walk_info walk_info = {.flags = 0};
> > >
> > > So vtd_pasid_table_walk_info is still used.  I thought we had
> > > reached a consensus that this can be dropped?
> >
> > yeah, I did have considered your suggestion and plan to do it. But
> > when I started coding, it looks a little bit weird to me:
> > For one, there is an input VTDPASIDCacheInfo in this function. It may
> > be nature to think about passing the parameter to further calling
> > (vtd_replay_pasid_bind_for_dev()). But, we can't do that. The
> > vtd_bus/devfn fields should be filled when looping the assigned
> > devices, not the one passed by vtd_replay_guest_pasid_bindings() caller.
> 
> Hacky way is we can directly modify VTDPASIDCacheInfo* with bus/devfn for the
> loop.  Otherwise we can duplicate the object when looping, so that we can avoid
> introducing a new struct which seems to contain mostly the same information.

I see. Please see below reply.

> > For two, reusing the VTDPASIDCacheInfo for passing walk info may
> > require the final user do the same thing as what the
> > vtd_replay_guest_pasid_bindings() has done here.
> 
> I don't see it happen, could you explain?

my concern is around flags field in VTDPASIDCacheInfo. The flags not
only indicates the invalidation granularity, but also indicates the
field presence. e.g. VTD_PASID_CACHE_DEVSI indicates the vtd_bus/devfn
fields are valid. If reuse it to pass walk info to vtd_sm_pasid_table_walk_one,
it would be meaningless as vtd_bus/devfn fields are always valid. But
I'm fine to reuse it's more prefered. Instead of modifying the vtd_bus/devn
in VTDPASIDCacheInfo*, I'd rather to define another VTDPASIDCacheInfo variable
and pass it to vtd_sm_pasid_table_walk_one. This may not affect the future
caller of vtd_replay_guest_pasid_bindings() as vtd_bus/devfn field are not
designed to bring something back to caller.

struct VTDPASIDCacheInfo {
#define VTD_PASID_CACHE_FORCE_RESET    (1ULL << 0)
#define VTD_PASID_CACHE_GLOBAL         (1ULL << 1)
#define VTD_PASID_CACHE_DOMSI          (1ULL << 2)
#define VTD_PASID_CACHE_PASIDSI        (1ULL << 3)
#define VTD_PASID_CACHE_DEVSI          (1ULL << 4)
    uint32_t flags;
    uint16_t domain_id;
    uint32_t pasid;
    VTDBus *vtd_bus;
    uint16_t devfn;
}; 

> >
> > So kept the vtd_pasid_table_walk_info.
> 
> [...]
> 
> > > > +/**
> > > > + * This function syncs the pasid bindings between guest and host.
> > > > + * It includes updating the pasid cache in vIOMMU and updating
> > > > +the
> > > > + * pasid bindings per guest's latest pasid entry presence.
> > > > + */
> > > > +static void vtd_pasid_cache_sync(IntelIOMMUState *s,
> > > > +                                 VTDPASIDCacheInfo *pc_info) {
> > > > +    /*
> > > > +     * Regards to a pasid cache invalidation, e.g. a PSI.
> > > > +     * it could be either cases of below:
> > > > +     * a) a present pasid entry moved to non-present
> > > > +     * b) a present pasid entry to be a present entry
> > > > +     * c) a non-present pasid entry moved to present
> > > > +     *
> > > > +     * Different invalidation granularity may affect different device
> > > > +     * scope and pasid scope. But for each invalidation granularity,
> > > > +     * it needs to do two steps to sync host and guest pasid binding.
> > > > +     *
> > > > +     * Here is the handling of a PSI:
> > > > +     * 1) loop all the existing vtd_pasid_as instances to update them
> > > > +     *    according to the latest guest pasid entry in pasid table.
> > > > +     *    this will make sure affected existing vtd_pasid_as instances
> > > > +     *    cached the latest pasid entries. Also, during the loop, the
> > > > +     *    host should be notified if needed. e.g. pasid unbind or pasid
> > > > +     *    update. Should be able to cover case a) and case b).
> > > > +     *
> > > > +     * 2) loop all devices to cover case c)
> > > > +     *    - For devices which have HostIOMMUContext instances,
> > > > +     *      we loop them and check if guest pasid entry exists. If yes,
> > > > +     *      it is case c), we update the pasid cache and also notify
> > > > +     *      host.
> > > > +     *    - For devices which have no HostIOMMUContext, it is not
> > > > +     *      necessary to create pasid cache at this phase since it
> > > > +     *      could be created when vIOMMU does DMA address translation.
> > > > +     *      This is not yet implemented since there is no emulated
> > > > +     *      pasid-capable devices today. If we have such devices in
> > > > +     *      future, the pasid cache shall be created there.
> > > > +     * Other granularity follow the same steps, just with different scope
> > > > +     *
> > > > +     */
> > > > +
> > > > +    vtd_iommu_lock(s);
> > > > +    /* Step 1: loop all the exisitng vtd_pasid_as instances */
> > > > +    g_hash_table_foreach_remove(s->vtd_pasid_as,
> > > > +                                vtd_flush_pasid, pc_info);
> > >
> > > OK the series is evolving along with our discussions, and /me too on
> > > understanding your series... Now I'm not very sure whether this operation is still
> useful...
> > >
> > > The major point is you'll need to do pasid table walk for all the
> > > registered devices below.  So IIUC vtd_replay_guest_pasid_bindings()
> > > will be able to also detect addition, removal or modification of
> > > pasid address spaces.  Am I right?
> >
> > It's true if there is only assigned pasid-capable devices. If there is
> > emualted pasid-capable device, it would be a problem as emualted
> > devices won't register HostIOMMUContext. Somehow, the pasid cahce
> > invalidation for emualted device would be missed. So I chose to make
> > the step 1 cover the "real" cache invalidation(a.k.a. removal), while
> > step 2 to cover addition and modification.
> 
> OK.  Btw, I think modification should still belongs to step 1 then (I think you're doing
> that, though).

Oh, yes, modification is done in step 1... step 2 is only for addition.

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 160+ messages in thread

* RE: [PATCH v2 13/22] intel_iommu: add PASID cache management infrastructure
@ 2020-04-03 15:05           ` Liu, Yi L
  0 siblings, 0 replies; 160+ messages in thread
From: Liu, Yi L @ 2020-04-03 15:05 UTC (permalink / raw)
  To: Peter Xu
  Cc: jean-philippe, Tian, Kevin, Jacob Pan, Yi Sun, Eduardo Habkost,
	kvm, mst, Tian, Jun J, qemu-devel, eric.auger, alex.williamson,
	pbonzini, Wu, Hao, Sun, Yi Y, Richard Henderson, david

> From: Peter Xu <peterx@redhat.com>
> Sent: Thursday, April 2, 2020 9:45 PM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [PATCH v2 13/22] intel_iommu: add PASID cache management
> infrastructure
> 
> On Thu, Apr 02, 2020 at 06:46:11AM +0000, Liu, Yi L wrote:
> 
> [...]
> 
> > > > +/**
> > > > + * This function replay the guest pasid bindings to hots by
> > > > + * walking the guest PASID table. This ensures host will have
> > > > + * latest guest pasid bindings. Caller should hold iommu_lock.
> > > > + */
> > > > +static void vtd_replay_guest_pasid_bindings(IntelIOMMUState *s,
> > > > +                                            VTDPASIDCacheInfo
> > > > +*pc_info) {
> > > > +    VTDHostIOMMUContext *vtd_dev_icx;
> > > > +    int start = 0, end = VTD_HPASID_MAX;
> > > > +    vtd_pasid_table_walk_info walk_info = {.flags = 0};
> > >
> > > So vtd_pasid_table_walk_info is still used.  I thought we had
> > > reached a consensus that this can be dropped?
> >
> > yeah, I did have considered your suggestion and plan to do it. But
> > when I started coding, it looks a little bit weird to me:
> > For one, there is an input VTDPASIDCacheInfo in this function. It may
> > be nature to think about passing the parameter to further calling
> > (vtd_replay_pasid_bind_for_dev()). But, we can't do that. The
> > vtd_bus/devfn fields should be filled when looping the assigned
> > devices, not the one passed by vtd_replay_guest_pasid_bindings() caller.
> 
> Hacky way is we can directly modify VTDPASIDCacheInfo* with bus/devfn for the
> loop.  Otherwise we can duplicate the object when looping, so that we can avoid
> introducing a new struct which seems to contain mostly the same information.

I see. Please see below reply.

> > For two, reusing the VTDPASIDCacheInfo for passing walk info may
> > require the final user do the same thing as what the
> > vtd_replay_guest_pasid_bindings() has done here.
> 
> I don't see it happen, could you explain?

my concern is around flags field in VTDPASIDCacheInfo. The flags not
only indicates the invalidation granularity, but also indicates the
field presence. e.g. VTD_PASID_CACHE_DEVSI indicates the vtd_bus/devfn
fields are valid. If reuse it to pass walk info to vtd_sm_pasid_table_walk_one,
it would be meaningless as vtd_bus/devfn fields are always valid. But
I'm fine to reuse it's more prefered. Instead of modifying the vtd_bus/devn
in VTDPASIDCacheInfo*, I'd rather to define another VTDPASIDCacheInfo variable
and pass it to vtd_sm_pasid_table_walk_one. This may not affect the future
caller of vtd_replay_guest_pasid_bindings() as vtd_bus/devfn field are not
designed to bring something back to caller.

struct VTDPASIDCacheInfo {
#define VTD_PASID_CACHE_FORCE_RESET    (1ULL << 0)
#define VTD_PASID_CACHE_GLOBAL         (1ULL << 1)
#define VTD_PASID_CACHE_DOMSI          (1ULL << 2)
#define VTD_PASID_CACHE_PASIDSI        (1ULL << 3)
#define VTD_PASID_CACHE_DEVSI          (1ULL << 4)
    uint32_t flags;
    uint16_t domain_id;
    uint32_t pasid;
    VTDBus *vtd_bus;
    uint16_t devfn;
}; 

> >
> > So kept the vtd_pasid_table_walk_info.
> 
> [...]
> 
> > > > +/**
> > > > + * This function syncs the pasid bindings between guest and host.
> > > > + * It includes updating the pasid cache in vIOMMU and updating
> > > > +the
> > > > + * pasid bindings per guest's latest pasid entry presence.
> > > > + */
> > > > +static void vtd_pasid_cache_sync(IntelIOMMUState *s,
> > > > +                                 VTDPASIDCacheInfo *pc_info) {
> > > > +    /*
> > > > +     * Regards to a pasid cache invalidation, e.g. a PSI.
> > > > +     * it could be either cases of below:
> > > > +     * a) a present pasid entry moved to non-present
> > > > +     * b) a present pasid entry to be a present entry
> > > > +     * c) a non-present pasid entry moved to present
> > > > +     *
> > > > +     * Different invalidation granularity may affect different device
> > > > +     * scope and pasid scope. But for each invalidation granularity,
> > > > +     * it needs to do two steps to sync host and guest pasid binding.
> > > > +     *
> > > > +     * Here is the handling of a PSI:
> > > > +     * 1) loop all the existing vtd_pasid_as instances to update them
> > > > +     *    according to the latest guest pasid entry in pasid table.
> > > > +     *    this will make sure affected existing vtd_pasid_as instances
> > > > +     *    cached the latest pasid entries. Also, during the loop, the
> > > > +     *    host should be notified if needed. e.g. pasid unbind or pasid
> > > > +     *    update. Should be able to cover case a) and case b).
> > > > +     *
> > > > +     * 2) loop all devices to cover case c)
> > > > +     *    - For devices which have HostIOMMUContext instances,
> > > > +     *      we loop them and check if guest pasid entry exists. If yes,
> > > > +     *      it is case c), we update the pasid cache and also notify
> > > > +     *      host.
> > > > +     *    - For devices which have no HostIOMMUContext, it is not
> > > > +     *      necessary to create pasid cache at this phase since it
> > > > +     *      could be created when vIOMMU does DMA address translation.
> > > > +     *      This is not yet implemented since there is no emulated
> > > > +     *      pasid-capable devices today. If we have such devices in
> > > > +     *      future, the pasid cache shall be created there.
> > > > +     * Other granularity follow the same steps, just with different scope
> > > > +     *
> > > > +     */
> > > > +
> > > > +    vtd_iommu_lock(s);
> > > > +    /* Step 1: loop all the exisitng vtd_pasid_as instances */
> > > > +    g_hash_table_foreach_remove(s->vtd_pasid_as,
> > > > +                                vtd_flush_pasid, pc_info);
> > >
> > > OK the series is evolving along with our discussions, and /me too on
> > > understanding your series... Now I'm not very sure whether this operation is still
> useful...
> > >
> > > The major point is you'll need to do pasid table walk for all the
> > > registered devices below.  So IIUC vtd_replay_guest_pasid_bindings()
> > > will be able to also detect addition, removal or modification of
> > > pasid address spaces.  Am I right?
> >
> > It's true if there is only assigned pasid-capable devices. If there is
> > emualted pasid-capable device, it would be a problem as emualted
> > devices won't register HostIOMMUContext. Somehow, the pasid cahce
> > invalidation for emualted device would be missed. So I chose to make
> > the step 1 cover the "real" cache invalidation(a.k.a. removal), while
> > step 2 to cover addition and modification.
> 
> OK.  Btw, I think modification should still belongs to step 1 then (I think you're doing
> that, though).

Oh, yes, modification is done in step 1... step 2 is only for addition.

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 160+ messages in thread

* RE: [PATCH v2 16/22] intel_iommu: replay pasid binds after context cache invalidation
  2020-04-03 14:45     ` Peter Xu
@ 2020-04-03 15:21       ` Liu, Yi L
  -1 siblings, 0 replies; 160+ messages in thread
From: Liu, Yi L @ 2020-04-03 15:21 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, alex.williamson, eric.auger, pbonzini, mst, david,
	Tian, Kevin, Tian, Jun J, Sun, Yi Y, kvm, Wu, Hao, jean-philippe,
	Jacob Pan, Yi Sun, Richard Henderson, Eduardo Habkost

> From: Peter Xu <peterx@redhat.com>
> Sent: Friday, April 3, 2020 10:46 PM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [PATCH v2 16/22] intel_iommu: replay pasid binds after context cache
> invalidation
> 
> On Sun, Mar 29, 2020 at 09:24:55PM -0700, Liu Yi L wrote:
> > This patch replays guest pasid bindings after context cache
> > invalidation. This is a behavior to ensure safety. Actually,
> > programmer should issue pasid cache invalidation with proper
> > granularity after issuing a context cache invalidation.
> >
> > Cc: Kevin Tian <kevin.tian@intel.com>
> > Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Cc: Peter Xu <peterx@redhat.com>
> > Cc: Yi Sun <yi.y.sun@linux.intel.com>
> > Cc: Paolo Bonzini <pbonzini@redhat.com>
> > Cc: Richard Henderson <rth@twiddle.net>
> > Cc: Eduardo Habkost <ehabkost@redhat.com>
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > ---
> >  hw/i386/intel_iommu.c          | 51
> ++++++++++++++++++++++++++++++++++++++++++
> >  hw/i386/intel_iommu_internal.h |  6 ++++-
> >  hw/i386/trace-events           |  1 +
> >  3 files changed, 57 insertions(+), 1 deletion(-)
> >
> > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> > index d87f608..883aeac 100644
> > --- a/hw/i386/intel_iommu.c
> > +++ b/hw/i386/intel_iommu.c
> > @@ -68,6 +68,10 @@ static void
> vtd_address_space_refresh_all(IntelIOMMUState *s);
> >  static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n);
> >
> >  static void vtd_pasid_cache_reset(IntelIOMMUState *s);
> > +static void vtd_pasid_cache_sync(IntelIOMMUState *s,
> > +                                 VTDPASIDCacheInfo *pc_info);
> > +static void vtd_pasid_cache_devsi(IntelIOMMUState *s,
> > +                                  VTDBus *vtd_bus, uint16_t devfn);
> >
> >  static void vtd_panic_require_caching_mode(void)
> >  {
> > @@ -1853,7 +1857,10 @@ static void vtd_iommu_replay_all(IntelIOMMUState
> *s)
> >
> >  static void vtd_context_global_invalidate(IntelIOMMUState *s)
> >  {
> > +    VTDPASIDCacheInfo pc_info;
> > +
> >      trace_vtd_inv_desc_cc_global();
> > +
> >      /* Protects context cache */
> >      vtd_iommu_lock(s);
> >      s->context_cache_gen++;
> > @@ -1870,6 +1877,9 @@ static void
> vtd_context_global_invalidate(IntelIOMMUState *s)
> >       * VT-d emulation codes.
> >       */
> >      vtd_iommu_replay_all(s);
> > +
> > +    pc_info.flags = VTD_PASID_CACHE_GLOBAL;
> > +    vtd_pasid_cache_sync(s, &pc_info);
> >  }
> >
> >  /**
> > @@ -2005,6 +2015,22 @@ static void
> vtd_context_device_invalidate(IntelIOMMUState *s,
> >                   * happened.
> >                   */
> >                  vtd_sync_shadow_page_table(vtd_as);
> > +                /*
> > +                 * Per spec, context flush should also followed with PASID
> > +                 * cache and iotlb flush. Regards to a device selective
> > +                 * context cache invalidation:
> 
> If context entry flush should also follow another pasid cache flush,
> then this is still needed?  Shouldn't the pasid flush do the same
> thing again?

yes, but how about guest software failed to follow it? It will do
the same thing when pasid cache flush comes. But this only happens
for the rid2pasid case (the IOVA page table).

> > +                 * if (emaulted_device)
> > +                 *    modify the pasid cache gen and pasid-based iotlb gen
> > +                 *    value (will be added in following patches)
> 
> Let's avoid using "following patches" because it'll be helpless after
> merged.  Also, the pasid cache gen is gone.

got it. will modify the description here.

> > +                 * else if (assigned_device)
> > +                 *    check if the device has been bound to any pasid
> > +                 *    invoke pasid_unbind regards to each bound pasid
> > +                 * Here, we have vtd_pasid_cache_devsi() to invalidate pasid
> > +                 * caches, while for piotlb in QEMU, we don't have it yet, so
> > +                 * no handling. For assigned device, host iommu driver would
> > +                 * flush piotlb when a pasid unbind is pass down to it.
> > +                 */
> > +                 vtd_pasid_cache_devsi(s, vtd_bus, devfn_it);
> >              }
> >          }
> >      }
> > @@ -2619,6 +2645,12 @@ static gboolean vtd_flush_pasid(gpointer key, gpointer
> value,
> >          /* Fall through */
> >      case VTD_PASID_CACHE_GLOBAL:
> >          break;
> > +    case VTD_PASID_CACHE_DEVSI:
> > +        if (pc_info->vtd_bus != vtd_bus ||
> > +            pc_info->devfn == devfn) {
> 
> Do you mean "!="?

exactly. thanks for catching it.

Regards,
Yi Liu

> > +            return false;
> > +        }
> > +        break;
> >      default:
> >          error_report("invalid pc_info->flags");
> >          abort();
> > @@ -2827,6 +2859,11 @@ static void
> vtd_replay_guest_pasid_bindings(IntelIOMMUState *s,
> >          walk_info.flags |= VTD_PASID_TABLE_DID_SEL_WALK;
> >          /* loop all assigned devices */
> >          break;
> > +    case VTD_PASID_CACHE_DEVSI:
> > +        walk_info.vtd_bus = pc_info->vtd_bus;
> > +        walk_info.devfn = pc_info->devfn;
> > +        vtd_replay_pasid_bind_for_dev(s, start, end, &walk_info);
> > +        return;
> >      case VTD_PASID_CACHE_FORCE_RESET:
> >          /* For force reset, no need to go further replay */
> >          return;
> > @@ -2912,6 +2949,20 @@ static void vtd_pasid_cache_sync(IntelIOMMUState *s,
> >      vtd_iommu_unlock(s);
> >  }
> >
> > +static void vtd_pasid_cache_devsi(IntelIOMMUState *s,
> > +                                  VTDBus *vtd_bus, uint16_t devfn)
> > +{
> > +    VTDPASIDCacheInfo pc_info;
> > +
> > +    trace_vtd_pasid_cache_devsi(devfn);
> > +
> > +    pc_info.flags = VTD_PASID_CACHE_DEVSI;
> > +    pc_info.vtd_bus = vtd_bus;
> > +    pc_info.devfn = devfn;
> > +
> > +    vtd_pasid_cache_sync(s, &pc_info);
> > +}
> > +
> >  /**
> >   * Caller of this function should hold iommu_lock
> >   */
> > diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> > index b9e48ab..9122601 100644
> > --- a/hw/i386/intel_iommu_internal.h
> > +++ b/hw/i386/intel_iommu_internal.h
> > @@ -529,14 +529,18 @@ struct VTDPASIDCacheInfo {
> >  #define VTD_PASID_CACHE_GLOBAL         (1ULL << 1)
> >  #define VTD_PASID_CACHE_DOMSI          (1ULL << 2)
> >  #define VTD_PASID_CACHE_PASIDSI        (1ULL << 3)
> > +#define VTD_PASID_CACHE_DEVSI          (1ULL << 4)
> >      uint32_t flags;
> >      uint16_t domain_id;
> >      uint32_t pasid;
> > +    VTDBus *vtd_bus;
> > +    uint16_t devfn;
> >  };
> >  #define VTD_PASID_CACHE_INFO_MASK    (VTD_PASID_CACHE_FORCE_RESET |
> \
> >                                        VTD_PASID_CACHE_GLOBAL  | \
> >                                        VTD_PASID_CACHE_DOMSI  | \
> > -                                      VTD_PASID_CACHE_PASIDSI)
> > +                                      VTD_PASID_CACHE_PASIDSI | \
> > +                                      VTD_PASID_CACHE_DEVSI)
> >  typedef struct VTDPASIDCacheInfo VTDPASIDCacheInfo;
> >
> >  /* PASID Table Related Definitions */
> > diff --git a/hw/i386/trace-events b/hw/i386/trace-events
> > index 60d20c1..3853fa8 100644
> > --- a/hw/i386/trace-events
> > +++ b/hw/i386/trace-events
> > @@ -26,6 +26,7 @@ vtd_pasid_cache_gsi(void) ""
> >  vtd_pasid_cache_reset(void) ""
> >  vtd_pasid_cache_dsi(uint16_t domain) "Domian slective PC invalidation domain
> 0x%"PRIx16
> >  vtd_pasid_cache_psi(uint16_t domain, uint32_t pasid) "PASID slective PC
> invalidation domain 0x%"PRIx16" pasid 0x%"PRIx32
> > +vtd_pasid_cache_devsi(uint16_t devfn) "Dev selective PC invalidation dev:
> 0x%"PRIx16
> >  vtd_re_not_present(uint8_t bus) "Root entry bus %"PRIu8" not present"
> >  vtd_ce_not_present(uint8_t bus, uint8_t devfn) "Context entry bus %"PRIu8"
> devfn %"PRIu8" not present"
> >  vtd_iotlb_page_hit(uint16_t sid, uint64_t addr, uint64_t slpte, uint16_t domain)
> "IOTLB page hit sid 0x%"PRIx16" iova 0x%"PRIx64" slpte 0x%"PRIx64" domain
> 0x%"PRIx16
> > --
> > 2.7.4
> >
> 
> --
> Peter Xu


^ permalink raw reply	[flat|nested] 160+ messages in thread

* RE: [PATCH v2 16/22] intel_iommu: replay pasid binds after context cache invalidation
@ 2020-04-03 15:21       ` Liu, Yi L
  0 siblings, 0 replies; 160+ messages in thread
From: Liu, Yi L @ 2020-04-03 15:21 UTC (permalink / raw)
  To: Peter Xu
  Cc: jean-philippe, Tian, Kevin, Jacob Pan, Yi Sun, Eduardo Habkost,
	kvm, mst, Tian, Jun J, qemu-devel, eric.auger, alex.williamson,
	pbonzini, Wu, Hao, Sun, Yi Y, Richard Henderson, david

> From: Peter Xu <peterx@redhat.com>
> Sent: Friday, April 3, 2020 10:46 PM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [PATCH v2 16/22] intel_iommu: replay pasid binds after context cache
> invalidation
> 
> On Sun, Mar 29, 2020 at 09:24:55PM -0700, Liu Yi L wrote:
> > This patch replays guest pasid bindings after context cache
> > invalidation. This is a behavior to ensure safety. Actually,
> > programmer should issue pasid cache invalidation with proper
> > granularity after issuing a context cache invalidation.
> >
> > Cc: Kevin Tian <kevin.tian@intel.com>
> > Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Cc: Peter Xu <peterx@redhat.com>
> > Cc: Yi Sun <yi.y.sun@linux.intel.com>
> > Cc: Paolo Bonzini <pbonzini@redhat.com>
> > Cc: Richard Henderson <rth@twiddle.net>
> > Cc: Eduardo Habkost <ehabkost@redhat.com>
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > ---
> >  hw/i386/intel_iommu.c          | 51
> ++++++++++++++++++++++++++++++++++++++++++
> >  hw/i386/intel_iommu_internal.h |  6 ++++-
> >  hw/i386/trace-events           |  1 +
> >  3 files changed, 57 insertions(+), 1 deletion(-)
> >
> > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> > index d87f608..883aeac 100644
> > --- a/hw/i386/intel_iommu.c
> > +++ b/hw/i386/intel_iommu.c
> > @@ -68,6 +68,10 @@ static void
> vtd_address_space_refresh_all(IntelIOMMUState *s);
> >  static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n);
> >
> >  static void vtd_pasid_cache_reset(IntelIOMMUState *s);
> > +static void vtd_pasid_cache_sync(IntelIOMMUState *s,
> > +                                 VTDPASIDCacheInfo *pc_info);
> > +static void vtd_pasid_cache_devsi(IntelIOMMUState *s,
> > +                                  VTDBus *vtd_bus, uint16_t devfn);
> >
> >  static void vtd_panic_require_caching_mode(void)
> >  {
> > @@ -1853,7 +1857,10 @@ static void vtd_iommu_replay_all(IntelIOMMUState
> *s)
> >
> >  static void vtd_context_global_invalidate(IntelIOMMUState *s)
> >  {
> > +    VTDPASIDCacheInfo pc_info;
> > +
> >      trace_vtd_inv_desc_cc_global();
> > +
> >      /* Protects context cache */
> >      vtd_iommu_lock(s);
> >      s->context_cache_gen++;
> > @@ -1870,6 +1877,9 @@ static void
> vtd_context_global_invalidate(IntelIOMMUState *s)
> >       * VT-d emulation codes.
> >       */
> >      vtd_iommu_replay_all(s);
> > +
> > +    pc_info.flags = VTD_PASID_CACHE_GLOBAL;
> > +    vtd_pasid_cache_sync(s, &pc_info);
> >  }
> >
> >  /**
> > @@ -2005,6 +2015,22 @@ static void
> vtd_context_device_invalidate(IntelIOMMUState *s,
> >                   * happened.
> >                   */
> >                  vtd_sync_shadow_page_table(vtd_as);
> > +                /*
> > +                 * Per spec, context flush should also followed with PASID
> > +                 * cache and iotlb flush. Regards to a device selective
> > +                 * context cache invalidation:
> 
> If context entry flush should also follow another pasid cache flush,
> then this is still needed?  Shouldn't the pasid flush do the same
> thing again?

yes, but how about guest software failed to follow it? It will do
the same thing when pasid cache flush comes. But this only happens
for the rid2pasid case (the IOVA page table).

> > +                 * if (emaulted_device)
> > +                 *    modify the pasid cache gen and pasid-based iotlb gen
> > +                 *    value (will be added in following patches)
> 
> Let's avoid using "following patches" because it'll be helpless after
> merged.  Also, the pasid cache gen is gone.

got it. will modify the description here.

> > +                 * else if (assigned_device)
> > +                 *    check if the device has been bound to any pasid
> > +                 *    invoke pasid_unbind regards to each bound pasid
> > +                 * Here, we have vtd_pasid_cache_devsi() to invalidate pasid
> > +                 * caches, while for piotlb in QEMU, we don't have it yet, so
> > +                 * no handling. For assigned device, host iommu driver would
> > +                 * flush piotlb when a pasid unbind is pass down to it.
> > +                 */
> > +                 vtd_pasid_cache_devsi(s, vtd_bus, devfn_it);
> >              }
> >          }
> >      }
> > @@ -2619,6 +2645,12 @@ static gboolean vtd_flush_pasid(gpointer key, gpointer
> value,
> >          /* Fall through */
> >      case VTD_PASID_CACHE_GLOBAL:
> >          break;
> > +    case VTD_PASID_CACHE_DEVSI:
> > +        if (pc_info->vtd_bus != vtd_bus ||
> > +            pc_info->devfn == devfn) {
> 
> Do you mean "!="?

exactly. thanks for catching it.

Regards,
Yi Liu

> > +            return false;
> > +        }
> > +        break;
> >      default:
> >          error_report("invalid pc_info->flags");
> >          abort();
> > @@ -2827,6 +2859,11 @@ static void
> vtd_replay_guest_pasid_bindings(IntelIOMMUState *s,
> >          walk_info.flags |= VTD_PASID_TABLE_DID_SEL_WALK;
> >          /* loop all assigned devices */
> >          break;
> > +    case VTD_PASID_CACHE_DEVSI:
> > +        walk_info.vtd_bus = pc_info->vtd_bus;
> > +        walk_info.devfn = pc_info->devfn;
> > +        vtd_replay_pasid_bind_for_dev(s, start, end, &walk_info);
> > +        return;
> >      case VTD_PASID_CACHE_FORCE_RESET:
> >          /* For force reset, no need to go further replay */
> >          return;
> > @@ -2912,6 +2949,20 @@ static void vtd_pasid_cache_sync(IntelIOMMUState *s,
> >      vtd_iommu_unlock(s);
> >  }
> >
> > +static void vtd_pasid_cache_devsi(IntelIOMMUState *s,
> > +                                  VTDBus *vtd_bus, uint16_t devfn)
> > +{
> > +    VTDPASIDCacheInfo pc_info;
> > +
> > +    trace_vtd_pasid_cache_devsi(devfn);
> > +
> > +    pc_info.flags = VTD_PASID_CACHE_DEVSI;
> > +    pc_info.vtd_bus = vtd_bus;
> > +    pc_info.devfn = devfn;
> > +
> > +    vtd_pasid_cache_sync(s, &pc_info);
> > +}
> > +
> >  /**
> >   * Caller of this function should hold iommu_lock
> >   */
> > diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> > index b9e48ab..9122601 100644
> > --- a/hw/i386/intel_iommu_internal.h
> > +++ b/hw/i386/intel_iommu_internal.h
> > @@ -529,14 +529,18 @@ struct VTDPASIDCacheInfo {
> >  #define VTD_PASID_CACHE_GLOBAL         (1ULL << 1)
> >  #define VTD_PASID_CACHE_DOMSI          (1ULL << 2)
> >  #define VTD_PASID_CACHE_PASIDSI        (1ULL << 3)
> > +#define VTD_PASID_CACHE_DEVSI          (1ULL << 4)
> >      uint32_t flags;
> >      uint16_t domain_id;
> >      uint32_t pasid;
> > +    VTDBus *vtd_bus;
> > +    uint16_t devfn;
> >  };
> >  #define VTD_PASID_CACHE_INFO_MASK    (VTD_PASID_CACHE_FORCE_RESET |
> \
> >                                        VTD_PASID_CACHE_GLOBAL  | \
> >                                        VTD_PASID_CACHE_DOMSI  | \
> > -                                      VTD_PASID_CACHE_PASIDSI)
> > +                                      VTD_PASID_CACHE_PASIDSI | \
> > +                                      VTD_PASID_CACHE_DEVSI)
> >  typedef struct VTDPASIDCacheInfo VTDPASIDCacheInfo;
> >
> >  /* PASID Table Related Definitions */
> > diff --git a/hw/i386/trace-events b/hw/i386/trace-events
> > index 60d20c1..3853fa8 100644
> > --- a/hw/i386/trace-events
> > +++ b/hw/i386/trace-events
> > @@ -26,6 +26,7 @@ vtd_pasid_cache_gsi(void) ""
> >  vtd_pasid_cache_reset(void) ""
> >  vtd_pasid_cache_dsi(uint16_t domain) "Domian slective PC invalidation domain
> 0x%"PRIx16
> >  vtd_pasid_cache_psi(uint16_t domain, uint32_t pasid) "PASID slective PC
> invalidation domain 0x%"PRIx16" pasid 0x%"PRIx32
> > +vtd_pasid_cache_devsi(uint16_t devfn) "Dev selective PC invalidation dev:
> 0x%"PRIx16
> >  vtd_re_not_present(uint8_t bus) "Root entry bus %"PRIu8" not present"
> >  vtd_ce_not_present(uint8_t bus, uint8_t devfn) "Context entry bus %"PRIu8"
> devfn %"PRIu8" not present"
> >  vtd_iotlb_page_hit(uint16_t sid, uint64_t addr, uint64_t slpte, uint16_t domain)
> "IOTLB page hit sid 0x%"PRIx16" iova 0x%"PRIx64" slpte 0x%"PRIx64" domain
> 0x%"PRIx16
> > --
> > 2.7.4
> >
> 
> --
> Peter Xu


^ permalink raw reply	[flat|nested] 160+ messages in thread

* RE: [PATCH v2 19/22] intel_iommu: process PASID-based iotlb invalidation
  2020-04-03 14:47     ` Peter Xu
@ 2020-04-03 15:21       ` Liu, Yi L
  -1 siblings, 0 replies; 160+ messages in thread
From: Liu, Yi L @ 2020-04-03 15:21 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, alex.williamson, eric.auger, pbonzini, mst, david,
	Tian, Kevin, Tian, Jun J, Sun, Yi Y, kvm, Wu, Hao, jean-philippe,
	Jacob Pan, Yi Sun, Richard Henderson, Eduardo Habkost

> From: Peter Xu <peterx@redhat.com>
> Sent: Friday, April 3, 2020 10:47 PM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [PATCH v2 19/22] intel_iommu: process PASID-based iotlb invalidation
> 
> On Sun, Mar 29, 2020 at 09:24:58PM -0700, Liu Yi L wrote:
> > This patch adds the basic PASID-based iotlb (piotlb) invalidation
> > support. piotlb is used during walking Intel VT-d 1st level page
> > table. This patch only adds the basic processing. Detailed handling
> > will be added in next patch.
> >
> > Cc: Kevin Tian <kevin.tian@intel.com>
> > Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Cc: Peter Xu <peterx@redhat.com>
> > Cc: Yi Sun <yi.y.sun@linux.intel.com>
> > Cc: Paolo Bonzini <pbonzini@redhat.com>
> > Cc: Richard Henderson <rth@twiddle.net>
> > Cc: Eduardo Habkost <ehabkost@redhat.com>
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> 
> Reviewed-by: Peter Xu <peterx@redhat.com>
> 
thanks for your help. :-)

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 160+ messages in thread

* RE: [PATCH v2 19/22] intel_iommu: process PASID-based iotlb invalidation
@ 2020-04-03 15:21       ` Liu, Yi L
  0 siblings, 0 replies; 160+ messages in thread
From: Liu, Yi L @ 2020-04-03 15:21 UTC (permalink / raw)
  To: Peter Xu
  Cc: jean-philippe, Tian, Kevin, Jacob Pan, Yi Sun, Eduardo Habkost,
	kvm, mst, Tian, Jun J, qemu-devel, eric.auger, alex.williamson,
	pbonzini, Wu, Hao, Sun, Yi Y, Richard Henderson, david

> From: Peter Xu <peterx@redhat.com>
> Sent: Friday, April 3, 2020 10:47 PM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [PATCH v2 19/22] intel_iommu: process PASID-based iotlb invalidation
> 
> On Sun, Mar 29, 2020 at 09:24:58PM -0700, Liu Yi L wrote:
> > This patch adds the basic PASID-based iotlb (piotlb) invalidation
> > support. piotlb is used during walking Intel VT-d 1st level page
> > table. This patch only adds the basic processing. Detailed handling
> > will be added in next patch.
> >
> > Cc: Kevin Tian <kevin.tian@intel.com>
> > Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Cc: Peter Xu <peterx@redhat.com>
> > Cc: Yi Sun <yi.y.sun@linux.intel.com>
> > Cc: Paolo Bonzini <pbonzini@redhat.com>
> > Cc: Richard Henderson <rth@twiddle.net>
> > Cc: Eduardo Habkost <ehabkost@redhat.com>
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> 
> Reviewed-by: Peter Xu <peterx@redhat.com>
> 
thanks for your help. :-)

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 160+ messages in thread

* RE: [PATCH v2 22/22] intel_iommu: modify x-scalable-mode to be string option
  2020-04-03 14:49     ` Peter Xu
@ 2020-04-03 15:22       ` Liu, Yi L
  -1 siblings, 0 replies; 160+ messages in thread
From: Liu, Yi L @ 2020-04-03 15:22 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, alex.williamson, eric.auger, pbonzini, mst, david,
	Tian, Kevin, Tian, Jun J, Sun, Yi Y, kvm, Wu, Hao, jean-philippe,
	Jacob Pan, Yi Sun, Richard Henderson, Eduardo Habkost

> From:  Peter Xu < peterx@redhat.com>
> Sent: Friday, April 3, 2020 10:49 PM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [PATCH v2 22/22] intel_iommu: modify x-scalable-mode to be string
> option
> 
> On Sun, Mar 29, 2020 at 09:25:01PM -0700, Liu Yi L wrote:
> > Intel VT-d 3.0 introduces scalable mode, and it has a bunch of
> > capabilities related to scalable mode translation, thus there are multiple
> combinations.
> > While this vIOMMU implementation wants simplify it for user by
> > providing typical combinations. User could config it by
> > "x-scalable-mode" option. The usage is as below:
> >
> > "-device intel-iommu,x-scalable-mode=["legacy"|"modern"|"off"]"
> >
> >  - "legacy": gives support for SL page table
> >  - "modern": gives support for FL page table, pasid, virtual command
> >  - "off": no scalable mode support
> >  -  if not configured, means no scalable mode support, if not proper
> >     configured, will throw error
> >
> > Note: this patch is supposed to be merged when  the whole vSVA patch
> > series were merged.
> >
> > Cc: Kevin Tian <kevin.tian@intel.com>
> > Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Cc: Peter Xu <peterx@redhat.com>
> > Cc: Yi Sun <yi.y.sun@linux.intel.com>
> > Cc: Paolo Bonzini <pbonzini@redhat.com>
> > Cc: Richard Henderson <rth@twiddle.net>
> > Cc: Eduardo Habkost <ehabkost@redhat.com>
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> 
> Reviewed-by: Peter Xu <peterx@redhat.com>
thanks for your help. :-)

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 160+ messages in thread

* RE: [PATCH v2 22/22] intel_iommu: modify x-scalable-mode to be string option
@ 2020-04-03 15:22       ` Liu, Yi L
  0 siblings, 0 replies; 160+ messages in thread
From: Liu, Yi L @ 2020-04-03 15:22 UTC (permalink / raw)
  To: Peter Xu
  Cc: jean-philippe, Tian, Kevin, Jacob Pan, Yi Sun, Eduardo Habkost,
	kvm, mst, Tian, Jun J, qemu-devel, eric.auger, alex.williamson,
	pbonzini, Wu, Hao, Sun, Yi Y, Richard Henderson, david

> From:  Peter Xu < peterx@redhat.com>
> Sent: Friday, April 3, 2020 10:49 PM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [PATCH v2 22/22] intel_iommu: modify x-scalable-mode to be string
> option
> 
> On Sun, Mar 29, 2020 at 09:25:01PM -0700, Liu Yi L wrote:
> > Intel VT-d 3.0 introduces scalable mode, and it has a bunch of
> > capabilities related to scalable mode translation, thus there are multiple
> combinations.
> > While this vIOMMU implementation wants simplify it for user by
> > providing typical combinations. User could config it by
> > "x-scalable-mode" option. The usage is as below:
> >
> > "-device intel-iommu,x-scalable-mode=["legacy"|"modern"|"off"]"
> >
> >  - "legacy": gives support for SL page table
> >  - "modern": gives support for FL page table, pasid, virtual command
> >  - "off": no scalable mode support
> >  -  if not configured, means no scalable mode support, if not proper
> >     configured, will throw error
> >
> > Note: this patch is supposed to be merged when  the whole vSVA patch
> > series were merged.
> >
> > Cc: Kevin Tian <kevin.tian@intel.com>
> > Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Cc: Peter Xu <peterx@redhat.com>
> > Cc: Yi Sun <yi.y.sun@linux.intel.com>
> > Cc: Paolo Bonzini <pbonzini@redhat.com>
> > Cc: Richard Henderson <rth@twiddle.net>
> > Cc: Eduardo Habkost <ehabkost@redhat.com>
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> 
> Reviewed-by: Peter Xu <peterx@redhat.com>
thanks for your help. :-)

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v2 16/22] intel_iommu: replay pasid binds after context cache invalidation
  2020-04-03 15:21       ` Liu, Yi L
@ 2020-04-03 16:11         ` Peter Xu
  -1 siblings, 0 replies; 160+ messages in thread
From: Peter Xu @ 2020-04-03 16:11 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: qemu-devel, alex.williamson, eric.auger, pbonzini, mst, david,
	Tian, Kevin, Tian, Jun J, Sun, Yi Y, kvm, Wu, Hao, jean-philippe,
	Jacob Pan, Yi Sun, Richard Henderson, Eduardo Habkost

On Fri, Apr 03, 2020 at 03:21:10PM +0000, Liu, Yi L wrote:
> > From: Peter Xu <peterx@redhat.com>
> > Sent: Friday, April 3, 2020 10:46 PM
> > To: Liu, Yi L <yi.l.liu@intel.com>
> > Subject: Re: [PATCH v2 16/22] intel_iommu: replay pasid binds after context cache
> > invalidation
> > 
> > On Sun, Mar 29, 2020 at 09:24:55PM -0700, Liu Yi L wrote:
> > > This patch replays guest pasid bindings after context cache
> > > invalidation. This is a behavior to ensure safety. Actually,
> > > programmer should issue pasid cache invalidation with proper
> > > granularity after issuing a context cache invalidation.
> > >
> > > Cc: Kevin Tian <kevin.tian@intel.com>
> > > Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > > Cc: Peter Xu <peterx@redhat.com>
> > > Cc: Yi Sun <yi.y.sun@linux.intel.com>
> > > Cc: Paolo Bonzini <pbonzini@redhat.com>
> > > Cc: Richard Henderson <rth@twiddle.net>
> > > Cc: Eduardo Habkost <ehabkost@redhat.com>
> > > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > > ---
> > >  hw/i386/intel_iommu.c          | 51
> > ++++++++++++++++++++++++++++++++++++++++++
> > >  hw/i386/intel_iommu_internal.h |  6 ++++-
> > >  hw/i386/trace-events           |  1 +
> > >  3 files changed, 57 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> > > index d87f608..883aeac 100644
> > > --- a/hw/i386/intel_iommu.c
> > > +++ b/hw/i386/intel_iommu.c
> > > @@ -68,6 +68,10 @@ static void
> > vtd_address_space_refresh_all(IntelIOMMUState *s);
> > >  static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n);
> > >
> > >  static void vtd_pasid_cache_reset(IntelIOMMUState *s);
> > > +static void vtd_pasid_cache_sync(IntelIOMMUState *s,
> > > +                                 VTDPASIDCacheInfo *pc_info);
> > > +static void vtd_pasid_cache_devsi(IntelIOMMUState *s,
> > > +                                  VTDBus *vtd_bus, uint16_t devfn);
> > >
> > >  static void vtd_panic_require_caching_mode(void)
> > >  {
> > > @@ -1853,7 +1857,10 @@ static void vtd_iommu_replay_all(IntelIOMMUState
> > *s)
> > >
> > >  static void vtd_context_global_invalidate(IntelIOMMUState *s)
> > >  {
> > > +    VTDPASIDCacheInfo pc_info;
> > > +
> > >      trace_vtd_inv_desc_cc_global();
> > > +
> > >      /* Protects context cache */
> > >      vtd_iommu_lock(s);
> > >      s->context_cache_gen++;
> > > @@ -1870,6 +1877,9 @@ static void
> > vtd_context_global_invalidate(IntelIOMMUState *s)
> > >       * VT-d emulation codes.
> > >       */
> > >      vtd_iommu_replay_all(s);
> > > +
> > > +    pc_info.flags = VTD_PASID_CACHE_GLOBAL;
> > > +    vtd_pasid_cache_sync(s, &pc_info);
> > >  }
> > >
> > >  /**
> > > @@ -2005,6 +2015,22 @@ static void
> > vtd_context_device_invalidate(IntelIOMMUState *s,
> > >                   * happened.
> > >                   */
> > >                  vtd_sync_shadow_page_table(vtd_as);
> > > +                /*
> > > +                 * Per spec, context flush should also followed with PASID
> > > +                 * cache and iotlb flush. Regards to a device selective
> > > +                 * context cache invalidation:
> > 
> > If context entry flush should also follow another pasid cache flush,
> > then this is still needed?  Shouldn't the pasid flush do the same
> > thing again?
> 
> yes, but how about guest software failed to follow it? It will do
> the same thing when pasid cache flush comes. But this only happens
> for the rid2pasid case (the IOVA page table).

Do you mean it will not happen when nested page table is used (so it's
required for nested tables)?

Yeah we can keep them for safe no matter what; at least I'm fine with
it (I believe most of the code we're discussing is not fast path).
Just want to be sure of it since if it's definitely duplicated then we
can instead drop it.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v2 16/22] intel_iommu: replay pasid binds after context cache invalidation
@ 2020-04-03 16:11         ` Peter Xu
  0 siblings, 0 replies; 160+ messages in thread
From: Peter Xu @ 2020-04-03 16:11 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: jean-philippe, Tian, Kevin, Jacob Pan, Yi Sun, Eduardo Habkost,
	kvm, mst, Tian, Jun J, qemu-devel, eric.auger, alex.williamson,
	pbonzini, Wu, Hao, Sun, Yi Y, Richard Henderson, david

On Fri, Apr 03, 2020 at 03:21:10PM +0000, Liu, Yi L wrote:
> > From: Peter Xu <peterx@redhat.com>
> > Sent: Friday, April 3, 2020 10:46 PM
> > To: Liu, Yi L <yi.l.liu@intel.com>
> > Subject: Re: [PATCH v2 16/22] intel_iommu: replay pasid binds after context cache
> > invalidation
> > 
> > On Sun, Mar 29, 2020 at 09:24:55PM -0700, Liu Yi L wrote:
> > > This patch replays guest pasid bindings after context cache
> > > invalidation. This is a behavior to ensure safety. Actually,
> > > programmer should issue pasid cache invalidation with proper
> > > granularity after issuing a context cache invalidation.
> > >
> > > Cc: Kevin Tian <kevin.tian@intel.com>
> > > Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > > Cc: Peter Xu <peterx@redhat.com>
> > > Cc: Yi Sun <yi.y.sun@linux.intel.com>
> > > Cc: Paolo Bonzini <pbonzini@redhat.com>
> > > Cc: Richard Henderson <rth@twiddle.net>
> > > Cc: Eduardo Habkost <ehabkost@redhat.com>
> > > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > > ---
> > >  hw/i386/intel_iommu.c          | 51
> > ++++++++++++++++++++++++++++++++++++++++++
> > >  hw/i386/intel_iommu_internal.h |  6 ++++-
> > >  hw/i386/trace-events           |  1 +
> > >  3 files changed, 57 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> > > index d87f608..883aeac 100644
> > > --- a/hw/i386/intel_iommu.c
> > > +++ b/hw/i386/intel_iommu.c
> > > @@ -68,6 +68,10 @@ static void
> > vtd_address_space_refresh_all(IntelIOMMUState *s);
> > >  static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n);
> > >
> > >  static void vtd_pasid_cache_reset(IntelIOMMUState *s);
> > > +static void vtd_pasid_cache_sync(IntelIOMMUState *s,
> > > +                                 VTDPASIDCacheInfo *pc_info);
> > > +static void vtd_pasid_cache_devsi(IntelIOMMUState *s,
> > > +                                  VTDBus *vtd_bus, uint16_t devfn);
> > >
> > >  static void vtd_panic_require_caching_mode(void)
> > >  {
> > > @@ -1853,7 +1857,10 @@ static void vtd_iommu_replay_all(IntelIOMMUState
> > *s)
> > >
> > >  static void vtd_context_global_invalidate(IntelIOMMUState *s)
> > >  {
> > > +    VTDPASIDCacheInfo pc_info;
> > > +
> > >      trace_vtd_inv_desc_cc_global();
> > > +
> > >      /* Protects context cache */
> > >      vtd_iommu_lock(s);
> > >      s->context_cache_gen++;
> > > @@ -1870,6 +1877,9 @@ static void
> > vtd_context_global_invalidate(IntelIOMMUState *s)
> > >       * VT-d emulation codes.
> > >       */
> > >      vtd_iommu_replay_all(s);
> > > +
> > > +    pc_info.flags = VTD_PASID_CACHE_GLOBAL;
> > > +    vtd_pasid_cache_sync(s, &pc_info);
> > >  }
> > >
> > >  /**
> > > @@ -2005,6 +2015,22 @@ static void
> > vtd_context_device_invalidate(IntelIOMMUState *s,
> > >                   * happened.
> > >                   */
> > >                  vtd_sync_shadow_page_table(vtd_as);
> > > +                /*
> > > +                 * Per spec, context flush should also followed with PASID
> > > +                 * cache and iotlb flush. Regards to a device selective
> > > +                 * context cache invalidation:
> > 
> > If context entry flush should also follow another pasid cache flush,
> > then this is still needed?  Shouldn't the pasid flush do the same
> > thing again?
> 
> yes, but how about guest software failed to follow it? It will do
> the same thing when pasid cache flush comes. But this only happens
> for the rid2pasid case (the IOVA page table).

Do you mean it will not happen when nested page table is used (so it's
required for nested tables)?

Yeah we can keep them for safe no matter what; at least I'm fine with
it (I believe most of the code we're discussing is not fast path).
Just want to be sure of it since if it's definitely duplicated then we
can instead drop it.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v2 13/22] intel_iommu: add PASID cache management infrastructure
  2020-04-03 15:05           ` Liu, Yi L
@ 2020-04-03 16:19             ` Peter Xu
  -1 siblings, 0 replies; 160+ messages in thread
From: Peter Xu @ 2020-04-03 16:19 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: qemu-devel, alex.williamson, eric.auger, pbonzini, mst, david,
	Tian, Kevin, Tian, Jun J, Sun, Yi Y, kvm, Wu, Hao, jean-philippe,
	Jacob Pan, Yi Sun, Richard Henderson, Eduardo Habkost

On Fri, Apr 03, 2020 at 03:05:57PM +0000, Liu, Yi L wrote:
> > From: Peter Xu <peterx@redhat.com>
> > Sent: Thursday, April 2, 2020 9:45 PM
> > To: Liu, Yi L <yi.l.liu@intel.com>
> > Subject: Re: [PATCH v2 13/22] intel_iommu: add PASID cache management
> > infrastructure
> > 
> > On Thu, Apr 02, 2020 at 06:46:11AM +0000, Liu, Yi L wrote:
> > 
> > [...]
> > 
> > > > > +/**
> > > > > + * This function replay the guest pasid bindings to hots by
> > > > > + * walking the guest PASID table. This ensures host will have
> > > > > + * latest guest pasid bindings. Caller should hold iommu_lock.
> > > > > + */
> > > > > +static void vtd_replay_guest_pasid_bindings(IntelIOMMUState *s,
> > > > > +                                            VTDPASIDCacheInfo
> > > > > +*pc_info) {
> > > > > +    VTDHostIOMMUContext *vtd_dev_icx;
> > > > > +    int start = 0, end = VTD_HPASID_MAX;
> > > > > +    vtd_pasid_table_walk_info walk_info = {.flags = 0};
> > > >
> > > > So vtd_pasid_table_walk_info is still used.  I thought we had
> > > > reached a consensus that this can be dropped?
> > >
> > > yeah, I did have considered your suggestion and plan to do it. But
> > > when I started coding, it looks a little bit weird to me:
> > > For one, there is an input VTDPASIDCacheInfo in this function. It may
> > > be nature to think about passing the parameter to further calling
> > > (vtd_replay_pasid_bind_for_dev()). But, we can't do that. The
> > > vtd_bus/devfn fields should be filled when looping the assigned
> > > devices, not the one passed by vtd_replay_guest_pasid_bindings() caller.
> > 
> > Hacky way is we can directly modify VTDPASIDCacheInfo* with bus/devfn for the
> > loop.  Otherwise we can duplicate the object when looping, so that we can avoid
> > introducing a new struct which seems to contain mostly the same information.
> 
> I see. Please see below reply.
> 
> > > For two, reusing the VTDPASIDCacheInfo for passing walk info may
> > > require the final user do the same thing as what the
> > > vtd_replay_guest_pasid_bindings() has done here.
> > 
> > I don't see it happen, could you explain?
> 
> my concern is around flags field in VTDPASIDCacheInfo. The flags not
> only indicates the invalidation granularity, but also indicates the
> field presence. e.g. VTD_PASID_CACHE_DEVSI indicates the vtd_bus/devfn
> fields are valid. If reuse it to pass walk info to vtd_sm_pasid_table_walk_one,
> it would be meaningless as vtd_bus/devfn fields are always valid. But
> I'm fine to reuse it's more prefered. Instead of modifying the vtd_bus/devn
> in VTDPASIDCacheInfo*, I'd rather to define another VTDPASIDCacheInfo variable
> and pass it to vtd_sm_pasid_table_walk_one. This may not affect the future
> caller of vtd_replay_guest_pasid_bindings() as vtd_bus/devfn field are not
> designed to bring something back to caller.

Yeah, let's give it a shot.  I know it's not ideal, but IMHO it's
still better than defining the page_walk struct and that might confuse
readers on what's the difference between the two.  When duplicating
the object, we can add some comment explaining this.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v2 13/22] intel_iommu: add PASID cache management infrastructure
@ 2020-04-03 16:19             ` Peter Xu
  0 siblings, 0 replies; 160+ messages in thread
From: Peter Xu @ 2020-04-03 16:19 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: jean-philippe, Tian, Kevin, Jacob Pan, Yi Sun, Eduardo Habkost,
	kvm, mst, Tian, Jun J, qemu-devel, eric.auger, alex.williamson,
	pbonzini, Wu, Hao, Sun, Yi Y, Richard Henderson, david

On Fri, Apr 03, 2020 at 03:05:57PM +0000, Liu, Yi L wrote:
> > From: Peter Xu <peterx@redhat.com>
> > Sent: Thursday, April 2, 2020 9:45 PM
> > To: Liu, Yi L <yi.l.liu@intel.com>
> > Subject: Re: [PATCH v2 13/22] intel_iommu: add PASID cache management
> > infrastructure
> > 
> > On Thu, Apr 02, 2020 at 06:46:11AM +0000, Liu, Yi L wrote:
> > 
> > [...]
> > 
> > > > > +/**
> > > > > + * This function replay the guest pasid bindings to hots by
> > > > > + * walking the guest PASID table. This ensures host will have
> > > > > + * latest guest pasid bindings. Caller should hold iommu_lock.
> > > > > + */
> > > > > +static void vtd_replay_guest_pasid_bindings(IntelIOMMUState *s,
> > > > > +                                            VTDPASIDCacheInfo
> > > > > +*pc_info) {
> > > > > +    VTDHostIOMMUContext *vtd_dev_icx;
> > > > > +    int start = 0, end = VTD_HPASID_MAX;
> > > > > +    vtd_pasid_table_walk_info walk_info = {.flags = 0};
> > > >
> > > > So vtd_pasid_table_walk_info is still used.  I thought we had
> > > > reached a consensus that this can be dropped?
> > >
> > > yeah, I did have considered your suggestion and plan to do it. But
> > > when I started coding, it looks a little bit weird to me:
> > > For one, there is an input VTDPASIDCacheInfo in this function. It may
> > > be nature to think about passing the parameter to further calling
> > > (vtd_replay_pasid_bind_for_dev()). But, we can't do that. The
> > > vtd_bus/devfn fields should be filled when looping the assigned
> > > devices, not the one passed by vtd_replay_guest_pasid_bindings() caller.
> > 
> > Hacky way is we can directly modify VTDPASIDCacheInfo* with bus/devfn for the
> > loop.  Otherwise we can duplicate the object when looping, so that we can avoid
> > introducing a new struct which seems to contain mostly the same information.
> 
> I see. Please see below reply.
> 
> > > For two, reusing the VTDPASIDCacheInfo for passing walk info may
> > > require the final user do the same thing as what the
> > > vtd_replay_guest_pasid_bindings() has done here.
> > 
> > I don't see it happen, could you explain?
> 
> my concern is around flags field in VTDPASIDCacheInfo. The flags not
> only indicates the invalidation granularity, but also indicates the
> field presence. e.g. VTD_PASID_CACHE_DEVSI indicates the vtd_bus/devfn
> fields are valid. If reuse it to pass walk info to vtd_sm_pasid_table_walk_one,
> it would be meaningless as vtd_bus/devfn fields are always valid. But
> I'm fine to reuse it's more prefered. Instead of modifying the vtd_bus/devn
> in VTDPASIDCacheInfo*, I'd rather to define another VTDPASIDCacheInfo variable
> and pass it to vtd_sm_pasid_table_walk_one. This may not affect the future
> caller of vtd_replay_guest_pasid_bindings() as vtd_bus/devfn field are not
> designed to bring something back to caller.

Yeah, let's give it a shot.  I know it's not ideal, but IMHO it's
still better than defining the page_walk struct and that might confuse
readers on what's the difference between the two.  When duplicating
the object, we can add some comment explaining this.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 160+ messages in thread

* RE: [PATCH v2 13/22] intel_iommu: add PASID cache management infrastructure
  2020-04-03 16:19             ` Peter Xu
@ 2020-04-04 11:39               ` Liu, Yi L
  -1 siblings, 0 replies; 160+ messages in thread
From: Liu, Yi L @ 2020-04-04 11:39 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, alex.williamson, eric.auger, pbonzini, mst, david,
	Tian, Kevin, Tian, Jun J, Sun, Yi Y, kvm, Wu, Hao, jean-philippe,
	Jacob Pan, Yi Sun, Richard Henderson, Eduardo Habkost

Hi Peter,

> From: Peter Xu <peterx@redhat.com>
> Sent: Saturday, April 4, 2020 12:20 AM
> To: Liu, Yi L <yi.l.liu@intel.com>
> 
> On Fri, Apr 03, 2020 at 03:05:57PM +0000, Liu, Yi L wrote:
> > > From: Peter Xu <peterx@redhat.com>
> > > Sent: Thursday, April 2, 2020 9:45 PM
> > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > Subject: Re: [PATCH v2 13/22] intel_iommu: add PASID cache management
> > > infrastructure
> > >
> > > On Thu, Apr 02, 2020 at 06:46:11AM +0000, Liu, Yi L wrote:
> > >
> > > [...]
> > >
> > > > > > +/**
> > > > > > + * This function replay the guest pasid bindings to hots by
> > > > > > + * walking the guest PASID table. This ensures host will have
> > > > > > + * latest guest pasid bindings. Caller should hold iommu_lock.
> > > > > > + */
> > > > > > +static void vtd_replay_guest_pasid_bindings(IntelIOMMUState *s,
> > > > > > +                                            VTDPASIDCacheInfo
> > > > > > +*pc_info) {
> > > > > > +    VTDHostIOMMUContext *vtd_dev_icx;
> > > > > > +    int start = 0, end = VTD_HPASID_MAX;
> > > > > > +    vtd_pasid_table_walk_info walk_info = {.flags = 0};
> > > > >
> > > > > So vtd_pasid_table_walk_info is still used.  I thought we had
> > > > > reached a consensus that this can be dropped?
> > > >
> > > > yeah, I did have considered your suggestion and plan to do it. But
> > > > when I started coding, it looks a little bit weird to me:
> > > > For one, there is an input VTDPASIDCacheInfo in this function. It may
> > > > be nature to think about passing the parameter to further calling
> > > > (vtd_replay_pasid_bind_for_dev()). But, we can't do that. The
> > > > vtd_bus/devfn fields should be filled when looping the assigned
> > > > devices, not the one passed by vtd_replay_guest_pasid_bindings() caller.
> > >
> > > Hacky way is we can directly modify VTDPASIDCacheInfo* with bus/devfn for
> the
> > > loop.  Otherwise we can duplicate the object when looping, so that we can avoid
> > > introducing a new struct which seems to contain mostly the same information.
> >
> > I see. Please see below reply.
> >
> > > > For two, reusing the VTDPASIDCacheInfo for passing walk info may
> > > > require the final user do the same thing as what the
> > > > vtd_replay_guest_pasid_bindings() has done here.
> > >
> > > I don't see it happen, could you explain?
> >
> > my concern is around flags field in VTDPASIDCacheInfo. The flags not
> > only indicates the invalidation granularity, but also indicates the
> > field presence. e.g. VTD_PASID_CACHE_DEVSI indicates the vtd_bus/devfn
> > fields are valid. If reuse it to pass walk info to vtd_sm_pasid_table_walk_one,
> > it would be meaningless as vtd_bus/devfn fields are always valid. But
> > I'm fine to reuse it's more prefered. Instead of modifying the vtd_bus/devn
> > in VTDPASIDCacheInfo*, I'd rather to define another VTDPASIDCacheInfo variable
> > and pass it to vtd_sm_pasid_table_walk_one. This may not affect the future
> > caller of vtd_replay_guest_pasid_bindings() as vtd_bus/devfn field are not
> > designed to bring something back to caller.
> 
> Yeah, let's give it a shot.  I know it's not ideal, but IMHO it's
> still better than defining the page_walk struct and that might confuse
> readers on what's the difference between the two.  When duplicating
> the object, we can add some comment explaining this.

got it. I'll drop the page_walk struct and add additional comments. :-)

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 160+ messages in thread

* RE: [PATCH v2 13/22] intel_iommu: add PASID cache management infrastructure
@ 2020-04-04 11:39               ` Liu, Yi L
  0 siblings, 0 replies; 160+ messages in thread
From: Liu, Yi L @ 2020-04-04 11:39 UTC (permalink / raw)
  To: Peter Xu
  Cc: jean-philippe, Tian, Kevin, Jacob Pan, Yi Sun, Eduardo Habkost,
	kvm, mst, Tian, Jun J, qemu-devel, eric.auger, alex.williamson,
	pbonzini, Wu, Hao, Sun, Yi Y, Richard Henderson, david

Hi Peter,

> From: Peter Xu <peterx@redhat.com>
> Sent: Saturday, April 4, 2020 12:20 AM
> To: Liu, Yi L <yi.l.liu@intel.com>
> 
> On Fri, Apr 03, 2020 at 03:05:57PM +0000, Liu, Yi L wrote:
> > > From: Peter Xu <peterx@redhat.com>
> > > Sent: Thursday, April 2, 2020 9:45 PM
> > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > Subject: Re: [PATCH v2 13/22] intel_iommu: add PASID cache management
> > > infrastructure
> > >
> > > On Thu, Apr 02, 2020 at 06:46:11AM +0000, Liu, Yi L wrote:
> > >
> > > [...]
> > >
> > > > > > +/**
> > > > > > + * This function replay the guest pasid bindings to hots by
> > > > > > + * walking the guest PASID table. This ensures host will have
> > > > > > + * latest guest pasid bindings. Caller should hold iommu_lock.
> > > > > > + */
> > > > > > +static void vtd_replay_guest_pasid_bindings(IntelIOMMUState *s,
> > > > > > +                                            VTDPASIDCacheInfo
> > > > > > +*pc_info) {
> > > > > > +    VTDHostIOMMUContext *vtd_dev_icx;
> > > > > > +    int start = 0, end = VTD_HPASID_MAX;
> > > > > > +    vtd_pasid_table_walk_info walk_info = {.flags = 0};
> > > > >
> > > > > So vtd_pasid_table_walk_info is still used.  I thought we had
> > > > > reached a consensus that this can be dropped?
> > > >
> > > > yeah, I did have considered your suggestion and plan to do it. But
> > > > when I started coding, it looks a little bit weird to me:
> > > > For one, there is an input VTDPASIDCacheInfo in this function. It may
> > > > be nature to think about passing the parameter to further calling
> > > > (vtd_replay_pasid_bind_for_dev()). But, we can't do that. The
> > > > vtd_bus/devfn fields should be filled when looping the assigned
> > > > devices, not the one passed by vtd_replay_guest_pasid_bindings() caller.
> > >
> > > Hacky way is we can directly modify VTDPASIDCacheInfo* with bus/devfn for
> the
> > > loop.  Otherwise we can duplicate the object when looping, so that we can avoid
> > > introducing a new struct which seems to contain mostly the same information.
> >
> > I see. Please see below reply.
> >
> > > > For two, reusing the VTDPASIDCacheInfo for passing walk info may
> > > > require the final user do the same thing as what the
> > > > vtd_replay_guest_pasid_bindings() has done here.
> > >
> > > I don't see it happen, could you explain?
> >
> > my concern is around flags field in VTDPASIDCacheInfo. The flags not
> > only indicates the invalidation granularity, but also indicates the
> > field presence. e.g. VTD_PASID_CACHE_DEVSI indicates the vtd_bus/devfn
> > fields are valid. If reuse it to pass walk info to vtd_sm_pasid_table_walk_one,
> > it would be meaningless as vtd_bus/devfn fields are always valid. But
> > I'm fine to reuse it's more prefered. Instead of modifying the vtd_bus/devn
> > in VTDPASIDCacheInfo*, I'd rather to define another VTDPASIDCacheInfo variable
> > and pass it to vtd_sm_pasid_table_walk_one. This may not affect the future
> > caller of vtd_replay_guest_pasid_bindings() as vtd_bus/devfn field are not
> > designed to bring something back to caller.
> 
> Yeah, let's give it a shot.  I know it's not ideal, but IMHO it's
> still better than defining the page_walk struct and that might confuse
> readers on what's the difference between the two.  When duplicating
> the object, we can add some comment explaining this.

got it. I'll drop the page_walk struct and add additional comments. :-)

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 160+ messages in thread

* RE: [PATCH v2 16/22] intel_iommu: replay pasid binds after context cache invalidation
  2020-04-03 16:11         ` Peter Xu
@ 2020-04-04 12:00           ` Liu, Yi L
  -1 siblings, 0 replies; 160+ messages in thread
From: Liu, Yi L @ 2020-04-04 12:00 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, alex.williamson, eric.auger, pbonzini, mst, david,
	Tian, Kevin, Tian, Jun J, Sun, Yi Y, kvm, Wu, Hao, jean-philippe,
	Jacob Pan, Yi Sun, Richard Henderson, Eduardo Habkost

Hi Peter,

> From: Peter Xu <peterx@redhat.com>
> Sent: Saturday, April 4, 2020 12:11 AM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [PATCH v2 16/22] intel_iommu: replay pasid binds after context cache
> invalidation
> 
> On Fri, Apr 03, 2020 at 03:21:10PM +0000, Liu, Yi L wrote:
> > > From: Peter Xu <peterx@redhat.com>
> > > Sent: Friday, April 3, 2020 10:46 PM
> > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > Subject: Re: [PATCH v2 16/22] intel_iommu: replay pasid binds after context
> cache
> > > invalidation
> > >
> > > On Sun, Mar 29, 2020 at 09:24:55PM -0700, Liu Yi L wrote:
> > > > This patch replays guest pasid bindings after context cache
> > > > invalidation. This is a behavior to ensure safety. Actually,
> > > > programmer should issue pasid cache invalidation with proper
> > > > granularity after issuing a context cache invalidation.
> > > >
> > > > Cc: Kevin Tian <kevin.tian@intel.com>
> > > > Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > > > Cc: Peter Xu <peterx@redhat.com>
> > > > Cc: Yi Sun <yi.y.sun@linux.intel.com>
> > > > Cc: Paolo Bonzini <pbonzini@redhat.com>
> > > > Cc: Richard Henderson <rth@twiddle.net>
> > > > Cc: Eduardo Habkost <ehabkost@redhat.com>
> > > > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > > > ---
> > > >  hw/i386/intel_iommu.c          | 51
> > > ++++++++++++++++++++++++++++++++++++++++++
> > > >  hw/i386/intel_iommu_internal.h |  6 ++++-
> > > >  hw/i386/trace-events           |  1 +
> > > >  3 files changed, 57 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> > > > index d87f608..883aeac 100644
> > > > --- a/hw/i386/intel_iommu.c
> > > > +++ b/hw/i386/intel_iommu.c
> > > > @@ -68,6 +68,10 @@ static void
> > > vtd_address_space_refresh_all(IntelIOMMUState *s);
> > > >  static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier
> *n);
> > > >
> > > >  static void vtd_pasid_cache_reset(IntelIOMMUState *s);
> > > > +static void vtd_pasid_cache_sync(IntelIOMMUState *s,
> > > > +                                 VTDPASIDCacheInfo *pc_info);
> > > > +static void vtd_pasid_cache_devsi(IntelIOMMUState *s,
> > > > +                                  VTDBus *vtd_bus, uint16_t devfn);
> > > >
> > > >  static void vtd_panic_require_caching_mode(void)
> > > >  {
> > > > @@ -1853,7 +1857,10 @@ static void
> vtd_iommu_replay_all(IntelIOMMUState
> > > *s)
> > > >
> > > >  static void vtd_context_global_invalidate(IntelIOMMUState *s)
> > > >  {
> > > > +    VTDPASIDCacheInfo pc_info;
> > > > +
> > > >      trace_vtd_inv_desc_cc_global();
> > > > +
> > > >      /* Protects context cache */
> > > >      vtd_iommu_lock(s);
> > > >      s->context_cache_gen++;
> > > > @@ -1870,6 +1877,9 @@ static void
> > > vtd_context_global_invalidate(IntelIOMMUState *s)
> > > >       * VT-d emulation codes.
> > > >       */
> > > >      vtd_iommu_replay_all(s);
> > > > +
> > > > +    pc_info.flags = VTD_PASID_CACHE_GLOBAL;
> > > > +    vtd_pasid_cache_sync(s, &pc_info);
> > > >  }
> > > >
> > > >  /**
> > > > @@ -2005,6 +2015,22 @@ static void
> > > vtd_context_device_invalidate(IntelIOMMUState *s,
> > > >                   * happened.
> > > >                   */
> > > >                  vtd_sync_shadow_page_table(vtd_as);
> > > > +                /*
> > > > +                 * Per spec, context flush should also
> > > > followed with PASID
> > > > +                 * cache and iotlb flush. Regards to
> > > > a device selective
> > > > +                 * context cache invalidation:
> > >
> > > If context entry flush should also follow another pasid cache flush,
> > > then this is still needed?  Shouldn't the pasid flush do the same
> > > thing again?
> >
> > yes, but how about guest software failed to follow it? It will do
> > the same thing when pasid cache flush comes. But this only happens
> > for the rid2pasid case (the IOVA page table).
> 
> Do you mean it will not happen when nested page table is used (so it's
> required for nested tables)?

no, by the IOVA page table case, I just want to confirm the duplicate
replay is true. But it is not "only" case. :-) my bad. any scalable mode
context entry modification will result in duplicate replay as this patch
enforces a pasid replay after context cache invalidation. But for normal
guest SVM usage, it won't have such duplicate work as it only modifies
pasid entry.

> Yeah we can keep them for safe no matter what; at least I'm fine with
> it (I believe most of the code we're discussing is not fast path).
> Just want to be sure of it since if it's definitely duplicated then we
> can instead drop it.

yes, it is not fast path. BTW. I guess the iova shadow sync applies
the same notion. right?

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 160+ messages in thread

* RE: [PATCH v2 16/22] intel_iommu: replay pasid binds after context cache invalidation
@ 2020-04-04 12:00           ` Liu, Yi L
  0 siblings, 0 replies; 160+ messages in thread
From: Liu, Yi L @ 2020-04-04 12:00 UTC (permalink / raw)
  To: Peter Xu
  Cc: jean-philippe, Tian, Kevin, Jacob Pan, Yi Sun, Eduardo Habkost,
	kvm, mst, Tian, Jun J, qemu-devel, eric.auger, alex.williamson,
	pbonzini, Wu, Hao, Sun, Yi Y, Richard Henderson, david

Hi Peter,

> From: Peter Xu <peterx@redhat.com>
> Sent: Saturday, April 4, 2020 12:11 AM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [PATCH v2 16/22] intel_iommu: replay pasid binds after context cache
> invalidation
> 
> On Fri, Apr 03, 2020 at 03:21:10PM +0000, Liu, Yi L wrote:
> > > From: Peter Xu <peterx@redhat.com>
> > > Sent: Friday, April 3, 2020 10:46 PM
> > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > Subject: Re: [PATCH v2 16/22] intel_iommu: replay pasid binds after context
> cache
> > > invalidation
> > >
> > > On Sun, Mar 29, 2020 at 09:24:55PM -0700, Liu Yi L wrote:
> > > > This patch replays guest pasid bindings after context cache
> > > > invalidation. This is a behavior to ensure safety. Actually,
> > > > programmer should issue pasid cache invalidation with proper
> > > > granularity after issuing a context cache invalidation.
> > > >
> > > > Cc: Kevin Tian <kevin.tian@intel.com>
> > > > Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > > > Cc: Peter Xu <peterx@redhat.com>
> > > > Cc: Yi Sun <yi.y.sun@linux.intel.com>
> > > > Cc: Paolo Bonzini <pbonzini@redhat.com>
> > > > Cc: Richard Henderson <rth@twiddle.net>
> > > > Cc: Eduardo Habkost <ehabkost@redhat.com>
> > > > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > > > ---
> > > >  hw/i386/intel_iommu.c          | 51
> > > ++++++++++++++++++++++++++++++++++++++++++
> > > >  hw/i386/intel_iommu_internal.h |  6 ++++-
> > > >  hw/i386/trace-events           |  1 +
> > > >  3 files changed, 57 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> > > > index d87f608..883aeac 100644
> > > > --- a/hw/i386/intel_iommu.c
> > > > +++ b/hw/i386/intel_iommu.c
> > > > @@ -68,6 +68,10 @@ static void
> > > vtd_address_space_refresh_all(IntelIOMMUState *s);
> > > >  static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier
> *n);
> > > >
> > > >  static void vtd_pasid_cache_reset(IntelIOMMUState *s);
> > > > +static void vtd_pasid_cache_sync(IntelIOMMUState *s,
> > > > +                                 VTDPASIDCacheInfo *pc_info);
> > > > +static void vtd_pasid_cache_devsi(IntelIOMMUState *s,
> > > > +                                  VTDBus *vtd_bus, uint16_t devfn);
> > > >
> > > >  static void vtd_panic_require_caching_mode(void)
> > > >  {
> > > > @@ -1853,7 +1857,10 @@ static void
> vtd_iommu_replay_all(IntelIOMMUState
> > > *s)
> > > >
> > > >  static void vtd_context_global_invalidate(IntelIOMMUState *s)
> > > >  {
> > > > +    VTDPASIDCacheInfo pc_info;
> > > > +
> > > >      trace_vtd_inv_desc_cc_global();
> > > > +
> > > >      /* Protects context cache */
> > > >      vtd_iommu_lock(s);
> > > >      s->context_cache_gen++;
> > > > @@ -1870,6 +1877,9 @@ static void
> > > vtd_context_global_invalidate(IntelIOMMUState *s)
> > > >       * VT-d emulation codes.
> > > >       */
> > > >      vtd_iommu_replay_all(s);
> > > > +
> > > > +    pc_info.flags = VTD_PASID_CACHE_GLOBAL;
> > > > +    vtd_pasid_cache_sync(s, &pc_info);
> > > >  }
> > > >
> > > >  /**
> > > > @@ -2005,6 +2015,22 @@ static void
> > > vtd_context_device_invalidate(IntelIOMMUState *s,
> > > >                   * happened.
> > > >                   */
> > > >                  vtd_sync_shadow_page_table(vtd_as);
> > > > +                /*
> > > > +                 * Per spec, context flush should also
> > > > followed with PASID
> > > > +                 * cache and iotlb flush. Regards to
> > > > a device selective
> > > > +                 * context cache invalidation:
> > >
> > > If context entry flush should also follow another pasid cache flush,
> > > then this is still needed?  Shouldn't the pasid flush do the same
> > > thing again?
> >
> > yes, but how about guest software failed to follow it? It will do
> > the same thing when pasid cache flush comes. But this only happens
> > for the rid2pasid case (the IOVA page table).
> 
> Do you mean it will not happen when nested page table is used (so it's
> required for nested tables)?

no, by the IOVA page table case, I just want to confirm the duplicate
replay is true. But it is not "only" case. :-) my bad. any scalable mode
context entry modification will result in duplicate replay as this patch
enforces a pasid replay after context cache invalidation. But for normal
guest SVM usage, it won't have such duplicate work as it only modifies
pasid entry.

> Yeah we can keep them for safe no matter what; at least I'm fine with
> it (I believe most of the code we're discussing is not fast path).
> Just want to be sure of it since if it's definitely duplicated then we
> can instead drop it.

yes, it is not fast path. BTW. I guess the iova shadow sync applies
the same notion. right?

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 160+ messages in thread

* RE: [PATCH v2 05/22] hw/pci: modify pci_setup_iommu() to set PCIIOMMUOps
  2020-04-02 13:49             ` Auger Eric
@ 2020-04-06  6:27               ` Liu, Yi L
  -1 siblings, 0 replies; 160+ messages in thread
From: Liu, Yi L @ 2020-04-06  6:27 UTC (permalink / raw)
  To: Auger Eric, qemu-devel, alex.williamson, peterx
  Cc: jean-philippe, Tian, Kevin, Jacob Pan, Yi Sun, kvm, mst, Tian,
	Jun J, Sun, Yi Y, pbonzini, Wu, Hao, david

Hi Eric,

> From: Auger Eric <eric.auger@redhat.com>
> Sent: Thursday, April 2, 2020 9:49 PM
> To: Liu, Yi L <yi.l.liu@intel.com>; qemu-devel@nongnu.org;
> Subject: Re: [PATCH v2 05/22] hw/pci: modify pci_setup_iommu() to set
> PCIIOMMUOps
> 
> Hi Yi,
> 
> On 4/2/20 3:37 PM, Liu, Yi L wrote:
> > Hi Eric,
> >
> >> From: Auger Eric < eric.auger@redhat.com >
> >> Sent: Thursday, April 2, 2020 8:41 PM
> >> To: Liu, Yi L <yi.l.liu@intel.com>; qemu-devel@nongnu.org;
> >> Subject: Re: [PATCH v2 05/22] hw/pci: modify pci_setup_iommu() to set
> >> PCIIOMMUOps
> >>
> >> Hi Yi,
> >>
> >> On 4/2/20 10:52 AM, Liu, Yi L wrote:
> >>>> From: Auger Eric < eric.auger@redhat.com>
> >>>> Sent: Monday, March 30, 2020 7:02 PM
> >>>> To: Liu, Yi L <yi.l.liu@intel.com>; qemu-devel@nongnu.org;
> >>>> Subject: Re: [PATCH v2 05/22] hw/pci: modify pci_setup_iommu() to
> >>>> set PCIIOMMUOps
> >>>>
> >>>>
> >>>>
> >>>> On 3/30/20 6:24 AM, Liu Yi L wrote:
> >>>>> This patch modifies pci_setup_iommu() to set PCIIOMMUOps instead
> >>>>> of setting PCIIOMMUFunc. PCIIOMMUFunc is used to get an address
> >>>>> space for a PCI device in vendor specific way. The PCIIOMMUOps
> >>>>> still offers this functionality. But using PCIIOMMUOps leaves
> >>>>> space to add more iommu related vendor specific operations.
> >>>>>
> >>>>> Cc: Kevin Tian <kevin.tian@intel.com>
> >>>>> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> >>>>> Cc: Peter Xu <peterx@redhat.com>
> >>>>> Cc: Eric Auger <eric.auger@redhat.com>
> >>>>> Cc: Yi Sun <yi.y.sun@linux.intel.com>
> >>>>> Cc: David Gibson <david@gibson.dropbear.id.au>
> >>>>> Cc: Michael S. Tsirkin <mst@redhat.com>
> >>>>> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> >>>>> Reviewed-by: Peter Xu <peterx@redhat.com>
> >>>>> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> >>>>> ---
> >>>>>  hw/alpha/typhoon.c       |  6 +++++-
> >>>>>  hw/arm/smmu-common.c     |  6 +++++-
> >>>>>  hw/hppa/dino.c           |  6 +++++-
> >>>>>  hw/i386/amd_iommu.c      |  6 +++++-
> >>>>>  hw/i386/intel_iommu.c    |  6 +++++-
> >>>>>  hw/pci-host/designware.c |  6 +++++-
> >>>>>  hw/pci-host/pnv_phb3.c   |  6 +++++-
> >>>>>  hw/pci-host/pnv_phb4.c   |  6 +++++-
> >>>>>  hw/pci-host/ppce500.c    |  6 +++++-
> >>>>>  hw/pci-host/prep.c       |  6 +++++-
> >>>>>  hw/pci-host/sabre.c      |  6 +++++-
> >>>>>  hw/pci/pci.c             | 12 +++++++-----
> >>>>>  hw/ppc/ppc440_pcix.c     |  6 +++++-
> >>>>>  hw/ppc/spapr_pci.c       |  6 +++++-
> >>>>>  hw/s390x/s390-pci-bus.c  |  8 ++++++--  hw/virtio/virtio-iommu.c
> >>>>> |
> >>>>> 6
> >>>>> +++++-
> >>>>>  include/hw/pci/pci.h     |  8 ++++++--
> >>>>>  include/hw/pci/pci_bus.h |  2 +-
> >>>>>  18 files changed, 90 insertions(+), 24 deletions(-)
> >>>>>
> >>>>> diff --git a/hw/alpha/typhoon.c b/hw/alpha/typhoon.c index
> >>>>> 1795e2f..f271de1 100644
> >>>>> --- a/hw/alpha/typhoon.c
> >>>>> +++ b/hw/alpha/typhoon.c
> >>>>> @@ -740,6 +740,10 @@ static AddressSpace
> >>>>> *typhoon_pci_dma_iommu(PCIBus
> >>>> *bus, void *opaque, int devfn)
> >>>>>      return &s->pchip.iommu_as;
> >>>>>  }
> >>>>>
> >>>>> +static const PCIIOMMUOps typhoon_iommu_ops = {
> >>>>> +    .get_address_space = typhoon_pci_dma_iommu, };
> >>>>> +
> >>>>>  static void typhoon_set_irq(void *opaque, int irq, int level)  {
> >>>>>      TyphoonState *s = opaque;
> >>>>> @@ -897,7 +901,7 @@ PCIBus *typhoon_init(MemoryRegion *ram, ISABus
> >>>> **isa_bus, qemu_irq *p_rtc_irq,
> >>>>>                               "iommu-typhoon", UINT64_MAX);
> >>>>>      address_space_init(&s->pchip.iommu_as, MEMORY_REGION(&s-
> >>>>> pchip.iommu),
> >>>>>                         "pchip0-pci");
> >>>>> -    pci_setup_iommu(b, typhoon_pci_dma_iommu, s);
> >>>>> +    pci_setup_iommu(b, &typhoon_iommu_ops, s);
> >>>>>
> >>>>>      /* Pchip0 PCI special/interrupt acknowledge, 0x801.F800.0000, 64MB.  */
> >>>>>      memory_region_init_io(&s->pchip.reg_iack, OBJECT(s),
> >>>>> &alpha_pci_iack_ops, diff --git a/hw/arm/smmu-common.c
> >>>>> b/hw/arm/smmu-common.c index e13a5f4..447146e 100644
> >>>>> --- a/hw/arm/smmu-common.c
> >>>>> +++ b/hw/arm/smmu-common.c
> >>>>> @@ -343,6 +343,10 @@ static AddressSpace *smmu_find_add_as(PCIBus
> >>>>> *bus,
> >>>> void *opaque, int devfn)
> >>>>>      return &sdev->as;
> >>>>>  }
> >>>>>
> >>>>> +static const PCIIOMMUOps smmu_ops = {
> >>>>> +    .get_address_space = smmu_find_add_as, };
> >>>>> +
> >>>>>  IOMMUMemoryRegion *smmu_iommu_mr(SMMUState *s, uint32_t sid)  {
> >>>>>      uint8_t bus_n, devfn;
> >>>>> @@ -437,7 +441,7 @@ static void smmu_base_realize(DeviceState
> >>>>> *dev, Error
> >>>> **errp)
> >>>>>      s->smmu_pcibus_by_busptr = g_hash_table_new(NULL, NULL);
> >>>>>
> >>>>>      if (s->primary_bus) {
> >>>>> -        pci_setup_iommu(s->primary_bus, smmu_find_add_as, s);
> >>>>> +        pci_setup_iommu(s->primary_bus, &smmu_ops, s);
> >>>>>      } else {
> >>>>>          error_setg(errp, "SMMU is not attached to any PCI bus!");
> >>>>>      }
> >>>>> diff --git a/hw/hppa/dino.c b/hw/hppa/dino.c index
> >>>>> 2b1b38c..3da4f84
> >>>>> 100644
> >>>>> --- a/hw/hppa/dino.c
> >>>>> +++ b/hw/hppa/dino.c
> >>>>> @@ -459,6 +459,10 @@ static AddressSpace
> >>>>> *dino_pcihost_set_iommu(PCIBus
> >>>> *bus, void *opaque,
> >>>>>      return &s->bm_as;
> >>>>>  }
> >>>>>
> >>>>> +static const PCIIOMMUOps dino_iommu_ops = {
> >>>>> +    .get_address_space = dino_pcihost_set_iommu, };
> >>>>> +
> >>>>>  /*
> >>>>>   * Dino interrupts are connected as shown on Page 78, Table 23
> >>>>>   * (Little-endian bit numbers)
> >>>>> @@ -580,7 +584,7 @@ PCIBus *dino_init(MemoryRegion *addr_space,
> >>>>>      memory_region_add_subregion(&s->bm, 0xfff00000,
> >>>>>                                  &s->bm_cpu_alias);
> >>>>>      address_space_init(&s->bm_as, &s->bm, "pci-bm");
> >>>>> -    pci_setup_iommu(b, dino_pcihost_set_iommu, s);
> >>>>> +    pci_setup_iommu(b, &dino_iommu_ops, s);
> >>>>>
> >>>>>      *p_rtc_irq = qemu_allocate_irq(dino_set_timer_irq, s, 0);
> >>>>>      *p_ser_irq = qemu_allocate_irq(dino_set_serial_irq, s, 0);
> >>>>> diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c index
> >>>>> b1175e5..5fec30e 100644
> >>>>> --- a/hw/i386/amd_iommu.c
> >>>>> +++ b/hw/i386/amd_iommu.c
> >>>>> @@ -1451,6 +1451,10 @@ static AddressSpace
> >>>> *amdvi_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
> >>>>>      return &iommu_as[devfn]->as;
> >>>>>  }
> >>>>>
> >>>>> +static const PCIIOMMUOps amdvi_iommu_ops = {
> >>>>> +    .get_address_space = amdvi_host_dma_iommu, };
> >>>>> +
> >>>>>  static const MemoryRegionOps mmio_mem_ops = {
> >>>>>      .read = amdvi_mmio_read,
> >>>>>      .write = amdvi_mmio_write,
> >>>>> @@ -1577,7 +1581,7 @@ static void amdvi_realize(DeviceState *dev,
> >>>>> Error **errp)
> >>>>>
> >>>>>      sysbus_init_mmio(SYS_BUS_DEVICE(s), &s->mmio);
> >>>>>      sysbus_mmio_map(SYS_BUS_DEVICE(s), 0, AMDVI_BASE_ADDR);
> >>>>> -    pci_setup_iommu(bus, amdvi_host_dma_iommu, s);
> >>>>> +    pci_setup_iommu(bus, &amdvi_iommu_ops, s);
> >>>>>      s->devid = object_property_get_int(OBJECT(&s->pci), "addr", errp);
> >>>>>      msi_init(&s->pci.dev, 0, 1, true, false, errp);
> >>>>>      amdvi_init(s);
> >>>>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c index
> >>>>> df7ad25..4b22910 100644
> >>>>> --- a/hw/i386/intel_iommu.c
> >>>>> +++ b/hw/i386/intel_iommu.c
> >>>>> @@ -3729,6 +3729,10 @@ static AddressSpace
> >>>>> *vtd_host_dma_iommu(PCIBus
> >>>> *bus, void *opaque, int devfn)
> >>>>>      return &vtd_as->as;
> >>>>>  }
> >>>>>
> >>>>> +static PCIIOMMUOps vtd_iommu_ops = {
> >>>> static const
> >>>
> >>> got it.
> >>>
> >>>>> +    .get_address_space = vtd_host_dma_iommu, };
> >>>>> +
> >>>>>  static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)  {
> >>>>>      X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(s); @@ -3840,7
> >>>>> +3844,7 @@ static void vtd_realize(DeviceState *dev, Error **errp)
> >>>>>                                                g_free, g_free);
> >>>>>      vtd_init(s);
> >>>>>      sysbus_mmio_map(SYS_BUS_DEVICE(s), 0,
> >>>> Q35_HOST_BRIDGE_IOMMU_ADDR);
> >>>>> -    pci_setup_iommu(bus, vtd_host_dma_iommu, dev);
> >>>>> +    pci_setup_iommu(bus, &vtd_iommu_ops, dev);
> >>>>>      /* Pseudo address space under root PCI bus. */
> >>>>>      x86ms->ioapic_as = vtd_host_dma_iommu(bus, s,
> >>>> Q35_PSEUDO_DEVFN_IOAPIC);
> >>>>>
> >>>>> qemu_add_machine_init_done_notifier(&vtd_machine_done_notify);
> >>>>> diff --git a/hw/pci-host/designware.c b/hw/pci-host/designware.c
> >>>>> index dd24551..4c6338a 100644
> >>>>> --- a/hw/pci-host/designware.c
> >>>>> +++ b/hw/pci-host/designware.c
> >>>>> @@ -645,6 +645,10 @@ static AddressSpace
> >>>> *designware_pcie_host_set_iommu(PCIBus *bus, void *opaque,
> >>>>>      return &s->pci.address_space;  }
> >>>>>
> >>>>> +static const PCIIOMMUOps designware_iommu_ops = {
> >>>>> +    .get_address_space = designware_pcie_host_set_iommu, };
> >>>>> +
> >>>>>  static void designware_pcie_host_realize(DeviceState *dev, Error
> >>>>> **errp)  {
> >>>>>      PCIHostState *pci = PCI_HOST_BRIDGE(dev); @@ -686,7 +690,7 @@
> >>>>> static void designware_pcie_host_realize(DeviceState *dev, Error **errp)
> >>>>>      address_space_init(&s->pci.address_space,
> >>>>>                         &s->pci.address_space_root,
> >>>>>                         "pcie-bus-address-space");
> >>>>> -    pci_setup_iommu(pci->bus, designware_pcie_host_set_iommu, s);
> >>>>> +    pci_setup_iommu(pci->bus, &designware_iommu_ops, s);
> >>>>>
> >>>>>      qdev_set_parent_bus(DEVICE(&s->root), BUS(pci->bus));
> >>>>>      qdev_init_nofail(DEVICE(&s->root));
> >>>>> diff --git a/hw/pci-host/pnv_phb3.c b/hw/pci-host/pnv_phb3.c index
> >>>>> 74618fa..ecfe627 100644
> >>>>> --- a/hw/pci-host/pnv_phb3.c
> >>>>> +++ b/hw/pci-host/pnv_phb3.c
> >>>>> @@ -961,6 +961,10 @@ static AddressSpace
> >>>>> *pnv_phb3_dma_iommu(PCIBus
> >>>> *bus, void *opaque, int devfn)
> >>>>>      return &ds->dma_as;
> >>>>>  }
> >>>>>
> >>>>> +static PCIIOMMUOps pnv_phb3_iommu_ops = {
> >>>> static const
> >>> got it. :-)
> >>>
> >>>>> +    .get_address_space = pnv_phb3_dma_iommu, };
> >>>>> +
> >>>>>  static void pnv_phb3_instance_init(Object *obj)  {
> >>>>>      PnvPHB3 *phb = PNV_PHB3(obj); @@ -1059,7 +1063,7 @@ static
> >>>>> void pnv_phb3_realize(DeviceState *dev, Error
> >>>> **errp)
> >>>>>                                       &phb->pci_mmio, &phb->pci_io,
> >>>>>                                       0, 4,
> >>>>> TYPE_PNV_PHB3_ROOT_BUS);
> >>>>>
> >>>>> -    pci_setup_iommu(pci->bus, pnv_phb3_dma_iommu, phb);
> >>>>> +    pci_setup_iommu(pci->bus, &pnv_phb3_iommu_ops, phb);
> >>>>>
> >>>>>      /* Add a single Root port */
> >>>>>      qdev_prop_set_uint8(DEVICE(&phb->root), "chassis",
> >>>>> phb->chip_id); diff --git a/hw/pci-host/pnv_phb4.c
> >>>>> b/hw/pci-host/pnv_phb4.c index
> >>>>> 23cf093..04e95e3 100644
> >>>>> --- a/hw/pci-host/pnv_phb4.c
> >>>>> +++ b/hw/pci-host/pnv_phb4.c
> >>>>> @@ -1148,6 +1148,10 @@ static AddressSpace
> >>>>> *pnv_phb4_dma_iommu(PCIBus
> >>>> *bus, void *opaque, int devfn)
> >>>>>      return &ds->dma_as;
> >>>>>  }
> >>>>>
> >>>>> +static PCIIOMMUOps pnv_phb4_iommu_ops = {
> >>>> idem
> >>> will add const.
> >>>
> >>>>> +    .get_address_space = pnv_phb4_dma_iommu, };
> >>>>> +
> >>>>>  static void pnv_phb4_instance_init(Object *obj)  {
> >>>>>      PnvPHB4 *phb = PNV_PHB4(obj); @@ -1205,7 +1209,7 @@ static
> >>>>> void pnv_phb4_realize(DeviceState *dev, Error
> >>>> **errp)
> >>>>>                                       pnv_phb4_set_irq, pnv_phb4_map_irq, phb,
> >>>>>                                       &phb->pci_mmio, &phb->pci_io,
> >>>>>                                       0, 4, TYPE_PNV_PHB4_ROOT_BUS);
> >>>>> -    pci_setup_iommu(pci->bus, pnv_phb4_dma_iommu, phb);
> >>>>> +    pci_setup_iommu(pci->bus, &pnv_phb4_iommu_ops, phb);
> >>>>>
> >>>>>      /* Add a single Root port */
> >>>>>      qdev_prop_set_uint8(DEVICE(&phb->root), "chassis",
> >>>>> phb->chip_id); diff --git a/hw/pci-host/ppce500.c
> >>>>> b/hw/pci-host/ppce500.c index d710727..5baf5db 100644
> >>>>> --- a/hw/pci-host/ppce500.c
> >>>>> +++ b/hw/pci-host/ppce500.c
> >>>>> @@ -439,6 +439,10 @@ static AddressSpace
> >>>>> *e500_pcihost_set_iommu(PCIBus
> >>>> *bus, void *opaque,
> >>>>>      return &s->bm_as;
> >>>>>  }
> >>>>>
> >>>>> +static const PCIIOMMUOps ppce500_iommu_ops = {
> >>>>> +    .get_address_space = e500_pcihost_set_iommu, };
> >>>>> +
> >>>>>  static void e500_pcihost_realize(DeviceState *dev, Error **errp)  {
> >>>>>      SysBusDevice *sbd = SYS_BUS_DEVICE(dev); @@ -473,7 +477,7 @@
> >>>>> static void e500_pcihost_realize(DeviceState *dev, Error **errp)
> >>>>>      memory_region_init(&s->bm, OBJECT(s), "bm-e500", UINT64_MAX);
> >>>>>      memory_region_add_subregion(&s->bm, 0x0, &s->busmem);
> >>>>>      address_space_init(&s->bm_as, &s->bm, "pci-bm");
> >>>>> -    pci_setup_iommu(b, e500_pcihost_set_iommu, s);
> >>>>> +    pci_setup_iommu(b, &ppce500_iommu_ops, s);
> >>>>>
> >>>>>      pci_create_simple(b, 0, "e500-host-bridge");
> >>>>>
> >>>>> diff --git a/hw/pci-host/prep.c b/hw/pci-host/prep.c index
> >>>>> 1a02e9a..7c57311 100644
> >>>>> --- a/hw/pci-host/prep.c
> >>>>> +++ b/hw/pci-host/prep.c
> >>>>> @@ -213,6 +213,10 @@ static AddressSpace
> >>>>> *raven_pcihost_set_iommu(PCIBus
> >>>> *bus, void *opaque,
> >>>>>      return &s->bm_as;
> >>>>>  }
> >>>>>
> >>>>> +static const PCIIOMMUOps raven_iommu_ops = {
> >>>>> +    .get_address_space = raven_pcihost_set_iommu, };
> >>>>> +
> >>>>>  static void raven_change_gpio(void *opaque, int n, int level)  {
> >>>>>      PREPPCIState *s = opaque;
> >>>>> @@ -303,7 +307,7 @@ static void raven_pcihost_initfn(Object *obj)
> >>>>>      memory_region_add_subregion(&s->bm, 0         , &s-
> >>> bm_pci_memory_alias);
> >>>>>      memory_region_add_subregion(&s->bm, 0x80000000, &s-
> >bm_ram_alias);
> >>>>>      address_space_init(&s->bm_as, &s->bm, "raven-bm");
> >>>>> -    pci_setup_iommu(&s->pci_bus, raven_pcihost_set_iommu, s);
> >>>>> +    pci_setup_iommu(&s->pci_bus, &raven_iommu_ops, s);
> >>>>>
> >>>>>      h->bus = &s->pci_bus;
> >>>>>
> >>>>> diff --git a/hw/pci-host/sabre.c b/hw/pci-host/sabre.c index
> >>>>> 2b8503b..251549b 100644
> >>>>> --- a/hw/pci-host/sabre.c
> >>>>> +++ b/hw/pci-host/sabre.c
> >>>>> @@ -112,6 +112,10 @@ static AddressSpace
> >>>>> *sabre_pci_dma_iommu(PCIBus
> >>>> *bus, void *opaque, int devfn)
> >>>>>      return &is->iommu_as;
> >>>>>  }
> >>>>>
> >>>>> +static const PCIIOMMUOps sabre_iommu_ops = {
> >>>>> +    .get_address_space = sabre_pci_dma_iommu, };
> >>>>> +
> >>>>>  static void sabre_config_write(void *opaque, hwaddr addr,
> >>>>>                                 uint64_t val, unsigned size)  { @@
> >>>>> -402,7 +406,7 @@ static void sabre_realize(DeviceState *dev, Error **errp)
> >>>>>      /* IOMMU */
> >>>>>      memory_region_add_subregion_overlap(&s->sabre_config, 0x200,
> >>>>>                      sysbus_mmio_get_region(SYS_BUS_DEVICE(s->iommu), 0), 1);
> >>>>> -    pci_setup_iommu(phb->bus, sabre_pci_dma_iommu, s->iommu);
> >>>>> +    pci_setup_iommu(phb->bus, &sabre_iommu_ops, s->iommu);
> >>>>>
> >>>>>      /* APB secondary busses */
> >>>>>      pci_dev = pci_create_multifunction(phb->bus, PCI_DEVFN(1, 0),
> >>>>> true, diff --git a/hw/pci/pci.c b/hw/pci/pci.c index
> >>>>> e1ed667..aa9025c
> >>>>> 100644
> >>>>> --- a/hw/pci/pci.c
> >>>>> +++ b/hw/pci/pci.c
> >>>>> @@ -2644,7 +2644,7 @@ AddressSpace
> >>>> *pci_device_iommu_address_space(PCIDevice *dev)
> >>>>>      PCIBus *iommu_bus = bus;
> >>>>>      uint8_t devfn = dev->devfn;
> >>>>>
> >>>>> -    while (iommu_bus && !iommu_bus->iommu_fn && iommu_bus-
> >>> parent_dev)
> >>>> {
> >>>>> +    while (iommu_bus && !iommu_bus->iommu_ops &&
> >>>>> + iommu_bus->parent_dev) {
> >>>> Depending on future usage, this is not strictly identical to the
> >>>> original code. You exit the loop as soon as a iommu_bus->iommu_ops
> >>>> is set whatever the presence of get_address_space().
> >>>
> >>> To be identical with original code, may adding the
> >>> get_address_space() presence check. Then the loop exits when the
> >>> iommu_bus->iommu_ops is set and meanwhile iommu_bus->iommu_ops-
> >get_address_space() is set.
> >>> But is it possible that there is an intermediate iommu_bus which has
> >>> iommu_ops set but the get_address_space() is clear. I guess not as
> >>> iommu_ops is set by vIOMMU and vIOMMU won't differentiate buses?
> >>
> >> I don't know. That depends on how the ops are going to be used in the
> >> future. Can't you enforce the fact that get_address_space() is a mandatory ops?
> >
> > No, I didn't mean that. Actually, in the patch, the get_address_space() presence is
> checked.
> > I'm not sure if your point is to add get_address_space() presence
> > check instead of just checking the iommu_ops presence.
> Yes that was my point. I wanted to underline the checks are not strictly identical
> and maybe during enumeration you may find a device with ops set and no
> get_address_space().

I see. But this happens only when there are multiple iommu_ops instances
provided by vIOMMU. right? For some buses, vIOMMU set an iommu_ops
with get_address_space() set, but for others it sets an iommu_ops w/o
get_address_space(). Is it possible?

> Then I meant maybe you should enforce somewhere in the
> code or in the documentation that get_address_space() is a mandatory operation in
> the ops struct and must be set as soon as the struct is passed. maybe in
> pci_setup_iommu() you could check that get_address_space is set?

How about other callbacks in iommu_ops? It would be strange that only
check get_address_space() and ignore other callbacks.

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 160+ messages in thread

* RE: [PATCH v2 05/22] hw/pci: modify pci_setup_iommu() to set PCIIOMMUOps
@ 2020-04-06  6:27               ` Liu, Yi L
  0 siblings, 0 replies; 160+ messages in thread
From: Liu, Yi L @ 2020-04-06  6:27 UTC (permalink / raw)
  To: Auger Eric, qemu-devel, alex.williamson, peterx
  Cc: jean-philippe, Tian, Kevin, Jacob Pan, Yi Sun, kvm, mst, Tian,
	Jun J, Sun, Yi Y, pbonzini, david, Wu, Hao

Hi Eric,

> From: Auger Eric <eric.auger@redhat.com>
> Sent: Thursday, April 2, 2020 9:49 PM
> To: Liu, Yi L <yi.l.liu@intel.com>; qemu-devel@nongnu.org;
> Subject: Re: [PATCH v2 05/22] hw/pci: modify pci_setup_iommu() to set
> PCIIOMMUOps
> 
> Hi Yi,
> 
> On 4/2/20 3:37 PM, Liu, Yi L wrote:
> > Hi Eric,
> >
> >> From: Auger Eric < eric.auger@redhat.com >
> >> Sent: Thursday, April 2, 2020 8:41 PM
> >> To: Liu, Yi L <yi.l.liu@intel.com>; qemu-devel@nongnu.org;
> >> Subject: Re: [PATCH v2 05/22] hw/pci: modify pci_setup_iommu() to set
> >> PCIIOMMUOps
> >>
> >> Hi Yi,
> >>
> >> On 4/2/20 10:52 AM, Liu, Yi L wrote:
> >>>> From: Auger Eric < eric.auger@redhat.com>
> >>>> Sent: Monday, March 30, 2020 7:02 PM
> >>>> To: Liu, Yi L <yi.l.liu@intel.com>; qemu-devel@nongnu.org;
> >>>> Subject: Re: [PATCH v2 05/22] hw/pci: modify pci_setup_iommu() to
> >>>> set PCIIOMMUOps
> >>>>
> >>>>
> >>>>
> >>>> On 3/30/20 6:24 AM, Liu Yi L wrote:
> >>>>> This patch modifies pci_setup_iommu() to set PCIIOMMUOps instead
> >>>>> of setting PCIIOMMUFunc. PCIIOMMUFunc is used to get an address
> >>>>> space for a PCI device in vendor specific way. The PCIIOMMUOps
> >>>>> still offers this functionality. But using PCIIOMMUOps leaves
> >>>>> space to add more iommu related vendor specific operations.
> >>>>>
> >>>>> Cc: Kevin Tian <kevin.tian@intel.com>
> >>>>> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> >>>>> Cc: Peter Xu <peterx@redhat.com>
> >>>>> Cc: Eric Auger <eric.auger@redhat.com>
> >>>>> Cc: Yi Sun <yi.y.sun@linux.intel.com>
> >>>>> Cc: David Gibson <david@gibson.dropbear.id.au>
> >>>>> Cc: Michael S. Tsirkin <mst@redhat.com>
> >>>>> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> >>>>> Reviewed-by: Peter Xu <peterx@redhat.com>
> >>>>> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> >>>>> ---
> >>>>>  hw/alpha/typhoon.c       |  6 +++++-
> >>>>>  hw/arm/smmu-common.c     |  6 +++++-
> >>>>>  hw/hppa/dino.c           |  6 +++++-
> >>>>>  hw/i386/amd_iommu.c      |  6 +++++-
> >>>>>  hw/i386/intel_iommu.c    |  6 +++++-
> >>>>>  hw/pci-host/designware.c |  6 +++++-
> >>>>>  hw/pci-host/pnv_phb3.c   |  6 +++++-
> >>>>>  hw/pci-host/pnv_phb4.c   |  6 +++++-
> >>>>>  hw/pci-host/ppce500.c    |  6 +++++-
> >>>>>  hw/pci-host/prep.c       |  6 +++++-
> >>>>>  hw/pci-host/sabre.c      |  6 +++++-
> >>>>>  hw/pci/pci.c             | 12 +++++++-----
> >>>>>  hw/ppc/ppc440_pcix.c     |  6 +++++-
> >>>>>  hw/ppc/spapr_pci.c       |  6 +++++-
> >>>>>  hw/s390x/s390-pci-bus.c  |  8 ++++++--  hw/virtio/virtio-iommu.c
> >>>>> |
> >>>>> 6
> >>>>> +++++-
> >>>>>  include/hw/pci/pci.h     |  8 ++++++--
> >>>>>  include/hw/pci/pci_bus.h |  2 +-
> >>>>>  18 files changed, 90 insertions(+), 24 deletions(-)
> >>>>>
> >>>>> diff --git a/hw/alpha/typhoon.c b/hw/alpha/typhoon.c index
> >>>>> 1795e2f..f271de1 100644
> >>>>> --- a/hw/alpha/typhoon.c
> >>>>> +++ b/hw/alpha/typhoon.c
> >>>>> @@ -740,6 +740,10 @@ static AddressSpace
> >>>>> *typhoon_pci_dma_iommu(PCIBus
> >>>> *bus, void *opaque, int devfn)
> >>>>>      return &s->pchip.iommu_as;
> >>>>>  }
> >>>>>
> >>>>> +static const PCIIOMMUOps typhoon_iommu_ops = {
> >>>>> +    .get_address_space = typhoon_pci_dma_iommu, };
> >>>>> +
> >>>>>  static void typhoon_set_irq(void *opaque, int irq, int level)  {
> >>>>>      TyphoonState *s = opaque;
> >>>>> @@ -897,7 +901,7 @@ PCIBus *typhoon_init(MemoryRegion *ram, ISABus
> >>>> **isa_bus, qemu_irq *p_rtc_irq,
> >>>>>                               "iommu-typhoon", UINT64_MAX);
> >>>>>      address_space_init(&s->pchip.iommu_as, MEMORY_REGION(&s-
> >>>>> pchip.iommu),
> >>>>>                         "pchip0-pci");
> >>>>> -    pci_setup_iommu(b, typhoon_pci_dma_iommu, s);
> >>>>> +    pci_setup_iommu(b, &typhoon_iommu_ops, s);
> >>>>>
> >>>>>      /* Pchip0 PCI special/interrupt acknowledge, 0x801.F800.0000, 64MB.  */
> >>>>>      memory_region_init_io(&s->pchip.reg_iack, OBJECT(s),
> >>>>> &alpha_pci_iack_ops, diff --git a/hw/arm/smmu-common.c
> >>>>> b/hw/arm/smmu-common.c index e13a5f4..447146e 100644
> >>>>> --- a/hw/arm/smmu-common.c
> >>>>> +++ b/hw/arm/smmu-common.c
> >>>>> @@ -343,6 +343,10 @@ static AddressSpace *smmu_find_add_as(PCIBus
> >>>>> *bus,
> >>>> void *opaque, int devfn)
> >>>>>      return &sdev->as;
> >>>>>  }
> >>>>>
> >>>>> +static const PCIIOMMUOps smmu_ops = {
> >>>>> +    .get_address_space = smmu_find_add_as, };
> >>>>> +
> >>>>>  IOMMUMemoryRegion *smmu_iommu_mr(SMMUState *s, uint32_t sid)  {
> >>>>>      uint8_t bus_n, devfn;
> >>>>> @@ -437,7 +441,7 @@ static void smmu_base_realize(DeviceState
> >>>>> *dev, Error
> >>>> **errp)
> >>>>>      s->smmu_pcibus_by_busptr = g_hash_table_new(NULL, NULL);
> >>>>>
> >>>>>      if (s->primary_bus) {
> >>>>> -        pci_setup_iommu(s->primary_bus, smmu_find_add_as, s);
> >>>>> +        pci_setup_iommu(s->primary_bus, &smmu_ops, s);
> >>>>>      } else {
> >>>>>          error_setg(errp, "SMMU is not attached to any PCI bus!");
> >>>>>      }
> >>>>> diff --git a/hw/hppa/dino.c b/hw/hppa/dino.c index
> >>>>> 2b1b38c..3da4f84
> >>>>> 100644
> >>>>> --- a/hw/hppa/dino.c
> >>>>> +++ b/hw/hppa/dino.c
> >>>>> @@ -459,6 +459,10 @@ static AddressSpace
> >>>>> *dino_pcihost_set_iommu(PCIBus
> >>>> *bus, void *opaque,
> >>>>>      return &s->bm_as;
> >>>>>  }
> >>>>>
> >>>>> +static const PCIIOMMUOps dino_iommu_ops = {
> >>>>> +    .get_address_space = dino_pcihost_set_iommu, };
> >>>>> +
> >>>>>  /*
> >>>>>   * Dino interrupts are connected as shown on Page 78, Table 23
> >>>>>   * (Little-endian bit numbers)
> >>>>> @@ -580,7 +584,7 @@ PCIBus *dino_init(MemoryRegion *addr_space,
> >>>>>      memory_region_add_subregion(&s->bm, 0xfff00000,
> >>>>>                                  &s->bm_cpu_alias);
> >>>>>      address_space_init(&s->bm_as, &s->bm, "pci-bm");
> >>>>> -    pci_setup_iommu(b, dino_pcihost_set_iommu, s);
> >>>>> +    pci_setup_iommu(b, &dino_iommu_ops, s);
> >>>>>
> >>>>>      *p_rtc_irq = qemu_allocate_irq(dino_set_timer_irq, s, 0);
> >>>>>      *p_ser_irq = qemu_allocate_irq(dino_set_serial_irq, s, 0);
> >>>>> diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c index
> >>>>> b1175e5..5fec30e 100644
> >>>>> --- a/hw/i386/amd_iommu.c
> >>>>> +++ b/hw/i386/amd_iommu.c
> >>>>> @@ -1451,6 +1451,10 @@ static AddressSpace
> >>>> *amdvi_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
> >>>>>      return &iommu_as[devfn]->as;
> >>>>>  }
> >>>>>
> >>>>> +static const PCIIOMMUOps amdvi_iommu_ops = {
> >>>>> +    .get_address_space = amdvi_host_dma_iommu, };
> >>>>> +
> >>>>>  static const MemoryRegionOps mmio_mem_ops = {
> >>>>>      .read = amdvi_mmio_read,
> >>>>>      .write = amdvi_mmio_write,
> >>>>> @@ -1577,7 +1581,7 @@ static void amdvi_realize(DeviceState *dev,
> >>>>> Error **errp)
> >>>>>
> >>>>>      sysbus_init_mmio(SYS_BUS_DEVICE(s), &s->mmio);
> >>>>>      sysbus_mmio_map(SYS_BUS_DEVICE(s), 0, AMDVI_BASE_ADDR);
> >>>>> -    pci_setup_iommu(bus, amdvi_host_dma_iommu, s);
> >>>>> +    pci_setup_iommu(bus, &amdvi_iommu_ops, s);
> >>>>>      s->devid = object_property_get_int(OBJECT(&s->pci), "addr", errp);
> >>>>>      msi_init(&s->pci.dev, 0, 1, true, false, errp);
> >>>>>      amdvi_init(s);
> >>>>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c index
> >>>>> df7ad25..4b22910 100644
> >>>>> --- a/hw/i386/intel_iommu.c
> >>>>> +++ b/hw/i386/intel_iommu.c
> >>>>> @@ -3729,6 +3729,10 @@ static AddressSpace
> >>>>> *vtd_host_dma_iommu(PCIBus
> >>>> *bus, void *opaque, int devfn)
> >>>>>      return &vtd_as->as;
> >>>>>  }
> >>>>>
> >>>>> +static PCIIOMMUOps vtd_iommu_ops = {
> >>>> static const
> >>>
> >>> got it.
> >>>
> >>>>> +    .get_address_space = vtd_host_dma_iommu, };
> >>>>> +
> >>>>>  static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)  {
> >>>>>      X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(s); @@ -3840,7
> >>>>> +3844,7 @@ static void vtd_realize(DeviceState *dev, Error **errp)
> >>>>>                                                g_free, g_free);
> >>>>>      vtd_init(s);
> >>>>>      sysbus_mmio_map(SYS_BUS_DEVICE(s), 0,
> >>>> Q35_HOST_BRIDGE_IOMMU_ADDR);
> >>>>> -    pci_setup_iommu(bus, vtd_host_dma_iommu, dev);
> >>>>> +    pci_setup_iommu(bus, &vtd_iommu_ops, dev);
> >>>>>      /* Pseudo address space under root PCI bus. */
> >>>>>      x86ms->ioapic_as = vtd_host_dma_iommu(bus, s,
> >>>> Q35_PSEUDO_DEVFN_IOAPIC);
> >>>>>
> >>>>> qemu_add_machine_init_done_notifier(&vtd_machine_done_notify);
> >>>>> diff --git a/hw/pci-host/designware.c b/hw/pci-host/designware.c
> >>>>> index dd24551..4c6338a 100644
> >>>>> --- a/hw/pci-host/designware.c
> >>>>> +++ b/hw/pci-host/designware.c
> >>>>> @@ -645,6 +645,10 @@ static AddressSpace
> >>>> *designware_pcie_host_set_iommu(PCIBus *bus, void *opaque,
> >>>>>      return &s->pci.address_space;  }
> >>>>>
> >>>>> +static const PCIIOMMUOps designware_iommu_ops = {
> >>>>> +    .get_address_space = designware_pcie_host_set_iommu, };
> >>>>> +
> >>>>>  static void designware_pcie_host_realize(DeviceState *dev, Error
> >>>>> **errp)  {
> >>>>>      PCIHostState *pci = PCI_HOST_BRIDGE(dev); @@ -686,7 +690,7 @@
> >>>>> static void designware_pcie_host_realize(DeviceState *dev, Error **errp)
> >>>>>      address_space_init(&s->pci.address_space,
> >>>>>                         &s->pci.address_space_root,
> >>>>>                         "pcie-bus-address-space");
> >>>>> -    pci_setup_iommu(pci->bus, designware_pcie_host_set_iommu, s);
> >>>>> +    pci_setup_iommu(pci->bus, &designware_iommu_ops, s);
> >>>>>
> >>>>>      qdev_set_parent_bus(DEVICE(&s->root), BUS(pci->bus));
> >>>>>      qdev_init_nofail(DEVICE(&s->root));
> >>>>> diff --git a/hw/pci-host/pnv_phb3.c b/hw/pci-host/pnv_phb3.c index
> >>>>> 74618fa..ecfe627 100644
> >>>>> --- a/hw/pci-host/pnv_phb3.c
> >>>>> +++ b/hw/pci-host/pnv_phb3.c
> >>>>> @@ -961,6 +961,10 @@ static AddressSpace
> >>>>> *pnv_phb3_dma_iommu(PCIBus
> >>>> *bus, void *opaque, int devfn)
> >>>>>      return &ds->dma_as;
> >>>>>  }
> >>>>>
> >>>>> +static PCIIOMMUOps pnv_phb3_iommu_ops = {
> >>>> static const
> >>> got it. :-)
> >>>
> >>>>> +    .get_address_space = pnv_phb3_dma_iommu, };
> >>>>> +
> >>>>>  static void pnv_phb3_instance_init(Object *obj)  {
> >>>>>      PnvPHB3 *phb = PNV_PHB3(obj); @@ -1059,7 +1063,7 @@ static
> >>>>> void pnv_phb3_realize(DeviceState *dev, Error
> >>>> **errp)
> >>>>>                                       &phb->pci_mmio, &phb->pci_io,
> >>>>>                                       0, 4,
> >>>>> TYPE_PNV_PHB3_ROOT_BUS);
> >>>>>
> >>>>> -    pci_setup_iommu(pci->bus, pnv_phb3_dma_iommu, phb);
> >>>>> +    pci_setup_iommu(pci->bus, &pnv_phb3_iommu_ops, phb);
> >>>>>
> >>>>>      /* Add a single Root port */
> >>>>>      qdev_prop_set_uint8(DEVICE(&phb->root), "chassis",
> >>>>> phb->chip_id); diff --git a/hw/pci-host/pnv_phb4.c
> >>>>> b/hw/pci-host/pnv_phb4.c index
> >>>>> 23cf093..04e95e3 100644
> >>>>> --- a/hw/pci-host/pnv_phb4.c
> >>>>> +++ b/hw/pci-host/pnv_phb4.c
> >>>>> @@ -1148,6 +1148,10 @@ static AddressSpace
> >>>>> *pnv_phb4_dma_iommu(PCIBus
> >>>> *bus, void *opaque, int devfn)
> >>>>>      return &ds->dma_as;
> >>>>>  }
> >>>>>
> >>>>> +static PCIIOMMUOps pnv_phb4_iommu_ops = {
> >>>> idem
> >>> will add const.
> >>>
> >>>>> +    .get_address_space = pnv_phb4_dma_iommu, };
> >>>>> +
> >>>>>  static void pnv_phb4_instance_init(Object *obj)  {
> >>>>>      PnvPHB4 *phb = PNV_PHB4(obj); @@ -1205,7 +1209,7 @@ static
> >>>>> void pnv_phb4_realize(DeviceState *dev, Error
> >>>> **errp)
> >>>>>                                       pnv_phb4_set_irq, pnv_phb4_map_irq, phb,
> >>>>>                                       &phb->pci_mmio, &phb->pci_io,
> >>>>>                                       0, 4, TYPE_PNV_PHB4_ROOT_BUS);
> >>>>> -    pci_setup_iommu(pci->bus, pnv_phb4_dma_iommu, phb);
> >>>>> +    pci_setup_iommu(pci->bus, &pnv_phb4_iommu_ops, phb);
> >>>>>
> >>>>>      /* Add a single Root port */
> >>>>>      qdev_prop_set_uint8(DEVICE(&phb->root), "chassis",
> >>>>> phb->chip_id); diff --git a/hw/pci-host/ppce500.c
> >>>>> b/hw/pci-host/ppce500.c index d710727..5baf5db 100644
> >>>>> --- a/hw/pci-host/ppce500.c
> >>>>> +++ b/hw/pci-host/ppce500.c
> >>>>> @@ -439,6 +439,10 @@ static AddressSpace
> >>>>> *e500_pcihost_set_iommu(PCIBus
> >>>> *bus, void *opaque,
> >>>>>      return &s->bm_as;
> >>>>>  }
> >>>>>
> >>>>> +static const PCIIOMMUOps ppce500_iommu_ops = {
> >>>>> +    .get_address_space = e500_pcihost_set_iommu, };
> >>>>> +
> >>>>>  static void e500_pcihost_realize(DeviceState *dev, Error **errp)  {
> >>>>>      SysBusDevice *sbd = SYS_BUS_DEVICE(dev); @@ -473,7 +477,7 @@
> >>>>> static void e500_pcihost_realize(DeviceState *dev, Error **errp)
> >>>>>      memory_region_init(&s->bm, OBJECT(s), "bm-e500", UINT64_MAX);
> >>>>>      memory_region_add_subregion(&s->bm, 0x0, &s->busmem);
> >>>>>      address_space_init(&s->bm_as, &s->bm, "pci-bm");
> >>>>> -    pci_setup_iommu(b, e500_pcihost_set_iommu, s);
> >>>>> +    pci_setup_iommu(b, &ppce500_iommu_ops, s);
> >>>>>
> >>>>>      pci_create_simple(b, 0, "e500-host-bridge");
> >>>>>
> >>>>> diff --git a/hw/pci-host/prep.c b/hw/pci-host/prep.c index
> >>>>> 1a02e9a..7c57311 100644
> >>>>> --- a/hw/pci-host/prep.c
> >>>>> +++ b/hw/pci-host/prep.c
> >>>>> @@ -213,6 +213,10 @@ static AddressSpace
> >>>>> *raven_pcihost_set_iommu(PCIBus
> >>>> *bus, void *opaque,
> >>>>>      return &s->bm_as;
> >>>>>  }
> >>>>>
> >>>>> +static const PCIIOMMUOps raven_iommu_ops = {
> >>>>> +    .get_address_space = raven_pcihost_set_iommu, };
> >>>>> +
> >>>>>  static void raven_change_gpio(void *opaque, int n, int level)  {
> >>>>>      PREPPCIState *s = opaque;
> >>>>> @@ -303,7 +307,7 @@ static void raven_pcihost_initfn(Object *obj)
> >>>>>      memory_region_add_subregion(&s->bm, 0         , &s-
> >>> bm_pci_memory_alias);
> >>>>>      memory_region_add_subregion(&s->bm, 0x80000000, &s-
> >bm_ram_alias);
> >>>>>      address_space_init(&s->bm_as, &s->bm, "raven-bm");
> >>>>> -    pci_setup_iommu(&s->pci_bus, raven_pcihost_set_iommu, s);
> >>>>> +    pci_setup_iommu(&s->pci_bus, &raven_iommu_ops, s);
> >>>>>
> >>>>>      h->bus = &s->pci_bus;
> >>>>>
> >>>>> diff --git a/hw/pci-host/sabre.c b/hw/pci-host/sabre.c index
> >>>>> 2b8503b..251549b 100644
> >>>>> --- a/hw/pci-host/sabre.c
> >>>>> +++ b/hw/pci-host/sabre.c
> >>>>> @@ -112,6 +112,10 @@ static AddressSpace
> >>>>> *sabre_pci_dma_iommu(PCIBus
> >>>> *bus, void *opaque, int devfn)
> >>>>>      return &is->iommu_as;
> >>>>>  }
> >>>>>
> >>>>> +static const PCIIOMMUOps sabre_iommu_ops = {
> >>>>> +    .get_address_space = sabre_pci_dma_iommu, };
> >>>>> +
> >>>>>  static void sabre_config_write(void *opaque, hwaddr addr,
> >>>>>                                 uint64_t val, unsigned size)  { @@
> >>>>> -402,7 +406,7 @@ static void sabre_realize(DeviceState *dev, Error **errp)
> >>>>>      /* IOMMU */
> >>>>>      memory_region_add_subregion_overlap(&s->sabre_config, 0x200,
> >>>>>                      sysbus_mmio_get_region(SYS_BUS_DEVICE(s->iommu), 0), 1);
> >>>>> -    pci_setup_iommu(phb->bus, sabre_pci_dma_iommu, s->iommu);
> >>>>> +    pci_setup_iommu(phb->bus, &sabre_iommu_ops, s->iommu);
> >>>>>
> >>>>>      /* APB secondary busses */
> >>>>>      pci_dev = pci_create_multifunction(phb->bus, PCI_DEVFN(1, 0),
> >>>>> true, diff --git a/hw/pci/pci.c b/hw/pci/pci.c index
> >>>>> e1ed667..aa9025c
> >>>>> 100644
> >>>>> --- a/hw/pci/pci.c
> >>>>> +++ b/hw/pci/pci.c
> >>>>> @@ -2644,7 +2644,7 @@ AddressSpace
> >>>> *pci_device_iommu_address_space(PCIDevice *dev)
> >>>>>      PCIBus *iommu_bus = bus;
> >>>>>      uint8_t devfn = dev->devfn;
> >>>>>
> >>>>> -    while (iommu_bus && !iommu_bus->iommu_fn && iommu_bus-
> >>> parent_dev)
> >>>> {
> >>>>> +    while (iommu_bus && !iommu_bus->iommu_ops &&
> >>>>> + iommu_bus->parent_dev) {
> >>>> Depending on future usage, this is not strictly identical to the
> >>>> original code. You exit the loop as soon as a iommu_bus->iommu_ops
> >>>> is set whatever the presence of get_address_space().
> >>>
> >>> To be identical with original code, may adding the
> >>> get_address_space() presence check. Then the loop exits when the
> >>> iommu_bus->iommu_ops is set and meanwhile iommu_bus->iommu_ops-
> >get_address_space() is set.
> >>> But is it possible that there is an intermediate iommu_bus which has
> >>> iommu_ops set but the get_address_space() is clear. I guess not as
> >>> iommu_ops is set by vIOMMU and vIOMMU won't differentiate buses?
> >>
> >> I don't know. That depends on how the ops are going to be used in the
> >> future. Can't you enforce the fact that get_address_space() is a mandatory ops?
> >
> > No, I didn't mean that. Actually, in the patch, the get_address_space() presence is
> checked.
> > I'm not sure if your point is to add get_address_space() presence
> > check instead of just checking the iommu_ops presence.
> Yes that was my point. I wanted to underline the checks are not strictly identical
> and maybe during enumeration you may find a device with ops set and no
> get_address_space().

I see. But this happens only when there are multiple iommu_ops instances
provided by vIOMMU. right? For some buses, vIOMMU set an iommu_ops
with get_address_space() set, but for others it sets an iommu_ops w/o
get_address_space(). Is it possible?

> Then I meant maybe you should enforce somewhere in the
> code or in the documentation that get_address_space() is a mandatory operation in
> the ops struct and must be set as soon as the struct is passed. maybe in
> pci_setup_iommu() you could check that get_address_space is set?

How about other callbacks in iommu_ops? It would be strange that only
check get_address_space() and ignore other callbacks.

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 160+ messages in thread

* RE: [PATCH v2 09/22] vfio/common: init HostIOMMUContext per-container
  2020-04-01  7:50     ` Auger Eric
@ 2020-04-06  7:12       ` Liu, Yi L
  -1 siblings, 0 replies; 160+ messages in thread
From: Liu, Yi L @ 2020-04-06  7:12 UTC (permalink / raw)
  To: Auger Eric, qemu-devel, alex.williamson, peterx
  Cc: pbonzini, mst, david, Tian, Kevin, Tian, Jun J, Sun, Yi Y, kvm,
	Wu, Hao, jean-philippe, Jacob Pan, Yi Sun

Hi Eric,

> From: Auger Eric <eric.auger@redhat.com>
> Sent: Wednesday, April 1, 2020 3:51 PM
> To: Liu, Yi L <yi.l.liu@intel.com>; qemu-devel@nongnu.org;
> Subject: Re: [PATCH v2 09/22] vfio/common: init HostIOMMUContext per-container
> 
> Hi Yi,
> 
> On 3/30/20 6:24 AM, Liu Yi L wrote:
> > In this patch, QEMU firstly gets iommu info from kernel to check the
> > supported capabilities by a VFIO_IOMMU_TYPE1_NESTING iommu. And inits
> > HostIOMMUContet instance.
> >
> > Cc: Kevin Tian <kevin.tian@intel.com>
> > Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Cc: Peter Xu <peterx@redhat.com>
> > Cc: Eric Auger <eric.auger@redhat.com>
> > Cc: Yi Sun <yi.y.sun@linux.intel.com>
> > Cc: David Gibson <david@gibson.dropbear.id.au>
> > Cc: Alex Williamson <alex.williamson@redhat.com>
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > ---
> >  hw/vfio/common.c | 99
> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 99 insertions(+)
> >
> > diff --git a/hw/vfio/common.c b/hw/vfio/common.c index
> > 5f3534d..44b142c 100644
> > --- a/hw/vfio/common.c
> > +++ b/hw/vfio/common.c
> > @@ -1226,10 +1226,89 @@ static int
> vfio_host_iommu_ctx_pasid_free(HostIOMMUContext *iommu_ctx,
> >      return 0;
> >  }
> >
> > +/**
> > + * Get iommu info from host. Caller of this funcion should free
> > + * the memory pointed by the returned pointer stored in @info
> > + * after a successful calling when finished its usage.
> > + */
> > +static int vfio_get_iommu_info(VFIOContainer *container,
> > +                         struct vfio_iommu_type1_info **info) {
> > +
> > +    size_t argsz = sizeof(struct vfio_iommu_type1_info);
> > +
> > +    *info = g_malloc0(argsz);
> > +
> > +retry:
> > +    (*info)->argsz = argsz;
> > +
> > +    if (ioctl(container->fd, VFIO_IOMMU_GET_INFO, *info)) {
> > +        g_free(*info);
> > +        *info = NULL;
> > +        return -errno;
> > +    }
> > +
> > +    if (((*info)->argsz > argsz)) {
> > +        argsz = (*info)->argsz;
> > +        *info = g_realloc(*info, argsz);
> > +        goto retry;
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> > +static struct vfio_info_cap_header *
> > +vfio_get_iommu_info_cap(struct vfio_iommu_type1_info *info, uint16_t
> > +id) {
> > +    struct vfio_info_cap_header *hdr;
> > +    void *ptr = info;
> > +
> > +    if (!(info->flags & VFIO_IOMMU_INFO_CAPS)) {
> > +        return NULL;
> > +    }
> > +
> > +    for (hdr = ptr + info->cap_offset; hdr != ptr; hdr = ptr + hdr->next) {
> > +        if (hdr->id == id) {
> > +            return hdr;
> > +        }
> > +    }
> > +
> > +    return NULL;
> > +}
> > +
> > +static int vfio_get_nesting_iommu_cap(VFIOContainer *container,
> > +                   struct vfio_iommu_type1_info_cap_nesting
> > +*cap_nesting) {
> > +    struct vfio_iommu_type1_info *info;
> > +    struct vfio_info_cap_header *hdr;
> > +    struct vfio_iommu_type1_info_cap_nesting *cap;
> > +    int ret;
> > +
> > +    ret = vfio_get_iommu_info(container, &info);
> > +    if (ret) {
> > +        return ret;
> > +    }
> > +
> > +    hdr = vfio_get_iommu_info_cap(info,
> > +                        VFIO_IOMMU_TYPE1_INFO_CAP_NESTING);
> > +    if (!hdr) {
> > +        g_free(info);
> > +        return -errno;
> > +    }
> > +
> > +    cap = container_of(hdr,
> > +                struct vfio_iommu_type1_info_cap_nesting, header);
> > +    *cap_nesting = *cap;
> > +
> > +    g_free(info);
> > +    return 0;
> > +}
> > +
> >  static int vfio_init_container(VFIOContainer *container, int group_fd,
> >                                 Error **errp)  {
> >      int iommu_type, ret;
> > +    uint64_t flags = 0;
> >
> >      iommu_type = vfio_get_iommu_type(container, errp);
> >      if (iommu_type < 0) {
> > @@ -1257,6 +1336,26 @@ static int vfio_init_container(VFIOContainer
> *container, int group_fd,
> >          return -errno;
> >      }
> >
> > +    if (iommu_type == VFIO_TYPE1_NESTING_IOMMU) {
> > +        struct vfio_iommu_type1_info_cap_nesting nesting = {
> > +                                         .nesting_capabilities = 0x0,
> > +                                         .stage1_formats = 0, };
> > +
> > +        ret = vfio_get_nesting_iommu_cap(container, &nesting);
> > +        if (ret) {
> > +            error_setg_errno(errp, -ret,
> > +                             "Failed to get nesting iommu cap");
> > +            return ret;
> > +        }
> > +
> > +        flags |= (nesting.nesting_capabilities & VFIO_IOMMU_PASID_REQS) ?
> > +                 HOST_IOMMU_PASID_REQUEST : 0;
> I still don't get why you can't transform your iommu_ctx into a  pointer and do
> container->iommu_ctx = g_new0(HostIOMMUContext, 1);
> then
> host_iommu_ctx_init(container->iommu_ctx, flags);
> 
> looks something similar to (hw/vfio/common.c). You may not even need to use a
> derived VFIOHostIOMMUContext object (As only VFIO does use that object)? Only
> the ops do change, no new field?
>         region->mem = g_new0(MemoryRegion, 1);
>         memory_region_init_io(region->mem, obj, &vfio_region_ops,
>                               region, name, region->size);

In this way, the vfio hook can easily get the VFIOContainer from
HostIOMMUContext when call in the hook provided by vfio. e.g. the
one below.

+static int vfio_host_iommu_ctx_pasid_alloc(HostIOMMUContext *iommu_ctx,
+                                           uint32_t min, uint32_t max,
+                                           uint32_t *pasid)
+{
+    VFIOContainer *container = container_of(iommu_ctx,
+                                            VFIOContainer, iommu_ctx);
 
Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 160+ messages in thread

* RE: [PATCH v2 09/22] vfio/common: init HostIOMMUContext per-container
@ 2020-04-06  7:12       ` Liu, Yi L
  0 siblings, 0 replies; 160+ messages in thread
From: Liu, Yi L @ 2020-04-06  7:12 UTC (permalink / raw)
  To: Auger Eric, qemu-devel, alex.williamson, peterx
  Cc: jean-philippe, Tian, Kevin, Jacob Pan, Yi Sun, kvm, mst, Tian,
	 Jun J, Sun, Yi Y, pbonzini, Wu, Hao, david

Hi Eric,

> From: Auger Eric <eric.auger@redhat.com>
> Sent: Wednesday, April 1, 2020 3:51 PM
> To: Liu, Yi L <yi.l.liu@intel.com>; qemu-devel@nongnu.org;
> Subject: Re: [PATCH v2 09/22] vfio/common: init HostIOMMUContext per-container
> 
> Hi Yi,
> 
> On 3/30/20 6:24 AM, Liu Yi L wrote:
> > In this patch, QEMU firstly gets iommu info from kernel to check the
> > supported capabilities by a VFIO_IOMMU_TYPE1_NESTING iommu. And inits
> > HostIOMMUContet instance.
> >
> > Cc: Kevin Tian <kevin.tian@intel.com>
> > Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Cc: Peter Xu <peterx@redhat.com>
> > Cc: Eric Auger <eric.auger@redhat.com>
> > Cc: Yi Sun <yi.y.sun@linux.intel.com>
> > Cc: David Gibson <david@gibson.dropbear.id.au>
> > Cc: Alex Williamson <alex.williamson@redhat.com>
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > ---
> >  hw/vfio/common.c | 99
> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 99 insertions(+)
> >
> > diff --git a/hw/vfio/common.c b/hw/vfio/common.c index
> > 5f3534d..44b142c 100644
> > --- a/hw/vfio/common.c
> > +++ b/hw/vfio/common.c
> > @@ -1226,10 +1226,89 @@ static int
> vfio_host_iommu_ctx_pasid_free(HostIOMMUContext *iommu_ctx,
> >      return 0;
> >  }
> >
> > +/**
> > + * Get iommu info from host. Caller of this funcion should free
> > + * the memory pointed by the returned pointer stored in @info
> > + * after a successful calling when finished its usage.
> > + */
> > +static int vfio_get_iommu_info(VFIOContainer *container,
> > +                         struct vfio_iommu_type1_info **info) {
> > +
> > +    size_t argsz = sizeof(struct vfio_iommu_type1_info);
> > +
> > +    *info = g_malloc0(argsz);
> > +
> > +retry:
> > +    (*info)->argsz = argsz;
> > +
> > +    if (ioctl(container->fd, VFIO_IOMMU_GET_INFO, *info)) {
> > +        g_free(*info);
> > +        *info = NULL;
> > +        return -errno;
> > +    }
> > +
> > +    if (((*info)->argsz > argsz)) {
> > +        argsz = (*info)->argsz;
> > +        *info = g_realloc(*info, argsz);
> > +        goto retry;
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> > +static struct vfio_info_cap_header *
> > +vfio_get_iommu_info_cap(struct vfio_iommu_type1_info *info, uint16_t
> > +id) {
> > +    struct vfio_info_cap_header *hdr;
> > +    void *ptr = info;
> > +
> > +    if (!(info->flags & VFIO_IOMMU_INFO_CAPS)) {
> > +        return NULL;
> > +    }
> > +
> > +    for (hdr = ptr + info->cap_offset; hdr != ptr; hdr = ptr + hdr->next) {
> > +        if (hdr->id == id) {
> > +            return hdr;
> > +        }
> > +    }
> > +
> > +    return NULL;
> > +}
> > +
> > +static int vfio_get_nesting_iommu_cap(VFIOContainer *container,
> > +                   struct vfio_iommu_type1_info_cap_nesting
> > +*cap_nesting) {
> > +    struct vfio_iommu_type1_info *info;
> > +    struct vfio_info_cap_header *hdr;
> > +    struct vfio_iommu_type1_info_cap_nesting *cap;
> > +    int ret;
> > +
> > +    ret = vfio_get_iommu_info(container, &info);
> > +    if (ret) {
> > +        return ret;
> > +    }
> > +
> > +    hdr = vfio_get_iommu_info_cap(info,
> > +                        VFIO_IOMMU_TYPE1_INFO_CAP_NESTING);
> > +    if (!hdr) {
> > +        g_free(info);
> > +        return -errno;
> > +    }
> > +
> > +    cap = container_of(hdr,
> > +                struct vfio_iommu_type1_info_cap_nesting, header);
> > +    *cap_nesting = *cap;
> > +
> > +    g_free(info);
> > +    return 0;
> > +}
> > +
> >  static int vfio_init_container(VFIOContainer *container, int group_fd,
> >                                 Error **errp)  {
> >      int iommu_type, ret;
> > +    uint64_t flags = 0;
> >
> >      iommu_type = vfio_get_iommu_type(container, errp);
> >      if (iommu_type < 0) {
> > @@ -1257,6 +1336,26 @@ static int vfio_init_container(VFIOContainer
> *container, int group_fd,
> >          return -errno;
> >      }
> >
> > +    if (iommu_type == VFIO_TYPE1_NESTING_IOMMU) {
> > +        struct vfio_iommu_type1_info_cap_nesting nesting = {
> > +                                         .nesting_capabilities = 0x0,
> > +                                         .stage1_formats = 0, };
> > +
> > +        ret = vfio_get_nesting_iommu_cap(container, &nesting);
> > +        if (ret) {
> > +            error_setg_errno(errp, -ret,
> > +                             "Failed to get nesting iommu cap");
> > +            return ret;
> > +        }
> > +
> > +        flags |= (nesting.nesting_capabilities & VFIO_IOMMU_PASID_REQS) ?
> > +                 HOST_IOMMU_PASID_REQUEST : 0;
> I still don't get why you can't transform your iommu_ctx into a  pointer and do
> container->iommu_ctx = g_new0(HostIOMMUContext, 1);
> then
> host_iommu_ctx_init(container->iommu_ctx, flags);
> 
> looks something similar to (hw/vfio/common.c). You may not even need to use a
> derived VFIOHostIOMMUContext object (As only VFIO does use that object)? Only
> the ops do change, no new field?
>         region->mem = g_new0(MemoryRegion, 1);
>         memory_region_init_io(region->mem, obj, &vfio_region_ops,
>                               region, name, region->size);

In this way, the vfio hook can easily get the VFIOContainer from
HostIOMMUContext when call in the hook provided by vfio. e.g. the
one below.

+static int vfio_host_iommu_ctx_pasid_alloc(HostIOMMUContext *iommu_ctx,
+                                           uint32_t min, uint32_t max,
+                                           uint32_t *pasid)
+{
+    VFIOContainer *container = container_of(iommu_ctx,
+                                            VFIOContainer, iommu_ctx);
 
Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 160+ messages in thread

* RE: [PATCH v2 04/22] hw/iommu: introduce HostIOMMUContext
  2020-03-30 17:22     ` Auger Eric
@ 2020-04-06  8:04       ` Liu, Yi L
  -1 siblings, 0 replies; 160+ messages in thread
From: Liu, Yi L @ 2020-04-06  8:04 UTC (permalink / raw)
  To: Auger Eric, qemu-devel, alex.williamson, peterx
  Cc: pbonzini, mst, david, Tian, Kevin, Tian, Jun J, Sun, Yi Y, kvm,
	Wu, Hao, jean-philippe, Jacob Pan, Yi Sun

Hi Eric,

> From: Auger Eric < eric.auger@redhat.com>
> Sent: Tuesday, March 31, 2020 1:23 AM
> To: Liu, Yi L <yi.l.liu@intel.com>; qemu-devel@nongnu.org;
> Subject: Re: [PATCH v2 04/22] hw/iommu: introduce HostIOMMUContext
> 
> Yi,
> 
> On 3/30/20 6:24 AM, Liu Yi L wrote:
> > Currently, many platform vendors provide the capability of dual stage
> > DMA address translation in hardware. For example, nested translation
> > on Intel VT-d scalable mode, nested stage translation on ARM SMMUv3,
> > and etc. In dual stage DMA address translation, there are two stages
> > address translation, stage-1 (a.k.a first-level) and stage-2 (a.k.a
> > second-level) translation structures. Stage-1 translation results are
> > also subjected to stage-2 translation structures. Take vSVA (Virtual
> > Shared Virtual Addressing) as an example, guest IOMMU driver owns
> > stage-1 translation structures (covers GVA->GPA translation), and host
> > IOMMU driver owns stage-2 translation structures (covers GPA->HPA
> > translation). VMM is responsible to bind stage-1 translation structures
> > to host, thus hardware could achieve GVA->GPA and then GPA->HPA
> > translation. For more background on SVA, refer the below links.
> >  - https://www.youtube.com/watch?v=Kq_nfGK5MwQ
> >  - https://events19.lfasiallc.com/wp-content/uploads/2017/11/\
> > Shared-Virtual-Memory-in-KVM_Yi-Liu.pdf
> >
[...]
> > +void host_iommu_ctx_init(void *_iommu_ctx, size_t instance_size,
> > +                         const char *mrtypename,
> > +                         uint64_t flags)
> > +{
> > +    HostIOMMUContext *iommu_ctx;
> > +
> > +    object_initialize(_iommu_ctx, instance_size, mrtypename);
> > +    iommu_ctx = HOST_IOMMU_CONTEXT(_iommu_ctx);
> > +    iommu_ctx->flags = flags;
> > +    iommu_ctx->initialized = true;
> > +}
> > +
> > +static const TypeInfo host_iommu_context_info = {
> > +    .parent             = TYPE_OBJECT,
> > +    .name               = TYPE_HOST_IOMMU_CONTEXT,
> > +    .class_size         = sizeof(HostIOMMUContextClass),
> > +    .instance_size      = sizeof(HostIOMMUContext),
> > +    .abstract           = true,
> Can't we use the usual .instance_init and .instance_finalize?
sorry, I somehow missed this comment. In prior patch, .instace_init
was used, but the current major init method is via host_iommu_ctx_init(),
so .instance_init is not really necessary.
https://www.spinics.net/lists/kvm/msg210878.html

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 160+ messages in thread

* RE: [PATCH v2 04/22] hw/iommu: introduce HostIOMMUContext
@ 2020-04-06  8:04       ` Liu, Yi L
  0 siblings, 0 replies; 160+ messages in thread
From: Liu, Yi L @ 2020-04-06  8:04 UTC (permalink / raw)
  To: Auger Eric, qemu-devel, alex.williamson, peterx
  Cc: jean-philippe, Tian, Kevin, Jacob Pan, Yi Sun, kvm, mst, Tian,
	 Jun J, Sun, Yi Y, pbonzini, Wu, Hao, david

Hi Eric,

> From: Auger Eric < eric.auger@redhat.com>
> Sent: Tuesday, March 31, 2020 1:23 AM
> To: Liu, Yi L <yi.l.liu@intel.com>; qemu-devel@nongnu.org;
> Subject: Re: [PATCH v2 04/22] hw/iommu: introduce HostIOMMUContext
> 
> Yi,
> 
> On 3/30/20 6:24 AM, Liu Yi L wrote:
> > Currently, many platform vendors provide the capability of dual stage
> > DMA address translation in hardware. For example, nested translation
> > on Intel VT-d scalable mode, nested stage translation on ARM SMMUv3,
> > and etc. In dual stage DMA address translation, there are two stages
> > address translation, stage-1 (a.k.a first-level) and stage-2 (a.k.a
> > second-level) translation structures. Stage-1 translation results are
> > also subjected to stage-2 translation structures. Take vSVA (Virtual
> > Shared Virtual Addressing) as an example, guest IOMMU driver owns
> > stage-1 translation structures (covers GVA->GPA translation), and host
> > IOMMU driver owns stage-2 translation structures (covers GPA->HPA
> > translation). VMM is responsible to bind stage-1 translation structures
> > to host, thus hardware could achieve GVA->GPA and then GPA->HPA
> > translation. For more background on SVA, refer the below links.
> >  - https://www.youtube.com/watch?v=Kq_nfGK5MwQ
> >  - https://events19.lfasiallc.com/wp-content/uploads/2017/11/\
> > Shared-Virtual-Memory-in-KVM_Yi-Liu.pdf
> >
[...]
> > +void host_iommu_ctx_init(void *_iommu_ctx, size_t instance_size,
> > +                         const char *mrtypename,
> > +                         uint64_t flags)
> > +{
> > +    HostIOMMUContext *iommu_ctx;
> > +
> > +    object_initialize(_iommu_ctx, instance_size, mrtypename);
> > +    iommu_ctx = HOST_IOMMU_CONTEXT(_iommu_ctx);
> > +    iommu_ctx->flags = flags;
> > +    iommu_ctx->initialized = true;
> > +}
> > +
> > +static const TypeInfo host_iommu_context_info = {
> > +    .parent             = TYPE_OBJECT,
> > +    .name               = TYPE_HOST_IOMMU_CONTEXT,
> > +    .class_size         = sizeof(HostIOMMUContextClass),
> > +    .instance_size      = sizeof(HostIOMMUContext),
> > +    .abstract           = true,
> Can't we use the usual .instance_init and .instance_finalize?
sorry, I somehow missed this comment. In prior patch, .instace_init
was used, but the current major init method is via host_iommu_ctx_init(),
so .instance_init is not really necessary.
https://www.spinics.net/lists/kvm/msg210878.html

Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v2 05/22] hw/pci: modify pci_setup_iommu() to set PCIIOMMUOps
  2020-04-06  6:27               ` Liu, Yi L
@ 2020-04-06 10:04                 ` Auger Eric
  -1 siblings, 0 replies; 160+ messages in thread
From: Auger Eric @ 2020-04-06 10:04 UTC (permalink / raw)
  To: Liu, Yi L, qemu-devel, alex.williamson, peterx
  Cc: jean-philippe, Tian, Kevin, Jacob Pan, Yi Sun, kvm, mst, Tian,
	Jun J, Sun, Yi Y, pbonzini, david, Wu, Hao

Yi,

On 4/6/20 8:27 AM, Liu, Yi L wrote:
> Hi Eric,
> 
>> From: Auger Eric <eric.auger@redhat.com>
>> Sent: Thursday, April 2, 2020 9:49 PM
>> To: Liu, Yi L <yi.l.liu@intel.com>; qemu-devel@nongnu.org;
>> Subject: Re: [PATCH v2 05/22] hw/pci: modify pci_setup_iommu() to set
>> PCIIOMMUOps
>>
>> Hi Yi,
>>
>> On 4/2/20 3:37 PM, Liu, Yi L wrote:
>>> Hi Eric,
>>>
>>>> From: Auger Eric < eric.auger@redhat.com >
>>>> Sent: Thursday, April 2, 2020 8:41 PM
>>>> To: Liu, Yi L <yi.l.liu@intel.com>; qemu-devel@nongnu.org;
>>>> Subject: Re: [PATCH v2 05/22] hw/pci: modify pci_setup_iommu() to set
>>>> PCIIOMMUOps
>>>>
>>>> Hi Yi,
>>>>
>>>> On 4/2/20 10:52 AM, Liu, Yi L wrote:
>>>>>> From: Auger Eric < eric.auger@redhat.com>
>>>>>> Sent: Monday, March 30, 2020 7:02 PM
>>>>>> To: Liu, Yi L <yi.l.liu@intel.com>; qemu-devel@nongnu.org;
>>>>>> Subject: Re: [PATCH v2 05/22] hw/pci: modify pci_setup_iommu() to
>>>>>> set PCIIOMMUOps
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 3/30/20 6:24 AM, Liu Yi L wrote:
>>>>>>> This patch modifies pci_setup_iommu() to set PCIIOMMUOps instead
>>>>>>> of setting PCIIOMMUFunc. PCIIOMMUFunc is used to get an address
>>>>>>> space for a PCI device in vendor specific way. The PCIIOMMUOps
>>>>>>> still offers this functionality. But using PCIIOMMUOps leaves
>>>>>>> space to add more iommu related vendor specific operations.
>>>>>>>
>>>>>>> Cc: Kevin Tian <kevin.tian@intel.com>
>>>>>>> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
>>>>>>> Cc: Peter Xu <peterx@redhat.com>
>>>>>>> Cc: Eric Auger <eric.auger@redhat.com>
>>>>>>> Cc: Yi Sun <yi.y.sun@linux.intel.com>
>>>>>>> Cc: David Gibson <david@gibson.dropbear.id.au>
>>>>>>> Cc: Michael S. Tsirkin <mst@redhat.com>
>>>>>>> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
>>>>>>> Reviewed-by: Peter Xu <peterx@redhat.com>
>>>>>>> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
>>>>>>> ---
>>>>>>>  hw/alpha/typhoon.c       |  6 +++++-
>>>>>>>  hw/arm/smmu-common.c     |  6 +++++-
>>>>>>>  hw/hppa/dino.c           |  6 +++++-
>>>>>>>  hw/i386/amd_iommu.c      |  6 +++++-
>>>>>>>  hw/i386/intel_iommu.c    |  6 +++++-
>>>>>>>  hw/pci-host/designware.c |  6 +++++-
>>>>>>>  hw/pci-host/pnv_phb3.c   |  6 +++++-
>>>>>>>  hw/pci-host/pnv_phb4.c   |  6 +++++-
>>>>>>>  hw/pci-host/ppce500.c    |  6 +++++-
>>>>>>>  hw/pci-host/prep.c       |  6 +++++-
>>>>>>>  hw/pci-host/sabre.c      |  6 +++++-
>>>>>>>  hw/pci/pci.c             | 12 +++++++-----
>>>>>>>  hw/ppc/ppc440_pcix.c     |  6 +++++-
>>>>>>>  hw/ppc/spapr_pci.c       |  6 +++++-
>>>>>>>  hw/s390x/s390-pci-bus.c  |  8 ++++++--  hw/virtio/virtio-iommu.c
>>>>>>> |
>>>>>>> 6
>>>>>>> +++++-
>>>>>>>  include/hw/pci/pci.h     |  8 ++++++--
>>>>>>>  include/hw/pci/pci_bus.h |  2 +-
>>>>>>>  18 files changed, 90 insertions(+), 24 deletions(-)
>>>>>>>
>>>>>>> diff --git a/hw/alpha/typhoon.c b/hw/alpha/typhoon.c index
>>>>>>> 1795e2f..f271de1 100644
>>>>>>> --- a/hw/alpha/typhoon.c
>>>>>>> +++ b/hw/alpha/typhoon.c
>>>>>>> @@ -740,6 +740,10 @@ static AddressSpace
>>>>>>> *typhoon_pci_dma_iommu(PCIBus
>>>>>> *bus, void *opaque, int devfn)
>>>>>>>      return &s->pchip.iommu_as;
>>>>>>>  }
>>>>>>>
>>>>>>> +static const PCIIOMMUOps typhoon_iommu_ops = {
>>>>>>> +    .get_address_space = typhoon_pci_dma_iommu, };
>>>>>>> +
>>>>>>>  static void typhoon_set_irq(void *opaque, int irq, int level)  {
>>>>>>>      TyphoonState *s = opaque;
>>>>>>> @@ -897,7 +901,7 @@ PCIBus *typhoon_init(MemoryRegion *ram, ISABus
>>>>>> **isa_bus, qemu_irq *p_rtc_irq,
>>>>>>>                               "iommu-typhoon", UINT64_MAX);
>>>>>>>      address_space_init(&s->pchip.iommu_as, MEMORY_REGION(&s-
>>>>>>> pchip.iommu),
>>>>>>>                         "pchip0-pci");
>>>>>>> -    pci_setup_iommu(b, typhoon_pci_dma_iommu, s);
>>>>>>> +    pci_setup_iommu(b, &typhoon_iommu_ops, s);
>>>>>>>
>>>>>>>      /* Pchip0 PCI special/interrupt acknowledge, 0x801.F800.0000, 64MB.  */
>>>>>>>      memory_region_init_io(&s->pchip.reg_iack, OBJECT(s),
>>>>>>> &alpha_pci_iack_ops, diff --git a/hw/arm/smmu-common.c
>>>>>>> b/hw/arm/smmu-common.c index e13a5f4..447146e 100644
>>>>>>> --- a/hw/arm/smmu-common.c
>>>>>>> +++ b/hw/arm/smmu-common.c
>>>>>>> @@ -343,6 +343,10 @@ static AddressSpace *smmu_find_add_as(PCIBus
>>>>>>> *bus,
>>>>>> void *opaque, int devfn)
>>>>>>>      return &sdev->as;
>>>>>>>  }
>>>>>>>
>>>>>>> +static const PCIIOMMUOps smmu_ops = {
>>>>>>> +    .get_address_space = smmu_find_add_as, };
>>>>>>> +
>>>>>>>  IOMMUMemoryRegion *smmu_iommu_mr(SMMUState *s, uint32_t sid)  {
>>>>>>>      uint8_t bus_n, devfn;
>>>>>>> @@ -437,7 +441,7 @@ static void smmu_base_realize(DeviceState
>>>>>>> *dev, Error
>>>>>> **errp)
>>>>>>>      s->smmu_pcibus_by_busptr = g_hash_table_new(NULL, NULL);
>>>>>>>
>>>>>>>      if (s->primary_bus) {
>>>>>>> -        pci_setup_iommu(s->primary_bus, smmu_find_add_as, s);
>>>>>>> +        pci_setup_iommu(s->primary_bus, &smmu_ops, s);
>>>>>>>      } else {
>>>>>>>          error_setg(errp, "SMMU is not attached to any PCI bus!");
>>>>>>>      }
>>>>>>> diff --git a/hw/hppa/dino.c b/hw/hppa/dino.c index
>>>>>>> 2b1b38c..3da4f84
>>>>>>> 100644
>>>>>>> --- a/hw/hppa/dino.c
>>>>>>> +++ b/hw/hppa/dino.c
>>>>>>> @@ -459,6 +459,10 @@ static AddressSpace
>>>>>>> *dino_pcihost_set_iommu(PCIBus
>>>>>> *bus, void *opaque,
>>>>>>>      return &s->bm_as;
>>>>>>>  }
>>>>>>>
>>>>>>> +static const PCIIOMMUOps dino_iommu_ops = {
>>>>>>> +    .get_address_space = dino_pcihost_set_iommu, };
>>>>>>> +
>>>>>>>  /*
>>>>>>>   * Dino interrupts are connected as shown on Page 78, Table 23
>>>>>>>   * (Little-endian bit numbers)
>>>>>>> @@ -580,7 +584,7 @@ PCIBus *dino_init(MemoryRegion *addr_space,
>>>>>>>      memory_region_add_subregion(&s->bm, 0xfff00000,
>>>>>>>                                  &s->bm_cpu_alias);
>>>>>>>      address_space_init(&s->bm_as, &s->bm, "pci-bm");
>>>>>>> -    pci_setup_iommu(b, dino_pcihost_set_iommu, s);
>>>>>>> +    pci_setup_iommu(b, &dino_iommu_ops, s);
>>>>>>>
>>>>>>>      *p_rtc_irq = qemu_allocate_irq(dino_set_timer_irq, s, 0);
>>>>>>>      *p_ser_irq = qemu_allocate_irq(dino_set_serial_irq, s, 0);
>>>>>>> diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c index
>>>>>>> b1175e5..5fec30e 100644
>>>>>>> --- a/hw/i386/amd_iommu.c
>>>>>>> +++ b/hw/i386/amd_iommu.c
>>>>>>> @@ -1451,6 +1451,10 @@ static AddressSpace
>>>>>> *amdvi_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
>>>>>>>      return &iommu_as[devfn]->as;
>>>>>>>  }
>>>>>>>
>>>>>>> +static const PCIIOMMUOps amdvi_iommu_ops = {
>>>>>>> +    .get_address_space = amdvi_host_dma_iommu, };
>>>>>>> +
>>>>>>>  static const MemoryRegionOps mmio_mem_ops = {
>>>>>>>      .read = amdvi_mmio_read,
>>>>>>>      .write = amdvi_mmio_write,
>>>>>>> @@ -1577,7 +1581,7 @@ static void amdvi_realize(DeviceState *dev,
>>>>>>> Error **errp)
>>>>>>>
>>>>>>>      sysbus_init_mmio(SYS_BUS_DEVICE(s), &s->mmio);
>>>>>>>      sysbus_mmio_map(SYS_BUS_DEVICE(s), 0, AMDVI_BASE_ADDR);
>>>>>>> -    pci_setup_iommu(bus, amdvi_host_dma_iommu, s);
>>>>>>> +    pci_setup_iommu(bus, &amdvi_iommu_ops, s);
>>>>>>>      s->devid = object_property_get_int(OBJECT(&s->pci), "addr", errp);
>>>>>>>      msi_init(&s->pci.dev, 0, 1, true, false, errp);
>>>>>>>      amdvi_init(s);
>>>>>>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c index
>>>>>>> df7ad25..4b22910 100644
>>>>>>> --- a/hw/i386/intel_iommu.c
>>>>>>> +++ b/hw/i386/intel_iommu.c
>>>>>>> @@ -3729,6 +3729,10 @@ static AddressSpace
>>>>>>> *vtd_host_dma_iommu(PCIBus
>>>>>> *bus, void *opaque, int devfn)
>>>>>>>      return &vtd_as->as;
>>>>>>>  }
>>>>>>>
>>>>>>> +static PCIIOMMUOps vtd_iommu_ops = {
>>>>>> static const
>>>>>
>>>>> got it.
>>>>>
>>>>>>> +    .get_address_space = vtd_host_dma_iommu, };
>>>>>>> +
>>>>>>>  static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)  {
>>>>>>>      X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(s); @@ -3840,7
>>>>>>> +3844,7 @@ static void vtd_realize(DeviceState *dev, Error **errp)
>>>>>>>                                                g_free, g_free);
>>>>>>>      vtd_init(s);
>>>>>>>      sysbus_mmio_map(SYS_BUS_DEVICE(s), 0,
>>>>>> Q35_HOST_BRIDGE_IOMMU_ADDR);
>>>>>>> -    pci_setup_iommu(bus, vtd_host_dma_iommu, dev);
>>>>>>> +    pci_setup_iommu(bus, &vtd_iommu_ops, dev);
>>>>>>>      /* Pseudo address space under root PCI bus. */
>>>>>>>      x86ms->ioapic_as = vtd_host_dma_iommu(bus, s,
>>>>>> Q35_PSEUDO_DEVFN_IOAPIC);
>>>>>>>
>>>>>>> qemu_add_machine_init_done_notifier(&vtd_machine_done_notify);
>>>>>>> diff --git a/hw/pci-host/designware.c b/hw/pci-host/designware.c
>>>>>>> index dd24551..4c6338a 100644
>>>>>>> --- a/hw/pci-host/designware.c
>>>>>>> +++ b/hw/pci-host/designware.c
>>>>>>> @@ -645,6 +645,10 @@ static AddressSpace
>>>>>> *designware_pcie_host_set_iommu(PCIBus *bus, void *opaque,
>>>>>>>      return &s->pci.address_space;  }
>>>>>>>
>>>>>>> +static const PCIIOMMUOps designware_iommu_ops = {
>>>>>>> +    .get_address_space = designware_pcie_host_set_iommu, };
>>>>>>> +
>>>>>>>  static void designware_pcie_host_realize(DeviceState *dev, Error
>>>>>>> **errp)  {
>>>>>>>      PCIHostState *pci = PCI_HOST_BRIDGE(dev); @@ -686,7 +690,7 @@
>>>>>>> static void designware_pcie_host_realize(DeviceState *dev, Error **errp)
>>>>>>>      address_space_init(&s->pci.address_space,
>>>>>>>                         &s->pci.address_space_root,
>>>>>>>                         "pcie-bus-address-space");
>>>>>>> -    pci_setup_iommu(pci->bus, designware_pcie_host_set_iommu, s);
>>>>>>> +    pci_setup_iommu(pci->bus, &designware_iommu_ops, s);
>>>>>>>
>>>>>>>      qdev_set_parent_bus(DEVICE(&s->root), BUS(pci->bus));
>>>>>>>      qdev_init_nofail(DEVICE(&s->root));
>>>>>>> diff --git a/hw/pci-host/pnv_phb3.c b/hw/pci-host/pnv_phb3.c index
>>>>>>> 74618fa..ecfe627 100644
>>>>>>> --- a/hw/pci-host/pnv_phb3.c
>>>>>>> +++ b/hw/pci-host/pnv_phb3.c
>>>>>>> @@ -961,6 +961,10 @@ static AddressSpace
>>>>>>> *pnv_phb3_dma_iommu(PCIBus
>>>>>> *bus, void *opaque, int devfn)
>>>>>>>      return &ds->dma_as;
>>>>>>>  }
>>>>>>>
>>>>>>> +static PCIIOMMUOps pnv_phb3_iommu_ops = {
>>>>>> static const
>>>>> got it. :-)
>>>>>
>>>>>>> +    .get_address_space = pnv_phb3_dma_iommu, };
>>>>>>> +
>>>>>>>  static void pnv_phb3_instance_init(Object *obj)  {
>>>>>>>      PnvPHB3 *phb = PNV_PHB3(obj); @@ -1059,7 +1063,7 @@ static
>>>>>>> void pnv_phb3_realize(DeviceState *dev, Error
>>>>>> **errp)
>>>>>>>                                       &phb->pci_mmio, &phb->pci_io,
>>>>>>>                                       0, 4,
>>>>>>> TYPE_PNV_PHB3_ROOT_BUS);
>>>>>>>
>>>>>>> -    pci_setup_iommu(pci->bus, pnv_phb3_dma_iommu, phb);
>>>>>>> +    pci_setup_iommu(pci->bus, &pnv_phb3_iommu_ops, phb);
>>>>>>>
>>>>>>>      /* Add a single Root port */
>>>>>>>      qdev_prop_set_uint8(DEVICE(&phb->root), "chassis",
>>>>>>> phb->chip_id); diff --git a/hw/pci-host/pnv_phb4.c
>>>>>>> b/hw/pci-host/pnv_phb4.c index
>>>>>>> 23cf093..04e95e3 100644
>>>>>>> --- a/hw/pci-host/pnv_phb4.c
>>>>>>> +++ b/hw/pci-host/pnv_phb4.c
>>>>>>> @@ -1148,6 +1148,10 @@ static AddressSpace
>>>>>>> *pnv_phb4_dma_iommu(PCIBus
>>>>>> *bus, void *opaque, int devfn)
>>>>>>>      return &ds->dma_as;
>>>>>>>  }
>>>>>>>
>>>>>>> +static PCIIOMMUOps pnv_phb4_iommu_ops = {
>>>>>> idem
>>>>> will add const.
>>>>>
>>>>>>> +    .get_address_space = pnv_phb4_dma_iommu, };
>>>>>>> +
>>>>>>>  static void pnv_phb4_instance_init(Object *obj)  {
>>>>>>>      PnvPHB4 *phb = PNV_PHB4(obj); @@ -1205,7 +1209,7 @@ static
>>>>>>> void pnv_phb4_realize(DeviceState *dev, Error
>>>>>> **errp)
>>>>>>>                                       pnv_phb4_set_irq, pnv_phb4_map_irq, phb,
>>>>>>>                                       &phb->pci_mmio, &phb->pci_io,
>>>>>>>                                       0, 4, TYPE_PNV_PHB4_ROOT_BUS);
>>>>>>> -    pci_setup_iommu(pci->bus, pnv_phb4_dma_iommu, phb);
>>>>>>> +    pci_setup_iommu(pci->bus, &pnv_phb4_iommu_ops, phb);
>>>>>>>
>>>>>>>      /* Add a single Root port */
>>>>>>>      qdev_prop_set_uint8(DEVICE(&phb->root), "chassis",
>>>>>>> phb->chip_id); diff --git a/hw/pci-host/ppce500.c
>>>>>>> b/hw/pci-host/ppce500.c index d710727..5baf5db 100644
>>>>>>> --- a/hw/pci-host/ppce500.c
>>>>>>> +++ b/hw/pci-host/ppce500.c
>>>>>>> @@ -439,6 +439,10 @@ static AddressSpace
>>>>>>> *e500_pcihost_set_iommu(PCIBus
>>>>>> *bus, void *opaque,
>>>>>>>      return &s->bm_as;
>>>>>>>  }
>>>>>>>
>>>>>>> +static const PCIIOMMUOps ppce500_iommu_ops = {
>>>>>>> +    .get_address_space = e500_pcihost_set_iommu, };
>>>>>>> +
>>>>>>>  static void e500_pcihost_realize(DeviceState *dev, Error **errp)  {
>>>>>>>      SysBusDevice *sbd = SYS_BUS_DEVICE(dev); @@ -473,7 +477,7 @@
>>>>>>> static void e500_pcihost_realize(DeviceState *dev, Error **errp)
>>>>>>>      memory_region_init(&s->bm, OBJECT(s), "bm-e500", UINT64_MAX);
>>>>>>>      memory_region_add_subregion(&s->bm, 0x0, &s->busmem);
>>>>>>>      address_space_init(&s->bm_as, &s->bm, "pci-bm");
>>>>>>> -    pci_setup_iommu(b, e500_pcihost_set_iommu, s);
>>>>>>> +    pci_setup_iommu(b, &ppce500_iommu_ops, s);
>>>>>>>
>>>>>>>      pci_create_simple(b, 0, "e500-host-bridge");
>>>>>>>
>>>>>>> diff --git a/hw/pci-host/prep.c b/hw/pci-host/prep.c index
>>>>>>> 1a02e9a..7c57311 100644
>>>>>>> --- a/hw/pci-host/prep.c
>>>>>>> +++ b/hw/pci-host/prep.c
>>>>>>> @@ -213,6 +213,10 @@ static AddressSpace
>>>>>>> *raven_pcihost_set_iommu(PCIBus
>>>>>> *bus, void *opaque,
>>>>>>>      return &s->bm_as;
>>>>>>>  }
>>>>>>>
>>>>>>> +static const PCIIOMMUOps raven_iommu_ops = {
>>>>>>> +    .get_address_space = raven_pcihost_set_iommu, };
>>>>>>> +
>>>>>>>  static void raven_change_gpio(void *opaque, int n, int level)  {
>>>>>>>      PREPPCIState *s = opaque;
>>>>>>> @@ -303,7 +307,7 @@ static void raven_pcihost_initfn(Object *obj)
>>>>>>>      memory_region_add_subregion(&s->bm, 0         , &s-
>>>>> bm_pci_memory_alias);
>>>>>>>      memory_region_add_subregion(&s->bm, 0x80000000, &s-
>>> bm_ram_alias);
>>>>>>>      address_space_init(&s->bm_as, &s->bm, "raven-bm");
>>>>>>> -    pci_setup_iommu(&s->pci_bus, raven_pcihost_set_iommu, s);
>>>>>>> +    pci_setup_iommu(&s->pci_bus, &raven_iommu_ops, s);
>>>>>>>
>>>>>>>      h->bus = &s->pci_bus;
>>>>>>>
>>>>>>> diff --git a/hw/pci-host/sabre.c b/hw/pci-host/sabre.c index
>>>>>>> 2b8503b..251549b 100644
>>>>>>> --- a/hw/pci-host/sabre.c
>>>>>>> +++ b/hw/pci-host/sabre.c
>>>>>>> @@ -112,6 +112,10 @@ static AddressSpace
>>>>>>> *sabre_pci_dma_iommu(PCIBus
>>>>>> *bus, void *opaque, int devfn)
>>>>>>>      return &is->iommu_as;
>>>>>>>  }
>>>>>>>
>>>>>>> +static const PCIIOMMUOps sabre_iommu_ops = {
>>>>>>> +    .get_address_space = sabre_pci_dma_iommu, };
>>>>>>> +
>>>>>>>  static void sabre_config_write(void *opaque, hwaddr addr,
>>>>>>>                                 uint64_t val, unsigned size)  { @@
>>>>>>> -402,7 +406,7 @@ static void sabre_realize(DeviceState *dev, Error **errp)
>>>>>>>      /* IOMMU */
>>>>>>>      memory_region_add_subregion_overlap(&s->sabre_config, 0x200,
>>>>>>>                      sysbus_mmio_get_region(SYS_BUS_DEVICE(s->iommu), 0), 1);
>>>>>>> -    pci_setup_iommu(phb->bus, sabre_pci_dma_iommu, s->iommu);
>>>>>>> +    pci_setup_iommu(phb->bus, &sabre_iommu_ops, s->iommu);
>>>>>>>
>>>>>>>      /* APB secondary busses */
>>>>>>>      pci_dev = pci_create_multifunction(phb->bus, PCI_DEVFN(1, 0),
>>>>>>> true, diff --git a/hw/pci/pci.c b/hw/pci/pci.c index
>>>>>>> e1ed667..aa9025c
>>>>>>> 100644
>>>>>>> --- a/hw/pci/pci.c
>>>>>>> +++ b/hw/pci/pci.c
>>>>>>> @@ -2644,7 +2644,7 @@ AddressSpace
>>>>>> *pci_device_iommu_address_space(PCIDevice *dev)
>>>>>>>      PCIBus *iommu_bus = bus;
>>>>>>>      uint8_t devfn = dev->devfn;
>>>>>>>
>>>>>>> -    while (iommu_bus && !iommu_bus->iommu_fn && iommu_bus-
>>>>> parent_dev)
>>>>>> {
>>>>>>> +    while (iommu_bus && !iommu_bus->iommu_ops &&
>>>>>>> + iommu_bus->parent_dev) {
>>>>>> Depending on future usage, this is not strictly identical to the
>>>>>> original code. You exit the loop as soon as a iommu_bus->iommu_ops
>>>>>> is set whatever the presence of get_address_space().
>>>>>
>>>>> To be identical with original code, may adding the
>>>>> get_address_space() presence check. Then the loop exits when the
>>>>> iommu_bus->iommu_ops is set and meanwhile iommu_bus->iommu_ops-
>>> get_address_space() is set.
>>>>> But is it possible that there is an intermediate iommu_bus which has
>>>>> iommu_ops set but the get_address_space() is clear. I guess not as
>>>>> iommu_ops is set by vIOMMU and vIOMMU won't differentiate buses?
>>>>
>>>> I don't know. That depends on how the ops are going to be used in the
>>>> future. Can't you enforce the fact that get_address_space() is a mandatory ops?
>>>
>>> No, I didn't mean that. Actually, in the patch, the get_address_space() presence is
>> checked.
>>> I'm not sure if your point is to add get_address_space() presence
>>> check instead of just checking the iommu_ops presence.
>> Yes that was my point. I wanted to underline the checks are not strictly identical
>> and maybe during enumeration you may find a device with ops set and no
>> get_address_space().
> 
> I see. But this happens only when there are multiple iommu_ops instances
> provided by vIOMMU. right? For some buses, vIOMMU set an iommu_ops
> with get_address_space() set, but for others it sets an iommu_ops w/o
> get_address_space(). Is it possible?
I don't think it is possible at the moment.
> 
>> Then I meant maybe you should enforce somewhere in the
>> code or in the documentation that get_address_space() is a mandatory operation in
>> the ops struct and must be set as soon as the struct is passed. maybe in
>> pci_setup_iommu() you could check that get_address_space is set?
> 
> How about other callbacks in iommu_ops? It would be strange that only
> check get_address_space() and ignore other callbacks.
Well in the iommu_ops, the get_address_space() is supposed to be always
implemented, right? If this is the case, then you may check it when
passing the ops. Is that the case for other ops and especially the
set_iommu_ctx? To me, if an operation is mandated this should be
documented and maybe enforced.

Thanks

Eric
> 
> Regards,
> Yi Liu
> 


^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v2 05/22] hw/pci: modify pci_setup_iommu() to set PCIIOMMUOps
@ 2020-04-06 10:04                 ` Auger Eric
  0 siblings, 0 replies; 160+ messages in thread
From: Auger Eric @ 2020-04-06 10:04 UTC (permalink / raw)
  To: Liu, Yi L, qemu-devel, alex.williamson, peterx
  Cc: jean-philippe, Tian, Kevin, Jacob Pan, Yi Sun, kvm, mst, Tian,
	Jun J, Sun, Yi Y, pbonzini, Wu, Hao, david

Yi,

On 4/6/20 8:27 AM, Liu, Yi L wrote:
> Hi Eric,
> 
>> From: Auger Eric <eric.auger@redhat.com>
>> Sent: Thursday, April 2, 2020 9:49 PM
>> To: Liu, Yi L <yi.l.liu@intel.com>; qemu-devel@nongnu.org;
>> Subject: Re: [PATCH v2 05/22] hw/pci: modify pci_setup_iommu() to set
>> PCIIOMMUOps
>>
>> Hi Yi,
>>
>> On 4/2/20 3:37 PM, Liu, Yi L wrote:
>>> Hi Eric,
>>>
>>>> From: Auger Eric < eric.auger@redhat.com >
>>>> Sent: Thursday, April 2, 2020 8:41 PM
>>>> To: Liu, Yi L <yi.l.liu@intel.com>; qemu-devel@nongnu.org;
>>>> Subject: Re: [PATCH v2 05/22] hw/pci: modify pci_setup_iommu() to set
>>>> PCIIOMMUOps
>>>>
>>>> Hi Yi,
>>>>
>>>> On 4/2/20 10:52 AM, Liu, Yi L wrote:
>>>>>> From: Auger Eric < eric.auger@redhat.com>
>>>>>> Sent: Monday, March 30, 2020 7:02 PM
>>>>>> To: Liu, Yi L <yi.l.liu@intel.com>; qemu-devel@nongnu.org;
>>>>>> Subject: Re: [PATCH v2 05/22] hw/pci: modify pci_setup_iommu() to
>>>>>> set PCIIOMMUOps
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 3/30/20 6:24 AM, Liu Yi L wrote:
>>>>>>> This patch modifies pci_setup_iommu() to set PCIIOMMUOps instead
>>>>>>> of setting PCIIOMMUFunc. PCIIOMMUFunc is used to get an address
>>>>>>> space for a PCI device in vendor specific way. The PCIIOMMUOps
>>>>>>> still offers this functionality. But using PCIIOMMUOps leaves
>>>>>>> space to add more iommu related vendor specific operations.
>>>>>>>
>>>>>>> Cc: Kevin Tian <kevin.tian@intel.com>
>>>>>>> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
>>>>>>> Cc: Peter Xu <peterx@redhat.com>
>>>>>>> Cc: Eric Auger <eric.auger@redhat.com>
>>>>>>> Cc: Yi Sun <yi.y.sun@linux.intel.com>
>>>>>>> Cc: David Gibson <david@gibson.dropbear.id.au>
>>>>>>> Cc: Michael S. Tsirkin <mst@redhat.com>
>>>>>>> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
>>>>>>> Reviewed-by: Peter Xu <peterx@redhat.com>
>>>>>>> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
>>>>>>> ---
>>>>>>>  hw/alpha/typhoon.c       |  6 +++++-
>>>>>>>  hw/arm/smmu-common.c     |  6 +++++-
>>>>>>>  hw/hppa/dino.c           |  6 +++++-
>>>>>>>  hw/i386/amd_iommu.c      |  6 +++++-
>>>>>>>  hw/i386/intel_iommu.c    |  6 +++++-
>>>>>>>  hw/pci-host/designware.c |  6 +++++-
>>>>>>>  hw/pci-host/pnv_phb3.c   |  6 +++++-
>>>>>>>  hw/pci-host/pnv_phb4.c   |  6 +++++-
>>>>>>>  hw/pci-host/ppce500.c    |  6 +++++-
>>>>>>>  hw/pci-host/prep.c       |  6 +++++-
>>>>>>>  hw/pci-host/sabre.c      |  6 +++++-
>>>>>>>  hw/pci/pci.c             | 12 +++++++-----
>>>>>>>  hw/ppc/ppc440_pcix.c     |  6 +++++-
>>>>>>>  hw/ppc/spapr_pci.c       |  6 +++++-
>>>>>>>  hw/s390x/s390-pci-bus.c  |  8 ++++++--  hw/virtio/virtio-iommu.c
>>>>>>> |
>>>>>>> 6
>>>>>>> +++++-
>>>>>>>  include/hw/pci/pci.h     |  8 ++++++--
>>>>>>>  include/hw/pci/pci_bus.h |  2 +-
>>>>>>>  18 files changed, 90 insertions(+), 24 deletions(-)
>>>>>>>
>>>>>>> diff --git a/hw/alpha/typhoon.c b/hw/alpha/typhoon.c index
>>>>>>> 1795e2f..f271de1 100644
>>>>>>> --- a/hw/alpha/typhoon.c
>>>>>>> +++ b/hw/alpha/typhoon.c
>>>>>>> @@ -740,6 +740,10 @@ static AddressSpace
>>>>>>> *typhoon_pci_dma_iommu(PCIBus
>>>>>> *bus, void *opaque, int devfn)
>>>>>>>      return &s->pchip.iommu_as;
>>>>>>>  }
>>>>>>>
>>>>>>> +static const PCIIOMMUOps typhoon_iommu_ops = {
>>>>>>> +    .get_address_space = typhoon_pci_dma_iommu, };
>>>>>>> +
>>>>>>>  static void typhoon_set_irq(void *opaque, int irq, int level)  {
>>>>>>>      TyphoonState *s = opaque;
>>>>>>> @@ -897,7 +901,7 @@ PCIBus *typhoon_init(MemoryRegion *ram, ISABus
>>>>>> **isa_bus, qemu_irq *p_rtc_irq,
>>>>>>>                               "iommu-typhoon", UINT64_MAX);
>>>>>>>      address_space_init(&s->pchip.iommu_as, MEMORY_REGION(&s-
>>>>>>> pchip.iommu),
>>>>>>>                         "pchip0-pci");
>>>>>>> -    pci_setup_iommu(b, typhoon_pci_dma_iommu, s);
>>>>>>> +    pci_setup_iommu(b, &typhoon_iommu_ops, s);
>>>>>>>
>>>>>>>      /* Pchip0 PCI special/interrupt acknowledge, 0x801.F800.0000, 64MB.  */
>>>>>>>      memory_region_init_io(&s->pchip.reg_iack, OBJECT(s),
>>>>>>> &alpha_pci_iack_ops, diff --git a/hw/arm/smmu-common.c
>>>>>>> b/hw/arm/smmu-common.c index e13a5f4..447146e 100644
>>>>>>> --- a/hw/arm/smmu-common.c
>>>>>>> +++ b/hw/arm/smmu-common.c
>>>>>>> @@ -343,6 +343,10 @@ static AddressSpace *smmu_find_add_as(PCIBus
>>>>>>> *bus,
>>>>>> void *opaque, int devfn)
>>>>>>>      return &sdev->as;
>>>>>>>  }
>>>>>>>
>>>>>>> +static const PCIIOMMUOps smmu_ops = {
>>>>>>> +    .get_address_space = smmu_find_add_as, };
>>>>>>> +
>>>>>>>  IOMMUMemoryRegion *smmu_iommu_mr(SMMUState *s, uint32_t sid)  {
>>>>>>>      uint8_t bus_n, devfn;
>>>>>>> @@ -437,7 +441,7 @@ static void smmu_base_realize(DeviceState
>>>>>>> *dev, Error
>>>>>> **errp)
>>>>>>>      s->smmu_pcibus_by_busptr = g_hash_table_new(NULL, NULL);
>>>>>>>
>>>>>>>      if (s->primary_bus) {
>>>>>>> -        pci_setup_iommu(s->primary_bus, smmu_find_add_as, s);
>>>>>>> +        pci_setup_iommu(s->primary_bus, &smmu_ops, s);
>>>>>>>      } else {
>>>>>>>          error_setg(errp, "SMMU is not attached to any PCI bus!");
>>>>>>>      }
>>>>>>> diff --git a/hw/hppa/dino.c b/hw/hppa/dino.c index
>>>>>>> 2b1b38c..3da4f84
>>>>>>> 100644
>>>>>>> --- a/hw/hppa/dino.c
>>>>>>> +++ b/hw/hppa/dino.c
>>>>>>> @@ -459,6 +459,10 @@ static AddressSpace
>>>>>>> *dino_pcihost_set_iommu(PCIBus
>>>>>> *bus, void *opaque,
>>>>>>>      return &s->bm_as;
>>>>>>>  }
>>>>>>>
>>>>>>> +static const PCIIOMMUOps dino_iommu_ops = {
>>>>>>> +    .get_address_space = dino_pcihost_set_iommu, };
>>>>>>> +
>>>>>>>  /*
>>>>>>>   * Dino interrupts are connected as shown on Page 78, Table 23
>>>>>>>   * (Little-endian bit numbers)
>>>>>>> @@ -580,7 +584,7 @@ PCIBus *dino_init(MemoryRegion *addr_space,
>>>>>>>      memory_region_add_subregion(&s->bm, 0xfff00000,
>>>>>>>                                  &s->bm_cpu_alias);
>>>>>>>      address_space_init(&s->bm_as, &s->bm, "pci-bm");
>>>>>>> -    pci_setup_iommu(b, dino_pcihost_set_iommu, s);
>>>>>>> +    pci_setup_iommu(b, &dino_iommu_ops, s);
>>>>>>>
>>>>>>>      *p_rtc_irq = qemu_allocate_irq(dino_set_timer_irq, s, 0);
>>>>>>>      *p_ser_irq = qemu_allocate_irq(dino_set_serial_irq, s, 0);
>>>>>>> diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c index
>>>>>>> b1175e5..5fec30e 100644
>>>>>>> --- a/hw/i386/amd_iommu.c
>>>>>>> +++ b/hw/i386/amd_iommu.c
>>>>>>> @@ -1451,6 +1451,10 @@ static AddressSpace
>>>>>> *amdvi_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
>>>>>>>      return &iommu_as[devfn]->as;
>>>>>>>  }
>>>>>>>
>>>>>>> +static const PCIIOMMUOps amdvi_iommu_ops = {
>>>>>>> +    .get_address_space = amdvi_host_dma_iommu, };
>>>>>>> +
>>>>>>>  static const MemoryRegionOps mmio_mem_ops = {
>>>>>>>      .read = amdvi_mmio_read,
>>>>>>>      .write = amdvi_mmio_write,
>>>>>>> @@ -1577,7 +1581,7 @@ static void amdvi_realize(DeviceState *dev,
>>>>>>> Error **errp)
>>>>>>>
>>>>>>>      sysbus_init_mmio(SYS_BUS_DEVICE(s), &s->mmio);
>>>>>>>      sysbus_mmio_map(SYS_BUS_DEVICE(s), 0, AMDVI_BASE_ADDR);
>>>>>>> -    pci_setup_iommu(bus, amdvi_host_dma_iommu, s);
>>>>>>> +    pci_setup_iommu(bus, &amdvi_iommu_ops, s);
>>>>>>>      s->devid = object_property_get_int(OBJECT(&s->pci), "addr", errp);
>>>>>>>      msi_init(&s->pci.dev, 0, 1, true, false, errp);
>>>>>>>      amdvi_init(s);
>>>>>>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c index
>>>>>>> df7ad25..4b22910 100644
>>>>>>> --- a/hw/i386/intel_iommu.c
>>>>>>> +++ b/hw/i386/intel_iommu.c
>>>>>>> @@ -3729,6 +3729,10 @@ static AddressSpace
>>>>>>> *vtd_host_dma_iommu(PCIBus
>>>>>> *bus, void *opaque, int devfn)
>>>>>>>      return &vtd_as->as;
>>>>>>>  }
>>>>>>>
>>>>>>> +static PCIIOMMUOps vtd_iommu_ops = {
>>>>>> static const
>>>>>
>>>>> got it.
>>>>>
>>>>>>> +    .get_address_space = vtd_host_dma_iommu, };
>>>>>>> +
>>>>>>>  static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)  {
>>>>>>>      X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(s); @@ -3840,7
>>>>>>> +3844,7 @@ static void vtd_realize(DeviceState *dev, Error **errp)
>>>>>>>                                                g_free, g_free);
>>>>>>>      vtd_init(s);
>>>>>>>      sysbus_mmio_map(SYS_BUS_DEVICE(s), 0,
>>>>>> Q35_HOST_BRIDGE_IOMMU_ADDR);
>>>>>>> -    pci_setup_iommu(bus, vtd_host_dma_iommu, dev);
>>>>>>> +    pci_setup_iommu(bus, &vtd_iommu_ops, dev);
>>>>>>>      /* Pseudo address space under root PCI bus. */
>>>>>>>      x86ms->ioapic_as = vtd_host_dma_iommu(bus, s,
>>>>>> Q35_PSEUDO_DEVFN_IOAPIC);
>>>>>>>
>>>>>>> qemu_add_machine_init_done_notifier(&vtd_machine_done_notify);
>>>>>>> diff --git a/hw/pci-host/designware.c b/hw/pci-host/designware.c
>>>>>>> index dd24551..4c6338a 100644
>>>>>>> --- a/hw/pci-host/designware.c
>>>>>>> +++ b/hw/pci-host/designware.c
>>>>>>> @@ -645,6 +645,10 @@ static AddressSpace
>>>>>> *designware_pcie_host_set_iommu(PCIBus *bus, void *opaque,
>>>>>>>      return &s->pci.address_space;  }
>>>>>>>
>>>>>>> +static const PCIIOMMUOps designware_iommu_ops = {
>>>>>>> +    .get_address_space = designware_pcie_host_set_iommu, };
>>>>>>> +
>>>>>>>  static void designware_pcie_host_realize(DeviceState *dev, Error
>>>>>>> **errp)  {
>>>>>>>      PCIHostState *pci = PCI_HOST_BRIDGE(dev); @@ -686,7 +690,7 @@
>>>>>>> static void designware_pcie_host_realize(DeviceState *dev, Error **errp)
>>>>>>>      address_space_init(&s->pci.address_space,
>>>>>>>                         &s->pci.address_space_root,
>>>>>>>                         "pcie-bus-address-space");
>>>>>>> -    pci_setup_iommu(pci->bus, designware_pcie_host_set_iommu, s);
>>>>>>> +    pci_setup_iommu(pci->bus, &designware_iommu_ops, s);
>>>>>>>
>>>>>>>      qdev_set_parent_bus(DEVICE(&s->root), BUS(pci->bus));
>>>>>>>      qdev_init_nofail(DEVICE(&s->root));
>>>>>>> diff --git a/hw/pci-host/pnv_phb3.c b/hw/pci-host/pnv_phb3.c index
>>>>>>> 74618fa..ecfe627 100644
>>>>>>> --- a/hw/pci-host/pnv_phb3.c
>>>>>>> +++ b/hw/pci-host/pnv_phb3.c
>>>>>>> @@ -961,6 +961,10 @@ static AddressSpace
>>>>>>> *pnv_phb3_dma_iommu(PCIBus
>>>>>> *bus, void *opaque, int devfn)
>>>>>>>      return &ds->dma_as;
>>>>>>>  }
>>>>>>>
>>>>>>> +static PCIIOMMUOps pnv_phb3_iommu_ops = {
>>>>>> static const
>>>>> got it. :-)
>>>>>
>>>>>>> +    .get_address_space = pnv_phb3_dma_iommu, };
>>>>>>> +
>>>>>>>  static void pnv_phb3_instance_init(Object *obj)  {
>>>>>>>      PnvPHB3 *phb = PNV_PHB3(obj); @@ -1059,7 +1063,7 @@ static
>>>>>>> void pnv_phb3_realize(DeviceState *dev, Error
>>>>>> **errp)
>>>>>>>                                       &phb->pci_mmio, &phb->pci_io,
>>>>>>>                                       0, 4,
>>>>>>> TYPE_PNV_PHB3_ROOT_BUS);
>>>>>>>
>>>>>>> -    pci_setup_iommu(pci->bus, pnv_phb3_dma_iommu, phb);
>>>>>>> +    pci_setup_iommu(pci->bus, &pnv_phb3_iommu_ops, phb);
>>>>>>>
>>>>>>>      /* Add a single Root port */
>>>>>>>      qdev_prop_set_uint8(DEVICE(&phb->root), "chassis",
>>>>>>> phb->chip_id); diff --git a/hw/pci-host/pnv_phb4.c
>>>>>>> b/hw/pci-host/pnv_phb4.c index
>>>>>>> 23cf093..04e95e3 100644
>>>>>>> --- a/hw/pci-host/pnv_phb4.c
>>>>>>> +++ b/hw/pci-host/pnv_phb4.c
>>>>>>> @@ -1148,6 +1148,10 @@ static AddressSpace
>>>>>>> *pnv_phb4_dma_iommu(PCIBus
>>>>>> *bus, void *opaque, int devfn)
>>>>>>>      return &ds->dma_as;
>>>>>>>  }
>>>>>>>
>>>>>>> +static PCIIOMMUOps pnv_phb4_iommu_ops = {
>>>>>> idem
>>>>> will add const.
>>>>>
>>>>>>> +    .get_address_space = pnv_phb4_dma_iommu, };
>>>>>>> +
>>>>>>>  static void pnv_phb4_instance_init(Object *obj)  {
>>>>>>>      PnvPHB4 *phb = PNV_PHB4(obj); @@ -1205,7 +1209,7 @@ static
>>>>>>> void pnv_phb4_realize(DeviceState *dev, Error
>>>>>> **errp)
>>>>>>>                                       pnv_phb4_set_irq, pnv_phb4_map_irq, phb,
>>>>>>>                                       &phb->pci_mmio, &phb->pci_io,
>>>>>>>                                       0, 4, TYPE_PNV_PHB4_ROOT_BUS);
>>>>>>> -    pci_setup_iommu(pci->bus, pnv_phb4_dma_iommu, phb);
>>>>>>> +    pci_setup_iommu(pci->bus, &pnv_phb4_iommu_ops, phb);
>>>>>>>
>>>>>>>      /* Add a single Root port */
>>>>>>>      qdev_prop_set_uint8(DEVICE(&phb->root), "chassis",
>>>>>>> phb->chip_id); diff --git a/hw/pci-host/ppce500.c
>>>>>>> b/hw/pci-host/ppce500.c index d710727..5baf5db 100644
>>>>>>> --- a/hw/pci-host/ppce500.c
>>>>>>> +++ b/hw/pci-host/ppce500.c
>>>>>>> @@ -439,6 +439,10 @@ static AddressSpace
>>>>>>> *e500_pcihost_set_iommu(PCIBus
>>>>>> *bus, void *opaque,
>>>>>>>      return &s->bm_as;
>>>>>>>  }
>>>>>>>
>>>>>>> +static const PCIIOMMUOps ppce500_iommu_ops = {
>>>>>>> +    .get_address_space = e500_pcihost_set_iommu, };
>>>>>>> +
>>>>>>>  static void e500_pcihost_realize(DeviceState *dev, Error **errp)  {
>>>>>>>      SysBusDevice *sbd = SYS_BUS_DEVICE(dev); @@ -473,7 +477,7 @@
>>>>>>> static void e500_pcihost_realize(DeviceState *dev, Error **errp)
>>>>>>>      memory_region_init(&s->bm, OBJECT(s), "bm-e500", UINT64_MAX);
>>>>>>>      memory_region_add_subregion(&s->bm, 0x0, &s->busmem);
>>>>>>>      address_space_init(&s->bm_as, &s->bm, "pci-bm");
>>>>>>> -    pci_setup_iommu(b, e500_pcihost_set_iommu, s);
>>>>>>> +    pci_setup_iommu(b, &ppce500_iommu_ops, s);
>>>>>>>
>>>>>>>      pci_create_simple(b, 0, "e500-host-bridge");
>>>>>>>
>>>>>>> diff --git a/hw/pci-host/prep.c b/hw/pci-host/prep.c index
>>>>>>> 1a02e9a..7c57311 100644
>>>>>>> --- a/hw/pci-host/prep.c
>>>>>>> +++ b/hw/pci-host/prep.c
>>>>>>> @@ -213,6 +213,10 @@ static AddressSpace
>>>>>>> *raven_pcihost_set_iommu(PCIBus
>>>>>> *bus, void *opaque,
>>>>>>>      return &s->bm_as;
>>>>>>>  }
>>>>>>>
>>>>>>> +static const PCIIOMMUOps raven_iommu_ops = {
>>>>>>> +    .get_address_space = raven_pcihost_set_iommu, };
>>>>>>> +
>>>>>>>  static void raven_change_gpio(void *opaque, int n, int level)  {
>>>>>>>      PREPPCIState *s = opaque;
>>>>>>> @@ -303,7 +307,7 @@ static void raven_pcihost_initfn(Object *obj)
>>>>>>>      memory_region_add_subregion(&s->bm, 0         , &s-
>>>>> bm_pci_memory_alias);
>>>>>>>      memory_region_add_subregion(&s->bm, 0x80000000, &s-
>>> bm_ram_alias);
>>>>>>>      address_space_init(&s->bm_as, &s->bm, "raven-bm");
>>>>>>> -    pci_setup_iommu(&s->pci_bus, raven_pcihost_set_iommu, s);
>>>>>>> +    pci_setup_iommu(&s->pci_bus, &raven_iommu_ops, s);
>>>>>>>
>>>>>>>      h->bus = &s->pci_bus;
>>>>>>>
>>>>>>> diff --git a/hw/pci-host/sabre.c b/hw/pci-host/sabre.c index
>>>>>>> 2b8503b..251549b 100644
>>>>>>> --- a/hw/pci-host/sabre.c
>>>>>>> +++ b/hw/pci-host/sabre.c
>>>>>>> @@ -112,6 +112,10 @@ static AddressSpace
>>>>>>> *sabre_pci_dma_iommu(PCIBus
>>>>>> *bus, void *opaque, int devfn)
>>>>>>>      return &is->iommu_as;
>>>>>>>  }
>>>>>>>
>>>>>>> +static const PCIIOMMUOps sabre_iommu_ops = {
>>>>>>> +    .get_address_space = sabre_pci_dma_iommu, };
>>>>>>> +
>>>>>>>  static void sabre_config_write(void *opaque, hwaddr addr,
>>>>>>>                                 uint64_t val, unsigned size)  { @@
>>>>>>> -402,7 +406,7 @@ static void sabre_realize(DeviceState *dev, Error **errp)
>>>>>>>      /* IOMMU */
>>>>>>>      memory_region_add_subregion_overlap(&s->sabre_config, 0x200,
>>>>>>>                      sysbus_mmio_get_region(SYS_BUS_DEVICE(s->iommu), 0), 1);
>>>>>>> -    pci_setup_iommu(phb->bus, sabre_pci_dma_iommu, s->iommu);
>>>>>>> +    pci_setup_iommu(phb->bus, &sabre_iommu_ops, s->iommu);
>>>>>>>
>>>>>>>      /* APB secondary busses */
>>>>>>>      pci_dev = pci_create_multifunction(phb->bus, PCI_DEVFN(1, 0),
>>>>>>> true, diff --git a/hw/pci/pci.c b/hw/pci/pci.c index
>>>>>>> e1ed667..aa9025c
>>>>>>> 100644
>>>>>>> --- a/hw/pci/pci.c
>>>>>>> +++ b/hw/pci/pci.c
>>>>>>> @@ -2644,7 +2644,7 @@ AddressSpace
>>>>>> *pci_device_iommu_address_space(PCIDevice *dev)
>>>>>>>      PCIBus *iommu_bus = bus;
>>>>>>>      uint8_t devfn = dev->devfn;
>>>>>>>
>>>>>>> -    while (iommu_bus && !iommu_bus->iommu_fn && iommu_bus-
>>>>> parent_dev)
>>>>>> {
>>>>>>> +    while (iommu_bus && !iommu_bus->iommu_ops &&
>>>>>>> + iommu_bus->parent_dev) {
>>>>>> Depending on future usage, this is not strictly identical to the
>>>>>> original code. You exit the loop as soon as a iommu_bus->iommu_ops
>>>>>> is set whatever the presence of get_address_space().
>>>>>
>>>>> To be identical with original code, may adding the
>>>>> get_address_space() presence check. Then the loop exits when the
>>>>> iommu_bus->iommu_ops is set and meanwhile iommu_bus->iommu_ops-
>>> get_address_space() is set.
>>>>> But is it possible that there is an intermediate iommu_bus which has
>>>>> iommu_ops set but the get_address_space() is clear. I guess not as
>>>>> iommu_ops is set by vIOMMU and vIOMMU won't differentiate buses?
>>>>
>>>> I don't know. That depends on how the ops are going to be used in the
>>>> future. Can't you enforce the fact that get_address_space() is a mandatory ops?
>>>
>>> No, I didn't mean that. Actually, in the patch, the get_address_space() presence is
>> checked.
>>> I'm not sure if your point is to add get_address_space() presence
>>> check instead of just checking the iommu_ops presence.
>> Yes that was my point. I wanted to underline the checks are not strictly identical
>> and maybe during enumeration you may find a device with ops set and no
>> get_address_space().
> 
> I see. But this happens only when there are multiple iommu_ops instances
> provided by vIOMMU. right? For some buses, vIOMMU set an iommu_ops
> with get_address_space() set, but for others it sets an iommu_ops w/o
> get_address_space(). Is it possible?
I don't think it is possible at the moment.
> 
>> Then I meant maybe you should enforce somewhere in the
>> code or in the documentation that get_address_space() is a mandatory operation in
>> the ops struct and must be set as soon as the struct is passed. maybe in
>> pci_setup_iommu() you could check that get_address_space is set?
> 
> How about other callbacks in iommu_ops? It would be strange that only
> check get_address_space() and ignore other callbacks.
Well in the iommu_ops, the get_address_space() is supposed to be always
implemented, right? If this is the case, then you may check it when
passing the ops. Is that the case for other ops and especially the
set_iommu_ctx? To me, if an operation is mandated this should be
documented and maybe enforced.

Thanks

Eric
> 
> Regards,
> Yi Liu
> 



^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v2 09/22] vfio/common: init HostIOMMUContext per-container
  2020-04-06  7:12       ` Liu, Yi L
@ 2020-04-06 10:20         ` Auger Eric
  -1 siblings, 0 replies; 160+ messages in thread
From: Auger Eric @ 2020-04-06 10:20 UTC (permalink / raw)
  To: Liu, Yi L, qemu-devel, alex.williamson, peterx
  Cc: jean-philippe, Tian, Kevin, Jacob Pan, Yi Sun, kvm, mst, Tian,
	Jun J, Sun, Yi Y, pbonzini, Wu, Hao, david

Hi Yi,

On 4/6/20 9:12 AM, Liu, Yi L wrote:
> Hi Eric,
> 
>> From: Auger Eric <eric.auger@redhat.com>
>> Sent: Wednesday, April 1, 2020 3:51 PM
>> To: Liu, Yi L <yi.l.liu@intel.com>; qemu-devel@nongnu.org;
>> Subject: Re: [PATCH v2 09/22] vfio/common: init HostIOMMUContext per-container
>>
>> Hi Yi,
>>
>> On 3/30/20 6:24 AM, Liu Yi L wrote:
>>> In this patch, QEMU firstly gets iommu info from kernel to check the
>>> supported capabilities by a VFIO_IOMMU_TYPE1_NESTING iommu. And inits
>>> HostIOMMUContet instance.
>>>
>>> Cc: Kevin Tian <kevin.tian@intel.com>
>>> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
>>> Cc: Peter Xu <peterx@redhat.com>
>>> Cc: Eric Auger <eric.auger@redhat.com>
>>> Cc: Yi Sun <yi.y.sun@linux.intel.com>
>>> Cc: David Gibson <david@gibson.dropbear.id.au>
>>> Cc: Alex Williamson <alex.williamson@redhat.com>
>>> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
>>> ---
>>>  hw/vfio/common.c | 99
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>  1 file changed, 99 insertions(+)
>>>
>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c index
>>> 5f3534d..44b142c 100644
>>> --- a/hw/vfio/common.c
>>> +++ b/hw/vfio/common.c
>>> @@ -1226,10 +1226,89 @@ static int
>> vfio_host_iommu_ctx_pasid_free(HostIOMMUContext *iommu_ctx,
>>>      return 0;
>>>  }
>>>
>>> +/**
>>> + * Get iommu info from host. Caller of this funcion should free
>>> + * the memory pointed by the returned pointer stored in @info
>>> + * after a successful calling when finished its usage.
>>> + */
>>> +static int vfio_get_iommu_info(VFIOContainer *container,
>>> +                         struct vfio_iommu_type1_info **info) {
>>> +
>>> +    size_t argsz = sizeof(struct vfio_iommu_type1_info);
>>> +
>>> +    *info = g_malloc0(argsz);
>>> +
>>> +retry:
>>> +    (*info)->argsz = argsz;
>>> +
>>> +    if (ioctl(container->fd, VFIO_IOMMU_GET_INFO, *info)) {
>>> +        g_free(*info);
>>> +        *info = NULL;
>>> +        return -errno;
>>> +    }
>>> +
>>> +    if (((*info)->argsz > argsz)) {
>>> +        argsz = (*info)->argsz;
>>> +        *info = g_realloc(*info, argsz);
>>> +        goto retry;
>>> +    }
>>> +
>>> +    return 0;
>>> +}
>>> +
>>> +static struct vfio_info_cap_header *
>>> +vfio_get_iommu_info_cap(struct vfio_iommu_type1_info *info, uint16_t
>>> +id) {
>>> +    struct vfio_info_cap_header *hdr;
>>> +    void *ptr = info;
>>> +
>>> +    if (!(info->flags & VFIO_IOMMU_INFO_CAPS)) {
>>> +        return NULL;
>>> +    }
>>> +
>>> +    for (hdr = ptr + info->cap_offset; hdr != ptr; hdr = ptr + hdr->next) {
>>> +        if (hdr->id == id) {
>>> +            return hdr;
>>> +        }
>>> +    }
>>> +
>>> +    return NULL;
>>> +}
>>> +
>>> +static int vfio_get_nesting_iommu_cap(VFIOContainer *container,
>>> +                   struct vfio_iommu_type1_info_cap_nesting
>>> +*cap_nesting) {
>>> +    struct vfio_iommu_type1_info *info;
>>> +    struct vfio_info_cap_header *hdr;
>>> +    struct vfio_iommu_type1_info_cap_nesting *cap;
>>> +    int ret;
>>> +
>>> +    ret = vfio_get_iommu_info(container, &info);
>>> +    if (ret) {
>>> +        return ret;
>>> +    }
>>> +
>>> +    hdr = vfio_get_iommu_info_cap(info,
>>> +                        VFIO_IOMMU_TYPE1_INFO_CAP_NESTING);
>>> +    if (!hdr) {
>>> +        g_free(info);
>>> +        return -errno;
>>> +    }
>>> +
>>> +    cap = container_of(hdr,
>>> +                struct vfio_iommu_type1_info_cap_nesting, header);
>>> +    *cap_nesting = *cap;
>>> +
>>> +    g_free(info);
>>> +    return 0;
>>> +}
>>> +
>>>  static int vfio_init_container(VFIOContainer *container, int group_fd,
>>>                                 Error **errp)  {
>>>      int iommu_type, ret;
>>> +    uint64_t flags = 0;
>>>
>>>      iommu_type = vfio_get_iommu_type(container, errp);
>>>      if (iommu_type < 0) {
>>> @@ -1257,6 +1336,26 @@ static int vfio_init_container(VFIOContainer
>> *container, int group_fd,
>>>          return -errno;
>>>      }
>>>
>>> +    if (iommu_type == VFIO_TYPE1_NESTING_IOMMU) {
>>> +        struct vfio_iommu_type1_info_cap_nesting nesting = {
>>> +                                         .nesting_capabilities = 0x0,
>>> +                                         .stage1_formats = 0, };
>>> +
>>> +        ret = vfio_get_nesting_iommu_cap(container, &nesting);
>>> +        if (ret) {
>>> +            error_setg_errno(errp, -ret,
>>> +                             "Failed to get nesting iommu cap");
>>> +            return ret;
>>> +        }
>>> +
>>> +        flags |= (nesting.nesting_capabilities & VFIO_IOMMU_PASID_REQS) ?
>>> +                 HOST_IOMMU_PASID_REQUEST : 0;
>> I still don't get why you can't transform your iommu_ctx into a  pointer and do
>> container->iommu_ctx = g_new0(HostIOMMUContext, 1);
>> then
>> host_iommu_ctx_init(container->iommu_ctx, flags);
>>
>> looks something similar to (hw/vfio/common.c). You may not even need to use a
>> derived VFIOHostIOMMUContext object (As only VFIO does use that object)? Only
>> the ops do change, no new field?
>>         region->mem = g_new0(MemoryRegion, 1);
>>         memory_region_init_io(region->mem, obj, &vfio_region_ops,
>>                               region, name, region->size);
> 
> In this way, the vfio hook can easily get the VFIOContainer from
> HostIOMMUContext when call in the hook provided by vfio. e.g. the
> one below.
OK I get it. However in memory_region_init_io(), you also pass the
owner, eg. region so I think you could do the same. no?

Thanks

Eric
> 
> +static int vfio_host_iommu_ctx_pasid_alloc(HostIOMMUContext *iommu_ctx,
> +                                           uint32_t min, uint32_t max,
> +                                           uint32_t *pasid)
> +{
> +    VFIOContainer *container = container_of(iommu_ctx,
> +                                            VFIOContainer, iommu_ctx);
>  
> Regards,
> Yi Liu
> 


^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v2 09/22] vfio/common: init HostIOMMUContext per-container
@ 2020-04-06 10:20         ` Auger Eric
  0 siblings, 0 replies; 160+ messages in thread
From: Auger Eric @ 2020-04-06 10:20 UTC (permalink / raw)
  To: Liu, Yi L, qemu-devel, alex.williamson, peterx
  Cc: jean-philippe, Tian, Kevin, Jacob Pan, Yi Sun, kvm, mst, Tian,
	Jun J, Sun, Yi Y, pbonzini, david, Wu, Hao

Hi Yi,

On 4/6/20 9:12 AM, Liu, Yi L wrote:
> Hi Eric,
> 
>> From: Auger Eric <eric.auger@redhat.com>
>> Sent: Wednesday, April 1, 2020 3:51 PM
>> To: Liu, Yi L <yi.l.liu@intel.com>; qemu-devel@nongnu.org;
>> Subject: Re: [PATCH v2 09/22] vfio/common: init HostIOMMUContext per-container
>>
>> Hi Yi,
>>
>> On 3/30/20 6:24 AM, Liu Yi L wrote:
>>> In this patch, QEMU firstly gets iommu info from kernel to check the
>>> supported capabilities by a VFIO_IOMMU_TYPE1_NESTING iommu. And inits
>>> HostIOMMUContet instance.
>>>
>>> Cc: Kevin Tian <kevin.tian@intel.com>
>>> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
>>> Cc: Peter Xu <peterx@redhat.com>
>>> Cc: Eric Auger <eric.auger@redhat.com>
>>> Cc: Yi Sun <yi.y.sun@linux.intel.com>
>>> Cc: David Gibson <david@gibson.dropbear.id.au>
>>> Cc: Alex Williamson <alex.williamson@redhat.com>
>>> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
>>> ---
>>>  hw/vfio/common.c | 99
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>  1 file changed, 99 insertions(+)
>>>
>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c index
>>> 5f3534d..44b142c 100644
>>> --- a/hw/vfio/common.c
>>> +++ b/hw/vfio/common.c
>>> @@ -1226,10 +1226,89 @@ static int
>> vfio_host_iommu_ctx_pasid_free(HostIOMMUContext *iommu_ctx,
>>>      return 0;
>>>  }
>>>
>>> +/**
>>> + * Get iommu info from host. Caller of this funcion should free
>>> + * the memory pointed by the returned pointer stored in @info
>>> + * after a successful calling when finished its usage.
>>> + */
>>> +static int vfio_get_iommu_info(VFIOContainer *container,
>>> +                         struct vfio_iommu_type1_info **info) {
>>> +
>>> +    size_t argsz = sizeof(struct vfio_iommu_type1_info);
>>> +
>>> +    *info = g_malloc0(argsz);
>>> +
>>> +retry:
>>> +    (*info)->argsz = argsz;
>>> +
>>> +    if (ioctl(container->fd, VFIO_IOMMU_GET_INFO, *info)) {
>>> +        g_free(*info);
>>> +        *info = NULL;
>>> +        return -errno;
>>> +    }
>>> +
>>> +    if (((*info)->argsz > argsz)) {
>>> +        argsz = (*info)->argsz;
>>> +        *info = g_realloc(*info, argsz);
>>> +        goto retry;
>>> +    }
>>> +
>>> +    return 0;
>>> +}
>>> +
>>> +static struct vfio_info_cap_header *
>>> +vfio_get_iommu_info_cap(struct vfio_iommu_type1_info *info, uint16_t
>>> +id) {
>>> +    struct vfio_info_cap_header *hdr;
>>> +    void *ptr = info;
>>> +
>>> +    if (!(info->flags & VFIO_IOMMU_INFO_CAPS)) {
>>> +        return NULL;
>>> +    }
>>> +
>>> +    for (hdr = ptr + info->cap_offset; hdr != ptr; hdr = ptr + hdr->next) {
>>> +        if (hdr->id == id) {
>>> +            return hdr;
>>> +        }
>>> +    }
>>> +
>>> +    return NULL;
>>> +}
>>> +
>>> +static int vfio_get_nesting_iommu_cap(VFIOContainer *container,
>>> +                   struct vfio_iommu_type1_info_cap_nesting
>>> +*cap_nesting) {
>>> +    struct vfio_iommu_type1_info *info;
>>> +    struct vfio_info_cap_header *hdr;
>>> +    struct vfio_iommu_type1_info_cap_nesting *cap;
>>> +    int ret;
>>> +
>>> +    ret = vfio_get_iommu_info(container, &info);
>>> +    if (ret) {
>>> +        return ret;
>>> +    }
>>> +
>>> +    hdr = vfio_get_iommu_info_cap(info,
>>> +                        VFIO_IOMMU_TYPE1_INFO_CAP_NESTING);
>>> +    if (!hdr) {
>>> +        g_free(info);
>>> +        return -errno;
>>> +    }
>>> +
>>> +    cap = container_of(hdr,
>>> +                struct vfio_iommu_type1_info_cap_nesting, header);
>>> +    *cap_nesting = *cap;
>>> +
>>> +    g_free(info);
>>> +    return 0;
>>> +}
>>> +
>>>  static int vfio_init_container(VFIOContainer *container, int group_fd,
>>>                                 Error **errp)  {
>>>      int iommu_type, ret;
>>> +    uint64_t flags = 0;
>>>
>>>      iommu_type = vfio_get_iommu_type(container, errp);
>>>      if (iommu_type < 0) {
>>> @@ -1257,6 +1336,26 @@ static int vfio_init_container(VFIOContainer
>> *container, int group_fd,
>>>          return -errno;
>>>      }
>>>
>>> +    if (iommu_type == VFIO_TYPE1_NESTING_IOMMU) {
>>> +        struct vfio_iommu_type1_info_cap_nesting nesting = {
>>> +                                         .nesting_capabilities = 0x0,
>>> +                                         .stage1_formats = 0, };
>>> +
>>> +        ret = vfio_get_nesting_iommu_cap(container, &nesting);
>>> +        if (ret) {
>>> +            error_setg_errno(errp, -ret,
>>> +                             "Failed to get nesting iommu cap");
>>> +            return ret;
>>> +        }
>>> +
>>> +        flags |= (nesting.nesting_capabilities & VFIO_IOMMU_PASID_REQS) ?
>>> +                 HOST_IOMMU_PASID_REQUEST : 0;
>> I still don't get why you can't transform your iommu_ctx into a  pointer and do
>> container->iommu_ctx = g_new0(HostIOMMUContext, 1);
>> then
>> host_iommu_ctx_init(container->iommu_ctx, flags);
>>
>> looks something similar to (hw/vfio/common.c). You may not even need to use a
>> derived VFIOHostIOMMUContext object (As only VFIO does use that object)? Only
>> the ops do change, no new field?
>>         region->mem = g_new0(MemoryRegion, 1);
>>         memory_region_init_io(region->mem, obj, &vfio_region_ops,
>>                               region, name, region->size);
> 
> In this way, the vfio hook can easily get the VFIOContainer from
> HostIOMMUContext when call in the hook provided by vfio. e.g. the
> one below.
OK I get it. However in memory_region_init_io(), you also pass the
owner, eg. region so I think you could do the same. no?

Thanks

Eric
> 
> +static int vfio_host_iommu_ctx_pasid_alloc(HostIOMMUContext *iommu_ctx,
> +                                           uint32_t min, uint32_t max,
> +                                           uint32_t *pasid)
> +{
> +    VFIOContainer *container = container_of(iommu_ctx,
> +                                            VFIOContainer, iommu_ctx);
>  
> Regards,
> Yi Liu
> 



^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v2 04/22] hw/iommu: introduce HostIOMMUContext
  2020-04-06  8:04       ` Liu, Yi L
@ 2020-04-06 10:30         ` Auger Eric
  -1 siblings, 0 replies; 160+ messages in thread
From: Auger Eric @ 2020-04-06 10:30 UTC (permalink / raw)
  To: Liu, Yi L, qemu-devel, alex.williamson, peterx
  Cc: pbonzini, mst, david, Tian, Kevin, Tian, Jun J, Sun, Yi Y, kvm,
	Wu, Hao, jean-philippe, Jacob Pan, Yi Sun

Hi Yi,

On 4/6/20 10:04 AM, Liu, Yi L wrote:
> Hi Eric,
> 
>> From: Auger Eric < eric.auger@redhat.com>
>> Sent: Tuesday, March 31, 2020 1:23 AM
>> To: Liu, Yi L <yi.l.liu@intel.com>; qemu-devel@nongnu.org;
>> Subject: Re: [PATCH v2 04/22] hw/iommu: introduce HostIOMMUContext
>>
>> Yi,
>>
>> On 3/30/20 6:24 AM, Liu Yi L wrote:
>>> Currently, many platform vendors provide the capability of dual stage
>>> DMA address translation in hardware. For example, nested translation
>>> on Intel VT-d scalable mode, nested stage translation on ARM SMMUv3,
>>> and etc. In dual stage DMA address translation, there are two stages
>>> address translation, stage-1 (a.k.a first-level) and stage-2 (a.k.a
>>> second-level) translation structures. Stage-1 translation results are
>>> also subjected to stage-2 translation structures. Take vSVA (Virtual
>>> Shared Virtual Addressing) as an example, guest IOMMU driver owns
>>> stage-1 translation structures (covers GVA->GPA translation), and host
>>> IOMMU driver owns stage-2 translation structures (covers GPA->HPA
>>> translation). VMM is responsible to bind stage-1 translation structures
>>> to host, thus hardware could achieve GVA->GPA and then GPA->HPA
>>> translation. For more background on SVA, refer the below links.
>>>  - https://www.youtube.com/watch?v=Kq_nfGK5MwQ
>>>  - https://events19.lfasiallc.com/wp-content/uploads/2017/11/\
>>> Shared-Virtual-Memory-in-KVM_Yi-Liu.pdf
>>>
> [...]
>>> +void host_iommu_ctx_init(void *_iommu_ctx, size_t instance_size,
>>> +                         const char *mrtypename,
>>> +                         uint64_t flags)
>>> +{
>>> +    HostIOMMUContext *iommu_ctx;
>>> +
>>> +    object_initialize(_iommu_ctx, instance_size, mrtypename);
>>> +    iommu_ctx = HOST_IOMMU_CONTEXT(_iommu_ctx);
>>> +    iommu_ctx->flags = flags;
>>> +    iommu_ctx->initialized = true;
>>> +}
>>> +
>>> +static const TypeInfo host_iommu_context_info = {
>>> +    .parent             = TYPE_OBJECT,
>>> +    .name               = TYPE_HOST_IOMMU_CONTEXT,
>>> +    .class_size         = sizeof(HostIOMMUContextClass),
>>> +    .instance_size      = sizeof(HostIOMMUContext),
>>> +    .abstract           = true,
>> Can't we use the usual .instance_init and .instance_finalize?
> sorry, I somehow missed this comment. In prior patch, .instace_init
> was used, but the current major init method is via host_iommu_ctx_init(),
> so .instance_init is not really necessary.
> https://www.spinics.net/lists/kvm/msg210878.html

OK globally what disturbs me is you introduced a QOM object but globally
the inheritance schema is not totally clear to me (only a VFIO derived
is created and I do not understand what other backend would be able to
use it) and this does not really have the look & feel of standard QOM
objects. I tried to compare its usage/implementation version
MemoryRegion for instance.

Thanks

Eric
> Regards,
> Yi Liu
> 


^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v2 04/22] hw/iommu: introduce HostIOMMUContext
@ 2020-04-06 10:30         ` Auger Eric
  0 siblings, 0 replies; 160+ messages in thread
From: Auger Eric @ 2020-04-06 10:30 UTC (permalink / raw)
  To: Liu, Yi L, qemu-devel, alex.williamson, peterx
  Cc: jean-philippe, Tian, Kevin, Jacob Pan, Yi Sun, kvm, mst, Tian,
	Jun J, Sun, Yi Y, pbonzini, Wu, Hao, david

Hi Yi,

On 4/6/20 10:04 AM, Liu, Yi L wrote:
> Hi Eric,
> 
>> From: Auger Eric < eric.auger@redhat.com>
>> Sent: Tuesday, March 31, 2020 1:23 AM
>> To: Liu, Yi L <yi.l.liu@intel.com>; qemu-devel@nongnu.org;
>> Subject: Re: [PATCH v2 04/22] hw/iommu: introduce HostIOMMUContext
>>
>> Yi,
>>
>> On 3/30/20 6:24 AM, Liu Yi L wrote:
>>> Currently, many platform vendors provide the capability of dual stage
>>> DMA address translation in hardware. For example, nested translation
>>> on Intel VT-d scalable mode, nested stage translation on ARM SMMUv3,
>>> and etc. In dual stage DMA address translation, there are two stages
>>> address translation, stage-1 (a.k.a first-level) and stage-2 (a.k.a
>>> second-level) translation structures. Stage-1 translation results are
>>> also subjected to stage-2 translation structures. Take vSVA (Virtual
>>> Shared Virtual Addressing) as an example, guest IOMMU driver owns
>>> stage-1 translation structures (covers GVA->GPA translation), and host
>>> IOMMU driver owns stage-2 translation structures (covers GPA->HPA
>>> translation). VMM is responsible to bind stage-1 translation structures
>>> to host, thus hardware could achieve GVA->GPA and then GPA->HPA
>>> translation. For more background on SVA, refer the below links.
>>>  - https://www.youtube.com/watch?v=Kq_nfGK5MwQ
>>>  - https://events19.lfasiallc.com/wp-content/uploads/2017/11/\
>>> Shared-Virtual-Memory-in-KVM_Yi-Liu.pdf
>>>
> [...]
>>> +void host_iommu_ctx_init(void *_iommu_ctx, size_t instance_size,
>>> +                         const char *mrtypename,
>>> +                         uint64_t flags)
>>> +{
>>> +    HostIOMMUContext *iommu_ctx;
>>> +
>>> +    object_initialize(_iommu_ctx, instance_size, mrtypename);
>>> +    iommu_ctx = HOST_IOMMU_CONTEXT(_iommu_ctx);
>>> +    iommu_ctx->flags = flags;
>>> +    iommu_ctx->initialized = true;
>>> +}
>>> +
>>> +static const TypeInfo host_iommu_context_info = {
>>> +    .parent             = TYPE_OBJECT,
>>> +    .name               = TYPE_HOST_IOMMU_CONTEXT,
>>> +    .class_size         = sizeof(HostIOMMUContextClass),
>>> +    .instance_size      = sizeof(HostIOMMUContext),
>>> +    .abstract           = true,
>> Can't we use the usual .instance_init and .instance_finalize?
> sorry, I somehow missed this comment. In prior patch, .instace_init
> was used, but the current major init method is via host_iommu_ctx_init(),
> so .instance_init is not really necessary.
> https://www.spinics.net/lists/kvm/msg210878.html

OK globally what disturbs me is you introduced a QOM object but globally
the inheritance schema is not totally clear to me (only a VFIO derived
is created and I do not understand what other backend would be able to
use it) and this does not really have the look & feel of standard QOM
objects. I tried to compare its usage/implementation version
MemoryRegion for instance.

Thanks

Eric
> Regards,
> Yi Liu
> 



^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v2 16/22] intel_iommu: replay pasid binds after context cache invalidation
  2020-04-04 12:00           ` Liu, Yi L
@ 2020-04-06 19:48             ` Peter Xu
  -1 siblings, 0 replies; 160+ messages in thread
From: Peter Xu @ 2020-04-06 19:48 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: qemu-devel, alex.williamson, eric.auger, pbonzini, mst, david,
	Tian, Kevin, Tian, Jun J, Sun, Yi Y, kvm, Wu, Hao, jean-philippe,
	Jacob Pan, Yi Sun, Richard Henderson, Eduardo Habkost

On Sat, Apr 04, 2020 at 12:00:12PM +0000, Liu, Yi L wrote:
> Hi Peter,
> 
> > From: Peter Xu <peterx@redhat.com>
> > Sent: Saturday, April 4, 2020 12:11 AM
> > To: Liu, Yi L <yi.l.liu@intel.com>
> > Subject: Re: [PATCH v2 16/22] intel_iommu: replay pasid binds after context cache
> > invalidation
> > 
> > On Fri, Apr 03, 2020 at 03:21:10PM +0000, Liu, Yi L wrote:
> > > > From: Peter Xu <peterx@redhat.com>
> > > > Sent: Friday, April 3, 2020 10:46 PM
> > > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > > Subject: Re: [PATCH v2 16/22] intel_iommu: replay pasid binds after context
> > cache
> > > > invalidation
> > > >
> > > > On Sun, Mar 29, 2020 at 09:24:55PM -0700, Liu Yi L wrote:
> > > > > This patch replays guest pasid bindings after context cache
> > > > > invalidation. This is a behavior to ensure safety. Actually,
> > > > > programmer should issue pasid cache invalidation with proper
> > > > > granularity after issuing a context cache invalidation.
> > > > >
> > > > > Cc: Kevin Tian <kevin.tian@intel.com>
> > > > > Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > > > > Cc: Peter Xu <peterx@redhat.com>
> > > > > Cc: Yi Sun <yi.y.sun@linux.intel.com>
> > > > > Cc: Paolo Bonzini <pbonzini@redhat.com>
> > > > > Cc: Richard Henderson <rth@twiddle.net>
> > > > > Cc: Eduardo Habkost <ehabkost@redhat.com>
> > > > > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > > > > ---
> > > > >  hw/i386/intel_iommu.c          | 51
> > > > ++++++++++++++++++++++++++++++++++++++++++
> > > > >  hw/i386/intel_iommu_internal.h |  6 ++++-
> > > > >  hw/i386/trace-events           |  1 +
> > > > >  3 files changed, 57 insertions(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> > > > > index d87f608..883aeac 100644
> > > > > --- a/hw/i386/intel_iommu.c
> > > > > +++ b/hw/i386/intel_iommu.c
> > > > > @@ -68,6 +68,10 @@ static void
> > > > vtd_address_space_refresh_all(IntelIOMMUState *s);
> > > > >  static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier
> > *n);
> > > > >
> > > > >  static void vtd_pasid_cache_reset(IntelIOMMUState *s);
> > > > > +static void vtd_pasid_cache_sync(IntelIOMMUState *s,
> > > > > +                                 VTDPASIDCacheInfo *pc_info);
> > > > > +static void vtd_pasid_cache_devsi(IntelIOMMUState *s,
> > > > > +                                  VTDBus *vtd_bus, uint16_t devfn);
> > > > >
> > > > >  static void vtd_panic_require_caching_mode(void)
> > > > >  {
> > > > > @@ -1853,7 +1857,10 @@ static void
> > vtd_iommu_replay_all(IntelIOMMUState
> > > > *s)
> > > > >
> > > > >  static void vtd_context_global_invalidate(IntelIOMMUState *s)
> > > > >  {
> > > > > +    VTDPASIDCacheInfo pc_info;
> > > > > +
> > > > >      trace_vtd_inv_desc_cc_global();
> > > > > +
> > > > >      /* Protects context cache */
> > > > >      vtd_iommu_lock(s);
> > > > >      s->context_cache_gen++;
> > > > > @@ -1870,6 +1877,9 @@ static void
> > > > vtd_context_global_invalidate(IntelIOMMUState *s)
> > > > >       * VT-d emulation codes.
> > > > >       */
> > > > >      vtd_iommu_replay_all(s);
> > > > > +
> > > > > +    pc_info.flags = VTD_PASID_CACHE_GLOBAL;
> > > > > +    vtd_pasid_cache_sync(s, &pc_info);
> > > > >  }
> > > > >
> > > > >  /**
> > > > > @@ -2005,6 +2015,22 @@ static void
> > > > vtd_context_device_invalidate(IntelIOMMUState *s,
> > > > >                   * happened.
> > > > >                   */
> > > > >                  vtd_sync_shadow_page_table(vtd_as);
> > > > > +                /*
> > > > > +                 * Per spec, context flush should also
> > > > > followed with PASID
> > > > > +                 * cache and iotlb flush. Regards to
> > > > > a device selective
> > > > > +                 * context cache invalidation:
> > > >
> > > > If context entry flush should also follow another pasid cache flush,
> > > > then this is still needed?  Shouldn't the pasid flush do the same
> > > > thing again?
> > >
> > > yes, but how about guest software failed to follow it? It will do
> > > the same thing when pasid cache flush comes. But this only happens
> > > for the rid2pasid case (the IOVA page table).
> > 
> > Do you mean it will not happen when nested page table is used (so it's
> > required for nested tables)?
> 
> no, by the IOVA page table case, I just want to confirm the duplicate
> replay is true. But it is not "only" case. :-) my bad. any scalable mode
> context entry modification will result in duplicate replay as this patch
> enforces a pasid replay after context cache invalidation. But for normal
> guest SVM usage, it won't have such duplicate work as it only modifies
> pasid entry.
> 
> > Yeah we can keep them for safe no matter what; at least I'm fine with
> > it (I believe most of the code we're discussing is not fast path).
> > Just want to be sure of it since if it's definitely duplicated then we
> > can instead drop it.
> 
> yes, it is not fast path. BTW. I guess the iova shadow sync applies
> the same notion. right?

Yes I rem we have similar things, but the same to that - if we can
confirm that it'll be duplicated then I think we should remove that
too.  But feel free to ignore this question for now and keep it.  The
comment explaining that would be helpful, as you already did.  Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v2 16/22] intel_iommu: replay pasid binds after context cache invalidation
@ 2020-04-06 19:48             ` Peter Xu
  0 siblings, 0 replies; 160+ messages in thread
From: Peter Xu @ 2020-04-06 19:48 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: jean-philippe, Tian, Kevin, Jacob Pan, Yi Sun, Eduardo Habkost,
	kvm, mst, Tian, Jun J, qemu-devel, eric.auger, alex.williamson,
	pbonzini, Wu, Hao, Sun, Yi Y, Richard Henderson, david

On Sat, Apr 04, 2020 at 12:00:12PM +0000, Liu, Yi L wrote:
> Hi Peter,
> 
> > From: Peter Xu <peterx@redhat.com>
> > Sent: Saturday, April 4, 2020 12:11 AM
> > To: Liu, Yi L <yi.l.liu@intel.com>
> > Subject: Re: [PATCH v2 16/22] intel_iommu: replay pasid binds after context cache
> > invalidation
> > 
> > On Fri, Apr 03, 2020 at 03:21:10PM +0000, Liu, Yi L wrote:
> > > > From: Peter Xu <peterx@redhat.com>
> > > > Sent: Friday, April 3, 2020 10:46 PM
> > > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > > Subject: Re: [PATCH v2 16/22] intel_iommu: replay pasid binds after context
> > cache
> > > > invalidation
> > > >
> > > > On Sun, Mar 29, 2020 at 09:24:55PM -0700, Liu Yi L wrote:
> > > > > This patch replays guest pasid bindings after context cache
> > > > > invalidation. This is a behavior to ensure safety. Actually,
> > > > > programmer should issue pasid cache invalidation with proper
> > > > > granularity after issuing a context cache invalidation.
> > > > >
> > > > > Cc: Kevin Tian <kevin.tian@intel.com>
> > > > > Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > > > > Cc: Peter Xu <peterx@redhat.com>
> > > > > Cc: Yi Sun <yi.y.sun@linux.intel.com>
> > > > > Cc: Paolo Bonzini <pbonzini@redhat.com>
> > > > > Cc: Richard Henderson <rth@twiddle.net>
> > > > > Cc: Eduardo Habkost <ehabkost@redhat.com>
> > > > > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > > > > ---
> > > > >  hw/i386/intel_iommu.c          | 51
> > > > ++++++++++++++++++++++++++++++++++++++++++
> > > > >  hw/i386/intel_iommu_internal.h |  6 ++++-
> > > > >  hw/i386/trace-events           |  1 +
> > > > >  3 files changed, 57 insertions(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> > > > > index d87f608..883aeac 100644
> > > > > --- a/hw/i386/intel_iommu.c
> > > > > +++ b/hw/i386/intel_iommu.c
> > > > > @@ -68,6 +68,10 @@ static void
> > > > vtd_address_space_refresh_all(IntelIOMMUState *s);
> > > > >  static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier
> > *n);
> > > > >
> > > > >  static void vtd_pasid_cache_reset(IntelIOMMUState *s);
> > > > > +static void vtd_pasid_cache_sync(IntelIOMMUState *s,
> > > > > +                                 VTDPASIDCacheInfo *pc_info);
> > > > > +static void vtd_pasid_cache_devsi(IntelIOMMUState *s,
> > > > > +                                  VTDBus *vtd_bus, uint16_t devfn);
> > > > >
> > > > >  static void vtd_panic_require_caching_mode(void)
> > > > >  {
> > > > > @@ -1853,7 +1857,10 @@ static void
> > vtd_iommu_replay_all(IntelIOMMUState
> > > > *s)
> > > > >
> > > > >  static void vtd_context_global_invalidate(IntelIOMMUState *s)
> > > > >  {
> > > > > +    VTDPASIDCacheInfo pc_info;
> > > > > +
> > > > >      trace_vtd_inv_desc_cc_global();
> > > > > +
> > > > >      /* Protects context cache */
> > > > >      vtd_iommu_lock(s);
> > > > >      s->context_cache_gen++;
> > > > > @@ -1870,6 +1877,9 @@ static void
> > > > vtd_context_global_invalidate(IntelIOMMUState *s)
> > > > >       * VT-d emulation codes.
> > > > >       */
> > > > >      vtd_iommu_replay_all(s);
> > > > > +
> > > > > +    pc_info.flags = VTD_PASID_CACHE_GLOBAL;
> > > > > +    vtd_pasid_cache_sync(s, &pc_info);
> > > > >  }
> > > > >
> > > > >  /**
> > > > > @@ -2005,6 +2015,22 @@ static void
> > > > vtd_context_device_invalidate(IntelIOMMUState *s,
> > > > >                   * happened.
> > > > >                   */
> > > > >                  vtd_sync_shadow_page_table(vtd_as);
> > > > > +                /*
> > > > > +                 * Per spec, context flush should also
> > > > > followed with PASID
> > > > > +                 * cache and iotlb flush. Regards to
> > > > > a device selective
> > > > > +                 * context cache invalidation:
> > > >
> > > > If context entry flush should also follow another pasid cache flush,
> > > > then this is still needed?  Shouldn't the pasid flush do the same
> > > > thing again?
> > >
> > > yes, but how about guest software failed to follow it? It will do
> > > the same thing when pasid cache flush comes. But this only happens
> > > for the rid2pasid case (the IOVA page table).
> > 
> > Do you mean it will not happen when nested page table is used (so it's
> > required for nested tables)?
> 
> no, by the IOVA page table case, I just want to confirm the duplicate
> replay is true. But it is not "only" case. :-) my bad. any scalable mode
> context entry modification will result in duplicate replay as this patch
> enforces a pasid replay after context cache invalidation. But for normal
> guest SVM usage, it won't have such duplicate work as it only modifies
> pasid entry.
> 
> > Yeah we can keep them for safe no matter what; at least I'm fine with
> > it (I believe most of the code we're discussing is not fast path).
> > Just want to be sure of it since if it's definitely duplicated then we
> > can instead drop it.
> 
> yes, it is not fast path. BTW. I guess the iova shadow sync applies
> the same notion. right?

Yes I rem we have similar things, but the same to that - if we can
confirm that it'll be duplicated then I think we should remove that
too.  But feel free to ignore this question for now and keep it.  The
comment explaining that would be helpful, as you already did.  Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 160+ messages in thread

* RE: [PATCH v2 09/22] vfio/common: init HostIOMMUContext per-container
  2020-04-06 10:20         ` Auger Eric
@ 2020-04-07 11:59           ` Liu, Yi L
  -1 siblings, 0 replies; 160+ messages in thread
From: Liu, Yi L @ 2020-04-07 11:59 UTC (permalink / raw)
  To: Auger Eric, qemu-devel, alex.williamson, peterx
  Cc: jean-philippe, Tian, Kevin, Jacob Pan, Yi Sun, kvm, mst, Tian,
	Jun J, Sun, Yi Y, pbonzini, Wu, Hao, david

Hi Eric,

> From: Auger Eric <eric.auger@redhat.com>
> Sent: Monday, April 6, 2020 6:20 PM
> Subject: Re: [PATCH v2 09/22] vfio/common: init HostIOMMUContext per-container
> 
> Hi Yi,
> 
> On 4/6/20 9:12 AM, Liu, Yi L wrote:
> > Hi Eric,
> >
> >> From: Auger Eric <eric.auger@redhat.com>
> >> Sent: Wednesday, April 1, 2020 3:51 PM
> >> To: Liu, Yi L <yi.l.liu@intel.com>; qemu-devel@nongnu.org;
> >> Subject: Re: [PATCH v2 09/22] vfio/common: init HostIOMMUContext
> >> per-container
> >>
> >> Hi Yi,
> >>
> >> On 3/30/20 6:24 AM, Liu Yi L wrote:
> >>> In this patch, QEMU firstly gets iommu info from kernel to check the
> >>> supported capabilities by a VFIO_IOMMU_TYPE1_NESTING iommu. And
> >>> inits HostIOMMUContet instance.
> >>>
> >>> Cc: Kevin Tian <kevin.tian@intel.com>
> >>> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> >>> Cc: Peter Xu <peterx@redhat.com>
> >>> Cc: Eric Auger <eric.auger@redhat.com>
> >>> Cc: Yi Sun <yi.y.sun@linux.intel.com>
> >>> Cc: David Gibson <david@gibson.dropbear.id.au>
> >>> Cc: Alex Williamson <alex.williamson@redhat.com>
> >>> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> >>> ---
> >>>  hw/vfio/common.c | 99
> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>  1 file changed, 99 insertions(+)
> >>>
> >>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c index
> >>> 5f3534d..44b142c 100644
> >>> --- a/hw/vfio/common.c
> >>> +++ b/hw/vfio/common.c
> >>> @@ -1226,10 +1226,89 @@ static int
> >> vfio_host_iommu_ctx_pasid_free(HostIOMMUContext *iommu_ctx,
> >>>      return 0;
> >>>  }
> >>>
> >>> +/**
> >>> + * Get iommu info from host. Caller of this funcion should free
> >>> + * the memory pointed by the returned pointer stored in @info
> >>> + * after a successful calling when finished its usage.
> >>> + */
> >>> +static int vfio_get_iommu_info(VFIOContainer *container,
> >>> +                         struct vfio_iommu_type1_info **info) {
> >>> +
> >>> +    size_t argsz = sizeof(struct vfio_iommu_type1_info);
> >>> +
> >>> +    *info = g_malloc0(argsz);
> >>> +
> >>> +retry:
> >>> +    (*info)->argsz = argsz;
> >>> +
> >>> +    if (ioctl(container->fd, VFIO_IOMMU_GET_INFO, *info)) {
> >>> +        g_free(*info);
> >>> +        *info = NULL;
> >>> +        return -errno;
> >>> +    }
> >>> +
> >>> +    if (((*info)->argsz > argsz)) {
> >>> +        argsz = (*info)->argsz;
> >>> +        *info = g_realloc(*info, argsz);
> >>> +        goto retry;
> >>> +    }
> >>> +
> >>> +    return 0;
> >>> +}
> >>> +
> >>> +static struct vfio_info_cap_header * vfio_get_iommu_info_cap(struct
> >>> +vfio_iommu_type1_info *info, uint16_t
> >>> +id) {
> >>> +    struct vfio_info_cap_header *hdr;
> >>> +    void *ptr = info;
> >>> +
> >>> +    if (!(info->flags & VFIO_IOMMU_INFO_CAPS)) {
> >>> +        return NULL;
> >>> +    }
> >>> +
> >>> +    for (hdr = ptr + info->cap_offset; hdr != ptr; hdr = ptr + hdr->next) {
> >>> +        if (hdr->id == id) {
> >>> +            return hdr;
> >>> +        }
> >>> +    }
> >>> +
> >>> +    return NULL;
> >>> +}
> >>> +
> >>> +static int vfio_get_nesting_iommu_cap(VFIOContainer *container,
> >>> +                   struct vfio_iommu_type1_info_cap_nesting
> >>> +*cap_nesting) {
> >>> +    struct vfio_iommu_type1_info *info;
> >>> +    struct vfio_info_cap_header *hdr;
> >>> +    struct vfio_iommu_type1_info_cap_nesting *cap;
> >>> +    int ret;
> >>> +
> >>> +    ret = vfio_get_iommu_info(container, &info);
> >>> +    if (ret) {
> >>> +        return ret;
> >>> +    }
> >>> +
> >>> +    hdr = vfio_get_iommu_info_cap(info,
> >>> +                        VFIO_IOMMU_TYPE1_INFO_CAP_NESTING);
> >>> +    if (!hdr) {
> >>> +        g_free(info);
> >>> +        return -errno;
> >>> +    }
> >>> +
> >>> +    cap = container_of(hdr,
> >>> +                struct vfio_iommu_type1_info_cap_nesting, header);
> >>> +    *cap_nesting = *cap;
> >>> +
> >>> +    g_free(info);
> >>> +    return 0;
> >>> +}
> >>> +
> >>>  static int vfio_init_container(VFIOContainer *container, int group_fd,
> >>>                                 Error **errp)  {
> >>>      int iommu_type, ret;
> >>> +    uint64_t flags = 0;
> >>>
> >>>      iommu_type = vfio_get_iommu_type(container, errp);
> >>>      if (iommu_type < 0) {
> >>> @@ -1257,6 +1336,26 @@ static int vfio_init_container(VFIOContainer
> >> *container, int group_fd,
> >>>          return -errno;
> >>>      }
> >>>
> >>> +    if (iommu_type == VFIO_TYPE1_NESTING_IOMMU) {
> >>> +        struct vfio_iommu_type1_info_cap_nesting nesting = {
> >>> +                                         .nesting_capabilities = 0x0,
> >>> +                                         .stage1_formats = 0, };
> >>> +
> >>> +        ret = vfio_get_nesting_iommu_cap(container, &nesting);
> >>> +        if (ret) {
> >>> +            error_setg_errno(errp, -ret,
> >>> +                             "Failed to get nesting iommu cap");
> >>> +            return ret;
> >>> +        }
> >>> +
> >>> +        flags |= (nesting.nesting_capabilities & VFIO_IOMMU_PASID_REQS) ?
> >>> +                 HOST_IOMMU_PASID_REQUEST : 0;
> >> I still don't get why you can't transform your iommu_ctx into a
> >> pointer and do
> >> container->iommu_ctx = g_new0(HostIOMMUContext, 1);
> >> then
> >> host_iommu_ctx_init(container->iommu_ctx, flags);
> >>
> >> looks something similar to (hw/vfio/common.c). You may not even need
> >> to use a derived VFIOHostIOMMUContext object (As only VFIO does use
> >> that object)? Only the ops do change, no new field?
> >>         region->mem = g_new0(MemoryRegion, 1);
> >>         memory_region_init_io(region->mem, obj, &vfio_region_ops,
> >>                               region, name, region->size);
> >
> > In this way, the vfio hook can easily get the VFIOContainer from
> > HostIOMMUContext when call in the hook provided by vfio. e.g. the one
> > below.
> OK I get it. However in memory_region_init_io(), you also pass the owner, eg.
> region so I think you could do the same. no?
Hmm, I can add it. But I've no idea about the proper owner for it so far.
any suggestion?

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 160+ messages in thread

* RE: [PATCH v2 09/22] vfio/common: init HostIOMMUContext per-container
@ 2020-04-07 11:59           ` Liu, Yi L
  0 siblings, 0 replies; 160+ messages in thread
From: Liu, Yi L @ 2020-04-07 11:59 UTC (permalink / raw)
  To: Auger Eric, qemu-devel, alex.williamson, peterx
  Cc: jean-philippe, Tian, Kevin, Jacob Pan, Yi Sun, kvm, mst, Tian,
	Jun J, Sun, Yi Y, pbonzini, david, Wu, Hao

Hi Eric,

> From: Auger Eric <eric.auger@redhat.com>
> Sent: Monday, April 6, 2020 6:20 PM
> Subject: Re: [PATCH v2 09/22] vfio/common: init HostIOMMUContext per-container
> 
> Hi Yi,
> 
> On 4/6/20 9:12 AM, Liu, Yi L wrote:
> > Hi Eric,
> >
> >> From: Auger Eric <eric.auger@redhat.com>
> >> Sent: Wednesday, April 1, 2020 3:51 PM
> >> To: Liu, Yi L <yi.l.liu@intel.com>; qemu-devel@nongnu.org;
> >> Subject: Re: [PATCH v2 09/22] vfio/common: init HostIOMMUContext
> >> per-container
> >>
> >> Hi Yi,
> >>
> >> On 3/30/20 6:24 AM, Liu Yi L wrote:
> >>> In this patch, QEMU firstly gets iommu info from kernel to check the
> >>> supported capabilities by a VFIO_IOMMU_TYPE1_NESTING iommu. And
> >>> inits HostIOMMUContet instance.
> >>>
> >>> Cc: Kevin Tian <kevin.tian@intel.com>
> >>> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> >>> Cc: Peter Xu <peterx@redhat.com>
> >>> Cc: Eric Auger <eric.auger@redhat.com>
> >>> Cc: Yi Sun <yi.y.sun@linux.intel.com>
> >>> Cc: David Gibson <david@gibson.dropbear.id.au>
> >>> Cc: Alex Williamson <alex.williamson@redhat.com>
> >>> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> >>> ---
> >>>  hw/vfio/common.c | 99
> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>  1 file changed, 99 insertions(+)
> >>>
> >>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c index
> >>> 5f3534d..44b142c 100644
> >>> --- a/hw/vfio/common.c
> >>> +++ b/hw/vfio/common.c
> >>> @@ -1226,10 +1226,89 @@ static int
> >> vfio_host_iommu_ctx_pasid_free(HostIOMMUContext *iommu_ctx,
> >>>      return 0;
> >>>  }
> >>>
> >>> +/**
> >>> + * Get iommu info from host. Caller of this funcion should free
> >>> + * the memory pointed by the returned pointer stored in @info
> >>> + * after a successful calling when finished its usage.
> >>> + */
> >>> +static int vfio_get_iommu_info(VFIOContainer *container,
> >>> +                         struct vfio_iommu_type1_info **info) {
> >>> +
> >>> +    size_t argsz = sizeof(struct vfio_iommu_type1_info);
> >>> +
> >>> +    *info = g_malloc0(argsz);
> >>> +
> >>> +retry:
> >>> +    (*info)->argsz = argsz;
> >>> +
> >>> +    if (ioctl(container->fd, VFIO_IOMMU_GET_INFO, *info)) {
> >>> +        g_free(*info);
> >>> +        *info = NULL;
> >>> +        return -errno;
> >>> +    }
> >>> +
> >>> +    if (((*info)->argsz > argsz)) {
> >>> +        argsz = (*info)->argsz;
> >>> +        *info = g_realloc(*info, argsz);
> >>> +        goto retry;
> >>> +    }
> >>> +
> >>> +    return 0;
> >>> +}
> >>> +
> >>> +static struct vfio_info_cap_header * vfio_get_iommu_info_cap(struct
> >>> +vfio_iommu_type1_info *info, uint16_t
> >>> +id) {
> >>> +    struct vfio_info_cap_header *hdr;
> >>> +    void *ptr = info;
> >>> +
> >>> +    if (!(info->flags & VFIO_IOMMU_INFO_CAPS)) {
> >>> +        return NULL;
> >>> +    }
> >>> +
> >>> +    for (hdr = ptr + info->cap_offset; hdr != ptr; hdr = ptr + hdr->next) {
> >>> +        if (hdr->id == id) {
> >>> +            return hdr;
> >>> +        }
> >>> +    }
> >>> +
> >>> +    return NULL;
> >>> +}
> >>> +
> >>> +static int vfio_get_nesting_iommu_cap(VFIOContainer *container,
> >>> +                   struct vfio_iommu_type1_info_cap_nesting
> >>> +*cap_nesting) {
> >>> +    struct vfio_iommu_type1_info *info;
> >>> +    struct vfio_info_cap_header *hdr;
> >>> +    struct vfio_iommu_type1_info_cap_nesting *cap;
> >>> +    int ret;
> >>> +
> >>> +    ret = vfio_get_iommu_info(container, &info);
> >>> +    if (ret) {
> >>> +        return ret;
> >>> +    }
> >>> +
> >>> +    hdr = vfio_get_iommu_info_cap(info,
> >>> +                        VFIO_IOMMU_TYPE1_INFO_CAP_NESTING);
> >>> +    if (!hdr) {
> >>> +        g_free(info);
> >>> +        return -errno;
> >>> +    }
> >>> +
> >>> +    cap = container_of(hdr,
> >>> +                struct vfio_iommu_type1_info_cap_nesting, header);
> >>> +    *cap_nesting = *cap;
> >>> +
> >>> +    g_free(info);
> >>> +    return 0;
> >>> +}
> >>> +
> >>>  static int vfio_init_container(VFIOContainer *container, int group_fd,
> >>>                                 Error **errp)  {
> >>>      int iommu_type, ret;
> >>> +    uint64_t flags = 0;
> >>>
> >>>      iommu_type = vfio_get_iommu_type(container, errp);
> >>>      if (iommu_type < 0) {
> >>> @@ -1257,6 +1336,26 @@ static int vfio_init_container(VFIOContainer
> >> *container, int group_fd,
> >>>          return -errno;
> >>>      }
> >>>
> >>> +    if (iommu_type == VFIO_TYPE1_NESTING_IOMMU) {
> >>> +        struct vfio_iommu_type1_info_cap_nesting nesting = {
> >>> +                                         .nesting_capabilities = 0x0,
> >>> +                                         .stage1_formats = 0, };
> >>> +
> >>> +        ret = vfio_get_nesting_iommu_cap(container, &nesting);
> >>> +        if (ret) {
> >>> +            error_setg_errno(errp, -ret,
> >>> +                             "Failed to get nesting iommu cap");
> >>> +            return ret;
> >>> +        }
> >>> +
> >>> +        flags |= (nesting.nesting_capabilities & VFIO_IOMMU_PASID_REQS) ?
> >>> +                 HOST_IOMMU_PASID_REQUEST : 0;
> >> I still don't get why you can't transform your iommu_ctx into a
> >> pointer and do
> >> container->iommu_ctx = g_new0(HostIOMMUContext, 1);
> >> then
> >> host_iommu_ctx_init(container->iommu_ctx, flags);
> >>
> >> looks something similar to (hw/vfio/common.c). You may not even need
> >> to use a derived VFIOHostIOMMUContext object (As only VFIO does use
> >> that object)? Only the ops do change, no new field?
> >>         region->mem = g_new0(MemoryRegion, 1);
> >>         memory_region_init_io(region->mem, obj, &vfio_region_ops,
> >>                               region, name, region->size);
> >
> > In this way, the vfio hook can easily get the VFIOContainer from
> > HostIOMMUContext when call in the hook provided by vfio. e.g. the one
> > below.
> OK I get it. However in memory_region_init_io(), you also pass the owner, eg.
> region so I think you could do the same. no?
Hmm, I can add it. But I've no idea about the proper owner for it so far.
any suggestion?

Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 160+ messages in thread

end of thread, other threads:[~2020-04-07 12:00 UTC | newest]

Thread overview: 160+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-30  4:24 [PATCH v2 00/22] intel_iommu: expose Shared Virtual Addressing to VMs Liu Yi L
2020-03-30  4:24 ` Liu Yi L
2020-03-30  4:24 ` [PATCH v2 01/22] scripts/update-linux-headers: Import iommu.h Liu Yi L
2020-03-30  4:24   ` Liu Yi L
2020-03-30  4:24 ` [PATCH v2 02/22] header file update VFIO/IOMMU vSVA APIs Liu Yi L
2020-03-30  4:24   ` Liu Yi L
2020-03-30  4:24 ` [PATCH v2 03/22] vfio: check VFIO_TYPE1_NESTING_IOMMU support Liu Yi L
2020-03-30  4:24   ` Liu Yi L
2020-03-30  9:36   ` Auger Eric
2020-03-30  9:36     ` Auger Eric
2020-03-31  6:08     ` Liu, Yi L
2020-03-31  6:08       ` Liu, Yi L
2020-03-30  4:24 ` [PATCH v2 04/22] hw/iommu: introduce HostIOMMUContext Liu Yi L
2020-03-30  4:24   ` Liu Yi L
2020-03-30 17:22   ` Auger Eric
2020-03-30 17:22     ` Auger Eric
2020-03-31  4:10     ` Liu, Yi L
2020-03-31  4:10       ` Liu, Yi L
2020-03-31  7:47       ` Auger Eric
2020-03-31  7:47         ` Auger Eric
2020-03-31 12:43         ` Liu, Yi L
2020-03-31 12:43           ` Liu, Yi L
2020-04-06  8:04     ` Liu, Yi L
2020-04-06  8:04       ` Liu, Yi L
2020-04-06 10:30       ` Auger Eric
2020-04-06 10:30         ` Auger Eric
2020-03-30  4:24 ` [PATCH v2 05/22] hw/pci: modify pci_setup_iommu() to set PCIIOMMUOps Liu Yi L
2020-03-30  4:24   ` Liu Yi L
2020-03-30 11:02   ` Auger Eric
2020-03-30 11:02     ` Auger Eric
2020-04-02  8:52     ` Liu, Yi L
2020-04-02  8:52       ` Liu, Yi L
2020-04-02 12:41       ` Auger Eric
2020-04-02 12:41         ` Auger Eric
2020-04-02 13:37         ` Liu, Yi L
2020-04-02 13:37           ` Liu, Yi L
2020-04-02 13:49           ` Auger Eric
2020-04-02 13:49             ` Auger Eric
2020-04-06  6:27             ` Liu, Yi L
2020-04-06  6:27               ` Liu, Yi L
2020-04-06 10:04               ` Auger Eric
2020-04-06 10:04                 ` Auger Eric
2020-03-30  4:24 ` [PATCH v2 06/22] hw/pci: introduce pci_device_set/unset_iommu_context() Liu Yi L
2020-03-30  4:24   ` Liu Yi L
2020-03-30 17:30   ` Auger Eric
2020-03-30 17:30     ` Auger Eric
2020-03-31 12:14     ` Liu, Yi L
2020-03-31 12:14       ` Liu, Yi L
2020-03-30  4:24 ` [PATCH v2 07/22] intel_iommu: add set/unset_iommu_context callback Liu Yi L
2020-03-30  4:24   ` Liu Yi L
2020-03-30 20:23   ` Auger Eric
2020-03-30 20:23     ` Auger Eric
2020-03-31 12:25     ` Liu, Yi L
2020-03-31 12:25       ` Liu, Yi L
2020-03-31 12:57       ` Auger Eric
2020-03-31 12:57         ` Auger Eric
2020-03-30  4:24 ` [PATCH v2 08/22] vfio/common: provide PASID alloc/free hooks Liu Yi L
2020-03-30  4:24   ` Liu Yi L
2020-03-31 10:47   ` Auger Eric
2020-03-31 10:47     ` Auger Eric
2020-03-31 10:59     ` Liu, Yi L
2020-03-31 10:59       ` Liu, Yi L
2020-03-31 11:15       ` Auger Eric
2020-03-31 11:15         ` Auger Eric
2020-03-31 12:54         ` Liu, Yi L
2020-03-31 12:54           ` Liu, Yi L
2020-03-30  4:24 ` [PATCH v2 09/22] vfio/common: init HostIOMMUContext per-container Liu Yi L
2020-03-30  4:24   ` Liu Yi L
2020-04-01  7:50   ` Auger Eric
2020-04-01  7:50     ` Auger Eric
2020-04-06  7:12     ` Liu, Yi L
2020-04-06  7:12       ` Liu, Yi L
2020-04-06 10:20       ` Auger Eric
2020-04-06 10:20         ` Auger Eric
2020-04-07 11:59         ` Liu, Yi L
2020-04-07 11:59           ` Liu, Yi L
2020-03-30  4:24 ` [PATCH v2 10/22] vfio/pci: set host iommu context to vIOMMU Liu Yi L
2020-03-30  4:24   ` Liu Yi L
2020-03-31 14:30   ` Auger Eric
2020-03-31 14:30     ` Auger Eric
2020-04-01  3:20     ` Liu, Yi L
2020-04-01  3:20       ` Liu, Yi L
2020-03-30  4:24 ` [PATCH v2 11/22] intel_iommu: add virtual command capability support Liu Yi L
2020-03-30  4:24   ` Liu Yi L
2020-03-30  4:24 ` [PATCH v2 12/22] intel_iommu: process PASID cache invalidation Liu Yi L
2020-03-30  4:24   ` Liu Yi L
2020-03-30  4:24 ` [PATCH v2 13/22] intel_iommu: add PASID cache management infrastructure Liu Yi L
2020-03-30  4:24   ` Liu Yi L
2020-04-02  0:02   ` Peter Xu
2020-04-02  0:02     ` Peter Xu
2020-04-02  6:46     ` Liu, Yi L
2020-04-02  6:46       ` Liu, Yi L
2020-04-02 13:44       ` Peter Xu
2020-04-02 13:44         ` Peter Xu
2020-04-03 15:05         ` Liu, Yi L
2020-04-03 15:05           ` Liu, Yi L
2020-04-03 16:19           ` Peter Xu
2020-04-03 16:19             ` Peter Xu
2020-04-04 11:39             ` Liu, Yi L
2020-04-04 11:39               ` Liu, Yi L
2020-03-30  4:24 ` [PATCH v2 14/22] vfio: add bind stage-1 page table support Liu Yi L
2020-03-30  4:24   ` Liu Yi L
2020-03-30  4:24 ` [PATCH v2 15/22] intel_iommu: bind/unbind guest page table to host Liu Yi L
2020-03-30  4:24   ` Liu Yi L
2020-04-02 18:09   ` Peter Xu
2020-04-02 18:09     ` Peter Xu
2020-04-03 14:29     ` Liu, Yi L
2020-04-03 14:29       ` Liu, Yi L
2020-03-30  4:24 ` [PATCH v2 16/22] intel_iommu: replay pasid binds after context cache invalidation Liu Yi L
2020-03-30  4:24   ` Liu Yi L
2020-04-03 14:45   ` Peter Xu
2020-04-03 14:45     ` Peter Xu
2020-04-03 15:21     ` Liu, Yi L
2020-04-03 15:21       ` Liu, Yi L
2020-04-03 16:11       ` Peter Xu
2020-04-03 16:11         ` Peter Xu
2020-04-04 12:00         ` Liu, Yi L
2020-04-04 12:00           ` Liu, Yi L
2020-04-06 19:48           ` Peter Xu
2020-04-06 19:48             ` Peter Xu
2020-03-30  4:24 ` [PATCH v2 17/22] intel_iommu: do not pass down pasid bind for PASID #0 Liu Yi L
2020-03-30  4:24   ` Liu Yi L
2020-03-30  4:24 ` [PATCH v2 18/22] vfio: add support for flush iommu stage-1 cache Liu Yi L
2020-03-30  4:24   ` Liu Yi L
2020-03-30  4:24 ` [PATCH v2 19/22] intel_iommu: process PASID-based iotlb invalidation Liu Yi L
2020-03-30  4:24   ` Liu Yi L
2020-04-03 14:47   ` Peter Xu
2020-04-03 14:47     ` Peter Xu
2020-04-03 15:21     ` Liu, Yi L
2020-04-03 15:21       ` Liu, Yi L
2020-03-30  4:24 ` [PATCH v2 20/22] intel_iommu: propagate PASID-based iotlb invalidation to host Liu Yi L
2020-03-30  4:24   ` Liu Yi L
2020-03-30  4:25 ` [PATCH v2 21/22] intel_iommu: process PASID-based Device-TLB invalidation Liu Yi L
2020-03-30  4:25   ` Liu Yi L
2020-03-30  4:25 ` [PATCH v2 22/22] intel_iommu: modify x-scalable-mode to be string option Liu Yi L
2020-03-30  4:25   ` Liu Yi L
2020-04-03 14:49   ` Peter Xu
2020-04-03 14:49     ` Peter Xu
2020-04-03 15:22     ` Liu, Yi L
2020-04-03 15:22       ` Liu, Yi L
2020-03-30  5:40 ` [PATCH v2 00/22] intel_iommu: expose Shared Virtual Addressing to VMs no-reply
2020-03-30  5:40   ` no-reply
2020-03-30 10:36 ` Auger Eric
2020-03-30 10:36   ` Auger Eric
2020-03-30 14:46   ` Peter Xu
2020-03-30 14:46     ` Peter Xu
2020-03-31  6:53     ` Liu, Yi L
2020-03-31  6:53       ` Liu, Yi L
2020-04-02  8:33 ` Jason Wang
2020-04-02  8:33   ` Jason Wang
2020-04-02 13:46   ` Peter Xu
2020-04-02 13:46     ` Peter Xu
2020-04-03  1:38     ` Jason Wang
2020-04-03  1:38       ` Jason Wang
2020-04-03 14:20     ` Liu, Yi L
2020-04-03 14:20       ` Liu, Yi L
2020-04-02 18:12 ` Peter Xu
2020-04-02 18:12   ` Peter Xu
2020-04-03 14:32   ` Liu, Yi L
2020-04-03 14:32     ` Liu, Yi L

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.