All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC v3 00/25] intel_iommu: expose Shared Virtual Addressing to VMs
@ 2020-01-29 12:16 ` Liu, Yi L
  0 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-01-29 12:16 UTC (permalink / raw)
  To: qemu-devel, david, pbonzini, alex.williamson, peterx
  Cc: mst, eric.auger, kevin.tian, yi.l.liu, jun.j.tian, yi.y.sun, kvm, hao.wu

Shared Virtual Addressing (SVA), a.k.a, Shared Virtual Memory (SVM) on
Intel platforms allows address space sharing between device DMA and
applications. SVA can reduce programming complexity and enhance security.

This QEMU series is intended to expose SVA usage to VMs. i.e. Sharing
guest application address space with passthru devices. This is called
vSVA in this series. The whole vSVA enabling requires QEMU/VFIO/IOMMU
changes. For IOMMU and VFIO changes, they are in separate series (listed
in the "Related series").

The high-level architecture for SVA virtualization is as below, the key
design of vSVA support is to utilize the dual-stage IOMMU translation (
also known as IOMMU nesting translation) capability in host IOMMU.

    .-------------.  .---------------------------.
    |   vIOMMU    |  | Guest process CR3, FL only|
    |             |  '---------------------------'
    .----------------/
    | PASID Entry |--- PASID cache flush -
    '-------------'                       |
    |             |                       V
    |             |                CR3 in GPA
    '-------------'
Guest
------| Shadow |--------------------------|--------
      v        v                          v
Host
    .-------------.  .----------------------.
    |   pIOMMU    |  | Bind FL for GVA-GPA  |
    |             |  '----------------------'
    .----------------/  |
    | PASID Entry |     V (Nested xlate)
    '----------------\.------------------------------.
    |             |   |SL for GPA-HPA, default domain|
    |             |   '------------------------------'
    '-------------'
Where:
 - FL = First level/stage one page tables
 - SL = Second level/stage two page tables

The complete vSVA kernel upstream patches are divided into three phases:
    1. Common APIs and PCI device direct assignment
    2. IOMMU-backed Mediated Device assignment
    3. Page Request Services (PRS) support

This QEMU RFC patchset is aiming for the phase 1 and phase 2, and works
together with the VT-d driver[1] changes and VFIO changes[2].

Related series:
[1] [PATCH V9 00/10] Nested Shared Virtual Address (SVA) VT-d support:
    https://lkml.org/lkml/2020/1/29/37
    [PATCH 0/3] IOMMU user API enhancement:
    https://lkml.org/lkml/2020/1/29/45

[2] [RFC v3 0/8] vfio: expose virtual Shared Virtual Addressing to VMs
    https://lkml.org/lkml/2020/1/29/255

There are roughly four parts:
 1. Modify pci_setup_iommu() to set PCIIOMMUOps instead of setup PCIIOMMUFunc
 2. Introduce DualStageIOMMUObject as abstract of host IOMMU. It provides
    method for vIOMMU emulators to communicate with host IOMMU. e.g. propagate
    guest page table binding to host IOMMU to setup dual-stage DMA translation
    in host IOMMU and flush iommu iotlb.
 3. Introduce IOMMUContext as abstract of vIOMMU. It provides operations for
    VFIO to communicate with vIOMMU emulators. e.g. let vIOMMU emulators be
    aware of host IOMMU's dual-stage translation capability by registering
    DualStageIOMMUObject instances to vIOMMU emulators.
 4. Setup dual-stage IOMMU translation for Intel vIOMMU. Includes 
    - Check IOMMU uAPI version compatibility and hardware compatibility which
      is preparation for setting up dual-stage DMA translation in host IOMMU.
    - Propagate guest PASID allocation and free request to host.
    - Propagate guest page table binding to host to setup dual-stage IOMMU DMA
      translation in host IOMMU.
    - Propagate guest IOMMU cache invalidation to host to ensure iotlb
      correctness.

The complete QEMU set can be found in below link:
https://github.com/luxis1999/qemu.git: sva_vtd_v9_rfcv3

Complete kernel can be found in:
https://github.com/luxis1999/linux-vsva: vsva-linux-5.5-rc3

Changelog:
	- RFC v2 -> v3:
	  a) Introduce DualStageIOMMUObject to abstract the host IOMMU programming
	  capability. e.g. request PASID from host, setup IOMMU nesting translation
	  on host IOMMU. The pasid_alloc/bind_guest_page_table/iommu_cache_flush
	  operations are moved to be DualStageIOMMUOps. Thus, DualStageIOMMUObject
	  is an abstract layer which provides QEMU vIOMMU emulators with an explicit
	  method to program host IOMMU.
	  b) Compared with RFC v2, the IOMMUContext has also been updated. It is
	  modified to provide an abstract for vIOMMU emulators. It provides the
	  method for pass-through modules (like VFIO) to communicate with host IOMMU.
	  e.g. tell vIOMMU emulators about the IOMMU nesting capability on host side
	  and report the host IOMMU DMA translation faults to vIOMMU emulators.
	  RFC v2: https://www.spinics.net/lists/kvm/msg198556.html

	- RFC v1 -> v2:
	  Introduce IOMMUContext to abstract the connection between VFIO
	  and vIOMMU emulators, which is a replacement of the PCIPASIDOps
	  in RFC v1. Modify x-scalable-mode to be string option instead of
	  adding a new option as RFC v1 did. Refined the pasid cache management
	  and addressed the TODOs mentioned in RFC v1. 
	  RFC v1: https://patchwork.kernel.org/cover/11033657/

Eric Auger (1):
  scripts/update-linux-headers: Import iommu.h

Liu Yi L (23):
  hw/pci: modify pci_setup_iommu() to set PCIIOMMUOps
  hw/iommu: introduce DualStageIOMMUObject
  hw/pci: introduce pci_device_iommu_context()
  intel_iommu: provide get_iommu_context() callback
  header file update VFIO/IOMMU vSVA APIs
  vfio: pass IOMMUContext into vfio_get_group()
  vfio: check VFIO_TYPE1_NESTING_IOMMU support
  vfio: register DualStageIOMMUObject to vIOMMU
  vfio: get stage-1 pasid formats from Kernel
  vfio/common: add pasid_alloc/free support
  intel_iommu: modify x-scalable-mode to be string option
  intel_iommu: add virtual command capability support
  intel_iommu: process pasid cache invalidation
  intel_iommu: add PASID cache management infrastructure
  vfio: add bind stage-1 page table support
  intel_iommu: bind/unbind guest page table to host
  intel_iommu: replay guest pasid bindings to host
  intel_iommu: replay pasid binds after context cache invalidation
  intel_iommu: do not pass down pasid bind for PASID #0
  vfio: add support for flush iommu stage-1 cache
  intel_iommu: process PASID-based iotlb invalidation
  intel_iommu: propagate PASID-based iotlb invalidation to host
  intel_iommu: process PASID-based Device-TLB invalidation

Peter Xu (1):
  hw/iommu: introduce IOMMUContext

 hw/Makefile.objs                    |    1 +
 hw/alpha/typhoon.c                  |    6 +-
 hw/arm/smmu-common.c                |    6 +-
 hw/hppa/dino.c                      |    6 +-
 hw/i386/amd_iommu.c                 |    6 +-
 hw/i386/intel_iommu.c               | 1208 ++++++++++++++++++++++++++++++++++-
 hw/i386/intel_iommu_internal.h      |  119 ++++
 hw/i386/trace-events                |    6 +
 hw/iommu/Makefile.objs              |    2 +
 hw/iommu/dual_stage_iommu.c         |  101 +++
 hw/iommu/iommu_context.c            |   54 ++
 hw/pci-host/designware.c            |    6 +-
 hw/pci-host/ppce500.c               |    6 +-
 hw/pci-host/prep.c                  |    6 +-
 hw/pci-host/sabre.c                 |    6 +-
 hw/pci/pci.c                        |   39 +-
 hw/ppc/ppc440_pcix.c                |    6 +-
 hw/ppc/spapr_pci.c                  |    6 +-
 hw/s390x/s390-pci-bus.c             |    8 +-
 hw/vfio/ap.c                        |    2 +-
 hw/vfio/ccw.c                       |    2 +-
 hw/vfio/common.c                    |  244 ++++++-
 hw/vfio/pci.c                       |    3 +-
 hw/vfio/platform.c                  |    2 +-
 include/hw/i386/intel_iommu.h       |   59 +-
 include/hw/iommu/dual_stage_iommu.h |  103 +++
 include/hw/iommu/iommu_context.h    |   61 ++
 include/hw/pci/pci.h                |   13 +-
 include/hw/pci/pci_bus.h            |    2 +-
 include/hw/vfio/vfio-common.h       |    6 +-
 linux-headers/linux/iommu.h         |  372 +++++++++++
 linux-headers/linux/vfio.h          |  148 +++++
 scripts/update-linux-headers.sh     |    2 +-
 33 files changed, 2568 insertions(+), 49 deletions(-)
 create mode 100644 hw/iommu/Makefile.objs
 create mode 100644 hw/iommu/dual_stage_iommu.c
 create mode 100644 hw/iommu/iommu_context.c
 create mode 100644 include/hw/iommu/dual_stage_iommu.h
 create mode 100644 include/hw/iommu/iommu_context.h
 create mode 100644 linux-headers/linux/iommu.h

-- 
2.7.4


^ permalink raw reply	[flat|nested] 136+ messages in thread

* [RFC v3 00/25] intel_iommu: expose Shared Virtual Addressing to VMs
@ 2020-01-29 12:16 ` Liu, Yi L
  0 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-01-29 12:16 UTC (permalink / raw)
  To: qemu-devel, david, pbonzini, alex.williamson, peterx
  Cc: kevin.tian, yi.l.liu, kvm, mst, jun.j.tian, eric.auger, yi.y.sun, hao.wu

Shared Virtual Addressing (SVA), a.k.a, Shared Virtual Memory (SVM) on
Intel platforms allows address space sharing between device DMA and
applications. SVA can reduce programming complexity and enhance security.

This QEMU series is intended to expose SVA usage to VMs. i.e. Sharing
guest application address space with passthru devices. This is called
vSVA in this series. The whole vSVA enabling requires QEMU/VFIO/IOMMU
changes. For IOMMU and VFIO changes, they are in separate series (listed
in the "Related series").

The high-level architecture for SVA virtualization is as below, the key
design of vSVA support is to utilize the dual-stage IOMMU translation (
also known as IOMMU nesting translation) capability in host IOMMU.

    .-------------.  .---------------------------.
    |   vIOMMU    |  | Guest process CR3, FL only|
    |             |  '---------------------------'
    .----------------/
    | PASID Entry |--- PASID cache flush -
    '-------------'                       |
    |             |                       V
    |             |                CR3 in GPA
    '-------------'
Guest
------| Shadow |--------------------------|--------
      v        v                          v
Host
    .-------------.  .----------------------.
    |   pIOMMU    |  | Bind FL for GVA-GPA  |
    |             |  '----------------------'
    .----------------/  |
    | PASID Entry |     V (Nested xlate)
    '----------------\.------------------------------.
    |             |   |SL for GPA-HPA, default domain|
    |             |   '------------------------------'
    '-------------'
Where:
 - FL = First level/stage one page tables
 - SL = Second level/stage two page tables

The complete vSVA kernel upstream patches are divided into three phases:
    1. Common APIs and PCI device direct assignment
    2. IOMMU-backed Mediated Device assignment
    3. Page Request Services (PRS) support

This QEMU RFC patchset is aiming for the phase 1 and phase 2, and works
together with the VT-d driver[1] changes and VFIO changes[2].

Related series:
[1] [PATCH V9 00/10] Nested Shared Virtual Address (SVA) VT-d support:
    https://lkml.org/lkml/2020/1/29/37
    [PATCH 0/3] IOMMU user API enhancement:
    https://lkml.org/lkml/2020/1/29/45

[2] [RFC v3 0/8] vfio: expose virtual Shared Virtual Addressing to VMs
    https://lkml.org/lkml/2020/1/29/255

There are roughly four parts:
 1. Modify pci_setup_iommu() to set PCIIOMMUOps instead of setup PCIIOMMUFunc
 2. Introduce DualStageIOMMUObject as abstract of host IOMMU. It provides
    method for vIOMMU emulators to communicate with host IOMMU. e.g. propagate
    guest page table binding to host IOMMU to setup dual-stage DMA translation
    in host IOMMU and flush iommu iotlb.
 3. Introduce IOMMUContext as abstract of vIOMMU. It provides operations for
    VFIO to communicate with vIOMMU emulators. e.g. let vIOMMU emulators be
    aware of host IOMMU's dual-stage translation capability by registering
    DualStageIOMMUObject instances to vIOMMU emulators.
 4. Setup dual-stage IOMMU translation for Intel vIOMMU. Includes 
    - Check IOMMU uAPI version compatibility and hardware compatibility which
      is preparation for setting up dual-stage DMA translation in host IOMMU.
    - Propagate guest PASID allocation and free request to host.
    - Propagate guest page table binding to host to setup dual-stage IOMMU DMA
      translation in host IOMMU.
    - Propagate guest IOMMU cache invalidation to host to ensure iotlb
      correctness.

The complete QEMU set can be found in below link:
https://github.com/luxis1999/qemu.git: sva_vtd_v9_rfcv3

Complete kernel can be found in:
https://github.com/luxis1999/linux-vsva: vsva-linux-5.5-rc3

Changelog:
	- RFC v2 -> v3:
	  a) Introduce DualStageIOMMUObject to abstract the host IOMMU programming
	  capability. e.g. request PASID from host, setup IOMMU nesting translation
	  on host IOMMU. The pasid_alloc/bind_guest_page_table/iommu_cache_flush
	  operations are moved to be DualStageIOMMUOps. Thus, DualStageIOMMUObject
	  is an abstract layer which provides QEMU vIOMMU emulators with an explicit
	  method to program host IOMMU.
	  b) Compared with RFC v2, the IOMMUContext has also been updated. It is
	  modified to provide an abstract for vIOMMU emulators. It provides the
	  method for pass-through modules (like VFIO) to communicate with host IOMMU.
	  e.g. tell vIOMMU emulators about the IOMMU nesting capability on host side
	  and report the host IOMMU DMA translation faults to vIOMMU emulators.
	  RFC v2: https://www.spinics.net/lists/kvm/msg198556.html

	- RFC v1 -> v2:
	  Introduce IOMMUContext to abstract the connection between VFIO
	  and vIOMMU emulators, which is a replacement of the PCIPASIDOps
	  in RFC v1. Modify x-scalable-mode to be string option instead of
	  adding a new option as RFC v1 did. Refined the pasid cache management
	  and addressed the TODOs mentioned in RFC v1. 
	  RFC v1: https://patchwork.kernel.org/cover/11033657/

Eric Auger (1):
  scripts/update-linux-headers: Import iommu.h

Liu Yi L (23):
  hw/pci: modify pci_setup_iommu() to set PCIIOMMUOps
  hw/iommu: introduce DualStageIOMMUObject
  hw/pci: introduce pci_device_iommu_context()
  intel_iommu: provide get_iommu_context() callback
  header file update VFIO/IOMMU vSVA APIs
  vfio: pass IOMMUContext into vfio_get_group()
  vfio: check VFIO_TYPE1_NESTING_IOMMU support
  vfio: register DualStageIOMMUObject to vIOMMU
  vfio: get stage-1 pasid formats from Kernel
  vfio/common: add pasid_alloc/free support
  intel_iommu: modify x-scalable-mode to be string option
  intel_iommu: add virtual command capability support
  intel_iommu: process pasid cache invalidation
  intel_iommu: add PASID cache management infrastructure
  vfio: add bind stage-1 page table support
  intel_iommu: bind/unbind guest page table to host
  intel_iommu: replay guest pasid bindings to host
  intel_iommu: replay pasid binds after context cache invalidation
  intel_iommu: do not pass down pasid bind for PASID #0
  vfio: add support for flush iommu stage-1 cache
  intel_iommu: process PASID-based iotlb invalidation
  intel_iommu: propagate PASID-based iotlb invalidation to host
  intel_iommu: process PASID-based Device-TLB invalidation

Peter Xu (1):
  hw/iommu: introduce IOMMUContext

 hw/Makefile.objs                    |    1 +
 hw/alpha/typhoon.c                  |    6 +-
 hw/arm/smmu-common.c                |    6 +-
 hw/hppa/dino.c                      |    6 +-
 hw/i386/amd_iommu.c                 |    6 +-
 hw/i386/intel_iommu.c               | 1208 ++++++++++++++++++++++++++++++++++-
 hw/i386/intel_iommu_internal.h      |  119 ++++
 hw/i386/trace-events                |    6 +
 hw/iommu/Makefile.objs              |    2 +
 hw/iommu/dual_stage_iommu.c         |  101 +++
 hw/iommu/iommu_context.c            |   54 ++
 hw/pci-host/designware.c            |    6 +-
 hw/pci-host/ppce500.c               |    6 +-
 hw/pci-host/prep.c                  |    6 +-
 hw/pci-host/sabre.c                 |    6 +-
 hw/pci/pci.c                        |   39 +-
 hw/ppc/ppc440_pcix.c                |    6 +-
 hw/ppc/spapr_pci.c                  |    6 +-
 hw/s390x/s390-pci-bus.c             |    8 +-
 hw/vfio/ap.c                        |    2 +-
 hw/vfio/ccw.c                       |    2 +-
 hw/vfio/common.c                    |  244 ++++++-
 hw/vfio/pci.c                       |    3 +-
 hw/vfio/platform.c                  |    2 +-
 include/hw/i386/intel_iommu.h       |   59 +-
 include/hw/iommu/dual_stage_iommu.h |  103 +++
 include/hw/iommu/iommu_context.h    |   61 ++
 include/hw/pci/pci.h                |   13 +-
 include/hw/pci/pci_bus.h            |    2 +-
 include/hw/vfio/vfio-common.h       |    6 +-
 linux-headers/linux/iommu.h         |  372 +++++++++++
 linux-headers/linux/vfio.h          |  148 +++++
 scripts/update-linux-headers.sh     |    2 +-
 33 files changed, 2568 insertions(+), 49 deletions(-)
 create mode 100644 hw/iommu/Makefile.objs
 create mode 100644 hw/iommu/dual_stage_iommu.c
 create mode 100644 hw/iommu/iommu_context.c
 create mode 100644 include/hw/iommu/dual_stage_iommu.h
 create mode 100644 include/hw/iommu/iommu_context.h
 create mode 100644 linux-headers/linux/iommu.h

-- 
2.7.4



^ permalink raw reply	[flat|nested] 136+ messages in thread

* [RFC v3 01/25] hw/pci: modify pci_setup_iommu() to set PCIIOMMUOps
  2020-01-29 12:16 ` Liu, Yi L
@ 2020-01-29 12:16   ` Liu, Yi L
  -1 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-01-29 12:16 UTC (permalink / raw)
  To: qemu-devel, david, pbonzini, alex.williamson, peterx
  Cc: mst, eric.auger, kevin.tian, yi.l.liu, jun.j.tian, yi.y.sun, kvm,
	hao.wu, Jacob Pan, Yi Sun

From: Liu Yi L <yi.l.liu@intel.com>

This patch modifies pci_setup_iommu() to set PCIIOMMUOps
instead of setting PCIIOMMUFunc. PCIIOMMUFunc is used to
get an address space for a PCI device in vendor specific
way. The PCIIOMMUOps still offers this functionality. But
using PCIIOMMUOps leaves space to add more iommu related
vendor specific operations.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Eric Auger <eric.auger@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 hw/alpha/typhoon.c       |  6 +++++-
 hw/arm/smmu-common.c     |  6 +++++-
 hw/hppa/dino.c           |  6 +++++-
 hw/i386/amd_iommu.c      |  6 +++++-
 hw/i386/intel_iommu.c    |  6 +++++-
 hw/pci-host/designware.c |  6 +++++-
 hw/pci-host/ppce500.c    |  6 +++++-
 hw/pci-host/prep.c       |  6 +++++-
 hw/pci-host/sabre.c      |  6 +++++-
 hw/pci/pci.c             | 12 +++++++-----
 hw/ppc/ppc440_pcix.c     |  6 +++++-
 hw/ppc/spapr_pci.c       |  6 +++++-
 hw/s390x/s390-pci-bus.c  |  8 ++++++--
 include/hw/pci/pci.h     |  8 ++++++--
 include/hw/pci/pci_bus.h |  2 +-
 15 files changed, 75 insertions(+), 21 deletions(-)

diff --git a/hw/alpha/typhoon.c b/hw/alpha/typhoon.c
index 179e1f7..b890771 100644
--- a/hw/alpha/typhoon.c
+++ b/hw/alpha/typhoon.c
@@ -741,6 +741,10 @@ static AddressSpace *typhoon_pci_dma_iommu(PCIBus *bus, void *opaque, int devfn)
     return &s->pchip.iommu_as;
 }
 
+static const PCIIOMMUOps typhoon_iommu_ops = {
+    .get_address_space = typhoon_pci_dma_iommu,
+};
+
 static void typhoon_set_irq(void *opaque, int irq, int level)
 {
     TyphoonState *s = opaque;
@@ -901,7 +905,7 @@ PCIBus *typhoon_init(ram_addr_t ram_size, ISABus **isa_bus,
                              "iommu-typhoon", UINT64_MAX);
     address_space_init(&s->pchip.iommu_as, MEMORY_REGION(&s->pchip.iommu),
                        "pchip0-pci");
-    pci_setup_iommu(b, typhoon_pci_dma_iommu, s);
+    pci_setup_iommu(b, &typhoon_iommu_ops, s);
 
     /* Pchip0 PCI special/interrupt acknowledge, 0x801.F800.0000, 64MB.  */
     memory_region_init_io(&s->pchip.reg_iack, OBJECT(s), &alpha_pci_iack_ops,
diff --git a/hw/arm/smmu-common.c b/hw/arm/smmu-common.c
index 245817d..d668514 100644
--- a/hw/arm/smmu-common.c
+++ b/hw/arm/smmu-common.c
@@ -342,6 +342,10 @@ static AddressSpace *smmu_find_add_as(PCIBus *bus, void *opaque, int devfn)
     return &sdev->as;
 }
 
+static const PCIIOMMUOps smmu_ops = {
+    .get_address_space = smmu_find_add_as,
+};
+
 IOMMUMemoryRegion *smmu_iommu_mr(SMMUState *s, uint32_t sid)
 {
     uint8_t bus_n, devfn;
@@ -436,7 +440,7 @@ static void smmu_base_realize(DeviceState *dev, Error **errp)
     s->smmu_pcibus_by_busptr = g_hash_table_new(NULL, NULL);
 
     if (s->primary_bus) {
-        pci_setup_iommu(s->primary_bus, smmu_find_add_as, s);
+        pci_setup_iommu(s->primary_bus, &smmu_ops, s);
     } else {
         error_setg(errp, "SMMU is not attached to any PCI bus!");
     }
diff --git a/hw/hppa/dino.c b/hw/hppa/dino.c
index ab6969b..dbcff03 100644
--- a/hw/hppa/dino.c
+++ b/hw/hppa/dino.c
@@ -389,6 +389,10 @@ static AddressSpace *dino_pcihost_set_iommu(PCIBus *bus, void *opaque,
     return &s->bm_as;
 }
 
+static const PCIIOMMUOps dino_iommu_ops = {
+    .get_address_space = dino_pcihost_set_iommu,
+};
+
 /*
  * Dino interrupts are connected as shown on Page 78, Table 23
  * (Little-endian bit numbers)
@@ -508,7 +512,7 @@ PCIBus *dino_init(MemoryRegion *addr_space,
     memory_region_add_subregion(&s->bm, 0xfff00000,
                                 &s->bm_cpu_alias);
     address_space_init(&s->bm_as, &s->bm, "pci-bm");
-    pci_setup_iommu(b, dino_pcihost_set_iommu, s);
+    pci_setup_iommu(b, &dino_iommu_ops, s);
 
     *p_rtc_irq = qemu_allocate_irq(dino_set_timer_irq, s, 0);
     *p_ser_irq = qemu_allocate_irq(dino_set_serial_irq, s, 0);
diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
index b1175e5..5fec30e 100644
--- a/hw/i386/amd_iommu.c
+++ b/hw/i386/amd_iommu.c
@@ -1451,6 +1451,10 @@ static AddressSpace *amdvi_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
     return &iommu_as[devfn]->as;
 }
 
+static const PCIIOMMUOps amdvi_iommu_ops = {
+    .get_address_space = amdvi_host_dma_iommu,
+};
+
 static const MemoryRegionOps mmio_mem_ops = {
     .read = amdvi_mmio_read,
     .write = amdvi_mmio_write,
@@ -1577,7 +1581,7 @@ static void amdvi_realize(DeviceState *dev, Error **errp)
 
     sysbus_init_mmio(SYS_BUS_DEVICE(s), &s->mmio);
     sysbus_mmio_map(SYS_BUS_DEVICE(s), 0, AMDVI_BASE_ADDR);
-    pci_setup_iommu(bus, amdvi_host_dma_iommu, s);
+    pci_setup_iommu(bus, &amdvi_iommu_ops, s);
     s->devid = object_property_get_int(OBJECT(&s->pci), "addr", errp);
     msi_init(&s->pci.dev, 0, 1, true, false, errp);
     amdvi_init(s);
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index bfe8edb..1a37e97 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -3722,6 +3722,10 @@ static AddressSpace *vtd_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
     return &vtd_as->as;
 }
 
+static PCIIOMMUOps vtd_iommu_ops = {
+    .get_address_space = vtd_host_dma_iommu,
+};
+
 static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
 {
     X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(s);
@@ -3833,7 +3837,7 @@ static void vtd_realize(DeviceState *dev, Error **errp)
                                               g_free, g_free);
     vtd_init(s);
     sysbus_mmio_map(SYS_BUS_DEVICE(s), 0, Q35_HOST_BRIDGE_IOMMU_ADDR);
-    pci_setup_iommu(bus, vtd_host_dma_iommu, dev);
+    pci_setup_iommu(bus, &vtd_iommu_ops, dev);
     /* Pseudo address space under root PCI bus. */
     x86ms->ioapic_as = vtd_host_dma_iommu(bus, s, Q35_PSEUDO_DEVFN_IOAPIC);
     qemu_add_machine_init_done_notifier(&vtd_machine_done_notify);
diff --git a/hw/pci-host/designware.c b/hw/pci-host/designware.c
index 71e9b0d..235d6af 100644
--- a/hw/pci-host/designware.c
+++ b/hw/pci-host/designware.c
@@ -645,6 +645,10 @@ static AddressSpace *designware_pcie_host_set_iommu(PCIBus *bus, void *opaque,
     return &s->pci.address_space;
 }
 
+static const PCIIOMMUOps designware_iommu_ops = {
+    .get_address_space = designware_pcie_host_set_iommu,
+};
+
 static void designware_pcie_host_realize(DeviceState *dev, Error **errp)
 {
     PCIHostState *pci = PCI_HOST_BRIDGE(dev);
@@ -686,7 +690,7 @@ static void designware_pcie_host_realize(DeviceState *dev, Error **errp)
     address_space_init(&s->pci.address_space,
                        &s->pci.address_space_root,
                        "pcie-bus-address-space");
-    pci_setup_iommu(pci->bus, designware_pcie_host_set_iommu, s);
+    pci_setup_iommu(pci->bus, designware_iommu_ops, s);
 
     qdev_set_parent_bus(DEVICE(&s->root), BUS(pci->bus));
     qdev_init_nofail(DEVICE(&s->root));
diff --git a/hw/pci-host/ppce500.c b/hw/pci-host/ppce500.c
index 8bed8e8..0f907b0 100644
--- a/hw/pci-host/ppce500.c
+++ b/hw/pci-host/ppce500.c
@@ -439,6 +439,10 @@ static AddressSpace *e500_pcihost_set_iommu(PCIBus *bus, void *opaque,
     return &s->bm_as;
 }
 
+static const PCIIOMMUOps ppce500_iommu_ops = {
+    .get_address_space = e500_pcihost_set_iommu,
+};
+
 static void e500_pcihost_realize(DeviceState *dev, Error **errp)
 {
     SysBusDevice *sbd = SYS_BUS_DEVICE(dev);
@@ -473,7 +477,7 @@ static void e500_pcihost_realize(DeviceState *dev, Error **errp)
     memory_region_init(&s->bm, OBJECT(s), "bm-e500", UINT64_MAX);
     memory_region_add_subregion(&s->bm, 0x0, &s->busmem);
     address_space_init(&s->bm_as, &s->bm, "pci-bm");
-    pci_setup_iommu(b, e500_pcihost_set_iommu, s);
+    pci_setup_iommu(b, &ppce500_iommu_ops, s);
 
     pci_create_simple(b, 0, "e500-host-bridge");
 
diff --git a/hw/pci-host/prep.c b/hw/pci-host/prep.c
index afa136d..8002fc9 100644
--- a/hw/pci-host/prep.c
+++ b/hw/pci-host/prep.c
@@ -213,6 +213,10 @@ static AddressSpace *raven_pcihost_set_iommu(PCIBus *bus, void *opaque,
     return &s->bm_as;
 }
 
+static const PCIIOMMU raven_iommu_ops = {
+    .get_address_space = raven_pcihost_set_iommu,
+};
+
 static void raven_change_gpio(void *opaque, int n, int level)
 {
     PREPPCIState *s = opaque;
@@ -303,7 +307,7 @@ static void raven_pcihost_initfn(Object *obj)
     memory_region_add_subregion(&s->bm, 0         , &s->bm_pci_memory_alias);
     memory_region_add_subregion(&s->bm, 0x80000000, &s->bm_ram_alias);
     address_space_init(&s->bm_as, &s->bm, "raven-bm");
-    pci_setup_iommu(&s->pci_bus, raven_pcihost_set_iommu, s);
+    pci_setup_iommu(&s->pci_bus, &raven_iommu_ops, s);
 
     h->bus = &s->pci_bus;
 
diff --git a/hw/pci-host/sabre.c b/hw/pci-host/sabre.c
index fae20ee..79b7565 100644
--- a/hw/pci-host/sabre.c
+++ b/hw/pci-host/sabre.c
@@ -112,6 +112,10 @@ static AddressSpace *sabre_pci_dma_iommu(PCIBus *bus, void *opaque, int devfn)
     return &is->iommu_as;
 }
 
+static const PCIIOMMUOps sabre_iommu_ops = {
+    .get_address_space = sabre_pci_dma_iommu,
+};
+
 static void sabre_config_write(void *opaque, hwaddr addr,
                                uint64_t val, unsigned size)
 {
@@ -402,7 +406,7 @@ static void sabre_realize(DeviceState *dev, Error **errp)
     /* IOMMU */
     memory_region_add_subregion_overlap(&s->sabre_config, 0x200,
                     sysbus_mmio_get_region(SYS_BUS_DEVICE(s->iommu), 0), 1);
-    pci_setup_iommu(phb->bus, sabre_pci_dma_iommu, s->iommu);
+    pci_setup_iommu(phb->bus, &sabre_iommu_ops, s->iommu);
 
     /* APB secondary busses */
     pci_dev = pci_create_multifunction(phb->bus, PCI_DEVFN(1, 0), true,
diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index e3d31036..e331a5c 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -2644,7 +2644,7 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
     PCIBus *iommu_bus = bus;
     uint8_t devfn = dev->devfn;
 
-    while (iommu_bus && !iommu_bus->iommu_fn && iommu_bus->parent_dev) {
+    while (iommu_bus && !iommu_bus->iommu_ops && iommu_bus->parent_dev) {
         PCIBus *parent_bus = pci_get_bus(iommu_bus->parent_dev);
 
         /*
@@ -2683,15 +2683,17 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
 
         iommu_bus = parent_bus;
     }
-    if (iommu_bus && iommu_bus->iommu_fn) {
-        return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, devfn);
+    if (iommu_bus && iommu_bus->iommu_ops &&
+                     iommu_bus->iommu_ops->get_address_space) {
+        return iommu_bus->iommu_ops->get_address_space(bus,
+                                 iommu_bus->iommu_opaque, devfn);
     }
     return &address_space_memory;
 }
 
-void pci_setup_iommu(PCIBus *bus, PCIIOMMUFunc fn, void *opaque)
+void pci_setup_iommu(PCIBus *bus, const PCIIOMMUOps *ops, void *opaque)
 {
-    bus->iommu_fn = fn;
+    bus->iommu_ops = ops;
     bus->iommu_opaque = opaque;
 }
 
diff --git a/hw/ppc/ppc440_pcix.c b/hw/ppc/ppc440_pcix.c
index 2ee2d4f..2c8579c 100644
--- a/hw/ppc/ppc440_pcix.c
+++ b/hw/ppc/ppc440_pcix.c
@@ -442,6 +442,10 @@ static AddressSpace *ppc440_pcix_set_iommu(PCIBus *b, void *opaque, int devfn)
     return &s->bm_as;
 }
 
+static const PCIIOMMUOps ppc440_iommu_ops = {
+    .get_adress_space = ppc440_pcix_set_iommu,
+};
+
 /* The default pci_host_data_{read,write} functions in pci/pci_host.c
  * deny access to registers without bit 31 set but our clients want
  * this to work so we have to override these here */
@@ -487,7 +491,7 @@ static void ppc440_pcix_realize(DeviceState *dev, Error **errp)
     memory_region_init(&s->bm, OBJECT(s), "bm-ppc440-pcix", UINT64_MAX);
     memory_region_add_subregion(&s->bm, 0x0, &s->busmem);
     address_space_init(&s->bm_as, &s->bm, "pci-bm");
-    pci_setup_iommu(h->bus, ppc440_pcix_set_iommu, s);
+    pci_setup_iommu(h->bus, &ppc440_iommu_ops, s);
 
     memory_region_init(&s->container, OBJECT(s), "pci-container", PCI_ALL_SIZE);
     memory_region_init_io(&h->conf_mem, OBJECT(s), &pci_host_conf_le_ops,
diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index 723373d..c39e8d4 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -771,6 +771,10 @@ static AddressSpace *spapr_pci_dma_iommu(PCIBus *bus, void *opaque, int devfn)
     return &phb->iommu_as;
 }
 
+static const PCIIOMMUOps spapr_iommu_ops = {
+    .get_address_space = spapr_pci_dma_iommu,
+};
+
 static char *spapr_phb_vfio_get_loc_code(SpaprPhbState *sphb,  PCIDevice *pdev)
 {
     char *path = NULL, *buf = NULL, *host = NULL;
@@ -1950,7 +1954,7 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
     memory_region_add_subregion(&sphb->iommu_root, SPAPR_PCI_MSI_WINDOW,
                                 &sphb->msiwindow);
 
-    pci_setup_iommu(bus, spapr_pci_dma_iommu, sphb);
+    pci_setup_iommu(bus, &spapr_iommu_ops, sphb);
 
     pci_bus_set_route_irq_fn(bus, spapr_route_intx_pin_to_irq);
 
diff --git a/hw/s390x/s390-pci-bus.c b/hw/s390x/s390-pci-bus.c
index 2d2f4a7..14684a0 100644
--- a/hw/s390x/s390-pci-bus.c
+++ b/hw/s390x/s390-pci-bus.c
@@ -635,6 +635,10 @@ static AddressSpace *s390_pci_dma_iommu(PCIBus *bus, void *opaque, int devfn)
     return &iommu->as;
 }
 
+static const PCIIOMMUOps s390_iommu_ops = {
+    .get_address_space = s390_pci_dma_iommu,
+};
+
 static uint8_t set_ind_atomic(uint64_t ind_loc, uint8_t to_be_set)
 {
     uint8_t ind_old, ind_new;
@@ -748,7 +752,7 @@ static void s390_pcihost_realize(DeviceState *dev, Error **errp)
     b = pci_register_root_bus(dev, NULL, s390_pci_set_irq, s390_pci_map_irq,
                               NULL, get_system_memory(), get_system_io(), 0,
                               64, TYPE_PCI_BUS);
-    pci_setup_iommu(b, s390_pci_dma_iommu, s);
+    pci_setup_iommu(b, &s390_iommu_ops, s);
 
     bus = BUS(b);
     qbus_set_hotplug_handler(bus, OBJECT(dev), &local_err);
@@ -919,7 +923,7 @@ static void s390_pcihost_plug(HotplugHandler *hotplug_dev, DeviceState *dev,
 
         pdev = PCI_DEVICE(dev);
         pci_bridge_map_irq(pb, dev->id, s390_pci_map_irq);
-        pci_setup_iommu(&pb->sec_bus, s390_pci_dma_iommu, s);
+        pci_setup_iommu(&pb->sec_bus, &s390_iommu_ops, s);
 
         qbus_set_hotplug_handler(BUS(&pb->sec_bus), OBJECT(s), errp);
 
diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index 2acd832..dc89aa1 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -484,10 +484,14 @@ void pci_bus_get_w64_range(PCIBus *bus, Range *range);
 
 void pci_device_deassert_intx(PCIDevice *dev);
 
-typedef AddressSpace *(*PCIIOMMUFunc)(PCIBus *, void *, int);
+typedef struct PCIIOMMUOps PCIIOMMUOps;
+struct PCIIOMMUOps {
+    AddressSpace * (*get_address_space)(PCIBus *bus,
+                                void *opaque, int32_t devfn);
+};
 
 AddressSpace *pci_device_iommu_address_space(PCIDevice *dev);
-void pci_setup_iommu(PCIBus *bus, PCIIOMMUFunc fn, void *opaque);
+void pci_setup_iommu(PCIBus *bus, const PCIIOMMUOps *iommu_ops, void *opaque);
 
 static inline void
 pci_set_byte(uint8_t *config, uint8_t val)
diff --git a/include/hw/pci/pci_bus.h b/include/hw/pci/pci_bus.h
index 0714f57..c281057 100644
--- a/include/hw/pci/pci_bus.h
+++ b/include/hw/pci/pci_bus.h
@@ -29,7 +29,7 @@ enum PCIBusFlags {
 struct PCIBus {
     BusState qbus;
     enum PCIBusFlags flags;
-    PCIIOMMUFunc iommu_fn;
+    const PCIIOMMUOps *iommu_ops;
     void *iommu_opaque;
     uint8_t devfn_min;
     uint32_t slot_reserved_mask;
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [RFC v3 01/25] hw/pci: modify pci_setup_iommu() to set PCIIOMMUOps
@ 2020-01-29 12:16   ` Liu, Yi L
  0 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-01-29 12:16 UTC (permalink / raw)
  To: qemu-devel, david, pbonzini, alex.williamson, peterx
  Cc: kevin.tian, yi.l.liu, Yi Sun, kvm, mst, jun.j.tian, eric.auger,
	yi.y.sun, Jacob Pan, hao.wu

From: Liu Yi L <yi.l.liu@intel.com>

This patch modifies pci_setup_iommu() to set PCIIOMMUOps
instead of setting PCIIOMMUFunc. PCIIOMMUFunc is used to
get an address space for a PCI device in vendor specific
way. The PCIIOMMUOps still offers this functionality. But
using PCIIOMMUOps leaves space to add more iommu related
vendor specific operations.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Eric Auger <eric.auger@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 hw/alpha/typhoon.c       |  6 +++++-
 hw/arm/smmu-common.c     |  6 +++++-
 hw/hppa/dino.c           |  6 +++++-
 hw/i386/amd_iommu.c      |  6 +++++-
 hw/i386/intel_iommu.c    |  6 +++++-
 hw/pci-host/designware.c |  6 +++++-
 hw/pci-host/ppce500.c    |  6 +++++-
 hw/pci-host/prep.c       |  6 +++++-
 hw/pci-host/sabre.c      |  6 +++++-
 hw/pci/pci.c             | 12 +++++++-----
 hw/ppc/ppc440_pcix.c     |  6 +++++-
 hw/ppc/spapr_pci.c       |  6 +++++-
 hw/s390x/s390-pci-bus.c  |  8 ++++++--
 include/hw/pci/pci.h     |  8 ++++++--
 include/hw/pci/pci_bus.h |  2 +-
 15 files changed, 75 insertions(+), 21 deletions(-)

diff --git a/hw/alpha/typhoon.c b/hw/alpha/typhoon.c
index 179e1f7..b890771 100644
--- a/hw/alpha/typhoon.c
+++ b/hw/alpha/typhoon.c
@@ -741,6 +741,10 @@ static AddressSpace *typhoon_pci_dma_iommu(PCIBus *bus, void *opaque, int devfn)
     return &s->pchip.iommu_as;
 }
 
+static const PCIIOMMUOps typhoon_iommu_ops = {
+    .get_address_space = typhoon_pci_dma_iommu,
+};
+
 static void typhoon_set_irq(void *opaque, int irq, int level)
 {
     TyphoonState *s = opaque;
@@ -901,7 +905,7 @@ PCIBus *typhoon_init(ram_addr_t ram_size, ISABus **isa_bus,
                              "iommu-typhoon", UINT64_MAX);
     address_space_init(&s->pchip.iommu_as, MEMORY_REGION(&s->pchip.iommu),
                        "pchip0-pci");
-    pci_setup_iommu(b, typhoon_pci_dma_iommu, s);
+    pci_setup_iommu(b, &typhoon_iommu_ops, s);
 
     /* Pchip0 PCI special/interrupt acknowledge, 0x801.F800.0000, 64MB.  */
     memory_region_init_io(&s->pchip.reg_iack, OBJECT(s), &alpha_pci_iack_ops,
diff --git a/hw/arm/smmu-common.c b/hw/arm/smmu-common.c
index 245817d..d668514 100644
--- a/hw/arm/smmu-common.c
+++ b/hw/arm/smmu-common.c
@@ -342,6 +342,10 @@ static AddressSpace *smmu_find_add_as(PCIBus *bus, void *opaque, int devfn)
     return &sdev->as;
 }
 
+static const PCIIOMMUOps smmu_ops = {
+    .get_address_space = smmu_find_add_as,
+};
+
 IOMMUMemoryRegion *smmu_iommu_mr(SMMUState *s, uint32_t sid)
 {
     uint8_t bus_n, devfn;
@@ -436,7 +440,7 @@ static void smmu_base_realize(DeviceState *dev, Error **errp)
     s->smmu_pcibus_by_busptr = g_hash_table_new(NULL, NULL);
 
     if (s->primary_bus) {
-        pci_setup_iommu(s->primary_bus, smmu_find_add_as, s);
+        pci_setup_iommu(s->primary_bus, &smmu_ops, s);
     } else {
         error_setg(errp, "SMMU is not attached to any PCI bus!");
     }
diff --git a/hw/hppa/dino.c b/hw/hppa/dino.c
index ab6969b..dbcff03 100644
--- a/hw/hppa/dino.c
+++ b/hw/hppa/dino.c
@@ -389,6 +389,10 @@ static AddressSpace *dino_pcihost_set_iommu(PCIBus *bus, void *opaque,
     return &s->bm_as;
 }
 
+static const PCIIOMMUOps dino_iommu_ops = {
+    .get_address_space = dino_pcihost_set_iommu,
+};
+
 /*
  * Dino interrupts are connected as shown on Page 78, Table 23
  * (Little-endian bit numbers)
@@ -508,7 +512,7 @@ PCIBus *dino_init(MemoryRegion *addr_space,
     memory_region_add_subregion(&s->bm, 0xfff00000,
                                 &s->bm_cpu_alias);
     address_space_init(&s->bm_as, &s->bm, "pci-bm");
-    pci_setup_iommu(b, dino_pcihost_set_iommu, s);
+    pci_setup_iommu(b, &dino_iommu_ops, s);
 
     *p_rtc_irq = qemu_allocate_irq(dino_set_timer_irq, s, 0);
     *p_ser_irq = qemu_allocate_irq(dino_set_serial_irq, s, 0);
diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
index b1175e5..5fec30e 100644
--- a/hw/i386/amd_iommu.c
+++ b/hw/i386/amd_iommu.c
@@ -1451,6 +1451,10 @@ static AddressSpace *amdvi_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
     return &iommu_as[devfn]->as;
 }
 
+static const PCIIOMMUOps amdvi_iommu_ops = {
+    .get_address_space = amdvi_host_dma_iommu,
+};
+
 static const MemoryRegionOps mmio_mem_ops = {
     .read = amdvi_mmio_read,
     .write = amdvi_mmio_write,
@@ -1577,7 +1581,7 @@ static void amdvi_realize(DeviceState *dev, Error **errp)
 
     sysbus_init_mmio(SYS_BUS_DEVICE(s), &s->mmio);
     sysbus_mmio_map(SYS_BUS_DEVICE(s), 0, AMDVI_BASE_ADDR);
-    pci_setup_iommu(bus, amdvi_host_dma_iommu, s);
+    pci_setup_iommu(bus, &amdvi_iommu_ops, s);
     s->devid = object_property_get_int(OBJECT(&s->pci), "addr", errp);
     msi_init(&s->pci.dev, 0, 1, true, false, errp);
     amdvi_init(s);
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index bfe8edb..1a37e97 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -3722,6 +3722,10 @@ static AddressSpace *vtd_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
     return &vtd_as->as;
 }
 
+static PCIIOMMUOps vtd_iommu_ops = {
+    .get_address_space = vtd_host_dma_iommu,
+};
+
 static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
 {
     X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(s);
@@ -3833,7 +3837,7 @@ static void vtd_realize(DeviceState *dev, Error **errp)
                                               g_free, g_free);
     vtd_init(s);
     sysbus_mmio_map(SYS_BUS_DEVICE(s), 0, Q35_HOST_BRIDGE_IOMMU_ADDR);
-    pci_setup_iommu(bus, vtd_host_dma_iommu, dev);
+    pci_setup_iommu(bus, &vtd_iommu_ops, dev);
     /* Pseudo address space under root PCI bus. */
     x86ms->ioapic_as = vtd_host_dma_iommu(bus, s, Q35_PSEUDO_DEVFN_IOAPIC);
     qemu_add_machine_init_done_notifier(&vtd_machine_done_notify);
diff --git a/hw/pci-host/designware.c b/hw/pci-host/designware.c
index 71e9b0d..235d6af 100644
--- a/hw/pci-host/designware.c
+++ b/hw/pci-host/designware.c
@@ -645,6 +645,10 @@ static AddressSpace *designware_pcie_host_set_iommu(PCIBus *bus, void *opaque,
     return &s->pci.address_space;
 }
 
+static const PCIIOMMUOps designware_iommu_ops = {
+    .get_address_space = designware_pcie_host_set_iommu,
+};
+
 static void designware_pcie_host_realize(DeviceState *dev, Error **errp)
 {
     PCIHostState *pci = PCI_HOST_BRIDGE(dev);
@@ -686,7 +690,7 @@ static void designware_pcie_host_realize(DeviceState *dev, Error **errp)
     address_space_init(&s->pci.address_space,
                        &s->pci.address_space_root,
                        "pcie-bus-address-space");
-    pci_setup_iommu(pci->bus, designware_pcie_host_set_iommu, s);
+    pci_setup_iommu(pci->bus, designware_iommu_ops, s);
 
     qdev_set_parent_bus(DEVICE(&s->root), BUS(pci->bus));
     qdev_init_nofail(DEVICE(&s->root));
diff --git a/hw/pci-host/ppce500.c b/hw/pci-host/ppce500.c
index 8bed8e8..0f907b0 100644
--- a/hw/pci-host/ppce500.c
+++ b/hw/pci-host/ppce500.c
@@ -439,6 +439,10 @@ static AddressSpace *e500_pcihost_set_iommu(PCIBus *bus, void *opaque,
     return &s->bm_as;
 }
 
+static const PCIIOMMUOps ppce500_iommu_ops = {
+    .get_address_space = e500_pcihost_set_iommu,
+};
+
 static void e500_pcihost_realize(DeviceState *dev, Error **errp)
 {
     SysBusDevice *sbd = SYS_BUS_DEVICE(dev);
@@ -473,7 +477,7 @@ static void e500_pcihost_realize(DeviceState *dev, Error **errp)
     memory_region_init(&s->bm, OBJECT(s), "bm-e500", UINT64_MAX);
     memory_region_add_subregion(&s->bm, 0x0, &s->busmem);
     address_space_init(&s->bm_as, &s->bm, "pci-bm");
-    pci_setup_iommu(b, e500_pcihost_set_iommu, s);
+    pci_setup_iommu(b, &ppce500_iommu_ops, s);
 
     pci_create_simple(b, 0, "e500-host-bridge");
 
diff --git a/hw/pci-host/prep.c b/hw/pci-host/prep.c
index afa136d..8002fc9 100644
--- a/hw/pci-host/prep.c
+++ b/hw/pci-host/prep.c
@@ -213,6 +213,10 @@ static AddressSpace *raven_pcihost_set_iommu(PCIBus *bus, void *opaque,
     return &s->bm_as;
 }
 
+static const PCIIOMMU raven_iommu_ops = {
+    .get_address_space = raven_pcihost_set_iommu,
+};
+
 static void raven_change_gpio(void *opaque, int n, int level)
 {
     PREPPCIState *s = opaque;
@@ -303,7 +307,7 @@ static void raven_pcihost_initfn(Object *obj)
     memory_region_add_subregion(&s->bm, 0         , &s->bm_pci_memory_alias);
     memory_region_add_subregion(&s->bm, 0x80000000, &s->bm_ram_alias);
     address_space_init(&s->bm_as, &s->bm, "raven-bm");
-    pci_setup_iommu(&s->pci_bus, raven_pcihost_set_iommu, s);
+    pci_setup_iommu(&s->pci_bus, &raven_iommu_ops, s);
 
     h->bus = &s->pci_bus;
 
diff --git a/hw/pci-host/sabre.c b/hw/pci-host/sabre.c
index fae20ee..79b7565 100644
--- a/hw/pci-host/sabre.c
+++ b/hw/pci-host/sabre.c
@@ -112,6 +112,10 @@ static AddressSpace *sabre_pci_dma_iommu(PCIBus *bus, void *opaque, int devfn)
     return &is->iommu_as;
 }
 
+static const PCIIOMMUOps sabre_iommu_ops = {
+    .get_address_space = sabre_pci_dma_iommu,
+};
+
 static void sabre_config_write(void *opaque, hwaddr addr,
                                uint64_t val, unsigned size)
 {
@@ -402,7 +406,7 @@ static void sabre_realize(DeviceState *dev, Error **errp)
     /* IOMMU */
     memory_region_add_subregion_overlap(&s->sabre_config, 0x200,
                     sysbus_mmio_get_region(SYS_BUS_DEVICE(s->iommu), 0), 1);
-    pci_setup_iommu(phb->bus, sabre_pci_dma_iommu, s->iommu);
+    pci_setup_iommu(phb->bus, &sabre_iommu_ops, s->iommu);
 
     /* APB secondary busses */
     pci_dev = pci_create_multifunction(phb->bus, PCI_DEVFN(1, 0), true,
diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index e3d31036..e331a5c 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -2644,7 +2644,7 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
     PCIBus *iommu_bus = bus;
     uint8_t devfn = dev->devfn;
 
-    while (iommu_bus && !iommu_bus->iommu_fn && iommu_bus->parent_dev) {
+    while (iommu_bus && !iommu_bus->iommu_ops && iommu_bus->parent_dev) {
         PCIBus *parent_bus = pci_get_bus(iommu_bus->parent_dev);
 
         /*
@@ -2683,15 +2683,17 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
 
         iommu_bus = parent_bus;
     }
-    if (iommu_bus && iommu_bus->iommu_fn) {
-        return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, devfn);
+    if (iommu_bus && iommu_bus->iommu_ops &&
+                     iommu_bus->iommu_ops->get_address_space) {
+        return iommu_bus->iommu_ops->get_address_space(bus,
+                                 iommu_bus->iommu_opaque, devfn);
     }
     return &address_space_memory;
 }
 
-void pci_setup_iommu(PCIBus *bus, PCIIOMMUFunc fn, void *opaque)
+void pci_setup_iommu(PCIBus *bus, const PCIIOMMUOps *ops, void *opaque)
 {
-    bus->iommu_fn = fn;
+    bus->iommu_ops = ops;
     bus->iommu_opaque = opaque;
 }
 
diff --git a/hw/ppc/ppc440_pcix.c b/hw/ppc/ppc440_pcix.c
index 2ee2d4f..2c8579c 100644
--- a/hw/ppc/ppc440_pcix.c
+++ b/hw/ppc/ppc440_pcix.c
@@ -442,6 +442,10 @@ static AddressSpace *ppc440_pcix_set_iommu(PCIBus *b, void *opaque, int devfn)
     return &s->bm_as;
 }
 
+static const PCIIOMMUOps ppc440_iommu_ops = {
+    .get_adress_space = ppc440_pcix_set_iommu,
+};
+
 /* The default pci_host_data_{read,write} functions in pci/pci_host.c
  * deny access to registers without bit 31 set but our clients want
  * this to work so we have to override these here */
@@ -487,7 +491,7 @@ static void ppc440_pcix_realize(DeviceState *dev, Error **errp)
     memory_region_init(&s->bm, OBJECT(s), "bm-ppc440-pcix", UINT64_MAX);
     memory_region_add_subregion(&s->bm, 0x0, &s->busmem);
     address_space_init(&s->bm_as, &s->bm, "pci-bm");
-    pci_setup_iommu(h->bus, ppc440_pcix_set_iommu, s);
+    pci_setup_iommu(h->bus, &ppc440_iommu_ops, s);
 
     memory_region_init(&s->container, OBJECT(s), "pci-container", PCI_ALL_SIZE);
     memory_region_init_io(&h->conf_mem, OBJECT(s), &pci_host_conf_le_ops,
diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index 723373d..c39e8d4 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -771,6 +771,10 @@ static AddressSpace *spapr_pci_dma_iommu(PCIBus *bus, void *opaque, int devfn)
     return &phb->iommu_as;
 }
 
+static const PCIIOMMUOps spapr_iommu_ops = {
+    .get_address_space = spapr_pci_dma_iommu,
+};
+
 static char *spapr_phb_vfio_get_loc_code(SpaprPhbState *sphb,  PCIDevice *pdev)
 {
     char *path = NULL, *buf = NULL, *host = NULL;
@@ -1950,7 +1954,7 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
     memory_region_add_subregion(&sphb->iommu_root, SPAPR_PCI_MSI_WINDOW,
                                 &sphb->msiwindow);
 
-    pci_setup_iommu(bus, spapr_pci_dma_iommu, sphb);
+    pci_setup_iommu(bus, &spapr_iommu_ops, sphb);
 
     pci_bus_set_route_irq_fn(bus, spapr_route_intx_pin_to_irq);
 
diff --git a/hw/s390x/s390-pci-bus.c b/hw/s390x/s390-pci-bus.c
index 2d2f4a7..14684a0 100644
--- a/hw/s390x/s390-pci-bus.c
+++ b/hw/s390x/s390-pci-bus.c
@@ -635,6 +635,10 @@ static AddressSpace *s390_pci_dma_iommu(PCIBus *bus, void *opaque, int devfn)
     return &iommu->as;
 }
 
+static const PCIIOMMUOps s390_iommu_ops = {
+    .get_address_space = s390_pci_dma_iommu,
+};
+
 static uint8_t set_ind_atomic(uint64_t ind_loc, uint8_t to_be_set)
 {
     uint8_t ind_old, ind_new;
@@ -748,7 +752,7 @@ static void s390_pcihost_realize(DeviceState *dev, Error **errp)
     b = pci_register_root_bus(dev, NULL, s390_pci_set_irq, s390_pci_map_irq,
                               NULL, get_system_memory(), get_system_io(), 0,
                               64, TYPE_PCI_BUS);
-    pci_setup_iommu(b, s390_pci_dma_iommu, s);
+    pci_setup_iommu(b, &s390_iommu_ops, s);
 
     bus = BUS(b);
     qbus_set_hotplug_handler(bus, OBJECT(dev), &local_err);
@@ -919,7 +923,7 @@ static void s390_pcihost_plug(HotplugHandler *hotplug_dev, DeviceState *dev,
 
         pdev = PCI_DEVICE(dev);
         pci_bridge_map_irq(pb, dev->id, s390_pci_map_irq);
-        pci_setup_iommu(&pb->sec_bus, s390_pci_dma_iommu, s);
+        pci_setup_iommu(&pb->sec_bus, &s390_iommu_ops, s);
 
         qbus_set_hotplug_handler(BUS(&pb->sec_bus), OBJECT(s), errp);
 
diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index 2acd832..dc89aa1 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -484,10 +484,14 @@ void pci_bus_get_w64_range(PCIBus *bus, Range *range);
 
 void pci_device_deassert_intx(PCIDevice *dev);
 
-typedef AddressSpace *(*PCIIOMMUFunc)(PCIBus *, void *, int);
+typedef struct PCIIOMMUOps PCIIOMMUOps;
+struct PCIIOMMUOps {
+    AddressSpace * (*get_address_space)(PCIBus *bus,
+                                void *opaque, int32_t devfn);
+};
 
 AddressSpace *pci_device_iommu_address_space(PCIDevice *dev);
-void pci_setup_iommu(PCIBus *bus, PCIIOMMUFunc fn, void *opaque);
+void pci_setup_iommu(PCIBus *bus, const PCIIOMMUOps *iommu_ops, void *opaque);
 
 static inline void
 pci_set_byte(uint8_t *config, uint8_t val)
diff --git a/include/hw/pci/pci_bus.h b/include/hw/pci/pci_bus.h
index 0714f57..c281057 100644
--- a/include/hw/pci/pci_bus.h
+++ b/include/hw/pci/pci_bus.h
@@ -29,7 +29,7 @@ enum PCIBusFlags {
 struct PCIBus {
     BusState qbus;
     enum PCIBusFlags flags;
-    PCIIOMMUFunc iommu_fn;
+    const PCIIOMMUOps *iommu_ops;
     void *iommu_opaque;
     uint8_t devfn_min;
     uint32_t slot_reserved_mask;
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [RFC v3 02/25] hw/iommu: introduce DualStageIOMMUObject
  2020-01-29 12:16 ` Liu, Yi L
@ 2020-01-29 12:16   ` Liu, Yi L
  -1 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-01-29 12:16 UTC (permalink / raw)
  To: qemu-devel, david, pbonzini, alex.williamson, peterx
  Cc: mst, eric.auger, kevin.tian, yi.l.liu, jun.j.tian, yi.y.sun, kvm,
	hao.wu, Jacob Pan, Yi Sun

From: Liu Yi L <yi.l.liu@intel.com>

Currently, many platform vendors provide the capability of dual stage
DMA address translation in hardware. For example, nested translation
on Intel VT-d scalable mode, nested stage translation on ARM SMMUv3,
and etc. In dual stage DMA address translation, there are two stages
address translation, stage-1 (a.k.a first-level) and stage-2 (a.k.a
second-level) translation structures. Stage-1 translation results are
also subjected to stage-2 translation structures. Take vSVA (Virtual
Shared Virtual Addressing) as an example, guest IOMMU driver owns
stage-1 translation structures (covers GVA->GPA translation), and host
IOMMU driver owns stage-2 translation structures (covers GPA->HPA
translation). VMM is responsible to bind stage-1 translation structures
to host, thus hardware could achieve GVA->GPA and then GPA->HPA
translation. For more background on SVA, refer the below links.
 - https://www.youtube.com/watch?v=Kq_nfGK5MwQ
 - https://events19.lfasiallc.com/wp-content/uploads/2017/11/\
Shared-Virtual-Memory-in-KVM_Yi-Liu.pdf

As above, dual stage DMA translation offers two stage address mappings,
which could have better DMA address translation support for passthru
devices. This is also what vIOMMU developers are doing so far. Efforts
includes vSVA enabling from Yi Liu and SMMUv3 Nested Stage Setup from
Eric Auger.
https://www.spinics.net/lists/kvm/msg198556.html
https://lists.gnu.org/archive/html/qemu-devel/2019-07/msg02842.html

Both efforts are aiming to expose a vIOMMU with dual stage hardware
backed. As so, QEMU needs to have an explicit object to stand for
the dual stage capability from hardware. Such object offers abstract
for the dual stage DMA translation related operations, like:

 1) PASID allocation (allow host to intercept in PASID allocation)
 2) bind stage-1 translation structures to host
 3) propagate stage-1 cache invalidation to host
 4) DMA address translation fault (I/O page fault) servicing etc.

This patch introduces DualStageIOMMUObject to stand for the hardware
dual stage DMA translation capability. PASID allocation/free are the
first operation included in it, in future, there will be more operations
like bind_stage1_pgtbl and invalidate_stage1_cache and etc.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Eric Auger <eric.auger@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 hw/Makefile.objs                    |  1 +
 hw/iommu/Makefile.objs              |  1 +
 hw/iommu/dual_stage_iommu.c         | 59 +++++++++++++++++++++++++++++++++++++
 include/hw/iommu/dual_stage_iommu.h | 59 +++++++++++++++++++++++++++++++++++++
 4 files changed, 120 insertions(+)
 create mode 100644 hw/iommu/Makefile.objs
 create mode 100644 hw/iommu/dual_stage_iommu.c
 create mode 100644 include/hw/iommu/dual_stage_iommu.h

diff --git a/hw/Makefile.objs b/hw/Makefile.objs
index 660e2b4..cab83fe 100644
--- a/hw/Makefile.objs
+++ b/hw/Makefile.objs
@@ -40,6 +40,7 @@ devices-dirs-$(CONFIG_MEM_DEVICE) += mem/
 devices-dirs-$(CONFIG_NUBUS) += nubus/
 devices-dirs-y += semihosting/
 devices-dirs-y += smbios/
+devices-dirs-y += iommu/
 endif
 
 common-obj-y += $(devices-dirs-y)
diff --git a/hw/iommu/Makefile.objs b/hw/iommu/Makefile.objs
new file mode 100644
index 0000000..d4f3b39
--- /dev/null
+++ b/hw/iommu/Makefile.objs
@@ -0,0 +1 @@
+obj-y += dual_stage_iommu.o
diff --git a/hw/iommu/dual_stage_iommu.c b/hw/iommu/dual_stage_iommu.c
new file mode 100644
index 0000000..be4179d
--- /dev/null
+++ b/hw/iommu/dual_stage_iommu.c
@@ -0,0 +1,59 @@
+/*
+ * QEMU abstract of Hardware Dual Stage DMA translation capability
+ *
+ * Copyright (C) 2020 Intel Corporation.
+ *
+ * Authors: Liu Yi L <yi.l.liu@intel.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include "qemu/osdep.h"
+#include "hw/iommu/dual_stage_iommu.h"
+
+int ds_iommu_pasid_alloc(DualStageIOMMUObject *dsi_obj, uint32_t min,
+                         uint32_t max, uint32_t *pasid)
+{
+    if (!dsi_obj) {
+        return -ENOENT;
+    }
+
+    if (dsi_obj->ops && dsi_obj->ops->pasid_alloc) {
+        return dsi_obj->ops->pasid_alloc(dsi_obj, min, max, pasid);
+    }
+    return -ENOENT;
+}
+
+int ds_iommu_pasid_free(DualStageIOMMUObject *dsi_obj, uint32_t pasid)
+{
+    if (!dsi_obj) {
+        return -ENOENT;
+    }
+
+    if (dsi_obj->ops && dsi_obj->ops->pasid_free) {
+        return dsi_obj->ops->pasid_free(dsi_obj, pasid);
+    }
+    return -ENOENT;
+}
+
+void ds_iommu_object_init(DualStageIOMMUObject *dsi_obj,
+                          DualStageIOMMUOps *ops)
+{
+    dsi_obj->ops = ops;
+}
+
+void ds_iommu_object_destroy(DualStageIOMMUObject *dsi_obj)
+{
+    dsi_obj->ops = NULL;
+}
diff --git a/include/hw/iommu/dual_stage_iommu.h b/include/hw/iommu/dual_stage_iommu.h
new file mode 100644
index 0000000..e9891e3
--- /dev/null
+++ b/include/hw/iommu/dual_stage_iommu.h
@@ -0,0 +1,59 @@
+/*
+ * QEMU abstraction of IOMMU Context
+ *
+ * Copyright (C) 2020 Red Hat Inc.
+ *
+ * Authors: Liu, Yi L <yi.l.liu@intel.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef HW_DS_IOMMU_H
+#define HW_DS_IOMMU_H
+
+#include "qemu/queue.h"
+#ifndef CONFIG_USER_ONLY
+#include "exec/hwaddr.h"
+#endif
+
+typedef struct DualStageIOMMUObject DualStageIOMMUObject;
+typedef struct DualStageIOMMUOps DualStageIOMMUOps;
+
+struct DualStageIOMMUOps {
+    /* Allocate pasid from DualStageIOMMU (a.k.a. host IOMMU) */
+    int (*pasid_alloc)(DualStageIOMMUObject *dsi_obj,
+                       uint32_t min,
+                       uint32_t max,
+                       uint32_t *pasid);
+    /* Reclaim a pasid from DualStageIOMMU (a.k.a. host IOMMU) */
+    int (*pasid_free)(DualStageIOMMUObject *dsi_obj,
+                      uint32_t pasid);
+};
+
+/*
+ * This is an abstraction of Dual-stage IOMMU.
+ */
+struct DualStageIOMMUObject {
+    DualStageIOMMUOps *ops;
+};
+
+int ds_iommu_pasid_alloc(DualStageIOMMUObject *dsi_obj, uint32_t min,
+                         uint32_t max, uint32_t *pasid);
+int ds_iommu_pasid_free(DualStageIOMMUObject *dsi_obj, uint32_t pasid);
+
+void ds_iommu_object_init(DualStageIOMMUObject *dsi_obj,
+                          DualStageIOMMUOps *ops);
+void ds_iommu_object_destroy(DualStageIOMMUObject *dsi_obj);
+
+#endif
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [RFC v3 02/25] hw/iommu: introduce DualStageIOMMUObject
@ 2020-01-29 12:16   ` Liu, Yi L
  0 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-01-29 12:16 UTC (permalink / raw)
  To: qemu-devel, david, pbonzini, alex.williamson, peterx
  Cc: kevin.tian, yi.l.liu, Yi Sun, kvm, mst, jun.j.tian, eric.auger,
	yi.y.sun, Jacob Pan, hao.wu

From: Liu Yi L <yi.l.liu@intel.com>

Currently, many platform vendors provide the capability of dual stage
DMA address translation in hardware. For example, nested translation
on Intel VT-d scalable mode, nested stage translation on ARM SMMUv3,
and etc. In dual stage DMA address translation, there are two stages
address translation, stage-1 (a.k.a first-level) and stage-2 (a.k.a
second-level) translation structures. Stage-1 translation results are
also subjected to stage-2 translation structures. Take vSVA (Virtual
Shared Virtual Addressing) as an example, guest IOMMU driver owns
stage-1 translation structures (covers GVA->GPA translation), and host
IOMMU driver owns stage-2 translation structures (covers GPA->HPA
translation). VMM is responsible to bind stage-1 translation structures
to host, thus hardware could achieve GVA->GPA and then GPA->HPA
translation. For more background on SVA, refer the below links.
 - https://www.youtube.com/watch?v=Kq_nfGK5MwQ
 - https://events19.lfasiallc.com/wp-content/uploads/2017/11/\
Shared-Virtual-Memory-in-KVM_Yi-Liu.pdf

As above, dual stage DMA translation offers two stage address mappings,
which could have better DMA address translation support for passthru
devices. This is also what vIOMMU developers are doing so far. Efforts
includes vSVA enabling from Yi Liu and SMMUv3 Nested Stage Setup from
Eric Auger.
https://www.spinics.net/lists/kvm/msg198556.html
https://lists.gnu.org/archive/html/qemu-devel/2019-07/msg02842.html

Both efforts are aiming to expose a vIOMMU with dual stage hardware
backed. As so, QEMU needs to have an explicit object to stand for
the dual stage capability from hardware. Such object offers abstract
for the dual stage DMA translation related operations, like:

 1) PASID allocation (allow host to intercept in PASID allocation)
 2) bind stage-1 translation structures to host
 3) propagate stage-1 cache invalidation to host
 4) DMA address translation fault (I/O page fault) servicing etc.

This patch introduces DualStageIOMMUObject to stand for the hardware
dual stage DMA translation capability. PASID allocation/free are the
first operation included in it, in future, there will be more operations
like bind_stage1_pgtbl and invalidate_stage1_cache and etc.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Eric Auger <eric.auger@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 hw/Makefile.objs                    |  1 +
 hw/iommu/Makefile.objs              |  1 +
 hw/iommu/dual_stage_iommu.c         | 59 +++++++++++++++++++++++++++++++++++++
 include/hw/iommu/dual_stage_iommu.h | 59 +++++++++++++++++++++++++++++++++++++
 4 files changed, 120 insertions(+)
 create mode 100644 hw/iommu/Makefile.objs
 create mode 100644 hw/iommu/dual_stage_iommu.c
 create mode 100644 include/hw/iommu/dual_stage_iommu.h

diff --git a/hw/Makefile.objs b/hw/Makefile.objs
index 660e2b4..cab83fe 100644
--- a/hw/Makefile.objs
+++ b/hw/Makefile.objs
@@ -40,6 +40,7 @@ devices-dirs-$(CONFIG_MEM_DEVICE) += mem/
 devices-dirs-$(CONFIG_NUBUS) += nubus/
 devices-dirs-y += semihosting/
 devices-dirs-y += smbios/
+devices-dirs-y += iommu/
 endif
 
 common-obj-y += $(devices-dirs-y)
diff --git a/hw/iommu/Makefile.objs b/hw/iommu/Makefile.objs
new file mode 100644
index 0000000..d4f3b39
--- /dev/null
+++ b/hw/iommu/Makefile.objs
@@ -0,0 +1 @@
+obj-y += dual_stage_iommu.o
diff --git a/hw/iommu/dual_stage_iommu.c b/hw/iommu/dual_stage_iommu.c
new file mode 100644
index 0000000..be4179d
--- /dev/null
+++ b/hw/iommu/dual_stage_iommu.c
@@ -0,0 +1,59 @@
+/*
+ * QEMU abstract of Hardware Dual Stage DMA translation capability
+ *
+ * Copyright (C) 2020 Intel Corporation.
+ *
+ * Authors: Liu Yi L <yi.l.liu@intel.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include "qemu/osdep.h"
+#include "hw/iommu/dual_stage_iommu.h"
+
+int ds_iommu_pasid_alloc(DualStageIOMMUObject *dsi_obj, uint32_t min,
+                         uint32_t max, uint32_t *pasid)
+{
+    if (!dsi_obj) {
+        return -ENOENT;
+    }
+
+    if (dsi_obj->ops && dsi_obj->ops->pasid_alloc) {
+        return dsi_obj->ops->pasid_alloc(dsi_obj, min, max, pasid);
+    }
+    return -ENOENT;
+}
+
+int ds_iommu_pasid_free(DualStageIOMMUObject *dsi_obj, uint32_t pasid)
+{
+    if (!dsi_obj) {
+        return -ENOENT;
+    }
+
+    if (dsi_obj->ops && dsi_obj->ops->pasid_free) {
+        return dsi_obj->ops->pasid_free(dsi_obj, pasid);
+    }
+    return -ENOENT;
+}
+
+void ds_iommu_object_init(DualStageIOMMUObject *dsi_obj,
+                          DualStageIOMMUOps *ops)
+{
+    dsi_obj->ops = ops;
+}
+
+void ds_iommu_object_destroy(DualStageIOMMUObject *dsi_obj)
+{
+    dsi_obj->ops = NULL;
+}
diff --git a/include/hw/iommu/dual_stage_iommu.h b/include/hw/iommu/dual_stage_iommu.h
new file mode 100644
index 0000000..e9891e3
--- /dev/null
+++ b/include/hw/iommu/dual_stage_iommu.h
@@ -0,0 +1,59 @@
+/*
+ * QEMU abstraction of IOMMU Context
+ *
+ * Copyright (C) 2020 Red Hat Inc.
+ *
+ * Authors: Liu, Yi L <yi.l.liu@intel.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef HW_DS_IOMMU_H
+#define HW_DS_IOMMU_H
+
+#include "qemu/queue.h"
+#ifndef CONFIG_USER_ONLY
+#include "exec/hwaddr.h"
+#endif
+
+typedef struct DualStageIOMMUObject DualStageIOMMUObject;
+typedef struct DualStageIOMMUOps DualStageIOMMUOps;
+
+struct DualStageIOMMUOps {
+    /* Allocate pasid from DualStageIOMMU (a.k.a. host IOMMU) */
+    int (*pasid_alloc)(DualStageIOMMUObject *dsi_obj,
+                       uint32_t min,
+                       uint32_t max,
+                       uint32_t *pasid);
+    /* Reclaim a pasid from DualStageIOMMU (a.k.a. host IOMMU) */
+    int (*pasid_free)(DualStageIOMMUObject *dsi_obj,
+                      uint32_t pasid);
+};
+
+/*
+ * This is an abstraction of Dual-stage IOMMU.
+ */
+struct DualStageIOMMUObject {
+    DualStageIOMMUOps *ops;
+};
+
+int ds_iommu_pasid_alloc(DualStageIOMMUObject *dsi_obj, uint32_t min,
+                         uint32_t max, uint32_t *pasid);
+int ds_iommu_pasid_free(DualStageIOMMUObject *dsi_obj, uint32_t pasid);
+
+void ds_iommu_object_init(DualStageIOMMUObject *dsi_obj,
+                          DualStageIOMMUOps *ops);
+void ds_iommu_object_destroy(DualStageIOMMUObject *dsi_obj);
+
+#endif
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [RFC v3 03/25] hw/iommu: introduce IOMMUContext
  2020-01-29 12:16 ` Liu, Yi L
@ 2020-01-29 12:16   ` Liu, Yi L
  -1 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-01-29 12:16 UTC (permalink / raw)
  To: qemu-devel, david, pbonzini, alex.williamson, peterx
  Cc: mst, eric.auger, kevin.tian, yi.l.liu, jun.j.tian, yi.y.sun, kvm,
	hao.wu, Jacob Pan, Yi Sun

From: Peter Xu <peterx@redhat.com>

Currently, many platform vendors provide the capability of dual stage
DMA address translation in hardware. For example, nested translation
on Intel VT-d scalable mode, nested stage translation on ARM SMMUv3,
and etc. Also there are efforts to make QEMU vIOMMU be backed by dual
stage DMA address translation capability provided by hardware to have
better address translation support for passthru devices.

As so, making vIOMMU be backed by dual stage translation capability
requires QEMU vIOMMU to have a way to get aware of such hardware
capability and also require a way to receive DMA address translation
faults (e.g. I/O page request) from host as guest owns stage-1 translation
structures in dual stage DAM address translation.

This patch adds IOMMUContext as an abstract of vIOMMU related operations.
Like provide a way for passthru modules (e.g. VFIO) to register
DualStageIOMMUObject instances. And in future, it is expected to offer
support for receiving host DMA translation faults happened on stage-1
translation.

For more backgrounds, may refer to the discussion below, while there
is also difference between the current implementation and original
proposal. This patch introduces the IOMMUContext as an abstract layer
for passthru module (e.g. VFIO) calls into vIOMMU. The first introduced
interface is to make QEMU vIOMMU be aware of dual stage translation
capability.

https://lists.gnu.org/archive/html/qemu-devel/2019-07/msg05022.html

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Eric Auger <eric.auger@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 hw/iommu/Makefile.objs           |  1 +
 hw/iommu/iommu_context.c         | 54 +++++++++++++++++++++++++++++++++++
 include/hw/iommu/iommu_context.h | 61 ++++++++++++++++++++++++++++++++++++++++
 3 files changed, 116 insertions(+)
 create mode 100644 hw/iommu/iommu_context.c
 create mode 100644 include/hw/iommu/iommu_context.h

diff --git a/hw/iommu/Makefile.objs b/hw/iommu/Makefile.objs
index d4f3b39..1e45072 100644
--- a/hw/iommu/Makefile.objs
+++ b/hw/iommu/Makefile.objs
@@ -1 +1,2 @@
 obj-y += dual_stage_iommu.o
+obj-y += iommu_context.o
diff --git a/hw/iommu/iommu_context.c b/hw/iommu/iommu_context.c
new file mode 100644
index 0000000..6340ca3
--- /dev/null
+++ b/hw/iommu/iommu_context.c
@@ -0,0 +1,54 @@
+/*
+ * QEMU abstract of vIOMMU context
+ *
+ * Copyright (C) 2020 Red Hat Inc.
+ *
+ * Authors: Peter Xu <peterx@redhat.com>,
+ *          Liu Yi L <yi.l.liu@intel.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include "qemu/osdep.h"
+#include "hw/iommu/iommu_context.h"
+
+int iommu_context_register_ds_iommu(IOMMUContext *iommu_ctx,
+                                    DualStageIOMMUObject *dsi_obj)
+{
+    if (!iommu_ctx || !dsi_obj) {
+        return -ENOENT;
+    }
+
+    if (iommu_ctx->ops && iommu_ctx->ops->register_ds_iommu) {
+        return iommu_ctx->ops->register_ds_iommu(iommu_ctx, dsi_obj);
+    }
+    return -ENOENT;
+}
+
+void iommu_context_unregister_ds_iommu(IOMMUContext *iommu_ctx,
+                                      DualStageIOMMUObject *dsi_obj)
+{
+    if (!iommu_ctx || !dsi_obj) {
+        return;
+    }
+
+    if (iommu_ctx->ops && iommu_ctx->ops->unregister_ds_iommu) {
+        iommu_ctx->ops->unregister_ds_iommu(iommu_ctx, dsi_obj);
+    }
+}
+
+void iommu_context_init(IOMMUContext *iommu_ctx, IOMMUContextOps *ops)
+{
+    iommu_ctx->ops = ops;
+}
diff --git a/include/hw/iommu/iommu_context.h b/include/hw/iommu/iommu_context.h
new file mode 100644
index 0000000..6f2ccb5
--- /dev/null
+++ b/include/hw/iommu/iommu_context.h
@@ -0,0 +1,61 @@
+/*
+ * QEMU abstraction of IOMMU Context
+ *
+ * Copyright (C) 2020 Red Hat Inc.
+ *
+ * Authors: Peter Xu <peterx@redhat.com>,
+ *          Liu, Yi L <yi.l.liu@intel.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef HW_IOMMU_CONTEXT_H
+#define HW_IOMMU_CONTEXT_H
+
+#include "qemu/queue.h"
+#ifndef CONFIG_USER_ONLY
+#include "exec/hwaddr.h"
+#endif
+#include "hw/iommu/dual_stage_iommu.h"
+
+typedef struct IOMMUContext IOMMUContext;
+typedef struct IOMMUContextOps IOMMUContextOps;
+
+struct IOMMUContextOps {
+    /*
+     * Register DualStageIOMMUObject to vIOMMU thus vIOMMU
+     * is aware of dual stage translation capability, and
+     * also be able to setup dual stage translation via
+     * interfaces exposed by DualStageIOMMUObject.
+     */
+    int (*register_ds_iommu)(IOMMUContext *iommu_ctx,
+                             DualStageIOMMUObject *dsi_obj);
+    void (*unregister_ds_iommu)(IOMMUContext *iommu_ctx,
+                                DualStageIOMMUObject *dsi_obj);
+};
+
+/*
+ * This is an abstraction of IOMMU context.
+ */
+struct IOMMUContext {
+    IOMMUContextOps *ops;
+};
+
+int iommu_context_register_ds_iommu(IOMMUContext *iommu_ctx,
+                                    DualStageIOMMUObject *dsi_obj);
+void iommu_context_unregister_ds_iommu(IOMMUContext *iommu_ctx,
+                                       DualStageIOMMUObject *dsi_obj);
+void iommu_context_init(IOMMUContext *iommu_ctx, IOMMUContextOps *ops);
+
+#endif
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [RFC v3 03/25] hw/iommu: introduce IOMMUContext
@ 2020-01-29 12:16   ` Liu, Yi L
  0 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-01-29 12:16 UTC (permalink / raw)
  To: qemu-devel, david, pbonzini, alex.williamson, peterx
  Cc: kevin.tian, yi.l.liu, Yi Sun, kvm, mst, jun.j.tian, eric.auger,
	yi.y.sun, Jacob Pan, hao.wu

From: Peter Xu <peterx@redhat.com>

Currently, many platform vendors provide the capability of dual stage
DMA address translation in hardware. For example, nested translation
on Intel VT-d scalable mode, nested stage translation on ARM SMMUv3,
and etc. Also there are efforts to make QEMU vIOMMU be backed by dual
stage DMA address translation capability provided by hardware to have
better address translation support for passthru devices.

As so, making vIOMMU be backed by dual stage translation capability
requires QEMU vIOMMU to have a way to get aware of such hardware
capability and also require a way to receive DMA address translation
faults (e.g. I/O page request) from host as guest owns stage-1 translation
structures in dual stage DAM address translation.

This patch adds IOMMUContext as an abstract of vIOMMU related operations.
Like provide a way for passthru modules (e.g. VFIO) to register
DualStageIOMMUObject instances. And in future, it is expected to offer
support for receiving host DMA translation faults happened on stage-1
translation.

For more backgrounds, may refer to the discussion below, while there
is also difference between the current implementation and original
proposal. This patch introduces the IOMMUContext as an abstract layer
for passthru module (e.g. VFIO) calls into vIOMMU. The first introduced
interface is to make QEMU vIOMMU be aware of dual stage translation
capability.

https://lists.gnu.org/archive/html/qemu-devel/2019-07/msg05022.html

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Eric Auger <eric.auger@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 hw/iommu/Makefile.objs           |  1 +
 hw/iommu/iommu_context.c         | 54 +++++++++++++++++++++++++++++++++++
 include/hw/iommu/iommu_context.h | 61 ++++++++++++++++++++++++++++++++++++++++
 3 files changed, 116 insertions(+)
 create mode 100644 hw/iommu/iommu_context.c
 create mode 100644 include/hw/iommu/iommu_context.h

diff --git a/hw/iommu/Makefile.objs b/hw/iommu/Makefile.objs
index d4f3b39..1e45072 100644
--- a/hw/iommu/Makefile.objs
+++ b/hw/iommu/Makefile.objs
@@ -1 +1,2 @@
 obj-y += dual_stage_iommu.o
+obj-y += iommu_context.o
diff --git a/hw/iommu/iommu_context.c b/hw/iommu/iommu_context.c
new file mode 100644
index 0000000..6340ca3
--- /dev/null
+++ b/hw/iommu/iommu_context.c
@@ -0,0 +1,54 @@
+/*
+ * QEMU abstract of vIOMMU context
+ *
+ * Copyright (C) 2020 Red Hat Inc.
+ *
+ * Authors: Peter Xu <peterx@redhat.com>,
+ *          Liu Yi L <yi.l.liu@intel.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include "qemu/osdep.h"
+#include "hw/iommu/iommu_context.h"
+
+int iommu_context_register_ds_iommu(IOMMUContext *iommu_ctx,
+                                    DualStageIOMMUObject *dsi_obj)
+{
+    if (!iommu_ctx || !dsi_obj) {
+        return -ENOENT;
+    }
+
+    if (iommu_ctx->ops && iommu_ctx->ops->register_ds_iommu) {
+        return iommu_ctx->ops->register_ds_iommu(iommu_ctx, dsi_obj);
+    }
+    return -ENOENT;
+}
+
+void iommu_context_unregister_ds_iommu(IOMMUContext *iommu_ctx,
+                                      DualStageIOMMUObject *dsi_obj)
+{
+    if (!iommu_ctx || !dsi_obj) {
+        return;
+    }
+
+    if (iommu_ctx->ops && iommu_ctx->ops->unregister_ds_iommu) {
+        iommu_ctx->ops->unregister_ds_iommu(iommu_ctx, dsi_obj);
+    }
+}
+
+void iommu_context_init(IOMMUContext *iommu_ctx, IOMMUContextOps *ops)
+{
+    iommu_ctx->ops = ops;
+}
diff --git a/include/hw/iommu/iommu_context.h b/include/hw/iommu/iommu_context.h
new file mode 100644
index 0000000..6f2ccb5
--- /dev/null
+++ b/include/hw/iommu/iommu_context.h
@@ -0,0 +1,61 @@
+/*
+ * QEMU abstraction of IOMMU Context
+ *
+ * Copyright (C) 2020 Red Hat Inc.
+ *
+ * Authors: Peter Xu <peterx@redhat.com>,
+ *          Liu, Yi L <yi.l.liu@intel.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef HW_IOMMU_CONTEXT_H
+#define HW_IOMMU_CONTEXT_H
+
+#include "qemu/queue.h"
+#ifndef CONFIG_USER_ONLY
+#include "exec/hwaddr.h"
+#endif
+#include "hw/iommu/dual_stage_iommu.h"
+
+typedef struct IOMMUContext IOMMUContext;
+typedef struct IOMMUContextOps IOMMUContextOps;
+
+struct IOMMUContextOps {
+    /*
+     * Register DualStageIOMMUObject to vIOMMU thus vIOMMU
+     * is aware of dual stage translation capability, and
+     * also be able to setup dual stage translation via
+     * interfaces exposed by DualStageIOMMUObject.
+     */
+    int (*register_ds_iommu)(IOMMUContext *iommu_ctx,
+                             DualStageIOMMUObject *dsi_obj);
+    void (*unregister_ds_iommu)(IOMMUContext *iommu_ctx,
+                                DualStageIOMMUObject *dsi_obj);
+};
+
+/*
+ * This is an abstraction of IOMMU context.
+ */
+struct IOMMUContext {
+    IOMMUContextOps *ops;
+};
+
+int iommu_context_register_ds_iommu(IOMMUContext *iommu_ctx,
+                                    DualStageIOMMUObject *dsi_obj);
+void iommu_context_unregister_ds_iommu(IOMMUContext *iommu_ctx,
+                                       DualStageIOMMUObject *dsi_obj);
+void iommu_context_init(IOMMUContext *iommu_ctx, IOMMUContextOps *ops);
+
+#endif
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [RFC v3 04/25] hw/pci: introduce pci_device_iommu_context()
  2020-01-29 12:16 ` Liu, Yi L
@ 2020-01-29 12:16   ` Liu, Yi L
  -1 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-01-29 12:16 UTC (permalink / raw)
  To: qemu-devel, david, pbonzini, alex.williamson, peterx
  Cc: mst, eric.auger, kevin.tian, yi.l.liu, jun.j.tian, yi.y.sun, kvm,
	hao.wu, Jacob Pan, Yi Sun

From: Liu Yi L <yi.l.liu@intel.com>

This patch adds pci_device_iommu_context() to get an iommu_context
for a given device. A new callback is added in PCIIOMMUOps. Users
who wants to call into vIOMMU could leaverage it. And further vIOMMU
operation could be done via IOMMUContext.

Note:
Previous version has got Reviewed by from David Gibson and Peter Xu.
But this version has slight changes due to the patch rebase. so I
didn't add reviewed by from you two in case of that you two want to
have one more review. If you two have no objection, I would add your
review by in next version. Thanks.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Eric Auger <eric.auger@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 hw/pci/pci.c         | 35 ++++++++++++++++++++++++++++++-----
 include/hw/pci/pci.h |  5 +++++
 2 files changed, 35 insertions(+), 5 deletions(-)

diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index e331a5c..d2cd810 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -2638,7 +2638,8 @@ static void pci_device_class_base_init(ObjectClass *klass, void *data)
     }
 }
 
-AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
+static void pci_device_get_iommu_bus_devfn(PCIDevice *dev,
+                              PCIBus **pbus, uint8_t *pdevfn)
 {
     PCIBus *bus = pci_get_bus(dev);
     PCIBus *iommu_bus = bus;
@@ -2683,14 +2684,38 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
 
         iommu_bus = parent_bus;
     }
-    if (iommu_bus && iommu_bus->iommu_ops &&
-                     iommu_bus->iommu_ops->get_address_space) {
-        return iommu_bus->iommu_ops->get_address_space(bus,
-                                 iommu_bus->iommu_opaque, devfn);
+    *pbus = iommu_bus;
+    *pdevfn = devfn;
+}
+
+AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
+{
+    PCIBus *bus;
+    uint8_t devfn;
+
+    pci_device_get_iommu_bus_devfn(dev, &bus, &devfn);
+    if (bus && bus->iommu_ops &&
+                     bus->iommu_ops->get_address_space) {
+        return bus->iommu_ops->get_address_space(bus,
+                                bus->iommu_opaque, devfn);
     }
     return &address_space_memory;
 }
 
+IOMMUContext *pci_device_iommu_context(PCIDevice *dev)
+{
+    PCIBus *bus;
+    uint8_t devfn;
+
+    pci_device_get_iommu_bus_devfn(dev, &bus, &devfn);
+    if (bus && bus->iommu_ops &&
+                    bus->iommu_ops->get_iommu_context) {
+        return bus->iommu_ops->get_iommu_context(bus,
+                              bus->iommu_opaque, devfn);
+    }
+    return NULL;
+}
+
 void pci_setup_iommu(PCIBus *bus, const PCIIOMMUOps *ops, void *opaque)
 {
     bus->iommu_ops = ops;
diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index dc89aa1..43e2023 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -9,6 +9,8 @@
 
 #include "hw/pci/pcie.h"
 
+#include "hw/iommu/iommu_context.h"
+
 extern bool pci_available;
 
 /* PCI bus */
@@ -488,9 +490,12 @@ typedef struct PCIIOMMUOps PCIIOMMUOps;
 struct PCIIOMMUOps {
     AddressSpace * (*get_address_space)(PCIBus *bus,
                                 void *opaque, int32_t devfn);
+    IOMMUContext * (*get_iommu_context)(PCIBus *bus,
+                                void *opaque, int32_t devfn);
 };
 
 AddressSpace *pci_device_iommu_address_space(PCIDevice *dev);
+IOMMUContext *pci_device_iommu_context(PCIDevice *dev);
 void pci_setup_iommu(PCIBus *bus, const PCIIOMMUOps *iommu_ops, void *opaque);
 
 static inline void
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [RFC v3 04/25] hw/pci: introduce pci_device_iommu_context()
@ 2020-01-29 12:16   ` Liu, Yi L
  0 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-01-29 12:16 UTC (permalink / raw)
  To: qemu-devel, david, pbonzini, alex.williamson, peterx
  Cc: kevin.tian, yi.l.liu, Yi Sun, kvm, mst, jun.j.tian, eric.auger,
	yi.y.sun, Jacob Pan, hao.wu

From: Liu Yi L <yi.l.liu@intel.com>

This patch adds pci_device_iommu_context() to get an iommu_context
for a given device. A new callback is added in PCIIOMMUOps. Users
who wants to call into vIOMMU could leaverage it. And further vIOMMU
operation could be done via IOMMUContext.

Note:
Previous version has got Reviewed by from David Gibson and Peter Xu.
But this version has slight changes due to the patch rebase. so I
didn't add reviewed by from you two in case of that you two want to
have one more review. If you two have no objection, I would add your
review by in next version. Thanks.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Eric Auger <eric.auger@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 hw/pci/pci.c         | 35 ++++++++++++++++++++++++++++++-----
 include/hw/pci/pci.h |  5 +++++
 2 files changed, 35 insertions(+), 5 deletions(-)

diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index e331a5c..d2cd810 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -2638,7 +2638,8 @@ static void pci_device_class_base_init(ObjectClass *klass, void *data)
     }
 }
 
-AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
+static void pci_device_get_iommu_bus_devfn(PCIDevice *dev,
+                              PCIBus **pbus, uint8_t *pdevfn)
 {
     PCIBus *bus = pci_get_bus(dev);
     PCIBus *iommu_bus = bus;
@@ -2683,14 +2684,38 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
 
         iommu_bus = parent_bus;
     }
-    if (iommu_bus && iommu_bus->iommu_ops &&
-                     iommu_bus->iommu_ops->get_address_space) {
-        return iommu_bus->iommu_ops->get_address_space(bus,
-                                 iommu_bus->iommu_opaque, devfn);
+    *pbus = iommu_bus;
+    *pdevfn = devfn;
+}
+
+AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
+{
+    PCIBus *bus;
+    uint8_t devfn;
+
+    pci_device_get_iommu_bus_devfn(dev, &bus, &devfn);
+    if (bus && bus->iommu_ops &&
+                     bus->iommu_ops->get_address_space) {
+        return bus->iommu_ops->get_address_space(bus,
+                                bus->iommu_opaque, devfn);
     }
     return &address_space_memory;
 }
 
+IOMMUContext *pci_device_iommu_context(PCIDevice *dev)
+{
+    PCIBus *bus;
+    uint8_t devfn;
+
+    pci_device_get_iommu_bus_devfn(dev, &bus, &devfn);
+    if (bus && bus->iommu_ops &&
+                    bus->iommu_ops->get_iommu_context) {
+        return bus->iommu_ops->get_iommu_context(bus,
+                              bus->iommu_opaque, devfn);
+    }
+    return NULL;
+}
+
 void pci_setup_iommu(PCIBus *bus, const PCIIOMMUOps *ops, void *opaque)
 {
     bus->iommu_ops = ops;
diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index dc89aa1..43e2023 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -9,6 +9,8 @@
 
 #include "hw/pci/pcie.h"
 
+#include "hw/iommu/iommu_context.h"
+
 extern bool pci_available;
 
 /* PCI bus */
@@ -488,9 +490,12 @@ typedef struct PCIIOMMUOps PCIIOMMUOps;
 struct PCIIOMMUOps {
     AddressSpace * (*get_address_space)(PCIBus *bus,
                                 void *opaque, int32_t devfn);
+    IOMMUContext * (*get_iommu_context)(PCIBus *bus,
+                                void *opaque, int32_t devfn);
 };
 
 AddressSpace *pci_device_iommu_address_space(PCIDevice *dev);
+IOMMUContext *pci_device_iommu_context(PCIDevice *dev);
 void pci_setup_iommu(PCIBus *bus, const PCIIOMMUOps *iommu_ops, void *opaque);
 
 static inline void
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [RFC v3 05/25] intel_iommu: provide get_iommu_context() callback
  2020-01-29 12:16 ` Liu, Yi L
@ 2020-01-29 12:16   ` Liu, Yi L
  -1 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-01-29 12:16 UTC (permalink / raw)
  To: qemu-devel, david, pbonzini, alex.williamson, peterx
  Cc: mst, eric.auger, kevin.tian, yi.l.liu, jun.j.tian, yi.y.sun, kvm,
	hao.wu, Jacob Pan, Yi Sun, Richard Henderson, Eduardo Habkost

From: Liu Yi L <yi.l.liu@intel.com>

This patch adds get_iommu_context() callback to return an
iommu_context Intel VT-d platform.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 hw/i386/intel_iommu.c         | 92 ++++++++++++++++++++++++++++++++++++++++---
 include/hw/i386/intel_iommu.h | 15 ++++++-
 2 files changed, 100 insertions(+), 7 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 1a37e97..1c1eb7f 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -3347,22 +3347,35 @@ static const MemoryRegionOps vtd_mem_ir_ops = {
     },
 };
 
-VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
+/**
+ * Caller should hold iommu_lock.
+ */
+static VTDBus *vtd_find_add_bus(IntelIOMMUState *s, PCIBus *bus)
 {
     uintptr_t key = (uintptr_t)bus;
-    VTDBus *vtd_bus = g_hash_table_lookup(s->vtd_as_by_busptr, &key);
-    VTDAddressSpace *vtd_dev_as;
-    char name[128];
+    VTDBus *vtd_bus;
 
+    vtd_bus = g_hash_table_lookup(s->vtd_as_by_busptr, &key);
     if (!vtd_bus) {
         uintptr_t *new_key = g_malloc(sizeof(*new_key));
         *new_key = (uintptr_t)bus;
         /* No corresponding free() */
-        vtd_bus = g_malloc0(sizeof(VTDBus) + sizeof(VTDAddressSpace *) * \
-                            PCI_DEVFN_MAX);
+        vtd_bus = g_malloc0(sizeof(VTDBus) + PCI_DEVFN_MAX * \
+                    (sizeof(VTDAddressSpace *) + sizeof(VTDIOMMUContext *)));
         vtd_bus->bus = bus;
         g_hash_table_insert(s->vtd_as_by_busptr, new_key, vtd_bus);
     }
+    return vtd_bus;
+}
+
+VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
+{
+    VTDBus *vtd_bus;
+    VTDAddressSpace *vtd_dev_as;
+    char name[128];
+
+    vtd_iommu_lock(s);
+    vtd_bus = vtd_find_add_bus(s, bus);
 
     vtd_dev_as = vtd_bus->dev_as[devfn];
 
@@ -3426,9 +3439,63 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
 
         vtd_switch_address_space(vtd_dev_as);
     }
+    vtd_iommu_unlock(s);
+
     return vtd_dev_as;
 }
 
+static int vtd_icx_register_ds_iommu(IOMMUContext *iommu_ctx,
+                                     DualStageIOMMUObject *dsi_obj)
+{
+    VTDIOMMUContext *vtd_dev_icx = container_of(iommu_ctx,
+                                               VTDIOMMUContext,
+                                               iommu_context);
+
+    vtd_dev_icx->dsi_obj = dsi_obj;
+    return 0;
+}
+
+static void vtd_icx_unregister_ds_iommu(IOMMUContext *iommu_ctx,
+                                        DualStageIOMMUObject *dsi_obj)
+{
+    VTDIOMMUContext *vtd_dev_icx = container_of(iommu_ctx,
+                                               VTDIOMMUContext,
+                                               iommu_context);
+
+    vtd_dev_icx->dsi_obj = NULL;
+}
+
+IOMMUContextOps vtd_iommu_context_ops = {
+    .register_ds_iommu = vtd_icx_register_ds_iommu,
+    .unregister_ds_iommu = vtd_icx_unregister_ds_iommu,
+};
+
+VTDIOMMUContext *vtd_find_add_icx(IntelIOMMUState *s,
+                                  PCIBus *bus, int devfn)
+{
+    VTDBus *vtd_bus;
+    VTDIOMMUContext *vtd_dev_icx;
+
+    vtd_iommu_lock(s);
+    vtd_bus = vtd_find_add_bus(s, bus);
+
+    vtd_dev_icx = vtd_bus->dev_icx[devfn];
+
+    if (!vtd_dev_icx) {
+        vtd_bus->dev_icx[devfn] = vtd_dev_icx =
+                    g_malloc0(sizeof(VTDIOMMUContext));
+        vtd_dev_icx->vtd_bus = vtd_bus;
+        vtd_dev_icx->devfn = (uint8_t)devfn;
+        vtd_dev_icx->iommu_state = s;
+        vtd_dev_icx->dsi_obj = NULL;
+        iommu_context_init(&vtd_dev_icx->iommu_context,
+                           &vtd_iommu_context_ops);
+    }
+    vtd_iommu_unlock(s);
+
+    return vtd_dev_icx;
+}
+
 static uint64_t get_naturally_aligned_size(uint64_t start,
                                            uint64_t size, int gaw)
 {
@@ -3722,8 +3789,21 @@ static AddressSpace *vtd_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
     return &vtd_as->as;
 }
 
+static IOMMUContext *vtd_dev_iommu_context(PCIBus *bus,
+                                           void *opaque, int devfn)
+{
+    IntelIOMMUState *s = opaque;
+    VTDIOMMUContext *vtd_icx;
+
+    assert(0 <= devfn && devfn < PCI_DEVFN_MAX);
+
+    vtd_icx = vtd_find_add_icx(s, bus, devfn);
+    return &vtd_icx->iommu_context;
+}
+
 static PCIIOMMUOps vtd_iommu_ops = {
     .get_address_space = vtd_host_dma_iommu,
+    .get_iommu_context = vtd_dev_iommu_context,
 };
 
 static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index 66b931e..8571a85 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -68,6 +68,7 @@ typedef union VTD_IR_TableEntry VTD_IR_TableEntry;
 typedef union VTD_IR_MSIAddress VTD_IR_MSIAddress;
 typedef struct VTDPASIDDirEntry VTDPASIDDirEntry;
 typedef struct VTDPASIDEntry VTDPASIDEntry;
+typedef struct VTDIOMMUContext VTDIOMMUContext;
 
 /* Context-Entry */
 struct VTDContextEntry {
@@ -116,9 +117,20 @@ struct VTDAddressSpace {
     IOVATree *iova_tree;          /* Traces mapped IOVA ranges */
 };
 
+struct VTDIOMMUContext {
+    VTDBus *vtd_bus;
+    uint8_t devfn;
+    IOMMUContext iommu_context;
+    DualStageIOMMUObject *dsi_obj;
+    IntelIOMMUState *iommu_state;
+};
+
 struct VTDBus {
     PCIBus* bus;		/* A reference to the bus to provide translation for */
-    VTDAddressSpace *dev_as[0];	/* A table of VTDAddressSpace objects indexed by devfn */
+    /* A table of VTDAddressSpace objects indexed by devfn */
+    VTDAddressSpace *dev_as[PCI_DEVFN_MAX];
+    /* A table of VTDIOMMUContext objects indexed by devfn */
+    VTDIOMMUContext *dev_icx[PCI_DEVFN_MAX];
 };
 
 struct VTDIOTLBEntry {
@@ -282,5 +294,6 @@ struct IntelIOMMUState {
  * create a new one if none exists
  */
 VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn);
+VTDIOMMUContext *vtd_find_add_icx(IntelIOMMUState *s, PCIBus *bus, int devfn);
 
 #endif
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [RFC v3 05/25] intel_iommu: provide get_iommu_context() callback
@ 2020-01-29 12:16   ` Liu, Yi L
  0 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-01-29 12:16 UTC (permalink / raw)
  To: qemu-devel, david, pbonzini, alex.williamson, peterx
  Cc: kevin.tian, yi.l.liu, Yi Sun, Eduardo Habkost, kvm, mst,
	jun.j.tian, eric.auger, yi.y.sun, Jacob Pan, Richard Henderson,
	hao.wu

From: Liu Yi L <yi.l.liu@intel.com>

This patch adds get_iommu_context() callback to return an
iommu_context Intel VT-d platform.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 hw/i386/intel_iommu.c         | 92 ++++++++++++++++++++++++++++++++++++++++---
 include/hw/i386/intel_iommu.h | 15 ++++++-
 2 files changed, 100 insertions(+), 7 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 1a37e97..1c1eb7f 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -3347,22 +3347,35 @@ static const MemoryRegionOps vtd_mem_ir_ops = {
     },
 };
 
-VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
+/**
+ * Caller should hold iommu_lock.
+ */
+static VTDBus *vtd_find_add_bus(IntelIOMMUState *s, PCIBus *bus)
 {
     uintptr_t key = (uintptr_t)bus;
-    VTDBus *vtd_bus = g_hash_table_lookup(s->vtd_as_by_busptr, &key);
-    VTDAddressSpace *vtd_dev_as;
-    char name[128];
+    VTDBus *vtd_bus;
 
+    vtd_bus = g_hash_table_lookup(s->vtd_as_by_busptr, &key);
     if (!vtd_bus) {
         uintptr_t *new_key = g_malloc(sizeof(*new_key));
         *new_key = (uintptr_t)bus;
         /* No corresponding free() */
-        vtd_bus = g_malloc0(sizeof(VTDBus) + sizeof(VTDAddressSpace *) * \
-                            PCI_DEVFN_MAX);
+        vtd_bus = g_malloc0(sizeof(VTDBus) + PCI_DEVFN_MAX * \
+                    (sizeof(VTDAddressSpace *) + sizeof(VTDIOMMUContext *)));
         vtd_bus->bus = bus;
         g_hash_table_insert(s->vtd_as_by_busptr, new_key, vtd_bus);
     }
+    return vtd_bus;
+}
+
+VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
+{
+    VTDBus *vtd_bus;
+    VTDAddressSpace *vtd_dev_as;
+    char name[128];
+
+    vtd_iommu_lock(s);
+    vtd_bus = vtd_find_add_bus(s, bus);
 
     vtd_dev_as = vtd_bus->dev_as[devfn];
 
@@ -3426,9 +3439,63 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
 
         vtd_switch_address_space(vtd_dev_as);
     }
+    vtd_iommu_unlock(s);
+
     return vtd_dev_as;
 }
 
+static int vtd_icx_register_ds_iommu(IOMMUContext *iommu_ctx,
+                                     DualStageIOMMUObject *dsi_obj)
+{
+    VTDIOMMUContext *vtd_dev_icx = container_of(iommu_ctx,
+                                               VTDIOMMUContext,
+                                               iommu_context);
+
+    vtd_dev_icx->dsi_obj = dsi_obj;
+    return 0;
+}
+
+static void vtd_icx_unregister_ds_iommu(IOMMUContext *iommu_ctx,
+                                        DualStageIOMMUObject *dsi_obj)
+{
+    VTDIOMMUContext *vtd_dev_icx = container_of(iommu_ctx,
+                                               VTDIOMMUContext,
+                                               iommu_context);
+
+    vtd_dev_icx->dsi_obj = NULL;
+}
+
+IOMMUContextOps vtd_iommu_context_ops = {
+    .register_ds_iommu = vtd_icx_register_ds_iommu,
+    .unregister_ds_iommu = vtd_icx_unregister_ds_iommu,
+};
+
+VTDIOMMUContext *vtd_find_add_icx(IntelIOMMUState *s,
+                                  PCIBus *bus, int devfn)
+{
+    VTDBus *vtd_bus;
+    VTDIOMMUContext *vtd_dev_icx;
+
+    vtd_iommu_lock(s);
+    vtd_bus = vtd_find_add_bus(s, bus);
+
+    vtd_dev_icx = vtd_bus->dev_icx[devfn];
+
+    if (!vtd_dev_icx) {
+        vtd_bus->dev_icx[devfn] = vtd_dev_icx =
+                    g_malloc0(sizeof(VTDIOMMUContext));
+        vtd_dev_icx->vtd_bus = vtd_bus;
+        vtd_dev_icx->devfn = (uint8_t)devfn;
+        vtd_dev_icx->iommu_state = s;
+        vtd_dev_icx->dsi_obj = NULL;
+        iommu_context_init(&vtd_dev_icx->iommu_context,
+                           &vtd_iommu_context_ops);
+    }
+    vtd_iommu_unlock(s);
+
+    return vtd_dev_icx;
+}
+
 static uint64_t get_naturally_aligned_size(uint64_t start,
                                            uint64_t size, int gaw)
 {
@@ -3722,8 +3789,21 @@ static AddressSpace *vtd_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
     return &vtd_as->as;
 }
 
+static IOMMUContext *vtd_dev_iommu_context(PCIBus *bus,
+                                           void *opaque, int devfn)
+{
+    IntelIOMMUState *s = opaque;
+    VTDIOMMUContext *vtd_icx;
+
+    assert(0 <= devfn && devfn < PCI_DEVFN_MAX);
+
+    vtd_icx = vtd_find_add_icx(s, bus, devfn);
+    return &vtd_icx->iommu_context;
+}
+
 static PCIIOMMUOps vtd_iommu_ops = {
     .get_address_space = vtd_host_dma_iommu,
+    .get_iommu_context = vtd_dev_iommu_context,
 };
 
 static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index 66b931e..8571a85 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -68,6 +68,7 @@ typedef union VTD_IR_TableEntry VTD_IR_TableEntry;
 typedef union VTD_IR_MSIAddress VTD_IR_MSIAddress;
 typedef struct VTDPASIDDirEntry VTDPASIDDirEntry;
 typedef struct VTDPASIDEntry VTDPASIDEntry;
+typedef struct VTDIOMMUContext VTDIOMMUContext;
 
 /* Context-Entry */
 struct VTDContextEntry {
@@ -116,9 +117,20 @@ struct VTDAddressSpace {
     IOVATree *iova_tree;          /* Traces mapped IOVA ranges */
 };
 
+struct VTDIOMMUContext {
+    VTDBus *vtd_bus;
+    uint8_t devfn;
+    IOMMUContext iommu_context;
+    DualStageIOMMUObject *dsi_obj;
+    IntelIOMMUState *iommu_state;
+};
+
 struct VTDBus {
     PCIBus* bus;		/* A reference to the bus to provide translation for */
-    VTDAddressSpace *dev_as[0];	/* A table of VTDAddressSpace objects indexed by devfn */
+    /* A table of VTDAddressSpace objects indexed by devfn */
+    VTDAddressSpace *dev_as[PCI_DEVFN_MAX];
+    /* A table of VTDIOMMUContext objects indexed by devfn */
+    VTDIOMMUContext *dev_icx[PCI_DEVFN_MAX];
 };
 
 struct VTDIOTLBEntry {
@@ -282,5 +294,6 @@ struct IntelIOMMUState {
  * create a new one if none exists
  */
 VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn);
+VTDIOMMUContext *vtd_find_add_icx(IntelIOMMUState *s, PCIBus *bus, int devfn);
 
 #endif
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [RFC v3 06/25] scripts/update-linux-headers: Import iommu.h
  2020-01-29 12:16 ` Liu, Yi L
@ 2020-01-29 12:16   ` Liu, Yi L
  -1 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-01-29 12:16 UTC (permalink / raw)
  To: qemu-devel, david, pbonzini, alex.williamson, peterx
  Cc: mst, eric.auger, kevin.tian, yi.l.liu, jun.j.tian, yi.y.sun, kvm,
	hao.wu, Jacob Pan, Yi Sun, Cornelia Huck

From: Eric Auger <eric.auger@redhat.com>

Update the script to import the new iommu.h uapi header.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Cornelia Huck <cohuck@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Eric Auger <eric.auger@redhat.com>
---
 scripts/update-linux-headers.sh | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/scripts/update-linux-headers.sh b/scripts/update-linux-headers.sh
index f76d773..dfdfdfd 100755
--- a/scripts/update-linux-headers.sh
+++ b/scripts/update-linux-headers.sh
@@ -141,7 +141,7 @@ done
 
 rm -rf "$output/linux-headers/linux"
 mkdir -p "$output/linux-headers/linux"
-for header in kvm.h vfio.h vfio_ccw.h vhost.h \
+for header in kvm.h vfio.h vfio_ccw.h vhost.h iommu.h \
               psci.h psp-sev.h userfaultfd.h mman.h; do
     cp "$tmpdir/include/linux/$header" "$output/linux-headers/linux"
 done
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [RFC v3 06/25] scripts/update-linux-headers: Import iommu.h
@ 2020-01-29 12:16   ` Liu, Yi L
  0 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-01-29 12:16 UTC (permalink / raw)
  To: qemu-devel, david, pbonzini, alex.williamson, peterx
  Cc: kevin.tian, yi.l.liu, Yi Sun, kvm, mst, jun.j.tian,
	Cornelia Huck, eric.auger, yi.y.sun, Jacob Pan, hao.wu

From: Eric Auger <eric.auger@redhat.com>

Update the script to import the new iommu.h uapi header.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Cornelia Huck <cohuck@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Eric Auger <eric.auger@redhat.com>
---
 scripts/update-linux-headers.sh | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/scripts/update-linux-headers.sh b/scripts/update-linux-headers.sh
index f76d773..dfdfdfd 100755
--- a/scripts/update-linux-headers.sh
+++ b/scripts/update-linux-headers.sh
@@ -141,7 +141,7 @@ done
 
 rm -rf "$output/linux-headers/linux"
 mkdir -p "$output/linux-headers/linux"
-for header in kvm.h vfio.h vfio_ccw.h vhost.h \
+for header in kvm.h vfio.h vfio_ccw.h vhost.h iommu.h \
               psci.h psp-sev.h userfaultfd.h mman.h; do
     cp "$tmpdir/include/linux/$header" "$output/linux-headers/linux"
 done
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [RFC v3 07/25] header file update VFIO/IOMMU vSVA APIs
  2020-01-29 12:16 ` Liu, Yi L
@ 2020-01-29 12:16   ` Liu, Yi L
  -1 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-01-29 12:16 UTC (permalink / raw)
  To: qemu-devel, david, pbonzini, alex.williamson, peterx
  Cc: mst, eric.auger, kevin.tian, yi.l.liu, jun.j.tian, yi.y.sun, kvm,
	hao.wu, Jacob Pan, Yi Sun, Cornelia Huck

From: Liu Yi L <yi.l.liu@intel.com>

The kernel uapi/linux/iommu.h header file includes the
extensions for vSVA support. e.g. bind gpasid, iommu
fault report related user structures and etc.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Cornelia Huck <cohuck@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 linux-headers/linux/iommu.h | 372 ++++++++++++++++++++++++++++++++++++++++++++
 linux-headers/linux/vfio.h  | 148 ++++++++++++++++++
 2 files changed, 520 insertions(+)
 create mode 100644 linux-headers/linux/iommu.h

diff --git a/linux-headers/linux/iommu.h b/linux-headers/linux/iommu.h
new file mode 100644
index 0000000..04cc4b0
--- /dev/null
+++ b/linux-headers/linux/iommu.h
@@ -0,0 +1,372 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/*
+ * IOMMU user API definitions
+ */
+
+#ifndef _IOMMU_H
+#define _IOMMU_H
+
+#include <linux/types.h>
+
+/**
+ * Current version of the IOMMU user API. This is intended for query
+ * between user and kernel to determine compatible data structures.
+ *
+ * Having a single UAPI version to govern the user-kernel data structures
+ * makes compatibility check straightforward. On the contrary, supporting
+ * combinations of multiple versions of the data can be a nightmare.
+ *
+ * UAPI version can be bumped up with the following rules:
+ * 1. All data structures passed between user and kernel space share
+ *    the same version number. i.e. any extension to to any structure
+ *    results in version bump up.
+ *
+ * 2. Data structures are open to extension but closed to modification.
+ *    New fields must be added at the end of each data structure with
+ *    64bit alignment. Flag bits can be added without size change but
+ *    existing ones cannot be altered.
+ *
+ * 3. Versions are backward compatible.
+ *
+ * 4. Version to size lookup is supported by kernel internal API for each
+ *    API function type. @version is mandatory for new data structures
+ *    and must be at the beginning with type of __u32.
+ */
+#define IOMMU_UAPI_VERSION	1
+static __inline__ int iommu_get_uapi_version(void)
+{
+	return IOMMU_UAPI_VERSION;
+}
+
+/*
+ * Supported UAPI features that can be reported to user space.
+ * These types represent the capability available in the kernel.
+ *
+ * REVISIT: UAPI version also implies the capabilities. Should we
+ * report them explicitly?
+ */
+enum IOMMU_UAPI_DATA_TYPES {
+	IOMMU_UAPI_BIND_GPASID,
+	IOMMU_UAPI_CACHE_INVAL,
+	IOMMU_UAPI_PAGE_RESP,
+	NR_IOMMU_UAPI_TYPE,
+};
+
+#define IOMMU_UAPI_CAP_MASK ((1 << IOMMU_UAPI_BIND_GPASID) |	\
+				(1 << IOMMU_UAPI_CACHE_INVAL) |	\
+				(1 << IOMMU_UAPI_PAGE_RESP))
+
+#define IOMMU_FAULT_PERM_READ	(1 << 0) /* read */
+#define IOMMU_FAULT_PERM_WRITE	(1 << 1) /* write */
+#define IOMMU_FAULT_PERM_EXEC	(1 << 2) /* exec */
+#define IOMMU_FAULT_PERM_PRIV	(1 << 3) /* privileged */
+
+/* Generic fault types, can be expanded IRQ remapping fault */
+enum iommu_fault_type {
+	IOMMU_FAULT_DMA_UNRECOV = 1,	/* unrecoverable fault */
+	IOMMU_FAULT_PAGE_REQ,		/* page request fault */
+};
+
+enum iommu_fault_reason {
+	IOMMU_FAULT_REASON_UNKNOWN = 0,
+
+	/* Could not access the PASID table (fetch caused external abort) */
+	IOMMU_FAULT_REASON_PASID_FETCH,
+
+	/* PASID entry is invalid or has configuration errors */
+	IOMMU_FAULT_REASON_BAD_PASID_ENTRY,
+
+	/*
+	 * PASID is out of range (e.g. exceeds the maximum PASID
+	 * supported by the IOMMU) or disabled.
+	 */
+	IOMMU_FAULT_REASON_PASID_INVALID,
+
+	/*
+	 * An external abort occurred fetching (or updating) a translation
+	 * table descriptor
+	 */
+	IOMMU_FAULT_REASON_WALK_EABT,
+
+	/*
+	 * Could not access the page table entry (Bad address),
+	 * actual translation fault
+	 */
+	IOMMU_FAULT_REASON_PTE_FETCH,
+
+	/* Protection flag check failed */
+	IOMMU_FAULT_REASON_PERMISSION,
+
+	/* access flag check failed */
+	IOMMU_FAULT_REASON_ACCESS,
+
+	/* Output address of a translation stage caused Address Size fault */
+	IOMMU_FAULT_REASON_OOR_ADDRESS,
+};
+
+/**
+ * struct iommu_fault_unrecoverable - Unrecoverable fault data
+ * @reason: reason of the fault, from &enum iommu_fault_reason
+ * @flags: parameters of this fault (IOMMU_FAULT_UNRECOV_* values)
+ * @pasid: Process Address Space ID
+ * @perm: requested permission access using by the incoming transaction
+ *        (IOMMU_FAULT_PERM_* values)
+ * @addr: offending page address
+ * @fetch_addr: address that caused a fetch abort, if any
+ */
+struct iommu_fault_unrecoverable {
+	__u32	reason;
+#define IOMMU_FAULT_UNRECOV_PASID_VALID		(1 << 0)
+#define IOMMU_FAULT_UNRECOV_ADDR_VALID		(1 << 1)
+#define IOMMU_FAULT_UNRECOV_FETCH_ADDR_VALID	(1 << 2)
+	__u32	flags;
+	__u32	pasid;
+	__u32	perm;
+	__u64	addr;
+	__u64	fetch_addr;
+};
+
+/**
+ * struct iommu_fault_page_request - Page Request data
+ * @flags: encodes whether the corresponding fields are valid and whether this
+ *         is the last page in group (IOMMU_FAULT_PAGE_REQUEST_* values)
+ * @pasid: Process Address Space ID
+ * @grpid: Page Request Group Index
+ * @perm: requested page permissions (IOMMU_FAULT_PERM_* values)
+ * @addr: page address
+ * @private_data: device-specific private information
+ */
+struct iommu_fault_page_request {
+#define IOMMU_FAULT_PAGE_REQUEST_PASID_VALID	(1 << 0)
+#define IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE	(1 << 1)
+#define IOMMU_FAULT_PAGE_REQUEST_PRIV_DATA	(1 << 2)
+	__u32	flags;
+	__u32	pasid;
+	__u32	grpid;
+	__u32	perm;
+	__u64	addr;
+	__u64	private_data[2];
+};
+
+/**
+ * struct iommu_fault - Generic fault data
+ * @type: fault type from &enum iommu_fault_type
+ * @padding: reserved for future use (should be zero)
+ * @event: fault event, when @type is %IOMMU_FAULT_DMA_UNRECOV
+ * @prm: Page Request message, when @type is %IOMMU_FAULT_PAGE_REQ
+ * @padding2: sets the fault size to allow for future extensions
+ */
+struct iommu_fault {
+	__u32	type;
+	__u32	padding;
+	union {
+		struct iommu_fault_unrecoverable event;
+		struct iommu_fault_page_request prm;
+		__u8 padding2[56];
+	};
+};
+
+/**
+ * enum iommu_page_response_code - Return status of fault handlers
+ * @IOMMU_PAGE_RESP_SUCCESS: Fault has been handled and the page tables
+ *	populated, retry the access. This is "Success" in PCI PRI.
+ * @IOMMU_PAGE_RESP_FAILURE: General error. Drop all subsequent faults from
+ *	this device if possible. This is "Response Failure" in PCI PRI.
+ * @IOMMU_PAGE_RESP_INVALID: Could not handle this fault, don't retry the
+ *	access. This is "Invalid Request" in PCI PRI.
+ */
+enum iommu_page_response_code {
+	IOMMU_PAGE_RESP_SUCCESS = 0,
+	IOMMU_PAGE_RESP_INVALID,
+	IOMMU_PAGE_RESP_FAILURE,
+};
+
+/**
+ * struct iommu_page_response - Generic page response information
+ * @version: IOMMU_UAPI_VERSION
+ * @flags: encodes whether the corresponding fields are valid
+ *         (IOMMU_FAULT_PAGE_RESPONSE_* values)
+ * @pasid: Process Address Space ID
+ * @grpid: Page Request Group Index
+ * @code: response code from &enum iommu_page_response_code
+ */
+struct iommu_page_response {
+	__u32	version;
+#define IOMMU_PAGE_RESP_PASID_VALID	(1 << 0)
+	__u32	flags;
+	__u32	pasid;
+	__u32	grpid;
+	__u32	code;
+};
+
+/* defines the granularity of the invalidation */
+enum iommu_inv_granularity {
+	IOMMU_INV_GRANU_DOMAIN,	/* domain-selective invalidation */
+	IOMMU_INV_GRANU_PASID,	/* PASID-selective invalidation */
+	IOMMU_INV_GRANU_ADDR,	/* page-selective invalidation */
+	IOMMU_INV_GRANU_NR,	/* number of invalidation granularities */
+};
+
+/**
+ * struct iommu_inv_addr_info - Address Selective Invalidation Structure
+ *
+ * @flags: indicates the granularity of the address-selective invalidation
+ * - If the PASID bit is set, the @pasid field is populated and the invalidation
+ *   relates to cache entries tagged with this PASID and matching the address
+ *   range.
+ * - If ARCHID bit is set, @archid is populated and the invalidation relates
+ *   to cache entries tagged with this architecture specific ID and matching
+ *   the address range.
+ * - Both PASID and ARCHID can be set as they may tag different caches.
+ * - If neither PASID or ARCHID is set, global addr invalidation applies.
+ * - The LEAF flag indicates whether only the leaf PTE caching needs to be
+ *   invalidated and other paging structure caches can be preserved.
+ * @pasid: process address space ID
+ * @archid: architecture-specific ID
+ * @addr: first stage/level input address
+ * @granule_size: page/block size of the mapping in bytes
+ * @nb_granules: number of contiguous granules to be invalidated
+ */
+struct iommu_inv_addr_info {
+#define IOMMU_INV_ADDR_FLAGS_PASID	(1 << 0)
+#define IOMMU_INV_ADDR_FLAGS_ARCHID	(1 << 1)
+#define IOMMU_INV_ADDR_FLAGS_LEAF	(1 << 2)
+	__u32	flags;
+	__u32	archid;
+	__u64	pasid;
+	__u64	addr;
+	__u64	granule_size;
+	__u64	nb_granules;
+};
+
+/**
+ * struct iommu_inv_pasid_info - PASID Selective Invalidation Structure
+ *
+ * @flags: indicates the granularity of the PASID-selective invalidation
+ * - If the PASID bit is set, the @pasid field is populated and the invalidation
+ *   relates to cache entries tagged with this PASID and matching the address
+ *   range.
+ * - If the ARCHID bit is set, the @archid is populated and the invalidation
+ *   relates to cache entries tagged with this architecture specific ID and
+ *   matching the address range.
+ * - Both PASID and ARCHID can be set as they may tag different caches.
+ * - At least one of PASID or ARCHID must be set.
+ * @pasid: process address space ID
+ * @archid: architecture-specific ID
+ */
+struct iommu_inv_pasid_info {
+#define IOMMU_INV_PASID_FLAGS_PASID	(1 << 0)
+#define IOMMU_INV_PASID_FLAGS_ARCHID	(1 << 1)
+	__u32	flags;
+	__u32	archid;
+	__u64	pasid;
+};
+
+/**
+ * struct iommu_cache_invalidate_info - First level/stage invalidation
+ *     information
+ * @version: IOMMU_UAPI_VERSION
+ * @cache: bitfield that allows to select which caches to invalidate
+ * @granularity: defines the lowest granularity used for the invalidation:
+ *     domain > PASID > addr
+ * @padding: reserved for future use (should be zero)
+ * @pasid_info: invalidation data when @granularity is %IOMMU_INV_GRANU_PASID
+ * @addr_info: invalidation data when @granularity is %IOMMU_INV_GRANU_ADDR
+ *
+ * Not all the combinations of cache/granularity are valid:
+ *
+ * +--------------+---------------+---------------+---------------+
+ * | type /       |   DEV_IOTLB   |     IOTLB     |      PASID    |
+ * | granularity  |               |               |      cache    |
+ * +==============+===============+===============+===============+
+ * | DOMAIN       |       N/A     |       Y       |       Y       |
+ * +--------------+---------------+---------------+---------------+
+ * | PASID        |       Y       |       Y       |       Y       |
+ * +--------------+---------------+---------------+---------------+
+ * | ADDR         |       Y       |       Y       |       N/A     |
+ * +--------------+---------------+---------------+---------------+
+ *
+ * Invalidations by %IOMMU_INV_GRANU_DOMAIN don't take any argument other than
+ * @version and @cache.
+ *
+ * If multiple cache types are invalidated simultaneously, they all
+ * must support the used granularity.
+ */
+struct iommu_cache_invalidate_info {
+	__u32	version;
+/* IOMMU paging structure cache */
+#define IOMMU_CACHE_INV_TYPE_IOTLB	(1 << 0) /* IOMMU IOTLB */
+#define IOMMU_CACHE_INV_TYPE_DEV_IOTLB	(1 << 1) /* Device IOTLB */
+#define IOMMU_CACHE_INV_TYPE_PASID	(1 << 2) /* PASID cache */
+#define IOMMU_CACHE_INV_TYPE_NR		(3)
+	__u8	cache;
+	__u8	granularity;
+	__u8	padding[2];
+	union {
+		struct iommu_inv_pasid_info pasid_info;
+		struct iommu_inv_addr_info addr_info;
+	};
+};
+
+/**
+ * struct iommu_gpasid_bind_data_vtd - Intel VT-d specific data on device and guest
+ * SVA binding.
+ *
+ * @flags:	VT-d PASID table entry attributes
+ * @pat:	Page attribute table data to compute effective memory type
+ * @emt:	Extended memory type
+ *
+ * Only guest vIOMMU selectable and effective options are passed down to
+ * the host IOMMU.
+ */
+struct iommu_gpasid_bind_data_vtd {
+#define IOMMU_SVA_VTD_GPASID_SRE	(1 << 0) /* supervisor request */
+#define IOMMU_SVA_VTD_GPASID_EAFE	(1 << 1) /* extended access enable */
+#define IOMMU_SVA_VTD_GPASID_PCD	(1 << 2) /* page-level cache disable */
+#define IOMMU_SVA_VTD_GPASID_PWT	(1 << 3) /* page-level write through */
+#define IOMMU_SVA_VTD_GPASID_EMTE	(1 << 4) /* extended mem type enable */
+#define IOMMU_SVA_VTD_GPASID_CD		(1 << 5) /* PASID-level cache disable */
+	__u64 flags;
+	__u32 pat;
+	__u32 emt;
+};
+#define IOMMU_SVA_VTD_GPASID_EMT_MASK	(IOMMU_SVA_VTD_GPASID_CD | \
+					 IOMMU_SVA_VTD_GPASID_EMTE | \
+					 IOMMU_SVA_VTD_GPASID_PCD |  \
+					 IOMMU_SVA_VTD_GPASID_PWT)
+/**
+ * struct iommu_gpasid_bind_data - Information about device and guest PASID binding
+ * @version:	IOMMU_UAPI_VERSION
+ * @format:	PASID table entry format
+ * @flags:	Additional information on guest bind request
+ * @gpgd:	Guest page directory base of the guest mm to bind
+ * @hpasid:	Process address space ID used for the guest mm in host IOMMU
+ * @gpasid:	Process address space ID used for the guest mm in guest IOMMU
+ * @addr_width:	Guest virtual address width
+ * @padding:	Reserved for future use (should be zero)
+ * @vtd:	Intel VT-d specific data
+ *
+ * Guest to host PASID mapping can be an identity or non-identity, where guest
+ * has its own PASID space. For non-identify mapping, guest to host PASID lookup
+ * is needed when VM programs guest PASID into an assigned device. VMM may
+ * trap such PASID programming then request host IOMMU driver to convert guest
+ * PASID to host PASID based on this bind data.
+ */
+struct iommu_gpasid_bind_data {
+	__u32 version;
+#define IOMMU_PASID_FORMAT_INTEL_VTD	1
+	__u32 format;
+#define IOMMU_SVA_GPASID_VAL	(1 << 0) /* guest PASID valid */
+	__u64 flags;
+	__u64 gpgd;
+	__u64 hpasid;
+	__u64 gpasid;
+	__u32 addr_width;
+	__u8  padding[12];
+	/* Vendor specific data */
+	union {
+		struct iommu_gpasid_bind_data_vtd vtd;
+	};
+};
+
+#endif /* _IOMMU_H */
diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
index fb10370..9670002 100644
--- a/linux-headers/linux/vfio.h
+++ b/linux-headers/linux/vfio.h
@@ -14,6 +14,7 @@
 
 #include <linux/types.h>
 #include <linux/ioctl.h>
+#include <linux/iommu.h>
 
 #define VFIO_API_VERSION	0
 
@@ -748,6 +749,13 @@ struct vfio_iommu_type1_info_cap_iova_range {
 	struct	vfio_iova_range iova_ranges[];
 };
 
+#define VFIO_IOMMU_TYPE1_INFO_CAP_NESTING  2
+
+struct vfio_iommu_type1_info_cap_nesting {
+	struct	vfio_info_cap_header header;
+	__u32	pasid_format;
+};
+
 #define VFIO_IOMMU_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12)
 
 /**
@@ -794,6 +802,146 @@ struct vfio_iommu_type1_dma_unmap {
 #define VFIO_IOMMU_ENABLE	_IO(VFIO_TYPE, VFIO_BASE + 15)
 #define VFIO_IOMMU_DISABLE	_IO(VFIO_TYPE, VFIO_BASE + 16)
 
+/*
+ * PASID (Process Address Space ID) is a PCIe concept which
+ * has been extended to support DMA isolation in fine-grain.
+ * With device assigned to user space (e.g. VMs), PASID alloc
+ * and free need to be system wide. This structure defines
+ * the info for pasid alloc/free between user space and kernel
+ * space.
+ *
+ * @flag=VFIO_IOMMU_PASID_ALLOC, refer to the @alloc_pasid
+ * @flag=VFIO_IOMMU_PASID_FREE, refer to @free_pasid
+ */
+struct vfio_iommu_type1_pasid_request {
+	__u32	argsz;
+#define VFIO_IOMMU_PASID_ALLOC	(1 << 0)
+#define VFIO_IOMMU_PASID_FREE	(1 << 1)
+	__u32	flags;
+	union {
+		struct {
+			__u32 min;
+			__u32 max;
+			__u32 result;
+		} alloc_pasid;
+		__u32 free_pasid;
+	};
+};
+
+#define VFIO_PASID_REQUEST_MASK	(VFIO_IOMMU_PASID_ALLOC | \
+					 VFIO_IOMMU_PASID_FREE)
+
+/**
+ * VFIO_IOMMU_PASID_REQUEST - _IOWR(VFIO_TYPE, VFIO_BASE + 22,
+ *				struct vfio_iommu_type1_pasid_request)
+ *
+ * Availability of this feature depends on PASID support in the device,
+ * its bus, the underlying IOMMU and the CPU architecture. In VFIO, it
+ * is available after VFIO_SET_IOMMU.
+ *
+ * returns: 0 on success, -errno on failure.
+ */
+#define VFIO_IOMMU_PASID_REQUEST	_IO(VFIO_TYPE, VFIO_BASE + 22)
+
+/**
+ * @quota: the new pasid quota which a userspace application (e.g. VM)
+ * is configured.
+ */
+struct vfio_iommu_type1_pasid_quota {
+	__u32	argsz;
+	__u32	flags;
+	__u32	quota;
+};
+
+/**
+ * VFIO_IOMMU_SET_PASID_QUOTA - _IOW(VFIO_TYPE, VFIO_BASE + 23,
+ *				struct vfio_iommu_type1_pasid_quota)
+ *
+ * Availability of this feature depends on PASID support in the device,
+ * its bus, the underlying IOMMU and the CPU architecture. In VFIO, it
+ * is available after VFIO_SET_IOMMU.
+ *
+ * returns: latest quota on success, -errno on failure.
+ */
+#define VFIO_IOMMU_SET_PASID_QUOTA	_IO(VFIO_TYPE, VFIO_BASE + 23)
+
+/**
+ * VFIO_NESTING_GET_IOMMU_UAPI_VERSION - _IO(VFIO_TYPE, VFIO_BASE + 24)
+ *
+ * Report the version of the IOMMU UAPI when dual stage IOMMU is supported.
+ * In VFIO, it is needed for VFIO_TYPE1_NESTING_IOMMU.
+ * Availability: Always.
+ * Return: IOMMU UAPI version
+ */
+#define VFIO_NESTING_GET_IOMMU_UAPI_VERSION	_IO(VFIO_TYPE, VFIO_BASE + 24)
+
+/**
+ * Supported flags:
+ *	- VFIO_IOMMU_BIND_GUEST_PGTBL: bind guest page tables to host for
+ *			nesting type IOMMUs. In @data field It takes struct
+ *			iommu_gpasid_bind_data.
+ *	- VFIO_IOMMU_UNBIND_GUEST_PGTBL: undo a bind guest page table operation
+ *			invoked by VFIO_IOMMU_BIND_GUEST_PGTBL.
+ *
+ */
+struct vfio_iommu_type1_bind {
+	__u32		argsz;
+	__u32		flags;
+#define VFIO_IOMMU_BIND_GUEST_PGTBL	(1 << 0)
+#define VFIO_IOMMU_UNBIND_GUEST_PGTBL	(1 << 1)
+	__u8		data[];
+};
+
+#define VFIO_IOMMU_BIND_MASK	(VFIO_IOMMU_BIND_GUEST_PGTBL | \
+					VFIO_IOMMU_UNBIND_GUEST_PGTBL)
+
+/**
+ * VFIO_IOMMU_BIND - _IOW(VFIO_TYPE, VFIO_BASE + 25,
+ *				struct vfio_iommu_type1_bind)
+ *
+ * Manage address spaces of devices in this container. Initially a TYPE1
+ * container can only have one address space, managed with
+ * VFIO_IOMMU_MAP/UNMAP_DMA.
+ *
+ * An IOMMU of type VFIO_TYPE1_NESTING_IOMMU can be managed by both MAP/UNMAP
+ * and BIND ioctls at the same time. MAP/UNMAP acts on the stage-2 (host) page
+ * tables, and BIND manages the stage-1 (guest) page tables. Other types of
+ * IOMMU may allow MAP/UNMAP and BIND to coexist, where MAP/UNMAP controls
+ * the traffics only require single stage translation while BIND controls the
+ * traffics require nesting translation. But this depends on the underlying
+ * IOMMU architecture and isn't guaranteed. Example of this is the guest SVA
+ * traffics, such traffics need nesting translation to gain gVA->gPA and then
+ * gPA->hPA translation.
+ *
+ * Availability of this feature depends on the device, its bus, the underlying
+ * IOMMU and the CPU architecture.
+ *
+ * returns: 0 on success, -errno on failure.
+ */
+#define VFIO_IOMMU_BIND		_IO(VFIO_TYPE, VFIO_BASE + 25)
+
+/**
+ * VFIO_IOMMU_CACHE_INVALIDATE - _IOW(VFIO_TYPE, VFIO_BASE + 26,
+ *			struct vfio_iommu_type1_cache_invalidate)
+ *
+ * Propagate guest IOMMU cache invalidation to the host. The cache
+ * invalidation information is conveyed by @cache_info, the content
+ * format would be structures defined in uapi/linux/iommu.h. User
+ * should be aware of that the struct  iommu_cache_invalidate_info
+ * has a @version field, vfio needs to parse this field before getting
+ * data from userspace.
+ *
+ * Availability of this IOCTL is after VFIO_SET_IOMMU.
+ *
+ * returns: 0 on success, -errno on failure.
+ */
+struct vfio_iommu_type1_cache_invalidate {
+	__u32   argsz;
+	__u32   flags;
+	struct	iommu_cache_invalidate_info cache_info;
+};
+#define VFIO_IOMMU_CACHE_INVALIDATE      _IO(VFIO_TYPE, VFIO_BASE + 26)
+
 /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */
 
 /*
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [RFC v3 07/25] header file update VFIO/IOMMU vSVA APIs
@ 2020-01-29 12:16   ` Liu, Yi L
  0 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-01-29 12:16 UTC (permalink / raw)
  To: qemu-devel, david, pbonzini, alex.williamson, peterx
  Cc: kevin.tian, yi.l.liu, Yi Sun, kvm, mst, jun.j.tian,
	Cornelia Huck, eric.auger, yi.y.sun, Jacob Pan, hao.wu

From: Liu Yi L <yi.l.liu@intel.com>

The kernel uapi/linux/iommu.h header file includes the
extensions for vSVA support. e.g. bind gpasid, iommu
fault report related user structures and etc.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Cornelia Huck <cohuck@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 linux-headers/linux/iommu.h | 372 ++++++++++++++++++++++++++++++++++++++++++++
 linux-headers/linux/vfio.h  | 148 ++++++++++++++++++
 2 files changed, 520 insertions(+)
 create mode 100644 linux-headers/linux/iommu.h

diff --git a/linux-headers/linux/iommu.h b/linux-headers/linux/iommu.h
new file mode 100644
index 0000000..04cc4b0
--- /dev/null
+++ b/linux-headers/linux/iommu.h
@@ -0,0 +1,372 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/*
+ * IOMMU user API definitions
+ */
+
+#ifndef _IOMMU_H
+#define _IOMMU_H
+
+#include <linux/types.h>
+
+/**
+ * Current version of the IOMMU user API. This is intended for query
+ * between user and kernel to determine compatible data structures.
+ *
+ * Having a single UAPI version to govern the user-kernel data structures
+ * makes compatibility check straightforward. On the contrary, supporting
+ * combinations of multiple versions of the data can be a nightmare.
+ *
+ * UAPI version can be bumped up with the following rules:
+ * 1. All data structures passed between user and kernel space share
+ *    the same version number. i.e. any extension to to any structure
+ *    results in version bump up.
+ *
+ * 2. Data structures are open to extension but closed to modification.
+ *    New fields must be added at the end of each data structure with
+ *    64bit alignment. Flag bits can be added without size change but
+ *    existing ones cannot be altered.
+ *
+ * 3. Versions are backward compatible.
+ *
+ * 4. Version to size lookup is supported by kernel internal API for each
+ *    API function type. @version is mandatory for new data structures
+ *    and must be at the beginning with type of __u32.
+ */
+#define IOMMU_UAPI_VERSION	1
+static __inline__ int iommu_get_uapi_version(void)
+{
+	return IOMMU_UAPI_VERSION;
+}
+
+/*
+ * Supported UAPI features that can be reported to user space.
+ * These types represent the capability available in the kernel.
+ *
+ * REVISIT: UAPI version also implies the capabilities. Should we
+ * report them explicitly?
+ */
+enum IOMMU_UAPI_DATA_TYPES {
+	IOMMU_UAPI_BIND_GPASID,
+	IOMMU_UAPI_CACHE_INVAL,
+	IOMMU_UAPI_PAGE_RESP,
+	NR_IOMMU_UAPI_TYPE,
+};
+
+#define IOMMU_UAPI_CAP_MASK ((1 << IOMMU_UAPI_BIND_GPASID) |	\
+				(1 << IOMMU_UAPI_CACHE_INVAL) |	\
+				(1 << IOMMU_UAPI_PAGE_RESP))
+
+#define IOMMU_FAULT_PERM_READ	(1 << 0) /* read */
+#define IOMMU_FAULT_PERM_WRITE	(1 << 1) /* write */
+#define IOMMU_FAULT_PERM_EXEC	(1 << 2) /* exec */
+#define IOMMU_FAULT_PERM_PRIV	(1 << 3) /* privileged */
+
+/* Generic fault types, can be expanded IRQ remapping fault */
+enum iommu_fault_type {
+	IOMMU_FAULT_DMA_UNRECOV = 1,	/* unrecoverable fault */
+	IOMMU_FAULT_PAGE_REQ,		/* page request fault */
+};
+
+enum iommu_fault_reason {
+	IOMMU_FAULT_REASON_UNKNOWN = 0,
+
+	/* Could not access the PASID table (fetch caused external abort) */
+	IOMMU_FAULT_REASON_PASID_FETCH,
+
+	/* PASID entry is invalid or has configuration errors */
+	IOMMU_FAULT_REASON_BAD_PASID_ENTRY,
+
+	/*
+	 * PASID is out of range (e.g. exceeds the maximum PASID
+	 * supported by the IOMMU) or disabled.
+	 */
+	IOMMU_FAULT_REASON_PASID_INVALID,
+
+	/*
+	 * An external abort occurred fetching (or updating) a translation
+	 * table descriptor
+	 */
+	IOMMU_FAULT_REASON_WALK_EABT,
+
+	/*
+	 * Could not access the page table entry (Bad address),
+	 * actual translation fault
+	 */
+	IOMMU_FAULT_REASON_PTE_FETCH,
+
+	/* Protection flag check failed */
+	IOMMU_FAULT_REASON_PERMISSION,
+
+	/* access flag check failed */
+	IOMMU_FAULT_REASON_ACCESS,
+
+	/* Output address of a translation stage caused Address Size fault */
+	IOMMU_FAULT_REASON_OOR_ADDRESS,
+};
+
+/**
+ * struct iommu_fault_unrecoverable - Unrecoverable fault data
+ * @reason: reason of the fault, from &enum iommu_fault_reason
+ * @flags: parameters of this fault (IOMMU_FAULT_UNRECOV_* values)
+ * @pasid: Process Address Space ID
+ * @perm: requested permission access using by the incoming transaction
+ *        (IOMMU_FAULT_PERM_* values)
+ * @addr: offending page address
+ * @fetch_addr: address that caused a fetch abort, if any
+ */
+struct iommu_fault_unrecoverable {
+	__u32	reason;
+#define IOMMU_FAULT_UNRECOV_PASID_VALID		(1 << 0)
+#define IOMMU_FAULT_UNRECOV_ADDR_VALID		(1 << 1)
+#define IOMMU_FAULT_UNRECOV_FETCH_ADDR_VALID	(1 << 2)
+	__u32	flags;
+	__u32	pasid;
+	__u32	perm;
+	__u64	addr;
+	__u64	fetch_addr;
+};
+
+/**
+ * struct iommu_fault_page_request - Page Request data
+ * @flags: encodes whether the corresponding fields are valid and whether this
+ *         is the last page in group (IOMMU_FAULT_PAGE_REQUEST_* values)
+ * @pasid: Process Address Space ID
+ * @grpid: Page Request Group Index
+ * @perm: requested page permissions (IOMMU_FAULT_PERM_* values)
+ * @addr: page address
+ * @private_data: device-specific private information
+ */
+struct iommu_fault_page_request {
+#define IOMMU_FAULT_PAGE_REQUEST_PASID_VALID	(1 << 0)
+#define IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE	(1 << 1)
+#define IOMMU_FAULT_PAGE_REQUEST_PRIV_DATA	(1 << 2)
+	__u32	flags;
+	__u32	pasid;
+	__u32	grpid;
+	__u32	perm;
+	__u64	addr;
+	__u64	private_data[2];
+};
+
+/**
+ * struct iommu_fault - Generic fault data
+ * @type: fault type from &enum iommu_fault_type
+ * @padding: reserved for future use (should be zero)
+ * @event: fault event, when @type is %IOMMU_FAULT_DMA_UNRECOV
+ * @prm: Page Request message, when @type is %IOMMU_FAULT_PAGE_REQ
+ * @padding2: sets the fault size to allow for future extensions
+ */
+struct iommu_fault {
+	__u32	type;
+	__u32	padding;
+	union {
+		struct iommu_fault_unrecoverable event;
+		struct iommu_fault_page_request prm;
+		__u8 padding2[56];
+	};
+};
+
+/**
+ * enum iommu_page_response_code - Return status of fault handlers
+ * @IOMMU_PAGE_RESP_SUCCESS: Fault has been handled and the page tables
+ *	populated, retry the access. This is "Success" in PCI PRI.
+ * @IOMMU_PAGE_RESP_FAILURE: General error. Drop all subsequent faults from
+ *	this device if possible. This is "Response Failure" in PCI PRI.
+ * @IOMMU_PAGE_RESP_INVALID: Could not handle this fault, don't retry the
+ *	access. This is "Invalid Request" in PCI PRI.
+ */
+enum iommu_page_response_code {
+	IOMMU_PAGE_RESP_SUCCESS = 0,
+	IOMMU_PAGE_RESP_INVALID,
+	IOMMU_PAGE_RESP_FAILURE,
+};
+
+/**
+ * struct iommu_page_response - Generic page response information
+ * @version: IOMMU_UAPI_VERSION
+ * @flags: encodes whether the corresponding fields are valid
+ *         (IOMMU_FAULT_PAGE_RESPONSE_* values)
+ * @pasid: Process Address Space ID
+ * @grpid: Page Request Group Index
+ * @code: response code from &enum iommu_page_response_code
+ */
+struct iommu_page_response {
+	__u32	version;
+#define IOMMU_PAGE_RESP_PASID_VALID	(1 << 0)
+	__u32	flags;
+	__u32	pasid;
+	__u32	grpid;
+	__u32	code;
+};
+
+/* defines the granularity of the invalidation */
+enum iommu_inv_granularity {
+	IOMMU_INV_GRANU_DOMAIN,	/* domain-selective invalidation */
+	IOMMU_INV_GRANU_PASID,	/* PASID-selective invalidation */
+	IOMMU_INV_GRANU_ADDR,	/* page-selective invalidation */
+	IOMMU_INV_GRANU_NR,	/* number of invalidation granularities */
+};
+
+/**
+ * struct iommu_inv_addr_info - Address Selective Invalidation Structure
+ *
+ * @flags: indicates the granularity of the address-selective invalidation
+ * - If the PASID bit is set, the @pasid field is populated and the invalidation
+ *   relates to cache entries tagged with this PASID and matching the address
+ *   range.
+ * - If ARCHID bit is set, @archid is populated and the invalidation relates
+ *   to cache entries tagged with this architecture specific ID and matching
+ *   the address range.
+ * - Both PASID and ARCHID can be set as they may tag different caches.
+ * - If neither PASID or ARCHID is set, global addr invalidation applies.
+ * - The LEAF flag indicates whether only the leaf PTE caching needs to be
+ *   invalidated and other paging structure caches can be preserved.
+ * @pasid: process address space ID
+ * @archid: architecture-specific ID
+ * @addr: first stage/level input address
+ * @granule_size: page/block size of the mapping in bytes
+ * @nb_granules: number of contiguous granules to be invalidated
+ */
+struct iommu_inv_addr_info {
+#define IOMMU_INV_ADDR_FLAGS_PASID	(1 << 0)
+#define IOMMU_INV_ADDR_FLAGS_ARCHID	(1 << 1)
+#define IOMMU_INV_ADDR_FLAGS_LEAF	(1 << 2)
+	__u32	flags;
+	__u32	archid;
+	__u64	pasid;
+	__u64	addr;
+	__u64	granule_size;
+	__u64	nb_granules;
+};
+
+/**
+ * struct iommu_inv_pasid_info - PASID Selective Invalidation Structure
+ *
+ * @flags: indicates the granularity of the PASID-selective invalidation
+ * - If the PASID bit is set, the @pasid field is populated and the invalidation
+ *   relates to cache entries tagged with this PASID and matching the address
+ *   range.
+ * - If the ARCHID bit is set, the @archid is populated and the invalidation
+ *   relates to cache entries tagged with this architecture specific ID and
+ *   matching the address range.
+ * - Both PASID and ARCHID can be set as they may tag different caches.
+ * - At least one of PASID or ARCHID must be set.
+ * @pasid: process address space ID
+ * @archid: architecture-specific ID
+ */
+struct iommu_inv_pasid_info {
+#define IOMMU_INV_PASID_FLAGS_PASID	(1 << 0)
+#define IOMMU_INV_PASID_FLAGS_ARCHID	(1 << 1)
+	__u32	flags;
+	__u32	archid;
+	__u64	pasid;
+};
+
+/**
+ * struct iommu_cache_invalidate_info - First level/stage invalidation
+ *     information
+ * @version: IOMMU_UAPI_VERSION
+ * @cache: bitfield that allows to select which caches to invalidate
+ * @granularity: defines the lowest granularity used for the invalidation:
+ *     domain > PASID > addr
+ * @padding: reserved for future use (should be zero)
+ * @pasid_info: invalidation data when @granularity is %IOMMU_INV_GRANU_PASID
+ * @addr_info: invalidation data when @granularity is %IOMMU_INV_GRANU_ADDR
+ *
+ * Not all the combinations of cache/granularity are valid:
+ *
+ * +--------------+---------------+---------------+---------------+
+ * | type /       |   DEV_IOTLB   |     IOTLB     |      PASID    |
+ * | granularity  |               |               |      cache    |
+ * +==============+===============+===============+===============+
+ * | DOMAIN       |       N/A     |       Y       |       Y       |
+ * +--------------+---------------+---------------+---------------+
+ * | PASID        |       Y       |       Y       |       Y       |
+ * +--------------+---------------+---------------+---------------+
+ * | ADDR         |       Y       |       Y       |       N/A     |
+ * +--------------+---------------+---------------+---------------+
+ *
+ * Invalidations by %IOMMU_INV_GRANU_DOMAIN don't take any argument other than
+ * @version and @cache.
+ *
+ * If multiple cache types are invalidated simultaneously, they all
+ * must support the used granularity.
+ */
+struct iommu_cache_invalidate_info {
+	__u32	version;
+/* IOMMU paging structure cache */
+#define IOMMU_CACHE_INV_TYPE_IOTLB	(1 << 0) /* IOMMU IOTLB */
+#define IOMMU_CACHE_INV_TYPE_DEV_IOTLB	(1 << 1) /* Device IOTLB */
+#define IOMMU_CACHE_INV_TYPE_PASID	(1 << 2) /* PASID cache */
+#define IOMMU_CACHE_INV_TYPE_NR		(3)
+	__u8	cache;
+	__u8	granularity;
+	__u8	padding[2];
+	union {
+		struct iommu_inv_pasid_info pasid_info;
+		struct iommu_inv_addr_info addr_info;
+	};
+};
+
+/**
+ * struct iommu_gpasid_bind_data_vtd - Intel VT-d specific data on device and guest
+ * SVA binding.
+ *
+ * @flags:	VT-d PASID table entry attributes
+ * @pat:	Page attribute table data to compute effective memory type
+ * @emt:	Extended memory type
+ *
+ * Only guest vIOMMU selectable and effective options are passed down to
+ * the host IOMMU.
+ */
+struct iommu_gpasid_bind_data_vtd {
+#define IOMMU_SVA_VTD_GPASID_SRE	(1 << 0) /* supervisor request */
+#define IOMMU_SVA_VTD_GPASID_EAFE	(1 << 1) /* extended access enable */
+#define IOMMU_SVA_VTD_GPASID_PCD	(1 << 2) /* page-level cache disable */
+#define IOMMU_SVA_VTD_GPASID_PWT	(1 << 3) /* page-level write through */
+#define IOMMU_SVA_VTD_GPASID_EMTE	(1 << 4) /* extended mem type enable */
+#define IOMMU_SVA_VTD_GPASID_CD		(1 << 5) /* PASID-level cache disable */
+	__u64 flags;
+	__u32 pat;
+	__u32 emt;
+};
+#define IOMMU_SVA_VTD_GPASID_EMT_MASK	(IOMMU_SVA_VTD_GPASID_CD | \
+					 IOMMU_SVA_VTD_GPASID_EMTE | \
+					 IOMMU_SVA_VTD_GPASID_PCD |  \
+					 IOMMU_SVA_VTD_GPASID_PWT)
+/**
+ * struct iommu_gpasid_bind_data - Information about device and guest PASID binding
+ * @version:	IOMMU_UAPI_VERSION
+ * @format:	PASID table entry format
+ * @flags:	Additional information on guest bind request
+ * @gpgd:	Guest page directory base of the guest mm to bind
+ * @hpasid:	Process address space ID used for the guest mm in host IOMMU
+ * @gpasid:	Process address space ID used for the guest mm in guest IOMMU
+ * @addr_width:	Guest virtual address width
+ * @padding:	Reserved for future use (should be zero)
+ * @vtd:	Intel VT-d specific data
+ *
+ * Guest to host PASID mapping can be an identity or non-identity, where guest
+ * has its own PASID space. For non-identify mapping, guest to host PASID lookup
+ * is needed when VM programs guest PASID into an assigned device. VMM may
+ * trap such PASID programming then request host IOMMU driver to convert guest
+ * PASID to host PASID based on this bind data.
+ */
+struct iommu_gpasid_bind_data {
+	__u32 version;
+#define IOMMU_PASID_FORMAT_INTEL_VTD	1
+	__u32 format;
+#define IOMMU_SVA_GPASID_VAL	(1 << 0) /* guest PASID valid */
+	__u64 flags;
+	__u64 gpgd;
+	__u64 hpasid;
+	__u64 gpasid;
+	__u32 addr_width;
+	__u8  padding[12];
+	/* Vendor specific data */
+	union {
+		struct iommu_gpasid_bind_data_vtd vtd;
+	};
+};
+
+#endif /* _IOMMU_H */
diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
index fb10370..9670002 100644
--- a/linux-headers/linux/vfio.h
+++ b/linux-headers/linux/vfio.h
@@ -14,6 +14,7 @@
 
 #include <linux/types.h>
 #include <linux/ioctl.h>
+#include <linux/iommu.h>
 
 #define VFIO_API_VERSION	0
 
@@ -748,6 +749,13 @@ struct vfio_iommu_type1_info_cap_iova_range {
 	struct	vfio_iova_range iova_ranges[];
 };
 
+#define VFIO_IOMMU_TYPE1_INFO_CAP_NESTING  2
+
+struct vfio_iommu_type1_info_cap_nesting {
+	struct	vfio_info_cap_header header;
+	__u32	pasid_format;
+};
+
 #define VFIO_IOMMU_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12)
 
 /**
@@ -794,6 +802,146 @@ struct vfio_iommu_type1_dma_unmap {
 #define VFIO_IOMMU_ENABLE	_IO(VFIO_TYPE, VFIO_BASE + 15)
 #define VFIO_IOMMU_DISABLE	_IO(VFIO_TYPE, VFIO_BASE + 16)
 
+/*
+ * PASID (Process Address Space ID) is a PCIe concept which
+ * has been extended to support DMA isolation in fine-grain.
+ * With device assigned to user space (e.g. VMs), PASID alloc
+ * and free need to be system wide. This structure defines
+ * the info for pasid alloc/free between user space and kernel
+ * space.
+ *
+ * @flag=VFIO_IOMMU_PASID_ALLOC, refer to the @alloc_pasid
+ * @flag=VFIO_IOMMU_PASID_FREE, refer to @free_pasid
+ */
+struct vfio_iommu_type1_pasid_request {
+	__u32	argsz;
+#define VFIO_IOMMU_PASID_ALLOC	(1 << 0)
+#define VFIO_IOMMU_PASID_FREE	(1 << 1)
+	__u32	flags;
+	union {
+		struct {
+			__u32 min;
+			__u32 max;
+			__u32 result;
+		} alloc_pasid;
+		__u32 free_pasid;
+	};
+};
+
+#define VFIO_PASID_REQUEST_MASK	(VFIO_IOMMU_PASID_ALLOC | \
+					 VFIO_IOMMU_PASID_FREE)
+
+/**
+ * VFIO_IOMMU_PASID_REQUEST - _IOWR(VFIO_TYPE, VFIO_BASE + 22,
+ *				struct vfio_iommu_type1_pasid_request)
+ *
+ * Availability of this feature depends on PASID support in the device,
+ * its bus, the underlying IOMMU and the CPU architecture. In VFIO, it
+ * is available after VFIO_SET_IOMMU.
+ *
+ * returns: 0 on success, -errno on failure.
+ */
+#define VFIO_IOMMU_PASID_REQUEST	_IO(VFIO_TYPE, VFIO_BASE + 22)
+
+/**
+ * @quota: the new pasid quota which a userspace application (e.g. VM)
+ * is configured.
+ */
+struct vfio_iommu_type1_pasid_quota {
+	__u32	argsz;
+	__u32	flags;
+	__u32	quota;
+};
+
+/**
+ * VFIO_IOMMU_SET_PASID_QUOTA - _IOW(VFIO_TYPE, VFIO_BASE + 23,
+ *				struct vfio_iommu_type1_pasid_quota)
+ *
+ * Availability of this feature depends on PASID support in the device,
+ * its bus, the underlying IOMMU and the CPU architecture. In VFIO, it
+ * is available after VFIO_SET_IOMMU.
+ *
+ * returns: latest quota on success, -errno on failure.
+ */
+#define VFIO_IOMMU_SET_PASID_QUOTA	_IO(VFIO_TYPE, VFIO_BASE + 23)
+
+/**
+ * VFIO_NESTING_GET_IOMMU_UAPI_VERSION - _IO(VFIO_TYPE, VFIO_BASE + 24)
+ *
+ * Report the version of the IOMMU UAPI when dual stage IOMMU is supported.
+ * In VFIO, it is needed for VFIO_TYPE1_NESTING_IOMMU.
+ * Availability: Always.
+ * Return: IOMMU UAPI version
+ */
+#define VFIO_NESTING_GET_IOMMU_UAPI_VERSION	_IO(VFIO_TYPE, VFIO_BASE + 24)
+
+/**
+ * Supported flags:
+ *	- VFIO_IOMMU_BIND_GUEST_PGTBL: bind guest page tables to host for
+ *			nesting type IOMMUs. In @data field It takes struct
+ *			iommu_gpasid_bind_data.
+ *	- VFIO_IOMMU_UNBIND_GUEST_PGTBL: undo a bind guest page table operation
+ *			invoked by VFIO_IOMMU_BIND_GUEST_PGTBL.
+ *
+ */
+struct vfio_iommu_type1_bind {
+	__u32		argsz;
+	__u32		flags;
+#define VFIO_IOMMU_BIND_GUEST_PGTBL	(1 << 0)
+#define VFIO_IOMMU_UNBIND_GUEST_PGTBL	(1 << 1)
+	__u8		data[];
+};
+
+#define VFIO_IOMMU_BIND_MASK	(VFIO_IOMMU_BIND_GUEST_PGTBL | \
+					VFIO_IOMMU_UNBIND_GUEST_PGTBL)
+
+/**
+ * VFIO_IOMMU_BIND - _IOW(VFIO_TYPE, VFIO_BASE + 25,
+ *				struct vfio_iommu_type1_bind)
+ *
+ * Manage address spaces of devices in this container. Initially a TYPE1
+ * container can only have one address space, managed with
+ * VFIO_IOMMU_MAP/UNMAP_DMA.
+ *
+ * An IOMMU of type VFIO_TYPE1_NESTING_IOMMU can be managed by both MAP/UNMAP
+ * and BIND ioctls at the same time. MAP/UNMAP acts on the stage-2 (host) page
+ * tables, and BIND manages the stage-1 (guest) page tables. Other types of
+ * IOMMU may allow MAP/UNMAP and BIND to coexist, where MAP/UNMAP controls
+ * the traffics only require single stage translation while BIND controls the
+ * traffics require nesting translation. But this depends on the underlying
+ * IOMMU architecture and isn't guaranteed. Example of this is the guest SVA
+ * traffics, such traffics need nesting translation to gain gVA->gPA and then
+ * gPA->hPA translation.
+ *
+ * Availability of this feature depends on the device, its bus, the underlying
+ * IOMMU and the CPU architecture.
+ *
+ * returns: 0 on success, -errno on failure.
+ */
+#define VFIO_IOMMU_BIND		_IO(VFIO_TYPE, VFIO_BASE + 25)
+
+/**
+ * VFIO_IOMMU_CACHE_INVALIDATE - _IOW(VFIO_TYPE, VFIO_BASE + 26,
+ *			struct vfio_iommu_type1_cache_invalidate)
+ *
+ * Propagate guest IOMMU cache invalidation to the host. The cache
+ * invalidation information is conveyed by @cache_info, the content
+ * format would be structures defined in uapi/linux/iommu.h. User
+ * should be aware of that the struct  iommu_cache_invalidate_info
+ * has a @version field, vfio needs to parse this field before getting
+ * data from userspace.
+ *
+ * Availability of this IOCTL is after VFIO_SET_IOMMU.
+ *
+ * returns: 0 on success, -errno on failure.
+ */
+struct vfio_iommu_type1_cache_invalidate {
+	__u32   argsz;
+	__u32   flags;
+	struct	iommu_cache_invalidate_info cache_info;
+};
+#define VFIO_IOMMU_CACHE_INVALIDATE      _IO(VFIO_TYPE, VFIO_BASE + 26)
+
 /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */
 
 /*
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [RFC v3 08/25] vfio: pass IOMMUContext into vfio_get_group()
  2020-01-29 12:16 ` Liu, Yi L
@ 2020-01-29 12:16   ` Liu, Yi L
  -1 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-01-29 12:16 UTC (permalink / raw)
  To: qemu-devel, david, pbonzini, alex.williamson, peterx
  Cc: mst, eric.auger, kevin.tian, yi.l.liu, jun.j.tian, yi.y.sun, kvm,
	hao.wu, Jacob Pan, Yi Sun

From: Liu Yi L <yi.l.liu@intel.com>

IOMMUContext provides an explicit way for VFIO to talk with vIOMMU.

This patch passes IOMMUContext instance into vfio_get_group() for
the reason of potential VFIO_TYPE1_NESTING_IOMMU configuration as
such configuration requires interaction with vIOMMU.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Eric Auger <eric.auger@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 hw/vfio/ap.c                  | 2 +-
 hw/vfio/ccw.c                 | 2 +-
 hw/vfio/common.c              | 3 ++-
 hw/vfio/pci.c                 | 3 ++-
 hw/vfio/platform.c            | 2 +-
 include/hw/vfio/vfio-common.h | 4 +++-
 6 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/hw/vfio/ap.c b/hw/vfio/ap.c
index 8fbaa72..7b03c12 100644
--- a/hw/vfio/ap.c
+++ b/hw/vfio/ap.c
@@ -82,7 +82,7 @@ static VFIOGroup *vfio_ap_get_group(VFIOAPDevice *vapdev, Error **errp)
 
     g_free(group_path);
 
-    return vfio_get_group(groupid, &address_space_memory, errp);
+    return vfio_get_group(groupid, &address_space_memory, NULL, errp);
 }
 
 static void vfio_ap_realize(DeviceState *dev, Error **errp)
diff --git a/hw/vfio/ccw.c b/hw/vfio/ccw.c
index 3b5520a..0764af5 100644
--- a/hw/vfio/ccw.c
+++ b/hw/vfio/ccw.c
@@ -471,7 +471,7 @@ static VFIOGroup *vfio_ccw_get_group(S390CCWDevice *cdev, Error **errp)
         return NULL;
     }
 
-    return vfio_get_group(groupid, &address_space_memory, errp);
+    return vfio_get_group(groupid, &address_space_memory, NULL, errp);
 }
 
 static void vfio_ccw_realize(DeviceState *dev, Error **errp)
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 5ca1148..0cc7ff5 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -1453,7 +1453,8 @@ static void vfio_disconnect_container(VFIOGroup *group)
     }
 }
 
-VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
+VFIOGroup *vfio_get_group(int groupid, AddressSpace *as,
+                          IOMMUContext *iommu_ctx, Error **errp)
 {
     VFIOGroup *group;
     char path[32];
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 2d40b39..77b941f 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2782,7 +2782,8 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
 
     trace_vfio_realize(vdev->vbasedev.name, groupid);
 
-    group = vfio_get_group(groupid, pci_device_iommu_address_space(pdev), errp);
+    group = vfio_get_group(groupid, pci_device_iommu_address_space(pdev),
+                           pci_device_iommu_context(pdev), errp);
     if (!group) {
         goto error;
     }
diff --git a/hw/vfio/platform.c b/hw/vfio/platform.c
index d7598c6..2029a1c 100644
--- a/hw/vfio/platform.c
+++ b/hw/vfio/platform.c
@@ -581,7 +581,7 @@ static int vfio_base_device_init(VFIODevice *vbasedev, Error **errp)
 
     trace_vfio_platform_base_device_init(vbasedev->name, groupid);
 
-    group = vfio_get_group(groupid, &address_space_memory, errp);
+    group = vfio_get_group(groupid, &address_space_memory, NULL, errp);
     if (!group) {
         return -ENOENT;
     }
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index fd56420..8ab68fa 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -26,6 +26,7 @@
 #include "qemu/notify.h"
 #include "ui/console.h"
 #include "hw/display/ramfb.h"
+#include "hw/iommu/iommu_context.h"
 #ifdef CONFIG_LINUX
 #include <linux/vfio.h>
 #endif
@@ -174,7 +175,8 @@ void vfio_region_mmaps_set_enabled(VFIORegion *region, bool enabled);
 void vfio_region_exit(VFIORegion *region);
 void vfio_region_finalize(VFIORegion *region);
 void vfio_reset_handler(void *opaque);
-VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp);
+VFIOGroup *vfio_get_group(int groupid, AddressSpace *as,
+                          IOMMUContext *iommu_ctx, Error **errp);
 void vfio_put_group(VFIOGroup *group);
 int vfio_get_device(VFIOGroup *group, const char *name,
                     VFIODevice *vbasedev, Error **errp);
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [RFC v3 08/25] vfio: pass IOMMUContext into vfio_get_group()
@ 2020-01-29 12:16   ` Liu, Yi L
  0 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-01-29 12:16 UTC (permalink / raw)
  To: qemu-devel, david, pbonzini, alex.williamson, peterx
  Cc: kevin.tian, yi.l.liu, Yi Sun, kvm, mst, jun.j.tian, eric.auger,
	yi.y.sun, Jacob Pan, hao.wu

From: Liu Yi L <yi.l.liu@intel.com>

IOMMUContext provides an explicit way for VFIO to talk with vIOMMU.

This patch passes IOMMUContext instance into vfio_get_group() for
the reason of potential VFIO_TYPE1_NESTING_IOMMU configuration as
such configuration requires interaction with vIOMMU.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Eric Auger <eric.auger@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 hw/vfio/ap.c                  | 2 +-
 hw/vfio/ccw.c                 | 2 +-
 hw/vfio/common.c              | 3 ++-
 hw/vfio/pci.c                 | 3 ++-
 hw/vfio/platform.c            | 2 +-
 include/hw/vfio/vfio-common.h | 4 +++-
 6 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/hw/vfio/ap.c b/hw/vfio/ap.c
index 8fbaa72..7b03c12 100644
--- a/hw/vfio/ap.c
+++ b/hw/vfio/ap.c
@@ -82,7 +82,7 @@ static VFIOGroup *vfio_ap_get_group(VFIOAPDevice *vapdev, Error **errp)
 
     g_free(group_path);
 
-    return vfio_get_group(groupid, &address_space_memory, errp);
+    return vfio_get_group(groupid, &address_space_memory, NULL, errp);
 }
 
 static void vfio_ap_realize(DeviceState *dev, Error **errp)
diff --git a/hw/vfio/ccw.c b/hw/vfio/ccw.c
index 3b5520a..0764af5 100644
--- a/hw/vfio/ccw.c
+++ b/hw/vfio/ccw.c
@@ -471,7 +471,7 @@ static VFIOGroup *vfio_ccw_get_group(S390CCWDevice *cdev, Error **errp)
         return NULL;
     }
 
-    return vfio_get_group(groupid, &address_space_memory, errp);
+    return vfio_get_group(groupid, &address_space_memory, NULL, errp);
 }
 
 static void vfio_ccw_realize(DeviceState *dev, Error **errp)
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 5ca1148..0cc7ff5 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -1453,7 +1453,8 @@ static void vfio_disconnect_container(VFIOGroup *group)
     }
 }
 
-VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
+VFIOGroup *vfio_get_group(int groupid, AddressSpace *as,
+                          IOMMUContext *iommu_ctx, Error **errp)
 {
     VFIOGroup *group;
     char path[32];
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 2d40b39..77b941f 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2782,7 +2782,8 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
 
     trace_vfio_realize(vdev->vbasedev.name, groupid);
 
-    group = vfio_get_group(groupid, pci_device_iommu_address_space(pdev), errp);
+    group = vfio_get_group(groupid, pci_device_iommu_address_space(pdev),
+                           pci_device_iommu_context(pdev), errp);
     if (!group) {
         goto error;
     }
diff --git a/hw/vfio/platform.c b/hw/vfio/platform.c
index d7598c6..2029a1c 100644
--- a/hw/vfio/platform.c
+++ b/hw/vfio/platform.c
@@ -581,7 +581,7 @@ static int vfio_base_device_init(VFIODevice *vbasedev, Error **errp)
 
     trace_vfio_platform_base_device_init(vbasedev->name, groupid);
 
-    group = vfio_get_group(groupid, &address_space_memory, errp);
+    group = vfio_get_group(groupid, &address_space_memory, NULL, errp);
     if (!group) {
         return -ENOENT;
     }
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index fd56420..8ab68fa 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -26,6 +26,7 @@
 #include "qemu/notify.h"
 #include "ui/console.h"
 #include "hw/display/ramfb.h"
+#include "hw/iommu/iommu_context.h"
 #ifdef CONFIG_LINUX
 #include <linux/vfio.h>
 #endif
@@ -174,7 +175,8 @@ void vfio_region_mmaps_set_enabled(VFIORegion *region, bool enabled);
 void vfio_region_exit(VFIORegion *region);
 void vfio_region_finalize(VFIORegion *region);
 void vfio_reset_handler(void *opaque);
-VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp);
+VFIOGroup *vfio_get_group(int groupid, AddressSpace *as,
+                          IOMMUContext *iommu_ctx, Error **errp);
 void vfio_put_group(VFIOGroup *group);
 int vfio_get_device(VFIOGroup *group, const char *name,
                     VFIODevice *vbasedev, Error **errp);
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [RFC v3 09/25] vfio: check VFIO_TYPE1_NESTING_IOMMU support
  2020-01-29 12:16 ` Liu, Yi L
@ 2020-01-29 12:16   ` Liu, Yi L
  -1 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-01-29 12:16 UTC (permalink / raw)
  To: qemu-devel, david, pbonzini, alex.williamson, peterx
  Cc: mst, eric.auger, kevin.tian, yi.l.liu, jun.j.tian, yi.y.sun, kvm,
	hao.wu, Jacob Pan, Yi Sun

From: Liu Yi L <yi.l.liu@intel.com>

VFIO needs to check VFIO_TYPE1_NESTING_IOMMU
support with Kernel before further using it.
e.g. requires to check IOMMU UAPI version.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Eric Auger <eric.auger@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
---
 hw/vfio/common.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 0cc7ff5..a5e70b1 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -1157,12 +1157,21 @@ static void vfio_put_address_space(VFIOAddressSpace *space)
 static int vfio_get_iommu_type(VFIOContainer *container,
                                Error **errp)
 {
-    int iommu_types[] = { VFIO_TYPE1v2_IOMMU, VFIO_TYPE1_IOMMU,
+    int iommu_types[] = { VFIO_TYPE1_NESTING_IOMMU,
+                          VFIO_TYPE1v2_IOMMU, VFIO_TYPE1_IOMMU,
                           VFIO_SPAPR_TCE_v2_IOMMU, VFIO_SPAPR_TCE_IOMMU };
-    int i;
+    int i, version;
 
     for (i = 0; i < ARRAY_SIZE(iommu_types); i++) {
         if (ioctl(container->fd, VFIO_CHECK_EXTENSION, iommu_types[i])) {
+            if (iommu_types[i] == VFIO_TYPE1_NESTING_IOMMU) {
+                version = ioctl(container->fd,
+                                VFIO_NESTING_GET_IOMMU_UAPI_VERSION);
+                if (version < IOMMU_UAPI_VERSION) {
+                    printf("IOMMU UAPI incompatible for nesting\n");
+                    continue;
+                }
+            }
             return iommu_types[i];
         }
     }
@@ -1278,6 +1287,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
     }
 
     switch (container->iommu_type) {
+    case VFIO_TYPE1_NESTING_IOMMU:
     case VFIO_TYPE1v2_IOMMU:
     case VFIO_TYPE1_IOMMU:
     {
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [RFC v3 09/25] vfio: check VFIO_TYPE1_NESTING_IOMMU support
@ 2020-01-29 12:16   ` Liu, Yi L
  0 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-01-29 12:16 UTC (permalink / raw)
  To: qemu-devel, david, pbonzini, alex.williamson, peterx
  Cc: kevin.tian, yi.l.liu, Yi Sun, kvm, mst, jun.j.tian, eric.auger,
	yi.y.sun, Jacob Pan, hao.wu

From: Liu Yi L <yi.l.liu@intel.com>

VFIO needs to check VFIO_TYPE1_NESTING_IOMMU
support with Kernel before further using it.
e.g. requires to check IOMMU UAPI version.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Eric Auger <eric.auger@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
---
 hw/vfio/common.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 0cc7ff5..a5e70b1 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -1157,12 +1157,21 @@ static void vfio_put_address_space(VFIOAddressSpace *space)
 static int vfio_get_iommu_type(VFIOContainer *container,
                                Error **errp)
 {
-    int iommu_types[] = { VFIO_TYPE1v2_IOMMU, VFIO_TYPE1_IOMMU,
+    int iommu_types[] = { VFIO_TYPE1_NESTING_IOMMU,
+                          VFIO_TYPE1v2_IOMMU, VFIO_TYPE1_IOMMU,
                           VFIO_SPAPR_TCE_v2_IOMMU, VFIO_SPAPR_TCE_IOMMU };
-    int i;
+    int i, version;
 
     for (i = 0; i < ARRAY_SIZE(iommu_types); i++) {
         if (ioctl(container->fd, VFIO_CHECK_EXTENSION, iommu_types[i])) {
+            if (iommu_types[i] == VFIO_TYPE1_NESTING_IOMMU) {
+                version = ioctl(container->fd,
+                                VFIO_NESTING_GET_IOMMU_UAPI_VERSION);
+                if (version < IOMMU_UAPI_VERSION) {
+                    printf("IOMMU UAPI incompatible for nesting\n");
+                    continue;
+                }
+            }
             return iommu_types[i];
         }
     }
@@ -1278,6 +1287,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
     }
 
     switch (container->iommu_type) {
+    case VFIO_TYPE1_NESTING_IOMMU:
     case VFIO_TYPE1v2_IOMMU:
     case VFIO_TYPE1_IOMMU:
     {
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [RFC v3 10/25] vfio: register DualStageIOMMUObject to vIOMMU
  2020-01-29 12:16 ` Liu, Yi L
@ 2020-01-29 12:16   ` Liu, Yi L
  -1 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-01-29 12:16 UTC (permalink / raw)
  To: qemu-devel, david, pbonzini, alex.williamson, peterx
  Cc: mst, eric.auger, kevin.tian, yi.l.liu, jun.j.tian, yi.y.sun, kvm,
	hao.wu, Jacob Pan, Yi Sun

From: Liu Yi L <yi.l.liu@intel.com>

After confirming dual stage DMA translation support
with kernel by checking VFIO_TYPE1_NESTING_IOMMU,
VFIO could register DualStageIOMMUObject instance to
vIOMMU, thus that vIOMMU is aware of such hardware
capability. vIOMMU may make use of such capability by
leveraging the ops provided by DualStageIOMMUObject.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Eric Auger <eric.auger@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 hw/vfio/common.c              | 30 ++++++++++++++++++++++++++++--
 include/hw/vfio/vfio-common.h |  2 ++
 2 files changed, 30 insertions(+), 2 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index a5e70b1..fc1723d 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -1179,6 +1179,9 @@ static int vfio_get_iommu_type(VFIOContainer *container,
     return -EINVAL;
 }
 
+static struct DualStageIOMMUOps vfio_ds_iommu_ops = {
+};
+
 static int vfio_init_container(VFIOContainer *container, int group_fd,
                                Error **errp)
 {
@@ -1210,12 +1213,29 @@ static int vfio_init_container(VFIOContainer *container, int group_fd,
         return -errno;
     }
 
+    if (iommu_type == VFIO_TYPE1_NESTING_IOMMU) {
+        ds_iommu_object_init(&container->dsi_obj, &vfio_ds_iommu_ops);
+        if (iommu_context_register_ds_iommu(container->iommu_ctx,
+                                            &container->dsi_obj)) {
+            /*
+             * Here just need an info to indicate that there is no
+             * DualStageIOMMUObject instance registered to vIOMMU
+             * due to either no IOMMUContext support in vIOMMU or
+             * vIOMMU internal failure. Neither is fatal error to
+             * VFIO as it is not mandatory requirement to use such
+             * capability in vIOMMU.
+             */
+            printf("No Dual Stage IOMMU for container(0x%p)\n", container);
+            ds_iommu_object_destroy(&container->dsi_obj);
+        }
+    }
+
     container->iommu_type = iommu_type;
     return 0;
 }
 
 static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
-                                  Error **errp)
+                                  IOMMUContext *iommu_ctx, Error **errp)
 {
     VFIOContainer *container;
     int ret, fd;
@@ -1277,6 +1297,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
     container = g_malloc0(sizeof(*container));
     container->space = space;
     container->fd = fd;
+    container->iommu_ctx = iommu_ctx;
     container->error = NULL;
     QLIST_INIT(&container->giommu_list);
     QLIST_INIT(&container->hostwin_list);
@@ -1457,6 +1478,11 @@ static void vfio_disconnect_container(VFIOGroup *group)
 
         trace_vfio_disconnect_container(container->fd);
         close(container->fd);
+        if (container->iommu_ctx) {
+            iommu_context_unregister_ds_iommu(container->iommu_ctx,
+                                              &container->dsi_obj);
+        }
+        ds_iommu_object_destroy(&container->dsi_obj);
         g_free(container);
 
         vfio_put_address_space(space);
@@ -1508,7 +1534,7 @@ VFIOGroup *vfio_get_group(int groupid, AddressSpace *as,
     group->groupid = groupid;
     QLIST_INIT(&group->device_list);
 
-    if (vfio_connect_container(group, as, errp)) {
+    if (vfio_connect_container(group, as, iommu_ctx, errp)) {
         error_prepend(errp, "failed to setup container for group %d: ",
                       groupid);
         goto close_fd_exit;
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 8ab68fa..dc68552 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -72,6 +72,8 @@ typedef struct VFIOContainer {
     MemoryListener listener;
     MemoryListener prereg_listener;
     unsigned iommu_type;
+    IOMMUContext *iommu_ctx;
+    DualStageIOMMUObject dsi_obj;
     Error *error;
     bool initialized;
     unsigned long pgsizes;
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [RFC v3 10/25] vfio: register DualStageIOMMUObject to vIOMMU
@ 2020-01-29 12:16   ` Liu, Yi L
  0 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-01-29 12:16 UTC (permalink / raw)
  To: qemu-devel, david, pbonzini, alex.williamson, peterx
  Cc: kevin.tian, yi.l.liu, Yi Sun, kvm, mst, jun.j.tian, eric.auger,
	yi.y.sun, Jacob Pan, hao.wu

From: Liu Yi L <yi.l.liu@intel.com>

After confirming dual stage DMA translation support
with kernel by checking VFIO_TYPE1_NESTING_IOMMU,
VFIO could register DualStageIOMMUObject instance to
vIOMMU, thus that vIOMMU is aware of such hardware
capability. vIOMMU may make use of such capability by
leveraging the ops provided by DualStageIOMMUObject.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Eric Auger <eric.auger@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 hw/vfio/common.c              | 30 ++++++++++++++++++++++++++++--
 include/hw/vfio/vfio-common.h |  2 ++
 2 files changed, 30 insertions(+), 2 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index a5e70b1..fc1723d 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -1179,6 +1179,9 @@ static int vfio_get_iommu_type(VFIOContainer *container,
     return -EINVAL;
 }
 
+static struct DualStageIOMMUOps vfio_ds_iommu_ops = {
+};
+
 static int vfio_init_container(VFIOContainer *container, int group_fd,
                                Error **errp)
 {
@@ -1210,12 +1213,29 @@ static int vfio_init_container(VFIOContainer *container, int group_fd,
         return -errno;
     }
 
+    if (iommu_type == VFIO_TYPE1_NESTING_IOMMU) {
+        ds_iommu_object_init(&container->dsi_obj, &vfio_ds_iommu_ops);
+        if (iommu_context_register_ds_iommu(container->iommu_ctx,
+                                            &container->dsi_obj)) {
+            /*
+             * Here just need an info to indicate that there is no
+             * DualStageIOMMUObject instance registered to vIOMMU
+             * due to either no IOMMUContext support in vIOMMU or
+             * vIOMMU internal failure. Neither is fatal error to
+             * VFIO as it is not mandatory requirement to use such
+             * capability in vIOMMU.
+             */
+            printf("No Dual Stage IOMMU for container(0x%p)\n", container);
+            ds_iommu_object_destroy(&container->dsi_obj);
+        }
+    }
+
     container->iommu_type = iommu_type;
     return 0;
 }
 
 static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
-                                  Error **errp)
+                                  IOMMUContext *iommu_ctx, Error **errp)
 {
     VFIOContainer *container;
     int ret, fd;
@@ -1277,6 +1297,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
     container = g_malloc0(sizeof(*container));
     container->space = space;
     container->fd = fd;
+    container->iommu_ctx = iommu_ctx;
     container->error = NULL;
     QLIST_INIT(&container->giommu_list);
     QLIST_INIT(&container->hostwin_list);
@@ -1457,6 +1478,11 @@ static void vfio_disconnect_container(VFIOGroup *group)
 
         trace_vfio_disconnect_container(container->fd);
         close(container->fd);
+        if (container->iommu_ctx) {
+            iommu_context_unregister_ds_iommu(container->iommu_ctx,
+                                              &container->dsi_obj);
+        }
+        ds_iommu_object_destroy(&container->dsi_obj);
         g_free(container);
 
         vfio_put_address_space(space);
@@ -1508,7 +1534,7 @@ VFIOGroup *vfio_get_group(int groupid, AddressSpace *as,
     group->groupid = groupid;
     QLIST_INIT(&group->device_list);
 
-    if (vfio_connect_container(group, as, errp)) {
+    if (vfio_connect_container(group, as, iommu_ctx, errp)) {
         error_prepend(errp, "failed to setup container for group %d: ",
                       groupid);
         goto close_fd_exit;
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 8ab68fa..dc68552 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -72,6 +72,8 @@ typedef struct VFIOContainer {
     MemoryListener listener;
     MemoryListener prereg_listener;
     unsigned iommu_type;
+    IOMMUContext *iommu_ctx;
+    DualStageIOMMUObject dsi_obj;
     Error *error;
     bool initialized;
     unsigned long pgsizes;
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [RFC v3 11/25] vfio: get stage-1 pasid formats from Kernel
  2020-01-29 12:16 ` Liu, Yi L
@ 2020-01-29 12:16   ` Liu, Yi L
  -1 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-01-29 12:16 UTC (permalink / raw)
  To: qemu-devel, david, pbonzini, alex.williamson, peterx
  Cc: mst, eric.auger, kevin.tian, yi.l.liu, jun.j.tian, yi.y.sun, kvm,
	hao.wu, Jacob Pan, Yi Sun

From: Liu Yi L <yi.l.liu@intel.com>

VFIO checks IOMMU UAPI version when it finds Kernel supports
VFIO_TYPE1_NESTING_IOMMU. It is enough for UAPI compatibility
check. However, IOMMU UAPI may support multiple stage-1 pasid
formats in a specific UAPI version, which is highly possible
since IOMMU UAPI supports stage-1 formats across all IOMMU vendors.
So VFIO needs to get the supported formats from Kernel and tell
vIOMMU. Let vIOMMU select proper format when setup dual stage DMA
translation.

This patch gets the stage-1 pasid format from kernel by using IOCTL
VFIO_IOMMU_GET_INFO and pass the supported format to vIOMMU by the
DualStageIOMMUObject instance which has been registered to vIOMMU.

This patch referred some code from Shameer Kolothum.
https://lists.gnu.org/archive/html/qemu-devel/2018-05/msg03759.html

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Eric Auger <eric.auger@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 hw/iommu/dual_stage_iommu.c         |  5 ++-
 hw/vfio/common.c                    | 85 ++++++++++++++++++++++++++++++++++++-
 include/hw/iommu/dual_stage_iommu.h | 10 ++++-
 3 files changed, 97 insertions(+), 3 deletions(-)

diff --git a/hw/iommu/dual_stage_iommu.c b/hw/iommu/dual_stage_iommu.c
index be4179d..d5a7168 100644
--- a/hw/iommu/dual_stage_iommu.c
+++ b/hw/iommu/dual_stage_iommu.c
@@ -48,9 +48,12 @@ int ds_iommu_pasid_free(DualStageIOMMUObject *dsi_obj, uint32_t pasid)
 }
 
 void ds_iommu_object_init(DualStageIOMMUObject *dsi_obj,
-                          DualStageIOMMUOps *ops)
+                          DualStageIOMMUOps *ops,
+                          DualStageIOMMUInfo *uinfo)
 {
     dsi_obj->ops = ops;
+
+    dsi_obj->uinfo.pasid_format = uinfo->pasid_format;
 }
 
 void ds_iommu_object_destroy(DualStageIOMMUObject *dsi_obj)
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index fc1723d..a07824b 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -1182,10 +1182,84 @@ static int vfio_get_iommu_type(VFIOContainer *container,
 static struct DualStageIOMMUOps vfio_ds_iommu_ops = {
 };
 
+static int vfio_get_iommu_info(VFIOContainer *container,
+                         struct vfio_iommu_type1_info **info)
+{
+
+    size_t argsz = sizeof(struct vfio_iommu_type1_info);
+
+
+    *info = g_malloc0(argsz);
+
+retry:
+    (*info)->argsz = argsz;
+
+    if (ioctl(container->fd, VFIO_IOMMU_GET_INFO, *info)) {
+        g_free(*info);
+        *info = NULL;
+        return -errno;
+    }
+
+    if (((*info)->argsz > argsz)) {
+        argsz = (*info)->argsz;
+        *info = g_realloc(*info, argsz);
+        goto retry;
+    }
+
+    return 0;
+}
+
+static struct vfio_info_cap_header *
+vfio_get_iommu_info_cap(struct vfio_iommu_type1_info *info, uint16_t id)
+{
+    struct vfio_info_cap_header *hdr;
+    void *ptr = info;
+
+    if (!(info->flags & VFIO_IOMMU_INFO_CAPS)) {
+        return NULL;
+    }
+
+    for (hdr = ptr + info->cap_offset; hdr != ptr; hdr = ptr + hdr->next) {
+        if (hdr->id == id) {
+            return hdr;
+        }
+    }
+
+    return NULL;
+}
+
+static int vfio_get_nesting_iommu_format(VFIOContainer *container,
+                                         uint32_t *pasid_format)
+{
+    struct vfio_iommu_type1_info *info;
+    struct vfio_info_cap_header *hdr;
+    struct vfio_iommu_type1_info_cap_nesting *cap;
+
+    if (vfio_get_iommu_info(container, &info)) {
+        return -errno;
+    }
+
+    hdr = vfio_get_iommu_info_cap(info,
+                        VFIO_IOMMU_TYPE1_INFO_CAP_NESTING);
+    if (!hdr) {
+        g_free(info);
+        return -errno;
+    }
+
+    cap = container_of(hdr,
+                struct vfio_iommu_type1_info_cap_nesting, header);
+    *pasid_format = cap->pasid_format;
+
+    g_free(info);
+    return 0;
+}
+
 static int vfio_init_container(VFIOContainer *container, int group_fd,
                                Error **errp)
 {
     int iommu_type, ret;
+    uint32_t format;
+    DualStageIOMMUInfo uinfo;
 
     iommu_type = vfio_get_iommu_type(container, errp);
     if (iommu_type < 0) {
@@ -1214,7 +1288,16 @@ static int vfio_init_container(VFIOContainer *container, int group_fd,
     }
 
     if (iommu_type == VFIO_TYPE1_NESTING_IOMMU) {
-        ds_iommu_object_init(&container->dsi_obj, &vfio_ds_iommu_ops);
+        if (vfio_get_nesting_iommu_format(container, &format)) {
+            error_setg_errno(errp, errno,
+                             "Failed to get nesting iommu format");
+            return -errno;
+        }
+
+        uinfo.pasid_format = format;
+        ds_iommu_object_init(&container->dsi_obj,
+                             &vfio_ds_iommu_ops, &uinfo);
+
         if (iommu_context_register_ds_iommu(container->iommu_ctx,
                                             &container->dsi_obj)) {
             /*
diff --git a/include/hw/iommu/dual_stage_iommu.h b/include/hw/iommu/dual_stage_iommu.h
index e9891e3..c6100b4 100644
--- a/include/hw/iommu/dual_stage_iommu.h
+++ b/include/hw/iommu/dual_stage_iommu.h
@@ -23,12 +23,14 @@
 #define HW_DS_IOMMU_H
 
 #include "qemu/queue.h"
+#include <linux/iommu.h>
 #ifndef CONFIG_USER_ONLY
 #include "exec/hwaddr.h"
 #endif
 
 typedef struct DualStageIOMMUObject DualStageIOMMUObject;
 typedef struct DualStageIOMMUOps DualStageIOMMUOps;
+typedef struct DualStageIOMMUInfo DualStageIOMMUInfo;
 
 struct DualStageIOMMUOps {
     /* Allocate pasid from DualStageIOMMU (a.k.a. host IOMMU) */
@@ -41,11 +43,16 @@ struct DualStageIOMMUOps {
                       uint32_t pasid);
 };
 
+struct DualStageIOMMUInfo {
+    uint32_t pasid_format;
+};
+
 /*
  * This is an abstraction of Dual-stage IOMMU.
  */
 struct DualStageIOMMUObject {
     DualStageIOMMUOps *ops;
+    DualStageIOMMUInfo uinfo;
 };
 
 int ds_iommu_pasid_alloc(DualStageIOMMUObject *dsi_obj, uint32_t min,
@@ -53,7 +60,8 @@ int ds_iommu_pasid_alloc(DualStageIOMMUObject *dsi_obj, uint32_t min,
 int ds_iommu_pasid_free(DualStageIOMMUObject *dsi_obj, uint32_t pasid);
 
 void ds_iommu_object_init(DualStageIOMMUObject *dsi_obj,
-                          DualStageIOMMUOps *ops);
+                          DualStageIOMMUOps *ops,
+                          DualStageIOMMUInfo *uinfo);
 void ds_iommu_object_destroy(DualStageIOMMUObject *dsi_obj);
 
 #endif
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [RFC v3 11/25] vfio: get stage-1 pasid formats from Kernel
@ 2020-01-29 12:16   ` Liu, Yi L
  0 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-01-29 12:16 UTC (permalink / raw)
  To: qemu-devel, david, pbonzini, alex.williamson, peterx
  Cc: kevin.tian, yi.l.liu, Yi Sun, kvm, mst, jun.j.tian, eric.auger,
	yi.y.sun, Jacob Pan, hao.wu

From: Liu Yi L <yi.l.liu@intel.com>

VFIO checks IOMMU UAPI version when it finds Kernel supports
VFIO_TYPE1_NESTING_IOMMU. It is enough for UAPI compatibility
check. However, IOMMU UAPI may support multiple stage-1 pasid
formats in a specific UAPI version, which is highly possible
since IOMMU UAPI supports stage-1 formats across all IOMMU vendors.
So VFIO needs to get the supported formats from Kernel and tell
vIOMMU. Let vIOMMU select proper format when setup dual stage DMA
translation.

This patch gets the stage-1 pasid format from kernel by using IOCTL
VFIO_IOMMU_GET_INFO and pass the supported format to vIOMMU by the
DualStageIOMMUObject instance which has been registered to vIOMMU.

This patch referred some code from Shameer Kolothum.
https://lists.gnu.org/archive/html/qemu-devel/2018-05/msg03759.html

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Eric Auger <eric.auger@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 hw/iommu/dual_stage_iommu.c         |  5 ++-
 hw/vfio/common.c                    | 85 ++++++++++++++++++++++++++++++++++++-
 include/hw/iommu/dual_stage_iommu.h | 10 ++++-
 3 files changed, 97 insertions(+), 3 deletions(-)

diff --git a/hw/iommu/dual_stage_iommu.c b/hw/iommu/dual_stage_iommu.c
index be4179d..d5a7168 100644
--- a/hw/iommu/dual_stage_iommu.c
+++ b/hw/iommu/dual_stage_iommu.c
@@ -48,9 +48,12 @@ int ds_iommu_pasid_free(DualStageIOMMUObject *dsi_obj, uint32_t pasid)
 }
 
 void ds_iommu_object_init(DualStageIOMMUObject *dsi_obj,
-                          DualStageIOMMUOps *ops)
+                          DualStageIOMMUOps *ops,
+                          DualStageIOMMUInfo *uinfo)
 {
     dsi_obj->ops = ops;
+
+    dsi_obj->uinfo.pasid_format = uinfo->pasid_format;
 }
 
 void ds_iommu_object_destroy(DualStageIOMMUObject *dsi_obj)
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index fc1723d..a07824b 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -1182,10 +1182,84 @@ static int vfio_get_iommu_type(VFIOContainer *container,
 static struct DualStageIOMMUOps vfio_ds_iommu_ops = {
 };
 
+static int vfio_get_iommu_info(VFIOContainer *container,
+                         struct vfio_iommu_type1_info **info)
+{
+
+    size_t argsz = sizeof(struct vfio_iommu_type1_info);
+
+
+    *info = g_malloc0(argsz);
+
+retry:
+    (*info)->argsz = argsz;
+
+    if (ioctl(container->fd, VFIO_IOMMU_GET_INFO, *info)) {
+        g_free(*info);
+        *info = NULL;
+        return -errno;
+    }
+
+    if (((*info)->argsz > argsz)) {
+        argsz = (*info)->argsz;
+        *info = g_realloc(*info, argsz);
+        goto retry;
+    }
+
+    return 0;
+}
+
+static struct vfio_info_cap_header *
+vfio_get_iommu_info_cap(struct vfio_iommu_type1_info *info, uint16_t id)
+{
+    struct vfio_info_cap_header *hdr;
+    void *ptr = info;
+
+    if (!(info->flags & VFIO_IOMMU_INFO_CAPS)) {
+        return NULL;
+    }
+
+    for (hdr = ptr + info->cap_offset; hdr != ptr; hdr = ptr + hdr->next) {
+        if (hdr->id == id) {
+            return hdr;
+        }
+    }
+
+    return NULL;
+}
+
+static int vfio_get_nesting_iommu_format(VFIOContainer *container,
+                                         uint32_t *pasid_format)
+{
+    struct vfio_iommu_type1_info *info;
+    struct vfio_info_cap_header *hdr;
+    struct vfio_iommu_type1_info_cap_nesting *cap;
+
+    if (vfio_get_iommu_info(container, &info)) {
+        return -errno;
+    }
+
+    hdr = vfio_get_iommu_info_cap(info,
+                        VFIO_IOMMU_TYPE1_INFO_CAP_NESTING);
+    if (!hdr) {
+        g_free(info);
+        return -errno;
+    }
+
+    cap = container_of(hdr,
+                struct vfio_iommu_type1_info_cap_nesting, header);
+    *pasid_format = cap->pasid_format;
+
+    g_free(info);
+    return 0;
+}
+
 static int vfio_init_container(VFIOContainer *container, int group_fd,
                                Error **errp)
 {
     int iommu_type, ret;
+    uint32_t format;
+    DualStageIOMMUInfo uinfo;
 
     iommu_type = vfio_get_iommu_type(container, errp);
     if (iommu_type < 0) {
@@ -1214,7 +1288,16 @@ static int vfio_init_container(VFIOContainer *container, int group_fd,
     }
 
     if (iommu_type == VFIO_TYPE1_NESTING_IOMMU) {
-        ds_iommu_object_init(&container->dsi_obj, &vfio_ds_iommu_ops);
+        if (vfio_get_nesting_iommu_format(container, &format)) {
+            error_setg_errno(errp, errno,
+                             "Failed to get nesting iommu format");
+            return -errno;
+        }
+
+        uinfo.pasid_format = format;
+        ds_iommu_object_init(&container->dsi_obj,
+                             &vfio_ds_iommu_ops, &uinfo);
+
         if (iommu_context_register_ds_iommu(container->iommu_ctx,
                                             &container->dsi_obj)) {
             /*
diff --git a/include/hw/iommu/dual_stage_iommu.h b/include/hw/iommu/dual_stage_iommu.h
index e9891e3..c6100b4 100644
--- a/include/hw/iommu/dual_stage_iommu.h
+++ b/include/hw/iommu/dual_stage_iommu.h
@@ -23,12 +23,14 @@
 #define HW_DS_IOMMU_H
 
 #include "qemu/queue.h"
+#include <linux/iommu.h>
 #ifndef CONFIG_USER_ONLY
 #include "exec/hwaddr.h"
 #endif
 
 typedef struct DualStageIOMMUObject DualStageIOMMUObject;
 typedef struct DualStageIOMMUOps DualStageIOMMUOps;
+typedef struct DualStageIOMMUInfo DualStageIOMMUInfo;
 
 struct DualStageIOMMUOps {
     /* Allocate pasid from DualStageIOMMU (a.k.a. host IOMMU) */
@@ -41,11 +43,16 @@ struct DualStageIOMMUOps {
                       uint32_t pasid);
 };
 
+struct DualStageIOMMUInfo {
+    uint32_t pasid_format;
+};
+
 /*
  * This is an abstraction of Dual-stage IOMMU.
  */
 struct DualStageIOMMUObject {
     DualStageIOMMUOps *ops;
+    DualStageIOMMUInfo uinfo;
 };
 
 int ds_iommu_pasid_alloc(DualStageIOMMUObject *dsi_obj, uint32_t min,
@@ -53,7 +60,8 @@ int ds_iommu_pasid_alloc(DualStageIOMMUObject *dsi_obj, uint32_t min,
 int ds_iommu_pasid_free(DualStageIOMMUObject *dsi_obj, uint32_t pasid);
 
 void ds_iommu_object_init(DualStageIOMMUObject *dsi_obj,
-                          DualStageIOMMUOps *ops);
+                          DualStageIOMMUOps *ops,
+                          DualStageIOMMUInfo *uinfo);
 void ds_iommu_object_destroy(DualStageIOMMUObject *dsi_obj);
 
 #endif
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [RFC v3 12/25] vfio/common: add pasid_alloc/free support
  2020-01-29 12:16 ` Liu, Yi L
@ 2020-01-29 12:16   ` Liu, Yi L
  -1 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-01-29 12:16 UTC (permalink / raw)
  To: qemu-devel, david, pbonzini, alex.williamson, peterx
  Cc: mst, eric.auger, kevin.tian, yi.l.liu, jun.j.tian, yi.y.sun, kvm,
	hao.wu, Jacob Pan, Yi Sun

From: Liu Yi L <yi.l.liu@intel.com>

This patch adds VFIO pasid alloc/free support to allow host intercept
in PASID allocation for VM by adding VFIO implementation of
DualStageIOMMUOps.pasid_alloc/free callbacks.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Eric Auger <eric.auger@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 hw/vfio/common.c | 42 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 42 insertions(+)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index a07824b..014f4e7 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -1179,7 +1179,49 @@ static int vfio_get_iommu_type(VFIOContainer *container,
     return -EINVAL;
 }
 
+static int vfio_ds_iommu_pasid_alloc(DualStageIOMMUObject *dsi_obj,
+                         uint32_t min, uint32_t max, uint32_t *pasid)
+{
+    VFIOContainer *container = container_of(dsi_obj, VFIOContainer, dsi_obj);
+    struct vfio_iommu_type1_pasid_request req;
+    unsigned long argsz;
+
+    argsz = sizeof(req);
+    req.argsz = argsz;
+    req.flags = VFIO_IOMMU_PASID_ALLOC;
+    req.alloc_pasid.min = min;
+    req.alloc_pasid.max = max;
+
+    if (ioctl(container->fd, VFIO_IOMMU_PASID_REQUEST, &req)) {
+        error_report("%s: %d, alloc failed", __func__, -errno);
+        return -errno;
+    }
+    *pasid = req.alloc_pasid.result;
+    return 0;
+}
+
+static int vfio_ds_iommu_pasid_free(DualStageIOMMUObject *dsi_obj,
+                                     uint32_t pasid)
+{
+    VFIOContainer *container = container_of(dsi_obj, VFIOContainer, dsi_obj);
+    struct vfio_iommu_type1_pasid_request req;
+    unsigned long argsz;
+
+    argsz = sizeof(req);
+    req.argsz = argsz;
+    req.flags = VFIO_IOMMU_PASID_FREE;
+    req.free_pasid = pasid;
+
+    if (ioctl(container->fd, VFIO_IOMMU_PASID_REQUEST, &req)) {
+        error_report("%s: %d, free failed", __func__, -errno);
+        return -errno;
+    }
+    return 0;
+}
+
 static struct DualStageIOMMUOps vfio_ds_iommu_ops = {
+    .pasid_alloc = vfio_ds_iommu_pasid_alloc,
+    .pasid_free = vfio_ds_iommu_pasid_free,
 };
 
 static int vfio_get_iommu_info(VFIOContainer *container,
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [RFC v3 12/25] vfio/common: add pasid_alloc/free support
@ 2020-01-29 12:16   ` Liu, Yi L
  0 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-01-29 12:16 UTC (permalink / raw)
  To: qemu-devel, david, pbonzini, alex.williamson, peterx
  Cc: kevin.tian, yi.l.liu, Yi Sun, kvm, mst, jun.j.tian, eric.auger,
	yi.y.sun, Jacob Pan, hao.wu

From: Liu Yi L <yi.l.liu@intel.com>

This patch adds VFIO pasid alloc/free support to allow host intercept
in PASID allocation for VM by adding VFIO implementation of
DualStageIOMMUOps.pasid_alloc/free callbacks.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Eric Auger <eric.auger@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 hw/vfio/common.c | 42 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 42 insertions(+)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index a07824b..014f4e7 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -1179,7 +1179,49 @@ static int vfio_get_iommu_type(VFIOContainer *container,
     return -EINVAL;
 }
 
+static int vfio_ds_iommu_pasid_alloc(DualStageIOMMUObject *dsi_obj,
+                         uint32_t min, uint32_t max, uint32_t *pasid)
+{
+    VFIOContainer *container = container_of(dsi_obj, VFIOContainer, dsi_obj);
+    struct vfio_iommu_type1_pasid_request req;
+    unsigned long argsz;
+
+    argsz = sizeof(req);
+    req.argsz = argsz;
+    req.flags = VFIO_IOMMU_PASID_ALLOC;
+    req.alloc_pasid.min = min;
+    req.alloc_pasid.max = max;
+
+    if (ioctl(container->fd, VFIO_IOMMU_PASID_REQUEST, &req)) {
+        error_report("%s: %d, alloc failed", __func__, -errno);
+        return -errno;
+    }
+    *pasid = req.alloc_pasid.result;
+    return 0;
+}
+
+static int vfio_ds_iommu_pasid_free(DualStageIOMMUObject *dsi_obj,
+                                     uint32_t pasid)
+{
+    VFIOContainer *container = container_of(dsi_obj, VFIOContainer, dsi_obj);
+    struct vfio_iommu_type1_pasid_request req;
+    unsigned long argsz;
+
+    argsz = sizeof(req);
+    req.argsz = argsz;
+    req.flags = VFIO_IOMMU_PASID_FREE;
+    req.free_pasid = pasid;
+
+    if (ioctl(container->fd, VFIO_IOMMU_PASID_REQUEST, &req)) {
+        error_report("%s: %d, free failed", __func__, -errno);
+        return -errno;
+    }
+    return 0;
+}
+
 static struct DualStageIOMMUOps vfio_ds_iommu_ops = {
+    .pasid_alloc = vfio_ds_iommu_pasid_alloc,
+    .pasid_free = vfio_ds_iommu_pasid_free,
 };
 
 static int vfio_get_iommu_info(VFIOContainer *container,
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [RFC v3 13/25] intel_iommu: modify x-scalable-mode to be string option
  2020-01-29 12:16 ` Liu, Yi L
@ 2020-01-29 12:16   ` Liu, Yi L
  -1 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-01-29 12:16 UTC (permalink / raw)
  To: qemu-devel, david, pbonzini, alex.williamson, peterx
  Cc: mst, eric.auger, kevin.tian, yi.l.liu, jun.j.tian, yi.y.sun, kvm,
	hao.wu, Jacob Pan, Yi Sun, Richard Henderson, Eduardo Habkost

From: Liu Yi L <yi.l.liu@intel.com>

Intel VT-d 3.0 introduces scalable mode, and it has a bunch of capabilities
related to scalable mode translation, thus there are multiple combinations.
While this vIOMMU implementation wants simplify it for user by providing
typical combinations. User could config it by "x-scalable-mode" option. The
usage is as below:

"-device intel-iommu,x-scalable-mode=["legacy"|"modern"]"

 - "legacy": gives support for SL page table
 - "modern": gives support for FL page table, pasid, virtual command
 -  if not configured, means no scalable mode support, if not proper
    configured, will throw error

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
---
 hw/i386/intel_iommu.c          | 27 +++++++++++++++++++++++++--
 hw/i386/intel_iommu_internal.h |  3 +++
 include/hw/i386/intel_iommu.h  |  2 ++
 3 files changed, 30 insertions(+), 2 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 1c1eb7f..33be40c 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -3078,7 +3078,7 @@ static Property vtd_properties[] = {
     DEFINE_PROP_UINT8("aw-bits", IntelIOMMUState, aw_bits,
                       VTD_HOST_ADDRESS_WIDTH),
     DEFINE_PROP_BOOL("caching-mode", IntelIOMMUState, caching_mode, FALSE),
-    DEFINE_PROP_BOOL("x-scalable-mode", IntelIOMMUState, scalable_mode, FALSE),
+    DEFINE_PROP_STRING("x-scalable-mode", IntelIOMMUState, scalable_mode_str),
     DEFINE_PROP_BOOL("dma-drain", IntelIOMMUState, dma_drain, true),
     DEFINE_PROP_END_OF_LIST(),
 };
@@ -3708,8 +3708,11 @@ static void vtd_init(IntelIOMMUState *s)
     }
 
     /* TODO: read cap/ecap from host to decide which cap to be exposed. */
-    if (s->scalable_mode) {
+    if (s->scalable_mode && !s->scalable_modern) {
         s->ecap |= VTD_ECAP_SMTS | VTD_ECAP_SRS | VTD_ECAP_SLTS;
+    } else if (s->scalable_mode && s->scalable_modern) {
+        s->ecap |= VTD_ECAP_SMTS | VTD_ECAP_SRS | VTD_ECAP_PASID
+                   | VTD_ECAP_FLTS | VTD_ECAP_PSS;
     }
 
     vtd_reset_caches(s);
@@ -3845,6 +3848,26 @@ static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
         return false;
     }
 
+    if (s->scalable_mode_str &&
+        (strcmp(s->scalable_mode_str, "modern") &&
+         strcmp(s->scalable_mode_str, "legacy"))) {
+        error_setg(errp, "Invalid x-scalable-mode config");
+        return false;
+    }
+
+    if (s->scalable_mode_str &&
+        !strcmp(s->scalable_mode_str, "legacy")) {
+        s->scalable_mode = true;
+        s->scalable_modern = false;
+    } else if (s->scalable_mode_str &&
+        !strcmp(s->scalable_mode_str, "modern")) {
+        s->scalable_mode = true;
+        s->scalable_modern = true;
+    } else {
+        s->scalable_mode = false;
+        s->scalable_modern = false;
+    }
+
     return true;
 }
 
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 862033e..c4dbb2c 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -190,8 +190,11 @@
 #define VTD_ECAP_PT                 (1ULL << 6)
 #define VTD_ECAP_MHMV               (15ULL << 20)
 #define VTD_ECAP_SRS                (1ULL << 31)
+#define VTD_ECAP_PSS                (19ULL << 35)
+#define VTD_ECAP_PASID              (1ULL << 40)
 #define VTD_ECAP_SMTS               (1ULL << 43)
 #define VTD_ECAP_SLTS               (1ULL << 46)
+#define VTD_ECAP_FLTS               (1ULL << 47)
 
 /* CAP_REG */
 /* (offset >> 4) << 24 */
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index 8571a85..1ef2917 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -244,6 +244,8 @@ struct IntelIOMMUState {
 
     bool caching_mode;              /* RO - is cap CM enabled? */
     bool scalable_mode;             /* RO - is Scalable Mode supported? */
+    char *scalable_mode_str;        /* RO - admin's Scalable Mode config */
+    bool scalable_modern;           /* RO - is modern SM supported? */
 
     dma_addr_t root;                /* Current root table pointer */
     bool root_scalable;             /* Type of root table (scalable or not) */
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [RFC v3 13/25] intel_iommu: modify x-scalable-mode to be string option
@ 2020-01-29 12:16   ` Liu, Yi L
  0 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-01-29 12:16 UTC (permalink / raw)
  To: qemu-devel, david, pbonzini, alex.williamson, peterx
  Cc: kevin.tian, yi.l.liu, Yi Sun, Eduardo Habkost, kvm, mst,
	jun.j.tian, eric.auger, yi.y.sun, Jacob Pan, Richard Henderson,
	hao.wu

From: Liu Yi L <yi.l.liu@intel.com>

Intel VT-d 3.0 introduces scalable mode, and it has a bunch of capabilities
related to scalable mode translation, thus there are multiple combinations.
While this vIOMMU implementation wants simplify it for user by providing
typical combinations. User could config it by "x-scalable-mode" option. The
usage is as below:

"-device intel-iommu,x-scalable-mode=["legacy"|"modern"]"

 - "legacy": gives support for SL page table
 - "modern": gives support for FL page table, pasid, virtual command
 -  if not configured, means no scalable mode support, if not proper
    configured, will throw error

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
---
 hw/i386/intel_iommu.c          | 27 +++++++++++++++++++++++++--
 hw/i386/intel_iommu_internal.h |  3 +++
 include/hw/i386/intel_iommu.h  |  2 ++
 3 files changed, 30 insertions(+), 2 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 1c1eb7f..33be40c 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -3078,7 +3078,7 @@ static Property vtd_properties[] = {
     DEFINE_PROP_UINT8("aw-bits", IntelIOMMUState, aw_bits,
                       VTD_HOST_ADDRESS_WIDTH),
     DEFINE_PROP_BOOL("caching-mode", IntelIOMMUState, caching_mode, FALSE),
-    DEFINE_PROP_BOOL("x-scalable-mode", IntelIOMMUState, scalable_mode, FALSE),
+    DEFINE_PROP_STRING("x-scalable-mode", IntelIOMMUState, scalable_mode_str),
     DEFINE_PROP_BOOL("dma-drain", IntelIOMMUState, dma_drain, true),
     DEFINE_PROP_END_OF_LIST(),
 };
@@ -3708,8 +3708,11 @@ static void vtd_init(IntelIOMMUState *s)
     }
 
     /* TODO: read cap/ecap from host to decide which cap to be exposed. */
-    if (s->scalable_mode) {
+    if (s->scalable_mode && !s->scalable_modern) {
         s->ecap |= VTD_ECAP_SMTS | VTD_ECAP_SRS | VTD_ECAP_SLTS;
+    } else if (s->scalable_mode && s->scalable_modern) {
+        s->ecap |= VTD_ECAP_SMTS | VTD_ECAP_SRS | VTD_ECAP_PASID
+                   | VTD_ECAP_FLTS | VTD_ECAP_PSS;
     }
 
     vtd_reset_caches(s);
@@ -3845,6 +3848,26 @@ static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
         return false;
     }
 
+    if (s->scalable_mode_str &&
+        (strcmp(s->scalable_mode_str, "modern") &&
+         strcmp(s->scalable_mode_str, "legacy"))) {
+        error_setg(errp, "Invalid x-scalable-mode config");
+        return false;
+    }
+
+    if (s->scalable_mode_str &&
+        !strcmp(s->scalable_mode_str, "legacy")) {
+        s->scalable_mode = true;
+        s->scalable_modern = false;
+    } else if (s->scalable_mode_str &&
+        !strcmp(s->scalable_mode_str, "modern")) {
+        s->scalable_mode = true;
+        s->scalable_modern = true;
+    } else {
+        s->scalable_mode = false;
+        s->scalable_modern = false;
+    }
+
     return true;
 }
 
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 862033e..c4dbb2c 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -190,8 +190,11 @@
 #define VTD_ECAP_PT                 (1ULL << 6)
 #define VTD_ECAP_MHMV               (15ULL << 20)
 #define VTD_ECAP_SRS                (1ULL << 31)
+#define VTD_ECAP_PSS                (19ULL << 35)
+#define VTD_ECAP_PASID              (1ULL << 40)
 #define VTD_ECAP_SMTS               (1ULL << 43)
 #define VTD_ECAP_SLTS               (1ULL << 46)
+#define VTD_ECAP_FLTS               (1ULL << 47)
 
 /* CAP_REG */
 /* (offset >> 4) << 24 */
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index 8571a85..1ef2917 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -244,6 +244,8 @@ struct IntelIOMMUState {
 
     bool caching_mode;              /* RO - is cap CM enabled? */
     bool scalable_mode;             /* RO - is Scalable Mode supported? */
+    char *scalable_mode_str;        /* RO - admin's Scalable Mode config */
+    bool scalable_modern;           /* RO - is modern SM supported? */
 
     dma_addr_t root;                /* Current root table pointer */
     bool root_scalable;             /* Type of root table (scalable or not) */
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [RFC v3 14/25] intel_iommu: add virtual command capability support
  2020-01-29 12:16 ` Liu, Yi L
@ 2020-01-29 12:16   ` Liu, Yi L
  -1 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-01-29 12:16 UTC (permalink / raw)
  To: qemu-devel, david, pbonzini, alex.williamson, peterx
  Cc: mst, eric.auger, kevin.tian, yi.l.liu, jun.j.tian, yi.y.sun, kvm,
	hao.wu, Jacob Pan, Yi Sun, Richard Henderson, Eduardo Habkost

From: Liu Yi L <yi.l.liu@intel.com>

This patch adds virtual command support to Intel vIOMMU per
Intel VT-d 3.1 spec. And adds two virtual commands: allocate
pasid and free pasid.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
---
 hw/i386/intel_iommu.c          | 163 ++++++++++++++++++++++++++++++++++++++++-
 hw/i386/intel_iommu_internal.h |  38 ++++++++++
 hw/i386/trace-events           |   1 +
 include/hw/i386/intel_iommu.h  |   6 +-
 4 files changed, 206 insertions(+), 2 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 33be40c..43a728f 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -2649,6 +2649,142 @@ static void vtd_handle_iectl_write(IntelIOMMUState *s)
     }
 }
 
+static int vtd_request_pasid_alloc(IntelIOMMUState *s, uint32_t *pasid)
+{
+    VTDBus *vtd_bus;
+    int bus_n, devfn, ret = -errno;
+    VTDIOMMUContext *vtd_icx;
+
+    for (bus_n = 0; bus_n < PCI_BUS_MAX; bus_n++) {
+        vtd_bus = vtd_find_as_from_bus_num(s, bus_n);
+        if (!vtd_bus) {
+            continue;
+        }
+        for (devfn = 0; devfn < PCI_DEVFN_MAX; devfn++) {
+            vtd_icx = vtd_bus->dev_icx[devfn];
+            if (!vtd_icx) {
+                continue;
+            }
+
+            /*
+             * We'll return the first valid result we got. It's
+             * a bit hackish in that we don't have a good global
+             * interface yet to talk to modules like vfio to deliver
+             * this allocation request, so we're leveraging this
+             * per-device iommu object to do the same thing just
+             * to make sure the allocation happens only once.
+             */
+            ret = ds_iommu_pasid_alloc(vtd_icx->dsi_obj,
+                         VTD_MIN_HPASID, VTD_MAX_HPASID, pasid);
+            if (!ret) {
+                break;
+            }
+        }
+    }
+    return ret;
+}
+
+static int vtd_request_pasid_free(IntelIOMMUState *s, uint32_t pasid)
+{
+    VTDBus *vtd_bus;
+    int bus_n, devfn, ret = -errno;
+    VTDIOMMUContext *vtd_icx;
+
+    for (bus_n = 0; bus_n < PCI_BUS_MAX; bus_n++) {
+        vtd_bus = vtd_find_as_from_bus_num(s, bus_n);
+        if (!vtd_bus) {
+            continue;
+        }
+        for (devfn = 0; devfn < PCI_DEVFN_MAX; devfn++) {
+            vtd_icx = vtd_bus->dev_icx[devfn];
+            if (!vtd_icx) {
+                continue;
+            }
+            /*
+             * Similar with pasid allocation. We'll free the pasid
+             * on the first successful free operation. It's a bit
+             * hackish in that we don't have a good global interface
+             * yet to talk to modules like vfio to deliver this pasid
+             * free request, so we're leveraging this per-device iommu
+             * object to do the same thing just to make sure the
+             * free happens only once.
+             */
+            ret = ds_iommu_pasid_free(vtd_icx->dsi_obj, pasid);
+            if (!ret) {
+                break;
+            }
+        }
+    }
+    return ret;
+}
+
+/*
+ * If IP is not set, set it and return 0
+ * If IP is already set, return -1
+ */
+static void vtd_vcmd_set_ip(IntelIOMMUState *s)
+{
+    s->vcrsp = 1;
+    vtd_set_quad_raw(s, DMAR_VCRSP_REG,
+                     ((uint64_t) s->vcrsp));
+}
+
+static void vtd_vcmd_clear_ip(IntelIOMMUState *s)
+{
+    s->vcrsp &= (~((uint64_t)(0x1)));
+    vtd_set_quad_raw(s, DMAR_VCRSP_REG,
+                     ((uint64_t) s->vcrsp));
+}
+
+/* Handle write to Virtual Command Register */
+static int vtd_handle_vcmd_write(IntelIOMMUState *s, uint64_t val)
+{
+    uint32_t pasid;
+    int ret = -1;
+
+    trace_vtd_reg_write_vcmd(s->vcrsp, val);
+
+    if (!(s->vccap & VTD_VCCAP_PAS) ||
+         (s->vcrsp & 1)) {
+        return -1;
+    }
+
+    /*
+     * Since vCPU should be blocked when the guest VMCD
+     * write was trapped to here. Should be no other vCPUs
+     * try to access VCMD if guest software is well written.
+     * However, we still emulate the IP bit here in case of
+     * bad guest software. Also align with the spec.
+     */
+    vtd_vcmd_set_ip(s);
+
+    switch (val & VTD_VCMD_CMD_MASK) {
+    case VTD_VCMD_ALLOC_PASID:
+        ret = vtd_request_pasid_alloc(s, &pasid);
+        if (ret) {
+            s->vcrsp |= VTD_VCRSP_SC(VTD_VCMD_NO_AVAILABLE_PASID);
+        } else {
+            s->vcrsp |= VTD_VCRSP_RSLT(pasid);
+        }
+        break;
+
+    case VTD_VCMD_FREE_PASID:
+        pasid = VTD_VCMD_PASID_VALUE(val);
+        ret = vtd_request_pasid_free(s, pasid);
+        if (ret < 0) {
+            s->vcrsp |= VTD_VCRSP_SC(VTD_VCMD_FREE_INVALID_PASID);
+        }
+        break;
+
+    default:
+        s->vcrsp |= VTD_VCRSP_SC(VTD_VCMD_UNDEFINED_CMD);
+        error_report_once("Virtual Command: unsupported command!!!");
+        break;
+    }
+    vtd_vcmd_clear_ip(s);
+    return 0;
+}
+
 static uint64_t vtd_mem_read(void *opaque, hwaddr addr, unsigned size)
 {
     IntelIOMMUState *s = opaque;
@@ -2938,6 +3074,23 @@ static void vtd_mem_write(void *opaque, hwaddr addr,
         vtd_set_long(s, addr, val);
         break;
 
+    case DMAR_VCMD_REG:
+        if (!vtd_handle_vcmd_write(s, val)) {
+            if (size == 4) {
+                vtd_set_long(s, addr, val);
+            } else {
+                vtd_set_quad(s, addr, val);
+            }
+        }
+        break;
+
+    case DMAR_VCMD_REG_HI:
+        assert(size == 4);
+        if (!vtd_handle_vcmd_write(s, val)) {
+            vtd_set_long(s, addr, val);
+        }
+        break;
+
     default:
         if (size == 4) {
             vtd_set_long(s, addr, val);
@@ -3712,7 +3865,8 @@ static void vtd_init(IntelIOMMUState *s)
         s->ecap |= VTD_ECAP_SMTS | VTD_ECAP_SRS | VTD_ECAP_SLTS;
     } else if (s->scalable_mode && s->scalable_modern) {
         s->ecap |= VTD_ECAP_SMTS | VTD_ECAP_SRS | VTD_ECAP_PASID
-                   | VTD_ECAP_FLTS | VTD_ECAP_PSS;
+                   | VTD_ECAP_FLTS | VTD_ECAP_PSS | VTD_ECAP_VCS;
+        s->vccap |= VTD_VCCAP_PAS;
     }
 
     vtd_reset_caches(s);
@@ -3768,6 +3922,13 @@ static void vtd_init(IntelIOMMUState *s)
      * Interrupt remapping registers.
      */
     vtd_define_quad(s, DMAR_IRTA_REG, 0, 0xfffffffffffff80fULL, 0);
+
+    /*
+     * Virtual Command Definitions
+     */
+    vtd_define_quad(s, DMAR_VCCAP_REG, s->vccap, 0, 0);
+    vtd_define_quad(s, DMAR_VCMD_REG, 0, 0xffffffffffffffffULL, 0);
+    vtd_define_quad(s, DMAR_VCRSP_REG, 0, 0, 0);
 }
 
 /* Should not reset address_spaces when reset because devices will still use
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index c4dbb2c..fb5fdc2 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -85,6 +85,12 @@
 #define DMAR_MTRRCAP_REG_HI     0x104
 #define DMAR_MTRRDEF_REG        0x108 /* MTRR default type */
 #define DMAR_MTRRDEF_REG_HI     0x10c
+#define DMAR_VCCAP_REG          0xE00 /* Virtual Command Capability Register */
+#define DMAR_VCCAP_REG_HI       0xE04
+#define DMAR_VCMD_REG           0xE10 /* Virtual Command Register */
+#define DMAR_VCMD_REG_HI        0xE14
+#define DMAR_VCRSP_REG          0xE20 /* Virtual Command Reponse Register */
+#define DMAR_VCRSP_REG_HI       0xE24
 
 /* IOTLB registers */
 #define DMAR_IOTLB_REG_OFFSET   0xf0 /* Offset to the IOTLB registers */
@@ -193,6 +199,7 @@
 #define VTD_ECAP_PSS                (19ULL << 35)
 #define VTD_ECAP_PASID              (1ULL << 40)
 #define VTD_ECAP_SMTS               (1ULL << 43)
+#define VTD_ECAP_VCS                (1ULL << 44)
 #define VTD_ECAP_SLTS               (1ULL << 46)
 #define VTD_ECAP_FLTS               (1ULL << 47)
 
@@ -315,6 +322,37 @@ typedef enum VTDFaultReason {
 
 #define VTD_CONTEXT_CACHE_GEN_MAX       0xffffffffUL
 
+/* VCCAP_REG */
+#define VTD_VCCAP_PAS               (1UL << 0)
+
+/*
+ * The basic idea is to let hypervisor to set a range for available
+ * PASIDs for VMs. One of the reasons is PASID #0 is reserved by
+ * RID_PASID usage. We have no idea how many reserved PASIDs in future,
+ * so here just an evaluated value. Honestly, set it as "1" is enough
+ * at current stage.
+ */
+#define VTD_MIN_HPASID              1
+#define VTD_MAX_HPASID              0xFFFFF
+
+/* Virtual Command Register */
+enum {
+     VTD_VCMD_NULL_CMD = 0,
+     VTD_VCMD_ALLOC_PASID = 1,
+     VTD_VCMD_FREE_PASID = 2,
+     VTD_VCMD_CMD_NUM,
+};
+
+#define VTD_VCMD_CMD_MASK           0xffUL
+#define VTD_VCMD_PASID_VALUE(val)   (((val) >> 8) & 0xfffff)
+
+#define VTD_VCRSP_RSLT(val)         ((val) << 8)
+#define VTD_VCRSP_SC(val)           (((val) & 0x3) << 1)
+
+#define VTD_VCMD_UNDEFINED_CMD         1ULL
+#define VTD_VCMD_NO_AVAILABLE_PASID    2ULL
+#define VTD_VCMD_FREE_INVALID_PASID    2ULL
+
 /* Interrupt Entry Cache Invalidation Descriptor: VT-d 6.5.2.7. */
 struct VTDInvDescIEC {
     uint32_t type:4;            /* Should always be 0x4 */
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index e48bef2..71536a7 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -51,6 +51,7 @@ vtd_reg_write_gcmd(uint32_t status, uint32_t val) "status 0x%"PRIx32" value 0x%"
 vtd_reg_write_fectl(uint32_t value) "value 0x%"PRIx32
 vtd_reg_write_iectl(uint32_t value) "value 0x%"PRIx32
 vtd_reg_ics_clear_ip(void) ""
+vtd_reg_write_vcmd(uint32_t status, uint32_t val) "status 0x%"PRIx32" value 0x%"PRIx32
 vtd_dmar_translate(uint8_t bus, uint8_t slot, uint8_t func, uint64_t iova, uint64_t gpa, uint64_t mask) "dev %02x:%02x.%02x iova 0x%"PRIx64" -> gpa 0x%"PRIx64" mask 0x%"PRIx64
 vtd_dmar_enable(bool en) "enable %d"
 vtd_dmar_fault(uint16_t sid, int fault, uint64_t addr, bool is_write) "sid 0x%"PRIx16" fault %d addr 0x%"PRIx64" write %d"
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index 1ef2917..4158116 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -46,7 +46,7 @@
 #define VTD_SID_TO_BUS(sid)         (((sid) >> 8) & 0xff)
 #define VTD_SID_TO_DEVFN(sid)       ((sid) & 0xff)
 
-#define DMAR_REG_SIZE               0x230
+#define DMAR_REG_SIZE               0xF00
 #define VTD_HOST_AW_39BIT           39
 #define VTD_HOST_AW_48BIT           48
 #define VTD_HOST_ADDRESS_WIDTH      VTD_HOST_AW_39BIT
@@ -285,6 +285,10 @@ struct IntelIOMMUState {
     uint8_t aw_bits;                /* Host/IOVA address width (in bits) */
     bool dma_drain;                 /* Whether DMA r/w draining enabled */
 
+    /* Virtual Command Register */
+    uint64_t vccap;                 /* The value of vcmd capability reg */
+    uint64_t vcrsp;                 /* Current value of VCMD RSP REG */
+
     /*
      * Protects IOMMU states in general.  Currently it protects the
      * per-IOMMU IOTLB cache, and context entry cache in VTDAddressSpace.
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [RFC v3 14/25] intel_iommu: add virtual command capability support
@ 2020-01-29 12:16   ` Liu, Yi L
  0 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-01-29 12:16 UTC (permalink / raw)
  To: qemu-devel, david, pbonzini, alex.williamson, peterx
  Cc: kevin.tian, yi.l.liu, Yi Sun, Eduardo Habkost, kvm, mst,
	jun.j.tian, eric.auger, yi.y.sun, Jacob Pan, Richard Henderson,
	hao.wu

From: Liu Yi L <yi.l.liu@intel.com>

This patch adds virtual command support to Intel vIOMMU per
Intel VT-d 3.1 spec. And adds two virtual commands: allocate
pasid and free pasid.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
---
 hw/i386/intel_iommu.c          | 163 ++++++++++++++++++++++++++++++++++++++++-
 hw/i386/intel_iommu_internal.h |  38 ++++++++++
 hw/i386/trace-events           |   1 +
 include/hw/i386/intel_iommu.h  |   6 +-
 4 files changed, 206 insertions(+), 2 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 33be40c..43a728f 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -2649,6 +2649,142 @@ static void vtd_handle_iectl_write(IntelIOMMUState *s)
     }
 }
 
+static int vtd_request_pasid_alloc(IntelIOMMUState *s, uint32_t *pasid)
+{
+    VTDBus *vtd_bus;
+    int bus_n, devfn, ret = -errno;
+    VTDIOMMUContext *vtd_icx;
+
+    for (bus_n = 0; bus_n < PCI_BUS_MAX; bus_n++) {
+        vtd_bus = vtd_find_as_from_bus_num(s, bus_n);
+        if (!vtd_bus) {
+            continue;
+        }
+        for (devfn = 0; devfn < PCI_DEVFN_MAX; devfn++) {
+            vtd_icx = vtd_bus->dev_icx[devfn];
+            if (!vtd_icx) {
+                continue;
+            }
+
+            /*
+             * We'll return the first valid result we got. It's
+             * a bit hackish in that we don't have a good global
+             * interface yet to talk to modules like vfio to deliver
+             * this allocation request, so we're leveraging this
+             * per-device iommu object to do the same thing just
+             * to make sure the allocation happens only once.
+             */
+            ret = ds_iommu_pasid_alloc(vtd_icx->dsi_obj,
+                         VTD_MIN_HPASID, VTD_MAX_HPASID, pasid);
+            if (!ret) {
+                break;
+            }
+        }
+    }
+    return ret;
+}
+
+static int vtd_request_pasid_free(IntelIOMMUState *s, uint32_t pasid)
+{
+    VTDBus *vtd_bus;
+    int bus_n, devfn, ret = -errno;
+    VTDIOMMUContext *vtd_icx;
+
+    for (bus_n = 0; bus_n < PCI_BUS_MAX; bus_n++) {
+        vtd_bus = vtd_find_as_from_bus_num(s, bus_n);
+        if (!vtd_bus) {
+            continue;
+        }
+        for (devfn = 0; devfn < PCI_DEVFN_MAX; devfn++) {
+            vtd_icx = vtd_bus->dev_icx[devfn];
+            if (!vtd_icx) {
+                continue;
+            }
+            /*
+             * Similar with pasid allocation. We'll free the pasid
+             * on the first successful free operation. It's a bit
+             * hackish in that we don't have a good global interface
+             * yet to talk to modules like vfio to deliver this pasid
+             * free request, so we're leveraging this per-device iommu
+             * object to do the same thing just to make sure the
+             * free happens only once.
+             */
+            ret = ds_iommu_pasid_free(vtd_icx->dsi_obj, pasid);
+            if (!ret) {
+                break;
+            }
+        }
+    }
+    return ret;
+}
+
+/*
+ * If IP is not set, set it and return 0
+ * If IP is already set, return -1
+ */
+static void vtd_vcmd_set_ip(IntelIOMMUState *s)
+{
+    s->vcrsp = 1;
+    vtd_set_quad_raw(s, DMAR_VCRSP_REG,
+                     ((uint64_t) s->vcrsp));
+}
+
+static void vtd_vcmd_clear_ip(IntelIOMMUState *s)
+{
+    s->vcrsp &= (~((uint64_t)(0x1)));
+    vtd_set_quad_raw(s, DMAR_VCRSP_REG,
+                     ((uint64_t) s->vcrsp));
+}
+
+/* Handle write to Virtual Command Register */
+static int vtd_handle_vcmd_write(IntelIOMMUState *s, uint64_t val)
+{
+    uint32_t pasid;
+    int ret = -1;
+
+    trace_vtd_reg_write_vcmd(s->vcrsp, val);
+
+    if (!(s->vccap & VTD_VCCAP_PAS) ||
+         (s->vcrsp & 1)) {
+        return -1;
+    }
+
+    /*
+     * Since vCPU should be blocked when the guest VMCD
+     * write was trapped to here. Should be no other vCPUs
+     * try to access VCMD if guest software is well written.
+     * However, we still emulate the IP bit here in case of
+     * bad guest software. Also align with the spec.
+     */
+    vtd_vcmd_set_ip(s);
+
+    switch (val & VTD_VCMD_CMD_MASK) {
+    case VTD_VCMD_ALLOC_PASID:
+        ret = vtd_request_pasid_alloc(s, &pasid);
+        if (ret) {
+            s->vcrsp |= VTD_VCRSP_SC(VTD_VCMD_NO_AVAILABLE_PASID);
+        } else {
+            s->vcrsp |= VTD_VCRSP_RSLT(pasid);
+        }
+        break;
+
+    case VTD_VCMD_FREE_PASID:
+        pasid = VTD_VCMD_PASID_VALUE(val);
+        ret = vtd_request_pasid_free(s, pasid);
+        if (ret < 0) {
+            s->vcrsp |= VTD_VCRSP_SC(VTD_VCMD_FREE_INVALID_PASID);
+        }
+        break;
+
+    default:
+        s->vcrsp |= VTD_VCRSP_SC(VTD_VCMD_UNDEFINED_CMD);
+        error_report_once("Virtual Command: unsupported command!!!");
+        break;
+    }
+    vtd_vcmd_clear_ip(s);
+    return 0;
+}
+
 static uint64_t vtd_mem_read(void *opaque, hwaddr addr, unsigned size)
 {
     IntelIOMMUState *s = opaque;
@@ -2938,6 +3074,23 @@ static void vtd_mem_write(void *opaque, hwaddr addr,
         vtd_set_long(s, addr, val);
         break;
 
+    case DMAR_VCMD_REG:
+        if (!vtd_handle_vcmd_write(s, val)) {
+            if (size == 4) {
+                vtd_set_long(s, addr, val);
+            } else {
+                vtd_set_quad(s, addr, val);
+            }
+        }
+        break;
+
+    case DMAR_VCMD_REG_HI:
+        assert(size == 4);
+        if (!vtd_handle_vcmd_write(s, val)) {
+            vtd_set_long(s, addr, val);
+        }
+        break;
+
     default:
         if (size == 4) {
             vtd_set_long(s, addr, val);
@@ -3712,7 +3865,8 @@ static void vtd_init(IntelIOMMUState *s)
         s->ecap |= VTD_ECAP_SMTS | VTD_ECAP_SRS | VTD_ECAP_SLTS;
     } else if (s->scalable_mode && s->scalable_modern) {
         s->ecap |= VTD_ECAP_SMTS | VTD_ECAP_SRS | VTD_ECAP_PASID
-                   | VTD_ECAP_FLTS | VTD_ECAP_PSS;
+                   | VTD_ECAP_FLTS | VTD_ECAP_PSS | VTD_ECAP_VCS;
+        s->vccap |= VTD_VCCAP_PAS;
     }
 
     vtd_reset_caches(s);
@@ -3768,6 +3922,13 @@ static void vtd_init(IntelIOMMUState *s)
      * Interrupt remapping registers.
      */
     vtd_define_quad(s, DMAR_IRTA_REG, 0, 0xfffffffffffff80fULL, 0);
+
+    /*
+     * Virtual Command Definitions
+     */
+    vtd_define_quad(s, DMAR_VCCAP_REG, s->vccap, 0, 0);
+    vtd_define_quad(s, DMAR_VCMD_REG, 0, 0xffffffffffffffffULL, 0);
+    vtd_define_quad(s, DMAR_VCRSP_REG, 0, 0, 0);
 }
 
 /* Should not reset address_spaces when reset because devices will still use
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index c4dbb2c..fb5fdc2 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -85,6 +85,12 @@
 #define DMAR_MTRRCAP_REG_HI     0x104
 #define DMAR_MTRRDEF_REG        0x108 /* MTRR default type */
 #define DMAR_MTRRDEF_REG_HI     0x10c
+#define DMAR_VCCAP_REG          0xE00 /* Virtual Command Capability Register */
+#define DMAR_VCCAP_REG_HI       0xE04
+#define DMAR_VCMD_REG           0xE10 /* Virtual Command Register */
+#define DMAR_VCMD_REG_HI        0xE14
+#define DMAR_VCRSP_REG          0xE20 /* Virtual Command Reponse Register */
+#define DMAR_VCRSP_REG_HI       0xE24
 
 /* IOTLB registers */
 #define DMAR_IOTLB_REG_OFFSET   0xf0 /* Offset to the IOTLB registers */
@@ -193,6 +199,7 @@
 #define VTD_ECAP_PSS                (19ULL << 35)
 #define VTD_ECAP_PASID              (1ULL << 40)
 #define VTD_ECAP_SMTS               (1ULL << 43)
+#define VTD_ECAP_VCS                (1ULL << 44)
 #define VTD_ECAP_SLTS               (1ULL << 46)
 #define VTD_ECAP_FLTS               (1ULL << 47)
 
@@ -315,6 +322,37 @@ typedef enum VTDFaultReason {
 
 #define VTD_CONTEXT_CACHE_GEN_MAX       0xffffffffUL
 
+/* VCCAP_REG */
+#define VTD_VCCAP_PAS               (1UL << 0)
+
+/*
+ * The basic idea is to let hypervisor to set a range for available
+ * PASIDs for VMs. One of the reasons is PASID #0 is reserved by
+ * RID_PASID usage. We have no idea how many reserved PASIDs in future,
+ * so here just an evaluated value. Honestly, set it as "1" is enough
+ * at current stage.
+ */
+#define VTD_MIN_HPASID              1
+#define VTD_MAX_HPASID              0xFFFFF
+
+/* Virtual Command Register */
+enum {
+     VTD_VCMD_NULL_CMD = 0,
+     VTD_VCMD_ALLOC_PASID = 1,
+     VTD_VCMD_FREE_PASID = 2,
+     VTD_VCMD_CMD_NUM,
+};
+
+#define VTD_VCMD_CMD_MASK           0xffUL
+#define VTD_VCMD_PASID_VALUE(val)   (((val) >> 8) & 0xfffff)
+
+#define VTD_VCRSP_RSLT(val)         ((val) << 8)
+#define VTD_VCRSP_SC(val)           (((val) & 0x3) << 1)
+
+#define VTD_VCMD_UNDEFINED_CMD         1ULL
+#define VTD_VCMD_NO_AVAILABLE_PASID    2ULL
+#define VTD_VCMD_FREE_INVALID_PASID    2ULL
+
 /* Interrupt Entry Cache Invalidation Descriptor: VT-d 6.5.2.7. */
 struct VTDInvDescIEC {
     uint32_t type:4;            /* Should always be 0x4 */
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index e48bef2..71536a7 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -51,6 +51,7 @@ vtd_reg_write_gcmd(uint32_t status, uint32_t val) "status 0x%"PRIx32" value 0x%"
 vtd_reg_write_fectl(uint32_t value) "value 0x%"PRIx32
 vtd_reg_write_iectl(uint32_t value) "value 0x%"PRIx32
 vtd_reg_ics_clear_ip(void) ""
+vtd_reg_write_vcmd(uint32_t status, uint32_t val) "status 0x%"PRIx32" value 0x%"PRIx32
 vtd_dmar_translate(uint8_t bus, uint8_t slot, uint8_t func, uint64_t iova, uint64_t gpa, uint64_t mask) "dev %02x:%02x.%02x iova 0x%"PRIx64" -> gpa 0x%"PRIx64" mask 0x%"PRIx64
 vtd_dmar_enable(bool en) "enable %d"
 vtd_dmar_fault(uint16_t sid, int fault, uint64_t addr, bool is_write) "sid 0x%"PRIx16" fault %d addr 0x%"PRIx64" write %d"
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index 1ef2917..4158116 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -46,7 +46,7 @@
 #define VTD_SID_TO_BUS(sid)         (((sid) >> 8) & 0xff)
 #define VTD_SID_TO_DEVFN(sid)       ((sid) & 0xff)
 
-#define DMAR_REG_SIZE               0x230
+#define DMAR_REG_SIZE               0xF00
 #define VTD_HOST_AW_39BIT           39
 #define VTD_HOST_AW_48BIT           48
 #define VTD_HOST_ADDRESS_WIDTH      VTD_HOST_AW_39BIT
@@ -285,6 +285,10 @@ struct IntelIOMMUState {
     uint8_t aw_bits;                /* Host/IOVA address width (in bits) */
     bool dma_drain;                 /* Whether DMA r/w draining enabled */
 
+    /* Virtual Command Register */
+    uint64_t vccap;                 /* The value of vcmd capability reg */
+    uint64_t vcrsp;                 /* Current value of VCMD RSP REG */
+
     /*
      * Protects IOMMU states in general.  Currently it protects the
      * per-IOMMU IOTLB cache, and context entry cache in VTDAddressSpace.
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [RFC v3 15/25] intel_iommu: process pasid cache invalidation
  2020-01-29 12:16 ` Liu, Yi L
@ 2020-01-29 12:16   ` Liu, Yi L
  -1 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-01-29 12:16 UTC (permalink / raw)
  To: qemu-devel, david, pbonzini, alex.williamson, peterx
  Cc: mst, eric.auger, kevin.tian, yi.l.liu, jun.j.tian, yi.y.sun, kvm,
	hao.wu, Jacob Pan, Yi Sun, Richard Henderson, Eduardo Habkost

From: Liu Yi L <yi.l.liu@intel.com>

This patch adds PASID cache invalidation handling. When guest enabled
PASID usages (e.g. SVA), guest software should issue a proper PASID
cache invalidation when caching-mode is exposed. This patch only adds
the draft handling of pasid cache invalidation. Detailed handling will
be added in subsequent patches.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 hw/i386/intel_iommu.c          | 66 ++++++++++++++++++++++++++++++++++++++----
 hw/i386/intel_iommu_internal.h | 12 ++++++++
 hw/i386/trace-events           |  3 ++
 3 files changed, 76 insertions(+), 5 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 43a728f..58e7213 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -2393,6 +2393,63 @@ static bool vtd_process_iotlb_desc(IntelIOMMUState *s, VTDInvDesc *inv_desc)
     return true;
 }
 
+static int vtd_pasid_cache_dsi(IntelIOMMUState *s, uint16_t domain_id)
+{
+    return 0;
+}
+
+static int vtd_pasid_cache_psi(IntelIOMMUState *s,
+                               uint16_t domain_id, uint32_t pasid)
+{
+    return 0;
+}
+
+static int vtd_pasid_cache_gsi(IntelIOMMUState *s)
+{
+    return 0;
+}
+
+static bool vtd_process_pasid_desc(IntelIOMMUState *s,
+                                   VTDInvDesc *inv_desc)
+{
+    uint16_t domain_id;
+    uint32_t pasid;
+    int ret = 0;
+
+    if ((inv_desc->val[0] & VTD_INV_DESC_PASIDC_RSVD_VAL0) ||
+        (inv_desc->val[1] & VTD_INV_DESC_PASIDC_RSVD_VAL1) ||
+        (inv_desc->val[2] & VTD_INV_DESC_PASIDC_RSVD_VAL2) ||
+        (inv_desc->val[3] & VTD_INV_DESC_PASIDC_RSVD_VAL3)) {
+        error_report_once("non-zero-field-in-pc_inv_desc hi: 0x%" PRIx64
+                  " lo: 0x%" PRIx64, inv_desc->val[1], inv_desc->val[0]);
+        return false;
+    }
+
+    domain_id = VTD_INV_DESC_PASIDC_DID(inv_desc->val[0]);
+    pasid = VTD_INV_DESC_PASIDC_PASID(inv_desc->val[0]);
+
+    switch (inv_desc->val[0] & VTD_INV_DESC_PASIDC_G) {
+    case VTD_INV_DESC_PASIDC_DSI:
+        ret = vtd_pasid_cache_dsi(s, domain_id);
+        break;
+
+    case VTD_INV_DESC_PASIDC_PASID_SI:
+        ret = vtd_pasid_cache_psi(s, domain_id, pasid);
+        break;
+
+    case VTD_INV_DESC_PASIDC_GLOBAL:
+        ret = vtd_pasid_cache_gsi(s);
+        break;
+
+    default:
+        error_report_once("invalid-inv-granu-in-pc_inv_desc hi: 0x%" PRIx64
+                  " lo: 0x%" PRIx64, inv_desc->val[1], inv_desc->val[0]);
+        return false;
+    }
+
+    return (ret == 0) ? true : false;
+}
+
 static bool vtd_process_inv_iec_desc(IntelIOMMUState *s,
                                      VTDInvDesc *inv_desc)
 {
@@ -2499,12 +2556,11 @@ static bool vtd_process_inv_desc(IntelIOMMUState *s)
         }
         break;
 
-    /*
-     * TODO: the entity of below two cases will be implemented in future series.
-     * To make guest (which integrates scalable mode support patch set in
-     * iommu driver) work, just return true is enough so far.
-     */
     case VTD_INV_DESC_PC:
+        trace_vtd_inv_desc("pasid-cache", inv_desc.val[1], inv_desc.val[0]);
+        if (!vtd_process_pasid_desc(s, &inv_desc)) {
+            return false;
+        }
         break;
 
     case VTD_INV_DESC_PIOTLB:
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index fb5fdc2..6c03560 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -448,6 +448,18 @@ typedef union VTDInvDesc VTDInvDesc;
         (0x3ffff800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM | VTD_SL_TM)) : \
         (0x3ffff800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
 
+#define VTD_INV_DESC_PASIDC_G          (3ULL << 4)
+#define VTD_INV_DESC_PASIDC_PASID(val) (((val) >> 32) & 0xfffffULL)
+#define VTD_INV_DESC_PASIDC_DID(val)   (((val) >> 16) & VTD_DOMAIN_ID_MASK)
+#define VTD_INV_DESC_PASIDC_RSVD_VAL0  0xfff000000000ffc0ULL
+#define VTD_INV_DESC_PASIDC_RSVD_VAL1  0xffffffffffffffffULL
+#define VTD_INV_DESC_PASIDC_RSVD_VAL2  0xffffffffffffffffULL
+#define VTD_INV_DESC_PASIDC_RSVD_VAL3  0xffffffffffffffffULL
+
+#define VTD_INV_DESC_PASIDC_DSI        (0ULL << 4)
+#define VTD_INV_DESC_PASIDC_PASID_SI   (1ULL << 4)
+#define VTD_INV_DESC_PASIDC_GLOBAL     (3ULL << 4)
+
 /* Information about page-selective IOTLB invalidate */
 struct VTDIOTLBPageInvInfo {
     uint16_t domain_id;
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index 71536a7..f7cd4e5 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -22,6 +22,9 @@ vtd_inv_qi_head(uint16_t head) "read head %d"
 vtd_inv_qi_tail(uint16_t head) "write tail %d"
 vtd_inv_qi_fetch(void) ""
 vtd_context_cache_reset(void) ""
+vtd_pasid_cache_gsi(void) ""
+vtd_pasid_cache_dsi(uint16_t domain) "Domian slective PC invalidation domain 0x%"PRIx16
+vtd_pasid_cache_psi(uint16_t domain, uint32_t pasid) "PASID slective PC invalidation domain 0x%"PRIx16" pasid 0x%"PRIx32
 vtd_re_not_present(uint8_t bus) "Root entry bus %"PRIu8" not present"
 vtd_ce_not_present(uint8_t bus, uint8_t devfn) "Context entry bus %"PRIu8" devfn %"PRIu8" not present"
 vtd_iotlb_page_hit(uint16_t sid, uint64_t addr, uint64_t slpte, uint16_t domain) "IOTLB page hit sid 0x%"PRIx16" iova 0x%"PRIx64" slpte 0x%"PRIx64" domain 0x%"PRIx16
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [RFC v3 15/25] intel_iommu: process pasid cache invalidation
@ 2020-01-29 12:16   ` Liu, Yi L
  0 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-01-29 12:16 UTC (permalink / raw)
  To: qemu-devel, david, pbonzini, alex.williamson, peterx
  Cc: kevin.tian, yi.l.liu, Yi Sun, Eduardo Habkost, kvm, mst,
	jun.j.tian, eric.auger, yi.y.sun, Jacob Pan, Richard Henderson,
	hao.wu

From: Liu Yi L <yi.l.liu@intel.com>

This patch adds PASID cache invalidation handling. When guest enabled
PASID usages (e.g. SVA), guest software should issue a proper PASID
cache invalidation when caching-mode is exposed. This patch only adds
the draft handling of pasid cache invalidation. Detailed handling will
be added in subsequent patches.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 hw/i386/intel_iommu.c          | 66 ++++++++++++++++++++++++++++++++++++++----
 hw/i386/intel_iommu_internal.h | 12 ++++++++
 hw/i386/trace-events           |  3 ++
 3 files changed, 76 insertions(+), 5 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 43a728f..58e7213 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -2393,6 +2393,63 @@ static bool vtd_process_iotlb_desc(IntelIOMMUState *s, VTDInvDesc *inv_desc)
     return true;
 }
 
+static int vtd_pasid_cache_dsi(IntelIOMMUState *s, uint16_t domain_id)
+{
+    return 0;
+}
+
+static int vtd_pasid_cache_psi(IntelIOMMUState *s,
+                               uint16_t domain_id, uint32_t pasid)
+{
+    return 0;
+}
+
+static int vtd_pasid_cache_gsi(IntelIOMMUState *s)
+{
+    return 0;
+}
+
+static bool vtd_process_pasid_desc(IntelIOMMUState *s,
+                                   VTDInvDesc *inv_desc)
+{
+    uint16_t domain_id;
+    uint32_t pasid;
+    int ret = 0;
+
+    if ((inv_desc->val[0] & VTD_INV_DESC_PASIDC_RSVD_VAL0) ||
+        (inv_desc->val[1] & VTD_INV_DESC_PASIDC_RSVD_VAL1) ||
+        (inv_desc->val[2] & VTD_INV_DESC_PASIDC_RSVD_VAL2) ||
+        (inv_desc->val[3] & VTD_INV_DESC_PASIDC_RSVD_VAL3)) {
+        error_report_once("non-zero-field-in-pc_inv_desc hi: 0x%" PRIx64
+                  " lo: 0x%" PRIx64, inv_desc->val[1], inv_desc->val[0]);
+        return false;
+    }
+
+    domain_id = VTD_INV_DESC_PASIDC_DID(inv_desc->val[0]);
+    pasid = VTD_INV_DESC_PASIDC_PASID(inv_desc->val[0]);
+
+    switch (inv_desc->val[0] & VTD_INV_DESC_PASIDC_G) {
+    case VTD_INV_DESC_PASIDC_DSI:
+        ret = vtd_pasid_cache_dsi(s, domain_id);
+        break;
+
+    case VTD_INV_DESC_PASIDC_PASID_SI:
+        ret = vtd_pasid_cache_psi(s, domain_id, pasid);
+        break;
+
+    case VTD_INV_DESC_PASIDC_GLOBAL:
+        ret = vtd_pasid_cache_gsi(s);
+        break;
+
+    default:
+        error_report_once("invalid-inv-granu-in-pc_inv_desc hi: 0x%" PRIx64
+                  " lo: 0x%" PRIx64, inv_desc->val[1], inv_desc->val[0]);
+        return false;
+    }
+
+    return (ret == 0) ? true : false;
+}
+
 static bool vtd_process_inv_iec_desc(IntelIOMMUState *s,
                                      VTDInvDesc *inv_desc)
 {
@@ -2499,12 +2556,11 @@ static bool vtd_process_inv_desc(IntelIOMMUState *s)
         }
         break;
 
-    /*
-     * TODO: the entity of below two cases will be implemented in future series.
-     * To make guest (which integrates scalable mode support patch set in
-     * iommu driver) work, just return true is enough so far.
-     */
     case VTD_INV_DESC_PC:
+        trace_vtd_inv_desc("pasid-cache", inv_desc.val[1], inv_desc.val[0]);
+        if (!vtd_process_pasid_desc(s, &inv_desc)) {
+            return false;
+        }
         break;
 
     case VTD_INV_DESC_PIOTLB:
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index fb5fdc2..6c03560 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -448,6 +448,18 @@ typedef union VTDInvDesc VTDInvDesc;
         (0x3ffff800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM | VTD_SL_TM)) : \
         (0x3ffff800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
 
+#define VTD_INV_DESC_PASIDC_G          (3ULL << 4)
+#define VTD_INV_DESC_PASIDC_PASID(val) (((val) >> 32) & 0xfffffULL)
+#define VTD_INV_DESC_PASIDC_DID(val)   (((val) >> 16) & VTD_DOMAIN_ID_MASK)
+#define VTD_INV_DESC_PASIDC_RSVD_VAL0  0xfff000000000ffc0ULL
+#define VTD_INV_DESC_PASIDC_RSVD_VAL1  0xffffffffffffffffULL
+#define VTD_INV_DESC_PASIDC_RSVD_VAL2  0xffffffffffffffffULL
+#define VTD_INV_DESC_PASIDC_RSVD_VAL3  0xffffffffffffffffULL
+
+#define VTD_INV_DESC_PASIDC_DSI        (0ULL << 4)
+#define VTD_INV_DESC_PASIDC_PASID_SI   (1ULL << 4)
+#define VTD_INV_DESC_PASIDC_GLOBAL     (3ULL << 4)
+
 /* Information about page-selective IOTLB invalidate */
 struct VTDIOTLBPageInvInfo {
     uint16_t domain_id;
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index 71536a7..f7cd4e5 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -22,6 +22,9 @@ vtd_inv_qi_head(uint16_t head) "read head %d"
 vtd_inv_qi_tail(uint16_t head) "write tail %d"
 vtd_inv_qi_fetch(void) ""
 vtd_context_cache_reset(void) ""
+vtd_pasid_cache_gsi(void) ""
+vtd_pasid_cache_dsi(uint16_t domain) "Domian slective PC invalidation domain 0x%"PRIx16
+vtd_pasid_cache_psi(uint16_t domain, uint32_t pasid) "PASID slective PC invalidation domain 0x%"PRIx16" pasid 0x%"PRIx32
 vtd_re_not_present(uint8_t bus) "Root entry bus %"PRIu8" not present"
 vtd_ce_not_present(uint8_t bus, uint8_t devfn) "Context entry bus %"PRIu8" devfn %"PRIu8" not present"
 vtd_iotlb_page_hit(uint16_t sid, uint64_t addr, uint64_t slpte, uint16_t domain) "IOTLB page hit sid 0x%"PRIx16" iova 0x%"PRIx64" slpte 0x%"PRIx64" domain 0x%"PRIx16
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [RFC v3 16/25] intel_iommu: add PASID cache management infrastructure
  2020-01-29 12:16 ` Liu, Yi L
@ 2020-01-29 12:16   ` Liu, Yi L
  -1 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-01-29 12:16 UTC (permalink / raw)
  To: qemu-devel, david, pbonzini, alex.williamson, peterx
  Cc: mst, eric.auger, kevin.tian, yi.l.liu, jun.j.tian, yi.y.sun, kvm,
	hao.wu, Jacob Pan, Yi Sun, Richard Henderson, Eduardo Habkost

From: Liu Yi L <yi.l.liu@intel.com>

This patch adds a PASID cache management infrastructure based on
new added structure VTDPASIDAddressSpace, which is used to track
the PASID usage and future PASID tagged DMA address translation
support in vIOMMU.

    struct VTDPASIDAddressSpace {
        VTDBus *vtd_bus;
        uint8_t devfn;
        AddressSpace as;
        uint32_t pasid;
        IntelIOMMUState *iommu_state;
        VTDContextCacheEntry context_cache_entry;
        QLIST_ENTRY(VTDPASIDAddressSpace) next;
        VTDPASIDCacheEntry pasid_cache_entry;
    };

Ideally, a VTDPASIDAddressSpace instance is created when a PASID
is bound with a DMA AddressSpace. Intel VT-d spec requires guest
software to issue pasid cache invalidation when bind or unbind a
pasid with an address space under caching-mode. However, as
VTDPASIDAddressSpace instances also act as pasid cache in this
implementation, its creation also happens during vIOMMU PASID
tagged DMA translation. The creation in this path will not be
added in this patch since no PASID-capable emulated devices for
now.

The implementation in this patch manages VTDPASIDAddressSpace
instances per PASID+BDF (lookup and insert will use PASID and
BDF) since Intel VT-d spec allows per-BDF PASID Table. When a
guest bind a PASID with an AddressSpace, QEMU will capture the
guest pasid selective pasid cache invalidation, and allocate
remove a VTDPASIDAddressSpace instance per the invalidation
reasons:

    *) a present pasid entry moved to non-present
    *) a present pasid entry to be a present entry
    *) a non-present pasid entry moved to present

vIOMMU emulator could figure out the reason by fetching latest
guest pasid entry.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 hw/i386/intel_iommu.c          | 367 +++++++++++++++++++++++++++++++++++++++++
 hw/i386/intel_iommu_internal.h |  14 ++
 hw/i386/trace-events           |   1 +
 include/hw/i386/intel_iommu.h  |  36 +++-
 4 files changed, 417 insertions(+), 1 deletion(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 58e7213..c75cb7b 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -40,6 +40,7 @@
 #include "kvm_i386.h"
 #include "migration/vmstate.h"
 #include "trace.h"
+#include "qemu/jhash.h"
 
 /* context entry operations */
 #define VTD_CE_GET_RID2PASID(ce) \
@@ -65,6 +66,8 @@
 static void vtd_address_space_refresh_all(IntelIOMMUState *s);
 static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n);
 
+static void vtd_pasid_cache_reset(IntelIOMMUState *s);
+
 static void vtd_panic_require_caching_mode(void)
 {
     error_report("We need to set caching-mode=on for intel-iommu to enable "
@@ -276,6 +279,7 @@ static void vtd_reset_caches(IntelIOMMUState *s)
     vtd_iommu_lock(s);
     vtd_reset_iotlb_locked(s);
     vtd_reset_context_cache_locked(s);
+    vtd_pasid_cache_reset(s);
     vtd_iommu_unlock(s);
 }
 
@@ -686,6 +690,11 @@ static inline bool vtd_pe_type_check(X86IOMMUState *x86_iommu,
     return true;
 }
 
+static inline uint16_t vtd_pe_get_domain_id(VTDPASIDEntry *pe)
+{
+    return VTD_SM_PASID_ENTRY_DID((pe)->val[1]);
+}
+
 static inline bool vtd_pdire_present(VTDPASIDDirEntry *pdire)
 {
     return pdire->val & 1;
@@ -2393,19 +2402,370 @@ static bool vtd_process_iotlb_desc(IntelIOMMUState *s, VTDInvDesc *inv_desc)
     return true;
 }
 
+static inline void vtd_init_pasid_key(uint32_t pasid,
+                                     uint16_t sid,
+                                     struct pasid_key *key)
+{
+    key->pasid = pasid;
+    key->sid = sid;
+}
+
+static guint vtd_pasid_as_key_hash(gconstpointer v)
+{
+    struct pasid_key *key = (struct pasid_key *)v;
+    uint32_t a, b, c;
+
+    /* Jenkins hash */
+    a = b = c = JHASH_INITVAL + sizeof(*key);
+    a += key->sid;
+    b += extract32(key->pasid, 0, 16);
+    c += extract32(key->pasid, 16, 16);
+
+    __jhash_mix(a, b, c);
+    __jhash_final(a, b, c);
+
+    return c;
+}
+
+static gboolean vtd_pasid_as_key_equal(gconstpointer v1, gconstpointer v2)
+{
+    const struct pasid_key *k1 = v1;
+    const struct pasid_key *k2 = v2;
+
+    return (k1->pasid == k2->pasid) && (k1->sid == k2->sid);
+}
+
+static inline int vtd_dev_get_pe_from_pasid(IntelIOMMUState *s,
+                                            uint8_t bus_num,
+                                            uint8_t devfn,
+                                            uint32_t pasid,
+                                            VTDPASIDEntry *pe)
+{
+    VTDContextEntry ce;
+    int ret;
+    dma_addr_t pasid_dir_base;
+
+    if (!s->root_scalable) {
+        return -VTD_FR_PASID_TABLE_INV;
+    }
+
+    ret = vtd_dev_to_context_entry(s, bus_num, devfn, &ce);
+    if (ret) {
+        return ret;
+    }
+
+    pasid_dir_base = VTD_CE_GET_PASID_DIR_TABLE(&ce);
+    ret = vtd_get_pe_from_pasid_table(s,
+                                  pasid_dir_base, pasid, pe);
+
+    return ret;
+}
+
+static bool vtd_pasid_entry_compare(VTDPASIDEntry *p1, VTDPASIDEntry *p2)
+{
+    return !memcmp(p1, p2, sizeof(*p1));
+}
+
+/**
+ * This function is used to clear pasid_cache_gen of cached pasid
+ * entry in vtd_pasid_as instances. Caller of this function should
+ * hold iommu_lock.
+ */
+static gboolean vtd_flush_pasid(gpointer key, gpointer value,
+                                gpointer user_data)
+{
+    VTDPASIDCacheInfo *pc_info = user_data;
+    VTDPASIDAddressSpace *vtd_pasid_as = value;
+    IntelIOMMUState *s = vtd_pasid_as->iommu_state;
+    VTDPASIDCacheEntry *pc_entry = &vtd_pasid_as->pasid_cache_entry;
+    VTDBus *vtd_bus = vtd_pasid_as->vtd_bus;
+    VTDPASIDEntry pe;
+    uint16_t did;
+    uint32_t pasid;
+    uint16_t devfn;
+    int ret;
+
+    did = vtd_pe_get_domain_id(&pc_entry->pasid_entry);
+    pasid = vtd_pasid_as->pasid;
+    devfn = vtd_pasid_as->devfn;
+
+    if (!(pc_entry->pasid_cache_gen == s->pasid_cache_gen)) {
+        return false;
+    }
+
+    switch (pc_info->flags & VTD_PASID_CACHE_INFO_MASK) {
+    case VTD_PASID_CACHE_PASIDSI:
+        if (pc_info->pasid != pasid) {
+            return false;
+        }
+        /* Fall through */
+    case VTD_PASID_CACHE_DOMSI:
+        if (pc_info->domain_id != did) {
+            return false;
+        }
+        /* Fall through */
+    case VTD_PASID_CACHE_GLOBAL:
+        break;
+    default:
+        return false;
+    }
+
+    /*
+     * pasid cache invalidation may indicate a present pasid
+     * entry to present pasid entry modification. To cover such
+     * case, vIOMMU emulator needs to fetch latest guest pasid
+     * entry and check cached pasid entry, then update pasid
+     * cache and send pasid bind/unbind to host properly.
+     */
+    ret = vtd_dev_get_pe_from_pasid(s,
+                  pci_bus_num(vtd_bus->bus), devfn, pasid, &pe);
+    if (ret) {
+        /*
+         * No valid pasid entry in guest memory. e.g. pasid entry
+         * was modified to be either all-zero or non-present. Either
+         * case means existing pasid cache should be removed.
+         */
+        goto remove;
+    }
+    /* Compare cached pasid entry and latest pasid entry */
+    if (!vtd_pasid_entry_compare(&pe, &pc_entry->pasid_entry)) {
+        /* pasid entry was updated, thus update the pasid cache */
+        pc_entry->pasid_entry = pe;
+        pc_entry->pasid_cache_gen = s->pasid_cache_gen;
+        /*
+         * TODO:
+         * - send pasid bind to host for passthru devices
+         * - when pasid-base-iotlb(piotlb) infrastructure is ready,
+         *   should invalidate QEMU piotlb togehter with this change.
+         */
+    }
+    return false;
+remove:
+    /*
+     * TODO:
+     * - send pasid unbind to host for passthru devices
+     * - when pasid-base-iotlb(piotlb) infrastructure is ready,
+     *   should invalidate QEMU piotlb togehter with this change.
+     */
+    return true;
+}
+
 static int vtd_pasid_cache_dsi(IntelIOMMUState *s, uint16_t domain_id)
 {
+    VTDPASIDCacheInfo pc_info;
+
+    trace_vtd_pasid_cache_dsi(domain_id);
+
+    pc_info.flags = VTD_PASID_CACHE_DOMSI;
+    pc_info.domain_id = domain_id;
+
+    /*
+     * Loop all existing pasid caches and update them.
+     */
+    vtd_iommu_lock(s);
+    g_hash_table_foreach_remove(s->vtd_pasid_as,
+                                 vtd_flush_pasid, &pc_info);
+
+    /*
+     * TODO: Domain selective PASID cache invalidation
+     * flushes all the pasid caches within a domain. To
+     * be safe, after invalidating the pasid caches, emulator
+     * needs to replay the pasid bindings by walking guest
+     * pasid dir and pasid table.
+     */
+    vtd_iommu_unlock(s);
     return 0;
 }
 
+/**
+ * This function finds or adds a VTDPASIDAddressSpace for a device
+ * when it is bound to a pasid. Caller of this function should hold
+ * iommu_lock.
+ */
+static VTDPASIDAddressSpace *vtd_add_find_pasid_as(IntelIOMMUState *s,
+                                                   VTDBus *vtd_bus,
+                                                   int devfn,
+                                                   uint32_t pasid,
+                                                   bool allocate)
+{
+    struct pasid_key key;
+    struct pasid_key *new_key;
+    VTDPASIDAddressSpace *vtd_pasid_as;
+    uint16_t sid;
+
+    sid = vtd_make_source_id(pci_bus_num(vtd_bus->bus), devfn);
+    vtd_init_pasid_key(pasid, sid, &key);
+    vtd_pasid_as = g_hash_table_lookup(s->vtd_pasid_as, &key);
+
+    if (!vtd_pasid_as && allocate) {
+        new_key = g_malloc0(sizeof(*new_key));
+        vtd_init_pasid_key(pasid, sid, new_key);
+        /*
+         * Initiate the vtd_pasid_as structure.
+         *
+         * This structure here is used to track the guest pasid
+         * binding and also serves as pasid-cache mangement entry.
+         *
+         * TODO: in future, if wants to support the SVA-aware DMA
+         *       emulation, the vtd_pasid_as should have include
+         *       AddressSpace to support DMA emulation.
+         */
+        vtd_pasid_as = g_malloc0(sizeof(VTDPASIDAddressSpace));
+        vtd_pasid_as->iommu_state = s;
+        vtd_pasid_as->vtd_bus = vtd_bus;
+        vtd_pasid_as->devfn = devfn;
+        vtd_pasid_as->context_cache_entry.context_cache_gen = 0;
+        vtd_pasid_as->pasid = pasid;
+        vtd_pasid_as->pasid_cache_entry.pasid_cache_gen = 0;
+        g_hash_table_insert(s->vtd_pasid_as, new_key, vtd_pasid_as);
+    }
+    return vtd_pasid_as;
+}
+
+ /**
+  * This function updates the pasid entry cached in &vtd_pasid_as.
+  * Caller of this function should hold iommu_lock.
+  */
+static inline void vtd_fill_in_pe_cache(
+              VTDPASIDAddressSpace *vtd_pasid_as, VTDPASIDEntry *pe)
+{
+    IntelIOMMUState *s = vtd_pasid_as->iommu_state;
+    VTDPASIDCacheEntry *pc_entry = &vtd_pasid_as->pasid_cache_entry;
+
+    pc_entry->pasid_entry = *pe;
+    pc_entry->pasid_cache_gen = s->pasid_cache_gen;
+}
+
 static int vtd_pasid_cache_psi(IntelIOMMUState *s,
                                uint16_t domain_id, uint32_t pasid)
 {
+    VTDPASIDCacheInfo pc_info;
+    VTDPASIDEntry pe;
+    VTDBus *vtd_bus;
+    int bus_n, devfn;
+    VTDPASIDAddressSpace *vtd_pasid_as;
+    VTDIOMMUContext *vtd_icx;
+
+    /* PASID selective implies a DID selective */
+    pc_info.flags = VTD_PASID_CACHE_PASIDSI;
+    pc_info.domain_id = domain_id;
+    pc_info.pasid = pasid;
+
+    /*
+     * Regards to a pasid selective pasid cache invalidation (PSI),
+     * it could be either cases of below:
+     * a) a present pasid entry moved to non-present
+     * b) a present pasid entry to be a present entry
+     * c) a non-present pasid entry moved to present
+     *
+     * Here the handling of a PSI is:
+     * 1) loop all the exisitng vtd_pasid_as instances to update them
+     *    according to the latest guest pasid entry in pasid table.
+     *    this will make sure affected existing vtd_pasid_as instances
+     *    cached the latest pasid entries. Also, during the loop, the
+     *    host should be notified if needed. e.g. pasid unbind or pasid
+     *    update. Should be able to cover case a) and case b).
+     *
+     * 2) loop all devices to cover case c)
+     *    However, it is not good to always loop all devices. In this
+     *    implementation. We do it in this ways:
+     *    - For devices which have VTDIOMMUContext instances,
+     *      we loop them and check if guest pasid entry exists. If yes,
+     *      it is case c), we update the pasid cache and also notify
+     *      host.
+     *    - For devices which have no VTDIOMMUContext
+     *      instances, it is not necessary to create pasid cache at
+     *      this phase since it could be created when vIOMMU do DMA
+     *      address translation. This is not implemented yet since
+     *      no PASID-capable emulated devices today. If we have it
+     *      in future, the pasid cache shall be created there.
+     */
+
+    vtd_iommu_lock(s);
+    g_hash_table_foreach_remove(s->vtd_pasid_as,
+                                vtd_flush_pasid, &pc_info);
+
+    QLIST_FOREACH(vtd_icx, &s->vtd_dev_icx_list, next) {
+        vtd_bus = vtd_icx->vtd_bus;
+        devfn = vtd_icx->devfn;
+        bus_n = pci_bus_num(vtd_bus->bus);
+
+        /* Step 1: fetch vtd_pasid_as and check if it is valid */
+        vtd_pasid_as = vtd_add_find_pasid_as(s, vtd_bus,
+                                        devfn, pasid, true);
+        if (vtd_pasid_as &&
+            (s->pasid_cache_gen ==
+             vtd_pasid_as->pasid_cache_entry.pasid_cache_gen)) {
+            /*
+             * pasid_cache_gen equals to s->pasid_cache_gen means
+             * vtd_pasid_as is valid after the above s->vtd_pasid_as
+             * updates. Thus no need for the below steps.
+             */
+            continue;
+        }
+
+        /*
+         * Step 2: vtd_pasid_as is not valid, it's potentailly a
+         * new pasid bind. Fetch guest pasid entry.
+         */
+        if (vtd_dev_get_pe_from_pasid(s, bus_n, devfn, pasid, &pe)) {
+            continue;
+        }
+
+        /*
+         * Step 3: pasid entry exists, update pasid cache
+         *
+         * Here need to check domain ID since guest pasid entry
+         * exists. What needs to do are:
+         *   - update the pc_entry in the vtd_pasid_as
+         *   - set proper pc_entry.pasid_cache_gen
+         *   - pass down the latest guest pasid entry config to host
+         *     (will be added in later patch)
+         */
+        if (domain_id == vtd_pe_get_domain_id(&pe)) {
+            vtd_fill_in_pe_cache(vtd_pasid_as, &pe);
+        }
+    }
+    vtd_iommu_unlock(s);
     return 0;
 }
 
+/**
+ * Caller of this function should hold iommu_lock
+ */
+static void vtd_pasid_cache_reset(IntelIOMMUState *s)
+{
+    VTDPASIDCacheInfo pc_info;
+
+    trace_vtd_pasid_cache_reset();
+
+    pc_info.flags = VTD_PASID_CACHE_GLOBAL;
+
+    /*
+     * Reset pasid cache is a big hammer, so use
+     * g_hash_table_foreach_remove which will free
+     * the vtd_pasid_as instances.
+     */
+    g_hash_table_foreach_remove(s->vtd_pasid_as,
+                           vtd_flush_pasid, &pc_info);
+    s->pasid_cache_gen = 1;
+}
+
 static int vtd_pasid_cache_gsi(IntelIOMMUState *s)
 {
+    trace_vtd_pasid_cache_gsi();
+
+    vtd_iommu_lock(s);
+    vtd_pasid_cache_reset(s);
+
+    /*
+     * TODO: Global PASID cache invalidation may be
+     * flushes all the pasid caches. To be safe, after
+     * invalidating the pasid caches, emulator needs
+     * to replay the pasid bindings by walking guest
+     * pasid dir and pasid table.
+     */
+    vtd_iommu_unlock(s);
     return 0;
 }
 
@@ -3659,8 +4019,11 @@ static int vtd_icx_register_ds_iommu(IOMMUContext *iommu_ctx,
     VTDIOMMUContext *vtd_dev_icx = container_of(iommu_ctx,
                                                VTDIOMMUContext,
                                                iommu_context);
+    IntelIOMMUState *s = vtd_dev_icx->iommu_state;
 
     vtd_dev_icx->dsi_obj = dsi_obj;
+    QLIST_INSERT_HEAD(&s->vtd_dev_icx_list, vtd_dev_icx, next);
+
     return 0;
 }
 
@@ -3672,6 +4035,7 @@ static void vtd_icx_unregister_ds_iommu(IOMMUContext *iommu_ctx,
                                                iommu_context);
 
     vtd_dev_icx->dsi_obj = NULL;
+    QLIST_REMOVE(vtd_dev_icx, next);
 }
 
 IOMMUContextOps vtd_iommu_context_ops = {
@@ -4130,6 +4494,7 @@ static void vtd_realize(DeviceState *dev, Error **errp)
     }
 
     QLIST_INIT(&s->vtd_as_with_notifiers);
+    QLIST_INIT(&s->vtd_dev_icx_list);
     qemu_mutex_init(&s->iommu_lock);
     memset(s->vtd_as_by_bus_num, 0, sizeof(s->vtd_as_by_bus_num));
     memory_region_init_io(&s->csrmem, OBJECT(s), &vtd_mem_ops, s,
@@ -4155,6 +4520,8 @@ static void vtd_realize(DeviceState *dev, Error **errp)
                                      g_free, g_free);
     s->vtd_as_by_busptr = g_hash_table_new_full(vtd_uint64_hash, vtd_uint64_equal,
                                               g_free, g_free);
+    s->vtd_pasid_as = g_hash_table_new_full(vtd_pasid_as_key_hash,
+                                   vtd_pasid_as_key_equal, g_free, g_free);
     vtd_init(s);
     sysbus_mmio_map(SYS_BUS_DEVICE(s), 0, Q35_HOST_BRIDGE_IOMMU_ADDR);
     pci_setup_iommu(bus, &vtd_iommu_ops, dev);
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 6c03560..18a9e50 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -311,6 +311,7 @@ typedef enum VTDFaultReason {
     VTD_FR_IR_SID_ERR = 0x26,   /* Invalid Source-ID */
 
     VTD_FR_PASID_TABLE_INV = 0x58,  /*Invalid PASID table entry */
+    VTD_FR_PASID_ENTRY_P = 0x59, /* The Present(P) field of pasidt-entry is 0 */
 
     /* This is not a normal fault reason. We use this to indicate some faults
      * that are not referenced by the VT-d specification.
@@ -485,6 +486,19 @@ struct VTDRootEntry {
 };
 typedef struct VTDRootEntry VTDRootEntry;
 
+struct VTDPASIDCacheInfo {
+#define VTD_PASID_CACHE_GLOBAL   (1ULL << 0)
+#define VTD_PASID_CACHE_DOMSI    (1ULL << 1)
+#define VTD_PASID_CACHE_PASIDSI  (1ULL << 2)
+    uint32_t flags;
+    uint16_t domain_id;
+    uint32_t pasid;
+};
+#define VTD_PASID_CACHE_INFO_MASK    (VTD_PASID_CACHE_GLOBAL | \
+                                      VTD_PASID_CACHE_DOMSI  | \
+                                      VTD_PASID_CACHE_PASIDSI)
+typedef struct VTDPASIDCacheInfo VTDPASIDCacheInfo;
+
 /* Masks for struct VTDRootEntry */
 #define VTD_ROOT_ENTRY_P            1ULL
 #define VTD_ROOT_ENTRY_CTP          (~0xfffULL)
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index f7cd4e5..87364a3 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -22,6 +22,7 @@ vtd_inv_qi_head(uint16_t head) "read head %d"
 vtd_inv_qi_tail(uint16_t head) "write tail %d"
 vtd_inv_qi_fetch(void) ""
 vtd_context_cache_reset(void) ""
+vtd_pasid_cache_reset(void) ""
 vtd_pasid_cache_gsi(void) ""
 vtd_pasid_cache_dsi(uint16_t domain) "Domian slective PC invalidation domain 0x%"PRIx16
 vtd_pasid_cache_psi(uint16_t domain, uint32_t pasid) "PASID slective PC invalidation domain 0x%"PRIx16" pasid 0x%"PRIx32
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index 4158116..3cc4b74 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -69,6 +69,8 @@ typedef union VTD_IR_MSIAddress VTD_IR_MSIAddress;
 typedef struct VTDPASIDDirEntry VTDPASIDDirEntry;
 typedef struct VTDPASIDEntry VTDPASIDEntry;
 typedef struct VTDIOMMUContext VTDIOMMUContext;
+typedef struct VTDPASIDCacheEntry VTDPASIDCacheEntry;
+typedef struct VTDPASIDAddressSpace VTDPASIDAddressSpace;
 
 /* Context-Entry */
 struct VTDContextEntry {
@@ -101,6 +103,31 @@ struct VTDPASIDEntry {
     uint64_t val[8];
 };
 
+struct pasid_key {
+    uint32_t pasid;
+    uint16_t sid;
+};
+
+struct VTDPASIDCacheEntry {
+    /*
+     * The cache entry is obsolete if
+     * pasid_cache_gen!=IntelIOMMUState.pasid_cache_gen
+     */
+    uint32_t pasid_cache_gen;
+    struct VTDPASIDEntry pasid_entry;
+};
+
+struct VTDPASIDAddressSpace {
+    VTDBus *vtd_bus;
+    uint8_t devfn;
+    AddressSpace as;
+    uint32_t pasid;
+    IntelIOMMUState *iommu_state;
+    VTDContextCacheEntry context_cache_entry;
+    QLIST_ENTRY(VTDPASIDAddressSpace) next;
+    VTDPASIDCacheEntry pasid_cache_entry;
+};
+
 struct VTDAddressSpace {
     PCIBus *bus;
     uint8_t devfn;
@@ -122,6 +149,7 @@ struct VTDIOMMUContext {
     uint8_t devfn;
     IOMMUContext iommu_context;
     DualStageIOMMUObject *dsi_obj;
+    QLIST_ENTRY(VTDIOMMUContext) next;
     IntelIOMMUState *iommu_state;
 };
 
@@ -272,9 +300,14 @@ struct IntelIOMMUState {
 
     GHashTable *vtd_as_by_busptr;   /* VTDBus objects indexed by PCIBus* reference */
     VTDBus *vtd_as_by_bus_num[VTD_PCI_BUS_MAX]; /* VTDBus objects indexed by bus number */
+    GHashTable *vtd_pasid_as;   /* VTDPASIDAddressSpace instances */
+    uint32_t pasid_cache_gen;   /* Should be in [1,MAX] */
     /* list of registered notifiers */
     QLIST_HEAD(, VTDAddressSpace) vtd_as_with_notifiers;
 
+    /* list of VTDIOMMUContexts with DualStageIOMMUObject registered */
+    QLIST_HEAD(, VTDIOMMUContext) vtd_dev_icx_list;
+
     /* interrupt remapping */
     bool intr_enabled;              /* Whether guest enabled IR */
     dma_addr_t intr_root;           /* Interrupt remapping table pointer */
@@ -291,7 +324,8 @@ struct IntelIOMMUState {
 
     /*
      * Protects IOMMU states in general.  Currently it protects the
-     * per-IOMMU IOTLB cache, and context entry cache in VTDAddressSpace.
+     * per-IOMMU IOTLB cache, and context entry cache in VTDAddressSpace,
+     * and pasid cache in VTDPASIDAddressSpace.
      */
     QemuMutex iommu_lock;
 };
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [RFC v3 16/25] intel_iommu: add PASID cache management infrastructure
@ 2020-01-29 12:16   ` Liu, Yi L
  0 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-01-29 12:16 UTC (permalink / raw)
  To: qemu-devel, david, pbonzini, alex.williamson, peterx
  Cc: kevin.tian, yi.l.liu, Yi Sun, Eduardo Habkost, kvm, mst,
	jun.j.tian, eric.auger, yi.y.sun, Jacob Pan, Richard Henderson,
	hao.wu

From: Liu Yi L <yi.l.liu@intel.com>

This patch adds a PASID cache management infrastructure based on
new added structure VTDPASIDAddressSpace, which is used to track
the PASID usage and future PASID tagged DMA address translation
support in vIOMMU.

    struct VTDPASIDAddressSpace {
        VTDBus *vtd_bus;
        uint8_t devfn;
        AddressSpace as;
        uint32_t pasid;
        IntelIOMMUState *iommu_state;
        VTDContextCacheEntry context_cache_entry;
        QLIST_ENTRY(VTDPASIDAddressSpace) next;
        VTDPASIDCacheEntry pasid_cache_entry;
    };

Ideally, a VTDPASIDAddressSpace instance is created when a PASID
is bound with a DMA AddressSpace. Intel VT-d spec requires guest
software to issue pasid cache invalidation when bind or unbind a
pasid with an address space under caching-mode. However, as
VTDPASIDAddressSpace instances also act as pasid cache in this
implementation, its creation also happens during vIOMMU PASID
tagged DMA translation. The creation in this path will not be
added in this patch since no PASID-capable emulated devices for
now.

The implementation in this patch manages VTDPASIDAddressSpace
instances per PASID+BDF (lookup and insert will use PASID and
BDF) since Intel VT-d spec allows per-BDF PASID Table. When a
guest bind a PASID with an AddressSpace, QEMU will capture the
guest pasid selective pasid cache invalidation, and allocate
remove a VTDPASIDAddressSpace instance per the invalidation
reasons:

    *) a present pasid entry moved to non-present
    *) a present pasid entry to be a present entry
    *) a non-present pasid entry moved to present

vIOMMU emulator could figure out the reason by fetching latest
guest pasid entry.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 hw/i386/intel_iommu.c          | 367 +++++++++++++++++++++++++++++++++++++++++
 hw/i386/intel_iommu_internal.h |  14 ++
 hw/i386/trace-events           |   1 +
 include/hw/i386/intel_iommu.h  |  36 +++-
 4 files changed, 417 insertions(+), 1 deletion(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 58e7213..c75cb7b 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -40,6 +40,7 @@
 #include "kvm_i386.h"
 #include "migration/vmstate.h"
 #include "trace.h"
+#include "qemu/jhash.h"
 
 /* context entry operations */
 #define VTD_CE_GET_RID2PASID(ce) \
@@ -65,6 +66,8 @@
 static void vtd_address_space_refresh_all(IntelIOMMUState *s);
 static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n);
 
+static void vtd_pasid_cache_reset(IntelIOMMUState *s);
+
 static void vtd_panic_require_caching_mode(void)
 {
     error_report("We need to set caching-mode=on for intel-iommu to enable "
@@ -276,6 +279,7 @@ static void vtd_reset_caches(IntelIOMMUState *s)
     vtd_iommu_lock(s);
     vtd_reset_iotlb_locked(s);
     vtd_reset_context_cache_locked(s);
+    vtd_pasid_cache_reset(s);
     vtd_iommu_unlock(s);
 }
 
@@ -686,6 +690,11 @@ static inline bool vtd_pe_type_check(X86IOMMUState *x86_iommu,
     return true;
 }
 
+static inline uint16_t vtd_pe_get_domain_id(VTDPASIDEntry *pe)
+{
+    return VTD_SM_PASID_ENTRY_DID((pe)->val[1]);
+}
+
 static inline bool vtd_pdire_present(VTDPASIDDirEntry *pdire)
 {
     return pdire->val & 1;
@@ -2393,19 +2402,370 @@ static bool vtd_process_iotlb_desc(IntelIOMMUState *s, VTDInvDesc *inv_desc)
     return true;
 }
 
+static inline void vtd_init_pasid_key(uint32_t pasid,
+                                     uint16_t sid,
+                                     struct pasid_key *key)
+{
+    key->pasid = pasid;
+    key->sid = sid;
+}
+
+static guint vtd_pasid_as_key_hash(gconstpointer v)
+{
+    struct pasid_key *key = (struct pasid_key *)v;
+    uint32_t a, b, c;
+
+    /* Jenkins hash */
+    a = b = c = JHASH_INITVAL + sizeof(*key);
+    a += key->sid;
+    b += extract32(key->pasid, 0, 16);
+    c += extract32(key->pasid, 16, 16);
+
+    __jhash_mix(a, b, c);
+    __jhash_final(a, b, c);
+
+    return c;
+}
+
+static gboolean vtd_pasid_as_key_equal(gconstpointer v1, gconstpointer v2)
+{
+    const struct pasid_key *k1 = v1;
+    const struct pasid_key *k2 = v2;
+
+    return (k1->pasid == k2->pasid) && (k1->sid == k2->sid);
+}
+
+static inline int vtd_dev_get_pe_from_pasid(IntelIOMMUState *s,
+                                            uint8_t bus_num,
+                                            uint8_t devfn,
+                                            uint32_t pasid,
+                                            VTDPASIDEntry *pe)
+{
+    VTDContextEntry ce;
+    int ret;
+    dma_addr_t pasid_dir_base;
+
+    if (!s->root_scalable) {
+        return -VTD_FR_PASID_TABLE_INV;
+    }
+
+    ret = vtd_dev_to_context_entry(s, bus_num, devfn, &ce);
+    if (ret) {
+        return ret;
+    }
+
+    pasid_dir_base = VTD_CE_GET_PASID_DIR_TABLE(&ce);
+    ret = vtd_get_pe_from_pasid_table(s,
+                                  pasid_dir_base, pasid, pe);
+
+    return ret;
+}
+
+static bool vtd_pasid_entry_compare(VTDPASIDEntry *p1, VTDPASIDEntry *p2)
+{
+    return !memcmp(p1, p2, sizeof(*p1));
+}
+
+/**
+ * This function is used to clear pasid_cache_gen of cached pasid
+ * entry in vtd_pasid_as instances. Caller of this function should
+ * hold iommu_lock.
+ */
+static gboolean vtd_flush_pasid(gpointer key, gpointer value,
+                                gpointer user_data)
+{
+    VTDPASIDCacheInfo *pc_info = user_data;
+    VTDPASIDAddressSpace *vtd_pasid_as = value;
+    IntelIOMMUState *s = vtd_pasid_as->iommu_state;
+    VTDPASIDCacheEntry *pc_entry = &vtd_pasid_as->pasid_cache_entry;
+    VTDBus *vtd_bus = vtd_pasid_as->vtd_bus;
+    VTDPASIDEntry pe;
+    uint16_t did;
+    uint32_t pasid;
+    uint16_t devfn;
+    int ret;
+
+    did = vtd_pe_get_domain_id(&pc_entry->pasid_entry);
+    pasid = vtd_pasid_as->pasid;
+    devfn = vtd_pasid_as->devfn;
+
+    if (!(pc_entry->pasid_cache_gen == s->pasid_cache_gen)) {
+        return false;
+    }
+
+    switch (pc_info->flags & VTD_PASID_CACHE_INFO_MASK) {
+    case VTD_PASID_CACHE_PASIDSI:
+        if (pc_info->pasid != pasid) {
+            return false;
+        }
+        /* Fall through */
+    case VTD_PASID_CACHE_DOMSI:
+        if (pc_info->domain_id != did) {
+            return false;
+        }
+        /* Fall through */
+    case VTD_PASID_CACHE_GLOBAL:
+        break;
+    default:
+        return false;
+    }
+
+    /*
+     * pasid cache invalidation may indicate a present pasid
+     * entry to present pasid entry modification. To cover such
+     * case, vIOMMU emulator needs to fetch latest guest pasid
+     * entry and check cached pasid entry, then update pasid
+     * cache and send pasid bind/unbind to host properly.
+     */
+    ret = vtd_dev_get_pe_from_pasid(s,
+                  pci_bus_num(vtd_bus->bus), devfn, pasid, &pe);
+    if (ret) {
+        /*
+         * No valid pasid entry in guest memory. e.g. pasid entry
+         * was modified to be either all-zero or non-present. Either
+         * case means existing pasid cache should be removed.
+         */
+        goto remove;
+    }
+    /* Compare cached pasid entry and latest pasid entry */
+    if (!vtd_pasid_entry_compare(&pe, &pc_entry->pasid_entry)) {
+        /* pasid entry was updated, thus update the pasid cache */
+        pc_entry->pasid_entry = pe;
+        pc_entry->pasid_cache_gen = s->pasid_cache_gen;
+        /*
+         * TODO:
+         * - send pasid bind to host for passthru devices
+         * - when pasid-base-iotlb(piotlb) infrastructure is ready,
+         *   should invalidate QEMU piotlb togehter with this change.
+         */
+    }
+    return false;
+remove:
+    /*
+     * TODO:
+     * - send pasid unbind to host for passthru devices
+     * - when pasid-base-iotlb(piotlb) infrastructure is ready,
+     *   should invalidate QEMU piotlb togehter with this change.
+     */
+    return true;
+}
+
 static int vtd_pasid_cache_dsi(IntelIOMMUState *s, uint16_t domain_id)
 {
+    VTDPASIDCacheInfo pc_info;
+
+    trace_vtd_pasid_cache_dsi(domain_id);
+
+    pc_info.flags = VTD_PASID_CACHE_DOMSI;
+    pc_info.domain_id = domain_id;
+
+    /*
+     * Loop all existing pasid caches and update them.
+     */
+    vtd_iommu_lock(s);
+    g_hash_table_foreach_remove(s->vtd_pasid_as,
+                                 vtd_flush_pasid, &pc_info);
+
+    /*
+     * TODO: Domain selective PASID cache invalidation
+     * flushes all the pasid caches within a domain. To
+     * be safe, after invalidating the pasid caches, emulator
+     * needs to replay the pasid bindings by walking guest
+     * pasid dir and pasid table.
+     */
+    vtd_iommu_unlock(s);
     return 0;
 }
 
+/**
+ * This function finds or adds a VTDPASIDAddressSpace for a device
+ * when it is bound to a pasid. Caller of this function should hold
+ * iommu_lock.
+ */
+static VTDPASIDAddressSpace *vtd_add_find_pasid_as(IntelIOMMUState *s,
+                                                   VTDBus *vtd_bus,
+                                                   int devfn,
+                                                   uint32_t pasid,
+                                                   bool allocate)
+{
+    struct pasid_key key;
+    struct pasid_key *new_key;
+    VTDPASIDAddressSpace *vtd_pasid_as;
+    uint16_t sid;
+
+    sid = vtd_make_source_id(pci_bus_num(vtd_bus->bus), devfn);
+    vtd_init_pasid_key(pasid, sid, &key);
+    vtd_pasid_as = g_hash_table_lookup(s->vtd_pasid_as, &key);
+
+    if (!vtd_pasid_as && allocate) {
+        new_key = g_malloc0(sizeof(*new_key));
+        vtd_init_pasid_key(pasid, sid, new_key);
+        /*
+         * Initiate the vtd_pasid_as structure.
+         *
+         * This structure here is used to track the guest pasid
+         * binding and also serves as pasid-cache mangement entry.
+         *
+         * TODO: in future, if wants to support the SVA-aware DMA
+         *       emulation, the vtd_pasid_as should have include
+         *       AddressSpace to support DMA emulation.
+         */
+        vtd_pasid_as = g_malloc0(sizeof(VTDPASIDAddressSpace));
+        vtd_pasid_as->iommu_state = s;
+        vtd_pasid_as->vtd_bus = vtd_bus;
+        vtd_pasid_as->devfn = devfn;
+        vtd_pasid_as->context_cache_entry.context_cache_gen = 0;
+        vtd_pasid_as->pasid = pasid;
+        vtd_pasid_as->pasid_cache_entry.pasid_cache_gen = 0;
+        g_hash_table_insert(s->vtd_pasid_as, new_key, vtd_pasid_as);
+    }
+    return vtd_pasid_as;
+}
+
+ /**
+  * This function updates the pasid entry cached in &vtd_pasid_as.
+  * Caller of this function should hold iommu_lock.
+  */
+static inline void vtd_fill_in_pe_cache(
+              VTDPASIDAddressSpace *vtd_pasid_as, VTDPASIDEntry *pe)
+{
+    IntelIOMMUState *s = vtd_pasid_as->iommu_state;
+    VTDPASIDCacheEntry *pc_entry = &vtd_pasid_as->pasid_cache_entry;
+
+    pc_entry->pasid_entry = *pe;
+    pc_entry->pasid_cache_gen = s->pasid_cache_gen;
+}
+
 static int vtd_pasid_cache_psi(IntelIOMMUState *s,
                                uint16_t domain_id, uint32_t pasid)
 {
+    VTDPASIDCacheInfo pc_info;
+    VTDPASIDEntry pe;
+    VTDBus *vtd_bus;
+    int bus_n, devfn;
+    VTDPASIDAddressSpace *vtd_pasid_as;
+    VTDIOMMUContext *vtd_icx;
+
+    /* PASID selective implies a DID selective */
+    pc_info.flags = VTD_PASID_CACHE_PASIDSI;
+    pc_info.domain_id = domain_id;
+    pc_info.pasid = pasid;
+
+    /*
+     * Regards to a pasid selective pasid cache invalidation (PSI),
+     * it could be either cases of below:
+     * a) a present pasid entry moved to non-present
+     * b) a present pasid entry to be a present entry
+     * c) a non-present pasid entry moved to present
+     *
+     * Here the handling of a PSI is:
+     * 1) loop all the exisitng vtd_pasid_as instances to update them
+     *    according to the latest guest pasid entry in pasid table.
+     *    this will make sure affected existing vtd_pasid_as instances
+     *    cached the latest pasid entries. Also, during the loop, the
+     *    host should be notified if needed. e.g. pasid unbind or pasid
+     *    update. Should be able to cover case a) and case b).
+     *
+     * 2) loop all devices to cover case c)
+     *    However, it is not good to always loop all devices. In this
+     *    implementation. We do it in this ways:
+     *    - For devices which have VTDIOMMUContext instances,
+     *      we loop them and check if guest pasid entry exists. If yes,
+     *      it is case c), we update the pasid cache and also notify
+     *      host.
+     *    - For devices which have no VTDIOMMUContext
+     *      instances, it is not necessary to create pasid cache at
+     *      this phase since it could be created when vIOMMU do DMA
+     *      address translation. This is not implemented yet since
+     *      no PASID-capable emulated devices today. If we have it
+     *      in future, the pasid cache shall be created there.
+     */
+
+    vtd_iommu_lock(s);
+    g_hash_table_foreach_remove(s->vtd_pasid_as,
+                                vtd_flush_pasid, &pc_info);
+
+    QLIST_FOREACH(vtd_icx, &s->vtd_dev_icx_list, next) {
+        vtd_bus = vtd_icx->vtd_bus;
+        devfn = vtd_icx->devfn;
+        bus_n = pci_bus_num(vtd_bus->bus);
+
+        /* Step 1: fetch vtd_pasid_as and check if it is valid */
+        vtd_pasid_as = vtd_add_find_pasid_as(s, vtd_bus,
+                                        devfn, pasid, true);
+        if (vtd_pasid_as &&
+            (s->pasid_cache_gen ==
+             vtd_pasid_as->pasid_cache_entry.pasid_cache_gen)) {
+            /*
+             * pasid_cache_gen equals to s->pasid_cache_gen means
+             * vtd_pasid_as is valid after the above s->vtd_pasid_as
+             * updates. Thus no need for the below steps.
+             */
+            continue;
+        }
+
+        /*
+         * Step 2: vtd_pasid_as is not valid, it's potentailly a
+         * new pasid bind. Fetch guest pasid entry.
+         */
+        if (vtd_dev_get_pe_from_pasid(s, bus_n, devfn, pasid, &pe)) {
+            continue;
+        }
+
+        /*
+         * Step 3: pasid entry exists, update pasid cache
+         *
+         * Here need to check domain ID since guest pasid entry
+         * exists. What needs to do are:
+         *   - update the pc_entry in the vtd_pasid_as
+         *   - set proper pc_entry.pasid_cache_gen
+         *   - pass down the latest guest pasid entry config to host
+         *     (will be added in later patch)
+         */
+        if (domain_id == vtd_pe_get_domain_id(&pe)) {
+            vtd_fill_in_pe_cache(vtd_pasid_as, &pe);
+        }
+    }
+    vtd_iommu_unlock(s);
     return 0;
 }
 
+/**
+ * Caller of this function should hold iommu_lock
+ */
+static void vtd_pasid_cache_reset(IntelIOMMUState *s)
+{
+    VTDPASIDCacheInfo pc_info;
+
+    trace_vtd_pasid_cache_reset();
+
+    pc_info.flags = VTD_PASID_CACHE_GLOBAL;
+
+    /*
+     * Reset pasid cache is a big hammer, so use
+     * g_hash_table_foreach_remove which will free
+     * the vtd_pasid_as instances.
+     */
+    g_hash_table_foreach_remove(s->vtd_pasid_as,
+                           vtd_flush_pasid, &pc_info);
+    s->pasid_cache_gen = 1;
+}
+
 static int vtd_pasid_cache_gsi(IntelIOMMUState *s)
 {
+    trace_vtd_pasid_cache_gsi();
+
+    vtd_iommu_lock(s);
+    vtd_pasid_cache_reset(s);
+
+    /*
+     * TODO: Global PASID cache invalidation may be
+     * flushes all the pasid caches. To be safe, after
+     * invalidating the pasid caches, emulator needs
+     * to replay the pasid bindings by walking guest
+     * pasid dir and pasid table.
+     */
+    vtd_iommu_unlock(s);
     return 0;
 }
 
@@ -3659,8 +4019,11 @@ static int vtd_icx_register_ds_iommu(IOMMUContext *iommu_ctx,
     VTDIOMMUContext *vtd_dev_icx = container_of(iommu_ctx,
                                                VTDIOMMUContext,
                                                iommu_context);
+    IntelIOMMUState *s = vtd_dev_icx->iommu_state;
 
     vtd_dev_icx->dsi_obj = dsi_obj;
+    QLIST_INSERT_HEAD(&s->vtd_dev_icx_list, vtd_dev_icx, next);
+
     return 0;
 }
 
@@ -3672,6 +4035,7 @@ static void vtd_icx_unregister_ds_iommu(IOMMUContext *iommu_ctx,
                                                iommu_context);
 
     vtd_dev_icx->dsi_obj = NULL;
+    QLIST_REMOVE(vtd_dev_icx, next);
 }
 
 IOMMUContextOps vtd_iommu_context_ops = {
@@ -4130,6 +4494,7 @@ static void vtd_realize(DeviceState *dev, Error **errp)
     }
 
     QLIST_INIT(&s->vtd_as_with_notifiers);
+    QLIST_INIT(&s->vtd_dev_icx_list);
     qemu_mutex_init(&s->iommu_lock);
     memset(s->vtd_as_by_bus_num, 0, sizeof(s->vtd_as_by_bus_num));
     memory_region_init_io(&s->csrmem, OBJECT(s), &vtd_mem_ops, s,
@@ -4155,6 +4520,8 @@ static void vtd_realize(DeviceState *dev, Error **errp)
                                      g_free, g_free);
     s->vtd_as_by_busptr = g_hash_table_new_full(vtd_uint64_hash, vtd_uint64_equal,
                                               g_free, g_free);
+    s->vtd_pasid_as = g_hash_table_new_full(vtd_pasid_as_key_hash,
+                                   vtd_pasid_as_key_equal, g_free, g_free);
     vtd_init(s);
     sysbus_mmio_map(SYS_BUS_DEVICE(s), 0, Q35_HOST_BRIDGE_IOMMU_ADDR);
     pci_setup_iommu(bus, &vtd_iommu_ops, dev);
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 6c03560..18a9e50 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -311,6 +311,7 @@ typedef enum VTDFaultReason {
     VTD_FR_IR_SID_ERR = 0x26,   /* Invalid Source-ID */
 
     VTD_FR_PASID_TABLE_INV = 0x58,  /*Invalid PASID table entry */
+    VTD_FR_PASID_ENTRY_P = 0x59, /* The Present(P) field of pasidt-entry is 0 */
 
     /* This is not a normal fault reason. We use this to indicate some faults
      * that are not referenced by the VT-d specification.
@@ -485,6 +486,19 @@ struct VTDRootEntry {
 };
 typedef struct VTDRootEntry VTDRootEntry;
 
+struct VTDPASIDCacheInfo {
+#define VTD_PASID_CACHE_GLOBAL   (1ULL << 0)
+#define VTD_PASID_CACHE_DOMSI    (1ULL << 1)
+#define VTD_PASID_CACHE_PASIDSI  (1ULL << 2)
+    uint32_t flags;
+    uint16_t domain_id;
+    uint32_t pasid;
+};
+#define VTD_PASID_CACHE_INFO_MASK    (VTD_PASID_CACHE_GLOBAL | \
+                                      VTD_PASID_CACHE_DOMSI  | \
+                                      VTD_PASID_CACHE_PASIDSI)
+typedef struct VTDPASIDCacheInfo VTDPASIDCacheInfo;
+
 /* Masks for struct VTDRootEntry */
 #define VTD_ROOT_ENTRY_P            1ULL
 #define VTD_ROOT_ENTRY_CTP          (~0xfffULL)
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index f7cd4e5..87364a3 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -22,6 +22,7 @@ vtd_inv_qi_head(uint16_t head) "read head %d"
 vtd_inv_qi_tail(uint16_t head) "write tail %d"
 vtd_inv_qi_fetch(void) ""
 vtd_context_cache_reset(void) ""
+vtd_pasid_cache_reset(void) ""
 vtd_pasid_cache_gsi(void) ""
 vtd_pasid_cache_dsi(uint16_t domain) "Domian slective PC invalidation domain 0x%"PRIx16
 vtd_pasid_cache_psi(uint16_t domain, uint32_t pasid) "PASID slective PC invalidation domain 0x%"PRIx16" pasid 0x%"PRIx32
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index 4158116..3cc4b74 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -69,6 +69,8 @@ typedef union VTD_IR_MSIAddress VTD_IR_MSIAddress;
 typedef struct VTDPASIDDirEntry VTDPASIDDirEntry;
 typedef struct VTDPASIDEntry VTDPASIDEntry;
 typedef struct VTDIOMMUContext VTDIOMMUContext;
+typedef struct VTDPASIDCacheEntry VTDPASIDCacheEntry;
+typedef struct VTDPASIDAddressSpace VTDPASIDAddressSpace;
 
 /* Context-Entry */
 struct VTDContextEntry {
@@ -101,6 +103,31 @@ struct VTDPASIDEntry {
     uint64_t val[8];
 };
 
+struct pasid_key {
+    uint32_t pasid;
+    uint16_t sid;
+};
+
+struct VTDPASIDCacheEntry {
+    /*
+     * The cache entry is obsolete if
+     * pasid_cache_gen!=IntelIOMMUState.pasid_cache_gen
+     */
+    uint32_t pasid_cache_gen;
+    struct VTDPASIDEntry pasid_entry;
+};
+
+struct VTDPASIDAddressSpace {
+    VTDBus *vtd_bus;
+    uint8_t devfn;
+    AddressSpace as;
+    uint32_t pasid;
+    IntelIOMMUState *iommu_state;
+    VTDContextCacheEntry context_cache_entry;
+    QLIST_ENTRY(VTDPASIDAddressSpace) next;
+    VTDPASIDCacheEntry pasid_cache_entry;
+};
+
 struct VTDAddressSpace {
     PCIBus *bus;
     uint8_t devfn;
@@ -122,6 +149,7 @@ struct VTDIOMMUContext {
     uint8_t devfn;
     IOMMUContext iommu_context;
     DualStageIOMMUObject *dsi_obj;
+    QLIST_ENTRY(VTDIOMMUContext) next;
     IntelIOMMUState *iommu_state;
 };
 
@@ -272,9 +300,14 @@ struct IntelIOMMUState {
 
     GHashTable *vtd_as_by_busptr;   /* VTDBus objects indexed by PCIBus* reference */
     VTDBus *vtd_as_by_bus_num[VTD_PCI_BUS_MAX]; /* VTDBus objects indexed by bus number */
+    GHashTable *vtd_pasid_as;   /* VTDPASIDAddressSpace instances */
+    uint32_t pasid_cache_gen;   /* Should be in [1,MAX] */
     /* list of registered notifiers */
     QLIST_HEAD(, VTDAddressSpace) vtd_as_with_notifiers;
 
+    /* list of VTDIOMMUContexts with DualStageIOMMUObject registered */
+    QLIST_HEAD(, VTDIOMMUContext) vtd_dev_icx_list;
+
     /* interrupt remapping */
     bool intr_enabled;              /* Whether guest enabled IR */
     dma_addr_t intr_root;           /* Interrupt remapping table pointer */
@@ -291,7 +324,8 @@ struct IntelIOMMUState {
 
     /*
      * Protects IOMMU states in general.  Currently it protects the
-     * per-IOMMU IOTLB cache, and context entry cache in VTDAddressSpace.
+     * per-IOMMU IOTLB cache, and context entry cache in VTDAddressSpace,
+     * and pasid cache in VTDPASIDAddressSpace.
      */
     QemuMutex iommu_lock;
 };
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [RFC v3 17/25] vfio: add bind stage-1 page table support
  2020-01-29 12:16 ` Liu, Yi L
@ 2020-01-29 12:16   ` Liu, Yi L
  -1 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-01-29 12:16 UTC (permalink / raw)
  To: qemu-devel, david, pbonzini, alex.williamson, peterx
  Cc: mst, eric.auger, kevin.tian, yi.l.liu, jun.j.tian, yi.y.sun, kvm,
	hao.wu, Jacob Pan, Yi Sun

From: Liu Yi L <yi.l.liu@intel.com>

This patch adds bind_stage1_pgtbl() definition in DualStageIOMMUOops,
also adds corresponding implementation in VFIO. This is to expose a
way for vIOMMU to setup dual stage DMA translation for passthru devices
on hardware.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Eric Auger <eric.auger@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 hw/iommu/dual_stage_iommu.c         | 26 ++++++++++++++++++++
 hw/vfio/common.c                    | 48 +++++++++++++++++++++++++++++++++++++
 include/hw/iommu/dual_stage_iommu.h | 22 +++++++++++++++++
 3 files changed, 96 insertions(+)

diff --git a/hw/iommu/dual_stage_iommu.c b/hw/iommu/dual_stage_iommu.c
index d5a7168..9d99e9e 100644
--- a/hw/iommu/dual_stage_iommu.c
+++ b/hw/iommu/dual_stage_iommu.c
@@ -47,6 +47,32 @@ int ds_iommu_pasid_free(DualStageIOMMUObject *dsi_obj, uint32_t pasid)
     return -ENOENT;
 }
 
+int ds_iommu_bind_stage1_pgtbl(DualStageIOMMUObject *dsi_obj,
+                               DualIOMMUStage1BindData *data)
+{
+    if (!dsi_obj) {
+        return -ENOENT;
+    }
+
+    if (dsi_obj->ops && dsi_obj->ops->bind_stage1_pgtbl) {
+        return dsi_obj->ops->bind_stage1_pgtbl(dsi_obj, data);
+    }
+    return -ENOENT;
+}
+
+int ds_iommu_unbind_stage1_pgtbl(DualStageIOMMUObject *dsi_obj,
+                                 DualIOMMUStage1BindData *data)
+{
+    if (!dsi_obj) {
+        return -ENOENT;
+    }
+
+    if (dsi_obj->ops && dsi_obj->ops->unbind_stage1_pgtbl) {
+        return dsi_obj->ops->unbind_stage1_pgtbl(dsi_obj, data);
+    }
+    return -ENOENT;
+}
+
 void ds_iommu_object_init(DualStageIOMMUObject *dsi_obj,
                           DualStageIOMMUOps *ops,
                           DualStageIOMMUInfo *uinfo)
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 014f4e7..d84bdc9 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -1219,9 +1219,57 @@ static int vfio_ds_iommu_pasid_free(DualStageIOMMUObject *dsi_obj,
     return 0;
 }
 
+static int vfio_ds_iommu_bind_stage1_pgtbl(DualStageIOMMUObject *dsi_obj,
+                                           DualIOMMUStage1BindData *bind_data)
+{
+    VFIOContainer *container = container_of(dsi_obj, VFIOContainer, dsi_obj);
+    struct vfio_iommu_type1_bind *bind;
+    unsigned long argsz;
+    int ret = 0;
+
+    argsz = sizeof(*bind) + sizeof(bind_data->bind_data);
+    bind = g_malloc0(argsz);
+    bind->argsz = argsz;
+    bind->flags = VFIO_IOMMU_BIND_GUEST_PGTBL;
+    memcpy(&bind->data, &bind_data->bind_data, sizeof(bind_data->bind_data));
+
+    if (ioctl(container->fd, VFIO_IOMMU_BIND, bind)) {
+        error_report("%s: pasid (%u) bind failed: %d",
+                      __func__, bind_data->pasid, -errno);
+        ret = -errno;
+    }
+    g_free(bind);
+    return ret;
+}
+
+static int vfio_ds_iommu_unbind_stage1_pgtbl(DualStageIOMMUObject *dsi_obj,
+                                        DualIOMMUStage1BindData *bind_data)
+{
+    VFIOContainer *container = container_of(dsi_obj, VFIOContainer, dsi_obj);
+    struct vfio_iommu_type1_bind *bind;
+    unsigned long argsz;
+    int ret = 0;
+
+    argsz = sizeof(*bind) + sizeof(bind_data->bind_data);
+    bind = g_malloc0(argsz);
+    bind->argsz = argsz;
+    bind->flags = VFIO_IOMMU_UNBIND_GUEST_PGTBL;
+    memcpy(&bind->data, &bind_data->bind_data, sizeof(bind_data->bind_data));
+
+    if (ioctl(container->fd, VFIO_IOMMU_BIND, bind)) {
+        error_report("%s: pasid (%u) unbind failed: %d",
+                      __func__, bind_data->pasid, -errno);
+        ret = -errno;
+    }
+    g_free(bind);
+    return ret;
+}
+
 static struct DualStageIOMMUOps vfio_ds_iommu_ops = {
     .pasid_alloc = vfio_ds_iommu_pasid_alloc,
     .pasid_free = vfio_ds_iommu_pasid_free,
+    .bind_stage1_pgtbl = vfio_ds_iommu_bind_stage1_pgtbl,
+    .unbind_stage1_pgtbl = vfio_ds_iommu_unbind_stage1_pgtbl,
 };
 
 static int vfio_get_iommu_info(VFIOContainer *container,
diff --git a/include/hw/iommu/dual_stage_iommu.h b/include/hw/iommu/dual_stage_iommu.h
index c6100b4..0eb983c 100644
--- a/include/hw/iommu/dual_stage_iommu.h
+++ b/include/hw/iommu/dual_stage_iommu.h
@@ -31,6 +31,7 @@
 typedef struct DualStageIOMMUObject DualStageIOMMUObject;
 typedef struct DualStageIOMMUOps DualStageIOMMUOps;
 typedef struct DualStageIOMMUInfo DualStageIOMMUInfo;
+typedef struct DualIOMMUStage1BindData DualIOMMUStage1BindData;
 
 struct DualStageIOMMUOps {
     /* Allocate pasid from DualStageIOMMU (a.k.a. host IOMMU) */
@@ -41,6 +42,16 @@ struct DualStageIOMMUOps {
     /* Reclaim a pasid from DualStageIOMMU (a.k.a. host IOMMU) */
     int (*pasid_free)(DualStageIOMMUObject *dsi_obj,
                       uint32_t pasid);
+    /*
+     * Bind stage-1 page table to a DualStageIOMMU (a.k.a. host
+     * IOMMU which has dual stage DMA translation capability.
+     * @bind_data specifies the bind configurations.
+     */
+    int (*bind_stage1_pgtbl)(DualStageIOMMUObject *dsi_obj,
+                            DualIOMMUStage1BindData *bind_data);
+    /* Undo a previous bind. @bind_data specifies the unbind info. */
+    int (*unbind_stage1_pgtbl)(DualStageIOMMUObject *dsi_obj,
+                              DualIOMMUStage1BindData *bind_data);
 };
 
 struct DualStageIOMMUInfo {
@@ -55,9 +66,20 @@ struct DualStageIOMMUObject {
     DualStageIOMMUInfo uinfo;
 };
 
+struct DualIOMMUStage1BindData {
+    uint32_t pasid;
+    union {
+        struct iommu_gpasid_bind_data gpasid_bind;
+    } bind_data;
+};
+
 int ds_iommu_pasid_alloc(DualStageIOMMUObject *dsi_obj, uint32_t min,
                          uint32_t max, uint32_t *pasid);
 int ds_iommu_pasid_free(DualStageIOMMUObject *dsi_obj, uint32_t pasid);
+int ds_iommu_bind_stage1_pgtbl(DualStageIOMMUObject *dsi_obj,
+                               DualIOMMUStage1BindData *bind_data);
+int ds_iommu_unbind_stage1_pgtbl(DualStageIOMMUObject *dsi_obj,
+                                 DualIOMMUStage1BindData *bind_data);
 
 void ds_iommu_object_init(DualStageIOMMUObject *dsi_obj,
                           DualStageIOMMUOps *ops,
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [RFC v3 17/25] vfio: add bind stage-1 page table support
@ 2020-01-29 12:16   ` Liu, Yi L
  0 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-01-29 12:16 UTC (permalink / raw)
  To: qemu-devel, david, pbonzini, alex.williamson, peterx
  Cc: kevin.tian, yi.l.liu, Yi Sun, kvm, mst, jun.j.tian, eric.auger,
	yi.y.sun, Jacob Pan, hao.wu

From: Liu Yi L <yi.l.liu@intel.com>

This patch adds bind_stage1_pgtbl() definition in DualStageIOMMUOops,
also adds corresponding implementation in VFIO. This is to expose a
way for vIOMMU to setup dual stage DMA translation for passthru devices
on hardware.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Eric Auger <eric.auger@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 hw/iommu/dual_stage_iommu.c         | 26 ++++++++++++++++++++
 hw/vfio/common.c                    | 48 +++++++++++++++++++++++++++++++++++++
 include/hw/iommu/dual_stage_iommu.h | 22 +++++++++++++++++
 3 files changed, 96 insertions(+)

diff --git a/hw/iommu/dual_stage_iommu.c b/hw/iommu/dual_stage_iommu.c
index d5a7168..9d99e9e 100644
--- a/hw/iommu/dual_stage_iommu.c
+++ b/hw/iommu/dual_stage_iommu.c
@@ -47,6 +47,32 @@ int ds_iommu_pasid_free(DualStageIOMMUObject *dsi_obj, uint32_t pasid)
     return -ENOENT;
 }
 
+int ds_iommu_bind_stage1_pgtbl(DualStageIOMMUObject *dsi_obj,
+                               DualIOMMUStage1BindData *data)
+{
+    if (!dsi_obj) {
+        return -ENOENT;
+    }
+
+    if (dsi_obj->ops && dsi_obj->ops->bind_stage1_pgtbl) {
+        return dsi_obj->ops->bind_stage1_pgtbl(dsi_obj, data);
+    }
+    return -ENOENT;
+}
+
+int ds_iommu_unbind_stage1_pgtbl(DualStageIOMMUObject *dsi_obj,
+                                 DualIOMMUStage1BindData *data)
+{
+    if (!dsi_obj) {
+        return -ENOENT;
+    }
+
+    if (dsi_obj->ops && dsi_obj->ops->unbind_stage1_pgtbl) {
+        return dsi_obj->ops->unbind_stage1_pgtbl(dsi_obj, data);
+    }
+    return -ENOENT;
+}
+
 void ds_iommu_object_init(DualStageIOMMUObject *dsi_obj,
                           DualStageIOMMUOps *ops,
                           DualStageIOMMUInfo *uinfo)
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 014f4e7..d84bdc9 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -1219,9 +1219,57 @@ static int vfio_ds_iommu_pasid_free(DualStageIOMMUObject *dsi_obj,
     return 0;
 }
 
+static int vfio_ds_iommu_bind_stage1_pgtbl(DualStageIOMMUObject *dsi_obj,
+                                           DualIOMMUStage1BindData *bind_data)
+{
+    VFIOContainer *container = container_of(dsi_obj, VFIOContainer, dsi_obj);
+    struct vfio_iommu_type1_bind *bind;
+    unsigned long argsz;
+    int ret = 0;
+
+    argsz = sizeof(*bind) + sizeof(bind_data->bind_data);
+    bind = g_malloc0(argsz);
+    bind->argsz = argsz;
+    bind->flags = VFIO_IOMMU_BIND_GUEST_PGTBL;
+    memcpy(&bind->data, &bind_data->bind_data, sizeof(bind_data->bind_data));
+
+    if (ioctl(container->fd, VFIO_IOMMU_BIND, bind)) {
+        error_report("%s: pasid (%u) bind failed: %d",
+                      __func__, bind_data->pasid, -errno);
+        ret = -errno;
+    }
+    g_free(bind);
+    return ret;
+}
+
+static int vfio_ds_iommu_unbind_stage1_pgtbl(DualStageIOMMUObject *dsi_obj,
+                                        DualIOMMUStage1BindData *bind_data)
+{
+    VFIOContainer *container = container_of(dsi_obj, VFIOContainer, dsi_obj);
+    struct vfio_iommu_type1_bind *bind;
+    unsigned long argsz;
+    int ret = 0;
+
+    argsz = sizeof(*bind) + sizeof(bind_data->bind_data);
+    bind = g_malloc0(argsz);
+    bind->argsz = argsz;
+    bind->flags = VFIO_IOMMU_UNBIND_GUEST_PGTBL;
+    memcpy(&bind->data, &bind_data->bind_data, sizeof(bind_data->bind_data));
+
+    if (ioctl(container->fd, VFIO_IOMMU_BIND, bind)) {
+        error_report("%s: pasid (%u) unbind failed: %d",
+                      __func__, bind_data->pasid, -errno);
+        ret = -errno;
+    }
+    g_free(bind);
+    return ret;
+}
+
 static struct DualStageIOMMUOps vfio_ds_iommu_ops = {
     .pasid_alloc = vfio_ds_iommu_pasid_alloc,
     .pasid_free = vfio_ds_iommu_pasid_free,
+    .bind_stage1_pgtbl = vfio_ds_iommu_bind_stage1_pgtbl,
+    .unbind_stage1_pgtbl = vfio_ds_iommu_unbind_stage1_pgtbl,
 };
 
 static int vfio_get_iommu_info(VFIOContainer *container,
diff --git a/include/hw/iommu/dual_stage_iommu.h b/include/hw/iommu/dual_stage_iommu.h
index c6100b4..0eb983c 100644
--- a/include/hw/iommu/dual_stage_iommu.h
+++ b/include/hw/iommu/dual_stage_iommu.h
@@ -31,6 +31,7 @@
 typedef struct DualStageIOMMUObject DualStageIOMMUObject;
 typedef struct DualStageIOMMUOps DualStageIOMMUOps;
 typedef struct DualStageIOMMUInfo DualStageIOMMUInfo;
+typedef struct DualIOMMUStage1BindData DualIOMMUStage1BindData;
 
 struct DualStageIOMMUOps {
     /* Allocate pasid from DualStageIOMMU (a.k.a. host IOMMU) */
@@ -41,6 +42,16 @@ struct DualStageIOMMUOps {
     /* Reclaim a pasid from DualStageIOMMU (a.k.a. host IOMMU) */
     int (*pasid_free)(DualStageIOMMUObject *dsi_obj,
                       uint32_t pasid);
+    /*
+     * Bind stage-1 page table to a DualStageIOMMU (a.k.a. host
+     * IOMMU which has dual stage DMA translation capability.
+     * @bind_data specifies the bind configurations.
+     */
+    int (*bind_stage1_pgtbl)(DualStageIOMMUObject *dsi_obj,
+                            DualIOMMUStage1BindData *bind_data);
+    /* Undo a previous bind. @bind_data specifies the unbind info. */
+    int (*unbind_stage1_pgtbl)(DualStageIOMMUObject *dsi_obj,
+                              DualIOMMUStage1BindData *bind_data);
 };
 
 struct DualStageIOMMUInfo {
@@ -55,9 +66,20 @@ struct DualStageIOMMUObject {
     DualStageIOMMUInfo uinfo;
 };
 
+struct DualIOMMUStage1BindData {
+    uint32_t pasid;
+    union {
+        struct iommu_gpasid_bind_data gpasid_bind;
+    } bind_data;
+};
+
 int ds_iommu_pasid_alloc(DualStageIOMMUObject *dsi_obj, uint32_t min,
                          uint32_t max, uint32_t *pasid);
 int ds_iommu_pasid_free(DualStageIOMMUObject *dsi_obj, uint32_t pasid);
+int ds_iommu_bind_stage1_pgtbl(DualStageIOMMUObject *dsi_obj,
+                               DualIOMMUStage1BindData *bind_data);
+int ds_iommu_unbind_stage1_pgtbl(DualStageIOMMUObject *dsi_obj,
+                                 DualIOMMUStage1BindData *bind_data);
 
 void ds_iommu_object_init(DualStageIOMMUObject *dsi_obj,
                           DualStageIOMMUOps *ops,
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [RFC v3 18/25] intel_iommu: bind/unbind guest page table to host
  2020-01-29 12:16 ` Liu, Yi L
@ 2020-01-29 12:16   ` Liu, Yi L
  -1 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-01-29 12:16 UTC (permalink / raw)
  To: qemu-devel, david, pbonzini, alex.williamson, peterx
  Cc: mst, eric.auger, kevin.tian, yi.l.liu, jun.j.tian, yi.y.sun, kvm,
	hao.wu, Jacob Pan, Yi Sun, Richard Henderson, Eduardo Habkost

From: Liu Yi L <yi.l.liu@intel.com>

This patch captures the guest PASID table entry modifications and
propagates the changes to host to setup dual stage DMA translation.
The guest page table is configured as 1st level page table (GVA->GPA)
whose translation result would further go through host VT-d 2nd
level page table(GPA->HPA) under nested translation mode. This is
a key part of vSVA support, and also a key to support IOVA over 1st
level page table for Intel VT-d in virtualization environment.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 hw/i386/intel_iommu.c          | 83 +++++++++++++++++++++++++++++++++++++++++-
 hw/i386/intel_iommu_internal.h | 26 +++++++++++++
 2 files changed, 107 insertions(+), 2 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index c75cb7b..319b3df 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -41,6 +41,7 @@
 #include "migration/vmstate.h"
 #include "trace.h"
 #include "qemu/jhash.h"
+#include <linux/iommu.h>
 
 /* context entry operations */
 #define VTD_CE_GET_RID2PASID(ce) \
@@ -695,6 +696,16 @@ static inline uint16_t vtd_pe_get_domain_id(VTDPASIDEntry *pe)
     return VTD_SM_PASID_ENTRY_DID((pe)->val[1]);
 }
 
+static inline uint32_t vtd_pe_get_fl_aw(VTDPASIDEntry *pe)
+{
+    return 48 + ((pe->val[2] >> 2) & VTD_SM_PASID_ENTRY_FLPM) * 9;
+}
+
+static inline dma_addr_t vtd_pe_get_flpt_base(VTDPASIDEntry *pe)
+{
+    return pe->val[2] & VTD_SM_PASID_ENTRY_FLPTPTR;
+}
+
 static inline bool vtd_pdire_present(VTDPASIDDirEntry *pdire)
 {
     return pdire->val & 1;
@@ -1854,6 +1865,68 @@ static void vtd_context_global_invalidate(IntelIOMMUState *s)
     vtd_iommu_replay_all(s);
 }
 
+static int vtd_bind_guest_pasid(IntelIOMMUState *s, VTDBus *vtd_bus,
+              int devfn, int pasid, VTDPASIDEntry *pe, VTDPASIDOp op)
+{
+    VTDIOMMUContext *vtd_icx;
+    DualIOMMUStage1BindData bind_data;
+    struct iommu_gpasid_bind_data *g_bind_data;
+    int ret = -1;
+
+    vtd_icx = vtd_bus->dev_icx[devfn];
+    if (!vtd_icx) {
+        return ret;
+    }
+
+    if (vtd_icx->dsi_obj->uinfo.pasid_format
+             != IOMMU_PASID_FORMAT_INTEL_VTD)
+    {
+        error_report_once("Dual Stage IOMMU is not compatible!!\n");
+    }
+
+    bind_data.pasid = pasid;
+    g_bind_data = &bind_data.bind_data.gpasid_bind;
+
+    g_bind_data->flags = 0;
+    g_bind_data->vtd.flags = 0;
+    switch (op) {
+    case VTD_PASID_BIND:
+    case VTD_PASID_UPDATE:
+        g_bind_data->version = IOMMU_UAPI_VERSION;
+        g_bind_data->format = IOMMU_PASID_FORMAT_INTEL_VTD;
+        g_bind_data->gpgd = vtd_pe_get_flpt_base(pe);
+        g_bind_data->addr_width = vtd_pe_get_fl_aw(pe);
+        g_bind_data->hpasid = pasid;
+        g_bind_data->gpasid = pasid;
+        g_bind_data->flags |= IOMMU_SVA_GPASID_VAL;
+        g_bind_data->vtd.flags =
+                             (VTD_SM_PASID_ENTRY_SRE_BIT(pe->val[2]) ? 1 : 0)
+                           | (VTD_SM_PASID_ENTRY_EAFE_BIT(pe->val[2]) ? 1 : 0)
+                           | (VTD_SM_PASID_ENTRY_PCD_BIT(pe->val[1]) ? 1 : 0)
+                           | (VTD_SM_PASID_ENTRY_PWT_BIT(pe->val[1]) ? 1 : 0)
+                           | (VTD_SM_PASID_ENTRY_EMTE_BIT(pe->val[1]) ? 1 : 0)
+                           | (VTD_SM_PASID_ENTRY_CD_BIT(pe->val[1]) ? 1 : 0);
+        g_bind_data->vtd.pat = VTD_SM_PASID_ENTRY_PAT(pe->val[1]);
+        g_bind_data->vtd.emt = VTD_SM_PASID_ENTRY_EMT(pe->val[1]);
+        ret = ds_iommu_bind_stage1_pgtbl(vtd_icx->dsi_obj, &bind_data);
+        break;
+    case VTD_PASID_UNBIND:
+        g_bind_data->version = IOMMU_UAPI_VERSION;
+        g_bind_data->format = IOMMU_PASID_FORMAT_INTEL_VTD;
+        g_bind_data->gpgd = 0;
+        g_bind_data->addr_width = 0;
+        g_bind_data->hpasid = pasid;
+        g_bind_data->gpasid = pasid;
+        g_bind_data->flags |= IOMMU_SVA_GPASID_VAL;
+        ret = ds_iommu_unbind_stage1_pgtbl(vtd_icx->dsi_obj, &bind_data);
+        break;
+    default:
+        error_report_once("Unknown VTDPASIDOp!!\n");
+        break;
+    }
+    return ret;
+}
+
 /* Do a context-cache device-selective invalidation.
  * @func_mask: FM field after shifting
  */
@@ -2532,18 +2605,20 @@ static gboolean vtd_flush_pasid(gpointer key, gpointer value,
         /* pasid entry was updated, thus update the pasid cache */
         pc_entry->pasid_entry = pe;
         pc_entry->pasid_cache_gen = s->pasid_cache_gen;
+        vtd_bind_guest_pasid(s, vtd_bus, devfn,
+                             pasid, &pe, VTD_PASID_UPDATE);
         /*
          * TODO:
-         * - send pasid bind to host for passthru devices
          * - when pasid-base-iotlb(piotlb) infrastructure is ready,
          *   should invalidate QEMU piotlb togehter with this change.
          */
     }
     return false;
 remove:
+    vtd_bind_guest_pasid(s, vtd_bus, devfn,
+                         pasid, NULL, VTD_PASID_UNBIND);
     /*
      * TODO:
-     * - send pasid unbind to host for passthru devices
      * - when pasid-base-iotlb(piotlb) infrastructure is ready,
      *   should invalidate QEMU piotlb togehter with this change.
      */
@@ -2634,6 +2709,10 @@ static inline void vtd_fill_in_pe_cache(
 
     pc_entry->pasid_entry = *pe;
     pc_entry->pasid_cache_gen = s->pasid_cache_gen;
+    vtd_bind_guest_pasid(s, vtd_pasid_as->vtd_bus,
+                         vtd_pasid_as->devfn,
+                         vtd_pasid_as->pasid,
+                         pe, VTD_PASID_BIND);
 }
 
 static int vtd_pasid_cache_psi(IntelIOMMUState *s,
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 18a9e50..833c442 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -486,6 +486,20 @@ struct VTDRootEntry {
 };
 typedef struct VTDRootEntry VTDRootEntry;
 
+enum VTD_DUAL_STAGE_UAPI {
+    UAPI_BIND_GPASID,
+    UAPI_NUM
+};
+typedef enum VTD_DUAL_STAGE_UAPI VTD_DUAL_STAGE_UAPI;
+
+enum VTDPASIDOp {
+    VTD_PASID_BIND,
+    VTD_PASID_UNBIND,
+    VTD_PASID_UPDATE,
+    VTD_OP_NUM
+};
+typedef enum VTDPASIDOp VTDPASIDOp;
+
 struct VTDPASIDCacheInfo {
 #define VTD_PASID_CACHE_GLOBAL   (1ULL << 0)
 #define VTD_PASID_CACHE_DOMSI    (1ULL << 1)
@@ -556,6 +570,18 @@ typedef struct VTDPASIDCacheInfo VTDPASIDCacheInfo;
 #define VTD_SM_PASID_ENTRY_AW          7ULL /* Adjusted guest-address-width */
 #define VTD_SM_PASID_ENTRY_DID(val)    ((val) & VTD_DOMAIN_ID_MASK)
 
+/* Adjusted guest-address-width */
+#define VTD_SM_PASID_ENTRY_FLPM          3ULL
+#define VTD_SM_PASID_ENTRY_FLPTPTR       (~0xfffULL)
+#define VTD_SM_PASID_ENTRY_SRE_BIT(val)  (!!((val) & 1ULL))
+#define VTD_SM_PASID_ENTRY_EAFE_BIT(val) (!!(((val) >> 7) & 1ULL))
+#define VTD_SM_PASID_ENTRY_PCD_BIT(val)  (!!(((val) >> 31) & 1ULL))
+#define VTD_SM_PASID_ENTRY_PWT_BIT(val)  (!!(((val) >> 30) & 1ULL))
+#define VTD_SM_PASID_ENTRY_EMTE_BIT(val) (!!(((val) >> 26) & 1ULL))
+#define VTD_SM_PASID_ENTRY_CD_BIT(val)   (!!(((val) >> 25) & 1ULL))
+#define VTD_SM_PASID_ENTRY_PAT(val)      (((val) >> 32) & 0xFFFFFFFFULL)
+#define VTD_SM_PASID_ENTRY_EMT(val)      (((val) >> 27) & 0x7ULL)
+
 /* Second Level Page Translation Pointer*/
 #define VTD_SM_PASID_ENTRY_SLPTPTR     (~0xfffULL)
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [RFC v3 18/25] intel_iommu: bind/unbind guest page table to host
@ 2020-01-29 12:16   ` Liu, Yi L
  0 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-01-29 12:16 UTC (permalink / raw)
  To: qemu-devel, david, pbonzini, alex.williamson, peterx
  Cc: kevin.tian, yi.l.liu, Yi Sun, Eduardo Habkost, kvm, mst,
	jun.j.tian, eric.auger, yi.y.sun, Jacob Pan, Richard Henderson,
	hao.wu

From: Liu Yi L <yi.l.liu@intel.com>

This patch captures the guest PASID table entry modifications and
propagates the changes to host to setup dual stage DMA translation.
The guest page table is configured as 1st level page table (GVA->GPA)
whose translation result would further go through host VT-d 2nd
level page table(GPA->HPA) under nested translation mode. This is
a key part of vSVA support, and also a key to support IOVA over 1st
level page table for Intel VT-d in virtualization environment.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 hw/i386/intel_iommu.c          | 83 +++++++++++++++++++++++++++++++++++++++++-
 hw/i386/intel_iommu_internal.h | 26 +++++++++++++
 2 files changed, 107 insertions(+), 2 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index c75cb7b..319b3df 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -41,6 +41,7 @@
 #include "migration/vmstate.h"
 #include "trace.h"
 #include "qemu/jhash.h"
+#include <linux/iommu.h>
 
 /* context entry operations */
 #define VTD_CE_GET_RID2PASID(ce) \
@@ -695,6 +696,16 @@ static inline uint16_t vtd_pe_get_domain_id(VTDPASIDEntry *pe)
     return VTD_SM_PASID_ENTRY_DID((pe)->val[1]);
 }
 
+static inline uint32_t vtd_pe_get_fl_aw(VTDPASIDEntry *pe)
+{
+    return 48 + ((pe->val[2] >> 2) & VTD_SM_PASID_ENTRY_FLPM) * 9;
+}
+
+static inline dma_addr_t vtd_pe_get_flpt_base(VTDPASIDEntry *pe)
+{
+    return pe->val[2] & VTD_SM_PASID_ENTRY_FLPTPTR;
+}
+
 static inline bool vtd_pdire_present(VTDPASIDDirEntry *pdire)
 {
     return pdire->val & 1;
@@ -1854,6 +1865,68 @@ static void vtd_context_global_invalidate(IntelIOMMUState *s)
     vtd_iommu_replay_all(s);
 }
 
+static int vtd_bind_guest_pasid(IntelIOMMUState *s, VTDBus *vtd_bus,
+              int devfn, int pasid, VTDPASIDEntry *pe, VTDPASIDOp op)
+{
+    VTDIOMMUContext *vtd_icx;
+    DualIOMMUStage1BindData bind_data;
+    struct iommu_gpasid_bind_data *g_bind_data;
+    int ret = -1;
+
+    vtd_icx = vtd_bus->dev_icx[devfn];
+    if (!vtd_icx) {
+        return ret;
+    }
+
+    if (vtd_icx->dsi_obj->uinfo.pasid_format
+             != IOMMU_PASID_FORMAT_INTEL_VTD)
+    {
+        error_report_once("Dual Stage IOMMU is not compatible!!\n");
+    }
+
+    bind_data.pasid = pasid;
+    g_bind_data = &bind_data.bind_data.gpasid_bind;
+
+    g_bind_data->flags = 0;
+    g_bind_data->vtd.flags = 0;
+    switch (op) {
+    case VTD_PASID_BIND:
+    case VTD_PASID_UPDATE:
+        g_bind_data->version = IOMMU_UAPI_VERSION;
+        g_bind_data->format = IOMMU_PASID_FORMAT_INTEL_VTD;
+        g_bind_data->gpgd = vtd_pe_get_flpt_base(pe);
+        g_bind_data->addr_width = vtd_pe_get_fl_aw(pe);
+        g_bind_data->hpasid = pasid;
+        g_bind_data->gpasid = pasid;
+        g_bind_data->flags |= IOMMU_SVA_GPASID_VAL;
+        g_bind_data->vtd.flags =
+                             (VTD_SM_PASID_ENTRY_SRE_BIT(pe->val[2]) ? 1 : 0)
+                           | (VTD_SM_PASID_ENTRY_EAFE_BIT(pe->val[2]) ? 1 : 0)
+                           | (VTD_SM_PASID_ENTRY_PCD_BIT(pe->val[1]) ? 1 : 0)
+                           | (VTD_SM_PASID_ENTRY_PWT_BIT(pe->val[1]) ? 1 : 0)
+                           | (VTD_SM_PASID_ENTRY_EMTE_BIT(pe->val[1]) ? 1 : 0)
+                           | (VTD_SM_PASID_ENTRY_CD_BIT(pe->val[1]) ? 1 : 0);
+        g_bind_data->vtd.pat = VTD_SM_PASID_ENTRY_PAT(pe->val[1]);
+        g_bind_data->vtd.emt = VTD_SM_PASID_ENTRY_EMT(pe->val[1]);
+        ret = ds_iommu_bind_stage1_pgtbl(vtd_icx->dsi_obj, &bind_data);
+        break;
+    case VTD_PASID_UNBIND:
+        g_bind_data->version = IOMMU_UAPI_VERSION;
+        g_bind_data->format = IOMMU_PASID_FORMAT_INTEL_VTD;
+        g_bind_data->gpgd = 0;
+        g_bind_data->addr_width = 0;
+        g_bind_data->hpasid = pasid;
+        g_bind_data->gpasid = pasid;
+        g_bind_data->flags |= IOMMU_SVA_GPASID_VAL;
+        ret = ds_iommu_unbind_stage1_pgtbl(vtd_icx->dsi_obj, &bind_data);
+        break;
+    default:
+        error_report_once("Unknown VTDPASIDOp!!\n");
+        break;
+    }
+    return ret;
+}
+
 /* Do a context-cache device-selective invalidation.
  * @func_mask: FM field after shifting
  */
@@ -2532,18 +2605,20 @@ static gboolean vtd_flush_pasid(gpointer key, gpointer value,
         /* pasid entry was updated, thus update the pasid cache */
         pc_entry->pasid_entry = pe;
         pc_entry->pasid_cache_gen = s->pasid_cache_gen;
+        vtd_bind_guest_pasid(s, vtd_bus, devfn,
+                             pasid, &pe, VTD_PASID_UPDATE);
         /*
          * TODO:
-         * - send pasid bind to host for passthru devices
          * - when pasid-base-iotlb(piotlb) infrastructure is ready,
          *   should invalidate QEMU piotlb togehter with this change.
          */
     }
     return false;
 remove:
+    vtd_bind_guest_pasid(s, vtd_bus, devfn,
+                         pasid, NULL, VTD_PASID_UNBIND);
     /*
      * TODO:
-     * - send pasid unbind to host for passthru devices
      * - when pasid-base-iotlb(piotlb) infrastructure is ready,
      *   should invalidate QEMU piotlb togehter with this change.
      */
@@ -2634,6 +2709,10 @@ static inline void vtd_fill_in_pe_cache(
 
     pc_entry->pasid_entry = *pe;
     pc_entry->pasid_cache_gen = s->pasid_cache_gen;
+    vtd_bind_guest_pasid(s, vtd_pasid_as->vtd_bus,
+                         vtd_pasid_as->devfn,
+                         vtd_pasid_as->pasid,
+                         pe, VTD_PASID_BIND);
 }
 
 static int vtd_pasid_cache_psi(IntelIOMMUState *s,
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 18a9e50..833c442 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -486,6 +486,20 @@ struct VTDRootEntry {
 };
 typedef struct VTDRootEntry VTDRootEntry;
 
+enum VTD_DUAL_STAGE_UAPI {
+    UAPI_BIND_GPASID,
+    UAPI_NUM
+};
+typedef enum VTD_DUAL_STAGE_UAPI VTD_DUAL_STAGE_UAPI;
+
+enum VTDPASIDOp {
+    VTD_PASID_BIND,
+    VTD_PASID_UNBIND,
+    VTD_PASID_UPDATE,
+    VTD_OP_NUM
+};
+typedef enum VTDPASIDOp VTDPASIDOp;
+
 struct VTDPASIDCacheInfo {
 #define VTD_PASID_CACHE_GLOBAL   (1ULL << 0)
 #define VTD_PASID_CACHE_DOMSI    (1ULL << 1)
@@ -556,6 +570,18 @@ typedef struct VTDPASIDCacheInfo VTDPASIDCacheInfo;
 #define VTD_SM_PASID_ENTRY_AW          7ULL /* Adjusted guest-address-width */
 #define VTD_SM_PASID_ENTRY_DID(val)    ((val) & VTD_DOMAIN_ID_MASK)
 
+/* Adjusted guest-address-width */
+#define VTD_SM_PASID_ENTRY_FLPM          3ULL
+#define VTD_SM_PASID_ENTRY_FLPTPTR       (~0xfffULL)
+#define VTD_SM_PASID_ENTRY_SRE_BIT(val)  (!!((val) & 1ULL))
+#define VTD_SM_PASID_ENTRY_EAFE_BIT(val) (!!(((val) >> 7) & 1ULL))
+#define VTD_SM_PASID_ENTRY_PCD_BIT(val)  (!!(((val) >> 31) & 1ULL))
+#define VTD_SM_PASID_ENTRY_PWT_BIT(val)  (!!(((val) >> 30) & 1ULL))
+#define VTD_SM_PASID_ENTRY_EMTE_BIT(val) (!!(((val) >> 26) & 1ULL))
+#define VTD_SM_PASID_ENTRY_CD_BIT(val)   (!!(((val) >> 25) & 1ULL))
+#define VTD_SM_PASID_ENTRY_PAT(val)      (((val) >> 32) & 0xFFFFFFFFULL)
+#define VTD_SM_PASID_ENTRY_EMT(val)      (((val) >> 27) & 0x7ULL)
+
 /* Second Level Page Translation Pointer*/
 #define VTD_SM_PASID_ENTRY_SLPTPTR     (~0xfffULL)
 
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [RFC v3 19/25] intel_iommu: replay guest pasid bindings to host
  2020-01-29 12:16 ` Liu, Yi L
@ 2020-01-29 12:16   ` Liu, Yi L
  -1 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-01-29 12:16 UTC (permalink / raw)
  To: qemu-devel, david, pbonzini, alex.williamson, peterx
  Cc: mst, eric.auger, kevin.tian, yi.l.liu, jun.j.tian, yi.y.sun, kvm,
	hao.wu, Jacob Pan, Yi Sun, Richard Henderson, Eduardo Habkost

From: Liu Yi L <yi.l.liu@intel.com>

This patch adds guest pasid bindings replay for domain
selective pasid cache invalidation(dsi) and global pasid
cache invalidation by walking guest pasid table.

Reason:
Guest OS may flush the pasid cache with a larger granularity.
e.g. guest does a svm_bind() but flush the pasid cache with
global or domain selective pasid cache invalidation instead
of pasid selective(psi) pasid cache invalidation. Regards to
such case, it works in host. Per spec, a global or domain
selective pasid cache invalidation should be able to cover
what a pasid selective invalidation does. The only concern
is performance deduction since dsi and global cache invalidation
will flush more than psi. To align with native, vIOMMU needs
emulator needs to do replay for the two invalidation granularity
to reflect the latest pasid bindings in guest pasid table.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 hw/i386/intel_iommu.c          | 156 ++++++++++++++++++++++++++++++++++++++---
 hw/i386/intel_iommu_internal.h |   1 +
 2 files changed, 147 insertions(+), 10 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 319b3df..1665843 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -68,6 +68,8 @@ static void vtd_address_space_refresh_all(IntelIOMMUState *s);
 static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n);
 
 static void vtd_pasid_cache_reset(IntelIOMMUState *s);
+static int vtd_update_pe_cache_for_dev(IntelIOMMUState *s,
+              VTDBus *vtd_bus, int devfn, int pasid, VTDPASIDEntry *pe);
 
 static void vtd_panic_require_caching_mode(void)
 {
@@ -2625,6 +2627,113 @@ remove:
     return true;
 }
 
+/**
+ * Constant information used during pasid table walk
+ * @vtd_icx: VTDIOMMUContext
+ * @flags: indicates if it is domain selective walk
+ * @did: domain ID of the pasid table walk
+ */
+typedef struct {
+    VTDIOMMUContext *vtd_icx;
+#define VTD_PASID_TABLE_DID_SEL_WALK   (1ULL << 0);
+    uint32_t flags;
+    uint16_t did;
+} vtd_pasid_table_walk_info;
+
+static bool vtd_sm_pasid_table_walk_one(IntelIOMMUState *s,
+                                       dma_addr_t pt_base,
+                                       int start,
+                                       int end,
+                                       vtd_pasid_table_walk_info *info)
+{
+    VTDPASIDEntry pe;
+    int pasid = start;
+    int pasid_next;
+    VTDIOMMUContext *vtd_icx = info->vtd_icx;
+
+    while (pasid < end) {
+        pasid_next = pasid + 1;
+
+        if (!vtd_get_pe_in_pasid_leaf_table(s, pasid, pt_base, &pe)
+            && vtd_pe_present(&pe)) {
+            if (vtd_update_pe_cache_for_dev(s, vtd_icx->vtd_bus,
+                                       vtd_icx->devfn, pasid, &pe)) {
+                error_report_once("%s, bus: %d, devfn: %d, pasid: %d",
+                                  __func__,
+                                  pci_bus_num(vtd_icx->vtd_bus->bus),
+                                  vtd_icx->devfn, pasid);
+                return false;
+            }
+        }
+        pasid = pasid_next;
+    }
+    return true;
+}
+
+/*
+ * Currently, VT-d scalable mode pasid table is a two level table, this
+ * function aims to loop a range of PASIDs in a given pasid table to
+ * identify the pasid config in guest.
+ */
+static void vtd_sm_pasid_table_walk(IntelIOMMUState *s, dma_addr_t pdt_base,
+                        int start, int end, vtd_pasid_table_walk_info *info)
+{
+    VTDPASIDDirEntry pdire;
+    int pasid = start;
+    int pasid_next;
+    dma_addr_t pt_base;
+
+    while (pasid < end) {
+        pasid_next = pasid + VTD_PASID_TBL_ENTRY_NUM;
+        if (!vtd_get_pdire_from_pdir_table(pdt_base, pasid, &pdire)
+            && vtd_pdire_present(&pdire)) {
+            pt_base = pdire.val & VTD_PASID_TABLE_BASE_ADDR_MASK;
+            if (!vtd_sm_pasid_table_walk_one(s,
+                              pt_base, pasid, pasid_next, info)) {
+                break;
+            }
+        }
+        pasid = pasid_next;
+    }
+}
+
+/**
+ * This function replay the guest pasid bindings to hots by
+ * walking the guest PASID table. This ensures host will have
+ * latest guest pasid bindings.
+ */
+static void vtd_replay_guest_pasid_bindings(IntelIOMMUState *s,
+                                           uint16_t *did, bool is_dsi)
+{
+    VTDContextEntry ce;
+    vtd_pasid_table_walk_info info;
+    VTDIOMMUContext *vtd_icx;
+
+    if (is_dsi) {
+        info.flags = VTD_PASID_TABLE_DID_SEL_WALK;
+        info.did = *did;
+    }
+
+    /*
+     * In this replay, only needs to care about the devices which
+     * has iommu_context created. For the one not have iommu_context,
+     * it is not necessary to replay the bindings since their cache
+     * could be re-created in the next DMA address transaltion.
+     */
+    QLIST_FOREACH(vtd_icx, &s->vtd_dev_icx_list, next) {
+        if (!vtd_dev_to_context_entry(s,
+                                      pci_bus_num(vtd_icx->vtd_bus->bus),
+                                      vtd_icx->devfn, &ce)) {
+            info.vtd_icx = vtd_icx;
+            vtd_sm_pasid_table_walk(s,
+                                    VTD_CE_GET_PASID_DIR_TABLE(&ce),
+                                    0,
+                                    VTD_MAX_HPASID,
+                                    &info);
+        }
+    }
+}
+
 static int vtd_pasid_cache_dsi(IntelIOMMUState *s, uint16_t domain_id)
 {
     VTDPASIDCacheInfo pc_info;
@@ -2642,12 +2751,13 @@ static int vtd_pasid_cache_dsi(IntelIOMMUState *s, uint16_t domain_id)
                                  vtd_flush_pasid, &pc_info);
 
     /*
-     * TODO: Domain selective PASID cache invalidation
-     * flushes all the pasid caches within a domain. To
-     * be safe, after invalidating the pasid caches, emulator
-     * needs to replay the pasid bindings by walking guest
-     * pasid dir and pasid table.
+     * Domain selective PASID cache invalidation flushes
+     * all the pasid caches within a domain. To be safe,
+     * after invalidating the pasid caches, emulator needs
+     * to replay the pasid bindings by walking guest pasid
+     * dir and pasid table.
      */
+    vtd_replay_guest_pasid_bindings(s, &domain_id, true);
     vtd_iommu_unlock(s);
     return 0;
 }
@@ -2715,6 +2825,31 @@ static inline void vtd_fill_in_pe_cache(
                          pe, VTD_PASID_BIND);
 }
 
+/**
+ * This function updates the pasid entry cached in &vtd_pasid_as.
+ * Caller of this function should hold iommu_lock.
+ */
+static int vtd_update_pe_cache_for_dev(IntelIOMMUState *s, VTDBus *vtd_bus,
+                               int devfn, int pasid, VTDPASIDEntry *pe)
+{
+    VTDPASIDAddressSpace *vtd_pasid_as;
+
+    vtd_pasid_as = vtd_add_find_pasid_as(s, vtd_bus,
+                                        devfn, pasid, true);
+    if (!vtd_pasid_as) {
+        error_report_once("%s, fatal error happened!\n", __func__);
+        return -1;
+    }
+
+    if (vtd_pasid_as->pasid_cache_entry.pasid_cache_gen ==
+                                               s->pasid_cache_gen) {
+        return 0;
+    }
+
+    vtd_fill_in_pe_cache(vtd_pasid_as, pe);
+    return 0;
+}
+
 static int vtd_pasid_cache_psi(IntelIOMMUState *s,
                                uint16_t domain_id, uint32_t pasid)
 {
@@ -2838,12 +2973,13 @@ static int vtd_pasid_cache_gsi(IntelIOMMUState *s)
     vtd_pasid_cache_reset(s);
 
     /*
-     * TODO: Global PASID cache invalidation may be
-     * flushes all the pasid caches. To be safe, after
-     * invalidating the pasid caches, emulator needs
-     * to replay the pasid bindings by walking guest
-     * pasid dir and pasid table.
+     * Global PASID cache invalidation flushes all
+     * the pasid caches. To be safe, after invalidating
+     * the pasid caches, emulator needs to replay the
+     * pasid bindings by walking guest pasid dir and
+     * pasid table.
      */
+    vtd_replay_guest_pasid_bindings(s, NULL, false);
     vtd_iommu_unlock(s);
     return 0;
 }
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 833c442..cd96b6e 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -558,6 +558,7 @@ typedef struct VTDPASIDCacheInfo VTDPASIDCacheInfo;
 #define VTD_PASID_TABLE_BITS_MASK     (0x3fULL)
 #define VTD_PASID_TABLE_INDEX(pasid)  ((pasid) & VTD_PASID_TABLE_BITS_MASK)
 #define VTD_PASID_ENTRY_FPD           (1ULL << 1) /* Fault Processing Disable */
+#define VTD_PASID_TBL_ENTRY_NUM       (1ULL << 6)
 
 /* PASID Granular Translation Type Mask */
 #define VTD_PASID_ENTRY_P              1ULL
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [RFC v3 19/25] intel_iommu: replay guest pasid bindings to host
@ 2020-01-29 12:16   ` Liu, Yi L
  0 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-01-29 12:16 UTC (permalink / raw)
  To: qemu-devel, david, pbonzini, alex.williamson, peterx
  Cc: kevin.tian, yi.l.liu, Yi Sun, Eduardo Habkost, kvm, mst,
	jun.j.tian, eric.auger, yi.y.sun, Jacob Pan, Richard Henderson,
	hao.wu

From: Liu Yi L <yi.l.liu@intel.com>

This patch adds guest pasid bindings replay for domain
selective pasid cache invalidation(dsi) and global pasid
cache invalidation by walking guest pasid table.

Reason:
Guest OS may flush the pasid cache with a larger granularity.
e.g. guest does a svm_bind() but flush the pasid cache with
global or domain selective pasid cache invalidation instead
of pasid selective(psi) pasid cache invalidation. Regards to
such case, it works in host. Per spec, a global or domain
selective pasid cache invalidation should be able to cover
what a pasid selective invalidation does. The only concern
is performance deduction since dsi and global cache invalidation
will flush more than psi. To align with native, vIOMMU needs
emulator needs to do replay for the two invalidation granularity
to reflect the latest pasid bindings in guest pasid table.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 hw/i386/intel_iommu.c          | 156 ++++++++++++++++++++++++++++++++++++++---
 hw/i386/intel_iommu_internal.h |   1 +
 2 files changed, 147 insertions(+), 10 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 319b3df..1665843 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -68,6 +68,8 @@ static void vtd_address_space_refresh_all(IntelIOMMUState *s);
 static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n);
 
 static void vtd_pasid_cache_reset(IntelIOMMUState *s);
+static int vtd_update_pe_cache_for_dev(IntelIOMMUState *s,
+              VTDBus *vtd_bus, int devfn, int pasid, VTDPASIDEntry *pe);
 
 static void vtd_panic_require_caching_mode(void)
 {
@@ -2625,6 +2627,113 @@ remove:
     return true;
 }
 
+/**
+ * Constant information used during pasid table walk
+ * @vtd_icx: VTDIOMMUContext
+ * @flags: indicates if it is domain selective walk
+ * @did: domain ID of the pasid table walk
+ */
+typedef struct {
+    VTDIOMMUContext *vtd_icx;
+#define VTD_PASID_TABLE_DID_SEL_WALK   (1ULL << 0);
+    uint32_t flags;
+    uint16_t did;
+} vtd_pasid_table_walk_info;
+
+static bool vtd_sm_pasid_table_walk_one(IntelIOMMUState *s,
+                                       dma_addr_t pt_base,
+                                       int start,
+                                       int end,
+                                       vtd_pasid_table_walk_info *info)
+{
+    VTDPASIDEntry pe;
+    int pasid = start;
+    int pasid_next;
+    VTDIOMMUContext *vtd_icx = info->vtd_icx;
+
+    while (pasid < end) {
+        pasid_next = pasid + 1;
+
+        if (!vtd_get_pe_in_pasid_leaf_table(s, pasid, pt_base, &pe)
+            && vtd_pe_present(&pe)) {
+            if (vtd_update_pe_cache_for_dev(s, vtd_icx->vtd_bus,
+                                       vtd_icx->devfn, pasid, &pe)) {
+                error_report_once("%s, bus: %d, devfn: %d, pasid: %d",
+                                  __func__,
+                                  pci_bus_num(vtd_icx->vtd_bus->bus),
+                                  vtd_icx->devfn, pasid);
+                return false;
+            }
+        }
+        pasid = pasid_next;
+    }
+    return true;
+}
+
+/*
+ * Currently, VT-d scalable mode pasid table is a two level table, this
+ * function aims to loop a range of PASIDs in a given pasid table to
+ * identify the pasid config in guest.
+ */
+static void vtd_sm_pasid_table_walk(IntelIOMMUState *s, dma_addr_t pdt_base,
+                        int start, int end, vtd_pasid_table_walk_info *info)
+{
+    VTDPASIDDirEntry pdire;
+    int pasid = start;
+    int pasid_next;
+    dma_addr_t pt_base;
+
+    while (pasid < end) {
+        pasid_next = pasid + VTD_PASID_TBL_ENTRY_NUM;
+        if (!vtd_get_pdire_from_pdir_table(pdt_base, pasid, &pdire)
+            && vtd_pdire_present(&pdire)) {
+            pt_base = pdire.val & VTD_PASID_TABLE_BASE_ADDR_MASK;
+            if (!vtd_sm_pasid_table_walk_one(s,
+                              pt_base, pasid, pasid_next, info)) {
+                break;
+            }
+        }
+        pasid = pasid_next;
+    }
+}
+
+/**
+ * This function replay the guest pasid bindings to hots by
+ * walking the guest PASID table. This ensures host will have
+ * latest guest pasid bindings.
+ */
+static void vtd_replay_guest_pasid_bindings(IntelIOMMUState *s,
+                                           uint16_t *did, bool is_dsi)
+{
+    VTDContextEntry ce;
+    vtd_pasid_table_walk_info info;
+    VTDIOMMUContext *vtd_icx;
+
+    if (is_dsi) {
+        info.flags = VTD_PASID_TABLE_DID_SEL_WALK;
+        info.did = *did;
+    }
+
+    /*
+     * In this replay, only needs to care about the devices which
+     * has iommu_context created. For the one not have iommu_context,
+     * it is not necessary to replay the bindings since their cache
+     * could be re-created in the next DMA address transaltion.
+     */
+    QLIST_FOREACH(vtd_icx, &s->vtd_dev_icx_list, next) {
+        if (!vtd_dev_to_context_entry(s,
+                                      pci_bus_num(vtd_icx->vtd_bus->bus),
+                                      vtd_icx->devfn, &ce)) {
+            info.vtd_icx = vtd_icx;
+            vtd_sm_pasid_table_walk(s,
+                                    VTD_CE_GET_PASID_DIR_TABLE(&ce),
+                                    0,
+                                    VTD_MAX_HPASID,
+                                    &info);
+        }
+    }
+}
+
 static int vtd_pasid_cache_dsi(IntelIOMMUState *s, uint16_t domain_id)
 {
     VTDPASIDCacheInfo pc_info;
@@ -2642,12 +2751,13 @@ static int vtd_pasid_cache_dsi(IntelIOMMUState *s, uint16_t domain_id)
                                  vtd_flush_pasid, &pc_info);
 
     /*
-     * TODO: Domain selective PASID cache invalidation
-     * flushes all the pasid caches within a domain. To
-     * be safe, after invalidating the pasid caches, emulator
-     * needs to replay the pasid bindings by walking guest
-     * pasid dir and pasid table.
+     * Domain selective PASID cache invalidation flushes
+     * all the pasid caches within a domain. To be safe,
+     * after invalidating the pasid caches, emulator needs
+     * to replay the pasid bindings by walking guest pasid
+     * dir and pasid table.
      */
+    vtd_replay_guest_pasid_bindings(s, &domain_id, true);
     vtd_iommu_unlock(s);
     return 0;
 }
@@ -2715,6 +2825,31 @@ static inline void vtd_fill_in_pe_cache(
                          pe, VTD_PASID_BIND);
 }
 
+/**
+ * This function updates the pasid entry cached in &vtd_pasid_as.
+ * Caller of this function should hold iommu_lock.
+ */
+static int vtd_update_pe_cache_for_dev(IntelIOMMUState *s, VTDBus *vtd_bus,
+                               int devfn, int pasid, VTDPASIDEntry *pe)
+{
+    VTDPASIDAddressSpace *vtd_pasid_as;
+
+    vtd_pasid_as = vtd_add_find_pasid_as(s, vtd_bus,
+                                        devfn, pasid, true);
+    if (!vtd_pasid_as) {
+        error_report_once("%s, fatal error happened!\n", __func__);
+        return -1;
+    }
+
+    if (vtd_pasid_as->pasid_cache_entry.pasid_cache_gen ==
+                                               s->pasid_cache_gen) {
+        return 0;
+    }
+
+    vtd_fill_in_pe_cache(vtd_pasid_as, pe);
+    return 0;
+}
+
 static int vtd_pasid_cache_psi(IntelIOMMUState *s,
                                uint16_t domain_id, uint32_t pasid)
 {
@@ -2838,12 +2973,13 @@ static int vtd_pasid_cache_gsi(IntelIOMMUState *s)
     vtd_pasid_cache_reset(s);
 
     /*
-     * TODO: Global PASID cache invalidation may be
-     * flushes all the pasid caches. To be safe, after
-     * invalidating the pasid caches, emulator needs
-     * to replay the pasid bindings by walking guest
-     * pasid dir and pasid table.
+     * Global PASID cache invalidation flushes all
+     * the pasid caches. To be safe, after invalidating
+     * the pasid caches, emulator needs to replay the
+     * pasid bindings by walking guest pasid dir and
+     * pasid table.
      */
+    vtd_replay_guest_pasid_bindings(s, NULL, false);
     vtd_iommu_unlock(s);
     return 0;
 }
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 833c442..cd96b6e 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -558,6 +558,7 @@ typedef struct VTDPASIDCacheInfo VTDPASIDCacheInfo;
 #define VTD_PASID_TABLE_BITS_MASK     (0x3fULL)
 #define VTD_PASID_TABLE_INDEX(pasid)  ((pasid) & VTD_PASID_TABLE_BITS_MASK)
 #define VTD_PASID_ENTRY_FPD           (1ULL << 1) /* Fault Processing Disable */
+#define VTD_PASID_TBL_ENTRY_NUM       (1ULL << 6)
 
 /* PASID Granular Translation Type Mask */
 #define VTD_PASID_ENTRY_P              1ULL
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [RFC v3 20/25] intel_iommu: replay pasid binds after context cache invalidation
  2020-01-29 12:16 ` Liu, Yi L
@ 2020-01-29 12:16   ` Liu, Yi L
  -1 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-01-29 12:16 UTC (permalink / raw)
  To: qemu-devel, david, pbonzini, alex.williamson, peterx
  Cc: mst, eric.auger, kevin.tian, yi.l.liu, jun.j.tian, yi.y.sun, kvm,
	hao.wu, Jacob Pan, Yi Sun, Richard Henderson, Eduardo Habkost

From: Liu Yi L <yi.l.liu@intel.com>

This patch replays guest pasid bindings after context cache
invalidation. This is a behavior to ensure safety. Actually,
programmer should issue pasid cache invalidation with proper
granularity after issuing a context cache invalidation.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 hw/i386/intel_iommu.c          | 67 ++++++++++++++++++++++++++++++++++++++++++
 hw/i386/intel_iommu_internal.h |  6 +++-
 hw/i386/trace-events           |  1 +
 3 files changed, 73 insertions(+), 1 deletion(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 1665843..6422add 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -70,6 +70,10 @@ static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n);
 static void vtd_pasid_cache_reset(IntelIOMMUState *s);
 static int vtd_update_pe_cache_for_dev(IntelIOMMUState *s,
               VTDBus *vtd_bus, int devfn, int pasid, VTDPASIDEntry *pe);
+static void vtd_replay_guest_pasid_bindings(IntelIOMMUState *s,
+                                           uint16_t *did, bool is_dsi);
+static void vtd_pasid_cache_devsi(IntelIOMMUState *s,
+                                  VTDBus *vtd_bus, uint16_t devfn);
 
 static void vtd_panic_require_caching_mode(void)
 {
@@ -1865,6 +1869,10 @@ static void vtd_context_global_invalidate(IntelIOMMUState *s)
      * VT-d emulation codes.
      */
     vtd_iommu_replay_all(s);
+
+    vtd_iommu_lock(s);
+    vtd_replay_guest_pasid_bindings(s, NULL, false);
+    vtd_iommu_unlock(s);
 }
 
 static int vtd_bind_guest_pasid(IntelIOMMUState *s, VTDBus *vtd_bus,
@@ -1986,6 +1994,22 @@ static void vtd_context_device_invalidate(IntelIOMMUState *s,
                  * happened.
                  */
                 vtd_sync_shadow_page_table(vtd_as);
+                /*
+                 * Per spec, context flush should also followed with PASID
+                 * cache and iotlb flush. Regards to a device selective
+                 * context cache invalidation:
+                 * if (emaulted_device)
+                 *    modify the pasid cache gen and pasid-based iotlb gen
+                 *    value (will be added in following patches)
+                 * else if (assigned_device)
+                 *    check if the device has been bound to any pasid
+                 *    invoke pasid_unbind regards to each bound pasid
+                 * Here, we have vtd_pasid_cache_devsi() to invalidate pasid
+                 * caches, while for piotlb in QEMU, we don't have it yet, so
+                 * no handling. For assigned device, host iommu driver would
+                 * flush piotlb when a pasid unbind is pass down to it.
+                 */
+                 vtd_pasid_cache_devsi(s, vtd_bus, devfn_it);
             }
         }
     }
@@ -2581,6 +2605,12 @@ static gboolean vtd_flush_pasid(gpointer key, gpointer value,
         /* Fall through */
     case VTD_PASID_CACHE_GLOBAL:
         break;
+    case VTD_PASID_CACHE_DEVSI:
+        if (pc_info->vtd_bus != vtd_bus ||
+            pc_info->devfn == devfn) {
+            return false;
+        }
+        break;
     default:
         return false;
     }
@@ -2944,6 +2974,43 @@ static int vtd_pasid_cache_psi(IntelIOMMUState *s,
     return 0;
 }
 
+static void vtd_pasid_cache_devsi(IntelIOMMUState *s,
+                                  VTDBus *vtd_bus, uint16_t devfn)
+{
+    VTDPASIDCacheInfo pc_info;
+    VTDContextEntry ce;
+    vtd_pasid_table_walk_info info;
+
+    trace_vtd_pasid_cache_devsi(devfn);
+
+    pc_info.flags = VTD_PASID_CACHE_DEVSI;
+    pc_info.vtd_bus = vtd_bus;
+    pc_info.devfn = devfn;
+
+    vtd_iommu_lock(s);
+    g_hash_table_foreach_remove(s->vtd_pasid_as, vtd_flush_pasid, &pc_info);
+
+    /*
+     * To be safe, after invalidating the pasid caches,
+     * emulator needs to replay the pasid bindings by
+     * walking guest pasid dir and pasid table.
+     */
+    if (vtd_bus->dev_icx[devfn] &&
+        !vtd_dev_to_context_entry(s,
+                                  pci_bus_num(vtd_bus->bus),
+                                  devfn, &ce)) {
+        info.flags = 0x0;
+        info.did = 0;
+        info.vtd_icx = vtd_bus->dev_icx[devfn];
+        vtd_sm_pasid_table_walk(s,
+                                VTD_CE_GET_PASID_DIR_TABLE(&ce),
+                                0,
+                                VTD_MAX_HPASID,
+                                &info);
+    }
+    vtd_iommu_unlock(s);
+}
+
 /**
  * Caller of this function should hold iommu_lock
  */
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index cd96b6e..a487b30 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -504,13 +504,17 @@ struct VTDPASIDCacheInfo {
 #define VTD_PASID_CACHE_GLOBAL   (1ULL << 0)
 #define VTD_PASID_CACHE_DOMSI    (1ULL << 1)
 #define VTD_PASID_CACHE_PASIDSI  (1ULL << 2)
+#define VTD_PASID_CACHE_DEVSI    (1ULL << 3)
     uint32_t flags;
     uint16_t domain_id;
     uint32_t pasid;
+    VTDBus *vtd_bus;
+    uint16_t devfn;
 };
 #define VTD_PASID_CACHE_INFO_MASK    (VTD_PASID_CACHE_GLOBAL | \
                                       VTD_PASID_CACHE_DOMSI  | \
-                                      VTD_PASID_CACHE_PASIDSI)
+                                      VTD_PASID_CACHE_PASIDSI | \
+                                      VTD_PASID_CACHE_DEVSI)
 typedef struct VTDPASIDCacheInfo VTDPASIDCacheInfo;
 
 /* Masks for struct VTDRootEntry */
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index 87364a3..75f5002 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -26,6 +26,7 @@ vtd_pasid_cache_reset(void) ""
 vtd_pasid_cache_gsi(void) ""
 vtd_pasid_cache_dsi(uint16_t domain) "Domian slective PC invalidation domain 0x%"PRIx16
 vtd_pasid_cache_psi(uint16_t domain, uint32_t pasid) "PASID slective PC invalidation domain 0x%"PRIx16" pasid 0x%"PRIx32
+vtd_pasid_cache_devsi(uint16_t devfn) "Dev slective PC invalidation dev: 0x%"PRIx16
 vtd_re_not_present(uint8_t bus) "Root entry bus %"PRIu8" not present"
 vtd_ce_not_present(uint8_t bus, uint8_t devfn) "Context entry bus %"PRIu8" devfn %"PRIu8" not present"
 vtd_iotlb_page_hit(uint16_t sid, uint64_t addr, uint64_t slpte, uint16_t domain) "IOTLB page hit sid 0x%"PRIx16" iova 0x%"PRIx64" slpte 0x%"PRIx64" domain 0x%"PRIx16
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [RFC v3 20/25] intel_iommu: replay pasid binds after context cache invalidation
@ 2020-01-29 12:16   ` Liu, Yi L
  0 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-01-29 12:16 UTC (permalink / raw)
  To: qemu-devel, david, pbonzini, alex.williamson, peterx
  Cc: kevin.tian, yi.l.liu, Yi Sun, Eduardo Habkost, kvm, mst,
	jun.j.tian, eric.auger, yi.y.sun, Jacob Pan, Richard Henderson,
	hao.wu

From: Liu Yi L <yi.l.liu@intel.com>

This patch replays guest pasid bindings after context cache
invalidation. This is a behavior to ensure safety. Actually,
programmer should issue pasid cache invalidation with proper
granularity after issuing a context cache invalidation.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 hw/i386/intel_iommu.c          | 67 ++++++++++++++++++++++++++++++++++++++++++
 hw/i386/intel_iommu_internal.h |  6 +++-
 hw/i386/trace-events           |  1 +
 3 files changed, 73 insertions(+), 1 deletion(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 1665843..6422add 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -70,6 +70,10 @@ static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n);
 static void vtd_pasid_cache_reset(IntelIOMMUState *s);
 static int vtd_update_pe_cache_for_dev(IntelIOMMUState *s,
               VTDBus *vtd_bus, int devfn, int pasid, VTDPASIDEntry *pe);
+static void vtd_replay_guest_pasid_bindings(IntelIOMMUState *s,
+                                           uint16_t *did, bool is_dsi);
+static void vtd_pasid_cache_devsi(IntelIOMMUState *s,
+                                  VTDBus *vtd_bus, uint16_t devfn);
 
 static void vtd_panic_require_caching_mode(void)
 {
@@ -1865,6 +1869,10 @@ static void vtd_context_global_invalidate(IntelIOMMUState *s)
      * VT-d emulation codes.
      */
     vtd_iommu_replay_all(s);
+
+    vtd_iommu_lock(s);
+    vtd_replay_guest_pasid_bindings(s, NULL, false);
+    vtd_iommu_unlock(s);
 }
 
 static int vtd_bind_guest_pasid(IntelIOMMUState *s, VTDBus *vtd_bus,
@@ -1986,6 +1994,22 @@ static void vtd_context_device_invalidate(IntelIOMMUState *s,
                  * happened.
                  */
                 vtd_sync_shadow_page_table(vtd_as);
+                /*
+                 * Per spec, context flush should also followed with PASID
+                 * cache and iotlb flush. Regards to a device selective
+                 * context cache invalidation:
+                 * if (emaulted_device)
+                 *    modify the pasid cache gen and pasid-based iotlb gen
+                 *    value (will be added in following patches)
+                 * else if (assigned_device)
+                 *    check if the device has been bound to any pasid
+                 *    invoke pasid_unbind regards to each bound pasid
+                 * Here, we have vtd_pasid_cache_devsi() to invalidate pasid
+                 * caches, while for piotlb in QEMU, we don't have it yet, so
+                 * no handling. For assigned device, host iommu driver would
+                 * flush piotlb when a pasid unbind is pass down to it.
+                 */
+                 vtd_pasid_cache_devsi(s, vtd_bus, devfn_it);
             }
         }
     }
@@ -2581,6 +2605,12 @@ static gboolean vtd_flush_pasid(gpointer key, gpointer value,
         /* Fall through */
     case VTD_PASID_CACHE_GLOBAL:
         break;
+    case VTD_PASID_CACHE_DEVSI:
+        if (pc_info->vtd_bus != vtd_bus ||
+            pc_info->devfn == devfn) {
+            return false;
+        }
+        break;
     default:
         return false;
     }
@@ -2944,6 +2974,43 @@ static int vtd_pasid_cache_psi(IntelIOMMUState *s,
     return 0;
 }
 
+static void vtd_pasid_cache_devsi(IntelIOMMUState *s,
+                                  VTDBus *vtd_bus, uint16_t devfn)
+{
+    VTDPASIDCacheInfo pc_info;
+    VTDContextEntry ce;
+    vtd_pasid_table_walk_info info;
+
+    trace_vtd_pasid_cache_devsi(devfn);
+
+    pc_info.flags = VTD_PASID_CACHE_DEVSI;
+    pc_info.vtd_bus = vtd_bus;
+    pc_info.devfn = devfn;
+
+    vtd_iommu_lock(s);
+    g_hash_table_foreach_remove(s->vtd_pasid_as, vtd_flush_pasid, &pc_info);
+
+    /*
+     * To be safe, after invalidating the pasid caches,
+     * emulator needs to replay the pasid bindings by
+     * walking guest pasid dir and pasid table.
+     */
+    if (vtd_bus->dev_icx[devfn] &&
+        !vtd_dev_to_context_entry(s,
+                                  pci_bus_num(vtd_bus->bus),
+                                  devfn, &ce)) {
+        info.flags = 0x0;
+        info.did = 0;
+        info.vtd_icx = vtd_bus->dev_icx[devfn];
+        vtd_sm_pasid_table_walk(s,
+                                VTD_CE_GET_PASID_DIR_TABLE(&ce),
+                                0,
+                                VTD_MAX_HPASID,
+                                &info);
+    }
+    vtd_iommu_unlock(s);
+}
+
 /**
  * Caller of this function should hold iommu_lock
  */
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index cd96b6e..a487b30 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -504,13 +504,17 @@ struct VTDPASIDCacheInfo {
 #define VTD_PASID_CACHE_GLOBAL   (1ULL << 0)
 #define VTD_PASID_CACHE_DOMSI    (1ULL << 1)
 #define VTD_PASID_CACHE_PASIDSI  (1ULL << 2)
+#define VTD_PASID_CACHE_DEVSI    (1ULL << 3)
     uint32_t flags;
     uint16_t domain_id;
     uint32_t pasid;
+    VTDBus *vtd_bus;
+    uint16_t devfn;
 };
 #define VTD_PASID_CACHE_INFO_MASK    (VTD_PASID_CACHE_GLOBAL | \
                                       VTD_PASID_CACHE_DOMSI  | \
-                                      VTD_PASID_CACHE_PASIDSI)
+                                      VTD_PASID_CACHE_PASIDSI | \
+                                      VTD_PASID_CACHE_DEVSI)
 typedef struct VTDPASIDCacheInfo VTDPASIDCacheInfo;
 
 /* Masks for struct VTDRootEntry */
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index 87364a3..75f5002 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -26,6 +26,7 @@ vtd_pasid_cache_reset(void) ""
 vtd_pasid_cache_gsi(void) ""
 vtd_pasid_cache_dsi(uint16_t domain) "Domian slective PC invalidation domain 0x%"PRIx16
 vtd_pasid_cache_psi(uint16_t domain, uint32_t pasid) "PASID slective PC invalidation domain 0x%"PRIx16" pasid 0x%"PRIx32
+vtd_pasid_cache_devsi(uint16_t devfn) "Dev slective PC invalidation dev: 0x%"PRIx16
 vtd_re_not_present(uint8_t bus) "Root entry bus %"PRIu8" not present"
 vtd_ce_not_present(uint8_t bus, uint8_t devfn) "Context entry bus %"PRIu8" devfn %"PRIu8" not present"
 vtd_iotlb_page_hit(uint16_t sid, uint64_t addr, uint64_t slpte, uint16_t domain) "IOTLB page hit sid 0x%"PRIx16" iova 0x%"PRIx64" slpte 0x%"PRIx64" domain 0x%"PRIx16
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [RFC v3 21/25] intel_iommu: do not pass down pasid bind for PASID #0
  2020-01-29 12:16 ` Liu, Yi L
@ 2020-01-29 12:16   ` Liu, Yi L
  -1 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-01-29 12:16 UTC (permalink / raw)
  To: qemu-devel, david, pbonzini, alex.williamson, peterx
  Cc: mst, eric.auger, kevin.tian, yi.l.liu, jun.j.tian, yi.y.sun, kvm,
	hao.wu, Jacob Pan, Yi Sun, Richard Henderson, Eduardo Habkost

From: Liu Yi L <yi.l.liu@intel.com>

RID_PASID field was introduced in VT-d 3.0 spec, it is used
for DMA requests w/o PASID in scalable mode VT-d. It is also
known as IOVA. And in VT-d 3.1 spec, there is definition on it:

"Implementations not supporting RID_PASID capability
(ECAP_REG.RPS is 0b), use a PASID value of 0 to perform
address translation for requests without PASID."

This patch adds a check against the PASIDs which are going to be
bound to device. For PASID #0, it is not necessary to pass down
pasid bind request for it since PASID #0 is used as RID_PASID for
DMA requests without pasid. Further reason is current Intel vIOMMU
supports gIOVA by shadowing guest 2nd level page table. However,
in future, if guest IOMMU driver uses 1st level page table to store
IOVA mappings, then guest IOVA support will also be done via nested
translation. When gIOVA is over FLPT, then vIOMMU should pass down
the pasid bind request for PASID #0 to host, host needs to bind the
guest IOVA page table to a proper PASID. e.g PASID value in RID_PASID
field for PF/VF if ECAP_REG.RPS is clear or default PASID for ADI
(Assignable Device Interface in Scalable IOV solution).

IOVA over FLPT support on Intel VT-d:
https://lkml.org/lkml/2019/9/23/297

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 hw/i386/intel_iommu.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 6422add..a511289 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -1883,6 +1883,16 @@ static int vtd_bind_guest_pasid(IntelIOMMUState *s, VTDBus *vtd_bus,
     struct iommu_gpasid_bind_data *g_bind_data;
     int ret = -1;
 
+    if (pasid < VTD_MIN_HPASID) {
+        /*
+         * If pasid < VTD_HPASID_MIN, this pasid is not allocated
+         * from host. No need to pass down the changes on it to host.
+         * TODO: when IOVA over FLPT is ready, this switch should be
+         * refined.
+         */
+        return 0;
+    }
+
     vtd_icx = vtd_bus->dev_icx[devfn];
     if (!vtd_icx) {
         return ret;
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [RFC v3 21/25] intel_iommu: do not pass down pasid bind for PASID #0
@ 2020-01-29 12:16   ` Liu, Yi L
  0 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-01-29 12:16 UTC (permalink / raw)
  To: qemu-devel, david, pbonzini, alex.williamson, peterx
  Cc: kevin.tian, yi.l.liu, Yi Sun, Eduardo Habkost, kvm, mst,
	jun.j.tian, eric.auger, yi.y.sun, Jacob Pan, Richard Henderson,
	hao.wu

From: Liu Yi L <yi.l.liu@intel.com>

RID_PASID field was introduced in VT-d 3.0 spec, it is used
for DMA requests w/o PASID in scalable mode VT-d. It is also
known as IOVA. And in VT-d 3.1 spec, there is definition on it:

"Implementations not supporting RID_PASID capability
(ECAP_REG.RPS is 0b), use a PASID value of 0 to perform
address translation for requests without PASID."

This patch adds a check against the PASIDs which are going to be
bound to device. For PASID #0, it is not necessary to pass down
pasid bind request for it since PASID #0 is used as RID_PASID for
DMA requests without pasid. Further reason is current Intel vIOMMU
supports gIOVA by shadowing guest 2nd level page table. However,
in future, if guest IOMMU driver uses 1st level page table to store
IOVA mappings, then guest IOVA support will also be done via nested
translation. When gIOVA is over FLPT, then vIOMMU should pass down
the pasid bind request for PASID #0 to host, host needs to bind the
guest IOVA page table to a proper PASID. e.g PASID value in RID_PASID
field for PF/VF if ECAP_REG.RPS is clear or default PASID for ADI
(Assignable Device Interface in Scalable IOV solution).

IOVA over FLPT support on Intel VT-d:
https://lkml.org/lkml/2019/9/23/297

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 hw/i386/intel_iommu.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 6422add..a511289 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -1883,6 +1883,16 @@ static int vtd_bind_guest_pasid(IntelIOMMUState *s, VTDBus *vtd_bus,
     struct iommu_gpasid_bind_data *g_bind_data;
     int ret = -1;
 
+    if (pasid < VTD_MIN_HPASID) {
+        /*
+         * If pasid < VTD_HPASID_MIN, this pasid is not allocated
+         * from host. No need to pass down the changes on it to host.
+         * TODO: when IOVA over FLPT is ready, this switch should be
+         * refined.
+         */
+        return 0;
+    }
+
     vtd_icx = vtd_bus->dev_icx[devfn];
     if (!vtd_icx) {
         return ret;
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [RFC v3 22/25] vfio: add support for flush iommu stage-1 cache
  2020-01-29 12:16 ` Liu, Yi L
@ 2020-01-29 12:16   ` Liu, Yi L
  -1 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-01-29 12:16 UTC (permalink / raw)
  To: qemu-devel, david, pbonzini, alex.williamson, peterx
  Cc: mst, eric.auger, kevin.tian, yi.l.liu, jun.j.tian, yi.y.sun, kvm,
	hao.wu, Jacob Pan, Yi Sun

From: Liu Yi L <yi.l.liu@intel.com>

This patch adds flush_stage1_cache() definition in DualStageIOMMUOops.
And adds corresponding implementation in VFIO. This is to expose a way
for vIOMMU to flush stage-1 cache in host side since guest owns stage-1
translation structures in dual stage DMA translation.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Eric Auger <eric.auger@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 hw/iommu/dual_stage_iommu.c         | 13 +++++++++++++
 hw/vfio/common.c                    | 24 ++++++++++++++++++++++++
 include/hw/iommu/dual_stage_iommu.h | 14 ++++++++++++++
 3 files changed, 51 insertions(+)

diff --git a/hw/iommu/dual_stage_iommu.c b/hw/iommu/dual_stage_iommu.c
index 9d99e9e..abbd2f7 100644
--- a/hw/iommu/dual_stage_iommu.c
+++ b/hw/iommu/dual_stage_iommu.c
@@ -73,6 +73,19 @@ int ds_iommu_unbind_stage1_pgtbl(DualStageIOMMUObject *dsi_obj,
     return -ENOENT;
 }
 
+int ds_iommu_flush_stage1_cache(DualStageIOMMUObject *dsi_obj,
+                                DualIOMMUStage1Cache *cache)
+{
+    if (!dsi_obj) {
+        return -ENOENT;
+    }
+
+    if (dsi_obj->ops && dsi_obj->ops->flush_stage1_cache) {
+        return dsi_obj->ops->flush_stage1_cache(dsi_obj, cache);
+    }
+    return -ENOENT;
+}
+
 void ds_iommu_object_init(DualStageIOMMUObject *dsi_obj,
                           DualStageIOMMUOps *ops,
                           DualStageIOMMUInfo *uinfo)
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index d84bdc9..6f0933c 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -1265,11 +1265,35 @@ static int vfio_ds_iommu_unbind_stage1_pgtbl(DualStageIOMMUObject *dsi_obj,
     return ret;
 }
 
+static int vfio_ds_iommu_flush_stage1_cache(DualStageIOMMUObject *dsi_obj,
+                                            DualIOMMUStage1Cache *cache)
+{
+    VFIOContainer *container = container_of(dsi_obj, VFIOContainer, dsi_obj);
+    struct vfio_iommu_type1_cache_invalidate *cache_inv;
+    unsigned long argsz;
+    int ret = 0;
+
+    argsz = sizeof(*cache_inv) + sizeof(cache->cache_info);
+    cache_inv = g_malloc0(argsz);
+    cache_inv->argsz = argsz;
+    cache_inv->flags = 0;
+    memcpy(&cache_inv->cache_info, &cache->cache_info,
+           sizeof(cache->cache_info));
+
+    if (ioctl(container->fd, VFIO_IOMMU_CACHE_INVALIDATE, cache_inv)) {
+        error_report("%s: iommu cache flush failed: %d", __func__, -errno);
+        ret = -errno;
+    }
+    g_free(cache_inv);
+    return ret;
+}
+
 static struct DualStageIOMMUOps vfio_ds_iommu_ops = {
     .pasid_alloc = vfio_ds_iommu_pasid_alloc,
     .pasid_free = vfio_ds_iommu_pasid_free,
     .bind_stage1_pgtbl = vfio_ds_iommu_bind_stage1_pgtbl,
     .unbind_stage1_pgtbl = vfio_ds_iommu_unbind_stage1_pgtbl,
+    .flush_stage1_cache = vfio_ds_iommu_flush_stage1_cache,
 };
 
 static int vfio_get_iommu_info(VFIOContainer *container,
diff --git a/include/hw/iommu/dual_stage_iommu.h b/include/hw/iommu/dual_stage_iommu.h
index 0eb983c..7daeb72 100644
--- a/include/hw/iommu/dual_stage_iommu.h
+++ b/include/hw/iommu/dual_stage_iommu.h
@@ -32,6 +32,7 @@ typedef struct DualStageIOMMUObject DualStageIOMMUObject;
 typedef struct DualStageIOMMUOps DualStageIOMMUOps;
 typedef struct DualStageIOMMUInfo DualStageIOMMUInfo;
 typedef struct DualIOMMUStage1BindData DualIOMMUStage1BindData;
+typedef struct DualIOMMUStage1Cache DualIOMMUStage1Cache;
 
 struct DualStageIOMMUOps {
     /* Allocate pasid from DualStageIOMMU (a.k.a. host IOMMU) */
@@ -52,6 +53,12 @@ struct DualStageIOMMUOps {
     /* Undo a previous bind. @bind_data specifies the unbind info. */
     int (*unbind_stage1_pgtbl)(DualStageIOMMUObject *dsi_obj,
                               DualIOMMUStage1BindData *bind_data);
+    /*
+     * Propagate stage-1 cache flush to DualStageIOMMU (a.k.a.
+     * host IOMMU), cache info specifid in @cache
+     */
+    int (*flush_stage1_cache)(DualStageIOMMUObject *dsi_obj,
+                              DualIOMMUStage1Cache *cache);
 };
 
 struct DualStageIOMMUInfo {
@@ -73,6 +80,11 @@ struct DualIOMMUStage1BindData {
     } bind_data;
 };
 
+struct DualIOMMUStage1Cache {
+    uint32_t pasid;
+    struct iommu_cache_invalidate_info cache_info;
+};
+
 int ds_iommu_pasid_alloc(DualStageIOMMUObject *dsi_obj, uint32_t min,
                          uint32_t max, uint32_t *pasid);
 int ds_iommu_pasid_free(DualStageIOMMUObject *dsi_obj, uint32_t pasid);
@@ -80,6 +92,8 @@ int ds_iommu_bind_stage1_pgtbl(DualStageIOMMUObject *dsi_obj,
                                DualIOMMUStage1BindData *bind_data);
 int ds_iommu_unbind_stage1_pgtbl(DualStageIOMMUObject *dsi_obj,
                                  DualIOMMUStage1BindData *bind_data);
+int ds_iommu_flush_stage1_cache(DualStageIOMMUObject *dsi_obj,
+                                DualIOMMUStage1Cache *cache);
 
 void ds_iommu_object_init(DualStageIOMMUObject *dsi_obj,
                           DualStageIOMMUOps *ops,
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [RFC v3 22/25] vfio: add support for flush iommu stage-1 cache
@ 2020-01-29 12:16   ` Liu, Yi L
  0 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-01-29 12:16 UTC (permalink / raw)
  To: qemu-devel, david, pbonzini, alex.williamson, peterx
  Cc: kevin.tian, yi.l.liu, Yi Sun, kvm, mst, jun.j.tian, eric.auger,
	yi.y.sun, Jacob Pan, hao.wu

From: Liu Yi L <yi.l.liu@intel.com>

This patch adds flush_stage1_cache() definition in DualStageIOMMUOops.
And adds corresponding implementation in VFIO. This is to expose a way
for vIOMMU to flush stage-1 cache in host side since guest owns stage-1
translation structures in dual stage DMA translation.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Eric Auger <eric.auger@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 hw/iommu/dual_stage_iommu.c         | 13 +++++++++++++
 hw/vfio/common.c                    | 24 ++++++++++++++++++++++++
 include/hw/iommu/dual_stage_iommu.h | 14 ++++++++++++++
 3 files changed, 51 insertions(+)

diff --git a/hw/iommu/dual_stage_iommu.c b/hw/iommu/dual_stage_iommu.c
index 9d99e9e..abbd2f7 100644
--- a/hw/iommu/dual_stage_iommu.c
+++ b/hw/iommu/dual_stage_iommu.c
@@ -73,6 +73,19 @@ int ds_iommu_unbind_stage1_pgtbl(DualStageIOMMUObject *dsi_obj,
     return -ENOENT;
 }
 
+int ds_iommu_flush_stage1_cache(DualStageIOMMUObject *dsi_obj,
+                                DualIOMMUStage1Cache *cache)
+{
+    if (!dsi_obj) {
+        return -ENOENT;
+    }
+
+    if (dsi_obj->ops && dsi_obj->ops->flush_stage1_cache) {
+        return dsi_obj->ops->flush_stage1_cache(dsi_obj, cache);
+    }
+    return -ENOENT;
+}
+
 void ds_iommu_object_init(DualStageIOMMUObject *dsi_obj,
                           DualStageIOMMUOps *ops,
                           DualStageIOMMUInfo *uinfo)
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index d84bdc9..6f0933c 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -1265,11 +1265,35 @@ static int vfio_ds_iommu_unbind_stage1_pgtbl(DualStageIOMMUObject *dsi_obj,
     return ret;
 }
 
+static int vfio_ds_iommu_flush_stage1_cache(DualStageIOMMUObject *dsi_obj,
+                                            DualIOMMUStage1Cache *cache)
+{
+    VFIOContainer *container = container_of(dsi_obj, VFIOContainer, dsi_obj);
+    struct vfio_iommu_type1_cache_invalidate *cache_inv;
+    unsigned long argsz;
+    int ret = 0;
+
+    argsz = sizeof(*cache_inv) + sizeof(cache->cache_info);
+    cache_inv = g_malloc0(argsz);
+    cache_inv->argsz = argsz;
+    cache_inv->flags = 0;
+    memcpy(&cache_inv->cache_info, &cache->cache_info,
+           sizeof(cache->cache_info));
+
+    if (ioctl(container->fd, VFIO_IOMMU_CACHE_INVALIDATE, cache_inv)) {
+        error_report("%s: iommu cache flush failed: %d", __func__, -errno);
+        ret = -errno;
+    }
+    g_free(cache_inv);
+    return ret;
+}
+
 static struct DualStageIOMMUOps vfio_ds_iommu_ops = {
     .pasid_alloc = vfio_ds_iommu_pasid_alloc,
     .pasid_free = vfio_ds_iommu_pasid_free,
     .bind_stage1_pgtbl = vfio_ds_iommu_bind_stage1_pgtbl,
     .unbind_stage1_pgtbl = vfio_ds_iommu_unbind_stage1_pgtbl,
+    .flush_stage1_cache = vfio_ds_iommu_flush_stage1_cache,
 };
 
 static int vfio_get_iommu_info(VFIOContainer *container,
diff --git a/include/hw/iommu/dual_stage_iommu.h b/include/hw/iommu/dual_stage_iommu.h
index 0eb983c..7daeb72 100644
--- a/include/hw/iommu/dual_stage_iommu.h
+++ b/include/hw/iommu/dual_stage_iommu.h
@@ -32,6 +32,7 @@ typedef struct DualStageIOMMUObject DualStageIOMMUObject;
 typedef struct DualStageIOMMUOps DualStageIOMMUOps;
 typedef struct DualStageIOMMUInfo DualStageIOMMUInfo;
 typedef struct DualIOMMUStage1BindData DualIOMMUStage1BindData;
+typedef struct DualIOMMUStage1Cache DualIOMMUStage1Cache;
 
 struct DualStageIOMMUOps {
     /* Allocate pasid from DualStageIOMMU (a.k.a. host IOMMU) */
@@ -52,6 +53,12 @@ struct DualStageIOMMUOps {
     /* Undo a previous bind. @bind_data specifies the unbind info. */
     int (*unbind_stage1_pgtbl)(DualStageIOMMUObject *dsi_obj,
                               DualIOMMUStage1BindData *bind_data);
+    /*
+     * Propagate stage-1 cache flush to DualStageIOMMU (a.k.a.
+     * host IOMMU), cache info specifid in @cache
+     */
+    int (*flush_stage1_cache)(DualStageIOMMUObject *dsi_obj,
+                              DualIOMMUStage1Cache *cache);
 };
 
 struct DualStageIOMMUInfo {
@@ -73,6 +80,11 @@ struct DualIOMMUStage1BindData {
     } bind_data;
 };
 
+struct DualIOMMUStage1Cache {
+    uint32_t pasid;
+    struct iommu_cache_invalidate_info cache_info;
+};
+
 int ds_iommu_pasid_alloc(DualStageIOMMUObject *dsi_obj, uint32_t min,
                          uint32_t max, uint32_t *pasid);
 int ds_iommu_pasid_free(DualStageIOMMUObject *dsi_obj, uint32_t pasid);
@@ -80,6 +92,8 @@ int ds_iommu_bind_stage1_pgtbl(DualStageIOMMUObject *dsi_obj,
                                DualIOMMUStage1BindData *bind_data);
 int ds_iommu_unbind_stage1_pgtbl(DualStageIOMMUObject *dsi_obj,
                                  DualIOMMUStage1BindData *bind_data);
+int ds_iommu_flush_stage1_cache(DualStageIOMMUObject *dsi_obj,
+                                DualIOMMUStage1Cache *cache);
 
 void ds_iommu_object_init(DualStageIOMMUObject *dsi_obj,
                           DualStageIOMMUOps *ops,
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [RFC v3 23/25] intel_iommu: process PASID-based iotlb invalidation
  2020-01-29 12:16 ` Liu, Yi L
@ 2020-01-29 12:16   ` Liu, Yi L
  -1 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-01-29 12:16 UTC (permalink / raw)
  To: qemu-devel, david, pbonzini, alex.williamson, peterx
  Cc: mst, eric.auger, kevin.tian, yi.l.liu, jun.j.tian, yi.y.sun, kvm,
	hao.wu, Jacob Pan, Yi Sun, Richard Henderson, Eduardo Habkost

From: Liu Yi L <yi.l.liu@intel.com>

This patch adds the basic PASID-based iotlb (piotlb) invalidation
support. piotlb is used during walking Intel VT-d 1st level page
table. This patch only adds the basic processing. Detailed handling
will be added in next patch.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 hw/i386/intel_iommu.c          | 57 ++++++++++++++++++++++++++++++++++++++++++
 hw/i386/intel_iommu_internal.h | 13 ++++++++++
 2 files changed, 70 insertions(+)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index a511289..1fe8257 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -3102,6 +3102,59 @@ static bool vtd_process_pasid_desc(IntelIOMMUState *s,
     return (ret == 0) ? true : false;
 }
 
+static void vtd_piotlb_pasid_invalidate(IntelIOMMUState *s,
+                                        uint16_t domain_id,
+                                        uint32_t pasid)
+{
+}
+
+static void vtd_piotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
+                             uint32_t pasid, hwaddr addr, uint8_t am, bool ih)
+{
+}
+
+static bool vtd_process_piotlb_desc(IntelIOMMUState *s,
+                                    VTDInvDesc *inv_desc)
+{
+    uint16_t domain_id;
+    uint32_t pasid;
+    uint8_t am;
+    hwaddr addr;
+
+    if ((inv_desc->val[0] & VTD_INV_DESC_PIOTLB_RSVD_VAL0) ||
+        (inv_desc->val[1] & VTD_INV_DESC_PIOTLB_RSVD_VAL1)) {
+        error_report_once("non-zero-field-in-piotlb_inv_desc hi: 0x%" PRIx64
+                  " lo: 0x%" PRIx64, inv_desc->val[1], inv_desc->val[0]);
+        return false;
+    }
+
+    domain_id = VTD_INV_DESC_PIOTLB_DID(inv_desc->val[0]);
+    pasid = VTD_INV_DESC_PIOTLB_PASID(inv_desc->val[0]);
+    switch (inv_desc->val[0] & VTD_INV_DESC_IOTLB_G) {
+    case VTD_INV_DESC_PIOTLB_ALL_IN_PASID:
+        vtd_piotlb_pasid_invalidate(s, domain_id, pasid);
+        break;
+
+    case VTD_INV_DESC_PIOTLB_PSI_IN_PASID:
+        am = VTD_INV_DESC_PIOTLB_AM(inv_desc->val[1]);
+        addr = (hwaddr) VTD_INV_DESC_PIOTLB_ADDR(inv_desc->val[1]);
+        if (am > VTD_MAMV) {
+            error_report_once("Invalid am, > max am value, hi: 0x%" PRIx64
+                    " lo: 0x%" PRIx64, inv_desc->val[1], inv_desc->val[0]);
+            return false;
+        }
+        vtd_piotlb_page_invalidate(s, domain_id, pasid,
+             addr, am, VTD_INV_DESC_PIOTLB_IH(inv_desc->val[1]));
+        break;
+
+    default:
+        error_report_once("Invalid granularity in P-IOTLB desc hi: 0x%" PRIx64
+                  " lo: 0x%" PRIx64, inv_desc->val[1], inv_desc->val[0]);
+        return false;
+    }
+    return true;
+}
+
 static bool vtd_process_inv_iec_desc(IntelIOMMUState *s,
                                      VTDInvDesc *inv_desc)
 {
@@ -3216,6 +3269,10 @@ static bool vtd_process_inv_desc(IntelIOMMUState *s)
         break;
 
     case VTD_INV_DESC_PIOTLB:
+        trace_vtd_inv_desc("p-iotlb", inv_desc.val[1], inv_desc.val[0]);
+        if (!vtd_process_piotlb_desc(s, &inv_desc)) {
+            return false;
+        }
         break;
 
     case VTD_INV_DESC_WAIT:
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index a487b30..7f4db04 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -461,6 +461,19 @@ typedef union VTDInvDesc VTDInvDesc;
 #define VTD_INV_DESC_PASIDC_PASID_SI   (1ULL << 4)
 #define VTD_INV_DESC_PASIDC_GLOBAL     (3ULL << 4)
 
+#define VTD_INV_DESC_PIOTLB_ALL_IN_PASID  (2ULL << 4)
+#define VTD_INV_DESC_PIOTLB_PSI_IN_PASID  (3ULL << 4)
+
+#define VTD_INV_DESC_PIOTLB_RSVD_VAL0     0xfff000000000ffc0ULL
+#define VTD_INV_DESC_PIOTLB_RSVD_VAL1     0xf80ULL
+
+#define VTD_INV_DESC_PIOTLB_PASID(val)    (((val) >> 32) & 0xfffffULL)
+#define VTD_INV_DESC_PIOTLB_DID(val)      (((val) >> 16) & \
+                                             VTD_DOMAIN_ID_MASK)
+#define VTD_INV_DESC_PIOTLB_ADDR(val)     ((val) & ~0xfffULL)
+#define VTD_INV_DESC_PIOTLB_AM(val)       ((val) & 0x3fULL)
+#define VTD_INV_DESC_PIOTLB_IH(val)       (((val) >> 6) & 0x1)
+
 /* Information about page-selective IOTLB invalidate */
 struct VTDIOTLBPageInvInfo {
     uint16_t domain_id;
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [RFC v3 23/25] intel_iommu: process PASID-based iotlb invalidation
@ 2020-01-29 12:16   ` Liu, Yi L
  0 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-01-29 12:16 UTC (permalink / raw)
  To: qemu-devel, david, pbonzini, alex.williamson, peterx
  Cc: kevin.tian, yi.l.liu, Yi Sun, Eduardo Habkost, kvm, mst,
	jun.j.tian, eric.auger, yi.y.sun, Jacob Pan, Richard Henderson,
	hao.wu

From: Liu Yi L <yi.l.liu@intel.com>

This patch adds the basic PASID-based iotlb (piotlb) invalidation
support. piotlb is used during walking Intel VT-d 1st level page
table. This patch only adds the basic processing. Detailed handling
will be added in next patch.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 hw/i386/intel_iommu.c          | 57 ++++++++++++++++++++++++++++++++++++++++++
 hw/i386/intel_iommu_internal.h | 13 ++++++++++
 2 files changed, 70 insertions(+)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index a511289..1fe8257 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -3102,6 +3102,59 @@ static bool vtd_process_pasid_desc(IntelIOMMUState *s,
     return (ret == 0) ? true : false;
 }
 
+static void vtd_piotlb_pasid_invalidate(IntelIOMMUState *s,
+                                        uint16_t domain_id,
+                                        uint32_t pasid)
+{
+}
+
+static void vtd_piotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
+                             uint32_t pasid, hwaddr addr, uint8_t am, bool ih)
+{
+}
+
+static bool vtd_process_piotlb_desc(IntelIOMMUState *s,
+                                    VTDInvDesc *inv_desc)
+{
+    uint16_t domain_id;
+    uint32_t pasid;
+    uint8_t am;
+    hwaddr addr;
+
+    if ((inv_desc->val[0] & VTD_INV_DESC_PIOTLB_RSVD_VAL0) ||
+        (inv_desc->val[1] & VTD_INV_DESC_PIOTLB_RSVD_VAL1)) {
+        error_report_once("non-zero-field-in-piotlb_inv_desc hi: 0x%" PRIx64
+                  " lo: 0x%" PRIx64, inv_desc->val[1], inv_desc->val[0]);
+        return false;
+    }
+
+    domain_id = VTD_INV_DESC_PIOTLB_DID(inv_desc->val[0]);
+    pasid = VTD_INV_DESC_PIOTLB_PASID(inv_desc->val[0]);
+    switch (inv_desc->val[0] & VTD_INV_DESC_IOTLB_G) {
+    case VTD_INV_DESC_PIOTLB_ALL_IN_PASID:
+        vtd_piotlb_pasid_invalidate(s, domain_id, pasid);
+        break;
+
+    case VTD_INV_DESC_PIOTLB_PSI_IN_PASID:
+        am = VTD_INV_DESC_PIOTLB_AM(inv_desc->val[1]);
+        addr = (hwaddr) VTD_INV_DESC_PIOTLB_ADDR(inv_desc->val[1]);
+        if (am > VTD_MAMV) {
+            error_report_once("Invalid am, > max am value, hi: 0x%" PRIx64
+                    " lo: 0x%" PRIx64, inv_desc->val[1], inv_desc->val[0]);
+            return false;
+        }
+        vtd_piotlb_page_invalidate(s, domain_id, pasid,
+             addr, am, VTD_INV_DESC_PIOTLB_IH(inv_desc->val[1]));
+        break;
+
+    default:
+        error_report_once("Invalid granularity in P-IOTLB desc hi: 0x%" PRIx64
+                  " lo: 0x%" PRIx64, inv_desc->val[1], inv_desc->val[0]);
+        return false;
+    }
+    return true;
+}
+
 static bool vtd_process_inv_iec_desc(IntelIOMMUState *s,
                                      VTDInvDesc *inv_desc)
 {
@@ -3216,6 +3269,10 @@ static bool vtd_process_inv_desc(IntelIOMMUState *s)
         break;
 
     case VTD_INV_DESC_PIOTLB:
+        trace_vtd_inv_desc("p-iotlb", inv_desc.val[1], inv_desc.val[0]);
+        if (!vtd_process_piotlb_desc(s, &inv_desc)) {
+            return false;
+        }
         break;
 
     case VTD_INV_DESC_WAIT:
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index a487b30..7f4db04 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -461,6 +461,19 @@ typedef union VTDInvDesc VTDInvDesc;
 #define VTD_INV_DESC_PASIDC_PASID_SI   (1ULL << 4)
 #define VTD_INV_DESC_PASIDC_GLOBAL     (3ULL << 4)
 
+#define VTD_INV_DESC_PIOTLB_ALL_IN_PASID  (2ULL << 4)
+#define VTD_INV_DESC_PIOTLB_PSI_IN_PASID  (3ULL << 4)
+
+#define VTD_INV_DESC_PIOTLB_RSVD_VAL0     0xfff000000000ffc0ULL
+#define VTD_INV_DESC_PIOTLB_RSVD_VAL1     0xf80ULL
+
+#define VTD_INV_DESC_PIOTLB_PASID(val)    (((val) >> 32) & 0xfffffULL)
+#define VTD_INV_DESC_PIOTLB_DID(val)      (((val) >> 16) & \
+                                             VTD_DOMAIN_ID_MASK)
+#define VTD_INV_DESC_PIOTLB_ADDR(val)     ((val) & ~0xfffULL)
+#define VTD_INV_DESC_PIOTLB_AM(val)       ((val) & 0x3fULL)
+#define VTD_INV_DESC_PIOTLB_IH(val)       (((val) >> 6) & 0x1)
+
 /* Information about page-selective IOTLB invalidate */
 struct VTDIOTLBPageInvInfo {
     uint16_t domain_id;
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [RFC v3 24/25] intel_iommu: propagate PASID-based iotlb invalidation to host
  2020-01-29 12:16 ` Liu, Yi L
@ 2020-01-29 12:16   ` Liu, Yi L
  -1 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-01-29 12:16 UTC (permalink / raw)
  To: qemu-devel, david, pbonzini, alex.williamson, peterx
  Cc: mst, eric.auger, kevin.tian, yi.l.liu, jun.j.tian, yi.y.sun, kvm,
	hao.wu, Jacob Pan, Yi Sun, Richard Henderson, Eduardo Habkost

From: Liu Yi L <yi.l.liu@intel.com>

This patch propagates PASID-based iotlb invalidation to host.

Intel VT-d 3.0 supports nested translation in PASID granular.
Guest SVA support could be implemented by configuring nested
translation on specific PASID. This is also known as dual stage
DMA translation.

Under such configuration, guest owns the GVA->GPA translation
which is configured as first level page table in host side for
a specific pasid, and host owns GPA->HPA translation. As guest
owns first level translation table, piotlb invalidation should
be propagated to host since host IOMMU will cache first level
page table related mappings during DMA address translation.

This patch traps the guest PASID-based iotlb flush and propagate
it to host.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 hw/i386/intel_iommu.c          | 122 +++++++++++++++++++++++++++++++++++++++++
 hw/i386/intel_iommu_internal.h |   7 +++
 2 files changed, 129 insertions(+)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 1fe8257..93de7e4 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -3102,15 +3102,137 @@ static bool vtd_process_pasid_desc(IntelIOMMUState *s,
     return (ret == 0) ? true : false;
 }
 
+static void vtd_invalidate_piotlb(IntelIOMMUState *s, VTDBus *vtd_bus,
+                                  int devfn, DualIOMMUStage1Cache *stage1_cache)
+{
+    VTDIOMMUContext *vtd_icx;
+    vtd_icx = vtd_bus->dev_icx[devfn];
+    if (!vtd_icx) {
+        return;
+    }
+    if (ds_iommu_flush_stage1_cache(vtd_icx->dsi_obj, stage1_cache)) {
+        error_report("Cache flush failed");
+    }
+}
+
+static inline bool vtd_pasid_cache_valid(
+                          VTDPASIDAddressSpace *vtd_pasid_as)
+{
+    return (vtd_pasid_as->iommu_state->pasid_cache_gen &&
+            (vtd_pasid_as->iommu_state->pasid_cache_gen
+             == vtd_pasid_as->pasid_cache_entry.pasid_cache_gen));
+}
+
+/**
+ * This function is a loop function for the s->vtd_pasid_as
+ * list with VTDPIOTLBInvInfo as execution filter. It propagates
+ * the piotlb invalidation to host. Caller of this function
+ * should hold iommu_lock.
+ */
+static void vtd_flush_pasid_iotlb(gpointer key, gpointer value,
+                                  gpointer user_data)
+{
+    VTDPIOTLBInvInfo *piotlb_info = user_data;
+    VTDPASIDAddressSpace *vtd_pasid_as = value;
+    uint16_t did;
+
+    /*
+     * Needs to check whether the pasid entry cache stored in
+     * vtd_pasid_as is valid or not. "invalid" means the pasid
+     * cache has been flushed, thus host should have done piotlb
+     * invalidation together with a pasid cache invalidation, so
+     * no need to pass down piotlb invalidation to host for better
+     * performance. Only when pasid entry cache is "valid", should
+     * a piotlb invalidation be propagated to host since it means
+     * guest just modified a mapping in its page table.
+     */
+    if (!vtd_pasid_cache_valid(vtd_pasid_as)) {
+        return;
+    }
+
+    did = vtd_pe_get_domain_id(
+                &(vtd_pasid_as->pasid_cache_entry.pasid_entry));
+
+    if ((piotlb_info->domain_id == did) &&
+        (piotlb_info->pasid == vtd_pasid_as->pasid)) {
+        vtd_invalidate_piotlb(vtd_pasid_as->iommu_state,
+                              vtd_pasid_as->vtd_bus,
+                              vtd_pasid_as->devfn,
+                              piotlb_info->stage1_cache);
+    }
+
+    /*
+     * TODO: needs to add QEMU piotlb flush when QEMU piotlb
+     * infrastructure is ready. For now, it is enough for passthru
+     * devices.
+     */
+}
+
 static void vtd_piotlb_pasid_invalidate(IntelIOMMUState *s,
                                         uint16_t domain_id,
                                         uint32_t pasid)
 {
+    VTDPIOTLBInvInfo piotlb_info;
+    struct iommu_cache_invalidate_info *cache_info;
+    DualIOMMUStage1Cache stage1_cache;
+
+    stage1_cache.pasid = pasid;
+
+    cache_info = &stage1_cache.cache_info;
+    cache_info->version = IOMMU_UAPI_VERSION;
+    cache_info->cache = IOMMU_CACHE_INV_TYPE_IOTLB;
+    cache_info->granularity = IOMMU_INV_GRANU_PASID;
+    cache_info->pasid_info.pasid = pasid;
+    cache_info->pasid_info.flags = IOMMU_INV_PASID_FLAGS_PASID;
+
+    piotlb_info.domain_id = domain_id;
+    piotlb_info.pasid = pasid;
+    piotlb_info.stage1_cache = &stage1_cache;
+
+    vtd_iommu_lock(s);
+    /*
+     * Here loops all the vtd_pasid_as instances in s->vtd_pasid_as
+     * to find out the affected devices since piotlb invalidation
+     * should check pasid cache per architecture point of view.
+     */
+    g_hash_table_foreach(s->vtd_pasid_as,
+                         vtd_flush_pasid_iotlb, &piotlb_info);
+    vtd_iommu_unlock(s);
 }
 
 static void vtd_piotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
                              uint32_t pasid, hwaddr addr, uint8_t am, bool ih)
 {
+    VTDPIOTLBInvInfo piotlb_info;
+    struct iommu_cache_invalidate_info *cache_info;
+    DualIOMMUStage1Cache stage1_cache;
+
+    stage1_cache.pasid = pasid;
+
+    cache_info = &stage1_cache.cache_info;
+    cache_info->version = IOMMU_UAPI_VERSION;
+    cache_info->cache = IOMMU_CACHE_INV_TYPE_IOTLB;
+    cache_info->granularity = IOMMU_INV_GRANU_ADDR;
+    cache_info->addr_info.flags = IOMMU_INV_ADDR_FLAGS_PASID;
+    cache_info->addr_info.flags |= ih ? IOMMU_INV_ADDR_FLAGS_LEAF : 0;
+    cache_info->addr_info.pasid = pasid;
+    cache_info->addr_info.addr = addr;
+    cache_info->addr_info.granule_size = 1 << (12 + am);
+    cache_info->addr_info.nb_granules = 1;
+
+    piotlb_info.domain_id = domain_id;
+    piotlb_info.pasid = pasid;
+    piotlb_info.stage1_cache = &stage1_cache;
+
+    vtd_iommu_lock(s);
+    /*
+     * Here loops all the vtd_pasid_as instances in s->vtd_pasid_as
+     * to find out the affected devices since piotlb invalidation
+     * should check pasid cache per architecture point of view.
+     */
+    g_hash_table_foreach(s->vtd_pasid_as,
+                         vtd_flush_pasid_iotlb, &piotlb_info);
+    vtd_iommu_unlock(s);
 }
 
 static bool vtd_process_piotlb_desc(IntelIOMMUState *s,
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 7f4db04..f144bd3 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -530,6 +530,13 @@ struct VTDPASIDCacheInfo {
                                       VTD_PASID_CACHE_DEVSI)
 typedef struct VTDPASIDCacheInfo VTDPASIDCacheInfo;
 
+struct VTDPIOTLBInvInfo {
+    uint16_t domain_id;
+    uint32_t pasid;
+    DualIOMMUStage1Cache *stage1_cache;
+};
+typedef struct VTDPIOTLBInvInfo VTDPIOTLBInvInfo;
+
 /* Masks for struct VTDRootEntry */
 #define VTD_ROOT_ENTRY_P            1ULL
 #define VTD_ROOT_ENTRY_CTP          (~0xfffULL)
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [RFC v3 24/25] intel_iommu: propagate PASID-based iotlb invalidation to host
@ 2020-01-29 12:16   ` Liu, Yi L
  0 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-01-29 12:16 UTC (permalink / raw)
  To: qemu-devel, david, pbonzini, alex.williamson, peterx
  Cc: kevin.tian, yi.l.liu, Yi Sun, Eduardo Habkost, kvm, mst,
	jun.j.tian, eric.auger, yi.y.sun, Jacob Pan, Richard Henderson,
	hao.wu

From: Liu Yi L <yi.l.liu@intel.com>

This patch propagates PASID-based iotlb invalidation to host.

Intel VT-d 3.0 supports nested translation in PASID granular.
Guest SVA support could be implemented by configuring nested
translation on specific PASID. This is also known as dual stage
DMA translation.

Under such configuration, guest owns the GVA->GPA translation
which is configured as first level page table in host side for
a specific pasid, and host owns GPA->HPA translation. As guest
owns first level translation table, piotlb invalidation should
be propagated to host since host IOMMU will cache first level
page table related mappings during DMA address translation.

This patch traps the guest PASID-based iotlb flush and propagate
it to host.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 hw/i386/intel_iommu.c          | 122 +++++++++++++++++++++++++++++++++++++++++
 hw/i386/intel_iommu_internal.h |   7 +++
 2 files changed, 129 insertions(+)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 1fe8257..93de7e4 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -3102,15 +3102,137 @@ static bool vtd_process_pasid_desc(IntelIOMMUState *s,
     return (ret == 0) ? true : false;
 }
 
+static void vtd_invalidate_piotlb(IntelIOMMUState *s, VTDBus *vtd_bus,
+                                  int devfn, DualIOMMUStage1Cache *stage1_cache)
+{
+    VTDIOMMUContext *vtd_icx;
+    vtd_icx = vtd_bus->dev_icx[devfn];
+    if (!vtd_icx) {
+        return;
+    }
+    if (ds_iommu_flush_stage1_cache(vtd_icx->dsi_obj, stage1_cache)) {
+        error_report("Cache flush failed");
+    }
+}
+
+static inline bool vtd_pasid_cache_valid(
+                          VTDPASIDAddressSpace *vtd_pasid_as)
+{
+    return (vtd_pasid_as->iommu_state->pasid_cache_gen &&
+            (vtd_pasid_as->iommu_state->pasid_cache_gen
+             == vtd_pasid_as->pasid_cache_entry.pasid_cache_gen));
+}
+
+/**
+ * This function is a loop function for the s->vtd_pasid_as
+ * list with VTDPIOTLBInvInfo as execution filter. It propagates
+ * the piotlb invalidation to host. Caller of this function
+ * should hold iommu_lock.
+ */
+static void vtd_flush_pasid_iotlb(gpointer key, gpointer value,
+                                  gpointer user_data)
+{
+    VTDPIOTLBInvInfo *piotlb_info = user_data;
+    VTDPASIDAddressSpace *vtd_pasid_as = value;
+    uint16_t did;
+
+    /*
+     * Needs to check whether the pasid entry cache stored in
+     * vtd_pasid_as is valid or not. "invalid" means the pasid
+     * cache has been flushed, thus host should have done piotlb
+     * invalidation together with a pasid cache invalidation, so
+     * no need to pass down piotlb invalidation to host for better
+     * performance. Only when pasid entry cache is "valid", should
+     * a piotlb invalidation be propagated to host since it means
+     * guest just modified a mapping in its page table.
+     */
+    if (!vtd_pasid_cache_valid(vtd_pasid_as)) {
+        return;
+    }
+
+    did = vtd_pe_get_domain_id(
+                &(vtd_pasid_as->pasid_cache_entry.pasid_entry));
+
+    if ((piotlb_info->domain_id == did) &&
+        (piotlb_info->pasid == vtd_pasid_as->pasid)) {
+        vtd_invalidate_piotlb(vtd_pasid_as->iommu_state,
+                              vtd_pasid_as->vtd_bus,
+                              vtd_pasid_as->devfn,
+                              piotlb_info->stage1_cache);
+    }
+
+    /*
+     * TODO: needs to add QEMU piotlb flush when QEMU piotlb
+     * infrastructure is ready. For now, it is enough for passthru
+     * devices.
+     */
+}
+
 static void vtd_piotlb_pasid_invalidate(IntelIOMMUState *s,
                                         uint16_t domain_id,
                                         uint32_t pasid)
 {
+    VTDPIOTLBInvInfo piotlb_info;
+    struct iommu_cache_invalidate_info *cache_info;
+    DualIOMMUStage1Cache stage1_cache;
+
+    stage1_cache.pasid = pasid;
+
+    cache_info = &stage1_cache.cache_info;
+    cache_info->version = IOMMU_UAPI_VERSION;
+    cache_info->cache = IOMMU_CACHE_INV_TYPE_IOTLB;
+    cache_info->granularity = IOMMU_INV_GRANU_PASID;
+    cache_info->pasid_info.pasid = pasid;
+    cache_info->pasid_info.flags = IOMMU_INV_PASID_FLAGS_PASID;
+
+    piotlb_info.domain_id = domain_id;
+    piotlb_info.pasid = pasid;
+    piotlb_info.stage1_cache = &stage1_cache;
+
+    vtd_iommu_lock(s);
+    /*
+     * Here loops all the vtd_pasid_as instances in s->vtd_pasid_as
+     * to find out the affected devices since piotlb invalidation
+     * should check pasid cache per architecture point of view.
+     */
+    g_hash_table_foreach(s->vtd_pasid_as,
+                         vtd_flush_pasid_iotlb, &piotlb_info);
+    vtd_iommu_unlock(s);
 }
 
 static void vtd_piotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
                              uint32_t pasid, hwaddr addr, uint8_t am, bool ih)
 {
+    VTDPIOTLBInvInfo piotlb_info;
+    struct iommu_cache_invalidate_info *cache_info;
+    DualIOMMUStage1Cache stage1_cache;
+
+    stage1_cache.pasid = pasid;
+
+    cache_info = &stage1_cache.cache_info;
+    cache_info->version = IOMMU_UAPI_VERSION;
+    cache_info->cache = IOMMU_CACHE_INV_TYPE_IOTLB;
+    cache_info->granularity = IOMMU_INV_GRANU_ADDR;
+    cache_info->addr_info.flags = IOMMU_INV_ADDR_FLAGS_PASID;
+    cache_info->addr_info.flags |= ih ? IOMMU_INV_ADDR_FLAGS_LEAF : 0;
+    cache_info->addr_info.pasid = pasid;
+    cache_info->addr_info.addr = addr;
+    cache_info->addr_info.granule_size = 1 << (12 + am);
+    cache_info->addr_info.nb_granules = 1;
+
+    piotlb_info.domain_id = domain_id;
+    piotlb_info.pasid = pasid;
+    piotlb_info.stage1_cache = &stage1_cache;
+
+    vtd_iommu_lock(s);
+    /*
+     * Here loops all the vtd_pasid_as instances in s->vtd_pasid_as
+     * to find out the affected devices since piotlb invalidation
+     * should check pasid cache per architecture point of view.
+     */
+    g_hash_table_foreach(s->vtd_pasid_as,
+                         vtd_flush_pasid_iotlb, &piotlb_info);
+    vtd_iommu_unlock(s);
 }
 
 static bool vtd_process_piotlb_desc(IntelIOMMUState *s,
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 7f4db04..f144bd3 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -530,6 +530,13 @@ struct VTDPASIDCacheInfo {
                                       VTD_PASID_CACHE_DEVSI)
 typedef struct VTDPASIDCacheInfo VTDPASIDCacheInfo;
 
+struct VTDPIOTLBInvInfo {
+    uint16_t domain_id;
+    uint32_t pasid;
+    DualIOMMUStage1Cache *stage1_cache;
+};
+typedef struct VTDPIOTLBInvInfo VTDPIOTLBInvInfo;
+
 /* Masks for struct VTDRootEntry */
 #define VTD_ROOT_ENTRY_P            1ULL
 #define VTD_ROOT_ENTRY_CTP          (~0xfffULL)
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [RFC v3 25/25] intel_iommu: process PASID-based Device-TLB invalidation
  2020-01-29 12:16 ` Liu, Yi L
@ 2020-01-29 12:16   ` Liu, Yi L
  -1 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-01-29 12:16 UTC (permalink / raw)
  To: qemu-devel, david, pbonzini, alex.williamson, peterx
  Cc: mst, eric.auger, kevin.tian, yi.l.liu, jun.j.tian, yi.y.sun, kvm,
	hao.wu, Jacob Pan, Yi Sun, Richard Henderson, Eduardo Habkost

From: Liu Yi L <yi.l.liu@intel.com>

This patch adds an empty handling for PASID-based Device-TLB
invalidation. For now it is enough as it is not necessary to
propagate it to host for passthru device and also there is no
emulated device has device tlb.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 hw/i386/intel_iommu.c          | 18 ++++++++++++++++++
 hw/i386/intel_iommu_internal.h |  1 +
 2 files changed, 19 insertions(+)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 93de7e4..c577e96 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -3290,6 +3290,17 @@ static bool vtd_process_inv_iec_desc(IntelIOMMUState *s,
     return true;
 }
 
+static bool vtd_process_device_piotlb_desc(IntelIOMMUState *s,
+                                           VTDInvDesc *inv_desc)
+{
+    /*
+     * no need to handle it for passthru device, for emulated
+     * devices with device tlb, it may be required, but for now,
+     * return is enough
+     */
+    return true;
+}
+
 static bool vtd_process_device_iotlb_desc(IntelIOMMUState *s,
                                           VTDInvDesc *inv_desc)
 {
@@ -3411,6 +3422,13 @@ static bool vtd_process_inv_desc(IntelIOMMUState *s)
         }
         break;
 
+    case VTD_INV_DESC_DEV_PIOTLB:
+        trace_vtd_inv_desc("device-piotlb", inv_desc.hi, inv_desc.lo);
+        if (!vtd_process_device_piotlb_desc(s, &inv_desc)) {
+            return false;
+        }
+        break;
+
     case VTD_INV_DESC_DEVICE:
         trace_vtd_inv_desc("device", inv_desc.hi, inv_desc.lo);
         if (!vtd_process_device_iotlb_desc(s, &inv_desc)) {
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index f144bd3..f7de046 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -390,6 +390,7 @@ typedef union VTDInvDesc VTDInvDesc;
 #define VTD_INV_DESC_WAIT               0x5 /* Invalidation Wait Descriptor */
 #define VTD_INV_DESC_PIOTLB             0x6 /* PASID-IOTLB Invalidate Desc */
 #define VTD_INV_DESC_PC                 0x7 /* PASID-cache Invalidate Desc */
+#define VTD_INV_DESC_DEV_PIOTLB         0x8 /* PASID-based-DIOTLB inv_desc*/
 #define VTD_INV_DESC_NONE               0   /* Not an Invalidate Descriptor */
 
 /* Masks for Invalidation Wait Descriptor*/
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [RFC v3 25/25] intel_iommu: process PASID-based Device-TLB invalidation
@ 2020-01-29 12:16   ` Liu, Yi L
  0 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-01-29 12:16 UTC (permalink / raw)
  To: qemu-devel, david, pbonzini, alex.williamson, peterx
  Cc: kevin.tian, yi.l.liu, Yi Sun, Eduardo Habkost, kvm, mst,
	jun.j.tian, eric.auger, yi.y.sun, Jacob Pan, Richard Henderson,
	hao.wu

From: Liu Yi L <yi.l.liu@intel.com>

This patch adds an empty handling for PASID-based Device-TLB
invalidation. For now it is enough as it is not necessary to
propagate it to host for passthru device and also there is no
emulated device has device tlb.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 hw/i386/intel_iommu.c          | 18 ++++++++++++++++++
 hw/i386/intel_iommu_internal.h |  1 +
 2 files changed, 19 insertions(+)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 93de7e4..c577e96 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -3290,6 +3290,17 @@ static bool vtd_process_inv_iec_desc(IntelIOMMUState *s,
     return true;
 }
 
+static bool vtd_process_device_piotlb_desc(IntelIOMMUState *s,
+                                           VTDInvDesc *inv_desc)
+{
+    /*
+     * no need to handle it for passthru device, for emulated
+     * devices with device tlb, it may be required, but for now,
+     * return is enough
+     */
+    return true;
+}
+
 static bool vtd_process_device_iotlb_desc(IntelIOMMUState *s,
                                           VTDInvDesc *inv_desc)
 {
@@ -3411,6 +3422,13 @@ static bool vtd_process_inv_desc(IntelIOMMUState *s)
         }
         break;
 
+    case VTD_INV_DESC_DEV_PIOTLB:
+        trace_vtd_inv_desc("device-piotlb", inv_desc.hi, inv_desc.lo);
+        if (!vtd_process_device_piotlb_desc(s, &inv_desc)) {
+            return false;
+        }
+        break;
+
     case VTD_INV_DESC_DEVICE:
         trace_vtd_inv_desc("device", inv_desc.hi, inv_desc.lo);
         if (!vtd_process_device_iotlb_desc(s, &inv_desc)) {
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index f144bd3..f7de046 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -390,6 +390,7 @@ typedef union VTDInvDesc VTDInvDesc;
 #define VTD_INV_DESC_WAIT               0x5 /* Invalidation Wait Descriptor */
 #define VTD_INV_DESC_PIOTLB             0x6 /* PASID-IOTLB Invalidate Desc */
 #define VTD_INV_DESC_PC                 0x7 /* PASID-cache Invalidate Desc */
+#define VTD_INV_DESC_DEV_PIOTLB         0x8 /* PASID-based-DIOTLB inv_desc*/
 #define VTD_INV_DESC_NONE               0   /* Not an Invalidate Descriptor */
 
 /* Masks for Invalidation Wait Descriptor*/
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 136+ messages in thread

* Re: [RFC v3 06/25] scripts/update-linux-headers: Import iommu.h
  2020-01-29 12:16   ` Liu, Yi L
@ 2020-01-29 12:25     ` Cornelia Huck
  -1 siblings, 0 replies; 136+ messages in thread
From: Cornelia Huck @ 2020-01-29 12:25 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: qemu-devel, david, pbonzini, alex.williamson, peterx, mst,
	eric.auger, kevin.tian, jun.j.tian, yi.y.sun, kvm, hao.wu,
	Jacob Pan, Yi Sun

On Wed, 29 Jan 2020 04:16:37 -0800
"Liu, Yi L" <yi.l.liu@intel.com> wrote:

> From: Eric Auger <eric.auger@redhat.com>
> 
> Update the script to import the new iommu.h uapi header.
> 
> Cc: Kevin Tian <kevin.tian@intel.com>
> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Yi Sun <yi.y.sun@linux.intel.com>
> Cc: Michael S. Tsirkin <mst@redhat.com>
> Cc: Cornelia Huck <cohuck@redhat.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Eric Auger <eric.auger@redhat.com>
> ---
>  scripts/update-linux-headers.sh | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/scripts/update-linux-headers.sh b/scripts/update-linux-headers.sh
> index f76d773..dfdfdfd 100755
> --- a/scripts/update-linux-headers.sh
> +++ b/scripts/update-linux-headers.sh
> @@ -141,7 +141,7 @@ done
>  
>  rm -rf "$output/linux-headers/linux"
>  mkdir -p "$output/linux-headers/linux"
> -for header in kvm.h vfio.h vfio_ccw.h vhost.h \
> +for header in kvm.h vfio.h vfio_ccw.h vhost.h iommu.h \
>                psci.h psp-sev.h userfaultfd.h mman.h; do
>      cp "$tmpdir/include/linux/$header" "$output/linux-headers/linux"
>  done

Acked-by: Cornelia Huck <cohuck@redhat.com>


^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC v3 06/25] scripts/update-linux-headers: Import iommu.h
@ 2020-01-29 12:25     ` Cornelia Huck
  0 siblings, 0 replies; 136+ messages in thread
From: Cornelia Huck @ 2020-01-29 12:25 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: kevin.tian, Jacob Pan, Yi Sun, kvm, mst, jun.j.tian, qemu-devel,
	peterx, eric.auger, alex.williamson, pbonzini, hao.wu, yi.y.sun,
	david

On Wed, 29 Jan 2020 04:16:37 -0800
"Liu, Yi L" <yi.l.liu@intel.com> wrote:

> From: Eric Auger <eric.auger@redhat.com>
> 
> Update the script to import the new iommu.h uapi header.
> 
> Cc: Kevin Tian <kevin.tian@intel.com>
> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Yi Sun <yi.y.sun@linux.intel.com>
> Cc: Michael S. Tsirkin <mst@redhat.com>
> Cc: Cornelia Huck <cohuck@redhat.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Eric Auger <eric.auger@redhat.com>
> ---
>  scripts/update-linux-headers.sh | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/scripts/update-linux-headers.sh b/scripts/update-linux-headers.sh
> index f76d773..dfdfdfd 100755
> --- a/scripts/update-linux-headers.sh
> +++ b/scripts/update-linux-headers.sh
> @@ -141,7 +141,7 @@ done
>  
>  rm -rf "$output/linux-headers/linux"
>  mkdir -p "$output/linux-headers/linux"
> -for header in kvm.h vfio.h vfio_ccw.h vhost.h \
> +for header in kvm.h vfio.h vfio_ccw.h vhost.h iommu.h \
>                psci.h psp-sev.h userfaultfd.h mman.h; do
>      cp "$tmpdir/include/linux/$header" "$output/linux-headers/linux"
>  done

Acked-by: Cornelia Huck <cohuck@redhat.com>



^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC v3 07/25] header file update VFIO/IOMMU vSVA APIs
  2020-01-29 12:16   ` Liu, Yi L
@ 2020-01-29 12:28     ` Cornelia Huck
  -1 siblings, 0 replies; 136+ messages in thread
From: Cornelia Huck @ 2020-01-29 12:28 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: qemu-devel, david, pbonzini, alex.williamson, peterx, mst,
	eric.auger, kevin.tian, jun.j.tian, yi.y.sun, kvm, hao.wu,
	Jacob Pan, Yi Sun

On Wed, 29 Jan 2020 04:16:38 -0800
"Liu, Yi L" <yi.l.liu@intel.com> wrote:

> From: Liu Yi L <yi.l.liu@intel.com>
> 
> The kernel uapi/linux/iommu.h header file includes the
> extensions for vSVA support. e.g. bind gpasid, iommu
> fault report related user structures and etc.
> 
> Cc: Kevin Tian <kevin.tian@intel.com>
> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Yi Sun <yi.y.sun@linux.intel.com>
> Cc: Michael S. Tsirkin <mst@redhat.com>
> Cc: Cornelia Huck <cohuck@redhat.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> ---
>  linux-headers/linux/iommu.h | 372 ++++++++++++++++++++++++++++++++++++++++++++
>  linux-headers/linux/vfio.h  | 148 ++++++++++++++++++
>  2 files changed, 520 insertions(+)
>  create mode 100644 linux-headers/linux/iommu.h

Please add a note that this is to be replaced with a full headers
update, so that it doesn't get missed :)


^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC v3 07/25] header file update VFIO/IOMMU vSVA APIs
@ 2020-01-29 12:28     ` Cornelia Huck
  0 siblings, 0 replies; 136+ messages in thread
From: Cornelia Huck @ 2020-01-29 12:28 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: kevin.tian, Jacob Pan, Yi Sun, kvm, mst, jun.j.tian, qemu-devel,
	peterx, eric.auger, alex.williamson, pbonzini, hao.wu, yi.y.sun,
	david

On Wed, 29 Jan 2020 04:16:38 -0800
"Liu, Yi L" <yi.l.liu@intel.com> wrote:

> From: Liu Yi L <yi.l.liu@intel.com>
> 
> The kernel uapi/linux/iommu.h header file includes the
> extensions for vSVA support. e.g. bind gpasid, iommu
> fault report related user structures and etc.
> 
> Cc: Kevin Tian <kevin.tian@intel.com>
> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Yi Sun <yi.y.sun@linux.intel.com>
> Cc: Michael S. Tsirkin <mst@redhat.com>
> Cc: Cornelia Huck <cohuck@redhat.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> ---
>  linux-headers/linux/iommu.h | 372 ++++++++++++++++++++++++++++++++++++++++++++
>  linux-headers/linux/vfio.h  | 148 ++++++++++++++++++
>  2 files changed, 520 insertions(+)
>  create mode 100644 linux-headers/linux/iommu.h

Please add a note that this is to be replaced with a full headers
update, so that it doesn't get missed :)



^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC v3 00/25] intel_iommu: expose Shared Virtual Addressing to VMs
  2020-01-29 12:16 ` Liu, Yi L
@ 2020-01-29 13:44   ` no-reply
  -1 siblings, 0 replies; 136+ messages in thread
From: no-reply @ 2020-01-29 13:44 UTC (permalink / raw)
  To: yi.l.liu
  Cc: qemu-devel, david, pbonzini, alex.williamson, peterx, kevin.tian,
	yi.l.liu, kvm, mst, jun.j.tian, eric.auger, yi.y.sun, hao.wu

Patchew URL: https://patchew.org/QEMU/1580300216-86172-1-git-send-email-yi.l.liu@intel.com/



Hi,

This series failed the docker-mingw@fedora build test. Please find the testing commands and
their output below. If you have Docker installed, you can probably reproduce it
locally.

=== TEST SCRIPT BEGIN ===
#! /bin/bash
export ARCH=x86_64
make docker-image-fedora V=1 NETWORK=1
time make docker-test-mingw@fedora J=14 NETWORK=1
=== TEST SCRIPT END ===

                 from /tmp/qemu-test/src/include/hw/pci/pci_bus.h:4,
                 from /tmp/qemu-test/src/include/hw/pci-host/i440fx.h:15,
                 from /tmp/qemu-test/src/stubs/pci-host-piix.c:2:
/tmp/qemu-test/src/include/hw/iommu/dual_stage_iommu.h:26:10: fatal error: linux/iommu.h: No such file or directory
 #include <linux/iommu.h>
          ^~~~~~~~~~~~~~~
compilation terminated.
make: *** [/tmp/qemu-test/src/rules.mak:69: stubs/pci-host-piix.o] Error 1
make: *** Waiting for unfinished jobs....
Traceback (most recent call last):
  File "./tests/docker/docker.py", line 662, in <module>
---
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['sudo', '-n', 'docker', 'run', '--label', 'com.qemu.instance.uuid=8a4150439dbc4c62abdee4366960ac9a', '-u', '1003', '--security-opt', 'seccomp=unconfined', '--rm', '-e', 'TARGET_LIST=', '-e', 'EXTRA_CONFIGURE_OPTS=', '-e', 'V=', '-e', 'J=14', '-e', 'DEBUG=', '-e', 'SHOW_ENV=', '-e', 'CCACHE_DIR=/var/tmp/ccache', '-v', '/home/patchew2/.cache/qemu-docker-ccache:/var/tmp/ccache:z', '-v', '/var/tmp/patchew-tester-tmp-qdowmtdx/src/docker-src.2020-01-29-08.42.04.21064:/var/tmp/qemu:z,ro', 'qemu:fedora', '/var/tmp/qemu/run', 'test-mingw']' returned non-zero exit status 2.
filter=--filter=label=com.qemu.instance.uuid=8a4150439dbc4c62abdee4366960ac9a
make[1]: *** [docker-run] Error 1
make[1]: Leaving directory `/var/tmp/patchew-tester-tmp-qdowmtdx/src'
make: *** [docker-run-test-mingw@fedora] Error 2

real    2m5.234s
user    0m7.662s


The full log is available at
http://patchew.org/logs/1580300216-86172-1-git-send-email-yi.l.liu@intel.com/testing.docker-mingw@fedora/?type=message.
---
Email generated automatically by Patchew [https://patchew.org/].
Please send your feedback to patchew-devel@redhat.com

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC v3 00/25] intel_iommu: expose Shared Virtual Addressing to VMs
@ 2020-01-29 13:44   ` no-reply
  0 siblings, 0 replies; 136+ messages in thread
From: no-reply @ 2020-01-29 13:44 UTC (permalink / raw)
  To: yi.l.liu
  Cc: kevin.tian, yi.l.liu, kvm, mst, jun.j.tian, qemu-devel, peterx,
	eric.auger, alex.williamson, pbonzini, hao.wu, yi.y.sun, david

Patchew URL: https://patchew.org/QEMU/1580300216-86172-1-git-send-email-yi.l.liu@intel.com/



Hi,

This series failed the docker-mingw@fedora build test. Please find the testing commands and
their output below. If you have Docker installed, you can probably reproduce it
locally.

=== TEST SCRIPT BEGIN ===
#! /bin/bash
export ARCH=x86_64
make docker-image-fedora V=1 NETWORK=1
time make docker-test-mingw@fedora J=14 NETWORK=1
=== TEST SCRIPT END ===

                 from /tmp/qemu-test/src/include/hw/pci/pci_bus.h:4,
                 from /tmp/qemu-test/src/include/hw/pci-host/i440fx.h:15,
                 from /tmp/qemu-test/src/stubs/pci-host-piix.c:2:
/tmp/qemu-test/src/include/hw/iommu/dual_stage_iommu.h:26:10: fatal error: linux/iommu.h: No such file or directory
 #include <linux/iommu.h>
          ^~~~~~~~~~~~~~~
compilation terminated.
make: *** [/tmp/qemu-test/src/rules.mak:69: stubs/pci-host-piix.o] Error 1
make: *** Waiting for unfinished jobs....
Traceback (most recent call last):
  File "./tests/docker/docker.py", line 662, in <module>
---
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['sudo', '-n', 'docker', 'run', '--label', 'com.qemu.instance.uuid=8a4150439dbc4c62abdee4366960ac9a', '-u', '1003', '--security-opt', 'seccomp=unconfined', '--rm', '-e', 'TARGET_LIST=', '-e', 'EXTRA_CONFIGURE_OPTS=', '-e', 'V=', '-e', 'J=14', '-e', 'DEBUG=', '-e', 'SHOW_ENV=', '-e', 'CCACHE_DIR=/var/tmp/ccache', '-v', '/home/patchew2/.cache/qemu-docker-ccache:/var/tmp/ccache:z', '-v', '/var/tmp/patchew-tester-tmp-qdowmtdx/src/docker-src.2020-01-29-08.42.04.21064:/var/tmp/qemu:z,ro', 'qemu:fedora', '/var/tmp/qemu/run', 'test-mingw']' returned non-zero exit status 2.
filter=--filter=label=com.qemu.instance.uuid=8a4150439dbc4c62abdee4366960ac9a
make[1]: *** [docker-run] Error 1
make[1]: Leaving directory `/var/tmp/patchew-tester-tmp-qdowmtdx/src'
make: *** [docker-run-test-mingw@fedora] Error 2

real    2m5.234s
user    0m7.662s


The full log is available at
http://patchew.org/logs/1580300216-86172-1-git-send-email-yi.l.liu@intel.com/testing.docker-mingw@fedora/?type=message.
---
Email generated automatically by Patchew [https://patchew.org/].
Please send your feedback to patchew-devel@redhat.com

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC v3 00/25] intel_iommu: expose Shared Virtual Addressing to VMs
  2020-01-29 12:16 ` Liu, Yi L
@ 2020-01-29 13:48   ` no-reply
  -1 siblings, 0 replies; 136+ messages in thread
From: no-reply @ 2020-01-29 13:48 UTC (permalink / raw)
  To: yi.l.liu
  Cc: qemu-devel, david, pbonzini, alex.williamson, peterx, kevin.tian,
	yi.l.liu, kvm, mst, jun.j.tian, eric.auger, yi.y.sun, hao.wu

Patchew URL: https://patchew.org/QEMU/1580300216-86172-1-git-send-email-yi.l.liu@intel.com/



Hi,

This series failed the docker-quick@centos7 build test. Please find the testing commands and
their output below. If you have Docker installed, you can probably reproduce it
locally.

=== TEST SCRIPT BEGIN ===
#!/bin/bash
make docker-image-centos7 V=1 NETWORK=1
time make docker-test-quick@centos7 SHOW_ENV=1 J=14 NETWORK=1
=== TEST SCRIPT END ===

  CC      hw/pci/pci_host.o
  CC      hw/pci/pcie.o
/tmp/qemu-test/src/hw/pci-host/designware.c: In function 'designware_pcie_host_realize':
/tmp/qemu-test/src/hw/pci-host/designware.c:693:5: error: incompatible type for argument 2 of 'pci_setup_iommu'
     pci_setup_iommu(pci->bus, designware_iommu_ops, s);
     ^
In file included from /tmp/qemu-test/src/include/hw/pci/msi.h:24:0,
---
/tmp/qemu-test/src/include/hw/pci/pci.h:499:6: note: expected 'const struct PCIIOMMUOps *' but argument is of type 'PCIIOMMUOps'
 void pci_setup_iommu(PCIBus *bus, const PCIIOMMUOps *iommu_ops, void *opaque);
      ^
make: *** [hw/pci-host/designware.o] Error 1
make: *** Waiting for unfinished jobs....
rm tests/qemu-iotests/socket_scm_helper.o
Traceback (most recent call last):
---
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['sudo', '-n', 'docker', 'run', '--label', 'com.qemu.instance.uuid=08fccc258f5241b886ff89ccf43e8926', '-u', '1001', '--security-opt', 'seccomp=unconfined', '--rm', '-e', 'TARGET_LIST=', '-e', 'EXTRA_CONFIGURE_OPTS=', '-e', 'V=', '-e', 'J=14', '-e', 'DEBUG=', '-e', 'SHOW_ENV=1', '-e', 'CCACHE_DIR=/var/tmp/ccache', '-v', '/home/patchew/.cache/qemu-docker-ccache:/var/tmp/ccache:z', '-v', '/var/tmp/patchew-tester-tmp-xrnho7rm/src/docker-src.2020-01-29-08.46.10.29742:/var/tmp/qemu:z,ro', 'qemu:centos7', '/var/tmp/qemu/run', 'test-quick']' returned non-zero exit status 2.
filter=--filter=label=com.qemu.instance.uuid=08fccc258f5241b886ff89ccf43e8926
make[1]: *** [docker-run] Error 1
make[1]: Leaving directory `/var/tmp/patchew-tester-tmp-xrnho7rm/src'
make: *** [docker-run-test-quick@centos7] Error 2

real    2m13.263s
user    0m8.269s


The full log is available at
http://patchew.org/logs/1580300216-86172-1-git-send-email-yi.l.liu@intel.com/testing.docker-quick@centos7/?type=message.
---
Email generated automatically by Patchew [https://patchew.org/].
Please send your feedback to patchew-devel@redhat.com

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC v3 00/25] intel_iommu: expose Shared Virtual Addressing to VMs
@ 2020-01-29 13:48   ` no-reply
  0 siblings, 0 replies; 136+ messages in thread
From: no-reply @ 2020-01-29 13:48 UTC (permalink / raw)
  To: yi.l.liu
  Cc: kevin.tian, yi.l.liu, kvm, mst, jun.j.tian, qemu-devel, peterx,
	eric.auger, alex.williamson, pbonzini, hao.wu, yi.y.sun, david

Patchew URL: https://patchew.org/QEMU/1580300216-86172-1-git-send-email-yi.l.liu@intel.com/



Hi,

This series failed the docker-quick@centos7 build test. Please find the testing commands and
their output below. If you have Docker installed, you can probably reproduce it
locally.

=== TEST SCRIPT BEGIN ===
#!/bin/bash
make docker-image-centos7 V=1 NETWORK=1
time make docker-test-quick@centos7 SHOW_ENV=1 J=14 NETWORK=1
=== TEST SCRIPT END ===

  CC      hw/pci/pci_host.o
  CC      hw/pci/pcie.o
/tmp/qemu-test/src/hw/pci-host/designware.c: In function 'designware_pcie_host_realize':
/tmp/qemu-test/src/hw/pci-host/designware.c:693:5: error: incompatible type for argument 2 of 'pci_setup_iommu'
     pci_setup_iommu(pci->bus, designware_iommu_ops, s);
     ^
In file included from /tmp/qemu-test/src/include/hw/pci/msi.h:24:0,
---
/tmp/qemu-test/src/include/hw/pci/pci.h:499:6: note: expected 'const struct PCIIOMMUOps *' but argument is of type 'PCIIOMMUOps'
 void pci_setup_iommu(PCIBus *bus, const PCIIOMMUOps *iommu_ops, void *opaque);
      ^
make: *** [hw/pci-host/designware.o] Error 1
make: *** Waiting for unfinished jobs....
rm tests/qemu-iotests/socket_scm_helper.o
Traceback (most recent call last):
---
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['sudo', '-n', 'docker', 'run', '--label', 'com.qemu.instance.uuid=08fccc258f5241b886ff89ccf43e8926', '-u', '1001', '--security-opt', 'seccomp=unconfined', '--rm', '-e', 'TARGET_LIST=', '-e', 'EXTRA_CONFIGURE_OPTS=', '-e', 'V=', '-e', 'J=14', '-e', 'DEBUG=', '-e', 'SHOW_ENV=1', '-e', 'CCACHE_DIR=/var/tmp/ccache', '-v', '/home/patchew/.cache/qemu-docker-ccache:/var/tmp/ccache:z', '-v', '/var/tmp/patchew-tester-tmp-xrnho7rm/src/docker-src.2020-01-29-08.46.10.29742:/var/tmp/qemu:z,ro', 'qemu:centos7', '/var/tmp/qemu/run', 'test-quick']' returned non-zero exit status 2.
filter=--filter=label=com.qemu.instance.uuid=08fccc258f5241b886ff89ccf43e8926
make[1]: *** [docker-run] Error 1
make[1]: Leaving directory `/var/tmp/patchew-tester-tmp-xrnho7rm/src'
make: *** [docker-run-test-quick@centos7] Error 2

real    2m13.263s
user    0m8.269s


The full log is available at
http://patchew.org/logs/1580300216-86172-1-git-send-email-yi.l.liu@intel.com/testing.docker-quick@centos7/?type=message.
---
Email generated automatically by Patchew [https://patchew.org/].
Please send your feedback to patchew-devel@redhat.com

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC v3 02/25] hw/iommu: introduce DualStageIOMMUObject
  2020-01-29 12:16   ` Liu, Yi L
@ 2020-01-31  3:59     ` David Gibson
  -1 siblings, 0 replies; 136+ messages in thread
From: David Gibson @ 2020-01-31  3:59 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: qemu-devel, pbonzini, alex.williamson, peterx, mst, eric.auger,
	kevin.tian, jun.j.tian, yi.y.sun, kvm, hao.wu, Jacob Pan, Yi Sun

[-- Attachment #1: Type: text/plain, Size: 3410 bytes --]

On Wed, Jan 29, 2020 at 04:16:33AM -0800, Liu, Yi L wrote:
> From: Liu Yi L <yi.l.liu@intel.com>
> 
> Currently, many platform vendors provide the capability of dual stage
> DMA address translation in hardware. For example, nested translation
> on Intel VT-d scalable mode, nested stage translation on ARM SMMUv3,
> and etc. In dual stage DMA address translation, there are two stages
> address translation, stage-1 (a.k.a first-level) and stage-2 (a.k.a
> second-level) translation structures. Stage-1 translation results are
> also subjected to stage-2 translation structures. Take vSVA (Virtual
> Shared Virtual Addressing) as an example, guest IOMMU driver owns
> stage-1 translation structures (covers GVA->GPA translation), and host
> IOMMU driver owns stage-2 translation structures (covers GPA->HPA
> translation). VMM is responsible to bind stage-1 translation structures
> to host, thus hardware could achieve GVA->GPA and then GPA->HPA
> translation. For more background on SVA, refer the below links.
>  - https://www.youtube.com/watch?v=Kq_nfGK5MwQ
>  - https://events19.lfasiallc.com/wp-content/uploads/2017/11/\
> Shared-Virtual-Memory-in-KVM_Yi-Liu.pdf
> 
> As above, dual stage DMA translation offers two stage address mappings,
> which could have better DMA address translation support for passthru
> devices. This is also what vIOMMU developers are doing so far. Efforts
> includes vSVA enabling from Yi Liu and SMMUv3 Nested Stage Setup from
> Eric Auger.
> https://www.spinics.net/lists/kvm/msg198556.html
> https://lists.gnu.org/archive/html/qemu-devel/2019-07/msg02842.html
> 
> Both efforts are aiming to expose a vIOMMU with dual stage hardware
> backed. As so, QEMU needs to have an explicit object to stand for
> the dual stage capability from hardware. Such object offers abstract
> for the dual stage DMA translation related operations, like:
> 
>  1) PASID allocation (allow host to intercept in PASID allocation)
>  2) bind stage-1 translation structures to host
>  3) propagate stage-1 cache invalidation to host
>  4) DMA address translation fault (I/O page fault) servicing etc.
> 
> This patch introduces DualStageIOMMUObject to stand for the hardware
> dual stage DMA translation capability. PASID allocation/free are the
> first operation included in it, in future, there will be more operations
> like bind_stage1_pgtbl and invalidate_stage1_cache and etc.
> 
> Cc: Kevin Tian <kevin.tian@intel.com>
> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Eric Auger <eric.auger@redhat.com>
> Cc: Yi Sun <yi.y.sun@linux.intel.com>
> Cc: David Gibson <david@gibson.dropbear.id.au>
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>

Several overall queries about this:

1) Since it's explicitly handling PASIDs, this seems a lot more
   specific to SVM than the name suggests.  I'd suggest a rename.

2) Why are you hand rolling structures of pointers, rather than making
   this a QOM class or interface and putting those things into methods?

3) It's not really clear to me if this is for the case where both
   stages of translation are visible to the guest, or only one of
   them.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC v3 02/25] hw/iommu: introduce DualStageIOMMUObject
@ 2020-01-31  3:59     ` David Gibson
  0 siblings, 0 replies; 136+ messages in thread
From: David Gibson @ 2020-01-31  3:59 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: kevin.tian, Jacob Pan, Yi Sun, kvm, mst, jun.j.tian, qemu-devel,
	peterx, eric.auger, alex.williamson, pbonzini, yi.y.sun, hao.wu

[-- Attachment #1: Type: text/plain, Size: 3410 bytes --]

On Wed, Jan 29, 2020 at 04:16:33AM -0800, Liu, Yi L wrote:
> From: Liu Yi L <yi.l.liu@intel.com>
> 
> Currently, many platform vendors provide the capability of dual stage
> DMA address translation in hardware. For example, nested translation
> on Intel VT-d scalable mode, nested stage translation on ARM SMMUv3,
> and etc. In dual stage DMA address translation, there are two stages
> address translation, stage-1 (a.k.a first-level) and stage-2 (a.k.a
> second-level) translation structures. Stage-1 translation results are
> also subjected to stage-2 translation structures. Take vSVA (Virtual
> Shared Virtual Addressing) as an example, guest IOMMU driver owns
> stage-1 translation structures (covers GVA->GPA translation), and host
> IOMMU driver owns stage-2 translation structures (covers GPA->HPA
> translation). VMM is responsible to bind stage-1 translation structures
> to host, thus hardware could achieve GVA->GPA and then GPA->HPA
> translation. For more background on SVA, refer the below links.
>  - https://www.youtube.com/watch?v=Kq_nfGK5MwQ
>  - https://events19.lfasiallc.com/wp-content/uploads/2017/11/\
> Shared-Virtual-Memory-in-KVM_Yi-Liu.pdf
> 
> As above, dual stage DMA translation offers two stage address mappings,
> which could have better DMA address translation support for passthru
> devices. This is also what vIOMMU developers are doing so far. Efforts
> includes vSVA enabling from Yi Liu and SMMUv3 Nested Stage Setup from
> Eric Auger.
> https://www.spinics.net/lists/kvm/msg198556.html
> https://lists.gnu.org/archive/html/qemu-devel/2019-07/msg02842.html
> 
> Both efforts are aiming to expose a vIOMMU with dual stage hardware
> backed. As so, QEMU needs to have an explicit object to stand for
> the dual stage capability from hardware. Such object offers abstract
> for the dual stage DMA translation related operations, like:
> 
>  1) PASID allocation (allow host to intercept in PASID allocation)
>  2) bind stage-1 translation structures to host
>  3) propagate stage-1 cache invalidation to host
>  4) DMA address translation fault (I/O page fault) servicing etc.
> 
> This patch introduces DualStageIOMMUObject to stand for the hardware
> dual stage DMA translation capability. PASID allocation/free are the
> first operation included in it, in future, there will be more operations
> like bind_stage1_pgtbl and invalidate_stage1_cache and etc.
> 
> Cc: Kevin Tian <kevin.tian@intel.com>
> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Eric Auger <eric.auger@redhat.com>
> Cc: Yi Sun <yi.y.sun@linux.intel.com>
> Cc: David Gibson <david@gibson.dropbear.id.au>
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>

Several overall queries about this:

1) Since it's explicitly handling PASIDs, this seems a lot more
   specific to SVM than the name suggests.  I'd suggest a rename.

2) Why are you hand rolling structures of pointers, rather than making
   this a QOM class or interface and putting those things into methods?

3) It's not really clear to me if this is for the case where both
   stages of translation are visible to the guest, or only one of
   them.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC v3 03/25] hw/iommu: introduce IOMMUContext
  2020-01-29 12:16   ` Liu, Yi L
@ 2020-01-31  4:06     ` David Gibson
  -1 siblings, 0 replies; 136+ messages in thread
From: David Gibson @ 2020-01-31  4:06 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: qemu-devel, pbonzini, alex.williamson, peterx, mst, eric.auger,
	kevin.tian, jun.j.tian, yi.y.sun, kvm, hao.wu, Jacob Pan, Yi Sun

[-- Attachment #1: Type: text/plain, Size: 7763 bytes --]

On Wed, Jan 29, 2020 at 04:16:34AM -0800, Liu, Yi L wrote:
> From: Peter Xu <peterx@redhat.com>
> 
> Currently, many platform vendors provide the capability of dual stage
> DMA address translation in hardware. For example, nested translation
> on Intel VT-d scalable mode, nested stage translation on ARM SMMUv3,
> and etc. Also there are efforts to make QEMU vIOMMU be backed by dual
> stage DMA address translation capability provided by hardware to have
> better address translation support for passthru devices.
> 
> As so, making vIOMMU be backed by dual stage translation capability
> requires QEMU vIOMMU to have a way to get aware of such hardware
> capability and also require a way to receive DMA address translation
> faults (e.g. I/O page request) from host as guest owns stage-1 translation
> structures in dual stage DAM address translation.
> 
> This patch adds IOMMUContext as an abstract of vIOMMU related operations.
> Like provide a way for passthru modules (e.g. VFIO) to register
> DualStageIOMMUObject instances. And in future, it is expected to offer
> support for receiving host DMA translation faults happened on stage-1
> translation.
> 
> For more backgrounds, may refer to the discussion below, while there
> is also difference between the current implementation and original
> proposal. This patch introduces the IOMMUContext as an abstract layer
> for passthru module (e.g. VFIO) calls into vIOMMU. The first introduced
> interface is to make QEMU vIOMMU be aware of dual stage translation
> capability.
> 
> https://lists.gnu.org/archive/html/qemu-devel/2019-07/msg05022.html

Again, is there a reason for not making this a QOM class or interface?


I'm not very clear on the relationship betwen an IOMMUContext and a
DualStageIOMMUObject.  Can there be many IOMMUContexts to a
DualStageIOMMUOBject?  The other way around?  Or is it just
zero-or-one DualStageIOMMUObjects to an IOMMUContext?

> Cc: Kevin Tian <kevin.tian@intel.com>
> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Eric Auger <eric.auger@redhat.com>
> Cc: Yi Sun <yi.y.sun@linux.intel.com>
> Cc: David Gibson <david@gibson.dropbear.id.au>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> ---
>  hw/iommu/Makefile.objs           |  1 +
>  hw/iommu/iommu_context.c         | 54 +++++++++++++++++++++++++++++++++++
>  include/hw/iommu/iommu_context.h | 61 ++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 116 insertions(+)
>  create mode 100644 hw/iommu/iommu_context.c
>  create mode 100644 include/hw/iommu/iommu_context.h
> 
> diff --git a/hw/iommu/Makefile.objs b/hw/iommu/Makefile.objs
> index d4f3b39..1e45072 100644
> --- a/hw/iommu/Makefile.objs
> +++ b/hw/iommu/Makefile.objs
> @@ -1 +1,2 @@
>  obj-y += dual_stage_iommu.o
> +obj-y += iommu_context.o
> diff --git a/hw/iommu/iommu_context.c b/hw/iommu/iommu_context.c
> new file mode 100644
> index 0000000..6340ca3
> --- /dev/null
> +++ b/hw/iommu/iommu_context.c
> @@ -0,0 +1,54 @@
> +/*
> + * QEMU abstract of vIOMMU context
> + *
> + * Copyright (C) 2020 Red Hat Inc.
> + *
> + * Authors: Peter Xu <peterx@redhat.com>,
> + *          Liu Yi L <yi.l.liu@intel.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> +
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> +
> + * You should have received a copy of the GNU General Public License along
> + * with this program; if not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#include "qemu/osdep.h"
> +#include "hw/iommu/iommu_context.h"
> +
> +int iommu_context_register_ds_iommu(IOMMUContext *iommu_ctx,
> +                                    DualStageIOMMUObject *dsi_obj)
> +{
> +    if (!iommu_ctx || !dsi_obj) {

Would this ever happen apart from a bug in the caller?  If not it
should be an assert().

> +        return -ENOENT;
> +    }
> +
> +    if (iommu_ctx->ops && iommu_ctx->ops->register_ds_iommu) {
> +        return iommu_ctx->ops->register_ds_iommu(iommu_ctx, dsi_obj);
> +    }
> +    return -ENOENT;
> +}
> +
> +void iommu_context_unregister_ds_iommu(IOMMUContext *iommu_ctx,
> +                                      DualStageIOMMUObject *dsi_obj)
> +{
> +    if (!iommu_ctx || !dsi_obj) {
> +        return;
> +    }
> +
> +    if (iommu_ctx->ops && iommu_ctx->ops->unregister_ds_iommu) {
> +        iommu_ctx->ops->unregister_ds_iommu(iommu_ctx, dsi_obj);
> +    }
> +}
> +
> +void iommu_context_init(IOMMUContext *iommu_ctx, IOMMUContextOps *ops)
> +{
> +    iommu_ctx->ops = ops;
> +}
> diff --git a/include/hw/iommu/iommu_context.h b/include/hw/iommu/iommu_context.h
> new file mode 100644
> index 0000000..6f2ccb5
> --- /dev/null
> +++ b/include/hw/iommu/iommu_context.h
> @@ -0,0 +1,61 @@
> +/*
> + * QEMU abstraction of IOMMU Context
> + *
> + * Copyright (C) 2020 Red Hat Inc.
> + *
> + * Authors: Peter Xu <peterx@redhat.com>,
> + *          Liu, Yi L <yi.l.liu@intel.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> +
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> +
> + * You should have received a copy of the GNU General Public License along
> + * with this program; if not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#ifndef HW_IOMMU_CONTEXT_H
> +#define HW_IOMMU_CONTEXT_H
> +
> +#include "qemu/queue.h"
> +#ifndef CONFIG_USER_ONLY
> +#include "exec/hwaddr.h"
> +#endif
> +#include "hw/iommu/dual_stage_iommu.h"
> +
> +typedef struct IOMMUContext IOMMUContext;
> +typedef struct IOMMUContextOps IOMMUContextOps;
> +
> +struct IOMMUContextOps {
> +    /*
> +     * Register DualStageIOMMUObject to vIOMMU thus vIOMMU
> +     * is aware of dual stage translation capability, and
> +     * also be able to setup dual stage translation via
> +     * interfaces exposed by DualStageIOMMUObject.
> +     */
> +    int (*register_ds_iommu)(IOMMUContext *iommu_ctx,
> +                             DualStageIOMMUObject *dsi_obj);
> +    void (*unregister_ds_iommu)(IOMMUContext *iommu_ctx,
> +                                DualStageIOMMUObject *dsi_obj);
> +};
> +
> +/*
> + * This is an abstraction of IOMMU context.
> + */
> +struct IOMMUContext {
> +    IOMMUContextOps *ops;
> +};
> +
> +int iommu_context_register_ds_iommu(IOMMUContext *iommu_ctx,
> +                                    DualStageIOMMUObject *dsi_obj);
> +void iommu_context_unregister_ds_iommu(IOMMUContext *iommu_ctx,
> +                                       DualStageIOMMUObject *dsi_obj);
> +void iommu_context_init(IOMMUContext *iommu_ctx, IOMMUContextOps *ops);
> +
> +#endif

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC v3 03/25] hw/iommu: introduce IOMMUContext
@ 2020-01-31  4:06     ` David Gibson
  0 siblings, 0 replies; 136+ messages in thread
From: David Gibson @ 2020-01-31  4:06 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: kevin.tian, Jacob Pan, Yi Sun, kvm, mst, jun.j.tian, qemu-devel,
	peterx, eric.auger, alex.williamson, pbonzini, yi.y.sun, hao.wu

[-- Attachment #1: Type: text/plain, Size: 7763 bytes --]

On Wed, Jan 29, 2020 at 04:16:34AM -0800, Liu, Yi L wrote:
> From: Peter Xu <peterx@redhat.com>
> 
> Currently, many platform vendors provide the capability of dual stage
> DMA address translation in hardware. For example, nested translation
> on Intel VT-d scalable mode, nested stage translation on ARM SMMUv3,
> and etc. Also there are efforts to make QEMU vIOMMU be backed by dual
> stage DMA address translation capability provided by hardware to have
> better address translation support for passthru devices.
> 
> As so, making vIOMMU be backed by dual stage translation capability
> requires QEMU vIOMMU to have a way to get aware of such hardware
> capability and also require a way to receive DMA address translation
> faults (e.g. I/O page request) from host as guest owns stage-1 translation
> structures in dual stage DAM address translation.
> 
> This patch adds IOMMUContext as an abstract of vIOMMU related operations.
> Like provide a way for passthru modules (e.g. VFIO) to register
> DualStageIOMMUObject instances. And in future, it is expected to offer
> support for receiving host DMA translation faults happened on stage-1
> translation.
> 
> For more backgrounds, may refer to the discussion below, while there
> is also difference between the current implementation and original
> proposal. This patch introduces the IOMMUContext as an abstract layer
> for passthru module (e.g. VFIO) calls into vIOMMU. The first introduced
> interface is to make QEMU vIOMMU be aware of dual stage translation
> capability.
> 
> https://lists.gnu.org/archive/html/qemu-devel/2019-07/msg05022.html

Again, is there a reason for not making this a QOM class or interface?


I'm not very clear on the relationship betwen an IOMMUContext and a
DualStageIOMMUObject.  Can there be many IOMMUContexts to a
DualStageIOMMUOBject?  The other way around?  Or is it just
zero-or-one DualStageIOMMUObjects to an IOMMUContext?

> Cc: Kevin Tian <kevin.tian@intel.com>
> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Eric Auger <eric.auger@redhat.com>
> Cc: Yi Sun <yi.y.sun@linux.intel.com>
> Cc: David Gibson <david@gibson.dropbear.id.au>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> ---
>  hw/iommu/Makefile.objs           |  1 +
>  hw/iommu/iommu_context.c         | 54 +++++++++++++++++++++++++++++++++++
>  include/hw/iommu/iommu_context.h | 61 ++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 116 insertions(+)
>  create mode 100644 hw/iommu/iommu_context.c
>  create mode 100644 include/hw/iommu/iommu_context.h
> 
> diff --git a/hw/iommu/Makefile.objs b/hw/iommu/Makefile.objs
> index d4f3b39..1e45072 100644
> --- a/hw/iommu/Makefile.objs
> +++ b/hw/iommu/Makefile.objs
> @@ -1 +1,2 @@
>  obj-y += dual_stage_iommu.o
> +obj-y += iommu_context.o
> diff --git a/hw/iommu/iommu_context.c b/hw/iommu/iommu_context.c
> new file mode 100644
> index 0000000..6340ca3
> --- /dev/null
> +++ b/hw/iommu/iommu_context.c
> @@ -0,0 +1,54 @@
> +/*
> + * QEMU abstract of vIOMMU context
> + *
> + * Copyright (C) 2020 Red Hat Inc.
> + *
> + * Authors: Peter Xu <peterx@redhat.com>,
> + *          Liu Yi L <yi.l.liu@intel.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> +
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> +
> + * You should have received a copy of the GNU General Public License along
> + * with this program; if not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#include "qemu/osdep.h"
> +#include "hw/iommu/iommu_context.h"
> +
> +int iommu_context_register_ds_iommu(IOMMUContext *iommu_ctx,
> +                                    DualStageIOMMUObject *dsi_obj)
> +{
> +    if (!iommu_ctx || !dsi_obj) {

Would this ever happen apart from a bug in the caller?  If not it
should be an assert().

> +        return -ENOENT;
> +    }
> +
> +    if (iommu_ctx->ops && iommu_ctx->ops->register_ds_iommu) {
> +        return iommu_ctx->ops->register_ds_iommu(iommu_ctx, dsi_obj);
> +    }
> +    return -ENOENT;
> +}
> +
> +void iommu_context_unregister_ds_iommu(IOMMUContext *iommu_ctx,
> +                                      DualStageIOMMUObject *dsi_obj)
> +{
> +    if (!iommu_ctx || !dsi_obj) {
> +        return;
> +    }
> +
> +    if (iommu_ctx->ops && iommu_ctx->ops->unregister_ds_iommu) {
> +        iommu_ctx->ops->unregister_ds_iommu(iommu_ctx, dsi_obj);
> +    }
> +}
> +
> +void iommu_context_init(IOMMUContext *iommu_ctx, IOMMUContextOps *ops)
> +{
> +    iommu_ctx->ops = ops;
> +}
> diff --git a/include/hw/iommu/iommu_context.h b/include/hw/iommu/iommu_context.h
> new file mode 100644
> index 0000000..6f2ccb5
> --- /dev/null
> +++ b/include/hw/iommu/iommu_context.h
> @@ -0,0 +1,61 @@
> +/*
> + * QEMU abstraction of IOMMU Context
> + *
> + * Copyright (C) 2020 Red Hat Inc.
> + *
> + * Authors: Peter Xu <peterx@redhat.com>,
> + *          Liu, Yi L <yi.l.liu@intel.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> +
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> +
> + * You should have received a copy of the GNU General Public License along
> + * with this program; if not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#ifndef HW_IOMMU_CONTEXT_H
> +#define HW_IOMMU_CONTEXT_H
> +
> +#include "qemu/queue.h"
> +#ifndef CONFIG_USER_ONLY
> +#include "exec/hwaddr.h"
> +#endif
> +#include "hw/iommu/dual_stage_iommu.h"
> +
> +typedef struct IOMMUContext IOMMUContext;
> +typedef struct IOMMUContextOps IOMMUContextOps;
> +
> +struct IOMMUContextOps {
> +    /*
> +     * Register DualStageIOMMUObject to vIOMMU thus vIOMMU
> +     * is aware of dual stage translation capability, and
> +     * also be able to setup dual stage translation via
> +     * interfaces exposed by DualStageIOMMUObject.
> +     */
> +    int (*register_ds_iommu)(IOMMUContext *iommu_ctx,
> +                             DualStageIOMMUObject *dsi_obj);
> +    void (*unregister_ds_iommu)(IOMMUContext *iommu_ctx,
> +                                DualStageIOMMUObject *dsi_obj);
> +};
> +
> +/*
> + * This is an abstraction of IOMMU context.
> + */
> +struct IOMMUContext {
> +    IOMMUContextOps *ops;
> +};
> +
> +int iommu_context_register_ds_iommu(IOMMUContext *iommu_ctx,
> +                                    DualStageIOMMUObject *dsi_obj);
> +void iommu_context_unregister_ds_iommu(IOMMUContext *iommu_ctx,
> +                                       DualStageIOMMUObject *dsi_obj);
> +void iommu_context_init(IOMMUContext *iommu_ctx, IOMMUContextOps *ops);
> +
> +#endif

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 136+ messages in thread

* RE: [RFC v3 06/25] scripts/update-linux-headers: Import iommu.h
  2020-01-29 12:25     ` Cornelia Huck
@ 2020-01-31 11:40       ` Liu, Yi L
  -1 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-01-31 11:40 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: qemu-devel, david, pbonzini, alex.williamson, peterx, mst,
	eric.auger, Tian, Kevin, Tian, Jun J, Sun, Yi Y, kvm, Wu, Hao,
	Jacob Pan, Yi Sun

> From: Cornelia Huck [mailto:cohuck@redhat.com]
> Sent: Wednesday, January 29, 2020 8:25 PM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [RFC v3 06/25] scripts/update-linux-headers: Import iommu.h
> 
> On Wed, 29 Jan 2020 04:16:37 -0800
> "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> 
> > From: Eric Auger <eric.auger@redhat.com>
> >
> > Update the script to import the new iommu.h uapi header.
> >
> > Cc: Kevin Tian <kevin.tian@intel.com>
> > Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Cc: Peter Xu <peterx@redhat.com>
> > Cc: Yi Sun <yi.y.sun@linux.intel.com>
> > Cc: Michael S. Tsirkin <mst@redhat.com>
> > Cc: Cornelia Huck <cohuck@redhat.com>
> > Cc: Paolo Bonzini <pbonzini@redhat.com>
> > Signed-off-by: Eric Auger <eric.auger@redhat.com>
> > ---
> >  scripts/update-linux-headers.sh | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/scripts/update-linux-headers.sh b/scripts/update-linux-headers.sh
> > index f76d773..dfdfdfd 100755
> > --- a/scripts/update-linux-headers.sh
> > +++ b/scripts/update-linux-headers.sh
> > @@ -141,7 +141,7 @@ done
> >
> >  rm -rf "$output/linux-headers/linux"
> >  mkdir -p "$output/linux-headers/linux"
> > -for header in kvm.h vfio.h vfio_ccw.h vhost.h \
> > +for header in kvm.h vfio.h vfio_ccw.h vhost.h iommu.h \
> >                psci.h psp-sev.h userfaultfd.h mman.h; do
> >      cp "$tmpdir/include/linux/$header" "$output/linux-headers/linux"
> >  done
> 
> Acked-by: Cornelia Huck <cohuck@redhat.com>

Thanks, Cornelia.

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 136+ messages in thread

* RE: [RFC v3 06/25] scripts/update-linux-headers: Import iommu.h
@ 2020-01-31 11:40       ` Liu, Yi L
  0 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-01-31 11:40 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: Tian, Kevin, Jacob Pan, Yi Sun, kvm, mst, Tian, Jun J,
	qemu-devel, peterx, eric.auger, alex.williamson, pbonzini, Wu,
	Hao, Sun, Yi Y, david

> From: Cornelia Huck [mailto:cohuck@redhat.com]
> Sent: Wednesday, January 29, 2020 8:25 PM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [RFC v3 06/25] scripts/update-linux-headers: Import iommu.h
> 
> On Wed, 29 Jan 2020 04:16:37 -0800
> "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> 
> > From: Eric Auger <eric.auger@redhat.com>
> >
> > Update the script to import the new iommu.h uapi header.
> >
> > Cc: Kevin Tian <kevin.tian@intel.com>
> > Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Cc: Peter Xu <peterx@redhat.com>
> > Cc: Yi Sun <yi.y.sun@linux.intel.com>
> > Cc: Michael S. Tsirkin <mst@redhat.com>
> > Cc: Cornelia Huck <cohuck@redhat.com>
> > Cc: Paolo Bonzini <pbonzini@redhat.com>
> > Signed-off-by: Eric Auger <eric.auger@redhat.com>
> > ---
> >  scripts/update-linux-headers.sh | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/scripts/update-linux-headers.sh b/scripts/update-linux-headers.sh
> > index f76d773..dfdfdfd 100755
> > --- a/scripts/update-linux-headers.sh
> > +++ b/scripts/update-linux-headers.sh
> > @@ -141,7 +141,7 @@ done
> >
> >  rm -rf "$output/linux-headers/linux"
> >  mkdir -p "$output/linux-headers/linux"
> > -for header in kvm.h vfio.h vfio_ccw.h vhost.h \
> > +for header in kvm.h vfio.h vfio_ccw.h vhost.h iommu.h \
> >                psci.h psp-sev.h userfaultfd.h mman.h; do
> >      cp "$tmpdir/include/linux/$header" "$output/linux-headers/linux"
> >  done
> 
> Acked-by: Cornelia Huck <cohuck@redhat.com>

Thanks, Cornelia.

Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 136+ messages in thread

* RE: [RFC v3 07/25] header file update VFIO/IOMMU vSVA APIs
  2020-01-29 12:28     ` Cornelia Huck
@ 2020-01-31 11:41       ` Liu, Yi L
  -1 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-01-31 11:41 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: qemu-devel, david, pbonzini, alex.williamson, peterx, mst,
	eric.auger, Tian, Kevin, Tian, Jun J, Sun, Yi Y, kvm, Wu, Hao,
	Jacob Pan, Yi Sun

> From: kvm-owner@vger.kernel.org [mailto:kvm-owner@vger.kernel.org] On Behalf
> Of Cornelia Huck
> Sent: Wednesday, January 29, 2020 8:29 PM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [RFC v3 07/25] header file update VFIO/IOMMU vSVA APIs
> 
> On Wed, 29 Jan 2020 04:16:38 -0800
> "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> 
> > From: Liu Yi L <yi.l.liu@intel.com>
> >
> > The kernel uapi/linux/iommu.h header file includes the extensions for
> > vSVA support. e.g. bind gpasid, iommu fault report related user
> > structures and etc.
> >
> > Cc: Kevin Tian <kevin.tian@intel.com>
> > Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Cc: Peter Xu <peterx@redhat.com>
> > Cc: Yi Sun <yi.y.sun@linux.intel.com>
> > Cc: Michael S. Tsirkin <mst@redhat.com>
> > Cc: Cornelia Huck <cohuck@redhat.com>
> > Cc: Paolo Bonzini <pbonzini@redhat.com>
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > ---
> >  linux-headers/linux/iommu.h | 372
> > ++++++++++++++++++++++++++++++++++++++++++++
> >  linux-headers/linux/vfio.h  | 148 ++++++++++++++++++
> >  2 files changed, 520 insertions(+)
> >  create mode 100644 linux-headers/linux/iommu.h
> 
> Please add a note that this is to be replaced with a full headers update, so that it
> doesn't get missed :)

Exactly, thanks for the reminder. I expect to have a full headers update when
the whole vSVA series is accepted. :-)

Thanks,
Yi Liu


^ permalink raw reply	[flat|nested] 136+ messages in thread

* RE: [RFC v3 07/25] header file update VFIO/IOMMU vSVA APIs
@ 2020-01-31 11:41       ` Liu, Yi L
  0 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-01-31 11:41 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: Tian, Kevin, Jacob Pan, Yi Sun, kvm, mst, Tian, Jun J,
	qemu-devel, peterx, eric.auger, alex.williamson, pbonzini, Wu,
	Hao, Sun, Yi Y, david

> From: kvm-owner@vger.kernel.org [mailto:kvm-owner@vger.kernel.org] On Behalf
> Of Cornelia Huck
> Sent: Wednesday, January 29, 2020 8:29 PM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [RFC v3 07/25] header file update VFIO/IOMMU vSVA APIs
> 
> On Wed, 29 Jan 2020 04:16:38 -0800
> "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> 
> > From: Liu Yi L <yi.l.liu@intel.com>
> >
> > The kernel uapi/linux/iommu.h header file includes the extensions for
> > vSVA support. e.g. bind gpasid, iommu fault report related user
> > structures and etc.
> >
> > Cc: Kevin Tian <kevin.tian@intel.com>
> > Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Cc: Peter Xu <peterx@redhat.com>
> > Cc: Yi Sun <yi.y.sun@linux.intel.com>
> > Cc: Michael S. Tsirkin <mst@redhat.com>
> > Cc: Cornelia Huck <cohuck@redhat.com>
> > Cc: Paolo Bonzini <pbonzini@redhat.com>
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > ---
> >  linux-headers/linux/iommu.h | 372
> > ++++++++++++++++++++++++++++++++++++++++++++
> >  linux-headers/linux/vfio.h  | 148 ++++++++++++++++++
> >  2 files changed, 520 insertions(+)
> >  create mode 100644 linux-headers/linux/iommu.h
> 
> Please add a note that this is to be replaced with a full headers update, so that it
> doesn't get missed :)

Exactly, thanks for the reminder. I expect to have a full headers update when
the whole vSVA series is accepted. :-)

Thanks,
Yi Liu



^ permalink raw reply	[flat|nested] 136+ messages in thread

* RE: [RFC v3 02/25] hw/iommu: introduce DualStageIOMMUObject
  2020-01-31  3:59     ` David Gibson
@ 2020-01-31 11:42       ` Liu, Yi L
  -1 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-01-31 11:42 UTC (permalink / raw)
  To: David Gibson
  Cc: qemu-devel, pbonzini, alex.williamson, peterx, mst, eric.auger,
	Tian, Kevin, Tian, Jun J, Sun, Yi Y, kvm, Wu, Hao, Jacob Pan,
	Yi Sun

Hi David,

> From: David Gibson [mailto:david@gibson.dropbear.id.au]
> Sent: Friday, January 31, 2020 11:59 AM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [RFC v3 02/25] hw/iommu: introduce DualStageIOMMUObject
> 
> On Wed, Jan 29, 2020 at 04:16:33AM -0800, Liu, Yi L wrote:
> > From: Liu Yi L <yi.l.liu@intel.com>
> >
> > Currently, many platform vendors provide the capability of dual stage
> > DMA address translation in hardware. For example, nested translation
> > on Intel VT-d scalable mode, nested stage translation on ARM SMMUv3,
> > and etc. In dual stage DMA address translation, there are two stages
> > address translation, stage-1 (a.k.a first-level) and stage-2 (a.k.a
> > second-level) translation structures. Stage-1 translation results are
> > also subjected to stage-2 translation structures. Take vSVA (Virtual
> > Shared Virtual Addressing) as an example, guest IOMMU driver owns
> > stage-1 translation structures (covers GVA->GPA translation), and host
> > IOMMU driver owns stage-2 translation structures (covers GPA->HPA
> > translation). VMM is responsible to bind stage-1 translation structures
> > to host, thus hardware could achieve GVA->GPA and then GPA->HPA
> > translation. For more background on SVA, refer the below links.
> >  - https://www.youtube.com/watch?v=Kq_nfGK5MwQ
> >  - https://events19.lfasiallc.com/wp-content/uploads/2017/11/\
> > Shared-Virtual-Memory-in-KVM_Yi-Liu.pdf
> >
> > As above, dual stage DMA translation offers two stage address mappings,
> > which could have better DMA address translation support for passthru
> > devices. This is also what vIOMMU developers are doing so far. Efforts
> > includes vSVA enabling from Yi Liu and SMMUv3 Nested Stage Setup from
> > Eric Auger.
> > https://www.spinics.net/lists/kvm/msg198556.html
> > https://lists.gnu.org/archive/html/qemu-devel/2019-07/msg02842.html
> >
> > Both efforts are aiming to expose a vIOMMU with dual stage hardware
> > backed. As so, QEMU needs to have an explicit object to stand for
> > the dual stage capability from hardware. Such object offers abstract
> > for the dual stage DMA translation related operations, like:
> >
> >  1) PASID allocation (allow host to intercept in PASID allocation)
> >  2) bind stage-1 translation structures to host
> >  3) propagate stage-1 cache invalidation to host
> >  4) DMA address translation fault (I/O page fault) servicing etc.
> >
> > This patch introduces DualStageIOMMUObject to stand for the hardware
> > dual stage DMA translation capability. PASID allocation/free are the
> > first operation included in it, in future, there will be more operations
> > like bind_stage1_pgtbl and invalidate_stage1_cache and etc.
> >
> > Cc: Kevin Tian <kevin.tian@intel.com>
> > Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Cc: Peter Xu <peterx@redhat.com>
> > Cc: Eric Auger <eric.auger@redhat.com>
> > Cc: Yi Sun <yi.y.sun@linux.intel.com>
> > Cc: David Gibson <david@gibson.dropbear.id.au>
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> 
> Several overall queries about this:
> 
> 1) Since it's explicitly handling PASIDs, this seems a lot more
>    specific to SVM than the name suggests.  I'd suggest a rename.

It is not specific to SVM in future. We have efforts to move guest
IOVA support based on host IOMMU's dual-stage DMA translation
capability. Then, guest IOVA support will also re-use the methods
provided by this abstract layer. e.g. the bind_guest_pgtbl() and
flush_iommu_iotlb().

For the naming, how about HostIOMMUContext? This layer is to provide
explicit methods for setting up dual-stage DMA translation in host.

> 
> 2) Why are you hand rolling structures of pointers, rather than making
>    this a QOM class or interface and putting those things into methods?

Maybe the name is not proper. Although I named it as DualStageIOMMUObject,
it is actually a kind of abstract layer we discussed in previous email. I
think this is similar with VFIO_MAP/UNMAP. The difference is that VFIO_MAP/
UNMAP programs mappings to host iommu domain. While the newly added explicit
method is to link guest page table to host iommu domain. VFIO_MAP/UNMAP
is exposed to vIOMMU emulators via MemoryRegion layer. right? Maybe adding a
similar abstract layer is enough. Is adding QOM really necessary for this
case?

> 3) It's not really clear to me if this is for the case where both
>    stages of translation are visible to the guest, or only one of
>    them.

For this case, vIOMMU will only expose a single stage translation to VM.
e.g. Intel VT-d, vIOMMU exposes first-level translation to guest. Hardware
IOMMUs with the dual-stage translation capability lets guest own stage-1
translation structures and host owns the stage-2 translation structures.
VMM is responsible to bind guest's translation structures to host and
enable dual-stage translation. e.g. on Intel VT-d, config translation type
to be NESTED.

Take guest SVM as an example, guest iommu driver owns the gVA->gPA mappings,
which is treated as stage-1 translation from host point of view. Host itself
owns the gPA->hPPA translation and called stage-2 translation when dual-stage
translation is configured.

For guest IOVA, it is similar with guest SVM. Guest iommu driver owns the
gIOVA->gPA mappings, which is treated as stage-1 translation. Host owns the
gPA->hPA translation.

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 136+ messages in thread

* RE: [RFC v3 02/25] hw/iommu: introduce DualStageIOMMUObject
@ 2020-01-31 11:42       ` Liu, Yi L
  0 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-01-31 11:42 UTC (permalink / raw)
  To: David Gibson
  Cc: Tian, Kevin, Jacob Pan, Yi Sun, kvm, mst, Tian, Jun J,
	qemu-devel, peterx, eric.auger, alex.williamson, pbonzini, Sun,
	Yi Y, Wu, Hao

Hi David,

> From: David Gibson [mailto:david@gibson.dropbear.id.au]
> Sent: Friday, January 31, 2020 11:59 AM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [RFC v3 02/25] hw/iommu: introduce DualStageIOMMUObject
> 
> On Wed, Jan 29, 2020 at 04:16:33AM -0800, Liu, Yi L wrote:
> > From: Liu Yi L <yi.l.liu@intel.com>
> >
> > Currently, many platform vendors provide the capability of dual stage
> > DMA address translation in hardware. For example, nested translation
> > on Intel VT-d scalable mode, nested stage translation on ARM SMMUv3,
> > and etc. In dual stage DMA address translation, there are two stages
> > address translation, stage-1 (a.k.a first-level) and stage-2 (a.k.a
> > second-level) translation structures. Stage-1 translation results are
> > also subjected to stage-2 translation structures. Take vSVA (Virtual
> > Shared Virtual Addressing) as an example, guest IOMMU driver owns
> > stage-1 translation structures (covers GVA->GPA translation), and host
> > IOMMU driver owns stage-2 translation structures (covers GPA->HPA
> > translation). VMM is responsible to bind stage-1 translation structures
> > to host, thus hardware could achieve GVA->GPA and then GPA->HPA
> > translation. For more background on SVA, refer the below links.
> >  - https://www.youtube.com/watch?v=Kq_nfGK5MwQ
> >  - https://events19.lfasiallc.com/wp-content/uploads/2017/11/\
> > Shared-Virtual-Memory-in-KVM_Yi-Liu.pdf
> >
> > As above, dual stage DMA translation offers two stage address mappings,
> > which could have better DMA address translation support for passthru
> > devices. This is also what vIOMMU developers are doing so far. Efforts
> > includes vSVA enabling from Yi Liu and SMMUv3 Nested Stage Setup from
> > Eric Auger.
> > https://www.spinics.net/lists/kvm/msg198556.html
> > https://lists.gnu.org/archive/html/qemu-devel/2019-07/msg02842.html
> >
> > Both efforts are aiming to expose a vIOMMU with dual stage hardware
> > backed. As so, QEMU needs to have an explicit object to stand for
> > the dual stage capability from hardware. Such object offers abstract
> > for the dual stage DMA translation related operations, like:
> >
> >  1) PASID allocation (allow host to intercept in PASID allocation)
> >  2) bind stage-1 translation structures to host
> >  3) propagate stage-1 cache invalidation to host
> >  4) DMA address translation fault (I/O page fault) servicing etc.
> >
> > This patch introduces DualStageIOMMUObject to stand for the hardware
> > dual stage DMA translation capability. PASID allocation/free are the
> > first operation included in it, in future, there will be more operations
> > like bind_stage1_pgtbl and invalidate_stage1_cache and etc.
> >
> > Cc: Kevin Tian <kevin.tian@intel.com>
> > Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Cc: Peter Xu <peterx@redhat.com>
> > Cc: Eric Auger <eric.auger@redhat.com>
> > Cc: Yi Sun <yi.y.sun@linux.intel.com>
> > Cc: David Gibson <david@gibson.dropbear.id.au>
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> 
> Several overall queries about this:
> 
> 1) Since it's explicitly handling PASIDs, this seems a lot more
>    specific to SVM than the name suggests.  I'd suggest a rename.

It is not specific to SVM in future. We have efforts to move guest
IOVA support based on host IOMMU's dual-stage DMA translation
capability. Then, guest IOVA support will also re-use the methods
provided by this abstract layer. e.g. the bind_guest_pgtbl() and
flush_iommu_iotlb().

For the naming, how about HostIOMMUContext? This layer is to provide
explicit methods for setting up dual-stage DMA translation in host.

> 
> 2) Why are you hand rolling structures of pointers, rather than making
>    this a QOM class or interface and putting those things into methods?

Maybe the name is not proper. Although I named it as DualStageIOMMUObject,
it is actually a kind of abstract layer we discussed in previous email. I
think this is similar with VFIO_MAP/UNMAP. The difference is that VFIO_MAP/
UNMAP programs mappings to host iommu domain. While the newly added explicit
method is to link guest page table to host iommu domain. VFIO_MAP/UNMAP
is exposed to vIOMMU emulators via MemoryRegion layer. right? Maybe adding a
similar abstract layer is enough. Is adding QOM really necessary for this
case?

> 3) It's not really clear to me if this is for the case where both
>    stages of translation are visible to the guest, or only one of
>    them.

For this case, vIOMMU will only expose a single stage translation to VM.
e.g. Intel VT-d, vIOMMU exposes first-level translation to guest. Hardware
IOMMUs with the dual-stage translation capability lets guest own stage-1
translation structures and host owns the stage-2 translation structures.
VMM is responsible to bind guest's translation structures to host and
enable dual-stage translation. e.g. on Intel VT-d, config translation type
to be NESTED.

Take guest SVM as an example, guest iommu driver owns the gVA->gPA mappings,
which is treated as stage-1 translation from host point of view. Host itself
owns the gPA->hPPA translation and called stage-2 translation when dual-stage
translation is configured.

For guest IOVA, it is similar with guest SVM. Guest iommu driver owns the
gIOVA->gPA mappings, which is treated as stage-1 translation. Host owns the
gPA->hPA translation.

Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 136+ messages in thread

* RE: [RFC v3 03/25] hw/iommu: introduce IOMMUContext
  2020-01-31  4:06     ` David Gibson
@ 2020-01-31 11:42       ` Liu, Yi L
  -1 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-01-31 11:42 UTC (permalink / raw)
  To: David Gibson
  Cc: qemu-devel, pbonzini, alex.williamson, peterx, mst, eric.auger,
	Tian, Kevin, Tian, Jun J, Sun, Yi Y, kvm, Wu, Hao, Jacob Pan,
	Yi Sun

Hi David,

> From: David Gibson [mailto:david@gibson.dropbear.id.au]
> Sent: Friday, January 31, 2020 12:07 PM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [RFC v3 03/25] hw/iommu: introduce IOMMUContext
> 
> On Wed, Jan 29, 2020 at 04:16:34AM -0800, Liu, Yi L wrote:
> > From: Peter Xu <peterx@redhat.com>
> >
> > Currently, many platform vendors provide the capability of dual stage
> > DMA address translation in hardware. For example, nested translation
> > on Intel VT-d scalable mode, nested stage translation on ARM SMMUv3,
> > and etc. Also there are efforts to make QEMU vIOMMU be backed by dual
> > stage DMA address translation capability provided by hardware to have
> > better address translation support for passthru devices.
> >
> > As so, making vIOMMU be backed by dual stage translation capability
> > requires QEMU vIOMMU to have a way to get aware of such hardware
> > capability and also require a way to receive DMA address translation
> > faults (e.g. I/O page request) from host as guest owns stage-1 translation
> > structures in dual stage DAM address translation.
> >
> > This patch adds IOMMUContext as an abstract of vIOMMU related operations.
> > Like provide a way for passthru modules (e.g. VFIO) to register
> > DualStageIOMMUObject instances. And in future, it is expected to offer
> > support for receiving host DMA translation faults happened on stage-1
> > translation.
> >
> > For more backgrounds, may refer to the discussion below, while there
> > is also difference between the current implementation and original
> > proposal. This patch introduces the IOMMUContext as an abstract layer
> > for passthru module (e.g. VFIO) calls into vIOMMU. The first introduced
> > interface is to make QEMU vIOMMU be aware of dual stage translation
> > capability.
> >
> > https://lists.gnu.org/archive/html/qemu-devel/2019-07/msg05022.html
> 
> Again, is there a reason for not making this a QOM class or interface?

I guess it is enough to make a simple abstract layer as explained in prior
email. IOMMUContext is to provide explicit method for VFIO to call into
vIOMMU emulators.

> 
> I'm not very clear on the relationship betwen an IOMMUContext and a
> DualStageIOMMUObject.  Can there be many IOMMUContexts to a
> DualStageIOMMUOBject?  The other way around?  Or is it just
> zero-or-one DualStageIOMMUObjects to an IOMMUContext?

It is possible. As the below patch shows, DualStageIOMMUObject is per vfio
container. IOMMUContext can be either per-device or shared across devices,
it depends on vendor specific vIOMMU emulators.
[RFC v3 10/25] vfio: register DualStageIOMMUObject to vIOMMU
https://www.spinics.net/lists/kvm/msg205198.html

Take Intel vIOMMU as an example, there is a per device structure which
includes IOMMUContext instance and a DualStageIOMMUObject pointer.

+struct VTDIOMMUContext {
+    VTDBus *vtd_bus;
+    uint8_t devfn;
+    IOMMUContext iommu_context;
+    DualStageIOMMUObject *dsi_obj;
+    IntelIOMMUState *iommu_state;
+};
https://www.spinics.net/lists/kvm/msg205196.html

I think this would leave space for vendor specific vIOMMU emulators to
design their own relationship between an IOMMUContext and a
DualStageIOMMUObject.

> > Cc: Kevin Tian <kevin.tian@intel.com>
> > Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Cc: Peter Xu <peterx@redhat.com>

[...]

> > + */
> > +
> > +#include "qemu/osdep.h"
> > +#include "hw/iommu/iommu_context.h"
> > +
> > +int iommu_context_register_ds_iommu(IOMMUContext *iommu_ctx,
> > +                                    DualStageIOMMUObject *dsi_obj)
> > +{
> > +    if (!iommu_ctx || !dsi_obj) {
> 
> Would this ever happen apart from a bug in the caller?  If not it
> should be an assert().

Got it, thanks, I'll check all other alike in this series and fix them in
next version.

Thanks,
Yi Liu

^ permalink raw reply	[flat|nested] 136+ messages in thread

* RE: [RFC v3 03/25] hw/iommu: introduce IOMMUContext
@ 2020-01-31 11:42       ` Liu, Yi L
  0 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-01-31 11:42 UTC (permalink / raw)
  To: David Gibson
  Cc: Tian, Kevin, Jacob Pan, Yi Sun, kvm, mst, Tian, Jun J,
	qemu-devel, peterx, eric.auger, alex.williamson, pbonzini, Sun,
	Yi Y, Wu, Hao

Hi David,

> From: David Gibson [mailto:david@gibson.dropbear.id.au]
> Sent: Friday, January 31, 2020 12:07 PM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [RFC v3 03/25] hw/iommu: introduce IOMMUContext
> 
> On Wed, Jan 29, 2020 at 04:16:34AM -0800, Liu, Yi L wrote:
> > From: Peter Xu <peterx@redhat.com>
> >
> > Currently, many platform vendors provide the capability of dual stage
> > DMA address translation in hardware. For example, nested translation
> > on Intel VT-d scalable mode, nested stage translation on ARM SMMUv3,
> > and etc. Also there are efforts to make QEMU vIOMMU be backed by dual
> > stage DMA address translation capability provided by hardware to have
> > better address translation support for passthru devices.
> >
> > As so, making vIOMMU be backed by dual stage translation capability
> > requires QEMU vIOMMU to have a way to get aware of such hardware
> > capability and also require a way to receive DMA address translation
> > faults (e.g. I/O page request) from host as guest owns stage-1 translation
> > structures in dual stage DAM address translation.
> >
> > This patch adds IOMMUContext as an abstract of vIOMMU related operations.
> > Like provide a way for passthru modules (e.g. VFIO) to register
> > DualStageIOMMUObject instances. And in future, it is expected to offer
> > support for receiving host DMA translation faults happened on stage-1
> > translation.
> >
> > For more backgrounds, may refer to the discussion below, while there
> > is also difference between the current implementation and original
> > proposal. This patch introduces the IOMMUContext as an abstract layer
> > for passthru module (e.g. VFIO) calls into vIOMMU. The first introduced
> > interface is to make QEMU vIOMMU be aware of dual stage translation
> > capability.
> >
> > https://lists.gnu.org/archive/html/qemu-devel/2019-07/msg05022.html
> 
> Again, is there a reason for not making this a QOM class or interface?

I guess it is enough to make a simple abstract layer as explained in prior
email. IOMMUContext is to provide explicit method for VFIO to call into
vIOMMU emulators.

> 
> I'm not very clear on the relationship betwen an IOMMUContext and a
> DualStageIOMMUObject.  Can there be many IOMMUContexts to a
> DualStageIOMMUOBject?  The other way around?  Or is it just
> zero-or-one DualStageIOMMUObjects to an IOMMUContext?

It is possible. As the below patch shows, DualStageIOMMUObject is per vfio
container. IOMMUContext can be either per-device or shared across devices,
it depends on vendor specific vIOMMU emulators.
[RFC v3 10/25] vfio: register DualStageIOMMUObject to vIOMMU
https://www.spinics.net/lists/kvm/msg205198.html

Take Intel vIOMMU as an example, there is a per device structure which
includes IOMMUContext instance and a DualStageIOMMUObject pointer.

+struct VTDIOMMUContext {
+    VTDBus *vtd_bus;
+    uint8_t devfn;
+    IOMMUContext iommu_context;
+    DualStageIOMMUObject *dsi_obj;
+    IntelIOMMUState *iommu_state;
+};
https://www.spinics.net/lists/kvm/msg205196.html

I think this would leave space for vendor specific vIOMMU emulators to
design their own relationship between an IOMMUContext and a
DualStageIOMMUObject.

> > Cc: Kevin Tian <kevin.tian@intel.com>
> > Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Cc: Peter Xu <peterx@redhat.com>

[...]

> > + */
> > +
> > +#include "qemu/osdep.h"
> > +#include "hw/iommu/iommu_context.h"
> > +
> > +int iommu_context_register_ds_iommu(IOMMUContext *iommu_ctx,
> > +                                    DualStageIOMMUObject *dsi_obj)
> > +{
> > +    if (!iommu_ctx || !dsi_obj) {
> 
> Would this ever happen apart from a bug in the caller?  If not it
> should be an assert().

Got it, thanks, I'll check all other alike in this series and fix them in
next version.

Thanks,
Yi Liu


^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC v3 03/25] hw/iommu: introduce IOMMUContext
  2020-01-31 11:42       ` Liu, Yi L
@ 2020-02-11 16:58         ` Peter Xu
  -1 siblings, 0 replies; 136+ messages in thread
From: Peter Xu @ 2020-02-11 16:58 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: David Gibson, qemu-devel, pbonzini, alex.williamson, mst,
	eric.auger, Tian, Kevin, Tian, Jun J, Sun, Yi Y, kvm, Wu, Hao,
	Jacob Pan, Yi Sun

On Fri, Jan 31, 2020 at 11:42:13AM +0000, Liu, Yi L wrote:
> > I'm not very clear on the relationship betwen an IOMMUContext and a
> > DualStageIOMMUObject.  Can there be many IOMMUContexts to a
> > DualStageIOMMUOBject?  The other way around?  Or is it just
> > zero-or-one DualStageIOMMUObjects to an IOMMUContext?
> 
> It is possible. As the below patch shows, DualStageIOMMUObject is per vfio
> container. IOMMUContext can be either per-device or shared across devices,
> it depends on vendor specific vIOMMU emulators.

Is there an example when an IOMMUContext can be not per-device?

It makes sense to me to have an object that is per-container (in your
case, the DualStageIOMMUObject, IIUC), then we can connect that object
to a device.  However I'm a bit confused on why we've got two abstract
layers (the other one is IOMMUContext)?  That was previously for the
whole SVA new APIs, now it's all moved over to the other new object,
then IOMMUContext only register/unregister... Can we put the reg/unreg
procedures into DualStageIOMMUObject as well?  Then we drop the
IOMMUContext (or say, keep IOMMUContext and drop DualStageIOMMUObject
but let IOMMUContext to be per-vfio-container, the major difference is
the naming here, say, PASID allocation does not seem to be related to
dual-stage at all).

Besides that, not sure I read it right... but even with your current
series, the container->iommu_ctx will always only be bound to the
first device created within that container, since you've got:

    group = vfio_get_group(groupid, pci_device_iommu_address_space(pdev),
                           pci_device_iommu_context(pdev), errp);

And:

    if (vfio_connect_container(group, as, iommu_ctx, errp)) {
        error_prepend(errp, "failed to setup container for group %d: ",
                      groupid);
        goto close_fd_exit;
    }

The iommu_ctx will be set to container->iommu_ctx if there's no
existing container.

> [RFC v3 10/25] vfio: register DualStageIOMMUObject to vIOMMU
> https://www.spinics.net/lists/kvm/msg205198.html
> 
> Take Intel vIOMMU as an example, there is a per device structure which
> includes IOMMUContext instance and a DualStageIOMMUObject pointer.
> 
> +struct VTDIOMMUContext {
> +    VTDBus *vtd_bus;
> +    uint8_t devfn;
> +    IOMMUContext iommu_context;
> +    DualStageIOMMUObject *dsi_obj;
> +    IntelIOMMUState *iommu_state;
> +};
> https://www.spinics.net/lists/kvm/msg205196.html
> 
> I think this would leave space for vendor specific vIOMMU emulators to
> design their own relationship between an IOMMUContext and a
> DualStageIOMMUObject.

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC v3 03/25] hw/iommu: introduce IOMMUContext
@ 2020-02-11 16:58         ` Peter Xu
  0 siblings, 0 replies; 136+ messages in thread
From: Peter Xu @ 2020-02-11 16:58 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: Tian, Kevin, Jacob Pan, Yi Sun, kvm, mst, Tian, Jun J,
	qemu-devel, eric.auger, alex.williamson, pbonzini, Wu, Hao, Sun,
	Yi Y, David Gibson

On Fri, Jan 31, 2020 at 11:42:13AM +0000, Liu, Yi L wrote:
> > I'm not very clear on the relationship betwen an IOMMUContext and a
> > DualStageIOMMUObject.  Can there be many IOMMUContexts to a
> > DualStageIOMMUOBject?  The other way around?  Or is it just
> > zero-or-one DualStageIOMMUObjects to an IOMMUContext?
> 
> It is possible. As the below patch shows, DualStageIOMMUObject is per vfio
> container. IOMMUContext can be either per-device or shared across devices,
> it depends on vendor specific vIOMMU emulators.

Is there an example when an IOMMUContext can be not per-device?

It makes sense to me to have an object that is per-container (in your
case, the DualStageIOMMUObject, IIUC), then we can connect that object
to a device.  However I'm a bit confused on why we've got two abstract
layers (the other one is IOMMUContext)?  That was previously for the
whole SVA new APIs, now it's all moved over to the other new object,
then IOMMUContext only register/unregister... Can we put the reg/unreg
procedures into DualStageIOMMUObject as well?  Then we drop the
IOMMUContext (or say, keep IOMMUContext and drop DualStageIOMMUObject
but let IOMMUContext to be per-vfio-container, the major difference is
the naming here, say, PASID allocation does not seem to be related to
dual-stage at all).

Besides that, not sure I read it right... but even with your current
series, the container->iommu_ctx will always only be bound to the
first device created within that container, since you've got:

    group = vfio_get_group(groupid, pci_device_iommu_address_space(pdev),
                           pci_device_iommu_context(pdev), errp);

And:

    if (vfio_connect_container(group, as, iommu_ctx, errp)) {
        error_prepend(errp, "failed to setup container for group %d: ",
                      groupid);
        goto close_fd_exit;
    }

The iommu_ctx will be set to container->iommu_ctx if there's no
existing container.

> [RFC v3 10/25] vfio: register DualStageIOMMUObject to vIOMMU
> https://www.spinics.net/lists/kvm/msg205198.html
> 
> Take Intel vIOMMU as an example, there is a per device structure which
> includes IOMMUContext instance and a DualStageIOMMUObject pointer.
> 
> +struct VTDIOMMUContext {
> +    VTDBus *vtd_bus;
> +    uint8_t devfn;
> +    IOMMUContext iommu_context;
> +    DualStageIOMMUObject *dsi_obj;
> +    IntelIOMMUState *iommu_state;
> +};
> https://www.spinics.net/lists/kvm/msg205196.html
> 
> I think this would leave space for vendor specific vIOMMU emulators to
> design their own relationship between an IOMMUContext and a
> DualStageIOMMUObject.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC v3 09/25] vfio: check VFIO_TYPE1_NESTING_IOMMU support
  2020-01-29 12:16   ` Liu, Yi L
@ 2020-02-11 19:08     ` Peter Xu
  -1 siblings, 0 replies; 136+ messages in thread
From: Peter Xu @ 2020-02-11 19:08 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: qemu-devel, david, pbonzini, alex.williamson, mst, eric.auger,
	kevin.tian, jun.j.tian, yi.y.sun, kvm, hao.wu, Jacob Pan, Yi Sun

On Wed, Jan 29, 2020 at 04:16:40AM -0800, Liu, Yi L wrote:
> From: Liu Yi L <yi.l.liu@intel.com>
> 
> VFIO needs to check VFIO_TYPE1_NESTING_IOMMU
> support with Kernel before further using it.
> e.g. requires to check IOMMU UAPI version.
> 
> Cc: Kevin Tian <kevin.tian@intel.com>
> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Eric Auger <eric.auger@redhat.com>
> Cc: Yi Sun <yi.y.sun@linux.intel.com>
> Cc: David Gibson <david@gibson.dropbear.id.au>
> Cc: Alex Williamson <alex.williamson@redhat.com>
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> ---
>  hw/vfio/common.c | 14 ++++++++++++--
>  1 file changed, 12 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 0cc7ff5..a5e70b1 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -1157,12 +1157,21 @@ static void vfio_put_address_space(VFIOAddressSpace *space)
>  static int vfio_get_iommu_type(VFIOContainer *container,
>                                 Error **errp)
>  {
> -    int iommu_types[] = { VFIO_TYPE1v2_IOMMU, VFIO_TYPE1_IOMMU,
> +    int iommu_types[] = { VFIO_TYPE1_NESTING_IOMMU,
> +                          VFIO_TYPE1v2_IOMMU, VFIO_TYPE1_IOMMU,
>                            VFIO_SPAPR_TCE_v2_IOMMU, VFIO_SPAPR_TCE_IOMMU };
> -    int i;
> +    int i, version;
>  
>      for (i = 0; i < ARRAY_SIZE(iommu_types); i++) {
>          if (ioctl(container->fd, VFIO_CHECK_EXTENSION, iommu_types[i])) {
> +            if (iommu_types[i] == VFIO_TYPE1_NESTING_IOMMU) {
> +                version = ioctl(container->fd,
> +                                VFIO_NESTING_GET_IOMMU_UAPI_VERSION);
> +                if (version < IOMMU_UAPI_VERSION) {
> +                    printf("IOMMU UAPI incompatible for nesting\n");

There should have better alternatives than printf()... Maybe
warn_report()?

> +                    continue;
> +                }
> +            }
>              return iommu_types[i];
>          }
>      }
> @@ -1278,6 +1287,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>      }
>  
>      switch (container->iommu_type) {
> +    case VFIO_TYPE1_NESTING_IOMMU:
>      case VFIO_TYPE1v2_IOMMU:
>      case VFIO_TYPE1_IOMMU:
>      {
> -- 
> 2.7.4
> 

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC v3 09/25] vfio: check VFIO_TYPE1_NESTING_IOMMU support
@ 2020-02-11 19:08     ` Peter Xu
  0 siblings, 0 replies; 136+ messages in thread
From: Peter Xu @ 2020-02-11 19:08 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: kevin.tian, Jacob Pan, Yi Sun, kvm, mst, jun.j.tian, qemu-devel,
	eric.auger, alex.williamson, pbonzini, hao.wu, yi.y.sun, david

On Wed, Jan 29, 2020 at 04:16:40AM -0800, Liu, Yi L wrote:
> From: Liu Yi L <yi.l.liu@intel.com>
> 
> VFIO needs to check VFIO_TYPE1_NESTING_IOMMU
> support with Kernel before further using it.
> e.g. requires to check IOMMU UAPI version.
> 
> Cc: Kevin Tian <kevin.tian@intel.com>
> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Eric Auger <eric.auger@redhat.com>
> Cc: Yi Sun <yi.y.sun@linux.intel.com>
> Cc: David Gibson <david@gibson.dropbear.id.au>
> Cc: Alex Williamson <alex.williamson@redhat.com>
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> ---
>  hw/vfio/common.c | 14 ++++++++++++--
>  1 file changed, 12 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 0cc7ff5..a5e70b1 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -1157,12 +1157,21 @@ static void vfio_put_address_space(VFIOAddressSpace *space)
>  static int vfio_get_iommu_type(VFIOContainer *container,
>                                 Error **errp)
>  {
> -    int iommu_types[] = { VFIO_TYPE1v2_IOMMU, VFIO_TYPE1_IOMMU,
> +    int iommu_types[] = { VFIO_TYPE1_NESTING_IOMMU,
> +                          VFIO_TYPE1v2_IOMMU, VFIO_TYPE1_IOMMU,
>                            VFIO_SPAPR_TCE_v2_IOMMU, VFIO_SPAPR_TCE_IOMMU };
> -    int i;
> +    int i, version;
>  
>      for (i = 0; i < ARRAY_SIZE(iommu_types); i++) {
>          if (ioctl(container->fd, VFIO_CHECK_EXTENSION, iommu_types[i])) {
> +            if (iommu_types[i] == VFIO_TYPE1_NESTING_IOMMU) {
> +                version = ioctl(container->fd,
> +                                VFIO_NESTING_GET_IOMMU_UAPI_VERSION);
> +                if (version < IOMMU_UAPI_VERSION) {
> +                    printf("IOMMU UAPI incompatible for nesting\n");

There should have better alternatives than printf()... Maybe
warn_report()?

> +                    continue;
> +                }
> +            }
>              return iommu_types[i];
>          }
>      }
> @@ -1278,6 +1287,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>      }
>  
>      switch (container->iommu_type) {
> +    case VFIO_TYPE1_NESTING_IOMMU:
>      case VFIO_TYPE1v2_IOMMU:
>      case VFIO_TYPE1_IOMMU:
>      {
> -- 
> 2.7.4
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC v3 11/25] vfio: get stage-1 pasid formats from Kernel
  2020-01-29 12:16   ` Liu, Yi L
@ 2020-02-11 19:30     ` Peter Xu
  -1 siblings, 0 replies; 136+ messages in thread
From: Peter Xu @ 2020-02-11 19:30 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: qemu-devel, david, pbonzini, alex.williamson, mst, eric.auger,
	kevin.tian, jun.j.tian, yi.y.sun, kvm, hao.wu, Jacob Pan, Yi Sun

On Wed, Jan 29, 2020 at 04:16:42AM -0800, Liu, Yi L wrote:
> From: Liu Yi L <yi.l.liu@intel.com>
> 
> VFIO checks IOMMU UAPI version when it finds Kernel supports
> VFIO_TYPE1_NESTING_IOMMU. It is enough for UAPI compatibility
> check. However, IOMMU UAPI may support multiple stage-1 pasid
> formats in a specific UAPI version, which is highly possible
> since IOMMU UAPI supports stage-1 formats across all IOMMU vendors.
> So VFIO needs to get the supported formats from Kernel and tell
> vIOMMU. Let vIOMMU select proper format when setup dual stage DMA
> translation.
> 
> This patch gets the stage-1 pasid format from kernel by using IOCTL
> VFIO_IOMMU_GET_INFO and pass the supported format to vIOMMU by the
> DualStageIOMMUObject instance which has been registered to vIOMMU.
> 
> This patch referred some code from Shameer Kolothum.
> https://lists.gnu.org/archive/html/qemu-devel/2018-05/msg03759.html
> 
> Cc: Kevin Tian <kevin.tian@intel.com>
> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Eric Auger <eric.auger@redhat.com>
> Cc: Yi Sun <yi.y.sun@linux.intel.com>
> Cc: David Gibson <david@gibson.dropbear.id.au>
> Cc: Alex Williamson <alex.williamson@redhat.com>
> Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> ---
>  hw/iommu/dual_stage_iommu.c         |  5 ++-
>  hw/vfio/common.c                    | 85 ++++++++++++++++++++++++++++++++++++-
>  include/hw/iommu/dual_stage_iommu.h | 10 ++++-
>  3 files changed, 97 insertions(+), 3 deletions(-)
> 
> diff --git a/hw/iommu/dual_stage_iommu.c b/hw/iommu/dual_stage_iommu.c
> index be4179d..d5a7168 100644
> --- a/hw/iommu/dual_stage_iommu.c
> +++ b/hw/iommu/dual_stage_iommu.c
> @@ -48,9 +48,12 @@ int ds_iommu_pasid_free(DualStageIOMMUObject *dsi_obj, uint32_t pasid)
>  }
>  
>  void ds_iommu_object_init(DualStageIOMMUObject *dsi_obj,
> -                          DualStageIOMMUOps *ops)
> +                          DualStageIOMMUOps *ops,
> +                          DualStageIOMMUInfo *uinfo)
>  {
>      dsi_obj->ops = ops;
> +
> +    dsi_obj->uinfo.pasid_format = uinfo->pasid_format;
>  }
>  
>  void ds_iommu_object_destroy(DualStageIOMMUObject *dsi_obj)
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index fc1723d..a07824b 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -1182,10 +1182,84 @@ static int vfio_get_iommu_type(VFIOContainer *container,
>  static struct DualStageIOMMUOps vfio_ds_iommu_ops = {
>  };
>  
> +static int vfio_get_iommu_info(VFIOContainer *container,
> +                         struct vfio_iommu_type1_info **info)

Better comment on the function to remember to free(*info) after use
for the callers.

> +{
> +
> +    size_t argsz = sizeof(struct vfio_iommu_type1_info);
> +

Nit: extra newline.

> +
> +    *info = g_malloc0(argsz);
> +
> +retry:
> +    (*info)->argsz = argsz;
> +
> +    if (ioctl(container->fd, VFIO_IOMMU_GET_INFO, *info)) {
> +        g_free(*info);
> +        *info = NULL;
> +        return -errno;
> +    }
> +
> +    if (((*info)->argsz > argsz)) {
> +        argsz = (*info)->argsz;
> +        *info = g_realloc(*info, argsz);
> +        goto retry;
> +    }
> +
> +    return 0;
> +}
> +
> +static struct vfio_info_cap_header *
> +vfio_get_iommu_info_cap(struct vfio_iommu_type1_info *info, uint16_t id)
> +{
> +    struct vfio_info_cap_header *hdr;
> +    void *ptr = info;
> +
> +    if (!(info->flags & VFIO_IOMMU_INFO_CAPS)) {
> +        return NULL;
> +    }
> +
> +    for (hdr = ptr + info->cap_offset; hdr != ptr; hdr = ptr + hdr->next) {
> +        if (hdr->id == id) {
> +            return hdr;
> +        }
> +    }
> +
> +    return NULL;
> +}
> +
> +static int vfio_get_nesting_iommu_format(VFIOContainer *container,
> +                                         uint32_t *pasid_format)
> +{
> +    struct vfio_iommu_type1_info *info;
> +    struct vfio_info_cap_header *hdr;
> +    struct vfio_iommu_type1_info_cap_nesting *cap;
> +
> +    if (vfio_get_iommu_info(container, &info)) {
> +        return -errno;

Should return the retcode from vfio_get_iommu_info.

> +    }
> +
> +    hdr = vfio_get_iommu_info_cap(info,
> +                        VFIO_IOMMU_TYPE1_INFO_CAP_NESTING);
> +    if (!hdr) {
> +        g_free(info);
> +        return -errno;
> +    }
> +
> +    cap = container_of(hdr,
> +                struct vfio_iommu_type1_info_cap_nesting, header);
> +    *pasid_format = cap->pasid_format;
> +
> +    g_free(info);
> +    return 0;
> +}
> +
>  static int vfio_init_container(VFIOContainer *container, int group_fd,
>                                 Error **errp)
>  {
>      int iommu_type, ret;
> +    uint32_t format;
> +    DualStageIOMMUInfo uinfo;
>  
>      iommu_type = vfio_get_iommu_type(container, errp);
>      if (iommu_type < 0) {
> @@ -1214,7 +1288,16 @@ static int vfio_init_container(VFIOContainer *container, int group_fd,
>      }
>  
>      if (iommu_type == VFIO_TYPE1_NESTING_IOMMU) {
> -        ds_iommu_object_init(&container->dsi_obj, &vfio_ds_iommu_ops);
> +        if (vfio_get_nesting_iommu_format(container, &format)) {
> +            error_setg_errno(errp, errno,
> +                             "Failed to get nesting iommu format");
> +            return -errno;

Same here, you might want to return the retcode from
vfio_get_nesting_iommu_format()?

> +        }
> +
> +        uinfo.pasid_format = format;
> +        ds_iommu_object_init(&container->dsi_obj,
> +                             &vfio_ds_iommu_ops, &uinfo);
> +
>          if (iommu_context_register_ds_iommu(container->iommu_ctx,
>                                              &container->dsi_obj)) {
>              /*
> diff --git a/include/hw/iommu/dual_stage_iommu.h b/include/hw/iommu/dual_stage_iommu.h
> index e9891e3..c6100b4 100644
> --- a/include/hw/iommu/dual_stage_iommu.h
> +++ b/include/hw/iommu/dual_stage_iommu.h
> @@ -23,12 +23,14 @@
>  #define HW_DS_IOMMU_H
>  
>  #include "qemu/queue.h"
> +#include <linux/iommu.h>
>  #ifndef CONFIG_USER_ONLY
>  #include "exec/hwaddr.h"
>  #endif
>  
>  typedef struct DualStageIOMMUObject DualStageIOMMUObject;
>  typedef struct DualStageIOMMUOps DualStageIOMMUOps;
> +typedef struct DualStageIOMMUInfo DualStageIOMMUInfo;
>  
>  struct DualStageIOMMUOps {
>      /* Allocate pasid from DualStageIOMMU (a.k.a. host IOMMU) */
> @@ -41,11 +43,16 @@ struct DualStageIOMMUOps {
>                        uint32_t pasid);
>  };
>  
> +struct DualStageIOMMUInfo {
> +    uint32_t pasid_format;
> +};
> +
>  /*
>   * This is an abstraction of Dual-stage IOMMU.
>   */
>  struct DualStageIOMMUObject {
>      DualStageIOMMUOps *ops;
> +    DualStageIOMMUInfo uinfo;
>  };
>  
>  int ds_iommu_pasid_alloc(DualStageIOMMUObject *dsi_obj, uint32_t min,
> @@ -53,7 +60,8 @@ int ds_iommu_pasid_alloc(DualStageIOMMUObject *dsi_obj, uint32_t min,
>  int ds_iommu_pasid_free(DualStageIOMMUObject *dsi_obj, uint32_t pasid);
>  
>  void ds_iommu_object_init(DualStageIOMMUObject *dsi_obj,
> -                          DualStageIOMMUOps *ops);
> +                          DualStageIOMMUOps *ops,
> +                          DualStageIOMMUInfo *uinfo);
>  void ds_iommu_object_destroy(DualStageIOMMUObject *dsi_obj);
>  
>  #endif
> -- 
> 2.7.4
> 

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC v3 11/25] vfio: get stage-1 pasid formats from Kernel
@ 2020-02-11 19:30     ` Peter Xu
  0 siblings, 0 replies; 136+ messages in thread
From: Peter Xu @ 2020-02-11 19:30 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: kevin.tian, Jacob Pan, Yi Sun, kvm, mst, jun.j.tian, qemu-devel,
	eric.auger, alex.williamson, pbonzini, hao.wu, yi.y.sun, david

On Wed, Jan 29, 2020 at 04:16:42AM -0800, Liu, Yi L wrote:
> From: Liu Yi L <yi.l.liu@intel.com>
> 
> VFIO checks IOMMU UAPI version when it finds Kernel supports
> VFIO_TYPE1_NESTING_IOMMU. It is enough for UAPI compatibility
> check. However, IOMMU UAPI may support multiple stage-1 pasid
> formats in a specific UAPI version, which is highly possible
> since IOMMU UAPI supports stage-1 formats across all IOMMU vendors.
> So VFIO needs to get the supported formats from Kernel and tell
> vIOMMU. Let vIOMMU select proper format when setup dual stage DMA
> translation.
> 
> This patch gets the stage-1 pasid format from kernel by using IOCTL
> VFIO_IOMMU_GET_INFO and pass the supported format to vIOMMU by the
> DualStageIOMMUObject instance which has been registered to vIOMMU.
> 
> This patch referred some code from Shameer Kolothum.
> https://lists.gnu.org/archive/html/qemu-devel/2018-05/msg03759.html
> 
> Cc: Kevin Tian <kevin.tian@intel.com>
> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Eric Auger <eric.auger@redhat.com>
> Cc: Yi Sun <yi.y.sun@linux.intel.com>
> Cc: David Gibson <david@gibson.dropbear.id.au>
> Cc: Alex Williamson <alex.williamson@redhat.com>
> Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> ---
>  hw/iommu/dual_stage_iommu.c         |  5 ++-
>  hw/vfio/common.c                    | 85 ++++++++++++++++++++++++++++++++++++-
>  include/hw/iommu/dual_stage_iommu.h | 10 ++++-
>  3 files changed, 97 insertions(+), 3 deletions(-)
> 
> diff --git a/hw/iommu/dual_stage_iommu.c b/hw/iommu/dual_stage_iommu.c
> index be4179d..d5a7168 100644
> --- a/hw/iommu/dual_stage_iommu.c
> +++ b/hw/iommu/dual_stage_iommu.c
> @@ -48,9 +48,12 @@ int ds_iommu_pasid_free(DualStageIOMMUObject *dsi_obj, uint32_t pasid)
>  }
>  
>  void ds_iommu_object_init(DualStageIOMMUObject *dsi_obj,
> -                          DualStageIOMMUOps *ops)
> +                          DualStageIOMMUOps *ops,
> +                          DualStageIOMMUInfo *uinfo)
>  {
>      dsi_obj->ops = ops;
> +
> +    dsi_obj->uinfo.pasid_format = uinfo->pasid_format;
>  }
>  
>  void ds_iommu_object_destroy(DualStageIOMMUObject *dsi_obj)
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index fc1723d..a07824b 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -1182,10 +1182,84 @@ static int vfio_get_iommu_type(VFIOContainer *container,
>  static struct DualStageIOMMUOps vfio_ds_iommu_ops = {
>  };
>  
> +static int vfio_get_iommu_info(VFIOContainer *container,
> +                         struct vfio_iommu_type1_info **info)

Better comment on the function to remember to free(*info) after use
for the callers.

> +{
> +
> +    size_t argsz = sizeof(struct vfio_iommu_type1_info);
> +

Nit: extra newline.

> +
> +    *info = g_malloc0(argsz);
> +
> +retry:
> +    (*info)->argsz = argsz;
> +
> +    if (ioctl(container->fd, VFIO_IOMMU_GET_INFO, *info)) {
> +        g_free(*info);
> +        *info = NULL;
> +        return -errno;
> +    }
> +
> +    if (((*info)->argsz > argsz)) {
> +        argsz = (*info)->argsz;
> +        *info = g_realloc(*info, argsz);
> +        goto retry;
> +    }
> +
> +    return 0;
> +}
> +
> +static struct vfio_info_cap_header *
> +vfio_get_iommu_info_cap(struct vfio_iommu_type1_info *info, uint16_t id)
> +{
> +    struct vfio_info_cap_header *hdr;
> +    void *ptr = info;
> +
> +    if (!(info->flags & VFIO_IOMMU_INFO_CAPS)) {
> +        return NULL;
> +    }
> +
> +    for (hdr = ptr + info->cap_offset; hdr != ptr; hdr = ptr + hdr->next) {
> +        if (hdr->id == id) {
> +            return hdr;
> +        }
> +    }
> +
> +    return NULL;
> +}
> +
> +static int vfio_get_nesting_iommu_format(VFIOContainer *container,
> +                                         uint32_t *pasid_format)
> +{
> +    struct vfio_iommu_type1_info *info;
> +    struct vfio_info_cap_header *hdr;
> +    struct vfio_iommu_type1_info_cap_nesting *cap;
> +
> +    if (vfio_get_iommu_info(container, &info)) {
> +        return -errno;

Should return the retcode from vfio_get_iommu_info.

> +    }
> +
> +    hdr = vfio_get_iommu_info_cap(info,
> +                        VFIO_IOMMU_TYPE1_INFO_CAP_NESTING);
> +    if (!hdr) {
> +        g_free(info);
> +        return -errno;
> +    }
> +
> +    cap = container_of(hdr,
> +                struct vfio_iommu_type1_info_cap_nesting, header);
> +    *pasid_format = cap->pasid_format;
> +
> +    g_free(info);
> +    return 0;
> +}
> +
>  static int vfio_init_container(VFIOContainer *container, int group_fd,
>                                 Error **errp)
>  {
>      int iommu_type, ret;
> +    uint32_t format;
> +    DualStageIOMMUInfo uinfo;
>  
>      iommu_type = vfio_get_iommu_type(container, errp);
>      if (iommu_type < 0) {
> @@ -1214,7 +1288,16 @@ static int vfio_init_container(VFIOContainer *container, int group_fd,
>      }
>  
>      if (iommu_type == VFIO_TYPE1_NESTING_IOMMU) {
> -        ds_iommu_object_init(&container->dsi_obj, &vfio_ds_iommu_ops);
> +        if (vfio_get_nesting_iommu_format(container, &format)) {
> +            error_setg_errno(errp, errno,
> +                             "Failed to get nesting iommu format");
> +            return -errno;

Same here, you might want to return the retcode from
vfio_get_nesting_iommu_format()?

> +        }
> +
> +        uinfo.pasid_format = format;
> +        ds_iommu_object_init(&container->dsi_obj,
> +                             &vfio_ds_iommu_ops, &uinfo);
> +
>          if (iommu_context_register_ds_iommu(container->iommu_ctx,
>                                              &container->dsi_obj)) {
>              /*
> diff --git a/include/hw/iommu/dual_stage_iommu.h b/include/hw/iommu/dual_stage_iommu.h
> index e9891e3..c6100b4 100644
> --- a/include/hw/iommu/dual_stage_iommu.h
> +++ b/include/hw/iommu/dual_stage_iommu.h
> @@ -23,12 +23,14 @@
>  #define HW_DS_IOMMU_H
>  
>  #include "qemu/queue.h"
> +#include <linux/iommu.h>
>  #ifndef CONFIG_USER_ONLY
>  #include "exec/hwaddr.h"
>  #endif
>  
>  typedef struct DualStageIOMMUObject DualStageIOMMUObject;
>  typedef struct DualStageIOMMUOps DualStageIOMMUOps;
> +typedef struct DualStageIOMMUInfo DualStageIOMMUInfo;
>  
>  struct DualStageIOMMUOps {
>      /* Allocate pasid from DualStageIOMMU (a.k.a. host IOMMU) */
> @@ -41,11 +43,16 @@ struct DualStageIOMMUOps {
>                        uint32_t pasid);
>  };
>  
> +struct DualStageIOMMUInfo {
> +    uint32_t pasid_format;
> +};
> +
>  /*
>   * This is an abstraction of Dual-stage IOMMU.
>   */
>  struct DualStageIOMMUObject {
>      DualStageIOMMUOps *ops;
> +    DualStageIOMMUInfo uinfo;
>  };
>  
>  int ds_iommu_pasid_alloc(DualStageIOMMUObject *dsi_obj, uint32_t min,
> @@ -53,7 +60,8 @@ int ds_iommu_pasid_alloc(DualStageIOMMUObject *dsi_obj, uint32_t min,
>  int ds_iommu_pasid_free(DualStageIOMMUObject *dsi_obj, uint32_t pasid);
>  
>  void ds_iommu_object_init(DualStageIOMMUObject *dsi_obj,
> -                          DualStageIOMMUOps *ops);
> +                          DualStageIOMMUOps *ops,
> +                          DualStageIOMMUInfo *uinfo);
>  void ds_iommu_object_destroy(DualStageIOMMUObject *dsi_obj);
>  
>  #endif
> -- 
> 2.7.4
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC v3 12/25] vfio/common: add pasid_alloc/free support
  2020-01-29 12:16   ` Liu, Yi L
@ 2020-02-11 19:31     ` Peter Xu
  -1 siblings, 0 replies; 136+ messages in thread
From: Peter Xu @ 2020-02-11 19:31 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: qemu-devel, david, pbonzini, alex.williamson, mst, eric.auger,
	kevin.tian, jun.j.tian, yi.y.sun, kvm, hao.wu, Jacob Pan, Yi Sun

On Wed, Jan 29, 2020 at 04:16:43AM -0800, Liu, Yi L wrote:
> From: Liu Yi L <yi.l.liu@intel.com>
> 
> This patch adds VFIO pasid alloc/free support to allow host intercept
> in PASID allocation for VM by adding VFIO implementation of
> DualStageIOMMUOps.pasid_alloc/free callbacks.
> 
> Cc: Kevin Tian <kevin.tian@intel.com>
> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Eric Auger <eric.auger@redhat.com>
> Cc: Yi Sun <yi.y.sun@linux.intel.com>
> Cc: David Gibson <david@gibson.dropbear.id.au>
> Cc: Alex Williamson <alex.williamson@redhat.com>
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> ---
>  hw/vfio/common.c | 42 ++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 42 insertions(+)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index a07824b..014f4e7 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -1179,7 +1179,49 @@ static int vfio_get_iommu_type(VFIOContainer *container,
>      return -EINVAL;
>  }
>  
> +static int vfio_ds_iommu_pasid_alloc(DualStageIOMMUObject *dsi_obj,
> +                         uint32_t min, uint32_t max, uint32_t *pasid)
> +{
> +    VFIOContainer *container = container_of(dsi_obj, VFIOContainer, dsi_obj);
> +    struct vfio_iommu_type1_pasid_request req;
> +    unsigned long argsz;
> +
> +    argsz = sizeof(req);
> +    req.argsz = argsz;
> +    req.flags = VFIO_IOMMU_PASID_ALLOC;
> +    req.alloc_pasid.min = min;
> +    req.alloc_pasid.max = max;
> +
> +    if (ioctl(container->fd, VFIO_IOMMU_PASID_REQUEST, &req)) {
> +        error_report("%s: %d, alloc failed", __func__, -errno);
> +        return -errno;

Note that errno is prone to change by other syscalls.  Better cache it
right after the ioctl.

> +    }
> +    *pasid = req.alloc_pasid.result;
> +    return 0;
> +}
> +
> +static int vfio_ds_iommu_pasid_free(DualStageIOMMUObject *dsi_obj,
> +                                     uint32_t pasid)
> +{
> +    VFIOContainer *container = container_of(dsi_obj, VFIOContainer, dsi_obj);
> +    struct vfio_iommu_type1_pasid_request req;
> +    unsigned long argsz;
> +
> +    argsz = sizeof(req);
> +    req.argsz = argsz;
> +    req.flags = VFIO_IOMMU_PASID_FREE;
> +    req.free_pasid = pasid;
> +
> +    if (ioctl(container->fd, VFIO_IOMMU_PASID_REQUEST, &req)) {
> +        error_report("%s: %d, free failed", __func__, -errno);
> +        return -errno;

Same here.

> +    }
> +    return 0;
> +}
> +
>  static struct DualStageIOMMUOps vfio_ds_iommu_ops = {
> +    .pasid_alloc = vfio_ds_iommu_pasid_alloc,
> +    .pasid_free = vfio_ds_iommu_pasid_free,
>  };
>  
>  static int vfio_get_iommu_info(VFIOContainer *container,
> -- 
> 2.7.4
> 

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC v3 12/25] vfio/common: add pasid_alloc/free support
@ 2020-02-11 19:31     ` Peter Xu
  0 siblings, 0 replies; 136+ messages in thread
From: Peter Xu @ 2020-02-11 19:31 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: kevin.tian, Jacob Pan, Yi Sun, kvm, mst, jun.j.tian, qemu-devel,
	eric.auger, alex.williamson, pbonzini, hao.wu, yi.y.sun, david

On Wed, Jan 29, 2020 at 04:16:43AM -0800, Liu, Yi L wrote:
> From: Liu Yi L <yi.l.liu@intel.com>
> 
> This patch adds VFIO pasid alloc/free support to allow host intercept
> in PASID allocation for VM by adding VFIO implementation of
> DualStageIOMMUOps.pasid_alloc/free callbacks.
> 
> Cc: Kevin Tian <kevin.tian@intel.com>
> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Eric Auger <eric.auger@redhat.com>
> Cc: Yi Sun <yi.y.sun@linux.intel.com>
> Cc: David Gibson <david@gibson.dropbear.id.au>
> Cc: Alex Williamson <alex.williamson@redhat.com>
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> ---
>  hw/vfio/common.c | 42 ++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 42 insertions(+)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index a07824b..014f4e7 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -1179,7 +1179,49 @@ static int vfio_get_iommu_type(VFIOContainer *container,
>      return -EINVAL;
>  }
>  
> +static int vfio_ds_iommu_pasid_alloc(DualStageIOMMUObject *dsi_obj,
> +                         uint32_t min, uint32_t max, uint32_t *pasid)
> +{
> +    VFIOContainer *container = container_of(dsi_obj, VFIOContainer, dsi_obj);
> +    struct vfio_iommu_type1_pasid_request req;
> +    unsigned long argsz;
> +
> +    argsz = sizeof(req);
> +    req.argsz = argsz;
> +    req.flags = VFIO_IOMMU_PASID_ALLOC;
> +    req.alloc_pasid.min = min;
> +    req.alloc_pasid.max = max;
> +
> +    if (ioctl(container->fd, VFIO_IOMMU_PASID_REQUEST, &req)) {
> +        error_report("%s: %d, alloc failed", __func__, -errno);
> +        return -errno;

Note that errno is prone to change by other syscalls.  Better cache it
right after the ioctl.

> +    }
> +    *pasid = req.alloc_pasid.result;
> +    return 0;
> +}
> +
> +static int vfio_ds_iommu_pasid_free(DualStageIOMMUObject *dsi_obj,
> +                                     uint32_t pasid)
> +{
> +    VFIOContainer *container = container_of(dsi_obj, VFIOContainer, dsi_obj);
> +    struct vfio_iommu_type1_pasid_request req;
> +    unsigned long argsz;
> +
> +    argsz = sizeof(req);
> +    req.argsz = argsz;
> +    req.flags = VFIO_IOMMU_PASID_FREE;
> +    req.free_pasid = pasid;
> +
> +    if (ioctl(container->fd, VFIO_IOMMU_PASID_REQUEST, &req)) {
> +        error_report("%s: %d, free failed", __func__, -errno);
> +        return -errno;

Same here.

> +    }
> +    return 0;
> +}
> +
>  static struct DualStageIOMMUOps vfio_ds_iommu_ops = {
> +    .pasid_alloc = vfio_ds_iommu_pasid_alloc,
> +    .pasid_free = vfio_ds_iommu_pasid_free,
>  };
>  
>  static int vfio_get_iommu_info(VFIOContainer *container,
> -- 
> 2.7.4
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC v3 13/25] intel_iommu: modify x-scalable-mode to be string option
  2020-01-29 12:16   ` Liu, Yi L
@ 2020-02-11 19:43     ` Peter Xu
  -1 siblings, 0 replies; 136+ messages in thread
From: Peter Xu @ 2020-02-11 19:43 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: qemu-devel, david, pbonzini, alex.williamson, mst, eric.auger,
	kevin.tian, jun.j.tian, yi.y.sun, kvm, hao.wu, Jacob Pan, Yi Sun,
	Richard Henderson, Eduardo Habkost

On Wed, Jan 29, 2020 at 04:16:44AM -0800, Liu, Yi L wrote:
> From: Liu Yi L <yi.l.liu@intel.com>
> 
> Intel VT-d 3.0 introduces scalable mode, and it has a bunch of capabilities
> related to scalable mode translation, thus there are multiple combinations.
> While this vIOMMU implementation wants simplify it for user by providing
> typical combinations. User could config it by "x-scalable-mode" option. The
> usage is as below:
> 
> "-device intel-iommu,x-scalable-mode=["legacy"|"modern"]"

Maybe also "off" when someone wants to explicitly disable it?

> 
>  - "legacy": gives support for SL page table
>  - "modern": gives support for FL page table, pasid, virtual command
>  -  if not configured, means no scalable mode support, if not proper
>     configured, will throw error
> 
> Cc: Kevin Tian <kevin.tian@intel.com>
> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Yi Sun <yi.y.sun@linux.intel.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Richard Henderson <rth@twiddle.net>
> Cc: Eduardo Habkost <ehabkost@redhat.com>
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> ---
>  hw/i386/intel_iommu.c          | 27 +++++++++++++++++++++++++--
>  hw/i386/intel_iommu_internal.h |  3 +++
>  include/hw/i386/intel_iommu.h  |  2 ++
>  3 files changed, 30 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 1c1eb7f..33be40c 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -3078,7 +3078,7 @@ static Property vtd_properties[] = {
>      DEFINE_PROP_UINT8("aw-bits", IntelIOMMUState, aw_bits,
>                        VTD_HOST_ADDRESS_WIDTH),
>      DEFINE_PROP_BOOL("caching-mode", IntelIOMMUState, caching_mode, FALSE),
> -    DEFINE_PROP_BOOL("x-scalable-mode", IntelIOMMUState, scalable_mode, FALSE),
> +    DEFINE_PROP_STRING("x-scalable-mode", IntelIOMMUState, scalable_mode_str),
>      DEFINE_PROP_BOOL("dma-drain", IntelIOMMUState, dma_drain, true),
>      DEFINE_PROP_END_OF_LIST(),
>  };
> @@ -3708,8 +3708,11 @@ static void vtd_init(IntelIOMMUState *s)
>      }
>  
>      /* TODO: read cap/ecap from host to decide which cap to be exposed. */
> -    if (s->scalable_mode) {
> +    if (s->scalable_mode && !s->scalable_modern) {
>          s->ecap |= VTD_ECAP_SMTS | VTD_ECAP_SRS | VTD_ECAP_SLTS;
> +    } else if (s->scalable_mode && s->scalable_modern) {
> +        s->ecap |= VTD_ECAP_SMTS | VTD_ECAP_SRS | VTD_ECAP_PASID
> +                   | VTD_ECAP_FLTS | VTD_ECAP_PSS;

This patch might be good to be the last one after all the impls are
ready.

>      }
>  
>      vtd_reset_caches(s);
> @@ -3845,6 +3848,26 @@ static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
>          return false;
>      }
>  
> +    if (s->scalable_mode_str &&
> +        (strcmp(s->scalable_mode_str, "modern") &&
> +         strcmp(s->scalable_mode_str, "legacy"))) {
> +        error_setg(errp, "Invalid x-scalable-mode config");

Maybe "..., Please use 'modern', 'legacy', or 'off'." to show options.

> +        return false;
> +    }
> +
> +    if (s->scalable_mode_str &&
> +        !strcmp(s->scalable_mode_str, "legacy")) {
> +        s->scalable_mode = true;
> +        s->scalable_modern = false;
> +    } else if (s->scalable_mode_str &&
> +        !strcmp(s->scalable_mode_str, "modern")) {
> +        s->scalable_mode = true;
> +        s->scalable_modern = true;
> +    } else {
> +        s->scalable_mode = false;
> +        s->scalable_modern = false;
> +    }
> +
>      return true;
>  }
>  
> diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> index 862033e..c4dbb2c 100644
> --- a/hw/i386/intel_iommu_internal.h
> +++ b/hw/i386/intel_iommu_internal.h
> @@ -190,8 +190,11 @@
>  #define VTD_ECAP_PT                 (1ULL << 6)
>  #define VTD_ECAP_MHMV               (15ULL << 20)
>  #define VTD_ECAP_SRS                (1ULL << 31)
> +#define VTD_ECAP_PSS                (19ULL << 35)
> +#define VTD_ECAP_PASID              (1ULL << 40)
>  #define VTD_ECAP_SMTS               (1ULL << 43)
>  #define VTD_ECAP_SLTS               (1ULL << 46)
> +#define VTD_ECAP_FLTS               (1ULL << 47)
>  
>  /* CAP_REG */
>  /* (offset >> 4) << 24 */
> diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> index 8571a85..1ef2917 100644
> --- a/include/hw/i386/intel_iommu.h
> +++ b/include/hw/i386/intel_iommu.h
> @@ -244,6 +244,8 @@ struct IntelIOMMUState {
>  
>      bool caching_mode;              /* RO - is cap CM enabled? */
>      bool scalable_mode;             /* RO - is Scalable Mode supported? */
> +    char *scalable_mode_str;        /* RO - admin's Scalable Mode config */
> +    bool scalable_modern;           /* RO - is modern SM supported? */
>  
>      dma_addr_t root;                /* Current root table pointer */
>      bool root_scalable;             /* Type of root table (scalable or not) */
> -- 
> 2.7.4
> 

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC v3 13/25] intel_iommu: modify x-scalable-mode to be string option
@ 2020-02-11 19:43     ` Peter Xu
  0 siblings, 0 replies; 136+ messages in thread
From: Peter Xu @ 2020-02-11 19:43 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: kevin.tian, Jacob Pan, Yi Sun, Eduardo Habkost, kvm, mst,
	jun.j.tian, qemu-devel, eric.auger, alex.williamson, pbonzini,
	hao.wu, yi.y.sun, Richard Henderson, david

On Wed, Jan 29, 2020 at 04:16:44AM -0800, Liu, Yi L wrote:
> From: Liu Yi L <yi.l.liu@intel.com>
> 
> Intel VT-d 3.0 introduces scalable mode, and it has a bunch of capabilities
> related to scalable mode translation, thus there are multiple combinations.
> While this vIOMMU implementation wants simplify it for user by providing
> typical combinations. User could config it by "x-scalable-mode" option. The
> usage is as below:
> 
> "-device intel-iommu,x-scalable-mode=["legacy"|"modern"]"

Maybe also "off" when someone wants to explicitly disable it?

> 
>  - "legacy": gives support for SL page table
>  - "modern": gives support for FL page table, pasid, virtual command
>  -  if not configured, means no scalable mode support, if not proper
>     configured, will throw error
> 
> Cc: Kevin Tian <kevin.tian@intel.com>
> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Yi Sun <yi.y.sun@linux.intel.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Richard Henderson <rth@twiddle.net>
> Cc: Eduardo Habkost <ehabkost@redhat.com>
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> ---
>  hw/i386/intel_iommu.c          | 27 +++++++++++++++++++++++++--
>  hw/i386/intel_iommu_internal.h |  3 +++
>  include/hw/i386/intel_iommu.h  |  2 ++
>  3 files changed, 30 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 1c1eb7f..33be40c 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -3078,7 +3078,7 @@ static Property vtd_properties[] = {
>      DEFINE_PROP_UINT8("aw-bits", IntelIOMMUState, aw_bits,
>                        VTD_HOST_ADDRESS_WIDTH),
>      DEFINE_PROP_BOOL("caching-mode", IntelIOMMUState, caching_mode, FALSE),
> -    DEFINE_PROP_BOOL("x-scalable-mode", IntelIOMMUState, scalable_mode, FALSE),
> +    DEFINE_PROP_STRING("x-scalable-mode", IntelIOMMUState, scalable_mode_str),
>      DEFINE_PROP_BOOL("dma-drain", IntelIOMMUState, dma_drain, true),
>      DEFINE_PROP_END_OF_LIST(),
>  };
> @@ -3708,8 +3708,11 @@ static void vtd_init(IntelIOMMUState *s)
>      }
>  
>      /* TODO: read cap/ecap from host to decide which cap to be exposed. */
> -    if (s->scalable_mode) {
> +    if (s->scalable_mode && !s->scalable_modern) {
>          s->ecap |= VTD_ECAP_SMTS | VTD_ECAP_SRS | VTD_ECAP_SLTS;
> +    } else if (s->scalable_mode && s->scalable_modern) {
> +        s->ecap |= VTD_ECAP_SMTS | VTD_ECAP_SRS | VTD_ECAP_PASID
> +                   | VTD_ECAP_FLTS | VTD_ECAP_PSS;

This patch might be good to be the last one after all the impls are
ready.

>      }
>  
>      vtd_reset_caches(s);
> @@ -3845,6 +3848,26 @@ static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
>          return false;
>      }
>  
> +    if (s->scalable_mode_str &&
> +        (strcmp(s->scalable_mode_str, "modern") &&
> +         strcmp(s->scalable_mode_str, "legacy"))) {
> +        error_setg(errp, "Invalid x-scalable-mode config");

Maybe "..., Please use 'modern', 'legacy', or 'off'." to show options.

> +        return false;
> +    }
> +
> +    if (s->scalable_mode_str &&
> +        !strcmp(s->scalable_mode_str, "legacy")) {
> +        s->scalable_mode = true;
> +        s->scalable_modern = false;
> +    } else if (s->scalable_mode_str &&
> +        !strcmp(s->scalable_mode_str, "modern")) {
> +        s->scalable_mode = true;
> +        s->scalable_modern = true;
> +    } else {
> +        s->scalable_mode = false;
> +        s->scalable_modern = false;
> +    }
> +
>      return true;
>  }
>  
> diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> index 862033e..c4dbb2c 100644
> --- a/hw/i386/intel_iommu_internal.h
> +++ b/hw/i386/intel_iommu_internal.h
> @@ -190,8 +190,11 @@
>  #define VTD_ECAP_PT                 (1ULL << 6)
>  #define VTD_ECAP_MHMV               (15ULL << 20)
>  #define VTD_ECAP_SRS                (1ULL << 31)
> +#define VTD_ECAP_PSS                (19ULL << 35)
> +#define VTD_ECAP_PASID              (1ULL << 40)
>  #define VTD_ECAP_SMTS               (1ULL << 43)
>  #define VTD_ECAP_SLTS               (1ULL << 46)
> +#define VTD_ECAP_FLTS               (1ULL << 47)
>  
>  /* CAP_REG */
>  /* (offset >> 4) << 24 */
> diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> index 8571a85..1ef2917 100644
> --- a/include/hw/i386/intel_iommu.h
> +++ b/include/hw/i386/intel_iommu.h
> @@ -244,6 +244,8 @@ struct IntelIOMMUState {
>  
>      bool caching_mode;              /* RO - is cap CM enabled? */
>      bool scalable_mode;             /* RO - is Scalable Mode supported? */
> +    char *scalable_mode_str;        /* RO - admin's Scalable Mode config */
> +    bool scalable_modern;           /* RO - is modern SM supported? */
>  
>      dma_addr_t root;                /* Current root table pointer */
>      bool root_scalable;             /* Type of root table (scalable or not) */
> -- 
> 2.7.4
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC v3 14/25] intel_iommu: add virtual command capability support
  2020-01-29 12:16   ` Liu, Yi L
@ 2020-02-11 20:16     ` Peter Xu
  -1 siblings, 0 replies; 136+ messages in thread
From: Peter Xu @ 2020-02-11 20:16 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: qemu-devel, david, pbonzini, alex.williamson, mst, eric.auger,
	kevin.tian, jun.j.tian, yi.y.sun, kvm, hao.wu, Jacob Pan, Yi Sun,
	Richard Henderson, Eduardo Habkost

On Wed, Jan 29, 2020 at 04:16:45AM -0800, Liu, Yi L wrote:
> From: Liu Yi L <yi.l.liu@intel.com>
> 
> This patch adds virtual command support to Intel vIOMMU per
> Intel VT-d 3.1 spec. And adds two virtual commands: allocate
> pasid and free pasid.
> 
> Cc: Kevin Tian <kevin.tian@intel.com>
> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Yi Sun <yi.y.sun@linux.intel.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Richard Henderson <rth@twiddle.net>
> Cc: Eduardo Habkost <ehabkost@redhat.com>
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> ---
>  hw/i386/intel_iommu.c          | 163 ++++++++++++++++++++++++++++++++++++++++-
>  hw/i386/intel_iommu_internal.h |  38 ++++++++++
>  hw/i386/trace-events           |   1 +
>  include/hw/i386/intel_iommu.h  |   6 +-
>  4 files changed, 206 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 33be40c..43a728f 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -2649,6 +2649,142 @@ static void vtd_handle_iectl_write(IntelIOMMUState *s)
>      }
>  }
>  
> +static int vtd_request_pasid_alloc(IntelIOMMUState *s, uint32_t *pasid)
> +{
> +    VTDBus *vtd_bus;
> +    int bus_n, devfn, ret = -errno;
> +    VTDIOMMUContext *vtd_icx;
> +
> +    for (bus_n = 0; bus_n < PCI_BUS_MAX; bus_n++) {
> +        vtd_bus = vtd_find_as_from_bus_num(s, bus_n);
> +        if (!vtd_bus) {
> +            continue;
> +        }
> +        for (devfn = 0; devfn < PCI_DEVFN_MAX; devfn++) {
> +            vtd_icx = vtd_bus->dev_icx[devfn];
> +            if (!vtd_icx) {
> +                continue;
> +            }
> +
> +            /*
> +             * We'll return the first valid result we got. It's
> +             * a bit hackish in that we don't have a good global
> +             * interface yet to talk to modules like vfio to deliver
> +             * this allocation request, so we're leveraging this
> +             * per-device iommu object to do the same thing just
> +             * to make sure the allocation happens only once.
> +             */
> +            ret = ds_iommu_pasid_alloc(vtd_icx->dsi_obj,
> +                         VTD_MIN_HPASID, VTD_MAX_HPASID, pasid);

Your indents are always strange to me for long funcalls...  Not a
complaint though, as long as no one else complains. :)

> +            if (!ret) {
> +                break;
> +            }
> +        }
> +    }
> +    return ret;
> +}
> +
> +static int vtd_request_pasid_free(IntelIOMMUState *s, uint32_t pasid)
> +{
> +    VTDBus *vtd_bus;
> +    int bus_n, devfn, ret = -errno;
> +    VTDIOMMUContext *vtd_icx;
> +
> +    for (bus_n = 0; bus_n < PCI_BUS_MAX; bus_n++) {
> +        vtd_bus = vtd_find_as_from_bus_num(s, bus_n);
> +        if (!vtd_bus) {
> +            continue;
> +        }
> +        for (devfn = 0; devfn < PCI_DEVFN_MAX; devfn++) {
> +            vtd_icx = vtd_bus->dev_icx[devfn];
> +            if (!vtd_icx) {
> +                continue;
> +            }
> +            /*
> +             * Similar with pasid allocation. We'll free the pasid
> +             * on the first successful free operation. It's a bit
> +             * hackish in that we don't have a good global interface
> +             * yet to talk to modules like vfio to deliver this pasid
> +             * free request, so we're leveraging this per-device iommu
> +             * object to do the same thing just to make sure the
> +             * free happens only once.
> +             */
> +            ret = ds_iommu_pasid_free(vtd_icx->dsi_obj, pasid);
> +            if (!ret) {
> +                break;
> +            }
> +        }
> +    }
> +    return ret;
> +}
> +
> +/*
> + * If IP is not set, set it and return 0
> + * If IP is already set, return -1

Out of date?  Instead can mention that this also resets the reply
status code to zero implicitly so by default it will return a success.

Other than that:

Reviewed-by: Peter Xu <peterx@redhat.com>

> + */
> +static void vtd_vcmd_set_ip(IntelIOMMUState *s)
> +{
> +    s->vcrsp = 1;
> +    vtd_set_quad_raw(s, DMAR_VCRSP_REG,
> +                     ((uint64_t) s->vcrsp));
> +}
> +
> +static void vtd_vcmd_clear_ip(IntelIOMMUState *s)
> +{
> +    s->vcrsp &= (~((uint64_t)(0x1)));
> +    vtd_set_quad_raw(s, DMAR_VCRSP_REG,
> +                     ((uint64_t) s->vcrsp));
> +}
> +
> +/* Handle write to Virtual Command Register */
> +static int vtd_handle_vcmd_write(IntelIOMMUState *s, uint64_t val)
> +{
> +    uint32_t pasid;
> +    int ret = -1;
> +
> +    trace_vtd_reg_write_vcmd(s->vcrsp, val);
> +
> +    if (!(s->vccap & VTD_VCCAP_PAS) ||
> +         (s->vcrsp & 1)) {
> +        return -1;
> +    }
> +
> +    /*
> +     * Since vCPU should be blocked when the guest VMCD
> +     * write was trapped to here. Should be no other vCPUs
> +     * try to access VCMD if guest software is well written.
> +     * However, we still emulate the IP bit here in case of
> +     * bad guest software. Also align with the spec.
> +     */
> +    vtd_vcmd_set_ip(s);
> +
> +    switch (val & VTD_VCMD_CMD_MASK) {
> +    case VTD_VCMD_ALLOC_PASID:
> +        ret = vtd_request_pasid_alloc(s, &pasid);
> +        if (ret) {
> +            s->vcrsp |= VTD_VCRSP_SC(VTD_VCMD_NO_AVAILABLE_PASID);
> +        } else {
> +            s->vcrsp |= VTD_VCRSP_RSLT(pasid);
> +        }
> +        break;
> +
> +    case VTD_VCMD_FREE_PASID:
> +        pasid = VTD_VCMD_PASID_VALUE(val);
> +        ret = vtd_request_pasid_free(s, pasid);
> +        if (ret < 0) {
> +            s->vcrsp |= VTD_VCRSP_SC(VTD_VCMD_FREE_INVALID_PASID);
> +        }
> +        break;
> +
> +    default:
> +        s->vcrsp |= VTD_VCRSP_SC(VTD_VCMD_UNDEFINED_CMD);
> +        error_report_once("Virtual Command: unsupported command!!!");
> +        break;
> +    }
> +    vtd_vcmd_clear_ip(s);
> +    return 0;
> +}
> +
>  static uint64_t vtd_mem_read(void *opaque, hwaddr addr, unsigned size)
>  {
>      IntelIOMMUState *s = opaque;
> @@ -2938,6 +3074,23 @@ static void vtd_mem_write(void *opaque, hwaddr addr,
>          vtd_set_long(s, addr, val);
>          break;
>  
> +    case DMAR_VCMD_REG:
> +        if (!vtd_handle_vcmd_write(s, val)) {
> +            if (size == 4) {
> +                vtd_set_long(s, addr, val);
> +            } else {
> +                vtd_set_quad(s, addr, val);
> +            }
> +        }
> +        break;
> +
> +    case DMAR_VCMD_REG_HI:
> +        assert(size == 4);
> +        if (!vtd_handle_vcmd_write(s, val)) {
> +            vtd_set_long(s, addr, val);
> +        }
> +        break;
> +
>      default:
>          if (size == 4) {
>              vtd_set_long(s, addr, val);
> @@ -3712,7 +3865,8 @@ static void vtd_init(IntelIOMMUState *s)
>          s->ecap |= VTD_ECAP_SMTS | VTD_ECAP_SRS | VTD_ECAP_SLTS;
>      } else if (s->scalable_mode && s->scalable_modern) {
>          s->ecap |= VTD_ECAP_SMTS | VTD_ECAP_SRS | VTD_ECAP_PASID
> -                   | VTD_ECAP_FLTS | VTD_ECAP_PSS;
> +                   | VTD_ECAP_FLTS | VTD_ECAP_PSS | VTD_ECAP_VCS;
> +        s->vccap |= VTD_VCCAP_PAS;
>      }
>  
>      vtd_reset_caches(s);
> @@ -3768,6 +3922,13 @@ static void vtd_init(IntelIOMMUState *s)
>       * Interrupt remapping registers.
>       */
>      vtd_define_quad(s, DMAR_IRTA_REG, 0, 0xfffffffffffff80fULL, 0);
> +
> +    /*
> +     * Virtual Command Definitions
> +     */
> +    vtd_define_quad(s, DMAR_VCCAP_REG, s->vccap, 0, 0);
> +    vtd_define_quad(s, DMAR_VCMD_REG, 0, 0xffffffffffffffffULL, 0);
> +    vtd_define_quad(s, DMAR_VCRSP_REG, 0, 0, 0);
>  }
>  
>  /* Should not reset address_spaces when reset because devices will still use
> diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> index c4dbb2c..fb5fdc2 100644
> --- a/hw/i386/intel_iommu_internal.h
> +++ b/hw/i386/intel_iommu_internal.h
> @@ -85,6 +85,12 @@
>  #define DMAR_MTRRCAP_REG_HI     0x104
>  #define DMAR_MTRRDEF_REG        0x108 /* MTRR default type */
>  #define DMAR_MTRRDEF_REG_HI     0x10c
> +#define DMAR_VCCAP_REG          0xE00 /* Virtual Command Capability Register */
> +#define DMAR_VCCAP_REG_HI       0xE04
> +#define DMAR_VCMD_REG           0xE10 /* Virtual Command Register */
> +#define DMAR_VCMD_REG_HI        0xE14
> +#define DMAR_VCRSP_REG          0xE20 /* Virtual Command Reponse Register */
> +#define DMAR_VCRSP_REG_HI       0xE24
>  
>  /* IOTLB registers */
>  #define DMAR_IOTLB_REG_OFFSET   0xf0 /* Offset to the IOTLB registers */
> @@ -193,6 +199,7 @@
>  #define VTD_ECAP_PSS                (19ULL << 35)
>  #define VTD_ECAP_PASID              (1ULL << 40)
>  #define VTD_ECAP_SMTS               (1ULL << 43)
> +#define VTD_ECAP_VCS                (1ULL << 44)
>  #define VTD_ECAP_SLTS               (1ULL << 46)
>  #define VTD_ECAP_FLTS               (1ULL << 47)
>  
> @@ -315,6 +322,37 @@ typedef enum VTDFaultReason {
>  
>  #define VTD_CONTEXT_CACHE_GEN_MAX       0xffffffffUL
>  
> +/* VCCAP_REG */
> +#define VTD_VCCAP_PAS               (1UL << 0)
> +
> +/*
> + * The basic idea is to let hypervisor to set a range for available
> + * PASIDs for VMs. One of the reasons is PASID #0 is reserved by
> + * RID_PASID usage. We have no idea how many reserved PASIDs in future,
> + * so here just an evaluated value. Honestly, set it as "1" is enough
> + * at current stage.
> + */
> +#define VTD_MIN_HPASID              1
> +#define VTD_MAX_HPASID              0xFFFFF
> +
> +/* Virtual Command Register */
> +enum {
> +     VTD_VCMD_NULL_CMD = 0,
> +     VTD_VCMD_ALLOC_PASID = 1,
> +     VTD_VCMD_FREE_PASID = 2,
> +     VTD_VCMD_CMD_NUM,
> +};
> +
> +#define VTD_VCMD_CMD_MASK           0xffUL
> +#define VTD_VCMD_PASID_VALUE(val)   (((val) >> 8) & 0xfffff)
> +
> +#define VTD_VCRSP_RSLT(val)         ((val) << 8)
> +#define VTD_VCRSP_SC(val)           (((val) & 0x3) << 1)
> +
> +#define VTD_VCMD_UNDEFINED_CMD         1ULL
> +#define VTD_VCMD_NO_AVAILABLE_PASID    2ULL
> +#define VTD_VCMD_FREE_INVALID_PASID    2ULL
> +
>  /* Interrupt Entry Cache Invalidation Descriptor: VT-d 6.5.2.7. */
>  struct VTDInvDescIEC {
>      uint32_t type:4;            /* Should always be 0x4 */
> diff --git a/hw/i386/trace-events b/hw/i386/trace-events
> index e48bef2..71536a7 100644
> --- a/hw/i386/trace-events
> +++ b/hw/i386/trace-events
> @@ -51,6 +51,7 @@ vtd_reg_write_gcmd(uint32_t status, uint32_t val) "status 0x%"PRIx32" value 0x%"
>  vtd_reg_write_fectl(uint32_t value) "value 0x%"PRIx32
>  vtd_reg_write_iectl(uint32_t value) "value 0x%"PRIx32
>  vtd_reg_ics_clear_ip(void) ""
> +vtd_reg_write_vcmd(uint32_t status, uint32_t val) "status 0x%"PRIx32" value 0x%"PRIx32
>  vtd_dmar_translate(uint8_t bus, uint8_t slot, uint8_t func, uint64_t iova, uint64_t gpa, uint64_t mask) "dev %02x:%02x.%02x iova 0x%"PRIx64" -> gpa 0x%"PRIx64" mask 0x%"PRIx64
>  vtd_dmar_enable(bool en) "enable %d"
>  vtd_dmar_fault(uint16_t sid, int fault, uint64_t addr, bool is_write) "sid 0x%"PRIx16" fault %d addr 0x%"PRIx64" write %d"
> diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> index 1ef2917..4158116 100644
> --- a/include/hw/i386/intel_iommu.h
> +++ b/include/hw/i386/intel_iommu.h
> @@ -46,7 +46,7 @@
>  #define VTD_SID_TO_BUS(sid)         (((sid) >> 8) & 0xff)
>  #define VTD_SID_TO_DEVFN(sid)       ((sid) & 0xff)
>  
> -#define DMAR_REG_SIZE               0x230
> +#define DMAR_REG_SIZE               0xF00
>  #define VTD_HOST_AW_39BIT           39
>  #define VTD_HOST_AW_48BIT           48
>  #define VTD_HOST_ADDRESS_WIDTH      VTD_HOST_AW_39BIT
> @@ -285,6 +285,10 @@ struct IntelIOMMUState {
>      uint8_t aw_bits;                /* Host/IOVA address width (in bits) */
>      bool dma_drain;                 /* Whether DMA r/w draining enabled */
>  
> +    /* Virtual Command Register */
> +    uint64_t vccap;                 /* The value of vcmd capability reg */
> +    uint64_t vcrsp;                 /* Current value of VCMD RSP REG */
> +
>      /*
>       * Protects IOMMU states in general.  Currently it protects the
>       * per-IOMMU IOTLB cache, and context entry cache in VTDAddressSpace.
> -- 
> 2.7.4
> 

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC v3 14/25] intel_iommu: add virtual command capability support
@ 2020-02-11 20:16     ` Peter Xu
  0 siblings, 0 replies; 136+ messages in thread
From: Peter Xu @ 2020-02-11 20:16 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: kevin.tian, Jacob Pan, Yi Sun, Eduardo Habkost, kvm, mst,
	jun.j.tian, qemu-devel, eric.auger, alex.williamson, pbonzini,
	hao.wu, yi.y.sun, Richard Henderson, david

On Wed, Jan 29, 2020 at 04:16:45AM -0800, Liu, Yi L wrote:
> From: Liu Yi L <yi.l.liu@intel.com>
> 
> This patch adds virtual command support to Intel vIOMMU per
> Intel VT-d 3.1 spec. And adds two virtual commands: allocate
> pasid and free pasid.
> 
> Cc: Kevin Tian <kevin.tian@intel.com>
> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Yi Sun <yi.y.sun@linux.intel.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Richard Henderson <rth@twiddle.net>
> Cc: Eduardo Habkost <ehabkost@redhat.com>
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> ---
>  hw/i386/intel_iommu.c          | 163 ++++++++++++++++++++++++++++++++++++++++-
>  hw/i386/intel_iommu_internal.h |  38 ++++++++++
>  hw/i386/trace-events           |   1 +
>  include/hw/i386/intel_iommu.h  |   6 +-
>  4 files changed, 206 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 33be40c..43a728f 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -2649,6 +2649,142 @@ static void vtd_handle_iectl_write(IntelIOMMUState *s)
>      }
>  }
>  
> +static int vtd_request_pasid_alloc(IntelIOMMUState *s, uint32_t *pasid)
> +{
> +    VTDBus *vtd_bus;
> +    int bus_n, devfn, ret = -errno;
> +    VTDIOMMUContext *vtd_icx;
> +
> +    for (bus_n = 0; bus_n < PCI_BUS_MAX; bus_n++) {
> +        vtd_bus = vtd_find_as_from_bus_num(s, bus_n);
> +        if (!vtd_bus) {
> +            continue;
> +        }
> +        for (devfn = 0; devfn < PCI_DEVFN_MAX; devfn++) {
> +            vtd_icx = vtd_bus->dev_icx[devfn];
> +            if (!vtd_icx) {
> +                continue;
> +            }
> +
> +            /*
> +             * We'll return the first valid result we got. It's
> +             * a bit hackish in that we don't have a good global
> +             * interface yet to talk to modules like vfio to deliver
> +             * this allocation request, so we're leveraging this
> +             * per-device iommu object to do the same thing just
> +             * to make sure the allocation happens only once.
> +             */
> +            ret = ds_iommu_pasid_alloc(vtd_icx->dsi_obj,
> +                         VTD_MIN_HPASID, VTD_MAX_HPASID, pasid);

Your indents are always strange to me for long funcalls...  Not a
complaint though, as long as no one else complains. :)

> +            if (!ret) {
> +                break;
> +            }
> +        }
> +    }
> +    return ret;
> +}
> +
> +static int vtd_request_pasid_free(IntelIOMMUState *s, uint32_t pasid)
> +{
> +    VTDBus *vtd_bus;
> +    int bus_n, devfn, ret = -errno;
> +    VTDIOMMUContext *vtd_icx;
> +
> +    for (bus_n = 0; bus_n < PCI_BUS_MAX; bus_n++) {
> +        vtd_bus = vtd_find_as_from_bus_num(s, bus_n);
> +        if (!vtd_bus) {
> +            continue;
> +        }
> +        for (devfn = 0; devfn < PCI_DEVFN_MAX; devfn++) {
> +            vtd_icx = vtd_bus->dev_icx[devfn];
> +            if (!vtd_icx) {
> +                continue;
> +            }
> +            /*
> +             * Similar with pasid allocation. We'll free the pasid
> +             * on the first successful free operation. It's a bit
> +             * hackish in that we don't have a good global interface
> +             * yet to talk to modules like vfio to deliver this pasid
> +             * free request, so we're leveraging this per-device iommu
> +             * object to do the same thing just to make sure the
> +             * free happens only once.
> +             */
> +            ret = ds_iommu_pasid_free(vtd_icx->dsi_obj, pasid);
> +            if (!ret) {
> +                break;
> +            }
> +        }
> +    }
> +    return ret;
> +}
> +
> +/*
> + * If IP is not set, set it and return 0
> + * If IP is already set, return -1

Out of date?  Instead can mention that this also resets the reply
status code to zero implicitly so by default it will return a success.

Other than that:

Reviewed-by: Peter Xu <peterx@redhat.com>

> + */
> +static void vtd_vcmd_set_ip(IntelIOMMUState *s)
> +{
> +    s->vcrsp = 1;
> +    vtd_set_quad_raw(s, DMAR_VCRSP_REG,
> +                     ((uint64_t) s->vcrsp));
> +}
> +
> +static void vtd_vcmd_clear_ip(IntelIOMMUState *s)
> +{
> +    s->vcrsp &= (~((uint64_t)(0x1)));
> +    vtd_set_quad_raw(s, DMAR_VCRSP_REG,
> +                     ((uint64_t) s->vcrsp));
> +}
> +
> +/* Handle write to Virtual Command Register */
> +static int vtd_handle_vcmd_write(IntelIOMMUState *s, uint64_t val)
> +{
> +    uint32_t pasid;
> +    int ret = -1;
> +
> +    trace_vtd_reg_write_vcmd(s->vcrsp, val);
> +
> +    if (!(s->vccap & VTD_VCCAP_PAS) ||
> +         (s->vcrsp & 1)) {
> +        return -1;
> +    }
> +
> +    /*
> +     * Since vCPU should be blocked when the guest VMCD
> +     * write was trapped to here. Should be no other vCPUs
> +     * try to access VCMD if guest software is well written.
> +     * However, we still emulate the IP bit here in case of
> +     * bad guest software. Also align with the spec.
> +     */
> +    vtd_vcmd_set_ip(s);
> +
> +    switch (val & VTD_VCMD_CMD_MASK) {
> +    case VTD_VCMD_ALLOC_PASID:
> +        ret = vtd_request_pasid_alloc(s, &pasid);
> +        if (ret) {
> +            s->vcrsp |= VTD_VCRSP_SC(VTD_VCMD_NO_AVAILABLE_PASID);
> +        } else {
> +            s->vcrsp |= VTD_VCRSP_RSLT(pasid);
> +        }
> +        break;
> +
> +    case VTD_VCMD_FREE_PASID:
> +        pasid = VTD_VCMD_PASID_VALUE(val);
> +        ret = vtd_request_pasid_free(s, pasid);
> +        if (ret < 0) {
> +            s->vcrsp |= VTD_VCRSP_SC(VTD_VCMD_FREE_INVALID_PASID);
> +        }
> +        break;
> +
> +    default:
> +        s->vcrsp |= VTD_VCRSP_SC(VTD_VCMD_UNDEFINED_CMD);
> +        error_report_once("Virtual Command: unsupported command!!!");
> +        break;
> +    }
> +    vtd_vcmd_clear_ip(s);
> +    return 0;
> +}
> +
>  static uint64_t vtd_mem_read(void *opaque, hwaddr addr, unsigned size)
>  {
>      IntelIOMMUState *s = opaque;
> @@ -2938,6 +3074,23 @@ static void vtd_mem_write(void *opaque, hwaddr addr,
>          vtd_set_long(s, addr, val);
>          break;
>  
> +    case DMAR_VCMD_REG:
> +        if (!vtd_handle_vcmd_write(s, val)) {
> +            if (size == 4) {
> +                vtd_set_long(s, addr, val);
> +            } else {
> +                vtd_set_quad(s, addr, val);
> +            }
> +        }
> +        break;
> +
> +    case DMAR_VCMD_REG_HI:
> +        assert(size == 4);
> +        if (!vtd_handle_vcmd_write(s, val)) {
> +            vtd_set_long(s, addr, val);
> +        }
> +        break;
> +
>      default:
>          if (size == 4) {
>              vtd_set_long(s, addr, val);
> @@ -3712,7 +3865,8 @@ static void vtd_init(IntelIOMMUState *s)
>          s->ecap |= VTD_ECAP_SMTS | VTD_ECAP_SRS | VTD_ECAP_SLTS;
>      } else if (s->scalable_mode && s->scalable_modern) {
>          s->ecap |= VTD_ECAP_SMTS | VTD_ECAP_SRS | VTD_ECAP_PASID
> -                   | VTD_ECAP_FLTS | VTD_ECAP_PSS;
> +                   | VTD_ECAP_FLTS | VTD_ECAP_PSS | VTD_ECAP_VCS;
> +        s->vccap |= VTD_VCCAP_PAS;
>      }
>  
>      vtd_reset_caches(s);
> @@ -3768,6 +3922,13 @@ static void vtd_init(IntelIOMMUState *s)
>       * Interrupt remapping registers.
>       */
>      vtd_define_quad(s, DMAR_IRTA_REG, 0, 0xfffffffffffff80fULL, 0);
> +
> +    /*
> +     * Virtual Command Definitions
> +     */
> +    vtd_define_quad(s, DMAR_VCCAP_REG, s->vccap, 0, 0);
> +    vtd_define_quad(s, DMAR_VCMD_REG, 0, 0xffffffffffffffffULL, 0);
> +    vtd_define_quad(s, DMAR_VCRSP_REG, 0, 0, 0);
>  }
>  
>  /* Should not reset address_spaces when reset because devices will still use
> diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> index c4dbb2c..fb5fdc2 100644
> --- a/hw/i386/intel_iommu_internal.h
> +++ b/hw/i386/intel_iommu_internal.h
> @@ -85,6 +85,12 @@
>  #define DMAR_MTRRCAP_REG_HI     0x104
>  #define DMAR_MTRRDEF_REG        0x108 /* MTRR default type */
>  #define DMAR_MTRRDEF_REG_HI     0x10c
> +#define DMAR_VCCAP_REG          0xE00 /* Virtual Command Capability Register */
> +#define DMAR_VCCAP_REG_HI       0xE04
> +#define DMAR_VCMD_REG           0xE10 /* Virtual Command Register */
> +#define DMAR_VCMD_REG_HI        0xE14
> +#define DMAR_VCRSP_REG          0xE20 /* Virtual Command Reponse Register */
> +#define DMAR_VCRSP_REG_HI       0xE24
>  
>  /* IOTLB registers */
>  #define DMAR_IOTLB_REG_OFFSET   0xf0 /* Offset to the IOTLB registers */
> @@ -193,6 +199,7 @@
>  #define VTD_ECAP_PSS                (19ULL << 35)
>  #define VTD_ECAP_PASID              (1ULL << 40)
>  #define VTD_ECAP_SMTS               (1ULL << 43)
> +#define VTD_ECAP_VCS                (1ULL << 44)
>  #define VTD_ECAP_SLTS               (1ULL << 46)
>  #define VTD_ECAP_FLTS               (1ULL << 47)
>  
> @@ -315,6 +322,37 @@ typedef enum VTDFaultReason {
>  
>  #define VTD_CONTEXT_CACHE_GEN_MAX       0xffffffffUL
>  
> +/* VCCAP_REG */
> +#define VTD_VCCAP_PAS               (1UL << 0)
> +
> +/*
> + * The basic idea is to let hypervisor to set a range for available
> + * PASIDs for VMs. One of the reasons is PASID #0 is reserved by
> + * RID_PASID usage. We have no idea how many reserved PASIDs in future,
> + * so here just an evaluated value. Honestly, set it as "1" is enough
> + * at current stage.
> + */
> +#define VTD_MIN_HPASID              1
> +#define VTD_MAX_HPASID              0xFFFFF
> +
> +/* Virtual Command Register */
> +enum {
> +     VTD_VCMD_NULL_CMD = 0,
> +     VTD_VCMD_ALLOC_PASID = 1,
> +     VTD_VCMD_FREE_PASID = 2,
> +     VTD_VCMD_CMD_NUM,
> +};
> +
> +#define VTD_VCMD_CMD_MASK           0xffUL
> +#define VTD_VCMD_PASID_VALUE(val)   (((val) >> 8) & 0xfffff)
> +
> +#define VTD_VCRSP_RSLT(val)         ((val) << 8)
> +#define VTD_VCRSP_SC(val)           (((val) & 0x3) << 1)
> +
> +#define VTD_VCMD_UNDEFINED_CMD         1ULL
> +#define VTD_VCMD_NO_AVAILABLE_PASID    2ULL
> +#define VTD_VCMD_FREE_INVALID_PASID    2ULL
> +
>  /* Interrupt Entry Cache Invalidation Descriptor: VT-d 6.5.2.7. */
>  struct VTDInvDescIEC {
>      uint32_t type:4;            /* Should always be 0x4 */
> diff --git a/hw/i386/trace-events b/hw/i386/trace-events
> index e48bef2..71536a7 100644
> --- a/hw/i386/trace-events
> +++ b/hw/i386/trace-events
> @@ -51,6 +51,7 @@ vtd_reg_write_gcmd(uint32_t status, uint32_t val) "status 0x%"PRIx32" value 0x%"
>  vtd_reg_write_fectl(uint32_t value) "value 0x%"PRIx32
>  vtd_reg_write_iectl(uint32_t value) "value 0x%"PRIx32
>  vtd_reg_ics_clear_ip(void) ""
> +vtd_reg_write_vcmd(uint32_t status, uint32_t val) "status 0x%"PRIx32" value 0x%"PRIx32
>  vtd_dmar_translate(uint8_t bus, uint8_t slot, uint8_t func, uint64_t iova, uint64_t gpa, uint64_t mask) "dev %02x:%02x.%02x iova 0x%"PRIx64" -> gpa 0x%"PRIx64" mask 0x%"PRIx64
>  vtd_dmar_enable(bool en) "enable %d"
>  vtd_dmar_fault(uint16_t sid, int fault, uint64_t addr, bool is_write) "sid 0x%"PRIx16" fault %d addr 0x%"PRIx64" write %d"
> diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> index 1ef2917..4158116 100644
> --- a/include/hw/i386/intel_iommu.h
> +++ b/include/hw/i386/intel_iommu.h
> @@ -46,7 +46,7 @@
>  #define VTD_SID_TO_BUS(sid)         (((sid) >> 8) & 0xff)
>  #define VTD_SID_TO_DEVFN(sid)       ((sid) & 0xff)
>  
> -#define DMAR_REG_SIZE               0x230
> +#define DMAR_REG_SIZE               0xF00
>  #define VTD_HOST_AW_39BIT           39
>  #define VTD_HOST_AW_48BIT           48
>  #define VTD_HOST_ADDRESS_WIDTH      VTD_HOST_AW_39BIT
> @@ -285,6 +285,10 @@ struct IntelIOMMUState {
>      uint8_t aw_bits;                /* Host/IOVA address width (in bits) */
>      bool dma_drain;                 /* Whether DMA r/w draining enabled */
>  
> +    /* Virtual Command Register */
> +    uint64_t vccap;                 /* The value of vcmd capability reg */
> +    uint64_t vcrsp;                 /* Current value of VCMD RSP REG */
> +
>      /*
>       * Protects IOMMU states in general.  Currently it protects the
>       * per-IOMMU IOTLB cache, and context entry cache in VTDAddressSpace.
> -- 
> 2.7.4
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC v3 15/25] intel_iommu: process pasid cache invalidation
  2020-01-29 12:16   ` Liu, Yi L
@ 2020-02-11 20:17     ` Peter Xu
  -1 siblings, 0 replies; 136+ messages in thread
From: Peter Xu @ 2020-02-11 20:17 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: qemu-devel, david, pbonzini, alex.williamson, mst, eric.auger,
	kevin.tian, jun.j.tian, yi.y.sun, kvm, hao.wu, Jacob Pan, Yi Sun,
	Richard Henderson, Eduardo Habkost

On Wed, Jan 29, 2020 at 04:16:46AM -0800, Liu, Yi L wrote:
> From: Liu Yi L <yi.l.liu@intel.com>
> 
> This patch adds PASID cache invalidation handling. When guest enabled
> PASID usages (e.g. SVA), guest software should issue a proper PASID
> cache invalidation when caching-mode is exposed. This patch only adds
> the draft handling of pasid cache invalidation. Detailed handling will
> be added in subsequent patches.
> 
> Cc: Kevin Tian <kevin.tian@intel.com>
> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Yi Sun <yi.y.sun@linux.intel.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Richard Henderson <rth@twiddle.net>
> Cc: Eduardo Habkost <ehabkost@redhat.com>
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>

Reviewed-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC v3 15/25] intel_iommu: process pasid cache invalidation
@ 2020-02-11 20:17     ` Peter Xu
  0 siblings, 0 replies; 136+ messages in thread
From: Peter Xu @ 2020-02-11 20:17 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: kevin.tian, Jacob Pan, Yi Sun, Eduardo Habkost, kvm, mst,
	jun.j.tian, qemu-devel, eric.auger, alex.williamson, pbonzini,
	hao.wu, yi.y.sun, Richard Henderson, david

On Wed, Jan 29, 2020 at 04:16:46AM -0800, Liu, Yi L wrote:
> From: Liu Yi L <yi.l.liu@intel.com>
> 
> This patch adds PASID cache invalidation handling. When guest enabled
> PASID usages (e.g. SVA), guest software should issue a proper PASID
> cache invalidation when caching-mode is exposed. This patch only adds
> the draft handling of pasid cache invalidation. Detailed handling will
> be added in subsequent patches.
> 
> Cc: Kevin Tian <kevin.tian@intel.com>
> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Yi Sun <yi.y.sun@linux.intel.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Richard Henderson <rth@twiddle.net>
> Cc: Eduardo Habkost <ehabkost@redhat.com>
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>

Reviewed-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC v3 14/25] intel_iommu: add virtual command capability support
  2020-01-29 12:16   ` Liu, Yi L
@ 2020-02-11 21:56     ` Peter Xu
  -1 siblings, 0 replies; 136+ messages in thread
From: Peter Xu @ 2020-02-11 21:56 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: qemu-devel, david, pbonzini, alex.williamson, mst, eric.auger,
	kevin.tian, jun.j.tian, yi.y.sun, kvm, hao.wu, Jacob Pan, Yi Sun,
	Richard Henderson, Eduardo Habkost

On Wed, Jan 29, 2020 at 04:16:45AM -0800, Liu, Yi L wrote:
> +/*
> + * The basic idea is to let hypervisor to set a range for available
> + * PASIDs for VMs. One of the reasons is PASID #0 is reserved by
> + * RID_PASID usage. We have no idea how many reserved PASIDs in future,
> + * so here just an evaluated value. Honestly, set it as "1" is enough
> + * at current stage.
> + */
> +#define VTD_MIN_HPASID              1
> +#define VTD_MAX_HPASID              0xFFFFF

One more question: I see that PASID is defined as 20bits long.  It's
fine.  However I start to get confused on how the Scalable Mode PASID
Directory could service that much of PASID entries.

I'm looking at spec 3.4.3, Figure 3-8.

Firstly, we only have two levels for a PASID table.  The context entry
of a device stores a pointer to the "Scalable Mode PASID Directory"
page. I see that there're 2^14 entries in "Scalable Mode PASID
Directory" page, each is a "Scalable Mode PASID Table".
However... how do we fit in the 4K page if each entry is a pointer of
x86_64 (8 bytes) while there're 2^14 entries?  A simple math gives me
4K/8 = 512, which means the "Scalable Mode PASID Directory" page can
only have 512 entries, then how the 2^14 come from?  Hmm??

Apart of this: also I just noticed (when reading the latter part of
the series) that the time that a pasid table walk can consume will
depend on this value too.  I'd suggest to make this as small as we
can, as long as it satisfies the usage.  We can even bump it in the
future.

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC v3 14/25] intel_iommu: add virtual command capability support
@ 2020-02-11 21:56     ` Peter Xu
  0 siblings, 0 replies; 136+ messages in thread
From: Peter Xu @ 2020-02-11 21:56 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: kevin.tian, Jacob Pan, Yi Sun, Eduardo Habkost, kvm, mst,
	jun.j.tian, qemu-devel, eric.auger, alex.williamson, pbonzini,
	hao.wu, yi.y.sun, Richard Henderson, david

On Wed, Jan 29, 2020 at 04:16:45AM -0800, Liu, Yi L wrote:
> +/*
> + * The basic idea is to let hypervisor to set a range for available
> + * PASIDs for VMs. One of the reasons is PASID #0 is reserved by
> + * RID_PASID usage. We have no idea how many reserved PASIDs in future,
> + * so here just an evaluated value. Honestly, set it as "1" is enough
> + * at current stage.
> + */
> +#define VTD_MIN_HPASID              1
> +#define VTD_MAX_HPASID              0xFFFFF

One more question: I see that PASID is defined as 20bits long.  It's
fine.  However I start to get confused on how the Scalable Mode PASID
Directory could service that much of PASID entries.

I'm looking at spec 3.4.3, Figure 3-8.

Firstly, we only have two levels for a PASID table.  The context entry
of a device stores a pointer to the "Scalable Mode PASID Directory"
page. I see that there're 2^14 entries in "Scalable Mode PASID
Directory" page, each is a "Scalable Mode PASID Table".
However... how do we fit in the 4K page if each entry is a pointer of
x86_64 (8 bytes) while there're 2^14 entries?  A simple math gives me
4K/8 = 512, which means the "Scalable Mode PASID Directory" page can
only have 512 entries, then how the 2^14 come from?  Hmm??

Apart of this: also I just noticed (when reading the latter part of
the series) that the time that a pasid table walk can consume will
depend on this value too.  I'd suggest to make this as small as we
can, as long as it satisfies the usage.  We can even bump it in the
future.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC v3 16/25] intel_iommu: add PASID cache management infrastructure
  2020-01-29 12:16   ` Liu, Yi L
@ 2020-02-11 23:35     ` Peter Xu
  -1 siblings, 0 replies; 136+ messages in thread
From: Peter Xu @ 2020-02-11 23:35 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: qemu-devel, david, pbonzini, alex.williamson, mst, eric.auger,
	kevin.tian, jun.j.tian, yi.y.sun, kvm, hao.wu, Jacob Pan, Yi Sun,
	Richard Henderson, Eduardo Habkost

On Wed, Jan 29, 2020 at 04:16:47AM -0800, Liu, Yi L wrote:
> From: Liu Yi L <yi.l.liu@intel.com>
> 
> This patch adds a PASID cache management infrastructure based on
> new added structure VTDPASIDAddressSpace, which is used to track
> the PASID usage and future PASID tagged DMA address translation
> support in vIOMMU.
> 
>     struct VTDPASIDAddressSpace {
>         VTDBus *vtd_bus;
>         uint8_t devfn;
>         AddressSpace as;
>         uint32_t pasid;
>         IntelIOMMUState *iommu_state;
>         VTDContextCacheEntry context_cache_entry;
>         QLIST_ENTRY(VTDPASIDAddressSpace) next;
>         VTDPASIDCacheEntry pasid_cache_entry;
>     };
> 
> Ideally, a VTDPASIDAddressSpace instance is created when a PASID
> is bound with a DMA AddressSpace. Intel VT-d spec requires guest
> software to issue pasid cache invalidation when bind or unbind a
> pasid with an address space under caching-mode. However, as
> VTDPASIDAddressSpace instances also act as pasid cache in this
> implementation, its creation also happens during vIOMMU PASID
> tagged DMA translation. The creation in this path will not be
> added in this patch since no PASID-capable emulated devices for
> now.
> 
> The implementation in this patch manages VTDPASIDAddressSpace
> instances per PASID+BDF (lookup and insert will use PASID and
> BDF) since Intel VT-d spec allows per-BDF PASID Table. When a
> guest bind a PASID with an AddressSpace, QEMU will capture the
> guest pasid selective pasid cache invalidation, and allocate
> remove a VTDPASIDAddressSpace instance per the invalidation
> reasons:
> 
>     *) a present pasid entry moved to non-present
>     *) a present pasid entry to be a present entry
>     *) a non-present pasid entry moved to present
> 
> vIOMMU emulator could figure out the reason by fetching latest
> guest pasid entry.
> 
> Cc: Kevin Tian <kevin.tian@intel.com>
> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Yi Sun <yi.y.sun@linux.intel.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Richard Henderson <rth@twiddle.net>
> Cc: Eduardo Habkost <ehabkost@redhat.com>
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> ---
>  hw/i386/intel_iommu.c          | 367 +++++++++++++++++++++++++++++++++++++++++
>  hw/i386/intel_iommu_internal.h |  14 ++
>  hw/i386/trace-events           |   1 +
>  include/hw/i386/intel_iommu.h  |  36 +++-
>  4 files changed, 417 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 58e7213..c75cb7b 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -40,6 +40,7 @@
>  #include "kvm_i386.h"
>  #include "migration/vmstate.h"
>  #include "trace.h"
> +#include "qemu/jhash.h"
>  
>  /* context entry operations */
>  #define VTD_CE_GET_RID2PASID(ce) \
> @@ -65,6 +66,8 @@
>  static void vtd_address_space_refresh_all(IntelIOMMUState *s);
>  static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n);
>  
> +static void vtd_pasid_cache_reset(IntelIOMMUState *s);
> +
>  static void vtd_panic_require_caching_mode(void)
>  {
>      error_report("We need to set caching-mode=on for intel-iommu to enable "
> @@ -276,6 +279,7 @@ static void vtd_reset_caches(IntelIOMMUState *s)
>      vtd_iommu_lock(s);
>      vtd_reset_iotlb_locked(s);
>      vtd_reset_context_cache_locked(s);
> +    vtd_pasid_cache_reset(s);
>      vtd_iommu_unlock(s);
>  }
>  
> @@ -686,6 +690,11 @@ static inline bool vtd_pe_type_check(X86IOMMUState *x86_iommu,
>      return true;
>  }
>  
> +static inline uint16_t vtd_pe_get_domain_id(VTDPASIDEntry *pe)
> +{
> +    return VTD_SM_PASID_ENTRY_DID((pe)->val[1]);
> +}
> +
>  static inline bool vtd_pdire_present(VTDPASIDDirEntry *pdire)
>  {
>      return pdire->val & 1;
> @@ -2393,19 +2402,370 @@ static bool vtd_process_iotlb_desc(IntelIOMMUState *s, VTDInvDesc *inv_desc)
>      return true;
>  }
>  
> +static inline void vtd_init_pasid_key(uint32_t pasid,
> +                                     uint16_t sid,
> +                                     struct pasid_key *key)
> +{
> +    key->pasid = pasid;
> +    key->sid = sid;
> +}
> +
> +static guint vtd_pasid_as_key_hash(gconstpointer v)
> +{
> +    struct pasid_key *key = (struct pasid_key *)v;
> +    uint32_t a, b, c;
> +
> +    /* Jenkins hash */
> +    a = b = c = JHASH_INITVAL + sizeof(*key);
> +    a += key->sid;
> +    b += extract32(key->pasid, 0, 16);
> +    c += extract32(key->pasid, 16, 16);
> +
> +    __jhash_mix(a, b, c);
> +    __jhash_final(a, b, c);
> +
> +    return c;
> +}
> +
> +static gboolean vtd_pasid_as_key_equal(gconstpointer v1, gconstpointer v2)
> +{
> +    const struct pasid_key *k1 = v1;
> +    const struct pasid_key *k2 = v2;
> +
> +    return (k1->pasid == k2->pasid) && (k1->sid == k2->sid);
> +}
> +
> +static inline int vtd_dev_get_pe_from_pasid(IntelIOMMUState *s,
> +                                            uint8_t bus_num,
> +                                            uint8_t devfn,
> +                                            uint32_t pasid,
> +                                            VTDPASIDEntry *pe)
> +{
> +    VTDContextEntry ce;
> +    int ret;
> +    dma_addr_t pasid_dir_base;
> +
> +    if (!s->root_scalable) {
> +        return -VTD_FR_PASID_TABLE_INV;
> +    }
> +
> +    ret = vtd_dev_to_context_entry(s, bus_num, devfn, &ce);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    pasid_dir_base = VTD_CE_GET_PASID_DIR_TABLE(&ce);
> +    ret = vtd_get_pe_from_pasid_table(s,
> +                                  pasid_dir_base, pasid, pe);
> +
> +    return ret;
> +}
> +
> +static bool vtd_pasid_entry_compare(VTDPASIDEntry *p1, VTDPASIDEntry *p2)
> +{
> +    return !memcmp(p1, p2, sizeof(*p1));
> +}
> +
> +/**
> + * This function is used to clear pasid_cache_gen of cached pasid
> + * entry in vtd_pasid_as instances. Caller of this function should
> + * hold iommu_lock.
> + */
> +static gboolean vtd_flush_pasid(gpointer key, gpointer value,
> +                                gpointer user_data)
> +{
> +    VTDPASIDCacheInfo *pc_info = user_data;
> +    VTDPASIDAddressSpace *vtd_pasid_as = value;
> +    IntelIOMMUState *s = vtd_pasid_as->iommu_state;
> +    VTDPASIDCacheEntry *pc_entry = &vtd_pasid_as->pasid_cache_entry;
> +    VTDBus *vtd_bus = vtd_pasid_as->vtd_bus;
> +    VTDPASIDEntry pe;
> +    uint16_t did;
> +    uint32_t pasid;
> +    uint16_t devfn;
> +    int ret;
> +
> +    did = vtd_pe_get_domain_id(&pc_entry->pasid_entry);
> +    pasid = vtd_pasid_as->pasid;
> +    devfn = vtd_pasid_as->devfn;
> +
> +    if (!(pc_entry->pasid_cache_gen == s->pasid_cache_gen)) {
> +        return false;
> +    }
> +
> +    switch (pc_info->flags & VTD_PASID_CACHE_INFO_MASK) {
> +    case VTD_PASID_CACHE_PASIDSI:
> +        if (pc_info->pasid != pasid) {
> +            return false;
> +        }
> +        /* Fall through */

Why fall through?

> +    case VTD_PASID_CACHE_DOMSI:
> +        if (pc_info->domain_id != did) {
> +            return false;
> +        }
> +        /* Fall through */

Same here.

> +    case VTD_PASID_CACHE_GLOBAL:
> +        break;
> +    default:

Nevee reach here right?  If so we can abort.

> +        return false;
> +    }
> +
> +    /*
> +     * pasid cache invalidation may indicate a present pasid
> +     * entry to present pasid entry modification. To cover such
> +     * case, vIOMMU emulator needs to fetch latest guest pasid
> +     * entry and check cached pasid entry, then update pasid
> +     * cache and send pasid bind/unbind to host properly.
> +     */
> +    ret = vtd_dev_get_pe_from_pasid(s,
> +                  pci_bus_num(vtd_bus->bus), devfn, pasid, &pe);
> +    if (ret) {
> +        /*
> +         * No valid pasid entry in guest memory. e.g. pasid entry
> +         * was modified to be either all-zero or non-present. Either
> +         * case means existing pasid cache should be removed.
> +         */
> +        goto remove;
> +    }
> +    /* Compare cached pasid entry and latest pasid entry */
> +    if (!vtd_pasid_entry_compare(&pe, &pc_entry->pasid_entry)) {
> +        /* pasid entry was updated, thus update the pasid cache */
> +        pc_entry->pasid_entry = pe;
> +        pc_entry->pasid_cache_gen = s->pasid_cache_gen;
> +        /*
> +         * TODO:
> +         * - send pasid bind to host for passthru devices
> +         * - when pasid-base-iotlb(piotlb) infrastructure is ready,
> +         *   should invalidate QEMU piotlb togehter with this change.
> +         */
> +    }
> +    return false;
> +remove:
> +    /*
> +     * TODO:
> +     * - send pasid unbind to host for passthru devices
> +     * - when pasid-base-iotlb(piotlb) infrastructure is ready,
> +     *   should invalidate QEMU piotlb togehter with this change.
> +     */
> +    return true;
> +}
> +
>  static int vtd_pasid_cache_dsi(IntelIOMMUState *s, uint16_t domain_id)
>  {
> +    VTDPASIDCacheInfo pc_info;
> +
> +    trace_vtd_pasid_cache_dsi(domain_id);
> +
> +    pc_info.flags = VTD_PASID_CACHE_DOMSI;
> +    pc_info.domain_id = domain_id;
> +
> +    /*
> +     * Loop all existing pasid caches and update them.
> +     */
> +    vtd_iommu_lock(s);
> +    g_hash_table_foreach_remove(s->vtd_pasid_as,
> +                                 vtd_flush_pasid, &pc_info);
> +
> +    /*
> +     * TODO: Domain selective PASID cache invalidation
> +     * flushes all the pasid caches within a domain. To
> +     * be safe, after invalidating the pasid caches, emulator
> +     * needs to replay the pasid bindings by walking guest
> +     * pasid dir and pasid table.

Better spell out on what special case we're handling here: When the
guest setup a new PASID entry then send a PASID DSI.

> +     */
> +    vtd_iommu_unlock(s);
>      return 0;
>  }
>  
> +/**
> + * This function finds or adds a VTDPASIDAddressSpace for a device
> + * when it is bound to a pasid. Caller of this function should hold
> + * iommu_lock.
> + */
> +static VTDPASIDAddressSpace *vtd_add_find_pasid_as(IntelIOMMUState *s,
> +                                                   VTDBus *vtd_bus,
> +                                                   int devfn,
> +                                                   uint32_t pasid,
> +                                                   bool allocate)
> +{
> +    struct pasid_key key;
> +    struct pasid_key *new_key;
> +    VTDPASIDAddressSpace *vtd_pasid_as;
> +    uint16_t sid;
> +
> +    sid = vtd_make_source_id(pci_bus_num(vtd_bus->bus), devfn);
> +    vtd_init_pasid_key(pasid, sid, &key);
> +    vtd_pasid_as = g_hash_table_lookup(s->vtd_pasid_as, &key);
> +
> +    if (!vtd_pasid_as && allocate) {
> +        new_key = g_malloc0(sizeof(*new_key));
> +        vtd_init_pasid_key(pasid, sid, new_key);
> +        /*
> +         * Initiate the vtd_pasid_as structure.
> +         *
> +         * This structure here is used to track the guest pasid
> +         * binding and also serves as pasid-cache mangement entry.
> +         *
> +         * TODO: in future, if wants to support the SVA-aware DMA
> +         *       emulation, the vtd_pasid_as should have include
> +         *       AddressSpace to support DMA emulation.
> +         */
> +        vtd_pasid_as = g_malloc0(sizeof(VTDPASIDAddressSpace));
> +        vtd_pasid_as->iommu_state = s;
> +        vtd_pasid_as->vtd_bus = vtd_bus;
> +        vtd_pasid_as->devfn = devfn;
> +        vtd_pasid_as->context_cache_entry.context_cache_gen = 0;
> +        vtd_pasid_as->pasid = pasid;
> +        vtd_pasid_as->pasid_cache_entry.pasid_cache_gen = 0;
> +        g_hash_table_insert(s->vtd_pasid_as, new_key, vtd_pasid_as);
> +    }
> +    return vtd_pasid_as;
> +}
> +
> + /**
> +  * This function updates the pasid entry cached in &vtd_pasid_as.
> +  * Caller of this function should hold iommu_lock.
> +  */
> +static inline void vtd_fill_in_pe_cache(
> +              VTDPASIDAddressSpace *vtd_pasid_as, VTDPASIDEntry *pe)
> +{
> +    IntelIOMMUState *s = vtd_pasid_as->iommu_state;
> +    VTDPASIDCacheEntry *pc_entry = &vtd_pasid_as->pasid_cache_entry;
> +
> +    pc_entry->pasid_entry = *pe;
> +    pc_entry->pasid_cache_gen = s->pasid_cache_gen;
> +}
> +
>  static int vtd_pasid_cache_psi(IntelIOMMUState *s,
>                                 uint16_t domain_id, uint32_t pasid)
>  {
> +    VTDPASIDCacheInfo pc_info;
> +    VTDPASIDEntry pe;
> +    VTDBus *vtd_bus;
> +    int bus_n, devfn;
> +    VTDPASIDAddressSpace *vtd_pasid_as;
> +    VTDIOMMUContext *vtd_icx;
> +
> +    /* PASID selective implies a DID selective */
> +    pc_info.flags = VTD_PASID_CACHE_PASIDSI;
> +    pc_info.domain_id = domain_id;
> +    pc_info.pasid = pasid;
> +
> +    /*
> +     * Regards to a pasid selective pasid cache invalidation (PSI),
> +     * it could be either cases of below:
> +     * a) a present pasid entry moved to non-present
> +     * b) a present pasid entry to be a present entry
> +     * c) a non-present pasid entry moved to present
> +     *
> +     * Here the handling of a PSI is:
> +     * 1) loop all the exisitng vtd_pasid_as instances to update them
> +     *    according to the latest guest pasid entry in pasid table.
> +     *    this will make sure affected existing vtd_pasid_as instances
> +     *    cached the latest pasid entries. Also, during the loop, the
> +     *    host should be notified if needed. e.g. pasid unbind or pasid
> +     *    update. Should be able to cover case a) and case b).
> +     *
> +     * 2) loop all devices to cover case c)
> +     *    However, it is not good to always loop all devices. In this
> +     *    implementation. We do it in this ways:
> +     *    - For devices which have VTDIOMMUContext instances,
> +     *      we loop them and check if guest pasid entry exists. If yes,
> +     *      it is case c), we update the pasid cache and also notify
> +     *      host.
> +     *    - For devices which have no VTDIOMMUContext
> +     *      instances, it is not necessary to create pasid cache at
> +     *      this phase since it could be created when vIOMMU do DMA
> +     *      address translation. This is not implemented yet since
> +     *      no PASID-capable emulated devices today. If we have it
> +     *      in future, the pasid cache shall be created there.
> +     */
> +
> +    vtd_iommu_lock(s);
> +    g_hash_table_foreach_remove(s->vtd_pasid_as,
> +                                vtd_flush_pasid, &pc_info);
> +
> +    QLIST_FOREACH(vtd_icx, &s->vtd_dev_icx_list, next) {
> +        vtd_bus = vtd_icx->vtd_bus;
> +        devfn = vtd_icx->devfn;
> +        bus_n = pci_bus_num(vtd_bus->bus);
> +
> +        /* Step 1: fetch vtd_pasid_as and check if it is valid */
> +        vtd_pasid_as = vtd_add_find_pasid_as(s, vtd_bus,
> +                                        devfn, pasid, true);

I feel like you wanted to pass "false" here for "allocate".

> +        if (vtd_pasid_as &&
> +            (s->pasid_cache_gen ==
> +             vtd_pasid_as->pasid_cache_entry.pasid_cache_gen)) {
> +            /*
> +             * pasid_cache_gen equals to s->pasid_cache_gen means
> +             * vtd_pasid_as is valid after the above s->vtd_pasid_as
> +             * updates. Thus no need for the below steps.
> +             */
> +            continue;
> +        }
> +
> +        /*
> +         * Step 2: vtd_pasid_as is not valid, it's potentailly a
> +         * new pasid bind. Fetch guest pasid entry.
> +         */
> +        if (vtd_dev_get_pe_from_pasid(s, bus_n, devfn, pasid, &pe)) {
> +            continue;
> +        }
> +
> +        /*
> +         * Step 3: pasid entry exists, update pasid cache
> +         *
> +         * Here need to check domain ID since guest pasid entry
> +         * exists. What needs to do are:
> +         *   - update the pc_entry in the vtd_pasid_as
> +         *   - set proper pc_entry.pasid_cache_gen
> +         *   - pass down the latest guest pasid entry config to host
> +         *     (will be added in later patch)
> +         */
> +        if (domain_id == vtd_pe_get_domain_id(&pe)) {
> +            vtd_fill_in_pe_cache(vtd_pasid_as, &pe);
> +        }
> +    }
> +    vtd_iommu_unlock(s);
>      return 0;
>  }
>  
> +/**
> + * Caller of this function should hold iommu_lock
> + */
> +static void vtd_pasid_cache_reset(IntelIOMMUState *s)
> +{
> +    VTDPASIDCacheInfo pc_info;
> +
> +    trace_vtd_pasid_cache_reset();
> +
> +    pc_info.flags = VTD_PASID_CACHE_GLOBAL;
> +
> +    /*
> +     * Reset pasid cache is a big hammer, so use
> +     * g_hash_table_foreach_remove which will free
> +     * the vtd_pasid_as instances.
> +     */
> +    g_hash_table_foreach_remove(s->vtd_pasid_as,
> +                           vtd_flush_pasid, &pc_info);
> +    s->pasid_cache_gen = 1;
> +}
> +
>  static int vtd_pasid_cache_gsi(IntelIOMMUState *s)
>  {
> +    trace_vtd_pasid_cache_gsi();
> +
> +    vtd_iommu_lock(s);
> +    vtd_pasid_cache_reset(s);

[1]

> +
> +    /*
> +     * TODO: Global PASID cache invalidation may be
> +     * flushes all the pasid caches. To be safe, after
> +     * invalidating the pasid caches, emulator needs
> +     * to replay the pasid bindings by walking guest
> +     * pasid dir and pasid table.
> +     */
> +    vtd_iommu_unlock(s);
>      return 0;
>  }
>  
> @@ -3659,8 +4019,11 @@ static int vtd_icx_register_ds_iommu(IOMMUContext *iommu_ctx,
>      VTDIOMMUContext *vtd_dev_icx = container_of(iommu_ctx,
>                                                 VTDIOMMUContext,
>                                                 iommu_context);
> +    IntelIOMMUState *s = vtd_dev_icx->iommu_state;
>  
>      vtd_dev_icx->dsi_obj = dsi_obj;
> +    QLIST_INSERT_HEAD(&s->vtd_dev_icx_list, vtd_dev_icx, next);
> +
>      return 0;
>  }
>  
> @@ -3672,6 +4035,7 @@ static void vtd_icx_unregister_ds_iommu(IOMMUContext *iommu_ctx,
>                                                 iommu_context);
>  
>      vtd_dev_icx->dsi_obj = NULL;
> +    QLIST_REMOVE(vtd_dev_icx, next);
>  }
>  
>  IOMMUContextOps vtd_iommu_context_ops = {
> @@ -4130,6 +4494,7 @@ static void vtd_realize(DeviceState *dev, Error **errp)
>      }
>  
>      QLIST_INIT(&s->vtd_as_with_notifiers);
> +    QLIST_INIT(&s->vtd_dev_icx_list);
>      qemu_mutex_init(&s->iommu_lock);
>      memset(s->vtd_as_by_bus_num, 0, sizeof(s->vtd_as_by_bus_num));
>      memory_region_init_io(&s->csrmem, OBJECT(s), &vtd_mem_ops, s,
> @@ -4155,6 +4520,8 @@ static void vtd_realize(DeviceState *dev, Error **errp)
>                                       g_free, g_free);
>      s->vtd_as_by_busptr = g_hash_table_new_full(vtd_uint64_hash, vtd_uint64_equal,
>                                                g_free, g_free);
> +    s->vtd_pasid_as = g_hash_table_new_full(vtd_pasid_as_key_hash,
> +                                   vtd_pasid_as_key_equal, g_free, g_free);
>      vtd_init(s);
>      sysbus_mmio_map(SYS_BUS_DEVICE(s), 0, Q35_HOST_BRIDGE_IOMMU_ADDR);
>      pci_setup_iommu(bus, &vtd_iommu_ops, dev);
> diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> index 6c03560..18a9e50 100644
> --- a/hw/i386/intel_iommu_internal.h
> +++ b/hw/i386/intel_iommu_internal.h
> @@ -311,6 +311,7 @@ typedef enum VTDFaultReason {
>      VTD_FR_IR_SID_ERR = 0x26,   /* Invalid Source-ID */
>  
>      VTD_FR_PASID_TABLE_INV = 0x58,  /*Invalid PASID table entry */
> +    VTD_FR_PASID_ENTRY_P = 0x59, /* The Present(P) field of pasidt-entry is 0 */
>  
>      /* This is not a normal fault reason. We use this to indicate some faults
>       * that are not referenced by the VT-d specification.
> @@ -485,6 +486,19 @@ struct VTDRootEntry {
>  };
>  typedef struct VTDRootEntry VTDRootEntry;
>  
> +struct VTDPASIDCacheInfo {
> +#define VTD_PASID_CACHE_GLOBAL   (1ULL << 0)
> +#define VTD_PASID_CACHE_DOMSI    (1ULL << 1)
> +#define VTD_PASID_CACHE_PASIDSI  (1ULL << 2)
> +    uint32_t flags;
> +    uint16_t domain_id;
> +    uint32_t pasid;
> +};
> +#define VTD_PASID_CACHE_INFO_MASK    (VTD_PASID_CACHE_GLOBAL | \
> +                                      VTD_PASID_CACHE_DOMSI  | \
> +                                      VTD_PASID_CACHE_PASIDSI)
> +typedef struct VTDPASIDCacheInfo VTDPASIDCacheInfo;
> +
>  /* Masks for struct VTDRootEntry */
>  #define VTD_ROOT_ENTRY_P            1ULL
>  #define VTD_ROOT_ENTRY_CTP          (~0xfffULL)
> diff --git a/hw/i386/trace-events b/hw/i386/trace-events
> index f7cd4e5..87364a3 100644
> --- a/hw/i386/trace-events
> +++ b/hw/i386/trace-events
> @@ -22,6 +22,7 @@ vtd_inv_qi_head(uint16_t head) "read head %d"
>  vtd_inv_qi_tail(uint16_t head) "write tail %d"
>  vtd_inv_qi_fetch(void) ""
>  vtd_context_cache_reset(void) ""
> +vtd_pasid_cache_reset(void) ""
>  vtd_pasid_cache_gsi(void) ""
>  vtd_pasid_cache_dsi(uint16_t domain) "Domian slective PC invalidation domain 0x%"PRIx16
>  vtd_pasid_cache_psi(uint16_t domain, uint32_t pasid) "PASID slective PC invalidation domain 0x%"PRIx16" pasid 0x%"PRIx32
> diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> index 4158116..3cc4b74 100644
> --- a/include/hw/i386/intel_iommu.h
> +++ b/include/hw/i386/intel_iommu.h
> @@ -69,6 +69,8 @@ typedef union VTD_IR_MSIAddress VTD_IR_MSIAddress;
>  typedef struct VTDPASIDDirEntry VTDPASIDDirEntry;
>  typedef struct VTDPASIDEntry VTDPASIDEntry;
>  typedef struct VTDIOMMUContext VTDIOMMUContext;
> +typedef struct VTDPASIDCacheEntry VTDPASIDCacheEntry;
> +typedef struct VTDPASIDAddressSpace VTDPASIDAddressSpace;
>  
>  /* Context-Entry */
>  struct VTDContextEntry {
> @@ -101,6 +103,31 @@ struct VTDPASIDEntry {
>      uint64_t val[8];
>  };
>  
> +struct pasid_key {
> +    uint32_t pasid;
> +    uint16_t sid;
> +};
> +
> +struct VTDPASIDCacheEntry {
> +    /*
> +     * The cache entry is obsolete if
> +     * pasid_cache_gen!=IntelIOMMUState.pasid_cache_gen
> +     */
> +    uint32_t pasid_cache_gen;
> +    struct VTDPASIDEntry pasid_entry;
> +};
> +
> +struct VTDPASIDAddressSpace {
> +    VTDBus *vtd_bus;
> +    uint8_t devfn;
> +    AddressSpace as;
> +    uint32_t pasid;
> +    IntelIOMMUState *iommu_state;
> +    VTDContextCacheEntry context_cache_entry;
> +    QLIST_ENTRY(VTDPASIDAddressSpace) next;
> +    VTDPASIDCacheEntry pasid_cache_entry;

In vtd_pasid_cache_gsi() [1], you directly reset pasid_cache_gen for
each pasid address space.  You never increase
pasid_cache_entry.pasid_cache_gen.  Then IIUC the gen will always be
either 0 or 1.  And...

> +};
> +
>  struct VTDAddressSpace {
>      PCIBus *bus;
>      uint8_t devfn;
> @@ -122,6 +149,7 @@ struct VTDIOMMUContext {
>      uint8_t devfn;
>      IOMMUContext iommu_context;
>      DualStageIOMMUObject *dsi_obj;
> +    QLIST_ENTRY(VTDIOMMUContext) next;
>      IntelIOMMUState *iommu_state;
>  };
>  
> @@ -272,9 +300,14 @@ struct IntelIOMMUState {
>  
>      GHashTable *vtd_as_by_busptr;   /* VTDBus objects indexed by PCIBus* reference */
>      VTDBus *vtd_as_by_bus_num[VTD_PCI_BUS_MAX]; /* VTDBus objects indexed by bus number */
> +    GHashTable *vtd_pasid_as;   /* VTDPASIDAddressSpace instances */
> +    uint32_t pasid_cache_gen;   /* Should be in [1,MAX] */

... This should always be 1.

IIUC you can drop both of the pasid_cache_gen, because in this whole
patchset you'll remove the pasid hash entry when it is invalidated,
right?  Then if the hash entry is there, it must be valid.  When it's
out-dated, it'll be removed from the hash.

>      /* list of registered notifiers */
>      QLIST_HEAD(, VTDAddressSpace) vtd_as_with_notifiers;
>  
> +    /* list of VTDIOMMUContexts with DualStageIOMMUObject registered */
> +    QLIST_HEAD(, VTDIOMMUContext) vtd_dev_icx_list;
> +
>      /* interrupt remapping */
>      bool intr_enabled;              /* Whether guest enabled IR */
>      dma_addr_t intr_root;           /* Interrupt remapping table pointer */
> @@ -291,7 +324,8 @@ struct IntelIOMMUState {
>  
>      /*
>       * Protects IOMMU states in general.  Currently it protects the
> -     * per-IOMMU IOTLB cache, and context entry cache in VTDAddressSpace.
> +     * per-IOMMU IOTLB cache, and context entry cache in VTDAddressSpace,
> +     * and pasid cache in VTDPASIDAddressSpace.
>       */
>      QemuMutex iommu_lock;
>  };
> -- 
> 2.7.4
> 

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC v3 16/25] intel_iommu: add PASID cache management infrastructure
@ 2020-02-11 23:35     ` Peter Xu
  0 siblings, 0 replies; 136+ messages in thread
From: Peter Xu @ 2020-02-11 23:35 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: kevin.tian, Jacob Pan, Yi Sun, Eduardo Habkost, kvm, mst,
	jun.j.tian, qemu-devel, eric.auger, alex.williamson, pbonzini,
	hao.wu, yi.y.sun, Richard Henderson, david

On Wed, Jan 29, 2020 at 04:16:47AM -0800, Liu, Yi L wrote:
> From: Liu Yi L <yi.l.liu@intel.com>
> 
> This patch adds a PASID cache management infrastructure based on
> new added structure VTDPASIDAddressSpace, which is used to track
> the PASID usage and future PASID tagged DMA address translation
> support in vIOMMU.
> 
>     struct VTDPASIDAddressSpace {
>         VTDBus *vtd_bus;
>         uint8_t devfn;
>         AddressSpace as;
>         uint32_t pasid;
>         IntelIOMMUState *iommu_state;
>         VTDContextCacheEntry context_cache_entry;
>         QLIST_ENTRY(VTDPASIDAddressSpace) next;
>         VTDPASIDCacheEntry pasid_cache_entry;
>     };
> 
> Ideally, a VTDPASIDAddressSpace instance is created when a PASID
> is bound with a DMA AddressSpace. Intel VT-d spec requires guest
> software to issue pasid cache invalidation when bind or unbind a
> pasid with an address space under caching-mode. However, as
> VTDPASIDAddressSpace instances also act as pasid cache in this
> implementation, its creation also happens during vIOMMU PASID
> tagged DMA translation. The creation in this path will not be
> added in this patch since no PASID-capable emulated devices for
> now.
> 
> The implementation in this patch manages VTDPASIDAddressSpace
> instances per PASID+BDF (lookup and insert will use PASID and
> BDF) since Intel VT-d spec allows per-BDF PASID Table. When a
> guest bind a PASID with an AddressSpace, QEMU will capture the
> guest pasid selective pasid cache invalidation, and allocate
> remove a VTDPASIDAddressSpace instance per the invalidation
> reasons:
> 
>     *) a present pasid entry moved to non-present
>     *) a present pasid entry to be a present entry
>     *) a non-present pasid entry moved to present
> 
> vIOMMU emulator could figure out the reason by fetching latest
> guest pasid entry.
> 
> Cc: Kevin Tian <kevin.tian@intel.com>
> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Yi Sun <yi.y.sun@linux.intel.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Richard Henderson <rth@twiddle.net>
> Cc: Eduardo Habkost <ehabkost@redhat.com>
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> ---
>  hw/i386/intel_iommu.c          | 367 +++++++++++++++++++++++++++++++++++++++++
>  hw/i386/intel_iommu_internal.h |  14 ++
>  hw/i386/trace-events           |   1 +
>  include/hw/i386/intel_iommu.h  |  36 +++-
>  4 files changed, 417 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 58e7213..c75cb7b 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -40,6 +40,7 @@
>  #include "kvm_i386.h"
>  #include "migration/vmstate.h"
>  #include "trace.h"
> +#include "qemu/jhash.h"
>  
>  /* context entry operations */
>  #define VTD_CE_GET_RID2PASID(ce) \
> @@ -65,6 +66,8 @@
>  static void vtd_address_space_refresh_all(IntelIOMMUState *s);
>  static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n);
>  
> +static void vtd_pasid_cache_reset(IntelIOMMUState *s);
> +
>  static void vtd_panic_require_caching_mode(void)
>  {
>      error_report("We need to set caching-mode=on for intel-iommu to enable "
> @@ -276,6 +279,7 @@ static void vtd_reset_caches(IntelIOMMUState *s)
>      vtd_iommu_lock(s);
>      vtd_reset_iotlb_locked(s);
>      vtd_reset_context_cache_locked(s);
> +    vtd_pasid_cache_reset(s);
>      vtd_iommu_unlock(s);
>  }
>  
> @@ -686,6 +690,11 @@ static inline bool vtd_pe_type_check(X86IOMMUState *x86_iommu,
>      return true;
>  }
>  
> +static inline uint16_t vtd_pe_get_domain_id(VTDPASIDEntry *pe)
> +{
> +    return VTD_SM_PASID_ENTRY_DID((pe)->val[1]);
> +}
> +
>  static inline bool vtd_pdire_present(VTDPASIDDirEntry *pdire)
>  {
>      return pdire->val & 1;
> @@ -2393,19 +2402,370 @@ static bool vtd_process_iotlb_desc(IntelIOMMUState *s, VTDInvDesc *inv_desc)
>      return true;
>  }
>  
> +static inline void vtd_init_pasid_key(uint32_t pasid,
> +                                     uint16_t sid,
> +                                     struct pasid_key *key)
> +{
> +    key->pasid = pasid;
> +    key->sid = sid;
> +}
> +
> +static guint vtd_pasid_as_key_hash(gconstpointer v)
> +{
> +    struct pasid_key *key = (struct pasid_key *)v;
> +    uint32_t a, b, c;
> +
> +    /* Jenkins hash */
> +    a = b = c = JHASH_INITVAL + sizeof(*key);
> +    a += key->sid;
> +    b += extract32(key->pasid, 0, 16);
> +    c += extract32(key->pasid, 16, 16);
> +
> +    __jhash_mix(a, b, c);
> +    __jhash_final(a, b, c);
> +
> +    return c;
> +}
> +
> +static gboolean vtd_pasid_as_key_equal(gconstpointer v1, gconstpointer v2)
> +{
> +    const struct pasid_key *k1 = v1;
> +    const struct pasid_key *k2 = v2;
> +
> +    return (k1->pasid == k2->pasid) && (k1->sid == k2->sid);
> +}
> +
> +static inline int vtd_dev_get_pe_from_pasid(IntelIOMMUState *s,
> +                                            uint8_t bus_num,
> +                                            uint8_t devfn,
> +                                            uint32_t pasid,
> +                                            VTDPASIDEntry *pe)
> +{
> +    VTDContextEntry ce;
> +    int ret;
> +    dma_addr_t pasid_dir_base;
> +
> +    if (!s->root_scalable) {
> +        return -VTD_FR_PASID_TABLE_INV;
> +    }
> +
> +    ret = vtd_dev_to_context_entry(s, bus_num, devfn, &ce);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    pasid_dir_base = VTD_CE_GET_PASID_DIR_TABLE(&ce);
> +    ret = vtd_get_pe_from_pasid_table(s,
> +                                  pasid_dir_base, pasid, pe);
> +
> +    return ret;
> +}
> +
> +static bool vtd_pasid_entry_compare(VTDPASIDEntry *p1, VTDPASIDEntry *p2)
> +{
> +    return !memcmp(p1, p2, sizeof(*p1));
> +}
> +
> +/**
> + * This function is used to clear pasid_cache_gen of cached pasid
> + * entry in vtd_pasid_as instances. Caller of this function should
> + * hold iommu_lock.
> + */
> +static gboolean vtd_flush_pasid(gpointer key, gpointer value,
> +                                gpointer user_data)
> +{
> +    VTDPASIDCacheInfo *pc_info = user_data;
> +    VTDPASIDAddressSpace *vtd_pasid_as = value;
> +    IntelIOMMUState *s = vtd_pasid_as->iommu_state;
> +    VTDPASIDCacheEntry *pc_entry = &vtd_pasid_as->pasid_cache_entry;
> +    VTDBus *vtd_bus = vtd_pasid_as->vtd_bus;
> +    VTDPASIDEntry pe;
> +    uint16_t did;
> +    uint32_t pasid;
> +    uint16_t devfn;
> +    int ret;
> +
> +    did = vtd_pe_get_domain_id(&pc_entry->pasid_entry);
> +    pasid = vtd_pasid_as->pasid;
> +    devfn = vtd_pasid_as->devfn;
> +
> +    if (!(pc_entry->pasid_cache_gen == s->pasid_cache_gen)) {
> +        return false;
> +    }
> +
> +    switch (pc_info->flags & VTD_PASID_CACHE_INFO_MASK) {
> +    case VTD_PASID_CACHE_PASIDSI:
> +        if (pc_info->pasid != pasid) {
> +            return false;
> +        }
> +        /* Fall through */

Why fall through?

> +    case VTD_PASID_CACHE_DOMSI:
> +        if (pc_info->domain_id != did) {
> +            return false;
> +        }
> +        /* Fall through */

Same here.

> +    case VTD_PASID_CACHE_GLOBAL:
> +        break;
> +    default:

Nevee reach here right?  If so we can abort.

> +        return false;
> +    }
> +
> +    /*
> +     * pasid cache invalidation may indicate a present pasid
> +     * entry to present pasid entry modification. To cover such
> +     * case, vIOMMU emulator needs to fetch latest guest pasid
> +     * entry and check cached pasid entry, then update pasid
> +     * cache and send pasid bind/unbind to host properly.
> +     */
> +    ret = vtd_dev_get_pe_from_pasid(s,
> +                  pci_bus_num(vtd_bus->bus), devfn, pasid, &pe);
> +    if (ret) {
> +        /*
> +         * No valid pasid entry in guest memory. e.g. pasid entry
> +         * was modified to be either all-zero or non-present. Either
> +         * case means existing pasid cache should be removed.
> +         */
> +        goto remove;
> +    }
> +    /* Compare cached pasid entry and latest pasid entry */
> +    if (!vtd_pasid_entry_compare(&pe, &pc_entry->pasid_entry)) {
> +        /* pasid entry was updated, thus update the pasid cache */
> +        pc_entry->pasid_entry = pe;
> +        pc_entry->pasid_cache_gen = s->pasid_cache_gen;
> +        /*
> +         * TODO:
> +         * - send pasid bind to host for passthru devices
> +         * - when pasid-base-iotlb(piotlb) infrastructure is ready,
> +         *   should invalidate QEMU piotlb togehter with this change.
> +         */
> +    }
> +    return false;
> +remove:
> +    /*
> +     * TODO:
> +     * - send pasid unbind to host for passthru devices
> +     * - when pasid-base-iotlb(piotlb) infrastructure is ready,
> +     *   should invalidate QEMU piotlb togehter with this change.
> +     */
> +    return true;
> +}
> +
>  static int vtd_pasid_cache_dsi(IntelIOMMUState *s, uint16_t domain_id)
>  {
> +    VTDPASIDCacheInfo pc_info;
> +
> +    trace_vtd_pasid_cache_dsi(domain_id);
> +
> +    pc_info.flags = VTD_PASID_CACHE_DOMSI;
> +    pc_info.domain_id = domain_id;
> +
> +    /*
> +     * Loop all existing pasid caches and update them.
> +     */
> +    vtd_iommu_lock(s);
> +    g_hash_table_foreach_remove(s->vtd_pasid_as,
> +                                 vtd_flush_pasid, &pc_info);
> +
> +    /*
> +     * TODO: Domain selective PASID cache invalidation
> +     * flushes all the pasid caches within a domain. To
> +     * be safe, after invalidating the pasid caches, emulator
> +     * needs to replay the pasid bindings by walking guest
> +     * pasid dir and pasid table.

Better spell out on what special case we're handling here: When the
guest setup a new PASID entry then send a PASID DSI.

> +     */
> +    vtd_iommu_unlock(s);
>      return 0;
>  }
>  
> +/**
> + * This function finds or adds a VTDPASIDAddressSpace for a device
> + * when it is bound to a pasid. Caller of this function should hold
> + * iommu_lock.
> + */
> +static VTDPASIDAddressSpace *vtd_add_find_pasid_as(IntelIOMMUState *s,
> +                                                   VTDBus *vtd_bus,
> +                                                   int devfn,
> +                                                   uint32_t pasid,
> +                                                   bool allocate)
> +{
> +    struct pasid_key key;
> +    struct pasid_key *new_key;
> +    VTDPASIDAddressSpace *vtd_pasid_as;
> +    uint16_t sid;
> +
> +    sid = vtd_make_source_id(pci_bus_num(vtd_bus->bus), devfn);
> +    vtd_init_pasid_key(pasid, sid, &key);
> +    vtd_pasid_as = g_hash_table_lookup(s->vtd_pasid_as, &key);
> +
> +    if (!vtd_pasid_as && allocate) {
> +        new_key = g_malloc0(sizeof(*new_key));
> +        vtd_init_pasid_key(pasid, sid, new_key);
> +        /*
> +         * Initiate the vtd_pasid_as structure.
> +         *
> +         * This structure here is used to track the guest pasid
> +         * binding and also serves as pasid-cache mangement entry.
> +         *
> +         * TODO: in future, if wants to support the SVA-aware DMA
> +         *       emulation, the vtd_pasid_as should have include
> +         *       AddressSpace to support DMA emulation.
> +         */
> +        vtd_pasid_as = g_malloc0(sizeof(VTDPASIDAddressSpace));
> +        vtd_pasid_as->iommu_state = s;
> +        vtd_pasid_as->vtd_bus = vtd_bus;
> +        vtd_pasid_as->devfn = devfn;
> +        vtd_pasid_as->context_cache_entry.context_cache_gen = 0;
> +        vtd_pasid_as->pasid = pasid;
> +        vtd_pasid_as->pasid_cache_entry.pasid_cache_gen = 0;
> +        g_hash_table_insert(s->vtd_pasid_as, new_key, vtd_pasid_as);
> +    }
> +    return vtd_pasid_as;
> +}
> +
> + /**
> +  * This function updates the pasid entry cached in &vtd_pasid_as.
> +  * Caller of this function should hold iommu_lock.
> +  */
> +static inline void vtd_fill_in_pe_cache(
> +              VTDPASIDAddressSpace *vtd_pasid_as, VTDPASIDEntry *pe)
> +{
> +    IntelIOMMUState *s = vtd_pasid_as->iommu_state;
> +    VTDPASIDCacheEntry *pc_entry = &vtd_pasid_as->pasid_cache_entry;
> +
> +    pc_entry->pasid_entry = *pe;
> +    pc_entry->pasid_cache_gen = s->pasid_cache_gen;
> +}
> +
>  static int vtd_pasid_cache_psi(IntelIOMMUState *s,
>                                 uint16_t domain_id, uint32_t pasid)
>  {
> +    VTDPASIDCacheInfo pc_info;
> +    VTDPASIDEntry pe;
> +    VTDBus *vtd_bus;
> +    int bus_n, devfn;
> +    VTDPASIDAddressSpace *vtd_pasid_as;
> +    VTDIOMMUContext *vtd_icx;
> +
> +    /* PASID selective implies a DID selective */
> +    pc_info.flags = VTD_PASID_CACHE_PASIDSI;
> +    pc_info.domain_id = domain_id;
> +    pc_info.pasid = pasid;
> +
> +    /*
> +     * Regards to a pasid selective pasid cache invalidation (PSI),
> +     * it could be either cases of below:
> +     * a) a present pasid entry moved to non-present
> +     * b) a present pasid entry to be a present entry
> +     * c) a non-present pasid entry moved to present
> +     *
> +     * Here the handling of a PSI is:
> +     * 1) loop all the exisitng vtd_pasid_as instances to update them
> +     *    according to the latest guest pasid entry in pasid table.
> +     *    this will make sure affected existing vtd_pasid_as instances
> +     *    cached the latest pasid entries. Also, during the loop, the
> +     *    host should be notified if needed. e.g. pasid unbind or pasid
> +     *    update. Should be able to cover case a) and case b).
> +     *
> +     * 2) loop all devices to cover case c)
> +     *    However, it is not good to always loop all devices. In this
> +     *    implementation. We do it in this ways:
> +     *    - For devices which have VTDIOMMUContext instances,
> +     *      we loop them and check if guest pasid entry exists. If yes,
> +     *      it is case c), we update the pasid cache and also notify
> +     *      host.
> +     *    - For devices which have no VTDIOMMUContext
> +     *      instances, it is not necessary to create pasid cache at
> +     *      this phase since it could be created when vIOMMU do DMA
> +     *      address translation. This is not implemented yet since
> +     *      no PASID-capable emulated devices today. If we have it
> +     *      in future, the pasid cache shall be created there.
> +     */
> +
> +    vtd_iommu_lock(s);
> +    g_hash_table_foreach_remove(s->vtd_pasid_as,
> +                                vtd_flush_pasid, &pc_info);
> +
> +    QLIST_FOREACH(vtd_icx, &s->vtd_dev_icx_list, next) {
> +        vtd_bus = vtd_icx->vtd_bus;
> +        devfn = vtd_icx->devfn;
> +        bus_n = pci_bus_num(vtd_bus->bus);
> +
> +        /* Step 1: fetch vtd_pasid_as and check if it is valid */
> +        vtd_pasid_as = vtd_add_find_pasid_as(s, vtd_bus,
> +                                        devfn, pasid, true);

I feel like you wanted to pass "false" here for "allocate".

> +        if (vtd_pasid_as &&
> +            (s->pasid_cache_gen ==
> +             vtd_pasid_as->pasid_cache_entry.pasid_cache_gen)) {
> +            /*
> +             * pasid_cache_gen equals to s->pasid_cache_gen means
> +             * vtd_pasid_as is valid after the above s->vtd_pasid_as
> +             * updates. Thus no need for the below steps.
> +             */
> +            continue;
> +        }
> +
> +        /*
> +         * Step 2: vtd_pasid_as is not valid, it's potentailly a
> +         * new pasid bind. Fetch guest pasid entry.
> +         */
> +        if (vtd_dev_get_pe_from_pasid(s, bus_n, devfn, pasid, &pe)) {
> +            continue;
> +        }
> +
> +        /*
> +         * Step 3: pasid entry exists, update pasid cache
> +         *
> +         * Here need to check domain ID since guest pasid entry
> +         * exists. What needs to do are:
> +         *   - update the pc_entry in the vtd_pasid_as
> +         *   - set proper pc_entry.pasid_cache_gen
> +         *   - pass down the latest guest pasid entry config to host
> +         *     (will be added in later patch)
> +         */
> +        if (domain_id == vtd_pe_get_domain_id(&pe)) {
> +            vtd_fill_in_pe_cache(vtd_pasid_as, &pe);
> +        }
> +    }
> +    vtd_iommu_unlock(s);
>      return 0;
>  }
>  
> +/**
> + * Caller of this function should hold iommu_lock
> + */
> +static void vtd_pasid_cache_reset(IntelIOMMUState *s)
> +{
> +    VTDPASIDCacheInfo pc_info;
> +
> +    trace_vtd_pasid_cache_reset();
> +
> +    pc_info.flags = VTD_PASID_CACHE_GLOBAL;
> +
> +    /*
> +     * Reset pasid cache is a big hammer, so use
> +     * g_hash_table_foreach_remove which will free
> +     * the vtd_pasid_as instances.
> +     */
> +    g_hash_table_foreach_remove(s->vtd_pasid_as,
> +                           vtd_flush_pasid, &pc_info);
> +    s->pasid_cache_gen = 1;
> +}
> +
>  static int vtd_pasid_cache_gsi(IntelIOMMUState *s)
>  {
> +    trace_vtd_pasid_cache_gsi();
> +
> +    vtd_iommu_lock(s);
> +    vtd_pasid_cache_reset(s);

[1]

> +
> +    /*
> +     * TODO: Global PASID cache invalidation may be
> +     * flushes all the pasid caches. To be safe, after
> +     * invalidating the pasid caches, emulator needs
> +     * to replay the pasid bindings by walking guest
> +     * pasid dir and pasid table.
> +     */
> +    vtd_iommu_unlock(s);
>      return 0;
>  }
>  
> @@ -3659,8 +4019,11 @@ static int vtd_icx_register_ds_iommu(IOMMUContext *iommu_ctx,
>      VTDIOMMUContext *vtd_dev_icx = container_of(iommu_ctx,
>                                                 VTDIOMMUContext,
>                                                 iommu_context);
> +    IntelIOMMUState *s = vtd_dev_icx->iommu_state;
>  
>      vtd_dev_icx->dsi_obj = dsi_obj;
> +    QLIST_INSERT_HEAD(&s->vtd_dev_icx_list, vtd_dev_icx, next);
> +
>      return 0;
>  }
>  
> @@ -3672,6 +4035,7 @@ static void vtd_icx_unregister_ds_iommu(IOMMUContext *iommu_ctx,
>                                                 iommu_context);
>  
>      vtd_dev_icx->dsi_obj = NULL;
> +    QLIST_REMOVE(vtd_dev_icx, next);
>  }
>  
>  IOMMUContextOps vtd_iommu_context_ops = {
> @@ -4130,6 +4494,7 @@ static void vtd_realize(DeviceState *dev, Error **errp)
>      }
>  
>      QLIST_INIT(&s->vtd_as_with_notifiers);
> +    QLIST_INIT(&s->vtd_dev_icx_list);
>      qemu_mutex_init(&s->iommu_lock);
>      memset(s->vtd_as_by_bus_num, 0, sizeof(s->vtd_as_by_bus_num));
>      memory_region_init_io(&s->csrmem, OBJECT(s), &vtd_mem_ops, s,
> @@ -4155,6 +4520,8 @@ static void vtd_realize(DeviceState *dev, Error **errp)
>                                       g_free, g_free);
>      s->vtd_as_by_busptr = g_hash_table_new_full(vtd_uint64_hash, vtd_uint64_equal,
>                                                g_free, g_free);
> +    s->vtd_pasid_as = g_hash_table_new_full(vtd_pasid_as_key_hash,
> +                                   vtd_pasid_as_key_equal, g_free, g_free);
>      vtd_init(s);
>      sysbus_mmio_map(SYS_BUS_DEVICE(s), 0, Q35_HOST_BRIDGE_IOMMU_ADDR);
>      pci_setup_iommu(bus, &vtd_iommu_ops, dev);
> diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> index 6c03560..18a9e50 100644
> --- a/hw/i386/intel_iommu_internal.h
> +++ b/hw/i386/intel_iommu_internal.h
> @@ -311,6 +311,7 @@ typedef enum VTDFaultReason {
>      VTD_FR_IR_SID_ERR = 0x26,   /* Invalid Source-ID */
>  
>      VTD_FR_PASID_TABLE_INV = 0x58,  /*Invalid PASID table entry */
> +    VTD_FR_PASID_ENTRY_P = 0x59, /* The Present(P) field of pasidt-entry is 0 */
>  
>      /* This is not a normal fault reason. We use this to indicate some faults
>       * that are not referenced by the VT-d specification.
> @@ -485,6 +486,19 @@ struct VTDRootEntry {
>  };
>  typedef struct VTDRootEntry VTDRootEntry;
>  
> +struct VTDPASIDCacheInfo {
> +#define VTD_PASID_CACHE_GLOBAL   (1ULL << 0)
> +#define VTD_PASID_CACHE_DOMSI    (1ULL << 1)
> +#define VTD_PASID_CACHE_PASIDSI  (1ULL << 2)
> +    uint32_t flags;
> +    uint16_t domain_id;
> +    uint32_t pasid;
> +};
> +#define VTD_PASID_CACHE_INFO_MASK    (VTD_PASID_CACHE_GLOBAL | \
> +                                      VTD_PASID_CACHE_DOMSI  | \
> +                                      VTD_PASID_CACHE_PASIDSI)
> +typedef struct VTDPASIDCacheInfo VTDPASIDCacheInfo;
> +
>  /* Masks for struct VTDRootEntry */
>  #define VTD_ROOT_ENTRY_P            1ULL
>  #define VTD_ROOT_ENTRY_CTP          (~0xfffULL)
> diff --git a/hw/i386/trace-events b/hw/i386/trace-events
> index f7cd4e5..87364a3 100644
> --- a/hw/i386/trace-events
> +++ b/hw/i386/trace-events
> @@ -22,6 +22,7 @@ vtd_inv_qi_head(uint16_t head) "read head %d"
>  vtd_inv_qi_tail(uint16_t head) "write tail %d"
>  vtd_inv_qi_fetch(void) ""
>  vtd_context_cache_reset(void) ""
> +vtd_pasid_cache_reset(void) ""
>  vtd_pasid_cache_gsi(void) ""
>  vtd_pasid_cache_dsi(uint16_t domain) "Domian slective PC invalidation domain 0x%"PRIx16
>  vtd_pasid_cache_psi(uint16_t domain, uint32_t pasid) "PASID slective PC invalidation domain 0x%"PRIx16" pasid 0x%"PRIx32
> diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> index 4158116..3cc4b74 100644
> --- a/include/hw/i386/intel_iommu.h
> +++ b/include/hw/i386/intel_iommu.h
> @@ -69,6 +69,8 @@ typedef union VTD_IR_MSIAddress VTD_IR_MSIAddress;
>  typedef struct VTDPASIDDirEntry VTDPASIDDirEntry;
>  typedef struct VTDPASIDEntry VTDPASIDEntry;
>  typedef struct VTDIOMMUContext VTDIOMMUContext;
> +typedef struct VTDPASIDCacheEntry VTDPASIDCacheEntry;
> +typedef struct VTDPASIDAddressSpace VTDPASIDAddressSpace;
>  
>  /* Context-Entry */
>  struct VTDContextEntry {
> @@ -101,6 +103,31 @@ struct VTDPASIDEntry {
>      uint64_t val[8];
>  };
>  
> +struct pasid_key {
> +    uint32_t pasid;
> +    uint16_t sid;
> +};
> +
> +struct VTDPASIDCacheEntry {
> +    /*
> +     * The cache entry is obsolete if
> +     * pasid_cache_gen!=IntelIOMMUState.pasid_cache_gen
> +     */
> +    uint32_t pasid_cache_gen;
> +    struct VTDPASIDEntry pasid_entry;
> +};
> +
> +struct VTDPASIDAddressSpace {
> +    VTDBus *vtd_bus;
> +    uint8_t devfn;
> +    AddressSpace as;
> +    uint32_t pasid;
> +    IntelIOMMUState *iommu_state;
> +    VTDContextCacheEntry context_cache_entry;
> +    QLIST_ENTRY(VTDPASIDAddressSpace) next;
> +    VTDPASIDCacheEntry pasid_cache_entry;

In vtd_pasid_cache_gsi() [1], you directly reset pasid_cache_gen for
each pasid address space.  You never increase
pasid_cache_entry.pasid_cache_gen.  Then IIUC the gen will always be
either 0 or 1.  And...

> +};
> +
>  struct VTDAddressSpace {
>      PCIBus *bus;
>      uint8_t devfn;
> @@ -122,6 +149,7 @@ struct VTDIOMMUContext {
>      uint8_t devfn;
>      IOMMUContext iommu_context;
>      DualStageIOMMUObject *dsi_obj;
> +    QLIST_ENTRY(VTDIOMMUContext) next;
>      IntelIOMMUState *iommu_state;
>  };
>  
> @@ -272,9 +300,14 @@ struct IntelIOMMUState {
>  
>      GHashTable *vtd_as_by_busptr;   /* VTDBus objects indexed by PCIBus* reference */
>      VTDBus *vtd_as_by_bus_num[VTD_PCI_BUS_MAX]; /* VTDBus objects indexed by bus number */
> +    GHashTable *vtd_pasid_as;   /* VTDPASIDAddressSpace instances */
> +    uint32_t pasid_cache_gen;   /* Should be in [1,MAX] */

... This should always be 1.

IIUC you can drop both of the pasid_cache_gen, because in this whole
patchset you'll remove the pasid hash entry when it is invalidated,
right?  Then if the hash entry is there, it must be valid.  When it's
out-dated, it'll be removed from the hash.

>      /* list of registered notifiers */
>      QLIST_HEAD(, VTDAddressSpace) vtd_as_with_notifiers;
>  
> +    /* list of VTDIOMMUContexts with DualStageIOMMUObject registered */
> +    QLIST_HEAD(, VTDIOMMUContext) vtd_dev_icx_list;
> +
>      /* interrupt remapping */
>      bool intr_enabled;              /* Whether guest enabled IR */
>      dma_addr_t intr_root;           /* Interrupt remapping table pointer */
> @@ -291,7 +324,8 @@ struct IntelIOMMUState {
>  
>      /*
>       * Protects IOMMU states in general.  Currently it protects the
> -     * per-IOMMU IOTLB cache, and context entry cache in VTDAddressSpace.
> +     * per-IOMMU IOTLB cache, and context entry cache in VTDAddressSpace,
> +     * and pasid cache in VTDPASIDAddressSpace.
>       */
>      QemuMutex iommu_lock;
>  };
> -- 
> 2.7.4
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC v3 02/25] hw/iommu: introduce DualStageIOMMUObject
  2020-01-31 11:42       ` Liu, Yi L
@ 2020-02-12  6:32         ` David Gibson
  -1 siblings, 0 replies; 136+ messages in thread
From: David Gibson @ 2020-02-12  6:32 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: qemu-devel, pbonzini, alex.williamson, peterx, mst, eric.auger,
	Tian, Kevin, Tian, Jun J, Sun, Yi Y, kvm, Wu, Hao, Jacob Pan,
	Yi Sun

[-- Attachment #1: Type: text/plain, Size: 6315 bytes --]

On Fri, Jan 31, 2020 at 11:42:06AM +0000, Liu, Yi L wrote:
> Hi David,
> 
> > From: David Gibson [mailto:david@gibson.dropbear.id.au]
> > Sent: Friday, January 31, 2020 11:59 AM
> > To: Liu, Yi L <yi.l.liu@intel.com>
> > Subject: Re: [RFC v3 02/25] hw/iommu: introduce DualStageIOMMUObject
> > 
> > On Wed, Jan 29, 2020 at 04:16:33AM -0800, Liu, Yi L wrote:
> > > From: Liu Yi L <yi.l.liu@intel.com>
> > >
> > > Currently, many platform vendors provide the capability of dual stage
> > > DMA address translation in hardware. For example, nested translation
> > > on Intel VT-d scalable mode, nested stage translation on ARM SMMUv3,
> > > and etc. In dual stage DMA address translation, there are two stages
> > > address translation, stage-1 (a.k.a first-level) and stage-2 (a.k.a
> > > second-level) translation structures. Stage-1 translation results are
> > > also subjected to stage-2 translation structures. Take vSVA (Virtual
> > > Shared Virtual Addressing) as an example, guest IOMMU driver owns
> > > stage-1 translation structures (covers GVA->GPA translation), and host
> > > IOMMU driver owns stage-2 translation structures (covers GPA->HPA
> > > translation). VMM is responsible to bind stage-1 translation structures
> > > to host, thus hardware could achieve GVA->GPA and then GPA->HPA
> > > translation. For more background on SVA, refer the below links.
> > >  - https://www.youtube.com/watch?v=Kq_nfGK5MwQ
> > >  - https://events19.lfasiallc.com/wp-content/uploads/2017/11/\
> > > Shared-Virtual-Memory-in-KVM_Yi-Liu.pdf
> > >
> > > As above, dual stage DMA translation offers two stage address mappings,
> > > which could have better DMA address translation support for passthru
> > > devices. This is also what vIOMMU developers are doing so far. Efforts
> > > includes vSVA enabling from Yi Liu and SMMUv3 Nested Stage Setup from
> > > Eric Auger.
> > > https://www.spinics.net/lists/kvm/msg198556.html
> > > https://lists.gnu.org/archive/html/qemu-devel/2019-07/msg02842.html
> > >
> > > Both efforts are aiming to expose a vIOMMU with dual stage hardware
> > > backed. As so, QEMU needs to have an explicit object to stand for
> > > the dual stage capability from hardware. Such object offers abstract
> > > for the dual stage DMA translation related operations, like:
> > >
> > >  1) PASID allocation (allow host to intercept in PASID allocation)
> > >  2) bind stage-1 translation structures to host
> > >  3) propagate stage-1 cache invalidation to host
> > >  4) DMA address translation fault (I/O page fault) servicing etc.
> > >
> > > This patch introduces DualStageIOMMUObject to stand for the hardware
> > > dual stage DMA translation capability. PASID allocation/free are the
> > > first operation included in it, in future, there will be more operations
> > > like bind_stage1_pgtbl and invalidate_stage1_cache and etc.
> > >
> > > Cc: Kevin Tian <kevin.tian@intel.com>
> > > Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > > Cc: Peter Xu <peterx@redhat.com>
> > > Cc: Eric Auger <eric.auger@redhat.com>
> > > Cc: Yi Sun <yi.y.sun@linux.intel.com>
> > > Cc: David Gibson <david@gibson.dropbear.id.au>
> > > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > 
> > Several overall queries about this:
> > 
> > 1) Since it's explicitly handling PASIDs, this seems a lot more
> >    specific to SVM than the name suggests.  I'd suggest a rename.
> 
> It is not specific to SVM in future. We have efforts to move guest
> IOVA support based on host IOMMU's dual-stage DMA translation
> capability.

It's assuming the existence of pasids though, which is a rather more
specific model than simply having two translation stages.

> Then, guest IOVA support will also re-use the methods
> provided by this abstract layer. e.g. the bind_guest_pgtbl() and
> flush_iommu_iotlb().
> 
> For the naming, how about HostIOMMUContext? This layer is to provide
> explicit methods for setting up dual-stage DMA translation in host.

Uh.. maybe?  I'm still having trouble figuring out what this object
really represents.

> > 2) Why are you hand rolling structures of pointers, rather than making
> >    this a QOM class or interface and putting those things into methods?
> 
> Maybe the name is not proper. Although I named it as DualStageIOMMUObject,
> it is actually a kind of abstract layer we discussed in previous email. I
> think this is similar with VFIO_MAP/UNMAP. The difference is that VFIO_MAP/
> UNMAP programs mappings to host iommu domain. While the newly added explicit
> method is to link guest page table to host iommu domain. VFIO_MAP/UNMAP
> is exposed to vIOMMU emulators via MemoryRegion layer. right? Maybe adding a
> similar abstract layer is enough. Is adding QOM really necessary for this
> case?

Um... sorry, I'm having a lot of trouble making any sense of that.

> > 3) It's not really clear to me if this is for the case where both
> >    stages of translation are visible to the guest, or only one of
> >    them.
> 
> For this case, vIOMMU will only expose a single stage translation to VM.
> e.g. Intel VT-d, vIOMMU exposes first-level translation to guest. Hardware
> IOMMUs with the dual-stage translation capability lets guest own stage-1
> translation structures and host owns the stage-2 translation structures.
> VMM is responsible to bind guest's translation structures to host and
> enable dual-stage translation. e.g. on Intel VT-d, config translation type
> to be NESTED.

Ok, understood.

> Take guest SVM as an example, guest iommu driver owns the gVA->gPA mappings,
> which is treated as stage-1 translation from host point of view. Host itself
> owns the gPA->hPPA translation and called stage-2 translation when dual-stage
> translation is configured.
> 
> For guest IOVA, it is similar with guest SVM. Guest iommu driver owns the
> gIOVA->gPA mappings, which is treated as stage-1 translation. Host owns the
> gPA->hPA translation.

Ok, that makes sense.  It's still not really clear to me which part of
this setup this object represents.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC v3 02/25] hw/iommu: introduce DualStageIOMMUObject
@ 2020-02-12  6:32         ` David Gibson
  0 siblings, 0 replies; 136+ messages in thread
From: David Gibson @ 2020-02-12  6:32 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: Tian, Kevin, Jacob Pan, Yi Sun, kvm, mst, Tian, Jun J,
	qemu-devel, peterx, eric.auger, alex.williamson, pbonzini, Sun,
	Yi Y, Wu, Hao

[-- Attachment #1: Type: text/plain, Size: 6315 bytes --]

On Fri, Jan 31, 2020 at 11:42:06AM +0000, Liu, Yi L wrote:
> Hi David,
> 
> > From: David Gibson [mailto:david@gibson.dropbear.id.au]
> > Sent: Friday, January 31, 2020 11:59 AM
> > To: Liu, Yi L <yi.l.liu@intel.com>
> > Subject: Re: [RFC v3 02/25] hw/iommu: introduce DualStageIOMMUObject
> > 
> > On Wed, Jan 29, 2020 at 04:16:33AM -0800, Liu, Yi L wrote:
> > > From: Liu Yi L <yi.l.liu@intel.com>
> > >
> > > Currently, many platform vendors provide the capability of dual stage
> > > DMA address translation in hardware. For example, nested translation
> > > on Intel VT-d scalable mode, nested stage translation on ARM SMMUv3,
> > > and etc. In dual stage DMA address translation, there are two stages
> > > address translation, stage-1 (a.k.a first-level) and stage-2 (a.k.a
> > > second-level) translation structures. Stage-1 translation results are
> > > also subjected to stage-2 translation structures. Take vSVA (Virtual
> > > Shared Virtual Addressing) as an example, guest IOMMU driver owns
> > > stage-1 translation structures (covers GVA->GPA translation), and host
> > > IOMMU driver owns stage-2 translation structures (covers GPA->HPA
> > > translation). VMM is responsible to bind stage-1 translation structures
> > > to host, thus hardware could achieve GVA->GPA and then GPA->HPA
> > > translation. For more background on SVA, refer the below links.
> > >  - https://www.youtube.com/watch?v=Kq_nfGK5MwQ
> > >  - https://events19.lfasiallc.com/wp-content/uploads/2017/11/\
> > > Shared-Virtual-Memory-in-KVM_Yi-Liu.pdf
> > >
> > > As above, dual stage DMA translation offers two stage address mappings,
> > > which could have better DMA address translation support for passthru
> > > devices. This is also what vIOMMU developers are doing so far. Efforts
> > > includes vSVA enabling from Yi Liu and SMMUv3 Nested Stage Setup from
> > > Eric Auger.
> > > https://www.spinics.net/lists/kvm/msg198556.html
> > > https://lists.gnu.org/archive/html/qemu-devel/2019-07/msg02842.html
> > >
> > > Both efforts are aiming to expose a vIOMMU with dual stage hardware
> > > backed. As so, QEMU needs to have an explicit object to stand for
> > > the dual stage capability from hardware. Such object offers abstract
> > > for the dual stage DMA translation related operations, like:
> > >
> > >  1) PASID allocation (allow host to intercept in PASID allocation)
> > >  2) bind stage-1 translation structures to host
> > >  3) propagate stage-1 cache invalidation to host
> > >  4) DMA address translation fault (I/O page fault) servicing etc.
> > >
> > > This patch introduces DualStageIOMMUObject to stand for the hardware
> > > dual stage DMA translation capability. PASID allocation/free are the
> > > first operation included in it, in future, there will be more operations
> > > like bind_stage1_pgtbl and invalidate_stage1_cache and etc.
> > >
> > > Cc: Kevin Tian <kevin.tian@intel.com>
> > > Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > > Cc: Peter Xu <peterx@redhat.com>
> > > Cc: Eric Auger <eric.auger@redhat.com>
> > > Cc: Yi Sun <yi.y.sun@linux.intel.com>
> > > Cc: David Gibson <david@gibson.dropbear.id.au>
> > > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > 
> > Several overall queries about this:
> > 
> > 1) Since it's explicitly handling PASIDs, this seems a lot more
> >    specific to SVM than the name suggests.  I'd suggest a rename.
> 
> It is not specific to SVM in future. We have efforts to move guest
> IOVA support based on host IOMMU's dual-stage DMA translation
> capability.

It's assuming the existence of pasids though, which is a rather more
specific model than simply having two translation stages.

> Then, guest IOVA support will also re-use the methods
> provided by this abstract layer. e.g. the bind_guest_pgtbl() and
> flush_iommu_iotlb().
> 
> For the naming, how about HostIOMMUContext? This layer is to provide
> explicit methods for setting up dual-stage DMA translation in host.

Uh.. maybe?  I'm still having trouble figuring out what this object
really represents.

> > 2) Why are you hand rolling structures of pointers, rather than making
> >    this a QOM class or interface and putting those things into methods?
> 
> Maybe the name is not proper. Although I named it as DualStageIOMMUObject,
> it is actually a kind of abstract layer we discussed in previous email. I
> think this is similar with VFIO_MAP/UNMAP. The difference is that VFIO_MAP/
> UNMAP programs mappings to host iommu domain. While the newly added explicit
> method is to link guest page table to host iommu domain. VFIO_MAP/UNMAP
> is exposed to vIOMMU emulators via MemoryRegion layer. right? Maybe adding a
> similar abstract layer is enough. Is adding QOM really necessary for this
> case?

Um... sorry, I'm having a lot of trouble making any sense of that.

> > 3) It's not really clear to me if this is for the case where both
> >    stages of translation are visible to the guest, or only one of
> >    them.
> 
> For this case, vIOMMU will only expose a single stage translation to VM.
> e.g. Intel VT-d, vIOMMU exposes first-level translation to guest. Hardware
> IOMMUs with the dual-stage translation capability lets guest own stage-1
> translation structures and host owns the stage-2 translation structures.
> VMM is responsible to bind guest's translation structures to host and
> enable dual-stage translation. e.g. on Intel VT-d, config translation type
> to be NESTED.

Ok, understood.

> Take guest SVM as an example, guest iommu driver owns the gVA->gPA mappings,
> which is treated as stage-1 translation from host point of view. Host itself
> owns the gPA->hPPA translation and called stage-2 translation when dual-stage
> translation is configured.
> 
> For guest IOVA, it is similar with guest SVM. Guest iommu driver owns the
> gIOVA->gPA mappings, which is treated as stage-1 translation. Host owns the
> gPA->hPA translation.

Ok, that makes sense.  It's still not really clear to me which part of
this setup this object represents.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 136+ messages in thread

* RE: [RFC v3 03/25] hw/iommu: introduce IOMMUContext
  2020-02-11 16:58         ` Peter Xu
@ 2020-02-12  7:15           ` Liu, Yi L
  -1 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-02-12  7:15 UTC (permalink / raw)
  To: Peter Xu
  Cc: David Gibson, qemu-devel, pbonzini, alex.williamson, mst,
	eric.auger, Tian, Kevin, Tian, Jun J, Sun, Yi Y, kvm, Wu, Hao,
	Jacob Pan, Yi Sun

Hi Peter,

> From: Peter Xu <peterx@redhat.com>
> Sent: Wednesday, February 12, 2020 12:59 AM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [RFC v3 03/25] hw/iommu: introduce IOMMUContext
> 
> On Fri, Jan 31, 2020 at 11:42:13AM +0000, Liu, Yi L wrote:
> > > I'm not very clear on the relationship betwen an IOMMUContext and a
> > > DualStageIOMMUObject.  Can there be many IOMMUContexts to a
> > > DualStageIOMMUOBject?  The other way around?  Or is it just
> > > zero-or-one DualStageIOMMUObjects to an IOMMUContext?
> >
> > It is possible. As the below patch shows, DualStageIOMMUObject is per vfio
> > container. IOMMUContext can be either per-device or shared across devices,
> > it depends on vendor specific vIOMMU emulators.
> 
> Is there an example when an IOMMUContext can be not per-device?

No, I don’t have such example so far. But as IOMMUContext is got from
pci_device_iommu_context(),  in concept it possible to be not per-device.
It is kind of leave to vIOMMU to decide if different devices could share a
single IOMMUContext.

> It makes sense to me to have an object that is per-container (in your
> case, the DualStageIOMMUObject, IIUC), then we can connect that object
> to a device.  However I'm a bit confused on why we've got two abstract
> layers (the other one is IOMMUContext)?  That was previously for the
> whole SVA new APIs, now it's all moved over to the other new object,
> then IOMMUContext only register/unregister...

Your understanding is correct. Actually, I also struggled on adding two
abstract layer. But, you know, there are two function calling requirements
around vSVA enabling. First one is explicit method for vIOMMU calls into
VFIO (pasid allocation, bind guest page table, cache invalidate). Second
one is explicit method for VFIO calls into vIOMMU (DMA fault/PRQ injection
which is not included in this series yet, but will be upstreamed later). 
So I added the DualStageIOMMUObject to cover vIOMMU to VFIO callings, and
IOMMUContext to cover VFIO to vIOMMU callings. As IOMMUContext covers VFIO
to vIOMMU callings, so I made it include register/unregister.

> Can we put the reg/unreg
> procedures into DualStageIOMMUObject as well?  Then we drop the
> IOMMUContext (or say, keep IOMMUContext and drop DualStageIOMMUObject
> but let IOMMUContext to be per-vfio-container, the major difference is
> the naming here, say, PASID allocation does not seem to be related to
> dual-stage at all).
>
> Besides that, not sure I read it right... but even with your current
> series, the container->iommu_ctx will always only be bound to the
> first device created within that container, since you've got:
> 
>     group = vfio_get_group(groupid, pci_device_iommu_address_space(pdev),
>                            pci_device_iommu_context(pdev), errp);
> 
> And:
> 
>     if (vfio_connect_container(group, as, iommu_ctx, errp)) {
>         error_prepend(errp, "failed to setup container for group %d: ",
>                       groupid);
>         goto close_fd_exit;
>     }
> 
> The iommu_ctx will be set to container->iommu_ctx if there's no
> existing container.

yes, it's true. May need to add a iommu_ctx list in VFIO container or
add check on the input iommu_ctx of vfio_get_group() if sticking on this
direction.

While considering your suggestion on dropping one of the two abstract
layers. I came up a new proposal as below:

We may drop the IOMMUContext in this series, and rename DualStageIOMMUObject
to HostIOMMUContext, which is per-vfio-container. Add an interface in PCI
layer(e.g. an callback in  PCIDevice) to let vIOMMU get HostIOMMUContext.
I think this could cover the requirement of providing explicit method for
vIOMMU to call into VFIO and then program host IOMMU.

While for the requirement of VFIO to vIOMMU callings (e.g. PRQ), I think it
could be done via PCI layer by adding an operation in PCIIOMMUOps. Thoughts?

Thanks,
Yi Liu


^ permalink raw reply	[flat|nested] 136+ messages in thread

* RE: [RFC v3 03/25] hw/iommu: introduce IOMMUContext
@ 2020-02-12  7:15           ` Liu, Yi L
  0 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-02-12  7:15 UTC (permalink / raw)
  To: Peter Xu
  Cc: Tian, Kevin, Jacob Pan, Yi Sun, kvm, mst, Tian, Jun J,
	qemu-devel, eric.auger, alex.williamson, pbonzini, Wu, Hao, Sun,
	 Yi Y, David Gibson

Hi Peter,

> From: Peter Xu <peterx@redhat.com>
> Sent: Wednesday, February 12, 2020 12:59 AM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [RFC v3 03/25] hw/iommu: introduce IOMMUContext
> 
> On Fri, Jan 31, 2020 at 11:42:13AM +0000, Liu, Yi L wrote:
> > > I'm not very clear on the relationship betwen an IOMMUContext and a
> > > DualStageIOMMUObject.  Can there be many IOMMUContexts to a
> > > DualStageIOMMUOBject?  The other way around?  Or is it just
> > > zero-or-one DualStageIOMMUObjects to an IOMMUContext?
> >
> > It is possible. As the below patch shows, DualStageIOMMUObject is per vfio
> > container. IOMMUContext can be either per-device or shared across devices,
> > it depends on vendor specific vIOMMU emulators.
> 
> Is there an example when an IOMMUContext can be not per-device?

No, I don’t have such example so far. But as IOMMUContext is got from
pci_device_iommu_context(),  in concept it possible to be not per-device.
It is kind of leave to vIOMMU to decide if different devices could share a
single IOMMUContext.

> It makes sense to me to have an object that is per-container (in your
> case, the DualStageIOMMUObject, IIUC), then we can connect that object
> to a device.  However I'm a bit confused on why we've got two abstract
> layers (the other one is IOMMUContext)?  That was previously for the
> whole SVA new APIs, now it's all moved over to the other new object,
> then IOMMUContext only register/unregister...

Your understanding is correct. Actually, I also struggled on adding two
abstract layer. But, you know, there are two function calling requirements
around vSVA enabling. First one is explicit method for vIOMMU calls into
VFIO (pasid allocation, bind guest page table, cache invalidate). Second
one is explicit method for VFIO calls into vIOMMU (DMA fault/PRQ injection
which is not included in this series yet, but will be upstreamed later). 
So I added the DualStageIOMMUObject to cover vIOMMU to VFIO callings, and
IOMMUContext to cover VFIO to vIOMMU callings. As IOMMUContext covers VFIO
to vIOMMU callings, so I made it include register/unregister.

> Can we put the reg/unreg
> procedures into DualStageIOMMUObject as well?  Then we drop the
> IOMMUContext (or say, keep IOMMUContext and drop DualStageIOMMUObject
> but let IOMMUContext to be per-vfio-container, the major difference is
> the naming here, say, PASID allocation does not seem to be related to
> dual-stage at all).
>
> Besides that, not sure I read it right... but even with your current
> series, the container->iommu_ctx will always only be bound to the
> first device created within that container, since you've got:
> 
>     group = vfio_get_group(groupid, pci_device_iommu_address_space(pdev),
>                            pci_device_iommu_context(pdev), errp);
> 
> And:
> 
>     if (vfio_connect_container(group, as, iommu_ctx, errp)) {
>         error_prepend(errp, "failed to setup container for group %d: ",
>                       groupid);
>         goto close_fd_exit;
>     }
> 
> The iommu_ctx will be set to container->iommu_ctx if there's no
> existing container.

yes, it's true. May need to add a iommu_ctx list in VFIO container or
add check on the input iommu_ctx of vfio_get_group() if sticking on this
direction.

While considering your suggestion on dropping one of the two abstract
layers. I came up a new proposal as below:

We may drop the IOMMUContext in this series, and rename DualStageIOMMUObject
to HostIOMMUContext, which is per-vfio-container. Add an interface in PCI
layer(e.g. an callback in  PCIDevice) to let vIOMMU get HostIOMMUContext.
I think this could cover the requirement of providing explicit method for
vIOMMU to call into VFIO and then program host IOMMU.

While for the requirement of VFIO to vIOMMU callings (e.g. PRQ), I think it
could be done via PCI layer by adding an operation in PCIIOMMUOps. Thoughts?

Thanks,
Yi Liu


^ permalink raw reply	[flat|nested] 136+ messages in thread

* RE: [RFC v3 09/25] vfio: check VFIO_TYPE1_NESTING_IOMMU support
  2020-02-11 19:08     ` Peter Xu
@ 2020-02-12  7:16       ` Liu, Yi L
  -1 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-02-12  7:16 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, david, pbonzini, alex.williamson, mst, eric.auger,
	Tian, Kevin, Tian, Jun J, Sun, Yi Y, kvm, Wu, Hao, Jacob Pan,
	Yi Sun

> From: Peter Xu <peterx@redhat.com>
> Sent: Wednesday, February 12, 2020 3:08 AM
> Subject: Re: [RFC v3 09/25] vfio: check VFIO_TYPE1_NESTING_IOMMU support
> 
> On Wed, Jan 29, 2020 at 04:16:40AM -0800, Liu, Yi L wrote:
> > From: Liu Yi L <yi.l.liu@intel.com>
> >
> > VFIO needs to check VFIO_TYPE1_NESTING_IOMMU support with Kernel
> > before further using it.
> > e.g. requires to check IOMMU UAPI version.
> >
> > Cc: Kevin Tian <kevin.tian@intel.com>
> > Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Cc: Peter Xu <peterx@redhat.com>
> > Cc: Eric Auger <eric.auger@redhat.com>
> > Cc: Yi Sun <yi.y.sun@linux.intel.com>
> > Cc: David Gibson <david@gibson.dropbear.id.au>
> > Cc: Alex Williamson <alex.williamson@redhat.com>
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> > ---
> >  hw/vfio/common.c | 14 ++++++++++++--
> >  1 file changed, 12 insertions(+), 2 deletions(-)
> >
> > diff --git a/hw/vfio/common.c b/hw/vfio/common.c index
> > 0cc7ff5..a5e70b1 100644
> > --- a/hw/vfio/common.c
> > +++ b/hw/vfio/common.c
> > @@ -1157,12 +1157,21 @@ static void
> > vfio_put_address_space(VFIOAddressSpace *space)  static int
> vfio_get_iommu_type(VFIOContainer *container,
> >                                 Error **errp)  {
> > -    int iommu_types[] = { VFIO_TYPE1v2_IOMMU, VFIO_TYPE1_IOMMU,
> > +    int iommu_types[] = { VFIO_TYPE1_NESTING_IOMMU,
> > +                          VFIO_TYPE1v2_IOMMU, VFIO_TYPE1_IOMMU,
> >                            VFIO_SPAPR_TCE_v2_IOMMU, VFIO_SPAPR_TCE_IOMMU };
> > -    int i;
> > +    int i, version;
> >
> >      for (i = 0; i < ARRAY_SIZE(iommu_types); i++) {
> >          if (ioctl(container->fd, VFIO_CHECK_EXTENSION,
> > iommu_types[i])) {
> > +            if (iommu_types[i] == VFIO_TYPE1_NESTING_IOMMU) {
> > +                version = ioctl(container->fd,
> > +                                VFIO_NESTING_GET_IOMMU_UAPI_VERSION);
> > +                if (version < IOMMU_UAPI_VERSION) {
> > +                    printf("IOMMU UAPI incompatible for nesting\n");
> 
> There should have better alternatives than printf()... Maybe warn_report()?

Got it. thanks. 😊

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 136+ messages in thread

* RE: [RFC v3 09/25] vfio: check VFIO_TYPE1_NESTING_IOMMU support
@ 2020-02-12  7:16       ` Liu, Yi L
  0 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-02-12  7:16 UTC (permalink / raw)
  To: Peter Xu
  Cc: Tian, Kevin, Jacob Pan, Yi Sun, kvm, mst, Tian, Jun J,
	qemu-devel, eric.auger, alex.williamson, pbonzini, Wu, Hao, Sun,
	Yi Y, david

> From: Peter Xu <peterx@redhat.com>
> Sent: Wednesday, February 12, 2020 3:08 AM
> Subject: Re: [RFC v3 09/25] vfio: check VFIO_TYPE1_NESTING_IOMMU support
> 
> On Wed, Jan 29, 2020 at 04:16:40AM -0800, Liu, Yi L wrote:
> > From: Liu Yi L <yi.l.liu@intel.com>
> >
> > VFIO needs to check VFIO_TYPE1_NESTING_IOMMU support with Kernel
> > before further using it.
> > e.g. requires to check IOMMU UAPI version.
> >
> > Cc: Kevin Tian <kevin.tian@intel.com>
> > Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Cc: Peter Xu <peterx@redhat.com>
> > Cc: Eric Auger <eric.auger@redhat.com>
> > Cc: Yi Sun <yi.y.sun@linux.intel.com>
> > Cc: David Gibson <david@gibson.dropbear.id.au>
> > Cc: Alex Williamson <alex.williamson@redhat.com>
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> > ---
> >  hw/vfio/common.c | 14 ++++++++++++--
> >  1 file changed, 12 insertions(+), 2 deletions(-)
> >
> > diff --git a/hw/vfio/common.c b/hw/vfio/common.c index
> > 0cc7ff5..a5e70b1 100644
> > --- a/hw/vfio/common.c
> > +++ b/hw/vfio/common.c
> > @@ -1157,12 +1157,21 @@ static void
> > vfio_put_address_space(VFIOAddressSpace *space)  static int
> vfio_get_iommu_type(VFIOContainer *container,
> >                                 Error **errp)  {
> > -    int iommu_types[] = { VFIO_TYPE1v2_IOMMU, VFIO_TYPE1_IOMMU,
> > +    int iommu_types[] = { VFIO_TYPE1_NESTING_IOMMU,
> > +                          VFIO_TYPE1v2_IOMMU, VFIO_TYPE1_IOMMU,
> >                            VFIO_SPAPR_TCE_v2_IOMMU, VFIO_SPAPR_TCE_IOMMU };
> > -    int i;
> > +    int i, version;
> >
> >      for (i = 0; i < ARRAY_SIZE(iommu_types); i++) {
> >          if (ioctl(container->fd, VFIO_CHECK_EXTENSION,
> > iommu_types[i])) {
> > +            if (iommu_types[i] == VFIO_TYPE1_NESTING_IOMMU) {
> > +                version = ioctl(container->fd,
> > +                                VFIO_NESTING_GET_IOMMU_UAPI_VERSION);
> > +                if (version < IOMMU_UAPI_VERSION) {
> > +                    printf("IOMMU UAPI incompatible for nesting\n");
> 
> There should have better alternatives than printf()... Maybe warn_report()?

Got it. thanks. 😊

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 136+ messages in thread

* RE: [RFC v3 11/25] vfio: get stage-1 pasid formats from Kernel
  2020-02-11 19:30     ` Peter Xu
@ 2020-02-12  7:19       ` Liu, Yi L
  -1 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-02-12  7:19 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, david, pbonzini, alex.williamson, mst, eric.auger,
	Tian, Kevin, Tian, Jun J, Sun, Yi Y, kvm, Wu, Hao, Jacob Pan,
	Yi Sun

> From: Peter Xu <peterx@redhat.com>
> Sent: Wednesday, February 12, 2020 3:30 AM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [RFC v3 11/25] vfio: get stage-1 pasid formats from Kernel
> 
> On Wed, Jan 29, 2020 at 04:16:42AM -0800, Liu, Yi L wrote:
> > From: Liu Yi L <yi.l.liu@intel.com>
> >
> > VFIO checks IOMMU UAPI version when it finds Kernel supports
> > VFIO_TYPE1_NESTING_IOMMU. It is enough for UAPI compatibility check.
> > However, IOMMU UAPI may support multiple stage-1 pasid formats in a
> > specific UAPI version, which is highly possible since IOMMU UAPI
> > supports stage-1 formats across all IOMMU vendors.
> > So VFIO needs to get the supported formats from Kernel and tell
> > vIOMMU. Let vIOMMU select proper format when setup dual stage DMA
> > translation.
> >
> > This patch gets the stage-1 pasid format from kernel by using IOCTL
> > VFIO_IOMMU_GET_INFO and pass the supported format to vIOMMU by the
> > DualStageIOMMUObject instance which has been registered to vIOMMU.
> >
> > This patch referred some code from Shameer Kolothum.
> > https://lists.gnu.org/archive/html/qemu-devel/2018-05/msg03759.html
> >
> > Cc: Kevin Tian <kevin.tian@intel.com>
> > Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Cc: Peter Xu <peterx@redhat.com>
> > Cc: Eric Auger <eric.auger@redhat.com>
> > Cc: Yi Sun <yi.y.sun@linux.intel.com>
> > Cc: David Gibson <david@gibson.dropbear.id.au>
> > Cc: Alex Williamson <alex.williamson@redhat.com>
> > Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > ---
> >  hw/iommu/dual_stage_iommu.c         |  5 ++-
> >  hw/vfio/common.c                    | 85
> ++++++++++++++++++++++++++++++++++++-
> >  include/hw/iommu/dual_stage_iommu.h | 10 ++++-
> >  3 files changed, 97 insertions(+), 3 deletions(-)
> >
> > diff --git a/hw/iommu/dual_stage_iommu.c
> b/hw/iommu/dual_stage_iommu.c
> > index be4179d..d5a7168 100644
> > --- a/hw/iommu/dual_stage_iommu.c
> > +++ b/hw/iommu/dual_stage_iommu.c
> > @@ -48,9 +48,12 @@ int ds_iommu_pasid_free(DualStageIOMMUObject
> > *dsi_obj, uint32_t pasid)  }
> >
> >  void ds_iommu_object_init(DualStageIOMMUObject *dsi_obj,
> > -                          DualStageIOMMUOps *ops)
> > +                          DualStageIOMMUOps *ops,
> > +                          DualStageIOMMUInfo *uinfo)
> >  {
> >      dsi_obj->ops = ops;
> > +
> > +    dsi_obj->uinfo.pasid_format = uinfo->pasid_format;
> >  }
> >
> >  void ds_iommu_object_destroy(DualStageIOMMUObject *dsi_obj) diff
> > --git a/hw/vfio/common.c b/hw/vfio/common.c index fc1723d..a07824b
> > 100644
> > --- a/hw/vfio/common.c
> > +++ b/hw/vfio/common.c
> > @@ -1182,10 +1182,84 @@ static int vfio_get_iommu_type(VFIOContainer
> > *container,  static struct DualStageIOMMUOps vfio_ds_iommu_ops = {  };
> >
> > +static int vfio_get_iommu_info(VFIOContainer *container,
> > +                         struct vfio_iommu_type1_info **info)
> 
> Better comment on the function to remember to free(*info) after use for the
> callers.

Will do. 😊

> 
> > +{
> > +
> > +    size_t argsz = sizeof(struct vfio_iommu_type1_info);
> > +
> 
> Nit: extra newline.

accepted. 😊
 
> > +
> > +    *info = g_malloc0(argsz);
> > +
> > +retry:
> > +    (*info)->argsz = argsz;
> > +
> > +    if (ioctl(container->fd, VFIO_IOMMU_GET_INFO, *info)) {
> > +        g_free(*info);
> > +        *info = NULL;
> > +        return -errno;
> > +    }
> > +
> > +    if (((*info)->argsz > argsz)) {
> > +        argsz = (*info)->argsz;
> > +        *info = g_realloc(*info, argsz);
> > +        goto retry;
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> > +static struct vfio_info_cap_header *
> > +vfio_get_iommu_info_cap(struct vfio_iommu_type1_info *info, uint16_t
> > +id) {
> > +    struct vfio_info_cap_header *hdr;
> > +    void *ptr = info;
> > +
> > +    if (!(info->flags & VFIO_IOMMU_INFO_CAPS)) {
> > +        return NULL;
> > +    }
> > +
> > +    for (hdr = ptr + info->cap_offset; hdr != ptr; hdr = ptr + hdr->next) {
> > +        if (hdr->id == id) {
> > +            return hdr;
> > +        }
> > +    }
> > +
> > +    return NULL;
> > +}
> > +
> > +static int vfio_get_nesting_iommu_format(VFIOContainer *container,
> > +                                         uint32_t *pasid_format) {
> > +    struct vfio_iommu_type1_info *info;
> > +    struct vfio_info_cap_header *hdr;
> > +    struct vfio_iommu_type1_info_cap_nesting *cap;
> > +
> > +    if (vfio_get_iommu_info(container, &info)) {
> > +        return -errno;
> 
> Should return the retcode from vfio_get_iommu_info.

yes , it is. thx for catching it.

> > +    }
> > +
> > +    hdr = vfio_get_iommu_info_cap(info,
> > +                        VFIO_IOMMU_TYPE1_INFO_CAP_NESTING);
> > +    if (!hdr) {
> > +        g_free(info);
> > +        return -errno;
> > +    }
> > +
> > +    cap = container_of(hdr,
> > +                struct vfio_iommu_type1_info_cap_nesting, header);
> > +    *pasid_format = cap->pasid_format;
> > +
> > +    g_free(info);
> > +    return 0;
> > +}
> > +
> >  static int vfio_init_container(VFIOContainer *container, int group_fd,
> >                                 Error **errp)  {
> >      int iommu_type, ret;
> > +    uint32_t format;
> > +    DualStageIOMMUInfo uinfo;
> >
> >      iommu_type = vfio_get_iommu_type(container, errp);
> >      if (iommu_type < 0) {
> > @@ -1214,7 +1288,16 @@ static int vfio_init_container(VFIOContainer
> *container, int group_fd,
> >      }
> >
> >      if (iommu_type == VFIO_TYPE1_NESTING_IOMMU) {
> > -        ds_iommu_object_init(&container->dsi_obj, &vfio_ds_iommu_ops);
> > +        if (vfio_get_nesting_iommu_format(container, &format)) {
> > +            error_setg_errno(errp, errno,
> > +                             "Failed to get nesting iommu format");
> > +            return -errno;
> 
> Same here, you might want to return the retcode from
> vfio_get_nesting_iommu_format()?

will do it. 😊
 
Thanks for your comments, I'll address them in next version.

Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 136+ messages in thread

* RE: [RFC v3 11/25] vfio: get stage-1 pasid formats from Kernel
@ 2020-02-12  7:19       ` Liu, Yi L
  0 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-02-12  7:19 UTC (permalink / raw)
  To: Peter Xu
  Cc: Tian, Kevin, Jacob Pan, Yi Sun, kvm, mst, Tian, Jun J,
	qemu-devel, eric.auger, alex.williamson, pbonzini, Wu, Hao, Sun,
	Yi Y, david

> From: Peter Xu <peterx@redhat.com>
> Sent: Wednesday, February 12, 2020 3:30 AM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [RFC v3 11/25] vfio: get stage-1 pasid formats from Kernel
> 
> On Wed, Jan 29, 2020 at 04:16:42AM -0800, Liu, Yi L wrote:
> > From: Liu Yi L <yi.l.liu@intel.com>
> >
> > VFIO checks IOMMU UAPI version when it finds Kernel supports
> > VFIO_TYPE1_NESTING_IOMMU. It is enough for UAPI compatibility check.
> > However, IOMMU UAPI may support multiple stage-1 pasid formats in a
> > specific UAPI version, which is highly possible since IOMMU UAPI
> > supports stage-1 formats across all IOMMU vendors.
> > So VFIO needs to get the supported formats from Kernel and tell
> > vIOMMU. Let vIOMMU select proper format when setup dual stage DMA
> > translation.
> >
> > This patch gets the stage-1 pasid format from kernel by using IOCTL
> > VFIO_IOMMU_GET_INFO and pass the supported format to vIOMMU by the
> > DualStageIOMMUObject instance which has been registered to vIOMMU.
> >
> > This patch referred some code from Shameer Kolothum.
> > https://lists.gnu.org/archive/html/qemu-devel/2018-05/msg03759.html
> >
> > Cc: Kevin Tian <kevin.tian@intel.com>
> > Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Cc: Peter Xu <peterx@redhat.com>
> > Cc: Eric Auger <eric.auger@redhat.com>
> > Cc: Yi Sun <yi.y.sun@linux.intel.com>
> > Cc: David Gibson <david@gibson.dropbear.id.au>
> > Cc: Alex Williamson <alex.williamson@redhat.com>
> > Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > ---
> >  hw/iommu/dual_stage_iommu.c         |  5 ++-
> >  hw/vfio/common.c                    | 85
> ++++++++++++++++++++++++++++++++++++-
> >  include/hw/iommu/dual_stage_iommu.h | 10 ++++-
> >  3 files changed, 97 insertions(+), 3 deletions(-)
> >
> > diff --git a/hw/iommu/dual_stage_iommu.c
> b/hw/iommu/dual_stage_iommu.c
> > index be4179d..d5a7168 100644
> > --- a/hw/iommu/dual_stage_iommu.c
> > +++ b/hw/iommu/dual_stage_iommu.c
> > @@ -48,9 +48,12 @@ int ds_iommu_pasid_free(DualStageIOMMUObject
> > *dsi_obj, uint32_t pasid)  }
> >
> >  void ds_iommu_object_init(DualStageIOMMUObject *dsi_obj,
> > -                          DualStageIOMMUOps *ops)
> > +                          DualStageIOMMUOps *ops,
> > +                          DualStageIOMMUInfo *uinfo)
> >  {
> >      dsi_obj->ops = ops;
> > +
> > +    dsi_obj->uinfo.pasid_format = uinfo->pasid_format;
> >  }
> >
> >  void ds_iommu_object_destroy(DualStageIOMMUObject *dsi_obj) diff
> > --git a/hw/vfio/common.c b/hw/vfio/common.c index fc1723d..a07824b
> > 100644
> > --- a/hw/vfio/common.c
> > +++ b/hw/vfio/common.c
> > @@ -1182,10 +1182,84 @@ static int vfio_get_iommu_type(VFIOContainer
> > *container,  static struct DualStageIOMMUOps vfio_ds_iommu_ops = {  };
> >
> > +static int vfio_get_iommu_info(VFIOContainer *container,
> > +                         struct vfio_iommu_type1_info **info)
> 
> Better comment on the function to remember to free(*info) after use for the
> callers.

Will do. 😊

> 
> > +{
> > +
> > +    size_t argsz = sizeof(struct vfio_iommu_type1_info);
> > +
> 
> Nit: extra newline.

accepted. 😊
 
> > +
> > +    *info = g_malloc0(argsz);
> > +
> > +retry:
> > +    (*info)->argsz = argsz;
> > +
> > +    if (ioctl(container->fd, VFIO_IOMMU_GET_INFO, *info)) {
> > +        g_free(*info);
> > +        *info = NULL;
> > +        return -errno;
> > +    }
> > +
> > +    if (((*info)->argsz > argsz)) {
> > +        argsz = (*info)->argsz;
> > +        *info = g_realloc(*info, argsz);
> > +        goto retry;
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> > +static struct vfio_info_cap_header *
> > +vfio_get_iommu_info_cap(struct vfio_iommu_type1_info *info, uint16_t
> > +id) {
> > +    struct vfio_info_cap_header *hdr;
> > +    void *ptr = info;
> > +
> > +    if (!(info->flags & VFIO_IOMMU_INFO_CAPS)) {
> > +        return NULL;
> > +    }
> > +
> > +    for (hdr = ptr + info->cap_offset; hdr != ptr; hdr = ptr + hdr->next) {
> > +        if (hdr->id == id) {
> > +            return hdr;
> > +        }
> > +    }
> > +
> > +    return NULL;
> > +}
> > +
> > +static int vfio_get_nesting_iommu_format(VFIOContainer *container,
> > +                                         uint32_t *pasid_format) {
> > +    struct vfio_iommu_type1_info *info;
> > +    struct vfio_info_cap_header *hdr;
> > +    struct vfio_iommu_type1_info_cap_nesting *cap;
> > +
> > +    if (vfio_get_iommu_info(container, &info)) {
> > +        return -errno;
> 
> Should return the retcode from vfio_get_iommu_info.

yes , it is. thx for catching it.

> > +    }
> > +
> > +    hdr = vfio_get_iommu_info_cap(info,
> > +                        VFIO_IOMMU_TYPE1_INFO_CAP_NESTING);
> > +    if (!hdr) {
> > +        g_free(info);
> > +        return -errno;
> > +    }
> > +
> > +    cap = container_of(hdr,
> > +                struct vfio_iommu_type1_info_cap_nesting, header);
> > +    *pasid_format = cap->pasid_format;
> > +
> > +    g_free(info);
> > +    return 0;
> > +}
> > +
> >  static int vfio_init_container(VFIOContainer *container, int group_fd,
> >                                 Error **errp)  {
> >      int iommu_type, ret;
> > +    uint32_t format;
> > +    DualStageIOMMUInfo uinfo;
> >
> >      iommu_type = vfio_get_iommu_type(container, errp);
> >      if (iommu_type < 0) {
> > @@ -1214,7 +1288,16 @@ static int vfio_init_container(VFIOContainer
> *container, int group_fd,
> >      }
> >
> >      if (iommu_type == VFIO_TYPE1_NESTING_IOMMU) {
> > -        ds_iommu_object_init(&container->dsi_obj, &vfio_ds_iommu_ops);
> > +        if (vfio_get_nesting_iommu_format(container, &format)) {
> > +            error_setg_errno(errp, errno,
> > +                             "Failed to get nesting iommu format");
> > +            return -errno;
> 
> Same here, you might want to return the retcode from
> vfio_get_nesting_iommu_format()?

will do it. 😊
 
Thanks for your comments, I'll address them in next version.

Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 136+ messages in thread

* RE: [RFC v3 12/25] vfio/common: add pasid_alloc/free support
  2020-02-11 19:31     ` Peter Xu
@ 2020-02-12  7:20       ` Liu, Yi L
  -1 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-02-12  7:20 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, david, pbonzini, alex.williamson, mst, eric.auger,
	Tian, Kevin, Tian, Jun J, Sun, Yi Y, kvm, Wu, Hao, Jacob Pan,
	Yi Sun

> From: Peter Xu <peterx@redhat.com>
> Sent: Wednesday, February 12, 2020 3:32 AM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [RFC v3 12/25] vfio/common: add pasid_alloc/free support
> 
> On Wed, Jan 29, 2020 at 04:16:43AM -0800, Liu, Yi L wrote:
> > From: Liu Yi L <yi.l.liu@intel.com>
> >
> > This patch adds VFIO pasid alloc/free support to allow host intercept
> > in PASID allocation for VM by adding VFIO implementation of
> > DualStageIOMMUOps.pasid_alloc/free callbacks.
> >
> > Cc: Kevin Tian <kevin.tian@intel.com>
> > Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Cc: Peter Xu <peterx@redhat.com>
> > Cc: Eric Auger <eric.auger@redhat.com>
> > Cc: Yi Sun <yi.y.sun@linux.intel.com>
> > Cc: David Gibson <david@gibson.dropbear.id.au>
> > Cc: Alex Williamson <alex.williamson@redhat.com>
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > ---
> >  hw/vfio/common.c | 42
> ++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 42 insertions(+)
> >
> > diff --git a/hw/vfio/common.c b/hw/vfio/common.c index
> > a07824b..014f4e7 100644
> > --- a/hw/vfio/common.c
> > +++ b/hw/vfio/common.c
> > @@ -1179,7 +1179,49 @@ static int vfio_get_iommu_type(VFIOContainer
> *container,
> >      return -EINVAL;
> >  }
> >
> > +static int vfio_ds_iommu_pasid_alloc(DualStageIOMMUObject *dsi_obj,
> > +                         uint32_t min, uint32_t max, uint32_t *pasid)
> > +{
> > +    VFIOContainer *container = container_of(dsi_obj, VFIOContainer, dsi_obj);
> > +    struct vfio_iommu_type1_pasid_request req;
> > +    unsigned long argsz;
> > +
> > +    argsz = sizeof(req);
> > +    req.argsz = argsz;
> > +    req.flags = VFIO_IOMMU_PASID_ALLOC;
> > +    req.alloc_pasid.min = min;
> > +    req.alloc_pasid.max = max;
> > +
> > +    if (ioctl(container->fd, VFIO_IOMMU_PASID_REQUEST, &req)) {
> > +        error_report("%s: %d, alloc failed", __func__, -errno);
> > +        return -errno;
> 
> Note that errno is prone to change by other syscalls.  Better cache it right after
> the ioctl.
> 
> > +    }
> > +    *pasid = req.alloc_pasid.result;
> > +    return 0;
> > +}
> > +
> > +static int vfio_ds_iommu_pasid_free(DualStageIOMMUObject *dsi_obj,
> > +                                     uint32_t pasid) {
> > +    VFIOContainer *container = container_of(dsi_obj, VFIOContainer, dsi_obj);
> > +    struct vfio_iommu_type1_pasid_request req;
> > +    unsigned long argsz;
> > +
> > +    argsz = sizeof(req);
> > +    req.argsz = argsz;
> > +    req.flags = VFIO_IOMMU_PASID_FREE;
> > +    req.free_pasid = pasid;
> > +
> > +    if (ioctl(container->fd, VFIO_IOMMU_PASID_REQUEST, &req)) {
> > +        error_report("%s: %d, free failed", __func__, -errno);
> > +        return -errno;
> 
> Same here.

Got the two comments. Thanks,

Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 136+ messages in thread

* RE: [RFC v3 12/25] vfio/common: add pasid_alloc/free support
@ 2020-02-12  7:20       ` Liu, Yi L
  0 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-02-12  7:20 UTC (permalink / raw)
  To: Peter Xu
  Cc: Tian, Kevin, Jacob Pan, Yi Sun, kvm, mst, Tian, Jun J,
	qemu-devel, eric.auger, alex.williamson, pbonzini, Wu, Hao, Sun,
	Yi Y, david

> From: Peter Xu <peterx@redhat.com>
> Sent: Wednesday, February 12, 2020 3:32 AM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [RFC v3 12/25] vfio/common: add pasid_alloc/free support
> 
> On Wed, Jan 29, 2020 at 04:16:43AM -0800, Liu, Yi L wrote:
> > From: Liu Yi L <yi.l.liu@intel.com>
> >
> > This patch adds VFIO pasid alloc/free support to allow host intercept
> > in PASID allocation for VM by adding VFIO implementation of
> > DualStageIOMMUOps.pasid_alloc/free callbacks.
> >
> > Cc: Kevin Tian <kevin.tian@intel.com>
> > Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Cc: Peter Xu <peterx@redhat.com>
> > Cc: Eric Auger <eric.auger@redhat.com>
> > Cc: Yi Sun <yi.y.sun@linux.intel.com>
> > Cc: David Gibson <david@gibson.dropbear.id.au>
> > Cc: Alex Williamson <alex.williamson@redhat.com>
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > ---
> >  hw/vfio/common.c | 42
> ++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 42 insertions(+)
> >
> > diff --git a/hw/vfio/common.c b/hw/vfio/common.c index
> > a07824b..014f4e7 100644
> > --- a/hw/vfio/common.c
> > +++ b/hw/vfio/common.c
> > @@ -1179,7 +1179,49 @@ static int vfio_get_iommu_type(VFIOContainer
> *container,
> >      return -EINVAL;
> >  }
> >
> > +static int vfio_ds_iommu_pasid_alloc(DualStageIOMMUObject *dsi_obj,
> > +                         uint32_t min, uint32_t max, uint32_t *pasid)
> > +{
> > +    VFIOContainer *container = container_of(dsi_obj, VFIOContainer, dsi_obj);
> > +    struct vfio_iommu_type1_pasid_request req;
> > +    unsigned long argsz;
> > +
> > +    argsz = sizeof(req);
> > +    req.argsz = argsz;
> > +    req.flags = VFIO_IOMMU_PASID_ALLOC;
> > +    req.alloc_pasid.min = min;
> > +    req.alloc_pasid.max = max;
> > +
> > +    if (ioctl(container->fd, VFIO_IOMMU_PASID_REQUEST, &req)) {
> > +        error_report("%s: %d, alloc failed", __func__, -errno);
> > +        return -errno;
> 
> Note that errno is prone to change by other syscalls.  Better cache it right after
> the ioctl.
> 
> > +    }
> > +    *pasid = req.alloc_pasid.result;
> > +    return 0;
> > +}
> > +
> > +static int vfio_ds_iommu_pasid_free(DualStageIOMMUObject *dsi_obj,
> > +                                     uint32_t pasid) {
> > +    VFIOContainer *container = container_of(dsi_obj, VFIOContainer, dsi_obj);
> > +    struct vfio_iommu_type1_pasid_request req;
> > +    unsigned long argsz;
> > +
> > +    argsz = sizeof(req);
> > +    req.argsz = argsz;
> > +    req.flags = VFIO_IOMMU_PASID_FREE;
> > +    req.free_pasid = pasid;
> > +
> > +    if (ioctl(container->fd, VFIO_IOMMU_PASID_REQUEST, &req)) {
> > +        error_report("%s: %d, free failed", __func__, -errno);
> > +        return -errno;
> 
> Same here.

Got the two comments. Thanks,

Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 136+ messages in thread

* RE: [RFC v3 13/25] intel_iommu: modify x-scalable-mode to be string option
  2020-02-11 19:43     ` Peter Xu
@ 2020-02-12  7:28       ` Liu, Yi L
  -1 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-02-12  7:28 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, david, pbonzini, alex.williamson, mst, eric.auger,
	Tian, Kevin, Tian, Jun J, Sun, Yi Y, kvm, Wu, Hao, Jacob Pan,
	Yi Sun, Richard Henderson, Eduardo Habkost

> From: Peter Xu <peterx@redhat.com>
> Sent: Wednesday, February 12, 2020 3:44 AM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [RFC v3 13/25] intel_iommu: modify x-scalable-mode to be string
> option
> 
> On Wed, Jan 29, 2020 at 04:16:44AM -0800, Liu, Yi L wrote:
> > From: Liu Yi L <yi.l.liu@intel.com>
> >
> > Intel VT-d 3.0 introduces scalable mode, and it has a bunch of
> > capabilities related to scalable mode translation, thus there are multiple
> combinations.
> > While this vIOMMU implementation wants simplify it for user by
> > providing typical combinations. User could config it by
> > "x-scalable-mode" option. The usage is as below:
> >
> > "-device intel-iommu,x-scalable-mode=["legacy"|"modern"]"
> 
> Maybe also "off" when someone wants to explicitly disable it?

emmm, I  think x-scalable-mode should be disabled by default. It is enabled
only when "legacy" or "modern" is configured. I'm fine to add "off" as an
explicit way to turn it off if you think it is necessary. :-)

> >
> >  - "legacy": gives support for SL page table
> >  - "modern": gives support for FL page table, pasid, virtual command
> >  -  if not configured, means no scalable mode support, if not proper
> >     configured, will throw error
> >
> > Cc: Kevin Tian <kevin.tian@intel.com>
> > Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Cc: Peter Xu <peterx@redhat.com>
> > Cc: Yi Sun <yi.y.sun@linux.intel.com>
> > Cc: Paolo Bonzini <pbonzini@redhat.com>
> > Cc: Richard Henderson <rth@twiddle.net>
> > Cc: Eduardo Habkost <ehabkost@redhat.com>
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> > ---
> >  hw/i386/intel_iommu.c          | 27 +++++++++++++++++++++++++--
> >  hw/i386/intel_iommu_internal.h |  3 +++
> > include/hw/i386/intel_iommu.h  |  2 ++
> >  3 files changed, 30 insertions(+), 2 deletions(-)
> >
> > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c index
> > 1c1eb7f..33be40c 100644
> > --- a/hw/i386/intel_iommu.c
> > +++ b/hw/i386/intel_iommu.c
> > @@ -3078,7 +3078,7 @@ static Property vtd_properties[] = {
> >      DEFINE_PROP_UINT8("aw-bits", IntelIOMMUState, aw_bits,
> >                        VTD_HOST_ADDRESS_WIDTH),
> >      DEFINE_PROP_BOOL("caching-mode", IntelIOMMUState, caching_mode,
> FALSE),
> > -    DEFINE_PROP_BOOL("x-scalable-mode", IntelIOMMUState, scalable_mode,
> FALSE),
> > +    DEFINE_PROP_STRING("x-scalable-mode", IntelIOMMUState,
> > + scalable_mode_str),
> >      DEFINE_PROP_BOOL("dma-drain", IntelIOMMUState, dma_drain, true),
> >      DEFINE_PROP_END_OF_LIST(),
> >  };
> > @@ -3708,8 +3708,11 @@ static void vtd_init(IntelIOMMUState *s)
> >      }
> >
> >      /* TODO: read cap/ecap from host to decide which cap to be exposed. */
> > -    if (s->scalable_mode) {
> > +    if (s->scalable_mode && !s->scalable_modern) {
> >          s->ecap |= VTD_ECAP_SMTS | VTD_ECAP_SRS | VTD_ECAP_SLTS;
> > +    } else if (s->scalable_mode && s->scalable_modern) {
> > +        s->ecap |= VTD_ECAP_SMTS | VTD_ECAP_SRS | VTD_ECAP_PASID
> > +                   | VTD_ECAP_FLTS | VTD_ECAP_PSS;
> 
> This patch might be good to be the last one after all the impls are ready.

Oh, yes. Let me reorder it in next version.

> >      }
> >
> >      vtd_reset_caches(s);
> > @@ -3845,6 +3848,26 @@ static bool vtd_decide_config(IntelIOMMUState *s,
> Error **errp)
> >          return false;
> >      }
> >
> > +    if (s->scalable_mode_str &&
> > +        (strcmp(s->scalable_mode_str, "modern") &&
> > +         strcmp(s->scalable_mode_str, "legacy"))) {
> > +        error_setg(errp, "Invalid x-scalable-mode config");
> 
> Maybe "..., Please use 'modern', 'legacy', or 'off'." to show options.

Got it.

Thanks,
Yi Liu

^ permalink raw reply	[flat|nested] 136+ messages in thread

* RE: [RFC v3 13/25] intel_iommu: modify x-scalable-mode to be string option
@ 2020-02-12  7:28       ` Liu, Yi L
  0 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-02-12  7:28 UTC (permalink / raw)
  To: Peter Xu
  Cc: Tian, Kevin, Jacob Pan, Yi Sun, Eduardo Habkost, kvm, mst, Tian,
	Jun J, qemu-devel, eric.auger, alex.williamson, pbonzini, Wu,
	Hao, Sun, Yi Y, Richard Henderson, david

> From: Peter Xu <peterx@redhat.com>
> Sent: Wednesday, February 12, 2020 3:44 AM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [RFC v3 13/25] intel_iommu: modify x-scalable-mode to be string
> option
> 
> On Wed, Jan 29, 2020 at 04:16:44AM -0800, Liu, Yi L wrote:
> > From: Liu Yi L <yi.l.liu@intel.com>
> >
> > Intel VT-d 3.0 introduces scalable mode, and it has a bunch of
> > capabilities related to scalable mode translation, thus there are multiple
> combinations.
> > While this vIOMMU implementation wants simplify it for user by
> > providing typical combinations. User could config it by
> > "x-scalable-mode" option. The usage is as below:
> >
> > "-device intel-iommu,x-scalable-mode=["legacy"|"modern"]"
> 
> Maybe also "off" when someone wants to explicitly disable it?

emmm, I  think x-scalable-mode should be disabled by default. It is enabled
only when "legacy" or "modern" is configured. I'm fine to add "off" as an
explicit way to turn it off if you think it is necessary. :-)

> >
> >  - "legacy": gives support for SL page table
> >  - "modern": gives support for FL page table, pasid, virtual command
> >  -  if not configured, means no scalable mode support, if not proper
> >     configured, will throw error
> >
> > Cc: Kevin Tian <kevin.tian@intel.com>
> > Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Cc: Peter Xu <peterx@redhat.com>
> > Cc: Yi Sun <yi.y.sun@linux.intel.com>
> > Cc: Paolo Bonzini <pbonzini@redhat.com>
> > Cc: Richard Henderson <rth@twiddle.net>
> > Cc: Eduardo Habkost <ehabkost@redhat.com>
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> > ---
> >  hw/i386/intel_iommu.c          | 27 +++++++++++++++++++++++++--
> >  hw/i386/intel_iommu_internal.h |  3 +++
> > include/hw/i386/intel_iommu.h  |  2 ++
> >  3 files changed, 30 insertions(+), 2 deletions(-)
> >
> > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c index
> > 1c1eb7f..33be40c 100644
> > --- a/hw/i386/intel_iommu.c
> > +++ b/hw/i386/intel_iommu.c
> > @@ -3078,7 +3078,7 @@ static Property vtd_properties[] = {
> >      DEFINE_PROP_UINT8("aw-bits", IntelIOMMUState, aw_bits,
> >                        VTD_HOST_ADDRESS_WIDTH),
> >      DEFINE_PROP_BOOL("caching-mode", IntelIOMMUState, caching_mode,
> FALSE),
> > -    DEFINE_PROP_BOOL("x-scalable-mode", IntelIOMMUState, scalable_mode,
> FALSE),
> > +    DEFINE_PROP_STRING("x-scalable-mode", IntelIOMMUState,
> > + scalable_mode_str),
> >      DEFINE_PROP_BOOL("dma-drain", IntelIOMMUState, dma_drain, true),
> >      DEFINE_PROP_END_OF_LIST(),
> >  };
> > @@ -3708,8 +3708,11 @@ static void vtd_init(IntelIOMMUState *s)
> >      }
> >
> >      /* TODO: read cap/ecap from host to decide which cap to be exposed. */
> > -    if (s->scalable_mode) {
> > +    if (s->scalable_mode && !s->scalable_modern) {
> >          s->ecap |= VTD_ECAP_SMTS | VTD_ECAP_SRS | VTD_ECAP_SLTS;
> > +    } else if (s->scalable_mode && s->scalable_modern) {
> > +        s->ecap |= VTD_ECAP_SMTS | VTD_ECAP_SRS | VTD_ECAP_PASID
> > +                   | VTD_ECAP_FLTS | VTD_ECAP_PSS;
> 
> This patch might be good to be the last one after all the impls are ready.

Oh, yes. Let me reorder it in next version.

> >      }
> >
> >      vtd_reset_caches(s);
> > @@ -3845,6 +3848,26 @@ static bool vtd_decide_config(IntelIOMMUState *s,
> Error **errp)
> >          return false;
> >      }
> >
> > +    if (s->scalable_mode_str &&
> > +        (strcmp(s->scalable_mode_str, "modern") &&
> > +         strcmp(s->scalable_mode_str, "legacy"))) {
> > +        error_setg(errp, "Invalid x-scalable-mode config");
> 
> Maybe "..., Please use 'modern', 'legacy', or 'off'." to show options.

Got it.

Thanks,
Yi Liu

^ permalink raw reply	[flat|nested] 136+ messages in thread

* RE: [RFC v3 14/25] intel_iommu: add virtual command capability support
  2020-02-11 20:16     ` Peter Xu
@ 2020-02-12  7:32       ` Liu, Yi L
  -1 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-02-12  7:32 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, david, pbonzini, alex.williamson, mst, eric.auger,
	Tian, Kevin, Tian, Jun J, Sun, Yi Y, kvm, Wu, Hao, Jacob Pan,
	Yi Sun, Richard Henderson, Eduardo Habkost

> From: Peter Xu <peterx@redhat.com>
> Sent: Wednesday, February 12, 2020 4:16 AM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [RFC v3 14/25] intel_iommu: add virtual command capability support
> 
> On Wed, Jan 29, 2020 at 04:16:45AM -0800, Liu, Yi L wrote:
> > From: Liu Yi L <yi.l.liu@intel.com>
> >
> > This patch adds virtual command support to Intel vIOMMU per Intel VT-d
> > 3.1 spec. And adds two virtual commands: allocate pasid and free
> > pasid.
> >
> > Cc: Kevin Tian <kevin.tian@intel.com>
> > Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Cc: Peter Xu <peterx@redhat.com>
> > Cc: Yi Sun <yi.y.sun@linux.intel.com>
> > Cc: Paolo Bonzini <pbonzini@redhat.com>
> > Cc: Richard Henderson <rth@twiddle.net>
> > Cc: Eduardo Habkost <ehabkost@redhat.com>
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> > ---
> >  hw/i386/intel_iommu.c          | 163
> ++++++++++++++++++++++++++++++++++++++++-
> >  hw/i386/intel_iommu_internal.h |  38 ++++++++++
> >  hw/i386/trace-events           |   1 +
> >  include/hw/i386/intel_iommu.h  |   6 +-
> >  4 files changed, 206 insertions(+), 2 deletions(-)
> >
> > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c index
> > 33be40c..43a728f 100644
> > --- a/hw/i386/intel_iommu.c
> > +++ b/hw/i386/intel_iommu.c
> > @@ -2649,6 +2649,142 @@ static void
> vtd_handle_iectl_write(IntelIOMMUState *s)
> >      }
> >  }
> >
> > +static int vtd_request_pasid_alloc(IntelIOMMUState *s, uint32_t
> > +*pasid) {
> > +    VTDBus *vtd_bus;
> > +    int bus_n, devfn, ret = -errno;
> > +    VTDIOMMUContext *vtd_icx;
> > +
> > +    for (bus_n = 0; bus_n < PCI_BUS_MAX; bus_n++) {
> > +        vtd_bus = vtd_find_as_from_bus_num(s, bus_n);
> > +        if (!vtd_bus) {
> > +            continue;
> > +        }
> > +        for (devfn = 0; devfn < PCI_DEVFN_MAX; devfn++) {
> > +            vtd_icx = vtd_bus->dev_icx[devfn];
> > +            if (!vtd_icx) {
> > +                continue;
> > +            }
> > +
> > +            /*
> > +             * We'll return the first valid result we got. It's
> > +             * a bit hackish in that we don't have a good global
> > +             * interface yet to talk to modules like vfio to deliver
> > +             * this allocation request, so we're leveraging this
> > +             * per-device iommu object to do the same thing just
> > +             * to make sure the allocation happens only once.
> > +             */
> > +            ret = ds_iommu_pasid_alloc(vtd_icx->dsi_obj,
> > +                         VTD_MIN_HPASID, VTD_MAX_HPASID, pasid);
> 
> Your indents are always strange to me for long funcalls...  Not a complaint though,
> as long as no one else complains. :)

yeah, I'm also not feeling well with them... I'll try to make the indents  for long
funccalls better. 😊

> 
> > +            if (!ret) {
> > +                break;
> > +            }
> > +        }
> > +    }
> > +    return ret;
> > +}
> > +
> > +static int vtd_request_pasid_free(IntelIOMMUState *s, uint32_t pasid)
> > +{
> > +    VTDBus *vtd_bus;
> > +    int bus_n, devfn, ret = -errno;
> > +    VTDIOMMUContext *vtd_icx;
> > +
> > +    for (bus_n = 0; bus_n < PCI_BUS_MAX; bus_n++) {
> > +        vtd_bus = vtd_find_as_from_bus_num(s, bus_n);
> > +        if (!vtd_bus) {
> > +            continue;
> > +        }
> > +        for (devfn = 0; devfn < PCI_DEVFN_MAX; devfn++) {
> > +            vtd_icx = vtd_bus->dev_icx[devfn];
> > +            if (!vtd_icx) {
> > +                continue;
> > +            }
> > +            /*
> > +             * Similar with pasid allocation. We'll free the pasid
> > +             * on the first successful free operation. It's a bit
> > +             * hackish in that we don't have a good global interface
> > +             * yet to talk to modules like vfio to deliver this pasid
> > +             * free request, so we're leveraging this per-device iommu
> > +             * object to do the same thing just to make sure the
> > +             * free happens only once.
> > +             */
> > +            ret = ds_iommu_pasid_free(vtd_icx->dsi_obj, pasid);
> > +            if (!ret) {
> > +                break;
> > +            }
> > +        }
> > +    }
> > +    return ret;
> > +}
> > +
> > +/*
> > + * If IP is not set, set it and return 0
> > + * If IP is already set, return -1
> 
> Out of date?  Instead can mention that this also resets the reply status code to
> zero implicitly so by default it will return a success.

Ooops, yeah, it's out of date. Will fix it.

> 
> Other than that:
> 
> Reviewed-by: Peter Xu <peterx@redhat.com>

Thanks a lot for the patient reviewing.

Regards,
Yi Liu

> > + */
> > +static void vtd_vcmd_set_ip(IntelIOMMUState *s) {
> > +    s->vcrsp = 1;
> > +    vtd_set_quad_raw(s, DMAR_VCRSP_REG,
> > +                     ((uint64_t) s->vcrsp)); }
> > +
> > +static void vtd_vcmd_clear_ip(IntelIOMMUState *s) {
> > +    s->vcrsp &= (~((uint64_t)(0x1)));
> > +    vtd_set_quad_raw(s, DMAR_VCRSP_REG,
> > +                     ((uint64_t) s->vcrsp)); }
> > +
> > +/* Handle write to Virtual Command Register */ static int
> > +vtd_handle_vcmd_write(IntelIOMMUState *s, uint64_t val) {
> > +    uint32_t pasid;
> > +    int ret = -1;
> > +
> > +    trace_vtd_reg_write_vcmd(s->vcrsp, val);
> > +
> > +    if (!(s->vccap & VTD_VCCAP_PAS) ||
> > +         (s->vcrsp & 1)) {
> > +        return -1;
> > +    }
> > +
> > +    /*
> > +     * Since vCPU should be blocked when the guest VMCD
> > +     * write was trapped to here. Should be no other vCPUs
> > +     * try to access VCMD if guest software is well written.
> > +     * However, we still emulate the IP bit here in case of
> > +     * bad guest software. Also align with the spec.
> > +     */
> > +    vtd_vcmd_set_ip(s);
> > +
> > +    switch (val & VTD_VCMD_CMD_MASK) {
> > +    case VTD_VCMD_ALLOC_PASID:
> > +        ret = vtd_request_pasid_alloc(s, &pasid);
> > +        if (ret) {
> > +            s->vcrsp |= VTD_VCRSP_SC(VTD_VCMD_NO_AVAILABLE_PASID);
> > +        } else {
> > +            s->vcrsp |= VTD_VCRSP_RSLT(pasid);
> > +        }
> > +        break;
> > +
> > +    case VTD_VCMD_FREE_PASID:
> > +        pasid = VTD_VCMD_PASID_VALUE(val);
> > +        ret = vtd_request_pasid_free(s, pasid);
> > +        if (ret < 0) {
> > +            s->vcrsp |= VTD_VCRSP_SC(VTD_VCMD_FREE_INVALID_PASID);
> > +        }
> > +        break;
> > +
> > +    default:
> > +        s->vcrsp |= VTD_VCRSP_SC(VTD_VCMD_UNDEFINED_CMD);
> > +        error_report_once("Virtual Command: unsupported command!!!");
> > +        break;
> > +    }
> > +    vtd_vcmd_clear_ip(s);
> > +    return 0;
> > +}
> > +
> >  static uint64_t vtd_mem_read(void *opaque, hwaddr addr, unsigned
> > size)  {
> >      IntelIOMMUState *s = opaque;
> > @@ -2938,6 +3074,23 @@ static void vtd_mem_write(void *opaque, hwaddr
> addr,
> >          vtd_set_long(s, addr, val);
> >          break;
> >
> > +    case DMAR_VCMD_REG:
> > +        if (!vtd_handle_vcmd_write(s, val)) {
> > +            if (size == 4) {
> > +                vtd_set_long(s, addr, val);
> > +            } else {
> > +                vtd_set_quad(s, addr, val);
> > +            }
> > +        }
> > +        break;
> > +
> > +    case DMAR_VCMD_REG_HI:
> > +        assert(size == 4);
> > +        if (!vtd_handle_vcmd_write(s, val)) {
> > +            vtd_set_long(s, addr, val);
> > +        }
> > +        break;
> > +
> >      default:
> >          if (size == 4) {
> >              vtd_set_long(s, addr, val); @@ -3712,7 +3865,8 @@ static
> > void vtd_init(IntelIOMMUState *s)
> >          s->ecap |= VTD_ECAP_SMTS | VTD_ECAP_SRS | VTD_ECAP_SLTS;
> >      } else if (s->scalable_mode && s->scalable_modern) {
> >          s->ecap |= VTD_ECAP_SMTS | VTD_ECAP_SRS | VTD_ECAP_PASID
> > -                   | VTD_ECAP_FLTS | VTD_ECAP_PSS;
> > +                   | VTD_ECAP_FLTS | VTD_ECAP_PSS | VTD_ECAP_VCS;
> > +        s->vccap |= VTD_VCCAP_PAS;
> >      }
> >
> >      vtd_reset_caches(s);
> > @@ -3768,6 +3922,13 @@ static void vtd_init(IntelIOMMUState *s)
> >       * Interrupt remapping registers.
> >       */
> >      vtd_define_quad(s, DMAR_IRTA_REG, 0, 0xfffffffffffff80fULL, 0);
> > +
> > +    /*
> > +     * Virtual Command Definitions
> > +     */
> > +    vtd_define_quad(s, DMAR_VCCAP_REG, s->vccap, 0, 0);
> > +    vtd_define_quad(s, DMAR_VCMD_REG, 0, 0xffffffffffffffffULL, 0);
> > +    vtd_define_quad(s, DMAR_VCRSP_REG, 0, 0, 0);
> >  }
> >
> >  /* Should not reset address_spaces when reset because devices will
> > still use diff --git a/hw/i386/intel_iommu_internal.h
> > b/hw/i386/intel_iommu_internal.h index c4dbb2c..fb5fdc2 100644
> > --- a/hw/i386/intel_iommu_internal.h
> > +++ b/hw/i386/intel_iommu_internal.h
> > @@ -85,6 +85,12 @@
> >  #define DMAR_MTRRCAP_REG_HI     0x104
> >  #define DMAR_MTRRDEF_REG        0x108 /* MTRR default type */
> >  #define DMAR_MTRRDEF_REG_HI     0x10c
> > +#define DMAR_VCCAP_REG          0xE00 /* Virtual Command Capability
> Register */
> > +#define DMAR_VCCAP_REG_HI       0xE04
> > +#define DMAR_VCMD_REG           0xE10 /* Virtual Command Register */
> > +#define DMAR_VCMD_REG_HI        0xE14
> > +#define DMAR_VCRSP_REG          0xE20 /* Virtual Command Reponse Register
> */
> > +#define DMAR_VCRSP_REG_HI       0xE24
> >
> >  /* IOTLB registers */
> >  #define DMAR_IOTLB_REG_OFFSET   0xf0 /* Offset to the IOTLB registers */
> > @@ -193,6 +199,7 @@
> >  #define VTD_ECAP_PSS                (19ULL << 35)
> >  #define VTD_ECAP_PASID              (1ULL << 40)
> >  #define VTD_ECAP_SMTS               (1ULL << 43)
> > +#define VTD_ECAP_VCS                (1ULL << 44)
> >  #define VTD_ECAP_SLTS               (1ULL << 46)
> >  #define VTD_ECAP_FLTS               (1ULL << 47)
> >
> > @@ -315,6 +322,37 @@ typedef enum VTDFaultReason {
> >
> >  #define VTD_CONTEXT_CACHE_GEN_MAX       0xffffffffUL
> >
> > +/* VCCAP_REG */
> > +#define VTD_VCCAP_PAS               (1UL << 0)
> > +
> > +/*
> > + * The basic idea is to let hypervisor to set a range for available
> > + * PASIDs for VMs. One of the reasons is PASID #0 is reserved by
> > + * RID_PASID usage. We have no idea how many reserved PASIDs in
> > +future,
> > + * so here just an evaluated value. Honestly, set it as "1" is enough
> > + * at current stage.
> > + */
> > +#define VTD_MIN_HPASID              1
> > +#define VTD_MAX_HPASID              0xFFFFF
> > +
> > +/* Virtual Command Register */
> > +enum {
> > +     VTD_VCMD_NULL_CMD = 0,
> > +     VTD_VCMD_ALLOC_PASID = 1,
> > +     VTD_VCMD_FREE_PASID = 2,
> > +     VTD_VCMD_CMD_NUM,
> > +};
> > +
> > +#define VTD_VCMD_CMD_MASK           0xffUL
> > +#define VTD_VCMD_PASID_VALUE(val)   (((val) >> 8) & 0xfffff)
> > +
> > +#define VTD_VCRSP_RSLT(val)         ((val) << 8)
> > +#define VTD_VCRSP_SC(val)           (((val) & 0x3) << 1)
> > +
> > +#define VTD_VCMD_UNDEFINED_CMD         1ULL
> > +#define VTD_VCMD_NO_AVAILABLE_PASID    2ULL
> > +#define VTD_VCMD_FREE_INVALID_PASID    2ULL
> > +
> >  /* Interrupt Entry Cache Invalidation Descriptor: VT-d 6.5.2.7. */
> > struct VTDInvDescIEC {
> >      uint32_t type:4;            /* Should always be 0x4 */
> > diff --git a/hw/i386/trace-events b/hw/i386/trace-events index
> > e48bef2..71536a7 100644
> > --- a/hw/i386/trace-events
> > +++ b/hw/i386/trace-events
> > @@ -51,6 +51,7 @@ vtd_reg_write_gcmd(uint32_t status, uint32_t val) "status
> 0x%"PRIx32" value 0x%"
> >  vtd_reg_write_fectl(uint32_t value) "value 0x%"PRIx32
> > vtd_reg_write_iectl(uint32_t value) "value 0x%"PRIx32
> >  vtd_reg_ics_clear_ip(void) ""
> > +vtd_reg_write_vcmd(uint32_t status, uint32_t val) "status 0x%"PRIx32"
> > +value 0x%"PRIx32
> >  vtd_dmar_translate(uint8_t bus, uint8_t slot, uint8_t func, uint64_t
> > iova, uint64_t gpa, uint64_t mask) "dev %02x:%02x.%02x iova 0x%"PRIx64" ->
> gpa 0x%"PRIx64" mask 0x%"PRIx64  vtd_dmar_enable(bool en) "enable %d"
> >  vtd_dmar_fault(uint16_t sid, int fault, uint64_t addr, bool is_write) "sid
> 0x%"PRIx16" fault %d addr 0x%"PRIx64" write %d"
> > diff --git a/include/hw/i386/intel_iommu.h
> > b/include/hw/i386/intel_iommu.h index 1ef2917..4158116 100644
> > --- a/include/hw/i386/intel_iommu.h
> > +++ b/include/hw/i386/intel_iommu.h
> > @@ -46,7 +46,7 @@
> >  #define VTD_SID_TO_BUS(sid)         (((sid) >> 8) & 0xff)
> >  #define VTD_SID_TO_DEVFN(sid)       ((sid) & 0xff)
> >
> > -#define DMAR_REG_SIZE               0x230
> > +#define DMAR_REG_SIZE               0xF00
> >  #define VTD_HOST_AW_39BIT           39
> >  #define VTD_HOST_AW_48BIT           48
> >  #define VTD_HOST_ADDRESS_WIDTH      VTD_HOST_AW_39BIT
> > @@ -285,6 +285,10 @@ struct IntelIOMMUState {
> >      uint8_t aw_bits;                /* Host/IOVA address width (in bits) */
> >      bool dma_drain;                 /* Whether DMA r/w draining enabled */
> >
> > +    /* Virtual Command Register */
> > +    uint64_t vccap;                 /* The value of vcmd capability reg */
> > +    uint64_t vcrsp;                 /* Current value of VCMD RSP REG */
> > +
> >      /*
> >       * Protects IOMMU states in general.  Currently it protects the
> >       * per-IOMMU IOTLB cache, and context entry cache in VTDAddressSpace.
> > --
> > 2.7.4
> >
> 
> --
> Peter Xu


^ permalink raw reply	[flat|nested] 136+ messages in thread

* RE: [RFC v3 14/25] intel_iommu: add virtual command capability support
@ 2020-02-12  7:32       ` Liu, Yi L
  0 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-02-12  7:32 UTC (permalink / raw)
  To: Peter Xu
  Cc: Tian, Kevin, Jacob Pan, Yi Sun, Eduardo Habkost, kvm, mst, Tian,
	Jun J, qemu-devel, eric.auger, alex.williamson, pbonzini, Wu,
	Hao, Sun, Yi Y, Richard Henderson, david

> From: Peter Xu <peterx@redhat.com>
> Sent: Wednesday, February 12, 2020 4:16 AM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [RFC v3 14/25] intel_iommu: add virtual command capability support
> 
> On Wed, Jan 29, 2020 at 04:16:45AM -0800, Liu, Yi L wrote:
> > From: Liu Yi L <yi.l.liu@intel.com>
> >
> > This patch adds virtual command support to Intel vIOMMU per Intel VT-d
> > 3.1 spec. And adds two virtual commands: allocate pasid and free
> > pasid.
> >
> > Cc: Kevin Tian <kevin.tian@intel.com>
> > Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Cc: Peter Xu <peterx@redhat.com>
> > Cc: Yi Sun <yi.y.sun@linux.intel.com>
> > Cc: Paolo Bonzini <pbonzini@redhat.com>
> > Cc: Richard Henderson <rth@twiddle.net>
> > Cc: Eduardo Habkost <ehabkost@redhat.com>
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> > ---
> >  hw/i386/intel_iommu.c          | 163
> ++++++++++++++++++++++++++++++++++++++++-
> >  hw/i386/intel_iommu_internal.h |  38 ++++++++++
> >  hw/i386/trace-events           |   1 +
> >  include/hw/i386/intel_iommu.h  |   6 +-
> >  4 files changed, 206 insertions(+), 2 deletions(-)
> >
> > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c index
> > 33be40c..43a728f 100644
> > --- a/hw/i386/intel_iommu.c
> > +++ b/hw/i386/intel_iommu.c
> > @@ -2649,6 +2649,142 @@ static void
> vtd_handle_iectl_write(IntelIOMMUState *s)
> >      }
> >  }
> >
> > +static int vtd_request_pasid_alloc(IntelIOMMUState *s, uint32_t
> > +*pasid) {
> > +    VTDBus *vtd_bus;
> > +    int bus_n, devfn, ret = -errno;
> > +    VTDIOMMUContext *vtd_icx;
> > +
> > +    for (bus_n = 0; bus_n < PCI_BUS_MAX; bus_n++) {
> > +        vtd_bus = vtd_find_as_from_bus_num(s, bus_n);
> > +        if (!vtd_bus) {
> > +            continue;
> > +        }
> > +        for (devfn = 0; devfn < PCI_DEVFN_MAX; devfn++) {
> > +            vtd_icx = vtd_bus->dev_icx[devfn];
> > +            if (!vtd_icx) {
> > +                continue;
> > +            }
> > +
> > +            /*
> > +             * We'll return the first valid result we got. It's
> > +             * a bit hackish in that we don't have a good global
> > +             * interface yet to talk to modules like vfio to deliver
> > +             * this allocation request, so we're leveraging this
> > +             * per-device iommu object to do the same thing just
> > +             * to make sure the allocation happens only once.
> > +             */
> > +            ret = ds_iommu_pasid_alloc(vtd_icx->dsi_obj,
> > +                         VTD_MIN_HPASID, VTD_MAX_HPASID, pasid);
> 
> Your indents are always strange to me for long funcalls...  Not a complaint though,
> as long as no one else complains. :)

yeah, I'm also not feeling well with them... I'll try to make the indents  for long
funccalls better. 😊

> 
> > +            if (!ret) {
> > +                break;
> > +            }
> > +        }
> > +    }
> > +    return ret;
> > +}
> > +
> > +static int vtd_request_pasid_free(IntelIOMMUState *s, uint32_t pasid)
> > +{
> > +    VTDBus *vtd_bus;
> > +    int bus_n, devfn, ret = -errno;
> > +    VTDIOMMUContext *vtd_icx;
> > +
> > +    for (bus_n = 0; bus_n < PCI_BUS_MAX; bus_n++) {
> > +        vtd_bus = vtd_find_as_from_bus_num(s, bus_n);
> > +        if (!vtd_bus) {
> > +            continue;
> > +        }
> > +        for (devfn = 0; devfn < PCI_DEVFN_MAX; devfn++) {
> > +            vtd_icx = vtd_bus->dev_icx[devfn];
> > +            if (!vtd_icx) {
> > +                continue;
> > +            }
> > +            /*
> > +             * Similar with pasid allocation. We'll free the pasid
> > +             * on the first successful free operation. It's a bit
> > +             * hackish in that we don't have a good global interface
> > +             * yet to talk to modules like vfio to deliver this pasid
> > +             * free request, so we're leveraging this per-device iommu
> > +             * object to do the same thing just to make sure the
> > +             * free happens only once.
> > +             */
> > +            ret = ds_iommu_pasid_free(vtd_icx->dsi_obj, pasid);
> > +            if (!ret) {
> > +                break;
> > +            }
> > +        }
> > +    }
> > +    return ret;
> > +}
> > +
> > +/*
> > + * If IP is not set, set it and return 0
> > + * If IP is already set, return -1
> 
> Out of date?  Instead can mention that this also resets the reply status code to
> zero implicitly so by default it will return a success.

Ooops, yeah, it's out of date. Will fix it.

> 
> Other than that:
> 
> Reviewed-by: Peter Xu <peterx@redhat.com>

Thanks a lot for the patient reviewing.

Regards,
Yi Liu

> > + */
> > +static void vtd_vcmd_set_ip(IntelIOMMUState *s) {
> > +    s->vcrsp = 1;
> > +    vtd_set_quad_raw(s, DMAR_VCRSP_REG,
> > +                     ((uint64_t) s->vcrsp)); }
> > +
> > +static void vtd_vcmd_clear_ip(IntelIOMMUState *s) {
> > +    s->vcrsp &= (~((uint64_t)(0x1)));
> > +    vtd_set_quad_raw(s, DMAR_VCRSP_REG,
> > +                     ((uint64_t) s->vcrsp)); }
> > +
> > +/* Handle write to Virtual Command Register */ static int
> > +vtd_handle_vcmd_write(IntelIOMMUState *s, uint64_t val) {
> > +    uint32_t pasid;
> > +    int ret = -1;
> > +
> > +    trace_vtd_reg_write_vcmd(s->vcrsp, val);
> > +
> > +    if (!(s->vccap & VTD_VCCAP_PAS) ||
> > +         (s->vcrsp & 1)) {
> > +        return -1;
> > +    }
> > +
> > +    /*
> > +     * Since vCPU should be blocked when the guest VMCD
> > +     * write was trapped to here. Should be no other vCPUs
> > +     * try to access VCMD if guest software is well written.
> > +     * However, we still emulate the IP bit here in case of
> > +     * bad guest software. Also align with the spec.
> > +     */
> > +    vtd_vcmd_set_ip(s);
> > +
> > +    switch (val & VTD_VCMD_CMD_MASK) {
> > +    case VTD_VCMD_ALLOC_PASID:
> > +        ret = vtd_request_pasid_alloc(s, &pasid);
> > +        if (ret) {
> > +            s->vcrsp |= VTD_VCRSP_SC(VTD_VCMD_NO_AVAILABLE_PASID);
> > +        } else {
> > +            s->vcrsp |= VTD_VCRSP_RSLT(pasid);
> > +        }
> > +        break;
> > +
> > +    case VTD_VCMD_FREE_PASID:
> > +        pasid = VTD_VCMD_PASID_VALUE(val);
> > +        ret = vtd_request_pasid_free(s, pasid);
> > +        if (ret < 0) {
> > +            s->vcrsp |= VTD_VCRSP_SC(VTD_VCMD_FREE_INVALID_PASID);
> > +        }
> > +        break;
> > +
> > +    default:
> > +        s->vcrsp |= VTD_VCRSP_SC(VTD_VCMD_UNDEFINED_CMD);
> > +        error_report_once("Virtual Command: unsupported command!!!");
> > +        break;
> > +    }
> > +    vtd_vcmd_clear_ip(s);
> > +    return 0;
> > +}
> > +
> >  static uint64_t vtd_mem_read(void *opaque, hwaddr addr, unsigned
> > size)  {
> >      IntelIOMMUState *s = opaque;
> > @@ -2938,6 +3074,23 @@ static void vtd_mem_write(void *opaque, hwaddr
> addr,
> >          vtd_set_long(s, addr, val);
> >          break;
> >
> > +    case DMAR_VCMD_REG:
> > +        if (!vtd_handle_vcmd_write(s, val)) {
> > +            if (size == 4) {
> > +                vtd_set_long(s, addr, val);
> > +            } else {
> > +                vtd_set_quad(s, addr, val);
> > +            }
> > +        }
> > +        break;
> > +
> > +    case DMAR_VCMD_REG_HI:
> > +        assert(size == 4);
> > +        if (!vtd_handle_vcmd_write(s, val)) {
> > +            vtd_set_long(s, addr, val);
> > +        }
> > +        break;
> > +
> >      default:
> >          if (size == 4) {
> >              vtd_set_long(s, addr, val); @@ -3712,7 +3865,8 @@ static
> > void vtd_init(IntelIOMMUState *s)
> >          s->ecap |= VTD_ECAP_SMTS | VTD_ECAP_SRS | VTD_ECAP_SLTS;
> >      } else if (s->scalable_mode && s->scalable_modern) {
> >          s->ecap |= VTD_ECAP_SMTS | VTD_ECAP_SRS | VTD_ECAP_PASID
> > -                   | VTD_ECAP_FLTS | VTD_ECAP_PSS;
> > +                   | VTD_ECAP_FLTS | VTD_ECAP_PSS | VTD_ECAP_VCS;
> > +        s->vccap |= VTD_VCCAP_PAS;
> >      }
> >
> >      vtd_reset_caches(s);
> > @@ -3768,6 +3922,13 @@ static void vtd_init(IntelIOMMUState *s)
> >       * Interrupt remapping registers.
> >       */
> >      vtd_define_quad(s, DMAR_IRTA_REG, 0, 0xfffffffffffff80fULL, 0);
> > +
> > +    /*
> > +     * Virtual Command Definitions
> > +     */
> > +    vtd_define_quad(s, DMAR_VCCAP_REG, s->vccap, 0, 0);
> > +    vtd_define_quad(s, DMAR_VCMD_REG, 0, 0xffffffffffffffffULL, 0);
> > +    vtd_define_quad(s, DMAR_VCRSP_REG, 0, 0, 0);
> >  }
> >
> >  /* Should not reset address_spaces when reset because devices will
> > still use diff --git a/hw/i386/intel_iommu_internal.h
> > b/hw/i386/intel_iommu_internal.h index c4dbb2c..fb5fdc2 100644
> > --- a/hw/i386/intel_iommu_internal.h
> > +++ b/hw/i386/intel_iommu_internal.h
> > @@ -85,6 +85,12 @@
> >  #define DMAR_MTRRCAP_REG_HI     0x104
> >  #define DMAR_MTRRDEF_REG        0x108 /* MTRR default type */
> >  #define DMAR_MTRRDEF_REG_HI     0x10c
> > +#define DMAR_VCCAP_REG          0xE00 /* Virtual Command Capability
> Register */
> > +#define DMAR_VCCAP_REG_HI       0xE04
> > +#define DMAR_VCMD_REG           0xE10 /* Virtual Command Register */
> > +#define DMAR_VCMD_REG_HI        0xE14
> > +#define DMAR_VCRSP_REG          0xE20 /* Virtual Command Reponse Register
> */
> > +#define DMAR_VCRSP_REG_HI       0xE24
> >
> >  /* IOTLB registers */
> >  #define DMAR_IOTLB_REG_OFFSET   0xf0 /* Offset to the IOTLB registers */
> > @@ -193,6 +199,7 @@
> >  #define VTD_ECAP_PSS                (19ULL << 35)
> >  #define VTD_ECAP_PASID              (1ULL << 40)
> >  #define VTD_ECAP_SMTS               (1ULL << 43)
> > +#define VTD_ECAP_VCS                (1ULL << 44)
> >  #define VTD_ECAP_SLTS               (1ULL << 46)
> >  #define VTD_ECAP_FLTS               (1ULL << 47)
> >
> > @@ -315,6 +322,37 @@ typedef enum VTDFaultReason {
> >
> >  #define VTD_CONTEXT_CACHE_GEN_MAX       0xffffffffUL
> >
> > +/* VCCAP_REG */
> > +#define VTD_VCCAP_PAS               (1UL << 0)
> > +
> > +/*
> > + * The basic idea is to let hypervisor to set a range for available
> > + * PASIDs for VMs. One of the reasons is PASID #0 is reserved by
> > + * RID_PASID usage. We have no idea how many reserved PASIDs in
> > +future,
> > + * so here just an evaluated value. Honestly, set it as "1" is enough
> > + * at current stage.
> > + */
> > +#define VTD_MIN_HPASID              1
> > +#define VTD_MAX_HPASID              0xFFFFF
> > +
> > +/* Virtual Command Register */
> > +enum {
> > +     VTD_VCMD_NULL_CMD = 0,
> > +     VTD_VCMD_ALLOC_PASID = 1,
> > +     VTD_VCMD_FREE_PASID = 2,
> > +     VTD_VCMD_CMD_NUM,
> > +};
> > +
> > +#define VTD_VCMD_CMD_MASK           0xffUL
> > +#define VTD_VCMD_PASID_VALUE(val)   (((val) >> 8) & 0xfffff)
> > +
> > +#define VTD_VCRSP_RSLT(val)         ((val) << 8)
> > +#define VTD_VCRSP_SC(val)           (((val) & 0x3) << 1)
> > +
> > +#define VTD_VCMD_UNDEFINED_CMD         1ULL
> > +#define VTD_VCMD_NO_AVAILABLE_PASID    2ULL
> > +#define VTD_VCMD_FREE_INVALID_PASID    2ULL
> > +
> >  /* Interrupt Entry Cache Invalidation Descriptor: VT-d 6.5.2.7. */
> > struct VTDInvDescIEC {
> >      uint32_t type:4;            /* Should always be 0x4 */
> > diff --git a/hw/i386/trace-events b/hw/i386/trace-events index
> > e48bef2..71536a7 100644
> > --- a/hw/i386/trace-events
> > +++ b/hw/i386/trace-events
> > @@ -51,6 +51,7 @@ vtd_reg_write_gcmd(uint32_t status, uint32_t val) "status
> 0x%"PRIx32" value 0x%"
> >  vtd_reg_write_fectl(uint32_t value) "value 0x%"PRIx32
> > vtd_reg_write_iectl(uint32_t value) "value 0x%"PRIx32
> >  vtd_reg_ics_clear_ip(void) ""
> > +vtd_reg_write_vcmd(uint32_t status, uint32_t val) "status 0x%"PRIx32"
> > +value 0x%"PRIx32
> >  vtd_dmar_translate(uint8_t bus, uint8_t slot, uint8_t func, uint64_t
> > iova, uint64_t gpa, uint64_t mask) "dev %02x:%02x.%02x iova 0x%"PRIx64" ->
> gpa 0x%"PRIx64" mask 0x%"PRIx64  vtd_dmar_enable(bool en) "enable %d"
> >  vtd_dmar_fault(uint16_t sid, int fault, uint64_t addr, bool is_write) "sid
> 0x%"PRIx16" fault %d addr 0x%"PRIx64" write %d"
> > diff --git a/include/hw/i386/intel_iommu.h
> > b/include/hw/i386/intel_iommu.h index 1ef2917..4158116 100644
> > --- a/include/hw/i386/intel_iommu.h
> > +++ b/include/hw/i386/intel_iommu.h
> > @@ -46,7 +46,7 @@
> >  #define VTD_SID_TO_BUS(sid)         (((sid) >> 8) & 0xff)
> >  #define VTD_SID_TO_DEVFN(sid)       ((sid) & 0xff)
> >
> > -#define DMAR_REG_SIZE               0x230
> > +#define DMAR_REG_SIZE               0xF00
> >  #define VTD_HOST_AW_39BIT           39
> >  #define VTD_HOST_AW_48BIT           48
> >  #define VTD_HOST_ADDRESS_WIDTH      VTD_HOST_AW_39BIT
> > @@ -285,6 +285,10 @@ struct IntelIOMMUState {
> >      uint8_t aw_bits;                /* Host/IOVA address width (in bits) */
> >      bool dma_drain;                 /* Whether DMA r/w draining enabled */
> >
> > +    /* Virtual Command Register */
> > +    uint64_t vccap;                 /* The value of vcmd capability reg */
> > +    uint64_t vcrsp;                 /* Current value of VCMD RSP REG */
> > +
> >      /*
> >       * Protects IOMMU states in general.  Currently it protects the
> >       * per-IOMMU IOTLB cache, and context entry cache in VTDAddressSpace.
> > --
> > 2.7.4
> >
> 
> --
> Peter Xu


^ permalink raw reply	[flat|nested] 136+ messages in thread

* RE: [RFC v3 15/25] intel_iommu: process pasid cache invalidation
  2020-02-11 20:17     ` Peter Xu
@ 2020-02-12  7:33       ` Liu, Yi L
  -1 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-02-12  7:33 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, david, pbonzini, alex.williamson, mst, eric.auger,
	Tian, Kevin, Tian, Jun J, Sun, Yi Y, kvm, Wu, Hao, Jacob Pan,
	Yi Sun, Richard Henderson, Eduardo Habkost

> From: Peter Xu <peterx@redhat.com>
> Sent: Wednesday, February 12, 2020 4:17 AM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [RFC v3 15/25] intel_iommu: process pasid cache invalidation
> 
> On Wed, Jan 29, 2020 at 04:16:46AM -0800, Liu, Yi L wrote:
> > From: Liu Yi L <yi.l.liu@intel.com>
> >
> > This patch adds PASID cache invalidation handling. When guest enabled
> > PASID usages (e.g. SVA), guest software should issue a proper PASID
> > cache invalidation when caching-mode is exposed. This patch only adds
> > the draft handling of pasid cache invalidation. Detailed handling will
> > be added in subsequent patches.
> >
> > Cc: Kevin Tian <kevin.tian@intel.com>
> > Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Cc: Peter Xu <peterx@redhat.com>
> > Cc: Yi Sun <yi.y.sun@linux.intel.com>
> > Cc: Paolo Bonzini <pbonzini@redhat.com>
> > Cc: Richard Henderson <rth@twiddle.net>
> > Cc: Eduardo Habkost <ehabkost@redhat.com>
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> 
> Reviewed-by: Peter Xu <peterx@redhat.com>

Thanks😊

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 136+ messages in thread

* RE: [RFC v3 15/25] intel_iommu: process pasid cache invalidation
@ 2020-02-12  7:33       ` Liu, Yi L
  0 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-02-12  7:33 UTC (permalink / raw)
  To: Peter Xu
  Cc: Tian, Kevin, Jacob Pan, Yi Sun, Eduardo Habkost, kvm, mst, Tian,
	Jun J, qemu-devel, eric.auger, alex.williamson, pbonzini, Wu,
	Hao, Sun, Yi Y, Richard Henderson, david

> From: Peter Xu <peterx@redhat.com>
> Sent: Wednesday, February 12, 2020 4:17 AM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [RFC v3 15/25] intel_iommu: process pasid cache invalidation
> 
> On Wed, Jan 29, 2020 at 04:16:46AM -0800, Liu, Yi L wrote:
> > From: Liu Yi L <yi.l.liu@intel.com>
> >
> > This patch adds PASID cache invalidation handling. When guest enabled
> > PASID usages (e.g. SVA), guest software should issue a proper PASID
> > cache invalidation when caching-mode is exposed. This patch only adds
> > the draft handling of pasid cache invalidation. Detailed handling will
> > be added in subsequent patches.
> >
> > Cc: Kevin Tian <kevin.tian@intel.com>
> > Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Cc: Peter Xu <peterx@redhat.com>
> > Cc: Yi Sun <yi.y.sun@linux.intel.com>
> > Cc: Paolo Bonzini <pbonzini@redhat.com>
> > Cc: Richard Henderson <rth@twiddle.net>
> > Cc: Eduardo Habkost <ehabkost@redhat.com>
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> 
> Reviewed-by: Peter Xu <peterx@redhat.com>

Thanks😊

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 136+ messages in thread

* RE: [RFC v3 16/25] intel_iommu: add PASID cache management infrastructure
  2020-02-11 23:35     ` Peter Xu
@ 2020-02-12  8:37       ` Liu, Yi L
  -1 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-02-12  8:37 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, david, pbonzini, alex.williamson, mst, eric.auger,
	Tian, Kevin, Tian, Jun J, Sun, Yi Y, kvm, Wu, Hao, Jacob Pan,
	Yi Sun, Richard Henderson, Eduardo Habkost

> From: Peter Xu <peterx@redhat.com>
> Sent: Wednesday, February 12, 2020 7:36 AM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [RFC v3 16/25] intel_iommu: add PASID cache management
> infrastructure
> 
> On Wed, Jan 29, 2020 at 04:16:47AM -0800, Liu, Yi L wrote:
> > From: Liu Yi L <yi.l.liu@intel.com>
> >
> > This patch adds a PASID cache management infrastructure based on
> > new added structure VTDPASIDAddressSpace, which is used to track
> > the PASID usage and future PASID tagged DMA address translation
> > support in vIOMMU.
> >
> >     struct VTDPASIDAddressSpace {
> >         VTDBus *vtd_bus;
> >         uint8_t devfn;
> >         AddressSpace as;
> >         uint32_t pasid;
> >         IntelIOMMUState *iommu_state;
> >         VTDContextCacheEntry context_cache_entry;
> >         QLIST_ENTRY(VTDPASIDAddressSpace) next;
> >         VTDPASIDCacheEntry pasid_cache_entry;
> >     };
> >
> > Ideally, a VTDPASIDAddressSpace instance is created when a PASID
> > is bound with a DMA AddressSpace. Intel VT-d spec requires guest
> > software to issue pasid cache invalidation when bind or unbind a
> > pasid with an address space under caching-mode. However, as
> > VTDPASIDAddressSpace instances also act as pasid cache in this
> > implementation, its creation also happens during vIOMMU PASID
> > tagged DMA translation. The creation in this path will not be
> > added in this patch since no PASID-capable emulated devices for
> > now.
> >
> > The implementation in this patch manages VTDPASIDAddressSpace
> > instances per PASID+BDF (lookup and insert will use PASID and
> > BDF) since Intel VT-d spec allows per-BDF PASID Table. When a
> > guest bind a PASID with an AddressSpace, QEMU will capture the
> > guest pasid selective pasid cache invalidation, and allocate
> > remove a VTDPASIDAddressSpace instance per the invalidation
> > reasons:
> >
> >     *) a present pasid entry moved to non-present
> >     *) a present pasid entry to be a present entry
> >     *) a non-present pasid entry moved to present
> >
> > vIOMMU emulator could figure out the reason by fetching latest
> > guest pasid entry.
> >
> > Cc: Kevin Tian <kevin.tian@intel.com>
> > Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Cc: Peter Xu <peterx@redhat.com>
> > Cc: Yi Sun <yi.y.sun@linux.intel.com>
> > Cc: Paolo Bonzini <pbonzini@redhat.com>
> > Cc: Richard Henderson <rth@twiddle.net>
> > Cc: Eduardo Habkost <ehabkost@redhat.com>
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > ---
> >  hw/i386/intel_iommu.c          | 367
> +++++++++++++++++++++++++++++++++++++++++
> >  hw/i386/intel_iommu_internal.h |  14 ++
> >  hw/i386/trace-events           |   1 +
> >  include/hw/i386/intel_iommu.h  |  36 +++-
> >  4 files changed, 417 insertions(+), 1 deletion(-)
> >
> > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> > index 58e7213..c75cb7b 100644
> > --- a/hw/i386/intel_iommu.c
> > +++ b/hw/i386/intel_iommu.c
> > @@ -40,6 +40,7 @@
> >  #include "kvm_i386.h"
> >  #include "migration/vmstate.h"
> >  #include "trace.h"
> > +#include "qemu/jhash.h"
> >
> >  /* context entry operations */
> >  #define VTD_CE_GET_RID2PASID(ce) \
> > @@ -65,6 +66,8 @@
> >  static void vtd_address_space_refresh_all(IntelIOMMUState *s);
> >  static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier
> *n);
> >
> > +static void vtd_pasid_cache_reset(IntelIOMMUState *s);
> > +
> >  static void vtd_panic_require_caching_mode(void)
> >  {
> >      error_report("We need to set caching-mode=on for intel-iommu to enable "
> > @@ -276,6 +279,7 @@ static void vtd_reset_caches(IntelIOMMUState *s)
> >      vtd_iommu_lock(s);
> >      vtd_reset_iotlb_locked(s);
> >      vtd_reset_context_cache_locked(s);
> > +    vtd_pasid_cache_reset(s);
> >      vtd_iommu_unlock(s);
> >  }
> >
> > @@ -686,6 +690,11 @@ static inline bool vtd_pe_type_check(X86IOMMUState
> *x86_iommu,
> >      return true;
> >  }
> >
> > +static inline uint16_t vtd_pe_get_domain_id(VTDPASIDEntry *pe)
> > +{
> > +    return VTD_SM_PASID_ENTRY_DID((pe)->val[1]);
> > +}
> > +
> >  static inline bool vtd_pdire_present(VTDPASIDDirEntry *pdire)
> >  {
> >      return pdire->val & 1;
> > @@ -2393,19 +2402,370 @@ static bool
> vtd_process_iotlb_desc(IntelIOMMUState *s, VTDInvDesc *inv_desc)
> >      return true;
> >  }
> >
> > +static inline void vtd_init_pasid_key(uint32_t pasid,
> > +                                     uint16_t sid,
> > +                                     struct pasid_key *key)
> > +{
> > +    key->pasid = pasid;
> > +    key->sid = sid;
> > +}
> > +
> > +static guint vtd_pasid_as_key_hash(gconstpointer v)
> > +{
> > +    struct pasid_key *key = (struct pasid_key *)v;
> > +    uint32_t a, b, c;
> > +
> > +    /* Jenkins hash */
> > +    a = b = c = JHASH_INITVAL + sizeof(*key);
> > +    a += key->sid;
> > +    b += extract32(key->pasid, 0, 16);
> > +    c += extract32(key->pasid, 16, 16);
> > +
> > +    __jhash_mix(a, b, c);
> > +    __jhash_final(a, b, c);
> > +
> > +    return c;
> > +}
> > +
> > +static gboolean vtd_pasid_as_key_equal(gconstpointer v1, gconstpointer v2)
> > +{
> > +    const struct pasid_key *k1 = v1;
> > +    const struct pasid_key *k2 = v2;
> > +
> > +    return (k1->pasid == k2->pasid) && (k1->sid == k2->sid);
> > +}
> > +
> > +static inline int vtd_dev_get_pe_from_pasid(IntelIOMMUState *s,
> > +                                            uint8_t bus_num,
> > +                                            uint8_t devfn,
> > +                                            uint32_t pasid,
> > +                                            VTDPASIDEntry *pe)
> > +{
> > +    VTDContextEntry ce;
> > +    int ret;
> > +    dma_addr_t pasid_dir_base;
> > +
> > +    if (!s->root_scalable) {
> > +        return -VTD_FR_PASID_TABLE_INV;
> > +    }
> > +
> > +    ret = vtd_dev_to_context_entry(s, bus_num, devfn, &ce);
> > +    if (ret) {
> > +        return ret;
> > +    }
> > +
> > +    pasid_dir_base = VTD_CE_GET_PASID_DIR_TABLE(&ce);
> > +    ret = vtd_get_pe_from_pasid_table(s,
> > +                                  pasid_dir_base, pasid, pe);
> > +
> > +    return ret;
> > +}
> > +
> > +static bool vtd_pasid_entry_compare(VTDPASIDEntry *p1, VTDPASIDEntry
> *p2)
> > +{
> > +    return !memcmp(p1, p2, sizeof(*p1));
> > +}
> > +
> > +/**
> > + * This function is used to clear pasid_cache_gen of cached pasid
> > + * entry in vtd_pasid_as instances. Caller of this function should
> > + * hold iommu_lock.
> > + */
> > +static gboolean vtd_flush_pasid(gpointer key, gpointer value,
> > +                                gpointer user_data)
> > +{
> > +    VTDPASIDCacheInfo *pc_info = user_data;
> > +    VTDPASIDAddressSpace *vtd_pasid_as = value;
> > +    IntelIOMMUState *s = vtd_pasid_as->iommu_state;
> > +    VTDPASIDCacheEntry *pc_entry = &vtd_pasid_as->pasid_cache_entry;
> > +    VTDBus *vtd_bus = vtd_pasid_as->vtd_bus;
> > +    VTDPASIDEntry pe;
> > +    uint16_t did;
> > +    uint32_t pasid;
> > +    uint16_t devfn;
> > +    int ret;
> > +
> > +    did = vtd_pe_get_domain_id(&pc_entry->pasid_entry);
> > +    pasid = vtd_pasid_as->pasid;
> > +    devfn = vtd_pasid_as->devfn;
> > +
> > +    if (!(pc_entry->pasid_cache_gen == s->pasid_cache_gen)) {
> > +        return false;
> > +    }
> > +
> > +    switch (pc_info->flags & VTD_PASID_CACHE_INFO_MASK) {
> > +    case VTD_PASID_CACHE_PASIDSI:
> > +        if (pc_info->pasid != pasid) {
> > +            return false;
> > +        }
> > +        /* Fall through */
> 
> Why fall through?

For VTD_PASID_CACHE_PASIDSI, it implies domain selective, so it
requires to check did just as VTD_PASID_CACHE_DOMSI.

> 
> > +    case VTD_PASID_CACHE_DOMSI:
> > +        if (pc_info->domain_id != did) {
> > +            return false;
> > +        }
> > +        /* Fall through */
> 
> Same here.

If code comes to here, it means the necessary checks are passed. Should
add a break here. However, as the below case does nothing and just calls
break. So I let the code fall through.

> 
> > +    case VTD_PASID_CACHE_GLOBAL:
> > +        break;
> > +    default:
> 
> Nevee reach here right?  If so we can abort.

yes, should never reach here.

> > +        return false;
> > +    }
> > +
> > +    /*
> > +     * pasid cache invalidation may indicate a present pasid
> > +     * entry to present pasid entry modification. To cover such
> > +     * case, vIOMMU emulator needs to fetch latest guest pasid
> > +     * entry and check cached pasid entry, then update pasid
> > +     * cache and send pasid bind/unbind to host properly.
> > +     */
> > +    ret = vtd_dev_get_pe_from_pasid(s,
> > +                  pci_bus_num(vtd_bus->bus), devfn, pasid, &pe);
> > +    if (ret) {
> > +        /*
> > +         * No valid pasid entry in guest memory. e.g. pasid entry
> > +         * was modified to be either all-zero or non-present. Either
> > +         * case means existing pasid cache should be removed.
> > +         */
> > +        goto remove;
> > +    }
> > +    /* Compare cached pasid entry and latest pasid entry */
> > +    if (!vtd_pasid_entry_compare(&pe, &pc_entry->pasid_entry)) {
> > +        /* pasid entry was updated, thus update the pasid cache */
> > +        pc_entry->pasid_entry = pe;
> > +        pc_entry->pasid_cache_gen = s->pasid_cache_gen;
> > +        /*
> > +         * TODO:
> > +         * - send pasid bind to host for passthru devices
> > +         * - when pasid-base-iotlb(piotlb) infrastructure is ready,
> > +         *   should invalidate QEMU piotlb togehter with this change.
> > +         */
> > +    }
> > +    return false;
> > +remove:
> > +    /*
> > +     * TODO:
> > +     * - send pasid unbind to host for passthru devices
> > +     * - when pasid-base-iotlb(piotlb) infrastructure is ready,
> > +     *   should invalidate QEMU piotlb togehter with this change.
> > +     */
> > +    return true;
> > +}
> > +
> >  static int vtd_pasid_cache_dsi(IntelIOMMUState *s, uint16_t domain_id)
> >  {
> > +    VTDPASIDCacheInfo pc_info;
> > +
> > +    trace_vtd_pasid_cache_dsi(domain_id);
> > +
> > +    pc_info.flags = VTD_PASID_CACHE_DOMSI;
> > +    pc_info.domain_id = domain_id;
> > +
> > +    /*
> > +     * Loop all existing pasid caches and update them.
> > +     */
> > +    vtd_iommu_lock(s);
> > +    g_hash_table_foreach_remove(s->vtd_pasid_as,
> > +                                 vtd_flush_pasid, &pc_info);
> > +
> > +    /*
> > +     * TODO: Domain selective PASID cache invalidation
> > +     * flushes all the pasid caches within a domain. To
> > +     * be safe, after invalidating the pasid caches, emulator
> > +     * needs to replay the pasid bindings by walking guest
> > +     * pasid dir and pasid table.
> 
> Better spell out on what special case we're handling here: When the
> guest setup a new PASID entry then send a PASID DSI.

oh, yes.  will add it in new version. :-)

> 
> > +     */
> > +    vtd_iommu_unlock(s);
> >      return 0;
> >  }
> >
> > +/**
> > + * This function finds or adds a VTDPASIDAddressSpace for a device
> > + * when it is bound to a pasid. Caller of this function should hold
> > + * iommu_lock.
> > + */
> > +static VTDPASIDAddressSpace *vtd_add_find_pasid_as(IntelIOMMUState *s,
> > +                                                   VTDBus *vtd_bus,
> > +                                                   int devfn,
> > +                                                   uint32_t pasid,
> > +                                                   bool allocate)
> > +{
> > +    struct pasid_key key;
> > +    struct pasid_key *new_key;
> > +    VTDPASIDAddressSpace *vtd_pasid_as;
> > +    uint16_t sid;
> > +
> > +    sid = vtd_make_source_id(pci_bus_num(vtd_bus->bus), devfn);
> > +    vtd_init_pasid_key(pasid, sid, &key);
> > +    vtd_pasid_as = g_hash_table_lookup(s->vtd_pasid_as, &key);
> > +
> > +    if (!vtd_pasid_as && allocate) {
> > +        new_key = g_malloc0(sizeof(*new_key));
> > +        vtd_init_pasid_key(pasid, sid, new_key);
> > +        /*
> > +         * Initiate the vtd_pasid_as structure.
> > +         *
> > +         * This structure here is used to track the guest pasid
> > +         * binding and also serves as pasid-cache mangement entry.
> > +         *
> > +         * TODO: in future, if wants to support the SVA-aware DMA
> > +         *       emulation, the vtd_pasid_as should have include
> > +         *       AddressSpace to support DMA emulation.
> > +         */
> > +        vtd_pasid_as = g_malloc0(sizeof(VTDPASIDAddressSpace));
> > +        vtd_pasid_as->iommu_state = s;
> > +        vtd_pasid_as->vtd_bus = vtd_bus;
> > +        vtd_pasid_as->devfn = devfn;
> > +        vtd_pasid_as->context_cache_entry.context_cache_gen = 0;
> > +        vtd_pasid_as->pasid = pasid;
> > +        vtd_pasid_as->pasid_cache_entry.pasid_cache_gen = 0;
> > +        g_hash_table_insert(s->vtd_pasid_as, new_key, vtd_pasid_as);
> > +    }
> > +    return vtd_pasid_as;
> > +}
> > +
> > + /**
> > +  * This function updates the pasid entry cached in &vtd_pasid_as.
> > +  * Caller of this function should hold iommu_lock.
> > +  */
> > +static inline void vtd_fill_in_pe_cache(
> > +              VTDPASIDAddressSpace *vtd_pasid_as, VTDPASIDEntry *pe)
> > +{
> > +    IntelIOMMUState *s = vtd_pasid_as->iommu_state;
> > +    VTDPASIDCacheEntry *pc_entry = &vtd_pasid_as->pasid_cache_entry;
> > +
> > +    pc_entry->pasid_entry = *pe;
> > +    pc_entry->pasid_cache_gen = s->pasid_cache_gen;
> > +}
> > +
> >  static int vtd_pasid_cache_psi(IntelIOMMUState *s,
> >                                 uint16_t domain_id, uint32_t pasid)
> >  {
> > +    VTDPASIDCacheInfo pc_info;
> > +    VTDPASIDEntry pe;
> > +    VTDBus *vtd_bus;
> > +    int bus_n, devfn;
> > +    VTDPASIDAddressSpace *vtd_pasid_as;
> > +    VTDIOMMUContext *vtd_icx;
> > +
> > +    /* PASID selective implies a DID selective */
> > +    pc_info.flags = VTD_PASID_CACHE_PASIDSI;
> > +    pc_info.domain_id = domain_id;
> > +    pc_info.pasid = pasid;
> > +
> > +    /*
> > +     * Regards to a pasid selective pasid cache invalidation (PSI),
> > +     * it could be either cases of below:
> > +     * a) a present pasid entry moved to non-present
> > +     * b) a present pasid entry to be a present entry
> > +     * c) a non-present pasid entry moved to present
> > +     *
> > +     * Here the handling of a PSI is:
> > +     * 1) loop all the exisitng vtd_pasid_as instances to update them
> > +     *    according to the latest guest pasid entry in pasid table.
> > +     *    this will make sure affected existing vtd_pasid_as instances
> > +     *    cached the latest pasid entries. Also, during the loop, the
> > +     *    host should be notified if needed. e.g. pasid unbind or pasid
> > +     *    update. Should be able to cover case a) and case b).
> > +     *
> > +     * 2) loop all devices to cover case c)
> > +     *    However, it is not good to always loop all devices. In this
> > +     *    implementation. We do it in this ways:
> > +     *    - For devices which have VTDIOMMUContext instances,
> > +     *      we loop them and check if guest pasid entry exists. If yes,
> > +     *      it is case c), we update the pasid cache and also notify
> > +     *      host.
> > +     *    - For devices which have no VTDIOMMUContext
> > +     *      instances, it is not necessary to create pasid cache at
> > +     *      this phase since it could be created when vIOMMU do DMA
> > +     *      address translation. This is not implemented yet since
> > +     *      no PASID-capable emulated devices today. If we have it
> > +     *      in future, the pasid cache shall be created there.
> > +     */
> > +
> > +    vtd_iommu_lock(s);
> > +    g_hash_table_foreach_remove(s->vtd_pasid_as,
> > +                                vtd_flush_pasid, &pc_info);
> > +
> > +    QLIST_FOREACH(vtd_icx, &s->vtd_dev_icx_list, next) {
> > +        vtd_bus = vtd_icx->vtd_bus;
> > +        devfn = vtd_icx->devfn;
> > +        bus_n = pci_bus_num(vtd_bus->bus);
> > +
> > +        /* Step 1: fetch vtd_pasid_as and check if it is valid */
> > +        vtd_pasid_as = vtd_add_find_pasid_as(s, vtd_bus,
> > +                                        devfn, pasid, true);
> 
> I feel like you wanted to pass "false" here for "allocate".

emmm, yeah. It was "false" in draft code as step 1 is only to check if
a valid vtd_pasid_as exists. And in step 3, it needs to call
vtd_add_find_pasid_as() with "allocate" be "true". In vtd_add_find_pasid_as(),
it will try search vtd_pasid_as first and then allocate a new one. In such
logic, there will be two vtd_add_find_pasid_as() callig and means two hash
table searching.

So I mofified it to be "true" to save a vtd_add_find_pasid_as() calling. If
a vtd_pasid_as is valid, its pasid_cache_gen will be equal to s->pasid_cache_gen.
If not, the vtd_pasid_as is a newly allocated and needs to go through step 2
and step 3 to fulfill it. Looks like I missed to free the vtd_pasid_as when step
2 failed. Will add it if you are fine with the current logic.


> > +        if (vtd_pasid_as &&
> > +            (s->pasid_cache_gen ==
> > +             vtd_pasid_as->pasid_cache_entry.pasid_cache_gen)) {
> > +            /*
> > +             * pasid_cache_gen equals to s->pasid_cache_gen means
> > +             * vtd_pasid_as is valid after the above s->vtd_pasid_as
> > +             * updates. Thus no need for the below steps.
> > +             */
> > +            continue;
> > +        }
> > +
> > +        /*
> > +         * Step 2: vtd_pasid_as is not valid, it's potentailly a
> > +         * new pasid bind. Fetch guest pasid entry.
> > +         */
> > +        if (vtd_dev_get_pe_from_pasid(s, bus_n, devfn, pasid, &pe)) {
> > +            continue;
> > +        }
> > +
> > +        /*
> > +         * Step 3: pasid entry exists, update pasid cache
> > +         *
> > +         * Here need to check domain ID since guest pasid entry
> > +         * exists. What needs to do are:
> > +         *   - update the pc_entry in the vtd_pasid_as
> > +         *   - set proper pc_entry.pasid_cache_gen
> > +         *   - pass down the latest guest pasid entry config to host
> > +         *     (will be added in later patch)
> > +         */
> > +        if (domain_id == vtd_pe_get_domain_id(&pe)) {
> > +            vtd_fill_in_pe_cache(vtd_pasid_as, &pe);
> > +        }
> > +    }
> > +    vtd_iommu_unlock(s);
> >      return 0;
> >  }
> >
> > +/**
> > + * Caller of this function should hold iommu_lock
> > + */
> > +static void vtd_pasid_cache_reset(IntelIOMMUState *s)
> > +{
> > +    VTDPASIDCacheInfo pc_info;
> > +
> > +    trace_vtd_pasid_cache_reset();
> > +
> > +    pc_info.flags = VTD_PASID_CACHE_GLOBAL;
> > +
> > +    /*
> > +     * Reset pasid cache is a big hammer, so use
> > +     * g_hash_table_foreach_remove which will free
> > +     * the vtd_pasid_as instances.
> > +     */
> > +    g_hash_table_foreach_remove(s->vtd_pasid_as,
> > +                           vtd_flush_pasid, &pc_info);
> > +    s->pasid_cache_gen = 1;
> > +}
> > +
> >  static int vtd_pasid_cache_gsi(IntelIOMMUState *s)
> >  {
> > +    trace_vtd_pasid_cache_gsi();
> > +
> > +    vtd_iommu_lock(s);
> > +    vtd_pasid_cache_reset(s);
> 
> [1]
> 
> > +
> > +    /*
> > +     * TODO: Global PASID cache invalidation may be
> > +     * flushes all the pasid caches. To be safe, after
> > +     * invalidating the pasid caches, emulator needs
> > +     * to replay the pasid bindings by walking guest
> > +     * pasid dir and pasid table.
> > +     */
> > +    vtd_iommu_unlock(s);
> >      return 0;
> >  }
> >
> > @@ -3659,8 +4019,11 @@ static int
> vtd_icx_register_ds_iommu(IOMMUContext *iommu_ctx,
> >      VTDIOMMUContext *vtd_dev_icx = container_of(iommu_ctx,
> >                                                 VTDIOMMUContext,
> >                                                 iommu_context);
> > +    IntelIOMMUState *s = vtd_dev_icx->iommu_state;
> >
> >      vtd_dev_icx->dsi_obj = dsi_obj;
> > +    QLIST_INSERT_HEAD(&s->vtd_dev_icx_list, vtd_dev_icx, next);
> > +
> >      return 0;
> >  }
> >
> > @@ -3672,6 +4035,7 @@ static void
> vtd_icx_unregister_ds_iommu(IOMMUContext *iommu_ctx,
> >                                                 iommu_context);
> >
> >      vtd_dev_icx->dsi_obj = NULL;
> > +    QLIST_REMOVE(vtd_dev_icx, next);
> >  }
> >
> >  IOMMUContextOps vtd_iommu_context_ops = {
> > @@ -4130,6 +4494,7 @@ static void vtd_realize(DeviceState *dev, Error **errp)
> >      }
> >
> >      QLIST_INIT(&s->vtd_as_with_notifiers);
> > +    QLIST_INIT(&s->vtd_dev_icx_list);
> >      qemu_mutex_init(&s->iommu_lock);
> >      memset(s->vtd_as_by_bus_num, 0, sizeof(s->vtd_as_by_bus_num));
> >      memory_region_init_io(&s->csrmem, OBJECT(s), &vtd_mem_ops, s,
> > @@ -4155,6 +4520,8 @@ static void vtd_realize(DeviceState *dev, Error **errp)
> >                                       g_free, g_free);
> >      s->vtd_as_by_busptr = g_hash_table_new_full(vtd_uint64_hash,
> vtd_uint64_equal,
> >                                                g_free, g_free);
> > +    s->vtd_pasid_as = g_hash_table_new_full(vtd_pasid_as_key_hash,
> > +                                   vtd_pasid_as_key_equal, g_free, g_free);
> >      vtd_init(s);
> >      sysbus_mmio_map(SYS_BUS_DEVICE(s), 0,
> Q35_HOST_BRIDGE_IOMMU_ADDR);
> >      pci_setup_iommu(bus, &vtd_iommu_ops, dev);
> > diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> > index 6c03560..18a9e50 100644
> > --- a/hw/i386/intel_iommu_internal.h
> > +++ b/hw/i386/intel_iommu_internal.h
> > @@ -311,6 +311,7 @@ typedef enum VTDFaultReason {
> >      VTD_FR_IR_SID_ERR = 0x26,   /* Invalid Source-ID */
> >
> >      VTD_FR_PASID_TABLE_INV = 0x58,  /*Invalid PASID table entry */
> > +    VTD_FR_PASID_ENTRY_P = 0x59, /* The Present(P) field of pasidt-entry is 0
> */
> >
> >      /* This is not a normal fault reason. We use this to indicate some faults
> >       * that are not referenced by the VT-d specification.
> > @@ -485,6 +486,19 @@ struct VTDRootEntry {
> >  };
> >  typedef struct VTDRootEntry VTDRootEntry;
> >
> > +struct VTDPASIDCacheInfo {
> > +#define VTD_PASID_CACHE_GLOBAL   (1ULL << 0)
> > +#define VTD_PASID_CACHE_DOMSI    (1ULL << 1)
> > +#define VTD_PASID_CACHE_PASIDSI  (1ULL << 2)
> > +    uint32_t flags;
> > +    uint16_t domain_id;
> > +    uint32_t pasid;
> > +};
> > +#define VTD_PASID_CACHE_INFO_MASK    (VTD_PASID_CACHE_GLOBAL | \
> > +                                      VTD_PASID_CACHE_DOMSI  | \
> > +                                      VTD_PASID_CACHE_PASIDSI)
> > +typedef struct VTDPASIDCacheInfo VTDPASIDCacheInfo;
> > +
> >  /* Masks for struct VTDRootEntry */
> >  #define VTD_ROOT_ENTRY_P            1ULL
> >  #define VTD_ROOT_ENTRY_CTP          (~0xfffULL)
> > diff --git a/hw/i386/trace-events b/hw/i386/trace-events
> > index f7cd4e5..87364a3 100644
> > --- a/hw/i386/trace-events
> > +++ b/hw/i386/trace-events
> > @@ -22,6 +22,7 @@ vtd_inv_qi_head(uint16_t head) "read head %d"
> >  vtd_inv_qi_tail(uint16_t head) "write tail %d"
> >  vtd_inv_qi_fetch(void) ""
> >  vtd_context_cache_reset(void) ""
> > +vtd_pasid_cache_reset(void) ""
> >  vtd_pasid_cache_gsi(void) ""
> >  vtd_pasid_cache_dsi(uint16_t domain) "Domian slective PC invalidation
> domain 0x%"PRIx16
> >  vtd_pasid_cache_psi(uint16_t domain, uint32_t pasid) "PASID slective PC
> invalidation domain 0x%"PRIx16" pasid 0x%"PRIx32
> > diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> > index 4158116..3cc4b74 100644
> > --- a/include/hw/i386/intel_iommu.h
> > +++ b/include/hw/i386/intel_iommu.h
> > @@ -69,6 +69,8 @@ typedef union VTD_IR_MSIAddress VTD_IR_MSIAddress;
> >  typedef struct VTDPASIDDirEntry VTDPASIDDirEntry;
> >  typedef struct VTDPASIDEntry VTDPASIDEntry;
> >  typedef struct VTDIOMMUContext VTDIOMMUContext;
> > +typedef struct VTDPASIDCacheEntry VTDPASIDCacheEntry;
> > +typedef struct VTDPASIDAddressSpace VTDPASIDAddressSpace;
> >
> >  /* Context-Entry */
> >  struct VTDContextEntry {
> > @@ -101,6 +103,31 @@ struct VTDPASIDEntry {
> >      uint64_t val[8];
> >  };
> >
> > +struct pasid_key {
> > +    uint32_t pasid;
> > +    uint16_t sid;
> > +};
> > +
> > +struct VTDPASIDCacheEntry {
> > +    /*
> > +     * The cache entry is obsolete if
> > +     * pasid_cache_gen!=IntelIOMMUState.pasid_cache_gen
> > +     */
> > +    uint32_t pasid_cache_gen;
> > +    struct VTDPASIDEntry pasid_entry;
> > +};
> > +
> > +struct VTDPASIDAddressSpace {
> > +    VTDBus *vtd_bus;
> > +    uint8_t devfn;
> > +    AddressSpace as;
> > +    uint32_t pasid;
> > +    IntelIOMMUState *iommu_state;
> > +    VTDContextCacheEntry context_cache_entry;
> > +    QLIST_ENTRY(VTDPASIDAddressSpace) next;
> > +    VTDPASIDCacheEntry pasid_cache_entry;
> 
> In vtd_pasid_cache_gsi() [1], you directly reset pasid_cache_gen for
> each pasid address space.  You never increase
> pasid_cache_entry.pasid_cache_gen.  Then IIUC the gen will always be
> either 0 or 1.  And...
> 
> > +};
> > +
> >  struct VTDAddressSpace {
> >      PCIBus *bus;
> >      uint8_t devfn;
> > @@ -122,6 +149,7 @@ struct VTDIOMMUContext {
> >      uint8_t devfn;
> >      IOMMUContext iommu_context;
> >      DualStageIOMMUObject *dsi_obj;
> > +    QLIST_ENTRY(VTDIOMMUContext) next;
> >      IntelIOMMUState *iommu_state;
> >  };
> >
> > @@ -272,9 +300,14 @@ struct IntelIOMMUState {
> >
> >      GHashTable *vtd_as_by_busptr;   /* VTDBus objects indexed by PCIBus*
> reference */
> >      VTDBus *vtd_as_by_bus_num[VTD_PCI_BUS_MAX]; /* VTDBus objects
> indexed by bus number */
> > +    GHashTable *vtd_pasid_as;   /* VTDPASIDAddressSpace instances */
> > +    uint32_t pasid_cache_gen;   /* Should be in [1,MAX] */
> 
> ... This should always be 1.
> IIUC you can drop both of the pasid_cache_gen, because in this whole
> patchset you'll remove the pasid hash entry when it is invalidated,
> right?  Then if the hash entry is there, it must be valid.  When it's
> out-dated, it'll be removed from the hash.

Oh, yes it is. However, it's not my intetion. I'd like to let [1] to
increase the s->pasid_cache_gen instead of justing zero it. I think it
will save some time as loop hash table takes time. Thanks for catching
it. :-)

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 136+ messages in thread

* RE: [RFC v3 16/25] intel_iommu: add PASID cache management infrastructure
@ 2020-02-12  8:37       ` Liu, Yi L
  0 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-02-12  8:37 UTC (permalink / raw)
  To: Peter Xu
  Cc: Tian, Kevin, Jacob Pan, Yi Sun, Eduardo Habkost, kvm, mst, Tian,
	Jun J, qemu-devel, eric.auger, alex.williamson, pbonzini, Wu,
	Hao, Sun, Yi Y, Richard Henderson, david

> From: Peter Xu <peterx@redhat.com>
> Sent: Wednesday, February 12, 2020 7:36 AM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [RFC v3 16/25] intel_iommu: add PASID cache management
> infrastructure
> 
> On Wed, Jan 29, 2020 at 04:16:47AM -0800, Liu, Yi L wrote:
> > From: Liu Yi L <yi.l.liu@intel.com>
> >
> > This patch adds a PASID cache management infrastructure based on
> > new added structure VTDPASIDAddressSpace, which is used to track
> > the PASID usage and future PASID tagged DMA address translation
> > support in vIOMMU.
> >
> >     struct VTDPASIDAddressSpace {
> >         VTDBus *vtd_bus;
> >         uint8_t devfn;
> >         AddressSpace as;
> >         uint32_t pasid;
> >         IntelIOMMUState *iommu_state;
> >         VTDContextCacheEntry context_cache_entry;
> >         QLIST_ENTRY(VTDPASIDAddressSpace) next;
> >         VTDPASIDCacheEntry pasid_cache_entry;
> >     };
> >
> > Ideally, a VTDPASIDAddressSpace instance is created when a PASID
> > is bound with a DMA AddressSpace. Intel VT-d spec requires guest
> > software to issue pasid cache invalidation when bind or unbind a
> > pasid with an address space under caching-mode. However, as
> > VTDPASIDAddressSpace instances also act as pasid cache in this
> > implementation, its creation also happens during vIOMMU PASID
> > tagged DMA translation. The creation in this path will not be
> > added in this patch since no PASID-capable emulated devices for
> > now.
> >
> > The implementation in this patch manages VTDPASIDAddressSpace
> > instances per PASID+BDF (lookup and insert will use PASID and
> > BDF) since Intel VT-d spec allows per-BDF PASID Table. When a
> > guest bind a PASID with an AddressSpace, QEMU will capture the
> > guest pasid selective pasid cache invalidation, and allocate
> > remove a VTDPASIDAddressSpace instance per the invalidation
> > reasons:
> >
> >     *) a present pasid entry moved to non-present
> >     *) a present pasid entry to be a present entry
> >     *) a non-present pasid entry moved to present
> >
> > vIOMMU emulator could figure out the reason by fetching latest
> > guest pasid entry.
> >
> > Cc: Kevin Tian <kevin.tian@intel.com>
> > Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Cc: Peter Xu <peterx@redhat.com>
> > Cc: Yi Sun <yi.y.sun@linux.intel.com>
> > Cc: Paolo Bonzini <pbonzini@redhat.com>
> > Cc: Richard Henderson <rth@twiddle.net>
> > Cc: Eduardo Habkost <ehabkost@redhat.com>
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > ---
> >  hw/i386/intel_iommu.c          | 367
> +++++++++++++++++++++++++++++++++++++++++
> >  hw/i386/intel_iommu_internal.h |  14 ++
> >  hw/i386/trace-events           |   1 +
> >  include/hw/i386/intel_iommu.h  |  36 +++-
> >  4 files changed, 417 insertions(+), 1 deletion(-)
> >
> > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> > index 58e7213..c75cb7b 100644
> > --- a/hw/i386/intel_iommu.c
> > +++ b/hw/i386/intel_iommu.c
> > @@ -40,6 +40,7 @@
> >  #include "kvm_i386.h"
> >  #include "migration/vmstate.h"
> >  #include "trace.h"
> > +#include "qemu/jhash.h"
> >
> >  /* context entry operations */
> >  #define VTD_CE_GET_RID2PASID(ce) \
> > @@ -65,6 +66,8 @@
> >  static void vtd_address_space_refresh_all(IntelIOMMUState *s);
> >  static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier
> *n);
> >
> > +static void vtd_pasid_cache_reset(IntelIOMMUState *s);
> > +
> >  static void vtd_panic_require_caching_mode(void)
> >  {
> >      error_report("We need to set caching-mode=on for intel-iommu to enable "
> > @@ -276,6 +279,7 @@ static void vtd_reset_caches(IntelIOMMUState *s)
> >      vtd_iommu_lock(s);
> >      vtd_reset_iotlb_locked(s);
> >      vtd_reset_context_cache_locked(s);
> > +    vtd_pasid_cache_reset(s);
> >      vtd_iommu_unlock(s);
> >  }
> >
> > @@ -686,6 +690,11 @@ static inline bool vtd_pe_type_check(X86IOMMUState
> *x86_iommu,
> >      return true;
> >  }
> >
> > +static inline uint16_t vtd_pe_get_domain_id(VTDPASIDEntry *pe)
> > +{
> > +    return VTD_SM_PASID_ENTRY_DID((pe)->val[1]);
> > +}
> > +
> >  static inline bool vtd_pdire_present(VTDPASIDDirEntry *pdire)
> >  {
> >      return pdire->val & 1;
> > @@ -2393,19 +2402,370 @@ static bool
> vtd_process_iotlb_desc(IntelIOMMUState *s, VTDInvDesc *inv_desc)
> >      return true;
> >  }
> >
> > +static inline void vtd_init_pasid_key(uint32_t pasid,
> > +                                     uint16_t sid,
> > +                                     struct pasid_key *key)
> > +{
> > +    key->pasid = pasid;
> > +    key->sid = sid;
> > +}
> > +
> > +static guint vtd_pasid_as_key_hash(gconstpointer v)
> > +{
> > +    struct pasid_key *key = (struct pasid_key *)v;
> > +    uint32_t a, b, c;
> > +
> > +    /* Jenkins hash */
> > +    a = b = c = JHASH_INITVAL + sizeof(*key);
> > +    a += key->sid;
> > +    b += extract32(key->pasid, 0, 16);
> > +    c += extract32(key->pasid, 16, 16);
> > +
> > +    __jhash_mix(a, b, c);
> > +    __jhash_final(a, b, c);
> > +
> > +    return c;
> > +}
> > +
> > +static gboolean vtd_pasid_as_key_equal(gconstpointer v1, gconstpointer v2)
> > +{
> > +    const struct pasid_key *k1 = v1;
> > +    const struct pasid_key *k2 = v2;
> > +
> > +    return (k1->pasid == k2->pasid) && (k1->sid == k2->sid);
> > +}
> > +
> > +static inline int vtd_dev_get_pe_from_pasid(IntelIOMMUState *s,
> > +                                            uint8_t bus_num,
> > +                                            uint8_t devfn,
> > +                                            uint32_t pasid,
> > +                                            VTDPASIDEntry *pe)
> > +{
> > +    VTDContextEntry ce;
> > +    int ret;
> > +    dma_addr_t pasid_dir_base;
> > +
> > +    if (!s->root_scalable) {
> > +        return -VTD_FR_PASID_TABLE_INV;
> > +    }
> > +
> > +    ret = vtd_dev_to_context_entry(s, bus_num, devfn, &ce);
> > +    if (ret) {
> > +        return ret;
> > +    }
> > +
> > +    pasid_dir_base = VTD_CE_GET_PASID_DIR_TABLE(&ce);
> > +    ret = vtd_get_pe_from_pasid_table(s,
> > +                                  pasid_dir_base, pasid, pe);
> > +
> > +    return ret;
> > +}
> > +
> > +static bool vtd_pasid_entry_compare(VTDPASIDEntry *p1, VTDPASIDEntry
> *p2)
> > +{
> > +    return !memcmp(p1, p2, sizeof(*p1));
> > +}
> > +
> > +/**
> > + * This function is used to clear pasid_cache_gen of cached pasid
> > + * entry in vtd_pasid_as instances. Caller of this function should
> > + * hold iommu_lock.
> > + */
> > +static gboolean vtd_flush_pasid(gpointer key, gpointer value,
> > +                                gpointer user_data)
> > +{
> > +    VTDPASIDCacheInfo *pc_info = user_data;
> > +    VTDPASIDAddressSpace *vtd_pasid_as = value;
> > +    IntelIOMMUState *s = vtd_pasid_as->iommu_state;
> > +    VTDPASIDCacheEntry *pc_entry = &vtd_pasid_as->pasid_cache_entry;
> > +    VTDBus *vtd_bus = vtd_pasid_as->vtd_bus;
> > +    VTDPASIDEntry pe;
> > +    uint16_t did;
> > +    uint32_t pasid;
> > +    uint16_t devfn;
> > +    int ret;
> > +
> > +    did = vtd_pe_get_domain_id(&pc_entry->pasid_entry);
> > +    pasid = vtd_pasid_as->pasid;
> > +    devfn = vtd_pasid_as->devfn;
> > +
> > +    if (!(pc_entry->pasid_cache_gen == s->pasid_cache_gen)) {
> > +        return false;
> > +    }
> > +
> > +    switch (pc_info->flags & VTD_PASID_CACHE_INFO_MASK) {
> > +    case VTD_PASID_CACHE_PASIDSI:
> > +        if (pc_info->pasid != pasid) {
> > +            return false;
> > +        }
> > +        /* Fall through */
> 
> Why fall through?

For VTD_PASID_CACHE_PASIDSI, it implies domain selective, so it
requires to check did just as VTD_PASID_CACHE_DOMSI.

> 
> > +    case VTD_PASID_CACHE_DOMSI:
> > +        if (pc_info->domain_id != did) {
> > +            return false;
> > +        }
> > +        /* Fall through */
> 
> Same here.

If code comes to here, it means the necessary checks are passed. Should
add a break here. However, as the below case does nothing and just calls
break. So I let the code fall through.

> 
> > +    case VTD_PASID_CACHE_GLOBAL:
> > +        break;
> > +    default:
> 
> Nevee reach here right?  If so we can abort.

yes, should never reach here.

> > +        return false;
> > +    }
> > +
> > +    /*
> > +     * pasid cache invalidation may indicate a present pasid
> > +     * entry to present pasid entry modification. To cover such
> > +     * case, vIOMMU emulator needs to fetch latest guest pasid
> > +     * entry and check cached pasid entry, then update pasid
> > +     * cache and send pasid bind/unbind to host properly.
> > +     */
> > +    ret = vtd_dev_get_pe_from_pasid(s,
> > +                  pci_bus_num(vtd_bus->bus), devfn, pasid, &pe);
> > +    if (ret) {
> > +        /*
> > +         * No valid pasid entry in guest memory. e.g. pasid entry
> > +         * was modified to be either all-zero or non-present. Either
> > +         * case means existing pasid cache should be removed.
> > +         */
> > +        goto remove;
> > +    }
> > +    /* Compare cached pasid entry and latest pasid entry */
> > +    if (!vtd_pasid_entry_compare(&pe, &pc_entry->pasid_entry)) {
> > +        /* pasid entry was updated, thus update the pasid cache */
> > +        pc_entry->pasid_entry = pe;
> > +        pc_entry->pasid_cache_gen = s->pasid_cache_gen;
> > +        /*
> > +         * TODO:
> > +         * - send pasid bind to host for passthru devices
> > +         * - when pasid-base-iotlb(piotlb) infrastructure is ready,
> > +         *   should invalidate QEMU piotlb togehter with this change.
> > +         */
> > +    }
> > +    return false;
> > +remove:
> > +    /*
> > +     * TODO:
> > +     * - send pasid unbind to host for passthru devices
> > +     * - when pasid-base-iotlb(piotlb) infrastructure is ready,
> > +     *   should invalidate QEMU piotlb togehter with this change.
> > +     */
> > +    return true;
> > +}
> > +
> >  static int vtd_pasid_cache_dsi(IntelIOMMUState *s, uint16_t domain_id)
> >  {
> > +    VTDPASIDCacheInfo pc_info;
> > +
> > +    trace_vtd_pasid_cache_dsi(domain_id);
> > +
> > +    pc_info.flags = VTD_PASID_CACHE_DOMSI;
> > +    pc_info.domain_id = domain_id;
> > +
> > +    /*
> > +     * Loop all existing pasid caches and update them.
> > +     */
> > +    vtd_iommu_lock(s);
> > +    g_hash_table_foreach_remove(s->vtd_pasid_as,
> > +                                 vtd_flush_pasid, &pc_info);
> > +
> > +    /*
> > +     * TODO: Domain selective PASID cache invalidation
> > +     * flushes all the pasid caches within a domain. To
> > +     * be safe, after invalidating the pasid caches, emulator
> > +     * needs to replay the pasid bindings by walking guest
> > +     * pasid dir and pasid table.
> 
> Better spell out on what special case we're handling here: When the
> guest setup a new PASID entry then send a PASID DSI.

oh, yes.  will add it in new version. :-)

> 
> > +     */
> > +    vtd_iommu_unlock(s);
> >      return 0;
> >  }
> >
> > +/**
> > + * This function finds or adds a VTDPASIDAddressSpace for a device
> > + * when it is bound to a pasid. Caller of this function should hold
> > + * iommu_lock.
> > + */
> > +static VTDPASIDAddressSpace *vtd_add_find_pasid_as(IntelIOMMUState *s,
> > +                                                   VTDBus *vtd_bus,
> > +                                                   int devfn,
> > +                                                   uint32_t pasid,
> > +                                                   bool allocate)
> > +{
> > +    struct pasid_key key;
> > +    struct pasid_key *new_key;
> > +    VTDPASIDAddressSpace *vtd_pasid_as;
> > +    uint16_t sid;
> > +
> > +    sid = vtd_make_source_id(pci_bus_num(vtd_bus->bus), devfn);
> > +    vtd_init_pasid_key(pasid, sid, &key);
> > +    vtd_pasid_as = g_hash_table_lookup(s->vtd_pasid_as, &key);
> > +
> > +    if (!vtd_pasid_as && allocate) {
> > +        new_key = g_malloc0(sizeof(*new_key));
> > +        vtd_init_pasid_key(pasid, sid, new_key);
> > +        /*
> > +         * Initiate the vtd_pasid_as structure.
> > +         *
> > +         * This structure here is used to track the guest pasid
> > +         * binding and also serves as pasid-cache mangement entry.
> > +         *
> > +         * TODO: in future, if wants to support the SVA-aware DMA
> > +         *       emulation, the vtd_pasid_as should have include
> > +         *       AddressSpace to support DMA emulation.
> > +         */
> > +        vtd_pasid_as = g_malloc0(sizeof(VTDPASIDAddressSpace));
> > +        vtd_pasid_as->iommu_state = s;
> > +        vtd_pasid_as->vtd_bus = vtd_bus;
> > +        vtd_pasid_as->devfn = devfn;
> > +        vtd_pasid_as->context_cache_entry.context_cache_gen = 0;
> > +        vtd_pasid_as->pasid = pasid;
> > +        vtd_pasid_as->pasid_cache_entry.pasid_cache_gen = 0;
> > +        g_hash_table_insert(s->vtd_pasid_as, new_key, vtd_pasid_as);
> > +    }
> > +    return vtd_pasid_as;
> > +}
> > +
> > + /**
> > +  * This function updates the pasid entry cached in &vtd_pasid_as.
> > +  * Caller of this function should hold iommu_lock.
> > +  */
> > +static inline void vtd_fill_in_pe_cache(
> > +              VTDPASIDAddressSpace *vtd_pasid_as, VTDPASIDEntry *pe)
> > +{
> > +    IntelIOMMUState *s = vtd_pasid_as->iommu_state;
> > +    VTDPASIDCacheEntry *pc_entry = &vtd_pasid_as->pasid_cache_entry;
> > +
> > +    pc_entry->pasid_entry = *pe;
> > +    pc_entry->pasid_cache_gen = s->pasid_cache_gen;
> > +}
> > +
> >  static int vtd_pasid_cache_psi(IntelIOMMUState *s,
> >                                 uint16_t domain_id, uint32_t pasid)
> >  {
> > +    VTDPASIDCacheInfo pc_info;
> > +    VTDPASIDEntry pe;
> > +    VTDBus *vtd_bus;
> > +    int bus_n, devfn;
> > +    VTDPASIDAddressSpace *vtd_pasid_as;
> > +    VTDIOMMUContext *vtd_icx;
> > +
> > +    /* PASID selective implies a DID selective */
> > +    pc_info.flags = VTD_PASID_CACHE_PASIDSI;
> > +    pc_info.domain_id = domain_id;
> > +    pc_info.pasid = pasid;
> > +
> > +    /*
> > +     * Regards to a pasid selective pasid cache invalidation (PSI),
> > +     * it could be either cases of below:
> > +     * a) a present pasid entry moved to non-present
> > +     * b) a present pasid entry to be a present entry
> > +     * c) a non-present pasid entry moved to present
> > +     *
> > +     * Here the handling of a PSI is:
> > +     * 1) loop all the exisitng vtd_pasid_as instances to update them
> > +     *    according to the latest guest pasid entry in pasid table.
> > +     *    this will make sure affected existing vtd_pasid_as instances
> > +     *    cached the latest pasid entries. Also, during the loop, the
> > +     *    host should be notified if needed. e.g. pasid unbind or pasid
> > +     *    update. Should be able to cover case a) and case b).
> > +     *
> > +     * 2) loop all devices to cover case c)
> > +     *    However, it is not good to always loop all devices. In this
> > +     *    implementation. We do it in this ways:
> > +     *    - For devices which have VTDIOMMUContext instances,
> > +     *      we loop them and check if guest pasid entry exists. If yes,
> > +     *      it is case c), we update the pasid cache and also notify
> > +     *      host.
> > +     *    - For devices which have no VTDIOMMUContext
> > +     *      instances, it is not necessary to create pasid cache at
> > +     *      this phase since it could be created when vIOMMU do DMA
> > +     *      address translation. This is not implemented yet since
> > +     *      no PASID-capable emulated devices today. If we have it
> > +     *      in future, the pasid cache shall be created there.
> > +     */
> > +
> > +    vtd_iommu_lock(s);
> > +    g_hash_table_foreach_remove(s->vtd_pasid_as,
> > +                                vtd_flush_pasid, &pc_info);
> > +
> > +    QLIST_FOREACH(vtd_icx, &s->vtd_dev_icx_list, next) {
> > +        vtd_bus = vtd_icx->vtd_bus;
> > +        devfn = vtd_icx->devfn;
> > +        bus_n = pci_bus_num(vtd_bus->bus);
> > +
> > +        /* Step 1: fetch vtd_pasid_as and check if it is valid */
> > +        vtd_pasid_as = vtd_add_find_pasid_as(s, vtd_bus,
> > +                                        devfn, pasid, true);
> 
> I feel like you wanted to pass "false" here for "allocate".

emmm, yeah. It was "false" in draft code as step 1 is only to check if
a valid vtd_pasid_as exists. And in step 3, it needs to call
vtd_add_find_pasid_as() with "allocate" be "true". In vtd_add_find_pasid_as(),
it will try search vtd_pasid_as first and then allocate a new one. In such
logic, there will be two vtd_add_find_pasid_as() callig and means two hash
table searching.

So I mofified it to be "true" to save a vtd_add_find_pasid_as() calling. If
a vtd_pasid_as is valid, its pasid_cache_gen will be equal to s->pasid_cache_gen.
If not, the vtd_pasid_as is a newly allocated and needs to go through step 2
and step 3 to fulfill it. Looks like I missed to free the vtd_pasid_as when step
2 failed. Will add it if you are fine with the current logic.


> > +        if (vtd_pasid_as &&
> > +            (s->pasid_cache_gen ==
> > +             vtd_pasid_as->pasid_cache_entry.pasid_cache_gen)) {
> > +            /*
> > +             * pasid_cache_gen equals to s->pasid_cache_gen means
> > +             * vtd_pasid_as is valid after the above s->vtd_pasid_as
> > +             * updates. Thus no need for the below steps.
> > +             */
> > +            continue;
> > +        }
> > +
> > +        /*
> > +         * Step 2: vtd_pasid_as is not valid, it's potentailly a
> > +         * new pasid bind. Fetch guest pasid entry.
> > +         */
> > +        if (vtd_dev_get_pe_from_pasid(s, bus_n, devfn, pasid, &pe)) {
> > +            continue;
> > +        }
> > +
> > +        /*
> > +         * Step 3: pasid entry exists, update pasid cache
> > +         *
> > +         * Here need to check domain ID since guest pasid entry
> > +         * exists. What needs to do are:
> > +         *   - update the pc_entry in the vtd_pasid_as
> > +         *   - set proper pc_entry.pasid_cache_gen
> > +         *   - pass down the latest guest pasid entry config to host
> > +         *     (will be added in later patch)
> > +         */
> > +        if (domain_id == vtd_pe_get_domain_id(&pe)) {
> > +            vtd_fill_in_pe_cache(vtd_pasid_as, &pe);
> > +        }
> > +    }
> > +    vtd_iommu_unlock(s);
> >      return 0;
> >  }
> >
> > +/**
> > + * Caller of this function should hold iommu_lock
> > + */
> > +static void vtd_pasid_cache_reset(IntelIOMMUState *s)
> > +{
> > +    VTDPASIDCacheInfo pc_info;
> > +
> > +    trace_vtd_pasid_cache_reset();
> > +
> > +    pc_info.flags = VTD_PASID_CACHE_GLOBAL;
> > +
> > +    /*
> > +     * Reset pasid cache is a big hammer, so use
> > +     * g_hash_table_foreach_remove which will free
> > +     * the vtd_pasid_as instances.
> > +     */
> > +    g_hash_table_foreach_remove(s->vtd_pasid_as,
> > +                           vtd_flush_pasid, &pc_info);
> > +    s->pasid_cache_gen = 1;
> > +}
> > +
> >  static int vtd_pasid_cache_gsi(IntelIOMMUState *s)
> >  {
> > +    trace_vtd_pasid_cache_gsi();
> > +
> > +    vtd_iommu_lock(s);
> > +    vtd_pasid_cache_reset(s);
> 
> [1]
> 
> > +
> > +    /*
> > +     * TODO: Global PASID cache invalidation may be
> > +     * flushes all the pasid caches. To be safe, after
> > +     * invalidating the pasid caches, emulator needs
> > +     * to replay the pasid bindings by walking guest
> > +     * pasid dir and pasid table.
> > +     */
> > +    vtd_iommu_unlock(s);
> >      return 0;
> >  }
> >
> > @@ -3659,8 +4019,11 @@ static int
> vtd_icx_register_ds_iommu(IOMMUContext *iommu_ctx,
> >      VTDIOMMUContext *vtd_dev_icx = container_of(iommu_ctx,
> >                                                 VTDIOMMUContext,
> >                                                 iommu_context);
> > +    IntelIOMMUState *s = vtd_dev_icx->iommu_state;
> >
> >      vtd_dev_icx->dsi_obj = dsi_obj;
> > +    QLIST_INSERT_HEAD(&s->vtd_dev_icx_list, vtd_dev_icx, next);
> > +
> >      return 0;
> >  }
> >
> > @@ -3672,6 +4035,7 @@ static void
> vtd_icx_unregister_ds_iommu(IOMMUContext *iommu_ctx,
> >                                                 iommu_context);
> >
> >      vtd_dev_icx->dsi_obj = NULL;
> > +    QLIST_REMOVE(vtd_dev_icx, next);
> >  }
> >
> >  IOMMUContextOps vtd_iommu_context_ops = {
> > @@ -4130,6 +4494,7 @@ static void vtd_realize(DeviceState *dev, Error **errp)
> >      }
> >
> >      QLIST_INIT(&s->vtd_as_with_notifiers);
> > +    QLIST_INIT(&s->vtd_dev_icx_list);
> >      qemu_mutex_init(&s->iommu_lock);
> >      memset(s->vtd_as_by_bus_num, 0, sizeof(s->vtd_as_by_bus_num));
> >      memory_region_init_io(&s->csrmem, OBJECT(s), &vtd_mem_ops, s,
> > @@ -4155,6 +4520,8 @@ static void vtd_realize(DeviceState *dev, Error **errp)
> >                                       g_free, g_free);
> >      s->vtd_as_by_busptr = g_hash_table_new_full(vtd_uint64_hash,
> vtd_uint64_equal,
> >                                                g_free, g_free);
> > +    s->vtd_pasid_as = g_hash_table_new_full(vtd_pasid_as_key_hash,
> > +                                   vtd_pasid_as_key_equal, g_free, g_free);
> >      vtd_init(s);
> >      sysbus_mmio_map(SYS_BUS_DEVICE(s), 0,
> Q35_HOST_BRIDGE_IOMMU_ADDR);
> >      pci_setup_iommu(bus, &vtd_iommu_ops, dev);
> > diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> > index 6c03560..18a9e50 100644
> > --- a/hw/i386/intel_iommu_internal.h
> > +++ b/hw/i386/intel_iommu_internal.h
> > @@ -311,6 +311,7 @@ typedef enum VTDFaultReason {
> >      VTD_FR_IR_SID_ERR = 0x26,   /* Invalid Source-ID */
> >
> >      VTD_FR_PASID_TABLE_INV = 0x58,  /*Invalid PASID table entry */
> > +    VTD_FR_PASID_ENTRY_P = 0x59, /* The Present(P) field of pasidt-entry is 0
> */
> >
> >      /* This is not a normal fault reason. We use this to indicate some faults
> >       * that are not referenced by the VT-d specification.
> > @@ -485,6 +486,19 @@ struct VTDRootEntry {
> >  };
> >  typedef struct VTDRootEntry VTDRootEntry;
> >
> > +struct VTDPASIDCacheInfo {
> > +#define VTD_PASID_CACHE_GLOBAL   (1ULL << 0)
> > +#define VTD_PASID_CACHE_DOMSI    (1ULL << 1)
> > +#define VTD_PASID_CACHE_PASIDSI  (1ULL << 2)
> > +    uint32_t flags;
> > +    uint16_t domain_id;
> > +    uint32_t pasid;
> > +};
> > +#define VTD_PASID_CACHE_INFO_MASK    (VTD_PASID_CACHE_GLOBAL | \
> > +                                      VTD_PASID_CACHE_DOMSI  | \
> > +                                      VTD_PASID_CACHE_PASIDSI)
> > +typedef struct VTDPASIDCacheInfo VTDPASIDCacheInfo;
> > +
> >  /* Masks for struct VTDRootEntry */
> >  #define VTD_ROOT_ENTRY_P            1ULL
> >  #define VTD_ROOT_ENTRY_CTP          (~0xfffULL)
> > diff --git a/hw/i386/trace-events b/hw/i386/trace-events
> > index f7cd4e5..87364a3 100644
> > --- a/hw/i386/trace-events
> > +++ b/hw/i386/trace-events
> > @@ -22,6 +22,7 @@ vtd_inv_qi_head(uint16_t head) "read head %d"
> >  vtd_inv_qi_tail(uint16_t head) "write tail %d"
> >  vtd_inv_qi_fetch(void) ""
> >  vtd_context_cache_reset(void) ""
> > +vtd_pasid_cache_reset(void) ""
> >  vtd_pasid_cache_gsi(void) ""
> >  vtd_pasid_cache_dsi(uint16_t domain) "Domian slective PC invalidation
> domain 0x%"PRIx16
> >  vtd_pasid_cache_psi(uint16_t domain, uint32_t pasid) "PASID slective PC
> invalidation domain 0x%"PRIx16" pasid 0x%"PRIx32
> > diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> > index 4158116..3cc4b74 100644
> > --- a/include/hw/i386/intel_iommu.h
> > +++ b/include/hw/i386/intel_iommu.h
> > @@ -69,6 +69,8 @@ typedef union VTD_IR_MSIAddress VTD_IR_MSIAddress;
> >  typedef struct VTDPASIDDirEntry VTDPASIDDirEntry;
> >  typedef struct VTDPASIDEntry VTDPASIDEntry;
> >  typedef struct VTDIOMMUContext VTDIOMMUContext;
> > +typedef struct VTDPASIDCacheEntry VTDPASIDCacheEntry;
> > +typedef struct VTDPASIDAddressSpace VTDPASIDAddressSpace;
> >
> >  /* Context-Entry */
> >  struct VTDContextEntry {
> > @@ -101,6 +103,31 @@ struct VTDPASIDEntry {
> >      uint64_t val[8];
> >  };
> >
> > +struct pasid_key {
> > +    uint32_t pasid;
> > +    uint16_t sid;
> > +};
> > +
> > +struct VTDPASIDCacheEntry {
> > +    /*
> > +     * The cache entry is obsolete if
> > +     * pasid_cache_gen!=IntelIOMMUState.pasid_cache_gen
> > +     */
> > +    uint32_t pasid_cache_gen;
> > +    struct VTDPASIDEntry pasid_entry;
> > +};
> > +
> > +struct VTDPASIDAddressSpace {
> > +    VTDBus *vtd_bus;
> > +    uint8_t devfn;
> > +    AddressSpace as;
> > +    uint32_t pasid;
> > +    IntelIOMMUState *iommu_state;
> > +    VTDContextCacheEntry context_cache_entry;
> > +    QLIST_ENTRY(VTDPASIDAddressSpace) next;
> > +    VTDPASIDCacheEntry pasid_cache_entry;
> 
> In vtd_pasid_cache_gsi() [1], you directly reset pasid_cache_gen for
> each pasid address space.  You never increase
> pasid_cache_entry.pasid_cache_gen.  Then IIUC the gen will always be
> either 0 or 1.  And...
> 
> > +};
> > +
> >  struct VTDAddressSpace {
> >      PCIBus *bus;
> >      uint8_t devfn;
> > @@ -122,6 +149,7 @@ struct VTDIOMMUContext {
> >      uint8_t devfn;
> >      IOMMUContext iommu_context;
> >      DualStageIOMMUObject *dsi_obj;
> > +    QLIST_ENTRY(VTDIOMMUContext) next;
> >      IntelIOMMUState *iommu_state;
> >  };
> >
> > @@ -272,9 +300,14 @@ struct IntelIOMMUState {
> >
> >      GHashTable *vtd_as_by_busptr;   /* VTDBus objects indexed by PCIBus*
> reference */
> >      VTDBus *vtd_as_by_bus_num[VTD_PCI_BUS_MAX]; /* VTDBus objects
> indexed by bus number */
> > +    GHashTable *vtd_pasid_as;   /* VTDPASIDAddressSpace instances */
> > +    uint32_t pasid_cache_gen;   /* Should be in [1,MAX] */
> 
> ... This should always be 1.
> IIUC you can drop both of the pasid_cache_gen, because in this whole
> patchset you'll remove the pasid hash entry when it is invalidated,
> right?  Then if the hash entry is there, it must be valid.  When it's
> out-dated, it'll be removed from the hash.

Oh, yes it is. However, it's not my intetion. I'd like to let [1] to
increase the s->pasid_cache_gen instead of justing zero it. I think it
will save some time as loop hash table takes time. Thanks for catching
it. :-)

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC v3 16/25] intel_iommu: add PASID cache management infrastructure
  2020-02-12  8:37       ` Liu, Yi L
@ 2020-02-12 15:26         ` Peter Xu
  -1 siblings, 0 replies; 136+ messages in thread
From: Peter Xu @ 2020-02-12 15:26 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: qemu-devel, david, pbonzini, alex.williamson, mst, eric.auger,
	Tian, Kevin, Tian, Jun J, Sun, Yi Y, kvm, Wu, Hao, Jacob Pan,
	Yi Sun, Richard Henderson, Eduardo Habkost

On Wed, Feb 12, 2020 at 08:37:30AM +0000, Liu, Yi L wrote:
> > From: Peter Xu <peterx@redhat.com>
> > Sent: Wednesday, February 12, 2020 7:36 AM
> > To: Liu, Yi L <yi.l.liu@intel.com>
> > Subject: Re: [RFC v3 16/25] intel_iommu: add PASID cache management
> > infrastructure
> > 
> > On Wed, Jan 29, 2020 at 04:16:47AM -0800, Liu, Yi L wrote:
> > > From: Liu Yi L <yi.l.liu@intel.com>
> > >
> > > This patch adds a PASID cache management infrastructure based on
> > > new added structure VTDPASIDAddressSpace, which is used to track
> > > the PASID usage and future PASID tagged DMA address translation
> > > support in vIOMMU.
> > >
> > >     struct VTDPASIDAddressSpace {
> > >         VTDBus *vtd_bus;
> > >         uint8_t devfn;
> > >         AddressSpace as;
> > >         uint32_t pasid;
> > >         IntelIOMMUState *iommu_state;
> > >         VTDContextCacheEntry context_cache_entry;
> > >         QLIST_ENTRY(VTDPASIDAddressSpace) next;
> > >         VTDPASIDCacheEntry pasid_cache_entry;
> > >     };
> > >
> > > Ideally, a VTDPASIDAddressSpace instance is created when a PASID
> > > is bound with a DMA AddressSpace. Intel VT-d spec requires guest
> > > software to issue pasid cache invalidation when bind or unbind a
> > > pasid with an address space under caching-mode. However, as
> > > VTDPASIDAddressSpace instances also act as pasid cache in this
> > > implementation, its creation also happens during vIOMMU PASID
> > > tagged DMA translation. The creation in this path will not be
> > > added in this patch since no PASID-capable emulated devices for
> > > now.
> > >
> > > The implementation in this patch manages VTDPASIDAddressSpace
> > > instances per PASID+BDF (lookup and insert will use PASID and
> > > BDF) since Intel VT-d spec allows per-BDF PASID Table. When a
> > > guest bind a PASID with an AddressSpace, QEMU will capture the
> > > guest pasid selective pasid cache invalidation, and allocate
> > > remove a VTDPASIDAddressSpace instance per the invalidation
> > > reasons:
> > >
> > >     *) a present pasid entry moved to non-present
> > >     *) a present pasid entry to be a present entry
> > >     *) a non-present pasid entry moved to present
> > >
> > > vIOMMU emulator could figure out the reason by fetching latest
> > > guest pasid entry.
> > >
> > > Cc: Kevin Tian <kevin.tian@intel.com>
> > > Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > > Cc: Peter Xu <peterx@redhat.com>
> > > Cc: Yi Sun <yi.y.sun@linux.intel.com>
> > > Cc: Paolo Bonzini <pbonzini@redhat.com>
> > > Cc: Richard Henderson <rth@twiddle.net>
> > > Cc: Eduardo Habkost <ehabkost@redhat.com>
> > > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > > ---
> > >  hw/i386/intel_iommu.c          | 367
> > +++++++++++++++++++++++++++++++++++++++++
> > >  hw/i386/intel_iommu_internal.h |  14 ++
> > >  hw/i386/trace-events           |   1 +
> > >  include/hw/i386/intel_iommu.h  |  36 +++-
> > >  4 files changed, 417 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> > > index 58e7213..c75cb7b 100644
> > > --- a/hw/i386/intel_iommu.c
> > > +++ b/hw/i386/intel_iommu.c
> > > @@ -40,6 +40,7 @@
> > >  #include "kvm_i386.h"
> > >  #include "migration/vmstate.h"
> > >  #include "trace.h"
> > > +#include "qemu/jhash.h"
> > >
> > >  /* context entry operations */
> > >  #define VTD_CE_GET_RID2PASID(ce) \
> > > @@ -65,6 +66,8 @@
> > >  static void vtd_address_space_refresh_all(IntelIOMMUState *s);
> > >  static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier
> > *n);
> > >
> > > +static void vtd_pasid_cache_reset(IntelIOMMUState *s);
> > > +
> > >  static void vtd_panic_require_caching_mode(void)
> > >  {
> > >      error_report("We need to set caching-mode=on for intel-iommu to enable "
> > > @@ -276,6 +279,7 @@ static void vtd_reset_caches(IntelIOMMUState *s)
> > >      vtd_iommu_lock(s);
> > >      vtd_reset_iotlb_locked(s);
> > >      vtd_reset_context_cache_locked(s);
> > > +    vtd_pasid_cache_reset(s);
> > >      vtd_iommu_unlock(s);
> > >  }
> > >
> > > @@ -686,6 +690,11 @@ static inline bool vtd_pe_type_check(X86IOMMUState
> > *x86_iommu,
> > >      return true;
> > >  }
> > >
> > > +static inline uint16_t vtd_pe_get_domain_id(VTDPASIDEntry *pe)
> > > +{
> > > +    return VTD_SM_PASID_ENTRY_DID((pe)->val[1]);
> > > +}
> > > +
> > >  static inline bool vtd_pdire_present(VTDPASIDDirEntry *pdire)
> > >  {
> > >      return pdire->val & 1;
> > > @@ -2393,19 +2402,370 @@ static bool
> > vtd_process_iotlb_desc(IntelIOMMUState *s, VTDInvDesc *inv_desc)
> > >      return true;
> > >  }
> > >
> > > +static inline void vtd_init_pasid_key(uint32_t pasid,
> > > +                                     uint16_t sid,
> > > +                                     struct pasid_key *key)
> > > +{
> > > +    key->pasid = pasid;
> > > +    key->sid = sid;
> > > +}
> > > +
> > > +static guint vtd_pasid_as_key_hash(gconstpointer v)
> > > +{
> > > +    struct pasid_key *key = (struct pasid_key *)v;
> > > +    uint32_t a, b, c;
> > > +
> > > +    /* Jenkins hash */
> > > +    a = b = c = JHASH_INITVAL + sizeof(*key);
> > > +    a += key->sid;
> > > +    b += extract32(key->pasid, 0, 16);
> > > +    c += extract32(key->pasid, 16, 16);
> > > +
> > > +    __jhash_mix(a, b, c);
> > > +    __jhash_final(a, b, c);
> > > +
> > > +    return c;
> > > +}
> > > +
> > > +static gboolean vtd_pasid_as_key_equal(gconstpointer v1, gconstpointer v2)
> > > +{
> > > +    const struct pasid_key *k1 = v1;
> > > +    const struct pasid_key *k2 = v2;
> > > +
> > > +    return (k1->pasid == k2->pasid) && (k1->sid == k2->sid);
> > > +}
> > > +
> > > +static inline int vtd_dev_get_pe_from_pasid(IntelIOMMUState *s,
> > > +                                            uint8_t bus_num,
> > > +                                            uint8_t devfn,
> > > +                                            uint32_t pasid,
> > > +                                            VTDPASIDEntry *pe)
> > > +{
> > > +    VTDContextEntry ce;
> > > +    int ret;
> > > +    dma_addr_t pasid_dir_base;
> > > +
> > > +    if (!s->root_scalable) {
> > > +        return -VTD_FR_PASID_TABLE_INV;
> > > +    }
> > > +
> > > +    ret = vtd_dev_to_context_entry(s, bus_num, devfn, &ce);
> > > +    if (ret) {
> > > +        return ret;
> > > +    }
> > > +
> > > +    pasid_dir_base = VTD_CE_GET_PASID_DIR_TABLE(&ce);
> > > +    ret = vtd_get_pe_from_pasid_table(s,
> > > +                                  pasid_dir_base, pasid, pe);
> > > +
> > > +    return ret;
> > > +}
> > > +
> > > +static bool vtd_pasid_entry_compare(VTDPASIDEntry *p1, VTDPASIDEntry
> > *p2)
> > > +{
> > > +    return !memcmp(p1, p2, sizeof(*p1));
> > > +}
> > > +
> > > +/**
> > > + * This function is used to clear pasid_cache_gen of cached pasid
> > > + * entry in vtd_pasid_as instances. Caller of this function should
> > > + * hold iommu_lock.
> > > + */
> > > +static gboolean vtd_flush_pasid(gpointer key, gpointer value,
> > > +                                gpointer user_data)
> > > +{
> > > +    VTDPASIDCacheInfo *pc_info = user_data;
> > > +    VTDPASIDAddressSpace *vtd_pasid_as = value;
> > > +    IntelIOMMUState *s = vtd_pasid_as->iommu_state;
> > > +    VTDPASIDCacheEntry *pc_entry = &vtd_pasid_as->pasid_cache_entry;
> > > +    VTDBus *vtd_bus = vtd_pasid_as->vtd_bus;
> > > +    VTDPASIDEntry pe;
> > > +    uint16_t did;
> > > +    uint32_t pasid;
> > > +    uint16_t devfn;
> > > +    int ret;
> > > +
> > > +    did = vtd_pe_get_domain_id(&pc_entry->pasid_entry);
> > > +    pasid = vtd_pasid_as->pasid;
> > > +    devfn = vtd_pasid_as->devfn;
> > > +
> > > +    if (!(pc_entry->pasid_cache_gen == s->pasid_cache_gen)) {
> > > +        return false;
> > > +    }
> > > +
> > > +    switch (pc_info->flags & VTD_PASID_CACHE_INFO_MASK) {
> > > +    case VTD_PASID_CACHE_PASIDSI:
> > > +        if (pc_info->pasid != pasid) {
> > > +            return false;
> > > +        }
> > > +        /* Fall through */
> > 
> > Why fall through?
> 
> For VTD_PASID_CACHE_PASIDSI, it implies domain selective, so it
> requires to check did just as VTD_PASID_CACHE_DOMSI.

Ah right. :)

> 
> > 
> > > +    case VTD_PASID_CACHE_DOMSI:
> > > +        if (pc_info->domain_id != did) {
> > > +            return false;
> > > +        }
> > > +        /* Fall through */
> > 
> > Same here.
> 
> If code comes to here, it means the necessary checks are passed. Should
> add a break here. However, as the below case does nothing and just calls
> break. So I let the code fall through.

Yes this is fine too.

> 
> > 
> > > +    case VTD_PASID_CACHE_GLOBAL:
> > > +        break;
> > > +    default:
> > 
> > Nevee reach here right?  If so we can abort.
> 
> yes, should never reach here.
> 
> > > +        return false;
> > > +    }
> > > +
> > > +    /*
> > > +     * pasid cache invalidation may indicate a present pasid
> > > +     * entry to present pasid entry modification. To cover such
> > > +     * case, vIOMMU emulator needs to fetch latest guest pasid
> > > +     * entry and check cached pasid entry, then update pasid
> > > +     * cache and send pasid bind/unbind to host properly.
> > > +     */
> > > +    ret = vtd_dev_get_pe_from_pasid(s,
> > > +                  pci_bus_num(vtd_bus->bus), devfn, pasid, &pe);
> > > +    if (ret) {
> > > +        /*
> > > +         * No valid pasid entry in guest memory. e.g. pasid entry
> > > +         * was modified to be either all-zero or non-present. Either
> > > +         * case means existing pasid cache should be removed.
> > > +         */
> > > +        goto remove;
> > > +    }
> > > +    /* Compare cached pasid entry and latest pasid entry */
> > > +    if (!vtd_pasid_entry_compare(&pe, &pc_entry->pasid_entry)) {
> > > +        /* pasid entry was updated, thus update the pasid cache */
> > > +        pc_entry->pasid_entry = pe;
> > > +        pc_entry->pasid_cache_gen = s->pasid_cache_gen;
> > > +        /*
> > > +         * TODO:
> > > +         * - send pasid bind to host for passthru devices
> > > +         * - when pasid-base-iotlb(piotlb) infrastructure is ready,
> > > +         *   should invalidate QEMU piotlb togehter with this change.
> > > +         */
> > > +    }
> > > +    return false;
> > > +remove:
> > > +    /*
> > > +     * TODO:
> > > +     * - send pasid unbind to host for passthru devices
> > > +     * - when pasid-base-iotlb(piotlb) infrastructure is ready,
> > > +     *   should invalidate QEMU piotlb togehter with this change.
> > > +     */
> > > +    return true;
> > > +}
> > > +
> > >  static int vtd_pasid_cache_dsi(IntelIOMMUState *s, uint16_t domain_id)
> > >  {
> > > +    VTDPASIDCacheInfo pc_info;
> > > +
> > > +    trace_vtd_pasid_cache_dsi(domain_id);
> > > +
> > > +    pc_info.flags = VTD_PASID_CACHE_DOMSI;
> > > +    pc_info.domain_id = domain_id;
> > > +
> > > +    /*
> > > +     * Loop all existing pasid caches and update them.
> > > +     */
> > > +    vtd_iommu_lock(s);
> > > +    g_hash_table_foreach_remove(s->vtd_pasid_as,
> > > +                                 vtd_flush_pasid, &pc_info);
> > > +
> > > +    /*
> > > +     * TODO: Domain selective PASID cache invalidation
> > > +     * flushes all the pasid caches within a domain. To
> > > +     * be safe, after invalidating the pasid caches, emulator
> > > +     * needs to replay the pasid bindings by walking guest
> > > +     * pasid dir and pasid table.
> > 
> > Better spell out on what special case we're handling here: When the
> > guest setup a new PASID entry then send a PASID DSI.
> 
> oh, yes.  will add it in new version. :-)
> 
> > 
> > > +     */
> > > +    vtd_iommu_unlock(s);
> > >      return 0;
> > >  }
> > >
> > > +/**
> > > + * This function finds or adds a VTDPASIDAddressSpace for a device
> > > + * when it is bound to a pasid. Caller of this function should hold
> > > + * iommu_lock.
> > > + */
> > > +static VTDPASIDAddressSpace *vtd_add_find_pasid_as(IntelIOMMUState *s,
> > > +                                                   VTDBus *vtd_bus,
> > > +                                                   int devfn,
> > > +                                                   uint32_t pasid,
> > > +                                                   bool allocate)
> > > +{
> > > +    struct pasid_key key;
> > > +    struct pasid_key *new_key;
> > > +    VTDPASIDAddressSpace *vtd_pasid_as;
> > > +    uint16_t sid;
> > > +
> > > +    sid = vtd_make_source_id(pci_bus_num(vtd_bus->bus), devfn);
> > > +    vtd_init_pasid_key(pasid, sid, &key);
> > > +    vtd_pasid_as = g_hash_table_lookup(s->vtd_pasid_as, &key);
> > > +
> > > +    if (!vtd_pasid_as && allocate) {
> > > +        new_key = g_malloc0(sizeof(*new_key));
> > > +        vtd_init_pasid_key(pasid, sid, new_key);
> > > +        /*
> > > +         * Initiate the vtd_pasid_as structure.
> > > +         *
> > > +         * This structure here is used to track the guest pasid
> > > +         * binding and also serves as pasid-cache mangement entry.
> > > +         *
> > > +         * TODO: in future, if wants to support the SVA-aware DMA
> > > +         *       emulation, the vtd_pasid_as should have include
> > > +         *       AddressSpace to support DMA emulation.
> > > +         */
> > > +        vtd_pasid_as = g_malloc0(sizeof(VTDPASIDAddressSpace));
> > > +        vtd_pasid_as->iommu_state = s;
> > > +        vtd_pasid_as->vtd_bus = vtd_bus;
> > > +        vtd_pasid_as->devfn = devfn;
> > > +        vtd_pasid_as->context_cache_entry.context_cache_gen = 0;
> > > +        vtd_pasid_as->pasid = pasid;
> > > +        vtd_pasid_as->pasid_cache_entry.pasid_cache_gen = 0;
> > > +        g_hash_table_insert(s->vtd_pasid_as, new_key, vtd_pasid_as);
> > > +    }
> > > +    return vtd_pasid_as;
> > > +}
> > > +
> > > + /**
> > > +  * This function updates the pasid entry cached in &vtd_pasid_as.
> > > +  * Caller of this function should hold iommu_lock.
> > > +  */
> > > +static inline void vtd_fill_in_pe_cache(
> > > +              VTDPASIDAddressSpace *vtd_pasid_as, VTDPASIDEntry *pe)
> > > +{
> > > +    IntelIOMMUState *s = vtd_pasid_as->iommu_state;
> > > +    VTDPASIDCacheEntry *pc_entry = &vtd_pasid_as->pasid_cache_entry;
> > > +
> > > +    pc_entry->pasid_entry = *pe;
> > > +    pc_entry->pasid_cache_gen = s->pasid_cache_gen;
> > > +}
> > > +
> > >  static int vtd_pasid_cache_psi(IntelIOMMUState *s,
> > >                                 uint16_t domain_id, uint32_t pasid)
> > >  {
> > > +    VTDPASIDCacheInfo pc_info;
> > > +    VTDPASIDEntry pe;
> > > +    VTDBus *vtd_bus;
> > > +    int bus_n, devfn;
> > > +    VTDPASIDAddressSpace *vtd_pasid_as;
> > > +    VTDIOMMUContext *vtd_icx;
> > > +
> > > +    /* PASID selective implies a DID selective */
> > > +    pc_info.flags = VTD_PASID_CACHE_PASIDSI;
> > > +    pc_info.domain_id = domain_id;
> > > +    pc_info.pasid = pasid;
> > > +
> > > +    /*
> > > +     * Regards to a pasid selective pasid cache invalidation (PSI),
> > > +     * it could be either cases of below:
> > > +     * a) a present pasid entry moved to non-present
> > > +     * b) a present pasid entry to be a present entry
> > > +     * c) a non-present pasid entry moved to present
> > > +     *
> > > +     * Here the handling of a PSI is:
> > > +     * 1) loop all the exisitng vtd_pasid_as instances to update them
> > > +     *    according to the latest guest pasid entry in pasid table.
> > > +     *    this will make sure affected existing vtd_pasid_as instances
> > > +     *    cached the latest pasid entries. Also, during the loop, the
> > > +     *    host should be notified if needed. e.g. pasid unbind or pasid
> > > +     *    update. Should be able to cover case a) and case b).
> > > +     *
> > > +     * 2) loop all devices to cover case c)
> > > +     *    However, it is not good to always loop all devices. In this
> > > +     *    implementation. We do it in this ways:
> > > +     *    - For devices which have VTDIOMMUContext instances,
> > > +     *      we loop them and check if guest pasid entry exists. If yes,
> > > +     *      it is case c), we update the pasid cache and also notify
> > > +     *      host.
> > > +     *    - For devices which have no VTDIOMMUContext
> > > +     *      instances, it is not necessary to create pasid cache at
> > > +     *      this phase since it could be created when vIOMMU do DMA
> > > +     *      address translation. This is not implemented yet since
> > > +     *      no PASID-capable emulated devices today. If we have it
> > > +     *      in future, the pasid cache shall be created there.
> > > +     */
> > > +
> > > +    vtd_iommu_lock(s);
> > > +    g_hash_table_foreach_remove(s->vtd_pasid_as,
> > > +                                vtd_flush_pasid, &pc_info);
> > > +
> > > +    QLIST_FOREACH(vtd_icx, &s->vtd_dev_icx_list, next) {
> > > +        vtd_bus = vtd_icx->vtd_bus;
> > > +        devfn = vtd_icx->devfn;
> > > +        bus_n = pci_bus_num(vtd_bus->bus);
> > > +
> > > +        /* Step 1: fetch vtd_pasid_as and check if it is valid */
> > > +        vtd_pasid_as = vtd_add_find_pasid_as(s, vtd_bus,
> > > +                                        devfn, pasid, true);
> > 
> > I feel like you wanted to pass "false" here for "allocate".
> 
> emmm, yeah. It was "false" in draft code as step 1 is only to check if
> a valid vtd_pasid_as exists. And in step 3, it needs to call
> vtd_add_find_pasid_as() with "allocate" be "true". In vtd_add_find_pasid_as(),
> it will try search vtd_pasid_as first and then allocate a new one. In such
> logic, there will be two vtd_add_find_pasid_as() callig and means two hash
> table searching.
> 
> So I mofified it to be "true" to save a vtd_add_find_pasid_as() calling. If
> a vtd_pasid_as is valid, its pasid_cache_gen will be equal to s->pasid_cache_gen.
> If not, the vtd_pasid_as is a newly allocated and needs to go through step 2
> and step 3 to fulfill it. Looks like I missed to free the vtd_pasid_as when step
> 2 failed. Will add it if you are fine with the current logic.

I see.  Note that vtd_add_find_pasid_as() is fast for no allocation,
because hash lookup is O(1).  However I think current approach is ok,
but if with that, we can also:

- Remove the allocate parameter for vtd_add_find_pasid_as(), since it's
  always true even in future patches so useless,

- Remove the vtd_pasid_as check right below because it's not needed.

> 
> 
> > > +        if (vtd_pasid_as &&
                   ^^^^^^^^^^^^

> > > +            (s->pasid_cache_gen ==
> > > +             vtd_pasid_as->pasid_cache_entry.pasid_cache_gen)) {
> > > +            /*
> > > +             * pasid_cache_gen equals to s->pasid_cache_gen means
> > > +             * vtd_pasid_as is valid after the above s->vtd_pasid_as
> > > +             * updates. Thus no need for the below steps.
> > > +             */
> > > +            continue;
> > > +        }
> > > +
> > > +        /*
> > > +         * Step 2: vtd_pasid_as is not valid, it's potentailly a
> > > +         * new pasid bind. Fetch guest pasid entry.
> > > +         */
> > > +        if (vtd_dev_get_pe_from_pasid(s, bus_n, devfn, pasid, &pe)) {
> > > +            continue;
> > > +        }
> > > +
> > > +        /*
> > > +         * Step 3: pasid entry exists, update pasid cache
> > > +         *
> > > +         * Here need to check domain ID since guest pasid entry
> > > +         * exists. What needs to do are:
> > > +         *   - update the pc_entry in the vtd_pasid_as
> > > +         *   - set proper pc_entry.pasid_cache_gen
> > > +         *   - pass down the latest guest pasid entry config to host
> > > +         *     (will be added in later patch)
> > > +         */
> > > +        if (domain_id == vtd_pe_get_domain_id(&pe)) {
> > > +            vtd_fill_in_pe_cache(vtd_pasid_as, &pe);
> > > +        }
> > > +    }
> > > +    vtd_iommu_unlock(s);
> > >      return 0;
> > >  }
> > >
> > > +/**
> > > + * Caller of this function should hold iommu_lock
> > > + */
> > > +static void vtd_pasid_cache_reset(IntelIOMMUState *s)
> > > +{
> > > +    VTDPASIDCacheInfo pc_info;
> > > +
> > > +    trace_vtd_pasid_cache_reset();
> > > +
> > > +    pc_info.flags = VTD_PASID_CACHE_GLOBAL;
> > > +
> > > +    /*
> > > +     * Reset pasid cache is a big hammer, so use
> > > +     * g_hash_table_foreach_remove which will free
> > > +     * the vtd_pasid_as instances.
> > > +     */
> > > +    g_hash_table_foreach_remove(s->vtd_pasid_as,
> > > +                           vtd_flush_pasid, &pc_info);
> > > +    s->pasid_cache_gen = 1;
> > > +}
> > > +
> > >  static int vtd_pasid_cache_gsi(IntelIOMMUState *s)
> > >  {
> > > +    trace_vtd_pasid_cache_gsi();
> > > +
> > > +    vtd_iommu_lock(s);
> > > +    vtd_pasid_cache_reset(s);
> > 
> > [1]
> > 
> > > +
> > > +    /*
> > > +     * TODO: Global PASID cache invalidation may be
> > > +     * flushes all the pasid caches. To be safe, after
> > > +     * invalidating the pasid caches, emulator needs
> > > +     * to replay the pasid bindings by walking guest
> > > +     * pasid dir and pasid table.
> > > +     */
> > > +    vtd_iommu_unlock(s);
> > >      return 0;
> > >  }
> > >
> > > @@ -3659,8 +4019,11 @@ static int
> > vtd_icx_register_ds_iommu(IOMMUContext *iommu_ctx,
> > >      VTDIOMMUContext *vtd_dev_icx = container_of(iommu_ctx,
> > >                                                 VTDIOMMUContext,
> > >                                                 iommu_context);
> > > +    IntelIOMMUState *s = vtd_dev_icx->iommu_state;
> > >
> > >      vtd_dev_icx->dsi_obj = dsi_obj;
> > > +    QLIST_INSERT_HEAD(&s->vtd_dev_icx_list, vtd_dev_icx, next);
> > > +
> > >      return 0;
> > >  }
> > >
> > > @@ -3672,6 +4035,7 @@ static void
> > vtd_icx_unregister_ds_iommu(IOMMUContext *iommu_ctx,
> > >                                                 iommu_context);
> > >
> > >      vtd_dev_icx->dsi_obj = NULL;
> > > +    QLIST_REMOVE(vtd_dev_icx, next);
> > >  }
> > >
> > >  IOMMUContextOps vtd_iommu_context_ops = {
> > > @@ -4130,6 +4494,7 @@ static void vtd_realize(DeviceState *dev, Error **errp)
> > >      }
> > >
> > >      QLIST_INIT(&s->vtd_as_with_notifiers);
> > > +    QLIST_INIT(&s->vtd_dev_icx_list);
> > >      qemu_mutex_init(&s->iommu_lock);
> > >      memset(s->vtd_as_by_bus_num, 0, sizeof(s->vtd_as_by_bus_num));
> > >      memory_region_init_io(&s->csrmem, OBJECT(s), &vtd_mem_ops, s,
> > > @@ -4155,6 +4520,8 @@ static void vtd_realize(DeviceState *dev, Error **errp)
> > >                                       g_free, g_free);
> > >      s->vtd_as_by_busptr = g_hash_table_new_full(vtd_uint64_hash,
> > vtd_uint64_equal,
> > >                                                g_free, g_free);
> > > +    s->vtd_pasid_as = g_hash_table_new_full(vtd_pasid_as_key_hash,
> > > +                                   vtd_pasid_as_key_equal, g_free, g_free);
> > >      vtd_init(s);
> > >      sysbus_mmio_map(SYS_BUS_DEVICE(s), 0,
> > Q35_HOST_BRIDGE_IOMMU_ADDR);
> > >      pci_setup_iommu(bus, &vtd_iommu_ops, dev);
> > > diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> > > index 6c03560..18a9e50 100644
> > > --- a/hw/i386/intel_iommu_internal.h
> > > +++ b/hw/i386/intel_iommu_internal.h
> > > @@ -311,6 +311,7 @@ typedef enum VTDFaultReason {
> > >      VTD_FR_IR_SID_ERR = 0x26,   /* Invalid Source-ID */
> > >
> > >      VTD_FR_PASID_TABLE_INV = 0x58,  /*Invalid PASID table entry */
> > > +    VTD_FR_PASID_ENTRY_P = 0x59, /* The Present(P) field of pasidt-entry is 0
> > */
> > >
> > >      /* This is not a normal fault reason. We use this to indicate some faults
> > >       * that are not referenced by the VT-d specification.
> > > @@ -485,6 +486,19 @@ struct VTDRootEntry {
> > >  };
> > >  typedef struct VTDRootEntry VTDRootEntry;
> > >
> > > +struct VTDPASIDCacheInfo {
> > > +#define VTD_PASID_CACHE_GLOBAL   (1ULL << 0)
> > > +#define VTD_PASID_CACHE_DOMSI    (1ULL << 1)
> > > +#define VTD_PASID_CACHE_PASIDSI  (1ULL << 2)
> > > +    uint32_t flags;
> > > +    uint16_t domain_id;
> > > +    uint32_t pasid;
> > > +};
> > > +#define VTD_PASID_CACHE_INFO_MASK    (VTD_PASID_CACHE_GLOBAL | \
> > > +                                      VTD_PASID_CACHE_DOMSI  | \
> > > +                                      VTD_PASID_CACHE_PASIDSI)
> > > +typedef struct VTDPASIDCacheInfo VTDPASIDCacheInfo;
> > > +
> > >  /* Masks for struct VTDRootEntry */
> > >  #define VTD_ROOT_ENTRY_P            1ULL
> > >  #define VTD_ROOT_ENTRY_CTP          (~0xfffULL)
> > > diff --git a/hw/i386/trace-events b/hw/i386/trace-events
> > > index f7cd4e5..87364a3 100644
> > > --- a/hw/i386/trace-events
> > > +++ b/hw/i386/trace-events
> > > @@ -22,6 +22,7 @@ vtd_inv_qi_head(uint16_t head) "read head %d"
> > >  vtd_inv_qi_tail(uint16_t head) "write tail %d"
> > >  vtd_inv_qi_fetch(void) ""
> > >  vtd_context_cache_reset(void) ""
> > > +vtd_pasid_cache_reset(void) ""
> > >  vtd_pasid_cache_gsi(void) ""
> > >  vtd_pasid_cache_dsi(uint16_t domain) "Domian slective PC invalidation
> > domain 0x%"PRIx16
> > >  vtd_pasid_cache_psi(uint16_t domain, uint32_t pasid) "PASID slective PC
> > invalidation domain 0x%"PRIx16" pasid 0x%"PRIx32
> > > diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> > > index 4158116..3cc4b74 100644
> > > --- a/include/hw/i386/intel_iommu.h
> > > +++ b/include/hw/i386/intel_iommu.h
> > > @@ -69,6 +69,8 @@ typedef union VTD_IR_MSIAddress VTD_IR_MSIAddress;
> > >  typedef struct VTDPASIDDirEntry VTDPASIDDirEntry;
> > >  typedef struct VTDPASIDEntry VTDPASIDEntry;
> > >  typedef struct VTDIOMMUContext VTDIOMMUContext;
> > > +typedef struct VTDPASIDCacheEntry VTDPASIDCacheEntry;
> > > +typedef struct VTDPASIDAddressSpace VTDPASIDAddressSpace;
> > >
> > >  /* Context-Entry */
> > >  struct VTDContextEntry {
> > > @@ -101,6 +103,31 @@ struct VTDPASIDEntry {
> > >      uint64_t val[8];
> > >  };
> > >
> > > +struct pasid_key {
> > > +    uint32_t pasid;
> > > +    uint16_t sid;
> > > +};
> > > +
> > > +struct VTDPASIDCacheEntry {
> > > +    /*
> > > +     * The cache entry is obsolete if
> > > +     * pasid_cache_gen!=IntelIOMMUState.pasid_cache_gen
> > > +     */
> > > +    uint32_t pasid_cache_gen;
> > > +    struct VTDPASIDEntry pasid_entry;
> > > +};
> > > +
> > > +struct VTDPASIDAddressSpace {
> > > +    VTDBus *vtd_bus;
> > > +    uint8_t devfn;
> > > +    AddressSpace as;
> > > +    uint32_t pasid;
> > > +    IntelIOMMUState *iommu_state;
> > > +    VTDContextCacheEntry context_cache_entry;
> > > +    QLIST_ENTRY(VTDPASIDAddressSpace) next;
> > > +    VTDPASIDCacheEntry pasid_cache_entry;
> > 
> > In vtd_pasid_cache_gsi() [1], you directly reset pasid_cache_gen for
> > each pasid address space.  You never increase
> > pasid_cache_entry.pasid_cache_gen.  Then IIUC the gen will always be
> > either 0 or 1.  And...
> > 
> > > +};
> > > +
> > >  struct VTDAddressSpace {
> > >      PCIBus *bus;
> > >      uint8_t devfn;
> > > @@ -122,6 +149,7 @@ struct VTDIOMMUContext {
> > >      uint8_t devfn;
> > >      IOMMUContext iommu_context;
> > >      DualStageIOMMUObject *dsi_obj;
> > > +    QLIST_ENTRY(VTDIOMMUContext) next;
> > >      IntelIOMMUState *iommu_state;
> > >  };
> > >
> > > @@ -272,9 +300,14 @@ struct IntelIOMMUState {
> > >
> > >      GHashTable *vtd_as_by_busptr;   /* VTDBus objects indexed by PCIBus*
> > reference */
> > >      VTDBus *vtd_as_by_bus_num[VTD_PCI_BUS_MAX]; /* VTDBus objects
> > indexed by bus number */
> > > +    GHashTable *vtd_pasid_as;   /* VTDPASIDAddressSpace instances */
> > > +    uint32_t pasid_cache_gen;   /* Should be in [1,MAX] */
> > 
> > ... This should always be 1.
> > IIUC you can drop both of the pasid_cache_gen, because in this whole
> > patchset you'll remove the pasid hash entry when it is invalidated,
> > right?  Then if the hash entry is there, it must be valid.  When it's
> > out-dated, it'll be removed from the hash.
> 
> Oh, yes it is. However, it's not my intetion. I'd like to let [1] to
> increase the s->pasid_cache_gen instead of justing zero it. I think it
> will save some time as loop hash table takes time. Thanks for catching
> it. :-)

OK that's fine too.  Then remember to conditionally reset it:

static int vtd_pasid_cache_gsi(IntelIOMMUState *s)
{
    trace_vtd_pasid_cache_gsi();

    vtd_iommu_lock(s);
    s->pasid_cache_gen++;
    if (s->pasid_cache_gen >= THRESHOLD) {
        vtd_pasid_cache_reset(s);
    }
    vtd_iommu_unlock(s);

    return 0;
}

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC v3 16/25] intel_iommu: add PASID cache management infrastructure
@ 2020-02-12 15:26         ` Peter Xu
  0 siblings, 0 replies; 136+ messages in thread
From: Peter Xu @ 2020-02-12 15:26 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: Tian, Kevin, Jacob Pan, Yi Sun, Eduardo Habkost, kvm, mst, Tian,
	Jun J, qemu-devel, eric.auger, alex.williamson, pbonzini, Wu,
	Hao, Sun, Yi Y, Richard Henderson, david

On Wed, Feb 12, 2020 at 08:37:30AM +0000, Liu, Yi L wrote:
> > From: Peter Xu <peterx@redhat.com>
> > Sent: Wednesday, February 12, 2020 7:36 AM
> > To: Liu, Yi L <yi.l.liu@intel.com>
> > Subject: Re: [RFC v3 16/25] intel_iommu: add PASID cache management
> > infrastructure
> > 
> > On Wed, Jan 29, 2020 at 04:16:47AM -0800, Liu, Yi L wrote:
> > > From: Liu Yi L <yi.l.liu@intel.com>
> > >
> > > This patch adds a PASID cache management infrastructure based on
> > > new added structure VTDPASIDAddressSpace, which is used to track
> > > the PASID usage and future PASID tagged DMA address translation
> > > support in vIOMMU.
> > >
> > >     struct VTDPASIDAddressSpace {
> > >         VTDBus *vtd_bus;
> > >         uint8_t devfn;
> > >         AddressSpace as;
> > >         uint32_t pasid;
> > >         IntelIOMMUState *iommu_state;
> > >         VTDContextCacheEntry context_cache_entry;
> > >         QLIST_ENTRY(VTDPASIDAddressSpace) next;
> > >         VTDPASIDCacheEntry pasid_cache_entry;
> > >     };
> > >
> > > Ideally, a VTDPASIDAddressSpace instance is created when a PASID
> > > is bound with a DMA AddressSpace. Intel VT-d spec requires guest
> > > software to issue pasid cache invalidation when bind or unbind a
> > > pasid with an address space under caching-mode. However, as
> > > VTDPASIDAddressSpace instances also act as pasid cache in this
> > > implementation, its creation also happens during vIOMMU PASID
> > > tagged DMA translation. The creation in this path will not be
> > > added in this patch since no PASID-capable emulated devices for
> > > now.
> > >
> > > The implementation in this patch manages VTDPASIDAddressSpace
> > > instances per PASID+BDF (lookup and insert will use PASID and
> > > BDF) since Intel VT-d spec allows per-BDF PASID Table. When a
> > > guest bind a PASID with an AddressSpace, QEMU will capture the
> > > guest pasid selective pasid cache invalidation, and allocate
> > > remove a VTDPASIDAddressSpace instance per the invalidation
> > > reasons:
> > >
> > >     *) a present pasid entry moved to non-present
> > >     *) a present pasid entry to be a present entry
> > >     *) a non-present pasid entry moved to present
> > >
> > > vIOMMU emulator could figure out the reason by fetching latest
> > > guest pasid entry.
> > >
> > > Cc: Kevin Tian <kevin.tian@intel.com>
> > > Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > > Cc: Peter Xu <peterx@redhat.com>
> > > Cc: Yi Sun <yi.y.sun@linux.intel.com>
> > > Cc: Paolo Bonzini <pbonzini@redhat.com>
> > > Cc: Richard Henderson <rth@twiddle.net>
> > > Cc: Eduardo Habkost <ehabkost@redhat.com>
> > > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > > ---
> > >  hw/i386/intel_iommu.c          | 367
> > +++++++++++++++++++++++++++++++++++++++++
> > >  hw/i386/intel_iommu_internal.h |  14 ++
> > >  hw/i386/trace-events           |   1 +
> > >  include/hw/i386/intel_iommu.h  |  36 +++-
> > >  4 files changed, 417 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> > > index 58e7213..c75cb7b 100644
> > > --- a/hw/i386/intel_iommu.c
> > > +++ b/hw/i386/intel_iommu.c
> > > @@ -40,6 +40,7 @@
> > >  #include "kvm_i386.h"
> > >  #include "migration/vmstate.h"
> > >  #include "trace.h"
> > > +#include "qemu/jhash.h"
> > >
> > >  /* context entry operations */
> > >  #define VTD_CE_GET_RID2PASID(ce) \
> > > @@ -65,6 +66,8 @@
> > >  static void vtd_address_space_refresh_all(IntelIOMMUState *s);
> > >  static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier
> > *n);
> > >
> > > +static void vtd_pasid_cache_reset(IntelIOMMUState *s);
> > > +
> > >  static void vtd_panic_require_caching_mode(void)
> > >  {
> > >      error_report("We need to set caching-mode=on for intel-iommu to enable "
> > > @@ -276,6 +279,7 @@ static void vtd_reset_caches(IntelIOMMUState *s)
> > >      vtd_iommu_lock(s);
> > >      vtd_reset_iotlb_locked(s);
> > >      vtd_reset_context_cache_locked(s);
> > > +    vtd_pasid_cache_reset(s);
> > >      vtd_iommu_unlock(s);
> > >  }
> > >
> > > @@ -686,6 +690,11 @@ static inline bool vtd_pe_type_check(X86IOMMUState
> > *x86_iommu,
> > >      return true;
> > >  }
> > >
> > > +static inline uint16_t vtd_pe_get_domain_id(VTDPASIDEntry *pe)
> > > +{
> > > +    return VTD_SM_PASID_ENTRY_DID((pe)->val[1]);
> > > +}
> > > +
> > >  static inline bool vtd_pdire_present(VTDPASIDDirEntry *pdire)
> > >  {
> > >      return pdire->val & 1;
> > > @@ -2393,19 +2402,370 @@ static bool
> > vtd_process_iotlb_desc(IntelIOMMUState *s, VTDInvDesc *inv_desc)
> > >      return true;
> > >  }
> > >
> > > +static inline void vtd_init_pasid_key(uint32_t pasid,
> > > +                                     uint16_t sid,
> > > +                                     struct pasid_key *key)
> > > +{
> > > +    key->pasid = pasid;
> > > +    key->sid = sid;
> > > +}
> > > +
> > > +static guint vtd_pasid_as_key_hash(gconstpointer v)
> > > +{
> > > +    struct pasid_key *key = (struct pasid_key *)v;
> > > +    uint32_t a, b, c;
> > > +
> > > +    /* Jenkins hash */
> > > +    a = b = c = JHASH_INITVAL + sizeof(*key);
> > > +    a += key->sid;
> > > +    b += extract32(key->pasid, 0, 16);
> > > +    c += extract32(key->pasid, 16, 16);
> > > +
> > > +    __jhash_mix(a, b, c);
> > > +    __jhash_final(a, b, c);
> > > +
> > > +    return c;
> > > +}
> > > +
> > > +static gboolean vtd_pasid_as_key_equal(gconstpointer v1, gconstpointer v2)
> > > +{
> > > +    const struct pasid_key *k1 = v1;
> > > +    const struct pasid_key *k2 = v2;
> > > +
> > > +    return (k1->pasid == k2->pasid) && (k1->sid == k2->sid);
> > > +}
> > > +
> > > +static inline int vtd_dev_get_pe_from_pasid(IntelIOMMUState *s,
> > > +                                            uint8_t bus_num,
> > > +                                            uint8_t devfn,
> > > +                                            uint32_t pasid,
> > > +                                            VTDPASIDEntry *pe)
> > > +{
> > > +    VTDContextEntry ce;
> > > +    int ret;
> > > +    dma_addr_t pasid_dir_base;
> > > +
> > > +    if (!s->root_scalable) {
> > > +        return -VTD_FR_PASID_TABLE_INV;
> > > +    }
> > > +
> > > +    ret = vtd_dev_to_context_entry(s, bus_num, devfn, &ce);
> > > +    if (ret) {
> > > +        return ret;
> > > +    }
> > > +
> > > +    pasid_dir_base = VTD_CE_GET_PASID_DIR_TABLE(&ce);
> > > +    ret = vtd_get_pe_from_pasid_table(s,
> > > +                                  pasid_dir_base, pasid, pe);
> > > +
> > > +    return ret;
> > > +}
> > > +
> > > +static bool vtd_pasid_entry_compare(VTDPASIDEntry *p1, VTDPASIDEntry
> > *p2)
> > > +{
> > > +    return !memcmp(p1, p2, sizeof(*p1));
> > > +}
> > > +
> > > +/**
> > > + * This function is used to clear pasid_cache_gen of cached pasid
> > > + * entry in vtd_pasid_as instances. Caller of this function should
> > > + * hold iommu_lock.
> > > + */
> > > +static gboolean vtd_flush_pasid(gpointer key, gpointer value,
> > > +                                gpointer user_data)
> > > +{
> > > +    VTDPASIDCacheInfo *pc_info = user_data;
> > > +    VTDPASIDAddressSpace *vtd_pasid_as = value;
> > > +    IntelIOMMUState *s = vtd_pasid_as->iommu_state;
> > > +    VTDPASIDCacheEntry *pc_entry = &vtd_pasid_as->pasid_cache_entry;
> > > +    VTDBus *vtd_bus = vtd_pasid_as->vtd_bus;
> > > +    VTDPASIDEntry pe;
> > > +    uint16_t did;
> > > +    uint32_t pasid;
> > > +    uint16_t devfn;
> > > +    int ret;
> > > +
> > > +    did = vtd_pe_get_domain_id(&pc_entry->pasid_entry);
> > > +    pasid = vtd_pasid_as->pasid;
> > > +    devfn = vtd_pasid_as->devfn;
> > > +
> > > +    if (!(pc_entry->pasid_cache_gen == s->pasid_cache_gen)) {
> > > +        return false;
> > > +    }
> > > +
> > > +    switch (pc_info->flags & VTD_PASID_CACHE_INFO_MASK) {
> > > +    case VTD_PASID_CACHE_PASIDSI:
> > > +        if (pc_info->pasid != pasid) {
> > > +            return false;
> > > +        }
> > > +        /* Fall through */
> > 
> > Why fall through?
> 
> For VTD_PASID_CACHE_PASIDSI, it implies domain selective, so it
> requires to check did just as VTD_PASID_CACHE_DOMSI.

Ah right. :)

> 
> > 
> > > +    case VTD_PASID_CACHE_DOMSI:
> > > +        if (pc_info->domain_id != did) {
> > > +            return false;
> > > +        }
> > > +        /* Fall through */
> > 
> > Same here.
> 
> If code comes to here, it means the necessary checks are passed. Should
> add a break here. However, as the below case does nothing and just calls
> break. So I let the code fall through.

Yes this is fine too.

> 
> > 
> > > +    case VTD_PASID_CACHE_GLOBAL:
> > > +        break;
> > > +    default:
> > 
> > Nevee reach here right?  If so we can abort.
> 
> yes, should never reach here.
> 
> > > +        return false;
> > > +    }
> > > +
> > > +    /*
> > > +     * pasid cache invalidation may indicate a present pasid
> > > +     * entry to present pasid entry modification. To cover such
> > > +     * case, vIOMMU emulator needs to fetch latest guest pasid
> > > +     * entry and check cached pasid entry, then update pasid
> > > +     * cache and send pasid bind/unbind to host properly.
> > > +     */
> > > +    ret = vtd_dev_get_pe_from_pasid(s,
> > > +                  pci_bus_num(vtd_bus->bus), devfn, pasid, &pe);
> > > +    if (ret) {
> > > +        /*
> > > +         * No valid pasid entry in guest memory. e.g. pasid entry
> > > +         * was modified to be either all-zero or non-present. Either
> > > +         * case means existing pasid cache should be removed.
> > > +         */
> > > +        goto remove;
> > > +    }
> > > +    /* Compare cached pasid entry and latest pasid entry */
> > > +    if (!vtd_pasid_entry_compare(&pe, &pc_entry->pasid_entry)) {
> > > +        /* pasid entry was updated, thus update the pasid cache */
> > > +        pc_entry->pasid_entry = pe;
> > > +        pc_entry->pasid_cache_gen = s->pasid_cache_gen;
> > > +        /*
> > > +         * TODO:
> > > +         * - send pasid bind to host for passthru devices
> > > +         * - when pasid-base-iotlb(piotlb) infrastructure is ready,
> > > +         *   should invalidate QEMU piotlb togehter with this change.
> > > +         */
> > > +    }
> > > +    return false;
> > > +remove:
> > > +    /*
> > > +     * TODO:
> > > +     * - send pasid unbind to host for passthru devices
> > > +     * - when pasid-base-iotlb(piotlb) infrastructure is ready,
> > > +     *   should invalidate QEMU piotlb togehter with this change.
> > > +     */
> > > +    return true;
> > > +}
> > > +
> > >  static int vtd_pasid_cache_dsi(IntelIOMMUState *s, uint16_t domain_id)
> > >  {
> > > +    VTDPASIDCacheInfo pc_info;
> > > +
> > > +    trace_vtd_pasid_cache_dsi(domain_id);
> > > +
> > > +    pc_info.flags = VTD_PASID_CACHE_DOMSI;
> > > +    pc_info.domain_id = domain_id;
> > > +
> > > +    /*
> > > +     * Loop all existing pasid caches and update them.
> > > +     */
> > > +    vtd_iommu_lock(s);
> > > +    g_hash_table_foreach_remove(s->vtd_pasid_as,
> > > +                                 vtd_flush_pasid, &pc_info);
> > > +
> > > +    /*
> > > +     * TODO: Domain selective PASID cache invalidation
> > > +     * flushes all the pasid caches within a domain. To
> > > +     * be safe, after invalidating the pasid caches, emulator
> > > +     * needs to replay the pasid bindings by walking guest
> > > +     * pasid dir and pasid table.
> > 
> > Better spell out on what special case we're handling here: When the
> > guest setup a new PASID entry then send a PASID DSI.
> 
> oh, yes.  will add it in new version. :-)
> 
> > 
> > > +     */
> > > +    vtd_iommu_unlock(s);
> > >      return 0;
> > >  }
> > >
> > > +/**
> > > + * This function finds or adds a VTDPASIDAddressSpace for a device
> > > + * when it is bound to a pasid. Caller of this function should hold
> > > + * iommu_lock.
> > > + */
> > > +static VTDPASIDAddressSpace *vtd_add_find_pasid_as(IntelIOMMUState *s,
> > > +                                                   VTDBus *vtd_bus,
> > > +                                                   int devfn,
> > > +                                                   uint32_t pasid,
> > > +                                                   bool allocate)
> > > +{
> > > +    struct pasid_key key;
> > > +    struct pasid_key *new_key;
> > > +    VTDPASIDAddressSpace *vtd_pasid_as;
> > > +    uint16_t sid;
> > > +
> > > +    sid = vtd_make_source_id(pci_bus_num(vtd_bus->bus), devfn);
> > > +    vtd_init_pasid_key(pasid, sid, &key);
> > > +    vtd_pasid_as = g_hash_table_lookup(s->vtd_pasid_as, &key);
> > > +
> > > +    if (!vtd_pasid_as && allocate) {
> > > +        new_key = g_malloc0(sizeof(*new_key));
> > > +        vtd_init_pasid_key(pasid, sid, new_key);
> > > +        /*
> > > +         * Initiate the vtd_pasid_as structure.
> > > +         *
> > > +         * This structure here is used to track the guest pasid
> > > +         * binding and also serves as pasid-cache mangement entry.
> > > +         *
> > > +         * TODO: in future, if wants to support the SVA-aware DMA
> > > +         *       emulation, the vtd_pasid_as should have include
> > > +         *       AddressSpace to support DMA emulation.
> > > +         */
> > > +        vtd_pasid_as = g_malloc0(sizeof(VTDPASIDAddressSpace));
> > > +        vtd_pasid_as->iommu_state = s;
> > > +        vtd_pasid_as->vtd_bus = vtd_bus;
> > > +        vtd_pasid_as->devfn = devfn;
> > > +        vtd_pasid_as->context_cache_entry.context_cache_gen = 0;
> > > +        vtd_pasid_as->pasid = pasid;
> > > +        vtd_pasid_as->pasid_cache_entry.pasid_cache_gen = 0;
> > > +        g_hash_table_insert(s->vtd_pasid_as, new_key, vtd_pasid_as);
> > > +    }
> > > +    return vtd_pasid_as;
> > > +}
> > > +
> > > + /**
> > > +  * This function updates the pasid entry cached in &vtd_pasid_as.
> > > +  * Caller of this function should hold iommu_lock.
> > > +  */
> > > +static inline void vtd_fill_in_pe_cache(
> > > +              VTDPASIDAddressSpace *vtd_pasid_as, VTDPASIDEntry *pe)
> > > +{
> > > +    IntelIOMMUState *s = vtd_pasid_as->iommu_state;
> > > +    VTDPASIDCacheEntry *pc_entry = &vtd_pasid_as->pasid_cache_entry;
> > > +
> > > +    pc_entry->pasid_entry = *pe;
> > > +    pc_entry->pasid_cache_gen = s->pasid_cache_gen;
> > > +}
> > > +
> > >  static int vtd_pasid_cache_psi(IntelIOMMUState *s,
> > >                                 uint16_t domain_id, uint32_t pasid)
> > >  {
> > > +    VTDPASIDCacheInfo pc_info;
> > > +    VTDPASIDEntry pe;
> > > +    VTDBus *vtd_bus;
> > > +    int bus_n, devfn;
> > > +    VTDPASIDAddressSpace *vtd_pasid_as;
> > > +    VTDIOMMUContext *vtd_icx;
> > > +
> > > +    /* PASID selective implies a DID selective */
> > > +    pc_info.flags = VTD_PASID_CACHE_PASIDSI;
> > > +    pc_info.domain_id = domain_id;
> > > +    pc_info.pasid = pasid;
> > > +
> > > +    /*
> > > +     * Regards to a pasid selective pasid cache invalidation (PSI),
> > > +     * it could be either cases of below:
> > > +     * a) a present pasid entry moved to non-present
> > > +     * b) a present pasid entry to be a present entry
> > > +     * c) a non-present pasid entry moved to present
> > > +     *
> > > +     * Here the handling of a PSI is:
> > > +     * 1) loop all the exisitng vtd_pasid_as instances to update them
> > > +     *    according to the latest guest pasid entry in pasid table.
> > > +     *    this will make sure affected existing vtd_pasid_as instances
> > > +     *    cached the latest pasid entries. Also, during the loop, the
> > > +     *    host should be notified if needed. e.g. pasid unbind or pasid
> > > +     *    update. Should be able to cover case a) and case b).
> > > +     *
> > > +     * 2) loop all devices to cover case c)
> > > +     *    However, it is not good to always loop all devices. In this
> > > +     *    implementation. We do it in this ways:
> > > +     *    - For devices which have VTDIOMMUContext instances,
> > > +     *      we loop them and check if guest pasid entry exists. If yes,
> > > +     *      it is case c), we update the pasid cache and also notify
> > > +     *      host.
> > > +     *    - For devices which have no VTDIOMMUContext
> > > +     *      instances, it is not necessary to create pasid cache at
> > > +     *      this phase since it could be created when vIOMMU do DMA
> > > +     *      address translation. This is not implemented yet since
> > > +     *      no PASID-capable emulated devices today. If we have it
> > > +     *      in future, the pasid cache shall be created there.
> > > +     */
> > > +
> > > +    vtd_iommu_lock(s);
> > > +    g_hash_table_foreach_remove(s->vtd_pasid_as,
> > > +                                vtd_flush_pasid, &pc_info);
> > > +
> > > +    QLIST_FOREACH(vtd_icx, &s->vtd_dev_icx_list, next) {
> > > +        vtd_bus = vtd_icx->vtd_bus;
> > > +        devfn = vtd_icx->devfn;
> > > +        bus_n = pci_bus_num(vtd_bus->bus);
> > > +
> > > +        /* Step 1: fetch vtd_pasid_as and check if it is valid */
> > > +        vtd_pasid_as = vtd_add_find_pasid_as(s, vtd_bus,
> > > +                                        devfn, pasid, true);
> > 
> > I feel like you wanted to pass "false" here for "allocate".
> 
> emmm, yeah. It was "false" in draft code as step 1 is only to check if
> a valid vtd_pasid_as exists. And in step 3, it needs to call
> vtd_add_find_pasid_as() with "allocate" be "true". In vtd_add_find_pasid_as(),
> it will try search vtd_pasid_as first and then allocate a new one. In such
> logic, there will be two vtd_add_find_pasid_as() callig and means two hash
> table searching.
> 
> So I mofified it to be "true" to save a vtd_add_find_pasid_as() calling. If
> a vtd_pasid_as is valid, its pasid_cache_gen will be equal to s->pasid_cache_gen.
> If not, the vtd_pasid_as is a newly allocated and needs to go through step 2
> and step 3 to fulfill it. Looks like I missed to free the vtd_pasid_as when step
> 2 failed. Will add it if you are fine with the current logic.

I see.  Note that vtd_add_find_pasid_as() is fast for no allocation,
because hash lookup is O(1).  However I think current approach is ok,
but if with that, we can also:

- Remove the allocate parameter for vtd_add_find_pasid_as(), since it's
  always true even in future patches so useless,

- Remove the vtd_pasid_as check right below because it's not needed.

> 
> 
> > > +        if (vtd_pasid_as &&
                   ^^^^^^^^^^^^

> > > +            (s->pasid_cache_gen ==
> > > +             vtd_pasid_as->pasid_cache_entry.pasid_cache_gen)) {
> > > +            /*
> > > +             * pasid_cache_gen equals to s->pasid_cache_gen means
> > > +             * vtd_pasid_as is valid after the above s->vtd_pasid_as
> > > +             * updates. Thus no need for the below steps.
> > > +             */
> > > +            continue;
> > > +        }
> > > +
> > > +        /*
> > > +         * Step 2: vtd_pasid_as is not valid, it's potentailly a
> > > +         * new pasid bind. Fetch guest pasid entry.
> > > +         */
> > > +        if (vtd_dev_get_pe_from_pasid(s, bus_n, devfn, pasid, &pe)) {
> > > +            continue;
> > > +        }
> > > +
> > > +        /*
> > > +         * Step 3: pasid entry exists, update pasid cache
> > > +         *
> > > +         * Here need to check domain ID since guest pasid entry
> > > +         * exists. What needs to do are:
> > > +         *   - update the pc_entry in the vtd_pasid_as
> > > +         *   - set proper pc_entry.pasid_cache_gen
> > > +         *   - pass down the latest guest pasid entry config to host
> > > +         *     (will be added in later patch)
> > > +         */
> > > +        if (domain_id == vtd_pe_get_domain_id(&pe)) {
> > > +            vtd_fill_in_pe_cache(vtd_pasid_as, &pe);
> > > +        }
> > > +    }
> > > +    vtd_iommu_unlock(s);
> > >      return 0;
> > >  }
> > >
> > > +/**
> > > + * Caller of this function should hold iommu_lock
> > > + */
> > > +static void vtd_pasid_cache_reset(IntelIOMMUState *s)
> > > +{
> > > +    VTDPASIDCacheInfo pc_info;
> > > +
> > > +    trace_vtd_pasid_cache_reset();
> > > +
> > > +    pc_info.flags = VTD_PASID_CACHE_GLOBAL;
> > > +
> > > +    /*
> > > +     * Reset pasid cache is a big hammer, so use
> > > +     * g_hash_table_foreach_remove which will free
> > > +     * the vtd_pasid_as instances.
> > > +     */
> > > +    g_hash_table_foreach_remove(s->vtd_pasid_as,
> > > +                           vtd_flush_pasid, &pc_info);
> > > +    s->pasid_cache_gen = 1;
> > > +}
> > > +
> > >  static int vtd_pasid_cache_gsi(IntelIOMMUState *s)
> > >  {
> > > +    trace_vtd_pasid_cache_gsi();
> > > +
> > > +    vtd_iommu_lock(s);
> > > +    vtd_pasid_cache_reset(s);
> > 
> > [1]
> > 
> > > +
> > > +    /*
> > > +     * TODO: Global PASID cache invalidation may be
> > > +     * flushes all the pasid caches. To be safe, after
> > > +     * invalidating the pasid caches, emulator needs
> > > +     * to replay the pasid bindings by walking guest
> > > +     * pasid dir and pasid table.
> > > +     */
> > > +    vtd_iommu_unlock(s);
> > >      return 0;
> > >  }
> > >
> > > @@ -3659,8 +4019,11 @@ static int
> > vtd_icx_register_ds_iommu(IOMMUContext *iommu_ctx,
> > >      VTDIOMMUContext *vtd_dev_icx = container_of(iommu_ctx,
> > >                                                 VTDIOMMUContext,
> > >                                                 iommu_context);
> > > +    IntelIOMMUState *s = vtd_dev_icx->iommu_state;
> > >
> > >      vtd_dev_icx->dsi_obj = dsi_obj;
> > > +    QLIST_INSERT_HEAD(&s->vtd_dev_icx_list, vtd_dev_icx, next);
> > > +
> > >      return 0;
> > >  }
> > >
> > > @@ -3672,6 +4035,7 @@ static void
> > vtd_icx_unregister_ds_iommu(IOMMUContext *iommu_ctx,
> > >                                                 iommu_context);
> > >
> > >      vtd_dev_icx->dsi_obj = NULL;
> > > +    QLIST_REMOVE(vtd_dev_icx, next);
> > >  }
> > >
> > >  IOMMUContextOps vtd_iommu_context_ops = {
> > > @@ -4130,6 +4494,7 @@ static void vtd_realize(DeviceState *dev, Error **errp)
> > >      }
> > >
> > >      QLIST_INIT(&s->vtd_as_with_notifiers);
> > > +    QLIST_INIT(&s->vtd_dev_icx_list);
> > >      qemu_mutex_init(&s->iommu_lock);
> > >      memset(s->vtd_as_by_bus_num, 0, sizeof(s->vtd_as_by_bus_num));
> > >      memory_region_init_io(&s->csrmem, OBJECT(s), &vtd_mem_ops, s,
> > > @@ -4155,6 +4520,8 @@ static void vtd_realize(DeviceState *dev, Error **errp)
> > >                                       g_free, g_free);
> > >      s->vtd_as_by_busptr = g_hash_table_new_full(vtd_uint64_hash,
> > vtd_uint64_equal,
> > >                                                g_free, g_free);
> > > +    s->vtd_pasid_as = g_hash_table_new_full(vtd_pasid_as_key_hash,
> > > +                                   vtd_pasid_as_key_equal, g_free, g_free);
> > >      vtd_init(s);
> > >      sysbus_mmio_map(SYS_BUS_DEVICE(s), 0,
> > Q35_HOST_BRIDGE_IOMMU_ADDR);
> > >      pci_setup_iommu(bus, &vtd_iommu_ops, dev);
> > > diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> > > index 6c03560..18a9e50 100644
> > > --- a/hw/i386/intel_iommu_internal.h
> > > +++ b/hw/i386/intel_iommu_internal.h
> > > @@ -311,6 +311,7 @@ typedef enum VTDFaultReason {
> > >      VTD_FR_IR_SID_ERR = 0x26,   /* Invalid Source-ID */
> > >
> > >      VTD_FR_PASID_TABLE_INV = 0x58,  /*Invalid PASID table entry */
> > > +    VTD_FR_PASID_ENTRY_P = 0x59, /* The Present(P) field of pasidt-entry is 0
> > */
> > >
> > >      /* This is not a normal fault reason. We use this to indicate some faults
> > >       * that are not referenced by the VT-d specification.
> > > @@ -485,6 +486,19 @@ struct VTDRootEntry {
> > >  };
> > >  typedef struct VTDRootEntry VTDRootEntry;
> > >
> > > +struct VTDPASIDCacheInfo {
> > > +#define VTD_PASID_CACHE_GLOBAL   (1ULL << 0)
> > > +#define VTD_PASID_CACHE_DOMSI    (1ULL << 1)
> > > +#define VTD_PASID_CACHE_PASIDSI  (1ULL << 2)
> > > +    uint32_t flags;
> > > +    uint16_t domain_id;
> > > +    uint32_t pasid;
> > > +};
> > > +#define VTD_PASID_CACHE_INFO_MASK    (VTD_PASID_CACHE_GLOBAL | \
> > > +                                      VTD_PASID_CACHE_DOMSI  | \
> > > +                                      VTD_PASID_CACHE_PASIDSI)
> > > +typedef struct VTDPASIDCacheInfo VTDPASIDCacheInfo;
> > > +
> > >  /* Masks for struct VTDRootEntry */
> > >  #define VTD_ROOT_ENTRY_P            1ULL
> > >  #define VTD_ROOT_ENTRY_CTP          (~0xfffULL)
> > > diff --git a/hw/i386/trace-events b/hw/i386/trace-events
> > > index f7cd4e5..87364a3 100644
> > > --- a/hw/i386/trace-events
> > > +++ b/hw/i386/trace-events
> > > @@ -22,6 +22,7 @@ vtd_inv_qi_head(uint16_t head) "read head %d"
> > >  vtd_inv_qi_tail(uint16_t head) "write tail %d"
> > >  vtd_inv_qi_fetch(void) ""
> > >  vtd_context_cache_reset(void) ""
> > > +vtd_pasid_cache_reset(void) ""
> > >  vtd_pasid_cache_gsi(void) ""
> > >  vtd_pasid_cache_dsi(uint16_t domain) "Domian slective PC invalidation
> > domain 0x%"PRIx16
> > >  vtd_pasid_cache_psi(uint16_t domain, uint32_t pasid) "PASID slective PC
> > invalidation domain 0x%"PRIx16" pasid 0x%"PRIx32
> > > diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> > > index 4158116..3cc4b74 100644
> > > --- a/include/hw/i386/intel_iommu.h
> > > +++ b/include/hw/i386/intel_iommu.h
> > > @@ -69,6 +69,8 @@ typedef union VTD_IR_MSIAddress VTD_IR_MSIAddress;
> > >  typedef struct VTDPASIDDirEntry VTDPASIDDirEntry;
> > >  typedef struct VTDPASIDEntry VTDPASIDEntry;
> > >  typedef struct VTDIOMMUContext VTDIOMMUContext;
> > > +typedef struct VTDPASIDCacheEntry VTDPASIDCacheEntry;
> > > +typedef struct VTDPASIDAddressSpace VTDPASIDAddressSpace;
> > >
> > >  /* Context-Entry */
> > >  struct VTDContextEntry {
> > > @@ -101,6 +103,31 @@ struct VTDPASIDEntry {
> > >      uint64_t val[8];
> > >  };
> > >
> > > +struct pasid_key {
> > > +    uint32_t pasid;
> > > +    uint16_t sid;
> > > +};
> > > +
> > > +struct VTDPASIDCacheEntry {
> > > +    /*
> > > +     * The cache entry is obsolete if
> > > +     * pasid_cache_gen!=IntelIOMMUState.pasid_cache_gen
> > > +     */
> > > +    uint32_t pasid_cache_gen;
> > > +    struct VTDPASIDEntry pasid_entry;
> > > +};
> > > +
> > > +struct VTDPASIDAddressSpace {
> > > +    VTDBus *vtd_bus;
> > > +    uint8_t devfn;
> > > +    AddressSpace as;
> > > +    uint32_t pasid;
> > > +    IntelIOMMUState *iommu_state;
> > > +    VTDContextCacheEntry context_cache_entry;
> > > +    QLIST_ENTRY(VTDPASIDAddressSpace) next;
> > > +    VTDPASIDCacheEntry pasid_cache_entry;
> > 
> > In vtd_pasid_cache_gsi() [1], you directly reset pasid_cache_gen for
> > each pasid address space.  You never increase
> > pasid_cache_entry.pasid_cache_gen.  Then IIUC the gen will always be
> > either 0 or 1.  And...
> > 
> > > +};
> > > +
> > >  struct VTDAddressSpace {
> > >      PCIBus *bus;
> > >      uint8_t devfn;
> > > @@ -122,6 +149,7 @@ struct VTDIOMMUContext {
> > >      uint8_t devfn;
> > >      IOMMUContext iommu_context;
> > >      DualStageIOMMUObject *dsi_obj;
> > > +    QLIST_ENTRY(VTDIOMMUContext) next;
> > >      IntelIOMMUState *iommu_state;
> > >  };
> > >
> > > @@ -272,9 +300,14 @@ struct IntelIOMMUState {
> > >
> > >      GHashTable *vtd_as_by_busptr;   /* VTDBus objects indexed by PCIBus*
> > reference */
> > >      VTDBus *vtd_as_by_bus_num[VTD_PCI_BUS_MAX]; /* VTDBus objects
> > indexed by bus number */
> > > +    GHashTable *vtd_pasid_as;   /* VTDPASIDAddressSpace instances */
> > > +    uint32_t pasid_cache_gen;   /* Should be in [1,MAX] */
> > 
> > ... This should always be 1.
> > IIUC you can drop both of the pasid_cache_gen, because in this whole
> > patchset you'll remove the pasid hash entry when it is invalidated,
> > right?  Then if the hash entry is there, it must be valid.  When it's
> > out-dated, it'll be removed from the hash.
> 
> Oh, yes it is. However, it's not my intetion. I'd like to let [1] to
> increase the s->pasid_cache_gen instead of justing zero it. I think it
> will save some time as loop hash table takes time. Thanks for catching
> it. :-)

OK that's fine too.  Then remember to conditionally reset it:

static int vtd_pasid_cache_gsi(IntelIOMMUState *s)
{
    trace_vtd_pasid_cache_gsi();

    vtd_iommu_lock(s);
    s->pasid_cache_gen++;
    if (s->pasid_cache_gen >= THRESHOLD) {
        vtd_pasid_cache_reset(s);
    }
    vtd_iommu_unlock(s);

    return 0;
}

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC v3 03/25] hw/iommu: introduce IOMMUContext
  2020-02-12  7:15           ` Liu, Yi L
@ 2020-02-12 15:59             ` Peter Xu
  -1 siblings, 0 replies; 136+ messages in thread
From: Peter Xu @ 2020-02-12 15:59 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: David Gibson, qemu-devel, pbonzini, alex.williamson, mst,
	eric.auger, Tian, Kevin, Tian, Jun J, Sun, Yi Y, kvm, Wu, Hao,
	Jacob Pan, Yi Sun

On Wed, Feb 12, 2020 at 07:15:13AM +0000, Liu, Yi L wrote:

[...]

> While considering your suggestion on dropping one of the two abstract
> layers. I came up a new proposal as below:
> 
> We may drop the IOMMUContext in this series, and rename DualStageIOMMUObject
> to HostIOMMUContext, which is per-vfio-container. Add an interface in PCI
> layer(e.g. an callback in  PCIDevice) to let vIOMMU get HostIOMMUContext.
> I think this could cover the requirement of providing explicit method for
> vIOMMU to call into VFIO and then program host IOMMU.
> 
> While for the requirement of VFIO to vIOMMU callings (e.g. PRQ), I think it
> could be done via PCI layer by adding an operation in PCIIOMMUOps. Thoughts?

Hmm sounds good. :)

The thing is for the calls to the other direction (e.g. VFIO injecting
faults to vIOMMU), that's neither per-container nor per-device, but
per-vIOMMU.  PCIIOMMUOps suites for that job I'd say, which is per-vIOMMU.

Let's see how it goes.

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC v3 03/25] hw/iommu: introduce IOMMUContext
@ 2020-02-12 15:59             ` Peter Xu
  0 siblings, 0 replies; 136+ messages in thread
From: Peter Xu @ 2020-02-12 15:59 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: Tian, Kevin, Jacob Pan, Yi Sun, kvm, mst, Tian, Jun J,
	qemu-devel, eric.auger, alex.williamson, pbonzini, Wu, Hao, Sun,
	Yi Y, David Gibson

On Wed, Feb 12, 2020 at 07:15:13AM +0000, Liu, Yi L wrote:

[...]

> While considering your suggestion on dropping one of the two abstract
> layers. I came up a new proposal as below:
> 
> We may drop the IOMMUContext in this series, and rename DualStageIOMMUObject
> to HostIOMMUContext, which is per-vfio-container. Add an interface in PCI
> layer(e.g. an callback in  PCIDevice) to let vIOMMU get HostIOMMUContext.
> I think this could cover the requirement of providing explicit method for
> vIOMMU to call into VFIO and then program host IOMMU.
> 
> While for the requirement of VFIO to vIOMMU callings (e.g. PRQ), I think it
> could be done via PCI layer by adding an operation in PCIIOMMUOps. Thoughts?

Hmm sounds good. :)

The thing is for the calls to the other direction (e.g. VFIO injecting
faults to vIOMMU), that's neither per-container nor per-device, but
per-vIOMMU.  PCIIOMMUOps suites for that job I'd say, which is per-vIOMMU.

Let's see how it goes.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC v3 13/25] intel_iommu: modify x-scalable-mode to be string option
  2020-02-12  7:28       ` Liu, Yi L
@ 2020-02-12 16:05         ` Peter Xu
  -1 siblings, 0 replies; 136+ messages in thread
From: Peter Xu @ 2020-02-12 16:05 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: qemu-devel, david, pbonzini, alex.williamson, mst, eric.auger,
	Tian, Kevin, Tian, Jun J, Sun, Yi Y, kvm, Wu, Hao, Jacob Pan,
	Yi Sun, Richard Henderson, Eduardo Habkost

On Wed, Feb 12, 2020 at 07:28:24AM +0000, Liu, Yi L wrote:
> > From: Peter Xu <peterx@redhat.com>
> > Sent: Wednesday, February 12, 2020 3:44 AM
> > To: Liu, Yi L <yi.l.liu@intel.com>
> > Subject: Re: [RFC v3 13/25] intel_iommu: modify x-scalable-mode to be string
> > option
> > 
> > On Wed, Jan 29, 2020 at 04:16:44AM -0800, Liu, Yi L wrote:
> > > From: Liu Yi L <yi.l.liu@intel.com>
> > >
> > > Intel VT-d 3.0 introduces scalable mode, and it has a bunch of
> > > capabilities related to scalable mode translation, thus there are multiple
> > combinations.
> > > While this vIOMMU implementation wants simplify it for user by
> > > providing typical combinations. User could config it by
> > > "x-scalable-mode" option. The usage is as below:
> > >
> > > "-device intel-iommu,x-scalable-mode=["legacy"|"modern"]"
> > 
> > Maybe also "off" when someone wants to explicitly disable it?
> 
> emmm, I  think x-scalable-mode should be disabled by default. It is enabled
> only when "legacy" or "modern" is configured. I'm fine to add "off" as an
> explicit way to turn it off if you think it is necessary. :-)

It's not necessary.  It'll be necessary when we remove "x-" and change
the default value.  However it'll always be good to provide all
options explicitly in the parameter starting from when we design it,
imho.  It's still experimental, so... Your call. :)

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC v3 13/25] intel_iommu: modify x-scalable-mode to be string option
@ 2020-02-12 16:05         ` Peter Xu
  0 siblings, 0 replies; 136+ messages in thread
From: Peter Xu @ 2020-02-12 16:05 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: Tian, Kevin, Jacob Pan, Yi Sun, Eduardo Habkost, kvm, mst, Tian,
	Jun J, qemu-devel, eric.auger, alex.williamson, pbonzini, Wu,
	Hao, Sun, Yi Y, Richard Henderson, david

On Wed, Feb 12, 2020 at 07:28:24AM +0000, Liu, Yi L wrote:
> > From: Peter Xu <peterx@redhat.com>
> > Sent: Wednesday, February 12, 2020 3:44 AM
> > To: Liu, Yi L <yi.l.liu@intel.com>
> > Subject: Re: [RFC v3 13/25] intel_iommu: modify x-scalable-mode to be string
> > option
> > 
> > On Wed, Jan 29, 2020 at 04:16:44AM -0800, Liu, Yi L wrote:
> > > From: Liu Yi L <yi.l.liu@intel.com>
> > >
> > > Intel VT-d 3.0 introduces scalable mode, and it has a bunch of
> > > capabilities related to scalable mode translation, thus there are multiple
> > combinations.
> > > While this vIOMMU implementation wants simplify it for user by
> > > providing typical combinations. User could config it by
> > > "x-scalable-mode" option. The usage is as below:
> > >
> > > "-device intel-iommu,x-scalable-mode=["legacy"|"modern"]"
> > 
> > Maybe also "off" when someone wants to explicitly disable it?
> 
> emmm, I  think x-scalable-mode should be disabled by default. It is enabled
> only when "legacy" or "modern" is configured. I'm fine to add "off" as an
> explicit way to turn it off if you think it is necessary. :-)

It's not necessary.  It'll be necessary when we remove "x-" and change
the default value.  However it'll always be good to provide all
options explicitly in the parameter starting from when we design it,
imho.  It's still experimental, so... Your call. :)

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 136+ messages in thread

* RE: [RFC v3 14/25] intel_iommu: add virtual command capability support
  2020-02-11 21:56     ` Peter Xu
@ 2020-02-13  2:40       ` Liu, Yi L
  -1 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-02-13  2:40 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, david, pbonzini, alex.williamson, mst, eric.auger,
	Tian, Kevin, Tian, Jun J, Sun, Yi Y, kvm, Wu, Hao, Jacob Pan,
	Yi Sun, Richard Henderson, Eduardo Habkost

> From: Peter Xu <peterx@redhat.com>
> Sent: Wednesday, February 12, 2020 5:57 AM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [RFC v3 14/25] intel_iommu: add virtual command capability support
> 
> On Wed, Jan 29, 2020 at 04:16:45AM -0800, Liu, Yi L wrote:
> > +/*
> > + * The basic idea is to let hypervisor to set a range for available
> > + * PASIDs for VMs. One of the reasons is PASID #0 is reserved by
> > + * RID_PASID usage. We have no idea how many reserved PASIDs in future,
> > + * so here just an evaluated value. Honestly, set it as "1" is enough
> > + * at current stage.
> > + */
> > +#define VTD_MIN_HPASID              1
> > +#define VTD_MAX_HPASID              0xFFFFF
> 
> One more question: I see that PASID is defined as 20bits long.  It's
> fine.  However I start to get confused on how the Scalable Mode PASID
> Directory could service that much of PASID entries.
> 
> I'm looking at spec 3.4.3, Figure 3-8.
> 
> Firstly, we only have two levels for a PASID table.  The context entry
> of a device stores a pointer to the "Scalable Mode PASID Directory"
> page. I see that there're 2^14 entries in "Scalable Mode PASID
> Directory" page, each is a "Scalable Mode PASID Table".
> However... how do we fit in the 4K page if each entry is a pointer of
> x86_64 (8 bytes) while there're 2^14 entries?  A simple math gives me
> 4K/8 = 512, which means the "Scalable Mode PASID Directory" page can
> only have 512 entries, then how the 2^14 come from?  Hmm??

I checked with Kevin. The spec doesn't say the dir table is 4K. It says 4K
only for pasid table. Also, if you look at 9.4, scalabe-mode context entry
includes a PDTS field to specify the actual size of the directory table.

> Apart of this: also I just noticed (when reading the latter part of
> the series) that the time that a pasid table walk can consume will
> depend on this value too.  I'd suggest to make this as small as we
> can, as long as it satisfies the usage.  We can even bump it in the
> future.

I see. This looks to be an optimization. right? Instead of modify the
value of this macro,  I think we can do this optimization by tracking
the allocated PASIDs in QEMU. Thus, the pasid table walk  would be more
efficient and also no dependency on the VTD_MAX_HPASID. Does it make
sense to you? :-)

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 136+ messages in thread

* RE: [RFC v3 14/25] intel_iommu: add virtual command capability support
@ 2020-02-13  2:40       ` Liu, Yi L
  0 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-02-13  2:40 UTC (permalink / raw)
  To: Peter Xu
  Cc: Tian, Kevin, Jacob Pan, Yi Sun, Eduardo Habkost, kvm, mst, Tian,
	Jun J, qemu-devel, eric.auger, alex.williamson, pbonzini, Wu,
	Hao, Sun, Yi Y, Richard Henderson, david

> From: Peter Xu <peterx@redhat.com>
> Sent: Wednesday, February 12, 2020 5:57 AM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [RFC v3 14/25] intel_iommu: add virtual command capability support
> 
> On Wed, Jan 29, 2020 at 04:16:45AM -0800, Liu, Yi L wrote:
> > +/*
> > + * The basic idea is to let hypervisor to set a range for available
> > + * PASIDs for VMs. One of the reasons is PASID #0 is reserved by
> > + * RID_PASID usage. We have no idea how many reserved PASIDs in future,
> > + * so here just an evaluated value. Honestly, set it as "1" is enough
> > + * at current stage.
> > + */
> > +#define VTD_MIN_HPASID              1
> > +#define VTD_MAX_HPASID              0xFFFFF
> 
> One more question: I see that PASID is defined as 20bits long.  It's
> fine.  However I start to get confused on how the Scalable Mode PASID
> Directory could service that much of PASID entries.
> 
> I'm looking at spec 3.4.3, Figure 3-8.
> 
> Firstly, we only have two levels for a PASID table.  The context entry
> of a device stores a pointer to the "Scalable Mode PASID Directory"
> page. I see that there're 2^14 entries in "Scalable Mode PASID
> Directory" page, each is a "Scalable Mode PASID Table".
> However... how do we fit in the 4K page if each entry is a pointer of
> x86_64 (8 bytes) while there're 2^14 entries?  A simple math gives me
> 4K/8 = 512, which means the "Scalable Mode PASID Directory" page can
> only have 512 entries, then how the 2^14 come from?  Hmm??

I checked with Kevin. The spec doesn't say the dir table is 4K. It says 4K
only for pasid table. Also, if you look at 9.4, scalabe-mode context entry
includes a PDTS field to specify the actual size of the directory table.

> Apart of this: also I just noticed (when reading the latter part of
> the series) that the time that a pasid table walk can consume will
> depend on this value too.  I'd suggest to make this as small as we
> can, as long as it satisfies the usage.  We can even bump it in the
> future.

I see. This looks to be an optimization. right? Instead of modify the
value of this macro,  I think we can do this optimization by tracking
the allocated PASIDs in QEMU. Thus, the pasid table walk  would be more
efficient and also no dependency on the VTD_MAX_HPASID. Does it make
sense to you? :-)

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 136+ messages in thread

* RE: [RFC v3 13/25] intel_iommu: modify x-scalable-mode to be string option
  2020-02-12 16:05         ` Peter Xu
@ 2020-02-13  2:44           ` Liu, Yi L
  -1 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-02-13  2:44 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, david, pbonzini, alex.williamson, mst, eric.auger,
	Tian, Kevin, Tian, Jun J, Sun, Yi Y, kvm, Wu, Hao, Jacob Pan,
	Yi Sun, Richard Henderson, Eduardo Habkost

> From: Peter Xu <peterx@redhat.com>
> Sent: Thursday, February 13, 2020 12:06 AM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [RFC v3 13/25] intel_iommu: modify x-scalable-mode to be string
> option
> 
> On Wed, Feb 12, 2020 at 07:28:24AM +0000, Liu, Yi L wrote:
> > > From: Peter Xu <peterx@redhat.com>
> > > Sent: Wednesday, February 12, 2020 3:44 AM
> > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > Subject: Re: [RFC v3 13/25] intel_iommu: modify x-scalable-mode to
> > > be string option
> > >
> > > On Wed, Jan 29, 2020 at 04:16:44AM -0800, Liu, Yi L wrote:
> > > > From: Liu Yi L <yi.l.liu@intel.com>
> > > >
> > > > Intel VT-d 3.0 introduces scalable mode, and it has a bunch of
> > > > capabilities related to scalable mode translation, thus there are
> > > > multiple
> > > combinations.
> > > > While this vIOMMU implementation wants simplify it for user by
> > > > providing typical combinations. User could config it by
> > > > "x-scalable-mode" option. The usage is as below:
> > > >
> > > > "-device intel-iommu,x-scalable-mode=["legacy"|"modern"]"
> > >
> > > Maybe also "off" when someone wants to explicitly disable it?
> >
> > emmm, I  think x-scalable-mode should be disabled by default. It is
> > enabled only when "legacy" or "modern" is configured. I'm fine to add
> > "off" as an explicit way to turn it off if you think it is necessary.
> > :-)
> 
> It's not necessary.  It'll be necessary when we remove "x-" and change the
> default value.  However it'll always be good to provide all options explicitly in the
> parameter starting from when we design it, imho.  It's still experimental, so...
> Your call. :)

Got it. Let me add it in next version. 😊

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 136+ messages in thread

* RE: [RFC v3 13/25] intel_iommu: modify x-scalable-mode to be string option
@ 2020-02-13  2:44           ` Liu, Yi L
  0 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-02-13  2:44 UTC (permalink / raw)
  To: Peter Xu
  Cc: Tian, Kevin, Jacob Pan, Yi Sun, Eduardo Habkost, kvm, mst, Tian,
	Jun J, qemu-devel, eric.auger, alex.williamson, pbonzini, Wu,
	Hao, Sun, Yi Y, Richard Henderson, david

> From: Peter Xu <peterx@redhat.com>
> Sent: Thursday, February 13, 2020 12:06 AM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [RFC v3 13/25] intel_iommu: modify x-scalable-mode to be string
> option
> 
> On Wed, Feb 12, 2020 at 07:28:24AM +0000, Liu, Yi L wrote:
> > > From: Peter Xu <peterx@redhat.com>
> > > Sent: Wednesday, February 12, 2020 3:44 AM
> > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > Subject: Re: [RFC v3 13/25] intel_iommu: modify x-scalable-mode to
> > > be string option
> > >
> > > On Wed, Jan 29, 2020 at 04:16:44AM -0800, Liu, Yi L wrote:
> > > > From: Liu Yi L <yi.l.liu@intel.com>
> > > >
> > > > Intel VT-d 3.0 introduces scalable mode, and it has a bunch of
> > > > capabilities related to scalable mode translation, thus there are
> > > > multiple
> > > combinations.
> > > > While this vIOMMU implementation wants simplify it for user by
> > > > providing typical combinations. User could config it by
> > > > "x-scalable-mode" option. The usage is as below:
> > > >
> > > > "-device intel-iommu,x-scalable-mode=["legacy"|"modern"]"
> > >
> > > Maybe also "off" when someone wants to explicitly disable it?
> >
> > emmm, I  think x-scalable-mode should be disabled by default. It is
> > enabled only when "legacy" or "modern" is configured. I'm fine to add
> > "off" as an explicit way to turn it off if you think it is necessary.
> > :-)
> 
> It's not necessary.  It'll be necessary when we remove "x-" and change the
> default value.  However it'll always be good to provide all options explicitly in the
> parameter starting from when we design it, imho.  It's still experimental, so...
> Your call. :)

Got it. Let me add it in next version. 😊

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 136+ messages in thread

* RE: [RFC v3 03/25] hw/iommu: introduce IOMMUContext
  2020-02-12 15:59             ` Peter Xu
@ 2020-02-13  2:46               ` Liu, Yi L
  -1 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-02-13  2:46 UTC (permalink / raw)
  To: Peter Xu
  Cc: Tian, Kevin, Jacob Pan, Yi Sun, kvm, mst, Tian, Jun J,
	qemu-devel, eric.auger, alex.williamson, pbonzini, Wu, Hao, Sun,
	Yi Y, David Gibson

> From: Peter Xu <peterx@redhat.com>
> Sent: Thursday, February 13, 2020 12:00 AM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [RFC v3 03/25] hw/iommu: introduce IOMMUContext
> 
> On Wed, Feb 12, 2020 at 07:15:13AM +0000, Liu, Yi L wrote:
> 
> [...]
> 
> > While considering your suggestion on dropping one of the two abstract
> > layers. I came up a new proposal as below:
> >
> > We may drop the IOMMUContext in this series, and rename
> > DualStageIOMMUObject to HostIOMMUContext, which is per-vfio-container.
> > Add an interface in PCI layer(e.g. an callback in  PCIDevice) to let vIOMMU get
> HostIOMMUContext.
> > I think this could cover the requirement of providing explicit method
> > for vIOMMU to call into VFIO and then program host IOMMU.
> >
> > While for the requirement of VFIO to vIOMMU callings (e.g. PRQ), I
> > think it could be done via PCI layer by adding an operation in PCIIOMMUOps.
> Thoughts?
> 
> Hmm sounds good. :)
> 
> The thing is for the calls to the other direction (e.g. VFIO injecting faults to
> vIOMMU), that's neither per-container nor per-device, but per-vIOMMU.
> PCIIOMMUOps suites for that job I'd say, which is per-vIOMMU.
> 
> Let's see how it goes.

Thanks, let me get a new version by end-of this week.

Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 136+ messages in thread

* RE: [RFC v3 03/25] hw/iommu: introduce IOMMUContext
@ 2020-02-13  2:46               ` Liu, Yi L
  0 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-02-13  2:46 UTC (permalink / raw)
  To: Peter Xu
  Cc: Tian, Kevin, Jacob Pan, Yi Sun, kvm, mst, Tian, Jun J,
	qemu-devel, eric.auger, alex.williamson, pbonzini, David Gibson,
	Sun, Yi Y, Wu, Hao

> From: Peter Xu <peterx@redhat.com>
> Sent: Thursday, February 13, 2020 12:00 AM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [RFC v3 03/25] hw/iommu: introduce IOMMUContext
> 
> On Wed, Feb 12, 2020 at 07:15:13AM +0000, Liu, Yi L wrote:
> 
> [...]
> 
> > While considering your suggestion on dropping one of the two abstract
> > layers. I came up a new proposal as below:
> >
> > We may drop the IOMMUContext in this series, and rename
> > DualStageIOMMUObject to HostIOMMUContext, which is per-vfio-container.
> > Add an interface in PCI layer(e.g. an callback in  PCIDevice) to let vIOMMU get
> HostIOMMUContext.
> > I think this could cover the requirement of providing explicit method
> > for vIOMMU to call into VFIO and then program host IOMMU.
> >
> > While for the requirement of VFIO to vIOMMU callings (e.g. PRQ), I
> > think it could be done via PCI layer by adding an operation in PCIIOMMUOps.
> Thoughts?
> 
> Hmm sounds good. :)
> 
> The thing is for the calls to the other direction (e.g. VFIO injecting faults to
> vIOMMU), that's neither per-container nor per-device, but per-vIOMMU.
> PCIIOMMUOps suites for that job I'd say, which is per-vIOMMU.
> 
> Let's see how it goes.

Thanks, let me get a new version by end-of this week.

Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 136+ messages in thread

* RE: [RFC v3 16/25] intel_iommu: add PASID cache management infrastructure
  2020-02-12 15:26         ` Peter Xu
@ 2020-02-13  2:59           ` Liu, Yi L
  -1 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-02-13  2:59 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, david, pbonzini, alex.williamson, mst, eric.auger,
	Tian, Kevin, Tian, Jun J, Sun, Yi Y, kvm, Wu, Hao, Jacob Pan,
	Yi Sun, Richard Henderson, Eduardo Habkost

> From: Peter Xu <peterx@redhat.com>
> Sent: Wednesday, February 12, 2020 11:26 PM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [RFC v3 16/25] intel_iommu: add PASID cache management
> infrastructure
> 
> On Wed, Feb 12, 2020 at 08:37:30AM +0000, Liu, Yi L wrote:
> > > From: Peter Xu <peterx@redhat.com>
> > > Sent: Wednesday, February 12, 2020 7:36 AM
> > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > Subject: Re: [RFC v3 16/25] intel_iommu: add PASID cache management
> > > infrastructure
> > >
> > > On Wed, Jan 29, 2020 at 04:16:47AM -0800, Liu, Yi L wrote:
> > > > From: Liu Yi L <yi.l.liu@intel.com>
> > > >
> > > > This patch adds a PASID cache management infrastructure based on
> > > > new added structure VTDPASIDAddressSpace, which is used to track
> > > > the PASID usage and future PASID tagged DMA address translation
> > > > support in vIOMMU.
> > > >
> > > >     struct VTDPASIDAddressSpace {
> > > >         VTDBus *vtd_bus;
> > > >         uint8_t devfn;
> > > >         AddressSpace as;
> > > >         uint32_t pasid;
> > > >         IntelIOMMUState *iommu_state;
> > > >         VTDContextCacheEntry context_cache_entry;
> > > >         QLIST_ENTRY(VTDPASIDAddressSpace) next;
> > > >         VTDPASIDCacheEntry pasid_cache_entry;
> > > >     };
> > > >
> > > > Ideally, a VTDPASIDAddressSpace instance is created when a PASID
> > > > is bound with a DMA AddressSpace. Intel VT-d spec requires guest
> > > > software to issue pasid cache invalidation when bind or unbind a
> > > > pasid with an address space under caching-mode. However, as
> > > > VTDPASIDAddressSpace instances also act as pasid cache in this
> > > > implementation, its creation also happens during vIOMMU PASID
> > > > tagged DMA translation. The creation in this path will not be
> > > > added in this patch since no PASID-capable emulated devices for
> > > > now.
> > > >
> > > > The implementation in this patch manages VTDPASIDAddressSpace
> > > > instances per PASID+BDF (lookup and insert will use PASID and
> > > > BDF) since Intel VT-d spec allows per-BDF PASID Table. When a
> > > > guest bind a PASID with an AddressSpace, QEMU will capture the
> > > > guest pasid selective pasid cache invalidation, and allocate
> > > > remove a VTDPASIDAddressSpace instance per the invalidation
> > > > reasons:
> > > >
> > > >     *) a present pasid entry moved to non-present
> > > >     *) a present pasid entry to be a present entry
> > > >     *) a non-present pasid entry moved to present
> > > >
> > > > vIOMMU emulator could figure out the reason by fetching latest
> > > > guest pasid entry.
> > > >
> > > > Cc: Kevin Tian <kevin.tian@intel.com>
> > > > Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > > > Cc: Peter Xu <peterx@redhat.com>
> > > > Cc: Yi Sun <yi.y.sun@linux.intel.com>
> > > > Cc: Paolo Bonzini <pbonzini@redhat.com>
> > > > Cc: Richard Henderson <rth@twiddle.net>
> > > > Cc: Eduardo Habkost <ehabkost@redhat.com>
> > > > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > > > ---
> > > >  hw/i386/intel_iommu.c          | 367
> > > +++++++++++++++++++++++++++++++++++++++++
> > > >  hw/i386/intel_iommu_internal.h |  14 ++
> > > >  hw/i386/trace-events           |   1 +
> > > >  include/hw/i386/intel_iommu.h  |  36 +++-
> > > >  4 files changed, 417 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c index
> > > > 58e7213..c75cb7b 100644
> > > > --- a/hw/i386/intel_iommu.c
> > > > +++ b/hw/i386/intel_iommu.c
> > > > @@ -40,6 +40,7 @@
> > > >  #include "kvm_i386.h"
> > > >  #include "migration/vmstate.h"
> > > >  #include "trace.h"
> > > > +#include "qemu/jhash.h"
> > > >
> > > >  /* context entry operations */
> > > >  #define VTD_CE_GET_RID2PASID(ce) \ @@ -65,6 +66,8 @@  static void
> > > > vtd_address_space_refresh_all(IntelIOMMUState *s);  static void
> > > > vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier
> > > *n);
> > > >
> > > > +static void vtd_pasid_cache_reset(IntelIOMMUState *s);
[...]
> > > > +
> > > > +/**
> > > > + * This function is used to clear pasid_cache_gen of cached pasid
> > > > + * entry in vtd_pasid_as instances. Caller of this function
> > > > +should
> > > > + * hold iommu_lock.
> > > > + */
> > > > +static gboolean vtd_flush_pasid(gpointer key, gpointer value,
> > > > +                                gpointer user_data) {
> > > > +    VTDPASIDCacheInfo *pc_info = user_data;
> > > > +    VTDPASIDAddressSpace *vtd_pasid_as = value;
> > > > +    IntelIOMMUState *s = vtd_pasid_as->iommu_state;
> > > > +    VTDPASIDCacheEntry *pc_entry = &vtd_pasid_as->pasid_cache_entry;
> > > > +    VTDBus *vtd_bus = vtd_pasid_as->vtd_bus;
> > > > +    VTDPASIDEntry pe;
> > > > +    uint16_t did;
> > > > +    uint32_t pasid;
> > > > +    uint16_t devfn;
> > > > +    int ret;
> > > > +
> > > > +    did = vtd_pe_get_domain_id(&pc_entry->pasid_entry);
> > > > +    pasid = vtd_pasid_as->pasid;
> > > > +    devfn = vtd_pasid_as->devfn;
> > > > +
> > > > +    if (!(pc_entry->pasid_cache_gen == s->pasid_cache_gen)) {
> > > > +        return false;
> > > > +    }
> > > > +
> > > > +    switch (pc_info->flags & VTD_PASID_CACHE_INFO_MASK) {
> > > > +    case VTD_PASID_CACHE_PASIDSI:
> > > > +        if (pc_info->pasid != pasid) {
> > > > +            return false;
> > > > +        }
> > > > +        /* Fall through */
> > >
> > > Why fall through?
> >
> > For VTD_PASID_CACHE_PASIDSI, it implies domain selective, so it
> > requires to check did just as VTD_PASID_CACHE_DOMSI.
> 
> Ah right. :)
> 
> >
> > >
> > > > +    case VTD_PASID_CACHE_DOMSI:
> > > > +        if (pc_info->domain_id != did) {
> > > > +            return false;
> > > > +        }
> > > > +        /* Fall through */
> > >
> > > Same here.
> >
> > If code comes to here, it means the necessary checks are passed.
> > Should add a break here. However, as the below case does nothing and
> > just calls break. So I let the code fall through.
> 
> Yes this is fine too.
> 
> >
> > >
> > > > +    case VTD_PASID_CACHE_GLOBAL:
> > > > +        break;
> > > > +    default:
> > >
> > > Nevee reach here right?  If so we can abort.
> >
> > yes, should never reach here.
> >
> > > > +        return false;
> > > > +    }
> > > > +
> > > > +    /*
> > > > +     * pasid cache invalidation may indicate a present pasid
> > > > +     * entry to present pasid entry modification. To cover such
> > > > +     * case, vIOMMU emulator needs to fetch latest guest pasid
> > > > +     * entry and check cached pasid entry, then update pasid
> > > > +     * cache and send pasid bind/unbind to host properly.
> > > > +     */
> > > > +    ret = vtd_dev_get_pe_from_pasid(s,
> > > > +                  pci_bus_num(vtd_bus->bus), devfn, pasid, &pe);
> > > > +    if (ret) {
> > > > +        /*
> > > > +         * No valid pasid entry in guest memory. e.g. pasid entry
> > > > +         * was modified to be either all-zero or non-present. Either
> > > > +         * case means existing pasid cache should be removed.
> > > > +         */
> > > > +        goto remove;
> > > > +    }
> > > > +    /* Compare cached pasid entry and latest pasid entry */
> > > > +    if (!vtd_pasid_entry_compare(&pe, &pc_entry->pasid_entry)) {
> > > > +        /* pasid entry was updated, thus update the pasid cache */
> > > > +        pc_entry->pasid_entry = pe;
> > > > +        pc_entry->pasid_cache_gen = s->pasid_cache_gen;
> > > > +        /*
> > > > +         * TODO:
> > > > +         * - send pasid bind to host for passthru devices
> > > > +         * - when pasid-base-iotlb(piotlb) infrastructure is ready,
> > > > +         *   should invalidate QEMU piotlb togehter with this change.
> > > > +         */
> > > > +    }
> > > > +    return false;
> > > > +remove:
> > > > +    /*
> > > > +     * TODO:
> > > > +     * - send pasid unbind to host for passthru devices
> > > > +     * - when pasid-base-iotlb(piotlb) infrastructure is ready,
> > > > +     *   should invalidate QEMU piotlb togehter with this change.
> > > > +     */
> > > > +    return true;
> > > > +}
> > > > +
> > > >  static int vtd_pasid_cache_dsi(IntelIOMMUState *s, uint16_t
> > > > domain_id)  {
> > > > +    VTDPASIDCacheInfo pc_info;
> > > > +
> > > > +    trace_vtd_pasid_cache_dsi(domain_id);
> > > > +
> > > > +    pc_info.flags = VTD_PASID_CACHE_DOMSI;
> > > > +    pc_info.domain_id = domain_id;
> > > > +
> > > > +    /*
> > > > +     * Loop all existing pasid caches and update them.
> > > > +     */
> > > > +    vtd_iommu_lock(s);
> > > > +    g_hash_table_foreach_remove(s->vtd_pasid_as,
> > > > +                                 vtd_flush_pasid, &pc_info);
> > > > +
> > > > +    /*
> > > > +     * TODO: Domain selective PASID cache invalidation
> > > > +     * flushes all the pasid caches within a domain. To
> > > > +     * be safe, after invalidating the pasid caches, emulator
> > > > +     * needs to replay the pasid bindings by walking guest
> > > > +     * pasid dir and pasid table.
> > >
> > > Better spell out on what special case we're handling here: When the
> > > guest setup a new PASID entry then send a PASID DSI.
> >
> > oh, yes.  will add it in new version. :-)
> >
> > >
> > > > +     */
> > > > +    vtd_iommu_unlock(s);
> > > >      return 0;
> > > >  }
> > > >
> > > > +/**
> > > > + * This function finds or adds a VTDPASIDAddressSpace for a
> > > > +device
> > > > + * when it is bound to a pasid. Caller of this function should
> > > > +hold
> > > > + * iommu_lock.
> > > > + */
> > > > +static VTDPASIDAddressSpace *vtd_add_find_pasid_as(IntelIOMMUState
> *s,
> > > > +                                                   VTDBus *vtd_bus,
> > > > +                                                   int devfn,
> > > > +                                                   uint32_t pasid,
> > > > +                                                   bool allocate)
> > > > +{
> > > > +    struct pasid_key key;
> > > > +    struct pasid_key *new_key;
> > > > +    VTDPASIDAddressSpace *vtd_pasid_as;
> > > > +    uint16_t sid;
> > > > +
> > > > +    sid = vtd_make_source_id(pci_bus_num(vtd_bus->bus), devfn);
> > > > +    vtd_init_pasid_key(pasid, sid, &key);
> > > > +    vtd_pasid_as = g_hash_table_lookup(s->vtd_pasid_as, &key);
> > > > +
> > > > +    if (!vtd_pasid_as && allocate) {
> > > > +        new_key = g_malloc0(sizeof(*new_key));
> > > > +        vtd_init_pasid_key(pasid, sid, new_key);
> > > > +        /*
> > > > +         * Initiate the vtd_pasid_as structure.
> > > > +         *
> > > > +         * This structure here is used to track the guest pasid
> > > > +         * binding and also serves as pasid-cache mangement entry.
> > > > +         *
> > > > +         * TODO: in future, if wants to support the SVA-aware DMA
> > > > +         *       emulation, the vtd_pasid_as should have include
> > > > +         *       AddressSpace to support DMA emulation.
> > > > +         */
> > > > +        vtd_pasid_as = g_malloc0(sizeof(VTDPASIDAddressSpace));
> > > > +        vtd_pasid_as->iommu_state = s;
> > > > +        vtd_pasid_as->vtd_bus = vtd_bus;
> > > > +        vtd_pasid_as->devfn = devfn;
> > > > +        vtd_pasid_as->context_cache_entry.context_cache_gen = 0;
> > > > +        vtd_pasid_as->pasid = pasid;
> > > > +        vtd_pasid_as->pasid_cache_entry.pasid_cache_gen = 0;
> > > > +        g_hash_table_insert(s->vtd_pasid_as, new_key, vtd_pasid_as);
> > > > +    }
> > > > +    return vtd_pasid_as;
> > > > +}
> > > > +
> > > > + /**
> > > > +  * This function updates the pasid entry cached in &vtd_pasid_as.
> > > > +  * Caller of this function should hold iommu_lock.
> > > > +  */
> > > > +static inline void vtd_fill_in_pe_cache(
> > > > +              VTDPASIDAddressSpace *vtd_pasid_as, VTDPASIDEntry
> > > > +*pe) {
> > > > +    IntelIOMMUState *s = vtd_pasid_as->iommu_state;
> > > > +    VTDPASIDCacheEntry *pc_entry =
> > > > +&vtd_pasid_as->pasid_cache_entry;
> > > > +
> > > > +    pc_entry->pasid_entry = *pe;
> > > > +    pc_entry->pasid_cache_gen = s->pasid_cache_gen; }
> > > > +
> > > >  static int vtd_pasid_cache_psi(IntelIOMMUState *s,
> > > >                                 uint16_t domain_id, uint32_t
> > > > pasid)  {
> > > > +    VTDPASIDCacheInfo pc_info;
> > > > +    VTDPASIDEntry pe;
> > > > +    VTDBus *vtd_bus;
> > > > +    int bus_n, devfn;
> > > > +    VTDPASIDAddressSpace *vtd_pasid_as;
> > > > +    VTDIOMMUContext *vtd_icx;
> > > > +
> > > > +    /* PASID selective implies a DID selective */
> > > > +    pc_info.flags = VTD_PASID_CACHE_PASIDSI;
> > > > +    pc_info.domain_id = domain_id;
> > > > +    pc_info.pasid = pasid;
> > > > +
> > > > +    /*
> > > > +     * Regards to a pasid selective pasid cache invalidation (PSI),
> > > > +     * it could be either cases of below:
> > > > +     * a) a present pasid entry moved to non-present
> > > > +     * b) a present pasid entry to be a present entry
> > > > +     * c) a non-present pasid entry moved to present
> > > > +     *
> > > > +     * Here the handling of a PSI is:
> > > > +     * 1) loop all the exisitng vtd_pasid_as instances to update them
> > > > +     *    according to the latest guest pasid entry in pasid table.
> > > > +     *    this will make sure affected existing vtd_pasid_as instances
> > > > +     *    cached the latest pasid entries. Also, during the loop, the
> > > > +     *    host should be notified if needed. e.g. pasid unbind or pasid
> > > > +     *    update. Should be able to cover case a) and case b).
> > > > +     *
> > > > +     * 2) loop all devices to cover case c)
> > > > +     *    However, it is not good to always loop all devices. In this
> > > > +     *    implementation. We do it in this ways:
> > > > +     *    - For devices which have VTDIOMMUContext instances,
> > > > +     *      we loop them and check if guest pasid entry exists. If yes,
> > > > +     *      it is case c), we update the pasid cache and also notify
> > > > +     *      host.
> > > > +     *    - For devices which have no VTDIOMMUContext
> > > > +     *      instances, it is not necessary to create pasid cache at
> > > > +     *      this phase since it could be created when vIOMMU do DMA
> > > > +     *      address translation. This is not implemented yet since
> > > > +     *      no PASID-capable emulated devices today. If we have it
> > > > +     *      in future, the pasid cache shall be created there.
> > > > +     */
> > > > +
> > > > +    vtd_iommu_lock(s);
> > > > +    g_hash_table_foreach_remove(s->vtd_pasid_as,
> > > > +                                vtd_flush_pasid, &pc_info);
> > > > +
> > > > +    QLIST_FOREACH(vtd_icx, &s->vtd_dev_icx_list, next) {
> > > > +        vtd_bus = vtd_icx->vtd_bus;
> > > > +        devfn = vtd_icx->devfn;
> > > > +        bus_n = pci_bus_num(vtd_bus->bus);
> > > > +
> > > > +        /* Step 1: fetch vtd_pasid_as and check if it is valid */
> > > > +        vtd_pasid_as = vtd_add_find_pasid_as(s, vtd_bus,
> > > > +                                        devfn, pasid, true);
> > >
> > > I feel like you wanted to pass "false" here for "allocate".
> >
> > emmm, yeah. It was "false" in draft code as step 1 is only to check if
> > a valid vtd_pasid_as exists. And in step 3, it needs to call
> > vtd_add_find_pasid_as() with "allocate" be "true". In
> > vtd_add_find_pasid_as(), it will try search vtd_pasid_as first and
> > then allocate a new one. In such logic, there will be two
> > vtd_add_find_pasid_as() callig and means two hash table searching.
> >
> > So I mofified it to be "true" to save a vtd_add_find_pasid_as()
> > calling. If a vtd_pasid_as is valid, its pasid_cache_gen will be equal to s-
> >pasid_cache_gen.
> > If not, the vtd_pasid_as is a newly allocated and needs to go through
> > step 2 and step 3 to fulfill it. Looks like I missed to free the
> > vtd_pasid_as when step
> > 2 failed. Will add it if you are fine with the current logic.
> 
> I see.  Note that vtd_add_find_pasid_as() is fast for no allocation, because hash
> lookup is O(1).  However I think current approach is ok, but if with that, we can
> also:
> 
> - Remove the allocate parameter for vtd_add_find_pasid_as(), since it's
>   always true even in future patches so useless,

right. :-)

> - Remove the vtd_pasid_as check right below because it's not needed.
> 
> >
> >
> > > > +        if (vtd_pasid_as &&
>                    ^^^^^^^^^^^^

yes, it is. In current series vtd_add_find_pasid_as() doesn’t check the
result of vtd_pasid_as mem allocation, so no need to check vtd_pasid_as
here either. However, it might be better to check the allocation result
or it will result in issue if allocation failed. What's your preference
here?

> > > > +            (s->pasid_cache_gen ==
> > > > +             vtd_pasid_as->pasid_cache_entry.pasid_cache_gen)) {
> > > > +            /*
> > > > +             * pasid_cache_gen equals to s->pasid_cache_gen means
> > > > +             * vtd_pasid_as is valid after the above s->vtd_pasid_as
> > > > +             * updates. Thus no need for the below steps.
> > > > +             */
> > > > +            continue;
> > > > +        }
> > > > +
> > > > +        /*
> > > > +         * Step 2: vtd_pasid_as is not valid, it's potentailly a
> > > > +         * new pasid bind. Fetch guest pasid entry.
> > > > +         */
> > > > +        if (vtd_dev_get_pe_from_pasid(s, bus_n, devfn, pasid, &pe)) {
> > > > +            continue;
> > > > +        }
> > > > +
> > > > +        /*
> > > > +         * Step 3: pasid entry exists, update pasid cache
> > > > +         *
> > > > +         * Here need to check domain ID since guest pasid entry
> > > > +         * exists. What needs to do are:
> > > > +         *   - update the pc_entry in the vtd_pasid_as
> > > > +         *   - set proper pc_entry.pasid_cache_gen
> > > > +         *   - pass down the latest guest pasid entry config to host
> > > > +         *     (will be added in later patch)
> > > > +         */
> > > > +        if (domain_id == vtd_pe_get_domain_id(&pe)) {
> > > > +            vtd_fill_in_pe_cache(vtd_pasid_as, &pe);
> > > > +        }
> > > > +    }
> > > > +    vtd_iommu_unlock(s);
> > > >      return 0;
> > > >  }
> > > >
> > > > +/**
> > > > + * Caller of this function should hold iommu_lock  */ static void
> > > > +vtd_pasid_cache_reset(IntelIOMMUState *s) {
> > > > +    VTDPASIDCacheInfo pc_info;
> > > > +
> > > > +    trace_vtd_pasid_cache_reset();
> > > > +
> > > > +    pc_info.flags = VTD_PASID_CACHE_GLOBAL;
> > > > +
> > > > +    /*
> > > > +     * Reset pasid cache is a big hammer, so use
> > > > +     * g_hash_table_foreach_remove which will free
> > > > +     * the vtd_pasid_as instances.
> > > > +     */
> > > > +    g_hash_table_foreach_remove(s->vtd_pasid_as,
> > > > +                           vtd_flush_pasid, &pc_info);
> > > > +    s->pasid_cache_gen = 1;
> > > > +}
> > > > +
> > > >  static int vtd_pasid_cache_gsi(IntelIOMMUState *s)  {
> > > > +    trace_vtd_pasid_cache_gsi();
> > > > +
> > > > +    vtd_iommu_lock(s);
> > > > +    vtd_pasid_cache_reset(s);
> > >
> > > [1]
> > >
> > > > +
> > > > +    /*
> > > > +     * TODO: Global PASID cache invalidation may be
> > > > +     * flushes all the pasid caches. To be safe, after
> > > > +     * invalidating the pasid caches, emulator needs
> > > > +     * to replay the pasid bindings by walking guest
> > > > +     * pasid dir and pasid table.
> > > > +     */
> > > > +    vtd_iommu_unlock(s);
> > > >      return 0;
> > > >  }
> > > >
[...]
> > > > diff --git a/include/hw/i386/intel_iommu.h
> > > > b/include/hw/i386/intel_iommu.h index 4158116..3cc4b74 100644
> > > > --- a/include/hw/i386/intel_iommu.h
> > > > +++ b/include/hw/i386/intel_iommu.h
> > > > @@ -69,6 +69,8 @@ typedef union VTD_IR_MSIAddress
> > > > VTD_IR_MSIAddress;  typedef struct VTDPASIDDirEntry
> > > > VTDPASIDDirEntry;  typedef struct VTDPASIDEntry VTDPASIDEntry;
> > > > typedef struct VTDIOMMUContext VTDIOMMUContext;
> > > > +typedef struct VTDPASIDCacheEntry VTDPASIDCacheEntry; typedef
> > > > +struct VTDPASIDAddressSpace VTDPASIDAddressSpace;
> > > >
> > > >  /* Context-Entry */
> > > >  struct VTDContextEntry {
> > > > @@ -101,6 +103,31 @@ struct VTDPASIDEntry {
> > > >      uint64_t val[8];
> > > >  };
> > > >
> > > > +struct pasid_key {
> > > > +    uint32_t pasid;
> > > > +    uint16_t sid;
> > > > +};
> > > > +
> > > > +struct VTDPASIDCacheEntry {
> > > > +    /*
> > > > +     * The cache entry is obsolete if
> > > > +     * pasid_cache_gen!=IntelIOMMUState.pasid_cache_gen
> > > > +     */
> > > > +    uint32_t pasid_cache_gen;
> > > > +    struct VTDPASIDEntry pasid_entry; };
> > > > +
> > > > +struct VTDPASIDAddressSpace {
> > > > +    VTDBus *vtd_bus;
> > > > +    uint8_t devfn;
> > > > +    AddressSpace as;
> > > > +    uint32_t pasid;
> > > > +    IntelIOMMUState *iommu_state;
> > > > +    VTDContextCacheEntry context_cache_entry;
> > > > +    QLIST_ENTRY(VTDPASIDAddressSpace) next;
> > > > +    VTDPASIDCacheEntry pasid_cache_entry;
> > >
> > > In vtd_pasid_cache_gsi() [1], you directly reset pasid_cache_gen for
> > > each pasid address space.  You never increase
> > > pasid_cache_entry.pasid_cache_gen.  Then IIUC the gen will always be
> > > either 0 or 1.  And...
> > >
> > > > +};
> > > > +
> > > >  struct VTDAddressSpace {
> > > >      PCIBus *bus;
> > > >      uint8_t devfn;
> > > > @@ -122,6 +149,7 @@ struct VTDIOMMUContext {
> > > >      uint8_t devfn;
> > > >      IOMMUContext iommu_context;
> > > >      DualStageIOMMUObject *dsi_obj;
> > > > +    QLIST_ENTRY(VTDIOMMUContext) next;
> > > >      IntelIOMMUState *iommu_state;  };
> > > >
> > > > @@ -272,9 +300,14 @@ struct IntelIOMMUState {
> > > >
> > > >      GHashTable *vtd_as_by_busptr;   /* VTDBus objects indexed by PCIBus*
> > > reference */
> > > >      VTDBus *vtd_as_by_bus_num[VTD_PCI_BUS_MAX]; /* VTDBus objects
> > > indexed by bus number */
> > > > +    GHashTable *vtd_pasid_as;   /* VTDPASIDAddressSpace instances */
> > > > +    uint32_t pasid_cache_gen;   /* Should be in [1,MAX] */
> > >
> > > ... This should always be 1.
> > > IIUC you can drop both of the pasid_cache_gen, because in this whole
> > > patchset you'll remove the pasid hash entry when it is invalidated,
> > > right?  Then if the hash entry is there, it must be valid.  When
> > > it's out-dated, it'll be removed from the hash.
> >
> > Oh, yes it is. However, it's not my intetion. I'd like to let [1] to
> > increase the s->pasid_cache_gen instead of justing zero it. I think it
> > will save some time as loop hash table takes time. Thanks for catching
> > it. :-)
> 
> OK that's fine too.  Then remember to conditionally reset it:
> 
> static int vtd_pasid_cache_gsi(IntelIOMMUState *s) {
>     trace_vtd_pasid_cache_gsi();
> 
>     vtd_iommu_lock(s);
>     s->pasid_cache_gen++;
>     if (s->pasid_cache_gen >= THRESHOLD) {
>         vtd_pasid_cache_reset(s);
>     }
>     vtd_iommu_unlock(s);
> 
>     return 0;
> }

thanks for the pseudo code. :-)

Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 136+ messages in thread

* RE: [RFC v3 16/25] intel_iommu: add PASID cache management infrastructure
@ 2020-02-13  2:59           ` Liu, Yi L
  0 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-02-13  2:59 UTC (permalink / raw)
  To: Peter Xu
  Cc: Tian, Kevin, Jacob Pan, Yi Sun, Eduardo Habkost, kvm, mst, Tian,
	Jun J, qemu-devel, eric.auger, alex.williamson, pbonzini, Wu,
	Hao, Sun, Yi Y, Richard Henderson, david

> From: Peter Xu <peterx@redhat.com>
> Sent: Wednesday, February 12, 2020 11:26 PM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [RFC v3 16/25] intel_iommu: add PASID cache management
> infrastructure
> 
> On Wed, Feb 12, 2020 at 08:37:30AM +0000, Liu, Yi L wrote:
> > > From: Peter Xu <peterx@redhat.com>
> > > Sent: Wednesday, February 12, 2020 7:36 AM
> > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > Subject: Re: [RFC v3 16/25] intel_iommu: add PASID cache management
> > > infrastructure
> > >
> > > On Wed, Jan 29, 2020 at 04:16:47AM -0800, Liu, Yi L wrote:
> > > > From: Liu Yi L <yi.l.liu@intel.com>
> > > >
> > > > This patch adds a PASID cache management infrastructure based on
> > > > new added structure VTDPASIDAddressSpace, which is used to track
> > > > the PASID usage and future PASID tagged DMA address translation
> > > > support in vIOMMU.
> > > >
> > > >     struct VTDPASIDAddressSpace {
> > > >         VTDBus *vtd_bus;
> > > >         uint8_t devfn;
> > > >         AddressSpace as;
> > > >         uint32_t pasid;
> > > >         IntelIOMMUState *iommu_state;
> > > >         VTDContextCacheEntry context_cache_entry;
> > > >         QLIST_ENTRY(VTDPASIDAddressSpace) next;
> > > >         VTDPASIDCacheEntry pasid_cache_entry;
> > > >     };
> > > >
> > > > Ideally, a VTDPASIDAddressSpace instance is created when a PASID
> > > > is bound with a DMA AddressSpace. Intel VT-d spec requires guest
> > > > software to issue pasid cache invalidation when bind or unbind a
> > > > pasid with an address space under caching-mode. However, as
> > > > VTDPASIDAddressSpace instances also act as pasid cache in this
> > > > implementation, its creation also happens during vIOMMU PASID
> > > > tagged DMA translation. The creation in this path will not be
> > > > added in this patch since no PASID-capable emulated devices for
> > > > now.
> > > >
> > > > The implementation in this patch manages VTDPASIDAddressSpace
> > > > instances per PASID+BDF (lookup and insert will use PASID and
> > > > BDF) since Intel VT-d spec allows per-BDF PASID Table. When a
> > > > guest bind a PASID with an AddressSpace, QEMU will capture the
> > > > guest pasid selective pasid cache invalidation, and allocate
> > > > remove a VTDPASIDAddressSpace instance per the invalidation
> > > > reasons:
> > > >
> > > >     *) a present pasid entry moved to non-present
> > > >     *) a present pasid entry to be a present entry
> > > >     *) a non-present pasid entry moved to present
> > > >
> > > > vIOMMU emulator could figure out the reason by fetching latest
> > > > guest pasid entry.
> > > >
> > > > Cc: Kevin Tian <kevin.tian@intel.com>
> > > > Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > > > Cc: Peter Xu <peterx@redhat.com>
> > > > Cc: Yi Sun <yi.y.sun@linux.intel.com>
> > > > Cc: Paolo Bonzini <pbonzini@redhat.com>
> > > > Cc: Richard Henderson <rth@twiddle.net>
> > > > Cc: Eduardo Habkost <ehabkost@redhat.com>
> > > > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > > > ---
> > > >  hw/i386/intel_iommu.c          | 367
> > > +++++++++++++++++++++++++++++++++++++++++
> > > >  hw/i386/intel_iommu_internal.h |  14 ++
> > > >  hw/i386/trace-events           |   1 +
> > > >  include/hw/i386/intel_iommu.h  |  36 +++-
> > > >  4 files changed, 417 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c index
> > > > 58e7213..c75cb7b 100644
> > > > --- a/hw/i386/intel_iommu.c
> > > > +++ b/hw/i386/intel_iommu.c
> > > > @@ -40,6 +40,7 @@
> > > >  #include "kvm_i386.h"
> > > >  #include "migration/vmstate.h"
> > > >  #include "trace.h"
> > > > +#include "qemu/jhash.h"
> > > >
> > > >  /* context entry operations */
> > > >  #define VTD_CE_GET_RID2PASID(ce) \ @@ -65,6 +66,8 @@  static void
> > > > vtd_address_space_refresh_all(IntelIOMMUState *s);  static void
> > > > vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier
> > > *n);
> > > >
> > > > +static void vtd_pasid_cache_reset(IntelIOMMUState *s);
[...]
> > > > +
> > > > +/**
> > > > + * This function is used to clear pasid_cache_gen of cached pasid
> > > > + * entry in vtd_pasid_as instances. Caller of this function
> > > > +should
> > > > + * hold iommu_lock.
> > > > + */
> > > > +static gboolean vtd_flush_pasid(gpointer key, gpointer value,
> > > > +                                gpointer user_data) {
> > > > +    VTDPASIDCacheInfo *pc_info = user_data;
> > > > +    VTDPASIDAddressSpace *vtd_pasid_as = value;
> > > > +    IntelIOMMUState *s = vtd_pasid_as->iommu_state;
> > > > +    VTDPASIDCacheEntry *pc_entry = &vtd_pasid_as->pasid_cache_entry;
> > > > +    VTDBus *vtd_bus = vtd_pasid_as->vtd_bus;
> > > > +    VTDPASIDEntry pe;
> > > > +    uint16_t did;
> > > > +    uint32_t pasid;
> > > > +    uint16_t devfn;
> > > > +    int ret;
> > > > +
> > > > +    did = vtd_pe_get_domain_id(&pc_entry->pasid_entry);
> > > > +    pasid = vtd_pasid_as->pasid;
> > > > +    devfn = vtd_pasid_as->devfn;
> > > > +
> > > > +    if (!(pc_entry->pasid_cache_gen == s->pasid_cache_gen)) {
> > > > +        return false;
> > > > +    }
> > > > +
> > > > +    switch (pc_info->flags & VTD_PASID_CACHE_INFO_MASK) {
> > > > +    case VTD_PASID_CACHE_PASIDSI:
> > > > +        if (pc_info->pasid != pasid) {
> > > > +            return false;
> > > > +        }
> > > > +        /* Fall through */
> > >
> > > Why fall through?
> >
> > For VTD_PASID_CACHE_PASIDSI, it implies domain selective, so it
> > requires to check did just as VTD_PASID_CACHE_DOMSI.
> 
> Ah right. :)
> 
> >
> > >
> > > > +    case VTD_PASID_CACHE_DOMSI:
> > > > +        if (pc_info->domain_id != did) {
> > > > +            return false;
> > > > +        }
> > > > +        /* Fall through */
> > >
> > > Same here.
> >
> > If code comes to here, it means the necessary checks are passed.
> > Should add a break here. However, as the below case does nothing and
> > just calls break. So I let the code fall through.
> 
> Yes this is fine too.
> 
> >
> > >
> > > > +    case VTD_PASID_CACHE_GLOBAL:
> > > > +        break;
> > > > +    default:
> > >
> > > Nevee reach here right?  If so we can abort.
> >
> > yes, should never reach here.
> >
> > > > +        return false;
> > > > +    }
> > > > +
> > > > +    /*
> > > > +     * pasid cache invalidation may indicate a present pasid
> > > > +     * entry to present pasid entry modification. To cover such
> > > > +     * case, vIOMMU emulator needs to fetch latest guest pasid
> > > > +     * entry and check cached pasid entry, then update pasid
> > > > +     * cache and send pasid bind/unbind to host properly.
> > > > +     */
> > > > +    ret = vtd_dev_get_pe_from_pasid(s,
> > > > +                  pci_bus_num(vtd_bus->bus), devfn, pasid, &pe);
> > > > +    if (ret) {
> > > > +        /*
> > > > +         * No valid pasid entry in guest memory. e.g. pasid entry
> > > > +         * was modified to be either all-zero or non-present. Either
> > > > +         * case means existing pasid cache should be removed.
> > > > +         */
> > > > +        goto remove;
> > > > +    }
> > > > +    /* Compare cached pasid entry and latest pasid entry */
> > > > +    if (!vtd_pasid_entry_compare(&pe, &pc_entry->pasid_entry)) {
> > > > +        /* pasid entry was updated, thus update the pasid cache */
> > > > +        pc_entry->pasid_entry = pe;
> > > > +        pc_entry->pasid_cache_gen = s->pasid_cache_gen;
> > > > +        /*
> > > > +         * TODO:
> > > > +         * - send pasid bind to host for passthru devices
> > > > +         * - when pasid-base-iotlb(piotlb) infrastructure is ready,
> > > > +         *   should invalidate QEMU piotlb togehter with this change.
> > > > +         */
> > > > +    }
> > > > +    return false;
> > > > +remove:
> > > > +    /*
> > > > +     * TODO:
> > > > +     * - send pasid unbind to host for passthru devices
> > > > +     * - when pasid-base-iotlb(piotlb) infrastructure is ready,
> > > > +     *   should invalidate QEMU piotlb togehter with this change.
> > > > +     */
> > > > +    return true;
> > > > +}
> > > > +
> > > >  static int vtd_pasid_cache_dsi(IntelIOMMUState *s, uint16_t
> > > > domain_id)  {
> > > > +    VTDPASIDCacheInfo pc_info;
> > > > +
> > > > +    trace_vtd_pasid_cache_dsi(domain_id);
> > > > +
> > > > +    pc_info.flags = VTD_PASID_CACHE_DOMSI;
> > > > +    pc_info.domain_id = domain_id;
> > > > +
> > > > +    /*
> > > > +     * Loop all existing pasid caches and update them.
> > > > +     */
> > > > +    vtd_iommu_lock(s);
> > > > +    g_hash_table_foreach_remove(s->vtd_pasid_as,
> > > > +                                 vtd_flush_pasid, &pc_info);
> > > > +
> > > > +    /*
> > > > +     * TODO: Domain selective PASID cache invalidation
> > > > +     * flushes all the pasid caches within a domain. To
> > > > +     * be safe, after invalidating the pasid caches, emulator
> > > > +     * needs to replay the pasid bindings by walking guest
> > > > +     * pasid dir and pasid table.
> > >
> > > Better spell out on what special case we're handling here: When the
> > > guest setup a new PASID entry then send a PASID DSI.
> >
> > oh, yes.  will add it in new version. :-)
> >
> > >
> > > > +     */
> > > > +    vtd_iommu_unlock(s);
> > > >      return 0;
> > > >  }
> > > >
> > > > +/**
> > > > + * This function finds or adds a VTDPASIDAddressSpace for a
> > > > +device
> > > > + * when it is bound to a pasid. Caller of this function should
> > > > +hold
> > > > + * iommu_lock.
> > > > + */
> > > > +static VTDPASIDAddressSpace *vtd_add_find_pasid_as(IntelIOMMUState
> *s,
> > > > +                                                   VTDBus *vtd_bus,
> > > > +                                                   int devfn,
> > > > +                                                   uint32_t pasid,
> > > > +                                                   bool allocate)
> > > > +{
> > > > +    struct pasid_key key;
> > > > +    struct pasid_key *new_key;
> > > > +    VTDPASIDAddressSpace *vtd_pasid_as;
> > > > +    uint16_t sid;
> > > > +
> > > > +    sid = vtd_make_source_id(pci_bus_num(vtd_bus->bus), devfn);
> > > > +    vtd_init_pasid_key(pasid, sid, &key);
> > > > +    vtd_pasid_as = g_hash_table_lookup(s->vtd_pasid_as, &key);
> > > > +
> > > > +    if (!vtd_pasid_as && allocate) {
> > > > +        new_key = g_malloc0(sizeof(*new_key));
> > > > +        vtd_init_pasid_key(pasid, sid, new_key);
> > > > +        /*
> > > > +         * Initiate the vtd_pasid_as structure.
> > > > +         *
> > > > +         * This structure here is used to track the guest pasid
> > > > +         * binding and also serves as pasid-cache mangement entry.
> > > > +         *
> > > > +         * TODO: in future, if wants to support the SVA-aware DMA
> > > > +         *       emulation, the vtd_pasid_as should have include
> > > > +         *       AddressSpace to support DMA emulation.
> > > > +         */
> > > > +        vtd_pasid_as = g_malloc0(sizeof(VTDPASIDAddressSpace));
> > > > +        vtd_pasid_as->iommu_state = s;
> > > > +        vtd_pasid_as->vtd_bus = vtd_bus;
> > > > +        vtd_pasid_as->devfn = devfn;
> > > > +        vtd_pasid_as->context_cache_entry.context_cache_gen = 0;
> > > > +        vtd_pasid_as->pasid = pasid;
> > > > +        vtd_pasid_as->pasid_cache_entry.pasid_cache_gen = 0;
> > > > +        g_hash_table_insert(s->vtd_pasid_as, new_key, vtd_pasid_as);
> > > > +    }
> > > > +    return vtd_pasid_as;
> > > > +}
> > > > +
> > > > + /**
> > > > +  * This function updates the pasid entry cached in &vtd_pasid_as.
> > > > +  * Caller of this function should hold iommu_lock.
> > > > +  */
> > > > +static inline void vtd_fill_in_pe_cache(
> > > > +              VTDPASIDAddressSpace *vtd_pasid_as, VTDPASIDEntry
> > > > +*pe) {
> > > > +    IntelIOMMUState *s = vtd_pasid_as->iommu_state;
> > > > +    VTDPASIDCacheEntry *pc_entry =
> > > > +&vtd_pasid_as->pasid_cache_entry;
> > > > +
> > > > +    pc_entry->pasid_entry = *pe;
> > > > +    pc_entry->pasid_cache_gen = s->pasid_cache_gen; }
> > > > +
> > > >  static int vtd_pasid_cache_psi(IntelIOMMUState *s,
> > > >                                 uint16_t domain_id, uint32_t
> > > > pasid)  {
> > > > +    VTDPASIDCacheInfo pc_info;
> > > > +    VTDPASIDEntry pe;
> > > > +    VTDBus *vtd_bus;
> > > > +    int bus_n, devfn;
> > > > +    VTDPASIDAddressSpace *vtd_pasid_as;
> > > > +    VTDIOMMUContext *vtd_icx;
> > > > +
> > > > +    /* PASID selective implies a DID selective */
> > > > +    pc_info.flags = VTD_PASID_CACHE_PASIDSI;
> > > > +    pc_info.domain_id = domain_id;
> > > > +    pc_info.pasid = pasid;
> > > > +
> > > > +    /*
> > > > +     * Regards to a pasid selective pasid cache invalidation (PSI),
> > > > +     * it could be either cases of below:
> > > > +     * a) a present pasid entry moved to non-present
> > > > +     * b) a present pasid entry to be a present entry
> > > > +     * c) a non-present pasid entry moved to present
> > > > +     *
> > > > +     * Here the handling of a PSI is:
> > > > +     * 1) loop all the exisitng vtd_pasid_as instances to update them
> > > > +     *    according to the latest guest pasid entry in pasid table.
> > > > +     *    this will make sure affected existing vtd_pasid_as instances
> > > > +     *    cached the latest pasid entries. Also, during the loop, the
> > > > +     *    host should be notified if needed. e.g. pasid unbind or pasid
> > > > +     *    update. Should be able to cover case a) and case b).
> > > > +     *
> > > > +     * 2) loop all devices to cover case c)
> > > > +     *    However, it is not good to always loop all devices. In this
> > > > +     *    implementation. We do it in this ways:
> > > > +     *    - For devices which have VTDIOMMUContext instances,
> > > > +     *      we loop them and check if guest pasid entry exists. If yes,
> > > > +     *      it is case c), we update the pasid cache and also notify
> > > > +     *      host.
> > > > +     *    - For devices which have no VTDIOMMUContext
> > > > +     *      instances, it is not necessary to create pasid cache at
> > > > +     *      this phase since it could be created when vIOMMU do DMA
> > > > +     *      address translation. This is not implemented yet since
> > > > +     *      no PASID-capable emulated devices today. If we have it
> > > > +     *      in future, the pasid cache shall be created there.
> > > > +     */
> > > > +
> > > > +    vtd_iommu_lock(s);
> > > > +    g_hash_table_foreach_remove(s->vtd_pasid_as,
> > > > +                                vtd_flush_pasid, &pc_info);
> > > > +
> > > > +    QLIST_FOREACH(vtd_icx, &s->vtd_dev_icx_list, next) {
> > > > +        vtd_bus = vtd_icx->vtd_bus;
> > > > +        devfn = vtd_icx->devfn;
> > > > +        bus_n = pci_bus_num(vtd_bus->bus);
> > > > +
> > > > +        /* Step 1: fetch vtd_pasid_as and check if it is valid */
> > > > +        vtd_pasid_as = vtd_add_find_pasid_as(s, vtd_bus,
> > > > +                                        devfn, pasid, true);
> > >
> > > I feel like you wanted to pass "false" here for "allocate".
> >
> > emmm, yeah. It was "false" in draft code as step 1 is only to check if
> > a valid vtd_pasid_as exists. And in step 3, it needs to call
> > vtd_add_find_pasid_as() with "allocate" be "true". In
> > vtd_add_find_pasid_as(), it will try search vtd_pasid_as first and
> > then allocate a new one. In such logic, there will be two
> > vtd_add_find_pasid_as() callig and means two hash table searching.
> >
> > So I mofified it to be "true" to save a vtd_add_find_pasid_as()
> > calling. If a vtd_pasid_as is valid, its pasid_cache_gen will be equal to s-
> >pasid_cache_gen.
> > If not, the vtd_pasid_as is a newly allocated and needs to go through
> > step 2 and step 3 to fulfill it. Looks like I missed to free the
> > vtd_pasid_as when step
> > 2 failed. Will add it if you are fine with the current logic.
> 
> I see.  Note that vtd_add_find_pasid_as() is fast for no allocation, because hash
> lookup is O(1).  However I think current approach is ok, but if with that, we can
> also:
> 
> - Remove the allocate parameter for vtd_add_find_pasid_as(), since it's
>   always true even in future patches so useless,

right. :-)

> - Remove the vtd_pasid_as check right below because it's not needed.
> 
> >
> >
> > > > +        if (vtd_pasid_as &&
>                    ^^^^^^^^^^^^

yes, it is. In current series vtd_add_find_pasid_as() doesn’t check the
result of vtd_pasid_as mem allocation, so no need to check vtd_pasid_as
here either. However, it might be better to check the allocation result
or it will result in issue if allocation failed. What's your preference
here?

> > > > +            (s->pasid_cache_gen ==
> > > > +             vtd_pasid_as->pasid_cache_entry.pasid_cache_gen)) {
> > > > +            /*
> > > > +             * pasid_cache_gen equals to s->pasid_cache_gen means
> > > > +             * vtd_pasid_as is valid after the above s->vtd_pasid_as
> > > > +             * updates. Thus no need for the below steps.
> > > > +             */
> > > > +            continue;
> > > > +        }
> > > > +
> > > > +        /*
> > > > +         * Step 2: vtd_pasid_as is not valid, it's potentailly a
> > > > +         * new pasid bind. Fetch guest pasid entry.
> > > > +         */
> > > > +        if (vtd_dev_get_pe_from_pasid(s, bus_n, devfn, pasid, &pe)) {
> > > > +            continue;
> > > > +        }
> > > > +
> > > > +        /*
> > > > +         * Step 3: pasid entry exists, update pasid cache
> > > > +         *
> > > > +         * Here need to check domain ID since guest pasid entry
> > > > +         * exists. What needs to do are:
> > > > +         *   - update the pc_entry in the vtd_pasid_as
> > > > +         *   - set proper pc_entry.pasid_cache_gen
> > > > +         *   - pass down the latest guest pasid entry config to host
> > > > +         *     (will be added in later patch)
> > > > +         */
> > > > +        if (domain_id == vtd_pe_get_domain_id(&pe)) {
> > > > +            vtd_fill_in_pe_cache(vtd_pasid_as, &pe);
> > > > +        }
> > > > +    }
> > > > +    vtd_iommu_unlock(s);
> > > >      return 0;
> > > >  }
> > > >
> > > > +/**
> > > > + * Caller of this function should hold iommu_lock  */ static void
> > > > +vtd_pasid_cache_reset(IntelIOMMUState *s) {
> > > > +    VTDPASIDCacheInfo pc_info;
> > > > +
> > > > +    trace_vtd_pasid_cache_reset();
> > > > +
> > > > +    pc_info.flags = VTD_PASID_CACHE_GLOBAL;
> > > > +
> > > > +    /*
> > > > +     * Reset pasid cache is a big hammer, so use
> > > > +     * g_hash_table_foreach_remove which will free
> > > > +     * the vtd_pasid_as instances.
> > > > +     */
> > > > +    g_hash_table_foreach_remove(s->vtd_pasid_as,
> > > > +                           vtd_flush_pasid, &pc_info);
> > > > +    s->pasid_cache_gen = 1;
> > > > +}
> > > > +
> > > >  static int vtd_pasid_cache_gsi(IntelIOMMUState *s)  {
> > > > +    trace_vtd_pasid_cache_gsi();
> > > > +
> > > > +    vtd_iommu_lock(s);
> > > > +    vtd_pasid_cache_reset(s);
> > >
> > > [1]
> > >
> > > > +
> > > > +    /*
> > > > +     * TODO: Global PASID cache invalidation may be
> > > > +     * flushes all the pasid caches. To be safe, after
> > > > +     * invalidating the pasid caches, emulator needs
> > > > +     * to replay the pasid bindings by walking guest
> > > > +     * pasid dir and pasid table.
> > > > +     */
> > > > +    vtd_iommu_unlock(s);
> > > >      return 0;
> > > >  }
> > > >
[...]
> > > > diff --git a/include/hw/i386/intel_iommu.h
> > > > b/include/hw/i386/intel_iommu.h index 4158116..3cc4b74 100644
> > > > --- a/include/hw/i386/intel_iommu.h
> > > > +++ b/include/hw/i386/intel_iommu.h
> > > > @@ -69,6 +69,8 @@ typedef union VTD_IR_MSIAddress
> > > > VTD_IR_MSIAddress;  typedef struct VTDPASIDDirEntry
> > > > VTDPASIDDirEntry;  typedef struct VTDPASIDEntry VTDPASIDEntry;
> > > > typedef struct VTDIOMMUContext VTDIOMMUContext;
> > > > +typedef struct VTDPASIDCacheEntry VTDPASIDCacheEntry; typedef
> > > > +struct VTDPASIDAddressSpace VTDPASIDAddressSpace;
> > > >
> > > >  /* Context-Entry */
> > > >  struct VTDContextEntry {
> > > > @@ -101,6 +103,31 @@ struct VTDPASIDEntry {
> > > >      uint64_t val[8];
> > > >  };
> > > >
> > > > +struct pasid_key {
> > > > +    uint32_t pasid;
> > > > +    uint16_t sid;
> > > > +};
> > > > +
> > > > +struct VTDPASIDCacheEntry {
> > > > +    /*
> > > > +     * The cache entry is obsolete if
> > > > +     * pasid_cache_gen!=IntelIOMMUState.pasid_cache_gen
> > > > +     */
> > > > +    uint32_t pasid_cache_gen;
> > > > +    struct VTDPASIDEntry pasid_entry; };
> > > > +
> > > > +struct VTDPASIDAddressSpace {
> > > > +    VTDBus *vtd_bus;
> > > > +    uint8_t devfn;
> > > > +    AddressSpace as;
> > > > +    uint32_t pasid;
> > > > +    IntelIOMMUState *iommu_state;
> > > > +    VTDContextCacheEntry context_cache_entry;
> > > > +    QLIST_ENTRY(VTDPASIDAddressSpace) next;
> > > > +    VTDPASIDCacheEntry pasid_cache_entry;
> > >
> > > In vtd_pasid_cache_gsi() [1], you directly reset pasid_cache_gen for
> > > each pasid address space.  You never increase
> > > pasid_cache_entry.pasid_cache_gen.  Then IIUC the gen will always be
> > > either 0 or 1.  And...
> > >
> > > > +};
> > > > +
> > > >  struct VTDAddressSpace {
> > > >      PCIBus *bus;
> > > >      uint8_t devfn;
> > > > @@ -122,6 +149,7 @@ struct VTDIOMMUContext {
> > > >      uint8_t devfn;
> > > >      IOMMUContext iommu_context;
> > > >      DualStageIOMMUObject *dsi_obj;
> > > > +    QLIST_ENTRY(VTDIOMMUContext) next;
> > > >      IntelIOMMUState *iommu_state;  };
> > > >
> > > > @@ -272,9 +300,14 @@ struct IntelIOMMUState {
> > > >
> > > >      GHashTable *vtd_as_by_busptr;   /* VTDBus objects indexed by PCIBus*
> > > reference */
> > > >      VTDBus *vtd_as_by_bus_num[VTD_PCI_BUS_MAX]; /* VTDBus objects
> > > indexed by bus number */
> > > > +    GHashTable *vtd_pasid_as;   /* VTDPASIDAddressSpace instances */
> > > > +    uint32_t pasid_cache_gen;   /* Should be in [1,MAX] */
> > >
> > > ... This should always be 1.
> > > IIUC you can drop both of the pasid_cache_gen, because in this whole
> > > patchset you'll remove the pasid hash entry when it is invalidated,
> > > right?  Then if the hash entry is there, it must be valid.  When
> > > it's out-dated, it'll be removed from the hash.
> >
> > Oh, yes it is. However, it's not my intetion. I'd like to let [1] to
> > increase the s->pasid_cache_gen instead of justing zero it. I think it
> > will save some time as loop hash table takes time. Thanks for catching
> > it. :-)
> 
> OK that's fine too.  Then remember to conditionally reset it:
> 
> static int vtd_pasid_cache_gsi(IntelIOMMUState *s) {
>     trace_vtd_pasid_cache_gsi();
> 
>     vtd_iommu_lock(s);
>     s->pasid_cache_gen++;
>     if (s->pasid_cache_gen >= THRESHOLD) {
>         vtd_pasid_cache_reset(s);
>     }
>     vtd_iommu_unlock(s);
> 
>     return 0;
> }

thanks for the pseudo code. :-)

Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC v3 14/25] intel_iommu: add virtual command capability support
  2020-02-13  2:40       ` Liu, Yi L
@ 2020-02-13 14:31         ` Peter Xu
  -1 siblings, 0 replies; 136+ messages in thread
From: Peter Xu @ 2020-02-13 14:31 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: qemu-devel, david, pbonzini, alex.williamson, mst, eric.auger,
	Tian, Kevin, Tian, Jun J, Sun, Yi Y, kvm, Wu, Hao, Jacob Pan,
	Yi Sun, Richard Henderson, Eduardo Habkost

On Thu, Feb 13, 2020 at 02:40:45AM +0000, Liu, Yi L wrote:
> > From: Peter Xu <peterx@redhat.com>
> > Sent: Wednesday, February 12, 2020 5:57 AM
> > To: Liu, Yi L <yi.l.liu@intel.com>
> > Subject: Re: [RFC v3 14/25] intel_iommu: add virtual command capability support
> > 
> > On Wed, Jan 29, 2020 at 04:16:45AM -0800, Liu, Yi L wrote:
> > > +/*
> > > + * The basic idea is to let hypervisor to set a range for available
> > > + * PASIDs for VMs. One of the reasons is PASID #0 is reserved by
> > > + * RID_PASID usage. We have no idea how many reserved PASIDs in future,
> > > + * so here just an evaluated value. Honestly, set it as "1" is enough
> > > + * at current stage.
> > > + */
> > > +#define VTD_MIN_HPASID              1
> > > +#define VTD_MAX_HPASID              0xFFFFF
> > 
> > One more question: I see that PASID is defined as 20bits long.  It's
> > fine.  However I start to get confused on how the Scalable Mode PASID
> > Directory could service that much of PASID entries.
> > 
> > I'm looking at spec 3.4.3, Figure 3-8.
> > 
> > Firstly, we only have two levels for a PASID table.  The context entry
> > of a device stores a pointer to the "Scalable Mode PASID Directory"
> > page. I see that there're 2^14 entries in "Scalable Mode PASID
> > Directory" page, each is a "Scalable Mode PASID Table".
> > However... how do we fit in the 4K page if each entry is a pointer of
> > x86_64 (8 bytes) while there're 2^14 entries?  A simple math gives me
> > 4K/8 = 512, which means the "Scalable Mode PASID Directory" page can
> > only have 512 entries, then how the 2^14 come from?  Hmm??
> 
> I checked with Kevin. The spec doesn't say the dir table is 4K. It says 4K
> only for pasid table. Also, if you look at 9.4, scalabe-mode context entry
> includes a PDTS field to specify the actual size of the directory table.

Ah I see.  Then it seems to be lost then in this series.  Say, I think
vtd_sm_pasid_table_walk() should also stop walking until reaching the
size there, and you need to fetch that size info from the context
entry before walk starts.

> 
> > Apart of this: also I just noticed (when reading the latter part of
> > the series) that the time that a pasid table walk can consume will
> > depend on this value too.  I'd suggest to make this as small as we
> > can, as long as it satisfies the usage.  We can even bump it in the
> > future.
> 
> I see. This looks to be an optimization. right? Instead of modify the
> value of this macro,  I think we can do this optimization by tracking
> the allocated PASIDs in QEMU. Thus, the pasid table walk  would be more
> efficient and also no dependency on the VTD_MAX_HPASID. Does it make
> sense to you? :-)

Yeah sounds good. :)

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC v3 14/25] intel_iommu: add virtual command capability support
@ 2020-02-13 14:31         ` Peter Xu
  0 siblings, 0 replies; 136+ messages in thread
From: Peter Xu @ 2020-02-13 14:31 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: Tian, Kevin, Jacob Pan, Yi Sun, Eduardo Habkost, kvm, mst, Tian,
	Jun J, qemu-devel, eric.auger, alex.williamson, pbonzini, Wu,
	Hao, Sun, Yi Y, Richard Henderson, david

On Thu, Feb 13, 2020 at 02:40:45AM +0000, Liu, Yi L wrote:
> > From: Peter Xu <peterx@redhat.com>
> > Sent: Wednesday, February 12, 2020 5:57 AM
> > To: Liu, Yi L <yi.l.liu@intel.com>
> > Subject: Re: [RFC v3 14/25] intel_iommu: add virtual command capability support
> > 
> > On Wed, Jan 29, 2020 at 04:16:45AM -0800, Liu, Yi L wrote:
> > > +/*
> > > + * The basic idea is to let hypervisor to set a range for available
> > > + * PASIDs for VMs. One of the reasons is PASID #0 is reserved by
> > > + * RID_PASID usage. We have no idea how many reserved PASIDs in future,
> > > + * so here just an evaluated value. Honestly, set it as "1" is enough
> > > + * at current stage.
> > > + */
> > > +#define VTD_MIN_HPASID              1
> > > +#define VTD_MAX_HPASID              0xFFFFF
> > 
> > One more question: I see that PASID is defined as 20bits long.  It's
> > fine.  However I start to get confused on how the Scalable Mode PASID
> > Directory could service that much of PASID entries.
> > 
> > I'm looking at spec 3.4.3, Figure 3-8.
> > 
> > Firstly, we only have two levels for a PASID table.  The context entry
> > of a device stores a pointer to the "Scalable Mode PASID Directory"
> > page. I see that there're 2^14 entries in "Scalable Mode PASID
> > Directory" page, each is a "Scalable Mode PASID Table".
> > However... how do we fit in the 4K page if each entry is a pointer of
> > x86_64 (8 bytes) while there're 2^14 entries?  A simple math gives me
> > 4K/8 = 512, which means the "Scalable Mode PASID Directory" page can
> > only have 512 entries, then how the 2^14 come from?  Hmm??
> 
> I checked with Kevin. The spec doesn't say the dir table is 4K. It says 4K
> only for pasid table. Also, if you look at 9.4, scalabe-mode context entry
> includes a PDTS field to specify the actual size of the directory table.

Ah I see.  Then it seems to be lost then in this series.  Say, I think
vtd_sm_pasid_table_walk() should also stop walking until reaching the
size there, and you need to fetch that size info from the context
entry before walk starts.

> 
> > Apart of this: also I just noticed (when reading the latter part of
> > the series) that the time that a pasid table walk can consume will
> > depend on this value too.  I'd suggest to make this as small as we
> > can, as long as it satisfies the usage.  We can even bump it in the
> > future.
> 
> I see. This looks to be an optimization. right? Instead of modify the
> value of this macro,  I think we can do this optimization by tracking
> the allocated PASIDs in QEMU. Thus, the pasid table walk  would be more
> efficient and also no dependency on the VTD_MAX_HPASID. Does it make
> sense to you? :-)

Yeah sounds good. :)

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC v3 14/25] intel_iommu: add virtual command capability support
  2020-02-13 14:31         ` Peter Xu
@ 2020-02-13 15:08           ` Peter Xu
  -1 siblings, 0 replies; 136+ messages in thread
From: Peter Xu @ 2020-02-13 15:08 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: qemu-devel, david, pbonzini, alex.williamson, mst, eric.auger,
	Tian, Kevin, Tian, Jun J, Sun, Yi Y, kvm, Wu, Hao, Jacob Pan,
	Yi Sun, Richard Henderson, Eduardo Habkost

On Thu, Feb 13, 2020 at 09:31:10AM -0500, Peter Xu wrote:

[...]

> > > Apart of this: also I just noticed (when reading the latter part of
> > > the series) that the time that a pasid table walk can consume will
> > > depend on this value too.  I'd suggest to make this as small as we
> > > can, as long as it satisfies the usage.  We can even bump it in the
> > > future.
> > 
> > I see. This looks to be an optimization. right? Instead of modify the
> > value of this macro,  I think we can do this optimization by tracking
> > the allocated PASIDs in QEMU. Thus, the pasid table walk  would be more
> > efficient and also no dependency on the VTD_MAX_HPASID. Does it make
> > sense to you? :-)
> 
> Yeah sounds good. :)

Just to make sure it's safe even for when the global allocation is not
happening (full emulation devices?  Do they need the PASID table walk
too?).  Anyway, be careful to not miss some valid PASID entries, or we
can still use the MIN(PASID_MAX, CONTEXT_ENTRY_SIZE) to be safe as a
first version.  Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC v3 14/25] intel_iommu: add virtual command capability support
@ 2020-02-13 15:08           ` Peter Xu
  0 siblings, 0 replies; 136+ messages in thread
From: Peter Xu @ 2020-02-13 15:08 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: Tian, Kevin, Jacob Pan, Yi Sun, Eduardo Habkost, kvm, mst, Tian,
	Jun J, qemu-devel, eric.auger, alex.williamson, pbonzini, Wu,
	Hao, Sun, Yi Y, Richard Henderson, david

On Thu, Feb 13, 2020 at 09:31:10AM -0500, Peter Xu wrote:

[...]

> > > Apart of this: also I just noticed (when reading the latter part of
> > > the series) that the time that a pasid table walk can consume will
> > > depend on this value too.  I'd suggest to make this as small as we
> > > can, as long as it satisfies the usage.  We can even bump it in the
> > > future.
> > 
> > I see. This looks to be an optimization. right? Instead of modify the
> > value of this macro,  I think we can do this optimization by tracking
> > the allocated PASIDs in QEMU. Thus, the pasid table walk  would be more
> > efficient and also no dependency on the VTD_MAX_HPASID. Does it make
> > sense to you? :-)
> 
> Yeah sounds good. :)

Just to make sure it's safe even for when the global allocation is not
happening (full emulation devices?  Do they need the PASID table walk
too?).  Anyway, be careful to not miss some valid PASID entries, or we
can still use the MIN(PASID_MAX, CONTEXT_ENTRY_SIZE) to be safe as a
first version.  Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC v3 16/25] intel_iommu: add PASID cache management infrastructure
  2020-02-13  2:59           ` Liu, Yi L
@ 2020-02-13 15:14             ` Peter Xu
  -1 siblings, 0 replies; 136+ messages in thread
From: Peter Xu @ 2020-02-13 15:14 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: qemu-devel, david, pbonzini, alex.williamson, mst, eric.auger,
	Tian, Kevin, Tian, Jun J, Sun, Yi Y, kvm, Wu, Hao, Jacob Pan,
	Yi Sun, Richard Henderson, Eduardo Habkost

On Thu, Feb 13, 2020 at 02:59:37AM +0000, Liu, Yi L wrote:
> > - Remove the vtd_pasid_as check right below because it's not needed.
> > 
> > >
> > >
> > > > > +        if (vtd_pasid_as &&
> >                    ^^^^^^^^^^^^
> 
> yes, it is. In current series vtd_add_find_pasid_as() doesn’t check the
> result of vtd_pasid_as mem allocation, so no need to check vtd_pasid_as
> here either. However, it might be better to check the allocation result
> or it will result in issue if allocation failed. What's your preference
> here?

That should not be needed, because IIRC g_malloc0() will directly
coredump if allocation fails.  Even if not, it'll coredump in
vtd_add_find_pasid_as() soon when accessing the NULL pointer.

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC v3 16/25] intel_iommu: add PASID cache management infrastructure
@ 2020-02-13 15:14             ` Peter Xu
  0 siblings, 0 replies; 136+ messages in thread
From: Peter Xu @ 2020-02-13 15:14 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: Tian, Kevin, Jacob Pan, Yi Sun, Eduardo Habkost, kvm, mst, Tian,
	Jun J, qemu-devel, eric.auger, alex.williamson, pbonzini, Wu,
	Hao, Sun, Yi Y, Richard Henderson, david

On Thu, Feb 13, 2020 at 02:59:37AM +0000, Liu, Yi L wrote:
> > - Remove the vtd_pasid_as check right below because it's not needed.
> > 
> > >
> > >
> > > > > +        if (vtd_pasid_as &&
> >                    ^^^^^^^^^^^^
> 
> yes, it is. In current series vtd_add_find_pasid_as() doesn’t check the
> result of vtd_pasid_as mem allocation, so no need to check vtd_pasid_as
> here either. However, it might be better to check the allocation result
> or it will result in issue if allocation failed. What's your preference
> here?

That should not be needed, because IIRC g_malloc0() will directly
coredump if allocation fails.  Even if not, it'll coredump in
vtd_add_find_pasid_as() soon when accessing the NULL pointer.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC v3 03/25] hw/iommu: introduce IOMMUContext
  2020-02-12  7:15           ` Liu, Yi L
@ 2020-02-14  5:36             ` David Gibson
  -1 siblings, 0 replies; 136+ messages in thread
From: David Gibson @ 2020-02-14  5:36 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: Peter Xu, qemu-devel, pbonzini, alex.williamson, mst, eric.auger,
	Tian, Kevin, Tian, Jun J, Sun, Yi Y, kvm, Wu, Hao, Jacob Pan,
	Yi Sun

[-- Attachment #1: Type: text/plain, Size: 1816 bytes --]

On Wed, Feb 12, 2020 at 07:15:13AM +0000, Liu, Yi L wrote:
> Hi Peter,
> 
> > From: Peter Xu <peterx@redhat.com>
> > Sent: Wednesday, February 12, 2020 12:59 AM
> > To: Liu, Yi L <yi.l.liu@intel.com>
> > Subject: Re: [RFC v3 03/25] hw/iommu: introduce IOMMUContext
> > 
> > On Fri, Jan 31, 2020 at 11:42:13AM +0000, Liu, Yi L wrote:
> > > > I'm not very clear on the relationship betwen an IOMMUContext and a
> > > > DualStageIOMMUObject.  Can there be many IOMMUContexts to a
> > > > DualStageIOMMUOBject?  The other way around?  Or is it just
> > > > zero-or-one DualStageIOMMUObjects to an IOMMUContext?
> > >
> > > It is possible. As the below patch shows, DualStageIOMMUObject is per vfio
> > > container. IOMMUContext can be either per-device or shared across devices,
> > > it depends on vendor specific vIOMMU emulators.
> > 
> > Is there an example when an IOMMUContext can be not per-device?
> 
> No, I don’t have such example so far. But as IOMMUContext is got from
> pci_device_iommu_context(),  in concept it possible to be not per-device.
> It is kind of leave to vIOMMU to decide if different devices could share a
> single IOMMUContext.

On the "pseries" machine the vIOMMU only has one set of translations
for a whole virtual PCI Host Bridge (vPHB).  So if you attach multiple
devices to a single vPHB, I believe you'd get multiple devices in an
IOMMUContext.  Well.. if we did the PASID stuff, which we don't at the
moment.

Note that on pseries on the other hand it's routine to create multiple
vPHBs, rather than multiple PCI roots being an oddity as it is on x86.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC v3 03/25] hw/iommu: introduce IOMMUContext
@ 2020-02-14  5:36             ` David Gibson
  0 siblings, 0 replies; 136+ messages in thread
From: David Gibson @ 2020-02-14  5:36 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: Tian, Kevin, Jacob Pan, Yi Sun, kvm, mst, Tian, Jun J,
	qemu-devel, Peter Xu, eric.auger, alex.williamson, pbonzini, Sun,
	Yi Y, Wu, Hao

[-- Attachment #1: Type: text/plain, Size: 1816 bytes --]

On Wed, Feb 12, 2020 at 07:15:13AM +0000, Liu, Yi L wrote:
> Hi Peter,
> 
> > From: Peter Xu <peterx@redhat.com>
> > Sent: Wednesday, February 12, 2020 12:59 AM
> > To: Liu, Yi L <yi.l.liu@intel.com>
> > Subject: Re: [RFC v3 03/25] hw/iommu: introduce IOMMUContext
> > 
> > On Fri, Jan 31, 2020 at 11:42:13AM +0000, Liu, Yi L wrote:
> > > > I'm not very clear on the relationship betwen an IOMMUContext and a
> > > > DualStageIOMMUObject.  Can there be many IOMMUContexts to a
> > > > DualStageIOMMUOBject?  The other way around?  Or is it just
> > > > zero-or-one DualStageIOMMUObjects to an IOMMUContext?
> > >
> > > It is possible. As the below patch shows, DualStageIOMMUObject is per vfio
> > > container. IOMMUContext can be either per-device or shared across devices,
> > > it depends on vendor specific vIOMMU emulators.
> > 
> > Is there an example when an IOMMUContext can be not per-device?
> 
> No, I don’t have such example so far. But as IOMMUContext is got from
> pci_device_iommu_context(),  in concept it possible to be not per-device.
> It is kind of leave to vIOMMU to decide if different devices could share a
> single IOMMUContext.

On the "pseries" machine the vIOMMU only has one set of translations
for a whole virtual PCI Host Bridge (vPHB).  So if you attach multiple
devices to a single vPHB, I believe you'd get multiple devices in an
IOMMUContext.  Well.. if we did the PASID stuff, which we don't at the
moment.

Note that on pseries on the other hand it's routine to create multiple
vPHBs, rather than multiple PCI roots being an oddity as it is on x86.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 136+ messages in thread

* RE: [RFC v3 03/25] hw/iommu: introduce IOMMUContext
  2020-02-14  5:36             ` David Gibson
@ 2020-02-15  6:25               ` Liu, Yi L
  -1 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-02-15  6:25 UTC (permalink / raw)
  To: David Gibson
  Cc: Peter Xu, qemu-devel, pbonzini, alex.williamson, mst, eric.auger,
	Tian, Kevin, Tian, Jun J, Sun, Yi Y, kvm, Wu, Hao, Jacob Pan,
	Yi Sun

> From: David Gibson < david@gibson.dropbear.id.au >
> Sent: Friday, February 14, 2020 1:36 PM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [RFC v3 03/25] hw/iommu: introduce IOMMUContext
> 
> On Wed, Feb 12, 2020 at 07:15:13AM +0000, Liu, Yi L wrote:
> > Hi Peter,
> >
> > > From: Peter Xu <peterx@redhat.com>
> > > Sent: Wednesday, February 12, 2020 12:59 AM
> > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > Subject: Re: [RFC v3 03/25] hw/iommu: introduce IOMMUContext
> > >
> > > On Fri, Jan 31, 2020 at 11:42:13AM +0000, Liu, Yi L wrote:
> > > > > I'm not very clear on the relationship betwen an IOMMUContext and a
> > > > > DualStageIOMMUObject.  Can there be many IOMMUContexts to a
> > > > > DualStageIOMMUOBject?  The other way around?  Or is it just
> > > > > zero-or-one DualStageIOMMUObjects to an IOMMUContext?
> > > >
> > > > It is possible. As the below patch shows, DualStageIOMMUObject is per vfio
> > > > container. IOMMUContext can be either per-device or shared across devices,
> > > > it depends on vendor specific vIOMMU emulators.
> > >
> > > Is there an example when an IOMMUContext can be not per-device?
> >
> > No, I don’t have such example so far. But as IOMMUContext is got from
> > pci_device_iommu_context(),  in concept it possible to be not per-device.
> > It is kind of leave to vIOMMU to decide if different devices could share a
> > single IOMMUContext.
> 
> On the "pseries" machine the vIOMMU only has one set of translations
> for a whole virtual PCI Host Bridge (vPHB).  So if you attach multiple
> devices to a single vPHB, I believe you'd get multiple devices in an
> IOMMUContext.  Well.. if we did the PASID stuff, which we don't at the
> moment.
> 
> Note that on pseries on the other hand it's routine to create multiple
> vPHBs, rather than multiple PCI roots being an oddity as it is on x86.

Thanks for the example, David. :-) BTW. I'll drop IOMMUContext in next version
as the email below mentioned.  Please feel free let me know your opinion.

https://lists.gnu.org/archive/html/qemu-devel/2020-02/msg02874.html

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 136+ messages in thread

* RE: [RFC v3 03/25] hw/iommu: introduce IOMMUContext
@ 2020-02-15  6:25               ` Liu, Yi L
  0 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-02-15  6:25 UTC (permalink / raw)
  To: David Gibson
  Cc: Tian, Kevin, Jacob Pan, Yi Sun, kvm, mst, Tian, Jun J,
	qemu-devel, Peter Xu, eric.auger, alex.williamson, pbonzini, Sun,
	 Yi Y, Wu, Hao

> From: David Gibson < david@gibson.dropbear.id.au >
> Sent: Friday, February 14, 2020 1:36 PM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [RFC v3 03/25] hw/iommu: introduce IOMMUContext
> 
> On Wed, Feb 12, 2020 at 07:15:13AM +0000, Liu, Yi L wrote:
> > Hi Peter,
> >
> > > From: Peter Xu <peterx@redhat.com>
> > > Sent: Wednesday, February 12, 2020 12:59 AM
> > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > Subject: Re: [RFC v3 03/25] hw/iommu: introduce IOMMUContext
> > >
> > > On Fri, Jan 31, 2020 at 11:42:13AM +0000, Liu, Yi L wrote:
> > > > > I'm not very clear on the relationship betwen an IOMMUContext and a
> > > > > DualStageIOMMUObject.  Can there be many IOMMUContexts to a
> > > > > DualStageIOMMUOBject?  The other way around?  Or is it just
> > > > > zero-or-one DualStageIOMMUObjects to an IOMMUContext?
> > > >
> > > > It is possible. As the below patch shows, DualStageIOMMUObject is per vfio
> > > > container. IOMMUContext can be either per-device or shared across devices,
> > > > it depends on vendor specific vIOMMU emulators.
> > >
> > > Is there an example when an IOMMUContext can be not per-device?
> >
> > No, I don’t have such example so far. But as IOMMUContext is got from
> > pci_device_iommu_context(),  in concept it possible to be not per-device.
> > It is kind of leave to vIOMMU to decide if different devices could share a
> > single IOMMUContext.
> 
> On the "pseries" machine the vIOMMU only has one set of translations
> for a whole virtual PCI Host Bridge (vPHB).  So if you attach multiple
> devices to a single vPHB, I believe you'd get multiple devices in an
> IOMMUContext.  Well.. if we did the PASID stuff, which we don't at the
> moment.
> 
> Note that on pseries on the other hand it's routine to create multiple
> vPHBs, rather than multiple PCI roots being an oddity as it is on x86.

Thanks for the example, David. :-) BTW. I'll drop IOMMUContext in next version
as the email below mentioned.  Please feel free let me know your opinion.

https://lists.gnu.org/archive/html/qemu-devel/2020-02/msg02874.html

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 136+ messages in thread

* RE: [RFC v3 14/25] intel_iommu: add virtual command capability support
  2020-02-13 15:08           ` Peter Xu
@ 2020-02-15  8:49             ` Liu, Yi L
  -1 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-02-15  8:49 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, david, pbonzini, alex.williamson, mst, eric.auger,
	Tian, Kevin, Tian, Jun J, Sun, Yi Y, kvm, Wu, Hao, Jacob Pan,
	Yi Sun, Richard Henderson, Eduardo Habkost

> From: Peter Xu < peterx@redhat.com >
> Sent: Thursday, February 13, 2020 11:09 PM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [RFC v3 14/25] intel_iommu: add virtual command capability support
> 
> On Thu, Feb 13, 2020 at 09:31:10AM -0500, Peter Xu wrote:
> 
> [...]
> 
> > > > Apart of this: also I just noticed (when reading the latter part of
> > > > the series) that the time that a pasid table walk can consume will
> > > > depend on this value too.  I'd suggest to make this as small as we
> > > > can, as long as it satisfies the usage.  We can even bump it in the
> > > > future.
> > >
> > > I see. This looks to be an optimization. right? Instead of modify the
> > > value of this macro,  I think we can do this optimization by tracking
> > > the allocated PASIDs in QEMU. Thus, the pasid table walk  would be more
> > > efficient and also no dependency on the VTD_MAX_HPASID. Does it make
> > > sense to you? :-)
> >
> > Yeah sounds good. :)
> 
> Just to make sure it's safe even for when the global allocation is not
> happening (full emulation devices?  Do they need the PASID table walk
> too?). 

I'd say no. For full emulation devices, just needs to ensure the pasid cache
is latest (do what guest told). Even the invalidation flushes too much cache,
it just affects the performance but no correctness issue.  This is different
with passthru devices, if unbind too much, it means some passthru devices
may encounter DMA  fault later.

> Anyway, be careful to not miss some valid PASID entries, or we
> can still use the MIN(PASID_MAX, CONTEXT_ENTRY_SIZE) to be safe as a
> first version.  Thanks,

Agreed. First version to ensure 100% safe.

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 136+ messages in thread

* RE: [RFC v3 14/25] intel_iommu: add virtual command capability support
@ 2020-02-15  8:49             ` Liu, Yi L
  0 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-02-15  8:49 UTC (permalink / raw)
  To: Peter Xu
  Cc: Tian, Kevin, Jacob Pan, Yi Sun, Eduardo Habkost, kvm, mst, Tian,
	Jun J, qemu-devel, eric.auger, alex.williamson, pbonzini, Wu,
	Hao, Sun, Yi Y, Richard Henderson, david

> From: Peter Xu < peterx@redhat.com >
> Sent: Thursday, February 13, 2020 11:09 PM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [RFC v3 14/25] intel_iommu: add virtual command capability support
> 
> On Thu, Feb 13, 2020 at 09:31:10AM -0500, Peter Xu wrote:
> 
> [...]
> 
> > > > Apart of this: also I just noticed (when reading the latter part of
> > > > the series) that the time that a pasid table walk can consume will
> > > > depend on this value too.  I'd suggest to make this as small as we
> > > > can, as long as it satisfies the usage.  We can even bump it in the
> > > > future.
> > >
> > > I see. This looks to be an optimization. right? Instead of modify the
> > > value of this macro,  I think we can do this optimization by tracking
> > > the allocated PASIDs in QEMU. Thus, the pasid table walk  would be more
> > > efficient and also no dependency on the VTD_MAX_HPASID. Does it make
> > > sense to you? :-)
> >
> > Yeah sounds good. :)
> 
> Just to make sure it's safe even for when the global allocation is not
> happening (full emulation devices?  Do they need the PASID table walk
> too?). 

I'd say no. For full emulation devices, just needs to ensure the pasid cache
is latest (do what guest told). Even the invalidation flushes too much cache,
it just affects the performance but no correctness issue.  This is different
with passthru devices, if unbind too much, it means some passthru devices
may encounter DMA  fault later.

> Anyway, be careful to not miss some valid PASID entries, or we
> can still use the MIN(PASID_MAX, CONTEXT_ENTRY_SIZE) to be safe as a
> first version.  Thanks,

Agreed. First version to ensure 100% safe.

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 136+ messages in thread

* RE: [RFC v3 16/25] intel_iommu: add PASID cache management infrastructure
  2020-02-13 15:14             ` Peter Xu
@ 2020-02-15  8:50               ` Liu, Yi L
  -1 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-02-15  8:50 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, david, pbonzini, alex.williamson, mst, eric.auger,
	Tian, Kevin, Tian, Jun J, Sun, Yi Y, kvm, Wu, Hao, Jacob Pan,
	Yi Sun, Richard Henderson, Eduardo Habkost



> -----Original Message-----
> From: Peter Xu <peterx@redhat.com>
> Sent: Thursday, February 13, 2020 11:14 PM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [RFC v3 16/25] intel_iommu: add PASID cache management
> infrastructure
> 
> On Thu, Feb 13, 2020 at 02:59:37AM +0000, Liu, Yi L wrote:
> > > - Remove the vtd_pasid_as check right below because it's not needed.
> > >
> > > >
> > > >
> > > > > > +        if (vtd_pasid_as &&
> > >                    ^^^^^^^^^^^^
> >
> > yes, it is. In current series vtd_add_find_pasid_as() doesn’t check the
> > result of vtd_pasid_as mem allocation, so no need to check vtd_pasid_as
> > here either. However, it might be better to check the allocation result
> > or it will result in issue if allocation failed. What's your preference
> > here?
> 
> That should not be needed, because IIRC g_malloc0() will directly
> coredump if allocation fails.  Even if not, it'll coredump in
> vtd_add_find_pasid_as() soon when accessing the NULL pointer.

Cool, thanks for this message. Then I'll follow your suggestion  to  remove
the vtd_pasid_as check.

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 136+ messages in thread

* RE: [RFC v3 16/25] intel_iommu: add PASID cache management infrastructure
@ 2020-02-15  8:50               ` Liu, Yi L
  0 siblings, 0 replies; 136+ messages in thread
From: Liu, Yi L @ 2020-02-15  8:50 UTC (permalink / raw)
  To: Peter Xu
  Cc: Tian, Kevin, Jacob Pan, Yi Sun, Eduardo Habkost, kvm, mst, Tian,
	Jun J, qemu-devel, eric.auger, alex.williamson, pbonzini, Wu,
	Hao, Sun, Yi Y, Richard Henderson, david



> -----Original Message-----
> From: Peter Xu <peterx@redhat.com>
> Sent: Thursday, February 13, 2020 11:14 PM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [RFC v3 16/25] intel_iommu: add PASID cache management
> infrastructure
> 
> On Thu, Feb 13, 2020 at 02:59:37AM +0000, Liu, Yi L wrote:
> > > - Remove the vtd_pasid_as check right below because it's not needed.
> > >
> > > >
> > > >
> > > > > > +        if (vtd_pasid_as &&
> > >                    ^^^^^^^^^^^^
> >
> > yes, it is. In current series vtd_add_find_pasid_as() doesn’t check the
> > result of vtd_pasid_as mem allocation, so no need to check vtd_pasid_as
> > here either. However, it might be better to check the allocation result
> > or it will result in issue if allocation failed. What's your preference
> > here?
> 
> That should not be needed, because IIRC g_malloc0() will directly
> coredump if allocation fails.  Even if not, it'll coredump in
> vtd_add_find_pasid_as() soon when accessing the NULL pointer.

Cool, thanks for this message. Then I'll follow your suggestion  to  remove
the vtd_pasid_as check.

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 136+ messages in thread

end of thread, other threads:[~2020-02-15  8:51 UTC | newest]

Thread overview: 136+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-29 12:16 [RFC v3 00/25] intel_iommu: expose Shared Virtual Addressing to VMs Liu, Yi L
2020-01-29 12:16 ` Liu, Yi L
2020-01-29 12:16 ` [RFC v3 01/25] hw/pci: modify pci_setup_iommu() to set PCIIOMMUOps Liu, Yi L
2020-01-29 12:16   ` Liu, Yi L
2020-01-29 12:16 ` [RFC v3 02/25] hw/iommu: introduce DualStageIOMMUObject Liu, Yi L
2020-01-29 12:16   ` Liu, Yi L
2020-01-31  3:59   ` David Gibson
2020-01-31  3:59     ` David Gibson
2020-01-31 11:42     ` Liu, Yi L
2020-01-31 11:42       ` Liu, Yi L
2020-02-12  6:32       ` David Gibson
2020-02-12  6:32         ` David Gibson
2020-01-29 12:16 ` [RFC v3 03/25] hw/iommu: introduce IOMMUContext Liu, Yi L
2020-01-29 12:16   ` Liu, Yi L
2020-01-31  4:06   ` David Gibson
2020-01-31  4:06     ` David Gibson
2020-01-31 11:42     ` Liu, Yi L
2020-01-31 11:42       ` Liu, Yi L
2020-02-11 16:58       ` Peter Xu
2020-02-11 16:58         ` Peter Xu
2020-02-12  7:15         ` Liu, Yi L
2020-02-12  7:15           ` Liu, Yi L
2020-02-12 15:59           ` Peter Xu
2020-02-12 15:59             ` Peter Xu
2020-02-13  2:46             ` Liu, Yi L
2020-02-13  2:46               ` Liu, Yi L
2020-02-14  5:36           ` David Gibson
2020-02-14  5:36             ` David Gibson
2020-02-15  6:25             ` Liu, Yi L
2020-02-15  6:25               ` Liu, Yi L
2020-01-29 12:16 ` [RFC v3 04/25] hw/pci: introduce pci_device_iommu_context() Liu, Yi L
2020-01-29 12:16   ` Liu, Yi L
2020-01-29 12:16 ` [RFC v3 05/25] intel_iommu: provide get_iommu_context() callback Liu, Yi L
2020-01-29 12:16   ` Liu, Yi L
2020-01-29 12:16 ` [RFC v3 06/25] scripts/update-linux-headers: Import iommu.h Liu, Yi L
2020-01-29 12:16   ` Liu, Yi L
2020-01-29 12:25   ` Cornelia Huck
2020-01-29 12:25     ` Cornelia Huck
2020-01-31 11:40     ` Liu, Yi L
2020-01-31 11:40       ` Liu, Yi L
2020-01-29 12:16 ` [RFC v3 07/25] header file update VFIO/IOMMU vSVA APIs Liu, Yi L
2020-01-29 12:16   ` Liu, Yi L
2020-01-29 12:28   ` Cornelia Huck
2020-01-29 12:28     ` Cornelia Huck
2020-01-31 11:41     ` Liu, Yi L
2020-01-31 11:41       ` Liu, Yi L
2020-01-29 12:16 ` [RFC v3 08/25] vfio: pass IOMMUContext into vfio_get_group() Liu, Yi L
2020-01-29 12:16   ` Liu, Yi L
2020-01-29 12:16 ` [RFC v3 09/25] vfio: check VFIO_TYPE1_NESTING_IOMMU support Liu, Yi L
2020-01-29 12:16   ` Liu, Yi L
2020-02-11 19:08   ` Peter Xu
2020-02-11 19:08     ` Peter Xu
2020-02-12  7:16     ` Liu, Yi L
2020-02-12  7:16       ` Liu, Yi L
2020-01-29 12:16 ` [RFC v3 10/25] vfio: register DualStageIOMMUObject to vIOMMU Liu, Yi L
2020-01-29 12:16   ` Liu, Yi L
2020-01-29 12:16 ` [RFC v3 11/25] vfio: get stage-1 pasid formats from Kernel Liu, Yi L
2020-01-29 12:16   ` Liu, Yi L
2020-02-11 19:30   ` Peter Xu
2020-02-11 19:30     ` Peter Xu
2020-02-12  7:19     ` Liu, Yi L
2020-02-12  7:19       ` Liu, Yi L
2020-01-29 12:16 ` [RFC v3 12/25] vfio/common: add pasid_alloc/free support Liu, Yi L
2020-01-29 12:16   ` Liu, Yi L
2020-02-11 19:31   ` Peter Xu
2020-02-11 19:31     ` Peter Xu
2020-02-12  7:20     ` Liu, Yi L
2020-02-12  7:20       ` Liu, Yi L
2020-01-29 12:16 ` [RFC v3 13/25] intel_iommu: modify x-scalable-mode to be string option Liu, Yi L
2020-01-29 12:16   ` Liu, Yi L
2020-02-11 19:43   ` Peter Xu
2020-02-11 19:43     ` Peter Xu
2020-02-12  7:28     ` Liu, Yi L
2020-02-12  7:28       ` Liu, Yi L
2020-02-12 16:05       ` Peter Xu
2020-02-12 16:05         ` Peter Xu
2020-02-13  2:44         ` Liu, Yi L
2020-02-13  2:44           ` Liu, Yi L
2020-01-29 12:16 ` [RFC v3 14/25] intel_iommu: add virtual command capability support Liu, Yi L
2020-01-29 12:16   ` Liu, Yi L
2020-02-11 20:16   ` Peter Xu
2020-02-11 20:16     ` Peter Xu
2020-02-12  7:32     ` Liu, Yi L
2020-02-12  7:32       ` Liu, Yi L
2020-02-11 21:56   ` Peter Xu
2020-02-11 21:56     ` Peter Xu
2020-02-13  2:40     ` Liu, Yi L
2020-02-13  2:40       ` Liu, Yi L
2020-02-13 14:31       ` Peter Xu
2020-02-13 14:31         ` Peter Xu
2020-02-13 15:08         ` Peter Xu
2020-02-13 15:08           ` Peter Xu
2020-02-15  8:49           ` Liu, Yi L
2020-02-15  8:49             ` Liu, Yi L
2020-01-29 12:16 ` [RFC v3 15/25] intel_iommu: process pasid cache invalidation Liu, Yi L
2020-01-29 12:16   ` Liu, Yi L
2020-02-11 20:17   ` Peter Xu
2020-02-11 20:17     ` Peter Xu
2020-02-12  7:33     ` Liu, Yi L
2020-02-12  7:33       ` Liu, Yi L
2020-01-29 12:16 ` [RFC v3 16/25] intel_iommu: add PASID cache management infrastructure Liu, Yi L
2020-01-29 12:16   ` Liu, Yi L
2020-02-11 23:35   ` Peter Xu
2020-02-11 23:35     ` Peter Xu
2020-02-12  8:37     ` Liu, Yi L
2020-02-12  8:37       ` Liu, Yi L
2020-02-12 15:26       ` Peter Xu
2020-02-12 15:26         ` Peter Xu
2020-02-13  2:59         ` Liu, Yi L
2020-02-13  2:59           ` Liu, Yi L
2020-02-13 15:14           ` Peter Xu
2020-02-13 15:14             ` Peter Xu
2020-02-15  8:50             ` Liu, Yi L
2020-02-15  8:50               ` Liu, Yi L
2020-01-29 12:16 ` [RFC v3 17/25] vfio: add bind stage-1 page table support Liu, Yi L
2020-01-29 12:16   ` Liu, Yi L
2020-01-29 12:16 ` [RFC v3 18/25] intel_iommu: bind/unbind guest page table to host Liu, Yi L
2020-01-29 12:16   ` Liu, Yi L
2020-01-29 12:16 ` [RFC v3 19/25] intel_iommu: replay guest pasid bindings " Liu, Yi L
2020-01-29 12:16   ` Liu, Yi L
2020-01-29 12:16 ` [RFC v3 20/25] intel_iommu: replay pasid binds after context cache invalidation Liu, Yi L
2020-01-29 12:16   ` Liu, Yi L
2020-01-29 12:16 ` [RFC v3 21/25] intel_iommu: do not pass down pasid bind for PASID #0 Liu, Yi L
2020-01-29 12:16   ` Liu, Yi L
2020-01-29 12:16 ` [RFC v3 22/25] vfio: add support for flush iommu stage-1 cache Liu, Yi L
2020-01-29 12:16   ` Liu, Yi L
2020-01-29 12:16 ` [RFC v3 23/25] intel_iommu: process PASID-based iotlb invalidation Liu, Yi L
2020-01-29 12:16   ` Liu, Yi L
2020-01-29 12:16 ` [RFC v3 24/25] intel_iommu: propagate PASID-based iotlb invalidation to host Liu, Yi L
2020-01-29 12:16   ` Liu, Yi L
2020-01-29 12:16 ` [RFC v3 25/25] intel_iommu: process PASID-based Device-TLB invalidation Liu, Yi L
2020-01-29 12:16   ` Liu, Yi L
2020-01-29 13:44 ` [RFC v3 00/25] intel_iommu: expose Shared Virtual Addressing to VMs no-reply
2020-01-29 13:44   ` no-reply
2020-01-29 13:48 ` no-reply
2020-01-29 13:48   ` no-reply

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.