qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [PATCH rfcv1 00/23] intel_iommu: Enable stage-1 translation
@ 2024-01-15 10:37 Zhenzhong Duan
  2024-01-15 10:37 ` [PATCH rfcv1 01/23] Update linux header to support nested hwpt alloc Zhenzhong Duan
                   ` (23 more replies)
  0 siblings, 24 replies; 29+ messages in thread
From: Zhenzhong Duan @ 2024-01-15 10:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, peterx, jasowang, mst, jgg,
	nicolinc, joao.m.martins, kevin.tian, yi.l.liu, yi.y.sun,
	chao.p.peng, Zhenzhong Duan

Hi,

This series enables stage-1 translation support in intel iommu which
we called "modern" mode. In this mode, we don't do shadowing of
guest page table for passthrough device but pass stage-1 page table
to host side to construct a nested domain; we also support emulated
device by translating the stage-1 page table. There was some effort
to enable this feature in old days, see [1] for details.

The key design is to utilize the dual-stage IOMMU translation
(also known as IOMMU nested translation) capability in host IOMMU.
As the below diagram shows, guest I/O page table pointer in GPA
(guest physical address) is passed to host and be used to perform
the stage-1 address translation. Along with it, modifications to
present mappings in the guest I/O page table should be followed
with an IOTLB invalidation.

        .-------------.  .---------------------------.
        |   vIOMMU    |  | Guest I/O page table      |
        |             |  '---------------------------'
        .----------------/
        | PASID Entry |--- PASID cache flush --+
        '-------------'                        |
        |             |                        V
        |             |           I/O page table pointer in GPA
        '-------------'
    Guest
    ------| Shadow |---------------------------|--------
          v        v                           v
    Host
        .-------------.  .------------------------.
        |   pIOMMU    |  |  FS for GIOVA->GPA     |
        |             |  '------------------------'
        .----------------/  |
        | PASID Entry |     V (Nested xlate)
        '----------------\.----------------------------------.
        |             |   | SS for GPA->HPA, unmanaged domain|
        |             |   '----------------------------------'
        '-------------'
Where:
 - FS = First stage page tables
 - SS = Second stage page tables
<Intel VT-d Nested translation>

There are some interactions between VFIO and vIOMMU.
* vIOMMU registers PCIIOMMUOps to PCI subsystem which VFIO can
  use to registers/unregisters IOMMUDevice object.
* VFIO registers an IOMMUFDDevice object at vfio device realize
  stage to vIOMMU, this is implemented as a prerequisite series[2].
* vIOMMU calls IOMMUFDDevice interface callback IOMMUFDDeviceOps
  to bind/unbind device to IOMMUFD backed domains, either nested
  domain or not.

See below diagram:

        VFIO Device                                 Intel IOMMU
    .-----------------.                         .-------------------.
    |                 |                         |                   |
    |       .---------|PCIIOMMUOps              |.-------------.    |
    |       | IOMMUFD |(set_iommu_device)       || IOMMUFD     |    |
    |       | Device  |------------------------>|| Device list |    |
    |       .---------|(unset_iommu_device)     |.-------------.    |
    |                 |                         |       |           |
    |                 |                         |       V           |
    |       .---------|         IOMMUFDDeviceOps|  .---------.      |
    |       | IOMMUFD |            (attach_hwpt)|  | IOMMUFD |      |
    |       | link    |<------------------------|  | Device  |      |
    |       .---------|            (detach_hwpt)|  .---------.      |
    |                 |                         |       |           |
    |                 |                         |       ...         |
    .-----------------.                         .-------------------.

Based on Yi's suggestion, we updated a new design of managing ioas and
hwpt, made it support multiple iommufd objects and the ERRATA_772415
case, meanwhile tried to be optimal to share ioas and hwpt whenever
possible.

Stage-2 page table could be shared by different devices if there is
no conflict and devices link to same iommufd object, i.e. devices
under same host IOMMU can share same stage-2 page table. If there
is conflict, i.e. there is one device under non cache coherency
mode which is different from others, it requires a seperate
stage-2 page table in non-CC mode.

SPR platform has ERRATA_772415 which requires no readonly mappings
in stage-2 page table. This series supports creating VTDIOASContainer
with no readonly mappings. I'm not clear if there is a rare case that
some IOMMUs on a multiple IOMMUs host have ERRATA_772415, this design
can survive even in that case.

See below example diagram for a full view:

      IntelIOMMUState
             |
             V
    .------------------.    .------------------.    .-------------------.
    | VTDIOASContainer |--->| VTDIOASContainer |--->| VTDIOASContainer  |-->...
    | (iommufd0,RW&RO) |    | (iommufd1,RW&RO) |    | (iommufd0,RW only)|
    .------------------.    .------------------.    .-------------------.
             |                       |                              |
             |                       .-->...                        |
             V                                                      V
      .-------------------.    .-------------------.          .---------------.
      |   VTDS2Hwpt(CC)   |--->| VTDS2Hwpt(non-CC) |-->...    | VTDS2Hwpt(CC) |-->...
      .-------------------.    .-------------------.          .---------------.
          |            |               |                            |
          |            |               |                            |
    .-----------.  .-----------.  .------------.              .------------.
    | IOMMUFD   |  | IOMMUFD   |  | IOMMUFD    |              | IOMMUFD    |
    | Device(CC)|  | Device(CC)|  | Device     |              | Device(CC) |
    | (iommufd0)|  | (iommufd0)|  | (non-CC)   |              | (errata)   |
    |           |  |           |  | (iommufd0) |              | (iommufd0) |
    .-----------.  .-----------.  .------------.              .------------.

This series is also a prerequisite work for vSVA, i.e. Sharing
guest application address space with passthrough devices.

To enable "modern" mode, only need to add "x-scalable-mode=modern".
i.e. -device intel-iommu,x-scalable-mode=modern,...

Passthrough device should use iommufd backend to work in "modern" mode.
i.e. -object iommufd,id=iommufd0 -device vfio-pci,iommufd=iommufd0,...

If host doens't support nested translation, qemu will fail
with an unsupported report.

Test done:
- devices hotplug/unplug
- different devices linked to different iommufds

PATCH1-2:  Some preparing work to update header and IOMMUFD uAPI
PATCH3-4:  Initialize vfio IOMMUFDDevice interface and pass to vIOMMU
PATCH5:    Introduce a placeholder variable for scalable modern mode
PATCH6:    Sync host cap/ecap with vIOMMU default cap/ecap in modern mode
PATCH7-22: Implement first stage page table for passthrough and emulated device
PATCH23:   Introduce "modern" mode to distinguish with legacy mode

Qemu code can be found at [3]
Matching kernel code can be found at [4]

TODO:
- RAM discard
- dirty tracking on stage-2 page table

THOUGHTS:
This design is optimal in sharing ioas/hwpt whenever posssible, but it also
bring some overhead for vIOMMU to implement a simliar memory listener as
vfio_memory_listener, i.e., this memory listener should also support ram
discard and dirty tracking.

We have also implemented another design internally, by reusing ioas from vfio
to create s2hwpt, this way each device has its own s2hwpt and sharing vfio's
ioas, so vfio_memory_listener is reused, no code redundency. But shis have
three flaws,
 1. address space switch should be bypassed for vfio device which means vfio
    device and emulated device can't share same address space.
 2. still need to create seperate ioas/hwpt if ERRATA_772415.
 3. no ioas/hwpt sharing.

Not clear which design is prefered in community, internally we like current
design a bit more, welcome comments and suggestions.

[1] https://patchwork.kernel.org/project/kvm/cover/20210302203827.437645-1-yi.l.liu@intel.com/
[2] https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg02730.html
[3] https://github.com/yiliu1765/qemu/tree/zhenzhong/iommufd_nesting_rfcv1
[4] https://github.com/yiliu1765/iommufd/tree/iommufd_nesting

Thanks
Zhenzhong


Yi Liu (11):
  intel_iommu: process PASID cache invalidation
  intel_iommu: add PASID cache management infrastructure
  intel_iommu: replay pasid binds after context cache invalidation
  intel_iommu: process PASID-based iotlb invalidation
  intel_iommu: propagate PASID-based iotlb invalidation to host
  intel_iommu: process PASID-based Device-TLB invalidation
  intel_iommu: rename slpte in iotlb_entry to pte
  intel_iommu: implement firt level translation
  intel_iommu: introduce pasid iotlb cache
  intel_iommu: refresh pasid bind after pasid cache force reset
  intel_iommu: modify x-scalable-mode to be string option

Yi Sun (2):
  intel_iommu: piotlb invalidation should notify unmap
  intel_iommu: invalidate piotlb when flush pasid

Yu Zhang (1):
  intel_iommu: fix the fault reason report

Zhenzhong Duan (9):
  Update linux header to support nested hwpt alloc
  backends/iommufd: add helpers for allocating user-managed HWPT
  backends/iommufd_device: introduce IOMMUFDDevice targeted interface
  vfio: implement IOMMUFDDevice interface callbacks
  intel_iommu: add a placeholder variable for scalable modern mode
  intel_iommu: check and sync host IOMMU cap/ecap in scalable modern
    mode
  vfio/iommufd_device: Add ioas_id in IOMMUFDDevice and pass to vIOMMU
  intel_iommu: bind/unbind guest page table to host
  intel_iommu: ERRATA_772415 workaround

 hw/i386/intel_iommu_internal.h                |  109 +-
 include/hw/i386/intel_iommu.h                 |   63 +-
 include/standard-headers/drm/drm_fourcc.h     |    2 +
 include/standard-headers/linux/fuse.h         |   10 +-
 include/standard-headers/linux/pci_regs.h     |   24 +-
 include/standard-headers/linux/vhost_types.h  |    7 +
 .../standard-headers/linux/virtio_config.h    |    5 +
 include/standard-headers/linux/virtio_pci.h   |   11 +
 include/sysemu/iommufd.h                      |    7 +
 include/sysemu/iommufd_device.h               |   12 +-
 linux-headers/asm-arm64/kvm.h                 |   32 +
 linux-headers/asm-generic/unistd.h            |   14 +-
 linux-headers/asm-loongarch/bitsperlong.h     |    1 +
 linux-headers/asm-loongarch/kvm.h             |  108 +
 linux-headers/asm-loongarch/mman.h            |    1 +
 linux-headers/asm-loongarch/unistd.h          |    5 +
 linux-headers/asm-mips/unistd_n32.h           |    4 +
 linux-headers/asm-mips/unistd_n64.h           |    4 +
 linux-headers/asm-mips/unistd_o32.h           |    4 +
 linux-headers/asm-powerpc/unistd_32.h         |    4 +
 linux-headers/asm-powerpc/unistd_64.h         |    4 +
 linux-headers/asm-riscv/kvm.h                 |   12 +
 linux-headers/asm-s390/unistd_32.h            |    4 +
 linux-headers/asm-s390/unistd_64.h            |    4 +
 linux-headers/asm-x86/unistd_32.h             |    4 +
 linux-headers/asm-x86/unistd_64.h             |    3 +
 linux-headers/asm-x86/unistd_x32.h            |    3 +
 linux-headers/linux/iommufd.h                 |  259 +-
 linux-headers/linux/kvm.h                     |   11 +
 linux-headers/linux/psp-sev.h                 |    1 +
 linux-headers/linux/stddef.h                  |    9 +-
 linux-headers/linux/userfaultfd.h             |    9 +-
 linux-headers/linux/vfio.h                    |   47 +-
 linux-headers/linux/vhost.h                   |    8 +
 backends/iommufd.c                            |   61 +
 backends/iommufd_device.c                     |   17 +-
 hw/i386/intel_iommu.c                         | 2822 ++++++++++++++---
 hw/vfio/iommufd.c                             |   37 +-
 backends/trace-events                         |    2 +
 hw/i386/trace-events                          |   16 +
 40 files changed, 3256 insertions(+), 504 deletions(-)
 create mode 100644 linux-headers/asm-loongarch/bitsperlong.h
 create mode 100644 linux-headers/asm-loongarch/kvm.h
 create mode 100644 linux-headers/asm-loongarch/mman.h
 create mode 100644 linux-headers/asm-loongarch/unistd.h

-- 
2.34.1



^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH rfcv1 01/23] Update linux header to support nested hwpt alloc
  2024-01-15 10:37 [PATCH rfcv1 00/23] intel_iommu: Enable stage-1 translation Zhenzhong Duan
@ 2024-01-15 10:37 ` Zhenzhong Duan
  2024-01-15 10:37 ` [PATCH rfcv1 02/23] backends/iommufd: add helpers for allocating user-managed HWPT Zhenzhong Duan
                   ` (22 subsequent siblings)
  23 siblings, 0 replies; 29+ messages in thread
From: Zhenzhong Duan @ 2024-01-15 10:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, peterx, jasowang, mst, jgg,
	nicolinc, joao.m.martins, kevin.tian, yi.l.liu, yi.y.sun,
	chao.p.peng, Zhenzhong Duan, Cornelia Huck, Paolo Bonzini,
	open list:Overall KVM CPUs

Repo: https://github.com/yiliu1765/iommufd/tree/iommufd_nesting
commit id: 7c22f835c4c9b

Placeholder, not for upstream.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 include/standard-headers/drm/drm_fourcc.h     |   2 +
 include/standard-headers/linux/fuse.h         |  10 +-
 include/standard-headers/linux/pci_regs.h     |  24 +-
 include/standard-headers/linux/vhost_types.h  |   7 +
 .../standard-headers/linux/virtio_config.h    |   5 +
 include/standard-headers/linux/virtio_pci.h   |  11 +
 linux-headers/asm-arm64/kvm.h                 |  32 +++
 linux-headers/asm-generic/unistd.h            |  14 +-
 linux-headers/asm-loongarch/bitsperlong.h     |   1 +
 linux-headers/asm-loongarch/kvm.h             | 108 ++++++++
 linux-headers/asm-loongarch/mman.h            |   1 +
 linux-headers/asm-loongarch/unistd.h          |   5 +
 linux-headers/asm-mips/unistd_n32.h           |   4 +
 linux-headers/asm-mips/unistd_n64.h           |   4 +
 linux-headers/asm-mips/unistd_o32.h           |   4 +
 linux-headers/asm-powerpc/unistd_32.h         |   4 +
 linux-headers/asm-powerpc/unistd_64.h         |   4 +
 linux-headers/asm-riscv/kvm.h                 |  12 +
 linux-headers/asm-s390/unistd_32.h            |   4 +
 linux-headers/asm-s390/unistd_64.h            |   4 +
 linux-headers/asm-x86/unistd_32.h             |   4 +
 linux-headers/asm-x86/unistd_64.h             |   3 +
 linux-headers/asm-x86/unistd_x32.h            |   3 +
 linux-headers/linux/iommufd.h                 | 259 +++++++++++++++++-
 linux-headers/linux/kvm.h                     |  11 +
 linux-headers/linux/psp-sev.h                 |   1 +
 linux-headers/linux/stddef.h                  |   9 +-
 linux-headers/linux/userfaultfd.h             |   9 +-
 linux-headers/linux/vfio.h                    |  47 +++-
 linux-headers/linux/vhost.h                   |   8 +
 30 files changed, 583 insertions(+), 31 deletions(-)
 create mode 100644 linux-headers/asm-loongarch/bitsperlong.h
 create mode 100644 linux-headers/asm-loongarch/kvm.h
 create mode 100644 linux-headers/asm-loongarch/mman.h
 create mode 100644 linux-headers/asm-loongarch/unistd.h

diff --git a/include/standard-headers/drm/drm_fourcc.h b/include/standard-headers/drm/drm_fourcc.h
index 72279f4d25..3afb70160f 100644
--- a/include/standard-headers/drm/drm_fourcc.h
+++ b/include/standard-headers/drm/drm_fourcc.h
@@ -322,6 +322,8 @@ extern "C" {
  * index 1 = Cr:Cb plane, [39:0] Cr1:Cb1:Cr0:Cb0 little endian
  */
 #define DRM_FORMAT_NV15		fourcc_code('N', 'V', '1', '5') /* 2x2 subsampled Cr:Cb plane */
+#define DRM_FORMAT_NV20		fourcc_code('N', 'V', '2', '0') /* 2x1 subsampled Cr:Cb plane */
+#define DRM_FORMAT_NV30		fourcc_code('N', 'V', '3', '0') /* non-subsampled Cr:Cb plane */
 
 /*
  * 2 plane YCbCr MSB aligned
diff --git a/include/standard-headers/linux/fuse.h b/include/standard-headers/linux/fuse.h
index 6b9793842c..fc0dcd10ae 100644
--- a/include/standard-headers/linux/fuse.h
+++ b/include/standard-headers/linux/fuse.h
@@ -209,7 +209,7 @@
  *  - add FUSE_HAS_EXPIRE_ONLY
  *
  *  7.39
- *  - add FUSE_DIRECT_IO_RELAX
+ *  - add FUSE_DIRECT_IO_ALLOW_MMAP
  *  - add FUSE_STATX and related structures
  */
 
@@ -405,8 +405,7 @@ struct fuse_file_lock {
  * FUSE_CREATE_SUPP_GROUP: add supplementary group info to create, mkdir,
  *			symlink and mknod (single group that matches parent)
  * FUSE_HAS_EXPIRE_ONLY: kernel supports expiry-only entry invalidation
- * FUSE_DIRECT_IO_RELAX: relax restrictions in FOPEN_DIRECT_IO mode, for now
- *                       allow shared mmap
+ * FUSE_DIRECT_IO_ALLOW_MMAP: allow shared mmap in FOPEN_DIRECT_IO mode.
  */
 #define FUSE_ASYNC_READ		(1 << 0)
 #define FUSE_POSIX_LOCKS	(1 << 1)
@@ -445,7 +444,10 @@ struct fuse_file_lock {
 #define FUSE_HAS_INODE_DAX	(1ULL << 33)
 #define FUSE_CREATE_SUPP_GROUP	(1ULL << 34)
 #define FUSE_HAS_EXPIRE_ONLY	(1ULL << 35)
-#define FUSE_DIRECT_IO_RELAX	(1ULL << 36)
+#define FUSE_DIRECT_IO_ALLOW_MMAP (1ULL << 36)
+
+/* Obsolete alias for FUSE_DIRECT_IO_ALLOW_MMAP */
+#define FUSE_DIRECT_IO_RELAX	FUSE_DIRECT_IO_ALLOW_MMAP
 
 /**
  * CUSE INIT request/reply flags
diff --git a/include/standard-headers/linux/pci_regs.h b/include/standard-headers/linux/pci_regs.h
index e5f558d964..a39193213f 100644
--- a/include/standard-headers/linux/pci_regs.h
+++ b/include/standard-headers/linux/pci_regs.h
@@ -80,6 +80,7 @@
 #define  PCI_HEADER_TYPE_NORMAL		0
 #define  PCI_HEADER_TYPE_BRIDGE		1
 #define  PCI_HEADER_TYPE_CARDBUS	2
+#define  PCI_HEADER_TYPE_MFD		0x80	/* Multi-Function Device (possible) */
 
 #define PCI_BIST		0x0f	/* 8 bits */
 #define  PCI_BIST_CODE_MASK	0x0f	/* Return result */
@@ -637,6 +638,7 @@
 #define PCI_EXP_RTCAP		0x1e	/* Root Capabilities */
 #define  PCI_EXP_RTCAP_CRSVIS	0x0001	/* CRS Software Visibility capability */
 #define PCI_EXP_RTSTA		0x20	/* Root Status */
+#define  PCI_EXP_RTSTA_PME_RQ_ID 0x0000ffff /* PME Requester ID */
 #define  PCI_EXP_RTSTA_PME	0x00010000 /* PME status */
 #define  PCI_EXP_RTSTA_PENDING	0x00020000 /* PME pending */
 /*
@@ -930,12 +932,13 @@
 
 /* Process Address Space ID */
 #define PCI_PASID_CAP		0x04    /* PASID feature register */
-#define  PCI_PASID_CAP_EXEC	0x02	/* Exec permissions Supported */
-#define  PCI_PASID_CAP_PRIV	0x04	/* Privilege Mode Supported */
+#define  PCI_PASID_CAP_EXEC	0x0002	/* Exec permissions Supported */
+#define  PCI_PASID_CAP_PRIV	0x0004	/* Privilege Mode Supported */
+#define  PCI_PASID_CAP_WIDTH	0x1f00
 #define PCI_PASID_CTRL		0x06    /* PASID control register */
-#define  PCI_PASID_CTRL_ENABLE	0x01	/* Enable bit */
-#define  PCI_PASID_CTRL_EXEC	0x02	/* Exec permissions Enable */
-#define  PCI_PASID_CTRL_PRIV	0x04	/* Privilege Mode Enable */
+#define  PCI_PASID_CTRL_ENABLE	0x0001	/* Enable bit */
+#define  PCI_PASID_CTRL_EXEC	0x0002	/* Exec permissions Enable */
+#define  PCI_PASID_CTRL_PRIV	0x0004	/* Privilege Mode Enable */
 #define PCI_EXT_CAP_PASID_SIZEOF	8
 
 /* Single Root I/O Virtualization */
@@ -975,6 +978,8 @@
 #define  PCI_LTR_VALUE_MASK	0x000003ff
 #define  PCI_LTR_SCALE_MASK	0x00001c00
 #define  PCI_LTR_SCALE_SHIFT	10
+#define  PCI_LTR_NOSNOOP_VALUE	0x03ff0000 /* Max No-Snoop Latency Value */
+#define  PCI_LTR_NOSNOOP_SCALE	0x1c000000 /* Scale for Max Value */
 #define PCI_EXT_CAP_LTR_SIZEOF	8
 
 /* Access Control Service */
@@ -1042,9 +1047,16 @@
 #define PCI_EXP_DPC_STATUS		0x08	/* DPC Status */
 #define  PCI_EXP_DPC_STATUS_TRIGGER	    0x0001 /* Trigger Status */
 #define  PCI_EXP_DPC_STATUS_TRIGGER_RSN	    0x0006 /* Trigger Reason */
+#define  PCI_EXP_DPC_STATUS_TRIGGER_RSN_UNCOR  0x0000 /* Uncorrectable error */
+#define  PCI_EXP_DPC_STATUS_TRIGGER_RSN_NFE    0x0002 /* Rcvd ERR_NONFATAL */
+#define  PCI_EXP_DPC_STATUS_TRIGGER_RSN_FE     0x0004 /* Rcvd ERR_FATAL */
+#define  PCI_EXP_DPC_STATUS_TRIGGER_RSN_IN_EXT 0x0006 /* Reason in Trig Reason Extension field */
 #define  PCI_EXP_DPC_STATUS_INTERRUPT	    0x0008 /* Interrupt Status */
 #define  PCI_EXP_DPC_RP_BUSY		    0x0010 /* Root Port Busy */
 #define  PCI_EXP_DPC_STATUS_TRIGGER_RSN_EXT 0x0060 /* Trig Reason Extension */
+#define  PCI_EXP_DPC_STATUS_TRIGGER_RSN_RP_PIO		0x0000	/* RP PIO error */
+#define  PCI_EXP_DPC_STATUS_TRIGGER_RSN_SW_TRIGGER	0x0020	/* DPC SW Trigger bit */
+#define  PCI_EXP_DPC_RP_PIO_FEP		    0x1f00 /* RP PIO First Err Ptr */
 
 #define PCI_EXP_DPC_SOURCE_ID		 0x0A	/* DPC Source Identifier */
 
@@ -1088,6 +1100,8 @@
 #define  PCI_L1SS_CTL1_LTR_L12_TH_VALUE	0x03ff0000  /* LTR_L1.2_THRESHOLD_Value */
 #define  PCI_L1SS_CTL1_LTR_L12_TH_SCALE	0xe0000000  /* LTR_L1.2_THRESHOLD_Scale */
 #define PCI_L1SS_CTL2		0x0c	/* Control 2 Register */
+#define  PCI_L1SS_CTL2_T_PWR_ON_SCALE	0x00000003  /* T_POWER_ON Scale */
+#define  PCI_L1SS_CTL2_T_PWR_ON_VALUE	0x000000f8  /* T_POWER_ON Value */
 
 /* Designated Vendor-Specific (DVSEC, PCI_EXT_CAP_ID_DVSEC) */
 #define PCI_DVSEC_HEADER1		0x4 /* Designated Vendor-Specific Header1 */
diff --git a/include/standard-headers/linux/vhost_types.h b/include/standard-headers/linux/vhost_types.h
index 5ad07e134a..fd54044936 100644
--- a/include/standard-headers/linux/vhost_types.h
+++ b/include/standard-headers/linux/vhost_types.h
@@ -185,5 +185,12 @@ struct vhost_vdpa_iova_range {
  * DRIVER_OK
  */
 #define VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK  0x6
+/* Device may expose the virtqueue's descriptor area, driver area and
+ * device area to a different group for ASID binding than where its
+ * buffers may reside. Requires VHOST_BACKEND_F_IOTLB_ASID.
+ */
+#define VHOST_BACKEND_F_DESC_ASID    0x7
+/* IOTLB don't flush memory mapping across device reset */
+#define VHOST_BACKEND_F_IOTLB_PERSIST  0x8
 
 #endif
diff --git a/include/standard-headers/linux/virtio_config.h b/include/standard-headers/linux/virtio_config.h
index 8a7d0dc8b0..bfd1ca643e 100644
--- a/include/standard-headers/linux/virtio_config.h
+++ b/include/standard-headers/linux/virtio_config.h
@@ -103,6 +103,11 @@
  */
 #define VIRTIO_F_NOTIFICATION_DATA	38
 
+/* This feature indicates that the driver uses the data provided by the device
+ * as a virtqueue identifier in available buffer notifications.
+ */
+#define VIRTIO_F_NOTIF_CONFIG_DATA	39
+
 /*
  * This feature indicates that the driver can reset a queue individually.
  */
diff --git a/include/standard-headers/linux/virtio_pci.h b/include/standard-headers/linux/virtio_pci.h
index be912cfc95..b7fdfd0668 100644
--- a/include/standard-headers/linux/virtio_pci.h
+++ b/include/standard-headers/linux/virtio_pci.h
@@ -166,6 +166,17 @@ struct virtio_pci_common_cfg {
 	uint32_t queue_used_hi;		/* read-write */
 };
 
+/*
+ * Warning: do not use sizeof on this: use offsetofend for
+ * specific fields you need.
+ */
+struct virtio_pci_modern_common_cfg {
+	struct virtio_pci_common_cfg cfg;
+
+	uint16_t queue_notify_data;	/* read-write */
+	uint16_t queue_reset;		/* read-write */
+};
+
 /* Fields in VIRTIO_PCI_CAP_PCI_CFG: */
 struct virtio_pci_cfg_cap {
 	struct virtio_pci_cap cap;
diff --git a/linux-headers/asm-arm64/kvm.h b/linux-headers/asm-arm64/kvm.h
index 38e5957526..c59ea55cd8 100644
--- a/linux-headers/asm-arm64/kvm.h
+++ b/linux-headers/asm-arm64/kvm.h
@@ -491,6 +491,38 @@ struct kvm_smccc_filter {
 #define KVM_HYPERCALL_EXIT_SMC		(1U << 0)
 #define KVM_HYPERCALL_EXIT_16BIT	(1U << 1)
 
+/*
+ * Get feature ID registers userspace writable mask.
+ *
+ * From DDI0487J.a, D19.2.66 ("ID_AA64MMFR2_EL1, AArch64 Memory Model
+ * Feature Register 2"):
+ *
+ * "The Feature ID space is defined as the System register space in
+ * AArch64 with op0==3, op1=={0, 1, 3}, CRn==0, CRm=={0-7},
+ * op2=={0-7}."
+ *
+ * This covers all currently known R/O registers that indicate
+ * anything useful feature wise, including the ID registers.
+ *
+ * If we ever need to introduce a new range, it will be described as
+ * such in the range field.
+ */
+#define KVM_ARM_FEATURE_ID_RANGE_IDX(op0, op1, crn, crm, op2)		\
+	({								\
+		__u64 __op1 = (op1) & 3;				\
+		__op1 -= (__op1 == 3);					\
+		(__op1 << 6 | ((crm) & 7) << 3 | (op2));		\
+	})
+
+#define KVM_ARM_FEATURE_ID_RANGE	0
+#define KVM_ARM_FEATURE_ID_RANGE_SIZE	(3 * 8 * 8)
+
+struct reg_mask_range {
+	__u64 addr;		/* Pointer to mask array */
+	__u32 range;		/* Requested range */
+	__u32 reserved[13];
+};
+
 #endif
 
 #endif /* __ARM_KVM_H__ */
diff --git a/linux-headers/asm-generic/unistd.h b/linux-headers/asm-generic/unistd.h
index abe087c53b..756b013fb8 100644
--- a/linux-headers/asm-generic/unistd.h
+++ b/linux-headers/asm-generic/unistd.h
@@ -71,7 +71,7 @@ __SYSCALL(__NR_fremovexattr, sys_fremovexattr)
 #define __NR_getcwd 17
 __SYSCALL(__NR_getcwd, sys_getcwd)
 #define __NR_lookup_dcookie 18
-__SC_COMP(__NR_lookup_dcookie, sys_lookup_dcookie, compat_sys_lookup_dcookie)
+__SYSCALL(__NR_lookup_dcookie, sys_ni_syscall)
 #define __NR_eventfd2 19
 __SYSCALL(__NR_eventfd2, sys_eventfd2)
 #define __NR_epoll_create1 20
@@ -816,15 +816,21 @@ __SYSCALL(__NR_process_mrelease, sys_process_mrelease)
 __SYSCALL(__NR_futex_waitv, sys_futex_waitv)
 #define __NR_set_mempolicy_home_node 450
 __SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node)
-
 #define __NR_cachestat 451
 __SYSCALL(__NR_cachestat, sys_cachestat)
-
 #define __NR_fchmodat2 452
 __SYSCALL(__NR_fchmodat2, sys_fchmodat2)
+#define __NR_map_shadow_stack 453
+__SYSCALL(__NR_map_shadow_stack, sys_map_shadow_stack)
+#define __NR_futex_wake 454
+__SYSCALL(__NR_futex_wake, sys_futex_wake)
+#define __NR_futex_wait 455
+__SYSCALL(__NR_futex_wait, sys_futex_wait)
+#define __NR_futex_requeue 456
+__SYSCALL(__NR_futex_requeue, sys_futex_requeue)
 
 #undef __NR_syscalls
-#define __NR_syscalls 453
+#define __NR_syscalls 457
 
 /*
  * 32 bit systems traditionally used different
diff --git a/linux-headers/asm-loongarch/bitsperlong.h b/linux-headers/asm-loongarch/bitsperlong.h
new file mode 100644
index 0000000000..6dc0bb0c13
--- /dev/null
+++ b/linux-headers/asm-loongarch/bitsperlong.h
@@ -0,0 +1 @@
+#include <asm-generic/bitsperlong.h>
diff --git a/linux-headers/asm-loongarch/kvm.h b/linux-headers/asm-loongarch/kvm.h
new file mode 100644
index 0000000000..c6ad2ee610
--- /dev/null
+++ b/linux-headers/asm-loongarch/kvm.h
@@ -0,0 +1,108 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/*
+ * Copyright (C) 2020-2023 Loongson Technology Corporation Limited
+ */
+
+#ifndef __UAPI_ASM_LOONGARCH_KVM_H
+#define __UAPI_ASM_LOONGARCH_KVM_H
+
+#include <linux/types.h>
+
+/*
+ * KVM LoongArch specific structures and definitions.
+ *
+ * Some parts derived from the x86 version of this file.
+ */
+
+#define __KVM_HAVE_READONLY_MEM
+
+#define KVM_COALESCED_MMIO_PAGE_OFFSET	1
+#define KVM_DIRTY_LOG_PAGE_OFFSET	64
+
+/*
+ * for KVM_GET_REGS and KVM_SET_REGS
+ */
+struct kvm_regs {
+	/* out (KVM_GET_REGS) / in (KVM_SET_REGS) */
+	__u64 gpr[32];
+	__u64 pc;
+};
+
+/*
+ * for KVM_GET_FPU and KVM_SET_FPU
+ */
+struct kvm_fpu {
+	__u32 fcsr;
+	__u64 fcc;    /* 8x8 */
+	struct kvm_fpureg {
+		__u64 val64[4];
+	} fpr[32];
+};
+
+/*
+ * For LoongArch, we use KVM_SET_ONE_REG and KVM_GET_ONE_REG to access various
+ * registers.  The id field is broken down as follows:
+ *
+ *  bits[63..52] - As per linux/kvm.h
+ *  bits[51..32] - Must be zero.
+ *  bits[31..16] - Register set.
+ *
+ * Register set = 0: GP registers from kvm_regs (see definitions below).
+ *
+ * Register set = 1: CSR registers.
+ *
+ * Register set = 2: KVM specific registers (see definitions below).
+ *
+ * Register set = 3: FPU / SIMD registers (see definitions below).
+ *
+ * Other sets registers may be added in the future.  Each set would
+ * have its own identifier in bits[31..16].
+ */
+
+#define KVM_REG_LOONGARCH_GPR		(KVM_REG_LOONGARCH | 0x00000ULL)
+#define KVM_REG_LOONGARCH_CSR		(KVM_REG_LOONGARCH | 0x10000ULL)
+#define KVM_REG_LOONGARCH_KVM		(KVM_REG_LOONGARCH | 0x20000ULL)
+#define KVM_REG_LOONGARCH_FPSIMD	(KVM_REG_LOONGARCH | 0x30000ULL)
+#define KVM_REG_LOONGARCH_CPUCFG	(KVM_REG_LOONGARCH | 0x40000ULL)
+#define KVM_REG_LOONGARCH_MASK		(KVM_REG_LOONGARCH | 0x70000ULL)
+#define KVM_CSR_IDX_MASK		0x7fff
+#define KVM_CPUCFG_IDX_MASK		0x7fff
+
+/*
+ * KVM_REG_LOONGARCH_KVM - KVM specific control registers.
+ */
+
+#define KVM_REG_LOONGARCH_COUNTER	(KVM_REG_LOONGARCH_KVM | KVM_REG_SIZE_U64 | 1)
+#define KVM_REG_LOONGARCH_VCPU_RESET	(KVM_REG_LOONGARCH_KVM | KVM_REG_SIZE_U64 | 2)
+
+#define LOONGARCH_REG_SHIFT		3
+#define LOONGARCH_REG_64(TYPE, REG)	(TYPE | KVM_REG_SIZE_U64 | (REG << LOONGARCH_REG_SHIFT))
+#define KVM_IOC_CSRID(REG)		LOONGARCH_REG_64(KVM_REG_LOONGARCH_CSR, REG)
+#define KVM_IOC_CPUCFG(REG)		LOONGARCH_REG_64(KVM_REG_LOONGARCH_CPUCFG, REG)
+
+struct kvm_debug_exit_arch {
+};
+
+/* for KVM_SET_GUEST_DEBUG */
+struct kvm_guest_debug_arch {
+};
+
+/* definition of registers in kvm_run */
+struct kvm_sync_regs {
+};
+
+/* dummy definition */
+struct kvm_sregs {
+};
+
+struct kvm_iocsr_entry {
+	__u32 addr;
+	__u32 pad;
+	__u64 data;
+};
+
+#define KVM_NR_IRQCHIPS		1
+#define KVM_IRQCHIP_NUM_PINS	64
+#define KVM_MAX_CORES		256
+
+#endif /* __UAPI_ASM_LOONGARCH_KVM_H */
diff --git a/linux-headers/asm-loongarch/mman.h b/linux-headers/asm-loongarch/mman.h
new file mode 100644
index 0000000000..8eebf89f5a
--- /dev/null
+++ b/linux-headers/asm-loongarch/mman.h
@@ -0,0 +1 @@
+#include <asm-generic/mman.h>
diff --git a/linux-headers/asm-loongarch/unistd.h b/linux-headers/asm-loongarch/unistd.h
new file mode 100644
index 0000000000..fcb668984f
--- /dev/null
+++ b/linux-headers/asm-loongarch/unistd.h
@@ -0,0 +1,5 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#define __ARCH_WANT_SYS_CLONE
+#define __ARCH_WANT_SYS_CLONE3
+
+#include <asm-generic/unistd.h>
diff --git a/linux-headers/asm-mips/unistd_n32.h b/linux-headers/asm-mips/unistd_n32.h
index 46d8500654..994b6f008f 100644
--- a/linux-headers/asm-mips/unistd_n32.h
+++ b/linux-headers/asm-mips/unistd_n32.h
@@ -381,5 +381,9 @@
 #define __NR_set_mempolicy_home_node (__NR_Linux + 450)
 #define __NR_cachestat (__NR_Linux + 451)
 #define __NR_fchmodat2 (__NR_Linux + 452)
+#define __NR_map_shadow_stack (__NR_Linux + 453)
+#define __NR_futex_wake (__NR_Linux + 454)
+#define __NR_futex_wait (__NR_Linux + 455)
+#define __NR_futex_requeue (__NR_Linux + 456)
 
 #endif /* _ASM_UNISTD_N32_H */
diff --git a/linux-headers/asm-mips/unistd_n64.h b/linux-headers/asm-mips/unistd_n64.h
index c2f7ac673b..41dcf5877a 100644
--- a/linux-headers/asm-mips/unistd_n64.h
+++ b/linux-headers/asm-mips/unistd_n64.h
@@ -357,5 +357,9 @@
 #define __NR_set_mempolicy_home_node (__NR_Linux + 450)
 #define __NR_cachestat (__NR_Linux + 451)
 #define __NR_fchmodat2 (__NR_Linux + 452)
+#define __NR_map_shadow_stack (__NR_Linux + 453)
+#define __NR_futex_wake (__NR_Linux + 454)
+#define __NR_futex_wait (__NR_Linux + 455)
+#define __NR_futex_requeue (__NR_Linux + 456)
 
 #endif /* _ASM_UNISTD_N64_H */
diff --git a/linux-headers/asm-mips/unistd_o32.h b/linux-headers/asm-mips/unistd_o32.h
index 757c68f2ad..ae9d334d96 100644
--- a/linux-headers/asm-mips/unistd_o32.h
+++ b/linux-headers/asm-mips/unistd_o32.h
@@ -427,5 +427,9 @@
 #define __NR_set_mempolicy_home_node (__NR_Linux + 450)
 #define __NR_cachestat (__NR_Linux + 451)
 #define __NR_fchmodat2 (__NR_Linux + 452)
+#define __NR_map_shadow_stack (__NR_Linux + 453)
+#define __NR_futex_wake (__NR_Linux + 454)
+#define __NR_futex_wait (__NR_Linux + 455)
+#define __NR_futex_requeue (__NR_Linux + 456)
 
 #endif /* _ASM_UNISTD_O32_H */
diff --git a/linux-headers/asm-powerpc/unistd_32.h b/linux-headers/asm-powerpc/unistd_32.h
index 8ef94bbac1..b9b23d66d7 100644
--- a/linux-headers/asm-powerpc/unistd_32.h
+++ b/linux-headers/asm-powerpc/unistd_32.h
@@ -434,6 +434,10 @@
 #define __NR_set_mempolicy_home_node 450
 #define __NR_cachestat 451
 #define __NR_fchmodat2 452
+#define __NR_map_shadow_stack 453
+#define __NR_futex_wake 454
+#define __NR_futex_wait 455
+#define __NR_futex_requeue 456
 
 
 #endif /* _ASM_UNISTD_32_H */
diff --git a/linux-headers/asm-powerpc/unistd_64.h b/linux-headers/asm-powerpc/unistd_64.h
index 0e7ee43e88..cbb4b3e8f7 100644
--- a/linux-headers/asm-powerpc/unistd_64.h
+++ b/linux-headers/asm-powerpc/unistd_64.h
@@ -406,6 +406,10 @@
 #define __NR_set_mempolicy_home_node 450
 #define __NR_cachestat 451
 #define __NR_fchmodat2 452
+#define __NR_map_shadow_stack 453
+#define __NR_futex_wake 454
+#define __NR_futex_wait 455
+#define __NR_futex_requeue 456
 
 
 #endif /* _ASM_UNISTD_64_H */
diff --git a/linux-headers/asm-riscv/kvm.h b/linux-headers/asm-riscv/kvm.h
index 992c5e4071..60d3b21dea 100644
--- a/linux-headers/asm-riscv/kvm.h
+++ b/linux-headers/asm-riscv/kvm.h
@@ -80,6 +80,7 @@ struct kvm_riscv_csr {
 	unsigned long sip;
 	unsigned long satp;
 	unsigned long scounteren;
+	unsigned long senvcfg;
 };
 
 /* AIA CSR registers for KVM_GET_ONE_REG and KVM_SET_ONE_REG */
@@ -93,6 +94,11 @@ struct kvm_riscv_aia_csr {
 	unsigned long iprio2h;
 };
 
+/* Smstateen CSR for KVM_GET_ONE_REG and KVM_SET_ONE_REG */
+struct kvm_riscv_smstateen_csr {
+	unsigned long sstateen0;
+};
+
 /* TIMER registers for KVM_GET_ONE_REG and KVM_SET_ONE_REG */
 struct kvm_riscv_timer {
 	__u64 frequency;
@@ -131,6 +137,8 @@ enum KVM_RISCV_ISA_EXT_ID {
 	KVM_RISCV_ISA_EXT_ZICSR,
 	KVM_RISCV_ISA_EXT_ZIFENCEI,
 	KVM_RISCV_ISA_EXT_ZIHPM,
+	KVM_RISCV_ISA_EXT_SMSTATEEN,
+	KVM_RISCV_ISA_EXT_ZICOND,
 	KVM_RISCV_ISA_EXT_MAX,
 };
 
@@ -148,6 +156,7 @@ enum KVM_RISCV_SBI_EXT_ID {
 	KVM_RISCV_SBI_EXT_PMU,
 	KVM_RISCV_SBI_EXT_EXPERIMENTAL,
 	KVM_RISCV_SBI_EXT_VENDOR,
+	KVM_RISCV_SBI_EXT_DBCN,
 	KVM_RISCV_SBI_EXT_MAX,
 };
 
@@ -178,10 +187,13 @@ enum KVM_RISCV_SBI_EXT_ID {
 #define KVM_REG_RISCV_CSR		(0x03 << KVM_REG_RISCV_TYPE_SHIFT)
 #define KVM_REG_RISCV_CSR_GENERAL	(0x0 << KVM_REG_RISCV_SUBTYPE_SHIFT)
 #define KVM_REG_RISCV_CSR_AIA		(0x1 << KVM_REG_RISCV_SUBTYPE_SHIFT)
+#define KVM_REG_RISCV_CSR_SMSTATEEN	(0x2 << KVM_REG_RISCV_SUBTYPE_SHIFT)
 #define KVM_REG_RISCV_CSR_REG(name)	\
 		(offsetof(struct kvm_riscv_csr, name) / sizeof(unsigned long))
 #define KVM_REG_RISCV_CSR_AIA_REG(name)	\
 	(offsetof(struct kvm_riscv_aia_csr, name) / sizeof(unsigned long))
+#define KVM_REG_RISCV_CSR_SMSTATEEN_REG(name)  \
+	(offsetof(struct kvm_riscv_smstateen_csr, name) / sizeof(unsigned long))
 
 /* Timer registers are mapped as type 4 */
 #define KVM_REG_RISCV_TIMER		(0x04 << KVM_REG_RISCV_TYPE_SHIFT)
diff --git a/linux-headers/asm-s390/unistd_32.h b/linux-headers/asm-s390/unistd_32.h
index 716fa368ca..c093e6d5f9 100644
--- a/linux-headers/asm-s390/unistd_32.h
+++ b/linux-headers/asm-s390/unistd_32.h
@@ -425,5 +425,9 @@
 #define __NR_set_mempolicy_home_node 450
 #define __NR_cachestat 451
 #define __NR_fchmodat2 452
+#define __NR_map_shadow_stack 453
+#define __NR_futex_wake 454
+#define __NR_futex_wait 455
+#define __NR_futex_requeue 456
 
 #endif /* _ASM_S390_UNISTD_32_H */
diff --git a/linux-headers/asm-s390/unistd_64.h b/linux-headers/asm-s390/unistd_64.h
index b2a11b1d13..114c0569a4 100644
--- a/linux-headers/asm-s390/unistd_64.h
+++ b/linux-headers/asm-s390/unistd_64.h
@@ -373,5 +373,9 @@
 #define __NR_set_mempolicy_home_node 450
 #define __NR_cachestat 451
 #define __NR_fchmodat2 452
+#define __NR_map_shadow_stack 453
+#define __NR_futex_wake 454
+#define __NR_futex_wait 455
+#define __NR_futex_requeue 456
 
 #endif /* _ASM_S390_UNISTD_64_H */
diff --git a/linux-headers/asm-x86/unistd_32.h b/linux-headers/asm-x86/unistd_32.h
index d749ad1c24..329649c377 100644
--- a/linux-headers/asm-x86/unistd_32.h
+++ b/linux-headers/asm-x86/unistd_32.h
@@ -443,6 +443,10 @@
 #define __NR_set_mempolicy_home_node 450
 #define __NR_cachestat 451
 #define __NR_fchmodat2 452
+#define __NR_map_shadow_stack 453
+#define __NR_futex_wake 454
+#define __NR_futex_wait 455
+#define __NR_futex_requeue 456
 
 
 #endif /* _ASM_UNISTD_32_H */
diff --git a/linux-headers/asm-x86/unistd_64.h b/linux-headers/asm-x86/unistd_64.h
index cea67282eb..4583606ce6 100644
--- a/linux-headers/asm-x86/unistd_64.h
+++ b/linux-headers/asm-x86/unistd_64.h
@@ -366,6 +366,9 @@
 #define __NR_cachestat 451
 #define __NR_fchmodat2 452
 #define __NR_map_shadow_stack 453
+#define __NR_futex_wake 454
+#define __NR_futex_wait 455
+#define __NR_futex_requeue 456
 
 
 #endif /* _ASM_UNISTD_64_H */
diff --git a/linux-headers/asm-x86/unistd_x32.h b/linux-headers/asm-x86/unistd_x32.h
index 5b2e79bf4c..146d74d8e4 100644
--- a/linux-headers/asm-x86/unistd_x32.h
+++ b/linux-headers/asm-x86/unistd_x32.h
@@ -318,6 +318,9 @@
 #define __NR_set_mempolicy_home_node (__X32_SYSCALL_BIT + 450)
 #define __NR_cachestat (__X32_SYSCALL_BIT + 451)
 #define __NR_fchmodat2 (__X32_SYSCALL_BIT + 452)
+#define __NR_futex_wake (__X32_SYSCALL_BIT + 454)
+#define __NR_futex_wait (__X32_SYSCALL_BIT + 455)
+#define __NR_futex_requeue (__X32_SYSCALL_BIT + 456)
 #define __NR_rt_sigaction (__X32_SYSCALL_BIT + 512)
 #define __NR_rt_sigreturn (__X32_SYSCALL_BIT + 513)
 #define __NR_ioctl (__X32_SYSCALL_BIT + 514)
diff --git a/linux-headers/linux/iommufd.h b/linux-headers/linux/iommufd.h
index 218bf7ac98..72e8f4b9dd 100644
--- a/linux-headers/linux/iommufd.h
+++ b/linux-headers/linux/iommufd.h
@@ -47,6 +47,9 @@ enum {
 	IOMMUFD_CMD_VFIO_IOAS,
 	IOMMUFD_CMD_HWPT_ALLOC,
 	IOMMUFD_CMD_GET_HW_INFO,
+	IOMMUFD_CMD_HWPT_SET_DIRTY_TRACKING,
+	IOMMUFD_CMD_HWPT_GET_DIRTY_BITMAP,
+	IOMMUFD_CMD_HWPT_INVALIDATE,
 };
 
 /**
@@ -347,20 +350,86 @@ struct iommu_vfio_ioas {
 };
 #define IOMMU_VFIO_IOAS _IO(IOMMUFD_TYPE, IOMMUFD_CMD_VFIO_IOAS)
 
+/**
+ * enum iommufd_hwpt_alloc_flags - Flags for HWPT allocation
+ * @IOMMU_HWPT_ALLOC_NEST_PARENT: If set, allocate a HWPT that can serve as
+ *                                the parent HWPT in a nesting configuration.
+ * @IOMMU_HWPT_ALLOC_DIRTY_TRACKING: Dirty tracking support for device IOMMU is
+ *                                   enforced on device attachment
+ */
+enum iommufd_hwpt_alloc_flags {
+	IOMMU_HWPT_ALLOC_NEST_PARENT = 1 << 0,
+	IOMMU_HWPT_ALLOC_DIRTY_TRACKING = 1 << 1,
+};
+
+/**
+ * enum iommu_hwpt_vtd_s1_flags - Intel VT-d stage-1 page table
+ *                                entry attributes
+ * @IOMMU_VTD_S1_SRE: Supervisor request
+ * @IOMMU_VTD_S1_EAFE: Extended access enable
+ * @IOMMU_VTD_S1_WPE: Write protect enable
+ */
+enum iommu_hwpt_vtd_s1_flags {
+	IOMMU_VTD_S1_SRE = 1 << 0,
+	IOMMU_VTD_S1_EAFE = 1 << 1,
+	IOMMU_VTD_S1_WPE = 1 << 2,
+};
+
+/**
+ * struct iommu_hwpt_vtd_s1 - Intel VT-d stage-1 page table
+ *                            info (IOMMU_HWPT_DATA_VTD_S1)
+ * @flags: Combination of enum iommu_hwpt_vtd_s1_flags
+ * @pgtbl_addr: The base address of the stage-1 page table.
+ * @addr_width: The address width of the stage-1 page table
+ * @__reserved: Must be 0
+ */
+struct iommu_hwpt_vtd_s1 {
+	__aligned_u64 flags;
+	__aligned_u64 pgtbl_addr;
+	__u32 addr_width;
+	__u32 __reserved;
+};
+
+/**
+ * enum iommu_hwpt_data_type - IOMMU HWPT Data Type
+ * @IOMMU_HWPT_DATA_NONE: no data
+ * @IOMMU_HWPT_DATA_VTD_S1: Intel VT-d stage-1 page table
+ */
+enum iommu_hwpt_data_type {
+	IOMMU_HWPT_DATA_NONE,
+	IOMMU_HWPT_DATA_VTD_S1,
+};
+
 /**
  * struct iommu_hwpt_alloc - ioctl(IOMMU_HWPT_ALLOC)
  * @size: sizeof(struct iommu_hwpt_alloc)
- * @flags: Must be 0
+ * @flags: Combination of enum iommufd_hwpt_alloc_flags
  * @dev_id: The device to allocate this HWPT for
- * @pt_id: The IOAS to connect this HWPT to
+ * @pt_id: The IOAS or HWPT to connect this HWPT to
  * @out_hwpt_id: The ID of the new HWPT
  * @__reserved: Must be 0
+ * @data_type: One of enum iommu_hwpt_data_type
+ * @data_len: Length of the type specific data
+ * @data_uptr: User pointer to the type specific data
  *
  * Explicitly allocate a hardware page table object. This is the same object
  * type that is returned by iommufd_device_attach() and represents the
  * underlying iommu driver's iommu_domain kernel object.
  *
- * A HWPT will be created with the IOVA mappings from the given IOAS.
+ * A kernel-managed HWPT will be created with the mappings from the given
+ * IOAS via the @pt_id. The @data_type for this allocation must be set to
+ * IOMMU_HWPT_DATA_NONE. The HWPT can be allocated as a parent HWPT for a
+ * nesting configuration by passing IOMMU_HWPT_ALLOC_NEST_PARENT via @flags.
+ *
+ * A user-managed nested HWPT will be created from a given parent HWPT via
+ * @pt_id, in which the parent HWPT must be allocated previously via the
+ * same ioctl from a given IOAS (@pt_id). In this case, the @data_type
+ * must be set to a pre-defined type corresponding to an I/O page table
+ * type supported by the underlying IOMMU hardware.
+ *
+ * If the @data_type is set to IOMMU_HWPT_DATA_NONE, @data_len and
+ * @data_uptr should be zero. Otherwise, both @data_len and @data_uptr
+ * must be given.
  */
 struct iommu_hwpt_alloc {
 	__u32 size;
@@ -369,13 +438,26 @@ struct iommu_hwpt_alloc {
 	__u32 pt_id;
 	__u32 out_hwpt_id;
 	__u32 __reserved;
+	__u32 data_type;
+	__u32 data_len;
+	__aligned_u64 data_uptr;
 };
 #define IOMMU_HWPT_ALLOC _IO(IOMMUFD_TYPE, IOMMUFD_CMD_HWPT_ALLOC)
 
+/**
+ * enum iommu_hw_info_vtd_flags - Flags for VT-d hw_info
+ * @IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17: If set, disallow read-only mappings
+ *                                         on a nested_parent domain.
+ *                                         https://www.intel.com/content/www/us/en/content-details/772415/content-details.html
+ */
+enum iommu_hw_info_vtd_flags {
+	IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17 = 1 << 0,
+};
+
 /**
  * struct iommu_hw_info_vtd - Intel VT-d hardware information
  *
- * @flags: Must be 0
+ * @flags: Combination of enum iommu_hw_info_vtd_flags
  * @__reserved: Must be 0
  *
  * @cap_reg: Value of Intel VT-d capability register defined in VT-d spec
@@ -404,6 +486,20 @@ enum iommu_hw_info_type {
 	IOMMU_HW_INFO_TYPE_INTEL_VTD,
 };
 
+/**
+ * enum iommufd_hw_capabilities
+ * @IOMMU_HW_CAP_DIRTY_TRACKING: IOMMU hardware support for dirty tracking
+ *                               If available, it means the following APIs
+ *                               are supported:
+ *
+ *                                   IOMMU_HWPT_GET_DIRTY_BITMAP
+ *                                   IOMMU_HWPT_SET_DIRTY_TRACKING
+ *
+ */
+enum iommufd_hw_capabilities {
+	IOMMU_HW_CAP_DIRTY_TRACKING = 1 << 0,
+};
+
 /**
  * struct iommu_hw_info - ioctl(IOMMU_GET_HW_INFO)
  * @size: sizeof(struct iommu_hw_info)
@@ -415,6 +511,8 @@ enum iommu_hw_info_type {
  *             the iommu type specific hardware information data
  * @out_data_type: Output the iommu hardware info type as defined in the enum
  *                 iommu_hw_info_type.
+ * @out_capabilities: Output the generic iommu capability info type as defined
+ *                    in the enum iommu_hw_capabilities.
  * @__reserved: Must be 0
  *
  * Query an iommu type specific hardware information data from an iommu behind
@@ -439,6 +537,159 @@ struct iommu_hw_info {
 	__aligned_u64 data_uptr;
 	__u32 out_data_type;
 	__u32 __reserved;
+	__aligned_u64 out_capabilities;
 };
 #define IOMMU_GET_HW_INFO _IO(IOMMUFD_TYPE, IOMMUFD_CMD_GET_HW_INFO)
+
+/*
+ * enum iommufd_hwpt_set_dirty_tracking_flags - Flags for steering dirty
+ *                                              tracking
+ * @IOMMU_HWPT_DIRTY_TRACKING_ENABLE: Enable dirty tracking
+ */
+enum iommufd_hwpt_set_dirty_tracking_flags {
+	IOMMU_HWPT_DIRTY_TRACKING_ENABLE = 1,
+};
+
+/**
+ * struct iommu_hwpt_set_dirty_tracking - ioctl(IOMMU_HWPT_SET_DIRTY_TRACKING)
+ * @size: sizeof(struct iommu_hwpt_set_dirty_tracking)
+ * @flags: Combination of enum iommufd_hwpt_set_dirty_tracking_flags
+ * @hwpt_id: HW pagetable ID that represents the IOMMU domain
+ * @__reserved: Must be 0
+ *
+ * Toggle dirty tracking on an HW pagetable.
+ */
+struct iommu_hwpt_set_dirty_tracking {
+	__u32 size;
+	__u32 flags;
+	__u32 hwpt_id;
+	__u32 __reserved;
+};
+#define IOMMU_HWPT_SET_DIRTY_TRACKING _IO(IOMMUFD_TYPE, \
+					  IOMMUFD_CMD_HWPT_SET_DIRTY_TRACKING)
+
+/**
+ * enum iommufd_hwpt_get_dirty_bitmap_flags - Flags for getting dirty bits
+ * @IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR: Just read the PTEs without clearing
+ *                                        any dirty bits metadata. This flag
+ *                                        can be passed in the expectation
+ *                                        where the next operation is an unmap
+ *                                        of the same IOVA range.
+ *
+ */
+enum iommufd_hwpt_get_dirty_bitmap_flags {
+	IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR = 1,
+};
+
+/**
+ * struct iommu_hwpt_get_dirty_bitmap - ioctl(IOMMU_HWPT_GET_DIRTY_BITMAP)
+ * @size: sizeof(struct iommu_hwpt_get_dirty_bitmap)
+ * @hwpt_id: HW pagetable ID that represents the IOMMU domain
+ * @flags: Combination of enum iommufd_hwpt_get_dirty_bitmap_flags
+ * @__reserved: Must be 0
+ * @iova: base IOVA of the bitmap first bit
+ * @length: IOVA range size
+ * @page_size: page size granularity of each bit in the bitmap
+ * @data: bitmap where to set the dirty bits. The bitmap bits each
+ *        represent a page_size which you deviate from an arbitrary iova.
+ *
+ * Checking a given IOVA is dirty:
+ *
+ *  data[(iova / page_size) / 64] & (1ULL << ((iova / page_size) % 64))
+ *
+ * Walk the IOMMU pagetables for a given IOVA range to return a bitmap
+ * with the dirty IOVAs. In doing so it will also by default clear any
+ * dirty bit metadata set in the IOPTE.
+ */
+struct iommu_hwpt_get_dirty_bitmap {
+	__u32 size;
+	__u32 hwpt_id;
+	__u32 flags;
+	__u32 __reserved;
+	__aligned_u64 iova;
+	__aligned_u64 length;
+	__aligned_u64 page_size;
+	__aligned_u64 data;
+};
+#define IOMMU_HWPT_GET_DIRTY_BITMAP _IO(IOMMUFD_TYPE, \
+					IOMMUFD_CMD_HWPT_GET_DIRTY_BITMAP)
+
+/**
+ * enum iommu_hwpt_invalidate_data_type - IOMMU HWPT Cache Invalidation
+ *                                        Data Type
+ * @IOMMU_HWPT_INVALIDATE_DATA_VTD_S1: Invalidation data for VTD_S1
+ */
+enum iommu_hwpt_invalidate_data_type {
+	IOMMU_HWPT_INVALIDATE_DATA_VTD_S1,
+};
+
+/**
+ * enum iommu_hwpt_vtd_s1_invalidate_flags - Flags for Intel VT-d
+ *                                           stage-1 cache invalidation
+ * @IOMMU_VTD_INV_FLAGS_LEAF: Indicates whether the invalidation applies
+ *                            to all-levels page structure cache or just
+ *                            the leaf PTE cache.
+ */
+enum iommu_hwpt_vtd_s1_invalidate_flags {
+	IOMMU_VTD_INV_FLAGS_LEAF = 1 << 0,
+};
+
+/**
+ * struct iommu_hwpt_vtd_s1_invalidate - Intel VT-d cache invalidation
+ *                                       (IOMMU_HWPT_INVALIDATE_DATA_VTD_S1)
+ * @addr: The start address of the range to be invalidated. It needs to
+ *        be 4KB aligned.
+ * @npages: Number of contiguous 4K pages to be invalidated.
+ * @flags: Combination of enum iommu_hwpt_vtd_s1_invalidate_flags
+ * @__reserved: Must be 0
+ *
+ * The Intel VT-d specific invalidation data for user-managed stage-1 cache
+ * invalidation in nested translation. Userspace uses this structure to
+ * tell the impacted cache scope after modifying the stage-1 page table.
+ *
+ * Invalidating all the caches related to the page table by setting @addr
+ * to be 0 and @npages to be U64_MAX.
+ *
+ * The device TLB will be invalidated automatically if ATS is enabled.
+ */
+struct iommu_hwpt_vtd_s1_invalidate {
+	__aligned_u64 addr;
+	__aligned_u64 npages;
+	__u32 flags;
+	__u32 __reserved;
+};
+
+/**
+ * struct iommu_hwpt_invalidate - ioctl(IOMMU_HWPT_INVALIDATE)
+ * @size: sizeof(struct iommu_hwpt_invalidate)
+ * @hwpt_id: ID of a nested HWPT for cache invalidation
+ * @data_uptr: User pointer to an array of driver-specific cache invalidation
+ *             data.
+ * @data_type: One of enum iommu_hwpt_invalidate_data_type, defining the data
+ *             type of all the entries in the invalidation request array. It
+ *             should be a type supported by the hwpt pointed by @hwpt_id.
+ * @entry_len: Length (in bytes) of a request entry in the request array
+ * @entry_num: Input the number of cache invalidation requests in the array.
+ *             Output the number of requests successfully handled by kernel.
+ * @__reserved: Must be 0.
+ *
+ * Invalidate the iommu cache for user-managed page table. Modifications on a
+ * user-managed page table should be followed by this operation to sync cache.
+ * Each ioctl can support one or more cache invalidation requests in the array
+ * that has a total size of @entry_len * @entry_num.
+ *
+ * An empty invalidation request array by setting @entry_num==0 is allowed, and
+ * @entry_len and @data_uptr would be ignored in this case. This can be used to
+ * check if the given @data_type is supported or not by kernel.
+ */
+struct iommu_hwpt_invalidate {
+	__u32 size;
+	__u32 hwpt_id;
+	__aligned_u64 data_uptr;
+	__u32 data_type;
+	__u32 entry_len;
+	__u32 entry_num;
+	__u32 __reserved;
+};
+#define IOMMU_HWPT_INVALIDATE _IO(IOMMUFD_TYPE, IOMMUFD_CMD_HWPT_INVALIDATE)
 #endif
diff --git a/linux-headers/linux/kvm.h b/linux-headers/linux/kvm.h
index 0d74ee999a..549fea3a97 100644
--- a/linux-headers/linux/kvm.h
+++ b/linux-headers/linux/kvm.h
@@ -264,6 +264,7 @@ struct kvm_xen_exit {
 #define KVM_EXIT_RISCV_SBI        35
 #define KVM_EXIT_RISCV_CSR        36
 #define KVM_EXIT_NOTIFY           37
+#define KVM_EXIT_LOONGARCH_IOCSR  38
 
 /* For KVM_EXIT_INTERNAL_ERROR */
 /* Emulate instruction failed. */
@@ -336,6 +337,13 @@ struct kvm_run {
 			__u32 len;
 			__u8  is_write;
 		} mmio;
+		/* KVM_EXIT_LOONGARCH_IOCSR */
+		struct {
+			__u64 phys_addr;
+			__u8  data[8];
+			__u32 len;
+			__u8  is_write;
+		} iocsr_io;
 		/* KVM_EXIT_HYPERCALL */
 		struct {
 			__u64 nr;
@@ -1188,6 +1196,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_COUNTER_OFFSET 227
 #define KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE 228
 #define KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES 229
+#define KVM_CAP_ARM_SUPPORTED_REG_MASK_RANGES 230
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -1358,6 +1367,7 @@ struct kvm_dirty_tlb {
 #define KVM_REG_ARM64		0x6000000000000000ULL
 #define KVM_REG_MIPS		0x7000000000000000ULL
 #define KVM_REG_RISCV		0x8000000000000000ULL
+#define KVM_REG_LOONGARCH	0x9000000000000000ULL
 
 #define KVM_REG_SIZE_SHIFT	52
 #define KVM_REG_SIZE_MASK	0x00f0000000000000ULL
@@ -1558,6 +1568,7 @@ struct kvm_s390_ucas_mapping {
 #define KVM_ARM_MTE_COPY_TAGS	  _IOR(KVMIO,  0xb4, struct kvm_arm_copy_mte_tags)
 /* Available with KVM_CAP_COUNTER_OFFSET */
 #define KVM_ARM_SET_COUNTER_OFFSET _IOW(KVMIO,  0xb5, struct kvm_arm_counter_offset)
+#define KVM_ARM_GET_REG_WRITABLE_MASKS _IOR(KVMIO,  0xb6, struct reg_mask_range)
 
 /* ioctl for vm fd */
 #define KVM_CREATE_DEVICE	  _IOWR(KVMIO,  0xe0, struct kvm_create_device)
diff --git a/linux-headers/linux/psp-sev.h b/linux-headers/linux/psp-sev.h
index 12ccb70099..bcb21339ee 100644
--- a/linux-headers/linux/psp-sev.h
+++ b/linux-headers/linux/psp-sev.h
@@ -68,6 +68,7 @@ typedef enum {
 	SEV_RET_INVALID_PARAM,
 	SEV_RET_RESOURCE_LIMIT,
 	SEV_RET_SECURE_DATA_INVALID,
+	SEV_RET_INVALID_KEY = 0x27,
 	SEV_RET_MAX,
 } sev_ret_code;
 
diff --git a/linux-headers/linux/stddef.h b/linux-headers/linux/stddef.h
index 9bb07083ac..bf9749dd14 100644
--- a/linux-headers/linux/stddef.h
+++ b/linux-headers/linux/stddef.h
@@ -27,8 +27,13 @@
 	union { \
 		struct { MEMBERS } ATTRS; \
 		struct TAG { MEMBERS } ATTRS NAME; \
-	}
+	} ATTRS
 
+#ifdef __cplusplus
+/* sizeof(struct{}) is 1 in C++, not 0, can't use C version of the macro. */
+#define __DECLARE_FLEX_ARRAY(T, member)	\
+	T member[0]
+#else
 /**
  * __DECLARE_FLEX_ARRAY() - Declare a flexible array usable in a union
  *
@@ -49,3 +54,5 @@
 #ifndef __counted_by
 #define __counted_by(m)
 #endif
+
+#endif /* _LINUX_STDDEF_H */
diff --git a/linux-headers/linux/userfaultfd.h b/linux-headers/linux/userfaultfd.h
index 59978fbaae..953c75feda 100644
--- a/linux-headers/linux/userfaultfd.h
+++ b/linux-headers/linux/userfaultfd.h
@@ -40,7 +40,8 @@
 			   UFFD_FEATURE_EXACT_ADDRESS |		\
 			   UFFD_FEATURE_WP_HUGETLBFS_SHMEM |	\
 			   UFFD_FEATURE_WP_UNPOPULATED |	\
-			   UFFD_FEATURE_POISON)
+			   UFFD_FEATURE_POISON |		\
+			   UFFD_FEATURE_WP_ASYNC)
 #define UFFD_API_IOCTLS				\
 	((__u64)1 << _UFFDIO_REGISTER |		\
 	 (__u64)1 << _UFFDIO_UNREGISTER |	\
@@ -216,6 +217,11 @@ struct uffdio_api {
 	 * (i.e. empty ptes).  This will be the default behavior for shmem
 	 * & hugetlbfs, so this flag only affects anonymous memory behavior
 	 * when userfault write-protection mode is registered.
+	 *
+	 * UFFD_FEATURE_WP_ASYNC indicates that userfaultfd write-protection
+	 * asynchronous mode is supported in which the write fault is
+	 * automatically resolved and write-protection is un-set.
+	 * It implies UFFD_FEATURE_WP_UNPOPULATED.
 	 */
 #define UFFD_FEATURE_PAGEFAULT_FLAG_WP		(1<<0)
 #define UFFD_FEATURE_EVENT_FORK			(1<<1)
@@ -232,6 +238,7 @@ struct uffdio_api {
 #define UFFD_FEATURE_WP_HUGETLBFS_SHMEM		(1<<12)
 #define UFFD_FEATURE_WP_UNPOPULATED		(1<<13)
 #define UFFD_FEATURE_POISON			(1<<14)
+#define UFFD_FEATURE_WP_ASYNC			(1<<15)
 	__u64 features;
 
 	__u64 ioctls;
diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
index acf72b4999..8e175ece31 100644
--- a/linux-headers/linux/vfio.h
+++ b/linux-headers/linux/vfio.h
@@ -277,8 +277,8 @@ struct vfio_region_info {
 #define VFIO_REGION_INFO_FLAG_CAPS	(1 << 3) /* Info supports caps */
 	__u32	index;		/* Region index */
 	__u32	cap_offset;	/* Offset within info struct of first cap */
-	__u64	size;		/* Region size (bytes) */
-	__u64	offset;		/* Region offset from start of device fd */
+	__aligned_u64	size;	/* Region size (bytes) */
+	__aligned_u64	offset;	/* Region offset from start of device fd */
 };
 #define VFIO_DEVICE_GET_REGION_INFO	_IO(VFIO_TYPE, VFIO_BASE + 8)
 
@@ -294,8 +294,8 @@ struct vfio_region_info {
 #define VFIO_REGION_INFO_CAP_SPARSE_MMAP	1
 
 struct vfio_region_sparse_mmap_area {
-	__u64	offset;	/* Offset of mmap'able area within region */
-	__u64	size;	/* Size of mmap'able area */
+	__aligned_u64	offset;	/* Offset of mmap'able area within region */
+	__aligned_u64	size;	/* Size of mmap'able area */
 };
 
 struct vfio_region_info_cap_sparse_mmap {
@@ -450,9 +450,9 @@ struct vfio_device_migration_info {
 					     VFIO_DEVICE_STATE_V1_RESUMING)
 
 	__u32 reserved;
-	__u64 pending_bytes;
-	__u64 data_offset;
-	__u64 data_size;
+	__aligned_u64 pending_bytes;
+	__aligned_u64 data_offset;
+	__aligned_u64 data_size;
 };
 
 /*
@@ -476,7 +476,7 @@ struct vfio_device_migration_info {
 
 struct vfio_region_info_cap_nvlink2_ssatgt {
 	struct vfio_info_cap_header header;
-	__u64 tgt;
+	__aligned_u64 tgt;
 };
 
 /*
@@ -816,7 +816,7 @@ struct vfio_device_gfx_plane_info {
 	__u32 drm_plane_type;	/* type of plane: DRM_PLANE_TYPE_* */
 	/* out */
 	__u32 drm_format;	/* drm format of plane */
-	__u64 drm_format_mod;   /* tiled mode */
+	__aligned_u64 drm_format_mod;   /* tiled mode */
 	__u32 width;	/* width of plane */
 	__u32 height;	/* height of plane */
 	__u32 stride;	/* stride of plane */
@@ -829,6 +829,7 @@ struct vfio_device_gfx_plane_info {
 		__u32 region_index;	/* region index */
 		__u32 dmabuf_id;	/* dma-buf id */
 	};
+	__u32 reserved;
 };
 
 #define VFIO_DEVICE_QUERY_GFX_PLANE _IO(VFIO_TYPE, VFIO_BASE + 14)
@@ -863,9 +864,10 @@ struct vfio_device_ioeventfd {
 #define VFIO_DEVICE_IOEVENTFD_32	(1 << 2) /* 4-byte write */
 #define VFIO_DEVICE_IOEVENTFD_64	(1 << 3) /* 8-byte write */
 #define VFIO_DEVICE_IOEVENTFD_SIZE_MASK	(0xf)
-	__u64	offset;			/* device fd offset of write */
-	__u64	data;			/* data to be written */
+	__aligned_u64	offset;		/* device fd offset of write */
+	__aligned_u64	data;		/* data to be written */
 	__s32	fd;			/* -1 for de-assignment */
+	__u32	reserved;
 };
 
 #define VFIO_DEVICE_IOEVENTFD		_IO(VFIO_TYPE, VFIO_BASE + 16)
@@ -1434,6 +1436,27 @@ struct vfio_device_feature_mig_data_size {
 
 #define VFIO_DEVICE_FEATURE_MIG_DATA_SIZE 9
 
+/**
+ * Upon VFIO_DEVICE_FEATURE_SET, set or clear the BUS mastering for the device
+ * based on the operation specified in op flag.
+ *
+ * The functionality is incorporated for devices that needs bus master control,
+ * but the in-band device interface lacks the support. Consequently, it is not
+ * applicable to PCI devices, as bus master control for PCI devices is managed
+ * in-band through the configuration space. At present, this feature is supported
+ * only for CDX devices.
+ * When the device's BUS MASTER setting is configured as CLEAR, it will result in
+ * blocking all incoming DMA requests from the device. On the other hand, configuring
+ * the device's BUS MASTER setting as SET (enable) will grant the device the
+ * capability to perform DMA to the host memory.
+ */
+struct vfio_device_feature_bus_master {
+	__u32 op;
+#define		VFIO_DEVICE_FEATURE_CLEAR_MASTER	0	/* Clear Bus Master */
+#define		VFIO_DEVICE_FEATURE_SET_MASTER		1	/* Set Bus Master */
+};
+#define VFIO_DEVICE_FEATURE_BUS_MASTER 10
+
 /* -------- API for Type1 VFIO IOMMU -------- */
 
 /**
@@ -1449,7 +1472,7 @@ struct vfio_iommu_type1_info {
 	__u32	flags;
 #define VFIO_IOMMU_INFO_PGSIZES (1 << 0)	/* supported page sizes info */
 #define VFIO_IOMMU_INFO_CAPS	(1 << 1)	/* Info supports caps */
-	__u64	iova_pgsizes;	/* Bitmap of supported page sizes */
+	__aligned_u64	iova_pgsizes;		/* Bitmap of supported page sizes */
 	__u32   cap_offset;	/* Offset within info struct of first cap */
 	__u32   pad;
 };
diff --git a/linux-headers/linux/vhost.h b/linux-headers/linux/vhost.h
index f5c48b61ab..649560c685 100644
--- a/linux-headers/linux/vhost.h
+++ b/linux-headers/linux/vhost.h
@@ -219,4 +219,12 @@
  */
 #define VHOST_VDPA_RESUME		_IO(VHOST_VIRTIO, 0x7E)
 
+/* Get the group for the descriptor table including driver & device areas
+ * of a virtqueue: read index, write group in num.
+ * The virtqueue index is stored in the index field of vhost_vring_state.
+ * The group ID of the descriptor table for this specific virtqueue
+ * is returned via num field of vhost_vring_state.
+ */
+#define VHOST_VDPA_GET_VRING_DESC_GROUP	_IOWR(VHOST_VIRTIO, 0x7F,	\
+					      struct vhost_vring_state)
 #endif
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH rfcv1 02/23] backends/iommufd: add helpers for allocating user-managed HWPT
  2024-01-15 10:37 [PATCH rfcv1 00/23] intel_iommu: Enable stage-1 translation Zhenzhong Duan
  2024-01-15 10:37 ` [PATCH rfcv1 01/23] Update linux header to support nested hwpt alloc Zhenzhong Duan
@ 2024-01-15 10:37 ` Zhenzhong Duan
  2024-01-15 10:37 ` [PATCH rfcv1 03/23] backends/iommufd_device: introduce IOMMUFDDevice targeted interface Zhenzhong Duan
                   ` (21 subsequent siblings)
  23 siblings, 0 replies; 29+ messages in thread
From: Zhenzhong Duan @ 2024-01-15 10:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, peterx, jasowang, mst, jgg,
	nicolinc, joao.m.martins, kevin.tian, yi.l.liu, yi.y.sun,
	chao.p.peng, Zhenzhong Duan

Include helper to allocate user-managed hwpt and helper for cache
invalidation as user-managed HWPT needs to sync cache per modifications.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 include/sysemu/iommufd.h |  7 +++++
 backends/iommufd.c       | 61 ++++++++++++++++++++++++++++++++++++++++
 backends/trace-events    |  2 ++
 3 files changed, 70 insertions(+)

diff --git a/include/sysemu/iommufd.h b/include/sysemu/iommufd.h
index 9af27ebd6c..ab6c382081 100644
--- a/include/sysemu/iommufd.h
+++ b/include/sysemu/iommufd.h
@@ -33,4 +33,11 @@ int iommufd_backend_map_dma(IOMMUFDBackend *be, uint32_t ioas_id, hwaddr iova,
                             ram_addr_t size, void *vaddr, bool readonly);
 int iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t ioas_id,
                               hwaddr iova, ram_addr_t size);
+int iommufd_backend_alloc_hwpt(IOMMUFDBackend *be, uint32_t dev_id,
+                               uint32_t pt_id, uint32_t flags,
+                               uint32_t data_type, uint32_t data_len,
+                               void *data_ptr, uint32_t *out_hwpt);
+int iommufd_backend_invalidate_cache(IOMMUFDBackend *be, uint32_t hwpt_id,
+                                     uint32_t data_type, uint32_t entry_len,
+                                     uint32_t *entry_num, void *data_ptr);
 #endif
diff --git a/backends/iommufd.c b/backends/iommufd.c
index 1ef683c7b0..9f920e08d3 100644
--- a/backends/iommufd.c
+++ b/backends/iommufd.c
@@ -211,6 +211,67 @@ int iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t ioas_id,
     return ret;
 }
 
+int iommufd_backend_alloc_hwpt(IOMMUFDBackend *be, uint32_t dev_id,
+                               uint32_t pt_id, uint32_t flags,
+                               uint32_t data_type, uint32_t data_len,
+                               void *data_ptr, uint32_t *out_hwpt)
+{
+    int ret, fd = be->fd;
+    struct iommu_hwpt_alloc alloc_hwpt = {
+        .size = sizeof(struct iommu_hwpt_alloc),
+        .flags = flags,
+        .dev_id = dev_id,
+        .pt_id = pt_id,
+        .data_type = data_type,
+        .data_len = data_len,
+        .data_uptr = (uintptr_t)data_ptr,
+        .__reserved = 0,
+    };
+
+    ret = ioctl(fd, IOMMU_HWPT_ALLOC, &alloc_hwpt);
+    if (ret) {
+        ret = -errno;
+        error_report("IOMMU_HWPT_ALLOC failed: %m");
+    } else {
+        *out_hwpt = alloc_hwpt.out_hwpt_id;
+    }
+
+    trace_iommufd_backend_alloc_hwpt(fd, dev_id, pt_id, flags, data_type,
+                                     data_len, (uint64_t)data_ptr,
+                                     alloc_hwpt.out_hwpt_id, ret);
+    return ret;
+}
+
+int iommufd_backend_invalidate_cache(IOMMUFDBackend *be, uint32_t hwpt_id,
+                                     uint32_t data_type, uint32_t entry_len,
+                                     uint32_t *entry_num, void *data_ptr)
+{
+    int ret, fd = be->fd;
+    struct iommu_hwpt_invalidate cache = {
+        .size = sizeof(cache),
+        .hwpt_id = hwpt_id,
+        .data_type = data_type,
+        .entry_len = entry_len,
+        .entry_num = *entry_num,
+        .data_uptr = (uintptr_t)data_ptr,
+    };
+
+    ret = ioctl(fd, IOMMU_HWPT_INVALIDATE, &cache);
+
+    trace_iommufd_backend_invalidate_cache(fd, hwpt_id, data_type, entry_len,
+                                           *entry_num, cache.entry_num,
+                                           (uintptr_t)data_ptr, ret);
+    if (ret) {
+        *entry_num = cache.entry_num;
+        error_report("IOMMU_HWPT_INVALIDATE failed: %s", strerror(errno));
+        ret = -errno;
+    } else {
+        g_assert(*entry_num == cache.entry_num);
+    }
+
+    return ret;
+}
+
 static const TypeInfo iommufd_backend_info = {
     .name = TYPE_IOMMUFD_BACKEND,
     .parent = TYPE_OBJECT,
diff --git a/backends/trace-events b/backends/trace-events
index d45c6e31a6..3df48bfb08 100644
--- a/backends/trace-events
+++ b/backends/trace-events
@@ -15,3 +15,5 @@ iommufd_backend_unmap_dma_non_exist(int iommufd, uint32_t ioas, uint64_t iova, u
 iommufd_backend_unmap_dma(int iommufd, uint32_t ioas, uint64_t iova, uint64_t size, int ret) " iommufd=%d ioas=%d iova=0x%"PRIx64" size=0x%"PRIx64" (%d)"
 iommufd_backend_alloc_ioas(int iommufd, uint32_t ioas, int ret) " iommufd=%d ioas=%d (%d)"
 iommufd_backend_free_id(int iommufd, uint32_t id, int ret) " iommufd=%d id=%d (%d)"
+iommufd_backend_alloc_hwpt(int iommufd, uint32_t dev_id, uint32_t pt_id, uint32_t flags, uint32_t hwpt_type, uint32_t len, uint64_t data_ptr, uint32_t out_hwpt_id, int ret) " iommufd=%d dev_id=%u pt_id=%u flags=0x%x hwpt_type=%u len=%u data_ptr=0x%"PRIx64" out_hwpt=%u (%d)"
+iommufd_backend_invalidate_cache(int iommufd, uint32_t hwpt_id, uint32_t data_type, uint32_t entry_len, uint32_t entry_num, uint32_t done_num, uint64_t data_ptr, int ret) " iommufd=%d hwpt_id=%u data_type=%u entry_len=%u entry_num=%u done_num=%u data_ptr=0x%"PRIx64" (%d)"
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH rfcv1 03/23] backends/iommufd_device: introduce IOMMUFDDevice targeted interface
  2024-01-15 10:37 [PATCH rfcv1 00/23] intel_iommu: Enable stage-1 translation Zhenzhong Duan
  2024-01-15 10:37 ` [PATCH rfcv1 01/23] Update linux header to support nested hwpt alloc Zhenzhong Duan
  2024-01-15 10:37 ` [PATCH rfcv1 02/23] backends/iommufd: add helpers for allocating user-managed HWPT Zhenzhong Duan
@ 2024-01-15 10:37 ` Zhenzhong Duan
  2024-01-15 10:37 ` [PATCH rfcv1 04/23] vfio: implement IOMMUFDDevice interface callbacks Zhenzhong Duan
                   ` (20 subsequent siblings)
  23 siblings, 0 replies; 29+ messages in thread
From: Zhenzhong Duan @ 2024-01-15 10:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, peterx, jasowang, mst, jgg,
	nicolinc, joao.m.martins, kevin.tian, yi.l.liu, yi.y.sun,
	chao.p.peng, Zhenzhong Duan

With IOMMUFDDevice passed to vIOMMU, we can query hw IOMMU information
and allocate hwpt for a device, but still need an extensible interface
for vIOMMU usage.

This introduces an IOMMUFDDevice targeted interface for vIOMMU.
Currently this interface includes two callbacks attach_hwpt/detach_hwpt
for vIOMMU to attach to or detach from hwpt on host side.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 include/sysemu/iommufd_device.h | 11 ++++++++++-
 backends/iommufd_device.c       | 16 +++++++++++++++-
 hw/vfio/iommufd.c               |  3 ++-
 3 files changed, 27 insertions(+), 3 deletions(-)

diff --git a/include/sysemu/iommufd_device.h b/include/sysemu/iommufd_device.h
index 795630324b..799c1345fd 100644
--- a/include/sysemu/iommufd_device.h
+++ b/include/sysemu/iommufd_device.h
@@ -17,15 +17,24 @@
 
 typedef struct IOMMUFDDevice IOMMUFDDevice;
 
+typedef struct IOMMUFDDeviceOps {
+    int (*attach_hwpt)(IOMMUFDDevice *idev, uint32_t hwpt_id);
+    int (*detach_hwpt)(IOMMUFDDevice *idev);
+} IOMMUFDDeviceOps;
+
 /* This is an abstraction of host IOMMUFD device */
 struct IOMMUFDDevice {
     IOMMUFDBackend *iommufd;
     uint32_t dev_id;
+    IOMMUFDDeviceOps *ops;
 };
 
+int iommufd_device_attach_hwpt(IOMMUFDDevice *idev, uint32_t hwpt_id);
+int iommufd_device_detach_hwpt(IOMMUFDDevice *idev);
 int iommufd_device_get_info(IOMMUFDDevice *idev,
                             enum iommu_hw_info_type *type,
                             uint32_t len, void *data);
 void iommufd_device_init(void *_idev, size_t instance_size,
-                         IOMMUFDBackend *iommufd, uint32_t dev_id);
+                         IOMMUFDBackend *iommufd, uint32_t dev_id,
+                         IOMMUFDDeviceOps *ops);
 #endif
diff --git a/backends/iommufd_device.c b/backends/iommufd_device.c
index f6e7ca1dbf..26f69252d2 100644
--- a/backends/iommufd_device.c
+++ b/backends/iommufd_device.c
@@ -14,6 +14,18 @@
 #include "qemu/error-report.h"
 #include "sysemu/iommufd_device.h"
 
+int iommufd_device_attach_hwpt(IOMMUFDDevice *idev, uint32_t hwpt_id)
+{
+    g_assert(idev->ops->attach_hwpt);
+    return idev->ops->attach_hwpt(idev, hwpt_id);
+}
+
+int iommufd_device_detach_hwpt(IOMMUFDDevice *idev)
+{
+    g_assert(idev->ops->detach_hwpt);
+    return idev->ops->detach_hwpt(idev);
+}
+
 int iommufd_device_get_info(IOMMUFDDevice *idev,
                             enum iommu_hw_info_type *type,
                             uint32_t len, void *data)
@@ -39,7 +51,8 @@ int iommufd_device_get_info(IOMMUFDDevice *idev,
 }
 
 void iommufd_device_init(void *_idev, size_t instance_size,
-                         IOMMUFDBackend *iommufd, uint32_t dev_id)
+                         IOMMUFDBackend *iommufd, uint32_t dev_id,
+                         IOMMUFDDeviceOps *ops)
 {
     IOMMUFDDevice *idev = (IOMMUFDDevice *)_idev;
 
@@ -47,4 +60,5 @@ void iommufd_device_init(void *_idev, size_t instance_size,
 
     idev->iommufd = iommufd;
     idev->dev_id = dev_id;
+    idev->ops = ops;
 }
diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
index cbd035f148..1b174b71ee 100644
--- a/hw/vfio/iommufd.c
+++ b/hw/vfio/iommufd.c
@@ -429,7 +429,8 @@ found_container:
     QLIST_INSERT_HEAD(&bcontainer->device_list, vbasedev, container_next);
     QLIST_INSERT_HEAD(&vfio_device_list, vbasedev, global_next);
 
-    iommufd_device_init(idev, sizeof(*idev), container->be, vbasedev->devid);
+    iommufd_device_init(idev, sizeof(*idev), container->be, vbasedev->devid,
+                        NULL);
     trace_iommufd_cdev_device_info(vbasedev->name, devfd, vbasedev->num_irqs,
                                    vbasedev->num_regions, vbasedev->flags);
     return 0;
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH rfcv1 04/23] vfio: implement IOMMUFDDevice interface callbacks
  2024-01-15 10:37 [PATCH rfcv1 00/23] intel_iommu: Enable stage-1 translation Zhenzhong Duan
                   ` (2 preceding siblings ...)
  2024-01-15 10:37 ` [PATCH rfcv1 03/23] backends/iommufd_device: introduce IOMMUFDDevice targeted interface Zhenzhong Duan
@ 2024-01-15 10:37 ` Zhenzhong Duan
  2024-01-15 10:37 ` [PATCH rfcv1 05/23] intel_iommu: add a placeholder variable for scalable modern mode Zhenzhong Duan
                   ` (19 subsequent siblings)
  23 siblings, 0 replies; 29+ messages in thread
From: Zhenzhong Duan @ 2024-01-15 10:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, peterx, jasowang, mst, jgg,
	nicolinc, joao.m.martins, kevin.tian, yi.l.liu, yi.y.sun,
	chao.p.peng, Zhenzhong Duan

Implement IOMMUFDDevice interface callbacks attach_hwpt/detach_hwpt
for vIOMMU usage. vIOMMU utilizes them to attach to or detach from
hwpt on host side.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/vfio/iommufd.c | 36 +++++++++++++++++++++++++++++++++++-
 1 file changed, 35 insertions(+), 1 deletion(-)

diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
index 1b174b71ee..c8c669c59a 100644
--- a/hw/vfio/iommufd.c
+++ b/hw/vfio/iommufd.c
@@ -26,6 +26,8 @@
 #include "qemu/chardev_open.h"
 #include "pci.h"
 
+static IOMMUFDDeviceOps vfio_iommufd_device_ops;
+
 static int iommufd_cdev_map(const VFIOContainerBase *bcontainer, hwaddr iova,
                             ram_addr_t size, void *vaddr, bool readonly)
 {
@@ -430,7 +432,7 @@ found_container:
     QLIST_INSERT_HEAD(&vfio_device_list, vbasedev, global_next);
 
     iommufd_device_init(idev, sizeof(*idev), container->be, vbasedev->devid,
-                        NULL);
+                        &vfio_iommufd_device_ops);
     trace_iommufd_cdev_device_info(vbasedev->name, devfd, vbasedev->num_irqs,
                                    vbasedev->num_regions, vbasedev->flags);
     return 0;
@@ -642,3 +644,35 @@ static const TypeInfo types[] = {
 };
 
 DEFINE_TYPES(types)
+
+static int vfio_iommufd_device_attach_hwpt(IOMMUFDDevice *idev,
+                                           uint32_t hwpt_id)
+{
+    VFIODevice *vbasedev = container_of(idev, VFIODevice, idev);
+    Error *err = NULL;
+    int ret;
+
+    ret = iommufd_cdev_attach_ioas_hwpt(vbasedev, hwpt_id, &err);
+    if (err) {
+        error_report_err(err);
+    }
+    return ret;
+}
+
+static int vfio_iommufd_device_detach_hwpt(IOMMUFDDevice *idev)
+{
+    VFIODevice *vbasedev = container_of(idev, VFIODevice, idev);
+    Error *err = NULL;
+    int ret;
+
+    ret = iommufd_cdev_detach_ioas_hwpt(vbasedev, &err);
+    if (err) {
+        error_report_err(err);
+    }
+    return ret;
+}
+
+static IOMMUFDDeviceOps vfio_iommufd_device_ops = {
+    .attach_hwpt = vfio_iommufd_device_attach_hwpt,
+    .detach_hwpt = vfio_iommufd_device_detach_hwpt,
+};
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH rfcv1 05/23] intel_iommu: add a placeholder variable for scalable modern mode
  2024-01-15 10:37 [PATCH rfcv1 00/23] intel_iommu: Enable stage-1 translation Zhenzhong Duan
                   ` (3 preceding siblings ...)
  2024-01-15 10:37 ` [PATCH rfcv1 04/23] vfio: implement IOMMUFDDevice interface callbacks Zhenzhong Duan
@ 2024-01-15 10:37 ` Zhenzhong Duan
  2024-01-15 10:37 ` [PATCH rfcv1 06/23] intel_iommu: check and sync host IOMMU cap/ecap in " Zhenzhong Duan
                   ` (18 subsequent siblings)
  23 siblings, 0 replies; 29+ messages in thread
From: Zhenzhong Duan @ 2024-01-15 10:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, peterx, jasowang, mst, jgg,
	nicolinc, joao.m.martins, kevin.tian, yi.l.liu, yi.y.sun,
	chao.p.peng, Zhenzhong Duan, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, Marcel Apfelbaum

Add an new element scalable_mode in IntelIOMMUState to mark scalable
modern mode, this element will be exposed as an intel_iommu property
finally.

For now, it's only a placehholder and used for cap/ecap initialization,
parameter compatibility check, etc.

No need to zero this element separately as IntelIOMMUState is zeroed
when creation.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/i386/intel_iommu_internal.h |  3 +++
 include/hw/i386/intel_iommu.h  |  1 +
 hw/i386/intel_iommu.c          | 13 +++++++++++--
 3 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index f8cf99bddf..ee4a784a35 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -192,9 +192,11 @@
 #define VTD_ECAP_SC                 (1ULL << 7)
 #define VTD_ECAP_MHMV               (15ULL << 20)
 #define VTD_ECAP_SRS                (1ULL << 31)
+#define VTD_ECAP_EAFS               (1ULL << 34)
 #define VTD_ECAP_PASID              (1ULL << 40)
 #define VTD_ECAP_SMTS               (1ULL << 43)
 #define VTD_ECAP_SLTS               (1ULL << 46)
+#define VTD_ECAP_FLTS               (1ULL << 47)
 
 /* CAP_REG */
 /* (offset >> 4) << 24 */
@@ -211,6 +213,7 @@
 #define VTD_CAP_SLLPS               ((1ULL << 34) | (1ULL << 35))
 #define VTD_CAP_DRAIN_WRITE         (1ULL << 54)
 #define VTD_CAP_DRAIN_READ          (1ULL << 55)
+#define VTD_CAP_FL1GP               (1ULL << 56)
 #define VTD_CAP_DRAIN               (VTD_CAP_DRAIN_READ | VTD_CAP_DRAIN_WRITE)
 #define VTD_CAP_CM                  (1ULL << 7)
 #define VTD_PASID_ID_SHIFT          20
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index b8abbcce12..006cec116b 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -270,6 +270,7 @@ struct IntelIOMMUState {
 
     bool caching_mode;              /* RO - is cap CM enabled? */
     bool scalable_mode;             /* RO - is Scalable Mode supported? */
+    bool scalable_modern;           /* RO - is modern SM supported? */
     bool snoop_control;             /* RO - is SNP filed supported? */
 
     dma_addr_t root;                /* Current root table pointer */
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index be03fcbf52..1d007c33a8 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -4095,8 +4095,11 @@ static void vtd_cap_init(IntelIOMMUState *s)
     }
 
     /* TODO: read cap/ecap from host to decide which cap to be exposed. */
-    if (s->scalable_mode) {
+    if (s->scalable_mode && !s->scalable_modern) {
         s->ecap |= VTD_ECAP_SMTS | VTD_ECAP_SRS | VTD_ECAP_SLTS;
+    } else if (s->scalable_mode && s->scalable_modern) {
+        s->ecap |= VTD_ECAP_SMTS | VTD_ECAP_SRS | VTD_ECAP_EAFS | VTD_ECAP_FLTS;
+        s->cap |= VTD_CAP_FL1GP;
     }
 
     if (s->snoop_control) {
@@ -4271,12 +4274,18 @@ static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
 
     /* Currently only address widths supported are 39 and 48 bits */
     if ((s->aw_bits != VTD_HOST_AW_39BIT) &&
-        (s->aw_bits != VTD_HOST_AW_48BIT)) {
+        (s->aw_bits != VTD_HOST_AW_48BIT) &&
+        !s->scalable_modern) {
         error_setg(errp, "Supported values for aw-bits are: %d, %d",
                    VTD_HOST_AW_39BIT, VTD_HOST_AW_48BIT);
         return false;
     }
 
+    if ((s->aw_bits != VTD_HOST_AW_48BIT) && s->scalable_modern) {
+        error_setg(errp, "Supported values for aw-bits are: %d",
+                   VTD_HOST_AW_48BIT);
+        return false;
+    }
     if (s->scalable_mode && !s->dma_drain) {
         error_setg(errp, "Need to set dma_drain for scalable mode");
         return false;
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH rfcv1 06/23] intel_iommu: check and sync host IOMMU cap/ecap in scalable modern mode
  2024-01-15 10:37 [PATCH rfcv1 00/23] intel_iommu: Enable stage-1 translation Zhenzhong Duan
                   ` (4 preceding siblings ...)
  2024-01-15 10:37 ` [PATCH rfcv1 05/23] intel_iommu: add a placeholder variable for scalable modern mode Zhenzhong Duan
@ 2024-01-15 10:37 ` Zhenzhong Duan
  2024-01-15 10:37 ` [PATCH rfcv1 07/23] intel_iommu: process PASID cache invalidation Zhenzhong Duan
                   ` (17 subsequent siblings)
  23 siblings, 0 replies; 29+ messages in thread
From: Zhenzhong Duan @ 2024-01-15 10:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, peterx, jasowang, mst, jgg,
	nicolinc, joao.m.martins, kevin.tian, yi.l.liu, yi.y.sun,
	chao.p.peng, Zhenzhong Duan, Marcel Apfelbaum, Paolo Bonzini,
	Richard Henderson, Eduardo Habkost

When vIOMMU is configured in scalable modern mode, stage-1 page table is
supported. We need to check and sync host side cap/ecap with vIOMMU
cap/ecap.

This happens when PCIe device (i.e., VFIO case) sets IOMMUFDDevice to vIOMMU.
Some of the bits in cap/ecap is user controllable, then user setting is
compared with host cap/ecap for compatibility, i.e., if intel_iommu is
configured in scalable modern but VTD_ECAP_NEST isn't set in host ecap,
that device will fail to attach. For other bits not controlled by user,
i.e. VTD_CAP/ECAP_MASK bits, host cap/ecap is picked.

Below is the sequence to initial and finalize vIOMMU cap/ecap:

vtd_cap_init() initializes iommu->cap/ecap. ---- vtd_cap_init()
iommu->host_cap/ecap is initialized as iommu->cap/ecap.  ---- vtd_init()
iommu->host_cap/ecap is updated some bits(VTD_CAP/ECAP_MASK) with host setting. ---- vtd_sync_hw_info()
iommu->cap/ecap is finalized as iommu->host_cap/ecap.  ---- vtd_machine_done_hook()

iommu->host_cap/ecap is a temporary storage to hold intermediate value
when synthesize host cap/ecap with vIOMMU's initial configured cap/ecap.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/i386/intel_iommu_internal.h | 10 ++++
 hw/i386/intel_iommu.c          | 83 ++++++++++++++++++++++++++++++----
 2 files changed, 85 insertions(+), 8 deletions(-)

diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index ee4a784a35..6d881adf9b 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -191,13 +191,19 @@
 #define VTD_ECAP_PT                 (1ULL << 6)
 #define VTD_ECAP_SC                 (1ULL << 7)
 #define VTD_ECAP_MHMV               (15ULL << 20)
+#define VTD_ECAP_NEST               (1ULL << 26)
 #define VTD_ECAP_SRS                (1ULL << 31)
 #define VTD_ECAP_EAFS               (1ULL << 34)
+#define VTD_ECAP_PSS(val)           (((val) & 0x1fULL) << 35)
 #define VTD_ECAP_PASID              (1ULL << 40)
 #define VTD_ECAP_SMTS               (1ULL << 43)
 #define VTD_ECAP_SLTS               (1ULL << 46)
 #define VTD_ECAP_FLTS               (1ULL << 47)
 
+#define VTD_ECAP_MASK               (VTD_ECAP_SRS | VTD_ECAP_EAFS)
+#define VTD_GET_PSS(val)            (((val) >> 35) & 0x1f)
+#define VTD_ECAP_PSS_MASK           (0x1fULL << 35)
+
 /* CAP_REG */
 /* (offset >> 4) << 24 */
 #define VTD_CAP_FRO                 (DMAR_FRCD_REG_OFFSET << 20)
@@ -214,11 +220,15 @@
 #define VTD_CAP_DRAIN_WRITE         (1ULL << 54)
 #define VTD_CAP_DRAIN_READ          (1ULL << 55)
 #define VTD_CAP_FL1GP               (1ULL << 56)
+#define VTD_CAP_FL5LP               (1ULL << 60)
 #define VTD_CAP_DRAIN               (VTD_CAP_DRAIN_READ | VTD_CAP_DRAIN_WRITE)
 #define VTD_CAP_CM                  (1ULL << 7)
 #define VTD_PASID_ID_SHIFT          20
 #define VTD_PASID_ID_MASK           ((1ULL << VTD_PASID_ID_SHIFT) - 1)
 
+
+#define VTD_CAP_MASK                (VTD_CAP_FL1GP | VTD_CAP_FL5LP)
+
 /* Supported Adjusted Guest Address Widths */
 #define VTD_CAP_SAGAW_SHIFT         8
 #define VTD_CAP_SAGAW_MASK          (0x1fULL << VTD_CAP_SAGAW_SHIFT)
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 1d007c33a8..c0973aaccb 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -3819,19 +3819,82 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus,
     return vtd_dev_as;
 }
 
+static bool vtd_check_hw_info(IntelIOMMUState *s, struct iommu_hw_info_vtd *vtd,
+                              Error **errp)
+{
+    if (!(vtd->ecap_reg & VTD_ECAP_NEST)) {
+        error_setg(errp, "Need nested translation on host in modern mode");
+        return false;
+    }
+
+    return true;
+}
+
+/* cap/ecap are readonly after vIOMMU finalized */
+static bool vtd_check_hw_info_finalized(IntelIOMMUState *s,
+                                        struct iommu_hw_info_vtd *vtd,
+                                        Error **errp)
+{
+    if (s->cap & ~vtd->cap_reg & VTD_CAP_MASK) {
+        error_setg(errp, "vIOMMU cap %lx isn't compatible with host %llx",
+                   s->cap, vtd->cap_reg);
+        return false;
+    }
+
+    if (s->ecap & ~vtd->ecap_reg & VTD_ECAP_MASK) {
+        error_setg(errp, "vIOMMU ecap %lx isn't compatible with host %llx",
+                   s->ecap, vtd->ecap_reg);
+        return false;
+    }
+
+    if (s->ecap & vtd->ecap_reg & VTD_ECAP_PASID &&
+        VTD_GET_PSS(s->ecap) > VTD_GET_PSS(vtd->ecap_reg)) {
+        error_setg(errp, "vIOMMU pasid bits %lu > host pasid bits %llu",
+                   VTD_GET_PSS(s->ecap), VTD_GET_PSS(vtd->ecap_reg));
+        return false;
+    }
+
+    return true;
+}
+
 static bool vtd_sync_hw_info(IntelIOMMUState *s, struct iommu_hw_info_vtd *vtd,
                              Error **errp)
 {
-    uint64_t addr_width;
+    uint64_t cap, ecap, addr_width, pasid_bits;
 
-    addr_width = (vtd->cap_reg >> 16) & 0x3fULL;
-    if (s->aw_bits > addr_width) {
-        error_setg(errp, "User aw-bits: %u > host address width: %lu",
-                   s->aw_bits, addr_width);
+    if (!s->scalable_modern) {
+        addr_width = (vtd->cap_reg >> 16) & 0x3fULL;
+        if (s->aw_bits > addr_width) {
+            error_setg(errp, "User aw-bits: %u > host address width: %lu",
+                       s->aw_bits, addr_width);
+            return false;
+        }
+        return true;
+    }
+
+    if (!vtd_check_hw_info(s, vtd, errp)) {
         return false;
     }
 
-    /* TODO: check and sync host cap/ecap into vIOMMU cap/ecap */
+    if (s->cap_finalized) {
+        return vtd_check_hw_info_finalized(s, vtd, errp);
+    }
+
+    /* sync host cap/ecap to vIOMMU */
+
+    cap = s->host_cap & vtd->cap_reg & VTD_CAP_MASK;
+    s->host_cap &= ~VTD_CAP_MASK;
+    s->host_cap |= cap;
+    ecap = s->host_ecap & vtd->ecap_reg & VTD_ECAP_MASK;
+    s->host_ecap &= ~VTD_ECAP_MASK;
+    s->host_ecap |= ecap;
+
+    pasid_bits = VTD_GET_PSS(vtd->ecap_reg);
+    if (s->host_ecap & VTD_ECAP_PASID &&
+        VTD_GET_PSS(s->host_ecap) > pasid_bits) {
+        s->host_ecap &= ~VTD_ECAP_PSS_MASK;
+        s->host_ecap |= VTD_ECAP_PSS(pasid_bits);
+    }
 
     return true;
 }
@@ -3873,9 +3936,13 @@ static int vtd_dev_set_iommu_device(PCIBus *bus, void *opaque, int32_t devfn,
 
     assert(0 <= devfn && devfn < PCI_DEVFN_MAX);
 
-    /* None IOMMUFD case */
-    if (!idev) {
+    if (!s->scalable_modern && !idev) {
+        /* Legacy vIOMMU and non-IOMMUFD backend */
         return 0;
+    } else if (!idev) {
+        /* Modern vIOMMU and non-IOMMUFD backend */
+        error_setg(errp, "Need IOMMUFD backend to setup nested page table");
+        return -1;
     }
 
     if (!vtd_check_idev(s, idev, errp)) {
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH rfcv1 07/23] intel_iommu: process PASID cache invalidation
  2024-01-15 10:37 [PATCH rfcv1 00/23] intel_iommu: Enable stage-1 translation Zhenzhong Duan
                   ` (5 preceding siblings ...)
  2024-01-15 10:37 ` [PATCH rfcv1 06/23] intel_iommu: check and sync host IOMMU cap/ecap in " Zhenzhong Duan
@ 2024-01-15 10:37 ` Zhenzhong Duan
  2024-01-15 10:37 ` [PATCH rfcv1 08/23] intel_iommu: add PASID cache management infrastructure Zhenzhong Duan
                   ` (16 subsequent siblings)
  23 siblings, 0 replies; 29+ messages in thread
From: Zhenzhong Duan @ 2024-01-15 10:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, peterx, jasowang, mst, jgg,
	nicolinc, joao.m.martins, kevin.tian, yi.l.liu, yi.y.sun,
	chao.p.peng, Zhenzhong Duan, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, Marcel Apfelbaum

From: Yi Liu <yi.l.liu@intel.com>

This adds PASID cache invalidation handling. When guest updated
a pasid entry in scalable mode, guest software should issue a proper
PASID cache invalidation when caching-mode is exposed. This can happen
even when pasid is disabled as rid_pasid will still be used.

This only adds a basic framework for handling pasid cache invalidation.
Detailed handling will be added in subsequent patches.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/i386/intel_iommu_internal.h | 12 ++++++++++
 hw/i386/intel_iommu.c          | 40 +++++++++++++++++++++++++++++-----
 hw/i386/trace-events           |  3 +++
 3 files changed, 50 insertions(+), 5 deletions(-)

diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 6d881adf9b..10117e2f25 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -444,6 +444,18 @@ typedef union VTDInvDesc VTDInvDesc;
         (0x3ffff800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM | VTD_SL_TM)) : \
         (0x3ffff800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
 
+#define VTD_INV_DESC_PASIDC_G          (3ULL << 4)
+#define VTD_INV_DESC_PASIDC_PASID(val) (((val) >> 32) & 0xfffffULL)
+#define VTD_INV_DESC_PASIDC_DID(val)   (((val) >> 16) & VTD_DOMAIN_ID_MASK)
+#define VTD_INV_DESC_PASIDC_RSVD_VAL0  0xfff000000000ffc0ULL
+#define VTD_INV_DESC_PASIDC_RSVD_VAL1  0xffffffffffffffffULL
+#define VTD_INV_DESC_PASIDC_RSVD_VAL2  0xffffffffffffffffULL
+#define VTD_INV_DESC_PASIDC_RSVD_VAL3  0xffffffffffffffffULL
+
+#define VTD_INV_DESC_PASIDC_DSI        (0ULL << 4)
+#define VTD_INV_DESC_PASIDC_PASID_SI   (1ULL << 4)
+#define VTD_INV_DESC_PASIDC_GLOBAL     (3ULL << 4)
+
 /* Information about page-selective IOTLB invalidate */
 struct VTDIOTLBPageInvInfo {
     uint16_t domain_id;
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index c0973aaccb..effbeed8a3 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -2635,6 +2635,37 @@ static bool vtd_process_iotlb_desc(IntelIOMMUState *s, VTDInvDesc *inv_desc)
     return true;
 }
 
+static bool vtd_process_pasid_desc(IntelIOMMUState *s,
+                                   VTDInvDesc *inv_desc)
+{
+    if ((inv_desc->val[0] & VTD_INV_DESC_PASIDC_RSVD_VAL0) ||
+        (inv_desc->val[1] & VTD_INV_DESC_PASIDC_RSVD_VAL1) ||
+        (inv_desc->val[2] & VTD_INV_DESC_PASIDC_RSVD_VAL2) ||
+        (inv_desc->val[3] & VTD_INV_DESC_PASIDC_RSVD_VAL3)) {
+        error_report_once("non-zero-field-in-pc_inv_desc hi: 0x%" PRIx64
+                  " lo: 0x%" PRIx64, inv_desc->val[1], inv_desc->val[0]);
+        return false;
+    }
+
+    switch (inv_desc->val[0] & VTD_INV_DESC_PASIDC_G) {
+    case VTD_INV_DESC_PASIDC_DSI:
+        break;
+
+    case VTD_INV_DESC_PASIDC_PASID_SI:
+        break;
+
+    case VTD_INV_DESC_PASIDC_GLOBAL:
+        break;
+
+    default:
+        error_report_once("invalid-inv-granu-in-pc_inv_desc hi: 0x%" PRIx64
+                  " lo: 0x%" PRIx64, inv_desc->val[1], inv_desc->val[0]);
+        return false;
+    }
+
+    return true;
+}
+
 static bool vtd_process_inv_iec_desc(IntelIOMMUState *s,
                                      VTDInvDesc *inv_desc)
 {
@@ -2736,12 +2767,11 @@ static bool vtd_process_inv_desc(IntelIOMMUState *s)
         }
         break;
 
-    /*
-     * TODO: the entity of below two cases will be implemented in future series.
-     * To make guest (which integrates scalable mode support patch set in
-     * iommu driver) work, just return true is enough so far.
-     */
     case VTD_INV_DESC_PC:
+        trace_vtd_inv_desc("pasid-cache", inv_desc.val[1], inv_desc.val[0]);
+        if (!vtd_process_pasid_desc(s, &inv_desc)) {
+            return false;
+        }
         break;
 
     case VTD_INV_DESC_PIOTLB:
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index 53c02d7ac8..e54799ee82 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -24,6 +24,9 @@ vtd_inv_qi_head(uint16_t head) "read head %d"
 vtd_inv_qi_tail(uint16_t head) "write tail %d"
 vtd_inv_qi_fetch(void) ""
 vtd_context_cache_reset(void) ""
+vtd_pasid_cache_gsi(void) ""
+vtd_pasid_cache_dsi(uint16_t domain) "Domian slective PC invalidation domain 0x%"PRIx16
+vtd_pasid_cache_psi(uint16_t domain, uint32_t pasid) "PASID slective PC invalidation domain 0x%"PRIx16" pasid 0x%"PRIx32
 vtd_re_not_present(uint8_t bus) "Root entry bus %"PRIu8" not present"
 vtd_ce_not_present(uint8_t bus, uint8_t devfn) "Context entry bus %"PRIu8" devfn %"PRIu8" not present"
 vtd_iotlb_page_hit(uint16_t sid, uint64_t addr, uint64_t slpte, uint16_t domain) "IOTLB page hit sid 0x%"PRIx16" iova 0x%"PRIx64" slpte 0x%"PRIx64" domain 0x%"PRIx16
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH rfcv1 08/23] intel_iommu: add PASID cache management infrastructure
  2024-01-15 10:37 [PATCH rfcv1 00/23] intel_iommu: Enable stage-1 translation Zhenzhong Duan
                   ` (6 preceding siblings ...)
  2024-01-15 10:37 ` [PATCH rfcv1 07/23] intel_iommu: process PASID cache invalidation Zhenzhong Duan
@ 2024-01-15 10:37 ` Zhenzhong Duan
  2024-01-15 10:37 ` [PATCH rfcv1 09/23] vfio/iommufd_device: Add ioas_id in IOMMUFDDevice and pass to vIOMMU Zhenzhong Duan
                   ` (15 subsequent siblings)
  23 siblings, 0 replies; 29+ messages in thread
From: Zhenzhong Duan @ 2024-01-15 10:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, peterx, jasowang, mst, jgg,
	nicolinc, joao.m.martins, kevin.tian, yi.l.liu, yi.y.sun,
	chao.p.peng, Yi Sun, Zhenzhong Duan, Paolo Bonzini,
	Richard Henderson, Eduardo Habkost, Marcel Apfelbaum

From: Yi Liu <yi.l.liu@intel.com>

This adds a PASID cache management infrastructure based on
new added structure VTDPASIDAddressSpace, which is used to track
the PASID usage and future PASID tagged DMA address translation
support in vIOMMU.

    struct VTDPASIDAddressSpace {
        PCIBus *bus;
        uint8_t devfn;
        AddressSpace as;
        uint32_t pasid;
        IntelIOMMUState *iommu_state;
        VTDContextCacheEntry context_cache_entry;
        QLIST_ENTRY(VTDPASIDAddressSpace) next;
        VTDPASIDCacheEntry pasid_cache_entry;
    };

The implementation manages VTDPASIDAddressSpace instances per
PASID+BDF (lookup and insert will use PASID and BDF) since
Intel VT-d spec allows per-BDF PASID Table.

A VTDPASIDAddressSpace instance is created/destroyed per the guest
pasid entry set up/destroy for passthrough devices. While for emulated
devices, VTDPASIDAddressSpace instance is created in the PASID tagged
DMA translation and be destroyed per guest PASID cache invalidation.
This focuses on the PASID cache management for passthrough devices
as there is no PASID-capable emulated devices yet.

When guest modifies a PASID entry, QEMU will capture the guest
pasid selective pasid cache invalidation, allocate or remove
a VTDPASIDAddressSpace instance per the invalidation reasons:

    *) a present pasid entry moved to non-present
    *) a present pasid entry to be a present entry
    *) a non-present pasid entry moved to present

vIOMMU emulator could figure out the reason by fetching latest
guest pasid entry and compare it with the PASID cache.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/i386/intel_iommu_internal.h |  21 ++
 include/hw/i386/intel_iommu.h  |  26 ++
 hw/i386/intel_iommu.c          | 468 +++++++++++++++++++++++++++++++++
 hw/i386/trace-events           |   1 +
 4 files changed, 516 insertions(+)

diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 10117e2f25..16dc712e94 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -325,6 +325,7 @@ typedef enum VTDFaultReason {
     VTD_FR_IR_SID_ERR = 0x26,   /* Invalid Source-ID */
 
     VTD_FR_PASID_TABLE_INV = 0x58,  /*Invalid PASID table entry */
+    VTD_FR_PASID_ENTRY_P = 0x59, /* The Present(P) field of pasidt-entry is 0 */
 
     /* Output address in the interrupt address range for scalable mode */
     VTD_FR_SM_INTERRUPT_ADDR = 0x87,
@@ -512,10 +513,29 @@ typedef struct VTDRootEntry VTDRootEntry;
 #define VTD_CTX_ENTRY_LEGACY_SIZE     16
 #define VTD_CTX_ENTRY_SCALABLE_SIZE   32
 
+#define VTD_SM_CONTEXT_ENTRY_PDTS(val)      (((val) >> 9) & 0x7)
 #define VTD_SM_CONTEXT_ENTRY_RID2PASID_MASK 0xfffff
 #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL0(aw)  (0x1e0ULL | ~VTD_HAW_MASK(aw))
 #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL1      0xffffffffffe00000ULL
 
+typedef enum VTDPCInvType {
+    /* force reset all */
+    VTD_PASID_CACHE_FORCE_RESET = 0,
+    /* pasid cache invalidation rely on guest PASID entry */
+    VTD_PASID_CACHE_GLOBAL_INV,
+    VTD_PASID_CACHE_DOMSI,
+    VTD_PASID_CACHE_PASIDSI,
+} VTDPCInvType;
+
+struct VTDPASIDCacheInfo {
+    VTDPCInvType type;
+    uint16_t domain_id;
+    uint32_t pasid;
+    PCIBus *bus;
+    uint16_t devfn;
+};
+typedef struct VTDPASIDCacheInfo VTDPASIDCacheInfo;
+
 /* PASID Table Related Definitions */
 #define VTD_PASID_DIR_BASE_ADDR_MASK  (~0xfffULL)
 #define VTD_PASID_TABLE_BASE_ADDR_MASK (~0xfffULL)
@@ -527,6 +547,7 @@ typedef struct VTDRootEntry VTDRootEntry;
 #define VTD_PASID_TABLE_BITS_MASK     (0x3fULL)
 #define VTD_PASID_TABLE_INDEX(pasid)  ((pasid) & VTD_PASID_TABLE_BITS_MASK)
 #define VTD_PASID_ENTRY_FPD           (1ULL << 1) /* Fault Processing Disable */
+#define VTD_PASID_TBL_ENTRY_NUM       (1ULL << 6)
 
 /* PASID Granular Translation Type Mask */
 #define VTD_PASID_ENTRY_P              1ULL
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index 006cec116b..c7b707a3d5 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -63,6 +63,8 @@ typedef union VTD_IR_MSIAddress VTD_IR_MSIAddress;
 typedef struct VTDPASIDDirEntry VTDPASIDDirEntry;
 typedef struct VTDPASIDEntry VTDPASIDEntry;
 typedef struct VTDIOMMUFDDevice VTDIOMMUFDDevice;
+typedef struct VTDPASIDCacheEntry VTDPASIDCacheEntry;
+typedef struct VTDPASIDAddressSpace VTDPASIDAddressSpace;
 
 /* Context-Entry */
 struct VTDContextEntry {
@@ -95,6 +97,25 @@ struct VTDPASIDEntry {
     uint64_t val[8];
 };
 
+struct pasid_key {
+    uint32_t pasid;
+    uint16_t sid;
+};
+
+struct VTDPASIDCacheEntry {
+    struct VTDPASIDEntry pasid_entry;
+};
+
+struct VTDPASIDAddressSpace {
+    PCIBus *bus;
+    uint8_t devfn;
+    uint32_t pasid;
+    IntelIOMMUState *iommu_state;
+    VTDContextCacheEntry context_cache_entry;
+    QLIST_ENTRY(VTDPASIDAddressSpace) next;
+    VTDPASIDCacheEntry pasid_cache_entry;
+};
+
 struct VTDAddressSpace {
     PCIBus *bus;
     uint8_t devfn;
@@ -154,6 +175,7 @@ struct VTDIOMMUFDDevice {
     uint8_t devfn;
     IOMMUFDDevice *idev;
     IntelIOMMUState *iommu_state;
+    QLIST_ENTRY(VTDIOMMUFDDevice) next;
 };
 
 struct VTDIOTLBEntry {
@@ -301,9 +323,13 @@ struct IntelIOMMUState {
 
     GHashTable *vtd_address_spaces;             /* VTD address spaces */
     VTDAddressSpace *vtd_as_cache[VTD_PCI_BUS_MAX]; /* VTD address space cache */
+    GHashTable *vtd_pasid_as;       /* VTDPASIDAddressSpace instances */
     /* list of registered notifiers */
     QLIST_HEAD(, VTDAddressSpace) vtd_as_with_notifiers;
 
+    /* list of VTDIOMMUFDDevices */
+    QLIST_HEAD(, VTDIOMMUFDDevice) vtd_idev_list;
+
     GHashTable *vtd_iommufd_dev;             /* VTDIOMMUFDDevice */
 
     /* interrupt remapping */
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index effbeed8a3..a1a1f23246 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -39,6 +39,7 @@
 #include "kvm/kvm_i386.h"
 #include "migration/vmstate.h"
 #include "trace.h"
+#include "qemu/jhash.h"
 
 /* context entry operations */
 #define VTD_CE_GET_RID2PASID(ce) \
@@ -71,6 +72,8 @@ struct vtd_iotlb_key {
 static void vtd_address_space_refresh_all(IntelIOMMUState *s);
 static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n);
 
+static void vtd_pasid_cache_reset(IntelIOMMUState *s);
+
 static void vtd_panic_require_caching_mode(void)
 {
     error_report("We need to set caching-mode=on for intel-iommu to enable "
@@ -326,6 +329,7 @@ static void vtd_reset_caches(IntelIOMMUState *s)
     vtd_iommu_lock(s);
     vtd_reset_iotlb_locked(s);
     vtd_reset_context_cache_locked(s);
+    vtd_pasid_cache_reset(s);
     vtd_iommu_unlock(s);
 }
 
@@ -757,6 +761,16 @@ static inline bool vtd_pe_type_check(X86IOMMUState *x86_iommu,
     return true;
 }
 
+static inline uint16_t vtd_pe_get_domain_id(VTDPASIDEntry *pe)
+{
+    return VTD_SM_PASID_ENTRY_DID((pe)->val[1]);
+}
+
+static inline uint32_t vtd_sm_ce_get_pdt_entry_num(VTDContextEntry *ce)
+{
+    return 1U << (VTD_SM_CONTEXT_ENTRY_PDTS(ce->val[0]) + 7);
+}
+
 static inline bool vtd_pdire_present(VTDPASIDDirEntry *pdire)
 {
     return pdire->val & 1;
@@ -2635,9 +2649,443 @@ static bool vtd_process_iotlb_desc(IntelIOMMUState *s, VTDInvDesc *inv_desc)
     return true;
 }
 
+static inline void vtd_init_pasid_key(uint32_t pasid,
+                                     uint16_t sid,
+                                     struct pasid_key *key)
+{
+    key->pasid = pasid;
+    key->sid = sid;
+}
+
+static guint vtd_pasid_as_key_hash(gconstpointer v)
+{
+    struct pasid_key *key = (struct pasid_key *)v;
+    uint32_t a, b, c;
+
+    /* Jenkins hash */
+    a = b = c = JHASH_INITVAL + sizeof(*key);
+    a += key->sid;
+    b += extract32(key->pasid, 0, 16);
+    c += extract32(key->pasid, 16, 16);
+
+    __jhash_mix(a, b, c);
+    __jhash_final(a, b, c);
+
+    return c;
+}
+
+static gboolean vtd_pasid_as_key_equal(gconstpointer v1, gconstpointer v2)
+{
+    const struct pasid_key *k1 = v1;
+    const struct pasid_key *k2 = v2;
+
+    return (k1->pasid == k2->pasid) && (k1->sid == k2->sid);
+}
+
+static inline int vtd_dev_get_pe_from_pasid(IntelIOMMUState *s,
+                                            uint8_t bus_num,
+                                            uint8_t devfn,
+                                            uint32_t pasid,
+                                            VTDPASIDEntry *pe)
+{
+    VTDContextEntry ce;
+    int ret;
+    dma_addr_t pasid_dir_base;
+
+    if (!s->root_scalable) {
+        return -VTD_FR_PASID_TABLE_INV;
+    }
+
+    ret = vtd_dev_to_context_entry(s, bus_num, devfn, &ce);
+    if (ret) {
+        return ret;
+    }
+
+    pasid_dir_base = VTD_CE_GET_PASID_DIR_TABLE(&ce);
+    ret = vtd_get_pe_from_pasid_table(s,
+                                  pasid_dir_base, pasid, pe);
+
+    return ret;
+}
+
+static bool vtd_pasid_entry_compare(VTDPASIDEntry *p1, VTDPASIDEntry *p2)
+{
+    return !memcmp(p1, p2, sizeof(*p1));
+}
+
+/*
+ * This function fills in the pasid entry in &vtd_pasid_as. Caller
+ * of this function should hold iommu_lock.
+ */
+static void vtd_fill_pe_in_cache(IntelIOMMUState *s,
+                                 VTDPASIDAddressSpace *vtd_pasid_as,
+                                 VTDPASIDEntry *pe)
+{
+    VTDPASIDCacheEntry *pc_entry = &vtd_pasid_as->pasid_cache_entry;
+
+    if (vtd_pasid_entry_compare(pe, &pc_entry->pasid_entry)) {
+        /* No need to go further as cached pasid entry is latest */
+        return;
+    }
+
+    pc_entry->pasid_entry = *pe;
+    /*
+     * TODO:
+     * - send pasid bind to host for passthru devices
+     */
+}
+
+/*
+ * This function is used to clear cached pasid entry in vtd_pasid_as
+ * instances. Caller of this function should hold iommu_lock.
+ */
+static gboolean vtd_flush_pasid(gpointer key, gpointer value,
+                                gpointer user_data)
+{
+    VTDPASIDCacheInfo *pc_info = user_data;
+    VTDPASIDAddressSpace *vtd_pasid_as = value;
+    IntelIOMMUState *s = vtd_pasid_as->iommu_state;
+    VTDPASIDCacheEntry *pc_entry = &vtd_pasid_as->pasid_cache_entry;
+    PCIBus *bus = vtd_pasid_as->bus;
+    VTDPASIDEntry pe;
+    uint16_t did;
+    uint32_t pasid;
+    uint16_t devfn;
+    int ret;
+
+    did = vtd_pe_get_domain_id(&pc_entry->pasid_entry);
+    pasid = vtd_pasid_as->pasid;
+    devfn = vtd_pasid_as->devfn;
+
+    switch (pc_info->type) {
+    case VTD_PASID_CACHE_FORCE_RESET:
+        goto remove;
+    case VTD_PASID_CACHE_PASIDSI:
+        if (pc_info->pasid != pasid) {
+            return false;
+        }
+        /* Fall through */
+    case VTD_PASID_CACHE_DOMSI:
+        if (pc_info->domain_id != did) {
+            return false;
+        }
+        /* Fall through */
+    case VTD_PASID_CACHE_GLOBAL_INV:
+        break;
+    default:
+        error_report("invalid pc_info->type");
+        abort();
+    }
+
+    /*
+     * pasid cache invalidation may indicate a present pasid
+     * entry to present pasid entry modification. To cover such
+     * case, vIOMMU emulator needs to fetch latest guest pasid
+     * entry and check cached pasid entry, then update pasid
+     * cache and send pasid bind/unbind to host properly.
+     */
+    ret = vtd_dev_get_pe_from_pasid(s, pci_bus_num(bus),
+                                    devfn, pasid, &pe);
+    if (ret) {
+        /*
+         * No valid pasid entry in guest memory. e.g. pasid entry
+         * was modified to be either all-zero or non-present. Either
+         * case means existing pasid cache should be removed.
+         */
+        goto remove;
+    }
+
+    vtd_fill_pe_in_cache(s, vtd_pasid_as, &pe);
+    /*
+     * TODO:
+     * - when pasid-base-iotlb(piotlb) infrastructure is ready,
+     *   should invalidate QEMU piotlb togehter with this change.
+     */
+    return false;
+remove:
+    /*
+     * TODO:
+     * - send pasid bind to host for passthru devices
+     * - when pasid-base-iotlb(piotlb) infrastructure is ready,
+     *   should invalidate QEMU piotlb togehter with this change.
+     */
+    return true;
+}
+
+/*
+ * This function finds or adds a VTDPASIDAddressSpace for a device
+ * when it is bound to a pasid. Caller of this function should hold
+ * iommu_lock.
+ */
+static VTDPASIDAddressSpace *vtd_add_find_pasid_as(IntelIOMMUState *s,
+                                                   PCIBus *bus,
+                                                   int devfn,
+                                                   uint32_t pasid)
+{
+    struct pasid_key key;
+    struct pasid_key *new_key;
+    VTDPASIDAddressSpace *vtd_pasid_as;
+    uint16_t sid;
+
+    sid = PCI_BUILD_BDF(pci_bus_num(bus), devfn);
+    vtd_init_pasid_key(pasid, sid, &key);
+    vtd_pasid_as = g_hash_table_lookup(s->vtd_pasid_as, &key);
+
+    if (!vtd_pasid_as) {
+        new_key = g_malloc0(sizeof(*new_key));
+        vtd_init_pasid_key(pasid, sid, new_key);
+        /*
+         * Initiate the vtd_pasid_as structure.
+         *
+         * This structure here is used to track the guest pasid
+         * binding and also serves as pasid-cache mangement entry.
+         *
+         * TODO: in future, if wants to support the SVA-aware DMA
+         *       emulation, the vtd_pasid_as should have include
+         *       AddressSpace to support DMA emulation.
+         */
+        vtd_pasid_as = g_malloc0(sizeof(VTDPASIDAddressSpace));
+        vtd_pasid_as->iommu_state = s;
+        vtd_pasid_as->bus = bus;
+        vtd_pasid_as->devfn = devfn;
+        vtd_pasid_as->pasid = pasid;
+        g_hash_table_insert(s->vtd_pasid_as, new_key, vtd_pasid_as);
+    }
+    return vtd_pasid_as;
+}
+
+/* Caller of this function should hold iommu_lock. */
+static void vtd_sm_pasid_table_walk_one(IntelIOMMUState *s,
+                                        dma_addr_t pt_base,
+                                        int start,
+                                        int end,
+                                        VTDPASIDCacheInfo *info)
+{
+    VTDPASIDEntry pe;
+    int pasid = start;
+    int pasid_next;
+    VTDPASIDAddressSpace *vtd_pasid_as;
+
+    while (pasid < end) {
+        pasid_next = pasid + 1;
+
+        if (!vtd_get_pe_in_pasid_leaf_table(s, pasid, pt_base, &pe)
+            && vtd_pe_present(&pe)) {
+            vtd_pasid_as = vtd_add_find_pasid_as(s,
+                                       info->bus, info->devfn, pasid);
+            if ((info->type == VTD_PASID_CACHE_DOMSI ||
+                 info->type == VTD_PASID_CACHE_PASIDSI) &&
+                !(info->domain_id == vtd_pe_get_domain_id(&pe))) {
+                /*
+                 * VTD_PASID_CACHE_DOMSI and VTD_PASID_CACHE_PASIDSI
+                 * requires domain ID check. If domain Id check fail,
+                 * go to next pasid.
+                 */
+                pasid = pasid_next;
+                continue;
+            }
+            vtd_fill_pe_in_cache(s, vtd_pasid_as, &pe);
+        }
+        pasid = pasid_next;
+    }
+}
+
+/*
+ * Currently, VT-d scalable mode pasid table is a two level table,
+ * this function aims to loop a range of PASIDs in a given pasid
+ * table to identify the pasid config in guest.
+ * Caller of this function should hold iommu_lock.
+ */
+static void vtd_sm_pasid_table_walk(IntelIOMMUState *s,
+                                    dma_addr_t pdt_base,
+                                    int start,
+                                    int end,
+                                    VTDPASIDCacheInfo *info)
+{
+    VTDPASIDDirEntry pdire;
+    int pasid = start;
+    int pasid_next;
+    dma_addr_t pt_base;
+
+    while (pasid < end) {
+        pasid_next = ((end - pasid) > VTD_PASID_TBL_ENTRY_NUM) ?
+                      (pasid + VTD_PASID_TBL_ENTRY_NUM) : end;
+        if (!vtd_get_pdire_from_pdir_table(pdt_base, pasid, &pdire)
+            && vtd_pdire_present(&pdire)) {
+            pt_base = pdire.val & VTD_PASID_TABLE_BASE_ADDR_MASK;
+            vtd_sm_pasid_table_walk_one(s, pt_base, pasid, pasid_next, info);
+        }
+        pasid = pasid_next;
+    }
+}
+
+static void vtd_replay_pasid_bind_for_dev(IntelIOMMUState *s,
+                                          int start, int end,
+                                          VTDPASIDCacheInfo *info)
+{
+    VTDContextEntry ce;
+    int bus_n, devfn;
+
+    bus_n = pci_bus_num(info->bus);
+    devfn = info->devfn;
+
+    if (!vtd_dev_to_context_entry(s, bus_n, devfn, &ce)) {
+        uint32_t max_pasid;
+
+        max_pasid = vtd_sm_ce_get_pdt_entry_num(&ce) * VTD_PASID_TBL_ENTRY_NUM;
+        if (end > max_pasid) {
+            end = max_pasid;
+        }
+        vtd_sm_pasid_table_walk(s,
+                                VTD_CE_GET_PASID_DIR_TABLE(&ce),
+                                start,
+                                end,
+                                info);
+    }
+}
+
+/*
+ * This function replay the guest pasid bindings to hots by
+ * walking the guest PASID table. This ensures host will have
+ * latest guest pasid bindings. Caller should hold iommu_lock.
+ */
+static void vtd_replay_guest_pasid_bindings(IntelIOMMUState *s,
+                                            VTDPASIDCacheInfo *pc_info)
+{
+    VTDIOMMUFDDevice *vtd_idev;
+    int start = 0, end = 1; /* only rid2pasid is supported */
+    VTDPASIDCacheInfo walk_info;
+
+    switch (pc_info->type) {
+    case VTD_PASID_CACHE_PASIDSI:
+        start = pc_info->pasid;
+        end = pc_info->pasid + 1;
+        /*
+         * PASID selective invalidation is within domain,
+         * thus fall through.
+         */
+    case VTD_PASID_CACHE_DOMSI:
+    case VTD_PASID_CACHE_GLOBAL_INV:
+        /* loop all assigned devices */
+        break;
+    case VTD_PASID_CACHE_FORCE_RESET:
+        /* For force reset, no need to go further replay */
+        return;
+    default:
+        error_report("invalid pc_info->type for replay");
+        abort();
+    }
+
+    /*
+     * In this replay, only needs to care about the devices which
+     * are backed by host IOMMU. For such devices, their vtd_idev
+     * instances are in the s->vtd_idev_list. For devices which
+     * are not backed byhost IOMMU, it is not necessary to replay
+     * the bindings since their cache could be re-created in the future
+     * DMA address transaltion.
+     */
+    walk_info = *pc_info;
+    QLIST_FOREACH(vtd_idev, &s->vtd_idev_list, next) {
+        /* bus|devfn fields are not identical with pc_info */
+        walk_info.bus = vtd_idev->bus;
+        walk_info.devfn = vtd_idev->devfn;
+        vtd_replay_pasid_bind_for_dev(s, start, end, &walk_info);
+    }
+}
+
+/*
+ * This function syncs the pasid bindings between guest and host.
+ * It includes updating the pasid cache in vIOMMU and updating the
+ * pasid bindings per guest's latest pasid entry presence.
+ */
+static void vtd_pasid_cache_sync(IntelIOMMUState *s,
+                                 VTDPASIDCacheInfo *pc_info)
+{
+    if (!s->scalable_modern || !s->root_scalable || !s->dmar_enabled) {
+        return;
+    }
+
+    /*
+     * Regards to a pasid cache invalidation, e.g. a PSI.
+     * it could be either cases of below:
+     * a) a present pasid entry moved to non-present
+     * b) a present pasid entry to be a present entry
+     * c) a non-present pasid entry moved to present
+     *
+     * Different invalidation granularity may affect different device
+     * scope and pasid scope. But for each invalidation granularity,
+     * it needs to do two steps to sync host and guest pasid binding.
+     *
+     * Here is the handling of a PSI:
+     * 1) loop all the existing vtd_pasid_as instances to update them
+     *    according to the latest guest pasid entry in pasid table.
+     *    this will make sure affected existing vtd_pasid_as instances
+     *    cached the latest pasid entries. Also, during the loop, the
+     *    host should be notified if needed. e.g. pasid unbind or pasid
+     *    update. Should be able to cover case a) and case b).
+     *
+     * 2) loop all devices to cover case c)
+     *    - For devices which have IOMMUFDDevice instances,
+     *      we loop them and check if guest pasid entry exists. If yes,
+     *      it is case c), we update the pasid cache and also notify
+     *      host.
+     *    - For devices which have no IOMMUFDDevice, it is not
+     *      necessary to create pasid cache at this phase since it
+     *      could be created when vIOMMU does DMA address translation.
+     *      This is not yet implemented since there is no emulated
+     *      pasid-capable devices today. If we have such devices in
+     *      future, the pasid cache shall be created there.
+     * Other granularity follow the same steps, just with different scope
+     *
+     */
+
+    vtd_iommu_lock(s);
+    /* Step 1: loop all the exisitng vtd_pasid_as instances */
+    g_hash_table_foreach_remove(s->vtd_pasid_as,
+                                vtd_flush_pasid, pc_info);
+
+    /*
+     * Step 2: loop all the exisitng vtd_idev instances.
+     * Ideally, needs to loop all devices to find if there is any new
+     * PASID binding regards to the PASID cache invalidation request.
+     * But it is enough to loop the devices which are backed by host
+     * IOMMU. For devices backed by vIOMMU (a.k.a emulated devices),
+     * if new PASID happened on them, their vtd_pasid_as instance could
+     * be created during future vIOMMU DMA translation.
+     */
+    vtd_replay_guest_pasid_bindings(s, pc_info);
+    vtd_iommu_unlock(s);
+}
+
+/* Caller of this function should hold iommu_lock */
+static void vtd_pasid_cache_reset(IntelIOMMUState *s)
+{
+    VTDPASIDCacheInfo pc_info;
+
+    trace_vtd_pasid_cache_reset();
+
+    pc_info.type = VTD_PASID_CACHE_FORCE_RESET;
+
+    /*
+     * Reset pasid cache is a big hammer, so use
+     * g_hash_table_foreach_remove which will free
+     * the vtd_pasid_as instances. Also, as a big
+     * hammer, use VTD_PASID_CACHE_FORCE_RESET to
+     * ensure all the vtd_pasid_as instances are
+     * dropped, meanwhile the change will be pass
+     * to host if IOMMUFDDevice is available.
+     */
+    g_hash_table_foreach_remove(s->vtd_pasid_as,
+                                vtd_flush_pasid, &pc_info);
+}
+
 static bool vtd_process_pasid_desc(IntelIOMMUState *s,
                                    VTDInvDesc *inv_desc)
 {
+    uint16_t domain_id;
+    uint32_t pasid;
+    VTDPASIDCacheInfo pc_info;
+
     if ((inv_desc->val[0] & VTD_INV_DESC_PASIDC_RSVD_VAL0) ||
         (inv_desc->val[1] & VTD_INV_DESC_PASIDC_RSVD_VAL1) ||
         (inv_desc->val[2] & VTD_INV_DESC_PASIDC_RSVD_VAL2) ||
@@ -2647,14 +3095,27 @@ static bool vtd_process_pasid_desc(IntelIOMMUState *s,
         return false;
     }
 
+    domain_id = VTD_INV_DESC_PASIDC_DID(inv_desc->val[0]);
+    pasid = VTD_INV_DESC_PASIDC_PASID(inv_desc->val[0]);
+
     switch (inv_desc->val[0] & VTD_INV_DESC_PASIDC_G) {
     case VTD_INV_DESC_PASIDC_DSI:
+        trace_vtd_pasid_cache_dsi(domain_id);
+        pc_info.type = VTD_PASID_CACHE_DOMSI;
+        pc_info.domain_id = domain_id;
         break;
 
     case VTD_INV_DESC_PASIDC_PASID_SI:
+        /* PASID selective implies a DID selective */
+        trace_vtd_pasid_cache_psi(domain_id, pasid);
+        pc_info.type = VTD_PASID_CACHE_PASIDSI;
+        pc_info.domain_id = domain_id;
+        pc_info.pasid = pasid;
         break;
 
     case VTD_INV_DESC_PASIDC_GLOBAL:
+        trace_vtd_pasid_cache_gsi();
+        pc_info.type = VTD_PASID_CACHE_GLOBAL_INV;
         break;
 
     default:
@@ -2663,6 +3124,7 @@ static bool vtd_process_pasid_desc(IntelIOMMUState *s,
         return false;
     }
 
+    vtd_pasid_cache_sync(s, &pc_info);
     return true;
 }
 
@@ -3997,6 +4459,7 @@ static int vtd_dev_set_iommu_device(PCIBus *bus, void *opaque, int32_t devfn,
     vtd_idev->devfn = (uint8_t)devfn;
     vtd_idev->iommu_state = s;
     vtd_idev->idev = idev;
+    QLIST_INSERT_HEAD(&s->vtd_idev_list, vtd_idev, next);
 
     g_hash_table_insert(s->vtd_iommufd_dev, new_key, vtd_idev);
 
@@ -4024,6 +4487,7 @@ static void vtd_dev_unset_iommu_device(PCIBus *bus, void *opaque, int32_t devfn)
         return;
     }
 
+    QLIST_REMOVE(vtd_idev, next);
     g_hash_table_remove(s->vtd_iommufd_dev, &key);
 
     vtd_iommu_unlock(s);
@@ -4460,6 +4924,7 @@ static void vtd_realize(DeviceState *dev, Error **errp)
     }
 
     QLIST_INIT(&s->vtd_as_with_notifiers);
+    QLIST_INIT(&s->vtd_idev_list);
     qemu_mutex_init(&s->iommu_lock);
     s->cap_finalized = false;
     memory_region_init_io(&s->csrmem, OBJECT(s), &vtd_mem_ops, s,
@@ -4487,6 +4952,9 @@ static void vtd_realize(DeviceState *dev, Error **errp)
                                       g_free, g_free);
     s->vtd_iommufd_dev = g_hash_table_new_full(vtd_as_hash, vtd_as_idev_equal,
                                                g_free, g_free);
+    s->vtd_pasid_as = g_hash_table_new_full(vtd_pasid_as_key_hash,
+                                            vtd_pasid_as_key_equal,
+                                            g_free, g_free);
     vtd_init(s);
     pci_setup_iommu(bus, &vtd_iommu_ops, dev);
     /* Pseudo address space under root PCI bus. */
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index e54799ee82..91d6c400b4 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -25,6 +25,7 @@ vtd_inv_qi_tail(uint16_t head) "write tail %d"
 vtd_inv_qi_fetch(void) ""
 vtd_context_cache_reset(void) ""
 vtd_pasid_cache_gsi(void) ""
+vtd_pasid_cache_reset(void) ""
 vtd_pasid_cache_dsi(uint16_t domain) "Domian slective PC invalidation domain 0x%"PRIx16
 vtd_pasid_cache_psi(uint16_t domain, uint32_t pasid) "PASID slective PC invalidation domain 0x%"PRIx16" pasid 0x%"PRIx32
 vtd_re_not_present(uint8_t bus) "Root entry bus %"PRIu8" not present"
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH rfcv1 09/23] vfio/iommufd_device: Add ioas_id in IOMMUFDDevice and pass to vIOMMU
  2024-01-15 10:37 [PATCH rfcv1 00/23] intel_iommu: Enable stage-1 translation Zhenzhong Duan
                   ` (7 preceding siblings ...)
  2024-01-15 10:37 ` [PATCH rfcv1 08/23] intel_iommu: add PASID cache management infrastructure Zhenzhong Duan
@ 2024-01-15 10:37 ` Zhenzhong Duan
  2024-01-15 10:37 ` [PATCH rfcv1 10/23] intel_iommu: bind/unbind guest page table to host Zhenzhong Duan
                   ` (14 subsequent siblings)
  23 siblings, 0 replies; 29+ messages in thread
From: Zhenzhong Duan @ 2024-01-15 10:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, peterx, jasowang, mst, jgg,
	nicolinc, joao.m.martins, kevin.tian, yi.l.liu, yi.y.sun,
	chao.p.peng, Zhenzhong Duan

Sometimes vIOMMU needs to re-attach device to ioas id of vfio,
i.e., when vIOMMU is disabled by guest.

This is a prerequisite patch for following one.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 include/sysemu/iommufd_device.h | 3 ++-
 backends/iommufd_device.c       | 3 ++-
 hw/vfio/iommufd.c               | 2 +-
 3 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/include/sysemu/iommufd_device.h b/include/sysemu/iommufd_device.h
index 799c1345fd..7aeec9b980 100644
--- a/include/sysemu/iommufd_device.h
+++ b/include/sysemu/iommufd_device.h
@@ -26,6 +26,7 @@ typedef struct IOMMUFDDeviceOps {
 struct IOMMUFDDevice {
     IOMMUFDBackend *iommufd;
     uint32_t dev_id;
+    uint32_t ioas_id;
     IOMMUFDDeviceOps *ops;
 };
 
@@ -36,5 +37,5 @@ int iommufd_device_get_info(IOMMUFDDevice *idev,
                             uint32_t len, void *data);
 void iommufd_device_init(void *_idev, size_t instance_size,
                          IOMMUFDBackend *iommufd, uint32_t dev_id,
-                         IOMMUFDDeviceOps *ops);
+                         uint32_t ioas_id, IOMMUFDDeviceOps *ops);
 #endif
diff --git a/backends/iommufd_device.c b/backends/iommufd_device.c
index 26f69252d2..f93a201453 100644
--- a/backends/iommufd_device.c
+++ b/backends/iommufd_device.c
@@ -52,7 +52,7 @@ int iommufd_device_get_info(IOMMUFDDevice *idev,
 
 void iommufd_device_init(void *_idev, size_t instance_size,
                          IOMMUFDBackend *iommufd, uint32_t dev_id,
-                         IOMMUFDDeviceOps *ops)
+                         uint32_t ioas_id, IOMMUFDDeviceOps *ops)
 {
     IOMMUFDDevice *idev = (IOMMUFDDevice *)_idev;
 
@@ -60,5 +60,6 @@ void iommufd_device_init(void *_idev, size_t instance_size,
 
     idev->iommufd = iommufd;
     idev->dev_id = dev_id;
+    idev->ioas_id = ioas_id;
     idev->ops = ops;
 }
diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
index c8c669c59a..3aabe41043 100644
--- a/hw/vfio/iommufd.c
+++ b/hw/vfio/iommufd.c
@@ -432,7 +432,7 @@ found_container:
     QLIST_INSERT_HEAD(&vfio_device_list, vbasedev, global_next);
 
     iommufd_device_init(idev, sizeof(*idev), container->be, vbasedev->devid,
-                        &vfio_iommufd_device_ops);
+                        container->ioas_id, &vfio_iommufd_device_ops);
     trace_iommufd_cdev_device_info(vbasedev->name, devfd, vbasedev->num_irqs,
                                    vbasedev->num_regions, vbasedev->flags);
     return 0;
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH rfcv1 10/23] intel_iommu: bind/unbind guest page table to host
  2024-01-15 10:37 [PATCH rfcv1 00/23] intel_iommu: Enable stage-1 translation Zhenzhong Duan
                   ` (8 preceding siblings ...)
  2024-01-15 10:37 ` [PATCH rfcv1 09/23] vfio/iommufd_device: Add ioas_id in IOMMUFDDevice and pass to vIOMMU Zhenzhong Duan
@ 2024-01-15 10:37 ` Zhenzhong Duan
  2024-01-15 10:37 ` [PATCH rfcv1 11/23] intel_iommu: ERRATA_772415 workaround Zhenzhong Duan
                   ` (13 subsequent siblings)
  23 siblings, 0 replies; 29+ messages in thread
From: Zhenzhong Duan @ 2024-01-15 10:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, peterx, jasowang, mst, jgg,
	nicolinc, joao.m.martins, kevin.tian, yi.l.liu, yi.y.sun,
	chao.p.peng, Zhenzhong Duan, Yi Sun, Marcel Apfelbaum,
	Paolo Bonzini, Richard Henderson, Eduardo Habkost

This captures the guest PASID table entry modifications and
propagates the changes to host to attach a hwpt with type determined
per guest PGTT configuration.

When PGTT is Pass-through(100b), the hwpt on host side is a stage-2
page table(GPA->HPA). When PGTT is First-stage Translation only(001b),
the hwpt on host side is a nested page table.

The guest page table is configured as stage-1 page table (gIOVA->GPA)
whose translation result would further go through host VT-d stage-2
page table(GPA->HPA) under nested translation mode. This is the key
to support gIOVA over stage-1 page table for Intel VT-d in
virtualization environment.

Stage-2 page table could be shared by different devices if there is
no conflict and devices link to same iommufd object, i.e. devices
under same host IOMMU can share same stage-2 page table. If there
is conflict, i.e. there is one device under non cache coherency
mode which is different from others, it requires a separate
stage-2 page table in non-CC mode.

See below example diagram:

      IntelIOMMUState
             |
             V
    .------------------.    .------------------.
    | VTDIOASContainer |--->| VTDIOASContainer |--->...
    |    (iommufd0)    |    |    (iommufd1)    |
    .------------------.    .------------------.
             |                       |
             |                       .-->...
             V
      .-------------------.    .-------------------.
      |   VTDS2Hwpt(CC)   |--->| VTDS2Hwpt(non-CC) |-->...
      .-------------------.    .-------------------.
          |            |               |
          |            |               |
    .-----------.  .-----------.  .------------.
    | IOMMUFD   |  | IOMMUFD   |  | IOMMUFD    |
    | Device(CC)|  | Device(CC)|  | Device     |
    | (iommufd0)|  | (iommufd0)|  | (non-CC)   |
    |           |  |           |  | (iommufd0) |
    .-----------.  .-----------.  .------------.

Co-Authored-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/i386/intel_iommu_internal.h |  16 +
 include/hw/i386/intel_iommu.h  |  30 ++
 hw/i386/intel_iommu.c          | 641 ++++++++++++++++++++++++++++++++-
 hw/i386/trace-events           |   8 +
 4 files changed, 677 insertions(+), 18 deletions(-)

diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 16dc712e94..e33c9f54b5 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -199,6 +199,7 @@
 #define VTD_ECAP_SMTS               (1ULL << 43)
 #define VTD_ECAP_SLTS               (1ULL << 46)
 #define VTD_ECAP_FLTS               (1ULL << 47)
+#define VTD_ECAP_RPS                (1ULL << 49)
 
 #define VTD_ECAP_MASK               (VTD_ECAP_SRS | VTD_ECAP_EAFS)
 #define VTD_GET_PSS(val)            (((val) >> 35) & 0x1f)
@@ -518,6 +519,14 @@ typedef struct VTDRootEntry VTDRootEntry;
 #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL0(aw)  (0x1e0ULL | ~VTD_HAW_MASK(aw))
 #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL1      0xffffffffffe00000ULL
 
+enum VTDPASIDOp {
+    VTD_PASID_BIND,
+    VTD_PASID_UPDATE,
+    VTD_PASID_UNBIND,
+    VTD_OP_NUM
+};
+typedef enum VTDPASIDOp VTDPASIDOp;
+
 typedef enum VTDPCInvType {
     /* force reset all */
     VTD_PASID_CACHE_FORCE_RESET = 0,
@@ -533,6 +542,7 @@ struct VTDPASIDCacheInfo {
     uint32_t pasid;
     PCIBus *bus;
     uint16_t devfn;
+    bool error_happened;
 };
 typedef struct VTDPASIDCacheInfo VTDPASIDCacheInfo;
 
@@ -560,6 +570,12 @@ typedef struct VTDPASIDCacheInfo VTDPASIDCacheInfo;
 #define VTD_SM_PASID_ENTRY_AW          7ULL /* Adjusted guest-address-width */
 #define VTD_SM_PASID_ENTRY_DID(val)    ((val) & VTD_DOMAIN_ID_MASK)
 
+#define VTD_SM_PASID_ENTRY_FLPM          3ULL
+#define VTD_SM_PASID_ENTRY_FLPTPTR       (~0xfffULL)
+#define VTD_SM_PASID_ENTRY_SRE_BIT(val)  (!!((val) & 1ULL))
+#define VTD_SM_PASID_ENTRY_WPE_BIT(val)  (!!(((val) >> 4) & 1ULL))
+#define VTD_SM_PASID_ENTRY_EAFE_BIT(val) (!!(((val) >> 7) & 1ULL))
+
 /* Second Level Page Translation Pointer*/
 #define VTD_SM_PASID_ENTRY_SLPTPTR     (~0xfffULL)
 
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index c7b707a3d5..d3122cf699 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -65,6 +65,9 @@ typedef struct VTDPASIDEntry VTDPASIDEntry;
 typedef struct VTDIOMMUFDDevice VTDIOMMUFDDevice;
 typedef struct VTDPASIDCacheEntry VTDPASIDCacheEntry;
 typedef struct VTDPASIDAddressSpace VTDPASIDAddressSpace;
+typedef struct VTDHwpt VTDHwpt;
+typedef struct VTDIOASContainer VTDIOASContainer;
+typedef struct VTDS2Hwpt VTDS2Hwpt;
 
 /* Context-Entry */
 struct VTDContextEntry {
@@ -102,14 +105,37 @@ struct pasid_key {
     uint16_t sid;
 };
 
+struct VTDIOASContainer {
+    IOMMUFDBackend *iommufd;
+    uint32_t ioas_id;
+    MemoryListener listener;
+    QLIST_HEAD(, VTDS2Hwpt) s2_hwpt_list;
+    QLIST_ENTRY(VTDIOASContainer) next;
+    Error *error;
+};
+
+struct VTDS2Hwpt {
+    uint32_t users;
+    uint32_t hwpt_id;
+    VTDIOASContainer *container;
+    QLIST_ENTRY(VTDS2Hwpt) next;
+};
+
+struct VTDHwpt {
+    uint32_t hwpt_id;
+    VTDS2Hwpt *s2_hwpt;
+};
+
 struct VTDPASIDCacheEntry {
     struct VTDPASIDEntry pasid_entry;
+    bool cache_filled;
 };
 
 struct VTDPASIDAddressSpace {
     PCIBus *bus;
     uint8_t devfn;
     uint32_t pasid;
+    VTDHwpt hwpt;
     IntelIOMMUState *iommu_state;
     VTDContextCacheEntry context_cache_entry;
     QLIST_ENTRY(VTDPASIDAddressSpace) next;
@@ -330,8 +356,12 @@ struct IntelIOMMUState {
     /* list of VTDIOMMUFDDevices */
     QLIST_HEAD(, VTDIOMMUFDDevice) vtd_idev_list;
 
+    QLIST_HEAD(, VTDIOASContainer) containers;
+
     GHashTable *vtd_iommufd_dev;             /* VTDIOMMUFDDevice */
 
+    VTDHwpt *s2_hwpt;
+
     /* interrupt remapping */
     bool intr_enabled;              /* Whether guest enabled IR */
     dma_addr_t intr_root;           /* Interrupt remapping table pointer */
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index a1a1f23246..df93fcacd8 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -40,6 +40,7 @@
 #include "migration/vmstate.h"
 #include "trace.h"
 #include "qemu/jhash.h"
+#include "sysemu/iommufd.h"
 
 /* context entry operations */
 #define VTD_CE_GET_RID2PASID(ce) \
@@ -771,6 +772,24 @@ static inline uint32_t vtd_sm_ce_get_pdt_entry_num(VTDContextEntry *ce)
     return 1U << (VTD_SM_CONTEXT_ENTRY_PDTS(ce->val[0]) + 7);
 }
 
+static inline uint32_t vtd_pe_get_fl_aw(VTDPASIDEntry *pe)
+{
+    return 48 + ((pe->val[2] >> 2) & VTD_SM_PASID_ENTRY_FLPM) * 9;
+}
+
+static inline dma_addr_t vtd_pe_get_flpt_base(VTDPASIDEntry *pe)
+{
+    return pe->val[2] & VTD_SM_PASID_ENTRY_FLPTPTR;
+}
+
+static inline void pasid_cache_info_set_error(VTDPASIDCacheInfo *pc_info)
+{
+    if (pc_info->error_happened) {
+        return;
+    }
+    pc_info->error_happened = true;
+}
+
 static inline bool vtd_pdire_present(VTDPASIDDirEntry *pdire)
 {
     return pdire->val & 1;
@@ -1631,6 +1650,17 @@ static int vtd_address_space_sync(VTDAddressSpace *vtd_as)
     return vtd_sync_shadow_page_table_range(vtd_as, &ce, 0, UINT64_MAX);
 }
 
+static bool vtd_pe_pgtt_is_pt(VTDPASIDEntry *pe)
+{
+    return (VTD_PE_GET_TYPE(pe) == VTD_SM_PASID_ENTRY_PT);
+}
+
+/* check if pgtt is first stage translation */
+static bool vtd_pe_pgtt_is_flt(VTDPASIDEntry *pe)
+{
+    return (VTD_PE_GET_TYPE(pe) == VTD_SM_PASID_ENTRY_FLT);
+}
+
 /*
  * Check if specific device is configured to bypass address
  * translation for DMA requests. In Scalable Mode, bypass
@@ -1652,7 +1682,7 @@ static bool vtd_dev_pt_enabled(IntelIOMMUState *s, VTDContextEntry *ce,
              */
             return false;
         }
-        return (VTD_PE_GET_TYPE(&pe) == VTD_SM_PASID_ENTRY_PT);
+        return vtd_pe_pgtt_is_pt(&pe);
     }
 
     return (vtd_ce_get_type(ce) == VTD_CONTEXT_TT_PASS_THROUGH);
@@ -2091,6 +2121,543 @@ static void vtd_context_global_invalidate(IntelIOMMUState *s)
     vtd_iommu_replay_all(s);
 }
 
+static bool iommufd_listener_skipped_section(MemoryRegionSection *section)
+{
+    return !memory_region_is_ram(section->mr) ||
+           memory_region_is_protected(section->mr) ||
+           /*
+            * Sizing an enabled 64-bit BAR can cause spurious mappings to
+            * addresses in the upper part of the 64-bit address space.  These
+            * are never accessed by the CPU and beyond the address width of
+            * some IOMMU hardware.  TODO: VFIO should tell us the IOMMU width.
+            */
+           section->offset_within_address_space & (1ULL << 63);
+}
+
+static void iommufd_listener_region_add_s2domain(MemoryListener *listener,
+                                                 MemoryRegionSection *section)
+{
+    VTDIOASContainer *container = container_of(listener,
+                                               VTDIOASContainer, listener);
+    IOMMUFDBackend *iommufd = container->iommufd;
+    uint32_t ioas_id = container->ioas_id;
+    hwaddr iova;
+    Int128 llend, llsize;
+    void *vaddr;
+    Error *err = NULL;
+    int ret;
+
+    if (iommufd_listener_skipped_section(section)) {
+        return;
+    }
+    iova = REAL_HOST_PAGE_ALIGN(section->offset_within_address_space);
+    llend = int128_make64(section->offset_within_address_space);
+    llend = int128_add(llend, section->size);
+    llend = int128_and(llend, int128_exts64(qemu_real_host_page_mask()));
+    llsize = int128_sub(llend, int128_make64(iova));
+    vaddr = memory_region_get_ram_ptr(section->mr) +
+            section->offset_within_region +
+            (iova - section->offset_within_address_space);
+
+    memory_region_ref(section->mr);
+
+    ret = iommufd_backend_map_dma(iommufd, ioas_id, iova, int128_get64(llsize),
+                                  vaddr, section->readonly);
+    if (!ret) {
+        return;
+    }
+
+    error_setg(&err,
+               "iommufd_listener_region_add_s2domain(%p, 0x%"HWADDR_PRIx", "
+               "0x%"HWADDR_PRIx", %p) = %d (%s)",
+               container, iova, int128_get64(llsize), vaddr, ret,
+               strerror(-ret));
+
+    if (memory_region_is_ram_device(section->mr)) {
+        /* Allow unexpected mappings not to be fatal for RAM devices */
+        error_report_err(err);
+        return;
+    }
+
+    if (!container->error) {
+        error_propagate_prepend(&container->error, err, "Region %s: ",
+                                memory_region_name(section->mr));
+    } else {
+        error_free(err);
+    }
+}
+
+static void iommufd_listener_region_del_s2domain(MemoryListener *listener,
+                                                 MemoryRegionSection *section)
+{
+    VTDIOASContainer *container = container_of(listener,
+                                               VTDIOASContainer, listener);
+    IOMMUFDBackend *iommufd = container->iommufd;
+    uint32_t ioas_id = container->ioas_id;
+    hwaddr iova;
+    Int128 llend, llsize;
+    int ret;
+
+    if (iommufd_listener_skipped_section(section)) {
+        return;
+    }
+    iova = REAL_HOST_PAGE_ALIGN(section->offset_within_address_space);
+    llend = int128_make64(section->offset_within_address_space);
+    llend = int128_add(llend, section->size);
+    llend = int128_and(llend, int128_exts64(qemu_real_host_page_mask()));
+    llsize = int128_sub(llend, int128_make64(iova));
+
+    ret = iommufd_backend_unmap_dma(iommufd, ioas_id,
+                                    iova, int128_get64(llsize));
+    if (ret) {
+        error_report("iommufd_listener_region_del_s2domain(%p, "
+                     "0x%"HWADDR_PRIx", 0x%"HWADDR_PRIx") = %d (%s)",
+                     container, iova, int128_get64(llsize), ret,
+                     strerror(-ret));
+    }
+
+    memory_region_unref(section->mr);
+}
+
+static const MemoryListener iommufd_s2domain_memory_listener = {
+    .name = "iommufd_s2domain",
+    .priority = 1000,
+    .region_add = iommufd_listener_region_add_s2domain,
+    .region_del = iommufd_listener_region_del_s2domain,
+};
+
+static void vtd_init_s1_hwpt_data(struct iommu_hwpt_vtd_s1 *vtd,
+                                  VTDPASIDEntry *pe)
+{
+    memset(vtd, 0, sizeof(*vtd));
+
+    vtd->flags =  (VTD_SM_PASID_ENTRY_SRE_BIT(pe->val[2]) ?
+                                        IOMMU_VTD_S1_SRE : 0) |
+                  (VTD_SM_PASID_ENTRY_WPE_BIT(pe->val[2]) ?
+                                        IOMMU_VTD_S1_WPE : 0) |
+                  (VTD_SM_PASID_ENTRY_EAFE_BIT(pe->val[2]) ?
+                                        IOMMU_VTD_S1_EAFE : 0);
+    vtd->addr_width = vtd_pe_get_fl_aw(pe);
+    vtd->pgtbl_addr = (uint64_t)vtd_pe_get_flpt_base(pe);
+}
+
+static int vtd_create_s1_hwpt(IOMMUFDDevice *idev,
+                              VTDS2Hwpt *s2_hwpt, VTDHwpt *hwpt,
+                              VTDPASIDEntry *pe, Error **errp)
+{
+    struct iommu_hwpt_vtd_s1 vtd;
+    uint32_t hwpt_id, s2_hwpt_id = s2_hwpt->hwpt_id;
+    int ret;
+
+    vtd_init_s1_hwpt_data(&vtd, pe);
+
+    ret = iommufd_backend_alloc_hwpt(idev->iommufd, idev->dev_id,
+                                     s2_hwpt_id, 0, IOMMU_HWPT_DATA_VTD_S1,
+                                     sizeof(vtd), &vtd, &hwpt_id);
+    if (ret) {
+        error_setg(errp, "Failed to allocate stage-1 page table, dev_id %d",
+                   idev->dev_id);
+        return ret;
+    }
+
+    hwpt->hwpt_id = hwpt_id;
+
+    return 0;
+}
+
+static void vtd_destroy_s1_hwpt(IOMMUFDDevice *idev, VTDHwpt *hwpt)
+{
+    iommufd_backend_free_id(idev->iommufd, hwpt->hwpt_id);
+}
+
+static VTDS2Hwpt *vtd_ioas_container_get_s2_hwpt(VTDIOASContainer *container,
+                                                 uint32_t hwpt_id)
+{
+    VTDS2Hwpt *s2_hwpt;
+
+    QLIST_FOREACH(s2_hwpt, &container->s2_hwpt_list, next) {
+        if (s2_hwpt->hwpt_id == hwpt_id) {
+            return s2_hwpt;
+        }
+    }
+
+    s2_hwpt = g_malloc0(sizeof(*s2_hwpt));
+
+    s2_hwpt->hwpt_id = hwpt_id;
+    s2_hwpt->container = container;
+    QLIST_INSERT_HEAD(&container->s2_hwpt_list, s2_hwpt, next);
+
+    return s2_hwpt;
+}
+
+static void vtd_ioas_container_put_s2_hwpt(VTDS2Hwpt *s2_hwpt)
+{
+    VTDIOASContainer *container = s2_hwpt->container;
+
+    if (s2_hwpt->users) {
+        return;
+    }
+
+    QLIST_REMOVE(s2_hwpt, next);
+    iommufd_backend_free_id(container->iommufd, s2_hwpt->hwpt_id);
+    g_free(s2_hwpt);
+}
+
+static void vtd_ioas_container_destroy(VTDIOASContainer *container)
+{
+    if (!QLIST_EMPTY(&container->s2_hwpt_list)) {
+        return;
+    }
+
+    QLIST_REMOVE(container, next);
+    memory_listener_unregister(&container->listener);
+    iommufd_backend_free_id(container->iommufd, container->ioas_id);
+    g_free(container);
+}
+
+static int vtd_device_attach_hwpt(VTDIOMMUFDDevice *vtd_idev,
+                                  uint32_t rid_pasid, VTDPASIDEntry *pe,
+                                  VTDS2Hwpt *s2_hwpt, VTDHwpt *hwpt,
+                                  Error **errp)
+{
+    IOMMUFDDevice *idev = vtd_idev->idev;
+    int ret;
+
+    if (vtd_pe_pgtt_is_flt(pe)) {
+        ret = vtd_create_s1_hwpt(vtd_idev->idev, s2_hwpt,
+                                 hwpt, pe, errp);
+        if (ret) {
+            return ret;
+        }
+    } else {
+        hwpt->hwpt_id = s2_hwpt->hwpt_id;
+    }
+
+    ret = iommufd_device_attach_hwpt(idev, hwpt->hwpt_id);
+    trace_vtd_device_attach_hwpt(idev->dev_id, rid_pasid, hwpt->hwpt_id, ret);
+    if (ret) {
+        if (vtd_pe_pgtt_is_flt(pe)) {
+            vtd_destroy_s1_hwpt(idev, hwpt);
+        }
+        hwpt->hwpt_id = 0;
+        error_setg(errp, "dev_id %d pasid %d failed to attach hwpt %d",
+                   idev->dev_id, rid_pasid, hwpt->hwpt_id);
+        return ret;
+    }
+
+    s2_hwpt->users++;
+    hwpt->s2_hwpt = s2_hwpt;
+
+    return 0;
+}
+
+static void vtd_device_detach_hwpt(VTDIOMMUFDDevice *vtd_idev,
+                                   uint32_t rid_pasid, VTDPASIDEntry *pe,
+                                   VTDHwpt *hwpt, Error **errp)
+{
+    IOMMUFDDevice *idev = vtd_idev->idev;
+    int ret;
+
+    if (vtd_idev->iommu_state->dmar_enabled) {
+        ret = iommufd_device_detach_hwpt(idev);
+        trace_vtd_device_detach_hwpt(idev->dev_id, rid_pasid, ret);
+    } else {
+        ret = iommufd_device_attach_hwpt(idev, idev->ioas_id);
+        trace_vtd_device_reattach_def_ioas(idev->dev_id, rid_pasid,
+                                           idev->ioas_id, ret);
+    }
+
+    if (ret) {
+        error_setg(errp, "dev_id %d pasid %d failed to attach hwpt %d",
+                   idev->dev_id, rid_pasid, hwpt->hwpt_id);
+    }
+
+    if (vtd_pe_pgtt_is_flt(pe)) {
+        vtd_destroy_s1_hwpt(idev, hwpt);
+    }
+
+    hwpt->s2_hwpt->users--;
+    hwpt->s2_hwpt = NULL;
+    hwpt->hwpt_id = 0;
+}
+
+static int vtd_device_attach_container(VTDIOMMUFDDevice *vtd_idev,
+                                       VTDIOASContainer *container,
+                                       uint32_t rid_pasid,
+                                       VTDPASIDEntry *pe,
+                                       VTDHwpt *hwpt,
+                                       Error **errp)
+{
+    IOMMUFDDevice *idev = vtd_idev->idev;
+    IOMMUFDBackend *iommufd = idev->iommufd;
+    VTDS2Hwpt *s2_hwpt;
+    uint32_t s2_hwpt_id;
+    Error *err = NULL;
+    int ret;
+
+    /* try to attach to an existing hwpt in this container */
+    QLIST_FOREACH(s2_hwpt, &container->s2_hwpt_list, next) {
+        ret = vtd_device_attach_hwpt(vtd_idev, rid_pasid, pe,
+                                     s2_hwpt, hwpt, &err);
+        if (ret) {
+            const char *msg = error_get_pretty(err);
+
+            trace_vtd_device_fail_attach_existing_hwpt(msg);
+            error_free(err);
+            err = NULL;
+        } else {
+            goto found_hwpt;
+        }
+    }
+
+    ret = iommufd_backend_alloc_hwpt(iommufd, idev->dev_id,
+                                     container->ioas_id,
+                                     IOMMU_HWPT_ALLOC_NEST_PARENT,
+                                     IOMMU_HWPT_DATA_NONE,
+                                     0, NULL, &s2_hwpt_id);
+    if (ret) {
+        error_setg_errno(errp, errno, "error alloc parent hwpt");
+        return ret;
+    }
+
+    s2_hwpt = vtd_ioas_container_get_s2_hwpt(container, s2_hwpt_id);
+
+    /* Attach vtd device to a new allocated hwpt within iommufd */
+    ret = vtd_device_attach_hwpt(vtd_idev, rid_pasid, pe, s2_hwpt, hwpt, &err);
+    if (ret) {
+        goto err_attach_hwpt;
+    }
+
+found_hwpt:
+    trace_vtd_device_attach_container(iommufd->fd, idev->dev_id, rid_pasid,
+                                      container->ioas_id, hwpt->hwpt_id);
+    return 0;
+
+err_attach_hwpt:
+    vtd_ioas_container_put_s2_hwpt(s2_hwpt);
+    return ret;
+}
+
+static void vtd_device_detach_container(VTDIOMMUFDDevice *vtd_idev,
+                                        uint32_t rid_pasid,
+                                        VTDPASIDEntry *pe,
+                                        VTDHwpt *hwpt,
+                                        Error **errp)
+{
+    IOMMUFDDevice *idev = vtd_idev->idev;
+    IOMMUFDBackend *iommufd = idev->iommufd;
+    VTDS2Hwpt *s2_hwpt = hwpt->s2_hwpt;
+
+    trace_vtd_device_detach_container(iommufd->fd, idev->dev_id, rid_pasid);
+    vtd_device_detach_hwpt(vtd_idev, rid_pasid, pe, hwpt, errp);
+    vtd_ioas_container_put_s2_hwpt(s2_hwpt);
+}
+
+static int vtd_device_attach_iommufd(VTDIOMMUFDDevice *vtd_idev,
+                                     uint32_t rid_pasid,
+                                     VTDPASIDEntry *pe,
+                                     VTDHwpt *hwpt,
+                                     Error **errp)
+{
+    IntelIOMMUState *s = vtd_idev->iommu_state;
+    VTDIOASContainer *container;
+    IOMMUFDBackend *iommufd = vtd_idev->idev->iommufd;
+    Error *err = NULL;
+    uint32_t ioas_id;
+    int ret;
+
+    /* try to attach to an existing container in this space */
+    QLIST_FOREACH(container, &s->containers, next) {
+        if (container->iommufd != iommufd) {
+            continue;
+        }
+
+        if (vtd_device_attach_container(vtd_idev, container,
+                                        rid_pasid, pe, hwpt, &err)) {
+            const char *msg = error_get_pretty(err);
+
+            trace_vtd_device_fail_attach_existing_container(msg);
+            error_free(err);
+            err = NULL;
+        } else {
+            return 0;
+        }
+    }
+
+    /* Need to allocate a new dedicated container */
+    ret = iommufd_backend_alloc_ioas(iommufd, &ioas_id, errp);
+    if (ret < 0) {
+        return ret;
+    }
+
+    trace_vtd_device_alloc_ioas(iommufd->fd, ioas_id);
+
+    container = g_malloc0(sizeof(*container));
+    container->iommufd = iommufd;
+    container->ioas_id = ioas_id;
+    QLIST_INIT(&container->s2_hwpt_list);
+
+    if (vtd_device_attach_container(vtd_idev, container,
+                                    rid_pasid, pe, hwpt, errp)) {
+        goto err_attach_container;
+    }
+
+    container->listener = iommufd_s2domain_memory_listener;
+    memory_listener_register(&container->listener, &address_space_memory);
+
+    if (container->error) {
+        ret = -1;
+        error_propagate_prepend(errp, container->error,
+                                "memory listener initialization failed: ");
+        goto err_listener_register;
+    }
+
+    QLIST_INSERT_HEAD(&s->containers, container, next);
+
+    return 0;
+
+err_listener_register:
+    vtd_device_detach_container(vtd_idev, rid_pasid, pe, hwpt, errp);
+err_attach_container:
+    iommufd_backend_free_id(iommufd, container->ioas_id);
+    g_free(container);
+    return ret;
+}
+
+static void vtd_device_detach_iommufd(VTDIOMMUFDDevice *vtd_idev,
+                                      uint32_t rid_pasid,
+                                      VTDPASIDEntry *pe,
+                                      VTDHwpt *hwpt,
+                                      Error **errp)
+{
+    VTDIOASContainer *container = hwpt->s2_hwpt->container;
+
+    vtd_device_detach_container(vtd_idev, rid_pasid, pe, hwpt, errp);
+    vtd_ioas_container_destroy(container);
+}
+
+static int vtd_device_attach_pgtbl(VTDIOMMUFDDevice *vtd_idev,
+                                   VTDPASIDEntry *pe,
+                                   VTDPASIDAddressSpace *vtd_pasid_as,
+                                   uint32_t rid_pasid)
+{
+    /*
+     * If pe->gptt != FLT, should be go ahead to do bind as host only
+     * accepts guest FLT under nesting. If pe->pgtt==PT, should setup
+     * the pasid with GPA page table. Otherwise should return failure.
+     */
+    if (!vtd_pe_pgtt_is_flt(pe) && !vtd_pe_pgtt_is_pt(pe)) {
+        return -EINVAL;
+    }
+
+    /* Should fail if the FLPT base is 0 */
+    if (vtd_pe_pgtt_is_flt(pe) && !vtd_pe_get_flpt_base(pe)) {
+        return -EINVAL;
+    }
+
+    return vtd_device_attach_iommufd(vtd_idev, rid_pasid, pe,
+                                     &vtd_pasid_as->hwpt, &error_abort);
+}
+
+static int vtd_device_detach_pgtbl(VTDIOMMUFDDevice *vtd_idev,
+                                  VTDPASIDAddressSpace *vtd_pasid_as,
+                                  uint32_t rid_pasid)
+{
+    VTDPASIDEntry *cached_pe = vtd_pasid_as->pasid_cache_entry.cache_filled ?
+                       &vtd_pasid_as->pasid_cache_entry.pasid_entry : NULL;
+
+    if (!cached_pe ||
+        (!vtd_pe_pgtt_is_flt(cached_pe) && !vtd_pe_pgtt_is_pt(cached_pe))) {
+        return 0;
+    }
+
+    vtd_device_detach_iommufd(vtd_idev, rid_pasid, cached_pe,
+                              &vtd_pasid_as->hwpt, &error_abort);
+
+    return 0;
+}
+
+static int vtd_dev_get_rid2pasid(IntelIOMMUState *s, uint8_t bus_num,
+                                 uint8_t devfn, uint32_t *rid_pasid)
+{
+    VTDContextEntry ce;
+    int ret;
+
+    /*
+     * Currently, ECAP.RPS bit is likely to be reported as "Clear".
+     * And per VT-d 3.1 spec, it will use PASID #0 as RID2PASID when
+     * RPS bit is reported as "Clear".
+     */
+    if (likely(!(s->ecap & VTD_ECAP_RPS))) {
+        *rid_pasid = 0;
+        return 0;
+    }
+
+    /*
+     * In future, to improve performance, could try to fetch context
+     * entry from cache firstly.
+     */
+    ret = vtd_dev_to_context_entry(s, bus_num, devfn, &ce);
+    if (!ret) {
+        *rid_pasid = VTD_CE_GET_RID2PASID(&ce);
+    }
+
+    return ret;
+}
+
+/**
+ * Caller should hold iommu_lock.
+ */
+static int vtd_bind_guest_pasid(VTDPASIDAddressSpace *vtd_pasid_as,
+                                VTDPASIDEntry *pe, VTDPASIDOp op)
+{
+    IntelIOMMUState *s = vtd_pasid_as->iommu_state;
+    VTDIOMMUFDDevice *vtd_idev;
+    uint32_t rid_pasid;
+    int devfn = vtd_pasid_as->devfn;
+    int ret = -EINVAL;
+    struct vtd_as_key key = {
+        .bus = vtd_pasid_as->bus,
+        .devfn = devfn,
+    };
+
+    vtd_idev = g_hash_table_lookup(s->vtd_iommufd_dev, &key);
+    if (!vtd_idev || !vtd_idev->idev) {
+        /* means no need to go further, e.g. for emulated devices */
+        return 0;
+    }
+
+    if (vtd_dev_get_rid2pasid(s, pci_bus_num(vtd_pasid_as->bus),
+                              devfn, &rid_pasid)) {
+        error_report("Unable to get rid_pasid for devfn: %d!", devfn);
+        return ret;
+    }
+
+    if (vtd_pasid_as->pasid != rid_pasid) {
+        error_report("Non-rid_pasid %d not supported yet", vtd_pasid_as->pasid);
+        return ret;
+    }
+
+    switch (op) {
+    case VTD_PASID_UPDATE:
+    case VTD_PASID_BIND:
+    {
+        ret = vtd_device_attach_pgtbl(vtd_idev, pe, vtd_pasid_as, rid_pasid);
+        break;
+    }
+    case VTD_PASID_UNBIND:
+    {
+        ret = vtd_device_detach_pgtbl(vtd_idev, vtd_pasid_as, rid_pasid);
+        break;
+    }
+    default:
+        error_report_once("Unknown VTDPASIDOp!!!\n");
+        break;
+    }
+
+    return ret;
+}
+
 /* Do a context-cache device-selective invalidation.
  * @func_mask: FM field after shifting
  */
@@ -2717,22 +3284,30 @@ static bool vtd_pasid_entry_compare(VTDPASIDEntry *p1, VTDPASIDEntry *p2)
  * This function fills in the pasid entry in &vtd_pasid_as. Caller
  * of this function should hold iommu_lock.
  */
-static void vtd_fill_pe_in_cache(IntelIOMMUState *s,
-                                 VTDPASIDAddressSpace *vtd_pasid_as,
-                                 VTDPASIDEntry *pe)
+static int vtd_fill_pe_in_cache(IntelIOMMUState *s,
+                                VTDPASIDAddressSpace *vtd_pasid_as,
+                                VTDPASIDEntry *pe)
 {
     VTDPASIDCacheEntry *pc_entry = &vtd_pasid_as->pasid_cache_entry;
+    int ret;
 
-    if (vtd_pasid_entry_compare(pe, &pc_entry->pasid_entry)) {
-        /* No need to go further as cached pasid entry is latest */
-        return;
+    if (pc_entry->cache_filled) {
+        if (vtd_pasid_entry_compare(pe, &pc_entry->pasid_entry)) {
+            /* No need to go further as cached pasid entry is latest */
+            return 0;
+        }
+        ret = vtd_bind_guest_pasid(vtd_pasid_as,
+                                   pe, VTD_PASID_UPDATE);
+    } else {
+        ret = vtd_bind_guest_pasid(vtd_pasid_as,
+                                   pe, VTD_PASID_BIND);
     }
 
-    pc_entry->pasid_entry = *pe;
-    /*
-     * TODO:
-     * - send pasid bind to host for passthru devices
-     */
+    if (!ret) {
+        pc_entry->pasid_entry = *pe;
+        pc_entry->cache_filled = true;
+    }
+    return ret;
 }
 
 /*
@@ -2795,7 +3370,11 @@ static gboolean vtd_flush_pasid(gpointer key, gpointer value,
         goto remove;
     }
 
-    vtd_fill_pe_in_cache(s, vtd_pasid_as, &pe);
+    if (vtd_fill_pe_in_cache(s, vtd_pasid_as, &pe)) {
+        pasid_cache_info_set_error(pc_info);
+        return true;
+    }
+
     /*
      * TODO:
      * - when pasid-base-iotlb(piotlb) infrastructure is ready,
@@ -2805,10 +3384,14 @@ static gboolean vtd_flush_pasid(gpointer key, gpointer value,
 remove:
     /*
      * TODO:
-     * - send pasid bind to host for passthru devices
      * - when pasid-base-iotlb(piotlb) infrastructure is ready,
      *   should invalidate QEMU piotlb togehter with this change.
      */
+    if (vtd_bind_guest_pasid(vtd_pasid_as,
+                             NULL, VTD_PASID_UNBIND)) {
+        pasid_cache_info_set_error(pc_info);
+    }
+
     return true;
 }
 
@@ -2854,6 +3437,22 @@ static VTDPASIDAddressSpace *vtd_add_find_pasid_as(IntelIOMMUState *s,
     return vtd_pasid_as;
 }
 
+/* Caller of this function should hold iommu_lock. */
+static void vtd_remove_pasid_as(VTDPASIDAddressSpace *vtd_pasid_as)
+{
+    IntelIOMMUState *s = vtd_pasid_as->iommu_state;
+    PCIBus *bus = vtd_pasid_as->bus;
+    struct pasid_key key;
+    int devfn = vtd_pasid_as->devfn;
+    uint32_t pasid = vtd_pasid_as->pasid;
+    uint16_t sid;
+
+    sid = PCI_BUILD_BDF(pci_bus_num(bus), devfn);
+    vtd_init_pasid_key(pasid, sid, &key);
+
+    g_hash_table_remove(s->vtd_pasid_as, &key);
+}
+
 /* Caller of this function should hold iommu_lock. */
 static void vtd_sm_pasid_table_walk_one(IntelIOMMUState *s,
                                         dma_addr_t pt_base,
@@ -2884,7 +3483,10 @@ static void vtd_sm_pasid_table_walk_one(IntelIOMMUState *s,
                 pasid = pasid_next;
                 continue;
             }
-            vtd_fill_pe_in_cache(s, vtd_pasid_as, &pe);
+            if (vtd_fill_pe_in_cache(s, vtd_pasid_as, &pe)) {
+                vtd_remove_pasid_as(vtd_pasid_as);
+                pasid_cache_info_set_error(info);
+            }
         }
         pasid = pasid_next;
     }
@@ -2991,6 +3593,9 @@ static void vtd_replay_guest_pasid_bindings(IntelIOMMUState *s,
         walk_info.devfn = vtd_idev->devfn;
         vtd_replay_pasid_bind_for_dev(s, start, end, &walk_info);
     }
+    if (walk_info.error_happened) {
+        pasid_cache_info_set_error(pc_info);
+    }
 }
 
 /*
@@ -3060,7 +3665,7 @@ static void vtd_pasid_cache_sync(IntelIOMMUState *s,
 /* Caller of this function should hold iommu_lock */
 static void vtd_pasid_cache_reset(IntelIOMMUState *s)
 {
-    VTDPASIDCacheInfo pc_info;
+    VTDPASIDCacheInfo pc_info = { .error_happened = false, };
 
     trace_vtd_pasid_cache_reset();
 
@@ -3082,9 +3687,9 @@ static void vtd_pasid_cache_reset(IntelIOMMUState *s)
 static bool vtd_process_pasid_desc(IntelIOMMUState *s,
                                    VTDInvDesc *inv_desc)
 {
+    VTDPASIDCacheInfo pc_info = { .error_happened = false, };
     uint16_t domain_id;
     uint32_t pasid;
-    VTDPASIDCacheInfo pc_info;
 
     if ((inv_desc->val[0] & VTD_INV_DESC_PASIDC_RSVD_VAL0) ||
         (inv_desc->val[1] & VTD_INV_DESC_PASIDC_RSVD_VAL1) ||
@@ -3125,7 +3730,7 @@ static bool vtd_process_pasid_desc(IntelIOMMUState *s,
     }
 
     vtd_pasid_cache_sync(s, &pc_info);
-    return true;
+    return !pc_info.error_happened ? true : false;
 }
 
 static bool vtd_process_inv_iec_desc(IntelIOMMUState *s,
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index 91d6c400b4..17e7191696 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -72,6 +72,14 @@ vtd_frr_new(int index, uint64_t hi, uint64_t lo) "index %d high 0x%"PRIx64" low
 vtd_warn_invalid_qi_tail(uint16_t tail) "tail 0x%"PRIx16
 vtd_warn_ir_vector(uint16_t sid, int index, int vec, int target) "sid 0x%"PRIx16" index %d vec %d (should be: %d)"
 vtd_warn_ir_trigger(uint16_t sid, int index, int trig, int target) "sid 0x%"PRIx16" index %d trigger %d (should be: %d)"
+vtd_device_attach_hwpt(uint32_t dev_id, uint32_t pasid, uint32_t hwpt_id, int ret) "dev_id %d pasid %d hwpt_id %d, ret: %d"
+vtd_device_detach_hwpt(uint32_t dev_id, uint32_t pasid, int ret) "dev_id %d pasid %d ret: %d"
+vtd_device_reattach_def_ioas(uint32_t dev_id, uint32_t pasid, uint32_t ioas_id, int ret) "dev_id %d pasid %d ioas_id %d, ret: %d"
+vtd_device_fail_attach_existing_hwpt(const char *msg) " %s"
+vtd_device_attach_container(int fd, uint32_t dev_id, uint32_t pasid, uint32_t ioas_id, uint32_t hwpt_id) "iommufd %d dev_id %d pasid %d ioas_id %d hwpt_id %d"
+vtd_device_detach_container(int fd, uint32_t dev_id, uint32_t pasid) "iommufd %d dev_id %d pasid %d"
+vtd_device_fail_attach_existing_container(const char *msg) " %s"
+vtd_device_alloc_ioas(int fd, uint32_t ioas_id) "iommufd %d ioas_id %d"
 
 # amd_iommu.c
 amdvi_evntlog_fail(uint64_t addr, uint32_t head) "error: fail to write at addr 0x%"PRIx64" +  offset 0x%"PRIx32
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH rfcv1 11/23] intel_iommu: ERRATA_772415 workaround
  2024-01-15 10:37 [PATCH rfcv1 00/23] intel_iommu: Enable stage-1 translation Zhenzhong Duan
                   ` (9 preceding siblings ...)
  2024-01-15 10:37 ` [PATCH rfcv1 10/23] intel_iommu: bind/unbind guest page table to host Zhenzhong Duan
@ 2024-01-15 10:37 ` Zhenzhong Duan
  2024-01-15 10:37 ` [PATCH rfcv1 12/23] intel_iommu: replay pasid binds after context cache invalidation Zhenzhong Duan
                   ` (12 subsequent siblings)
  23 siblings, 0 replies; 29+ messages in thread
From: Zhenzhong Duan @ 2024-01-15 10:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, peterx, jasowang, mst, jgg,
	nicolinc, joao.m.martins, kevin.tian, yi.l.liu, yi.y.sun,
	chao.p.peng, Zhenzhong Duan, Marcel Apfelbaum, Paolo Bonzini,
	Richard Henderson, Eduardo Habkost

On a system influenced by ERRATA_772415, IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17
is repored by IOMMU_DEVICE_GET_HW_INFO. Due to this errata, even the readonly
range mapped on stage-2 page table could still be written.

Reference from 4th Gen Intel Xeon Processor Scalable Family Specification
Update, Errata Details, SPR17.

[0] https://edc.intel.com/content/www/us/en/design/products-and-solutions/processors-and-chipsets/eagle-stream/sapphire-rapids-specification-update

We utilize the new added IOMMUFD container/ioas/hwpt management framework in
VTD. Add a check to create new VTDIOASContainer to hold RW-only mappings,
then this VTDIOASContainer can be used as backend for device with
ERRATA_772415. See below diagram for details:

      IntelIOMMUState
             |
             V
    .------------------.    .------------------.    .-------------------.
    | VTDIOASContainer |--->| VTDIOASContainer |--->| VTDIOASContainer  |-->...
    | (iommufd0,RW&RO) |    | (iommufd1,RW&RO) |    | (iommufd0,RW only)|
    .------------------.    .------------------.    .-------------------.
             |                       |                              |
             |                       .-->...                        |
             V                                                      V
      .-------------------.    .-------------------.          .---------------.
      |   VTDS2Hwpt(CC)   |--->| VTDS2Hwpt(non-CC) |-->...    | VTDS2Hwpt(CC) |-->...
      .-------------------.    .-------------------.          .---------------.
          |            |               |                            |
          |            |               |                            |
    .-----------.  .-----------.  .------------.              .------------.
    | IOMMUFD   |  | IOMMUFD   |  | IOMMUFD    |              | IOMMUFD    |
    | Device(CC)|  | Device(CC)|  | Device     |              | Device(CC) |
    | (iommufd0)|  | (iommufd0)|  | (non-CC)   |              | (errata)   |
    |           |  |           |  | (iommufd0) |              | (iommufd0) |
    .-----------.  .-----------.  .------------.              .------------.

Suggested-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 include/hw/i386/intel_iommu.h |  2 ++
 hw/i386/intel_iommu.c         | 27 +++++++++++++++++++--------
 2 files changed, 21 insertions(+), 8 deletions(-)

diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index d3122cf699..72702e10a2 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -108,6 +108,7 @@ struct pasid_key {
 struct VTDIOASContainer {
     IOMMUFDBackend *iommufd;
     uint32_t ioas_id;
+    uint32_t errata;
     MemoryListener listener;
     QLIST_HEAD(, VTDS2Hwpt) s2_hwpt_list;
     QLIST_ENTRY(VTDIOASContainer) next;
@@ -200,6 +201,7 @@ struct VTDIOMMUFDDevice {
     PCIBus *bus;
     uint8_t devfn;
     IOMMUFDDevice *idev;
+    uint32_t errata;
     IntelIOMMUState *iommu_state;
     QLIST_ENTRY(VTDIOMMUFDDevice) next;
 };
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index df93fcacd8..8f9a59ae6f 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -2121,7 +2121,8 @@ static void vtd_context_global_invalidate(IntelIOMMUState *s)
     vtd_iommu_replay_all(s);
 }
 
-static bool iommufd_listener_skipped_section(MemoryRegionSection *section)
+static bool iommufd_listener_skipped_section(VTDIOASContainer *container,
+                                             MemoryRegionSection *section)
 {
     return !memory_region_is_ram(section->mr) ||
            memory_region_is_protected(section->mr) ||
@@ -2131,7 +2132,8 @@ static bool iommufd_listener_skipped_section(MemoryRegionSection *section)
             * are never accessed by the CPU and beyond the address width of
             * some IOMMU hardware.  TODO: VFIO should tell us the IOMMU width.
             */
-           section->offset_within_address_space & (1ULL << 63);
+           section->offset_within_address_space & (1ULL << 63) ||
+           (container->errata && section->readonly);
 }
 
 static void iommufd_listener_region_add_s2domain(MemoryListener *listener,
@@ -2147,7 +2149,7 @@ static void iommufd_listener_region_add_s2domain(MemoryListener *listener,
     Error *err = NULL;
     int ret;
 
-    if (iommufd_listener_skipped_section(section)) {
+    if (iommufd_listener_skipped_section(container, section)) {
         return;
     }
     iova = REAL_HOST_PAGE_ALIGN(section->offset_within_address_space);
@@ -2198,7 +2200,7 @@ static void iommufd_listener_region_del_s2domain(MemoryListener *listener,
     Int128 llend, llsize;
     int ret;
 
-    if (iommufd_listener_skipped_section(section)) {
+    if (iommufd_listener_skipped_section(container, section)) {
         return;
     }
     iova = REAL_HOST_PAGE_ALIGN(section->offset_within_address_space);
@@ -2468,7 +2470,8 @@ static int vtd_device_attach_iommufd(VTDIOMMUFDDevice *vtd_idev,
 
     /* try to attach to an existing container in this space */
     QLIST_FOREACH(container, &s->containers, next) {
-        if (container->iommufd != iommufd) {
+        if (container->iommufd != iommufd ||
+            container->errata != vtd_idev->errata) {
             continue;
         }
 
@@ -2495,6 +2498,7 @@ static int vtd_device_attach_iommufd(VTDIOMMUFDDevice *vtd_idev,
     container = g_malloc0(sizeof(*container));
     container->iommufd = iommufd;
     container->ioas_id = ioas_id;
+    container->errata = vtd_idev->errata;
     QLIST_INIT(&container->s2_hwpt_list);
 
     if (vtd_device_attach_container(vtd_idev, container,
@@ -5002,10 +5006,11 @@ static bool vtd_sync_hw_info(IntelIOMMUState *s, struct iommu_hw_info_vtd *vtd,
  * could bind guest page table to host.
  */
 static bool vtd_check_idev(IntelIOMMUState *s, IOMMUFDDevice *idev,
-                           Error **errp)
+                           uint32_t *flags, Error **errp)
 {
     struct iommu_hw_info_vtd vtd;
     enum iommu_hw_info_type type = IOMMU_HW_INFO_TYPE_INTEL_VTD;
+    bool passed;
 
     if (iommufd_device_get_info(idev, &type, sizeof(vtd), &vtd)) {
         error_setg(errp, "Failed to get IOMMU capability!!!");
@@ -5017,7 +5022,11 @@ static bool vtd_check_idev(IntelIOMMUState *s, IOMMUFDDevice *idev,
         return false;
     }
 
-    return vtd_sync_hw_info(s, &vtd, errp);
+    passed = vtd_sync_hw_info(s, &vtd, errp);
+    if (passed) {
+        *flags = vtd.flags;
+    }
+    return passed;
 }
 
 static int vtd_dev_set_iommu_device(PCIBus *bus, void *opaque, int32_t devfn,
@@ -5030,6 +5039,7 @@ static int vtd_dev_set_iommu_device(PCIBus *bus, void *opaque, int32_t devfn,
         .devfn = devfn,
     };
     struct vtd_as_key *new_key;
+    uint32_t flags;
 
     assert(0 <= devfn && devfn < PCI_DEVFN_MAX);
 
@@ -5042,7 +5052,7 @@ static int vtd_dev_set_iommu_device(PCIBus *bus, void *opaque, int32_t devfn,
         return -1;
     }
 
-    if (!vtd_check_idev(s, idev, errp)) {
+    if (!vtd_check_idev(s, idev, &flags, errp)) {
         return -1;
     }
 
@@ -5064,6 +5074,7 @@ static int vtd_dev_set_iommu_device(PCIBus *bus, void *opaque, int32_t devfn,
     vtd_idev->devfn = (uint8_t)devfn;
     vtd_idev->iommu_state = s;
     vtd_idev->idev = idev;
+    vtd_idev->errata = flags & IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17;
     QLIST_INSERT_HEAD(&s->vtd_idev_list, vtd_idev, next);
 
     g_hash_table_insert(s->vtd_iommufd_dev, new_key, vtd_idev);
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH rfcv1 12/23] intel_iommu: replay pasid binds after context cache invalidation
  2024-01-15 10:37 [PATCH rfcv1 00/23] intel_iommu: Enable stage-1 translation Zhenzhong Duan
                   ` (10 preceding siblings ...)
  2024-01-15 10:37 ` [PATCH rfcv1 11/23] intel_iommu: ERRATA_772415 workaround Zhenzhong Duan
@ 2024-01-15 10:37 ` Zhenzhong Duan
  2024-01-15 10:37 ` [PATCH rfcv1 13/23] intel_iommu: process PASID-based iotlb invalidation Zhenzhong Duan
                   ` (11 subsequent siblings)
  23 siblings, 0 replies; 29+ messages in thread
From: Zhenzhong Duan @ 2024-01-15 10:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, peterx, jasowang, mst, jgg,
	nicolinc, joao.m.martins, kevin.tian, yi.l.liu, yi.y.sun,
	chao.p.peng, Yi Sun, Zhenzhong Duan, Marcel Apfelbaum,
	Paolo Bonzini, Richard Henderson, Eduardo Habkost

From: Yi Liu <yi.l.liu@intel.com>

This replays guest pasid attachments after context cache
invalidation. This is a behavior to ensure safety. Actually,
programmer should issue pasid cache invalidation with proper
granularity after issuing a context cache invalidation.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/i386/intel_iommu_internal.h |  1 +
 hw/i386/intel_iommu.c          | 49 ++++++++++++++++++++++++++++++++++
 hw/i386/trace-events           |  1 +
 3 files changed, 51 insertions(+)

diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index e33c9f54b5..65fe07c13b 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -532,6 +532,7 @@ typedef enum VTDPCInvType {
     VTD_PASID_CACHE_FORCE_RESET = 0,
     /* pasid cache invalidation rely on guest PASID entry */
     VTD_PASID_CACHE_GLOBAL_INV,
+    VTD_PASID_CACHE_DEVSI,
     VTD_PASID_CACHE_DOMSI,
     VTD_PASID_CACHE_PASIDSI,
 } VTDPCInvType;
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 8f9a59ae6f..9058be9efd 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -74,6 +74,10 @@ static void vtd_address_space_refresh_all(IntelIOMMUState *s);
 static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n);
 
 static void vtd_pasid_cache_reset(IntelIOMMUState *s);
+static void vtd_pasid_cache_sync(IntelIOMMUState *s,
+                                 VTDPASIDCacheInfo *pc_info);
+static void vtd_pasid_cache_devsi(IntelIOMMUState *s,
+                                  PCIBus *bus, uint16_t devfn);
 
 static void vtd_panic_require_caching_mode(void)
 {
@@ -2102,6 +2106,8 @@ static void vtd_iommu_replay_all(IntelIOMMUState *s)
 
 static void vtd_context_global_invalidate(IntelIOMMUState *s)
 {
+    VTDPASIDCacheInfo pc_info = { .error_happened = false, };
+
     trace_vtd_inv_desc_cc_global();
     /* Protects context cache */
     vtd_iommu_lock(s);
@@ -2119,6 +2125,9 @@ static void vtd_context_global_invalidate(IntelIOMMUState *s)
      * VT-d emulation codes.
      */
     vtd_iommu_replay_all(s);
+
+    pc_info.type = VTD_PASID_CACHE_GLOBAL_INV;
+    vtd_pasid_cache_sync(s, &pc_info);
 }
 
 static bool iommufd_listener_skipped_section(VTDIOASContainer *container,
@@ -2720,6 +2729,21 @@ static void vtd_context_device_invalidate(IntelIOMMUState *s,
              * happened.
              */
             vtd_address_space_sync(vtd_as);
+            /*
+             * Per spec, context flush should also followed with PASID
+             * cache and iotlb flush. Regards to a device selective
+             * context cache invalidation:
+             * if (emaulted_device)
+             *    invalidate pasid cahce and pasid-based iotlb
+             * else if (assigned_device)
+             *    check if the device has been bound to any pasid
+             *    invoke pasid_unbind regards to each bound pasid
+             * Here, we have vtd_pasid_cache_devsi() to invalidate pasid
+             * caches, while for piotlb in QEMU, we don't have it yet, so
+             * no handling. For assigned device, host iommu driver would
+             * flush piotlb when a pasid unbind is pass down to it.
+             */
+             vtd_pasid_cache_devsi(s, vtd_as->bus, devfn);
         }
     }
 }
@@ -3351,6 +3375,12 @@ static gboolean vtd_flush_pasid(gpointer key, gpointer value,
         /* Fall through */
     case VTD_PASID_CACHE_GLOBAL_INV:
         break;
+    case VTD_PASID_CACHE_DEVSI:
+        if (pc_info->bus != bus ||
+            pc_info->devfn != devfn) {
+            return false;
+        }
+        break;
     default:
         error_report("invalid pc_info->type");
         abort();
@@ -3574,6 +3604,11 @@ static void vtd_replay_guest_pasid_bindings(IntelIOMMUState *s,
     case VTD_PASID_CACHE_GLOBAL_INV:
         /* loop all assigned devices */
         break;
+    case VTD_PASID_CACHE_DEVSI:
+        walk_info.bus = pc_info->bus;
+        walk_info.devfn = pc_info->devfn;
+        vtd_replay_pasid_bind_for_dev(s, start, end, &walk_info);
+        return;
     case VTD_PASID_CACHE_FORCE_RESET:
         /* For force reset, no need to go further replay */
         return;
@@ -3666,6 +3701,20 @@ static void vtd_pasid_cache_sync(IntelIOMMUState *s,
     vtd_iommu_unlock(s);
 }
 
+static void vtd_pasid_cache_devsi(IntelIOMMUState *s,
+                                  PCIBus *bus, uint16_t devfn)
+{
+    VTDPASIDCacheInfo pc_info = { .error_happened = false, };
+
+    trace_vtd_pasid_cache_devsi(devfn);
+
+    pc_info.type = VTD_PASID_CACHE_DEVSI;
+    pc_info.bus = bus;
+    pc_info.devfn = devfn;
+
+    vtd_pasid_cache_sync(s, &pc_info);
+}
+
 /* Caller of this function should hold iommu_lock */
 static void vtd_pasid_cache_reset(IntelIOMMUState *s)
 {
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index 17e7191696..66f7c1ba59 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -28,6 +28,7 @@ vtd_pasid_cache_gsi(void) ""
 vtd_pasid_cache_reset(void) ""
 vtd_pasid_cache_dsi(uint16_t domain) "Domian slective PC invalidation domain 0x%"PRIx16
 vtd_pasid_cache_psi(uint16_t domain, uint32_t pasid) "PASID slective PC invalidation domain 0x%"PRIx16" pasid 0x%"PRIx32
+vtd_pasid_cache_devsi(uint16_t devfn) "Dev selective PC invalidation dev: 0x%"PRIx16
 vtd_re_not_present(uint8_t bus) "Root entry bus %"PRIu8" not present"
 vtd_ce_not_present(uint8_t bus, uint8_t devfn) "Context entry bus %"PRIu8" devfn %"PRIu8" not present"
 vtd_iotlb_page_hit(uint16_t sid, uint64_t addr, uint64_t slpte, uint16_t domain) "IOTLB page hit sid 0x%"PRIx16" iova 0x%"PRIx64" slpte 0x%"PRIx64" domain 0x%"PRIx16
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH rfcv1 13/23] intel_iommu: process PASID-based iotlb invalidation
  2024-01-15 10:37 [PATCH rfcv1 00/23] intel_iommu: Enable stage-1 translation Zhenzhong Duan
                   ` (11 preceding siblings ...)
  2024-01-15 10:37 ` [PATCH rfcv1 12/23] intel_iommu: replay pasid binds after context cache invalidation Zhenzhong Duan
@ 2024-01-15 10:37 ` Zhenzhong Duan
  2024-01-15 10:37 ` [PATCH rfcv1 14/23] intel_iommu: propagate PASID-based iotlb invalidation to host Zhenzhong Duan
                   ` (10 subsequent siblings)
  23 siblings, 0 replies; 29+ messages in thread
From: Zhenzhong Duan @ 2024-01-15 10:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, peterx, jasowang, mst, jgg,
	nicolinc, joao.m.martins, kevin.tian, yi.l.liu, yi.y.sun,
	chao.p.peng, Zhenzhong Duan, Marcel Apfelbaum, Paolo Bonzini,
	Richard Henderson, Eduardo Habkost

From: Yi Liu <yi.l.liu@intel.com>

PASID-based iotlb (piotlb) is used during walking Intel
VT-d stage-1 page table.

This adds the basic framework for piotlb invalidation,
detailed handling will be added in next patch.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/i386/intel_iommu_internal.h | 13 +++++++++
 hw/i386/intel_iommu.c          | 52 ++++++++++++++++++++++++++++++++++
 2 files changed, 65 insertions(+)

diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 65fe07c13b..40361de207 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -458,6 +458,19 @@ typedef union VTDInvDesc VTDInvDesc;
 #define VTD_INV_DESC_PASIDC_PASID_SI   (1ULL << 4)
 #define VTD_INV_DESC_PASIDC_GLOBAL     (3ULL << 4)
 
+#define VTD_INV_DESC_PIOTLB_ALL_IN_PASID  (2ULL << 4)
+#define VTD_INV_DESC_PIOTLB_PSI_IN_PASID  (3ULL << 4)
+
+#define VTD_INV_DESC_PIOTLB_RSVD_VAL0     0xfff000000000ffc0ULL
+#define VTD_INV_DESC_PIOTLB_RSVD_VAL1     0xf80ULL
+
+#define VTD_INV_DESC_PIOTLB_PASID(val)    (((val) >> 32) & 0xfffffULL)
+#define VTD_INV_DESC_PIOTLB_DID(val)      (((val) >> 16) & \
+                                             VTD_DOMAIN_ID_MASK)
+#define VTD_INV_DESC_PIOTLB_ADDR(val)     ((val) & ~0xfffULL)
+#define VTD_INV_DESC_PIOTLB_AM(val)       ((val) & 0x3fULL)
+#define VTD_INV_DESC_PIOTLB_IH(val)       (((val) >> 6) & 0x1)
+
 /* Information about page-selective IOTLB invalidate */
 struct VTDIOTLBPageInvInfo {
     uint16_t domain_id;
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 9058be9efd..6aa44b80d6 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -3786,6 +3786,54 @@ static bool vtd_process_pasid_desc(IntelIOMMUState *s,
     return !pc_info.error_happened ? true : false;
 }
 
+static void vtd_piotlb_pasid_invalidate(IntelIOMMUState *s,
+                                        uint16_t domain_id, uint32_t pasid)
+{
+}
+
+static void vtd_piotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
+                                       uint32_t pasid, hwaddr addr, uint8_t am,
+                                       bool ih)
+{
+}
+
+static bool vtd_process_piotlb_desc(IntelIOMMUState *s,
+                                    VTDInvDesc *inv_desc)
+{
+    uint16_t domain_id;
+    uint32_t pasid;
+    uint8_t am;
+    hwaddr addr;
+
+    if ((inv_desc->val[0] & VTD_INV_DESC_PIOTLB_RSVD_VAL0) ||
+        (inv_desc->val[1] & VTD_INV_DESC_PIOTLB_RSVD_VAL1)) {
+        error_report_once("non-zero-field-in-piotlb_inv_desc hi: 0x%" PRIx64
+                  " lo: 0x%" PRIx64, inv_desc->val[1], inv_desc->val[0]);
+        return false;
+    }
+
+    domain_id = VTD_INV_DESC_PIOTLB_DID(inv_desc->val[0]);
+    pasid = VTD_INV_DESC_PIOTLB_PASID(inv_desc->val[0]);
+    switch (inv_desc->val[0] & VTD_INV_DESC_IOTLB_G) {
+    case VTD_INV_DESC_PIOTLB_ALL_IN_PASID:
+        vtd_piotlb_pasid_invalidate(s, domain_id, pasid);
+        break;
+
+    case VTD_INV_DESC_PIOTLB_PSI_IN_PASID:
+        am = VTD_INV_DESC_PIOTLB_AM(inv_desc->val[1]);
+        addr = (hwaddr) VTD_INV_DESC_PIOTLB_ADDR(inv_desc->val[1]);
+        vtd_piotlb_page_invalidate(s, domain_id, pasid, addr, am,
+                                   VTD_INV_DESC_PIOTLB_IH(inv_desc->val[1]));
+        break;
+
+    default:
+        error_report_once("Invalid granularity in P-IOTLB desc hi: 0x%" PRIx64
+                  " lo: 0x%" PRIx64, inv_desc->val[1], inv_desc->val[0]);
+        return false;
+    }
+    return true;
+}
+
 static bool vtd_process_inv_iec_desc(IntelIOMMUState *s,
                                      VTDInvDesc *inv_desc)
 {
@@ -3895,6 +3943,10 @@ static bool vtd_process_inv_desc(IntelIOMMUState *s)
         break;
 
     case VTD_INV_DESC_PIOTLB:
+        trace_vtd_inv_desc("p-iotlb", inv_desc.val[1], inv_desc.val[0]);
+        if (!vtd_process_piotlb_desc(s, &inv_desc)) {
+            return false;
+        }
         break;
 
     case VTD_INV_DESC_WAIT:
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH rfcv1 14/23] intel_iommu: propagate PASID-based iotlb invalidation to host
  2024-01-15 10:37 [PATCH rfcv1 00/23] intel_iommu: Enable stage-1 translation Zhenzhong Duan
                   ` (12 preceding siblings ...)
  2024-01-15 10:37 ` [PATCH rfcv1 13/23] intel_iommu: process PASID-based iotlb invalidation Zhenzhong Duan
@ 2024-01-15 10:37 ` Zhenzhong Duan
  2024-01-15 10:37 ` [PATCH rfcv1 15/23] intel_iommu: process PASID-based Device-TLB invalidation Zhenzhong Duan
                   ` (9 subsequent siblings)
  23 siblings, 0 replies; 29+ messages in thread
From: Zhenzhong Duan @ 2024-01-15 10:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, peterx, jasowang, mst, jgg,
	nicolinc, joao.m.martins, kevin.tian, yi.l.liu, yi.y.sun,
	chao.p.peng, Yi Sun, Zhenzhong Duan, Marcel Apfelbaum,
	Paolo Bonzini, Richard Henderson, Eduardo Habkost

From: Yi Liu <yi.l.liu@intel.com>

This traps the guest PASID-based iotlb invalidation request
and propagate it to host.

Intel VT-d 3.0 supports nested translation in PASID granular.
Guest SVA support could be implemented by configuring nested
translation on specific PASID. This is also known as dual stage
DMA translation.

Under such configuration, guest owns the GVA->GPA translation
which is configured as stage-1 page table in host side for a
specific pasid, and host owns GPA->HPA translation. As guest
owns stage-1 translation table, piotlb invalidation should
be propagated to host since host IOMMU will cache first level
page table related mappings during DMA address translation.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/i386/intel_iommu_internal.h |   7 +++
 hw/i386/intel_iommu.c          | 103 +++++++++++++++++++++++++++++++++
 2 files changed, 110 insertions(+)

diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 40361de207..ed0d5cd99b 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -560,6 +560,13 @@ struct VTDPASIDCacheInfo {
 };
 typedef struct VTDPASIDCacheInfo VTDPASIDCacheInfo;
 
+struct VTDPIOTLBInvInfo {
+    uint16_t domain_id;
+    uint32_t pasid;
+    struct iommu_hwpt_vtd_s1_invalidate *inv_data;
+};
+typedef struct VTDPIOTLBInvInfo VTDPIOTLBInvInfo;
+
 /* PASID Table Related Definitions */
 #define VTD_PASID_DIR_BASE_ADDR_MASK  (~0xfffULL)
 #define VTD_PASID_TABLE_BASE_ADDR_MASK (~0xfffULL)
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 6aa44b80d6..2912fc6b88 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -3786,15 +3786,118 @@ static bool vtd_process_pasid_desc(IntelIOMMUState *s,
     return !pc_info.error_happened ? true : false;
 }
 
+/**
+ * Caller of this function should hold iommu_lock.
+ */
+static void vtd_invalidate_piotlb(VTDPASIDAddressSpace *vtd_pasid_as,
+                                  struct iommu_hwpt_vtd_s1_invalidate *cache)
+{
+    VTDIOMMUFDDevice *vtd_idev;
+    VTDHwpt *hwpt = &vtd_pasid_as->hwpt;
+    int devfn = vtd_pasid_as->devfn;
+    struct vtd_as_key key = {
+        .bus = vtd_pasid_as->bus,
+        .devfn = devfn,
+    };
+    IntelIOMMUState *s = vtd_pasid_as->iommu_state;
+    uint32_t entry_num = 1; /* Only implement one request for simplicity */
+
+    if (!hwpt) {
+        return;
+    }
+
+    vtd_idev = g_hash_table_lookup(s->vtd_iommufd_dev, &key);
+    if (!vtd_idev || !vtd_idev->idev) {
+        return;
+    }
+    if (iommufd_backend_invalidate_cache(vtd_idev->idev->iommufd, hwpt->hwpt_id,
+                                         IOMMU_HWPT_INVALIDATE_DATA_VTD_S1,
+                                         sizeof(*cache), &entry_num, cache)) {
+        error_report("Cache flush failed, entry_num %d", entry_num);
+    }
+}
+
+/**
+ * This function is a loop function for the s->vtd_pasid_as
+ * list with VTDPIOTLBInvInfo as execution filter. It propagates
+ * the piotlb invalidation to host. Caller of this function
+ * should hold iommu_lock.
+ */
+static void vtd_flush_pasid_iotlb(gpointer key, gpointer value,
+                                  gpointer user_data)
+{
+    VTDPIOTLBInvInfo *piotlb_info = user_data;
+    VTDPASIDAddressSpace *vtd_pasid_as = value;
+    VTDPASIDCacheEntry *pc_entry = &vtd_pasid_as->pasid_cache_entry;
+    uint16_t did;
+
+    if (!vtd_pe_pgtt_is_flt(&pc_entry->pasid_entry)) {
+        return;
+    }
+
+    did = vtd_pe_get_domain_id(&pc_entry->pasid_entry);
+
+    if ((piotlb_info->domain_id == did) &&
+        (piotlb_info->pasid == vtd_pasid_as->pasid)) {
+        vtd_invalidate_piotlb(vtd_pasid_as,
+                              piotlb_info->inv_data);
+    }
+
+    /*
+     * TODO: needs to add QEMU piotlb flush when QEMU piotlb
+     * infrastructure is ready. For now, it is enough for passthru
+     * devices.
+     */
+}
+
 static void vtd_piotlb_pasid_invalidate(IntelIOMMUState *s,
                                         uint16_t domain_id, uint32_t pasid)
 {
+    struct iommu_hwpt_vtd_s1_invalidate cache_info = { 0 };
+    VTDPIOTLBInvInfo piotlb_info;
+
+    cache_info.addr = 0;
+    cache_info.npages = (uint64_t)-1;
+
+    piotlb_info.domain_id = domain_id;
+    piotlb_info.pasid = pasid;
+    piotlb_info.inv_data = &cache_info;
+
+    vtd_iommu_lock(s);
+    /*
+     * Here loops all the vtd_pasid_as instances in s->vtd_pasid_as
+     * to find out the affected devices since piotlb invalidation
+     * should check pasid cache per architecture point of view.
+     */
+    g_hash_table_foreach(s->vtd_pasid_as,
+                         vtd_flush_pasid_iotlb, &piotlb_info);
+    vtd_iommu_unlock(s);
 }
 
 static void vtd_piotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
                                        uint32_t pasid, hwaddr addr, uint8_t am,
                                        bool ih)
 {
+    struct iommu_hwpt_vtd_s1_invalidate cache_info = { 0 };
+    VTDPIOTLBInvInfo piotlb_info;
+
+    cache_info.addr = addr;
+    cache_info.npages = 1 << am;
+    cache_info.flags = ih ? IOMMU_VTD_INV_FLAGS_LEAF : 0;
+
+    piotlb_info.domain_id = domain_id;
+    piotlb_info.pasid = pasid;
+    piotlb_info.inv_data = &cache_info;
+
+    vtd_iommu_lock(s);
+    /*
+     * Here loops all the vtd_pasid_as instances in s->vtd_pasid_as
+     * to find out the affected devices since piotlb invalidation
+     * should check pasid cache per architecture point of view.
+     */
+    g_hash_table_foreach(s->vtd_pasid_as,
+                         vtd_flush_pasid_iotlb, &piotlb_info);
+    vtd_iommu_unlock(s);
 }
 
 static bool vtd_process_piotlb_desc(IntelIOMMUState *s,
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH rfcv1 15/23] intel_iommu: process PASID-based Device-TLB invalidation
  2024-01-15 10:37 [PATCH rfcv1 00/23] intel_iommu: Enable stage-1 translation Zhenzhong Duan
                   ` (13 preceding siblings ...)
  2024-01-15 10:37 ` [PATCH rfcv1 14/23] intel_iommu: propagate PASID-based iotlb invalidation to host Zhenzhong Duan
@ 2024-01-15 10:37 ` Zhenzhong Duan
  2024-01-15 10:37 ` [PATCH rfcv1 16/23] intel_iommu: rename slpte in iotlb_entry to pte Zhenzhong Duan
                   ` (8 subsequent siblings)
  23 siblings, 0 replies; 29+ messages in thread
From: Zhenzhong Duan @ 2024-01-15 10:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, peterx, jasowang, mst, jgg,
	nicolinc, joao.m.martins, kevin.tian, yi.l.liu, yi.y.sun,
	chao.p.peng, Zhenzhong Duan, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, Marcel Apfelbaum

From: Yi Liu <yi.l.liu@intel.com>

This adds an empty handling for PASID-based Device-TLB invalidation.
For now it is enough as it is not necessary to propagate it to host
for passthrough device.

The reason for an empty device tlb invalidation handling is that
iommufd's intel vt-d cache invalidation uapi indicates all caches
related to the mapping. So no need to send device tlb. spec says
an iotlb invalidation should be issued before issuing device tlb
invalidation. This means host should have received a cache invalidation
due to guest iotlb invalidation. Hence no need to send another cache
invalidation due to device tlb invalidation.

Chapter 6.5.2.5:
"Since translation requests-without-PASID from a device may be serviced by
hardware from the IOTLB, software must always request IOTLB invalidation
(iotlb_inv_dsc) before requesting corresponding Device-TLB (dev_tlb_inv_dsc)
invalidation."

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/i386/intel_iommu_internal.h |  1 +
 hw/i386/intel_iommu.c          | 18 ++++++++++++++++++
 2 files changed, 19 insertions(+)

diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index ed0d5cd99b..dcf1410fcf 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -380,6 +380,7 @@ typedef union VTDInvDesc VTDInvDesc;
 #define VTD_INV_DESC_WAIT               0x5 /* Invalidation Wait Descriptor */
 #define VTD_INV_DESC_PIOTLB             0x6 /* PASID-IOTLB Invalidate Desc */
 #define VTD_INV_DESC_PC                 0x7 /* PASID-cache Invalidate Desc */
+#define VTD_INV_DESC_DEV_PIOTLB         0x8 /* PASID-based-DIOTLB inv_desc*/
 #define VTD_INV_DESC_NONE               0   /* Not an Invalidate Descriptor */
 
 /* Masks for Invalidation Wait Descriptor*/
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 2912fc6b88..5e7d445d98 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -3950,6 +3950,17 @@ static bool vtd_process_inv_iec_desc(IntelIOMMUState *s,
     return true;
 }
 
+static bool vtd_process_device_piotlb_desc(IntelIOMMUState *s,
+                                           VTDInvDesc *inv_desc)
+{
+    /*
+     * no need to handle it for passthru device, for emulated
+     * devices with device tlb, it may be required, but for now,
+     * return is enough
+     */
+    return true;
+}
+
 static bool vtd_process_device_iotlb_desc(IntelIOMMUState *s,
                                           VTDInvDesc *inv_desc)
 {
@@ -4066,6 +4077,13 @@ static bool vtd_process_inv_desc(IntelIOMMUState *s)
         }
         break;
 
+    case VTD_INV_DESC_DEV_PIOTLB:
+        trace_vtd_inv_desc("device-piotlb", inv_desc.hi, inv_desc.lo);
+        if (!vtd_process_device_piotlb_desc(s, &inv_desc)) {
+            return false;
+        }
+        break;
+
     case VTD_INV_DESC_DEVICE:
         trace_vtd_inv_desc("device", inv_desc.hi, inv_desc.lo);
         if (!vtd_process_device_iotlb_desc(s, &inv_desc)) {
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH rfcv1 16/23] intel_iommu: rename slpte in iotlb_entry to pte
  2024-01-15 10:37 [PATCH rfcv1 00/23] intel_iommu: Enable stage-1 translation Zhenzhong Duan
                   ` (14 preceding siblings ...)
  2024-01-15 10:37 ` [PATCH rfcv1 15/23] intel_iommu: process PASID-based Device-TLB invalidation Zhenzhong Duan
@ 2024-01-15 10:37 ` Zhenzhong Duan
  2024-01-15 10:37 ` [PATCH rfcv1 17/23] intel_iommu: implement firt level translation Zhenzhong Duan
                   ` (7 subsequent siblings)
  23 siblings, 0 replies; 29+ messages in thread
From: Zhenzhong Duan @ 2024-01-15 10:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, peterx, jasowang, mst, jgg,
	nicolinc, joao.m.martins, kevin.tian, yi.l.liu, yi.y.sun,
	chao.p.peng, Yi Sun, Zhenzhong Duan, Paolo Bonzini,
	Richard Henderson, Eduardo Habkost, Marcel Apfelbaum

From: Yi Liu <yi.l.liu@intel.com>

Because we support both FST and SST translation, rename slpte in iotlb_entry
to pte to make it generic.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 include/hw/i386/intel_iommu.h |  2 +-
 hw/i386/intel_iommu.c         | 10 +++++-----
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index 72702e10a2..dedaab5ac9 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -210,7 +210,7 @@ struct VTDIOTLBEntry {
     uint64_t gfn;
     uint16_t domain_id;
     uint32_t pasid;
-    uint64_t slpte;
+    uint64_t pte;
     uint64_t mask;
     uint8_t access_flags;
 };
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 5e7d445d98..7c24f8f677 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -384,7 +384,7 @@ static void vtd_update_iotlb(IntelIOMMUState *s, uint16_t source_id,
 
     entry->gfn = gfn;
     entry->domain_id = domain_id;
-    entry->slpte = slpte;
+    entry->pte = slpte;
     entry->access_flags = access_flags;
     entry->mask = vtd_slpt_level_page_mask(level);
     entry->pasid = pasid;
@@ -1949,9 +1949,9 @@ static bool vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
     if (!rid2pasid) {
         iotlb_entry = vtd_lookup_iotlb(s, source_id, pasid, addr);
         if (iotlb_entry) {
-            trace_vtd_iotlb_page_hit(source_id, addr, iotlb_entry->slpte,
+            trace_vtd_iotlb_page_hit(source_id, addr, iotlb_entry->pte,
                                      iotlb_entry->domain_id);
-            slpte = iotlb_entry->slpte;
+            slpte = iotlb_entry->pte;
             access_flags = iotlb_entry->access_flags;
             page_mask = iotlb_entry->mask;
             goto out;
@@ -2027,9 +2027,9 @@ static bool vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
     if (rid2pasid) {
         iotlb_entry = vtd_lookup_iotlb(s, source_id, pasid, addr);
         if (iotlb_entry) {
-            trace_vtd_iotlb_page_hit(source_id, addr, iotlb_entry->slpte,
+            trace_vtd_iotlb_page_hit(source_id, addr, iotlb_entry->pte,
                                      iotlb_entry->domain_id);
-            slpte = iotlb_entry->slpte;
+            slpte = iotlb_entry->pte;
             access_flags = iotlb_entry->access_flags;
             page_mask = iotlb_entry->mask;
             goto out;
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH rfcv1 17/23] intel_iommu: implement firt level translation
  2024-01-15 10:37 [PATCH rfcv1 00/23] intel_iommu: Enable stage-1 translation Zhenzhong Duan
                   ` (15 preceding siblings ...)
  2024-01-15 10:37 ` [PATCH rfcv1 16/23] intel_iommu: rename slpte in iotlb_entry to pte Zhenzhong Duan
@ 2024-01-15 10:37 ` Zhenzhong Duan
  2024-01-15 10:37 ` [PATCH rfcv1 18/23] intel_iommu: fix the fault reason report Zhenzhong Duan
                   ` (6 subsequent siblings)
  23 siblings, 0 replies; 29+ messages in thread
From: Zhenzhong Duan @ 2024-01-15 10:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, peterx, jasowang, mst, jgg,
	nicolinc, joao.m.martins, kevin.tian, yi.l.liu, yi.y.sun,
	chao.p.peng, Yi Sun, Zhenzhong Duan, Marcel Apfelbaum,
	Paolo Bonzini, Richard Henderson, Eduardo Habkost

From: Yi Liu <yi.l.liu@intel.com>

This adds stage-1 page table walking to support stage-1 only
transltion in scalable mode.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/i386/intel_iommu_internal.h |  16 +++
 hw/i386/intel_iommu.c          | 242 ++++++++++++++++++++++++++++++++-
 hw/i386/trace-events           |   2 +
 3 files changed, 258 insertions(+), 2 deletions(-)

diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index dcf1410fcf..41b958cd5d 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -598,6 +598,22 @@ typedef struct VTDPIOTLBInvInfo VTDPIOTLBInvInfo;
 #define VTD_SM_PASID_ENTRY_WPE_BIT(val)  (!!(((val) >> 4) & 1ULL))
 #define VTD_SM_PASID_ENTRY_EAFE_BIT(val) (!!(((val) >> 7) & 1ULL))
 
+#define VTD_PASID_IOTLB_MAX_SIZE       1024 /* Max size of the hash table */
+
+/* Paging Structure common */
+#define VTD_FL_PT_PAGE_SIZE_MASK    (1ULL << 7)
+/* Bits to decide the offset for each level */
+#define VTD_FL_LEVEL_BITS           9
+
+/* First Level Paging Structure */
+#define VTD_FL_PT_LEVEL             1
+#define VTD_FL_PT_ENTRY_NR          512
+
+/* Masks for First Level Paging Entry */
+#define VTD_FL_RW_MASK              (1ULL << 1)
+#define VTD_FL_PT_BASE_ADDR_MASK(aw) (~(VTD_PAGE_SIZE - 1) & VTD_HAW_MASK(aw))
+#define VTD_PASID_ENTRY_FPD         (1ULL << 1) /* Fault Processing Disable */
+
 /* Second Level Page Translation Pointer*/
 #define VTD_SM_PASID_ENTRY_SLPTPTR     (~0xfffULL)
 
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 7c24f8f677..1c21f40ccd 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -78,6 +78,10 @@ static void vtd_pasid_cache_sync(IntelIOMMUState *s,
                                  VTDPASIDCacheInfo *pc_info);
 static void vtd_pasid_cache_devsi(IntelIOMMUState *s,
                                   PCIBus *bus, uint16_t devfn);
+static VTDPASIDAddressSpace *vtd_add_find_pasid_as(IntelIOMMUState *s,
+                                                   PCIBus *bus,
+                                                   int devfn,
+                                                   uint32_t pasid);
 
 static void vtd_panic_require_caching_mode(void)
 {
@@ -1888,6 +1892,114 @@ out:
     trace_vtd_pt_enable_fast_path(source_id, success);
 }
 
+/* The shift of an addr for a certain level of paging structure */
+static inline uint32_t vtd_flpt_level_shift(uint32_t level)
+{
+    assert(level != 0);
+    return VTD_PAGE_SHIFT_4K + (level - 1) * VTD_FL_LEVEL_BITS;
+}
+
+static inline uint64_t vtd_flpt_level_page_mask(uint32_t level)
+{
+    return ~((1ULL << vtd_flpt_level_shift(level)) - 1);
+}
+
+static inline dma_addr_t vtd_pe_get_flpt_level(VTDPASIDEntry *pe)
+{
+    return 4 + ((pe->val[2] >> 2) & VTD_SM_PASID_ENTRY_FLPM);
+}
+
+/*
+ * Given an iova and the level of paging structure, return the offset
+ * of current level.
+ */
+static inline uint32_t vtd_iova_fl_level_offset(uint64_t iova, uint32_t level)
+{
+    return (iova >> vtd_flpt_level_shift(level)) &
+            ((1ULL << VTD_FL_LEVEL_BITS) - 1);
+}
+
+/* Get the content of a flpte located in @base_addr[@index] */
+static uint64_t vtd_get_flpte(dma_addr_t base_addr, uint32_t index)
+{
+    uint64_t flpte;
+
+    assert(index < VTD_FL_PT_ENTRY_NR);
+
+    if (dma_memory_read(&address_space_memory,
+                        base_addr + index * sizeof(flpte), &flpte,
+                        sizeof(flpte), MEMTXATTRS_UNSPECIFIED)) {
+        flpte = (uint64_t)-1;
+        return flpte;
+    }
+    flpte = le64_to_cpu(flpte);
+    return flpte;
+}
+
+static inline bool vtd_flpte_present(uint64_t flpte)
+{
+    return !!(flpte & 0x1);
+}
+
+/* Whether the pte indicates the address of the page frame */
+static inline bool vtd_is_last_flpte(uint64_t flpte, uint32_t level)
+{
+    return level == VTD_FL_PT_LEVEL || (flpte & VTD_FL_PT_PAGE_SIZE_MASK);
+}
+
+static inline uint64_t vtd_get_flpte_addr(uint64_t flpte, uint8_t aw)
+{
+    return flpte & VTD_FL_PT_BASE_ADDR_MASK(aw);
+}
+
+/*
+ * Given the @iova, get relevant @flptep. @flpte_level will be the last level
+ * of the translation, can be used for deciding the size of large page.
+ */
+static int vtd_iova_to_flpte(VTDPASIDEntry *pe, uint64_t iova, bool is_write,
+                             uint64_t *flptep, uint32_t *flpte_level,
+                             bool *reads, bool *writes, uint8_t aw_bits)
+{
+    dma_addr_t addr = vtd_pe_get_flpt_base(pe);
+    uint32_t level = vtd_pe_get_flpt_level(pe);
+    uint32_t offset;
+    uint64_t flpte;
+
+    while (true) {
+        offset = vtd_iova_fl_level_offset(iova, level);
+        flpte = vtd_get_flpte(addr, offset);
+        if (flpte == (uint64_t)-1) {
+            if (level == VTD_PE_GET_LEVEL(pe)) {
+                /* Invalid programming of context-entry */
+                return -VTD_FR_CONTEXT_ENTRY_INV;
+            } else {
+                return -VTD_FR_PAGING_ENTRY_INV;
+            }
+        }
+
+        if (!vtd_flpte_present(flpte)) {
+            *reads = false;
+            *writes = false;
+            return -VTD_FR_PAGING_ENTRY_INV;
+        }
+
+        *reads = true;
+        *writes = (*writes) && (flpte & VTD_FL_RW_MASK);
+        if (is_write && !(flpte & VTD_FL_RW_MASK)) {
+            return -VTD_FR_WRITE;
+        }
+
+        if (vtd_is_last_flpte(flpte, level)) {
+            *flptep = flpte;
+            *flpte_level = level;
+            return 0;
+        }
+
+        addr = vtd_get_flpte_addr(flpte, aw_bits);
+        level--;
+    }
+}
+
 static void vtd_report_fault(IntelIOMMUState *s,
                              int err, bool is_fpd_set,
                              uint16_t source_id,
@@ -1904,6 +2016,105 @@ static void vtd_report_fault(IntelIOMMUState *s,
     }
 }
 
+/*
+ * Map dev to pasid-entry then do a paging-structures walk to do a iommu
+ * translation.
+ *
+ * Called from RCU critical section.
+ *
+ * @vtd_as: The untranslated address space
+ * @bus_num: The bus number
+ * @devfn: The devfn, which is the  combined of device and function number
+ * @is_write: The access is a write operation
+ * @entry: IOMMUTLBEntry that contain the addr to be translated and result
+ *
+ * Returns true if translation is successful, otherwise false.
+ */
+static bool vtd_do_iommu_fl_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
+                                      uint8_t devfn, hwaddr addr, bool is_write,
+                                      IOMMUTLBEntry *entry)
+{
+    IntelIOMMUState *s = vtd_as->iommu_state;
+    VTDContextEntry ce;
+    VTDPASIDEntry pe;
+    uint8_t bus_num = pci_bus_num(bus);
+    uint64_t flpte, page_mask;
+    uint32_t level;
+    uint16_t source_id = PCI_BUILD_BDF(bus_num, devfn);
+    int ret;
+    bool is_fpd_set = false;
+    bool reads = true;
+    bool writes = true;
+    uint8_t access_flags;
+
+    /*
+     * We have standalone memory region for interrupt addresses, we
+     * should never receive translation requests in this region.
+     */
+    assert(!vtd_is_interrupt_addr(addr));
+
+    ret = vtd_dev_to_context_entry(s, pci_bus_num(bus), devfn, &ce);
+    if (ret) {
+        error_report_once("%s: detected translation failure 1 "
+                          "(dev=%02x:%02x:%02x, iova=0x%" PRIx64 ")",
+                          __func__, pci_bus_num(bus),
+                          VTD_PCI_SLOT(devfn),
+                          VTD_PCI_FUNC(devfn),
+                          addr);
+        return false;
+    }
+
+    vtd_iommu_lock(s);
+
+    ret = vtd_ce_get_rid2pasid_entry(s, &ce, &pe, PCI_NO_PASID);
+    is_fpd_set = pe.val[0] & VTD_PASID_ENTRY_FPD;
+    if (ret) {
+        vtd_report_fault(s, -ret, is_fpd_set, source_id, addr, is_write,
+                         false, PCI_NO_PASID);
+        goto error;
+    }
+
+    /*
+     * We don't need to translate for pass-through context entries.
+     * Also, let's ignore IOTLB caching as well for PT devices.
+     */
+    if (VTD_PE_GET_TYPE(&pe) == VTD_SM_PASID_ENTRY_PT) {
+        entry->iova = addr & VTD_PAGE_MASK_4K;
+        entry->translated_addr = entry->iova;
+        entry->addr_mask = ~VTD_PAGE_MASK_4K;
+        entry->perm = IOMMU_RW;
+        vtd_iommu_unlock(s);
+        return true;
+    }
+
+    ret = vtd_iova_to_flpte(&pe, addr, is_write, &flpte, &level,
+                            &reads, &writes, s->aw_bits);
+    if (ret) {
+        vtd_report_fault(s, -ret, is_fpd_set, source_id, addr, is_write,
+                         false, PCI_NO_PASID);
+        goto error;
+    }
+
+    page_mask = vtd_flpt_level_page_mask(level);
+    access_flags = IOMMU_ACCESS_FLAG(reads, writes);
+
+    vtd_iommu_unlock(s);
+
+    entry->iova = addr & page_mask;
+    entry->translated_addr = vtd_get_flpte_addr(flpte, s->aw_bits) & page_mask;
+    entry->addr_mask = ~page_mask;
+    entry->perm = access_flags;
+    return true;
+
+error:
+    vtd_iommu_unlock(s);
+    entry->iova = 0;
+    entry->translated_addr = 0;
+    entry->addr_mask = 0;
+    entry->perm = IOMMU_NONE;
+    return false;
+}
+
 /* Map dev to context-entry then do a paging-structures walk to do a iommu
  * translation.
  *
@@ -4516,10 +4727,37 @@ static IOMMUTLBEntry vtd_iommu_translate(IOMMUMemoryRegion *iommu, hwaddr addr,
         .target_as = &address_space_memory,
     };
     bool success;
+    VTDContextEntry ce;
+    VTDPASIDEntry pe;
+    int ret = 0;
 
     if (likely(s->dmar_enabled)) {
-        success = vtd_do_iommu_translate(vtd_as, vtd_as->bus, vtd_as->devfn,
-                                         addr, flag & IOMMU_WO, &iotlb);
+        if (s->root_scalable) {
+            ret = vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
+                                           vtd_as->devfn, &ce);
+            ret = vtd_ce_get_rid2pasid_entry(s, &ce, &pe, PCI_NO_PASID);
+            if (ret) {
+                error_report_once("%s: detected translation failure 1 "
+                                  "(dev=%02x:%02x:%02x, iova=0x%" PRIx64 ")",
+                                  __func__, pci_bus_num(vtd_as->bus),
+                                  VTD_PCI_SLOT(vtd_as->devfn),
+                                  VTD_PCI_FUNC(vtd_as->devfn),
+                                  addr);
+                return iotlb;
+            }
+            if (VTD_PE_GET_TYPE(&pe) == VTD_SM_PASID_ENTRY_FLT) {
+                success = vtd_do_iommu_fl_translate(vtd_as, vtd_as->bus,
+                                                    vtd_as->devfn, addr,
+                                                    flag & IOMMU_WO, &iotlb);
+            } else {
+                success = vtd_do_iommu_translate(vtd_as, vtd_as->bus,
+                                                 vtd_as->devfn, addr,
+                                                 flag & IOMMU_WO, &iotlb);
+            }
+        } else {
+            success = vtd_do_iommu_translate(vtd_as, vtd_as->bus, vtd_as->devfn,
+                                             addr, flag & IOMMU_WO, &iotlb);
+        }
     } else {
         /* DMAR disabled, passthrough, use 4k-page*/
         iotlb.iova = addr & VTD_PAGE_MASK_4K;
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index 66f7c1ba59..00b27bc5b1 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -33,6 +33,8 @@ vtd_re_not_present(uint8_t bus) "Root entry bus %"PRIu8" not present"
 vtd_ce_not_present(uint8_t bus, uint8_t devfn) "Context entry bus %"PRIu8" devfn %"PRIu8" not present"
 vtd_iotlb_page_hit(uint16_t sid, uint64_t addr, uint64_t slpte, uint16_t domain) "IOTLB page hit sid 0x%"PRIx16" iova 0x%"PRIx64" slpte 0x%"PRIx64" domain 0x%"PRIx16
 vtd_iotlb_page_update(uint16_t sid, uint64_t addr, uint64_t slpte, uint16_t domain) "IOTLB page update sid 0x%"PRIx16" iova 0x%"PRIx64" slpte 0x%"PRIx64" domain 0x%"PRIx16
+vtd_iotlb_pe_hit(uint32_t pasid, uint64_t val0, uint32_t gen) "IOTLB pasid hit pasid %"PRIu32" val[0] 0x%"PRIx64" gen %"PRIu32
+vtd_iotlb_pe_update(uint32_t pasid, uint64_t val0, uint32_t gen1, uint32_t gen2) "IOTLB pasid update pasid %"PRIu32" val[0] 0x%"PRIx64" gen %"PRIu32" -> gen %"PRIu32
 vtd_iotlb_cc_hit(uint8_t bus, uint8_t devfn, uint64_t high, uint64_t low, uint32_t gen) "IOTLB context hit bus 0x%"PRIx8" devfn 0x%"PRIx8" high 0x%"PRIx64" low 0x%"PRIx64" gen %"PRIu32
 vtd_iotlb_cc_update(uint8_t bus, uint8_t devfn, uint64_t high, uint64_t low, uint32_t gen1, uint32_t gen2) "IOTLB context update bus 0x%"PRIx8" devfn 0x%"PRIx8" high 0x%"PRIx64" low 0x%"PRIx64" gen %"PRIu32" -> gen %"PRIu32
 vtd_iotlb_reset(const char *reason) "IOTLB reset (reason: %s)"
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH rfcv1 18/23] intel_iommu: fix the fault reason report
  2024-01-15 10:37 [PATCH rfcv1 00/23] intel_iommu: Enable stage-1 translation Zhenzhong Duan
                   ` (16 preceding siblings ...)
  2024-01-15 10:37 ` [PATCH rfcv1 17/23] intel_iommu: implement firt level translation Zhenzhong Duan
@ 2024-01-15 10:37 ` Zhenzhong Duan
  2024-01-15 10:37 ` [PATCH rfcv1 19/23] intel_iommu: introduce pasid iotlb cache Zhenzhong Duan
                   ` (5 subsequent siblings)
  23 siblings, 0 replies; 29+ messages in thread
From: Zhenzhong Duan @ 2024-01-15 10:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, peterx, jasowang, mst, jgg,
	nicolinc, joao.m.martins, kevin.tian, yi.l.liu, yi.y.sun,
	chao.p.peng, Yu Zhang, Zhenzhong Duan, Marcel Apfelbaum,
	Paolo Bonzini, Richard Henderson, Eduardo Habkost

From: Yu Zhang <yu.c.zhang@linux.intel.com>

Currently we use only VTD_FR_PASID_TABLE_INV as
fault reaon. Fix this with correct fault reasons
listed in VT-d spec 7.2.3.

Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/i386/intel_iommu_internal.h |  8 ++++++-
 hw/i386/intel_iommu.c          | 42 +++++++++++++++++++++-------------
 2 files changed, 33 insertions(+), 17 deletions(-)

diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 41b958cd5d..21fa767740 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -325,8 +325,14 @@ typedef enum VTDFaultReason {
                                   * request while disabled */
     VTD_FR_IR_SID_ERR = 0x26,   /* Invalid Source-ID */
 
-    VTD_FR_PASID_TABLE_INV = 0x58,  /*Invalid PASID table entry */
+    VTD_FR_RTADDR_INV_TTM = 0x31,  /* Invalid TTM in RTADDR */
+    /* PASID directory entry access failure */
+    VTD_FR_PASID_DIR_ACCESS_ERR = 0x50,
+    /* The Present(P) field of pasid directory entry is 0 */
+    VTD_FR_PASID_DIR_ENTRY_P = 0x51,
+    VTD_FR_PASID_TABLE_ACCESS_ERR = 0x58, /* PASID table entry access failure */
     VTD_FR_PASID_ENTRY_P = 0x59, /* The Present(P) field of pasidt-entry is 0 */
+    VTD_FR_PASID_TABLE_ENTRY_INV = 0x5b,  /*Invalid PASID table entry */
 
     /* Output address in the interrupt address range for scalable mode */
     VTD_FR_SM_INTERRUPT_ADDR = 0x87,
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 1c21f40ccd..1e87383a41 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -819,7 +819,7 @@ static int vtd_get_pdire_from_pdir_table(dma_addr_t pasid_dir_base,
     addr = pasid_dir_base + index * entry_size;
     if (dma_memory_read(&address_space_memory, addr,
                         pdire, entry_size, MEMTXATTRS_UNSPECIFIED)) {
-        return -VTD_FR_PASID_TABLE_INV;
+        return -VTD_FR_PASID_DIR_ACCESS_ERR;
     }
 
     pdire->val = le64_to_cpu(pdire->val);
@@ -832,6 +832,11 @@ static inline bool vtd_pe_present(VTDPASIDEntry *pe)
     return pe->val[0] & VTD_PASID_ENTRY_P;
 }
 
+static inline uint32_t vtd_pe_get_flpt_level(VTDPASIDEntry *pe)
+{
+    return 4 + ((pe->val[2] >> 2) & VTD_SM_PASID_ENTRY_FLPM);
+}
+
 static int vtd_get_pe_in_pasid_leaf_table(IntelIOMMUState *s,
                                           uint32_t pasid,
                                           dma_addr_t addr,
@@ -840,13 +845,14 @@ static int vtd_get_pe_in_pasid_leaf_table(IntelIOMMUState *s,
     uint32_t index;
     dma_addr_t entry_size;
     X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(s);
+    uint8_t pgtt;
 
     index = VTD_PASID_TABLE_INDEX(pasid);
     entry_size = VTD_PASID_ENTRY_SIZE;
     addr = addr + index * entry_size;
     if (dma_memory_read(&address_space_memory, addr,
                         pe, entry_size, MEMTXATTRS_UNSPECIFIED)) {
-        return -VTD_FR_PASID_TABLE_INV;
+        return -VTD_FR_PASID_TABLE_ACCESS_ERR;
     }
     for (size_t i = 0; i < ARRAY_SIZE(pe->val); i++) {
         pe->val[i] = le64_to_cpu(pe->val[i]);
@@ -854,12 +860,17 @@ static int vtd_get_pe_in_pasid_leaf_table(IntelIOMMUState *s,
 
     /* Do translation type check */
     if (!vtd_pe_type_check(x86_iommu, pe)) {
-        return -VTD_FR_PASID_TABLE_INV;
+        return -VTD_FR_PASID_TABLE_ENTRY_INV;
     }
 
-    if (!vtd_is_level_supported(s, VTD_PE_GET_LEVEL(pe))) {
-        return -VTD_FR_PASID_TABLE_INV;
-    }
+    pgtt = VTD_PE_GET_TYPE(pe);
+    if (pgtt == VTD_SM_PASID_ENTRY_SLT &&
+        !vtd_is_level_supported(s, VTD_PE_GET_LEVEL(pe)))
+            return -VTD_FR_PASID_TABLE_ENTRY_INV;
+
+    if (pgtt == VTD_SM_PASID_ENTRY_FLT &&
+        vtd_pe_get_flpt_level(pe) != 4)
+            return -VTD_FR_PASID_TABLE_ENTRY_INV;
 
     return 0;
 }
@@ -899,7 +910,7 @@ static int vtd_get_pe_from_pasid_table(IntelIOMMUState *s,
     }
 
     if (!vtd_pdire_present(&pdire)) {
-        return -VTD_FR_PASID_TABLE_INV;
+        return -VTD_FR_PASID_DIR_ENTRY_P;
     }
 
     ret = vtd_get_pe_from_pdire(s, pasid, &pdire, pe);
@@ -908,7 +919,7 @@ static int vtd_get_pe_from_pasid_table(IntelIOMMUState *s,
     }
 
     if (!vtd_pe_present(pe)) {
-        return -VTD_FR_PASID_TABLE_INV;
+        return -VTD_FR_PASID_ENTRY_P;
     }
 
     return 0;
@@ -961,7 +972,7 @@ static int vtd_ce_get_pasid_fpd(IntelIOMMUState *s,
     }
 
     if (!vtd_pdire_present(&pdire)) {
-        return -VTD_FR_PASID_TABLE_INV;
+        return -VTD_FR_PASID_DIR_ENTRY_P;
     }
 
     /*
@@ -1829,7 +1840,11 @@ static const bool vtd_qualified_faults[] = {
     [VTD_FR_ROOT_ENTRY_RSVD] = false,
     [VTD_FR_PAGING_ENTRY_RSVD] = true,
     [VTD_FR_CONTEXT_ENTRY_TT] = true,
-    [VTD_FR_PASID_TABLE_INV] = false,
+    [VTD_FR_PASID_DIR_ACCESS_ERR] = false,
+    [VTD_FR_PASID_DIR_ENTRY_P] = true,
+    [VTD_FR_PASID_TABLE_ACCESS_ERR] = false,
+    [VTD_FR_PASID_ENTRY_P] = true,
+    [VTD_FR_PASID_TABLE_ENTRY_INV] = true,
     [VTD_FR_SM_INTERRUPT_ADDR] = true,
     [VTD_FR_MAX] = false,
 };
@@ -1904,11 +1919,6 @@ static inline uint64_t vtd_flpt_level_page_mask(uint32_t level)
     return ~((1ULL << vtd_flpt_level_shift(level)) - 1);
 }
 
-static inline dma_addr_t vtd_pe_get_flpt_level(VTDPASIDEntry *pe)
-{
-    return 4 + ((pe->val[2] >> 2) & VTD_SM_PASID_ENTRY_FLPM);
-}
-
 /*
  * Given an iova and the level of paging structure, return the offset
  * of current level.
@@ -3499,7 +3509,7 @@ static inline int vtd_dev_get_pe_from_pasid(IntelIOMMUState *s,
     dma_addr_t pasid_dir_base;
 
     if (!s->root_scalable) {
-        return -VTD_FR_PASID_TABLE_INV;
+        return -VTD_FR_RTADDR_INV_TTM;
     }
 
     ret = vtd_dev_to_context_entry(s, bus_num, devfn, &ce);
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH rfcv1 19/23] intel_iommu: introduce pasid iotlb cache
  2024-01-15 10:37 [PATCH rfcv1 00/23] intel_iommu: Enable stage-1 translation Zhenzhong Duan
                   ` (17 preceding siblings ...)
  2024-01-15 10:37 ` [PATCH rfcv1 18/23] intel_iommu: fix the fault reason report Zhenzhong Duan
@ 2024-01-15 10:37 ` Zhenzhong Duan
  2024-01-15 10:37 ` [PATCH rfcv1 20/23] intel_iommu: piotlb invalidation should notify unmap Zhenzhong Duan
                   ` (4 subsequent siblings)
  23 siblings, 0 replies; 29+ messages in thread
From: Zhenzhong Duan @ 2024-01-15 10:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, peterx, jasowang, mst, jgg,
	nicolinc, joao.m.martins, kevin.tian, yi.l.liu, yi.y.sun,
	chao.p.peng, Yi Sun, Zhenzhong Duan, Marcel Apfelbaum,
	Paolo Bonzini, Richard Henderson, Eduardo Habkost

From: Yi Liu <yi.l.liu@intel.com>

To accelerate stage-1 translation, introduce pasid iotlb cache.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/i386/intel_iommu_internal.h |   1 +
 include/hw/i386/intel_iommu.h  |   1 +
 hw/i386/intel_iommu.c          | 126 +++++++++++++++++++++++++++++++--
 hw/i386/trace-events           |   1 +
 4 files changed, 124 insertions(+), 5 deletions(-)

diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 21fa767740..08701f5457 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -480,6 +480,7 @@ typedef union VTDInvDesc VTDInvDesc;
 
 /* Information about page-selective IOTLB invalidate */
 struct VTDIOTLBPageInvInfo {
+    bool is_piotlb;
     uint16_t domain_id;
     uint32_t pasid;
     uint64_t addr;
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index dedaab5ac9..f3e75263b7 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -348,6 +348,7 @@ struct IntelIOMMUState {
 
     uint32_t context_cache_gen;     /* Should be in [1,MAX] */
     GHashTable *iotlb;              /* IOTLB */
+    GHashTable *p_iotlb;            /* pasid based IOTLB */
 
     GHashTable *vtd_address_spaces;             /* VTD address spaces */
     VTDAddressSpace *vtd_as_cache[VTD_PCI_BUS_MAX]; /* VTD address space cache */
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 1e87383a41..e9480608a5 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -82,6 +82,8 @@ static VTDPASIDAddressSpace *vtd_add_find_pasid_as(IntelIOMMUState *s,
                                                    PCIBus *bus,
                                                    int devfn,
                                                    uint32_t pasid);
+static int vtd_dev_get_rid2pasid(IntelIOMMUState *s, uint8_t bus_num,
+                                 uint8_t devfn, uint32_t *rid_pasid);
 
 static void vtd_panic_require_caching_mode(void)
 {
@@ -297,6 +299,7 @@ static gboolean vtd_hash_remove_by_page(gpointer key, gpointer value,
     uint64_t gfn = (info->addr >> VTD_PAGE_SHIFT_4K) & info->mask;
     uint64_t gfn_tlb = (info->addr & entry->mask) >> VTD_PAGE_SHIFT_4K;
     return (entry->domain_id == info->domain_id) &&
+            (info->is_piotlb ? (entry->pasid == info->pasid) : 1) &&
             (((entry->gfn & info->mask) == gfn) ||
              (entry->gfn == gfn_tlb));
 }
@@ -333,12 +336,19 @@ static void vtd_reset_iotlb(IntelIOMMUState *s)
     vtd_iommu_unlock(s);
 }
 
+static void vtd_reset_piotlb(IntelIOMMUState *s)
+{
+    assert(s->p_iotlb);
+    g_hash_table_remove_all(s->p_iotlb);
+}
+
 static void vtd_reset_caches(IntelIOMMUState *s)
 {
     vtd_iommu_lock(s);
     vtd_reset_iotlb_locked(s);
     vtd_reset_context_cache_locked(s);
     vtd_pasid_cache_reset(s);
+    vtd_reset_piotlb(s);
     vtd_iommu_unlock(s);
 }
 
@@ -2026,6 +2036,63 @@ static void vtd_report_fault(IntelIOMMUState *s,
     }
 }
 
+static uint64_t vtd_get_piotlb_gfn(hwaddr addr, uint32_t level)
+{
+    return (addr & vtd_flpt_level_page_mask(level)) >> VTD_PAGE_SHIFT_4K;
+}
+
+static int vtd_get_piotlb_key(char *key, int key_size, uint64_t gfn,
+                              uint32_t pasid, uint32_t level,
+                              uint16_t source_id)
+{
+    return snprintf(key, key_size,
+                    "rsv%010dsid%06dpasid%010dgfn%017lldlevel%01d",
+                    0, source_id, pasid, (unsigned long long int)gfn, level);
+}
+
+static VTDIOTLBEntry *vtd_lookup_piotlb(IntelIOMMUState *s, uint32_t pasid,
+                                        hwaddr addr, uint16_t source_id)
+{
+    VTDIOTLBEntry *entry;
+    char key[64];
+    int level;
+
+    for (level = VTD_SL_PT_LEVEL; level < VTD_SL_PML4_LEVEL; level++) {
+        vtd_get_piotlb_key(&key[0], 64, vtd_get_piotlb_gfn(addr, level),
+                           pasid, level, source_id);
+        entry = g_hash_table_lookup(s->p_iotlb, &key[0]);
+        if (entry) {
+            goto out;
+        }
+    }
+
+out:
+    return entry;
+}
+
+static void vtd_update_piotlb(IntelIOMMUState *s, uint32_t pasid,
+                              uint16_t domain_id, hwaddr addr, uint64_t flpte,
+                              uint8_t access_flags, uint32_t level,
+                              uint16_t source_id)
+{
+    VTDIOTLBEntry *entry = g_malloc(sizeof(*entry));
+    char *key = g_malloc(64);
+    uint64_t gfn = vtd_get_piotlb_gfn(addr, level);
+
+    if (g_hash_table_size(s->p_iotlb) >= VTD_PASID_IOTLB_MAX_SIZE) {
+        vtd_reset_piotlb(s);
+    }
+
+    entry->gfn = gfn;
+    entry->domain_id = domain_id;
+    entry->pte = flpte;
+    entry->pasid = pasid;
+    entry->access_flags = access_flags;
+    entry->mask = vtd_flpt_level_page_mask(level);
+    vtd_get_piotlb_key(key, 64, gfn, pasid, level, source_id);
+    g_hash_table_replace(s->p_iotlb, key, entry);
+}
+
 /*
  * Map dev to pasid-entry then do a paging-structures walk to do a iommu
  * translation.
@@ -2056,6 +2123,8 @@ static bool vtd_do_iommu_fl_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
     bool reads = true;
     bool writes = true;
     uint8_t access_flags;
+    uint32_t pasid;
+    VTDIOTLBEntry *piotlb_entry;
 
     /*
      * We have standalone memory region for interrupt addresses, we
@@ -2074,8 +2143,30 @@ static bool vtd_do_iommu_fl_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
         return false;
     }
 
+    /* For emulated device IOVA translation, use RID2PASID. */
+    if (vtd_dev_get_rid2pasid(s, pci_bus_num(bus), devfn, &pasid)) {
+        error_report_once("%s: detected translation failure 2 "
+                          "(dev=%02x:%02x:%02x, iova=0x%" PRIx64 ")",
+                          __func__, pci_bus_num(bus),
+                          VTD_PCI_SLOT(devfn),
+                          VTD_PCI_FUNC(devfn),
+                          addr);
+        return false;
+    }
+
     vtd_iommu_lock(s);
 
+    /* Try to fetch flpte form IOTLB */
+    piotlb_entry = vtd_lookup_piotlb(s, pasid, addr, source_id);
+    if (piotlb_entry) {
+        trace_vtd_piotlb_page_hit(source_id, pasid, addr, piotlb_entry->pte,
+                                  piotlb_entry->domain_id);
+        flpte = piotlb_entry->pte;
+        access_flags = piotlb_entry->access_flags;
+        page_mask = piotlb_entry->mask;
+        goto out;
+    }
+
     ret = vtd_ce_get_rid2pasid_entry(s, &ce, &pe, PCI_NO_PASID);
     is_fpd_set = pe.val[0] & VTD_PASID_ENTRY_FPD;
     if (ret) {
@@ -2108,6 +2199,9 @@ static bool vtd_do_iommu_fl_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
     page_mask = vtd_flpt_level_page_mask(level);
     access_flags = IOMMU_ACCESS_FLAG(reads, writes);
 
+    vtd_update_piotlb(s, pasid, vtd_pe_get_domain_id(&pe), addr, flpte,
+                      access_flags, level, source_id);
+out:
     vtd_iommu_unlock(s);
 
     entry->iova = addr & page_mask;
@@ -3080,6 +3174,7 @@ static void vtd_iotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
     trace_vtd_inv_desc_iotlb_pages(domain_id, addr, am);
 
     assert(am <= VTD_MAMV);
+    info.is_piotlb = false;
     info.domain_id = domain_id;
     info.addr = addr;
     info.mask = ~((1 << am) - 1);
@@ -4063,12 +4158,16 @@ static void vtd_flush_pasid_iotlb(gpointer key, gpointer value,
         vtd_invalidate_piotlb(vtd_pasid_as,
                               piotlb_info->inv_data);
     }
+}
 
-    /*
-     * TODO: needs to add QEMU piotlb flush when QEMU piotlb
-     * infrastructure is ready. For now, it is enough for passthru
-     * devices.
-     */
+static gboolean vtd_hash_remove_by_pasid(gpointer key, gpointer value,
+                                         gpointer user_data)
+{
+    VTDIOTLBEntry *entry = (VTDIOTLBEntry *)value;
+    VTDIOTLBPageInvInfo *info = (VTDIOTLBPageInvInfo *)user_data;
+
+    return ((entry->domain_id == info->domain_id) &&
+            (entry->pasid == info->pasid));
 }
 
 static void vtd_piotlb_pasid_invalidate(IntelIOMMUState *s,
@@ -4076,6 +4175,7 @@ static void vtd_piotlb_pasid_invalidate(IntelIOMMUState *s,
 {
     struct iommu_hwpt_vtd_s1_invalidate cache_info = { 0 };
     VTDPIOTLBInvInfo piotlb_info;
+    VTDIOTLBPageInvInfo info;
 
     cache_info.addr = 0;
     cache_info.npages = (uint64_t)-1;
@@ -4084,6 +4184,9 @@ static void vtd_piotlb_pasid_invalidate(IntelIOMMUState *s,
     piotlb_info.pasid = pasid;
     piotlb_info.inv_data = &cache_info;
 
+    info.domain_id = domain_id;
+    info.pasid = pasid;
+
     vtd_iommu_lock(s);
     /*
      * Here loops all the vtd_pasid_as instances in s->vtd_pasid_as
@@ -4092,6 +4195,8 @@ static void vtd_piotlb_pasid_invalidate(IntelIOMMUState *s,
      */
     g_hash_table_foreach(s->vtd_pasid_as,
                          vtd_flush_pasid_iotlb, &piotlb_info);
+    g_hash_table_foreach_remove(s->p_iotlb, vtd_hash_remove_by_pasid,
+                                &info);
     vtd_iommu_unlock(s);
 }
 
@@ -4101,6 +4206,7 @@ static void vtd_piotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
 {
     struct iommu_hwpt_vtd_s1_invalidate cache_info = { 0 };
     VTDPIOTLBInvInfo piotlb_info;
+    VTDIOTLBPageInvInfo info;
 
     cache_info.addr = addr;
     cache_info.npages = 1 << am;
@@ -4110,6 +4216,12 @@ static void vtd_piotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
     piotlb_info.pasid = pasid;
     piotlb_info.inv_data = &cache_info;
 
+    info.is_piotlb = true;
+    info.domain_id = domain_id;
+    info.pasid = pasid;
+    info.addr = addr;
+    info.mask = ~((1 << am) - 1);
+
     vtd_iommu_lock(s);
     /*
      * Here loops all the vtd_pasid_as instances in s->vtd_pasid_as
@@ -4118,6 +4230,8 @@ static void vtd_piotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
      */
     g_hash_table_foreach(s->vtd_pasid_as,
                          vtd_flush_pasid_iotlb, &piotlb_info);
+    g_hash_table_foreach_remove(s->p_iotlb,
+                                vtd_hash_remove_by_page, &info);
     vtd_iommu_unlock(s);
 }
 
@@ -6034,6 +6148,8 @@ static void vtd_realize(DeviceState *dev, Error **errp)
     /* No corresponding destroy */
     s->iotlb = g_hash_table_new_full(vtd_iotlb_hash, vtd_iotlb_equal,
                                      g_free, g_free);
+    s->p_iotlb = g_hash_table_new_full(&g_str_hash, &g_str_equal,
+                                       g_free, g_free);
     s->vtd_address_spaces = g_hash_table_new_full(vtd_as_hash, vtd_as_equal,
                                       g_free, g_free);
     s->vtd_iommufd_dev = g_hash_table_new_full(vtd_as_hash, vtd_as_idev_equal,
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index 00b27bc5b1..7c36f34ae8 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -31,6 +31,7 @@ vtd_pasid_cache_psi(uint16_t domain, uint32_t pasid) "PASID slective PC invalida
 vtd_pasid_cache_devsi(uint16_t devfn) "Dev selective PC invalidation dev: 0x%"PRIx16
 vtd_re_not_present(uint8_t bus) "Root entry bus %"PRIu8" not present"
 vtd_ce_not_present(uint8_t bus, uint8_t devfn) "Context entry bus %"PRIu8" devfn %"PRIu8" not present"
+vtd_piotlb_page_hit(uint16_t sid, uint32_t pasid, uint64_t addr, uint64_t pte, uint16_t domain) "PIOTLB page hit sid 0x%"PRIx16" pasid %"PRIu32" iova 0x%"PRIx64" pte 0x%"PRIx64" domain 0x%"PRIx16
 vtd_iotlb_page_hit(uint16_t sid, uint64_t addr, uint64_t slpte, uint16_t domain) "IOTLB page hit sid 0x%"PRIx16" iova 0x%"PRIx64" slpte 0x%"PRIx64" domain 0x%"PRIx16
 vtd_iotlb_page_update(uint16_t sid, uint64_t addr, uint64_t slpte, uint16_t domain) "IOTLB page update sid 0x%"PRIx16" iova 0x%"PRIx64" slpte 0x%"PRIx64" domain 0x%"PRIx16
 vtd_iotlb_pe_hit(uint32_t pasid, uint64_t val0, uint32_t gen) "IOTLB pasid hit pasid %"PRIu32" val[0] 0x%"PRIx64" gen %"PRIu32
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH rfcv1 20/23] intel_iommu: piotlb invalidation should notify unmap
  2024-01-15 10:37 [PATCH rfcv1 00/23] intel_iommu: Enable stage-1 translation Zhenzhong Duan
                   ` (18 preceding siblings ...)
  2024-01-15 10:37 ` [PATCH rfcv1 19/23] intel_iommu: introduce pasid iotlb cache Zhenzhong Duan
@ 2024-01-15 10:37 ` Zhenzhong Duan
  2024-01-15 10:37 ` [PATCH rfcv1 21/23] intel_iommu: invalidate piotlb when flush pasid Zhenzhong Duan
                   ` (3 subsequent siblings)
  23 siblings, 0 replies; 29+ messages in thread
From: Zhenzhong Duan @ 2024-01-15 10:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, peterx, jasowang, mst, jgg,
	nicolinc, joao.m.martins, kevin.tian, yi.l.liu, yi.y.sun,
	chao.p.peng, Yi Sun, Zhenzhong Duan, Marcel Apfelbaum,
	Paolo Bonzini, Richard Henderson, Eduardo Habkost

From: Yi Sun <yi.y.sun@linux.intel.com>

This is used by some emulated devices which caches address
translation result. When piotlb invalidation issued in guest,
those caches should be refreshed.

Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/i386/intel_iommu.c | 56 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 56 insertions(+)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index e9480608a5..6a6478e865 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -4176,6 +4176,9 @@ static void vtd_piotlb_pasid_invalidate(IntelIOMMUState *s,
     struct iommu_hwpt_vtd_s1_invalidate cache_info = { 0 };
     VTDPIOTLBInvInfo piotlb_info;
     VTDIOTLBPageInvInfo info;
+    VTDAddressSpace *vtd_as;
+    VTDContextEntry ce;
+    int ret;
 
     cache_info.addr = 0;
     cache_info.npages = (uint64_t)-1;
@@ -4198,6 +4201,33 @@ static void vtd_piotlb_pasid_invalidate(IntelIOMMUState *s,
     g_hash_table_foreach_remove(s->p_iotlb, vtd_hash_remove_by_pasid,
                                 &info);
     vtd_iommu_unlock(s);
+
+    QLIST_FOREACH(vtd_as, &(s->vtd_as_with_notifiers), next) {
+        uint32_t rid2pasid = 0;
+        vtd_dev_get_rid2pasid(s, pci_bus_num(vtd_as->bus), vtd_as->devfn,
+                              &rid2pasid);
+        ret = vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
+                                       vtd_as->devfn, &ce);
+        if (!ret && s->root_scalable && likely(s->dmar_enabled) &&
+            domain_id == vtd_get_domain_id(s, &ce, pasid) &&
+            pasid == rid2pasid && !vtd_as_has_map_notifier(vtd_as)) {
+            IOMMUNotifier *notifier;
+
+            IOMMU_NOTIFIER_FOREACH(notifier, &vtd_as->iommu) {
+                IOMMUTLBEvent event;
+
+                event.type = IOMMU_NOTIFIER_UNMAP |
+                             IOMMU_NOTIFIER_DEVIOTLB_UNMAP;
+                event.entry.target_as = &address_space_memory;
+                event.entry.iova = notifier->start;
+                event.entry.perm = IOMMU_NONE;
+                event.entry.addr_mask = notifier->end - notifier->start;
+                event.entry.translated_addr = 0;
+
+                memory_region_notify_iommu_one(notifier, &event);
+            }
+        }
+    }
 }
 
 static void vtd_piotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
@@ -4207,6 +4237,10 @@ static void vtd_piotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
     struct iommu_hwpt_vtd_s1_invalidate cache_info = { 0 };
     VTDPIOTLBInvInfo piotlb_info;
     VTDIOTLBPageInvInfo info;
+    VTDAddressSpace *vtd_as;
+    VTDContextEntry ce;
+    hwaddr size = (1 << am) * VTD_PAGE_SIZE;
+    int ret;
 
     cache_info.addr = addr;
     cache_info.npages = 1 << am;
@@ -4233,6 +4267,28 @@ static void vtd_piotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
     g_hash_table_foreach_remove(s->p_iotlb,
                                 vtd_hash_remove_by_page, &info);
     vtd_iommu_unlock(s);
+
+    QLIST_FOREACH(vtd_as, &(s->vtd_as_with_notifiers), next) {
+        uint32_t rid2pasid = 0;
+        vtd_dev_get_rid2pasid(s, pci_bus_num(vtd_as->bus), vtd_as->devfn,
+                              &rid2pasid);
+        ret = vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
+                                       vtd_as->devfn, &ce);
+        if (!ret && s->root_scalable && likely(s->dmar_enabled) &&
+            domain_id == vtd_get_domain_id(s, &ce, pasid) &&
+            pasid == rid2pasid && !vtd_as_has_map_notifier(vtd_as)) {
+            IOMMUTLBEvent event;
+
+            event.type = IOMMU_NOTIFIER_UNMAP | IOMMU_NOTIFIER_DEVIOTLB_UNMAP;
+            event.entry.target_as = &address_space_memory;
+            event.entry.iova = addr;
+            event.entry.perm = IOMMU_NONE;
+            event.entry.addr_mask = size - 1;
+            event.entry.translated_addr = 0;
+
+            memory_region_notify_iommu(&vtd_as->iommu, 0, event);
+        }
+    }
 }
 
 static bool vtd_process_piotlb_desc(IntelIOMMUState *s,
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH rfcv1 21/23] intel_iommu: invalidate piotlb when flush pasid
  2024-01-15 10:37 [PATCH rfcv1 00/23] intel_iommu: Enable stage-1 translation Zhenzhong Duan
                   ` (19 preceding siblings ...)
  2024-01-15 10:37 ` [PATCH rfcv1 20/23] intel_iommu: piotlb invalidation should notify unmap Zhenzhong Duan
@ 2024-01-15 10:37 ` Zhenzhong Duan
  2024-01-15 10:37 ` [PATCH rfcv1 22/23] intel_iommu: refresh pasid bind after pasid cache force reset Zhenzhong Duan
                   ` (2 subsequent siblings)
  23 siblings, 0 replies; 29+ messages in thread
From: Zhenzhong Duan @ 2024-01-15 10:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, peterx, jasowang, mst, jgg,
	nicolinc, joao.m.martins, kevin.tian, yi.l.liu, yi.y.sun,
	chao.p.peng, Yi Sun, Zhenzhong Duan, Marcel Apfelbaum,
	Paolo Bonzini, Richard Henderson, Eduardo Habkost

From: Yi Sun <yi.y.sun@linux.intel.com>

When bind/unbind emulated devices, we should invalidate QEMU
piotlb. Host will flush piotlb for passthrough devices so we
don't handle passthrough devices.

Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/i386/intel_iommu.c | 27 +++++++++++++++++----------
 1 file changed, 17 insertions(+), 10 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 6a6478e865..2f3d3a28b0 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -84,6 +84,8 @@ static VTDPASIDAddressSpace *vtd_add_find_pasid_as(IntelIOMMUState *s,
                                                    uint32_t pasid);
 static int vtd_dev_get_rid2pasid(IntelIOMMUState *s, uint8_t bus_num,
                                  uint8_t devfn, uint32_t *rid_pasid);
+static gboolean vtd_hash_remove_by_pasid(gpointer key, gpointer value,
+                                         gpointer user_data);
 
 static void vtd_panic_require_caching_mode(void)
 {
@@ -3667,14 +3669,21 @@ static gboolean vtd_flush_pasid(gpointer key, gpointer value,
     VTDPASIDCacheEntry *pc_entry = &vtd_pasid_as->pasid_cache_entry;
     PCIBus *bus = vtd_pasid_as->bus;
     VTDPASIDEntry pe;
+    VTDIOMMUFDDevice *vtd_idev;
+    VTDIOTLBPageInvInfo info;
     uint16_t did;
     uint32_t pasid;
     uint16_t devfn;
     int ret;
+    struct vtd_as_key as_key = {
+        .bus = vtd_pasid_as->bus,
+        .devfn = vtd_pasid_as->devfn,
+    };
 
     did = vtd_pe_get_domain_id(&pc_entry->pasid_entry);
     pasid = vtd_pasid_as->pasid;
     devfn = vtd_pasid_as->devfn;
+    vtd_idev = g_hash_table_lookup(s->vtd_iommufd_dev, &as_key);
 
     switch (pc_info->type) {
     case VTD_PASID_CACHE_FORCE_RESET:
@@ -3702,6 +3711,13 @@ static gboolean vtd_flush_pasid(gpointer key, gpointer value,
         abort();
     }
 
+    info.domain_id = did;
+    info.pasid = pasid;
+    /* For passthrough device, we don't need invalidate QEMU piotlb */
+    if (s->root_scalable && likely(s->dmar_enabled) && !vtd_idev)
+        g_hash_table_foreach_remove(s->p_iotlb, vtd_hash_remove_by_pasid,
+                                    &info);
+
     /*
      * pasid cache invalidation may indicate a present pasid
      * entry to present pasid entry modification. To cover such
@@ -3725,18 +3741,9 @@ static gboolean vtd_flush_pasid(gpointer key, gpointer value,
         return true;
     }
 
-    /*
-     * TODO:
-     * - when pasid-base-iotlb(piotlb) infrastructure is ready,
-     *   should invalidate QEMU piotlb togehter with this change.
-     */
     return false;
+
 remove:
-    /*
-     * TODO:
-     * - when pasid-base-iotlb(piotlb) infrastructure is ready,
-     *   should invalidate QEMU piotlb togehter with this change.
-     */
     if (vtd_bind_guest_pasid(vtd_pasid_as,
                              NULL, VTD_PASID_UNBIND)) {
         pasid_cache_info_set_error(pc_info);
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH rfcv1 22/23] intel_iommu: refresh pasid bind after pasid cache force reset
  2024-01-15 10:37 [PATCH rfcv1 00/23] intel_iommu: Enable stage-1 translation Zhenzhong Duan
                   ` (20 preceding siblings ...)
  2024-01-15 10:37 ` [PATCH rfcv1 21/23] intel_iommu: invalidate piotlb when flush pasid Zhenzhong Duan
@ 2024-01-15 10:37 ` Zhenzhong Duan
  2024-01-15 10:37 ` [PATCH rfcv1 23/23] intel_iommu: modify x-scalable-mode to be string option Zhenzhong Duan
  2024-01-22  4:29 ` [PATCH rfcv1 00/23] intel_iommu: Enable stage-1 translation Jason Wang
  23 siblings, 0 replies; 29+ messages in thread
From: Zhenzhong Duan @ 2024-01-15 10:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, peterx, jasowang, mst, jgg,
	nicolinc, joao.m.martins, kevin.tian, yi.l.liu, yi.y.sun,
	chao.p.peng, Zhenzhong Duan, Marcel Apfelbaum, Paolo Bonzini,
	Richard Henderson, Eduardo Habkost

From: Yi Liu <yi.l.liu@intel.com>

The force reset clears the vtd_pasid_as and also unbinds the pasid on
host side. This is bad when the reset is triggered after some pasid
binding is setup. e.g. gcmd.TE enabling will reset cache, but wishes
to keep the pasid #0 (gIOVA) binding. So needs to refresh the pasid
bind per guest pasid table accordingly. Without this, issue has been
observed when legacy device passthrough (e.g. NIC without PASID support).

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/i386/intel_iommu.c | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 2f3d3a28b0..e418305f6e 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -72,6 +72,7 @@ struct vtd_iotlb_key {
 
 static void vtd_address_space_refresh_all(IntelIOMMUState *s);
 static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n);
+static void vtd_refresh_pasid_bind(IntelIOMMUState *s);
 
 static void vtd_pasid_cache_reset(IntelIOMMUState *s);
 static void vtd_pasid_cache_sync(IntelIOMMUState *s,
@@ -3292,6 +3293,7 @@ static void vtd_handle_gcmd_srtp(IntelIOMMUState *s)
     vtd_set_clear_mask_long(s, DMAR_GSTS_REG, 0, VTD_GSTS_RTPS);
     vtd_reset_caches(s);
     vtd_address_space_refresh_all(s);
+    vtd_refresh_pasid_bind(s);
 }
 
 /* Set Interrupt Remap Table Pointer */
@@ -3326,6 +3328,7 @@ static void vtd_handle_gcmd_te(IntelIOMMUState *s, bool en)
 
     vtd_reset_caches(s);
     vtd_address_space_refresh_all(s);
+    vtd_refresh_pasid_bind(s);
 }
 
 /* Handle Interrupt Remap Enable/Disable */
@@ -3960,6 +3963,28 @@ static void vtd_replay_guest_pasid_bindings(IntelIOMMUState *s,
     }
 }
 
+static void vtd_refresh_pasid_bind(IntelIOMMUState *s)
+{
+    VTDPASIDCacheInfo pc_info = { .error_happened = false,
+                                  .type = VTD_PASID_CACHE_GLOBAL_INV };
+
+    /*
+     * Only when dmar is enabled, should pasid bindings replayed,
+     * otherwise no need to replay.
+     */
+    if (!s->dmar_enabled) {
+        return;
+    }
+
+    if (!s->scalable_modern || !s->root_scalable) {
+        return;
+    }
+
+    vtd_iommu_lock(s);
+    vtd_replay_guest_pasid_bindings(s, &pc_info);
+    vtd_iommu_unlock(s);
+}
+
 /*
  * This function syncs the pasid bindings between guest and host.
  * It includes updating the pasid cache in vIOMMU and updating the
@@ -6051,6 +6076,7 @@ static void vtd_reset(DeviceState *dev)
 
     vtd_init(s);
     vtd_address_space_refresh_all(s);
+    vtd_refresh_pasid_bind(s);
 }
 
 static AddressSpace *vtd_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH rfcv1 23/23] intel_iommu: modify x-scalable-mode to be string option
  2024-01-15 10:37 [PATCH rfcv1 00/23] intel_iommu: Enable stage-1 translation Zhenzhong Duan
                   ` (21 preceding siblings ...)
  2024-01-15 10:37 ` [PATCH rfcv1 22/23] intel_iommu: refresh pasid bind after pasid cache force reset Zhenzhong Duan
@ 2024-01-15 10:37 ` Zhenzhong Duan
       [not found]   ` <CGME20240131144013eucas1p22d46339ae42f54dd59c23e8b95502dda@eucas1p2.samsung.com>
  2024-01-22  4:29 ` [PATCH rfcv1 00/23] intel_iommu: Enable stage-1 translation Jason Wang
  23 siblings, 1 reply; 29+ messages in thread
From: Zhenzhong Duan @ 2024-01-15 10:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, peterx, jasowang, mst, jgg,
	nicolinc, joao.m.martins, kevin.tian, yi.l.liu, yi.y.sun,
	chao.p.peng, Yi Sun, Zhenzhong Duan, Marcel Apfelbaum,
	Paolo Bonzini, Richard Henderson, Eduardo Habkost

From: Yi Liu <yi.l.liu@intel.com>

Intel VT-d 3.0 introduces scalable mode, and it has a bunch of capabilities
related to scalable mode translation, thus there are multiple combinations.
While this vIOMMU implementation wants to simplify it for user by providing
typical combinations. User could config it by "x-scalable-mode" option. The
usage is as below:

"-device intel-iommu,x-scalable-mode=["legacy"|"modern"|"off"]"

 - "legacy": gives support for stage-2 page table
 - "modern": gives support for stage-1 page table
 - "off": no scalable mode support
 -  if not configured, means no scalable mode support, if not proper
    configured, will throw error

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 include/hw/i386/intel_iommu.h |  1 +
 hw/i386/intel_iommu.c         | 25 +++++++++++++++++++++++--
 2 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index f3e75263b7..9cbd568171 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -320,6 +320,7 @@ struct IntelIOMMUState {
 
     bool caching_mode;              /* RO - is cap CM enabled? */
     bool scalable_mode;             /* RO - is Scalable Mode supported? */
+    char *scalable_mode_str;        /* RO - admin's Scalable Mode config */
     bool scalable_modern;           /* RO - is modern SM supported? */
     bool snoop_control;             /* RO - is SNP filed supported? */
 
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index e418305f6e..b507112069 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -5111,7 +5111,7 @@ static Property vtd_properties[] = {
     DEFINE_PROP_UINT8("aw-bits", IntelIOMMUState, aw_bits,
                       VTD_HOST_ADDRESS_WIDTH),
     DEFINE_PROP_BOOL("caching-mode", IntelIOMMUState, caching_mode, FALSE),
-    DEFINE_PROP_BOOL("x-scalable-mode", IntelIOMMUState, scalable_mode, FALSE),
+    DEFINE_PROP_STRING("x-scalable-mode", IntelIOMMUState, scalable_mode_str),
     DEFINE_PROP_BOOL("snoop-control", IntelIOMMUState, snoop_control, false),
     DEFINE_PROP_BOOL("x-pasid-mode", IntelIOMMUState, pasid, false),
     DEFINE_PROP_BOOL("dma-drain", IntelIOMMUState, dma_drain, true),
@@ -6122,7 +6122,28 @@ static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
         }
     }
 
-    /* Currently only address widths supported are 39 and 48 bits */
+    if (s->scalable_mode_str &&
+        (strcmp(s->scalable_mode_str, "off") &&
+         strcmp(s->scalable_mode_str, "modern") &&
+         strcmp(s->scalable_mode_str, "legacy"))) {
+        error_setg(errp, "Invalid x-scalable-mode config,"
+                         "Please use \"modern\", \"legacy\" or \"off\"");
+        return false;
+    }
+
+    if (s->scalable_mode_str &&
+        !strcmp(s->scalable_mode_str, "legacy")) {
+        s->scalable_mode = true;
+        s->scalable_modern = false;
+    } else if (s->scalable_mode_str &&
+        !strcmp(s->scalable_mode_str, "modern")) {
+        s->scalable_mode = true;
+        s->scalable_modern = true;
+    } else {
+        s->scalable_mode = false;
+        s->scalable_modern = false;
+    }
+
     if ((s->aw_bits != VTD_HOST_AW_39BIT) &&
         (s->aw_bits != VTD_HOST_AW_48BIT) &&
         !s->scalable_modern) {
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH rfcv1 00/23] intel_iommu: Enable stage-1 translation
  2024-01-15 10:37 [PATCH rfcv1 00/23] intel_iommu: Enable stage-1 translation Zhenzhong Duan
                   ` (22 preceding siblings ...)
  2024-01-15 10:37 ` [PATCH rfcv1 23/23] intel_iommu: modify x-scalable-mode to be string option Zhenzhong Duan
@ 2024-01-22  4:29 ` Jason Wang
  2024-01-22  5:59   ` Duan, Zhenzhong
  23 siblings, 1 reply; 29+ messages in thread
From: Jason Wang @ 2024-01-22  4:29 UTC (permalink / raw)
  To: Zhenzhong Duan
  Cc: qemu-devel, alex.williamson, clg, eric.auger, peterx, mst, jgg,
	nicolinc, joao.m.martins, kevin.tian, yi.l.liu, yi.y.sun,
	chao.p.peng

On Mon, Jan 15, 2024 at 6:39 PM Zhenzhong Duan <zhenzhong.duan@intel.com> wrote:
>
> Hi,
>
> This series enables stage-1 translation support in intel iommu which
> we called "modern" mode. In this mode, we don't do shadowing of
> guest page table for passthrough device but pass stage-1 page table
> to host side to construct a nested domain; we also support emulated
> device by translating the stage-1 page table. There was some effort
> to enable this feature in old days, see [1] for details.
>
> The key design is to utilize the dual-stage IOMMU translation
> (also known as IOMMU nested translation) capability in host IOMMU.
> As the below diagram shows, guest I/O page table pointer in GPA
> (guest physical address) is passed to host and be used to perform
> the stage-1 address translation. Along with it, modifications to
> present mappings in the guest I/O page table should be followed
> with an IOTLB invalidation.
>
>         .-------------.  .---------------------------.
>         |   vIOMMU    |  | Guest I/O page table      |
>         |             |  '---------------------------'
>         .----------------/
>         | PASID Entry |--- PASID cache flush --+
>         '-------------'                        |
>         |             |                        V
>         |             |           I/O page table pointer in GPA
>         '-------------'
>     Guest
>     ------| Shadow |---------------------------|--------
>           v        v                           v
>     Host
>         .-------------.  .------------------------.
>         |   pIOMMU    |  |  FS for GIOVA->GPA     |
>         |             |  '------------------------'
>         .----------------/  |
>         | PASID Entry |     V (Nested xlate)
>         '----------------\.----------------------------------.
>         |             |   | SS for GPA->HPA, unmanaged domain|
>         |             |   '----------------------------------'
>         '-------------'
> Where:
>  - FS = First stage page tables
>  - SS = Second stage page tables
> <Intel VT-d Nested translation>
>
> There are some interactions between VFIO and vIOMMU.
> * vIOMMU registers PCIIOMMUOps to PCI subsystem which VFIO can
>   use to registers/unregisters IOMMUDevice object.
> * VFIO registers an IOMMUFDDevice object at vfio device realize
>   stage to vIOMMU, this is implemented as a prerequisite series[2].
> * vIOMMU calls IOMMUFDDevice interface callback IOMMUFDDeviceOps
>   to bind/unbind device to IOMMUFD backed domains, either nested
>   domain or not.
>
> See below diagram:
>
>         VFIO Device                                 Intel IOMMU
>     .-----------------.                         .-------------------.
>     |                 |                         |                   |
>     |       .---------|PCIIOMMUOps              |.-------------.    |
>     |       | IOMMUFD |(set_iommu_device)       || IOMMUFD     |    |
>     |       | Device  |------------------------>|| Device list |    |
>     |       .---------|(unset_iommu_device)     |.-------------.    |
>     |                 |                         |       |           |
>     |                 |                         |       V           |
>     |       .---------|         IOMMUFDDeviceOps|  .---------.      |
>     |       | IOMMUFD |            (attach_hwpt)|  | IOMMUFD |      |
>     |       | link    |<------------------------|  | Device  |      |
>     |       .---------|            (detach_hwpt)|  .---------.      |
>     |                 |                         |       |           |
>     |                 |                         |       ...         |
>     .-----------------.                         .-------------------.
>
> Based on Yi's suggestion, we updated a new design of managing ioas and
> hwpt, made it support multiple iommufd objects and the ERRATA_772415
> case, meanwhile tried to be optimal to share ioas and hwpt whenever
> possible.
>
> Stage-2 page table could be shared by different devices if there is
> no conflict and devices link to same iommufd object, i.e. devices
> under same host IOMMU can share same stage-2 page table. If there
> is conflict, i.e. there is one device under non cache coherency
> mode which is different from others, it requires a seperate
> stage-2 page table in non-CC mode.
>
> SPR platform has ERRATA_772415 which requires no readonly mappings
> in stage-2 page table. This series supports creating VTDIOASContainer
> with no readonly mappings. I'm not clear if there is a rare case that
> some IOMMUs on a multiple IOMMUs host have ERRATA_772415, this design
> can survive even in that case.
>
> See below example diagram for a full view:
>
>       IntelIOMMUState
>              |
>              V
>     .------------------.    .------------------.    .-------------------.
>     | VTDIOASContainer |--->| VTDIOASContainer |--->| VTDIOASContainer  |-->...
>     | (iommufd0,RW&RO) |    | (iommufd1,RW&RO) |    | (iommufd0,RW only)|
>     .------------------.    .------------------.    .-------------------.
>              |                       |                              |
>              |                       .-->...                        |
>              V                                                      V
>       .-------------------.    .-------------------.          .---------------.
>       |   VTDS2Hwpt(CC)   |--->| VTDS2Hwpt(non-CC) |-->...    | VTDS2Hwpt(CC) |-->...
>       .-------------------.    .-------------------.          .---------------.
>           |            |               |                            |
>           |            |               |                            |
>     .-----------.  .-----------.  .------------.              .------------.
>     | IOMMUFD   |  | IOMMUFD   |  | IOMMUFD    |              | IOMMUFD    |
>     | Device(CC)|  | Device(CC)|  | Device     |              | Device(CC) |
>     | (iommufd0)|  | (iommufd0)|  | (non-CC)   |              | (errata)   |
>     |           |  |           |  | (iommufd0) |              | (iommufd0) |
>     .-----------.  .-----------.  .------------.              .------------.
>
> This series is also a prerequisite work for vSVA, i.e. Sharing
> guest application address space with passthrough devices.
>
> To enable "modern" mode, only need to add "x-scalable-mode=modern".
> i.e. -device intel-iommu,x-scalable-mode=modern,...
>
> Passthrough device should use iommufd backend to work in "modern" mode.
> i.e. -object iommufd,id=iommufd0 -device vfio-pci,iommufd=iommufd0,...
>
> If host doens't support nested translation, qemu will fail
> with an unsupported report.
>
> Test done:
> - devices hotplug/unplug
> - different devices linked to different iommufds
>
> PATCH1-2:  Some preparing work to update header and IOMMUFD uAPI
> PATCH3-4:  Initialize vfio IOMMUFDDevice interface and pass to vIOMMU
> PATCH5:    Introduce a placeholder variable for scalable modern mode
> PATCH6:    Sync host cap/ecap with vIOMMU default cap/ecap in modern mode
> PATCH7-22: Implement first stage page table for passthrough and emulated device

Can we split the series and start from the emulated devices (and have
a qtest for that)? This might help for reviewing.

Thanks



^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: [PATCH rfcv1 00/23] intel_iommu: Enable stage-1 translation
  2024-01-22  4:29 ` [PATCH rfcv1 00/23] intel_iommu: Enable stage-1 translation Jason Wang
@ 2024-01-22  5:59   ` Duan, Zhenzhong
  0 siblings, 0 replies; 29+ messages in thread
From: Duan, Zhenzhong @ 2024-01-22  5:59 UTC (permalink / raw)
  To: Jason Wang
  Cc: qemu-devel, alex.williamson, clg, eric.auger, peterx, mst, jgg,
	nicolinc, joao.m.martins, Tian, Kevin, Liu, Yi L, Sun, Yi Y,
	Peng, Chao P



>-----Original Message-----
>From: Jason Wang <jasowang@redhat.com>
>Subject: Re: [PATCH rfcv1 00/23] intel_iommu: Enable stage-1 translation
>
>On Mon, Jan 15, 2024 at 6:39 PM Zhenzhong Duan
><zhenzhong.duan@intel.com> wrote:
>>
>> Hi,
>>
>> This series enables stage-1 translation support in intel iommu which
>> we called "modern" mode. In this mode, we don't do shadowing of
>> guest page table for passthrough device but pass stage-1 page table
>> to host side to construct a nested domain; we also support emulated
>> device by translating the stage-1 page table. There was some effort
>> to enable this feature in old days, see [1] for details.
>>
>> The key design is to utilize the dual-stage IOMMU translation
>> (also known as IOMMU nested translation) capability in host IOMMU.
>> As the below diagram shows, guest I/O page table pointer in GPA
>> (guest physical address) is passed to host and be used to perform
>> the stage-1 address translation. Along with it, modifications to
>> present mappings in the guest I/O page table should be followed
>> with an IOTLB invalidation.
>>
>>         .-------------.  .---------------------------.
>>         |   vIOMMU    |  | Guest I/O page table      |
>>         |             |  '---------------------------'
>>         .----------------/
>>         | PASID Entry |--- PASID cache flush --+
>>         '-------------'                        |
>>         |             |                        V
>>         |             |           I/O page table pointer in GPA
>>         '-------------'
>>     Guest
>>     ------| Shadow |---------------------------|--------
>>           v        v                           v
>>     Host
>>         .-------------.  .------------------------.
>>         |   pIOMMU    |  |  FS for GIOVA->GPA     |
>>         |             |  '------------------------'
>>         .----------------/  |
>>         | PASID Entry |     V (Nested xlate)
>>         '----------------\.----------------------------------.
>>         |             |   | SS for GPA->HPA, unmanaged domain|
>>         |             |   '----------------------------------'
>>         '-------------'
>> Where:
>>  - FS = First stage page tables
>>  - SS = Second stage page tables
>> <Intel VT-d Nested translation>
>>
>> There are some interactions between VFIO and vIOMMU.
>> * vIOMMU registers PCIIOMMUOps to PCI subsystem which VFIO can
>>   use to registers/unregisters IOMMUDevice object.
>> * VFIO registers an IOMMUFDDevice object at vfio device realize
>>   stage to vIOMMU, this is implemented as a prerequisite series[2].
>> * vIOMMU calls IOMMUFDDevice interface callback IOMMUFDDeviceOps
>>   to bind/unbind device to IOMMUFD backed domains, either nested
>>   domain or not.
>>
>> See below diagram:
>>
>>         VFIO Device                                 Intel IOMMU
>>     .-----------------.                         .-------------------.
>>     |                 |                         |                   |
>>     |       .---------|PCIIOMMUOps              |.-------------.    |
>>     |       | IOMMUFD |(set_iommu_device)       || IOMMUFD     |    |
>>     |       | Device  |------------------------>|| Device list |    |
>>     |       .---------|(unset_iommu_device)     |.-------------.    |
>>     |                 |                         |       |           |
>>     |                 |                         |       V           |
>>     |       .---------|         IOMMUFDDeviceOps|  .---------.      |
>>     |       | IOMMUFD |            (attach_hwpt)|  | IOMMUFD |      |
>>     |       | link    |<------------------------|  | Device  |      |
>>     |       .---------|            (detach_hwpt)|  .---------.      |
>>     |                 |                         |       |           |
>>     |                 |                         |       ...         |
>>     .-----------------.                         .-------------------.
>>
>> Based on Yi's suggestion, we updated a new design of managing ioas and
>> hwpt, made it support multiple iommufd objects and the ERRATA_772415
>> case, meanwhile tried to be optimal to share ioas and hwpt whenever
>> possible.
>>
>> Stage-2 page table could be shared by different devices if there is
>> no conflict and devices link to same iommufd object, i.e. devices
>> under same host IOMMU can share same stage-2 page table. If there
>> is conflict, i.e. there is one device under non cache coherency
>> mode which is different from others, it requires a seperate
>> stage-2 page table in non-CC mode.
>>
>> SPR platform has ERRATA_772415 which requires no readonly mappings
>> in stage-2 page table. This series supports creating VTDIOASContainer
>> with no readonly mappings. I'm not clear if there is a rare case that
>> some IOMMUs on a multiple IOMMUs host have ERRATA_772415, this
>design
>> can survive even in that case.
>>
>> See below example diagram for a full view:
>>
>>       IntelIOMMUState
>>              |
>>              V
>>     .------------------.    .------------------.    .-------------------.
>>     | VTDIOASContainer |--->| VTDIOASContainer |--->| VTDIOASContainer
>|-->...
>>     | (iommufd0,RW&RO) |    | (iommufd1,RW&RO) |    | (iommufd0,RW
>only)|
>>     .------------------.    .------------------.    .-------------------.
>>              |                       |                              |
>>              |                       .-->...                        |
>>              V                                                      V
>>       .-------------------.    .-------------------.          .---------------.
>>       |   VTDS2Hwpt(CC)   |--->| VTDS2Hwpt(non-CC) |-->...    |
>VTDS2Hwpt(CC) |-->...
>>       .-------------------.    .-------------------.          .---------------.
>>           |            |               |                            |
>>           |            |               |                            |
>>     .-----------.  .-----------.  .------------.              .------------.
>>     | IOMMUFD   |  | IOMMUFD   |  | IOMMUFD    |              | IOMMUFD    |
>>     | Device(CC)|  | Device(CC)|  | Device     |              | Device(CC) |
>>     | (iommufd0)|  | (iommufd0)|  | (non-CC)   |              | (errata)   |
>>     |           |  |           |  | (iommufd0) |              | (iommufd0) |
>>     .-----------.  .-----------.  .------------.              .------------.
>>
>> This series is also a prerequisite work for vSVA, i.e. Sharing
>> guest application address space with passthrough devices.
>>
>> To enable "modern" mode, only need to add "x-scalable-mode=modern".
>> i.e. -device intel-iommu,x-scalable-mode=modern,...
>>
>> Passthrough device should use iommufd backend to work in "modern"
>mode.
>> i.e. -object iommufd,id=iommufd0 -device vfio-pci,iommufd=iommufd0,...
>>
>> If host doens't support nested translation, qemu will fail
>> with an unsupported report.
>>
>> Test done:
>> - devices hotplug/unplug
>> - different devices linked to different iommufds
>>
>> PATCH1-2:  Some preparing work to update header and IOMMUFD uAPI
>> PATCH3-4:  Initialize vfio IOMMUFDDevice interface and pass to vIOMMU
>> PATCH5:    Introduce a placeholder variable for scalable modern mode
>> PATCH6:    Sync host cap/ecap with vIOMMU default cap/ecap in modern
>mode
>> PATCH7-22: Implement first stage page table for passthrough and
>emulated device
>
>Can we split the series and start from the emulated devices (and have
>a qtest for that)? This might help for reviewing.

Sure, will do in rfcv2.

Thanks
Zhenzhong

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH rfcv1 23/23] intel_iommu: modify x-scalable-mode to be string option
       [not found]   ` <CGME20240131144013eucas1p22d46339ae42f54dd59c23e8b95502dda@eucas1p2.samsung.com>
@ 2024-01-31 14:40     ` Joel Granados
  2024-01-31 15:24       ` Yi Liu
  0 siblings, 1 reply; 29+ messages in thread
From: Joel Granados @ 2024-01-31 14:40 UTC (permalink / raw)
  To: Zhenzhong Duan
  Cc: qemu-devel, alex.williamson, clg, eric.auger, peterx, jasowang,
	mst, jgg, nicolinc, joao.m.martins, kevin.tian, yi.l.liu,
	yi.y.sun, chao.p.peng, Yi Sun, Marcel Apfelbaum, Paolo Bonzini,
	Richard Henderson, Eduardo Habkost

[-- Attachment #1: Type: text/plain, Size: 4140 bytes --]

On Mon, Jan 15, 2024 at 06:37:35PM +0800, Zhenzhong Duan wrote:
> From: Yi Liu <yi.l.liu@intel.com>
> 
> Intel VT-d 3.0 introduces scalable mode, and it has a bunch of capabilities
> related to scalable mode translation, thus there are multiple combinations.
> While this vIOMMU implementation wants to simplify it for user by providing
> typical combinations. User could config it by "x-scalable-mode" option. The
> usage is as below:
> 
> "-device intel-iommu,x-scalable-mode=["legacy"|"modern"|"off"]"
> 
>  - "legacy": gives support for stage-2 page table
>  - "modern": gives support for stage-1 page table
>  - "off": no scalable mode support
>  -  if not configured, means no scalable mode support, if not proper
>     configured, will throw error
> 
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
>  include/hw/i386/intel_iommu.h |  1 +
>  hw/i386/intel_iommu.c         | 25 +++++++++++++++++++++++--
>  2 files changed, 24 insertions(+), 2 deletions(-)
> 
> diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> index f3e75263b7..9cbd568171 100644
> --- a/include/hw/i386/intel_iommu.h
> +++ b/include/hw/i386/intel_iommu.h
> @@ -320,6 +320,7 @@ struct IntelIOMMUState {
>  
>      bool caching_mode;              /* RO - is cap CM enabled? */
>      bool scalable_mode;             /* RO - is Scalable Mode supported? */
> +    char *scalable_mode_str;        /* RO - admin's Scalable Mode config */
>      bool scalable_modern;           /* RO - is modern SM supported? */
>      bool snoop_control;             /* RO - is SNP filed supported? */
>  
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index e418305f6e..b507112069 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -5111,7 +5111,7 @@ static Property vtd_properties[] = {
>      DEFINE_PROP_UINT8("aw-bits", IntelIOMMUState, aw_bits,
>                        VTD_HOST_ADDRESS_WIDTH),
>      DEFINE_PROP_BOOL("caching-mode", IntelIOMMUState, caching_mode, FALSE),
> -    DEFINE_PROP_BOOL("x-scalable-mode", IntelIOMMUState, scalable_mode, FALSE),
> +    DEFINE_PROP_STRING("x-scalable-mode", IntelIOMMUState, scalable_mode_str),
>      DEFINE_PROP_BOOL("snoop-control", IntelIOMMUState, snoop_control, false),
>      DEFINE_PROP_BOOL("x-pasid-mode", IntelIOMMUState, pasid, false),
>      DEFINE_PROP_BOOL("dma-drain", IntelIOMMUState, dma_drain, true),
> @@ -6122,7 +6122,28 @@ static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
>          }
>      }
>  
> -    /* Currently only address widths supported are 39 and 48 bits */
> +    if (s->scalable_mode_str &&
> +        (strcmp(s->scalable_mode_str, "off") &&
> +         strcmp(s->scalable_mode_str, "modern") &&
> +         strcmp(s->scalable_mode_str, "legacy"))) {
> +        error_setg(errp, "Invalid x-scalable-mode config,"
> +                         "Please use \"modern\", \"legacy\" or \"off\"");
> +        return false;
> +    }
> +
> +    if (s->scalable_mode_str &&
> +        !strcmp(s->scalable_mode_str, "legacy")) {
> +        s->scalable_mode = true;
> +        s->scalable_modern = false;
> +    } else if (s->scalable_mode_str &&
> +        !strcmp(s->scalable_mode_str, "modern")) {
> +        s->scalable_mode = true;
> +        s->scalable_modern = true;
> +    } else {
> +        s->scalable_mode = false;
> +        s->scalable_modern = false;
> +    }
> +
>      if ((s->aw_bits != VTD_HOST_AW_39BIT) &&
>          (s->aw_bits != VTD_HOST_AW_48BIT) &&
>          !s->scalable_modern) {
> -- 
> 2.34.1
> 
> 

I noticed that this patch changes quite a bit from the previous version
that you had. I Specifically noticed that you dropped VTD_ECAP_PRS from
intel_iommu_internal.h. I was under the impression that this set the
Page Request Servicves capability in the IOMMU effectively enabling PRI
in the iommu.

Why did you remove it from your original patch?

Thx in advance

Best

-- 

Joel Granados

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 659 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH rfcv1 23/23] intel_iommu: modify x-scalable-mode to be string option
  2024-01-31 14:40     ` Joel Granados
@ 2024-01-31 15:24       ` Yi Liu
  2024-02-04 21:05         ` Joel Granados
  0 siblings, 1 reply; 29+ messages in thread
From: Yi Liu @ 2024-01-31 15:24 UTC (permalink / raw)
  To: Joel Granados, Zhenzhong Duan
  Cc: qemu-devel, alex.williamson, clg, eric.auger, peterx, jasowang,
	mst, jgg, nicolinc, joao.m.martins, kevin.tian, yi.y.sun,
	chao.p.peng, Yi Sun, Marcel Apfelbaum, Paolo Bonzini,
	Richard Henderson, Eduardo Habkost

On 2024/1/31 22:40, Joel Granados wrote:
> On Mon, Jan 15, 2024 at 06:37:35PM +0800, Zhenzhong Duan wrote:
>> From: Yi Liu <yi.l.liu@intel.com>
>>
>> Intel VT-d 3.0 introduces scalable mode, and it has a bunch of capabilities
>> related to scalable mode translation, thus there are multiple combinations.
>> While this vIOMMU implementation wants to simplify it for user by providing
>> typical combinations. User could config it by "x-scalable-mode" option. The
>> usage is as below:
>>
>> "-device intel-iommu,x-scalable-mode=["legacy"|"modern"|"off"]"
>>
>>   - "legacy": gives support for stage-2 page table
>>   - "modern": gives support for stage-1 page table
>>   - "off": no scalable mode support
>>   -  if not configured, means no scalable mode support, if not proper
>>      configured, will throw error
>>
>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> ---
>>   include/hw/i386/intel_iommu.h |  1 +
>>   hw/i386/intel_iommu.c         | 25 +++++++++++++++++++++++--
>>   2 files changed, 24 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
>> index f3e75263b7..9cbd568171 100644
>> --- a/include/hw/i386/intel_iommu.h
>> +++ b/include/hw/i386/intel_iommu.h
>> @@ -320,6 +320,7 @@ struct IntelIOMMUState {
>>   
>>       bool caching_mode;              /* RO - is cap CM enabled? */
>>       bool scalable_mode;             /* RO - is Scalable Mode supported? */
>> +    char *scalable_mode_str;        /* RO - admin's Scalable Mode config */
>>       bool scalable_modern;           /* RO - is modern SM supported? */
>>       bool snoop_control;             /* RO - is SNP filed supported? */
>>   
>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>> index e418305f6e..b507112069 100644
>> --- a/hw/i386/intel_iommu.c
>> +++ b/hw/i386/intel_iommu.c
>> @@ -5111,7 +5111,7 @@ static Property vtd_properties[] = {
>>       DEFINE_PROP_UINT8("aw-bits", IntelIOMMUState, aw_bits,
>>                         VTD_HOST_ADDRESS_WIDTH),
>>       DEFINE_PROP_BOOL("caching-mode", IntelIOMMUState, caching_mode, FALSE),
>> -    DEFINE_PROP_BOOL("x-scalable-mode", IntelIOMMUState, scalable_mode, FALSE),
>> +    DEFINE_PROP_STRING("x-scalable-mode", IntelIOMMUState, scalable_mode_str),
>>       DEFINE_PROP_BOOL("snoop-control", IntelIOMMUState, snoop_control, false),
>>       DEFINE_PROP_BOOL("x-pasid-mode", IntelIOMMUState, pasid, false),
>>       DEFINE_PROP_BOOL("dma-drain", IntelIOMMUState, dma_drain, true),
>> @@ -6122,7 +6122,28 @@ static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
>>           }
>>       }
>>   
>> -    /* Currently only address widths supported are 39 and 48 bits */
>> +    if (s->scalable_mode_str &&
>> +        (strcmp(s->scalable_mode_str, "off") &&
>> +         strcmp(s->scalable_mode_str, "modern") &&
>> +         strcmp(s->scalable_mode_str, "legacy"))) {
>> +        error_setg(errp, "Invalid x-scalable-mode config,"
>> +                         "Please use \"modern\", \"legacy\" or \"off\"");
>> +        return false;
>> +    }
>> +
>> +    if (s->scalable_mode_str &&
>> +        !strcmp(s->scalable_mode_str, "legacy")) {
>> +        s->scalable_mode = true;
>> +        s->scalable_modern = false;
>> +    } else if (s->scalable_mode_str &&
>> +        !strcmp(s->scalable_mode_str, "modern")) {
>> +        s->scalable_mode = true;
>> +        s->scalable_modern = true;
>> +    } else {
>> +        s->scalable_mode = false;
>> +        s->scalable_modern = false;
>> +    }
>> +
>>       if ((s->aw_bits != VTD_HOST_AW_39BIT) &&
>>           (s->aw_bits != VTD_HOST_AW_48BIT) &&
>>           !s->scalable_modern) {
>> -- 
>> 2.34.1
>>
>>
> 
> I noticed that this patch changes quite a bit from the previous version
> that you had. I Specifically noticed that you dropped VTD_ECAP_PRS from
> intel_iommu_internal.h. I was under the impression that this set the
> Page Request Servicves capability in the IOMMU effectively enabling PRI
> in the iommu.
> 
> Why did you remove it from your original patch?

It is because this series does not cover the PRS support. It should come
in a future series.

Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH rfcv1 23/23] intel_iommu: modify x-scalable-mode to be string option
  2024-01-31 15:24       ` Yi Liu
@ 2024-02-04 21:05         ` Joel Granados
  0 siblings, 0 replies; 29+ messages in thread
From: Joel Granados @ 2024-02-04 21:05 UTC (permalink / raw)
  To: Yi Liu
  Cc: Zhenzhong Duan, qemu-devel, alex.williamson, clg, eric.auger,
	peterx, jasowang, mst, jgg, nicolinc, joao.m.martins, kevin.tian,
	yi.y.sun, chao.p.peng, Yi Sun, Marcel Apfelbaum, Paolo Bonzini,
	Richard Henderson, Eduardo Habkost

[-- Attachment #1: Type: text/plain, Size: 4691 bytes --]

On Wed, Jan 31, 2024 at 11:24:18PM +0800, Yi Liu wrote:
> On 2024/1/31 22:40, Joel Granados wrote:
> > On Mon, Jan 15, 2024 at 06:37:35PM +0800, Zhenzhong Duan wrote:
> >> From: Yi Liu <yi.l.liu@intel.com>
> >>
> >> Intel VT-d 3.0 introduces scalable mode, and it has a bunch of capabilities
> >> related to scalable mode translation, thus there are multiple combinations.
> >> While this vIOMMU implementation wants to simplify it for user by providing
> >> typical combinations. User could config it by "x-scalable-mode" option. The
> >> usage is as below:
> >>
> >> "-device intel-iommu,x-scalable-mode=["legacy"|"modern"|"off"]"
> >>
> >>   - "legacy": gives support for stage-2 page table
> >>   - "modern": gives support for stage-1 page table
> >>   - "off": no scalable mode support
> >>   -  if not configured, means no scalable mode support, if not proper
> >>      configured, will throw error
> >>
> >> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> >> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> >> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> >> ---
> >>   include/hw/i386/intel_iommu.h |  1 +
> >>   hw/i386/intel_iommu.c         | 25 +++++++++++++++++++++++--
> >>   2 files changed, 24 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> >> index f3e75263b7..9cbd568171 100644
> >> --- a/include/hw/i386/intel_iommu.h
> >> +++ b/include/hw/i386/intel_iommu.h
> >> @@ -320,6 +320,7 @@ struct IntelIOMMUState {
> >>   
> >>       bool caching_mode;              /* RO - is cap CM enabled? */
> >>       bool scalable_mode;             /* RO - is Scalable Mode supported? */
> >> +    char *scalable_mode_str;        /* RO - admin's Scalable Mode config */
> >>       bool scalable_modern;           /* RO - is modern SM supported? */
> >>       bool snoop_control;             /* RO - is SNP filed supported? */
> >>   
> >> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> >> index e418305f6e..b507112069 100644
> >> --- a/hw/i386/intel_iommu.c
> >> +++ b/hw/i386/intel_iommu.c
> >> @@ -5111,7 +5111,7 @@ static Property vtd_properties[] = {
> >>       DEFINE_PROP_UINT8("aw-bits", IntelIOMMUState, aw_bits,
> >>                         VTD_HOST_ADDRESS_WIDTH),
> >>       DEFINE_PROP_BOOL("caching-mode", IntelIOMMUState, caching_mode, FALSE),
> >> -    DEFINE_PROP_BOOL("x-scalable-mode", IntelIOMMUState, scalable_mode, FALSE),
> >> +    DEFINE_PROP_STRING("x-scalable-mode", IntelIOMMUState, scalable_mode_str),
> >>       DEFINE_PROP_BOOL("snoop-control", IntelIOMMUState, snoop_control, false),
> >>       DEFINE_PROP_BOOL("x-pasid-mode", IntelIOMMUState, pasid, false),
> >>       DEFINE_PROP_BOOL("dma-drain", IntelIOMMUState, dma_drain, true),
> >> @@ -6122,7 +6122,28 @@ static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
> >>           }
> >>       }
> >>   
> >> -    /* Currently only address widths supported are 39 and 48 bits */
> >> +    if (s->scalable_mode_str &&
> >> +        (strcmp(s->scalable_mode_str, "off") &&
> >> +         strcmp(s->scalable_mode_str, "modern") &&
> >> +         strcmp(s->scalable_mode_str, "legacy"))) {
> >> +        error_setg(errp, "Invalid x-scalable-mode config,"
> >> +                         "Please use \"modern\", \"legacy\" or \"off\"");
> >> +        return false;
> >> +    }
> >> +
> >> +    if (s->scalable_mode_str &&
> >> +        !strcmp(s->scalable_mode_str, "legacy")) {
> >> +        s->scalable_mode = true;
> >> +        s->scalable_modern = false;
> >> +    } else if (s->scalable_mode_str &&
> >> +        !strcmp(s->scalable_mode_str, "modern")) {
> >> +        s->scalable_mode = true;
> >> +        s->scalable_modern = true;
> >> +    } else {
> >> +        s->scalable_mode = false;
> >> +        s->scalable_modern = false;
> >> +    }
> >> +
> >>       if ((s->aw_bits != VTD_HOST_AW_39BIT) &&
> >>           (s->aw_bits != VTD_HOST_AW_48BIT) &&
> >>           !s->scalable_modern) {
> >> -- 
> >> 2.34.1
> >>
> >>
> > 
> > I noticed that this patch changes quite a bit from the previous version
> > that you had. I Specifically noticed that you dropped VTD_ECAP_PRS from
> > intel_iommu_internal.h. I was under the impression that this set the
> > Page Request Servicves capability in the IOMMU effectively enabling PRI
> > in the iommu.
> > 
> > Why did you remove it from your original patch?
> 
> It is because this series does not cover the PRS support. It should come
> in a future series.

ok. Thx for getting back to me.

Best
> 
> Regards,
> Yi Liu

-- 

Joel Granados

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 659 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2024-02-04 21:06 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-01-15 10:37 [PATCH rfcv1 00/23] intel_iommu: Enable stage-1 translation Zhenzhong Duan
2024-01-15 10:37 ` [PATCH rfcv1 01/23] Update linux header to support nested hwpt alloc Zhenzhong Duan
2024-01-15 10:37 ` [PATCH rfcv1 02/23] backends/iommufd: add helpers for allocating user-managed HWPT Zhenzhong Duan
2024-01-15 10:37 ` [PATCH rfcv1 03/23] backends/iommufd_device: introduce IOMMUFDDevice targeted interface Zhenzhong Duan
2024-01-15 10:37 ` [PATCH rfcv1 04/23] vfio: implement IOMMUFDDevice interface callbacks Zhenzhong Duan
2024-01-15 10:37 ` [PATCH rfcv1 05/23] intel_iommu: add a placeholder variable for scalable modern mode Zhenzhong Duan
2024-01-15 10:37 ` [PATCH rfcv1 06/23] intel_iommu: check and sync host IOMMU cap/ecap in " Zhenzhong Duan
2024-01-15 10:37 ` [PATCH rfcv1 07/23] intel_iommu: process PASID cache invalidation Zhenzhong Duan
2024-01-15 10:37 ` [PATCH rfcv1 08/23] intel_iommu: add PASID cache management infrastructure Zhenzhong Duan
2024-01-15 10:37 ` [PATCH rfcv1 09/23] vfio/iommufd_device: Add ioas_id in IOMMUFDDevice and pass to vIOMMU Zhenzhong Duan
2024-01-15 10:37 ` [PATCH rfcv1 10/23] intel_iommu: bind/unbind guest page table to host Zhenzhong Duan
2024-01-15 10:37 ` [PATCH rfcv1 11/23] intel_iommu: ERRATA_772415 workaround Zhenzhong Duan
2024-01-15 10:37 ` [PATCH rfcv1 12/23] intel_iommu: replay pasid binds after context cache invalidation Zhenzhong Duan
2024-01-15 10:37 ` [PATCH rfcv1 13/23] intel_iommu: process PASID-based iotlb invalidation Zhenzhong Duan
2024-01-15 10:37 ` [PATCH rfcv1 14/23] intel_iommu: propagate PASID-based iotlb invalidation to host Zhenzhong Duan
2024-01-15 10:37 ` [PATCH rfcv1 15/23] intel_iommu: process PASID-based Device-TLB invalidation Zhenzhong Duan
2024-01-15 10:37 ` [PATCH rfcv1 16/23] intel_iommu: rename slpte in iotlb_entry to pte Zhenzhong Duan
2024-01-15 10:37 ` [PATCH rfcv1 17/23] intel_iommu: implement firt level translation Zhenzhong Duan
2024-01-15 10:37 ` [PATCH rfcv1 18/23] intel_iommu: fix the fault reason report Zhenzhong Duan
2024-01-15 10:37 ` [PATCH rfcv1 19/23] intel_iommu: introduce pasid iotlb cache Zhenzhong Duan
2024-01-15 10:37 ` [PATCH rfcv1 20/23] intel_iommu: piotlb invalidation should notify unmap Zhenzhong Duan
2024-01-15 10:37 ` [PATCH rfcv1 21/23] intel_iommu: invalidate piotlb when flush pasid Zhenzhong Duan
2024-01-15 10:37 ` [PATCH rfcv1 22/23] intel_iommu: refresh pasid bind after pasid cache force reset Zhenzhong Duan
2024-01-15 10:37 ` [PATCH rfcv1 23/23] intel_iommu: modify x-scalable-mode to be string option Zhenzhong Duan
     [not found]   ` <CGME20240131144013eucas1p22d46339ae42f54dd59c23e8b95502dda@eucas1p2.samsung.com>
2024-01-31 14:40     ` Joel Granados
2024-01-31 15:24       ` Yi Liu
2024-02-04 21:05         ` Joel Granados
2024-01-22  4:29 ` [PATCH rfcv1 00/23] intel_iommu: Enable stage-1 translation Jason Wang
2024-01-22  5:59   ` Duan, Zhenzhong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).