qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [RFC v9 00/29] vSMMUv3/pSMMUv3 2 stage VFIO integration
@ 2021-04-11 12:08 Eric Auger
  2021-04-11 12:08 ` [RFC v9 01/29] hw/vfio/common: trace vfio_connect_container operations Eric Auger
                   ` (28 more replies)
  0 siblings, 29 replies; 46+ messages in thread
From: Eric Auger @ 2021-04-11 12:08 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, qemu-devel, qemu-arm, alex.williamson
  Cc: peter.maydell, jacob.jun.pan, jean-philippe, tn, chenxiang66,
	jiangkunkun, shameerali.kolothum.thodi, nicoleotsuka,
	vivek.gautam, vdumpa, yi.l.liu, peterx, zhangfei.gao, yuzenghui,
	zhukeqian1

Up to now vSMMUv3 has not been integrated with VFIO. VFIO
integration requires to program the physical IOMMU consistently
with the guest mappings. However, as opposed to VTD, SMMUv3 has
no "Caching Mode" which allows easy trapping of guest mappings.
This means the vSMMUV3 cannot use the same VFIO integration as VTD.

However SMMUv3 has 2 translation stages. This was devised with
virtualization use case in mind where stage 1 is "owned" by the
guest whereas the host uses stage 2 for VM isolation.

This series sets up this nested translation stage. It only works
if there is one physical SMMUv3 used along with QEMU vSMMUv3 (in
other words, it does not work if there is a physical SMMUv2).

- We force the host to use stage 2 instead of stage 1, when we
  detect a vSMMUV3 is behind a VFIO device. For a VFIO device
  without any virtual IOMMU, we still use stage 1 as many existing
  SMMUs expect this behavior.
- We use PCIPASIDOps to propage guest stage1 config changes on
  STE (Stream Table Entry) changes.
- We implement a specific UNMAP notifier that conveys guest
  IOTLB invalidations to the host
- We register MSI IOVA/GPA bindings to the host so that this latter
  can build a nested stage translation
- As the legacy MAP notifier is not called anymore, we must make
  sure stage 2 mappings are set. This is achieved through another
  prereg memory listener.
- Physical SMMU stage 1 related faults are reported to the guest
  via en eventfd mechanism and exposed trhough a dedicated VFIO-PCI
  region. Then they are reinjected into the guest.

Best Regards

Eric

All the patches can be found at:
https://github.com/eauger/qemu/tree/v6.0.0-rc2-2stage-rfcv9

Previous version:
v8: https://github.com/eauger/qemu/tree/v5.2.0-2stage-rfcv8

Kernel Dependencies:
[1] [PATCH v15 00/12] SMMUv3 Nested Stage Setup (IOMMU part)
[2] [PATCH v13 00/13] SMMUv3 Nested Stage Setup (VFIO part)
branch containing both:
https://github.com/eauger/linux/tree/v5.12-rc6-jean-iopf-14-2stage-v15

History:

v8 -> v9:
- added
  hw/arm/smmu-common: Allow domain invalidation for NH_ALL/NSNH_ALL
  following Chenxiang's report

v7 -> v8:
- adapt to changes to the kernel uapi
- Fix unregistration of MSI bindings
- applies on top of range invalidation fixes
- changes in IOTLBEntry (flags)
- addressed all the comments from reviewers/testers I hope.
  Many thanks to all of you! see individual logs


Eric Auger (28):
  hw/vfio/common: trace vfio_connect_container operations
  update-linux-headers: Import iommu.h
  header update against 5.12-rc6 and IOMMU/VFIO nested stage APIs
  memory: Add new fields in IOTLBEntry
  hw/arm/smmuv3: Improve stage1 ASID invalidation
  hw/arm/smmu-common: Allow domain invalidation for NH_ALL/NSNH_ALL
  memory: Add IOMMU_ATTR_VFIO_NESTED IOMMU memory region attribute
  memory: Add IOMMU_ATTR_MSI_TRANSLATE IOMMU memory region attribute
  memory: Introduce IOMMU Memory Region inject_faults API
  iommu: Introduce generic header
  vfio: Force nested if iommu requires it
  vfio: Introduce hostwin_from_range helper
  vfio: Introduce helpers to DMA map/unmap a RAM section
  vfio: Set up nested stage mappings
  vfio: Pass stage 1 MSI bindings to the host
  vfio: Helper to get IRQ info including capabilities
  vfio/pci: Register handler for iommu fault
  vfio/pci: Set up the DMA FAULT region
  vfio/pci: Implement the DMA fault handler
  hw/arm/smmuv3: Advertise MSI_TRANSLATE attribute
  hw/arm/smmuv3: Store the PASID table GPA in the translation config
  hw/arm/smmuv3: Fill the IOTLBEntry arch_id on NH_VA invalidation
  hw/arm/smmuv3: Fill the IOTLBEntry leaf field on NH_VA invalidation
  hw/arm/smmuv3: Pass stage 1 configurations to the host
  hw/arm/smmuv3: Implement fault injection
  hw/arm/smmuv3: Allow MAP notifiers
  pci: Add return_page_response pci ops
  vfio/pci: Implement return_page_response page response callback

Liu Yi L (1):
  pci: introduce PCIPASIDOps to PCIDevice

 hw/arm/smmu-internal.h                        |   1 +
 hw/vfio/pci.h                                 |  11 +
 include/exec/memory.h                         |  64 +-
 include/hw/arm/smmu-common.h                  |   1 +
 include/hw/iommu/iommu.h                      |  36 ++
 include/hw/pci/pci.h                          |  15 +
 include/hw/vfio/vfio-common.h                 |  19 +
 include/standard-headers/drm/drm_fourcc.h     |  23 +-
 include/standard-headers/linux/ethtool.h      |  54 +-
 include/standard-headers/linux/fuse.h         |   3 +-
 include/standard-headers/linux/input.h        |   2 +-
 .../standard-headers/rdma/vmw_pvrdma-abi.h    |   7 +
 linux-headers/asm-generic/unistd.h            |   4 +-
 linux-headers/asm-mips/unistd_n32.h           |   1 +
 linux-headers/asm-mips/unistd_n64.h           |   1 +
 linux-headers/asm-mips/unistd_o32.h           |   1 +
 linux-headers/asm-powerpc/kvm.h               |   2 +
 linux-headers/asm-powerpc/unistd_32.h         |   1 +
 linux-headers/asm-powerpc/unistd_64.h         |   1 +
 linux-headers/asm-s390/unistd_32.h            |   1 +
 linux-headers/asm-s390/unistd_64.h            |   1 +
 linux-headers/asm-x86/kvm.h                   |   1 +
 linux-headers/asm-x86/unistd_32.h             |   1 +
 linux-headers/asm-x86/unistd_64.h             |   1 +
 linux-headers/asm-x86/unistd_x32.h            |   1 +
 linux-headers/linux/kvm.h                     |  89 +++
 linux-headers/linux/vfio.h                    | 169 +++++-
 hw/arm/smmu-common.c                          |   3 +-
 hw/arm/smmuv3.c                               | 224 ++++++-
 hw/i386/intel_iommu.c                         |   6 +-
 hw/pci/pci.c                                  |  50 ++
 hw/ppc/spapr_iommu.c                          |   2 +-
 hw/vfio/common.c                              | 565 ++++++++++++++----
 hw/vfio/pci.c                                 | 415 ++++++++++++-
 hw/virtio/virtio-iommu.c                      |   4 +-
 softmmu/memory.c                              |  10 +
 hw/arm/trace-events                           |   2 +
 hw/vfio/trace-events                          |  11 +-
 scripts/update-linux-headers.sh               |   2 +-
 39 files changed, 1626 insertions(+), 179 deletions(-)
 create mode 100644 include/hw/iommu/iommu.h

-- 
2.26.3



^ permalink raw reply	[flat|nested] 46+ messages in thread

* [RFC v9 01/29] hw/vfio/common: trace vfio_connect_container operations
  2021-04-11 12:08 [RFC v9 00/29] vSMMUv3/pSMMUv3 2 stage VFIO integration Eric Auger
@ 2021-04-11 12:08 ` Eric Auger
  2021-04-11 12:08 ` [RFC v9 02/29] update-linux-headers: Import iommu.h Eric Auger
                   ` (27 subsequent siblings)
  28 siblings, 0 replies; 46+ messages in thread
From: Eric Auger @ 2021-04-11 12:08 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, qemu-devel, qemu-arm, alex.williamson
  Cc: peter.maydell, jacob.jun.pan, jean-philippe, tn, chenxiang66,
	jiangkunkun, shameerali.kolothum.thodi, nicoleotsuka,
	vivek.gautam, vdumpa, yi.l.liu, peterx, zhangfei.gao, yuzenghui,
	zhukeqian1

We currently trace vfio_disconnect_container() but we do not trace
the container <-> group creation, which can be useful to understand
the VFIO topology.

Signed-off-by: Eric Auger <eric.auger@redhat.com>
---
 hw/vfio/common.c     | 3 +++
 hw/vfio/trace-events | 2 ++
 2 files changed, 5 insertions(+)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index ae5654fcdb..a456455517 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -1742,6 +1742,8 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
     QLIST_FOREACH(container, &space->containers, next) {
         if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
             group->container = container;
+            trace_vfio_connect_existing_container(group->groupid,
+                                                  container->fd);
             QLIST_INSERT_HEAD(&container->group_list, group, container_next);
             vfio_kvm_device_add_group(group);
             return 0;
@@ -1775,6 +1777,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
     if (ret) {
         goto free_container_exit;
     }
+    trace_vfio_connect_new_container(group->groupid, container->fd);
 
     switch (container->iommu_type) {
     case VFIO_TYPE1v2_IOMMU:
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 079f53acf2..2a41326c0f 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -104,6 +104,8 @@ vfio_listener_region_add_no_dma_map(const char *name, uint64_t iova, uint64_t si
 vfio_listener_region_del_skip(uint64_t start, uint64_t end) "SKIPPING region_del 0x%"PRIx64" - 0x%"PRIx64
 vfio_listener_region_del(uint64_t start, uint64_t end) "region_del 0x%"PRIx64" - 0x%"PRIx64
 vfio_disconnect_container(int fd) "close container->fd=%d"
+vfio_connect_existing_container(int groupid, int container_fd) "group=%d existing container fd=%d"
+vfio_connect_new_container(int groupid, int container_fd) "group=%d new container fd=%d"
 vfio_put_group(int fd) "close group->fd=%d"
 vfio_get_device(const char * name, unsigned int flags, unsigned int num_regions, unsigned int num_irqs) "Device %s flags: %u, regions: %u, irqs: %u"
 vfio_put_base_device(int fd) "close vdev->fd=%d"
-- 
2.26.3



^ permalink raw reply	[flat|nested] 46+ messages in thread

* [RFC v9 02/29] update-linux-headers: Import iommu.h
  2021-04-11 12:08 [RFC v9 00/29] vSMMUv3/pSMMUv3 2 stage VFIO integration Eric Auger
  2021-04-11 12:08 ` [RFC v9 01/29] hw/vfio/common: trace vfio_connect_container operations Eric Auger
@ 2021-04-11 12:08 ` Eric Auger
  2021-04-11 12:08 ` [RFC v9 03/29] header update against 5.12-rc6 and IOMMU/VFIO nested stage APIs Eric Auger
                   ` (26 subsequent siblings)
  28 siblings, 0 replies; 46+ messages in thread
From: Eric Auger @ 2021-04-11 12:08 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, qemu-devel, qemu-arm, alex.williamson
  Cc: peter.maydell, jacob.jun.pan, jean-philippe, tn, chenxiang66,
	jiangkunkun, shameerali.kolothum.thodi, nicoleotsuka,
	vivek.gautam, vdumpa, yi.l.liu, peterx, zhangfei.gao, yuzenghui,
	zhukeqian1

Update the script to import the new iommu.h uapi header.

Signed-off-by: Eric Auger <eric.auger@redhat.com>
---
 scripts/update-linux-headers.sh | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/scripts/update-linux-headers.sh b/scripts/update-linux-headers.sh
index 1050e36169..b1abafac3c 100755
--- a/scripts/update-linux-headers.sh
+++ b/scripts/update-linux-headers.sh
@@ -142,7 +142,7 @@ done
 
 rm -rf "$output/linux-headers/linux"
 mkdir -p "$output/linux-headers/linux"
-for header in kvm.h vfio.h vfio_ccw.h vfio_zdev.h vhost.h \
+for header in kvm.h vfio.h vfio_ccw.h vfio_zdev.h vhost.h iommu.h \
               psci.h psp-sev.h userfaultfd.h mman.h; do
     cp "$tmpdir/include/linux/$header" "$output/linux-headers/linux"
 done
-- 
2.26.3



^ permalink raw reply	[flat|nested] 46+ messages in thread

* [RFC v9 03/29] header update against 5.12-rc6 and IOMMU/VFIO nested stage APIs
  2021-04-11 12:08 [RFC v9 00/29] vSMMUv3/pSMMUv3 2 stage VFIO integration Eric Auger
  2021-04-11 12:08 ` [RFC v9 01/29] hw/vfio/common: trace vfio_connect_container operations Eric Auger
  2021-04-11 12:08 ` [RFC v9 02/29] update-linux-headers: Import iommu.h Eric Auger
@ 2021-04-11 12:08 ` Eric Auger
  2021-04-11 12:08 ` [RFC v9 04/29] memory: Add new fields in IOTLBEntry Eric Auger
                   ` (25 subsequent siblings)
  28 siblings, 0 replies; 46+ messages in thread
From: Eric Auger @ 2021-04-11 12:08 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, qemu-devel, qemu-arm, alex.williamson
  Cc: peter.maydell, jacob.jun.pan, jean-philippe, tn, chenxiang66,
	jiangkunkun, shameerali.kolothum.thodi, nicoleotsuka,
	vivek.gautam, vdumpa, yi.l.liu, peterx, zhangfei.gao, yuzenghui,
	zhukeqian1

Signed-off-by: Eric Auger <eric.auger@redhat.com>
---
 include/standard-headers/drm/drm_fourcc.h     |  23 ++-
 include/standard-headers/linux/ethtool.h      |  54 +++---
 include/standard-headers/linux/fuse.h         |   3 +-
 include/standard-headers/linux/input.h        |   2 +-
 .../standard-headers/rdma/vmw_pvrdma-abi.h    |   7 +
 linux-headers/asm-generic/unistd.h            |   4 +-
 linux-headers/asm-mips/unistd_n32.h           |   1 +
 linux-headers/asm-mips/unistd_n64.h           |   1 +
 linux-headers/asm-mips/unistd_o32.h           |   1 +
 linux-headers/asm-powerpc/kvm.h               |   2 +
 linux-headers/asm-powerpc/unistd_32.h         |   1 +
 linux-headers/asm-powerpc/unistd_64.h         |   1 +
 linux-headers/asm-s390/unistd_32.h            |   1 +
 linux-headers/asm-s390/unistd_64.h            |   1 +
 linux-headers/asm-x86/kvm.h                   |   1 +
 linux-headers/asm-x86/unistd_32.h             |   1 +
 linux-headers/asm-x86/unistd_64.h             |   1 +
 linux-headers/asm-x86/unistd_x32.h            |   1 +
 linux-headers/linux/kvm.h                     |  89 +++++++++
 linux-headers/linux/vfio.h                    | 169 +++++++++++++++++-
 20 files changed, 337 insertions(+), 27 deletions(-)

diff --git a/include/standard-headers/drm/drm_fourcc.h b/include/standard-headers/drm/drm_fourcc.h
index c47e19810c..a61ae520c2 100644
--- a/include/standard-headers/drm/drm_fourcc.h
+++ b/include/standard-headers/drm/drm_fourcc.h
@@ -526,6 +526,25 @@ extern "C" {
  */
 #define I915_FORMAT_MOD_Y_TILED_GEN12_MC_CCS fourcc_mod_code(INTEL, 7)
 
+/*
+ * Intel Color Control Surface with Clear Color (CCS) for Gen-12 render
+ * compression.
+ *
+ * The main surface is Y-tiled and is at plane index 0 whereas CCS is linear
+ * and at index 1. The clear color is stored at index 2, and the pitch should
+ * be ignored. The clear color structure is 256 bits. The first 128 bits
+ * represents Raw Clear Color Red, Green, Blue and Alpha color each represented
+ * by 32 bits. The raw clear color is consumed by the 3d engine and generates
+ * the converted clear color of size 64 bits. The first 32 bits store the Lower
+ * Converted Clear Color value and the next 32 bits store the Higher Converted
+ * Clear Color value when applicable. The Converted Clear Color values are
+ * consumed by the DE. The last 64 bits are used to store Color Discard Enable
+ * and Depth Clear Value Valid which are ignored by the DE. A CCS cache line
+ * corresponds to an area of 4x1 tiles in the main surface. The main surface
+ * pitch is required to be a multiple of 4 tile widths.
+ */
+#define I915_FORMAT_MOD_Y_TILED_GEN12_RC_CCS_CC fourcc_mod_code(INTEL, 8)
+
 /*
  * Tiled, NV12MT, grouped in 64 (pixels) x 32 (lines) -sized macroblocks
  *
@@ -1035,9 +1054,9 @@ drm_fourcc_canonicalize_nvidia_format_mod(uint64_t modifier)
  * Not all combinations are valid, and different SoCs may support different
  * combinations of layout and options.
  */
-#define __fourcc_mod_amlogic_layout_mask 0xf
+#define __fourcc_mod_amlogic_layout_mask 0xff
 #define __fourcc_mod_amlogic_options_shift 8
-#define __fourcc_mod_amlogic_options_mask 0xf
+#define __fourcc_mod_amlogic_options_mask 0xff
 
 #define DRM_FORMAT_MOD_AMLOGIC_FBC(__layout, __options) \
 	fourcc_mod_code(AMLOGIC, \
diff --git a/include/standard-headers/linux/ethtool.h b/include/standard-headers/linux/ethtool.h
index 8bfd01d230..8e166b3c49 100644
--- a/include/standard-headers/linux/ethtool.h
+++ b/include/standard-headers/linux/ethtool.h
@@ -26,6 +26,14 @@
  * have the same layout for 32-bit and 64-bit userland.
  */
 
+/* Note on reserved space.
+ * Reserved fields must not be accessed directly by user space because
+ * they may be replaced by a different field in the future. They must
+ * be initialized to zero before making the request, e.g. via memset
+ * of the entire structure or implicitly by not being set in a structure
+ * initializer.
+ */
+
 /**
  * struct ethtool_cmd - DEPRECATED, link control and status
  * This structure is DEPRECATED, please use struct ethtool_link_settings.
@@ -67,6 +75,7 @@
  *	and other link features that the link partner advertised
  *	through autonegotiation; 0 if unknown or not applicable.
  *	Read-only.
+ * @reserved: Reserved for future use; see the note on reserved space.
  *
  * The link speed in Mbps is split between @speed and @speed_hi.  Use
  * the ethtool_cmd_speed() and ethtool_cmd_speed_set() functions to
@@ -155,6 +164,7 @@ static inline uint32_t ethtool_cmd_speed(const struct ethtool_cmd *ep)
  * @bus_info: Device bus address.  This should match the dev_name()
  *	string for the underlying bus device, if there is one.  May be
  *	an empty string.
+ * @reserved2: Reserved for future use; see the note on reserved space.
  * @n_priv_flags: Number of flags valid for %ETHTOOL_GPFLAGS and
  *	%ETHTOOL_SPFLAGS commands; also the number of strings in the
  *	%ETH_SS_PRIV_FLAGS set
@@ -356,6 +366,7 @@ struct ethtool_eeprom {
  * @tx_lpi_timer: Time in microseconds the interface delays prior to asserting
  *	its tx lpi (after reaching 'idle' state). Effective only when eee
  *	was negotiated and tx_lpi_enabled was set.
+ * @reserved: Reserved for future use; see the note on reserved space.
  */
 struct ethtool_eee {
 	uint32_t	cmd;
@@ -374,6 +385,7 @@ struct ethtool_eee {
  * @cmd: %ETHTOOL_GMODULEINFO
  * @type: Standard the module information conforms to %ETH_MODULE_SFF_xxxx
  * @eeprom_len: Length of the eeprom
+ * @reserved: Reserved for future use; see the note on reserved space.
  *
  * This structure is used to return the information to
  * properly size memory for a subsequent call to %ETHTOOL_GMODULEEEPROM.
@@ -579,9 +591,7 @@ struct ethtool_pauseparam {
 	uint32_t	tx_pause;
 };
 
-/**
- * enum ethtool_link_ext_state - link extended state
- */
+/* Link extended state */
 enum ethtool_link_ext_state {
 	ETHTOOL_LINK_EXT_STATE_AUTONEG,
 	ETHTOOL_LINK_EXT_STATE_LINK_TRAINING_FAILURE,
@@ -595,10 +605,7 @@ enum ethtool_link_ext_state {
 	ETHTOOL_LINK_EXT_STATE_OVERHEAT,
 };
 
-/**
- * enum ethtool_link_ext_substate_autoneg - more information in addition to
- * ETHTOOL_LINK_EXT_STATE_AUTONEG.
- */
+/* More information in addition to ETHTOOL_LINK_EXT_STATE_AUTONEG. */
 enum ethtool_link_ext_substate_autoneg {
 	ETHTOOL_LINK_EXT_SUBSTATE_AN_NO_PARTNER_DETECTED = 1,
 	ETHTOOL_LINK_EXT_SUBSTATE_AN_ACK_NOT_RECEIVED,
@@ -608,9 +615,7 @@ enum ethtool_link_ext_substate_autoneg {
 	ETHTOOL_LINK_EXT_SUBSTATE_AN_NO_HCD,
 };
 
-/**
- * enum ethtool_link_ext_substate_link_training - more information in addition to
- * ETHTOOL_LINK_EXT_STATE_LINK_TRAINING_FAILURE.
+/* More information in addition to ETHTOOL_LINK_EXT_STATE_LINK_TRAINING_FAILURE.
  */
 enum ethtool_link_ext_substate_link_training {
 	ETHTOOL_LINK_EXT_SUBSTATE_LT_KR_FRAME_LOCK_NOT_ACQUIRED = 1,
@@ -619,9 +624,7 @@ enum ethtool_link_ext_substate_link_training {
 	ETHTOOL_LINK_EXT_SUBSTATE_LT_REMOTE_FAULT,
 };
 
-/**
- * enum ethtool_link_ext_substate_logical_mismatch - more information in addition
- * to ETHTOOL_LINK_EXT_STATE_LINK_LOGICAL_MISMATCH.
+/* More information in addition to ETHTOOL_LINK_EXT_STATE_LINK_LOGICAL_MISMATCH.
  */
 enum ethtool_link_ext_substate_link_logical_mismatch {
 	ETHTOOL_LINK_EXT_SUBSTATE_LLM_PCS_DID_NOT_ACQUIRE_BLOCK_LOCK = 1,
@@ -631,19 +634,14 @@ enum ethtool_link_ext_substate_link_logical_mismatch {
 	ETHTOOL_LINK_EXT_SUBSTATE_LLM_RS_FEC_IS_NOT_LOCKED,
 };
 
-/**
- * enum ethtool_link_ext_substate_bad_signal_integrity - more information in
- * addition to ETHTOOL_LINK_EXT_STATE_BAD_SIGNAL_INTEGRITY.
+/* More information in addition to ETHTOOL_LINK_EXT_STATE_BAD_SIGNAL_INTEGRITY.
  */
 enum ethtool_link_ext_substate_bad_signal_integrity {
 	ETHTOOL_LINK_EXT_SUBSTATE_BSI_LARGE_NUMBER_OF_PHYSICAL_ERRORS = 1,
 	ETHTOOL_LINK_EXT_SUBSTATE_BSI_UNSUPPORTED_RATE,
 };
 
-/**
- * enum ethtool_link_ext_substate_cable_issue - more information in
- * addition to ETHTOOL_LINK_EXT_STATE_CABLE_ISSUE.
- */
+/* More information in addition to ETHTOOL_LINK_EXT_STATE_CABLE_ISSUE. */
 enum ethtool_link_ext_substate_cable_issue {
 	ETHTOOL_LINK_EXT_SUBSTATE_CI_UNSUPPORTED_CABLE = 1,
 	ETHTOOL_LINK_EXT_SUBSTATE_CI_CABLE_TEST_FAILURE,
@@ -661,6 +659,7 @@ enum ethtool_link_ext_substate_cable_issue {
  *	now deprecated
  * @ETH_SS_FEATURES: Device feature names
  * @ETH_SS_RSS_HASH_FUNCS: RSS hush function names
+ * @ETH_SS_TUNABLES: tunable names
  * @ETH_SS_PHY_STATS: Statistic names, for use with %ETHTOOL_GPHYSTATS
  * @ETH_SS_PHY_TUNABLES: PHY tunable names
  * @ETH_SS_LINK_MODES: link mode names
@@ -670,6 +669,8 @@ enum ethtool_link_ext_substate_cable_issue {
  * @ETH_SS_TS_TX_TYPES: timestamping Tx types
  * @ETH_SS_TS_RX_FILTERS: timestamping Rx filters
  * @ETH_SS_UDP_TUNNEL_TYPES: UDP tunnel types
+ *
+ * @ETH_SS_COUNT: number of defined string sets
  */
 enum ethtool_stringset {
 	ETH_SS_TEST		= 0,
@@ -715,6 +716,7 @@ struct ethtool_gstrings {
 /**
  * struct ethtool_sset_info - string set information
  * @cmd: Command number = %ETHTOOL_GSSET_INFO
+ * @reserved: Reserved for future use; see the note on reserved space.
  * @sset_mask: On entry, a bitmask of string sets to query, with bits
  *	numbered according to &enum ethtool_stringset.  On return, a
  *	bitmask of those string sets queried that are supported.
@@ -759,6 +761,7 @@ enum ethtool_test_flags {
  * @flags: A bitmask of flags from &enum ethtool_test_flags.  Some
  *	flags may be set by the user on entry; others may be set by
  *	the driver on return.
+ * @reserved: Reserved for future use; see the note on reserved space.
  * @len: On return, the number of test results
  * @data: Array of test results
  *
@@ -959,6 +962,7 @@ union ethtool_flow_union {
  * @vlan_etype: VLAN EtherType
  * @vlan_tci: VLAN tag control information
  * @data: user defined data
+ * @padding: Reserved for future use; see the note on reserved space.
  *
  * Note, @vlan_etype, @vlan_tci, and @data are only valid if %FLOW_EXT
  * is set in &struct ethtool_rx_flow_spec @flow_type.
@@ -1134,7 +1138,8 @@ struct ethtool_rxfh_indir {
  *	hardware hash key.
  * @hfunc: Defines the current RSS hash function used by HW (or to be set to).
  *	Valid values are one of the %ETH_RSS_HASH_*.
- * @rsvd:	Reserved for future extensions.
+ * @rsvd8: Reserved for future use; see the note on reserved space.
+ * @rsvd32: Reserved for future use; see the note on reserved space.
  * @rss_config: RX ring/queue index for each hash value i.e., indirection table
  *	of @indir_size uint32_t elements, followed by hash key of @key_size
  *	bytes.
@@ -1302,7 +1307,9 @@ struct ethtool_sfeatures {
  * @so_timestamping: bit mask of the sum of the supported SO_TIMESTAMPING flags
  * @phc_index: device index of the associated PHC, or -1 if there is none
  * @tx_types: bit mask of the supported hwtstamp_tx_types enumeration values
+ * @tx_reserved: Reserved for future use; see the note on reserved space.
  * @rx_filters: bit mask of the supported hwtstamp_rx_filters enumeration values
+ * @rx_reserved: Reserved for future use; see the note on reserved space.
  *
  * The bits in the 'tx_types' and 'rx_filters' fields correspond to
  * the 'hwtstamp_tx_types' and 'hwtstamp_rx_filters' enumeration values,
@@ -1958,6 +1965,11 @@ enum ethtool_reset_flags {
  *	autonegotiation; 0 if unknown or not applicable.  Read-only.
  * @transceiver: Used to distinguish different possible PHY types,
  *	reported consistently by PHYLIB.  Read-only.
+ * @master_slave_cfg: Master/slave port mode.
+ * @master_slave_state: Master/slave port state.
+ * @reserved: Reserved for future use; see the note on reserved space.
+ * @reserved1: Reserved for future use; see the note on reserved space.
+ * @link_mode_masks: Variable length bitmaps.
  *
  * If autonegotiation is disabled, the speed and @duplex represent the
  * fixed link mode and are writable if the driver supports multiple
diff --git a/include/standard-headers/linux/fuse.h b/include/standard-headers/linux/fuse.h
index 950d7edb7e..4dca160517 100644
--- a/include/standard-headers/linux/fuse.h
+++ b/include/standard-headers/linux/fuse.h
@@ -899,7 +899,8 @@ struct fuse_notify_retrieve_in {
 };
 
 /* Device ioctls: */
-#define FUSE_DEV_IOC_CLONE	_IOR(229, 0, uint32_t)
+#define FUSE_DEV_IOC_MAGIC		229
+#define FUSE_DEV_IOC_CLONE		_IOR(FUSE_DEV_IOC_MAGIC, 0, uint32_t)
 
 struct fuse_lseek_in {
 	uint64_t	fh;
diff --git a/include/standard-headers/linux/input.h b/include/standard-headers/linux/input.h
index f89c986190..7822c24178 100644
--- a/include/standard-headers/linux/input.h
+++ b/include/standard-headers/linux/input.h
@@ -81,7 +81,7 @@ struct input_id {
  * in units per radian.
  * When INPUT_PROP_ACCELEROMETER is set the resolution changes.
  * The main axes (ABS_X, ABS_Y, ABS_Z) are then reported in
- * in units per g (units/g) and in units per degree per second
+ * units per g (units/g) and in units per degree per second
  * (units/deg/s) for rotational axes (ABS_RX, ABS_RY, ABS_RZ).
  */
 struct input_absinfo {
diff --git a/include/standard-headers/rdma/vmw_pvrdma-abi.h b/include/standard-headers/rdma/vmw_pvrdma-abi.h
index 0989426a3f..c30182a7ae 100644
--- a/include/standard-headers/rdma/vmw_pvrdma-abi.h
+++ b/include/standard-headers/rdma/vmw_pvrdma-abi.h
@@ -133,6 +133,13 @@ enum pvrdma_wc_flags {
 	PVRDMA_WC_FLAGS_MAX		= PVRDMA_WC_WITH_NETWORK_HDR_TYPE,
 };
 
+enum pvrdma_network_type {
+	PVRDMA_NETWORK_IB,
+	PVRDMA_NETWORK_ROCE_V1 = PVRDMA_NETWORK_IB,
+	PVRDMA_NETWORK_IPV4,
+	PVRDMA_NETWORK_IPV6
+};
+
 struct pvrdma_alloc_ucontext_resp {
 	uint32_t qp_tab_size;
 	uint32_t reserved;
diff --git a/linux-headers/asm-generic/unistd.h b/linux-headers/asm-generic/unistd.h
index 7287529177..ce58cff99b 100644
--- a/linux-headers/asm-generic/unistd.h
+++ b/linux-headers/asm-generic/unistd.h
@@ -861,9 +861,11 @@ __SYSCALL(__NR_faccessat2, sys_faccessat2)
 __SYSCALL(__NR_process_madvise, sys_process_madvise)
 #define __NR_epoll_pwait2 441
 __SC_COMP(__NR_epoll_pwait2, sys_epoll_pwait2, compat_sys_epoll_pwait2)
+#define __NR_mount_setattr 442
+__SYSCALL(__NR_mount_setattr, sys_mount_setattr)
 
 #undef __NR_syscalls
-#define __NR_syscalls 442
+#define __NR_syscalls 443
 
 /*
  * 32 bit systems traditionally used different
diff --git a/linux-headers/asm-mips/unistd_n32.h b/linux-headers/asm-mips/unistd_n32.h
index 59e53b6e07..2ca45a0122 100644
--- a/linux-headers/asm-mips/unistd_n32.h
+++ b/linux-headers/asm-mips/unistd_n32.h
@@ -371,6 +371,7 @@
 #define __NR_faccessat2	(__NR_Linux + 439)
 #define __NR_process_madvise	(__NR_Linux + 440)
 #define __NR_epoll_pwait2	(__NR_Linux + 441)
+#define __NR_mount_setattr	(__NR_Linux + 442)
 
 
 #endif /* _ASM_MIPS_UNISTD_N32_H */
diff --git a/linux-headers/asm-mips/unistd_n64.h b/linux-headers/asm-mips/unistd_n64.h
index 683558a7f8..c8df45e69c 100644
--- a/linux-headers/asm-mips/unistd_n64.h
+++ b/linux-headers/asm-mips/unistd_n64.h
@@ -347,6 +347,7 @@
 #define __NR_faccessat2	(__NR_Linux + 439)
 #define __NR_process_madvise	(__NR_Linux + 440)
 #define __NR_epoll_pwait2	(__NR_Linux + 441)
+#define __NR_mount_setattr	(__NR_Linux + 442)
 
 
 #endif /* _ASM_MIPS_UNISTD_N64_H */
diff --git a/linux-headers/asm-mips/unistd_o32.h b/linux-headers/asm-mips/unistd_o32.h
index ca6a7e5c0b..10ba4cf9f5 100644
--- a/linux-headers/asm-mips/unistd_o32.h
+++ b/linux-headers/asm-mips/unistd_o32.h
@@ -417,6 +417,7 @@
 #define __NR_faccessat2	(__NR_Linux + 439)
 #define __NR_process_madvise	(__NR_Linux + 440)
 #define __NR_epoll_pwait2	(__NR_Linux + 441)
+#define __NR_mount_setattr	(__NR_Linux + 442)
 
 
 #endif /* _ASM_MIPS_UNISTD_O32_H */
diff --git a/linux-headers/asm-powerpc/kvm.h b/linux-headers/asm-powerpc/kvm.h
index c3af3f324c..9f18fa090f 100644
--- a/linux-headers/asm-powerpc/kvm.h
+++ b/linux-headers/asm-powerpc/kvm.h
@@ -644,6 +644,8 @@ struct kvm_ppc_cpu_char {
 #define KVM_REG_PPC_MMCR3	(KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xc1)
 #define KVM_REG_PPC_SIER2	(KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xc2)
 #define KVM_REG_PPC_SIER3	(KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xc3)
+#define KVM_REG_PPC_DAWR1	(KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xc4)
+#define KVM_REG_PPC_DAWRX1	(KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xc5)
 
 /* Transactional Memory checkpointed state:
  * This is all GPRs, all VSX regs and a subset of SPRs
diff --git a/linux-headers/asm-powerpc/unistd_32.h b/linux-headers/asm-powerpc/unistd_32.h
index 4624c90043..1d63e42fc4 100644
--- a/linux-headers/asm-powerpc/unistd_32.h
+++ b/linux-headers/asm-powerpc/unistd_32.h
@@ -424,6 +424,7 @@
 #define __NR_faccessat2	439
 #define __NR_process_madvise	440
 #define __NR_epoll_pwait2	441
+#define __NR_mount_setattr	442
 
 
 #endif /* _ASM_POWERPC_UNISTD_32_H */
diff --git a/linux-headers/asm-powerpc/unistd_64.h b/linux-headers/asm-powerpc/unistd_64.h
index 7e851b30bb..6a8708c0c5 100644
--- a/linux-headers/asm-powerpc/unistd_64.h
+++ b/linux-headers/asm-powerpc/unistd_64.h
@@ -396,6 +396,7 @@
 #define __NR_faccessat2	439
 #define __NR_process_madvise	440
 #define __NR_epoll_pwait2	441
+#define __NR_mount_setattr	442
 
 
 #endif /* _ASM_POWERPC_UNISTD_64_H */
diff --git a/linux-headers/asm-s390/unistd_32.h b/linux-headers/asm-s390/unistd_32.h
index c94d2c3a22..e5efe406e3 100644
--- a/linux-headers/asm-s390/unistd_32.h
+++ b/linux-headers/asm-s390/unistd_32.h
@@ -414,5 +414,6 @@
 #define __NR_faccessat2 439
 #define __NR_process_madvise 440
 #define __NR_epoll_pwait2 441
+#define __NR_mount_setattr 442
 
 #endif /* _ASM_S390_UNISTD_32_H */
diff --git a/linux-headers/asm-s390/unistd_64.h b/linux-headers/asm-s390/unistd_64.h
index 984a06b7eb..f0392fc6c7 100644
--- a/linux-headers/asm-s390/unistd_64.h
+++ b/linux-headers/asm-s390/unistd_64.h
@@ -362,5 +362,6 @@
 #define __NR_faccessat2 439
 #define __NR_process_madvise 440
 #define __NR_epoll_pwait2 441
+#define __NR_mount_setattr 442
 
 #endif /* _ASM_S390_UNISTD_64_H */
diff --git a/linux-headers/asm-x86/kvm.h b/linux-headers/asm-x86/kvm.h
index 8e76d3701d..5a3022c8af 100644
--- a/linux-headers/asm-x86/kvm.h
+++ b/linux-headers/asm-x86/kvm.h
@@ -112,6 +112,7 @@ struct kvm_ioapic_state {
 #define KVM_NR_IRQCHIPS          3
 
 #define KVM_RUN_X86_SMM		 (1 << 0)
+#define KVM_RUN_X86_BUS_LOCK     (1 << 1)
 
 /* for KVM_GET_REGS and KVM_SET_REGS */
 struct kvm_regs {
diff --git a/linux-headers/asm-x86/unistd_32.h b/linux-headers/asm-x86/unistd_32.h
index 18fb99dfa2..1374427c66 100644
--- a/linux-headers/asm-x86/unistd_32.h
+++ b/linux-headers/asm-x86/unistd_32.h
@@ -432,6 +432,7 @@
 #define __NR_faccessat2 439
 #define __NR_process_madvise 440
 #define __NR_epoll_pwait2 441
+#define __NR_mount_setattr 442
 
 
 #endif /* _ASM_X86_UNISTD_32_H */
diff --git a/linux-headers/asm-x86/unistd_64.h b/linux-headers/asm-x86/unistd_64.h
index bde959328d..e9d0707bc3 100644
--- a/linux-headers/asm-x86/unistd_64.h
+++ b/linux-headers/asm-x86/unistd_64.h
@@ -354,6 +354,7 @@
 #define __NR_faccessat2 439
 #define __NR_process_madvise 440
 #define __NR_epoll_pwait2 441
+#define __NR_mount_setattr 442
 
 
 #endif /* _ASM_X86_UNISTD_64_H */
diff --git a/linux-headers/asm-x86/unistd_x32.h b/linux-headers/asm-x86/unistd_x32.h
index 4ff6b17d3b..107aee76f2 100644
--- a/linux-headers/asm-x86/unistd_x32.h
+++ b/linux-headers/asm-x86/unistd_x32.h
@@ -307,6 +307,7 @@
 #define __NR_faccessat2 (__X32_SYSCALL_BIT + 439)
 #define __NR_process_madvise (__X32_SYSCALL_BIT + 440)
 #define __NR_epoll_pwait2 (__X32_SYSCALL_BIT + 441)
+#define __NR_mount_setattr (__X32_SYSCALL_BIT + 442)
 #define __NR_rt_sigaction (__X32_SYSCALL_BIT + 512)
 #define __NR_rt_sigreturn (__X32_SYSCALL_BIT + 513)
 #define __NR_ioctl (__X32_SYSCALL_BIT + 514)
diff --git a/linux-headers/linux/kvm.h b/linux-headers/linux/kvm.h
index 020b62a619..238c6c5847 100644
--- a/linux-headers/linux/kvm.h
+++ b/linux-headers/linux/kvm.h
@@ -216,6 +216,20 @@ struct kvm_hyperv_exit {
 	} u;
 };
 
+struct kvm_xen_exit {
+#define KVM_EXIT_XEN_HCALL          1
+	__u32 type;
+	union {
+		struct {
+			__u32 longmode;
+			__u32 cpl;
+			__u64 input;
+			__u64 result;
+			__u64 params[6];
+		} hcall;
+	} u;
+};
+
 #define KVM_S390_GET_SKEYS_NONE   1
 #define KVM_S390_SKEYS_MAX        1048576
 
@@ -251,6 +265,9 @@ struct kvm_hyperv_exit {
 #define KVM_EXIT_X86_RDMSR        29
 #define KVM_EXIT_X86_WRMSR        30
 #define KVM_EXIT_DIRTY_RING_FULL  31
+#define KVM_EXIT_AP_RESET_HOLD    32
+#define KVM_EXIT_X86_BUS_LOCK     33
+#define KVM_EXIT_XEN              34
 
 /* For KVM_EXIT_INTERNAL_ERROR */
 /* Emulate instruction failed. */
@@ -427,6 +444,8 @@ struct kvm_run {
 			__u32 index; /* kernel -> user */
 			__u64 data; /* kernel <-> user */
 		} msr;
+		/* KVM_EXIT_XEN */
+		struct kvm_xen_exit xen;
 		/* Fix the size of the union. */
 		char padding[256];
 	};
@@ -573,6 +592,7 @@ struct kvm_vapic_addr {
 #define KVM_MP_STATE_CHECK_STOP        6
 #define KVM_MP_STATE_OPERATING         7
 #define KVM_MP_STATE_LOAD              8
+#define KVM_MP_STATE_AP_RESET_HOLD     9
 
 struct kvm_mp_state {
 	__u32 mp_state;
@@ -1056,6 +1076,8 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_ENFORCE_PV_FEATURE_CPUID 190
 #define KVM_CAP_SYS_HYPERV_CPUID 191
 #define KVM_CAP_DIRTY_LOG_RING 192
+#define KVM_CAP_X86_BUS_LOCK_EXIT 193
+#define KVM_CAP_PPC_DAWR1 194
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -1129,6 +1151,11 @@ struct kvm_x86_mce {
 #endif
 
 #ifdef KVM_CAP_XEN_HVM
+#define KVM_XEN_HVM_CONFIG_HYPERCALL_MSR	(1 << 0)
+#define KVM_XEN_HVM_CONFIG_INTERCEPT_HCALL	(1 << 1)
+#define KVM_XEN_HVM_CONFIG_SHARED_INFO		(1 << 2)
+#define KVM_XEN_HVM_CONFIG_RUNSTATE		(1 << 3)
+
 struct kvm_xen_hvm_config {
 	__u32 flags;
 	__u32 msr;
@@ -1563,6 +1590,57 @@ struct kvm_pv_cmd {
 /* Available with KVM_CAP_DIRTY_LOG_RING */
 #define KVM_RESET_DIRTY_RINGS		_IO(KVMIO, 0xc7)
 
+/* Per-VM Xen attributes */
+#define KVM_XEN_HVM_GET_ATTR	_IOWR(KVMIO, 0xc8, struct kvm_xen_hvm_attr)
+#define KVM_XEN_HVM_SET_ATTR	_IOW(KVMIO,  0xc9, struct kvm_xen_hvm_attr)
+
+struct kvm_xen_hvm_attr {
+	__u16 type;
+	__u16 pad[3];
+	union {
+		__u8 long_mode;
+		__u8 vector;
+		struct {
+			__u64 gfn;
+		} shared_info;
+		__u64 pad[8];
+	} u;
+};
+
+/* Available with KVM_CAP_XEN_HVM / KVM_XEN_HVM_CONFIG_SHARED_INFO */
+#define KVM_XEN_ATTR_TYPE_LONG_MODE		0x0
+#define KVM_XEN_ATTR_TYPE_SHARED_INFO		0x1
+#define KVM_XEN_ATTR_TYPE_UPCALL_VECTOR		0x2
+
+/* Per-vCPU Xen attributes */
+#define KVM_XEN_VCPU_GET_ATTR	_IOWR(KVMIO, 0xca, struct kvm_xen_vcpu_attr)
+#define KVM_XEN_VCPU_SET_ATTR	_IOW(KVMIO,  0xcb, struct kvm_xen_vcpu_attr)
+
+struct kvm_xen_vcpu_attr {
+	__u16 type;
+	__u16 pad[3];
+	union {
+		__u64 gpa;
+		__u64 pad[8];
+		struct {
+			__u64 state;
+			__u64 state_entry_time;
+			__u64 time_running;
+			__u64 time_runnable;
+			__u64 time_blocked;
+			__u64 time_offline;
+		} runstate;
+	} u;
+};
+
+/* Available with KVM_CAP_XEN_HVM / KVM_XEN_HVM_CONFIG_SHARED_INFO */
+#define KVM_XEN_VCPU_ATTR_TYPE_VCPU_INFO	0x0
+#define KVM_XEN_VCPU_ATTR_TYPE_VCPU_TIME_INFO	0x1
+#define KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_ADDR	0x2
+#define KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_CURRENT	0x3
+#define KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_DATA	0x4
+#define KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_ADJUST	0x5
+
 /* Secure Encrypted Virtualization command */
 enum sev_cmd_id {
 	/* Guest initialization commands */
@@ -1591,6 +1669,8 @@ enum sev_cmd_id {
 	KVM_SEV_DBG_ENCRYPT,
 	/* Guest certificates commands */
 	KVM_SEV_CERT_EXPORT,
+	/* Attestation report */
+	KVM_SEV_GET_ATTESTATION_REPORT,
 
 	KVM_SEV_NR_MAX,
 };
@@ -1643,6 +1723,12 @@ struct kvm_sev_dbg {
 	__u32 len;
 };
 
+struct kvm_sev_attestation_report {
+	__u8 mnonce[16];
+	__u64 uaddr;
+	__u32 len;
+};
+
 #define KVM_DEV_ASSIGN_ENABLE_IOMMU	(1 << 0)
 #define KVM_DEV_ASSIGN_PCI_2_3		(1 << 1)
 #define KVM_DEV_ASSIGN_MASK_INTX	(1 << 2)
@@ -1764,4 +1850,7 @@ struct kvm_dirty_gfn {
 	__u64 offset;
 };
 
+#define KVM_BUS_LOCK_DETECTION_OFF             (1 << 0)
+#define KVM_BUS_LOCK_DETECTION_EXIT            (1 << 1)
+
 #endif /* __LINUX_KVM_H */
diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
index 609099e455..f413396ff9 100644
--- a/linux-headers/linux/vfio.h
+++ b/linux-headers/linux/vfio.h
@@ -14,6 +14,7 @@
 
 #include <linux/types.h>
 #include <linux/ioctl.h>
+#include <linux/iommu.h>
 
 #define VFIO_API_VERSION	0
 
@@ -46,6 +47,12 @@
  */
 #define VFIO_NOIOMMU_IOMMU		8
 
+/* Supports VFIO_DMA_UNMAP_FLAG_ALL */
+#define VFIO_UNMAP_ALL			9
+
+/* Supports the vaddr flag for DMA map and unmap */
+#define VFIO_UPDATE_VADDR		10
+
 /*
  * The IOCTL interface is designed for extensibility by embedding the
  * structure length (argsz) and flags into structures passed between
@@ -318,6 +325,7 @@ struct vfio_region_info_cap_type {
 #define VFIO_REGION_TYPE_GFX                    (1)
 #define VFIO_REGION_TYPE_CCW			(2)
 #define VFIO_REGION_TYPE_MIGRATION              (3)
+#define VFIO_REGION_TYPE_NESTED			(4)
 
 /* sub-types for VFIO_REGION_TYPE_PCI_* */
 
@@ -342,6 +350,10 @@ struct vfio_region_info_cap_type {
 /* sub-types for VFIO_REGION_TYPE_GFX */
 #define VFIO_REGION_SUBTYPE_GFX_EDID            (1)
 
+/* sub-types for VFIO_REGION_TYPE_NESTED */
+#define VFIO_REGION_SUBTYPE_NESTED_DMA_FAULT	(1)
+#define VFIO_REGION_SUBTYPE_NESTED_DMA_FAULT_RESPONSE	(2)
+
 /**
  * struct vfio_region_gfx_edid - EDID region layout.
  *
@@ -697,11 +709,30 @@ struct vfio_irq_info {
 #define VFIO_IRQ_INFO_MASKABLE		(1 << 1)
 #define VFIO_IRQ_INFO_AUTOMASKED	(1 << 2)
 #define VFIO_IRQ_INFO_NORESIZE		(1 << 3)
+#define VFIO_IRQ_INFO_FLAG_CAPS		(1 << 4) /* Info supports caps */
 	__u32	index;		/* IRQ index */
 	__u32	count;		/* Number of IRQs within this index */
+	__u32	cap_offset;	/* Offset within info struct of first cap */
 };
 #define VFIO_DEVICE_GET_IRQ_INFO	_IO(VFIO_TYPE, VFIO_BASE + 9)
 
+/*
+ * The irq type capability allows IRQs unique to a specific device or
+ * class of devices to be exposed.
+ *
+ * The structures below define version 1 of this capability.
+ */
+#define VFIO_IRQ_INFO_CAP_TYPE      3
+
+struct vfio_irq_info_cap_type {
+	struct vfio_info_cap_header header;
+	__u32 type;     /* global per bus driver */
+	__u32 subtype;  /* type specific */
+};
+
+#define VFIO_IRQ_TYPE_NESTED				(1)
+#define VFIO_IRQ_SUBTYPE_DMA_FAULT			(1)
+
 /**
  * VFIO_DEVICE_SET_IRQS - _IOW(VFIO_TYPE, VFIO_BASE + 10, struct vfio_irq_set)
  *
@@ -803,7 +834,8 @@ enum {
 	VFIO_PCI_MSIX_IRQ_INDEX,
 	VFIO_PCI_ERR_IRQ_INDEX,
 	VFIO_PCI_REQ_IRQ_INDEX,
-	VFIO_PCI_NUM_IRQS
+	VFIO_PCI_NUM_IRQS = 5	/* Fixed user ABI, IRQ indexes >=5 use   */
+				/* device specific cap to define content */
 };
 
 /*
@@ -988,6 +1020,68 @@ struct vfio_device_feature {
  */
 #define VFIO_DEVICE_FEATURE_PCI_VF_TOKEN	(0)
 
+/*
+ * Capability exposed by the DMA fault region
+ * @version: ABI version
+ */
+#define VFIO_REGION_INFO_CAP_DMA_FAULT	6
+
+struct vfio_region_info_cap_fault {
+	struct vfio_info_cap_header header;
+	__u32 version;
+};
+
+/*
+ * Capability exposed by the DMA fault response region
+ * @version: ABI version
+ */
+#define VFIO_REGION_INFO_CAP_DMA_FAULT_RESPONSE	7
+
+struct vfio_region_info_cap_fault_response {
+	struct vfio_info_cap_header header;
+	__u32 version;
+};
+
+/*
+ * DMA Fault Region Layout
+ * @tail: index relative to the start of the ring buffer at which the
+ *        consumer finds the next item in the buffer
+ * @entry_size: fault ring buffer entry size in bytes
+ * @nb_entries: max capacity of the fault ring buffer
+ * @offset: ring buffer offset relative to the start of the region
+ * @head: index relative to the start of the ring buffer at which the
+ *        producer (kernel) inserts items into the buffers
+ */
+struct vfio_region_dma_fault {
+	/* Write-Only */
+	__u32   tail;
+	/* Read-Only */
+	__u32   entry_size;
+	__u32	nb_entries;
+	__u32	offset;
+	__u32   head;
+};
+
+/*
+ * DMA Fault Response Region Layout
+ * @head: index relative to the start of the ring buffer at which the
+ *        producer (userspace) insert responses into the buffer
+ * @entry_size: fault ring buffer entry size in bytes
+ * @nb_entries: max capacity of the fault ring buffer
+ * @offset: ring buffer offset relative to the start of the region
+ * @tail: index relative to the start of the ring buffer at which the
+ *        consumer (kernel) finds the next item in the buffer
+ */
+struct vfio_region_dma_fault_response {
+	/* Write-Only */
+	__u32   head;
+	/* Read-Only */
+	__u32   entry_size;
+	__u32	nb_entries;
+	__u32	offset;
+	__u32   tail;
+};
+
 /* -------- API for Type1 VFIO IOMMU -------- */
 
 /**
@@ -1074,12 +1168,22 @@ struct vfio_iommu_type1_info_dma_avail {
  *
  * Map process virtual addresses to IO virtual addresses using the
  * provided struct vfio_dma_map. Caller sets argsz. READ &/ WRITE required.
+ *
+ * If flags & VFIO_DMA_MAP_FLAG_VADDR, update the base vaddr for iova, and
+ * unblock translation of host virtual addresses in the iova range.  The vaddr
+ * must have previously been invalidated with VFIO_DMA_UNMAP_FLAG_VADDR.  To
+ * maintain memory consistency within the user application, the updated vaddr
+ * must address the same memory object as originally mapped.  Failure to do so
+ * will result in user memory corruption and/or device misbehavior.  iova and
+ * size must match those in the original MAP_DMA call.  Protection is not
+ * changed, and the READ & WRITE flags must be 0.
  */
 struct vfio_iommu_type1_dma_map {
 	__u32	argsz;
 	__u32	flags;
 #define VFIO_DMA_MAP_FLAG_READ (1 << 0)		/* readable from device */
 #define VFIO_DMA_MAP_FLAG_WRITE (1 << 1)	/* writable from device */
+#define VFIO_DMA_MAP_FLAG_VADDR (1 << 2)
 	__u64	vaddr;				/* Process virtual address */
 	__u64	iova;				/* IO virtual address */
 	__u64	size;				/* Size of mapping (bytes) */
@@ -1102,6 +1206,7 @@ struct vfio_bitmap {
  * field.  No guarantee is made to the user that arbitrary unmaps of iova
  * or size different from those used in the original mapping call will
  * succeed.
+ *
  * VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP should be set to get the dirty bitmap
  * before unmapping IO virtual addresses. When this flag is set, the user must
  * provide a struct vfio_bitmap in data[]. User must provide zero-allocated
@@ -1111,11 +1216,21 @@ struct vfio_bitmap {
  * indicates that the page at that offset from iova is dirty. A Bitmap of the
  * pages in the range of unmapped size is returned in the user-provided
  * vfio_bitmap.data.
+ *
+ * If flags & VFIO_DMA_UNMAP_FLAG_ALL, unmap all addresses.  iova and size
+ * must be 0.  This cannot be combined with the get-dirty-bitmap flag.
+ *
+ * If flags & VFIO_DMA_UNMAP_FLAG_VADDR, do not unmap, but invalidate host
+ * virtual addresses in the iova range.  Tasks that attempt to translate an
+ * iova's vaddr will block.  DMA to already-mapped pages continues.  This
+ * cannot be combined with the get-dirty-bitmap flag.
  */
 struct vfio_iommu_type1_dma_unmap {
 	__u32	argsz;
 	__u32	flags;
 #define VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP (1 << 0)
+#define VFIO_DMA_UNMAP_FLAG_ALL		     (1 << 1)
+#define VFIO_DMA_UNMAP_FLAG_VADDR	     (1 << 2)
 	__u64	iova;				/* IO virtual address */
 	__u64	size;				/* Size of mapping (bytes) */
 	__u8    data[];
@@ -1181,6 +1296,58 @@ struct vfio_iommu_type1_dirty_bitmap_get {
 
 #define VFIO_IOMMU_DIRTY_PAGES             _IO(VFIO_TYPE, VFIO_BASE + 17)
 
+/*
+ * VFIO_IOMMU_SET_PASID_TABLE - _IOWR(VFIO_TYPE, VFIO_BASE + 18,
+ *			struct vfio_iommu_type1_set_pasid_table)
+ *
+ * The SET operation passes a PASID table to the host while the
+ * UNSET operation detaches the one currently programmed. It is
+ * allowed to "SET" the table several times without unsetting as
+ * long as the table config does not stay IOMMU_PASID_CONFIG_TRANSLATE.
+ */
+struct vfio_iommu_type1_set_pasid_table {
+	__u32	argsz;
+	__u32	flags;
+#define VFIO_PASID_TABLE_FLAG_SET	(1 << 0)
+#define VFIO_PASID_TABLE_FLAG_UNSET	(1 << 1)
+	struct iommu_pasid_table_config config; /* used on SET */
+};
+
+#define VFIO_IOMMU_SET_PASID_TABLE	_IO(VFIO_TYPE, VFIO_BASE + 18)
+
+/**
+ * VFIO_IOMMU_CACHE_INVALIDATE - _IOWR(VFIO_TYPE, VFIO_BASE + 19,
+ *			struct vfio_iommu_type1_cache_invalidate)
+ *
+ * Propagate guest IOMMU cache invalidation to the host.
+ */
+struct vfio_iommu_type1_cache_invalidate {
+	__u32   argsz;
+	__u32   flags;
+	struct iommu_cache_invalidate_info info;
+};
+#define VFIO_IOMMU_CACHE_INVALIDATE      _IO(VFIO_TYPE, VFIO_BASE + 19)
+
+/**
+ * VFIO_IOMMU_SET_MSI_BINDING - _IOWR(VFIO_TYPE, VFIO_BASE + 20,
+ *			struct vfio_iommu_type1_set_msi_binding)
+ *
+ * Pass a stage 1 MSI doorbell mapping to the host so that this
+ * latter can build a nested stage2 mapping. Or conversely tear
+ * down a previously bound stage 1 MSI binding.
+ */
+struct vfio_iommu_type1_set_msi_binding {
+	__u32   argsz;
+	__u32   flags;
+#define VFIO_IOMMU_BIND_MSI	(1 << 0)
+#define VFIO_IOMMU_UNBIND_MSI	(1 << 1)
+	__u64	iova;	/* MSI guest IOVA */
+	/* Fields below are used on BIND */
+	__u64	gpa;	/* MSI guest physical address */
+	__u64	size;	/* size of stage1 mapping (bytes) */
+};
+#define VFIO_IOMMU_SET_MSI_BINDING      _IO(VFIO_TYPE, VFIO_BASE + 20)
+
 /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */
 
 /*
-- 
2.26.3



^ permalink raw reply	[flat|nested] 46+ messages in thread

* [RFC v9 04/29] memory: Add new fields in IOTLBEntry
  2021-04-11 12:08 [RFC v9 00/29] vSMMUv3/pSMMUv3 2 stage VFIO integration Eric Auger
                   ` (2 preceding siblings ...)
  2021-04-11 12:08 ` [RFC v9 03/29] header update against 5.12-rc6 and IOMMU/VFIO nested stage APIs Eric Auger
@ 2021-04-11 12:08 ` Eric Auger
  2021-04-11 12:08 ` [RFC v9 05/29] hw/arm/smmuv3: Improve stage1 ASID invalidation Eric Auger
                   ` (24 subsequent siblings)
  28 siblings, 0 replies; 46+ messages in thread
From: Eric Auger @ 2021-04-11 12:08 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, qemu-devel, qemu-arm, alex.williamson
  Cc: peter.maydell, jacob.jun.pan, jean-philippe, tn, chenxiang66,
	jiangkunkun, shameerali.kolothum.thodi, nicoleotsuka,
	vivek.gautam, vdumpa, yi.l.liu, peterx, zhangfei.gao, yuzenghui,
	zhukeqian1

The current IOTLBEntry becomes too simple to interact with
some physical IOMMUs. IOTLBs can be invalidated with different
granularities: domain, pasid, addr. Current IOTLB entry only offers
page selective invalidation. Let's add a granularity field
that conveys this information.

TLB entries are usually tagged with some ids such as the asid
or pasid. When propagating an invalidation command from the
guest to the host, we need to pass those IDs.

Also we add a leaf field which indicates, in case of invalidation
notification, whether only cache entries for the last level of
translation are required to be invalidated.

A flag field is introduced to inform whether those fields are set.

To enforce all existing users do not use those new fields,
initialize the IOMMUTLBEvents when needed.

Signed-off-by: Eric Auger <eric.auger@redhat.com>

---

v7 -> v8:
- add pasid, granularity and flags
---
 include/exec/memory.h    | 36 +++++++++++++++++++++++++++++++++++-
 hw/arm/smmu-common.c     |  2 +-
 hw/arm/smmuv3.c          |  2 +-
 hw/i386/intel_iommu.c    |  6 +++---
 hw/ppc/spapr_iommu.c     |  2 +-
 hw/virtio/virtio-iommu.c |  4 ++--
 6 files changed, 43 insertions(+), 9 deletions(-)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index 5728a681b2..94b9157249 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -75,14 +75,48 @@ typedef enum {
     IOMMU_RW   = 3,
 } IOMMUAccessFlags;
 
+/* Granularity of the cache invalidation */
+typedef enum {
+    IOMMU_INV_GRAN_ADDR = 0,
+    IOMMU_INV_GRAN_PASID,
+    IOMMU_INV_GRAN_DOMAIN,
+} IOMMUInvGranularity;
+
 #define IOMMU_ACCESS_FLAG(r, w) (((r) ? IOMMU_RO : 0) | ((w) ? IOMMU_WO : 0))
 
+/**
+ * IOMMUTLBEntry - IOMMU TLB entry
+ *
+ * Structure used when performing a translation or when notifying MAP or
+ * UNMAP (invalidation) events
+ *
+ * @target_as: target address space
+ * @iova: IO virtual address (input)
+ * @translated_addr: translated address (output)
+ * @addr_mask: address mask (0xfff means 4K binding), must be multiple of 2
+ * @perm: permission flag of the mapping (NONE encodes no mapping or
+ * invalidation notification)
+ * @granularity: granularity of the invalidation
+ * @flags: informs whether the following fields are set
+ * @arch_id: architecture specific ID tagging the TLB
+ * @pasid: PASID tagging the TLB
+ * @leaf: when @perm is NONE, indicates whether only caches for the last
+ * level of translation need to be invalidated.
+ */
 struct IOMMUTLBEntry {
     AddressSpace    *target_as;
     hwaddr           iova;
     hwaddr           translated_addr;
-    hwaddr           addr_mask;  /* 0xfff = 4k translation */
+    hwaddr           addr_mask;
     IOMMUAccessFlags perm;
+    IOMMUInvGranularity granularity;
+#define IOMMU_INV_FLAGS_PASID  (1 << 0)
+#define IOMMU_INV_FLAGS_ARCHID (1 << 1)
+#define IOMMU_INV_FLAGS_LEAF   (1 << 2)
+    uint32_t         flags;
+    uint32_t         arch_id;
+    uint32_t         pasid;
+    bool             leaf;
 };
 
 /*
diff --git a/hw/arm/smmu-common.c b/hw/arm/smmu-common.c
index 84d2c62c26..0ba3dca3b8 100644
--- a/hw/arm/smmu-common.c
+++ b/hw/arm/smmu-common.c
@@ -471,7 +471,7 @@ IOMMUMemoryRegion *smmu_iommu_mr(SMMUState *s, uint32_t sid)
 /* Unmap the whole notifier's range */
 static void smmu_unmap_notifier_range(IOMMUNotifier *n)
 {
-    IOMMUTLBEvent event;
+    IOMMUTLBEvent event = {};
 
     event.type = IOMMU_NOTIFIER_UNMAP;
     event.entry.target_as = &address_space_memory;
diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c
index 3b87324ce2..d037d6df5b 100644
--- a/hw/arm/smmuv3.c
+++ b/hw/arm/smmuv3.c
@@ -801,7 +801,7 @@ static void smmuv3_notify_iova(IOMMUMemoryRegion *mr,
                                uint8_t tg, uint64_t num_pages)
 {
     SMMUDevice *sdev = container_of(mr, SMMUDevice, iommu);
-    IOMMUTLBEvent event;
+    IOMMUTLBEvent event = {};
     uint8_t granule;
 
     if (!tg) {
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 6be8f32918..1c5b43f902 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -1195,7 +1195,7 @@ static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
     uint32_t offset;
     uint64_t slpte;
     uint64_t subpage_size, subpage_mask;
-    IOMMUTLBEvent event;
+    IOMMUTLBEvent event = {};
     uint64_t iova = start;
     uint64_t iova_next;
     int ret = 0;
@@ -2427,7 +2427,7 @@ static bool vtd_process_device_iotlb_desc(IntelIOMMUState *s,
                                           VTDInvDesc *inv_desc)
 {
     VTDAddressSpace *vtd_dev_as;
-    IOMMUTLBEvent event;
+    IOMMUTLBEvent event = {};
     struct VTDBus *vtd_bus;
     hwaddr addr;
     uint64_t sz;
@@ -3483,7 +3483,7 @@ static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n)
     size = remain = end - start + 1;
 
     while (remain >= VTD_PAGE_SIZE) {
-        IOMMUTLBEvent event;
+        IOMMUTLBEvent event = {};
         uint64_t mask = dma_aligned_pow2_mask(start, end, s->aw_bits);
         uint64_t size = mask + 1;
 
diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
index 24537ffcbd..1ad3709e34 100644
--- a/hw/ppc/spapr_iommu.c
+++ b/hw/ppc/spapr_iommu.c
@@ -450,7 +450,7 @@ static void spapr_tce_reset(DeviceState *dev)
 static target_ulong put_tce_emu(SpaprTceTable *tcet, target_ulong ioba,
                                 target_ulong tce)
 {
-    IOMMUTLBEvent event;
+    IOMMUTLBEvent event = {};
     hwaddr page_mask = IOMMU_PAGE_MASK(tcet->page_shift);
     unsigned long index = (ioba - tcet->bus_offset) >> tcet->page_shift;
 
diff --git a/hw/virtio/virtio-iommu.c b/hw/virtio/virtio-iommu.c
index 1b23e8e18c..83ed2b82e6 100644
--- a/hw/virtio/virtio-iommu.c
+++ b/hw/virtio/virtio-iommu.c
@@ -129,7 +129,7 @@ static void virtio_iommu_notify_map(IOMMUMemoryRegion *mr, hwaddr virt_start,
                                     hwaddr virt_end, hwaddr paddr,
                                     uint32_t flags)
 {
-    IOMMUTLBEvent event;
+    IOMMUTLBEvent event = {};
     IOMMUAccessFlags perm = IOMMU_ACCESS_FLAG(flags & VIRTIO_IOMMU_MAP_F_READ,
                                               flags & VIRTIO_IOMMU_MAP_F_WRITE);
 
@@ -154,7 +154,7 @@ static void virtio_iommu_notify_map(IOMMUMemoryRegion *mr, hwaddr virt_start,
 static void virtio_iommu_notify_unmap(IOMMUMemoryRegion *mr, hwaddr virt_start,
                                       hwaddr virt_end)
 {
-    IOMMUTLBEvent event;
+    IOMMUTLBEvent event = {};
     uint64_t delta = virt_end - virt_start;
 
     if (!(mr->iommu_notify_flags & IOMMU_NOTIFIER_UNMAP)) {
-- 
2.26.3



^ permalink raw reply	[flat|nested] 46+ messages in thread

* [RFC v9 05/29] hw/arm/smmuv3: Improve stage1 ASID invalidation
  2021-04-11 12:08 [RFC v9 00/29] vSMMUv3/pSMMUv3 2 stage VFIO integration Eric Auger
                   ` (3 preceding siblings ...)
  2021-04-11 12:08 ` [RFC v9 04/29] memory: Add new fields in IOTLBEntry Eric Auger
@ 2021-04-11 12:08 ` Eric Auger
  2021-04-11 12:08 ` [RFC v9 06/29] hw/arm/smmu-common: Allow domain invalidation for NH_ALL/NSNH_ALL Eric Auger
                   ` (23 subsequent siblings)
  28 siblings, 0 replies; 46+ messages in thread
From: Eric Auger @ 2021-04-11 12:08 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, qemu-devel, qemu-arm, alex.williamson
  Cc: peter.maydell, jacob.jun.pan, jean-philippe, tn, chenxiang66,
	jiangkunkun, shameerali.kolothum.thodi, nicoleotsuka,
	vivek.gautam, vdumpa, yi.l.liu, peterx, zhangfei.gao, yuzenghui,
	zhukeqian1

At the moment ASID invalidation command (CMD_TLBI_NH_ASID) is
propagated as a domain invalidation (the whole notifier range
is invalidated independently on any ASID information).

The new granularity field now allows to be more precise and
restrict the invalidation to a peculiar ASID. Set the corresponding
fields and flag.

We still keep the iova and addr_mask settings for consumers that
do not support the new fields, like VHOST.

Signed-off-by: Eric Auger <eric.auger@redhat.com>

---

v8 -> v9:
- restore the iova and addr_massk settings for consumers that do
  not support the new fields like VHOST
---
 hw/arm/smmuv3.c     | 44 ++++++++++++++++++++++++++++++++++++++++++--
 hw/arm/trace-events |  1 +
 2 files changed, 43 insertions(+), 2 deletions(-)

diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c
index d037d6df5b..a4436868ba 100644
--- a/hw/arm/smmuv3.c
+++ b/hw/arm/smmuv3.c
@@ -835,6 +835,31 @@ static void smmuv3_notify_iova(IOMMUMemoryRegion *mr,
     memory_region_notify_iommu_one(n, &event);
 }
 
+/**
+ * smmuv3_notify_asid - call the notifier @n for a given asid
+ *
+ * @mr: IOMMU mr region handle
+ * @n: notifier to be called
+ * @asid: address space ID or negative value if we don't care
+ */
+static void smmuv3_notify_asid(IOMMUMemoryRegion *mr,
+                               IOMMUNotifier *n, int asid)
+{
+    IOMMUTLBEvent event = {};
+
+    event.type = IOMMU_NOTIFIER_UNMAP;
+    event.entry.target_as = &address_space_memory;
+    event.entry.perm = IOMMU_NONE;
+    event.entry.granularity = IOMMU_INV_GRAN_PASID;
+    event.entry.flags = IOMMU_INV_FLAGS_ARCHID;
+    event.entry.arch_id = asid;
+    event.entry.iova = n->start;
+    event.entry.addr_mask = n->end - n->start;
+
+    memory_region_notify_iommu_one(n, &event);
+}
+
+
 /* invalidate an asid/iova range tuple in all mr's */
 static void smmuv3_inv_notifiers_iova(SMMUState *s, int asid, dma_addr_t iova,
                                       uint8_t tg, uint64_t num_pages)
@@ -910,6 +935,22 @@ smmuv3_invalidate_ste(gpointer key, gpointer value, gpointer user_data)
     return true;
 }
 
+static void smmuv3_s1_asid_inval(SMMUState *s, uint16_t asid)
+{
+    SMMUDevice *sdev;
+
+    trace_smmuv3_s1_asid_inval(asid);
+    QLIST_FOREACH(sdev, &s->devices_with_notifiers, next) {
+        IOMMUMemoryRegion *mr = &sdev->iommu;
+        IOMMUNotifier *n;
+
+        IOMMU_NOTIFIER_FOREACH(n, mr) {
+            smmuv3_notify_asid(mr, n, asid);
+        }
+    }
+    smmu_iotlb_inv_asid(s, asid);
+}
+
 static int smmuv3_cmdq_consume(SMMUv3State *s)
 {
     SMMUState *bs = ARM_SMMU(s);
@@ -1020,8 +1061,7 @@ static int smmuv3_cmdq_consume(SMMUv3State *s)
             uint16_t asid = CMD_ASID(&cmd);
 
             trace_smmuv3_cmdq_tlbi_nh_asid(asid);
-            smmu_inv_notifiers_all(&s->smmu_state);
-            smmu_iotlb_inv_asid(bs, asid);
+            smmuv3_s1_asid_inval(bs, asid);
             break;
         }
         case SMMU_CMD_TLBI_NH_ALL:
diff --git a/hw/arm/trace-events b/hw/arm/trace-events
index b79a91af5f..8e530ba79d 100644
--- a/hw/arm/trace-events
+++ b/hw/arm/trace-events
@@ -46,6 +46,7 @@ smmuv3_cmdq_cfgi_cd(uint32_t sid) "sid=0x%x"
 smmuv3_config_cache_hit(uint32_t sid, uint32_t hits, uint32_t misses, uint32_t perc) "Config cache HIT for sid=0x%x (hits=%d, misses=%d, hit rate=%d)"
 smmuv3_config_cache_miss(uint32_t sid, uint32_t hits, uint32_t misses, uint32_t perc) "Config cache MISS for sid=0x%x (hits=%d, misses=%d, hit rate=%d)"
 smmuv3_s1_range_inval(int vmid, int asid, uint64_t addr, uint8_t tg, uint64_t num_pages, uint8_t ttl, bool leaf) "vmid=%d asid=%d addr=0x%"PRIx64" tg=%d num_pages=0x%"PRIx64" ttl=%d leaf=%d"
+smmuv3_s1_asid_inval(int asid) "asid=%d"
 smmuv3_cmdq_tlbi_nh(void) ""
 smmuv3_cmdq_tlbi_nh_asid(uint16_t asid) "asid=%d"
 smmuv3_config_cache_inv(uint32_t sid) "Config cache INV for sid=0x%x"
-- 
2.26.3



^ permalink raw reply	[flat|nested] 46+ messages in thread

* [RFC v9 06/29] hw/arm/smmu-common: Allow domain invalidation for NH_ALL/NSNH_ALL
  2021-04-11 12:08 [RFC v9 00/29] vSMMUv3/pSMMUv3 2 stage VFIO integration Eric Auger
                   ` (4 preceding siblings ...)
  2021-04-11 12:08 ` [RFC v9 05/29] hw/arm/smmuv3: Improve stage1 ASID invalidation Eric Auger
@ 2021-04-11 12:08 ` Eric Auger
  2021-04-11 12:08 ` [RFC v9 07/29] memory: Add IOMMU_ATTR_VFIO_NESTED IOMMU memory region attribute Eric Auger
                   ` (22 subsequent siblings)
  28 siblings, 0 replies; 46+ messages in thread
From: Eric Auger @ 2021-04-11 12:08 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, qemu-devel, qemu-arm, alex.williamson
  Cc: peter.maydell, jacob.jun.pan, jean-philippe, tn, chenxiang66,
	jiangkunkun, shameerali.kolothum.thodi, nicoleotsuka,
	vivek.gautam, vdumpa, yi.l.liu, peterx, zhangfei.gao, yuzenghui,
	zhukeqian1

NH_ALL/NSNH_ALL corresponds to a domain granularity invalidation,
ie. all the notifier range gets invalidation, whatever the ASID.
So let's set the granularity to IOMMU_INV_GRAN_DOMAIN to allow
the consumer to benefit from the info if it can.

Signed-off-by: Eric Auger <eric.auger@redhat.com>
Suggested-by: chenxiang (M) <chenxiang66@hisilicon.com>
---
 hw/arm/smmu-common.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/hw/arm/smmu-common.c b/hw/arm/smmu-common.c
index 0ba3dca3b8..c33d03de67 100644
--- a/hw/arm/smmu-common.c
+++ b/hw/arm/smmu-common.c
@@ -478,6 +478,7 @@ static void smmu_unmap_notifier_range(IOMMUNotifier *n)
     event.entry.iova = n->start;
     event.entry.perm = IOMMU_NONE;
     event.entry.addr_mask = n->end - n->start;
+    event.entry.granularity = IOMMU_INV_GRAN_DOMAIN;
 
     memory_region_notify_iommu_one(n, &event);
 }
-- 
2.26.3



^ permalink raw reply	[flat|nested] 46+ messages in thread

* [RFC v9 07/29] memory: Add IOMMU_ATTR_VFIO_NESTED IOMMU memory region attribute
  2021-04-11 12:08 [RFC v9 00/29] vSMMUv3/pSMMUv3 2 stage VFIO integration Eric Auger
                   ` (5 preceding siblings ...)
  2021-04-11 12:08 ` [RFC v9 06/29] hw/arm/smmu-common: Allow domain invalidation for NH_ALL/NSNH_ALL Eric Auger
@ 2021-04-11 12:08 ` Eric Auger
  2021-04-11 12:08 ` [RFC v9 08/29] memory: Add IOMMU_ATTR_MSI_TRANSLATE " Eric Auger
                   ` (21 subsequent siblings)
  28 siblings, 0 replies; 46+ messages in thread
From: Eric Auger @ 2021-04-11 12:08 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, qemu-devel, qemu-arm, alex.williamson
  Cc: peter.maydell, jacob.jun.pan, jean-philippe, tn, chenxiang66,
	jiangkunkun, shameerali.kolothum.thodi, nicoleotsuka,
	vivek.gautam, vdumpa, yi.l.liu, peterx, zhangfei.gao, yuzenghui,
	zhukeqian1

We introduce a new IOMMU Memory Region attribute,
IOMMU_ATTR_VFIO_NESTED that tells whether the virtual IOMMU
requires HW nested paging for VFIO integration.

Current Intel virtual IOMMU device supports "Caching
Mode" and does not require 2 stages at physical level to be
integrated with VFIO. However SMMUv3 does not implement such
"caching mode" and requires to use HW nested paging.

As such SMMUv3 is the first IOMMU device to advertise this
attribute.

Signed-off-by: Eric Auger <eric.auger@redhat.com>
---
 include/exec/memory.h |  3 ++-
 hw/arm/smmuv3.c       | 12 ++++++++++++
 2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index 94b9157249..3af3cc1adb 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -273,7 +273,8 @@ typedef struct MemoryRegionClass {
 
 
 enum IOMMUMemoryRegionAttr {
-    IOMMU_ATTR_SPAPR_TCE_FD
+    IOMMU_ATTR_SPAPR_TCE_FD,
+    IOMMU_ATTR_VFIO_NESTED,
 };
 
 /*
diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c
index a4436868ba..7166008ab0 100644
--- a/hw/arm/smmuv3.c
+++ b/hw/arm/smmuv3.c
@@ -1582,6 +1582,17 @@ static int smmuv3_notify_flag_changed(IOMMUMemoryRegion *iommu,
     return 0;
 }
 
+static int smmuv3_get_attr(IOMMUMemoryRegion *iommu,
+                           enum IOMMUMemoryRegionAttr attr,
+                           void *data)
+{
+    if (attr == IOMMU_ATTR_VFIO_NESTED) {
+        *(bool *) data = true;
+        return 0;
+    }
+    return -EINVAL;
+}
+
 static void smmuv3_iommu_memory_region_class_init(ObjectClass *klass,
                                                   void *data)
 {
@@ -1589,6 +1600,7 @@ static void smmuv3_iommu_memory_region_class_init(ObjectClass *klass,
 
     imrc->translate = smmuv3_translate;
     imrc->notify_flag_changed = smmuv3_notify_flag_changed;
+    imrc->get_attr = smmuv3_get_attr;
 }
 
 static const TypeInfo smmuv3_type_info = {
-- 
2.26.3



^ permalink raw reply	[flat|nested] 46+ messages in thread

* [RFC v9 08/29] memory: Add IOMMU_ATTR_MSI_TRANSLATE IOMMU memory region attribute
  2021-04-11 12:08 [RFC v9 00/29] vSMMUv3/pSMMUv3 2 stage VFIO integration Eric Auger
                   ` (6 preceding siblings ...)
  2021-04-11 12:08 ` [RFC v9 07/29] memory: Add IOMMU_ATTR_VFIO_NESTED IOMMU memory region attribute Eric Auger
@ 2021-04-11 12:08 ` Eric Auger
  2021-04-11 12:08 ` [RFC v9 09/29] memory: Introduce IOMMU Memory Region inject_faults API Eric Auger
                   ` (20 subsequent siblings)
  28 siblings, 0 replies; 46+ messages in thread
From: Eric Auger @ 2021-04-11 12:08 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, qemu-devel, qemu-arm, alex.williamson
  Cc: peter.maydell, jacob.jun.pan, jean-philippe, tn, chenxiang66,
	jiangkunkun, shameerali.kolothum.thodi, nicoleotsuka,
	vivek.gautam, vdumpa, yi.l.liu, peterx, zhangfei.gao, yuzenghui,
	zhukeqian1

We introduce a new IOMMU Memory Region attribute, IOMMU_ATTR_MSI_TRANSLATE
which tells whether the virtual IOMMU translates MSIs. ARM SMMU
will expose this attribute since, as opposed to Intel DMAR, MSIs
are translated as any other DMA requests.

Signed-off-by: Eric Auger <eric.auger@redhat.com>
---
 include/exec/memory.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index 3af3cc1adb..ac8521b29a 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -275,6 +275,7 @@ typedef struct MemoryRegionClass {
 enum IOMMUMemoryRegionAttr {
     IOMMU_ATTR_SPAPR_TCE_FD,
     IOMMU_ATTR_VFIO_NESTED,
+    IOMMU_ATTR_MSI_TRANSLATE,
 };
 
 /*
-- 
2.26.3



^ permalink raw reply	[flat|nested] 46+ messages in thread

* [RFC v9 09/29] memory: Introduce IOMMU Memory Region inject_faults API
  2021-04-11 12:08 [RFC v9 00/29] vSMMUv3/pSMMUv3 2 stage VFIO integration Eric Auger
                   ` (7 preceding siblings ...)
  2021-04-11 12:08 ` [RFC v9 08/29] memory: Add IOMMU_ATTR_MSI_TRANSLATE " Eric Auger
@ 2021-04-11 12:08 ` Eric Auger
  2021-04-11 12:08 ` [RFC v9 10/29] iommu: Introduce generic header Eric Auger
                   ` (19 subsequent siblings)
  28 siblings, 0 replies; 46+ messages in thread
From: Eric Auger @ 2021-04-11 12:08 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, qemu-devel, qemu-arm, alex.williamson
  Cc: peter.maydell, jacob.jun.pan, jean-philippe, tn, chenxiang66,
	jiangkunkun, shameerali.kolothum.thodi, nicoleotsuka,
	vivek.gautam, vdumpa, yi.l.liu, peterx, zhangfei.gao, yuzenghui,
	zhukeqian1

This new API allows to inject @count iommu_faults into
the IOMMU memory region.

Signed-off-by: Eric Auger <eric.auger@redhat.com>
---
 include/exec/memory.h | 24 ++++++++++++++++++++++++
 softmmu/memory.c      | 10 ++++++++++
 2 files changed, 34 insertions(+)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index ac8521b29a..527f77c453 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -65,6 +65,8 @@ struct ReservedRegion {
     unsigned type;
 };
 
+struct iommu_fault;
+
 typedef struct IOMMUTLBEntry IOMMUTLBEntry;
 
 /* See address_space_translate: bit 0 is read, bit 1 is write.  */
@@ -475,6 +477,19 @@ struct IOMMUMemoryRegionClass {
      int (*iommu_set_page_size_mask)(IOMMUMemoryRegion *iommu,
                                      uint64_t page_size_mask,
                                      Error **errp);
+
+    /*
+     * Inject @count faults into the IOMMU memory region
+     *
+     * Optional method: if this method is not provided, then
+     * memory_region_injection_faults() will return -ENOENT
+     *
+     * @iommu: the IOMMU memory region to inject the faults in
+     * @count: number of faults to inject
+     * @buf: fault buffer
+     */
+    int (*inject_faults)(IOMMUMemoryRegion *iommu, int count,
+                         struct iommu_fault *buf);
 };
 
 typedef struct CoalescedMemoryRange CoalescedMemoryRange;
@@ -1520,6 +1535,15 @@ int memory_region_iommu_num_indexes(IOMMUMemoryRegion *iommu_mr);
 int memory_region_iommu_set_page_size_mask(IOMMUMemoryRegion *iommu_mr,
                                            uint64_t page_size_mask,
                                            Error **errp);
+/**
+ * memory_region_inject_faults : inject @count faults stored in @buf
+ *
+ * @iommu_mr: the IOMMU memory region
+ * @count: number of faults to be injected
+ * @buf: buffer containing the faults
+ */
+int memory_region_inject_faults(IOMMUMemoryRegion *iommu_mr, int count,
+                                struct iommu_fault *buf);
 
 /**
  * memory_region_name: get a memory region's name
diff --git a/softmmu/memory.c b/softmmu/memory.c
index d4493ef9e4..1dd34356c0 100644
--- a/softmmu/memory.c
+++ b/softmmu/memory.c
@@ -2030,6 +2030,16 @@ int memory_region_iommu_num_indexes(IOMMUMemoryRegion *iommu_mr)
     return imrc->num_indexes(iommu_mr);
 }
 
+int memory_region_inject_faults(IOMMUMemoryRegion *iommu_mr, int count,
+                                struct iommu_fault *buf)
+{
+    IOMMUMemoryRegionClass *imrc = IOMMU_MEMORY_REGION_GET_CLASS(iommu_mr);
+    if (!imrc->inject_faults) {
+        return -ENOENT;
+    }
+    return imrc->inject_faults(iommu_mr, count, buf);
+}
+
 void memory_region_set_log(MemoryRegion *mr, bool log, unsigned client)
 {
     uint8_t mask = 1 << client;
-- 
2.26.3



^ permalink raw reply	[flat|nested] 46+ messages in thread

* [RFC v9 10/29] iommu: Introduce generic header
  2021-04-11 12:08 [RFC v9 00/29] vSMMUv3/pSMMUv3 2 stage VFIO integration Eric Auger
                   ` (8 preceding siblings ...)
  2021-04-11 12:08 ` [RFC v9 09/29] memory: Introduce IOMMU Memory Region inject_faults API Eric Auger
@ 2021-04-11 12:08 ` Eric Auger
  2021-04-11 12:08 ` [RFC v9 11/29] pci: introduce PCIPASIDOps to PCIDevice Eric Auger
                   ` (18 subsequent siblings)
  28 siblings, 0 replies; 46+ messages in thread
From: Eric Auger @ 2021-04-11 12:08 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, qemu-devel, qemu-arm, alex.williamson
  Cc: peter.maydell, jacob.jun.pan, jean-philippe, tn, chenxiang66,
	jiangkunkun, shameerali.kolothum.thodi, nicoleotsuka,
	vivek.gautam, vdumpa, yi.l.liu, peterx, zhangfei.gao, yuzenghui,
	zhukeqian1

This header is meant to exposes data types used by
several IOMMU devices such as struct for SVA and
nested stage configuration.

Signed-off-by: Eric Auger <eric.auger@redhat.com>
---
 include/hw/iommu/iommu.h | 28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)
 create mode 100644 include/hw/iommu/iommu.h

diff --git a/include/hw/iommu/iommu.h b/include/hw/iommu/iommu.h
new file mode 100644
index 0000000000..12092bda7b
--- /dev/null
+++ b/include/hw/iommu/iommu.h
@@ -0,0 +1,28 @@
+/*
+ * common header for iommu devices
+ *
+ * Copyright Red Hat, Inc. 2019
+ *
+ * Authors:
+ *  Eric Auger <eric.auger@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ */
+
+#ifndef QEMU_HW_IOMMU_IOMMU_H
+#define QEMU_HW_IOMMU_IOMMU_H
+#ifdef __linux__
+#include <linux/iommu.h>
+#endif
+
+typedef struct IOMMUConfig {
+    union {
+#ifdef __linux__
+        struct iommu_pasid_table_config pasid_cfg;
+#endif
+          };
+} IOMMUConfig;
+
+
+#endif /* QEMU_HW_IOMMU_IOMMU_H */
-- 
2.26.3



^ permalink raw reply	[flat|nested] 46+ messages in thread

* [RFC v9 11/29] pci: introduce PCIPASIDOps to PCIDevice
  2021-04-11 12:08 [RFC v9 00/29] vSMMUv3/pSMMUv3 2 stage VFIO integration Eric Auger
                   ` (9 preceding siblings ...)
  2021-04-11 12:08 ` [RFC v9 10/29] iommu: Introduce generic header Eric Auger
@ 2021-04-11 12:08 ` Eric Auger
  2021-04-11 12:08 ` [RFC v9 12/29] vfio: Force nested if iommu requires it Eric Auger
                   ` (17 subsequent siblings)
  28 siblings, 0 replies; 46+ messages in thread
From: Eric Auger @ 2021-04-11 12:08 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, qemu-devel, qemu-arm, alex.williamson
  Cc: peter.maydell, jacob.jun.pan, jean-philippe, tn, chenxiang66,
	jiangkunkun, shameerali.kolothum.thodi, nicoleotsuka,
	vivek.gautam, vdumpa, yi.l.liu, peterx, zhangfei.gao, yuzenghui,
	zhukeqian1

From: Liu Yi L <yi.l.liu@intel.com>

This patch introduces PCIPASIDOps for IOMMU related operations.

https://lists.gnu.org/archive/html/qemu-devel/2018-03/msg00078.html
https://lists.gnu.org/archive/html/qemu-devel/2018-03/msg00940.html

So far, to setup virt-SVA for assigned SVA capable device, needs to
configure host translation structures for specific pasid. (e.g. bind
guest page table to host and enable nested translation in host).
Besides, vIOMMU emulator needs to forward guest's cache invalidation
to host since host nested translation is enabled. e.g. on VT-d, guest
owns 1st level translation table, thus cache invalidation for 1st
level should be propagated to host.

This patch adds two functions: alloc_pasid and free_pasid to support
guest pasid allocation and free. The implementations of the callbacks
would be device passthru modules. Like vfio.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Eric Auger <eric.auger@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
---
 include/hw/pci/pci.h | 11 +++++++++++
 hw/pci/pci.c         | 34 ++++++++++++++++++++++++++++++++++
 2 files changed, 45 insertions(+)

diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index 6be4e0c460..1f73c04975 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -9,6 +9,7 @@
 
 #include "hw/pci/pcie.h"
 #include "qom/object.h"
+#include "hw/iommu/iommu.h"
 
 extern bool pci_available;
 
@@ -265,6 +266,11 @@ struct PCIReqIDCache {
 };
 typedef struct PCIReqIDCache PCIReqIDCache;
 
+struct PCIPASIDOps {
+    int (*set_pasid_table)(PCIBus *bus, int32_t devfn, IOMMUConfig *config);
+};
+typedef struct PCIPASIDOps PCIPASIDOps;
+
 struct PCIDevice {
     DeviceState qdev;
     bool partially_hotplugged;
@@ -360,6 +366,7 @@ struct PCIDevice {
     /* ID of standby device in net_failover pair */
     char *failover_pair_id;
     uint32_t acpi_index;
+    PCIPASIDOps *pasid_ops;
 };
 
 void pci_register_bar(PCIDevice *pci_dev, int region_num,
@@ -491,6 +498,10 @@ typedef AddressSpace *(*PCIIOMMUFunc)(PCIBus *, void *, int);
 AddressSpace *pci_device_iommu_address_space(PCIDevice *dev);
 void pci_setup_iommu(PCIBus *bus, PCIIOMMUFunc fn, void *opaque);
 
+void pci_setup_pasid_ops(PCIDevice *dev, PCIPASIDOps *ops);
+bool pci_device_is_pasid_ops_set(PCIBus *bus, int32_t devfn);
+int pci_device_set_pasid_table(PCIBus *bus, int32_t devfn, IOMMUConfig *config);
+
 static inline void
 pci_set_byte(uint8_t *config, uint8_t val)
 {
diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index 8f35e13a0c..114855a0ac 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -2731,6 +2731,40 @@ void pci_setup_iommu(PCIBus *bus, PCIIOMMUFunc fn, void *opaque)
     bus->iommu_opaque = opaque;
 }
 
+void pci_setup_pasid_ops(PCIDevice *dev, PCIPASIDOps *ops)
+{
+    assert(ops && !dev->pasid_ops);
+    dev->pasid_ops = ops;
+}
+
+bool pci_device_is_pasid_ops_set(PCIBus *bus, int32_t devfn)
+{
+    PCIDevice *dev;
+
+    if (!bus) {
+        return false;
+    }
+
+    dev = bus->devices[devfn];
+    return !!(dev && dev->pasid_ops);
+}
+
+int pci_device_set_pasid_table(PCIBus *bus, int32_t devfn,
+                               IOMMUConfig *config)
+{
+    PCIDevice *dev;
+
+    if (!bus) {
+        return -EINVAL;
+    }
+
+    dev = bus->devices[devfn];
+    if (dev && dev->pasid_ops && dev->pasid_ops->set_pasid_table) {
+        return dev->pasid_ops->set_pasid_table(bus, devfn, config);
+    }
+    return -ENOENT;
+}
+
 static void pci_dev_get_w64(PCIBus *b, PCIDevice *dev, void *opaque)
 {
     Range *range = opaque;
-- 
2.26.3



^ permalink raw reply	[flat|nested] 46+ messages in thread

* [RFC v9 12/29] vfio: Force nested if iommu requires it
  2021-04-11 12:08 [RFC v9 00/29] vSMMUv3/pSMMUv3 2 stage VFIO integration Eric Auger
                   ` (10 preceding siblings ...)
  2021-04-11 12:08 ` [RFC v9 11/29] pci: introduce PCIPASIDOps to PCIDevice Eric Auger
@ 2021-04-11 12:08 ` Eric Auger
  2021-04-11 12:08 ` [RFC v9 13/29] vfio: Introduce hostwin_from_range helper Eric Auger
                   ` (16 subsequent siblings)
  28 siblings, 0 replies; 46+ messages in thread
From: Eric Auger @ 2021-04-11 12:08 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, qemu-devel, qemu-arm, alex.williamson
  Cc: peter.maydell, jacob.jun.pan, jean-philippe, tn, chenxiang66,
	jiangkunkun, shameerali.kolothum.thodi, nicoleotsuka,
	vivek.gautam, vdumpa, yi.l.liu, peterx, zhangfei.gao, yuzenghui,
	zhukeqian1

In case we detect the address space is translated by
a virtual IOMMU which requires HW nested paging to
integrate with VFIO, let's set up the container with
the VFIO_TYPE1_NESTING_IOMMU iommu_type.

Signed-off-by: Eric Auger <eric.auger@redhat.com>

---

v7 -> v8
- remove as != &address_space_memory as
  memory_region_is_iommu(as->root) is sufficient [Kunkun]

v4 -> v5:
- fail immediatly if nested is wanted but not supported

v2 -> v3:
- add "nested only is selected if requested by @force_nested"
  comment in this patch
---
 hw/vfio/common.c | 36 ++++++++++++++++++++++++++++--------
 1 file changed, 28 insertions(+), 8 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index a456455517..30dc45df90 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -1579,27 +1579,38 @@ static void vfio_put_address_space(VFIOAddressSpace *space)
  * vfio_get_iommu_type - selects the richest iommu_type (v2 first)
  */
 static int vfio_get_iommu_type(VFIOContainer *container,
+                               bool want_nested,
                                Error **errp)
 {
-    int iommu_types[] = { VFIO_TYPE1v2_IOMMU, VFIO_TYPE1_IOMMU,
+    int iommu_types[] = { VFIO_TYPE1_NESTING_IOMMU,
+                          VFIO_TYPE1v2_IOMMU, VFIO_TYPE1_IOMMU,
                           VFIO_SPAPR_TCE_v2_IOMMU, VFIO_SPAPR_TCE_IOMMU };
-    int i;
+    int i, ret = -EINVAL;
 
     for (i = 0; i < ARRAY_SIZE(iommu_types); i++) {
         if (ioctl(container->fd, VFIO_CHECK_EXTENSION, iommu_types[i])) {
-            return iommu_types[i];
+            if (iommu_types[i] == VFIO_TYPE1_NESTING_IOMMU && !want_nested) {
+                continue;
+            }
+            ret = iommu_types[i];
+            break;
         }
     }
-    error_setg(errp, "No available IOMMU models");
-    return -EINVAL;
+    if (ret < 0) {
+        error_setg(errp, "No available IOMMU models");
+    } else if (want_nested && ret != VFIO_TYPE1_NESTING_IOMMU) {
+        error_setg(errp, "Nested mode requested but not supported");
+        ret = -EINVAL;
+    }
+    return ret;
 }
 
 static int vfio_init_container(VFIOContainer *container, int group_fd,
-                               Error **errp)
+                               bool want_nested, Error **errp)
 {
     int iommu_type, ret;
 
-    iommu_type = vfio_get_iommu_type(container, errp);
+    iommu_type = vfio_get_iommu_type(container, want_nested, errp);
     if (iommu_type < 0) {
         return iommu_type;
     }
@@ -1704,6 +1715,14 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
     VFIOContainer *container;
     int ret, fd;
     VFIOAddressSpace *space;
+    IOMMUMemoryRegion *iommu_mr;
+    bool nested = false;
+
+    if (memory_region_is_iommu(as->root)) {
+        iommu_mr = IOMMU_MEMORY_REGION(as->root);
+        memory_region_iommu_get_attr(iommu_mr, IOMMU_ATTR_VFIO_NESTED,
+                                     (void *)&nested);
+    }
 
     space = vfio_get_address_space(as);
 
@@ -1773,13 +1792,14 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
     QLIST_INIT(&container->giommu_list);
     QLIST_INIT(&container->hostwin_list);
 
-    ret = vfio_init_container(container, group->fd, errp);
+    ret = vfio_init_container(container, group->fd, nested, errp);
     if (ret) {
         goto free_container_exit;
     }
     trace_vfio_connect_new_container(group->groupid, container->fd);
 
     switch (container->iommu_type) {
+    case VFIO_TYPE1_NESTING_IOMMU:
     case VFIO_TYPE1v2_IOMMU:
     case VFIO_TYPE1_IOMMU:
     {
-- 
2.26.3



^ permalink raw reply	[flat|nested] 46+ messages in thread

* [RFC v9 13/29] vfio: Introduce hostwin_from_range helper
  2021-04-11 12:08 [RFC v9 00/29] vSMMUv3/pSMMUv3 2 stage VFIO integration Eric Auger
                   ` (11 preceding siblings ...)
  2021-04-11 12:08 ` [RFC v9 12/29] vfio: Force nested if iommu requires it Eric Auger
@ 2021-04-11 12:08 ` Eric Auger
  2021-04-11 12:08 ` [RFC v9 14/29] vfio: Introduce helpers to DMA map/unmap a RAM section Eric Auger
                   ` (15 subsequent siblings)
  28 siblings, 0 replies; 46+ messages in thread
From: Eric Auger @ 2021-04-11 12:08 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, qemu-devel, qemu-arm, alex.williamson
  Cc: peter.maydell, jacob.jun.pan, jean-philippe, tn, chenxiang66,
	jiangkunkun, shameerali.kolothum.thodi, nicoleotsuka,
	vivek.gautam, vdumpa, yi.l.liu, peterx, zhangfei.gao, yuzenghui,
	zhukeqian1

Let's introduce a hostwin_from_range() helper that returns the
hostwin encapsulating an IOVA range or NULL if none is found.

This improves the readibility of callers and removes the usage
of hostwin_found.

Signed-off-by: Eric Auger <eric.auger@redhat.com>
---
 hw/vfio/common.c | 36 +++++++++++++++++-------------------
 1 file changed, 17 insertions(+), 19 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 30dc45df90..a8f835328e 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -649,6 +649,19 @@ out:
     rcu_read_unlock();
 }
 
+static VFIOHostDMAWindow *
+hostwin_from_range(VFIOContainer *container, hwaddr iova, hwaddr end)
+{
+    VFIOHostDMAWindow *hostwin;
+
+    QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) {
+        if (hostwin->min_iova <= iova && end <= hostwin->max_iova) {
+            return hostwin;
+        }
+    }
+    return NULL;
+}
+
 static void vfio_listener_region_add(MemoryListener *listener,
                                      MemoryRegionSection *section)
 {
@@ -658,7 +671,6 @@ static void vfio_listener_region_add(MemoryListener *listener,
     void *vaddr;
     int ret;
     VFIOHostDMAWindow *hostwin;
-    bool hostwin_found;
     Error *err = NULL;
 
     if (vfio_listener_skipped_section(section)) {
@@ -744,15 +756,8 @@ static void vfio_listener_region_add(MemoryListener *listener,
 #endif
     }
 
-    hostwin_found = false;
-    QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) {
-        if (hostwin->min_iova <= iova && end <= hostwin->max_iova) {
-            hostwin_found = true;
-            break;
-        }
-    }
-
-    if (!hostwin_found) {
+    hostwin = hostwin_from_range(container, iova, end);
+    if (!hostwin) {
         error_setg(&err, "Container %p can't map guest IOVA region"
                    " 0x%"HWADDR_PRIx"..0x%"HWADDR_PRIx, container, iova, end);
         goto fail;
@@ -934,16 +939,9 @@ static void vfio_listener_region_del(MemoryListener *listener,
 
     if (memory_region_is_ram_device(section->mr)) {
         hwaddr pgmask;
-        VFIOHostDMAWindow *hostwin;
-        bool hostwin_found = false;
+        VFIOHostDMAWindow *hostwin = hostwin_from_range(container, iova, end);
 
-        QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) {
-            if (hostwin->min_iova <= iova && end <= hostwin->max_iova) {
-                hostwin_found = true;
-                break;
-            }
-        }
-        assert(hostwin_found); /* or region_add() would have failed */
+        assert(hostwin); /* or region_add() would have failed */
 
         pgmask = (1ULL << ctz64(hostwin->iova_pgsizes)) - 1;
         try_unmap = !((iova & pgmask) || (int128_get64(llsize) & pgmask));
-- 
2.26.3



^ permalink raw reply	[flat|nested] 46+ messages in thread

* [RFC v9 14/29] vfio: Introduce helpers to DMA map/unmap a RAM section
  2021-04-11 12:08 [RFC v9 00/29] vSMMUv3/pSMMUv3 2 stage VFIO integration Eric Auger
                   ` (12 preceding siblings ...)
  2021-04-11 12:08 ` [RFC v9 13/29] vfio: Introduce hostwin_from_range helper Eric Auger
@ 2021-04-11 12:08 ` Eric Auger
  2021-04-27 14:05   ` Kunkun Jiang
  2021-09-03  8:22   ` Kunkun Jiang
  2021-04-11 12:08 ` [RFC v9 15/29] vfio: Set up nested stage mappings Eric Auger
                   ` (14 subsequent siblings)
  28 siblings, 2 replies; 46+ messages in thread
From: Eric Auger @ 2021-04-11 12:08 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, qemu-devel, qemu-arm, alex.williamson
  Cc: peter.maydell, jacob.jun.pan, jean-philippe, tn, chenxiang66,
	jiangkunkun, shameerali.kolothum.thodi, nicoleotsuka,
	vivek.gautam, vdumpa, yi.l.liu, peterx, zhangfei.gao, yuzenghui,
	zhukeqian1

Let's introduce two helpers that allow to DMA map/unmap a RAM
section. Those helpers will be called for nested stage setup in
another call site. Also the vfio_listener_region_add/del()
structure may be clearer.

Signed-off-by: Eric Auger <eric.auger@redhat.com>

---

v8 -> v9
- rebase on top of
  1eb7f642750c ("vfio: Support host translation granule size")

v5 -> v6:
- add Error **
---
 hw/vfio/common.c     | 199 +++++++++++++++++++++++++------------------
 hw/vfio/trace-events |   4 +-
 2 files changed, 119 insertions(+), 84 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index a8f835328e..0cd7ef2139 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -662,13 +662,126 @@ hostwin_from_range(VFIOContainer *container, hwaddr iova, hwaddr end)
     return NULL;
 }
 
+static int vfio_dma_map_ram_section(VFIOContainer *container,
+                                    MemoryRegionSection *section, Error **err)
+{
+    VFIOHostDMAWindow *hostwin;
+    Int128 llend, llsize;
+    hwaddr iova, end;
+    void *vaddr;
+    int ret;
+
+    assert(memory_region_is_ram(section->mr));
+
+    iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
+    llend = int128_make64(section->offset_within_address_space);
+    llend = int128_add(llend, section->size);
+    llend = int128_and(llend, int128_exts64(TARGET_PAGE_MASK));
+    end = int128_get64(int128_sub(llend, int128_one()));
+
+    vaddr = memory_region_get_ram_ptr(section->mr) +
+            section->offset_within_region +
+            (iova - section->offset_within_address_space);
+
+    hostwin = hostwin_from_range(container, iova, end);
+    if (!hostwin) {
+        error_setg(err, "Container %p can't map guest IOVA region"
+                   " 0x%"HWADDR_PRIx"..0x%"HWADDR_PRIx, container, iova, end);
+        return -EFAULT;
+    }
+
+    trace_vfio_dma_map_ram(iova, end, vaddr);
+
+    llsize = int128_sub(llend, int128_make64(iova));
+
+    if (memory_region_is_ram_device(section->mr)) {
+        hwaddr pgmask = (1ULL << ctz64(hostwin->iova_pgsizes)) - 1;
+
+        if ((iova & pgmask) || (int128_get64(llsize) & pgmask)) {
+            trace_vfio_listener_region_add_no_dma_map(
+                memory_region_name(section->mr),
+                section->offset_within_address_space,
+                int128_getlo(section->size),
+                pgmask + 1);
+            return 0;
+        }
+    }
+
+    ret = vfio_dma_map(container, iova, int128_get64(llsize),
+                       vaddr, section->readonly);
+    if (ret) {
+        error_setg(err, "vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
+                   "0x%"HWADDR_PRIx", %p) = %d (%m)",
+                   container, iova, int128_get64(llsize), vaddr, ret);
+        if (memory_region_is_ram_device(section->mr)) {
+            /* Allow unexpected mappings not to be fatal for RAM devices */
+            error_report_err(*err);
+            return 0;
+        }
+        return ret;
+    }
+    return 0;
+}
+
+static void vfio_dma_unmap_ram_section(VFIOContainer *container,
+                                       MemoryRegionSection *section)
+{
+    Int128 llend, llsize;
+    hwaddr iova, end;
+    bool try_unmap = true;
+    int ret;
+
+    iova = REAL_HOST_PAGE_ALIGN(section->offset_within_address_space);
+    llend = int128_make64(section->offset_within_address_space);
+    llend = int128_add(llend, section->size);
+    llend = int128_and(llend, int128_exts64(qemu_real_host_page_mask));
+
+    if (int128_ge(int128_make64(iova), llend)) {
+        return;
+    }
+    end = int128_get64(int128_sub(llend, int128_one()));
+
+    llsize = int128_sub(llend, int128_make64(iova));
+
+    trace_vfio_dma_unmap_ram(iova, end);
+
+    if (memory_region_is_ram_device(section->mr)) {
+        hwaddr pgmask;
+        VFIOHostDMAWindow *hostwin = hostwin_from_range(container, iova, end);
+
+        assert(hostwin); /* or region_add() would have failed */
+
+        pgmask = (1ULL << ctz64(hostwin->iova_pgsizes)) - 1;
+        try_unmap = !((iova & pgmask) || (int128_get64(llsize) & pgmask));
+    }
+
+    if (try_unmap) {
+        if (int128_eq(llsize, int128_2_64())) {
+            /* The unmap ioctl doesn't accept a full 64-bit span. */
+            llsize = int128_rshift(llsize, 1);
+            ret = vfio_dma_unmap(container, iova, int128_get64(llsize), NULL);
+            if (ret) {
+                error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
+                             "0x%"HWADDR_PRIx") = %d (%m)",
+                             container, iova, int128_get64(llsize), ret);
+            }
+            iova += int128_get64(llsize);
+        }
+        ret = vfio_dma_unmap(container, iova, int128_get64(llsize), NULL);
+        if (ret) {
+            error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
+                         "0x%"HWADDR_PRIx") = %d (%m)",
+                         container, iova, int128_get64(llsize), ret);
+        }
+    }
+}
+
 static void vfio_listener_region_add(MemoryListener *listener,
                                      MemoryRegionSection *section)
 {
     VFIOContainer *container = container_of(listener, VFIOContainer, listener);
     hwaddr iova, end;
-    Int128 llend, llsize;
-    void *vaddr;
+    Int128 llend;
     int ret;
     VFIOHostDMAWindow *hostwin;
     Error *err = NULL;
@@ -814,39 +927,7 @@ static void vfio_listener_region_add(MemoryListener *listener,
     }
 
     /* Here we assume that memory_region_is_ram(section->mr)==true */
-
-    vaddr = memory_region_get_ram_ptr(section->mr) +
-            section->offset_within_region +
-            (iova - section->offset_within_address_space);
-
-    trace_vfio_listener_region_add_ram(iova, end, vaddr);
-
-    llsize = int128_sub(llend, int128_make64(iova));
-
-    if (memory_region_is_ram_device(section->mr)) {
-        hwaddr pgmask = (1ULL << ctz64(hostwin->iova_pgsizes)) - 1;
-
-        if ((iova & pgmask) || (int128_get64(llsize) & pgmask)) {
-            trace_vfio_listener_region_add_no_dma_map(
-                memory_region_name(section->mr),
-                section->offset_within_address_space,
-                int128_getlo(section->size),
-                pgmask + 1);
-            return;
-        }
-    }
-
-    ret = vfio_dma_map(container, iova, int128_get64(llsize),
-                       vaddr, section->readonly);
-    if (ret) {
-        error_setg(&err, "vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
-                   "0x%"HWADDR_PRIx", %p) = %d (%m)",
-                   container, iova, int128_get64(llsize), vaddr, ret);
-        if (memory_region_is_ram_device(section->mr)) {
-            /* Allow unexpected mappings not to be fatal for RAM devices */
-            error_report_err(err);
-            return;
-        }
+    if (vfio_dma_map_ram_section(container, section, &err)) {
         goto fail;
     }
 
@@ -880,10 +961,6 @@ static void vfio_listener_region_del(MemoryListener *listener,
                                      MemoryRegionSection *section)
 {
     VFIOContainer *container = container_of(listener, VFIOContainer, listener);
-    hwaddr iova, end;
-    Int128 llend, llsize;
-    int ret;
-    bool try_unmap = true;
 
     if (vfio_listener_skipped_section(section)) {
         trace_vfio_listener_region_del_skip(
@@ -923,49 +1000,7 @@ static void vfio_listener_region_del(MemoryListener *listener,
          */
     }
 
-    iova = REAL_HOST_PAGE_ALIGN(section->offset_within_address_space);
-    llend = int128_make64(section->offset_within_address_space);
-    llend = int128_add(llend, section->size);
-    llend = int128_and(llend, int128_exts64(qemu_real_host_page_mask));
-
-    if (int128_ge(int128_make64(iova), llend)) {
-        return;
-    }
-    end = int128_get64(int128_sub(llend, int128_one()));
-
-    llsize = int128_sub(llend, int128_make64(iova));
-
-    trace_vfio_listener_region_del(iova, end);
-
-    if (memory_region_is_ram_device(section->mr)) {
-        hwaddr pgmask;
-        VFIOHostDMAWindow *hostwin = hostwin_from_range(container, iova, end);
-
-        assert(hostwin); /* or region_add() would have failed */
-
-        pgmask = (1ULL << ctz64(hostwin->iova_pgsizes)) - 1;
-        try_unmap = !((iova & pgmask) || (int128_get64(llsize) & pgmask));
-    }
-
-    if (try_unmap) {
-        if (int128_eq(llsize, int128_2_64())) {
-            /* The unmap ioctl doesn't accept a full 64-bit span. */
-            llsize = int128_rshift(llsize, 1);
-            ret = vfio_dma_unmap(container, iova, int128_get64(llsize), NULL);
-            if (ret) {
-                error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
-                             "0x%"HWADDR_PRIx") = %d (%m)",
-                             container, iova, int128_get64(llsize), ret);
-            }
-            iova += int128_get64(llsize);
-        }
-        ret = vfio_dma_unmap(container, iova, int128_get64(llsize), NULL);
-        if (ret) {
-            error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
-                         "0x%"HWADDR_PRIx") = %d (%m)",
-                         container, iova, int128_get64(llsize), ret);
-        }
-    }
+    vfio_dma_unmap_ram_section(container, section);
 
     memory_region_unref(section->mr);
 
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 2a41326c0f..936d29d150 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -99,10 +99,10 @@ vfio_iommu_map_notify(const char *op, uint64_t iova_start, uint64_t iova_end) "i
 vfio_listener_region_add_skip(uint64_t start, uint64_t end) "SKIPPING region_add 0x%"PRIx64" - 0x%"PRIx64
 vfio_spapr_group_attach(int groupfd, int tablefd) "Attached groupfd %d to liobn fd %d"
 vfio_listener_region_add_iommu(uint64_t start, uint64_t end) "region_add [iommu] 0x%"PRIx64" - 0x%"PRIx64
-vfio_listener_region_add_ram(uint64_t iova_start, uint64_t iova_end, void *vaddr) "region_add [ram] 0x%"PRIx64" - 0x%"PRIx64" [%p]"
+vfio_dma_map_ram(uint64_t iova_start, uint64_t iova_end, void *vaddr) "region_add [ram] 0x%"PRIx64" - 0x%"PRIx64" [%p]"
 vfio_listener_region_add_no_dma_map(const char *name, uint64_t iova, uint64_t size, uint64_t page_size) "Region \"%s\" 0x%"PRIx64" size=0x%"PRIx64" is not aligned to 0x%"PRIx64" and cannot be mapped for DMA"
 vfio_listener_region_del_skip(uint64_t start, uint64_t end) "SKIPPING region_del 0x%"PRIx64" - 0x%"PRIx64
-vfio_listener_region_del(uint64_t start, uint64_t end) "region_del 0x%"PRIx64" - 0x%"PRIx64
+vfio_dma_unmap_ram(uint64_t start, uint64_t end) "region_del 0x%"PRIx64" - 0x%"PRIx64
 vfio_disconnect_container(int fd) "close container->fd=%d"
 vfio_connect_existing_container(int groupid, int container_fd) "group=%d existing container fd=%d"
 vfio_connect_new_container(int groupid, int container_fd) "group=%d new container fd=%d"
-- 
2.26.3



^ permalink raw reply	[flat|nested] 46+ messages in thread

* [RFC v9 15/29] vfio: Set up nested stage mappings
  2021-04-11 12:08 [RFC v9 00/29] vSMMUv3/pSMMUv3 2 stage VFIO integration Eric Auger
                   ` (13 preceding siblings ...)
  2021-04-11 12:08 ` [RFC v9 14/29] vfio: Introduce helpers to DMA map/unmap a RAM section Eric Auger
@ 2021-04-11 12:08 ` Eric Auger
  2021-04-13 12:10   ` Kunkun Jiang
  2021-04-11 12:08 ` [RFC v9 16/29] vfio: Pass stage 1 MSI bindings to the host Eric Auger
                   ` (13 subsequent siblings)
  28 siblings, 1 reply; 46+ messages in thread
From: Eric Auger @ 2021-04-11 12:08 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, qemu-devel, qemu-arm, alex.williamson
  Cc: peter.maydell, jacob.jun.pan, jean-philippe, tn, chenxiang66,
	jiangkunkun, shameerali.kolothum.thodi, nicoleotsuka,
	vivek.gautam, vdumpa, yi.l.liu, peterx, zhangfei.gao, yuzenghui,
	zhukeqian1

In nested mode, legacy vfio_iommu_map_notify cannot be used as
there is no "caching" mode and we do not trap on map.

On Intel, vfio_iommu_map_notify was used to DMA map the RAM
through the host single stage.

With nested mode, we need to setup the stage 2 and the stage 1
separately. This patch introduces a prereg_listener to setup
the stage 2 mapping.

The stage 1 mapping, owned by the guest, is passed to the host
when the guest invalidates the stage 1 configuration, through
a dedicated PCIPASIDOps callback. Guest IOTLB invalidations
are cascaded downto the host through another IOMMU MR UNMAP
notifier.

Signed-off-by: Eric Auger <eric.auger@redhat.com>

---

v7 -> v8:
- properly handle new IOMMUTLBEntry fields and especially
  propagate DOMAIN and PASID based invalidations

v6 -> v7:
- remove PASID based invalidation

v5 -> v6:
- add error_report_err()
- remove the abort in case of nested stage case

v4 -> v5:
- use VFIO_IOMMU_SET_PASID_TABLE
- use PCIPASIDOps for config notification

v3 -> v4:
- use iommu_inv_pasid_info for ASID invalidation

v2 -> v3:
- use VFIO_IOMMU_ATTACH_PASID_TABLE
- new user API
- handle leaf

v1 -> v2:
- adapt to uapi changes
- pass the asid
- pass IOMMU_NOTIFIER_S1_CFG when initializing the config notifier
---
 hw/vfio/common.c     | 139 +++++++++++++++++++++++++++++++++++++++++--
 hw/vfio/pci.c        |  21 +++++++
 hw/vfio/trace-events |   2 +
 3 files changed, 157 insertions(+), 5 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 0cd7ef2139..e369d451e7 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -595,6 +595,73 @@ static bool vfio_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
     return true;
 }
 
+/* Propagate a guest IOTLB invalidation to the host (nested mode) */
+static void vfio_iommu_unmap_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
+{
+    VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
+    struct vfio_iommu_type1_cache_invalidate ustruct = {};
+    VFIOContainer *container = giommu->container;
+    int ret;
+
+    assert(iotlb->perm == IOMMU_NONE);
+
+    ustruct.argsz = sizeof(ustruct);
+    ustruct.flags = 0;
+    ustruct.info.argsz = sizeof(struct iommu_cache_invalidate_info);
+    ustruct.info.version = IOMMU_CACHE_INVALIDATE_INFO_VERSION_1;
+    ustruct.info.cache = IOMMU_CACHE_INV_TYPE_IOTLB;
+
+    switch (iotlb->granularity) {
+    case IOMMU_INV_GRAN_DOMAIN:
+        ustruct.info.granularity = IOMMU_INV_GRANU_DOMAIN;
+        break;
+    case IOMMU_INV_GRAN_PASID:
+    {
+        struct iommu_inv_pasid_info *pasid_info;
+        int archid = -1;
+
+        pasid_info = &ustruct.info.granu.pasid_info;
+        ustruct.info.granularity = IOMMU_INV_GRANU_PASID;
+        if (iotlb->flags & IOMMU_INV_FLAGS_ARCHID) {
+            pasid_info->flags |= IOMMU_INV_ADDR_FLAGS_ARCHID;
+            archid = iotlb->arch_id;
+        }
+        pasid_info->archid = archid;
+        trace_vfio_iommu_asid_inv_iotlb(archid);
+        break;
+    }
+    case IOMMU_INV_GRAN_ADDR:
+    {
+        hwaddr start = iotlb->iova + giommu->iommu_offset;
+        struct iommu_inv_addr_info *addr_info;
+        size_t size = iotlb->addr_mask + 1;
+        int archid = -1;
+
+        addr_info = &ustruct.info.granu.addr_info;
+        ustruct.info.granularity = IOMMU_INV_GRANU_ADDR;
+        if (iotlb->leaf) {
+            addr_info->flags |= IOMMU_INV_ADDR_FLAGS_LEAF;
+        }
+        if (iotlb->flags & IOMMU_INV_FLAGS_ARCHID) {
+            addr_info->flags |= IOMMU_INV_ADDR_FLAGS_ARCHID;
+            archid = iotlb->arch_id;
+        }
+        addr_info->archid = archid;
+        addr_info->addr = start;
+        addr_info->granule_size = size;
+        addr_info->nb_granules = 1;
+        trace_vfio_iommu_addr_inv_iotlb(archid, start, size,
+                                        1, iotlb->leaf);
+        break;
+    }
+    }
+
+    ret = ioctl(container->fd, VFIO_IOMMU_CACHE_INVALIDATE, &ustruct);
+    if (ret) {
+        error_report("%p: failed to invalidate CACHE (%d)", container, ret);
+    }
+}
+
 static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
 {
     VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
@@ -776,6 +843,35 @@ static void vfio_dma_unmap_ram_section(VFIOContainer *container,
     }
 }
 
+static void vfio_prereg_listener_region_add(MemoryListener *listener,
+                                            MemoryRegionSection *section)
+{
+    VFIOContainer *container =
+        container_of(listener, VFIOContainer, prereg_listener);
+    Error *err = NULL;
+
+    if (!memory_region_is_ram(section->mr)) {
+        return;
+    }
+
+    vfio_dma_map_ram_section(container, section, &err);
+    if (err) {
+        error_report_err(err);
+    }
+}
+static void vfio_prereg_listener_region_del(MemoryListener *listener,
+                                     MemoryRegionSection *section)
+{
+    VFIOContainer *container =
+        container_of(listener, VFIOContainer, prereg_listener);
+
+    if (!memory_region_is_ram(section->mr)) {
+        return;
+    }
+
+    vfio_dma_unmap_ram_section(container, section);
+}
+
 static void vfio_listener_region_add(MemoryListener *listener,
                                      MemoryRegionSection *section)
 {
@@ -879,9 +975,10 @@ static void vfio_listener_region_add(MemoryListener *listener,
     memory_region_ref(section->mr);
 
     if (memory_region_is_iommu(section->mr)) {
+        IOMMUNotify notify;
         VFIOGuestIOMMU *giommu;
         IOMMUMemoryRegion *iommu_mr = IOMMU_MEMORY_REGION(section->mr);
-        int iommu_idx;
+        int iommu_idx, flags;
 
         trace_vfio_listener_region_add_iommu(iova, end);
         /*
@@ -900,8 +997,18 @@ static void vfio_listener_region_add(MemoryListener *listener,
         llend = int128_sub(llend, int128_one());
         iommu_idx = memory_region_iommu_attrs_to_index(iommu_mr,
                                                        MEMTXATTRS_UNSPECIFIED);
-        iommu_notifier_init(&giommu->n, vfio_iommu_map_notify,
-                            IOMMU_NOTIFIER_IOTLB_EVENTS,
+
+        if (container->iommu_type == VFIO_TYPE1_NESTING_IOMMU) {
+            /* IOTLB unmap notifier to propagate guest IOTLB invalidations */
+            flags = IOMMU_NOTIFIER_UNMAP;
+            notify = vfio_iommu_unmap_notify;
+        } else {
+            /* MAP/UNMAP IOTLB notifier */
+            flags = IOMMU_NOTIFIER_IOTLB_EVENTS;
+            notify = vfio_iommu_map_notify;
+        }
+
+        iommu_notifier_init(&giommu->n, notify, flags,
                             section->offset_within_region,
                             int128_get64(llend),
                             iommu_idx);
@@ -921,7 +1028,9 @@ static void vfio_listener_region_add(MemoryListener *listener,
             goto fail;
         }
         QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next);
-        memory_region_iommu_replay(giommu->iommu, &giommu->n);
+        if (flags & IOMMU_NOTIFIER_MAP) {
+            memory_region_iommu_replay(giommu->iommu, &giommu->n);
+        }
 
         return;
     }
@@ -1205,10 +1314,16 @@ static const MemoryListener vfio_memory_listener = {
     .log_sync = vfio_listener_log_sync,
 };
 
+static MemoryListener vfio_memory_prereg_listener = {
+    .region_add = vfio_prereg_listener_region_add,
+    .region_del = vfio_prereg_listener_region_del,
+};
+
 static void vfio_listener_release(VFIOContainer *container)
 {
     memory_listener_unregister(&container->listener);
-    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
+    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU ||
+        container->iommu_type == VFIO_TYPE1_NESTING_IOMMU) {
         memory_listener_unregister(&container->prereg_listener);
     }
 }
@@ -1858,6 +1973,20 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
             vfio_get_iommu_info_migration(container, info);
         }
         g_free(info);
+
+        if (container->iommu_type == VFIO_TYPE1_NESTING_IOMMU) {
+            container->prereg_listener = vfio_memory_prereg_listener;
+            memory_listener_register(&container->prereg_listener,
+                                     &address_space_memory);
+            if (container->error) {
+                memory_listener_unregister(&container->prereg_listener);
+                ret = -1;
+                error_propagate_prepend(errp, container->error,
+                                    "RAM memory listener initialization failed "
+                                    "for container");
+                goto free_container_exit;
+            }
+        }
         break;
     }
     case VFIO_SPAPR_TCE_v2_IOMMU:
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 5c65aa0a98..cad7deec71 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2773,6 +2773,25 @@ static void vfio_unregister_req_notifier(VFIOPCIDevice *vdev)
     vdev->req_enabled = false;
 }
 
+static int vfio_iommu_set_pasid_table(PCIBus *bus, int32_t devfn,
+                                      IOMMUConfig *config)
+{
+    PCIDevice *pdev = bus->devices[devfn];
+    VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev);
+    VFIOContainer *container = vdev->vbasedev.group->container;
+    struct vfio_iommu_type1_set_pasid_table info;
+
+    info.argsz = sizeof(info);
+    info.flags = VFIO_PASID_TABLE_FLAG_SET;
+    memcpy(&info.config, &config->pasid_cfg, sizeof(config->pasid_cfg));
+
+    return ioctl(container->fd, VFIO_IOMMU_SET_PASID_TABLE, &info);
+}
+
+static PCIPASIDOps vfio_pci_pasid_ops = {
+    .set_pasid_table = vfio_iommu_set_pasid_table,
+};
+
 static void vfio_realize(PCIDevice *pdev, Error **errp)
 {
     VFIOPCIDevice *vdev = VFIO_PCI(pdev);
@@ -3084,6 +3103,8 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
     vfio_register_req_notifier(vdev);
     vfio_setup_resetfn_quirk(vdev);
 
+    pci_setup_pasid_ops(pdev, &vfio_pci_pasid_ops);
+
     return;
 
 out_deregister:
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 936d29d150..43696afc15 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -120,6 +120,8 @@ vfio_region_sparse_mmap_header(const char *name, int index, int nr_areas) "Devic
 vfio_region_sparse_mmap_entry(int i, unsigned long start, unsigned long end) "sparse entry %d [0x%lx - 0x%lx]"
 vfio_get_dev_region(const char *name, int index, uint32_t type, uint32_t subtype) "%s index %d, %08x/%0x8"
 vfio_dma_unmap_overflow_workaround(void) ""
+vfio_iommu_addr_inv_iotlb(int asid, uint64_t addr, uint64_t size, uint64_t nb_granules, bool leaf) "nested IOTLB invalidate asid=%d, addr=0x%"PRIx64" granule_size=0x%"PRIx64" nb_granules=0x%"PRIx64" leaf=%d"
+vfio_iommu_asid_inv_iotlb(int asid) "nested IOTLB invalidate asid=%d"
 
 # platform.c
 vfio_platform_base_device_init(char *name, int groupid) "%s belongs to group #%d"
-- 
2.26.3



^ permalink raw reply	[flat|nested] 46+ messages in thread

* [RFC v9 16/29] vfio: Pass stage 1 MSI bindings to the host
  2021-04-11 12:08 [RFC v9 00/29] vSMMUv3/pSMMUv3 2 stage VFIO integration Eric Auger
                   ` (14 preceding siblings ...)
  2021-04-11 12:08 ` [RFC v9 15/29] vfio: Set up nested stage mappings Eric Auger
@ 2021-04-11 12:08 ` Eric Auger
  2021-10-15 10:54   ` Shameerali Kolothum Thodi
  2021-04-11 12:09 ` [RFC v9 17/29] vfio: Helper to get IRQ info including capabilities Eric Auger
                   ` (12 subsequent siblings)
  28 siblings, 1 reply; 46+ messages in thread
From: Eric Auger @ 2021-04-11 12:08 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, qemu-devel, qemu-arm, alex.williamson
  Cc: peter.maydell, jacob.jun.pan, jean-philippe, tn, chenxiang66,
	jiangkunkun, shameerali.kolothum.thodi, nicoleotsuka,
	vivek.gautam, vdumpa, yi.l.liu, peterx, zhangfei.gao, yuzenghui,
	zhukeqian1

We register the stage1 MSI bindings when enabling the vectors
and we unregister them on msi disable.

Signed-off-by: Eric Auger <eric.auger@redhat.com>

---

v7 -> v8:
- add unregistration on msix_diable
- remove vfio_container_unbind_msis()

v4 -> v5:
- use VFIO_IOMMU_SET_MSI_BINDING

v2 -> v3:
- only register the notifier if the IOMMU translates MSIs
- record the msi bindings in a container list and unregister on
  container release
---
 include/hw/vfio/vfio-common.h | 12 ++++++
 hw/vfio/common.c              | 59 +++++++++++++++++++++++++++
 hw/vfio/pci.c                 | 76 ++++++++++++++++++++++++++++++++++-
 hw/vfio/trace-events          |  2 +
 4 files changed, 147 insertions(+), 2 deletions(-)

diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 6141162d7a..f30133b2a3 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -74,6 +74,14 @@ typedef struct VFIOAddressSpace {
     QLIST_ENTRY(VFIOAddressSpace) list;
 } VFIOAddressSpace;
 
+typedef struct VFIOMSIBinding {
+    int index;
+    hwaddr iova;
+    hwaddr gpa;
+    hwaddr size;
+    QLIST_ENTRY(VFIOMSIBinding) next;
+} VFIOMSIBinding;
+
 struct VFIOGroup;
 
 typedef struct VFIOContainer {
@@ -91,6 +99,7 @@ typedef struct VFIOContainer {
     QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
     QLIST_HEAD(, VFIOHostDMAWindow) hostwin_list;
     QLIST_HEAD(, VFIOGroup) group_list;
+    QLIST_HEAD(, VFIOMSIBinding) msibinding_list;
     QLIST_ENTRY(VFIOContainer) next;
 } VFIOContainer;
 
@@ -200,6 +209,9 @@ VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp);
 void vfio_put_group(VFIOGroup *group);
 int vfio_get_device(VFIOGroup *group, const char *name,
                     VFIODevice *vbasedev, Error **errp);
+int vfio_iommu_set_msi_binding(VFIOContainer *container, int n,
+                               IOMMUTLBEntry *entry);
+int vfio_iommu_unset_msi_binding(VFIOContainer *container, int n);
 
 extern const MemoryRegionOps vfio_region_ops;
 typedef QLIST_HEAD(VFIOGroupList, VFIOGroup) VFIOGroupList;
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index e369d451e7..970a5a7be7 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -662,6 +662,65 @@ static void vfio_iommu_unmap_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
     }
 }
 
+int vfio_iommu_set_msi_binding(VFIOContainer *container, int n,
+                               IOMMUTLBEntry *iotlb)
+{
+    struct vfio_iommu_type1_set_msi_binding ustruct;
+    VFIOMSIBinding *binding;
+    int ret;
+
+    QLIST_FOREACH(binding, &container->msibinding_list, next) {
+        if (binding->index == n) {
+            return 0;
+        }
+    }
+
+    ustruct.argsz = sizeof(struct vfio_iommu_type1_set_msi_binding);
+    ustruct.iova = iotlb->iova;
+    ustruct.flags = VFIO_IOMMU_BIND_MSI;
+    ustruct.gpa = iotlb->translated_addr;
+    ustruct.size = iotlb->addr_mask + 1;
+    ret = ioctl(container->fd, VFIO_IOMMU_SET_MSI_BINDING , &ustruct);
+    if (ret) {
+        error_report("%s: failed to register the stage1 MSI binding (%m)",
+                     __func__);
+        return ret;
+    }
+    binding =  g_new0(VFIOMSIBinding, 1);
+    binding->iova = ustruct.iova;
+    binding->gpa = ustruct.gpa;
+    binding->size = ustruct.size;
+    binding->index = n;
+
+    QLIST_INSERT_HEAD(&container->msibinding_list, binding, next);
+    return 0;
+}
+
+int vfio_iommu_unset_msi_binding(VFIOContainer *container, int n)
+{
+    struct vfio_iommu_type1_set_msi_binding ustruct;
+    VFIOMSIBinding *binding, *tmp;
+    int ret;
+
+    ustruct.argsz = sizeof(struct vfio_iommu_type1_set_msi_binding);
+    QLIST_FOREACH_SAFE(binding, &container->msibinding_list, next, tmp) {
+        if (binding->index != n) {
+            continue;
+        }
+        ustruct.flags = VFIO_IOMMU_UNBIND_MSI;
+        ustruct.iova = binding->iova;
+        ret = ioctl(container->fd, VFIO_IOMMU_SET_MSI_BINDING , &ustruct);
+        if (ret) {
+            error_report("Failed to unregister the stage1 MSI binding "
+                         "for iova=0x%"PRIx64" (%m)", binding->iova);
+        }
+        QLIST_REMOVE(binding, next);
+        g_free(binding);
+        return ret;
+    }
+    return 0;
+}
+
 static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
 {
     VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index cad7deec71..a49029dfa4 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -366,6 +366,65 @@ static void vfio_msi_interrupt(void *opaque)
     notify(&vdev->pdev, nr);
 }
 
+static bool vfio_iommu_require_msi_binding(IOMMUMemoryRegion *iommu_mr)
+{
+    bool msi_translate = false, nested = false;
+
+    memory_region_iommu_get_attr(iommu_mr, IOMMU_ATTR_MSI_TRANSLATE,
+                                 (void *)&msi_translate);
+    memory_region_iommu_get_attr(iommu_mr, IOMMU_ATTR_VFIO_NESTED,
+                                 (void *)&nested);
+    if (!nested || !msi_translate) {
+        return false;
+    }
+   return true;
+}
+
+static int vfio_register_msi_binding(VFIOPCIDevice *vdev,
+                                     int vector_n, bool set)
+{
+    VFIOContainer *container = vdev->vbasedev.group->container;
+    PCIDevice *dev = &vdev->pdev;
+    AddressSpace *as = pci_device_iommu_address_space(dev);
+    IOMMUMemoryRegionClass *imrc;
+    IOMMUMemoryRegion *iommu_mr;
+    IOMMUTLBEntry entry;
+    MSIMessage msg;
+
+    if (as == &address_space_memory) {
+        return 0;
+    }
+
+    iommu_mr = IOMMU_MEMORY_REGION(as->root);
+    if (!vfio_iommu_require_msi_binding(iommu_mr)) {
+        return 0;
+    }
+
+    /* MSI doorbell address is translated by an IOMMU */
+
+    if (!set) { /* unregister */
+        trace_vfio_unregister_msi_binding(vdev->vbasedev.name, vector_n);
+
+        return vfio_iommu_unset_msi_binding(container, vector_n);
+    }
+
+    msg = pci_get_msi_message(dev, vector_n);
+    imrc = memory_region_get_iommu_class_nocheck(iommu_mr);
+
+    rcu_read_lock();
+    entry = imrc->translate(iommu_mr, msg.address, IOMMU_WO, 0);
+    rcu_read_unlock();
+
+    if (entry.perm == IOMMU_NONE) {
+        return -ENOENT;
+    }
+
+    trace_vfio_register_msi_binding(vdev->vbasedev.name, vector_n,
+                                    msg.address, entry.translated_addr);
+
+    return vfio_iommu_set_msi_binding(container, vector_n, &entry);
+}
+
 static int vfio_enable_vectors(VFIOPCIDevice *vdev, bool msix)
 {
     struct vfio_irq_set *irq_set;
@@ -383,7 +442,7 @@ static int vfio_enable_vectors(VFIOPCIDevice *vdev, bool msix)
     fds = (int32_t *)&irq_set->data;
 
     for (i = 0; i < vdev->nr_vectors; i++) {
-        int fd = -1;
+        int ret, fd = -1;
 
         /*
          * MSI vs MSI-X - The guest has direct access to MSI mask and pending
@@ -392,6 +451,12 @@ static int vfio_enable_vectors(VFIOPCIDevice *vdev, bool msix)
          * KVM signaling path only when configured and unmasked.
          */
         if (vdev->msi_vectors[i].use) {
+            ret = vfio_register_msi_binding(vdev, i, true);
+            if (ret) {
+                error_report("%s failed to register S1 MSI binding "
+                             "for vector %d(%d)", vdev->vbasedev.name, i, ret);
+                goto out;
+            }
             if (vdev->msi_vectors[i].virq < 0 ||
                 (msix && msix_is_masked(&vdev->pdev, i))) {
                 fd = event_notifier_get_fd(&vdev->msi_vectors[i].interrupt);
@@ -405,6 +470,7 @@ static int vfio_enable_vectors(VFIOPCIDevice *vdev, bool msix)
 
     ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_SET_IRQS, irq_set);
 
+out:
     g_free(irq_set);
 
     return ret;
@@ -719,7 +785,8 @@ static void vfio_msi_disable_common(VFIOPCIDevice *vdev)
 
 static void vfio_msix_disable(VFIOPCIDevice *vdev)
 {
-    int i;
+    int ret, i;
+
 
     msix_unset_vector_notifiers(&vdev->pdev);
 
@@ -731,6 +798,11 @@ static void vfio_msix_disable(VFIOPCIDevice *vdev)
         if (vdev->msi_vectors[i].use) {
             vfio_msix_vector_release(&vdev->pdev, i);
             msix_vector_unuse(&vdev->pdev, i);
+            ret = vfio_register_msi_binding(vdev, i, false);
+            if (ret) {
+                error_report("%s: failed to unregister S1 MSI binding "
+                             "for vector %d(%d)", vdev->vbasedev.name, i, ret);
+            }
         }
     }
 
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 43696afc15..5c1b28d0d4 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -122,6 +122,8 @@ vfio_get_dev_region(const char *name, int index, uint32_t type, uint32_t subtype
 vfio_dma_unmap_overflow_workaround(void) ""
 vfio_iommu_addr_inv_iotlb(int asid, uint64_t addr, uint64_t size, uint64_t nb_granules, bool leaf) "nested IOTLB invalidate asid=%d, addr=0x%"PRIx64" granule_size=0x%"PRIx64" nb_granules=0x%"PRIx64" leaf=%d"
 vfio_iommu_asid_inv_iotlb(int asid) "nested IOTLB invalidate asid=%d"
+vfio_register_msi_binding(const char *name, int vector, uint64_t giova, uint64_t gdb) "%s: register vector %d gIOVA=0x%"PRIx64 "-> gDB=0x%"PRIx64" stage 1 mapping"
+vfio_unregister_msi_binding(const char *name, int vector) "%s: unregister vector %d stage 1 mapping"
 
 # platform.c
 vfio_platform_base_device_init(char *name, int groupid) "%s belongs to group #%d"
-- 
2.26.3



^ permalink raw reply	[flat|nested] 46+ messages in thread

* [RFC v9 17/29] vfio: Helper to get IRQ info including capabilities
  2021-04-11 12:08 [RFC v9 00/29] vSMMUv3/pSMMUv3 2 stage VFIO integration Eric Auger
                   ` (15 preceding siblings ...)
  2021-04-11 12:08 ` [RFC v9 16/29] vfio: Pass stage 1 MSI bindings to the host Eric Auger
@ 2021-04-11 12:09 ` Eric Auger
  2021-04-11 12:09 ` [RFC v9 18/29] vfio/pci: Register handler for iommu fault Eric Auger
                   ` (11 subsequent siblings)
  28 siblings, 0 replies; 46+ messages in thread
From: Eric Auger @ 2021-04-11 12:09 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, qemu-devel, qemu-arm, alex.williamson
  Cc: peter.maydell, jacob.jun.pan, jean-philippe, tn, chenxiang66,
	jiangkunkun, shameerali.kolothum.thodi, nicoleotsuka,
	vivek.gautam, vdumpa, yi.l.liu, peterx, zhangfei.gao, yuzenghui,
	zhukeqian1

As done for vfio regions, add helpers to retrieve irq info
including their optional capabilities.

Signed-off-by: Eric Auger <eric.auger@redhat.com>
---
 include/hw/vfio/vfio-common.h |  7 +++
 hw/vfio/common.c              | 97 +++++++++++++++++++++++++++++++++++
 hw/vfio/trace-events          |  1 +
 3 files changed, 105 insertions(+)

diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index f30133b2a3..fcbda2d071 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -232,6 +232,13 @@ bool vfio_get_info_dma_avail(struct vfio_iommu_type1_info *info,
                              unsigned int *avail);
 struct vfio_info_cap_header *
 vfio_get_device_info_cap(struct vfio_device_info *info, uint16_t id);
+int vfio_get_irq_info(VFIODevice *vbasedev, int index,
+                      struct vfio_irq_info **info);
+int vfio_get_dev_irq_info(VFIODevice *vbasedev, uint32_t type,
+                          uint32_t subtype, struct vfio_irq_info **info);
+bool vfio_has_irq_cap(VFIODevice *vbasedev, int irq, uint16_t cap_type);
+struct vfio_info_cap_header *
+vfio_get_irq_info_cap(struct vfio_irq_info *info, uint16_t id);
 #endif
 extern const MemoryListener vfio_prereg_listener;
 
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 970a5a7be7..dc8372c772 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -1452,6 +1452,25 @@ bool vfio_get_info_dma_avail(struct vfio_iommu_type1_info *info,
     return true;
 }
 
+struct vfio_info_cap_header *
+vfio_get_irq_info_cap(struct vfio_irq_info *info, uint16_t id)
+{
+    struct vfio_info_cap_header *hdr;
+    void *ptr = info;
+
+    if (!(info->flags & VFIO_IRQ_INFO_FLAG_CAPS)) {
+        return NULL;
+    }
+
+    for (hdr = ptr + info->cap_offset; hdr != ptr; hdr = ptr + hdr->next) {
+        if (hdr->id == id) {
+            return hdr;
+        }
+    }
+
+    return NULL;
+}
+
 static int vfio_setup_region_sparse_mmaps(VFIORegion *region,
                                           struct vfio_region_info *info)
 {
@@ -2385,6 +2404,33 @@ retry:
     return 0;
 }
 
+int vfio_get_irq_info(VFIODevice *vbasedev, int index,
+                      struct vfio_irq_info **info)
+{
+    size_t argsz = sizeof(struct vfio_irq_info);
+
+    *info = g_malloc0(argsz);
+
+    (*info)->index = index;
+retry:
+    (*info)->argsz = argsz;
+
+    if (ioctl(vbasedev->fd, VFIO_DEVICE_GET_IRQ_INFO, *info)) {
+        g_free(*info);
+        *info = NULL;
+        return -errno;
+    }
+
+    if ((*info)->argsz > argsz) {
+        argsz = (*info)->argsz;
+        *info = g_realloc(*info, argsz);
+
+        goto retry;
+    }
+
+    return 0;
+}
+
 int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type,
                              uint32_t subtype, struct vfio_region_info **info)
 {
@@ -2420,6 +2466,42 @@ int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type,
     return -ENODEV;
 }
 
+int vfio_get_dev_irq_info(VFIODevice *vbasedev, uint32_t type,
+                          uint32_t subtype, struct vfio_irq_info **info)
+{
+    int i;
+
+    for (i = 0; i < vbasedev->num_irqs; i++) {
+        struct vfio_info_cap_header *hdr;
+        struct vfio_irq_info_cap_type *cap_type;
+
+        if (vfio_get_irq_info(vbasedev, i, info)) {
+            continue;
+        }
+
+        hdr = vfio_get_irq_info_cap(*info, VFIO_IRQ_INFO_CAP_TYPE);
+        if (!hdr) {
+            g_free(*info);
+            continue;
+        }
+
+        cap_type = container_of(hdr, struct vfio_irq_info_cap_type, header);
+
+        trace_vfio_get_dev_irq(vbasedev->name, i,
+                               cap_type->type, cap_type->subtype);
+
+        if (cap_type->type == type && cap_type->subtype == subtype) {
+            return 0;
+        }
+
+        g_free(*info);
+    }
+
+    *info = NULL;
+    return -ENODEV;
+}
+
+
 bool vfio_has_region_cap(VFIODevice *vbasedev, int region, uint16_t cap_type)
 {
     struct vfio_region_info *info = NULL;
@@ -2435,6 +2517,21 @@ bool vfio_has_region_cap(VFIODevice *vbasedev, int region, uint16_t cap_type)
     return ret;
 }
 
+bool vfio_has_irq_cap(VFIODevice *vbasedev, int region, uint16_t cap_type)
+{
+    struct vfio_region_info *info = NULL;
+    bool ret = false;
+
+    if (!vfio_get_region_info(vbasedev, region, &info)) {
+        if (vfio_get_region_info_cap(info, cap_type)) {
+            ret = true;
+        }
+        g_free(info);
+    }
+
+    return ret;
+}
+
 /*
  * Interfaces for IBM EEH (Enhanced Error Handling)
  */
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 5c1b28d0d4..1d87c40c1b 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -119,6 +119,7 @@ vfio_region_unmap(const char *name, unsigned long offset, unsigned long end) "Re
 vfio_region_sparse_mmap_header(const char *name, int index, int nr_areas) "Device %s region %d: %d sparse mmap entries"
 vfio_region_sparse_mmap_entry(int i, unsigned long start, unsigned long end) "sparse entry %d [0x%lx - 0x%lx]"
 vfio_get_dev_region(const char *name, int index, uint32_t type, uint32_t subtype) "%s index %d, %08x/%0x8"
+vfio_get_dev_irq(const char *name, int index, uint32_t type, uint32_t subtype) "%s index %d, %08x/%0x8"
 vfio_dma_unmap_overflow_workaround(void) ""
 vfio_iommu_addr_inv_iotlb(int asid, uint64_t addr, uint64_t size, uint64_t nb_granules, bool leaf) "nested IOTLB invalidate asid=%d, addr=0x%"PRIx64" granule_size=0x%"PRIx64" nb_granules=0x%"PRIx64" leaf=%d"
 vfio_iommu_asid_inv_iotlb(int asid) "nested IOTLB invalidate asid=%d"
-- 
2.26.3



^ permalink raw reply	[flat|nested] 46+ messages in thread

* [RFC v9 18/29] vfio/pci: Register handler for iommu fault
  2021-04-11 12:08 [RFC v9 00/29] vSMMUv3/pSMMUv3 2 stage VFIO integration Eric Auger
                   ` (16 preceding siblings ...)
  2021-04-11 12:09 ` [RFC v9 17/29] vfio: Helper to get IRQ info including capabilities Eric Auger
@ 2021-04-11 12:09 ` Eric Auger
  2021-04-11 12:09 ` [RFC v9 19/29] vfio/pci: Set up the DMA FAULT region Eric Auger
                   ` (10 subsequent siblings)
  28 siblings, 0 replies; 46+ messages in thread
From: Eric Auger @ 2021-04-11 12:09 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, qemu-devel, qemu-arm, alex.williamson
  Cc: peter.maydell, jacob.jun.pan, jean-philippe, tn, chenxiang66,
	jiangkunkun, shameerali.kolothum.thodi, nicoleotsuka,
	vivek.gautam, vdumpa, yi.l.liu, peterx, zhangfei.gao, yuzenghui,
	zhukeqian1

We use the new extended IRQ VFIO_IRQ_TYPE_NESTED type and
VFIO_IRQ_SUBTYPE_DMA_FAULT subtype to set/unset
a notifier for physical DMA faults. The associated eventfd is
triggered, in nested mode, whenever a fault is detected at IOMMU
physical level.

The actual handler will be implemented in subsequent patches.

Signed-off-by: Eric Auger <eric.auger@redhat.com>

---

v4 -> v5:
- index_to_str now returns the index name, ie. DMA_FAULT
- use the extended IRQ

v3 -> v4:
- check VFIO_PCI_DMA_FAULT_IRQ_INDEX is supported at kernel level
  before attempting to set signaling for it.
---
 hw/vfio/pci.h |  7 +++++
 hw/vfio/pci.c | 81 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 87 insertions(+), 1 deletion(-)

diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
index 64777516d1..a8b06737fb 100644
--- a/hw/vfio/pci.h
+++ b/hw/vfio/pci.h
@@ -114,6 +114,12 @@ typedef struct VFIOMSIXInfo {
     unsigned long *pending;
 } VFIOMSIXInfo;
 
+typedef struct VFIOPCIExtIRQ {
+    struct VFIOPCIDevice *vdev;
+    EventNotifier notifier;
+    uint32_t index;
+} VFIOPCIExtIRQ;
+
 #define TYPE_VFIO_PCI "vfio-pci"
 OBJECT_DECLARE_SIMPLE_TYPE(VFIOPCIDevice, VFIO_PCI)
 
@@ -138,6 +144,7 @@ struct VFIOPCIDevice {
     PCIHostDeviceAddress host;
     EventNotifier err_notifier;
     EventNotifier req_notifier;
+    VFIOPCIExtIRQ *ext_irqs;
     int (*resetfn)(struct VFIOPCIDevice *);
     uint32_t vendor_id;
     uint32_t device_id;
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index a49029dfa4..71b411b61c 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2864,6 +2864,76 @@ static PCIPASIDOps vfio_pci_pasid_ops = {
     .set_pasid_table = vfio_iommu_set_pasid_table,
 };
 
+static void vfio_dma_fault_notifier_handler(void *opaque)
+{
+    VFIOPCIExtIRQ *ext_irq = opaque;
+
+    if (!event_notifier_test_and_clear(&ext_irq->notifier)) {
+        return;
+    }
+}
+
+static int vfio_register_ext_irq_handler(VFIOPCIDevice *vdev,
+                                         uint32_t type, uint32_t subtype,
+                                         IOHandler *handler)
+{
+    int32_t fd, ext_irq_index, index;
+    struct vfio_irq_info *irq_info;
+    Error *err = NULL;
+    EventNotifier *n;
+    int ret;
+
+    ret = vfio_get_dev_irq_info(&vdev->vbasedev, type, subtype, &irq_info);
+    if (ret) {
+        return ret;
+    }
+    index = irq_info->index;
+    ext_irq_index = irq_info->index - VFIO_PCI_NUM_IRQS;
+    g_free(irq_info);
+
+    vdev->ext_irqs[ext_irq_index].vdev = vdev;
+    vdev->ext_irqs[ext_irq_index].index = index;
+    n = &vdev->ext_irqs[ext_irq_index].notifier;
+
+    ret = event_notifier_init(n, 0);
+    if (ret) {
+        error_report("vfio: Unable to init event notifier for ext irq %d(%d)",
+                     ext_irq_index, ret);
+        return ret;
+    }
+
+    fd = event_notifier_get_fd(n);
+    qemu_set_fd_handler(fd, vfio_dma_fault_notifier_handler, NULL,
+                        &vdev->ext_irqs[ext_irq_index]);
+
+    ret = vfio_set_irq_signaling(&vdev->vbasedev, index, 0,
+                                 VFIO_IRQ_SET_ACTION_TRIGGER, fd, &err);
+    if (ret) {
+        error_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name);
+        qemu_set_fd_handler(fd, NULL, NULL, vdev);
+        event_notifier_cleanup(n);
+    }
+    return ret;
+}
+
+static void vfio_unregister_ext_irq_notifiers(VFIOPCIDevice *vdev)
+{
+    VFIODevice *vbasedev = &vdev->vbasedev;
+    Error *err = NULL;
+    int i;
+
+    for (i = 0; i < vbasedev->num_irqs - VFIO_PCI_NUM_IRQS; i++) {
+        if (vfio_set_irq_signaling(vbasedev, i + VFIO_PCI_NUM_IRQS , 0,
+                                   VFIO_IRQ_SET_ACTION_TRIGGER, -1, &err)) {
+            error_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name);
+        }
+        qemu_set_fd_handler(event_notifier_get_fd(&vdev->ext_irqs[i].notifier),
+                            NULL, NULL, vdev);
+        event_notifier_cleanup(&vdev->ext_irqs[i].notifier);
+    }
+    g_free(vdev->ext_irqs);
+}
+
 static void vfio_realize(PCIDevice *pdev, Error **errp)
 {
     VFIOPCIDevice *vdev = VFIO_PCI(pdev);
@@ -2874,7 +2944,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
     ssize_t len;
     struct stat st;
     int groupid;
-    int i, ret;
+    int i, ret, nb_ext_irqs;
     bool is_mdev;
 
     if (!vdev->vbasedev.sysfsdev) {
@@ -2962,6 +3032,11 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
         goto error;
     }
 
+    nb_ext_irqs = vdev->vbasedev.num_irqs - VFIO_PCI_NUM_IRQS;
+    if (nb_ext_irqs > 0) {
+        vdev->ext_irqs = g_new0(VFIOPCIExtIRQ, nb_ext_irqs);
+    }
+
     vfio_populate_device(vdev, &err);
     if (err) {
         error_propagate(errp, err);
@@ -3173,6 +3248,9 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
 
     vfio_register_err_notifier(vdev);
     vfio_register_req_notifier(vdev);
+    vfio_register_ext_irq_handler(vdev, VFIO_IRQ_TYPE_NESTED,
+                                  VFIO_IRQ_SUBTYPE_DMA_FAULT,
+                                  vfio_dma_fault_notifier_handler);
     vfio_setup_resetfn_quirk(vdev);
 
     pci_setup_pasid_ops(pdev, &vfio_pci_pasid_ops);
@@ -3215,6 +3293,7 @@ static void vfio_exitfn(PCIDevice *pdev)
 
     vfio_unregister_req_notifier(vdev);
     vfio_unregister_err_notifier(vdev);
+    vfio_unregister_ext_irq_notifiers(vdev);
     pci_device_set_intx_routing_notifier(&vdev->pdev, NULL);
     if (vdev->irqchip_change_notifier.notify) {
         kvm_irqchip_remove_change_notifier(&vdev->irqchip_change_notifier);
-- 
2.26.3



^ permalink raw reply	[flat|nested] 46+ messages in thread

* [RFC v9 19/29] vfio/pci: Set up the DMA FAULT region
  2021-04-11 12:08 [RFC v9 00/29] vSMMUv3/pSMMUv3 2 stage VFIO integration Eric Auger
                   ` (17 preceding siblings ...)
  2021-04-11 12:09 ` [RFC v9 18/29] vfio/pci: Register handler for iommu fault Eric Auger
@ 2021-04-11 12:09 ` Eric Auger
  2021-04-11 12:09 ` [RFC v9 20/29] vfio/pci: Implement the DMA fault handler Eric Auger
                   ` (9 subsequent siblings)
  28 siblings, 0 replies; 46+ messages in thread
From: Eric Auger @ 2021-04-11 12:09 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, qemu-devel, qemu-arm, alex.williamson
  Cc: peter.maydell, jacob.jun.pan, jean-philippe, tn, chenxiang66,
	jiangkunkun, shameerali.kolothum.thodi, nicoleotsuka,
	vivek.gautam, vdumpa, yi.l.liu, peterx, zhangfei.gao, yuzenghui,
	zhukeqian1

Set up the fault region which is composed of the actual fault
queue (mmappable) and a header used to handle it. The fault
queue is mmapped.

Signed-off-by: Eric Auger <eric.auger@redhat.com>

---

v4 -> v5:
- use a single DMA FAULT region. No version selection anymore
---
 hw/vfio/pci.h |  1 +
 hw/vfio/pci.c | 64 +++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 65 insertions(+)

diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
index a8b06737fb..eef91065f1 100644
--- a/hw/vfio/pci.h
+++ b/hw/vfio/pci.h
@@ -145,6 +145,7 @@ struct VFIOPCIDevice {
     EventNotifier err_notifier;
     EventNotifier req_notifier;
     VFIOPCIExtIRQ *ext_irqs;
+    VFIORegion dma_fault_region;
     int (*resetfn)(struct VFIOPCIDevice *);
     uint32_t vendor_id;
     uint32_t device_id;
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 71b411b61c..9d4e020b97 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2614,11 +2614,67 @@ int vfio_populate_vga(VFIOPCIDevice *vdev, Error **errp)
     return 0;
 }
 
+static void vfio_init_fault_regions(VFIOPCIDevice *vdev, Error **errp)
+{
+    struct vfio_region_info *fault_region_info = NULL;
+    struct vfio_region_info_cap_fault *cap_fault;
+    VFIODevice *vbasedev = &vdev->vbasedev;
+    struct vfio_info_cap_header *hdr;
+    char *fault_region_name;
+    int ret;
+
+    ret = vfio_get_dev_region_info(&vdev->vbasedev,
+                                   VFIO_REGION_TYPE_NESTED,
+                                   VFIO_REGION_SUBTYPE_NESTED_DMA_FAULT,
+                                   &fault_region_info);
+    if (ret) {
+        goto out;
+    }
+
+    hdr = vfio_get_region_info_cap(fault_region_info,
+                                   VFIO_REGION_INFO_CAP_DMA_FAULT);
+    if (!hdr) {
+        error_setg(errp, "failed to retrieve DMA FAULT capability");
+        goto out;
+    }
+    cap_fault = container_of(hdr, struct vfio_region_info_cap_fault,
+                             header);
+    if (cap_fault->version != 1) {
+        error_setg(errp, "Unsupported DMA FAULT API version %d",
+                   cap_fault->version);
+        goto out;
+    }
+
+    fault_region_name = g_strdup_printf("%s DMA FAULT %d",
+                                        vbasedev->name,
+                                        fault_region_info->index);
+
+    ret = vfio_region_setup(OBJECT(vdev), vbasedev,
+                            &vdev->dma_fault_region,
+                            fault_region_info->index,
+                            fault_region_name);
+    g_free(fault_region_name);
+    if (ret) {
+        error_setg_errno(errp, -ret,
+                         "failed to set up the DMA FAULT region %d",
+                         fault_region_info->index);
+        goto out;
+    }
+
+    ret = vfio_region_mmap(&vdev->dma_fault_region);
+    if (ret) {
+        error_setg_errno(errp, -ret, "Failed to mmap the DMA FAULT queue");
+    }
+out:
+    g_free(fault_region_info);
+}
+
 static void vfio_populate_device(VFIOPCIDevice *vdev, Error **errp)
 {
     VFIODevice *vbasedev = &vdev->vbasedev;
     struct vfio_region_info *reg_info;
     struct vfio_irq_info irq_info = { .argsz = sizeof(irq_info) };
+    Error *err = NULL;
     int i, ret = -1;
 
     /* Sanity check device */
@@ -2682,6 +2738,12 @@ static void vfio_populate_device(VFIOPCIDevice *vdev, Error **errp)
         }
     }
 
+    vfio_init_fault_regions(vdev, &err);
+    if (err) {
+        error_propagate(errp, err);
+        return;
+    }
+
     irq_info.index = VFIO_PCI_ERR_IRQ_INDEX;
 
     ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_GET_IRQ_INFO, &irq_info);
@@ -3274,6 +3336,7 @@ static void vfio_instance_finalize(Object *obj)
 
     vfio_display_finalize(vdev);
     vfio_bars_finalize(vdev);
+    vfio_region_finalize(&vdev->dma_fault_region);
     g_free(vdev->emulated_config_bits);
     g_free(vdev->rom);
     /*
@@ -3294,6 +3357,7 @@ static void vfio_exitfn(PCIDevice *pdev)
     vfio_unregister_req_notifier(vdev);
     vfio_unregister_err_notifier(vdev);
     vfio_unregister_ext_irq_notifiers(vdev);
+    vfio_region_exit(&vdev->dma_fault_region);
     pci_device_set_intx_routing_notifier(&vdev->pdev, NULL);
     if (vdev->irqchip_change_notifier.notify) {
         kvm_irqchip_remove_change_notifier(&vdev->irqchip_change_notifier);
-- 
2.26.3



^ permalink raw reply	[flat|nested] 46+ messages in thread

* [RFC v9 20/29] vfio/pci: Implement the DMA fault handler
  2021-04-11 12:08 [RFC v9 00/29] vSMMUv3/pSMMUv3 2 stage VFIO integration Eric Auger
                   ` (18 preceding siblings ...)
  2021-04-11 12:09 ` [RFC v9 19/29] vfio/pci: Set up the DMA FAULT region Eric Auger
@ 2021-04-11 12:09 ` Eric Auger
  2021-04-11 12:09 ` [RFC v9 21/29] hw/arm/smmuv3: Advertise MSI_TRANSLATE attribute Eric Auger
                   ` (8 subsequent siblings)
  28 siblings, 0 replies; 46+ messages in thread
From: Eric Auger @ 2021-04-11 12:09 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, qemu-devel, qemu-arm, alex.williamson
  Cc: peter.maydell, jacob.jun.pan, jean-philippe, tn, chenxiang66,
	jiangkunkun, shameerali.kolothum.thodi, nicoleotsuka,
	vivek.gautam, vdumpa, yi.l.liu, peterx, zhangfei.gao, yuzenghui,
	zhukeqian1

Whenever the eventfd is triggered, we retrieve the DMA fault(s)
from the mmapped fault region and inject them in the iommu
memory region.

Signed-off-by: Eric Auger <eric.auger@redhat.com>
---
 hw/vfio/pci.h |  1 +
 hw/vfio/pci.c | 50 ++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 51 insertions(+)

diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
index eef91065f1..03ac8919ef 100644
--- a/hw/vfio/pci.h
+++ b/hw/vfio/pci.h
@@ -146,6 +146,7 @@ struct VFIOPCIDevice {
     EventNotifier req_notifier;
     VFIOPCIExtIRQ *ext_irqs;
     VFIORegion dma_fault_region;
+    uint32_t fault_tail_index;
     int (*resetfn)(struct VFIOPCIDevice *);
     uint32_t vendor_id;
     uint32_t device_id;
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 9d4e020b97..d7e563859f 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2929,10 +2929,60 @@ static PCIPASIDOps vfio_pci_pasid_ops = {
 static void vfio_dma_fault_notifier_handler(void *opaque)
 {
     VFIOPCIExtIRQ *ext_irq = opaque;
+    VFIOPCIDevice *vdev = ext_irq->vdev;
+    PCIDevice *pdev = &vdev->pdev;
+    AddressSpace *as = pci_device_iommu_address_space(pdev);
+    IOMMUMemoryRegion *iommu_mr = IOMMU_MEMORY_REGION(as->root);
+    struct vfio_region_dma_fault header;
+    struct iommu_fault *queue;
+    char *queue_buffer = NULL;
+    ssize_t bytes;
 
     if (!event_notifier_test_and_clear(&ext_irq->notifier)) {
         return;
     }
+
+    bytes = pread(vdev->vbasedev.fd, &header, sizeof(header),
+                  vdev->dma_fault_region.fd_offset);
+    if (bytes != sizeof(header)) {
+        error_report("%s unable to read the fault region header (0x%lx)",
+                     __func__, bytes);
+        return;
+    }
+
+    /* Normally the fault queue is mmapped */
+    queue = (struct iommu_fault *)vdev->dma_fault_region.mmaps[0].mmap;
+    if (!queue) {
+        size_t queue_size = header.nb_entries * header.entry_size;
+
+        error_report("%s: fault queue not mmapped: slower fault handling",
+                     vdev->vbasedev.name);
+
+        queue_buffer = g_malloc(queue_size);
+        bytes =  pread(vdev->vbasedev.fd, queue_buffer, queue_size,
+                       vdev->dma_fault_region.fd_offset + header.offset);
+        if (bytes != queue_size) {
+            error_report("%s unable to read the fault queue (0x%lx)",
+                         __func__, bytes);
+            return;
+        }
+
+        queue = (struct iommu_fault *)queue_buffer;
+    }
+
+    while (vdev->fault_tail_index != header.head) {
+        memory_region_inject_faults(iommu_mr, 1,
+                                    &queue[vdev->fault_tail_index]);
+        vdev->fault_tail_index =
+            (vdev->fault_tail_index + 1) % header.nb_entries;
+    }
+    bytes = pwrite(vdev->vbasedev.fd, &vdev->fault_tail_index, 4,
+                   vdev->dma_fault_region.fd_offset);
+    if (bytes != 4) {
+        error_report("%s unable to write the fault region tail index (0x%lx)",
+                     __func__, bytes);
+    }
+    g_free(queue_buffer);
 }
 
 static int vfio_register_ext_irq_handler(VFIOPCIDevice *vdev,
-- 
2.26.3



^ permalink raw reply	[flat|nested] 46+ messages in thread

* [RFC v9 21/29] hw/arm/smmuv3: Advertise MSI_TRANSLATE attribute
  2021-04-11 12:08 [RFC v9 00/29] vSMMUv3/pSMMUv3 2 stage VFIO integration Eric Auger
                   ` (19 preceding siblings ...)
  2021-04-11 12:09 ` [RFC v9 20/29] vfio/pci: Implement the DMA fault handler Eric Auger
@ 2021-04-11 12:09 ` Eric Auger
  2021-04-11 12:09 ` [RFC v9 22/29] hw/arm/smmuv3: Store the PASID table GPA in the translation config Eric Auger
                   ` (7 subsequent siblings)
  28 siblings, 0 replies; 46+ messages in thread
From: Eric Auger @ 2021-04-11 12:09 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, qemu-devel, qemu-arm, alex.williamson
  Cc: peter.maydell, jacob.jun.pan, jean-philippe, tn, chenxiang66,
	jiangkunkun, shameerali.kolothum.thodi, nicoleotsuka,
	vivek.gautam, vdumpa, yi.l.liu, peterx, zhangfei.gao, yuzenghui,
	zhukeqian1

The SMMUv3 has the peculiarity to translate MSI
transactionss. let's advertise the corresponding
attribute.

Signed-off-by: Eric Auger <eric.auger@redhat.com>

---
---
 hw/arm/smmuv3.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c
index 7166008ab0..1ee81a25e9 100644
--- a/hw/arm/smmuv3.c
+++ b/hw/arm/smmuv3.c
@@ -1589,6 +1589,9 @@ static int smmuv3_get_attr(IOMMUMemoryRegion *iommu,
     if (attr == IOMMU_ATTR_VFIO_NESTED) {
         *(bool *) data = true;
         return 0;
+    } else if (attr == IOMMU_ATTR_MSI_TRANSLATE) {
+        *(bool *) data = true;
+        return 0;
     }
     return -EINVAL;
 }
-- 
2.26.3



^ permalink raw reply	[flat|nested] 46+ messages in thread

* [RFC v9 22/29] hw/arm/smmuv3: Store the PASID table GPA in the translation config
  2021-04-11 12:08 [RFC v9 00/29] vSMMUv3/pSMMUv3 2 stage VFIO integration Eric Auger
                   ` (20 preceding siblings ...)
  2021-04-11 12:09 ` [RFC v9 21/29] hw/arm/smmuv3: Advertise MSI_TRANSLATE attribute Eric Auger
@ 2021-04-11 12:09 ` Eric Auger
  2021-04-11 12:09 ` [RFC v9 23/29] hw/arm/smmuv3: Fill the IOTLBEntry arch_id on NH_VA invalidation Eric Auger
                   ` (6 subsequent siblings)
  28 siblings, 0 replies; 46+ messages in thread
From: Eric Auger @ 2021-04-11 12:09 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, qemu-devel, qemu-arm, alex.williamson
  Cc: peter.maydell, jacob.jun.pan, jean-philippe, tn, chenxiang66,
	jiangkunkun, shameerali.kolothum.thodi, nicoleotsuka,
	vivek.gautam, vdumpa, yi.l.liu, peterx, zhangfei.gao, yuzenghui,
	zhukeqian1

For VFIO integration we will need to pass the Context Descriptor (CD)
table GPA to the host. The CD table is also referred to as the PASID
table. Its GPA corresponds to the s1ctrptr field of the Stream Table
Entry. So let's decode and store it in the configuration structure.

Signed-off-by: Eric Auger <eric.auger@redhat.com>
---
 include/hw/arm/smmu-common.h | 1 +
 hw/arm/smmuv3.c              | 1 +
 2 files changed, 2 insertions(+)

diff --git a/include/hw/arm/smmu-common.h b/include/hw/arm/smmu-common.h
index 706be3c6d0..d578339935 100644
--- a/include/hw/arm/smmu-common.h
+++ b/include/hw/arm/smmu-common.h
@@ -76,6 +76,7 @@ typedef struct SMMUTransCfg {
     uint8_t tbi;               /* Top Byte Ignore */
     uint16_t asid;
     SMMUTransTableInfo tt[2];
+    dma_addr_t s1ctxptr;
     uint32_t iotlb_hits;       /* counts IOTLB hits for this asid */
     uint32_t iotlb_misses;     /* counts IOTLB misses for this asid */
 } SMMUTransCfg;
diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c
index 1ee81a25e9..a7608af5dd 100644
--- a/hw/arm/smmuv3.c
+++ b/hw/arm/smmuv3.c
@@ -358,6 +358,7 @@ static int decode_ste(SMMUv3State *s, SMMUTransCfg *cfg,
                       "SMMUv3 S1 stalling fault model not allowed yet\n");
         goto bad_ste;
     }
+    cfg->s1ctxptr = STE_CTXPTR(ste);
     return 0;
 
 bad_ste:
-- 
2.26.3



^ permalink raw reply	[flat|nested] 46+ messages in thread

* [RFC v9 23/29] hw/arm/smmuv3: Fill the IOTLBEntry arch_id on NH_VA invalidation
  2021-04-11 12:08 [RFC v9 00/29] vSMMUv3/pSMMUv3 2 stage VFIO integration Eric Auger
                   ` (21 preceding siblings ...)
  2021-04-11 12:09 ` [RFC v9 22/29] hw/arm/smmuv3: Store the PASID table GPA in the translation config Eric Auger
@ 2021-04-11 12:09 ` Eric Auger
  2021-04-11 12:09 ` [RFC v9 24/29] hw/arm/smmuv3: Fill the IOTLBEntry leaf field " Eric Auger
                   ` (5 subsequent siblings)
  28 siblings, 0 replies; 46+ messages in thread
From: Eric Auger @ 2021-04-11 12:09 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, qemu-devel, qemu-arm, alex.williamson
  Cc: peter.maydell, jacob.jun.pan, jean-philippe, tn, chenxiang66,
	jiangkunkun, shameerali.kolothum.thodi, nicoleotsuka,
	vivek.gautam, vdumpa, yi.l.liu, peterx, zhangfei.gao, yuzenghui,
	zhukeqian1

When the guest invalidates one S1 entry, it passes the asid.
When propagating this invalidation downto the host, the asid
information also must be passed. So let's fill the arch_id field
introduced for that purpose and accordingly set the flags to
indicate its presence.

Signed-off-by: Eric Auger <eric.auger@redhat.com>

---

v7 -> v8:
- set flags
---
 hw/arm/smmuv3.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c
index a7608af5dd..7beb55cd89 100644
--- a/hw/arm/smmuv3.c
+++ b/hw/arm/smmuv3.c
@@ -832,6 +832,8 @@ static void smmuv3_notify_iova(IOMMUMemoryRegion *mr,
     event.entry.iova = iova;
     event.entry.addr_mask = num_pages * (1 << granule) - 1;
     event.entry.perm = IOMMU_NONE;
+    event.entry.flags = IOMMU_INV_FLAGS_ARCHID;
+    event.entry.arch_id = asid;
 
     memory_region_notify_iommu_one(n, &event);
 }
-- 
2.26.3



^ permalink raw reply	[flat|nested] 46+ messages in thread

* [RFC v9 24/29] hw/arm/smmuv3: Fill the IOTLBEntry leaf field on NH_VA invalidation
  2021-04-11 12:08 [RFC v9 00/29] vSMMUv3/pSMMUv3 2 stage VFIO integration Eric Auger
                   ` (22 preceding siblings ...)
  2021-04-11 12:09 ` [RFC v9 23/29] hw/arm/smmuv3: Fill the IOTLBEntry arch_id on NH_VA invalidation Eric Auger
@ 2021-04-11 12:09 ` Eric Auger
  2021-05-13  7:09   ` Kunkun Jiang
  2021-04-11 12:09 ` [RFC v9 25/29] hw/arm/smmuv3: Pass stage 1 configurations to the host Eric Auger
                   ` (4 subsequent siblings)
  28 siblings, 1 reply; 46+ messages in thread
From: Eric Auger @ 2021-04-11 12:09 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, qemu-devel, qemu-arm, alex.williamson
  Cc: peter.maydell, jacob.jun.pan, jean-philippe, tn, chenxiang66,
	jiangkunkun, shameerali.kolothum.thodi, nicoleotsuka,
	vivek.gautam, vdumpa, yi.l.liu, peterx, zhangfei.gao, yuzenghui,
	zhukeqian1

Let's propagate the leaf attribute throughout the invalidation path.
This hint is used to reduce the scope of the invalidations to the
last level of translation. Not enforcing it induces large performance
penalties in nested mode.

Signed-off-by: Eric Auger <eric.auger@redhat.com>
---
 hw/arm/smmuv3.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c
index 7beb55cd89..74a6408146 100644
--- a/hw/arm/smmuv3.c
+++ b/hw/arm/smmuv3.c
@@ -799,7 +799,7 @@ epilogue:
 static void smmuv3_notify_iova(IOMMUMemoryRegion *mr,
                                IOMMUNotifier *n,
                                int asid, dma_addr_t iova,
-                               uint8_t tg, uint64_t num_pages)
+                               uint8_t tg, uint64_t num_pages, bool leaf)
 {
     SMMUDevice *sdev = container_of(mr, SMMUDevice, iommu);
     IOMMUTLBEvent event = {};
@@ -834,6 +834,7 @@ static void smmuv3_notify_iova(IOMMUMemoryRegion *mr,
     event.entry.perm = IOMMU_NONE;
     event.entry.flags = IOMMU_INV_FLAGS_ARCHID;
     event.entry.arch_id = asid;
+    event.entry.leaf = leaf;
 
     memory_region_notify_iommu_one(n, &event);
 }
@@ -865,7 +866,7 @@ static void smmuv3_notify_asid(IOMMUMemoryRegion *mr,
 
 /* invalidate an asid/iova range tuple in all mr's */
 static void smmuv3_inv_notifiers_iova(SMMUState *s, int asid, dma_addr_t iova,
-                                      uint8_t tg, uint64_t num_pages)
+                                      uint8_t tg, uint64_t num_pages, bool leaf)
 {
     SMMUDevice *sdev;
 
@@ -877,7 +878,7 @@ static void smmuv3_inv_notifiers_iova(SMMUState *s, int asid, dma_addr_t iova,
                                         tg, num_pages);
 
         IOMMU_NOTIFIER_FOREACH(n, mr) {
-            smmuv3_notify_iova(mr, n, asid, iova, tg, num_pages);
+            smmuv3_notify_iova(mr, n, asid, iova, tg, num_pages, leaf);
         }
     }
 }
@@ -915,7 +916,7 @@ static void smmuv3_s1_range_inval(SMMUState *s, Cmd *cmd)
         count = mask + 1;
 
         trace_smmuv3_s1_range_inval(vmid, asid, addr, tg, count, ttl, leaf);
-        smmuv3_inv_notifiers_iova(s, asid, addr, tg, count);
+        smmuv3_inv_notifiers_iova(s, asid, addr, tg, count, leaf);
         smmu_iotlb_inv_iova(s, asid, addr, tg, count, ttl);
 
         num_pages -= count;
-- 
2.26.3



^ permalink raw reply	[flat|nested] 46+ messages in thread

* [RFC v9 25/29] hw/arm/smmuv3: Pass stage 1 configurations to the host
  2021-04-11 12:08 [RFC v9 00/29] vSMMUv3/pSMMUv3 2 stage VFIO integration Eric Auger
                   ` (23 preceding siblings ...)
  2021-04-11 12:09 ` [RFC v9 24/29] hw/arm/smmuv3: Fill the IOTLBEntry leaf field " Eric Auger
@ 2021-04-11 12:09 ` Eric Auger
  2021-04-11 12:09 ` [RFC v9 26/29] hw/arm/smmuv3: Implement fault injection Eric Auger
                   ` (3 subsequent siblings)
  28 siblings, 0 replies; 46+ messages in thread
From: Eric Auger @ 2021-04-11 12:09 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, qemu-devel, qemu-arm, alex.williamson
  Cc: peter.maydell, jacob.jun.pan, jean-philippe, tn, chenxiang66,
	jiangkunkun, shameerali.kolothum.thodi, nicoleotsuka,
	vivek.gautam, vdumpa, yi.l.liu, peterx, zhangfei.gao, yuzenghui,
	zhukeqian1

In case PASID PciOps are set for the device we call
the set_pasid_table() callback on each STE update.

This allows to pass the guest stage 1 configuration
to the host and apply it at physical level.

Signed-off-by: Eric Auger <eric.auger@redhat.com>

---

v4 -> v5:
- Use PciOps instead of config notifiers

v3 -> v4:
- fix compile issue with mingw

v2 -> v3:
- adapt to pasid_cfg field changes. Use local variable
- add trace event
- set version fields
- use CONFIG_PASID

v1 -> v2:
- do not notify anymore on CD change. Anyway the smmuv3 linux
  driver is not sending any CD invalidation commands. If we were
  to propagate CD invalidation commands, we would use the
  CACHE_INVALIDATE VFIO ioctl.
- notify a precise config flags to prepare for addition of new
  flags
---
 hw/arm/smmu-internal.h |  1 +
 hw/arm/smmuv3.c        | 72 ++++++++++++++++++++++++++++++++++++------
 hw/arm/trace-events    |  1 +
 3 files changed, 64 insertions(+), 10 deletions(-)

diff --git a/hw/arm/smmu-internal.h b/hw/arm/smmu-internal.h
index 2d75b31953..5ef8c598c6 100644
--- a/hw/arm/smmu-internal.h
+++ b/hw/arm/smmu-internal.h
@@ -105,6 +105,7 @@ typedef struct SMMUIOTLBPageInvInfo {
 } SMMUIOTLBPageInvInfo;
 
 typedef struct SMMUSIDRange {
+    SMMUState *state;
     uint32_t start;
     uint32_t end;
 } SMMUSIDRange;
diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c
index 74a6408146..aefc55a607 100644
--- a/hw/arm/smmuv3.c
+++ b/hw/arm/smmuv3.c
@@ -16,6 +16,10 @@
  * with this program; if not, see <http://www.gnu.org/licenses/>.
  */
 
+#ifdef __linux__
+#include "linux/iommu.h"
+#endif
+
 #include "qemu/osdep.h"
 #include "qemu/bitops.h"
 #include "hw/irq.h"
@@ -925,6 +929,61 @@ static void smmuv3_s1_range_inval(SMMUState *s, Cmd *cmd)
     }
 }
 
+static void smmuv3_notify_config_change(SMMUState *bs, uint32_t sid)
+{
+#ifdef __linux__
+    IOMMUMemoryRegion *mr = smmu_iommu_mr(bs, sid);
+    SMMUEventInfo event = {.type = SMMU_EVT_NONE, .sid = sid,
+                           .inval_ste_allowed = true};
+    IOMMUConfig iommu_config = {};
+    SMMUTransCfg *cfg;
+    SMMUDevice *sdev;
+
+    if (!mr) {
+        return;
+    }
+
+    sdev = container_of(mr, SMMUDevice, iommu);
+
+    /* flush QEMU config cache */
+    smmuv3_flush_config(sdev);
+
+    if (!pci_device_is_pasid_ops_set(sdev->bus, sdev->devfn)) {
+        return;
+    }
+
+    cfg = smmuv3_get_config(sdev, &event);
+
+    if (!cfg) {
+        return;
+    }
+
+    iommu_config.pasid_cfg.argsz = sizeof(struct iommu_pasid_table_config);
+    iommu_config.pasid_cfg.version = PASID_TABLE_CFG_VERSION_1;
+    iommu_config.pasid_cfg.format = IOMMU_PASID_FORMAT_SMMUV3;
+    iommu_config.pasid_cfg.base_ptr = cfg->s1ctxptr;
+    iommu_config.pasid_cfg.pasid_bits = 0;
+    iommu_config.pasid_cfg.vendor_data.smmuv3.version = PASID_TABLE_SMMUV3_CFG_VERSION_1;
+
+    if (cfg->disabled || cfg->bypassed) {
+        iommu_config.pasid_cfg.config = IOMMU_PASID_CONFIG_BYPASS;
+    } else if (cfg->aborted) {
+        iommu_config.pasid_cfg.config = IOMMU_PASID_CONFIG_ABORT;
+    } else {
+        iommu_config.pasid_cfg.config = IOMMU_PASID_CONFIG_TRANSLATE;
+    }
+
+    trace_smmuv3_notify_config_change(mr->parent_obj.name,
+                                      iommu_config.pasid_cfg.config,
+                                      iommu_config.pasid_cfg.base_ptr);
+
+    if (pci_device_set_pasid_table(sdev->bus, sdev->devfn, &iommu_config)) {
+        error_report("Failed to pass PASID table to host for iommu mr %s (%m)",
+                     mr->parent_obj.name);
+    }
+#endif
+}
+
 static gboolean
 smmuv3_invalidate_ste(gpointer key, gpointer value, gpointer user_data)
 {
@@ -935,6 +994,7 @@ smmuv3_invalidate_ste(gpointer key, gpointer value, gpointer user_data)
     if (sid < sid_range->start || sid > sid_range->end) {
         return false;
     }
+    smmuv3_notify_config_change(sid_range->state, sid);
     trace_smmuv3_config_cache_inv(sid);
     return true;
 }
@@ -1005,22 +1065,14 @@ static int smmuv3_cmdq_consume(SMMUv3State *s)
         case SMMU_CMD_CFGI_STE:
         {
             uint32_t sid = CMD_SID(&cmd);
-            IOMMUMemoryRegion *mr = smmu_iommu_mr(bs, sid);
-            SMMUDevice *sdev;
 
             if (CMD_SSEC(&cmd)) {
                 cmd_error = SMMU_CERROR_ILL;
                 break;
             }
 
-            if (!mr) {
-                break;
-            }
-
             trace_smmuv3_cmdq_cfgi_ste(sid);
-            sdev = container_of(mr, SMMUDevice, iommu);
-            smmuv3_flush_config(sdev);
-
+            smmuv3_notify_config_change(bs, sid);
             break;
         }
         case SMMU_CMD_CFGI_STE_RANGE: /* same as SMMU_CMD_CFGI_ALL */
@@ -1028,7 +1080,7 @@ static int smmuv3_cmdq_consume(SMMUv3State *s)
             uint32_t start = CMD_SID(&cmd);
             uint8_t range = CMD_STE_RANGE(&cmd);
             uint64_t end = start + (1ULL << (range + 1)) - 1;
-            SMMUSIDRange sid_range = {start, end};
+            SMMUSIDRange sid_range = {bs, start, end};
 
             if (CMD_SSEC(&cmd)) {
                 cmd_error = SMMU_CERROR_ILL;
diff --git a/hw/arm/trace-events b/hw/arm/trace-events
index 8e530ba79d..b0b0030d24 100644
--- a/hw/arm/trace-events
+++ b/hw/arm/trace-events
@@ -53,4 +53,5 @@ smmuv3_config_cache_inv(uint32_t sid) "Config cache INV for sid=0x%x"
 smmuv3_notify_flag_add(const char *iommu) "ADD SMMUNotifier node for iommu mr=%s"
 smmuv3_notify_flag_del(const char *iommu) "DEL SMMUNotifier node for iommu mr=%s"
 smmuv3_inv_notifiers_iova(const char *name, uint16_t asid, uint64_t iova, uint8_t tg, uint64_t num_pages) "iommu mr=%s asid=%d iova=0x%"PRIx64" tg=%d num_pages=0x%"PRIx64
+smmuv3_notify_config_change(const char *name, uint8_t config, uint64_t s1ctxptr) "iommu mr=%s config=%d s1ctxptr=0x%"PRIx64
 
-- 
2.26.3



^ permalink raw reply	[flat|nested] 46+ messages in thread

* [RFC v9 26/29] hw/arm/smmuv3: Implement fault injection
  2021-04-11 12:08 [RFC v9 00/29] vSMMUv3/pSMMUv3 2 stage VFIO integration Eric Auger
                   ` (24 preceding siblings ...)
  2021-04-11 12:09 ` [RFC v9 25/29] hw/arm/smmuv3: Pass stage 1 configurations to the host Eric Auger
@ 2021-04-11 12:09 ` Eric Auger
  2021-04-11 12:09 ` [RFC v9 27/29] hw/arm/smmuv3: Allow MAP notifiers Eric Auger
                   ` (2 subsequent siblings)
  28 siblings, 0 replies; 46+ messages in thread
From: Eric Auger @ 2021-04-11 12:09 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, qemu-devel, qemu-arm, alex.williamson
  Cc: peter.maydell, jacob.jun.pan, jean-philippe, tn, chenxiang66,
	jiangkunkun, shameerali.kolothum.thodi, nicoleotsuka,
	vivek.gautam, vdumpa, yi.l.liu, peterx, zhangfei.gao, yuzenghui,
	zhukeqian1

We convert iommu_fault structs received from the kernel
into the data struct used by the emulation code and record
the evnts into the virtual event queue.

Signed-off-by: Eric Auger <eric.auger@redhat.com>

---

v3 -> v4:
- fix compil issue on mingw

Exhaustive mapping remains to be done
---
 hw/arm/smmuv3.c | 71 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 71 insertions(+)

diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c
index aefc55a607..53b71c895c 100644
--- a/hw/arm/smmuv3.c
+++ b/hw/arm/smmuv3.c
@@ -1652,6 +1652,76 @@ static int smmuv3_get_attr(IOMMUMemoryRegion *iommu,
     return -EINVAL;
 }
 
+struct iommu_fault;
+
+static inline int
+smmuv3_inject_faults(IOMMUMemoryRegion *iommu_mr, int count,
+                     struct iommu_fault *buf)
+{
+#ifdef __linux__
+    SMMUDevice *sdev = container_of(iommu_mr, SMMUDevice, iommu);
+    SMMUv3State *s3 = sdev->smmu;
+    uint32_t sid = smmu_get_sid(sdev);
+    int i;
+
+    for (i = 0; i < count; i++) {
+        SMMUEventInfo info = {};
+        struct iommu_fault_unrecoverable *record;
+
+        if (buf[i].type != IOMMU_FAULT_DMA_UNRECOV) {
+            continue;
+        }
+
+        info.sid = sid;
+        record = &buf[i].event;
+
+        switch (record->reason) {
+        case IOMMU_FAULT_REASON_PASID_INVALID:
+            info.type = SMMU_EVT_C_BAD_SUBSTREAMID;
+            /* TODO further fill info.u.c_bad_substream */
+            break;
+        case IOMMU_FAULT_REASON_PASID_FETCH:
+            info.type = SMMU_EVT_F_CD_FETCH;
+            break;
+        case IOMMU_FAULT_REASON_BAD_PASID_ENTRY:
+            info.type = SMMU_EVT_C_BAD_CD;
+            /* TODO further fill info.u.c_bad_cd */
+            break;
+        case IOMMU_FAULT_REASON_WALK_EABT:
+            info.type = SMMU_EVT_F_WALK_EABT;
+            info.u.f_walk_eabt.addr = record->addr;
+            info.u.f_walk_eabt.addr2 = record->fetch_addr;
+            break;
+        case IOMMU_FAULT_REASON_PTE_FETCH:
+            info.type = SMMU_EVT_F_TRANSLATION;
+            info.u.f_translation.addr = record->addr;
+            break;
+        case IOMMU_FAULT_REASON_OOR_ADDRESS:
+            info.type = SMMU_EVT_F_ADDR_SIZE;
+            info.u.f_addr_size.addr = record->addr;
+            break;
+        case IOMMU_FAULT_REASON_ACCESS:
+            info.type = SMMU_EVT_F_ACCESS;
+            info.u.f_access.addr = record->addr;
+            break;
+        case IOMMU_FAULT_REASON_PERMISSION:
+            info.type = SMMU_EVT_F_PERMISSION;
+            info.u.f_permission.addr = record->addr;
+            break;
+        default:
+            warn_report("%s Unexpected fault reason received from host: %d",
+                        __func__, record->reason);
+            continue;
+        }
+
+        smmuv3_record_event(s3, &info);
+    }
+    return 0;
+#else
+    return -1;
+#endif
+}
+
 static void smmuv3_iommu_memory_region_class_init(ObjectClass *klass,
                                                   void *data)
 {
@@ -1660,6 +1730,7 @@ static void smmuv3_iommu_memory_region_class_init(ObjectClass *klass,
     imrc->translate = smmuv3_translate;
     imrc->notify_flag_changed = smmuv3_notify_flag_changed;
     imrc->get_attr = smmuv3_get_attr;
+    imrc->inject_faults = smmuv3_inject_faults;
 }
 
 static const TypeInfo smmuv3_type_info = {
-- 
2.26.3



^ permalink raw reply	[flat|nested] 46+ messages in thread

* [RFC v9 27/29] hw/arm/smmuv3: Allow MAP notifiers
  2021-04-11 12:08 [RFC v9 00/29] vSMMUv3/pSMMUv3 2 stage VFIO integration Eric Auger
                   ` (25 preceding siblings ...)
  2021-04-11 12:09 ` [RFC v9 26/29] hw/arm/smmuv3: Implement fault injection Eric Auger
@ 2021-04-11 12:09 ` Eric Auger
  2021-04-11 12:09 ` [RFC v9 28/29] pci: Add return_page_response pci ops Eric Auger
  2021-04-11 12:09 ` [RFC v9 29/29] vfio/pci: Implement return_page_response page response callback Eric Auger
  28 siblings, 0 replies; 46+ messages in thread
From: Eric Auger @ 2021-04-11 12:09 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, qemu-devel, qemu-arm, alex.williamson
  Cc: peter.maydell, jacob.jun.pan, jean-philippe, tn, chenxiang66,
	jiangkunkun, shameerali.kolothum.thodi, nicoleotsuka,
	vivek.gautam, vdumpa, yi.l.liu, peterx, zhangfei.gao, yuzenghui,
	zhukeqian1

We now have all bricks to support nested paging. This
uses MAP notifiers to map the MSIs. So let's allow MAP
notifiers to be registered.

Signed-off-by: Eric Auger <eric.auger@redhat.com>
---
 hw/arm/smmuv3.c | 8 --------
 1 file changed, 8 deletions(-)

diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c
index 53b71c895c..ca690513e6 100644
--- a/hw/arm/smmuv3.c
+++ b/hw/arm/smmuv3.c
@@ -1620,14 +1620,6 @@ static int smmuv3_notify_flag_changed(IOMMUMemoryRegion *iommu,
         return -EINVAL;
     }
 
-    if (new & IOMMU_NOTIFIER_MAP) {
-        error_setg(errp,
-                   "device %02x.%02x.%x requires iommu MAP notifier which is "
-                   "not currently supported", pci_bus_num(sdev->bus),
-                   PCI_SLOT(sdev->devfn), PCI_FUNC(sdev->devfn));
-        return -EINVAL;
-    }
-
     if (old == IOMMU_NOTIFIER_NONE) {
         trace_smmuv3_notify_flag_add(iommu->parent_obj.name);
         QLIST_INSERT_HEAD(&s->devices_with_notifiers, sdev, next);
-- 
2.26.3



^ permalink raw reply	[flat|nested] 46+ messages in thread

* [RFC v9 28/29] pci: Add return_page_response pci ops
  2021-04-11 12:08 [RFC v9 00/29] vSMMUv3/pSMMUv3 2 stage VFIO integration Eric Auger
                   ` (26 preceding siblings ...)
  2021-04-11 12:09 ` [RFC v9 27/29] hw/arm/smmuv3: Allow MAP notifiers Eric Auger
@ 2021-04-11 12:09 ` Eric Auger
  2021-04-11 12:09 ` [RFC v9 29/29] vfio/pci: Implement return_page_response page response callback Eric Auger
  28 siblings, 0 replies; 46+ messages in thread
From: Eric Auger @ 2021-04-11 12:09 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, qemu-devel, qemu-arm, alex.williamson
  Cc: peter.maydell, jacob.jun.pan, jean-philippe, tn, chenxiang66,
	jiangkunkun, shameerali.kolothum.thodi, nicoleotsuka,
	vivek.gautam, vdumpa, yi.l.liu, peterx, zhangfei.gao, yuzenghui,
	zhukeqian1

Add a new PCI operation that allows to return page responses
to registered VFIO devices

Signed-off-by: Eric Auger <eric.auger@redhat.com>
---
 include/hw/iommu/iommu.h |  8 ++++++++
 include/hw/pci/pci.h     |  4 ++++
 hw/pci/pci.c             | 16 ++++++++++++++++
 3 files changed, 28 insertions(+)

diff --git a/include/hw/iommu/iommu.h b/include/hw/iommu/iommu.h
index 12092bda7b..5890f095b1 100644
--- a/include/hw/iommu/iommu.h
+++ b/include/hw/iommu/iommu.h
@@ -24,5 +24,13 @@ typedef struct IOMMUConfig {
           };
 } IOMMUConfig;
 
+typedef struct IOMMUPageResponse {
+    union {
+#ifdef __linux__
+        struct iommu_page_response resp;
+#endif
+          };
+} IOMMUPageResponse;
+
 
 #endif /* QEMU_HW_IOMMU_IOMMU_H */
diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index 1f73c04975..9bc0919352 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -268,6 +268,8 @@ typedef struct PCIReqIDCache PCIReqIDCache;
 
 struct PCIPASIDOps {
     int (*set_pasid_table)(PCIBus *bus, int32_t devfn, IOMMUConfig *config);
+    int (*return_page_response)(PCIBus *bus, int32_t devfn,
+                                IOMMUPageResponse *resp);
 };
 typedef struct PCIPASIDOps PCIPASIDOps;
 
@@ -501,6 +503,8 @@ void pci_setup_iommu(PCIBus *bus, PCIIOMMUFunc fn, void *opaque);
 void pci_setup_pasid_ops(PCIDevice *dev, PCIPASIDOps *ops);
 bool pci_device_is_pasid_ops_set(PCIBus *bus, int32_t devfn);
 int pci_device_set_pasid_table(PCIBus *bus, int32_t devfn, IOMMUConfig *config);
+int pci_device_return_page_response(PCIBus *bus, int32_t devfn,
+                                    IOMMUPageResponse *resp);
 
 static inline void
 pci_set_byte(uint8_t *config, uint8_t val)
diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index 114855a0ac..18d84ff42e 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -2765,6 +2765,22 @@ int pci_device_set_pasid_table(PCIBus *bus, int32_t devfn,
     return -ENOENT;
 }
 
+int pci_device_return_page_response(PCIBus *bus, int32_t devfn,
+                                    IOMMUPageResponse *resp)
+{
+    PCIDevice *dev;
+
+    if (!bus) {
+        return -EINVAL;
+    }
+
+    dev = bus->devices[devfn];
+    if (dev && dev->pasid_ops && dev->pasid_ops->return_page_response) {
+        return dev->pasid_ops->return_page_response(bus, devfn, resp);
+    }
+    return -ENOENT;
+}
+
 static void pci_dev_get_w64(PCIBus *b, PCIDevice *dev, void *opaque)
 {
     Range *range = opaque;
-- 
2.26.3



^ permalink raw reply	[flat|nested] 46+ messages in thread

* [RFC v9 29/29] vfio/pci: Implement return_page_response page response callback
  2021-04-11 12:08 [RFC v9 00/29] vSMMUv3/pSMMUv3 2 stage VFIO integration Eric Auger
                   ` (27 preceding siblings ...)
  2021-04-11 12:09 ` [RFC v9 28/29] pci: Add return_page_response pci ops Eric Auger
@ 2021-04-11 12:09 ` Eric Auger
  28 siblings, 0 replies; 46+ messages in thread
From: Eric Auger @ 2021-04-11 12:09 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, qemu-devel, qemu-arm, alex.williamson
  Cc: peter.maydell, jacob.jun.pan, jean-philippe, tn, chenxiang66,
	jiangkunkun, shameerali.kolothum.thodi, nicoleotsuka,
	vivek.gautam, vdumpa, yi.l.liu, peterx, zhangfei.gao, yuzenghui,
	zhukeqian1

This patch implements the page response path. The
response is written into the page response ring buffer and then
update header's head index is updated. This path is not used
by this series. It is introduced here as a POC for vSVA/ARM
integration.

Signed-off-by: Eric Auger <eric.auger@redhat.com>

---

v11 -> v12:
- use VFIO_REGION_INFO_CAP_DMA_FAULT_RESPONSE [Shameer]
- fix hot del regression reported and fixed by Shameer
---
 hw/vfio/pci.h |   2 +
 hw/vfio/pci.c | 123 ++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 125 insertions(+)

diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
index 03ac8919ef..61b3bf1303 100644
--- a/hw/vfio/pci.h
+++ b/hw/vfio/pci.h
@@ -147,6 +147,8 @@ struct VFIOPCIDevice {
     VFIOPCIExtIRQ *ext_irqs;
     VFIORegion dma_fault_region;
     uint32_t fault_tail_index;
+    VFIORegion dma_fault_response_region;
+    uint32_t fault_response_head_index;
     int (*resetfn)(struct VFIOPCIDevice *);
     uint32_t vendor_id;
     uint32_t device_id;
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index d7e563859f..0f23c8f343 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2669,6 +2669,61 @@ out:
     g_free(fault_region_info);
 }
 
+static void vfio_init_fault_response_regions(VFIOPCIDevice *vdev, Error **errp)
+{
+    struct vfio_region_info *fault_region_info = NULL;
+    struct vfio_region_info_cap_fault *cap_fault;
+    VFIODevice *vbasedev = &vdev->vbasedev;
+    struct vfio_info_cap_header *hdr;
+    char *fault_region_name;
+    int ret;
+
+    ret = vfio_get_dev_region_info(&vdev->vbasedev,
+                                   VFIO_REGION_TYPE_NESTED,
+                                   VFIO_REGION_SUBTYPE_NESTED_DMA_FAULT_RESPONSE,
+                                   &fault_region_info);
+    if (ret) {
+        goto out;
+    }
+
+    hdr = vfio_get_region_info_cap(fault_region_info,
+                                   VFIO_REGION_INFO_CAP_DMA_FAULT_RESPONSE);
+    if (!hdr) {
+        error_setg(errp, "failed to retrieve DMA FAULT RESPONSE capability");
+        goto out;
+    }
+    cap_fault = container_of(hdr, struct vfio_region_info_cap_fault,
+                             header);
+    if (cap_fault->version != 1) {
+        error_setg(errp, "Unsupported DMA FAULT RESPONSE API version %d",
+                   cap_fault->version);
+        goto out;
+    }
+
+    fault_region_name = g_strdup_printf("%s DMA FAULT RESPONSE %d",
+                                        vbasedev->name,
+                                        fault_region_info->index);
+
+    ret = vfio_region_setup(OBJECT(vdev), vbasedev,
+                            &vdev->dma_fault_response_region,
+                            fault_region_info->index,
+                            fault_region_name);
+    g_free(fault_region_name);
+    if (ret) {
+        error_setg_errno(errp, -ret,
+                         "failed to set up the DMA FAULT RESPONSE region %d",
+                         fault_region_info->index);
+        goto out;
+    }
+
+    ret = vfio_region_mmap(&vdev->dma_fault_response_region);
+    if (ret) {
+        error_setg_errno(errp, -ret, "Failed to mmap the DMA FAULT RESPONSE queue");
+    }
+out:
+    g_free(fault_region_info);
+}
+
 static void vfio_populate_device(VFIOPCIDevice *vdev, Error **errp)
 {
     VFIODevice *vbasedev = &vdev->vbasedev;
@@ -2744,6 +2799,12 @@ static void vfio_populate_device(VFIOPCIDevice *vdev, Error **errp)
         return;
     }
 
+    vfio_init_fault_response_regions(vdev, &err);
+    if (err) {
+        error_propagate(errp, err);
+        return;
+    }
+
     irq_info.index = VFIO_PCI_ERR_IRQ_INDEX;
 
     ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_GET_IRQ_INFO, &irq_info);
@@ -2922,8 +2983,68 @@ static int vfio_iommu_set_pasid_table(PCIBus *bus, int32_t devfn,
     return ioctl(container->fd, VFIO_IOMMU_SET_PASID_TABLE, &info);
 }
 
+static int vfio_iommu_return_page_response(PCIBus *bus, int32_t devfn,
+                                           IOMMUPageResponse *resp)
+{
+    PCIDevice *pdev = bus->devices[devfn];
+    VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev);
+    struct iommu_page_response *response = &resp->resp;
+    struct vfio_region_dma_fault_response header;
+    struct iommu_page_response *queue;
+    char *queue_buffer = NULL;
+    ssize_t bytes;
+
+    if (!vdev->dma_fault_response_region.mem) {
+        return -EINVAL;
+    }
+
+    /* read the header */
+    bytes = pread(vdev->vbasedev.fd, &header, sizeof(header),
+                  vdev->dma_fault_response_region.fd_offset);
+    if (bytes != sizeof(header)) {
+        error_report("%s unable to read the fault region header (0x%lx)",
+                     __func__, bytes);
+        return -1;
+    }
+
+    /* Normally the fault queue is mmapped */
+    queue = (struct iommu_page_response *)vdev->dma_fault_response_region.mmaps[0].mmap;
+    if (!queue) {
+        size_t queue_size = header.nb_entries * header.entry_size;
+
+        error_report("%s: fault queue not mmapped: slower fault handling",
+                     vdev->vbasedev.name);
+
+        queue_buffer = g_malloc(queue_size);
+        bytes = pread(vdev->vbasedev.fd, queue_buffer, queue_size,
+                      vdev->dma_fault_response_region.fd_offset + header.offset);
+        if (bytes != queue_size) {
+            error_report("%s unable to read the fault queue (0x%lx)",
+                         __func__, bytes);
+            return -1;
+        }
+
+        queue = (struct iommu_page_response *)queue_buffer;
+    }
+    /* deposit the new response in the queue and increment the head */
+    memcpy(queue + header.head, response, header.entry_size);
+
+    vdev->fault_response_head_index =
+        (vdev->fault_response_head_index + 1) % header.nb_entries;
+    bytes = pwrite(vdev->vbasedev.fd, &vdev->fault_response_head_index, 4,
+                   vdev->dma_fault_response_region.fd_offset);
+    if (bytes != 4) {
+        error_report("%s unable to write the fault response region head index (0x%lx)",
+                     __func__, bytes);
+    }
+    g_free(queue_buffer);
+
+    return 0;
+}
+
 static PCIPASIDOps vfio_pci_pasid_ops = {
     .set_pasid_table = vfio_iommu_set_pasid_table,
+    .return_page_response = vfio_iommu_return_page_response,
 };
 
 static void vfio_dma_fault_notifier_handler(void *opaque)
@@ -3387,6 +3508,7 @@ static void vfio_instance_finalize(Object *obj)
     vfio_display_finalize(vdev);
     vfio_bars_finalize(vdev);
     vfio_region_finalize(&vdev->dma_fault_region);
+    vfio_region_finalize(&vdev->dma_fault_response_region);
     g_free(vdev->emulated_config_bits);
     g_free(vdev->rom);
     /*
@@ -3408,6 +3530,7 @@ static void vfio_exitfn(PCIDevice *pdev)
     vfio_unregister_err_notifier(vdev);
     vfio_unregister_ext_irq_notifiers(vdev);
     vfio_region_exit(&vdev->dma_fault_region);
+    vfio_region_exit(&vdev->dma_fault_response_region);
     pci_device_set_intx_routing_notifier(&vdev->pdev, NULL);
     if (vdev->irqchip_change_notifier.notify) {
         kvm_irqchip_remove_change_notifier(&vdev->irqchip_change_notifier);
-- 
2.26.3



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC v9 15/29] vfio: Set up nested stage mappings
  2021-04-11 12:08 ` [RFC v9 15/29] vfio: Set up nested stage mappings Eric Auger
@ 2021-04-13 12:10   ` Kunkun Jiang
  2021-04-13 12:57     ` Auger Eric
  0 siblings, 1 reply; 46+ messages in thread
From: Kunkun Jiang @ 2021-04-13 12:10 UTC (permalink / raw)
  To: Eric Auger, eric.auger.pro, qemu-devel, qemu-arm, alex.williamson
  Cc: peter.maydell, jacob.jun.pan, jean-philippe, tn, chenxiang66,
	shameerali.kolothum.thodi, nicoleotsuka, vivek.gautam, vdumpa,
	yi.l.liu, peterx, zhangfei.gao, yuzenghui, wanghaibin.wang,
	zhukeqian1

Hi Eric,

On 2021/4/11 20:08, Eric Auger wrote:
> In nested mode, legacy vfio_iommu_map_notify cannot be used as
> there is no "caching" mode and we do not trap on map.
>
> On Intel, vfio_iommu_map_notify was used to DMA map the RAM
> through the host single stage.
>
> With nested mode, we need to setup the stage 2 and the stage 1
> separately. This patch introduces a prereg_listener to setup
> the stage 2 mapping.
>
> The stage 1 mapping, owned by the guest, is passed to the host
> when the guest invalidates the stage 1 configuration, through
> a dedicated PCIPASIDOps callback. Guest IOTLB invalidations
> are cascaded downto the host through another IOMMU MR UNMAP
> notifier.
>
> Signed-off-by: Eric Auger <eric.auger@redhat.com>
>
> ---
>
> v7 -> v8:
> - properly handle new IOMMUTLBEntry fields and especially
>    propagate DOMAIN and PASID based invalidations
>
> v6 -> v7:
> - remove PASID based invalidation
>
> v5 -> v6:
> - add error_report_err()
> - remove the abort in case of nested stage case
>
> v4 -> v5:
> - use VFIO_IOMMU_SET_PASID_TABLE
> - use PCIPASIDOps for config notification
>
> v3 -> v4:
> - use iommu_inv_pasid_info for ASID invalidation
>
> v2 -> v3:
> - use VFIO_IOMMU_ATTACH_PASID_TABLE
> - new user API
> - handle leaf
>
> v1 -> v2:
> - adapt to uapi changes
> - pass the asid
> - pass IOMMU_NOTIFIER_S1_CFG when initializing the config notifier
> ---
>   hw/vfio/common.c     | 139 +++++++++++++++++++++++++++++++++++++++++--
>   hw/vfio/pci.c        |  21 +++++++
>   hw/vfio/trace-events |   2 +
>   3 files changed, 157 insertions(+), 5 deletions(-)
>
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 0cd7ef2139..e369d451e7 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -595,6 +595,73 @@ static bool vfio_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
>       return true;
>   }
>   
> +/* Propagate a guest IOTLB invalidation to the host (nested mode) */
> +static void vfio_iommu_unmap_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
> +{
> +    VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
> +    struct vfio_iommu_type1_cache_invalidate ustruct = {};
> +    VFIOContainer *container = giommu->container;
> +    int ret;
> +
> +    assert(iotlb->perm == IOMMU_NONE);
> +
> +    ustruct.argsz = sizeof(ustruct);
> +    ustruct.flags = 0;
> +    ustruct.info.argsz = sizeof(struct iommu_cache_invalidate_info);
> +    ustruct.info.version = IOMMU_CACHE_INVALIDATE_INFO_VERSION_1;
> +    ustruct.info.cache = IOMMU_CACHE_INV_TYPE_IOTLB;
> +
> +    switch (iotlb->granularity) {
> +    case IOMMU_INV_GRAN_DOMAIN:
> +        ustruct.info.granularity = IOMMU_INV_GRANU_DOMAIN;
> +        break;
> +    case IOMMU_INV_GRAN_PASID:
> +    {
> +        struct iommu_inv_pasid_info *pasid_info;
> +        int archid = -1;
> +
> +        pasid_info = &ustruct.info.granu.pasid_info;
> +        ustruct.info.granularity = IOMMU_INV_GRANU_PASID;
> +        if (iotlb->flags & IOMMU_INV_FLAGS_ARCHID) {
> +            pasid_info->flags |= IOMMU_INV_ADDR_FLAGS_ARCHID;
> +            archid = iotlb->arch_id;
> +        }
> +        pasid_info->archid = archid;
> +        trace_vfio_iommu_asid_inv_iotlb(archid);
> +        break;
> +    }
> +    case IOMMU_INV_GRAN_ADDR:
> +    {
> +        hwaddr start = iotlb->iova + giommu->iommu_offset;
> +        struct iommu_inv_addr_info *addr_info;
> +        size_t size = iotlb->addr_mask + 1;
> +        int archid = -1;
> +
> +        addr_info = &ustruct.info.granu.addr_info;
> +        ustruct.info.granularity = IOMMU_INV_GRANU_ADDR;
> +        if (iotlb->leaf) {
> +            addr_info->flags |= IOMMU_INV_ADDR_FLAGS_LEAF;
> +        }
> +        if (iotlb->flags & IOMMU_INV_FLAGS_ARCHID) {
> +            addr_info->flags |= IOMMU_INV_ADDR_FLAGS_ARCHID;
> +            archid = iotlb->arch_id;
> +        }
> +        addr_info->archid = archid;
> +        addr_info->addr = start;
> +        addr_info->granule_size = size;
> +        addr_info->nb_granules = 1;
> +        trace_vfio_iommu_addr_inv_iotlb(archid, start, size,
> +                                        1, iotlb->leaf);
> +        break;
> +    }
Should we pass a size to  host kernel here, even if vSMMU doesn't support
RIL or guest kernel doesn't use RIL?

It will cause TLBI issue in  this scenario: Guest kernel issues a TLBI cmd
without "range" (tg = 0) to invalidate a 2M huge page. Then qemu passed
the iova and size (4K) to host kernel. Finally, host kernel issues a 
TLBI cmd
with "range" (4K) which can not invalidate the TLB entry of 2M huge page.
(pSMMU supports RIL)

Thanks,
Kunkun Jiang
> +    }
> +
> +    ret = ioctl(container->fd, VFIO_IOMMU_CACHE_INVALIDATE, &ustruct);
> +    if (ret) {
> +        error_report("%p: failed to invalidate CACHE (%d)", container, ret);
> +    }
> +}
> +
>   static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
>   {
>       VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
> @@ -776,6 +843,35 @@ static void vfio_dma_unmap_ram_section(VFIOContainer *container,
>       }
>   }
>   
> +static void vfio_prereg_listener_region_add(MemoryListener *listener,
> +                                            MemoryRegionSection *section)
> +{
> +    VFIOContainer *container =
> +        container_of(listener, VFIOContainer, prereg_listener);
> +    Error *err = NULL;
> +
> +    if (!memory_region_is_ram(section->mr)) {
> +        return;
> +    }
> +
> +    vfio_dma_map_ram_section(container, section, &err);
> +    if (err) {
> +        error_report_err(err);
> +    }
> +}
> +static void vfio_prereg_listener_region_del(MemoryListener *listener,
> +                                     MemoryRegionSection *section)
> +{
> +    VFIOContainer *container =
> +        container_of(listener, VFIOContainer, prereg_listener);
> +
> +    if (!memory_region_is_ram(section->mr)) {
> +        return;
> +    }
> +
> +    vfio_dma_unmap_ram_section(container, section);
> +}
> +
>   static void vfio_listener_region_add(MemoryListener *listener,
>                                        MemoryRegionSection *section)
>   {
> @@ -879,9 +975,10 @@ static void vfio_listener_region_add(MemoryListener *listener,
>       memory_region_ref(section->mr);
>   
>       if (memory_region_is_iommu(section->mr)) {
> +        IOMMUNotify notify;
>           VFIOGuestIOMMU *giommu;
>           IOMMUMemoryRegion *iommu_mr = IOMMU_MEMORY_REGION(section->mr);
> -        int iommu_idx;
> +        int iommu_idx, flags;
>   
>           trace_vfio_listener_region_add_iommu(iova, end);
>           /*
> @@ -900,8 +997,18 @@ static void vfio_listener_region_add(MemoryListener *listener,
>           llend = int128_sub(llend, int128_one());
>           iommu_idx = memory_region_iommu_attrs_to_index(iommu_mr,
>                                                          MEMTXATTRS_UNSPECIFIED);
> -        iommu_notifier_init(&giommu->n, vfio_iommu_map_notify,
> -                            IOMMU_NOTIFIER_IOTLB_EVENTS,
> +
> +        if (container->iommu_type == VFIO_TYPE1_NESTING_IOMMU) {
> +            /* IOTLB unmap notifier to propagate guest IOTLB invalidations */
> +            flags = IOMMU_NOTIFIER_UNMAP;
> +            notify = vfio_iommu_unmap_notify;
> +        } else {
> +            /* MAP/UNMAP IOTLB notifier */
> +            flags = IOMMU_NOTIFIER_IOTLB_EVENTS;
> +            notify = vfio_iommu_map_notify;
> +        }
> +
> +        iommu_notifier_init(&giommu->n, notify, flags,
>                               section->offset_within_region,
>                               int128_get64(llend),
>                               iommu_idx);
> @@ -921,7 +1028,9 @@ static void vfio_listener_region_add(MemoryListener *listener,
>               goto fail;
>           }
>           QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next);
> -        memory_region_iommu_replay(giommu->iommu, &giommu->n);
> +        if (flags & IOMMU_NOTIFIER_MAP) {
> +            memory_region_iommu_replay(giommu->iommu, &giommu->n);
> +        }
>   
>           return;
>       }
> @@ -1205,10 +1314,16 @@ static const MemoryListener vfio_memory_listener = {
>       .log_sync = vfio_listener_log_sync,
>   };
>   
> +static MemoryListener vfio_memory_prereg_listener = {
> +    .region_add = vfio_prereg_listener_region_add,
> +    .region_del = vfio_prereg_listener_region_del,
> +};
> +
>   static void vfio_listener_release(VFIOContainer *container)
>   {
>       memory_listener_unregister(&container->listener);
> -    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
> +    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU ||
> +        container->iommu_type == VFIO_TYPE1_NESTING_IOMMU) {
>           memory_listener_unregister(&container->prereg_listener);
>       }
>   }
> @@ -1858,6 +1973,20 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>               vfio_get_iommu_info_migration(container, info);
>           }
>           g_free(info);
> +
> +        if (container->iommu_type == VFIO_TYPE1_NESTING_IOMMU) {
> +            container->prereg_listener = vfio_memory_prereg_listener;
> +            memory_listener_register(&container->prereg_listener,
> +                                     &address_space_memory);
> +            if (container->error) {
> +                memory_listener_unregister(&container->prereg_listener);
> +                ret = -1;
> +                error_propagate_prepend(errp, container->error,
> +                                    "RAM memory listener initialization failed "
> +                                    "for container");
> +                goto free_container_exit;
> +            }
> +        }
>           break;
>       }
>       case VFIO_SPAPR_TCE_v2_IOMMU:
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 5c65aa0a98..cad7deec71 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -2773,6 +2773,25 @@ static void vfio_unregister_req_notifier(VFIOPCIDevice *vdev)
>       vdev->req_enabled = false;
>   }
>   
> +static int vfio_iommu_set_pasid_table(PCIBus *bus, int32_t devfn,
> +                                      IOMMUConfig *config)
> +{
> +    PCIDevice *pdev = bus->devices[devfn];
> +    VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev);
> +    VFIOContainer *container = vdev->vbasedev.group->container;
> +    struct vfio_iommu_type1_set_pasid_table info;
> +
> +    info.argsz = sizeof(info);
> +    info.flags = VFIO_PASID_TABLE_FLAG_SET;
> +    memcpy(&info.config, &config->pasid_cfg, sizeof(config->pasid_cfg));
> +
> +    return ioctl(container->fd, VFIO_IOMMU_SET_PASID_TABLE, &info);
> +}
> +
> +static PCIPASIDOps vfio_pci_pasid_ops = {
> +    .set_pasid_table = vfio_iommu_set_pasid_table,
> +};
> +
>   static void vfio_realize(PCIDevice *pdev, Error **errp)
>   {
>       VFIOPCIDevice *vdev = VFIO_PCI(pdev);
> @@ -3084,6 +3103,8 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
>       vfio_register_req_notifier(vdev);
>       vfio_setup_resetfn_quirk(vdev);
>   
> +    pci_setup_pasid_ops(pdev, &vfio_pci_pasid_ops);
> +
>       return;
>   
>   out_deregister:
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index 936d29d150..43696afc15 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -120,6 +120,8 @@ vfio_region_sparse_mmap_header(const char *name, int index, int nr_areas) "Devic
>   vfio_region_sparse_mmap_entry(int i, unsigned long start, unsigned long end) "sparse entry %d [0x%lx - 0x%lx]"
>   vfio_get_dev_region(const char *name, int index, uint32_t type, uint32_t subtype) "%s index %d, %08x/%0x8"
>   vfio_dma_unmap_overflow_workaround(void) ""
> +vfio_iommu_addr_inv_iotlb(int asid, uint64_t addr, uint64_t size, uint64_t nb_granules, bool leaf) "nested IOTLB invalidate asid=%d, addr=0x%"PRIx64" granule_size=0x%"PRIx64" nb_granules=0x%"PRIx64" leaf=%d"
> +vfio_iommu_asid_inv_iotlb(int asid) "nested IOTLB invalidate asid=%d"
>   
>   # platform.c
>   vfio_platform_base_device_init(char *name, int groupid) "%s belongs to group #%d"




^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC v9 15/29] vfio: Set up nested stage mappings
  2021-04-13 12:10   ` Kunkun Jiang
@ 2021-04-13 12:57     ` Auger Eric
  2021-04-14  1:45       ` Kunkun Jiang
  0 siblings, 1 reply; 46+ messages in thread
From: Auger Eric @ 2021-04-13 12:57 UTC (permalink / raw)
  To: Kunkun Jiang, eric.auger.pro, qemu-devel, qemu-arm, alex.williamson
  Cc: peter.maydell, jacob.jun.pan, chenxiang66, tn,
	shameerali.kolothum.thodi, nicoleotsuka, vivek.gautam, vdumpa,
	yi.l.liu, peterx, zhangfei.gao, wanghaibin.wang, yuzenghui,
	jean-philippe, zhukeqian1

Hi Kunkun,

On 4/13/21 2:10 PM, Kunkun Jiang wrote:
> Hi Eric,
> 
> On 2021/4/11 20:08, Eric Auger wrote:
>> In nested mode, legacy vfio_iommu_map_notify cannot be used as
>> there is no "caching" mode and we do not trap on map.
>>
>> On Intel, vfio_iommu_map_notify was used to DMA map the RAM
>> through the host single stage.
>>
>> With nested mode, we need to setup the stage 2 and the stage 1
>> separately. This patch introduces a prereg_listener to setup
>> the stage 2 mapping.
>>
>> The stage 1 mapping, owned by the guest, is passed to the host
>> when the guest invalidates the stage 1 configuration, through
>> a dedicated PCIPASIDOps callback. Guest IOTLB invalidations
>> are cascaded downto the host through another IOMMU MR UNMAP
>> notifier.
>>
>> Signed-off-by: Eric Auger <eric.auger@redhat.com>
>>
>> ---
>>
>> v7 -> v8:
>> - properly handle new IOMMUTLBEntry fields and especially
>>    propagate DOMAIN and PASID based invalidations
>>
>> v6 -> v7:
>> - remove PASID based invalidation
>>
>> v5 -> v6:
>> - add error_report_err()
>> - remove the abort in case of nested stage case
>>
>> v4 -> v5:
>> - use VFIO_IOMMU_SET_PASID_TABLE
>> - use PCIPASIDOps for config notification
>>
>> v3 -> v4:
>> - use iommu_inv_pasid_info for ASID invalidation
>>
>> v2 -> v3:
>> - use VFIO_IOMMU_ATTACH_PASID_TABLE
>> - new user API
>> - handle leaf
>>
>> v1 -> v2:
>> - adapt to uapi changes
>> - pass the asid
>> - pass IOMMU_NOTIFIER_S1_CFG when initializing the config notifier
>> ---
>>   hw/vfio/common.c     | 139 +++++++++++++++++++++++++++++++++++++++++--
>>   hw/vfio/pci.c        |  21 +++++++
>>   hw/vfio/trace-events |   2 +
>>   3 files changed, 157 insertions(+), 5 deletions(-)
>>
>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>> index 0cd7ef2139..e369d451e7 100644
>> --- a/hw/vfio/common.c
>> +++ b/hw/vfio/common.c
>> @@ -595,6 +595,73 @@ static bool vfio_get_xlat_addr(IOMMUTLBEntry
>> *iotlb, void **vaddr,
>>       return true;
>>   }
>>   +/* Propagate a guest IOTLB invalidation to the host (nested mode) */
>> +static void vfio_iommu_unmap_notify(IOMMUNotifier *n, IOMMUTLBEntry
>> *iotlb)
>> +{
>> +    VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
>> +    struct vfio_iommu_type1_cache_invalidate ustruct = {};
>> +    VFIOContainer *container = giommu->container;
>> +    int ret;
>> +
>> +    assert(iotlb->perm == IOMMU_NONE);
>> +
>> +    ustruct.argsz = sizeof(ustruct);
>> +    ustruct.flags = 0;
>> +    ustruct.info.argsz = sizeof(struct iommu_cache_invalidate_info);
>> +    ustruct.info.version = IOMMU_CACHE_INVALIDATE_INFO_VERSION_1;
>> +    ustruct.info.cache = IOMMU_CACHE_INV_TYPE_IOTLB;
>> +
>> +    switch (iotlb->granularity) {
>> +    case IOMMU_INV_GRAN_DOMAIN:
>> +        ustruct.info.granularity = IOMMU_INV_GRANU_DOMAIN;
>> +        break;
>> +    case IOMMU_INV_GRAN_PASID:
>> +    {
>> +        struct iommu_inv_pasid_info *pasid_info;
>> +        int archid = -1;
>> +
>> +        pasid_info = &ustruct.info.granu.pasid_info;
>> +        ustruct.info.granularity = IOMMU_INV_GRANU_PASID;
>> +        if (iotlb->flags & IOMMU_INV_FLAGS_ARCHID) {
>> +            pasid_info->flags |= IOMMU_INV_ADDR_FLAGS_ARCHID;
>> +            archid = iotlb->arch_id;
>> +        }
>> +        pasid_info->archid = archid;
>> +        trace_vfio_iommu_asid_inv_iotlb(archid);
>> +        break;
>> +    }
>> +    case IOMMU_INV_GRAN_ADDR:
>> +    {
>> +        hwaddr start = iotlb->iova + giommu->iommu_offset;
>> +        struct iommu_inv_addr_info *addr_info;
>> +        size_t size = iotlb->addr_mask + 1;
>> +        int archid = -1;
>> +
>> +        addr_info = &ustruct.info.granu.addr_info;
>> +        ustruct.info.granularity = IOMMU_INV_GRANU_ADDR;
>> +        if (iotlb->leaf) {
>> +            addr_info->flags |= IOMMU_INV_ADDR_FLAGS_LEAF;
>> +        }
>> +        if (iotlb->flags & IOMMU_INV_FLAGS_ARCHID) {
>> +            addr_info->flags |= IOMMU_INV_ADDR_FLAGS_ARCHID;
>> +            archid = iotlb->arch_id;
>> +        }
>> +        addr_info->archid = archid;
>> +        addr_info->addr = start;
>> +        addr_info->granule_size = size;
>> +        addr_info->nb_granules = 1;
>> +        trace_vfio_iommu_addr_inv_iotlb(archid, start, size,
>> +                                        1, iotlb->leaf);
>> +        break;
>> +    }
> Should we pass a size to  host kernel here, even if vSMMU doesn't support
> RIL or guest kernel doesn't use RIL?
> 
> It will cause TLBI issue in  this scenario: Guest kernel issues a TLBI cmd
> without "range" (tg = 0) to invalidate a 2M huge page. Then qemu passed
> the iova and size (4K) to host kernel. Finally, host kernel issues a
> TLBI cmd
> with "range" (4K) which can not invalidate the TLB entry of 2M huge page.
> (pSMMU supports RIL)

In that case the guest will loop over all 4K images belonging to the 2M
huge page and invalidate each of them. This should turn into qemu
notifications for each 4kB page, no? This is totally inefficient, hence
the support of RIL on guest side and QEMU device.

What do I miss?

Thanks

Eric
> 
> Thanks,
> Kunkun Jiang
>> +    }
>> +
>> +    ret = ioctl(container->fd, VFIO_IOMMU_CACHE_INVALIDATE, &ustruct);
>> +    if (ret) {
>> +        error_report("%p: failed to invalidate CACHE (%d)",
>> container, ret);
>> +    }
>> +}
>> +
>>   static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry
>> *iotlb)
>>   {
>>       VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
>> @@ -776,6 +843,35 @@ static void
>> vfio_dma_unmap_ram_section(VFIOContainer *container,
>>       }
>>   }
>>   +static void vfio_prereg_listener_region_add(MemoryListener *listener,
>> +                                            MemoryRegionSection
>> *section)
>> +{
>> +    VFIOContainer *container =
>> +        container_of(listener, VFIOContainer, prereg_listener);
>> +    Error *err = NULL;
>> +
>> +    if (!memory_region_is_ram(section->mr)) {
>> +        return;
>> +    }
>> +
>> +    vfio_dma_map_ram_section(container, section, &err);
>> +    if (err) {
>> +        error_report_err(err);
>> +    }
>> +}
>> +static void vfio_prereg_listener_region_del(MemoryListener *listener,
>> +                                     MemoryRegionSection *section)
>> +{
>> +    VFIOContainer *container =
>> +        container_of(listener, VFIOContainer, prereg_listener);
>> +
>> +    if (!memory_region_is_ram(section->mr)) {
>> +        return;
>> +    }
>> +
>> +    vfio_dma_unmap_ram_section(container, section);
>> +}
>> +
>>   static void vfio_listener_region_add(MemoryListener *listener,
>>                                        MemoryRegionSection *section)
>>   {
>> @@ -879,9 +975,10 @@ static void
>> vfio_listener_region_add(MemoryListener *listener,
>>       memory_region_ref(section->mr);
>>         if (memory_region_is_iommu(section->mr)) {
>> +        IOMMUNotify notify;
>>           VFIOGuestIOMMU *giommu;
>>           IOMMUMemoryRegion *iommu_mr = IOMMU_MEMORY_REGION(section->mr);
>> -        int iommu_idx;
>> +        int iommu_idx, flags;
>>             trace_vfio_listener_region_add_iommu(iova, end);
>>           /*
>> @@ -900,8 +997,18 @@ static void
>> vfio_listener_region_add(MemoryListener *listener,
>>           llend = int128_sub(llend, int128_one());
>>           iommu_idx = memory_region_iommu_attrs_to_index(iommu_mr,
>>                                                         
>> MEMTXATTRS_UNSPECIFIED);
>> -        iommu_notifier_init(&giommu->n, vfio_iommu_map_notify,
>> -                            IOMMU_NOTIFIER_IOTLB_EVENTS,
>> +
>> +        if (container->iommu_type == VFIO_TYPE1_NESTING_IOMMU) {
>> +            /* IOTLB unmap notifier to propagate guest IOTLB
>> invalidations */
>> +            flags = IOMMU_NOTIFIER_UNMAP;
>> +            notify = vfio_iommu_unmap_notify;
>> +        } else {
>> +            /* MAP/UNMAP IOTLB notifier */
>> +            flags = IOMMU_NOTIFIER_IOTLB_EVENTS;
>> +            notify = vfio_iommu_map_notify;
>> +        }
>> +
>> +        iommu_notifier_init(&giommu->n, notify, flags,
>>                               section->offset_within_region,
>>                               int128_get64(llend),
>>                               iommu_idx);
>> @@ -921,7 +1028,9 @@ static void
>> vfio_listener_region_add(MemoryListener *listener,
>>               goto fail;
>>           }
>>           QLIST_INSERT_HEAD(&container->giommu_list, giommu,
>> giommu_next);
>> -        memory_region_iommu_replay(giommu->iommu, &giommu->n);
>> +        if (flags & IOMMU_NOTIFIER_MAP) {
>> +            memory_region_iommu_replay(giommu->iommu, &giommu->n);
>> +        }
>>             return;
>>       }
>> @@ -1205,10 +1314,16 @@ static const MemoryListener
>> vfio_memory_listener = {
>>       .log_sync = vfio_listener_log_sync,
>>   };
>>   +static MemoryListener vfio_memory_prereg_listener = {
>> +    .region_add = vfio_prereg_listener_region_add,
>> +    .region_del = vfio_prereg_listener_region_del,
>> +};
>> +
>>   static void vfio_listener_release(VFIOContainer *container)
>>   {
>>       memory_listener_unregister(&container->listener);
>> -    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
>> +    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU ||
>> +        container->iommu_type == VFIO_TYPE1_NESTING_IOMMU) {
>>           memory_listener_unregister(&container->prereg_listener);
>>       }
>>   }
>> @@ -1858,6 +1973,20 @@ static int vfio_connect_container(VFIOGroup
>> *group, AddressSpace *as,
>>               vfio_get_iommu_info_migration(container, info);
>>           }
>>           g_free(info);
>> +
>> +        if (container->iommu_type == VFIO_TYPE1_NESTING_IOMMU) {
>> +            container->prereg_listener = vfio_memory_prereg_listener;
>> +            memory_listener_register(&container->prereg_listener,
>> +                                     &address_space_memory);
>> +            if (container->error) {
>> +                memory_listener_unregister(&container->prereg_listener);
>> +                ret = -1;
>> +                error_propagate_prepend(errp, container->error,
>> +                                    "RAM memory listener
>> initialization failed "
>> +                                    "for container");
>> +                goto free_container_exit;
>> +            }
>> +        }
>>           break;
>>       }
>>       case VFIO_SPAPR_TCE_v2_IOMMU:
>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>> index 5c65aa0a98..cad7deec71 100644
>> --- a/hw/vfio/pci.c
>> +++ b/hw/vfio/pci.c
>> @@ -2773,6 +2773,25 @@ static void
>> vfio_unregister_req_notifier(VFIOPCIDevice *vdev)
>>       vdev->req_enabled = false;
>>   }
>>   +static int vfio_iommu_set_pasid_table(PCIBus *bus, int32_t devfn,
>> +                                      IOMMUConfig *config)
>> +{
>> +    PCIDevice *pdev = bus->devices[devfn];
>> +    VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev);
>> +    VFIOContainer *container = vdev->vbasedev.group->container;
>> +    struct vfio_iommu_type1_set_pasid_table info;
>> +
>> +    info.argsz = sizeof(info);
>> +    info.flags = VFIO_PASID_TABLE_FLAG_SET;
>> +    memcpy(&info.config, &config->pasid_cfg, sizeof(config->pasid_cfg));
>> +
>> +    return ioctl(container->fd, VFIO_IOMMU_SET_PASID_TABLE, &info);
>> +}
>> +
>> +static PCIPASIDOps vfio_pci_pasid_ops = {
>> +    .set_pasid_table = vfio_iommu_set_pasid_table,
>> +};
>> +
>>   static void vfio_realize(PCIDevice *pdev, Error **errp)
>>   {
>>       VFIOPCIDevice *vdev = VFIO_PCI(pdev);
>> @@ -3084,6 +3103,8 @@ static void vfio_realize(PCIDevice *pdev, Error
>> **errp)
>>       vfio_register_req_notifier(vdev);
>>       vfio_setup_resetfn_quirk(vdev);
>>   +    pci_setup_pasid_ops(pdev, &vfio_pci_pasid_ops);
>> +
>>       return;
>>     out_deregister:
>> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
>> index 936d29d150..43696afc15 100644
>> --- a/hw/vfio/trace-events
>> +++ b/hw/vfio/trace-events
>> @@ -120,6 +120,8 @@ vfio_region_sparse_mmap_header(const char *name,
>> int index, int nr_areas) "Devic
>>   vfio_region_sparse_mmap_entry(int i, unsigned long start, unsigned
>> long end) "sparse entry %d [0x%lx - 0x%lx]"
>>   vfio_get_dev_region(const char *name, int index, uint32_t type,
>> uint32_t subtype) "%s index %d, %08x/%0x8"
>>   vfio_dma_unmap_overflow_workaround(void) ""
>> +vfio_iommu_addr_inv_iotlb(int asid, uint64_t addr, uint64_t size,
>> uint64_t nb_granules, bool leaf) "nested IOTLB invalidate asid=%d,
>> addr=0x%"PRIx64" granule_size=0x%"PRIx64" nb_granules=0x%"PRIx64"
>> leaf=%d"
>> +vfio_iommu_asid_inv_iotlb(int asid) "nested IOTLB invalidate asid=%d"
>>     # platform.c
>>   vfio_platform_base_device_init(char *name, int groupid) "%s belongs
>> to group #%d"
> 
> 
> 



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC v9 15/29] vfio: Set up nested stage mappings
  2021-04-13 12:57     ` Auger Eric
@ 2021-04-14  1:45       ` Kunkun Jiang
  2021-04-14  8:05         ` Auger Eric
                           ` (2 more replies)
  0 siblings, 3 replies; 46+ messages in thread
From: Kunkun Jiang @ 2021-04-14  1:45 UTC (permalink / raw)
  To: Auger Eric, eric.auger.pro, qemu-devel, qemu-arm, alex.williamson
  Cc: peter.maydell, jacob.jun.pan, chenxiang66, tn,
	shameerali.kolothum.thodi, nicoleotsuka, vivek.gautam, vdumpa,
	yi.l.liu, peterx, zhangfei.gao, wanghaibin.wang, yuzenghui,
	jean-philippe, zhukeqian1

On 2021/4/13 20:57, Auger Eric wrote:
> Hi Kunkun,
>
> On 4/13/21 2:10 PM, Kunkun Jiang wrote:
>> Hi Eric,
>>
>> On 2021/4/11 20:08, Eric Auger wrote:
>>> In nested mode, legacy vfio_iommu_map_notify cannot be used as
>>> there is no "caching" mode and we do not trap on map.
>>>
>>> On Intel, vfio_iommu_map_notify was used to DMA map the RAM
>>> through the host single stage.
>>>
>>> With nested mode, we need to setup the stage 2 and the stage 1
>>> separately. This patch introduces a prereg_listener to setup
>>> the stage 2 mapping.
>>>
>>> The stage 1 mapping, owned by the guest, is passed to the host
>>> when the guest invalidates the stage 1 configuration, through
>>> a dedicated PCIPASIDOps callback. Guest IOTLB invalidations
>>> are cascaded downto the host through another IOMMU MR UNMAP
>>> notifier.
>>>
>>> Signed-off-by: Eric Auger <eric.auger@redhat.com>
>>>
>>> ---
>>>
>>> v7 -> v8:
>>> - properly handle new IOMMUTLBEntry fields and especially
>>>     propagate DOMAIN and PASID based invalidations
>>>
>>> v6 -> v7:
>>> - remove PASID based invalidation
>>>
>>> v5 -> v6:
>>> - add error_report_err()
>>> - remove the abort in case of nested stage case
>>>
>>> v4 -> v5:
>>> - use VFIO_IOMMU_SET_PASID_TABLE
>>> - use PCIPASIDOps for config notification
>>>
>>> v3 -> v4:
>>> - use iommu_inv_pasid_info for ASID invalidation
>>>
>>> v2 -> v3:
>>> - use VFIO_IOMMU_ATTACH_PASID_TABLE
>>> - new user API
>>> - handle leaf
>>>
>>> v1 -> v2:
>>> - adapt to uapi changes
>>> - pass the asid
>>> - pass IOMMU_NOTIFIER_S1_CFG when initializing the config notifier
>>> ---
>>>    hw/vfio/common.c     | 139 +++++++++++++++++++++++++++++++++++++++++--
>>>    hw/vfio/pci.c        |  21 +++++++
>>>    hw/vfio/trace-events |   2 +
>>>    3 files changed, 157 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>>> index 0cd7ef2139..e369d451e7 100644
>>> --- a/hw/vfio/common.c
>>> +++ b/hw/vfio/common.c
>>> @@ -595,6 +595,73 @@ static bool vfio_get_xlat_addr(IOMMUTLBEntry
>>> *iotlb, void **vaddr,
>>>        return true;
>>>    }
>>>    +/* Propagate a guest IOTLB invalidation to the host (nested mode) */
>>> +static void vfio_iommu_unmap_notify(IOMMUNotifier *n, IOMMUTLBEntry
>>> *iotlb)
>>> +{
>>> +    VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
>>> +    struct vfio_iommu_type1_cache_invalidate ustruct = {};
>>> +    VFIOContainer *container = giommu->container;
>>> +    int ret;
>>> +
>>> +    assert(iotlb->perm == IOMMU_NONE);
>>> +
>>> +    ustruct.argsz = sizeof(ustruct);
>>> +    ustruct.flags = 0;
>>> +    ustruct.info.argsz = sizeof(struct iommu_cache_invalidate_info);
>>> +    ustruct.info.version = IOMMU_CACHE_INVALIDATE_INFO_VERSION_1;
>>> +    ustruct.info.cache = IOMMU_CACHE_INV_TYPE_IOTLB;
>>> +
>>> +    switch (iotlb->granularity) {
>>> +    case IOMMU_INV_GRAN_DOMAIN:
>>> +        ustruct.info.granularity = IOMMU_INV_GRANU_DOMAIN;
>>> +        break;
>>> +    case IOMMU_INV_GRAN_PASID:
>>> +    {
>>> +        struct iommu_inv_pasid_info *pasid_info;
>>> +        int archid = -1;
>>> +
>>> +        pasid_info = &ustruct.info.granu.pasid_info;
>>> +        ustruct.info.granularity = IOMMU_INV_GRANU_PASID;
>>> +        if (iotlb->flags & IOMMU_INV_FLAGS_ARCHID) {
>>> +            pasid_info->flags |= IOMMU_INV_ADDR_FLAGS_ARCHID;
>>> +            archid = iotlb->arch_id;
>>> +        }
>>> +        pasid_info->archid = archid;
>>> +        trace_vfio_iommu_asid_inv_iotlb(archid);
>>> +        break;
>>> +    }
>>> +    case IOMMU_INV_GRAN_ADDR:
>>> +    {
>>> +        hwaddr start = iotlb->iova + giommu->iommu_offset;
>>> +        struct iommu_inv_addr_info *addr_info;
>>> +        size_t size = iotlb->addr_mask + 1;
>>> +        int archid = -1;
>>> +
>>> +        addr_info = &ustruct.info.granu.addr_info;
>>> +        ustruct.info.granularity = IOMMU_INV_GRANU_ADDR;
>>> +        if (iotlb->leaf) {
>>> +            addr_info->flags |= IOMMU_INV_ADDR_FLAGS_LEAF;
>>> +        }
>>> +        if (iotlb->flags & IOMMU_INV_FLAGS_ARCHID) {
>>> +            addr_info->flags |= IOMMU_INV_ADDR_FLAGS_ARCHID;
>>> +            archid = iotlb->arch_id;
>>> +        }
>>> +        addr_info->archid = archid;
>>> +        addr_info->addr = start;
>>> +        addr_info->granule_size = size;
>>> +        addr_info->nb_granules = 1;
>>> +        trace_vfio_iommu_addr_inv_iotlb(archid, start, size,
>>> +                                        1, iotlb->leaf);
>>> +        break;
>>> +    }
>> Should we pass a size to  host kernel here, even if vSMMU doesn't support
>> RIL or guest kernel doesn't use RIL?
>>
>> It will cause TLBI issue in  this scenario: Guest kernel issues a TLBI cmd
>> without "range" (tg = 0) to invalidate a 2M huge page. Then qemu passed
>> the iova and size (4K) to host kernel. Finally, host kernel issues a
>> TLBI cmd
>> with "range" (4K) which can not invalidate the TLB entry of 2M huge page.
>> (pSMMU supports RIL)
> In that case the guest will loop over all 4K images belonging to the 2M
> huge page and invalidate each of them. This should turn into qemu
> notifications for each 4kB page, no? This is totally inefficient, hence
The guest will not loop over all 4K images belonging to the 2M huge page.
The iommu_iotlb_gather->pgsize will be 2M, if a page is 2M huge page. The
gather->pgsize will be passed to __arm_smmu_tlb_inv_range as "granule":

iommu_iotlb_gather_add_page
     iommu_iotlb_sync
         domain->ops->iotlb_sync
             arm_smmu_iotlb_sync
                 arm_smmu_tlb_inv_range_domain
                     __arm_smmu_tlb_inv_range

In the above mentioned scenario, guest kernel will issue a TLBI cmd only 
with
"iova" (tg = 0).

Thanks,
Kunkun Jiang
> the support of RIL on guest side and QEMU device.
>
> What do I miss?
>
> Thanks
>
> Eric
>> Thanks,
>> Kunkun Jiang
>>> +    }
>>> +
>>> +    ret = ioctl(container->fd, VFIO_IOMMU_CACHE_INVALIDATE, &ustruct);
>>> +    if (ret) {
>>> +        error_report("%p: failed to invalidate CACHE (%d)",
>>> container, ret);
>>> +    }
>>> +}
>>> +
>>>    static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry
>>> *iotlb)
>>>    {
>>>        VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
>>> @@ -776,6 +843,35 @@ static void
>>> vfio_dma_unmap_ram_section(VFIOContainer *container,
>>>        }
>>>    }
>>>    +static void vfio_prereg_listener_region_add(MemoryListener *listener,
>>> +                                            MemoryRegionSection
>>> *section)
>>> +{
>>> +    VFIOContainer *container =
>>> +        container_of(listener, VFIOContainer, prereg_listener);
>>> +    Error *err = NULL;
>>> +
>>> +    if (!memory_region_is_ram(section->mr)) {
>>> +        return;
>>> +    }
>>> +
>>> +    vfio_dma_map_ram_section(container, section, &err);
>>> +    if (err) {
>>> +        error_report_err(err);
>>> +    }
>>> +}
>>> +static void vfio_prereg_listener_region_del(MemoryListener *listener,
>>> +                                     MemoryRegionSection *section)
>>> +{
>>> +    VFIOContainer *container =
>>> +        container_of(listener, VFIOContainer, prereg_listener);
>>> +
>>> +    if (!memory_region_is_ram(section->mr)) {
>>> +        return;
>>> +    }
>>> +
>>> +    vfio_dma_unmap_ram_section(container, section);
>>> +}
>>> +
>>>    static void vfio_listener_region_add(MemoryListener *listener,
>>>                                         MemoryRegionSection *section)
>>>    {
>>> @@ -879,9 +975,10 @@ static void
>>> vfio_listener_region_add(MemoryListener *listener,
>>>        memory_region_ref(section->mr);
>>>          if (memory_region_is_iommu(section->mr)) {
>>> +        IOMMUNotify notify;
>>>            VFIOGuestIOMMU *giommu;
>>>            IOMMUMemoryRegion *iommu_mr = IOMMU_MEMORY_REGION(section->mr);
>>> -        int iommu_idx;
>>> +        int iommu_idx, flags;
>>>              trace_vfio_listener_region_add_iommu(iova, end);
>>>            /*
>>> @@ -900,8 +997,18 @@ static void
>>> vfio_listener_region_add(MemoryListener *listener,
>>>            llend = int128_sub(llend, int128_one());
>>>            iommu_idx = memory_region_iommu_attrs_to_index(iommu_mr,
>>>                                                          
>>> MEMTXATTRS_UNSPECIFIED);
>>> -        iommu_notifier_init(&giommu->n, vfio_iommu_map_notify,
>>> -                            IOMMU_NOTIFIER_IOTLB_EVENTS,
>>> +
>>> +        if (container->iommu_type == VFIO_TYPE1_NESTING_IOMMU) {
>>> +            /* IOTLB unmap notifier to propagate guest IOTLB
>>> invalidations */
>>> +            flags = IOMMU_NOTIFIER_UNMAP;
>>> +            notify = vfio_iommu_unmap_notify;
>>> +        } else {
>>> +            /* MAP/UNMAP IOTLB notifier */
>>> +            flags = IOMMU_NOTIFIER_IOTLB_EVENTS;
>>> +            notify = vfio_iommu_map_notify;
>>> +        }
>>> +
>>> +        iommu_notifier_init(&giommu->n, notify, flags,
>>>                                section->offset_within_region,
>>>                                int128_get64(llend),
>>>                                iommu_idx);
>>> @@ -921,7 +1028,9 @@ static void
>>> vfio_listener_region_add(MemoryListener *listener,
>>>                goto fail;
>>>            }
>>>            QLIST_INSERT_HEAD(&container->giommu_list, giommu,
>>> giommu_next);
>>> -        memory_region_iommu_replay(giommu->iommu, &giommu->n);
>>> +        if (flags & IOMMU_NOTIFIER_MAP) {
>>> +            memory_region_iommu_replay(giommu->iommu, &giommu->n);
>>> +        }
>>>              return;
>>>        }
>>> @@ -1205,10 +1314,16 @@ static const MemoryListener
>>> vfio_memory_listener = {
>>>        .log_sync = vfio_listener_log_sync,
>>>    };
>>>    +static MemoryListener vfio_memory_prereg_listener = {
>>> +    .region_add = vfio_prereg_listener_region_add,
>>> +    .region_del = vfio_prereg_listener_region_del,
>>> +};
>>> +
>>>    static void vfio_listener_release(VFIOContainer *container)
>>>    {
>>>        memory_listener_unregister(&container->listener);
>>> -    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
>>> +    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU ||
>>> +        container->iommu_type == VFIO_TYPE1_NESTING_IOMMU) {
>>>            memory_listener_unregister(&container->prereg_listener);
>>>        }
>>>    }
>>> @@ -1858,6 +1973,20 @@ static int vfio_connect_container(VFIOGroup
>>> *group, AddressSpace *as,
>>>                vfio_get_iommu_info_migration(container, info);
>>>            }
>>>            g_free(info);
>>> +
>>> +        if (container->iommu_type == VFIO_TYPE1_NESTING_IOMMU) {
>>> +            container->prereg_listener = vfio_memory_prereg_listener;
>>> +            memory_listener_register(&container->prereg_listener,
>>> +                                     &address_space_memory);
>>> +            if (container->error) {
>>> +                memory_listener_unregister(&container->prereg_listener);
>>> +                ret = -1;
>>> +                error_propagate_prepend(errp, container->error,
>>> +                                    "RAM memory listener
>>> initialization failed "
>>> +                                    "for container");
>>> +                goto free_container_exit;
>>> +            }
>>> +        }
>>>            break;
>>>        }
>>>        case VFIO_SPAPR_TCE_v2_IOMMU:
>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>>> index 5c65aa0a98..cad7deec71 100644
>>> --- a/hw/vfio/pci.c
>>> +++ b/hw/vfio/pci.c
>>> @@ -2773,6 +2773,25 @@ static void
>>> vfio_unregister_req_notifier(VFIOPCIDevice *vdev)
>>>        vdev->req_enabled = false;
>>>    }
>>>    +static int vfio_iommu_set_pasid_table(PCIBus *bus, int32_t devfn,
>>> +                                      IOMMUConfig *config)
>>> +{
>>> +    PCIDevice *pdev = bus->devices[devfn];
>>> +    VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev);
>>> +    VFIOContainer *container = vdev->vbasedev.group->container;
>>> +    struct vfio_iommu_type1_set_pasid_table info;
>>> +
>>> +    info.argsz = sizeof(info);
>>> +    info.flags = VFIO_PASID_TABLE_FLAG_SET;
>>> +    memcpy(&info.config, &config->pasid_cfg, sizeof(config->pasid_cfg));
>>> +
>>> +    return ioctl(container->fd, VFIO_IOMMU_SET_PASID_TABLE, &info);
>>> +}
>>> +
>>> +static PCIPASIDOps vfio_pci_pasid_ops = {
>>> +    .set_pasid_table = vfio_iommu_set_pasid_table,
>>> +};
>>> +
>>>    static void vfio_realize(PCIDevice *pdev, Error **errp)
>>>    {
>>>        VFIOPCIDevice *vdev = VFIO_PCI(pdev);
>>> @@ -3084,6 +3103,8 @@ static void vfio_realize(PCIDevice *pdev, Error
>>> **errp)
>>>        vfio_register_req_notifier(vdev);
>>>        vfio_setup_resetfn_quirk(vdev);
>>>    +    pci_setup_pasid_ops(pdev, &vfio_pci_pasid_ops);
>>> +
>>>        return;
>>>      out_deregister:
>>> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
>>> index 936d29d150..43696afc15 100644
>>> --- a/hw/vfio/trace-events
>>> +++ b/hw/vfio/trace-events
>>> @@ -120,6 +120,8 @@ vfio_region_sparse_mmap_header(const char *name,
>>> int index, int nr_areas) "Devic
>>>    vfio_region_sparse_mmap_entry(int i, unsigned long start, unsigned
>>> long end) "sparse entry %d [0x%lx - 0x%lx]"
>>>    vfio_get_dev_region(const char *name, int index, uint32_t type,
>>> uint32_t subtype) "%s index %d, %08x/%0x8"
>>>    vfio_dma_unmap_overflow_workaround(void) ""
>>> +vfio_iommu_addr_inv_iotlb(int asid, uint64_t addr, uint64_t size,
>>> uint64_t nb_granules, bool leaf) "nested IOTLB invalidate asid=%d,
>>> addr=0x%"PRIx64" granule_size=0x%"PRIx64" nb_granules=0x%"PRIx64"
>>> leaf=%d"
>>> +vfio_iommu_asid_inv_iotlb(int asid) "nested IOTLB invalidate asid=%d"
>>>      # platform.c
>>>    vfio_platform_base_device_init(char *name, int groupid) "%s belongs
>>> to group #%d"
>>
>>
> .




^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC v9 15/29] vfio: Set up nested stage mappings
  2021-04-14  1:45       ` Kunkun Jiang
@ 2021-04-14  8:05         ` Auger Eric
  2021-04-15  2:03           ` Kunkun Jiang
  2021-04-26 12:30         ` Auger Eric
  2021-10-07 16:58         ` Eric Auger
  2 siblings, 1 reply; 46+ messages in thread
From: Auger Eric @ 2021-04-14  8:05 UTC (permalink / raw)
  To: Kunkun Jiang, eric.auger.pro, qemu-devel, qemu-arm, alex.williamson
  Cc: peter.maydell, jacob.jun.pan, chenxiang66, tn,
	shameerali.kolothum.thodi, nicoleotsuka, vivek.gautam, vdumpa,
	yi.l.liu, peterx, zhangfei.gao, wanghaibin.wang, yuzenghui,
	jean-philippe, zhukeqian1

Hi Kunkun,

On 4/14/21 3:45 AM, Kunkun Jiang wrote:
> On 2021/4/13 20:57, Auger Eric wrote:
>> Hi Kunkun,
>>
>> On 4/13/21 2:10 PM, Kunkun Jiang wrote:
>>> Hi Eric,
>>>
>>> On 2021/4/11 20:08, Eric Auger wrote:
>>>> In nested mode, legacy vfio_iommu_map_notify cannot be used as
>>>> there is no "caching" mode and we do not trap on map.
>>>>
>>>> On Intel, vfio_iommu_map_notify was used to DMA map the RAM
>>>> through the host single stage.
>>>>
>>>> With nested mode, we need to setup the stage 2 and the stage 1
>>>> separately. This patch introduces a prereg_listener to setup
>>>> the stage 2 mapping.
>>>>
>>>> The stage 1 mapping, owned by the guest, is passed to the host
>>>> when the guest invalidates the stage 1 configuration, through
>>>> a dedicated PCIPASIDOps callback. Guest IOTLB invalidations
>>>> are cascaded downto the host through another IOMMU MR UNMAP
>>>> notifier.
>>>>
>>>> Signed-off-by: Eric Auger <eric.auger@redhat.com>
>>>>
>>>> ---
>>>>
>>>> v7 -> v8:
>>>> - properly handle new IOMMUTLBEntry fields and especially
>>>>     propagate DOMAIN and PASID based invalidations
>>>>
>>>> v6 -> v7:
>>>> - remove PASID based invalidation
>>>>
>>>> v5 -> v6:
>>>> - add error_report_err()
>>>> - remove the abort in case of nested stage case
>>>>
>>>> v4 -> v5:
>>>> - use VFIO_IOMMU_SET_PASID_TABLE
>>>> - use PCIPASIDOps for config notification
>>>>
>>>> v3 -> v4:
>>>> - use iommu_inv_pasid_info for ASID invalidation
>>>>
>>>> v2 -> v3:
>>>> - use VFIO_IOMMU_ATTACH_PASID_TABLE
>>>> - new user API
>>>> - handle leaf
>>>>
>>>> v1 -> v2:
>>>> - adapt to uapi changes
>>>> - pass the asid
>>>> - pass IOMMU_NOTIFIER_S1_CFG when initializing the config notifier
>>>> ---
>>>>    hw/vfio/common.c     | 139
>>>> +++++++++++++++++++++++++++++++++++++++++--
>>>>    hw/vfio/pci.c        |  21 +++++++
>>>>    hw/vfio/trace-events |   2 +
>>>>    3 files changed, 157 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>>>> index 0cd7ef2139..e369d451e7 100644
>>>> --- a/hw/vfio/common.c
>>>> +++ b/hw/vfio/common.c
>>>> @@ -595,6 +595,73 @@ static bool vfio_get_xlat_addr(IOMMUTLBEntry
>>>> *iotlb, void **vaddr,
>>>>        return true;
>>>>    }
>>>>    +/* Propagate a guest IOTLB invalidation to the host (nested
>>>> mode) */
>>>> +static void vfio_iommu_unmap_notify(IOMMUNotifier *n, IOMMUTLBEntry
>>>> *iotlb)
>>>> +{
>>>> +    VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
>>>> +    struct vfio_iommu_type1_cache_invalidate ustruct = {};
>>>> +    VFIOContainer *container = giommu->container;
>>>> +    int ret;
>>>> +
>>>> +    assert(iotlb->perm == IOMMU_NONE);
>>>> +
>>>> +    ustruct.argsz = sizeof(ustruct);
>>>> +    ustruct.flags = 0;
>>>> +    ustruct.info.argsz = sizeof(struct iommu_cache_invalidate_info);
>>>> +    ustruct.info.version = IOMMU_CACHE_INVALIDATE_INFO_VERSION_1;
>>>> +    ustruct.info.cache = IOMMU_CACHE_INV_TYPE_IOTLB;
>>>> +
>>>> +    switch (iotlb->granularity) {
>>>> +    case IOMMU_INV_GRAN_DOMAIN:
>>>> +        ustruct.info.granularity = IOMMU_INV_GRANU_DOMAIN;
>>>> +        break;
>>>> +    case IOMMU_INV_GRAN_PASID:
>>>> +    {
>>>> +        struct iommu_inv_pasid_info *pasid_info;
>>>> +        int archid = -1;
>>>> +
>>>> +        pasid_info = &ustruct.info.granu.pasid_info;
>>>> +        ustruct.info.granularity = IOMMU_INV_GRANU_PASID;
>>>> +        if (iotlb->flags & IOMMU_INV_FLAGS_ARCHID) {
>>>> +            pasid_info->flags |= IOMMU_INV_ADDR_FLAGS_ARCHID;
>>>> +            archid = iotlb->arch_id;
>>>> +        }
>>>> +        pasid_info->archid = archid;
>>>> +        trace_vfio_iommu_asid_inv_iotlb(archid);
>>>> +        break;
>>>> +    }
>>>> +    case IOMMU_INV_GRAN_ADDR:
>>>> +    {
>>>> +        hwaddr start = iotlb->iova + giommu->iommu_offset;
>>>> +        struct iommu_inv_addr_info *addr_info;
>>>> +        size_t size = iotlb->addr_mask + 1;
>>>> +        int archid = -1;
>>>> +
>>>> +        addr_info = &ustruct.info.granu.addr_info;
>>>> +        ustruct.info.granularity = IOMMU_INV_GRANU_ADDR;
>>>> +        if (iotlb->leaf) {
>>>> +            addr_info->flags |= IOMMU_INV_ADDR_FLAGS_LEAF;
>>>> +        }
>>>> +        if (iotlb->flags & IOMMU_INV_FLAGS_ARCHID) {
>>>> +            addr_info->flags |= IOMMU_INV_ADDR_FLAGS_ARCHID;
>>>> +            archid = iotlb->arch_id;
>>>> +        }
>>>> +        addr_info->archid = archid;
>>>> +        addr_info->addr = start;
>>>> +        addr_info->granule_size = size;
>>>> +        addr_info->nb_granules = 1;
>>>> +        trace_vfio_iommu_addr_inv_iotlb(archid, start, size,
>>>> +                                        1, iotlb->leaf);
>>>> +        break;
>>>> +    }
>>> Should we pass a size to  host kernel here, even if vSMMU doesn't
>>> support
>>> RIL or guest kernel doesn't use RIL?
>>>
>>> It will cause TLBI issue in  this scenario: Guest kernel issues a
>>> TLBI cmd
>>> without "range" (tg = 0) to invalidate a 2M huge page. Then qemu passed
>>> the iova and size (4K) to host kernel. Finally, host kernel issues a
>>> TLBI cmd
>>> with "range" (4K) which can not invalidate the TLB entry of 2M huge
>>> page.
>>> (pSMMU supports RIL)
>> In that case the guest will loop over all 4K images belonging to the 2M
>> huge page and invalidate each of them. This should turn into qemu
>> notifications for each 4kB page, no? This is totally inefficient, hence
> The guest will not loop over all 4K images belonging to the 2M huge page.
> The iommu_iotlb_gather->pgsize will be 2M, if a page is 2M huge page. The
> gather->pgsize will be passed to __arm_smmu_tlb_inv_range as "granule":
> 
> iommu_iotlb_gather_add_page
>     iommu_iotlb_sync
>         domain->ops->iotlb_sync
>             arm_smmu_iotlb_sync
>                 arm_smmu_tlb_inv_range_domain
>                     __arm_smmu_tlb_inv_range
> 
> In the above mentioned scenario, guest kernel will issue a TLBI cmd only
> with
> "iova" (tg = 0).

OK I get it now. In that case I think I should do a TG=0 invalidation
on host even if the host does support RIL. Does that sound correct?

The trouble is the uapi struct does not convey such info explicitly.
Maybe I should use nb_granules = 0 to detect that case.

I think for this use case RIL should be supported by the guest. Thoughts?

Thanks

Eric
> 
> Thanks,
> Kunkun Jiang
>> the support of RIL on guest side and QEMU device.
>>
>> What do I miss?
>>
>> Thanks
>>
>> Eric
>>> Thanks,
>>> Kunkun Jiang
>>>> +    }
>>>> +
>>>> +    ret = ioctl(container->fd, VFIO_IOMMU_CACHE_INVALIDATE, &ustruct);
>>>> +    if (ret) {
>>>> +        error_report("%p: failed to invalidate CACHE (%d)",
>>>> container, ret);
>>>> +    }
>>>> +}
>>>> +
>>>>    static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry
>>>> *iotlb)
>>>>    {
>>>>        VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
>>>> @@ -776,6 +843,35 @@ static void
>>>> vfio_dma_unmap_ram_section(VFIOContainer *container,
>>>>        }
>>>>    }
>>>>    +static void vfio_prereg_listener_region_add(MemoryListener
>>>> *listener,
>>>> +                                            MemoryRegionSection
>>>> *section)
>>>> +{
>>>> +    VFIOContainer *container =
>>>> +        container_of(listener, VFIOContainer, prereg_listener);
>>>> +    Error *err = NULL;
>>>> +
>>>> +    if (!memory_region_is_ram(section->mr)) {
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    vfio_dma_map_ram_section(container, section, &err);
>>>> +    if (err) {
>>>> +        error_report_err(err);
>>>> +    }
>>>> +}
>>>> +static void vfio_prereg_listener_region_del(MemoryListener *listener,
>>>> +                                     MemoryRegionSection *section)
>>>> +{
>>>> +    VFIOContainer *container =
>>>> +        container_of(listener, VFIOContainer, prereg_listener);
>>>> +
>>>> +    if (!memory_region_is_ram(section->mr)) {
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    vfio_dma_unmap_ram_section(container, section);
>>>> +}
>>>> +
>>>>    static void vfio_listener_region_add(MemoryListener *listener,
>>>>                                         MemoryRegionSection *section)
>>>>    {
>>>> @@ -879,9 +975,10 @@ static void
>>>> vfio_listener_region_add(MemoryListener *listener,
>>>>        memory_region_ref(section->mr);
>>>>          if (memory_region_is_iommu(section->mr)) {
>>>> +        IOMMUNotify notify;
>>>>            VFIOGuestIOMMU *giommu;
>>>>            IOMMUMemoryRegion *iommu_mr =
>>>> IOMMU_MEMORY_REGION(section->mr);
>>>> -        int iommu_idx;
>>>> +        int iommu_idx, flags;
>>>>              trace_vfio_listener_region_add_iommu(iova, end);
>>>>            /*
>>>> @@ -900,8 +997,18 @@ static void
>>>> vfio_listener_region_add(MemoryListener *listener,
>>>>            llend = int128_sub(llend, int128_one());
>>>>            iommu_idx = memory_region_iommu_attrs_to_index(iommu_mr,
>>>>                                                         
>>>> MEMTXATTRS_UNSPECIFIED);
>>>> -        iommu_notifier_init(&giommu->n, vfio_iommu_map_notify,
>>>> -                            IOMMU_NOTIFIER_IOTLB_EVENTS,
>>>> +
>>>> +        if (container->iommu_type == VFIO_TYPE1_NESTING_IOMMU) {
>>>> +            /* IOTLB unmap notifier to propagate guest IOTLB
>>>> invalidations */
>>>> +            flags = IOMMU_NOTIFIER_UNMAP;
>>>> +            notify = vfio_iommu_unmap_notify;
>>>> +        } else {
>>>> +            /* MAP/UNMAP IOTLB notifier */
>>>> +            flags = IOMMU_NOTIFIER_IOTLB_EVENTS;
>>>> +            notify = vfio_iommu_map_notify;
>>>> +        }
>>>> +
>>>> +        iommu_notifier_init(&giommu->n, notify, flags,
>>>>                                section->offset_within_region,
>>>>                                int128_get64(llend),
>>>>                                iommu_idx);
>>>> @@ -921,7 +1028,9 @@ static void
>>>> vfio_listener_region_add(MemoryListener *listener,
>>>>                goto fail;
>>>>            }
>>>>            QLIST_INSERT_HEAD(&container->giommu_list, giommu,
>>>> giommu_next);
>>>> -        memory_region_iommu_replay(giommu->iommu, &giommu->n);
>>>> +        if (flags & IOMMU_NOTIFIER_MAP) {
>>>> +            memory_region_iommu_replay(giommu->iommu, &giommu->n);
>>>> +        }
>>>>              return;
>>>>        }
>>>> @@ -1205,10 +1314,16 @@ static const MemoryListener
>>>> vfio_memory_listener = {
>>>>        .log_sync = vfio_listener_log_sync,
>>>>    };
>>>>    +static MemoryListener vfio_memory_prereg_listener = {
>>>> +    .region_add = vfio_prereg_listener_region_add,
>>>> +    .region_del = vfio_prereg_listener_region_del,
>>>> +};
>>>> +
>>>>    static void vfio_listener_release(VFIOContainer *container)
>>>>    {
>>>>        memory_listener_unregister(&container->listener);
>>>> -    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
>>>> +    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU ||
>>>> +        container->iommu_type == VFIO_TYPE1_NESTING_IOMMU) {
>>>>            memory_listener_unregister(&container->prereg_listener);
>>>>        }
>>>>    }
>>>> @@ -1858,6 +1973,20 @@ static int vfio_connect_container(VFIOGroup
>>>> *group, AddressSpace *as,
>>>>                vfio_get_iommu_info_migration(container, info);
>>>>            }
>>>>            g_free(info);
>>>> +
>>>> +        if (container->iommu_type == VFIO_TYPE1_NESTING_IOMMU) {
>>>> +            container->prereg_listener = vfio_memory_prereg_listener;
>>>> +            memory_listener_register(&container->prereg_listener,
>>>> +                                     &address_space_memory);
>>>> +            if (container->error) {
>>>> +               
>>>> memory_listener_unregister(&container->prereg_listener);
>>>> +                ret = -1;
>>>> +                error_propagate_prepend(errp, container->error,
>>>> +                                    "RAM memory listener
>>>> initialization failed "
>>>> +                                    "for container");
>>>> +                goto free_container_exit;
>>>> +            }
>>>> +        }
>>>>            break;
>>>>        }
>>>>        case VFIO_SPAPR_TCE_v2_IOMMU:
>>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>>>> index 5c65aa0a98..cad7deec71 100644
>>>> --- a/hw/vfio/pci.c
>>>> +++ b/hw/vfio/pci.c
>>>> @@ -2773,6 +2773,25 @@ static void
>>>> vfio_unregister_req_notifier(VFIOPCIDevice *vdev)
>>>>        vdev->req_enabled = false;
>>>>    }
>>>>    +static int vfio_iommu_set_pasid_table(PCIBus *bus, int32_t devfn,
>>>> +                                      IOMMUConfig *config)
>>>> +{
>>>> +    PCIDevice *pdev = bus->devices[devfn];
>>>> +    VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev);
>>>> +    VFIOContainer *container = vdev->vbasedev.group->container;
>>>> +    struct vfio_iommu_type1_set_pasid_table info;
>>>> +
>>>> +    info.argsz = sizeof(info);
>>>> +    info.flags = VFIO_PASID_TABLE_FLAG_SET;
>>>> +    memcpy(&info.config, &config->pasid_cfg,
>>>> sizeof(config->pasid_cfg));
>>>> +
>>>> +    return ioctl(container->fd, VFIO_IOMMU_SET_PASID_TABLE, &info);
>>>> +}
>>>> +
>>>> +static PCIPASIDOps vfio_pci_pasid_ops = {
>>>> +    .set_pasid_table = vfio_iommu_set_pasid_table,
>>>> +};
>>>> +
>>>>    static void vfio_realize(PCIDevice *pdev, Error **errp)
>>>>    {
>>>>        VFIOPCIDevice *vdev = VFIO_PCI(pdev);
>>>> @@ -3084,6 +3103,8 @@ static void vfio_realize(PCIDevice *pdev, Error
>>>> **errp)
>>>>        vfio_register_req_notifier(vdev);
>>>>        vfio_setup_resetfn_quirk(vdev);
>>>>    +    pci_setup_pasid_ops(pdev, &vfio_pci_pasid_ops);
>>>> +
>>>>        return;
>>>>      out_deregister:
>>>> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
>>>> index 936d29d150..43696afc15 100644
>>>> --- a/hw/vfio/trace-events
>>>> +++ b/hw/vfio/trace-events
>>>> @@ -120,6 +120,8 @@ vfio_region_sparse_mmap_header(const char *name,
>>>> int index, int nr_areas) "Devic
>>>>    vfio_region_sparse_mmap_entry(int i, unsigned long start, unsigned
>>>> long end) "sparse entry %d [0x%lx - 0x%lx]"
>>>>    vfio_get_dev_region(const char *name, int index, uint32_t type,
>>>> uint32_t subtype) "%s index %d, %08x/%0x8"
>>>>    vfio_dma_unmap_overflow_workaround(void) ""
>>>> +vfio_iommu_addr_inv_iotlb(int asid, uint64_t addr, uint64_t size,
>>>> uint64_t nb_granules, bool leaf) "nested IOTLB invalidate asid=%d,
>>>> addr=0x%"PRIx64" granule_size=0x%"PRIx64" nb_granules=0x%"PRIx64"
>>>> leaf=%d"
>>>> +vfio_iommu_asid_inv_iotlb(int asid) "nested IOTLB invalidate asid=%d"
>>>>      # platform.c
>>>>    vfio_platform_base_device_init(char *name, int groupid) "%s belongs
>>>> to group #%d"
>>>
>>>
>> .
> 
> 



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC v9 15/29] vfio: Set up nested stage mappings
  2021-04-14  8:05         ` Auger Eric
@ 2021-04-15  2:03           ` Kunkun Jiang
  2021-04-26 19:16             ` Auger Eric
  0 siblings, 1 reply; 46+ messages in thread
From: Kunkun Jiang @ 2021-04-15  2:03 UTC (permalink / raw)
  To: Auger Eric, eric.auger.pro, qemu-devel, qemu-arm, alex.williamson
  Cc: peter.maydell, jacob.jun.pan, chenxiang66, tn,
	shameerali.kolothum.thodi, nicoleotsuka, vivek.gautam, vdumpa,
	yi.l.liu, peterx, zhangfei.gao, wanghaibin.wang, yuzenghui,
	jean-philippe, zhukeqian1

Hi Eric,

On 2021/4/14 16:05, Auger Eric wrote:
> Hi Kunkun,
>
> On 4/14/21 3:45 AM, Kunkun Jiang wrote:
>> On 2021/4/13 20:57, Auger Eric wrote:
>>> Hi Kunkun,
>>>
>>> On 4/13/21 2:10 PM, Kunkun Jiang wrote:
>>>> Hi Eric,
>>>>
>>>> On 2021/4/11 20:08, Eric Auger wrote:
>>>>> In nested mode, legacy vfio_iommu_map_notify cannot be used as
>>>>> there is no "caching" mode and we do not trap on map.
>>>>>
>>>>> On Intel, vfio_iommu_map_notify was used to DMA map the RAM
>>>>> through the host single stage.
>>>>>
>>>>> With nested mode, we need to setup the stage 2 and the stage 1
>>>>> separately. This patch introduces a prereg_listener to setup
>>>>> the stage 2 mapping.
>>>>>
>>>>> The stage 1 mapping, owned by the guest, is passed to the host
>>>>> when the guest invalidates the stage 1 configuration, through
>>>>> a dedicated PCIPASIDOps callback. Guest IOTLB invalidations
>>>>> are cascaded downto the host through another IOMMU MR UNMAP
>>>>> notifier.
>>>>>
>>>>> Signed-off-by: Eric Auger <eric.auger@redhat.com>
>>>>>
>>>>> ---
>>>>>
>>>>> v7 -> v8:
>>>>> - properly handle new IOMMUTLBEntry fields and especially
>>>>>      propagate DOMAIN and PASID based invalidations
>>>>>
>>>>> v6 -> v7:
>>>>> - remove PASID based invalidation
>>>>>
>>>>> v5 -> v6:
>>>>> - add error_report_err()
>>>>> - remove the abort in case of nested stage case
>>>>>
>>>>> v4 -> v5:
>>>>> - use VFIO_IOMMU_SET_PASID_TABLE
>>>>> - use PCIPASIDOps for config notification
>>>>>
>>>>> v3 -> v4:
>>>>> - use iommu_inv_pasid_info for ASID invalidation
>>>>>
>>>>> v2 -> v3:
>>>>> - use VFIO_IOMMU_ATTACH_PASID_TABLE
>>>>> - new user API
>>>>> - handle leaf
>>>>>
>>>>> v1 -> v2:
>>>>> - adapt to uapi changes
>>>>> - pass the asid
>>>>> - pass IOMMU_NOTIFIER_S1_CFG when initializing the config notifier
>>>>> ---
>>>>>     hw/vfio/common.c     | 139
>>>>> +++++++++++++++++++++++++++++++++++++++++--
>>>>>     hw/vfio/pci.c        |  21 +++++++
>>>>>     hw/vfio/trace-events |   2 +
>>>>>     3 files changed, 157 insertions(+), 5 deletions(-)
>>>>>
>>>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>>>>> index 0cd7ef2139..e369d451e7 100644
>>>>> --- a/hw/vfio/common.c
>>>>> +++ b/hw/vfio/common.c
>>>>> @@ -595,6 +595,73 @@ static bool vfio_get_xlat_addr(IOMMUTLBEntry
>>>>> *iotlb, void **vaddr,
>>>>>         return true;
>>>>>     }
>>>>>     +/* Propagate a guest IOTLB invalidation to the host (nested
>>>>> mode) */
>>>>> +static void vfio_iommu_unmap_notify(IOMMUNotifier *n, IOMMUTLBEntry
>>>>> *iotlb)
>>>>> +{
>>>>> +    VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
>>>>> +    struct vfio_iommu_type1_cache_invalidate ustruct = {};
>>>>> +    VFIOContainer *container = giommu->container;
>>>>> +    int ret;
>>>>> +
>>>>> +    assert(iotlb->perm == IOMMU_NONE);
>>>>> +
>>>>> +    ustruct.argsz = sizeof(ustruct);
>>>>> +    ustruct.flags = 0;
>>>>> +    ustruct.info.argsz = sizeof(struct iommu_cache_invalidate_info);
>>>>> +    ustruct.info.version = IOMMU_CACHE_INVALIDATE_INFO_VERSION_1;
>>>>> +    ustruct.info.cache = IOMMU_CACHE_INV_TYPE_IOTLB;
>>>>> +
>>>>> +    switch (iotlb->granularity) {
>>>>> +    case IOMMU_INV_GRAN_DOMAIN:
>>>>> +        ustruct.info.granularity = IOMMU_INV_GRANU_DOMAIN;
>>>>> +        break;
>>>>> +    case IOMMU_INV_GRAN_PASID:
>>>>> +    {
>>>>> +        struct iommu_inv_pasid_info *pasid_info;
>>>>> +        int archid = -1;
>>>>> +
>>>>> +        pasid_info = &ustruct.info.granu.pasid_info;
>>>>> +        ustruct.info.granularity = IOMMU_INV_GRANU_PASID;
>>>>> +        if (iotlb->flags & IOMMU_INV_FLAGS_ARCHID) {
>>>>> +            pasid_info->flags |= IOMMU_INV_ADDR_FLAGS_ARCHID;
>>>>> +            archid = iotlb->arch_id;
>>>>> +        }
>>>>> +        pasid_info->archid = archid;
>>>>> +        trace_vfio_iommu_asid_inv_iotlb(archid);
>>>>> +        break;
>>>>> +    }
>>>>> +    case IOMMU_INV_GRAN_ADDR:
>>>>> +    {
>>>>> +        hwaddr start = iotlb->iova + giommu->iommu_offset;
>>>>> +        struct iommu_inv_addr_info *addr_info;
>>>>> +        size_t size = iotlb->addr_mask + 1;
>>>>> +        int archid = -1;
>>>>> +
>>>>> +        addr_info = &ustruct.info.granu.addr_info;
>>>>> +        ustruct.info.granularity = IOMMU_INV_GRANU_ADDR;
>>>>> +        if (iotlb->leaf) {
>>>>> +            addr_info->flags |= IOMMU_INV_ADDR_FLAGS_LEAF;
>>>>> +        }
>>>>> +        if (iotlb->flags & IOMMU_INV_FLAGS_ARCHID) {
>>>>> +            addr_info->flags |= IOMMU_INV_ADDR_FLAGS_ARCHID;
>>>>> +            archid = iotlb->arch_id;
>>>>> +        }
>>>>> +        addr_info->archid = archid;
>>>>> +        addr_info->addr = start;
>>>>> +        addr_info->granule_size = size;
>>>>> +        addr_info->nb_granules = 1;
>>>>> +        trace_vfio_iommu_addr_inv_iotlb(archid, start, size,
>>>>> +                                        1, iotlb->leaf);
>>>>> +        break;
>>>>> +    }
>>>> Should we pass a size to  host kernel here, even if vSMMU doesn't
>>>> support
>>>> RIL or guest kernel doesn't use RIL?
>>>>
>>>> It will cause TLBI issue in  this scenario: Guest kernel issues a
>>>> TLBI cmd
>>>> without "range" (tg = 0) to invalidate a 2M huge page. Then qemu passed
>>>> the iova and size (4K) to host kernel. Finally, host kernel issues a
>>>> TLBI cmd
>>>> with "range" (4K) which can not invalidate the TLB entry of 2M huge
>>>> page.
>>>> (pSMMU supports RIL)
>>> In that case the guest will loop over all 4K images belonging to the 2M
>>> huge page and invalidate each of them. This should turn into qemu
>>> notifications for each 4kB page, no? This is totally inefficient, hence
>> The guest will not loop over all 4K images belonging to the 2M huge page.
>> The iommu_iotlb_gather->pgsize will be 2M, if a page is 2M huge page. The
>> gather->pgsize will be passed to __arm_smmu_tlb_inv_range as "granule":
>>
>> iommu_iotlb_gather_add_page
>>      iommu_iotlb_sync
>>          domain->ops->iotlb_sync
>>              arm_smmu_iotlb_sync
>>                  arm_smmu_tlb_inv_range_domain
>>                      __arm_smmu_tlb_inv_range
>>
>> In the above mentioned scenario, guest kernel will issue a TLBI cmd only
>> with
>> "iova" (tg = 0).
> OK I get it now. In that case I think I should do a TG=0 invalidation
> on host even if the host does support RIL. Does that sound correct?
Yeah, that's what I thought.
> The trouble is the uapi struct does not convey such info explicitly.
> Maybe I should use nb_granules = 0 to detect that case.
It is troublesome to correct this issue. Using nb_granules = 0 may be
a good choice.
> I think for this use case RIL should be supported by the guest. Thoughts?
Yes. If guest supports RIL, the scenario mentioned above does not exist.

Thanks,
Kunkun Jiang
>
> Thanks
>
> Eric
>> Thanks,
>> Kunkun Jiang
>>> the support of RIL on guest side and QEMU device.
>>>
>>> What do I miss?
>>>
>>> Thanks
>>>
>>> Eric
>>>> Thanks,
>>>> Kunkun Jiang
>>>>> +    }
>>>>> +
>>>>> +    ret = ioctl(container->fd, VFIO_IOMMU_CACHE_INVALIDATE, &ustruct);
>>>>> +    if (ret) {
>>>>> +        error_report("%p: failed to invalidate CACHE (%d)",
>>>>> container, ret);
>>>>> +    }
>>>>> +}
>>>>> +
>>>>>     static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry
>>>>> *iotlb)
>>>>>     {
>>>>>         VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
>>>>> @@ -776,6 +843,35 @@ static void
>>>>> vfio_dma_unmap_ram_section(VFIOContainer *container,
>>>>>         }
>>>>>     }
>>>>>     +static void vfio_prereg_listener_region_add(MemoryListener
>>>>> *listener,
>>>>> +                                            MemoryRegionSection
>>>>> *section)
>>>>> +{
>>>>> +    VFIOContainer *container =
>>>>> +        container_of(listener, VFIOContainer, prereg_listener);
>>>>> +    Error *err = NULL;
>>>>> +
>>>>> +    if (!memory_region_is_ram(section->mr)) {
>>>>> +        return;
>>>>> +    }
>>>>> +
>>>>> +    vfio_dma_map_ram_section(container, section, &err);
>>>>> +    if (err) {
>>>>> +        error_report_err(err);
>>>>> +    }
>>>>> +}
>>>>> +static void vfio_prereg_listener_region_del(MemoryListener *listener,
>>>>> +                                     MemoryRegionSection *section)
>>>>> +{
>>>>> +    VFIOContainer *container =
>>>>> +        container_of(listener, VFIOContainer, prereg_listener);
>>>>> +
>>>>> +    if (!memory_region_is_ram(section->mr)) {
>>>>> +        return;
>>>>> +    }
>>>>> +
>>>>> +    vfio_dma_unmap_ram_section(container, section);
>>>>> +}
>>>>> +
>>>>>     static void vfio_listener_region_add(MemoryListener *listener,
>>>>>                                          MemoryRegionSection *section)
>>>>>     {
>>>>> @@ -879,9 +975,10 @@ static void
>>>>> vfio_listener_region_add(MemoryListener *listener,
>>>>>         memory_region_ref(section->mr);
>>>>>           if (memory_region_is_iommu(section->mr)) {
>>>>> +        IOMMUNotify notify;
>>>>>             VFIOGuestIOMMU *giommu;
>>>>>             IOMMUMemoryRegion *iommu_mr =
>>>>> IOMMU_MEMORY_REGION(section->mr);
>>>>> -        int iommu_idx;
>>>>> +        int iommu_idx, flags;
>>>>>               trace_vfio_listener_region_add_iommu(iova, end);
>>>>>             /*
>>>>> @@ -900,8 +997,18 @@ static void
>>>>> vfio_listener_region_add(MemoryListener *listener,
>>>>>             llend = int128_sub(llend, int128_one());
>>>>>             iommu_idx = memory_region_iommu_attrs_to_index(iommu_mr,
>>>>>                                                          
>>>>> MEMTXATTRS_UNSPECIFIED);
>>>>> -        iommu_notifier_init(&giommu->n, vfio_iommu_map_notify,
>>>>> -                            IOMMU_NOTIFIER_IOTLB_EVENTS,
>>>>> +
>>>>> +        if (container->iommu_type == VFIO_TYPE1_NESTING_IOMMU) {
>>>>> +            /* IOTLB unmap notifier to propagate guest IOTLB
>>>>> invalidations */
>>>>> +            flags = IOMMU_NOTIFIER_UNMAP;
>>>>> +            notify = vfio_iommu_unmap_notify;
>>>>> +        } else {
>>>>> +            /* MAP/UNMAP IOTLB notifier */
>>>>> +            flags = IOMMU_NOTIFIER_IOTLB_EVENTS;
>>>>> +            notify = vfio_iommu_map_notify;
>>>>> +        }
>>>>> +
>>>>> +        iommu_notifier_init(&giommu->n, notify, flags,
>>>>>                                 section->offset_within_region,
>>>>>                                 int128_get64(llend),
>>>>>                                 iommu_idx);
>>>>> @@ -921,7 +1028,9 @@ static void
>>>>> vfio_listener_region_add(MemoryListener *listener,
>>>>>                 goto fail;
>>>>>             }
>>>>>             QLIST_INSERT_HEAD(&container->giommu_list, giommu,
>>>>> giommu_next);
>>>>> -        memory_region_iommu_replay(giommu->iommu, &giommu->n);
>>>>> +        if (flags & IOMMU_NOTIFIER_MAP) {
>>>>> +            memory_region_iommu_replay(giommu->iommu, &giommu->n);
>>>>> +        }
>>>>>               return;
>>>>>         }
>>>>> @@ -1205,10 +1314,16 @@ static const MemoryListener
>>>>> vfio_memory_listener = {
>>>>>         .log_sync = vfio_listener_log_sync,
>>>>>     };
>>>>>     +static MemoryListener vfio_memory_prereg_listener = {
>>>>> +    .region_add = vfio_prereg_listener_region_add,
>>>>> +    .region_del = vfio_prereg_listener_region_del,
>>>>> +};
>>>>> +
>>>>>     static void vfio_listener_release(VFIOContainer *container)
>>>>>     {
>>>>>         memory_listener_unregister(&container->listener);
>>>>> -    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
>>>>> +    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU ||
>>>>> +        container->iommu_type == VFIO_TYPE1_NESTING_IOMMU) {
>>>>>             memory_listener_unregister(&container->prereg_listener);
>>>>>         }
>>>>>     }
>>>>> @@ -1858,6 +1973,20 @@ static int vfio_connect_container(VFIOGroup
>>>>> *group, AddressSpace *as,
>>>>>                 vfio_get_iommu_info_migration(container, info);
>>>>>             }
>>>>>             g_free(info);
>>>>> +
>>>>> +        if (container->iommu_type == VFIO_TYPE1_NESTING_IOMMU) {
>>>>> +            container->prereg_listener = vfio_memory_prereg_listener;
>>>>> +            memory_listener_register(&container->prereg_listener,
>>>>> +                                     &address_space_memory);
>>>>> +            if (container->error) {
>>>>> +
>>>>> memory_listener_unregister(&container->prereg_listener);
>>>>> +                ret = -1;
>>>>> +                error_propagate_prepend(errp, container->error,
>>>>> +                                    "RAM memory listener
>>>>> initialization failed "
>>>>> +                                    "for container");
>>>>> +                goto free_container_exit;
>>>>> +            }
>>>>> +        }
>>>>>             break;
>>>>>         }
>>>>>         case VFIO_SPAPR_TCE_v2_IOMMU:
>>>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>>>>> index 5c65aa0a98..cad7deec71 100644
>>>>> --- a/hw/vfio/pci.c
>>>>> +++ b/hw/vfio/pci.c
>>>>> @@ -2773,6 +2773,25 @@ static void
>>>>> vfio_unregister_req_notifier(VFIOPCIDevice *vdev)
>>>>>         vdev->req_enabled = false;
>>>>>     }
>>>>>     +static int vfio_iommu_set_pasid_table(PCIBus *bus, int32_t devfn,
>>>>> +                                      IOMMUConfig *config)
>>>>> +{
>>>>> +    PCIDevice *pdev = bus->devices[devfn];
>>>>> +    VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev);
>>>>> +    VFIOContainer *container = vdev->vbasedev.group->container;
>>>>> +    struct vfio_iommu_type1_set_pasid_table info;
>>>>> +
>>>>> +    info.argsz = sizeof(info);
>>>>> +    info.flags = VFIO_PASID_TABLE_FLAG_SET;
>>>>> +    memcpy(&info.config, &config->pasid_cfg,
>>>>> sizeof(config->pasid_cfg));
>>>>> +
>>>>> +    return ioctl(container->fd, VFIO_IOMMU_SET_PASID_TABLE, &info);
>>>>> +}
>>>>> +
>>>>> +static PCIPASIDOps vfio_pci_pasid_ops = {
>>>>> +    .set_pasid_table = vfio_iommu_set_pasid_table,
>>>>> +};
>>>>> +
>>>>>     static void vfio_realize(PCIDevice *pdev, Error **errp)
>>>>>     {
>>>>>         VFIOPCIDevice *vdev = VFIO_PCI(pdev);
>>>>> @@ -3084,6 +3103,8 @@ static void vfio_realize(PCIDevice *pdev, Error
>>>>> **errp)
>>>>>         vfio_register_req_notifier(vdev);
>>>>>         vfio_setup_resetfn_quirk(vdev);
>>>>>     +    pci_setup_pasid_ops(pdev, &vfio_pci_pasid_ops);
>>>>> +
>>>>>         return;
>>>>>       out_deregister:
>>>>> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
>>>>> index 936d29d150..43696afc15 100644
>>>>> --- a/hw/vfio/trace-events
>>>>> +++ b/hw/vfio/trace-events
>>>>> @@ -120,6 +120,8 @@ vfio_region_sparse_mmap_header(const char *name,
>>>>> int index, int nr_areas) "Devic
>>>>>     vfio_region_sparse_mmap_entry(int i, unsigned long start, unsigned
>>>>> long end) "sparse entry %d [0x%lx - 0x%lx]"
>>>>>     vfio_get_dev_region(const char *name, int index, uint32_t type,
>>>>> uint32_t subtype) "%s index %d, %08x/%0x8"
>>>>>     vfio_dma_unmap_overflow_workaround(void) ""
>>>>> +vfio_iommu_addr_inv_iotlb(int asid, uint64_t addr, uint64_t size,
>>>>> uint64_t nb_granules, bool leaf) "nested IOTLB invalidate asid=%d,
>>>>> addr=0x%"PRIx64" granule_size=0x%"PRIx64" nb_granules=0x%"PRIx64"
>>>>> leaf=%d"
>>>>> +vfio_iommu_asid_inv_iotlb(int asid) "nested IOTLB invalidate asid=%d"
>>>>>       # platform.c
>>>>>     vfio_platform_base_device_init(char *name, int groupid) "%s belongs
>>>>> to group #%d"
>>>>
>>> .
>>
> .




^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC v9 15/29] vfio: Set up nested stage mappings
  2021-04-14  1:45       ` Kunkun Jiang
  2021-04-14  8:05         ` Auger Eric
@ 2021-04-26 12:30         ` Auger Eric
  2021-04-27  8:58           ` Kunkun Jiang
  2021-10-07 16:58         ` Eric Auger
  2 siblings, 1 reply; 46+ messages in thread
From: Auger Eric @ 2021-04-26 12:30 UTC (permalink / raw)
  To: Kunkun Jiang, eric.auger.pro, qemu-devel, qemu-arm, alex.williamson
  Cc: peter.maydell, jacob.jun.pan, chenxiang66, tn,
	shameerali.kolothum.thodi, nicoleotsuka, vivek.gautam, vdumpa,
	yi.l.liu, peterx, zhangfei.gao, wanghaibin.wang, yuzenghui,
	jean-philippe, zhukeqian1

Hi Kunkun,

On 4/14/21 3:45 AM, Kunkun Jiang wrote:
> On 2021/4/13 20:57, Auger Eric wrote:
>> Hi Kunkun,
>>
>> On 4/13/21 2:10 PM, Kunkun Jiang wrote:
>>> Hi Eric,
>>>
>>> On 2021/4/11 20:08, Eric Auger wrote:
>>>> In nested mode, legacy vfio_iommu_map_notify cannot be used as
>>>> there is no "caching" mode and we do not trap on map.
>>>>
>>>> On Intel, vfio_iommu_map_notify was used to DMA map the RAM
>>>> through the host single stage.
>>>>
>>>> With nested mode, we need to setup the stage 2 and the stage 1
>>>> separately. This patch introduces a prereg_listener to setup
>>>> the stage 2 mapping.
>>>>
>>>> The stage 1 mapping, owned by the guest, is passed to the host
>>>> when the guest invalidates the stage 1 configuration, through
>>>> a dedicated PCIPASIDOps callback. Guest IOTLB invalidations
>>>> are cascaded downto the host through another IOMMU MR UNMAP
>>>> notifier.
>>>>
>>>> Signed-off-by: Eric Auger <eric.auger@redhat.com>
>>>>
>>>> ---
>>>>
>>>> v7 -> v8:
>>>> - properly handle new IOMMUTLBEntry fields and especially
>>>>     propagate DOMAIN and PASID based invalidations
>>>>
>>>> v6 -> v7:
>>>> - remove PASID based invalidation
>>>>
>>>> v5 -> v6:
>>>> - add error_report_err()
>>>> - remove the abort in case of nested stage case
>>>>
>>>> v4 -> v5:
>>>> - use VFIO_IOMMU_SET_PASID_TABLE
>>>> - use PCIPASIDOps for config notification
>>>>
>>>> v3 -> v4:
>>>> - use iommu_inv_pasid_info for ASID invalidation
>>>>
>>>> v2 -> v3:
>>>> - use VFIO_IOMMU_ATTACH_PASID_TABLE
>>>> - new user API
>>>> - handle leaf
>>>>
>>>> v1 -> v2:
>>>> - adapt to uapi changes
>>>> - pass the asid
>>>> - pass IOMMU_NOTIFIER_S1_CFG when initializing the config notifier
>>>> ---
>>>>    hw/vfio/common.c     | 139
>>>> +++++++++++++++++++++++++++++++++++++++++--
>>>>    hw/vfio/pci.c        |  21 +++++++
>>>>    hw/vfio/trace-events |   2 +
>>>>    3 files changed, 157 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>>>> index 0cd7ef2139..e369d451e7 100644
>>>> --- a/hw/vfio/common.c
>>>> +++ b/hw/vfio/common.c
>>>> @@ -595,6 +595,73 @@ static bool vfio_get_xlat_addr(IOMMUTLBEntry
>>>> *iotlb, void **vaddr,
>>>>        return true;
>>>>    }
>>>>    +/* Propagate a guest IOTLB invalidation to the host (nested
>>>> mode) */
>>>> +static void vfio_iommu_unmap_notify(IOMMUNotifier *n, IOMMUTLBEntry
>>>> *iotlb)
>>>> +{
>>>> +    VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
>>>> +    struct vfio_iommu_type1_cache_invalidate ustruct = {};
>>>> +    VFIOContainer *container = giommu->container;
>>>> +    int ret;
>>>> +
>>>> +    assert(iotlb->perm == IOMMU_NONE);
>>>> +
>>>> +    ustruct.argsz = sizeof(ustruct);
>>>> +    ustruct.flags = 0;
>>>> +    ustruct.info.argsz = sizeof(struct iommu_cache_invalidate_info);
>>>> +    ustruct.info.version = IOMMU_CACHE_INVALIDATE_INFO_VERSION_1;
>>>> +    ustruct.info.cache = IOMMU_CACHE_INV_TYPE_IOTLB;
>>>> +
>>>> +    switch (iotlb->granularity) {
>>>> +    case IOMMU_INV_GRAN_DOMAIN:
>>>> +        ustruct.info.granularity = IOMMU_INV_GRANU_DOMAIN;
>>>> +        break;
>>>> +    case IOMMU_INV_GRAN_PASID:
>>>> +    {
>>>> +        struct iommu_inv_pasid_info *pasid_info;
>>>> +        int archid = -1;
>>>> +
>>>> +        pasid_info = &ustruct.info.granu.pasid_info;
>>>> +        ustruct.info.granularity = IOMMU_INV_GRANU_PASID;
>>>> +        if (iotlb->flags & IOMMU_INV_FLAGS_ARCHID) {
>>>> +            pasid_info->flags |= IOMMU_INV_ADDR_FLAGS_ARCHID;
>>>> +            archid = iotlb->arch_id;
>>>> +        }
>>>> +        pasid_info->archid = archid;
>>>> +        trace_vfio_iommu_asid_inv_iotlb(archid);
>>>> +        break;
>>>> +    }
>>>> +    case IOMMU_INV_GRAN_ADDR:
>>>> +    {
>>>> +        hwaddr start = iotlb->iova + giommu->iommu_offset;
>>>> +        struct iommu_inv_addr_info *addr_info;
>>>> +        size_t size = iotlb->addr_mask + 1;
>>>> +        int archid = -1;
>>>> +
>>>> +        addr_info = &ustruct.info.granu.addr_info;
>>>> +        ustruct.info.granularity = IOMMU_INV_GRANU_ADDR;
>>>> +        if (iotlb->leaf) {
>>>> +            addr_info->flags |= IOMMU_INV_ADDR_FLAGS_LEAF;
>>>> +        }
>>>> +        if (iotlb->flags & IOMMU_INV_FLAGS_ARCHID) {
>>>> +            addr_info->flags |= IOMMU_INV_ADDR_FLAGS_ARCHID;
>>>> +            archid = iotlb->arch_id;
>>>> +        }
>>>> +        addr_info->archid = archid;
>>>> +        addr_info->addr = start;
>>>> +        addr_info->granule_size = size;
>>>> +        addr_info->nb_granules = 1;
>>>> +        trace_vfio_iommu_addr_inv_iotlb(archid, start, size,
>>>> +                                        1, iotlb->leaf);
>>>> +        break;
>>>> +    }
>>> Should we pass a size to  host kernel here, even if vSMMU doesn't
>>> support
>>> RIL or guest kernel doesn't use RIL?
>>>
>>> It will cause TLBI issue in  this scenario: Guest kernel issues a
>>> TLBI cmd
>>> without "range" (tg = 0) to invalidate a 2M huge page. Then qemu passed
>>> the iova and size (4K) to host kernel. Finally, host kernel issues a
>>> TLBI cmd
>>> with "range" (4K) which can not invalidate the TLB entry of 2M huge
>>> page.
>>> (pSMMU supports RIL)
>> In that case the guest will loop over all 4K images belonging to the 2M
>> huge page and invalidate each of them. This should turn into qemu
>> notifications for each 4kB page, no? This is totally inefficient, hence
> The guest will not loop over all 4K images belonging to the 2M huge page.
> The iommu_iotlb_gather->pgsize will be 2M, if a page is 2M huge page. The
> gather->pgsize will be passed to __arm_smmu_tlb_inv_range as "granule":
> 
> iommu_iotlb_gather_add_page
>     iommu_iotlb_sync
>         domain->ops->iotlb_sync
>             arm_smmu_iotlb_sync
>                 arm_smmu_tlb_inv_range_domain
>                     __arm_smmu_tlb_inv_range
> 
> In the above mentioned scenario, guest kernel will issue a TLBI cmd only
> with
> "iova" (tg = 0).

I am busy fixing this case. Could you share your guest use case. It is
DPDK? In the positive I would be interested in getting your DPDK
setup/commands.

Thank you in advance

Eric
> 
> Thanks,
> Kunkun Jiang
>> the support of RIL on guest side and QEMU device.
>>
>> What do I miss?
>>
>> Thanks
>>
>> Eric
>>> Thanks,
>>> Kunkun Jiang
>>>> +    }
>>>> +
>>>> +    ret = ioctl(container->fd, VFIO_IOMMU_CACHE_INVALIDATE, &ustruct);
>>>> +    if (ret) {
>>>> +        error_report("%p: failed to invalidate CACHE (%d)",
>>>> container, ret);
>>>> +    }
>>>> +}
>>>> +
>>>>    static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry
>>>> *iotlb)
>>>>    {
>>>>        VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
>>>> @@ -776,6 +843,35 @@ static void
>>>> vfio_dma_unmap_ram_section(VFIOContainer *container,
>>>>        }
>>>>    }
>>>>    +static void vfio_prereg_listener_region_add(MemoryListener
>>>> *listener,
>>>> +                                            MemoryRegionSection
>>>> *section)
>>>> +{
>>>> +    VFIOContainer *container =
>>>> +        container_of(listener, VFIOContainer, prereg_listener);
>>>> +    Error *err = NULL;
>>>> +
>>>> +    if (!memory_region_is_ram(section->mr)) {
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    vfio_dma_map_ram_section(container, section, &err);
>>>> +    if (err) {
>>>> +        error_report_err(err);
>>>> +    }
>>>> +}
>>>> +static void vfio_prereg_listener_region_del(MemoryListener *listener,
>>>> +                                     MemoryRegionSection *section)
>>>> +{
>>>> +    VFIOContainer *container =
>>>> +        container_of(listener, VFIOContainer, prereg_listener);
>>>> +
>>>> +    if (!memory_region_is_ram(section->mr)) {
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    vfio_dma_unmap_ram_section(container, section);
>>>> +}
>>>> +
>>>>    static void vfio_listener_region_add(MemoryListener *listener,
>>>>                                         MemoryRegionSection *section)
>>>>    {
>>>> @@ -879,9 +975,10 @@ static void
>>>> vfio_listener_region_add(MemoryListener *listener,
>>>>        memory_region_ref(section->mr);
>>>>          if (memory_region_is_iommu(section->mr)) {
>>>> +        IOMMUNotify notify;
>>>>            VFIOGuestIOMMU *giommu;
>>>>            IOMMUMemoryRegion *iommu_mr =
>>>> IOMMU_MEMORY_REGION(section->mr);
>>>> -        int iommu_idx;
>>>> +        int iommu_idx, flags;
>>>>              trace_vfio_listener_region_add_iommu(iova, end);
>>>>            /*
>>>> @@ -900,8 +997,18 @@ static void
>>>> vfio_listener_region_add(MemoryListener *listener,
>>>>            llend = int128_sub(llend, int128_one());
>>>>            iommu_idx = memory_region_iommu_attrs_to_index(iommu_mr,
>>>>                                                         
>>>> MEMTXATTRS_UNSPECIFIED);
>>>> -        iommu_notifier_init(&giommu->n, vfio_iommu_map_notify,
>>>> -                            IOMMU_NOTIFIER_IOTLB_EVENTS,
>>>> +
>>>> +        if (container->iommu_type == VFIO_TYPE1_NESTING_IOMMU) {
>>>> +            /* IOTLB unmap notifier to propagate guest IOTLB
>>>> invalidations */
>>>> +            flags = IOMMU_NOTIFIER_UNMAP;
>>>> +            notify = vfio_iommu_unmap_notify;
>>>> +        } else {
>>>> +            /* MAP/UNMAP IOTLB notifier */
>>>> +            flags = IOMMU_NOTIFIER_IOTLB_EVENTS;
>>>> +            notify = vfio_iommu_map_notify;
>>>> +        }
>>>> +
>>>> +        iommu_notifier_init(&giommu->n, notify, flags,
>>>>                                section->offset_within_region,
>>>>                                int128_get64(llend),
>>>>                                iommu_idx);
>>>> @@ -921,7 +1028,9 @@ static void
>>>> vfio_listener_region_add(MemoryListener *listener,
>>>>                goto fail;
>>>>            }
>>>>            QLIST_INSERT_HEAD(&container->giommu_list, giommu,
>>>> giommu_next);
>>>> -        memory_region_iommu_replay(giommu->iommu, &giommu->n);
>>>> +        if (flags & IOMMU_NOTIFIER_MAP) {
>>>> +            memory_region_iommu_replay(giommu->iommu, &giommu->n);
>>>> +        }
>>>>              return;
>>>>        }
>>>> @@ -1205,10 +1314,16 @@ static const MemoryListener
>>>> vfio_memory_listener = {
>>>>        .log_sync = vfio_listener_log_sync,
>>>>    };
>>>>    +static MemoryListener vfio_memory_prereg_listener = {
>>>> +    .region_add = vfio_prereg_listener_region_add,
>>>> +    .region_del = vfio_prereg_listener_region_del,
>>>> +};
>>>> +
>>>>    static void vfio_listener_release(VFIOContainer *container)
>>>>    {
>>>>        memory_listener_unregister(&container->listener);
>>>> -    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
>>>> +    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU ||
>>>> +        container->iommu_type == VFIO_TYPE1_NESTING_IOMMU) {
>>>>            memory_listener_unregister(&container->prereg_listener);
>>>>        }
>>>>    }
>>>> @@ -1858,6 +1973,20 @@ static int vfio_connect_container(VFIOGroup
>>>> *group, AddressSpace *as,
>>>>                vfio_get_iommu_info_migration(container, info);
>>>>            }
>>>>            g_free(info);
>>>> +
>>>> +        if (container->iommu_type == VFIO_TYPE1_NESTING_IOMMU) {
>>>> +            container->prereg_listener = vfio_memory_prereg_listener;
>>>> +            memory_listener_register(&container->prereg_listener,
>>>> +                                     &address_space_memory);
>>>> +            if (container->error) {
>>>> +               
>>>> memory_listener_unregister(&container->prereg_listener);
>>>> +                ret = -1;
>>>> +                error_propagate_prepend(errp, container->error,
>>>> +                                    "RAM memory listener
>>>> initialization failed "
>>>> +                                    "for container");
>>>> +                goto free_container_exit;
>>>> +            }
>>>> +        }
>>>>            break;
>>>>        }
>>>>        case VFIO_SPAPR_TCE_v2_IOMMU:
>>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>>>> index 5c65aa0a98..cad7deec71 100644
>>>> --- a/hw/vfio/pci.c
>>>> +++ b/hw/vfio/pci.c
>>>> @@ -2773,6 +2773,25 @@ static void
>>>> vfio_unregister_req_notifier(VFIOPCIDevice *vdev)
>>>>        vdev->req_enabled = false;
>>>>    }
>>>>    +static int vfio_iommu_set_pasid_table(PCIBus *bus, int32_t devfn,
>>>> +                                      IOMMUConfig *config)
>>>> +{
>>>> +    PCIDevice *pdev = bus->devices[devfn];
>>>> +    VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev);
>>>> +    VFIOContainer *container = vdev->vbasedev.group->container;
>>>> +    struct vfio_iommu_type1_set_pasid_table info;
>>>> +
>>>> +    info.argsz = sizeof(info);
>>>> +    info.flags = VFIO_PASID_TABLE_FLAG_SET;
>>>> +    memcpy(&info.config, &config->pasid_cfg,
>>>> sizeof(config->pasid_cfg));
>>>> +
>>>> +    return ioctl(container->fd, VFIO_IOMMU_SET_PASID_TABLE, &info);
>>>> +}
>>>> +
>>>> +static PCIPASIDOps vfio_pci_pasid_ops = {
>>>> +    .set_pasid_table = vfio_iommu_set_pasid_table,
>>>> +};
>>>> +
>>>>    static void vfio_realize(PCIDevice *pdev, Error **errp)
>>>>    {
>>>>        VFIOPCIDevice *vdev = VFIO_PCI(pdev);
>>>> @@ -3084,6 +3103,8 @@ static void vfio_realize(PCIDevice *pdev, Error
>>>> **errp)
>>>>        vfio_register_req_notifier(vdev);
>>>>        vfio_setup_resetfn_quirk(vdev);
>>>>    +    pci_setup_pasid_ops(pdev, &vfio_pci_pasid_ops);
>>>> +
>>>>        return;
>>>>      out_deregister:
>>>> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
>>>> index 936d29d150..43696afc15 100644
>>>> --- a/hw/vfio/trace-events
>>>> +++ b/hw/vfio/trace-events
>>>> @@ -120,6 +120,8 @@ vfio_region_sparse_mmap_header(const char *name,
>>>> int index, int nr_areas) "Devic
>>>>    vfio_region_sparse_mmap_entry(int i, unsigned long start, unsigned
>>>> long end) "sparse entry %d [0x%lx - 0x%lx]"
>>>>    vfio_get_dev_region(const char *name, int index, uint32_t type,
>>>> uint32_t subtype) "%s index %d, %08x/%0x8"
>>>>    vfio_dma_unmap_overflow_workaround(void) ""
>>>> +vfio_iommu_addr_inv_iotlb(int asid, uint64_t addr, uint64_t size,
>>>> uint64_t nb_granules, bool leaf) "nested IOTLB invalidate asid=%d,
>>>> addr=0x%"PRIx64" granule_size=0x%"PRIx64" nb_granules=0x%"PRIx64"
>>>> leaf=%d"
>>>> +vfio_iommu_asid_inv_iotlb(int asid) "nested IOTLB invalidate asid=%d"
>>>>      # platform.c
>>>>    vfio_platform_base_device_init(char *name, int groupid) "%s belongs
>>>> to group #%d"
>>>
>>>
>> .
> 
> 



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC v9 15/29] vfio: Set up nested stage mappings
  2021-04-15  2:03           ` Kunkun Jiang
@ 2021-04-26 19:16             ` Auger Eric
  2021-04-28  9:51               ` Kunkun Jiang
  0 siblings, 1 reply; 46+ messages in thread
From: Auger Eric @ 2021-04-26 19:16 UTC (permalink / raw)
  To: Kunkun Jiang, eric.auger.pro, qemu-devel, qemu-arm, alex.williamson
  Cc: peter.maydell, jacob.jun.pan, chenxiang66, tn,
	shameerali.kolothum.thodi, nicoleotsuka, vivek.gautam, vdumpa,
	yi.l.liu, peterx, zhangfei.gao, wanghaibin.wang, yuzenghui,
	jean-philippe, zhukeqian1

Hi Kunkun,

On 4/15/21 4:03 AM, Kunkun Jiang wrote:
> Hi Eric,
> 
> On 2021/4/14 16:05, Auger Eric wrote:
>> Hi Kunkun,
>>
>> On 4/14/21 3:45 AM, Kunkun Jiang wrote:
>>> On 2021/4/13 20:57, Auger Eric wrote:
>>>> Hi Kunkun,
>>>>
>>>> On 4/13/21 2:10 PM, Kunkun Jiang wrote:
>>>>> Hi Eric,
>>>>>
>>>>> On 2021/4/11 20:08, Eric Auger wrote:
>>>>>> In nested mode, legacy vfio_iommu_map_notify cannot be used as
>>>>>> there is no "caching" mode and we do not trap on map.
>>>>>>
>>>>>> On Intel, vfio_iommu_map_notify was used to DMA map the RAM
>>>>>> through the host single stage.
>>>>>>
>>>>>> With nested mode, we need to setup the stage 2 and the stage 1
>>>>>> separately. This patch introduces a prereg_listener to setup
>>>>>> the stage 2 mapping.
>>>>>>
>>>>>> The stage 1 mapping, owned by the guest, is passed to the host
>>>>>> when the guest invalidates the stage 1 configuration, through
>>>>>> a dedicated PCIPASIDOps callback. Guest IOTLB invalidations
>>>>>> are cascaded downto the host through another IOMMU MR UNMAP
>>>>>> notifier.
>>>>>>
>>>>>> Signed-off-by: Eric Auger <eric.auger@redhat.com>
>>>>>>
>>>>>> ---
>>>>>>
>>>>>> v7 -> v8:
>>>>>> - properly handle new IOMMUTLBEntry fields and especially
>>>>>>      propagate DOMAIN and PASID based invalidations
>>>>>>
>>>>>> v6 -> v7:
>>>>>> - remove PASID based invalidation
>>>>>>
>>>>>> v5 -> v6:
>>>>>> - add error_report_err()
>>>>>> - remove the abort in case of nested stage case
>>>>>>
>>>>>> v4 -> v5:
>>>>>> - use VFIO_IOMMU_SET_PASID_TABLE
>>>>>> - use PCIPASIDOps for config notification
>>>>>>
>>>>>> v3 -> v4:
>>>>>> - use iommu_inv_pasid_info for ASID invalidation
>>>>>>
>>>>>> v2 -> v3:
>>>>>> - use VFIO_IOMMU_ATTACH_PASID_TABLE
>>>>>> - new user API
>>>>>> - handle leaf
>>>>>>
>>>>>> v1 -> v2:
>>>>>> - adapt to uapi changes
>>>>>> - pass the asid
>>>>>> - pass IOMMU_NOTIFIER_S1_CFG when initializing the config notifier
>>>>>> ---
>>>>>>     hw/vfio/common.c     | 139
>>>>>> +++++++++++++++++++++++++++++++++++++++++--
>>>>>>     hw/vfio/pci.c        |  21 +++++++
>>>>>>     hw/vfio/trace-events |   2 +
>>>>>>     3 files changed, 157 insertions(+), 5 deletions(-)
>>>>>>
>>>>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>>>>>> index 0cd7ef2139..e369d451e7 100644
>>>>>> --- a/hw/vfio/common.c
>>>>>> +++ b/hw/vfio/common.c
>>>>>> @@ -595,6 +595,73 @@ static bool vfio_get_xlat_addr(IOMMUTLBEntry
>>>>>> *iotlb, void **vaddr,
>>>>>>         return true;
>>>>>>     }
>>>>>>     +/* Propagate a guest IOTLB invalidation to the host (nested
>>>>>> mode) */
>>>>>> +static void vfio_iommu_unmap_notify(IOMMUNotifier *n, IOMMUTLBEntry
>>>>>> *iotlb)
>>>>>> +{
>>>>>> +    VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
>>>>>> +    struct vfio_iommu_type1_cache_invalidate ustruct = {};
>>>>>> +    VFIOContainer *container = giommu->container;
>>>>>> +    int ret;
>>>>>> +
>>>>>> +    assert(iotlb->perm == IOMMU_NONE);
>>>>>> +
>>>>>> +    ustruct.argsz = sizeof(ustruct);
>>>>>> +    ustruct.flags = 0;
>>>>>> +    ustruct.info.argsz = sizeof(struct iommu_cache_invalidate_info);
>>>>>> +    ustruct.info.version = IOMMU_CACHE_INVALIDATE_INFO_VERSION_1;
>>>>>> +    ustruct.info.cache = IOMMU_CACHE_INV_TYPE_IOTLB;
>>>>>> +
>>>>>> +    switch (iotlb->granularity) {
>>>>>> +    case IOMMU_INV_GRAN_DOMAIN:
>>>>>> +        ustruct.info.granularity = IOMMU_INV_GRANU_DOMAIN;
>>>>>> +        break;
>>>>>> +    case IOMMU_INV_GRAN_PASID:
>>>>>> +    {
>>>>>> +        struct iommu_inv_pasid_info *pasid_info;
>>>>>> +        int archid = -1;
>>>>>> +
>>>>>> +        pasid_info = &ustruct.info.granu.pasid_info;
>>>>>> +        ustruct.info.granularity = IOMMU_INV_GRANU_PASID;
>>>>>> +        if (iotlb->flags & IOMMU_INV_FLAGS_ARCHID) {
>>>>>> +            pasid_info->flags |= IOMMU_INV_ADDR_FLAGS_ARCHID;
>>>>>> +            archid = iotlb->arch_id;
>>>>>> +        }
>>>>>> +        pasid_info->archid = archid;
>>>>>> +        trace_vfio_iommu_asid_inv_iotlb(archid);
>>>>>> +        break;
>>>>>> +    }
>>>>>> +    case IOMMU_INV_GRAN_ADDR:
>>>>>> +    {
>>>>>> +        hwaddr start = iotlb->iova + giommu->iommu_offset;
>>>>>> +        struct iommu_inv_addr_info *addr_info;
>>>>>> +        size_t size = iotlb->addr_mask + 1;
>>>>>> +        int archid = -1;
>>>>>> +
>>>>>> +        addr_info = &ustruct.info.granu.addr_info;
>>>>>> +        ustruct.info.granularity = IOMMU_INV_GRANU_ADDR;
>>>>>> +        if (iotlb->leaf) {
>>>>>> +            addr_info->flags |= IOMMU_INV_ADDR_FLAGS_LEAF;
>>>>>> +        }
>>>>>> +        if (iotlb->flags & IOMMU_INV_FLAGS_ARCHID) {
>>>>>> +            addr_info->flags |= IOMMU_INV_ADDR_FLAGS_ARCHID;
>>>>>> +            archid = iotlb->arch_id;
>>>>>> +        }
>>>>>> +        addr_info->archid = archid;
>>>>>> +        addr_info->addr = start;
>>>>>> +        addr_info->granule_size = size;
>>>>>> +        addr_info->nb_granules = 1;
>>>>>> +        trace_vfio_iommu_addr_inv_iotlb(archid, start, size,
>>>>>> +                                        1, iotlb->leaf);
>>>>>> +        break;
>>>>>> +    }
>>>>> Should we pass a size to  host kernel here, even if vSMMU doesn't
>>>>> support
>>>>> RIL or guest kernel doesn't use RIL?
>>>>>
>>>>> It will cause TLBI issue in  this scenario: Guest kernel issues a
>>>>> TLBI cmd
>>>>> without "range" (tg = 0) to invalidate a 2M huge page. Then qemu
>>>>> passed
>>>>> the iova and size (4K) to host kernel. Finally, host kernel issues a
>>>>> TLBI cmd
>>>>> with "range" (4K) which can not invalidate the TLB entry of 2M huge
>>>>> page.
>>>>> (pSMMU supports RIL)
>>>> In that case the guest will loop over all 4K images belonging to the 2M
>>>> huge page and invalidate each of them. This should turn into qemu
>>>> notifications for each 4kB page, no? This is totally inefficient, hence
>>> The guest will not loop over all 4K images belonging to the 2M huge
>>> page.
>>> The iommu_iotlb_gather->pgsize will be 2M, if a page is 2M huge page.
>>> The
>>> gather->pgsize will be passed to __arm_smmu_tlb_inv_range as "granule":
>>>
>>> iommu_iotlb_gather_add_page
>>>      iommu_iotlb_sync
>>>          domain->ops->iotlb_sync
>>>              arm_smmu_iotlb_sync
>>>                  arm_smmu_tlb_inv_range_domain
>>>                      __arm_smmu_tlb_inv_range
>>>
>>> In the above mentioned scenario, guest kernel will issue a TLBI cmd only
>>> with
>>> "iova" (tg = 0).
>> OK I get it now. In that case I think I should do a TG=0 invalidation
>> on host even if the host does support RIL. Does that sound correct?
> Yeah, that's what I thought.
>> The trouble is the uapi struct does not convey such info explicitly.
>> Maybe I should use nb_granules = 0 to detect that case.
> It is troublesome to correct this issue. Using nb_granules = 0 may be
> a good choice.
>> I think for this use case RIL should be supported by the guest. Thoughts?
> Yes. If guest supports RIL, the scenario mentioned above does not exist.

After further study I really wonder if it is worth supporting the case
where the guest does not use range inval. Passing a nb_granules = 0 (~
size) would be OK to perform the TG=0 range invalidation on the host.
however host arm_smmu_tlb_inv_range_domain() then calls
arm_smmu_atc_inv_domain(smmu_domain, 0, iova, size) which needs a size.
We would need to trap guest CMD_ATC_INV and use different code paths on
host smmu driver to cascade the various cache invalidations. At the
moment, without maintainer giodance, I am a bit reluctant to add this
extra complexity.

So I would be inclined to fail in QEMU if we detect TG=0 is being used
by the guest. Recent guest kernels would be a prerequisite for this use
case. What do you think?

Thanks

Eric
> 
> Thanks,
> Kunkun Jiang
>>
>> Thanks
>>
>> Eric
>>> Thanks,
>>> Kunkun Jiang
>>>> the support of RIL on guest side and QEMU device.
>>>>
>>>> What do I miss?
>>>>
>>>> Thanks
>>>>
>>>> Eric
>>>>> Thanks,
>>>>> Kunkun Jiang
>>>>>> +    }
>>>>>> +
>>>>>> +    ret = ioctl(container->fd, VFIO_IOMMU_CACHE_INVALIDATE,
>>>>>> &ustruct);
>>>>>> +    if (ret) {
>>>>>> +        error_report("%p: failed to invalidate CACHE (%d)",
>>>>>> container, ret);
>>>>>> +    }
>>>>>> +}
>>>>>> +
>>>>>>     static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry
>>>>>> *iotlb)
>>>>>>     {
>>>>>>         VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
>>>>>> @@ -776,6 +843,35 @@ static void
>>>>>> vfio_dma_unmap_ram_section(VFIOContainer *container,
>>>>>>         }
>>>>>>     }
>>>>>>     +static void vfio_prereg_listener_region_add(MemoryListener
>>>>>> *listener,
>>>>>> +                                            MemoryRegionSection
>>>>>> *section)
>>>>>> +{
>>>>>> +    VFIOContainer *container =
>>>>>> +        container_of(listener, VFIOContainer, prereg_listener);
>>>>>> +    Error *err = NULL;
>>>>>> +
>>>>>> +    if (!memory_region_is_ram(section->mr)) {
>>>>>> +        return;
>>>>>> +    }
>>>>>> +
>>>>>> +    vfio_dma_map_ram_section(container, section, &err);
>>>>>> +    if (err) {
>>>>>> +        error_report_err(err);
>>>>>> +    }
>>>>>> +}
>>>>>> +static void vfio_prereg_listener_region_del(MemoryListener
>>>>>> *listener,
>>>>>> +                                     MemoryRegionSection *section)
>>>>>> +{
>>>>>> +    VFIOContainer *container =
>>>>>> +        container_of(listener, VFIOContainer, prereg_listener);
>>>>>> +
>>>>>> +    if (!memory_region_is_ram(section->mr)) {
>>>>>> +        return;
>>>>>> +    }
>>>>>> +
>>>>>> +    vfio_dma_unmap_ram_section(container, section);
>>>>>> +}
>>>>>> +
>>>>>>     static void vfio_listener_region_add(MemoryListener *listener,
>>>>>>                                          MemoryRegionSection
>>>>>> *section)
>>>>>>     {
>>>>>> @@ -879,9 +975,10 @@ static void
>>>>>> vfio_listener_region_add(MemoryListener *listener,
>>>>>>         memory_region_ref(section->mr);
>>>>>>           if (memory_region_is_iommu(section->mr)) {
>>>>>> +        IOMMUNotify notify;
>>>>>>             VFIOGuestIOMMU *giommu;
>>>>>>             IOMMUMemoryRegion *iommu_mr =
>>>>>> IOMMU_MEMORY_REGION(section->mr);
>>>>>> -        int iommu_idx;
>>>>>> +        int iommu_idx, flags;
>>>>>>               trace_vfio_listener_region_add_iommu(iova, end);
>>>>>>             /*
>>>>>> @@ -900,8 +997,18 @@ static void
>>>>>> vfio_listener_region_add(MemoryListener *listener,
>>>>>>             llend = int128_sub(llend, int128_one());
>>>>>>             iommu_idx = memory_region_iommu_attrs_to_index(iommu_mr,
>>>>>>                                                         
>>>>>> MEMTXATTRS_UNSPECIFIED);
>>>>>> -        iommu_notifier_init(&giommu->n, vfio_iommu_map_notify,
>>>>>> -                            IOMMU_NOTIFIER_IOTLB_EVENTS,
>>>>>> +
>>>>>> +        if (container->iommu_type == VFIO_TYPE1_NESTING_IOMMU) {
>>>>>> +            /* IOTLB unmap notifier to propagate guest IOTLB
>>>>>> invalidations */
>>>>>> +            flags = IOMMU_NOTIFIER_UNMAP;
>>>>>> +            notify = vfio_iommu_unmap_notify;
>>>>>> +        } else {
>>>>>> +            /* MAP/UNMAP IOTLB notifier */
>>>>>> +            flags = IOMMU_NOTIFIER_IOTLB_EVENTS;
>>>>>> +            notify = vfio_iommu_map_notify;
>>>>>> +        }
>>>>>> +
>>>>>> +        iommu_notifier_init(&giommu->n, notify, flags,
>>>>>>                                 section->offset_within_region,
>>>>>>                                 int128_get64(llend),
>>>>>>                                 iommu_idx);
>>>>>> @@ -921,7 +1028,9 @@ static void
>>>>>> vfio_listener_region_add(MemoryListener *listener,
>>>>>>                 goto fail;
>>>>>>             }
>>>>>>             QLIST_INSERT_HEAD(&container->giommu_list, giommu,
>>>>>> giommu_next);
>>>>>> -        memory_region_iommu_replay(giommu->iommu, &giommu->n);
>>>>>> +        if (flags & IOMMU_NOTIFIER_MAP) {
>>>>>> +            memory_region_iommu_replay(giommu->iommu, &giommu->n);
>>>>>> +        }
>>>>>>               return;
>>>>>>         }
>>>>>> @@ -1205,10 +1314,16 @@ static const MemoryListener
>>>>>> vfio_memory_listener = {
>>>>>>         .log_sync = vfio_listener_log_sync,
>>>>>>     };
>>>>>>     +static MemoryListener vfio_memory_prereg_listener = {
>>>>>> +    .region_add = vfio_prereg_listener_region_add,
>>>>>> +    .region_del = vfio_prereg_listener_region_del,
>>>>>> +};
>>>>>> +
>>>>>>     static void vfio_listener_release(VFIOContainer *container)
>>>>>>     {
>>>>>>         memory_listener_unregister(&container->listener);
>>>>>> -    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
>>>>>> +    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU ||
>>>>>> +        container->iommu_type == VFIO_TYPE1_NESTING_IOMMU) {
>>>>>>             memory_listener_unregister(&container->prereg_listener);
>>>>>>         }
>>>>>>     }
>>>>>> @@ -1858,6 +1973,20 @@ static int vfio_connect_container(VFIOGroup
>>>>>> *group, AddressSpace *as,
>>>>>>                 vfio_get_iommu_info_migration(container, info);
>>>>>>             }
>>>>>>             g_free(info);
>>>>>> +
>>>>>> +        if (container->iommu_type == VFIO_TYPE1_NESTING_IOMMU) {
>>>>>> +            container->prereg_listener =
>>>>>> vfio_memory_prereg_listener;
>>>>>> +            memory_listener_register(&container->prereg_listener,
>>>>>> +                                     &address_space_memory);
>>>>>> +            if (container->error) {
>>>>>> +
>>>>>> memory_listener_unregister(&container->prereg_listener);
>>>>>> +                ret = -1;
>>>>>> +                error_propagate_prepend(errp, container->error,
>>>>>> +                                    "RAM memory listener
>>>>>> initialization failed "
>>>>>> +                                    "for container");
>>>>>> +                goto free_container_exit;
>>>>>> +            }
>>>>>> +        }
>>>>>>             break;
>>>>>>         }
>>>>>>         case VFIO_SPAPR_TCE_v2_IOMMU:
>>>>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>>>>>> index 5c65aa0a98..cad7deec71 100644
>>>>>> --- a/hw/vfio/pci.c
>>>>>> +++ b/hw/vfio/pci.c
>>>>>> @@ -2773,6 +2773,25 @@ static void
>>>>>> vfio_unregister_req_notifier(VFIOPCIDevice *vdev)
>>>>>>         vdev->req_enabled = false;
>>>>>>     }
>>>>>>     +static int vfio_iommu_set_pasid_table(PCIBus *bus, int32_t
>>>>>> devfn,
>>>>>> +                                      IOMMUConfig *config)
>>>>>> +{
>>>>>> +    PCIDevice *pdev = bus->devices[devfn];
>>>>>> +    VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev);
>>>>>> +    VFIOContainer *container = vdev->vbasedev.group->container;
>>>>>> +    struct vfio_iommu_type1_set_pasid_table info;
>>>>>> +
>>>>>> +    info.argsz = sizeof(info);
>>>>>> +    info.flags = VFIO_PASID_TABLE_FLAG_SET;
>>>>>> +    memcpy(&info.config, &config->pasid_cfg,
>>>>>> sizeof(config->pasid_cfg));
>>>>>> +
>>>>>> +    return ioctl(container->fd, VFIO_IOMMU_SET_PASID_TABLE, &info);
>>>>>> +}
>>>>>> +
>>>>>> +static PCIPASIDOps vfio_pci_pasid_ops = {
>>>>>> +    .set_pasid_table = vfio_iommu_set_pasid_table,
>>>>>> +};
>>>>>> +
>>>>>>     static void vfio_realize(PCIDevice *pdev, Error **errp)
>>>>>>     {
>>>>>>         VFIOPCIDevice *vdev = VFIO_PCI(pdev);
>>>>>> @@ -3084,6 +3103,8 @@ static void vfio_realize(PCIDevice *pdev, Error
>>>>>> **errp)
>>>>>>         vfio_register_req_notifier(vdev);
>>>>>>         vfio_setup_resetfn_quirk(vdev);
>>>>>>     +    pci_setup_pasid_ops(pdev, &vfio_pci_pasid_ops);
>>>>>> +
>>>>>>         return;
>>>>>>       out_deregister:
>>>>>> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
>>>>>> index 936d29d150..43696afc15 100644
>>>>>> --- a/hw/vfio/trace-events
>>>>>> +++ b/hw/vfio/trace-events
>>>>>> @@ -120,6 +120,8 @@ vfio_region_sparse_mmap_header(const char *name,
>>>>>> int index, int nr_areas) "Devic
>>>>>>     vfio_region_sparse_mmap_entry(int i, unsigned long start,
>>>>>> unsigned
>>>>>> long end) "sparse entry %d [0x%lx - 0x%lx]"
>>>>>>     vfio_get_dev_region(const char *name, int index, uint32_t type,
>>>>>> uint32_t subtype) "%s index %d, %08x/%0x8"
>>>>>>     vfio_dma_unmap_overflow_workaround(void) ""
>>>>>> +vfio_iommu_addr_inv_iotlb(int asid, uint64_t addr, uint64_t size,
>>>>>> uint64_t nb_granules, bool leaf) "nested IOTLB invalidate asid=%d,
>>>>>> addr=0x%"PRIx64" granule_size=0x%"PRIx64" nb_granules=0x%"PRIx64"
>>>>>> leaf=%d"
>>>>>> +vfio_iommu_asid_inv_iotlb(int asid) "nested IOTLB invalidate
>>>>>> asid=%d"
>>>>>>       # platform.c
>>>>>>     vfio_platform_base_device_init(char *name, int groupid) "%s
>>>>>> belongs
>>>>>> to group #%d"
>>>>>
>>>> .
>>>
>> .
> 
> 



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC v9 15/29] vfio: Set up nested stage mappings
  2021-04-26 12:30         ` Auger Eric
@ 2021-04-27  8:58           ` Kunkun Jiang
  0 siblings, 0 replies; 46+ messages in thread
From: Kunkun Jiang @ 2021-04-27  8:58 UTC (permalink / raw)
  To: Auger Eric, eric.auger.pro, qemu-devel, qemu-arm, alex.williamson
  Cc: peter.maydell, jacob.jun.pan, Xingang Wang, chenxiang66, tn,
	shameerali.kolothum.thodi, nicoleotsuka, vivek.gautam, vdumpa,
	yi.l.liu, peterx, zhangfei.gao, wanghaibin.wang, yuzenghui,
	jean-philippe, zhukeqian1

Hi Eric,

On 2021/4/26 20:30, Auger Eric wrote:
> Hi Kunkun,
>
> On 4/14/21 3:45 AM, Kunkun Jiang wrote:
>> On 2021/4/13 20:57, Auger Eric wrote:
>>> Hi Kunkun,
>>>
>>> On 4/13/21 2:10 PM, Kunkun Jiang wrote:
>>>> Hi Eric,
>>>>
>>>> On 2021/4/11 20:08, Eric Auger wrote:
>>>>> In nested mode, legacy vfio_iommu_map_notify cannot be used as
>>>>> there is no "caching" mode and we do not trap on map.
>>>>>
>>>>> On Intel, vfio_iommu_map_notify was used to DMA map the RAM
>>>>> through the host single stage.
>>>>>
>>>>> With nested mode, we need to setup the stage 2 and the stage 1
>>>>> separately. This patch introduces a prereg_listener to setup
>>>>> the stage 2 mapping.
>>>>>
>>>>> The stage 1 mapping, owned by the guest, is passed to the host
>>>>> when the guest invalidates the stage 1 configuration, through
>>>>> a dedicated PCIPASIDOps callback. Guest IOTLB invalidations
>>>>> are cascaded downto the host through another IOMMU MR UNMAP
>>>>> notifier.
>>>>>
>>>>> Signed-off-by: Eric Auger <eric.auger@redhat.com>
>>>>>
>>>>> ---
>>>>>
>>>>> v7 -> v8:
>>>>> - properly handle new IOMMUTLBEntry fields and especially
>>>>>      propagate DOMAIN and PASID based invalidations
>>>>>
>>>>> v6 -> v7:
>>>>> - remove PASID based invalidation
>>>>>
>>>>> v5 -> v6:
>>>>> - add error_report_err()
>>>>> - remove the abort in case of nested stage case
>>>>>
>>>>> v4 -> v5:
>>>>> - use VFIO_IOMMU_SET_PASID_TABLE
>>>>> - use PCIPASIDOps for config notification
>>>>>
>>>>> v3 -> v4:
>>>>> - use iommu_inv_pasid_info for ASID invalidation
>>>>>
>>>>> v2 -> v3:
>>>>> - use VFIO_IOMMU_ATTACH_PASID_TABLE
>>>>> - new user API
>>>>> - handle leaf
>>>>>
>>>>> v1 -> v2:
>>>>> - adapt to uapi changes
>>>>> - pass the asid
>>>>> - pass IOMMU_NOTIFIER_S1_CFG when initializing the config notifier
>>>>> ---
>>>>>     hw/vfio/common.c     | 139
>>>>> +++++++++++++++++++++++++++++++++++++++++--
>>>>>     hw/vfio/pci.c        |  21 +++++++
>>>>>     hw/vfio/trace-events |   2 +
>>>>>     3 files changed, 157 insertions(+), 5 deletions(-)
>>>>>
>>>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>>>>> index 0cd7ef2139..e369d451e7 100644
>>>>> --- a/hw/vfio/common.c
>>>>> +++ b/hw/vfio/common.c
>>>>> @@ -595,6 +595,73 @@ static bool vfio_get_xlat_addr(IOMMUTLBEntry
>>>>> *iotlb, void **vaddr,
>>>>>         return true;
>>>>>     }
>>>>>     +/* Propagate a guest IOTLB invalidation to the host (nested
>>>>> mode) */
>>>>> +static void vfio_iommu_unmap_notify(IOMMUNotifier *n, IOMMUTLBEntry
>>>>> *iotlb)
>>>>> +{
>>>>> +    VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
>>>>> +    struct vfio_iommu_type1_cache_invalidate ustruct = {};
>>>>> +    VFIOContainer *container = giommu->container;
>>>>> +    int ret;
>>>>> +
>>>>> +    assert(iotlb->perm == IOMMU_NONE);
>>>>> +
>>>>> +    ustruct.argsz = sizeof(ustruct);
>>>>> +    ustruct.flags = 0;
>>>>> +    ustruct.info.argsz = sizeof(struct iommu_cache_invalidate_info);
>>>>> +    ustruct.info.version = IOMMU_CACHE_INVALIDATE_INFO_VERSION_1;
>>>>> +    ustruct.info.cache = IOMMU_CACHE_INV_TYPE_IOTLB;
>>>>> +
>>>>> +    switch (iotlb->granularity) {
>>>>> +    case IOMMU_INV_GRAN_DOMAIN:
>>>>> +        ustruct.info.granularity = IOMMU_INV_GRANU_DOMAIN;
>>>>> +        break;
>>>>> +    case IOMMU_INV_GRAN_PASID:
>>>>> +    {
>>>>> +        struct iommu_inv_pasid_info *pasid_info;
>>>>> +        int archid = -1;
>>>>> +
>>>>> +        pasid_info = &ustruct.info.granu.pasid_info;
>>>>> +        ustruct.info.granularity = IOMMU_INV_GRANU_PASID;
>>>>> +        if (iotlb->flags & IOMMU_INV_FLAGS_ARCHID) {
>>>>> +            pasid_info->flags |= IOMMU_INV_ADDR_FLAGS_ARCHID;
>>>>> +            archid = iotlb->arch_id;
>>>>> +        }
>>>>> +        pasid_info->archid = archid;
>>>>> +        trace_vfio_iommu_asid_inv_iotlb(archid);
>>>>> +        break;
>>>>> +    }
>>>>> +    case IOMMU_INV_GRAN_ADDR:
>>>>> +    {
>>>>> +        hwaddr start = iotlb->iova + giommu->iommu_offset;
>>>>> +        struct iommu_inv_addr_info *addr_info;
>>>>> +        size_t size = iotlb->addr_mask + 1;
>>>>> +        int archid = -1;
>>>>> +
>>>>> +        addr_info = &ustruct.info.granu.addr_info;
>>>>> +        ustruct.info.granularity = IOMMU_INV_GRANU_ADDR;
>>>>> +        if (iotlb->leaf) {
>>>>> +            addr_info->flags |= IOMMU_INV_ADDR_FLAGS_LEAF;
>>>>> +        }
>>>>> +        if (iotlb->flags & IOMMU_INV_FLAGS_ARCHID) {
>>>>> +            addr_info->flags |= IOMMU_INV_ADDR_FLAGS_ARCHID;
>>>>> +            archid = iotlb->arch_id;
>>>>> +        }
>>>>> +        addr_info->archid = archid;
>>>>> +        addr_info->addr = start;
>>>>> +        addr_info->granule_size = size;
>>>>> +        addr_info->nb_granules = 1;
>>>>> +        trace_vfio_iommu_addr_inv_iotlb(archid, start, size,
>>>>> +                                        1, iotlb->leaf);
>>>>> +        break;
>>>>> +    }
>>>> Should we pass a size to  host kernel here, even if vSMMU doesn't
>>>> support
>>>> RIL or guest kernel doesn't use RIL?
>>>>
>>>> It will cause TLBI issue in  this scenario: Guest kernel issues a
>>>> TLBI cmd
>>>> without "range" (tg = 0) to invalidate a 2M huge page. Then qemu passed
>>>> the iova and size (4K) to host kernel. Finally, host kernel issues a
>>>> TLBI cmd
>>>> with "range" (4K) which can not invalidate the TLB entry of 2M huge
>>>> page.
>>>> (pSMMU supports RIL)
>>> In that case the guest will loop over all 4K images belonging to the 2M
>>> huge page and invalidate each of them. This should turn into qemu
>>> notifications for each 4kB page, no? This is totally inefficient, hence
>> The guest will not loop over all 4K images belonging to the 2M huge page.
>> The iommu_iotlb_gather->pgsize will be 2M, if a page is 2M huge page. The
>> gather->pgsize will be passed to __arm_smmu_tlb_inv_range as "granule":
>>
>> iommu_iotlb_gather_add_page
>>      iommu_iotlb_sync
>>          domain->ops->iotlb_sync
>>              arm_smmu_iotlb_sync
>>                  arm_smmu_tlb_inv_range_domain
>>                      __arm_smmu_tlb_inv_range
>>
>> In the above mentioned scenario, guest kernel will issue a TLBI cmd only
>> with
>> "iova" (tg = 0).
> I am busy fixing this case. Could you share your guest use case. It is
> DPDK? In the positive I would be interested in getting your DPDK
> setup/commands.
I use UADK[1] with the accelerator Hisilicon SEC.
./test_hisi_sec --perf --sync --pktlen 1024 --block 1024 --blknum 1 
--times 1 --multi 1 --ctxnum 1

By the way, it needs to modify some code of UADK in vSVA scenario
to map 2M huge pages. The modified UADK I used was provided by
@Xingang Wang

[1] User Space Accelerator Development Kit
https://github.com/Linaro/uadk

Thanks,
Kunkun Jiang
> Thank you in advance
>
> Eric
>> Thanks,
>> Kunkun Jiang
>>> the support of RIL on guest side and QEMU device.
>>>
>>> What do I miss?
>>>
>>> Thanks
>>>
>>> Eric
>>>> Thanks,
>>>> Kunkun Jiang
>>>>> +    }
>>>>> +
>>>>> +    ret = ioctl(container->fd, VFIO_IOMMU_CACHE_INVALIDATE, &ustruct);
>>>>> +    if (ret) {
>>>>> +        error_report("%p: failed to invalidate CACHE (%d)",
>>>>> container, ret);
>>>>> +    }
>>>>> +}
>>>>> +
>>>>>     static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry
>>>>> *iotlb)
>>>>>     {
>>>>>         VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
>>>>> @@ -776,6 +843,35 @@ static void
>>>>> vfio_dma_unmap_ram_section(VFIOContainer *container,
>>>>>         }
>>>>>     }
>>>>>     +static void vfio_prereg_listener_region_add(MemoryListener
>>>>> *listener,
>>>>> +                                            MemoryRegionSection
>>>>> *section)
>>>>> +{
>>>>> +    VFIOContainer *container =
>>>>> +        container_of(listener, VFIOContainer, prereg_listener);
>>>>> +    Error *err = NULL;
>>>>> +
>>>>> +    if (!memory_region_is_ram(section->mr)) {
>>>>> +        return;
>>>>> +    }
>>>>> +
>>>>> +    vfio_dma_map_ram_section(container, section, &err);
>>>>> +    if (err) {
>>>>> +        error_report_err(err);
>>>>> +    }
>>>>> +}
>>>>> +static void vfio_prereg_listener_region_del(MemoryListener *listener,
>>>>> +                                     MemoryRegionSection *section)
>>>>> +{
>>>>> +    VFIOContainer *container =
>>>>> +        container_of(listener, VFIOContainer, prereg_listener);
>>>>> +
>>>>> +    if (!memory_region_is_ram(section->mr)) {
>>>>> +        return;
>>>>> +    }
>>>>> +
>>>>> +    vfio_dma_unmap_ram_section(container, section);
>>>>> +}
>>>>> +
>>>>>     static void vfio_listener_region_add(MemoryListener *listener,
>>>>>                                          MemoryRegionSection *section)
>>>>>     {
>>>>> @@ -879,9 +975,10 @@ static void
>>>>> vfio_listener_region_add(MemoryListener *listener,
>>>>>         memory_region_ref(section->mr);
>>>>>           if (memory_region_is_iommu(section->mr)) {
>>>>> +        IOMMUNotify notify;
>>>>>             VFIOGuestIOMMU *giommu;
>>>>>             IOMMUMemoryRegion *iommu_mr =
>>>>> IOMMU_MEMORY_REGION(section->mr);
>>>>> -        int iommu_idx;
>>>>> +        int iommu_idx, flags;
>>>>>               trace_vfio_listener_region_add_iommu(iova, end);
>>>>>             /*
>>>>> @@ -900,8 +997,18 @@ static void
>>>>> vfio_listener_region_add(MemoryListener *listener,
>>>>>             llend = int128_sub(llend, int128_one());
>>>>>             iommu_idx = memory_region_iommu_attrs_to_index(iommu_mr,
>>>>>                                                          
>>>>> MEMTXATTRS_UNSPECIFIED);
>>>>> -        iommu_notifier_init(&giommu->n, vfio_iommu_map_notify,
>>>>> -                            IOMMU_NOTIFIER_IOTLB_EVENTS,
>>>>> +
>>>>> +        if (container->iommu_type == VFIO_TYPE1_NESTING_IOMMU) {
>>>>> +            /* IOTLB unmap notifier to propagate guest IOTLB
>>>>> invalidations */
>>>>> +            flags = IOMMU_NOTIFIER_UNMAP;
>>>>> +            notify = vfio_iommu_unmap_notify;
>>>>> +        } else {
>>>>> +            /* MAP/UNMAP IOTLB notifier */
>>>>> +            flags = IOMMU_NOTIFIER_IOTLB_EVENTS;
>>>>> +            notify = vfio_iommu_map_notify;
>>>>> +        }
>>>>> +
>>>>> +        iommu_notifier_init(&giommu->n, notify, flags,
>>>>>                                 section->offset_within_region,
>>>>>                                 int128_get64(llend),
>>>>>                                 iommu_idx);
>>>>> @@ -921,7 +1028,9 @@ static void
>>>>> vfio_listener_region_add(MemoryListener *listener,
>>>>>                 goto fail;
>>>>>             }
>>>>>             QLIST_INSERT_HEAD(&container->giommu_list, giommu,
>>>>> giommu_next);
>>>>> -        memory_region_iommu_replay(giommu->iommu, &giommu->n);
>>>>> +        if (flags & IOMMU_NOTIFIER_MAP) {
>>>>> +            memory_region_iommu_replay(giommu->iommu, &giommu->n);
>>>>> +        }
>>>>>               return;
>>>>>         }
>>>>> @@ -1205,10 +1314,16 @@ static const MemoryListener
>>>>> vfio_memory_listener = {
>>>>>         .log_sync = vfio_listener_log_sync,
>>>>>     };
>>>>>     +static MemoryListener vfio_memory_prereg_listener = {
>>>>> +    .region_add = vfio_prereg_listener_region_add,
>>>>> +    .region_del = vfio_prereg_listener_region_del,
>>>>> +};
>>>>> +
>>>>>     static void vfio_listener_release(VFIOContainer *container)
>>>>>     {
>>>>>         memory_listener_unregister(&container->listener);
>>>>> -    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
>>>>> +    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU ||
>>>>> +        container->iommu_type == VFIO_TYPE1_NESTING_IOMMU) {
>>>>>             memory_listener_unregister(&container->prereg_listener);
>>>>>         }
>>>>>     }
>>>>> @@ -1858,6 +1973,20 @@ static int vfio_connect_container(VFIOGroup
>>>>> *group, AddressSpace *as,
>>>>>                 vfio_get_iommu_info_migration(container, info);
>>>>>             }
>>>>>             g_free(info);
>>>>> +
>>>>> +        if (container->iommu_type == VFIO_TYPE1_NESTING_IOMMU) {
>>>>> +            container->prereg_listener = vfio_memory_prereg_listener;
>>>>> +            memory_listener_register(&container->prereg_listener,
>>>>> +                                     &address_space_memory);
>>>>> +            if (container->error) {
>>>>> +
>>>>> memory_listener_unregister(&container->prereg_listener);
>>>>> +                ret = -1;
>>>>> +                error_propagate_prepend(errp, container->error,
>>>>> +                                    "RAM memory listener
>>>>> initialization failed "
>>>>> +                                    "for container");
>>>>> +                goto free_container_exit;
>>>>> +            }
>>>>> +        }
>>>>>             break;
>>>>>         }
>>>>>         case VFIO_SPAPR_TCE_v2_IOMMU:
>>>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>>>>> index 5c65aa0a98..cad7deec71 100644
>>>>> --- a/hw/vfio/pci.c
>>>>> +++ b/hw/vfio/pci.c
>>>>> @@ -2773,6 +2773,25 @@ static void
>>>>> vfio_unregister_req_notifier(VFIOPCIDevice *vdev)
>>>>>         vdev->req_enabled = false;
>>>>>     }
>>>>>     +static int vfio_iommu_set_pasid_table(PCIBus *bus, int32_t devfn,
>>>>> +                                      IOMMUConfig *config)
>>>>> +{
>>>>> +    PCIDevice *pdev = bus->devices[devfn];
>>>>> +    VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev);
>>>>> +    VFIOContainer *container = vdev->vbasedev.group->container;
>>>>> +    struct vfio_iommu_type1_set_pasid_table info;
>>>>> +
>>>>> +    info.argsz = sizeof(info);
>>>>> +    info.flags = VFIO_PASID_TABLE_FLAG_SET;
>>>>> +    memcpy(&info.config, &config->pasid_cfg,
>>>>> sizeof(config->pasid_cfg));
>>>>> +
>>>>> +    return ioctl(container->fd, VFIO_IOMMU_SET_PASID_TABLE, &info);
>>>>> +}
>>>>> +
>>>>> +static PCIPASIDOps vfio_pci_pasid_ops = {
>>>>> +    .set_pasid_table = vfio_iommu_set_pasid_table,
>>>>> +};
>>>>> +
>>>>>     static void vfio_realize(PCIDevice *pdev, Error **errp)
>>>>>     {
>>>>>         VFIOPCIDevice *vdev = VFIO_PCI(pdev);
>>>>> @@ -3084,6 +3103,8 @@ static void vfio_realize(PCIDevice *pdev, Error
>>>>> **errp)
>>>>>         vfio_register_req_notifier(vdev);
>>>>>         vfio_setup_resetfn_quirk(vdev);
>>>>>     +    pci_setup_pasid_ops(pdev, &vfio_pci_pasid_ops);
>>>>> +
>>>>>         return;
>>>>>       out_deregister:
>>>>> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
>>>>> index 936d29d150..43696afc15 100644
>>>>> --- a/hw/vfio/trace-events
>>>>> +++ b/hw/vfio/trace-events
>>>>> @@ -120,6 +120,8 @@ vfio_region_sparse_mmap_header(const char *name,
>>>>> int index, int nr_areas) "Devic
>>>>>     vfio_region_sparse_mmap_entry(int i, unsigned long start, unsigned
>>>>> long end) "sparse entry %d [0x%lx - 0x%lx]"
>>>>>     vfio_get_dev_region(const char *name, int index, uint32_t type,
>>>>> uint32_t subtype) "%s index %d, %08x/%0x8"
>>>>>     vfio_dma_unmap_overflow_workaround(void) ""
>>>>> +vfio_iommu_addr_inv_iotlb(int asid, uint64_t addr, uint64_t size,
>>>>> uint64_t nb_granules, bool leaf) "nested IOTLB invalidate asid=%d,
>>>>> addr=0x%"PRIx64" granule_size=0x%"PRIx64" nb_granules=0x%"PRIx64"
>>>>> leaf=%d"
>>>>> +vfio_iommu_asid_inv_iotlb(int asid) "nested IOTLB invalidate asid=%d"
>>>>>       # platform.c
>>>>>     vfio_platform_base_device_init(char *name, int groupid) "%s belongs
>>>>> to group #%d"
>>>>
>>> .
>>
> .




^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC v9 14/29] vfio: Introduce helpers to DMA map/unmap a RAM section
  2021-04-11 12:08 ` [RFC v9 14/29] vfio: Introduce helpers to DMA map/unmap a RAM section Eric Auger
@ 2021-04-27 14:05   ` Kunkun Jiang
  2021-09-03  8:22   ` Kunkun Jiang
  1 sibling, 0 replies; 46+ messages in thread
From: Kunkun Jiang @ 2021-04-27 14:05 UTC (permalink / raw)
  To: Eric Auger, eric.auger.pro, qemu-devel, qemu-arm, alex.williamson
  Cc: peter.maydell, jacob.jun.pan, jean-philippe, tn, chenxiang66,
	shameerali.kolothum.thodi, nicoleotsuka, vivek.gautam, vdumpa,
	yi.l.liu, peterx, zhangfei.gao, yuzenghui, zhukeqian1

Hi Eric,

On 2021/4/11 20:08, Eric Auger wrote:
> Let's introduce two helpers that allow to DMA map/unmap a RAM
> section. Those helpers will be called for nested stage setup in
> another call site. Also the vfio_listener_region_add/del()
> structure may be clearer.
>
> Signed-off-by: Eric Auger <eric.auger@redhat.com>
>
> ---
>
> v8 -> v9
> - rebase on top of
>    1eb7f642750c ("vfio: Support host translation granule size")
>
> v5 -> v6:
> - add Error **
> ---
>   hw/vfio/common.c     | 199 +++++++++++++++++++++++++------------------
>   hw/vfio/trace-events |   4 +-
>   2 files changed, 119 insertions(+), 84 deletions(-)
>
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index a8f835328e..0cd7ef2139 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -662,13 +662,126 @@ hostwin_from_range(VFIOContainer *container, hwaddr iova, hwaddr end)
>       return NULL;
>   }
>   
> +static int vfio_dma_map_ram_section(VFIOContainer *container,
> +                                    MemoryRegionSection *section, Error **err)
> +{
> +    VFIOHostDMAWindow *hostwin;
> +    Int128 llend, llsize;
> +    hwaddr iova, end;
> +    void *vaddr;
> +    int ret;
> +
> +    assert(memory_region_is_ram(section->mr));
> +
> +    iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
> +    llend = int128_make64(section->offset_within_address_space);
> +    llend = int128_add(llend, section->size);
> +    llend = int128_and(llend, int128_exts64(TARGET_PAGE_MASK));
> +    end = int128_get64(int128_sub(llend, int128_one()));
> +
> +    vaddr = memory_region_get_ram_ptr(section->mr) +
> +            section->offset_within_region +
> +            (iova - section->offset_within_address_space);
> +
> +    hostwin = hostwin_from_range(container, iova, end);
> +    if (!hostwin) {
> +        error_setg(err, "Container %p can't map guest IOVA region"
> +                   " 0x%"HWADDR_PRIx"..0x%"HWADDR_PRIx, container, iova, end);
> +        return -EFAULT;
> +    }
> +
> +    trace_vfio_dma_map_ram(iova, end, vaddr);
> +
> +    llsize = int128_sub(llend, int128_make64(iova));
> +
> +    if (memory_region_is_ram_device(section->mr)) {
> +        hwaddr pgmask = (1ULL << ctz64(hostwin->iova_pgsizes)) - 1;
> +
> +        if ((iova & pgmask) || (int128_get64(llsize) & pgmask)) {
> +            trace_vfio_listener_region_add_no_dma_map(
> +                memory_region_name(section->mr),
> +                section->offset_within_address_space,
> +                int128_getlo(section->size),
> +                pgmask + 1);
> +            return 0;
> +        }
> +    }
> +
> +    ret = vfio_dma_map(container, iova, int128_get64(llsize),
> +                       vaddr, section->readonly);
> +    if (ret) {
> +        error_setg(err, "vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
> +                   "0x%"HWADDR_PRIx", %p) = %d (%m)",
> +                   container, iova, int128_get64(llsize), vaddr, ret);
> +        if (memory_region_is_ram_device(section->mr)) {
> +            /* Allow unexpected mappings not to be fatal for RAM devices */
> +            error_report_err(*err);
> +            return 0;
> +        }
> +        return ret;
> +    }
> +    return 0;
> +}
> +
> +static void vfio_dma_unmap_ram_section(VFIOContainer *container,
> +                                       MemoryRegionSection *section)
> +{
> +    Int128 llend, llsize;
> +    hwaddr iova, end;
> +    bool try_unmap = true;
> +    int ret;
> +
> +    iova = REAL_HOST_PAGE_ALIGN(section->offset_within_address_space);
> +    llend = int128_make64(section->offset_within_address_space);
> +    llend = int128_add(llend, section->size);
> +    llend = int128_and(llend, int128_exts64(qemu_real_host_page_mask));
> +
> +    if (int128_ge(int128_make64(iova), llend)) {
> +        return;
> +    }
> +    end = int128_get64(int128_sub(llend, int128_one()));
> +
> +    llsize = int128_sub(llend, int128_make64(iova));
> +
> +    trace_vfio_dma_unmap_ram(iova, end);
> +
> +    if (memory_region_is_ram_device(section->mr)) {
> +        hwaddr pgmask;
> +        VFIOHostDMAWindow *hostwin = hostwin_from_range(container, iova, end);
> +
> +        assert(hostwin); /* or region_add() would have failed */
> +
> +        pgmask = (1ULL << ctz64(hostwin->iova_pgsizes)) - 1;
> +        try_unmap = !((iova & pgmask) || (int128_get64(llsize) & pgmask));
> +    }
> +
> +    if (try_unmap) {
> +        if (int128_eq(llsize, int128_2_64())) {
> +            /* The unmap ioctl doesn't accept a full 64-bit span. */
> +            llsize = int128_rshift(llsize, 1);
> +            ret = vfio_dma_unmap(container, iova, int128_get64(llsize), NULL);
> +            if (ret) {
> +                error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
> +                             "0x%"HWADDR_PRIx") = %d (%m)",
> +                             container, iova, int128_get64(llsize), ret);
> +            }
> +            iova += int128_get64(llsize);
> +        }
> +        ret = vfio_dma_unmap(container, iova, int128_get64(llsize), NULL);
> +        if (ret) {
> +            error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
> +                         "0x%"HWADDR_PRIx") = %d (%m)",
> +                         container, iova, int128_get64(llsize), ret);
> +        }
> +    }
> +}
> +
>   static void vfio_listener_region_add(MemoryListener *listener,
>                                        MemoryRegionSection *section)
>   {
>       VFIOContainer *container = container_of(listener, VFIOContainer, listener);
>       hwaddr iova, end;
> -    Int128 llend, llsize;
> -    void *vaddr;
> +    Int128 llend;
>       int ret;
>       VFIOHostDMAWindow *hostwin;
>       Error *err = NULL;
> @@ -814,39 +927,7 @@ static void vfio_listener_region_add(MemoryListener *listener,
>       }
>   
>       /* Here we assume that memory_region_is_ram(section->mr)==true */
> -
> -    vaddr = memory_region_get_ram_ptr(section->mr) +
> -            section->offset_within_region +
> -            (iova - section->offset_within_address_space);
> -
> -    trace_vfio_listener_region_add_ram(iova, end, vaddr);
> -
> -    llsize = int128_sub(llend, int128_make64(iova));
> -
> -    if (memory_region_is_ram_device(section->mr)) {
> -        hwaddr pgmask = (1ULL << ctz64(hostwin->iova_pgsizes)) - 1;
> -
> -        if ((iova & pgmask) || (int128_get64(llsize) & pgmask)) {
> -            trace_vfio_listener_region_add_no_dma_map(
> -                memory_region_name(section->mr),
> -                section->offset_within_address_space,
> -                int128_getlo(section->size),
> -                pgmask + 1);
> -            return;
> -        }
> -    }
> -
> -    ret = vfio_dma_map(container, iova, int128_get64(llsize),
> -                       vaddr, section->readonly);
> -    if (ret) {
> -        error_setg(&err, "vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
> -                   "0x%"HWADDR_PRIx", %p) = %d (%m)",
> -                   container, iova, int128_get64(llsize), vaddr, ret);
> -        if (memory_region_is_ram_device(section->mr)) {
> -            /* Allow unexpected mappings not to be fatal for RAM devices */
> -            error_report_err(err);
> -            return;
> -        }
> +    if (vfio_dma_map_ram_section(container, section, &err)) {
>           goto fail;
>       }
>   
> @@ -880,10 +961,6 @@ static void vfio_listener_region_del(MemoryListener *listener,
>                                        MemoryRegionSection *section)
>   {
>       VFIOContainer *container = container_of(listener, VFIOContainer, listener);
> -    hwaddr iova, end;
> -    Int128 llend, llsize;
> -    int ret;
> -    bool try_unmap = true;
>   
>       if (vfio_listener_skipped_section(section)) {
>           trace_vfio_listener_region_del_skip(
> @@ -923,49 +1000,7 @@ static void vfio_listener_region_del(MemoryListener *listener,
>            */
>       }
There are some codes of vfio_listener_region_del() that just doesn't 
show up.
I post it below.

> if (memory_region_is_iommu(section->mr)) {
>         VFIOGuestIOMMU *giommu;
>
>         QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
>             if (MEMORY_REGION(giommu->iommu) == section->mr &&
>                 giommu->n.start == section->offset_within_region) {
> memory_region_unregister_iommu_notifier(section->mr,
> &giommu->n);
>                 QLIST_REMOVE(giommu, giommu_next);
>                 g_free(giommu);
>                 break;
>             }
>         }
>
>         /*
>          * FIXME: We assume the one big unmap below is adequate to
>          * remove any individual page mappings in the IOMMU which
>          * might have been copied into VFIO. This works for a page table
>          * based IOMMU where a big unmap flattens a large range of 
> IO-PTEs.
>          * That may not be true for all IOMMU types.
>          */
>     }
I think we need a check here. If it is nested mode, just return after
g_free(giommu). Because in nested mode, stage 2 (gpa->hpa) and the
stage 1 (giova->gpa) are set separately.

When hot delete a pci device,  we are going to call
vfio_listener_region_del() and vfio_prereg_listener_region_del().
So it's not appropriate to unmap RAM section in
vfio_listener_region_del(). The RAM section will be unmap in
vfio_prereg_listener_region_del().

Thanks,
Kunkun Jiang
>   
> -    iova = REAL_HOST_PAGE_ALIGN(section->offset_within_address_space);
> -    llend = int128_make64(section->offset_within_address_space);
> -    llend = int128_add(llend, section->size);
> -    llend = int128_and(llend, int128_exts64(qemu_real_host_page_mask));
> -
> -    if (int128_ge(int128_make64(iova), llend)) {
> -        return;
> -    }
> -    end = int128_get64(int128_sub(llend, int128_one()));
> -
> -    llsize = int128_sub(llend, int128_make64(iova));
> -
> -    trace_vfio_listener_region_del(iova, end);
> -
> -    if (memory_region_is_ram_device(section->mr)) {
> -        hwaddr pgmask;
> -        VFIOHostDMAWindow *hostwin = hostwin_from_range(container, iova, end);
> -
> -        assert(hostwin); /* or region_add() would have failed */
> -
> -        pgmask = (1ULL << ctz64(hostwin->iova_pgsizes)) - 1;
> -        try_unmap = !((iova & pgmask) || (int128_get64(llsize) & pgmask));
> -    }
> -
> -    if (try_unmap) {
> -        if (int128_eq(llsize, int128_2_64())) {
> -            /* The unmap ioctl doesn't accept a full 64-bit span. */
> -            llsize = int128_rshift(llsize, 1);
> -            ret = vfio_dma_unmap(container, iova, int128_get64(llsize), NULL);
> -            if (ret) {
> -                error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
> -                             "0x%"HWADDR_PRIx") = %d (%m)",
> -                             container, iova, int128_get64(llsize), ret);
> -            }
> -            iova += int128_get64(llsize);
> -        }
> -        ret = vfio_dma_unmap(container, iova, int128_get64(llsize), NULL);
> -        if (ret) {
> -            error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
> -                         "0x%"HWADDR_PRIx") = %d (%m)",
> -                         container, iova, int128_get64(llsize), ret);
> -        }
> -    }
> +    vfio_dma_unmap_ram_section(container, section);
>   
>       memory_region_unref(section->mr);
>   
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index 2a41326c0f..936d29d150 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -99,10 +99,10 @@ vfio_iommu_map_notify(const char *op, uint64_t iova_start, uint64_t iova_end) "i
>   vfio_listener_region_add_skip(uint64_t start, uint64_t end) "SKIPPING region_add 0x%"PRIx64" - 0x%"PRIx64
>   vfio_spapr_group_attach(int groupfd, int tablefd) "Attached groupfd %d to liobn fd %d"
>   vfio_listener_region_add_iommu(uint64_t start, uint64_t end) "region_add [iommu] 0x%"PRIx64" - 0x%"PRIx64
> -vfio_listener_region_add_ram(uint64_t iova_start, uint64_t iova_end, void *vaddr) "region_add [ram] 0x%"PRIx64" - 0x%"PRIx64" [%p]"
> +vfio_dma_map_ram(uint64_t iova_start, uint64_t iova_end, void *vaddr) "region_add [ram] 0x%"PRIx64" - 0x%"PRIx64" [%p]"
>   vfio_listener_region_add_no_dma_map(const char *name, uint64_t iova, uint64_t size, uint64_t page_size) "Region \"%s\" 0x%"PRIx64" size=0x%"PRIx64" is not aligned to 0x%"PRIx64" and cannot be mapped for DMA"
>   vfio_listener_region_del_skip(uint64_t start, uint64_t end) "SKIPPING region_del 0x%"PRIx64" - 0x%"PRIx64
> -vfio_listener_region_del(uint64_t start, uint64_t end) "region_del 0x%"PRIx64" - 0x%"PRIx64
> +vfio_dma_unmap_ram(uint64_t start, uint64_t end) "region_del 0x%"PRIx64" - 0x%"PRIx64
>   vfio_disconnect_container(int fd) "close container->fd=%d"
>   vfio_connect_existing_container(int groupid, int container_fd) "group=%d existing container fd=%d"
>   vfio_connect_new_container(int groupid, int container_fd) "group=%d new container fd=%d"




^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC v9 15/29] vfio: Set up nested stage mappings
  2021-04-26 19:16             ` Auger Eric
@ 2021-04-28  9:51               ` Kunkun Jiang
  2021-04-29 13:58                 ` Auger Eric
  0 siblings, 1 reply; 46+ messages in thread
From: Kunkun Jiang @ 2021-04-28  9:51 UTC (permalink / raw)
  To: Auger Eric, eric.auger.pro, qemu-devel, qemu-arm, alex.williamson
  Cc: peter.maydell, jacob.jun.pan, chenxiang66, tn,
	shameerali.kolothum.thodi, nicoleotsuka, vivek.gautam, vdumpa,
	yi.l.liu, peterx, zhangfei.gao, wanghaibin.wang, yuzenghui,
	jean-philippe, zhukeqian1

Hi Eric,

On 2021/4/27 3:16, Auger Eric wrote:
> Hi Kunkun,
>
> On 4/15/21 4:03 AM, Kunkun Jiang wrote:
>> Hi Eric,
>>
>> On 2021/4/14 16:05, Auger Eric wrote:
>>> Hi Kunkun,
>>>
>>> On 4/14/21 3:45 AM, Kunkun Jiang wrote:
>>>> On 2021/4/13 20:57, Auger Eric wrote:
>>>>> Hi Kunkun,
>>>>>
>>>>> On 4/13/21 2:10 PM, Kunkun Jiang wrote:
>>>>>> Hi Eric,
>>>>>>
>>>>>> On 2021/4/11 20:08, Eric Auger wrote:
>>>>>>> In nested mode, legacy vfio_iommu_map_notify cannot be used as
>>>>>>> there is no "caching" mode and we do not trap on map.
>>>>>>>
>>>>>>> On Intel, vfio_iommu_map_notify was used to DMA map the RAM
>>>>>>> through the host single stage.
>>>>>>>
>>>>>>> With nested mode, we need to setup the stage 2 and the stage 1
>>>>>>> separately. This patch introduces a prereg_listener to setup
>>>>>>> the stage 2 mapping.
>>>>>>>
>>>>>>> The stage 1 mapping, owned by the guest, is passed to the host
>>>>>>> when the guest invalidates the stage 1 configuration, through
>>>>>>> a dedicated PCIPASIDOps callback. Guest IOTLB invalidations
>>>>>>> are cascaded downto the host through another IOMMU MR UNMAP
>>>>>>> notifier.
>>>>>>>
>>>>>>> Signed-off-by: Eric Auger <eric.auger@redhat.com>
>>>>>>>
>>>>>>> ---
>>>>>>>
>>>>>>> v7 -> v8:
>>>>>>> - properly handle new IOMMUTLBEntry fields and especially
>>>>>>>       propagate DOMAIN and PASID based invalidations
>>>>>>>
>>>>>>> v6 -> v7:
>>>>>>> - remove PASID based invalidation
>>>>>>>
>>>>>>> v5 -> v6:
>>>>>>> - add error_report_err()
>>>>>>> - remove the abort in case of nested stage case
>>>>>>>
>>>>>>> v4 -> v5:
>>>>>>> - use VFIO_IOMMU_SET_PASID_TABLE
>>>>>>> - use PCIPASIDOps for config notification
>>>>>>>
>>>>>>> v3 -> v4:
>>>>>>> - use iommu_inv_pasid_info for ASID invalidation
>>>>>>>
>>>>>>> v2 -> v3:
>>>>>>> - use VFIO_IOMMU_ATTACH_PASID_TABLE
>>>>>>> - new user API
>>>>>>> - handle leaf
>>>>>>>
>>>>>>> v1 -> v2:
>>>>>>> - adapt to uapi changes
>>>>>>> - pass the asid
>>>>>>> - pass IOMMU_NOTIFIER_S1_CFG when initializing the config notifier
>>>>>>> ---
>>>>>>>      hw/vfio/common.c     | 139
>>>>>>> +++++++++++++++++++++++++++++++++++++++++--
>>>>>>>      hw/vfio/pci.c        |  21 +++++++
>>>>>>>      hw/vfio/trace-events |   2 +
>>>>>>>      3 files changed, 157 insertions(+), 5 deletions(-)
>>>>>>>
>>>>>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>>>>>>> index 0cd7ef2139..e369d451e7 100644
>>>>>>> --- a/hw/vfio/common.c
>>>>>>> +++ b/hw/vfio/common.c
>>>>>>> @@ -595,6 +595,73 @@ static bool vfio_get_xlat_addr(IOMMUTLBEntry
>>>>>>> *iotlb, void **vaddr,
>>>>>>>          return true;
>>>>>>>      }
>>>>>>>      +/* Propagate a guest IOTLB invalidation to the host (nested
>>>>>>> mode) */
>>>>>>> +static void vfio_iommu_unmap_notify(IOMMUNotifier *n, IOMMUTLBEntry
>>>>>>> *iotlb)
>>>>>>> +{
>>>>>>> +    VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
>>>>>>> +    struct vfio_iommu_type1_cache_invalidate ustruct = {};
>>>>>>> +    VFIOContainer *container = giommu->container;
>>>>>>> +    int ret;
>>>>>>> +
>>>>>>> +    assert(iotlb->perm == IOMMU_NONE);
>>>>>>> +
>>>>>>> +    ustruct.argsz = sizeof(ustruct);
>>>>>>> +    ustruct.flags = 0;
>>>>>>> +    ustruct.info.argsz = sizeof(struct iommu_cache_invalidate_info);
>>>>>>> +    ustruct.info.version = IOMMU_CACHE_INVALIDATE_INFO_VERSION_1;
>>>>>>> +    ustruct.info.cache = IOMMU_CACHE_INV_TYPE_IOTLB;
>>>>>>> +
>>>>>>> +    switch (iotlb->granularity) {
>>>>>>> +    case IOMMU_INV_GRAN_DOMAIN:
>>>>>>> +        ustruct.info.granularity = IOMMU_INV_GRANU_DOMAIN;
>>>>>>> +        break;
>>>>>>> +    case IOMMU_INV_GRAN_PASID:
>>>>>>> +    {
>>>>>>> +        struct iommu_inv_pasid_info *pasid_info;
>>>>>>> +        int archid = -1;
>>>>>>> +
>>>>>>> +        pasid_info = &ustruct.info.granu.pasid_info;
>>>>>>> +        ustruct.info.granularity = IOMMU_INV_GRANU_PASID;
>>>>>>> +        if (iotlb->flags & IOMMU_INV_FLAGS_ARCHID) {
>>>>>>> +            pasid_info->flags |= IOMMU_INV_ADDR_FLAGS_ARCHID;
>>>>>>> +            archid = iotlb->arch_id;
>>>>>>> +        }
>>>>>>> +        pasid_info->archid = archid;
>>>>>>> +        trace_vfio_iommu_asid_inv_iotlb(archid);
>>>>>>> +        break;
>>>>>>> +    }
>>>>>>> +    case IOMMU_INV_GRAN_ADDR:
>>>>>>> +    {
>>>>>>> +        hwaddr start = iotlb->iova + giommu->iommu_offset;
>>>>>>> +        struct iommu_inv_addr_info *addr_info;
>>>>>>> +        size_t size = iotlb->addr_mask + 1;
>>>>>>> +        int archid = -1;
>>>>>>> +
>>>>>>> +        addr_info = &ustruct.info.granu.addr_info;
>>>>>>> +        ustruct.info.granularity = IOMMU_INV_GRANU_ADDR;
>>>>>>> +        if (iotlb->leaf) {
>>>>>>> +            addr_info->flags |= IOMMU_INV_ADDR_FLAGS_LEAF;
>>>>>>> +        }
>>>>>>> +        if (iotlb->flags & IOMMU_INV_FLAGS_ARCHID) {
>>>>>>> +            addr_info->flags |= IOMMU_INV_ADDR_FLAGS_ARCHID;
>>>>>>> +            archid = iotlb->arch_id;
>>>>>>> +        }
>>>>>>> +        addr_info->archid = archid;
>>>>>>> +        addr_info->addr = start;
>>>>>>> +        addr_info->granule_size = size;
>>>>>>> +        addr_info->nb_granules = 1;
>>>>>>> +        trace_vfio_iommu_addr_inv_iotlb(archid, start, size,
>>>>>>> +                                        1, iotlb->leaf);
>>>>>>> +        break;
>>>>>>> +    }
>>>>>> Should we pass a size to  host kernel here, even if vSMMU doesn't
>>>>>> support
>>>>>> RIL or guest kernel doesn't use RIL?
>>>>>>
>>>>>> It will cause TLBI issue in  this scenario: Guest kernel issues a
>>>>>> TLBI cmd
>>>>>> without "range" (tg = 0) to invalidate a 2M huge page. Then qemu
>>>>>> passed
>>>>>> the iova and size (4K) to host kernel. Finally, host kernel issues a
>>>>>> TLBI cmd
>>>>>> with "range" (4K) which can not invalidate the TLB entry of 2M huge
>>>>>> page.
>>>>>> (pSMMU supports RIL)
>>>>> In that case the guest will loop over all 4K images belonging to the 2M
>>>>> huge page and invalidate each of them. This should turn into qemu
>>>>> notifications for each 4kB page, no? This is totally inefficient, hence
>>>> The guest will not loop over all 4K images belonging to the 2M huge
>>>> page.
>>>> The iommu_iotlb_gather->pgsize will be 2M, if a page is 2M huge page.
>>>> The
>>>> gather->pgsize will be passed to __arm_smmu_tlb_inv_range as "granule":
>>>>
>>>> iommu_iotlb_gather_add_page
>>>>       iommu_iotlb_sync
>>>>           domain->ops->iotlb_sync
>>>>               arm_smmu_iotlb_sync
>>>>                   arm_smmu_tlb_inv_range_domain
>>>>                       __arm_smmu_tlb_inv_range
>>>>
>>>> In the above mentioned scenario, guest kernel will issue a TLBI cmd only
>>>> with
>>>> "iova" (tg = 0).
>>> OK I get it now. In that case I think I should do a TG=0 invalidation
>>> on host even if the host does support RIL. Does that sound correct?
>> Yeah, that's what I thought.
>>> The trouble is the uapi struct does not convey such info explicitly.
>>> Maybe I should use nb_granules = 0 to detect that case.
>> It is troublesome to correct this issue. Using nb_granules = 0 may be
>> a good choice.
>>> I think for this use case RIL should be supported by the guest. Thoughts?
>> Yes. If guest supports RIL, the scenario mentioned above does not exist.
> After further study I really wonder if it is worth supporting the case
> where the guest does not use range inval. Passing a nb_granules = 0 (~
> size) would be OK to perform the TG=0 range invalidation on the host.
> however host arm_smmu_tlb_inv_range_domain() then calls
> arm_smmu_atc_inv_domain(smmu_domain, 0, iova, size) which needs a size.
> We would need to trap guest CMD_ATC_INV and use different code paths on
> host smmu driver to cascade the various cache invalidations. At the
> moment, without maintainer giodance, I am a bit reluctant to add this
> extra complexity.
>
> So I would be inclined to fail in QEMU if we detect TG=0 is being used
> by the guest. Recent guest kernels would be a prerequisite for this use
> case. What do you think?
Sorry for late reply.

How do I determine whether the guest kernel is a recent guest kernel?
If I use a modified kernel or an old kernel, this method will cause problems
when the vm runs business.

I didn't think of the problem of CMD_ATC_INV. In my opinion, it is the best
choice to add processing of CMD_ATC_INV. And some function of smmu
driver will issues a separate CMD_ATC_INV, for example
arm_smmu_enable_ats(). Is it possible that Qemu needs to support the
processing of CMD_ATC_INV in the future?

I also thought of another not-so-good idea. We can use the granule_size
to give CMD_ATC_INV a corresponding maximum page size. This will
result in poor performance.
For example:
granule_size          maximum_page_size
       4K                               1G

Thanks,
Kunkun Jiang
> Thanks
>
> Eric
>> Thanks,
>> Kunkun Jiang
>>> Thanks
>>>
>>> Eric
>>>> Thanks,
>>>> Kunkun Jiang
>>>>> the support of RIL on guest side and QEMU device.
>>>>>
>>>>> What do I miss?
>>>>>
>>>>> Thanks
>>>>>
>>>>> Eric
>>>>>> Thanks,
>>>>>> Kunkun Jiang
>>>>>>> +    }
>>>>>>> +
>>>>>>> +    ret = ioctl(container->fd, VFIO_IOMMU_CACHE_INVALIDATE,
>>>>>>> &ustruct);
>>>>>>> +    if (ret) {
>>>>>>> +        error_report("%p: failed to invalidate CACHE (%d)",
>>>>>>> container, ret);
>>>>>>> +    }
>>>>>>> +}
>>>>>>> +
>>>>>>>      static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry
>>>>>>> *iotlb)
>>>>>>>      {
>>>>>>>          VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
>>>>>>> @@ -776,6 +843,35 @@ static void
>>>>>>> vfio_dma_unmap_ram_section(VFIOContainer *container,
>>>>>>>          }
>>>>>>>      }
>>>>>>>      +static void vfio_prereg_listener_region_add(MemoryListener
>>>>>>> *listener,
>>>>>>> +                                            MemoryRegionSection
>>>>>>> *section)
>>>>>>> +{
>>>>>>> +    VFIOContainer *container =
>>>>>>> +        container_of(listener, VFIOContainer, prereg_listener);
>>>>>>> +    Error *err = NULL;
>>>>>>> +
>>>>>>> +    if (!memory_region_is_ram(section->mr)) {
>>>>>>> +        return;
>>>>>>> +    }
>>>>>>> +
>>>>>>> +    vfio_dma_map_ram_section(container, section, &err);
>>>>>>> +    if (err) {
>>>>>>> +        error_report_err(err);
>>>>>>> +    }
>>>>>>> +}
>>>>>>> +static void vfio_prereg_listener_region_del(MemoryListener
>>>>>>> *listener,
>>>>>>> +                                     MemoryRegionSection *section)
>>>>>>> +{
>>>>>>> +    VFIOContainer *container =
>>>>>>> +        container_of(listener, VFIOContainer, prereg_listener);
>>>>>>> +
>>>>>>> +    if (!memory_region_is_ram(section->mr)) {
>>>>>>> +        return;
>>>>>>> +    }
>>>>>>> +
>>>>>>> +    vfio_dma_unmap_ram_section(container, section);
>>>>>>> +}
>>>>>>> +
>>>>>>>      static void vfio_listener_region_add(MemoryListener *listener,
>>>>>>>                                           MemoryRegionSection
>>>>>>> *section)
>>>>>>>      {
>>>>>>> @@ -879,9 +975,10 @@ static void
>>>>>>> vfio_listener_region_add(MemoryListener *listener,
>>>>>>>          memory_region_ref(section->mr);
>>>>>>>            if (memory_region_is_iommu(section->mr)) {
>>>>>>> +        IOMMUNotify notify;
>>>>>>>              VFIOGuestIOMMU *giommu;
>>>>>>>              IOMMUMemoryRegion *iommu_mr =
>>>>>>> IOMMU_MEMORY_REGION(section->mr);
>>>>>>> -        int iommu_idx;
>>>>>>> +        int iommu_idx, flags;
>>>>>>>                trace_vfio_listener_region_add_iommu(iova, end);
>>>>>>>              /*
>>>>>>> @@ -900,8 +997,18 @@ static void
>>>>>>> vfio_listener_region_add(MemoryListener *listener,
>>>>>>>              llend = int128_sub(llend, int128_one());
>>>>>>>              iommu_idx = memory_region_iommu_attrs_to_index(iommu_mr,
>>>>>>>                                                          
>>>>>>> MEMTXATTRS_UNSPECIFIED);
>>>>>>> -        iommu_notifier_init(&giommu->n, vfio_iommu_map_notify,
>>>>>>> -                            IOMMU_NOTIFIER_IOTLB_EVENTS,
>>>>>>> +
>>>>>>> +        if (container->iommu_type == VFIO_TYPE1_NESTING_IOMMU) {
>>>>>>> +            /* IOTLB unmap notifier to propagate guest IOTLB
>>>>>>> invalidations */
>>>>>>> +            flags = IOMMU_NOTIFIER_UNMAP;
>>>>>>> +            notify = vfio_iommu_unmap_notify;
>>>>>>> +        } else {
>>>>>>> +            /* MAP/UNMAP IOTLB notifier */
>>>>>>> +            flags = IOMMU_NOTIFIER_IOTLB_EVENTS;
>>>>>>> +            notify = vfio_iommu_map_notify;
>>>>>>> +        }
>>>>>>> +
>>>>>>> +        iommu_notifier_init(&giommu->n, notify, flags,
>>>>>>>                                  section->offset_within_region,
>>>>>>>                                  int128_get64(llend),
>>>>>>>                                  iommu_idx);
>>>>>>> @@ -921,7 +1028,9 @@ static void
>>>>>>> vfio_listener_region_add(MemoryListener *listener,
>>>>>>>                  goto fail;
>>>>>>>              }
>>>>>>>              QLIST_INSERT_HEAD(&container->giommu_list, giommu,
>>>>>>> giommu_next);
>>>>>>> -        memory_region_iommu_replay(giommu->iommu, &giommu->n);
>>>>>>> +        if (flags & IOMMU_NOTIFIER_MAP) {
>>>>>>> +            memory_region_iommu_replay(giommu->iommu, &giommu->n);
>>>>>>> +        }
>>>>>>>                return;
>>>>>>>          }
>>>>>>> @@ -1205,10 +1314,16 @@ static const MemoryListener
>>>>>>> vfio_memory_listener = {
>>>>>>>          .log_sync = vfio_listener_log_sync,
>>>>>>>      };
>>>>>>>      +static MemoryListener vfio_memory_prereg_listener = {
>>>>>>> +    .region_add = vfio_prereg_listener_region_add,
>>>>>>> +    .region_del = vfio_prereg_listener_region_del,
>>>>>>> +};
>>>>>>> +
>>>>>>>      static void vfio_listener_release(VFIOContainer *container)
>>>>>>>      {
>>>>>>>          memory_listener_unregister(&container->listener);
>>>>>>> -    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
>>>>>>> +    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU ||
>>>>>>> +        container->iommu_type == VFIO_TYPE1_NESTING_IOMMU) {
>>>>>>>              memory_listener_unregister(&container->prereg_listener);
>>>>>>>          }
>>>>>>>      }
>>>>>>> @@ -1858,6 +1973,20 @@ static int vfio_connect_container(VFIOGroup
>>>>>>> *group, AddressSpace *as,
>>>>>>>                  vfio_get_iommu_info_migration(container, info);
>>>>>>>              }
>>>>>>>              g_free(info);
>>>>>>> +
>>>>>>> +        if (container->iommu_type == VFIO_TYPE1_NESTING_IOMMU) {
>>>>>>> +            container->prereg_listener =
>>>>>>> vfio_memory_prereg_listener;
>>>>>>> +            memory_listener_register(&container->prereg_listener,
>>>>>>> +                                     &address_space_memory);
>>>>>>> +            if (container->error) {
>>>>>>> +
>>>>>>> memory_listener_unregister(&container->prereg_listener);
>>>>>>> +                ret = -1;
>>>>>>> +                error_propagate_prepend(errp, container->error,
>>>>>>> +                                    "RAM memory listener
>>>>>>> initialization failed "
>>>>>>> +                                    "for container");
>>>>>>> +                goto free_container_exit;
>>>>>>> +            }
>>>>>>> +        }
>>>>>>>              break;
>>>>>>>          }
>>>>>>>          case VFIO_SPAPR_TCE_v2_IOMMU:
>>>>>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>>>>>>> index 5c65aa0a98..cad7deec71 100644
>>>>>>> --- a/hw/vfio/pci.c
>>>>>>> +++ b/hw/vfio/pci.c
>>>>>>> @@ -2773,6 +2773,25 @@ static void
>>>>>>> vfio_unregister_req_notifier(VFIOPCIDevice *vdev)
>>>>>>>          vdev->req_enabled = false;
>>>>>>>      }
>>>>>>>      +static int vfio_iommu_set_pasid_table(PCIBus *bus, int32_t
>>>>>>> devfn,
>>>>>>> +                                      IOMMUConfig *config)
>>>>>>> +{
>>>>>>> +    PCIDevice *pdev = bus->devices[devfn];
>>>>>>> +    VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev);
>>>>>>> +    VFIOContainer *container = vdev->vbasedev.group->container;
>>>>>>> +    struct vfio_iommu_type1_set_pasid_table info;
>>>>>>> +
>>>>>>> +    info.argsz = sizeof(info);
>>>>>>> +    info.flags = VFIO_PASID_TABLE_FLAG_SET;
>>>>>>> +    memcpy(&info.config, &config->pasid_cfg,
>>>>>>> sizeof(config->pasid_cfg));
>>>>>>> +
>>>>>>> +    return ioctl(container->fd, VFIO_IOMMU_SET_PASID_TABLE, &info);
>>>>>>> +}
>>>>>>> +
>>>>>>> +static PCIPASIDOps vfio_pci_pasid_ops = {
>>>>>>> +    .set_pasid_table = vfio_iommu_set_pasid_table,
>>>>>>> +};
>>>>>>> +
>>>>>>>      static void vfio_realize(PCIDevice *pdev, Error **errp)
>>>>>>>      {
>>>>>>>          VFIOPCIDevice *vdev = VFIO_PCI(pdev);
>>>>>>> @@ -3084,6 +3103,8 @@ static void vfio_realize(PCIDevice *pdev, Error
>>>>>>> **errp)
>>>>>>>          vfio_register_req_notifier(vdev);
>>>>>>>          vfio_setup_resetfn_quirk(vdev);
>>>>>>>      +    pci_setup_pasid_ops(pdev, &vfio_pci_pasid_ops);
>>>>>>> +
>>>>>>>          return;
>>>>>>>        out_deregister:
>>>>>>> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
>>>>>>> index 936d29d150..43696afc15 100644
>>>>>>> --- a/hw/vfio/trace-events
>>>>>>> +++ b/hw/vfio/trace-events
>>>>>>> @@ -120,6 +120,8 @@ vfio_region_sparse_mmap_header(const char *name,
>>>>>>> int index, int nr_areas) "Devic
>>>>>>>      vfio_region_sparse_mmap_entry(int i, unsigned long start,
>>>>>>> unsigned
>>>>>>> long end) "sparse entry %d [0x%lx - 0x%lx]"
>>>>>>>      vfio_get_dev_region(const char *name, int index, uint32_t type,
>>>>>>> uint32_t subtype) "%s index %d, %08x/%0x8"
>>>>>>>      vfio_dma_unmap_overflow_workaround(void) ""
>>>>>>> +vfio_iommu_addr_inv_iotlb(int asid, uint64_t addr, uint64_t size,
>>>>>>> uint64_t nb_granules, bool leaf) "nested IOTLB invalidate asid=%d,
>>>>>>> addr=0x%"PRIx64" granule_size=0x%"PRIx64" nb_granules=0x%"PRIx64"
>>>>>>> leaf=%d"
>>>>>>> +vfio_iommu_asid_inv_iotlb(int asid) "nested IOTLB invalidate
>>>>>>> asid=%d"
>>>>>>>        # platform.c
>>>>>>>      vfio_platform_base_device_init(char *name, int groupid) "%s
>>>>>>> belongs
>>>>>>> to group #%d"
>>>>> .
>>> .
>>
> .




^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC v9 15/29] vfio: Set up nested stage mappings
  2021-04-28  9:51               ` Kunkun Jiang
@ 2021-04-29 13:58                 ` Auger Eric
  0 siblings, 0 replies; 46+ messages in thread
From: Auger Eric @ 2021-04-29 13:58 UTC (permalink / raw)
  To: Kunkun Jiang, eric.auger.pro, qemu-devel, qemu-arm, alex.williamson
  Cc: peter.maydell, jacob.jun.pan, chenxiang66, tn,
	shameerali.kolothum.thodi, nicoleotsuka, vivek.gautam, vdumpa,
	yi.l.liu, peterx, zhangfei.gao, jean-philippe, yuzenghui,
	wanghaibin.wang, zhukeqian1

Hi Kunkun,

On 4/28/21 11:51 AM, Kunkun Jiang wrote:
> Hi Eric,
> 
> On 2021/4/27 3:16, Auger Eric wrote:
>> Hi Kunkun,
>>
>> On 4/15/21 4:03 AM, Kunkun Jiang wrote:
>>> Hi Eric,
>>>
>>> On 2021/4/14 16:05, Auger Eric wrote:
>>>> Hi Kunkun,
>>>>
>>>> On 4/14/21 3:45 AM, Kunkun Jiang wrote:
>>>>> On 2021/4/13 20:57, Auger Eric wrote:
>>>>>> Hi Kunkun,
>>>>>>
>>>>>> On 4/13/21 2:10 PM, Kunkun Jiang wrote:
>>>>>>> Hi Eric,
>>>>>>>
>>>>>>> On 2021/4/11 20:08, Eric Auger wrote:
>>>>>>>> In nested mode, legacy vfio_iommu_map_notify cannot be used as
>>>>>>>> there is no "caching" mode and we do not trap on map.
>>>>>>>>
>>>>>>>> On Intel, vfio_iommu_map_notify was used to DMA map the RAM
>>>>>>>> through the host single stage.
>>>>>>>>
>>>>>>>> With nested mode, we need to setup the stage 2 and the stage 1
>>>>>>>> separately. This patch introduces a prereg_listener to setup
>>>>>>>> the stage 2 mapping.
>>>>>>>>
>>>>>>>> The stage 1 mapping, owned by the guest, is passed to the host
>>>>>>>> when the guest invalidates the stage 1 configuration, through
>>>>>>>> a dedicated PCIPASIDOps callback. Guest IOTLB invalidations
>>>>>>>> are cascaded downto the host through another IOMMU MR UNMAP
>>>>>>>> notifier.
>>>>>>>>
>>>>>>>> Signed-off-by: Eric Auger <eric.auger@redhat.com>
>>>>>>>>
>>>>>>>> ---
>>>>>>>>
>>>>>>>> v7 -> v8:
>>>>>>>> - properly handle new IOMMUTLBEntry fields and especially
>>>>>>>>       propagate DOMAIN and PASID based invalidations
>>>>>>>>
>>>>>>>> v6 -> v7:
>>>>>>>> - remove PASID based invalidation
>>>>>>>>
>>>>>>>> v5 -> v6:
>>>>>>>> - add error_report_err()
>>>>>>>> - remove the abort in case of nested stage case
>>>>>>>>
>>>>>>>> v4 -> v5:
>>>>>>>> - use VFIO_IOMMU_SET_PASID_TABLE
>>>>>>>> - use PCIPASIDOps for config notification
>>>>>>>>
>>>>>>>> v3 -> v4:
>>>>>>>> - use iommu_inv_pasid_info for ASID invalidation
>>>>>>>>
>>>>>>>> v2 -> v3:
>>>>>>>> - use VFIO_IOMMU_ATTACH_PASID_TABLE
>>>>>>>> - new user API
>>>>>>>> - handle leaf
>>>>>>>>
>>>>>>>> v1 -> v2:
>>>>>>>> - adapt to uapi changes
>>>>>>>> - pass the asid
>>>>>>>> - pass IOMMU_NOTIFIER_S1_CFG when initializing the config notifier
>>>>>>>> ---
>>>>>>>>      hw/vfio/common.c     | 139
>>>>>>>> +++++++++++++++++++++++++++++++++++++++++--
>>>>>>>>      hw/vfio/pci.c        |  21 +++++++
>>>>>>>>      hw/vfio/trace-events |   2 +
>>>>>>>>      3 files changed, 157 insertions(+), 5 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>>>>>>>> index 0cd7ef2139..e369d451e7 100644
>>>>>>>> --- a/hw/vfio/common.c
>>>>>>>> +++ b/hw/vfio/common.c
>>>>>>>> @@ -595,6 +595,73 @@ static bool vfio_get_xlat_addr(IOMMUTLBEntry
>>>>>>>> *iotlb, void **vaddr,
>>>>>>>>          return true;
>>>>>>>>      }
>>>>>>>>      +/* Propagate a guest IOTLB invalidation to the host (nested
>>>>>>>> mode) */
>>>>>>>> +static void vfio_iommu_unmap_notify(IOMMUNotifier *n,
>>>>>>>> IOMMUTLBEntry
>>>>>>>> *iotlb)
>>>>>>>> +{
>>>>>>>> +    VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
>>>>>>>> +    struct vfio_iommu_type1_cache_invalidate ustruct = {};
>>>>>>>> +    VFIOContainer *container = giommu->container;
>>>>>>>> +    int ret;
>>>>>>>> +
>>>>>>>> +    assert(iotlb->perm == IOMMU_NONE);
>>>>>>>> +
>>>>>>>> +    ustruct.argsz = sizeof(ustruct);
>>>>>>>> +    ustruct.flags = 0;
>>>>>>>> +    ustruct.info.argsz = sizeof(struct
>>>>>>>> iommu_cache_invalidate_info);
>>>>>>>> +    ustruct.info.version = IOMMU_CACHE_INVALIDATE_INFO_VERSION_1;
>>>>>>>> +    ustruct.info.cache = IOMMU_CACHE_INV_TYPE_IOTLB;
>>>>>>>> +
>>>>>>>> +    switch (iotlb->granularity) {
>>>>>>>> +    case IOMMU_INV_GRAN_DOMAIN:
>>>>>>>> +        ustruct.info.granularity = IOMMU_INV_GRANU_DOMAIN;
>>>>>>>> +        break;
>>>>>>>> +    case IOMMU_INV_GRAN_PASID:
>>>>>>>> +    {
>>>>>>>> +        struct iommu_inv_pasid_info *pasid_info;
>>>>>>>> +        int archid = -1;
>>>>>>>> +
>>>>>>>> +        pasid_info = &ustruct.info.granu.pasid_info;
>>>>>>>> +        ustruct.info.granularity = IOMMU_INV_GRANU_PASID;
>>>>>>>> +        if (iotlb->flags & IOMMU_INV_FLAGS_ARCHID) {
>>>>>>>> +            pasid_info->flags |= IOMMU_INV_ADDR_FLAGS_ARCHID;
>>>>>>>> +            archid = iotlb->arch_id;
>>>>>>>> +        }
>>>>>>>> +        pasid_info->archid = archid;
>>>>>>>> +        trace_vfio_iommu_asid_inv_iotlb(archid);
>>>>>>>> +        break;
>>>>>>>> +    }
>>>>>>>> +    case IOMMU_INV_GRAN_ADDR:
>>>>>>>> +    {
>>>>>>>> +        hwaddr start = iotlb->iova + giommu->iommu_offset;
>>>>>>>> +        struct iommu_inv_addr_info *addr_info;
>>>>>>>> +        size_t size = iotlb->addr_mask + 1;
>>>>>>>> +        int archid = -1;
>>>>>>>> +
>>>>>>>> +        addr_info = &ustruct.info.granu.addr_info;
>>>>>>>> +        ustruct.info.granularity = IOMMU_INV_GRANU_ADDR;
>>>>>>>> +        if (iotlb->leaf) {
>>>>>>>> +            addr_info->flags |= IOMMU_INV_ADDR_FLAGS_LEAF;
>>>>>>>> +        }
>>>>>>>> +        if (iotlb->flags & IOMMU_INV_FLAGS_ARCHID) {
>>>>>>>> +            addr_info->flags |= IOMMU_INV_ADDR_FLAGS_ARCHID;
>>>>>>>> +            archid = iotlb->arch_id;
>>>>>>>> +        }
>>>>>>>> +        addr_info->archid = archid;
>>>>>>>> +        addr_info->addr = start;
>>>>>>>> +        addr_info->granule_size = size;
>>>>>>>> +        addr_info->nb_granules = 1;
>>>>>>>> +        trace_vfio_iommu_addr_inv_iotlb(archid, start, size,
>>>>>>>> +                                        1, iotlb->leaf);
>>>>>>>> +        break;
>>>>>>>> +    }
>>>>>>> Should we pass a size to  host kernel here, even if vSMMU doesn't
>>>>>>> support
>>>>>>> RIL or guest kernel doesn't use RIL?
>>>>>>>
>>>>>>> It will cause TLBI issue in  this scenario: Guest kernel issues a
>>>>>>> TLBI cmd
>>>>>>> without "range" (tg = 0) to invalidate a 2M huge page. Then qemu
>>>>>>> passed
>>>>>>> the iova and size (4K) to host kernel. Finally, host kernel issues a
>>>>>>> TLBI cmd
>>>>>>> with "range" (4K) which can not invalidate the TLB entry of 2M huge
>>>>>>> page.
>>>>>>> (pSMMU supports RIL)
>>>>>> In that case the guest will loop over all 4K images belonging to
>>>>>> the 2M
>>>>>> huge page and invalidate each of them. This should turn into qemu
>>>>>> notifications for each 4kB page, no? This is totally inefficient,
>>>>>> hence
>>>>> The guest will not loop over all 4K images belonging to the 2M huge
>>>>> page.
>>>>> The iommu_iotlb_gather->pgsize will be 2M, if a page is 2M huge page.
>>>>> The
>>>>> gather->pgsize will be passed to __arm_smmu_tlb_inv_range as
>>>>> "granule":
>>>>>
>>>>> iommu_iotlb_gather_add_page
>>>>>       iommu_iotlb_sync
>>>>>           domain->ops->iotlb_sync
>>>>>               arm_smmu_iotlb_sync
>>>>>                   arm_smmu_tlb_inv_range_domain
>>>>>                       __arm_smmu_tlb_inv_range
>>>>>
>>>>> In the above mentioned scenario, guest kernel will issue a TLBI cmd
>>>>> only
>>>>> with
>>>>> "iova" (tg = 0).
>>>> OK I get it now. In that case I think I should do a TG=0 invalidation
>>>> on host even if the host does support RIL. Does that sound correct?
>>> Yeah, that's what I thought.
>>>> The trouble is the uapi struct does not convey such info explicitly.
>>>> Maybe I should use nb_granules = 0 to detect that case.
>>> It is troublesome to correct this issue. Using nb_granules = 0 may be
>>> a good choice.
>>>> I think for this use case RIL should be supported by the guest.
>>>> Thoughts?
>>> Yes. If guest supports RIL, the scenario mentioned above does not exist.
>> After further study I really wonder if it is worth supporting the case
>> where the guest does not use range inval. Passing a nb_granules = 0 (~
>> size) would be OK to perform the TG=0 range invalidation on the host.
>> however host arm_smmu_tlb_inv_range_domain() then calls
>> arm_smmu_atc_inv_domain(smmu_domain, 0, iova, size) which needs a size.
>> We would need to trap guest CMD_ATC_INV and use different code paths on
>> host smmu driver to cascade the various cache invalidations. At the
>> moment, without maintainer giodance, I am a bit reluctant to add this
>> extra complexity.
>>
>> So I would be inclined to fail in QEMU if we detect TG=0 is being used
>> by the guest. Recent guest kernels would be a prerequisite for this use
>> case. What do you think?
> Sorry for late reply.
> 
> How do I determine whether the guest kernel is a recent guest kernel?
> If I use a modified kernel or an old kernel, this method will cause
> problems
> when the vm runs business.
Yes I aknowledge this detection would happen very late when trapping
guest invals.
> 
> I didn't think of the problem of CMD_ATC_INV. In my opinion, it is the best
> choice to add processing of CMD_ATC_INV. And some function of smmu
> driver will issues a separate CMD_ATC_INV, for example
> arm_smmu_enable_ats(). 

Yes this would mean that in arm_smmu_cache_invalidate() I should not use
arm_smmu_tlb_inv_range_domain() and split the actual inv_range from the
atc_inv_domain. Or even at the moment, since the vIOMMU does not support
ATS I can do without.

Is it possible that Qemu needs to support the
> processing of CMD_ATC_INV in the future?

well that's technically feasible but we would need to make sure the
pIOMMU supports it.

Thanks

Eric
> 
> I also thought of another not-so-good idea. We can use the granule_size
> to give CMD_ATC_INV a corresponding maximum page size. This will
> result in poor performance.
> For example:
> granule_size          maximum_page_size
>       4K                               1G
> 
> Thanks,
> Kunkun Jiang
>> Thanks
>>
>> Eric
>>> Thanks,
>>> Kunkun Jiang
>>>> Thanks
>>>>
>>>> Eric
>>>>> Thanks,
>>>>> Kunkun Jiang
>>>>>> the support of RIL on guest side and QEMU device.
>>>>>>
>>>>>> What do I miss?
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> Eric
>>>>>>> Thanks,
>>>>>>> Kunkun Jiang
>>>>>>>> +    }
>>>>>>>> +
>>>>>>>> +    ret = ioctl(container->fd, VFIO_IOMMU_CACHE_INVALIDATE,
>>>>>>>> &ustruct);
>>>>>>>> +    if (ret) {
>>>>>>>> +        error_report("%p: failed to invalidate CACHE (%d)",
>>>>>>>> container, ret);
>>>>>>>> +    }
>>>>>>>> +}
>>>>>>>> +
>>>>>>>>      static void vfio_iommu_map_notify(IOMMUNotifier *n,
>>>>>>>> IOMMUTLBEntry
>>>>>>>> *iotlb)
>>>>>>>>      {
>>>>>>>>          VFIOGuestIOMMU *giommu = container_of(n,
>>>>>>>> VFIOGuestIOMMU, n);
>>>>>>>> @@ -776,6 +843,35 @@ static void
>>>>>>>> vfio_dma_unmap_ram_section(VFIOContainer *container,
>>>>>>>>          }
>>>>>>>>      }
>>>>>>>>      +static void vfio_prereg_listener_region_add(MemoryListener
>>>>>>>> *listener,
>>>>>>>> +                                            MemoryRegionSection
>>>>>>>> *section)
>>>>>>>> +{
>>>>>>>> +    VFIOContainer *container =
>>>>>>>> +        container_of(listener, VFIOContainer, prereg_listener);
>>>>>>>> +    Error *err = NULL;
>>>>>>>> +
>>>>>>>> +    if (!memory_region_is_ram(section->mr)) {
>>>>>>>> +        return;
>>>>>>>> +    }
>>>>>>>> +
>>>>>>>> +    vfio_dma_map_ram_section(container, section, &err);
>>>>>>>> +    if (err) {
>>>>>>>> +        error_report_err(err);
>>>>>>>> +    }
>>>>>>>> +}
>>>>>>>> +static void vfio_prereg_listener_region_del(MemoryListener
>>>>>>>> *listener,
>>>>>>>> +                                     MemoryRegionSection *section)
>>>>>>>> +{
>>>>>>>> +    VFIOContainer *container =
>>>>>>>> +        container_of(listener, VFIOContainer, prereg_listener);
>>>>>>>> +
>>>>>>>> +    if (!memory_region_is_ram(section->mr)) {
>>>>>>>> +        return;
>>>>>>>> +    }
>>>>>>>> +
>>>>>>>> +    vfio_dma_unmap_ram_section(container, section);
>>>>>>>> +}
>>>>>>>> +
>>>>>>>>      static void vfio_listener_region_add(MemoryListener *listener,
>>>>>>>>                                           MemoryRegionSection
>>>>>>>> *section)
>>>>>>>>      {
>>>>>>>> @@ -879,9 +975,10 @@ static void
>>>>>>>> vfio_listener_region_add(MemoryListener *listener,
>>>>>>>>          memory_region_ref(section->mr);
>>>>>>>>            if (memory_region_is_iommu(section->mr)) {
>>>>>>>> +        IOMMUNotify notify;
>>>>>>>>              VFIOGuestIOMMU *giommu;
>>>>>>>>              IOMMUMemoryRegion *iommu_mr =
>>>>>>>> IOMMU_MEMORY_REGION(section->mr);
>>>>>>>> -        int iommu_idx;
>>>>>>>> +        int iommu_idx, flags;
>>>>>>>>                trace_vfio_listener_region_add_iommu(iova, end);
>>>>>>>>              /*
>>>>>>>> @@ -900,8 +997,18 @@ static void
>>>>>>>> vfio_listener_region_add(MemoryListener *listener,
>>>>>>>>              llend = int128_sub(llend, int128_one());
>>>>>>>>              iommu_idx =
>>>>>>>> memory_region_iommu_attrs_to_index(iommu_mr,
>>>>>>>>                                                         
>>>>>>>> MEMTXATTRS_UNSPECIFIED);
>>>>>>>> -        iommu_notifier_init(&giommu->n, vfio_iommu_map_notify,
>>>>>>>> -                            IOMMU_NOTIFIER_IOTLB_EVENTS,
>>>>>>>> +
>>>>>>>> +        if (container->iommu_type == VFIO_TYPE1_NESTING_IOMMU) {
>>>>>>>> +            /* IOTLB unmap notifier to propagate guest IOTLB
>>>>>>>> invalidations */
>>>>>>>> +            flags = IOMMU_NOTIFIER_UNMAP;
>>>>>>>> +            notify = vfio_iommu_unmap_notify;
>>>>>>>> +        } else {
>>>>>>>> +            /* MAP/UNMAP IOTLB notifier */
>>>>>>>> +            flags = IOMMU_NOTIFIER_IOTLB_EVENTS;
>>>>>>>> +            notify = vfio_iommu_map_notify;
>>>>>>>> +        }
>>>>>>>> +
>>>>>>>> +        iommu_notifier_init(&giommu->n, notify, flags,
>>>>>>>>                                  section->offset_within_region,
>>>>>>>>                                  int128_get64(llend),
>>>>>>>>                                  iommu_idx);
>>>>>>>> @@ -921,7 +1028,9 @@ static void
>>>>>>>> vfio_listener_region_add(MemoryListener *listener,
>>>>>>>>                  goto fail;
>>>>>>>>              }
>>>>>>>>              QLIST_INSERT_HEAD(&container->giommu_list, giommu,
>>>>>>>> giommu_next);
>>>>>>>> -        memory_region_iommu_replay(giommu->iommu, &giommu->n);
>>>>>>>> +        if (flags & IOMMU_NOTIFIER_MAP) {
>>>>>>>> +            memory_region_iommu_replay(giommu->iommu, &giommu->n);
>>>>>>>> +        }
>>>>>>>>                return;
>>>>>>>>          }
>>>>>>>> @@ -1205,10 +1314,16 @@ static const MemoryListener
>>>>>>>> vfio_memory_listener = {
>>>>>>>>          .log_sync = vfio_listener_log_sync,
>>>>>>>>      };
>>>>>>>>      +static MemoryListener vfio_memory_prereg_listener = {
>>>>>>>> +    .region_add = vfio_prereg_listener_region_add,
>>>>>>>> +    .region_del = vfio_prereg_listener_region_del,
>>>>>>>> +};
>>>>>>>> +
>>>>>>>>      static void vfio_listener_release(VFIOContainer *container)
>>>>>>>>      {
>>>>>>>>          memory_listener_unregister(&container->listener);
>>>>>>>> -    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
>>>>>>>> +    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU ||
>>>>>>>> +        container->iommu_type == VFIO_TYPE1_NESTING_IOMMU) {
>>>>>>>>             
>>>>>>>> memory_listener_unregister(&container->prereg_listener);
>>>>>>>>          }
>>>>>>>>      }
>>>>>>>> @@ -1858,6 +1973,20 @@ static int vfio_connect_container(VFIOGroup
>>>>>>>> *group, AddressSpace *as,
>>>>>>>>                  vfio_get_iommu_info_migration(container, info);
>>>>>>>>              }
>>>>>>>>              g_free(info);
>>>>>>>> +
>>>>>>>> +        if (container->iommu_type == VFIO_TYPE1_NESTING_IOMMU) {
>>>>>>>> +            container->prereg_listener =
>>>>>>>> vfio_memory_prereg_listener;
>>>>>>>> +            memory_listener_register(&container->prereg_listener,
>>>>>>>> +                                     &address_space_memory);
>>>>>>>> +            if (container->error) {
>>>>>>>> +
>>>>>>>> memory_listener_unregister(&container->prereg_listener);
>>>>>>>> +                ret = -1;
>>>>>>>> +                error_propagate_prepend(errp, container->error,
>>>>>>>> +                                    "RAM memory listener
>>>>>>>> initialization failed "
>>>>>>>> +                                    "for container");
>>>>>>>> +                goto free_container_exit;
>>>>>>>> +            }
>>>>>>>> +        }
>>>>>>>>              break;
>>>>>>>>          }
>>>>>>>>          case VFIO_SPAPR_TCE_v2_IOMMU:
>>>>>>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>>>>>>>> index 5c65aa0a98..cad7deec71 100644
>>>>>>>> --- a/hw/vfio/pci.c
>>>>>>>> +++ b/hw/vfio/pci.c
>>>>>>>> @@ -2773,6 +2773,25 @@ static void
>>>>>>>> vfio_unregister_req_notifier(VFIOPCIDevice *vdev)
>>>>>>>>          vdev->req_enabled = false;
>>>>>>>>      }
>>>>>>>>      +static int vfio_iommu_set_pasid_table(PCIBus *bus, int32_t
>>>>>>>> devfn,
>>>>>>>> +                                      IOMMUConfig *config)
>>>>>>>> +{
>>>>>>>> +    PCIDevice *pdev = bus->devices[devfn];
>>>>>>>> +    VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev);
>>>>>>>> +    VFIOContainer *container = vdev->vbasedev.group->container;
>>>>>>>> +    struct vfio_iommu_type1_set_pasid_table info;
>>>>>>>> +
>>>>>>>> +    info.argsz = sizeof(info);
>>>>>>>> +    info.flags = VFIO_PASID_TABLE_FLAG_SET;
>>>>>>>> +    memcpy(&info.config, &config->pasid_cfg,
>>>>>>>> sizeof(config->pasid_cfg));
>>>>>>>> +
>>>>>>>> +    return ioctl(container->fd, VFIO_IOMMU_SET_PASID_TABLE,
>>>>>>>> &info);
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static PCIPASIDOps vfio_pci_pasid_ops = {
>>>>>>>> +    .set_pasid_table = vfio_iommu_set_pasid_table,
>>>>>>>> +};
>>>>>>>> +
>>>>>>>>      static void vfio_realize(PCIDevice *pdev, Error **errp)
>>>>>>>>      {
>>>>>>>>          VFIOPCIDevice *vdev = VFIO_PCI(pdev);
>>>>>>>> @@ -3084,6 +3103,8 @@ static void vfio_realize(PCIDevice *pdev,
>>>>>>>> Error
>>>>>>>> **errp)
>>>>>>>>          vfio_register_req_notifier(vdev);
>>>>>>>>          vfio_setup_resetfn_quirk(vdev);
>>>>>>>>      +    pci_setup_pasid_ops(pdev, &vfio_pci_pasid_ops);
>>>>>>>> +
>>>>>>>>          return;
>>>>>>>>        out_deregister:
>>>>>>>> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
>>>>>>>> index 936d29d150..43696afc15 100644
>>>>>>>> --- a/hw/vfio/trace-events
>>>>>>>> +++ b/hw/vfio/trace-events
>>>>>>>> @@ -120,6 +120,8 @@ vfio_region_sparse_mmap_header(const char
>>>>>>>> *name,
>>>>>>>> int index, int nr_areas) "Devic
>>>>>>>>      vfio_region_sparse_mmap_entry(int i, unsigned long start,
>>>>>>>> unsigned
>>>>>>>> long end) "sparse entry %d [0x%lx - 0x%lx]"
>>>>>>>>      vfio_get_dev_region(const char *name, int index, uint32_t
>>>>>>>> type,
>>>>>>>> uint32_t subtype) "%s index %d, %08x/%0x8"
>>>>>>>>      vfio_dma_unmap_overflow_workaround(void) ""
>>>>>>>> +vfio_iommu_addr_inv_iotlb(int asid, uint64_t addr, uint64_t size,
>>>>>>>> uint64_t nb_granules, bool leaf) "nested IOTLB invalidate asid=%d,
>>>>>>>> addr=0x%"PRIx64" granule_size=0x%"PRIx64" nb_granules=0x%"PRIx64"
>>>>>>>> leaf=%d"
>>>>>>>> +vfio_iommu_asid_inv_iotlb(int asid) "nested IOTLB invalidate
>>>>>>>> asid=%d"
>>>>>>>>        # platform.c
>>>>>>>>      vfio_platform_base_device_init(char *name, int groupid) "%s
>>>>>>>> belongs
>>>>>>>> to group #%d"
>>>>>> .
>>>> .
>>>
>> .
> 
> 
> 



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC v9 24/29] hw/arm/smmuv3: Fill the IOTLBEntry leaf field on NH_VA invalidation
  2021-04-11 12:09 ` [RFC v9 24/29] hw/arm/smmuv3: Fill the IOTLBEntry leaf field " Eric Auger
@ 2021-05-13  7:09   ` Kunkun Jiang
  0 siblings, 0 replies; 46+ messages in thread
From: Kunkun Jiang @ 2021-05-13  7:09 UTC (permalink / raw)
  To: Eric Auger, eric.auger.pro, qemu-devel, qemu-arm, alex.williamson
  Cc: peter.maydell, jacob.jun.pan, jean-philippe, tn, chenxiang66,
	shameerali.kolothum.thodi, nicoleotsuka, vivek.gautam, vdumpa,
	yi.l.liu, peterx, zhangfei.gao, yuzenghui, wanghaibin.wang,
	zhukeqian1

Hi Eric,

On 2021/4/11 20:09, Eric Auger wrote:
> Let's propagate the leaf attribute throughout the invalidation path.
> This hint is used to reduce the scope of the invalidations to the
> last level of translation. Not enforcing it induces large performance
> penalties in nested mode.
>
> Signed-off-by: Eric Auger <eric.auger@redhat.com>
> ---
>   hw/arm/smmuv3.c | 9 +++++----
>   1 file changed, 5 insertions(+), 4 deletions(-)
>
> diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c
> index 7beb55cd89..74a6408146 100644
> --- a/hw/arm/smmuv3.c
> +++ b/hw/arm/smmuv3.c
> @@ -799,7 +799,7 @@ epilogue:
>   static void smmuv3_notify_iova(IOMMUMemoryRegion *mr,
>                                  IOMMUNotifier *n,
>                                  int asid, dma_addr_t iova,
> -                               uint8_t tg, uint64_t num_pages)
> +                               uint8_t tg, uint64_t num_pages, bool leaf)
>   {
>       SMMUDevice *sdev = container_of(mr, SMMUDevice, iommu);
>       IOMMUTLBEvent event = {};
> @@ -834,6 +834,7 @@ static void smmuv3_notify_iova(IOMMUMemoryRegion *mr,
>       event.entry.perm = IOMMU_NONE;
>       event.entry.flags = IOMMU_INV_FLAGS_ARCHID;
>       event.entry.arch_id = asid;
> +    event.entry.leaf = leaf;
>   
>       memory_region_notify_iommu_one(n, &event);
>   }
> @@ -865,7 +866,7 @@ static void smmuv3_notify_asid(IOMMUMemoryRegion *mr,
>   
>   /* invalidate an asid/iova range tuple in all mr's */
>   static void smmuv3_inv_notifiers_iova(SMMUState *s, int asid, dma_addr_t iova,
> -                                      uint8_t tg, uint64_t num_pages)
> +                                      uint8_t tg, uint64_t num_pages, bool leaf)
>   {
>       SMMUDevice *sdev;
Does the parameter 'leaf' need to be added to the trace?
> trace_smmuv3_inv_notifiers_iova(mr->parent_obj.name, asid, iova,
>                                         tg, num_pages);

Thanks,
Kunkun Jiang
>   
> @@ -877,7 +878,7 @@ static void smmuv3_inv_notifiers_iova(SMMUState *s, int asid, dma_addr_t iova,
>                                           tg, num_pages);
>   
>           IOMMU_NOTIFIER_FOREACH(n, mr) {
> -            smmuv3_notify_iova(mr, n, asid, iova, tg, num_pages);
> +            smmuv3_notify_iova(mr, n, asid, iova, tg, num_pages, leaf);
>           }
>       }
>   }
> @@ -915,7 +916,7 @@ static void smmuv3_s1_range_inval(SMMUState *s, Cmd *cmd)
>           count = mask + 1;
>   
>           trace_smmuv3_s1_range_inval(vmid, asid, addr, tg, count, ttl, leaf);
> -        smmuv3_inv_notifiers_iova(s, asid, addr, tg, count);
> +        smmuv3_inv_notifiers_iova(s, asid, addr, tg, count, leaf);
>           smmu_iotlb_inv_iova(s, asid, addr, tg, count, ttl);
>   
>           num_pages -= count;




^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC v9 14/29] vfio: Introduce helpers to DMA map/unmap a RAM section
  2021-04-11 12:08 ` [RFC v9 14/29] vfio: Introduce helpers to DMA map/unmap a RAM section Eric Auger
  2021-04-27 14:05   ` Kunkun Jiang
@ 2021-09-03  8:22   ` Kunkun Jiang
  1 sibling, 0 replies; 46+ messages in thread
From: Kunkun Jiang @ 2021-09-03  8:22 UTC (permalink / raw)
  To: Eric Auger, eric.auger.pro, qemu-devel, qemu-arm, alex.williamson
  Cc: peter.maydell, jacob.jun.pan, jean-philippe, tn, chenxiang66,
	shameerali.kolothum.thodi, nicoleotsuka, vivek.gautam, vdumpa,
	yi.l.liu, peterx, zhangfei.gao, yuzenghui, wanghaibin.wang,
	zhukeqian1

Hi Eric,

On 2021/4/11 20:08, Eric Auger wrote:
> Let's introduce two helpers that allow to DMA map/unmap a RAM
> section. Those helpers will be called for nested stage setup in
> another call site. Also the vfio_listener_region_add/del()
> structure may be clearer.
>
> Signed-off-by: Eric Auger <eric.auger@redhat.com>
>
> ---
>
> v8 -> v9
> - rebase on top of
>    1eb7f642750c ("vfio: Support host translation granule size")
>
> v5 -> v6:
> - add Error **
> ---
>   hw/vfio/common.c     | 199 +++++++++++++++++++++++++------------------
>   hw/vfio/trace-events |   4 +-
>   2 files changed, 119 insertions(+), 84 deletions(-)
>
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index a8f835328e..0cd7ef2139 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -662,13 +662,126 @@ hostwin_from_range(VFIOContainer *container, hwaddr iova, hwaddr end)
>       return NULL;
>   }
>   
> +static int vfio_dma_map_ram_section(VFIOContainer *container,
> +                                    MemoryRegionSection *section, Error **err)
> +{
> +    VFIOHostDMAWindow *hostwin;
> +    Int128 llend, llsize;
> +    hwaddr iova, end;
> +    void *vaddr;
> +    int ret;
> +
> +    assert(memory_region_is_ram(section->mr));
> +
> +    iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
> +    llend = int128_make64(section->offset_within_address_space);
> +    llend = int128_add(llend, section->size);
> +    llend = int128_and(llend, int128_exts64(TARGET_PAGE_MASK));
Is there any special meaning for using TRAGET_PAGE_ALIGN here?
REAL_HOST_PAGE_ALIGN is used in vfio_listener_region_add.

And I think a check should be added here if using REAL_HOST_PAGE_ALIGN.
> if (int128_ge(int128_make64(iova), llend)) {
>      return;
> }

It will cause an error log or 'assert(r ==a )' of int128_get64 by calling
vfio_prereg_listener_region_add in some scenarios. Some devices' BAR
may map MSI-X structures and others in one host page.

By the way, is this set of patch to be updated after "/dev/iommu" is
sent out?

Thanks,
Kunkun Jiang

> +    end = int128_get64(int128_sub(llend, int128_one()));
> +
> +    vaddr = memory_region_get_ram_ptr(section->mr) +
> +            section->offset_within_region +
> +            (iova - section->offset_within_address_space);
> +
> +    hostwin = hostwin_from_range(container, iova, end);
> +    if (!hostwin) {
> +        error_setg(err, "Container %p can't map guest IOVA region"
> +                   " 0x%"HWADDR_PRIx"..0x%"HWADDR_PRIx, container, iova, end);
> +        return -EFAULT;
> +    }
> +
> +    trace_vfio_dma_map_ram(iova, end, vaddr);
> +
> +    llsize = int128_sub(llend, int128_make64(iova));
> +
> +    if (memory_region_is_ram_device(section->mr)) {
> +        hwaddr pgmask = (1ULL << ctz64(hostwin->iova_pgsizes)) - 1;
> +
> +        if ((iova & pgmask) || (int128_get64(llsize) & pgmask)) {
> +            trace_vfio_listener_region_add_no_dma_map(
> +                memory_region_name(section->mr),
> +                section->offset_within_address_space,
> +                int128_getlo(section->size),
> +                pgmask + 1);
> +            return 0;
> +        }
> +    }
> +
> +    ret = vfio_dma_map(container, iova, int128_get64(llsize),
> +                       vaddr, section->readonly);
> +    if (ret) {
> +        error_setg(err, "vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
> +                   "0x%"HWADDR_PRIx", %p) = %d (%m)",
> +                   container, iova, int128_get64(llsize), vaddr, ret);
> +        if (memory_region_is_ram_device(section->mr)) {
> +            /* Allow unexpected mappings not to be fatal for RAM devices */
> +            error_report_err(*err);
> +            return 0;
> +        }
> +        return ret;
> +    }
> +    return 0;
> +}
> +
> +static void vfio_dma_unmap_ram_section(VFIOContainer *container,
> +                                       MemoryRegionSection *section)
> +{
> +    Int128 llend, llsize;
> +    hwaddr iova, end;
> +    bool try_unmap = true;
> +    int ret;
> +
> +    iova = REAL_HOST_PAGE_ALIGN(section->offset_within_address_space);
> +    llend = int128_make64(section->offset_within_address_space);
> +    llend = int128_add(llend, section->size);
> +    llend = int128_and(llend, int128_exts64(qemu_real_host_page_mask));
> +
> +    if (int128_ge(int128_make64(iova), llend)) {
> +        return;
> +    }
> +    end = int128_get64(int128_sub(llend, int128_one()));
> +
> +    llsize = int128_sub(llend, int128_make64(iova));
> +
> +    trace_vfio_dma_unmap_ram(iova, end);
> +
> +    if (memory_region_is_ram_device(section->mr)) {
> +        hwaddr pgmask;
> +        VFIOHostDMAWindow *hostwin = hostwin_from_range(container, iova, end);
> +
> +        assert(hostwin); /* or region_add() would have failed */
> +
> +        pgmask = (1ULL << ctz64(hostwin->iova_pgsizes)) - 1;
> +        try_unmap = !((iova & pgmask) || (int128_get64(llsize) & pgmask));
> +    }
> +
> +    if (try_unmap) {
> +        if (int128_eq(llsize, int128_2_64())) {
> +            /* The unmap ioctl doesn't accept a full 64-bit span. */
> +            llsize = int128_rshift(llsize, 1);
> +            ret = vfio_dma_unmap(container, iova, int128_get64(llsize), NULL);
> +            if (ret) {
> +                error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
> +                             "0x%"HWADDR_PRIx") = %d (%m)",
> +                             container, iova, int128_get64(llsize), ret);
> +            }
> +            iova += int128_get64(llsize);
> +        }
> +        ret = vfio_dma_unmap(container, iova, int128_get64(llsize), NULL);
> +        if (ret) {
> +            error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
> +                         "0x%"HWADDR_PRIx") = %d (%m)",
> +                         container, iova, int128_get64(llsize), ret);
> +        }
> +    }
> +}
> +
>   static void vfio_listener_region_add(MemoryListener *listener,
>                                        MemoryRegionSection *section)
>   {
>       VFIOContainer *container = container_of(listener, VFIOContainer, listener);
>       hwaddr iova, end;
> -    Int128 llend, llsize;
> -    void *vaddr;
> +    Int128 llend;
>       int ret;
>       VFIOHostDMAWindow *hostwin;
>       Error *err = NULL;
> @@ -814,39 +927,7 @@ static void vfio_listener_region_add(MemoryListener *listener,
>       }
>   
>       /* Here we assume that memory_region_is_ram(section->mr)==true */
> -
> -    vaddr = memory_region_get_ram_ptr(section->mr) +
> -            section->offset_within_region +
> -            (iova - section->offset_within_address_space);
> -
> -    trace_vfio_listener_region_add_ram(iova, end, vaddr);
> -
> -    llsize = int128_sub(llend, int128_make64(iova));
> -
> -    if (memory_region_is_ram_device(section->mr)) {
> -        hwaddr pgmask = (1ULL << ctz64(hostwin->iova_pgsizes)) - 1;
> -
> -        if ((iova & pgmask) || (int128_get64(llsize) & pgmask)) {
> -            trace_vfio_listener_region_add_no_dma_map(
> -                memory_region_name(section->mr),
> -                section->offset_within_address_space,
> -                int128_getlo(section->size),
> -                pgmask + 1);
> -            return;
> -        }
> -    }
> -
> -    ret = vfio_dma_map(container, iova, int128_get64(llsize),
> -                       vaddr, section->readonly);
> -    if (ret) {
> -        error_setg(&err, "vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
> -                   "0x%"HWADDR_PRIx", %p) = %d (%m)",
> -                   container, iova, int128_get64(llsize), vaddr, ret);
> -        if (memory_region_is_ram_device(section->mr)) {
> -            /* Allow unexpected mappings not to be fatal for RAM devices */
> -            error_report_err(err);
> -            return;
> -        }
> +    if (vfio_dma_map_ram_section(container, section, &err)) {
>           goto fail;
>       }
>   
> @@ -880,10 +961,6 @@ static void vfio_listener_region_del(MemoryListener *listener,
>                                        MemoryRegionSection *section)
>   {
>       VFIOContainer *container = container_of(listener, VFIOContainer, listener);
> -    hwaddr iova, end;
> -    Int128 llend, llsize;
> -    int ret;
> -    bool try_unmap = true;
>   
>       if (vfio_listener_skipped_section(section)) {
>           trace_vfio_listener_region_del_skip(
> @@ -923,49 +1000,7 @@ static void vfio_listener_region_del(MemoryListener *listener,
>            */
>       }
>   
> -    iova = REAL_HOST_PAGE_ALIGN(section->offset_within_address_space);
> -    llend = int128_make64(section->offset_within_address_space);
> -    llend = int128_add(llend, section->size);
> -    llend = int128_and(llend, int128_exts64(qemu_real_host_page_mask));
> -
> -    if (int128_ge(int128_make64(iova), llend)) {
> -        return;
> -    }
> -    end = int128_get64(int128_sub(llend, int128_one()));
> -
> -    llsize = int128_sub(llend, int128_make64(iova));
> -
> -    trace_vfio_listener_region_del(iova, end);
> -
> -    if (memory_region_is_ram_device(section->mr)) {
> -        hwaddr pgmask;
> -        VFIOHostDMAWindow *hostwin = hostwin_from_range(container, iova, end);
> -
> -        assert(hostwin); /* or region_add() would have failed */
> -
> -        pgmask = (1ULL << ctz64(hostwin->iova_pgsizes)) - 1;
> -        try_unmap = !((iova & pgmask) || (int128_get64(llsize) & pgmask));
> -    }
> -
> -    if (try_unmap) {
> -        if (int128_eq(llsize, int128_2_64())) {
> -            /* The unmap ioctl doesn't accept a full 64-bit span. */
> -            llsize = int128_rshift(llsize, 1);
> -            ret = vfio_dma_unmap(container, iova, int128_get64(llsize), NULL);
> -            if (ret) {
> -                error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
> -                             "0x%"HWADDR_PRIx") = %d (%m)",
> -                             container, iova, int128_get64(llsize), ret);
> -            }
> -            iova += int128_get64(llsize);
> -        }
> -        ret = vfio_dma_unmap(container, iova, int128_get64(llsize), NULL);
> -        if (ret) {
> -            error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
> -                         "0x%"HWADDR_PRIx") = %d (%m)",
> -                         container, iova, int128_get64(llsize), ret);
> -        }
> -    }
> +    vfio_dma_unmap_ram_section(container, section);
>   
>       memory_region_unref(section->mr);
>   
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index 2a41326c0f..936d29d150 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -99,10 +99,10 @@ vfio_iommu_map_notify(const char *op, uint64_t iova_start, uint64_t iova_end) "i
>   vfio_listener_region_add_skip(uint64_t start, uint64_t end) "SKIPPING region_add 0x%"PRIx64" - 0x%"PRIx64
>   vfio_spapr_group_attach(int groupfd, int tablefd) "Attached groupfd %d to liobn fd %d"
>   vfio_listener_region_add_iommu(uint64_t start, uint64_t end) "region_add [iommu] 0x%"PRIx64" - 0x%"PRIx64
> -vfio_listener_region_add_ram(uint64_t iova_start, uint64_t iova_end, void *vaddr) "region_add [ram] 0x%"PRIx64" - 0x%"PRIx64" [%p]"
> +vfio_dma_map_ram(uint64_t iova_start, uint64_t iova_end, void *vaddr) "region_add [ram] 0x%"PRIx64" - 0x%"PRIx64" [%p]"
>   vfio_listener_region_add_no_dma_map(const char *name, uint64_t iova, uint64_t size, uint64_t page_size) "Region \"%s\" 0x%"PRIx64" size=0x%"PRIx64" is not aligned to 0x%"PRIx64" and cannot be mapped for DMA"
>   vfio_listener_region_del_skip(uint64_t start, uint64_t end) "SKIPPING region_del 0x%"PRIx64" - 0x%"PRIx64
> -vfio_listener_region_del(uint64_t start, uint64_t end) "region_del 0x%"PRIx64" - 0x%"PRIx64
> +vfio_dma_unmap_ram(uint64_t start, uint64_t end) "region_del 0x%"PRIx64" - 0x%"PRIx64
>   vfio_disconnect_container(int fd) "close container->fd=%d"
>   vfio_connect_existing_container(int groupid, int container_fd) "group=%d existing container fd=%d"
>   vfio_connect_new_container(int groupid, int container_fd) "group=%d new container fd=%d"




^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC v9 15/29] vfio: Set up nested stage mappings
  2021-04-14  1:45       ` Kunkun Jiang
  2021-04-14  8:05         ` Auger Eric
  2021-04-26 12:30         ` Auger Eric
@ 2021-10-07 16:58         ` Eric Auger
  2021-10-08  2:13           ` Kunkun Jiang
  2 siblings, 1 reply; 46+ messages in thread
From: Eric Auger @ 2021-10-07 16:58 UTC (permalink / raw)
  To: Kunkun Jiang, eric.auger.pro, qemu-devel, qemu-arm, alex.williamson
  Cc: peter.maydell, jacob.jun.pan, chenxiang66, tn,
	shameerali.kolothum.thodi, nicoleotsuka, vivek.gautam, vdumpa,
	yi.l.liu, peterx, zhangfei.gao, wanghaibin.wang, yuzenghui,
	jean-philippe, zhukeqian1

Hi Kunkun,

On 4/14/21 3:45 AM, Kunkun Jiang wrote:
> On 2021/4/13 20:57, Auger Eric wrote:
>> Hi Kunkun,
>>
>> On 4/13/21 2:10 PM, Kunkun Jiang wrote:
>>> Hi Eric,
>>>
>>> On 2021/4/11 20:08, Eric Auger wrote:
>>>> In nested mode, legacy vfio_iommu_map_notify cannot be used as
>>>> there is no "caching" mode and we do not trap on map.
>>>>
>>>> On Intel, vfio_iommu_map_notify was used to DMA map the RAM
>>>> through the host single stage.
>>>>
>>>> With nested mode, we need to setup the stage 2 and the stage 1
>>>> separately. This patch introduces a prereg_listener to setup
>>>> the stage 2 mapping.
>>>>
>>>> The stage 1 mapping, owned by the guest, is passed to the host
>>>> when the guest invalidates the stage 1 configuration, through
>>>> a dedicated PCIPASIDOps callback. Guest IOTLB invalidations
>>>> are cascaded downto the host through another IOMMU MR UNMAP
>>>> notifier.
>>>>
>>>> Signed-off-by: Eric Auger <eric.auger@redhat.com>
>>>>
>>>> ---
>>>>
>>>> v7 -> v8:
>>>> - properly handle new IOMMUTLBEntry fields and especially
>>>>     propagate DOMAIN and PASID based invalidations
>>>>
>>>> v6 -> v7:
>>>> - remove PASID based invalidation
>>>>
>>>> v5 -> v6:
>>>> - add error_report_err()
>>>> - remove the abort in case of nested stage case
>>>>
>>>> v4 -> v5:
>>>> - use VFIO_IOMMU_SET_PASID_TABLE
>>>> - use PCIPASIDOps for config notification
>>>>
>>>> v3 -> v4:
>>>> - use iommu_inv_pasid_info for ASID invalidation
>>>>
>>>> v2 -> v3:
>>>> - use VFIO_IOMMU_ATTACH_PASID_TABLE
>>>> - new user API
>>>> - handle leaf
>>>>
>>>> v1 -> v2:
>>>> - adapt to uapi changes
>>>> - pass the asid
>>>> - pass IOMMU_NOTIFIER_S1_CFG when initializing the config notifier
>>>> ---
>>>>    hw/vfio/common.c     | 139
>>>> +++++++++++++++++++++++++++++++++++++++++--
>>>>    hw/vfio/pci.c        |  21 +++++++
>>>>    hw/vfio/trace-events |   2 +
>>>>    3 files changed, 157 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>>>> index 0cd7ef2139..e369d451e7 100644
>>>> --- a/hw/vfio/common.c
>>>> +++ b/hw/vfio/common.c
>>>> @@ -595,6 +595,73 @@ static bool vfio_get_xlat_addr(IOMMUTLBEntry
>>>> *iotlb, void **vaddr,
>>>>        return true;
>>>>    }
>>>>    +/* Propagate a guest IOTLB invalidation to the host (nested
>>>> mode) */
>>>> +static void vfio_iommu_unmap_notify(IOMMUNotifier *n, IOMMUTLBEntry
>>>> *iotlb)
>>>> +{
>>>> +    VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
>>>> +    struct vfio_iommu_type1_cache_invalidate ustruct = {};
>>>> +    VFIOContainer *container = giommu->container;
>>>> +    int ret;
>>>> +
>>>> +    assert(iotlb->perm == IOMMU_NONE);
>>>> +
>>>> +    ustruct.argsz = sizeof(ustruct);
>>>> +    ustruct.flags = 0;
>>>> +    ustruct.info.argsz = sizeof(struct iommu_cache_invalidate_info);
>>>> +    ustruct.info.version = IOMMU_CACHE_INVALIDATE_INFO_VERSION_1;
>>>> +    ustruct.info.cache = IOMMU_CACHE_INV_TYPE_IOTLB;
>>>> +
>>>> +    switch (iotlb->granularity) {
>>>> +    case IOMMU_INV_GRAN_DOMAIN:
>>>> +        ustruct.info.granularity = IOMMU_INV_GRANU_DOMAIN;
>>>> +        break;
>>>> +    case IOMMU_INV_GRAN_PASID:
>>>> +    {
>>>> +        struct iommu_inv_pasid_info *pasid_info;
>>>> +        int archid = -1;
>>>> +
>>>> +        pasid_info = &ustruct.info.granu.pasid_info;
>>>> +        ustruct.info.granularity = IOMMU_INV_GRANU_PASID;
>>>> +        if (iotlb->flags & IOMMU_INV_FLAGS_ARCHID) {
>>>> +            pasid_info->flags |= IOMMU_INV_ADDR_FLAGS_ARCHID;
>>>> +            archid = iotlb->arch_id;
>>>> +        }
>>>> +        pasid_info->archid = archid;
>>>> +        trace_vfio_iommu_asid_inv_iotlb(archid);
>>>> +        break;
>>>> +    }
>>>> +    case IOMMU_INV_GRAN_ADDR:
>>>> +    {
>>>> +        hwaddr start = iotlb->iova + giommu->iommu_offset;
>>>> +        struct iommu_inv_addr_info *addr_info;
>>>> +        size_t size = iotlb->addr_mask + 1;
>>>> +        int archid = -1;
>>>> +
>>>> +        addr_info = &ustruct.info.granu.addr_info;
>>>> +        ustruct.info.granularity = IOMMU_INV_GRANU_ADDR;
>>>> +        if (iotlb->leaf) {
>>>> +            addr_info->flags |= IOMMU_INV_ADDR_FLAGS_LEAF;
>>>> +        }
>>>> +        if (iotlb->flags & IOMMU_INV_FLAGS_ARCHID) {
>>>> +            addr_info->flags |= IOMMU_INV_ADDR_FLAGS_ARCHID;
>>>> +            archid = iotlb->arch_id;
>>>> +        }
>>>> +        addr_info->archid = archid;
>>>> +        addr_info->addr = start;
>>>> +        addr_info->granule_size = size;
>>>> +        addr_info->nb_granules = 1;
>>>> +        trace_vfio_iommu_addr_inv_iotlb(archid, start, size,
>>>> +                                        1, iotlb->leaf);
>>>> +        break;
>>>> +    }
>>> Should we pass a size to  host kernel here, even if vSMMU doesn't
>>> support
>>> RIL or guest kernel doesn't use RIL?
>>>
>>> It will cause TLBI issue in  this scenario: Guest kernel issues a
>>> TLBI cmd
>>> without "range" (tg = 0) to invalidate a 2M huge page. Then qemu passed
>>> the iova and size (4K) to host kernel. Finally, host kernel issues a
>>> TLBI cmd
>>> with "range" (4K) which can not invalidate the TLB entry of 2M huge
>>> page.
>>> (pSMMU supports RIL)
>> In that case the guest will loop over all 4K images belonging to the 2M
>> huge page and invalidate each of them. This should turn into qemu
>> notifications for each 4kB page, no? This is totally inefficient, hence
> The guest will not loop over all 4K images belonging to the 2M huge page.
> The iommu_iotlb_gather->pgsize will be 2M, if a page is 2M huge page. The
> gather->pgsize will be passed to __arm_smmu_tlb_inv_range as "granule":
>
> iommu_iotlb_gather_add_page
>     iommu_iotlb_sync
>         domain->ops->iotlb_sync
>             arm_smmu_iotlb_sync
>                 arm_smmu_tlb_inv_range_domain
>                     __arm_smmu_tlb_inv_range
>
> In the above mentioned scenario, guest kernel will issue a TLBI cmd
> only with
> "iova" (tg = 0).

I am currently respinning the SMMU part (the rest depends on
/dev/iommu). While thinking more about this issue you reported a long
time ago (sorry), I think a guest not using RIL is problematic wrt vSMMU
integration. Your case would not work with vhost either because when you
notify vhost you need to pass an invalidation range. In this specific
case there is none passed by the guest as the INVAL CMD just contains an
iova and that's it. Hope I do not miss anything. And I cannot guess the
range size. I only have the granule size from the CD and the IOVA. So
how can I properly invalidate in the vhost case beside upgrading the API
and pass the info telling the addr range is not set and invalidate the
whole vhost cache in that case.

So I would be inclined to support a RIL only guest for nested and
globally with vhost. I can't ignore your argument that the detection of
non interoperability comes late, with the first inval with TG=0. But I
think the SMMU spec without RIL just is *not* virtualization compatible
and this was first reported in Aug 2017
(https://lkml.org/lkml/2017/8/11/428), yeah so much time spent on this
without outcome :-(. Given the nested feature timeframe, now with this
/dev/iommu redesign, can't you consider that using a guest kernel >=
5.7, featuring RIL now is acceptable.

Thanks

Eric
>
> Thanks,
> Kunkun Jiang
>> the support of RIL on guest side and QEMU device.
>>
>> What do I miss?
>>
>> Thanks
>>
>> Eric
>>> Thanks,
>>> Kunkun Jiang
>>>> +    }
>>>> +
>>>> +    ret = ioctl(container->fd, VFIO_IOMMU_CACHE_INVALIDATE,
>>>> &ustruct);
>>>> +    if (ret) {
>>>> +        error_report("%p: failed to invalidate CACHE (%d)",
>>>> container, ret);
>>>> +    }
>>>> +}
>>>> +
>>>>    static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry
>>>> *iotlb)
>>>>    {
>>>>        VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
>>>> @@ -776,6 +843,35 @@ static void
>>>> vfio_dma_unmap_ram_section(VFIOContainer *container,
>>>>        }
>>>>    }
>>>>    +static void vfio_prereg_listener_region_add(MemoryListener
>>>> *listener,
>>>> +                                            MemoryRegionSection
>>>> *section)
>>>> +{
>>>> +    VFIOContainer *container =
>>>> +        container_of(listener, VFIOContainer, prereg_listener);
>>>> +    Error *err = NULL;
>>>> +
>>>> +    if (!memory_region_is_ram(section->mr)) {
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    vfio_dma_map_ram_section(container, section, &err);
>>>> +    if (err) {
>>>> +        error_report_err(err);
>>>> +    }
>>>> +}
>>>> +static void vfio_prereg_listener_region_del(MemoryListener *listener,
>>>> +                                     MemoryRegionSection *section)
>>>> +{
>>>> +    VFIOContainer *container =
>>>> +        container_of(listener, VFIOContainer, prereg_listener);
>>>> +
>>>> +    if (!memory_region_is_ram(section->mr)) {
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    vfio_dma_unmap_ram_section(container, section);
>>>> +}
>>>> +
>>>>    static void vfio_listener_region_add(MemoryListener *listener,
>>>>                                         MemoryRegionSection *section)
>>>>    {
>>>> @@ -879,9 +975,10 @@ static void
>>>> vfio_listener_region_add(MemoryListener *listener,
>>>>        memory_region_ref(section->mr);
>>>>          if (memory_region_is_iommu(section->mr)) {
>>>> +        IOMMUNotify notify;
>>>>            VFIOGuestIOMMU *giommu;
>>>>            IOMMUMemoryRegion *iommu_mr =
>>>> IOMMU_MEMORY_REGION(section->mr);
>>>> -        int iommu_idx;
>>>> +        int iommu_idx, flags;
>>>>              trace_vfio_listener_region_add_iommu(iova, end);
>>>>            /*
>>>> @@ -900,8 +997,18 @@ static void
>>>> vfio_listener_region_add(MemoryListener *listener,
>>>>            llend = int128_sub(llend, int128_one());
>>>>            iommu_idx = memory_region_iommu_attrs_to_index(iommu_mr,
>>>>                                                         
>>>> MEMTXATTRS_UNSPECIFIED);
>>>> -        iommu_notifier_init(&giommu->n, vfio_iommu_map_notify,
>>>> -                            IOMMU_NOTIFIER_IOTLB_EVENTS,
>>>> +
>>>> +        if (container->iommu_type == VFIO_TYPE1_NESTING_IOMMU) {
>>>> +            /* IOTLB unmap notifier to propagate guest IOTLB
>>>> invalidations */
>>>> +            flags = IOMMU_NOTIFIER_UNMAP;
>>>> +            notify = vfio_iommu_unmap_notify;
>>>> +        } else {
>>>> +            /* MAP/UNMAP IOTLB notifier */
>>>> +            flags = IOMMU_NOTIFIER_IOTLB_EVENTS;
>>>> +            notify = vfio_iommu_map_notify;
>>>> +        }
>>>> +
>>>> +        iommu_notifier_init(&giommu->n, notify, flags,
>>>>                                section->offset_within_region,
>>>>                                int128_get64(llend),
>>>>                                iommu_idx);
>>>> @@ -921,7 +1028,9 @@ static void
>>>> vfio_listener_region_add(MemoryListener *listener,
>>>>                goto fail;
>>>>            }
>>>>            QLIST_INSERT_HEAD(&container->giommu_list, giommu,
>>>> giommu_next);
>>>> -        memory_region_iommu_replay(giommu->iommu, &giommu->n);
>>>> +        if (flags & IOMMU_NOTIFIER_MAP) {
>>>> +            memory_region_iommu_replay(giommu->iommu, &giommu->n);
>>>> +        }
>>>>              return;
>>>>        }
>>>> @@ -1205,10 +1314,16 @@ static const MemoryListener
>>>> vfio_memory_listener = {
>>>>        .log_sync = vfio_listener_log_sync,
>>>>    };
>>>>    +static MemoryListener vfio_memory_prereg_listener = {
>>>> +    .region_add = vfio_prereg_listener_region_add,
>>>> +    .region_del = vfio_prereg_listener_region_del,
>>>> +};
>>>> +
>>>>    static void vfio_listener_release(VFIOContainer *container)
>>>>    {
>>>>        memory_listener_unregister(&container->listener);
>>>> -    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
>>>> +    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU ||
>>>> +        container->iommu_type == VFIO_TYPE1_NESTING_IOMMU) {
>>>>            memory_listener_unregister(&container->prereg_listener);
>>>>        }
>>>>    }
>>>> @@ -1858,6 +1973,20 @@ static int vfio_connect_container(VFIOGroup
>>>> *group, AddressSpace *as,
>>>>                vfio_get_iommu_info_migration(container, info);
>>>>            }
>>>>            g_free(info);
>>>> +
>>>> +        if (container->iommu_type == VFIO_TYPE1_NESTING_IOMMU) {
>>>> +            container->prereg_listener = vfio_memory_prereg_listener;
>>>> +            memory_listener_register(&container->prereg_listener,
>>>> +                                     &address_space_memory);
>>>> +            if (container->error) {
>>>> +               
>>>> memory_listener_unregister(&container->prereg_listener);
>>>> +                ret = -1;
>>>> +                error_propagate_prepend(errp, container->error,
>>>> +                                    "RAM memory listener
>>>> initialization failed "
>>>> +                                    "for container");
>>>> +                goto free_container_exit;
>>>> +            }
>>>> +        }
>>>>            break;
>>>>        }
>>>>        case VFIO_SPAPR_TCE_v2_IOMMU:
>>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>>>> index 5c65aa0a98..cad7deec71 100644
>>>> --- a/hw/vfio/pci.c
>>>> +++ b/hw/vfio/pci.c
>>>> @@ -2773,6 +2773,25 @@ static void
>>>> vfio_unregister_req_notifier(VFIOPCIDevice *vdev)
>>>>        vdev->req_enabled = false;
>>>>    }
>>>>    +static int vfio_iommu_set_pasid_table(PCIBus *bus, int32_t devfn,
>>>> +                                      IOMMUConfig *config)
>>>> +{
>>>> +    PCIDevice *pdev = bus->devices[devfn];
>>>> +    VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev);
>>>> +    VFIOContainer *container = vdev->vbasedev.group->container;
>>>> +    struct vfio_iommu_type1_set_pasid_table info;
>>>> +
>>>> +    info.argsz = sizeof(info);
>>>> +    info.flags = VFIO_PASID_TABLE_FLAG_SET;
>>>> +    memcpy(&info.config, &config->pasid_cfg,
>>>> sizeof(config->pasid_cfg));
>>>> +
>>>> +    return ioctl(container->fd, VFIO_IOMMU_SET_PASID_TABLE, &info);
>>>> +}
>>>> +
>>>> +static PCIPASIDOps vfio_pci_pasid_ops = {
>>>> +    .set_pasid_table = vfio_iommu_set_pasid_table,
>>>> +};
>>>> +
>>>>    static void vfio_realize(PCIDevice *pdev, Error **errp)
>>>>    {
>>>>        VFIOPCIDevice *vdev = VFIO_PCI(pdev);
>>>> @@ -3084,6 +3103,8 @@ static void vfio_realize(PCIDevice *pdev, Error
>>>> **errp)
>>>>        vfio_register_req_notifier(vdev);
>>>>        vfio_setup_resetfn_quirk(vdev);
>>>>    +    pci_setup_pasid_ops(pdev, &vfio_pci_pasid_ops);
>>>> +
>>>>        return;
>>>>      out_deregister:
>>>> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
>>>> index 936d29d150..43696afc15 100644
>>>> --- a/hw/vfio/trace-events
>>>> +++ b/hw/vfio/trace-events
>>>> @@ -120,6 +120,8 @@ vfio_region_sparse_mmap_header(const char *name,
>>>> int index, int nr_areas) "Devic
>>>>    vfio_region_sparse_mmap_entry(int i, unsigned long start, unsigned
>>>> long end) "sparse entry %d [0x%lx - 0x%lx]"
>>>>    vfio_get_dev_region(const char *name, int index, uint32_t type,
>>>> uint32_t subtype) "%s index %d, %08x/%0x8"
>>>>    vfio_dma_unmap_overflow_workaround(void) ""
>>>> +vfio_iommu_addr_inv_iotlb(int asid, uint64_t addr, uint64_t size,
>>>> uint64_t nb_granules, bool leaf) "nested IOTLB invalidate asid=%d,
>>>> addr=0x%"PRIx64" granule_size=0x%"PRIx64" nb_granules=0x%"PRIx64"
>>>> leaf=%d"
>>>> +vfio_iommu_asid_inv_iotlb(int asid) "nested IOTLB invalidate asid=%d"
>>>>      # platform.c
>>>>    vfio_platform_base_device_init(char *name, int groupid) "%s belongs
>>>> to group #%d"
>>>
>>>
>> .
>
>



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC v9 15/29] vfio: Set up nested stage mappings
  2021-10-07 16:58         ` Eric Auger
@ 2021-10-08  2:13           ` Kunkun Jiang
  0 siblings, 0 replies; 46+ messages in thread
From: Kunkun Jiang @ 2021-10-08  2:13 UTC (permalink / raw)
  To: eric.auger, eric.auger.pro, qemu-devel, qemu-arm, alex.williamson
  Cc: peter.maydell, jacob.jun.pan, chenxiang66, tn,
	shameerali.kolothum.thodi, nicoleotsuka, vivek.gautam, vdumpa,
	yi.l.liu, peterx, zhangfei.gao, wanghaibin.wang, yuzenghui,
	jean-philippe, zhukeqian1

Hi Eric,

On 2021/10/8 0:58, Eric Auger wrote:
> Hi Kunkun,
>
> On 4/14/21 3:45 AM, Kunkun Jiang wrote:
>> On 2021/4/13 20:57, Auger Eric wrote:
>>> Hi Kunkun,
>>>
>>> On 4/13/21 2:10 PM, Kunkun Jiang wrote:
>>>> Hi Eric,
>>>>
>>>> On 2021/4/11 20:08, Eric Auger wrote:
>>>>> In nested mode, legacy vfio_iommu_map_notify cannot be used as
>>>>> there is no "caching" mode and we do not trap on map.
>>>>>
>>>>> On Intel, vfio_iommu_map_notify was used to DMA map the RAM
>>>>> through the host single stage.
>>>>>
>>>>> With nested mode, we need to setup the stage 2 and the stage 1
>>>>> separately. This patch introduces a prereg_listener to setup
>>>>> the stage 2 mapping.
>>>>>
>>>>> The stage 1 mapping, owned by the guest, is passed to the host
>>>>> when the guest invalidates the stage 1 configuration, through
>>>>> a dedicated PCIPASIDOps callback. Guest IOTLB invalidations
>>>>> are cascaded downto the host through another IOMMU MR UNMAP
>>>>> notifier.
>>>>>
>>>>> Signed-off-by: Eric Auger <eric.auger@redhat.com>
>>>>>
>>>>> ---
>>>>>
>>>>> v7 -> v8:
>>>>> - properly handle new IOMMUTLBEntry fields and especially
>>>>>      propagate DOMAIN and PASID based invalidations
>>>>>
>>>>> v6 -> v7:
>>>>> - remove PASID based invalidation
>>>>>
>>>>> v5 -> v6:
>>>>> - add error_report_err()
>>>>> - remove the abort in case of nested stage case
>>>>>
>>>>> v4 -> v5:
>>>>> - use VFIO_IOMMU_SET_PASID_TABLE
>>>>> - use PCIPASIDOps for config notification
>>>>>
>>>>> v3 -> v4:
>>>>> - use iommu_inv_pasid_info for ASID invalidation
>>>>>
>>>>> v2 -> v3:
>>>>> - use VFIO_IOMMU_ATTACH_PASID_TABLE
>>>>> - new user API
>>>>> - handle leaf
>>>>>
>>>>> v1 -> v2:
>>>>> - adapt to uapi changes
>>>>> - pass the asid
>>>>> - pass IOMMU_NOTIFIER_S1_CFG when initializing the config notifier
>>>>> ---
>>>>>     hw/vfio/common.c     | 139
>>>>> +++++++++++++++++++++++++++++++++++++++++--
>>>>>     hw/vfio/pci.c        |  21 +++++++
>>>>>     hw/vfio/trace-events |   2 +
>>>>>     3 files changed, 157 insertions(+), 5 deletions(-)
>>>>>
>>>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>>>>> index 0cd7ef2139..e369d451e7 100644
>>>>> --- a/hw/vfio/common.c
>>>>> +++ b/hw/vfio/common.c
>>>>> @@ -595,6 +595,73 @@ static bool vfio_get_xlat_addr(IOMMUTLBEntry
>>>>> *iotlb, void **vaddr,
>>>>>         return true;
>>>>>     }
>>>>>     +/* Propagate a guest IOTLB invalidation to the host (nested
>>>>> mode) */
>>>>> +static void vfio_iommu_unmap_notify(IOMMUNotifier *n, IOMMUTLBEntry
>>>>> *iotlb)
>>>>> +{
>>>>> +    VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
>>>>> +    struct vfio_iommu_type1_cache_invalidate ustruct = {};
>>>>> +    VFIOContainer *container = giommu->container;
>>>>> +    int ret;
>>>>> +
>>>>> +    assert(iotlb->perm == IOMMU_NONE);
>>>>> +
>>>>> +    ustruct.argsz = sizeof(ustruct);
>>>>> +    ustruct.flags = 0;
>>>>> +    ustruct.info.argsz = sizeof(struct iommu_cache_invalidate_info);
>>>>> +    ustruct.info.version = IOMMU_CACHE_INVALIDATE_INFO_VERSION_1;
>>>>> +    ustruct.info.cache = IOMMU_CACHE_INV_TYPE_IOTLB;
>>>>> +
>>>>> +    switch (iotlb->granularity) {
>>>>> +    case IOMMU_INV_GRAN_DOMAIN:
>>>>> +        ustruct.info.granularity = IOMMU_INV_GRANU_DOMAIN;
>>>>> +        break;
>>>>> +    case IOMMU_INV_GRAN_PASID:
>>>>> +    {
>>>>> +        struct iommu_inv_pasid_info *pasid_info;
>>>>> +        int archid = -1;
>>>>> +
>>>>> +        pasid_info = &ustruct.info.granu.pasid_info;
>>>>> +        ustruct.info.granularity = IOMMU_INV_GRANU_PASID;
>>>>> +        if (iotlb->flags & IOMMU_INV_FLAGS_ARCHID) {
>>>>> +            pasid_info->flags |= IOMMU_INV_ADDR_FLAGS_ARCHID;
>>>>> +            archid = iotlb->arch_id;
>>>>> +        }
>>>>> +        pasid_info->archid = archid;
>>>>> +        trace_vfio_iommu_asid_inv_iotlb(archid);
>>>>> +        break;
>>>>> +    }
>>>>> +    case IOMMU_INV_GRAN_ADDR:
>>>>> +    {
>>>>> +        hwaddr start = iotlb->iova + giommu->iommu_offset;
>>>>> +        struct iommu_inv_addr_info *addr_info;
>>>>> +        size_t size = iotlb->addr_mask + 1;
>>>>> +        int archid = -1;
>>>>> +
>>>>> +        addr_info = &ustruct.info.granu.addr_info;
>>>>> +        ustruct.info.granularity = IOMMU_INV_GRANU_ADDR;
>>>>> +        if (iotlb->leaf) {
>>>>> +            addr_info->flags |= IOMMU_INV_ADDR_FLAGS_LEAF;
>>>>> +        }
>>>>> +        if (iotlb->flags & IOMMU_INV_FLAGS_ARCHID) {
>>>>> +            addr_info->flags |= IOMMU_INV_ADDR_FLAGS_ARCHID;
>>>>> +            archid = iotlb->arch_id;
>>>>> +        }
>>>>> +        addr_info->archid = archid;
>>>>> +        addr_info->addr = start;
>>>>> +        addr_info->granule_size = size;
>>>>> +        addr_info->nb_granules = 1;
>>>>> +        trace_vfio_iommu_addr_inv_iotlb(archid, start, size,
>>>>> +                                        1, iotlb->leaf);
>>>>> +        break;
>>>>> +    }
>>>> Should we pass a size to  host kernel here, even if vSMMU doesn't
>>>> support
>>>> RIL or guest kernel doesn't use RIL?
>>>>
>>>> It will cause TLBI issue in  this scenario: Guest kernel issues a
>>>> TLBI cmd
>>>> without "range" (tg = 0) to invalidate a 2M huge page. Then qemu passed
>>>> the iova and size (4K) to host kernel. Finally, host kernel issues a
>>>> TLBI cmd
>>>> with "range" (4K) which can not invalidate the TLB entry of 2M huge
>>>> page.
>>>> (pSMMU supports RIL)
>>> In that case the guest will loop over all 4K images belonging to the 2M
>>> huge page and invalidate each of them. This should turn into qemu
>>> notifications for each 4kB page, no? This is totally inefficient, hence
>> The guest will not loop over all 4K images belonging to the 2M huge page.
>> The iommu_iotlb_gather->pgsize will be 2M, if a page is 2M huge page. The
>> gather->pgsize will be passed to __arm_smmu_tlb_inv_range as "granule":
>>
>> iommu_iotlb_gather_add_page
>>      iommu_iotlb_sync
>>          domain->ops->iotlb_sync
>>              arm_smmu_iotlb_sync
>>                  arm_smmu_tlb_inv_range_domain
>>                      __arm_smmu_tlb_inv_range
>>
>> In the above mentioned scenario, guest kernel will issue a TLBI cmd
>> only with
>> "iova" (tg = 0).
> I am currently respinning the SMMU part (the rest depends on
> /dev/iommu). While thinking more about this issue you reported a long
> time ago (sorry), I think a guest not using RIL is problematic wrt vSMMU
> integration. Your case would not work with vhost either because when you
> notify vhost you need to pass an invalidation range. In this specific
> case there is none passed by the guest as the INVAL CMD just contains an
> iova and that's it. Hope I do not miss anything. And I cannot guess the
> range size. I only have the granule size from the CD and the IOVA. So
> how can I properly invalidate in the vhost case beside upgrading the API
> and pass the info telling the addr range is not set and invalidate the
> whole vhost cache in that case.
>
> So I would be inclined to support a RIL only guest for nested and
> globally with vhost. I can't ignore your argument that the detection of
> non interoperability comes late, with the first inval with TG=0. But I
> think the SMMU spec without RIL just is *not* virtualization compatible
> and this was first reported in Aug 2017
> (https://lkml.org/lkml/2017/8/11/428), yeah so much time spent on this
> without outcome :-(. Given the nested feature timeframe, now with this
> /dev/iommu redesign, can't you consider that using a guest kernel >=
> 5.7, featuring RIL now is acceptable.
Ok, I see. Thanks for your reply.
Looking forward to your new patch set.

Thanks,
Kunkun Jiang
> Thanks
>
> Eric
>> Thanks,
>> Kunkun Jiang
>>> the support of RIL on guest side and QEMU device.
>>>
>>> What do I miss?
>>>
>>> Thanks
>>>
>>> Eric
>>>> Thanks,
>>>> Kunkun Jiang
>>>>> +    }
>>>>> +
>>>>> +    ret = ioctl(container->fd, VFIO_IOMMU_CACHE_INVALIDATE,
>>>>> &ustruct);
>>>>> +    if (ret) {
>>>>> +        error_report("%p: failed to invalidate CACHE (%d)",
>>>>> container, ret);
>>>>> +    }
>>>>> +}
>>>>> +
>>>>>     static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry
>>>>> *iotlb)
>>>>>     {
>>>>>         VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
>>>>> @@ -776,6 +843,35 @@ static void
>>>>> vfio_dma_unmap_ram_section(VFIOContainer *container,
>>>>>         }
>>>>>     }
>>>>>     +static void vfio_prereg_listener_region_add(MemoryListener
>>>>> *listener,
>>>>> +                                            MemoryRegionSection
>>>>> *section)
>>>>> +{
>>>>> +    VFIOContainer *container =
>>>>> +        container_of(listener, VFIOContainer, prereg_listener);
>>>>> +    Error *err = NULL;
>>>>> +
>>>>> +    if (!memory_region_is_ram(section->mr)) {
>>>>> +        return;
>>>>> +    }
>>>>> +
>>>>> +    vfio_dma_map_ram_section(container, section, &err);
>>>>> +    if (err) {
>>>>> +        error_report_err(err);
>>>>> +    }
>>>>> +}
>>>>> +static void vfio_prereg_listener_region_del(MemoryListener *listener,
>>>>> +                                     MemoryRegionSection *section)
>>>>> +{
>>>>> +    VFIOContainer *container =
>>>>> +        container_of(listener, VFIOContainer, prereg_listener);
>>>>> +
>>>>> +    if (!memory_region_is_ram(section->mr)) {
>>>>> +        return;
>>>>> +    }
>>>>> +
>>>>> +    vfio_dma_unmap_ram_section(container, section);
>>>>> +}
>>>>> +
>>>>>     static void vfio_listener_region_add(MemoryListener *listener,
>>>>>                                          MemoryRegionSection *section)
>>>>>     {
>>>>> @@ -879,9 +975,10 @@ static void
>>>>> vfio_listener_region_add(MemoryListener *listener,
>>>>>         memory_region_ref(section->mr);
>>>>>           if (memory_region_is_iommu(section->mr)) {
>>>>> +        IOMMUNotify notify;
>>>>>             VFIOGuestIOMMU *giommu;
>>>>>             IOMMUMemoryRegion *iommu_mr =
>>>>> IOMMU_MEMORY_REGION(section->mr);
>>>>> -        int iommu_idx;
>>>>> +        int iommu_idx, flags;
>>>>>               trace_vfio_listener_region_add_iommu(iova, end);
>>>>>             /*
>>>>> @@ -900,8 +997,18 @@ static void
>>>>> vfio_listener_region_add(MemoryListener *listener,
>>>>>             llend = int128_sub(llend, int128_one());
>>>>>             iommu_idx = memory_region_iommu_attrs_to_index(iommu_mr,
>>>>>                                                          
>>>>> MEMTXATTRS_UNSPECIFIED);
>>>>> -        iommu_notifier_init(&giommu->n, vfio_iommu_map_notify,
>>>>> -                            IOMMU_NOTIFIER_IOTLB_EVENTS,
>>>>> +
>>>>> +        if (container->iommu_type == VFIO_TYPE1_NESTING_IOMMU) {
>>>>> +            /* IOTLB unmap notifier to propagate guest IOTLB
>>>>> invalidations */
>>>>> +            flags = IOMMU_NOTIFIER_UNMAP;
>>>>> +            notify = vfio_iommu_unmap_notify;
>>>>> +        } else {
>>>>> +            /* MAP/UNMAP IOTLB notifier */
>>>>> +            flags = IOMMU_NOTIFIER_IOTLB_EVENTS;
>>>>> +            notify = vfio_iommu_map_notify;
>>>>> +        }
>>>>> +
>>>>> +        iommu_notifier_init(&giommu->n, notify, flags,
>>>>>                                 section->offset_within_region,
>>>>>                                 int128_get64(llend),
>>>>>                                 iommu_idx);
>>>>> @@ -921,7 +1028,9 @@ static void
>>>>> vfio_listener_region_add(MemoryListener *listener,
>>>>>                 goto fail;
>>>>>             }
>>>>>             QLIST_INSERT_HEAD(&container->giommu_list, giommu,
>>>>> giommu_next);
>>>>> -        memory_region_iommu_replay(giommu->iommu, &giommu->n);
>>>>> +        if (flags & IOMMU_NOTIFIER_MAP) {
>>>>> +            memory_region_iommu_replay(giommu->iommu, &giommu->n);
>>>>> +        }
>>>>>               return;
>>>>>         }
>>>>> @@ -1205,10 +1314,16 @@ static const MemoryListener
>>>>> vfio_memory_listener = {
>>>>>         .log_sync = vfio_listener_log_sync,
>>>>>     };
>>>>>     +static MemoryListener vfio_memory_prereg_listener = {
>>>>> +    .region_add = vfio_prereg_listener_region_add,
>>>>> +    .region_del = vfio_prereg_listener_region_del,
>>>>> +};
>>>>> +
>>>>>     static void vfio_listener_release(VFIOContainer *container)
>>>>>     {
>>>>>         memory_listener_unregister(&container->listener);
>>>>> -    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
>>>>> +    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU ||
>>>>> +        container->iommu_type == VFIO_TYPE1_NESTING_IOMMU) {
>>>>>             memory_listener_unregister(&container->prereg_listener);
>>>>>         }
>>>>>     }
>>>>> @@ -1858,6 +1973,20 @@ static int vfio_connect_container(VFIOGroup
>>>>> *group, AddressSpace *as,
>>>>>                 vfio_get_iommu_info_migration(container, info);
>>>>>             }
>>>>>             g_free(info);
>>>>> +
>>>>> +        if (container->iommu_type == VFIO_TYPE1_NESTING_IOMMU) {
>>>>> +            container->prereg_listener = vfio_memory_prereg_listener;
>>>>> +            memory_listener_register(&container->prereg_listener,
>>>>> +                                     &address_space_memory);
>>>>> +            if (container->error) {
>>>>> +
>>>>> memory_listener_unregister(&container->prereg_listener);
>>>>> +                ret = -1;
>>>>> +                error_propagate_prepend(errp, container->error,
>>>>> +                                    "RAM memory listener
>>>>> initialization failed "
>>>>> +                                    "for container");
>>>>> +                goto free_container_exit;
>>>>> +            }
>>>>> +        }
>>>>>             break;
>>>>>         }
>>>>>         case VFIO_SPAPR_TCE_v2_IOMMU:
>>>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>>>>> index 5c65aa0a98..cad7deec71 100644
>>>>> --- a/hw/vfio/pci.c
>>>>> +++ b/hw/vfio/pci.c
>>>>> @@ -2773,6 +2773,25 @@ static void
>>>>> vfio_unregister_req_notifier(VFIOPCIDevice *vdev)
>>>>>         vdev->req_enabled = false;
>>>>>     }
>>>>>     +static int vfio_iommu_set_pasid_table(PCIBus *bus, int32_t devfn,
>>>>> +                                      IOMMUConfig *config)
>>>>> +{
>>>>> +    PCIDevice *pdev = bus->devices[devfn];
>>>>> +    VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev);
>>>>> +    VFIOContainer *container = vdev->vbasedev.group->container;
>>>>> +    struct vfio_iommu_type1_set_pasid_table info;
>>>>> +
>>>>> +    info.argsz = sizeof(info);
>>>>> +    info.flags = VFIO_PASID_TABLE_FLAG_SET;
>>>>> +    memcpy(&info.config, &config->pasid_cfg,
>>>>> sizeof(config->pasid_cfg));
>>>>> +
>>>>> +    return ioctl(container->fd, VFIO_IOMMU_SET_PASID_TABLE, &info);
>>>>> +}
>>>>> +
>>>>> +static PCIPASIDOps vfio_pci_pasid_ops = {
>>>>> +    .set_pasid_table = vfio_iommu_set_pasid_table,
>>>>> +};
>>>>> +
>>>>>     static void vfio_realize(PCIDevice *pdev, Error **errp)
>>>>>     {
>>>>>         VFIOPCIDevice *vdev = VFIO_PCI(pdev);
>>>>> @@ -3084,6 +3103,8 @@ static void vfio_realize(PCIDevice *pdev, Error
>>>>> **errp)
>>>>>         vfio_register_req_notifier(vdev);
>>>>>         vfio_setup_resetfn_quirk(vdev);
>>>>>     +    pci_setup_pasid_ops(pdev, &vfio_pci_pasid_ops);
>>>>> +
>>>>>         return;
>>>>>       out_deregister:
>>>>> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
>>>>> index 936d29d150..43696afc15 100644
>>>>> --- a/hw/vfio/trace-events
>>>>> +++ b/hw/vfio/trace-events
>>>>> @@ -120,6 +120,8 @@ vfio_region_sparse_mmap_header(const char *name,
>>>>> int index, int nr_areas) "Devic
>>>>>     vfio_region_sparse_mmap_entry(int i, unsigned long start, unsigned
>>>>> long end) "sparse entry %d [0x%lx - 0x%lx]"
>>>>>     vfio_get_dev_region(const char *name, int index, uint32_t type,
>>>>> uint32_t subtype) "%s index %d, %08x/%0x8"
>>>>>     vfio_dma_unmap_overflow_workaround(void) ""
>>>>> +vfio_iommu_addr_inv_iotlb(int asid, uint64_t addr, uint64_t size,
>>>>> uint64_t nb_granules, bool leaf) "nested IOTLB invalidate asid=%d,
>>>>> addr=0x%"PRIx64" granule_size=0x%"PRIx64" nb_granules=0x%"PRIx64"
>>>>> leaf=%d"
>>>>> +vfio_iommu_asid_inv_iotlb(int asid) "nested IOTLB invalidate asid=%d"
>>>>>       # platform.c
>>>>>     vfio_platform_base_device_init(char *name, int groupid) "%s belongs
>>>>> to group #%d"
>>>>
>>> .
>>
> .




^ permalink raw reply	[flat|nested] 46+ messages in thread

* RE: [RFC v9 16/29] vfio: Pass stage 1 MSI bindings to the host
  2021-04-11 12:08 ` [RFC v9 16/29] vfio: Pass stage 1 MSI bindings to the host Eric Auger
@ 2021-10-15 10:54   ` Shameerali Kolothum Thodi
  0 siblings, 0 replies; 46+ messages in thread
From: Shameerali Kolothum Thodi @ 2021-10-15 10:54 UTC (permalink / raw)
  To: Eric Auger, eric.auger.pro, qemu-devel, qemu-arm, alex.williamson
  Cc: peter.maydell, jacob.jun.pan, jean-philippe, tn, chenxiang (M),
	jiangkunkun, peterx, nicoleotsuka, vivek.gautam, vdumpa,
	yi.l.liu, zhangfei.gao, yuzenghui, qubingbing, zhukeqian

Hi Eric,

> -----Original Message-----
> From: Eric Auger [mailto:eric.auger@redhat.com]
> Sent: 11 April 2021 13:09
> To: eric.auger.pro@gmail.com; eric.auger@redhat.com;
> qemu-devel@nongnu.org; qemu-arm@nongnu.org;
> alex.williamson@redhat.com
> Cc: peter.maydell@linaro.org; jean-philippe@linaro.org; peterx@redhat.com;
> jacob.jun.pan@linux.intel.com; yi.l.liu@intel.com; Shameerali Kolothum Thodi
> <shameerali.kolothum.thodi@huawei.com>; tn@semihalf.com;
> nicoleotsuka@gmail.com; yuzenghui <yuzenghui@huawei.com>;
> zhangfei.gao@gmail.com; vivek.gautam@arm.com; jiangkunkun
> <jiangkunkun@huawei.com>; vdumpa@nvidia.com; chenxiang (M)
> <chenxiang66@hisilicon.com>; zhukeqian <zhukeqian1@huawei.com>
> Subject: [RFC v9 16/29] vfio: Pass stage 1 MSI bindings to the host
> 
> We register the stage1 MSI bindings when enabling the vectors
> and we unregister them on msi disable.
> 
> Signed-off-by: Eric Auger <eric.auger@redhat.com>
> 
> ---
> 
> v7 -> v8:
> - add unregistration on msix_diable
> - remove vfio_container_unbind_msis()
> 
> v4 -> v5:
> - use VFIO_IOMMU_SET_MSI_BINDING
> 
> v2 -> v3:
> - only register the notifier if the IOMMU translates MSIs
> - record the msi bindings in a container list and unregister on
>   container release
> ---
>  include/hw/vfio/vfio-common.h | 12 ++++++
>  hw/vfio/common.c              | 59 +++++++++++++++++++++++++++
>  hw/vfio/pci.c                 | 76
> ++++++++++++++++++++++++++++++++++-
>  hw/vfio/trace-events          |  2 +
>  4 files changed, 147 insertions(+), 2 deletions(-)
> 
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 6141162d7a..f30133b2a3 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -74,6 +74,14 @@ typedef struct VFIOAddressSpace {
>      QLIST_ENTRY(VFIOAddressSpace) list;
>  } VFIOAddressSpace;
> 
> +typedef struct VFIOMSIBinding {
> +    int index;
> +    hwaddr iova;
> +    hwaddr gpa;
> +    hwaddr size;
> +    QLIST_ENTRY(VFIOMSIBinding) next;
> +} VFIOMSIBinding;
> +
>  struct VFIOGroup;
> 
>  typedef struct VFIOContainer {
> @@ -91,6 +99,7 @@ typedef struct VFIOContainer {
>      QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
>      QLIST_HEAD(, VFIOHostDMAWindow) hostwin_list;
>      QLIST_HEAD(, VFIOGroup) group_list;
> +    QLIST_HEAD(, VFIOMSIBinding) msibinding_list;
>      QLIST_ENTRY(VFIOContainer) next;
>  } VFIOContainer;
> 
> @@ -200,6 +209,9 @@ VFIOGroup *vfio_get_group(int groupid,
> AddressSpace *as, Error **errp);
>  void vfio_put_group(VFIOGroup *group);
>  int vfio_get_device(VFIOGroup *group, const char *name,
>                      VFIODevice *vbasedev, Error **errp);
> +int vfio_iommu_set_msi_binding(VFIOContainer *container, int n,
> +                               IOMMUTLBEntry *entry);
> +int vfio_iommu_unset_msi_binding(VFIOContainer *container, int n);
> 
>  extern const MemoryRegionOps vfio_region_ops;
>  typedef QLIST_HEAD(VFIOGroupList, VFIOGroup) VFIOGroupList;
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index e369d451e7..970a5a7be7 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -662,6 +662,65 @@ static void
> vfio_iommu_unmap_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
>      }
>  }
> 
> +int vfio_iommu_set_msi_binding(VFIOContainer *container, int n,
> +                               IOMMUTLBEntry *iotlb)
> +{
> +    struct vfio_iommu_type1_set_msi_binding ustruct;
> +    VFIOMSIBinding *binding;
> +    int ret;
> +
> +    QLIST_FOREACH(binding, &container->msibinding_list, next) {
> +        if (binding->index == n) {
> +            return 0;
> +        }
> +    }
> +
> +    ustruct.argsz = sizeof(struct vfio_iommu_type1_set_msi_binding);
> +    ustruct.iova = iotlb->iova;
> +    ustruct.flags = VFIO_IOMMU_BIND_MSI;
> +    ustruct.gpa = iotlb->translated_addr;
> +    ustruct.size = iotlb->addr_mask + 1;
> +    ret = ioctl(container->fd, VFIO_IOMMU_SET_MSI_BINDING , &ustruct);
> +    if (ret) {
> +        error_report("%s: failed to register the stage1 MSI binding (%m)",
> +                     __func__);
> +        return ret;
> +    }
> +    binding =  g_new0(VFIOMSIBinding, 1);
> +    binding->iova = ustruct.iova;
> +    binding->gpa = ustruct.gpa;
> +    binding->size = ustruct.size;
> +    binding->index = n;
> +
> +    QLIST_INSERT_HEAD(&container->msibinding_list, binding, next);
> +    return 0;
> +}
> +
> +int vfio_iommu_unset_msi_binding(VFIOContainer *container, int n)
> +{
> +    struct vfio_iommu_type1_set_msi_binding ustruct;
> +    VFIOMSIBinding *binding, *tmp;
> +    int ret;
> +
> +    ustruct.argsz = sizeof(struct vfio_iommu_type1_set_msi_binding);
> +    QLIST_FOREACH_SAFE(binding, &container->msibinding_list, next, tmp) {
> +        if (binding->index != n) {
> +            continue;
> +        }
> +        ustruct.flags = VFIO_IOMMU_UNBIND_MSI;
> +        ustruct.iova = binding->iova;
> +        ret = ioctl(container->fd, VFIO_IOMMU_SET_MSI_BINDING ,
> &ustruct);
> +        if (ret) {
> +            error_report("Failed to unregister the stage1 MSI binding "
> +                         "for iova=0x%"PRIx64" (%m)", binding->iova);
> +        }
> +        QLIST_REMOVE(binding, next);
> +        g_free(binding);
> +        return ret;
> +    }
> +    return 0;
> +}
> +
>  static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry
> *iotlb)
>  {
>      VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index cad7deec71..a49029dfa4 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -366,6 +366,65 @@ static void vfio_msi_interrupt(void *opaque)
>      notify(&vdev->pdev, nr);
>  }
> 
> +static bool vfio_iommu_require_msi_binding(IOMMUMemoryRegion
> *iommu_mr)
> +{
> +    bool msi_translate = false, nested = false;
> +
> +    memory_region_iommu_get_attr(iommu_mr,
> IOMMU_ATTR_MSI_TRANSLATE,
> +                                 (void *)&msi_translate);
> +    memory_region_iommu_get_attr(iommu_mr,
> IOMMU_ATTR_VFIO_NESTED,
> +                                 (void *)&nested);
> +    if (!nested || !msi_translate) {
> +        return false;
> +    }
> +   return true;
> +}
> +
> +static int vfio_register_msi_binding(VFIOPCIDevice *vdev,
> +                                     int vector_n, bool set)
> +{
> +    VFIOContainer *container = vdev->vbasedev.group->container;
> +    PCIDevice *dev = &vdev->pdev;
> +    AddressSpace *as = pci_device_iommu_address_space(dev);
> +    IOMMUMemoryRegionClass *imrc;
> +    IOMMUMemoryRegion *iommu_mr;
> +    IOMMUTLBEntry entry;
> +    MSIMessage msg;
> +
> +    if (as == &address_space_memory) {
> +        return 0;
> +    }
> +
> +    iommu_mr = IOMMU_MEMORY_REGION(as->root);
> +    if (!vfio_iommu_require_msi_binding(iommu_mr)) {
> +        return 0;
> +    }
> +
> +    /* MSI doorbell address is translated by an IOMMU */
> +
> +    if (!set) { /* unregister */
> +        trace_vfio_unregister_msi_binding(vdev->vbasedev.name,
> vector_n);
> +
> +        return vfio_iommu_unset_msi_binding(container, vector_n);
> +    }
> +
> +    msg = pci_get_msi_message(dev, vector_n);
> +    imrc = memory_region_get_iommu_class_nocheck(iommu_mr);
> +
> +    rcu_read_lock();
> +    entry = imrc->translate(iommu_mr, msg.address, IOMMU_WO, 0);
> +    rcu_read_unlock();
> +
> +    if (entry.perm == IOMMU_NONE) {
> +        return -ENOENT;
> +    }
> +
> +    trace_vfio_register_msi_binding(vdev->vbasedev.name, vector_n,
> +                                    msg.address,
> entry.translated_addr);
> +
> +    return vfio_iommu_set_msi_binding(container, vector_n, &entry);
> +}
> +
>  static int vfio_enable_vectors(VFIOPCIDevice *vdev, bool msix)
>  {
>      struct vfio_irq_set *irq_set;
> @@ -383,7 +442,7 @@ static int vfio_enable_vectors(VFIOPCIDevice *vdev,
> bool msix)
>      fds = (int32_t *)&irq_set->data;
> 
>      for (i = 0; i < vdev->nr_vectors; i++) {
> -        int fd = -1;
> +        int ret, fd = -1;
> 
>          /*
>           * MSI vs MSI-X - The guest has direct access to MSI mask and
> pending
> @@ -392,6 +451,12 @@ static int vfio_enable_vectors(VFIOPCIDevice *vdev,
> bool msix)
>           * KVM signaling path only when configured and unmasked.
>           */
>          if (vdev->msi_vectors[i].use) {
> +            ret = vfio_register_msi_binding(vdev, i, true);

Just a heads up if you have plans to re-spin this series soon.

With the recent kernel commit 438553958ba1("PCI/MSI: Enable and mask
MSI-X early"), MSI-X is enabled early but vectors remained masked. 

Looks like it creates a problem here when you try to call pci_get_msi_message()
above as it returns msg.address as 0. Hence the S1 MSI bind fails for devices with
MSI-X cap. 

I have a temp fix here[2] for the issue where we effectively wait till MASKALL is 
cleared to enable MSI-X. Or else we need to delay this binding for MSI-X for a 
later stage.

Please take a look and let me know.  

Thanks,
Shameer
[1] https://lore.kernel.org/all/20210729222542.344136412@linutronix.de/
[2] https://github.com/Linaro/qemu/commit/568820e409417473eb6f16dfdf8e9075f5a5feaf
 
 
> +            if (ret) {
> +                error_report("%s failed to register S1 MSI binding "
> +                             "for vector %d(%d)", vdev->vbasedev.name,
> i, ret);
> +                goto out;
> +            }
>              if (vdev->msi_vectors[i].virq < 0 ||
>                  (msix && msix_is_masked(&vdev->pdev, i))) {
>                  fd =
> event_notifier_get_fd(&vdev->msi_vectors[i].interrupt);
> @@ -405,6 +470,7 @@ static int vfio_enable_vectors(VFIOPCIDevice *vdev,
> bool msix)
> 
>      ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_SET_IRQS, irq_set);
> 
> +out:
>      g_free(irq_set);
> 
>      return ret;
> @@ -719,7 +785,8 @@ static void vfio_msi_disable_common(VFIOPCIDevice
> *vdev)
> 
>  static void vfio_msix_disable(VFIOPCIDevice *vdev)
>  {
> -    int i;
> +    int ret, i;
> +
> 
>      msix_unset_vector_notifiers(&vdev->pdev);
> 
> @@ -731,6 +798,11 @@ static void vfio_msix_disable(VFIOPCIDevice *vdev)
>          if (vdev->msi_vectors[i].use) {
>              vfio_msix_vector_release(&vdev->pdev, i);
>              msix_vector_unuse(&vdev->pdev, i);
> +            ret = vfio_register_msi_binding(vdev, i, false);
> +            if (ret) {
> +                error_report("%s: failed to unregister S1 MSI binding "
> +                             "for vector %d(%d)", vdev->vbasedev.name,
> i, ret);
> +            }
>          }
>      }
> 
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index 43696afc15..5c1b28d0d4 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -122,6 +122,8 @@ vfio_get_dev_region(const char *name, int index,
> uint32_t type, uint32_t subtype
>  vfio_dma_unmap_overflow_workaround(void) ""
>  vfio_iommu_addr_inv_iotlb(int asid, uint64_t addr, uint64_t size, uint64_t
> nb_granules, bool leaf) "nested IOTLB invalidate asid=%d, addr=0x%"PRIx64"
> granule_size=0x%"PRIx64" nb_granules=0x%"PRIx64" leaf=%d"
>  vfio_iommu_asid_inv_iotlb(int asid) "nested IOTLB invalidate asid=%d"
> +vfio_register_msi_binding(const char *name, int vector, uint64_t giova,
> uint64_t gdb) "%s: register vector %d gIOVA=0x%"PRIx64 "-> gDB=0x%"PRIx64"
> stage 1 mapping"
> +vfio_unregister_msi_binding(const char *name, int vector) "%s: unregister
> vector %d stage 1 mapping"
> 
>  # platform.c
>  vfio_platform_base_device_init(char *name, int groupid) "%s belongs to
> group #%d"
> --
> 2.26.3



^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2021-10-15 11:00 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-11 12:08 [RFC v9 00/29] vSMMUv3/pSMMUv3 2 stage VFIO integration Eric Auger
2021-04-11 12:08 ` [RFC v9 01/29] hw/vfio/common: trace vfio_connect_container operations Eric Auger
2021-04-11 12:08 ` [RFC v9 02/29] update-linux-headers: Import iommu.h Eric Auger
2021-04-11 12:08 ` [RFC v9 03/29] header update against 5.12-rc6 and IOMMU/VFIO nested stage APIs Eric Auger
2021-04-11 12:08 ` [RFC v9 04/29] memory: Add new fields in IOTLBEntry Eric Auger
2021-04-11 12:08 ` [RFC v9 05/29] hw/arm/smmuv3: Improve stage1 ASID invalidation Eric Auger
2021-04-11 12:08 ` [RFC v9 06/29] hw/arm/smmu-common: Allow domain invalidation for NH_ALL/NSNH_ALL Eric Auger
2021-04-11 12:08 ` [RFC v9 07/29] memory: Add IOMMU_ATTR_VFIO_NESTED IOMMU memory region attribute Eric Auger
2021-04-11 12:08 ` [RFC v9 08/29] memory: Add IOMMU_ATTR_MSI_TRANSLATE " Eric Auger
2021-04-11 12:08 ` [RFC v9 09/29] memory: Introduce IOMMU Memory Region inject_faults API Eric Auger
2021-04-11 12:08 ` [RFC v9 10/29] iommu: Introduce generic header Eric Auger
2021-04-11 12:08 ` [RFC v9 11/29] pci: introduce PCIPASIDOps to PCIDevice Eric Auger
2021-04-11 12:08 ` [RFC v9 12/29] vfio: Force nested if iommu requires it Eric Auger
2021-04-11 12:08 ` [RFC v9 13/29] vfio: Introduce hostwin_from_range helper Eric Auger
2021-04-11 12:08 ` [RFC v9 14/29] vfio: Introduce helpers to DMA map/unmap a RAM section Eric Auger
2021-04-27 14:05   ` Kunkun Jiang
2021-09-03  8:22   ` Kunkun Jiang
2021-04-11 12:08 ` [RFC v9 15/29] vfio: Set up nested stage mappings Eric Auger
2021-04-13 12:10   ` Kunkun Jiang
2021-04-13 12:57     ` Auger Eric
2021-04-14  1:45       ` Kunkun Jiang
2021-04-14  8:05         ` Auger Eric
2021-04-15  2:03           ` Kunkun Jiang
2021-04-26 19:16             ` Auger Eric
2021-04-28  9:51               ` Kunkun Jiang
2021-04-29 13:58                 ` Auger Eric
2021-04-26 12:30         ` Auger Eric
2021-04-27  8:58           ` Kunkun Jiang
2021-10-07 16:58         ` Eric Auger
2021-10-08  2:13           ` Kunkun Jiang
2021-04-11 12:08 ` [RFC v9 16/29] vfio: Pass stage 1 MSI bindings to the host Eric Auger
2021-10-15 10:54   ` Shameerali Kolothum Thodi
2021-04-11 12:09 ` [RFC v9 17/29] vfio: Helper to get IRQ info including capabilities Eric Auger
2021-04-11 12:09 ` [RFC v9 18/29] vfio/pci: Register handler for iommu fault Eric Auger
2021-04-11 12:09 ` [RFC v9 19/29] vfio/pci: Set up the DMA FAULT region Eric Auger
2021-04-11 12:09 ` [RFC v9 20/29] vfio/pci: Implement the DMA fault handler Eric Auger
2021-04-11 12:09 ` [RFC v9 21/29] hw/arm/smmuv3: Advertise MSI_TRANSLATE attribute Eric Auger
2021-04-11 12:09 ` [RFC v9 22/29] hw/arm/smmuv3: Store the PASID table GPA in the translation config Eric Auger
2021-04-11 12:09 ` [RFC v9 23/29] hw/arm/smmuv3: Fill the IOTLBEntry arch_id on NH_VA invalidation Eric Auger
2021-04-11 12:09 ` [RFC v9 24/29] hw/arm/smmuv3: Fill the IOTLBEntry leaf field " Eric Auger
2021-05-13  7:09   ` Kunkun Jiang
2021-04-11 12:09 ` [RFC v9 25/29] hw/arm/smmuv3: Pass stage 1 configurations to the host Eric Auger
2021-04-11 12:09 ` [RFC v9 26/29] hw/arm/smmuv3: Implement fault injection Eric Auger
2021-04-11 12:09 ` [RFC v9 27/29] hw/arm/smmuv3: Allow MAP notifiers Eric Auger
2021-04-11 12:09 ` [RFC v9 28/29] pci: Add return_page_response pci ops Eric Auger
2021-04-11 12:09 ` [RFC v9 29/29] vfio/pci: Implement return_page_response page response callback Eric Auger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).