QEMU-Devel Archive on lore.kernel.org
 help / color / Atom feed
* [RFC v6 00/24] vSMMUv3/pSMMUv3 2 stage VFIO integration
@ 2020-03-20 16:58 Eric Auger
  2020-03-20 16:58 ` [RFC v6 01/24] update-linux-headers: Import iommu.h Eric Auger
                   ` (25 more replies)
  0 siblings, 26 replies; 40+ messages in thread
From: Eric Auger @ 2020-03-20 16:58 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, qemu-devel, qemu-arm, peter.maydell,
	mst, alex.williamson, jacob.jun.pan, yi.l.liu
  Cc: jean-philippe, tnowicki, maz, zhangfei.gao, peterx,
	shameerali.kolothum.thodi, zhangfei.gao, bbhushan2, will

Up to now vSMMUv3 has not been integrated with VFIO. VFIO
integration requires to program the physical IOMMU consistently
with the guest mappings. However, as opposed to VTD, SMMUv3 has
no "Caching Mode" which allows easy trapping of guest mappings.
This means the vSMMUV3 cannot use the same VFIO integration as VTD.

However SMMUv3 has 2 translation stages. This was devised with
virtualization use case in mind where stage 1 is "owned" by the
guest whereas the host uses stage 2 for VM isolation.

This series sets up this nested translation stage. It only works
if there is one physical SMMUv3 used along with QEMU vSMMUv3 (in
other words, it does not work if there is a physical SMMUv2).

- We force the host to use stage 2 instead of stage 1, when we
  detect a vSMMUV3 is behind a VFIO device. For a VFIO device
  without any virtual IOMMU, we still use stage 1 as many existing
  SMMUs expect this behavior.
- We use PCIPASIDOps to propage guest stage1 config changes on
  STE (Stream Table Entry) changes.
- We implement a specific UNMAP notifier that conveys guest
  IOTLB invalidations to the host
- We register MSI IOVA/GPA bindings to the host so that this latter
  can build a nested stage translation
- As the legacy MAP notifier is not called anymore, we must make
  sure stage 2 mappings are set. This is achieved through another
  prereg memory listener.
- Physical SMMU stage 1 related faults are reported to the guest
  via en eventfd mechanism and exposed trhough a dedicated VFIO-PCI
  region. Then they are reinjected into the guest.

Best Regards

Eric

This series can be found at:
https://github.com/eauger/qemu/tree/v4.2.0-2stage-rfcv6

Kernel Dependencies:
[1] [PATCH v10 00/11] SMMUv3 Nested Stage Setup (VFIO part)
[2] [PATCH v10 00/13] SMMUv3 Nested Stage Setup (IOMMU part)
branch at: https://github.com/eauger/linux/tree/will-arm-smmu-updates-2stage-v10

History:

v5 -> v6:
- just rebase work

v4 -> v5:
- Use PCIPASIDOps for config update notifications
- removal of notification for MSI binding which is not needed
  anymore
- Use a single fault region
- use the specific interrupt index

v3 -> v4:
- adapt to changes in uapi (asid cache invalidation)
- check VFIO_PCI_DMA_FAULT_IRQ_INDEX is supported at kernel level
  before attempting to set signaling for it.
- sync on 5.2-rc1 kernel headers + Drew's patch that imports sve_context.h
- fix MSI binding for MSI (not MSIX)
- fix mingw compilation

v2 -> v3:
- rework fault handling
- MSI binding registration done in vfio-pci. MSI binding tear down called
  on container cleanup path
- leaf parameter propagated

v1 -> v2:
- Fixed dual assignment (asid now correctly propagated on TLB invalidations)
- Integrated fault reporting


Eric Auger (23):
  update-linux-headers: Import iommu.h
  header update against 5.6.0-rc3 and IOMMU/VFIO nested stage APIs
  memory: Add IOMMU_ATTR_VFIO_NESTED IOMMU memory region attribute
  memory: Add IOMMU_ATTR_MSI_TRANSLATE IOMMU memory region attribute
  memory: Introduce IOMMU Memory Region inject_faults API
  memory: Add arch_id and leaf fields in IOTLBEntry
  iommu: Introduce generic header
  vfio: Force nested if iommu requires it
  vfio: Introduce hostwin_from_range helper
  vfio: Introduce helpers to DMA map/unmap a RAM section
  vfio: Set up nested stage mappings
  vfio: Pass stage 1 MSI bindings to the host
  vfio: Helper to get IRQ info including capabilities
  vfio/pci: Register handler for iommu fault
  vfio/pci: Set up the DMA FAULT region
  vfio/pci: Implement the DMA fault handler
  hw/arm/smmuv3: Advertise MSI_TRANSLATE attribute
  hw/arm/smmuv3: Store the PASID table GPA in the translation config
  hw/arm/smmuv3: Fill the IOTLBEntry arch_id on NH_VA invalidation
  hw/arm/smmuv3: Fill the IOTLBEntry leaf field on NH_VA invalidation
  hw/arm/smmuv3: Pass stage 1 configurations to the host
  hw/arm/smmuv3: Implement fault injection
  hw/arm/smmuv3: Allow MAP notifiers

Liu Yi L (1):
  pci: introduce PCIPASIDOps to PCIDevice

 hw/arm/smmuv3.c                 | 189 ++++++++++--
 hw/arm/trace-events             |   3 +-
 hw/pci/pci.c                    |  34 +++
 hw/vfio/common.c                | 506 +++++++++++++++++++++++++-------
 hw/vfio/pci.c                   | 267 ++++++++++++++++-
 hw/vfio/pci.h                   |   9 +
 hw/vfio/trace-events            |   9 +-
 include/exec/memory.h           |  49 +++-
 include/hw/arm/smmu-common.h    |   1 +
 include/hw/iommu/iommu.h        |  28 ++
 include/hw/pci/pci.h            |  11 +
 include/hw/vfio/vfio-common.h   |  16 +
 linux-headers/COPYING           |   2 +
 linux-headers/asm-x86/kvm.h     |   1 +
 linux-headers/linux/iommu.h     | 375 +++++++++++++++++++++++
 linux-headers/linux/vfio.h      | 109 ++++++-
 memory.c                        |  10 +
 scripts/update-linux-headers.sh |   2 +-
 18 files changed, 1478 insertions(+), 143 deletions(-)
 create mode 100644 include/hw/iommu/iommu.h
 create mode 100644 linux-headers/linux/iommu.h

-- 
2.20.1



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RFC v6 01/24] update-linux-headers: Import iommu.h
  2020-03-20 16:58 [RFC v6 00/24] vSMMUv3/pSMMUv3 2 stage VFIO integration Eric Auger
@ 2020-03-20 16:58 ` Eric Auger
  2020-03-26 12:58   ` Liu, Yi L
  2020-03-20 16:58 ` [RFC v6 02/24] header update against 5.6.0-rc3 and IOMMU/VFIO nested stage APIs Eric Auger
                   ` (24 subsequent siblings)
  25 siblings, 1 reply; 40+ messages in thread
From: Eric Auger @ 2020-03-20 16:58 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, qemu-devel, qemu-arm, peter.maydell,
	mst, alex.williamson, jacob.jun.pan, yi.l.liu
  Cc: jean-philippe, tnowicki, maz, zhangfei.gao, peterx,
	shameerali.kolothum.thodi, zhangfei.gao, bbhushan2, will

Update the script to import the new iommu.h uapi header.

Signed-off-by: Eric Auger <eric.auger@redhat.com>
---
 scripts/update-linux-headers.sh | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/scripts/update-linux-headers.sh b/scripts/update-linux-headers.sh
index 29c27f4681..5b64ee3912 100755
--- a/scripts/update-linux-headers.sh
+++ b/scripts/update-linux-headers.sh
@@ -141,7 +141,7 @@ done
 
 rm -rf "$output/linux-headers/linux"
 mkdir -p "$output/linux-headers/linux"
-for header in kvm.h vfio.h vfio_ccw.h vhost.h \
+for header in kvm.h vfio.h vfio_ccw.h vhost.h iommu.h \
               psci.h psp-sev.h userfaultfd.h mman.h; do
     cp "$tmpdir/include/linux/$header" "$output/linux-headers/linux"
 done
-- 
2.20.1



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RFC v6 02/24] header update against 5.6.0-rc3 and IOMMU/VFIO nested stage APIs
  2020-03-20 16:58 [RFC v6 00/24] vSMMUv3/pSMMUv3 2 stage VFIO integration Eric Auger
  2020-03-20 16:58 ` [RFC v6 01/24] update-linux-headers: Import iommu.h Eric Auger
@ 2020-03-20 16:58 ` Eric Auger
  2020-03-20 16:58 ` [RFC v6 03/24] memory: Add IOMMU_ATTR_VFIO_NESTED IOMMU memory region attribute Eric Auger
                   ` (23 subsequent siblings)
  25 siblings, 0 replies; 40+ messages in thread
From: Eric Auger @ 2020-03-20 16:58 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, qemu-devel, qemu-arm, peter.maydell,
	mst, alex.williamson, jacob.jun.pan, yi.l.liu
  Cc: jean-philippe, tnowicki, maz, zhangfei.gao, peterx,
	shameerali.kolothum.thodi, zhangfei.gao, bbhushan2, will

This is an update against
https://github.com/eauger/linux/tree/will-arm-smmu-updates-2stage-v10

Signed-off-by: Eric Auger <eric.auger@redhat.com>
---
 linux-headers/COPYING       |   2 +
 linux-headers/asm-x86/kvm.h |   1 +
 linux-headers/linux/iommu.h | 375 ++++++++++++++++++++++++++++++++++++
 linux-headers/linux/vfio.h  | 109 ++++++++++-
 4 files changed, 486 insertions(+), 1 deletion(-)
 create mode 100644 linux-headers/linux/iommu.h

diff --git a/linux-headers/COPYING b/linux-headers/COPYING
index da4cb28feb..a635a38ef9 100644
--- a/linux-headers/COPYING
+++ b/linux-headers/COPYING
@@ -16,3 +16,5 @@ In addition, other licenses may also apply. Please see:
 	Documentation/process/license-rules.rst
 
 for more details.
+
+All contributions to the Linux Kernel are subject to this COPYING file.
diff --git a/linux-headers/asm-x86/kvm.h b/linux-headers/asm-x86/kvm.h
index 503d3f42da..3f3f780c8c 100644
--- a/linux-headers/asm-x86/kvm.h
+++ b/linux-headers/asm-x86/kvm.h
@@ -390,6 +390,7 @@ struct kvm_sync_regs {
 #define KVM_STATE_NESTED_GUEST_MODE	0x00000001
 #define KVM_STATE_NESTED_RUN_PENDING	0x00000002
 #define KVM_STATE_NESTED_EVMCS		0x00000004
+#define KVM_STATE_NESTED_MTF_PENDING	0x00000008
 
 #define KVM_STATE_NESTED_SMM_GUEST_MODE	0x00000001
 #define KVM_STATE_NESTED_SMM_VMXON	0x00000002
diff --git a/linux-headers/linux/iommu.h b/linux-headers/linux/iommu.h
new file mode 100644
index 0000000000..1b3f6420bb
--- /dev/null
+++ b/linux-headers/linux/iommu.h
@@ -0,0 +1,375 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/*
+ * IOMMU user API definitions
+ */
+
+#ifndef _IOMMU_H
+#define _IOMMU_H
+
+#include <linux/types.h>
+
+#define IOMMU_FAULT_PERM_READ	(1 << 0) /* read */
+#define IOMMU_FAULT_PERM_WRITE	(1 << 1) /* write */
+#define IOMMU_FAULT_PERM_EXEC	(1 << 2) /* exec */
+#define IOMMU_FAULT_PERM_PRIV	(1 << 3) /* privileged */
+
+/* Generic fault types, can be expanded IRQ remapping fault */
+enum iommu_fault_type {
+	IOMMU_FAULT_DMA_UNRECOV = 1,	/* unrecoverable fault */
+	IOMMU_FAULT_PAGE_REQ,		/* page request fault */
+};
+
+enum iommu_fault_reason {
+	IOMMU_FAULT_REASON_UNKNOWN = 0,
+
+	/* Could not access the PASID table (fetch caused external abort) */
+	IOMMU_FAULT_REASON_PASID_FETCH,
+
+	/* PASID entry is invalid or has configuration errors */
+	IOMMU_FAULT_REASON_BAD_PASID_ENTRY,
+
+	/*
+	 * PASID is out of range (e.g. exceeds the maximum PASID
+	 * supported by the IOMMU) or disabled.
+	 */
+	IOMMU_FAULT_REASON_PASID_INVALID,
+
+	/*
+	 * An external abort occurred fetching (or updating) a translation
+	 * table descriptor
+	 */
+	IOMMU_FAULT_REASON_WALK_EABT,
+
+	/*
+	 * Could not access the page table entry (Bad address),
+	 * actual translation fault
+	 */
+	IOMMU_FAULT_REASON_PTE_FETCH,
+
+	/* Protection flag check failed */
+	IOMMU_FAULT_REASON_PERMISSION,
+
+	/* access flag check failed */
+	IOMMU_FAULT_REASON_ACCESS,
+
+	/* Output address of a translation stage caused Address Size fault */
+	IOMMU_FAULT_REASON_OOR_ADDRESS,
+};
+
+/**
+ * struct iommu_fault_unrecoverable - Unrecoverable fault data
+ * @reason: reason of the fault, from &enum iommu_fault_reason
+ * @flags: parameters of this fault (IOMMU_FAULT_UNRECOV_* values)
+ * @pasid: Process Address Space ID
+ * @perm: requested permission access using by the incoming transaction
+ *        (IOMMU_FAULT_PERM_* values)
+ * @addr: offending page address
+ * @fetch_addr: address that caused a fetch abort, if any
+ */
+struct iommu_fault_unrecoverable {
+	__u32	reason;
+#define IOMMU_FAULT_UNRECOV_PASID_VALID		(1 << 0)
+#define IOMMU_FAULT_UNRECOV_ADDR_VALID		(1 << 1)
+#define IOMMU_FAULT_UNRECOV_FETCH_ADDR_VALID	(1 << 2)
+	__u32	flags;
+	__u32	pasid;
+	__u32	perm;
+	__u64	addr;
+	__u64	fetch_addr;
+};
+
+/**
+ * struct iommu_fault_page_request - Page Request data
+ * @flags: encodes whether the corresponding fields are valid and whether this
+ *         is the last page in group (IOMMU_FAULT_PAGE_REQUEST_* values)
+ * @pasid: Process Address Space ID
+ * @grpid: Page Request Group Index
+ * @perm: requested page permissions (IOMMU_FAULT_PERM_* values)
+ * @addr: page address
+ * @private_data: device-specific private information
+ */
+struct iommu_fault_page_request {
+#define IOMMU_FAULT_PAGE_REQUEST_PASID_VALID	(1 << 0)
+#define IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE	(1 << 1)
+#define IOMMU_FAULT_PAGE_REQUEST_PRIV_DATA	(1 << 2)
+	__u32	flags;
+	__u32	pasid;
+	__u32	grpid;
+	__u32	perm;
+	__u64	addr;
+	__u64	private_data[2];
+};
+
+/**
+ * struct iommu_fault - Generic fault data
+ * @type: fault type from &enum iommu_fault_type
+ * @padding: reserved for future use (should be zero)
+ * @event: fault event, when @type is %IOMMU_FAULT_DMA_UNRECOV
+ * @prm: Page Request message, when @type is %IOMMU_FAULT_PAGE_REQ
+ * @padding2: sets the fault size to allow for future extensions
+ */
+struct iommu_fault {
+	__u32	type;
+	__u32	padding;
+	union {
+		struct iommu_fault_unrecoverable event;
+		struct iommu_fault_page_request prm;
+		__u8 padding2[56];
+	};
+};
+
+/**
+ * enum iommu_page_response_code - Return status of fault handlers
+ * @IOMMU_PAGE_RESP_SUCCESS: Fault has been handled and the page tables
+ *	populated, retry the access. This is "Success" in PCI PRI.
+ * @IOMMU_PAGE_RESP_FAILURE: General error. Drop all subsequent faults from
+ *	this device if possible. This is "Response Failure" in PCI PRI.
+ * @IOMMU_PAGE_RESP_INVALID: Could not handle this fault, don't retry the
+ *	access. This is "Invalid Request" in PCI PRI.
+ */
+enum iommu_page_response_code {
+	IOMMU_PAGE_RESP_SUCCESS = 0,
+	IOMMU_PAGE_RESP_INVALID,
+	IOMMU_PAGE_RESP_FAILURE,
+};
+
+/**
+ * struct iommu_page_response - Generic page response information
+ * @version: API version of this structure
+ * @flags: encodes whether the corresponding fields are valid
+ *         (IOMMU_FAULT_PAGE_RESPONSE_* values)
+ * @pasid: Process Address Space ID
+ * @grpid: Page Request Group Index
+ * @code: response code from &enum iommu_page_response_code
+ */
+struct iommu_page_response {
+#define IOMMU_PAGE_RESP_VERSION_1	1
+	__u32	version;
+#define IOMMU_PAGE_RESP_PASID_VALID	(1 << 0)
+	__u32	flags;
+	__u32	pasid;
+	__u32	grpid;
+	__u32	code;
+};
+
+/* defines the granularity of the invalidation */
+enum iommu_inv_granularity {
+	IOMMU_INV_GRANU_DOMAIN,	/* domain-selective invalidation */
+	IOMMU_INV_GRANU_PASID,	/* PASID-selective invalidation */
+	IOMMU_INV_GRANU_ADDR,	/* page-selective invalidation */
+	IOMMU_INV_GRANU_NR,	/* number of invalidation granularities */
+};
+
+/**
+ * struct iommu_inv_addr_info - Address Selective Invalidation Structure
+ *
+ * @flags: indicates the granularity of the address-selective invalidation
+ * - If the PASID bit is set, the @pasid field is populated and the invalidation
+ *   relates to cache entries tagged with this PASID and matching the address
+ *   range.
+ * - If ARCHID bit is set, @archid is populated and the invalidation relates
+ *   to cache entries tagged with this architecture specific ID and matching
+ *   the address range.
+ * - Both PASID and ARCHID can be set as they may tag different caches.
+ * - If neither PASID or ARCHID is set, global addr invalidation applies.
+ * - The LEAF flag indicates whether only the leaf PTE caching needs to be
+ *   invalidated and other paging structure caches can be preserved.
+ * @pasid: process address space ID
+ * @archid: architecture-specific ID
+ * @addr: first stage/level input address
+ * @granule_size: page/block size of the mapping in bytes
+ * @nb_granules: number of contiguous granules to be invalidated
+ */
+struct iommu_inv_addr_info {
+#define IOMMU_INV_ADDR_FLAGS_PASID	(1 << 0)
+#define IOMMU_INV_ADDR_FLAGS_ARCHID	(1 << 1)
+#define IOMMU_INV_ADDR_FLAGS_LEAF	(1 << 2)
+	__u32	flags;
+	__u32	archid;
+	__u64	pasid;
+	__u64	addr;
+	__u64	granule_size;
+	__u64	nb_granules;
+};
+
+/**
+ * struct iommu_inv_pasid_info - PASID Selective Invalidation Structure
+ *
+ * @flags: indicates the granularity of the PASID-selective invalidation
+ * - If the PASID bit is set, the @pasid field is populated and the invalidation
+ *   relates to cache entries tagged with this PASID and matching the address
+ *   range.
+ * - If the ARCHID bit is set, the @archid is populated and the invalidation
+ *   relates to cache entries tagged with this architecture specific ID and
+ *   matching the address range.
+ * - Both PASID and ARCHID can be set as they may tag different caches.
+ * - At least one of PASID or ARCHID must be set.
+ * @pasid: process address space ID
+ * @archid: architecture-specific ID
+ */
+struct iommu_inv_pasid_info {
+#define IOMMU_INV_PASID_FLAGS_PASID	(1 << 0)
+#define IOMMU_INV_PASID_FLAGS_ARCHID	(1 << 1)
+	__u32	flags;
+	__u32	archid;
+	__u64	pasid;
+};
+
+/**
+ * struct iommu_cache_invalidate_info - First level/stage invalidation
+ *     information
+ * @version: API version of this structure
+ * @cache: bitfield that allows to select which caches to invalidate
+ * @granularity: defines the lowest granularity used for the invalidation:
+ *     domain > PASID > addr
+ * @padding: reserved for future use (should be zero)
+ * @pasid_info: invalidation data when @granularity is %IOMMU_INV_GRANU_PASID
+ * @addr_info: invalidation data when @granularity is %IOMMU_INV_GRANU_ADDR
+ *
+ * Not all the combinations of cache/granularity are valid:
+ *
+ * +--------------+---------------+---------------+---------------+
+ * | type /       |   DEV_IOTLB   |     IOTLB     |      PASID    |
+ * | granularity  |               |               |      cache    |
+ * +==============+===============+===============+===============+
+ * | DOMAIN       |       N/A     |       Y       |       Y       |
+ * +--------------+---------------+---------------+---------------+
+ * | PASID        |       Y       |       Y       |       Y       |
+ * +--------------+---------------+---------------+---------------+
+ * | ADDR         |       Y       |       Y       |       N/A     |
+ * +--------------+---------------+---------------+---------------+
+ *
+ * Invalidations by %IOMMU_INV_GRANU_DOMAIN don't take any argument other than
+ * @version and @cache.
+ *
+ * If multiple cache types are invalidated simultaneously, they all
+ * must support the used granularity.
+ */
+struct iommu_cache_invalidate_info {
+#define IOMMU_CACHE_INVALIDATE_INFO_VERSION_1 1
+	__u32	version;
+/* IOMMU paging structure cache */
+#define IOMMU_CACHE_INV_TYPE_IOTLB	(1 << 0) /* IOMMU IOTLB */
+#define IOMMU_CACHE_INV_TYPE_DEV_IOTLB	(1 << 1) /* Device IOTLB */
+#define IOMMU_CACHE_INV_TYPE_PASID	(1 << 2) /* PASID cache */
+#define IOMMU_CACHE_INV_TYPE_NR		(3)
+	__u8	cache;
+	__u8	granularity;
+	__u8	padding[2];
+	union {
+		struct iommu_inv_pasid_info pasid_info;
+		struct iommu_inv_addr_info addr_info;
+	};
+};
+
+/**
+ * struct iommu_gpasid_bind_data_vtd - Intel VT-d specific data on device and guest
+ * SVA binding.
+ *
+ * @flags:	VT-d PASID table entry attributes
+ * @pat:	Page attribute table data to compute effective memory type
+ * @emt:	Extended memory type
+ *
+ * Only guest vIOMMU selectable and effective options are passed down to
+ * the host IOMMU.
+ */
+struct iommu_gpasid_bind_data_vtd {
+#define IOMMU_SVA_VTD_GPASID_SRE	(1 << 0) /* supervisor request */
+#define IOMMU_SVA_VTD_GPASID_EAFE	(1 << 1) /* extended access enable */
+#define IOMMU_SVA_VTD_GPASID_PCD	(1 << 2) /* page-level cache disable */
+#define IOMMU_SVA_VTD_GPASID_PWT	(1 << 3) /* page-level write through */
+#define IOMMU_SVA_VTD_GPASID_EMTE	(1 << 4) /* extended mem type enable */
+#define IOMMU_SVA_VTD_GPASID_CD		(1 << 5) /* PASID-level cache disable */
+	__u64 flags;
+	__u32 pat;
+	__u32 emt;
+};
+
+/**
+ * struct iommu_gpasid_bind_data - Information about device and guest PASID binding
+ * @version:	Version of this data structure
+ * @format:	PASID table entry format
+ * @flags:	Additional information on guest bind request
+ * @gpgd:	Guest page directory base of the guest mm to bind
+ * @hpasid:	Process address space ID used for the guest mm in host IOMMU
+ * @gpasid:	Process address space ID used for the guest mm in guest IOMMU
+ * @addr_width:	Guest virtual address width
+ * @padding:	Reserved for future use (should be zero)
+ * @vtd:	Intel VT-d specific data
+ *
+ * Guest to host PASID mapping can be an identity or non-identity, where guest
+ * has its own PASID space. For non-identify mapping, guest to host PASID lookup
+ * is needed when VM programs guest PASID into an assigned device. VMM may
+ * trap such PASID programming then request host IOMMU driver to convert guest
+ * PASID to host PASID based on this bind data.
+ */
+struct iommu_gpasid_bind_data {
+#define IOMMU_GPASID_BIND_VERSION_1	1
+	__u32 version;
+#define IOMMU_PASID_FORMAT_INTEL_VTD	1
+	__u32 format;
+#define IOMMU_SVA_GPASID_VAL	(1 << 0) /* guest PASID valid */
+	__u64 flags;
+	__u64 gpgd;
+	__u64 hpasid;
+	__u64 gpasid;
+	__u32 addr_width;
+	__u8  padding[12];
+	/* Vendor specific data */
+	union {
+		struct iommu_gpasid_bind_data_vtd vtd;
+	};
+};
+
+/**
+ * struct iommu_pasid_smmuv3 - ARM SMMUv3 Stream Table Entry stage 1 related
+ *     information
+ * @version: API version of this structure
+ * @s1fmt: STE s1fmt (format of the CD table: single CD, linear table
+ *         or 2-level table)
+ * @s1dss: STE s1dss (specifies the behavior when @pasid_bits != 0
+ *         and no PASID is passed along with the incoming transaction)
+ * @padding: reserved for future use (should be zero)
+ *
+ * The PASID table is referred to as the Context Descriptor (CD) table on ARM
+ * SMMUv3. Please refer to the ARM SMMU 3.x spec (ARM IHI 0070A) for full
+ * details.
+ */
+struct iommu_pasid_smmuv3 {
+#define PASID_TABLE_SMMUV3_CFG_VERSION_1 1
+	__u32	version;
+	__u8	s1fmt;
+	__u8	s1dss;
+	__u8	padding[2];
+};
+
+/**
+ * struct iommu_pasid_table_config - PASID table data used to bind guest PASID
+ *     table to the host IOMMU
+ * @version: API version to prepare for future extensions
+ * @format: format of the PASID table
+ * @base_ptr: guest physical address of the PASID table
+ * @pasid_bits: number of PASID bits used in the PASID table
+ * @config: indicates whether the guest translation stage must
+ *          be translated, bypassed or aborted.
+ * @padding: reserved for future use (should be zero)
+ * @smmuv3: table information when @format is %IOMMU_PASID_FORMAT_SMMUV3
+ */
+struct iommu_pasid_table_config {
+#define PASID_TABLE_CFG_VERSION_1 1
+	__u32	version;
+#define IOMMU_PASID_FORMAT_SMMUV3	1
+	__u32	format;
+	__u64	base_ptr;
+	__u8	pasid_bits;
+#define IOMMU_PASID_CONFIG_TRANSLATE	1
+#define IOMMU_PASID_CONFIG_BYPASS	2
+#define IOMMU_PASID_CONFIG_ABORT	3
+	__u8	config;
+	__u8    padding[6];
+	union {
+		struct iommu_pasid_smmuv3 smmuv3;
+	};
+};
+
+#endif /* _IOMMU_H */
diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
index fb10370d29..fc9adb8df1 100644
--- a/linux-headers/linux/vfio.h
+++ b/linux-headers/linux/vfio.h
@@ -14,6 +14,7 @@
 
 #include <linux/types.h>
 #include <linux/ioctl.h>
+#include <linux/iommu.h>
 
 #define VFIO_API_VERSION	0
 
@@ -329,6 +330,9 @@ struct vfio_region_info_cap_type {
 /* sub-types for VFIO_REGION_TYPE_GFX */
 #define VFIO_REGION_SUBTYPE_GFX_EDID            (1)
 
+#define VFIO_REGION_TYPE_NESTED			(2)
+#define VFIO_REGION_SUBTYPE_NESTED_DMA_FAULT	(1)
+
 /**
  * struct vfio_region_gfx_edid - EDID region layout.
  *
@@ -455,11 +459,30 @@ struct vfio_irq_info {
 #define VFIO_IRQ_INFO_MASKABLE		(1 << 1)
 #define VFIO_IRQ_INFO_AUTOMASKED	(1 << 2)
 #define VFIO_IRQ_INFO_NORESIZE		(1 << 3)
+#define VFIO_IRQ_INFO_FLAG_CAPS		(1 << 4) /* Info supports caps */
 	__u32	index;		/* IRQ index */
 	__u32	count;		/* Number of IRQs within this index */
+	__u32	cap_offset;	/* Offset within info struct of first cap */
 };
 #define VFIO_DEVICE_GET_IRQ_INFO	_IO(VFIO_TYPE, VFIO_BASE + 9)
 
+/*
+ * The irq type capability allows IRQs unique to a specific device or
+ * class of devices to be exposed.
+ *
+ * The structures below define version 1 of this capability.
+ */
+#define VFIO_IRQ_INFO_CAP_TYPE      3
+
+struct vfio_irq_info_cap_type {
+	struct vfio_info_cap_header header;
+	__u32 type;     /* global per bus driver */
+	__u32 subtype;  /* type specific */
+};
+
+#define VFIO_IRQ_TYPE_NESTED				(1)
+#define VFIO_IRQ_SUBTYPE_DMA_FAULT			(1)
+
 /**
  * VFIO_DEVICE_SET_IRQS - _IOW(VFIO_TYPE, VFIO_BASE + 10, struct vfio_irq_set)
  *
@@ -561,7 +584,8 @@ enum {
 	VFIO_PCI_MSIX_IRQ_INDEX,
 	VFIO_PCI_ERR_IRQ_INDEX,
 	VFIO_PCI_REQ_IRQ_INDEX,
-	VFIO_PCI_NUM_IRQS
+	VFIO_PCI_NUM_IRQS = 5	/* Fixed user ABI, IRQ indexes >=5 use   */
+				/* device specific cap to define content */
 };
 
 /*
@@ -707,6 +731,38 @@ struct vfio_device_ioeventfd {
 
 #define VFIO_DEVICE_IOEVENTFD		_IO(VFIO_TYPE, VFIO_BASE + 16)
 
+
+/*
+ * Capability exposed by the DMA fault region
+ * @version: ABI version
+ */
+#define VFIO_REGION_INFO_CAP_DMA_FAULT	6
+
+struct vfio_region_info_cap_fault {
+	struct vfio_info_cap_header header;
+	__u32 version;
+};
+
+/*
+ * DMA Fault Region Layout
+ * @tail: index relative to the start of the ring buffer at which the
+ *        consumer finds the next item in the buffer
+ * @entry_size: fault ring buffer entry size in bytes
+ * @nb_entries: max capacity of the fault ring buffer
+ * @offset: ring buffer offset relative to the start of the region
+ * @head: index relative to the start of the ring buffer at which the
+ *        producer (kernel) inserts items into the buffers
+ */
+struct vfio_region_dma_fault {
+	/* Write-Only */
+	__u32   tail;
+	/* Read-Only */
+	__u32   entry_size;
+	__u32	nb_entries;
+	__u32	offset;
+	__u32   head;
+};
+
 /* -------- API for Type1 VFIO IOMMU -------- */
 
 /**
@@ -794,6 +850,57 @@ struct vfio_iommu_type1_dma_unmap {
 #define VFIO_IOMMU_ENABLE	_IO(VFIO_TYPE, VFIO_BASE + 15)
 #define VFIO_IOMMU_DISABLE	_IO(VFIO_TYPE, VFIO_BASE + 16)
 
+/**
+ * VFIO_IOMMU_SET_PASID_TABLE - _IOWR(VFIO_TYPE, VFIO_BASE + 22,
+ *			struct vfio_iommu_type1_set_pasid_table)
+ *
+ * The SET operation passes a PASID table to the host while the
+ * UNSET operation detaches the one currently programmed. Setting
+ * a table while another is already programmed replaces the old table.
+ */
+struct vfio_iommu_type1_set_pasid_table {
+	__u32	argsz;
+	__u32	flags;
+#define VFIO_PASID_TABLE_FLAG_SET	(1 << 0)
+#define VFIO_PASID_TABLE_FLAG_UNSET	(1 << 1)
+	struct iommu_pasid_table_config config; /* used on SET */
+};
+
+#define VFIO_IOMMU_SET_PASID_TABLE	_IO(VFIO_TYPE, VFIO_BASE + 22)
+
+/**
+ * VFIO_IOMMU_CACHE_INVALIDATE - _IOWR(VFIO_TYPE, VFIO_BASE + 23,
+ *			struct vfio_iommu_type1_cache_invalidate)
+ *
+ * Propagate guest IOMMU cache invalidation to the host.
+ */
+struct vfio_iommu_type1_cache_invalidate {
+	__u32   argsz;
+	__u32   flags;
+	struct iommu_cache_invalidate_info info;
+};
+#define VFIO_IOMMU_CACHE_INVALIDATE      _IO(VFIO_TYPE, VFIO_BASE + 23)
+
+/**
+ * VFIO_IOMMU_SET_MSI_BINDING - _IOWR(VFIO_TYPE, VFIO_BASE + 24,
+ *			struct vfio_iommu_type1_set_msi_binding)
+ *
+ * Pass a stage 1 MSI doorbell mapping to the host so that this
+ * latter can build a nested stage2 mapping. Or conversely tear
+ * down a previously bound stage 1 MSI binding.
+ */
+struct vfio_iommu_type1_set_msi_binding {
+	__u32   argsz;
+	__u32   flags;
+#define VFIO_IOMMU_BIND_MSI	(1 << 0)
+#define VFIO_IOMMU_UNBIND_MSI	(1 << 1)
+	__u64	iova;	/* MSI guest IOVA */
+	/* Fields below are used on BIND */
+	__u64	gpa;	/* MSI guest physical address */
+	__u64	size;	/* size of stage1 mapping (bytes) */
+};
+#define VFIO_IOMMU_SET_MSI_BINDING      _IO(VFIO_TYPE, VFIO_BASE + 24)
+
 /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */
 
 /*
-- 
2.20.1



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RFC v6 03/24] memory: Add IOMMU_ATTR_VFIO_NESTED IOMMU memory region attribute
  2020-03-20 16:58 [RFC v6 00/24] vSMMUv3/pSMMUv3 2 stage VFIO integration Eric Auger
  2020-03-20 16:58 ` [RFC v6 01/24] update-linux-headers: Import iommu.h Eric Auger
  2020-03-20 16:58 ` [RFC v6 02/24] header update against 5.6.0-rc3 and IOMMU/VFIO nested stage APIs Eric Auger
@ 2020-03-20 16:58 ` Eric Auger
  2020-03-20 16:58 ` [RFC v6 04/24] memory: Add IOMMU_ATTR_MSI_TRANSLATE " Eric Auger
                   ` (22 subsequent siblings)
  25 siblings, 0 replies; 40+ messages in thread
From: Eric Auger @ 2020-03-20 16:58 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, qemu-devel, qemu-arm, peter.maydell,
	mst, alex.williamson, jacob.jun.pan, yi.l.liu
  Cc: jean-philippe, tnowicki, maz, zhangfei.gao, peterx,
	shameerali.kolothum.thodi, zhangfei.gao, bbhushan2, will

We introduce a new IOMMU Memory Region attribute,
IOMMU_ATTR_VFIO_NESTED that tells whether the virtual IOMMU
requires HW nested paging for VFIO integration.

Current Intel virtual IOMMU device supports "Caching
Mode" and does not require 2 stages at physical level to be
integrated with VFIO. However SMMUv3 does not implement such
"caching mode" and requires to use HW nested paging.

As such SMMUv3 is the first IOMMU device to advertise this
attribute.

Signed-off-by: Eric Auger <eric.auger@redhat.com>
---
 hw/arm/smmuv3.c       | 12 ++++++++++++
 include/exec/memory.h |  3 ++-
 2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c
index 57a79df55b..e33eabd028 100644
--- a/hw/arm/smmuv3.c
+++ b/hw/arm/smmuv3.c
@@ -1508,6 +1508,17 @@ static int smmuv3_notify_flag_changed(IOMMUMemoryRegion *iommu,
     return 0;
 }
 
+static int smmuv3_get_attr(IOMMUMemoryRegion *iommu,
+                           enum IOMMUMemoryRegionAttr attr,
+                           void *data)
+{
+    if (attr == IOMMU_ATTR_VFIO_NESTED) {
+        *(bool *) data = true;
+        return 0;
+    }
+    return -EINVAL;
+}
+
 static void smmuv3_iommu_memory_region_class_init(ObjectClass *klass,
                                                   void *data)
 {
@@ -1515,6 +1526,7 @@ static void smmuv3_iommu_memory_region_class_init(ObjectClass *klass,
 
     imrc->translate = smmuv3_translate;
     imrc->notify_flag_changed = smmuv3_notify_flag_changed;
+    imrc->get_attr = smmuv3_get_attr;
 }
 
 static const TypeInfo smmuv3_type_info = {
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 1614d9a02c..b9d2f0a437 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -213,7 +213,8 @@ typedef struct MemoryRegionClass {
 
 
 enum IOMMUMemoryRegionAttr {
-    IOMMU_ATTR_SPAPR_TCE_FD
+    IOMMU_ATTR_SPAPR_TCE_FD,
+    IOMMU_ATTR_VFIO_NESTED,
 };
 
 /**
-- 
2.20.1



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RFC v6 04/24] memory: Add IOMMU_ATTR_MSI_TRANSLATE IOMMU memory region attribute
  2020-03-20 16:58 [RFC v6 00/24] vSMMUv3/pSMMUv3 2 stage VFIO integration Eric Auger
                   ` (2 preceding siblings ...)
  2020-03-20 16:58 ` [RFC v6 03/24] memory: Add IOMMU_ATTR_VFIO_NESTED IOMMU memory region attribute Eric Auger
@ 2020-03-20 16:58 ` Eric Auger
  2020-03-20 16:58 ` [RFC v6 05/24] memory: Introduce IOMMU Memory Region inject_faults API Eric Auger
                   ` (21 subsequent siblings)
  25 siblings, 0 replies; 40+ messages in thread
From: Eric Auger @ 2020-03-20 16:58 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, qemu-devel, qemu-arm, peter.maydell,
	mst, alex.williamson, jacob.jun.pan, yi.l.liu
  Cc: jean-philippe, tnowicki, maz, zhangfei.gao, peterx,
	shameerali.kolothum.thodi, zhangfei.gao, bbhushan2, will

We introduce a new IOMMU Memory Region attribute, IOMMU_ATTR_MSI_TRANSLATE
which tells whether the virtual IOMMU translates MSIs. ARM SMMU
will expose this attribute since, as opposed to Intel DMAR, MSIs
are translated as any other DMA requests.

Signed-off-by: Eric Auger <eric.auger@redhat.com>
---
 include/exec/memory.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index b9d2f0a437..f2c773163f 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -215,6 +215,7 @@ typedef struct MemoryRegionClass {
 enum IOMMUMemoryRegionAttr {
     IOMMU_ATTR_SPAPR_TCE_FD,
     IOMMU_ATTR_VFIO_NESTED,
+    IOMMU_ATTR_MSI_TRANSLATE,
 };
 
 /**
-- 
2.20.1



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RFC v6 05/24] memory: Introduce IOMMU Memory Region inject_faults API
  2020-03-20 16:58 [RFC v6 00/24] vSMMUv3/pSMMUv3 2 stage VFIO integration Eric Auger
                   ` (3 preceding siblings ...)
  2020-03-20 16:58 ` [RFC v6 04/24] memory: Add IOMMU_ATTR_MSI_TRANSLATE " Eric Auger
@ 2020-03-20 16:58 ` Eric Auger
  2020-03-26 13:13   ` Liu, Yi L
  2020-03-20 16:58 ` [RFC v6 06/24] memory: Add arch_id and leaf fields in IOTLBEntry Eric Auger
                   ` (20 subsequent siblings)
  25 siblings, 1 reply; 40+ messages in thread
From: Eric Auger @ 2020-03-20 16:58 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, qemu-devel, qemu-arm, peter.maydell,
	mst, alex.williamson, jacob.jun.pan, yi.l.liu
  Cc: jean-philippe, tnowicki, maz, zhangfei.gao, peterx,
	shameerali.kolothum.thodi, zhangfei.gao, bbhushan2, will

This new API allows to inject @count iommu_faults into
the IOMMU memory region.

Signed-off-by: Eric Auger <eric.auger@redhat.com>
---
 include/exec/memory.h | 25 +++++++++++++++++++++++++
 memory.c              | 10 ++++++++++
 2 files changed, 35 insertions(+)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index f2c773163f..141a5dc197 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -57,6 +57,8 @@ struct MemoryRegionMmio {
     CPUWriteMemoryFunc *write[3];
 };
 
+struct iommu_fault;
+
 typedef struct IOMMUTLBEntry IOMMUTLBEntry;
 
 /* See address_space_translate: bit 0 is read, bit 1 is write.  */
@@ -357,6 +359,19 @@ typedef struct IOMMUMemoryRegionClass {
      * @iommu: the IOMMUMemoryRegion
      */
     int (*num_indexes)(IOMMUMemoryRegion *iommu);
+
+    /*
+     * Inject @count faults into the IOMMU memory region
+     *
+     * Optional method: if this method is not provided, then
+     * memory_region_injection_faults() will return -ENOENT
+     *
+     * @iommu: the IOMMU memory region to inject the faults in
+     * @count: number of faults to inject
+     * @buf: fault buffer
+     */
+    int (*inject_faults)(IOMMUMemoryRegion *iommu, int count,
+                         struct iommu_fault *buf);
 } IOMMUMemoryRegionClass;
 
 typedef struct CoalescedMemoryRange CoalescedMemoryRange;
@@ -1365,6 +1380,16 @@ int memory_region_iommu_attrs_to_index(IOMMUMemoryRegion *iommu_mr,
  */
 int memory_region_iommu_num_indexes(IOMMUMemoryRegion *iommu_mr);
 
+/**
+ * memory_region_inject_faults : inject @count faults stored in @buf
+ *
+ * @iommu_mr: the IOMMU memory region
+ * @count: number of faults to be injected
+ * @buf: buffer containing the faults
+ */
+int memory_region_inject_faults(IOMMUMemoryRegion *iommu_mr, int count,
+                                struct iommu_fault *buf);
+
 /**
  * memory_region_name: get a memory region's name
  *
diff --git a/memory.c b/memory.c
index 09be40edd2..9cdd77e0de 100644
--- a/memory.c
+++ b/memory.c
@@ -2001,6 +2001,16 @@ int memory_region_iommu_num_indexes(IOMMUMemoryRegion *iommu_mr)
     return imrc->num_indexes(iommu_mr);
 }
 
+int memory_region_inject_faults(IOMMUMemoryRegion *iommu_mr, int count,
+                                struct iommu_fault *buf)
+{
+    IOMMUMemoryRegionClass *imrc = IOMMU_MEMORY_REGION_GET_CLASS(iommu_mr);
+    if (!imrc->inject_faults) {
+        return -ENOENT;
+    }
+    return imrc->inject_faults(iommu_mr, count, buf);
+}
+
 void memory_region_set_log(MemoryRegion *mr, bool log, unsigned client)
 {
     uint8_t mask = 1 << client;
-- 
2.20.1



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RFC v6 06/24] memory: Add arch_id and leaf fields in IOTLBEntry
  2020-03-20 16:58 [RFC v6 00/24] vSMMUv3/pSMMUv3 2 stage VFIO integration Eric Auger
                   ` (4 preceding siblings ...)
  2020-03-20 16:58 ` [RFC v6 05/24] memory: Introduce IOMMU Memory Region inject_faults API Eric Auger
@ 2020-03-20 16:58 ` Eric Auger
  2020-03-20 16:58 ` [RFC v6 07/24] iommu: Introduce generic header Eric Auger
                   ` (19 subsequent siblings)
  25 siblings, 0 replies; 40+ messages in thread
From: Eric Auger @ 2020-03-20 16:58 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, qemu-devel, qemu-arm, peter.maydell,
	mst, alex.williamson, jacob.jun.pan, yi.l.liu
  Cc: jean-philippe, tnowicki, maz, zhangfei.gao, peterx,
	shameerali.kolothum.thodi, zhangfei.gao, bbhushan2, will

TLB entries are usually tagged with some ids such as the asid
or pasid. When propagating an invalidation command from the
guest to the host, we need to pass this id.

Also we add a leaf field which indicates, in case of invalidation
notification, whether only cache entries for the last level of
translation are required to be invalidated.

Signed-off-by: Eric Auger <eric.auger@redhat.com>
---
 include/exec/memory.h | 20 +++++++++++++++++++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index 141a5dc197..d61311aeba 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -71,12 +71,30 @@ typedef enum {
 
 #define IOMMU_ACCESS_FLAG(r, w) (((r) ? IOMMU_RO : 0) | ((w) ? IOMMU_WO : 0))
 
+/**
+ * IOMMUTLBEntry - IOMMU TLB entry
+ *
+ * Structure used when performing a translation or when notifying MAP or
+ * UNMAP (invalidation) events
+ *
+ * @target_as: target address space
+ * @iova: IO virtual address (input)
+ * @translated_addr: translated address (output)
+ * @addr_mask: address mask (0xfff means 4K binding), must be multiple of 2
+ * @perm: permission flag of the mapping (NONE encodes no mapping or
+ * invalidation notification)
+ * @arch_id: architecture specific ID tagging the TLB
+ * @leaf: when @perm is NONE, indicates whether only caches for the last
+ * level of translation need to be invalidated.
+ */
 struct IOMMUTLBEntry {
     AddressSpace    *target_as;
     hwaddr           iova;
     hwaddr           translated_addr;
-    hwaddr           addr_mask;  /* 0xfff = 4k translation */
+    hwaddr           addr_mask;
     IOMMUAccessFlags perm;
+    uint32_t         arch_id;
+    bool             leaf;
 };
 
 /*
-- 
2.20.1



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RFC v6 07/24] iommu: Introduce generic header
  2020-03-20 16:58 [RFC v6 00/24] vSMMUv3/pSMMUv3 2 stage VFIO integration Eric Auger
                   ` (5 preceding siblings ...)
  2020-03-20 16:58 ` [RFC v6 06/24] memory: Add arch_id and leaf fields in IOTLBEntry Eric Auger
@ 2020-03-20 16:58 ` Eric Auger
  2020-03-20 16:58 ` [RFC v6 08/24] pci: introduce PCIPASIDOps to PCIDevice Eric Auger
                   ` (18 subsequent siblings)
  25 siblings, 0 replies; 40+ messages in thread
From: Eric Auger @ 2020-03-20 16:58 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, qemu-devel, qemu-arm, peter.maydell,
	mst, alex.williamson, jacob.jun.pan, yi.l.liu
  Cc: jean-philippe, tnowicki, maz, zhangfei.gao, peterx,
	shameerali.kolothum.thodi, zhangfei.gao, bbhushan2, will

This header is meant to exposes data types used by
several IOMMU devices such as struct for SVA and
nested stage configuration.

Signed-off-by: Eric Auger <eric.auger@redhat.com>
---
 include/hw/iommu/iommu.h | 28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)
 create mode 100644 include/hw/iommu/iommu.h

diff --git a/include/hw/iommu/iommu.h b/include/hw/iommu/iommu.h
new file mode 100644
index 0000000000..12092bda7b
--- /dev/null
+++ b/include/hw/iommu/iommu.h
@@ -0,0 +1,28 @@
+/*
+ * common header for iommu devices
+ *
+ * Copyright Red Hat, Inc. 2019
+ *
+ * Authors:
+ *  Eric Auger <eric.auger@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ */
+
+#ifndef QEMU_HW_IOMMU_IOMMU_H
+#define QEMU_HW_IOMMU_IOMMU_H
+#ifdef __linux__
+#include <linux/iommu.h>
+#endif
+
+typedef struct IOMMUConfig {
+    union {
+#ifdef __linux__
+        struct iommu_pasid_table_config pasid_cfg;
+#endif
+          };
+} IOMMUConfig;
+
+
+#endif /* QEMU_HW_IOMMU_IOMMU_H */
-- 
2.20.1



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RFC v6 08/24] pci: introduce PCIPASIDOps to PCIDevice
  2020-03-20 16:58 [RFC v6 00/24] vSMMUv3/pSMMUv3 2 stage VFIO integration Eric Auger
                   ` (6 preceding siblings ...)
  2020-03-20 16:58 ` [RFC v6 07/24] iommu: Introduce generic header Eric Auger
@ 2020-03-20 16:58 ` Eric Auger
  2020-03-26 13:01   ` Liu, Yi L
  2020-03-20 16:58 ` [RFC v6 09/24] vfio: Force nested if iommu requires it Eric Auger
                   ` (17 subsequent siblings)
  25 siblings, 1 reply; 40+ messages in thread
From: Eric Auger @ 2020-03-20 16:58 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, qemu-devel, qemu-arm, peter.maydell,
	mst, alex.williamson, jacob.jun.pan, yi.l.liu
  Cc: jean-philippe, tnowicki, maz, zhangfei.gao, peterx,
	shameerali.kolothum.thodi, zhangfei.gao, bbhushan2, will

From: Liu Yi L <yi.l.liu@intel.com>

This patch introduces PCIPASIDOps for IOMMU related operations.

https://lists.gnu.org/archive/html/qemu-devel/2018-03/msg00078.html
https://lists.gnu.org/archive/html/qemu-devel/2018-03/msg00940.html

So far, to setup virt-SVA for assigned SVA capable device, needs to
configure host translation structures for specific pasid. (e.g. bind
guest page table to host and enable nested translation in host).
Besides, vIOMMU emulator needs to forward guest's cache invalidation
to host since host nested translation is enabled. e.g. on VT-d, guest
owns 1st level translation table, thus cache invalidation for 1st
level should be propagated to host.

This patch adds two functions: alloc_pasid and free_pasid to support
guest pasid allocation and free. The implementations of the callbacks
would be device passthru modules. Like vfio.

Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Eric Auger <eric.auger@redhat.com>
Cc: Yi Sun <yi.y.sun@linux.intel.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
---
 hw/pci/pci.c         | 34 ++++++++++++++++++++++++++++++++++
 include/hw/pci/pci.h | 11 +++++++++++
 2 files changed, 45 insertions(+)

diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index e1ed6677e1..67e03b8db1 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -2695,6 +2695,40 @@ void pci_setup_iommu(PCIBus *bus, PCIIOMMUFunc fn, void *opaque)
     bus->iommu_opaque = opaque;
 }
 
+void pci_setup_pasid_ops(PCIDevice *dev, PCIPASIDOps *ops)
+{
+    assert(ops && !dev->pasid_ops);
+    dev->pasid_ops = ops;
+}
+
+bool pci_device_is_pasid_ops_set(PCIBus *bus, int32_t devfn)
+{
+    PCIDevice *dev;
+
+    if (!bus) {
+        return false;
+    }
+
+    dev = bus->devices[devfn];
+    return !!(dev && dev->pasid_ops);
+}
+
+int pci_device_set_pasid_table(PCIBus *bus, int32_t devfn,
+                               IOMMUConfig *config)
+{
+    PCIDevice *dev;
+
+    if (!bus) {
+        return -EINVAL;
+    }
+
+    dev = bus->devices[devfn];
+    if (dev && dev->pasid_ops && dev->pasid_ops->set_pasid_table) {
+        return dev->pasid_ops->set_pasid_table(bus, devfn, config);
+    }
+    return -ENOENT;
+}
+
 static void pci_dev_get_w64(PCIBus *b, PCIDevice *dev, void *opaque)
 {
     Range *range = opaque;
diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index cfedf5a995..2146cb7519 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -8,6 +8,7 @@
 #include "hw/isa/isa.h"
 
 #include "hw/pci/pcie.h"
+#include "hw/iommu/iommu.h"
 
 extern bool pci_available;
 
@@ -264,6 +265,11 @@ struct PCIReqIDCache {
 };
 typedef struct PCIReqIDCache PCIReqIDCache;
 
+struct PCIPASIDOps {
+    int (*set_pasid_table)(PCIBus *bus, int32_t devfn, IOMMUConfig *config);
+};
+typedef struct PCIPASIDOps PCIPASIDOps;
+
 struct PCIDevice {
     DeviceState qdev;
     bool partially_hotplugged;
@@ -357,6 +363,7 @@ struct PCIDevice {
 
     /* ID of standby device in net_failover pair */
     char *failover_pair_id;
+    PCIPASIDOps *pasid_ops;
 };
 
 void pci_register_bar(PCIDevice *pci_dev, int region_num,
@@ -490,6 +497,10 @@ typedef AddressSpace *(*PCIIOMMUFunc)(PCIBus *, void *, int);
 AddressSpace *pci_device_iommu_address_space(PCIDevice *dev);
 void pci_setup_iommu(PCIBus *bus, PCIIOMMUFunc fn, void *opaque);
 
+void pci_setup_pasid_ops(PCIDevice *dev, PCIPASIDOps *ops);
+bool pci_device_is_pasid_ops_set(PCIBus *bus, int32_t devfn);
+int pci_device_set_pasid_table(PCIBus *bus, int32_t devfn, IOMMUConfig *config);
+
 static inline void
 pci_set_byte(uint8_t *config, uint8_t val)
 {
-- 
2.20.1



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RFC v6 09/24] vfio: Force nested if iommu requires it
  2020-03-20 16:58 [RFC v6 00/24] vSMMUv3/pSMMUv3 2 stage VFIO integration Eric Auger
                   ` (7 preceding siblings ...)
  2020-03-20 16:58 ` [RFC v6 08/24] pci: introduce PCIPASIDOps to PCIDevice Eric Auger
@ 2020-03-20 16:58 ` Eric Auger
  2020-03-31  6:34   ` Liu, Yi L
  2020-03-20 16:58 ` [RFC v6 10/24] vfio: Introduce hostwin_from_range helper Eric Auger
                   ` (16 subsequent siblings)
  25 siblings, 1 reply; 40+ messages in thread
From: Eric Auger @ 2020-03-20 16:58 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, qemu-devel, qemu-arm, peter.maydell,
	mst, alex.williamson, jacob.jun.pan, yi.l.liu
  Cc: jean-philippe, tnowicki, maz, zhangfei.gao, peterx,
	shameerali.kolothum.thodi, zhangfei.gao, bbhushan2, will

In case we detect the address space is translated by
a virtual IOMMU which requires HW nested paging to
integrate with VFIO, let's set up the container with
the VFIO_TYPE1_NESTING_IOMMU iommu_type.

Signed-off-by: Eric Auger <eric.auger@redhat.com>

---

v4 -> v5:
- fail immediatly if nested is wanted but not supported

v2 -> v3:
- add "nested only is selected if requested by @force_nested"
  comment in this patch
---
 hw/vfio/common.c | 36 ++++++++++++++++++++++++++++--------
 1 file changed, 28 insertions(+), 8 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 0b3593b3c0..ac417b5dbd 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -1155,27 +1155,38 @@ static void vfio_put_address_space(VFIOAddressSpace *space)
  * vfio_get_iommu_type - selects the richest iommu_type (v2 first)
  */
 static int vfio_get_iommu_type(VFIOContainer *container,
+                               bool want_nested,
                                Error **errp)
 {
-    int iommu_types[] = { VFIO_TYPE1v2_IOMMU, VFIO_TYPE1_IOMMU,
+    int iommu_types[] = { VFIO_TYPE1_NESTING_IOMMU,
+                          VFIO_TYPE1v2_IOMMU, VFIO_TYPE1_IOMMU,
                           VFIO_SPAPR_TCE_v2_IOMMU, VFIO_SPAPR_TCE_IOMMU };
-    int i;
+    int i, ret = -EINVAL;
 
     for (i = 0; i < ARRAY_SIZE(iommu_types); i++) {
         if (ioctl(container->fd, VFIO_CHECK_EXTENSION, iommu_types[i])) {
-            return iommu_types[i];
+            if (iommu_types[i] == VFIO_TYPE1_NESTING_IOMMU && !want_nested) {
+                continue;
+            }
+            ret = iommu_types[i];
+            break;
         }
     }
-    error_setg(errp, "No available IOMMU models");
-    return -EINVAL;
+    if (ret < 0) {
+        error_setg(errp, "No available IOMMU models");
+    } else if (want_nested && ret != VFIO_TYPE1_NESTING_IOMMU) {
+        error_setg(errp, "Nested mode requested but not supported");
+        ret = -EINVAL;
+    }
+    return ret;
 }
 
 static int vfio_init_container(VFIOContainer *container, int group_fd,
-                               Error **errp)
+                               bool want_nested, Error **errp)
 {
     int iommu_type, ret;
 
-    iommu_type = vfio_get_iommu_type(container, errp);
+    iommu_type = vfio_get_iommu_type(container, want_nested, errp);
     if (iommu_type < 0) {
         return iommu_type;
     }
@@ -1211,6 +1222,14 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
     VFIOContainer *container;
     int ret, fd;
     VFIOAddressSpace *space;
+    IOMMUMemoryRegion *iommu_mr;
+    bool nested = false;
+
+    if (as != &address_space_memory && memory_region_is_iommu(as->root)) {
+        iommu_mr = IOMMU_MEMORY_REGION(as->root);
+        memory_region_iommu_get_attr(iommu_mr, IOMMU_ATTR_VFIO_NESTED,
+                                     (void *)&nested);
+    }
 
     space = vfio_get_address_space(as);
 
@@ -1272,12 +1291,13 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
     QLIST_INIT(&container->giommu_list);
     QLIST_INIT(&container->hostwin_list);
 
-    ret = vfio_init_container(container, group->fd, errp);
+    ret = vfio_init_container(container, group->fd, nested, errp);
     if (ret) {
         goto free_container_exit;
     }
 
     switch (container->iommu_type) {
+    case VFIO_TYPE1_NESTING_IOMMU:
     case VFIO_TYPE1v2_IOMMU:
     case VFIO_TYPE1_IOMMU:
     {
-- 
2.20.1



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RFC v6 10/24] vfio: Introduce hostwin_from_range helper
  2020-03-20 16:58 [RFC v6 00/24] vSMMUv3/pSMMUv3 2 stage VFIO integration Eric Auger
                   ` (8 preceding siblings ...)
  2020-03-20 16:58 ` [RFC v6 09/24] vfio: Force nested if iommu requires it Eric Auger
@ 2020-03-20 16:58 ` Eric Auger
  2020-03-20 16:58 ` [RFC v6 11/24] vfio: Introduce helpers to DMA map/unmap a RAM section Eric Auger
                   ` (15 subsequent siblings)
  25 siblings, 0 replies; 40+ messages in thread
From: Eric Auger @ 2020-03-20 16:58 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, qemu-devel, qemu-arm, peter.maydell,
	mst, alex.williamson, jacob.jun.pan, yi.l.liu
  Cc: jean-philippe, tnowicki, maz, zhangfei.gao, peterx,
	shameerali.kolothum.thodi, zhangfei.gao, bbhushan2, will

Let's introduce a hostwin_from_range() helper that returns the
hostwin encapsulating an IOVA range or NULL if none is found.

This improves the readibility of callers and removes the usage
of hostwin_found.

Signed-off-by: Eric Auger <eric.auger@redhat.com>
---
 hw/vfio/common.c | 36 +++++++++++++++++-------------------
 1 file changed, 17 insertions(+), 19 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index ac417b5dbd..f20b37fbee 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -499,6 +499,19 @@ out:
     rcu_read_unlock();
 }
 
+static VFIOHostDMAWindow *
+hostwin_from_range(VFIOContainer *container, hwaddr iova, hwaddr end)
+{
+    VFIOHostDMAWindow *hostwin;
+
+    QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) {
+        if (hostwin->min_iova <= iova && end <= hostwin->max_iova) {
+            return hostwin;
+        }
+    }
+    return NULL;
+}
+
 static void vfio_listener_region_add(MemoryListener *listener,
                                      MemoryRegionSection *section)
 {
@@ -508,7 +521,6 @@ static void vfio_listener_region_add(MemoryListener *listener,
     void *vaddr;
     int ret;
     VFIOHostDMAWindow *hostwin;
-    bool hostwin_found;
     Error *err = NULL;
 
     if (vfio_listener_skipped_section(section)) {
@@ -593,15 +605,8 @@ static void vfio_listener_region_add(MemoryListener *listener,
 #endif
     }
 
-    hostwin_found = false;
-    QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) {
-        if (hostwin->min_iova <= iova && end <= hostwin->max_iova) {
-            hostwin_found = true;
-            break;
-        }
-    }
-
-    if (!hostwin_found) {
+    hostwin = hostwin_from_range(container, iova, end);
+    if (!hostwin) {
         error_setg(&err, "Container %p can't map guest IOVA region"
                    " 0x%"HWADDR_PRIx"..0x%"HWADDR_PRIx, container, iova, end);
         goto fail;
@@ -774,16 +779,9 @@ static void vfio_listener_region_del(MemoryListener *listener,
 
     if (memory_region_is_ram_device(section->mr)) {
         hwaddr pgmask;
-        VFIOHostDMAWindow *hostwin;
-        bool hostwin_found = false;
+        VFIOHostDMAWindow *hostwin = hostwin_from_range(container, iova, end);
 
-        QLIST_FOREACH(hostwin, &container->hostwin_list, hostwin_next) {
-            if (hostwin->min_iova <= iova && end <= hostwin->max_iova) {
-                hostwin_found = true;
-                break;
-            }
-        }
-        assert(hostwin_found); /* or region_add() would have failed */
+        assert(hostwin); /* or region_add() would have failed */
 
         pgmask = (1ULL << ctz64(hostwin->iova_pgsizes)) - 1;
         try_unmap = !((iova & pgmask) || (int128_get64(llsize) & pgmask));
-- 
2.20.1



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RFC v6 11/24] vfio: Introduce helpers to DMA map/unmap a RAM section
  2020-03-20 16:58 [RFC v6 00/24] vSMMUv3/pSMMUv3 2 stage VFIO integration Eric Auger
                   ` (9 preceding siblings ...)
  2020-03-20 16:58 ` [RFC v6 10/24] vfio: Introduce hostwin_from_range helper Eric Auger
@ 2020-03-20 16:58 ` Eric Auger
  2020-03-20 16:58 ` [RFC v6 12/24] vfio: Set up nested stage mappings Eric Auger
                   ` (14 subsequent siblings)
  25 siblings, 0 replies; 40+ messages in thread
From: Eric Auger @ 2020-03-20 16:58 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, qemu-devel, qemu-arm, peter.maydell,
	mst, alex.williamson, jacob.jun.pan, yi.l.liu
  Cc: jean-philippe, tnowicki, maz, zhangfei.gao, peterx,
	shameerali.kolothum.thodi, zhangfei.gao, bbhushan2, will

Let's introduce two helpers that allow to DMA map/unmap a RAM
section. Those helpers will be called for nested stage setup in
another call site. Also the vfio_listener_region_add/del()
structure may be clearer.

Signed-off-by: Eric Auger <eric.auger@redhat.com>

---

v5 -> v6:
- add Error **
---
 hw/vfio/common.c     | 177 ++++++++++++++++++++++++++-----------------
 hw/vfio/trace-events |   4 +-
 2 files changed, 108 insertions(+), 73 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index f20b37fbee..e067009da8 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -512,13 +512,115 @@ hostwin_from_range(VFIOContainer *container, hwaddr iova, hwaddr end)
     return NULL;
 }
 
+static int vfio_dma_map_ram_section(VFIOContainer *container,
+                                    MemoryRegionSection *section, Error **err)
+{
+    VFIOHostDMAWindow *hostwin;
+    Int128 llend, llsize;
+    hwaddr iova, end;
+    void *vaddr;
+    int ret;
+
+    assert(memory_region_is_ram(section->mr));
+
+    iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
+    llend = int128_make64(section->offset_within_address_space);
+    llend = int128_add(llend, section->size);
+    llend = int128_and(llend, int128_exts64(TARGET_PAGE_MASK));
+    end = int128_get64(int128_sub(llend, int128_one()));
+
+    vaddr = memory_region_get_ram_ptr(section->mr) +
+            section->offset_within_region +
+            (iova - section->offset_within_address_space);
+
+    hostwin = hostwin_from_range(container, iova, end);
+    if (!hostwin) {
+        error_setg(err, "Container %p can't map guest IOVA region"
+                   " 0x%"HWADDR_PRIx"..0x%"HWADDR_PRIx, container, iova, end);
+        return -EFAULT;
+    }
+
+    trace_vfio_dma_map_ram(iova, end, vaddr);
+
+    llsize = int128_sub(llend, int128_make64(iova));
+
+    if (memory_region_is_ram_device(section->mr)) {
+        hwaddr pgmask = (1ULL << ctz64(hostwin->iova_pgsizes)) - 1;
+
+        if ((iova & pgmask) || (int128_get64(llsize) & pgmask)) {
+            trace_vfio_listener_region_add_no_dma_map(
+                memory_region_name(section->mr),
+                section->offset_within_address_space,
+                int128_getlo(section->size),
+                pgmask + 1);
+            return 0;
+        }
+    }
+
+    ret = vfio_dma_map(container, iova, int128_get64(llsize),
+                       vaddr, section->readonly);
+    if (ret) {
+        error_setg(err, "vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
+                   "0x%"HWADDR_PRIx", %p) = %d (%m)",
+                   container, iova, int128_get64(llsize), vaddr, ret);
+        if (memory_region_is_ram_device(section->mr)) {
+            /* Allow unexpected mappings not to be fatal for RAM devices */
+            error_report_err(*err);
+            return 0;
+        }
+        return ret;
+    }
+    return 0;
+}
+
+static void vfio_dma_unmap_ram_section(VFIOContainer *container,
+                                       MemoryRegionSection *section)
+{
+    Int128 llend, llsize;
+    hwaddr iova, end;
+    bool try_unmap = true;
+    int ret;
+
+    iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
+    llend = int128_make64(section->offset_within_address_space);
+    llend = int128_add(llend, section->size);
+    llend = int128_and(llend, int128_exts64(TARGET_PAGE_MASK));
+
+    if (int128_ge(int128_make64(iova), llend)) {
+        return;
+    }
+    end = int128_get64(int128_sub(llend, int128_one()));
+
+    llsize = int128_sub(llend, int128_make64(iova));
+
+    trace_vfio_dma_unmap_ram(iova, end);
+
+    if (memory_region_is_ram_device(section->mr)) {
+        hwaddr pgmask;
+        VFIOHostDMAWindow *hostwin = hostwin_from_range(container, iova, end);
+
+        assert(hostwin); /* or region_add() would have failed */
+
+        pgmask = (1ULL << ctz64(hostwin->iova_pgsizes)) - 1;
+        try_unmap = !((iova & pgmask) || (int128_get64(llsize) & pgmask));
+    }
+
+    if (try_unmap) {
+        ret = vfio_dma_unmap(container, iova, int128_get64(llsize));
+        if (ret) {
+            error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
+                         "0x%"HWADDR_PRIx") = %d (%m)",
+                         container, iova, int128_get64(llsize), ret);
+        }
+    }
+}
+
 static void vfio_listener_region_add(MemoryListener *listener,
                                      MemoryRegionSection *section)
 {
     VFIOContainer *container = container_of(listener, VFIOContainer, listener);
     hwaddr iova, end;
-    Int128 llend, llsize;
-    void *vaddr;
+    Int128 llend;
     int ret;
     VFIOHostDMAWindow *hostwin;
     Error *err = NULL;
@@ -655,39 +757,7 @@ static void vfio_listener_region_add(MemoryListener *listener,
     }
 
     /* Here we assume that memory_region_is_ram(section->mr)==true */
-
-    vaddr = memory_region_get_ram_ptr(section->mr) +
-            section->offset_within_region +
-            (iova - section->offset_within_address_space);
-
-    trace_vfio_listener_region_add_ram(iova, end, vaddr);
-
-    llsize = int128_sub(llend, int128_make64(iova));
-
-    if (memory_region_is_ram_device(section->mr)) {
-        hwaddr pgmask = (1ULL << ctz64(hostwin->iova_pgsizes)) - 1;
-
-        if ((iova & pgmask) || (int128_get64(llsize) & pgmask)) {
-            trace_vfio_listener_region_add_no_dma_map(
-                memory_region_name(section->mr),
-                section->offset_within_address_space,
-                int128_getlo(section->size),
-                pgmask + 1);
-            return;
-        }
-    }
-
-    ret = vfio_dma_map(container, iova, int128_get64(llsize),
-                       vaddr, section->readonly);
-    if (ret) {
-        error_setg(&err, "vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
-                   "0x%"HWADDR_PRIx", %p) = %d (%m)",
-                   container, iova, int128_get64(llsize), vaddr, ret);
-        if (memory_region_is_ram_device(section->mr)) {
-            /* Allow unexpected mappings not to be fatal for RAM devices */
-            error_report_err(err);
-            return;
-        }
+    if (vfio_dma_map_ram_section(container, section, &err)) {
         goto fail;
     }
 
@@ -721,10 +791,6 @@ static void vfio_listener_region_del(MemoryListener *listener,
                                      MemoryRegionSection *section)
 {
     VFIOContainer *container = container_of(listener, VFIOContainer, listener);
-    hwaddr iova, end;
-    Int128 llend, llsize;
-    int ret;
-    bool try_unmap = true;
 
     if (vfio_listener_skipped_section(section)) {
         trace_vfio_listener_region_del_skip(
@@ -763,38 +829,7 @@ static void vfio_listener_region_del(MemoryListener *listener,
          */
     }
 
-    iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
-    llend = int128_make64(section->offset_within_address_space);
-    llend = int128_add(llend, section->size);
-    llend = int128_and(llend, int128_exts64(TARGET_PAGE_MASK));
-
-    if (int128_ge(int128_make64(iova), llend)) {
-        return;
-    }
-    end = int128_get64(int128_sub(llend, int128_one()));
-
-    llsize = int128_sub(llend, int128_make64(iova));
-
-    trace_vfio_listener_region_del(iova, end);
-
-    if (memory_region_is_ram_device(section->mr)) {
-        hwaddr pgmask;
-        VFIOHostDMAWindow *hostwin = hostwin_from_range(container, iova, end);
-
-        assert(hostwin); /* or region_add() would have failed */
-
-        pgmask = (1ULL << ctz64(hostwin->iova_pgsizes)) - 1;
-        try_unmap = !((iova & pgmask) || (int128_get64(llsize) & pgmask));
-    }
-
-    if (try_unmap) {
-        ret = vfio_dma_unmap(container, iova, int128_get64(llsize));
-        if (ret) {
-            error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
-                         "0x%"HWADDR_PRIx") = %d (%m)",
-                         container, iova, int128_get64(llsize), ret);
-        }
-    }
+    vfio_dma_unmap_ram_section(container, section);
 
     memory_region_unref(section->mr);
 
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index b1ef55a33f..410801de6e 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -97,10 +97,10 @@ vfio_iommu_map_notify(const char *op, uint64_t iova_start, uint64_t iova_end) "i
 vfio_listener_region_add_skip(uint64_t start, uint64_t end) "SKIPPING region_add 0x%"PRIx64" - 0x%"PRIx64
 vfio_spapr_group_attach(int groupfd, int tablefd) "Attached groupfd %d to liobn fd %d"
 vfio_listener_region_add_iommu(uint64_t start, uint64_t end) "region_add [iommu] 0x%"PRIx64" - 0x%"PRIx64
-vfio_listener_region_add_ram(uint64_t iova_start, uint64_t iova_end, void *vaddr) "region_add [ram] 0x%"PRIx64" - 0x%"PRIx64" [%p]"
+vfio_dma_map_ram(uint64_t iova_start, uint64_t iova_end, void *vaddr) "region_add [ram] 0x%"PRIx64" - 0x%"PRIx64" [%p]"
 vfio_listener_region_add_no_dma_map(const char *name, uint64_t iova, uint64_t size, uint64_t page_size) "Region \"%s\" 0x%"PRIx64" size=0x%"PRIx64" is not aligned to 0x%"PRIx64" and cannot be mapped for DMA"
 vfio_listener_region_del_skip(uint64_t start, uint64_t end) "SKIPPING region_del 0x%"PRIx64" - 0x%"PRIx64
-vfio_listener_region_del(uint64_t start, uint64_t end) "region_del 0x%"PRIx64" - 0x%"PRIx64
+vfio_dma_unmap_ram(uint64_t start, uint64_t end) "region_del 0x%"PRIx64" - 0x%"PRIx64
 vfio_disconnect_container(int fd) "close container->fd=%d"
 vfio_put_group(int fd) "close group->fd=%d"
 vfio_get_device(const char * name, unsigned int flags, unsigned int num_regions, unsigned int num_irqs) "Device %s flags: %u, regions: %u, irqs: %u"
-- 
2.20.1



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RFC v6 12/24] vfio: Set up nested stage mappings
  2020-03-20 16:58 [RFC v6 00/24] vSMMUv3/pSMMUv3 2 stage VFIO integration Eric Auger
                   ` (10 preceding siblings ...)
  2020-03-20 16:58 ` [RFC v6 11/24] vfio: Introduce helpers to DMA map/unmap a RAM section Eric Auger
@ 2020-03-20 16:58 ` Eric Auger
  2020-03-20 16:58 ` [RFC v6 13/24] vfio: Pass stage 1 MSI bindings to the host Eric Auger
                   ` (13 subsequent siblings)
  25 siblings, 0 replies; 40+ messages in thread
From: Eric Auger @ 2020-03-20 16:58 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, qemu-devel, qemu-arm, peter.maydell,
	mst, alex.williamson, jacob.jun.pan, yi.l.liu
  Cc: jean-philippe, tnowicki, maz, zhangfei.gao, peterx,
	shameerali.kolothum.thodi, zhangfei.gao, bbhushan2, will

In nested mode, legacy vfio_iommu_map_notify cannot be used as
there is no "caching" mode and we do not trap on map.

On Intel, vfio_iommu_map_notify was used to DMA map the RAM
through the host single stage.

With nested mode, we need to setup the stage 2 and the stage 1
separately. This patch introduces a prereg_listener to setup
the stage 2 mapping.

The stage 1 mapping, owned by the guest, is passed to the host
when the guest invalidates the stage 1 configuration, through
a dedicated PCIPASIDOps callback. Guest IOTLB invalidations
are cascaded downto the host through another IOMMU MR UNMAP
notifier.

Signed-off-by: Eric Auger <eric.auger@redhat.com>

---

v6 -> v7:
- remove PASID based invalidation

v5 -> v6:
- add error_report_err()
- remove the abort in case of nested stage case

v4 -> v5:
- use VFIO_IOMMU_SET_PASID_TABLE
- use PCIPASIDOps for config notification

v3 -> v4:
- use iommu_inv_pasid_info for ASID invalidation

v2 -> v3:
- use VFIO_IOMMU_ATTACH_PASID_TABLE
- new user API
- handle leaf

v1 -> v2:
- adapt to uapi changes
- pass the asid
- pass IOMMU_NOTIFIER_S1_CFG when initializing the config notifier
---
 hw/vfio/common.c     | 112 ++++++++++++++++++++++++++++++++++++++++---
 hw/vfio/pci.c        |  21 ++++++++
 hw/vfio/trace-events |   2 +
 3 files changed, 129 insertions(+), 6 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index e067009da8..c0ae59bfe6 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -446,6 +446,44 @@ static bool vfio_get_vaddr(IOMMUTLBEntry *iotlb, void **vaddr,
     return true;
 }
 
+/* Propagate a guest IOTLB invalidation to the host (nested mode) */
+static void vfio_iommu_unmap_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
+{
+    VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
+    hwaddr start = iotlb->iova + giommu->iommu_offset;
+
+    VFIOContainer *container = giommu->container;
+    struct vfio_iommu_type1_cache_invalidate ustruct;
+    size_t size = iotlb->addr_mask + 1;
+    int ret;
+
+    assert(iotlb->perm == IOMMU_NONE);
+
+    ustruct.argsz = sizeof(ustruct);
+    ustruct.flags = 0;
+    ustruct.info.version = IOMMU_CACHE_INVALIDATE_INFO_VERSION_1;
+
+    ustruct.info.cache = IOMMU_CACHE_INV_TYPE_IOTLB;
+    ustruct.info.granularity = IOMMU_INV_GRANU_ADDR;
+    ustruct.info.addr_info.flags = IOMMU_INV_ADDR_FLAGS_ARCHID;
+    if (iotlb->leaf) {
+        ustruct.info.addr_info.flags |= IOMMU_INV_ADDR_FLAGS_LEAF;
+    }
+    ustruct.info.addr_info.archid = iotlb->arch_id;
+    ustruct.info.addr_info.addr = start;
+    ustruct.info.addr_info.granule_size = size;
+    ustruct.info.addr_info.nb_granules = 1;
+    trace_vfio_iommu_addr_inv_iotlb(iotlb->arch_id, start, size, 1,
+                                    iotlb->leaf);
+
+    ret = ioctl(container->fd, VFIO_IOMMU_CACHE_INVALIDATE, &ustruct);
+    if (ret) {
+        error_report("%p: failed to invalidate CACHE for 0x%"PRIx64
+                     " mask=0x%"PRIx64" (%d)",
+                     container, start, iotlb->addr_mask, ret);
+    }
+}
+
 static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
 {
     VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
@@ -615,6 +653,35 @@ static void vfio_dma_unmap_ram_section(VFIOContainer *container,
     }
 }
 
+static void vfio_prereg_listener_region_add(MemoryListener *listener,
+                                            MemoryRegionSection *section)
+{
+    VFIOContainer *container =
+        container_of(listener, VFIOContainer, prereg_listener);
+    Error *err = NULL;
+
+    if (!memory_region_is_ram(section->mr)) {
+        return;
+    }
+
+    vfio_dma_map_ram_section(container, section, &err);
+    if (err) {
+        error_report_err(err);
+    }
+}
+static void vfio_prereg_listener_region_del(MemoryListener *listener,
+                                     MemoryRegionSection *section)
+{
+    VFIOContainer *container =
+        container_of(listener, VFIOContainer, prereg_listener);
+
+    if (!memory_region_is_ram(section->mr)) {
+        return;
+    }
+
+    vfio_dma_unmap_ram_section(container, section);
+}
+
 static void vfio_listener_region_add(MemoryListener *listener,
                                      MemoryRegionSection *section)
 {
@@ -717,9 +784,10 @@ static void vfio_listener_region_add(MemoryListener *listener,
     memory_region_ref(section->mr);
 
     if (memory_region_is_iommu(section->mr)) {
+        IOMMUNotify notify;
         VFIOGuestIOMMU *giommu;
         IOMMUMemoryRegion *iommu_mr = IOMMU_MEMORY_REGION(section->mr);
-        int iommu_idx;
+        int iommu_idx, flags;
 
         trace_vfio_listener_region_add_iommu(iova, end);
         /*
@@ -738,8 +806,18 @@ static void vfio_listener_region_add(MemoryListener *listener,
         llend = int128_sub(llend, int128_one());
         iommu_idx = memory_region_iommu_attrs_to_index(iommu_mr,
                                                        MEMTXATTRS_UNSPECIFIED);
-        iommu_notifier_init(&giommu->n, vfio_iommu_map_notify,
-                            IOMMU_NOTIFIER_ALL,
+
+        if (container->iommu_type == VFIO_TYPE1_NESTING_IOMMU) {
+            /* IOTLB unmap notifier to propagate guest IOTLB invalidations */
+            flags = IOMMU_NOTIFIER_UNMAP;
+            notify = vfio_iommu_unmap_notify;
+        } else {
+            /* MAP/UNMAP IOTLB notifier */
+            flags = IOMMU_NOTIFIER_ALL;
+            notify = vfio_iommu_map_notify;
+        }
+
+        iommu_notifier_init(&giommu->n, notify, flags,
                             section->offset_within_region,
                             int128_get64(llend),
                             iommu_idx);
@@ -751,7 +829,9 @@ static void vfio_listener_region_add(MemoryListener *listener,
             goto fail;
         }
         QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next);
-        memory_region_iommu_replay(giommu->iommu, &giommu->n);
+        if (flags & IOMMU_NOTIFIER_MAP) {
+            memory_region_iommu_replay(giommu->iommu, &giommu->n);
+        }
 
         return;
     }
@@ -846,15 +926,21 @@ static void vfio_listener_region_del(MemoryListener *listener,
     }
 }
 
-static const MemoryListener vfio_memory_listener = {
+static MemoryListener vfio_memory_listener = {
     .region_add = vfio_listener_region_add,
     .region_del = vfio_listener_region_del,
 };
 
+static MemoryListener vfio_memory_prereg_listener = {
+    .region_add = vfio_prereg_listener_region_add,
+    .region_del = vfio_prereg_listener_region_del,
+};
+
 static void vfio_listener_release(VFIOContainer *container)
 {
     memory_listener_unregister(&container->listener);
-    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
+    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU ||
+        container->iommu_type == VFIO_TYPE1_NESTING_IOMMU) {
         memory_listener_unregister(&container->prereg_listener);
     }
 }
@@ -1352,6 +1438,20 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
         }
         vfio_host_win_add(container, 0, (hwaddr)-1, info.iova_pgsizes);
         container->pgsizes = info.iova_pgsizes;
+
+        if (container->iommu_type == VFIO_TYPE1_NESTING_IOMMU) {
+            container->prereg_listener = vfio_memory_prereg_listener;
+            memory_listener_register(&container->prereg_listener,
+                                     &address_space_memory);
+            if (container->error) {
+                memory_listener_unregister(&container->prereg_listener);
+                ret = -1;
+                error_propagate_prepend(errp, container->error,
+                                    "RAM memory listener initialization failed "
+                                    "for container");
+                goto free_container_exit;
+            }
+        }
         break;
     }
     case VFIO_SPAPR_TCE_v2_IOMMU:
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 5e75a95129..fc314cc6a9 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2712,6 +2712,25 @@ static void vfio_unregister_req_notifier(VFIOPCIDevice *vdev)
     vdev->req_enabled = false;
 }
 
+static int vfio_iommu_set_pasid_table(PCIBus *bus, int32_t devfn,
+                                      IOMMUConfig *config)
+{
+    PCIDevice *pdev = bus->devices[devfn];
+    VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev);
+    VFIOContainer *container = vdev->vbasedev.group->container;
+    struct vfio_iommu_type1_set_pasid_table info;
+
+    info.argsz = sizeof(info);
+    info.flags = VFIO_PASID_TABLE_FLAG_SET;
+    memcpy(&info.config, &config->pasid_cfg, sizeof(config->pasid_cfg));
+
+    return ioctl(container->fd, VFIO_IOMMU_SET_PASID_TABLE, &info);
+}
+
+static PCIPASIDOps vfio_pci_pasid_ops = {
+    .set_pasid_table = vfio_iommu_set_pasid_table,
+};
+
 static void vfio_realize(PCIDevice *pdev, Error **errp)
 {
     VFIOPCIDevice *vdev = PCI_VFIO(pdev);
@@ -3028,6 +3047,8 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
     vfio_register_req_notifier(vdev);
     vfio_setup_resetfn_quirk(vdev);
 
+    pci_setup_pasid_ops(pdev, &vfio_pci_pasid_ops);
+
     return;
 
 out_deregister:
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 410801de6e..9f1868af2d 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -115,6 +115,8 @@ vfio_region_sparse_mmap_header(const char *name, int index, int nr_areas) "Devic
 vfio_region_sparse_mmap_entry(int i, unsigned long start, unsigned long end) "sparse entry %d [0x%lx - 0x%lx]"
 vfio_get_dev_region(const char *name, int index, uint32_t type, uint32_t subtype) "%s index %d, %08x/%0x8"
 vfio_dma_unmap_overflow_workaround(void) ""
+vfio_iommu_addr_inv_iotlb(int asid, uint64_t addr, uint64_t size, uint64_t nb_granules, bool leaf) "nested IOTLB invalidate asid=%d, addr=0x%"PRIx64" granule_size=0x%"PRIx64" nb_granules=0x%"PRIx64" leaf=%d"
+vfio_iommu_asid_inv_iotlb(int asid) "nested IOTLB invalidate asid=%d"
 
 # platform.c
 vfio_platform_base_device_init(char *name, int groupid) "%s belongs to group #%d"
-- 
2.20.1



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RFC v6 13/24] vfio: Pass stage 1 MSI bindings to the host
  2020-03-20 16:58 [RFC v6 00/24] vSMMUv3/pSMMUv3 2 stage VFIO integration Eric Auger
                   ` (11 preceding siblings ...)
  2020-03-20 16:58 ` [RFC v6 12/24] vfio: Set up nested stage mappings Eric Auger
@ 2020-03-20 16:58 ` Eric Auger
  2020-03-20 16:58 ` [RFC v6 14/24] vfio: Helper to get IRQ info including capabilities Eric Auger
                   ` (12 subsequent siblings)
  25 siblings, 0 replies; 40+ messages in thread
From: Eric Auger @ 2020-03-20 16:58 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, qemu-devel, qemu-arm, peter.maydell,
	mst, alex.williamson, jacob.jun.pan, yi.l.liu
  Cc: jean-philippe, tnowicki, maz, zhangfei.gao, peterx,
	shameerali.kolothum.thodi, zhangfei.gao, bbhushan2, will

We register the stage1 MSI bindings when enabling the vectors
and we unregister them on container disconnection.

Signed-off-by: Eric Auger <eric.auger@redhat.com>

---

v4 -> v5:
- use VFIO_IOMMU_SET_MSI_BINDING

v2 -> v3:
- only register the notifier if the IOMMU translates MSIs
- record the msi bindings in a container list and unregister on
  container release
---
 hw/vfio/common.c              | 52 +++++++++++++++++++++++++++++++++++
 hw/vfio/pci.c                 | 51 +++++++++++++++++++++++++++++++++-
 hw/vfio/trace-events          |  2 ++
 include/hw/vfio/vfio-common.h |  9 ++++++
 4 files changed, 113 insertions(+), 1 deletion(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index c0ae59bfe6..4d51b1f63b 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -484,6 +484,56 @@ static void vfio_iommu_unmap_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
     }
 }
 
+int vfio_iommu_set_msi_binding(VFIOContainer *container,
+                               IOMMUTLBEntry *iotlb)
+{
+    struct vfio_iommu_type1_set_msi_binding ustruct;
+    VFIOMSIBinding *binding;
+    int ret;
+
+    QLIST_FOREACH(binding, &container->msibinding_list, next) {
+        if (binding->iova == iotlb->iova) {
+            return 0;
+        }
+    }
+
+    ustruct.argsz = sizeof(struct vfio_iommu_type1_set_msi_binding);
+    ustruct.iova = iotlb->iova;
+    ustruct.flags = VFIO_IOMMU_BIND_MSI;
+    ustruct.gpa = iotlb->translated_addr;
+    ustruct.size = iotlb->addr_mask + 1;
+    ret = ioctl(container->fd, VFIO_IOMMU_SET_MSI_BINDING , &ustruct);
+    if (ret) {
+        error_report("%s: failed to register the stage1 MSI binding (%m)",
+                     __func__);
+        return ret;
+    }
+    binding =  g_new0(VFIOMSIBinding, 1);
+    binding->iova = ustruct.iova;
+    binding->gpa = ustruct.gpa;
+    binding->size = ustruct.size;
+
+    QLIST_INSERT_HEAD(&container->msibinding_list, binding, next);
+    return 0;
+}
+
+static void vfio_container_unbind_msis(VFIOContainer *container)
+{
+    VFIOMSIBinding *binding, *tmp;
+
+    QLIST_FOREACH_SAFE(binding, &container->msibinding_list, next, tmp) {
+        struct vfio_iommu_type1_set_msi_binding ustruct;
+
+        /* the MSI doorbell is not used anymore, unregister it */
+        ustruct.argsz = sizeof(struct vfio_iommu_type1_set_msi_binding);
+        ustruct.flags = VFIO_IOMMU_UNBIND_MSI;
+        ustruct.iova = binding->iova;
+        ioctl(container->fd, VFIO_IOMMU_SET_MSI_BINDING , &ustruct);
+        QLIST_REMOVE(binding, next);
+        g_free(binding);
+    }
+}
+
 static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
 {
     VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
@@ -1598,6 +1648,8 @@ static void vfio_disconnect_container(VFIOGroup *group)
             g_free(giommu);
         }
 
+        vfio_container_unbind_msis(container);
+
         trace_vfio_disconnect_container(container->fd);
         close(container->fd);
         g_free(container);
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index fc314cc6a9..6f2d5696c3 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -377,6 +377,49 @@ static void vfio_msi_interrupt(void *opaque)
     notify(&vdev->pdev, nr);
 }
 
+static int vfio_register_msi_binding(VFIOPCIDevice *vdev, int vector_n)
+{
+    VFIOContainer *container = vdev->vbasedev.group->container;
+    PCIDevice *dev = &vdev->pdev;
+    AddressSpace *as = pci_device_iommu_address_space(dev);
+    MSIMessage msg = pci_get_msi_message(dev, vector_n);
+    IOMMUMemoryRegionClass *imrc;
+    IOMMUMemoryRegion *iommu_mr;
+    bool msi_translate = false, nested = false;
+    IOMMUTLBEntry entry;
+
+    if (as == &address_space_memory) {
+        return 0;
+    }
+
+    iommu_mr = IOMMU_MEMORY_REGION(as->root);
+    memory_region_iommu_get_attr(iommu_mr, IOMMU_ATTR_MSI_TRANSLATE,
+                                 (void *)&msi_translate);
+    memory_region_iommu_get_attr(iommu_mr, IOMMU_ATTR_VFIO_NESTED,
+                                 (void *)&nested);
+    imrc = memory_region_get_iommu_class_nocheck(iommu_mr);
+
+    if (!nested || !msi_translate) {
+        return 0;
+    }
+
+    /* MSI doorbell address is translated by an IOMMU */
+
+    rcu_read_lock();
+    entry = imrc->translate(iommu_mr, msg.address, IOMMU_WO, 0);
+    rcu_read_unlock();
+
+    if (entry.perm == IOMMU_NONE) {
+        return -ENOENT;
+    }
+
+    trace_vfio_register_msi_binding(vdev->vbasedev.name, vector_n,
+                                    msg.address, entry.translated_addr);
+
+    vfio_iommu_set_msi_binding(container, &entry);
+    return 0;
+}
+
 static int vfio_enable_vectors(VFIOPCIDevice *vdev, bool msix)
 {
     struct vfio_irq_set *irq_set;
@@ -394,7 +437,7 @@ static int vfio_enable_vectors(VFIOPCIDevice *vdev, bool msix)
     fds = (int32_t *)&irq_set->data;
 
     for (i = 0; i < vdev->nr_vectors; i++) {
-        int fd = -1;
+        int ret, fd = -1;
 
         /*
          * MSI vs MSI-X - The guest has direct access to MSI mask and pending
@@ -409,6 +452,12 @@ static int vfio_enable_vectors(VFIOPCIDevice *vdev, bool msix)
             } else {
                 fd = event_notifier_get_fd(&vdev->msi_vectors[i].kvm_interrupt);
             }
+            ret = vfio_register_msi_binding(vdev, i);
+            if (ret) {
+                error_report("%s failed to register S1 MSI binding "
+                             "for vector %d(%d)", __func__, i, ret);
+                return ret;
+            }
         }
 
         fds[i] = fd;
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 9f1868af2d..5de97a8882 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -117,6 +117,8 @@ vfio_get_dev_region(const char *name, int index, uint32_t type, uint32_t subtype
 vfio_dma_unmap_overflow_workaround(void) ""
 vfio_iommu_addr_inv_iotlb(int asid, uint64_t addr, uint64_t size, uint64_t nb_granules, bool leaf) "nested IOTLB invalidate asid=%d, addr=0x%"PRIx64" granule_size=0x%"PRIx64" nb_granules=0x%"PRIx64" leaf=%d"
 vfio_iommu_asid_inv_iotlb(int asid) "nested IOTLB invalidate asid=%d"
+vfio_register_msi_binding(const char *name, int vector, uint64_t giova, uint64_t gdb) "%s: register vector %d gIOVA=0x%"PRIx64 "-> gDB=0x%"PRIx64" stage 1 mapping"
+vfio_unregister_msi_binding(const char *name, int vector, uint64_t giova) "%s: unregister vector %d gIOVA=0x%"PRIx64 " stage 1 mapping"
 
 # platform.c
 vfio_platform_base_device_init(char *name, int groupid) "%s belongs to group #%d"
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index fd564209ac..8ca34146d7 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -63,6 +63,13 @@ typedef struct VFIOAddressSpace {
     QLIST_ENTRY(VFIOAddressSpace) list;
 } VFIOAddressSpace;
 
+typedef struct VFIOMSIBinding {
+    hwaddr iova;
+    hwaddr gpa;
+    hwaddr size;
+    QLIST_ENTRY(VFIOMSIBinding) next;
+} VFIOMSIBinding;
+
 struct VFIOGroup;
 
 typedef struct VFIOContainer {
@@ -77,6 +84,7 @@ typedef struct VFIOContainer {
     QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
     QLIST_HEAD(, VFIOHostDMAWindow) hostwin_list;
     QLIST_HEAD(, VFIOGroup) group_list;
+    QLIST_HEAD(, VFIOMSIBinding) msibinding_list;
     QLIST_ENTRY(VFIOContainer) next;
 } VFIOContainer;
 
@@ -178,6 +186,7 @@ VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp);
 void vfio_put_group(VFIOGroup *group);
 int vfio_get_device(VFIOGroup *group, const char *name,
                     VFIODevice *vbasedev, Error **errp);
+int vfio_iommu_set_msi_binding(VFIOContainer *container, IOMMUTLBEntry *entry);
 
 extern const MemoryRegionOps vfio_region_ops;
 typedef QLIST_HEAD(VFIOGroupList, VFIOGroup) VFIOGroupList;
-- 
2.20.1



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RFC v6 14/24] vfio: Helper to get IRQ info including capabilities
  2020-03-20 16:58 [RFC v6 00/24] vSMMUv3/pSMMUv3 2 stage VFIO integration Eric Auger
                   ` (12 preceding siblings ...)
  2020-03-20 16:58 ` [RFC v6 13/24] vfio: Pass stage 1 MSI bindings to the host Eric Auger
@ 2020-03-20 16:58 ` Eric Auger
  2020-03-20 16:58 ` [RFC v6 15/24] vfio/pci: Register handler for iommu fault Eric Auger
                   ` (11 subsequent siblings)
  25 siblings, 0 replies; 40+ messages in thread
From: Eric Auger @ 2020-03-20 16:58 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, qemu-devel, qemu-arm, peter.maydell,
	mst, alex.williamson, jacob.jun.pan, yi.l.liu
  Cc: jean-philippe, tnowicki, maz, zhangfei.gao, peterx,
	shameerali.kolothum.thodi, zhangfei.gao, bbhushan2, will

As done for vfio regions, add helpers to retrieve irq info
including their optional capabilities.

Signed-off-by: Eric Auger <eric.auger@redhat.com>
---
 hw/vfio/common.c              | 97 +++++++++++++++++++++++++++++++++++
 hw/vfio/trace-events          |  1 +
 include/hw/vfio/vfio-common.h |  7 +++
 3 files changed, 105 insertions(+)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 4d51b1f63b..327fedf7e4 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -1014,6 +1014,25 @@ vfio_get_region_info_cap(struct vfio_region_info *info, uint16_t id)
     return NULL;
 }
 
+struct vfio_info_cap_header *
+vfio_get_irq_info_cap(struct vfio_irq_info *info, uint16_t id)
+{
+    struct vfio_info_cap_header *hdr;
+    void *ptr = info;
+
+    if (!(info->flags & VFIO_IRQ_INFO_FLAG_CAPS)) {
+        return NULL;
+    }
+
+    for (hdr = ptr + info->cap_offset; hdr != ptr; hdr = ptr + hdr->next) {
+        if (hdr->id == id) {
+            return hdr;
+        }
+    }
+
+    return NULL;
+}
+
 static int vfio_setup_region_sparse_mmaps(VFIORegion *region,
                                           struct vfio_region_info *info)
 {
@@ -1842,6 +1861,33 @@ retry:
     return 0;
 }
 
+int vfio_get_irq_info(VFIODevice *vbasedev, int index,
+                      struct vfio_irq_info **info)
+{
+    size_t argsz = sizeof(struct vfio_irq_info);
+
+    *info = g_malloc0(argsz);
+
+    (*info)->index = index;
+retry:
+    (*info)->argsz = argsz;
+
+    if (ioctl(vbasedev->fd, VFIO_DEVICE_GET_IRQ_INFO, *info)) {
+        g_free(*info);
+        *info = NULL;
+        return -errno;
+    }
+
+    if ((*info)->argsz > argsz) {
+        argsz = (*info)->argsz;
+        *info = g_realloc(*info, argsz);
+
+        goto retry;
+    }
+
+    return 0;
+}
+
 int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type,
                              uint32_t subtype, struct vfio_region_info **info)
 {
@@ -1877,6 +1923,42 @@ int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type,
     return -ENODEV;
 }
 
+int vfio_get_dev_irq_info(VFIODevice *vbasedev, uint32_t type,
+                          uint32_t subtype, struct vfio_irq_info **info)
+{
+    int i;
+
+    for (i = 0; i < vbasedev->num_irqs; i++) {
+        struct vfio_info_cap_header *hdr;
+        struct vfio_irq_info_cap_type *cap_type;
+
+        if (vfio_get_irq_info(vbasedev, i, info)) {
+            continue;
+        }
+
+        hdr = vfio_get_irq_info_cap(*info, VFIO_IRQ_INFO_CAP_TYPE);
+        if (!hdr) {
+            g_free(*info);
+            continue;
+        }
+
+        cap_type = container_of(hdr, struct vfio_irq_info_cap_type, header);
+
+        trace_vfio_get_dev_irq(vbasedev->name, i,
+                               cap_type->type, cap_type->subtype);
+
+        if (cap_type->type == type && cap_type->subtype == subtype) {
+            return 0;
+        }
+
+        g_free(*info);
+    }
+
+    *info = NULL;
+    return -ENODEV;
+}
+
+
 bool vfio_has_region_cap(VFIODevice *vbasedev, int region, uint16_t cap_type)
 {
     struct vfio_region_info *info = NULL;
@@ -1892,6 +1974,21 @@ bool vfio_has_region_cap(VFIODevice *vbasedev, int region, uint16_t cap_type)
     return ret;
 }
 
+bool vfio_has_irq_cap(VFIODevice *vbasedev, int region, uint16_t cap_type)
+{
+    struct vfio_region_info *info = NULL;
+    bool ret = false;
+
+    if (!vfio_get_region_info(vbasedev, region, &info)) {
+        if (vfio_get_region_info_cap(info, cap_type)) {
+            ret = true;
+        }
+        g_free(info);
+    }
+
+    return ret;
+}
+
 /*
  * Interfaces for IBM EEH (Enhanced Error Handling)
  */
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 5de97a8882..c04a8c12d8 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -114,6 +114,7 @@ vfio_region_mmaps_set_enabled(const char *name, bool enabled) "Region %s mmaps e
 vfio_region_sparse_mmap_header(const char *name, int index, int nr_areas) "Device %s region %d: %d sparse mmap entries"
 vfio_region_sparse_mmap_entry(int i, unsigned long start, unsigned long end) "sparse entry %d [0x%lx - 0x%lx]"
 vfio_get_dev_region(const char *name, int index, uint32_t type, uint32_t subtype) "%s index %d, %08x/%0x8"
+vfio_get_dev_irq(const char *name, int index, uint32_t type, uint32_t subtype) "%s index %d, %08x/%0x8"
 vfio_dma_unmap_overflow_workaround(void) ""
 vfio_iommu_addr_inv_iotlb(int asid, uint64_t addr, uint64_t size, uint64_t nb_granules, bool leaf) "nested IOTLB invalidate asid=%d, addr=0x%"PRIx64" granule_size=0x%"PRIx64" nb_granules=0x%"PRIx64" leaf=%d"
 vfio_iommu_asid_inv_iotlb(int asid) "nested IOTLB invalidate asid=%d"
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 8ca34146d7..2ef39cbbc3 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -200,6 +200,13 @@ int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type,
 bool vfio_has_region_cap(VFIODevice *vbasedev, int region, uint16_t cap_type);
 struct vfio_info_cap_header *
 vfio_get_region_info_cap(struct vfio_region_info *info, uint16_t id);
+int vfio_get_irq_info(VFIODevice *vbasedev, int index,
+                      struct vfio_irq_info **info);
+int vfio_get_dev_irq_info(VFIODevice *vbasedev, uint32_t type,
+                          uint32_t subtype, struct vfio_irq_info **info);
+bool vfio_has_irq_cap(VFIODevice *vbasedev, int irq, uint16_t cap_type);
+struct vfio_info_cap_header *
+vfio_get_irq_info_cap(struct vfio_irq_info *info, uint16_t id);
 #endif
 extern const MemoryListener vfio_prereg_listener;
 
-- 
2.20.1



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RFC v6 15/24] vfio/pci: Register handler for iommu fault
  2020-03-20 16:58 [RFC v6 00/24] vSMMUv3/pSMMUv3 2 stage VFIO integration Eric Auger
                   ` (13 preceding siblings ...)
  2020-03-20 16:58 ` [RFC v6 14/24] vfio: Helper to get IRQ info including capabilities Eric Auger
@ 2020-03-20 16:58 ` Eric Auger
  2020-03-20 16:58 ` [RFC v6 16/24] vfio/pci: Set up the DMA FAULT region Eric Auger
                   ` (10 subsequent siblings)
  25 siblings, 0 replies; 40+ messages in thread
From: Eric Auger @ 2020-03-20 16:58 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, qemu-devel, qemu-arm, peter.maydell,
	mst, alex.williamson, jacob.jun.pan, yi.l.liu
  Cc: jean-philippe, tnowicki, maz, zhangfei.gao, peterx,
	shameerali.kolothum.thodi, zhangfei.gao, bbhushan2, will

We use the new extended IRQ VFIO_IRQ_TYPE_NESTED type and
VFIO_IRQ_SUBTYPE_DMA_FAULT subtype to set/unset
a notifier for physical DMA faults. The associated eventfd is
triggered, in nested mode, whenever a fault is detected at IOMMU
physical level.

The actual handler will be implemented in subsequent patches.

Signed-off-by: Eric Auger <eric.auger@redhat.com>

---

v4 -> v5:
- index_to_str now returns the index name, ie. DMA_FAULT
- use the extended IRQ

v3 -> v4:
- check VFIO_PCI_DMA_FAULT_IRQ_INDEX is supported at kernel level
  before attempting to set signaling for it.
---
 hw/vfio/pci.c | 81 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 hw/vfio/pci.h |  7 +++++
 2 files changed, 87 insertions(+), 1 deletion(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 6f2d5696c3..7579f476b0 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2780,6 +2780,76 @@ static PCIPASIDOps vfio_pci_pasid_ops = {
     .set_pasid_table = vfio_iommu_set_pasid_table,
 };
 
+static void vfio_dma_fault_notifier_handler(void *opaque)
+{
+    VFIOPCIExtIRQ *ext_irq = opaque;
+
+    if (!event_notifier_test_and_clear(&ext_irq->notifier)) {
+        return;
+    }
+}
+
+static int vfio_register_ext_irq_handler(VFIOPCIDevice *vdev,
+                                         uint32_t type, uint32_t subtype,
+                                         IOHandler *handler)
+{
+    int32_t fd, ext_irq_index, index;
+    struct vfio_irq_info *irq_info;
+    Error *err = NULL;
+    EventNotifier *n;
+    int ret;
+
+    ret = vfio_get_dev_irq_info(&vdev->vbasedev, type, subtype, &irq_info);
+    if (ret) {
+        return ret;
+    }
+    index = irq_info->index;
+    ext_irq_index = irq_info->index - VFIO_PCI_NUM_IRQS;
+    g_free(irq_info);
+
+    vdev->ext_irqs[ext_irq_index].vdev = vdev;
+    vdev->ext_irqs[ext_irq_index].index = index;
+    n = &vdev->ext_irqs[ext_irq_index].notifier;
+
+    ret = event_notifier_init(n, 0);
+    if (ret) {
+        error_report("vfio: Unable to init event notifier for ext irq %d(%d)",
+                     ext_irq_index, ret);
+        return ret;
+    }
+
+    fd = event_notifier_get_fd(n);
+    qemu_set_fd_handler(fd, vfio_dma_fault_notifier_handler, NULL,
+                        &vdev->ext_irqs[ext_irq_index]);
+
+    ret = vfio_set_irq_signaling(&vdev->vbasedev, index, 0,
+                                 VFIO_IRQ_SET_ACTION_TRIGGER, fd, &err);
+    if (ret) {
+        error_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name);
+        qemu_set_fd_handler(fd, NULL, NULL, vdev);
+        event_notifier_cleanup(n);
+    }
+    return ret;
+}
+
+static void vfio_unregister_ext_irq_notifiers(VFIOPCIDevice *vdev)
+{
+    VFIODevice *vbasedev = &vdev->vbasedev;
+    Error *err = NULL;
+    int i;
+
+    for (i = 0; i < vbasedev->num_irqs - VFIO_PCI_NUM_IRQS; i++) {
+        if (vfio_set_irq_signaling(vbasedev, i + VFIO_PCI_NUM_IRQS , 0,
+                                   VFIO_IRQ_SET_ACTION_TRIGGER, -1, &err)) {
+            error_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name);
+        }
+        qemu_set_fd_handler(event_notifier_get_fd(&vdev->ext_irqs[i].notifier),
+                            NULL, NULL, vdev);
+        event_notifier_cleanup(&vdev->ext_irqs[i].notifier);
+    }
+    g_free(vdev->ext_irqs);
+}
+
 static void vfio_realize(PCIDevice *pdev, Error **errp)
 {
     VFIOPCIDevice *vdev = PCI_VFIO(pdev);
@@ -2790,7 +2860,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
     ssize_t len;
     struct stat st;
     int groupid;
-    int i, ret;
+    int i, ret, nb_ext_irqs;
     bool is_mdev;
 
     if (!vdev->vbasedev.sysfsdev) {
@@ -2890,6 +2960,11 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
         goto error;
     }
 
+    nb_ext_irqs = vdev->vbasedev.num_irqs - VFIO_PCI_NUM_IRQS;
+    if (nb_ext_irqs > 0) {
+        vdev->ext_irqs = g_new0(VFIOPCIExtIRQ, nb_ext_irqs);
+    }
+
     vfio_populate_device(vdev, &err);
     if (err) {
         error_propagate(errp, err);
@@ -3094,6 +3169,9 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
 
     vfio_register_err_notifier(vdev);
     vfio_register_req_notifier(vdev);
+    vfio_register_ext_irq_handler(vdev, VFIO_IRQ_TYPE_NESTED,
+                                  VFIO_IRQ_SUBTYPE_DMA_FAULT,
+                                  vfio_dma_fault_notifier_handler);
     vfio_setup_resetfn_quirk(vdev);
 
     pci_setup_pasid_ops(pdev, &vfio_pci_pasid_ops);
@@ -3145,6 +3223,7 @@ static void vfio_exitfn(PCIDevice *pdev)
 
     vfio_unregister_req_notifier(vdev);
     vfio_unregister_err_notifier(vdev);
+    vfio_unregister_ext_irq_notifiers(vdev);
     pci_device_set_intx_routing_notifier(&vdev->pdev, NULL);
     if (vdev->irqchip_change_notifier.notify) {
         kvm_irqchip_remove_change_notifier(&vdev->irqchip_change_notifier);
diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
index 0da7a20a7e..56f0fabb33 100644
--- a/hw/vfio/pci.h
+++ b/hw/vfio/pci.h
@@ -113,6 +113,12 @@ typedef struct VFIOMSIXInfo {
     unsigned long *pending;
 } VFIOMSIXInfo;
 
+typedef struct VFIOPCIExtIRQ {
+    struct VFIOPCIDevice *vdev;
+    EventNotifier notifier;
+    uint32_t index;
+} VFIOPCIExtIRQ;
+
 typedef struct VFIOPCIDevice {
     PCIDevice pdev;
     VFIODevice vbasedev;
@@ -134,6 +140,7 @@ typedef struct VFIOPCIDevice {
     PCIHostDeviceAddress host;
     EventNotifier err_notifier;
     EventNotifier req_notifier;
+    VFIOPCIExtIRQ *ext_irqs;
     int (*resetfn)(struct VFIOPCIDevice *);
     uint32_t vendor_id;
     uint32_t device_id;
-- 
2.20.1



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RFC v6 16/24] vfio/pci: Set up the DMA FAULT region
  2020-03-20 16:58 [RFC v6 00/24] vSMMUv3/pSMMUv3 2 stage VFIO integration Eric Auger
                   ` (14 preceding siblings ...)
  2020-03-20 16:58 ` [RFC v6 15/24] vfio/pci: Register handler for iommu fault Eric Auger
@ 2020-03-20 16:58 ` Eric Auger
  2020-03-20 16:58 ` [RFC v6 17/24] vfio/pci: Implement the DMA fault handler Eric Auger
                   ` (9 subsequent siblings)
  25 siblings, 0 replies; 40+ messages in thread
From: Eric Auger @ 2020-03-20 16:58 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, qemu-devel, qemu-arm, peter.maydell,
	mst, alex.williamson, jacob.jun.pan, yi.l.liu
  Cc: jean-philippe, tnowicki, maz, zhangfei.gao, peterx,
	shameerali.kolothum.thodi, zhangfei.gao, bbhushan2, will

Set up the fault region which is composed of the actual fault
queue (mmappable) and a header used to handle it. The fault
queue is mmapped.

Signed-off-by: Eric Auger <eric.auger@redhat.com>

---

v4 -> v5:
- use a single DMA FAULT region. No version selection anymore
---
 hw/vfio/pci.c | 64 +++++++++++++++++++++++++++++++++++++++++++++++++++
 hw/vfio/pci.h |  1 +
 2 files changed, 65 insertions(+)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 7579f476b0..029652a507 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2530,11 +2530,67 @@ int vfio_populate_vga(VFIOPCIDevice *vdev, Error **errp)
     return 0;
 }
 
+static void vfio_init_fault_regions(VFIOPCIDevice *vdev, Error **errp)
+{
+    struct vfio_region_info *fault_region_info = NULL;
+    struct vfio_region_info_cap_fault *cap_fault;
+    VFIODevice *vbasedev = &vdev->vbasedev;
+    struct vfio_info_cap_header *hdr;
+    char *fault_region_name;
+    int ret;
+
+    ret = vfio_get_dev_region_info(&vdev->vbasedev,
+                                   VFIO_REGION_TYPE_NESTED,
+                                   VFIO_REGION_SUBTYPE_NESTED_DMA_FAULT,
+                                   &fault_region_info);
+    if (ret) {
+        goto out;
+    }
+
+    hdr = vfio_get_region_info_cap(fault_region_info,
+                                   VFIO_REGION_INFO_CAP_DMA_FAULT);
+    if (!hdr) {
+        error_setg(errp, "failed to retrieve DMA FAULT capability");
+        goto out;
+    }
+    cap_fault = container_of(hdr, struct vfio_region_info_cap_fault,
+                             header);
+    if (cap_fault->version != 1) {
+        error_setg(errp, "Unsupported DMA FAULT API version %d",
+                   cap_fault->version);
+        goto out;
+    }
+
+    fault_region_name = g_strdup_printf("%s DMA FAULT %d",
+                                        vbasedev->name,
+                                        fault_region_info->index);
+
+    ret = vfio_region_setup(OBJECT(vdev), vbasedev,
+                            &vdev->dma_fault_region,
+                            fault_region_info->index,
+                            fault_region_name);
+    g_free(fault_region_name);
+    if (ret) {
+        error_setg_errno(errp, -ret,
+                         "failed to set up the DMA FAULT region %d",
+                         fault_region_info->index);
+        goto out;
+    }
+
+    ret = vfio_region_mmap(&vdev->dma_fault_region);
+    if (ret) {
+        error_setg_errno(errp, -ret, "Failed to mmap the DMA FAULT queue");
+    }
+out:
+    g_free(fault_region_info);
+}
+
 static void vfio_populate_device(VFIOPCIDevice *vdev, Error **errp)
 {
     VFIODevice *vbasedev = &vdev->vbasedev;
     struct vfio_region_info *reg_info;
     struct vfio_irq_info irq_info = { .argsz = sizeof(irq_info) };
+    Error *err = NULL;
     int i, ret = -1;
 
     /* Sanity check device */
@@ -2598,6 +2654,12 @@ static void vfio_populate_device(VFIOPCIDevice *vdev, Error **errp)
         }
     }
 
+    vfio_init_fault_regions(vdev, &err);
+    if (err) {
+        error_propagate(errp, err);
+        return;
+    }
+
     irq_info.index = VFIO_PCI_ERR_IRQ_INDEX;
 
     ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_GET_IRQ_INFO, &irq_info);
@@ -3200,6 +3262,7 @@ static void vfio_instance_finalize(Object *obj)
 
     vfio_display_finalize(vdev);
     vfio_bars_finalize(vdev);
+    vfio_region_finalize(&vdev->dma_fault_region);
     g_free(vdev->emulated_config_bits);
     g_free(vdev->rom);
     if (vdev->migration_blocker) {
@@ -3224,6 +3287,7 @@ static void vfio_exitfn(PCIDevice *pdev)
     vfio_unregister_req_notifier(vdev);
     vfio_unregister_err_notifier(vdev);
     vfio_unregister_ext_irq_notifiers(vdev);
+    vfio_region_exit(&vdev->dma_fault_region);
     pci_device_set_intx_routing_notifier(&vdev->pdev, NULL);
     if (vdev->irqchip_change_notifier.notify) {
         kvm_irqchip_remove_change_notifier(&vdev->irqchip_change_notifier);
diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
index 56f0fabb33..c5a59a8e3d 100644
--- a/hw/vfio/pci.h
+++ b/hw/vfio/pci.h
@@ -141,6 +141,7 @@ typedef struct VFIOPCIDevice {
     EventNotifier err_notifier;
     EventNotifier req_notifier;
     VFIOPCIExtIRQ *ext_irqs;
+    VFIORegion dma_fault_region;
     int (*resetfn)(struct VFIOPCIDevice *);
     uint32_t vendor_id;
     uint32_t device_id;
-- 
2.20.1



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RFC v6 17/24] vfio/pci: Implement the DMA fault handler
  2020-03-20 16:58 [RFC v6 00/24] vSMMUv3/pSMMUv3 2 stage VFIO integration Eric Auger
                   ` (15 preceding siblings ...)
  2020-03-20 16:58 ` [RFC v6 16/24] vfio/pci: Set up the DMA FAULT region Eric Auger
@ 2020-03-20 16:58 ` Eric Auger
  2020-03-20 16:58 ` [RFC v6 18/24] hw/arm/smmuv3: Advertise MSI_TRANSLATE attribute Eric Auger
                   ` (8 subsequent siblings)
  25 siblings, 0 replies; 40+ messages in thread
From: Eric Auger @ 2020-03-20 16:58 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, qemu-devel, qemu-arm, peter.maydell,
	mst, alex.williamson, jacob.jun.pan, yi.l.liu
  Cc: jean-philippe, tnowicki, maz, zhangfei.gao, peterx,
	shameerali.kolothum.thodi, zhangfei.gao, bbhushan2, will

Whenever the eventfd is triggered, we retrieve the DMA fault(s)
from the mmapped fault region and inject them in the iommu
memory region.

Signed-off-by: Eric Auger <eric.auger@redhat.com>
---
 hw/vfio/pci.c | 50 ++++++++++++++++++++++++++++++++++++++++++++++++++
 hw/vfio/pci.h |  1 +
 2 files changed, 51 insertions(+)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 029652a507..86ee4b6b47 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2845,10 +2845,60 @@ static PCIPASIDOps vfio_pci_pasid_ops = {
 static void vfio_dma_fault_notifier_handler(void *opaque)
 {
     VFIOPCIExtIRQ *ext_irq = opaque;
+    VFIOPCIDevice *vdev = ext_irq->vdev;
+    PCIDevice *pdev = &vdev->pdev;
+    AddressSpace *as = pci_device_iommu_address_space(pdev);
+    IOMMUMemoryRegion *iommu_mr = IOMMU_MEMORY_REGION(as->root);
+    struct vfio_region_dma_fault header;
+    struct iommu_fault *queue;
+    char *queue_buffer = NULL;
+    ssize_t bytes;
 
     if (!event_notifier_test_and_clear(&ext_irq->notifier)) {
         return;
     }
+
+    bytes = pread(vdev->vbasedev.fd, &header, sizeof(header),
+                  vdev->dma_fault_region.fd_offset);
+    if (bytes != sizeof(header)) {
+        error_report("%s unable to read the fault region header (0x%lx)",
+                     __func__, bytes);
+        return;
+    }
+
+    /* Normally the fault queue is mmapped */
+    queue = (struct iommu_fault *)vdev->dma_fault_region.mmaps[0].mmap;
+    if (!queue) {
+        size_t queue_size = header.nb_entries * header.entry_size;
+
+        error_report("%s: fault queue not mmapped: slower fault handling",
+                     vdev->vbasedev.name);
+
+        queue_buffer = g_malloc(queue_size);
+        bytes =  pread(vdev->vbasedev.fd, queue_buffer, queue_size,
+                       vdev->dma_fault_region.fd_offset + header.offset);
+        if (bytes != queue_size) {
+            error_report("%s unable to read the fault queue (0x%lx)",
+                         __func__, bytes);
+            return;
+        }
+
+        queue = (struct iommu_fault *)queue_buffer;
+    }
+
+    while (vdev->fault_tail_index != header.head) {
+        memory_region_inject_faults(iommu_mr, 1,
+                                    &queue[vdev->fault_tail_index]);
+        vdev->fault_tail_index =
+            (vdev->fault_tail_index + 1) % header.nb_entries;
+    }
+    bytes = pwrite(vdev->vbasedev.fd, &vdev->fault_tail_index, 4,
+                   vdev->dma_fault_region.fd_offset);
+    if (bytes != 4) {
+        error_report("%s unable to write the fault region tail index (0x%lx)",
+                     __func__, bytes);
+    }
+    g_free(queue_buffer);
 }
 
 static int vfio_register_ext_irq_handler(VFIOPCIDevice *vdev,
diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
index c5a59a8e3d..2d0b65d8ff 100644
--- a/hw/vfio/pci.h
+++ b/hw/vfio/pci.h
@@ -142,6 +142,7 @@ typedef struct VFIOPCIDevice {
     EventNotifier req_notifier;
     VFIOPCIExtIRQ *ext_irqs;
     VFIORegion dma_fault_region;
+    uint32_t fault_tail_index;
     int (*resetfn)(struct VFIOPCIDevice *);
     uint32_t vendor_id;
     uint32_t device_id;
-- 
2.20.1



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RFC v6 18/24] hw/arm/smmuv3: Advertise MSI_TRANSLATE attribute
  2020-03-20 16:58 [RFC v6 00/24] vSMMUv3/pSMMUv3 2 stage VFIO integration Eric Auger
                   ` (16 preceding siblings ...)
  2020-03-20 16:58 ` [RFC v6 17/24] vfio/pci: Implement the DMA fault handler Eric Auger
@ 2020-03-20 16:58 ` Eric Auger
  2020-03-20 16:58 ` [RFC v6 19/24] hw/arm/smmuv3: Store the PASID table GPA in the translation config Eric Auger
                   ` (7 subsequent siblings)
  25 siblings, 0 replies; 40+ messages in thread
From: Eric Auger @ 2020-03-20 16:58 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, qemu-devel, qemu-arm, peter.maydell,
	mst, alex.williamson, jacob.jun.pan, yi.l.liu
  Cc: jean-philippe, tnowicki, maz, zhangfei.gao, peterx,
	shameerali.kolothum.thodi, zhangfei.gao, bbhushan2, will

The SMMUv3 has the peculiarity to translate MSI
transactionss. let's advertise the corresponding
attribute.

Signed-off-by: Eric Auger <eric.auger@redhat.com>

---
---
 hw/arm/smmuv3.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c
index e33eabd028..9bea5f65ae 100644
--- a/hw/arm/smmuv3.c
+++ b/hw/arm/smmuv3.c
@@ -1515,6 +1515,9 @@ static int smmuv3_get_attr(IOMMUMemoryRegion *iommu,
     if (attr == IOMMU_ATTR_VFIO_NESTED) {
         *(bool *) data = true;
         return 0;
+    } else if (attr == IOMMU_ATTR_MSI_TRANSLATE) {
+        *(bool *) data = true;
+        return 0;
     }
     return -EINVAL;
 }
-- 
2.20.1



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RFC v6 19/24] hw/arm/smmuv3: Store the PASID table GPA in the translation config
  2020-03-20 16:58 [RFC v6 00/24] vSMMUv3/pSMMUv3 2 stage VFIO integration Eric Auger
                   ` (17 preceding siblings ...)
  2020-03-20 16:58 ` [RFC v6 18/24] hw/arm/smmuv3: Advertise MSI_TRANSLATE attribute Eric Auger
@ 2020-03-20 16:58 ` Eric Auger
  2020-03-20 16:58 ` [RFC v6 20/24] hw/arm/smmuv3: Fill the IOTLBEntry arch_id on NH_VA invalidation Eric Auger
                   ` (6 subsequent siblings)
  25 siblings, 0 replies; 40+ messages in thread
From: Eric Auger @ 2020-03-20 16:58 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, qemu-devel, qemu-arm, peter.maydell,
	mst, alex.williamson, jacob.jun.pan, yi.l.liu
  Cc: jean-philippe, tnowicki, maz, zhangfei.gao, peterx,
	shameerali.kolothum.thodi, zhangfei.gao, bbhushan2, will

For VFIO integration we will need to pass the Context Descriptor (CD)
table GPA to the host. The CD table is also referred to as the PASID
table. Its GPA corresponds to the s1ctrptr field of the Stream Table
Entry. So let's decode and store it in the configuration structure.

Signed-off-by: Eric Auger <eric.auger@redhat.com>
---
 hw/arm/smmuv3.c              | 1 +
 include/hw/arm/smmu-common.h | 1 +
 2 files changed, 2 insertions(+)

diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c
index 9bea5f65ae..1424e08c31 100644
--- a/hw/arm/smmuv3.c
+++ b/hw/arm/smmuv3.c
@@ -352,6 +352,7 @@ static int decode_ste(SMMUv3State *s, SMMUTransCfg *cfg,
                       "SMMUv3 S1 stalling fault model not allowed yet\n");
         goto bad_ste;
     }
+    cfg->s1ctxptr = STE_CTXPTR(ste);
     return 0;
 
 bad_ste:
diff --git a/include/hw/arm/smmu-common.h b/include/hw/arm/smmu-common.h
index 1f37844e5c..353668f4ea 100644
--- a/include/hw/arm/smmu-common.h
+++ b/include/hw/arm/smmu-common.h
@@ -68,6 +68,7 @@ typedef struct SMMUTransCfg {
     uint8_t tbi;               /* Top Byte Ignore */
     uint16_t asid;
     SMMUTransTableInfo tt[2];
+    dma_addr_t s1ctxptr;
     uint32_t iotlb_hits;       /* counts IOTLB hits for this asid */
     uint32_t iotlb_misses;     /* counts IOTLB misses for this asid */
 } SMMUTransCfg;
-- 
2.20.1



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RFC v6 20/24] hw/arm/smmuv3: Fill the IOTLBEntry arch_id on NH_VA invalidation
  2020-03-20 16:58 [RFC v6 00/24] vSMMUv3/pSMMUv3 2 stage VFIO integration Eric Auger
                   ` (18 preceding siblings ...)
  2020-03-20 16:58 ` [RFC v6 19/24] hw/arm/smmuv3: Store the PASID table GPA in the translation config Eric Auger
@ 2020-03-20 16:58 ` Eric Auger
  2020-03-20 16:58 ` [RFC v6 21/24] hw/arm/smmuv3: Fill the IOTLBEntry leaf field " Eric Auger
                   ` (5 subsequent siblings)
  25 siblings, 0 replies; 40+ messages in thread
From: Eric Auger @ 2020-03-20 16:58 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, qemu-devel, qemu-arm, peter.maydell,
	mst, alex.williamson, jacob.jun.pan, yi.l.liu
  Cc: jean-philippe, tnowicki, maz, zhangfei.gao, peterx,
	shameerali.kolothum.thodi, zhangfei.gao, bbhushan2, will

When the guest invalidates one S1 entry, it passes the asid.
When propagating this invalidation downto the host, the asid
information also must be passed. So let's fill the arch_id field
introduced for that purpose.

Signed-off-by: Eric Auger <eric.auger@redhat.com>
---
 hw/arm/smmuv3.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c
index 1424e08c31..66603c1fde 100644
--- a/hw/arm/smmuv3.c
+++ b/hw/arm/smmuv3.c
@@ -838,6 +838,7 @@ static void smmuv3_notify_iova(IOMMUMemoryRegion *mr,
     entry.iova = iova;
     entry.addr_mask = (1 << tt->granule_sz) - 1;
     entry.perm = IOMMU_NONE;
+    entry.arch_id = asid;
 
     memory_region_notify_one(n, &entry);
 }
-- 
2.20.1



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RFC v6 21/24] hw/arm/smmuv3: Fill the IOTLBEntry leaf field on NH_VA invalidation
  2020-03-20 16:58 [RFC v6 00/24] vSMMUv3/pSMMUv3 2 stage VFIO integration Eric Auger
                   ` (19 preceding siblings ...)
  2020-03-20 16:58 ` [RFC v6 20/24] hw/arm/smmuv3: Fill the IOTLBEntry arch_id on NH_VA invalidation Eric Auger
@ 2020-03-20 16:58 ` Eric Auger
  2020-03-20 16:58 ` [RFC v6 22/24] hw/arm/smmuv3: Pass stage 1 configurations to the host Eric Auger
                   ` (4 subsequent siblings)
  25 siblings, 0 replies; 40+ messages in thread
From: Eric Auger @ 2020-03-20 16:58 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, qemu-devel, qemu-arm, peter.maydell,
	mst, alex.williamson, jacob.jun.pan, yi.l.liu
  Cc: jean-philippe, tnowicki, maz, zhangfei.gao, peterx,
	shameerali.kolothum.thodi, zhangfei.gao, bbhushan2, will

Let's propagate the leaf attribute throughout the invalidation path.
This hint is used to reduce the scope of the invalidations to the
last level of translation. Not enforcing it induces large performance
penalties in nested mode.

Signed-off-by: Eric Auger <eric.auger@redhat.com>
---
 hw/arm/smmuv3.c     | 16 +++++++++-------
 hw/arm/trace-events |  2 +-
 2 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c
index 66603c1fde..edd76bce4c 100644
--- a/hw/arm/smmuv3.c
+++ b/hw/arm/smmuv3.c
@@ -811,8 +811,7 @@ epilogue:
  */
 static void smmuv3_notify_iova(IOMMUMemoryRegion *mr,
                                IOMMUNotifier *n,
-                               int asid,
-                               dma_addr_t iova)
+                               int asid, dma_addr_t iova, bool leaf)
 {
     SMMUDevice *sdev = container_of(mr, SMMUDevice, iommu);
     SMMUEventInfo event = {.inval_ste_allowed = true};
@@ -839,12 +838,14 @@ static void smmuv3_notify_iova(IOMMUMemoryRegion *mr,
     entry.addr_mask = (1 << tt->granule_sz) - 1;
     entry.perm = IOMMU_NONE;
     entry.arch_id = asid;
+    entry.leaf = leaf;
 
     memory_region_notify_one(n, &entry);
 }
 
 /* invalidate an asid/iova tuple in all mr's */
-static void smmuv3_inv_notifiers_iova(SMMUState *s, int asid, dma_addr_t iova)
+static void smmuv3_inv_notifiers_iova(SMMUState *s, int asid,
+                                      dma_addr_t iova, bool leaf)
 {
     SMMUDevice *sdev;
 
@@ -855,7 +856,7 @@ static void smmuv3_inv_notifiers_iova(SMMUState *s, int asid, dma_addr_t iova)
         trace_smmuv3_inv_notifiers_iova(mr->parent_obj.name, asid, iova);
 
         IOMMU_NOTIFIER_FOREACH(n, mr) {
-            smmuv3_notify_iova(mr, n, asid, iova);
+            smmuv3_notify_iova(mr, n, asid, iova, leaf);
         }
     }
 }
@@ -993,9 +994,10 @@ static int smmuv3_cmdq_consume(SMMUv3State *s)
         {
             dma_addr_t addr = CMD_ADDR(&cmd);
             uint16_t vmid = CMD_VMID(&cmd);
+            bool leaf = CMD_LEAF(&cmd);
 
-            trace_smmuv3_cmdq_tlbi_nh_vaa(vmid, addr);
-            smmuv3_inv_notifiers_iova(bs, -1, addr);
+            trace_smmuv3_cmdq_tlbi_nh_vaa(vmid, addr, leaf);
+            smmuv3_inv_notifiers_iova(bs, -1, addr, leaf);
             smmu_iotlb_inv_all(bs);
             break;
         }
@@ -1007,7 +1009,7 @@ static int smmuv3_cmdq_consume(SMMUv3State *s)
             bool leaf = CMD_LEAF(&cmd);
 
             trace_smmuv3_cmdq_tlbi_nh_va(vmid, asid, addr, leaf);
-            smmuv3_inv_notifiers_iova(bs, asid, addr);
+            smmuv3_inv_notifiers_iova(bs, asid, addr, leaf);
             smmu_iotlb_inv_iova(bs, asid, addr);
             break;
         }
diff --git a/hw/arm/trace-events b/hw/arm/trace-events
index 0acedcedc6..3809005cba 100644
--- a/hw/arm/trace-events
+++ b/hw/arm/trace-events
@@ -43,7 +43,7 @@ smmuv3_cmdq_cfgi_cd(uint32_t sid) "streamid = %d"
 smmuv3_config_cache_hit(uint32_t sid, uint32_t hits, uint32_t misses, uint32_t perc) "Config cache HIT for sid %d (hits=%d, misses=%d, hit rate=%d)"
 smmuv3_config_cache_miss(uint32_t sid, uint32_t hits, uint32_t misses, uint32_t perc) "Config cache MISS for sid %d (hits=%d, misses=%d, hit rate=%d)"
 smmuv3_cmdq_tlbi_nh_va(int vmid, int asid, uint64_t addr, bool leaf) "vmid =%d asid =%d addr=0x%"PRIx64" leaf=%d"
-smmuv3_cmdq_tlbi_nh_vaa(int vmid, uint64_t addr) "vmid =%d addr=0x%"PRIx64
+smmuv3_cmdq_tlbi_nh_vaa(int vmid, uint64_t addr, bool leaf) "vmid =%d addr=0x%"PRIx64" leaf=%d"
 smmuv3_cmdq_tlbi_nh(void) ""
 smmuv3_cmdq_tlbi_nh_asid(uint16_t asid) "asid=%d"
 smmu_iotlb_cache_hit(uint16_t asid, uint64_t addr, uint32_t hit, uint32_t miss, uint32_t p) "IOTLB cache HIT asid=%d addr=0x%"PRIx64" hit=%d miss=%d hit rate=%d"
-- 
2.20.1



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RFC v6 22/24] hw/arm/smmuv3: Pass stage 1 configurations to the host
  2020-03-20 16:58 [RFC v6 00/24] vSMMUv3/pSMMUv3 2 stage VFIO integration Eric Auger
                   ` (20 preceding siblings ...)
  2020-03-20 16:58 ` [RFC v6 21/24] hw/arm/smmuv3: Fill the IOTLBEntry leaf field " Eric Auger
@ 2020-03-20 16:58 ` Eric Auger
  2020-03-20 16:58 ` [RFC v6 23/24] hw/arm/smmuv3: Implement fault injection Eric Auger
                   ` (3 subsequent siblings)
  25 siblings, 0 replies; 40+ messages in thread
From: Eric Auger @ 2020-03-20 16:58 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, qemu-devel, qemu-arm, peter.maydell,
	mst, alex.williamson, jacob.jun.pan, yi.l.liu
  Cc: jean-philippe, tnowicki, maz, zhangfei.gao, peterx,
	shameerali.kolothum.thodi, zhangfei.gao, bbhushan2, will

In case PASID PciOps are set for the device we call
the set_pasid_table() callback on each STE update.

This allows to pass the guest stage 1 configuration
to the host and apply it at physical level.

Signed-off-by: Eric Auger <eric.auger@redhat.com>

---

v4 -> v5:
- Use PciOps instead of config notifiers

v3 -> v4:
- fix compile issue with mingw

v2 -> v3:
- adapt to pasid_cfg field changes. Use local variable
- add trace event
- set version fields
- use CONFIG_PASID

v1 -> v2:
- do not notify anymore on CD change. Anyway the smmuv3 linux
  driver is not sending any CD invalidation commands. If we were
  to propagate CD invalidation commands, we would use the
  CACHE_INVALIDATE VFIO ioctl.
- notify a precise config flags to prepare for addition of new
  flags
---
 hw/arm/smmuv3.c     | 77 +++++++++++++++++++++++++++++++++++----------
 hw/arm/trace-events |  1 +
 2 files changed, 61 insertions(+), 17 deletions(-)

diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c
index edd76bce4c..7a805030e2 100644
--- a/hw/arm/smmuv3.c
+++ b/hw/arm/smmuv3.c
@@ -16,6 +16,10 @@
  * with this program; if not, see <http://www.gnu.org/licenses/>.
  */
 
+#ifdef __linux__
+#include "linux/iommu.h"
+#endif
+
 #include "qemu/osdep.h"
 #include "hw/irq.h"
 #include "hw/sysbus.h"
@@ -861,6 +865,60 @@ static void smmuv3_inv_notifiers_iova(SMMUState *s, int asid,
     }
 }
 
+static void smmuv3_notify_config_change(SMMUState *bs, uint32_t sid)
+{
+#ifdef __linux__
+    IOMMUMemoryRegion *mr = smmu_iommu_mr(bs, sid);
+    SMMUEventInfo event = {.type = SMMU_EVT_NONE, .sid = sid,
+                           .inval_ste_allowed = true};
+    IOMMUConfig iommu_config;
+    SMMUTransCfg *cfg;
+    SMMUDevice *sdev;
+
+    if (!mr) {
+        return;
+    }
+
+    sdev = container_of(mr, SMMUDevice, iommu);
+
+    /* flush QEMU config cache */
+    smmuv3_flush_config(sdev);
+
+    if (!pci_device_is_pasid_ops_set(sdev->bus, sdev->devfn)) {
+        return;
+    }
+
+    cfg = smmuv3_get_config(sdev, &event);
+
+    if (!cfg) {
+        return;
+    }
+
+    iommu_config.pasid_cfg.version = PASID_TABLE_CFG_VERSION_1;
+    iommu_config.pasid_cfg.format = IOMMU_PASID_FORMAT_SMMUV3;
+    iommu_config.pasid_cfg.base_ptr = cfg->s1ctxptr;
+    iommu_config.pasid_cfg.pasid_bits = 0;
+    iommu_config.pasid_cfg.smmuv3.version = PASID_TABLE_SMMUV3_CFG_VERSION_1;
+
+    if (cfg->disabled || cfg->bypassed) {
+        iommu_config.pasid_cfg.config = IOMMU_PASID_CONFIG_BYPASS;
+    } else if (cfg->aborted) {
+        iommu_config.pasid_cfg.config = IOMMU_PASID_CONFIG_ABORT;
+    } else {
+        iommu_config.pasid_cfg.config = IOMMU_PASID_CONFIG_TRANSLATE;
+    }
+
+    trace_smmuv3_notify_config_change(mr->parent_obj.name,
+                                      iommu_config.pasid_cfg.config,
+                                      iommu_config.pasid_cfg.base_ptr);
+
+    if (pci_device_set_pasid_table(sdev->bus, sdev->devfn, &iommu_config)) {
+        error_report("Failed to pass PASID table to host for iommu mr %s (%m)",
+                     mr->parent_obj.name);
+    }
+#endif
+}
+
 static int smmuv3_cmdq_consume(SMMUv3State *s)
 {
     SMMUState *bs = ARM_SMMU(s);
@@ -911,22 +969,14 @@ static int smmuv3_cmdq_consume(SMMUv3State *s)
         case SMMU_CMD_CFGI_STE:
         {
             uint32_t sid = CMD_SID(&cmd);
-            IOMMUMemoryRegion *mr = smmu_iommu_mr(bs, sid);
-            SMMUDevice *sdev;
 
             if (CMD_SSEC(&cmd)) {
                 cmd_error = SMMU_CERROR_ILL;
                 break;
             }
 
-            if (!mr) {
-                break;
-            }
-
             trace_smmuv3_cmdq_cfgi_ste(sid);
-            sdev = container_of(mr, SMMUDevice, iommu);
-            smmuv3_flush_config(sdev);
-
+            smmuv3_notify_config_change(bs, sid);
             break;
         }
         case SMMU_CMD_CFGI_STE_RANGE: /* same as SMMU_CMD_CFGI_ALL */
@@ -943,14 +993,7 @@ static int smmuv3_cmdq_consume(SMMUv3State *s)
             trace_smmuv3_cmdq_cfgi_ste_range(start, end);
 
             for (i = start; i <= end; i++) {
-                IOMMUMemoryRegion *mr = smmu_iommu_mr(bs, i);
-                SMMUDevice *sdev;
-
-                if (!mr) {
-                    continue;
-                }
-                sdev = container_of(mr, SMMUDevice, iommu);
-                smmuv3_flush_config(sdev);
+                smmuv3_notify_config_change(bs, i);
             }
             break;
         }
diff --git a/hw/arm/trace-events b/hw/arm/trace-events
index 3809005cba..741e645ae2 100644
--- a/hw/arm/trace-events
+++ b/hw/arm/trace-events
@@ -52,4 +52,5 @@ smmuv3_config_cache_inv(uint32_t sid) "Config cache INV for sid %d"
 smmuv3_notify_flag_add(const char *iommu) "ADD SMMUNotifier node for iommu mr=%s"
 smmuv3_notify_flag_del(const char *iommu) "DEL SMMUNotifier node for iommu mr=%s"
 smmuv3_inv_notifiers_iova(const char *name, uint16_t asid, uint64_t iova) "iommu mr=%s asid=%d iova=0x%"PRIx64
+smmuv3_notify_config_change(const char *name, uint8_t config, uint64_t s1ctxptr) "iommu mr=%s config=%d s1ctxptr=0x%"PRIx64
 
-- 
2.20.1



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RFC v6 23/24] hw/arm/smmuv3: Implement fault injection
  2020-03-20 16:58 [RFC v6 00/24] vSMMUv3/pSMMUv3 2 stage VFIO integration Eric Auger
                   ` (21 preceding siblings ...)
  2020-03-20 16:58 ` [RFC v6 22/24] hw/arm/smmuv3: Pass stage 1 configurations to the host Eric Auger
@ 2020-03-20 16:58 ` Eric Auger
  2020-03-20 16:58 ` [RFC v6 24/24] hw/arm/smmuv3: Allow MAP notifiers Eric Auger
                   ` (2 subsequent siblings)
  25 siblings, 0 replies; 40+ messages in thread
From: Eric Auger @ 2020-03-20 16:58 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, qemu-devel, qemu-arm, peter.maydell,
	mst, alex.williamson, jacob.jun.pan, yi.l.liu
  Cc: jean-philippe, tnowicki, maz, zhangfei.gao, peterx,
	shameerali.kolothum.thodi, zhangfei.gao, bbhushan2, will

We convert iommu_fault structs received from the kernel
into the data struct used by the emulation code and record
the evnts into the virtual event queue.

Signed-off-by: Eric Auger <eric.auger@redhat.com>

---

v3 -> v4:
- fix compil issue on mingw

Exhaustive mapping remains to be done
---
 hw/arm/smmuv3.c | 71 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 71 insertions(+)

diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c
index 7a805030e2..6db3d2f218 100644
--- a/hw/arm/smmuv3.c
+++ b/hw/arm/smmuv3.c
@@ -1569,6 +1569,76 @@ static int smmuv3_get_attr(IOMMUMemoryRegion *iommu,
     return -EINVAL;
 }
 
+struct iommu_fault;
+
+static inline int
+smmuv3_inject_faults(IOMMUMemoryRegion *iommu_mr, int count,
+                     struct iommu_fault *buf)
+{
+#ifdef __linux__
+    SMMUDevice *sdev = container_of(iommu_mr, SMMUDevice, iommu);
+    SMMUv3State *s3 = sdev->smmu;
+    uint32_t sid = smmu_get_sid(sdev);
+    int i;
+
+    for (i = 0; i < count; i++) {
+        SMMUEventInfo info = {};
+        struct iommu_fault_unrecoverable *record;
+
+        if (buf[i].type != IOMMU_FAULT_DMA_UNRECOV) {
+            continue;
+        }
+
+        info.sid = sid;
+        record = &buf[i].event;
+
+        switch (record->reason) {
+        case IOMMU_FAULT_REASON_PASID_INVALID:
+            info.type = SMMU_EVT_C_BAD_SUBSTREAMID;
+            /* TODO further fill info.u.c_bad_substream */
+            break;
+        case IOMMU_FAULT_REASON_PASID_FETCH:
+            info.type = SMMU_EVT_F_CD_FETCH;
+            break;
+        case IOMMU_FAULT_REASON_BAD_PASID_ENTRY:
+            info.type = SMMU_EVT_C_BAD_CD;
+            /* TODO further fill info.u.c_bad_cd */
+            break;
+        case IOMMU_FAULT_REASON_WALK_EABT:
+            info.type = SMMU_EVT_F_WALK_EABT;
+            info.u.f_walk_eabt.addr = record->addr;
+            info.u.f_walk_eabt.addr2 = record->fetch_addr;
+            break;
+        case IOMMU_FAULT_REASON_PTE_FETCH:
+            info.type = SMMU_EVT_F_TRANSLATION;
+            info.u.f_translation.addr = record->addr;
+            break;
+        case IOMMU_FAULT_REASON_OOR_ADDRESS:
+            info.type = SMMU_EVT_F_ADDR_SIZE;
+            info.u.f_addr_size.addr = record->addr;
+            break;
+        case IOMMU_FAULT_REASON_ACCESS:
+            info.type = SMMU_EVT_F_ACCESS;
+            info.u.f_access.addr = record->addr;
+            break;
+        case IOMMU_FAULT_REASON_PERMISSION:
+            info.type = SMMU_EVT_F_PERMISSION;
+            info.u.f_permission.addr = record->addr;
+            break;
+        default:
+            warn_report("%s Unexpected fault reason received from host: %d",
+                        __func__, record->reason);
+            continue;
+        }
+
+        smmuv3_record_event(s3, &info);
+    }
+    return 0;
+#else
+    return -1;
+#endif
+}
+
 static void smmuv3_iommu_memory_region_class_init(ObjectClass *klass,
                                                   void *data)
 {
@@ -1577,6 +1647,7 @@ static void smmuv3_iommu_memory_region_class_init(ObjectClass *klass,
     imrc->translate = smmuv3_translate;
     imrc->notify_flag_changed = smmuv3_notify_flag_changed;
     imrc->get_attr = smmuv3_get_attr;
+    imrc->inject_faults = smmuv3_inject_faults;
 }
 
 static const TypeInfo smmuv3_type_info = {
-- 
2.20.1



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RFC v6 24/24] hw/arm/smmuv3: Allow MAP notifiers
  2020-03-20 16:58 [RFC v6 00/24] vSMMUv3/pSMMUv3 2 stage VFIO integration Eric Auger
                   ` (22 preceding siblings ...)
  2020-03-20 16:58 ` [RFC v6 23/24] hw/arm/smmuv3: Implement fault injection Eric Auger
@ 2020-03-20 16:58 ` Eric Auger
  2020-03-25 11:35 ` [RFC v6 00/24] vSMMUv3/pSMMUv3 2 stage VFIO integration Shameerali Kolothum Thodi
  2020-03-31  6:42 ` Zhangfei Gao
  25 siblings, 0 replies; 40+ messages in thread
From: Eric Auger @ 2020-03-20 16:58 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, qemu-devel, qemu-arm, peter.maydell,
	mst, alex.williamson, jacob.jun.pan, yi.l.liu
  Cc: jean-philippe, tnowicki, maz, zhangfei.gao, peterx,
	shameerali.kolothum.thodi, zhangfei.gao, bbhushan2, will

We now have all bricks to support nested paging. This
uses MAP notifiers to map the MSIs. So let's allow MAP
notifiers to be registered.

Signed-off-by: Eric Auger <eric.auger@redhat.com>
---
 hw/arm/smmuv3.c | 8 --------
 1 file changed, 8 deletions(-)

diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c
index 6db3d2f218..dc716d7d59 100644
--- a/hw/arm/smmuv3.c
+++ b/hw/arm/smmuv3.c
@@ -1537,14 +1537,6 @@ static int smmuv3_notify_flag_changed(IOMMUMemoryRegion *iommu,
     SMMUv3State *s3 = sdev->smmu;
     SMMUState *s = &(s3->smmu_state);
 
-    if (new & IOMMU_NOTIFIER_MAP) {
-        error_setg(errp,
-                   "device %02x.%02x.%x requires iommu MAP notifier which is "
-                   "not currently supported", pci_bus_num(sdev->bus),
-                   PCI_SLOT(sdev->devfn), PCI_FUNC(sdev->devfn));
-        return -EINVAL;
-    }
-
     if (old == IOMMU_NOTIFIER_NONE) {
         trace_smmuv3_notify_flag_add(iommu->parent_obj.name);
         QLIST_INSERT_HEAD(&s->devices_with_notifiers, sdev, next);
-- 
2.20.1



^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: [RFC v6 00/24] vSMMUv3/pSMMUv3 2 stage VFIO integration
  2020-03-20 16:58 [RFC v6 00/24] vSMMUv3/pSMMUv3 2 stage VFIO integration Eric Auger
                   ` (23 preceding siblings ...)
  2020-03-20 16:58 ` [RFC v6 24/24] hw/arm/smmuv3: Allow MAP notifiers Eric Auger
@ 2020-03-25 11:35 ` Shameerali Kolothum Thodi
  2020-03-25 12:42   ` Auger Eric
  2020-04-03 10:45   ` Auger Eric
  2020-03-31  6:42 ` Zhangfei Gao
  25 siblings, 2 replies; 40+ messages in thread
From: Shameerali Kolothum Thodi @ 2020-03-25 11:35 UTC (permalink / raw)
  To: Eric Auger, eric.auger.pro, qemu-devel, qemu-arm, peter.maydell,
	mst, alex.williamson, jacob.jun.pan, yi.l.liu
  Cc: jean-philippe, tnowicki, maz, zhangfei.gao, peterx, zhangfei.gao,
	bbhushan2, will

Hi Eric,

> -----Original Message-----
> From: Eric Auger [mailto:eric.auger@redhat.com]
> Sent: 20 March 2020 16:58
> To: eric.auger.pro@gmail.com; eric.auger@redhat.com;
> qemu-devel@nongnu.org; qemu-arm@nongnu.org; peter.maydell@linaro.org;
> mst@redhat.com; alex.williamson@redhat.com;
> jacob.jun.pan@linux.intel.com; yi.l.liu@intel.com
> Cc: peterx@redhat.com; jean-philippe@linaro.org; will@kernel.org;
> tnowicki@marvell.com; Shameerali Kolothum Thodi
> <shameerali.kolothum.thodi@huawei.com>; zhangfei.gao@foxmail.com;
> zhangfei.gao@linaro.org; maz@kernel.org; bbhushan2@marvell.com
> Subject: [RFC v6 00/24] vSMMUv3/pSMMUv3 2 stage VFIO integration
> 
> Up to now vSMMUv3 has not been integrated with VFIO. VFIO
> integration requires to program the physical IOMMU consistently
> with the guest mappings. However, as opposed to VTD, SMMUv3 has
> no "Caching Mode" which allows easy trapping of guest mappings.
> This means the vSMMUV3 cannot use the same VFIO integration as VTD.
> 
> However SMMUv3 has 2 translation stages. This was devised with
> virtualization use case in mind where stage 1 is "owned" by the
> guest whereas the host uses stage 2 for VM isolation.
> 
> This series sets up this nested translation stage. It only works
> if there is one physical SMMUv3 used along with QEMU vSMMUv3 (in
> other words, it does not work if there is a physical SMMUv2).

I was testing this series on one of our hardware board with SMMUv3. I did
observe an issue while trying to bring up Guest with and without the vsmmuV3.

Steps are like below,

1. start a guest with "iommu=smmuv3" and a n/w vf device.

2.Exit the VM.

3. start the guest again without "iommu=smmuv3"

This time qemu crashes with,

[ 0.447830] hns3 0000:00:01.0: enabling device (0000 -> 0002)
/home/shameer/qemu-eric/qemu/hw/vfio/pci.c:2851:vfio_dma_fault_notifier_handler:
Object 0xaaaaeeb47c00 is not an instance of type
qemu:iommu-memory-region
./qemu_run-vsmmu-hns: line 9: 13609 Aborted                 (core
dumped) ./qemu-system-aarch64-vsmmuv3v10 -machine
virt,kernel_irqchip=on,gic-version=3 -cpu host -smp cpus=1 -kernel
Image-ericv10-uacce -initrd rootfs-iperf.cpio -bios
QEMU_EFI_Dec2018.fd -device vfio-pci,host=0000:7d:02.1 -net none -m
4096 -nographic -D -d -enable-kvm -append "console=ttyAMA0
root=/dev/vda -m 4096 rw earlycon=pl011,0x9000000"

And you can see that host kernel receives smmuv3 C_BAD_STE event,

[10499.379288] vfio-pci 0000:7d:02.1: enabling device (0000 -> 0002)
[10501.943881] arm-smmu-v3 arm-smmu-v3.2.auto: event 0x04 received:
[10501.943884] arm-smmu-v3 arm-smmu-v3.2.auto: 0x00007d1100000004
[10501.943886] arm-smmu-v3 arm-smmu-v3.2.auto: 0x0000100800000080
[10501.943887] arm-smmu-v3 arm-smmu-v3.2.auto: 0x00000000fe040000
[10501.943889] arm-smmu-v3 arm-smmu-v3.2.auto: 0x000000007e04c440

So I suspect we didn't clear nested stage configuration and that affects the 
translation in the second run. I tried to issue(force) a vfio_detach_pasid_table() but 
that didn't solve the problem.

May be I am missing something. Could you please take a look and let me know.

Thanks,
Shameer

> - We force the host to use stage 2 instead of stage 1, when we
>   detect a vSMMUV3 is behind a VFIO device. For a VFIO device
>   without any virtual IOMMU, we still use stage 1 as many existing
>   SMMUs expect this behavior.
> - We use PCIPASIDOps to propage guest stage1 config changes on
>   STE (Stream Table Entry) changes.
> - We implement a specific UNMAP notifier that conveys guest
>   IOTLB invalidations to the host
> - We register MSI IOVA/GPA bindings to the host so that this latter
>   can build a nested stage translation
> - As the legacy MAP notifier is not called anymore, we must make
>   sure stage 2 mappings are set. This is achieved through another
>   prereg memory listener.
> - Physical SMMU stage 1 related faults are reported to the guest
>   via en eventfd mechanism and exposed trhough a dedicated VFIO-PCI
>   region. Then they are reinjected into the guest.
> 
> Best Regards
> 
> Eric
> 
> This series can be found at:
> https://github.com/eauger/qemu/tree/v4.2.0-2stage-rfcv6
> 
> Kernel Dependencies:
> [1] [PATCH v10 00/11] SMMUv3 Nested Stage Setup (VFIO part)
> [2] [PATCH v10 00/13] SMMUv3 Nested Stage Setup (IOMMU part)
> branch at:
> https://github.com/eauger/linux/tree/will-arm-smmu-updates-2stage-v10
> 
> History:
> 
> v5 -> v6:
> - just rebase work
> 
> v4 -> v5:
> - Use PCIPASIDOps for config update notifications
> - removal of notification for MSI binding which is not needed
>   anymore
> - Use a single fault region
> - use the specific interrupt index
> 
> v3 -> v4:
> - adapt to changes in uapi (asid cache invalidation)
> - check VFIO_PCI_DMA_FAULT_IRQ_INDEX is supported at kernel level
>   before attempting to set signaling for it.
> - sync on 5.2-rc1 kernel headers + Drew's patch that imports sve_context.h
> - fix MSI binding for MSI (not MSIX)
> - fix mingw compilation
> 
> v2 -> v3:
> - rework fault handling
> - MSI binding registration done in vfio-pci. MSI binding tear down called
>   on container cleanup path
> - leaf parameter propagated
> 
> v1 -> v2:
> - Fixed dual assignment (asid now correctly propagated on TLB invalidations)
> - Integrated fault reporting
> 
> 
> Eric Auger (23):
>   update-linux-headers: Import iommu.h
>   header update against 5.6.0-rc3 and IOMMU/VFIO nested stage APIs
>   memory: Add IOMMU_ATTR_VFIO_NESTED IOMMU memory region
> attribute
>   memory: Add IOMMU_ATTR_MSI_TRANSLATE IOMMU memory region
> attribute
>   memory: Introduce IOMMU Memory Region inject_faults API
>   memory: Add arch_id and leaf fields in IOTLBEntry
>   iommu: Introduce generic header
>   vfio: Force nested if iommu requires it
>   vfio: Introduce hostwin_from_range helper
>   vfio: Introduce helpers to DMA map/unmap a RAM section
>   vfio: Set up nested stage mappings
>   vfio: Pass stage 1 MSI bindings to the host
>   vfio: Helper to get IRQ info including capabilities
>   vfio/pci: Register handler for iommu fault
>   vfio/pci: Set up the DMA FAULT region
>   vfio/pci: Implement the DMA fault handler
>   hw/arm/smmuv3: Advertise MSI_TRANSLATE attribute
>   hw/arm/smmuv3: Store the PASID table GPA in the translation config
>   hw/arm/smmuv3: Fill the IOTLBEntry arch_id on NH_VA invalidation
>   hw/arm/smmuv3: Fill the IOTLBEntry leaf field on NH_VA invalidation
>   hw/arm/smmuv3: Pass stage 1 configurations to the host
>   hw/arm/smmuv3: Implement fault injection
>   hw/arm/smmuv3: Allow MAP notifiers
> 
> Liu Yi L (1):
>   pci: introduce PCIPASIDOps to PCIDevice
> 
>  hw/arm/smmuv3.c                 | 189 ++++++++++--
>  hw/arm/trace-events             |   3 +-
>  hw/pci/pci.c                    |  34 +++
>  hw/vfio/common.c                | 506
> +++++++++++++++++++++++++-------
>  hw/vfio/pci.c                   | 267 ++++++++++++++++-
>  hw/vfio/pci.h                   |   9 +
>  hw/vfio/trace-events            |   9 +-
>  include/exec/memory.h           |  49 +++-
>  include/hw/arm/smmu-common.h    |   1 +
>  include/hw/iommu/iommu.h        |  28 ++
>  include/hw/pci/pci.h            |  11 +
>  include/hw/vfio/vfio-common.h   |  16 +
>  linux-headers/COPYING           |   2 +
>  linux-headers/asm-x86/kvm.h     |   1 +
>  linux-headers/linux/iommu.h     | 375 +++++++++++++++++++++++
>  linux-headers/linux/vfio.h      | 109 ++++++-
>  memory.c                        |  10 +
>  scripts/update-linux-headers.sh |   2 +-
>  18 files changed, 1478 insertions(+), 143 deletions(-)
>  create mode 100644 include/hw/iommu/iommu.h
>  create mode 100644 linux-headers/linux/iommu.h
> 
> --
> 2.20.1



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC v6 00/24] vSMMUv3/pSMMUv3 2 stage VFIO integration
  2020-03-25 11:35 ` [RFC v6 00/24] vSMMUv3/pSMMUv3 2 stage VFIO integration Shameerali Kolothum Thodi
@ 2020-03-25 12:42   ` Auger Eric
  2020-04-03 10:45   ` Auger Eric
  1 sibling, 0 replies; 40+ messages in thread
From: Auger Eric @ 2020-03-25 12:42 UTC (permalink / raw)
  To: Shameerali Kolothum Thodi, eric.auger.pro, qemu-devel, qemu-arm,
	peter.maydell, mst, alex.williamson, jacob.jun.pan, yi.l.liu
  Cc: jean-philippe, tnowicki, maz, zhangfei.gao, peterx, zhangfei.gao,
	bbhushan2, will

Hi Shameer,

On 3/25/20 12:35 PM, Shameerali Kolothum Thodi wrote:
> Hi Eric,
> 
>> -----Original Message-----
>> From: Eric Auger [mailto:eric.auger@redhat.com]
>> Sent: 20 March 2020 16:58
>> To: eric.auger.pro@gmail.com; eric.auger@redhat.com;
>> qemu-devel@nongnu.org; qemu-arm@nongnu.org; peter.maydell@linaro.org;
>> mst@redhat.com; alex.williamson@redhat.com;
>> jacob.jun.pan@linux.intel.com; yi.l.liu@intel.com
>> Cc: peterx@redhat.com; jean-philippe@linaro.org; will@kernel.org;
>> tnowicki@marvell.com; Shameerali Kolothum Thodi
>> <shameerali.kolothum.thodi@huawei.com>; zhangfei.gao@foxmail.com;
>> zhangfei.gao@linaro.org; maz@kernel.org; bbhushan2@marvell.com
>> Subject: [RFC v6 00/24] vSMMUv3/pSMMUv3 2 stage VFIO integration
>>
>> Up to now vSMMUv3 has not been integrated with VFIO. VFIO
>> integration requires to program the physical IOMMU consistently
>> with the guest mappings. However, as opposed to VTD, SMMUv3 has
>> no "Caching Mode" which allows easy trapping of guest mappings.
>> This means the vSMMUV3 cannot use the same VFIO integration as VTD.
>>
>> However SMMUv3 has 2 translation stages. This was devised with
>> virtualization use case in mind where stage 1 is "owned" by the
>> guest whereas the host uses stage 2 for VM isolation.
>>
>> This series sets up this nested translation stage. It only works
>> if there is one physical SMMUv3 used along with QEMU vSMMUv3 (in
>> other words, it does not work if there is a physical SMMUv2).
> 
> I was testing this series on one of our hardware board with SMMUv3. I did
> observe an issue while trying to bring up Guest with and without the vsmmuV3.
> 
> Steps are like below,
> 
> 1. start a guest with "iommu=smmuv3" and a n/w vf device.
> 
> 2.Exit the VM.
> 
> 3. start the guest again without "iommu=smmuv3"
> 
> This time qemu crashes with,
> 
> [ 0.447830] hns3 0000:00:01.0: enabling device (0000 -> 0002)
> /home/shameer/qemu-eric/qemu/hw/vfio/pci.c:2851:vfio_dma_fault_notifier_handler:
> Object 0xaaaaeeb47c00 is not an instance of type
> qemu:iommu-memory-region
> ./qemu_run-vsmmu-hns: line 9: 13609 Aborted                 (core
> dumped) ./qemu-system-aarch64-vsmmuv3v10 -machine
> virt,kernel_irqchip=on,gic-version=3 -cpu host -smp cpus=1 -kernel
> Image-ericv10-uacce -initrd rootfs-iperf.cpio -bios
> QEMU_EFI_Dec2018.fd -device vfio-pci,host=0000:7d:02.1 -net none -m
> 4096 -nographic -D -d -enable-kvm -append "console=ttyAMA0
> root=/dev/vda -m 4096 rw earlycon=pl011,0x9000000"
> 
> And you can see that host kernel receives smmuv3 C_BAD_STE event,
> 
> [10499.379288] vfio-pci 0000:7d:02.1: enabling device (0000 -> 0002)
> [10501.943881] arm-smmu-v3 arm-smmu-v3.2.auto: event 0x04 received:
> [10501.943884] arm-smmu-v3 arm-smmu-v3.2.auto: 0x00007d1100000004
> [10501.943886] arm-smmu-v3 arm-smmu-v3.2.auto: 0x0000100800000080
> [10501.943887] arm-smmu-v3 arm-smmu-v3.2.auto: 0x00000000fe040000
> [10501.943889] arm-smmu-v3 arm-smmu-v3.2.auto: 0x000000007e04c440
> 
> So I suspect we didn't clear nested stage configuration and that affects the 
> translation in the second run. I tried to issue(force) a vfio_detach_pasid_table() but 
> that didn't solve the problem.
> 
> May be I am missing something. Could you please take a look and let me know.

Sure, I will try to reproduce on my end. Thank you for the bug report!

Best Regards

Eric
> 
> Thanks,
> Shameer
> 
>> - We force the host to use stage 2 instead of stage 1, when we
>>   detect a vSMMUV3 is behind a VFIO device. For a VFIO device
>>   without any virtual IOMMU, we still use stage 1 as many existing
>>   SMMUs expect this behavior.
>> - We use PCIPASIDOps to propage guest stage1 config changes on
>>   STE (Stream Table Entry) changes.
>> - We implement a specific UNMAP notifier that conveys guest
>>   IOTLB invalidations to the host
>> - We register MSI IOVA/GPA bindings to the host so that this latter
>>   can build a nested stage translation
>> - As the legacy MAP notifier is not called anymore, we must make
>>   sure stage 2 mappings are set. This is achieved through another
>>   prereg memory listener.
>> - Physical SMMU stage 1 related faults are reported to the guest
>>   via en eventfd mechanism and exposed trhough a dedicated VFIO-PCI
>>   region. Then they are reinjected into the guest.
>>
>> Best Regards
>>
>> Eric
>>
>> This series can be found at:
>> https://github.com/eauger/qemu/tree/v4.2.0-2stage-rfcv6
>>
>> Kernel Dependencies:
>> [1] [PATCH v10 00/11] SMMUv3 Nested Stage Setup (VFIO part)
>> [2] [PATCH v10 00/13] SMMUv3 Nested Stage Setup (IOMMU part)
>> branch at:
>> https://github.com/eauger/linux/tree/will-arm-smmu-updates-2stage-v10
>>
>> History:
>>
>> v5 -> v6:
>> - just rebase work
>>
>> v4 -> v5:
>> - Use PCIPASIDOps for config update notifications
>> - removal of notification for MSI binding which is not needed
>>   anymore
>> - Use a single fault region
>> - use the specific interrupt index
>>
>> v3 -> v4:
>> - adapt to changes in uapi (asid cache invalidation)
>> - check VFIO_PCI_DMA_FAULT_IRQ_INDEX is supported at kernel level
>>   before attempting to set signaling for it.
>> - sync on 5.2-rc1 kernel headers + Drew's patch that imports sve_context.h
>> - fix MSI binding for MSI (not MSIX)
>> - fix mingw compilation
>>
>> v2 -> v3:
>> - rework fault handling
>> - MSI binding registration done in vfio-pci. MSI binding tear down called
>>   on container cleanup path
>> - leaf parameter propagated
>>
>> v1 -> v2:
>> - Fixed dual assignment (asid now correctly propagated on TLB invalidations)
>> - Integrated fault reporting
>>
>>
>> Eric Auger (23):
>>   update-linux-headers: Import iommu.h
>>   header update against 5.6.0-rc3 and IOMMU/VFIO nested stage APIs
>>   memory: Add IOMMU_ATTR_VFIO_NESTED IOMMU memory region
>> attribute
>>   memory: Add IOMMU_ATTR_MSI_TRANSLATE IOMMU memory region
>> attribute
>>   memory: Introduce IOMMU Memory Region inject_faults API
>>   memory: Add arch_id and leaf fields in IOTLBEntry
>>   iommu: Introduce generic header
>>   vfio: Force nested if iommu requires it
>>   vfio: Introduce hostwin_from_range helper
>>   vfio: Introduce helpers to DMA map/unmap a RAM section
>>   vfio: Set up nested stage mappings
>>   vfio: Pass stage 1 MSI bindings to the host
>>   vfio: Helper to get IRQ info including capabilities
>>   vfio/pci: Register handler for iommu fault
>>   vfio/pci: Set up the DMA FAULT region
>>   vfio/pci: Implement the DMA fault handler
>>   hw/arm/smmuv3: Advertise MSI_TRANSLATE attribute
>>   hw/arm/smmuv3: Store the PASID table GPA in the translation config
>>   hw/arm/smmuv3: Fill the IOTLBEntry arch_id on NH_VA invalidation
>>   hw/arm/smmuv3: Fill the IOTLBEntry leaf field on NH_VA invalidation
>>   hw/arm/smmuv3: Pass stage 1 configurations to the host
>>   hw/arm/smmuv3: Implement fault injection
>>   hw/arm/smmuv3: Allow MAP notifiers
>>
>> Liu Yi L (1):
>>   pci: introduce PCIPASIDOps to PCIDevice
>>
>>  hw/arm/smmuv3.c                 | 189 ++++++++++--
>>  hw/arm/trace-events             |   3 +-
>>  hw/pci/pci.c                    |  34 +++
>>  hw/vfio/common.c                | 506
>> +++++++++++++++++++++++++-------
>>  hw/vfio/pci.c                   | 267 ++++++++++++++++-
>>  hw/vfio/pci.h                   |   9 +
>>  hw/vfio/trace-events            |   9 +-
>>  include/exec/memory.h           |  49 +++-
>>  include/hw/arm/smmu-common.h    |   1 +
>>  include/hw/iommu/iommu.h        |  28 ++
>>  include/hw/pci/pci.h            |  11 +
>>  include/hw/vfio/vfio-common.h   |  16 +
>>  linux-headers/COPYING           |   2 +
>>  linux-headers/asm-x86/kvm.h     |   1 +
>>  linux-headers/linux/iommu.h     | 375 +++++++++++++++++++++++
>>  linux-headers/linux/vfio.h      | 109 ++++++-
>>  memory.c                        |  10 +
>>  scripts/update-linux-headers.sh |   2 +-
>>  18 files changed, 1478 insertions(+), 143 deletions(-)
>>  create mode 100644 include/hw/iommu/iommu.h
>>  create mode 100644 linux-headers/linux/iommu.h
>>
>> --
>> 2.20.1
> 
> 



^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: [RFC v6 01/24] update-linux-headers: Import iommu.h
  2020-03-20 16:58 ` [RFC v6 01/24] update-linux-headers: Import iommu.h Eric Auger
@ 2020-03-26 12:58   ` Liu, Yi L
  2020-03-26 17:51     ` Auger Eric
  0 siblings, 1 reply; 40+ messages in thread
From: Liu, Yi L @ 2020-03-26 12:58 UTC (permalink / raw)
  To: Eric Auger, eric.auger.pro, qemu-devel, qemu-arm, peter.maydell,
	mst, alex.williamson, jacob.jun.pan
  Cc: jean-philippe, tnowicki, maz, zhangfei.gao, peterx,
	shameerali.kolothum.thodi, zhangfei.gao, bbhushan2, will

> From: Eric Auger <eric.auger@redhat.com>
> Sent: Saturday, March 21, 2020 12:58 AM
> To: eric.auger.pro@gmail.com; eric.auger@redhat.com; qemu-devel@nongnu.org;
> Subject: [RFC v6 01/24] update-linux-headers: Import iommu.h
> 
> Update the script to import the new iommu.h uapi header.
> 
> Signed-off-by: Eric Auger <eric.auger@redhat.com>
> ---
>  scripts/update-linux-headers.sh | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/scripts/update-linux-headers.sh b/scripts/update-linux-headers.sh index
> 29c27f4681..5b64ee3912 100755
> --- a/scripts/update-linux-headers.sh
> +++ b/scripts/update-linux-headers.sh
> @@ -141,7 +141,7 @@ done
> 
>  rm -rf "$output/linux-headers/linux"
>  mkdir -p "$output/linux-headers/linux"
> -for header in kvm.h vfio.h vfio_ccw.h vhost.h \
> +for header in kvm.h vfio.h vfio_ccw.h vhost.h iommu.h \
>                psci.h psp-sev.h userfaultfd.h mman.h; do
>      cp "$tmpdir/include/linux/$header" "$output/linux-headers/linux"
>  done

Hi Eric,

This patch already got acked by from Cornelia. :-)

https://patchwork.ozlabs.org/patch/1259643/

Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: [RFC v6 08/24] pci: introduce PCIPASIDOps to PCIDevice
  2020-03-20 16:58 ` [RFC v6 08/24] pci: introduce PCIPASIDOps to PCIDevice Eric Auger
@ 2020-03-26 13:01   ` Liu, Yi L
  0 siblings, 0 replies; 40+ messages in thread
From: Liu, Yi L @ 2020-03-26 13:01 UTC (permalink / raw)
  To: Eric Auger, eric.auger.pro, qemu-devel, qemu-arm, peter.maydell,
	mst, alex.williamson, jacob.jun.pan
  Cc: jean-philippe, tnowicki, maz, zhangfei.gao, peterx,
	shameerali.kolothum.thodi, zhangfei.gao, bbhushan2, will

Hi Eric,

Not sure about your preference. I've modified my patch as below, which
HostIOMMUContext to provide callbacks for vIOMMU to call into VFIO.
Please feel free to give your suggestions.

https://patchwork.ozlabs.org/patch/1259665/

Regards,
Yi Liu

> From: Eric Auger <eric.auger@redhat.com>
> Sent: Saturday, March 21, 2020 12:58 AM
> To: eric.auger.pro@gmail.com; eric.auger@redhat.com; qemu-devel@nongnu.org;
> Subject: [RFC v6 08/24] pci: introduce PCIPASIDOps to PCIDevice
> 
> From: Liu Yi L <yi.l.liu@intel.com>
> 
> This patch introduces PCIPASIDOps for IOMMU related operations.
> 
> https://lists.gnu.org/archive/html/qemu-devel/2018-03/msg00078.html
> https://lists.gnu.org/archive/html/qemu-devel/2018-03/msg00940.html
> 
> So far, to setup virt-SVA for assigned SVA capable device, needs to
> configure host translation structures for specific pasid. (e.g. bind
> guest page table to host and enable nested translation in host).
> Besides, vIOMMU emulator needs to forward guest's cache invalidation
> to host since host nested translation is enabled. e.g. on VT-d, guest
> owns 1st level translation table, thus cache invalidation for 1st
> level should be propagated to host.
> 
> This patch adds two functions: alloc_pasid and free_pasid to support
> guest pasid allocation and free. The implementations of the callbacks
> would be device passthru modules. Like vfio.
> 
> Cc: Kevin Tian <kevin.tian@intel.com>
> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Eric Auger <eric.auger@redhat.com>
> Cc: Yi Sun <yi.y.sun@linux.intel.com>
> Cc: David Gibson <david@gibson.dropbear.id.au>
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> ---
>  hw/pci/pci.c         | 34 ++++++++++++++++++++++++++++++++++
>  include/hw/pci/pci.h | 11 +++++++++++
>  2 files changed, 45 insertions(+)
> 
> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> index e1ed6677e1..67e03b8db1 100644
> --- a/hw/pci/pci.c
> +++ b/hw/pci/pci.c
> @@ -2695,6 +2695,40 @@ void pci_setup_iommu(PCIBus *bus, PCIIOMMUFunc fn,
> void *opaque)
>      bus->iommu_opaque = opaque;
>  }
> 
> +void pci_setup_pasid_ops(PCIDevice *dev, PCIPASIDOps *ops)
> +{
> +    assert(ops && !dev->pasid_ops);
> +    dev->pasid_ops = ops;
> +}
> +
> +bool pci_device_is_pasid_ops_set(PCIBus *bus, int32_t devfn)
> +{
> +    PCIDevice *dev;
> +
> +    if (!bus) {
> +        return false;
> +    }
> +
> +    dev = bus->devices[devfn];
> +    return !!(dev && dev->pasid_ops);
> +}
> +
> +int pci_device_set_pasid_table(PCIBus *bus, int32_t devfn,
> +                               IOMMUConfig *config)
> +{
> +    PCIDevice *dev;
> +
> +    if (!bus) {
> +        return -EINVAL;
> +    }
> +
> +    dev = bus->devices[devfn];
> +    if (dev && dev->pasid_ops && dev->pasid_ops->set_pasid_table) {
> +        return dev->pasid_ops->set_pasid_table(bus, devfn, config);
> +    }
> +    return -ENOENT;
> +}
> +
>  static void pci_dev_get_w64(PCIBus *b, PCIDevice *dev, void *opaque)
>  {
>      Range *range = opaque;
> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
> index cfedf5a995..2146cb7519 100644
> --- a/include/hw/pci/pci.h
> +++ b/include/hw/pci/pci.h
> @@ -8,6 +8,7 @@
>  #include "hw/isa/isa.h"
> 
>  #include "hw/pci/pcie.h"
> +#include "hw/iommu/iommu.h"
> 
>  extern bool pci_available;
> 
> @@ -264,6 +265,11 @@ struct PCIReqIDCache {
>  };
>  typedef struct PCIReqIDCache PCIReqIDCache;
> 
> +struct PCIPASIDOps {
> +    int (*set_pasid_table)(PCIBus *bus, int32_t devfn, IOMMUConfig *config);
> +};
> +typedef struct PCIPASIDOps PCIPASIDOps;
> +
>  struct PCIDevice {
>      DeviceState qdev;
>      bool partially_hotplugged;
> @@ -357,6 +363,7 @@ struct PCIDevice {
> 
>      /* ID of standby device in net_failover pair */
>      char *failover_pair_id;
> +    PCIPASIDOps *pasid_ops;
>  };
> 
>  void pci_register_bar(PCIDevice *pci_dev, int region_num,
> @@ -490,6 +497,10 @@ typedef AddressSpace *(*PCIIOMMUFunc)(PCIBus *, void
> *, int);
>  AddressSpace *pci_device_iommu_address_space(PCIDevice *dev);
>  void pci_setup_iommu(PCIBus *bus, PCIIOMMUFunc fn, void *opaque);
> 
> +void pci_setup_pasid_ops(PCIDevice *dev, PCIPASIDOps *ops);
> +bool pci_device_is_pasid_ops_set(PCIBus *bus, int32_t devfn);
> +int pci_device_set_pasid_table(PCIBus *bus, int32_t devfn, IOMMUConfig *config);
> +
>  static inline void
>  pci_set_byte(uint8_t *config, uint8_t val)
>  {
> --
> 2.20.1



^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: [RFC v6 05/24] memory: Introduce IOMMU Memory Region inject_faults API
  2020-03-20 16:58 ` [RFC v6 05/24] memory: Introduce IOMMU Memory Region inject_faults API Eric Auger
@ 2020-03-26 13:13   ` Liu, Yi L
  0 siblings, 0 replies; 40+ messages in thread
From: Liu, Yi L @ 2020-03-26 13:13 UTC (permalink / raw)
  To: Eric Auger, eric.auger.pro, qemu-devel, qemu-arm, peter.maydell,
	mst, alex.williamson, jacob.jun.pan
  Cc: jean-philippe, tnowicki, maz, zhangfei.gao, peterx,
	shameerali.kolothum.thodi, zhangfei.gao, bbhushan2, will

Hi Eric,

I'm also considering how to inject iommu fault to vIOMMU. As our
previous discussion (long time ago), MemoryRegion way doesn't work
well for VTd case. So I'd like see your opinion on the proposal
below:
I've a patch to make vIOMMUs register PCIIOMMUOps to PCI layer.
Current usage is to get address space and set/unset HostIOMMUContext
(added by me). I think it may be also nice to add the fault injection
callback in the PCIIOMMUOps. Thoughts?

https://patchwork.ozlabs.org/patch/1259645/

Regards,
Yi Liu

> From: Eric Auger <eric.auger@redhat.com>
> Sent: Saturday, March 21, 2020 12:58 AM
> To: eric.auger.pro@gmail.com; eric.auger@redhat.com; qemu-devel@nongnu.org;
> Subject: [RFC v6 05/24] memory: Introduce IOMMU Memory Region inject_faults
> API
> 
> This new API allows to inject @count iommu_faults into
> the IOMMU memory region.
> 
> Signed-off-by: Eric Auger <eric.auger@redhat.com>
> ---
>  include/exec/memory.h | 25 +++++++++++++++++++++++++
>  memory.c              | 10 ++++++++++
>  2 files changed, 35 insertions(+)
> 
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index f2c773163f..141a5dc197 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -57,6 +57,8 @@ struct MemoryRegionMmio {
>      CPUWriteMemoryFunc *write[3];
>  };
> 
> +struct iommu_fault;
> +
>  typedef struct IOMMUTLBEntry IOMMUTLBEntry;
> 
>  /* See address_space_translate: bit 0 is read, bit 1 is write.  */
> @@ -357,6 +359,19 @@ typedef struct IOMMUMemoryRegionClass {
>       * @iommu: the IOMMUMemoryRegion
>       */
>      int (*num_indexes)(IOMMUMemoryRegion *iommu);
> +
> +    /*
> +     * Inject @count faults into the IOMMU memory region
> +     *
> +     * Optional method: if this method is not provided, then
> +     * memory_region_injection_faults() will return -ENOENT
> +     *
> +     * @iommu: the IOMMU memory region to inject the faults in
> +     * @count: number of faults to inject
> +     * @buf: fault buffer
> +     */
> +    int (*inject_faults)(IOMMUMemoryRegion *iommu, int count,
> +                         struct iommu_fault *buf);
>  } IOMMUMemoryRegionClass;
> 
>  typedef struct CoalescedMemoryRange CoalescedMemoryRange;
> @@ -1365,6 +1380,16 @@ int
> memory_region_iommu_attrs_to_index(IOMMUMemoryRegion *iommu_mr,
>   */
>  int memory_region_iommu_num_indexes(IOMMUMemoryRegion *iommu_mr);
> 
> +/**
> + * memory_region_inject_faults : inject @count faults stored in @buf
> + *
> + * @iommu_mr: the IOMMU memory region
> + * @count: number of faults to be injected
> + * @buf: buffer containing the faults
> + */
> +int memory_region_inject_faults(IOMMUMemoryRegion *iommu_mr, int count,
> +                                struct iommu_fault *buf);
> +
>  /**
>   * memory_region_name: get a memory region's name
>   *
> diff --git a/memory.c b/memory.c
> index 09be40edd2..9cdd77e0de 100644
> --- a/memory.c
> +++ b/memory.c
> @@ -2001,6 +2001,16 @@ int
> memory_region_iommu_num_indexes(IOMMUMemoryRegion *iommu_mr)
>      return imrc->num_indexes(iommu_mr);
>  }
> 
> +int memory_region_inject_faults(IOMMUMemoryRegion *iommu_mr, int count,
> +                                struct iommu_fault *buf)
> +{
> +    IOMMUMemoryRegionClass *imrc =
> IOMMU_MEMORY_REGION_GET_CLASS(iommu_mr);
> +    if (!imrc->inject_faults) {
> +        return -ENOENT;
> +    }
> +    return imrc->inject_faults(iommu_mr, count, buf);
> +}
> +
>  void memory_region_set_log(MemoryRegion *mr, bool log, unsigned client)
>  {
>      uint8_t mask = 1 << client;
> --
> 2.20.1



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC v6 01/24] update-linux-headers: Import iommu.h
  2020-03-26 12:58   ` Liu, Yi L
@ 2020-03-26 17:51     ` Auger Eric
  0 siblings, 0 replies; 40+ messages in thread
From: Auger Eric @ 2020-03-26 17:51 UTC (permalink / raw)
  To: Liu, Yi L, eric.auger.pro, qemu-devel, qemu-arm, peter.maydell,
	mst, alex.williamson, jacob.jun.pan
  Cc: jean-philippe, tnowicki, maz, zhangfei.gao, peterx,
	shameerali.kolothum.thodi, zhangfei.gao, bbhushan2, will

Hi Yi,

On 3/26/20 1:58 PM, Liu, Yi L wrote:
>> From: Eric Auger <eric.auger@redhat.com>
>> Sent: Saturday, March 21, 2020 12:58 AM
>> To: eric.auger.pro@gmail.com; eric.auger@redhat.com; qemu-devel@nongnu.org;
>> Subject: [RFC v6 01/24] update-linux-headers: Import iommu.h
>>
>> Update the script to import the new iommu.h uapi header.
>>
>> Signed-off-by: Eric Auger <eric.auger@redhat.com>
>> ---
>>  scripts/update-linux-headers.sh | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/scripts/update-linux-headers.sh b/scripts/update-linux-headers.sh index
>> 29c27f4681..5b64ee3912 100755
>> --- a/scripts/update-linux-headers.sh
>> +++ b/scripts/update-linux-headers.sh
>> @@ -141,7 +141,7 @@ done
>>
>>  rm -rf "$output/linux-headers/linux"
>>  mkdir -p "$output/linux-headers/linux"
>> -for header in kvm.h vfio.h vfio_ccw.h vhost.h \
>> +for header in kvm.h vfio.h vfio_ccw.h vhost.h iommu.h \
>>                psci.h psp-sev.h userfaultfd.h mman.h; do
>>      cp "$tmpdir/include/linux/$header" "$output/linux-headers/linux"
>>  done
> 
> Hi Eric,
> 
> This patch already got acked by from Cornelia. :-)
> 
> https://patchwork.ozlabs.org/patch/1259643/
thanks for the heads-up! Every little step ... ;-)

Thanks

Eric
> 
> Regards,
> Yi Liu
> 



^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: [RFC v6 09/24] vfio: Force nested if iommu requires it
  2020-03-20 16:58 ` [RFC v6 09/24] vfio: Force nested if iommu requires it Eric Auger
@ 2020-03-31  6:34   ` Liu, Yi L
  2020-03-31  8:04     ` Auger Eric
  0 siblings, 1 reply; 40+ messages in thread
From: Liu, Yi L @ 2020-03-31  6:34 UTC (permalink / raw)
  To: Eric Auger, eric.auger.pro, qemu-devel, qemu-arm, peter.maydell,
	mst, alex.williamson, jacob.jun.pan
  Cc: jean-philippe, tnowicki, maz, zhangfei.gao, peterx,
	shameerali.kolothum.thodi, zhangfei.gao, bbhushan2, will

Hi Eric,

> From: Eric Auger <eric.auger@redhat.com>
> Sent: Saturday, March 21, 2020 12:58 AM
> To: eric.auger.pro@gmail.com; eric.auger@redhat.com; qemu-devel@nongnu.org;
> Subject: [RFC v6 09/24] vfio: Force nested if iommu requires it
> 
> In case we detect the address space is translated by
> a virtual IOMMU which requires HW nested paging to
> integrate with VFIO, let's set up the container with
> the VFIO_TYPE1_NESTING_IOMMU iommu_type.
> 
> Signed-off-by: Eric Auger <eric.auger@redhat.com>
> 
> ---
> 
> v4 -> v5:
> - fail immediatly if nested is wanted but not supported
> 
> v2 -> v3:
> - add "nested only is selected if requested by @force_nested"
>   comment in this patch
> ---
>  hw/vfio/common.c | 36 ++++++++++++++++++++++++++++--------
>  1 file changed, 28 insertions(+), 8 deletions(-)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 0b3593b3c0..ac417b5dbd 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -1155,27 +1155,38 @@ static void vfio_put_address_space(VFIOAddressSpace
> *space)
>   * vfio_get_iommu_type - selects the richest iommu_type (v2 first)
>   */
>  static int vfio_get_iommu_type(VFIOContainer *container,
> +                               bool want_nested,
>                                 Error **errp)
>  {
> -    int iommu_types[] = { VFIO_TYPE1v2_IOMMU, VFIO_TYPE1_IOMMU,
> +    int iommu_types[] = { VFIO_TYPE1_NESTING_IOMMU,
> +                          VFIO_TYPE1v2_IOMMU, VFIO_TYPE1_IOMMU,
>                            VFIO_SPAPR_TCE_v2_IOMMU, VFIO_SPAPR_TCE_IOMMU };
> -    int i;
> +    int i, ret = -EINVAL;
> 
>      for (i = 0; i < ARRAY_SIZE(iommu_types); i++) {
>          if (ioctl(container->fd, VFIO_CHECK_EXTENSION, iommu_types[i])) {
> -            return iommu_types[i];
> +            if (iommu_types[i] == VFIO_TYPE1_NESTING_IOMMU && !want_nested) {
> +                continue;
> +            }
> +            ret = iommu_types[i];
> +            break;
>          }
>      }
> -    error_setg(errp, "No available IOMMU models");
> -    return -EINVAL;
> +    if (ret < 0) {
> +        error_setg(errp, "No available IOMMU models");
> +    } else if (want_nested && ret != VFIO_TYPE1_NESTING_IOMMU) {
> +        error_setg(errp, "Nested mode requested but not supported");
> +        ret = -EINVAL;
> +    }
> +    return ret;
>  }
> 
>  static int vfio_init_container(VFIOContainer *container, int group_fd,
> -                               Error **errp)
> +                               bool want_nested, Error **errp)
>  {
>      int iommu_type, ret;
> 
> -    iommu_type = vfio_get_iommu_type(container, errp);
> +    iommu_type = vfio_get_iommu_type(container, want_nested, errp);
>      if (iommu_type < 0) {
>          return iommu_type;
>      }
> @@ -1211,6 +1222,14 @@ static int vfio_connect_container(VFIOGroup *group,
> AddressSpace *as,
>      VFIOContainer *container;
>      int ret, fd;
>      VFIOAddressSpace *space;
> +    IOMMUMemoryRegion *iommu_mr;
> +    bool nested = false;
> +
> +    if (as != &address_space_memory && memory_region_is_iommu(as->root)) {

I tried on my side. For virtual VT-d, it doesn't work as in intel_iommu,
we have a dynamic switch mechanism. Thus that, the
memory_region_is_iommu(as->root) won't return true as expected. I'm afraid
it doesn't work for virtual VT-d.  So firstly, I'm wondering if
as != &address_space_memory is enough. Secondly, I'm considering if it is
good to let vfio_get_group() caller to provide a hint whether vIOMMU is
exposed. e.g. vfio_realize() in vfio/pci.c could figure out whether vIOMMU
is set easily. Thoughts?

Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC v6 00/24] vSMMUv3/pSMMUv3 2 stage VFIO integration
  2020-03-20 16:58 [RFC v6 00/24] vSMMUv3/pSMMUv3 2 stage VFIO integration Eric Auger
                   ` (24 preceding siblings ...)
  2020-03-25 11:35 ` [RFC v6 00/24] vSMMUv3/pSMMUv3 2 stage VFIO integration Shameerali Kolothum Thodi
@ 2020-03-31  6:42 ` Zhangfei Gao
  2020-03-31  8:12   ` Auger Eric
  25 siblings, 1 reply; 40+ messages in thread
From: Zhangfei Gao @ 2020-03-31  6:42 UTC (permalink / raw)
  To: Eric Auger, eric.auger.pro, qemu-devel, qemu-arm, peter.maydell,
	mst, alex.williamson, jacob.jun.pan, yi.l.liu
  Cc: jean-philippe, tnowicki, maz, peterx, shameerali.kolothum.thodi,
	zhangfei.gao, bbhushan2, will

Hi, Eric

On 2020/3/21 上午12:58, Eric Auger wrote:
> Up to now vSMMUv3 has not been integrated with VFIO. VFIO
> integration requires to program the physical IOMMU consistently
> with the guest mappings. However, as opposed to VTD, SMMUv3 has
> no "Caching Mode" which allows easy trapping of guest mappings.
> This means the vSMMUV3 cannot use the same VFIO integration as VTD.
>
> However SMMUv3 has 2 translation stages. This was devised with
> virtualization use case in mind where stage 1 is "owned" by the
> guest whereas the host uses stage 2 for VM isolation.
>
> This series sets up this nested translation stage. It only works
> if there is one physical SMMUv3 used along with QEMU vSMMUv3 (in
> other words, it does not work if there is a physical SMMUv2).
>
> - We force the host to use stage 2 instead of stage 1, when we
>    detect a vSMMUV3 is behind a VFIO device. For a VFIO device
>    without any virtual IOMMU, we still use stage 1 as many existing
>    SMMUs expect this behavior.
> - We use PCIPASIDOps to propage guest stage1 config changes on
>    STE (Stream Table Entry) changes.
> - We implement a specific UNMAP notifier that conveys guest
>    IOTLB invalidations to the host
> - We register MSI IOVA/GPA bindings to the host so that this latter
>    can build a nested stage translation
> - As the legacy MAP notifier is not called anymore, we must make
>    sure stage 2 mappings are set. This is achieved through another
>    prereg memory listener.
> - Physical SMMU stage 1 related faults are reported to the guest
>    via en eventfd mechanism and exposed trhough a dedicated VFIO-PCI
>    region. Then they are reinjected into the guest.
>
> Best Regards
>
> Eric
>
> This series can be found at:
> https://github.com/eauger/qemu/tree/v4.2.0-2stage-rfcv6
>
> Kernel Dependencies:
> [1] [PATCH v10 00/11] SMMUv3 Nested Stage Setup (VFIO part)
> [2] [PATCH v10 00/13] SMMUv3 Nested Stage Setup (IOMMU part)
> branch at: https://github.com/eauger/linux/tree/will-arm-smmu-updates-2stage-v10
Really appreciated that you re-start this work.

I tested your branch and some update.

Guest: https://github.com/Linaro/linux-kernel-warpdrive/tree/sva-devel 
<https://github.com/Linaro/linux-kernel-warpdrive/tree/sva-devel>
Host: 
https://github.com/eauger/linux/tree/will-arm-smmu-updates-2stage-v10 
<https://github.com/eauger/linux/tree/will-arm-smmu-updates-2stage-v10>
qemu: https://github.com/eauger/qemu/tree/v4.2.0-2stage-rfcv6 
<https://github.com/eauger/qemu/tree/v4.2.0-2stage-rfcv6>

The guest I am using is contains Jean's sva patches.
Since currently they are many patch conflict, so use two different tree.

Result
No-sva mode works.
This mode, guest directly get physical address via ioctl.

While vSVA can not work, there are still much work to do.
I am trying to SVA mode, and it fails, so choose no-sva instead.
iommu_dev_enable_feature(parent, IOMMU_DEV_FEAT_SVA)

I am in debugging how to enable this.

Thanks



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC v6 09/24] vfio: Force nested if iommu requires it
  2020-03-31  6:34   ` Liu, Yi L
@ 2020-03-31  8:04     ` Auger Eric
  2020-03-31  8:34       ` Liu, Yi L
  0 siblings, 1 reply; 40+ messages in thread
From: Auger Eric @ 2020-03-31  8:04 UTC (permalink / raw)
  To: Liu, Yi L, eric.auger.pro, qemu-devel, qemu-arm, peter.maydell,
	mst, alex.williamson, jacob.jun.pan
  Cc: jean-philippe, tnowicki, maz, zhangfei.gao,
	shameerali.kolothum.thodi, peterx, zhangfei.gao, bbhushan2, will

Yi,

On 3/31/20 8:34 AM, Liu, Yi L wrote:
> Hi Eric,
> 
>> From: Eric Auger <eric.auger@redhat.com>
>> Sent: Saturday, March 21, 2020 12:58 AM
>> To: eric.auger.pro@gmail.com; eric.auger@redhat.com; qemu-devel@nongnu.org;
>> Subject: [RFC v6 09/24] vfio: Force nested if iommu requires it
>>
>> In case we detect the address space is translated by
>> a virtual IOMMU which requires HW nested paging to
>> integrate with VFIO, let's set up the container with
>> the VFIO_TYPE1_NESTING_IOMMU iommu_type.
>>
>> Signed-off-by: Eric Auger <eric.auger@redhat.com>
>>
>> ---
>>
>> v4 -> v5:
>> - fail immediatly if nested is wanted but not supported
>>
>> v2 -> v3:
>> - add "nested only is selected if requested by @force_nested"
>>   comment in this patch
>> ---
>>  hw/vfio/common.c | 36 ++++++++++++++++++++++++++++--------
>>  1 file changed, 28 insertions(+), 8 deletions(-)
>>
>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>> index 0b3593b3c0..ac417b5dbd 100644
>> --- a/hw/vfio/common.c
>> +++ b/hw/vfio/common.c
>> @@ -1155,27 +1155,38 @@ static void vfio_put_address_space(VFIOAddressSpace
>> *space)
>>   * vfio_get_iommu_type - selects the richest iommu_type (v2 first)
>>   */
>>  static int vfio_get_iommu_type(VFIOContainer *container,
>> +                               bool want_nested,
>>                                 Error **errp)
>>  {
>> -    int iommu_types[] = { VFIO_TYPE1v2_IOMMU, VFIO_TYPE1_IOMMU,
>> +    int iommu_types[] = { VFIO_TYPE1_NESTING_IOMMU,
>> +                          VFIO_TYPE1v2_IOMMU, VFIO_TYPE1_IOMMU,
>>                            VFIO_SPAPR_TCE_v2_IOMMU, VFIO_SPAPR_TCE_IOMMU };
>> -    int i;
>> +    int i, ret = -EINVAL;
>>
>>      for (i = 0; i < ARRAY_SIZE(iommu_types); i++) {
>>          if (ioctl(container->fd, VFIO_CHECK_EXTENSION, iommu_types[i])) {
>> -            return iommu_types[i];
>> +            if (iommu_types[i] == VFIO_TYPE1_NESTING_IOMMU && !want_nested) {
>> +                continue;
>> +            }
>> +            ret = iommu_types[i];
>> +            break;
>>          }
>>      }
>> -    error_setg(errp, "No available IOMMU models");
>> -    return -EINVAL;
>> +    if (ret < 0) {
>> +        error_setg(errp, "No available IOMMU models");
>> +    } else if (want_nested && ret != VFIO_TYPE1_NESTING_IOMMU) {
>> +        error_setg(errp, "Nested mode requested but not supported");
>> +        ret = -EINVAL;
>> +    }
>> +    return ret;
>>  }
>>
>>  static int vfio_init_container(VFIOContainer *container, int group_fd,
>> -                               Error **errp)
>> +                               bool want_nested, Error **errp)
>>  {
>>      int iommu_type, ret;
>>
>> -    iommu_type = vfio_get_iommu_type(container, errp);
>> +    iommu_type = vfio_get_iommu_type(container, want_nested, errp);
>>      if (iommu_type < 0) {
>>          return iommu_type;
>>      }
>> @@ -1211,6 +1222,14 @@ static int vfio_connect_container(VFIOGroup *group,
>> AddressSpace *as,
>>      VFIOContainer *container;
>>      int ret, fd;
>>      VFIOAddressSpace *space;
>> +    IOMMUMemoryRegion *iommu_mr;
>> +    bool nested = false;
>> +
>> +    if (as != &address_space_memory && memory_region_is_iommu(as->root)) {
> 
> I tried on my side. For virtual VT-d, it doesn't work as in intel_iommu,
> we have a dynamic switch mechanism. Thus that, the
> memory_region_is_iommu(as->root) won't return true as expected. I'm afraid
> it doesn't work for virtual VT-d.  So firstly, I'm wondering if
> as != &address_space_memory is enough.

(as != &address_space_memory) should be sufficient to tell that a vIOMMU
is being used. But then, for example, you don't want to set nested
paging for the virtio-iommu because virtio-iommu/VFIO uses notify-on-my
(CM similar implementation). That's why I devised an attribute to
retrieve the vIOMMU need for nested.

 Secondly, I'm considering if it is
> good to let vfio_get_group() caller to provide a hint whether vIOMMU is
> exposed. e.g. vfio_realize() in vfio/pci.c could figure out whether vIOMMU
> is set easily. Thoughts?
Sorry I don't get your point here. Why is it easier to figure out
whether vIOMMU is set in vfio_realize()?

pci_device_iommu_address_space(pdev) !=  &address_space_memory
does determine whether a vIOMMU is in place, no?

Thanks

Eric
> 
> Regards,
> Yi Liu
> 



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC v6 00/24] vSMMUv3/pSMMUv3 2 stage VFIO integration
  2020-03-31  6:42 ` Zhangfei Gao
@ 2020-03-31  8:12   ` Auger Eric
  2020-03-31  8:24     ` Zhangfei Gao
  0 siblings, 1 reply; 40+ messages in thread
From: Auger Eric @ 2020-03-31  8:12 UTC (permalink / raw)
  To: Zhangfei Gao, eric.auger.pro, qemu-devel, qemu-arm,
	peter.maydell, mst, alex.williamson, jacob.jun.pan, yi.l.liu
  Cc: jean-philippe, tnowicki, maz, shameerali.kolothum.thodi, peterx,
	bbhushan2, will

Hi Zhangfei,

On 3/31/20 8:42 AM, Zhangfei Gao wrote:
> Hi, Eric
> 
> On 2020/3/21 上午12:58, Eric Auger wrote:
>> Up to now vSMMUv3 has not been integrated with VFIO. VFIO
>> integration requires to program the physical IOMMU consistently
>> with the guest mappings. However, as opposed to VTD, SMMUv3 has
>> no "Caching Mode" which allows easy trapping of guest mappings.
>> This means the vSMMUV3 cannot use the same VFIO integration as VTD.
>>
>> However SMMUv3 has 2 translation stages. This was devised with
>> virtualization use case in mind where stage 1 is "owned" by the
>> guest whereas the host uses stage 2 for VM isolation.
>>
>> This series sets up this nested translation stage. It only works
>> if there is one physical SMMUv3 used along with QEMU vSMMUv3 (in
>> other words, it does not work if there is a physical SMMUv2).
>>
>> - We force the host to use stage 2 instead of stage 1, when we
>>    detect a vSMMUV3 is behind a VFIO device. For a VFIO device
>>    without any virtual IOMMU, we still use stage 1 as many existing
>>    SMMUs expect this behavior.
>> - We use PCIPASIDOps to propage guest stage1 config changes on
>>    STE (Stream Table Entry) changes.
>> - We implement a specific UNMAP notifier that conveys guest
>>    IOTLB invalidations to the host
>> - We register MSI IOVA/GPA bindings to the host so that this latter
>>    can build a nested stage translation
>> - As the legacy MAP notifier is not called anymore, we must make
>>    sure stage 2 mappings are set. This is achieved through another
>>    prereg memory listener.
>> - Physical SMMU stage 1 related faults are reported to the guest
>>    via en eventfd mechanism and exposed trhough a dedicated VFIO-PCI
>>    region. Then they are reinjected into the guest.
>>
>> Best Regards
>>
>> Eric
>>
>> This series can be found at:
>> https://github.com/eauger/qemu/tree/v4.2.0-2stage-rfcv6
>>
>> Kernel Dependencies:
>> [1] [PATCH v10 00/11] SMMUv3 Nested Stage Setup (VFIO part)
>> [2] [PATCH v10 00/13] SMMUv3 Nested Stage Setup (IOMMU part)
>> branch at:
>> https://github.com/eauger/linux/tree/will-arm-smmu-updates-2stage-v10
> Really appreciated that you re-start this work.
> 
> I tested your branch and some update.
> 
> Guest: https://github.com/Linaro/linux-kernel-warpdrive/tree/sva-devel
> <https://github.com/Linaro/linux-kernel-warpdrive/tree/sva-devel>
> Host:
> https://github.com/eauger/linux/tree/will-arm-smmu-updates-2stage-v10
> <https://github.com/eauger/linux/tree/will-arm-smmu-updates-2stage-v10>
> qemu: https://github.com/eauger/qemu/tree/v4.2.0-2stage-rfcv6
> <https://github.com/eauger/qemu/tree/v4.2.0-2stage-rfcv6>
> 
> The guest I am using is contains Jean's sva patches.
> Since currently they are many patch conflict, so use two different tree.
> 
> Result
> No-sva mode works.
> This mode, guest directly get physical address via ioctl.
OK thanks for testing
> 
> While vSVA can not work, there are still much work to do.
> I am trying to SVA mode, and it fails, so choose no-sva instead.
> iommu_dev_enable_feature(parent, IOMMU_DEV_FEAT_SVA)
Indeed I assume there are plenty of things missing to make vSVM work on
ARM (iommu, vfio, QEMU). I am currently reviewing Jacob and Yi's kernel
and QEMU series on Intel side. After that, I will come back to you to
help. Also vSMMUv3 does not support multiple contexts at the moment. I
will add this soon.


Still the problem I have is testing. Any suggestion welcome.

Thanks

Eric
> 
> I am in debugging how to enable this.
> 
> Thanks
> 
> 



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC v6 00/24] vSMMUv3/pSMMUv3 2 stage VFIO integration
  2020-03-31  8:12   ` Auger Eric
@ 2020-03-31  8:24     ` Zhangfei Gao
  2020-04-02 16:46       ` Auger Eric
  0 siblings, 1 reply; 40+ messages in thread
From: Zhangfei Gao @ 2020-03-31  8:24 UTC (permalink / raw)
  To: Auger Eric, eric.auger.pro, qemu-devel, qemu-arm, peter.maydell,
	mst, alex.williamson, jacob.jun.pan, yi.l.liu
  Cc: jean-philippe, tnowicki, maz, shameerali.kolothum.thodi, peterx,
	bbhushan2, will

Hi, Eric

On 2020/3/31 下午4:12, Auger Eric wrote:
> Hi Zhangfei,
>
> On 3/31/20 8:42 AM, Zhangfei Gao wrote:
>> Hi, Eric
>>
>> On 2020/3/21 上午12:58, Eric Auger wrote:
>>> Up to now vSMMUv3 has not been integrated with VFIO. VFIO
>>> integration requires to program the physical IOMMU consistently
>>> with the guest mappings. However, as opposed to VTD, SMMUv3 has
>>> no "Caching Mode" which allows easy trapping of guest mappings.
>>> This means the vSMMUV3 cannot use the same VFIO integration as VTD.
>>>
>>> However SMMUv3 has 2 translation stages. This was devised with
>>> virtualization use case in mind where stage 1 is "owned" by the
>>> guest whereas the host uses stage 2 for VM isolation.
>>>
>>> This series sets up this nested translation stage. It only works
>>> if there is one physical SMMUv3 used along with QEMU vSMMUv3 (in
>>> other words, it does not work if there is a physical SMMUv2).
>>>
>>> - We force the host to use stage 2 instead of stage 1, when we
>>>     detect a vSMMUV3 is behind a VFIO device. For a VFIO device
>>>     without any virtual IOMMU, we still use stage 1 as many existing
>>>     SMMUs expect this behavior.
>>> - We use PCIPASIDOps to propage guest stage1 config changes on
>>>     STE (Stream Table Entry) changes.
>>> - We implement a specific UNMAP notifier that conveys guest
>>>     IOTLB invalidations to the host
>>> - We register MSI IOVA/GPA bindings to the host so that this latter
>>>     can build a nested stage translation
>>> - As the legacy MAP notifier is not called anymore, we must make
>>>     sure stage 2 mappings are set. This is achieved through another
>>>     prereg memory listener.
>>> - Physical SMMU stage 1 related faults are reported to the guest
>>>     via en eventfd mechanism and exposed trhough a dedicated VFIO-PCI
>>>     region. Then they are reinjected into the guest.
>>>
>>> Best Regards
>>>
>>> Eric
>>>
>>> This series can be found at:
>>> https://github.com/eauger/qemu/tree/v4.2.0-2stage-rfcv6
>>>
>>> Kernel Dependencies:
>>> [1] [PATCH v10 00/11] SMMUv3 Nested Stage Setup (VFIO part)
>>> [2] [PATCH v10 00/13] SMMUv3 Nested Stage Setup (IOMMU part)
>>> branch at:
>>> https://github.com/eauger/linux/tree/will-arm-smmu-updates-2stage-v10
>> Really appreciated that you re-start this work.
>>
>> I tested your branch and some update.
>>
>> Guest: https://github.com/Linaro/linux-kernel-warpdrive/tree/sva-devel
>> <https://github.com/Linaro/linux-kernel-warpdrive/tree/sva-devel>
>> Host:
>> https://github.com/eauger/linux/tree/will-arm-smmu-updates-2stage-v10
>> <https://github.com/eauger/linux/tree/will-arm-smmu-updates-2stage-v10>
>> qemu: https://github.com/eauger/qemu/tree/v4.2.0-2stage-rfcv6
>> <https://github.com/eauger/qemu/tree/v4.2.0-2stage-rfcv6>
>>
>> The guest I am using is contains Jean's sva patches.
>> Since currently they are many patch conflict, so use two different tree.
>>
>> Result
>> No-sva mode works.
>> This mode, guest directly get physical address via ioctl.
> OK thanks for testing
>> While vSVA can not work, there are still much work to do.
>> I am trying to SVA mode, and it fails, so choose no-sva instead.
>> iommu_dev_enable_feature(parent, IOMMU_DEV_FEAT_SVA)
> Indeed I assume there are plenty of things missing to make vSVM work on
> ARM (iommu, vfio, QEMU). I am currently reviewing Jacob and Yi's kernel
> and QEMU series on Intel side. After that, I will come back to you to
> help. Also vSMMUv3 does not support multiple contexts at the moment. I
> will add this soon.
>
>
> Still the problem I have is testing. Any suggestion welcome.
>
>
To make sure
Do you mean you need a environment for testing?

How about Hisilicon kunpeng920, arm64 platform supporting SVA in host now.
There is such a platform in linaro mlab that I think we can share.
Currently I am testing with uacce,
By testing a user driver (hisi zip accelerator) in guest, we can test 
vSVA and PASID easily.

Thanks



^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: [RFC v6 09/24] vfio: Force nested if iommu requires it
  2020-03-31  8:04     ` Auger Eric
@ 2020-03-31  8:34       ` Liu, Yi L
  0 siblings, 0 replies; 40+ messages in thread
From: Liu, Yi L @ 2020-03-31  8:34 UTC (permalink / raw)
  To: Auger Eric, eric.auger.pro, qemu-devel, qemu-arm, peter.maydell,
	mst, alex.williamson, jacob.jun.pan
  Cc: jean-philippe, tnowicki, maz, zhangfei.gao,
	shameerali.kolothum.thodi, peterx, zhangfei.gao, bbhushan2, will

> From: Auger Eric <eric.auger@redhat.com>
> Sent: Tuesday, March 31, 2020 4:05 PM
> To: Liu, Yi L <yi.l.liu@intel.com>; eric.auger.pro@gmail.com; qemu-
> Subject: Re: [RFC v6 09/24] vfio: Force nested if iommu requires it
> 
> Yi,
> 
> On 3/31/20 8:34 AM, Liu, Yi L wrote:
> > Hi Eric,
> >
> >> From: Eric Auger <eric.auger@redhat.com>
> >> Sent: Saturday, March 21, 2020 12:58 AM
> >> To: eric.auger.pro@gmail.com; eric.auger@redhat.com; qemu-
> devel@nongnu.org;
> >> Subject: [RFC v6 09/24] vfio: Force nested if iommu requires it
> >>
> >> In case we detect the address space is translated by
> >> a virtual IOMMU which requires HW nested paging to
> >> integrate with VFIO, let's set up the container with
> >> the VFIO_TYPE1_NESTING_IOMMU iommu_type.
> >>
> >> Signed-off-by: Eric Auger <eric.auger@redhat.com>
> >>
> >> ---
> >>
> >> v4 -> v5:
> >> - fail immediatly if nested is wanted but not supported
> >>
> >> v2 -> v3:
> >> - add "nested only is selected if requested by @force_nested"
> >>   comment in this patch
> >> ---
> >>  hw/vfio/common.c | 36 ++++++++++++++++++++++++++++--------
> >>  1 file changed, 28 insertions(+), 8 deletions(-)
> >>
> >> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> >> index 0b3593b3c0..ac417b5dbd 100644
> >> --- a/hw/vfio/common.c
> >> +++ b/hw/vfio/common.c
> >> @@ -1155,27 +1155,38 @@ static void
> vfio_put_address_space(VFIOAddressSpace
> >> *space)
> >>   * vfio_get_iommu_type - selects the richest iommu_type (v2 first)
> >>   */
> >>  static int vfio_get_iommu_type(VFIOContainer *container,
> >> +                               bool want_nested,
> >>                                 Error **errp)
> >>  {
> >> -    int iommu_types[] = { VFIO_TYPE1v2_IOMMU, VFIO_TYPE1_IOMMU,
> >> +    int iommu_types[] = { VFIO_TYPE1_NESTING_IOMMU,
> >> +                          VFIO_TYPE1v2_IOMMU, VFIO_TYPE1_IOMMU,
> >>                            VFIO_SPAPR_TCE_v2_IOMMU, VFIO_SPAPR_TCE_IOMMU };
> >> -    int i;
> >> +    int i, ret = -EINVAL;
> >>
> >>      for (i = 0; i < ARRAY_SIZE(iommu_types); i++) {
> >>          if (ioctl(container->fd, VFIO_CHECK_EXTENSION, iommu_types[i])) {
> >> -            return iommu_types[i];
> >> +            if (iommu_types[i] == VFIO_TYPE1_NESTING_IOMMU && !want_nested)
> {
> >> +                continue;
> >> +            }
> >> +            ret = iommu_types[i];
> >> +            break;
> >>          }
> >>      }
> >> -    error_setg(errp, "No available IOMMU models");
> >> -    return -EINVAL;
> >> +    if (ret < 0) {
> >> +        error_setg(errp, "No available IOMMU models");
> >> +    } else if (want_nested && ret != VFIO_TYPE1_NESTING_IOMMU) {
> >> +        error_setg(errp, "Nested mode requested but not supported");
> >> +        ret = -EINVAL;
> >> +    }
> >> +    return ret;
> >>  }
> >>
> >>  static int vfio_init_container(VFIOContainer *container, int group_fd,
> >> -                               Error **errp)
> >> +                               bool want_nested, Error **errp)
> >>  {
> >>      int iommu_type, ret;
> >>
> >> -    iommu_type = vfio_get_iommu_type(container, errp);
> >> +    iommu_type = vfio_get_iommu_type(container, want_nested, errp);
> >>      if (iommu_type < 0) {
> >>          return iommu_type;
> >>      }
> >> @@ -1211,6 +1222,14 @@ static int vfio_connect_container(VFIOGroup *group,
> >> AddressSpace *as,
> >>      VFIOContainer *container;
> >>      int ret, fd;
> >>      VFIOAddressSpace *space;
> >> +    IOMMUMemoryRegion *iommu_mr;
> >> +    bool nested = false;
> >> +
> >> +    if (as != &address_space_memory && memory_region_is_iommu(as->root))
> {
> >
> > I tried on my side. For virtual VT-d, it doesn't work as in intel_iommu,
> > we have a dynamic switch mechanism. Thus that, the
> > memory_region_is_iommu(as->root) won't return true as expected. I'm afraid
> > it doesn't work for virtual VT-d.  So firstly, I'm wondering if
> > as != &address_space_memory is enough.
> 
> (as != &address_space_memory) should be sufficient to tell that a vIOMMU
> is being used. But then, for example, you don't want to set nested
> paging for the virtio-iommu because virtio-iommu/VFIO uses notify-on-my
> (CM similar implementation). That's why I devised an attribute to
> retrieve the vIOMMU need for nested.
> 
>  Secondly, I'm considering if it is
> > good to let vfio_get_group() caller to provide a hint whether vIOMMU is
> > exposed. e.g. vfio_realize() in vfio/pci.c could figure out whether vIOMMU
> > is set easily. Thoughts?
> Sorry I don't get your point here. Why is it easier to figure out
> whether vIOMMU is set in vfio_realize()?
> 
> pci_device_iommu_address_space(pdev) !=  &address_space_memory
> does determine whether a vIOMMU is in place, no?
> 
No it's not just pci_device_iommu_address_space(pdev) !=  &address_space_memory,
I agree with your above comment, it's not enough to tell whether nesting is
needed or not. I'd like to add an API like pci_device_iommu_nesting_required(),
so that it can be determined. In the meanwhile, adding a query callback in
PCIIOMMUOps introduced in below pathc. Guess it works?

[v2,05/22] hw/pci: modify pci_setup_iommu() to set PCIIOMMUOps
https://patchwork.kernel.org/patch/11464577/

Regards,
Yi Liu



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC v6 00/24] vSMMUv3/pSMMUv3 2 stage VFIO integration
  2020-03-31  8:24     ` Zhangfei Gao
@ 2020-04-02 16:46       ` Auger Eric
  0 siblings, 0 replies; 40+ messages in thread
From: Auger Eric @ 2020-04-02 16:46 UTC (permalink / raw)
  To: Zhangfei Gao, eric.auger.pro, qemu-devel, qemu-arm,
	peter.maydell, mst, alex.williamson, jacob.jun.pan, yi.l.liu
  Cc: jean-philippe, tnowicki, maz, shameerali.kolothum.thodi, peterx,
	bbhushan2, will

Hi Zhangfei,

On 3/31/20 10:24 AM, Zhangfei Gao wrote:
> Hi, Eric
> 
> On 2020/3/31 下午4:12, Auger Eric wrote:
>> Hi Zhangfei,
>>
>> On 3/31/20 8:42 AM, Zhangfei Gao wrote:
>>> Hi, Eric
>>>
>>> On 2020/3/21 上午12:58, Eric Auger wrote:
>>>> Up to now vSMMUv3 has not been integrated with VFIO. VFIO
>>>> integration requires to program the physical IOMMU consistently
>>>> with the guest mappings. However, as opposed to VTD, SMMUv3 has
>>>> no "Caching Mode" which allows easy trapping of guest mappings.
>>>> This means the vSMMUV3 cannot use the same VFIO integration as VTD.
>>>>
>>>> However SMMUv3 has 2 translation stages. This was devised with
>>>> virtualization use case in mind where stage 1 is "owned" by the
>>>> guest whereas the host uses stage 2 for VM isolation.
>>>>
>>>> This series sets up this nested translation stage. It only works
>>>> if there is one physical SMMUv3 used along with QEMU vSMMUv3 (in
>>>> other words, it does not work if there is a physical SMMUv2).
>>>>
>>>> - We force the host to use stage 2 instead of stage 1, when we
>>>>     detect a vSMMUV3 is behind a VFIO device. For a VFIO device
>>>>     without any virtual IOMMU, we still use stage 1 as many existing
>>>>     SMMUs expect this behavior.
>>>> - We use PCIPASIDOps to propage guest stage1 config changes on
>>>>     STE (Stream Table Entry) changes.
>>>> - We implement a specific UNMAP notifier that conveys guest
>>>>     IOTLB invalidations to the host
>>>> - We register MSI IOVA/GPA bindings to the host so that this latter
>>>>     can build a nested stage translation
>>>> - As the legacy MAP notifier is not called anymore, we must make
>>>>     sure stage 2 mappings are set. This is achieved through another
>>>>     prereg memory listener.
>>>> - Physical SMMU stage 1 related faults are reported to the guest
>>>>     via en eventfd mechanism and exposed trhough a dedicated VFIO-PCI
>>>>     region. Then they are reinjected into the guest.
>>>>
>>>> Best Regards
>>>>
>>>> Eric
>>>>
>>>> This series can be found at:
>>>> https://github.com/eauger/qemu/tree/v4.2.0-2stage-rfcv6
>>>>
>>>> Kernel Dependencies:
>>>> [1] [PATCH v10 00/11] SMMUv3 Nested Stage Setup (VFIO part)
>>>> [2] [PATCH v10 00/13] SMMUv3 Nested Stage Setup (IOMMU part)
>>>> branch at:
>>>> https://github.com/eauger/linux/tree/will-arm-smmu-updates-2stage-v10
>>> Really appreciated that you re-start this work.
>>>
>>> I tested your branch and some update.
>>>
>>> Guest: https://github.com/Linaro/linux-kernel-warpdrive/tree/sva-devel
>>> <https://github.com/Linaro/linux-kernel-warpdrive/tree/sva-devel>
>>> Host:
>>> https://github.com/eauger/linux/tree/will-arm-smmu-updates-2stage-v10
>>> <https://github.com/eauger/linux/tree/will-arm-smmu-updates-2stage-v10>
>>> qemu: https://github.com/eauger/qemu/tree/v4.2.0-2stage-rfcv6
>>> <https://github.com/eauger/qemu/tree/v4.2.0-2stage-rfcv6>
>>>
>>> The guest I am using is contains Jean's sva patches.
>>> Since currently they are many patch conflict, so use two different tree.
>>>
>>> Result
>>> No-sva mode works.
>>> This mode, guest directly get physical address via ioctl.
>> OK thanks for testing
>>> While vSVA can not work, there are still much work to do.
>>> I am trying to SVA mode, and it fails, so choose no-sva instead.
>>> iommu_dev_enable_feature(parent, IOMMU_DEV_FEAT_SVA)
>> Indeed I assume there are plenty of things missing to make vSVM work on
>> ARM (iommu, vfio, QEMU). I am currently reviewing Jacob and Yi's kernel
>> and QEMU series on Intel side. After that, I will come back to you to
>> help. Also vSMMUv3 does not support multiple contexts at the moment. I
>> will add this soon.
>>
>>
>> Still the problem I have is testing. Any suggestion welcome.
>>
>>
> To make sure
> Do you mean you need a environment for testing?
> 
> How about Hisilicon kunpeng920, arm64 platform supporting SVA in host now.
> There is such a platform in linaro mlab that I think we can share.
> Currently I am testing with uacce,
> By testing a user driver (hisi zip accelerator) in guest, we can test
> vSVA and PASID easily.
Sorry for the delay. I am currently investigating if this could be
possible. Thank you for the suggestion!

Best Regards

Eric
> 
> Thanks
> 



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC v6 00/24] vSMMUv3/pSMMUv3 2 stage VFIO integration
  2020-03-25 11:35 ` [RFC v6 00/24] vSMMUv3/pSMMUv3 2 stage VFIO integration Shameerali Kolothum Thodi
  2020-03-25 12:42   ` Auger Eric
@ 2020-04-03 10:45   ` Auger Eric
  2020-04-03 12:10     ` Shameerali Kolothum Thodi
  1 sibling, 1 reply; 40+ messages in thread
From: Auger Eric @ 2020-04-03 10:45 UTC (permalink / raw)
  To: Shameerali Kolothum Thodi, eric.auger.pro, qemu-devel, qemu-arm,
	peter.maydell, mst, alex.williamson, jacob.jun.pan, yi.l.liu
  Cc: jean-philippe, tnowicki, maz, zhangfei.gao, peterx, zhangfei.gao,
	bbhushan2, will

Hi Shameer,

On 3/25/20 12:35 PM, Shameerali Kolothum Thodi wrote:
> Hi Eric,
> 
>> -----Original Message-----
>> From: Eric Auger [mailto:eric.auger@redhat.com]
>> Sent: 20 March 2020 16:58
>> To: eric.auger.pro@gmail.com; eric.auger@redhat.com;
>> qemu-devel@nongnu.org; qemu-arm@nongnu.org; peter.maydell@linaro.org;
>> mst@redhat.com; alex.williamson@redhat.com;
>> jacob.jun.pan@linux.intel.com; yi.l.liu@intel.com
>> Cc: peterx@redhat.com; jean-philippe@linaro.org; will@kernel.org;
>> tnowicki@marvell.com; Shameerali Kolothum Thodi
>> <shameerali.kolothum.thodi@huawei.com>; zhangfei.gao@foxmail.com;
>> zhangfei.gao@linaro.org; maz@kernel.org; bbhushan2@marvell.com
>> Subject: [RFC v6 00/24] vSMMUv3/pSMMUv3 2 stage VFIO integration
>>
>> Up to now vSMMUv3 has not been integrated with VFIO. VFIO
>> integration requires to program the physical IOMMU consistently
>> with the guest mappings. However, as opposed to VTD, SMMUv3 has
>> no "Caching Mode" which allows easy trapping of guest mappings.
>> This means the vSMMUV3 cannot use the same VFIO integration as VTD.
>>
>> However SMMUv3 has 2 translation stages. This was devised with
>> virtualization use case in mind where stage 1 is "owned" by the
>> guest whereas the host uses stage 2 for VM isolation.
>>
>> This series sets up this nested translation stage. It only works
>> if there is one physical SMMUv3 used along with QEMU vSMMUv3 (in
>> other words, it does not work if there is a physical SMMUv2).
> 
> I was testing this series on one of our hardware board with SMMUv3. I did
> observe an issue while trying to bring up Guest with and without the vsmmuV3.

I am currently investigating and up to now I fail to reproduce on my end.
> 
> Steps are like below,
> 
> 1. start a guest with "iommu=smmuv3" and a n/w vf device.
> 
> 2.Exit the VM.
how to you exit the VM?
> 
> 3. start the guest again without "iommu=smmuv3"
> 
> This time qemu crashes with,
> 
> [ 0.447830] hns3 0000:00:01.0: enabling device (0000 -> 0002)
> /home/shameer/qemu-eric/qemu/hw/vfio/pci.c:2851:vfio_dma_fault_notifier_handler:
> Object 0xaaaaeeb47c00 is not an instance of type
So I think I understand the qemu crash. At the moment the vfio_pci
registers a fault handler even if we are not in nested mode. The smmuv3
host driver calls any registered fault handler when it encounters an
error in !nested mode. So the eventfd is triggered to userspace but qemu
does not expect that. However the root case is we got some physical
faults on the second run.
> qemu:iommu-memory-region
> ./qemu_run-vsmmu-hns: line 9: 13609 Aborted                 (core
> dumped) ./qemu-system-aarch64-vsmmuv3v10 -machine
> virt,kernel_irqchip=on,gic-version=3 -cpu host -smp cpus=1 -kernel
> Image-ericv10-uacce -initrd rootfs-iperf.cpio -bios
Just to double check with you,
host: will-arm-smmu-updates-2stage-v10
qemu: v4.2.0-2stage-rfcv6
guest version?
> QEMU_EFI_Dec2018.fd -device vfio-pci,host=0000:7d:02.1 -net none -m
Do you assign exactly the same VF as during the 1st run?
> 4096 -nographic -D -d -enable-kvm -append "console=ttyAMA0
> root=/dev/vda -m 4096 rw earlycon=pl011,0x9000000"
> 
> And you can see that host kernel receives smmuv3 C_BAD_STE event,
> 
> [10499.379288] vfio-pci 0000:7d:02.1: enabling device (0000 -> 0002)
> [10501.943881] arm-smmu-v3 arm-smmu-v3.2.auto: event 0x04 received:
> [10501.943884] arm-smmu-v3 arm-smmu-v3.2.auto: 0x00007d1100000004
> [10501.943886] arm-smmu-v3 arm-smmu-v3.2.auto: 0x0000100800000080
> [10501.943887] arm-smmu-v3 arm-smmu-v3.2.auto: 0x00000000fe040000
> [10501.943889] arm-smmu-v3 arm-smmu-v3.2.auto: 0x000000007e04c440
I will try to prepare a kernel branch with additional traces.

Thanks

Eric
> 
> So I suspect we didn't clear nested stage configuration and that affects the 
> translation in the second run. I tried to issue(force) a vfio_detach_pasid_table() but 
> that didn't solve the problem.
> 
> May be I am missing something. Could you please take a look and let me know.
> 
> Thanks,
> Shameer
> 
>> - We force the host to use stage 2 instead of stage 1, when we
>>   detect a vSMMUV3 is behind a VFIO device. For a VFIO device
>>   without any virtual IOMMU, we still use stage 1 as many existing
>>   SMMUs expect this behavior.
>> - We use PCIPASIDOps to propage guest stage1 config changes on
>>   STE (Stream Table Entry) changes.
>> - We implement a specific UNMAP notifier that conveys guest
>>   IOTLB invalidations to the host
>> - We register MSI IOVA/GPA bindings to the host so that this latter
>>   can build a nested stage translation
>> - As the legacy MAP notifier is not called anymore, we must make
>>   sure stage 2 mappings are set. This is achieved through another
>>   prereg memory listener.
>> - Physical SMMU stage 1 related faults are reported to the guest
>>   via en eventfd mechanism and exposed trhough a dedicated VFIO-PCI
>>   region. Then they are reinjected into the guest.
>>
>> Best Regards
>>
>> Eric
>>
>> This series can be found at:
>> https://github.com/eauger/qemu/tree/v4.2.0-2stage-rfcv6
>>
>> Kernel Dependencies:
>> [1] [PATCH v10 00/11] SMMUv3 Nested Stage Setup (VFIO part)
>> [2] [PATCH v10 00/13] SMMUv3 Nested Stage Setup (IOMMU part)
>> branch at:
>> https://github.com/eauger/linux/tree/will-arm-smmu-updates-2stage-v10
>>
>> History:
>>
>> v5 -> v6:
>> - just rebase work
>>
>> v4 -> v5:
>> - Use PCIPASIDOps for config update notifications
>> - removal of notification for MSI binding which is not needed
>>   anymore
>> - Use a single fault region
>> - use the specific interrupt index
>>
>> v3 -> v4:
>> - adapt to changes in uapi (asid cache invalidation)
>> - check VFIO_PCI_DMA_FAULT_IRQ_INDEX is supported at kernel level
>>   before attempting to set signaling for it.
>> - sync on 5.2-rc1 kernel headers + Drew's patch that imports sve_context.h
>> - fix MSI binding for MSI (not MSIX)
>> - fix mingw compilation
>>
>> v2 -> v3:
>> - rework fault handling
>> - MSI binding registration done in vfio-pci. MSI binding tear down called
>>   on container cleanup path
>> - leaf parameter propagated
>>
>> v1 -> v2:
>> - Fixed dual assignment (asid now correctly propagated on TLB invalidations)
>> - Integrated fault reporting
>>
>>
>> Eric Auger (23):
>>   update-linux-headers: Import iommu.h
>>   header update against 5.6.0-rc3 and IOMMU/VFIO nested stage APIs
>>   memory: Add IOMMU_ATTR_VFIO_NESTED IOMMU memory region
>> attribute
>>   memory: Add IOMMU_ATTR_MSI_TRANSLATE IOMMU memory region
>> attribute
>>   memory: Introduce IOMMU Memory Region inject_faults API
>>   memory: Add arch_id and leaf fields in IOTLBEntry
>>   iommu: Introduce generic header
>>   vfio: Force nested if iommu requires it
>>   vfio: Introduce hostwin_from_range helper
>>   vfio: Introduce helpers to DMA map/unmap a RAM section
>>   vfio: Set up nested stage mappings
>>   vfio: Pass stage 1 MSI bindings to the host
>>   vfio: Helper to get IRQ info including capabilities
>>   vfio/pci: Register handler for iommu fault
>>   vfio/pci: Set up the DMA FAULT region
>>   vfio/pci: Implement the DMA fault handler
>>   hw/arm/smmuv3: Advertise MSI_TRANSLATE attribute
>>   hw/arm/smmuv3: Store the PASID table GPA in the translation config
>>   hw/arm/smmuv3: Fill the IOTLBEntry arch_id on NH_VA invalidation
>>   hw/arm/smmuv3: Fill the IOTLBEntry leaf field on NH_VA invalidation
>>   hw/arm/smmuv3: Pass stage 1 configurations to the host
>>   hw/arm/smmuv3: Implement fault injection
>>   hw/arm/smmuv3: Allow MAP notifiers
>>
>> Liu Yi L (1):
>>   pci: introduce PCIPASIDOps to PCIDevice
>>
>>  hw/arm/smmuv3.c                 | 189 ++++++++++--
>>  hw/arm/trace-events             |   3 +-
>>  hw/pci/pci.c                    |  34 +++
>>  hw/vfio/common.c                | 506
>> +++++++++++++++++++++++++-------
>>  hw/vfio/pci.c                   | 267 ++++++++++++++++-
>>  hw/vfio/pci.h                   |   9 +
>>  hw/vfio/trace-events            |   9 +-
>>  include/exec/memory.h           |  49 +++-
>>  include/hw/arm/smmu-common.h    |   1 +
>>  include/hw/iommu/iommu.h        |  28 ++
>>  include/hw/pci/pci.h            |  11 +
>>  include/hw/vfio/vfio-common.h   |  16 +
>>  linux-headers/COPYING           |   2 +
>>  linux-headers/asm-x86/kvm.h     |   1 +
>>  linux-headers/linux/iommu.h     | 375 +++++++++++++++++++++++
>>  linux-headers/linux/vfio.h      | 109 ++++++-
>>  memory.c                        |  10 +
>>  scripts/update-linux-headers.sh |   2 +-
>>  18 files changed, 1478 insertions(+), 143 deletions(-)
>>  create mode 100644 include/hw/iommu/iommu.h
>>  create mode 100644 linux-headers/linux/iommu.h
>>
>> --
>> 2.20.1
> 



^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: [RFC v6 00/24] vSMMUv3/pSMMUv3 2 stage VFIO integration
  2020-04-03 10:45   ` Auger Eric
@ 2020-04-03 12:10     ` Shameerali Kolothum Thodi
  0 siblings, 0 replies; 40+ messages in thread
From: Shameerali Kolothum Thodi @ 2020-04-03 12:10 UTC (permalink / raw)
  To: Auger Eric, eric.auger.pro, qemu-devel, qemu-arm, peter.maydell,
	mst, alex.williamson, jacob.jun.pan, yi.l.liu
  Cc: jean-philippe, tnowicki, maz, zhangfei.gao, peterx, zhangfei.gao,
	bbhushan2, will

Hi Eric,

> -----Original Message-----
> From: Auger Eric [mailto:eric.auger@redhat.com]
> Sent: 03 April 2020 11:45
> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>;
> eric.auger.pro@gmail.com; qemu-devel@nongnu.org; qemu-arm@nongnu.org;
> peter.maydell@linaro.org; mst@redhat.com; alex.williamson@redhat.com;
> jacob.jun.pan@linux.intel.com; yi.l.liu@intel.com
> Cc: peterx@redhat.com; jean-philippe@linaro.org; will@kernel.org;
> tnowicki@marvell.com; zhangfei.gao@foxmail.com; zhangfei.gao@linaro.org;
> maz@kernel.org; bbhushan2@marvell.com
> Subject: Re: [RFC v6 00/24] vSMMUv3/pSMMUv3 2 stage VFIO integration
> 
> Hi Shameer,
> 
> On 3/25/20 12:35 PM, Shameerali Kolothum Thodi wrote:
> > Hi Eric,
> >
> >> -----Original Message-----
> >> From: Eric Auger [mailto:eric.auger@redhat.com]
> >> Sent: 20 March 2020 16:58
> >> To: eric.auger.pro@gmail.com; eric.auger@redhat.com;
> >> qemu-devel@nongnu.org; qemu-arm@nongnu.org;
> peter.maydell@linaro.org;
> >> mst@redhat.com; alex.williamson@redhat.com;
> >> jacob.jun.pan@linux.intel.com; yi.l.liu@intel.com
> >> Cc: peterx@redhat.com; jean-philippe@linaro.org; will@kernel.org;
> >> tnowicki@marvell.com; Shameerali Kolothum Thodi
> >> <shameerali.kolothum.thodi@huawei.com>; zhangfei.gao@foxmail.com;
> >> zhangfei.gao@linaro.org; maz@kernel.org; bbhushan2@marvell.com
> >> Subject: [RFC v6 00/24] vSMMUv3/pSMMUv3 2 stage VFIO integration
> >>
> >> Up to now vSMMUv3 has not been integrated with VFIO. VFIO
> >> integration requires to program the physical IOMMU consistently
> >> with the guest mappings. However, as opposed to VTD, SMMUv3 has
> >> no "Caching Mode" which allows easy trapping of guest mappings.
> >> This means the vSMMUV3 cannot use the same VFIO integration as VTD.
> >>
> >> However SMMUv3 has 2 translation stages. This was devised with
> >> virtualization use case in mind where stage 1 is "owned" by the
> >> guest whereas the host uses stage 2 for VM isolation.
> >>
> >> This series sets up this nested translation stage. It only works
> >> if there is one physical SMMUv3 used along with QEMU vSMMUv3 (in
> >> other words, it does not work if there is a physical SMMUv2).
> >
> > I was testing this series on one of our hardware board with SMMUv3. I did
> > observe an issue while trying to bring up Guest with and without the
> vsmmuV3.
> 
> I am currently investigating and up to now I fail to reproduce on my end.
> >
> > Steps are like below,
> >
> > 1. start a guest with "iommu=smmuv3" and a n/w vf device.
> >
> > 2.Exit the VM.
> how to you exit the VM?

QMP system_powerdown

> >
> > 3. start the guest again without "iommu=smmuv3"
> >
> > This time qemu crashes with,
> >
> > [ 0.447830] hns3 0000:00:01.0: enabling device (0000 -> 0002)
> >
> /home/shameer/qemu-eric/qemu/hw/vfio/pci.c:2851:vfio_dma_fault_notifier_
> handler:
> > Object 0xaaaaeeb47c00 is not an instance of type
> So I think I understand the qemu crash. At the moment the vfio_pci
> registers a fault handler even if we are not in nested mode. The smmuv3
> host driver calls any registered fault handler when it encounters an
> error in !nested mode. So the eventfd is triggered to userspace but qemu
> does not expect that. However the root case is we got some physical
> faults on the second run.

True. And qemu works fine if I run again with iommu=smmuv3 option. 
That's why I suspect the mapping for the device in the phys smmu
is not cleared and on vfio-pci enable dev path it encounters error ?

> > qemu:iommu-memory-region
> > ./qemu_run-vsmmu-hns: line 9: 13609 Aborted                 (core
> > dumped) ./qemu-system-aarch64-vsmmuv3v10 -machine
> > virt,kernel_irqchip=on,gic-version=3 -cpu host -smp cpus=1 -kernel
> > Image-ericv10-uacce -initrd rootfs-iperf.cpio -bios
> Just to double check with you,
> host: will-arm-smmu-updates-2stage-v10
> qemu: v4.2.0-2stage-rfcv6
> guest version?

Yes. And guest = host image.

> > QEMU_EFI_Dec2018.fd -device vfio-pci,host=0000:7d:02.1 -net none -m
> Do you assign exactly the same VF as during the 1st run?

Yes same. Only change is "iommu=smmuv3" omission. 

> > 4096 -nographic -D -d -enable-kvm -append "console=ttyAMA0
> > root=/dev/vda -m 4096 rw earlycon=pl011,0x9000000"
> >
> > And you can see that host kernel receives smmuv3 C_BAD_STE event,
> >
> > [10499.379288] vfio-pci 0000:7d:02.1: enabling device (0000 -> 0002)
> > [10501.943881] arm-smmu-v3 arm-smmu-v3.2.auto: event 0x04 received:
> > [10501.943884] arm-smmu-v3 arm-smmu-v3.2.auto: 0x00007d1100000004
> > [10501.943886] arm-smmu-v3 arm-smmu-v3.2.auto: 0x0000100800000080
> > [10501.943887] arm-smmu-v3 arm-smmu-v3.2.auto: 0x00000000fe040000
> > [10501.943889] arm-smmu-v3 arm-smmu-v3.2.auto: 0x000000007e04c440
> I will try to prepare a kernel branch with additional traces.

Ok. You can find the qemu traces below (vfio*/smmu*) for with and without
iommu=smmuv3 runs(may be not that useful).

https://github.com/hisilicon/qemu/tree/v4.2.0-2stage-rfcv6-eric/traces

Thanks,
Shameer

> Thanks
> 
> Eric
> >
> > So I suspect we didn't clear nested stage configuration and that affects the
> > translation in the second run. I tried to issue(force) a
> vfio_detach_pasid_table() but
> > that didn't solve the problem.
> >
> > May be I am missing something. Could you please take a look and let me
> know.
> >
> > Thanks,
> > Shameer
> >
> >> - We force the host to use stage 2 instead of stage 1, when we
> >>   detect a vSMMUV3 is behind a VFIO device. For a VFIO device
> >>   without any virtual IOMMU, we still use stage 1 as many existing
> >>   SMMUs expect this behavior.
> >> - We use PCIPASIDOps to propage guest stage1 config changes on
> >>   STE (Stream Table Entry) changes.
> >> - We implement a specific UNMAP notifier that conveys guest
> >>   IOTLB invalidations to the host
> >> - We register MSI IOVA/GPA bindings to the host so that this latter
> >>   can build a nested stage translation
> >> - As the legacy MAP notifier is not called anymore, we must make
> >>   sure stage 2 mappings are set. This is achieved through another
> >>   prereg memory listener.
> >> - Physical SMMU stage 1 related faults are reported to the guest
> >>   via en eventfd mechanism and exposed trhough a dedicated VFIO-PCI
> >>   region. Then they are reinjected into the guest.
> >>
> >> Best Regards
> >>
> >> Eric
> >>
> >> This series can be found at:
> >> https://github.com/eauger/qemu/tree/v4.2.0-2stage-rfcv6
> >>
> >> Kernel Dependencies:
> >> [1] [PATCH v10 00/11] SMMUv3 Nested Stage Setup (VFIO part)
> >> [2] [PATCH v10 00/13] SMMUv3 Nested Stage Setup (IOMMU part)
> >> branch at:
> >> https://github.com/eauger/linux/tree/will-arm-smmu-updates-2stage-v10
> >>
> >> History:
> >>
> >> v5 -> v6:
> >> - just rebase work
> >>
> >> v4 -> v5:
> >> - Use PCIPASIDOps for config update notifications
> >> - removal of notification for MSI binding which is not needed
> >>   anymore
> >> - Use a single fault region
> >> - use the specific interrupt index
> >>
> >> v3 -> v4:
> >> - adapt to changes in uapi (asid cache invalidation)
> >> - check VFIO_PCI_DMA_FAULT_IRQ_INDEX is supported at kernel level
> >>   before attempting to set signaling for it.
> >> - sync on 5.2-rc1 kernel headers + Drew's patch that imports sve_context.h
> >> - fix MSI binding for MSI (not MSIX)
> >> - fix mingw compilation
> >>
> >> v2 -> v3:
> >> - rework fault handling
> >> - MSI binding registration done in vfio-pci. MSI binding tear down called
> >>   on container cleanup path
> >> - leaf parameter propagated
> >>
> >> v1 -> v2:
> >> - Fixed dual assignment (asid now correctly propagated on TLB invalidations)
> >> - Integrated fault reporting
> >>
> >>
> >> Eric Auger (23):
> >>   update-linux-headers: Import iommu.h
> >>   header update against 5.6.0-rc3 and IOMMU/VFIO nested stage APIs
> >>   memory: Add IOMMU_ATTR_VFIO_NESTED IOMMU memory region
> >> attribute
> >>   memory: Add IOMMU_ATTR_MSI_TRANSLATE IOMMU memory region
> >> attribute
> >>   memory: Introduce IOMMU Memory Region inject_faults API
> >>   memory: Add arch_id and leaf fields in IOTLBEntry
> >>   iommu: Introduce generic header
> >>   vfio: Force nested if iommu requires it
> >>   vfio: Introduce hostwin_from_range helper
> >>   vfio: Introduce helpers to DMA map/unmap a RAM section
> >>   vfio: Set up nested stage mappings
> >>   vfio: Pass stage 1 MSI bindings to the host
> >>   vfio: Helper to get IRQ info including capabilities
> >>   vfio/pci: Register handler for iommu fault
> >>   vfio/pci: Set up the DMA FAULT region
> >>   vfio/pci: Implement the DMA fault handler
> >>   hw/arm/smmuv3: Advertise MSI_TRANSLATE attribute
> >>   hw/arm/smmuv3: Store the PASID table GPA in the translation config
> >>   hw/arm/smmuv3: Fill the IOTLBEntry arch_id on NH_VA invalidation
> >>   hw/arm/smmuv3: Fill the IOTLBEntry leaf field on NH_VA invalidation
> >>   hw/arm/smmuv3: Pass stage 1 configurations to the host
> >>   hw/arm/smmuv3: Implement fault injection
> >>   hw/arm/smmuv3: Allow MAP notifiers
> >>
> >> Liu Yi L (1):
> >>   pci: introduce PCIPASIDOps to PCIDevice
> >>
> >>  hw/arm/smmuv3.c                 | 189 ++++++++++--
> >>  hw/arm/trace-events             |   3 +-
> >>  hw/pci/pci.c                    |  34 +++
> >>  hw/vfio/common.c                | 506
> >> +++++++++++++++++++++++++-------
> >>  hw/vfio/pci.c                   | 267 ++++++++++++++++-
> >>  hw/vfio/pci.h                   |   9 +
> >>  hw/vfio/trace-events            |   9 +-
> >>  include/exec/memory.h           |  49 +++-
> >>  include/hw/arm/smmu-common.h    |   1 +
> >>  include/hw/iommu/iommu.h        |  28 ++
> >>  include/hw/pci/pci.h            |  11 +
> >>  include/hw/vfio/vfio-common.h   |  16 +
> >>  linux-headers/COPYING           |   2 +
> >>  linux-headers/asm-x86/kvm.h     |   1 +
> >>  linux-headers/linux/iommu.h     | 375 +++++++++++++++++++++++
> >>  linux-headers/linux/vfio.h      | 109 ++++++-
> >>  memory.c                        |  10 +
> >>  scripts/update-linux-headers.sh |   2 +-
> >>  18 files changed, 1478 insertions(+), 143 deletions(-)
> >>  create mode 100644 include/hw/iommu/iommu.h
> >>  create mode 100644 linux-headers/linux/iommu.h
> >>
> >> --
> >> 2.20.1
> >



^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, back to index

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-20 16:58 [RFC v6 00/24] vSMMUv3/pSMMUv3 2 stage VFIO integration Eric Auger
2020-03-20 16:58 ` [RFC v6 01/24] update-linux-headers: Import iommu.h Eric Auger
2020-03-26 12:58   ` Liu, Yi L
2020-03-26 17:51     ` Auger Eric
2020-03-20 16:58 ` [RFC v6 02/24] header update against 5.6.0-rc3 and IOMMU/VFIO nested stage APIs Eric Auger
2020-03-20 16:58 ` [RFC v6 03/24] memory: Add IOMMU_ATTR_VFIO_NESTED IOMMU memory region attribute Eric Auger
2020-03-20 16:58 ` [RFC v6 04/24] memory: Add IOMMU_ATTR_MSI_TRANSLATE " Eric Auger
2020-03-20 16:58 ` [RFC v6 05/24] memory: Introduce IOMMU Memory Region inject_faults API Eric Auger
2020-03-26 13:13   ` Liu, Yi L
2020-03-20 16:58 ` [RFC v6 06/24] memory: Add arch_id and leaf fields in IOTLBEntry Eric Auger
2020-03-20 16:58 ` [RFC v6 07/24] iommu: Introduce generic header Eric Auger
2020-03-20 16:58 ` [RFC v6 08/24] pci: introduce PCIPASIDOps to PCIDevice Eric Auger
2020-03-26 13:01   ` Liu, Yi L
2020-03-20 16:58 ` [RFC v6 09/24] vfio: Force nested if iommu requires it Eric Auger
2020-03-31  6:34   ` Liu, Yi L
2020-03-31  8:04     ` Auger Eric
2020-03-31  8:34       ` Liu, Yi L
2020-03-20 16:58 ` [RFC v6 10/24] vfio: Introduce hostwin_from_range helper Eric Auger
2020-03-20 16:58 ` [RFC v6 11/24] vfio: Introduce helpers to DMA map/unmap a RAM section Eric Auger
2020-03-20 16:58 ` [RFC v6 12/24] vfio: Set up nested stage mappings Eric Auger
2020-03-20 16:58 ` [RFC v6 13/24] vfio: Pass stage 1 MSI bindings to the host Eric Auger
2020-03-20 16:58 ` [RFC v6 14/24] vfio: Helper to get IRQ info including capabilities Eric Auger
2020-03-20 16:58 ` [RFC v6 15/24] vfio/pci: Register handler for iommu fault Eric Auger
2020-03-20 16:58 ` [RFC v6 16/24] vfio/pci: Set up the DMA FAULT region Eric Auger
2020-03-20 16:58 ` [RFC v6 17/24] vfio/pci: Implement the DMA fault handler Eric Auger
2020-03-20 16:58 ` [RFC v6 18/24] hw/arm/smmuv3: Advertise MSI_TRANSLATE attribute Eric Auger
2020-03-20 16:58 ` [RFC v6 19/24] hw/arm/smmuv3: Store the PASID table GPA in the translation config Eric Auger
2020-03-20 16:58 ` [RFC v6 20/24] hw/arm/smmuv3: Fill the IOTLBEntry arch_id on NH_VA invalidation Eric Auger
2020-03-20 16:58 ` [RFC v6 21/24] hw/arm/smmuv3: Fill the IOTLBEntry leaf field " Eric Auger
2020-03-20 16:58 ` [RFC v6 22/24] hw/arm/smmuv3: Pass stage 1 configurations to the host Eric Auger
2020-03-20 16:58 ` [RFC v6 23/24] hw/arm/smmuv3: Implement fault injection Eric Auger
2020-03-20 16:58 ` [RFC v6 24/24] hw/arm/smmuv3: Allow MAP notifiers Eric Auger
2020-03-25 11:35 ` [RFC v6 00/24] vSMMUv3/pSMMUv3 2 stage VFIO integration Shameerali Kolothum Thodi
2020-03-25 12:42   ` Auger Eric
2020-04-03 10:45   ` Auger Eric
2020-04-03 12:10     ` Shameerali Kolothum Thodi
2020-03-31  6:42 ` Zhangfei Gao
2020-03-31  8:12   ` Auger Eric
2020-03-31  8:24     ` Zhangfei Gao
2020-04-02 16:46       ` Auger Eric

QEMU-Devel Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/qemu-devel/0 qemu-devel/git/0.git
	git clone --mirror https://lore.kernel.org/qemu-devel/1 qemu-devel/git/1.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 qemu-devel qemu-devel/ https://lore.kernel.org/qemu-devel \
		qemu-devel@nongnu.org
	public-inbox-index qemu-devel

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.nongnu.qemu-devel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git