linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC v3 00/21] SMMUv3 Nested Stage Setup
@ 2019-01-08 10:26 Eric Auger
  2019-01-08 10:26 ` [RFC v3 01/21] iommu: Introduce set_pasid_table API Eric Auger
                   ` (21 more replies)
  0 siblings, 22 replies; 59+ messages in thread
From: Eric Auger @ 2019-01-08 10:26 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, iommu, linux-kernel, kvm, kvmarm,
	joro, alex.williamson, jacob.jun.pan, yi.l.liu,
	jean-philippe.brucker, will.deacon, robin.murphy
  Cc: kevin.tian, ashok.raj, marc.zyngier, christoffer.dall, peter.maydell

This series allows a virtualizer to program the nested stage mode.
This is useful when both the host and the guest are exposed with
an SMMUv3 and a PCI device is assigned to the guest using VFIO.

In this mode, the physical IOMMU must be programmed to translate
the two stages: the one set up by the guest (IOVA -> GPA) and the
one set up by the host VFIO driver as part of the assignment process
(GPA -> HPA).

On Intel, this is traditionnaly achieved by combining the 2 stages
into a single physical stage. However this relies on the capability
to trap on each guest translation structure update. This is possible
by using the VTD Caching Mode. Unfortunately the ARM SMMUv3 does
not offer a similar mechanism.

However, the ARM SMMUv3 architecture supports 2 physical stages! Those
were devised exactly with that use case in mind. Assuming the HW
implements both stages (optional), the guest now can use stage 1
while the host uses stage 2.

This assumes the virtualizer has means to propagate guest settings
to the host SMMUv3 driver. This series brings this VFIO/IOMMU
infrastructure.  Those services are:
- bind the guest stage 1 configuration to the stream table entry
- propagate guest TLB invalidations
- bind MSI IOVAs
- propagate faults collected at physical level up to the virtualizer

This series largely reuses the user API and infrastructure originally
devised for SVA/SVM and patches submitted by Jacob, Yi Liu, Tianyu in
[1-2] and Jean-Philippe [3-4].

Best Regards

Eric

This series can be found at:
https://github.com/eauger/linux/tree/v5.0-rc1-2stage-rfc-v3

This was tested on Qualcomm HW featuring SMMUv3 and with adapted QEMU
vSMMUv3.

References:
[1] [PATCH v5 00/23] IOMMU and VT-d driver support for Shared Virtual
    Address (SVA)
    https://lwn.net/Articles/754331/
[2] [RFC PATCH 0/8] Shared Virtual Memory virtualization for VT-d
    (VFIO part)
    https://lists.linuxfoundation.org/pipermail/iommu/2017-April/021475.html
[3] [v2,00/40] Shared Virtual Addressing for the IOMMU
    https://patchwork.ozlabs.org/cover/912129/
[4] [PATCH v3 00/10] Shared Virtual Addressing for the IOMMU
    https://patchwork.kernel.org/cover/10608299/

History:

v2 -> v3:
- When registering the S1 MSI binding we now store the device handle. This
  addresses Robin's comment about discimination of devices beonging to
  different S1 groups and using different physical MSI doorbells.
- Change the fault reporting API: use VFIO_PCI_DMA_FAULT_IRQ_INDEX to
  set the eventfd and expose the faults through an mmappable fault region

v1 -> v2:
- Added the fault reporting capability
- asid properly passed on invalidation (fix assignment of multiple
  devices)
- see individual change logs for more info

Eric Auger (12):
  iommu: Introduce bind_guest_msi
  vfio: VFIO_IOMMU_BIND_MSI
  iommu/smmuv3: Get prepared for nested stage support
  iommu/smmuv3: Implement set_pasid_table
  iommu/smmuv3: Implement cache_invalidate
  dma-iommu: Implement NESTED_MSI cookie
  iommu/smmuv3: Implement bind_guest_msi
  iommu/smmuv3: Report non recoverable faults
  vfio-pci: Add a new VFIO_REGION_TYPE_NESTED region type
  vfio-pci: Register an iommu fault handler
  vfio-pci: Add VFIO_PCI_DMA_FAULT_IRQ_INDEX
  vfio: Document nested stage control

Jacob Pan (4):
  iommu: Introduce set_pasid_table API
  iommu: introduce device fault data
  driver core: add per device iommu param
  iommu: introduce device fault report API

Jean-Philippe Brucker (2):
  iommu/arm-smmu-v3: Link domains and devices
  iommu/arm-smmu-v3: Maintain a SID->device structure

Liu, Yi L (3):
  iommu: Introduce cache_invalidate API
  vfio: VFIO_IOMMU_SET_PASID_TABLE
  vfio: VFIO_IOMMU_CACHE_INVALIDATE

 Documentation/vfio.txt              |  62 ++++
 drivers/iommu/arm-smmu-v3.c         | 460 ++++++++++++++++++++++++++--
 drivers/iommu/dma-iommu.c           | 112 ++++++-
 drivers/iommu/iommu.c               | 187 ++++++++++-
 drivers/vfio/pci/vfio_pci.c         | 147 ++++++++-
 drivers/vfio/pci/vfio_pci_intrs.c   |  19 ++
 drivers/vfio/pci/vfio_pci_private.h |   3 +
 drivers/vfio/vfio_iommu_type1.c     | 105 +++++++
 include/linux/device.h              |   3 +
 include/linux/dma-iommu.h           |  11 +
 include/linux/iommu.h               | 127 +++++++-
 include/uapi/linux/iommu.h          | 234 ++++++++++++++
 include/uapi/linux/vfio.h           |  38 +++
 13 files changed, 1476 insertions(+), 32 deletions(-)
 create mode 100644 include/uapi/linux/iommu.h

-- 
2.17.2


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [RFC v3 01/21] iommu: Introduce set_pasid_table API
  2019-01-08 10:26 [RFC v3 00/21] SMMUv3 Nested Stage Setup Eric Auger
@ 2019-01-08 10:26 ` Eric Auger
  2019-01-11 18:16   ` Jean-Philippe Brucker
  2019-01-11 18:43   ` Alex Williamson
  2019-01-08 10:26 ` [RFC v3 02/21] iommu: Introduce cache_invalidate API Eric Auger
                   ` (20 subsequent siblings)
  21 siblings, 2 replies; 59+ messages in thread
From: Eric Auger @ 2019-01-08 10:26 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, iommu, linux-kernel, kvm, kvmarm,
	joro, alex.williamson, jacob.jun.pan, yi.l.liu,
	jean-philippe.brucker, will.deacon, robin.murphy
  Cc: kevin.tian, ashok.raj, marc.zyngier, christoffer.dall, peter.maydell

From: Jacob Pan <jacob.jun.pan@linux.intel.com>

In virtualization use case, when a guest is assigned
a PCI host device, protected by a virtual IOMMU on a guest,
the physical IOMMU must be programmed to be consistent with
the guest mappings. If the physical IOMMU supports two
translation stages it makes sense to program guest mappings
onto the first stage/level (ARM/VTD terminology) while to host
owns the stage/level 2.

In that case, it is mandated to trap on guest configuration
settings and pass those to the physical iommu driver.

This patch adds a new API to the iommu subsystem that allows
to set the pasid table information.

A generic iommu_pasid_table_config struct is introduced in
a new iommu.h uapi header. This is going to be used by the VFIO
user API. We foresee at least two specializations of this struct,
for PASID table passing and ARM SMMUv3.

Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Eric Auger <eric.auger@redhat.com>

---

This patch generalizes the API introduced by Jacob & co-authors in
https://lwn.net/Articles/754331/

v2 -> v3:
- replace unbind/bind by set_pasid_table
- move table pointer and pasid bits in the generic part of the struct

v1 -> v2:
- restore the original pasid table name
- remove the struct device * parameter in the API
- reworked iommu_pasid_smmuv3
---
 drivers/iommu/iommu.c      | 10 ++++++++
 include/linux/iommu.h      | 14 +++++++++++
 include/uapi/linux/iommu.h | 50 ++++++++++++++++++++++++++++++++++++++
 3 files changed, 74 insertions(+)
 create mode 100644 include/uapi/linux/iommu.h

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 3ed4db334341..0f2b7f1fc7c8 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1393,6 +1393,16 @@ int iommu_attach_device(struct iommu_domain *domain, struct device *dev)
 }
 EXPORT_SYMBOL_GPL(iommu_attach_device);
 
+int iommu_set_pasid_table(struct iommu_domain *domain,
+			  struct iommu_pasid_table_config *cfg)
+{
+	if (unlikely(!domain->ops->set_pasid_table))
+		return -ENODEV;
+
+	return domain->ops->set_pasid_table(domain, cfg);
+}
+EXPORT_SYMBOL_GPL(iommu_set_pasid_table);
+
 static void __iommu_detach_device(struct iommu_domain *domain,
 				  struct device *dev)
 {
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index e90da6b6f3d1..1da2a2357ea4 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -25,6 +25,7 @@
 #include <linux/errno.h>
 #include <linux/err.h>
 #include <linux/of.h>
+#include <uapi/linux/iommu.h>
 
 #define IOMMU_READ	(1 << 0)
 #define IOMMU_WRITE	(1 << 1)
@@ -184,6 +185,7 @@ struct iommu_resv_region {
  * @domain_window_disable: Disable a particular window for a domain
  * @of_xlate: add OF master IDs to iommu grouping
  * @pgsize_bitmap: bitmap of all possible supported page sizes
+ * @set_pasid_table: set pasid table
  */
 struct iommu_ops {
 	bool (*capable)(enum iommu_cap);
@@ -226,6 +228,9 @@ struct iommu_ops {
 	int (*of_xlate)(struct device *dev, struct of_phandle_args *args);
 	bool (*is_attach_deferred)(struct iommu_domain *domain, struct device *dev);
 
+	int (*set_pasid_table)(struct iommu_domain *domain,
+			       struct iommu_pasid_table_config *cfg);
+
 	unsigned long pgsize_bitmap;
 };
 
@@ -287,6 +292,8 @@ extern int iommu_attach_device(struct iommu_domain *domain,
 			       struct device *dev);
 extern void iommu_detach_device(struct iommu_domain *domain,
 				struct device *dev);
+extern int iommu_set_pasid_table(struct iommu_domain *domain,
+				 struct iommu_pasid_table_config *cfg);
 extern struct iommu_domain *iommu_get_domain_for_dev(struct device *dev);
 extern struct iommu_domain *iommu_get_dma_domain(struct device *dev);
 extern int iommu_map(struct iommu_domain *domain, unsigned long iova,
@@ -696,6 +703,13 @@ const struct iommu_ops *iommu_ops_from_fwnode(struct fwnode_handle *fwnode)
 	return NULL;
 }
 
+static inline
+int iommu_set_pasid_table(struct iommu_domain *domain,
+			  struct iommu_pasid_table_config *cfg)
+{
+	return -ENODEV;
+}
+
 #endif /* CONFIG_IOMMU_API */
 
 #ifdef CONFIG_IOMMU_DEBUGFS
diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
new file mode 100644
index 000000000000..7a7cf7a3de7c
--- /dev/null
+++ b/include/uapi/linux/iommu.h
@@ -0,0 +1,50 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/*
+ * IOMMU user API definitions
+ *
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#ifndef _UAPI_IOMMU_H
+#define _UAPI_IOMMU_H
+
+#include <linux/types.h>
+
+/**
+ * SMMUv3 Stream Table Entry stage 1 related information
+ * @abort: shall the STE lead to abort
+ * @s1fmt: STE s1fmt field as set by the guest
+ * @s1dss: STE s1dss as set by the guest
+ * All field names match the smmu 3.0/3.1 spec (ARM IHI 0070A)
+ */
+struct iommu_pasid_smmuv3 {
+	__u8 abort;
+	__u8 s1fmt;
+	__u8 s1dss;
+};
+
+/**
+ * PASID table data used to bind guest PASID table to the host IOMMU
+ * Note PASID table corresponds to the Context Table on ARM SMMUv3.
+ *
+ * @version: API version to prepare for future extensions
+ * @format: format of the PASID table
+ *
+ */
+struct iommu_pasid_table_config {
+#define PASID_TABLE_CFG_VERSION_1 1
+	__u32	version;
+#define IOMMU_PASID_FORMAT_SMMUV3	(1 << 0)
+	__u32	format;
+	__u64	base_ptr;
+	__u8	pasid_bits;
+	__u8	bypass;
+	union {
+		struct iommu_pasid_smmuv3 smmuv3;
+	};
+};
+
+#endif /* _UAPI_IOMMU_H */
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [RFC v3 02/21] iommu: Introduce cache_invalidate API
  2019-01-08 10:26 [RFC v3 00/21] SMMUv3 Nested Stage Setup Eric Auger
  2019-01-08 10:26 ` [RFC v3 01/21] iommu: Introduce set_pasid_table API Eric Auger
@ 2019-01-08 10:26 ` Eric Auger
  2019-01-11 21:30   ` Alex Williamson
  2019-01-08 10:26 ` [RFC v3 03/21] iommu: Introduce bind_guest_msi Eric Auger
                   ` (19 subsequent siblings)
  21 siblings, 1 reply; 59+ messages in thread
From: Eric Auger @ 2019-01-08 10:26 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, iommu, linux-kernel, kvm, kvmarm,
	joro, alex.williamson, jacob.jun.pan, yi.l.liu,
	jean-philippe.brucker, will.deacon, robin.murphy
  Cc: kevin.tian, ashok.raj, marc.zyngier, christoffer.dall, peter.maydell

From: "Liu, Yi L" <yi.l.liu@linux.intel.com>

In any virtualization use case, when the first translation stage
is "owned" by the guest OS, the host IOMMU driver has no knowledge
of caching structure updates unless the guest invalidation activities
are trapped by the virtualizer and passed down to the host.

Since the invalidation data are obtained from user space and will be
written into physical IOMMU, we must allow security check at various
layers. Therefore, generic invalidation data format are proposed here,
model specific IOMMU drivers need to convert them into their own format.

Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Eric Auger <eric.auger@redhat.com>

---
v1 -> v2:
- add arch_id field
- renamed tlb_invalidate into cache_invalidate as this API allows
  to invalidate context caches on top of IOTLBs

v1:
renamed sva_invalidate into tlb_invalidate and add iommu_ prefix in
header. Commit message reworded.
---
 drivers/iommu/iommu.c      | 14 ++++++
 include/linux/iommu.h      | 14 ++++++
 include/uapi/linux/iommu.h | 95 ++++++++++++++++++++++++++++++++++++++
 3 files changed, 123 insertions(+)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 0f2b7f1fc7c8..b2e248770508 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1403,6 +1403,20 @@ int iommu_set_pasid_table(struct iommu_domain *domain,
 }
 EXPORT_SYMBOL_GPL(iommu_set_pasid_table);
 
+int iommu_cache_invalidate(struct iommu_domain *domain, struct device *dev,
+			   struct iommu_cache_invalidate_info *inv_info)
+{
+	int ret = 0;
+
+	if (unlikely(!domain->ops->cache_invalidate))
+		return -ENODEV;
+
+	ret = domain->ops->cache_invalidate(domain, dev, inv_info);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_cache_invalidate);
+
 static void __iommu_detach_device(struct iommu_domain *domain,
 				  struct device *dev)
 {
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 1da2a2357ea4..96d59886f230 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -186,6 +186,7 @@ struct iommu_resv_region {
  * @of_xlate: add OF master IDs to iommu grouping
  * @pgsize_bitmap: bitmap of all possible supported page sizes
  * @set_pasid_table: set pasid table
+ * @cache_invalidate: invalidate translation caches
  */
 struct iommu_ops {
 	bool (*capable)(enum iommu_cap);
@@ -231,6 +232,9 @@ struct iommu_ops {
 	int (*set_pasid_table)(struct iommu_domain *domain,
 			       struct iommu_pasid_table_config *cfg);
 
+	int (*cache_invalidate)(struct iommu_domain *domain, struct device *dev,
+				struct iommu_cache_invalidate_info *inv_info);
+
 	unsigned long pgsize_bitmap;
 };
 
@@ -294,6 +298,9 @@ extern void iommu_detach_device(struct iommu_domain *domain,
 				struct device *dev);
 extern int iommu_set_pasid_table(struct iommu_domain *domain,
 				 struct iommu_pasid_table_config *cfg);
+extern int iommu_cache_invalidate(struct iommu_domain *domain,
+				struct device *dev,
+				struct iommu_cache_invalidate_info *inv_info);
 extern struct iommu_domain *iommu_get_domain_for_dev(struct device *dev);
 extern struct iommu_domain *iommu_get_dma_domain(struct device *dev);
 extern int iommu_map(struct iommu_domain *domain, unsigned long iova,
@@ -709,6 +716,13 @@ int iommu_set_pasid_table(struct iommu_domain *domain,
 {
 	return -ENODEV;
 }
+static inline int
+iommu_cache_invalidate(struct iommu_domain *domain,
+		       struct device *dev,
+		       struct iommu_cache_invalidate_info *inv_info)
+{
+	return -ENODEV;
+}
 
 #endif /* CONFIG_IOMMU_API */
 
diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
index 7a7cf7a3de7c..4605f5cfac84 100644
--- a/include/uapi/linux/iommu.h
+++ b/include/uapi/linux/iommu.h
@@ -47,4 +47,99 @@ struct iommu_pasid_table_config {
 	};
 };
 
+/**
+ * enum iommu_inv_granularity - Generic invalidation granularity
+ * @IOMMU_INV_GRANU_DOMAIN_ALL_PASID:	TLB entries or PASID caches of all
+ *					PASIDs associated with a domain ID
+ * @IOMMU_INV_GRANU_PASID_SEL:		TLB entries or PASID cache associated
+ *					with a PASID and a domain
+ * @IOMMU_INV_GRANU_PAGE_PASID:		TLB entries of selected page range
+ *					within a PASID
+ *
+ * When an invalidation request is passed down to IOMMU to flush translation
+ * caches, it may carry different granularity levels, which can be specific
+ * to certain types of translation caches.
+ * This enum is a collection of granularities for all types of translation
+ * caches. The idea is to make it easy for IOMMU model specific driver to
+ * convert from generic to model specific value. Each IOMMU driver
+ * can enforce check based on its own conversion table. The conversion is
+ * based on 2D look-up with inputs as follows:
+ * - translation cache types
+ * - granularity
+ *
+ *             type |   DTLB    |    TLB    |   PASID   |
+ *  granule         |           |           |   cache   |
+ * -----------------+-----------+-----------+-----------+
+ *  DN_ALL_PASID    |   Y       |   Y       |   Y       |
+ *  PASID_SEL       |   Y       |   Y       |   Y       |
+ *  PAGE_PASID      |   Y       |   Y       |   N/A     |
+ *
+ */
+enum iommu_inv_granularity {
+	IOMMU_INV_GRANU_DOMAIN_ALL_PASID,
+	IOMMU_INV_GRANU_PASID_SEL,
+	IOMMU_INV_GRANU_PAGE_PASID,
+	IOMMU_INV_NR_GRANU,
+};
+
+/**
+ * enum iommu_inv_type - Generic translation cache types for invalidation
+ *
+ * @IOMMU_INV_TYPE_DTLB:	device IOTLB
+ * @IOMMU_INV_TYPE_TLB:		IOMMU paging structure cache
+ * @IOMMU_INV_TYPE_PASID:	PASID cache
+ * Invalidation requests sent to IOMMU for a given device need to indicate
+ * which type of translation cache to be operated on. Combined with enum
+ * iommu_inv_granularity, model specific driver can do a simple lookup to
+ * convert from generic to model specific value.
+ */
+enum iommu_inv_type {
+	IOMMU_INV_TYPE_DTLB,
+	IOMMU_INV_TYPE_TLB,
+	IOMMU_INV_TYPE_PASID,
+	IOMMU_INV_NR_TYPE
+};
+
+/**
+ * Translation cache invalidation header that contains mandatory meta data.
+ * @version:	info format version, expecting future extesions
+ * @type:	type of translation cache to be invalidated
+ */
+struct iommu_cache_invalidate_hdr {
+	__u32 version;
+#define TLB_INV_HDR_VERSION_1 1
+	enum iommu_inv_type type;
+};
+
+/**
+ * Translation cache invalidation information, contains generic IOMMU
+ * data which can be parsed based on model ID by model specific drivers.
+ * Since the invalidation of second level page tables are included in the
+ * unmap operation, this info is only applicable to the first level
+ * translation caches, i.e. DMA request with PASID.
+ *
+ * @granularity:	requested invalidation granularity, type dependent
+ * @size:		2^size of 4K pages, 0 for 4k, 9 for 2MB, etc.
+ * @nr_pages:		number of pages to invalidate
+ * @pasid:		processor address space ID value per PCI spec.
+ * @arch_id:		architecture dependent id characterizing a context
+ *			and tagging the caches, ie. domain Identfier on VTD,
+ *			asid on ARM SMMU
+ * @addr:		page address to be invalidated
+ * @flags		IOMMU_INVALIDATE_ADDR_LEAF: leaf paging entries
+ *			IOMMU_INVALIDATE_GLOBAL_PAGE: global pages
+ *
+ */
+struct iommu_cache_invalidate_info {
+	struct iommu_cache_invalidate_hdr	hdr;
+	enum iommu_inv_granularity	granularity;
+	__u32		flags;
+#define IOMMU_INVALIDATE_ADDR_LEAF	(1 << 0)
+#define IOMMU_INVALIDATE_GLOBAL_PAGE	(1 << 1)
+	__u8		size;
+	__u64		nr_pages;
+	__u32		pasid;
+	__u64		arch_id;
+	__u64		addr;
+};
 #endif /* _UAPI_IOMMU_H */
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [RFC v3 03/21] iommu: Introduce bind_guest_msi
  2019-01-08 10:26 [RFC v3 00/21] SMMUv3 Nested Stage Setup Eric Auger
  2019-01-08 10:26 ` [RFC v3 01/21] iommu: Introduce set_pasid_table API Eric Auger
  2019-01-08 10:26 ` [RFC v3 02/21] iommu: Introduce cache_invalidate API Eric Auger
@ 2019-01-08 10:26 ` Eric Auger
  2019-01-11 22:44   ` Alex Williamson
  2019-01-08 10:26 ` [RFC v3 04/21] vfio: VFIO_IOMMU_SET_PASID_TABLE Eric Auger
                   ` (18 subsequent siblings)
  21 siblings, 1 reply; 59+ messages in thread
From: Eric Auger @ 2019-01-08 10:26 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, iommu, linux-kernel, kvm, kvmarm,
	joro, alex.williamson, jacob.jun.pan, yi.l.liu,
	jean-philippe.brucker, will.deacon, robin.murphy
  Cc: kevin.tian, ashok.raj, marc.zyngier, christoffer.dall, peter.maydell

On ARM, MSI are translated by the SMMU. An IOVA is allocated
for each MSI doorbell. If both the host and the guest are exposed
with SMMUs, we end up with 2 different IOVAs allocated by each.
guest allocates an IOVA (gIOVA) to map onto the guest MSI
doorbell (gDB). The Host allocates another IOVA (hIOVA) to map
onto the physical doorbell (hDB).

So we end up with 2 untied mappings:
         S1            S2
gIOVA    ->    gDB
              hIOVA    ->    gDB

Currently the PCI device is programmed by the host with hIOVA
as MSI doorbell. So this does not work.

This patch introduces an API to pass gIOVA/gDB to the host so
that gIOVA can be reused by the host instead of re-allocating
a new IOVA. So the goal is to create the following nested mapping:

         S1            S2
gIOVA    ->    gDB     ->    hDB

and program the PCI device with gIOVA MSI doorbell.

Signed-off-by: Eric Auger <eric.auger@redhat.com>

---

v2 -> v3:
- add a struct device handle
---
 drivers/iommu/iommu.c      | 10 ++++++++++
 include/linux/iommu.h      | 13 +++++++++++++
 include/uapi/linux/iommu.h |  6 ++++++
 3 files changed, 29 insertions(+)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index b2e248770508..ea11442e7054 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1431,6 +1431,16 @@ static void __iommu_detach_device(struct iommu_domain *domain,
 	trace_detach_device_from_domain(dev);
 }
 
+int iommu_bind_guest_msi(struct iommu_domain *domain, struct device *dev,
+			 struct iommu_guest_msi_binding *binding)
+{
+	if (unlikely(!domain->ops->bind_guest_msi))
+		return -ENODEV;
+
+	return domain->ops->bind_guest_msi(domain, dev, binding);
+}
+EXPORT_SYMBOL_GPL(iommu_bind_guest_msi);
+
 void iommu_detach_device(struct iommu_domain *domain, struct device *dev)
 {
 	struct iommu_group *group;
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 96d59886f230..244c1a3d5989 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -235,6 +235,9 @@ struct iommu_ops {
 	int (*cache_invalidate)(struct iommu_domain *domain, struct device *dev,
 				struct iommu_cache_invalidate_info *inv_info);
 
+	int (*bind_guest_msi)(struct iommu_domain *domain, struct device *dev,
+			      struct iommu_guest_msi_binding *binding);
+
 	unsigned long pgsize_bitmap;
 };
 
@@ -301,6 +304,9 @@ extern int iommu_set_pasid_table(struct iommu_domain *domain,
 extern int iommu_cache_invalidate(struct iommu_domain *domain,
 				struct device *dev,
 				struct iommu_cache_invalidate_info *inv_info);
+extern int iommu_bind_guest_msi(struct iommu_domain *domain, struct device *dev,
+				struct iommu_guest_msi_binding *binding);
+
 extern struct iommu_domain *iommu_get_domain_for_dev(struct device *dev);
 extern struct iommu_domain *iommu_get_dma_domain(struct device *dev);
 extern int iommu_map(struct iommu_domain *domain, unsigned long iova,
@@ -724,6 +730,13 @@ iommu_cache_invalidate(struct iommu_domain *domain,
 	return -ENODEV;
 }
 
+static inline
+int iommu_bind_guest_msi(struct iommu_domain *domain, struct device *dev,
+			 struct iommu_guest_msi_binding *binding)
+{
+	return -ENODEV;
+}
+
 #endif /* CONFIG_IOMMU_API */
 
 #ifdef CONFIG_IOMMU_DEBUGFS
diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
index 4605f5cfac84..f28cd9a1aa96 100644
--- a/include/uapi/linux/iommu.h
+++ b/include/uapi/linux/iommu.h
@@ -142,4 +142,10 @@ struct iommu_cache_invalidate_info {
 	__u64		arch_id;
 	__u64		addr;
 };
+
+struct iommu_guest_msi_binding {
+	__u64		iova;
+	__u64		gpa;
+	__u32		granule;
+};
 #endif /* _UAPI_IOMMU_H */
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [RFC v3 04/21] vfio: VFIO_IOMMU_SET_PASID_TABLE
  2019-01-08 10:26 [RFC v3 00/21] SMMUv3 Nested Stage Setup Eric Auger
                   ` (2 preceding siblings ...)
  2019-01-08 10:26 ` [RFC v3 03/21] iommu: Introduce bind_guest_msi Eric Auger
@ 2019-01-08 10:26 ` Eric Auger
  2019-01-11 22:50   ` Alex Williamson
  2019-01-08 10:26 ` [RFC v3 05/21] vfio: VFIO_IOMMU_CACHE_INVALIDATE Eric Auger
                   ` (17 subsequent siblings)
  21 siblings, 1 reply; 59+ messages in thread
From: Eric Auger @ 2019-01-08 10:26 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, iommu, linux-kernel, kvm, kvmarm,
	joro, alex.williamson, jacob.jun.pan, yi.l.liu,
	jean-philippe.brucker, will.deacon, robin.murphy
  Cc: kevin.tian, ashok.raj, marc.zyngier, christoffer.dall, peter.maydell

From: "Liu, Yi L" <yi.l.liu@linux.intel.com>

This patch adds VFIO_IOMMU_SET_PASID_TABLE ioctl which aims at
passing the virtual iommu guest configuration to the VFIO driver
downto to the iommu subsystem.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
Signed-off-by: Eric Auger <eric.auger@redhat.com>

---
v2 -> v3:
- s/BIND_PASID_TABLE/SET_PASID_TABLE

v1 -> v2:
- s/BIND_GUEST_STAGE/BIND_PASID_TABLE
- remove the struct device arg
---
 drivers/vfio/vfio_iommu_type1.c | 31 +++++++++++++++++++++++++++++++
 include/uapi/linux/vfio.h       |  8 ++++++++
 2 files changed, 39 insertions(+)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 7651cfb14836..d9dd23f64f00 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -1644,6 +1644,24 @@ static int vfio_domains_have_iommu_cache(struct vfio_iommu *iommu)
 	return ret;
 }
 
+static int
+vfio_set_pasid_table(struct vfio_iommu *iommu,
+		      struct vfio_iommu_type1_set_pasid_table *ustruct)
+{
+	struct vfio_domain *d;
+	int ret = 0;
+
+	mutex_lock(&iommu->lock);
+
+	list_for_each_entry(d, &iommu->domain_list, next) {
+		ret = iommu_set_pasid_table(d->domain, &ustruct->config);
+		if (ret)
+			break;
+	}
+	mutex_unlock(&iommu->lock);
+	return ret;
+}
+
 static long vfio_iommu_type1_ioctl(void *iommu_data,
 				   unsigned int cmd, unsigned long arg)
 {
@@ -1714,6 +1732,19 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
 
 		return copy_to_user((void __user *)arg, &unmap, minsz) ?
 			-EFAULT : 0;
+	} else if (cmd == VFIO_IOMMU_SET_PASID_TABLE) {
+		struct vfio_iommu_type1_set_pasid_table ustruct;
+
+		minsz = offsetofend(struct vfio_iommu_type1_set_pasid_table,
+				    config);
+
+		if (copy_from_user(&ustruct, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (ustruct.argsz < minsz || ustruct.flags)
+			return -EINVAL;
+
+		return vfio_set_pasid_table(iommu, &ustruct);
 	}
 
 	return -ENOTTY;
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 02bb7ad6e986..0d9f4090c95d 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -14,6 +14,7 @@
 
 #include <linux/types.h>
 #include <linux/ioctl.h>
+#include <linux/iommu.h>
 
 #define VFIO_API_VERSION	0
 
@@ -759,6 +760,13 @@ struct vfio_iommu_type1_dma_unmap {
 #define VFIO_IOMMU_ENABLE	_IO(VFIO_TYPE, VFIO_BASE + 15)
 #define VFIO_IOMMU_DISABLE	_IO(VFIO_TYPE, VFIO_BASE + 16)
 
+struct vfio_iommu_type1_set_pasid_table {
+	__u32	argsz;
+	__u32	flags;
+	struct iommu_pasid_table_config config;
+};
+#define VFIO_IOMMU_SET_PASID_TABLE	_IO(VFIO_TYPE, VFIO_BASE + 22)
+
 /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */
 
 /*
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [RFC v3 05/21] vfio: VFIO_IOMMU_CACHE_INVALIDATE
  2019-01-08 10:26 [RFC v3 00/21] SMMUv3 Nested Stage Setup Eric Auger
                   ` (3 preceding siblings ...)
  2019-01-08 10:26 ` [RFC v3 04/21] vfio: VFIO_IOMMU_SET_PASID_TABLE Eric Auger
@ 2019-01-08 10:26 ` Eric Auger
  2019-01-08 10:26 ` [RFC v3 06/21] vfio: VFIO_IOMMU_BIND_MSI Eric Auger
                   ` (16 subsequent siblings)
  21 siblings, 0 replies; 59+ messages in thread
From: Eric Auger @ 2019-01-08 10:26 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, iommu, linux-kernel, kvm, kvmarm,
	joro, alex.williamson, jacob.jun.pan, yi.l.liu,
	jean-philippe.brucker, will.deacon, robin.murphy
  Cc: kevin.tian, ashok.raj, marc.zyngier, christoffer.dall, peter.maydell

From: "Liu, Yi L" <yi.l.liu@linux.intel.com>

When the guest "owns" the stage 1 translation structures,  the host
IOMMU driver has no knowledge of caching structure updates unless
the guest invalidation requests are trapped and passed down to the
host.

This patch adds the VFIO_IOMMU_CACHE_INVALIDATE ioctl with aims
at propagating guest stage1 IOMMU cache invalidations to the host.

Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
Signed-off-by: Eric Auger <eric.auger@redhat.com>

---

v2 -> v3:
- introduce vfio_iommu_for_each_dev back in this patch

v1 -> v2:
- s/TLB/CACHE
- remove vfio_iommu_task usage
- commit message rewording
---
 drivers/vfio/vfio_iommu_type1.c | 47 +++++++++++++++++++++++++++++++++
 include/uapi/linux/vfio.h       |  7 +++++
 2 files changed, 54 insertions(+)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index d9dd23f64f00..c3ba3f249438 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -113,6 +113,26 @@ struct vfio_regions {
 #define IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)	\
 					(!list_empty(&iommu->domain_list))
 
+/* iommu->lock must be held */
+static int
+vfio_iommu_for_each_dev(struct vfio_iommu *iommu, void *data,
+			int (*fn)(struct device *, void *))
+{
+	struct vfio_domain *d;
+	struct vfio_group *g;
+	int ret = 0;
+
+	list_for_each_entry(d, &iommu->domain_list, next) {
+		list_for_each_entry(g, &d->group_list, next) {
+			ret = iommu_group_for_each_dev(g->iommu_group,
+						       data, fn);
+			if (ret)
+				break;
+		}
+	}
+	return ret;
+}
+
 static int put_pfn(unsigned long pfn, int prot);
 
 /*
@@ -1644,6 +1664,15 @@ static int vfio_domains_have_iommu_cache(struct vfio_iommu *iommu)
 	return ret;
 }
 
+static int vfio_cache_inv_fn(struct device *dev, void *data)
+{
+	struct vfio_iommu_type1_cache_invalidate *ustruct =
+		(struct vfio_iommu_type1_cache_invalidate *)data;
+	struct iommu_domain *d = iommu_get_domain_for_dev(dev);
+
+	return iommu_cache_invalidate(d, dev, &ustruct->info);
+}
+
 static int
 vfio_set_pasid_table(struct vfio_iommu *iommu,
 		      struct vfio_iommu_type1_set_pasid_table *ustruct)
@@ -1745,6 +1774,24 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
 			return -EINVAL;
 
 		return vfio_set_pasid_table(iommu, &ustruct);
+	} else if (cmd == VFIO_IOMMU_CACHE_INVALIDATE) {
+		struct vfio_iommu_type1_cache_invalidate ustruct;
+		int ret;
+
+		minsz = offsetofend(struct vfio_iommu_type1_cache_invalidate,
+				    info);
+
+		if (copy_from_user(&ustruct, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (ustruct.argsz < minsz || ustruct.flags)
+			return -EINVAL;
+
+		mutex_lock(&iommu->lock);
+		ret = vfio_iommu_for_each_dev(iommu, &ustruct,
+					      vfio_cache_inv_fn);
+		mutex_unlock(&iommu->lock);
+		return ret;
 	}
 
 	return -ENOTTY;
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 0d9f4090c95d..11a07165e7e1 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -767,6 +767,13 @@ struct vfio_iommu_type1_set_pasid_table {
 };
 #define VFIO_IOMMU_SET_PASID_TABLE	_IO(VFIO_TYPE, VFIO_BASE + 22)
 
+struct vfio_iommu_type1_cache_invalidate {
+	__u32   argsz;
+	__u32   flags;
+	struct iommu_cache_invalidate_info info;
+};
+#define VFIO_IOMMU_CACHE_INVALIDATE      _IO(VFIO_TYPE, VFIO_BASE + 23)
+
 /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */
 
 /*
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [RFC v3 06/21] vfio: VFIO_IOMMU_BIND_MSI
  2019-01-08 10:26 [RFC v3 00/21] SMMUv3 Nested Stage Setup Eric Auger
                   ` (4 preceding siblings ...)
  2019-01-08 10:26 ` [RFC v3 05/21] vfio: VFIO_IOMMU_CACHE_INVALIDATE Eric Auger
@ 2019-01-08 10:26 ` Eric Auger
  2019-01-11 23:02   ` Alex Williamson
  2019-01-08 10:26 ` [RFC v3 07/21] iommu/arm-smmu-v3: Link domains and devices Eric Auger
                   ` (15 subsequent siblings)
  21 siblings, 1 reply; 59+ messages in thread
From: Eric Auger @ 2019-01-08 10:26 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, iommu, linux-kernel, kvm, kvmarm,
	joro, alex.williamson, jacob.jun.pan, yi.l.liu,
	jean-philippe.brucker, will.deacon, robin.murphy
  Cc: kevin.tian, ashok.raj, marc.zyngier, christoffer.dall, peter.maydell

This patch adds the VFIO_IOMMU_BIND_MSI ioctl which aims at
passing the guest MSI binding to the host.

Signed-off-by: Eric Auger <eric.auger@redhat.com>

---

v2 -> v3:
- adapt to new proto of bind_guest_msi
- directly use vfio_iommu_for_each_dev

v1 -> v2:
- s/vfio_iommu_type1_guest_msi_binding/vfio_iommu_type1_bind_guest_msi
---
 drivers/vfio/vfio_iommu_type1.c | 27 +++++++++++++++++++++++++++
 include/uapi/linux/vfio.h       |  7 +++++++
 2 files changed, 34 insertions(+)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index c3ba3f249438..59229f6e2d84 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -1673,6 +1673,15 @@ static int vfio_cache_inv_fn(struct device *dev, void *data)
 	return iommu_cache_invalidate(d, dev, &ustruct->info);
 }
 
+static int vfio_bind_guest_msi_fn(struct device *dev, void *data)
+{
+	struct vfio_iommu_type1_bind_guest_msi *ustruct =
+		(struct vfio_iommu_type1_bind_guest_msi *)data;
+	struct iommu_domain *d = iommu_get_domain_for_dev(dev);
+
+	return iommu_bind_guest_msi(d, dev, &ustruct->binding);
+}
+
 static int
 vfio_set_pasid_table(struct vfio_iommu *iommu,
 		      struct vfio_iommu_type1_set_pasid_table *ustruct)
@@ -1792,6 +1801,24 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
 					      vfio_cache_inv_fn);
 		mutex_unlock(&iommu->lock);
 		return ret;
+	} else if (cmd == VFIO_IOMMU_BIND_MSI) {
+		struct vfio_iommu_type1_bind_guest_msi ustruct;
+		int ret;
+
+		minsz = offsetofend(struct vfio_iommu_type1_bind_guest_msi,
+				    binding);
+
+		if (copy_from_user(&ustruct, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (ustruct.argsz < minsz || ustruct.flags)
+			return -EINVAL;
+
+		mutex_lock(&iommu->lock);
+		ret = vfio_iommu_for_each_dev(iommu, &ustruct,
+					      vfio_bind_guest_msi_fn);
+		mutex_unlock(&iommu->lock);
+		return ret;
 	}
 
 	return -ENOTTY;
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 11a07165e7e1..352e795a93c8 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -774,6 +774,13 @@ struct vfio_iommu_type1_cache_invalidate {
 };
 #define VFIO_IOMMU_CACHE_INVALIDATE      _IO(VFIO_TYPE, VFIO_BASE + 23)
 
+struct vfio_iommu_type1_bind_guest_msi {
+	__u32   argsz;
+	__u32   flags;
+	struct iommu_guest_msi_binding binding;
+};
+#define VFIO_IOMMU_BIND_MSI      _IO(VFIO_TYPE, VFIO_BASE + 24)
+
 /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */
 
 /*
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [RFC v3 07/21] iommu/arm-smmu-v3: Link domains and devices
  2019-01-08 10:26 [RFC v3 00/21] SMMUv3 Nested Stage Setup Eric Auger
                   ` (5 preceding siblings ...)
  2019-01-08 10:26 ` [RFC v3 06/21] vfio: VFIO_IOMMU_BIND_MSI Eric Auger
@ 2019-01-08 10:26 ` Eric Auger
  2019-01-08 10:26 ` [RFC v3 08/21] iommu/arm-smmu-v3: Maintain a SID->device structure Eric Auger
                   ` (14 subsequent siblings)
  21 siblings, 0 replies; 59+ messages in thread
From: Eric Auger @ 2019-01-08 10:26 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, iommu, linux-kernel, kvm, kvmarm,
	joro, alex.williamson, jacob.jun.pan, yi.l.liu,
	jean-philippe.brucker, will.deacon, robin.murphy
  Cc: kevin.tian, ashok.raj, marc.zyngier, christoffer.dall, peter.maydell

From: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>

When removing a mapping from a domain, we need to send an invalidation to
all devices that might have stored it in their Address Translation Cache
(ATC). In addition with SVM, we'll need to invalidate context descriptors
of all devices attached to a live domain.

Maintain a list of devices in each domain, protected by a spinlock. It is
updated every time we attach or detach devices to and from domains.

It needs to be a spinlock because we'll invalidate ATC entries from
within hardirq-safe contexts, but it may be possible to relax the read
side with RCU later.

Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
---
 drivers/iommu/arm-smmu-v3.c | 28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 0d284029dc73..ce222705f52b 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -595,6 +595,11 @@ struct arm_smmu_device {
 struct arm_smmu_master_data {
 	struct arm_smmu_device		*smmu;
 	struct arm_smmu_strtab_ent	ste;
+
+	struct arm_smmu_domain		*domain;
+	struct list_head		list; /* domain->devices */
+
+	struct device			*dev;
 };
 
 /* SMMU private data for an IOMMU domain */
@@ -619,6 +624,9 @@ struct arm_smmu_domain {
 	};
 
 	struct iommu_domain		domain;
+
+	struct list_head		devices;
+	spinlock_t			devices_lock;
 };
 
 struct arm_smmu_option_prop {
@@ -1494,6 +1502,9 @@ static struct iommu_domain *arm_smmu_domain_alloc(unsigned type)
 	}
 
 	mutex_init(&smmu_domain->init_mutex);
+	INIT_LIST_HEAD(&smmu_domain->devices);
+	spin_lock_init(&smmu_domain->devices_lock);
+
 	return &smmu_domain->domain;
 }
 
@@ -1714,6 +1725,16 @@ static void arm_smmu_detach_dev(struct device *dev)
 {
 	struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
 	struct arm_smmu_master_data *master = fwspec->iommu_priv;
+	unsigned long flags;
+	struct arm_smmu_domain *smmu_domain = master->domain;
+
+	if (smmu_domain) {
+		spin_lock_irqsave(&smmu_domain->devices_lock, flags);
+		list_del(&master->list);
+		spin_unlock_irqrestore(&smmu_domain->devices_lock, flags);
+
+		master->domain = NULL;
+	}
 
 	master->ste.assigned = false;
 	arm_smmu_install_ste_for_dev(fwspec);
@@ -1723,6 +1744,7 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
 {
 	int ret = 0;
 	struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
+	unsigned long flags;
 	struct arm_smmu_device *smmu;
 	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
 	struct arm_smmu_master_data *master;
@@ -1758,6 +1780,11 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
 	}
 
 	ste->assigned = true;
+	master->domain = smmu_domain;
+
+	spin_lock_irqsave(&smmu_domain->devices_lock, flags);
+	list_add(&master->list, &smmu_domain->devices);
+	spin_unlock_irqrestore(&smmu_domain->devices_lock, flags);
 
 	if (smmu_domain->stage == ARM_SMMU_DOMAIN_BYPASS) {
 		ste->s1_cfg = NULL;
@@ -1884,6 +1911,7 @@ static int arm_smmu_add_device(struct device *dev)
 			return -ENOMEM;
 
 		master->smmu = smmu;
+		master->dev = dev;
 		fwspec->iommu_priv = master;
 	}
 
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [RFC v3 08/21] iommu/arm-smmu-v3: Maintain a SID->device structure
  2019-01-08 10:26 [RFC v3 00/21] SMMUv3 Nested Stage Setup Eric Auger
                   ` (6 preceding siblings ...)
  2019-01-08 10:26 ` [RFC v3 07/21] iommu/arm-smmu-v3: Link domains and devices Eric Auger
@ 2019-01-08 10:26 ` Eric Auger
  2019-01-08 10:26 ` [RFC v3 09/21] iommu/smmuv3: Get prepared for nested stage support Eric Auger
                   ` (13 subsequent siblings)
  21 siblings, 0 replies; 59+ messages in thread
From: Eric Auger @ 2019-01-08 10:26 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, iommu, linux-kernel, kvm, kvmarm,
	joro, alex.williamson, jacob.jun.pan, yi.l.liu,
	jean-philippe.brucker, will.deacon, robin.murphy
  Cc: kevin.tian, ashok.raj, marc.zyngier, christoffer.dall, peter.maydell

From: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>

When handling faults from the event or PRI queue, we need to find the
struct device associated to a SID. Add a rb_tree to keep track of SIDs.

Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
---
 drivers/iommu/arm-smmu-v3.c | 136 ++++++++++++++++++++++++++++++++++--
 1 file changed, 132 insertions(+), 4 deletions(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index ce222705f52b..9af68266bbb1 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -589,6 +589,16 @@ struct arm_smmu_device {
 
 	/* IOMMU core code handle */
 	struct iommu_device		iommu;
+
+	struct rb_root			streams;
+	struct mutex			streams_mutex;
+
+};
+
+struct arm_smmu_stream {
+	u32				id;
+	struct arm_smmu_master_data	*master;
+	struct rb_node			node;
 };
 
 /* SMMU private data for each master */
@@ -598,6 +608,7 @@ struct arm_smmu_master_data {
 
 	struct arm_smmu_domain		*domain;
 	struct list_head		list; /* domain->devices */
+	struct arm_smmu_stream		*streams;
 
 	struct device			*dev;
 };
@@ -1244,6 +1255,32 @@ static int arm_smmu_init_l2_strtab(struct arm_smmu_device *smmu, u32 sid)
 	return 0;
 }
 
+__maybe_unused
+static struct arm_smmu_master_data *
+arm_smmu_find_master(struct arm_smmu_device *smmu, u32 sid)
+{
+	struct rb_node *node;
+	struct arm_smmu_stream *stream;
+	struct arm_smmu_master_data *master = NULL;
+
+	mutex_lock(&smmu->streams_mutex);
+	node = smmu->streams.rb_node;
+	while (node) {
+		stream = rb_entry(node, struct arm_smmu_stream, node);
+		if (stream->id < sid) {
+			node = node->rb_right;
+		} else if (stream->id > sid) {
+			node = node->rb_left;
+		} else {
+			master = stream->master;
+			break;
+		}
+	}
+	mutex_unlock(&smmu->streams_mutex);
+
+	return master;
+}
+
 /* IRQ and event handlers */
 static irqreturn_t arm_smmu_evtq_thread(int irq, void *dev)
 {
@@ -1882,6 +1919,71 @@ static bool arm_smmu_sid_in_range(struct arm_smmu_device *smmu, u32 sid)
 	return sid < limit;
 }
 
+static int arm_smmu_insert_master(struct arm_smmu_device *smmu,
+				  struct arm_smmu_master_data *master)
+{
+	int i;
+	int ret = 0;
+	struct arm_smmu_stream *new_stream, *cur_stream;
+	struct rb_node **new_node, *parent_node = NULL;
+	struct iommu_fwspec *fwspec = master->dev->iommu_fwspec;
+
+	master->streams = kcalloc(fwspec->num_ids,
+				  sizeof(struct arm_smmu_stream), GFP_KERNEL);
+	if (!master->streams)
+		return -ENOMEM;
+
+	mutex_lock(&smmu->streams_mutex);
+	for (i = 0; i < fwspec->num_ids && !ret; i++) {
+		new_stream = &master->streams[i];
+		new_stream->id = fwspec->ids[i];
+		new_stream->master = master;
+
+		new_node = &(smmu->streams.rb_node);
+		while (*new_node) {
+			cur_stream = rb_entry(*new_node, struct arm_smmu_stream,
+					      node);
+			parent_node = *new_node;
+			if (cur_stream->id > new_stream->id) {
+				new_node = &((*new_node)->rb_left);
+			} else if (cur_stream->id < new_stream->id) {
+				new_node = &((*new_node)->rb_right);
+			} else {
+				dev_warn(master->dev,
+					 "stream %u already in tree\n",
+					 cur_stream->id);
+				ret = -EINVAL;
+				break;
+			}
+		}
+
+		if (!ret) {
+			rb_link_node(&new_stream->node, parent_node, new_node);
+			rb_insert_color(&new_stream->node, &smmu->streams);
+		}
+	}
+	mutex_unlock(&smmu->streams_mutex);
+
+	return ret;
+}
+
+static void arm_smmu_remove_master(struct arm_smmu_device *smmu,
+				   struct arm_smmu_master_data *master)
+{
+	int i;
+	struct iommu_fwspec *fwspec = master->dev->iommu_fwspec;
+
+	if (!master->streams)
+		return;
+
+	mutex_lock(&smmu->streams_mutex);
+	for (i = 0; i < fwspec->num_ids; i++)
+		rb_erase(&master->streams[i].node, &smmu->streams);
+	mutex_unlock(&smmu->streams_mutex);
+
+	kfree(master->streams);
+}
+
 static struct iommu_ops arm_smmu_ops;
 
 static int arm_smmu_add_device(struct device *dev)
@@ -1930,13 +2032,35 @@ static int arm_smmu_add_device(struct device *dev)
 		}
 	}
 
+	ret = iommu_device_link(&smmu->iommu, dev);
+	if (ret)
+		goto err_free_master;
+
+	ret = arm_smmu_insert_master(smmu, master);
+	if (ret)
+		goto err_unlink;
+
 	group = iommu_group_get_for_dev(dev);
-	if (!IS_ERR(group)) {
-		iommu_group_put(group);
-		iommu_device_link(&smmu->iommu, dev);
+	if (IS_ERR(group)) {
+		ret = PTR_ERR(group);
+		goto err_remove_master;
 	}
 
-	return PTR_ERR_OR_ZERO(group);
+	iommu_group_put(group);
+
+	return 0;
+
+err_remove_master:
+	arm_smmu_remove_master(smmu, master);
+
+err_unlink:
+	iommu_device_unlink(&smmu->iommu, dev);
+
+err_free_master:
+	kfree(master);
+	fwspec->iommu_priv = NULL;
+
+	return ret;
 }
 
 static void arm_smmu_remove_device(struct device *dev)
@@ -1953,6 +2077,7 @@ static void arm_smmu_remove_device(struct device *dev)
 	if (master && master->ste.assigned)
 		arm_smmu_detach_dev(dev);
 	iommu_group_remove_device(dev);
+	arm_smmu_remove_master(smmu, master);
 	iommu_device_unlink(&smmu->iommu, dev);
 	kfree(master);
 	iommu_fwspec_free(dev);
@@ -2266,6 +2391,9 @@ static int arm_smmu_init_structures(struct arm_smmu_device *smmu)
 {
 	int ret;
 
+	mutex_init(&smmu->streams_mutex);
+	smmu->streams = RB_ROOT;
+
 	ret = arm_smmu_init_queues(smmu);
 	if (ret)
 		return ret;
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [RFC v3 09/21] iommu/smmuv3: Get prepared for nested stage support
  2019-01-08 10:26 [RFC v3 00/21] SMMUv3 Nested Stage Setup Eric Auger
                   ` (7 preceding siblings ...)
  2019-01-08 10:26 ` [RFC v3 08/21] iommu/arm-smmu-v3: Maintain a SID->device structure Eric Auger
@ 2019-01-08 10:26 ` Eric Auger
  2019-01-11 16:04   ` Jean-Philippe Brucker
  2019-01-25 19:27   ` Robin Murphy
  2019-01-08 10:26 ` [RFC v3 10/21] iommu/smmuv3: Implement set_pasid_table Eric Auger
                   ` (12 subsequent siblings)
  21 siblings, 2 replies; 59+ messages in thread
From: Eric Auger @ 2019-01-08 10:26 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, iommu, linux-kernel, kvm, kvmarm,
	joro, alex.williamson, jacob.jun.pan, yi.l.liu,
	jean-philippe.brucker, will.deacon, robin.murphy
  Cc: kevin.tian, ashok.raj, marc.zyngier, christoffer.dall, peter.maydell

To allow nested stage support, we need to store both
stage 1 and stage 2 configurations (and remove the former
union).

arm_smmu_write_strtab_ent() is modified to write both stage
fields in the STE.

We add a nested_bypass field to the S1 configuration as the first
stage can be bypassed. Also the guest may force the STE to abort:
this information gets stored into the nested_abort field.

Only S2 stage is "finalized" as the host does not configure
S1 CD, guest does.

Signed-off-by: Eric Auger <eric.auger@redhat.com>

---

v1 -> v2:
- invalidate the STE before moving from a live STE config to another
- add the nested_abort and nested_bypass fields
---
 drivers/iommu/arm-smmu-v3.c | 43 ++++++++++++++++++++++++++++---------
 1 file changed, 33 insertions(+), 10 deletions(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 9af68266bbb1..9716a301d9ae 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -212,6 +212,7 @@
 #define STRTAB_STE_0_CFG_BYPASS		4
 #define STRTAB_STE_0_CFG_S1_TRANS	5
 #define STRTAB_STE_0_CFG_S2_TRANS	6
+#define STRTAB_STE_0_CFG_NESTED		7
 
 #define STRTAB_STE_0_S1FMT		GENMASK_ULL(5, 4)
 #define STRTAB_STE_0_S1FMT_LINEAR	0
@@ -491,6 +492,10 @@ struct arm_smmu_strtab_l1_desc {
 struct arm_smmu_s1_cfg {
 	__le64				*cdptr;
 	dma_addr_t			cdptr_dma;
+	/* in nested mode, tells s1 must be bypassed */
+	bool				nested_bypass;
+	/* in nested mode, abort is forced by guest */
+	bool				nested_abort;
 
 	struct arm_smmu_ctx_desc {
 		u16	asid;
@@ -515,6 +520,7 @@ struct arm_smmu_strtab_ent {
 	 * configured according to the domain type.
 	 */
 	bool				assigned;
+	bool				nested;
 	struct arm_smmu_s1_cfg		*s1_cfg;
 	struct arm_smmu_s2_cfg		*s2_cfg;
 };
@@ -629,10 +635,8 @@ struct arm_smmu_domain {
 	bool				non_strict;
 
 	enum arm_smmu_domain_stage	stage;
-	union {
-		struct arm_smmu_s1_cfg	s1_cfg;
-		struct arm_smmu_s2_cfg	s2_cfg;
-	};
+	struct arm_smmu_s1_cfg	s1_cfg;
+	struct arm_smmu_s2_cfg	s2_cfg;
 
 	struct iommu_domain		domain;
 
@@ -1139,10 +1143,11 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_device *smmu, u32 sid,
 			break;
 		case STRTAB_STE_0_CFG_S1_TRANS:
 		case STRTAB_STE_0_CFG_S2_TRANS:
+		case STRTAB_STE_0_CFG_NESTED:
 			ste_live = true;
 			break;
 		case STRTAB_STE_0_CFG_ABORT:
-			if (disable_bypass)
+			if (disable_bypass || ste->nested)
 				break;
 		default:
 			BUG(); /* STE corruption */
@@ -1154,7 +1159,8 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_device *smmu, u32 sid,
 
 	/* Bypass/fault */
 	if (!ste->assigned || !(ste->s1_cfg || ste->s2_cfg)) {
-		if (!ste->assigned && disable_bypass)
+		if ((!ste->assigned && disable_bypass) ||
+				(ste->s1_cfg && ste->s1_cfg->nested_abort))
 			val |= FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_ABORT);
 		else
 			val |= FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_BYPASS);
@@ -1172,8 +1178,17 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_device *smmu, u32 sid,
 		return;
 	}
 
+	if (ste->nested && ste_live) {
+		/*
+		 * When enabling nested, the STE may be transitionning from
+		 * s2 to nested and back. Invalidate the STE before changing it.
+		 */
+		dst[0] = cpu_to_le64(0);
+		arm_smmu_sync_ste_for_sid(smmu, sid);
+		val = STRTAB_STE_0_V;
+	}
+
 	if (ste->s1_cfg) {
-		BUG_ON(ste_live);
 		dst[1] = cpu_to_le64(
 			 FIELD_PREP(STRTAB_STE_1_S1CIR, STRTAB_STE_1_S1C_CACHE_WBRA) |
 			 FIELD_PREP(STRTAB_STE_1_S1COR, STRTAB_STE_1_S1C_CACHE_WBRA) |
@@ -1187,12 +1202,12 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_device *smmu, u32 sid,
 		   !(smmu->features & ARM_SMMU_FEAT_STALL_FORCE))
 			dst[1] |= cpu_to_le64(STRTAB_STE_1_S1STALLD);
 
-		val |= (ste->s1_cfg->cdptr_dma & STRTAB_STE_0_S1CTXPTR_MASK) |
-			FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_S1_TRANS);
+		if (!ste->s1_cfg->nested_bypass)
+			val |= (ste->s1_cfg->cdptr_dma & STRTAB_STE_0_S1CTXPTR_MASK) |
+				FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_S1_TRANS);
 	}
 
 	if (ste->s2_cfg) {
-		BUG_ON(ste_live);
 		dst[2] = cpu_to_le64(
 			 FIELD_PREP(STRTAB_STE_2_S2VMID, ste->s2_cfg->vmid) |
 			 FIELD_PREP(STRTAB_STE_2_VTCR, ste->s2_cfg->vtcr) |
@@ -1454,6 +1469,10 @@ static void arm_smmu_tlb_inv_context(void *cookie)
 		cmd.opcode	= CMDQ_OP_TLBI_NH_ASID;
 		cmd.tlbi.asid	= smmu_domain->s1_cfg.cd.asid;
 		cmd.tlbi.vmid	= 0;
+	} else if (smmu_domain->stage == ARM_SMMU_DOMAIN_NESTED) {
+		cmd.opcode      = CMDQ_OP_TLBI_NH_ASID;
+		cmd.tlbi.asid   = smmu_domain->s1_cfg.cd.asid;
+		cmd.tlbi.vmid   = smmu_domain->s2_cfg.vmid;
 	} else {
 		cmd.opcode	= CMDQ_OP_TLBI_S12_VMALL;
 		cmd.tlbi.vmid	= smmu_domain->s2_cfg.vmid;
@@ -1484,6 +1503,10 @@ static void arm_smmu_tlb_inv_range_nosync(unsigned long iova, size_t size,
 	if (smmu_domain->stage == ARM_SMMU_DOMAIN_S1) {
 		cmd.opcode	= CMDQ_OP_TLBI_NH_VA;
 		cmd.tlbi.asid	= smmu_domain->s1_cfg.cd.asid;
+	} else if (smmu_domain->stage == ARM_SMMU_DOMAIN_NESTED) {
+		cmd.opcode      = CMDQ_OP_TLBI_NH_VA;
+		cmd.tlbi.asid   = smmu_domain->s1_cfg.cd.asid;
+		cmd.tlbi.vmid   = smmu_domain->s2_cfg.vmid;
 	} else {
 		cmd.opcode	= CMDQ_OP_TLBI_S2_IPA;
 		cmd.tlbi.vmid	= smmu_domain->s2_cfg.vmid;
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [RFC v3 10/21] iommu/smmuv3: Implement set_pasid_table
  2019-01-08 10:26 [RFC v3 00/21] SMMUv3 Nested Stage Setup Eric Auger
                   ` (8 preceding siblings ...)
  2019-01-08 10:26 ` [RFC v3 09/21] iommu/smmuv3: Get prepared for nested stage support Eric Auger
@ 2019-01-08 10:26 ` Eric Auger
  2019-01-08 10:26 ` [RFC v3 11/21] iommu/smmuv3: Implement cache_invalidate Eric Auger
                   ` (11 subsequent siblings)
  21 siblings, 0 replies; 59+ messages in thread
From: Eric Auger @ 2019-01-08 10:26 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, iommu, linux-kernel, kvm, kvmarm,
	joro, alex.williamson, jacob.jun.pan, yi.l.liu,
	jean-philippe.brucker, will.deacon, robin.murphy
  Cc: kevin.tian, ashok.raj, marc.zyngier, christoffer.dall, peter.maydell

On set_pasid_table() we program STE S1 related info set
by the guest into the actual physical STEs. At minimum
we need to program the context descriptor GPA and compute
whether the guest wanted to bypass the stage 1 or induce
aborts for this STE.

Signed-off-by: Eric Auger <eric.auger@redhat.com>

---

v2 -> v3:
- callback now is named set_pasid_table and struct fields
  are laid out differently.

v1 -> v2:
- invalidate the STE before changing them
- hold init_mutex
- handle new fields
---
 drivers/iommu/arm-smmu-v3.c | 68 +++++++++++++++++++++++++++++++++++++
 1 file changed, 68 insertions(+)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 9716a301d9ae..0e006babc8a6 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -2226,6 +2226,73 @@ static void arm_smmu_put_resv_regions(struct device *dev,
 		kfree(entry);
 }
 
+static int arm_smmu_set_pasid_table(struct iommu_domain *domain,
+				     struct iommu_pasid_table_config *cfg)
+{
+	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
+	struct arm_smmu_master_data *entry;
+	struct arm_smmu_s1_cfg *s1_cfg;
+	struct arm_smmu_device *smmu;
+	unsigned long flags;
+	int ret = -EINVAL;
+
+	if (cfg->format != IOMMU_PASID_FORMAT_SMMUV3)
+		return -EINVAL;
+
+	mutex_lock(&smmu_domain->init_mutex);
+
+	smmu = smmu_domain->smmu;
+
+	if (!smmu)
+		goto out;
+
+	if (!((smmu->features & ARM_SMMU_FEAT_TRANS_S1) &&
+	      (smmu->features & ARM_SMMU_FEAT_TRANS_S2))) {
+		dev_info(smmu_domain->smmu->dev,
+			 "does not implement two stages\n");
+		goto out;
+	}
+
+	if (smmu_domain->stage != ARM_SMMU_DOMAIN_NESTED)
+		goto out;
+
+	if (cfg->bypass) {
+		spin_lock_irqsave(&smmu_domain->devices_lock, flags);
+		list_for_each_entry(entry, &smmu_domain->devices, list) {
+			entry->ste.s1_cfg = NULL;
+			entry->ste.nested = false;
+			arm_smmu_install_ste_for_dev(entry->dev->iommu_fwspec);
+		}
+		spin_unlock_irqrestore(&smmu_domain->devices_lock, flags);
+
+		s1_cfg->nested_abort = false;
+		s1_cfg->nested_bypass = false;
+		ret = 0;
+		goto out;
+	}
+
+	/* we currently support a single CD. S1DSS and S1FMT are ignored */
+	if (cfg->pasid_bits)
+		goto out;
+
+	s1_cfg = &smmu_domain->s1_cfg;
+	s1_cfg->nested_bypass = cfg->bypass;
+	s1_cfg->nested_abort = cfg->smmuv3.abort;
+	s1_cfg->cdptr_dma = cfg->base_ptr;
+
+	spin_lock_irqsave(&smmu_domain->devices_lock, flags);
+	list_for_each_entry(entry, &smmu_domain->devices, list) {
+		entry->ste.s1_cfg = &smmu_domain->s1_cfg;
+		entry->ste.nested = true;
+		arm_smmu_install_ste_for_dev(entry->dev->iommu_fwspec);
+	}
+	spin_unlock_irqrestore(&smmu_domain->devices_lock, flags);
+	ret = 0;
+out:
+	mutex_unlock(&smmu_domain->init_mutex);
+	return ret;
+}
+
 static struct iommu_ops arm_smmu_ops = {
 	.capable		= arm_smmu_capable,
 	.domain_alloc		= arm_smmu_domain_alloc,
@@ -2244,6 +2311,7 @@ static struct iommu_ops arm_smmu_ops = {
 	.of_xlate		= arm_smmu_of_xlate,
 	.get_resv_regions	= arm_smmu_get_resv_regions,
 	.put_resv_regions	= arm_smmu_put_resv_regions,
+	.set_pasid_table	= arm_smmu_set_pasid_table,
 	.pgsize_bitmap		= -1UL, /* Restricted during device attach */
 };
 
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [RFC v3 11/21] iommu/smmuv3: Implement cache_invalidate
  2019-01-08 10:26 [RFC v3 00/21] SMMUv3 Nested Stage Setup Eric Auger
                   ` (9 preceding siblings ...)
  2019-01-08 10:26 ` [RFC v3 10/21] iommu/smmuv3: Implement set_pasid_table Eric Auger
@ 2019-01-08 10:26 ` Eric Auger
  2019-01-11 16:59   ` Jean-Philippe Brucker
  2019-01-08 10:26 ` [RFC v3 12/21] dma-iommu: Implement NESTED_MSI cookie Eric Auger
                   ` (10 subsequent siblings)
  21 siblings, 1 reply; 59+ messages in thread
From: Eric Auger @ 2019-01-08 10:26 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, iommu, linux-kernel, kvm, kvmarm,
	joro, alex.williamson, jacob.jun.pan, yi.l.liu,
	jean-philippe.brucker, will.deacon, robin.murphy
  Cc: kevin.tian, ashok.raj, marc.zyngier, christoffer.dall, peter.maydell

Implement IOMMU_INV_TYPE_TLB invalidations. When
nr_pages is null we interpret this as a context
invalidation.

Signed-off-by: Eric Auger <eric.auger@redhat.com>

---

The user API needs to be refined to discriminate context
invalidations from NH_VA invalidations. Also the leaf attribute
is not yet properly handled.

v2 -> v3:
- replace __arm_smmu_tlb_sync by arm_smmu_cmdq_issue_sync

v1 -> v2:
- properly pass the asid
---
 drivers/iommu/arm-smmu-v3.c | 40 +++++++++++++++++++++++++++++++++++++
 1 file changed, 40 insertions(+)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 0e006babc8a6..ca72e0ce92f6 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -2293,6 +2293,45 @@ static int arm_smmu_set_pasid_table(struct iommu_domain *domain,
 	return ret;
 }
 
+static int
+arm_smmu_cache_invalidate(struct iommu_domain *domain, struct device *dev,
+			  struct iommu_cache_invalidate_info *inv_info)
+{
+	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
+	struct arm_smmu_device *smmu = smmu_domain->smmu;
+
+	if (smmu_domain->stage != ARM_SMMU_DOMAIN_NESTED)
+		return -EINVAL;
+
+	if (!smmu)
+		return -EINVAL;
+
+	switch (inv_info->hdr.type) {
+	case IOMMU_INV_TYPE_TLB:
+		/*
+		 * TODO: On context invalidation, the userspace sets nr_pages
+		 * to 0. Refine the API to add a dedicated flags and also
+		 * properly handle the leaf parameter.
+		 */
+		if (!inv_info->nr_pages) {
+			smmu_domain->s1_cfg.cd.asid = inv_info->arch_id;
+			arm_smmu_tlb_inv_context(smmu_domain);
+		} else {
+			size_t granule = 1 << (inv_info->size + 12);
+			size_t size = inv_info->nr_pages * granule;
+
+			smmu_domain->s1_cfg.cd.asid = inv_info->arch_id;
+			arm_smmu_tlb_inv_range_nosync(inv_info->addr, size,
+						      granule, false,
+						      smmu_domain);
+			arm_smmu_cmdq_issue_sync(smmu);
+		}
+		return 0;
+	default:
+		return -EINVAL;
+	}
+}
+
 static struct iommu_ops arm_smmu_ops = {
 	.capable		= arm_smmu_capable,
 	.domain_alloc		= arm_smmu_domain_alloc,
@@ -2312,6 +2351,7 @@ static struct iommu_ops arm_smmu_ops = {
 	.get_resv_regions	= arm_smmu_get_resv_regions,
 	.put_resv_regions	= arm_smmu_put_resv_regions,
 	.set_pasid_table	= arm_smmu_set_pasid_table,
+	.cache_invalidate	= arm_smmu_cache_invalidate,
 	.pgsize_bitmap		= -1UL, /* Restricted during device attach */
 };
 
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [RFC v3 12/21] dma-iommu: Implement NESTED_MSI cookie
  2019-01-08 10:26 [RFC v3 00/21] SMMUv3 Nested Stage Setup Eric Auger
                   ` (10 preceding siblings ...)
  2019-01-08 10:26 ` [RFC v3 11/21] iommu/smmuv3: Implement cache_invalidate Eric Auger
@ 2019-01-08 10:26 ` Eric Auger
  2019-01-08 10:26 ` [RFC v3 13/21] iommu/smmuv3: Implement bind_guest_msi Eric Auger
                   ` (9 subsequent siblings)
  21 siblings, 0 replies; 59+ messages in thread
From: Eric Auger @ 2019-01-08 10:26 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, iommu, linux-kernel, kvm, kvmarm,
	joro, alex.williamson, jacob.jun.pan, yi.l.liu,
	jean-philippe.brucker, will.deacon, robin.murphy
  Cc: kevin.tian, ashok.raj, marc.zyngier, christoffer.dall, peter.maydell

Up to now, when the type was UNMANAGED, we used to
allocate IOVA pages within a range provided by the user.
This does not work in nested mode.

If both the host and the guest are exposed with SMMUs, each
would allocate an IOVA. The guest allocates an IOVA (gIOVA)
to map onto the guest MSI doorbell (gDB). The Host allocates
another IOVA (hIOVA) to map onto the physical doorbell (hDB).

So we end up with 2 unrelated mappings, at S1 and S2:
         S1             S2
gIOVA    ->     gDB
               hIOVA    ->    hDB

The PCI device would be programmed with hIOVA.

iommu_dma_bind_doorbell allows to pass gIOVA/gDB to the host
so that gIOVA can be used by the host instead of re-allocating
a new IOVA. The device handle also is passed to garantee
devices belonging to different stage1 domains record
distinguishable stage1 mappings. That way the host can create
the following nested
mapping:

         S1           S2
gIOVA    ->    gDB    ->    hDB

this time, the PCI device will be programmed with the gIOVA MSI
doorbell which is correctly mapped through the 2 stages.

Signed-off-by: Eric Auger <eric.auger@redhat.com>

---

v2 -> v3:
- also store the device handle on S1 mapping registration.
  This garantees we associate the associated S2 mapping binds
  to the correct physical MSI controller.

v1 -> v2:
- unmap stage2 on put()
---
 drivers/iommu/dma-iommu.c | 112 ++++++++++++++++++++++++++++++++++++--
 include/linux/dma-iommu.h |  11 ++++
 2 files changed, 119 insertions(+), 4 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index d19f3d6b43c1..19af8107e959 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -35,12 +35,15 @@
 struct iommu_dma_msi_page {
 	struct list_head	list;
 	dma_addr_t		iova;
+	dma_addr_t		ipa;
 	phys_addr_t		phys;
+	struct device		*dev;
 };
 
 enum iommu_dma_cookie_type {
 	IOMMU_DMA_IOVA_COOKIE,
 	IOMMU_DMA_MSI_COOKIE,
+	IOMMU_DMA_NESTED_MSI_COOKIE,
 };
 
 struct iommu_dma_cookie {
@@ -110,14 +113,17 @@ EXPORT_SYMBOL(iommu_get_dma_cookie);
  *
  * Users who manage their own IOVA allocation and do not want DMA API support,
  * but would still like to take advantage of automatic MSI remapping, can use
- * this to initialise their own domain appropriately. Users should reserve a
+ * this to initialise their own domain appropriately. Users may reserve a
  * contiguous IOVA region, starting at @base, large enough to accommodate the
  * number of PAGE_SIZE mappings necessary to cover every MSI doorbell address
- * used by the devices attached to @domain.
+ * used by the devices attached to @domain. The other way round is to provide
+ * usable iova pages through the iommu_dma_bind_doorbell API (nested stages
+ * use case)
  */
 int iommu_get_msi_cookie(struct iommu_domain *domain, dma_addr_t base)
 {
 	struct iommu_dma_cookie *cookie;
+	int nesting, ret;
 
 	if (domain->type != IOMMU_DOMAIN_UNMANAGED)
 		return -EINVAL;
@@ -125,7 +131,12 @@ int iommu_get_msi_cookie(struct iommu_domain *domain, dma_addr_t base)
 	if (domain->iova_cookie)
 		return -EEXIST;
 
-	cookie = cookie_alloc(IOMMU_DMA_MSI_COOKIE);
+	ret =  iommu_domain_get_attr(domain, DOMAIN_ATTR_NESTING, &nesting);
+	if (!ret && nesting)
+		cookie = cookie_alloc(IOMMU_DMA_NESTED_MSI_COOKIE);
+	else
+		cookie = cookie_alloc(IOMMU_DMA_MSI_COOKIE);
+
 	if (!cookie)
 		return -ENOMEM;
 
@@ -146,6 +157,7 @@ void iommu_put_dma_cookie(struct iommu_domain *domain)
 {
 	struct iommu_dma_cookie *cookie = domain->iova_cookie;
 	struct iommu_dma_msi_page *msi, *tmp;
+	bool s2_unmap = false;
 
 	if (!cookie)
 		return;
@@ -153,7 +165,15 @@ void iommu_put_dma_cookie(struct iommu_domain *domain)
 	if (cookie->type == IOMMU_DMA_IOVA_COOKIE && cookie->iovad.granule)
 		put_iova_domain(&cookie->iovad);
 
+	if (cookie->type == IOMMU_DMA_NESTED_MSI_COOKIE)
+		s2_unmap = true;
+
 	list_for_each_entry_safe(msi, tmp, &cookie->msi_page_list, list) {
+		if (s2_unmap && msi->phys) {
+			size_t size = cookie_msi_granule(cookie);
+
+			WARN_ON(iommu_unmap(domain, msi->ipa, size) != size);
+		}
 		list_del(&msi->list);
 		kfree(msi);
 	}
@@ -162,6 +182,52 @@ void iommu_put_dma_cookie(struct iommu_domain *domain)
 }
 EXPORT_SYMBOL(iommu_put_dma_cookie);
 
+/**
+ * iommu_dma_bind_doorbell - Allows to provide a usable IOVA page
+ * @domain: domain handle
+ * @dev: device handle
+ * @binding: IOVA/IPA binding
+ *
+ * In nested stage use case, the user can provide IOVA/IPA bindings
+ * corresponding to a guest MSI stage 1 mapping. When the host needs
+ * to map its own MSI doorbells, it can use the IPA as stage 2 input
+ * and map it onto the physical MSI doorbell.
+ */
+int iommu_dma_bind_doorbell(struct iommu_domain *domain, struct device *dev,
+			    struct iommu_guest_msi_binding *binding)
+{
+	struct iommu_dma_cookie *cookie = domain->iova_cookie;
+	struct iommu_dma_msi_page *msi;
+	dma_addr_t ipa, iova;
+	size_t size;
+
+	if (!cookie)
+		return -EINVAL;
+
+	if (cookie->type != IOMMU_DMA_NESTED_MSI_COOKIE)
+		return -EINVAL;
+
+	size = 1 << binding->granule;
+	iova = binding->iova & ~(phys_addr_t)(size - 1);
+	ipa = binding->gpa & ~(phys_addr_t)(size - 1);
+
+	list_for_each_entry(msi, &cookie->msi_page_list, list) {
+		if (msi->iova == iova && msi->dev == dev)
+			return 0; /* this page is already registered */
+	}
+
+	msi = kzalloc(sizeof(*msi), GFP_KERNEL);
+	if (!msi)
+		return -ENOMEM;
+
+	msi->iova = iova;
+	msi->ipa = ipa;
+	msi->dev = dev;
+	list_add(&msi->list, &cookie->msi_page_list);
+	return 0;
+}
+EXPORT_SYMBOL(iommu_dma_bind_doorbell);
+
 /**
  * iommu_dma_get_resv_regions - Reserved region driver helper
  * @dev: Device from iommu_get_resv_regions()
@@ -856,6 +922,16 @@ void iommu_dma_unmap_resource(struct device *dev, dma_addr_t handle,
 	__iommu_dma_unmap(iommu_get_dma_domain(dev), handle, size);
 }
 
+static bool msi_page_match(struct iommu_dma_msi_page *msi_page,
+			   struct device *dev, phys_addr_t msi_addr)
+{
+	bool match = msi_page->phys == msi_addr;
+
+	if (msi_page->dev)
+		match &= (msi_page->dev == dev);
+	return match;
+}
+
 static struct iommu_dma_msi_page *iommu_dma_get_msi_page(struct device *dev,
 		phys_addr_t msi_addr, struct iommu_domain *domain)
 {
@@ -867,9 +943,37 @@ static struct iommu_dma_msi_page *iommu_dma_get_msi_page(struct device *dev,
 
 	msi_addr &= ~(phys_addr_t)(size - 1);
 	list_for_each_entry(msi_page, &cookie->msi_page_list, list)
-		if (msi_page->phys == msi_addr)
+		if (msi_page_match(msi_page, dev, msi_addr))
 			return msi_page;
 
+	/*
+	 * In nested stage mode, we do not allocate an MSI page in
+	 * a range provided by the user. Instead, IOVA/IPA bindings are
+	 * individually provided. We reuse thise IOVAs to build the
+	 * IOVA -> IPA -> MSI PA nested stage mapping.
+	 */
+	if (cookie->type == IOMMU_DMA_NESTED_MSI_COOKIE) {
+		list_for_each_entry(msi_page, &cookie->msi_page_list, list)
+			if (!msi_page->phys && msi_page->dev == dev) {
+				dma_addr_t ipa = msi_page->ipa;
+				int ret;
+
+				msi_page->phys = msi_addr;
+
+				/* do the stage 2 mapping */
+				ret = iommu_map(domain, ipa, msi_addr, size,
+						IOMMU_MMIO | IOMMU_WRITE);
+				if (ret) {
+					pr_warn("MSI S2 mapping failed (%d)\n",
+						ret);
+					return NULL;
+				}
+				return msi_page;
+			}
+		pr_warn("%s no MSI binding found\n", __func__);
+		return NULL;
+	}
+
 	msi_page = kzalloc(sizeof(*msi_page), GFP_ATOMIC);
 	if (!msi_page)
 		return NULL;
diff --git a/include/linux/dma-iommu.h b/include/linux/dma-iommu.h
index e760dc5d1fa8..778243719462 100644
--- a/include/linux/dma-iommu.h
+++ b/include/linux/dma-iommu.h
@@ -24,6 +24,7 @@
 #include <linux/dma-mapping.h>
 #include <linux/iommu.h>
 #include <linux/msi.h>
+#include <uapi/linux/iommu.h>
 
 int iommu_dma_init(void);
 
@@ -73,12 +74,15 @@ void iommu_dma_unmap_resource(struct device *dev, dma_addr_t handle,
 /* The DMA API isn't _quite_ the whole story, though... */
 void iommu_dma_map_msi_msg(int irq, struct msi_msg *msg);
 void iommu_dma_get_resv_regions(struct device *dev, struct list_head *list);
+int iommu_dma_bind_doorbell(struct iommu_domain *domain, struct device *dev,
+			    struct iommu_guest_msi_binding *binding);
 
 #else
 
 struct iommu_domain;
 struct msi_msg;
 struct device;
+struct iommu_guest_msi_binding;
 
 static inline int iommu_dma_init(void)
 {
@@ -103,6 +107,13 @@ static inline void iommu_dma_map_msi_msg(int irq, struct msi_msg *msg)
 {
 }
 
+static inline int
+iommu_dma_bind_doorbell(struct iommu_domain *domain,
+			struct iommu_guest_msi_binding *binding)
+{
+	return -ENODEV;
+}
+
 static inline void iommu_dma_get_resv_regions(struct device *dev, struct list_head *list)
 {
 }
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [RFC v3 13/21] iommu/smmuv3: Implement bind_guest_msi
  2019-01-08 10:26 [RFC v3 00/21] SMMUv3 Nested Stage Setup Eric Auger
                   ` (11 preceding siblings ...)
  2019-01-08 10:26 ` [RFC v3 12/21] dma-iommu: Implement NESTED_MSI cookie Eric Auger
@ 2019-01-08 10:26 ` Eric Auger
  2019-01-08 10:26 ` [RFC v3 14/21] iommu: introduce device fault data Eric Auger
                   ` (8 subsequent siblings)
  21 siblings, 0 replies; 59+ messages in thread
From: Eric Auger @ 2019-01-08 10:26 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, iommu, linux-kernel, kvm, kvmarm,
	joro, alex.williamson, jacob.jun.pan, yi.l.liu,
	jean-philippe.brucker, will.deacon, robin.murphy
  Cc: kevin.tian, ashok.raj, marc.zyngier, christoffer.dall, peter.maydell

The bind_guest_msi() callback checks the domain
is NESTED and redirect to the dma-iommu implementation.

Signed-off-by: Eric Auger <eric.auger@redhat.com>
---
 drivers/iommu/arm-smmu-v3.c | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index ca72e0ce92f6..999ee470a2ae 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -2226,6 +2226,28 @@ static void arm_smmu_put_resv_regions(struct device *dev,
 		kfree(entry);
 }
 
+static int arm_smmu_bind_guest_msi(struct iommu_domain *domain,
+				   struct device *dev,
+				   struct iommu_guest_msi_binding *binding)
+{
+	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
+	struct arm_smmu_device *smmu;
+	int ret = -EINVAL;
+
+	mutex_lock(&smmu_domain->init_mutex);
+	smmu = smmu_domain->smmu;
+	if (!smmu)
+		goto out;
+
+	if (smmu_domain->stage != ARM_SMMU_DOMAIN_NESTED)
+		goto out;
+
+	ret = iommu_dma_bind_doorbell(domain, dev, binding);
+out:
+	mutex_unlock(&smmu_domain->init_mutex);
+	return ret;
+}
+
 static int arm_smmu_set_pasid_table(struct iommu_domain *domain,
 				     struct iommu_pasid_table_config *cfg)
 {
@@ -2352,6 +2374,7 @@ static struct iommu_ops arm_smmu_ops = {
 	.put_resv_regions	= arm_smmu_put_resv_regions,
 	.set_pasid_table	= arm_smmu_set_pasid_table,
 	.cache_invalidate	= arm_smmu_cache_invalidate,
+	.bind_guest_msi		= arm_smmu_bind_guest_msi,
 	.pgsize_bitmap		= -1UL, /* Restricted during device attach */
 };
 
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [RFC v3 14/21] iommu: introduce device fault data
  2019-01-08 10:26 [RFC v3 00/21] SMMUv3 Nested Stage Setup Eric Auger
                   ` (12 preceding siblings ...)
  2019-01-08 10:26 ` [RFC v3 13/21] iommu/smmuv3: Implement bind_guest_msi Eric Auger
@ 2019-01-08 10:26 ` Eric Auger
       [not found]   ` <20190110104544.26f3bcb1@jacob-builder>
  2019-01-08 10:26 ` [RFC v3 15/21] driver core: add per device iommu param Eric Auger
                   ` (7 subsequent siblings)
  21 siblings, 1 reply; 59+ messages in thread
From: Eric Auger @ 2019-01-08 10:26 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, iommu, linux-kernel, kvm, kvmarm,
	joro, alex.williamson, jacob.jun.pan, yi.l.liu,
	jean-philippe.brucker, will.deacon, robin.murphy
  Cc: kevin.tian, ashok.raj, marc.zyngier, christoffer.dall, peter.maydell

From: Jacob Pan <jacob.jun.pan@linux.intel.com>

Device faults detected by IOMMU can be reported outside IOMMU
subsystem for further processing. This patch intends to provide
a generic device fault data such that device drivers can be
communicated with IOMMU faults without model specific knowledge.

The proposed format is the result of discussion at:
https://lkml.org/lkml/2017/11/10/291
Part of the code is based on Jean-Philippe Brucker's patchset
(https://patchwork.kernel.org/patch/9989315/).

The assumption is that model specific IOMMU driver can filter and
handle most of the internal faults if the cause is within IOMMU driver
control. Therefore, the fault reasons can be reported are grouped
and generalized based common specifications such as PCI ATS.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Eric Auger <eric.auger@redhat.com>
[moved part of the iommu_fault_event struct in the uapi, enriched
 the fault reasons to be able to map unrecoverable SMMUv3 errors]
---
 include/linux/iommu.h      | 55 ++++++++++++++++++++++++-
 include/uapi/linux/iommu.h | 83 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 136 insertions(+), 2 deletions(-)

diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 244c1a3d5989..1dedc2d247c2 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -49,13 +49,17 @@ struct bus_type;
 struct device;
 struct iommu_domain;
 struct notifier_block;
+struct iommu_fault_event;
 
 /* iommu fault flags */
-#define IOMMU_FAULT_READ	0x0
-#define IOMMU_FAULT_WRITE	0x1
+#define IOMMU_FAULT_READ		(1 << 0)
+#define IOMMU_FAULT_WRITE		(1 << 1)
+#define IOMMU_FAULT_EXEC		(1 << 2)
+#define IOMMU_FAULT_PRIV		(1 << 3)
 
 typedef int (*iommu_fault_handler_t)(struct iommu_domain *,
 			struct device *, unsigned long, int, void *);
+typedef int (*iommu_dev_fault_handler_t)(struct iommu_fault_event *, void *);
 
 struct iommu_domain_geometry {
 	dma_addr_t aperture_start; /* First address that can be mapped    */
@@ -255,6 +259,52 @@ struct iommu_device {
 	struct device *dev;
 };
 
+/**
+ * struct iommu_fault_event - Generic per device fault data
+ *
+ * - PCI and non-PCI devices
+ * - Recoverable faults (e.g. page request), information based on PCI ATS
+ * and PASID spec.
+ * - Un-recoverable faults of device interest
+ * - DMA remapping and IRQ remapping faults
+ *
+ * @fault: fault descriptor
+ * @device_private: if present, uniquely identify device-specific
+ *                  private data for an individual page request.
+ * @iommu_private: used by the IOMMU driver for storing fault-specific
+ *                 data. Users should not modify this field before
+ *                 sending the fault response.
+ */
+struct iommu_fault_event {
+	struct iommu_fault fault;
+	u64 device_private;
+	u64 iommu_private;
+};
+
+/**
+ * struct iommu_fault_param - per-device IOMMU fault data
+ * @dev_fault_handler: Callback function to handle IOMMU faults at device level
+ * @data: handler private data
+ *
+ */
+struct iommu_fault_param {
+	iommu_dev_fault_handler_t handler;
+	void *data;
+};
+
+/**
+ * struct iommu_param - collection of per-device IOMMU data
+ *
+ * @fault_param: IOMMU detected device fault reporting data
+ *
+ * TODO: migrate other per device data pointers under iommu_dev_data, e.g.
+ *	struct iommu_group	*iommu_group;
+ *	struct iommu_fwspec	*iommu_fwspec;
+ */
+struct iommu_param {
+	struct iommu_fault_param *fault_param;
+};
+
 int  iommu_device_register(struct iommu_device *iommu);
 void iommu_device_unregister(struct iommu_device *iommu);
 int  iommu_device_sysfs_add(struct iommu_device *iommu,
@@ -438,6 +488,7 @@ struct iommu_ops {};
 struct iommu_group {};
 struct iommu_fwspec {};
 struct iommu_device {};
+struct iommu_fault_param {};
 
 static inline bool iommu_present(struct bus_type *bus)
 {
diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
index f28cd9a1aa96..e9b5330a13c8 100644
--- a/include/uapi/linux/iommu.h
+++ b/include/uapi/linux/iommu.h
@@ -148,4 +148,87 @@ struct iommu_guest_msi_binding {
 	__u64		gpa;
 	__u32		granule;
 };
+
+/*  Generic fault types, can be expanded IRQ remapping fault */
+enum iommu_fault_type {
+	IOMMU_FAULT_DMA_UNRECOV = 1,	/* unrecoverable fault */
+	IOMMU_FAULT_PAGE_REQ,		/* page request fault */
+};
+
+enum iommu_fault_reason {
+	IOMMU_FAULT_REASON_UNKNOWN = 0,
+
+	/* IOMMU internal error, no specific reason to report out */
+	IOMMU_FAULT_REASON_INTERNAL,
+
+	/* Could not access the PASID table (fetch caused external abort) */
+	IOMMU_FAULT_REASON_PASID_FETCH,
+
+	/* could not access the device context (fetch caused external abort) */
+	IOMMU_FAULT_REASON_DEVICE_CONTEXT_FETCH,
+
+	/* pasid entry is invalid or has configuration errors */
+	IOMMU_FAULT_REASON_BAD_PASID_ENTRY,
+
+	/* device context entry is invalid or has configuration errors */
+	IOMMU_FAULT_REASON_BAD_DEVICE_CONTEXT_ENTRY,
+	/*
+	 * PASID is out of range (e.g. exceeds the maximum PASID
+	 * supported by the IOMMU) or disabled.
+	 */
+	IOMMU_FAULT_REASON_PASID_INVALID,
+
+	/* source id is out of range */
+	IOMMU_FAULT_REASON_SOURCEID_INVALID,
+
+	/*
+	 * An external abort occurred fetching (or updating) a translation
+	 * table descriptor
+	 */
+	IOMMU_FAULT_REASON_WALK_EABT,
+
+	/*
+	 * Could not access the page table entry (Bad address),
+	 * actual translation fault
+	 */
+	IOMMU_FAULT_REASON_PTE_FETCH,
+
+	/* Protection flag check failed */
+	IOMMU_FAULT_REASON_PERMISSION,
+
+	/* access flag check failed */
+	IOMMU_FAULT_REASON_ACCESS,
+
+	/* Output address of a translation stage caused Address Size fault */
+	IOMMU_FAULT_REASON_OOR_ADDRESS
+};
+
+/**
+ * struct iommu_fault - Generic fault data
+ *
+ * @type contains fault type
+ * @reason fault reasons if relevant outside IOMMU driver.
+ * IOMMU driver internal faults are not reported.
+ * @addr: tells the offending page address
+ * @fetch_addr: tells the address that caused an abort, if any
+ * @pasid: contains process address space ID, used in shared virtual memory
+ * @page_req_group_id: page request group index
+ * @last_req: last request in a page request group
+ * @pasid_valid: indicates if the PRQ has a valid PASID
+ * @prot: page access protection flag:
+ *	IOMMU_FAULT_READ, IOMMU_FAULT_WRITE
+ */
+
+struct iommu_fault {
+	__u32	type;   /* enum iommu_fault_type */
+	__u32	reason; /* enum iommu_fault_reason */
+	__u64	addr;
+	__u64	fetch_addr;
+	__u32	pasid;
+	__u32	page_req_group_id;
+	__u32	last_req;
+	__u32	pasid_valid;
+	__u32	prot;
+	__u32	access;
+};
 #endif /* _UAPI_IOMMU_H */
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [RFC v3 15/21] driver core: add per device iommu param
  2019-01-08 10:26 [RFC v3 00/21] SMMUv3 Nested Stage Setup Eric Auger
                   ` (13 preceding siblings ...)
  2019-01-08 10:26 ` [RFC v3 14/21] iommu: introduce device fault data Eric Auger
@ 2019-01-08 10:26 ` Eric Auger
  2019-01-08 10:26 ` [RFC v3 16/21] iommu: introduce device fault report API Eric Auger
                   ` (6 subsequent siblings)
  21 siblings, 0 replies; 59+ messages in thread
From: Eric Auger @ 2019-01-08 10:26 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, iommu, linux-kernel, kvm, kvmarm,
	joro, alex.williamson, jacob.jun.pan, yi.l.liu,
	jean-philippe.brucker, will.deacon, robin.murphy
  Cc: kevin.tian, ashok.raj, marc.zyngier, christoffer.dall, peter.maydell

From: Jacob Pan <jacob.jun.pan@linux.intel.com>

DMA faults can be detected by IOMMU at device level. Adding a pointer
to struct device allows IOMMU subsystem to report relevant faults
back to the device driver for further handling.
For direct assigned device (or user space drivers), guest OS holds
responsibility to handle and respond per device IOMMU fault.
Therefore we need fault reporting mechanism to propagate faults beyond
IOMMU subsystem.

There are two other IOMMU data pointers under struct device today, here
we introduce iommu_param as a parent pointer such that all device IOMMU
data can be consolidated here. The idea was suggested here by Greg KH
and Joerg. The name iommu_param is chosen here since iommu_data has been used.

Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Link: https://lkml.org/lkml/2017/10/6/81
---
 include/linux/device.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/include/linux/device.h b/include/linux/device.h
index 6cb4640b6160..fd7f9fae404e 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -42,6 +42,7 @@ struct iommu_ops;
 struct iommu_group;
 struct iommu_fwspec;
 struct dev_pin_info;
+struct iommu_param;
 
 struct bus_attribute {
 	struct attribute	attr;
@@ -950,6 +951,7 @@ struct dev_links_info {
  * 		device (i.e. the bus driver that discovered the device).
  * @iommu_group: IOMMU group the device belongs to.
  * @iommu_fwspec: IOMMU-specific properties supplied by firmware.
+ * @iommu_param: Per device generic IOMMU runtime data
  *
  * @offline_disabled: If set, the device is permanently online.
  * @offline:	Set after successful invocation of bus type's .offline().
@@ -1042,6 +1044,7 @@ struct device {
 	void	(*release)(struct device *dev);
 	struct iommu_group	*iommu_group;
 	struct iommu_fwspec	*iommu_fwspec;
+	struct iommu_param	*iommu_param;
 
 	bool			offline_disabled:1;
 	bool			offline:1;
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [RFC v3 16/21] iommu: introduce device fault report API
  2019-01-08 10:26 [RFC v3 00/21] SMMUv3 Nested Stage Setup Eric Auger
                   ` (14 preceding siblings ...)
  2019-01-08 10:26 ` [RFC v3 15/21] driver core: add per device iommu param Eric Auger
@ 2019-01-08 10:26 ` Eric Auger
  2019-01-08 10:26 ` [RFC v3 17/21] iommu/smmuv3: Report non recoverable faults Eric Auger
                   ` (5 subsequent siblings)
  21 siblings, 0 replies; 59+ messages in thread
From: Eric Auger @ 2019-01-08 10:26 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, iommu, linux-kernel, kvm, kvmarm,
	joro, alex.williamson, jacob.jun.pan, yi.l.liu,
	jean-philippe.brucker, will.deacon, robin.murphy
  Cc: kevin.tian, ashok.raj, marc.zyngier, christoffer.dall, peter.maydell

From: Jacob Pan <jacob.jun.pan@linux.intel.com>

Traditionally, device specific faults are detected and handled within
their own device drivers. When IOMMU is enabled, faults such as DMA
related transactions are detected by IOMMU. There is no generic
reporting mechanism to report faults back to the in-kernel device
driver or the guest OS in case of assigned devices.

Faults detected by IOMMU is based on the transaction's source ID which
can be reported at per device basis, regardless of the device type is a
PCI device or not.

The fault types include recoverable (e.g. page request) and
unrecoverable faults(e.g. access error). In most cases, faults can be
handled by IOMMU drivers internally. The primary use cases are as
follows:
1. page request fault originated from an SVM capable device that is
assigned to guest via vIOMMU. In this case, the first level page tables
are owned by the guest. Page request must be propagated to the guest to
let guest OS fault in the pages then send page response. In this
mechanism, the direct receiver of IOMMU fault notification is VFIO,
which can relay notification events to QEMU or other user space
software.

2. faults need more subtle handling by device drivers. Other than
simply invoke reset function, there are needs to let device driver
handle the fault with a smaller impact.

This patchset is intended to create a generic fault report API such
that it can scale as follows:
- all IOMMU types
- PCI and non-PCI devices
- recoverable and unrecoverable faults
- VFIO and other other in kernel users
- DMA & IRQ remapping (TBD)
The original idea was brought up by David Woodhouse and discussions
summarized at https://lwn.net/Articles/608914/.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
Signed-off-by: Eric Auger <eric.auger@redhat.com>
[adapt to new iommu_fault fault field, test fault_param on
 iommu_unregister_device_fault_handler]
---
 drivers/iommu/iommu.c | 153 +++++++++++++++++++++++++++++++++++++++++-
 include/linux/iommu.h |  33 ++++++++-
 2 files changed, 184 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index ea11442e7054..fb13d83914a6 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -648,6 +648,13 @@ int iommu_group_add_device(struct iommu_group *group, struct device *dev)
 		goto err_free_name;
 	}
 
+	dev->iommu_param = kzalloc(sizeof(*dev->iommu_param), GFP_KERNEL);
+	if (!dev->iommu_param) {
+		ret = -ENOMEM;
+		goto err_free_name;
+	}
+	mutex_init(&dev->iommu_param->lock);
+
 	kobject_get(group->devices_kobj);
 
 	dev->iommu_group = group;
@@ -678,6 +685,7 @@ int iommu_group_add_device(struct iommu_group *group, struct device *dev)
 	mutex_unlock(&group->mutex);
 	dev->iommu_group = NULL;
 	kobject_put(group->devices_kobj);
+	kfree(dev->iommu_param);
 err_free_name:
 	kfree(device->name);
 err_remove_link:
@@ -724,7 +732,7 @@ void iommu_group_remove_device(struct device *dev)
 	sysfs_remove_link(&dev->kobj, "iommu_group");
 
 	trace_remove_device_from_group(group->id, dev);
-
+	kfree(dev->iommu_param);
 	kfree(device->name);
 	kfree(device);
 	dev->iommu_group = NULL;
@@ -858,6 +866,149 @@ int iommu_group_unregister_notifier(struct iommu_group *group,
 }
 EXPORT_SYMBOL_GPL(iommu_group_unregister_notifier);
 
+/**
+ * iommu_register_device_fault_handler() - Register a device fault handler
+ * @dev: the device
+ * @handler: the fault handler
+ * @data: private data passed as argument to the handler
+ *
+ * When an IOMMU fault event is received, call this handler with the fault event
+ * and data as argument. The handler should return 0 on success. If the fault is
+ * recoverable (IOMMU_FAULT_PAGE_REQ), the handler can also complete
+ * the fault by calling iommu_page_response() with one of the following
+ * response code:
+ * - IOMMU_PAGE_RESP_SUCCESS: retry the translation
+ * - IOMMU_PAGE_RESP_INVALID: terminate the fault
+ * - IOMMU_PAGE_RESP_FAILURE: terminate the fault and stop reporting
+ *   page faults if possible.
+ *
+ * Return 0 if the fault handler was installed successfully, or an error.
+ */
+int iommu_register_device_fault_handler(struct device *dev,
+					iommu_dev_fault_handler_t handler,
+					void *data)
+{
+	struct iommu_param *param = dev->iommu_param;
+	int ret = 0;
+
+	/*
+	 * Device iommu_param should have been allocated when device is
+	 * added to its iommu_group.
+	 */
+	if (!param)
+		return -EINVAL;
+
+	mutex_lock(&param->lock);
+	/* Only allow one fault handler registered for each device */
+	if (param->fault_param) {
+		ret = -EBUSY;
+		goto done_unlock;
+	}
+
+	get_device(dev);
+	param->fault_param =
+		kzalloc(sizeof(struct iommu_fault_param), GFP_KERNEL);
+	if (!param->fault_param) {
+		put_device(dev);
+		ret = -ENOMEM;
+		goto done_unlock;
+	}
+	mutex_init(&param->fault_param->lock);
+	param->fault_param->handler = handler;
+	param->fault_param->data = data;
+	INIT_LIST_HEAD(&param->fault_param->faults);
+
+done_unlock:
+	mutex_unlock(&param->lock);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_register_device_fault_handler);
+
+/**
+ * iommu_unregister_device_fault_handler() - Unregister the device fault handler
+ * @dev: the device
+ *
+ * Remove the device fault handler installed with
+ * iommu_register_device_fault_handler().
+ *
+ * Return 0 on success, or an error.
+ */
+int iommu_unregister_device_fault_handler(struct device *dev)
+{
+	struct iommu_param *param = dev->iommu_param;
+	int ret = 0;
+
+	if (!param)
+		return -EINVAL;
+
+	mutex_lock(&param->lock);
+
+	if (!param->fault_param)
+		goto unlock;
+
+	/* we cannot unregister handler if there are pending faults */
+	if (!list_empty(&param->fault_param->faults)) {
+		ret = -EBUSY;
+		goto unlock;
+	}
+
+	kfree(param->fault_param);
+	param->fault_param = NULL;
+	put_device(dev);
+unlock:
+	mutex_unlock(&param->lock);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_unregister_device_fault_handler);
+
+
+/**
+ * iommu_report_device_fault() - Report fault event to device
+ * @dev: the device
+ * @evt: fault event data
+ *
+ * Called by IOMMU model specific drivers when fault is detected, typically
+ * in a threaded IRQ handler.
+ *
+ * Return 0 on success, or an error.
+ */
+int iommu_report_device_fault(struct device *dev, struct iommu_fault_event *evt)
+{
+	int ret = 0;
+	struct iommu_fault_event *evt_pending;
+	struct iommu_fault_param *fparam;
+
+	/* iommu_param is allocated when device is added to group */
+	if (!dev->iommu_param | !evt)
+		return -EINVAL;
+	/* we only report device fault if there is a handler registered */
+	mutex_lock(&dev->iommu_param->lock);
+	if (!dev->iommu_param->fault_param ||
+		!dev->iommu_param->fault_param->handler) {
+		ret = -EINVAL;
+		goto done_unlock;
+	}
+	fparam = dev->iommu_param->fault_param;
+	if (evt->fault.type == IOMMU_FAULT_PAGE_REQ && evt->fault.last_req) {
+		evt_pending = kmemdup(evt, sizeof(struct iommu_fault_event),
+				GFP_KERNEL);
+		if (!evt_pending) {
+			ret = -ENOMEM;
+			goto done_unlock;
+		}
+		mutex_lock(&fparam->lock);
+		list_add_tail(&evt_pending->list, &fparam->faults);
+		mutex_unlock(&fparam->lock);
+	}
+	ret = fparam->handler(evt, fparam->data);
+done_unlock:
+	mutex_unlock(&dev->iommu_param->lock);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_report_device_fault);
+
 /**
  * iommu_group_id - Return ID for a group
  * @group: the group to ID
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 1dedc2d247c2..a39bf9e040d4 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -269,6 +269,7 @@ struct iommu_device {
  * - DMA remapping and IRQ remapping faults
  *
  * @fault: fault descriptor
+ * @list pending fault event list, used for tracking responses
  * @device_private: if present, uniquely identify device-specific
  *                  private data for an individual page request.
  * @iommu_private: used by the IOMMU driver for storing fault-specific
@@ -276,6 +277,7 @@ struct iommu_device {
  *                 sending the fault response.
  */
 struct iommu_fault_event {
+	struct list_head list;
 	struct iommu_fault fault;
 	u64 device_private;
 	u64 iommu_private;
@@ -285,10 +287,13 @@ struct iommu_fault_event {
  * struct iommu_fault_param - per-device IOMMU fault data
  * @dev_fault_handler: Callback function to handle IOMMU faults at device level
  * @data: handler private data
- *
+ * @faults: holds the pending faults which needs response, e.g. page response.
+ * @lock: protect pending PRQ event list
  */
 struct iommu_fault_param {
 	iommu_dev_fault_handler_t handler;
+	struct list_head faults;
+	struct mutex lock;
 	void *data;
 };
 
@@ -302,6 +307,7 @@ struct iommu_fault_param {
  *	struct iommu_fwspec	*iommu_fwspec;
  */
 struct iommu_param {
+	struct mutex lock;
 	struct iommu_fault_param *fault_param;
 };
 
@@ -402,6 +408,14 @@ extern int iommu_group_register_notifier(struct iommu_group *group,
 					 struct notifier_block *nb);
 extern int iommu_group_unregister_notifier(struct iommu_group *group,
 					   struct notifier_block *nb);
+extern int iommu_register_device_fault_handler(struct device *dev,
+					iommu_dev_fault_handler_t handler,
+					void *data);
+
+extern int iommu_unregister_device_fault_handler(struct device *dev);
+
+extern int iommu_report_device_fault(struct device *dev, struct iommu_fault_event *evt);
+
 extern int iommu_group_id(struct iommu_group *group);
 extern struct iommu_group *iommu_group_get_for_dev(struct device *dev);
 extern struct iommu_domain *iommu_group_default_domain(struct iommu_group *);
@@ -682,6 +696,23 @@ static inline int iommu_group_unregister_notifier(struct iommu_group *group,
 	return 0;
 }
 
+static inline int iommu_register_device_fault_handler(struct device *dev,
+						iommu_dev_fault_handler_t handler,
+						void *data)
+{
+	return -ENODEV;
+}
+
+static inline int iommu_unregister_device_fault_handler(struct device *dev)
+{
+	return 0;
+}
+
+static inline int iommu_report_device_fault(struct device *dev, struct iommu_fault_event *evt)
+{
+	return -ENODEV;
+}
+
 static inline int iommu_group_id(struct iommu_group *group)
 {
 	return -ENODEV;
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [RFC v3 17/21] iommu/smmuv3: Report non recoverable faults
  2019-01-08 10:26 [RFC v3 00/21] SMMUv3 Nested Stage Setup Eric Auger
                   ` (15 preceding siblings ...)
  2019-01-08 10:26 ` [RFC v3 16/21] iommu: introduce device fault report API Eric Auger
@ 2019-01-08 10:26 ` Eric Auger
  2019-01-11 17:46   ` Jean-Philippe Brucker
  2019-01-08 10:26 ` [RFC v3 18/21] vfio-pci: Add a new VFIO_REGION_TYPE_NESTED region type Eric Auger
                   ` (4 subsequent siblings)
  21 siblings, 1 reply; 59+ messages in thread
From: Eric Auger @ 2019-01-08 10:26 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, iommu, linux-kernel, kvm, kvmarm,
	joro, alex.williamson, jacob.jun.pan, yi.l.liu,
	jean-philippe.brucker, will.deacon, robin.murphy
  Cc: kevin.tian, ashok.raj, marc.zyngier, christoffer.dall, peter.maydell

When a stage 1 related fault event is read from the event queue,
let's propagate it to potential external fault listeners, ie. users
who registered a fault handler.

Signed-off-by: Eric Auger <eric.auger@redhat.com>
---
 drivers/iommu/arm-smmu-v3.c | 124 ++++++++++++++++++++++++++++++++----
 1 file changed, 113 insertions(+), 11 deletions(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 999ee470a2ae..6a711cbbb228 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -168,6 +168,26 @@
 #define ARM_SMMU_PRIQ_IRQ_CFG1		0xd8
 #define ARM_SMMU_PRIQ_IRQ_CFG2		0xdc
 
+/* Events */
+#define ARM_SMMU_EVT_F_UUT		0x01
+#define ARM_SMMU_EVT_C_BAD_STREAMID	0x02
+#define ARM_SMMU_EVT_F_STE_FETCH	0x03
+#define ARM_SMMU_EVT_C_BAD_STE		0x04
+#define ARM_SMMU_EVT_F_BAD_ATS_TREQ	0x05
+#define ARM_SMMU_EVT_F_STREAM_DISABLED	0x06
+#define ARM_SMMU_EVT_F_TRANSL_FORBIDDEN	0x07
+#define ARM_SMMU_EVT_C_BAD_SUBSTREAMID	0x08
+#define ARM_SMMU_EVT_F_CD_FETCH		0x09
+#define ARM_SMMU_EVT_C_BAD_CD		0x0a
+#define ARM_SMMU_EVT_F_WALK_EABT	0x0b
+#define ARM_SMMU_EVT_F_TRANSLATION	0x10
+#define ARM_SMMU_EVT_F_ADDR_SIZE	0x11
+#define ARM_SMMU_EVT_F_ACCESS		0x12
+#define ARM_SMMU_EVT_F_PERMISSION	0x13
+#define ARM_SMMU_EVT_F_TLB_CONFLICT	0x20
+#define ARM_SMMU_EVT_F_CFG_CONFLICT	0x21
+#define ARM_SMMU_EVT_E_PAGE_REQUEST	0x24
+
 /* Common MSI config fields */
 #define MSI_CFG0_ADDR_MASK		GENMASK_ULL(51, 2)
 #define MSI_CFG2_SH			GENMASK(5, 4)
@@ -333,6 +353,11 @@
 #define EVTQ_MAX_SZ_SHIFT		7
 
 #define EVTQ_0_ID			GENMASK_ULL(7, 0)
+#define EVTQ_0_SUBSTREAMID		GENMASK_ULL(31, 12)
+#define EVTQ_0_STREAMID			GENMASK_ULL(63, 32)
+#define EVTQ_1_S2			GENMASK_ULL(39, 39)
+#define EVTQ_1_CLASS			GENMASK_ULL(40, 41)
+#define EVTQ_3_FETCH_ADDR		GENMASK_ULL(51, 3)
 
 /* PRI queue */
 #define PRIQ_ENT_DWORDS			2
@@ -1270,7 +1295,6 @@ static int arm_smmu_init_l2_strtab(struct arm_smmu_device *smmu, u32 sid)
 	return 0;
 }
 
-__maybe_unused
 static struct arm_smmu_master_data *
 arm_smmu_find_master(struct arm_smmu_device *smmu, u32 sid)
 {
@@ -1296,24 +1320,102 @@ arm_smmu_find_master(struct arm_smmu_device *smmu, u32 sid)
 	return master;
 }
 
+static void arm_smmu_report_event(struct arm_smmu_device *smmu, u64 *evt)
+{
+	u64 fetch_addr = FIELD_GET(EVTQ_3_FETCH_ADDR, evt[3]);
+	u32 sid = FIELD_GET(EVTQ_0_STREAMID, evt[0]);
+	bool s1 = !FIELD_GET(EVTQ_1_S2, evt[1]);
+	u8 type = FIELD_GET(EVTQ_0_ID, evt[0]);
+	struct arm_smmu_master_data *master;
+	struct iommu_fault_event event;
+	bool propagate = true;
+	u64 addr = evt[2];
+	int i;
+
+	master = arm_smmu_find_master(smmu, sid);
+	if (WARN_ON(!master))
+		return;
+
+	event.fault.type = IOMMU_FAULT_DMA_UNRECOV;
+
+	switch (type) {
+	case ARM_SMMU_EVT_C_BAD_STREAMID:
+		event.fault.reason = IOMMU_FAULT_REASON_SOURCEID_INVALID;
+		break;
+	case ARM_SMMU_EVT_F_STREAM_DISABLED:
+	case ARM_SMMU_EVT_C_BAD_SUBSTREAMID:
+		event.fault.reason = IOMMU_FAULT_REASON_PASID_INVALID;
+		break;
+	case ARM_SMMU_EVT_F_CD_FETCH:
+		event.fault.reason = IOMMU_FAULT_REASON_PASID_FETCH;
+		break;
+	case ARM_SMMU_EVT_F_WALK_EABT:
+		event.fault.reason = IOMMU_FAULT_REASON_WALK_EABT;
+		event.fault.addr = addr;
+		event.fault.fetch_addr = fetch_addr;
+		propagate = s1;
+		break;
+	case ARM_SMMU_EVT_F_TRANSLATION:
+		event.fault.reason = IOMMU_FAULT_REASON_PTE_FETCH;
+		event.fault.addr = addr;
+		event.fault.fetch_addr = fetch_addr;
+		propagate = s1;
+		break;
+	case ARM_SMMU_EVT_F_PERMISSION:
+		event.fault.reason = IOMMU_FAULT_REASON_PERMISSION;
+		event.fault.addr = addr;
+		propagate = s1;
+		break;
+	case ARM_SMMU_EVT_F_ACCESS:
+		event.fault.reason = IOMMU_FAULT_REASON_ACCESS;
+		event.fault.addr = addr;
+		propagate = s1;
+		break;
+	case ARM_SMMU_EVT_C_BAD_STE:
+		event.fault.reason = IOMMU_FAULT_REASON_BAD_DEVICE_CONTEXT_ENTRY;
+		break;
+	case ARM_SMMU_EVT_C_BAD_CD:
+		event.fault.reason = IOMMU_FAULT_REASON_BAD_PASID_ENTRY;
+		break;
+	case ARM_SMMU_EVT_F_ADDR_SIZE:
+		event.fault.reason = IOMMU_FAULT_REASON_OOR_ADDRESS;
+		propagate = s1;
+		break;
+	case ARM_SMMU_EVT_F_STE_FETCH:
+		event.fault.reason = IOMMU_FAULT_REASON_DEVICE_CONTEXT_FETCH;
+		event.fault.fetch_addr = fetch_addr;
+		break;
+	/* End of addition */
+	case ARM_SMMU_EVT_E_PAGE_REQUEST:
+	case ARM_SMMU_EVT_F_TLB_CONFLICT:
+	case ARM_SMMU_EVT_F_CFG_CONFLICT:
+	case ARM_SMMU_EVT_F_BAD_ATS_TREQ:
+	case ARM_SMMU_EVT_F_TRANSL_FORBIDDEN:
+	case ARM_SMMU_EVT_F_UUT:
+	default:
+		event.fault.reason = IOMMU_FAULT_REASON_UNKNOWN;
+	}
+	/* only propagate the error if it relates to stage 1 */
+	if (s1)
+		iommu_report_device_fault(master->dev, &event);
+
+	dev_info(smmu->dev, "event 0x%02x received:\n", type);
+	for (i = 0; i < EVTQ_ENT_DWORDS; ++i) {
+		dev_info(smmu->dev, "\t0x%016llx\n",
+			 (unsigned long long)evt[i]);
+	}
+}
+
 /* IRQ and event handlers */
 static irqreturn_t arm_smmu_evtq_thread(int irq, void *dev)
 {
-	int i;
 	struct arm_smmu_device *smmu = dev;
 	struct arm_smmu_queue *q = &smmu->evtq.q;
 	u64 evt[EVTQ_ENT_DWORDS];
 
 	do {
-		while (!queue_remove_raw(q, evt)) {
-			u8 id = FIELD_GET(EVTQ_0_ID, evt[0]);
-
-			dev_info(smmu->dev, "event 0x%02x received:\n", id);
-			for (i = 0; i < ARRAY_SIZE(evt); ++i)
-				dev_info(smmu->dev, "\t0x%016llx\n",
-					 (unsigned long long)evt[i]);
-
-		}
+		while (!queue_remove_raw(q, evt))
+			arm_smmu_report_event(smmu, evt);
 
 		/*
 		 * Not much we can do on overflow, so scream and pretend we're
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [RFC v3 18/21] vfio-pci: Add a new VFIO_REGION_TYPE_NESTED region type
  2019-01-08 10:26 [RFC v3 00/21] SMMUv3 Nested Stage Setup Eric Auger
                   ` (16 preceding siblings ...)
  2019-01-08 10:26 ` [RFC v3 17/21] iommu/smmuv3: Report non recoverable faults Eric Auger
@ 2019-01-08 10:26 ` Eric Auger
  2019-01-11 23:58   ` Alex Williamson
  2019-01-08 10:26 ` [RFC v3 19/21] vfio-pci: Register an iommu fault handler Eric Auger
                   ` (3 subsequent siblings)
  21 siblings, 1 reply; 59+ messages in thread
From: Eric Auger @ 2019-01-08 10:26 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, iommu, linux-kernel, kvm, kvmarm,
	joro, alex.williamson, jacob.jun.pan, yi.l.liu,
	jean-philippe.brucker, will.deacon, robin.murphy
  Cc: kevin.tian, ashok.raj, marc.zyngier, christoffer.dall, peter.maydell

This patch adds a new 64kB region aiming to report nested mode
translation faults.

The region contains a header with the size of the queue,
the producer and consumer indices and then the actual
fault queue data. The producer is updated by the kernel while
the consumer is updated by the userspace.

Signed-off-by: Eric Auger <eric.auger@redhat.com>

---
---
 drivers/vfio/pci/vfio_pci.c         | 102 +++++++++++++++++++++++++++-
 drivers/vfio/pci/vfio_pci_private.h |   2 +
 include/uapi/linux/vfio.h           |  15 ++++
 3 files changed, 118 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index ff60bd1ea587..2ba181ab2edd 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -56,6 +56,11 @@ module_param(disable_idle_d3, bool, S_IRUGO | S_IWUSR);
 MODULE_PARM_DESC(disable_idle_d3,
 		 "Disable using the PCI D3 low power state for idle, unused devices");
 
+#define VFIO_FAULT_REGION_SIZE 0x10000
+#define VFIO_FAULT_QUEUE_SIZE	\
+	((VFIO_FAULT_REGION_SIZE - sizeof(struct vfio_fault_region_header)) / \
+	sizeof(struct iommu_fault))
+
 static inline bool vfio_vga_disabled(void)
 {
 #ifdef CONFIG_VFIO_PCI_VGA
@@ -1226,6 +1231,100 @@ static const struct vfio_device_ops vfio_pci_ops = {
 static int vfio_pci_reflck_attach(struct vfio_pci_device *vdev);
 static void vfio_pci_reflck_put(struct vfio_pci_reflck *reflck);
 
+static size_t
+vfio_pci_dma_fault_rw(struct vfio_pci_device *vdev, char __user *buf,
+		      size_t count, loff_t *ppos, bool iswrite)
+{
+	unsigned int i = VFIO_PCI_OFFSET_TO_INDEX(*ppos) - VFIO_PCI_NUM_REGIONS;
+	void *base = vdev->region[i].data;
+	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
+
+	if (pos >= vdev->region[i].size)
+		return -EINVAL;
+
+	count = min(count, (size_t)(vdev->region[i].size - pos));
+
+	if (copy_to_user(buf, base + pos, count))
+		return -EFAULT;
+
+	*ppos += count;
+
+	return count;
+}
+
+static int vfio_pci_dma_fault_mmap(struct vfio_pci_device *vdev,
+				   struct vfio_pci_region *region,
+				   struct vm_area_struct *vma)
+{
+	u64 phys_len, req_len, pgoff, req_start;
+	unsigned long long addr;
+	unsigned int index;
+
+	index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);
+
+	if (vma->vm_end < vma->vm_start)
+		return -EINVAL;
+	if ((vma->vm_flags & VM_SHARED) == 0)
+		return -EINVAL;
+
+	phys_len = VFIO_FAULT_REGION_SIZE;
+
+	req_len = vma->vm_end - vma->vm_start;
+	pgoff = vma->vm_pgoff &
+		((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
+	req_start = pgoff << PAGE_SHIFT;
+
+	if (req_start + req_len > phys_len)
+		return -EINVAL;
+
+	addr = virt_to_phys(vdev->fault_region);
+	vma->vm_private_data = vdev;
+	vma->vm_pgoff = (addr >> PAGE_SHIFT) + pgoff;
+
+	return remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
+			       req_len, vma->vm_page_prot);
+}
+
+void vfio_pci_dma_fault_release(struct vfio_pci_device *vdev,
+				struct vfio_pci_region *region)
+{
+}
+
+static const struct vfio_pci_regops vfio_pci_dma_fault_regops = {
+	.rw		= vfio_pci_dma_fault_rw,
+	.mmap		= vfio_pci_dma_fault_mmap,
+	.release	= vfio_pci_dma_fault_release,
+};
+
+static int vfio_pci_init_dma_fault_region(struct vfio_pci_device *vdev)
+{
+	u32 flags = VFIO_REGION_INFO_FLAG_READ | VFIO_REGION_INFO_FLAG_WRITE |
+		    VFIO_REGION_INFO_FLAG_MMAP;
+	int ret;
+
+	spin_lock_init(&vdev->fault_queue_lock);
+
+	vdev->fault_region = kmalloc(VFIO_FAULT_REGION_SIZE, GFP_KERNEL);
+	if (!vdev->fault_region)
+		return -ENOMEM;
+
+	ret = vfio_pci_register_dev_region(vdev,
+		VFIO_REGION_TYPE_NESTED,
+		VFIO_REGION_SUBTYPE_NESTED_FAULT_REGION,
+		&vfio_pci_dma_fault_regops, VFIO_FAULT_REGION_SIZE,
+		flags, vdev->fault_region);
+	if (ret) {
+		kfree(vdev->fault_region);
+		return ret;
+	}
+
+	vdev->fault_region->header.prod = 0;
+	vdev->fault_region->header.cons = 0;
+	vdev->fault_region->header.reserved = 0;
+	vdev->fault_region->header.size = VFIO_FAULT_QUEUE_SIZE;
+	return 0;
+}
+
 static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 {
 	struct vfio_pci_device *vdev;
@@ -1300,7 +1399,7 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 		pci_set_power_state(pdev, PCI_D3hot);
 	}
 
-	return ret;
+	return vfio_pci_init_dma_fault_region(vdev);
 }
 
 static void vfio_pci_remove(struct pci_dev *pdev)
@@ -1315,6 +1414,7 @@ static void vfio_pci_remove(struct pci_dev *pdev)
 
 	vfio_iommu_group_put(pdev->dev.iommu_group, &pdev->dev);
 	kfree(vdev->region);
+	kfree(vdev->fault_region);
 	mutex_destroy(&vdev->ioeventfds_lock);
 	kfree(vdev);
 
diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
index 8c0009f00818..38b5d1764a26 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -120,6 +120,8 @@ struct vfio_pci_device {
 	int			ioeventfds_nr;
 	struct eventfd_ctx	*err_trigger;
 	struct eventfd_ctx	*req_trigger;
+	spinlock_t              fault_queue_lock;
+	struct vfio_fault_region *fault_region;
 	struct list_head	dummy_resources_list;
 	struct mutex		ioeventfds_lock;
 	struct list_head	ioeventfds_list;
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 352e795a93c8..b78c2c62af6d 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -307,6 +307,9 @@ struct vfio_region_info_cap_type {
 #define VFIO_REGION_TYPE_GFX                    (1)
 #define VFIO_REGION_SUBTYPE_GFX_EDID            (1)
 
+#define VFIO_REGION_TYPE_NESTED			(2)
+#define VFIO_REGION_SUBTYPE_NESTED_FAULT_REGION	(1)
+
 /**
  * struct vfio_region_gfx_edid - EDID region layout.
  *
@@ -697,6 +700,18 @@ struct vfio_device_ioeventfd {
 
 #define VFIO_DEVICE_IOEVENTFD		_IO(VFIO_TYPE, VFIO_BASE + 16)
 
+struct vfio_fault_region_header {
+	__u32	size;		/* Read-Only */
+	__u32	prod;		/* Read-Only */
+	__u32	cons;
+	__u32	reserved;	/* must be 0 */
+};
+
+struct vfio_fault_region {
+	struct vfio_fault_region_header header;
+	struct iommu_fault queue[0];
+};
+
 /* -------- API for Type1 VFIO IOMMU -------- */
 
 /**
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [RFC v3 19/21] vfio-pci: Register an iommu fault handler
  2019-01-08 10:26 [RFC v3 00/21] SMMUv3 Nested Stage Setup Eric Auger
                   ` (17 preceding siblings ...)
  2019-01-08 10:26 ` [RFC v3 18/21] vfio-pci: Add a new VFIO_REGION_TYPE_NESTED region type Eric Auger
@ 2019-01-08 10:26 ` Eric Auger
  2019-01-08 10:26 ` [RFC v3 20/21] vfio-pci: Add VFIO_PCI_DMA_FAULT_IRQ_INDEX Eric Auger
                   ` (2 subsequent siblings)
  21 siblings, 0 replies; 59+ messages in thread
From: Eric Auger @ 2019-01-08 10:26 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, iommu, linux-kernel, kvm, kvmarm,
	joro, alex.williamson, jacob.jun.pan, yi.l.liu,
	jean-philippe.brucker, will.deacon, robin.murphy
  Cc: kevin.tian, ashok.raj, marc.zyngier, christoffer.dall, peter.maydell

This patch registers a fault handler which records faults in
a circular buffer and then signals an eventfd. This buffer is
exposed within the fault region.

Signed-off-by: Eric Auger <eric.auger@redhat.com>
---
 drivers/vfio/pci/vfio_pci.c         | 44 ++++++++++++++++++++++++++++-
 drivers/vfio/pci/vfio_pci_private.h |  1 +
 2 files changed, 44 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 2ba181ab2edd..f9e2c8292e60 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -29,6 +29,7 @@
 #include <linux/vfio.h>
 #include <linux/vgaarb.h>
 #include <linux/nospec.h>
+#include <linux/circ_buf.h>
 
 #include "vfio_pci_private.h"
 
@@ -1296,6 +1297,44 @@ static const struct vfio_pci_regops vfio_pci_dma_fault_regops = {
 	.release	= vfio_pci_dma_fault_release,
 };
 
+int vfio_pci_iommu_dev_fault_handler(struct iommu_fault_event *evt, void *data)
+{
+	struct vfio_pci_device *vdev = (struct vfio_pci_device *) data;
+	int prod, cons, size;
+	unsigned long flags;
+
+	spin_lock_irqsave(&vdev->fault_queue_lock, flags);
+	prod = vdev->fault_region->header.prod;
+	cons = vdev->fault_region->header.cons;
+	size = vdev->fault_region->header.size;
+
+	if (cons > VFIO_FAULT_QUEUE_SIZE - 1)
+		goto unlock;
+	if (prod > VFIO_FAULT_QUEUE_SIZE - 1)
+		goto unlock;
+	if (size != VFIO_FAULT_QUEUE_SIZE)
+		goto unlock;
+	if (vdev->fault_region->header.reserved)
+		goto unlock;
+	if (CIRC_SPACE(prod, cons, size) < 1)
+		goto unlock;
+
+	vdev->fault_region->queue[prod] = evt->fault;
+	prod = (prod + 1) % size;
+	vdev->fault_region->header.prod = prod;
+	spin_unlock_irqrestore(&vdev->fault_queue_lock, flags);
+
+	mutex_lock(&vdev->igate);
+	if (vdev->dma_fault_trigger)
+		eventfd_signal(vdev->dma_fault_trigger, 1);
+	mutex_unlock(&vdev->igate);
+	return 0;
+
+unlock:
+	spin_unlock_irqrestore(&vdev->fault_queue_lock, flags);
+	return -EINVAL;
+}
+
 static int vfio_pci_init_dma_fault_region(struct vfio_pci_device *vdev)
 {
 	u32 flags = VFIO_REGION_INFO_FLAG_READ | VFIO_REGION_INFO_FLAG_WRITE |
@@ -1322,7 +1361,9 @@ static int vfio_pci_init_dma_fault_region(struct vfio_pci_device *vdev)
 	vdev->fault_region->header.cons = 0;
 	vdev->fault_region->header.reserved = 0;
 	vdev->fault_region->header.size = VFIO_FAULT_QUEUE_SIZE;
-	return 0;
+	return iommu_register_device_fault_handler(&vdev->pdev->dev,
+					vfio_pci_iommu_dev_fault_handler,
+					vdev);
 }
 
 static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
@@ -1414,6 +1455,7 @@ static void vfio_pci_remove(struct pci_dev *pdev)
 
 	vfio_iommu_group_put(pdev->dev.iommu_group, &pdev->dev);
 	kfree(vdev->region);
+	iommu_unregister_device_fault_handler(&pdev->dev);
 	kfree(vdev->fault_region);
 	mutex_destroy(&vdev->ioeventfds_lock);
 	kfree(vdev);
diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
index 38b5d1764a26..5936802cbbd0 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -120,6 +120,7 @@ struct vfio_pci_device {
 	int			ioeventfds_nr;
 	struct eventfd_ctx	*err_trigger;
 	struct eventfd_ctx	*req_trigger;
+	struct eventfd_ctx	*dma_fault_trigger;
 	spinlock_t              fault_queue_lock;
 	struct vfio_fault_region *fault_region;
 	struct list_head	dummy_resources_list;
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [RFC v3 20/21] vfio-pci: Add VFIO_PCI_DMA_FAULT_IRQ_INDEX
  2019-01-08 10:26 [RFC v3 00/21] SMMUv3 Nested Stage Setup Eric Auger
                   ` (18 preceding siblings ...)
  2019-01-08 10:26 ` [RFC v3 19/21] vfio-pci: Register an iommu fault handler Eric Auger
@ 2019-01-08 10:26 ` Eric Auger
  2019-01-08 10:26 ` [RFC v3 21/21] vfio: Document nested stage control Eric Auger
  2019-01-18 10:02 ` [RFC v3 00/21] SMMUv3 Nested Stage Setup Auger Eric
  21 siblings, 0 replies; 59+ messages in thread
From: Eric Auger @ 2019-01-08 10:26 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, iommu, linux-kernel, kvm, kvmarm,
	joro, alex.williamson, jacob.jun.pan, yi.l.liu,
	jean-philippe.brucker, will.deacon, robin.murphy
  Cc: kevin.tian, ashok.raj, marc.zyngier, christoffer.dall, peter.maydell

Add a new VFIO_PCI_DMA_FAULT_IRQ_INDEX index. This allows to
set/unset an eventfd that will be triggered when DMA translation
faults are detected at physical level when the nested mode is used.

Signed-off-by: Eric Auger <eric.auger@redhat.com>
---
 drivers/vfio/pci/vfio_pci.c       |  3 +++
 drivers/vfio/pci/vfio_pci_intrs.c | 19 +++++++++++++++++++
 include/uapi/linux/vfio.h         |  1 +
 3 files changed, 23 insertions(+)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index f9e2c8292e60..66d44736e71d 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -495,6 +495,8 @@ static int vfio_pci_get_irq_count(struct vfio_pci_device *vdev, int irq_type)
 			return 1;
 	} else if (irq_type == VFIO_PCI_REQ_IRQ_INDEX) {
 		return 1;
+	} else if (irq_type == VFIO_PCI_DMA_FAULT_IRQ_INDEX) {
+		return 1;
 	}
 
 	return 0;
@@ -822,6 +824,7 @@ static long vfio_pci_ioctl(void *device_data,
 		switch (info.index) {
 		case VFIO_PCI_INTX_IRQ_INDEX ... VFIO_PCI_MSIX_IRQ_INDEX:
 		case VFIO_PCI_REQ_IRQ_INDEX:
+		case VFIO_PCI_DMA_FAULT_IRQ_INDEX:
 			break;
 		case VFIO_PCI_ERR_IRQ_INDEX:
 			if (pci_is_pcie(vdev->pdev))
diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
index 1c46045b0e7f..28a96117daf3 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -622,6 +622,18 @@ static int vfio_pci_set_req_trigger(struct vfio_pci_device *vdev,
 					       count, flags, data);
 }
 
+static int vfio_pci_set_dma_fault_trigger(struct vfio_pci_device *vdev,
+					  unsigned index, unsigned start,
+					  unsigned count, uint32_t flags,
+					  void *data)
+{
+	if (index != VFIO_PCI_DMA_FAULT_IRQ_INDEX || start != 0 || count > 1)
+		return -EINVAL;
+
+	return vfio_pci_set_ctx_trigger_single(&vdev->dma_fault_trigger,
+					       count, flags, data);
+}
+
 int vfio_pci_set_irqs_ioctl(struct vfio_pci_device *vdev, uint32_t flags,
 			    unsigned index, unsigned start, unsigned count,
 			    void *data)
@@ -671,6 +683,13 @@ int vfio_pci_set_irqs_ioctl(struct vfio_pci_device *vdev, uint32_t flags,
 			break;
 		}
 		break;
+	case VFIO_PCI_DMA_FAULT_IRQ_INDEX:
+		switch (flags & VFIO_IRQ_SET_ACTION_TYPE_MASK) {
+		case VFIO_IRQ_SET_ACTION_TRIGGER:
+			func = vfio_pci_set_dma_fault_trigger;
+			break;
+		}
+		break;
 	}
 
 	if (!func)
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index b78c2c62af6d..47b65ef9d448 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -554,6 +554,7 @@ enum {
 	VFIO_PCI_MSIX_IRQ_INDEX,
 	VFIO_PCI_ERR_IRQ_INDEX,
 	VFIO_PCI_REQ_IRQ_INDEX,
+	VFIO_PCI_DMA_FAULT_IRQ_INDEX,
 	VFIO_PCI_NUM_IRQS
 };
 
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [RFC v3 21/21] vfio: Document nested stage control
  2019-01-08 10:26 [RFC v3 00/21] SMMUv3 Nested Stage Setup Eric Auger
                   ` (19 preceding siblings ...)
  2019-01-08 10:26 ` [RFC v3 20/21] vfio-pci: Add VFIO_PCI_DMA_FAULT_IRQ_INDEX Eric Auger
@ 2019-01-08 10:26 ` Eric Auger
  2019-01-18 10:02 ` [RFC v3 00/21] SMMUv3 Nested Stage Setup Auger Eric
  21 siblings, 0 replies; 59+ messages in thread
From: Eric Auger @ 2019-01-08 10:26 UTC (permalink / raw)
  To: eric.auger.pro, eric.auger, iommu, linux-kernel, kvm, kvmarm,
	joro, alex.williamson, jacob.jun.pan, yi.l.liu,
	jean-philippe.brucker, will.deacon, robin.murphy
  Cc: kevin.tian, ashok.raj, marc.zyngier, christoffer.dall, peter.maydell

New iotcls were introduced to pass information about guest stage1
to the host through VFIO. Let's document the nested stage control.

Signed-off-by: Eric Auger <eric.auger@redhat.com>

---

v2 -> v3:
- document the new fault API

v1 -> v2:
- use the new ioctl names
- add doc related to fault handling
---
 Documentation/vfio.txt | 62 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 62 insertions(+)

diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
index f1a4d3c3ba0b..620e38ed0c4a 100644
--- a/Documentation/vfio.txt
+++ b/Documentation/vfio.txt
@@ -239,6 +239,68 @@ group and can access them as follows::
 	/* Gratuitous device reset and go... */
 	ioctl(device, VFIO_DEVICE_RESET);
 
+IOMMU Dual Stage Control
+------------------------
+
+Some IOMMUs support 2 stages/levels of translation. "Stage" corresponds to
+the ARM terminology while "level" corresponds to Intel's VTD terminology. In
+the following text we use either without distinction.
+
+This is useful when the guest is exposed with a virtual IOMMU and some
+devices are assigned to the guest through VFIO. Then the guest OS can use
+stage 1 (IOVA -> GPA), while the hypervisor uses stage 2 for VM isolation
+(GPA -> HPA).
+
+The guest gets ownership of the stage 1 page tables and also owns stage 1
+configuration structures. The hypervisor owns the root configuration structure
+(for security reason), including stage 2 configuration. This works as long
+configuration structures and page table format are compatible between the
+virtual IOMMU and the physical IOMMU.
+
+Assuming the HW supports it, this nested mode is selected by choosing the
+VFIO_TYPE1_NESTING_IOMMU type through:
+
+ioctl(container, VFIO_SET_IOMMU, VFIO_TYPE1_NESTING_IOMMU);
+
+This forces the hypervisor to use the stage 2, leaving stage 1 available for
+guest usage.
+
+Once groups are attached to the container, the guest stage 1 translation
+configuration data can be passed to VFIO by using
+
+ioctl(container, VFIO_IOMMU_BIND_PASID_TABLE, &pasid_table_info);
+
+This allows to combine guest stage 1 configuration structure along with
+hypervisor stage 2 configuration structure. stage 1 configuration structures
+are dependent on the IOMMU type.
+
+As the stage 1 translation is fully delegated to the HW, physical events that
+may occur (especially translation faults), need to be propagated up to
+the virtualizer and re-injected into the guest.
+
+By using VFIO_DEVICE_SET_IRQS along with the VFIO_PCI_DMA_FAULT_IRQ_INDEX
+index, the virtualizer can register an eventfd signalled whenever a
+fault is observed at physical level. The actual faults can be retrieved
+from the device fault region whose type/subtype is:
+VFIO_REGION_TYPE_NESTED/VFIO_REGION_SUBTYPE_NESTED_FAULT_REGION.
+
+This region can be mmapped. When a fault is consumed, the user must increment
+the consumer index.
+
+When the guest invalidates stage 1 related caches, invalidations must be
+forwarded to the host through
+ioctl(container, VFIO_IOMMU_CACHE_INVALIDATE, &inv_data);
+Those invalidations can happen at various granularity levels, page, context, ...
+
+The ARM SMMU specification introduces another challenge: MSIs are translated by
+both the virtual SMMU and the physical SMMU. To build a nested mapping for the
+IOVA programmed into the assigned device, the guest needs to pass its IOVA/MSI
+doorbell GPA binding to the host. Then the hypervisor can build a nested stage 2
+binding eventually translating into the physical MSI doorbell.
+
+This is achieved by
+ioctl(container, VFIO_IOMMU_BIND_MSI, &guest_binding);
+
 VFIO User API
 -------------------------------------------------------------------------------
 
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [RFC v3 14/21] iommu: introduce device fault data
       [not found]   ` <20190110104544.26f3bcb1@jacob-builder>
@ 2019-01-11 11:06     ` Jean-Philippe Brucker
  2019-01-14 22:32       ` Jacob Pan
  2019-01-15 21:27       ` Auger Eric
  0 siblings, 2 replies; 59+ messages in thread
From: Jean-Philippe Brucker @ 2019-01-11 11:06 UTC (permalink / raw)
  To: Jacob Pan, Eric Auger
  Cc: yi.l.liu, kevin.tian, alex.williamson, ashok.raj, kvm,
	peter.maydell, Will Deacon, linux-kernel, Christoffer Dall,
	Marc Zyngier, iommu, Robin Murphy, kvmarm, eric.auger.pro

On 10/01/2019 18:45, Jacob Pan wrote:
> On Tue,  8 Jan 2019 11:26:26 +0100
> Eric Auger <eric.auger@redhat.com> wrote:
> 
>> From: Jacob Pan <jacob.jun.pan@linux.intel.com>
>>
>> Device faults detected by IOMMU can be reported outside IOMMU
>> subsystem for further processing. This patch intends to provide
>> a generic device fault data such that device drivers can be
>> communicated with IOMMU faults without model specific knowledge.
>>
>> The proposed format is the result of discussion at:
>> https://lkml.org/lkml/2017/11/10/291
>> Part of the code is based on Jean-Philippe Brucker's patchset
>> (https://patchwork.kernel.org/patch/9989315/).
>>
>> The assumption is that model specific IOMMU driver can filter and
>> handle most of the internal faults if the cause is within IOMMU driver
>> control. Therefore, the fault reasons can be reported are grouped
>> and generalized based common specifications such as PCI ATS.
>>
>> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
>> Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
>> Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
>> Signed-off-by: Ashok Raj <ashok.raj@intel.com>
>> Signed-off-by: Eric Auger <eric.auger@redhat.com>
>> [moved part of the iommu_fault_event struct in the uapi, enriched
>>  the fault reasons to be able to map unrecoverable SMMUv3 errors]
>> ---
>>  include/linux/iommu.h      | 55 ++++++++++++++++++++++++-
>>  include/uapi/linux/iommu.h | 83
>> ++++++++++++++++++++++++++++++++++++++ 2 files changed, 136
>> insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
>> index 244c1a3d5989..1dedc2d247c2 100644
>> --- a/include/linux/iommu.h
>> +++ b/include/linux/iommu.h
>> @@ -49,13 +49,17 @@ struct bus_type;
>>  struct device;
>>  struct iommu_domain;
>>  struct notifier_block;
>> +struct iommu_fault_event;
>>  
>>  /* iommu fault flags */
>> -#define IOMMU_FAULT_READ	0x0
>> -#define IOMMU_FAULT_WRITE	0x1
>> +#define IOMMU_FAULT_READ		(1 << 0)
>> +#define IOMMU_FAULT_WRITE		(1 << 1)
>> +#define IOMMU_FAULT_EXEC		(1 << 2)
>> +#define IOMMU_FAULT_PRIV		(1 << 3)
>>  
>>  typedef int (*iommu_fault_handler_t)(struct iommu_domain *,
>>  			struct device *, unsigned long, int, void *);
>> +typedef int (*iommu_dev_fault_handler_t)(struct iommu_fault_event *,
>> void *); 
>>  struct iommu_domain_geometry {
>>  	dma_addr_t aperture_start; /* First address that can be
>> mapped    */ @@ -255,6 +259,52 @@ struct iommu_device {
>>  	struct device *dev;
>>  };
>>  
>> +/**
>> + * struct iommu_fault_event - Generic per device fault data
>> + *
>> + * - PCI and non-PCI devices
>> + * - Recoverable faults (e.g. page request), information based on
>> PCI ATS
>> + * and PASID spec.
>> + * - Un-recoverable faults of device interest
>> + * - DMA remapping and IRQ remapping faults
>> + *
>> + * @fault: fault descriptor
>> + * @device_private: if present, uniquely identify device-specific
>> + *                  private data for an individual page request.
>> + * @iommu_private: used by the IOMMU driver for storing
>> fault-specific
>> + *                 data. Users should not modify this field before
>> + *                 sending the fault response.
>> + */
>> +struct iommu_fault_event {
>> +	struct iommu_fault fault;
>> +	u64 device_private;
> I think we want to move device_private to uapi since it gets injected
> into the guest, then returned by guest in case of page response. For
> VT-d we also need 128 bits of private data. VT-d spec. 7.7.1

Ah, I didn't notice the format changed in VT-d rev3. On that topic, how
do we manage future extensions to the iommu_fault struct? Should we add
~48 bytes of padding after device_private, along with some flags telling
which field is valid, or deal with it using a structure version like we
do for the invalidate and bind structs? In the first case, iommu_fault
wouldn't fit in a 64-byte cacheline anymore, but I'm not sure we care.

> For exception tracking (e.g. unanswered page request), I can add timer
> and list info later when I include PRQ. sounds ok?
>> +	u64 iommu_private;
[...]
>> +/**
>> + * struct iommu_fault - Generic fault data
>> + *
>> + * @type contains fault type
>> + * @reason fault reasons if relevant outside IOMMU driver.
>> + * IOMMU driver internal faults are not reported.
>> + * @addr: tells the offending page address
>> + * @fetch_addr: tells the address that caused an abort, if any
>> + * @pasid: contains process address space ID, used in shared virtual
>> memory
>> + * @page_req_group_id: page request group index
>> + * @last_req: last request in a page request group
>> + * @pasid_valid: indicates if the PRQ has a valid PASID
>> + * @prot: page access protection flag:
>> + *	IOMMU_FAULT_READ, IOMMU_FAULT_WRITE
>> + */
>> +
>> +struct iommu_fault {
>> +	__u32	type;   /* enum iommu_fault_type */
>> +	__u32	reason; /* enum iommu_fault_reason */
>> +	__u64	addr;
>> +	__u64	fetch_addr;
>> +	__u32	pasid;
>> +	__u32	page_req_group_id;
>> +	__u32	last_req;
>> +	__u32	pasid_valid;
>> +	__u32	prot;
>> +	__u32	access;

What does @access contain? Can it be squashed into @prot?

Thanks,
Jean

> relocated to uapi, Yi can you confirm?
> 	__u64 device_private[2];
> 
>> +};
>>  #endif /* _UAPI_IOMMU_H */
> 
> _______________________________________________
> iommu mailing list
> iommu@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/iommu
> 


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC v3 09/21] iommu/smmuv3: Get prepared for nested stage support
  2019-01-08 10:26 ` [RFC v3 09/21] iommu/smmuv3: Get prepared for nested stage support Eric Auger
@ 2019-01-11 16:04   ` Jean-Philippe Brucker
  2019-01-25 19:27   ` Robin Murphy
  1 sibling, 0 replies; 59+ messages in thread
From: Jean-Philippe Brucker @ 2019-01-11 16:04 UTC (permalink / raw)
  To: Eric Auger, eric.auger.pro, iommu, linux-kernel, kvm, kvmarm,
	joro, alex.williamson, jacob.jun.pan, yi.l.liu, will.deacon,
	robin.murphy
  Cc: marc.zyngier, peter.maydell, kevin.tian, ashok.raj, christoffer.dall

Hi Eric,

On 08/01/2019 10:26, Eric Auger wrote:
> To allow nested stage support, we need to store both
> stage 1 and stage 2 configurations (and remove the former
> union).
> 
> arm_smmu_write_strtab_ent() is modified to write both stage
> fields in the STE.
> 
> We add a nested_bypass field to the S1 configuration as the first
> stage can be bypassed. Also the guest may force the STE to abort:
> this information gets stored into the nested_abort field.
> 
> Only S2 stage is "finalized" as the host does not configure
> S1 CD, guest does.
> 
> Signed-off-by: Eric Auger <eric.auger@redhat.com>
> 
> ---
> 
> v1 -> v2:
> - invalidate the STE before moving from a live STE config to another
> - add the nested_abort and nested_bypass fields
> ---
>  drivers/iommu/arm-smmu-v3.c | 43 ++++++++++++++++++++++++++++---------
>  1 file changed, 33 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
> index 9af68266bbb1..9716a301d9ae 100644
> --- a/drivers/iommu/arm-smmu-v3.c
> +++ b/drivers/iommu/arm-smmu-v3.c
> @@ -212,6 +212,7 @@
>  #define STRTAB_STE_0_CFG_BYPASS		4
>  #define STRTAB_STE_0_CFG_S1_TRANS	5
>  #define STRTAB_STE_0_CFG_S2_TRANS	6
> +#define STRTAB_STE_0_CFG_NESTED		7
>  
>  #define STRTAB_STE_0_S1FMT		GENMASK_ULL(5, 4)
>  #define STRTAB_STE_0_S1FMT_LINEAR	0
> @@ -491,6 +492,10 @@ struct arm_smmu_strtab_l1_desc {
>  struct arm_smmu_s1_cfg {
>  	__le64				*cdptr;
>  	dma_addr_t			cdptr_dma;
> +	/* in nested mode, tells s1 must be bypassed */
> +	bool				nested_bypass;
> +	/* in nested mode, abort is forced by guest */
> +	bool				nested_abort;
>  
>  	struct arm_smmu_ctx_desc {
>  		u16	asid;
> @@ -515,6 +520,7 @@ struct arm_smmu_strtab_ent {
>  	 * configured according to the domain type.
>  	 */
>  	bool				assigned;
> +	bool				nested;
>  	struct arm_smmu_s1_cfg		*s1_cfg;
>  	struct arm_smmu_s2_cfg		*s2_cfg;
>  };
> @@ -629,10 +635,8 @@ struct arm_smmu_domain {
>  	bool				non_strict;
>  
>  	enum arm_smmu_domain_stage	stage;
> -	union {
> -		struct arm_smmu_s1_cfg	s1_cfg;
> -		struct arm_smmu_s2_cfg	s2_cfg;
> -	};
> +	struct arm_smmu_s1_cfg	s1_cfg;
> +	struct arm_smmu_s2_cfg	s2_cfg;
>  
>  	struct iommu_domain		domain;
>  
> @@ -1139,10 +1143,11 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_device *smmu, u32 sid,

Could you also update the "This is hideously complicated..." comment
with the nested case? This function was complicated before, but it
becomes hell when adding nested and SVA support, so we really need the
comments :)

>  			break;
>  		case STRTAB_STE_0_CFG_S1_TRANS:
>  		case STRTAB_STE_0_CFG_S2_TRANS:
> +		case STRTAB_STE_0_CFG_NESTED:
>  			ste_live = true;
>  			break;
>  		case STRTAB_STE_0_CFG_ABORT:
> -			if (disable_bypass)
> +			if (disable_bypass || ste->nested)
>  				break;
>  		default:
>  			BUG(); /* STE corruption */
> @@ -1154,7 +1159,8 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_device *smmu, u32 sid,
>  
>  	/* Bypass/fault */
>  	if (!ste->assigned || !(ste->s1_cfg || ste->s2_cfg)) {
> -		if (!ste->assigned && disable_bypass)
> +		if ((!ste->assigned && disable_bypass) ||
> +				(ste->s1_cfg && ste->s1_cfg->nested_abort))

I don't think we're ever reaching this, given that ste->assigned is true
and ste->s2_cfg is set.

Something I find noteworthy is that with STRTAB_STE_0_CFG_ABORT, no
event is recorded in case of DMA fault. For vSMMU you'd want to emulate
the SMMU behavior closely, so you don't want to inject faults if the
guest sets CFG_ABORT, but this way you also can't report errors to the
VMM. If we did want to notify the VMM of faults, we'd need to implement
nested_abort differently, for example by installing an empty context
descriptor with Config=s1translate-s2translate.

>  			val |= FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_ABORT);
>  		else
>  			val |= FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_BYPASS);
> @@ -1172,8 +1178,17 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_device *smmu, u32 sid,
>  		return;
>  	}
>  
> +	if (ste->nested && ste_live) {
> +		/*
> +		 * When enabling nested, the STE may be transitionning from

transitioning (my bad)

> +		 * s2 to nested and back. Invalidate the STE before changing it.
> +		 */
> +		dst[0] = cpu_to_le64(0);
> +		arm_smmu_sync_ste_for_sid(smmu, sid);
> +		val = STRTAB_STE_0_V;

val is already STRTAB_STE_0_V

> +	}
> +
>  	if (ste->s1_cfg) {
> -		BUG_ON(ste_live);
>  		dst[1] = cpu_to_le64(
>  			 FIELD_PREP(STRTAB_STE_1_S1CIR, STRTAB_STE_1_S1C_CACHE_WBRA) |
>  			 FIELD_PREP(STRTAB_STE_1_S1COR, STRTAB_STE_1_S1C_CACHE_WBRA) |
> @@ -1187,12 +1202,12 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_device *smmu, u32 sid,
>  		   !(smmu->features & ARM_SMMU_FEAT_STALL_FORCE))
>  			dst[1] |= cpu_to_le64(STRTAB_STE_1_S1STALLD);
>  
> -		val |= (ste->s1_cfg->cdptr_dma & STRTAB_STE_0_S1CTXPTR_MASK) |
> -			FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_S1_TRANS);
> +		if (!ste->s1_cfg->nested_bypass)
> +			val |= (ste->s1_cfg->cdptr_dma & STRTAB_STE_0_S1CTXPTR_MASK) |
> +				FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_S1_TRANS);

In patch 10/21, you're handling cfg->bypass == 1 by clearing ste->s1_cfg
(which I think is the best way - resetting the STE like it was when
initializing the nested domain). So can we get rid of
s1_cfg->nested_bypass and this change?

>  	}
>  
>  	if (ste->s2_cfg) {
> -		BUG_ON(ste_live);
>  		dst[2] = cpu_to_le64(
>  			 FIELD_PREP(STRTAB_STE_2_S2VMID, ste->s2_cfg->vmid) |
>  			 FIELD_PREP(STRTAB_STE_2_VTCR, ste->s2_cfg->vtcr) |
> @@ -1454,6 +1469,10 @@ static void arm_smmu_tlb_inv_context(void *cookie)
>  		cmd.opcode	= CMDQ_OP_TLBI_NH_ASID;
>  		cmd.tlbi.asid	= smmu_domain->s1_cfg.cd.asid;
>  		cmd.tlbi.vmid	= 0;
> +	} else if (smmu_domain->stage == ARM_SMMU_DOMAIN_NESTED) {
> +		cmd.opcode      = CMDQ_OP_TLBI_NH_ASID;
> +		cmd.tlbi.asid   = smmu_domain->s1_cfg.cd.asid;

Using s1_cfg.cd.asid as interface between cache_invalidate() and
tlb_inv_context() seems racy. In nested mode, s1_cfg.cd really shouldn't
be used. I'd rather cache_invalidate() crafted the commands itself
instead of going through these callbacks. Or you could add a leaf
function that takes asid as argument and is called by both
arm_smmu_tlb_inv_context() and cache_invalidate().

Thanks,
Jean

> +		cmd.tlbi.vmid   = smmu_domain->s2_cfg.vmid;
>  	} else {
>  		cmd.opcode	= CMDQ_OP_TLBI_S12_VMALL;
>  		cmd.tlbi.vmid	= smmu_domain->s2_cfg.vmid;
> @@ -1484,6 +1503,10 @@ static void arm_smmu_tlb_inv_range_nosync(unsigned long iova, size_t size,
>  	if (smmu_domain->stage == ARM_SMMU_DOMAIN_S1) {
>  		cmd.opcode	= CMDQ_OP_TLBI_NH_VA;
>  		cmd.tlbi.asid	= smmu_domain->s1_cfg.cd.asid;
> +	} else if (smmu_domain->stage == ARM_SMMU_DOMAIN_NESTED) {
> +		cmd.opcode      = CMDQ_OP_TLBI_NH_VA;
> +		cmd.tlbi.asid   = smmu_domain->s1_cfg.cd.asid;
> +		cmd.tlbi.vmid   = smmu_domain->s2_cfg.vmid;
>  	} else {
>  		cmd.opcode	= CMDQ_OP_TLBI_S2_IPA;
>  		cmd.tlbi.vmid	= smmu_domain->s2_cfg.vmid;
> 


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC v3 11/21] iommu/smmuv3: Implement cache_invalidate
  2019-01-08 10:26 ` [RFC v3 11/21] iommu/smmuv3: Implement cache_invalidate Eric Auger
@ 2019-01-11 16:59   ` Jean-Philippe Brucker
  0 siblings, 0 replies; 59+ messages in thread
From: Jean-Philippe Brucker @ 2019-01-11 16:59 UTC (permalink / raw)
  To: Eric Auger, eric.auger.pro, iommu, linux-kernel, kvm, kvmarm,
	joro, alex.williamson, jacob.jun.pan, yi.l.liu, will.deacon,
	robin.murphy
  Cc: marc.zyngier, peter.maydell, kevin.tian, ashok.raj, christoffer.dall

On 08/01/2019 10:26, Eric Auger wrote:
> Implement IOMMU_INV_TYPE_TLB invalidations. When
> nr_pages is null we interpret this as a context
> invalidation.
> 
> Signed-off-by: Eric Auger <eric.auger@redhat.com>
> 
> ---
> 
> The user API needs to be refined to discriminate context
> invalidations from NH_VA invalidations. Also the leaf attribute
> is not yet properly handled.
> 
> v2 -> v3:
> - replace __arm_smmu_tlb_sync by arm_smmu_cmdq_issue_sync
> 
> v1 -> v2:
> - properly pass the asid
> ---
>  drivers/iommu/arm-smmu-v3.c | 40 +++++++++++++++++++++++++++++++++++++
>  1 file changed, 40 insertions(+)
> 
> diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
> index 0e006babc8a6..ca72e0ce92f6 100644
> --- a/drivers/iommu/arm-smmu-v3.c
> +++ b/drivers/iommu/arm-smmu-v3.c
> @@ -2293,6 +2293,45 @@ static int arm_smmu_set_pasid_table(struct iommu_domain *domain,
>  	return ret;
>  }
>  
> +static int
> +arm_smmu_cache_invalidate(struct iommu_domain *domain, struct device *dev,
> +			  struct iommu_cache_invalidate_info *inv_info)
> +{
> +	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
> +	struct arm_smmu_device *smmu = smmu_domain->smmu;
> +
> +	if (smmu_domain->stage != ARM_SMMU_DOMAIN_NESTED)
> +		return -EINVAL;
> +
> +	if (!smmu)
> +		return -EINVAL;
> +
> +	switch (inv_info->hdr.type) {
> +	case IOMMU_INV_TYPE_TLB:
> +		/*
> +		 * TODO: On context invalidation, the userspace sets nr_pages
> +		 * to 0. Refine the API to add a dedicated flags and also
> +		 * properly handle the leaf parameter.
> +		 */

That's what inv->granularity is for: if inv->granularity is PASID_SEL,
then the invalidation is for the whole context (and nr_pages, size,
addr, etc. should be ignored). If inv->granularity is PAGE_PASID, then
it's a range. The names could probably be improved but it's already in
the API

Thanks,
Jean

> +		if (!inv_info->nr_pages) {
> +			smmu_domain->s1_cfg.cd.asid = inv_info->arch_id;
> +			arm_smmu_tlb_inv_context(smmu_domain);
> +		} else {
> +			size_t granule = 1 << (inv_info->size + 12);
> +			size_t size = inv_info->nr_pages * granule;
> +
> +			smmu_domain->s1_cfg.cd.asid = inv_info->arch_id;
> +			arm_smmu_tlb_inv_range_nosync(inv_info->addr, size,
> +						      granule, false,
> +						      smmu_domain);
> +			arm_smmu_cmdq_issue_sync(smmu);
> +		}
> +		return 0;
> +	default:
> +		return -EINVAL;
> +	}
> +}
> +
>  static struct iommu_ops arm_smmu_ops = {
>  	.capable		= arm_smmu_capable,
>  	.domain_alloc		= arm_smmu_domain_alloc,
> @@ -2312,6 +2351,7 @@ static struct iommu_ops arm_smmu_ops = {
>  	.get_resv_regions	= arm_smmu_get_resv_regions,
>  	.put_resv_regions	= arm_smmu_put_resv_regions,
>  	.set_pasid_table	= arm_smmu_set_pasid_table,
> +	.cache_invalidate	= arm_smmu_cache_invalidate,
>  	.pgsize_bitmap		= -1UL, /* Restricted during device attach */
>  };
>  
> 


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC v3 17/21] iommu/smmuv3: Report non recoverable faults
  2019-01-08 10:26 ` [RFC v3 17/21] iommu/smmuv3: Report non recoverable faults Eric Auger
@ 2019-01-11 17:46   ` Jean-Philippe Brucker
  2019-01-15 21:06     ` Auger Eric
  0 siblings, 1 reply; 59+ messages in thread
From: Jean-Philippe Brucker @ 2019-01-11 17:46 UTC (permalink / raw)
  To: Eric Auger, eric.auger.pro, iommu, linux-kernel, kvm, kvmarm,
	joro, alex.williamson, jacob.jun.pan, yi.l.liu, will.deacon,
	robin.murphy
  Cc: marc.zyngier, peter.maydell, kevin.tian, ashok.raj, christoffer.dall

On 08/01/2019 10:26, Eric Auger wrote:
> When a stage 1 related fault event is read from the event queue,
> let's propagate it to potential external fault listeners, ie. users
> who registered a fault handler.
> 
> Signed-off-by: Eric Auger <eric.auger@redhat.com>
> ---
>  drivers/iommu/arm-smmu-v3.c | 124 ++++++++++++++++++++++++++++++++----
>  1 file changed, 113 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
> index 999ee470a2ae..6a711cbbb228 100644
> --- a/drivers/iommu/arm-smmu-v3.c
> +++ b/drivers/iommu/arm-smmu-v3.c
> @@ -168,6 +168,26 @@
>  #define ARM_SMMU_PRIQ_IRQ_CFG1		0xd8
>  #define ARM_SMMU_PRIQ_IRQ_CFG2		0xdc
>  
> +/* Events */
> +#define ARM_SMMU_EVT_F_UUT		0x01
> +#define ARM_SMMU_EVT_C_BAD_STREAMID	0x02
> +#define ARM_SMMU_EVT_F_STE_FETCH	0x03
> +#define ARM_SMMU_EVT_C_BAD_STE		0x04
> +#define ARM_SMMU_EVT_F_BAD_ATS_TREQ	0x05
> +#define ARM_SMMU_EVT_F_STREAM_DISABLED	0x06
> +#define ARM_SMMU_EVT_F_TRANSL_FORBIDDEN	0x07
> +#define ARM_SMMU_EVT_C_BAD_SUBSTREAMID	0x08
> +#define ARM_SMMU_EVT_F_CD_FETCH		0x09
> +#define ARM_SMMU_EVT_C_BAD_CD		0x0a
> +#define ARM_SMMU_EVT_F_WALK_EABT	0x0b
> +#define ARM_SMMU_EVT_F_TRANSLATION	0x10
> +#define ARM_SMMU_EVT_F_ADDR_SIZE	0x11
> +#define ARM_SMMU_EVT_F_ACCESS		0x12
> +#define ARM_SMMU_EVT_F_PERMISSION	0x13
> +#define ARM_SMMU_EVT_F_TLB_CONFLICT	0x20
> +#define ARM_SMMU_EVT_F_CFG_CONFLICT	0x21
> +#define ARM_SMMU_EVT_E_PAGE_REQUEST	0x24
> +
>  /* Common MSI config fields */
>  #define MSI_CFG0_ADDR_MASK		GENMASK_ULL(51, 2)
>  #define MSI_CFG2_SH			GENMASK(5, 4)
> @@ -333,6 +353,11 @@
>  #define EVTQ_MAX_SZ_SHIFT		7
>  
>  #define EVTQ_0_ID			GENMASK_ULL(7, 0)
> +#define EVTQ_0_SUBSTREAMID		GENMASK_ULL(31, 12)
> +#define EVTQ_0_STREAMID			GENMASK_ULL(63, 32)
> +#define EVTQ_1_S2			GENMASK_ULL(39, 39)
> +#define EVTQ_1_CLASS			GENMASK_ULL(40, 41)
> +#define EVTQ_3_FETCH_ADDR		GENMASK_ULL(51, 3)
>  
>  /* PRI queue */
>  #define PRIQ_ENT_DWORDS			2
> @@ -1270,7 +1295,6 @@ static int arm_smmu_init_l2_strtab(struct arm_smmu_device *smmu, u32 sid)
>  	return 0;
>  }
>  
> -__maybe_unused
>  static struct arm_smmu_master_data *
>  arm_smmu_find_master(struct arm_smmu_device *smmu, u32 sid)
>  {
> @@ -1296,24 +1320,102 @@ arm_smmu_find_master(struct arm_smmu_device *smmu, u32 sid)
>  	return master;
>  }
>  
> +static void arm_smmu_report_event(struct arm_smmu_device *smmu, u64 *evt)
> +{
> +	u64 fetch_addr = FIELD_GET(EVTQ_3_FETCH_ADDR, evt[3]);
> +	u32 sid = FIELD_GET(EVTQ_0_STREAMID, evt[0]);
> +	bool s1 = !FIELD_GET(EVTQ_1_S2, evt[1]);
> +	u8 type = FIELD_GET(EVTQ_0_ID, evt[0]);
> +	struct arm_smmu_master_data *master;
> +	struct iommu_fault_event event;
> +	bool propagate = true;
> +	u64 addr = evt[2];
> +	int i;
> +
> +	master = arm_smmu_find_master(smmu, sid);
> +	if (WARN_ON(!master))
> +		return;
> +
> +	event.fault.type = IOMMU_FAULT_DMA_UNRECOV;
> +
> +	switch (type) {
> +	case ARM_SMMU_EVT_C_BAD_STREAMID:
> +		event.fault.reason = IOMMU_FAULT_REASON_SOURCEID_INVALID;
> +		break;
> +	case ARM_SMMU_EVT_F_STREAM_DISABLED:
> +	case ARM_SMMU_EVT_C_BAD_SUBSTREAMID:
> +		event.fault.reason = IOMMU_FAULT_REASON_PASID_INVALID;
> +		break;
> +	case ARM_SMMU_EVT_F_CD_FETCH:
> +		event.fault.reason = IOMMU_FAULT_REASON_PASID_FETCH;
> +		break;
> +	case ARM_SMMU_EVT_F_WALK_EABT:
> +		event.fault.reason = IOMMU_FAULT_REASON_WALK_EABT;
> +		event.fault.addr = addr;
> +		event.fault.fetch_addr = fetch_addr;
> +		propagate = s1;
> +		break;
> +	case ARM_SMMU_EVT_F_TRANSLATION:
> +		event.fault.reason = IOMMU_FAULT_REASON_PTE_FETCH;
> +		event.fault.addr = addr;
> +		event.fault.fetch_addr = fetch_addr;
> +		propagate = s1;
> +		break;
> +	case ARM_SMMU_EVT_F_PERMISSION:
> +		event.fault.reason = IOMMU_FAULT_REASON_PERMISSION;
> +		event.fault.addr = addr;
> +		propagate = s1;
> +		break;
> +	case ARM_SMMU_EVT_F_ACCESS:
> +		event.fault.reason = IOMMU_FAULT_REASON_ACCESS;
> +		event.fault.addr = addr;
> +		propagate = s1;
> +		break;
> +	case ARM_SMMU_EVT_C_BAD_STE:
> +		event.fault.reason = IOMMU_FAULT_REASON_BAD_DEVICE_CONTEXT_ENTRY;
> +		break;
> +	case ARM_SMMU_EVT_C_BAD_CD:
> +		event.fault.reason = IOMMU_FAULT_REASON_BAD_PASID_ENTRY;
> +		break;
> +	case ARM_SMMU_EVT_F_ADDR_SIZE:
> +		event.fault.reason = IOMMU_FAULT_REASON_OOR_ADDRESS;
> +		propagate = s1;
> +		break;
> +	case ARM_SMMU_EVT_F_STE_FETCH:
> +		event.fault.reason = IOMMU_FAULT_REASON_DEVICE_CONTEXT_FETCH;
> +		event.fault.fetch_addr = fetch_addr;
> +		break;
> +	/* End of addition */
> +	case ARM_SMMU_EVT_E_PAGE_REQUEST:
> +	case ARM_SMMU_EVT_F_TLB_CONFLICT:
> +	case ARM_SMMU_EVT_F_CFG_CONFLICT:
> +	case ARM_SMMU_EVT_F_BAD_ATS_TREQ:
> +	case ARM_SMMU_EVT_F_TRANSL_FORBIDDEN:
> +	case ARM_SMMU_EVT_F_UUT:
> +	default:
> +		event.fault.reason = IOMMU_FAULT_REASON_UNKNOWN;
> +	}
> +	/* only propagate the error if it relates to stage 1 */
> +	if (s1)

if (propagate)

But I don't quite understand how we're deciding what to propagate: a
C_BAD_STE is most likely a bug in the SMMU driver, but is reported to
userspace. On the other hand a stage-2 F_TRANSLATION is likely an error
from the VMM (didn't setup stage-2 mappings properly), but we're not
reporting it. Maybe we should add a bit to event.fault that tells
whether the fault was stage 1 or 2, and let the VMM deal with it?

> +		iommu_report_device_fault(master->dev, &event);

We should return here if the fault is successfully injected

Thanks,
Jean

> +
> +	dev_info(smmu->dev, "event 0x%02x received:\n", type);
> +	for (i = 0; i < EVTQ_ENT_DWORDS; ++i) {
> +		dev_info(smmu->dev, "\t0x%016llx\n",
> +			 (unsigned long long)evt[i]);
> +	}
> +}
> +
>  /* IRQ and event handlers */
>  static irqreturn_t arm_smmu_evtq_thread(int irq, void *dev)
>  {
> -	int i;
>  	struct arm_smmu_device *smmu = dev;
>  	struct arm_smmu_queue *q = &smmu->evtq.q;
>  	u64 evt[EVTQ_ENT_DWORDS];
>  
>  	do {
> -		while (!queue_remove_raw(q, evt)) {
> -			u8 id = FIELD_GET(EVTQ_0_ID, evt[0]);
> -
> -			dev_info(smmu->dev, "event 0x%02x received:\n", id);
> -			for (i = 0; i < ARRAY_SIZE(evt); ++i)
> -				dev_info(smmu->dev, "\t0x%016llx\n",
> -					 (unsigned long long)evt[i]);
> -
> -		}
> +		while (!queue_remove_raw(q, evt))
> +			arm_smmu_report_event(smmu, evt);
>  
>  		/*
>  		 * Not much we can do on overflow, so scream and pretend we're
> 


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC v3 01/21] iommu: Introduce set_pasid_table API
  2019-01-08 10:26 ` [RFC v3 01/21] iommu: Introduce set_pasid_table API Eric Auger
@ 2019-01-11 18:16   ` Jean-Philippe Brucker
  2019-01-25  8:39     ` Auger Eric
  2019-01-11 18:43   ` Alex Williamson
  1 sibling, 1 reply; 59+ messages in thread
From: Jean-Philippe Brucker @ 2019-01-11 18:16 UTC (permalink / raw)
  To: Eric Auger, eric.auger.pro, iommu, linux-kernel, kvm, kvmarm,
	joro, alex.williamson, jacob.jun.pan, yi.l.liu, will.deacon,
	robin.murphy
  Cc: marc.zyngier, peter.maydell, kevin.tian, ashok.raj, christoffer.dall

On 08/01/2019 10:26, Eric Auger wrote:
> From: Jacob Pan <jacob.jun.pan@linux.intel.com>
> 
> In virtualization use case, when a guest is assigned
> a PCI host device, protected by a virtual IOMMU on a guest,
> the physical IOMMU must be programmed to be consistent with
> the guest mappings. If the physical IOMMU supports two
> translation stages it makes sense to program guest mappings
> onto the first stage/level (ARM/VTD terminology) while to host
> owns the stage/level 2.
> 
> In that case, it is mandated to trap on guest configuration
> settings and pass those to the physical iommu driver.
> 
> This patch adds a new API to the iommu subsystem that allows
> to set the pasid table information.
> 
> A generic iommu_pasid_table_config struct is introduced in
> a new iommu.h uapi header. This is going to be used by the VFIO
> user API. We foresee at least two specializations of this struct,
> for PASID table passing and ARM SMMUv3.

Last sentence is a bit confusing. With SMMUv3 it is also used for the
PASID table, even when it only has one entry and PASID is disabled.

> Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
> Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
> Signed-off-by: Ashok Raj <ashok.raj@intel.com>
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Signed-off-by: Eric Auger <eric.auger@redhat.com>
> 
> ---
> 
> This patch generalizes the API introduced by Jacob & co-authors in
> https://lwn.net/Articles/754331/
> 
> v2 -> v3:
> - replace unbind/bind by set_pasid_table
> - move table pointer and pasid bits in the generic part of the struct
> 
> v1 -> v2:
> - restore the original pasid table name
> - remove the struct device * parameter in the API
> - reworked iommu_pasid_smmuv3
> ---
>  drivers/iommu/iommu.c      | 10 ++++++++
>  include/linux/iommu.h      | 14 +++++++++++
>  include/uapi/linux/iommu.h | 50 ++++++++++++++++++++++++++++++++++++++
>  3 files changed, 74 insertions(+)
>  create mode 100644 include/uapi/linux/iommu.h
> 
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index 3ed4db334341..0f2b7f1fc7c8 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -1393,6 +1393,16 @@ int iommu_attach_device(struct iommu_domain *domain, struct device *dev)
>  }
>  EXPORT_SYMBOL_GPL(iommu_attach_device);
>  
> +int iommu_set_pasid_table(struct iommu_domain *domain,
> +			  struct iommu_pasid_table_config *cfg)
> +{
> +	if (unlikely(!domain->ops->set_pasid_table))
> +		return -ENODEV;
> +
> +	return domain->ops->set_pasid_table(domain, cfg);
> +}
> +EXPORT_SYMBOL_GPL(iommu_set_pasid_table);
> +
>  static void __iommu_detach_device(struct iommu_domain *domain,
>  				  struct device *dev)
>  {
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index e90da6b6f3d1..1da2a2357ea4 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -25,6 +25,7 @@
>  #include <linux/errno.h>
>  #include <linux/err.h>
>  #include <linux/of.h>
> +#include <uapi/linux/iommu.h>
>  
>  #define IOMMU_READ	(1 << 0)
>  #define IOMMU_WRITE	(1 << 1)
> @@ -184,6 +185,7 @@ struct iommu_resv_region {
>   * @domain_window_disable: Disable a particular window for a domain
>   * @of_xlate: add OF master IDs to iommu grouping
>   * @pgsize_bitmap: bitmap of all possible supported page sizes
> + * @set_pasid_table: set pasid table
>   */
>  struct iommu_ops {
>  	bool (*capable)(enum iommu_cap);
> @@ -226,6 +228,9 @@ struct iommu_ops {
>  	int (*of_xlate)(struct device *dev, struct of_phandle_args *args);
>  	bool (*is_attach_deferred)(struct iommu_domain *domain, struct device *dev);
>  
> +	int (*set_pasid_table)(struct iommu_domain *domain,
> +			       struct iommu_pasid_table_config *cfg);
> +
>  	unsigned long pgsize_bitmap;
>  };
>  
> @@ -287,6 +292,8 @@ extern int iommu_attach_device(struct iommu_domain *domain,
>  			       struct device *dev);
>  extern void iommu_detach_device(struct iommu_domain *domain,
>  				struct device *dev);
> +extern int iommu_set_pasid_table(struct iommu_domain *domain,
> +				 struct iommu_pasid_table_config *cfg);
>  extern struct iommu_domain *iommu_get_domain_for_dev(struct device *dev);
>  extern struct iommu_domain *iommu_get_dma_domain(struct device *dev);
>  extern int iommu_map(struct iommu_domain *domain, unsigned long iova,
> @@ -696,6 +703,13 @@ const struct iommu_ops *iommu_ops_from_fwnode(struct fwnode_handle *fwnode)
>  	return NULL;
>  }
>  
> +static inline
> +int iommu_set_pasid_table(struct iommu_domain *domain,
> +			  struct iommu_pasid_table_config *cfg)
> +{
> +	return -ENODEV;
> +}
> +
>  #endif /* CONFIG_IOMMU_API */
>  
>  #ifdef CONFIG_IOMMU_DEBUGFS
> diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
> new file mode 100644
> index 000000000000..7a7cf7a3de7c
> --- /dev/null
> +++ b/include/uapi/linux/iommu.h
> @@ -0,0 +1,50 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +/*
> + * IOMMU user API definitions
> + *
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.

I don't think we need both the boilerplate and the SPDX header

> + */
> +
> +#ifndef _UAPI_IOMMU_H
> +#define _UAPI_IOMMU_H
> +
> +#include <linux/types.h>
> +
> +/**
> + * SMMUv3 Stream Table Entry stage 1 related information
> + * @abort: shall the STE lead to abort
> + * @s1fmt: STE s1fmt field as set by the guest
> + * @s1dss: STE s1dss as set by the guest
> + * All field names match the smmu 3.0/3.1 spec (ARM IHI 0070A)

Not really the case for @abort. Could you clarify whether @abort is
valid in combination with @bypass?

> + */
> +struct iommu_pasid_smmuv3 {
> +	__u8 abort;
> +	__u8 s1fmt;
> +	__u8 s1dss;
> +};
> +
> +/**
> + * PASID table data used to bind guest PASID table to the host IOMMU
> + * Note PASID table corresponds to the Context Table on ARM SMMUv3.
> + *
> + * @version: API version to prepare for future extensions
> + * @format: format of the PASID table
> + *
> + */
> +struct iommu_pasid_table_config {
> +#define PASID_TABLE_CFG_VERSION_1 1
> +	__u32	version;
> +#define IOMMU_PASID_FORMAT_SMMUV3	(1 << 0)
> +	__u32	format;
> +	__u64	base_ptr;
> +	__u8	pasid_bits;
> +	__u8	bypass;

We need some padding, in case someone adds a new struct to the union
that requires 64-byte alignment

And 'bypass' might not be the right name if we're making it common,
maybe 'reset' would be clearer? Or we just need to explain that bypass
is the initial state of a nesting domain

Thanks,
Jean

> +	union {
> +		struct iommu_pasid_smmuv3 smmuv3;
> +	};
> +};
> +
> +#endif /* _UAPI_IOMMU_H */
> 


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC v3 01/21] iommu: Introduce set_pasid_table API
  2019-01-08 10:26 ` [RFC v3 01/21] iommu: Introduce set_pasid_table API Eric Auger
  2019-01-11 18:16   ` Jean-Philippe Brucker
@ 2019-01-11 18:43   ` Alex Williamson
  2019-01-25  9:20     ` Auger Eric
  1 sibling, 1 reply; 59+ messages in thread
From: Alex Williamson @ 2019-01-11 18:43 UTC (permalink / raw)
  To: Eric Auger
  Cc: eric.auger.pro, iommu, linux-kernel, kvm, kvmarm, joro,
	jacob.jun.pan, yi.l.liu, jean-philippe.brucker, will.deacon,
	robin.murphy, kevin.tian, ashok.raj, marc.zyngier,
	christoffer.dall, peter.maydell

On Tue,  8 Jan 2019 11:26:13 +0100
Eric Auger <eric.auger@redhat.com> wrote:

> From: Jacob Pan <jacob.jun.pan@linux.intel.com>
> 
> In virtualization use case, when a guest is assigned
> a PCI host device, protected by a virtual IOMMU on a guest,
> the physical IOMMU must be programmed to be consistent with
> the guest mappings. If the physical IOMMU supports two
> translation stages it makes sense to program guest mappings
> onto the first stage/level (ARM/VTD terminology) while to host
> owns the stage/level 2.
> 
> In that case, it is mandated to trap on guest configuration
> settings and pass those to the physical iommu driver.
> 
> This patch adds a new API to the iommu subsystem that allows
> to set the pasid table information.
> 
> A generic iommu_pasid_table_config struct is introduced in
> a new iommu.h uapi header. This is going to be used by the VFIO
> user API. We foresee at least two specializations of this struct,
> for PASID table passing and ARM SMMUv3.
> 
> Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
> Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
> Signed-off-by: Ashok Raj <ashok.raj@intel.com>
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Signed-off-by: Eric Auger <eric.auger@redhat.com>
> 
> ---
> 
> This patch generalizes the API introduced by Jacob & co-authors in
> https://lwn.net/Articles/754331/
> 
> v2 -> v3:
> - replace unbind/bind by set_pasid_table
> - move table pointer and pasid bits in the generic part of the struct
> 
> v1 -> v2:
> - restore the original pasid table name
> - remove the struct device * parameter in the API
> - reworked iommu_pasid_smmuv3
> ---
>  drivers/iommu/iommu.c      | 10 ++++++++
>  include/linux/iommu.h      | 14 +++++++++++
>  include/uapi/linux/iommu.h | 50 ++++++++++++++++++++++++++++++++++++++
>  3 files changed, 74 insertions(+)
>  create mode 100644 include/uapi/linux/iommu.h
> 
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index 3ed4db334341..0f2b7f1fc7c8 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -1393,6 +1393,16 @@ int iommu_attach_device(struct iommu_domain *domain, struct device *dev)
>  }
>  EXPORT_SYMBOL_GPL(iommu_attach_device);
>  
> +int iommu_set_pasid_table(struct iommu_domain *domain,
> +			  struct iommu_pasid_table_config *cfg)
> +{
> +	if (unlikely(!domain->ops->set_pasid_table))
> +		return -ENODEV;
> +
> +	return domain->ops->set_pasid_table(domain, cfg);
> +}
> +EXPORT_SYMBOL_GPL(iommu_set_pasid_table);
> +
>  static void __iommu_detach_device(struct iommu_domain *domain,
>  				  struct device *dev)
>  {
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index e90da6b6f3d1..1da2a2357ea4 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -25,6 +25,7 @@
>  #include <linux/errno.h>
>  #include <linux/err.h>
>  #include <linux/of.h>
> +#include <uapi/linux/iommu.h>
>  
>  #define IOMMU_READ	(1 << 0)
>  #define IOMMU_WRITE	(1 << 1)
> @@ -184,6 +185,7 @@ struct iommu_resv_region {
>   * @domain_window_disable: Disable a particular window for a domain
>   * @of_xlate: add OF master IDs to iommu grouping
>   * @pgsize_bitmap: bitmap of all possible supported page sizes
> + * @set_pasid_table: set pasid table
>   */
>  struct iommu_ops {
>  	bool (*capable)(enum iommu_cap);
> @@ -226,6 +228,9 @@ struct iommu_ops {
>  	int (*of_xlate)(struct device *dev, struct of_phandle_args *args);
>  	bool (*is_attach_deferred)(struct iommu_domain *domain, struct device *dev);
>  
> +	int (*set_pasid_table)(struct iommu_domain *domain,
> +			       struct iommu_pasid_table_config *cfg);
> +
>  	unsigned long pgsize_bitmap;
>  };
>  
> @@ -287,6 +292,8 @@ extern int iommu_attach_device(struct iommu_domain *domain,
>  			       struct device *dev);
>  extern void iommu_detach_device(struct iommu_domain *domain,
>  				struct device *dev);
> +extern int iommu_set_pasid_table(struct iommu_domain *domain,
> +				 struct iommu_pasid_table_config *cfg);
>  extern struct iommu_domain *iommu_get_domain_for_dev(struct device *dev);
>  extern struct iommu_domain *iommu_get_dma_domain(struct device *dev);
>  extern int iommu_map(struct iommu_domain *domain, unsigned long iova,
> @@ -696,6 +703,13 @@ const struct iommu_ops *iommu_ops_from_fwnode(struct fwnode_handle *fwnode)
>  	return NULL;
>  }
>  
> +static inline
> +int iommu_set_pasid_table(struct iommu_domain *domain,
> +			  struct iommu_pasid_table_config *cfg)
> +{
> +	return -ENODEV;
> +}
> +
>  #endif /* CONFIG_IOMMU_API */
>  
>  #ifdef CONFIG_IOMMU_DEBUGFS
> diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
> new file mode 100644
> index 000000000000..7a7cf7a3de7c
> --- /dev/null
> +++ b/include/uapi/linux/iommu.h
> @@ -0,0 +1,50 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +/*
> + * IOMMU user API definitions
> + *
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#ifndef _UAPI_IOMMU_H
> +#define _UAPI_IOMMU_H
> +
> +#include <linux/types.h>
> +
> +/**
> + * SMMUv3 Stream Table Entry stage 1 related information
> + * @abort: shall the STE lead to abort
> + * @s1fmt: STE s1fmt field as set by the guest
> + * @s1dss: STE s1dss as set by the guest
> + * All field names match the smmu 3.0/3.1 spec (ARM IHI 0070A)
> + */
> +struct iommu_pasid_smmuv3 {
> +	__u8 abort;
> +	__u8 s1fmt;
> +	__u8 s1dss;
> +};
> +

I can find STE.S1DSS and STE.S1FMT in the spec, but not STE.ABORT, is
this something to do with Config[2:0]?  Are we allowed to describe what
these fields are beyond their name and why they're necessary here vs
the other fields or do the spec restrictions preclude that?

> +/**
> + * PASID table data used to bind guest PASID table to the host IOMMU
> + * Note PASID table corresponds to the Context Table on ARM SMMUv3.
> + *
> + * @version: API version to prepare for future extensions
> + * @format: format of the PASID table
> + *
> + */
> +struct iommu_pasid_table_config {
> +#define PASID_TABLE_CFG_VERSION_1 1
> +	__u32	version;
> +#define IOMMU_PASID_FORMAT_SMMUV3	(1 << 0)
> +	__u32	format;
> +	__u64	base_ptr;
> +	__u8	pasid_bits;
> +	__u8	bypass;
> +	union {
> +		struct iommu_pasid_smmuv3 smmuv3;
> +	};
> +};

Structure is not naturally aligned or explicitly aligned for
interchange with userspace.  It might work for smmuv3 since the
structure is only composed of bytes, but looks troublesome in general.
Should each format type also contain a version?  Is format intended to
be a bit-field or a signature?  It seems we only need a signature, but
only having a single format defined, it looks like a bit-field, which
makes me worry what we do when we exhaust the bits.  The bypass field
should be better defined, is it 0/1?  zero/non-zero?  more selective?
Thanks,

Alex

> +
> +#endif /* _UAPI_IOMMU_H */


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC v3 02/21] iommu: Introduce cache_invalidate API
  2019-01-08 10:26 ` [RFC v3 02/21] iommu: Introduce cache_invalidate API Eric Auger
@ 2019-01-11 21:30   ` Alex Williamson
  2019-01-25 16:49     ` Auger Eric
  0 siblings, 1 reply; 59+ messages in thread
From: Alex Williamson @ 2019-01-11 21:30 UTC (permalink / raw)
  To: Eric Auger
  Cc: eric.auger.pro, iommu, linux-kernel, kvm, kvmarm, joro,
	jacob.jun.pan, yi.l.liu, jean-philippe.brucker, will.deacon,
	robin.murphy, kevin.tian, ashok.raj, marc.zyngier,
	christoffer.dall, peter.maydell

On Tue,  8 Jan 2019 11:26:14 +0100
Eric Auger <eric.auger@redhat.com> wrote:

> From: "Liu, Yi L" <yi.l.liu@linux.intel.com>
> 
> In any virtualization use case, when the first translation stage
> is "owned" by the guest OS, the host IOMMU driver has no knowledge
> of caching structure updates unless the guest invalidation activities
> are trapped by the virtualizer and passed down to the host.
> 
> Since the invalidation data are obtained from user space and will be
> written into physical IOMMU, we must allow security check at various
> layers. Therefore, generic invalidation data format are proposed here,
> model specific IOMMU drivers need to convert them into their own format.
> 
> Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
> Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Signed-off-by: Ashok Raj <ashok.raj@intel.com>
> Signed-off-by: Eric Auger <eric.auger@redhat.com>
> 
> ---
> v1 -> v2:
> - add arch_id field
> - renamed tlb_invalidate into cache_invalidate as this API allows
>   to invalidate context caches on top of IOTLBs
> 
> v1:
> renamed sva_invalidate into tlb_invalidate and add iommu_ prefix in
> header. Commit message reworded.
> ---
>  drivers/iommu/iommu.c      | 14 ++++++
>  include/linux/iommu.h      | 14 ++++++
>  include/uapi/linux/iommu.h | 95 ++++++++++++++++++++++++++++++++++++++
>  3 files changed, 123 insertions(+)
> 
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index 0f2b7f1fc7c8..b2e248770508 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -1403,6 +1403,20 @@ int iommu_set_pasid_table(struct iommu_domain *domain,
>  }
>  EXPORT_SYMBOL_GPL(iommu_set_pasid_table);
>  
> +int iommu_cache_invalidate(struct iommu_domain *domain, struct device *dev,
> +			   struct iommu_cache_invalidate_info *inv_info)
> +{
> +	int ret = 0;
> +
> +	if (unlikely(!domain->ops->cache_invalidate))
> +		return -ENODEV;
> +
> +	ret = domain->ops->cache_invalidate(domain, dev, inv_info);
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(iommu_cache_invalidate);
> +
>  static void __iommu_detach_device(struct iommu_domain *domain,
>  				  struct device *dev)
>  {
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index 1da2a2357ea4..96d59886f230 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -186,6 +186,7 @@ struct iommu_resv_region {
>   * @of_xlate: add OF master IDs to iommu grouping
>   * @pgsize_bitmap: bitmap of all possible supported page sizes
>   * @set_pasid_table: set pasid table
> + * @cache_invalidate: invalidate translation caches
>   */
>  struct iommu_ops {
>  	bool (*capable)(enum iommu_cap);
> @@ -231,6 +232,9 @@ struct iommu_ops {
>  	int (*set_pasid_table)(struct iommu_domain *domain,
>  			       struct iommu_pasid_table_config *cfg);
>  
> +	int (*cache_invalidate)(struct iommu_domain *domain, struct device *dev,
> +				struct iommu_cache_invalidate_info *inv_info);
> +
>  	unsigned long pgsize_bitmap;
>  };
>  
> @@ -294,6 +298,9 @@ extern void iommu_detach_device(struct iommu_domain *domain,
>  				struct device *dev);
>  extern int iommu_set_pasid_table(struct iommu_domain *domain,
>  				 struct iommu_pasid_table_config *cfg);
> +extern int iommu_cache_invalidate(struct iommu_domain *domain,
> +				struct device *dev,
> +				struct iommu_cache_invalidate_info *inv_info);
>  extern struct iommu_domain *iommu_get_domain_for_dev(struct device *dev);
>  extern struct iommu_domain *iommu_get_dma_domain(struct device *dev);
>  extern int iommu_map(struct iommu_domain *domain, unsigned long iova,
> @@ -709,6 +716,13 @@ int iommu_set_pasid_table(struct iommu_domain *domain,
>  {
>  	return -ENODEV;
>  }
> +static inline int
> +iommu_cache_invalidate(struct iommu_domain *domain,
> +		       struct device *dev,
> +		       struct iommu_cache_invalidate_info *inv_info)
> +{
> +	return -ENODEV;
> +}
>  
>  #endif /* CONFIG_IOMMU_API */
>  
> diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
> index 7a7cf7a3de7c..4605f5cfac84 100644
> --- a/include/uapi/linux/iommu.h
> +++ b/include/uapi/linux/iommu.h
> @@ -47,4 +47,99 @@ struct iommu_pasid_table_config {
>  	};
>  };
>  
> +/**
> + * enum iommu_inv_granularity - Generic invalidation granularity
> + * @IOMMU_INV_GRANU_DOMAIN_ALL_PASID:	TLB entries or PASID caches of all
> + *					PASIDs associated with a domain ID
> + * @IOMMU_INV_GRANU_PASID_SEL:		TLB entries or PASID cache associated
> + *					with a PASID and a domain
> + * @IOMMU_INV_GRANU_PAGE_PASID:		TLB entries of selected page range
> + *					within a PASID
> + *
> + * When an invalidation request is passed down to IOMMU to flush translation
> + * caches, it may carry different granularity levels, which can be specific
> + * to certain types of translation caches.
> + * This enum is a collection of granularities for all types of translation
> + * caches. The idea is to make it easy for IOMMU model specific driver to
> + * convert from generic to model specific value. Each IOMMU driver
> + * can enforce check based on its own conversion table. The conversion is
> + * based on 2D look-up with inputs as follows:
> + * - translation cache types
> + * - granularity
> + *
> + *             type |   DTLB    |    TLB    |   PASID   |
> + *  granule         |           |           |   cache   |
> + * -----------------+-----------+-----------+-----------+
> + *  DN_ALL_PASID    |   Y       |   Y       |   Y       |
> + *  PASID_SEL       |   Y       |   Y       |   Y       |
> + *  PAGE_PASID      |   Y       |   Y       |   N/A     |
> + *
> + */
> +enum iommu_inv_granularity {
> +	IOMMU_INV_GRANU_DOMAIN_ALL_PASID,
> +	IOMMU_INV_GRANU_PASID_SEL,
> +	IOMMU_INV_GRANU_PAGE_PASID,
> +	IOMMU_INV_NR_GRANU,
> +};
> +
> +/**
> + * enum iommu_inv_type - Generic translation cache types for invalidation
> + *
> + * @IOMMU_INV_TYPE_DTLB:	device IOTLB
> + * @IOMMU_INV_TYPE_TLB:		IOMMU paging structure cache
> + * @IOMMU_INV_TYPE_PASID:	PASID cache
> + * Invalidation requests sent to IOMMU for a given device need to indicate
> + * which type of translation cache to be operated on. Combined with enum
> + * iommu_inv_granularity, model specific driver can do a simple lookup to
> + * convert from generic to model specific value.
> + */
> +enum iommu_inv_type {
> +	IOMMU_INV_TYPE_DTLB,
> +	IOMMU_INV_TYPE_TLB,
> +	IOMMU_INV_TYPE_PASID,
> +	IOMMU_INV_NR_TYPE
> +};
> +
> +/**
> + * Translation cache invalidation header that contains mandatory meta data.
> + * @version:	info format version, expecting future extesions
> + * @type:	type of translation cache to be invalidated
> + */
> +struct iommu_cache_invalidate_hdr {
> +	__u32 version;
> +#define TLB_INV_HDR_VERSION_1 1
> +	enum iommu_inv_type type;
> +};
> +
> +/**
> + * Translation cache invalidation information, contains generic IOMMU
> + * data which can be parsed based on model ID by model specific drivers.
> + * Since the invalidation of second level page tables are included in the
> + * unmap operation, this info is only applicable to the first level
> + * translation caches, i.e. DMA request with PASID.
> + *
> + * @granularity:	requested invalidation granularity, type dependent
> + * @size:		2^size of 4K pages, 0 for 4k, 9 for 2MB, etc.

Why is this a 4K page centric interface?

> + * @nr_pages:		number of pages to invalidate
> + * @pasid:		processor address space ID value per PCI spec.
> + * @arch_id:		architecture dependent id characterizing a context
> + *			and tagging the caches, ie. domain Identfier on VTD,
> + *			asid on ARM SMMU
> + * @addr:		page address to be invalidated
> + * @flags		IOMMU_INVALIDATE_ADDR_LEAF: leaf paging entries
> + *			IOMMU_INVALIDATE_GLOBAL_PAGE: global pages

Shouldn't some of these be tied the the granularity of the
invalidation?  It seems like this should be more similar to
iommu_pasid_table_config where the granularity of the invalidation
defines which entry within a union at the end of the structure is valid
and populated.  Otherwise we have fields that don't make sense for
certain invalidations.

> + *
> + */
> +struct iommu_cache_invalidate_info {
> +	struct iommu_cache_invalidate_hdr	hdr;
> +	enum iommu_inv_granularity	granularity;

A separate structure for hdr seems a little pointless.

> +	__u32		flags;
> +#define IOMMU_INVALIDATE_ADDR_LEAF	(1 << 0)
> +#define IOMMU_INVALIDATE_GLOBAL_PAGE	(1 << 1)
> +	__u8		size;

Really need some padding or packing here for any hope of having
consistency with userspace.

> +	__u64		nr_pages;
> +	__u32		pasid;

Sub-optimal ordering for packing/padding.  Thanks,

Alex

> +	__u64		arch_id;
> +	__u64		addr;
> +};
>  #endif /* _UAPI_IOMMU_H */


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC v3 03/21] iommu: Introduce bind_guest_msi
  2019-01-08 10:26 ` [RFC v3 03/21] iommu: Introduce bind_guest_msi Eric Auger
@ 2019-01-11 22:44   ` Alex Williamson
  2019-01-25 17:51     ` Auger Eric
  2019-01-25 18:11     ` Auger Eric
  0 siblings, 2 replies; 59+ messages in thread
From: Alex Williamson @ 2019-01-11 22:44 UTC (permalink / raw)
  To: Eric Auger
  Cc: eric.auger.pro, iommu, linux-kernel, kvm, kvmarm, joro,
	jacob.jun.pan, yi.l.liu, jean-philippe.brucker, will.deacon,
	robin.murphy, kevin.tian, ashok.raj, marc.zyngier,
	christoffer.dall, peter.maydell

On Tue,  8 Jan 2019 11:26:15 +0100
Eric Auger <eric.auger@redhat.com> wrote:

> On ARM, MSI are translated by the SMMU. An IOVA is allocated
> for each MSI doorbell. If both the host and the guest are exposed
> with SMMUs, we end up with 2 different IOVAs allocated by each.
> guest allocates an IOVA (gIOVA) to map onto the guest MSI
> doorbell (gDB). The Host allocates another IOVA (hIOVA) to map
> onto the physical doorbell (hDB).
> 
> So we end up with 2 untied mappings:
>          S1            S2
> gIOVA    ->    gDB
>               hIOVA    ->    gDB
                               ^^^ hDB

> Currently the PCI device is programmed by the host with hIOVA
> as MSI doorbell. So this does not work.
> 
> This patch introduces an API to pass gIOVA/gDB to the host so
> that gIOVA can be reused by the host instead of re-allocating
> a new IOVA. So the goal is to create the following nested mapping:
> 
>          S1            S2
> gIOVA    ->    gDB     ->    hDB
> 
> and program the PCI device with gIOVA MSI doorbell.
> 
> Signed-off-by: Eric Auger <eric.auger@redhat.com>
> 
> ---
> 
> v2 -> v3:
> - add a struct device handle
> ---
>  drivers/iommu/iommu.c      | 10 ++++++++++
>  include/linux/iommu.h      | 13 +++++++++++++
>  include/uapi/linux/iommu.h |  6 ++++++
>  3 files changed, 29 insertions(+)
> 
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index b2e248770508..ea11442e7054 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -1431,6 +1431,16 @@ static void __iommu_detach_device(struct iommu_domain *domain,
>  	trace_detach_device_from_domain(dev);
>  }
>  
> +int iommu_bind_guest_msi(struct iommu_domain *domain, struct device *dev,
> +			 struct iommu_guest_msi_binding *binding)
> +{
> +	if (unlikely(!domain->ops->bind_guest_msi))
> +		return -ENODEV;
> +
> +	return domain->ops->bind_guest_msi(domain, dev, binding);
> +}
> +EXPORT_SYMBOL_GPL(iommu_bind_guest_msi);
> +
>  void iommu_detach_device(struct iommu_domain *domain, struct device *dev)
>  {
>  	struct iommu_group *group;
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index 96d59886f230..244c1a3d5989 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -235,6 +235,9 @@ struct iommu_ops {
>  	int (*cache_invalidate)(struct iommu_domain *domain, struct device *dev,
>  				struct iommu_cache_invalidate_info *inv_info);
>  
> +	int (*bind_guest_msi)(struct iommu_domain *domain, struct device *dev,
> +			      struct iommu_guest_msi_binding *binding);
> +
>  	unsigned long pgsize_bitmap;
>  };
>  
> @@ -301,6 +304,9 @@ extern int iommu_set_pasid_table(struct iommu_domain *domain,
>  extern int iommu_cache_invalidate(struct iommu_domain *domain,
>  				struct device *dev,
>  				struct iommu_cache_invalidate_info *inv_info);
> +extern int iommu_bind_guest_msi(struct iommu_domain *domain, struct device *dev,
> +				struct iommu_guest_msi_binding *binding);
> +
>  extern struct iommu_domain *iommu_get_domain_for_dev(struct device *dev);
>  extern struct iommu_domain *iommu_get_dma_domain(struct device *dev);
>  extern int iommu_map(struct iommu_domain *domain, unsigned long iova,
> @@ -724,6 +730,13 @@ iommu_cache_invalidate(struct iommu_domain *domain,
>  	return -ENODEV;
>  }
>  
> +static inline
> +int iommu_bind_guest_msi(struct iommu_domain *domain, struct device *dev,
> +			 struct iommu_guest_msi_binding *binding)
> +{
> +	return -ENODEV;
> +}
> +
>  #endif /* CONFIG_IOMMU_API */
>  
>  #ifdef CONFIG_IOMMU_DEBUGFS
> diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
> index 4605f5cfac84..f28cd9a1aa96 100644
> --- a/include/uapi/linux/iommu.h
> +++ b/include/uapi/linux/iommu.h
> @@ -142,4 +142,10 @@ struct iommu_cache_invalidate_info {
>  	__u64		arch_id;
>  	__u64		addr;
>  };
> +
> +struct iommu_guest_msi_binding {
> +	__u64		iova;
> +	__u64		gpa;
> +	__u32		granule;

What's granule?  The size?  This looks a lot like just a stage 1
mapping interface, I can't really figure out from the description how
this matches to any specific MSI mapping.  Zero comments in the code
or headers here about how this is supposed to work.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC v3 04/21] vfio: VFIO_IOMMU_SET_PASID_TABLE
  2019-01-08 10:26 ` [RFC v3 04/21] vfio: VFIO_IOMMU_SET_PASID_TABLE Eric Auger
@ 2019-01-11 22:50   ` Alex Williamson
  2019-01-15 21:34     ` Auger Eric
  0 siblings, 1 reply; 59+ messages in thread
From: Alex Williamson @ 2019-01-11 22:50 UTC (permalink / raw)
  To: Eric Auger
  Cc: eric.auger.pro, iommu, linux-kernel, kvm, kvmarm, joro,
	jacob.jun.pan, yi.l.liu, jean-philippe.brucker, will.deacon,
	robin.murphy, kevin.tian, ashok.raj, marc.zyngier,
	christoffer.dall, peter.maydell

On Tue,  8 Jan 2019 11:26:16 +0100
Eric Auger <eric.auger@redhat.com> wrote:

> From: "Liu, Yi L" <yi.l.liu@linux.intel.com>
> 
> This patch adds VFIO_IOMMU_SET_PASID_TABLE ioctl which aims at
> passing the virtual iommu guest configuration to the VFIO driver
> downto to the iommu subsystem.
> 
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
> Signed-off-by: Eric Auger <eric.auger@redhat.com>
> 
> ---
> v2 -> v3:
> - s/BIND_PASID_TABLE/SET_PASID_TABLE
> 
> v1 -> v2:
> - s/BIND_GUEST_STAGE/BIND_PASID_TABLE
> - remove the struct device arg
> ---
>  drivers/vfio/vfio_iommu_type1.c | 31 +++++++++++++++++++++++++++++++
>  include/uapi/linux/vfio.h       |  8 ++++++++
>  2 files changed, 39 insertions(+)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 7651cfb14836..d9dd23f64f00 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -1644,6 +1644,24 @@ static int vfio_domains_have_iommu_cache(struct vfio_iommu *iommu)
>  	return ret;
>  }
>  
> +static int
> +vfio_set_pasid_table(struct vfio_iommu *iommu,
> +		      struct vfio_iommu_type1_set_pasid_table *ustruct)
> +{
> +	struct vfio_domain *d;
> +	int ret = 0;
> +
> +	mutex_lock(&iommu->lock);
> +
> +	list_for_each_entry(d, &iommu->domain_list, next) {
> +		ret = iommu_set_pasid_table(d->domain, &ustruct->config);
> +		if (ret)
> +			break;
> +	}

There's no unwind on failure here, leaves us in an inconsistent state
should something go wrong or domains don't have homogeneous PASID
support.  What's expected to happen if a PASID table is already set for
a domain, does it replace the old one or return -EBUSY?

> +	mutex_unlock(&iommu->lock);
> +	return ret;
> +}
> +
>  static long vfio_iommu_type1_ioctl(void *iommu_data,
>  				   unsigned int cmd, unsigned long arg)
>  {
> @@ -1714,6 +1732,19 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
>  
>  		return copy_to_user((void __user *)arg, &unmap, minsz) ?
>  			-EFAULT : 0;
> +	} else if (cmd == VFIO_IOMMU_SET_PASID_TABLE) {
> +		struct vfio_iommu_type1_set_pasid_table ustruct;
> +
> +		minsz = offsetofend(struct vfio_iommu_type1_set_pasid_table,
> +				    config);
> +
> +		if (copy_from_user(&ustruct, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (ustruct.argsz < minsz || ustruct.flags)
> +			return -EINVAL;
> +
> +		return vfio_set_pasid_table(iommu, &ustruct);
>  	}
>  
>  	return -ENOTTY;
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 02bb7ad6e986..0d9f4090c95d 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -14,6 +14,7 @@
>  
>  #include <linux/types.h>
>  #include <linux/ioctl.h>
> +#include <linux/iommu.h>
>  
>  #define VFIO_API_VERSION	0
>  
> @@ -759,6 +760,13 @@ struct vfio_iommu_type1_dma_unmap {
>  #define VFIO_IOMMU_ENABLE	_IO(VFIO_TYPE, VFIO_BASE + 15)
>  #define VFIO_IOMMU_DISABLE	_IO(VFIO_TYPE, VFIO_BASE + 16)
>  
> +struct vfio_iommu_type1_set_pasid_table {
> +	__u32	argsz;
> +	__u32	flags;
> +	struct iommu_pasid_table_config config;
> +};
> +#define VFIO_IOMMU_SET_PASID_TABLE	_IO(VFIO_TYPE, VFIO_BASE + 22)

-ENOCOMMENTS  Thanks,

Alex

> +
>  /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */
>  
>  /*


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC v3 06/21] vfio: VFIO_IOMMU_BIND_MSI
  2019-01-08 10:26 ` [RFC v3 06/21] vfio: VFIO_IOMMU_BIND_MSI Eric Auger
@ 2019-01-11 23:02   ` Alex Williamson
  2019-01-11 23:23     ` Alex Williamson
  0 siblings, 1 reply; 59+ messages in thread
From: Alex Williamson @ 2019-01-11 23:02 UTC (permalink / raw)
  To: Eric Auger
  Cc: eric.auger.pro, iommu, linux-kernel, kvm, kvmarm, joro,
	jacob.jun.pan, yi.l.liu, jean-philippe.brucker, will.deacon,
	robin.murphy, kevin.tian, ashok.raj, marc.zyngier,
	christoffer.dall, peter.maydell

On Tue,  8 Jan 2019 11:26:18 +0100
Eric Auger <eric.auger@redhat.com> wrote:

> This patch adds the VFIO_IOMMU_BIND_MSI ioctl which aims at
> passing the guest MSI binding to the host.
> 
> Signed-off-by: Eric Auger <eric.auger@redhat.com>
> 
> ---
> 
> v2 -> v3:
> - adapt to new proto of bind_guest_msi
> - directly use vfio_iommu_for_each_dev
> 
> v1 -> v2:
> - s/vfio_iommu_type1_guest_msi_binding/vfio_iommu_type1_bind_guest_msi
> ---
>  drivers/vfio/vfio_iommu_type1.c | 27 +++++++++++++++++++++++++++
>  include/uapi/linux/vfio.h       |  7 +++++++
>  2 files changed, 34 insertions(+)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index c3ba3f249438..59229f6e2d84 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -1673,6 +1673,15 @@ static int vfio_cache_inv_fn(struct device *dev, void *data)
>  	return iommu_cache_invalidate(d, dev, &ustruct->info);
>  }
>  
> +static int vfio_bind_guest_msi_fn(struct device *dev, void *data)
> +{
> +	struct vfio_iommu_type1_bind_guest_msi *ustruct =
> +		(struct vfio_iommu_type1_bind_guest_msi *)data;
> +	struct iommu_domain *d = iommu_get_domain_for_dev(dev);
> +
> +	return iommu_bind_guest_msi(d, dev, &ustruct->binding);
> +}
> +
>  static int
>  vfio_set_pasid_table(struct vfio_iommu *iommu,
>  		      struct vfio_iommu_type1_set_pasid_table *ustruct)
> @@ -1792,6 +1801,24 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
>  					      vfio_cache_inv_fn);
>  		mutex_unlock(&iommu->lock);
>  		return ret;
> +	} else if (cmd == VFIO_IOMMU_BIND_MSI) {
> +		struct vfio_iommu_type1_bind_guest_msi ustruct;
> +		int ret;
> +
> +		minsz = offsetofend(struct vfio_iommu_type1_bind_guest_msi,
> +				    binding);
> +
> +		if (copy_from_user(&ustruct, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (ustruct.argsz < minsz || ustruct.flags)
> +			return -EINVAL;
> +
> +		mutex_lock(&iommu->lock);
> +		ret = vfio_iommu_for_each_dev(iommu, &ustruct,
> +					      vfio_bind_guest_msi_fn);

The vfio_iommu_for_each_dev() interface is fine for invalidation, where
a partial failure requires no unwind, but it's not sufficiently robust
here.

> +		mutex_unlock(&iommu->lock);
> +		return ret;
>  	}
>  
>  	return -ENOTTY;
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 11a07165e7e1..352e795a93c8 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -774,6 +774,13 @@ struct vfio_iommu_type1_cache_invalidate {
>  };
>  #define VFIO_IOMMU_CACHE_INVALIDATE      _IO(VFIO_TYPE, VFIO_BASE + 23)
>  
> +struct vfio_iommu_type1_bind_guest_msi {
> +	__u32   argsz;
> +	__u32   flags;
> +	struct iommu_guest_msi_binding binding;
> +};
> +#define VFIO_IOMMU_BIND_MSI      _IO(VFIO_TYPE, VFIO_BASE + 24)

-ENOCOMMENTS  MSIs are setup and torn down, is this only a machine init
sort of interface?  How does the user un-bind?  Thanks,

Alex

> +
>  /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */
>  
>  /*


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC v3 06/21] vfio: VFIO_IOMMU_BIND_MSI
  2019-01-11 23:02   ` Alex Williamson
@ 2019-01-11 23:23     ` Alex Williamson
  0 siblings, 0 replies; 59+ messages in thread
From: Alex Williamson @ 2019-01-11 23:23 UTC (permalink / raw)
  To: Eric Auger
  Cc: eric.auger.pro, iommu, linux-kernel, kvm, kvmarm, joro,
	jacob.jun.pan, yi.l.liu, jean-philippe.brucker, will.deacon,
	robin.murphy, kevin.tian, ashok.raj, marc.zyngier,
	christoffer.dall, peter.maydell

On Fri, 11 Jan 2019 16:02:44 -0700
Alex Williamson <alex.williamson@redhat.com> wrote:

> On Tue,  8 Jan 2019 11:26:18 +0100
> Eric Auger <eric.auger@redhat.com> wrote:
> 
> > This patch adds the VFIO_IOMMU_BIND_MSI ioctl which aims at
> > passing the guest MSI binding to the host.
> > 
> > Signed-off-by: Eric Auger <eric.auger@redhat.com>
> > 
> > ---
> > 
> > v2 -> v3:
> > - adapt to new proto of bind_guest_msi
> > - directly use vfio_iommu_for_each_dev
> > 
> > v1 -> v2:
> > - s/vfio_iommu_type1_guest_msi_binding/vfio_iommu_type1_bind_guest_msi
> > ---
> >  drivers/vfio/vfio_iommu_type1.c | 27 +++++++++++++++++++++++++++
> >  include/uapi/linux/vfio.h       |  7 +++++++
> >  2 files changed, 34 insertions(+)
> > 
> > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> > index c3ba3f249438..59229f6e2d84 100644
> > --- a/drivers/vfio/vfio_iommu_type1.c
> > +++ b/drivers/vfio/vfio_iommu_type1.c
> > @@ -1673,6 +1673,15 @@ static int vfio_cache_inv_fn(struct device *dev, void *data)
> >  	return iommu_cache_invalidate(d, dev, &ustruct->info);
> >  }
> >  
> > +static int vfio_bind_guest_msi_fn(struct device *dev, void *data)
> > +{
> > +	struct vfio_iommu_type1_bind_guest_msi *ustruct =
> > +		(struct vfio_iommu_type1_bind_guest_msi *)data;
> > +	struct iommu_domain *d = iommu_get_domain_for_dev(dev);
> > +
> > +	return iommu_bind_guest_msi(d, dev, &ustruct->binding);
> > +}
> > +
> >  static int
> >  vfio_set_pasid_table(struct vfio_iommu *iommu,
> >  		      struct vfio_iommu_type1_set_pasid_table *ustruct)
> > @@ -1792,6 +1801,24 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
> >  					      vfio_cache_inv_fn);
> >  		mutex_unlock(&iommu->lock);
> >  		return ret;
> > +	} else if (cmd == VFIO_IOMMU_BIND_MSI) {
> > +		struct vfio_iommu_type1_bind_guest_msi ustruct;
> > +		int ret;
> > +
> > +		minsz = offsetofend(struct vfio_iommu_type1_bind_guest_msi,
> > +				    binding);
> > +
> > +		if (copy_from_user(&ustruct, (void __user *)arg, minsz))
> > +			return -EFAULT;
> > +
> > +		if (ustruct.argsz < minsz || ustruct.flags)
> > +			return -EINVAL;
> > +
> > +		mutex_lock(&iommu->lock);
> > +		ret = vfio_iommu_for_each_dev(iommu, &ustruct,
> > +					      vfio_bind_guest_msi_fn);  
> 
> The vfio_iommu_for_each_dev() interface is fine for invalidation, where
> a partial failure requires no unwind, but it's not sufficiently robust
> here.

Additionally, what happens as devices are added and removed from the
guest?  Are we designing an interface that specifically precludes
hotplug?  Thanks,

Alex
 
> > +		mutex_unlock(&iommu->lock);
> > +		return ret;
> >  	}
> >  
> >  	return -ENOTTY;
> > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > index 11a07165e7e1..352e795a93c8 100644
> > --- a/include/uapi/linux/vfio.h
> > +++ b/include/uapi/linux/vfio.h
> > @@ -774,6 +774,13 @@ struct vfio_iommu_type1_cache_invalidate {
> >  };
> >  #define VFIO_IOMMU_CACHE_INVALIDATE      _IO(VFIO_TYPE, VFIO_BASE + 23)
> >  
> > +struct vfio_iommu_type1_bind_guest_msi {
> > +	__u32   argsz;
> > +	__u32   flags;
> > +	struct iommu_guest_msi_binding binding;
> > +};
> > +#define VFIO_IOMMU_BIND_MSI      _IO(VFIO_TYPE, VFIO_BASE + 24)  
> 
> -ENOCOMMENTS  MSIs are setup and torn down, is this only a machine init
> sort of interface?  How does the user un-bind?  Thanks,
> 
> Alex
> 
> > +
> >  /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */
> >  
> >  /*  
> 


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC v3 18/21] vfio-pci: Add a new VFIO_REGION_TYPE_NESTED region type
  2019-01-08 10:26 ` [RFC v3 18/21] vfio-pci: Add a new VFIO_REGION_TYPE_NESTED region type Eric Auger
@ 2019-01-11 23:58   ` Alex Williamson
  2019-01-14 20:48     ` Auger Eric
  0 siblings, 1 reply; 59+ messages in thread
From: Alex Williamson @ 2019-01-11 23:58 UTC (permalink / raw)
  To: Eric Auger
  Cc: eric.auger.pro, iommu, linux-kernel, kvm, kvmarm, joro,
	jacob.jun.pan, yi.l.liu, jean-philippe.brucker, will.deacon,
	robin.murphy, kevin.tian, ashok.raj, marc.zyngier,
	christoffer.dall, peter.maydell

On Tue,  8 Jan 2019 11:26:30 +0100
Eric Auger <eric.auger@redhat.com> wrote:

> This patch adds a new 64kB region aiming to report nested mode
> translation faults.
> 
> The region contains a header with the size of the queue,
> the producer and consumer indices and then the actual
> fault queue data. The producer is updated by the kernel while
> the consumer is updated by the userspace.
> 
> Signed-off-by: Eric Auger <eric.auger@redhat.com>
> 
> ---
> ---
>  drivers/vfio/pci/vfio_pci.c         | 102 +++++++++++++++++++++++++++-
>  drivers/vfio/pci/vfio_pci_private.h |   2 +
>  include/uapi/linux/vfio.h           |  15 ++++
>  3 files changed, 118 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index ff60bd1ea587..2ba181ab2edd 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -56,6 +56,11 @@ module_param(disable_idle_d3, bool, S_IRUGO | S_IWUSR);
>  MODULE_PARM_DESC(disable_idle_d3,
>  		 "Disable using the PCI D3 low power state for idle, unused devices");
>  
> +#define VFIO_FAULT_REGION_SIZE 0x10000

Why 64K?

> +#define VFIO_FAULT_QUEUE_SIZE	\
> +	((VFIO_FAULT_REGION_SIZE - sizeof(struct vfio_fault_region_header)) / \
> +	sizeof(struct iommu_fault))
> +
>  static inline bool vfio_vga_disabled(void)
>  {
>  #ifdef CONFIG_VFIO_PCI_VGA
> @@ -1226,6 +1231,100 @@ static const struct vfio_device_ops vfio_pci_ops = {
>  static int vfio_pci_reflck_attach(struct vfio_pci_device *vdev);
>  static void vfio_pci_reflck_put(struct vfio_pci_reflck *reflck);
>  
> +static size_t
> +vfio_pci_dma_fault_rw(struct vfio_pci_device *vdev, char __user *buf,
> +		      size_t count, loff_t *ppos, bool iswrite)
> +{
> +	unsigned int i = VFIO_PCI_OFFSET_TO_INDEX(*ppos) - VFIO_PCI_NUM_REGIONS;
> +	void *base = vdev->region[i].data;
> +	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
> +
> +	if (pos >= vdev->region[i].size)
> +		return -EINVAL;
> +
> +	count = min(count, (size_t)(vdev->region[i].size - pos));
> +
> +	if (copy_to_user(buf, base + pos, count))
> +		return -EFAULT;
> +
> +	*ppos += count;
> +
> +	return count;
> +}
> +
> +static int vfio_pci_dma_fault_mmap(struct vfio_pci_device *vdev,
> +				   struct vfio_pci_region *region,
> +				   struct vm_area_struct *vma)
> +{
> +	u64 phys_len, req_len, pgoff, req_start;
> +	unsigned long long addr;
> +	unsigned int index;
> +
> +	index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);
> +
> +	if (vma->vm_end < vma->vm_start)
> +		return -EINVAL;
> +	if ((vma->vm_flags & VM_SHARED) == 0)
> +		return -EINVAL;
> +
> +	phys_len = VFIO_FAULT_REGION_SIZE;
> +
> +	req_len = vma->vm_end - vma->vm_start;
> +	pgoff = vma->vm_pgoff &
> +		((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
> +	req_start = pgoff << PAGE_SHIFT;
> +
> +	if (req_start + req_len > phys_len)
> +		return -EINVAL;
> +
> +	addr = virt_to_phys(vdev->fault_region);
> +	vma->vm_private_data = vdev;
> +	vma->vm_pgoff = (addr >> PAGE_SHIFT) + pgoff;
> +
> +	return remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
> +			       req_len, vma->vm_page_prot);
> +}
> +
> +void vfio_pci_dma_fault_release(struct vfio_pci_device *vdev,
> +				struct vfio_pci_region *region)
> +{
> +}
> +
> +static const struct vfio_pci_regops vfio_pci_dma_fault_regops = {
> +	.rw		= vfio_pci_dma_fault_rw,
> +	.mmap		= vfio_pci_dma_fault_mmap,
> +	.release	= vfio_pci_dma_fault_release,
> +};
> +
> +static int vfio_pci_init_dma_fault_region(struct vfio_pci_device *vdev)
> +{
> +	u32 flags = VFIO_REGION_INFO_FLAG_READ | VFIO_REGION_INFO_FLAG_WRITE |
> +		    VFIO_REGION_INFO_FLAG_MMAP;
> +	int ret;
> +
> +	spin_lock_init(&vdev->fault_queue_lock);
> +
> +	vdev->fault_region = kmalloc(VFIO_FAULT_REGION_SIZE, GFP_KERNEL);
> +	if (!vdev->fault_region)
> +		return -ENOMEM;
> +
> +	ret = vfio_pci_register_dev_region(vdev,
> +		VFIO_REGION_TYPE_NESTED,
> +		VFIO_REGION_SUBTYPE_NESTED_FAULT_REGION,
> +		&vfio_pci_dma_fault_regops, VFIO_FAULT_REGION_SIZE,
> +		flags, vdev->fault_region);
> +	if (ret) {
> +		kfree(vdev->fault_region);
> +		return ret;
> +	}
> +
> +	vdev->fault_region->header.prod = 0;
> +	vdev->fault_region->header.cons = 0;
> +	vdev->fault_region->header.reserved = 0;

Use kzalloc above or else we're leaking kernel memory to userspace
anyway.

> +	vdev->fault_region->header.size = VFIO_FAULT_QUEUE_SIZE;
> +	return 0;
> +}
> +
>  static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>  {
>  	struct vfio_pci_device *vdev;
> @@ -1300,7 +1399,7 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>  		pci_set_power_state(pdev, PCI_D3hot);
>  	}
>  
> -	return ret;
> +	return vfio_pci_init_dma_fault_region(vdev);

Missing lots of cleanup should this fail.  Why is this done on probe
anyway?  This looks like something we'd do from vfio_pci_enable() and
therefore our release callback would free fault_region rather than what
we have below.

>  }
>  
>  static void vfio_pci_remove(struct pci_dev *pdev)
> @@ -1315,6 +1414,7 @@ static void vfio_pci_remove(struct pci_dev *pdev)
>  
>  	vfio_iommu_group_put(pdev->dev.iommu_group, &pdev->dev);
>  	kfree(vdev->region);
> +	kfree(vdev->fault_region);
>  	mutex_destroy(&vdev->ioeventfds_lock);
>  	kfree(vdev);
>  
> diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
> index 8c0009f00818..38b5d1764a26 100644
> --- a/drivers/vfio/pci/vfio_pci_private.h
> +++ b/drivers/vfio/pci/vfio_pci_private.h
> @@ -120,6 +120,8 @@ struct vfio_pci_device {
>  	int			ioeventfds_nr;
>  	struct eventfd_ctx	*err_trigger;
>  	struct eventfd_ctx	*req_trigger;
> +	spinlock_t              fault_queue_lock;
> +	struct vfio_fault_region *fault_region;
>  	struct list_head	dummy_resources_list;
>  	struct mutex		ioeventfds_lock;
>  	struct list_head	ioeventfds_list;
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 352e795a93c8..b78c2c62af6d 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -307,6 +307,9 @@ struct vfio_region_info_cap_type {
>  #define VFIO_REGION_TYPE_GFX                    (1)
>  #define VFIO_REGION_SUBTYPE_GFX_EDID            (1)
>  
> +#define VFIO_REGION_TYPE_NESTED			(2)
> +#define VFIO_REGION_SUBTYPE_NESTED_FAULT_REGION	(1)
> +
>  /**
>   * struct vfio_region_gfx_edid - EDID region layout.
>   *
> @@ -697,6 +700,18 @@ struct vfio_device_ioeventfd {
>  
>  #define VFIO_DEVICE_IOEVENTFD		_IO(VFIO_TYPE, VFIO_BASE + 16)
>  
> +struct vfio_fault_region_header {
> +	__u32	size;		/* Read-Only */
> +	__u32	prod;		/* Read-Only */

We can't really enforce read-only if it's mmap'd.  I worry about
synchronization here too, perhaps there should be a ring offset such
that the ring can be in a separate page from the header and then sparse
mmap support can ensure that the user access is restricted.  I also
wonder if there are other transports that make sense here, this almost
feels like a vhost sort of thing.  Thanks,

Alex

> +	__u32	cons;
> +	__u32	reserved;	/* must be 0 */
> +};
> +
> +struct vfio_fault_region {
> +	struct vfio_fault_region_header header;
> +	struct iommu_fault queue[0];
> +};
> +
>  /* -------- API for Type1 VFIO IOMMU -------- */
>  
>  /**


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC v3 18/21] vfio-pci: Add a new VFIO_REGION_TYPE_NESTED region type
  2019-01-11 23:58   ` Alex Williamson
@ 2019-01-14 20:48     ` Auger Eric
  2019-01-14 23:04       ` Alex Williamson
  0 siblings, 1 reply; 59+ messages in thread
From: Auger Eric @ 2019-01-14 20:48 UTC (permalink / raw)
  To: Alex Williamson
  Cc: eric.auger.pro, iommu, linux-kernel, kvm, kvmarm, joro,
	jacob.jun.pan, yi.l.liu, jean-philippe.brucker, will.deacon,
	robin.murphy, kevin.tian, ashok.raj, marc.zyngier,
	christoffer.dall, peter.maydell

Hi Alex,

On 1/12/19 12:58 AM, Alex Williamson wrote:
> On Tue,  8 Jan 2019 11:26:30 +0100
> Eric Auger <eric.auger@redhat.com> wrote:
> 
>> This patch adds a new 64kB region aiming to report nested mode
>> translation faults.
>>
>> The region contains a header with the size of the queue,
>> the producer and consumer indices and then the actual
>> fault queue data. The producer is updated by the kernel while
>> the consumer is updated by the userspace.
>>
>> Signed-off-by: Eric Auger <eric.auger@redhat.com>
>>
>> ---
>> ---
>>  drivers/vfio/pci/vfio_pci.c         | 102 +++++++++++++++++++++++++++-
>>  drivers/vfio/pci/vfio_pci_private.h |   2 +
>>  include/uapi/linux/vfio.h           |  15 ++++
>>  3 files changed, 118 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
>> index ff60bd1ea587..2ba181ab2edd 100644
>> --- a/drivers/vfio/pci/vfio_pci.c
>> +++ b/drivers/vfio/pci/vfio_pci.c
>> @@ -56,6 +56,11 @@ module_param(disable_idle_d3, bool, S_IRUGO | S_IWUSR);
>>  MODULE_PARM_DESC(disable_idle_d3,
>>  		 "Disable using the PCI D3 low power state for idle, unused devices");
>>  
>> +#define VFIO_FAULT_REGION_SIZE 0x10000
> 
> Why 64K?
For the region to be mmappable with 64kB page size.
> 
>> +#define VFIO_FAULT_QUEUE_SIZE	\
>> +	((VFIO_FAULT_REGION_SIZE - sizeof(struct vfio_fault_region_header)) / \
>> +	sizeof(struct iommu_fault))
>> +
>>  static inline bool vfio_vga_disabled(void)
>>  {
>>  #ifdef CONFIG_VFIO_PCI_VGA
>> @@ -1226,6 +1231,100 @@ static const struct vfio_device_ops vfio_pci_ops = {
>>  static int vfio_pci_reflck_attach(struct vfio_pci_device *vdev);
>>  static void vfio_pci_reflck_put(struct vfio_pci_reflck *reflck);
>>  
>> +static size_t
>> +vfio_pci_dma_fault_rw(struct vfio_pci_device *vdev, char __user *buf,
>> +		      size_t count, loff_t *ppos, bool iswrite)
>> +{
>> +	unsigned int i = VFIO_PCI_OFFSET_TO_INDEX(*ppos) - VFIO_PCI_NUM_REGIONS;
>> +	void *base = vdev->region[i].data;
>> +	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
>> +
>> +	if (pos >= vdev->region[i].size)
>> +		return -EINVAL;
>> +
>> +	count = min(count, (size_t)(vdev->region[i].size - pos));
>> +
>> +	if (copy_to_user(buf, base + pos, count))
>> +		return -EFAULT;
>> +
>> +	*ppos += count;
>> +
>> +	return count;
>> +}
>> +
>> +static int vfio_pci_dma_fault_mmap(struct vfio_pci_device *vdev,
>> +				   struct vfio_pci_region *region,
>> +				   struct vm_area_struct *vma)
>> +{
>> +	u64 phys_len, req_len, pgoff, req_start;
>> +	unsigned long long addr;
>> +	unsigned int index;
>> +
>> +	index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);
>> +
>> +	if (vma->vm_end < vma->vm_start)
>> +		return -EINVAL;
>> +	if ((vma->vm_flags & VM_SHARED) == 0)
>> +		return -EINVAL;
>> +
>> +	phys_len = VFIO_FAULT_REGION_SIZE;
>> +
>> +	req_len = vma->vm_end - vma->vm_start;
>> +	pgoff = vma->vm_pgoff &
>> +		((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
>> +	req_start = pgoff << PAGE_SHIFT;
>> +
>> +	if (req_start + req_len > phys_len)
>> +		return -EINVAL;
>> +
>> +	addr = virt_to_phys(vdev->fault_region);
>> +	vma->vm_private_data = vdev;
>> +	vma->vm_pgoff = (addr >> PAGE_SHIFT) + pgoff;
>> +
>> +	return remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
>> +			       req_len, vma->vm_page_prot);
>> +}
>> +
>> +void vfio_pci_dma_fault_release(struct vfio_pci_device *vdev,
>> +				struct vfio_pci_region *region)
>> +{
>> +}
>> +
>> +static const struct vfio_pci_regops vfio_pci_dma_fault_regops = {
>> +	.rw		= vfio_pci_dma_fault_rw,
>> +	.mmap		= vfio_pci_dma_fault_mmap,
>> +	.release	= vfio_pci_dma_fault_release,
>> +};
>> +
>> +static int vfio_pci_init_dma_fault_region(struct vfio_pci_device *vdev)
>> +{
>> +	u32 flags = VFIO_REGION_INFO_FLAG_READ | VFIO_REGION_INFO_FLAG_WRITE |
>> +		    VFIO_REGION_INFO_FLAG_MMAP;
>> +	int ret;
>> +
>> +	spin_lock_init(&vdev->fault_queue_lock);
>> +
>> +	vdev->fault_region = kmalloc(VFIO_FAULT_REGION_SIZE, GFP_KERNEL);
>> +	if (!vdev->fault_region)
>> +		return -ENOMEM;
>> +
>> +	ret = vfio_pci_register_dev_region(vdev,
>> +		VFIO_REGION_TYPE_NESTED,
>> +		VFIO_REGION_SUBTYPE_NESTED_FAULT_REGION,
>> +		&vfio_pci_dma_fault_regops, VFIO_FAULT_REGION_SIZE,
>> +		flags, vdev->fault_region);
>> +	if (ret) {
>> +		kfree(vdev->fault_region);
>> +		return ret;
>> +	}
>> +
>> +	vdev->fault_region->header.prod = 0;
>> +	vdev->fault_region->header.cons = 0;
>> +	vdev->fault_region->header.reserved = 0;
> 
> Use kzalloc above or else we're leaking kernel memory to userspace
> anyway.
sure
> 
>> +	vdev->fault_region->header.size = VFIO_FAULT_QUEUE_SIZE;
>> +	return 0;
>> +}
>> +
>>  static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>>  {
>>  	struct vfio_pci_device *vdev;
>> @@ -1300,7 +1399,7 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>>  		pci_set_power_state(pdev, PCI_D3hot);
>>  	}
>>  
>> -	return ret;
>> +	return vfio_pci_init_dma_fault_region(vdev);
> 
> Missing lots of cleanup should this fail.  Why is this done on probe
> anyway?  This looks like something we'd do from vfio_pci_enable() and
> therefore our release callback would free fault_region rather than what
> we have below.
OK. That's fine to put in the vfio_pci_enable().
> 
>>  }
>>  
>>  static void vfio_pci_remove(struct pci_dev *pdev)
>> @@ -1315,6 +1414,7 @@ static void vfio_pci_remove(struct pci_dev *pdev)
>>  
>>  	vfio_iommu_group_put(pdev->dev.iommu_group, &pdev->dev);
>>  	kfree(vdev->region);
>> +	kfree(vdev->fault_region);
>>  	mutex_destroy(&vdev->ioeventfds_lock);
>>  	kfree(vdev);
>>  
>> diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
>> index 8c0009f00818..38b5d1764a26 100644
>> --- a/drivers/vfio/pci/vfio_pci_private.h
>> +++ b/drivers/vfio/pci/vfio_pci_private.h
>> @@ -120,6 +120,8 @@ struct vfio_pci_device {
>>  	int			ioeventfds_nr;
>>  	struct eventfd_ctx	*err_trigger;
>>  	struct eventfd_ctx	*req_trigger;
>> +	spinlock_t              fault_queue_lock;
>> +	struct vfio_fault_region *fault_region;
>>  	struct list_head	dummy_resources_list;
>>  	struct mutex		ioeventfds_lock;
>>  	struct list_head	ioeventfds_list;
>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>> index 352e795a93c8..b78c2c62af6d 100644
>> --- a/include/uapi/linux/vfio.h
>> +++ b/include/uapi/linux/vfio.h
>> @@ -307,6 +307,9 @@ struct vfio_region_info_cap_type {
>>  #define VFIO_REGION_TYPE_GFX                    (1)
>>  #define VFIO_REGION_SUBTYPE_GFX_EDID            (1)
>>  
>> +#define VFIO_REGION_TYPE_NESTED			(2)
>> +#define VFIO_REGION_SUBTYPE_NESTED_FAULT_REGION	(1)
>> +
>>  /**
>>   * struct vfio_region_gfx_edid - EDID region layout.
>>   *
>> @@ -697,6 +700,18 @@ struct vfio_device_ioeventfd {
>>  
>>  #define VFIO_DEVICE_IOEVENTFD		_IO(VFIO_TYPE, VFIO_BASE + 16)
>>  
>> +struct vfio_fault_region_header {
>> +	__u32	size;		/* Read-Only */
>> +	__u32	prod;		/* Read-Only */
> 
> We can't really enforce read-only if it's mmap'd.
Do we really need to? Assuming the kernel always uses
VFIO_FAULT_QUEUE_SIZE to check prod and cons indice - which is not the
case at the moment by the way :-( -s, the queue cannot
be overflown .

The header also can be checked each time the kernel fills any event in
the queue
(vfio_pci_iommu_dev_fault_handler). If inconsistent the kernel may stop
using the queue. If the user-space mangles with those RO fields, this
will break error reporting on the guest but the problem should be
confined here?


> I worry about synchronization here too, perhaps there should be a ring offset such
> that the ring can be in a separate page from the header and then sparse
> mmap support can ensure that the user access is restricted.

I was assuming a single writer and single reader lock-free circular
buffer here. My understanding was it was safe to consider concurrent
read and write. What I am missing anyway is atomic counter operations to
guarantee the indices are updated after the push/pop action as explained in
https://www.kernel.org/doc/Documentation/circular-buffers.txt. I am not
comfortable about how to enforce this on user side though.

In case I split the header and the actual buffer into 2 different
possible 64kB pages, the first one will be very scarcely used.

> wonder if there are other transports that make sense here, this almost
> feels like a vhost sort of thing.  Thanks,
Using something more sophisticated may be useful for PRI where answers
need to be provided. For the case of unrecoverable faults, I wonder
whether it is worth the pain exposing a fault region compared to the
original IOCTL approach introduced in
[RFC v2 18/20] vfio: VFIO_IOMMU_GET_FAULT_EVENTS
https://lkml.org/lkml/2018/9/18/1094

Thanks

Eric
> 
> Alex
> 
>> +	__u32	cons;
>> +	__u32	reserved;	/* must be 0 */
>> +};
>> +
>> +struct vfio_fault_region {
>> +	struct vfio_fault_region_header header;
>> +	struct iommu_fault queue[0];
>> +};
>> +
>>  /* -------- API for Type1 VFIO IOMMU -------- */
>>  
>>  /**
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC v3 14/21] iommu: introduce device fault data
  2019-01-11 11:06     ` Jean-Philippe Brucker
@ 2019-01-14 22:32       ` Jacob Pan
  2019-01-16 15:52         ` Jean-Philippe Brucker
  2019-01-15 21:27       ` Auger Eric
  1 sibling, 1 reply; 59+ messages in thread
From: Jacob Pan @ 2019-01-14 22:32 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: Eric Auger, yi.l.liu, kevin.tian, alex.williamson, ashok.raj,
	kvm, peter.maydell, Will Deacon, linux-kernel, Christoffer Dall,
	Marc Zyngier, iommu, Robin Murphy, kvmarm, eric.auger.pro,
	jacob.jun.pan

On Fri, 11 Jan 2019 11:06:29 +0000
Jean-Philippe Brucker <jean-philippe.brucker@arm.com> wrote:

> On 10/01/2019 18:45, Jacob Pan wrote:
> > On Tue,  8 Jan 2019 11:26:26 +0100
> > Eric Auger <eric.auger@redhat.com> wrote:
> >   
> >> From: Jacob Pan <jacob.jun.pan@linux.intel.com>
> >>
> >> Device faults detected by IOMMU can be reported outside IOMMU
> >> subsystem for further processing. This patch intends to provide
> >> a generic device fault data such that device drivers can be
> >> communicated with IOMMU faults without model specific knowledge.
> >>
> >> The proposed format is the result of discussion at:
> >> https://lkml.org/lkml/2017/11/10/291
> >> Part of the code is based on Jean-Philippe Brucker's patchset
> >> (https://patchwork.kernel.org/patch/9989315/).
> >>
> >> The assumption is that model specific IOMMU driver can filter and
> >> handle most of the internal faults if the cause is within IOMMU
> >> driver control. Therefore, the fault reasons can be reported are
> >> grouped and generalized based common specifications such as PCI
> >> ATS.
> >>
> >> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> >> Signed-off-by: Jean-Philippe Brucker
> >> <jean-philippe.brucker@arm.com> Signed-off-by: Liu, Yi L
> >> <yi.l.liu@linux.intel.com> Signed-off-by: Ashok Raj
> >> <ashok.raj@intel.com> Signed-off-by: Eric Auger
> >> <eric.auger@redhat.com> [moved part of the iommu_fault_event
> >> struct in the uapi, enriched the fault reasons to be able to map
> >> unrecoverable SMMUv3 errors] ---
> >>  include/linux/iommu.h      | 55 ++++++++++++++++++++++++-
> >>  include/uapi/linux/iommu.h | 83
> >> ++++++++++++++++++++++++++++++++++++++ 2 files changed, 136
> >> insertions(+), 2 deletions(-)
> >>
> >> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> >> index 244c1a3d5989..1dedc2d247c2 100644
> >> --- a/include/linux/iommu.h
> >> +++ b/include/linux/iommu.h
> >> @@ -49,13 +49,17 @@ struct bus_type;
> >>  struct device;
> >>  struct iommu_domain;
> >>  struct notifier_block;
> >> +struct iommu_fault_event;
> >>  
> >>  /* iommu fault flags */
> >> -#define IOMMU_FAULT_READ	0x0
> >> -#define IOMMU_FAULT_WRITE	0x1
> >> +#define IOMMU_FAULT_READ		(1 << 0)
> >> +#define IOMMU_FAULT_WRITE		(1 << 1)
> >> +#define IOMMU_FAULT_EXEC		(1 << 2)
> >> +#define IOMMU_FAULT_PRIV		(1 << 3)
> >>  
> >>  typedef int (*iommu_fault_handler_t)(struct iommu_domain *,
> >>  			struct device *, unsigned long, int, void
> >> *); +typedef int (*iommu_dev_fault_handler_t)(struct
> >> iommu_fault_event *, void *); 
> >>  struct iommu_domain_geometry {
> >>  	dma_addr_t aperture_start; /* First address that can be
> >> mapped    */ @@ -255,6 +259,52 @@ struct iommu_device {
> >>  	struct device *dev;
> >>  };
> >>  
> >> +/**
> >> + * struct iommu_fault_event - Generic per device fault data
> >> + *
> >> + * - PCI and non-PCI devices
> >> + * - Recoverable faults (e.g. page request), information based on
> >> PCI ATS
> >> + * and PASID spec.
> >> + * - Un-recoverable faults of device interest
> >> + * - DMA remapping and IRQ remapping faults
> >> + *
> >> + * @fault: fault descriptor
> >> + * @device_private: if present, uniquely identify device-specific
> >> + *                  private data for an individual page request.
> >> + * @iommu_private: used by the IOMMU driver for storing
> >> fault-specific
> >> + *                 data. Users should not modify this field before
> >> + *                 sending the fault response.
> >> + */
> >> +struct iommu_fault_event {
> >> +	struct iommu_fault fault;
> >> +	u64 device_private;  
> > I think we want to move device_private to uapi since it gets
> > injected into the guest, then returned by guest in case of page
> > response. For VT-d we also need 128 bits of private data. VT-d
> > spec. 7.7.1  
> 
> Ah, I didn't notice the format changed in VT-d rev3. On that topic,
> how do we manage future extensions to the iommu_fault struct? Should
> we add ~48 bytes of padding after device_private, along with some
> flags telling which field is valid, or deal with it using a structure
> version like we do for the invalidate and bind structs? In the first
> case, iommu_fault wouldn't fit in a 64-byte cacheline anymore, but
> I'm not sure we care.
> 
IMHO, I like version and padding. I don't see a need for flags once we
have version.

> > For exception tracking (e.g. unanswered page request), I can add
> > timer and list info later when I include PRQ. sounds ok?  
> >> +	u64 iommu_private;  
> [...]
> >> +/**
> >> + * struct iommu_fault - Generic fault data
> >> + *
> >> + * @type contains fault type
> >> + * @reason fault reasons if relevant outside IOMMU driver.
> >> + * IOMMU driver internal faults are not reported.
> >> + * @addr: tells the offending page address
> >> + * @fetch_addr: tells the address that caused an abort, if any
> >> + * @pasid: contains process address space ID, used in shared
> >> virtual memory
> >> + * @page_req_group_id: page request group index
> >> + * @last_req: last request in a page request group
> >> + * @pasid_valid: indicates if the PRQ has a valid PASID
> >> + * @prot: page access protection flag:
> >> + *	IOMMU_FAULT_READ, IOMMU_FAULT_WRITE
> >> + */
> >> +
> >> +struct iommu_fault {
> >> +	__u32	type;   /* enum iommu_fault_type */
> >> +	__u32	reason; /* enum iommu_fault_reason */
> >> +	__u64	addr;
> >> +	__u64	fetch_addr;
> >> +	__u32	pasid;
> >> +	__u32	page_req_group_id;
> >> +	__u32	last_req;
> >> +	__u32	pasid_valid;
> >> +	__u32	prot;
> >> +	__u32	access;  
> 
> What does @access contain? Can it be squashed into @prot?
> 
I agreed.

how about this?
#define IOMMU_FAULT_VERSION_V1 0x1
struct iommu_fault {
	__u16 version;
	__u16 type;
	__u32 reason;
	__u64 addr;
	__u32 pasid;
	__u32 page_req_group_id;
	__u32 last_req : 1;
	__u32 pasid_valid : 1;
	__u32 prot;
	__u64 device_private[2];
	__u8 padding[48];
};


> Thanks,
> Jean
> 
> > relocated to uapi, Yi can you confirm?
> > 	__u64 device_private[2];
> >   
> >> +};
> >>  #endif /* _UAPI_IOMMU_H */  
> > 
> > _______________________________________________
> > iommu mailing list
> > iommu@lists.linux-foundation.org
> > https://lists.linuxfoundation.org/mailman/listinfo/iommu
> >   
> 

[Jacob Pan]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC v3 18/21] vfio-pci: Add a new VFIO_REGION_TYPE_NESTED region type
  2019-01-14 20:48     ` Auger Eric
@ 2019-01-14 23:04       ` Alex Williamson
  2019-01-15 21:56         ` Auger Eric
  0 siblings, 1 reply; 59+ messages in thread
From: Alex Williamson @ 2019-01-14 23:04 UTC (permalink / raw)
  To: Auger Eric
  Cc: eric.auger.pro, iommu, linux-kernel, kvm, kvmarm, joro,
	jacob.jun.pan, yi.l.liu, jean-philippe.brucker, will.deacon,
	robin.murphy, kevin.tian, ashok.raj, marc.zyngier,
	christoffer.dall, peter.maydell

On Mon, 14 Jan 2019 21:48:06 +0100
Auger Eric <eric.auger@redhat.com> wrote:

> Hi Alex,
> 
> On 1/12/19 12:58 AM, Alex Williamson wrote:
> > On Tue,  8 Jan 2019 11:26:30 +0100
> > Eric Auger <eric.auger@redhat.com> wrote:
> >   
> >> This patch adds a new 64kB region aiming to report nested mode
> >> translation faults.
> >>
> >> The region contains a header with the size of the queue,
> >> the producer and consumer indices and then the actual
> >> fault queue data. The producer is updated by the kernel while
> >> the consumer is updated by the userspace.
> >>
> >> Signed-off-by: Eric Auger <eric.auger@redhat.com>
> >>
> >> ---
> >> ---
> >>  drivers/vfio/pci/vfio_pci.c         | 102 +++++++++++++++++++++++++++-
> >>  drivers/vfio/pci/vfio_pci_private.h |   2 +
> >>  include/uapi/linux/vfio.h           |  15 ++++
> >>  3 files changed, 118 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> >> index ff60bd1ea587..2ba181ab2edd 100644
> >> --- a/drivers/vfio/pci/vfio_pci.c
> >> +++ b/drivers/vfio/pci/vfio_pci.c
> >> @@ -56,6 +56,11 @@ module_param(disable_idle_d3, bool, S_IRUGO | S_IWUSR);
> >>  MODULE_PARM_DESC(disable_idle_d3,
> >>  		 "Disable using the PCI D3 low power state for idle, unused devices");
> >>  
> >> +#define VFIO_FAULT_REGION_SIZE 0x10000  
> > 
> > Why 64K?  
> For the region to be mmappable with 64kB page size.

Isn't hard coding 64K just as bad as hard coding 4K?  The kernel knows
what PAGE_SIZE is after all.  Is there some target number of queue
entries here that we could round up to a multiple of PAGE_SIZE?
 
> >> +#define VFIO_FAULT_QUEUE_SIZE	\
> >> +	((VFIO_FAULT_REGION_SIZE - sizeof(struct vfio_fault_region_header)) / \
> >> +	sizeof(struct iommu_fault))
> >> +
> >>  static inline bool vfio_vga_disabled(void)
> >>  {
> >>  #ifdef CONFIG_VFIO_PCI_VGA
> >> @@ -1226,6 +1231,100 @@ static const struct vfio_device_ops vfio_pci_ops = {
> >>  static int vfio_pci_reflck_attach(struct vfio_pci_device *vdev);
> >>  static void vfio_pci_reflck_put(struct vfio_pci_reflck *reflck);
> >>  
> >> +static size_t
> >> +vfio_pci_dma_fault_rw(struct vfio_pci_device *vdev, char __user *buf,
> >> +		      size_t count, loff_t *ppos, bool iswrite)
> >> +{
> >> +	unsigned int i = VFIO_PCI_OFFSET_TO_INDEX(*ppos) - VFIO_PCI_NUM_REGIONS;
> >> +	void *base = vdev->region[i].data;
> >> +	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
> >> +
> >> +	if (pos >= vdev->region[i].size)
> >> +		return -EINVAL;
> >> +
> >> +	count = min(count, (size_t)(vdev->region[i].size - pos));
> >> +
> >> +	if (copy_to_user(buf, base + pos, count))
> >> +		return -EFAULT;
> >> +
> >> +	*ppos += count;
> >> +
> >> +	return count;
> >> +}
> >> +
> >> +static int vfio_pci_dma_fault_mmap(struct vfio_pci_device *vdev,
> >> +				   struct vfio_pci_region *region,
> >> +				   struct vm_area_struct *vma)
> >> +{
> >> +	u64 phys_len, req_len, pgoff, req_start;
> >> +	unsigned long long addr;
> >> +	unsigned int index;
> >> +
> >> +	index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);
> >> +
> >> +	if (vma->vm_end < vma->vm_start)
> >> +		return -EINVAL;
> >> +	if ((vma->vm_flags & VM_SHARED) == 0)
> >> +		return -EINVAL;
> >> +
> >> +	phys_len = VFIO_FAULT_REGION_SIZE;
> >> +
> >> +	req_len = vma->vm_end - vma->vm_start;
> >> +	pgoff = vma->vm_pgoff &
> >> +		((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
> >> +	req_start = pgoff << PAGE_SHIFT;
> >> +
> >> +	if (req_start + req_len > phys_len)
> >> +		return -EINVAL;
> >> +
> >> +	addr = virt_to_phys(vdev->fault_region);
> >> +	vma->vm_private_data = vdev;
> >> +	vma->vm_pgoff = (addr >> PAGE_SHIFT) + pgoff;
> >> +
> >> +	return remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
> >> +			       req_len, vma->vm_page_prot);
> >> +}
> >> +
> >> +void vfio_pci_dma_fault_release(struct vfio_pci_device *vdev,
> >> +				struct vfio_pci_region *region)
> >> +{
> >> +}
> >> +
> >> +static const struct vfio_pci_regops vfio_pci_dma_fault_regops = {
> >> +	.rw		= vfio_pci_dma_fault_rw,
> >> +	.mmap		= vfio_pci_dma_fault_mmap,
> >> +	.release	= vfio_pci_dma_fault_release,
> >> +};
> >> +
> >> +static int vfio_pci_init_dma_fault_region(struct vfio_pci_device *vdev)
> >> +{
> >> +	u32 flags = VFIO_REGION_INFO_FLAG_READ | VFIO_REGION_INFO_FLAG_WRITE |
> >> +		    VFIO_REGION_INFO_FLAG_MMAP;
> >> +	int ret;
> >> +
> >> +	spin_lock_init(&vdev->fault_queue_lock);
> >> +
> >> +	vdev->fault_region = kmalloc(VFIO_FAULT_REGION_SIZE, GFP_KERNEL);
> >> +	if (!vdev->fault_region)
> >> +		return -ENOMEM;
> >> +
> >> +	ret = vfio_pci_register_dev_region(vdev,
> >> +		VFIO_REGION_TYPE_NESTED,
> >> +		VFIO_REGION_SUBTYPE_NESTED_FAULT_REGION,
> >> +		&vfio_pci_dma_fault_regops, VFIO_FAULT_REGION_SIZE,
> >> +		flags, vdev->fault_region);
> >> +	if (ret) {
> >> +		kfree(vdev->fault_region);
> >> +		return ret;
> >> +	}
> >> +
> >> +	vdev->fault_region->header.prod = 0;
> >> +	vdev->fault_region->header.cons = 0;
> >> +	vdev->fault_region->header.reserved = 0;  
> > 
> > Use kzalloc above or else we're leaking kernel memory to userspace
> > anyway.  
> sure
> >   
> >> +	vdev->fault_region->header.size = VFIO_FAULT_QUEUE_SIZE;
> >> +	return 0;
> >> +}
> >> +
> >>  static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
> >>  {
> >>  	struct vfio_pci_device *vdev;
> >> @@ -1300,7 +1399,7 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
> >>  		pci_set_power_state(pdev, PCI_D3hot);
> >>  	}
> >>  
> >> -	return ret;
> >> +	return vfio_pci_init_dma_fault_region(vdev);  
> > 
> > Missing lots of cleanup should this fail.  Why is this done on probe
> > anyway?  This looks like something we'd do from vfio_pci_enable() and
> > therefore our release callback would free fault_region rather than what
> > we have below.  
> OK. That's fine to put in the vfio_pci_enable().
> >   
> >>  }
> >>  
> >>  static void vfio_pci_remove(struct pci_dev *pdev)
> >> @@ -1315,6 +1414,7 @@ static void vfio_pci_remove(struct pci_dev *pdev)
> >>  
> >>  	vfio_iommu_group_put(pdev->dev.iommu_group, &pdev->dev);
> >>  	kfree(vdev->region);
> >> +	kfree(vdev->fault_region);
> >>  	mutex_destroy(&vdev->ioeventfds_lock);
> >>  	kfree(vdev);
> >>  
> >> diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
> >> index 8c0009f00818..38b5d1764a26 100644
> >> --- a/drivers/vfio/pci/vfio_pci_private.h
> >> +++ b/drivers/vfio/pci/vfio_pci_private.h
> >> @@ -120,6 +120,8 @@ struct vfio_pci_device {
> >>  	int			ioeventfds_nr;
> >>  	struct eventfd_ctx	*err_trigger;
> >>  	struct eventfd_ctx	*req_trigger;
> >> +	spinlock_t              fault_queue_lock;
> >> +	struct vfio_fault_region *fault_region;
> >>  	struct list_head	dummy_resources_list;
> >>  	struct mutex		ioeventfds_lock;
> >>  	struct list_head	ioeventfds_list;
> >> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> >> index 352e795a93c8..b78c2c62af6d 100644
> >> --- a/include/uapi/linux/vfio.h
> >> +++ b/include/uapi/linux/vfio.h
> >> @@ -307,6 +307,9 @@ struct vfio_region_info_cap_type {
> >>  #define VFIO_REGION_TYPE_GFX                    (1)
> >>  #define VFIO_REGION_SUBTYPE_GFX_EDID            (1)
> >>  
> >> +#define VFIO_REGION_TYPE_NESTED			(2)
> >> +#define VFIO_REGION_SUBTYPE_NESTED_FAULT_REGION	(1)
> >> +
> >>  /**
> >>   * struct vfio_region_gfx_edid - EDID region layout.
> >>   *
> >> @@ -697,6 +700,18 @@ struct vfio_device_ioeventfd {
> >>  
> >>  #define VFIO_DEVICE_IOEVENTFD		_IO(VFIO_TYPE, VFIO_BASE + 16)
> >>  
> >> +struct vfio_fault_region_header {
> >> +	__u32	size;		/* Read-Only */
> >> +	__u32	prod;		/* Read-Only */  
> > 
> > We can't really enforce read-only if it's mmap'd.  
> Do we really need to? Assuming the kernel always uses
> VFIO_FAULT_QUEUE_SIZE to check prod and cons indice - which is not the
> case at the moment by the way :-( -s, the queue cannot
> be overflown .
> 
> The header also can be checked each time the kernel fills any event in
> the queue
> (vfio_pci_iommu_dev_fault_handler). If inconsistent the kernel may stop
> using the queue. If the user-space mangles with those RO fields, this
> will break error reporting on the guest but the problem should be
> confined here?

I guess this is a matter of whether the performance benefit is worth
the hardening and I can imagine that it could be, but we do need that
hardening and well defined behavior to the user.  If we put it into a
single page then we need to define if a user write to the producer index
resets the index or if it's just a shadow of internal state and will be
restored on the next update.  Does a user write to the size register
change the size of the queue or is it ignored?  Is there an event that
will restore it?  Maybe the size should be exposed as a part of a
region info capability like we've done for some of the new nvlink
regions.

> > I worry about synchronization here too, perhaps there should be a ring offset such
> > that the ring can be in a separate page from the header and then sparse
> > mmap support can ensure that the user access is restricted.  
> 
> I was assuming a single writer and single reader lock-free circular
> buffer here. My understanding was it was safe to consider concurrent
> read and write. What I am missing anyway is atomic counter operations to
> guarantee the indices are updated after the push/pop action as explained in
> https://www.kernel.org/doc/Documentation/circular-buffers.txt. I am not
> comfortable about how to enforce this on user side though.

It doesn't seem enforceable to the user without slowing down the
interface, either via ioctl or non-mmap.  I think it's fine to have
acquire and release semantics, we just need to define them and have
safe, predictable behavior when the user does something wrong (and
expect them to do something wrong, intentionally or not).

> In case I split the header and the actual buffer into 2 different
> possible 64kB pages, the first one will be very scarcely used.

Yes, it's wasteful, a shared page would be preferred, but it's also
only a page.

> > wonder if there are other transports that make sense here, this almost
> > feels like a vhost sort of thing.  Thanks,  
> Using something more sophisticated may be useful for PRI where answers
> need to be provided. For the case of unrecoverable faults, I wonder
> whether it is worth the pain exposing a fault region compared to the
> original IOCTL approach introduced in
> [RFC v2 18/20] vfio: VFIO_IOMMU_GET_FAULT_EVENTS
> https://lkml.org/lkml/2018/9/18/1094

Not sure I understand the pain aspect, if we don't support mmap on a
region then we can fill a user read from a region just as easily as we
can fill a buffer passed via ioctl.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC v3 17/21] iommu/smmuv3: Report non recoverable faults
  2019-01-11 17:46   ` Jean-Philippe Brucker
@ 2019-01-15 21:06     ` Auger Eric
  2019-01-16 12:25       ` Jean-Philippe Brucker
  0 siblings, 1 reply; 59+ messages in thread
From: Auger Eric @ 2019-01-15 21:06 UTC (permalink / raw)
  To: Jean-Philippe Brucker, eric.auger.pro, iommu, linux-kernel, kvm,
	kvmarm, joro, alex.williamson, jacob.jun.pan, yi.l.liu,
	will.deacon, robin.murphy
  Cc: marc.zyngier, peter.maydell, kevin.tian, ashok.raj, christoffer.dall

Hi Jean,

On 1/11/19 6:46 PM, Jean-Philippe Brucker wrote:
> On 08/01/2019 10:26, Eric Auger wrote:
>> When a stage 1 related fault event is read from the event queue,
>> let's propagate it to potential external fault listeners, ie. users
>> who registered a fault handler.
>>
>> Signed-off-by: Eric Auger <eric.auger@redhat.com>
>> ---
>>  drivers/iommu/arm-smmu-v3.c | 124 ++++++++++++++++++++++++++++++++----
>>  1 file changed, 113 insertions(+), 11 deletions(-)
>>
>> diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
>> index 999ee470a2ae..6a711cbbb228 100644
>> --- a/drivers/iommu/arm-smmu-v3.c
>> +++ b/drivers/iommu/arm-smmu-v3.c
>> @@ -168,6 +168,26 @@
>>  #define ARM_SMMU_PRIQ_IRQ_CFG1		0xd8
>>  #define ARM_SMMU_PRIQ_IRQ_CFG2		0xdc
>>  
>> +/* Events */
>> +#define ARM_SMMU_EVT_F_UUT		0x01
>> +#define ARM_SMMU_EVT_C_BAD_STREAMID	0x02
>> +#define ARM_SMMU_EVT_F_STE_FETCH	0x03
>> +#define ARM_SMMU_EVT_C_BAD_STE		0x04
>> +#define ARM_SMMU_EVT_F_BAD_ATS_TREQ	0x05
>> +#define ARM_SMMU_EVT_F_STREAM_DISABLED	0x06
>> +#define ARM_SMMU_EVT_F_TRANSL_FORBIDDEN	0x07
>> +#define ARM_SMMU_EVT_C_BAD_SUBSTREAMID	0x08
>> +#define ARM_SMMU_EVT_F_CD_FETCH		0x09
>> +#define ARM_SMMU_EVT_C_BAD_CD		0x0a
>> +#define ARM_SMMU_EVT_F_WALK_EABT	0x0b
>> +#define ARM_SMMU_EVT_F_TRANSLATION	0x10
>> +#define ARM_SMMU_EVT_F_ADDR_SIZE	0x11
>> +#define ARM_SMMU_EVT_F_ACCESS		0x12
>> +#define ARM_SMMU_EVT_F_PERMISSION	0x13
>> +#define ARM_SMMU_EVT_F_TLB_CONFLICT	0x20
>> +#define ARM_SMMU_EVT_F_CFG_CONFLICT	0x21
>> +#define ARM_SMMU_EVT_E_PAGE_REQUEST	0x24
>> +
>>  /* Common MSI config fields */
>>  #define MSI_CFG0_ADDR_MASK		GENMASK_ULL(51, 2)
>>  #define MSI_CFG2_SH			GENMASK(5, 4)
>> @@ -333,6 +353,11 @@
>>  #define EVTQ_MAX_SZ_SHIFT		7
>>  
>>  #define EVTQ_0_ID			GENMASK_ULL(7, 0)
>> +#define EVTQ_0_SUBSTREAMID		GENMASK_ULL(31, 12)
>> +#define EVTQ_0_STREAMID			GENMASK_ULL(63, 32)
>> +#define EVTQ_1_S2			GENMASK_ULL(39, 39)
>> +#define EVTQ_1_CLASS			GENMASK_ULL(40, 41)
>> +#define EVTQ_3_FETCH_ADDR		GENMASK_ULL(51, 3)
>>  
>>  /* PRI queue */
>>  #define PRIQ_ENT_DWORDS			2
>> @@ -1270,7 +1295,6 @@ static int arm_smmu_init_l2_strtab(struct arm_smmu_device *smmu, u32 sid)
>>  	return 0;
>>  }
>>  
>> -__maybe_unused
>>  static struct arm_smmu_master_data *
>>  arm_smmu_find_master(struct arm_smmu_device *smmu, u32 sid)
>>  {
>> @@ -1296,24 +1320,102 @@ arm_smmu_find_master(struct arm_smmu_device *smmu, u32 sid)
>>  	return master;
>>  }
>>  
>> +static void arm_smmu_report_event(struct arm_smmu_device *smmu, u64 *evt)
>> +{
>> +	u64 fetch_addr = FIELD_GET(EVTQ_3_FETCH_ADDR, evt[3]);
>> +	u32 sid = FIELD_GET(EVTQ_0_STREAMID, evt[0]);
>> +	bool s1 = !FIELD_GET(EVTQ_1_S2, evt[1]);
>> +	u8 type = FIELD_GET(EVTQ_0_ID, evt[0]);
>> +	struct arm_smmu_master_data *master;
>> +	struct iommu_fault_event event;
>> +	bool propagate = true;
>> +	u64 addr = evt[2];
>> +	int i;
>> +
>> +	master = arm_smmu_find_master(smmu, sid);
>> +	if (WARN_ON(!master))
>> +		return;
>> +
>> +	event.fault.type = IOMMU_FAULT_DMA_UNRECOV;
>> +
>> +	switch (type) {
>> +	case ARM_SMMU_EVT_C_BAD_STREAMID:
>> +		event.fault.reason = IOMMU_FAULT_REASON_SOURCEID_INVALID;
>> +		break;
>> +	case ARM_SMMU_EVT_F_STREAM_DISABLED:
>> +	case ARM_SMMU_EVT_C_BAD_SUBSTREAMID:
>> +		event.fault.reason = IOMMU_FAULT_REASON_PASID_INVALID;
>> +		break;
>> +	case ARM_SMMU_EVT_F_CD_FETCH:
>> +		event.fault.reason = IOMMU_FAULT_REASON_PASID_FETCH;
>> +		break;
>> +	case ARM_SMMU_EVT_F_WALK_EABT:
>> +		event.fault.reason = IOMMU_FAULT_REASON_WALK_EABT;
>> +		event.fault.addr = addr;
>> +		event.fault.fetch_addr = fetch_addr;
>> +		propagate = s1;
>> +		break;
>> +	case ARM_SMMU_EVT_F_TRANSLATION:
>> +		event.fault.reason = IOMMU_FAULT_REASON_PTE_FETCH;
>> +		event.fault.addr = addr;
>> +		event.fault.fetch_addr = fetch_addr;
>> +		propagate = s1;
>> +		break;
>> +	case ARM_SMMU_EVT_F_PERMISSION:
>> +		event.fault.reason = IOMMU_FAULT_REASON_PERMISSION;
>> +		event.fault.addr = addr;
>> +		propagate = s1;
>> +		break;
>> +	case ARM_SMMU_EVT_F_ACCESS:
>> +		event.fault.reason = IOMMU_FAULT_REASON_ACCESS;
>> +		event.fault.addr = addr;
>> +		propagate = s1;
>> +		break;
>> +	case ARM_SMMU_EVT_C_BAD_STE:
>> +		event.fault.reason = IOMMU_FAULT_REASON_BAD_DEVICE_CONTEXT_ENTRY;
>> +		break;
>> +	case ARM_SMMU_EVT_C_BAD_CD:
>> +		event.fault.reason = IOMMU_FAULT_REASON_BAD_PASID_ENTRY;
>> +		break;
>> +	case ARM_SMMU_EVT_F_ADDR_SIZE:
>> +		event.fault.reason = IOMMU_FAULT_REASON_OOR_ADDRESS;
>> +		propagate = s1;
>> +		break;
>> +	case ARM_SMMU_EVT_F_STE_FETCH:
>> +		event.fault.reason = IOMMU_FAULT_REASON_DEVICE_CONTEXT_FETCH;
>> +		event.fault.fetch_addr = fetch_addr;
>> +		break;
>> +	/* End of addition */
>> +	case ARM_SMMU_EVT_E_PAGE_REQUEST:
>> +	case ARM_SMMU_EVT_F_TLB_CONFLICT:
>> +	case ARM_SMMU_EVT_F_CFG_CONFLICT:
>> +	case ARM_SMMU_EVT_F_BAD_ATS_TREQ:
>> +	case ARM_SMMU_EVT_F_TRANSL_FORBIDDEN:
>> +	case ARM_SMMU_EVT_F_UUT:
>> +	default:
>> +		event.fault.reason = IOMMU_FAULT_REASON_UNKNOWN;
>> +	}
>> +	/* only propagate the error if it relates to stage 1 */
>> +	if (s1)
> 
> if (propagate)
> 
> But I don't quite understand how we're deciding what to propagate: a
> C_BAD_STE is most likely a bug in the SMMU driver, but is reported to
> userspace. On the other hand a stage-2 F_TRANSLATION is likely an error
> from the VMM (didn't setup stage-2 mappings properly), but we're not
> reporting it. Maybe we should add a bit to event.fault that tells
> whether the fault was stage 1 or 2, and let the VMM deal with it?
Yes I mixed this up. propagate should be false by default. In case the
event has an S2 field and S2 == 0 we can safely propagate to the guest.
Otherwise it is case by case and I need to do another review. If it
relates to structures owned by the guest propagate.
> 
>> +		iommu_report_device_fault(master->dev, &event);
> 
> We should return here if the fault is successfully injected

Even if the fault gets injected in the guest can't it be still useful to
get the message below on host side?

Thanks

Eric
> 
> Thanks,
> Jean
> 
>> +
>> +	dev_info(smmu->dev, "event 0x%02x received:\n", type);
>> +	for (i = 0; i < EVTQ_ENT_DWORDS; ++i) {
>> +		dev_info(smmu->dev, "\t0x%016llx\n",
>> +			 (unsigned long long)evt[i]);
>> +	}
>> +}
>> +
>>  /* IRQ and event handlers */
>>  static irqreturn_t arm_smmu_evtq_thread(int irq, void *dev)
>>  {
>> -	int i;
>>  	struct arm_smmu_device *smmu = dev;
>>  	struct arm_smmu_queue *q = &smmu->evtq.q;
>>  	u64 evt[EVTQ_ENT_DWORDS];
>>  
>>  	do {
>> -		while (!queue_remove_raw(q, evt)) {
>> -			u8 id = FIELD_GET(EVTQ_0_ID, evt[0]);
>> -
>> -			dev_info(smmu->dev, "event 0x%02x received:\n", id);
>> -			for (i = 0; i < ARRAY_SIZE(evt); ++i)
>> -				dev_info(smmu->dev, "\t0x%016llx\n",
>> -					 (unsigned long long)evt[i]);
>> -
>> -		}
>> +		while (!queue_remove_raw(q, evt))
>> +			arm_smmu_report_event(smmu, evt);
>>  
>>  		/*
>>  		 * Not much we can do on overflow, so scream and pretend we're
>>
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC v3 14/21] iommu: introduce device fault data
  2019-01-11 11:06     ` Jean-Philippe Brucker
  2019-01-14 22:32       ` Jacob Pan
@ 2019-01-15 21:27       ` Auger Eric
  2019-01-16 16:54         ` Jean-Philippe Brucker
  1 sibling, 1 reply; 59+ messages in thread
From: Auger Eric @ 2019-01-15 21:27 UTC (permalink / raw)
  To: Jean-Philippe Brucker, Jacob Pan
  Cc: yi.l.liu, kevin.tian, alex.williamson, ashok.raj, kvm,
	peter.maydell, Will Deacon, linux-kernel, Christoffer Dall,
	Marc Zyngier, iommu, Robin Murphy, kvmarm, eric.auger.pro

Hi Jean,

On 1/11/19 12:06 PM, Jean-Philippe Brucker wrote:
> On 10/01/2019 18:45, Jacob Pan wrote:
>> On Tue,  8 Jan 2019 11:26:26 +0100
>> Eric Auger <eric.auger@redhat.com> wrote:
>>
>>> From: Jacob Pan <jacob.jun.pan@linux.intel.com>
>>>
>>> Device faults detected by IOMMU can be reported outside IOMMU
>>> subsystem for further processing. This patch intends to provide
>>> a generic device fault data such that device drivers can be
>>> communicated with IOMMU faults without model specific knowledge.
>>>
>>> The proposed format is the result of discussion at:
>>> https://lkml.org/lkml/2017/11/10/291
>>> Part of the code is based on Jean-Philippe Brucker's patchset
>>> (https://patchwork.kernel.org/patch/9989315/).
>>>
>>> The assumption is that model specific IOMMU driver can filter and
>>> handle most of the internal faults if the cause is within IOMMU driver
>>> control. Therefore, the fault reasons can be reported are grouped
>>> and generalized based common specifications such as PCI ATS.
>>>
>>> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
>>> Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
>>> Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
>>> Signed-off-by: Ashok Raj <ashok.raj@intel.com>
>>> Signed-off-by: Eric Auger <eric.auger@redhat.com>
>>> [moved part of the iommu_fault_event struct in the uapi, enriched
>>>  the fault reasons to be able to map unrecoverable SMMUv3 errors]
>>> ---
>>>  include/linux/iommu.h      | 55 ++++++++++++++++++++++++-
>>>  include/uapi/linux/iommu.h | 83
>>> ++++++++++++++++++++++++++++++++++++++ 2 files changed, 136
>>> insertions(+), 2 deletions(-)
>>>
>>> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
>>> index 244c1a3d5989..1dedc2d247c2 100644
>>> --- a/include/linux/iommu.h
>>> +++ b/include/linux/iommu.h
>>> @@ -49,13 +49,17 @@ struct bus_type;
>>>  struct device;
>>>  struct iommu_domain;
>>>  struct notifier_block;
>>> +struct iommu_fault_event;
>>>  
>>>  /* iommu fault flags */
>>> -#define IOMMU_FAULT_READ	0x0
>>> -#define IOMMU_FAULT_WRITE	0x1
>>> +#define IOMMU_FAULT_READ		(1 << 0)
>>> +#define IOMMU_FAULT_WRITE		(1 << 1)
>>> +#define IOMMU_FAULT_EXEC		(1 << 2)
>>> +#define IOMMU_FAULT_PRIV		(1 << 3)
>>>  
>>>  typedef int (*iommu_fault_handler_t)(struct iommu_domain *,
>>>  			struct device *, unsigned long, int, void *);
>>> +typedef int (*iommu_dev_fault_handler_t)(struct iommu_fault_event *,
>>> void *); 
>>>  struct iommu_domain_geometry {
>>>  	dma_addr_t aperture_start; /* First address that can be
>>> mapped    */ @@ -255,6 +259,52 @@ struct iommu_device {
>>>  	struct device *dev;
>>>  };
>>>  
>>> +/**
>>> + * struct iommu_fault_event - Generic per device fault data
>>> + *
>>> + * - PCI and non-PCI devices
>>> + * - Recoverable faults (e.g. page request), information based on
>>> PCI ATS
>>> + * and PASID spec.
>>> + * - Un-recoverable faults of device interest
>>> + * - DMA remapping and IRQ remapping faults
>>> + *
>>> + * @fault: fault descriptor
>>> + * @device_private: if present, uniquely identify device-specific
>>> + *                  private data for an individual page request.
>>> + * @iommu_private: used by the IOMMU driver for storing
>>> fault-specific
>>> + *                 data. Users should not modify this field before
>>> + *                 sending the fault response.
>>> + */
>>> +struct iommu_fault_event {
>>> +	struct iommu_fault fault;
>>> +	u64 device_private;
>> I think we want to move device_private to uapi since it gets injected
>> into the guest, then returned by guest in case of page response. For
>> VT-d we also need 128 bits of private data. VT-d spec. 7.7.1
> 
> Ah, I didn't notice the format changed in VT-d rev3. On that topic, how
> do we manage future extensions to the iommu_fault struct? Should we add
> ~48 bytes of padding after device_private, along with some flags telling
> which field is valid, or deal with it using a structure version like we
> do for the invalidate and bind structs? In the first case, iommu_fault
> wouldn't fit in a 64-byte cacheline anymore, but I'm not sure we care.
> 
>> For exception tracking (e.g. unanswered page request), I can add timer
>> and list info later when I include PRQ. sounds ok?
>>> +	u64 iommu_private;
> [...]
>>> +/**
>>> + * struct iommu_fault - Generic fault data
>>> + *
>>> + * @type contains fault type
>>> + * @reason fault reasons if relevant outside IOMMU driver.
>>> + * IOMMU driver internal faults are not reported.
>>> + * @addr: tells the offending page address
>>> + * @fetch_addr: tells the address that caused an abort, if any
>>> + * @pasid: contains process address space ID, used in shared virtual
>>> memory
>>> + * @page_req_group_id: page request group index
>>> + * @last_req: last request in a page request group
>>> + * @pasid_valid: indicates if the PRQ has a valid PASID
>>> + * @prot: page access protection flag:
>>> + *	IOMMU_FAULT_READ, IOMMU_FAULT_WRITE
>>> + */
>>> +
>>> +struct iommu_fault {
>>> +	__u32	type;   /* enum iommu_fault_type */
>>> +	__u32	reason; /* enum iommu_fault_reason */
>>> +	__u64	addr;
>>> +	__u64	fetch_addr;
>>> +	__u32	pasid;
>>> +	__u32	page_req_group_id;
>>> +	__u32	last_req;
>>> +	__u32	pasid_valid;
>>> +	__u32	prot;
>>> +	__u32	access;
> 
> What does @access contain? Can it be squashed into @prot?
it was related to F_ACCESS event record and was a placeholder for
reporting access attributes of the input transaction (Rnw, InD, PnU
fields). But I wonder whether this is needed to implement such fine
level fault reporting. Do we really care?

Thanks

Eric
> 
> Thanks,
> Jean
> 
>> relocated to uapi, Yi can you confirm?
>> 	__u64 device_private[2];
>>
>>> +};
>>>  #endif /* _UAPI_IOMMU_H */
>>
>> _______________________________________________
>> iommu mailing list
>> iommu@lists.linux-foundation.org
>> https://lists.linuxfoundation.org/mailman/listinfo/iommu
>>
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC v3 04/21] vfio: VFIO_IOMMU_SET_PASID_TABLE
  2019-01-11 22:50   ` Alex Williamson
@ 2019-01-15 21:34     ` Auger Eric
  0 siblings, 0 replies; 59+ messages in thread
From: Auger Eric @ 2019-01-15 21:34 UTC (permalink / raw)
  To: Alex Williamson
  Cc: eric.auger.pro, iommu, linux-kernel, kvm, kvmarm, joro,
	jacob.jun.pan, yi.l.liu, jean-philippe.brucker, will.deacon,
	robin.murphy, kevin.tian, ashok.raj, marc.zyngier,
	christoffer.dall, peter.maydell

Hi Alex,

On 1/11/19 11:50 PM, Alex Williamson wrote:
> On Tue,  8 Jan 2019 11:26:16 +0100
> Eric Auger <eric.auger@redhat.com> wrote:
> 
>> From: "Liu, Yi L" <yi.l.liu@linux.intel.com>
>>
>> This patch adds VFIO_IOMMU_SET_PASID_TABLE ioctl which aims at
>> passing the virtual iommu guest configuration to the VFIO driver
>> downto to the iommu subsystem.
>>
>> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
>> Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
>> Signed-off-by: Eric Auger <eric.auger@redhat.com>
>>
>> ---
>> v2 -> v3:
>> - s/BIND_PASID_TABLE/SET_PASID_TABLE
>>
>> v1 -> v2:
>> - s/BIND_GUEST_STAGE/BIND_PASID_TABLE
>> - remove the struct device arg
>> ---
>>  drivers/vfio/vfio_iommu_type1.c | 31 +++++++++++++++++++++++++++++++
>>  include/uapi/linux/vfio.h       |  8 ++++++++
>>  2 files changed, 39 insertions(+)
>>
>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>> index 7651cfb14836..d9dd23f64f00 100644
>> --- a/drivers/vfio/vfio_iommu_type1.c
>> +++ b/drivers/vfio/vfio_iommu_type1.c
>> @@ -1644,6 +1644,24 @@ static int vfio_domains_have_iommu_cache(struct vfio_iommu *iommu)
>>  	return ret;
>>  }
>>  
>> +static int
>> +vfio_set_pasid_table(struct vfio_iommu *iommu,
>> +		      struct vfio_iommu_type1_set_pasid_table *ustruct)
>> +{
>> +	struct vfio_domain *d;
>> +	int ret = 0;
>> +
>> +	mutex_lock(&iommu->lock);
>> +
>> +	list_for_each_entry(d, &iommu->domain_list, next) {
>> +		ret = iommu_set_pasid_table(d->domain, &ustruct->config);
>> +		if (ret)
>> +			break;
>> +	}
> 
> There's no unwind on failure here, leaves us in an inconsistent state
> should something go wrong or domains don't have homogeneous PASID
> support.  What's expected to happen if a PASID table is already set for
> a domain, does it replace the old one or return -EBUSY?
Effectively the setting can succeed on one domain and fail on another
one, typically if one underlying SMMU does not support nested while the
others support it.

At the moment setting a new PASID table whereas one is installed does
not return any error in the SMMU code, just overwrite the associated
fields in the stream table entry. This is also the way we turn nested
off (cfg->bypass == true). Needs to be clarified anyway.
> 
>> +	mutex_unlock(&iommu->lock);
>> +	return ret;
>> +}
>> +
>>  static long vfio_iommu_type1_ioctl(void *iommu_data,
>>  				   unsigned int cmd, unsigned long arg)
>>  {
>> @@ -1714,6 +1732,19 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
>>  
>>  		return copy_to_user((void __user *)arg, &unmap, minsz) ?
>>  			-EFAULT : 0;
>> +	} else if (cmd == VFIO_IOMMU_SET_PASID_TABLE) {
>> +		struct vfio_iommu_type1_set_pasid_table ustruct;
>> +
>> +		minsz = offsetofend(struct vfio_iommu_type1_set_pasid_table,
>> +				    config);
>> +
>> +		if (copy_from_user(&ustruct, (void __user *)arg, minsz))
>> +			return -EFAULT;
>> +
>> +		if (ustruct.argsz < minsz || ustruct.flags)
>> +			return -EINVAL;
>> +
>> +		return vfio_set_pasid_table(iommu, &ustruct);
>>  	}
>>  
>>  	return -ENOTTY;
>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>> index 02bb7ad6e986..0d9f4090c95d 100644
>> --- a/include/uapi/linux/vfio.h
>> +++ b/include/uapi/linux/vfio.h
>> @@ -14,6 +14,7 @@
>>  
>>  #include <linux/types.h>
>>  #include <linux/ioctl.h>
>> +#include <linux/iommu.h>
>>  
>>  #define VFIO_API_VERSION	0
>>  
>> @@ -759,6 +760,13 @@ struct vfio_iommu_type1_dma_unmap {
>>  #define VFIO_IOMMU_ENABLE	_IO(VFIO_TYPE, VFIO_BASE + 15)
>>  #define VFIO_IOMMU_DISABLE	_IO(VFIO_TYPE, VFIO_BASE + 16)
>>  
>> +struct vfio_iommu_type1_set_pasid_table {
>> +	__u32	argsz;
>> +	__u32	flags;
>> +	struct iommu_pasid_table_config config;
>> +};
>> +#define VFIO_IOMMU_SET_PASID_TABLE	_IO(VFIO_TYPE, VFIO_BASE + 22)
> 
> -ENOCOMMENTS  Thanks,
sure

Thanks

Eric
> 
> Alex
> 
>> +
>>  /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */
>>  
>>  /*
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC v3 18/21] vfio-pci: Add a new VFIO_REGION_TYPE_NESTED region type
  2019-01-14 23:04       ` Alex Williamson
@ 2019-01-15 21:56         ` Auger Eric
  0 siblings, 0 replies; 59+ messages in thread
From: Auger Eric @ 2019-01-15 21:56 UTC (permalink / raw)
  To: Alex Williamson
  Cc: eric.auger.pro, iommu, linux-kernel, kvm, kvmarm, joro,
	jacob.jun.pan, yi.l.liu, jean-philippe.brucker, will.deacon,
	robin.murphy, kevin.tian, ashok.raj, marc.zyngier,
	christoffer.dall, peter.maydell

Hi Alex,

On 1/15/19 12:04 AM, Alex Williamson wrote:
> On Mon, 14 Jan 2019 21:48:06 +0100
> Auger Eric <eric.auger@redhat.com> wrote:
> 
>> Hi Alex,
>>
>> On 1/12/19 12:58 AM, Alex Williamson wrote:
>>> On Tue,  8 Jan 2019 11:26:30 +0100
>>> Eric Auger <eric.auger@redhat.com> wrote:
>>>   
>>>> This patch adds a new 64kB region aiming to report nested mode
>>>> translation faults.
>>>>
>>>> The region contains a header with the size of the queue,
>>>> the producer and consumer indices and then the actual
>>>> fault queue data. The producer is updated by the kernel while
>>>> the consumer is updated by the userspace.
>>>>
>>>> Signed-off-by: Eric Auger <eric.auger@redhat.com>
>>>>
>>>> ---
>>>> ---
>>>>  drivers/vfio/pci/vfio_pci.c         | 102 +++++++++++++++++++++++++++-
>>>>  drivers/vfio/pci/vfio_pci_private.h |   2 +
>>>>  include/uapi/linux/vfio.h           |  15 ++++
>>>>  3 files changed, 118 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
>>>> index ff60bd1ea587..2ba181ab2edd 100644
>>>> --- a/drivers/vfio/pci/vfio_pci.c
>>>> +++ b/drivers/vfio/pci/vfio_pci.c
>>>> @@ -56,6 +56,11 @@ module_param(disable_idle_d3, bool, S_IRUGO | S_IWUSR);
>>>>  MODULE_PARM_DESC(disable_idle_d3,
>>>>  		 "Disable using the PCI D3 low power state for idle, unused devices");
>>>>  
>>>> +#define VFIO_FAULT_REGION_SIZE 0x10000  
>>>
>>> Why 64K?  
>> For the region to be mmappable with 64kB page size.
> 
> Isn't hard coding 64K just as bad as hard coding 4K?  The kernel knows
> what PAGE_SIZE is after all.  Is there some target number of queue
> entries here that we could round up to a multiple of PAGE_SIZE?
Spec says the queue size has max 2^n events with n <= 19. n is
implementation dependent. In practice the driver uses 2^8 entries, each
entry being 32 bytes so the event queue is 8kB. So depending on the size
of the entry we are going to choose here we may need a queue counting
around the same number of entries.

>  
>>>> +#define VFIO_FAULT_QUEUE_SIZE	\
>>>> +	((VFIO_FAULT_REGION_SIZE - sizeof(struct vfio_fault_region_header)) / \
>>>> +	sizeof(struct iommu_fault))
>>>> +
>>>>  static inline bool vfio_vga_disabled(void)
>>>>  {
>>>>  #ifdef CONFIG_VFIO_PCI_VGA
>>>> @@ -1226,6 +1231,100 @@ static const struct vfio_device_ops vfio_pci_ops = {
>>>>  static int vfio_pci_reflck_attach(struct vfio_pci_device *vdev);
>>>>  static void vfio_pci_reflck_put(struct vfio_pci_reflck *reflck);
>>>>  
>>>> +static size_t
>>>> +vfio_pci_dma_fault_rw(struct vfio_pci_device *vdev, char __user *buf,
>>>> +		      size_t count, loff_t *ppos, bool iswrite)
>>>> +{
>>>> +	unsigned int i = VFIO_PCI_OFFSET_TO_INDEX(*ppos) - VFIO_PCI_NUM_REGIONS;
>>>> +	void *base = vdev->region[i].data;
>>>> +	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
>>>> +
>>>> +	if (pos >= vdev->region[i].size)
>>>> +		return -EINVAL;
>>>> +
>>>> +	count = min(count, (size_t)(vdev->region[i].size - pos));
>>>> +
>>>> +	if (copy_to_user(buf, base + pos, count))
>>>> +		return -EFAULT;
>>>> +
>>>> +	*ppos += count;
>>>> +
>>>> +	return count;
>>>> +}
>>>> +
>>>> +static int vfio_pci_dma_fault_mmap(struct vfio_pci_device *vdev,
>>>> +				   struct vfio_pci_region *region,
>>>> +				   struct vm_area_struct *vma)
>>>> +{
>>>> +	u64 phys_len, req_len, pgoff, req_start;
>>>> +	unsigned long long addr;
>>>> +	unsigned int index;
>>>> +
>>>> +	index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);
>>>> +
>>>> +	if (vma->vm_end < vma->vm_start)
>>>> +		return -EINVAL;
>>>> +	if ((vma->vm_flags & VM_SHARED) == 0)
>>>> +		return -EINVAL;
>>>> +
>>>> +	phys_len = VFIO_FAULT_REGION_SIZE;
>>>> +
>>>> +	req_len = vma->vm_end - vma->vm_start;
>>>> +	pgoff = vma->vm_pgoff &
>>>> +		((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
>>>> +	req_start = pgoff << PAGE_SHIFT;
>>>> +
>>>> +	if (req_start + req_len > phys_len)
>>>> +		return -EINVAL;
>>>> +
>>>> +	addr = virt_to_phys(vdev->fault_region);
>>>> +	vma->vm_private_data = vdev;
>>>> +	vma->vm_pgoff = (addr >> PAGE_SHIFT) + pgoff;
>>>> +
>>>> +	return remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
>>>> +			       req_len, vma->vm_page_prot);
>>>> +}
>>>> +
>>>> +void vfio_pci_dma_fault_release(struct vfio_pci_device *vdev,
>>>> +				struct vfio_pci_region *region)
>>>> +{
>>>> +}
>>>> +
>>>> +static const struct vfio_pci_regops vfio_pci_dma_fault_regops = {
>>>> +	.rw		= vfio_pci_dma_fault_rw,
>>>> +	.mmap		= vfio_pci_dma_fault_mmap,
>>>> +	.release	= vfio_pci_dma_fault_release,
>>>> +};
>>>> +
>>>> +static int vfio_pci_init_dma_fault_region(struct vfio_pci_device *vdev)
>>>> +{
>>>> +	u32 flags = VFIO_REGION_INFO_FLAG_READ | VFIO_REGION_INFO_FLAG_WRITE |
>>>> +		    VFIO_REGION_INFO_FLAG_MMAP;
>>>> +	int ret;
>>>> +
>>>> +	spin_lock_init(&vdev->fault_queue_lock);
>>>> +
>>>> +	vdev->fault_region = kmalloc(VFIO_FAULT_REGION_SIZE, GFP_KERNEL);
>>>> +	if (!vdev->fault_region)
>>>> +		return -ENOMEM;
>>>> +
>>>> +	ret = vfio_pci_register_dev_region(vdev,
>>>> +		VFIO_REGION_TYPE_NESTED,
>>>> +		VFIO_REGION_SUBTYPE_NESTED_FAULT_REGION,
>>>> +		&vfio_pci_dma_fault_regops, VFIO_FAULT_REGION_SIZE,
>>>> +		flags, vdev->fault_region);
>>>> +	if (ret) {
>>>> +		kfree(vdev->fault_region);
>>>> +		return ret;
>>>> +	}
>>>> +
>>>> +	vdev->fault_region->header.prod = 0;
>>>> +	vdev->fault_region->header.cons = 0;
>>>> +	vdev->fault_region->header.reserved = 0;  
>>>
>>> Use kzalloc above or else we're leaking kernel memory to userspace
>>> anyway.  
>> sure
>>>   
>>>> +	vdev->fault_region->header.size = VFIO_FAULT_QUEUE_SIZE;
>>>> +	return 0;
>>>> +}
>>>> +
>>>>  static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>>>>  {
>>>>  	struct vfio_pci_device *vdev;
>>>> @@ -1300,7 +1399,7 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>>>>  		pci_set_power_state(pdev, PCI_D3hot);
>>>>  	}
>>>>  
>>>> -	return ret;
>>>> +	return vfio_pci_init_dma_fault_region(vdev);  
>>>
>>> Missing lots of cleanup should this fail.  Why is this done on probe
>>> anyway?  This looks like something we'd do from vfio_pci_enable() and
>>> therefore our release callback would free fault_region rather than what
>>> we have below.  
>> OK. That's fine to put in the vfio_pci_enable().
>>>   
>>>>  }
>>>>  
>>>>  static void vfio_pci_remove(struct pci_dev *pdev)
>>>> @@ -1315,6 +1414,7 @@ static void vfio_pci_remove(struct pci_dev *pdev)
>>>>  
>>>>  	vfio_iommu_group_put(pdev->dev.iommu_group, &pdev->dev);
>>>>  	kfree(vdev->region);
>>>> +	kfree(vdev->fault_region);
>>>>  	mutex_destroy(&vdev->ioeventfds_lock);
>>>>  	kfree(vdev);
>>>>  
>>>> diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
>>>> index 8c0009f00818..38b5d1764a26 100644
>>>> --- a/drivers/vfio/pci/vfio_pci_private.h
>>>> +++ b/drivers/vfio/pci/vfio_pci_private.h
>>>> @@ -120,6 +120,8 @@ struct vfio_pci_device {
>>>>  	int			ioeventfds_nr;
>>>>  	struct eventfd_ctx	*err_trigger;
>>>>  	struct eventfd_ctx	*req_trigger;
>>>> +	spinlock_t              fault_queue_lock;
>>>> +	struct vfio_fault_region *fault_region;
>>>>  	struct list_head	dummy_resources_list;
>>>>  	struct mutex		ioeventfds_lock;
>>>>  	struct list_head	ioeventfds_list;
>>>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>>>> index 352e795a93c8..b78c2c62af6d 100644
>>>> --- a/include/uapi/linux/vfio.h
>>>> +++ b/include/uapi/linux/vfio.h
>>>> @@ -307,6 +307,9 @@ struct vfio_region_info_cap_type {
>>>>  #define VFIO_REGION_TYPE_GFX                    (1)
>>>>  #define VFIO_REGION_SUBTYPE_GFX_EDID            (1)
>>>>  
>>>> +#define VFIO_REGION_TYPE_NESTED			(2)
>>>> +#define VFIO_REGION_SUBTYPE_NESTED_FAULT_REGION	(1)
>>>> +
>>>>  /**
>>>>   * struct vfio_region_gfx_edid - EDID region layout.
>>>>   *
>>>> @@ -697,6 +700,18 @@ struct vfio_device_ioeventfd {
>>>>  
>>>>  #define VFIO_DEVICE_IOEVENTFD		_IO(VFIO_TYPE, VFIO_BASE + 16)
>>>>  
>>>> +struct vfio_fault_region_header {
>>>> +	__u32	size;		/* Read-Only */
>>>> +	__u32	prod;		/* Read-Only */  
>>>
>>> We can't really enforce read-only if it's mmap'd.  
>> Do we really need to? Assuming the kernel always uses
>> VFIO_FAULT_QUEUE_SIZE to check prod and cons indice - which is not the
>> case at the moment by the way :-( -s, the queue cannot
>> be overflown .
>>
>> The header also can be checked each time the kernel fills any event in
>> the queue
>> (vfio_pci_iommu_dev_fault_handler). If inconsistent the kernel may stop
>> using the queue. If the user-space mangles with those RO fields, this
>> will break error reporting on the guest but the problem should be
>> confined here?
> 
> I guess this is a matter of whether the performance benefit is worth
> the hardening and I can imagine that it could be, but we do need that
> hardening and well defined behavior to the user.  If we put it into a
> single page then we need to define if a user write to the producer index
> resets the index or if it's just a shadow of internal state and will be
> restored on the next update. 
I had in mind the kernel would leave the prod value by the userspace.
Obviously the error reporting would be broken then.
 Does a user write to the size register
> change the size of the queue or is it ignored?
it would be ignored and restored on the subsquent kernel push
  Is there an event that
> will restore it?  Maybe the size should be exposed as a part of a
> region info capability like we've done for some of the new nvlink
> regions.
> 
>>> I worry about synchronization here too, perhaps there should be a ring offset such
>>> that the ring can be in a separate page from the header and then sparse
>>> mmap support can ensure that the user access is restricted.  
>>
>> I was assuming a single writer and single reader lock-free circular
>> buffer here. My understanding was it was safe to consider concurrent
>> read and write. What I am missing anyway is atomic counter operations to
>> guarantee the indices are updated after the push/pop action as explained in
>> https://www.kernel.org/doc/Documentation/circular-buffers.txt. I am not
>> comfortable about how to enforce this on user side though.
> 
> It doesn't seem enforceable to the user without slowing down the
> interface, either via ioctl or non-mmap.  I think it's fine to have
> acquire and release semantics, we just need to define them and have
> safe, predictable behavior when the user does something wrong (and
> expect them to do something wrong, intentionally or not).
Agreed. I am not comfortable either with the existing proto despite it
works in my case.
> 
>> In case I split the header and the actual buffer into 2 different
>> possible 64kB pages, the first one will be very scarcely used.
> 
> Yes, it's wasteful, a shared page would be preferred, but it's also
> only a page.
agreed
> 
>>> wonder if there are other transports that make sense here, this almost
>>> feels like a vhost sort of thing.  Thanks,  
>> Using something more sophisticated may be useful for PRI where answers
>> need to be provided. For the case of unrecoverable faults, I wonder
>> whether it is worth the pain exposing a fault region compared to the
>> original IOCTL approach introduced in
>> [RFC v2 18/20] vfio: VFIO_IOMMU_GET_FAULT_EVENTS
>> https://lkml.org/lkml/2018/9/18/1094
> 
> Not sure I understand the pain aspect, if we don't support mmap on a
> region then we can fill a user read from a region just as easily as we
> can fill a buffer passed via ioctl.  Thanks,
My impression was the previous VFIO_IOMMU_GET_FAULT_EVENTS ioctl based
approach filling the user provided buffer with available events was
rather simple compared to the region implementation (my own filling). We
could use the kfifo utilities without needing to define a region layout.

Thanks

Eric
> 
> Alex
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC v3 17/21] iommu/smmuv3: Report non recoverable faults
  2019-01-15 21:06     ` Auger Eric
@ 2019-01-16 12:25       ` Jean-Philippe Brucker
  2019-01-16 12:49         ` Auger Eric
  0 siblings, 1 reply; 59+ messages in thread
From: Jean-Philippe Brucker @ 2019-01-16 12:25 UTC (permalink / raw)
  To: Auger Eric, eric.auger.pro, iommu, linux-kernel, kvm, kvmarm,
	joro, alex.williamson, jacob.jun.pan, yi.l.liu, Will Deacon,
	Robin Murphy
  Cc: Marc Zyngier, peter.maydell, kevin.tian, ashok.raj, Christoffer Dall

On 15/01/2019 21:06, Auger Eric wrote:
>>> +		iommu_report_device_fault(master->dev, &event);
>>
>> We should return here if the fault is successfully injected
> 
> Even if the fault gets injected in the guest can't it be still useful to
> get the message below on host side?

I don't think we should let the guest flood the host log by issuing
invalid DMA (or are there other cases where the guest can freely print
stuff in the host?) We do print all errors at the moment, but we should
tighten this once there is an upstream solution to let the guest control
DMA mappings.

Thanks,
Jean

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC v3 17/21] iommu/smmuv3: Report non recoverable faults
  2019-01-16 12:25       ` Jean-Philippe Brucker
@ 2019-01-16 12:49         ` Auger Eric
  0 siblings, 0 replies; 59+ messages in thread
From: Auger Eric @ 2019-01-16 12:49 UTC (permalink / raw)
  To: Jean-Philippe Brucker, eric.auger.pro, iommu, linux-kernel, kvm,
	kvmarm, joro, alex.williamson, jacob.jun.pan, yi.l.liu,
	Will Deacon, Robin Murphy
  Cc: Marc Zyngier, peter.maydell, kevin.tian, ashok.raj, Christoffer Dall

Hi Jean,

On 1/16/19 1:25 PM, Jean-Philippe Brucker wrote:
> On 15/01/2019 21:06, Auger Eric wrote:
>>>> +		iommu_report_device_fault(master->dev, &event);
>>>
>>> We should return here if the fault is successfully injected
>>
>> Even if the fault gets injected in the guest can't it be still useful to
>> get the message below on host side?
> 
> I don't think we should let the guest flood the host log by issuing
> invalid DMA (or are there other cases where the guest can freely print
> stuff in the host?) We do print all errors at the moment, but we should
> tighten this once there is an upstream solution to let the guest control
> DMA mappings.
OK thank you for the clarification. Yes it makes sense to skip those
traces then.

Thanks

Eric
> 
> Thanks,
> Jean
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC v3 14/21] iommu: introduce device fault data
  2019-01-14 22:32       ` Jacob Pan
@ 2019-01-16 15:52         ` Jean-Philippe Brucker
  2019-01-16 18:33           ` Auger Eric
  0 siblings, 1 reply; 59+ messages in thread
From: Jean-Philippe Brucker @ 2019-01-16 15:52 UTC (permalink / raw)
  To: Jacob Pan
  Cc: yi.l.liu, kevin.tian, ashok.raj, kvm, peter.maydell,
	Marc Zyngier, Will Deacon, linux-kernel, iommu, Christoffer Dall,
	alex.williamson, Robin Murphy, kvmarm, eric.auger.pro

On 14/01/2019 22:32, Jacob Pan wrote:
>> [...]
>>>> +/**
>>>> + * struct iommu_fault - Generic fault data
>>>> + *
>>>> + * @type contains fault type
>>>> + * @reason fault reasons if relevant outside IOMMU driver.
>>>> + * IOMMU driver internal faults are not reported.
>>>> + * @addr: tells the offending page address
>>>> + * @fetch_addr: tells the address that caused an abort, if any
>>>> + * @pasid: contains process address space ID, used in shared
>>>> virtual memory
>>>> + * @page_req_group_id: page request group index
>>>> + * @last_req: last request in a page request group
>>>> + * @pasid_valid: indicates if the PRQ has a valid PASID
>>>> + * @prot: page access protection flag:
>>>> + *	IOMMU_FAULT_READ, IOMMU_FAULT_WRITE
>>>> + */
>>>> +
>>>> +struct iommu_fault {
>>>> +	__u32	type;   /* enum iommu_fault_type */
>>>> +	__u32	reason; /* enum iommu_fault_reason */
>>>> +	__u64	addr;
>>>> +	__u64	fetch_addr;
>>>> +	__u32	pasid;
>>>> +	__u32	page_req_group_id;
>>>> +	__u32	last_req;
>>>> +	__u32	pasid_valid;
>>>> +	__u32	prot;
>>>> +	__u32	access;  
>>
>> What does @access contain? Can it be squashed into @prot?
>>
> I agreed.
> 
> how about this?
> #define IOMMU_FAULT_VERSION_V1 0x1
> struct iommu_fault {
> 	__u16 version;

Right, but the version field becomes redundant when we present a batch
of these to userspace, in patch 18 (assuming we don't want to mix fault
structure versions within a batch... I certainly don't).

When introducing IOMMU_FAULT_VERSION_V2, in a distant future, I think we
still need to support a userspace that uses IOMMU_FAULT_VERSION_V1. One
strategy for this:

* We define structs iommu_fault_v1 (the current iommu_fault) and
  iommu_fault_v2.
* Userspace selects IOMMU_FAULT_VERSION_V1 when registering the fault
  queue
* The IOMMU driver fills iommu_fault_v2 and passes it to VFIO
* VFIO does its best to translate this into a iommu_fault_v1 struct

So what we need now, is a way for userspace to tell the kernel which
structure version it expects. I'm not sure we even need to pass the
actual version number we're using back to userspace. Agreeing on one
version at registration should be sufficient.

> 	__u16 type;
> 	__u32 reason;
> 	__u64 addr;

I'm in favor of keeping @fetch_addr as well, it can contain useful
information. For example, while attempting to translate an IOVA
0xfffff000, the IOMMU can't find the PASID table that we installed with
address 0xdead - the guest passed an invalid address to
bind_pasid_table(). We can then report 0xfffff000 in @addr, and 0xdead
in @fetch_addr.

> 	__u32 pasid;
> 	__u32 page_req_group_id;
> 	__u32 last_req : 1;
> 	__u32 pasid_valid : 1;

Agreed, with some explicit padding or combined as a @flag field. In fact
if we do add the @fetch_addr field, I think we need a bit that indicates
its validity as well.

Thanks,
Jean

> 	__u32 prot;
> 	__u64 device_private[2];
> 	__u8 padding[48];
> };
> 
> 
>> Thanks,
>> Jean
>>
>>> relocated to uapi, Yi can you confirm?
>>> 	__u64 device_private[2];
>>>   
>>>> +};
>>>>  #endif /* _UAPI_IOMMU_H */  
>>>
>>> _______________________________________________
>>> iommu mailing list
>>> iommu@lists.linux-foundation.org
>>> https://lists.linuxfoundation.org/mailman/listinfo/iommu
>>>   
>>
> 
> [Jacob Pan]
> _______________________________________________
> iommu mailing list
> iommu@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/iommu
> 


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC v3 14/21] iommu: introduce device fault data
  2019-01-15 21:27       ` Auger Eric
@ 2019-01-16 16:54         ` Jean-Philippe Brucker
  0 siblings, 0 replies; 59+ messages in thread
From: Jean-Philippe Brucker @ 2019-01-16 16:54 UTC (permalink / raw)
  To: Auger Eric, Jacob Pan
  Cc: peter.maydell, kevin.tian, ashok.raj, kvm, yi.l.liu,
	Marc Zyngier, Will Deacon, linux-kernel, iommu, Christoffer Dall,
	alex.williamson, Robin Murphy, kvmarm, eric.auger.pro

On 15/01/2019 21:27, Auger Eric wrote:
[...]
>>>>  /* iommu fault flags */
>>>> -#define IOMMU_FAULT_READ	0x0
>>>> -#define IOMMU_FAULT_WRITE	0x1
>>>> +#define IOMMU_FAULT_READ		(1 << 0)
>>>> +#define IOMMU_FAULT_WRITE		(1 << 1)
>>>> +#define IOMMU_FAULT_EXEC		(1 << 2)
>>>> +#define IOMMU_FAULT_PRIV		(1 << 3)
>>>>  
>>>>  typedef int (*iommu_fault_handler_t)(struct iommu_domain *,
>>>>  			struct device *, unsigned long, int, void *);
>>>> +typedef int (*iommu_dev_fault_handler_t)(struct iommu_fault_event *,
>>>> void *); 
>>>>  struct iommu_domain_geometry {
>>>>  	dma_addr_t aperture_start; /* First address that can be
>>>> mapped    */ @@ -255,6 +259,52 @@ struct iommu_device {
>>>>  	struct device *dev;
>>>>  };
>>>>  
>>>> +/**
>>>> + * struct iommu_fault_event - Generic per device fault data
>>>> + *
>>>> + * - PCI and non-PCI devices
>>>> + * - Recoverable faults (e.g. page request), information based on
>>>> PCI ATS
>>>> + * and PASID spec.
>>>> + * - Un-recoverable faults of device interest
>>>> + * - DMA remapping and IRQ remapping faults
>>>> + *
>>>> + * @fault: fault descriptor
>>>> + * @device_private: if present, uniquely identify device-specific
>>>> + *                  private data for an individual page request.
>>>> + * @iommu_private: used by the IOMMU driver for storing
>>>> fault-specific
>>>> + *                 data. Users should not modify this field before
>>>> + *                 sending the fault response.
>>>> + */
>>>> +struct iommu_fault_event {
>>>> +	struct iommu_fault fault;
>>>> +	u64 device_private;
>>> I think we want to move device_private to uapi since it gets injected
>>> into the guest, then returned by guest in case of page response. For
>>> VT-d we also need 128 bits of private data. VT-d spec. 7.7.1
>>
>> Ah, I didn't notice the format changed in VT-d rev3. On that topic, how
>> do we manage future extensions to the iommu_fault struct? Should we add
>> ~48 bytes of padding after device_private, along with some flags telling
>> which field is valid, or deal with it using a structure version like we
>> do for the invalidate and bind structs? In the first case, iommu_fault
>> wouldn't fit in a 64-byte cacheline anymore, but I'm not sure we care.
>>
>>> For exception tracking (e.g. unanswered page request), I can add timer
>>> and list info later when I include PRQ. sounds ok?
>>>> +	u64 iommu_private;
>> [...]
>>>> +/**
>>>> + * struct iommu_fault - Generic fault data
>>>> + *
>>>> + * @type contains fault type
>>>> + * @reason fault reasons if relevant outside IOMMU driver.
>>>> + * IOMMU driver internal faults are not reported.
>>>> + * @addr: tells the offending page address
>>>> + * @fetch_addr: tells the address that caused an abort, if any
>>>> + * @pasid: contains process address space ID, used in shared virtual
>>>> memory
>>>> + * @page_req_group_id: page request group index
>>>> + * @last_req: last request in a page request group
>>>> + * @pasid_valid: indicates if the PRQ has a valid PASID
>>>> + * @prot: page access protection flag:
>>>> + *	IOMMU_FAULT_READ, IOMMU_FAULT_WRITE
>>>> + */
>>>> +
>>>> +struct iommu_fault {
>>>> +	__u32	type;   /* enum iommu_fault_type */
>>>> +	__u32	reason; /* enum iommu_fault_reason */
>>>> +	__u64	addr;
>>>> +	__u64	fetch_addr;
>>>> +	__u32	pasid;
>>>> +	__u32	page_req_group_id;
>>>> +	__u32	last_req;
>>>> +	__u32	pasid_valid;
>>>> +	__u32	prot;
>>>> +	__u32	access;
>>
>> What does @access contain? Can it be squashed into @prot?
> it was related to F_ACCESS event record and was a placeholder for
> reporting access attributes of the input transaction (Rnw, InD, PnU
> fields). But I wonder whether this is needed to implement such fine
> level fault reporting. Do we really care?

I think we do, to properly inject PRI/Stall later. But RnW, InD and PnU
can already be described with the IOMMU_FAULT_* flags defined above.
We're missing CLASS and S2, which could also be useful for debugging.
CLASS is specific to SMMUv3 but could probably be represented with
@reason. For S2, we could keep printing stage-2 faults in the driver,
and not report them to userspace.

Thanks,
Jean

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC v3 14/21] iommu: introduce device fault data
  2019-01-16 15:52         ` Jean-Philippe Brucker
@ 2019-01-16 18:33           ` Auger Eric
  0 siblings, 0 replies; 59+ messages in thread
From: Auger Eric @ 2019-01-16 18:33 UTC (permalink / raw)
  To: Jean-Philippe Brucker, Jacob Pan
  Cc: yi.l.liu, kevin.tian, alex.williamson, ashok.raj, kvm,
	Marc Zyngier, Will Deacon, linux-kernel, iommu, Robin Murphy,
	kvmarm, eric.auger.pro

Hi Jean,

On 1/16/19 4:52 PM, Jean-Philippe Brucker wrote:
> On 14/01/2019 22:32, Jacob Pan wrote:
>>> [...]
>>>>> +/**
>>>>> + * struct iommu_fault - Generic fault data
>>>>> + *
>>>>> + * @type contains fault type
>>>>> + * @reason fault reasons if relevant outside IOMMU driver.
>>>>> + * IOMMU driver internal faults are not reported.
>>>>> + * @addr: tells the offending page address
>>>>> + * @fetch_addr: tells the address that caused an abort, if any
>>>>> + * @pasid: contains process address space ID, used in shared
>>>>> virtual memory
>>>>> + * @page_req_group_id: page request group index
>>>>> + * @last_req: last request in a page request group
>>>>> + * @pasid_valid: indicates if the PRQ has a valid PASID
>>>>> + * @prot: page access protection flag:
>>>>> + *	IOMMU_FAULT_READ, IOMMU_FAULT_WRITE
>>>>> + */
>>>>> +
>>>>> +struct iommu_fault {
>>>>> +	__u32	type;   /* enum iommu_fault_type */
>>>>> +	__u32	reason; /* enum iommu_fault_reason */
>>>>> +	__u64	addr;
>>>>> +	__u64	fetch_addr;
>>>>> +	__u32	pasid;
>>>>> +	__u32	page_req_group_id;
>>>>> +	__u32	last_req;
>>>>> +	__u32	pasid_valid;
>>>>> +	__u32	prot;
>>>>> +	__u32	access;  
>>>
>>> What does @access contain? Can it be squashed into @prot?
>>>
>> I agreed.
>>
>> how about this?
>> #define IOMMU_FAULT_VERSION_V1 0x1
>> struct iommu_fault {
>> 	__u16 version;
> 
> Right, but the version field becomes redundant when we present a batch
> of these to userspace, in patch 18 (assuming we don't want to mix fault
> structure versions within a batch... I certainly don't).>
> When introducing IOMMU_FAULT_VERSION_V2, in a distant future, I think we
> still need to support a userspace that uses IOMMU_FAULT_VERSION_V1. One
> strategy for this:
> 
> * We define structs iommu_fault_v1 (the current iommu_fault) and
>   iommu_fault_v2.
> * Userspace selects IOMMU_FAULT_VERSION_V1 when registering the fault
>   queue
> * The IOMMU driver fills iommu_fault_v2 and passes it to VFIO
> * VFIO does its best to translate this into a iommu_fault_v1 struct
> 
> So what we need now, is a way for userspace to tell the kernel which
> structure version it expects. I'm not sure we even need to pass the
> actual version number we're using back to userspace. Agreeing on one
> version at registration should be sufficient.

As we expose a VFIO region we can report its size, entry size, max
supported entry version, actual entry version in the region capabilities.

Conveying the version along with the eventfd at registration time will
require to introduce a new flag at vfio_irq_set level (if we still plan
to use this VFIO_DEVICE_SET_IRQS API) but that should be feasible.
> 
>> 	__u16 type;
>> 	__u32 reason;
>> 	__u64 addr;
> 
> I'm in favor of keeping @fetch_addr as well, it can contain useful
> information. For example, while attempting to translate an IOVA
> 0xfffff000, the IOMMU can't find the PASID table that we installed with
> address 0xdead - the guest passed an invalid address to
> bind_pasid_table(). We can then report 0xfffff000 in @addr, and 0xdead
> in @fetch_addr.
agreed
> 
>> 	__u32 pasid;
>> 	__u32 page_req_group_id;
>> 	__u32 last_req : 1;
>> 	__u32 pasid_valid : 1;
> 
> Agreed, with some explicit padding or combined as a @flag field. In fact
> if we do add the @fetch_addr field, I think we need a bit that indicates
> its validity as well.
Can't we simply state fetch_addr is valid for
IOMMU_FAULT_REASON_PASID_FETCH, IOMMU_FAULT_REASON_DEVICE_CONTEXT_FETCH,
IOMMU_FAULT_REASON_WALK_EABT which maps to its usage in SMMU spec.

Thanks

Eric

> 
> Thanks,
> Jean
> 
>> 	__u32 prot;
>> 	__u64 device_private[2];
>> 	__u8 padding[48];
>> };
>>
>>
>>> Thanks,
>>> Jean
>>>
>>>> relocated to uapi, Yi can you confirm?
>>>> 	__u64 device_private[2];
>>>>   
>>>>> +};
>>>>>  #endif /* _UAPI_IOMMU_H */  
>>>>
>>>> _______________________________________________
>>>> iommu mailing list
>>>> iommu@lists.linux-foundation.org
>>>> https://lists.linuxfoundation.org/mailman/listinfo/iommu
>>>>   
>>>
>>
>> [Jacob Pan]
>> _______________________________________________
>> iommu mailing list
>> iommu@lists.linux-foundation.org
>> https://lists.linuxfoundation.org/mailman/listinfo/iommu
>>
> 
> _______________________________________________
> kvmarm mailing list
> kvmarm@lists.cs.columbia.edu
> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC v3 00/21] SMMUv3 Nested Stage Setup
  2019-01-08 10:26 [RFC v3 00/21] SMMUv3 Nested Stage Setup Eric Auger
                   ` (20 preceding siblings ...)
  2019-01-08 10:26 ` [RFC v3 21/21] vfio: Document nested stage control Eric Auger
@ 2019-01-18 10:02 ` Auger Eric
  21 siblings, 0 replies; 59+ messages in thread
From: Auger Eric @ 2019-01-18 10:02 UTC (permalink / raw)
  To: eric.auger.pro, iommu, linux-kernel, kvm, kvmarm, joro,
	alex.williamson, jacob.jun.pan, yi.l.liu, jean-philippe.brucker,
	will.deacon, robin.murphy
  Cc: kevin.tian, ashok.raj, marc.zyngier, christoffer.dall, peter.maydell

Hi,
On 1/8/19 11:26 AM, Eric Auger wrote:
> This series allows a virtualizer to program the nested stage mode.
> This is useful when both the host and the guest are exposed with
> an SMMUv3 and a PCI device is assigned to the guest using VFIO.
> 
> In this mode, the physical IOMMU must be programmed to translate
> the two stages: the one set up by the guest (IOVA -> GPA) and the
> one set up by the host VFIO driver as part of the assignment process
> (GPA -> HPA).
> 
> On Intel, this is traditionnaly achieved by combining the 2 stages
> into a single physical stage. However this relies on the capability
> to trap on each guest translation structure update. This is possible
> by using the VTD Caching Mode. Unfortunately the ARM SMMUv3 does
> not offer a similar mechanism.
> 
> However, the ARM SMMUv3 architecture supports 2 physical stages! Those
> were devised exactly with that use case in mind. Assuming the HW
> implements both stages (optional), the guest now can use stage 1
> while the host uses stage 2.
> 
> This assumes the virtualizer has means to propagate guest settings
> to the host SMMUv3 driver. This series brings this VFIO/IOMMU
> infrastructure.  Those services are:
> - bind the guest stage 1 configuration to the stream table entry
> - propagate guest TLB invalidations
> - bind MSI IOVAs
> - propagate faults collected at physical level up to the virtualizer
> 
> This series largely reuses the user API and infrastructure originally
> devised for SVA/SVM and patches submitted by Jacob, Yi Liu, Tianyu in
> [1-2] and Jean-Philippe [3-4].
> 
> Best Regards
> 
> Eric
> 
> This series can be found at:
> https://github.com/eauger/linux/tree/v5.0-rc1-2stage-rfc-v3
If someone is willing to test this series with QEMU you can use the
branch below, until I send a formal respin.

https://github.com/eauger/qemu/commits/v3.1.0-rc5-2stage-v3-for-rfc3-test-only.


Thanks

Eric


> 
> This was tested on Qualcomm HW featuring SMMUv3 and with adapted QEMU
> vSMMUv3.
> 
> References:
> [1] [PATCH v5 00/23] IOMMU and VT-d driver support for Shared Virtual
>     Address (SVA)
>     https://lwn.net/Articles/754331/
> [2] [RFC PATCH 0/8] Shared Virtual Memory virtualization for VT-d
>     (VFIO part)
>     https://lists.linuxfoundation.org/pipermail/iommu/2017-April/021475.html
> [3] [v2,00/40] Shared Virtual Addressing for the IOMMU
>     https://patchwork.ozlabs.org/cover/912129/
> [4] [PATCH v3 00/10] Shared Virtual Addressing for the IOMMU
>     https://patchwork.kernel.org/cover/10608299/
> 
> History:
> 
> v2 -> v3:
> - When registering the S1 MSI binding we now store the device handle. This
>   addresses Robin's comment about discimination of devices beonging to
>   different S1 groups and using different physical MSI doorbells.
> - Change the fault reporting API: use VFIO_PCI_DMA_FAULT_IRQ_INDEX to
>   set the eventfd and expose the faults through an mmappable fault region
> 
> v1 -> v2:
> - Added the fault reporting capability
> - asid properly passed on invalidation (fix assignment of multiple
>   devices)
> - see individual change logs for more info
> 
> Eric Auger (12):
>   iommu: Introduce bind_guest_msi
>   vfio: VFIO_IOMMU_BIND_MSI
>   iommu/smmuv3: Get prepared for nested stage support
>   iommu/smmuv3: Implement set_pasid_table
>   iommu/smmuv3: Implement cache_invalidate
>   dma-iommu: Implement NESTED_MSI cookie
>   iommu/smmuv3: Implement bind_guest_msi
>   iommu/smmuv3: Report non recoverable faults
>   vfio-pci: Add a new VFIO_REGION_TYPE_NESTED region type
>   vfio-pci: Register an iommu fault handler
>   vfio-pci: Add VFIO_PCI_DMA_FAULT_IRQ_INDEX
>   vfio: Document nested stage control
> 
> Jacob Pan (4):
>   iommu: Introduce set_pasid_table API
>   iommu: introduce device fault data
>   driver core: add per device iommu param
>   iommu: introduce device fault report API
> 
> Jean-Philippe Brucker (2):
>   iommu/arm-smmu-v3: Link domains and devices
>   iommu/arm-smmu-v3: Maintain a SID->device structure
> 
> Liu, Yi L (3):
>   iommu: Introduce cache_invalidate API
>   vfio: VFIO_IOMMU_SET_PASID_TABLE
>   vfio: VFIO_IOMMU_CACHE_INVALIDATE
> 
>  Documentation/vfio.txt              |  62 ++++
>  drivers/iommu/arm-smmu-v3.c         | 460 ++++++++++++++++++++++++++--
>  drivers/iommu/dma-iommu.c           | 112 ++++++-
>  drivers/iommu/iommu.c               | 187 ++++++++++-
>  drivers/vfio/pci/vfio_pci.c         | 147 ++++++++-
>  drivers/vfio/pci/vfio_pci_intrs.c   |  19 ++
>  drivers/vfio/pci/vfio_pci_private.h |   3 +
>  drivers/vfio/vfio_iommu_type1.c     | 105 +++++++
>  include/linux/device.h              |   3 +
>  include/linux/dma-iommu.h           |  11 +
>  include/linux/iommu.h               | 127 +++++++-
>  include/uapi/linux/iommu.h          | 234 ++++++++++++++
>  include/uapi/linux/vfio.h           |  38 +++
>  13 files changed, 1476 insertions(+), 32 deletions(-)
>  create mode 100644 include/uapi/linux/iommu.h
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC v3 01/21] iommu: Introduce set_pasid_table API
  2019-01-11 18:16   ` Jean-Philippe Brucker
@ 2019-01-25  8:39     ` Auger Eric
  2019-01-25  8:55       ` Auger Eric
  0 siblings, 1 reply; 59+ messages in thread
From: Auger Eric @ 2019-01-25  8:39 UTC (permalink / raw)
  To: Jean-Philippe Brucker, eric.auger.pro, iommu, linux-kernel, kvm,
	kvmarm, joro, alex.williamson, jacob.jun.pan, yi.l.liu,
	will.deacon, robin.murphy
  Cc: marc.zyngier, peter.maydell, kevin.tian, ashok.raj, christoffer.dall

Hi Jean-Philippe,

On 1/11/19 7:16 PM, Jean-Philippe Brucker wrote:
> On 08/01/2019 10:26, Eric Auger wrote:
>> From: Jacob Pan <jacob.jun.pan@linux.intel.com>
>>
>> In virtualization use case, when a guest is assigned
>> a PCI host device, protected by a virtual IOMMU on a guest,
>> the physical IOMMU must be programmed to be consistent with
>> the guest mappings. If the physical IOMMU supports two
>> translation stages it makes sense to program guest mappings
>> onto the first stage/level (ARM/VTD terminology) while to host
>> owns the stage/level 2.
>>
>> In that case, it is mandated to trap on guest configuration
>> settings and pass those to the physical iommu driver.
>>
>> This patch adds a new API to the iommu subsystem that allows
>> to set the pasid table information.
>>
>> A generic iommu_pasid_table_config struct is introduced in
>> a new iommu.h uapi header. This is going to be used by the VFIO
>> user API. We foresee at least two specializations of this struct,
>> for PASID table passing and ARM SMMUv3.
> 
> Last sentence is a bit confusing. With SMMUv3 it is also used for the
> PASID table, even when it only has one entry and PASID is disabled.
OK removed
> 
>> Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
>> Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
>> Signed-off-by: Ashok Raj <ashok.raj@intel.com>
>> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
>> Signed-off-by: Eric Auger <eric.auger@redhat.com>
>>
>> ---
>>
>> This patch generalizes the API introduced by Jacob & co-authors in
>> https://lwn.net/Articles/754331/
>>
>> v2 -> v3:
>> - replace unbind/bind by set_pasid_table
>> - move table pointer and pasid bits in the generic part of the struct
>>
>> v1 -> v2:
>> - restore the original pasid table name
>> - remove the struct device * parameter in the API
>> - reworked iommu_pasid_smmuv3
>> ---
>>  drivers/iommu/iommu.c      | 10 ++++++++
>>  include/linux/iommu.h      | 14 +++++++++++
>>  include/uapi/linux/iommu.h | 50 ++++++++++++++++++++++++++++++++++++++
>>  3 files changed, 74 insertions(+)
>>  create mode 100644 include/uapi/linux/iommu.h
>>
>> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
>> index 3ed4db334341..0f2b7f1fc7c8 100644
>> --- a/drivers/iommu/iommu.c
>> +++ b/drivers/iommu/iommu.c
>> @@ -1393,6 +1393,16 @@ int iommu_attach_device(struct iommu_domain *domain, struct device *dev)
>>  }
>>  EXPORT_SYMBOL_GPL(iommu_attach_device);
>>  
>> +int iommu_set_pasid_table(struct iommu_domain *domain,
>> +			  struct iommu_pasid_table_config *cfg)
>> +{
>> +	if (unlikely(!domain->ops->set_pasid_table))
>> +		return -ENODEV;
>> +
>> +	return domain->ops->set_pasid_table(domain, cfg);
>> +}
>> +EXPORT_SYMBOL_GPL(iommu_set_pasid_table);
>> +
>>  static void __iommu_detach_device(struct iommu_domain *domain,
>>  				  struct device *dev)
>>  {
>> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
>> index e90da6b6f3d1..1da2a2357ea4 100644
>> --- a/include/linux/iommu.h
>> +++ b/include/linux/iommu.h
>> @@ -25,6 +25,7 @@
>>  #include <linux/errno.h>
>>  #include <linux/err.h>
>>  #include <linux/of.h>
>> +#include <uapi/linux/iommu.h>
>>  
>>  #define IOMMU_READ	(1 << 0)
>>  #define IOMMU_WRITE	(1 << 1)
>> @@ -184,6 +185,7 @@ struct iommu_resv_region {
>>   * @domain_window_disable: Disable a particular window for a domain
>>   * @of_xlate: add OF master IDs to iommu grouping
>>   * @pgsize_bitmap: bitmap of all possible supported page sizes
>> + * @set_pasid_table: set pasid table
>>   */
>>  struct iommu_ops {
>>  	bool (*capable)(enum iommu_cap);
>> @@ -226,6 +228,9 @@ struct iommu_ops {
>>  	int (*of_xlate)(struct device *dev, struct of_phandle_args *args);
>>  	bool (*is_attach_deferred)(struct iommu_domain *domain, struct device *dev);
>>  
>> +	int (*set_pasid_table)(struct iommu_domain *domain,
>> +			       struct iommu_pasid_table_config *cfg);
>> +
>>  	unsigned long pgsize_bitmap;
>>  };
>>  
>> @@ -287,6 +292,8 @@ extern int iommu_attach_device(struct iommu_domain *domain,
>>  			       struct device *dev);
>>  extern void iommu_detach_device(struct iommu_domain *domain,
>>  				struct device *dev);
>> +extern int iommu_set_pasid_table(struct iommu_domain *domain,
>> +				 struct iommu_pasid_table_config *cfg);
>>  extern struct iommu_domain *iommu_get_domain_for_dev(struct device *dev);
>>  extern struct iommu_domain *iommu_get_dma_domain(struct device *dev);
>>  extern int iommu_map(struct iommu_domain *domain, unsigned long iova,
>> @@ -696,6 +703,13 @@ const struct iommu_ops *iommu_ops_from_fwnode(struct fwnode_handle *fwnode)
>>  	return NULL;
>>  }
>>  
>> +static inline
>> +int iommu_set_pasid_table(struct iommu_domain *domain,
>> +			  struct iommu_pasid_table_config *cfg)
>> +{
>> +	return -ENODEV;
>> +}
>> +
>>  #endif /* CONFIG_IOMMU_API */
>>  
>>  #ifdef CONFIG_IOMMU_DEBUGFS
>> diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
>> new file mode 100644
>> index 000000000000..7a7cf7a3de7c
>> --- /dev/null
>> +++ b/include/uapi/linux/iommu.h
>> @@ -0,0 +1,50 @@
>> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
>> +/*
>> + * IOMMU user API definitions
>> + *
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License version 2 as
>> + * published by the Free Software Foundation.
> 
> I don't think we need both the boilerplate and the SPDX header
OK I only kept the SPDX header.
> 
>> + */
>> +
>> +#ifndef _UAPI_IOMMU_H
>> +#define _UAPI_IOMMU_H
>> +
>> +#include <linux/types.h>
>> +
>> +/**
>> + * SMMUv3 Stream Table Entry stage 1 related information
>> + * @abort: shall the STE lead to abort
>> + * @s1fmt: STE s1fmt field as set by the guest
>> + * @s1dss: STE s1dss as set by the guest
>> + * All field names match the smmu 3.0/3.1 spec (ARM IHI 0070A)
> 
> Not really the case for @abort. Could you clarify whether @abort is
> valid in combination with @bypass?
> abort corresponds to !Config[2]. In that case the spec says "report
abort to device, no event recorded". S1 bypass corresponds to
Config=0b1x0. What about removing abort in the SMMUv3 specific part and
encode the stage state in the generic part. See below proposal ...
>> + */
>> +struct iommu_pasid_smmuv3 {
>> +	__u8 abort;
>> +	__u8 s1fmt;
>> +	__u8 s1dss;
>> +};
>> +
>> +/**
>> + * PASID table data used to bind guest PASID table to the host IOMMU
>> + * Note PASID table corresponds to the Context Table on ARM SMMUv3.
>> + *
>> + * @version: API version to prepare for future extensions
>> + * @format: format of the PASID table
>> + *
>> + */
>> +struct iommu_pasid_table_config {
>> +#define PASID_TABLE_CFG_VERSION_1 1
>> +	__u32	version;
>> +#define IOMMU_PASID_FORMAT_SMMUV3	(1 << 0)
>> +	__u32	format;
>> +	__u64	base_ptr;
>> +	__u8	pasid_bits;
>> +	__u8	bypass
#define IOMMU_PASID_STREAM_ABORT  (1 << 0)
#define IOMMU_PASID_STREAM_BYPASS (1 << 1)
#define IOMMU_PASID_STREAM_TRANSLATE (1 << 2)
__u8 config;
> 
> We need some padding, in case someone adds a new struct to the union
> that requires 64-byte alignment
OK
> 
> And 'bypass' might not be the right name if we're making it common,
> maybe 'reset' would be clearer? Or we just need to explain that bypass
> is the initial state of a nesting domain
I will add such comment. To me the "bypass" terminology sounds clearer
than "reset"
> 
> Thanks,
> Jean
> 
>> +	union {
>> +		struct iommu_pasid_smmuv3 smmuv3;
>> +	};
>> +};
>> +
>> +#endif /* _UAPI_IOMMU_H */
>>
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC v3 01/21] iommu: Introduce set_pasid_table API
  2019-01-25  8:39     ` Auger Eric
@ 2019-01-25  8:55       ` Auger Eric
  2019-01-25 10:33         ` Jean-Philippe Brucker
  0 siblings, 1 reply; 59+ messages in thread
From: Auger Eric @ 2019-01-25  8:55 UTC (permalink / raw)
  To: Jean-Philippe Brucker, eric.auger.pro, iommu, linux-kernel, kvm,
	kvmarm, joro, alex.williamson, jacob.jun.pan, yi.l.liu,
	will.deacon, robin.murphy
  Cc: marc.zyngier, peter.maydell, kevin.tian, ashok.raj, christoffer.dall

Hi Jean-Philippe,

On 1/25/19 9:39 AM, Auger Eric wrote:
> Hi Jean-Philippe,
> 
> On 1/11/19 7:16 PM, Jean-Philippe Brucker wrote:
>> On 08/01/2019 10:26, Eric Auger wrote:
>>> From: Jacob Pan <jacob.jun.pan@linux.intel.com>
>>>
>>> In virtualization use case, when a guest is assigned
>>> a PCI host device, protected by a virtual IOMMU on a guest,
>>> the physical IOMMU must be programmed to be consistent with
>>> the guest mappings. If the physical IOMMU supports two
>>> translation stages it makes sense to program guest mappings
>>> onto the first stage/level (ARM/VTD terminology) while to host
>>> owns the stage/level 2.
>>>
>>> In that case, it is mandated to trap on guest configuration
>>> settings and pass those to the physical iommu driver.
>>>
>>> This patch adds a new API to the iommu subsystem that allows
>>> to set the pasid table information.
>>>
>>> A generic iommu_pasid_table_config struct is introduced in
>>> a new iommu.h uapi header. This is going to be used by the VFIO
>>> user API. We foresee at least two specializations of this struct,
>>> for PASID table passing and ARM SMMUv3.
>>
>> Last sentence is a bit confusing. With SMMUv3 it is also used for the
>> PASID table, even when it only has one entry and PASID is disabled.
> OK removed
>>
>>> Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
>>> Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
>>> Signed-off-by: Ashok Raj <ashok.raj@intel.com>
>>> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
>>> Signed-off-by: Eric Auger <eric.auger@redhat.com>
>>>
>>> ---
>>>
>>> This patch generalizes the API introduced by Jacob & co-authors in
>>> https://lwn.net/Articles/754331/
>>>
>>> v2 -> v3:
>>> - replace unbind/bind by set_pasid_table
>>> - move table pointer and pasid bits in the generic part of the struct
>>>
>>> v1 -> v2:
>>> - restore the original pasid table name
>>> - remove the struct device * parameter in the API
>>> - reworked iommu_pasid_smmuv3
>>> ---
>>>  drivers/iommu/iommu.c      | 10 ++++++++
>>>  include/linux/iommu.h      | 14 +++++++++++
>>>  include/uapi/linux/iommu.h | 50 ++++++++++++++++++++++++++++++++++++++
>>>  3 files changed, 74 insertions(+)
>>>  create mode 100644 include/uapi/linux/iommu.h
>>>
>>> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
>>> index 3ed4db334341..0f2b7f1fc7c8 100644
>>> --- a/drivers/iommu/iommu.c
>>> +++ b/drivers/iommu/iommu.c
>>> @@ -1393,6 +1393,16 @@ int iommu_attach_device(struct iommu_domain *domain, struct device *dev)
>>>  }
>>>  EXPORT_SYMBOL_GPL(iommu_attach_device);
>>>  
>>> +int iommu_set_pasid_table(struct iommu_domain *domain,
>>> +			  struct iommu_pasid_table_config *cfg)
>>> +{
>>> +	if (unlikely(!domain->ops->set_pasid_table))
>>> +		return -ENODEV;
>>> +
>>> +	return domain->ops->set_pasid_table(domain, cfg);
>>> +}
>>> +EXPORT_SYMBOL_GPL(iommu_set_pasid_table);
>>> +
>>>  static void __iommu_detach_device(struct iommu_domain *domain,
>>>  				  struct device *dev)
>>>  {
>>> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
>>> index e90da6b6f3d1..1da2a2357ea4 100644
>>> --- a/include/linux/iommu.h
>>> +++ b/include/linux/iommu.h
>>> @@ -25,6 +25,7 @@
>>>  #include <linux/errno.h>
>>>  #include <linux/err.h>
>>>  #include <linux/of.h>
>>> +#include <uapi/linux/iommu.h>
>>>  
>>>  #define IOMMU_READ	(1 << 0)
>>>  #define IOMMU_WRITE	(1 << 1)
>>> @@ -184,6 +185,7 @@ struct iommu_resv_region {
>>>   * @domain_window_disable: Disable a particular window for a domain
>>>   * @of_xlate: add OF master IDs to iommu grouping
>>>   * @pgsize_bitmap: bitmap of all possible supported page sizes
>>> + * @set_pasid_table: set pasid table
>>>   */
>>>  struct iommu_ops {
>>>  	bool (*capable)(enum iommu_cap);
>>> @@ -226,6 +228,9 @@ struct iommu_ops {
>>>  	int (*of_xlate)(struct device *dev, struct of_phandle_args *args);
>>>  	bool (*is_attach_deferred)(struct iommu_domain *domain, struct device *dev);
>>>  
>>> +	int (*set_pasid_table)(struct iommu_domain *domain,
>>> +			       struct iommu_pasid_table_config *cfg);
>>> +
>>>  	unsigned long pgsize_bitmap;
>>>  };
>>>  
>>> @@ -287,6 +292,8 @@ extern int iommu_attach_device(struct iommu_domain *domain,
>>>  			       struct device *dev);
>>>  extern void iommu_detach_device(struct iommu_domain *domain,
>>>  				struct device *dev);
>>> +extern int iommu_set_pasid_table(struct iommu_domain *domain,
>>> +				 struct iommu_pasid_table_config *cfg);
>>>  extern struct iommu_domain *iommu_get_domain_for_dev(struct device *dev);
>>>  extern struct iommu_domain *iommu_get_dma_domain(struct device *dev);
>>>  extern int iommu_map(struct iommu_domain *domain, unsigned long iova,
>>> @@ -696,6 +703,13 @@ const struct iommu_ops *iommu_ops_from_fwnode(struct fwnode_handle *fwnode)
>>>  	return NULL;
>>>  }
>>>  
>>> +static inline
>>> +int iommu_set_pasid_table(struct iommu_domain *domain,
>>> +			  struct iommu_pasid_table_config *cfg)
>>> +{
>>> +	return -ENODEV;
>>> +}
>>> +
>>>  #endif /* CONFIG_IOMMU_API */
>>>  
>>>  #ifdef CONFIG_IOMMU_DEBUGFS
>>> diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
>>> new file mode 100644
>>> index 000000000000..7a7cf7a3de7c
>>> --- /dev/null
>>> +++ b/include/uapi/linux/iommu.h
>>> @@ -0,0 +1,50 @@
>>> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
>>> +/*
>>> + * IOMMU user API definitions
>>> + *
>>> + *
>>> + * This program is free software; you can redistribute it and/or modify
>>> + * it under the terms of the GNU General Public License version 2 as
>>> + * published by the Free Software Foundation.
>>
>> I don't think we need both the boilerplate and the SPDX header
> OK I only kept the SPDX header.
>>
>>> + */
>>> +
>>> +#ifndef _UAPI_IOMMU_H
>>> +#define _UAPI_IOMMU_H
>>> +
>>> +#include <linux/types.h>
>>> +
>>> +/**
>>> + * SMMUv3 Stream Table Entry stage 1 related information
>>> + * @abort: shall the STE lead to abort
>>> + * @s1fmt: STE s1fmt field as set by the guest
>>> + * @s1dss: STE s1dss as set by the guest
>>> + * All field names match the smmu 3.0/3.1 spec (ARM IHI 0070A)
>>
>> Not really the case for @abort. Could you clarify whether @abort is
>> valid in combination with @bypass?
>> abort corresponds to !Config[2]. In that case the spec says "report
> abort to device, no event recorded". S1 bypass corresponds to
> Config=0b1x0. What about removing abort in the SMMUv3 specific part and
> encode the stage state in the generic part. See below proposal ...
>>> + */
>>> +struct iommu_pasid_smmuv3 {
>>> +	__u8 abort;
>>> +	__u8 s1fmt;
>>> +	__u8 s1dss;
>>> +};
>>> +
>>> +/**
>>> + * PASID table data used to bind guest PASID table to the host IOMMU
>>> + * Note PASID table corresponds to the Context Table on ARM SMMUv3.
>>> + *
>>> + * @version: API version to prepare for future extensions
>>> + * @format: format of the PASID table
>>> + *
>>> + */
>>> +struct iommu_pasid_table_config {
>>> +#define PASID_TABLE_CFG_VERSION_1 1
>>> +	__u32	version;
>>> +#define IOMMU_PASID_FORMAT_SMMUV3	(1 << 0)
>>> +	__u32	format;
>>> +	__u64	base_ptr;
>>> +	__u8	pasid_bits;
>>> +	__u8	bypass
> #define IOMMU_PASID_STREAM_ABORT  (1 << 0)
> #define IOMMU_PASID_STREAM_BYPASS (1 << 1)
> #define IOMMU_PASID_STREAM_TRANSLATE (1 << 2)
> __u8 config;
Sorry for the confusion, we don't want a bitfield here as those values
are exclusive.

What about:
struct iommu_pasid_table_config {
#define PASID_TABLE_CFG_VERSION_1 1
        __u32   version;
#define IOMMU_PASID_FORMAT_SMMUV3       (1 << 0)
        __u32   format;
        __u64   base_ptr;
        __u8    pasid_bits;
#define IOMMU_PASID_CONFIG_BYPASS       1
#define IOMMU_PASID_CONFIG_ABORT        2
#define IOMMU_PASID_CONFIG_TRANSLATE    3
        __u8    config;
        __u8    padding[6];
        union {
                struct iommu_pasid_smmuv3 smmuv3;
        };
};


Thanks

Eric
>>
>> We need some padding, in case someone adds a new struct to the union
>> that requires 64-byte alignment
> OK
>>
>> And 'bypass' might not be the right name if we're making it common,
>> maybe 'reset' would be clearer? Or we just need to explain that bypass
>> is the initial state of a nesting domain
> I will add such comment. To me the "bypass" terminology sounds clearer
> than "reset"
>>
>> Thanks,
>> Jean
>>
>>> +	union {
>>> +		struct iommu_pasid_smmuv3 smmuv3;
>>> +	};
>>> +};
>>> +
>>> +#endif /* _UAPI_IOMMU_H */
>>>
>>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC v3 01/21] iommu: Introduce set_pasid_table API
  2019-01-11 18:43   ` Alex Williamson
@ 2019-01-25  9:20     ` Auger Eric
  0 siblings, 0 replies; 59+ messages in thread
From: Auger Eric @ 2019-01-25  9:20 UTC (permalink / raw)
  To: Alex Williamson
  Cc: eric.auger.pro, iommu, linux-kernel, kvm, kvmarm, joro,
	jacob.jun.pan, yi.l.liu, jean-philippe.brucker, will.deacon,
	robin.murphy, kevin.tian, ashok.raj, marc.zyngier,
	christoffer.dall, peter.maydell

Hi Alex,

On 1/11/19 7:43 PM, Alex Williamson wrote:
> On Tue,  8 Jan 2019 11:26:13 +0100
> Eric Auger <eric.auger@redhat.com> wrote:
> 
>> From: Jacob Pan <jacob.jun.pan@linux.intel.com>
>>
>> In virtualization use case, when a guest is assigned
>> a PCI host device, protected by a virtual IOMMU on a guest,
>> the physical IOMMU must be programmed to be consistent with
>> the guest mappings. If the physical IOMMU supports two
>> translation stages it makes sense to program guest mappings
>> onto the first stage/level (ARM/VTD terminology) while to host
>> owns the stage/level 2.
>>
>> In that case, it is mandated to trap on guest configuration
>> settings and pass those to the physical iommu driver.
>>
>> This patch adds a new API to the iommu subsystem that allows
>> to set the pasid table information.
>>
>> A generic iommu_pasid_table_config struct is introduced in
>> a new iommu.h uapi header. This is going to be used by the VFIO
>> user API. We foresee at least two specializations of this struct,
>> for PASID table passing and ARM SMMUv3.
>>
>> Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
>> Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
>> Signed-off-by: Ashok Raj <ashok.raj@intel.com>
>> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
>> Signed-off-by: Eric Auger <eric.auger@redhat.com>
>>
>> ---
>>
>> This patch generalizes the API introduced by Jacob & co-authors in
>> https://lwn.net/Articles/754331/
>>
>> v2 -> v3:
>> - replace unbind/bind by set_pasid_table
>> - move table pointer and pasid bits in the generic part of the struct
>>
>> v1 -> v2:
>> - restore the original pasid table name
>> - remove the struct device * parameter in the API
>> - reworked iommu_pasid_smmuv3
>> ---
>>  drivers/iommu/iommu.c      | 10 ++++++++
>>  include/linux/iommu.h      | 14 +++++++++++
>>  include/uapi/linux/iommu.h | 50 ++++++++++++++++++++++++++++++++++++++
>>  3 files changed, 74 insertions(+)
>>  create mode 100644 include/uapi/linux/iommu.h
>>
>> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
>> index 3ed4db334341..0f2b7f1fc7c8 100644
>> --- a/drivers/iommu/iommu.c
>> +++ b/drivers/iommu/iommu.c
>> @@ -1393,6 +1393,16 @@ int iommu_attach_device(struct iommu_domain *domain, struct device *dev)
>>  }
>>  EXPORT_SYMBOL_GPL(iommu_attach_device);
>>  
>> +int iommu_set_pasid_table(struct iommu_domain *domain,
>> +			  struct iommu_pasid_table_config *cfg)
>> +{
>> +	if (unlikely(!domain->ops->set_pasid_table))
>> +		return -ENODEV;
>> +
>> +	return domain->ops->set_pasid_table(domain, cfg);
>> +}
>> +EXPORT_SYMBOL_GPL(iommu_set_pasid_table);
>> +
>>  static void __iommu_detach_device(struct iommu_domain *domain,
>>  				  struct device *dev)
>>  {
>> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
>> index e90da6b6f3d1..1da2a2357ea4 100644
>> --- a/include/linux/iommu.h
>> +++ b/include/linux/iommu.h
>> @@ -25,6 +25,7 @@
>>  #include <linux/errno.h>
>>  #include <linux/err.h>
>>  #include <linux/of.h>
>> +#include <uapi/linux/iommu.h>
>>  
>>  #define IOMMU_READ	(1 << 0)
>>  #define IOMMU_WRITE	(1 << 1)
>> @@ -184,6 +185,7 @@ struct iommu_resv_region {
>>   * @domain_window_disable: Disable a particular window for a domain
>>   * @of_xlate: add OF master IDs to iommu grouping
>>   * @pgsize_bitmap: bitmap of all possible supported page sizes
>> + * @set_pasid_table: set pasid table
>>   */
>>  struct iommu_ops {
>>  	bool (*capable)(enum iommu_cap);
>> @@ -226,6 +228,9 @@ struct iommu_ops {
>>  	int (*of_xlate)(struct device *dev, struct of_phandle_args *args);
>>  	bool (*is_attach_deferred)(struct iommu_domain *domain, struct device *dev);
>>  
>> +	int (*set_pasid_table)(struct iommu_domain *domain,
>> +			       struct iommu_pasid_table_config *cfg);
>> +
>>  	unsigned long pgsize_bitmap;
>>  };
>>  
>> @@ -287,6 +292,8 @@ extern int iommu_attach_device(struct iommu_domain *domain,
>>  			       struct device *dev);
>>  extern void iommu_detach_device(struct iommu_domain *domain,
>>  				struct device *dev);
>> +extern int iommu_set_pasid_table(struct iommu_domain *domain,
>> +				 struct iommu_pasid_table_config *cfg);
>>  extern struct iommu_domain *iommu_get_domain_for_dev(struct device *dev);
>>  extern struct iommu_domain *iommu_get_dma_domain(struct device *dev);
>>  extern int iommu_map(struct iommu_domain *domain, unsigned long iova,
>> @@ -696,6 +703,13 @@ const struct iommu_ops *iommu_ops_from_fwnode(struct fwnode_handle *fwnode)
>>  	return NULL;
>>  }
>>  
>> +static inline
>> +int iommu_set_pasid_table(struct iommu_domain *domain,
>> +			  struct iommu_pasid_table_config *cfg)
>> +{
>> +	return -ENODEV;
>> +}
>> +
>>  #endif /* CONFIG_IOMMU_API */
>>  
>>  #ifdef CONFIG_IOMMU_DEBUGFS
>> diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
>> new file mode 100644
>> index 000000000000..7a7cf7a3de7c
>> --- /dev/null
>> +++ b/include/uapi/linux/iommu.h
>> @@ -0,0 +1,50 @@
>> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
>> +/*
>> + * IOMMU user API definitions
>> + *
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License version 2 as
>> + * published by the Free Software Foundation.
>> + */
>> +
>> +#ifndef _UAPI_IOMMU_H
>> +#define _UAPI_IOMMU_H
>> +
>> +#include <linux/types.h>
>> +
>> +/**
>> + * SMMUv3 Stream Table Entry stage 1 related information
>> + * @abort: shall the STE lead to abort
>> + * @s1fmt: STE s1fmt field as set by the guest
>> + * @s1dss: STE s1dss as set by the guest
>> + * All field names match the smmu 3.0/3.1 spec (ARM IHI 0070A)
>> + */
>> +struct iommu_pasid_smmuv3 {
>> +	__u8 abort;
>> +	__u8 s1fmt;
>> +	__u8 s1dss;
>> +};
>> +
> 
> I can find STE.S1DSS and STE.S1FMT in the spec, but not STE.ABORT, is
> this something to do with Config[2:0]?  Are we allowed to describe what
> these fields are beyond their name and why they're necessary here vs
> the other fields or do the spec restrictions preclude that?
Yes you're right abort matches !Config[2]

what about:

/**
 * SMMUv3 Stream Table Entry stage 1 related information
 * The PASID table is referred to as the context descriptor (CD) table.
 *
 * @s1fmt: STE s1fmt (format of the CD table: single CD, linear table
   or 2-level table)
 * @s1dss: STE s1dss (specifies the behavior when pasid_bits != 0
   an no pasid is passed along with the incoming transaction)
 * Please refer to the smmu 3.x spec (ARM IHI 0070A) for full details
 */
struct iommu_pasid_smmuv3 {
#define PASID_TABLE_SMMUV3_CFG_VERSION_1 1
        __u32   version;
        __u8 s1fmt;
        __u8 s1dss;
        __u8 padding[2];
};


> 
>> +/**
>> + * PASID table data used to bind guest PASID table to the host IOMMU
>> + * Note PASID table corresponds to the Context Table on ARM SMMUv3.
>> + *
>> + * @version: API version to prepare for future extensions
>> + * @format: format of the PASID table
>> + *
>> + */
>> +struct iommu_pasid_table_config {
>> +#define PASID_TABLE_CFG_VERSION_1 1
>> +	__u32	version;
>> +#define IOMMU_PASID_FORMAT_SMMUV3	(1 << 0)
>> +	__u32	format;
>> +	__u64	base_ptr;
>> +	__u8	pasid_bits;
>> +	__u8	bypass;
>> +	union {
>> +		struct iommu_pasid_smmuv3 smmuv3;
>> +	};
>> +};
> 
> Structure is not naturally aligned or explicitly aligned for
> interchange with userspace.  It might work for smmuv3 since the
> structure is only composed of bytes, but looks troublesome in general.
> Should each format type also contain a version?  Is format intended to
> be a bit-field or a signature?  It seems we only need a signature, but
> only having a single format defined, it looks like a bit-field, which
> makes me worry what we do when we exhaust the bits.

I think a signature is what we need.

  The bypass field
> should be better defined, is it 0/1?  zero/non-zero?  more selective?

I suggest to replace by a signature config field (bypass, abort, translate)

Thanks

Eric
> Thanks,
> 
> Alex
> 
>> +
>> +#endif /* _UAPI_IOMMU_H */
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC v3 01/21] iommu: Introduce set_pasid_table API
  2019-01-25  8:55       ` Auger Eric
@ 2019-01-25 10:33         ` Jean-Philippe Brucker
  0 siblings, 0 replies; 59+ messages in thread
From: Jean-Philippe Brucker @ 2019-01-25 10:33 UTC (permalink / raw)
  To: Auger Eric, eric.auger.pro, iommu, linux-kernel, kvm, kvmarm,
	joro, alex.williamson, jacob.jun.pan, yi.l.liu, Will Deacon,
	Robin Murphy
  Cc: Marc Zyngier, peter.maydell, kevin.tian, ashok.raj, Christoffer Dall

On 25/01/2019 08:55, Auger Eric wrote:
> Hi Jean-Philippe,
> 
> On 1/25/19 9:39 AM, Auger Eric wrote:
>> Hi Jean-Philippe,
>>
>> On 1/11/19 7:16 PM, Jean-Philippe Brucker wrote:
>>> On 08/01/2019 10:26, Eric Auger wrote:
>>>> From: Jacob Pan <jacob.jun.pan@linux.intel.com>
>>>>
>>>> In virtualization use case, when a guest is assigned
>>>> a PCI host device, protected by a virtual IOMMU on a guest,
>>>> the physical IOMMU must be programmed to be consistent with
>>>> the guest mappings. If the physical IOMMU supports two
>>>> translation stages it makes sense to program guest mappings
>>>> onto the first stage/level (ARM/VTD terminology) while to host
>>>> owns the stage/level 2.
>>>>
>>>> In that case, it is mandated to trap on guest configuration
>>>> settings and pass those to the physical iommu driver.
>>>>
>>>> This patch adds a new API to the iommu subsystem that allows
>>>> to set the pasid table information.
>>>>
>>>> A generic iommu_pasid_table_config struct is introduced in
>>>> a new iommu.h uapi header. This is going to be used by the VFIO
>>>> user API. We foresee at least two specializations of this struct,
>>>> for PASID table passing and ARM SMMUv3.
>>>
>>> Last sentence is a bit confusing. With SMMUv3 it is also used for the
>>> PASID table, even when it only has one entry and PASID is disabled.
>> OK removed
>>>
>>>> Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
>>>> Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
>>>> Signed-off-by: Ashok Raj <ashok.raj@intel.com>
>>>> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
>>>> Signed-off-by: Eric Auger <eric.auger@redhat.com>
>>>>
>>>> ---
>>>>
>>>> This patch generalizes the API introduced by Jacob & co-authors in
>>>> https://lwn.net/Articles/754331/
>>>>
>>>> v2 -> v3:
>>>> - replace unbind/bind by set_pasid_table
>>>> - move table pointer and pasid bits in the generic part of the struct
>>>>
>>>> v1 -> v2:
>>>> - restore the original pasid table name
>>>> - remove the struct device * parameter in the API
>>>> - reworked iommu_pasid_smmuv3
>>>> ---
>>>>  drivers/iommu/iommu.c      | 10 ++++++++
>>>>  include/linux/iommu.h      | 14 +++++++++++
>>>>  include/uapi/linux/iommu.h | 50 ++++++++++++++++++++++++++++++++++++++
>>>>  3 files changed, 74 insertions(+)
>>>>  create mode 100644 include/uapi/linux/iommu.h
>>>>
>>>> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
>>>> index 3ed4db334341..0f2b7f1fc7c8 100644
>>>> --- a/drivers/iommu/iommu.c
>>>> +++ b/drivers/iommu/iommu.c
>>>> @@ -1393,6 +1393,16 @@ int iommu_attach_device(struct iommu_domain *domain, struct device *dev)
>>>>  }
>>>>  EXPORT_SYMBOL_GPL(iommu_attach_device);
>>>>  
>>>> +int iommu_set_pasid_table(struct iommu_domain *domain,
>>>> +			  struct iommu_pasid_table_config *cfg)
>>>> +{
>>>> +	if (unlikely(!domain->ops->set_pasid_table))
>>>> +		return -ENODEV;
>>>> +
>>>> +	return domain->ops->set_pasid_table(domain, cfg);
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(iommu_set_pasid_table);
>>>> +
>>>>  static void __iommu_detach_device(struct iommu_domain *domain,
>>>>  				  struct device *dev)
>>>>  {
>>>> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
>>>> index e90da6b6f3d1..1da2a2357ea4 100644
>>>> --- a/include/linux/iommu.h
>>>> +++ b/include/linux/iommu.h
>>>> @@ -25,6 +25,7 @@
>>>>  #include <linux/errno.h>
>>>>  #include <linux/err.h>
>>>>  #include <linux/of.h>
>>>> +#include <uapi/linux/iommu.h>
>>>>  
>>>>  #define IOMMU_READ	(1 << 0)
>>>>  #define IOMMU_WRITE	(1 << 1)
>>>> @@ -184,6 +185,7 @@ struct iommu_resv_region {
>>>>   * @domain_window_disable: Disable a particular window for a domain
>>>>   * @of_xlate: add OF master IDs to iommu grouping
>>>>   * @pgsize_bitmap: bitmap of all possible supported page sizes
>>>> + * @set_pasid_table: set pasid table
>>>>   */
>>>>  struct iommu_ops {
>>>>  	bool (*capable)(enum iommu_cap);
>>>> @@ -226,6 +228,9 @@ struct iommu_ops {
>>>>  	int (*of_xlate)(struct device *dev, struct of_phandle_args *args);
>>>>  	bool (*is_attach_deferred)(struct iommu_domain *domain, struct device *dev);
>>>>  
>>>> +	int (*set_pasid_table)(struct iommu_domain *domain,
>>>> +			       struct iommu_pasid_table_config *cfg);
>>>> +
>>>>  	unsigned long pgsize_bitmap;
>>>>  };
>>>>  
>>>> @@ -287,6 +292,8 @@ extern int iommu_attach_device(struct iommu_domain *domain,
>>>>  			       struct device *dev);
>>>>  extern void iommu_detach_device(struct iommu_domain *domain,
>>>>  				struct device *dev);
>>>> +extern int iommu_set_pasid_table(struct iommu_domain *domain,
>>>> +				 struct iommu_pasid_table_config *cfg);
>>>>  extern struct iommu_domain *iommu_get_domain_for_dev(struct device *dev);
>>>>  extern struct iommu_domain *iommu_get_dma_domain(struct device *dev);
>>>>  extern int iommu_map(struct iommu_domain *domain, unsigned long iova,
>>>> @@ -696,6 +703,13 @@ const struct iommu_ops *iommu_ops_from_fwnode(struct fwnode_handle *fwnode)
>>>>  	return NULL;
>>>>  }
>>>>  
>>>> +static inline
>>>> +int iommu_set_pasid_table(struct iommu_domain *domain,
>>>> +			  struct iommu_pasid_table_config *cfg)
>>>> +{
>>>> +	return -ENODEV;
>>>> +}
>>>> +
>>>>  #endif /* CONFIG_IOMMU_API */
>>>>  
>>>>  #ifdef CONFIG_IOMMU_DEBUGFS
>>>> diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
>>>> new file mode 100644
>>>> index 000000000000..7a7cf7a3de7c
>>>> --- /dev/null
>>>> +++ b/include/uapi/linux/iommu.h
>>>> @@ -0,0 +1,50 @@
>>>> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
>>>> +/*
>>>> + * IOMMU user API definitions
>>>> + *
>>>> + *
>>>> + * This program is free software; you can redistribute it and/or modify
>>>> + * it under the terms of the GNU General Public License version 2 as
>>>> + * published by the Free Software Foundation.
>>>
>>> I don't think we need both the boilerplate and the SPDX header
>> OK I only kept the SPDX header.
>>>
>>>> + */
>>>> +
>>>> +#ifndef _UAPI_IOMMU_H
>>>> +#define _UAPI_IOMMU_H
>>>> +
>>>> +#include <linux/types.h>
>>>> +
>>>> +/**
>>>> + * SMMUv3 Stream Table Entry stage 1 related information
>>>> + * @abort: shall the STE lead to abort
>>>> + * @s1fmt: STE s1fmt field as set by the guest
>>>> + * @s1dss: STE s1dss as set by the guest
>>>> + * All field names match the smmu 3.0/3.1 spec (ARM IHI 0070A)
>>>
>>> Not really the case for @abort. Could you clarify whether @abort is
>>> valid in combination with @bypass?
>>> abort corresponds to !Config[2]. In that case the spec says "report
>> abort to device, no event recorded". S1 bypass corresponds to
>> Config=0b1x0. What about removing abort in the SMMUv3 specific part and
>> encode the stage state in the generic part. See below proposal ...
>>>> + */
>>>> +struct iommu_pasid_smmuv3 {
>>>> +	__u8 abort;
>>>> +	__u8 s1fmt;
>>>> +	__u8 s1dss;
>>>> +};
>>>> +
>>>> +/**
>>>> + * PASID table data used to bind guest PASID table to the host IOMMU
>>>> + * Note PASID table corresponds to the Context Table on ARM SMMUv3.
>>>> + *
>>>> + * @version: API version to prepare for future extensions
>>>> + * @format: format of the PASID table
>>>> + *
>>>> + */
>>>> +struct iommu_pasid_table_config {
>>>> +#define PASID_TABLE_CFG_VERSION_1 1
>>>> +	__u32	version;
>>>> +#define IOMMU_PASID_FORMAT_SMMUV3	(1 << 0)
>>>> +	__u32	format;
>>>> +	__u64	base_ptr;
>>>> +	__u8	pasid_bits;
>>>> +	__u8	bypass
>> #define IOMMU_PASID_STREAM_ABORT  (1 << 0)
>> #define IOMMU_PASID_STREAM_BYPASS (1 << 1)
>> #define IOMMU_PASID_STREAM_TRANSLATE (1 << 2)
>> __u8 config;
> Sorry for the confusion, we don't want a bitfield here as those values
> are exclusive.
> 
> What about:
> struct iommu_pasid_table_config {
> #define PASID_TABLE_CFG_VERSION_1 1
>         __u32   version;
> #define IOMMU_PASID_FORMAT_SMMUV3       (1 << 0)
>         __u32   format;
>         __u64   base_ptr;
>         __u8    pasid_bits;
> #define IOMMU_PASID_CONFIG_BYPASS       1
> #define IOMMU_PASID_CONFIG_ABORT        2
> #define IOMMU_PASID_CONFIG_TRANSLATE    3

Yes, this makes more sense :)

>         __u8    config;
>         __u8    padding[6];
>         union {
>                 struct iommu_pasid_smmuv3 smmuv3;
>         };
> };

Thanks,
Jean

> 
> Thanks
> 
> Eric
>>>
>>> We need some padding, in case someone adds a new struct to the union
>>> that requires 64-byte alignment
>> OK
>>>
>>> And 'bypass' might not be the right name if we're making it common,
>>> maybe 'reset' would be clearer? Or we just need to explain that bypass
>>> is the initial state of a nesting domain
>> I will add such comment. To me the "bypass" terminology sounds clearer
>> than "reset"
>>>
>>> Thanks,
>>> Jean
>>>
>>>> +	union {
>>>> +		struct iommu_pasid_smmuv3 smmuv3;
>>>> +	};
>>>> +};
>>>> +
>>>> +#endif /* _UAPI_IOMMU_H */
>>>>
>>>


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC v3 02/21] iommu: Introduce cache_invalidate API
  2019-01-11 21:30   ` Alex Williamson
@ 2019-01-25 16:49     ` Auger Eric
  2019-01-28 17:32       ` Jean-Philippe Brucker
  2019-01-29 23:16       ` Alex Williamson
  0 siblings, 2 replies; 59+ messages in thread
From: Auger Eric @ 2019-01-25 16:49 UTC (permalink / raw)
  To: Alex Williamson
  Cc: eric.auger.pro, iommu, linux-kernel, kvm, kvmarm, joro,
	jacob.jun.pan, yi.l.liu, jean-philippe.brucker, will.deacon,
	robin.murphy, kevin.tian, ashok.raj, marc.zyngier,
	christoffer.dall, peter.maydell

Hi Alex,

On 1/11/19 10:30 PM, Alex Williamson wrote:
> On Tue,  8 Jan 2019 11:26:14 +0100
> Eric Auger <eric.auger@redhat.com> wrote:
> 
>> From: "Liu, Yi L" <yi.l.liu@linux.intel.com>
>>
>> In any virtualization use case, when the first translation stage
>> is "owned" by the guest OS, the host IOMMU driver has no knowledge
>> of caching structure updates unless the guest invalidation activities
>> are trapped by the virtualizer and passed down to the host.
>>
>> Since the invalidation data are obtained from user space and will be
>> written into physical IOMMU, we must allow security check at various
>> layers. Therefore, generic invalidation data format are proposed here,
>> model specific IOMMU drivers need to convert them into their own format.
>>
>> Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
>> Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
>> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
>> Signed-off-by: Ashok Raj <ashok.raj@intel.com>
>> Signed-off-by: Eric Auger <eric.auger@redhat.com>
>>
>> ---
>> v1 -> v2:
>> - add arch_id field
>> - renamed tlb_invalidate into cache_invalidate as this API allows
>>   to invalidate context caches on top of IOTLBs
>>
>> v1:
>> renamed sva_invalidate into tlb_invalidate and add iommu_ prefix in
>> header. Commit message reworded.
>> ---
>>  drivers/iommu/iommu.c      | 14 ++++++
>>  include/linux/iommu.h      | 14 ++++++
>>  include/uapi/linux/iommu.h | 95 ++++++++++++++++++++++++++++++++++++++
>>  3 files changed, 123 insertions(+)
>>
>> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
>> index 0f2b7f1fc7c8..b2e248770508 100644
>> --- a/drivers/iommu/iommu.c
>> +++ b/drivers/iommu/iommu.c
>> @@ -1403,6 +1403,20 @@ int iommu_set_pasid_table(struct iommu_domain *domain,
>>  }
>>  EXPORT_SYMBOL_GPL(iommu_set_pasid_table);
>>  
>> +int iommu_cache_invalidate(struct iommu_domain *domain, struct device *dev,
>> +			   struct iommu_cache_invalidate_info *inv_info)
>> +{
>> +	int ret = 0;
>> +
>> +	if (unlikely(!domain->ops->cache_invalidate))
>> +		return -ENODEV;
>> +
>> +	ret = domain->ops->cache_invalidate(domain, dev, inv_info);
>> +
>> +	return ret;
>> +}
>> +EXPORT_SYMBOL_GPL(iommu_cache_invalidate);
>> +
>>  static void __iommu_detach_device(struct iommu_domain *domain,
>>  				  struct device *dev)
>>  {
>> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
>> index 1da2a2357ea4..96d59886f230 100644
>> --- a/include/linux/iommu.h
>> +++ b/include/linux/iommu.h
>> @@ -186,6 +186,7 @@ struct iommu_resv_region {
>>   * @of_xlate: add OF master IDs to iommu grouping
>>   * @pgsize_bitmap: bitmap of all possible supported page sizes
>>   * @set_pasid_table: set pasid table
>> + * @cache_invalidate: invalidate translation caches
>>   */
>>  struct iommu_ops {
>>  	bool (*capable)(enum iommu_cap);
>> @@ -231,6 +232,9 @@ struct iommu_ops {
>>  	int (*set_pasid_table)(struct iommu_domain *domain,
>>  			       struct iommu_pasid_table_config *cfg);
>>  
>> +	int (*cache_invalidate)(struct iommu_domain *domain, struct device *dev,
>> +				struct iommu_cache_invalidate_info *inv_info);
>> +
>>  	unsigned long pgsize_bitmap;
>>  };
>>  
>> @@ -294,6 +298,9 @@ extern void iommu_detach_device(struct iommu_domain *domain,
>>  				struct device *dev);
>>  extern int iommu_set_pasid_table(struct iommu_domain *domain,
>>  				 struct iommu_pasid_table_config *cfg);
>> +extern int iommu_cache_invalidate(struct iommu_domain *domain,
>> +				struct device *dev,
>> +				struct iommu_cache_invalidate_info *inv_info);
>>  extern struct iommu_domain *iommu_get_domain_for_dev(struct device *dev);
>>  extern struct iommu_domain *iommu_get_dma_domain(struct device *dev);
>>  extern int iommu_map(struct iommu_domain *domain, unsigned long iova,
>> @@ -709,6 +716,13 @@ int iommu_set_pasid_table(struct iommu_domain *domain,
>>  {
>>  	return -ENODEV;
>>  }
>> +static inline int
>> +iommu_cache_invalidate(struct iommu_domain *domain,
>> +		       struct device *dev,
>> +		       struct iommu_cache_invalidate_info *inv_info)
>> +{
>> +	return -ENODEV;
>> +}
>>  
>>  #endif /* CONFIG_IOMMU_API */
>>  
>> diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
>> index 7a7cf7a3de7c..4605f5cfac84 100644
>> --- a/include/uapi/linux/iommu.h
>> +++ b/include/uapi/linux/iommu.h
>> @@ -47,4 +47,99 @@ struct iommu_pasid_table_config {
>>  	};
>>  };
>>  
>> +/**
>> + * enum iommu_inv_granularity - Generic invalidation granularity
>> + * @IOMMU_INV_GRANU_DOMAIN_ALL_PASID:	TLB entries or PASID caches of all
>> + *					PASIDs associated with a domain ID
>> + * @IOMMU_INV_GRANU_PASID_SEL:		TLB entries or PASID cache associated
>> + *					with a PASID and a domain
>> + * @IOMMU_INV_GRANU_PAGE_PASID:		TLB entries of selected page range
>> + *					within a PASID
>> + *
>> + * When an invalidation request is passed down to IOMMU to flush translation
>> + * caches, it may carry different granularity levels, which can be specific
>> + * to certain types of translation caches.
>> + * This enum is a collection of granularities for all types of translation
>> + * caches. The idea is to make it easy for IOMMU model specific driver to
>> + * convert from generic to model specific value. Each IOMMU driver
>> + * can enforce check based on its own conversion table. The conversion is
>> + * based on 2D look-up with inputs as follows:
>> + * - translation cache types
>> + * - granularity
>> + *
>> + *             type |   DTLB    |    TLB    |   PASID   |
>> + *  granule         |           |           |   cache   |
>> + * -----------------+-----------+-----------+-----------+
>> + *  DN_ALL_PASID    |   Y       |   Y       |   Y       |
>> + *  PASID_SEL       |   Y       |   Y       |   Y       |
>> + *  PAGE_PASID      |   Y       |   Y       |   N/A     |
>> + *
>> + */
>> +enum iommu_inv_granularity {
>> +	IOMMU_INV_GRANU_DOMAIN_ALL_PASID,
>> +	IOMMU_INV_GRANU_PASID_SEL,
>> +	IOMMU_INV_GRANU_PAGE_PASID,
>> +	IOMMU_INV_NR_GRANU,
>> +};
>> +
>> +/**
>> + * enum iommu_inv_type - Generic translation cache types for invalidation
>> + *
>> + * @IOMMU_INV_TYPE_DTLB:	device IOTLB
>> + * @IOMMU_INV_TYPE_TLB:		IOMMU paging structure cache
>> + * @IOMMU_INV_TYPE_PASID:	PASID cache
>> + * Invalidation requests sent to IOMMU for a given device need to indicate
>> + * which type of translation cache to be operated on. Combined with enum
>> + * iommu_inv_granularity, model specific driver can do a simple lookup to
>> + * convert from generic to model specific value.
>> + */
>> +enum iommu_inv_type {
>> +	IOMMU_INV_TYPE_DTLB,
>> +	IOMMU_INV_TYPE_TLB,
>> +	IOMMU_INV_TYPE_PASID,
>> +	IOMMU_INV_NR_TYPE
>> +};
>> +
>> +/**
>> + * Translation cache invalidation header that contains mandatory meta data.
>> + * @version:	info format version, expecting future extesions
>> + * @type:	type of translation cache to be invalidated
>> + */
>> +struct iommu_cache_invalidate_hdr {
>> +	__u32 version;
>> +#define TLB_INV_HDR_VERSION_1 1
>> +	enum iommu_inv_type type;
>> +};
>> +
>> +/**
>> + * Translation cache invalidation information, contains generic IOMMU
>> + * data which can be parsed based on model ID by model specific drivers.
>> + * Since the invalidation of second level page tables are included in the
>> + * unmap operation, this info is only applicable to the first level
>> + * translation caches, i.e. DMA request with PASID.
>> + *
>> + * @granularity:	requested invalidation granularity, type dependent
>> + * @size:		2^size of 4K pages, 0 for 4k, 9 for 2MB, etc.
> 
> Why is this a 4K page centric interface?
This matches the vt-d Address Mask (AM) field of the IOTLB Invalidate
Descriptor. We can pass a log2size instead.
> 
>> + * @nr_pages:		number of pages to invalidate
>> + * @pasid:		processor address space ID value per PCI spec.
>> + * @arch_id:		architecture dependent id characterizing a context
>> + *			and tagging the caches, ie. domain Identfier on VTD,
>> + *			asid on ARM SMMU
>> + * @addr:		page address to be invalidated
>> + * @flags		IOMMU_INVALIDATE_ADDR_LEAF: leaf paging entries
>> + *			IOMMU_INVALIDATE_GLOBAL_PAGE: global pages
> 
> Shouldn't some of these be tied the the granularity of the
> invalidation?  It seems like this should be more similar to
> iommu_pasid_table_config where the granularity of the invalidation
> defines which entry within a union at the end of the structure is valid
> and populated.  Otherwise we have fields that don't make sense for
> certain invalidations.

I am a little bit embarrassed here as this API version is the outcome of
long discussions held by Jacob, jean-Philippe and many others. I don't
want to hijack that work as I am "simply" reusing this API. Nevertheless
I am willing to help on this. So following your recommendation above I
dare to propose an updated API:

struct iommu_device_iotlb_inv_info {
        __u32   version;
#define IOMMU_DEV_IOTLB_INV_GLOBAL   0
#define IOMMU_DEV_IOTLB_INV_SOURCEID (1 << 0)
#define IOMMU_DEV_IOTLB_INV_PASID    (1 << 1)
        __u8    granularity;
        __u64   addr;
        __u8    log2size;
        __u64   sourceid;
        __u64   pasid;
        __u8    padding[2];
};

struct iommu_iotlb_inv_info {
        __u32   version;
#define IOMMU_IOTLB_INV_GLOBAL  0
#define IOMMU_IOTLB_INV_ARCHID  (1 << 0)
#define IOMMU_IOTLB_INV_PASID   (1 << 1)
#define IOMMU_IOTLB_INV_PAGE    (1 << 2)
        __u8    granularity;
        __u64   archid;
        __u64   pasid;
        __u64   addr;
        __u8    log2size;
        __u8    padding[2];
};

struct iommu_pasid_inv_info {
        __u32   version;
#define IOMMU_PASID_INV_GLOBAL     0
#define IOMMU_PASID_INV_ARCHID     (1 << 0)
#define IOMMU_PASID_INV_PASID      (1 << 1)
#define IOMMU_PASID_INV_SOURCEID   (1 << 2)
        __u8    granularity;
        __u64   archid;
        __u64   pasid;
        __u64   sourceid;
        __u8    padding[3];
};
/**
 * Translation cache invalidation information, contains generic IOMMU
 * data which can be parsed based on model ID by model specific drivers.
 * Since the invalidation of second level page tables are included in
 * the unmap operation, this info is only applicable to the first level
 * translation caches, i.e. DMA request with PASID.
 *
 */
struct iommu_cache_invalidate_info {
#define IOMMU_CACHE_INVALIDATE_INFO_VERSION_1 1
        __u32 version;
#define IOMMU_INV_TYPE_IOTLB        1 /* IOMMU paging structure cache */
#define IOMMU_INV_TYPE_DEV_IOTLB    2 /* Device IOTLB */
#define IOMMU_INV_TYPE_PASID        3 /* PASID cache */
        __u8 type;
        union {
                struct iommu_iotlb_invalidate_info iotlb_inv_info;
                struct iommu_dev_iotlb_invalidate_info dev_iotlb_inv_info;
                struct iommu_pasid_inv_info pasid_inv_info;
        };
};

At the moment I ignore the leaf bool parameter used on ARM for PASID
invalidation and TLB invalidation. Maybe we can just invalidate more
that the leaf cache structures at the moment?

on ARM the PASID table can be invalidated per streamid. On vt-d, as far
as I understand the sourceid does not tag the entries.





> 
>> + *
>> + */
>> +struct iommu_cache_invalidate_info {
>> +	struct iommu_cache_invalidate_hdr	hdr;
>> +	enum iommu_inv_granularity	granularity;
> 
> A separate structure for hdr seems a little pointless.
removed
> 
>> +	__u32		flags;
>> +#define IOMMU_INVALIDATE_ADDR_LEAF	(1 << 0)
>> +#define IOMMU_INVALIDATE_GLOBAL_PAGE	(1 << 1)
>> +	__u8		size;
> 
> Really need some padding or packing here for any hope of having
> consistency with userspace.
> 
>> +	__u64		nr_pages;
>> +	__u32		pasid;
> 
> Sub-optimal ordering for packing/padding.  Thanks,
I introduced some padding above. Is that OK?

Again if this introduces more noise than it helps I will simply rely on
initial contributors for the respin of their series according to your
comments. Also we if can't define generic enough structures for ARM and x86

Thanks

Eric
> 
> Alex
> 
>> +	__u64		arch_id;
>> +	__u64		addr;
>> +};
>>  #endif /* _UAPI_IOMMU_H */
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC v3 03/21] iommu: Introduce bind_guest_msi
  2019-01-11 22:44   ` Alex Williamson
@ 2019-01-25 17:51     ` Auger Eric
  2019-01-25 18:11     ` Auger Eric
  1 sibling, 0 replies; 59+ messages in thread
From: Auger Eric @ 2019-01-25 17:51 UTC (permalink / raw)
  To: Alex Williamson
  Cc: eric.auger.pro, iommu, linux-kernel, kvm, kvmarm, joro,
	jacob.jun.pan, yi.l.liu, jean-philippe.brucker, will.deacon,
	robin.murphy, kevin.tian, ashok.raj, marc.zyngier,
	christoffer.dall, peter.maydell

Hi Alex,
On 1/11/19 11:44 PM, Alex Williamson wrote:
> On Tue,  8 Jan 2019 11:26:15 +0100
> Eric Auger <eric.auger@redhat.com> wrote:
> 
>> On ARM, MSI are translated by the SMMU. An IOVA is allocated
>> for each MSI doorbell. If both the host and the guest are exposed
>> with SMMUs, we end up with 2 different IOVAs allocated by each.
>> guest allocates an IOVA (gIOVA) to map onto the guest MSI
>> doorbell (gDB). The Host allocates another IOVA (hIOVA) to map
>> onto the physical doorbell (hDB).
>>
>> So we end up with 2 untied mappings:
>>          S1            S2
>> gIOVA    ->    gDB
>>               hIOVA    ->    gDB
>                                ^^^ hDB
right!
> 
>> Currently the PCI device is programmed by the host with hIOVA
>> as MSI doorbell. So this does not work.
>>
>> This patch introduces an API to pass gIOVA/gDB to the host so
>> that gIOVA can be reused by the host instead of re-allocating
>> a new IOVA. So the goal is to create the following nested mapping:
>>
>>          S1            S2
>> gIOVA    ->    gDB     ->    hDB
>>
>> and program the PCI device with gIOVA MSI doorbell.
>>
>> Signed-off-by: Eric Auger <eric.auger@redhat.com>
>>
>> ---
>>
>> v2 -> v3:
>> - add a struct device handle
>> ---
>>  drivers/iommu/iommu.c      | 10 ++++++++++
>>  include/linux/iommu.h      | 13 +++++++++++++
>>  include/uapi/linux/iommu.h |  6 ++++++
>>  3 files changed, 29 insertions(+)
>>
>> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
>> index b2e248770508..ea11442e7054 100644
>> --- a/drivers/iommu/iommu.c
>> +++ b/drivers/iommu/iommu.c
>> @@ -1431,6 +1431,16 @@ static void __iommu_detach_device(struct iommu_domain *domain,
>>  	trace_detach_device_from_domain(dev);
>>  }
>>  
>> +int iommu_bind_guest_msi(struct iommu_domain *domain, struct device *dev,
>> +			 struct iommu_guest_msi_binding *binding)
>> +{
>> +	if (unlikely(!domain->ops->bind_guest_msi))
>> +		return -ENODEV;
>> +
>> +	return domain->ops->bind_guest_msi(domain, dev, binding);
>> +}
>> +EXPORT_SYMBOL_GPL(iommu_bind_guest_msi);
>> +
>>  void iommu_detach_device(struct iommu_domain *domain, struct device *dev)
>>  {
>>  	struct iommu_group *group;
>> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
>> index 96d59886f230..244c1a3d5989 100644
>> --- a/include/linux/iommu.h
>> +++ b/include/linux/iommu.h
>> @@ -235,6 +235,9 @@ struct iommu_ops {
>>  	int (*cache_invalidate)(struct iommu_domain *domain, struct device *dev,
>>  				struct iommu_cache_invalidate_info *inv_info);
>>  
>> +	int (*bind_guest_msi)(struct iommu_domain *domain, struct device *dev,
>> +			      struct iommu_guest_msi_binding *binding);
>> +
>>  	unsigned long pgsize_bitmap;
>>  };
>>  
>> @@ -301,6 +304,9 @@ extern int iommu_set_pasid_table(struct iommu_domain *domain,
>>  extern int iommu_cache_invalidate(struct iommu_domain *domain,
>>  				struct device *dev,
>>  				struct iommu_cache_invalidate_info *inv_info);
>> +extern int iommu_bind_guest_msi(struct iommu_domain *domain, struct device *dev,
>> +				struct iommu_guest_msi_binding *binding);
>> +
>>  extern struct iommu_domain *iommu_get_domain_for_dev(struct device *dev);
>>  extern struct iommu_domain *iommu_get_dma_domain(struct device *dev);
>>  extern int iommu_map(struct iommu_domain *domain, unsigned long iova,
>> @@ -724,6 +730,13 @@ iommu_cache_invalidate(struct iommu_domain *domain,
>>  	return -ENODEV;
>>  }
>>  
>> +static inline
>> +int iommu_bind_guest_msi(struct iommu_domain *domain, struct device *dev,
>> +			 struct iommu_guest_msi_binding *binding)
>> +{
>> +	return -ENODEV;
>> +}
>> +
>>  #endif /* CONFIG_IOMMU_API */
>>  
>>  #ifdef CONFIG_IOMMU_DEBUGFS
>> diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
>> index 4605f5cfac84..f28cd9a1aa96 100644
>> --- a/include/uapi/linux/iommu.h
>> +++ b/include/uapi/linux/iommu.h
>> @@ -142,4 +142,10 @@ struct iommu_cache_invalidate_info {
>>  	__u64		arch_id;
>>  	__u64		addr;
>>  };
>> +
>> +struct iommu_guest_msi_binding {
>> +	__u64		iova;
>> +	__u64		gpa;
>> +	__u32		granule;
> 
> What's granule?  The size?  This looks a lot like just a stage 1
> mapping interface, I can't really figure out from the description how
> this matches to any specific MSI mapping.
Yes that's just a stage 1 binding. The granule is the log2size of the
stage1 page. As this is a guest mapping of a virtual doorbell, this is
WRITE only.

What about something like:
/**
 * 1st level/stage1 binding of a virtual MSI doorbell
 *
 * @iova:       iova
 * @gpa:        guest physical address of the virtual doorbell
 * @log2size:   log2size of the doorbell (generally a guest page)
 *
 * As this is an MSI doorbell, the mapping is write only.
 */
struct iommu_guest_msi_binding {
        __u64   iova;
        __u64   gpa;
        __u32   log2size;
};

Also added:

/**
 * iommu_bind_guest_msi - Passes the stage1 binding of the virtual
 * doorbell used by the assigned device @dev.
 *
 * @domain: iommu domain the stage 1 mapping will be attached to
 * @dev: assigned device which uses this stage1 mapping
 * @binding: stage1 MSI binding
 *
 * The associated IOVA can be reused by the host to create a nested
 * stage2 binding mapping onto the physical doorbell used by @dev
 */


Thanks

Eric

  Zero comments in the code
> or headers here about how this is supposed to work.  Thanks,
> 
> Alex
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC v3 03/21] iommu: Introduce bind_guest_msi
  2019-01-11 22:44   ` Alex Williamson
  2019-01-25 17:51     ` Auger Eric
@ 2019-01-25 18:11     ` Auger Eric
  1 sibling, 0 replies; 59+ messages in thread
From: Auger Eric @ 2019-01-25 18:11 UTC (permalink / raw)
  To: Alex Williamson
  Cc: eric.auger.pro, iommu, linux-kernel, kvm, kvmarm, joro,
	jacob.jun.pan, yi.l.liu, jean-philippe.brucker, will.deacon,
	robin.murphy, kevin.tian, ashok.raj, marc.zyngier,
	christoffer.dall, peter.maydell

Hi Alex,

On 1/11/19 11:44 PM, Alex Williamson wrote:
> On Tue,  8 Jan 2019 11:26:15 +0100
> Eric Auger <eric.auger@redhat.com> wrote:
> 
>> On ARM, MSI are translated by the SMMU. An IOVA is allocated
>> for each MSI doorbell. If both the host and the guest are exposed
>> with SMMUs, we end up with 2 different IOVAs allocated by each.
>> guest allocates an IOVA (gIOVA) to map onto the guest MSI
>> doorbell (gDB). The Host allocates another IOVA (hIOVA) to map
>> onto the physical doorbell (hDB).
>>
>> So we end up with 2 untied mappings:
>>          S1            S2
>> gIOVA    ->    gDB
>>               hIOVA    ->    gDB
>                                ^^^ hDB
> 
>> Currently the PCI device is programmed by the host with hIOVA
>> as MSI doorbell. So this does not work.
>>
>> This patch introduces an API to pass gIOVA/gDB to the host so
>> that gIOVA can be reused by the host instead of re-allocating
>> a new IOVA. So the goal is to create the following nested mapping:
>>
>>          S1            S2
>> gIOVA    ->    gDB     ->    hDB
>>
>> and program the PCI device with gIOVA MSI doorbell.
>>
>> Signed-off-by: Eric Auger <eric.auger@redhat.com>
>>
>> ---
>>
>> v2 -> v3:
>> - add a struct device handle
>> ---
>>  drivers/iommu/iommu.c      | 10 ++++++++++
>>  include/linux/iommu.h      | 13 +++++++++++++
>>  include/uapi/linux/iommu.h |  6 ++++++
>>  3 files changed, 29 insertions(+)
>>
>> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
>> index b2e248770508..ea11442e7054 100644
>> --- a/drivers/iommu/iommu.c
>> +++ b/drivers/iommu/iommu.c
>> @@ -1431,6 +1431,16 @@ static void __iommu_detach_device(struct iommu_domain *domain,
>>  	trace_detach_device_from_domain(dev);
>>  }
>>  
>> +int iommu_bind_guest_msi(struct iommu_domain *domain, struct device *dev,
>> +			 struct iommu_guest_msi_binding *binding)
>> +{
>> +	if (unlikely(!domain->ops->bind_guest_msi))
>> +		return -ENODEV;
>> +
>> +	return domain->ops->bind_guest_msi(domain, dev, binding);
>> +}
>> +EXPORT_SYMBOL_GPL(iommu_bind_guest_msi);
>> +
>>  void iommu_detach_device(struct iommu_domain *domain, struct device *dev)
>>  {
>>  	struct iommu_group *group;
>> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
>> index 96d59886f230..244c1a3d5989 100644
>> --- a/include/linux/iommu.h
>> +++ b/include/linux/iommu.h
>> @@ -235,6 +235,9 @@ struct iommu_ops {
>>  	int (*cache_invalidate)(struct iommu_domain *domain, struct device *dev,
>>  				struct iommu_cache_invalidate_info *inv_info);
>>  
>> +	int (*bind_guest_msi)(struct iommu_domain *domain, struct device *dev,
>> +			      struct iommu_guest_msi_binding *binding);
>> +
>>  	unsigned long pgsize_bitmap;
>>  };
>>  
>> @@ -301,6 +304,9 @@ extern int iommu_set_pasid_table(struct iommu_domain *domain,
>>  extern int iommu_cache_invalidate(struct iommu_domain *domain,
>>  				struct device *dev,
>>  				struct iommu_cache_invalidate_info *inv_info);
>> +extern int iommu_bind_guest_msi(struct iommu_domain *domain, struct device *dev,
>> +				struct iommu_guest_msi_binding *binding);
>> +
>>  extern struct iommu_domain *iommu_get_domain_for_dev(struct device *dev);
>>  extern struct iommu_domain *iommu_get_dma_domain(struct device *dev);
>>  extern int iommu_map(struct iommu_domain *domain, unsigned long iova,
>> @@ -724,6 +730,13 @@ iommu_cache_invalidate(struct iommu_domain *domain,
>>  	return -ENODEV;
>>  }
>>  
>> +static inline
>> +int iommu_bind_guest_msi(struct iommu_domain *domain, struct device *dev,
>> +			 struct iommu_guest_msi_binding *binding)
>> +{
>> +	return -ENODEV;
>> +}
>> +
>>  #endif /* CONFIG_IOMMU_API */
>>  
>>  #ifdef CONFIG_IOMMU_DEBUGFS
>> diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
>> index 4605f5cfac84..f28cd9a1aa96 100644
>> --- a/include/uapi/linux/iommu.h
>> +++ b/include/uapi/linux/iommu.h
>> @@ -142,4 +142,10 @@ struct iommu_cache_invalidate_info {
>>  	__u64		arch_id;
>>  	__u64		addr;
>>  };
>> +
>> +struct iommu_guest_msi_binding {
>> +	__u64		iova;
>> +	__u64		gpa;
>> +	__u32		granule;
> 
> What's granule?  The size?  This looks a lot like just a stage 1
> mapping interface, I can't really figure out from the description how
> this matches to any specific MSI mapping.  Zero comments in the code
> or headers here about how this is supposed to work.  Thanks,
Considering your next comment about the unbind() operation, I think the
struct can simply be removed. Instead we could directly have:
int (*bind_guest_msi)(struct iommu_domain *domain, struct device *dev,
                              dma_addr_t iova, gpa_t gpa, size_t size);
int (*unbind_guest_msi)(struct iommu_domain *domain, struct device *dev,
                                dma_addr_t iova);


Thanks

Eric
> 
> Alex
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC v3 09/21] iommu/smmuv3: Get prepared for nested stage support
  2019-01-08 10:26 ` [RFC v3 09/21] iommu/smmuv3: Get prepared for nested stage support Eric Auger
  2019-01-11 16:04   ` Jean-Philippe Brucker
@ 2019-01-25 19:27   ` Robin Murphy
  1 sibling, 0 replies; 59+ messages in thread
From: Robin Murphy @ 2019-01-25 19:27 UTC (permalink / raw)
  To: Eric Auger, eric.auger.pro, iommu, linux-kernel, kvm, kvmarm,
	joro, alex.williamson, jacob.jun.pan, yi.l.liu,
	jean-philippe.brucker, will.deacon
  Cc: kevin.tian, ashok.raj, marc.zyngier, christoffer.dall, peter.maydell

On 08/01/2019 10:26, Eric Auger wrote:
> To allow nested stage support, we need to store both
> stage 1 and stage 2 configurations (and remove the former
> union).
> 
> arm_smmu_write_strtab_ent() is modified to write both stage
> fields in the STE.
> 
> We add a nested_bypass field to the S1 configuration as the first
> stage can be bypassed. Also the guest may force the STE to abort:
> this information gets stored into the nested_abort field.
> 
> Only S2 stage is "finalized" as the host does not configure
> S1 CD, guest does.
> 
> Signed-off-by: Eric Auger <eric.auger@redhat.com>
> 
> ---
> 
> v1 -> v2:
> - invalidate the STE before moving from a live STE config to another
> - add the nested_abort and nested_bypass fields
> ---
>   drivers/iommu/arm-smmu-v3.c | 43 ++++++++++++++++++++++++++++---------
>   1 file changed, 33 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
> index 9af68266bbb1..9716a301d9ae 100644
> --- a/drivers/iommu/arm-smmu-v3.c
> +++ b/drivers/iommu/arm-smmu-v3.c
> @@ -212,6 +212,7 @@
>   #define STRTAB_STE_0_CFG_BYPASS		4
>   #define STRTAB_STE_0_CFG_S1_TRANS	5
>   #define STRTAB_STE_0_CFG_S2_TRANS	6
> +#define STRTAB_STE_0_CFG_NESTED		7
>   
>   #define STRTAB_STE_0_S1FMT		GENMASK_ULL(5, 4)
>   #define STRTAB_STE_0_S1FMT_LINEAR	0
> @@ -491,6 +492,10 @@ struct arm_smmu_strtab_l1_desc {
>   struct arm_smmu_s1_cfg {
>   	__le64				*cdptr;
>   	dma_addr_t			cdptr_dma;
> +	/* in nested mode, tells s1 must be bypassed */
> +	bool				nested_bypass;

Couldn't that be inferred from "s1_cfg == NULL"?

> +	/* in nested mode, abort is forced by guest */
> +	bool				nested_abort;

Couldn't that be inferred from "s1_cfg == NULL && s2_cfg == NULL && 
smmu_domain->stage == ARM_SMMU_DOMAIN_NESTED"?

>   	struct arm_smmu_ctx_desc {
>   		u16	asid;
> @@ -515,6 +520,7 @@ struct arm_smmu_strtab_ent {
>   	 * configured according to the domain type.
>   	 */
>   	bool				assigned;
> +	bool				nested;

AFAICS, "nested" really only serves a differentiator between the 
assigned-as-bypass and assigned-as-fault cases. The latter isn't 
actually unique to nested though, I'd say it's more just that nobody's 
found reason to do anything with IOMMU_DOMAIN_BLOCKED yet. There's some 
argument for upgrading "assigned" into a tristate enum, but I think it 
might have a few drawbacks elsewhere, so an extra flag here seems 
reasonable, but I think it should just be named "abort". If we have both 
s1_cfg and s2_cfg set, we can see it's nested; if we only have s2_cfg, I 
don't think we really care whether the host or guest asked for stage 1 
bypass; and if in future we care about the difference between host- vs. 
guest-requested abort, leaving s2_cfg set for the latter would probably 
suffice.

>   	struct arm_smmu_s1_cfg		*s1_cfg;
>   	struct arm_smmu_s2_cfg		*s2_cfg;
>   };
> @@ -629,10 +635,8 @@ struct arm_smmu_domain {
>   	bool				non_strict;
>   
>   	enum arm_smmu_domain_stage	stage;
> -	union {
> -		struct arm_smmu_s1_cfg	s1_cfg;
> -		struct arm_smmu_s2_cfg	s2_cfg;
> -	};
> +	struct arm_smmu_s1_cfg	s1_cfg;
> +	struct arm_smmu_s2_cfg	s2_cfg;
>   
>   	struct iommu_domain		domain;
>   
> @@ -1139,10 +1143,11 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_device *smmu, u32 sid,
>   			break;
>   		case STRTAB_STE_0_CFG_S1_TRANS:
>   		case STRTAB_STE_0_CFG_S2_TRANS:
> +		case STRTAB_STE_0_CFG_NESTED:
>   			ste_live = true;
>   			break;
>   		case STRTAB_STE_0_CFG_ABORT:
> -			if (disable_bypass)
> +			if (disable_bypass || ste->nested)
>   				break;
>   		default:
>   			BUG(); /* STE corruption */
> @@ -1154,7 +1159,8 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_device *smmu, u32 sid,
>   
>   	/* Bypass/fault */
>   	if (!ste->assigned || !(ste->s1_cfg || ste->s2_cfg)) {
> -		if (!ste->assigned && disable_bypass)
> +		if ((!ste->assigned && disable_bypass) ||
> +				(ste->s1_cfg && ste->s1_cfg->nested_abort))

Yikes, these conditions were hard enough to follow before...


I think what I've proposed above might allow the logic here to be a bit 
less convoluted, but even then it may be time to hoist all these checks 
out and have a temporary decision variable for the bypass/abort/valid 
config outcome.

Robin.

>   			val |= FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_ABORT);
>   		else
>   			val |= FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_BYPASS);
> @@ -1172,8 +1178,17 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_device *smmu, u32 sid,
>   		return;
>   	}
>   
> +	if (ste->nested && ste_live) {
> +		/*
> +		 * When enabling nested, the STE may be transitionning from
> +		 * s2 to nested and back. Invalidate the STE before changing it.
> +		 */
> +		dst[0] = cpu_to_le64(0);
> +		arm_smmu_sync_ste_for_sid(smmu, sid);
> +		val = STRTAB_STE_0_V;
> +	}
> +
>   	if (ste->s1_cfg) {
> -		BUG_ON(ste_live);
>   		dst[1] = cpu_to_le64(
>   			 FIELD_PREP(STRTAB_STE_1_S1CIR, STRTAB_STE_1_S1C_CACHE_WBRA) |
>   			 FIELD_PREP(STRTAB_STE_1_S1COR, STRTAB_STE_1_S1C_CACHE_WBRA) |
> @@ -1187,12 +1202,12 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_device *smmu, u32 sid,
>   		   !(smmu->features & ARM_SMMU_FEAT_STALL_FORCE))
>   			dst[1] |= cpu_to_le64(STRTAB_STE_1_S1STALLD);
>   
> -		val |= (ste->s1_cfg->cdptr_dma & STRTAB_STE_0_S1CTXPTR_MASK) |
> -			FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_S1_TRANS);
> +		if (!ste->s1_cfg->nested_bypass)
> +			val |= (ste->s1_cfg->cdptr_dma & STRTAB_STE_0_S1CTXPTR_MASK) |
> +				FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_S1_TRANS);
>   	}
>   
>   	if (ste->s2_cfg) {
> -		BUG_ON(ste_live);
>   		dst[2] = cpu_to_le64(
>   			 FIELD_PREP(STRTAB_STE_2_S2VMID, ste->s2_cfg->vmid) |
>   			 FIELD_PREP(STRTAB_STE_2_VTCR, ste->s2_cfg->vtcr) |
> @@ -1454,6 +1469,10 @@ static void arm_smmu_tlb_inv_context(void *cookie)
>   		cmd.opcode	= CMDQ_OP_TLBI_NH_ASID;
>   		cmd.tlbi.asid	= smmu_domain->s1_cfg.cd.asid;
>   		cmd.tlbi.vmid	= 0;
> +	} else if (smmu_domain->stage == ARM_SMMU_DOMAIN_NESTED) {
> +		cmd.opcode      = CMDQ_OP_TLBI_NH_ASID;
> +		cmd.tlbi.asid   = smmu_domain->s1_cfg.cd.asid;
> +		cmd.tlbi.vmid   = smmu_domain->s2_cfg.vmid;
>   	} else {
>   		cmd.opcode	= CMDQ_OP_TLBI_S12_VMALL;
>   		cmd.tlbi.vmid	= smmu_domain->s2_cfg.vmid;
> @@ -1484,6 +1503,10 @@ static void arm_smmu_tlb_inv_range_nosync(unsigned long iova, size_t size,
>   	if (smmu_domain->stage == ARM_SMMU_DOMAIN_S1) {
>   		cmd.opcode	= CMDQ_OP_TLBI_NH_VA;
>   		cmd.tlbi.asid	= smmu_domain->s1_cfg.cd.asid;
> +	} else if (smmu_domain->stage == ARM_SMMU_DOMAIN_NESTED) {
> +		cmd.opcode      = CMDQ_OP_TLBI_NH_VA;
> +		cmd.tlbi.asid   = smmu_domain->s1_cfg.cd.asid;
> +		cmd.tlbi.vmid   = smmu_domain->s2_cfg.vmid;
>   	} else {
>   		cmd.opcode	= CMDQ_OP_TLBI_S2_IPA;
>   		cmd.tlbi.vmid	= smmu_domain->s2_cfg.vmid;
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC v3 02/21] iommu: Introduce cache_invalidate API
  2019-01-25 16:49     ` Auger Eric
@ 2019-01-28 17:32       ` Jean-Philippe Brucker
  2019-01-29 17:49         ` Auger Eric
  2019-01-29 23:16       ` Alex Williamson
  1 sibling, 1 reply; 59+ messages in thread
From: Jean-Philippe Brucker @ 2019-01-28 17:32 UTC (permalink / raw)
  To: Auger Eric, Alex Williamson
  Cc: yi.l.liu, kevin.tian, ashok.raj, kvm, peter.maydell, will.deacon,
	linux-kernel, christoffer.dall, marc.zyngier, iommu,
	robin.murphy, kvmarm, eric.auger.pro

Hi Eric,

On 25/01/2019 16:49, Auger Eric wrote:
[...]
>>> diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
>>> index 7a7cf7a3de7c..4605f5cfac84 100644
>>> --- a/include/uapi/linux/iommu.h
>>> +++ b/include/uapi/linux/iommu.h
>>> @@ -47,4 +47,99 @@ struct iommu_pasid_table_config {
>>>  	};
>>>  };
>>>  
>>> +/**
>>> + * enum iommu_inv_granularity - Generic invalidation granularity
>>> + * @IOMMU_INV_GRANU_DOMAIN_ALL_PASID:	TLB entries or PASID caches of all
>>> + *					PASIDs associated with a domain ID
>>> + * @IOMMU_INV_GRANU_PASID_SEL:		TLB entries or PASID cache associated
>>> + *					with a PASID and a domain
>>> + * @IOMMU_INV_GRANU_PAGE_PASID:		TLB entries of selected page range
>>> + *					within a PASID
>>> + *
>>> + * When an invalidation request is passed down to IOMMU to flush translation
>>> + * caches, it may carry different granularity levels, which can be specific
>>> + * to certain types of translation caches.
>>> + * This enum is a collection of granularities for all types of translation
>>> + * caches. The idea is to make it easy for IOMMU model specific driver to
>>> + * convert from generic to model specific value. Each IOMMU driver
>>> + * can enforce check based on its own conversion table. The conversion is
>>> + * based on 2D look-up with inputs as follows:
>>> + * - translation cache types
>>> + * - granularity
>>> + *
>>> + *             type |   DTLB    |    TLB    |   PASID   |
>>> + *  granule         |           |           |   cache   |
>>> + * -----------------+-----------+-----------+-----------+
>>> + *  DN_ALL_PASID    |   Y       |   Y       |   Y       |
>>> + *  PASID_SEL       |   Y       |   Y       |   Y       |
>>> + *  PAGE_PASID      |   Y       |   Y       |   N/A     |
>>> + *
>>> + */
>>> +enum iommu_inv_granularity {
>>> +	IOMMU_INV_GRANU_DOMAIN_ALL_PASID,
>>> +	IOMMU_INV_GRANU_PASID_SEL,
>>> +	IOMMU_INV_GRANU_PAGE_PASID,
>>> +	IOMMU_INV_NR_GRANU,
>>> +};
>>> +
>>> +/**
>>> + * enum iommu_inv_type - Generic translation cache types for invalidation
>>> + *
>>> + * @IOMMU_INV_TYPE_DTLB:	device IOTLB
>>> + * @IOMMU_INV_TYPE_TLB:		IOMMU paging structure cache
>>> + * @IOMMU_INV_TYPE_PASID:	PASID cache
>>> + * Invalidation requests sent to IOMMU for a given device need to indicate
>>> + * which type of translation cache to be operated on. Combined with enum
>>> + * iommu_inv_granularity, model specific driver can do a simple lookup to
>>> + * convert from generic to model specific value.
>>> + */
>>> +enum iommu_inv_type {
>>> +	IOMMU_INV_TYPE_DTLB,
>>> +	IOMMU_INV_TYPE_TLB,
>>> +	IOMMU_INV_TYPE_PASID,
>>> +	IOMMU_INV_NR_TYPE
>>> +};
>>> +
>>> +/**
>>> + * Translation cache invalidation header that contains mandatory meta data.
>>> + * @version:	info format version, expecting future extesions
>>> + * @type:	type of translation cache to be invalidated
>>> + */
>>> +struct iommu_cache_invalidate_hdr {
>>> +	__u32 version;
>>> +#define TLB_INV_HDR_VERSION_1 1
>>> +	enum iommu_inv_type type;
>>> +};
>>> +
>>> +/**
>>> + * Translation cache invalidation information, contains generic IOMMU
>>> + * data which can be parsed based on model ID by model specific drivers.
>>> + * Since the invalidation of second level page tables are included in the
>>> + * unmap operation, this info is only applicable to the first level
>>> + * translation caches, i.e. DMA request with PASID.
>>> + *
>>> + * @granularity:	requested invalidation granularity, type dependent
>>> + * @size:		2^size of 4K pages, 0 for 4k, 9 for 2MB, etc.
>>
>> Why is this a 4K page centric interface?
> This matches the vt-d Address Mask (AM) field of the IOTLB Invalidate
> Descriptor. We can pass a log2size instead.
>>
>>> + * @nr_pages:		number of pages to invalidate
>>> + * @pasid:		processor address space ID value per PCI spec.
>>> + * @arch_id:		architecture dependent id characterizing a context
>>> + *			and tagging the caches, ie. domain Identfier on VTD,
>>> + *			asid on ARM SMMU
>>> + * @addr:		page address to be invalidated
>>> + * @flags		IOMMU_INVALIDATE_ADDR_LEAF: leaf paging entries
>>> + *			IOMMU_INVALIDATE_GLOBAL_PAGE: global pages
>>
>> Shouldn't some of these be tied the the granularity of the
>> invalidation?  It seems like this should be more similar to
>> iommu_pasid_table_config where the granularity of the invalidation
>> defines which entry within a union at the end of the structure is valid
>> and populated.  Otherwise we have fields that don't make sense for
>> certain invalidations.
> 
> I am a little bit embarrassed here as this API version is the outcome of
> long discussions held by Jacob, jean-Philippe and many others. I don't
> want to hijack that work as I am "simply" reusing this API. Nevertheless
> I am willing to help on this. So following your recommendation above I
> dare to propose an updated API:

Discussing this again is completely fine by me. I have some concerns
with this proposal though, some of which apply to our previous versions
as well.

> struct iommu_device_iotlb_inv_info {
>         __u32   version;
> #define IOMMU_DEV_IOTLB_INV_GLOBAL   0
> #define IOMMU_DEV_IOTLB_INV_SOURCEID (1 << 0)
> #define IOMMU_DEV_IOTLB_INV_PASID    (1 << 1)
>         __u8    granularity;
>         __u64   addr;
>         __u8    log2size;
>         __u64   sourceid;
>         __u64   pasid;
>         __u8    padding[2];
> };
> 
> struct iommu_iotlb_inv_info {
>         __u32   version;
> #define IOMMU_IOTLB_INV_GLOBAL  0

Using "global" for granularity=0 will be confusing, let's call this
"domain" instead. In the SMMU and ATS specifications (and I think VT-d
as well), the global flag is used to invalidate VA ranges that are
cached for all PASIDs. In the Arm architecture these TLB entries are
created from PTE entries without the "nG" bit. So "global" usually means
granularity=IOMMU_IOTLB_INV_PAGE.

> #define IOMMU_IOTLB_INV_ARCHID  (1 << 0)
> #define IOMMU_IOTLB_INV_PASID   (1 << 1)
> #define IOMMU_IOTLB_INV_PAGE    (1 << 2)

We might as well call this bit "INV_ADDR" to make it clear that it
describes the validity of field @addr.

>         __u8    granularity;
>         __u64   archid;
>         __u64   pasid;
>         __u64   addr;
>         __u8    log2size;
>         __u8    padding[2];
> };
> 
> struct iommu_pasid_inv_info {
>         __u32   version;
> #define IOMMU_PASID_INV_GLOBAL     0
> #define IOMMU_PASID_INV_ARCHID     (1 << 0)
> #define IOMMU_PASID_INV_PASID      (1 << 1)
> #define IOMMU_PASID_INV_SOURCEID   (1 << 2)
>         __u8    granularity;
>         __u64   archid;
>         __u64   pasid;
>         __u64   sourceid;
>         __u8    padding[3];
> };
> /**
>  * Translation cache invalidation information, contains generic IOMMU
>  * data which can be parsed based on model ID by model specific drivers.
>  * Since the invalidation of second level page tables are included in
>  * the unmap operation, this info is only applicable to the first level
>  * translation caches, i.e. DMA request with PASID.
>  *
>  */
> struct iommu_cache_invalidate_info {
> #define IOMMU_CACHE_INVALIDATE_INFO_VERSION_1 1
>         __u32 version;
> #define IOMMU_INV_TYPE_IOTLB        1 /* IOMMU paging structure cache */
> #define IOMMU_INV_TYPE_DEV_IOTLB    2 /* Device IOTLB */
> #define IOMMU_INV_TYPE_PASID        3 /* PASID cache */
>         __u8 type;
>         union {
>                 struct iommu_iotlb_invalidate_info iotlb_inv_info;
>                 struct iommu_dev_iotlb_invalidate_info dev_iotlb_inv_info;
>                 struct iommu_pasid_inv_info pasid_inv_info;
>         };
> };

Although I find the new structure and field names clearer in general, I
believe we're losing something by splitting structures this way.

The one concept I'd like to keep is the possibility to multiplex ATC and
IOTLB invalidation. That is, when unmapping a range, the guest (using a
pvIOMMU) shouldn't need to send both iotlb_inv and device_iotlb_inv to
the host - it's completely redundant. Instead the host IOMMU driver
should receive a single invalidation packet, and send both TLB and ATC
invalidation to the hardware.

So in my opinion the invalidation type needs to be a bitfield: userspace
can select either TYPE_IOTLB, TYPE_DEV_IOTLB, or both. And if type is a
bitfield, the content that follows has to be a single structure.

Same for the PASID cache: when the guest changes the config for a PASID,
it shouldn't have to also send TLB and ATC invalidations. A single
packet should be enough. However config changes won't be a fast path, so
optimizing the API is less important here.

We could say that IOMMU_INV_TYPE_IOTLB implies IOMMU_INV_TYPE_DEV_IOTLB,
and that IOMMU_INV_TYPE_PASID implies the others. In fact I think we do,
in this patch. But that in turn would be suboptimal with vSMMU and other
emulated solutions, which will receive both TLB and ATC invalidation
from the guest, and have to signal both separately to the kernel. So for
@type, a bitfield would be best.

If we want to split the structure, I think splitting by @granularity
might make more sense. It might require at least 4 structures:
* domain invalidation:		granule = 0
* pasid invalidation:		granule = pasid|archid
* global va invalidation:	granule = addr
* va invalidation		granule = pasid|archid|addr

> At the moment I ignore the leaf bool parameter used on ARM for PASID
> invalidation and TLB invalidation. Maybe we can just invalidate more
> that the leaf cache structures at the moment?

Sure, though we could add a flags field and leave it unused for now,
which is easier to extend than introducing a new version. I wonder if
Intel's Invalidation Hint (IH) does the same as Arm's leaf flag?

> on ARM the PASID table can be invalidated per streamid. On vt-d, as far
> as I understand the sourceid does not tag the entries.

I don't think sourceid is the right thing to use in this interface. The
device ID (streamid or sourceid) that the IOMMU sees in DMA transactions
isn't really made visible to userspace. And for mdev it doesn't really
exist, VFIO will have to pass the parent device's handle to IOMMU.

To userspace a device is identified by the fd provided by VFIO. So if we
do want to have device-scope in the invalidation (as opposed to
the current domain-scope) I think userspace needs to provide a device fd
to VFIO (outside the structures defined here), and then VFIO would pass
a struct device to iommu_cache_inval().

Thanks,
Jean

>>
>>> + *
>>> + */
>>> +struct iommu_cache_invalidate_info {
>>> +	struct iommu_cache_invalidate_hdr	hdr;
>>> +	enum iommu_inv_granularity	granularity;
>>
>> A separate structure for hdr seems a little pointless.
> removed
>>
>>> +	__u32		flags;
>>> +#define IOMMU_INVALIDATE_ADDR_LEAF	(1 << 0)
>>> +#define IOMMU_INVALIDATE_GLOBAL_PAGE	(1 << 1)
>>> +	__u8		size;
>>
>> Really need some padding or packing here for any hope of having
>> consistency with userspace.
>>
>>> +	__u64		nr_pages;
>>> +	__u32		pasid;
>>
>> Sub-optimal ordering for packing/padding.  Thanks,
> I introduced some padding above. Is that OK?
> 
> Again if this introduces more noise than it helps I will simply rely on
> initial contributors for the respin of their series according to your
> comments. Also we if can't define generic enough structures for ARM and x86
> 
> Thanks
> 
> Eric
>>
>> Alex
>>
>>> +	__u64		arch_id;
>>> +	__u64		addr;
>>> +};
>>>  #endif /* _UAPI_IOMMU_H */
>>
> _______________________________________________
> iommu mailing list
> iommu@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/iommu
> 


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC v3 02/21] iommu: Introduce cache_invalidate API
  2019-01-28 17:32       ` Jean-Philippe Brucker
@ 2019-01-29 17:49         ` Auger Eric
  0 siblings, 0 replies; 59+ messages in thread
From: Auger Eric @ 2019-01-29 17:49 UTC (permalink / raw)
  To: Jean-Philippe Brucker, Alex Williamson
  Cc: yi.l.liu, kevin.tian, ashok.raj, kvm, peter.maydell, will.deacon,
	linux-kernel, christoffer.dall, marc.zyngier, iommu,
	robin.murphy, kvmarm, eric.auger.pro

Hi Jean-Philippe,

On 1/28/19 6:32 PM, Jean-Philippe Brucker wrote:
> Hi Eric,
> 
> On 25/01/2019 16:49, Auger Eric wrote:
> [...]
>>>> diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
>>>> index 7a7cf7a3de7c..4605f5cfac84 100644
>>>> --- a/include/uapi/linux/iommu.h
>>>> +++ b/include/uapi/linux/iommu.h
>>>> @@ -47,4 +47,99 @@ struct iommu_pasid_table_config {
>>>>  	};
>>>>  };
>>>>  
>>>> +/**
>>>> + * enum iommu_inv_granularity - Generic invalidation granularity
>>>> + * @IOMMU_INV_GRANU_DOMAIN_ALL_PASID:	TLB entries or PASID caches of all
>>>> + *					PASIDs associated with a domain ID
>>>> + * @IOMMU_INV_GRANU_PASID_SEL:		TLB entries or PASID cache associated
>>>> + *					with a PASID and a domain
>>>> + * @IOMMU_INV_GRANU_PAGE_PASID:		TLB entries of selected page range
>>>> + *					within a PASID
>>>> + *
>>>> + * When an invalidation request is passed down to IOMMU to flush translation
>>>> + * caches, it may carry different granularity levels, which can be specific
>>>> + * to certain types of translation caches.
>>>> + * This enum is a collection of granularities for all types of translation
>>>> + * caches. The idea is to make it easy for IOMMU model specific driver to
>>>> + * convert from generic to model specific value. Each IOMMU driver
>>>> + * can enforce check based on its own conversion table. The conversion is
>>>> + * based on 2D look-up with inputs as follows:
>>>> + * - translation cache types
>>>> + * - granularity
>>>> + *
>>>> + *             type |   DTLB    |    TLB    |   PASID   |
>>>> + *  granule         |           |           |   cache   |
>>>> + * -----------------+-----------+-----------+-----------+
>>>> + *  DN_ALL_PASID    |   Y       |   Y       |   Y       |
>>>> + *  PASID_SEL       |   Y       |   Y       |   Y       |
>>>> + *  PAGE_PASID      |   Y       |   Y       |   N/A     |
>>>> + *
>>>> + */
>>>> +enum iommu_inv_granularity {
>>>> +	IOMMU_INV_GRANU_DOMAIN_ALL_PASID,
>>>> +	IOMMU_INV_GRANU_PASID_SEL,
>>>> +	IOMMU_INV_GRANU_PAGE_PASID,
>>>> +	IOMMU_INV_NR_GRANU,
>>>> +};
>>>> +
>>>> +/**
>>>> + * enum iommu_inv_type - Generic translation cache types for invalidation
>>>> + *
>>>> + * @IOMMU_INV_TYPE_DTLB:	device IOTLB
>>>> + * @IOMMU_INV_TYPE_TLB:		IOMMU paging structure cache
>>>> + * @IOMMU_INV_TYPE_PASID:	PASID cache
>>>> + * Invalidation requests sent to IOMMU for a given device need to indicate
>>>> + * which type of translation cache to be operated on. Combined with enum
>>>> + * iommu_inv_granularity, model specific driver can do a simple lookup to
>>>> + * convert from generic to model specific value.
>>>> + */
>>>> +enum iommu_inv_type {
>>>> +	IOMMU_INV_TYPE_DTLB,
>>>> +	IOMMU_INV_TYPE_TLB,
>>>> +	IOMMU_INV_TYPE_PASID,
>>>> +	IOMMU_INV_NR_TYPE
>>>> +};
>>>> +
>>>> +/**
>>>> + * Translation cache invalidation header that contains mandatory meta data.
>>>> + * @version:	info format version, expecting future extesions
>>>> + * @type:	type of translation cache to be invalidated
>>>> + */
>>>> +struct iommu_cache_invalidate_hdr {
>>>> +	__u32 version;
>>>> +#define TLB_INV_HDR_VERSION_1 1
>>>> +	enum iommu_inv_type type;
>>>> +};
>>>> +
>>>> +/**
>>>> + * Translation cache invalidation information, contains generic IOMMU
>>>> + * data which can be parsed based on model ID by model specific drivers.
>>>> + * Since the invalidation of second level page tables are included in the
>>>> + * unmap operation, this info is only applicable to the first level
>>>> + * translation caches, i.e. DMA request with PASID.
>>>> + *
>>>> + * @granularity:	requested invalidation granularity, type dependent
>>>> + * @size:		2^size of 4K pages, 0 for 4k, 9 for 2MB, etc.
>>>
>>> Why is this a 4K page centric interface?
>> This matches the vt-d Address Mask (AM) field of the IOTLB Invalidate
>> Descriptor. We can pass a log2size instead.
>>>
>>>> + * @nr_pages:		number of pages to invalidate
>>>> + * @pasid:		processor address space ID value per PCI spec.
>>>> + * @arch_id:		architecture dependent id characterizing a context
>>>> + *			and tagging the caches, ie. domain Identfier on VTD,
>>>> + *			asid on ARM SMMU
>>>> + * @addr:		page address to be invalidated
>>>> + * @flags		IOMMU_INVALIDATE_ADDR_LEAF: leaf paging entries
>>>> + *			IOMMU_INVALIDATE_GLOBAL_PAGE: global pages
>>>
>>> Shouldn't some of these be tied the the granularity of the
>>> invalidation?  It seems like this should be more similar to
>>> iommu_pasid_table_config where the granularity of the invalidation
>>> defines which entry within a union at the end of the structure is valid
>>> and populated.  Otherwise we have fields that don't make sense for
>>> certain invalidations.
>>
>> I am a little bit embarrassed here as this API version is the outcome of
>> long discussions held by Jacob, jean-Philippe and many others. I don't
>> want to hijack that work as I am "simply" reusing this API. Nevertheless
>> I am willing to help on this. So following your recommendation above I
>> dare to propose an updated API:
> 
> Discussing this again is completely fine by me. I have some concerns
> with this proposal though, some of which apply to our previous versions
> as well.
> 
>> struct iommu_device_iotlb_inv_info {
>>         __u32   version;
>> #define IOMMU_DEV_IOTLB_INV_GLOBAL   0
>> #define IOMMU_DEV_IOTLB_INV_SOURCEID (1 << 0)
>> #define IOMMU_DEV_IOTLB_INV_PASID    (1 << 1)
>>         __u8    granularity;
>>         __u64   addr;
>>         __u8    log2size;
>>         __u64   sourceid;
>>         __u64   pasid;
>>         __u8    padding[2];
>> };
>>
>> struct iommu_iotlb_inv_info {
>>         __u32   version;
>> #define IOMMU_IOTLB_INV_GLOBAL  0
> 
> Using "global" for granularity=0 will be confusing, let's call this
> "domain" instead. In the SMMU and ATS specifications (and I think VT-d
> as well), the global flag is used to invalidate VA ranges that are
> cached for all PASIDs. In the Arm architecture these TLB entries are
> created from PTE entries without the "nG" bit. So "global" usually means
> granularity=IOMMU_IOTLB_INV_PAGE.
in vt-d global invalidation of IOTLB invalidate means all IOTLB entries
are invalidated.

Effectively for the device IOTLB, Global means invalidate this addr/S
for all PASIDs

domain refers to domain-id=archid on vt-d.
> 
>> #define IOMMU_IOTLB_INV_ARCHID  (1 << 0)
>> #define IOMMU_IOTLB_INV_PASID   (1 << 1)
>> #define IOMMU_IOTLB_INV_PAGE    (1 << 2)
> 
> We might as well call this bit "INV_ADDR" to make it clear that it
> describes the validity of field @addr.
agreed
> 
>>         __u8    granularity;
>>         __u64   archid;
>>         __u64   pasid;
>>         __u64   addr;
>>         __u8    log2size;
>>         __u8    padding[2];
>> };
>>
>> struct iommu_pasid_inv_info {
>>         __u32   version;
>> #define IOMMU_PASID_INV_GLOBAL     0
>> #define IOMMU_PASID_INV_ARCHID     (1 << 0)
>> #define IOMMU_PASID_INV_PASID      (1 << 1)
>> #define IOMMU_PASID_INV_SOURCEID   (1 << 2)
>>         __u8    granularity;
>>         __u64   archid;
>>         __u64   pasid;
>>         __u64   sourceid;
>>         __u8    padding[3];
>> };
>> /**
>>  * Translation cache invalidation information, contains generic IOMMU
>>  * data which can be parsed based on model ID by model specific drivers.
>>  * Since the invalidation of second level page tables are included in
>>  * the unmap operation, this info is only applicable to the first level
>>  * translation caches, i.e. DMA request with PASID.
>>  *
>>  */
>> struct iommu_cache_invalidate_info {
>> #define IOMMU_CACHE_INVALIDATE_INFO_VERSION_1 1
>>         __u32 version;
>> #define IOMMU_INV_TYPE_IOTLB        1 /* IOMMU paging structure cache */
>> #define IOMMU_INV_TYPE_DEV_IOTLB    2 /* Device IOTLB */
>> #define IOMMU_INV_TYPE_PASID        3 /* PASID cache */
>>         __u8 type;
>>         union {
>>                 struct iommu_iotlb_invalidate_info iotlb_inv_info;
>>                 struct iommu_dev_iotlb_invalidate_info dev_iotlb_inv_info;
>>                 struct iommu_pasid_inv_info pasid_inv_info;
>>         };
>> };
> 
> Although I find the new structure and field names clearer in general, I
> believe we're losing something by splitting structures this way.
> 
> The one concept I'd like to keep is the possibility to multiplex ATC and
> IOTLB invalidation. That is, when unmapping a range, the guest (using a
> pvIOMMU) shouldn't need to send both iotlb_inv and device_iotlb_inv to
> the host - it's completely redundant. Instead the host IOMMU driver
> should receive a single invalidation packet, and send both TLB and ATC
> invalidation to the hardware.
> 
> So in my opinion the invalidation type needs to be a bitfield: userspace
> can select either TYPE_IOTLB, TYPE_DEV_IOTLB, or both. And if type is a
> bitfield, the content that follows has to be a single structure.
OK I missed this requirement.
> 
> Same for the PASID cache: when the guest changes the config for a PASID,
> it shouldn't have to also send TLB and ATC invalidations. A single
> packet should be enough. However config changes won't be a fast path, so
> optimizing the API is less important here.
OK.
> 
> We could say that IOMMU_INV_TYPE_IOTLB implies IOMMU_INV_TYPE_DEV_IOTLB,
> and that IOMMU_INV_TYPE_PASID implies the others. In fact I think we do,
> in this patch. But that in turn would be suboptimal with vSMMU and other
> emulated solutions, which will receive both TLB and ATC invalidation
> from the guest, and have to signal both separately to the kernel. So for
> @type, a bitfield would be best.
Effectively vSMMUv3 will react to each trapped event.
> 
> If we want to split the structure, I think splitting by @granularity
> might make more sense. It might require at least 4 structures:
> * domain invalidation:		granule = 0
> * pasid invalidation:		granule = pasid|archid
> * global va invalidation:	granule = addr> * va invalidation		granule = pasid|archid|addr

I have questions about the mapping of the following cmds:
- SMMUv3 CMD_TLBI_NH_VA uses ASID and addr. would use va inval struct.
Would you set pasid=0?
- SMMUv3 ATC_INV with G=0 ->pasid and addr only. What would you put as
archid? Same for vt-d PASID based device TLB invalidate. Does this mean
we need a flag within va invalidation struct telling which fields are used.

The fact one operation applies to several caches makes things quite
intricate although I understand the need.


> 
>> At the moment I ignore the leaf bool parameter used on ARM for PASID
>> invalidation and TLB invalidation. Maybe we can just invalidate more
>> that the leaf cache structures at the moment?
> 
> Sure, though we could add a flags field and leave it unused for now,
> which is easier to extend than introducing a new version. I wonder if
> Intel's Invalidation Hint (IH) does the same as Arm's leaf flag?
This can be added to the va invalidation struct? IH looks the same to me.
> 
>> on ARM the PASID table can be invalidated per streamid. On vt-d, as far
>> as I understand the sourceid does not tag the entries.
> 
> I don't think sourceid is the right thing to use in this interface. The
> device ID (streamid or sourceid) that the IOMMU sees in DMA transactions
> isn't really made visible to userspace. And for mdev it doesn't really
> exist, VFIO will have to pass the parent device's handle to IOMMU.
agreed. I copy/pasted the vtd descriptors here and that's not relevant
for source-id. it is embodied by the struct device .

Thank you for your feedbacks!

Eric
> 
> To userspace a device is identified by the fd provided by VFIO. So if we
> do want to have device-scope in the invalidation (as opposed to
> the current domain-scope) I think userspace needs to provide a device fd
> to VFIO (outside the structures defined here), and then VFIO would pass
> a struct device to iommu_cache_inval().
> 
> Thanks,
> Jean
> 
>>>
>>>> + *
>>>> + */
>>>> +struct iommu_cache_invalidate_info {
>>>> +	struct iommu_cache_invalidate_hdr	hdr;
>>>> +	enum iommu_inv_granularity	granularity;
>>>
>>> A separate structure for hdr seems a little pointless.
>> removed
>>>
>>>> +	__u32		flags;
>>>> +#define IOMMU_INVALIDATE_ADDR_LEAF	(1 << 0)
>>>> +#define IOMMU_INVALIDATE_GLOBAL_PAGE	(1 << 1)
>>>> +	__u8		size;
>>>
>>> Really need some padding or packing here for any hope of having
>>> consistency with userspace.
>>>
>>>> +	__u64		nr_pages;
>>>> +	__u32		pasid;
>>>
>>> Sub-optimal ordering for packing/padding.  Thanks,
>> I introduced some padding above. Is that OK?
>>
>> Again if this introduces more noise than it helps I will simply rely on
>> initial contributors for the respin of their series according to your
>> comments. Also we if can't define generic enough structures for ARM and x86
>>
>> Thanks
>>
>> Eric
>>>
>>> Alex
>>>
>>>> +	__u64		arch_id;
>>>> +	__u64		addr;
>>>> +};
>>>>  #endif /* _UAPI_IOMMU_H */
>>>
>> _______________________________________________
>> iommu mailing list
>> iommu@lists.linux-foundation.org
>> https://lists.linuxfoundation.org/mailman/listinfo/iommu
>>
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC v3 02/21] iommu: Introduce cache_invalidate API
  2019-01-25 16:49     ` Auger Eric
  2019-01-28 17:32       ` Jean-Philippe Brucker
@ 2019-01-29 23:16       ` Alex Williamson
  2019-01-30  8:48         ` Auger Eric
  1 sibling, 1 reply; 59+ messages in thread
From: Alex Williamson @ 2019-01-29 23:16 UTC (permalink / raw)
  To: Auger Eric
  Cc: eric.auger.pro, iommu, linux-kernel, kvm, kvmarm, joro,
	jacob.jun.pan, yi.l.liu, jean-philippe.brucker, will.deacon,
	robin.murphy, kevin.tian, ashok.raj, marc.zyngier,
	christoffer.dall, peter.maydell

On Fri, 25 Jan 2019 17:49:20 +0100
Auger Eric <eric.auger@redhat.com> wrote:

> Hi Alex,
> 
> On 1/11/19 10:30 PM, Alex Williamson wrote:
> > On Tue,  8 Jan 2019 11:26:14 +0100
> > Eric Auger <eric.auger@redhat.com> wrote:
> >   
> >> From: "Liu, Yi L" <yi.l.liu@linux.intel.com>
> >>
> >> In any virtualization use case, when the first translation stage
> >> is "owned" by the guest OS, the host IOMMU driver has no knowledge
> >> of caching structure updates unless the guest invalidation activities
> >> are trapped by the virtualizer and passed down to the host.
> >>
> >> Since the invalidation data are obtained from user space and will be
> >> written into physical IOMMU, we must allow security check at various
> >> layers. Therefore, generic invalidation data format are proposed here,
> >> model specific IOMMU drivers need to convert them into their own format.
> >>
> >> Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
> >> Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
> >> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> >> Signed-off-by: Ashok Raj <ashok.raj@intel.com>
> >> Signed-off-by: Eric Auger <eric.auger@redhat.com>
> >>
> >> ---
> >> v1 -> v2:
> >> - add arch_id field
> >> - renamed tlb_invalidate into cache_invalidate as this API allows
> >>   to invalidate context caches on top of IOTLBs
> >>
> >> v1:
> >> renamed sva_invalidate into tlb_invalidate and add iommu_ prefix in
> >> header. Commit message reworded.
> >> ---
> >>  drivers/iommu/iommu.c      | 14 ++++++
> >>  include/linux/iommu.h      | 14 ++++++
> >>  include/uapi/linux/iommu.h | 95 ++++++++++++++++++++++++++++++++++++++
> >>  3 files changed, 123 insertions(+)
> >>
> >> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> >> index 0f2b7f1fc7c8..b2e248770508 100644
> >> --- a/drivers/iommu/iommu.c
> >> +++ b/drivers/iommu/iommu.c
> >> @@ -1403,6 +1403,20 @@ int iommu_set_pasid_table(struct iommu_domain *domain,
> >>  }
> >>  EXPORT_SYMBOL_GPL(iommu_set_pasid_table);
> >>  
> >> +int iommu_cache_invalidate(struct iommu_domain *domain, struct device *dev,
> >> +			   struct iommu_cache_invalidate_info *inv_info)
> >> +{
> >> +	int ret = 0;
> >> +
> >> +	if (unlikely(!domain->ops->cache_invalidate))
> >> +		return -ENODEV;
> >> +
> >> +	ret = domain->ops->cache_invalidate(domain, dev, inv_info);
> >> +
> >> +	return ret;
> >> +}
> >> +EXPORT_SYMBOL_GPL(iommu_cache_invalidate);
> >> +
> >>  static void __iommu_detach_device(struct iommu_domain *domain,
> >>  				  struct device *dev)
> >>  {
> >> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> >> index 1da2a2357ea4..96d59886f230 100644
> >> --- a/include/linux/iommu.h
> >> +++ b/include/linux/iommu.h
> >> @@ -186,6 +186,7 @@ struct iommu_resv_region {
> >>   * @of_xlate: add OF master IDs to iommu grouping
> >>   * @pgsize_bitmap: bitmap of all possible supported page sizes
> >>   * @set_pasid_table: set pasid table
> >> + * @cache_invalidate: invalidate translation caches
> >>   */
> >>  struct iommu_ops {
> >>  	bool (*capable)(enum iommu_cap);
> >> @@ -231,6 +232,9 @@ struct iommu_ops {
> >>  	int (*set_pasid_table)(struct iommu_domain *domain,
> >>  			       struct iommu_pasid_table_config *cfg);
> >>  
> >> +	int (*cache_invalidate)(struct iommu_domain *domain, struct device *dev,
> >> +				struct iommu_cache_invalidate_info *inv_info);
> >> +
> >>  	unsigned long pgsize_bitmap;
> >>  };
> >>  
> >> @@ -294,6 +298,9 @@ extern void iommu_detach_device(struct iommu_domain *domain,
> >>  				struct device *dev);
> >>  extern int iommu_set_pasid_table(struct iommu_domain *domain,
> >>  				 struct iommu_pasid_table_config *cfg);
> >> +extern int iommu_cache_invalidate(struct iommu_domain *domain,
> >> +				struct device *dev,
> >> +				struct iommu_cache_invalidate_info *inv_info);
> >>  extern struct iommu_domain *iommu_get_domain_for_dev(struct device *dev);
> >>  extern struct iommu_domain *iommu_get_dma_domain(struct device *dev);
> >>  extern int iommu_map(struct iommu_domain *domain, unsigned long iova,
> >> @@ -709,6 +716,13 @@ int iommu_set_pasid_table(struct iommu_domain *domain,
> >>  {
> >>  	return -ENODEV;
> >>  }
> >> +static inline int
> >> +iommu_cache_invalidate(struct iommu_domain *domain,
> >> +		       struct device *dev,
> >> +		       struct iommu_cache_invalidate_info *inv_info)
> >> +{
> >> +	return -ENODEV;
> >> +}
> >>  
> >>  #endif /* CONFIG_IOMMU_API */
> >>  
> >> diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
> >> index 7a7cf7a3de7c..4605f5cfac84 100644
> >> --- a/include/uapi/linux/iommu.h
> >> +++ b/include/uapi/linux/iommu.h
> >> @@ -47,4 +47,99 @@ struct iommu_pasid_table_config {
> >>  	};
> >>  };
> >>  
> >> +/**
> >> + * enum iommu_inv_granularity - Generic invalidation granularity
> >> + * @IOMMU_INV_GRANU_DOMAIN_ALL_PASID:	TLB entries or PASID caches of all
> >> + *					PASIDs associated with a domain ID
> >> + * @IOMMU_INV_GRANU_PASID_SEL:		TLB entries or PASID cache associated
> >> + *					with a PASID and a domain
> >> + * @IOMMU_INV_GRANU_PAGE_PASID:		TLB entries of selected page range
> >> + *					within a PASID
> >> + *
> >> + * When an invalidation request is passed down to IOMMU to flush translation
> >> + * caches, it may carry different granularity levels, which can be specific
> >> + * to certain types of translation caches.
> >> + * This enum is a collection of granularities for all types of translation
> >> + * caches. The idea is to make it easy for IOMMU model specific driver to
> >> + * convert from generic to model specific value. Each IOMMU driver
> >> + * can enforce check based on its own conversion table. The conversion is
> >> + * based on 2D look-up with inputs as follows:
> >> + * - translation cache types
> >> + * - granularity
> >> + *
> >> + *             type |   DTLB    |    TLB    |   PASID   |
> >> + *  granule         |           |           |   cache   |
> >> + * -----------------+-----------+-----------+-----------+
> >> + *  DN_ALL_PASID    |   Y       |   Y       |   Y       |
> >> + *  PASID_SEL       |   Y       |   Y       |   Y       |
> >> + *  PAGE_PASID      |   Y       |   Y       |   N/A     |
> >> + *
> >> + */
> >> +enum iommu_inv_granularity {
> >> +	IOMMU_INV_GRANU_DOMAIN_ALL_PASID,
> >> +	IOMMU_INV_GRANU_PASID_SEL,
> >> +	IOMMU_INV_GRANU_PAGE_PASID,
> >> +	IOMMU_INV_NR_GRANU,
> >> +};
> >> +
> >> +/**
> >> + * enum iommu_inv_type - Generic translation cache types for invalidation
> >> + *
> >> + * @IOMMU_INV_TYPE_DTLB:	device IOTLB
> >> + * @IOMMU_INV_TYPE_TLB:		IOMMU paging structure cache
> >> + * @IOMMU_INV_TYPE_PASID:	PASID cache
> >> + * Invalidation requests sent to IOMMU for a given device need to indicate
> >> + * which type of translation cache to be operated on. Combined with enum
> >> + * iommu_inv_granularity, model specific driver can do a simple lookup to
> >> + * convert from generic to model specific value.
> >> + */
> >> +enum iommu_inv_type {
> >> +	IOMMU_INV_TYPE_DTLB,
> >> +	IOMMU_INV_TYPE_TLB,
> >> +	IOMMU_INV_TYPE_PASID,
> >> +	IOMMU_INV_NR_TYPE
> >> +};
> >> +
> >> +/**
> >> + * Translation cache invalidation header that contains mandatory meta data.
> >> + * @version:	info format version, expecting future extesions
> >> + * @type:	type of translation cache to be invalidated
> >> + */
> >> +struct iommu_cache_invalidate_hdr {
> >> +	__u32 version;
> >> +#define TLB_INV_HDR_VERSION_1 1
> >> +	enum iommu_inv_type type;
> >> +};
> >> +
> >> +/**
> >> + * Translation cache invalidation information, contains generic IOMMU
> >> + * data which can be parsed based on model ID by model specific drivers.
> >> + * Since the invalidation of second level page tables are included in the
> >> + * unmap operation, this info is only applicable to the first level
> >> + * translation caches, i.e. DMA request with PASID.
> >> + *
> >> + * @granularity:	requested invalidation granularity, type dependent
> >> + * @size:		2^size of 4K pages, 0 for 4k, 9 for 2MB, etc.  
> > 
> > Why is this a 4K page centric interface?  
> This matches the vt-d Address Mask (AM) field of the IOTLB Invalidate
> Descriptor. We can pass a log2size instead.

Could some options not require a power of two size?

> >> + * @nr_pages:		number of pages to invalidate
> >> + * @pasid:		processor address space ID value per PCI spec.
> >> + * @arch_id:		architecture dependent id characterizing a context
> >> + *			and tagging the caches, ie. domain Identfier on VTD,
> >> + *			asid on ARM SMMU
> >> + * @addr:		page address to be invalidated
> >> + * @flags		IOMMU_INVALIDATE_ADDR_LEAF: leaf paging entries
> >> + *			IOMMU_INVALIDATE_GLOBAL_PAGE: global pages  
> > 
> > Shouldn't some of these be tied the the granularity of the
> > invalidation?  It seems like this should be more similar to
> > iommu_pasid_table_config where the granularity of the invalidation
> > defines which entry within a union at the end of the structure is valid
> > and populated.  Otherwise we have fields that don't make sense for
> > certain invalidations.  
> 
> I am a little bit embarrassed here as this API version is the outcome of
> long discussions held by Jacob, jean-Philippe and many others. I don't
> want to hijack that work as I am "simply" reusing this API. Nevertheless
> I am willing to help on this. So following your recommendation above I
> dare to propose an updated API:
> 
> struct iommu_device_iotlb_inv_info {
>         __u32   version;
> #define IOMMU_DEV_IOTLB_INV_GLOBAL   0
> #define IOMMU_DEV_IOTLB_INV_SOURCEID (1 << 0)
> #define IOMMU_DEV_IOTLB_INV_PASID    (1 << 1)
>         __u8    granularity;
>         __u64   addr;
>         __u8    log2size;
>         __u64   sourceid;
>         __u64   pasid;
>         __u8    padding[2];
> };
> 
> struct iommu_iotlb_inv_info {
>         __u32   version;
> #define IOMMU_IOTLB_INV_GLOBAL  0
> #define IOMMU_IOTLB_INV_ARCHID  (1 << 0)
> #define IOMMU_IOTLB_INV_PASID   (1 << 1)
> #define IOMMU_IOTLB_INV_PAGE    (1 << 2)
>         __u8    granularity;
>         __u64   archid;
>         __u64   pasid;
>         __u64   addr;
>         __u8    log2size;
>         __u8    padding[2];
> };
> 
> struct iommu_pasid_inv_info {
>         __u32   version;
> #define IOMMU_PASID_INV_GLOBAL     0
> #define IOMMU_PASID_INV_ARCHID     (1 << 0)
> #define IOMMU_PASID_INV_PASID      (1 << 1)
> #define IOMMU_PASID_INV_SOURCEID   (1 << 2)
>         __u8    granularity;
>         __u64   archid;
>         __u64   pasid;
>         __u64   sourceid;
>         __u8    padding[3];
> };
> /**
>  * Translation cache invalidation information, contains generic IOMMU
>  * data which can be parsed based on model ID by model specific drivers.
>  * Since the invalidation of second level page tables are included in
>  * the unmap operation, this info is only applicable to the first level
>  * translation caches, i.e. DMA request with PASID.
>  *
>  */
> struct iommu_cache_invalidate_info {
> #define IOMMU_CACHE_INVALIDATE_INFO_VERSION_1 1
>         __u32 version;
> #define IOMMU_INV_TYPE_IOTLB        1 /* IOMMU paging structure cache */
> #define IOMMU_INV_TYPE_DEV_IOTLB    2 /* Device IOTLB */
> #define IOMMU_INV_TYPE_PASID        3 /* PASID cache */
>         __u8 type;
>         union {
>                 struct iommu_iotlb_invalidate_info iotlb_inv_info;
>                 struct iommu_dev_iotlb_invalidate_info dev_iotlb_inv_info;
>                 struct iommu_pasid_inv_info pasid_inv_info;
>         };
> };
> 
> At the moment I ignore the leaf bool parameter used on ARM for PASID
> invalidation and TLB invalidation. Maybe we can just invalidate more
> that the leaf cache structures at the moment?
> 
> on ARM the PASID table can be invalidated per streamid. On vt-d, as far
> as I understand the sourceid does not tag the entries.
> 
> >   
> >> + *
> >> + */
> >> +struct iommu_cache_invalidate_info {
> >> +	struct iommu_cache_invalidate_hdr	hdr;
> >> +	enum iommu_inv_granularity	granularity;  
> > 
> > A separate structure for hdr seems a little pointless.  
> removed
> >   
> >> +	__u32		flags;
> >> +#define IOMMU_INVALIDATE_ADDR_LEAF	(1 << 0)
> >> +#define IOMMU_INVALIDATE_GLOBAL_PAGE	(1 << 1)
> >> +	__u8		size;  
> > 
> > Really need some padding or packing here for any hope of having
> > consistency with userspace.
> >   
> >> +	__u64		nr_pages;
> >> +	__u32		pasid;  
> > 
> > Sub-optimal ordering for packing/padding.  Thanks,  
> I introduced some padding above. Is that OK?

No, you're not taking field alignment into account, processors don't
like unaligned data.  If we have:

struct foo {
	uint32_t	a;
	uint8_t		b;
	uint64_t	c;
	uint8_t		d;
	uint64_t	e;
};

In memory on a 64 bit system, that would look like:

aaaab...ccccccccd.......eeeeeeee

While on a 32 bit system, it would look like:

aaaab...ccccccccd...eeeeeeee

In this example we have 22 bytes of data (4 + 1 + 8 + 1 + 8), but the
structure is 32 bytes when provided by a 64 bit userspace or 28 bytes
when provided by a 32 bit userspace and the start address of the 'e'
field changes.  A 64 bit kernel would process the latter structure
incorrectly or fault trying to copy the expected length from userspace.
Adding padding to the end doesn't solve this. If we instead reconstruct
the structure as:

struct foo {
	uint32_t	a;
	uint8_t		b;
	uint8_t		d;
	uint8_t		pad[2];
	uint64_t	c;
	uint64_t	e;
};

Then we create a structure that looks the same from either a 32 bit or
64 bit userspace, is only 24 bytes in memory, and works for any
reasonable compiler, though we might choose to add a packed attribute to
make sure the compiler doesn't do anything screwy if we were paranoid.
Thanks,

Alex

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC v3 02/21] iommu: Introduce cache_invalidate API
  2019-01-29 23:16       ` Alex Williamson
@ 2019-01-30  8:48         ` Auger Eric
  0 siblings, 0 replies; 59+ messages in thread
From: Auger Eric @ 2019-01-30  8:48 UTC (permalink / raw)
  To: Alex Williamson
  Cc: eric.auger.pro, iommu, linux-kernel, kvm, kvmarm, joro,
	jacob.jun.pan, yi.l.liu, jean-philippe.brucker, will.deacon,
	robin.murphy, kevin.tian, ashok.raj, marc.zyngier,
	christoffer.dall, peter.maydell

Hi Alex,

On 1/30/19 12:16 AM, Alex Williamson wrote:
> On Fri, 25 Jan 2019 17:49:20 +0100
> Auger Eric <eric.auger@redhat.com> wrote:
> 
>> Hi Alex,
>>
>> On 1/11/19 10:30 PM, Alex Williamson wrote:
>>> On Tue,  8 Jan 2019 11:26:14 +0100
>>> Eric Auger <eric.auger@redhat.com> wrote:
>>>   
>>>> From: "Liu, Yi L" <yi.l.liu@linux.intel.com>
>>>>
>>>> In any virtualization use case, when the first translation stage
>>>> is "owned" by the guest OS, the host IOMMU driver has no knowledge
>>>> of caching structure updates unless the guest invalidation activities
>>>> are trapped by the virtualizer and passed down to the host.
>>>>
>>>> Since the invalidation data are obtained from user space and will be
>>>> written into physical IOMMU, we must allow security check at various
>>>> layers. Therefore, generic invalidation data format are proposed here,
>>>> model specific IOMMU drivers need to convert them into their own format.
>>>>
>>>> Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
>>>> Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
>>>> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
>>>> Signed-off-by: Ashok Raj <ashok.raj@intel.com>
>>>> Signed-off-by: Eric Auger <eric.auger@redhat.com>
>>>>
>>>> ---
>>>> v1 -> v2:
>>>> - add arch_id field
>>>> - renamed tlb_invalidate into cache_invalidate as this API allows
>>>>   to invalidate context caches on top of IOTLBs
>>>>
>>>> v1:
>>>> renamed sva_invalidate into tlb_invalidate and add iommu_ prefix in
>>>> header. Commit message reworded.
>>>> ---
>>>>  drivers/iommu/iommu.c      | 14 ++++++
>>>>  include/linux/iommu.h      | 14 ++++++
>>>>  include/uapi/linux/iommu.h | 95 ++++++++++++++++++++++++++++++++++++++
>>>>  3 files changed, 123 insertions(+)
>>>>
>>>> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
>>>> index 0f2b7f1fc7c8..b2e248770508 100644
>>>> --- a/drivers/iommu/iommu.c
>>>> +++ b/drivers/iommu/iommu.c
>>>> @@ -1403,6 +1403,20 @@ int iommu_set_pasid_table(struct iommu_domain *domain,
>>>>  }
>>>>  EXPORT_SYMBOL_GPL(iommu_set_pasid_table);
>>>>  
>>>> +int iommu_cache_invalidate(struct iommu_domain *domain, struct device *dev,
>>>> +			   struct iommu_cache_invalidate_info *inv_info)
>>>> +{
>>>> +	int ret = 0;
>>>> +
>>>> +	if (unlikely(!domain->ops->cache_invalidate))
>>>> +		return -ENODEV;
>>>> +
>>>> +	ret = domain->ops->cache_invalidate(domain, dev, inv_info);
>>>> +
>>>> +	return ret;
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(iommu_cache_invalidate);
>>>> +
>>>>  static void __iommu_detach_device(struct iommu_domain *domain,
>>>>  				  struct device *dev)
>>>>  {
>>>> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
>>>> index 1da2a2357ea4..96d59886f230 100644
>>>> --- a/include/linux/iommu.h
>>>> +++ b/include/linux/iommu.h
>>>> @@ -186,6 +186,7 @@ struct iommu_resv_region {
>>>>   * @of_xlate: add OF master IDs to iommu grouping
>>>>   * @pgsize_bitmap: bitmap of all possible supported page sizes
>>>>   * @set_pasid_table: set pasid table
>>>> + * @cache_invalidate: invalidate translation caches
>>>>   */
>>>>  struct iommu_ops {
>>>>  	bool (*capable)(enum iommu_cap);
>>>> @@ -231,6 +232,9 @@ struct iommu_ops {
>>>>  	int (*set_pasid_table)(struct iommu_domain *domain,
>>>>  			       struct iommu_pasid_table_config *cfg);
>>>>  
>>>> +	int (*cache_invalidate)(struct iommu_domain *domain, struct device *dev,
>>>> +				struct iommu_cache_invalidate_info *inv_info);
>>>> +
>>>>  	unsigned long pgsize_bitmap;
>>>>  };
>>>>  
>>>> @@ -294,6 +298,9 @@ extern void iommu_detach_device(struct iommu_domain *domain,
>>>>  				struct device *dev);
>>>>  extern int iommu_set_pasid_table(struct iommu_domain *domain,
>>>>  				 struct iommu_pasid_table_config *cfg);
>>>> +extern int iommu_cache_invalidate(struct iommu_domain *domain,
>>>> +				struct device *dev,
>>>> +				struct iommu_cache_invalidate_info *inv_info);
>>>>  extern struct iommu_domain *iommu_get_domain_for_dev(struct device *dev);
>>>>  extern struct iommu_domain *iommu_get_dma_domain(struct device *dev);
>>>>  extern int iommu_map(struct iommu_domain *domain, unsigned long iova,
>>>> @@ -709,6 +716,13 @@ int iommu_set_pasid_table(struct iommu_domain *domain,
>>>>  {
>>>>  	return -ENODEV;
>>>>  }
>>>> +static inline int
>>>> +iommu_cache_invalidate(struct iommu_domain *domain,
>>>> +		       struct device *dev,
>>>> +		       struct iommu_cache_invalidate_info *inv_info)
>>>> +{
>>>> +	return -ENODEV;
>>>> +}
>>>>  
>>>>  #endif /* CONFIG_IOMMU_API */
>>>>  
>>>> diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
>>>> index 7a7cf7a3de7c..4605f5cfac84 100644
>>>> --- a/include/uapi/linux/iommu.h
>>>> +++ b/include/uapi/linux/iommu.h
>>>> @@ -47,4 +47,99 @@ struct iommu_pasid_table_config {
>>>>  	};
>>>>  };
>>>>  
>>>> +/**
>>>> + * enum iommu_inv_granularity - Generic invalidation granularity
>>>> + * @IOMMU_INV_GRANU_DOMAIN_ALL_PASID:	TLB entries or PASID caches of all
>>>> + *					PASIDs associated with a domain ID
>>>> + * @IOMMU_INV_GRANU_PASID_SEL:		TLB entries or PASID cache associated
>>>> + *					with a PASID and a domain
>>>> + * @IOMMU_INV_GRANU_PAGE_PASID:		TLB entries of selected page range
>>>> + *					within a PASID
>>>> + *
>>>> + * When an invalidation request is passed down to IOMMU to flush translation
>>>> + * caches, it may carry different granularity levels, which can be specific
>>>> + * to certain types of translation caches.
>>>> + * This enum is a collection of granularities for all types of translation
>>>> + * caches. The idea is to make it easy for IOMMU model specific driver to
>>>> + * convert from generic to model specific value. Each IOMMU driver
>>>> + * can enforce check based on its own conversion table. The conversion is
>>>> + * based on 2D look-up with inputs as follows:
>>>> + * - translation cache types
>>>> + * - granularity
>>>> + *
>>>> + *             type |   DTLB    |    TLB    |   PASID   |
>>>> + *  granule         |           |           |   cache   |
>>>> + * -----------------+-----------+-----------+-----------+
>>>> + *  DN_ALL_PASID    |   Y       |   Y       |   Y       |
>>>> + *  PASID_SEL       |   Y       |   Y       |   Y       |
>>>> + *  PAGE_PASID      |   Y       |   Y       |   N/A     |
>>>> + *
>>>> + */
>>>> +enum iommu_inv_granularity {
>>>> +	IOMMU_INV_GRANU_DOMAIN_ALL_PASID,
>>>> +	IOMMU_INV_GRANU_PASID_SEL,
>>>> +	IOMMU_INV_GRANU_PAGE_PASID,
>>>> +	IOMMU_INV_NR_GRANU,
>>>> +};
>>>> +
>>>> +/**
>>>> + * enum iommu_inv_type - Generic translation cache types for invalidation
>>>> + *
>>>> + * @IOMMU_INV_TYPE_DTLB:	device IOTLB
>>>> + * @IOMMU_INV_TYPE_TLB:		IOMMU paging structure cache
>>>> + * @IOMMU_INV_TYPE_PASID:	PASID cache
>>>> + * Invalidation requests sent to IOMMU for a given device need to indicate
>>>> + * which type of translation cache to be operated on. Combined with enum
>>>> + * iommu_inv_granularity, model specific driver can do a simple lookup to
>>>> + * convert from generic to model specific value.
>>>> + */
>>>> +enum iommu_inv_type {
>>>> +	IOMMU_INV_TYPE_DTLB,
>>>> +	IOMMU_INV_TYPE_TLB,
>>>> +	IOMMU_INV_TYPE_PASID,
>>>> +	IOMMU_INV_NR_TYPE
>>>> +};
>>>> +
>>>> +/**
>>>> + * Translation cache invalidation header that contains mandatory meta data.
>>>> + * @version:	info format version, expecting future extesions
>>>> + * @type:	type of translation cache to be invalidated
>>>> + */
>>>> +struct iommu_cache_invalidate_hdr {
>>>> +	__u32 version;
>>>> +#define TLB_INV_HDR_VERSION_1 1
>>>> +	enum iommu_inv_type type;
>>>> +};
>>>> +
>>>> +/**
>>>> + * Translation cache invalidation information, contains generic IOMMU
>>>> + * data which can be parsed based on model ID by model specific drivers.
>>>> + * Since the invalidation of second level page tables are included in the
>>>> + * unmap operation, this info is only applicable to the first level
>>>> + * translation caches, i.e. DMA request with PASID.
>>>> + *
>>>> + * @granularity:	requested invalidation granularity, type dependent
>>>> + * @size:		2^size of 4K pages, 0 for 4k, 9 for 2MB, etc.  
>>>
>>> Why is this a 4K page centric interface?  
>> This matches the vt-d Address Mask (AM) field of the IOTLB Invalidate
>> Descriptor. We can pass a log2size instead.
> 
> Could some options not require a power of two size?
> 
>>>> + * @nr_pages:		number of pages to invalidate
>>>> + * @pasid:		processor address space ID value per PCI spec.
>>>> + * @arch_id:		architecture dependent id characterizing a context
>>>> + *			and tagging the caches, ie. domain Identfier on VTD,
>>>> + *			asid on ARM SMMU
>>>> + * @addr:		page address to be invalidated
>>>> + * @flags		IOMMU_INVALIDATE_ADDR_LEAF: leaf paging entries
>>>> + *			IOMMU_INVALIDATE_GLOBAL_PAGE: global pages  
>>>
>>> Shouldn't some of these be tied the the granularity of the
>>> invalidation?  It seems like this should be more similar to
>>> iommu_pasid_table_config where the granularity of the invalidation
>>> defines which entry within a union at the end of the structure is valid
>>> and populated.  Otherwise we have fields that don't make sense for
>>> certain invalidations.  
>>
>> I am a little bit embarrassed here as this API version is the outcome of
>> long discussions held by Jacob, jean-Philippe and many others. I don't
>> want to hijack that work as I am "simply" reusing this API. Nevertheless
>> I am willing to help on this. So following your recommendation above I
>> dare to propose an updated API:
>>
>> struct iommu_device_iotlb_inv_info {
>>         __u32   version;
>> #define IOMMU_DEV_IOTLB_INV_GLOBAL   0
>> #define IOMMU_DEV_IOTLB_INV_SOURCEID (1 << 0)
>> #define IOMMU_DEV_IOTLB_INV_PASID    (1 << 1)
>>         __u8    granularity;
>>         __u64   addr;
>>         __u8    log2size;
>>         __u64   sourceid;
>>         __u64   pasid;
>>         __u8    padding[2];
>> };
>>
>> struct iommu_iotlb_inv_info {
>>         __u32   version;
>> #define IOMMU_IOTLB_INV_GLOBAL  0
>> #define IOMMU_IOTLB_INV_ARCHID  (1 << 0)
>> #define IOMMU_IOTLB_INV_PASID   (1 << 1)
>> #define IOMMU_IOTLB_INV_PAGE    (1 << 2)
>>         __u8    granularity;
>>         __u64   archid;
>>         __u64   pasid;
>>         __u64   addr;
>>         __u8    log2size;
>>         __u8    padding[2];
>> };
>>
>> struct iommu_pasid_inv_info {
>>         __u32   version;
>> #define IOMMU_PASID_INV_GLOBAL     0
>> #define IOMMU_PASID_INV_ARCHID     (1 << 0)
>> #define IOMMU_PASID_INV_PASID      (1 << 1)
>> #define IOMMU_PASID_INV_SOURCEID   (1 << 2)
>>         __u8    granularity;
>>         __u64   archid;
>>         __u64   pasid;
>>         __u64   sourceid;
>>         __u8    padding[3];
>> };
>> /**
>>  * Translation cache invalidation information, contains generic IOMMU
>>  * data which can be parsed based on model ID by model specific drivers.
>>  * Since the invalidation of second level page tables are included in
>>  * the unmap operation, this info is only applicable to the first level
>>  * translation caches, i.e. DMA request with PASID.
>>  *
>>  */
>> struct iommu_cache_invalidate_info {
>> #define IOMMU_CACHE_INVALIDATE_INFO_VERSION_1 1
>>         __u32 version;
>> #define IOMMU_INV_TYPE_IOTLB        1 /* IOMMU paging structure cache */
>> #define IOMMU_INV_TYPE_DEV_IOTLB    2 /* Device IOTLB */
>> #define IOMMU_INV_TYPE_PASID        3 /* PASID cache */
>>         __u8 type;
>>         union {
>>                 struct iommu_iotlb_invalidate_info iotlb_inv_info;
>>                 struct iommu_dev_iotlb_invalidate_info dev_iotlb_inv_info;
>>                 struct iommu_pasid_inv_info pasid_inv_info;
>>         };
>> };
>>
>> At the moment I ignore the leaf bool parameter used on ARM for PASID
>> invalidation and TLB invalidation. Maybe we can just invalidate more
>> that the leaf cache structures at the moment?
>>
>> on ARM the PASID table can be invalidated per streamid. On vt-d, as far
>> as I understand the sourceid does not tag the entries.
>>
>>>   
>>>> + *
>>>> + */
>>>> +struct iommu_cache_invalidate_info {
>>>> +	struct iommu_cache_invalidate_hdr	hdr;
>>>> +	enum iommu_inv_granularity	granularity;  
>>>
>>> A separate structure for hdr seems a little pointless.  
>> removed
>>>   
>>>> +	__u32		flags;
>>>> +#define IOMMU_INVALIDATE_ADDR_LEAF	(1 << 0)
>>>> +#define IOMMU_INVALIDATE_GLOBAL_PAGE	(1 << 1)
>>>> +	__u8		size;  
>>>
>>> Really need some padding or packing here for any hope of having
>>> consistency with userspace.
>>>   
>>>> +	__u64		nr_pages;
>>>> +	__u32		pasid;  
>>>
>>> Sub-optimal ordering for packing/padding.  Thanks,  
>> I introduced some padding above. Is that OK?
> 
> No, you're not taking field alignment into account, processors don't
> like unaligned data.  If we have:
> 
> struct foo {
> 	uint32_t	a;
> 	uint8_t		b;
> 	uint64_t	c;
> 	uint8_t		d;
> 	uint64_t	e;
> };
> 
> In memory on a 64 bit system, that would look like:
> 
> aaaab...ccccccccd.......eeeeeeee
> 
> While on a 32 bit system, it would look like:
> 
> aaaab...ccccccccd...eeeeeeee
> 
> In this example we have 22 bytes of data (4 + 1 + 8 + 1 + 8), but the
> structure is 32 bytes when provided by a 64 bit userspace or 28 bytes
> when provided by a 32 bit userspace and the start address of the 'e'
> field changes.  A 64 bit kernel would process the latter structure
> incorrectly or fault trying to copy the expected length from userspace.
> Adding padding to the end doesn't solve this. If we instead reconstruct
> the structure as:
> 
> struct foo {
> 	uint32_t	a;
> 	uint8_t		b;
> 	uint8_t		d;
> 	uint8_t		pad[2];
> 	uint64_t	c;
> 	uint64_t	e;
> };
> 
> Then we create a structure that looks the same from either a 32 bit or
> 64 bit userspace, is only 24 bytes in memory, and works for any
> reasonable compiler, though we might choose to add a packed attribute to
> make sure the compiler doesn't do anything screwy if we were paranoid.
> Thanks,

Thank you very much for your time and explanations. That's understood now.

Eric
> 
> Alex
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

end of thread, other threads:[~2019-01-30  8:49 UTC | newest]

Thread overview: 59+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-08 10:26 [RFC v3 00/21] SMMUv3 Nested Stage Setup Eric Auger
2019-01-08 10:26 ` [RFC v3 01/21] iommu: Introduce set_pasid_table API Eric Auger
2019-01-11 18:16   ` Jean-Philippe Brucker
2019-01-25  8:39     ` Auger Eric
2019-01-25  8:55       ` Auger Eric
2019-01-25 10:33         ` Jean-Philippe Brucker
2019-01-11 18:43   ` Alex Williamson
2019-01-25  9:20     ` Auger Eric
2019-01-08 10:26 ` [RFC v3 02/21] iommu: Introduce cache_invalidate API Eric Auger
2019-01-11 21:30   ` Alex Williamson
2019-01-25 16:49     ` Auger Eric
2019-01-28 17:32       ` Jean-Philippe Brucker
2019-01-29 17:49         ` Auger Eric
2019-01-29 23:16       ` Alex Williamson
2019-01-30  8:48         ` Auger Eric
2019-01-08 10:26 ` [RFC v3 03/21] iommu: Introduce bind_guest_msi Eric Auger
2019-01-11 22:44   ` Alex Williamson
2019-01-25 17:51     ` Auger Eric
2019-01-25 18:11     ` Auger Eric
2019-01-08 10:26 ` [RFC v3 04/21] vfio: VFIO_IOMMU_SET_PASID_TABLE Eric Auger
2019-01-11 22:50   ` Alex Williamson
2019-01-15 21:34     ` Auger Eric
2019-01-08 10:26 ` [RFC v3 05/21] vfio: VFIO_IOMMU_CACHE_INVALIDATE Eric Auger
2019-01-08 10:26 ` [RFC v3 06/21] vfio: VFIO_IOMMU_BIND_MSI Eric Auger
2019-01-11 23:02   ` Alex Williamson
2019-01-11 23:23     ` Alex Williamson
2019-01-08 10:26 ` [RFC v3 07/21] iommu/arm-smmu-v3: Link domains and devices Eric Auger
2019-01-08 10:26 ` [RFC v3 08/21] iommu/arm-smmu-v3: Maintain a SID->device structure Eric Auger
2019-01-08 10:26 ` [RFC v3 09/21] iommu/smmuv3: Get prepared for nested stage support Eric Auger
2019-01-11 16:04   ` Jean-Philippe Brucker
2019-01-25 19:27   ` Robin Murphy
2019-01-08 10:26 ` [RFC v3 10/21] iommu/smmuv3: Implement set_pasid_table Eric Auger
2019-01-08 10:26 ` [RFC v3 11/21] iommu/smmuv3: Implement cache_invalidate Eric Auger
2019-01-11 16:59   ` Jean-Philippe Brucker
2019-01-08 10:26 ` [RFC v3 12/21] dma-iommu: Implement NESTED_MSI cookie Eric Auger
2019-01-08 10:26 ` [RFC v3 13/21] iommu/smmuv3: Implement bind_guest_msi Eric Auger
2019-01-08 10:26 ` [RFC v3 14/21] iommu: introduce device fault data Eric Auger
     [not found]   ` <20190110104544.26f3bcb1@jacob-builder>
2019-01-11 11:06     ` Jean-Philippe Brucker
2019-01-14 22:32       ` Jacob Pan
2019-01-16 15:52         ` Jean-Philippe Brucker
2019-01-16 18:33           ` Auger Eric
2019-01-15 21:27       ` Auger Eric
2019-01-16 16:54         ` Jean-Philippe Brucker
2019-01-08 10:26 ` [RFC v3 15/21] driver core: add per device iommu param Eric Auger
2019-01-08 10:26 ` [RFC v3 16/21] iommu: introduce device fault report API Eric Auger
2019-01-08 10:26 ` [RFC v3 17/21] iommu/smmuv3: Report non recoverable faults Eric Auger
2019-01-11 17:46   ` Jean-Philippe Brucker
2019-01-15 21:06     ` Auger Eric
2019-01-16 12:25       ` Jean-Philippe Brucker
2019-01-16 12:49         ` Auger Eric
2019-01-08 10:26 ` [RFC v3 18/21] vfio-pci: Add a new VFIO_REGION_TYPE_NESTED region type Eric Auger
2019-01-11 23:58   ` Alex Williamson
2019-01-14 20:48     ` Auger Eric
2019-01-14 23:04       ` Alex Williamson
2019-01-15 21:56         ` Auger Eric
2019-01-08 10:26 ` [RFC v3 19/21] vfio-pci: Register an iommu fault handler Eric Auger
2019-01-08 10:26 ` [RFC v3 20/21] vfio-pci: Add VFIO_PCI_DMA_FAULT_IRQ_INDEX Eric Auger
2019-01-08 10:26 ` [RFC v3 21/21] vfio: Document nested stage control Eric Auger
2019-01-18 10:02 ` [RFC v3 00/21] SMMUv3 Nested Stage Setup Auger Eric

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).