linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC 0/9] IOMMU driver changes for shared virtual memory virtualization
@ 2017-06-14 22:22 Jacob Pan
  2017-06-14 22:22 ` [RFC 1/9] iommu: Introduce bind_pasid_table API function Jacob Pan
                   ` (8 more replies)
  0 siblings, 9 replies; 30+ messages in thread
From: Jacob Pan @ 2017-06-14 22:22 UTC (permalink / raw)
  To: iommu, LKML, Joerg Roedel, David Woodhouse
  Cc: Liu, Yi L, Lan Tianyu, Tian, Kevin, Raj Ashok, Alex Williamson,
	Jean Delvare, Jacob Pan

This patchset provides IOMMU driver support of shared virtual memory (SVM)
virtualization. Generic APIs are introduced in addition to Intel VT-d specific
changes, the goal is to have common interfaces across IOMMU and device types for
both VFIO and other in-kernel users.

This is the IOMMU portion follow up of the more complete series of
the kernel changes to support SVM. Please refer to the link below for more
details.
https://www.spinics.net/lists/kvm/msg148819.html
New in this series are the IOMMU fault notification APIs.

At the top level, three new IOMMU interfaces are introduced:
 - bind PASID table
 - passdown invalidation
 - per device IOMMU fault notification

The additional patches are Intel VT-d specific, which either implements or
replaces existing private interfaces with the generic ones.

Thanks,

Jacob


Jacob Pan (8):
  iommu: Introduce bind_pasid_table API function
  iommu/vt-d: add bind_pasid_table function
  iommu/vt-d: Add iommu do invalidate function
  iommu: Introduce fault notifier API
  iommu/vt-d: track device with pasid table bond to a guest
  iommu/dmar: notify unrecoverable faults
  iommu/intel-svm: notify page request to guest
  iommu/intel-svm: replace dev ops with generic fault notifier

Liu, Yi L (1):
  iommu: Introduce iommu do invalidate API function

 drivers/iommu/dmar.c          |  37 ++++++++-
 drivers/iommu/intel-iommu.c   | 169 +++++++++++++++++++++++++++++++++++++-----
 drivers/iommu/intel-svm.c     |  94 ++++++++++++++++++++---
 drivers/iommu/iommu.c         |  95 ++++++++++++++++++++++++
 include/linux/dma_remapping.h |   1 +
 include/linux/intel-iommu.h   |  30 +++++++-
 include/linux/intel-svm.h     |  20 +----
 include/linux/iommu.h         |  88 ++++++++++++++++++++++
 include/uapi/linux/iommu.h    |  37 +++++++++
 9 files changed, 522 insertions(+), 49 deletions(-)
 create mode 100644 include/uapi/linux/iommu.h

-- 
2.7.4

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC 1/9] iommu: Introduce bind_pasid_table API function
  2017-06-14 22:22 [RFC 0/9] IOMMU driver changes for shared virtual memory virtualization Jacob Pan
@ 2017-06-14 22:22 ` Jacob Pan
  2017-06-22 22:52   ` Alex Williamson
  2017-06-14 22:22 ` [RFC 2/9] iommu/vt-d: add bind_pasid_table function Jacob Pan
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 30+ messages in thread
From: Jacob Pan @ 2017-06-14 22:22 UTC (permalink / raw)
  To: iommu, LKML, Joerg Roedel, David Woodhouse
  Cc: Liu, Yi L, Lan Tianyu, Tian, Kevin, Raj Ashok, Alex Williamson,
	Jean Delvare, Jacob Pan, Liu, Yi L

Virtual IOMMU was proposed to support Shared Virtual Memory (SVM) use
case in the guest:
https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg05311.html

As part of the proposed architecture, when a SVM capable PCI
device is assigned to a guest, nested mode is turned on. Guest owns the
first level page tables (request with PASID) and performs GVA->GPA
translation. Second level page tables are owned by the host for GPA->HPA
translation for both request with and without PASID.

A new IOMMU driver interface is therefore needed to perform tasks as
follows:
* Enable nested translation and appropriate translation type
* Assign guest PASID table pointer (in GPA) and size to host IOMMU

This patch introduces new functions called iommu_(un)bind_pasid_table()
to IOMMU APIs. Architecture specific IOMMU function can be added later
to perform the specific steps for binding pasid table of assigned devices.

This patch also adds model definition in iommu.h. It would be used to
check if the bind request is from a compatible entity. e.g. a bind
request from an intel_iommu emulator may not be supported by an ARM SMMU
driver.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
---
 drivers/iommu/iommu.c      | 19 +++++++++++++++++++
 include/linux/iommu.h      | 23 +++++++++++++++++++++++
 include/uapi/linux/iommu.h | 33 +++++++++++++++++++++++++++++++++
 3 files changed, 75 insertions(+)
 create mode 100644 include/uapi/linux/iommu.h

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index cf7ca7e..494309b 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1328,6 +1328,25 @@ int iommu_attach_device(struct iommu_domain *domain, struct device *dev)
 }
 EXPORT_SYMBOL_GPL(iommu_attach_device);
 
+int iommu_bind_pasid_table(struct iommu_domain *domain, struct device *dev,
+			struct pasid_table_info *pasidt_binfo)
+{
+	if (unlikely(!domain->ops->bind_pasid_table))
+		return -EINVAL;
+
+	return domain->ops->bind_pasid_table(domain, dev, pasidt_binfo);
+}
+EXPORT_SYMBOL_GPL(iommu_bind_pasid_table);
+
+int iommu_unbind_pasid_table(struct iommu_domain *domain, struct device *dev)
+{
+	if (unlikely(!domain->ops->unbind_pasid_table))
+		return -EINVAL;
+
+	return domain->ops->unbind_pasid_table(domain, dev);
+}
+EXPORT_SYMBOL_GPL(iommu_unbind_pasid_table);
+
 static void __iommu_detach_device(struct iommu_domain *domain,
 				  struct device *dev)
 {
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 2cb54ad..7122add 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -25,6 +25,7 @@
 #include <linux/errno.h>
 #include <linux/err.h>
 #include <linux/of.h>
+#include <uapi/linux/iommu.h>
 
 #define IOMMU_READ	(1 << 0)
 #define IOMMU_WRITE	(1 << 1)
@@ -183,6 +184,8 @@ struct iommu_resv_region {
  * @domain_get_windows: Return the number of windows for a domain
  * @of_xlate: add OF master IDs to iommu grouping
  * @pgsize_bitmap: bitmap of all possible supported page sizes
+ * @bind_pasid_table: bind pasid table pointer for guest SVM
+ * @unbind_pasid_table: unbind pasid table pointer and restore defaults
  */
 struct iommu_ops {
 	bool (*capable)(enum iommu_cap);
@@ -225,6 +228,10 @@ struct iommu_ops {
 	u32 (*domain_get_windows)(struct iommu_domain *domain);
 
 	int (*of_xlate)(struct device *dev, struct of_phandle_args *args);
+	int (*bind_pasid_table)(struct iommu_domain *domain, struct device *dev,
+				struct pasid_table_info *pasidt_binfo);
+	int (*unbind_pasid_table)(struct iommu_domain *domain,
+				struct device *dev);
 
 	unsigned long pgsize_bitmap;
 };
@@ -282,6 +289,10 @@ extern int iommu_attach_device(struct iommu_domain *domain,
 			       struct device *dev);
 extern void iommu_detach_device(struct iommu_domain *domain,
 				struct device *dev);
+extern int iommu_bind_pasid_table(struct iommu_domain *domain,
+		struct device *dev, struct pasid_table_info *pasidt_binfo);
+extern int iommu_unbind_pasid_table(struct iommu_domain *domain,
+				struct device *dev);
 extern struct iommu_domain *iommu_get_domain_for_dev(struct device *dev);
 extern int iommu_map(struct iommu_domain *domain, unsigned long iova,
 		     phys_addr_t paddr, size_t size, int prot);
@@ -637,6 +648,18 @@ const struct iommu_ops *iommu_ops_from_fwnode(struct fwnode_handle *fwnode)
 	return NULL;
 }
 
+static inline
+int iommu_bind_pasid_table(struct iommu_domain *domain, struct device *dev,
+			struct pasid_table_info *pasidt_binfo)
+{
+	return -EINVAL;
+}
+static inline
+int iommu_unbind_pasid_table(struct iommu_domain *domain, struct device *dev)
+{
+	return -EINVAL;
+}
+
 #endif /* CONFIG_IOMMU_API */
 
 #endif /* __LINUX_IOMMU_H */
diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
new file mode 100644
index 0000000..5ef0e7c
--- /dev/null
+++ b/include/uapi/linux/iommu.h
@@ -0,0 +1,33 @@
+/*
+ * IOMMU user API definitions
+ *
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#ifndef _UAPI_IOMMU_H
+#define _UAPI_IOMMU_H
+
+/**
+ * PASID table data used to bind guest PASID table to the host IOMMU. This will
+ * enable guest managed first level page tables.
+ * @ptr		PASID table pointer in GPA
+ * @size	size of the guest PASID table, must be <= host table size
+ * @model	magic number tells vendor apart
+ * @length	length of the opaque data
+ * @opaque	architecture specific IOMMU data
+ */
+struct pasid_table_info {
+	__u64	ptr;
+	__u64	size;
+	__u32	model;
+#define INTEL_IOMMU	(1 << 0)
+#define ARM_SMMU	(1 << 1)
+	__u32	length;
+	__u8	opaque[];
+};
+
+
+#endif /* _UAPI_IOMMU_H */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC 2/9] iommu/vt-d: add bind_pasid_table function
  2017-06-14 22:22 [RFC 0/9] IOMMU driver changes for shared virtual memory virtualization Jacob Pan
  2017-06-14 22:22 ` [RFC 1/9] iommu: Introduce bind_pasid_table API function Jacob Pan
@ 2017-06-14 22:22 ` Jacob Pan
  2017-06-22 22:52   ` Alex Williamson
  2017-06-14 22:22 ` [RFC 3/9] iommu: Introduce iommu do invalidate API function Jacob Pan
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 30+ messages in thread
From: Jacob Pan @ 2017-06-14 22:22 UTC (permalink / raw)
  To: iommu, LKML, Joerg Roedel, David Woodhouse
  Cc: Liu, Yi L, Lan Tianyu, Tian, Kevin, Raj Ashok, Alex Williamson,
	Jean Delvare, Jacob Pan, Liu, Yi L

Add Intel VT-d ops to the generic iommu_bind_pasid_table API
functions.

The primary use case is for direct assignment of SVM capable
device. Originated from emulated IOMMU in the guest, the request goes
through many layers (e.g. VFIO). Upon calling host IOMMU driver, caller
passes guest PASID table pointer (GPA) and size.

Device context table entry is modified by Intel IOMMU specific
bind_pasid_table function. This will turn on nesting mode and matching
translation type.

The unbind operation restores default context mapping.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
---
 drivers/iommu/intel-iommu.c   | 109 ++++++++++++++++++++++++++++++++++++++++++
 include/linux/dma_remapping.h |   1 +
 2 files changed, 110 insertions(+)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index fc2765c..1d5d9ab9 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -5430,6 +5430,111 @@ struct intel_iommu *intel_svm_device_to_iommu(struct device *dev)
 
 	return iommu;
 }
+
+static int intel_iommu_bind_pasid_table(struct iommu_domain *domain,
+		struct device *dev, struct pasid_table_info *pasidt_binfo)
+{
+	struct intel_iommu *iommu;
+	struct context_entry *context;
+	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
+	struct device_domain_info *info;
+	struct pci_dev *pdev;
+	u8 bus, devfn;
+	u16 did, *sid;
+	int ret = 0;
+	unsigned long flags;
+	u64 ctx_lo;
+
+	if (pasidt_binfo == NULL || pasidt_binfo->model != INTEL_IOMMU) {
+		pr_warn("%s: Invalid bind request!\n", __func__);
+		return -EINVAL;
+	}
+
+	iommu = device_to_iommu(dev, &bus, &devfn);
+	if (!iommu)
+		return -ENODEV;
+
+	sid = (u16 *)&pasidt_binfo->opaque;
+	/*
+	 * check SID, if it is not correct, return success to allow looping
+	 * through all devices within a group
+	 */
+	if (PCI_DEVID(bus, devfn) != *sid)
+		return 0;
+
+	pdev = to_pci_dev(dev);
+	info = dev->archdata.iommu;
+	if (!info || !info->pasid_supported) {
+		pr_err("PCI %04x:%02x:%02x.%d: has no PASID support\n",
+			       pci_domain_nr(pdev->bus), bus, PCI_SLOT(devfn),
+			       PCI_FUNC(devfn));
+		ret = -EINVAL;
+		goto out;
+	}
+
+	if (pasidt_binfo->size > intel_iommu_get_pts(iommu)) {
+		pr_err("Invalid gPASID table size %llu, host size %lu\n",
+			pasidt_binfo->size,
+			intel_iommu_get_pts(iommu));
+		ret = -EINVAL;
+		goto out;
+	}
+	spin_lock_irqsave(&iommu->lock, flags);
+	context = iommu_context_addr(iommu, bus, devfn, 0);
+	if (!context || !context_present(context)) {
+		pr_warn("%s: ctx not present for bus devfn %x:%x\n",
+			__func__, bus, devfn);
+		spin_unlock_irqrestore(&iommu->lock, flags);
+		goto out;
+	}
+	/* Anticipate guest to use SVM and owns the first level */
+	ctx_lo = context[0].lo;
+	ctx_lo |= CONTEXT_NESTE;
+	ctx_lo |= CONTEXT_PRS;
+	ctx_lo |= CONTEXT_PASIDE;
+	ctx_lo &= ~CONTEXT_TT_MASK;
+	ctx_lo |= CONTEXT_TT_DEV_IOTLB << 2;
+	context[0].lo = ctx_lo;
+
+	/* Assign guest PASID table pointer and size */
+	ctx_lo = (pasidt_binfo->ptr & VTD_PAGE_MASK) | pasidt_binfo->size;
+	context[1].lo = ctx_lo;
+	/* make sure context entry is updated before flushing */
+	wmb();
+	did = dmar_domain->iommu_did[iommu->seq_id];
+	iommu->flush.flush_context(iommu, did,
+				(((u16)bus) << 8) | devfn,
+				DMA_CCMD_MASK_NOBIT,
+				DMA_CCMD_DEVICE_INVL);
+	iommu->flush.flush_iotlb(iommu, did, 0, 0, DMA_TLB_DSI_FLUSH);
+	spin_unlock_irqrestore(&iommu->lock, flags);
+
+
+out:
+	return ret;
+}
+
+static int intel_iommu_unbind_pasid_table(struct iommu_domain *domain,
+					struct device *dev)
+{
+	struct intel_iommu *iommu;
+	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
+	u8 bus, devfn;
+
+	iommu = device_to_iommu(dev, &bus, &devfn);
+	if (!iommu)
+		return -ENODEV;
+	/*
+	 * REVISIT: we might want to clear the PASID table pointer
+	 * as part of context clear operation. Currently, it leaves
+	 * stale data but should be ignored by hardware since PASIDE
+	 * is clear.
+	 */
+	/* ATS will be reenabled when remapping is restored */
+	pci_disable_ats(to_pci_dev(dev));
+	domain_context_clear(iommu, dev);
+	return domain_context_mapping_one(dmar_domain, iommu, bus, devfn);
+}
 #endif /* CONFIG_INTEL_IOMMU_SVM */
 
 const struct iommu_ops intel_iommu_ops = {
@@ -5438,6 +5543,10 @@ const struct iommu_ops intel_iommu_ops = {
 	.domain_free		= intel_iommu_domain_free,
 	.attach_dev		= intel_iommu_attach_device,
 	.detach_dev		= intel_iommu_detach_device,
+#ifdef CONFIG_INTEL_IOMMU_SVM
+	.bind_pasid_table	= intel_iommu_bind_pasid_table,
+	.unbind_pasid_table	= intel_iommu_unbind_pasid_table,
+#endif
 	.map			= intel_iommu_map,
 	.unmap			= intel_iommu_unmap,
 	.map_sg			= default_iommu_map_sg,
diff --git a/include/linux/dma_remapping.h b/include/linux/dma_remapping.h
index 9088407..85367b7 100644
--- a/include/linux/dma_remapping.h
+++ b/include/linux/dma_remapping.h
@@ -27,6 +27,7 @@
 
 #define CONTEXT_DINVE		(1ULL << 8)
 #define CONTEXT_PRS		(1ULL << 9)
+#define CONTEXT_NESTE		(1ULL << 10)
 #define CONTEXT_PASIDE		(1ULL << 11)
 
 struct intel_iommu;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC 3/9] iommu: Introduce iommu do invalidate API function
  2017-06-14 22:22 [RFC 0/9] IOMMU driver changes for shared virtual memory virtualization Jacob Pan
  2017-06-14 22:22 ` [RFC 1/9] iommu: Introduce bind_pasid_table API function Jacob Pan
  2017-06-14 22:22 ` [RFC 2/9] iommu/vt-d: add bind_pasid_table function Jacob Pan
@ 2017-06-14 22:22 ` Jacob Pan
  2017-06-22 22:52   ` Alex Williamson
  2017-06-14 22:22 ` [RFC 4/9] iommu/vt-d: Add iommu do invalidate function Jacob Pan
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 30+ messages in thread
From: Jacob Pan @ 2017-06-14 22:22 UTC (permalink / raw)
  To: iommu, LKML, Joerg Roedel, David Woodhouse
  Cc: Liu, Yi L, Lan Tianyu, Tian, Kevin, Raj Ashok, Alex Williamson,
	Jean Delvare, Liu, Yi L, Liu, Jacob Pan

From: "Liu, Yi L" <yi.l.liu@linux.intel.com>

When a SVM capable device is assigned to a guest, the first level page
tables are owned by the guest and the guest PASID table pointer is
linked to the device context entry of the physical IOMMU.

Host IOMMU driver has no knowledge of caching structure updates unless
the guest invalidation activities are passed down to the host. The
primary usage is derived from emulated IOMMU in the guest, where QEMU
can trap invalidation activities before pass them down the
host/physical IOMMU. There are IOMMU architectural specific actions
need to be taken which requires the generic APIs introduced in this
patch to have opaque data in the tlb_invalidate_info argument.

Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
---
 drivers/iommu/iommu.c      | 13 +++++++++++++
 include/linux/iommu.h      | 11 +++++++++++
 include/uapi/linux/iommu.h |  4 ++++
 3 files changed, 28 insertions(+)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 494309b..c786370 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1347,6 +1347,19 @@ int iommu_unbind_pasid_table(struct iommu_domain *domain, struct device *dev)
 }
 EXPORT_SYMBOL_GPL(iommu_unbind_pasid_table);
 
+int iommu_do_invalidate(struct iommu_domain *domain,
+		struct device *dev, struct tlb_invalidate_info *inv_info)
+{
+	int ret = 0;
+
+	if (unlikely(domain->ops->do_invalidate == NULL))
+		return -ENODEV;
+
+	ret = domain->ops->do_invalidate(domain, dev, inv_info);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_do_invalidate);
+
 static void __iommu_detach_device(struct iommu_domain *domain,
 				  struct device *dev)
 {
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 7122add..2cdbaa3 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -232,6 +232,8 @@ struct iommu_ops {
 				struct pasid_table_info *pasidt_binfo);
 	int (*unbind_pasid_table)(struct iommu_domain *domain,
 				struct device *dev);
+	int (*do_invalidate)(struct iommu_domain *domain,
+		struct device *dev, struct tlb_invalidate_info *inv_info);
 
 	unsigned long pgsize_bitmap;
 };
@@ -293,6 +295,9 @@ extern int iommu_bind_pasid_table(struct iommu_domain *domain,
 		struct device *dev, struct pasid_table_info *pasidt_binfo);
 extern int iommu_unbind_pasid_table(struct iommu_domain *domain,
 				struct device *dev);
+extern int iommu_do_invalidate(struct iommu_domain *domain,
+		struct device *dev, struct tlb_invalidate_info *inv_info);
+
 extern struct iommu_domain *iommu_get_domain_for_dev(struct device *dev);
 extern int iommu_map(struct iommu_domain *domain, unsigned long iova,
 		     phys_addr_t paddr, size_t size, int prot);
@@ -660,6 +665,12 @@ int iommu_unbind_pasid_table(struct iommu_domain *domain, struct device *dev)
 	return -EINVAL;
 }
 
+static inline int iommu_do_invalidate(struct iommu_domain *domain,
+		struct device *dev, struct tlb_invalidate_info *inv_info)
+{
+	return -EINVAL;
+}
+
 #endif /* CONFIG_IOMMU_API */
 
 #endif /* __LINUX_IOMMU_H */
diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
index 5ef0e7c..fcee18f 100644
--- a/include/uapi/linux/iommu.h
+++ b/include/uapi/linux/iommu.h
@@ -29,5 +29,9 @@ struct pasid_table_info {
 	__u8	opaque[];
 };
 
+struct tlb_invalidate_info {
+	__u32	model;
+	__u8	opaque[];
+};
 
 #endif /* _UAPI_IOMMU_H */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC 4/9] iommu/vt-d: Add iommu do invalidate function
  2017-06-14 22:22 [RFC 0/9] IOMMU driver changes for shared virtual memory virtualization Jacob Pan
                   ` (2 preceding siblings ...)
  2017-06-14 22:22 ` [RFC 3/9] iommu: Introduce iommu do invalidate API function Jacob Pan
@ 2017-06-14 22:22 ` Jacob Pan
  2017-06-22 22:52   ` Alex Williamson
  2017-06-14 22:22 ` [RFC 5/9] iommu: Introduce fault notifier API Jacob Pan
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 30+ messages in thread
From: Jacob Pan @ 2017-06-14 22:22 UTC (permalink / raw)
  To: iommu, LKML, Joerg Roedel, David Woodhouse
  Cc: Liu, Yi L, Lan Tianyu, Tian, Kevin, Raj Ashok, Alex Williamson,
	Jean Delvare, Jacob Pan, Liu, Yi L

This patch adds Intel VT-d specific function to implement
iommu_do_invalidate API.

The use case is for supporting caching structure invalidation
of assigned SVM capable devices. Emulated IOMMU exposes queue
invalidation capability and passes down all descriptors from the guest
to the physical IOMMU.

The assumption is that guest to host device ID mapping should be
resolved prior to calling IOMMU driver. Based on the device handle,
host IOMMU driver can replace certain fields before submit to the
invalidation queue.

Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
---
 drivers/iommu/intel-iommu.c | 41 +++++++++++++++++++++++++++++++++++++++++
 include/linux/intel-iommu.h | 11 ++++++++++-
 2 files changed, 51 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 1d5d9ab9..6b8e997 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -5127,6 +5127,46 @@ static void intel_iommu_detach_device(struct iommu_domain *domain,
 	dmar_remove_one_dev_info(to_dmar_domain(domain), dev);
 }
 
+static int intel_iommu_do_invalidate(struct iommu_domain *domain,
+		struct device *dev, struct tlb_invalidate_info *inv_info)
+{
+	struct intel_iommu *iommu;
+	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
+	struct intel_invalidate_data *inv_data;
+	struct qi_desc *qi;
+	u16 did;
+	u8 bus, devfn;
+
+	if (!inv_info || !dmar_domain || (inv_info->model != INTEL_IOMMU))
+		return -EINVAL;
+
+	iommu = device_to_iommu(dev, &bus, &devfn);
+	if (!iommu)
+		return -ENODEV;
+
+	inv_data = (struct intel_invalidate_data *)&inv_info->opaque;
+
+	/* check SID */
+	if (PCI_DEVID(bus, devfn) != inv_data->sid)
+		return 0;
+
+	qi = &inv_data->inv_desc;
+
+	switch (qi->low & QI_TYPE_MASK) {
+	case QI_DIOTLB_TYPE:
+	case QI_DEIOTLB_TYPE:
+		/* for device IOTLB, we just let it pass through */
+		break;
+	default:
+		did = dmar_domain->iommu_did[iommu->seq_id];
+		qi->low &= ~QI_DID_MASK;
+		qi->low |= QI_DID(did);
+		break;
+	}
+
+	return qi_submit_sync(qi, iommu);
+}
+
 static int intel_iommu_map(struct iommu_domain *domain,
 			   unsigned long iova, phys_addr_t hpa,
 			   size_t size, int iommu_prot)
@@ -5546,6 +5586,7 @@ const struct iommu_ops intel_iommu_ops = {
 #ifdef CONFIG_INTEL_IOMMU_SVM
 	.bind_pasid_table	= intel_iommu_bind_pasid_table,
 	.unbind_pasid_table	= intel_iommu_unbind_pasid_table,
+	.do_invalidate		= intel_iommu_do_invalidate,
 #endif
 	.map			= intel_iommu_map,
 	.unmap			= intel_iommu_unmap,
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index 485a5b4..8df6c91 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -31,7 +31,6 @@
 #include <linux/list.h>
 #include <linux/iommu.h>
 #include <linux/io-64-nonatomic-lo-hi.h>
-
 #include <asm/cacheflush.h>
 #include <asm/iommu.h>
 
@@ -258,6 +257,10 @@ enum {
 #define QI_PGRP_RESP_TYPE	0x9
 #define QI_PSTRM_RESP_TYPE	0xa
 
+#define QI_DID(did)		(((u64)did & 0xffff) << 16)
+#define QI_DID_MASK		GENMASK(31, 16)
+#define QI_TYPE_MASK		GENMASK(3, 0)
+
 #define QI_IEC_SELECTIVE	(((u64)1) << 4)
 #define QI_IEC_IIDEX(idx)	(((u64)(idx & 0xffff) << 32))
 #define QI_IEC_IM(m)		(((u64)(m & 0x1f) << 27))
@@ -489,6 +492,12 @@ extern int intel_iommu_enable_pasid(struct intel_iommu *iommu, struct intel_svm_
 extern struct intel_iommu *intel_svm_device_to_iommu(struct device *dev);
 #endif
 
+struct intel_invalidate_data {
+	u16 sid;
+	u32 pasid;
+	struct qi_desc inv_desc;
+};
+
 extern const struct attribute_group *intel_iommu_groups[];
 
 #endif
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC 5/9] iommu: Introduce fault notifier API
  2017-06-14 22:22 [RFC 0/9] IOMMU driver changes for shared virtual memory virtualization Jacob Pan
                   ` (3 preceding siblings ...)
  2017-06-14 22:22 ` [RFC 4/9] iommu/vt-d: Add iommu do invalidate function Jacob Pan
@ 2017-06-14 22:22 ` Jacob Pan
  2017-06-22 22:53   ` Alex Williamson
  2017-06-14 22:23 ` [RFC 6/9] iommu/vt-d: track device with pasid table bond to a guest Jacob Pan
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 30+ messages in thread
From: Jacob Pan @ 2017-06-14 22:22 UTC (permalink / raw)
  To: iommu, LKML, Joerg Roedel, David Woodhouse
  Cc: Liu, Yi L, Lan Tianyu, Tian, Kevin, Raj Ashok, Alex Williamson,
	Jean Delvare, Jacob Pan

Traditionally, device specific faults are detected and handled within
their own device drivers. When IOMMU is enabled, faults such as DMA
related transactions are detected by IOMMU. There is no generic
reporting mechanism to report faults back to the in-kernel device
driver or the guest OS in case of assigned devices.

Faults detected by IOMMU is based on the transaction's source ID which
can be reported at per device basis, regardless of the device type is a
PCI device or not.

The fault types includes recoverable (e.g. page request) and
unrecoverable faults(e.g. invalid context). In most cases, faults can be
handled by IOMMU drivers. However, there are use cases that require
fault processing outside IOMMU driver, e.g.

1. page request fault originated from an SVM capable device that is
assigned to guest via vIOMMU. In this case, the first level page tables
are owned by the guest. Page request must be propagated to the guest to
let guest OS fault in the pages then send page response. In this
mechanism, the direct receiver of IOMMU fault notification is VFIO,
which can relay notification events to QEMU or other user space
software.

2. faults need more subtle handling by device drivers. Other than
simply invoke reset function, there are needs to let device driver
handle the fault with a smaller impact.

This patchset is intended to create a generic fault notification API such
that it can scale as follows:
- all IOMMU types
- PCI and non-PCI devices
- recoverable and unrecoverable faults
- VFIO and other other in kernel users
- DMA & IRQ remapping (TBD)

The event data contains both generic and raw architectural data
such that performance is not compromised as the data propagation may
involve many layers.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
---
 drivers/iommu/iommu.c | 63 +++++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/iommu.h | 54 +++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 117 insertions(+)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index c786370..04c73f3 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -48,6 +48,7 @@ struct iommu_group {
 	struct list_head devices;
 	struct mutex mutex;
 	struct blocking_notifier_head notifier;
+	struct blocking_notifier_head fault_notifier;
 	void *iommu_data;
 	void (*iommu_data_release)(void *iommu_data);
 	char *name;
@@ -345,6 +346,7 @@ struct iommu_group *iommu_group_alloc(void)
 	mutex_init(&group->mutex);
 	INIT_LIST_HEAD(&group->devices);
 	BLOCKING_INIT_NOTIFIER_HEAD(&group->notifier);
+	BLOCKING_INIT_NOTIFIER_HEAD(&group->fault_notifier);
 
 	ret = ida_simple_get(&iommu_group_ida, 0, 0, GFP_KERNEL);
 	if (ret < 0) {
@@ -790,6 +792,67 @@ int iommu_group_unregister_notifier(struct iommu_group *group,
 EXPORT_SYMBOL_GPL(iommu_group_unregister_notifier);
 
 /**
+ * iommu_register_fault_notifier - Register a notifier for fault reporting
+ * @dev: device to notify fault events
+ * @nb: notifier block to signal
+ *
+ */
+int iommu_register_fault_notifier(struct device *dev,
+				struct notifier_block *nb)
+{
+	int ret;
+	struct iommu_group *group = iommu_group_get(dev);
+
+	if (!group)
+		return -EINVAL;
+
+	ret = blocking_notifier_chain_register(&group->fault_notifier, nb);
+	iommu_group_put(group);
+
+	return ret;
+
+}
+EXPORT_SYMBOL_GPL(iommu_register_fault_notifier);
+
+/**
+ * iommu_unregister_fault_notifier - Unregister a notifier for fault reporting
+ * @domain: the domain to watch
+ * @nb: notifier block to signal
+ *
+ */
+int iommu_unregister_fault_notifier(struct device *dev,
+				  struct notifier_block *nb)
+{
+	int ret;
+	struct iommu_group *group = iommu_group_get(dev);
+
+	if (!group)
+		return -EINVAL;
+
+	ret = blocking_notifier_chain_unregister(&group->fault_notifier, nb);
+	iommu_group_put(group);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_unregister_fault_notifier);
+
+int iommu_fault_notifier_call_chain(struct iommu_fault_event *event)
+{
+	int ret;
+	struct iommu_group *group = iommu_group_get(event->dev);
+
+	if (!group)
+		return -EINVAL;
+	/* caller provide generic data related to the event, TBD */
+	ret = (blocking_notifier_call_chain(&group->fault_notifier, 0, (void *)event)
+		== NOTIFY_BAD) ? -EINVAL : 0;
+	iommu_group_put(group);
+
+	return ret;
+}
+EXPORT_SYMBOL(iommu_fault_notifier_call_chain);
+
+/**
  * iommu_group_id - Return ID for a group
  * @group: the group to ID
  *
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 2cdbaa3..fe89e88 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -42,6 +42,7 @@
  * if the IOMMU page table format is equivalent.
  */
 #define IOMMU_PRIV	(1 << 5)
+#define IOMMU_EXEC	(1 << 6)
 
 struct iommu_ops;
 struct iommu_group;
@@ -97,6 +98,36 @@ struct iommu_domain {
 	void *iova_cookie;
 };
 
+/*
+ * Generic fault event notification data, used by all IOMMU architectures.
+ *
+ * - PCI and non-PCI devices
+ * - Recoverable faults (e.g. page request) & un-recoverable faults
+ * - DMA remapping and IRQ remapping faults
+ *
+ * @dev The device which faults are reported by IOMMU
+ * @addr tells the offending address
+ * @pasid contains process address space ID, used in shared virtual memory (SVM)
+ * @prot page access protection flag, e.g. IOMMU_READ, IOMMU_WRITE
+ * @flags contains fault type, etc.
+ * @length tells the size of the buf
+ * @buf contains any raw or arch specific data
+ *
+ */
+struct iommu_fault_event {
+	struct device *dev;
+	__u64 addr;
+	__u32 pasid;
+	__u32 prot;
+	__u32 flags;
+#define IOMMU_FAULT_PAGE_REQ	BIT(0)
+#define IOMMU_FAULT_UNRECOV	BIT(1)
+#define IOMMU_FAULT_IRQ_REMAP	BIT(2)
+#define IOMMU_FAULT_INVAL	BIT(3)
+	__u32 length;
+	__u8  buf[];
+};
+
 enum iommu_cap {
 	IOMMU_CAP_CACHE_COHERENCY,	/* IOMMU can enforce cache coherent DMA
 					   transactions */
@@ -341,6 +372,12 @@ extern int iommu_group_register_notifier(struct iommu_group *group,
 					 struct notifier_block *nb);
 extern int iommu_group_unregister_notifier(struct iommu_group *group,
 					   struct notifier_block *nb);
+extern int iommu_register_fault_notifier(struct device *dev,
+					 struct notifier_block *nb);
+extern int iommu_unregister_fault_notifier(struct device *dev,
+					 struct notifier_block *nb);
+extern int iommu_fault_notifier_call_chain(struct iommu_fault_event *event);
+
 extern int iommu_group_id(struct iommu_group *group);
 extern struct iommu_group *iommu_group_get_for_dev(struct device *dev);
 extern struct iommu_domain *iommu_group_default_domain(struct iommu_group *);
@@ -573,6 +610,23 @@ static inline int iommu_group_unregister_notifier(struct iommu_group *group,
 	return 0;
 }
 
+static inline int iommu_register_fault_notifier(struct device *dev,
+						  struct notifier_block *nb)
+{
+	return 0;
+}
+
+static inline int iommu_unregister_fault_notifier(struct device *dev,
+						  struct notifier_block *nb)
+{
+	return 0;
+}
+
+static inline int iommu_fault_notifier_call_chain(struct iommu_fault_event *event)
+{
+	return 0;
+}
+
 static inline int iommu_group_id(struct iommu_group *group)
 {
 	return -ENODEV;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC 6/9] iommu/vt-d: track device with pasid table bond to a guest
  2017-06-14 22:22 [RFC 0/9] IOMMU driver changes for shared virtual memory virtualization Jacob Pan
                   ` (4 preceding siblings ...)
  2017-06-14 22:22 ` [RFC 5/9] iommu: Introduce fault notifier API Jacob Pan
@ 2017-06-14 22:23 ` Jacob Pan
  2017-06-22 22:54   ` Alex Williamson
  2017-06-14 22:23 ` [RFC 7/9] iommu/dmar: notify unrecoverable faults Jacob Pan
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 30+ messages in thread
From: Jacob Pan @ 2017-06-14 22:23 UTC (permalink / raw)
  To: iommu, LKML, Joerg Roedel, David Woodhouse
  Cc: Liu, Yi L, Lan Tianyu, Tian, Kevin, Raj Ashok, Alex Williamson,
	Jean Delvare, Jacob Pan

When PASID table pointer of an assigned device is bond to a guest,
the first level page tables are managed by the guest. However, only
host/physical IOMMU can detect fault events, e.g. page requests.
Therefore, we need to keep track of which device has its PASID table
pointer bond to a guest such that page request and other events can
be propagated to the guest as needed.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
---
 drivers/iommu/intel-iommu.c | 19 +------------------
 include/linux/intel-iommu.h | 19 +++++++++++++++++++
 2 files changed, 20 insertions(+), 18 deletions(-)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 6b8e997..9765277 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -416,24 +416,6 @@ struct dmar_domain {
 					   iommu core */
 };
 
-/* PCI domain-device relationship */
-struct device_domain_info {
-	struct list_head link;	/* link to domain siblings */
-	struct list_head global; /* link to global list */
-	u8 bus;			/* PCI bus number */
-	u8 devfn;		/* PCI devfn number */
-	u8 pasid_supported:3;
-	u8 pasid_enabled:1;
-	u8 pri_supported:1;
-	u8 pri_enabled:1;
-	u8 ats_supported:1;
-	u8 ats_enabled:1;
-	u8 ats_qdep;
-	struct device *dev; /* it's NULL for PCIe-to-PCI bridge */
-	struct intel_iommu *iommu; /* IOMMU used by this device */
-	struct dmar_domain *domain; /* pointer to domain */
-};
-
 struct dmar_rmrr_unit {
 	struct list_head list;		/* list of rmrr units	*/
 	struct acpi_dmar_header *hdr;	/* ACPI header		*/
@@ -5547,6 +5529,7 @@ static int intel_iommu_bind_pasid_table(struct iommu_domain *domain,
 				DMA_CCMD_MASK_NOBIT,
 				DMA_CCMD_DEVICE_INVL);
 	iommu->flush.flush_iotlb(iommu, did, 0, 0, DMA_TLB_DSI_FLUSH);
+	info->pasid_tbl_bond = 1;
 	spin_unlock_irqrestore(&iommu->lock, flags);
 
 
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index 8df6c91..ed39c56 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -434,6 +434,25 @@ struct intel_iommu {
 	u32		flags;      /* Software defined flags */
 };
 
+/* PCI domain-device relationship */
+struct device_domain_info {
+	struct list_head link;	/* link to domain siblings */
+	struct list_head global; /* link to global list */
+	u8 bus;			/* PCI bus number */
+	u8 devfn;		/* PCI devfn number */
+	u8 pasid_supported:3;
+	u8 pasid_enabled:1;
+	u8 pasid_tbl_bond:1;	/* bond to guest PASID table */
+	u8 pri_supported:1;
+	u8 pri_enabled:1;
+	u8 ats_supported:1;
+	u8 ats_enabled:1;
+	u8 ats_qdep;
+	struct device *dev; /* it's NULL for PCIe-to-PCI bridge */
+	struct intel_iommu *iommu; /* IOMMU used by this device */
+	struct dmar_domain *domain; /* pointer to domain */
+};
+
 static inline void __iommu_flush_cache(
 	struct intel_iommu *iommu, void *addr, int size)
 {
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC 7/9] iommu/dmar: notify unrecoverable faults
  2017-06-14 22:22 [RFC 0/9] IOMMU driver changes for shared virtual memory virtualization Jacob Pan
                   ` (5 preceding siblings ...)
  2017-06-14 22:23 ` [RFC 6/9] iommu/vt-d: track device with pasid table bond to a guest Jacob Pan
@ 2017-06-14 22:23 ` Jacob Pan
  2017-06-22 22:54   ` Alex Williamson
  2017-06-14 22:23 ` [RFC 8/9] iommu/intel-svm: notify page request to guest Jacob Pan
  2017-06-14 22:23 ` [RFC 9/9] iommu/intel-svm: replace dev ops with generic fault notifier Jacob Pan
  8 siblings, 1 reply; 30+ messages in thread
From: Jacob Pan @ 2017-06-14 22:23 UTC (permalink / raw)
  To: iommu, LKML, Joerg Roedel, David Woodhouse
  Cc: Liu, Yi L, Lan Tianyu, Tian, Kevin, Raj Ashok, Alex Williamson,
	Jean Delvare, Jacob Pan

Currently, when device DMA faults are detected by IOMMU the fault
reasons are printed but the offending device is not notified.
This patch allows device drivers to be optionally notified for fault
conditions when device specific handling is needed for more subtle
processing, e.g. request with PASID transactions.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
---
 drivers/iommu/dmar.c | 37 ++++++++++++++++++++++++++++++++++++-
 1 file changed, 36 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c
index cbf7763..2c0b80d464 100644
--- a/drivers/iommu/dmar.c
+++ b/drivers/iommu/dmar.c
@@ -1587,11 +1587,43 @@ void dmar_msi_read(int irq, struct msi_msg *msg)
 	raw_spin_unlock_irqrestore(&iommu->register_lock, flag);
 }
 
+static int dmar_unrecov_fault_notify(u8 fault_reason, u16 source_id,
+			unsigned long long addr)
+{
+	int ret;
+	struct pci_dev *pdev;
+	struct iommu_fault_event *event;
+
+	pdev = pci_get_bus_and_slot(source_id >> 8, source_id & 0xFF);
+	if (!pdev)
+		return -ENODEV;
+	pr_debug("Notify PCI device fault [%02x:%02x.%d]\n",
+		source_id >> 8, PCI_SLOT(source_id & 0xff),
+		PCI_FUNC(source_id & 0xff));
+	event = kzalloc(sizeof(*event) + sizeof(fault_reason), GFP_KERNEL);
+	if (!event)
+		return -ENOMEM;
+
+	pci_dev_get(pdev);
+	event->dev = &pdev->dev;
+	event->buf[0] = fault_reason;
+	event->addr = addr;
+	event->length = sizeof(fault_reason);
+	event->flags = IOMMU_FAULT_UNRECOV;
+	ret = iommu_fault_notifier_call_chain(event);
+
+	pci_dev_put(pdev);
+	kfree(event);
+
+	return ret;
+}
+
 static int dmar_fault_do_one(struct intel_iommu *iommu, int type,
 		u8 fault_reason, u16 source_id, unsigned long long addr)
 {
 	const char *reason;
 	int fault_type;
+	int ret = 0;
 
 	reason = dmar_get_fault_reason(fault_reason, &fault_type);
 
@@ -1600,11 +1632,14 @@ static int dmar_fault_do_one(struct intel_iommu *iommu, int type,
 			source_id >> 8, PCI_SLOT(source_id & 0xFF),
 			PCI_FUNC(source_id & 0xFF), addr >> 48,
 			fault_reason, reason);
-	else
+	else {
 		pr_err("[%s] Request device [%02x:%02x.%d] fault addr %llx [fault reason %02d] %s\n",
 		       type ? "DMA Read" : "DMA Write",
 		       source_id >> 8, PCI_SLOT(source_id & 0xFF),
 		       PCI_FUNC(source_id & 0xFF), addr, fault_reason, reason);
+		ret = dmar_unrecov_fault_notify(fault_reason, source_id, addr);
+	}
+
 	return 0;
 }
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC 8/9] iommu/intel-svm: notify page request to guest
  2017-06-14 22:22 [RFC 0/9] IOMMU driver changes for shared virtual memory virtualization Jacob Pan
                   ` (6 preceding siblings ...)
  2017-06-14 22:23 ` [RFC 7/9] iommu/dmar: notify unrecoverable faults Jacob Pan
@ 2017-06-14 22:23 ` Jacob Pan
  2017-06-22 22:53   ` Alex Williamson
  2017-06-14 22:23 ` [RFC 9/9] iommu/intel-svm: replace dev ops with generic fault notifier Jacob Pan
  8 siblings, 1 reply; 30+ messages in thread
From: Jacob Pan @ 2017-06-14 22:23 UTC (permalink / raw)
  To: iommu, LKML, Joerg Roedel, David Woodhouse
  Cc: Liu, Yi L, Lan Tianyu, Tian, Kevin, Raj Ashok, Alex Williamson,
	Jean Delvare, Jacob Pan

If the source device of a page request has its PASID table pointer
bond to a guest, the first level page tables are owned by the guest.
In this case, we shall let guest OS to manage page fault.

This patch uses the IOMMU fault notification API to send notifications,
possibly via VFIO, to the guest OS. Once guest pages are fault in, guest
will issue page response which will be passed down via the invalidation
passdown APIs.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
---
 drivers/iommu/intel-svm.c | 81 ++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 80 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c
index 23c4276..d1d2d23 100644
--- a/drivers/iommu/intel-svm.c
+++ b/drivers/iommu/intel-svm.c
@@ -525,6 +525,80 @@ static bool access_error(struct vm_area_struct *vma, struct page_req_dsc *req)
 	return (requested & ~vma->vm_flags) != 0;
 }
 
+static int prq_to_iommu_prot(struct page_req_dsc *req)
+{
+	int prot = 0;
+
+	if (req->rd_req)
+		prot |= IOMMU_READ;
+	if (req->wr_req)
+		prot |= IOMMU_WRITE;
+	if (req->exe_req)
+		prot |= IOMMU_EXEC;
+	if (req->priv_req)
+		prot |= IOMMU_PRIV;
+
+	return prot;
+}
+
+static int intel_svm_prq_notify(struct device *dev, struct page_req_dsc *desc)
+{
+	int ret = 0;
+	struct iommu_fault_event *event;
+	struct pci_dev *pdev;
+	struct device_domain_info *info;
+	unsigned long buf_offset;
+
+	/**
+	 * If caller does not provide struct device, this is the case where
+	 * guest PASID table is bond to the device. So we need to retrieve
+	 * struct device from the page request deescriptor then proceed.
+	 */
+	if (!dev) {
+		pdev = pci_get_bus_and_slot(desc->bus, desc->devfn);
+		if (!pdev) {
+			pr_err("No PCI device found for PRQ [%02x:%02x.%d]\n",
+				desc->bus, PCI_SLOT(desc->devfn),
+				PCI_FUNC(desc->devfn));
+			return -ENODEV;
+		}
+		/**
+		 * Make sure PASID table pointer is bond to guest, if yes notify
+		 * handler in the guest, e.g. via VFIO.
+		 */
+		info = pdev->dev.archdata.iommu;
+		if (!info || !info->pasid_tbl_bond) {
+			pr_debug("PRQ device pasid table not bond.\n");
+			return -EINVAL;
+		}
+		dev = &pdev->dev;
+	}
+
+	pr_debug("Notify PRQ device [%02x:%02x.%d]\n",
+		desc->bus, PCI_SLOT(desc->devfn),
+		PCI_FUNC(desc->devfn));
+	event = kzalloc(sizeof(*event) + sizeof(*desc), GFP_KERNEL);
+	if (!event)
+		return -ENOMEM;
+
+	get_device(dev);
+	/* Fill in event data for device specific processing */
+	event->dev = dev;
+	buf_offset = offsetofend(struct iommu_fault_event, length);
+	memcpy(buf_offset + event, desc, sizeof(*desc));
+	event->addr = desc->addr;
+	event->pasid = desc->pasid;
+	event->prot = prq_to_iommu_prot(desc);
+	event->length = sizeof(*desc);
+	event->flags = IOMMU_FAULT_PAGE_REQ;
+
+	ret = iommu_fault_notifier_call_chain(event);
+	put_device(dev);
+	kfree(event);
+
+	return ret;
+}
+
 static irqreturn_t prq_event_thread(int irq, void *d)
 {
 	struct intel_iommu *iommu = d;
@@ -548,7 +622,12 @@ static irqreturn_t prq_event_thread(int irq, void *d)
 		handled = 1;
 
 		req = &iommu->prq[head / sizeof(*req)];
-
+		/**
+		 * If prq is to be handled outside iommu driver via receiver of
+		 * the fault notifiers, we skip the page response here.
+		 */
+		if (!intel_svm_prq_notify(NULL, req))
+			continue;
 		result = QI_RESP_FAILURE;
 		address = (u64)req->addr << VTD_PAGE_SHIFT;
 		if (!req->pasid_present) {
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC 9/9] iommu/intel-svm: replace dev ops with generic fault notifier
  2017-06-14 22:22 [RFC 0/9] IOMMU driver changes for shared virtual memory virtualization Jacob Pan
                   ` (7 preceding siblings ...)
  2017-06-14 22:23 ` [RFC 8/9] iommu/intel-svm: notify page request to guest Jacob Pan
@ 2017-06-14 22:23 ` Jacob Pan
  8 siblings, 0 replies; 30+ messages in thread
From: Jacob Pan @ 2017-06-14 22:23 UTC (permalink / raw)
  To: iommu, LKML, Joerg Roedel, David Woodhouse
  Cc: Liu, Yi L, Lan Tianyu, Tian, Kevin, Raj Ashok, Alex Williamson,
	Jean Delvare, Jacob Pan

Intel SVM devices register callbacks when bind task and PASID. These fault
callbacks are optional which can be replaced by the generic IOMMU fault
notifier APIs.

Currently, there is no main line kernel user of these callback functions.
Fault event data delivered by the IOMMU notification API is both generic
and contains raw data for architectural specific uses.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
---
 drivers/iommu/intel-svm.c | 13 ++-----------
 include/linux/intel-svm.h | 20 +++-----------------
 2 files changed, 5 insertions(+), 28 deletions(-)

diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c
index d1d2d23..112ec558 100644
--- a/drivers/iommu/intel-svm.c
+++ b/drivers/iommu/intel-svm.c
@@ -291,7 +291,7 @@ static const struct mmu_notifier_ops intel_mmuops = {
 
 static DEFINE_MUTEX(pasid_mutex);
 
-int intel_svm_bind_mm(struct device *dev, int *pasid, int flags, struct svm_dev_ops *ops)
+int intel_svm_bind_mm(struct device *dev, int *pasid, int flags)
 {
 	struct intel_iommu *iommu = intel_svm_device_to_iommu(dev);
 	struct intel_svm_dev *sdev;
@@ -337,10 +337,6 @@ int intel_svm_bind_mm(struct device *dev, int *pasid, int flags, struct svm_dev_
 
 			list_for_each_entry(sdev, &svm->devs, list) {
 				if (dev == sdev->dev) {
-					if (sdev->ops != ops) {
-						ret = -EBUSY;
-						goto out;
-					}
 					sdev->users++;
 					goto success;
 				}
@@ -366,7 +362,6 @@ int intel_svm_bind_mm(struct device *dev, int *pasid, int flags, struct svm_dev_
 	}
 	/* Finish the setup now we know we're keeping it */
 	sdev->users = 1;
-	sdev->ops = ops;
 	init_rcu_head(&sdev->rcu);
 
 	if (!svm) {
@@ -693,11 +688,7 @@ static irqreturn_t prq_event_thread(int irq, void *d)
 		if (WARN_ON(&sdev->list == &svm->devs))
 			sdev = NULL;
 
-		if (sdev && sdev->ops && sdev->ops->fault_cb) {
-			int rwxp = (req->rd_req << 3) | (req->wr_req << 2) |
-				(req->exe_req << 1) | (req->priv_req);
-			sdev->ops->fault_cb(sdev->dev, req->pasid, req->addr, req->private, rwxp, result);
-		}
+		intel_svm_prq_notify(sdev->dev, req);
 		/* We get here in the error case where the PASID lookup failed,
 		   and these can be NULL. Do not use them below this point! */
 		sdev = NULL;
diff --git a/include/linux/intel-svm.h b/include/linux/intel-svm.h
index 3c25794..67e056d3 100644
--- a/include/linux/intel-svm.h
+++ b/include/linux/intel-svm.h
@@ -18,18 +18,6 @@
 
 struct device;
 
-struct svm_dev_ops {
-	void (*fault_cb)(struct device *dev, int pasid, u64 address,
-			 u32 private, int rwxp, int response);
-};
-
-/* Values for rxwp in fault_cb callback */
-#define SVM_REQ_READ	(1<<3)
-#define SVM_REQ_WRITE	(1<<2)
-#define SVM_REQ_EXEC	(1<<1)
-#define SVM_REQ_PRIV	(1<<0)
-
-
 /*
  * The SVM_FLAG_PRIVATE_PASID flag requests a PASID which is *not* the "main"
  * PASID for the current process. Even if a PASID already exists, a new one
@@ -60,7 +48,6 @@ struct svm_dev_ops {
  * @dev:	Device to be granted acccess
  * @pasid:	Address for allocated PASID
  * @flags:	Flags. Later for requesting supervisor mode, etc.
- * @ops:	Callbacks to device driver
  *
  * This function attempts to enable PASID support for the given device.
  * If the @pasid argument is non-%NULL, a PASID is allocated for access
@@ -82,8 +69,7 @@ struct svm_dev_ops {
  * Multiple calls from the same process may result in the same PASID
  * being re-used. A reference count is kept.
  */
-extern int intel_svm_bind_mm(struct device *dev, int *pasid, int flags,
-			     struct svm_dev_ops *ops);
+extern int intel_svm_bind_mm(struct device *dev, int *pasid, int flags);
 
 /**
  * intel_svm_unbind_mm() - Unbind a specified PASID
@@ -105,7 +91,7 @@ extern int intel_svm_unbind_mm(struct device *dev, int pasid);
 #else /* CONFIG_INTEL_IOMMU_SVM */
 
 static inline int intel_svm_bind_mm(struct device *dev, int *pasid,
-				    int flags, struct svm_dev_ops *ops)
+				int flags)
 {
 	return -ENOSYS;
 }
@@ -116,6 +102,6 @@ static inline int intel_svm_unbind_mm(struct device *dev, int pasid)
 }
 #endif /* CONFIG_INTEL_IOMMU_SVM */
 
-#define intel_svm_available(dev) (!intel_svm_bind_mm((dev), NULL, 0, NULL))
+#define intel_svm_available(dev) (!intel_svm_bind_mm((dev), NULL, 0))
 
 #endif /* __INTEL_SVM_H__ */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [RFC 1/9] iommu: Introduce bind_pasid_table API function
  2017-06-14 22:22 ` [RFC 1/9] iommu: Introduce bind_pasid_table API function Jacob Pan
@ 2017-06-22 22:52   ` Alex Williamson
  2017-06-23 18:20     ` Jacob Pan
  0 siblings, 1 reply; 30+ messages in thread
From: Alex Williamson @ 2017-06-22 22:52 UTC (permalink / raw)
  To: Jacob Pan
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Liu, Yi L,
	Lan Tianyu, Tian, Kevin, Raj Ashok, Jean Delvare, Liu Yi L

On Wed, 14 Jun 2017 15:22:55 -0700
Jacob Pan <jacob.jun.pan@linux.intel.com> wrote:

> Virtual IOMMU was proposed to support Shared Virtual Memory (SVM) use
> case in the guest:
> https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg05311.html
> 
> As part of the proposed architecture, when a SVM capable PCI
> device is assigned to a guest, nested mode is turned on. Guest owns the
> first level page tables (request with PASID) and performs GVA->GPA
> translation. Second level page tables are owned by the host for GPA->HPA
> translation for both request with and without PASID.
> 
> A new IOMMU driver interface is therefore needed to perform tasks as
> follows:
> * Enable nested translation and appropriate translation type
> * Assign guest PASID table pointer (in GPA) and size to host IOMMU
> 
> This patch introduces new functions called iommu_(un)bind_pasid_table()
> to IOMMU APIs. Architecture specific IOMMU function can be added later
> to perform the specific steps for binding pasid table of assigned devices.
> 
> This patch also adds model definition in iommu.h. It would be used to
> check if the bind request is from a compatible entity. e.g. a bind
> request from an intel_iommu emulator may not be supported by an ARM SMMU
> driver.
> 
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
> Signed-off-by: Ashok Raj <ashok.raj@intel.com>
> ---
>  drivers/iommu/iommu.c      | 19 +++++++++++++++++++
>  include/linux/iommu.h      | 23 +++++++++++++++++++++++
>  include/uapi/linux/iommu.h | 33 +++++++++++++++++++++++++++++++++
>  3 files changed, 75 insertions(+)
>  create mode 100644 include/uapi/linux/iommu.h
> 
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index cf7ca7e..494309b 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -1328,6 +1328,25 @@ int iommu_attach_device(struct iommu_domain *domain, struct device *dev)
>  }
>  EXPORT_SYMBOL_GPL(iommu_attach_device);
>  
> +int iommu_bind_pasid_table(struct iommu_domain *domain, struct device *dev,
> +			struct pasid_table_info *pasidt_binfo)
> +{
> +	if (unlikely(!domain->ops->bind_pasid_table))
> +		return -EINVAL;
> +
> +	return domain->ops->bind_pasid_table(domain, dev, pasidt_binfo);
> +}
> +EXPORT_SYMBOL_GPL(iommu_bind_pasid_table);
> +
> +int iommu_unbind_pasid_table(struct iommu_domain *domain, struct device *dev)
> +{
> +	if (unlikely(!domain->ops->unbind_pasid_table))
> +		return -EINVAL;
> +
> +	return domain->ops->unbind_pasid_table(domain, dev);
> +}
> +EXPORT_SYMBOL_GPL(iommu_unbind_pasid_table);
> +
>  static void __iommu_detach_device(struct iommu_domain *domain,
>  				  struct device *dev)
>  {
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index 2cb54ad..7122add 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -25,6 +25,7 @@
>  #include <linux/errno.h>
>  #include <linux/err.h>
>  #include <linux/of.h>
> +#include <uapi/linux/iommu.h>
>  
>  #define IOMMU_READ	(1 << 0)
>  #define IOMMU_WRITE	(1 << 1)
> @@ -183,6 +184,8 @@ struct iommu_resv_region {
>   * @domain_get_windows: Return the number of windows for a domain
>   * @of_xlate: add OF master IDs to iommu grouping
>   * @pgsize_bitmap: bitmap of all possible supported page sizes
> + * @bind_pasid_table: bind pasid table pointer for guest SVM
> + * @unbind_pasid_table: unbind pasid table pointer and restore defaults
>   */
>  struct iommu_ops {
>  	bool (*capable)(enum iommu_cap);
> @@ -225,6 +228,10 @@ struct iommu_ops {
>  	u32 (*domain_get_windows)(struct iommu_domain *domain);
>  
>  	int (*of_xlate)(struct device *dev, struct of_phandle_args *args);
> +	int (*bind_pasid_table)(struct iommu_domain *domain, struct device *dev,
> +				struct pasid_table_info *pasidt_binfo);
> +	int (*unbind_pasid_table)(struct iommu_domain *domain,
> +				struct device *dev);
>  
>  	unsigned long pgsize_bitmap;
>  };
> @@ -282,6 +289,10 @@ extern int iommu_attach_device(struct iommu_domain *domain,
>  			       struct device *dev);
>  extern void iommu_detach_device(struct iommu_domain *domain,
>  				struct device *dev);
> +extern int iommu_bind_pasid_table(struct iommu_domain *domain,
> +		struct device *dev, struct pasid_table_info *pasidt_binfo);
> +extern int iommu_unbind_pasid_table(struct iommu_domain *domain,
> +				struct device *dev);
>  extern struct iommu_domain *iommu_get_domain_for_dev(struct device *dev);
>  extern int iommu_map(struct iommu_domain *domain, unsigned long iova,
>  		     phys_addr_t paddr, size_t size, int prot);
> @@ -637,6 +648,18 @@ const struct iommu_ops *iommu_ops_from_fwnode(struct fwnode_handle *fwnode)
>  	return NULL;
>  }
>  
> +static inline
> +int iommu_bind_pasid_table(struct iommu_domain *domain, struct device *dev,
> +			struct pasid_table_info *pasidt_binfo)
> +{
> +	return -EINVAL;
> +}
> +static inline
> +int iommu_unbind_pasid_table(struct iommu_domain *domain, struct device *dev)
> +{
> +	return -EINVAL;
> +}
> +
>  #endif /* CONFIG_IOMMU_API */
>  
>  #endif /* __LINUX_IOMMU_H */
> diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
> new file mode 100644
> index 0000000..5ef0e7c
> --- /dev/null
> +++ b/include/uapi/linux/iommu.h
> @@ -0,0 +1,33 @@
> +/*
> + * IOMMU user API definitions
> + *
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#ifndef _UAPI_IOMMU_H
> +#define _UAPI_IOMMU_H
> +
> +/**
> + * PASID table data used to bind guest PASID table to the host IOMMU. This will
> + * enable guest managed first level page tables.
> + * @ptr		PASID table pointer in GPA

GPA?  I'm confused how the host physical IOMMU needs the guest physical
address of the PASID table here.  The first level translation does
GVA to GPA lookup, but doesn't that translation still come from a
table that the IOMMU references via a host physical address?  If not,
is the address space of this pointer implementation/architecture
specific?  In general I think it's good policy to not make the interface
specific to the VM use case. After all, vfio is a userspace driver
interface and device assignment is just one use case.  

> + * @size	size of the guest PASID table, must be <= host table size

Presumably in bytes, best to say so.

> + * @model	magic number tells vendor apart
> + * @length	length of the opaque data

Also in bytes.

> + * @opaque	architecture specific IOMMU data

s/architecture/model/?

> + */
> +struct pasid_table_info {
> +	__u64	ptr;
> +	__u64	size;
> +	__u32	model;
> +#define INTEL_IOMMU	(1 << 0)
> +#define ARM_SMMU	(1 << 1)

Why are we using this as a bit field rather than an enum?  Does it make
sense for model to be (INTEL_IOMMU|ARM_SMMU)?

> +	__u32	length;
> +	__u8	opaque[];
> +};
> +
> +
> +#endif /* _UAPI_IOMMU_H */

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 2/9] iommu/vt-d: add bind_pasid_table function
  2017-06-14 22:22 ` [RFC 2/9] iommu/vt-d: add bind_pasid_table function Jacob Pan
@ 2017-06-22 22:52   ` Alex Williamson
  2017-06-23 18:19     ` Jacob Pan
  0 siblings, 1 reply; 30+ messages in thread
From: Alex Williamson @ 2017-06-22 22:52 UTC (permalink / raw)
  To: Jacob Pan
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Liu, Yi L,
	Lan Tianyu, Tian, Kevin, Raj Ashok, Jean Delvare, Yi L

On Wed, 14 Jun 2017 15:22:56 -0700
Jacob Pan <jacob.jun.pan@linux.intel.com> wrote:

> Add Intel VT-d ops to the generic iommu_bind_pasid_table API
> functions.
> 
> The primary use case is for direct assignment of SVM capable
> device. Originated from emulated IOMMU in the guest, the request goes
> through many layers (e.g. VFIO). Upon calling host IOMMU driver, caller
> passes guest PASID table pointer (GPA) and size.
> 
> Device context table entry is modified by Intel IOMMU specific
> bind_pasid_table function. This will turn on nesting mode and matching
> translation type.
> 
> The unbind operation restores default context mapping.
> 
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
> Signed-off-by: Ashok Raj <ashok.raj@intel.com>
> ---
>  drivers/iommu/intel-iommu.c   | 109 ++++++++++++++++++++++++++++++++++++++++++
>  include/linux/dma_remapping.h |   1 +
>  2 files changed, 110 insertions(+)
> 
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index fc2765c..1d5d9ab9 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -5430,6 +5430,111 @@ struct intel_iommu *intel_svm_device_to_iommu(struct device *dev)
>  
>  	return iommu;
>  }
> +
> +static int intel_iommu_bind_pasid_table(struct iommu_domain *domain,
> +		struct device *dev, struct pasid_table_info *pasidt_binfo)
> +{
> +	struct intel_iommu *iommu;
> +	struct context_entry *context;
> +	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
> +	struct device_domain_info *info;
> +	struct pci_dev *pdev;
> +	u8 bus, devfn;
> +	u16 did, *sid;
> +	int ret = 0;
> +	unsigned long flags;
> +	u64 ctx_lo;
> +
> +	if (pasidt_binfo == NULL || pasidt_binfo->model != INTEL_IOMMU) {

Clearly model cannot be used as a bit field so it would appear wrong to
limit ourselves to 32 possible models by using it that way.

> +		pr_warn("%s: Invalid bind request!\n", __func__);

Let the callers deal with error reporting.

> +		return -EINVAL;
> +	}
> +
> +	iommu = device_to_iommu(dev, &bus, &devfn);
> +	if (!iommu)
> +		return -ENODEV;
> +
> +	sid = (u16 *)&pasidt_binfo->opaque;

Failed to check length.


> +	/*
> +	 * check SID, if it is not correct, return success to allow looping
> +	 * through all devices within a group
> +	 */
> +	if (PCI_DEVID(bus, devfn) != *sid)
> +		return 0;
> +
> +	pdev = to_pci_dev(dev);

Better test dev_is_pci() first!

> +	info = dev->archdata.iommu;
> +	if (!info || !info->pasid_supported) {
> +		pr_err("PCI %04x:%02x:%02x.%d: has no PASID support\n",
> +			       pci_domain_nr(pdev->bus), bus, PCI_SLOT(devfn),
> +			       PCI_FUNC(devfn));
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	if (pasidt_binfo->size > intel_iommu_get_pts(iommu)) {
> +		pr_err("Invalid gPASID table size %llu, host size %lu\n",
> +			pasidt_binfo->size,
> +			intel_iommu_get_pts(iommu));
> +		ret = -EINVAL;
> +		goto out;
> +	}

Different errnos here would be more useful to code that handles the
return than these pr_err()s.

> +	spin_lock_irqsave(&iommu->lock, flags);
> +	context = iommu_context_addr(iommu, bus, devfn, 0);
> +	if (!context || !context_present(context)) {
> +		pr_warn("%s: ctx not present for bus devfn %x:%x\n",
> +			__func__, bus, devfn);
> +		spin_unlock_irqrestore(&iommu->lock, flags);
> +		goto out;

Return success?!

> +	}
> +	/* Anticipate guest to use SVM and owns the first level */
> +	ctx_lo = context[0].lo;
> +	ctx_lo |= CONTEXT_NESTE;
> +	ctx_lo |= CONTEXT_PRS;
> +	ctx_lo |= CONTEXT_PASIDE;
> +	ctx_lo &= ~CONTEXT_TT_MASK;
> +	ctx_lo |= CONTEXT_TT_DEV_IOTLB << 2;
> +	context[0].lo = ctx_lo;
> +
> +	/* Assign guest PASID table pointer and size */
> +	ctx_lo = (pasidt_binfo->ptr & VTD_PAGE_MASK) | pasidt_binfo->size;
> +	context[1].lo = ctx_lo;
> +	/* make sure context entry is updated before flushing */
> +	wmb();
> +	did = dmar_domain->iommu_did[iommu->seq_id];
> +	iommu->flush.flush_context(iommu, did,
> +				(((u16)bus) << 8) | devfn,
> +				DMA_CCMD_MASK_NOBIT,
> +				DMA_CCMD_DEVICE_INVL);
> +	iommu->flush.flush_iotlb(iommu, did, 0, 0, DMA_TLB_DSI_FLUSH);
> +	spin_unlock_irqrestore(&iommu->lock, flags);
> +
> +
> +out:
> +	return ret;
> +}
> +
> +static int intel_iommu_unbind_pasid_table(struct iommu_domain *domain,
> +					struct device *dev)
> +{
> +	struct intel_iommu *iommu;
> +	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
> +	u8 bus, devfn;
> +
> +	iommu = device_to_iommu(dev, &bus, &devfn);
> +	if (!iommu)
> +		return -ENODEV;
> +	/*
> +	 * REVISIT: we might want to clear the PASID table pointer
> +	 * as part of context clear operation. Currently, it leaves
> +	 * stale data but should be ignored by hardware since PASIDE
> +	 * is clear.
> +	 */
> +	/* ATS will be reenabled when remapping is restored */
> +	pci_disable_ats(to_pci_dev(dev));

dev_is_pci()?

> +	domain_context_clear(iommu, dev);
> +	return domain_context_mapping_one(dmar_domain, iommu, bus, devfn);
> +}
>  #endif /* CONFIG_INTEL_IOMMU_SVM */
>  
>  const struct iommu_ops intel_iommu_ops = {
> @@ -5438,6 +5543,10 @@ const struct iommu_ops intel_iommu_ops = {
>  	.domain_free		= intel_iommu_domain_free,
>  	.attach_dev		= intel_iommu_attach_device,
>  	.detach_dev		= intel_iommu_detach_device,
> +#ifdef CONFIG_INTEL_IOMMU_SVM
> +	.bind_pasid_table	= intel_iommu_bind_pasid_table,
> +	.unbind_pasid_table	= intel_iommu_unbind_pasid_table,
> +#endif
>  	.map			= intel_iommu_map,
>  	.unmap			= intel_iommu_unmap,
>  	.map_sg			= default_iommu_map_sg,
> diff --git a/include/linux/dma_remapping.h b/include/linux/dma_remapping.h
> index 9088407..85367b7 100644
> --- a/include/linux/dma_remapping.h
> +++ b/include/linux/dma_remapping.h
> @@ -27,6 +27,7 @@
>  
>  #define CONTEXT_DINVE		(1ULL << 8)
>  #define CONTEXT_PRS		(1ULL << 9)
> +#define CONTEXT_NESTE		(1ULL << 10)
>  #define CONTEXT_PASIDE		(1ULL << 11)
>  
>  struct intel_iommu;

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 3/9] iommu: Introduce iommu do invalidate API function
  2017-06-14 22:22 ` [RFC 3/9] iommu: Introduce iommu do invalidate API function Jacob Pan
@ 2017-06-22 22:52   ` Alex Williamson
  0 siblings, 0 replies; 30+ messages in thread
From: Alex Williamson @ 2017-06-22 22:52 UTC (permalink / raw)
  To: Jacob Pan
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Liu, Yi L,
	Lan Tianyu, Tian, Kevin, Raj Ashok, Jean Delvare, Liu, Yi L

On Wed, 14 Jun 2017 15:22:57 -0700
Jacob Pan <jacob.jun.pan@linux.intel.com> wrote:

> From: "Liu, Yi L" <yi.l.liu@linux.intel.com>
> 
> When a SVM capable device is assigned to a guest, the first level page
> tables are owned by the guest and the guest PASID table pointer is
> linked to the device context entry of the physical IOMMU.
> 
> Host IOMMU driver has no knowledge of caching structure updates unless
> the guest invalidation activities are passed down to the host. The
> primary usage is derived from emulated IOMMU in the guest, where QEMU
> can trap invalidation activities before pass them down the
> host/physical IOMMU. There are IOMMU architectural specific actions
> need to be taken which requires the generic APIs introduced in this
> patch to have opaque data in the tlb_invalidate_info argument.
> 
> Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Signed-off-by: Ashok Raj <ashok.raj@intel.com>
> ---
>  drivers/iommu/iommu.c      | 13 +++++++++++++
>  include/linux/iommu.h      | 11 +++++++++++
>  include/uapi/linux/iommu.h |  4 ++++
>  3 files changed, 28 insertions(+)
> 
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index 494309b..c786370 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -1347,6 +1347,19 @@ int iommu_unbind_pasid_table(struct iommu_domain *domain, struct device *dev)
>  }
>  EXPORT_SYMBOL_GPL(iommu_unbind_pasid_table);
>  
> +int iommu_do_invalidate(struct iommu_domain *domain,
> +		struct device *dev, struct tlb_invalidate_info *inv_info)
> +{
> +	int ret = 0;
> +
> +	if (unlikely(domain->ops->do_invalidate == NULL))
> +		return -ENODEV;
> +
> +	ret = domain->ops->do_invalidate(domain, dev, inv_info);
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(iommu_do_invalidate);


s/do_invalidate/invalidate/?  'do_invalidate' seems like something
'invalidate' might call, not the exported interface.  Invalidate what?
Is that entirely buried in the opaque data?  It seems like there are
plenty of things to be invalidated that could be common.


> +
>  static void __iommu_detach_device(struct iommu_domain *domain,
>  				  struct device *dev)
>  {
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index 7122add..2cdbaa3 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -232,6 +232,8 @@ struct iommu_ops {
>  				struct pasid_table_info *pasidt_binfo);
>  	int (*unbind_pasid_table)(struct iommu_domain *domain,
>  				struct device *dev);
> +	int (*do_invalidate)(struct iommu_domain *domain,
> +		struct device *dev, struct tlb_invalidate_info *inv_info);
>  
>  	unsigned long pgsize_bitmap;
>  };
> @@ -293,6 +295,9 @@ extern int iommu_bind_pasid_table(struct iommu_domain *domain,
>  		struct device *dev, struct pasid_table_info *pasidt_binfo);
>  extern int iommu_unbind_pasid_table(struct iommu_domain *domain,
>  				struct device *dev);
> +extern int iommu_do_invalidate(struct iommu_domain *domain,
> +		struct device *dev, struct tlb_invalidate_info *inv_info);
> +
>  extern struct iommu_domain *iommu_get_domain_for_dev(struct device *dev);
>  extern int iommu_map(struct iommu_domain *domain, unsigned long iova,
>  		     phys_addr_t paddr, size_t size, int prot);
> @@ -660,6 +665,12 @@ int iommu_unbind_pasid_table(struct iommu_domain *domain, struct device *dev)
>  	return -EINVAL;
>  }
>  
> +static inline int iommu_do_invalidate(struct iommu_domain *domain,
> +		struct device *dev, struct tlb_invalidate_info *inv_info)
> +{
> +	return -EINVAL;
> +}
> +
>  #endif /* CONFIG_IOMMU_API */
>  
>  #endif /* __LINUX_IOMMU_H */
> diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
> index 5ef0e7c..fcee18f 100644
> --- a/include/uapi/linux/iommu.h
> +++ b/include/uapi/linux/iommu.h
> @@ -29,5 +29,9 @@ struct pasid_table_info {
>  	__u8	opaque[];
>  };
>  
> +struct tlb_invalidate_info {
> +	__u32	model;
> +	__u8	opaque[];

Why did we specify a length (and then not use it anyway) of opaque data
field in pasid_table_info but not here?

> +};
>  
>  #endif /* _UAPI_IOMMU_H */

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 4/9] iommu/vt-d: Add iommu do invalidate function
  2017-06-14 22:22 ` [RFC 4/9] iommu/vt-d: Add iommu do invalidate function Jacob Pan
@ 2017-06-22 22:52   ` Alex Williamson
  0 siblings, 0 replies; 30+ messages in thread
From: Alex Williamson @ 2017-06-22 22:52 UTC (permalink / raw)
  To: Jacob Pan
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Liu, Yi L,
	Lan Tianyu, Tian, Kevin, Raj Ashok, Jean Delvare, Yi L

On Wed, 14 Jun 2017 15:22:58 -0700
Jacob Pan <jacob.jun.pan@linux.intel.com> wrote:

> This patch adds Intel VT-d specific function to implement
> iommu_do_invalidate API.
> 
> The use case is for supporting caching structure invalidation
> of assigned SVM capable devices. Emulated IOMMU exposes queue
> invalidation capability and passes down all descriptors from the guest
> to the physical IOMMU.
> 
> The assumption is that guest to host device ID mapping should be
> resolved prior to calling IOMMU driver. Based on the device handle,
> host IOMMU driver can replace certain fields before submit to the
> invalidation queue.
> 
> Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Signed-off-by: Ashok Raj <ashok.raj@intel.com>
> ---
>  drivers/iommu/intel-iommu.c | 41 +++++++++++++++++++++++++++++++++++++++++
>  include/linux/intel-iommu.h | 11 ++++++++++-
>  2 files changed, 51 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index 1d5d9ab9..6b8e997 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -5127,6 +5127,46 @@ static void intel_iommu_detach_device(struct iommu_domain *domain,
>  	dmar_remove_one_dev_info(to_dmar_domain(domain), dev);
>  }
>  
> +static int intel_iommu_do_invalidate(struct iommu_domain *domain,
> +		struct device *dev, struct tlb_invalidate_info *inv_info)
> +{
> +	struct intel_iommu *iommu;
> +	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
> +	struct intel_invalidate_data *inv_data;
> +	struct qi_desc *qi;
> +	u16 did;
> +	u8 bus, devfn;
> +
> +	if (!inv_info || !dmar_domain || (inv_info->model != INTEL_IOMMU))
> +		return -EINVAL;
> +
> +	iommu = device_to_iommu(dev, &bus, &devfn);
> +	if (!iommu)
> +		return -ENODEV;
> +
> +	inv_data = (struct intel_invalidate_data *)&inv_info->opaque;
> +
> +	/* check SID */

dev_is_pci()!

> +	if (PCI_DEVID(bus, devfn) != inv_data->sid)
> +		return 0;
> +
> +	qi = &inv_data->inv_desc;
> +
> +	switch (qi->low & QI_TYPE_MASK) {
> +	case QI_DIOTLB_TYPE:
> +	case QI_DEIOTLB_TYPE:
> +		/* for device IOTLB, we just let it pass through */
> +		break;
> +	default:
> +		did = dmar_domain->iommu_did[iommu->seq_id];
> +		qi->low &= ~QI_DID_MASK;
> +		qi->low |= QI_DID(did);
> +		break;
> +	}
> +
> +	return qi_submit_sync(qi, iommu);
> +}
> +
>  static int intel_iommu_map(struct iommu_domain *domain,
>  			   unsigned long iova, phys_addr_t hpa,
>  			   size_t size, int iommu_prot)
> @@ -5546,6 +5586,7 @@ const struct iommu_ops intel_iommu_ops = {
>  #ifdef CONFIG_INTEL_IOMMU_SVM
>  	.bind_pasid_table	= intel_iommu_bind_pasid_table,
>  	.unbind_pasid_table	= intel_iommu_unbind_pasid_table,
> +	.do_invalidate		= intel_iommu_do_invalidate,
>  #endif
>  	.map			= intel_iommu_map,
>  	.unmap			= intel_iommu_unmap,
> diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
> index 485a5b4..8df6c91 100644
> --- a/include/linux/intel-iommu.h
> +++ b/include/linux/intel-iommu.h
> @@ -31,7 +31,6 @@
>  #include <linux/list.h>
>  #include <linux/iommu.h>
>  #include <linux/io-64-nonatomic-lo-hi.h>
> -
>  #include <asm/cacheflush.h>
>  #include <asm/iommu.h>
>  
> @@ -258,6 +257,10 @@ enum {
>  #define QI_PGRP_RESP_TYPE	0x9
>  #define QI_PSTRM_RESP_TYPE	0xa
>  
> +#define QI_DID(did)		(((u64)did & 0xffff) << 16)
> +#define QI_DID_MASK		GENMASK(31, 16)
> +#define QI_TYPE_MASK		GENMASK(3, 0)
> +
>  #define QI_IEC_SELECTIVE	(((u64)1) << 4)
>  #define QI_IEC_IIDEX(idx)	(((u64)(idx & 0xffff) << 32))
>  #define QI_IEC_IM(m)		(((u64)(m & 0x1f) << 27))
> @@ -489,6 +492,12 @@ extern int intel_iommu_enable_pasid(struct intel_iommu *iommu, struct intel_svm_
>  extern struct intel_iommu *intel_svm_device_to_iommu(struct device *dev);
>  #endif
>  
> +struct intel_invalidate_data {
> +	u16 sid;
> +	u32 pasid;
> +	struct qi_desc inv_desc;
> +};
> +

If userspace is ever going to construct this to pass it through vfio,
it'll need to be defined in UAPI.

>  extern const struct attribute_group *intel_iommu_groups[];
>  
>  #endif

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 5/9] iommu: Introduce fault notifier API
  2017-06-14 22:22 ` [RFC 5/9] iommu: Introduce fault notifier API Jacob Pan
@ 2017-06-22 22:53   ` Alex Williamson
  2017-06-23 18:59     ` Jacob Pan
  0 siblings, 1 reply; 30+ messages in thread
From: Alex Williamson @ 2017-06-22 22:53 UTC (permalink / raw)
  To: Jacob Pan
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Liu, Yi L,
	Lan Tianyu, Tian, Kevin, Raj Ashok, Jean Delvare

On Wed, 14 Jun 2017 15:22:59 -0700
Jacob Pan <jacob.jun.pan@linux.intel.com> wrote:

> Traditionally, device specific faults are detected and handled within
> their own device drivers. When IOMMU is enabled, faults such as DMA
> related transactions are detected by IOMMU. There is no generic
> reporting mechanism to report faults back to the in-kernel device
> driver or the guest OS in case of assigned devices.
> 
> Faults detected by IOMMU is based on the transaction's source ID which
> can be reported at per device basis, regardless of the device type is a
> PCI device or not.
> 
> The fault types includes recoverable (e.g. page request) and
> unrecoverable faults(e.g. invalid context). In most cases, faults can be
> handled by IOMMU drivers. However, there are use cases that require
> fault processing outside IOMMU driver, e.g.
> 
> 1. page request fault originated from an SVM capable device that is
> assigned to guest via vIOMMU. In this case, the first level page tables
> are owned by the guest. Page request must be propagated to the guest to
> let guest OS fault in the pages then send page response. In this
> mechanism, the direct receiver of IOMMU fault notification is VFIO,
> which can relay notification events to QEMU or other user space
> software.
> 
> 2. faults need more subtle handling by device drivers. Other than
> simply invoke reset function, there are needs to let device driver
> handle the fault with a smaller impact.
> 
> This patchset is intended to create a generic fault notification API such
> that it can scale as follows:
> - all IOMMU types
> - PCI and non-PCI devices
> - recoverable and unrecoverable faults
> - VFIO and other other in kernel users
> - DMA & IRQ remapping (TBD)
> 
> The event data contains both generic and raw architectural data
> such that performance is not compromised as the data propagation may
> involve many layers.
> 
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Signed-off-by: Ashok Raj <ashok.raj@intel.com>
> ---
>  drivers/iommu/iommu.c | 63 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/iommu.h | 54 +++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 117 insertions(+)
> 
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index c786370..04c73f3 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -48,6 +48,7 @@ struct iommu_group {
>  	struct list_head devices;
>  	struct mutex mutex;
>  	struct blocking_notifier_head notifier;
> +	struct blocking_notifier_head fault_notifier;
>  	void *iommu_data;
>  	void (*iommu_data_release)(void *iommu_data);
>  	char *name;
> @@ -345,6 +346,7 @@ struct iommu_group *iommu_group_alloc(void)
>  	mutex_init(&group->mutex);
>  	INIT_LIST_HEAD(&group->devices);
>  	BLOCKING_INIT_NOTIFIER_HEAD(&group->notifier);
> +	BLOCKING_INIT_NOTIFIER_HEAD(&group->fault_notifier);
>  
>  	ret = ida_simple_get(&iommu_group_ida, 0, 0, GFP_KERNEL);
>  	if (ret < 0) {
> @@ -790,6 +792,67 @@ int iommu_group_unregister_notifier(struct iommu_group *group,
>  EXPORT_SYMBOL_GPL(iommu_group_unregister_notifier);
>  
>  /**
> + * iommu_register_fault_notifier - Register a notifier for fault reporting
> + * @dev: device to notify fault events
> + * @nb: notifier block to signal
> + *
> + */
> +int iommu_register_fault_notifier(struct device *dev,
> +				struct notifier_block *nb)
> +{
> +	int ret;
> +	struct iommu_group *group = iommu_group_get(dev);
> +
> +	if (!group)
> +		return -EINVAL;
> +
> +	ret = blocking_notifier_chain_register(&group->fault_notifier, nb);
> +	iommu_group_put(group);
> +
> +	return ret;
> +
> +}
> +EXPORT_SYMBOL_GPL(iommu_register_fault_notifier);
> +
> +/**
> + * iommu_unregister_fault_notifier - Unregister a notifier for fault reporting
> + * @domain: the domain to watch
> + * @nb: notifier block to signal
> + *
> + */
> +int iommu_unregister_fault_notifier(struct device *dev,
> +				  struct notifier_block *nb)
> +{
> +	int ret;
> +	struct iommu_group *group = iommu_group_get(dev);
> +
> +	if (!group)
> +		return -EINVAL;
> +
> +	ret = blocking_notifier_chain_unregister(&group->fault_notifier, nb);
> +	iommu_group_put(group);
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(iommu_unregister_fault_notifier);


If the call chains are on the group, why do we register with a device?
If I registered with a group, I'd know that I only need to register
once per group, that's not clear here and may lead to callers of this
getting multiple notifications, one for each device in a group.

> +
> +int iommu_fault_notifier_call_chain(struct iommu_fault_event *event)
> +{
> +	int ret;
> +	struct iommu_group *group = iommu_group_get(event->dev);
> +
> +	if (!group)
> +		return -EINVAL;
> +	/* caller provide generic data related to the event, TBD */
> +	ret = (blocking_notifier_call_chain(&group->fault_notifier, 0, (void *)event)
> +		== NOTIFY_BAD) ? -EINVAL : 0;
> +	iommu_group_put(group);
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL(iommu_fault_notifier_call_chain);
> +
> +/**
>   * iommu_group_id - Return ID for a group
>   * @group: the group to ID
>   *
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index 2cdbaa3..fe89e88 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -42,6 +42,7 @@
>   * if the IOMMU page table format is equivalent.
>   */
>  #define IOMMU_PRIV	(1 << 5)
> +#define IOMMU_EXEC	(1 << 6)

Irrelevant change?

>  
>  struct iommu_ops;
>  struct iommu_group;
> @@ -97,6 +98,36 @@ struct iommu_domain {
>  	void *iova_cookie;
>  };
>  
> +/*
> + * Generic fault event notification data, used by all IOMMU architectures.
> + *
> + * - PCI and non-PCI devices
> + * - Recoverable faults (e.g. page request) & un-recoverable faults
> + * - DMA remapping and IRQ remapping faults
> + *
> + * @dev The device which faults are reported by IOMMU
> + * @addr tells the offending address
> + * @pasid contains process address space ID, used in shared virtual memory (SVM)
> + * @prot page access protection flag, e.g. IOMMU_READ, IOMMU_WRITE
> + * @flags contains fault type, etc.
> + * @length tells the size of the buf
> + * @buf contains any raw or arch specific data
> + *
> + */
> +struct iommu_fault_event {
> +	struct device *dev;
> +	__u64 addr;
> +	__u32 pasid;
> +	__u32 prot;
> +	__u32 flags;
> +#define IOMMU_FAULT_PAGE_REQ	BIT(0)
> +#define IOMMU_FAULT_UNRECOV	BIT(1)
> +#define IOMMU_FAULT_IRQ_REMAP	BIT(2)
> +#define IOMMU_FAULT_INVAL	BIT(3)

Details of each of these are defined where?

> +	__u32 length;
> +	__u8  buf[];
> +};

This is not UAPI, so I'll be curious how vfio is supposed to expose this
to userspace.

> +
>  enum iommu_cap {
>  	IOMMU_CAP_CACHE_COHERENCY,	/* IOMMU can enforce cache coherent DMA
>  					   transactions */
> @@ -341,6 +372,12 @@ extern int iommu_group_register_notifier(struct iommu_group *group,
>  					 struct notifier_block *nb);
>  extern int iommu_group_unregister_notifier(struct iommu_group *group,
>  					   struct notifier_block *nb);
> +extern int iommu_register_fault_notifier(struct device *dev,
> +					 struct notifier_block *nb);
> +extern int iommu_unregister_fault_notifier(struct device *dev,
> +					 struct notifier_block *nb);
> +extern int iommu_fault_notifier_call_chain(struct iommu_fault_event *event);
> +
>  extern int iommu_group_id(struct iommu_group *group);
>  extern struct iommu_group *iommu_group_get_for_dev(struct device *dev);
>  extern struct iommu_domain *iommu_group_default_domain(struct iommu_group *);
> @@ -573,6 +610,23 @@ static inline int iommu_group_unregister_notifier(struct iommu_group *group,
>  	return 0;
>  }
>  
> +static inline int iommu_register_fault_notifier(struct device *dev,
> +						  struct notifier_block *nb)
> +{
> +	return 0;
> +}
> +
> +static inline int iommu_unregister_fault_notifier(struct device *dev,
> +						  struct notifier_block *nb)
> +{
> +	return 0;
> +}
> +
> +static inline int iommu_fault_notifier_call_chain(struct iommu_fault_event *event)
> +{
> +	return 0;
> +}
> +
>  static inline int iommu_group_id(struct iommu_group *group)
>  {
>  	return -ENODEV;

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 8/9] iommu/intel-svm: notify page request to guest
  2017-06-14 22:23 ` [RFC 8/9] iommu/intel-svm: notify page request to guest Jacob Pan
@ 2017-06-22 22:53   ` Alex Williamson
  2017-06-23 20:16     ` Jacob Pan
  0 siblings, 1 reply; 30+ messages in thread
From: Alex Williamson @ 2017-06-22 22:53 UTC (permalink / raw)
  To: Jacob Pan
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Liu, Yi L,
	Lan Tianyu, Tian, Kevin, Raj Ashok, Jean Delvare

On Wed, 14 Jun 2017 15:23:02 -0700
Jacob Pan <jacob.jun.pan@linux.intel.com> wrote:

> If the source device of a page request has its PASID table pointer
> bond to a guest, the first level page tables are owned by the guest.
> In this case, we shall let guest OS to manage page fault.
> 
> This patch uses the IOMMU fault notification API to send notifications,
> possibly via VFIO, to the guest OS. Once guest pages are fault in, guest
> will issue page response which will be passed down via the invalidation
> passdown APIs.
> 
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Signed-off-by: Ashok Raj <ashok.raj@intel.com>
> ---
>  drivers/iommu/intel-svm.c | 81 ++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 80 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c
> index 23c4276..d1d2d23 100644
> --- a/drivers/iommu/intel-svm.c
> +++ b/drivers/iommu/intel-svm.c
> @@ -525,6 +525,80 @@ static bool access_error(struct vm_area_struct *vma, struct page_req_dsc *req)
>  	return (requested & ~vma->vm_flags) != 0;
>  }
>  
> +static int prq_to_iommu_prot(struct page_req_dsc *req)
> +{
> +	int prot = 0;
> +
> +	if (req->rd_req)
> +		prot |= IOMMU_READ;
> +	if (req->wr_req)
> +		prot |= IOMMU_WRITE;
> +	if (req->exe_req)
> +		prot |= IOMMU_EXEC;
> +	if (req->priv_req)
> +		prot |= IOMMU_PRIV;
> +
> +	return prot;
> +}
> +
> +static int intel_svm_prq_notify(struct device *dev, struct page_req_dsc *desc)
> +{
> +	int ret = 0;
> +	struct iommu_fault_event *event;
> +	struct pci_dev *pdev;
> +	struct device_domain_info *info;
> +	unsigned long buf_offset;
> +
> +	/**
> +	 * If caller does not provide struct device, this is the case where
> +	 * guest PASID table is bond to the device. So we need to retrieve
> +	 * struct device from the page request deescriptor then proceed.
> +	 */
> +	if (!dev) {
> +		pdev = pci_get_bus_and_slot(desc->bus, desc->devfn);
> +		if (!pdev) {
> +			pr_err("No PCI device found for PRQ [%02x:%02x.%d]\n",
> +				desc->bus, PCI_SLOT(desc->devfn),
> +				PCI_FUNC(desc->devfn));
> +			return -ENODEV;
> +		}
> +		/**
> +		 * Make sure PASID table pointer is bond to guest, if yes notify
> +		 * handler in the guest, e.g. via VFIO.
> +		 */
> +		info = pdev->dev.archdata.iommu;
> +		if (!info || !info->pasid_tbl_bond) {
> +			pr_debug("PRQ device pasid table not bond.\n");

I can "bond" two things together, they are then "bound".

> +			return -EINVAL;
> +		}
> +		dev = &pdev->dev;

Leaks pdev reference.  Both normal and error path.

> +	}
> +
> +	pr_debug("Notify PRQ device [%02x:%02x.%d]\n",
> +		desc->bus, PCI_SLOT(desc->devfn),
> +		PCI_FUNC(desc->devfn));
> +	event = kzalloc(sizeof(*event) + sizeof(*desc), GFP_KERNEL);
> +	if (!event)
> +		return -ENOMEM;
> +
> +	get_device(dev);
> +	/* Fill in event data for device specific processing */
> +	event->dev = dev;
> +	buf_offset = offsetofend(struct iommu_fault_event, length);
> +	memcpy(buf_offset + event, desc, sizeof(*desc));
> +	event->addr = desc->addr;
> +	event->pasid = desc->pasid;
> +	event->prot = prq_to_iommu_prot(desc);
> +	event->length = sizeof(*desc);
> +	event->flags = IOMMU_FAULT_PAGE_REQ;
> +
> +	ret = iommu_fault_notifier_call_chain(event);
> +	put_device(dev);
> +	kfree(event);
> +
> +	return ret;
> +}
> +
>  static irqreturn_t prq_event_thread(int irq, void *d)
>  {
>  	struct intel_iommu *iommu = d;
> @@ -548,7 +622,12 @@ static irqreturn_t prq_event_thread(int irq, void *d)
>  		handled = 1;
>  
>  		req = &iommu->prq[head / sizeof(*req)];
> -
> +		/**
> +		 * If prq is to be handled outside iommu driver via receiver of
> +		 * the fault notifiers, we skip the page response here.
> +		 */
> +		if (!intel_svm_prq_notify(NULL, req))
> +			continue;
>  		result = QI_RESP_FAILURE;
>  		address = (u64)req->addr << VTD_PAGE_SHIFT;
>  		if (!req->pasid_present) {

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 7/9] iommu/dmar: notify unrecoverable faults
  2017-06-14 22:23 ` [RFC 7/9] iommu/dmar: notify unrecoverable faults Jacob Pan
@ 2017-06-22 22:54   ` Alex Williamson
  2017-06-23 20:19     ` Jacob Pan
  0 siblings, 1 reply; 30+ messages in thread
From: Alex Williamson @ 2017-06-22 22:54 UTC (permalink / raw)
  To: Jacob Pan
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Liu, Yi L,
	Lan Tianyu, Tian, Kevin, Raj Ashok, Jean Delvare

On Wed, 14 Jun 2017 15:23:01 -0700
Jacob Pan <jacob.jun.pan@linux.intel.com> wrote:

> Currently, when device DMA faults are detected by IOMMU the fault
> reasons are printed but the offending device is not notified.
> This patch allows device drivers to be optionally notified for fault
> conditions when device specific handling is needed for more subtle
> processing, e.g. request with PASID transactions.
> 
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Signed-off-by: Ashok Raj <ashok.raj@intel.com>
> ---
>  drivers/iommu/dmar.c | 37 ++++++++++++++++++++++++++++++++++++-
>  1 file changed, 36 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c
> index cbf7763..2c0b80d464 100644
> --- a/drivers/iommu/dmar.c
> +++ b/drivers/iommu/dmar.c
> @@ -1587,11 +1587,43 @@ void dmar_msi_read(int irq, struct msi_msg *msg)
>  	raw_spin_unlock_irqrestore(&iommu->register_lock, flag);
>  }
>  
> +static int dmar_unrecov_fault_notify(u8 fault_reason, u16 source_id,
> +			unsigned long long addr)
> +{
> +	int ret;
> +	struct pci_dev *pdev;
> +	struct iommu_fault_event *event;
> +
> +	pdev = pci_get_bus_and_slot(source_id >> 8, source_id & 0xFF);
> +	if (!pdev)
> +		return -ENODEV;
> +	pr_debug("Notify PCI device fault [%02x:%02x.%d]\n",
> +		source_id >> 8, PCI_SLOT(source_id & 0xff),
> +		PCI_FUNC(source_id & 0xff));
> +	event = kzalloc(sizeof(*event) + sizeof(fault_reason), GFP_KERNEL);
> +	if (!event)
> +		return -ENOMEM;

Leaks pdev reference.

> +
> +	pci_dev_get(pdev);
> +	event->dev = &pdev->dev;
> +	event->buf[0] = fault_reason;
> +	event->addr = addr;
> +	event->length = sizeof(fault_reason);
> +	event->flags = IOMMU_FAULT_UNRECOV;
> +	ret = iommu_fault_notifier_call_chain(event);
> +
> +	pci_dev_put(pdev);
> +	kfree(event);
> +
> +	return ret;
> +}
> +
>  static int dmar_fault_do_one(struct intel_iommu *iommu, int type,
>  		u8 fault_reason, u16 source_id, unsigned long long addr)
>  {
>  	const char *reason;
>  	int fault_type;
> +	int ret = 0;
>  
>  	reason = dmar_get_fault_reason(fault_reason, &fault_type);
>  
> @@ -1600,11 +1632,14 @@ static int dmar_fault_do_one(struct intel_iommu *iommu, int type,
>  			source_id >> 8, PCI_SLOT(source_id & 0xFF),
>  			PCI_FUNC(source_id & 0xFF), addr >> 48,
>  			fault_reason, reason);
> -	else
> +	else {
>  		pr_err("[%s] Request device [%02x:%02x.%d] fault addr %llx [fault reason %02d] %s\n",
>  		       type ? "DMA Read" : "DMA Write",
>  		       source_id >> 8, PCI_SLOT(source_id & 0xFF),
>  		       PCI_FUNC(source_id & 0xFF), addr, fault_reason, reason);
> +		ret = dmar_unrecov_fault_notify(fault_reason, source_id, addr);

For what purpose are we recording this return code?

> +	}
> +
>  	return 0;
>  }
>  

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 6/9] iommu/vt-d: track device with pasid table bond to a guest
  2017-06-14 22:23 ` [RFC 6/9] iommu/vt-d: track device with pasid table bond to a guest Jacob Pan
@ 2017-06-22 22:54   ` Alex Williamson
  0 siblings, 0 replies; 30+ messages in thread
From: Alex Williamson @ 2017-06-22 22:54 UTC (permalink / raw)
  To: Jacob Pan
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Liu, Yi L,
	Lan Tianyu, Tian, Kevin, Raj Ashok, Jean Delvare

On Wed, 14 Jun 2017 15:23:00 -0700
Jacob Pan <jacob.jun.pan@linux.intel.com> wrote:

> When PASID table pointer of an assigned device is bond to a guest,

s/bond/bound/

> the first level page tables are managed by the guest. However, only
> host/physical IOMMU can detect fault events, e.g. page requests.
> Therefore, we need to keep track of which device has its PASID table
> pointer bond to a guest such that page request and other events can

same and throughout...

> be propagated to the guest as needed.
> 
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Signed-off-by: Ashok Raj <ashok.raj@intel.com>
> ---
>  drivers/iommu/intel-iommu.c | 19 +------------------
>  include/linux/intel-iommu.h | 19 +++++++++++++++++++
>  2 files changed, 20 insertions(+), 18 deletions(-)
> 
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index 6b8e997..9765277 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -416,24 +416,6 @@ struct dmar_domain {
>  					   iommu core */
>  };
>  
> -/* PCI domain-device relationship */
> -struct device_domain_info {
> -	struct list_head link;	/* link to domain siblings */
> -	struct list_head global; /* link to global list */
> -	u8 bus;			/* PCI bus number */
> -	u8 devfn;		/* PCI devfn number */
> -	u8 pasid_supported:3;
> -	u8 pasid_enabled:1;
> -	u8 pri_supported:1;
> -	u8 pri_enabled:1;
> -	u8 ats_supported:1;
> -	u8 ats_enabled:1;
> -	u8 ats_qdep;
> -	struct device *dev; /* it's NULL for PCIe-to-PCI bridge */
> -	struct intel_iommu *iommu; /* IOMMU used by this device */
> -	struct dmar_domain *domain; /* pointer to domain */
> -};
> -
>  struct dmar_rmrr_unit {
>  	struct list_head list;		/* list of rmrr units	*/
>  	struct acpi_dmar_header *hdr;	/* ACPI header		*/
> @@ -5547,6 +5529,7 @@ static int intel_iommu_bind_pasid_table(struct iommu_domain *domain,
>  				DMA_CCMD_MASK_NOBIT,
>  				DMA_CCMD_DEVICE_INVL);
>  	iommu->flush.flush_iotlb(iommu, did, 0, 0, DMA_TLB_DSI_FLUSH);
> +	info->pasid_tbl_bond = 1;
>  	spin_unlock_irqrestore(&iommu->lock, flags);
>  
>  
> diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
> index 8df6c91..ed39c56 100644
> --- a/include/linux/intel-iommu.h
> +++ b/include/linux/intel-iommu.h
> @@ -434,6 +434,25 @@ struct intel_iommu {
>  	u32		flags;      /* Software defined flags */
>  };
>  
> +/* PCI domain-device relationship */
> +struct device_domain_info {
> +	struct list_head link;	/* link to domain siblings */
> +	struct list_head global; /* link to global list */
> +	u8 bus;			/* PCI bus number */
> +	u8 devfn;		/* PCI devfn number */
> +	u8 pasid_supported:3;
> +	u8 pasid_enabled:1;
> +	u8 pasid_tbl_bond:1;	/* bond to guest PASID table */
> +	u8 pri_supported:1;
> +	u8 pri_enabled:1;
> +	u8 ats_supported:1;
> +	u8 ats_enabled:1;
> +	u8 ats_qdep;
> +	struct device *dev; /* it's NULL for PCIe-to-PCI bridge */
> +	struct intel_iommu *iommu; /* IOMMU used by this device */
> +	struct dmar_domain *domain; /* pointer to domain */
> +};
> +
>  static inline void __iommu_flush_cache(
>  	struct intel_iommu *iommu, void *addr, int size)
>  {

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 2/9] iommu/vt-d: add bind_pasid_table function
  2017-06-22 22:52   ` Alex Williamson
@ 2017-06-23 18:19     ` Jacob Pan
  2017-06-23 18:59       ` Alex Williamson
  0 siblings, 1 reply; 30+ messages in thread
From: Jacob Pan @ 2017-06-23 18:19 UTC (permalink / raw)
  To: Alex Williamson
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Liu, Yi L,
	Lan Tianyu, Tian, Kevin, Raj Ashok, Jean Delvare, Yi L,
	jacob.jun.pan

On Thu, 22 Jun 2017 16:52:15 -0600
Alex Williamson <alex.williamson@redhat.com> wrote:

> On Wed, 14 Jun 2017 15:22:56 -0700
> Jacob Pan <jacob.jun.pan@linux.intel.com> wrote:
> 
> > Add Intel VT-d ops to the generic iommu_bind_pasid_table API
> > functions.
> > 
> > The primary use case is for direct assignment of SVM capable
> > device. Originated from emulated IOMMU in the guest, the request
> > goes through many layers (e.g. VFIO). Upon calling host IOMMU
> > driver, caller passes guest PASID table pointer (GPA) and size.
> > 
> > Device context table entry is modified by Intel IOMMU specific
> > bind_pasid_table function. This will turn on nesting mode and
> > matching translation type.
> > 
> > The unbind operation restores default context mapping.
> > 
> > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
> > Signed-off-by: Ashok Raj <ashok.raj@intel.com>
> > ---
> >  drivers/iommu/intel-iommu.c   | 109
> > ++++++++++++++++++++++++++++++++++++++++++
> > include/linux/dma_remapping.h |   1 + 2 files changed, 110
> > insertions(+)
> > 
> > diff --git a/drivers/iommu/intel-iommu.c
> > b/drivers/iommu/intel-iommu.c index fc2765c..1d5d9ab9 100644
> > --- a/drivers/iommu/intel-iommu.c
> > +++ b/drivers/iommu/intel-iommu.c
> > @@ -5430,6 +5430,111 @@ struct intel_iommu
> > *intel_svm_device_to_iommu(struct device *dev) 
> >  	return iommu;
> >  }
> > +
> > +static int intel_iommu_bind_pasid_table(struct iommu_domain
> > *domain,
> > +		struct device *dev, struct pasid_table_info
> > *pasidt_binfo) +{
> > +	struct intel_iommu *iommu;
> > +	struct context_entry *context;
> > +	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
> > +	struct device_domain_info *info;
> > +	struct pci_dev *pdev;
> > +	u8 bus, devfn;
> > +	u16 did, *sid;
> > +	int ret = 0;
> > +	unsigned long flags;
> > +	u64 ctx_lo;
> > +
> > +	if (pasidt_binfo == NULL || pasidt_binfo->model !=
> > INTEL_IOMMU) {  
> 
> Clearly model cannot be used as a bit field so it would appear wrong
> to limit ourselves to 32 possible models by using it that way.
> 
agreed.
> > +		pr_warn("%s: Invalid bind request!\n", __func__);  
> 
> Let the callers deal with error reporting.
> 
ditto.
> > +		return -EINVAL;
> > +	}
> > +
> > +	iommu = device_to_iommu(dev, &bus, &devfn);
> > +	if (!iommu)
> > +		return -ENODEV;
> > +
> > +	sid = (u16 *)&pasidt_binfo->opaque;  
> 
> Failed to check length.
> 
ditto.
> 
> > +	/*
> > +	 * check SID, if it is not correct, return success to
> > allow looping
> > +	 * through all devices within a group
> > +	 */
> > +	if (PCI_DEVID(bus, devfn) != *sid)
> > +		return 0;
> > +
> > +	pdev = to_pci_dev(dev);  
> 
> Better test dev_is_pci() first!
> 
good point.
> > +	info = dev->archdata.iommu;
> > +	if (!info || !info->pasid_supported) {
> > +		pr_err("PCI %04x:%02x:%02x.%d: has no PASID
> > support\n",
> > +			       pci_domain_nr(pdev->bus), bus,
> > PCI_SLOT(devfn),
> > +			       PCI_FUNC(devfn));
> > +		ret = -EINVAL;
> > +		goto out;
> > +	}
> > +
> > +	if (pasidt_binfo->size > intel_iommu_get_pts(iommu)) {
> > +		pr_err("Invalid gPASID table size %llu, host size
> > %lu\n",
> > +			pasidt_binfo->size,
> > +			intel_iommu_get_pts(iommu));
> > +		ret = -EINVAL;
> > +		goto out;
> > +	}  
> 
> Different errnos here would be more useful to code that handles the
> return than these pr_err()s.
> 
OK.
> > +	spin_lock_irqsave(&iommu->lock, flags);
> > +	context = iommu_context_addr(iommu, bus, devfn, 0);
> > +	if (!context || !context_present(context)) {
> > +		pr_warn("%s: ctx not present for bus devfn
> > %x:%x\n",
> > +			__func__, bus, devfn);
> > +		spin_unlock_irqrestore(&iommu->lock, flags);
> > +		goto out;  
> 
> Return success?!
> 
good catch, should fail.
> > +	}
> > +	/* Anticipate guest to use SVM and owns the first level */
> > +	ctx_lo = context[0].lo;
> > +	ctx_lo |= CONTEXT_NESTE;
> > +	ctx_lo |= CONTEXT_PRS;
> > +	ctx_lo |= CONTEXT_PASIDE;
> > +	ctx_lo &= ~CONTEXT_TT_MASK;
> > +	ctx_lo |= CONTEXT_TT_DEV_IOTLB << 2;
> > +	context[0].lo = ctx_lo;
> > +
> > +	/* Assign guest PASID table pointer and size */
> > +	ctx_lo = (pasidt_binfo->ptr & VTD_PAGE_MASK) |
> > pasidt_binfo->size;
> > +	context[1].lo = ctx_lo;
> > +	/* make sure context entry is updated before flushing */
> > +	wmb();
> > +	did = dmar_domain->iommu_did[iommu->seq_id];
> > +	iommu->flush.flush_context(iommu, did,
> > +				(((u16)bus) << 8) | devfn,
> > +				DMA_CCMD_MASK_NOBIT,
> > +				DMA_CCMD_DEVICE_INVL);
> > +	iommu->flush.flush_iotlb(iommu, did, 0, 0,
> > DMA_TLB_DSI_FLUSH);
> > +	spin_unlock_irqrestore(&iommu->lock, flags);
> > +
> > +
> > +out:
> > +	return ret;
> > +}
> > +
> > +static int intel_iommu_unbind_pasid_table(struct iommu_domain
> > *domain,
> > +					struct device *dev)
> > +{
> > +	struct intel_iommu *iommu;
> > +	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
> > +	u8 bus, devfn;
> > +
> > +	iommu = device_to_iommu(dev, &bus, &devfn);
> > +	if (!iommu)
> > +		return -ENODEV;
> > +	/*
> > +	 * REVISIT: we might want to clear the PASID table pointer
> > +	 * as part of context clear operation. Currently, it leaves
> > +	 * stale data but should be ignored by hardware since
> > PASIDE
> > +	 * is clear.
> > +	 */
> > +	/* ATS will be reenabled when remapping is restored */
> > +	pci_disable_ats(to_pci_dev(dev));  
> 
> dev_is_pci()?
> 
good to check, even thought intel iommu supports PCI only.
> > +	domain_context_clear(iommu, dev);
> > +	return domain_context_mapping_one(dmar_domain, iommu, bus,
> > devfn); +}
> >  #endif /* CONFIG_INTEL_IOMMU_SVM */
> >  
> >  const struct iommu_ops intel_iommu_ops = {
> > @@ -5438,6 +5543,10 @@ const struct iommu_ops intel_iommu_ops = {
> >  	.domain_free		= intel_iommu_domain_free,
> >  	.attach_dev		= intel_iommu_attach_device,
> >  	.detach_dev		= intel_iommu_detach_device,
> > +#ifdef CONFIG_INTEL_IOMMU_SVM
> > +	.bind_pasid_table	= intel_iommu_bind_pasid_table,
> > +	.unbind_pasid_table	=
> > intel_iommu_unbind_pasid_table, +#endif
> >  	.map			= intel_iommu_map,
> >  	.unmap			= intel_iommu_unmap,
> >  	.map_sg			= default_iommu_map_sg,
> > diff --git a/include/linux/dma_remapping.h
> > b/include/linux/dma_remapping.h index 9088407..85367b7 100644
> > --- a/include/linux/dma_remapping.h
> > +++ b/include/linux/dma_remapping.h
> > @@ -27,6 +27,7 @@
> >  
> >  #define CONTEXT_DINVE		(1ULL << 8)
> >  #define CONTEXT_PRS		(1ULL << 9)
> > +#define CONTEXT_NESTE		(1ULL << 10)
> >  #define CONTEXT_PASIDE		(1ULL << 11)
> >  
> >  struct intel_iommu;  

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 1/9] iommu: Introduce bind_pasid_table API function
  2017-06-22 22:52   ` Alex Williamson
@ 2017-06-23 18:20     ` Jacob Pan
  0 siblings, 0 replies; 30+ messages in thread
From: Jacob Pan @ 2017-06-23 18:20 UTC (permalink / raw)
  To: Alex Williamson
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Liu, Yi L,
	Lan Tianyu, Tian, Kevin, Raj Ashok, Jean Delvare, Liu Yi L,
	jacob.jun.pan

On Thu, 22 Jun 2017 16:52:01 -0600
Alex Williamson <alex.williamson@redhat.com> wrote:

> On Wed, 14 Jun 2017 15:22:55 -0700
> Jacob Pan <jacob.jun.pan@linux.intel.com> wrote:
> 
> > Virtual IOMMU was proposed to support Shared Virtual Memory (SVM)
> > use case in the guest:
> > https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg05311.html
> > 
> > As part of the proposed architecture, when a SVM capable PCI
> > device is assigned to a guest, nested mode is turned on. Guest owns
> > the first level page tables (request with PASID) and performs
> > GVA->GPA translation. Second level page tables are owned by the
> > host for GPA->HPA translation for both request with and without
> > PASID.
> > 
> > A new IOMMU driver interface is therefore needed to perform tasks as
> > follows:
> > * Enable nested translation and appropriate translation type
> > * Assign guest PASID table pointer (in GPA) and size to host IOMMU
> > 
> > This patch introduces new functions called
> > iommu_(un)bind_pasid_table() to IOMMU APIs. Architecture specific
> > IOMMU function can be added later to perform the specific steps for
> > binding pasid table of assigned devices.
> > 
> > This patch also adds model definition in iommu.h. It would be used
> > to check if the bind request is from a compatible entity. e.g. a
> > bind request from an intel_iommu emulator may not be supported by
> > an ARM SMMU driver.
> > 
> > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
> > Signed-off-by: Ashok Raj <ashok.raj@intel.com>
> > ---
> >  drivers/iommu/iommu.c      | 19 +++++++++++++++++++
> >  include/linux/iommu.h      | 23 +++++++++++++++++++++++
> >  include/uapi/linux/iommu.h | 33 +++++++++++++++++++++++++++++++++
> >  3 files changed, 75 insertions(+)
> >  create mode 100644 include/uapi/linux/iommu.h
> > 
> > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> > index cf7ca7e..494309b 100644
> > --- a/drivers/iommu/iommu.c
> > +++ b/drivers/iommu/iommu.c
> > @@ -1328,6 +1328,25 @@ int iommu_attach_device(struct iommu_domain
> > *domain, struct device *dev) }
> >  EXPORT_SYMBOL_GPL(iommu_attach_device);
> >  
> > +int iommu_bind_pasid_table(struct iommu_domain *domain, struct
> > device *dev,
> > +			struct pasid_table_info *pasidt_binfo)
> > +{
> > +	if (unlikely(!domain->ops->bind_pasid_table))
> > +		return -EINVAL;
> > +
> > +	return domain->ops->bind_pasid_table(domain, dev,
> > pasidt_binfo); +}
> > +EXPORT_SYMBOL_GPL(iommu_bind_pasid_table);
> > +
> > +int iommu_unbind_pasid_table(struct iommu_domain *domain, struct
> > device *dev) +{
> > +	if (unlikely(!domain->ops->unbind_pasid_table))
> > +		return -EINVAL;
> > +
> > +	return domain->ops->unbind_pasid_table(domain, dev);
> > +}
> > +EXPORT_SYMBOL_GPL(iommu_unbind_pasid_table);
> > +
> >  static void __iommu_detach_device(struct iommu_domain *domain,
> >  				  struct device *dev)
> >  {
> > diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> > index 2cb54ad..7122add 100644
> > --- a/include/linux/iommu.h
> > +++ b/include/linux/iommu.h
> > @@ -25,6 +25,7 @@
> >  #include <linux/errno.h>
> >  #include <linux/err.h>
> >  #include <linux/of.h>
> > +#include <uapi/linux/iommu.h>
> >  
> >  #define IOMMU_READ	(1 << 0)
> >  #define IOMMU_WRITE	(1 << 1)
> > @@ -183,6 +184,8 @@ struct iommu_resv_region {
> >   * @domain_get_windows: Return the number of windows for a domain
> >   * @of_xlate: add OF master IDs to iommu grouping
> >   * @pgsize_bitmap: bitmap of all possible supported page sizes
> > + * @bind_pasid_table: bind pasid table pointer for guest SVM
> > + * @unbind_pasid_table: unbind pasid table pointer and restore
> > defaults */
> >  struct iommu_ops {
> >  	bool (*capable)(enum iommu_cap);
> > @@ -225,6 +228,10 @@ struct iommu_ops {
> >  	u32 (*domain_get_windows)(struct iommu_domain *domain);
> >  
> >  	int (*of_xlate)(struct device *dev, struct of_phandle_args
> > *args);
> > +	int (*bind_pasid_table)(struct iommu_domain *domain,
> > struct device *dev,
> > +				struct pasid_table_info
> > *pasidt_binfo);
> > +	int (*unbind_pasid_table)(struct iommu_domain *domain,
> > +				struct device *dev);
> >  
> >  	unsigned long pgsize_bitmap;
> >  };
> > @@ -282,6 +289,10 @@ extern int iommu_attach_device(struct
> > iommu_domain *domain, struct device *dev);
> >  extern void iommu_detach_device(struct iommu_domain *domain,
> >  				struct device *dev);
> > +extern int iommu_bind_pasid_table(struct iommu_domain *domain,
> > +		struct device *dev, struct pasid_table_info
> > *pasidt_binfo); +extern int iommu_unbind_pasid_table(struct
> > iommu_domain *domain,
> > +				struct device *dev);
> >  extern struct iommu_domain *iommu_get_domain_for_dev(struct device
> > *dev); extern int iommu_map(struct iommu_domain *domain, unsigned
> > long iova, phys_addr_t paddr, size_t size, int prot);
> > @@ -637,6 +648,18 @@ const struct iommu_ops
> > *iommu_ops_from_fwnode(struct fwnode_handle *fwnode) return NULL;
> >  }
> >  
> > +static inline
> > +int iommu_bind_pasid_table(struct iommu_domain *domain, struct
> > device *dev,
> > +			struct pasid_table_info *pasidt_binfo)
> > +{
> > +	return -EINVAL;
> > +}
> > +static inline
> > +int iommu_unbind_pasid_table(struct iommu_domain *domain, struct
> > device *dev) +{
> > +	return -EINVAL;
> > +}
> > +
> >  #endif /* CONFIG_IOMMU_API */
> >  
> >  #endif /* __LINUX_IOMMU_H */
> > diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
> > new file mode 100644
> > index 0000000..5ef0e7c
> > --- /dev/null
> > +++ b/include/uapi/linux/iommu.h
> > @@ -0,0 +1,33 @@
> > +/*
> > + * IOMMU user API definitions
> > + *
> > + *
> > + * This program is free software; you can redistribute it and/or
> > modify
> > + * it under the terms of the GNU General Public License version 2
> > as
> > + * published by the Free Software Foundation.
> > + */
> > +
> > +#ifndef _UAPI_IOMMU_H
> > +#define _UAPI_IOMMU_H
> > +
> > +/**
> > + * PASID table data used to bind guest PASID table to the host
> > IOMMU. This will
> > + * enable guest managed first level page tables.
> > + * @ptr		PASID table pointer in GPA  
> 
> GPA?  I'm confused how the host physical IOMMU needs the guest
> physical address of the PASID table here.  The first level
> translation does GVA to GPA lookup, but doesn't that translation
> still come from a table that the IOMMU references via a host physical
> address?  If not, is the address space of this pointer
> implementation/architecture specific?  In general I think it's good
> policy to not make the interface specific to the VM use case. After
> all, vfio is a userspace driver interface and device assignment is
> just one use case.  
> 
This is the case we have nested translation turned on. 2nd level does
GPA-HPA translation therefore PASID table pointer is GPA. But I agree
we don't have to restrict this to specific architecture. Perhaps just
leave it as pointer, w/o mentioning GPA.

At IOMMU driver level, binding PASID table and nested translation should
be generic. Perhaps at VFIO level, this can be further abstracted to
encompass both VM and non-VM use cases?

> > + * @size	size of the guest PASID table, must be <= host
> > table size  
> 
> Presumably in bytes, best to say so.
> 
> > + * @model	magic number tells vendor apart
> > + * @length	length of the opaque data  
> 
> Also in bytes.
> 
> > + * @opaque	architecture specific IOMMU data  
> 
> s/architecture/model/?
> 
> > + */
> > +struct pasid_table_info {
> > +	__u64	ptr;
> > +	__u64	size;
> > +	__u32	model;
> > +#define INTEL_IOMMU	(1 << 0)
> > +#define ARM_SMMU	(1 << 1)  
> 
> Why are we using this as a bit field rather than an enum?  Does it
> make sense for model to be (INTEL_IOMMU|ARM_SMMU)?
> 
make sense.
> > +	__u32	length;
> > +	__u8	opaque[];
> > +};
> > +
> > +
> > +#endif /* _UAPI_IOMMU_H */  
> 

[Jacob Pan]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 2/9] iommu/vt-d: add bind_pasid_table function
  2017-06-23 18:19     ` Jacob Pan
@ 2017-06-23 18:59       ` Alex Williamson
  2017-06-23 20:21         ` Jacob Pan
  0 siblings, 1 reply; 30+ messages in thread
From: Alex Williamson @ 2017-06-23 18:59 UTC (permalink / raw)
  To: Jacob Pan
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Liu, Yi L,
	Lan Tianyu, Tian, Kevin, Raj Ashok, Jean Delvare, Yi L

On Fri, 23 Jun 2017 11:19:52 -0700
Jacob Pan <jacob.jun.pan@linux.intel.com> wrote:

> On Thu, 22 Jun 2017 16:52:15 -0600
> Alex Williamson <alex.williamson@redhat.com> wrote:
> 
> > On Wed, 14 Jun 2017 15:22:56 -0700
> > Jacob Pan <jacob.jun.pan@linux.intel.com> wrote:
> > > +static int intel_iommu_unbind_pasid_table(struct iommu_domain
> > > *domain,
> > > +					struct device *dev)
> > > +{
> > > +	struct intel_iommu *iommu;
> > > +	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
> > > +	u8 bus, devfn;
> > > +
> > > +	iommu = device_to_iommu(dev, &bus, &devfn);
> > > +	if (!iommu)
> > > +		return -ENODEV;
> > > +	/*
> > > +	 * REVISIT: we might want to clear the PASID table pointer
> > > +	 * as part of context clear operation. Currently, it leaves
> > > +	 * stale data but should be ignored by hardware since
> > > PASIDE
> > > +	 * is clear.
> > > +	 */
> > > +	/* ATS will be reenabled when remapping is restored */
> > > +	pci_disable_ats(to_pci_dev(dev));    
> > 
> > dev_is_pci()?
> >   
> good to check, even thought intel iommu supports PCI only.

That's not true, intel-iommu supports non-PCI devices defined in ACPI
as well.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 5/9] iommu: Introduce fault notifier API
  2017-06-22 22:53   ` Alex Williamson
@ 2017-06-23 18:59     ` Jacob Pan
  2017-06-23 19:15       ` Alex Williamson
  0 siblings, 1 reply; 30+ messages in thread
From: Jacob Pan @ 2017-06-23 18:59 UTC (permalink / raw)
  To: Alex Williamson
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Liu, Yi L,
	Lan Tianyu, Tian, Kevin, Raj Ashok, Jean Delvare, jacob.jun.pan

On Thu, 22 Jun 2017 16:53:17 -0600
Alex Williamson <alex.williamson@redhat.com> wrote:

> On Wed, 14 Jun 2017 15:22:59 -0700
> Jacob Pan <jacob.jun.pan@linux.intel.com> wrote:
> 
> > Traditionally, device specific faults are detected and handled
> > within their own device drivers. When IOMMU is enabled, faults such
> > as DMA related transactions are detected by IOMMU. There is no
> > generic reporting mechanism to report faults back to the in-kernel
> > device driver or the guest OS in case of assigned devices.
> > 
> > Faults detected by IOMMU is based on the transaction's source ID
> > which can be reported at per device basis, regardless of the device
> > type is a PCI device or not.
> > 
> > The fault types includes recoverable (e.g. page request) and
> > unrecoverable faults(e.g. invalid context). In most cases, faults
> > can be handled by IOMMU drivers. However, there are use cases that
> > require fault processing outside IOMMU driver, e.g.
> > 
> > 1. page request fault originated from an SVM capable device that is
> > assigned to guest via vIOMMU. In this case, the first level page
> > tables are owned by the guest. Page request must be propagated to
> > the guest to let guest OS fault in the pages then send page
> > response. In this mechanism, the direct receiver of IOMMU fault
> > notification is VFIO, which can relay notification events to QEMU
> > or other user space software.
> > 
> > 2. faults need more subtle handling by device drivers. Other than
> > simply invoke reset function, there are needs to let device driver
> > handle the fault with a smaller impact.
> > 
> > This patchset is intended to create a generic fault notification
> > API such that it can scale as follows:
> > - all IOMMU types
> > - PCI and non-PCI devices
> > - recoverable and unrecoverable faults
> > - VFIO and other other in kernel users
> > - DMA & IRQ remapping (TBD)
> > 
> > The event data contains both generic and raw architectural data
> > such that performance is not compromised as the data propagation may
> > involve many layers.
> > 
> > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Signed-off-by: Ashok Raj <ashok.raj@intel.com>
> > ---
> >  drivers/iommu/iommu.c | 63
> > +++++++++++++++++++++++++++++++++++++++++++++++++++
> > include/linux/iommu.h | 54
> > +++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 117
> > insertions(+)
> > 
> > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> > index c786370..04c73f3 100644
> > --- a/drivers/iommu/iommu.c
> > +++ b/drivers/iommu/iommu.c
> > @@ -48,6 +48,7 @@ struct iommu_group {
> >  	struct list_head devices;
> >  	struct mutex mutex;
> >  	struct blocking_notifier_head notifier;
> > +	struct blocking_notifier_head fault_notifier;
> >  	void *iommu_data;
> >  	void (*iommu_data_release)(void *iommu_data);
> >  	char *name;
> > @@ -345,6 +346,7 @@ struct iommu_group *iommu_group_alloc(void)
> >  	mutex_init(&group->mutex);
> >  	INIT_LIST_HEAD(&group->devices);
> >  	BLOCKING_INIT_NOTIFIER_HEAD(&group->notifier);
> > +	BLOCKING_INIT_NOTIFIER_HEAD(&group->fault_notifier);
> >  
> >  	ret = ida_simple_get(&iommu_group_ida, 0, 0, GFP_KERNEL);
> >  	if (ret < 0) {
> > @@ -790,6 +792,67 @@ int iommu_group_unregister_notifier(struct
> > iommu_group *group,
> > EXPORT_SYMBOL_GPL(iommu_group_unregister_notifier); 
> >  /**
> > + * iommu_register_fault_notifier - Register a notifier for fault
> > reporting
> > + * @dev: device to notify fault events
> > + * @nb: notifier block to signal
> > + *
> > + */
> > +int iommu_register_fault_notifier(struct device *dev,
> > +				struct notifier_block *nb)
> > +{
> > +	int ret;
> > +	struct iommu_group *group = iommu_group_get(dev);
> > +
> > +	if (!group)
> > +		return -EINVAL;
> > +
> > +	ret =
> > blocking_notifier_chain_register(&group->fault_notifier, nb);
> > +	iommu_group_put(group);
> > +
> > +	return ret;
> > +
> > +}
> > +EXPORT_SYMBOL_GPL(iommu_register_fault_notifier);
> > +
> > +/**
> > + * iommu_unregister_fault_notifier - Unregister a notifier for
> > fault reporting
> > + * @domain: the domain to watch
> > + * @nb: notifier block to signal
> > + *
> > + */
> > +int iommu_unregister_fault_notifier(struct device *dev,
> > +				  struct notifier_block *nb)
> > +{
> > +	int ret;
> > +	struct iommu_group *group = iommu_group_get(dev);
> > +
> > +	if (!group)
> > +		return -EINVAL;
> > +
> > +	ret =
> > blocking_notifier_chain_unregister(&group->fault_notifier, nb);
> > +	iommu_group_put(group);
> > +
> > +	return ret;
> > +}
> > +EXPORT_SYMBOL_GPL(iommu_unregister_fault_notifier);  
> 
> 
> If the call chains are on the group, why do we register with a device?
> If I registered with a group, I'd know that I only need to register
> once per group, that's not clear here and may lead to callers of this
> getting multiple notifications, one for each device in a group.
> 
My thought was that since iommu_group is already part of struct device,
it is more efficient to share group level notifier than introducing per
device fault callbacks or notifiers.

So you do need to register per device not just once for the group. In
most cases device with SVM has ACS, it will be 1:1 between device and
group.

When the fault event occurs, IOMMU can identify the offending device
and struct device is embedded in the event data sent to the device
driver or VFIO, so the receiver can filter out unwanted events.

> > +
> > +int iommu_fault_notifier_call_chain(struct iommu_fault_event
> > *event) +{
> > +	int ret;
> > +	struct iommu_group *group = iommu_group_get(event->dev);
> > +
> > +	if (!group)
> > +		return -EINVAL;
> > +	/* caller provide generic data related to the event, TBD */
> > +	ret =
> > (blocking_notifier_call_chain(&group->fault_notifier, 0, (void
> > *)event)
> > +		== NOTIFY_BAD) ? -EINVAL : 0;
> > +	iommu_group_put(group);
> > +
> > +	return ret;
> > +}
> > +EXPORT_SYMBOL(iommu_fault_notifier_call_chain);
> > +
> > +/**
> >   * iommu_group_id - Return ID for a group
> >   * @group: the group to ID
> >   *
> > diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> > index 2cdbaa3..fe89e88 100644
> > --- a/include/linux/iommu.h
> > +++ b/include/linux/iommu.h
> > @@ -42,6 +42,7 @@
> >   * if the IOMMU page table format is equivalent.
> >   */
> >  #define IOMMU_PRIV	(1 << 5)
> > +#define IOMMU_EXEC	(1 << 6)  
> 
> Irrelevant change?
> 
yes, it should be separate. used later in svm patch.
> >  
> >  struct iommu_ops;
> >  struct iommu_group;
> > @@ -97,6 +98,36 @@ struct iommu_domain {
> >  	void *iova_cookie;
> >  };
> >  
> > +/*
> > + * Generic fault event notification data, used by all IOMMU
> > architectures.
> > + *
> > + * - PCI and non-PCI devices
> > + * - Recoverable faults (e.g. page request) & un-recoverable faults
> > + * - DMA remapping and IRQ remapping faults
> > + *
> > + * @dev The device which faults are reported by IOMMU
> > + * @addr tells the offending address
> > + * @pasid contains process address space ID, used in shared
> > virtual memory (SVM)
> > + * @prot page access protection flag, e.g. IOMMU_READ, IOMMU_WRITE
> > + * @flags contains fault type, etc.
> > + * @length tells the size of the buf
> > + * @buf contains any raw or arch specific data
> > + *
> > + */
> > +struct iommu_fault_event {
> > +	struct device *dev;
> > +	__u64 addr;
> > +	__u32 pasid;
> > +	__u32 prot;
> > +	__u32 flags;
> > +#define IOMMU_FAULT_PAGE_REQ	BIT(0)
> > +#define IOMMU_FAULT_UNRECOV	BIT(1)
> > +#define IOMMU_FAULT_IRQ_REMAP	BIT(2)
> > +#define IOMMU_FAULT_INVAL	BIT(3)  
> 
> Details of each of these are defined where?
> 
good point, will add description.
> > +	__u32 length;
> > +	__u8  buf[];
> > +};  
> 
> This is not UAPI, so I'll be curious how vfio is supposed to expose
> this to userspace.
> 
But this also has other in-kernel users.
> > +
> >  enum iommu_cap {
> >  	IOMMU_CAP_CACHE_COHERENCY,	/* IOMMU can enforce
> > cache coherent DMA transactions */
> > @@ -341,6 +372,12 @@ extern int
> > iommu_group_register_notifier(struct iommu_group *group, struct
> > notifier_block *nb); extern int
> > iommu_group_unregister_notifier(struct iommu_group *group, struct
> > notifier_block *nb); +extern int
> > iommu_register_fault_notifier(struct device *dev,
> > +					 struct notifier_block
> > *nb); +extern int iommu_unregister_fault_notifier(struct device
> > *dev,
> > +					 struct notifier_block
> > *nb); +extern int iommu_fault_notifier_call_chain(struct
> > iommu_fault_event *event); +
> >  extern int iommu_group_id(struct iommu_group *group);
> >  extern struct iommu_group *iommu_group_get_for_dev(struct device
> > *dev); extern struct iommu_domain
> > *iommu_group_default_domain(struct iommu_group *); @@ -573,6
> > +610,23 @@ static inline int iommu_group_unregister_notifier(struct
> > iommu_group *group, return 0; }
> >  
> > +static inline int iommu_register_fault_notifier(struct device *dev,
> > +						  struct
> > notifier_block *nb) +{
> > +	return 0;
> > +}
> > +
> > +static inline int iommu_unregister_fault_notifier(struct device
> > *dev,
> > +						  struct
> > notifier_block *nb) +{
> > +	return 0;
> > +}
> > +
> > +static inline int iommu_fault_notifier_call_chain(struct
> > iommu_fault_event *event) +{
> > +	return 0;
> > +}
> > +
> >  static inline int iommu_group_id(struct iommu_group *group)
> >  {
> >  	return -ENODEV;  
> 

[Jacob Pan]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 5/9] iommu: Introduce fault notifier API
  2017-06-23 18:59     ` Jacob Pan
@ 2017-06-23 19:15       ` Alex Williamson
  2017-06-26 15:27         ` Jacob Pan
  0 siblings, 1 reply; 30+ messages in thread
From: Alex Williamson @ 2017-06-23 19:15 UTC (permalink / raw)
  To: Jacob Pan
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Liu, Yi L,
	Lan Tianyu, Tian, Kevin, Raj Ashok, Jean Delvare

On Fri, 23 Jun 2017 11:59:28 -0700
Jacob Pan <jacob.jun.pan@linux.intel.com> wrote:

> On Thu, 22 Jun 2017 16:53:17 -0600
> Alex Williamson <alex.williamson@redhat.com> wrote:
> 
> > On Wed, 14 Jun 2017 15:22:59 -0700
> > Jacob Pan <jacob.jun.pan@linux.intel.com> wrote:
> >   
> > > Traditionally, device specific faults are detected and handled
> > > within their own device drivers. When IOMMU is enabled, faults such
> > > as DMA related transactions are detected by IOMMU. There is no
> > > generic reporting mechanism to report faults back to the in-kernel
> > > device driver or the guest OS in case of assigned devices.
> > > 
> > > Faults detected by IOMMU is based on the transaction's source ID
> > > which can be reported at per device basis, regardless of the device
> > > type is a PCI device or not.
> > > 
> > > The fault types includes recoverable (e.g. page request) and
> > > unrecoverable faults(e.g. invalid context). In most cases, faults
> > > can be handled by IOMMU drivers. However, there are use cases that
> > > require fault processing outside IOMMU driver, e.g.
> > > 
> > > 1. page request fault originated from an SVM capable device that is
> > > assigned to guest via vIOMMU. In this case, the first level page
> > > tables are owned by the guest. Page request must be propagated to
> > > the guest to let guest OS fault in the pages then send page
> > > response. In this mechanism, the direct receiver of IOMMU fault
> > > notification is VFIO, which can relay notification events to QEMU
> > > or other user space software.
> > > 
> > > 2. faults need more subtle handling by device drivers. Other than
> > > simply invoke reset function, there are needs to let device driver
> > > handle the fault with a smaller impact.
> > > 
> > > This patchset is intended to create a generic fault notification
> > > API such that it can scale as follows:
> > > - all IOMMU types
> > > - PCI and non-PCI devices
> > > - recoverable and unrecoverable faults
> > > - VFIO and other other in kernel users
> > > - DMA & IRQ remapping (TBD)
> > > 
> > > The event data contains both generic and raw architectural data
> > > such that performance is not compromised as the data propagation may
> > > involve many layers.
> > > 
> > > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > > Signed-off-by: Ashok Raj <ashok.raj@intel.com>
> > > ---
> > >  drivers/iommu/iommu.c | 63
> > > +++++++++++++++++++++++++++++++++++++++++++++++++++
> > > include/linux/iommu.h | 54
> > > +++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 117
> > > insertions(+)
> > > 
> > > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> > > index c786370..04c73f3 100644
> > > --- a/drivers/iommu/iommu.c
> > > +++ b/drivers/iommu/iommu.c
> > > @@ -48,6 +48,7 @@ struct iommu_group {
> > >  	struct list_head devices;
> > >  	struct mutex mutex;
> > >  	struct blocking_notifier_head notifier;
> > > +	struct blocking_notifier_head fault_notifier;
> > >  	void *iommu_data;
> > >  	void (*iommu_data_release)(void *iommu_data);
> > >  	char *name;
> > > @@ -345,6 +346,7 @@ struct iommu_group *iommu_group_alloc(void)
> > >  	mutex_init(&group->mutex);
> > >  	INIT_LIST_HEAD(&group->devices);
> > >  	BLOCKING_INIT_NOTIFIER_HEAD(&group->notifier);
> > > +	BLOCKING_INIT_NOTIFIER_HEAD(&group->fault_notifier);
> > >  
> > >  	ret = ida_simple_get(&iommu_group_ida, 0, 0, GFP_KERNEL);
> > >  	if (ret < 0) {
> > > @@ -790,6 +792,67 @@ int iommu_group_unregister_notifier(struct
> > > iommu_group *group,
> > > EXPORT_SYMBOL_GPL(iommu_group_unregister_notifier); 
> > >  /**
> > > + * iommu_register_fault_notifier - Register a notifier for fault
> > > reporting
> > > + * @dev: device to notify fault events
> > > + * @nb: notifier block to signal
> > > + *
> > > + */
> > > +int iommu_register_fault_notifier(struct device *dev,
> > > +				struct notifier_block *nb)
> > > +{
> > > +	int ret;
> > > +	struct iommu_group *group = iommu_group_get(dev);
> > > +
> > > +	if (!group)
> > > +		return -EINVAL;
> > > +
> > > +	ret =
> > > blocking_notifier_chain_register(&group->fault_notifier, nb);
> > > +	iommu_group_put(group);
> > > +
> > > +	return ret;
> > > +
> > > +}
> > > +EXPORT_SYMBOL_GPL(iommu_register_fault_notifier);
> > > +
> > > +/**
> > > + * iommu_unregister_fault_notifier - Unregister a notifier for
> > > fault reporting
> > > + * @domain: the domain to watch
> > > + * @nb: notifier block to signal
> > > + *
> > > + */
> > > +int iommu_unregister_fault_notifier(struct device *dev,
> > > +				  struct notifier_block *nb)
> > > +{
> > > +	int ret;
> > > +	struct iommu_group *group = iommu_group_get(dev);
> > > +
> > > +	if (!group)
> > > +		return -EINVAL;
> > > +
> > > +	ret =
> > > blocking_notifier_chain_unregister(&group->fault_notifier, nb);
> > > +	iommu_group_put(group);
> > > +
> > > +	return ret;
> > > +}
> > > +EXPORT_SYMBOL_GPL(iommu_unregister_fault_notifier);    
> > 
> > 
> > If the call chains are on the group, why do we register with a device?
> > If I registered with a group, I'd know that I only need to register
> > once per group, that's not clear here and may lead to callers of this
> > getting multiple notifications, one for each device in a group.
> >   
> My thought was that since iommu_group is already part of struct device,
> it is more efficient to share group level notifier than introducing per
> device fault callbacks or notifiers.

That's fine, but the interface should reflect the same.
 
> So you do need to register per device not just once for the group. In
> most cases device with SVM has ACS, it will be 1:1 between device and
> group.

ACS at the endpoint is only one factor in determining grouping, we need
only look at Intel Core processors to find root ports lacking ACS which
will result in multiple devices per group as a very common case.
 
> When the fault event occurs, IOMMU can identify the offending device
> and struct device is embedded in the event data sent to the device
> driver or VFIO, so the receiver can filter out unwanted events.

I think this is a poor design choice.  If the notification list is on
the group then make the registration be on the group.  Otherwise the
number of notifies seen by the caller is automatically multiplied by
the number of devices they're using in the group.  The caller not only
needs to disregard unintended devices, but they need to make sure they
only act on a notification for the device on the correct notifier.
This is designing in unnecessary complexity and overhead.

> > > +
> > > +int iommu_fault_notifier_call_chain(struct iommu_fault_event
> > > *event) +{
> > > +	int ret;
> > > +	struct iommu_group *group = iommu_group_get(event->dev);
> > > +
> > > +	if (!group)
> > > +		return -EINVAL;
> > > +	/* caller provide generic data related to the event, TBD */
> > > +	ret =
> > > (blocking_notifier_call_chain(&group->fault_notifier, 0, (void
> > > *)event)
> > > +		== NOTIFY_BAD) ? -EINVAL : 0;
> > > +	iommu_group_put(group);
> > > +
> > > +	return ret;
> > > +}
> > > +EXPORT_SYMBOL(iommu_fault_notifier_call_chain);
> > > +
> > > +/**
> > >   * iommu_group_id - Return ID for a group
> > >   * @group: the group to ID
> > >   *
> > > diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> > > index 2cdbaa3..fe89e88 100644
> > > --- a/include/linux/iommu.h
> > > +++ b/include/linux/iommu.h
> > > @@ -42,6 +42,7 @@
> > >   * if the IOMMU page table format is equivalent.
> > >   */
> > >  #define IOMMU_PRIV	(1 << 5)
> > > +#define IOMMU_EXEC	(1 << 6)    
> > 
> > Irrelevant change?
> >   
> yes, it should be separate. used later in svm patch.
> > >  
> > >  struct iommu_ops;
> > >  struct iommu_group;
> > > @@ -97,6 +98,36 @@ struct iommu_domain {
> > >  	void *iova_cookie;
> > >  };
> > >  
> > > +/*
> > > + * Generic fault event notification data, used by all IOMMU
> > > architectures.
> > > + *
> > > + * - PCI and non-PCI devices
> > > + * - Recoverable faults (e.g. page request) & un-recoverable faults
> > > + * - DMA remapping and IRQ remapping faults
> > > + *
> > > + * @dev The device which faults are reported by IOMMU
> > > + * @addr tells the offending address
> > > + * @pasid contains process address space ID, used in shared
> > > virtual memory (SVM)
> > > + * @prot page access protection flag, e.g. IOMMU_READ, IOMMU_WRITE
> > > + * @flags contains fault type, etc.
> > > + * @length tells the size of the buf
> > > + * @buf contains any raw or arch specific data
> > > + *
> > > + */
> > > +struct iommu_fault_event {
> > > +	struct device *dev;
> > > +	__u64 addr;
> > > +	__u32 pasid;
> > > +	__u32 prot;
> > > +	__u32 flags;
> > > +#define IOMMU_FAULT_PAGE_REQ	BIT(0)
> > > +#define IOMMU_FAULT_UNRECOV	BIT(1)
> > > +#define IOMMU_FAULT_IRQ_REMAP	BIT(2)
> > > +#define IOMMU_FAULT_INVAL	BIT(3)    
> > 
> > Details of each of these are defined where?
> >   
> good point, will add description.
> > > +	__u32 length;
> > > +	__u8  buf[];
> > > +};    
> > 
> > This is not UAPI, so I'll be curious how vfio is supposed to expose
> > this to userspace.
> >   
> But this also has other in-kernel users.


Any point where you expect vfio to pass through one of these model
specific data structures needs to be fully specified in a uapi header.
No magic opaque data please.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 8/9] iommu/intel-svm: notify page request to guest
  2017-06-22 22:53   ` Alex Williamson
@ 2017-06-23 20:16     ` Jacob Pan
  2017-06-23 20:34       ` Alex Williamson
  0 siblings, 1 reply; 30+ messages in thread
From: Jacob Pan @ 2017-06-23 20:16 UTC (permalink / raw)
  To: Alex Williamson
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Liu, Yi L,
	Lan Tianyu, Tian, Kevin, Raj Ashok, Jean Delvare, jacob.jun.pan

On Thu, 22 Jun 2017 16:53:58 -0600
Alex Williamson <alex.williamson@redhat.com> wrote:

> On Wed, 14 Jun 2017 15:23:02 -0700
> Jacob Pan <jacob.jun.pan@linux.intel.com> wrote:
> 
> > If the source device of a page request has its PASID table pointer
> > bond to a guest, the first level page tables are owned by the guest.
> > In this case, we shall let guest OS to manage page fault.
> > 
> > This patch uses the IOMMU fault notification API to send
> > notifications, possibly via VFIO, to the guest OS. Once guest pages
> > are fault in, guest will issue page response which will be passed
> > down via the invalidation passdown APIs.
> > 
> > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Signed-off-by: Ashok Raj <ashok.raj@intel.com>
> > ---
> >  drivers/iommu/intel-svm.c | 81
> > ++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 80
> > insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c
> > index 23c4276..d1d2d23 100644
> > --- a/drivers/iommu/intel-svm.c
> > +++ b/drivers/iommu/intel-svm.c
> > @@ -525,6 +525,80 @@ static bool access_error(struct vm_area_struct
> > *vma, struct page_req_dsc *req) return (requested &
> > ~vma->vm_flags) != 0; }
> >  
> > +static int prq_to_iommu_prot(struct page_req_dsc *req)
> > +{
> > +	int prot = 0;
> > +
> > +	if (req->rd_req)
> > +		prot |= IOMMU_READ;
> > +	if (req->wr_req)
> > +		prot |= IOMMU_WRITE;
> > +	if (req->exe_req)
> > +		prot |= IOMMU_EXEC;
> > +	if (req->priv_req)
> > +		prot |= IOMMU_PRIV;
> > +
> > +	return prot;
> > +}
> > +
> > +static int intel_svm_prq_notify(struct device *dev, struct
> > page_req_dsc *desc) +{
> > +	int ret = 0;
> > +	struct iommu_fault_event *event;
> > +	struct pci_dev *pdev;
> > +	struct device_domain_info *info;
> > +	unsigned long buf_offset;
> > +
> > +	/**
> > +	 * If caller does not provide struct device, this is the
> > case where
> > +	 * guest PASID table is bond to the device. So we need to
> > retrieve
> > +	 * struct device from the page request deescriptor then
> > proceed.
> > +	 */
> > +	if (!dev) {
> > +		pdev = pci_get_bus_and_slot(desc->bus,
> > desc->devfn);
> > +		if (!pdev) {
> > +			pr_err("No PCI device found for PRQ
> > [%02x:%02x.%d]\n",
> > +				desc->bus, PCI_SLOT(desc->devfn),
> > +				PCI_FUNC(desc->devfn));
> > +			return -ENODEV;
> > +		}
> > +		/**
> > +		 * Make sure PASID table pointer is bond to guest,
> > if yes notify
> > +		 * handler in the guest, e.g. via VFIO.
> > +		 */
> > +		info = pdev->dev.archdata.iommu;
> > +		if (!info || !info->pasid_tbl_bond) {
> > +			pr_debug("PRQ device pasid table not
> > bond.\n");  
> 
> I can "bond" two things together, they are then "bound".
> 
will fix that :)
> > +			return -EINVAL;
> > +		}
> > +		dev = &pdev->dev;  
> 
> Leaks pdev reference.  Both normal and error path.
> 
I guess you are referring to ref count in pci_get_bus_and_slot()? I did
look at the code, it does not seem to do the count increment. Perhaps
the comment is stale?

 * pointer to its data structure.  The caller must decrement the
 * reference count by calling pci_dev_put().  If no device is found,
 * %NULL is returned.
 */
struct pci_dev *pci_get_domain_bus_and_slot(int domain, unsigned int bus,
					    unsigned int devfn)
{
	struct pci_dev *dev = NULL;

	for_each_pci_dev(dev) {
		if (pci_domain_nr(dev->bus) == domain &&
		    (dev->bus->number == bus && dev->devfn == devfn))
			return dev;
	}
	return NULL;
}
EXPORT_SYMBOL(pci_get_domain_bus_and_slot);



> > +	}
> > +
> > +	pr_debug("Notify PRQ device [%02x:%02x.%d]\n",
> > +		desc->bus, PCI_SLOT(desc->devfn),
> > +		PCI_FUNC(desc->devfn));
> > +	event = kzalloc(sizeof(*event) + sizeof(*desc),
> > GFP_KERNEL);
> > +	if (!event)
> > +		return -ENOMEM;
> > +
> > +	get_device(dev);
> > +	/* Fill in event data for device specific processing */
> > +	event->dev = dev;
> > +	buf_offset = offsetofend(struct iommu_fault_event, length);
> > +	memcpy(buf_offset + event, desc, sizeof(*desc));
> > +	event->addr = desc->addr;
> > +	event->pasid = desc->pasid;
> > +	event->prot = prq_to_iommu_prot(desc);
> > +	event->length = sizeof(*desc);
> > +	event->flags = IOMMU_FAULT_PAGE_REQ;
> > +
> > +	ret = iommu_fault_notifier_call_chain(event);
> > +	put_device(dev);
> > +	kfree(event);
> > +
> > +	return ret;
> > +}
> > +
> >  static irqreturn_t prq_event_thread(int irq, void *d)
> >  {
> >  	struct intel_iommu *iommu = d;
> > @@ -548,7 +622,12 @@ static irqreturn_t prq_event_thread(int irq,
> > void *d) handled = 1;
> >  
> >  		req = &iommu->prq[head / sizeof(*req)];
> > -
> > +		/**
> > +		 * If prq is to be handled outside iommu driver
> > via receiver of
> > +		 * the fault notifiers, we skip the page response
> > here.
> > +		 */
> > +		if (!intel_svm_prq_notify(NULL, req))
> > +			continue;
> >  		result = QI_RESP_FAILURE;
> >  		address = (u64)req->addr << VTD_PAGE_SHIFT;
> >  		if (!req->pasid_present) {  
> 

[Jacob Pan]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 7/9] iommu/dmar: notify unrecoverable faults
  2017-06-22 22:54   ` Alex Williamson
@ 2017-06-23 20:19     ` Jacob Pan
  0 siblings, 0 replies; 30+ messages in thread
From: Jacob Pan @ 2017-06-23 20:19 UTC (permalink / raw)
  To: Alex Williamson
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Liu, Yi L,
	Lan Tianyu, Tian, Kevin, Raj Ashok, Jean Delvare, jacob.jun.pan

On Thu, 22 Jun 2017 16:54:16 -0600
Alex Williamson <alex.williamson@redhat.com> wrote:

> On Wed, 14 Jun 2017 15:23:01 -0700
> Jacob Pan <jacob.jun.pan@linux.intel.com> wrote:
> 
> > Currently, when device DMA faults are detected by IOMMU the fault
> > reasons are printed but the offending device is not notified.
> > This patch allows device drivers to be optionally notified for fault
> > conditions when device specific handling is needed for more subtle
> > processing, e.g. request with PASID transactions.
> > 
> > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Signed-off-by: Ashok Raj <ashok.raj@intel.com>
> > ---
> >  drivers/iommu/dmar.c | 37 ++++++++++++++++++++++++++++++++++++-
> >  1 file changed, 36 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c
> > index cbf7763..2c0b80d464 100644
> > --- a/drivers/iommu/dmar.c
> > +++ b/drivers/iommu/dmar.c
> > @@ -1587,11 +1587,43 @@ void dmar_msi_read(int irq, struct msi_msg
> > *msg) raw_spin_unlock_irqrestore(&iommu->register_lock, flag);
> >  }
> >  
> > +static int dmar_unrecov_fault_notify(u8 fault_reason, u16
> > source_id,
> > +			unsigned long long addr)
> > +{
> > +	int ret;
> > +	struct pci_dev *pdev;
> > +	struct iommu_fault_event *event;
> > +
> > +	pdev = pci_get_bus_and_slot(source_id >> 8, source_id &
> > 0xFF);
> > +	if (!pdev)
> > +		return -ENODEV;
> > +	pr_debug("Notify PCI device fault [%02x:%02x.%d]\n",
> > +		source_id >> 8, PCI_SLOT(source_id & 0xff),
> > +		PCI_FUNC(source_id & 0xff));
> > +	event = kzalloc(sizeof(*event) + sizeof(fault_reason),
> > GFP_KERNEL);
> > +	if (!event)
> > +		return -ENOMEM;  
> 
> Leaks pdev reference.
> 
same as before. pci_get_bus_and_slot() does not do ref counting.
> > +
> > +	pci_dev_get(pdev);
> > +	event->dev = &pdev->dev;
> > +	event->buf[0] = fault_reason;
> > +	event->addr = addr;
> > +	event->length = sizeof(fault_reason);
> > +	event->flags = IOMMU_FAULT_UNRECOV;
> > +	ret = iommu_fault_notifier_call_chain(event);
> > +
> > +	pci_dev_put(pdev);
> > +	kfree(event);
> > +
> > +	return ret;
> > +}
> > +
> >  static int dmar_fault_do_one(struct intel_iommu *iommu, int type,
> >  		u8 fault_reason, u16 source_id, unsigned long long
> > addr) {
> >  	const char *reason;
> >  	int fault_type;
> > +	int ret = 0;
> >  
> >  	reason = dmar_get_fault_reason(fault_reason, &fault_type);
> >  
> > @@ -1600,11 +1632,14 @@ static int dmar_fault_do_one(struct
> > intel_iommu *iommu, int type, source_id >> 8, PCI_SLOT(source_id &
> > 0xFF), PCI_FUNC(source_id & 0xFF), addr >> 48,
> >  			fault_reason, reason);
> > -	else
> > +	else {
> >  		pr_err("[%s] Request device [%02x:%02x.%d] fault
> > addr %llx [fault reason %02d] %s\n", type ? "DMA Read" : "DMA
> > Write", source_id >> 8, PCI_SLOT(source_id & 0xFF),
> >  		       PCI_FUNC(source_id & 0xFF), addr,
> > fault_reason, reason);
> > +		ret = dmar_unrecov_fault_notify(fault_reason,
> > source_id, addr);  
> 
> For what purpose are we recording this return code?
> 
good catch. I will drop the return code.
> > +	}
> > +
> >  	return 0;
> >  }
> >    
> 

[Jacob Pan]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 2/9] iommu/vt-d: add bind_pasid_table function
  2017-06-23 18:59       ` Alex Williamson
@ 2017-06-23 20:21         ` Jacob Pan
  0 siblings, 0 replies; 30+ messages in thread
From: Jacob Pan @ 2017-06-23 20:21 UTC (permalink / raw)
  To: Alex Williamson
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Liu, Yi L,
	Lan Tianyu, Tian, Kevin, Raj Ashok, Jean Delvare, Yi L,
	jacob.jun.pan

On Fri, 23 Jun 2017 12:59:00 -0600
Alex Williamson <alex.williamson@redhat.com> wrote:

> On Fri, 23 Jun 2017 11:19:52 -0700
> Jacob Pan <jacob.jun.pan@linux.intel.com> wrote:
> 
> > On Thu, 22 Jun 2017 16:52:15 -0600
> > Alex Williamson <alex.williamson@redhat.com> wrote:
> >   
> > > On Wed, 14 Jun 2017 15:22:56 -0700
> > > Jacob Pan <jacob.jun.pan@linux.intel.com> wrote:  
> > > > +static int intel_iommu_unbind_pasid_table(struct iommu_domain
> > > > *domain,
> > > > +					struct device *dev)
> > > > +{
> > > > +	struct intel_iommu *iommu;
> > > > +	struct dmar_domain *dmar_domain =
> > > > to_dmar_domain(domain);
> > > > +	u8 bus, devfn;
> > > > +
> > > > +	iommu = device_to_iommu(dev, &bus, &devfn);
> > > > +	if (!iommu)
> > > > +		return -ENODEV;
> > > > +	/*
> > > > +	 * REVISIT: we might want to clear the PASID table
> > > > pointer
> > > > +	 * as part of context clear operation. Currently, it
> > > > leaves
> > > > +	 * stale data but should be ignored by hardware since
> > > > PASIDE
> > > > +	 * is clear.
> > > > +	 */
> > > > +	/* ATS will be reenabled when remapping is restored */
> > > > +	pci_disable_ats(to_pci_dev(dev));      
> > > 
> > > dev_is_pci()?
> > >     
> > good to check, even thought intel iommu supports PCI only.  
> 
> That's not true, intel-iommu supports non-PCI devices defined in ACPI
> as well.  Thanks,
> 
For non-pci device, there is still a pci BDF allocated for it (shown in
ACPI) such that it can have its own IOMMU context, right? e.g. HPET

> Alex

[Jacob Pan]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 8/9] iommu/intel-svm: notify page request to guest
  2017-06-23 20:16     ` Jacob Pan
@ 2017-06-23 20:34       ` Alex Williamson
  2017-06-23 21:33         ` Jacob Pan
  0 siblings, 1 reply; 30+ messages in thread
From: Alex Williamson @ 2017-06-23 20:34 UTC (permalink / raw)
  To: Jacob Pan
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Liu, Yi L,
	Lan Tianyu, Tian, Kevin, Raj Ashok, Jean Delvare

On Fri, 23 Jun 2017 13:16:29 -0700
Jacob Pan <jacob.jun.pan@linux.intel.com> wrote:

> On Thu, 22 Jun 2017 16:53:58 -0600
> Alex Williamson <alex.williamson@redhat.com> wrote:
> 
> > On Wed, 14 Jun 2017 15:23:02 -0700
> > Jacob Pan <jacob.jun.pan@linux.intel.com> wrote:
> >   
> > > If the source device of a page request has its PASID table pointer
> > > bond to a guest, the first level page tables are owned by the guest.
> > > In this case, we shall let guest OS to manage page fault.
> > > 
> > > This patch uses the IOMMU fault notification API to send
> > > notifications, possibly via VFIO, to the guest OS. Once guest pages
> > > are fault in, guest will issue page response which will be passed
> > > down via the invalidation passdown APIs.
> > > 
> > > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > > Signed-off-by: Ashok Raj <ashok.raj@intel.com>
> > > ---
> > >  drivers/iommu/intel-svm.c | 81
> > > ++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 80
> > > insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c
> > > index 23c4276..d1d2d23 100644
> > > --- a/drivers/iommu/intel-svm.c
> > > +++ b/drivers/iommu/intel-svm.c
> > > @@ -525,6 +525,80 @@ static bool access_error(struct vm_area_struct
> > > *vma, struct page_req_dsc *req) return (requested &
> > > ~vma->vm_flags) != 0; }
> > >  
> > > +static int prq_to_iommu_prot(struct page_req_dsc *req)
> > > +{
> > > +	int prot = 0;
> > > +
> > > +	if (req->rd_req)
> > > +		prot |= IOMMU_READ;
> > > +	if (req->wr_req)
> > > +		prot |= IOMMU_WRITE;
> > > +	if (req->exe_req)
> > > +		prot |= IOMMU_EXEC;
> > > +	if (req->priv_req)
> > > +		prot |= IOMMU_PRIV;
> > > +
> > > +	return prot;
> > > +}
> > > +
> > > +static int intel_svm_prq_notify(struct device *dev, struct
> > > page_req_dsc *desc) +{
> > > +	int ret = 0;
> > > +	struct iommu_fault_event *event;
> > > +	struct pci_dev *pdev;
> > > +	struct device_domain_info *info;
> > > +	unsigned long buf_offset;
> > > +
> > > +	/**
> > > +	 * If caller does not provide struct device, this is the
> > > case where
> > > +	 * guest PASID table is bond to the device. So we need to
> > > retrieve
> > > +	 * struct device from the page request deescriptor then
> > > proceed.
> > > +	 */
> > > +	if (!dev) {
> > > +		pdev = pci_get_bus_and_slot(desc->bus,
> > > desc->devfn);
> > > +		if (!pdev) {
> > > +			pr_err("No PCI device found for PRQ
> > > [%02x:%02x.%d]\n",
> > > +				desc->bus, PCI_SLOT(desc->devfn),
> > > +				PCI_FUNC(desc->devfn));
> > > +			return -ENODEV;
> > > +		}
> > > +		/**
> > > +		 * Make sure PASID table pointer is bond to guest,
> > > if yes notify
> > > +		 * handler in the guest, e.g. via VFIO.
> > > +		 */
> > > +		info = pdev->dev.archdata.iommu;
> > > +		if (!info || !info->pasid_tbl_bond) {
> > > +			pr_debug("PRQ device pasid table not
> > > bond.\n");    
> > 
> > I can "bond" two things together, they are then "bound".
> >   
> will fix that :)
> > > +			return -EINVAL;
> > > +		}
> > > +		dev = &pdev->dev;    
> > 
> > Leaks pdev reference.  Both normal and error path.
> >   
> I guess you are referring to ref count in pci_get_bus_and_slot()? I did
> look at the code, it does not seem to do the count increment. Perhaps
> the comment is stale?
> 
>  * pointer to its data structure.  The caller must decrement the
>  * reference count by calling pci_dev_put().  If no device is found,
>  * %NULL is returned.
>  */
> struct pci_dev *pci_get_domain_bus_and_slot(int domain, unsigned int bus,
> 					    unsigned int devfn)
> {
> 	struct pci_dev *dev = NULL;
> 
> 	for_each_pci_dev(dev) {
        ^^^^^^^^^^^^^^^^ <-- look in here, it's trickier than it appears


> 		if (pci_domain_nr(dev->bus) == domain &&
> 		    (dev->bus->number == bus && dev->devfn == devfn))
> 			return dev;
> 	}
> 	return NULL;
> }
> EXPORT_SYMBOL(pci_get_domain_bus_and_slot);
> 
> 
> 
> > > +	}
> > > +
> > > +	pr_debug("Notify PRQ device [%02x:%02x.%d]\n",
> > > +		desc->bus, PCI_SLOT(desc->devfn),
> > > +		PCI_FUNC(desc->devfn));
> > > +	event = kzalloc(sizeof(*event) + sizeof(*desc),
> > > GFP_KERNEL);
> > > +	if (!event)
> > > +		return -ENOMEM;
> > > +
> > > +	get_device(dev);
> > > +	/* Fill in event data for device specific processing */
> > > +	event->dev = dev;
> > > +	buf_offset = offsetofend(struct iommu_fault_event, length);
> > > +	memcpy(buf_offset + event, desc, sizeof(*desc));
> > > +	event->addr = desc->addr;
> > > +	event->pasid = desc->pasid;
> > > +	event->prot = prq_to_iommu_prot(desc);
> > > +	event->length = sizeof(*desc);
> > > +	event->flags = IOMMU_FAULT_PAGE_REQ;
> > > +
> > > +	ret = iommu_fault_notifier_call_chain(event);
> > > +	put_device(dev);
> > > +	kfree(event);
> > > +
> > > +	return ret;
> > > +}
> > > +
> > >  static irqreturn_t prq_event_thread(int irq, void *d)
> > >  {
> > >  	struct intel_iommu *iommu = d;
> > > @@ -548,7 +622,12 @@ static irqreturn_t prq_event_thread(int irq,
> > > void *d) handled = 1;
> > >  
> > >  		req = &iommu->prq[head / sizeof(*req)];
> > > -
> > > +		/**
> > > +		 * If prq is to be handled outside iommu driver
> > > via receiver of
> > > +		 * the fault notifiers, we skip the page response
> > > here.
> > > +		 */
> > > +		if (!intel_svm_prq_notify(NULL, req))
> > > +			continue;
> > >  		result = QI_RESP_FAILURE;
> > >  		address = (u64)req->addr << VTD_PAGE_SHIFT;
> > >  		if (!req->pasid_present) {    
> >   
> 
> [Jacob Pan]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 8/9] iommu/intel-svm: notify page request to guest
  2017-06-23 20:34       ` Alex Williamson
@ 2017-06-23 21:33         ` Jacob Pan
  0 siblings, 0 replies; 30+ messages in thread
From: Jacob Pan @ 2017-06-23 21:33 UTC (permalink / raw)
  To: Alex Williamson
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Liu, Yi L,
	Lan Tianyu, Tian, Kevin, Raj Ashok, Jean Delvare, jacob.jun.pan

On Fri, 23 Jun 2017 14:34:34 -0600
Alex Williamson <alex.williamson@redhat.com> wrote:

> > 	for_each_pci_dev(dev) {  
>         ^^^^^^^^^^^^^^^^ <-- look in here, it's trickier than it
> appears
you are right, thanks.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 5/9] iommu: Introduce fault notifier API
  2017-06-23 19:15       ` Alex Williamson
@ 2017-06-26 15:27         ` Jacob Pan
  2017-06-26 15:32           ` Alex Williamson
  0 siblings, 1 reply; 30+ messages in thread
From: Jacob Pan @ 2017-06-26 15:27 UTC (permalink / raw)
  To: Alex Williamson
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Liu, Yi L,
	Lan Tianyu, Tian, Kevin, Raj Ashok, Jean Delvare, jacob.jun.pan

On Fri, 23 Jun 2017 13:15:51 -0600
Alex Williamson <alex.williamson@redhat.com> wrote:

> On Fri, 23 Jun 2017 11:59:28 -0700
> Jacob Pan <jacob.jun.pan@linux.intel.com> wrote:
> 
> > On Thu, 22 Jun 2017 16:53:17 -0600
> > Alex Williamson <alex.williamson@redhat.com> wrote:
> >   
> > > On Wed, 14 Jun 2017 15:22:59 -0700
> > > Jacob Pan <jacob.jun.pan@linux.intel.com> wrote:
> > >     
> > > > Traditionally, device specific faults are detected and handled
> > > > within their own device drivers. When IOMMU is enabled, faults
> > > > such as DMA related transactions are detected by IOMMU. There
> > > > is no generic reporting mechanism to report faults back to the
> > > > in-kernel device driver or the guest OS in case of assigned
> > > > devices.
> > > > 
> > > > Faults detected by IOMMU is based on the transaction's source ID
> > > > which can be reported at per device basis, regardless of the
> > > > device type is a PCI device or not.
> > > > 
> > > > The fault types includes recoverable (e.g. page request) and
> > > > unrecoverable faults(e.g. invalid context). In most cases,
> > > > faults can be handled by IOMMU drivers. However, there are use
> > > > cases that require fault processing outside IOMMU driver, e.g.
> > > > 
> > > > 1. page request fault originated from an SVM capable device
> > > > that is assigned to guest via vIOMMU. In this case, the first
> > > > level page tables are owned by the guest. Page request must be
> > > > propagated to the guest to let guest OS fault in the pages then
> > > > send page response. In this mechanism, the direct receiver of
> > > > IOMMU fault notification is VFIO, which can relay notification
> > > > events to QEMU or other user space software.
> > > > 
> > > > 2. faults need more subtle handling by device drivers. Other
> > > > than simply invoke reset function, there are needs to let
> > > > device driver handle the fault with a smaller impact.
> > > > 
> > > > This patchset is intended to create a generic fault notification
> > > > API such that it can scale as follows:
> > > > - all IOMMU types
> > > > - PCI and non-PCI devices
> > > > - recoverable and unrecoverable faults
> > > > - VFIO and other other in kernel users
> > > > - DMA & IRQ remapping (TBD)
> > > > 
> > > > The event data contains both generic and raw architectural data
> > > > such that performance is not compromised as the data
> > > > propagation may involve many layers.
> > > > 
> > > > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > > > Signed-off-by: Ashok Raj <ashok.raj@intel.com>
> > > > ---
> > > >  drivers/iommu/iommu.c | 63
> > > > +++++++++++++++++++++++++++++++++++++++++++++++++++
> > > > include/linux/iommu.h | 54
> > > > +++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 117
> > > > insertions(+)
> > > > 
> > > > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> > > > index c786370..04c73f3 100644
> > > > --- a/drivers/iommu/iommu.c
> > > > +++ b/drivers/iommu/iommu.c
> > > > @@ -48,6 +48,7 @@ struct iommu_group {
> > > >  	struct list_head devices;
> > > >  	struct mutex mutex;
> > > >  	struct blocking_notifier_head notifier;
> > > > +	struct blocking_notifier_head fault_notifier;
> > > >  	void *iommu_data;
> > > >  	void (*iommu_data_release)(void *iommu_data);
> > > >  	char *name;
> > > > @@ -345,6 +346,7 @@ struct iommu_group *iommu_group_alloc(void)
> > > >  	mutex_init(&group->mutex);
> > > >  	INIT_LIST_HEAD(&group->devices);
> > > >  	BLOCKING_INIT_NOTIFIER_HEAD(&group->notifier);
> > > > +	BLOCKING_INIT_NOTIFIER_HEAD(&group->fault_notifier);
> > > >  
> > > >  	ret = ida_simple_get(&iommu_group_ida, 0, 0,
> > > > GFP_KERNEL); if (ret < 0) {
> > > > @@ -790,6 +792,67 @@ int iommu_group_unregister_notifier(struct
> > > > iommu_group *group,
> > > > EXPORT_SYMBOL_GPL(iommu_group_unregister_notifier); 
> > > >  /**
> > > > + * iommu_register_fault_notifier - Register a notifier for
> > > > fault reporting
> > > > + * @dev: device to notify fault events
> > > > + * @nb: notifier block to signal
> > > > + *
> > > > + */
> > > > +int iommu_register_fault_notifier(struct device *dev,
> > > > +				struct notifier_block *nb)
> > > > +{
> > > > +	int ret;
> > > > +	struct iommu_group *group = iommu_group_get(dev);
> > > > +
> > > > +	if (!group)
> > > > +		return -EINVAL;
> > > > +
> > > > +	ret =
> > > > blocking_notifier_chain_register(&group->fault_notifier, nb);
> > > > +	iommu_group_put(group);
> > > > +
> > > > +	return ret;
> > > > +
> > > > +}
> > > > +EXPORT_SYMBOL_GPL(iommu_register_fault_notifier);
> > > > +
> > > > +/**
> > > > + * iommu_unregister_fault_notifier - Unregister a notifier for
> > > > fault reporting
> > > > + * @domain: the domain to watch
> > > > + * @nb: notifier block to signal
> > > > + *
> > > > + */
> > > > +int iommu_unregister_fault_notifier(struct device *dev,
> > > > +				  struct notifier_block *nb)
> > > > +{
> > > > +	int ret;
> > > > +	struct iommu_group *group = iommu_group_get(dev);
> > > > +
> > > > +	if (!group)
> > > > +		return -EINVAL;
> > > > +
> > > > +	ret =
> > > > blocking_notifier_chain_unregister(&group->fault_notifier, nb);
> > > > +	iommu_group_put(group);
> > > > +
> > > > +	return ret;
> > > > +}
> > > > +EXPORT_SYMBOL_GPL(iommu_unregister_fault_notifier);      
> > > 
> > > 
> > > If the call chains are on the group, why do we register with a
> > > device? If I registered with a group, I'd know that I only need
> > > to register once per group, that's not clear here and may lead to
> > > callers of this getting multiple notifications, one for each
> > > device in a group. 
> > My thought was that since iommu_group is already part of struct
> > device, it is more efficient to share group level notifier than
> > introducing per device fault callbacks or notifiers.  
> 
> That's fine, but the interface should reflect the same.
>  
> > So you do need to register per device not just once for the group.
> > In most cases device with SVM has ACS, it will be 1:1 between
> > device and group.  
> 
> ACS at the endpoint is only one factor in determining grouping, we
> need only look at Intel Core processors to find root ports lacking
> ACS which will result in multiple devices per group as a very common
> case. 
> > When the fault event occurs, IOMMU can identify the offending device
> > and struct device is embedded in the event data sent to the device
> > driver or VFIO, so the receiver can filter out unwanted events.  
> 
> I think this is a poor design choice.  If the notification list is on
> the group then make the registration be on the group.  Otherwise the
> number of notifies seen by the caller is automatically multiplied by
> the number of devices they're using in the group.  The caller not only
> needs to disregard unintended devices, but they need to make sure they
> only act on a notification for the device on the correct notifier.
> This is designing in unnecessary complexity and overhead.
> 
If the design is for high frequency events, I would totally agree with
you that having notifier on the group does not scale. But since this is
for fault notification, which implies infrequent use, is it still
worthwhile to tax strict device to have a per device notifier?
Even with recoverable fault, i.e. page request, a properly configured
use case would pre-fault pages to minimize PRQ.

I was referring to this previous discussion by David for the need of
per device fault handler(https://lwn.net/Articles/608914/). Perhaps I
misinterpreted your suggestion for an alternative solution.

Thanks,

Jacob

> > > > +
> > > > +int iommu_fault_notifier_call_chain(struct iommu_fault_event
> > > > *event) +{
> > > > +	int ret;
> > > > +	struct iommu_group *group =
> > > > iommu_group_get(event->dev); +
> > > > +	if (!group)
> > > > +		return -EINVAL;
> > > > +	/* caller provide generic data related to the event,
> > > > TBD */
> > > > +	ret =
> > > > (blocking_notifier_call_chain(&group->fault_notifier, 0, (void
> > > > *)event)
> > > > +		== NOTIFY_BAD) ? -EINVAL : 0;
> > > > +	iommu_group_put(group);
> > > > +
> > > > +	return ret;
> > > > +}
> > > > +EXPORT_SYMBOL(iommu_fault_notifier_call_chain);
> > > > +
> > > > +/**
> > > >   * iommu_group_id - Return ID for a group
> > > >   * @group: the group to ID
> > > >   *
> > > > diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> > > > index 2cdbaa3..fe89e88 100644
> > > > --- a/include/linux/iommu.h
> > > > +++ b/include/linux/iommu.h
> > > > @@ -42,6 +42,7 @@
> > > >   * if the IOMMU page table format is equivalent.
> > > >   */
> > > >  #define IOMMU_PRIV	(1 << 5)
> > > > +#define IOMMU_EXEC	(1 << 6)      
> > > 
> > > Irrelevant change?
> > >     
> > yes, it should be separate. used later in svm patch.  
> > > >  
> > > >  struct iommu_ops;
> > > >  struct iommu_group;
> > > > @@ -97,6 +98,36 @@ struct iommu_domain {
> > > >  	void *iova_cookie;
> > > >  };
> > > >  
> > > > +/*
> > > > + * Generic fault event notification data, used by all IOMMU
> > > > architectures.
> > > > + *
> > > > + * - PCI and non-PCI devices
> > > > + * - Recoverable faults (e.g. page request) & un-recoverable
> > > > faults
> > > > + * - DMA remapping and IRQ remapping faults
> > > > + *
> > > > + * @dev The device which faults are reported by IOMMU
> > > > + * @addr tells the offending address
> > > > + * @pasid contains process address space ID, used in shared
> > > > virtual memory (SVM)
> > > > + * @prot page access protection flag, e.g. IOMMU_READ,
> > > > IOMMU_WRITE
> > > > + * @flags contains fault type, etc.
> > > > + * @length tells the size of the buf
> > > > + * @buf contains any raw or arch specific data
> > > > + *
> > > > + */
> > > > +struct iommu_fault_event {
> > > > +	struct device *dev;
> > > > +	__u64 addr;
> > > > +	__u32 pasid;
> > > > +	__u32 prot;
> > > > +	__u32 flags;
> > > > +#define IOMMU_FAULT_PAGE_REQ	BIT(0)
> > > > +#define IOMMU_FAULT_UNRECOV	BIT(1)
> > > > +#define IOMMU_FAULT_IRQ_REMAP	BIT(2)
> > > > +#define IOMMU_FAULT_INVAL	BIT(3)      
> > > 
> > > Details of each of these are defined where?
> > >     
> > good point, will add description.  
> > > > +	__u32 length;
> > > > +	__u8  buf[];
> > > > +};      
> > > 
> > > This is not UAPI, so I'll be curious how vfio is supposed to
> > > expose this to userspace.
> > >     
> > But this also has other in-kernel users.  
> 
> 
> Any point where you expect vfio to pass through one of these model
> specific data structures needs to be fully specified in a uapi header.
> No magic opaque data please.  Thanks,
> 
> Alex

[Jacob Pan]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 5/9] iommu: Introduce fault notifier API
  2017-06-26 15:27         ` Jacob Pan
@ 2017-06-26 15:32           ` Alex Williamson
  0 siblings, 0 replies; 30+ messages in thread
From: Alex Williamson @ 2017-06-26 15:32 UTC (permalink / raw)
  To: Jacob Pan
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Liu, Yi L,
	Lan Tianyu, Tian, Kevin, Raj Ashok, Jean Delvare

On Mon, 26 Jun 2017 08:27:52 -0700
Jacob Pan <jacob.jun.pan@linux.intel.com> wrote:

> On Fri, 23 Jun 2017 13:15:51 -0600
> Alex Williamson <alex.williamson@redhat.com> wrote:
> 
> > On Fri, 23 Jun 2017 11:59:28 -0700
> > Jacob Pan <jacob.jun.pan@linux.intel.com> wrote:
> >   
> > > On Thu, 22 Jun 2017 16:53:17 -0600
> > > Alex Williamson <alex.williamson@redhat.com> wrote:
> > >     
> > > > On Wed, 14 Jun 2017 15:22:59 -0700
> > > > Jacob Pan <jacob.jun.pan@linux.intel.com> wrote:
> > > >       
> > > > > Traditionally, device specific faults are detected and handled
> > > > > within their own device drivers. When IOMMU is enabled, faults
> > > > > such as DMA related transactions are detected by IOMMU. There
> > > > > is no generic reporting mechanism to report faults back to the
> > > > > in-kernel device driver or the guest OS in case of assigned
> > > > > devices.
> > > > > 
> > > > > Faults detected by IOMMU is based on the transaction's source ID
> > > > > which can be reported at per device basis, regardless of the
> > > > > device type is a PCI device or not.
> > > > > 
> > > > > The fault types includes recoverable (e.g. page request) and
> > > > > unrecoverable faults(e.g. invalid context). In most cases,
> > > > > faults can be handled by IOMMU drivers. However, there are use
> > > > > cases that require fault processing outside IOMMU driver, e.g.
> > > > > 
> > > > > 1. page request fault originated from an SVM capable device
> > > > > that is assigned to guest via vIOMMU. In this case, the first
> > > > > level page tables are owned by the guest. Page request must be
> > > > > propagated to the guest to let guest OS fault in the pages then
> > > > > send page response. In this mechanism, the direct receiver of
> > > > > IOMMU fault notification is VFIO, which can relay notification
> > > > > events to QEMU or other user space software.
> > > > > 
> > > > > 2. faults need more subtle handling by device drivers. Other
> > > > > than simply invoke reset function, there are needs to let
> > > > > device driver handle the fault with a smaller impact.
> > > > > 
> > > > > This patchset is intended to create a generic fault notification
> > > > > API such that it can scale as follows:
> > > > > - all IOMMU types
> > > > > - PCI and non-PCI devices
> > > > > - recoverable and unrecoverable faults
> > > > > - VFIO and other other in kernel users
> > > > > - DMA & IRQ remapping (TBD)
> > > > > 
> > > > > The event data contains both generic and raw architectural data
> > > > > such that performance is not compromised as the data
> > > > > propagation may involve many layers.
> > > > > 
> > > > > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > > > > Signed-off-by: Ashok Raj <ashok.raj@intel.com>
> > > > > ---
> > > > >  drivers/iommu/iommu.c | 63
> > > > > +++++++++++++++++++++++++++++++++++++++++++++++++++
> > > > > include/linux/iommu.h | 54
> > > > > +++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 117
> > > > > insertions(+)
> > > > > 
> > > > > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> > > > > index c786370..04c73f3 100644
> > > > > --- a/drivers/iommu/iommu.c
> > > > > +++ b/drivers/iommu/iommu.c
> > > > > @@ -48,6 +48,7 @@ struct iommu_group {
> > > > >  	struct list_head devices;
> > > > >  	struct mutex mutex;
> > > > >  	struct blocking_notifier_head notifier;
> > > > > +	struct blocking_notifier_head fault_notifier;
> > > > >  	void *iommu_data;
> > > > >  	void (*iommu_data_release)(void *iommu_data);
> > > > >  	char *name;
> > > > > @@ -345,6 +346,7 @@ struct iommu_group *iommu_group_alloc(void)
> > > > >  	mutex_init(&group->mutex);
> > > > >  	INIT_LIST_HEAD(&group->devices);
> > > > >  	BLOCKING_INIT_NOTIFIER_HEAD(&group->notifier);
> > > > > +	BLOCKING_INIT_NOTIFIER_HEAD(&group->fault_notifier);
> > > > >  
> > > > >  	ret = ida_simple_get(&iommu_group_ida, 0, 0,
> > > > > GFP_KERNEL); if (ret < 0) {
> > > > > @@ -790,6 +792,67 @@ int iommu_group_unregister_notifier(struct
> > > > > iommu_group *group,
> > > > > EXPORT_SYMBOL_GPL(iommu_group_unregister_notifier); 
> > > > >  /**
> > > > > + * iommu_register_fault_notifier - Register a notifier for
> > > > > fault reporting
> > > > > + * @dev: device to notify fault events
> > > > > + * @nb: notifier block to signal
> > > > > + *
> > > > > + */
> > > > > +int iommu_register_fault_notifier(struct device *dev,
> > > > > +				struct notifier_block *nb)
> > > > > +{
> > > > > +	int ret;
> > > > > +	struct iommu_group *group = iommu_group_get(dev);
> > > > > +
> > > > > +	if (!group)
> > > > > +		return -EINVAL;
> > > > > +
> > > > > +	ret =
> > > > > blocking_notifier_chain_register(&group->fault_notifier, nb);
> > > > > +	iommu_group_put(group);
> > > > > +
> > > > > +	return ret;
> > > > > +
> > > > > +}
> > > > > +EXPORT_SYMBOL_GPL(iommu_register_fault_notifier);
> > > > > +
> > > > > +/**
> > > > > + * iommu_unregister_fault_notifier - Unregister a notifier for
> > > > > fault reporting
> > > > > + * @domain: the domain to watch
> > > > > + * @nb: notifier block to signal
> > > > > + *
> > > > > + */
> > > > > +int iommu_unregister_fault_notifier(struct device *dev,
> > > > > +				  struct notifier_block *nb)
> > > > > +{
> > > > > +	int ret;
> > > > > +	struct iommu_group *group = iommu_group_get(dev);
> > > > > +
> > > > > +	if (!group)
> > > > > +		return -EINVAL;
> > > > > +
> > > > > +	ret =
> > > > > blocking_notifier_chain_unregister(&group->fault_notifier, nb);
> > > > > +	iommu_group_put(group);
> > > > > +
> > > > > +	return ret;
> > > > > +}
> > > > > +EXPORT_SYMBOL_GPL(iommu_unregister_fault_notifier);        
> > > > 
> > > > 
> > > > If the call chains are on the group, why do we register with a
> > > > device? If I registered with a group, I'd know that I only need
> > > > to register once per group, that's not clear here and may lead to
> > > > callers of this getting multiple notifications, one for each
> > > > device in a group.   
> > > My thought was that since iommu_group is already part of struct
> > > device, it is more efficient to share group level notifier than
> > > introducing per device fault callbacks or notifiers.    
> > 
> > That's fine, but the interface should reflect the same.
> >    
> > > So you do need to register per device not just once for the group.
> > > In most cases device with SVM has ACS, it will be 1:1 between
> > > device and group.    
> > 
> > ACS at the endpoint is only one factor in determining grouping, we
> > need only look at Intel Core processors to find root ports lacking
> > ACS which will result in multiple devices per group as a very common
> > case.   
> > > When the fault event occurs, IOMMU can identify the offending device
> > > and struct device is embedded in the event data sent to the device
> > > driver or VFIO, so the receiver can filter out unwanted events.    
> > 
> > I think this is a poor design choice.  If the notification list is on
> > the group then make the registration be on the group.  Otherwise the
> > number of notifies seen by the caller is automatically multiplied by
> > the number of devices they're using in the group.  The caller not only
> > needs to disregard unintended devices, but they need to make sure they
> > only act on a notification for the device on the correct notifier.
> > This is designing in unnecessary complexity and overhead.
> >   
> If the design is for high frequency events, I would totally agree with
> you that having notifier on the group does not scale. But since this is
> for fault notification, which implies infrequent use, is it still
> worthwhile to tax strict device to have a per device notifier?
> Even with recoverable fault, i.e. page request, a properly configured
> use case would pre-fault pages to minimize PRQ.
> 
> I was referring to this previous discussion by David for the need of
> per device fault handler(https://lwn.net/Articles/608914/). Perhaps I
> misinterpreted your suggestion for an alternative solution.

I'm not suggesting per device notifiers, I'm suggesting that if the
notifier is per group then make the registration on the group, not the
device.  The interface is too complicated to use with too many
apparently duplicate notifications otherwise.  Thanks,

Alex

> > > > > +
> > > > > +int iommu_fault_notifier_call_chain(struct iommu_fault_event
> > > > > *event) +{
> > > > > +	int ret;
> > > > > +	struct iommu_group *group =
> > > > > iommu_group_get(event->dev); +
> > > > > +	if (!group)
> > > > > +		return -EINVAL;
> > > > > +	/* caller provide generic data related to the event,
> > > > > TBD */
> > > > > +	ret =
> > > > > (blocking_notifier_call_chain(&group->fault_notifier, 0, (void
> > > > > *)event)
> > > > > +		== NOTIFY_BAD) ? -EINVAL : 0;
> > > > > +	iommu_group_put(group);
> > > > > +
> > > > > +	return ret;
> > > > > +}
> > > > > +EXPORT_SYMBOL(iommu_fault_notifier_call_chain);
> > > > > +
> > > > > +/**
> > > > >   * iommu_group_id - Return ID for a group
> > > > >   * @group: the group to ID
> > > > >   *
> > > > > diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> > > > > index 2cdbaa3..fe89e88 100644
> > > > > --- a/include/linux/iommu.h
> > > > > +++ b/include/linux/iommu.h
> > > > > @@ -42,6 +42,7 @@
> > > > >   * if the IOMMU page table format is equivalent.
> > > > >   */
> > > > >  #define IOMMU_PRIV	(1 << 5)
> > > > > +#define IOMMU_EXEC	(1 << 6)        
> > > > 
> > > > Irrelevant change?
> > > >       
> > > yes, it should be separate. used later in svm patch.    
> > > > >  
> > > > >  struct iommu_ops;
> > > > >  struct iommu_group;
> > > > > @@ -97,6 +98,36 @@ struct iommu_domain {
> > > > >  	void *iova_cookie;
> > > > >  };
> > > > >  
> > > > > +/*
> > > > > + * Generic fault event notification data, used by all IOMMU
> > > > > architectures.
> > > > > + *
> > > > > + * - PCI and non-PCI devices
> > > > > + * - Recoverable faults (e.g. page request) & un-recoverable
> > > > > faults
> > > > > + * - DMA remapping and IRQ remapping faults
> > > > > + *
> > > > > + * @dev The device which faults are reported by IOMMU
> > > > > + * @addr tells the offending address
> > > > > + * @pasid contains process address space ID, used in shared
> > > > > virtual memory (SVM)
> > > > > + * @prot page access protection flag, e.g. IOMMU_READ,
> > > > > IOMMU_WRITE
> > > > > + * @flags contains fault type, etc.
> > > > > + * @length tells the size of the buf
> > > > > + * @buf contains any raw or arch specific data
> > > > > + *
> > > > > + */
> > > > > +struct iommu_fault_event {
> > > > > +	struct device *dev;
> > > > > +	__u64 addr;
> > > > > +	__u32 pasid;
> > > > > +	__u32 prot;
> > > > > +	__u32 flags;
> > > > > +#define IOMMU_FAULT_PAGE_REQ	BIT(0)
> > > > > +#define IOMMU_FAULT_UNRECOV	BIT(1)
> > > > > +#define IOMMU_FAULT_IRQ_REMAP	BIT(2)
> > > > > +#define IOMMU_FAULT_INVAL	BIT(3)        
> > > > 
> > > > Details of each of these are defined where?
> > > >       
> > > good point, will add description.    
> > > > > +	__u32 length;
> > > > > +	__u8  buf[];
> > > > > +};        
> > > > 
> > > > This is not UAPI, so I'll be curious how vfio is supposed to
> > > > expose this to userspace.
> > > >       
> > > But this also has other in-kernel users.    
> > 
> > 
> > Any point where you expect vfio to pass through one of these model
> > specific data structures needs to be fully specified in a uapi header.
> > No magic opaque data please.  Thanks,
> > 
> > Alex  
> 
> [Jacob Pan]

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2017-06-26 15:32 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-14 22:22 [RFC 0/9] IOMMU driver changes for shared virtual memory virtualization Jacob Pan
2017-06-14 22:22 ` [RFC 1/9] iommu: Introduce bind_pasid_table API function Jacob Pan
2017-06-22 22:52   ` Alex Williamson
2017-06-23 18:20     ` Jacob Pan
2017-06-14 22:22 ` [RFC 2/9] iommu/vt-d: add bind_pasid_table function Jacob Pan
2017-06-22 22:52   ` Alex Williamson
2017-06-23 18:19     ` Jacob Pan
2017-06-23 18:59       ` Alex Williamson
2017-06-23 20:21         ` Jacob Pan
2017-06-14 22:22 ` [RFC 3/9] iommu: Introduce iommu do invalidate API function Jacob Pan
2017-06-22 22:52   ` Alex Williamson
2017-06-14 22:22 ` [RFC 4/9] iommu/vt-d: Add iommu do invalidate function Jacob Pan
2017-06-22 22:52   ` Alex Williamson
2017-06-14 22:22 ` [RFC 5/9] iommu: Introduce fault notifier API Jacob Pan
2017-06-22 22:53   ` Alex Williamson
2017-06-23 18:59     ` Jacob Pan
2017-06-23 19:15       ` Alex Williamson
2017-06-26 15:27         ` Jacob Pan
2017-06-26 15:32           ` Alex Williamson
2017-06-14 22:23 ` [RFC 6/9] iommu/vt-d: track device with pasid table bond to a guest Jacob Pan
2017-06-22 22:54   ` Alex Williamson
2017-06-14 22:23 ` [RFC 7/9] iommu/dmar: notify unrecoverable faults Jacob Pan
2017-06-22 22:54   ` Alex Williamson
2017-06-23 20:19     ` Jacob Pan
2017-06-14 22:23 ` [RFC 8/9] iommu/intel-svm: notify page request to guest Jacob Pan
2017-06-22 22:53   ` Alex Williamson
2017-06-23 20:16     ` Jacob Pan
2017-06-23 20:34       ` Alex Williamson
2017-06-23 21:33         ` Jacob Pan
2017-06-14 22:23 ` [RFC 9/9] iommu/intel-svm: replace dev ops with generic fault notifier Jacob Pan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).