LKML Archive on lore.kernel.org
 help / Atom feed
* [PATCH v5 00/23] IOMMU and VT-d driver support for Shared Virtual Address (SVA)
@ 2018-05-11 20:53 Jacob Pan
  2018-05-11 20:53 ` [PATCH v5 01/23] iommu: introduce bind_pasid_table API function Jacob Pan
                   ` (23 more replies)
  0 siblings, 24 replies; 78+ messages in thread
From: Jacob Pan @ 2018-05-11 20:53 UTC (permalink / raw)
  To: iommu, LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Alex Williamson, Jean-Philippe Brucker
  Cc: Rafael Wysocki, Liu, Yi L, Tian, Kevin, Raj Ashok, Jean Delvare,
	Christoph Hellwig, Lu Baolu, Jacob Pan

Shared virtual address (SVA), a.k.a, Shared virtual memory (SVM) on Intel
platforms allow address space sharing between device DMA and applications.
SVA can reduce programming complexity and enhance security. To enable SVA
in the guest, i.e. shared guest application address space and physical
device DMA address, IOMMU driver must provide some new functionalities.

This patchset is a follow-up on the discussions held at LPC 2017
VFIO/IOMMU/PCI track. Slides and notes can be found here:
https://linuxplumbersconf.org/2017/ocw/events/LPC2017/tracks/636

The complete guest SVA support also involves changes in QEMU and VFIO,
which has been posted earlier.
https://www.spinics.net/lists/kvm/msg148798.html

This is the IOMMU portion follow up of the more complete series of the
kernel changes to support vSVA. Please refer to the link below for more
details. https://www.spinics.net/lists/kvm/msg148819.html

Generic APIs are introduced in addition to Intel VT-d specific changes,
the goal is to have common interfaces across IOMMU and device types for
both VFIO and other in-kernel users.

At the top level, new IOMMU interfaces are introduced as follows:
 - bind guest PASID table
 - passdown invalidations of translation caches
 - IOMMU device fault reporting including page request/response and
   non-recoverable faults.

For IOMMU detected device fault reporting, struct device is extended to
provide callback and tracking at device level. The original proposal was
discussed here "Error handling for I/O memory management units"
(https://lwn.net/Articles/608914/). I have experimented two alternative
solutions:
1. use a shared group notifier, this does not scale well also causes unwanted
notification traffic when group sibling device is reported with faults.
2. place fault callback at device IOMMU arch data, e.g. device_domain_info
in Intel/FSL IOMMU driver. This will cause code duplication, since per
device fault reporting is generic.

The additional patches are Intel VT-d specific, which either implements or
replaces existing private interfaces with the generic ones.

This patchset is based on the work and ideas from many people, especially:
Ashok Raj <ashok.raj@intel.com>
Liu, Yi L <yi.l.liu@linux.intel.com>
Jean-Philippe Brucker <jean-philippe.brucker@arm.com>

Thanks,

Jacob

V5
	- Removed device context cache and non-pasid TLB invalidation type
	- Simplified and sorted granularities for the remaining TLB
	invalidation types, per discussion and review by Jean-Philippe Brucker.
	- Added a setup parameter for page response timeout
	- Added version and size checking in bind PASID and invalidation APIs
	- Fixed locking and error handling in device fault reporting API based
	  on Jean's review

V4
	- Futher integrate feedback for iommu_param and iommu_fault_param
	  from Jean and others.
	- Handle fault reporting error and race conditions. Keep tracking per
	  device pending page requests such that page group response can be
	  sanitized.
	- Added a timer to handle irresponsive guest who does not send page
	  response on time.
	- Use a workqueue for VT-d non-recorverable IRQ fault handling.
	- Added trace events for invalidation and fault reporting.
V3
	- Consolidated fault reporting data format based on discussions on v2,
	  including input from ARM and AMD.
	- Renamed invalidation APIs from svm to sva based on discussions on v2
	- Use a parent pointer under struct device for all iommu per device data
	- Simplified device fault callback, allow driver private data to be
	  registered. This might make it easy to replace domain fault handler.
V2
	- Replaced hybrid interface data model (generic data + vendor specific
	data) with all generic data. This will have the security benefit where
	data passed from user space can be sanitized by all software layers if
	needed.
	- Addressed review comments from V1
	- Use per device fault report data
	- Support page request/response communications between host IOMMU and
	guest or other in-kernel users.
	- Added unrecoverable fault reporting to DMAR
	- Use threaded IRQ function for DMAR fault interrupt and fault
	  reporting

Jacob Pan (22):
  iommu: introduce bind_pasid_table API function
  iommu/vt-d: move device_domain_info to header
  iommu/vt-d: add a flag for pasid table bound status
  iommu/vt-d: add bind_pasid_table function
  iommu/vt-d: add definitions for PFSID
  iommu/vt-d: fix dev iotlb pfsid use
  iommu/vt-d: support flushing more translation cache types
  iommu/vt-d: add svm/sva invalidate function
  iommu: introduce device fault data
  driver core: add per device iommu param
  iommu: add a timeout parameter for prq response
  iommu: introduce device fault report API
  iommu: introduce page response function
  iommu: handle page response timeout
  iommu/config: add build dependency for dmar
  iommu/vt-d: report non-recoverable faults to device
  iommu/intel-svm: report device page request
  iommu/intel-svm: replace dev ops with fault report API
  iommu/intel-svm: do not flush iotlb for viommu
  iommu/vt-d: add intel iommu page response function
  trace/iommu: add sva trace events
  iommu: use sva invalidate and device fault trace event

Liu, Yi L (1):
  iommu: introduce iommu invalidate API function

 Documentation/admin-guide/kernel-parameters.txt |   8 +
 drivers/iommu/Kconfig                           |   1 +
 drivers/iommu/dmar.c                            | 209 ++++++++++++++-
 drivers/iommu/intel-iommu.c                     | 338 ++++++++++++++++++++++--
 drivers/iommu/intel-svm.c                       |  84 ++++--
 drivers/iommu/iommu.c                           | 311 +++++++++++++++++++++-
 include/linux/device.h                          |   3 +
 include/linux/dma_remapping.h                   |   1 +
 include/linux/dmar.h                            |   2 +-
 include/linux/intel-iommu.h                     |  52 +++-
 include/linux/intel-svm.h                       |  20 +-
 include/linux/iommu.h                           | 216 ++++++++++++++-
 include/trace/events/iommu.h                    | 112 ++++++++
 include/uapi/linux/iommu.h                      | 124 +++++++++
 14 files changed, 1409 insertions(+), 72 deletions(-)
 create mode 100644 include/uapi/linux/iommu.h

-- 
2.7.4

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v5 01/23] iommu: introduce bind_pasid_table API function
  2018-05-11 20:53 [PATCH v5 00/23] IOMMU and VT-d driver support for Shared Virtual Address (SVA) Jacob Pan
@ 2018-05-11 20:53 ` Jacob Pan
  2018-08-23 16:34   ` Auger Eric
  2018-08-24 15:00   ` Auger Eric
  2018-05-11 20:53 ` [PATCH v5 02/23] iommu/vt-d: move device_domain_info to header Jacob Pan
                   ` (22 subsequent siblings)
  23 siblings, 2 replies; 78+ messages in thread
From: Jacob Pan @ 2018-05-11 20:53 UTC (permalink / raw)
  To: iommu, LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Alex Williamson, Jean-Philippe Brucker
  Cc: Rafael Wysocki, Liu, Yi L, Tian, Kevin, Raj Ashok, Jean Delvare,
	Christoph Hellwig, Lu Baolu, Jacob Pan, Liu, Yi L

Virtual IOMMU was proposed to support Shared Virtual Memory (SVM)
use in the guest:
https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg05311.html

As part of the proposed architecture, when an SVM capable PCI
device is assigned to a guest, nested mode is turned on. Guest owns the
first level page tables (request with PASID) which performs GVA->GPA
translation. Second level page tables are owned by the host for GPA->HPA
translation for both request with and without PASID.

A new IOMMU driver interface is therefore needed to perform tasks as
follows:
* Enable nested translation and appropriate translation type
* Assign guest PASID table pointer (in GPA) and size to host IOMMU

This patch introduces new API functions to perform bind/unbind guest PASID
tables. Based on common data, model specific IOMMU drivers can be extended
to perform the specific steps for binding pasid table of assigned devices.

Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 drivers/iommu/iommu.c      | 19 +++++++++++++++++++
 include/linux/iommu.h      | 24 ++++++++++++++++++++++++
 include/uapi/linux/iommu.h | 33 +++++++++++++++++++++++++++++++++
 3 files changed, 76 insertions(+)
 create mode 100644 include/uapi/linux/iommu.h

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index d2aa2320..3a69620 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1325,6 +1325,25 @@ int iommu_attach_device(struct iommu_domain *domain, struct device *dev)
 }
 EXPORT_SYMBOL_GPL(iommu_attach_device);
 
+int iommu_bind_pasid_table(struct iommu_domain *domain, struct device *dev,
+			struct pasid_table_config *pasidt_binfo)
+{
+	if (unlikely(!domain->ops->bind_pasid_table))
+		return -ENODEV;
+
+	return domain->ops->bind_pasid_table(domain, dev, pasidt_binfo);
+}
+EXPORT_SYMBOL_GPL(iommu_bind_pasid_table);
+
+void iommu_unbind_pasid_table(struct iommu_domain *domain, struct device *dev)
+{
+	if (unlikely(!domain->ops->unbind_pasid_table))
+		return;
+
+	domain->ops->unbind_pasid_table(domain, dev);
+}
+EXPORT_SYMBOL_GPL(iommu_unbind_pasid_table);
+
 static void __iommu_detach_device(struct iommu_domain *domain,
 				  struct device *dev)
 {
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 19938ee..5199ca4 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -25,6 +25,7 @@
 #include <linux/errno.h>
 #include <linux/err.h>
 #include <linux/of.h>
+#include <uapi/linux/iommu.h>
 
 #define IOMMU_READ	(1 << 0)
 #define IOMMU_WRITE	(1 << 1)
@@ -187,6 +188,8 @@ struct iommu_resv_region {
  * @domain_get_windows: Return the number of windows for a domain
  * @of_xlate: add OF master IDs to iommu grouping
  * @pgsize_bitmap: bitmap of all possible supported page sizes
+ * @bind_pasid_table: bind pasid table pointer for guest SVM
+ * @unbind_pasid_table: unbind pasid table pointer and restore defaults
  */
 struct iommu_ops {
 	bool (*capable)(enum iommu_cap);
@@ -233,8 +236,14 @@ struct iommu_ops {
 	u32 (*domain_get_windows)(struct iommu_domain *domain);
 
 	int (*of_xlate)(struct device *dev, struct of_phandle_args *args);
+
 	bool (*is_attach_deferred)(struct iommu_domain *domain, struct device *dev);
 
+	int (*bind_pasid_table)(struct iommu_domain *domain, struct device *dev,
+				struct pasid_table_config *pasidt_binfo);
+	void (*unbind_pasid_table)(struct iommu_domain *domain,
+				struct device *dev);
+
 	unsigned long pgsize_bitmap;
 };
 
@@ -296,6 +305,10 @@ extern int iommu_attach_device(struct iommu_domain *domain,
 			       struct device *dev);
 extern void iommu_detach_device(struct iommu_domain *domain,
 				struct device *dev);
+extern int iommu_bind_pasid_table(struct iommu_domain *domain,
+		struct device *dev, struct pasid_table_config *pasidt_binfo);
+extern void iommu_unbind_pasid_table(struct iommu_domain *domain,
+				struct device *dev);
 extern struct iommu_domain *iommu_get_domain_for_dev(struct device *dev);
 extern int iommu_map(struct iommu_domain *domain, unsigned long iova,
 		     phys_addr_t paddr, size_t size, int prot);
@@ -696,6 +709,17 @@ const struct iommu_ops *iommu_ops_from_fwnode(struct fwnode_handle *fwnode)
 	return NULL;
 }
 
+static inline
+int iommu_bind_pasid_table(struct iommu_domain *domain, struct device *dev,
+			struct pasid_table_config *pasidt_binfo)
+{
+	return -ENODEV;
+}
+static inline
+void iommu_unbind_pasid_table(struct iommu_domain *domain, struct device *dev)
+{
+}
+
 #endif /* CONFIG_IOMMU_API */
 
 #endif /* __LINUX_IOMMU_H */
diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
new file mode 100644
index 0000000..cb2d625
--- /dev/null
+++ b/include/uapi/linux/iommu.h
@@ -0,0 +1,33 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/*
+ * IOMMU user API definitions
+ *
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#ifndef _UAPI_IOMMU_H
+#define _UAPI_IOMMU_H
+
+#include <linux/types.h>
+
+/**
+ * PASID table data used to bind guest PASID table to the host IOMMU. This will
+ * enable guest managed first level page tables.
+ * @version: for future extensions and identification of the data format
+ * @bytes: size of this structure
+ * @base_ptr:	PASID table pointer
+ * @pasid_bits:	number of bits supported in the guest PASID table, must be less
+ *		or equal than the host supported PASID size.
+ */
+struct pasid_table_config {
+	__u32 version;
+#define PASID_TABLE_CFG_VERSION_1 1
+	__u32 bytes;
+	__u64 base_ptr;
+	__u8 pasid_bits;
+};
+
+#endif /* _UAPI_IOMMU_H */
-- 
2.7.4

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v5 02/23] iommu/vt-d: move device_domain_info to header
  2018-05-11 20:53 [PATCH v5 00/23] IOMMU and VT-d driver support for Shared Virtual Address (SVA) Jacob Pan
  2018-05-11 20:53 ` [PATCH v5 01/23] iommu: introduce bind_pasid_table API function Jacob Pan
@ 2018-05-11 20:53 ` Jacob Pan
  2018-05-11 20:53 ` [PATCH v5 03/23] iommu/vt-d: add a flag for pasid table bound status Jacob Pan
                   ` (21 subsequent siblings)
  23 siblings, 0 replies; 78+ messages in thread
From: Jacob Pan @ 2018-05-11 20:53 UTC (permalink / raw)
  To: iommu, LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Alex Williamson, Jean-Philippe Brucker
  Cc: Rafael Wysocki, Liu, Yi L, Tian, Kevin, Raj Ashok, Jean Delvare,
	Christoph Hellwig, Lu Baolu, Jacob Pan

Allow both intel-iommu.c and dmar.c to access device_domain_info.
Prepare for additional per device arch data used in TLB flush function

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 drivers/iommu/intel-iommu.c | 18 ------------------
 include/linux/intel-iommu.h | 19 +++++++++++++++++++
 2 files changed, 19 insertions(+), 18 deletions(-)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index d60b2fb..a0f81a4 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -391,24 +391,6 @@ struct dmar_domain {
 					   iommu core */
 };
 
-/* PCI domain-device relationship */
-struct device_domain_info {
-	struct list_head link;	/* link to domain siblings */
-	struct list_head global; /* link to global list */
-	u8 bus;			/* PCI bus number */
-	u8 devfn;		/* PCI devfn number */
-	u8 pasid_supported:3;
-	u8 pasid_enabled:1;
-	u8 pri_supported:1;
-	u8 pri_enabled:1;
-	u8 ats_supported:1;
-	u8 ats_enabled:1;
-	u8 ats_qdep;
-	struct device *dev; /* it's NULL for PCIe-to-PCI bridge */
-	struct intel_iommu *iommu; /* IOMMU used by this device */
-	struct dmar_domain *domain; /* pointer to domain */
-};
-
 struct dmar_rmrr_unit {
 	struct list_head list;		/* list of rmrr units	*/
 	struct acpi_dmar_header *hdr;	/* ACPI header		*/
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index eec4827..304afae 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -461,6 +461,25 @@ struct intel_iommu {
 	u32		flags;      /* Software defined flags */
 };
 
+/* PCI domain-device relationship */
+struct device_domain_info {
+	struct list_head link;	/* link to domain siblings */
+	struct list_head global; /* link to global list */
+	u8 bus;			/* PCI bus number */
+	u8 devfn;		/* PCI devfn number */
+	u8 pasid_supported:3;
+	u8 pasid_enabled:1;
+	u8 pri_supported:1;
+	u8 pri_enabled:1;
+	u8 ats_supported:1;
+	u8 ats_enabled:1;
+	u8 ats_qdep;
+	u64 fault_mask;	/* selected IOMMU faults to be reported */
+	struct device *dev; /* it's NULL for PCIe-to-PCI bridge */
+	struct intel_iommu *iommu; /* IOMMU used by this device */
+	struct dmar_domain *domain; /* pointer to domain */
+};
+
 static inline void __iommu_flush_cache(
 	struct intel_iommu *iommu, void *addr, int size)
 {
-- 
2.7.4

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v5 03/23] iommu/vt-d: add a flag for pasid table bound status
  2018-05-11 20:53 [PATCH v5 00/23] IOMMU and VT-d driver support for Shared Virtual Address (SVA) Jacob Pan
  2018-05-11 20:53 ` [PATCH v5 01/23] iommu: introduce bind_pasid_table API function Jacob Pan
  2018-05-11 20:53 ` [PATCH v5 02/23] iommu/vt-d: move device_domain_info to header Jacob Pan
@ 2018-05-11 20:53 ` Jacob Pan
  2018-05-13  7:33   ` Lu Baolu
  2018-05-13  8:01   ` Lu Baolu
  2018-05-11 20:53 ` [PATCH v5 04/23] iommu/vt-d: add bind_pasid_table function Jacob Pan
                   ` (20 subsequent siblings)
  23 siblings, 2 replies; 78+ messages in thread
From: Jacob Pan @ 2018-05-11 20:53 UTC (permalink / raw)
  To: iommu, LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Alex Williamson, Jean-Philippe Brucker
  Cc: Rafael Wysocki, Liu, Yi L, Tian, Kevin, Raj Ashok, Jean Delvare,
	Christoph Hellwig, Lu Baolu, Jacob Pan

Adding a flag in device domain into to track whether a guest or
user PASID table is bound to a device.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 include/linux/intel-iommu.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index 304afae..ddc7d79 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -473,6 +473,7 @@ struct device_domain_info {
 	u8 pri_enabled:1;
 	u8 ats_supported:1;
 	u8 ats_enabled:1;
+	u8 pasid_table_bound:1;
 	u8 ats_qdep;
 	u64 fault_mask;	/* selected IOMMU faults to be reported */
 	struct device *dev; /* it's NULL for PCIe-to-PCI bridge */
-- 
2.7.4

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v5 04/23] iommu/vt-d: add bind_pasid_table function
  2018-05-11 20:53 [PATCH v5 00/23] IOMMU and VT-d driver support for Shared Virtual Address (SVA) Jacob Pan
                   ` (2 preceding siblings ...)
  2018-05-11 20:53 ` [PATCH v5 03/23] iommu/vt-d: add a flag for pasid table bound status Jacob Pan
@ 2018-05-11 20:53 ` Jacob Pan
  2018-05-13  9:29   ` Lu Baolu
  2018-05-11 20:53 ` [PATCH v5 05/23] iommu: introduce iommu invalidate API function Jacob Pan
                   ` (19 subsequent siblings)
  23 siblings, 1 reply; 78+ messages in thread
From: Jacob Pan @ 2018-05-11 20:53 UTC (permalink / raw)
  To: iommu, LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Alex Williamson, Jean-Philippe Brucker
  Cc: Rafael Wysocki, Liu, Yi L, Tian, Kevin, Raj Ashok, Jean Delvare,
	Christoph Hellwig, Lu Baolu, Jacob Pan, Liu, Yi L

Add Intel VT-d ops to the generic iommu_bind_pasid_table API
functions.

The primary use case is for direct assignment of SVM capable
device. Originated from emulated IOMMU in the guest, the request goes
through many layers (e.g. VFIO). Upon calling host IOMMU driver, caller
passes guest PASID table pointer (GPA) and size.

Device context table entry is modified by Intel IOMMU specific
bind_pasid_table function. This will turn on nesting mode and matching
translation type.

The unbind operation restores default context mapping.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
---
 drivers/iommu/intel-iommu.c   | 122 ++++++++++++++++++++++++++++++++++++++++++
 include/linux/dma_remapping.h |   1 +
 2 files changed, 123 insertions(+)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index a0f81a4..4623294 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -2409,6 +2409,7 @@ static struct dmar_domain *dmar_insert_one_dev_info(struct intel_iommu *iommu,
 	info->ats_supported = info->pasid_supported = info->pri_supported = 0;
 	info->ats_enabled = info->pasid_enabled = info->pri_enabled = 0;
 	info->ats_qdep = 0;
+	info->pasid_table_bound = 0;
 	info->dev = dev;
 	info->domain = domain;
 	info->iommu = iommu;
@@ -5132,6 +5133,7 @@ static void intel_iommu_put_resv_regions(struct device *dev,
 
 #ifdef CONFIG_INTEL_IOMMU_SVM
 #define MAX_NR_PASID_BITS (20)
+#define MIN_NR_PASID_BITS (5)
 static inline unsigned long intel_iommu_get_pts(struct intel_iommu *iommu)
 {
 	/*
@@ -5258,6 +5260,122 @@ struct intel_iommu *intel_svm_device_to_iommu(struct device *dev)
 
 	return iommu;
 }
+
+static int intel_iommu_bind_pasid_table(struct iommu_domain *domain,
+		struct device *dev, struct pasid_table_config *pasidt_binfo)
+{
+	struct intel_iommu *iommu;
+	struct context_entry *context;
+	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
+	struct device_domain_info *info;
+	struct pci_dev *pdev;
+	u8 bus, devfn, host_table_pasid_bits;
+	u16 did, sid;
+	int ret = 0;
+	unsigned long flags;
+	u64 ctx_lo;
+
+	if ((pasidt_binfo->version != PASID_TABLE_CFG_VERSION_1) ||
+		pasidt_binfo->bytes != sizeof(*pasidt_binfo))
+		return -EINVAL;
+	iommu = device_to_iommu(dev, &bus, &devfn);
+	if (!iommu)
+		return -ENODEV;
+	/* VT-d spec section 9.4 says pasid table size is encoded as 2^(x+5) */
+	host_table_pasid_bits = intel_iommu_get_pts(iommu) + MIN_NR_PASID_BITS;
+	if (!pasidt_binfo || pasidt_binfo->pasid_bits > host_table_pasid_bits ||
+		pasidt_binfo->pasid_bits < MIN_NR_PASID_BITS) {
+		pr_err("Invalid gPASID bits %d, host range %d - %d\n",
+			pasidt_binfo->pasid_bits,
+			MIN_NR_PASID_BITS, host_table_pasid_bits);
+		return -ERANGE;
+	}
+	if (!ecap_nest(iommu->ecap)) {
+		dev_err(dev, "Cannot bind PASID table, no nested translation\n");
+		ret = -ENODEV;
+		goto out;
+	}
+	pdev = to_pci_dev(dev);
+	sid = PCI_DEVID(bus, devfn);
+	info = dev->archdata.iommu;
+
+	if (!info) {
+		dev_err(dev, "Invalid device domain info\n");
+		ret = -EINVAL;
+		goto out;
+	}
+	if (info->pasid_table_bound) {
+		dev_err(dev, "Device PASID table already bound\n");
+		ret = -EBUSY;
+		goto out;
+	}
+	if (!info->pasid_enabled) {
+		ret = pci_enable_pasid(pdev, info->pasid_supported & ~1);
+		if (ret) {
+			dev_err(dev, "Failed to enable PASID\n");
+			goto out;
+		}
+	}
+	spin_lock_irqsave(&iommu->lock, flags);
+	context = iommu_context_addr(iommu, bus, devfn, 0);
+	if (!context_present(context)) {
+		dev_err(dev, "Context not present\n");
+		ret = -EINVAL;
+		goto out_unlock;
+	}
+
+	/* Anticipate guest to use SVM and owns the first level, so we turn
+	 * nested mode on
+	 */
+	ctx_lo = context[0].lo;
+	ctx_lo |= CONTEXT_NESTE | CONTEXT_PRS | CONTEXT_PASIDE;
+	ctx_lo &= ~CONTEXT_TT_MASK;
+	ctx_lo |= CONTEXT_TT_DEV_IOTLB << 2;
+	context[0].lo = ctx_lo;
+
+	/* Assign guest PASID table pointer and size order */
+	ctx_lo = (pasidt_binfo->base_ptr & VTD_PAGE_MASK) |
+		(pasidt_binfo->pasid_bits - MIN_NR_PASID_BITS);
+	context[1].lo = ctx_lo;
+	/* make sure context entry is updated before flushing */
+	wmb();
+	did = dmar_domain->iommu_did[iommu->seq_id];
+	iommu->flush.flush_context(iommu, did,
+				(((u16)bus) << 8) | devfn,
+				DMA_CCMD_MASK_NOBIT,
+				DMA_CCMD_DEVICE_INVL);
+	iommu->flush.flush_iotlb(iommu, did, 0, 0, DMA_TLB_DSI_FLUSH);
+	info->pasid_table_bound = 1;
+out_unlock:
+	spin_unlock_irqrestore(&iommu->lock, flags);
+out:
+	return ret;
+}
+
+static void intel_iommu_unbind_pasid_table(struct iommu_domain *domain,
+					struct device *dev)
+{
+	struct intel_iommu *iommu;
+	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
+	struct device_domain_info *info;
+	u8 bus, devfn;
+
+	info = dev->archdata.iommu;
+	if (!info) {
+		dev_err(dev, "Invalid device domain info\n");
+		return;
+	}
+	iommu = device_to_iommu(dev, &bus, &devfn);
+	if (!iommu) {
+		dev_err(dev, "No IOMMU for device to unbind PASID table\n");
+		return;
+	}
+
+	domain_context_clear(iommu, dev);
+
+	domain_context_mapping_one(dmar_domain, iommu, bus, devfn);
+	info->pasid_table_bound = 0;
+}
 #endif /* CONFIG_INTEL_IOMMU_SVM */
 
 const struct iommu_ops intel_iommu_ops = {
@@ -5266,6 +5384,10 @@ const struct iommu_ops intel_iommu_ops = {
 	.domain_free		= intel_iommu_domain_free,
 	.attach_dev		= intel_iommu_attach_device,
 	.detach_dev		= intel_iommu_detach_device,
+#ifdef CONFIG_INTEL_IOMMU_SVM
+	.bind_pasid_table	= intel_iommu_bind_pasid_table,
+	.unbind_pasid_table	= intel_iommu_unbind_pasid_table,
+#endif
 	.map			= intel_iommu_map,
 	.unmap			= intel_iommu_unmap,
 	.map_sg			= default_iommu_map_sg,
diff --git a/include/linux/dma_remapping.h b/include/linux/dma_remapping.h
index 21b3e7d..db290b2 100644
--- a/include/linux/dma_remapping.h
+++ b/include/linux/dma_remapping.h
@@ -28,6 +28,7 @@
 
 #define CONTEXT_DINVE		(1ULL << 8)
 #define CONTEXT_PRS		(1ULL << 9)
+#define CONTEXT_NESTE		(1ULL << 10)
 #define CONTEXT_PASIDE		(1ULL << 11)
 
 struct intel_iommu;
-- 
2.7.4

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v5 05/23] iommu: introduce iommu invalidate API function
  2018-05-11 20:53 [PATCH v5 00/23] IOMMU and VT-d driver support for Shared Virtual Address (SVA) Jacob Pan
                   ` (3 preceding siblings ...)
  2018-05-11 20:53 ` [PATCH v5 04/23] iommu/vt-d: add bind_pasid_table function Jacob Pan
@ 2018-05-11 20:53 ` Jacob Pan
  2018-05-11 20:53 ` [PATCH v5 06/23] iommu/vt-d: add definitions for PFSID Jacob Pan
                   ` (18 subsequent siblings)
  23 siblings, 0 replies; 78+ messages in thread
From: Jacob Pan @ 2018-05-11 20:53 UTC (permalink / raw)
  To: iommu, LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Alex Williamson, Jean-Philippe Brucker
  Cc: Rafael Wysocki, Liu, Yi L, Tian, Kevin, Raj Ashok, Jean Delvare,
	Christoph Hellwig, Lu Baolu, Liu, Yi L, Liu, Jacob Pan

From: "Liu, Yi L" <yi.l.liu@linux.intel.com>

When an SVM capable device is assigned to a guest, the first level page
tables are owned by the guest and the guest PASID table pointer is
linked to the device context entry of the physical IOMMU.

Host IOMMU driver has no knowledge of caching structure updates unless
the guest invalidation activities are passed down to the host. The
primary usage is derived from emulated IOMMU in the guest, where QEMU
can trap invalidation activities before passing them down to the
host/physical IOMMU.
Since the invalidation data are obtained from user space and will be
written into physical IOMMU, we must allow security check at various
layers. Therefore, generic invalidation data format are proposed here,
model specific IOMMU drivers need to convert them into their own format.

Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
---
 drivers/iommu/iommu.c      | 14 +++++++
 include/linux/iommu.h      | 12 ++++++
 include/uapi/linux/iommu.h | 91 ++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 117 insertions(+)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 3a69620..784e019 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1344,6 +1344,20 @@ void iommu_unbind_pasid_table(struct iommu_domain *domain, struct device *dev)
 }
 EXPORT_SYMBOL_GPL(iommu_unbind_pasid_table);
 
+int iommu_sva_invalidate(struct iommu_domain *domain,
+		struct device *dev, struct tlb_invalidate_info *inv_info)
+{
+	int ret = 0;
+
+	if (unlikely(!domain->ops->sva_invalidate))
+		return -ENODEV;
+
+	ret = domain->ops->sva_invalidate(domain, dev, inv_info);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_sva_invalidate);
+
 static void __iommu_detach_device(struct iommu_domain *domain,
 				  struct device *dev)
 {
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 5199ca4..e8cadb6 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -190,6 +190,7 @@ struct iommu_resv_region {
  * @pgsize_bitmap: bitmap of all possible supported page sizes
  * @bind_pasid_table: bind pasid table pointer for guest SVM
  * @unbind_pasid_table: unbind pasid table pointer and restore defaults
+ * @sva_invalidate: invalidate translation caches of shared virtual address
  */
 struct iommu_ops {
 	bool (*capable)(enum iommu_cap);
@@ -243,6 +244,8 @@ struct iommu_ops {
 				struct pasid_table_config *pasidt_binfo);
 	void (*unbind_pasid_table)(struct iommu_domain *domain,
 				struct device *dev);
+	int (*sva_invalidate)(struct iommu_domain *domain,
+		struct device *dev, struct tlb_invalidate_info *inv_info);
 
 	unsigned long pgsize_bitmap;
 };
@@ -309,6 +312,9 @@ extern int iommu_bind_pasid_table(struct iommu_domain *domain,
 		struct device *dev, struct pasid_table_config *pasidt_binfo);
 extern void iommu_unbind_pasid_table(struct iommu_domain *domain,
 				struct device *dev);
+extern int iommu_sva_invalidate(struct iommu_domain *domain,
+		struct device *dev, struct tlb_invalidate_info *inv_info);
+
 extern struct iommu_domain *iommu_get_domain_for_dev(struct device *dev);
 extern int iommu_map(struct iommu_domain *domain, unsigned long iova,
 		     phys_addr_t paddr, size_t size, int prot);
@@ -720,6 +726,12 @@ void iommu_unbind_pasid_table(struct iommu_domain *domain, struct device *dev)
 {
 }
 
+static inline int iommu_sva_invalidate(struct iommu_domain *domain,
+		struct device *dev, struct tlb_invalidate_info *inv_info)
+{
+	return -ENODEV;
+}
+
 #endif /* CONFIG_IOMMU_API */
 
 #endif /* __LINUX_IOMMU_H */
diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
index cb2d625..79d93ef 100644
--- a/include/uapi/linux/iommu.h
+++ b/include/uapi/linux/iommu.h
@@ -30,4 +30,95 @@ struct pasid_table_config {
 	__u8 pasid_bits;
 };
 
+/**
+ * enum iommu_inv_granularity - Generic invalidation granularity
+ * @IOMMU_INV_GRANU_DOMAIN_ALL_PASID:	TLB entries or PASID caches of all
+ *					PASIDs associated with a domain ID
+ * @IOMMU_INV_GRANU_PASID_SEL:		TLB entries or PASID cache associated
+ *					with a PASID and a domain
+ * @IOMMU_INV_GRANU_PAGE_PASID:		TLB entries of selected page range
+ *					within a PASID
+ *
+ * When an invalidation request is passed down to IOMMU to flush translation
+ * caches, it may carry different granularity levels, which can be specific
+ * to certain types of translation caches.
+ * This enum is a collection of granularities for all types of translation
+ * caches. The idea is to make it easy for IOMMU model specific driver to
+ * convert from generic to model specific value. Each IOMMU driver
+ * can enforce check based on its own conversion table. The conversion is
+ * based on 2D look-up with inputs as follows:
+ * - translation cache types
+ * - granularity
+ *
+ *             type |   DTLB    |    TLB    |   PASID   |
+ *  granule         |           |           |   cache   |
+ * -----------------+-----------+-----------+-----------+
+ *  DN_ALL_PASID    |   Y       |   Y       |   Y       |
+ *  PASID_SEL       |   Y       |   Y       |   Y       |
+ *  PAGE_PASID      |   Y       |   Y       |   N/A     |
+ *
+ */
+enum iommu_inv_granularity {
+	IOMMU_INV_GRANU_DOMAIN_ALL_PASID,
+	IOMMU_INV_GRANU_PASID_SEL,
+	IOMMU_INV_GRANU_PAGE_PASID,
+	IOMMU_INV_NR_GRANU,
+};
+
+/**
+ * enum iommu_inv_type - Generic translation cache types for invalidation
+ *
+ * @IOMMU_INV_TYPE_DTLB:	device IOTLB
+ * @IOMMU_INV_TYPE_TLB:		IOMMU paging structure cache
+ * @IOMMU_INV_TYPE_PASID:	PASID cache
+ * Invalidation requests sent to IOMMU for a given device need to indicate
+ * which type of translation cache to be operated on. Combined with enum
+ * iommu_inv_granularity, model specific driver can do a simple lookup to
+ * convert from generic to model specific value.
+ */
+enum iommu_inv_type {
+	IOMMU_INV_TYPE_DTLB,
+	IOMMU_INV_TYPE_TLB,
+	IOMMU_INV_TYPE_PASID,
+	IOMMU_INV_NR_TYPE
+};
+
+/**
+ * Translation cache invalidation header that contains mandatory meta data.
+ * @version:	info format version, expecting future extesions
+ * @type:	type of translation cache to be invalidated
+ */
+struct tlb_invalidate_hdr {
+	__u32 version;
+#define TLB_INV_HDR_VERSION_1 1
+	enum iommu_inv_type type;
+};
+
+/**
+ * Translation cache invalidation information, contains generic IOMMU
+ * data which can be parsed based on model ID by model specific drivers.
+ * Since the invalidation of second level page tables are included in the
+ * unmap operation, this info is only applicable to the first level
+ * translation caches, i.e. DMA request with PASID.
+ *
+ * @granularity:	requested invalidation granularity, type dependent
+ * @size:		2^size of 4K pages, 0 for 4k, 9 for 2MB, etc.
+ * @nr_pages:		number of pages to invalidate
+ * @pasid:		processor address space ID value per PCI spec.
+ * @addr:		page address to be invalidated
+ * @flags		IOMMU_INVALIDATE_ADDR_LEAF: leaf paging entries
+ *			IOMMU_INVALIDATE_GLOBAL_PAGE: global pages
+ *
+ */
+struct tlb_invalidate_info {
+	struct tlb_invalidate_hdr	hdr;
+	enum iommu_inv_granularity	granularity;
+	__u32		flags;
+#define IOMMU_INVALIDATE_ADDR_LEAF	(1 << 0)
+#define IOMMU_INVALIDATE_GLOBAL_PAGE	(1 << 1)
+	__u8		size;
+	__u64		nr_pages;
+	__u32		pasid;
+	__u64		addr;
+};
 #endif /* _UAPI_IOMMU_H */
-- 
2.7.4

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v5 06/23] iommu/vt-d: add definitions for PFSID
  2018-05-11 20:53 [PATCH v5 00/23] IOMMU and VT-d driver support for Shared Virtual Address (SVA) Jacob Pan
                   ` (4 preceding siblings ...)
  2018-05-11 20:53 ` [PATCH v5 05/23] iommu: introduce iommu invalidate API function Jacob Pan
@ 2018-05-11 20:53 ` Jacob Pan
  2018-05-14  1:36   ` Lu Baolu
  2018-05-11 20:53 ` [PATCH v5 07/23] iommu/vt-d: fix dev iotlb pfsid use Jacob Pan
                   ` (17 subsequent siblings)
  23 siblings, 1 reply; 78+ messages in thread
From: Jacob Pan @ 2018-05-11 20:53 UTC (permalink / raw)
  To: iommu, LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Alex Williamson, Jean-Philippe Brucker
  Cc: Rafael Wysocki, Liu, Yi L, Tian, Kevin, Raj Ashok, Jean Delvare,
	Christoph Hellwig, Lu Baolu, Jacob Pan

When SRIOV VF device IOTLB is invalidated, we need to provide
the PF source ID such that IOMMU hardware can gauge the depth
of invalidation queue which is shared among VFs. This is needed
when device invalidation throttle (DIT) capability is supported.

This patch adds bit definitions for checking and tracking PFSID.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 include/linux/intel-iommu.h | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index ddc7d79..dfacd49 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -114,6 +114,7 @@
  * Extended Capability Register
  */
 
+#define ecap_dit(e)		((e >> 41) & 0x1)
 #define ecap_pasid(e)		((e >> 40) & 0x1)
 #define ecap_pss(e)		((e >> 35) & 0x1f)
 #define ecap_eafs(e)		((e >> 34) & 0x1)
@@ -284,6 +285,7 @@ enum {
 #define QI_DEV_IOTLB_SID(sid)	((u64)((sid) & 0xffff) << 32)
 #define QI_DEV_IOTLB_QDEP(qdep)	(((qdep) & 0x1f) << 16)
 #define QI_DEV_IOTLB_ADDR(addr)	((u64)(addr) & VTD_PAGE_MASK)
+#define QI_DEV_IOTLB_PFSID(pfsid) (((u64)(pfsid & 0xf) << 12) | ((u64)(pfsid & 0xff0) << 48))
 #define QI_DEV_IOTLB_SIZE	1
 #define QI_DEV_IOTLB_MAX_INVS	32
 
@@ -308,6 +310,7 @@ enum {
 #define QI_DEV_EIOTLB_PASID(p)	(((u64)p) << 32)
 #define QI_DEV_EIOTLB_SID(sid)	((u64)((sid) & 0xffff) << 16)
 #define QI_DEV_EIOTLB_QDEP(qd)	((u64)((qd) & 0x1f) << 4)
+#define QI_DEV_EIOTLB_PFSID(pfsid) (((u64)(pfsid & 0xf) << 12) | ((u64)(pfsid & 0xff0) << 48))
 #define QI_DEV_EIOTLB_MAX_INVS	32
 
 #define QI_PGRP_IDX(idx)	(((u64)(idx)) << 55)
@@ -467,6 +470,7 @@ struct device_domain_info {
 	struct list_head global; /* link to global list */
 	u8 bus;			/* PCI bus number */
 	u8 devfn;		/* PCI devfn number */
+	u16 pfsid;		/* SRIOV physical function source ID */
 	u8 pasid_supported:3;
 	u8 pasid_enabled:1;
 	u8 pri_supported:1;
-- 
2.7.4

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v5 07/23] iommu/vt-d: fix dev iotlb pfsid use
  2018-05-11 20:53 [PATCH v5 00/23] IOMMU and VT-d driver support for Shared Virtual Address (SVA) Jacob Pan
                   ` (5 preceding siblings ...)
  2018-05-11 20:53 ` [PATCH v5 06/23] iommu/vt-d: add definitions for PFSID Jacob Pan
@ 2018-05-11 20:53 ` Jacob Pan
  2018-05-14  1:52   ` Lu Baolu
  2018-05-11 20:54 ` [PATCH v5 08/23] iommu/vt-d: support flushing more translation cache types Jacob Pan
                   ` (16 subsequent siblings)
  23 siblings, 1 reply; 78+ messages in thread
From: Jacob Pan @ 2018-05-11 20:53 UTC (permalink / raw)
  To: iommu, LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Alex Williamson, Jean-Philippe Brucker
  Cc: Rafael Wysocki, Liu, Yi L, Tian, Kevin, Raj Ashok, Jean Delvare,
	Christoph Hellwig, Lu Baolu, Jacob Pan

PFSID should be used in the invalidation descriptor for flushing
device IOTLBs on SRIOV VFs.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 drivers/iommu/dmar.c        |  6 +++---
 drivers/iommu/intel-iommu.c | 16 +++++++++++++++-
 include/linux/intel-iommu.h |  5 ++---
 3 files changed, 20 insertions(+), 7 deletions(-)

diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c
index 460bed4..7852678 100644
--- a/drivers/iommu/dmar.c
+++ b/drivers/iommu/dmar.c
@@ -1339,8 +1339,8 @@ void qi_flush_iotlb(struct intel_iommu *iommu, u16 did, u64 addr,
 	qi_submit_sync(&desc, iommu);
 }
 
-void qi_flush_dev_iotlb(struct intel_iommu *iommu, u16 sid, u16 qdep,
-			u64 addr, unsigned mask)
+void qi_flush_dev_iotlb(struct intel_iommu *iommu, u16 sid, u16 pfsid,
+			u16 qdep, u64 addr, unsigned mask)
 {
 	struct qi_desc desc;
 
@@ -1355,7 +1355,7 @@ void qi_flush_dev_iotlb(struct intel_iommu *iommu, u16 sid, u16 qdep,
 		qdep = 0;
 
 	desc.low = QI_DEV_IOTLB_SID(sid) | QI_DEV_IOTLB_QDEP(qdep) |
-		   QI_DIOTLB_TYPE;
+		   QI_DIOTLB_TYPE | QI_DEV_IOTLB_PFSID(pfsid);
 
 	qi_submit_sync(&desc, iommu);
 }
diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 4623294..732a10f 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -1459,6 +1459,19 @@ static void iommu_enable_dev_iotlb(struct device_domain_info *info)
 		return;
 
 	pdev = to_pci_dev(info->dev);
+	/* For IOMMU that supports device IOTLB throttling (DIT), we assign
+	 * PFSID to the invalidation desc of a VF such that IOMMU HW can gauge
+	 * queue depth at PF level. If DIT is not set, PFSID will be treated as
+	 * reserved, which should be set to 0.
+	 */
+	if (!ecap_dit(info->iommu->ecap))
+		info->pfsid = 0;
+	else if (pdev && pdev->is_virtfn) {
+		if (ecap_dit(info->iommu->ecap))
+			dev_warn(&pdev->dev, "SRIOV VF device IOTLB enabled without flow control\n");
+		info->pfsid = PCI_DEVID(pdev->physfn->bus->number, pdev->physfn->devfn);
+	} else
+		info->pfsid = PCI_DEVID(info->bus, info->devfn);
 
 #ifdef CONFIG_INTEL_IOMMU_SVM
 	/* The PCIe spec, in its wisdom, declares that the behaviour of
@@ -1524,7 +1537,8 @@ static void iommu_flush_dev_iotlb(struct dmar_domain *domain,
 
 		sid = info->bus << 8 | info->devfn;
 		qdep = info->ats_qdep;
-		qi_flush_dev_iotlb(info->iommu, sid, qdep, addr, mask);
+		qi_flush_dev_iotlb(info->iommu, sid, info->pfsid,
+				qdep, addr, mask);
 	}
 	spin_unlock_irqrestore(&device_domain_lock, flags);
 }
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index dfacd49..678a0f4 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -504,9 +504,8 @@ extern void qi_flush_context(struct intel_iommu *iommu, u16 did, u16 sid,
 			     u8 fm, u64 type);
 extern void qi_flush_iotlb(struct intel_iommu *iommu, u16 did, u64 addr,
 			  unsigned int size_order, u64 type);
-extern void qi_flush_dev_iotlb(struct intel_iommu *iommu, u16 sid, u16 qdep,
-			       u64 addr, unsigned mask);
-
+extern void qi_flush_dev_iotlb(struct intel_iommu *iommu, u16 sid, u16 pfsid,
+			u16 qdep, u64 addr, unsigned mask);
 extern int qi_submit_sync(struct qi_desc *desc, struct intel_iommu *iommu);
 
 extern int dmar_ir_support(void);
-- 
2.7.4

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v5 08/23] iommu/vt-d: support flushing more translation cache types
  2018-05-11 20:53 [PATCH v5 00/23] IOMMU and VT-d driver support for Shared Virtual Address (SVA) Jacob Pan
                   ` (6 preceding siblings ...)
  2018-05-11 20:53 ` [PATCH v5 07/23] iommu/vt-d: fix dev iotlb pfsid use Jacob Pan
@ 2018-05-11 20:54 ` Jacob Pan
  2018-05-14  2:18   ` Lu Baolu
  2018-05-17  8:44   ` kbuild test robot
  2018-05-11 20:54 ` [PATCH v5 09/23] iommu/vt-d: add svm/sva invalidate function Jacob Pan
                   ` (15 subsequent siblings)
  23 siblings, 2 replies; 78+ messages in thread
From: Jacob Pan @ 2018-05-11 20:54 UTC (permalink / raw)
  To: iommu, LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Alex Williamson, Jean-Philippe Brucker
  Cc: Rafael Wysocki, Liu, Yi L, Tian, Kevin, Raj Ashok, Jean Delvare,
	Christoph Hellwig, Lu Baolu, Jacob Pan

When Shared Virtual Memory is exposed to a guest via vIOMMU, extended
IOTLB invalidation may be passed down from outside IOMMU subsystems.
This patch adds invalidation functions that can be used for additional
translation cache types.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 drivers/iommu/dmar.c        | 44 ++++++++++++++++++++++++++++++++++++++++++++
 include/linux/intel-iommu.h | 21 +++++++++++++++++++--
 2 files changed, 63 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c
index 7852678..0b5b052 100644
--- a/drivers/iommu/dmar.c
+++ b/drivers/iommu/dmar.c
@@ -1339,6 +1339,18 @@ void qi_flush_iotlb(struct intel_iommu *iommu, u16 did, u64 addr,
 	qi_submit_sync(&desc, iommu);
 }
 
+void qi_flush_eiotlb(struct intel_iommu *iommu, u16 did, u64 addr, u32 pasid,
+		unsigned int size_order, u64 granu, bool global)
+{
+	struct qi_desc desc;
+
+	desc.low = QI_EIOTLB_PASID(pasid) | QI_EIOTLB_DID(did) |
+		QI_EIOTLB_GRAN(granu) | QI_EIOTLB_TYPE;
+	desc.high = QI_EIOTLB_ADDR(addr) | QI_EIOTLB_GL(global) |
+		QI_EIOTLB_IH(0) | QI_EIOTLB_AM(size_order);
+	qi_submit_sync(&desc, iommu);
+}
+
 void qi_flush_dev_iotlb(struct intel_iommu *iommu, u16 sid, u16 pfsid,
 			u16 qdep, u64 addr, unsigned mask)
 {
@@ -1360,6 +1372,38 @@ void qi_flush_dev_iotlb(struct intel_iommu *iommu, u16 sid, u16 pfsid,
 	qi_submit_sync(&desc, iommu);
 }
 
+void qi_flush_dev_eiotlb(struct intel_iommu *iommu, u16 sid,
+		u32 pasid,  u16 qdep, u64 addr, unsigned size, u64 granu)
+{
+	struct qi_desc desc;
+
+	desc.low = QI_DEV_EIOTLB_PASID(pasid) | QI_DEV_EIOTLB_SID(sid) |
+		QI_DEV_EIOTLB_QDEP(qdep) | QI_DEIOTLB_TYPE;
+	desc.high |= QI_DEV_EIOTLB_GLOB(granu);
+
+	/* If S bit is 0, we only flush a single page. If S bit is set,
+	 * The least significant zero bit indicates the size. VT-d spec
+	 * 6.5.2.6
+	 */
+	if (!size)
+		desc.high = QI_DEV_EIOTLB_ADDR(addr) & ~QI_DEV_EIOTLB_SIZE;
+	else {
+		unsigned long mask = 1UL << (VTD_PAGE_SHIFT + size);
+
+		desc.high = QI_DEV_EIOTLB_ADDR(addr & ~mask) | QI_DEV_EIOTLB_SIZE;
+	}
+	qi_submit_sync(&desc, iommu);
+}
+
+void qi_flush_pasid(struct intel_iommu *iommu, u16 did, u64 granu, int pasid)
+{
+	struct qi_desc desc;
+
+	desc.high = 0;
+	desc.low = QI_PC_TYPE | QI_PC_DID(did) | QI_PC_GRAN(granu) | QI_PC_PASID(pasid);
+
+	qi_submit_sync(&desc, iommu);
+}
 /*
  * Disable Queued Invalidation interface.
  */
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index 678a0f4..5ac0c28 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -262,6 +262,10 @@ enum {
 #define QI_PGRP_RESP_TYPE	0x9
 #define QI_PSTRM_RESP_TYPE	0xa
 
+#define QI_DID(did)		(((u64)did & 0xffff) << 16)
+#define QI_DID_MASK		GENMASK(31, 16)
+#define QI_TYPE_MASK		GENMASK(3, 0)
+
 #define QI_IEC_SELECTIVE	(((u64)1) << 4)
 #define QI_IEC_IIDEX(idx)	(((u64)(idx & 0xffff) << 32))
 #define QI_IEC_IM(m)		(((u64)(m & 0x1f) << 27))
@@ -293,8 +297,9 @@ enum {
 #define QI_PC_DID(did)		(((u64)did) << 16)
 #define QI_PC_GRAN(gran)	(((u64)gran) << 4)
 
-#define QI_PC_ALL_PASIDS	(QI_PC_TYPE | QI_PC_GRAN(0))
-#define QI_PC_PASID_SEL		(QI_PC_TYPE | QI_PC_GRAN(1))
+/* PASID cache invalidation granu */
+#define QI_PC_ALL_PASIDS	0
+#define QI_PC_PASID_SEL		1
 
 #define QI_EIOTLB_ADDR(addr)	((u64)(addr) & VTD_PAGE_MASK)
 #define QI_EIOTLB_GL(gl)	(((u64)gl) << 7)
@@ -304,6 +309,10 @@ enum {
 #define QI_EIOTLB_DID(did)	(((u64)did) << 16)
 #define QI_EIOTLB_GRAN(gran) 	(((u64)gran) << 4)
 
+/* QI Dev-IOTLB inv granu */
+#define QI_DEV_IOTLB_GRAN_ALL		1
+#define QI_DEV_IOTLB_GRAN_PASID_SEL	0
+
 #define QI_DEV_EIOTLB_ADDR(a)	((u64)(a) & VTD_PAGE_MASK)
 #define QI_DEV_EIOTLB_SIZE	(((u64)1) << 11)
 #define QI_DEV_EIOTLB_GLOB(g)	((u64)g)
@@ -332,6 +341,7 @@ enum {
 #define QI_RESP_INVALID		0x1
 #define QI_RESP_FAILURE		0xf
 
+/* QI EIOTLB inv granu */
 #define QI_GRAN_ALL_ALL			0
 #define QI_GRAN_NONG_ALL		1
 #define QI_GRAN_NONG_PASID		2
@@ -504,8 +514,15 @@ extern void qi_flush_context(struct intel_iommu *iommu, u16 did, u16 sid,
 			     u8 fm, u64 type);
 extern void qi_flush_iotlb(struct intel_iommu *iommu, u16 did, u64 addr,
 			  unsigned int size_order, u64 type);
+extern void qi_flush_eiotlb(struct intel_iommu *iommu, u16 did, u64 addr,
+			u32 pasid, unsigned int size_order, u64 type, bool global);
 extern void qi_flush_dev_iotlb(struct intel_iommu *iommu, u16 sid, u16 pfsid,
 			u16 qdep, u64 addr, unsigned mask);
+
+extern void qi_flush_dev_eiotlb(struct intel_iommu *iommu, u16 sid,
+			u32 pasid, u16 qdep, u64 addr, unsigned size, u64 granu);
+extern void qi_flush_pasid(struct intel_iommu *iommu, u16 did, u64 granu, int pasid);
+
 extern int qi_submit_sync(struct qi_desc *desc, struct intel_iommu *iommu);
 
 extern int dmar_ir_support(void);
-- 
2.7.4

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v5 09/23] iommu/vt-d: add svm/sva invalidate function
  2018-05-11 20:53 [PATCH v5 00/23] IOMMU and VT-d driver support for Shared Virtual Address (SVA) Jacob Pan
                   ` (7 preceding siblings ...)
  2018-05-11 20:54 ` [PATCH v5 08/23] iommu/vt-d: support flushing more translation cache types Jacob Pan
@ 2018-05-11 20:54 ` Jacob Pan
  2018-05-14  3:35   ` Lu Baolu
  2018-05-11 20:54 ` [PATCH v5 10/23] iommu: introduce device fault data Jacob Pan
                   ` (14 subsequent siblings)
  23 siblings, 1 reply; 78+ messages in thread
From: Jacob Pan @ 2018-05-11 20:54 UTC (permalink / raw)
  To: iommu, LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Alex Williamson, Jean-Philippe Brucker
  Cc: Rafael Wysocki, Liu, Yi L, Tian, Kevin, Raj Ashok, Jean Delvare,
	Christoph Hellwig, Lu Baolu, Jacob Pan, Liu, Yi L

When Shared Virtual Address (SVA) is enabled for a guest OS via
vIOMMU, we need to provide invalidation support at IOMMU API and driver
level. This patch adds Intel VT-d specific function to implement
iommu passdown invalidate API for shared virtual address.

The use case is for supporting caching structure invalidation
of assigned SVM capable devices. Emulated IOMMU exposes queue
invalidation capability and passes down all descriptors from the guest
to the physical IOMMU.

The assumption is that guest to host device ID mapping should be
resolved prior to calling IOMMU driver. Based on the device handle,
host IOMMU driver can replace certain fields before submit to the
invalidation queue.

Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 drivers/iommu/intel-iommu.c | 129 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 129 insertions(+)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 732a10f..684bd98 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -4973,6 +4973,134 @@ static void intel_iommu_detach_device(struct iommu_domain *domain,
 	dmar_remove_one_dev_info(to_dmar_domain(domain), dev);
 }
 
+/*
+ * 2D array for converting and sanitizing IOMMU generic TLB granularity to
+ * VT-d granularity. Invalidation is typically included in the unmap operation
+ * as a result of DMA or VFIO unmap. However, for assigned device where guest
+ * could own the first level page tables without being shadowed by QEMU. In
+ * this case there is no pass down unmap to the host IOMMU as a result of unmap
+ * in the guest. Only invalidations are trapped and passed down.
+ * In all cases, only first level TLB invalidation (request with PASID) can be
+ * passed down, therefore we do not include IOTLB granularity for request
+ * without PASID (second level).
+ *
+ * For an example, to find the VT-d granularity encoding for IOTLB
+ * type and page selective granularity within PASID:
+ * X: indexed by enum iommu_inv_type
+ * Y: indexed by enum iommu_inv_granularity
+ * [IOMMU_INV_TYPE_TLB][IOMMU_INV_GRANU_PAGE_PASID]
+ *
+ * Granu_map array indicates validity of the table. 1: valid, 0: invalid
+ *
+ */
+const static int inv_type_granu_map[IOMMU_INV_NR_TYPE][IOMMU_INV_NR_GRANU] = {
+	/* Extended dev TLBs */
+	{1, 1, 1},
+	/* Extended IOTLB */
+	{1, 1, 1},
+	/* PASID cache */
+	{1, 1, 0}
+};
+
+const static u64 inv_type_granu_table[IOMMU_INV_NR_TYPE][IOMMU_INV_NR_GRANU] = {
+	/* extended dev IOTLBs */
+	{QI_DEV_IOTLB_GRAN_ALL, QI_DEV_IOTLB_GRAN_PASID_SEL, 0},
+	/* Extended IOTLB */
+	{QI_GRAN_NONG_ALL, QI_GRAN_NONG_PASID, QI_GRAN_PSI_PASID},
+	/* PASID cache */
+	{QI_PC_ALL_PASIDS, QI_PC_PASID_SEL, 0},
+};
+
+static inline int to_vtd_granularity(int type, int granu, u64 *vtd_granu)
+{
+	if (type >= IOMMU_INV_NR_TYPE || granu >= IOMMU_INV_NR_GRANU ||
+		!inv_type_granu_map[type][granu])
+		return -EINVAL;
+
+	*vtd_granu = inv_type_granu_table[type][granu];
+
+	return 0;
+}
+
+static int intel_iommu_sva_invalidate(struct iommu_domain *domain,
+		struct device *dev, struct tlb_invalidate_info *inv_info)
+{
+	struct intel_iommu *iommu;
+	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
+	struct device_domain_info *info;
+	u16 did, sid;
+	u8 bus, devfn;
+	int ret = 0;
+	u64 granu;
+	unsigned long flags;
+
+	if (!inv_info || !dmar_domain ||
+		inv_info->hdr.type != TLB_INV_HDR_VERSION_1)
+		return -EINVAL;
+
+	iommu = device_to_iommu(dev, &bus, &devfn);
+	if (!iommu)
+		return -ENODEV;
+
+	if (!dev || !dev_is_pci(dev))
+		return -ENODEV;
+
+	did = dmar_domain->iommu_did[iommu->seq_id];
+	sid = PCI_DEVID(bus, devfn);
+	ret = to_vtd_granularity(inv_info->hdr.type, inv_info->granularity,
+				&granu);
+	if (ret) {
+		pr_err("Invalid range type %d, granu %d\n", inv_info->hdr.type,
+			inv_info->granularity);
+		return ret;
+	}
+
+	spin_lock(&iommu->lock);
+	spin_lock_irqsave(&device_domain_lock, flags);
+
+	switch (inv_info->hdr.type) {
+	case IOMMU_INV_TYPE_TLB:
+		if (inv_info->size &&
+			(inv_info->addr & ((1 << (VTD_PAGE_SHIFT + inv_info->size)) - 1))) {
+			pr_err("Addr out of range, addr 0x%llx, size order %d\n",
+				inv_info->addr, inv_info->size);
+			ret = -ERANGE;
+			goto out_unlock;
+		}
+
+		qi_flush_eiotlb(iommu, did, mm_to_dma_pfn(inv_info->addr),
+				inv_info->pasid,
+				inv_info->size, granu,
+				inv_info->flags & IOMMU_INVALIDATE_GLOBAL_PAGE);
+		/**
+		 * Always flush device IOTLB if ATS is enabled since guest
+		 * vIOMMU exposes CM = 1, no device IOTLB flush will be passed
+		 * down.
+		 */
+		info = iommu_support_dev_iotlb(dmar_domain, iommu, bus, devfn);
+		if (info && info->ats_enabled) {
+			qi_flush_dev_eiotlb(iommu, sid,
+					inv_info->pasid, info->ats_qdep,
+					inv_info->addr, inv_info->size,
+					granu);
+		}
+		break;
+	case IOMMU_INV_TYPE_PASID:
+		qi_flush_pasid(iommu, did, granu, inv_info->pasid);
+
+		break;
+	default:
+		dev_err(dev, "Unknown IOMMU invalidation type %d\n",
+			inv_info->hdr.type);
+		ret = -EINVAL;
+	}
+out_unlock:
+	spin_unlock(&iommu->lock);
+	spin_unlock_irqrestore(&device_domain_lock, flags);
+
+	return ret;
+}
+
 static int intel_iommu_map(struct iommu_domain *domain,
 			   unsigned long iova, phys_addr_t hpa,
 			   size_t size, int iommu_prot)
@@ -5401,6 +5529,7 @@ const struct iommu_ops intel_iommu_ops = {
 #ifdef CONFIG_INTEL_IOMMU_SVM
 	.bind_pasid_table	= intel_iommu_bind_pasid_table,
 	.unbind_pasid_table	= intel_iommu_unbind_pasid_table,
+	.sva_invalidate		= intel_iommu_sva_invalidate,
 #endif
 	.map			= intel_iommu_map,
 	.unmap			= intel_iommu_unmap,
-- 
2.7.4

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v5 10/23] iommu: introduce device fault data
  2018-05-11 20:53 [PATCH v5 00/23] IOMMU and VT-d driver support for Shared Virtual Address (SVA) Jacob Pan
                   ` (8 preceding siblings ...)
  2018-05-11 20:54 ` [PATCH v5 09/23] iommu/vt-d: add svm/sva invalidate function Jacob Pan
@ 2018-05-11 20:54 ` Jacob Pan
  2018-09-21 10:07   ` Auger Eric
  2018-05-11 20:54 ` [PATCH v5 11/23] driver core: add per device iommu param Jacob Pan
                   ` (13 subsequent siblings)
  23 siblings, 1 reply; 78+ messages in thread
From: Jacob Pan @ 2018-05-11 20:54 UTC (permalink / raw)
  To: iommu, LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Alex Williamson, Jean-Philippe Brucker
  Cc: Rafael Wysocki, Liu, Yi L, Tian, Kevin, Raj Ashok, Jean Delvare,
	Christoph Hellwig, Lu Baolu, Jacob Pan, Liu, Yi L

Device faults detected by IOMMU can be reported outside IOMMU
subsystem for further processing. This patch intends to provide
a generic device fault data such that device drivers can be
communicated with IOMMU faults without model specific knowledge.

The proposed format is the result of discussion at:
https://lkml.org/lkml/2017/11/10/291
Part of the code is based on Jean-Philippe Brucker's patchset
(https://patchwork.kernel.org/patch/9989315/).

The assumption is that model specific IOMMU driver can filter and
handle most of the internal faults if the cause is within IOMMU driver
control. Therefore, the fault reasons can be reported are grouped
and generalized based common specifications such as PCI ATS.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
---
 include/linux/iommu.h | 101 +++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 99 insertions(+), 2 deletions(-)

diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index e8cadb6..aeadb4f 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -49,13 +49,17 @@ struct bus_type;
 struct device;
 struct iommu_domain;
 struct notifier_block;
+struct iommu_fault_event;
 
 /* iommu fault flags */
-#define IOMMU_FAULT_READ	0x0
-#define IOMMU_FAULT_WRITE	0x1
+#define IOMMU_FAULT_READ		(1 << 0)
+#define IOMMU_FAULT_WRITE		(1 << 1)
+#define IOMMU_FAULT_EXEC		(1 << 2)
+#define IOMMU_FAULT_PRIV		(1 << 3)
 
 typedef int (*iommu_fault_handler_t)(struct iommu_domain *,
 			struct device *, unsigned long, int, void *);
+typedef int (*iommu_dev_fault_handler_t)(struct iommu_fault_event *, void *);
 
 struct iommu_domain_geometry {
 	dma_addr_t aperture_start; /* First address that can be mapped    */
@@ -264,6 +268,98 @@ struct iommu_device {
 	struct device *dev;
 };
 
+/*  Generic fault types, can be expanded IRQ remapping fault */
+enum iommu_fault_type {
+	IOMMU_FAULT_DMA_UNRECOV = 1,	/* unrecoverable fault */
+	IOMMU_FAULT_PAGE_REQ,		/* page request fault */
+};
+
+enum iommu_fault_reason {
+	IOMMU_FAULT_REASON_UNKNOWN = 0,
+
+	/* IOMMU internal error, no specific reason to report out */
+	IOMMU_FAULT_REASON_INTERNAL,
+
+	/* Could not access the PASID table */
+	IOMMU_FAULT_REASON_PASID_FETCH,
+
+	/*
+	 * PASID is out of range (e.g. exceeds the maximum PASID
+	 * supported by the IOMMU) or disabled.
+	 */
+	IOMMU_FAULT_REASON_PASID_INVALID,
+
+	/* Could not access the page directory (Invalid PASID entry) */
+	IOMMU_FAULT_REASON_PGD_FETCH,
+
+	/* Could not access the page table entry (Bad address) */
+	IOMMU_FAULT_REASON_PTE_FETCH,
+
+	/* Protection flag check failed */
+	IOMMU_FAULT_REASON_PERMISSION,
+};
+
+/**
+ * struct iommu_fault_event - Generic per device fault data
+ *
+ * - PCI and non-PCI devices
+ * - Recoverable faults (e.g. page request), information based on PCI ATS
+ * and PASID spec.
+ * - Un-recoverable faults of device interest
+ * - DMA remapping and IRQ remapping faults
+
+ * @type contains fault type.
+ * @reason fault reasons if relevant outside IOMMU driver, IOMMU driver internal
+ *         faults are not reported
+ * @addr: tells the offending page address
+ * @pasid: contains process address space ID, used in shared virtual memory(SVM)
+ * @page_req_group_id: page request group index
+ * @last_req: last request in a page request group
+ * @pasid_valid: indicates if the PRQ has a valid PASID
+ * @prot: page access protection flag, e.g. IOMMU_FAULT_READ, IOMMU_FAULT_WRITE
+ * @device_private: if present, uniquely identify device-specific
+ *                  private data for an individual page request.
+ * @iommu_private: used by the IOMMU driver for storing fault-specific
+ *                 data. Users should not modify this field before
+ *                 sending the fault response.
+ */
+struct iommu_fault_event {
+	enum iommu_fault_type type;
+	enum iommu_fault_reason reason;
+	u64 addr;
+	u32 pasid;
+	u32 page_req_group_id;
+	u32 last_req : 1;
+	u32 pasid_valid : 1;
+	u32 prot;
+	u64 device_private;
+	u64 iommu_private;
+};
+
+/**
+ * struct iommu_fault_param - per-device IOMMU fault data
+ * @dev_fault_handler: Callback function to handle IOMMU faults at device level
+ * @data: handler private data
+ *
+ */
+struct iommu_fault_param {
+	iommu_dev_fault_handler_t handler;
+	void *data;
+};
+
+/**
+ * struct iommu_param - collection of per-device IOMMU data
+ *
+ * @fault_param: IOMMU detected device fault reporting data
+ *
+ * TODO: migrate other per device data pointers under iommu_dev_data, e.g.
+ *	struct iommu_group	*iommu_group;
+ *	struct iommu_fwspec	*iommu_fwspec;
+ */
+struct iommu_param {
+	struct iommu_fault_param *fault_param;
+};
+
 int  iommu_device_register(struct iommu_device *iommu);
 void iommu_device_unregister(struct iommu_device *iommu);
 int  iommu_device_sysfs_add(struct iommu_device *iommu,
@@ -437,6 +533,7 @@ struct iommu_ops {};
 struct iommu_group {};
 struct iommu_fwspec {};
 struct iommu_device {};
+struct iommu_fault_param {};
 
 static inline bool iommu_present(struct bus_type *bus)
 {
-- 
2.7.4

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v5 11/23] driver core: add per device iommu param
  2018-05-11 20:53 [PATCH v5 00/23] IOMMU and VT-d driver support for Shared Virtual Address (SVA) Jacob Pan
                   ` (9 preceding siblings ...)
  2018-05-11 20:54 ` [PATCH v5 10/23] iommu: introduce device fault data Jacob Pan
@ 2018-05-11 20:54 ` Jacob Pan
  2018-05-14  5:27   ` Lu Baolu
  2018-05-11 20:54 ` [PATCH v5 12/23] iommu: add a timeout parameter for prq response Jacob Pan
                   ` (12 subsequent siblings)
  23 siblings, 1 reply; 78+ messages in thread
From: Jacob Pan @ 2018-05-11 20:54 UTC (permalink / raw)
  To: iommu, LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Alex Williamson, Jean-Philippe Brucker
  Cc: Rafael Wysocki, Liu, Yi L, Tian, Kevin, Raj Ashok, Jean Delvare,
	Christoph Hellwig, Lu Baolu, Jacob Pan

DMA faults can be detected by IOMMU at device level. Adding a pointer
to struct device allows IOMMU subsystem to report relevant faults
back to the device driver for further handling.
For direct assigned device (or user space drivers), guest OS holds
responsibility to handle and respond per device IOMMU fault.
Therefore we need fault reporting mechanism to propagate faults beyond
IOMMU subsystem.

There are two other IOMMU data pointers under struct device today, here
we introduce iommu_param as a parent pointer such that all device IOMMU
data can be consolidated here. The idea was suggested here by Greg KH
and Joerg. The name iommu_param is chosen here since iommu_data has been used.

Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Link: https://lkml.org/lkml/2017/10/6/81
---
 include/linux/device.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/include/linux/device.h b/include/linux/device.h
index 4779569..c1b1796 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -41,6 +41,7 @@ struct iommu_ops;
 struct iommu_group;
 struct iommu_fwspec;
 struct dev_pin_info;
+struct iommu_param;
 
 struct bus_attribute {
 	struct attribute	attr;
@@ -899,6 +900,7 @@ struct dev_links_info {
  * 		device (i.e. the bus driver that discovered the device).
  * @iommu_group: IOMMU group the device belongs to.
  * @iommu_fwspec: IOMMU-specific properties supplied by firmware.
+ * @iommu_param: Per device generic IOMMU runtime data
  *
  * @offline_disabled: If set, the device is permanently online.
  * @offline:	Set after successful invocation of bus type's .offline().
@@ -988,6 +990,7 @@ struct device {
 	void	(*release)(struct device *dev);
 	struct iommu_group	*iommu_group;
 	struct iommu_fwspec	*iommu_fwspec;
+	struct iommu_param	*iommu_param;
 
 	bool			offline_disabled:1;
 	bool			offline:1;
-- 
2.7.4

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v5 12/23] iommu: add a timeout parameter for prq response
  2018-05-11 20:53 [PATCH v5 00/23] IOMMU and VT-d driver support for Shared Virtual Address (SVA) Jacob Pan
                   ` (10 preceding siblings ...)
  2018-05-11 20:54 ` [PATCH v5 11/23] driver core: add per device iommu param Jacob Pan
@ 2018-05-11 20:54 ` Jacob Pan
  2018-05-11 20:54 ` [PATCH v5 13/23] iommu: introduce device fault report API Jacob Pan
                   ` (11 subsequent siblings)
  23 siblings, 0 replies; 78+ messages in thread
From: Jacob Pan @ 2018-05-11 20:54 UTC (permalink / raw)
  To: iommu, LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Alex Williamson, Jean-Philippe Brucker
  Cc: Rafael Wysocki, Liu, Yi L, Tian, Kevin, Raj Ashok, Jean Delvare,
	Christoph Hellwig, Lu Baolu, Jacob Pan

When an IO page request is processed outside IOMMU subsystem, response can be
delayed or lost. Add a tunable setup parameter such that user can chooose
the timeout for IOMMU to track pending page requests.
This timeout mechanism is a basic safty net which can be implemented in conjunction
with credit based or device level page response exception handling.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 Documentation/admin-guide/kernel-parameters.txt |  8 +++++++
 drivers/iommu/iommu.c                           | 28 +++++++++++++++++++++++++
 2 files changed, 36 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 11fc28e..5c1e836 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1718,6 +1718,14 @@
 			1 - Bypass the IOMMU for DMA.
 			unset - Use IOMMU translation for DMA.
 
+	iommu.prq_timeout=
+			Timeout in seconds to wait for page response
+			of a pending page request.
+			Format: <integer>
+			Default: 10
+			0 - no timeout tracking
+			1 to 100 - allowed range
+
 	io7=		[HW] IO7 for Marvel based alpha systems
 			See comment before marvel_specify_io7 in
 			arch/alpha/kernel/core_marvel.c.
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 784e019..3a49b96 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -37,6 +37,18 @@
 static struct kset *iommu_group_kset;
 static DEFINE_IDA(iommu_group_ida);
 static unsigned int iommu_def_domain_type = IOMMU_DOMAIN_DMA;
+/*
+ * Timeout to wait for page response of a pending page request. This is
+ * intended as a basic safty net in case a pending page request is not
+ * responded for an exceptionally long time. Device may also implement
+ * its own protection mechanism against this exception.
+ * Units are in jiffies with a range between 1 - 100 seconds equivalent.
+ * Default to 10 seconds.
+ * Setting 0 means no timeout tracking.
+ */
+#define IOMMU_PAGE_RESPONSE_MAX_TIMEOUT (HZ * 100)
+#define IOMMU_PAGE_RESPONSE_DEF_TIMEOUT (HZ * 10)
+static unsigned long prq_timeout = IOMMU_PAGE_RESPONSE_DEF_TIMEOUT;
 
 struct iommu_callback_data {
 	const struct iommu_ops *ops;
@@ -125,6 +137,22 @@ static int __init iommu_set_def_domain_type(char *str)
 }
 early_param("iommu.passthrough", iommu_set_def_domain_type);
 
+static int __init iommu_set_prq_timeout(char *str)
+{
+	unsigned long timeout;
+
+	if (!str)
+		return -EINVAL;
+	timeout = simple_strtoul(str, NULL, 0);
+	timeout = timeout * HZ;
+	if (timeout > IOMMU_PAGE_RESPONSE_MAX_TIMEOUT)
+		return -EINVAL;
+	prq_timeout = timeout;
+
+	return 0;
+}
+early_param("iommu.prq_timeout", iommu_set_prq_timeout);
+
 static ssize_t iommu_group_attr_show(struct kobject *kobj,
 				     struct attribute *__attr, char *buf)
 {
-- 
2.7.4

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v5 13/23] iommu: introduce device fault report API
  2018-05-11 20:53 [PATCH v5 00/23] IOMMU and VT-d driver support for Shared Virtual Address (SVA) Jacob Pan
                   ` (11 preceding siblings ...)
  2018-05-11 20:54 ` [PATCH v5 12/23] iommu: add a timeout parameter for prq response Jacob Pan
@ 2018-05-11 20:54 ` Jacob Pan
  2018-05-14  6:01   ` Lu Baolu
                     ` (4 more replies)
  2018-05-11 20:54 ` [PATCH v5 14/23] iommu: introduce page response function Jacob Pan
                   ` (10 subsequent siblings)
  23 siblings, 5 replies; 78+ messages in thread
From: Jacob Pan @ 2018-05-11 20:54 UTC (permalink / raw)
  To: iommu, LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Alex Williamson, Jean-Philippe Brucker
  Cc: Rafael Wysocki, Liu, Yi L, Tian, Kevin, Raj Ashok, Jean Delvare,
	Christoph Hellwig, Lu Baolu, Jacob Pan

Traditionally, device specific faults are detected and handled within
their own device drivers. When IOMMU is enabled, faults such as DMA
related transactions are detected by IOMMU. There is no generic
reporting mechanism to report faults back to the in-kernel device
driver or the guest OS in case of assigned devices.

Faults detected by IOMMU is based on the transaction's source ID which
can be reported at per device basis, regardless of the device type is a
PCI device or not.

The fault types include recoverable (e.g. page request) and
unrecoverable faults(e.g. access error). In most cases, faults can be
handled by IOMMU drivers internally. The primary use cases are as
follows:
1. page request fault originated from an SVM capable device that is
assigned to guest via vIOMMU. In this case, the first level page tables
are owned by the guest. Page request must be propagated to the guest to
let guest OS fault in the pages then send page response. In this
mechanism, the direct receiver of IOMMU fault notification is VFIO,
which can relay notification events to QEMU or other user space
software.

2. faults need more subtle handling by device drivers. Other than
simply invoke reset function, there are needs to let device driver
handle the fault with a smaller impact.

This patchset is intended to create a generic fault report API such
that it can scale as follows:
- all IOMMU types
- PCI and non-PCI devices
- recoverable and unrecoverable faults
- VFIO and other other in kernel users
- DMA & IRQ remapping (TBD)
The original idea was brought up by David Woodhouse and discussions
summarized at https://lwn.net/Articles/608914/.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
---
 drivers/iommu/iommu.c | 149 +++++++++++++++++++++++++++++++++++++++++++++++++-
 include/linux/iommu.h |  35 +++++++++++-
 2 files changed, 181 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 3a49b96..b3f9daf 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -609,6 +609,13 @@ int iommu_group_add_device(struct iommu_group *group, struct device *dev)
 		goto err_free_name;
 	}
 
+	dev->iommu_param = kzalloc(sizeof(*dev->iommu_param), GFP_KERNEL);
+	if (!dev->iommu_param) {
+		ret = -ENOMEM;
+		goto err_free_name;
+	}
+	mutex_init(&dev->iommu_param->lock);
+
 	kobject_get(group->devices_kobj);
 
 	dev->iommu_group = group;
@@ -639,6 +646,7 @@ int iommu_group_add_device(struct iommu_group *group, struct device *dev)
 	mutex_unlock(&group->mutex);
 	dev->iommu_group = NULL;
 	kobject_put(group->devices_kobj);
+	kfree(dev->iommu_param);
 err_free_name:
 	kfree(device->name);
 err_remove_link:
@@ -685,7 +693,7 @@ void iommu_group_remove_device(struct device *dev)
 	sysfs_remove_link(&dev->kobj, "iommu_group");
 
 	trace_remove_device_from_group(group->id, dev);
-
+	kfree(dev->iommu_param);
 	kfree(device->name);
 	kfree(device);
 	dev->iommu_group = NULL;
@@ -820,6 +828,145 @@ int iommu_group_unregister_notifier(struct iommu_group *group,
 EXPORT_SYMBOL_GPL(iommu_group_unregister_notifier);
 
 /**
+ * iommu_register_device_fault_handler() - Register a device fault handler
+ * @dev: the device
+ * @handler: the fault handler
+ * @data: private data passed as argument to the handler
+ *
+ * When an IOMMU fault event is received, call this handler with the fault event
+ * and data as argument. The handler should return 0 on success. If the fault is
+ * recoverable (IOMMU_FAULT_PAGE_REQ), the handler can also complete
+ * the fault by calling iommu_page_response() with one of the following
+ * response code:
+ * - IOMMU_PAGE_RESP_SUCCESS: retry the translation
+ * - IOMMU_PAGE_RESP_INVALID: terminate the fault
+ * - IOMMU_PAGE_RESP_FAILURE: terminate the fault and stop reporting
+ *   page faults if possible.
+ *
+ * Return 0 if the fault handler was installed successfully, or an error.
+ */
+int iommu_register_device_fault_handler(struct device *dev,
+					iommu_dev_fault_handler_t handler,
+					void *data)
+{
+	struct iommu_param *param = dev->iommu_param;
+	int ret = 0;
+
+	/*
+	 * Device iommu_param should have been allocated when device is
+	 * added to its iommu_group.
+	 */
+	if (!param)
+		return -EINVAL;
+
+	mutex_lock(&param->lock);
+	/* Only allow one fault handler registered for each device */
+	if (param->fault_param) {
+		ret = -EBUSY;
+		goto done_unlock;
+	}
+
+	get_device(dev);
+	param->fault_param =
+		kzalloc(sizeof(struct iommu_fault_param), GFP_KERNEL);
+	if (!param->fault_param) {
+		put_device(dev);
+		ret = -ENOMEM;
+		goto done_unlock;
+	}
+	mutex_init(&param->fault_param->lock);
+	param->fault_param->handler = handler;
+	param->fault_param->data = data;
+	INIT_LIST_HEAD(&param->fault_param->faults);
+
+done_unlock:
+	mutex_unlock(&param->lock);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_register_device_fault_handler);
+
+/**
+ * iommu_unregister_device_fault_handler() - Unregister the device fault handler
+ * @dev: the device
+ *
+ * Remove the device fault handler installed with
+ * iommu_register_device_fault_handler().
+ *
+ * Return 0 on success, or an error.
+ */
+int iommu_unregister_device_fault_handler(struct device *dev)
+{
+	struct iommu_param *param = dev->iommu_param;
+	int ret = 0;
+
+	if (!param)
+		return -EINVAL;
+
+	mutex_lock(&param->lock);
+	/* we cannot unregister handler if there are pending faults */
+	if (!list_empty(&param->fault_param->faults)) {
+		ret = -EBUSY;
+		goto unlock;
+	}
+
+	kfree(param->fault_param);
+	param->fault_param = NULL;
+	put_device(dev);
+unlock:
+	mutex_unlock(&param->lock);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_unregister_device_fault_handler);
+
+
+/**
+ * iommu_report_device_fault() - Report fault event to device
+ * @dev: the device
+ * @evt: fault event data
+ *
+ * Called by IOMMU model specific drivers when fault is detected, typically
+ * in a threaded IRQ handler.
+ *
+ * Return 0 on success, or an error.
+ */
+int iommu_report_device_fault(struct device *dev, struct iommu_fault_event *evt)
+{
+	int ret = 0;
+	struct iommu_fault_event *evt_pending;
+	struct iommu_fault_param *fparam;
+
+	/* iommu_param is allocated when device is added to group */
+	if (!dev->iommu_param | !evt)
+		return -EINVAL;
+	/* we only report device fault if there is a handler registered */
+	mutex_lock(&dev->iommu_param->lock);
+	if (!dev->iommu_param->fault_param ||
+		!dev->iommu_param->fault_param->handler) {
+		ret = -EINVAL;
+		goto done_unlock;
+	}
+	fparam = dev->iommu_param->fault_param;
+	if (evt->type == IOMMU_FAULT_PAGE_REQ && evt->last_req) {
+		evt_pending = kmemdup(evt, sizeof(struct iommu_fault_event),
+				GFP_KERNEL);
+		if (!evt_pending) {
+			ret = -ENOMEM;
+			goto done_unlock;
+		}
+		mutex_lock(&fparam->lock);
+		list_add_tail(&evt_pending->list, &fparam->faults);
+		mutex_unlock(&fparam->lock);
+	}
+	ret = fparam->handler(evt, fparam->data);
+done_unlock:
+	mutex_unlock(&dev->iommu_param->lock);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_report_device_fault);
+
+/**
  * iommu_group_id - Return ID for a group
  * @group: the group to ID
  *
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index aeadb4f..b3312ee 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -307,7 +307,8 @@ enum iommu_fault_reason {
  * and PASID spec.
  * - Un-recoverable faults of device interest
  * - DMA remapping and IRQ remapping faults
-
+ *
+ * @list pending fault event list, used for tracking responses
  * @type contains fault type.
  * @reason fault reasons if relevant outside IOMMU driver, IOMMU driver internal
  *         faults are not reported
@@ -324,6 +325,7 @@ enum iommu_fault_reason {
  *                 sending the fault response.
  */
 struct iommu_fault_event {
+	struct list_head list;
 	enum iommu_fault_type type;
 	enum iommu_fault_reason reason;
 	u64 addr;
@@ -340,10 +342,13 @@ struct iommu_fault_event {
  * struct iommu_fault_param - per-device IOMMU fault data
  * @dev_fault_handler: Callback function to handle IOMMU faults at device level
  * @data: handler private data
- *
+ * @faults: holds the pending faults which needs response, e.g. page response.
+ * @lock: protect pending PRQ event list
  */
 struct iommu_fault_param {
 	iommu_dev_fault_handler_t handler;
+	struct list_head faults;
+	struct mutex lock;
 	void *data;
 };
 
@@ -357,6 +362,7 @@ struct iommu_fault_param {
  *	struct iommu_fwspec	*iommu_fwspec;
  */
 struct iommu_param {
+	struct mutex lock;
 	struct iommu_fault_param *fault_param;
 };
 
@@ -456,6 +462,14 @@ extern int iommu_group_register_notifier(struct iommu_group *group,
 					 struct notifier_block *nb);
 extern int iommu_group_unregister_notifier(struct iommu_group *group,
 					   struct notifier_block *nb);
+extern int iommu_register_device_fault_handler(struct device *dev,
+					iommu_dev_fault_handler_t handler,
+					void *data);
+
+extern int iommu_unregister_device_fault_handler(struct device *dev);
+
+extern int iommu_report_device_fault(struct device *dev, struct iommu_fault_event *evt);
+
 extern int iommu_group_id(struct iommu_group *group);
 extern struct iommu_group *iommu_group_get_for_dev(struct device *dev);
 extern struct iommu_domain *iommu_group_default_domain(struct iommu_group *);
@@ -727,6 +741,23 @@ static inline int iommu_group_unregister_notifier(struct iommu_group *group,
 	return 0;
 }
 
+static inline int iommu_register_device_fault_handler(struct device *dev,
+						iommu_dev_fault_handler_t handler,
+						void *data)
+{
+	return -ENODEV;
+}
+
+static inline int iommu_unregister_device_fault_handler(struct device *dev)
+{
+	return 0;
+}
+
+static inline int iommu_report_device_fault(struct device *dev, struct iommu_fault_event *evt)
+{
+	return -ENODEV;
+}
+
 static inline int iommu_group_id(struct iommu_group *group)
 {
 	return -ENODEV;
-- 
2.7.4

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v5 14/23] iommu: introduce page response function
  2018-05-11 20:53 [PATCH v5 00/23] IOMMU and VT-d driver support for Shared Virtual Address (SVA) Jacob Pan
                   ` (12 preceding siblings ...)
  2018-05-11 20:54 ` [PATCH v5 13/23] iommu: introduce device fault report API Jacob Pan
@ 2018-05-11 20:54 ` Jacob Pan
  2018-05-14  6:39   ` Lu Baolu
  2018-09-10 14:52   ` Auger Eric
  2018-05-11 20:54 ` [PATCH v5 15/23] iommu: handle page response timeout Jacob Pan
                   ` (9 subsequent siblings)
  23 siblings, 2 replies; 78+ messages in thread
From: Jacob Pan @ 2018-05-11 20:54 UTC (permalink / raw)
  To: iommu, LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Alex Williamson, Jean-Philippe Brucker
  Cc: Rafael Wysocki, Liu, Yi L, Tian, Kevin, Raj Ashok, Jean Delvare,
	Christoph Hellwig, Lu Baolu, Jacob Pan

IO page faults can be handled outside IOMMU subsystem. For an example,
when nested translation is turned on and guest owns the
first level page tables, device page request can be forwared
to the guest for handling faults. As the page response returns
by the guest, IOMMU driver on the host need to process the
response which informs the device and completes the page request
transaction.

This patch introduces generic API function for page response
passing from the guest or other in-kernel users. The definitions of
the generic data is based on PCI ATS specification not limited to
any vendor.

Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Link: https://lkml.org/lkml/2017/12/7/1725
---
 drivers/iommu/iommu.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
 include/linux/iommu.h | 43 +++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 88 insertions(+)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index b3f9daf..02fed3e 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1533,6 +1533,51 @@ int iommu_sva_invalidate(struct iommu_domain *domain,
 }
 EXPORT_SYMBOL_GPL(iommu_sva_invalidate);
 
+int iommu_page_response(struct device *dev,
+			struct page_response_msg *msg)
+{
+	struct iommu_param *param = dev->iommu_param;
+	int ret = -EINVAL;
+	struct iommu_fault_event *evt;
+	struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
+
+	if (!domain || !domain->ops->page_response)
+		return -ENODEV;
+
+	/*
+	 * Device iommu_param should have been allocated when device is
+	 * added to its iommu_group.
+	 */
+	if (!param || !param->fault_param)
+		return -EINVAL;
+
+	/* Only send response if there is a fault report pending */
+	mutex_lock(&param->fault_param->lock);
+	if (list_empty(&param->fault_param->faults)) {
+		pr_warn("no pending PRQ, drop response\n");
+		goto done_unlock;
+	}
+	/*
+	 * Check if we have a matching page request pending to respond,
+	 * otherwise return -EINVAL
+	 */
+	list_for_each_entry(evt, &param->fault_param->faults, list) {
+		if (evt->pasid == msg->pasid &&
+		    msg->page_req_group_id == evt->page_req_group_id) {
+			msg->private_data = evt->iommu_private;
+			ret = domain->ops->page_response(dev, msg);
+			list_del(&evt->list);
+			kfree(evt);
+			break;
+		}
+	}
+
+done_unlock:
+	mutex_unlock(&param->fault_param->lock);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_page_response);
+
 static void __iommu_detach_device(struct iommu_domain *domain,
 				  struct device *dev)
 {
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index b3312ee..722b90f 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -163,6 +163,41 @@ struct iommu_resv_region {
 #ifdef CONFIG_IOMMU_API
 
 /**
+ * enum page_response_code - Return status of fault handlers, telling the IOMMU
+ * driver how to proceed with the fault.
+ *
+ * @IOMMU_PAGE_RESP_SUCCESS: Fault has been handled and the page tables
+ *	populated, retry the access. This is "Success" in PCI PRI.
+ * @IOMMU_PAGE_RESP_FAILURE: General error. Drop all subsequent faults from
+ *	this device if possible. This is "Response Failure" in PCI PRI.
+ * @IOMMU_PAGE_RESP_INVALID: Could not handle this fault, don't retry the
+ *	access. This is "Invalid Request" in PCI PRI.
+ */
+enum page_response_code {
+	IOMMU_PAGE_RESP_SUCCESS = 0,
+	IOMMU_PAGE_RESP_INVALID,
+	IOMMU_PAGE_RESP_FAILURE,
+};
+
+/**
+ * Generic page response information based on PCI ATS and PASID spec.
+ * @addr: servicing page address
+ * @pasid: contains process address space ID
+ * @resp_code: response code
+ * @page_req_group_id: page request group index
+ * @private_data: uniquely identify device-specific private data for an
+ *                individual page response
+ */
+struct page_response_msg {
+	u64 addr;
+	u32 pasid;
+	enum page_response_code resp_code;
+	u32 pasid_present:1;
+	u32 page_req_group_id;
+	u64 private_data;
+};
+
+/**
  * struct iommu_ops - iommu ops and capabilities
  * @capable: check capability
  * @domain_alloc: allocate iommu domain
@@ -195,6 +230,7 @@ struct iommu_resv_region {
  * @bind_pasid_table: bind pasid table pointer for guest SVM
  * @unbind_pasid_table: unbind pasid table pointer and restore defaults
  * @sva_invalidate: invalidate translation caches of shared virtual address
+ * @page_response: handle page request response
  */
 struct iommu_ops {
 	bool (*capable)(enum iommu_cap);
@@ -250,6 +286,7 @@ struct iommu_ops {
 				struct device *dev);
 	int (*sva_invalidate)(struct iommu_domain *domain,
 		struct device *dev, struct tlb_invalidate_info *inv_info);
+	int (*page_response)(struct device *dev, struct page_response_msg *msg);
 
 	unsigned long pgsize_bitmap;
 };
@@ -470,6 +507,7 @@ extern int iommu_unregister_device_fault_handler(struct device *dev);
 
 extern int iommu_report_device_fault(struct device *dev, struct iommu_fault_event *evt);
 
+extern int iommu_page_response(struct device *dev, struct page_response_msg *msg);
 extern int iommu_group_id(struct iommu_group *group);
 extern struct iommu_group *iommu_group_get_for_dev(struct device *dev);
 extern struct iommu_domain *iommu_group_default_domain(struct iommu_group *);
@@ -758,6 +796,11 @@ static inline int iommu_report_device_fault(struct device *dev, struct iommu_fau
 	return -ENODEV;
 }
 
+static inline int iommu_page_response(struct device *dev, struct page_response_msg *msg)
+{
+	return -ENODEV;
+}
+
 static inline int iommu_group_id(struct iommu_group *group)
 {
 	return -ENODEV;
-- 
2.7.4

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v5 15/23] iommu: handle page response timeout
  2018-05-11 20:53 [PATCH v5 00/23] IOMMU and VT-d driver support for Shared Virtual Address (SVA) Jacob Pan
                   ` (13 preceding siblings ...)
  2018-05-11 20:54 ` [PATCH v5 14/23] iommu: introduce page response function Jacob Pan
@ 2018-05-11 20:54 ` Jacob Pan
  2018-05-14  7:43   ` Lu Baolu
  2018-05-11 20:54 ` [PATCH v5 16/23] iommu/config: add build dependency for dmar Jacob Pan
                   ` (8 subsequent siblings)
  23 siblings, 1 reply; 78+ messages in thread
From: Jacob Pan @ 2018-05-11 20:54 UTC (permalink / raw)
  To: iommu, LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Alex Williamson, Jean-Philippe Brucker
  Cc: Rafael Wysocki, Liu, Yi L, Tian, Kevin, Raj Ashok, Jean Delvare,
	Christoph Hellwig, Lu Baolu, Jacob Pan

When IO page faults are reported outside IOMMU subsystem, the page
request handler may fail for various reasons. E.g. a guest received
page requests but did not have a chance to run for a long time. The
irresponsive behavior could hold off limited resources on the pending
device.
There can be hardware or credit based software solutions as suggested
in the PCI ATS Ch-4. To provide a basic safty net this patch
introduces a per device deferrable timer which monitors the longest
pending page fault that requires a response. Proper action such as
sending failure response code could be taken when timer expires but not
included in this patch. We need to consider the life cycle of page
groupd ID to prevent confusion with reused group ID by a device.
For now, a warning message provides clue of such failure.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
---
 drivers/iommu/iommu.c | 53 +++++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/iommu.h |  4 ++++
 2 files changed, 57 insertions(+)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 02fed3e..1f2f49e 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -827,6 +827,37 @@ int iommu_group_unregister_notifier(struct iommu_group *group,
 }
 EXPORT_SYMBOL_GPL(iommu_group_unregister_notifier);
 
+static void iommu_dev_fault_timer_fn(struct timer_list *t)
+{
+	struct iommu_fault_param *fparam = from_timer(fparam, t, timer);
+	struct iommu_fault_event *evt;
+
+	u64 now;
+
+	now = get_jiffies_64();
+
+	/* The goal is to ensure driver or guest page fault handler(via vfio)
+	 * send page response on time. Otherwise, limited queue resources
+	 * may be occupied by some irresponsive guests or drivers.
+	 * When per device pending fault list is not empty, we periodically checks
+	 * if any anticipated page response time has expired.
+	 *
+	 * TODO:
+	 * We could do the following if response time expires:
+	 * 1. send page response code FAILURE to all pending PRQ
+	 * 2. inform device driver or vfio
+	 * 3. drain in-flight page requests and responses for this device
+	 * 4. clear pending fault list such that driver can unregister fault
+	 *    handler(otherwise blocked when pending faults are present).
+	 */
+	list_for_each_entry(evt, &fparam->faults, list) {
+		if (time_after64(now, evt->expire))
+			pr_err("Page response time expired!, pasid %d gid %d exp %llu now %llu\n",
+				evt->pasid, evt->page_req_group_id, evt->expire, now);
+	}
+	mod_timer(t, now + prq_timeout);
+}
+
 /**
  * iommu_register_device_fault_handler() - Register a device fault handler
  * @dev: the device
@@ -879,6 +910,9 @@ int iommu_register_device_fault_handler(struct device *dev,
 	param->fault_param->data = data;
 	INIT_LIST_HEAD(&param->fault_param->faults);
 
+	if (prq_timeout)
+		timer_setup(&param->fault_param->timer, iommu_dev_fault_timer_fn,
+			TIMER_DEFERRABLE);
 done_unlock:
 	mutex_unlock(&param->lock);
 
@@ -935,6 +969,8 @@ int iommu_report_device_fault(struct device *dev, struct iommu_fault_event *evt)
 {
 	int ret = 0;
 	struct iommu_fault_event *evt_pending;
+	struct timer_list *tmr;
+	u64 exp;
 	struct iommu_fault_param *fparam;
 
 	/* iommu_param is allocated when device is added to group */
@@ -955,7 +991,17 @@ int iommu_report_device_fault(struct device *dev, struct iommu_fault_event *evt)
 			ret = -ENOMEM;
 			goto done_unlock;
 		}
+		/* Keep track of response expiration time */
+		exp = get_jiffies_64() + prq_timeout;
+		evt_pending->expire = exp;
 		mutex_lock(&fparam->lock);
+		if (list_empty(&fparam->faults)) {
+			/* First pending event, start timer */
+			tmr = &dev->iommu_param->fault_param->timer;
+			WARN_ON(timer_pending(tmr));
+			mod_timer(tmr, exp);
+		}
+
 		list_add_tail(&evt_pending->list, &fparam->faults);
 		mutex_unlock(&fparam->lock);
 	}
@@ -1572,6 +1618,13 @@ int iommu_page_response(struct device *dev,
 		}
 	}
 
+	/* stop response timer if no more pending request */
+	if (list_empty(&param->fault_param->faults) &&
+		timer_pending(&param->fault_param->timer)) {
+		pr_debug("no pending PRQ, stop timer\n");
+		del_timer(&param->fault_param->timer);
+	}
+
 done_unlock:
 	mutex_unlock(&param->fault_param->lock);
 	return ret;
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 722b90f..f3665b7 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -360,6 +360,7 @@ enum iommu_fault_reason {
  * @iommu_private: used by the IOMMU driver for storing fault-specific
  *                 data. Users should not modify this field before
  *                 sending the fault response.
+ * @expire: time limit in jiffies will wait for page response
  */
 struct iommu_fault_event {
 	struct list_head list;
@@ -373,6 +374,7 @@ struct iommu_fault_event {
 	u32 prot;
 	u64 device_private;
 	u64 iommu_private;
+	u64 expire;
 };
 
 /**
@@ -380,11 +382,13 @@ struct iommu_fault_event {
  * @dev_fault_handler: Callback function to handle IOMMU faults at device level
  * @data: handler private data
  * @faults: holds the pending faults which needs response, e.g. page response.
+ * @timer: track page request pending time limit
  * @lock: protect pending PRQ event list
  */
 struct iommu_fault_param {
 	iommu_dev_fault_handler_t handler;
 	struct list_head faults;
+	struct timer_list timer;
 	struct mutex lock;
 	void *data;
 };
-- 
2.7.4

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v5 16/23] iommu/config: add build dependency for dmar
  2018-05-11 20:53 [PATCH v5 00/23] IOMMU and VT-d driver support for Shared Virtual Address (SVA) Jacob Pan
                   ` (14 preceding siblings ...)
  2018-05-11 20:54 ` [PATCH v5 15/23] iommu: handle page response timeout Jacob Pan
@ 2018-05-11 20:54 ` Jacob Pan
  2018-05-11 20:54 ` [PATCH v5 17/23] iommu/vt-d: report non-recoverable faults to device Jacob Pan
                   ` (7 subsequent siblings)
  23 siblings, 0 replies; 78+ messages in thread
From: Jacob Pan @ 2018-05-11 20:54 UTC (permalink / raw)
  To: iommu, LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Alex Williamson, Jean-Philippe Brucker
  Cc: Rafael Wysocki, Liu, Yi L, Tian, Kevin, Raj Ashok, Jean Delvare,
	Christoph Hellwig, Lu Baolu, Jacob Pan

Intel VT-d interrupts come from both IRQ remapping and DMA remapping.
In order to report non-recoverable faults back to device driver, we
need to have access to IOMMU fault reporting APIs. This patch adds
build depenency to DMAR code where fault IRQ handlers can selectively
report faults.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 drivers/iommu/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index 73590ba..8d8b63f 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -139,6 +139,7 @@ config AMD_IOMMU_V2
 # Intel IOMMU support
 config DMAR_TABLE
 	bool
+	select IOMMU_API
 
 config INTEL_IOMMU
 	bool "Support for Intel IOMMU using DMA Remapping Devices"
-- 
2.7.4

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v5 17/23] iommu/vt-d: report non-recoverable faults to device
  2018-05-11 20:53 [PATCH v5 00/23] IOMMU and VT-d driver support for Shared Virtual Address (SVA) Jacob Pan
                   ` (15 preceding siblings ...)
  2018-05-11 20:54 ` [PATCH v5 16/23] iommu/config: add build dependency for dmar Jacob Pan
@ 2018-05-11 20:54 ` Jacob Pan
  2018-05-14  8:17   ` Lu Baolu
  2018-05-11 20:54 ` [PATCH v5 18/23] iommu/intel-svm: report device page request Jacob Pan
                   ` (6 subsequent siblings)
  23 siblings, 1 reply; 78+ messages in thread
From: Jacob Pan @ 2018-05-11 20:54 UTC (permalink / raw)
  To: iommu, LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Alex Williamson, Jean-Philippe Brucker
  Cc: Rafael Wysocki, Liu, Yi L, Tian, Kevin, Raj Ashok, Jean Delvare,
	Christoph Hellwig, Lu Baolu, Jacob Pan, Liu, Yi L

Currently, dmar fault IRQ handler does nothing more than rate
limited printk, no critical hardware handling need to be done
in IRQ context.
For some use case such as vIOMMU, it might be useful to report
non-recoverable faults outside host IOMMU subsystem. DMAR fault
can come from both DMA and interrupt remapping which has to be
set up early before threaded IRQ is available.
This patch adds an option and a workqueue such that when faults
are requested, DMAR fault IRQ handler can use the IOMMU fault
reporting API to report.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
---
 drivers/iommu/dmar.c        | 159 ++++++++++++++++++++++++++++++++++++++++++--
 drivers/iommu/intel-iommu.c |   6 +-
 include/linux/dmar.h        |   2 +-
 include/linux/intel-iommu.h |   1 +
 4 files changed, 159 insertions(+), 9 deletions(-)

diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c
index 0b5b052..ef846e3 100644
--- a/drivers/iommu/dmar.c
+++ b/drivers/iommu/dmar.c
@@ -1110,6 +1110,12 @@ static int alloc_iommu(struct dmar_drhd_unit *drhd)
 	return err;
 }
 
+static inline void dmar_free_fault_wq(struct intel_iommu *iommu)
+{
+	if (iommu->fault_wq)
+		destroy_workqueue(iommu->fault_wq);
+}
+
 static void free_iommu(struct intel_iommu *iommu)
 {
 	if (intel_iommu_enabled) {
@@ -1126,6 +1132,7 @@ static void free_iommu(struct intel_iommu *iommu)
 		free_irq(iommu->irq, iommu);
 		dmar_free_hwirq(iommu->irq);
 		iommu->irq = 0;
+		dmar_free_fault_wq(iommu);
 	}
 
 	if (iommu->qi) {
@@ -1554,6 +1561,31 @@ static const char *irq_remap_fault_reasons[] =
 	"Blocked an interrupt request due to source-id verification failure",
 };
 
+/* fault data and status */
+enum intel_iommu_fault_reason {
+	INTEL_IOMMU_FAULT_REASON_SW,
+	INTEL_IOMMU_FAULT_REASON_ROOT_NOT_PRESENT,
+	INTEL_IOMMU_FAULT_REASON_CONTEXT_NOT_PRESENT,
+	INTEL_IOMMU_FAULT_REASON_CONTEXT_INVALID,
+	INTEL_IOMMU_FAULT_REASON_BEYOND_ADDR_WIDTH,
+	INTEL_IOMMU_FAULT_REASON_PTE_WRITE_ACCESS,
+	INTEL_IOMMU_FAULT_REASON_PTE_READ_ACCESS,
+	INTEL_IOMMU_FAULT_REASON_NEXT_PT_INVALID,
+	INTEL_IOMMU_FAULT_REASON_ROOT_ADDR_INVALID,
+	INTEL_IOMMU_FAULT_REASON_CONTEXT_PTR_INVALID,
+	INTEL_IOMMU_FAULT_REASON_NONE_ZERO_RTP,
+	INTEL_IOMMU_FAULT_REASON_NONE_ZERO_CTP,
+	INTEL_IOMMU_FAULT_REASON_NONE_ZERO_PTE,
+	NR_INTEL_IOMMU_FAULT_REASON,
+};
+
+/* fault reasons that are allowed to be reported outside IOMMU subsystem */
+#define INTEL_IOMMU_FAULT_REASON_ALLOWED			\
+	((1ULL << INTEL_IOMMU_FAULT_REASON_BEYOND_ADDR_WIDTH) |	\
+		(1ULL << INTEL_IOMMU_FAULT_REASON_PTE_WRITE_ACCESS) |	\
+		(1ULL << INTEL_IOMMU_FAULT_REASON_PTE_READ_ACCESS))
+
+
 static const char *dmar_get_fault_reason(u8 fault_reason, int *fault_type)
 {
 	if (fault_reason >= 0x20 && (fault_reason - 0x20 <
@@ -1634,11 +1666,91 @@ void dmar_msi_read(int irq, struct msi_msg *msg)
 	raw_spin_unlock_irqrestore(&iommu->register_lock, flag);
 }
 
+static enum iommu_fault_reason to_iommu_fault_reason(u8 reason)
+{
+	if (reason >= NR_INTEL_IOMMU_FAULT_REASON) {
+		pr_warn("unknown DMAR fault reason %d\n", reason);
+		return IOMMU_FAULT_REASON_UNKNOWN;
+	}
+	switch (reason) {
+	case INTEL_IOMMU_FAULT_REASON_SW:
+	case INTEL_IOMMU_FAULT_REASON_ROOT_NOT_PRESENT:
+	case INTEL_IOMMU_FAULT_REASON_CONTEXT_NOT_PRESENT:
+	case INTEL_IOMMU_FAULT_REASON_CONTEXT_INVALID:
+	case INTEL_IOMMU_FAULT_REASON_BEYOND_ADDR_WIDTH:
+	case INTEL_IOMMU_FAULT_REASON_ROOT_ADDR_INVALID:
+	case INTEL_IOMMU_FAULT_REASON_CONTEXT_PTR_INVALID:
+		return IOMMU_FAULT_REASON_INTERNAL;
+	case INTEL_IOMMU_FAULT_REASON_NEXT_PT_INVALID:
+	case INTEL_IOMMU_FAULT_REASON_PTE_WRITE_ACCESS:
+	case INTEL_IOMMU_FAULT_REASON_PTE_READ_ACCESS:
+		return IOMMU_FAULT_REASON_PERMISSION;
+	default:
+		return IOMMU_FAULT_REASON_UNKNOWN;
+	}
+}
+
+struct dmar_fault_work {
+	struct work_struct fault_work;
+	struct intel_iommu *iommu;
+	u64 addr;
+	int type;
+	int fault_type;
+	enum intel_iommu_fault_reason reason;
+	u16 sid;
+};
+
+static void report_fault_to_device(struct work_struct *work)
+{
+	struct dmar_fault_work *dfw = container_of(work, struct dmar_fault_work,
+						fault_work);
+	struct iommu_fault_event event;
+	struct pci_dev *pdev;
+	u8 bus, devfn;
+
+	memset(&event, 0, sizeof(struct iommu_fault_event));
+
+	/* check if fault reason is permitted to report outside IOMMU */
+	if (!((1 << dfw->reason) & INTEL_IOMMU_FAULT_REASON_ALLOWED)) {
+		pr_debug("Fault reason %d not allowed to report to device\n",
+			dfw->reason);
+		goto free_work;
+	}
+
+	bus = PCI_BUS_NUM(dfw->sid);
+	devfn = PCI_DEVFN(PCI_SLOT(dfw->sid), PCI_FUNC(dfw->sid));
+	/*
+	 * we need to check if the fault reporting is requested for the
+	 * offending device.
+	 */
+	pdev = pci_get_domain_bus_and_slot(dfw->iommu->segment, bus, devfn);
+	if (!pdev) {
+		pr_warn("No PCI device found for source ID %x\n", dfw->sid);
+		goto free_work;
+	}
+	/*
+	 * unrecoverable fault is reported per IOMMU, notifier handler can
+	 * resolve PCI device based on source ID.
+	 */
+	event.reason = to_iommu_fault_reason(dfw->reason);
+	event.addr = dfw->addr;
+	event.type = IOMMU_FAULT_DMA_UNRECOV;
+	event.prot = dfw->type ? IOMMU_READ : IOMMU_WRITE;
+	dev_warn(&pdev->dev, "report device unrecoverable fault: %d, %x, %d\n",
+		event.reason, dfw->sid, event.type);
+	iommu_report_device_fault(&pdev->dev, &event);
+	pci_dev_put(pdev);
+
+free_work:
+	kfree(dfw);
+}
+
 static int dmar_fault_do_one(struct intel_iommu *iommu, int type,
 		u8 fault_reason, u16 source_id, unsigned long long addr)
 {
 	const char *reason;
 	int fault_type;
+	struct dmar_fault_work *dfw;
 
 	reason = dmar_get_fault_reason(fault_reason, &fault_type);
 
@@ -1647,11 +1759,29 @@ static int dmar_fault_do_one(struct intel_iommu *iommu, int type,
 			source_id >> 8, PCI_SLOT(source_id & 0xFF),
 			PCI_FUNC(source_id & 0xFF), addr >> 48,
 			fault_reason, reason);
-	else
+	else {
 		pr_err("[%s] Request device [%02x:%02x.%d] fault addr %llx [fault reason %02d] %s\n",
 		       type ? "DMA Read" : "DMA Write",
 		       source_id >> 8, PCI_SLOT(source_id & 0xFF),
 		       PCI_FUNC(source_id & 0xFF), addr, fault_reason, reason);
+	}
+
+	dfw = kmalloc(sizeof(*dfw), GFP_ATOMIC);
+	if (!dfw)
+		return -ENOMEM;
+
+	INIT_WORK(&dfw->fault_work, report_fault_to_device);
+	dfw->addr = addr;
+	dfw->type = type;
+	dfw->fault_type = fault_type;
+	dfw->reason = fault_reason;
+	dfw->sid = source_id;
+	dfw->iommu = iommu;
+	if (!queue_work(iommu->fault_wq, &dfw->fault_work)) {
+		kfree(dfw);
+		return -EBUSY;
+	}
+
 	return 0;
 }
 
@@ -1731,10 +1861,28 @@ irqreturn_t dmar_fault(int irq, void *dev_id)
 	return IRQ_HANDLED;
 }
 
-int dmar_set_interrupt(struct intel_iommu *iommu)
+static int dmar_set_fault_wq(struct intel_iommu *iommu)
+{
+	if (iommu->fault_wq)
+		return 0;
+
+	iommu->fault_wq = alloc_ordered_workqueue(iommu->name, 0);
+	if (!iommu->fault_wq)
+		return -ENOMEM;
+
+	return 0;
+}
+
+int dmar_set_interrupt(struct intel_iommu *iommu, bool queue_fault)
 {
 	int irq, ret;
 
+	/* fault can be reported back to device drivers via a wq */
+	if (queue_fault) {
+		ret = dmar_set_fault_wq(iommu);
+		if (ret)
+			pr_err("Failed to create fault handling workqueue\n");
+	}
 	/*
 	 * Check if the fault interrupt is already initialized.
 	 */
@@ -1748,10 +1896,11 @@ int dmar_set_interrupt(struct intel_iommu *iommu)
 		pr_err("No free IRQ vectors\n");
 		return -EINVAL;
 	}
-
 	ret = request_irq(irq, dmar_fault, IRQF_NO_THREAD, iommu->name, iommu);
-	if (ret)
+	if (ret) {
 		pr_err("Can't request irq\n");
+		dmar_free_fault_wq(iommu);
+	}
 	return ret;
 }
 
@@ -1765,7 +1914,7 @@ int __init enable_drhd_fault_handling(void)
 	 */
 	for_each_iommu(iommu, drhd) {
 		u32 fault_status;
-		int ret = dmar_set_interrupt(iommu);
+		int ret = dmar_set_interrupt(iommu, false);
 
 		if (ret) {
 			pr_err("DRHD %Lx: failed to enable fault, interrupt, ret %d\n",
diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 684bd98..3949b3cf 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -3401,10 +3401,10 @@ static int __init init_dmars(void)
 				goto free_iommu;
 		}
 #endif
-		ret = dmar_set_interrupt(iommu);
+		ret = dmar_set_interrupt(iommu, true);
+
 		if (ret)
 			goto free_iommu;
-
 		if (!translation_pre_enabled(iommu))
 			iommu_enable_translation(iommu);
 
@@ -4291,7 +4291,7 @@ static int intel_iommu_add(struct dmar_drhd_unit *dmaru)
 			goto disable_iommu;
 	}
 #endif
-	ret = dmar_set_interrupt(iommu);
+	ret = dmar_set_interrupt(iommu, true);
 	if (ret)
 		goto disable_iommu;
 
diff --git a/include/linux/dmar.h b/include/linux/dmar.h
index e2433bc..21f2162 100644
--- a/include/linux/dmar.h
+++ b/include/linux/dmar.h
@@ -278,7 +278,7 @@ extern void dmar_msi_unmask(struct irq_data *data);
 extern void dmar_msi_mask(struct irq_data *data);
 extern void dmar_msi_read(int irq, struct msi_msg *msg);
 extern void dmar_msi_write(int irq, struct msi_msg *msg);
-extern int dmar_set_interrupt(struct intel_iommu *iommu);
+extern int dmar_set_interrupt(struct intel_iommu *iommu, bool queue_fault);
 extern irqreturn_t dmar_fault(int irq, void *dev_id);
 extern int dmar_alloc_hwirq(int id, int node, void *arg);
 extern void dmar_free_hwirq(int irq);
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index 5ac0c28..b3a26c7 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -472,6 +472,7 @@ struct intel_iommu {
 	struct iommu_device iommu;  /* IOMMU core code handle */
 	int		node;
 	u32		flags;      /* Software defined flags */
+	struct workqueue_struct *fault_wq; /* Reporting IOMMU fault to device */
 };
 
 /* PCI domain-device relationship */
-- 
2.7.4

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v5 18/23] iommu/intel-svm: report device page request
  2018-05-11 20:53 [PATCH v5 00/23] IOMMU and VT-d driver support for Shared Virtual Address (SVA) Jacob Pan
                   ` (16 preceding siblings ...)
  2018-05-11 20:54 ` [PATCH v5 17/23] iommu/vt-d: report non-recoverable faults to device Jacob Pan
@ 2018-05-11 20:54 ` Jacob Pan
  2018-05-11 20:54 ` [PATCH v5 19/23] iommu/intel-svm: replace dev ops with fault report API Jacob Pan
                   ` (5 subsequent siblings)
  23 siblings, 0 replies; 78+ messages in thread
From: Jacob Pan @ 2018-05-11 20:54 UTC (permalink / raw)
  To: iommu, LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Alex Williamson, Jean-Philippe Brucker
  Cc: Rafael Wysocki, Liu, Yi L, Tian, Kevin, Raj Ashok, Jean Delvare,
	Christoph Hellwig, Lu Baolu, Jacob Pan

If the source device of a page request has its PASID table pointer
bound to a guest, the first level page tables are owned by the guest.
In this case, we shall let guest OS to manage page fault.

This patch uses the IOMMU fault reporting API to send fault events,
possibly via VFIO, to the guest OS. Once guest pages are fault in, guest
will issue page response which will be passed down via the invalidation
passdown APIs.

Recoverable faults, such as page request reporting is not limitted to
guest use. In kernel driver can also request a chance to receive fault
notifications.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
---
 drivers/iommu/intel-svm.c | 73 ++++++++++++++++++++++++++++++++++++++++-------
 include/linux/iommu.h     |  1 +
 2 files changed, 64 insertions(+), 10 deletions(-)

diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c
index e8cd984..a8186f8 100644
--- a/drivers/iommu/intel-svm.c
+++ b/drivers/iommu/intel-svm.c
@@ -577,6 +577,58 @@ static bool is_canonical_address(u64 addr)
 	return (((saddr << shift) >> shift) == saddr);
 }
 
+static int prq_to_iommu_prot(struct page_req_dsc *req)
+{
+	int prot = 0;
+
+	if (req->rd_req)
+		prot |= IOMMU_FAULT_READ;
+	if (req->wr_req)
+		prot |= IOMMU_FAULT_WRITE;
+	if (req->exe_req)
+		prot |= IOMMU_FAULT_EXEC;
+	if (req->priv_req)
+		prot |= IOMMU_FAULT_PRIV;
+
+	return prot;
+}
+
+static int intel_svm_prq_report(struct intel_iommu *iommu,
+				struct page_req_dsc *desc)
+{
+	int ret = 0;
+	struct iommu_fault_event event;
+	struct pci_dev *pdev;
+
+	memset(&event, 0, sizeof(struct iommu_fault_event));
+	pdev = pci_get_domain_bus_and_slot(iommu->segment,
+					desc->bus, desc->devfn);
+	if (!pdev) {
+		pr_err("No PCI device found for PRQ [%02x:%02x.%d]\n",
+			desc->bus, PCI_SLOT(desc->devfn),
+			PCI_FUNC(desc->devfn));
+		return -ENODEV;
+	}
+
+	/* Fill in event data for device specific processing */
+	event.type = IOMMU_FAULT_PAGE_REQ;
+	event.addr = (u64)desc->addr << VTD_PAGE_SHIFT;
+	event.pasid = desc->pasid;
+	event.page_req_group_id = desc->prg_index;
+	event.prot = prq_to_iommu_prot(desc);
+	event.last_req = desc->lpig;
+	event.pasid_valid = 1;
+	/* keep track of PRQ so that when the response comes back, we know
+	 * whether we do group response or stream response. SRR[0] and
+	 * private[54:32] bits in the descriptor are stored.
+	 */
+	event.iommu_private = *(u64 *)desc;
+	ret = iommu_report_device_fault(&pdev->dev, &event);
+	pci_dev_put(pdev);
+
+	return ret;
+}
+
 static irqreturn_t prq_event_thread(int irq, void *d)
 {
 	struct intel_iommu *iommu = d;
@@ -625,6 +677,16 @@ static irqreturn_t prq_event_thread(int irq, void *d)
 				goto no_pasid;
 			}
 		}
+		/* If address is not canonical, return invalid response */
+		if (!is_canonical_address(address))
+			goto bad_req;
+
+		/*
+		 * If prq is to be handled outside iommu driver via receiver of
+		 * the fault notifiers, we skip the page response here.
+		 */
+		if (!intel_svm_prq_report(iommu, req))
+			goto prq_advance;
 
 		result = QI_RESP_INVALID;
 		/* Since we're using init_mm.pgd directly, we should never take
@@ -635,9 +697,6 @@ static irqreturn_t prq_event_thread(int irq, void *d)
 		if (!mmget_not_zero(svm->mm))
 			goto bad_req;
 
-		/* If address is not canonical, return invalid response */
-		if (!is_canonical_address(address))
-			goto bad_req;
 
 		down_read(&svm->mm->mmap_sem);
 		vma = find_extend_vma(svm->mm, address);
@@ -670,12 +729,6 @@ static irqreturn_t prq_event_thread(int irq, void *d)
 
 		if (WARN_ON(&sdev->list == &svm->devs))
 			sdev = NULL;
-
-		if (sdev && sdev->ops && sdev->ops->fault_cb) {
-			int rwxp = (req->rd_req << 3) | (req->wr_req << 2) |
-				(req->exe_req << 1) | (req->priv_req);
-			sdev->ops->fault_cb(sdev->dev, req->pasid, req->addr, req->private, rwxp, result);
-		}
 		/* We get here in the error case where the PASID lookup failed,
 		   and these can be NULL. Do not use them below this point! */
 		sdev = NULL;
@@ -701,7 +754,7 @@ static irqreturn_t prq_event_thread(int irq, void *d)
 
 			qi_submit_sync(&resp, iommu);
 		}
-
+	prq_advance:
 		head = (head + sizeof(*req)) & PRQ_RING_MASK;
 	}
 
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index f3665b7..8a973ee 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -42,6 +42,7 @@
  * if the IOMMU page table format is equivalent.
  */
 #define IOMMU_PRIV	(1 << 5)
+#define IOMMU_EXEC	(1 << 6)
 
 struct iommu_ops;
 struct iommu_group;
-- 
2.7.4

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v5 19/23] iommu/intel-svm: replace dev ops with fault report API
  2018-05-11 20:53 [PATCH v5 00/23] IOMMU and VT-d driver support for Shared Virtual Address (SVA) Jacob Pan
                   ` (17 preceding siblings ...)
  2018-05-11 20:54 ` [PATCH v5 18/23] iommu/intel-svm: report device page request Jacob Pan
@ 2018-05-11 20:54 ` Jacob Pan
  2018-05-11 20:54 ` [PATCH v5 20/23] iommu/intel-svm: do not flush iotlb for viommu Jacob Pan
                   ` (4 subsequent siblings)
  23 siblings, 0 replies; 78+ messages in thread
From: Jacob Pan @ 2018-05-11 20:54 UTC (permalink / raw)
  To: iommu, LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Alex Williamson, Jean-Philippe Brucker
  Cc: Rafael Wysocki, Liu, Yi L, Tian, Kevin, Raj Ashok, Jean Delvare,
	Christoph Hellwig, Lu Baolu, Jacob Pan

With the introduction of generic IOMMU device fault reporting API, we
can replace the private fault callback functions with standard function
and event data.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 drivers/iommu/intel-svm.c |  7 +------
 include/linux/intel-svm.h | 20 +++-----------------
 2 files changed, 4 insertions(+), 23 deletions(-)

diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c
index a8186f8..bdda1b6 100644
--- a/drivers/iommu/intel-svm.c
+++ b/drivers/iommu/intel-svm.c
@@ -299,7 +299,7 @@ static const struct mmu_notifier_ops intel_mmuops = {
 
 static DEFINE_MUTEX(pasid_mutex);
 
-int intel_svm_bind_mm(struct device *dev, int *pasid, int flags, struct svm_dev_ops *ops)
+int intel_svm_bind_mm(struct device *dev, int *pasid, int flags)
 {
 	struct intel_iommu *iommu = intel_svm_device_to_iommu(dev);
 	struct intel_svm_dev *sdev;
@@ -346,10 +346,6 @@ int intel_svm_bind_mm(struct device *dev, int *pasid, int flags, struct svm_dev_
 
 			list_for_each_entry(sdev, &svm->devs, list) {
 				if (dev == sdev->dev) {
-					if (sdev->ops != ops) {
-						ret = -EBUSY;
-						goto out;
-					}
 					sdev->users++;
 					goto success;
 				}
@@ -375,7 +371,6 @@ int intel_svm_bind_mm(struct device *dev, int *pasid, int flags, struct svm_dev_
 	}
 	/* Finish the setup now we know we're keeping it */
 	sdev->users = 1;
-	sdev->ops = ops;
 	init_rcu_head(&sdev->rcu);
 
 	if (!svm) {
diff --git a/include/linux/intel-svm.h b/include/linux/intel-svm.h
index 99bc5b3..a39a502 100644
--- a/include/linux/intel-svm.h
+++ b/include/linux/intel-svm.h
@@ -18,18 +18,6 @@
 
 struct device;
 
-struct svm_dev_ops {
-	void (*fault_cb)(struct device *dev, int pasid, u64 address,
-			 u32 private, int rwxp, int response);
-};
-
-/* Values for rxwp in fault_cb callback */
-#define SVM_REQ_READ	(1<<3)
-#define SVM_REQ_WRITE	(1<<2)
-#define SVM_REQ_EXEC	(1<<1)
-#define SVM_REQ_PRIV	(1<<0)
-
-
 /*
  * The SVM_FLAG_PRIVATE_PASID flag requests a PASID which is *not* the "main"
  * PASID for the current process. Even if a PASID already exists, a new one
@@ -60,7 +48,6 @@ struct svm_dev_ops {
  * @dev:	Device to be granted acccess
  * @pasid:	Address for allocated PASID
  * @flags:	Flags. Later for requesting supervisor mode, etc.
- * @ops:	Callbacks to device driver
  *
  * This function attempts to enable PASID support for the given device.
  * If the @pasid argument is non-%NULL, a PASID is allocated for access
@@ -82,8 +69,7 @@ struct svm_dev_ops {
  * Multiple calls from the same process may result in the same PASID
  * being re-used. A reference count is kept.
  */
-extern int intel_svm_bind_mm(struct device *dev, int *pasid, int flags,
-			     struct svm_dev_ops *ops);
+extern int intel_svm_bind_mm(struct device *dev, int *pasid, int flags);
 
 /**
  * intel_svm_unbind_mm() - Unbind a specified PASID
@@ -120,7 +106,7 @@ extern int intel_svm_is_pasid_valid(struct device *dev, int pasid);
 #else /* CONFIG_INTEL_IOMMU_SVM */
 
 static inline int intel_svm_bind_mm(struct device *dev, int *pasid,
-				    int flags, struct svm_dev_ops *ops)
+				int flags)
 {
 	return -ENOSYS;
 }
@@ -136,6 +122,6 @@ static int intel_svm_is_pasid_valid(struct device *dev, int pasid)
 }
 #endif /* CONFIG_INTEL_IOMMU_SVM */
 
-#define intel_svm_available(dev) (!intel_svm_bind_mm((dev), NULL, 0, NULL))
+#define intel_svm_available(dev) (!intel_svm_bind_mm((dev), NULL, 0))
 
 #endif /* __INTEL_SVM_H__ */
-- 
2.7.4

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v5 20/23] iommu/intel-svm: do not flush iotlb for viommu
  2018-05-11 20:53 [PATCH v5 00/23] IOMMU and VT-d driver support for Shared Virtual Address (SVA) Jacob Pan
                   ` (18 preceding siblings ...)
  2018-05-11 20:54 ` [PATCH v5 19/23] iommu/intel-svm: replace dev ops with fault report API Jacob Pan
@ 2018-05-11 20:54 ` Jacob Pan
  2018-05-11 20:54 ` [PATCH v5 21/23] iommu/vt-d: add intel iommu page response function Jacob Pan
                   ` (3 subsequent siblings)
  23 siblings, 0 replies; 78+ messages in thread
From: Jacob Pan @ 2018-05-11 20:54 UTC (permalink / raw)
  To: iommu, LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Alex Williamson, Jean-Philippe Brucker
  Cc: Rafael Wysocki, Liu, Yi L, Tian, Kevin, Raj Ashok, Jean Delvare,
	Christoph Hellwig, Lu Baolu, Jacob Pan

vIOMMU passdown invalidation will be inclusive, PASID cache invalidation
includes TLBs. See Intel VT-d Specification Ch 6.5.2.2 for details.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 drivers/iommu/intel-svm.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c
index bdda1b6..697d5c2 100644
--- a/drivers/iommu/intel-svm.c
+++ b/drivers/iommu/intel-svm.c
@@ -284,7 +284,9 @@ static void intel_mm_release(struct mmu_notifier *mn, struct mm_struct *mm)
 	rcu_read_lock();
 	list_for_each_entry_rcu(sdev, &svm->devs, list) {
 		intel_flush_pasid_dev(svm, sdev, svm->pasid);
-		intel_flush_svm_range_dev(svm, sdev, 0, -1, 0, !svm->mm);
+		/* for emulated iommu, PASID cache invalidation implies IOTLB/DTLB */
+		if (!cap_caching_mode(svm->iommu->cap))
+			intel_flush_svm_range_dev(svm, sdev, 0, -1, 0, !svm->mm);
 	}
 	rcu_read_unlock();
 
-- 
2.7.4

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v5 21/23] iommu/vt-d: add intel iommu page response function
  2018-05-11 20:53 [PATCH v5 00/23] IOMMU and VT-d driver support for Shared Virtual Address (SVA) Jacob Pan
                   ` (19 preceding siblings ...)
  2018-05-11 20:54 ` [PATCH v5 20/23] iommu/intel-svm: do not flush iotlb for viommu Jacob Pan
@ 2018-05-11 20:54 ` Jacob Pan
  2018-05-11 20:54 ` [PATCH v5 22/23] trace/iommu: add sva trace events Jacob Pan
                   ` (2 subsequent siblings)
  23 siblings, 0 replies; 78+ messages in thread
From: Jacob Pan @ 2018-05-11 20:54 UTC (permalink / raw)
  To: iommu, LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Alex Williamson, Jean-Philippe Brucker
  Cc: Rafael Wysocki, Liu, Yi L, Tian, Kevin, Raj Ashok, Jean Delvare,
	Christoph Hellwig, Lu Baolu, Jacob Pan

This patch adds page response support for Intel VT-d.
Generic response data is taken from the IOMMU API
then parsed into VT-d specific response descriptor format.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 drivers/iommu/intel-iommu.c | 47 +++++++++++++++++++++++++++++++++++++++++++++
 include/linux/intel-iommu.h |  3 +++
 2 files changed, 50 insertions(+)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 3949b3cf..c261639 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -5101,6 +5101,52 @@ static int intel_iommu_sva_invalidate(struct iommu_domain *domain,
 	return ret;
 }
 
+static int intel_iommu_page_response(struct device *dev, struct page_response_msg *msg)
+{
+	struct qi_desc resp;
+	struct intel_iommu *iommu;
+	struct pci_dev *pdev;
+	u8 bus, devfn;
+	u16 rid;
+	u64 desc;
+
+	pdev = to_pci_dev(dev);
+	iommu = device_to_iommu(dev, &bus, &devfn);
+	if (!iommu) {
+		dev_err(dev, "No IOMMU for device to unbind PASID table\n");
+		return -ENODEV;
+	}
+
+	pci_dev_get(pdev);
+	rid = ((u16)bus << 8) | devfn;
+	/* Iommu private data contains  preserved page request descriptor, so we
+	 * inspect the SRR bit for response type then queue response with only
+	 * the private data [54:32].
+	 */
+	desc = msg->private_data;
+	if (desc & QI_PRQ_SRR) {
+		/* Page Stream Response */
+		resp.low = QI_PSTRM_IDX(msg->page_req_group_id) |
+			(desc & QI_PRQ_PRIV) | QI_PSTRM_BUS(PCI_BUS_NUM(pdev->bus->number)) |
+			QI_PSTRM_PASID(msg->pasid) | QI_PSTRM_RESP_TYPE;
+		resp.high = QI_PSTRM_ADDR(msg->addr) | QI_PSTRM_DEVFN(pdev->devfn & 0xff) |
+			QI_PSTRM_RESP_CODE(msg->resp_code);
+	} else {
+		/* Page Group Response */
+		resp.low = QI_PGRP_PASID(msg->pasid) |
+			QI_PGRP_DID(rid) |
+			QI_PGRP_PASID_P(msg->pasid_present) |
+			QI_PGRP_RESP_TYPE;
+		resp.high = QI_PGRP_IDX(msg->page_req_group_id) |
+			(desc & QI_PRQ_PRIV) | QI_PGRP_RESP_CODE(msg->resp_code);
+
+	}
+	qi_submit_sync(&resp, iommu);
+	pci_dev_put(pdev);
+
+	return 0;
+}
+
 static int intel_iommu_map(struct iommu_domain *domain,
 			   unsigned long iova, phys_addr_t hpa,
 			   size_t size, int iommu_prot)
@@ -5530,6 +5576,7 @@ const struct iommu_ops intel_iommu_ops = {
 	.bind_pasid_table	= intel_iommu_bind_pasid_table,
 	.unbind_pasid_table	= intel_iommu_unbind_pasid_table,
 	.sva_invalidate		= intel_iommu_sva_invalidate,
+	.page_response		= intel_iommu_page_response,
 #endif
 	.map			= intel_iommu_map,
 	.unmap			= intel_iommu_unmap,
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index b3a26c7..94366d9 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -337,6 +337,9 @@ enum {
 #define QI_PSTRM_BUS(bus)	(((u64)(bus)) << 24)
 #define QI_PSTRM_PASID(pasid)	(((u64)(pasid)) << 4)
 
+#define QI_PRQ_SRR	BIT_ULL(0)
+#define QI_PRQ_PRIV	GENMASK_ULL(54, 32)
+
 #define QI_RESP_SUCCESS		0x0
 #define QI_RESP_INVALID		0x1
 #define QI_RESP_FAILURE		0xf
-- 
2.7.4

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v5 22/23] trace/iommu: add sva trace events
  2018-05-11 20:53 [PATCH v5 00/23] IOMMU and VT-d driver support for Shared Virtual Address (SVA) Jacob Pan
                   ` (20 preceding siblings ...)
  2018-05-11 20:54 ` [PATCH v5 21/23] iommu/vt-d: add intel iommu page response function Jacob Pan
@ 2018-05-11 20:54 ` Jacob Pan
  2018-05-11 20:54 ` [PATCH v5 23/23] iommu: use sva invalidate and device fault trace event Jacob Pan
  2018-05-29 15:54 ` [PATCH v5 00/23] IOMMU and VT-d driver support for Shared Virtual Address (SVA) Jacob Pan
  23 siblings, 0 replies; 78+ messages in thread
From: Jacob Pan @ 2018-05-11 20:54 UTC (permalink / raw)
  To: iommu, LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Alex Williamson, Jean-Philippe Brucker
  Cc: Rafael Wysocki, Liu, Yi L, Tian, Kevin, Raj Ashok, Jean Delvare,
	Christoph Hellwig, Lu Baolu, Jacob Pan

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 include/trace/events/iommu.h | 112 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 112 insertions(+)

diff --git a/include/trace/events/iommu.h b/include/trace/events/iommu.h
index 72b4582..e64eb29 100644
--- a/include/trace/events/iommu.h
+++ b/include/trace/events/iommu.h
@@ -12,6 +12,8 @@
 #define _TRACE_IOMMU_H
 
 #include <linux/tracepoint.h>
+#include <linux/iommu.h>
+#include <uapi/linux/iommu.h>
 
 struct device;
 
@@ -161,6 +163,116 @@ DEFINE_EVENT(iommu_error, io_page_fault,
 
 	TP_ARGS(dev, iova, flags)
 );
+
+TRACE_EVENT(dev_fault,
+
+	TP_PROTO(struct device *dev,  struct iommu_fault_event *evt),
+
+	TP_ARGS(dev, evt),
+
+	TP_STRUCT__entry(
+		__string(device, dev_name(dev))
+		__field(int, type)
+		__field(int, reason)
+		__field(u64, addr)
+		__field(u32, pasid)
+		__field(u32, pgid)
+		__field(u32, last_req)
+		__field(u32, prot)
+	),
+
+	TP_fast_assign(
+		__assign_str(device, dev_name(dev));
+		__entry->type = evt->type;
+		__entry->reason = evt->reason;
+		__entry->addr = evt->addr;
+		__entry->pasid = evt->pasid;
+		__entry->pgid = evt->page_req_group_id;
+		__entry->last_req = evt->last_req;
+		__entry->prot = evt->prot;
+	),
+
+	TP_printk("IOMMU:%s type=%d reason=%d addr=0x%016llx pasid=%d group=%d last=%d prot=%d",
+		__get_str(device),
+		__entry->type,
+		__entry->reason,
+		__entry->addr,
+		__entry->pasid,
+		__entry->pgid,
+		__entry->last_req,
+		__entry->prot
+	)
+);
+
+TRACE_EVENT(dev_page_response,
+
+	TP_PROTO(struct device *dev,  struct page_response_msg *msg),
+
+	TP_ARGS(dev, msg),
+
+	TP_STRUCT__entry(
+		__string(device, dev_name(dev))
+		__field(int, code)
+		__field(u64, addr)
+		__field(u32, pasid)
+		__field(u32, pgid)
+	),
+
+	TP_fast_assign(
+		__assign_str(device, dev_name(dev));
+		__entry->code = msg->resp_code;
+		__entry->addr = msg->addr;
+		__entry->pasid = msg->pasid;
+		__entry->pgid = msg->page_req_group_id;
+	),
+
+	TP_printk("IOMMU:%s code=%d addr=0x%016llx pasid=%d group=%d",
+		__get_str(device),
+		__entry->code,
+		__entry->addr,
+		__entry->pasid,
+		__entry->pgid
+	)
+);
+
+TRACE_EVENT(sva_invalidate,
+
+	TP_PROTO(struct device *dev,  struct tlb_invalidate_info *ti),
+
+	TP_ARGS(dev, ti),
+
+	TP_STRUCT__entry(
+		__string(device, dev_name(dev))
+		__field(int, type)
+		__field(u32, granu)
+		__field(u32, flags)
+		__field(u8, size)
+		__field(u32, pasid)
+		__field(u64, addr)
+	),
+
+	TP_fast_assign(
+		__assign_str(device, dev_name(dev));
+		__entry->type = ti->hdr.type;
+		__entry->flags = ti->flags;
+		__entry->granu = ti->granularity;
+		__entry->size = ti->size;
+		__entry->pasid = ti->pasid;
+		__entry->addr = ti->addr;
+	),
+
+	TP_printk("IOMMU:%s type=%d flags=0x%08x granu=%d size=%d pasid=%d addr=0x%016llx",
+		__get_str(device),
+		__entry->type,
+		__entry->flags,
+		__entry->granu,
+		__entry->size,
+		__entry->pasid,
+		__entry->addr
+	)
+);
+
+
 #endif /* _TRACE_IOMMU_H */
 
 /* This part must be outside protection */
-- 
2.7.4

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v5 23/23] iommu: use sva invalidate and device fault trace event
  2018-05-11 20:53 [PATCH v5 00/23] IOMMU and VT-d driver support for Shared Virtual Address (SVA) Jacob Pan
                   ` (21 preceding siblings ...)
  2018-05-11 20:54 ` [PATCH v5 22/23] trace/iommu: add sva trace events Jacob Pan
@ 2018-05-11 20:54 ` Jacob Pan
  2018-05-29 15:54 ` [PATCH v5 00/23] IOMMU and VT-d driver support for Shared Virtual Address (SVA) Jacob Pan
  23 siblings, 0 replies; 78+ messages in thread
From: Jacob Pan @ 2018-05-11 20:54 UTC (permalink / raw)
  To: iommu, LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Alex Williamson, Jean-Philippe Brucker
  Cc: Rafael Wysocki, Liu, Yi L, Tian, Kevin, Raj Ashok, Jean Delvare,
	Christoph Hellwig, Lu Baolu, Jacob Pan

For performance and debugging purposes, these trace events help
analyzing device faults and passdown invalidations that interact
with IOMMU subsystem.
E.g.
IOMMU:0000:00:0a.0 type=2 reason=0 addr=0x00000000007ff000 pasid=1
group=1 last=0 prot=1

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 drivers/iommu/iommu.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 1f2f49e..0108970 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1006,6 +1006,7 @@ int iommu_report_device_fault(struct device *dev, struct iommu_fault_event *evt)
 		mutex_unlock(&fparam->lock);
 	}
 	ret = fparam->handler(evt, fparam->data);
+	trace_dev_fault(dev, evt);
 done_unlock:
 	mutex_unlock(&dev->iommu_param->lock);
 	return ret;
@@ -1574,6 +1575,7 @@ int iommu_sva_invalidate(struct iommu_domain *domain,
 		return -ENODEV;
 
 	ret = domain->ops->sva_invalidate(domain, dev, inv_info);
+	trace_sva_invalidate(dev, inv_info);
 
 	return ret;
 }
@@ -1611,6 +1613,7 @@ int iommu_page_response(struct device *dev,
 		if (evt->pasid == msg->pasid &&
 		    msg->page_req_group_id == evt->page_req_group_id) {
 			msg->private_data = evt->iommu_private;
+			trace_dev_page_response(dev, msg);
 			ret = domain->ops->page_response(dev, msg);
 			list_del(&evt->list);
 			kfree(evt);
-- 
2.7.4

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v5 03/23] iommu/vt-d: add a flag for pasid table bound status
  2018-05-11 20:53 ` [PATCH v5 03/23] iommu/vt-d: add a flag for pasid table bound status Jacob Pan
@ 2018-05-13  7:33   ` Lu Baolu
  2018-05-14 18:51     ` Jacob Pan
  2018-05-13  8:01   ` Lu Baolu
  1 sibling, 1 reply; 78+ messages in thread
From: Lu Baolu @ 2018-05-13  7:33 UTC (permalink / raw)
  To: Jacob Pan, iommu, LKML, Joerg Roedel, David Woodhouse,
	Greg Kroah-Hartman, Alex Williamson, Jean-Philippe Brucker
  Cc: Rafael Wysocki, Liu, Yi L, Tian, Kevin, Raj Ashok, Jean Delvare,
	Christoph Hellwig

Hi,

On 05/12/2018 04:53 AM, Jacob Pan wrote:
> Adding a flag in device domain into to track whether a guest or
typo:                                       ^^info

Best regards,
Lu Baolu

> user PASID table is bound to a device.
>
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> ---
>  include/linux/intel-iommu.h | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
> index 304afae..ddc7d79 100644
> --- a/include/linux/intel-iommu.h
> +++ b/include/linux/intel-iommu.h
> @@ -473,6 +473,7 @@ struct device_domain_info {
>  	u8 pri_enabled:1;
>  	u8 ats_supported:1;
>  	u8 ats_enabled:1;
> +	u8 pasid_table_bound:1;
>  	u8 ats_qdep;
>  	u64 fault_mask;	/* selected IOMMU faults to be reported */
>  	struct device *dev; /* it's NULL for PCIe-to-PCI bridge */

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v5 03/23] iommu/vt-d: add a flag for pasid table bound status
  2018-05-11 20:53 ` [PATCH v5 03/23] iommu/vt-d: add a flag for pasid table bound status Jacob Pan
  2018-05-13  7:33   ` Lu Baolu
@ 2018-05-13  8:01   ` Lu Baolu
  2018-05-14 18:52     ` Jacob Pan
  1 sibling, 1 reply; 78+ messages in thread
From: Lu Baolu @ 2018-05-13  8:01 UTC (permalink / raw)
  To: Jacob Pan, iommu, LKML, Joerg Roedel, David Woodhouse,
	Greg Kroah-Hartman, Alex Williamson, Jean-Philippe Brucker
  Cc: Rafael Wysocki, Liu, Yi L, Tian, Kevin, Raj Ashok, Jean Delvare,
	Christoph Hellwig

Hi again,

On 05/12/2018 04:53 AM, Jacob Pan wrote:
> Adding a flag in device domain into to track whether a guest or
> user PASID table is bound to a device.
>
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> ---
>  include/linux/intel-iommu.h | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
> index 304afae..ddc7d79 100644
> --- a/include/linux/intel-iommu.h
> +++ b/include/linux/intel-iommu.h
> @@ -473,6 +473,7 @@ struct device_domain_info {
>  	u8 pri_enabled:1;
>  	u8 ats_supported:1;
>  	u8 ats_enabled:1;
> +	u8 pasid_table_bound:1;

Can you please add some comments here? So that, people can
understand the purpose for this bit exactly.

Best regards,
Lu Baolu

>  	u8 ats_qdep;
>  	u64 fault_mask;	/* selected IOMMU faults to be reported */
>  	struct device *dev; /* it's NULL for PCIe-to-PCI bridge */

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v5 04/23] iommu/vt-d: add bind_pasid_table function
  2018-05-11 20:53 ` [PATCH v5 04/23] iommu/vt-d: add bind_pasid_table function Jacob Pan
@ 2018-05-13  9:29   ` Lu Baolu
  2018-05-14 20:22     ` Jacob Pan
  0 siblings, 1 reply; 78+ messages in thread
From: Lu Baolu @ 2018-05-13  9:29 UTC (permalink / raw)
  To: Jacob Pan, iommu, LKML, Joerg Roedel, David Woodhouse,
	Greg Kroah-Hartman, Alex Williamson, Jean-Philippe Brucker
  Cc: Rafael Wysocki, Liu, Yi L, Tian, Kevin, Raj Ashok, Jean Delvare,
	Christoph Hellwig, Liu, Yi L

Hi,

On 05/12/2018 04:53 AM, Jacob Pan wrote:
> Add Intel VT-d ops to the generic iommu_bind_pasid_table API
> functions.
>
> The primary use case is for direct assignment of SVM capable
> device. Originated from emulated IOMMU in the guest, the request goes
> through many layers (e.g. VFIO). Upon calling host IOMMU driver, caller
> passes guest PASID table pointer (GPA) and size.
>
> Device context table entry is modified by Intel IOMMU specific
> bind_pasid_table function. This will turn on nesting mode and matching
> translation type.
>
> The unbind operation restores default context mapping.
>
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
> Signed-off-by: Ashok Raj <ashok.raj@intel.com>
> ---
>  drivers/iommu/intel-iommu.c   | 122 ++++++++++++++++++++++++++++++++++++++++++
>  include/linux/dma_remapping.h |   1 +
>  2 files changed, 123 insertions(+)
>
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index a0f81a4..4623294 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -2409,6 +2409,7 @@ static struct dmar_domain *dmar_insert_one_dev_info(struct intel_iommu *iommu,
>  	info->ats_supported = info->pasid_supported = info->pri_supported = 0;
>  	info->ats_enabled = info->pasid_enabled = info->pri_enabled = 0;
>  	info->ats_qdep = 0;
> +	info->pasid_table_bound = 0;
>  	info->dev = dev;
>  	info->domain = domain;
>  	info->iommu = iommu;
> @@ -5132,6 +5133,7 @@ static void intel_iommu_put_resv_regions(struct device *dev,
>  
>  #ifdef CONFIG_INTEL_IOMMU_SVM
>  #define MAX_NR_PASID_BITS (20)
> +#define MIN_NR_PASID_BITS (5)
>  static inline unsigned long intel_iommu_get_pts(struct intel_iommu *iommu)
>  {
>  	/*
> @@ -5258,6 +5260,122 @@ struct intel_iommu *intel_svm_device_to_iommu(struct device *dev)
>  
>  	return iommu;
>  }
> +
> +static int intel_iommu_bind_pasid_table(struct iommu_domain *domain,
> +		struct device *dev, struct pasid_table_config *pasidt_binfo)
> +{
> +	struct intel_iommu *iommu;
> +	struct context_entry *context;
> +	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
> +	struct device_domain_info *info;
> +	struct pci_dev *pdev;
> +	u8 bus, devfn, host_table_pasid_bits;
> +	u16 did, sid;
> +	int ret = 0;
> +	unsigned long flags;
> +	u64 ctx_lo;

I personally prefer to have this in order.

        struct dmar_domain *dmar_domain = to_dmar_domain(domain);
        u8 bus, devfn, host_table_pasid_bits;
        struct device_domain_info *info;
        struct context_entry *context;
        struct intel_iommu *iommu;
        struct pci_dev *pdev;
        unsigned long flags;
        u16 did, sid;
        int ret = 0;
        u64 ctx_lo;

> +
> +	if ((pasidt_binfo->version != PASID_TABLE_CFG_VERSION_1) ||

Unnecessary parentheses.

> +		pasidt_binfo->bytes != sizeof(*pasidt_binfo))

Alignment should match open parenthesis.

> +		return -EINVAL;
> +	iommu = device_to_iommu(dev, &bus, &devfn);
> +	if (!iommu)
> +		return -ENODEV;
> +	/* VT-d spec section 9.4 says pasid table size is encoded as 2^(x+5) */
> +	host_table_pasid_bits = intel_iommu_get_pts(iommu) + MIN_NR_PASID_BITS;
> +	if (!pasidt_binfo || pasidt_binfo->pasid_bits > host_table_pasid_bits ||

"!pasidt_binfo" checking should be moved up to the version checking.

> +		pasidt_binfo->pasid_bits < MIN_NR_PASID_BITS) {
> +		pr_err("Invalid gPASID bits %d, host range %d - %d\n",

How about dev_err()? 

> +			pasidt_binfo->pasid_bits,
> +			MIN_NR_PASID_BITS, host_table_pasid_bits);
> +		return -ERANGE;
> +	}
> +	if (!ecap_nest(iommu->ecap)) {
> +		dev_err(dev, "Cannot bind PASID table, no nested translation\n");
> +		ret = -ENODEV;
> +		goto out;

How about
+        return -ENODEV;
?

> +	}
> +	pdev = to_pci_dev(dev);

We can't always assume that it is a PCI device, right?

> +	sid = PCI_DEVID(bus, devfn);
> +	info = dev->archdata.iommu;
> +
> +	if (!info) {
> +		dev_err(dev, "Invalid device domain info\n");
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +	if (info->pasid_table_bound) {

We should do this checking with lock hold.

Otherwise,

Thread A on CPUx                Thread B on CPUy
===========                ============
check pasid_table_bound    check pasid_table_bound

mutex_lock()
Setup context
pasid_table_bound = 1
mutex_unlock()

                                               mutex_lock()
                                               Setup context
                                               pasid_table_bound = 1
                                               mutex_unlock()


> +		dev_err(dev, "Device PASID table already bound\n");
> +		ret = -EBUSY;
> +		goto out;
> +	}
> +	if (!info->pasid_enabled) {
> +		ret = pci_enable_pasid(pdev, info->pasid_supported & ~1);
> +		if (ret) {
> +			dev_err(dev, "Failed to enable PASID\n");
> +			goto out;
> +		}
> +	}

I prefer a blank line here.

> +	spin_lock_irqsave(&iommu->lock, flags);
> +	context = iommu_context_addr(iommu, bus, devfn, 0);
> +	if (!context_present(context)) {
> +		dev_err(dev, "Context not present\n");
> +		ret = -EINVAL;
> +		goto out_unlock;
> +	}
> +
> +	/* Anticipate guest to use SVM and owns the first level, so we turn
> +	 * nested mode on
> +	 */
> +	ctx_lo = context[0].lo;
> +	ctx_lo |= CONTEXT_NESTE | CONTEXT_PRS | CONTEXT_PASIDE;
> +	ctx_lo &= ~CONTEXT_TT_MASK;
> +	ctx_lo |= CONTEXT_TT_DEV_IOTLB << 2;
> +	context[0].lo = ctx_lo;
> +
> +	/* Assign guest PASID table pointer and size order */
> +	ctx_lo = (pasidt_binfo->base_ptr & VTD_PAGE_MASK) |
> +		(pasidt_binfo->pasid_bits - MIN_NR_PASID_BITS);
> +	context[1].lo = ctx_lo;
> +	/* make sure context entry is updated before flushing */
> +	wmb();
> +	did = dmar_domain->iommu_did[iommu->seq_id];
> +	iommu->flush.flush_context(iommu, did,
> +				(((u16)bus) << 8) | devfn,
> +				DMA_CCMD_MASK_NOBIT,
> +				DMA_CCMD_DEVICE_INVL);
> +	iommu->flush.flush_iotlb(iommu, did, 0, 0, DMA_TLB_DSI_FLUSH);
> +	info->pasid_table_bound = 1;
> +out_unlock:
> +	spin_unlock_irqrestore(&iommu->lock, flags);
> +out:
> +	return ret;
> +}
> +
> +static void intel_iommu_unbind_pasid_table(struct iommu_domain *domain,
> +					struct device *dev)
> +{
> +	struct intel_iommu *iommu;
> +	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
> +	struct device_domain_info *info;
> +	u8 bus, devfn;
> +
> +	info = dev->archdata.iommu;
> +	if (!info) {
> +		dev_err(dev, "Invalid device domain info\n");
> +		return;
> +	}
> +	iommu = device_to_iommu(dev, &bus, &devfn);
> +	if (!iommu) {
> +		dev_err(dev, "No IOMMU for device to unbind PASID table\n");
> +		return;
> +	}
> +
> +	domain_context_clear(iommu, dev);
> +
> +	domain_context_mapping_one(dmar_domain, iommu, bus, devfn);
> +	info->pasid_table_bound = 0;
> +}
>  #endif /* CONFIG_INTEL_IOMMU_SVM */
>  
>  const struct iommu_ops intel_iommu_ops = {
> @@ -5266,6 +5384,10 @@ const struct iommu_ops intel_iommu_ops = {
>  	.domain_free		= intel_iommu_domain_free,
>  	.attach_dev		= intel_iommu_attach_device,
>  	.detach_dev		= intel_iommu_detach_device,
> +#ifdef CONFIG_INTEL_IOMMU_SVM
> +	.bind_pasid_table	= intel_iommu_bind_pasid_table,
> +	.unbind_pasid_table	= intel_iommu_unbind_pasid_table,
> +#endif
>  	.map			= intel_iommu_map,
>  	.unmap			= intel_iommu_unmap,
>  	.map_sg			= default_iommu_map_sg,
> diff --git a/include/linux/dma_remapping.h b/include/linux/dma_remapping.h
> index 21b3e7d..db290b2 100644
> --- a/include/linux/dma_remapping.h
> +++ b/include/linux/dma_remapping.h
> @@ -28,6 +28,7 @@
>  
>  #define CONTEXT_DINVE		(1ULL << 8)
>  #define CONTEXT_PRS		(1ULL << 9)
> +#define CONTEXT_NESTE		(1ULL << 10)
>  #define CONTEXT_PASIDE		(1ULL << 11)
>  
>  struct intel_iommu;

Best regards,
Lu Baolu

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v5 06/23] iommu/vt-d: add definitions for PFSID
  2018-05-11 20:53 ` [PATCH v5 06/23] iommu/vt-d: add definitions for PFSID Jacob Pan
@ 2018-05-14  1:36   ` Lu Baolu
  2018-05-14 20:30     ` Jacob Pan
  0 siblings, 1 reply; 78+ messages in thread
From: Lu Baolu @ 2018-05-14  1:36 UTC (permalink / raw)
  To: Jacob Pan, iommu, LKML, Joerg Roedel, David Woodhouse,
	Greg Kroah-Hartman, Alex Williamson, Jean-Philippe Brucker
  Cc: Rafael Wysocki, Liu, Yi L, Tian, Kevin, Raj Ashok, Jean Delvare,
	Christoph Hellwig

Hi,

On 05/12/2018 04:53 AM, Jacob Pan wrote:
> When SRIOV VF device IOTLB is invalidated, we need to provide
> the PF source ID such that IOMMU hardware can gauge the depth
> of invalidation queue which is shared among VFs. This is needed
> when device invalidation throttle (DIT) capability is supported.
>
> This patch adds bit definitions for checking and tracking PFSID.

Patch 6 and 7 could be posted in a separated patch series.

>
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> ---
>  include/linux/intel-iommu.h | 4 ++++
>  1 file changed, 4 insertions(+)
>
> diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
> index ddc7d79..dfacd49 100644
> --- a/include/linux/intel-iommu.h
> +++ b/include/linux/intel-iommu.h
> @@ -114,6 +114,7 @@
>   * Extended Capability Register
>   */
>  
> +#define ecap_dit(e)		((e >> 41) & 0x1)
>  #define ecap_pasid(e)		((e >> 40) & 0x1)
>  #define ecap_pss(e)		((e >> 35) & 0x1f)
>  #define ecap_eafs(e)		((e >> 34) & 0x1)
> @@ -284,6 +285,7 @@ enum {
>  #define QI_DEV_IOTLB_SID(sid)	((u64)((sid) & 0xffff) << 32)
>  #define QI_DEV_IOTLB_QDEP(qdep)	(((qdep) & 0x1f) << 16)
>  #define QI_DEV_IOTLB_ADDR(addr)	((u64)(addr) & VTD_PAGE_MASK)
> +#define QI_DEV_IOTLB_PFSID(pfsid) (((u64)(pfsid & 0xf) << 12) | ((u64)(pfsid & 0xff0) << 48))
>  #define QI_DEV_IOTLB_SIZE	1
>  #define QI_DEV_IOTLB_MAX_INVS	32
>  
> @@ -308,6 +310,7 @@ enum {
>  #define QI_DEV_EIOTLB_PASID(p)	(((u64)p) << 32)
>  #define QI_DEV_EIOTLB_SID(sid)	((u64)((sid) & 0xffff) << 16)
>  #define QI_DEV_EIOTLB_QDEP(qd)	((u64)((qd) & 0x1f) << 4)
> +#define QI_DEV_EIOTLB_PFSID(pfsid) (((u64)(pfsid & 0xf) << 12) | ((u64)(pfsid & 0xff0) << 48))

PFSID[15:4] are stored in Descriptor [63:52], hence it should look like:

+#define QI_DEV_EIOTLB_PFSID(pfsid) (((u64)(pfsid & 0xf) << 12) | ((u64)(pfsid & 0xfff0) << 48))



>  #define QI_DEV_EIOTLB_MAX_INVS	32
>  
>  #define QI_PGRP_IDX(idx)	(((u64)(idx)) << 55)
> @@ -467,6 +470,7 @@ struct device_domain_info {
>  	struct list_head global; /* link to global list */
>  	u8 bus;			/* PCI bus number */
>  	u8 devfn;		/* PCI devfn number */
> +	u16 pfsid;		/* SRIOV physical function source ID */
>  	u8 pasid_supported:3;
>  	u8 pasid_enabled:1;
>  	u8 pri_supported:1;

Best regards,
Lu Baolu

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v5 07/23] iommu/vt-d: fix dev iotlb pfsid use
  2018-05-11 20:53 ` [PATCH v5 07/23] iommu/vt-d: fix dev iotlb pfsid use Jacob Pan
@ 2018-05-14  1:52   ` Lu Baolu
  2018-05-14 20:38     ` Jacob Pan
  0 siblings, 1 reply; 78+ messages in thread
From: Lu Baolu @ 2018-05-14  1:52 UTC (permalink / raw)
  To: Jacob Pan, iommu, LKML, Joerg Roedel, David Woodhouse,
	Greg Kroah-Hartman, Alex Williamson, Jean-Philippe Brucker
  Cc: Rafael Wysocki, Liu, Yi L, Tian, Kevin, Raj Ashok, Jean Delvare,
	Christoph Hellwig

Hi,

On 05/12/2018 04:53 AM, Jacob Pan wrote:
> PFSID should be used in the invalidation descriptor for flushing
> device IOTLBs on SRIOV VFs.

This patch could be submitted separately.

>
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> ---
>  drivers/iommu/dmar.c        |  6 +++---
>  drivers/iommu/intel-iommu.c | 16 +++++++++++++++-
>  include/linux/intel-iommu.h |  5 ++---
>  3 files changed, 20 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c
> index 460bed4..7852678 100644
> --- a/drivers/iommu/dmar.c
> +++ b/drivers/iommu/dmar.c
> @@ -1339,8 +1339,8 @@ void qi_flush_iotlb(struct intel_iommu *iommu, u16 did, u64 addr,
>  	qi_submit_sync(&desc, iommu);
>  }
>  
> -void qi_flush_dev_iotlb(struct intel_iommu *iommu, u16 sid, u16 qdep,
> -			u64 addr, unsigned mask)
> +void qi_flush_dev_iotlb(struct intel_iommu *iommu, u16 sid, u16 pfsid,
> +			u16 qdep, u64 addr, unsigned mask)
>  {
>  	struct qi_desc desc;
>  
> @@ -1355,7 +1355,7 @@ void qi_flush_dev_iotlb(struct intel_iommu *iommu, u16 sid, u16 qdep,
>  		qdep = 0;
>  
>  	desc.low = QI_DEV_IOTLB_SID(sid) | QI_DEV_IOTLB_QDEP(qdep) |
> -		   QI_DIOTLB_TYPE;
> +		   QI_DIOTLB_TYPE | QI_DEV_IOTLB_PFSID(pfsid);
>  
>  	qi_submit_sync(&desc, iommu);
>  }
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index 4623294..732a10f 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -1459,6 +1459,19 @@ static void iommu_enable_dev_iotlb(struct device_domain_info *info)
>  		return;
>  
>  	pdev = to_pci_dev(info->dev);
> +	/* For IOMMU that supports device IOTLB throttling (DIT), we assign
> +	 * PFSID to the invalidation desc of a VF such that IOMMU HW can gauge
> +	 * queue depth at PF level. If DIT is not set, PFSID will be treated as
> +	 * reserved, which should be set to 0.
> +	 */
> +	if (!ecap_dit(info->iommu->ecap))
> +		info->pfsid = 0;
> +	else if (pdev && pdev->is_virtfn) {
> +		if (ecap_dit(info->iommu->ecap))
> +			dev_warn(&pdev->dev, "SRIOV VF device IOTLB enabled without flow control\n");

I can't understand these two lines.

Isn't the condition always true? What does the error message mean?

> +		info->pfsid = PCI_DEVID(pdev->physfn->bus->number, pdev->physfn->devfn);
> +	} else
> +		info->pfsid = PCI_DEVID(info->bus, info->devfn);
>  
>  #ifdef CONFIG_INTEL_IOMMU_SVM
>  	/* The PCIe spec, in its wisdom, declares that the behaviour of
> @@ -1524,7 +1537,8 @@ static void iommu_flush_dev_iotlb(struct dmar_domain *domain,
>  
>  		sid = info->bus << 8 | info->devfn;
>  		qdep = info->ats_qdep;
> -		qi_flush_dev_iotlb(info->iommu, sid, qdep, addr, mask);
> +		qi_flush_dev_iotlb(info->iommu, sid, info->pfsid,
> +				qdep, addr, mask);

Alignment should match open parenthesis.

>  	}
>  	spin_unlock_irqrestore(&device_domain_lock, flags);
>  }
> diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
> index dfacd49..678a0f4 100644
> --- a/include/linux/intel-iommu.h
> +++ b/include/linux/intel-iommu.h
> @@ -504,9 +504,8 @@ extern void qi_flush_context(struct intel_iommu *iommu, u16 did, u16 sid,
>  			     u8 fm, u64 type);
>  extern void qi_flush_iotlb(struct intel_iommu *iommu, u16 did, u64 addr,
>  			  unsigned int size_order, u64 type);
> -extern void qi_flush_dev_iotlb(struct intel_iommu *iommu, u16 sid, u16 qdep,
> -			       u64 addr, unsigned mask);
> -
> +extern void qi_flush_dev_iotlb(struct intel_iommu *iommu, u16 sid, u16 pfsid,
> +			u16 qdep, u64 addr, unsigned mask);

Alignment should match open parenthesis.

>  extern int qi_submit_sync(struct qi_desc *desc, struct intel_iommu *iommu);
>  
>  extern int dmar_ir_support(void);

Best regards,
Lu Baolu

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v5 08/23] iommu/vt-d: support flushing more translation cache types
  2018-05-11 20:54 ` [PATCH v5 08/23] iommu/vt-d: support flushing more translation cache types Jacob Pan
@ 2018-05-14  2:18   ` Lu Baolu
  2018-05-14 20:46     ` Jacob Pan
  2018-05-17  8:44   ` kbuild test robot
  1 sibling, 1 reply; 78+ messages in thread
From: Lu Baolu @ 2018-05-14  2:18 UTC (permalink / raw)
  To: Jacob Pan, iommu, LKML, Joerg Roedel, David Woodhouse,
	Greg Kroah-Hartman, Alex Williamson, Jean-Philippe Brucker
  Cc: Rafael Wysocki, Liu, Yi L, Tian, Kevin, Raj Ashok, Jean Delvare,
	Christoph Hellwig

Hi,

On 05/12/2018 04:54 AM, Jacob Pan wrote:
> When Shared Virtual Memory is exposed to a guest via vIOMMU, extended
> IOTLB invalidation may be passed down from outside IOMMU subsystems.
> This patch adds invalidation functions that can be used for additional
> translation cache types.
>
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> ---
>  drivers/iommu/dmar.c        | 44 ++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/intel-iommu.h | 21 +++++++++++++++++++--
>  2 files changed, 63 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c
> index 7852678..0b5b052 100644
> --- a/drivers/iommu/dmar.c
> +++ b/drivers/iommu/dmar.c
> @@ -1339,6 +1339,18 @@ void qi_flush_iotlb(struct intel_iommu *iommu, u16 did, u64 addr,
>  	qi_submit_sync(&desc, iommu);
>  }
>  
> +void qi_flush_eiotlb(struct intel_iommu *iommu, u16 did, u64 addr, u32 pasid,
> +		unsigned int size_order, u64 granu, bool global)

Alignment should match open parenthesis.

> +{
> +	struct qi_desc desc;
> +
> +	desc.low = QI_EIOTLB_PASID(pasid) | QI_EIOTLB_DID(did) |
> +		QI_EIOTLB_GRAN(granu) | QI_EIOTLB_TYPE;
> +	desc.high = QI_EIOTLB_ADDR(addr) | QI_EIOTLB_GL(global) |
> +		QI_EIOTLB_IH(0) | QI_EIOTLB_AM(size_order);
> +	qi_submit_sync(&desc, iommu);
> +}
> +
>  void qi_flush_dev_iotlb(struct intel_iommu *iommu, u16 sid, u16 pfsid,
>  			u16 qdep, u64 addr, unsigned mask)
>  {
> @@ -1360,6 +1372,38 @@ void qi_flush_dev_iotlb(struct intel_iommu *iommu, u16 sid, u16 pfsid,
>  	qi_submit_sync(&desc, iommu);
>  }
>  
> +void qi_flush_dev_eiotlb(struct intel_iommu *iommu, u16 sid,
> +		u32 pasid,  u16 qdep, u64 addr, unsigned size, u64 granu)

Ditto.

> +{
> +	struct qi_desc desc;
> +
> +	desc.low = QI_DEV_EIOTLB_PASID(pasid) | QI_DEV_EIOTLB_SID(sid) |
> +		QI_DEV_EIOTLB_QDEP(qdep) | QI_DEIOTLB_TYPE;

Have you forgotten PFSID, or I missed anything here?

> +	desc.high |= QI_DEV_EIOTLB_GLOB(granu);
> +
> +	/* If S bit is 0, we only flush a single page. If S bit is set,
> +	 * The least significant zero bit indicates the size. VT-d spec
> +	 * 6.5.2.6
> +	 */
> +	if (!size)
> +		desc.high = QI_DEV_EIOTLB_ADDR(addr) & ~QI_DEV_EIOTLB_SIZE;
> +	else {
> +		unsigned long mask = 1UL << (VTD_PAGE_SHIFT + size);
> +
> +		desc.high = QI_DEV_EIOTLB_ADDR(addr & ~mask) | QI_DEV_EIOTLB_SIZE;
> +	}
> +	qi_submit_sync(&desc, iommu);
> +}
> +
> +void qi_flush_pasid(struct intel_iommu *iommu, u16 did, u64 granu, int pasid)
> +{
> +	struct qi_desc desc;
> +
> +	desc.high = 0;
> +	desc.low = QI_PC_TYPE | QI_PC_DID(did) | QI_PC_GRAN(granu) | QI_PC_PASID(pasid);
> +
> +	qi_submit_sync(&desc, iommu);
> +}
>  /*
>   * Disable Queued Invalidation interface.
>   */
> diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
> index 678a0f4..5ac0c28 100644
> --- a/include/linux/intel-iommu.h
> +++ b/include/linux/intel-iommu.h
> @@ -262,6 +262,10 @@ enum {
>  #define QI_PGRP_RESP_TYPE	0x9
>  #define QI_PSTRM_RESP_TYPE	0xa
>  
> +#define QI_DID(did)		(((u64)did & 0xffff) << 16)
> +#define QI_DID_MASK		GENMASK(31, 16)
> +#define QI_TYPE_MASK		GENMASK(3, 0)
> +
>  #define QI_IEC_SELECTIVE	(((u64)1) << 4)
>  #define QI_IEC_IIDEX(idx)	(((u64)(idx & 0xffff) << 32))
>  #define QI_IEC_IM(m)		(((u64)(m & 0x1f) << 27))
> @@ -293,8 +297,9 @@ enum {
>  #define QI_PC_DID(did)		(((u64)did) << 16)
>  #define QI_PC_GRAN(gran)	(((u64)gran) << 4)
>  
> -#define QI_PC_ALL_PASIDS	(QI_PC_TYPE | QI_PC_GRAN(0))
> -#define QI_PC_PASID_SEL		(QI_PC_TYPE | QI_PC_GRAN(1))
> +/* PASID cache invalidation granu */
> +#define QI_PC_ALL_PASIDS	0
> +#define QI_PC_PASID_SEL		1
>  
>  #define QI_EIOTLB_ADDR(addr)	((u64)(addr) & VTD_PAGE_MASK)
>  #define QI_EIOTLB_GL(gl)	(((u64)gl) << 7)
> @@ -304,6 +309,10 @@ enum {
>  #define QI_EIOTLB_DID(did)	(((u64)did) << 16)
>  #define QI_EIOTLB_GRAN(gran) 	(((u64)gran) << 4)
>  
> +/* QI Dev-IOTLB inv granu */
> +#define QI_DEV_IOTLB_GRAN_ALL		1
> +#define QI_DEV_IOTLB_GRAN_PASID_SEL	0
> +
>  #define QI_DEV_EIOTLB_ADDR(a)	((u64)(a) & VTD_PAGE_MASK)
>  #define QI_DEV_EIOTLB_SIZE	(((u64)1) << 11)
>  #define QI_DEV_EIOTLB_GLOB(g)	((u64)g)
> @@ -332,6 +341,7 @@ enum {
>  #define QI_RESP_INVALID		0x1
>  #define QI_RESP_FAILURE		0xf
>  
> +/* QI EIOTLB inv granu */
>  #define QI_GRAN_ALL_ALL			0
>  #define QI_GRAN_NONG_ALL		1
>  #define QI_GRAN_NONG_PASID		2
> @@ -504,8 +514,15 @@ extern void qi_flush_context(struct intel_iommu *iommu, u16 did, u16 sid,
>  			     u8 fm, u64 type);
>  extern void qi_flush_iotlb(struct intel_iommu *iommu, u16 did, u64 addr,
>  			  unsigned int size_order, u64 type);
> +extern void qi_flush_eiotlb(struct intel_iommu *iommu, u16 did, u64 addr,
> +			u32 pasid, unsigned int size_order, u64 type, bool global);
>  extern void qi_flush_dev_iotlb(struct intel_iommu *iommu, u16 sid, u16 pfsid,
>  			u16 qdep, u64 addr, unsigned mask);
> +
> +extern void qi_flush_dev_eiotlb(struct intel_iommu *iommu, u16 sid,
> +			u32 pasid, u16 qdep, u64 addr, unsigned size, u64 granu);
> +extern void qi_flush_pasid(struct intel_iommu *iommu, u16 did, u64 granu, int pasid);
> +
>  extern int qi_submit_sync(struct qi_desc *desc, struct intel_iommu *iommu);
>  
>  extern int dmar_ir_support(void);

Best regards,
Lu Baolu

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v5 09/23] iommu/vt-d: add svm/sva invalidate function
  2018-05-11 20:54 ` [PATCH v5 09/23] iommu/vt-d: add svm/sva invalidate function Jacob Pan
@ 2018-05-14  3:35   ` Lu Baolu
  2018-05-14 20:49     ` Jacob Pan
  0 siblings, 1 reply; 78+ messages in thread
From: Lu Baolu @ 2018-05-14  3:35 UTC (permalink / raw)
  To: Jacob Pan, iommu, LKML, Joerg Roedel, David Woodhouse,
	Greg Kroah-Hartman, Alex Williamson, Jean-Philippe Brucker
  Cc: Rafael Wysocki, Liu, Yi L, Tian, Kevin, Raj Ashok, Jean Delvare,
	Christoph Hellwig, Liu, Yi L

Hi,

On 05/12/2018 04:54 AM, Jacob Pan wrote:
> When Shared Virtual Address (SVA) is enabled for a guest OS via
> vIOMMU, we need to provide invalidation support at IOMMU API and driver
> level. This patch adds Intel VT-d specific function to implement
> iommu passdown invalidate API for shared virtual address.
>
> The use case is for supporting caching structure invalidation
> of assigned SVM capable devices. Emulated IOMMU exposes queue
> invalidation capability and passes down all descriptors from the guest
> to the physical IOMMU.
>
> The assumption is that guest to host device ID mapping should be
> resolved prior to calling IOMMU driver. Based on the device handle,
> host IOMMU driver can replace certain fields before submit to the
> invalidation queue.
>
> Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
> Signed-off-by: Ashok Raj <ashok.raj@intel.com>
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> ---
>  drivers/iommu/intel-iommu.c | 129 ++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 129 insertions(+)
>
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index 732a10f..684bd98 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -4973,6 +4973,134 @@ static void intel_iommu_detach_device(struct iommu_domain *domain,
>  	dmar_remove_one_dev_info(to_dmar_domain(domain), dev);
>  }
>  
> +/*
> + * 2D array for converting and sanitizing IOMMU generic TLB granularity to
> + * VT-d granularity. Invalidation is typically included in the unmap operation
> + * as a result of DMA or VFIO unmap. However, for assigned device where guest
> + * could own the first level page tables without being shadowed by QEMU. In
> + * this case there is no pass down unmap to the host IOMMU as a result of unmap
> + * in the guest. Only invalidations are trapped and passed down.
> + * In all cases, only first level TLB invalidation (request with PASID) can be
> + * passed down, therefore we do not include IOTLB granularity for request
> + * without PASID (second level).
> + *
> + * For an example, to find the VT-d granularity encoding for IOTLB
> + * type and page selective granularity within PASID:
> + * X: indexed by enum iommu_inv_type
> + * Y: indexed by enum iommu_inv_granularity
> + * [IOMMU_INV_TYPE_TLB][IOMMU_INV_GRANU_PAGE_PASID]
> + *
> + * Granu_map array indicates validity of the table. 1: valid, 0: invalid
> + *
> + */
> +const static int inv_type_granu_map[IOMMU_INV_NR_TYPE][IOMMU_INV_NR_GRANU] = {
> +	/* Extended dev TLBs */
> +	{1, 1, 1},
> +	/* Extended IOTLB */
> +	{1, 1, 1},
> +	/* PASID cache */
> +	{1, 1, 0}
> +};
> +
> +const static u64 inv_type_granu_table[IOMMU_INV_NR_TYPE][IOMMU_INV_NR_GRANU] = {
> +	/* extended dev IOTLBs */
> +	{QI_DEV_IOTLB_GRAN_ALL, QI_DEV_IOTLB_GRAN_PASID_SEL, 0},
> +	/* Extended IOTLB */
> +	{QI_GRAN_NONG_ALL, QI_GRAN_NONG_PASID, QI_GRAN_PSI_PASID},
> +	/* PASID cache */
> +	{QI_PC_ALL_PASIDS, QI_PC_PASID_SEL, 0},
> +};
> +
> +static inline int to_vtd_granularity(int type, int granu, u64 *vtd_granu)
> +{
> +	if (type >= IOMMU_INV_NR_TYPE || granu >= IOMMU_INV_NR_GRANU ||
> +		!inv_type_granu_map[type][granu])

Alignment should match open parenthesis.

> +		return -EINVAL;
> +
> +	*vtd_granu = inv_type_granu_table[type][granu];
> +
> +	return 0;
> +}
> +
> +static int intel_iommu_sva_invalidate(struct iommu_domain *domain,
> +		struct device *dev, struct tlb_invalidate_info *inv_info)

Ditto.

> +{
> +	struct intel_iommu *iommu;
> +	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
> +	struct device_domain_info *info;
> +	u16 did, sid;
> +	u8 bus, devfn;
> +	int ret = 0;
> +	u64 granu;
> +	unsigned long flags;
> +

I prefer to keep this in order.

        struct dmar_domain *dmar_domain = to_dmar_domain(domain);
        struct device_domain_info *info;
        struct intel_iommu *iommu;
        unsigned long flags;
        u8 bus, devfn;
        u16 did, sid;
        int ret = 0;
        u64 granu;

> +	if (!inv_info || !dmar_domain ||
> +		inv_info->hdr.type != TLB_INV_HDR_VERSION_1)

Ditto.

> +		return -EINVAL;
> +
> +	iommu = device_to_iommu(dev, &bus, &devfn);
> +	if (!iommu)
> +		return -ENODEV;
> +
> +	if (!dev || !dev_is_pci(dev))
> +		return -ENODEV;
> +
> +	did = dmar_domain->iommu_did[iommu->seq_id];
> +	sid = PCI_DEVID(bus, devfn);
> +	ret = to_vtd_granularity(inv_info->hdr.type, inv_info->granularity,
> +				&granu);
> +	if (ret) {
> +		pr_err("Invalid range type %d, granu %d\n", inv_info->hdr.type,
> +			inv_info->granularity);
> +		return ret;
> +	}
> +
> +	spin_lock(&iommu->lock);
> +	spin_lock_irqsave(&device_domain_lock, flags);
> +
> +	switch (inv_info->hdr.type) {
> +	case IOMMU_INV_TYPE_TLB:
> +		if (inv_info->size &&
> +			(inv_info->addr & ((1 << (VTD_PAGE_SHIFT + inv_info->size)) - 1))) {
> +			pr_err("Addr out of range, addr 0x%llx, size order %d\n",
> +				inv_info->addr, inv_info->size);
> +			ret = -ERANGE;
> +			goto out_unlock;
> +		}
> +
> +		qi_flush_eiotlb(iommu, did, mm_to_dma_pfn(inv_info->addr),
> +				inv_info->pasid,
> +				inv_info->size, granu,
> +				inv_info->flags & IOMMU_INVALIDATE_GLOBAL_PAGE);
> +		/**
> +		 * Always flush device IOTLB if ATS is enabled since guest
> +		 * vIOMMU exposes CM = 1, no device IOTLB flush will be passed
> +		 * down.
> +		 */
> +		info = iommu_support_dev_iotlb(dmar_domain, iommu, bus, devfn);
> +		if (info && info->ats_enabled) {
> +			qi_flush_dev_eiotlb(iommu, sid,
> +					inv_info->pasid, info->ats_qdep,
> +					inv_info->addr, inv_info->size,
> +					granu);
> +		}
> +		break;
> +	case IOMMU_INV_TYPE_PASID:
> +		qi_flush_pasid(iommu, did, granu, inv_info->pasid);
> +
> +		break;
> +	default:
> +		dev_err(dev, "Unknown IOMMU invalidation type %d\n",
> +			inv_info->hdr.type);

There are three types of invalidation:

enum iommu_inv_type {
        IOMMU_INV_TYPE_DTLB,
        IOMMU_INV_TYPE_TLB,
        IOMMU_INV_TYPE_PASID,
        IOMMU_INV_NR_TYPE
};

So "unsupported" looks better than "unknown" in the message.

> +		ret = -EINVAL;
> +	}
> +out_unlock:
> +	spin_unlock(&iommu->lock);
> +	spin_unlock_irqrestore(&device_domain_lock, flags);
> +
> +	return ret;
> +}
> +
>  static int intel_iommu_map(struct iommu_domain *domain,
>  			   unsigned long iova, phys_addr_t hpa,
>  			   size_t size, int iommu_prot)
> @@ -5401,6 +5529,7 @@ const struct iommu_ops intel_iommu_ops = {
>  #ifdef CONFIG_INTEL_IOMMU_SVM
>  	.bind_pasid_table	= intel_iommu_bind_pasid_table,
>  	.unbind_pasid_table	= intel_iommu_unbind_pasid_table,
> +	.sva_invalidate		= intel_iommu_sva_invalidate,
>  #endif
>  	.map			= intel_iommu_map,
>  	.unmap			= intel_iommu_unmap,

Best regards,
Lu Baolu

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v5 11/23] driver core: add per device iommu param
  2018-05-11 20:54 ` [PATCH v5 11/23] driver core: add per device iommu param Jacob Pan
@ 2018-05-14  5:27   ` Lu Baolu
  2018-05-14 20:52     ` Jacob Pan
  0 siblings, 1 reply; 78+ messages in thread
From: Lu Baolu @ 2018-05-14  5:27 UTC (permalink / raw)
  To: Jacob Pan, iommu, LKML, Joerg Roedel, David Woodhouse,
	Greg Kroah-Hartman, Alex Williamson, Jean-Philippe Brucker
  Cc: Rafael Wysocki, Liu, Yi L, Tian, Kevin, Raj Ashok, Jean Delvare,
	Christoph Hellwig

Hi,

On 05/12/2018 04:54 AM, Jacob Pan wrote:
> DMA faults can be detected by IOMMU at device level. Adding a pointer
> to struct device allows IOMMU subsystem to report relevant faults
> back to the device driver for further handling.
> For direct assigned device (or user space drivers), guest OS holds
> responsibility to handle and respond per device IOMMU fault.
> Therefore we need fault reporting mechanism to propagate faults beyond
> IOMMU subsystem.
>
> There are two other IOMMU data pointers under struct device today, here
> we introduce iommu_param as a parent pointer such that all device IOMMU
> data can be consolidated here. The idea was suggested here by Greg KH
> and Joerg. The name iommu_param is chosen here since iommu_data has been used.

This doesn't match what you've done in the patch. Maybe you
forgot to cleanup? :-)

The idea is to create a parent pointer under device struct and
move previous iommu_group and iommu_fwspec together with
the iommu fault related data into it.

Best regards,
Lu Baolu

>
> Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Link: https://lkml.org/lkml/2017/10/6/81
> ---
>  include/linux/device.h | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/include/linux/device.h b/include/linux/device.h
> index 4779569..c1b1796 100644
> --- a/include/linux/device.h
> +++ b/include/linux/device.h
> @@ -41,6 +41,7 @@ struct iommu_ops;
>  struct iommu_group;
>  struct iommu_fwspec;
>  struct dev_pin_info;
> +struct iommu_param;
>  
>  struct bus_attribute {
>  	struct attribute	attr;
> @@ -899,6 +900,7 @@ struct dev_links_info {
>   * 		device (i.e. the bus driver that discovered the device).
>   * @iommu_group: IOMMU group the device belongs to.
>   * @iommu_fwspec: IOMMU-specific properties supplied by firmware.
> + * @iommu_param: Per device generic IOMMU runtime data
>   *
>   * @offline_disabled: If set, the device is permanently online.
>   * @offline:	Set after successful invocation of bus type's .offline().
> @@ -988,6 +990,7 @@ struct device {
>  	void	(*release)(struct device *dev);
>  	struct iommu_group	*iommu_group;
>  	struct iommu_fwspec	*iommu_fwspec;
> +	struct iommu_param	*iommu_param;
>  
>  	bool			offline_disabled:1;
>  	bool			offline:1;

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v5 13/23] iommu: introduce device fault report API
  2018-05-11 20:54 ` [PATCH v5 13/23] iommu: introduce device fault report API Jacob Pan
@ 2018-05-14  6:01   ` Lu Baolu
  2018-05-14 20:55     ` Jacob Pan
  2018-05-17 11:41   ` Liu, Yi L
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 78+ messages in thread
From: Lu Baolu @ 2018-05-14  6:01 UTC (permalink / raw)
  To: Jacob Pan, iommu, LKML, Joerg Roedel, David Woodhouse,
	Greg Kroah-Hartman, Alex Williamson, Jean-Philippe Brucker
  Cc: Rafael Wysocki, Liu, Yi L, Tian, Kevin, Raj Ashok, Jean Delvare,
	Christoph Hellwig

Hi,

On 05/12/2018 04:54 AM, Jacob Pan wrote:
> Traditionally, device specific faults are detected and handled within
> their own device drivers. When IOMMU is enabled, faults such as DMA
> related transactions are detected by IOMMU. There is no generic
> reporting mechanism to report faults back to the in-kernel device
> driver or the guest OS in case of assigned devices.
>
> Faults detected by IOMMU is based on the transaction's source ID which
> can be reported at per device basis, regardless of the device type is a
> PCI device or not.
>
> The fault types include recoverable (e.g. page request) and
> unrecoverable faults(e.g. access error). In most cases, faults can be
> handled by IOMMU drivers internally. The primary use cases are as
> follows:
> 1. page request fault originated from an SVM capable device that is
> assigned to guest via vIOMMU. In this case, the first level page tables
> are owned by the guest. Page request must be propagated to the guest to
> let guest OS fault in the pages then send page response. In this
> mechanism, the direct receiver of IOMMU fault notification is VFIO,
> which can relay notification events to QEMU or other user space
> software.
>
> 2. faults need more subtle handling by device drivers. Other than
> simply invoke reset function, there are needs to let device driver
> handle the fault with a smaller impact.
>
> This patchset is intended to create a generic fault report API such
> that it can scale as follows:
> - all IOMMU types
> - PCI and non-PCI devices
> - recoverable and unrecoverable faults
> - VFIO and other other in kernel users
> - DMA & IRQ remapping (TBD)
> The original idea was brought up by David Woodhouse and discussions
> summarized at https://lwn.net/Articles/608914/.
>
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Signed-off-by: Ashok Raj <ashok.raj@intel.com>
> Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
> ---
>  drivers/iommu/iommu.c | 149 +++++++++++++++++++++++++++++++++++++++++++++++++-
>  include/linux/iommu.h |  35 +++++++++++-
>  2 files changed, 181 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index 3a49b96..b3f9daf 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -609,6 +609,13 @@ int iommu_group_add_device(struct iommu_group *group, struct device *dev)
>  		goto err_free_name;
>  	}
>  
> +	dev->iommu_param = kzalloc(sizeof(*dev->iommu_param), GFP_KERNEL);
> +	if (!dev->iommu_param) {
> +		ret = -ENOMEM;
> +		goto err_free_name;
> +	}
> +	mutex_init(&dev->iommu_param->lock);
> +
>  	kobject_get(group->devices_kobj);
>  
>  	dev->iommu_group = group;
> @@ -639,6 +646,7 @@ int iommu_group_add_device(struct iommu_group *group, struct device *dev)
>  	mutex_unlock(&group->mutex);
>  	dev->iommu_group = NULL;
>  	kobject_put(group->devices_kobj);
> +	kfree(dev->iommu_param);
>  err_free_name:
>  	kfree(device->name);
>  err_remove_link:
> @@ -685,7 +693,7 @@ void iommu_group_remove_device(struct device *dev)
>  	sysfs_remove_link(&dev->kobj, "iommu_group");
>  
>  	trace_remove_device_from_group(group->id, dev);
> -
> +	kfree(dev->iommu_param);
>  	kfree(device->name);
>  	kfree(device);
>  	dev->iommu_group = NULL;
> @@ -820,6 +828,145 @@ int iommu_group_unregister_notifier(struct iommu_group *group,
>  EXPORT_SYMBOL_GPL(iommu_group_unregister_notifier);
>  
>  /**
> + * iommu_register_device_fault_handler() - Register a device fault handler
> + * @dev: the device
> + * @handler: the fault handler
> + * @data: private data passed as argument to the handler
> + *
> + * When an IOMMU fault event is received, call this handler with the fault event
> + * and data as argument. The handler should return 0 on success. If the fault is
> + * recoverable (IOMMU_FAULT_PAGE_REQ), the handler can also complete
> + * the fault by calling iommu_page_response() with one of the following
> + * response code:
> + * - IOMMU_PAGE_RESP_SUCCESS: retry the translation
> + * - IOMMU_PAGE_RESP_INVALID: terminate the fault
> + * - IOMMU_PAGE_RESP_FAILURE: terminate the fault and stop reporting
> + *   page faults if possible.
> + *
> + * Return 0 if the fault handler was installed successfully, or an error.
> + */
> +int iommu_register_device_fault_handler(struct device *dev,
> +					iommu_dev_fault_handler_t handler,
> +					void *data)
> +{
> +	struct iommu_param *param = dev->iommu_param;
> +	int ret = 0;
> +
> +	/*
> +	 * Device iommu_param should have been allocated when device is
> +	 * added to its iommu_group.
> +	 */
> +	if (!param)
> +		return -EINVAL;
> +
> +	mutex_lock(&param->lock);
> +	/* Only allow one fault handler registered for each device */
> +	if (param->fault_param) {
> +		ret = -EBUSY;
> +		goto done_unlock;
> +	}
> +
> +	get_device(dev);
> +	param->fault_param =
> +		kzalloc(sizeof(struct iommu_fault_param), GFP_KERNEL);
> +	if (!param->fault_param) {
> +		put_device(dev);
> +		ret = -ENOMEM;
> +		goto done_unlock;
> +	}
> +	mutex_init(&param->fault_param->lock);

Do we really need this mutex lock? Is param->lock enough?

> +	param->fault_param->handler = handler;
> +	param->fault_param->data = data;
> +	INIT_LIST_HEAD(&param->fault_param->faults);
> +
> +done_unlock:
> +	mutex_unlock(&param->lock);
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(iommu_register_device_fault_handler);
> +
> +/**
> + * iommu_unregister_device_fault_handler() - Unregister the device fault handler
> + * @dev: the device
> + *
> + * Remove the device fault handler installed with
> + * iommu_register_device_fault_handler().
> + *
> + * Return 0 on success, or an error.
> + */
> +int iommu_unregister_device_fault_handler(struct device *dev)
> +{
> +	struct iommu_param *param = dev->iommu_param;
> +	int ret = 0;
> +
> +	if (!param)
> +		return -EINVAL;
> +
> +	mutex_lock(&param->lock);
> +	/* we cannot unregister handler if there are pending faults */
> +	if (!list_empty(&param->fault_param->faults)) {
> +		ret = -EBUSY;
> +		goto unlock;
> +	}
> +
> +	kfree(param->fault_param);
> +	param->fault_param = NULL;
> +	put_device(dev);
> +unlock:
> +	mutex_unlock(&param->lock);
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(iommu_unregister_device_fault_handler);
> +
> +
> +/**
> + * iommu_report_device_fault() - Report fault event to device
> + * @dev: the device
> + * @evt: fault event data
> + *
> + * Called by IOMMU model specific drivers when fault is detected, typically
> + * in a threaded IRQ handler.
> + *
> + * Return 0 on success, or an error.
> + */
> +int iommu_report_device_fault(struct device *dev, struct iommu_fault_event *evt)
> +{
> +	int ret = 0;
> +	struct iommu_fault_event *evt_pending;
> +	struct iommu_fault_param *fparam;
> +
> +	/* iommu_param is allocated when device is added to group */
> +	if (!dev->iommu_param | !evt)
> +		return -EINVAL;
> +	/* we only report device fault if there is a handler registered */
> +	mutex_lock(&dev->iommu_param->lock);
> +	if (!dev->iommu_param->fault_param ||
> +		!dev->iommu_param->fault_param->handler) {
> +		ret = -EINVAL;
> +		goto done_unlock;
> +	}
> +	fparam = dev->iommu_param->fault_param;
> +	if (evt->type == IOMMU_FAULT_PAGE_REQ && evt->last_req) {
> +		evt_pending = kmemdup(evt, sizeof(struct iommu_fault_event),
> +				GFP_KERNEL);
> +		if (!evt_pending) {
> +			ret = -ENOMEM;
> +			goto done_unlock;
> +		}
> +		mutex_lock(&fparam->lock);
> +		list_add_tail(&evt_pending->list, &fparam->faults);
> +		mutex_unlock(&fparam->lock);
> +	}
> +	ret = fparam->handler(evt, fparam->data);
> +done_unlock:
> +	mutex_unlock(&dev->iommu_param->lock);
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(iommu_report_device_fault);
> +
> +/**
>   * iommu_group_id - Return ID for a group
>   * @group: the group to ID
>   *
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index aeadb4f..b3312ee 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -307,7 +307,8 @@ enum iommu_fault_reason {
>   * and PASID spec.
>   * - Un-recoverable faults of device interest
>   * - DMA remapping and IRQ remapping faults
> -
> + *
> + * @list pending fault event list, used for tracking responses
>   * @type contains fault type.
>   * @reason fault reasons if relevant outside IOMMU driver, IOMMU driver internal
>   *         faults are not reported
> @@ -324,6 +325,7 @@ enum iommu_fault_reason {
>   *                 sending the fault response.
>   */
>  struct iommu_fault_event {
> +	struct list_head list;
>  	enum iommu_fault_type type;
>  	enum iommu_fault_reason reason;
>  	u64 addr;
> @@ -340,10 +342,13 @@ struct iommu_fault_event {
>   * struct iommu_fault_param - per-device IOMMU fault data
>   * @dev_fault_handler: Callback function to handle IOMMU faults at device level
>   * @data: handler private data
> - *
> + * @faults: holds the pending faults which needs response, e.g. page response.
> + * @lock: protect pending PRQ event list
>   */
>  struct iommu_fault_param {
>  	iommu_dev_fault_handler_t handler;
> +	struct list_head faults;
> +	struct mutex lock;
>  	void *data;
>  };
>  
> @@ -357,6 +362,7 @@ struct iommu_fault_param {
>   *	struct iommu_fwspec	*iommu_fwspec;
>   */
>  struct iommu_param {
> +	struct mutex lock;
>  	struct iommu_fault_param *fault_param;
>  };
>  
> @@ -456,6 +462,14 @@ extern int iommu_group_register_notifier(struct iommu_group *group,
>  					 struct notifier_block *nb);
>  extern int iommu_group_unregister_notifier(struct iommu_group *group,
>  					   struct notifier_block *nb);
> +extern int iommu_register_device_fault_handler(struct device *dev,
> +					iommu_dev_fault_handler_t handler,
> +					void *data);
> +
> +extern int iommu_unregister_device_fault_handler(struct device *dev);
> +
> +extern int iommu_report_device_fault(struct device *dev, struct iommu_fault_event *evt);
> +
>  extern int iommu_group_id(struct iommu_group *group);
>  extern struct iommu_group *iommu_group_get_for_dev(struct device *dev);
>  extern struct iommu_domain *iommu_group_default_domain(struct iommu_group *);
> @@ -727,6 +741,23 @@ static inline int iommu_group_unregister_notifier(struct iommu_group *group,
>  	return 0;
>  }
>  
> +static inline int iommu_register_device_fault_handler(struct device *dev,
> +						iommu_dev_fault_handler_t handler,
> +						void *data)
> +{
> +	return -ENODEV;
> +}
> +
> +static inline int iommu_unregister_device_fault_handler(struct device *dev)
> +{
> +	return 0;
> +}
> +
> +static inline int iommu_report_device_fault(struct device *dev, struct iommu_fault_event *evt)
> +{
> +	return -ENODEV;
> +}
> +
>  static inline int iommu_group_id(struct iommu_group *group)
>  {
>  	return -ENODEV;

Best regards,
Lu Baolu

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v5 14/23] iommu: introduce page response function
  2018-05-11 20:54 ` [PATCH v5 14/23] iommu: introduce page response function Jacob Pan
@ 2018-05-14  6:39   ` Lu Baolu
  2018-05-29 16:13     ` Jacob Pan
  2018-09-10 14:52   ` Auger Eric
  1 sibling, 1 reply; 78+ messages in thread
From: Lu Baolu @ 2018-05-14  6:39 UTC (permalink / raw)
  To: Jacob Pan, iommu, LKML, Joerg Roedel, David Woodhouse,
	Greg Kroah-Hartman, Alex Williamson, Jean-Philippe Brucker
  Cc: Rafael Wysocki, Liu, Yi L, Tian, Kevin, Raj Ashok, Jean Delvare,
	Christoph Hellwig

Hi,

On 05/12/2018 04:54 AM, Jacob Pan wrote:
> IO page faults can be handled outside IOMMU subsystem. For an example,
> when nested translation is turned on and guest owns the
> first level page tables, device page request can be forwared
> to the guest for handling faults. As the page response returns
> by the guest, IOMMU driver on the host need to process the
> response which informs the device and completes the page request
> transaction.
>
> This patch introduces generic API function for page response
> passing from the guest or other in-kernel users. The definitions of
> the generic data is based on PCI ATS specification not limited to
> any vendor.
>
> Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Link: https://lkml.org/lkml/2017/12/7/1725
> ---
>  drivers/iommu/iommu.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/iommu.h | 43 +++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 88 insertions(+)
>
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index b3f9daf..02fed3e 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -1533,6 +1533,51 @@ int iommu_sva_invalidate(struct iommu_domain *domain,
>  }
>  EXPORT_SYMBOL_GPL(iommu_sva_invalidate);
>  
> +int iommu_page_response(struct device *dev,
> +			struct page_response_msg *msg)
> +{
> +	struct iommu_param *param = dev->iommu_param;
> +	int ret = -EINVAL;
> +	struct iommu_fault_event *evt;
> +	struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
> +
> +	if (!domain || !domain->ops->page_response)
> +		return -ENODEV;
> +
> +	/*
> +	 * Device iommu_param should have been allocated when device is
> +	 * added to its iommu_group.
> +	 */
> +	if (!param || !param->fault_param)
> +		return -EINVAL;
> +
> +	/* Only send response if there is a fault report pending */
> +	mutex_lock(&param->fault_param->lock);
> +	if (list_empty(&param->fault_param->faults)) {
> +		pr_warn("no pending PRQ, drop response\n");
> +		goto done_unlock;
> +	}
> +	/*
> +	 * Check if we have a matching page request pending to respond,
> +	 * otherwise return -EINVAL
> +	 */
> +	list_for_each_entry(evt, &param->fault_param->faults, list) {
> +		if (evt->pasid == msg->pasid &&
> +		    msg->page_req_group_id == evt->page_req_group_id) {
> +			msg->private_data = evt->iommu_private;
> +			ret = domain->ops->page_response(dev, msg);
> +			list_del(&evt->list);
> +			kfree(evt);
> +			break;
> +		}
> +	}

Are above two checks duplicated? We won't find a matching
request if the list is empty. And we need to  printk a message
if we can't find the matching request.

Best regards,
Lu Baolu

> +
> +done_unlock:
> +	mutex_unlock(&param->fault_param->lock);
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(iommu_page_response);
> +
>  static void __iommu_detach_device(struct iommu_domain *domain,
>  				  struct device *dev)
>  {
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index b3312ee..722b90f 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -163,6 +163,41 @@ struct iommu_resv_region {
>  #ifdef CONFIG_IOMMU_API
>  
>  /**
> + * enum page_response_code - Return status of fault handlers, telling the IOMMU
> + * driver how to proceed with the fault.
> + *
> + * @IOMMU_PAGE_RESP_SUCCESS: Fault has been handled and the page tables
> + *	populated, retry the access. This is "Success" in PCI PRI.
> + * @IOMMU_PAGE_RESP_FAILURE: General error. Drop all subsequent faults from
> + *	this device if possible. This is "Response Failure" in PCI PRI.
> + * @IOMMU_PAGE_RESP_INVALID: Could not handle this fault, don't retry the
> + *	access. This is "Invalid Request" in PCI PRI.
> + */
> +enum page_response_code {
> +	IOMMU_PAGE_RESP_SUCCESS = 0,
> +	IOMMU_PAGE_RESP_INVALID,
> +	IOMMU_PAGE_RESP_FAILURE,
> +};
> +
> +/**
> + * Generic page response information based on PCI ATS and PASID spec.
> + * @addr: servicing page address
> + * @pasid: contains process address space ID
> + * @resp_code: response code
> + * @page_req_group_id: page request group index
> + * @private_data: uniquely identify device-specific private data for an
> + *                individual page response
> + */
> +struct page_response_msg {
> +	u64 addr;
> +	u32 pasid;
> +	enum page_response_code resp_code;
> +	u32 pasid_present:1;
> +	u32 page_req_group_id;
> +	u64 private_data;
> +};
> +
> +/**
>   * struct iommu_ops - iommu ops and capabilities
>   * @capable: check capability
>   * @domain_alloc: allocate iommu domain
> @@ -195,6 +230,7 @@ struct iommu_resv_region {
>   * @bind_pasid_table: bind pasid table pointer for guest SVM
>   * @unbind_pasid_table: unbind pasid table pointer and restore defaults
>   * @sva_invalidate: invalidate translation caches of shared virtual address
> + * @page_response: handle page request response
>   */
>  struct iommu_ops {
>  	bool (*capable)(enum iommu_cap);
> @@ -250,6 +286,7 @@ struct iommu_ops {
>  				struct device *dev);
>  	int (*sva_invalidate)(struct iommu_domain *domain,
>  		struct device *dev, struct tlb_invalidate_info *inv_info);
> +	int (*page_response)(struct device *dev, struct page_response_msg *msg);
>  
>  	unsigned long pgsize_bitmap;
>  };
> @@ -470,6 +507,7 @@ extern int iommu_unregister_device_fault_handler(struct device *dev);
>  
>  extern int iommu_report_device_fault(struct device *dev, struct iommu_fault_event *evt);
>  
> +extern int iommu_page_response(struct device *dev, struct page_response_msg *msg);
>  extern int iommu_group_id(struct iommu_group *group);
>  extern struct iommu_group *iommu_group_get_for_dev(struct device *dev);
>  extern struct iommu_domain *iommu_group_default_domain(struct iommu_group *);
> @@ -758,6 +796,11 @@ static inline int iommu_report_device_fault(struct device *dev, struct iommu_fau
>  	return -ENODEV;
>  }
>  
> +static inline int iommu_page_response(struct device *dev, struct page_response_msg *msg)
> +{
> +	return -ENODEV;
> +}
> +
>  static inline int iommu_group_id(struct iommu_group *group)
>  {
>  	return -ENODEV;

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v5 15/23] iommu: handle page response timeout
  2018-05-11 20:54 ` [PATCH v5 15/23] iommu: handle page response timeout Jacob Pan
@ 2018-05-14  7:43   ` Lu Baolu
  2018-05-29 16:20     ` Jacob Pan
  0 siblings, 1 reply; 78+ messages in thread
From: Lu Baolu @ 2018-05-14  7:43 UTC (permalink / raw)
  To: Jacob Pan, iommu, LKML, Joerg Roedel, David Woodhouse,
	Greg Kroah-Hartman, Alex Williamson, Jean-Philippe Brucker
  Cc: Rafael Wysocki, Liu, Yi L, Tian, Kevin, Raj Ashok, Jean Delvare,
	Christoph Hellwig

Hi,

On 05/12/2018 04:54 AM, Jacob Pan wrote:
> When IO page faults are reported outside IOMMU subsystem, the page
> request handler may fail for various reasons. E.g. a guest received
> page requests but did not have a chance to run for a long time. The
> irresponsive behavior could hold off limited resources on the pending
> device.
> There can be hardware or credit based software solutions as suggested
> in the PCI ATS Ch-4. To provide a basic safty net this patch
> introduces a per device deferrable timer which monitors the longest
> pending page fault that requires a response. Proper action such as
> sending failure response code could be taken when timer expires but not
> included in this patch. We need to consider the life cycle of page
> groupd ID to prevent confusion with reused group ID by a device.
> For now, a warning message provides clue of such failure.
>
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Signed-off-by: Ashok Raj <ashok.raj@intel.com>
> ---
>  drivers/iommu/iommu.c | 53 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/iommu.h |  4 ++++
>  2 files changed, 57 insertions(+)
>
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index 02fed3e..1f2f49e 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -827,6 +827,37 @@ int iommu_group_unregister_notifier(struct iommu_group *group,
>  }
>  EXPORT_SYMBOL_GPL(iommu_group_unregister_notifier);
>  
> +static void iommu_dev_fault_timer_fn(struct timer_list *t)
> +{
> +	struct iommu_fault_param *fparam = from_timer(fparam, t, timer);
> +	struct iommu_fault_event *evt;
> +
> +	u64 now;
> +
> +	now = get_jiffies_64();
> +
> +	/* The goal is to ensure driver or guest page fault handler(via vfio)
> +	 * send page response on time. Otherwise, limited queue resources
> +	 * may be occupied by some irresponsive guests or drivers.
> +	 * When per device pending fault list is not empty, we periodically checks
> +	 * if any anticipated page response time has expired.
> +	 *
> +	 * TODO:
> +	 * We could do the following if response time expires:
> +	 * 1. send page response code FAILURE to all pending PRQ
> +	 * 2. inform device driver or vfio
> +	 * 3. drain in-flight page requests and responses for this device
> +	 * 4. clear pending fault list such that driver can unregister fault
> +	 *    handler(otherwise blocked when pending faults are present).
> +	 */
> +	list_for_each_entry(evt, &fparam->faults, list) {
> +		if (time_after64(now, evt->expire))
> +			pr_err("Page response time expired!, pasid %d gid %d exp %llu now %llu\n",
> +				evt->pasid, evt->page_req_group_id, evt->expire, now);
> +	}
> +	mod_timer(t, now + prq_timeout);
> +}
> +

This timer scheme is very rough.

The timer expires every 10 seconds (by default).

0                   10                 20                30                 40            
+---------------+---------------+---------------+---------------+
^  ^   ^  ^                        ^
 |   |     |    |                         |
F0 F1  F2 F3                       (F1,F2,F3 will not be handled until here!)

F0, F1, F2, F3 are four page faults happens during [0, 10s) time
window. F1, F2, F3 timeout won't be handled until the timer expires
again at 20s. That means a fault might be pending there until about
(2 * prq_timeout) seconds later.

Out of curiosity, Why not adding a timer in iommu_fault_event, starting it in
iommu_report_device_fault() and removing it in iommu_page_response()?

Best regards,
Lu Baolu


>  /**
>   * iommu_register_device_fault_handler() - Register a device fault handler
>   * @dev: the device
> @@ -879,6 +910,9 @@ int iommu_register_device_fault_handler(struct device *dev,
>  	param->fault_param->data = data;
>  	INIT_LIST_HEAD(&param->fault_param->faults);
>  
> +	if (prq_timeout)
> +		timer_setup(&param->fault_param->timer, iommu_dev_fault_timer_fn,
> +			TIMER_DEFERRABLE);
>  done_unlock:
>  	mutex_unlock(&param->lock);
>  
> @@ -935,6 +969,8 @@ int iommu_report_device_fault(struct device *dev, struct iommu_fault_event *evt)
>  {
>  	int ret = 0;
>  	struct iommu_fault_event *evt_pending;
> +	struct timer_list *tmr;
> +	u64 exp;
>  	struct iommu_fault_param *fparam;
>  
>  	/* iommu_param is allocated when device is added to group */
> @@ -955,7 +991,17 @@ int iommu_report_device_fault(struct device *dev, struct iommu_fault_event *evt)
>  			ret = -ENOMEM;
>  			goto done_unlock;
>  		}
> +		/* Keep track of response expiration time */
> +		exp = get_jiffies_64() + prq_timeout;
> +		evt_pending->expire = exp;
>  		mutex_lock(&fparam->lock);
> +		if (list_empty(&fparam->faults)) {
> +			/* First pending event, start timer */
> +			tmr = &dev->iommu_param->fault_param->timer;
> +			WARN_ON(timer_pending(tmr));
> +			mod_timer(tmr, exp);
> +		}
> +
>  		list_add_tail(&evt_pending->list, &fparam->faults);
>  		mutex_unlock(&fparam->lock);
>  	}
> @@ -1572,6 +1618,13 @@ int iommu_page_response(struct device *dev,
>  		}
>  	}
>  
> +	/* stop response timer if no more pending request */
> +	if (list_empty(&param->fault_param->faults) &&
> +		timer_pending(&param->fault_param->timer)) {
> +		pr_debug("no pending PRQ, stop timer\n");
> +		del_timer(&param->fault_param->timer);
> +	}
> +
>  done_unlock:
>  	mutex_unlock(&param->fault_param->lock);
>  	return ret;
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index 722b90f..f3665b7 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -360,6 +360,7 @@ enum iommu_fault_reason {
>   * @iommu_private: used by the IOMMU driver for storing fault-specific
>   *                 data. Users should not modify this field before
>   *                 sending the fault response.
> + * @expire: time limit in jiffies will wait for page response
>   */
>  struct iommu_fault_event {
>  	struct list_head list;
> @@ -373,6 +374,7 @@ struct iommu_fault_event {
>  	u32 prot;
>  	u64 device_private;
>  	u64 iommu_private;
> +	u64 expire;
>  };
>  
>  /**
> @@ -380,11 +382,13 @@ struct iommu_fault_event {
>   * @dev_fault_handler: Callback function to handle IOMMU faults at device level
>   * @data: handler private data
>   * @faults: holds the pending faults which needs response, e.g. page response.
> + * @timer: track page request pending time limit
>   * @lock: protect pending PRQ event list
>   */
>  struct iommu_fault_param {
>  	iommu_dev_fault_handler_t handler;
>  	struct list_head faults;
> +	struct timer_list timer;
>  	struct mutex lock;
>  	void *data;
>  };

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v5 17/23] iommu/vt-d: report non-recoverable faults to device
  2018-05-11 20:54 ` [PATCH v5 17/23] iommu/vt-d: report non-recoverable faults to device Jacob Pan
@ 2018-05-14  8:17   ` Lu Baolu
  2018-05-29 17:33     ` Jacob Pan
  0 siblings, 1 reply; 78+ messages in thread
From: Lu Baolu @ 2018-05-14  8:17 UTC (permalink / raw)
  To: Jacob Pan, iommu, LKML, Joerg Roedel, David Woodhouse,
	Greg Kroah-Hartman, Alex Williamson, Jean-Philippe Brucker
  Cc: Rafael Wysocki, Liu, Yi L, Tian, Kevin, Raj Ashok, Jean Delvare,
	Christoph Hellwig, Liu, Yi L

Hi,

On 05/12/2018 04:54 AM, Jacob Pan wrote:
> Currently, dmar fault IRQ handler does nothing more than rate
> limited printk, no critical hardware handling need to be done
> in IRQ context.

Not exactly. dmar_fault() needs to clear all the faults so that
the subsequent faults could be logged.

> For some use case such as vIOMMU, it might be useful to report
> non-recoverable faults outside host IOMMU subsystem. DMAR fault
> can come from both DMA and interrupt remapping which has to be
> set up early before threaded IRQ is available.
> This patch adds an option and a workqueue such that when faults
> are requested, DMAR fault IRQ handler can use the IOMMU fault
> reporting API to report.
>
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
> Signed-off-by: Ashok Raj <ashok.raj@intel.com>
> ---
>  drivers/iommu/dmar.c        | 159 ++++++++++++++++++++++++++++++++++++++++++--
>  drivers/iommu/intel-iommu.c |   6 +-
>  include/linux/dmar.h        |   2 +-
>  include/linux/intel-iommu.h |   1 +
>  4 files changed, 159 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c
> index 0b5b052..ef846e3 100644
> --- a/drivers/iommu/dmar.c
> +++ b/drivers/iommu/dmar.c
> @@ -1110,6 +1110,12 @@ static int alloc_iommu(struct dmar_drhd_unit *drhd)
>  	return err;
>  }
>  
> +static inline void dmar_free_fault_wq(struct intel_iommu *iommu)
> +{
> +	if (iommu->fault_wq)
> +		destroy_workqueue(iommu->fault_wq);
> +}
> +
>  static void free_iommu(struct intel_iommu *iommu)
>  {
>  	if (intel_iommu_enabled) {
> @@ -1126,6 +1132,7 @@ static void free_iommu(struct intel_iommu *iommu)
>  		free_irq(iommu->irq, iommu);
>  		dmar_free_hwirq(iommu->irq);
>  		iommu->irq = 0;
> +		dmar_free_fault_wq(iommu);
>  	}
>  
>  	if (iommu->qi) {
> @@ -1554,6 +1561,31 @@ static const char *irq_remap_fault_reasons[] =
>  	"Blocked an interrupt request due to source-id verification failure",
>  };
>  
> +/* fault data and status */
> +enum intel_iommu_fault_reason {
> +	INTEL_IOMMU_FAULT_REASON_SW,
> +	INTEL_IOMMU_FAULT_REASON_ROOT_NOT_PRESENT,
> +	INTEL_IOMMU_FAULT_REASON_CONTEXT_NOT_PRESENT,
> +	INTEL_IOMMU_FAULT_REASON_CONTEXT_INVALID,
> +	INTEL_IOMMU_FAULT_REASON_BEYOND_ADDR_WIDTH,
> +	INTEL_IOMMU_FAULT_REASON_PTE_WRITE_ACCESS,
> +	INTEL_IOMMU_FAULT_REASON_PTE_READ_ACCESS,
> +	INTEL_IOMMU_FAULT_REASON_NEXT_PT_INVALID,
> +	INTEL_IOMMU_FAULT_REASON_ROOT_ADDR_INVALID,
> +	INTEL_IOMMU_FAULT_REASON_CONTEXT_PTR_INVALID,
> +	INTEL_IOMMU_FAULT_REASON_NONE_ZERO_RTP,
> +	INTEL_IOMMU_FAULT_REASON_NONE_ZERO_CTP,
> +	INTEL_IOMMU_FAULT_REASON_NONE_ZERO_PTE,
> +	NR_INTEL_IOMMU_FAULT_REASON,
> +};
> +
> +/* fault reasons that are allowed to be reported outside IOMMU subsystem */
> +#define INTEL_IOMMU_FAULT_REASON_ALLOWED			\
> +	((1ULL << INTEL_IOMMU_FAULT_REASON_BEYOND_ADDR_WIDTH) |	\
> +		(1ULL << INTEL_IOMMU_FAULT_REASON_PTE_WRITE_ACCESS) |	\
> +		(1ULL << INTEL_IOMMU_FAULT_REASON_PTE_READ_ACCESS))
> +
> +
>  static const char *dmar_get_fault_reason(u8 fault_reason, int *fault_type)
>  {
>  	if (fault_reason >= 0x20 && (fault_reason - 0x20 <
> @@ -1634,11 +1666,91 @@ void dmar_msi_read(int irq, struct msi_msg *msg)
>  	raw_spin_unlock_irqrestore(&iommu->register_lock, flag);
>  }
>  
> +static enum iommu_fault_reason to_iommu_fault_reason(u8 reason)
> +{
> +	if (reason >= NR_INTEL_IOMMU_FAULT_REASON) {
> +		pr_warn("unknown DMAR fault reason %d\n", reason);
> +		return IOMMU_FAULT_REASON_UNKNOWN;
> +	}
> +	switch (reason) {
> +	case INTEL_IOMMU_FAULT_REASON_SW:
> +	case INTEL_IOMMU_FAULT_REASON_ROOT_NOT_PRESENT:
> +	case INTEL_IOMMU_FAULT_REASON_CONTEXT_NOT_PRESENT:
> +	case INTEL_IOMMU_FAULT_REASON_CONTEXT_INVALID:
> +	case INTEL_IOMMU_FAULT_REASON_BEYOND_ADDR_WIDTH:
> +	case INTEL_IOMMU_FAULT_REASON_ROOT_ADDR_INVALID:
> +	case INTEL_IOMMU_FAULT_REASON_CONTEXT_PTR_INVALID:
> +		return IOMMU_FAULT_REASON_INTERNAL;
> +	case INTEL_IOMMU_FAULT_REASON_NEXT_PT_INVALID:
> +	case INTEL_IOMMU_FAULT_REASON_PTE_WRITE_ACCESS:
> +	case INTEL_IOMMU_FAULT_REASON_PTE_READ_ACCESS:
> +		return IOMMU_FAULT_REASON_PERMISSION;
> +	default:
> +		return IOMMU_FAULT_REASON_UNKNOWN;
> +	}
> +}
> +
> +struct dmar_fault_work {
> +	struct work_struct fault_work;
> +	struct intel_iommu *iommu;
> +	u64 addr;
> +	int type;
> +	int fault_type;
> +	enum intel_iommu_fault_reason reason;
> +	u16 sid;
> +};
> +
> +static void report_fault_to_device(struct work_struct *work)
> +{
> +	struct dmar_fault_work *dfw = container_of(work, struct dmar_fault_work,
> +						fault_work);
> +	struct iommu_fault_event event;
> +	struct pci_dev *pdev;
> +	u8 bus, devfn;
> +
> +	memset(&event, 0, sizeof(struct iommu_fault_event));
> +
> +	/* check if fault reason is permitted to report outside IOMMU */
> +	if (!((1 << dfw->reason) & INTEL_IOMMU_FAULT_REASON_ALLOWED)) {
> +		pr_debug("Fault reason %d not allowed to report to device\n",
> +			dfw->reason);

No need to print this message. And how about moving this check ahead
before queue the work?

> +		goto free_work;
> +	}
> +
> +	bus = PCI_BUS_NUM(dfw->sid);
> +	devfn = PCI_DEVFN(PCI_SLOT(dfw->sid), PCI_FUNC(dfw->sid));
> +	/*
> +	 * we need to check if the fault reporting is requested for the
> +	 * offending device.
> +	 */
> +	pdev = pci_get_domain_bus_and_slot(dfw->iommu->segment, bus, devfn);
> +	if (!pdev) {
> +		pr_warn("No PCI device found for source ID %x\n", dfw->sid);
> +		goto free_work;
> +	}
> +	/*
> +	 * unrecoverable fault is reported per IOMMU, notifier handler can
> +	 * resolve PCI device based on source ID.
> +	 */
> +	event.reason = to_iommu_fault_reason(dfw->reason);
> +	event.addr = dfw->addr;
> +	event.type = IOMMU_FAULT_DMA_UNRECOV;
> +	event.prot = dfw->type ? IOMMU_READ : IOMMU_WRITE;
> +	dev_warn(&pdev->dev, "report device unrecoverable fault: %d, %x, %d\n",
> +		event.reason, dfw->sid, event.type);

No need to print this warn message.

> +	iommu_report_device_fault(&pdev->dev, &event);
> +	pci_dev_put(pdev);
> +
> +free_work:
> +	kfree(dfw);
> +}
> +
>  static int dmar_fault_do_one(struct intel_iommu *iommu, int type,
>  		u8 fault_reason, u16 source_id, unsigned long long addr)
>  {
>  	const char *reason;
>  	int fault_type;
> +	struct dmar_fault_work *dfw;
>  
>  	reason = dmar_get_fault_reason(fault_reason, &fault_type);
>  
> @@ -1647,11 +1759,29 @@ static int dmar_fault_do_one(struct intel_iommu *iommu, int type,
>  			source_id >> 8, PCI_SLOT(source_id & 0xFF),
>  			PCI_FUNC(source_id & 0xFF), addr >> 48,
>  			fault_reason, reason);
> -	else
> +	else {
>  		pr_err("[%s] Request device [%02x:%02x.%d] fault addr %llx [fault reason %02d] %s\n",
>  		       type ? "DMA Read" : "DMA Write",
>  		       source_id >> 8, PCI_SLOT(source_id & 0xFF),
>  		       PCI_FUNC(source_id & 0xFF), addr, fault_reason, reason);
> +	}

No need to add braces.

> +
> +	dfw = kmalloc(sizeof(*dfw), GFP_ATOMIC);
> +	if (!dfw)
> +		return -ENOMEM;
> +
> +	INIT_WORK(&dfw->fault_work, report_fault_to_device);
> +	dfw->addr = addr;
> +	dfw->type = type;
> +	dfw->fault_type = fault_type;
> +	dfw->reason = fault_reason;
> +	dfw->sid = source_id;
> +	dfw->iommu = iommu;
> +	if (!queue_work(iommu->fault_wq, &dfw->fault_work)) {

Check whether this fault is allowed to report to device before
queuing the work.

> +		kfree(dfw);
> +		return -EBUSY;
> +	}
> +
>  	return 0;
>  }
>  
> @@ -1731,10 +1861,28 @@ irqreturn_t dmar_fault(int irq, void *dev_id)
>  	return IRQ_HANDLED;
>  }
>  
> -int dmar_set_interrupt(struct intel_iommu *iommu)
> +static int dmar_set_fault_wq(struct intel_iommu *iommu)
> +{
> +	if (iommu->fault_wq)
> +		return 0;
> +
> +	iommu->fault_wq = alloc_ordered_workqueue(iommu->name, 0);
> +	if (!iommu->fault_wq)
> +		return -ENOMEM;
> +
> +	return 0;
> +}
> +
> +int dmar_set_interrupt(struct intel_iommu *iommu, bool queue_fault)
>  {
>  	int irq, ret;
>  
> +	/* fault can be reported back to device drivers via a wq */
> +	if (queue_fault) {
> +		ret = dmar_set_fault_wq(iommu);
> +		if (ret)
> +			pr_err("Failed to create fault handling workqueue\n");
> +	}
>  	/*
>  	 * Check if the fault interrupt is already initialized.
>  	 */
> @@ -1748,10 +1896,11 @@ int dmar_set_interrupt(struct intel_iommu *iommu)
>  		pr_err("No free IRQ vectors\n");
>  		return -EINVAL;
>  	}
> -
>  	ret = request_irq(irq, dmar_fault, IRQF_NO_THREAD, iommu->name, iommu);
> -	if (ret)
> +	if (ret) {
>  		pr_err("Can't request irq\n");
> +		dmar_free_fault_wq(iommu);
> +	}
>  	return ret;
>  }
>  
> @@ -1765,7 +1914,7 @@ int __init enable_drhd_fault_handling(void)
>  	 */
>  	for_each_iommu(iommu, drhd) {
>  		u32 fault_status;
> -		int ret = dmar_set_interrupt(iommu);
> +		int ret = dmar_set_interrupt(iommu, false);
>  
>  		if (ret) {
>  			pr_err("DRHD %Lx: failed to enable fault, interrupt, ret %d\n",
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index 684bd98..3949b3cf 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -3401,10 +3401,10 @@ static int __init init_dmars(void)
>  				goto free_iommu;
>  		}
>  #endif
> -		ret = dmar_set_interrupt(iommu);
> +		ret = dmar_set_interrupt(iommu, true);
> +
>  		if (ret)
>  			goto free_iommu;
> -
>  		if (!translation_pre_enabled(iommu))
>  			iommu_enable_translation(iommu);
>  
> @@ -4291,7 +4291,7 @@ static int intel_iommu_add(struct dmar_drhd_unit *dmaru)
>  			goto disable_iommu;
>  	}
>  #endif
> -	ret = dmar_set_interrupt(iommu);
> +	ret = dmar_set_interrupt(iommu, true);
>  	if (ret)
>  		goto disable_iommu;
>  
> diff --git a/include/linux/dmar.h b/include/linux/dmar.h
> index e2433bc..21f2162 100644
> --- a/include/linux/dmar.h
> +++ b/include/linux/dmar.h
> @@ -278,7 +278,7 @@ extern void dmar_msi_unmask(struct irq_data *data);
>  extern void dmar_msi_mask(struct irq_data *data);
>  extern void dmar_msi_read(int irq, struct msi_msg *msg);
>  extern void dmar_msi_write(int irq, struct msi_msg *msg);
> -extern int dmar_set_interrupt(struct intel_iommu *iommu);
> +extern int dmar_set_interrupt(struct intel_iommu *iommu, bool queue_fault);
>  extern irqreturn_t dmar_fault(int irq, void *dev_id);
>  extern int dmar_alloc_hwirq(int id, int node, void *arg);
>  extern void dmar_free_hwirq(int irq);
> diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
> index 5ac0c28..b3a26c7 100644
> --- a/include/linux/intel-iommu.h
> +++ b/include/linux/intel-iommu.h
> @@ -472,6 +472,7 @@ struct intel_iommu {
>  	struct iommu_device iommu;  /* IOMMU core code handle */
>  	int		node;
>  	u32		flags;      /* Software defined flags */
> +	struct workqueue_struct *fault_wq; /* Reporting IOMMU fault to device */
>  };
>  
>  /* PCI domain-device relationship */

Best regards,
Lu Baolu

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v5 03/23] iommu/vt-d: add a flag for pasid table bound status
  2018-05-13  7:33   ` Lu Baolu
@ 2018-05-14 18:51     ` Jacob Pan
  0 siblings, 0 replies; 78+ messages in thread
From: Jacob Pan @ 2018-05-14 18:51 UTC (permalink / raw)
  To: Lu Baolu
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Alex Williamson, Jean-Philippe Brucker, Rafael Wysocki, Liu,
	Yi L, Tian, Kevin, Raj Ashok, Jean Delvare, Christoph Hellwig,
	jacob.jun.pan

On Sun, 13 May 2018 15:33:23 +0800
Lu Baolu <baolu.lu@linux.intel.com> wrote:

> Hi,
> 
> On 05/12/2018 04:53 AM, Jacob Pan wrote:
> > Adding a flag in device domain into to track whether a guest or  
> typo:                                       ^^info
> 
good catch, will fix. thanks
> Best regards,
> Lu Baolu
> 
>  [...]  
> 

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v5 03/23] iommu/vt-d: add a flag for pasid table bound status
  2018-05-13  8:01   ` Lu Baolu
@ 2018-05-14 18:52     ` Jacob Pan
  0 siblings, 0 replies; 78+ messages in thread
From: Jacob Pan @ 2018-05-14 18:52 UTC (permalink / raw)
  To: Lu Baolu
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Alex Williamson, Jean-Philippe Brucker, Rafael Wysocki, Liu,
	Yi L, Tian, Kevin, Raj Ashok, Jean Delvare, Christoph Hellwig,
	jacob.jun.pan

On Sun, 13 May 2018 16:01:50 +0800
Lu Baolu <baolu.lu@linux.intel.com> wrote:

> > +	u8 pasid_table_bound:1;  
> 
> Can you please add some comments here? So that, people can
> understand the purpose for this bit exactly.
will do. I will add:
u8 pasid_table_bound:1;	/* PASID table is bound to a guest */

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v5 04/23] iommu/vt-d: add bind_pasid_table function
  2018-05-13  9:29   ` Lu Baolu
@ 2018-05-14 20:22     ` Jacob Pan
  0 siblings, 0 replies; 78+ messages in thread
From: Jacob Pan @ 2018-05-14 20:22 UTC (permalink / raw)
  To: Lu Baolu
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Alex Williamson, Jean-Philippe Brucker, Rafael Wysocki, Liu,
	Yi L, Tian, Kevin, Raj Ashok, Jean Delvare, Christoph Hellwig,
	jacob.jun.pan

On Sun, 13 May 2018 17:29:47 +0800
Lu Baolu <baolu.lu@linux.intel.com> wrote:

> Hi,
> 
> On 05/12/2018 04:53 AM, Jacob Pan wrote:
> > Add Intel VT-d ops to the generic iommu_bind_pasid_table API
> > functions.
> >
> > The primary use case is for direct assignment of SVM capable
> > device. Originated from emulated IOMMU in the guest, the request
> > goes through many layers (e.g. VFIO). Upon calling host IOMMU
> > driver, caller passes guest PASID table pointer (GPA) and size.
> >
> > Device context table entry is modified by Intel IOMMU specific
> > bind_pasid_table function. This will turn on nesting mode and
> > matching translation type.
> >
> > The unbind operation restores default context mapping.
> >
> > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
> > Signed-off-by: Ashok Raj <ashok.raj@intel.com>
> > ---
> >  drivers/iommu/intel-iommu.c   | 122
> > ++++++++++++++++++++++++++++++++++++++++++
> > include/linux/dma_remapping.h |   1 + 2 files changed, 123
> > insertions(+)
> >
> > diff --git a/drivers/iommu/intel-iommu.c
> > b/drivers/iommu/intel-iommu.c index a0f81a4..4623294 100644
> > --- a/drivers/iommu/intel-iommu.c
> > +++ b/drivers/iommu/intel-iommu.c
> > @@ -2409,6 +2409,7 @@ static struct dmar_domain
> > *dmar_insert_one_dev_info(struct intel_iommu *iommu,
> > info->ats_supported = info->pasid_supported = info->pri_supported =
> > 0; info->ats_enabled = info->pasid_enabled = info->pri_enabled = 0;
> > info->ats_qdep = 0;
> > +	info->pasid_table_bound = 0;
> >  	info->dev = dev;
> >  	info->domain = domain;
> >  	info->iommu = iommu;
> > @@ -5132,6 +5133,7 @@ static void
> > intel_iommu_put_resv_regions(struct device *dev, 
> >  #ifdef CONFIG_INTEL_IOMMU_SVM
> >  #define MAX_NR_PASID_BITS (20)
> > +#define MIN_NR_PASID_BITS (5)
> >  static inline unsigned long intel_iommu_get_pts(struct intel_iommu
> > *iommu) {
> >  	/*
> > @@ -5258,6 +5260,122 @@ struct intel_iommu
> > *intel_svm_device_to_iommu(struct device *dev) 
> >  	return iommu;
> >  }
> > +
> > +static int intel_iommu_bind_pasid_table(struct iommu_domain
> > *domain,
> > +		struct device *dev, struct pasid_table_config
> > *pasidt_binfo) +{
> > +	struct intel_iommu *iommu;
> > +	struct context_entry *context;
> > +	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
> > +	struct device_domain_info *info;
> > +	struct pci_dev *pdev;
> > +	u8 bus, devfn, host_table_pasid_bits;
> > +	u16 did, sid;
> > +	int ret = 0;
> > +	unsigned long flags;
> > +	u64 ctx_lo;  
> 
> I personally prefer to have this in order.
> 
>         struct dmar_domain *dmar_domain = to_dmar_domain(domain);
>         u8 bus, devfn, host_table_pasid_bits;
>         struct device_domain_info *info;
>         struct context_entry *context;
>         struct intel_iommu *iommu;
>         struct pci_dev *pdev;
>         unsigned long flags;
>         u16 did, sid;
>         int ret = 0;
>         u64 ctx_lo;
> 
looks better.
> > +
> > +	if ((pasidt_binfo->version != PASID_TABLE_CFG_VERSION_1)
> > ||  
> 
> Unnecessary parentheses.
> 
here for readability.
> > +		pasidt_binfo->bytes != sizeof(*pasidt_binfo))  
> 
> Alignment should match open parenthesis.
> 
> > +		return -EINVAL;
> > +	iommu = device_to_iommu(dev, &bus, &devfn);
> > +	if (!iommu)
> > +		return -ENODEV;
> > +	/* VT-d spec section 9.4 says pasid table size is encoded
> > as 2^(x+5) */
> > +	host_table_pasid_bits = intel_iommu_get_pts(iommu) +
> > MIN_NR_PASID_BITS;
> > +	if (!pasidt_binfo || pasidt_binfo->pasid_bits >
> > host_table_pasid_bits ||  
> 
> "!pasidt_binfo" checking should be moved up to the version checking.
> 
good point!
> > +		pasidt_binfo->pasid_bits < MIN_NR_PASID_BITS) {
> > +		pr_err("Invalid gPASID bits %d, host range %d -
> > %d\n",  
> 
> How about dev_err()? 
> 
the error is not exactly specific to the device but rather the guest.
> > +			pasidt_binfo->pasid_bits,
> > +			MIN_NR_PASID_BITS, host_table_pasid_bits);
> > +		return -ERANGE;
> > +	}
> > +	if (!ecap_nest(iommu->ecap)) {
> > +		dev_err(dev, "Cannot bind PASID table, no nested
> > translation\n");
> > +		ret = -ENODEV;
> > +		goto out;  
> 
> How about
> +        return -ENODEV;
> ?
> 
> > +	}
> > +	pdev = to_pci_dev(dev);  
> 
> We can't always assume that it is a PCI device, right?
> 
for vt-d, I don't think we expect any non-pci device.
> > +	sid = PCI_DEVID(bus, devfn);
> > +	info = dev->archdata.iommu;
> > +
> > +	if (!info) {
> > +		dev_err(dev, "Invalid device domain info\n");
> > +		ret = -EINVAL;
> > +		goto out;
> > +	}
> > +	if (info->pasid_table_bound) {  
> 
> We should do this checking with lock hold.
> 
agreed. will hold the device_domain_lock.
> Otherwise,
> 
> Thread A on CPUx                Thread B on CPUy
> ===========                ============
> check pasid_table_bound    check pasid_table_bound
> 
> mutex_lock()
> Setup context
> pasid_table_bound = 1
> mutex_unlock()
> 
>                                                mutex_lock()
>                                                Setup context
>                                                pasid_table_bound = 1
>                                                mutex_unlock()
> 
> 
> > +		dev_err(dev, "Device PASID table already bound\n");
> > +		ret = -EBUSY;
> > +		goto out;
> > +	}
> > +	if (!info->pasid_enabled) {
> > +		ret = pci_enable_pasid(pdev, info->pasid_supported
> > & ~1);
> > +		if (ret) {
> > +			dev_err(dev, "Failed to enable PASID\n");
> > +			goto out;
> > +		}
> > +	}  
> 
> I prefer a blank line here.
> 
>  [...]  
> 
> Best regards,
> Lu Baolu

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v5 06/23] iommu/vt-d: add definitions for PFSID
  2018-05-14  1:36   ` Lu Baolu
@ 2018-05-14 20:30     ` Jacob Pan
  0 siblings, 0 replies; 78+ messages in thread
From: Jacob Pan @ 2018-05-14 20:30 UTC (permalink / raw)
  To: Lu Baolu
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Alex Williamson, Jean-Philippe Brucker, Rafael Wysocki, Liu,
	Yi L, Tian, Kevin, Raj Ashok, Jean Delvare, Christoph Hellwig,
	jacob.jun.pan

On Mon, 14 May 2018 09:36:08 +0800
Lu Baolu <baolu.lu@linux.intel.com> wrote:

> Hi,
> 
> On 05/12/2018 04:53 AM, Jacob Pan wrote:
> > When SRIOV VF device IOTLB is invalidated, we need to provide
> > the PF source ID such that IOMMU hardware can gauge the depth
> > of invalidation queue which is shared among VFs. This is needed
> > when device invalidation throttle (DIT) capability is supported.
> >
> > This patch adds bit definitions for checking and tracking PFSID.  
> 
> Patch 6 and 7 could be posted in a separated patch series.
> 
Thought about that also, but we need to include PFSID passdown
invalidation in this patchset, that is why i prefer to include
this fix, otherwise this patchset will continue to be wrong.
> >
> > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > ---
> >  include/linux/intel-iommu.h | 4 ++++
> >  1 file changed, 4 insertions(+)
> >
> > diff --git a/include/linux/intel-iommu.h
> > b/include/linux/intel-iommu.h index ddc7d79..dfacd49 100644
> > --- a/include/linux/intel-iommu.h
> > +++ b/include/linux/intel-iommu.h
> > @@ -114,6 +114,7 @@
> >   * Extended Capability Register
> >   */
> >  
> > +#define ecap_dit(e)		((e >> 41) & 0x1)
> >  #define ecap_pasid(e)		((e >> 40) & 0x1)
> >  #define ecap_pss(e)		((e >> 35) & 0x1f)
> >  #define ecap_eafs(e)		((e >> 34) & 0x1)
> > @@ -284,6 +285,7 @@ enum {
> >  #define QI_DEV_IOTLB_SID(sid)	((u64)((sid) & 0xffff) << 32)
> >  #define QI_DEV_IOTLB_QDEP(qdep)	(((qdep) & 0x1f) << 16)
> >  #define QI_DEV_IOTLB_ADDR(addr)	((u64)(addr) &
> > VTD_PAGE_MASK) +#define QI_DEV_IOTLB_PFSID(pfsid) (((u64)(pfsid &
> > 0xf) << 12) | ((u64)(pfsid & 0xff0) << 48)) #define
> > QI_DEV_IOTLB_SIZE	1 #define QI_DEV_IOTLB_MAX_INVS	32
> >  
> > @@ -308,6 +310,7 @@ enum {
> >  #define QI_DEV_EIOTLB_PASID(p)	(((u64)p) << 32)
> >  #define QI_DEV_EIOTLB_SID(sid)	((u64)((sid) & 0xffff) << 16)
> >  #define QI_DEV_EIOTLB_QDEP(qd)	((u64)((qd) & 0x1f) << 4)
> > +#define QI_DEV_EIOTLB_PFSID(pfsid) (((u64)(pfsid & 0xf) << 12) |
> > ((u64)(pfsid & 0xff0) << 48))  
> 
> PFSID[15:4] are stored in Descriptor [63:52], hence it should look
> like:
> 
> +#define QI_DEV_EIOTLB_PFSID(pfsid) (((u64)(pfsid & 0xf) << 12) |
> ((u64)(pfsid & 0xfff0) << 48))
> 
> 
good catch! thanks.
> 
> >  #define QI_DEV_EIOTLB_MAX_INVS	32
> >  
> >  #define QI_PGRP_IDX(idx)	(((u64)(idx)) << 55)
> > @@ -467,6 +470,7 @@ struct device_domain_info {
> >  	struct list_head global; /* link to global list */
> >  	u8 bus;			/* PCI bus number */
> >  	u8 devfn;		/* PCI devfn number */
> > +	u16 pfsid;		/* SRIOV physical function
> > source ID */ u8 pasid_supported:3;
> >  	u8 pasid_enabled:1;
> >  	u8 pri_supported:1;  
> 
> Best regards,
> Lu Baolu

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v5 07/23] iommu/vt-d: fix dev iotlb pfsid use
  2018-05-14  1:52   ` Lu Baolu
@ 2018-05-14 20:38     ` Jacob Pan
  0 siblings, 0 replies; 78+ messages in thread
From: Jacob Pan @ 2018-05-14 20:38 UTC (permalink / raw)
  To: Lu Baolu
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Alex Williamson, Jean-Philippe Brucker, Rafael Wysocki, Liu,
	Yi L, Tian, Kevin, Raj Ashok, Jean Delvare, Christoph Hellwig,
	jacob.jun.pan

On Mon, 14 May 2018 09:52:04 +0800
Lu Baolu <baolu.lu@linux.intel.com> wrote:

> > diff --git a/drivers/iommu/intel-iommu.c
> > b/drivers/iommu/intel-iommu.c index 4623294..732a10f 100644
> > --- a/drivers/iommu/intel-iommu.c
> > +++ b/drivers/iommu/intel-iommu.c
> > @@ -1459,6 +1459,19 @@ static void iommu_enable_dev_iotlb(struct
> > device_domain_info *info) return;
> >  
> >  	pdev = to_pci_dev(info->dev);
> > +	/* For IOMMU that supports device IOTLB throttling (DIT),
> > we assign
> > +	 * PFSID to the invalidation desc of a VF such that IOMMU
> > HW can gauge
> > +	 * queue depth at PF level. If DIT is not set, PFSID will
> > be treated as
> > +	 * reserved, which should be set to 0.
> > +	 */
> > +	if (!ecap_dit(info->iommu->ecap))
> > +		info->pfsid = 0;
> > +	else if (pdev && pdev->is_virtfn) {
> > +		if (ecap_(info->iommu->ecap))
> > +			dev_warn(&pdev->dev, "SRIOV VF device
> > IOTLB enabled without flow control\n");  
> 
> I can't understand these two lines.
> 
> Isn't the condition always true? What does the error message mean?

you are right, there is no need to check ecap_dit again. thanks!

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v5 08/23] iommu/vt-d: support flushing more translation cache types
  2018-05-14  2:18   ` Lu Baolu
@ 2018-05-14 20:46     ` Jacob Pan
  0 siblings, 0 replies; 78+ messages in thread
From: Jacob Pan @ 2018-05-14 20:46 UTC (permalink / raw)
  To: Lu Baolu
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Alex Williamson, Jean-Philippe Brucker, Rafael Wysocki, Liu,
	Yi L, Tian, Kevin, Raj Ashok, Jean Delvare, Christoph Hellwig,
	jacob.jun.pan

On Mon, 14 May 2018 10:18:44 +0800
Lu Baolu <baolu.lu@linux.intel.com> wrote:

> Hi,
> 
> On 05/12/2018 04:54 AM, Jacob Pan wrote:
> > When Shared Virtual Memory is exposed to a guest via vIOMMU,
> > extended IOTLB invalidation may be passed down from outside IOMMU
> > subsystems. This patch adds invalidation functions that can be used
> > for additional translation cache types.
> >
> > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > ---
> >  drivers/iommu/dmar.c        | 44
> > ++++++++++++++++++++++++++++++++++++++++++++
> > include/linux/intel-iommu.h | 21 +++++++++++++++++++-- 2 files
> > changed, 63 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c
> > index 7852678..0b5b052 100644
> > --- a/drivers/iommu/dmar.c
> > +++ b/drivers/iommu/dmar.c
> > @@ -1339,6 +1339,18 @@ void qi_flush_iotlb(struct intel_iommu
> > *iommu, u16 did, u64 addr, qi_submit_sync(&desc, iommu);
> >  }
> >  
> > +void qi_flush_eiotlb(struct intel_iommu *iommu, u16 did, u64 addr,
> > u32 pasid,
> > +		unsigned int size_order, u64 granu, bool global)  
> 
> Alignment should match open parenthesis.
> 
> > +{
> > +	struct qi_desc desc;
> > +
> > +	desc.low = QI_EIOTLB_PASID(pasid) | QI_EIOTLB_DID(did) |
> > +		QI_EIOTLB_GRAN(granu) | QI_EIOTLB_TYPE;
> > +	desc.high = QI_EIOTLB_ADDR(addr) | QI_EIOTLB_GL(global) |
> > +		QI_EIOTLB_IH(0) | QI_EIOTLB_AM(size_order);
> > +	qi_submit_sync(&desc, iommu);
> > +}
> > +
> >  void qi_flush_dev_iotlb(struct intel_iommu *iommu, u16 sid, u16
> > pfsid, u16 qdep, u64 addr, unsigned mask)
> >  {
> > @@ -1360,6 +1372,38 @@ void qi_flush_dev_iotlb(struct intel_iommu
> > *iommu, u16 sid, u16 pfsid, qi_submit_sync(&desc, iommu);
> >  }
> >  
> > +void qi_flush_dev_eiotlb(struct intel_iommu *iommu, u16 sid,
> > +		u32 pasid,  u16 qdep, u64 addr, unsigned size, u64
> > granu)  
> 
> Ditto.
> 
> > +{
> > +	struct qi_desc desc;
> > +
> > +	desc.low = QI_DEV_EIOTLB_PASID(pasid) |
> > QI_DEV_EIOTLB_SID(sid) |
> > +		QI_DEV_EIOTLB_QDEP(qdep) | QI_DEIOTLB_TYPE;  
> 
> Have you forgotten PFSID, or I missed anything here?
you are right, missed pfsid in this case.
> 
>  [...]  
> 
> Best regards,
> Lu Baolu

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v5 09/23] iommu/vt-d: add svm/sva invalidate function
  2018-05-14  3:35   ` Lu Baolu
@ 2018-05-14 20:49     ` Jacob Pan
  0 siblings, 0 replies; 78+ messages in thread
From: Jacob Pan @ 2018-05-14 20:49 UTC (permalink / raw)
  To: Lu Baolu
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Alex Williamson, Jean-Philippe Brucker, Rafael Wysocki, Liu,
	Yi L, Tian, Kevin, Raj Ashok, Jean Delvare, Christoph Hellwig,
	jacob.jun.pan

On Mon, 14 May 2018 11:35:05 +0800
Lu Baolu <baolu.lu@linux.intel.com> wrote:

> > +	switch (inv_info->hdr.type) {
> > +	case IOMMU_INV_TYPE_TLB:
> > +		if (inv_info->size &&
> > +			(inv_info->addr & ((1 << (VTD_PAGE_SHIFT +
> > inv_info->size)) - 1))) {
> > +			pr_err("Addr out of range, addr 0x%llx,
> > size order %d\n",
> > +				inv_info->addr, inv_info->size);
> > +			ret = -ERANGE;
> > +			goto out_unlock;
> > +		}
> > +
> > +		qi_flush_eiotlb(iommu, did,
> > mm_to_dma_pfn(inv_info->addr),
> > +				inv_info->pasid,
> > +				inv_info->size, granu,
> > +				inv_info->flags &
> > IOMMU_INVALIDATE_GLOBAL_PAGE);
> > +		/**
> > +		 * Always flush device IOTLB if ATS is enabled
> > since guest
> > +		 * vIOMMU exposes CM = 1, no device IOTLB flush
> > will be passed
> > +		 * down.
> > +		 */
> > +		info = iommu_support_dev_iotlb(dmar_domain, iommu,
> > bus, devfn);
> > +		if (info && info->ats_enabled) {
> > +			qi_flush_dev_eiotlb(iommu, sid,
> > +					inv_info->pasid,
> > info->ats_qdep,
> > +					inv_info->addr,
> > inv_info->size,
> > +					granu);
> > +		}
> > +		break;
> > +	case IOMMU_INV_TYPE_PASID:
> > +		qi_flush_pasid(iommu, did, granu, inv_info->pasid);
> > +
> > +		break;
> > +	default:
> > +		dev_err(dev, "Unknown IOMMU invalidation type
> > %d\n",
> > +			inv_info->hdr.type);  
> 
> There are three types of invalidation:
> 
> enum iommu_inv_type {
>         IOMMU_INV_TYPE_DTLB,
>         IOMMU_INV_TYPE_TLB,
>         IOMMU_INV_TYPE_PASID,
>         IOMMU_INV_NR_TYPE
> };
> 
> So "unsupported" looks better than "unknown" in the message.
> 
agreed, makes more sense.
>  [...]  

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v5 11/23] driver core: add per device iommu param
  2018-05-14  5:27   ` Lu Baolu
@ 2018-05-14 20:52     ` Jacob Pan
  0 siblings, 0 replies; 78+ messages in thread
From: Jacob Pan @ 2018-05-14 20:52 UTC (permalink / raw)
  To: Lu Baolu
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Alex Williamson, Jean-Philippe Brucker, Rafael Wysocki, Liu,
	Yi L, Tian, Kevin, Raj Ashok, Jean Delvare, Christoph Hellwig,
	jacob.jun.pan

On Mon, 14 May 2018 13:27:13 +0800
Lu Baolu <baolu.lu@linux.intel.com> wrote:

> Hi,
> 
> On 05/12/2018 04:54 AM, Jacob Pan wrote:
> > DMA faults can be detected by IOMMU at device level. Adding a
> > pointer to struct device allows IOMMU subsystem to report relevant
> > faults back to the device driver for further handling.
> > For direct assigned device (or user space drivers), guest OS holds
> > responsibility to handle and respond per device IOMMU fault.
> > Therefore we need fault reporting mechanism to propagate faults
> > beyond IOMMU subsystem.
> >
> > There are two other IOMMU data pointers under struct device today,
> > here we introduce iommu_param as a parent pointer such that all
> > device IOMMU data can be consolidated here. The idea was suggested
> > here by Greg KH and Joerg. The name iommu_param is chosen here
> > since iommu_data has been used.  
> 
> This doesn't match what you've done in the patch. Maybe you
> forgot to cleanup? :-)
> 
No, I was trying to explain the thought process behind naming
iommu_param. I meant to say iommu_data is a probably a better name but
taken already.
> The idea is to create a parent pointer under device struct and
> move previous iommu_group and iommu_fwspec together with
> the iommu fault related data into it.
> 
> Best regards,
> Lu Baolu
> 
>  [...]  
> 

[Jacob Pan]

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v5 13/23] iommu: introduce device fault report API
  2018-05-14  6:01   ` Lu Baolu
@ 2018-05-14 20:55     ` Jacob Pan
  2018-05-15  6:52       ` Lu Baolu
  0 siblings, 1 reply; 78+ messages in thread
From: Jacob Pan @ 2018-05-14 20:55 UTC (permalink / raw)
  To: Lu Baolu
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Alex Williamson, Jean-Philippe Brucker, Rafael Wysocki, Liu,
	Yi L, Tian, Kevin, Raj Ashok, Jean Delvare, Christoph Hellwig,
	jacob.jun.pan

On Mon, 14 May 2018 14:01:06 +0800
Lu Baolu <baolu.lu@linux.intel.com> wrote:

> Hi,
> 
> On 05/12/2018 04:54 AM, Jacob Pan wrote:
> > Traditionally, device specific faults are detected and handled
> > within their own device drivers. When IOMMU is enabled, faults such
> > as DMA related transactions are detected by IOMMU. There is no
> > generic reporting mechanism to report faults back to the in-kernel
> > device driver or the guest OS in case of assigned devices.
> >
> > Faults detected by IOMMU is based on the transaction's source ID
> > which can be reported at per device basis, regardless of the device
> > type is a PCI device or not.
> >
> > The fault types include recoverable (e.g. page request) and
> > unrecoverable faults(e.g. access error). In most cases, faults can
> > be handled by IOMMU drivers internally. The primary use cases are as
> > follows:
> > 1. page request fault originated from an SVM capable device that is
> > assigned to guest via vIOMMU. In this case, the first level page
> > tables are owned by the guest. Page request must be propagated to
> > the guest to let guest OS fault in the pages then send page
> > response. In this mechanism, the direct receiver of IOMMU fault
> > notification is VFIO, which can relay notification events to QEMU
> > or other user space software.
> >
> > 2. faults need more subtle handling by device drivers. Other than
> > simply invoke reset function, there are needs to let device driver
> > handle the fault with a smaller impact.
> >
> > This patchset is intended to create a generic fault report API such
> > that it can scale as follows:
> > - all IOMMU types
> > - PCI and non-PCI devices
> > - recoverable and unrecoverable faults
> > - VFIO and other other in kernel users
> > - DMA & IRQ remapping (TBD)
> > The original idea was brought up by David Woodhouse and discussions
> > summarized at https://lwn.net/Articles/608914/.
> >
> > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Signed-off-by: Ashok Raj <ashok.raj@intel.com>
> > Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
> > ---
> >  drivers/iommu/iommu.c | 149
> > +++++++++++++++++++++++++++++++++++++++++++++++++-
> > include/linux/iommu.h |  35 +++++++++++- 2 files changed, 181
> > insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> > index 3a49b96..b3f9daf 100644
> > --- a/drivers/iommu/iommu.c
> > +++ b/drivers/iommu/iommu.c
> > @@ -609,6 +609,13 @@ int iommu_group_add_device(struct iommu_group
> > *group, struct device *dev) goto err_free_name;
> >  	}
> >  
> > +	dev->iommu_param = kzalloc(sizeof(*dev->iommu_param),
> > GFP_KERNEL);
> > +	if (!dev->iommu_param) {
> > +		ret = -ENOMEM;
> > +		goto err_free_name;
> > +	}
> > +	mutex_init(&dev->iommu_param->lock);
> > +
> >  	kobject_get(group->devices_kobj);
> >  
> >  	dev->iommu_group = group;
> > @@ -639,6 +646,7 @@ int iommu_group_add_device(struct iommu_group
> > *group, struct device *dev) mutex_unlock(&group->mutex);
> >  	dev->iommu_group = NULL;
> >  	kobject_put(group->devices_kobj);
> > +	kfree(dev->iommu_param);
> >  err_free_name:
> >  	kfree(device->name);
> >  err_remove_link:
> > @@ -685,7 +693,7 @@ void iommu_group_remove_device(struct device
> > *dev) sysfs_remove_link(&dev->kobj, "iommu_group");
> >  
> >  	trace_remove_device_from_group(group->id, dev);
> > -
> > +	kfree(dev->iommu_param);
> >  	kfree(device->name);
> >  	kfree(device);
> >  	dev->iommu_group = NULL;
> > @@ -820,6 +828,145 @@ int iommu_group_unregister_notifier(struct
> > iommu_group *group,
> > EXPORT_SYMBOL_GPL(iommu_group_unregister_notifier); 
> >  /**
> > + * iommu_register_device_fault_handler() - Register a device fault
> > handler
> > + * @dev: the device
> > + * @handler: the fault handler
> > + * @data: private data passed as argument to the handler
> > + *
> > + * When an IOMMU fault event is received, call this handler with
> > the fault event
> > + * and data as argument. The handler should return 0 on success.
> > If the fault is
> > + * recoverable (IOMMU_FAULT_PAGE_REQ), the handler can also
> > complete
> > + * the fault by calling iommu_page_response() with one of the
> > following
> > + * response code:
> > + * - IOMMU_PAGE_RESP_SUCCESS: retry the translation
> > + * - IOMMU_PAGE_RESP_INVALID: terminate the fault
> > + * - IOMMU_PAGE_RESP_FAILURE: terminate the fault and stop
> > reporting
> > + *   page faults if possible.
> > + *
> > + * Return 0 if the fault handler was installed successfully, or an
> > error.
> > + */
> > +int iommu_register_device_fault_handler(struct device *dev,
> > +					iommu_dev_fault_handler_t
> > handler,
> > +					void *data)
> > +{
> > +	struct iommu_param *param = dev->iommu_param;
> > +	int ret = 0;
> > +
> > +	/*
> > +	 * Device iommu_param should have been allocated when
> > device is
> > +	 * added to its iommu_group.
> > +	 */
> > +	if (!param)
> > +		return -EINVAL;
> > +
> > +	mutex_lock(&param->lock);
> > +	/* Only allow one fault handler registered for each device
> > */
> > +	if (param->fault_param) {
> > +		ret = -EBUSY;
> > +		goto done_unlock;
> > +	}
> > +
> > +	get_device(dev);
> > +	param->fault_param =
> > +		kzalloc(sizeof(struct iommu_fault_param),
> > GFP_KERNEL);
> > +	if (!param->fault_param) {
> > +		put_device(dev);
> > +		ret = -ENOMEM;
> > +		goto done_unlock;
> > +	}
> > +	mutex_init(&param->fault_param->lock);  
> 
> Do we really need this mutex lock? Is param->lock enough?
> 
I am trying to provide more fine locking granularity in that
iommu_param is meant to be expanded as the sole iommu data under struct
device, so the scope of param->lock may expand.
>  [...]  
> 
> Best regards,
> Lu Baolu

[Jacob Pan]

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v5 13/23] iommu: introduce device fault report API
  2018-05-14 20:55     ` Jacob Pan
@ 2018-05-15  6:52       ` Lu Baolu
  0 siblings, 0 replies; 78+ messages in thread
From: Lu Baolu @ 2018-05-15  6:52 UTC (permalink / raw)
  To: Jacob Pan
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Alex Williamson, Jean-Philippe Brucker, Rafael Wysocki, Liu,
	Yi L, Tian, Kevin, Raj Ashok, Jean Delvare, Christoph Hellwig

Hi,

On 05/15/2018 04:55 AM, Jacob Pan wrote:
> On Mon, 14 May 2018 14:01:06 +0800
> Lu Baolu <baolu.lu@linux.intel.com> wrote:
>
>> Hi,
>>
>> On 05/12/2018 04:54 AM, Jacob Pan wrote:
>>> Traditionally, device specific faults are detected and handled
>>> within their own device drivers. When IOMMU is enabled, faults such
>>> as DMA related transactions are detected by IOMMU. There is no
>>> generic reporting mechanism to report faults back to the in-kernel
>>> device driver or the guest OS in case of assigned devices.
>>>
>>> Faults detected by IOMMU is based on the transaction's source ID
>>> which can be reported at per device basis, regardless of the device
>>> type is a PCI device or not.
>>>
>>> The fault types include recoverable (e.g. page request) and
>>> unrecoverable faults(e.g. access error). In most cases, faults can
>>> be handled by IOMMU drivers internally. The primary use cases are as
>>> follows:
>>> 1. page request fault originated from an SVM capable device that is
>>> assigned to guest via vIOMMU. In this case, the first level page
>>> tables are owned by the guest. Page request must be propagated to
>>> the guest to let guest OS fault in the pages then send page
>>> response. In this mechanism, the direct receiver of IOMMU fault
>>> notification is VFIO, which can relay notification events to QEMU
>>> or other user space software.
>>>
>>> 2. faults need more subtle handling by device drivers. Other than
>>> simply invoke reset function, there are needs to let device driver
>>> handle the fault with a smaller impact.
>>>
>>> This patchset is intended to create a generic fault report API such
>>> that it can scale as follows:
>>> - all IOMMU types
>>> - PCI and non-PCI devices
>>> - recoverable and unrecoverable faults
>>> - VFIO and other other in kernel users
>>> - DMA & IRQ remapping (TBD)
>>> The original idea was brought up by David Woodhouse and discussions
>>> summarized at https://lwn.net/Articles/608914/.
>>>
>>> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
>>> Signed-off-by: Ashok Raj <ashok.raj@intel.com>
>>> Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
>>> ---
>>>  drivers/iommu/iommu.c | 149
>>> +++++++++++++++++++++++++++++++++++++++++++++++++-
>>> include/linux/iommu.h |  35 +++++++++++- 2 files changed, 181
>>> insertions(+), 3 deletions(-)
>>>
>>> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
>>> index 3a49b96..b3f9daf 100644
>>> --- a/drivers/iommu/iommu.c
>>> +++ b/drivers/iommu/iommu.c
>>> @@ -609,6 +609,13 @@ int iommu_group_add_device(struct iommu_group
>>> *group, struct device *dev) goto err_free_name;
>>>  	}
>>>  
>>> +	dev->iommu_param = kzalloc(sizeof(*dev->iommu_param),
>>> GFP_KERNEL);
>>> +	if (!dev->iommu_param) {
>>> +		ret = -ENOMEM;
>>> +		goto err_free_name;
>>> +	}
>>> +	mutex_init(&dev->iommu_param->lock);
>>> +
>>>  	kobject_get(group->devices_kobj);
>>>  
>>>  	dev->iommu_group = group;
>>> @@ -639,6 +646,7 @@ int iommu_group_add_device(struct iommu_group
>>> *group, struct device *dev) mutex_unlock(&group->mutex);
>>>  	dev->iommu_group = NULL;
>>>  	kobject_put(group->devices_kobj);
>>> +	kfree(dev->iommu_param);
>>>  err_free_name:
>>>  	kfree(device->name);
>>>  err_remove_link:
>>> @@ -685,7 +693,7 @@ void iommu_group_remove_device(struct device
>>> *dev) sysfs_remove_link(&dev->kobj, "iommu_group");
>>>  
>>>  	trace_remove_device_from_group(group->id, dev);
>>> -
>>> +	kfree(dev->iommu_param);
>>>  	kfree(device->name);
>>>  	kfree(device);
>>>  	dev->iommu_group = NULL;
>>> @@ -820,6 +828,145 @@ int iommu_group_unregister_notifier(struct
>>> iommu_group *group,
>>> EXPORT_SYMBOL_GPL(iommu_group_unregister_notifier); 
>>>  /**
>>> + * iommu_register_device_fault_handler() - Register a device fault
>>> handler
>>> + * @dev: the device
>>> + * @handler: the fault handler
>>> + * @data: private data passed as argument to the handler
>>> + *
>>> + * When an IOMMU fault event is received, call this handler with
>>> the fault event
>>> + * and data as argument. The handler should return 0 on success.
>>> If the fault is
>>> + * recoverable (IOMMU_FAULT_PAGE_REQ), the handler can also
>>> complete
>>> + * the fault by calling iommu_page_response() with one of the
>>> following
>>> + * response code:
>>> + * - IOMMU_PAGE_RESP_SUCCESS: retry the translation
>>> + * - IOMMU_PAGE_RESP_INVALID: terminate the fault
>>> + * - IOMMU_PAGE_RESP_FAILURE: terminate the fault and stop
>>> reporting
>>> + *   page faults if possible.
>>> + *
>>> + * Return 0 if the fault handler was installed successfully, or an
>>> error.
>>> + */
>>> +int iommu_register_device_fault_handler(struct device *dev,
>>> +					iommu_dev_fault_handler_t
>>> handler,
>>> +					void *data)
>>> +{
>>> +	struct iommu_param *param = dev->iommu_param;
>>> +	int ret = 0;
>>> +
>>> +	/*
>>> +	 * Device iommu_param should have been allocated when
>>> device is
>>> +	 * added to its iommu_group.
>>> +	 */
>>> +	if (!param)
>>> +		return -EINVAL;
>>> +
>>> +	mutex_lock(&param->lock);
>>> +	/* Only allow one fault handler registered for each device
>>> */
>>> +	if (param->fault_param) {
>>> +		ret = -EBUSY;
>>> +		goto done_unlock;
>>> +	}
>>> +
>>> +	get_device(dev);
>>> +	param->fault_param =
>>> +		kzalloc(sizeof(struct iommu_fault_param),
>>> GFP_KERNEL);
>>> +	if (!param->fault_param) {
>>> +		put_device(dev);
>>> +		ret = -ENOMEM;
>>> +		goto done_unlock;
>>> +	}
>>> +	mutex_init(&param->fault_param->lock);  
>> Do we really need this mutex lock? Is param->lock enough?
>>
> I am trying to provide more fine locking granularity in that
> iommu_param is meant to be expanded as the sole iommu data under struct
> device, so the scope of param->lock may expand.

Okay, got it.

Best regards,
Lu Baolu

>>  [...]  
>>
>> Best regards,
>> Lu Baolu
> [Jacob Pan]
>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v5 08/23] iommu/vt-d: support flushing more translation cache types
  2018-05-11 20:54 ` [PATCH v5 08/23] iommu/vt-d: support flushing more translation cache types Jacob Pan
  2018-05-14  2:18   ` Lu Baolu
@ 2018-05-17  8:44   ` kbuild test robot
  1 sibling, 0 replies; 78+ messages in thread
From: kbuild test robot @ 2018-05-17  8:44 UTC (permalink / raw)
  To: Jacob Pan
  Cc: kbuild-all, iommu, LKML, Joerg Roedel, David Woodhouse,
	Greg Kroah-Hartman, Alex Williamson, Jean-Philippe Brucker,
	Rafael Wysocki, Liu, Yi L, Tian, Kevin, Raj Ashok, Jean Delvare,
	Christoph Hellwig, Lu Baolu, Jacob Pan

[-- Attachment #1: Type: text/plain, Size: 1975 bytes --]

Hi Jacob,

I love your patch! Perhaps something to improve:

[auto build test WARNING on iommu/next]
[also build test WARNING on v4.17-rc5 next-20180516]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Jacob-Pan/IOMMU-and-VT-d-driver-support-for-Shared-Virtual-Address-SVA/20180512-114854
base:   https://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git next
config: x86_64-rhel-7.2 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

All warnings (new ones prefixed by >>):

   drivers/iommu/dmar.c: In function 'qi_flush_dev_eiotlb':
>> drivers/iommu/dmar.c:1382:12: warning: 'desc.high' is used uninitialized in this function [-Wuninitialized]
     desc.high |= QI_DEV_EIOTLB_GLOB(granu);
               ^~

vim +1382 drivers/iommu/dmar.c

  1374	
  1375	void qi_flush_dev_eiotlb(struct intel_iommu *iommu, u16 sid,
  1376			u32 pasid,  u16 qdep, u64 addr, unsigned size, u64 granu)
  1377	{
  1378		struct qi_desc desc;
  1379	
  1380		desc.low = QI_DEV_EIOTLB_PASID(pasid) | QI_DEV_EIOTLB_SID(sid) |
  1381			QI_DEV_EIOTLB_QDEP(qdep) | QI_DEIOTLB_TYPE;
> 1382		desc.high |= QI_DEV_EIOTLB_GLOB(granu);
  1383	
  1384		/* If S bit is 0, we only flush a single page. If S bit is set,
  1385		 * The least significant zero bit indicates the size. VT-d spec
  1386		 * 6.5.2.6
  1387		 */
  1388		if (!size)
  1389			desc.high = QI_DEV_EIOTLB_ADDR(addr) & ~QI_DEV_EIOTLB_SIZE;
  1390		else {
  1391			unsigned long mask = 1UL << (VTD_PAGE_SHIFT + size);
  1392	
  1393			desc.high = QI_DEV_EIOTLB_ADDR(addr & ~mask) | QI_DEV_EIOTLB_SIZE;
  1394		}
  1395		qi_submit_sync(&desc, iommu);
  1396	}
  1397	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 40417 bytes --]

^ permalink raw reply	[flat|nested] 78+ messages in thread

* RE: [PATCH v5 13/23] iommu: introduce device fault report API
  2018-05-11 20:54 ` [PATCH v5 13/23] iommu: introduce device fault report API Jacob Pan
  2018-05-14  6:01   ` Lu Baolu
@ 2018-05-17 11:41   ` Liu, Yi L
  2018-05-17 15:59     ` Jacob Pan
  2018-09-06  9:25   ` Auger Eric
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 78+ messages in thread
From: Liu, Yi L @ 2018-05-17 11:41 UTC (permalink / raw)
  To: Jacob Pan, iommu, LKML, Joerg Roedel, David Woodhouse,
	Greg Kroah-Hartman, Alex Williamson, Jean-Philippe Brucker
  Cc: Wysocki, Rafael J, Tian, Kevin, Raj, Ashok, Jean Delvare,
	Christoph Hellwig, Lu Baolu

> From: Jacob Pan [mailto:jacob.jun.pan@linux.intel.com]
> Sent: Saturday, May 12, 2018 4:54 AM
> 
> Traditionally, device specific faults are detected and handled within their own device
> drivers. When IOMMU is enabled, faults such as DMA related transactions are
> detected by IOMMU. There is no generic reporting mechanism to report faults back
> to the in-kernel device driver or the guest OS in case of assigned devices.
> 
> Faults detected by IOMMU is based on the transaction's source ID which can be
> reported at per device basis, regardless of the device type is a PCI device or not.
> 
> The fault types include recoverable (e.g. page request) and unrecoverable faults(e.g.
> access error). In most cases, faults can be handled by IOMMU drivers internally. The
> primary use cases are as
> follows:
> 1. page request fault originated from an SVM capable device that is assigned to
> guest via vIOMMU. In this case, the first level page tables are owned by the guest.
> Page request must be propagated to the guest to let guest OS fault in the pages then
> send page response. In this mechanism, the direct receiver of IOMMU fault
> notification is VFIO, which can relay notification events to QEMU or other user space
> software.
> 
> 2. faults need more subtle handling by device drivers. Other than simply invoke reset
> function, there are needs to let device driver handle the fault with a smaller impact.
> 
> This patchset is intended to create a generic fault report API such that it can scale as
> follows:
> - all IOMMU types
> - PCI and non-PCI devices
> - recoverable and unrecoverable faults
> - VFIO and other other in kernel users
> - DMA & IRQ remapping (TBD)
> The original idea was brought up by David Woodhouse and discussions summarized
> at https://lwn.net/Articles/608914/.
> 
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Signed-off-by: Ashok Raj <ashok.raj@intel.com>
> Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
> ---
>  drivers/iommu/iommu.c | 149
> +++++++++++++++++++++++++++++++++++++++++++++++++-
>  include/linux/iommu.h |  35 +++++++++++-
>  2 files changed, 181 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index
> 3a49b96..b3f9daf 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -609,6 +609,13 @@ int iommu_group_add_device(struct iommu_group *group,
> struct device *dev)
>  		goto err_free_name;
>  	}
> 
> +	dev->iommu_param = kzalloc(sizeof(*dev->iommu_param), GFP_KERNEL);
> +	if (!dev->iommu_param) {
> +		ret = -ENOMEM;
> +		goto err_free_name;
> +	}
> +	mutex_init(&dev->iommu_param->lock);
> +
>  	kobject_get(group->devices_kobj);
> 
>  	dev->iommu_group = group;
> @@ -639,6 +646,7 @@ int iommu_group_add_device(struct iommu_group *group,
> struct device *dev)
>  	mutex_unlock(&group->mutex);
>  	dev->iommu_group = NULL;
>  	kobject_put(group->devices_kobj);
> +	kfree(dev->iommu_param);
>  err_free_name:
>  	kfree(device->name);
>  err_remove_link:
> @@ -685,7 +693,7 @@ void iommu_group_remove_device(struct device *dev)
>  	sysfs_remove_link(&dev->kobj, "iommu_group");
> 
>  	trace_remove_device_from_group(group->id, dev);
> -
> +	kfree(dev->iommu_param);
>  	kfree(device->name);
>  	kfree(device);
>  	dev->iommu_group = NULL;
> @@ -820,6 +828,145 @@ int iommu_group_unregister_notifier(struct
> iommu_group *group,  EXPORT_SYMBOL_GPL(iommu_group_unregister_notifier);
> 
>  /**
> + * iommu_register_device_fault_handler() - Register a device fault
> +handler
> + * @dev: the device
> + * @handler: the fault handler
> + * @data: private data passed as argument to the handler
> + *
> + * When an IOMMU fault event is received, call this handler with the
> +fault event
> + * and data as argument. The handler should return 0 on success. If the
> +fault is
> + * recoverable (IOMMU_FAULT_PAGE_REQ), the handler can also complete
> + * the fault by calling iommu_page_response() with one of the following
> + * response code:
> + * - IOMMU_PAGE_RESP_SUCCESS: retry the translation
> + * - IOMMU_PAGE_RESP_INVALID: terminate the fault
> + * - IOMMU_PAGE_RESP_FAILURE: terminate the fault and stop reporting
> + *   page faults if possible.
> + *
> + * Return 0 if the fault handler was installed successfully, or an error.
> + */
> +int iommu_register_device_fault_handler(struct device *dev,
> +					iommu_dev_fault_handler_t handler,
> +					void *data)
> +{
> +	struct iommu_param *param = dev->iommu_param;
> +	int ret = 0;
> +
> +	/*
> +	 * Device iommu_param should have been allocated when device is
> +	 * added to its iommu_group.
> +	 */
> +	if (!param)
> +		return -EINVAL;
> +
> +	mutex_lock(&param->lock);
> +	/* Only allow one fault handler registered for each device */
> +	if (param->fault_param) {
> +		ret = -EBUSY;
> +		goto done_unlock;
> +	}
> +
> +	get_device(dev);
> +	param->fault_param =
> +		kzalloc(sizeof(struct iommu_fault_param), GFP_KERNEL);
> +	if (!param->fault_param) {
> +		put_device(dev);
> +		ret = -ENOMEM;
> +		goto done_unlock;
> +	}
> +	mutex_init(&param->fault_param->lock);
> +	param->fault_param->handler = handler;
> +	param->fault_param->data = data;
> +	INIT_LIST_HEAD(&param->fault_param->faults);
> +
> +done_unlock:
> +	mutex_unlock(&param->lock);
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(iommu_register_device_fault_handler);
> +
> +/**
> + * iommu_unregister_device_fault_handler() - Unregister the device
> +fault handler
> + * @dev: the device
> + *
> + * Remove the device fault handler installed with
> + * iommu_register_device_fault_handler().
> + *
> + * Return 0 on success, or an error.
> + */
> +int iommu_unregister_device_fault_handler(struct device *dev) {
> +	struct iommu_param *param = dev->iommu_param;
> +	int ret = 0;
> +
> +	if (!param)
> +		return -EINVAL;
> +
> +	mutex_lock(&param->lock);
> +	/* we cannot unregister handler if there are pending faults */
> +	if (!list_empty(&param->fault_param->faults)) {
> +		ret = -EBUSY;
> +		goto unlock;
> +	}
> +
> +	kfree(param->fault_param);
> +	param->fault_param = NULL;
> +	put_device(dev);
> +unlock:
> +	mutex_unlock(&param->lock);
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(iommu_unregister_device_fault_handler);
> +
> +
> +/**
> + * iommu_report_device_fault() - Report fault event to device
> + * @dev: the device
> + * @evt: fault event data
> + *
> + * Called by IOMMU model specific drivers when fault is detected,
> +typically
> + * in a threaded IRQ handler.
> + *
> + * Return 0 on success, or an error.
> + */
> +int iommu_report_device_fault(struct device *dev, struct
> +iommu_fault_event *evt) {
> +	int ret = 0;
> +	struct iommu_fault_event *evt_pending;
> +	struct iommu_fault_param *fparam;
> +
> +	/* iommu_param is allocated when device is added to group */
> +	if (!dev->iommu_param | !evt)
> +		return -EINVAL;
> +	/* we only report device fault if there is a handler registered */
> +	mutex_lock(&dev->iommu_param->lock);
> +	if (!dev->iommu_param->fault_param ||
> +		!dev->iommu_param->fault_param->handler) {
> +		ret = -EINVAL;
> +		goto done_unlock;
> +	}
> +	fparam = dev->iommu_param->fault_param;
> +	if (evt->type == IOMMU_FAULT_PAGE_REQ && evt->last_req) {
> +		evt_pending = kmemdup(evt, sizeof(struct iommu_fault_event),
> +				GFP_KERNEL);
> +		if (!evt_pending) {
> +			ret = -ENOMEM;
> +			goto done_unlock;
> +		}
> +		mutex_lock(&fparam->lock);
> +		list_add_tail(&evt_pending->list, &fparam->faults);

I may missed it. Here only see list add, how about removing? Who would remove
entry from the fault list?

> +		mutex_unlock(&fparam->lock);
> +	}
> +	ret = fparam->handler(evt, fparam->data);

I remember you mentioned there will be a queue to store the faults. Is it in the
fparam->faults list? Or there is no such queue?

Thanks,
Yi Liu

> +done_unlock:
> +	mutex_unlock(&dev->iommu_param->lock);
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(iommu_report_device_fault);
> +
> +/**
>   * iommu_group_id - Return ID for a group
>   * @group: the group to ID
>   *
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h index aeadb4f..b3312ee
> 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -307,7 +307,8 @@ enum iommu_fault_reason {
>   * and PASID spec.
>   * - Un-recoverable faults of device interest
>   * - DMA remapping and IRQ remapping faults
> -
> + *
> + * @list pending fault event list, used for tracking responses
>   * @type contains fault type.
>   * @reason fault reasons if relevant outside IOMMU driver, IOMMU driver internal
>   *         faults are not reported
> @@ -324,6 +325,7 @@ enum iommu_fault_reason {
>   *                 sending the fault response.
>   */
>  struct iommu_fault_event {
> +	struct list_head list;
>  	enum iommu_fault_type type;
>  	enum iommu_fault_reason reason;
>  	u64 addr;
> @@ -340,10 +342,13 @@ struct iommu_fault_event {
>   * struct iommu_fault_param - per-device IOMMU fault data
>   * @dev_fault_handler: Callback function to handle IOMMU faults at device level
>   * @data: handler private data
> - *
> + * @faults: holds the pending faults which needs response, e.g. page response.
> + * @lock: protect pending PRQ event list
>   */
>  struct iommu_fault_param {
>  	iommu_dev_fault_handler_t handler;
> +	struct list_head faults;
> +	struct mutex lock;
>  	void *data;
>  };
> 
> @@ -357,6 +362,7 @@ struct iommu_fault_param {
>   *	struct iommu_fwspec	*iommu_fwspec;
>   */
>  struct iommu_param {
> +	struct mutex lock;
>  	struct iommu_fault_param *fault_param;  };
> 
> @@ -456,6 +462,14 @@ extern int iommu_group_register_notifier(struct
> iommu_group *group,
>  					 struct notifier_block *nb);
>  extern int iommu_group_unregister_notifier(struct iommu_group *group,
>  					   struct notifier_block *nb);
> +extern int iommu_register_device_fault_handler(struct device *dev,
> +					iommu_dev_fault_handler_t handler,
> +					void *data);
> +
> +extern int iommu_unregister_device_fault_handler(struct device *dev);
> +
> +extern int iommu_report_device_fault(struct device *dev, struct
> +iommu_fault_event *evt);
> +
>  extern int iommu_group_id(struct iommu_group *group);  extern struct
> iommu_group *iommu_group_get_for_dev(struct device *dev);  extern struct
> iommu_domain *iommu_group_default_domain(struct iommu_group *); @@ -
> 727,6 +741,23 @@ static inline int iommu_group_unregister_notifier(struct
> iommu_group *group,
>  	return 0;
>  }
> 
> +static inline int iommu_register_device_fault_handler(struct device *dev,
> +						iommu_dev_fault_handler_t
> handler,
> +						void *data)
> +{
> +	return -ENODEV;
> +}
> +
> +static inline int iommu_unregister_device_fault_handler(struct device
> +*dev) {
> +	return 0;
> +}
> +
> +static inline int iommu_report_device_fault(struct device *dev, struct
> +iommu_fault_event *evt) {
> +	return -ENODEV;
> +}
> +
>  static inline int iommu_group_id(struct iommu_group *group)  {
>  	return -ENODEV;
> --
> 2.7.4

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v5 13/23] iommu: introduce device fault report API
  2018-05-17 11:41   ` Liu, Yi L
@ 2018-05-17 15:59     ` Jacob Pan
  2018-05-17 23:22       ` Liu, Yi L
  0 siblings, 1 reply; 78+ messages in thread
From: Jacob Pan @ 2018-05-17 15:59 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Alex Williamson, Jean-Philippe Brucker, Wysocki, Rafael J, Tian,
	Kevin, Raj, Ashok, Jean Delvare, Christoph Hellwig, Lu Baolu,
	jacob.jun.pan

On Thu, 17 May 2018 11:41:56 +0000
"Liu, Yi L" <yi.l.liu@intel.com> wrote:

> > +int iommu_report_device_fault(struct device *dev, struct
> > +iommu_fault_event *evt) {
> > +	int ret = 0;
> > +	struct iommu_fault_event *evt_pending;
> > +	struct iommu_fault_param *fparam;
> > +
> > +	/* iommu_param is allocated when device is added to group
> > */
> > +	if (!dev->iommu_param | !evt)
> > +		return -EINVAL;
> > +	/* we only report device fault if there is a handler
> > registered */
> > +	mutex_lock(&dev->iommu_param->lock);
> > +	if (!dev->iommu_param->fault_param ||
> > +		!dev->iommu_param->fault_param->handler) {
> > +		ret = -EINVAL;
> > +		goto done_unlock;
> > +	}
> > +	fparam = dev->iommu_param->fault_param;
> > +	if (evt->type == IOMMU_FAULT_PAGE_REQ && evt->last_req) {
> > +		evt_pending = kmemdup(evt, sizeof(struct
> > iommu_fault_event),
> > +				GFP_KERNEL);
> > +		if (!evt_pending) {
> > +			ret = -ENOMEM;
> > +			goto done_unlock;
> > +		}
> > +		mutex_lock(&fparam->lock);
> > +		list_add_tail(&evt_pending->list,
> > &fparam->faults);  
> 
> I may missed it. Here only see list add, how about removing? Who
> would remove entry from the fault list?
> 
deletion of the pending event is in page response function (int
iommu_page_response), once iommu driver finds a matching response for
the pending request, it will delete the pending event.

if the response never came, right now we don't delete it, just gives
warning.

> > +		mutex_unlock(&fparam->lock);
> > +	}
> > +	ret = fparam->handler(evt, fparam->data);  
> 
> I remember you mentioned there will be a queue to store the faults.
> Is it in the fparam->faults list? Or there is no such queue?
There are two use cases:
case A: guest SVA, PRQ events are reported outside IOMMU subsystem,
	e.g. vfio
case B: in-kernel

The io page fault queuing is Jean's patchset, mostly for case
B (in-kernel IO page fault handling). I will convert intel-svm to Jean's
io page fault mechanism so that we can also have parallel and out of
order queuing of PRQ. I still need some time to evaluate intel specific
needs such as streaming page request/response.

For case A, there is no queuing in host IOMMU driver. My understanding
of the flow is as the following:
1. host IOMMU receives PRQ
2. host IOMMU driver reports PRQ fault event to registered called, i.e.
vfio
3. VFIO reports fault event to QEMU
4. QEMU injects PRQ to guest
5. Guest IOMMU driver receives PRQ in IRQ
6. Guest IOMMU driver queue PRQ by groups, PASID.
So as long as in-kernel PRQ handling can do queuing, there is no need
for queuing in the host reporting path.

Jacob

^ permalink raw reply	[flat|nested] 78+ messages in thread

* RE: [PATCH v5 13/23] iommu: introduce device fault report API
  2018-05-17 15:59     ` Jacob Pan
@ 2018-05-17 23:22       ` Liu, Yi L
  2018-05-21 23:03         ` Jacob Pan
  0 siblings, 1 reply; 78+ messages in thread
From: Liu, Yi L @ 2018-05-17 23:22 UTC (permalink / raw)
  To: Jacob Pan
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Alex Williamson, Jean-Philippe Brucker, Wysocki, Rafael J, Tian,
	Kevin, Raj, Ashok, Jean Delvare, Christoph Hellwig, Lu Baolu

> From: Jacob Pan [mailto:jacob.jun.pan@linux.intel.com]
> Sent: Thursday, May 17, 2018 11:59 PM
> On Thu, 17 May 2018 11:41:56 +0000
> "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> 
> > > +int iommu_report_device_fault(struct device *dev, struct
> > > +iommu_fault_event *evt) {
> > > +	int ret = 0;
> > > +	struct iommu_fault_event *evt_pending;
> > > +	struct iommu_fault_param *fparam;
> > > +
> > > +	/* iommu_param is allocated when device is added to group
> > > */
> > > +	if (!dev->iommu_param | !evt)
> > > +		return -EINVAL;
> > > +	/* we only report device fault if there is a handler
> > > registered */
> > > +	mutex_lock(&dev->iommu_param->lock);
> > > +	if (!dev->iommu_param->fault_param ||
> > > +		!dev->iommu_param->fault_param->handler) {
> > > +		ret = -EINVAL;
> > > +		goto done_unlock;
> > > +	}
> > > +	fparam = dev->iommu_param->fault_param;
> > > +	if (evt->type == IOMMU_FAULT_PAGE_REQ && evt->last_req) {
> > > +		evt_pending = kmemdup(evt, sizeof(struct
> > > iommu_fault_event),
> > > +				GFP_KERNEL);
> > > +		if (!evt_pending) {
> > > +			ret = -ENOMEM;
> > > +			goto done_unlock;
> > > +		}
> > > +		mutex_lock(&fparam->lock);
> > > +		list_add_tail(&evt_pending->list,
> > > &fparam->faults);
> >
> > I may missed it. Here only see list add, how about removing? Who would
> > remove entry from the fault list?
> >
> deletion of the pending event is in page response function (int
> iommu_page_response), once iommu driver finds a matching response for the
> pending request, it will delete the pending event.
> 
> if the response never came, right now we don't delete it, just gives warning.

Got it.

> 
> > > +		mutex_unlock(&fparam->lock);
> > > +	}
> > > +	ret = fparam->handler(evt, fparam->data);
> >
> > I remember you mentioned there will be a queue to store the faults.
> > Is it in the fparam->faults list? Or there is no such queue?
> There are two use cases:
> case A: guest SVA, PRQ events are reported outside IOMMU subsystem,
> 	e.g. vfio
> case B: in-kernel
> 
> The io page fault queuing is Jean's patchset, mostly for case B (in-kernel IO page
> fault handling). I will convert intel-svm to Jean's io page fault mechanism so that we
> can also have parallel and out of order queuing of PRQ. I still need some time to
> evaluate intel specific needs such as streaming page request/response.
> 
> For case A, there is no queuing in host IOMMU driver. My understanding of the flow
> is as the following:
> 1. host IOMMU receives PRQ
> 2. host IOMMU driver reports PRQ fault event to registered called, i.e.
> vfio
> 3. VFIO reports fault event to QEMU
> 4. QEMU injects PRQ to guest
> 5. Guest IOMMU driver receives PRQ in IRQ 6. Guest IOMMU driver queue PRQ by
> groups, PASID.

Correct.

> So as long as in-kernel PRQ handling can do queuing, there is no need for queuing in
> the host reporting path.

Will it affect current interface? Here the handler only get an "evt" per a PRQ IRQ. And I suppose
vfio needs not rely on host iommu queuing?

Thanks,
Yi Liu

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v5 13/23] iommu: introduce device fault report API
  2018-05-17 23:22       ` Liu, Yi L
@ 2018-05-21 23:03         ` Jacob Pan
  0 siblings, 0 replies; 78+ messages in thread
From: Jacob Pan @ 2018-05-21 23:03 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Alex Williamson, Jean-Philippe Brucker, Wysocki, Rafael J, Tian,
	Kevin, Raj, Ashok, Jean Delvare, Christoph Hellwig, Lu Baolu,
	jacob.jun.pan

On Thu, 17 May 2018 23:22:43 +0000
"Liu, Yi L" <yi.l.liu@intel.com> wrote:

> > So as long as in-kernel PRQ handling can do queuing, there is no
> > need for queuing in the host reporting path.  
> 
> Will it affect current interface? Here the handler only get an "evt"
> per a PRQ IRQ. And I suppose vfio needs not rely on host iommu
> queuing?

I don't think it needs iommu driver queuing.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v5 00/23] IOMMU and VT-d driver support for Shared Virtual Address (SVA)
  2018-05-11 20:53 [PATCH v5 00/23] IOMMU and VT-d driver support for Shared Virtual Address (SVA) Jacob Pan
                   ` (22 preceding siblings ...)
  2018-05-11 20:54 ` [PATCH v5 23/23] iommu: use sva invalidate and device fault trace event Jacob Pan
@ 2018-05-29 15:54 ` Jacob Pan
  23 siblings, 0 replies; 78+ messages in thread
From: Jacob Pan @ 2018-05-29 15:54 UTC (permalink / raw)
  To: iommu, LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Alex Williamson, Jean-Philippe Brucker
  Cc: Rafael Wysocki, Liu, Yi L, Tian, Kevin, Raj Ashok, Jean Delvare,
	Christoph Hellwig, Lu Baolu, jacob.jun.pan

Hi Joerg,

Just wondering if you had a chance to review this version before I
spin another one mostly based on Baolu's comments. I have incorporated
feedbacks from your review on the previous version, but it was a while
ago.

Thanks,

Jacob

On Fri, 11 May 2018 13:53:52 -0700
Jacob Pan <jacob.jun.pan@linux.intel.com> wrote:

> Shared virtual address (SVA), a.k.a, Shared virtual memory (SVM) on
> Intel platforms allow address space sharing between device DMA and
> applications. SVA can reduce programming complexity and enhance
> security. To enable SVA in the guest, i.e. shared guest application
> address space and physical device DMA address, IOMMU driver must
> provide some new functionalities.
> 
> This patchset is a follow-up on the discussions held at LPC 2017
> VFIO/IOMMU/PCI track. Slides and notes can be found here:
> https://linuxplumbersconf.org/2017/ocw/events/LPC2017/tracks/636
> 
> The complete guest SVA support also involves changes in QEMU and VFIO,
> which has been posted earlier.
> https://www.spinics.net/lists/kvm/msg148798.html
> 
> This is the IOMMU portion follow up of the more complete series of the
> kernel changes to support vSVA. Please refer to the link below for
> more details. https://www.spinics.net/lists/kvm/msg148819.html
> 
> Generic APIs are introduced in addition to Intel VT-d specific
> changes, the goal is to have common interfaces across IOMMU and
> device types for both VFIO and other in-kernel users.
> 
> At the top level, new IOMMU interfaces are introduced as follows:
>  - bind guest PASID table
>  - passdown invalidations of translation caches
>  - IOMMU device fault reporting including page request/response and
>    non-recoverable faults.
> 
> For IOMMU detected device fault reporting, struct device is extended
> to provide callback and tracking at device level. The original
> proposal was discussed here "Error handling for I/O memory management
> units" (https://lwn.net/Articles/608914/). I have experimented two
> alternative solutions:
> 1. use a shared group notifier, this does not scale well also causes
> unwanted notification traffic when group sibling device is reported
> with faults. 2. place fault callback at device IOMMU arch data, e.g.
> device_domain_info in Intel/FSL IOMMU driver. This will cause code
> duplication, since per device fault reporting is generic.
> 
> The additional patches are Intel VT-d specific, which either
> implements or replaces existing private interfaces with the generic
> ones.
> 
> This patchset is based on the work and ideas from many people,
> especially: Ashok Raj <ashok.raj@intel.com>
> Liu, Yi L <yi.l.liu@linux.intel.com>
> Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
> 
> Thanks,
> 
> Jacob
> 
> V5
> 	- Removed device context cache and non-pasid TLB invalidation
> type
> 	- Simplified and sorted granularities for the remaining TLB
> 	invalidation types, per discussion and review by
> Jean-Philippe Brucker.
> 	- Added a setup parameter for page response timeout
> 	- Added version and size checking in bind PASID and
> invalidation APIs
> 	- Fixed locking and error handling in device fault reporting
> API based on Jean's review
> 
> V4
> 	- Futher integrate feedback for iommu_param and
> iommu_fault_param from Jean and others.
> 	- Handle fault reporting error and race conditions. Keep
> tracking per device pending page requests such that page group
> response can be sanitized.
> 	- Added a timer to handle irresponsive guest who does not
> send page response on time.
> 	- Use a workqueue for VT-d non-recorverable IRQ fault
> handling.
> 	- Added trace events for invalidation and fault reporting.
> V3
> 	- Consolidated fault reporting data format based on
> discussions on v2, including input from ARM and AMD.
> 	- Renamed invalidation APIs from svm to sva based on
> discussions on v2
> 	- Use a parent pointer under struct device for all iommu per
> device data
> 	- Simplified device fault callback, allow driver private data
> to be registered. This might make it easy to replace domain fault
> handler. V2
> 	- Replaced hybrid interface data model (generic data + vendor
> specific data) with all generic data. This will have the security
> benefit where data passed from user space can be sanitized by all
> software layers if needed.
> 	- Addressed review comments from V1
> 	- Use per device fault report data
> 	- Support page request/response communications between host
> IOMMU and guest or other in-kernel users.
> 	- Added unrecoverable fault reporting to DMAR
> 	- Use threaded IRQ function for DMAR fault interrupt and fault
> 	  reporting
> 
> Jacob Pan (22):
>   iommu: introduce bind_pasid_table API function
>   iommu/vt-d: move device_domain_info to header
>   iommu/vt-d: add a flag for pasid table bound status
>   iommu/vt-d: add bind_pasid_table function
>   iommu/vt-d: add definitions for PFSID
>   iommu/vt-d: fix dev iotlb pfsid use
>   iommu/vt-d: support flushing more translation cache types
>   iommu/vt-d: add svm/sva invalidate function
>   iommu: introduce device fault data
>   driver core: add per device iommu param
>   iommu: add a timeout parameter for prq response
>   iommu: introduce device fault report API
>   iommu: introduce page response function
>   iommu: handle page response timeout
>   iommu/config: add build dependency for dmar
>   iommu/vt-d: report non-recoverable faults to device
>   iommu/intel-svm: report device page request
>   iommu/intel-svm: replace dev ops with fault report API
>   iommu/intel-svm: do not flush iotlb for viommu
>   iommu/vt-d: add intel iommu page response function
>   trace/iommu: add sva trace events
>   iommu: use sva invalidate and device fault trace event
> 
> Liu, Yi L (1):
>   iommu: introduce iommu invalidate API function
> 
>  Documentation/admin-guide/kernel-parameters.txt |   8 +
>  drivers/iommu/Kconfig                           |   1 +
>  drivers/iommu/dmar.c                            | 209 ++++++++++++++-
>  drivers/iommu/intel-iommu.c                     | 338
> ++++++++++++++++++++++--
> drivers/iommu/intel-svm.c                       |  84 ++++--
> drivers/iommu/iommu.c                           | 311
> +++++++++++++++++++++-
> include/linux/device.h                          |   3 +
> include/linux/dma_remapping.h                   |   1 +
> include/linux/dmar.h                            |   2 +-
> include/linux/intel-iommu.h                     |  52 +++-
> include/linux/intel-svm.h                       |  20 +-
> include/linux/iommu.h                           | 216 ++++++++++++++-
> include/trace/events/iommu.h                    | 112 ++++++++
> include/uapi/linux/iommu.h                      | 124 +++++++++ 14
> files changed, 1409 insertions(+), 72 deletions(-) create mode 100644
> include/uapi/linux/iommu.h
> 

[Jacob Pan]

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v5 14/23] iommu: introduce page response function
  2018-05-14  6:39   ` Lu Baolu
@ 2018-05-29 16:13     ` Jacob Pan
  0 siblings, 0 replies; 78+ messages in thread
From: Jacob Pan @ 2018-05-29 16:13 UTC (permalink / raw)
  To: Lu Baolu
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Alex Williamson, Jean-Philippe Brucker, Rafael Wysocki, Liu,
	Yi L, Tian, Kevin, Raj Ashok, Jean Delvare, Christoph Hellwig,
	jacob.jun.pan

On Mon, 14 May 2018 14:39:51 +0800
Lu Baolu <baolu.lu@linux.intel.com> wrote:

> Hi,
> 
> On 05/12/2018 04:54 AM, Jacob Pan wrote:
> > IO page faults can be handled outside IOMMU subsystem. For an
> > example, when nested translation is turned on and guest owns the
> > first level page tables, device page request can be forwared
> > to the guest for handling faults. As the page response returns
> > by the guest, IOMMU driver on the host need to process the
> > response which informs the device and completes the page request
> > transaction.
> >
> > This patch introduces generic API function for page response
> > passing from the guest or other in-kernel users. The definitions of
> > the generic data is based on PCI ATS specification not limited to
> > any vendor.
> >
> > Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
> > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Link: https://lkml.org/lkml/2017/12/7/1725
> > ---
> >  drivers/iommu/iommu.c | 45
> > +++++++++++++++++++++++++++++++++++++++++++++ include/linux/iommu.h
> > | 43 +++++++++++++++++++++++++++++++++++++++++++ 2 files changed,
> > 88 insertions(+)
> >
> > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> > index b3f9daf..02fed3e 100644
> > --- a/drivers/iommu/iommu.c
> > +++ b/drivers/iommu/iommu.c
> > @@ -1533,6 +1533,51 @@ int iommu_sva_invalidate(struct iommu_domain
> > *domain, }
> >  EXPORT_SYMBOL_GPL(iommu_sva_invalidate);
> >  
> > +int iommu_page_response(struct device *dev,
> > +			struct page_response_msg *msg)
> > +{
> > +	struct iommu_param *param = dev->iommu_param;
> > +	int ret = -EINVAL;
> > +	struct iommu_fault_event *evt;
> > +	struct iommu_domain *domain =
> > iommu_get_domain_for_dev(dev); +
> > +	if (!domain || !domain->ops->page_response)
> > +		return -ENODEV;
> > +
> > +	/*
> > +	 * Device iommu_param should have been allocated when
> > device is
> > +	 * added to its iommu_group.
> > +	 */
> > +	if (!param || !param->fault_param)
> > +		return -EINVAL;
> > +
> > +	/* Only send response if there is a fault report pending */
> > +	mutex_lock(&param->fault_param->lock);
> > +	if (list_empty(&param->fault_param->faults)) {
> > +		pr_warn("no pending PRQ, drop response\n");
> > +		goto done_unlock;
> > +	}
> > +	/*
> > +	 * Check if we have a matching page request pending to
> > respond,
> > +	 * otherwise return -EINVAL
> > +	 */
> > +	list_for_each_entry(evt, &param->fault_param->faults,
> > list) {
> > +		if (evt->pasid == msg->pasid &&
> > +		    msg->page_req_group_id ==
> > evt->page_req_group_id) {
> > +			msg->private_data = evt->iommu_private;
> > +			ret = domain->ops->page_response(dev, msg);
> > +			list_del(&evt->list);
> > +			kfree(evt);
> > +			break;
> > +		}
> > +	}  
> 
> Are above two checks duplicated? We won't find a matching
> request if the list is empty. And we need to  printk a message
> if we can't find the matching request.
> 
sorry about the delay. I am not sure which two checks, we only search
for pending page request if the list is not empty.
Yes, it is a good idea to print out warning if the page response has no
match request.
> Best regards,
> Lu Baolu
> 
>  [...]  
> 

[Jacob Pan]

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v5 15/23] iommu: handle page response timeout
  2018-05-14  7:43   ` Lu Baolu
@ 2018-05-29 16:20     ` Jacob Pan
  2018-05-30  7:46       ` Lu Baolu
  0 siblings, 1 reply; 78+ messages in thread
From: Jacob Pan @ 2018-05-29 16:20 UTC (permalink / raw)
  To: Lu Baolu
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Alex Williamson, Jean-Philippe Brucker, Rafael Wysocki, Liu,
	Yi L, Tian, Kevin, Raj Ashok, Jean Delvare, Christoph Hellwig,
	jacob.jun.pan

On Mon, 14 May 2018 15:43:54 +0800
Lu Baolu <baolu.lu@linux.intel.com> wrote:

> Hi,
> 
> On 05/12/2018 04:54 AM, Jacob Pan wrote:
> > When IO page faults are reported outside IOMMU subsystem, the page
> > request handler may fail for various reasons. E.g. a guest received
> > page requests but did not have a chance to run for a long time. The
> > irresponsive behavior could hold off limited resources on the
> > pending device.
> > There can be hardware or credit based software solutions as
> > suggested in the PCI ATS Ch-4. To provide a basic safty net this
> > patch introduces a per device deferrable timer which monitors the
> > longest pending page fault that requires a response. Proper action
> > such as sending failure response code could be taken when timer
> > expires but not included in this patch. We need to consider the
> > life cycle of page groupd ID to prevent confusion with reused group
> > ID by a device. For now, a warning message provides clue of such
> > failure.
> >
> > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Signed-off-by: Ashok Raj <ashok.raj@intel.com>
> > ---
> >  drivers/iommu/iommu.c | 53
> > +++++++++++++++++++++++++++++++++++++++++++++++++++
> > include/linux/iommu.h |  4 ++++ 2 files changed, 57 insertions(+)
> >
> > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> > index 02fed3e..1f2f49e 100644
> > --- a/drivers/iommu/iommu.c
> > +++ b/drivers/iommu/iommu.c
> > @@ -827,6 +827,37 @@ int iommu_group_unregister_notifier(struct
> > iommu_group *group, }
> >  EXPORT_SYMBOL_GPL(iommu_group_unregister_notifier);
> >  
> > +static void iommu_dev_fault_timer_fn(struct timer_list *t)
> > +{
> > +	struct iommu_fault_param *fparam = from_timer(fparam, t,
> > timer);
> > +	struct iommu_fault_event *evt;
> > +
> > +	u64 now;
> > +
> > +	now = get_jiffies_64();
> > +
> > +	/* The goal is to ensure driver or guest page fault
> > handler(via vfio)
> > +	 * send page response on time. Otherwise, limited queue
> > resources
> > +	 * may be occupied by some irresponsive guests or drivers.
> > +	 * When per device pending fault list is not empty, we
> > periodically checks
> > +	 * if any anticipated page response time has expired.
> > +	 *
> > +	 * TODO:
> > +	 * We could do the following if response time expires:
> > +	 * 1. send page response code FAILURE to all pending PRQ
> > +	 * 2. inform device driver or vfio
> > +	 * 3. drain in-flight page requests and responses for this
> > device
> > +	 * 4. clear pending fault list such that driver can
> > unregister fault
> > +	 *    handler(otherwise blocked when pending faults are
> > present).
> > +	 */
> > +	list_for_each_entry(evt, &fparam->faults, list) {
> > +		if (time_after64(now, evt->expire))
> > +			pr_err("Page response time expired!, pasid
> > %d gid %d exp %llu now %llu\n",
> > +				evt->pasid,
> > evt->page_req_group_id, evt->expire, now);
> > +	}
> > +	mod_timer(t, now + prq_timeout);
> > +}
> > +  
> 
> This timer scheme is very rough.
> 
yes, the timer is a rough safety net for misbehaved PRQ handlers such
as a guest.
> The timer expires every 10 seconds (by default).
> 
> 0                   10                 20
> 30                 40
> +---------------+---------------+---------------+---------------+ ^
> ^   ^  ^                        ^ |   |     |
> |                         | F0 F1  F2 F3
> (F1,F2,F3 will not be handled until here!)
> 
> F0, F1, F2, F3 are four page faults happens during [0, 10s) time
> window. F1, F2, F3 timeout won't be handled until the timer expires
> again at 20s. That means a fault might be pending there until about
> (2 * prq_timeout) seconds later.
> 
correct. it could be 2x for the worst case. I should explain in
comments.
> Out of curiosity, Why not adding a timer in iommu_fault_event,
> starting it in iommu_report_device_fault() and removing it in
> iommu_page_response()?
> 
I thought about that also but since we are just trying to have a broad
and rough safety net (in addition to potential HW mechanism or credit
based solution), my thought was that having a per device timer is more
economical than per event.
Thanks for the in-depth check!

> Best regards,
> Lu Baolu
> 
> 
>  [...]  
> 

[Jacob Pan]

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v5 17/23] iommu/vt-d: report non-recoverable faults to device
  2018-05-14  8:17   ` Lu Baolu
@ 2018-05-29 17:33     ` Jacob Pan
  0 siblings, 0 replies; 78+ messages in thread
From: Jacob Pan @ 2018-05-29 17:33 UTC (permalink / raw)
  To: Lu Baolu
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Alex Williamson, Jean-Philippe Brucker, Rafael Wysocki, Liu,
	Yi L, Tian, Kevin, Raj Ashok, Jean Delvare, Christoph Hellwig,
	jacob.jun.pan

On Mon, 14 May 2018 16:17:28 +0800
Lu Baolu <baolu.lu@linux.intel.com> wrote:

> Hi,
> 
> On 05/12/2018 04:54 AM, Jacob Pan wrote:
> > Currently, dmar fault IRQ handler does nothing more than rate
> > limited printk, no critical hardware handling need to be done
> > in IRQ context.  
> 
> Not exactly. dmar_fault() needs to clear all the faults so that
> the subsequent faults could be logged.
True, but this is standard IRQ handling. Moving to threaded IRQ should
not be causing any functional problems, this is what I am trying to say.
> 
> > For some use case such as vIOMMU, it might be useful to report
> > non-recoverable faults outside host IOMMU subsystem. DMAR fault
> > can come from both DMA and interrupt remapping which has to be
> > set up early before threaded IRQ is available.
> > This patch adds an option and a workqueue such that when faults
> > are requested, DMAR fault IRQ handler can use the IOMMU fault
> > reporting API to report.
> >
> > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
> > Signed-off-by: Ashok Raj <ashok.raj@intel.com>
> > ---
> >  drivers/iommu/dmar.c        | 159
> > ++++++++++++++++++++++++++++++++++++++++++--
> > drivers/iommu/intel-iommu.c |   6 +- include/linux/dmar.h
> > |   2 +- include/linux/intel-iommu.h |   1 +
> >  4 files changed, 159 insertions(+), 9 deletions(-)
> >
> > diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c
> > index 0b5b052..ef846e3 100644
> > --- a/drivers/iommu/dmar.c
> > +++ b/drivers/iommu/dmar.c
> > @@ -1110,6 +1110,12 @@ static int alloc_iommu(struct dmar_drhd_unit
> > *drhd) return err;
> >  }
> >  
> > +static inline void dmar_free_fault_wq(struct intel_iommu *iommu)
> > +{
> > +	if (iommu->fault_wq)
> > +		destroy_workqueue(iommu->fault_wq);
> > +}
> > +
> >  static void free_iommu(struct intel_iommu *iommu)
> >  {
> >  	if (intel_iommu_enabled) {
> > @@ -1126,6 +1132,7 @@ static void free_iommu(struct intel_iommu
> > *iommu) free_irq(iommu->irq, iommu);
> >  		dmar_free_hwirq(iommu->irq);
> >  		iommu->irq = 0;
> > +		dmar_free_fault_wq(iommu);
> >  	}
> >  
> >  	if (iommu->qi) {
> > @@ -1554,6 +1561,31 @@ static const char *irq_remap_fault_reasons[]
> > = "Blocked an interrupt request due to source-id verification
> > failure", };
> >  
> > +/* fault data and status */
> > +enum intel_iommu_fault_reason {
> > +	INTEL_IOMMU_FAULT_REASON_SW,
> > +	INTEL_IOMMU_FAULT_REASON_ROOT_NOT_PRESENT,
> > +	INTEL_IOMMU_FAULT_REASON_CONTEXT_NOT_PRESENT,
> > +	INTEL_IOMMU_FAULT_REASON_CONTEXT_INVALID,
> > +	INTEL_IOMMU_FAULT_REASON_BEYOND_ADDR_WIDTH,
> > +	INTEL_IOMMU_FAULT_REASON_PTE_WRITE_ACCESS,
> > +	INTEL_IOMMU_FAULT_REASON_PTE_READ_ACCESS,
> > +	INTEL_IOMMU_FAULT_REASON_NEXT_PT_INVALID,
> > +	INTEL_IOMMU_FAULT_REASON_ROOT_ADDR_INVALID,
> > +	INTEL_IOMMU_FAULT_REASON_CONTEXT_PTR_INVALID,
> > +	INTEL_IOMMU_FAULT_REASON_NONE_ZERO_RTP,
> > +	INTEL_IOMMU_FAULT_REASON_NONE_ZERO_CTP,
> > +	INTEL_IOMMU_FAULT_REASON_NONE_ZERO_PTE,
> > +	NR_INTEL_IOMMU_FAULT_REASON,
> > +};
> > +
> > +/* fault reasons that are allowed to be reported outside IOMMU
> > subsystem */ +#define
> > INTEL_IOMMU_FAULT_REASON_ALLOWED			\
> > +	((1ULL << INTEL_IOMMU_FAULT_REASON_BEYOND_ADDR_WIDTH)
> > |	\
> > +		(1ULL <<
> > INTEL_IOMMU_FAULT_REASON_PTE_WRITE_ACCESS) |	\
> > +		(1ULL << INTEL_IOMMU_FAULT_REASON_PTE_READ_ACCESS))
> > +
> > +
> >  static const char *dmar_get_fault_reason(u8 fault_reason, int
> > *fault_type) {
> >  	if (fault_reason >= 0x20 && (fault_reason - 0x20 <
> > @@ -1634,11 +1666,91 @@ void dmar_msi_read(int irq, struct msi_msg
> > *msg) raw_spin_unlock_irqrestore(&iommu->register_lock, flag);
> >  }
> >  
> > +static enum iommu_fault_reason to_iommu_fault_reason(u8 reason)
> > +{
> > +	if (reason >= NR_INTEL_IOMMU_FAULT_REASON) {
> > +		pr_warn("unknown DMAR fault reason %d\n", reason);
> > +		return IOMMU_FAULT_REASON_UNKNOWN;
> > +	}
> > +	switch (reason) {
> > +	case INTEL_IOMMU_FAULT_REASON_SW:
> > +	case INTEL_IOMMU_FAULT_REASON_ROOT_NOT_PRESENT:
> > +	case INTEL_IOMMU_FAULT_REASON_CONTEXT_NOT_PRESENT:
> > +	case INTEL_IOMMU_FAULT_REASON_CONTEXT_INVALID:
> > +	case INTEL_IOMMU_FAULT_REASON_BEYOND_ADDR_WIDTH:
> > +	case INTEL_IOMMU_FAULT_REASON_ROOT_ADDR_INVALID:
> > +	case INTEL_IOMMU_FAULT_REASON_CONTEXT_PTR_INVALID:
> > +		return IOMMU_FAULT_REASON_INTERNAL;
> > +	case INTEL_IOMMU_FAULT_REASON_NEXT_PT_INVALID:
> > +	case INTEL_IOMMU_FAULT_REASON_PTE_WRITE_ACCESS:
> > +	case INTEL_IOMMU_FAULT_REASON_PTE_READ_ACCESS:
> > +		return IOMMU_FAULT_REASON_PERMISSION;
> > +	default:
> > +		return IOMMU_FAULT_REASON_UNKNOWN;
> > +	}
> > +}
> > +
> > +struct dmar_fault_work {
> > +	struct work_struct fault_work;
> > +	struct intel_iommu *iommu;
> > +	u64 addr;
> > +	int type;
> > +	int fault_type;
> > +	enum intel_iommu_fault_reason reason;
> > +	u16 sid;
> > +};
> > +
> > +static void report_fault_to_device(struct work_struct *work)
> > +{
> > +	struct dmar_fault_work *dfw = container_of(work, struct
> > dmar_fault_work,
> > +						fault_work);
> > +	struct iommu_fault_event event;
> > +	struct pci_dev *pdev;
> > +	u8 bus, devfn;
> > +
> > +	memset(&event, 0, sizeof(struct iommu_fault_event));
> > +
> > +	/* check if fault reason is permitted to report outside
> > IOMMU */
> > +	if (!((1 << dfw->reason) &
> > INTEL_IOMMU_FAULT_REASON_ALLOWED)) {
> > +		pr_debug("Fault reason %d not allowed to report to
> > device\n",
> > +			dfw->reason);  
> 
> No need to print this message. And how about moving this check ahead
> before queue the work?
> 
Good point. rest of the points taken. Thanks!
>  [...]  
> 
> No need to print this warn message.
> 
>  [...]  
> 
> No need to add braces.
> 
> > +
> > +	dfw = kmalloc(sizeof(*dfw), GFP_ATOMIC);
> > +	if (!dfw)
> > +		return -ENOMEM;
> > +
> > +	INIT_WORK(&dfw->fault_work, report_fault_to_device);
> > +	dfw->addr = addr;
> > +	dfw->type = type;
> > +	dfw->fault_type = fault_type;
> > +	dfw->reason = fault_reason;
> > +	dfw->sid = source_id;
> > +	dfw->iommu = iommu;
> > +	if (!queue_work(iommu->fault_wq, &dfw->fault_work)) {  
> 
> Check whether this fault is allowed to report to device before
> queuing the work.
> 
>  [...]  
> 
> Best regards,
> Lu Baolu

[Jacob Pan]

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v5 15/23] iommu: handle page response timeout
  2018-05-29 16:20     ` Jacob Pan
@ 2018-05-30  7:46       ` Lu Baolu
  0 siblings, 0 replies; 78+ messages in thread
From: Lu Baolu @ 2018-05-30  7:46 UTC (permalink / raw)
  To: Jacob Pan
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Alex Williamson, Jean-Philippe Brucker, Rafael Wysocki, Liu,
	Yi L, Tian, Kevin, Raj Ashok, Jean Delvare, Christoph Hellwig

Hi,

On 05/30/2018 12:20 AM, Jacob Pan wrote:
> On Mon, 14 May 2018 15:43:54 +0800
> Lu Baolu <baolu.lu@linux.intel.com> wrote:
>
>> Hi,
>>
>> On 05/12/2018 04:54 AM, Jacob Pan wrote:
>>> When IO page faults are reported outside IOMMU subsystem, the page
>>> request handler may fail for various reasons. E.g. a guest received
>>> page requests but did not have a chance to run for a long time. The
>>> irresponsive behavior could hold off limited resources on the
>>> pending device.
>>> There can be hardware or credit based software solutions as
>>> suggested in the PCI ATS Ch-4. To provide a basic safty net this
>>> patch introduces a per device deferrable timer which monitors the
>>> longest pending page fault that requires a response. Proper action
>>> such as sending failure response code could be taken when timer
>>> expires but not included in this patch. We need to consider the
>>> life cycle of page groupd ID to prevent confusion with reused group
>>> ID by a device. For now, a warning message provides clue of such
>>> failure.
>>>
>>> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
>>> Signed-off-by: Ashok Raj <ashok.raj@intel.com>
>>> ---
>>>  drivers/iommu/iommu.c | 53
>>> +++++++++++++++++++++++++++++++++++++++++++++++++++
>>> include/linux/iommu.h |  4 ++++ 2 files changed, 57 insertions(+)
>>>
>>> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
>>> index 02fed3e..1f2f49e 100644
>>> --- a/drivers/iommu/iommu.c
>>> +++ b/drivers/iommu/iommu.c
>>> @@ -827,6 +827,37 @@ int iommu_group_unregister_notifier(struct
>>> iommu_group *group, }
>>>  EXPORT_SYMBOL_GPL(iommu_group_unregister_notifier);
>>>  
>>> +static void iommu_dev_fault_timer_fn(struct timer_list *t)
>>> +{
>>> +	struct iommu_fault_param *fparam = from_timer(fparam, t,
>>> timer);
>>> +	struct iommu_fault_event *evt;
>>> +
>>> +	u64 now;
>>> +
>>> +	now = get_jiffies_64();
>>> +
>>> +	/* The goal is to ensure driver or guest page fault
>>> handler(via vfio)
>>> +	 * send page response on time. Otherwise, limited queue
>>> resources
>>> +	 * may be occupied by some irresponsive guests or drivers.
>>> +	 * When per device pending fault list is not empty, we
>>> periodically checks
>>> +	 * if any anticipated page response time has expired.
>>> +	 *
>>> +	 * TODO:
>>> +	 * We could do the following if response time expires:
>>> +	 * 1. send page response code FAILURE to all pending PRQ
>>> +	 * 2. inform device driver or vfio
>>> +	 * 3. drain in-flight page requests and responses for this
>>> device
>>> +	 * 4. clear pending fault list such that driver can
>>> unregister fault
>>> +	 *    handler(otherwise blocked when pending faults are
>>> present).
>>> +	 */
>>> +	list_for_each_entry(evt, &fparam->faults, list) {
>>> +		if (time_after64(now, evt->expire))
>>> +			pr_err("Page response time expired!, pasid
>>> %d gid %d exp %llu now %llu\n",
>>> +				evt->pasid,
>>> evt->page_req_group_id, evt->expire, now);
>>> +	}
>>> +	mod_timer(t, now + prq_timeout);
>>> +}
>>> +  
>> This timer scheme is very rough.
>>
> yes, the timer is a rough safety net for misbehaved PRQ handlers such
> as a guest.
>> The timer expires every 10 seconds (by default).
>>
>> 0                   10                 20
>> 30                 40
>> +---------------+---------------+---------------+---------------+ ^
>> ^   ^  ^                        ^ |   |     |
>> |                         | F0 F1  F2 F3
>> (F1,F2,F3 will not be handled until here!)
>>
>> F0, F1, F2, F3 are four page faults happens during [0, 10s) time
>> window. F1, F2, F3 timeout won't be handled until the timer expires
>> again at 20s. That means a fault might be pending there until about
>> (2 * prq_timeout) seconds later.
>>
> correct. it could be 2x for the worst case. I should explain in
> comments.
>> Out of curiosity, Why not adding a timer in iommu_fault_event,
>> starting it in iommu_report_device_fault() and removing it in
>> iommu_page_response()?
>>
> I thought about that also but since we are just trying to have a broad
> and rough safety net (in addition to potential HW mechanism or credit
> based solution), my thought was that having a per device timer is more
> economical than per event.
> Thanks for the in-depth check!

Okay,  got your idea. Thanks for explanation.

Best regards,
Lu Baolu

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v5 01/23] iommu: introduce bind_pasid_table API function
  2018-05-11 20:53 ` [PATCH v5 01/23] iommu: introduce bind_pasid_table API function Jacob Pan
@ 2018-08-23 16:34   ` Auger Eric
  2018-08-24 12:47     ` Liu, Yi L
  2018-08-24 15:00   ` Auger Eric
  1 sibling, 1 reply; 78+ messages in thread
From: Auger Eric @ 2018-08-23 16:34 UTC (permalink / raw)
  To: Jacob Pan, iommu, LKML, Joerg Roedel, David Woodhouse,
	Greg Kroah-Hartman, Alex Williamson, Jean-Philippe Brucker
  Cc: Yi L, Raj Ashok, Rafael Wysocki, Liu, Jean Delvare

Hi Jacob,

On 05/11/2018 10:53 PM, Jacob Pan wrote:
> Virtual IOMMU was proposed to support Shared Virtual Memory (SVM)
> use in the guest:
> https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg05311.html
> 
> As part of the proposed architecture, when an SVM capable PCI
> device is assigned to a guest, nested mode is turned on. Guest owns the
> first level page tables (request with PASID) which performs GVA->GPA
> translation. Second level page tables are owned by the host for GPA->HPA
> translation for both request with and without PASID.
> 
> A new IOMMU driver interface is therefore needed to perform tasks as
> follows:
> * Enable nested translation and appropriate translation type
> * Assign guest PASID table pointer (in GPA) and size to host IOMMU
> 
> This patch introduces new API functions to perform bind/unbind guest PASID
> tables. Based on common data, model specific IOMMU drivers can be extended
> to perform the specific steps for binding pasid table of assigned devices.
> 
> Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
> Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
> Signed-off-by: Ashok Raj <ashok.raj@intel.com>
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> ---
>  drivers/iommu/iommu.c      | 19 +++++++++++++++++++
>  include/linux/iommu.h      | 24 ++++++++++++++++++++++++
>  include/uapi/linux/iommu.h | 33 +++++++++++++++++++++++++++++++++
>  3 files changed, 76 insertions(+)
>  create mode 100644 include/uapi/linux/iommu.h
> 
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index d2aa2320..3a69620 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -1325,6 +1325,25 @@ int iommu_attach_device(struct iommu_domain *domain, struct device *dev)
>  }
>  EXPORT_SYMBOL_GPL(iommu_attach_device);
>  
> +int iommu_bind_pasid_table(struct iommu_domain *domain, struct device *dev,
> +			struct pasid_table_config *pasidt_binfo)
> +{
> +	if (unlikely(!domain->ops->bind_pasid_table))
> +		return -ENODEV;
> +
> +	return domain->ops->bind_pasid_table(domain, dev, pasidt_binfo);
> +}
> +EXPORT_SYMBOL_GPL(iommu_bind_pasid_table);
> +
> +void iommu_unbind_pasid_table(struct iommu_domain *domain, struct device *dev)
> +{
> +	if (unlikely(!domain->ops->unbind_pasid_table))
> +		return;
> +
> +	domain->ops->unbind_pasid_table(domain, dev);
> +}
> +EXPORT_SYMBOL_GPL(iommu_unbind_pasid_table);
> +
>  static void __iommu_detach_device(struct iommu_domain *domain,
>  				  struct device *dev)
>  {
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index 19938ee..5199ca4 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -25,6 +25,7 @@
>  #include <linux/errno.h>
>  #include <linux/err.h>
>  #include <linux/of.h>
> +#include <uapi/linux/iommu.h>
>  
>  #define IOMMU_READ	(1 << 0)
>  #define IOMMU_WRITE	(1 << 1)
> @@ -187,6 +188,8 @@ struct iommu_resv_region {
>   * @domain_get_windows: Return the number of windows for a domain
>   * @of_xlate: add OF master IDs to iommu grouping
>   * @pgsize_bitmap: bitmap of all possible supported page sizes
> + * @bind_pasid_table: bind pasid table pointer for guest SVM
> + * @unbind_pasid_table: unbind pasid table pointer and restore defaults
>   */
>  struct iommu_ops {
>  	bool (*capable)(enum iommu_cap);
> @@ -233,8 +236,14 @@ struct iommu_ops {
>  	u32 (*domain_get_windows)(struct iommu_domain *domain);
>  
>  	int (*of_xlate)(struct device *dev, struct of_phandle_args *args);
> +
>  	bool (*is_attach_deferred)(struct iommu_domain *domain, struct device *dev);
>  
> +	int (*bind_pasid_table)(struct iommu_domain *domain, struct device *dev,
> +				struct pasid_table_config *pasidt_binfo);
> +	void (*unbind_pasid_table)(struct iommu_domain *domain,
> +				struct device *dev);
> +
>  	unsigned long pgsize_bitmap;
>  };
>  
> @@ -296,6 +305,10 @@ extern int iommu_attach_device(struct iommu_domain *domain,
>  			       struct device *dev);
>  extern void iommu_detach_device(struct iommu_domain *domain,
>  				struct device *dev);
> +extern int iommu_bind_pasid_table(struct iommu_domain *domain,
> +		struct device *dev, struct pasid_table_config *pasidt_binfo);
> +extern void iommu_unbind_pasid_table(struct iommu_domain *domain,
> +				struct device *dev);
>  extern struct iommu_domain *iommu_get_domain_for_dev(struct device *dev);
>  extern int iommu_map(struct iommu_domain *domain, unsigned long iova,
>  		     phys_addr_t paddr, size_t size, int prot);
> @@ -696,6 +709,17 @@ const struct iommu_ops *iommu_ops_from_fwnode(struct fwnode_handle *fwnode)
>  	return NULL;
>  }
>  
> +static inline
> +int iommu_bind_pasid_table(struct iommu_domain *domain, struct device *dev,
> +			struct pasid_table_config *pasidt_binfo)
> +{
> +	return -ENODEV;
> +}
> +static inline
> +void iommu_unbind_pasid_table(struct iommu_domain *domain, struct device *dev)
> +{
> +}
> +
>  #endif /* CONFIG_IOMMU_API */
>  
>  #endif /* __LINUX_IOMMU_H */
> diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
> new file mode 100644
> index 0000000..cb2d625
> --- /dev/null
> +++ b/include/uapi/linux/iommu.h
> @@ -0,0 +1,33 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +/*
> + * IOMMU user API definitions
> + *
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#ifndef _UAPI_IOMMU_H
> +#define _UAPI_IOMMU_H
> +
> +#include <linux/types.h>
> +
> +/**
> + * PASID table data used to bind guest PASID table to the host IOMMU. This will
> + * enable guest managed first level page tables.
> + * @version: for future extensions and identification of the data format
> + * @bytes: size of this structure
> + * @base_ptr:	PASID table pointer
> + * @pasid_bits:	number of bits supported in the guest PASID table, must be less
> + *		or equal than the host supported PASID size.
> + */
> +struct pasid_table_config {
> +	__u32 version;
> +#define PASID_TABLE_CFG_VERSION_1 1
> +	__u32 bytes;
> +	__u64 base_ptr;
> +	__u8 pasid_bits;

As reported in "[RFC 00/13] SMMUv3 Nested Stage Setup" thread, this API
could be used for ARM SMMUv3 nested stage enablement without many
changes. Assuming SMMUv3 nested stage is confirmed to be interesting for
vendors and maintainers, we could try to unify the APIs.

As far as I understand the VTD PASID table is equivalent to the ARM
SMMUv3 context descriptor table (CD). This corresponds to the stage 1
context table with one or more entries, each corresponding to one PASID.
maybe using the s1ctx_table_config terminology instead of
pasid_table_config would be more generic, the pasid table being Intel
naming.

on top of pasid_bits, I think an "asid_bits" field may be needed too.
The guest IOMMU might support a different number of asid bits from the
host one.

Although without having skimmed through the whole series yet, I wonder
how you handle the case where stage1 is bypassed or disabled? The guest
may define the S1 context entries but bypass or abort stage 1
translations globally. Looks something missing to me at first sight.

Thanks

Eric
> +};
> +
> +#endif /* _UAPI_IOMMU_H */
> 

^ permalink raw reply	[flat|nested] 78+ messages in thread

* RE: [PATCH v5 01/23] iommu: introduce bind_pasid_table API function
  2018-08-23 16:34   ` Auger Eric
@ 2018-08-24 12:47     ` Liu, Yi L
  2018-08-24 13:20       ` Auger Eric
  0 siblings, 1 reply; 78+ messages in thread
From: Liu, Yi L @ 2018-08-24 12:47 UTC (permalink / raw)
  To: Auger Eric, Jacob Pan, iommu, LKML, Joerg Roedel,
	David Woodhouse, Greg Kroah-Hartman, Alex Williamson,
	Jean-Philippe Brucker
  Cc: Liu, Yi L, Jean Delvare, Wysocki, Rafael J, Raj, Ashok

Hi Eric,

> From: iommu-bounces@lists.linux-foundation.org [mailto:iommu-
> bounces@lists.linux-foundation.org] On Behalf Of Auger Eric
> Sent: Friday, August 24, 2018 12:35 AM
> 
> Hi Jacob,
> 
> On 05/11/2018 10:53 PM, Jacob Pan wrote:
> > Virtual IOMMU was proposed to support Shared Virtual Memory (SVM)
> > use in the guest:
> > https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg05311.html
> >
> > As part of the proposed architecture, when an SVM capable PCI
> > device is assigned to a guest, nested mode is turned on. Guest owns the
> > first level page tables (request with PASID) which performs GVA->GPA
> > translation. Second level page tables are owned by the host for GPA->HPA
> > translation for both request with and without PASID.
> >
> > A new IOMMU driver interface is therefore needed to perform tasks as
> > follows:
> > * Enable nested translation and appropriate translation type
> > * Assign guest PASID table pointer (in GPA) and size to host IOMMU
> >
> > This patch introduces new API functions to perform bind/unbind guest PASID
> > tables. Based on common data, model specific IOMMU drivers can be extended
> > to perform the specific steps for binding pasid table of assigned devices.
> >
> > Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
> > Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
> > Signed-off-by: Ashok Raj <ashok.raj@intel.com>
> > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > ---

[...]

> > +#ifndef _UAPI_IOMMU_H
> > +#define _UAPI_IOMMU_H
> > +
> > +#include <linux/types.h>
> > +
> > +/**
> > + * PASID table data used to bind guest PASID table to the host IOMMU. This will
> > + * enable guest managed first level page tables.
> > + * @version: for future extensions and identification of the data format
> > + * @bytes: size of this structure
> > + * @base_ptr:	PASID table pointer
> > + * @pasid_bits:	number of bits supported in the guest PASID table, must be
> less
> > + *		or equal than the host supported PASID size.
> > + */
> > +struct pasid_table_config {
> > +	__u32 version;
> > +#define PASID_TABLE_CFG_VERSION_1 1
> > +	__u32 bytes;
> > +	__u64 base_ptr;
> > +	__u8 pasid_bits;
> 
> As reported in "[RFC 00/13] SMMUv3 Nested Stage Setup" thread, this API
> could be used for ARM SMMUv3 nested stage enablement without many
> changes. Assuming SMMUv3 nested stage is confirmed to be interesting for
> vendors and maintainers, we could try to unify the APIs.

Just a quick question on nested stage on SMMUv3. If virtualizer wants to
enable nested stage on SMMUv3, does it link the whole guest CD table to
host or do it in other manner?

> As far as I understand the VTD PASID table is equivalent to the ARM
> SMMUv3 context descriptor table (CD). This corresponds to the stage 1
> context table with one or more entries, each corresponding to one PASID.

PASID table is index by PASID, and have multiple entries. A PASID table
would have 2^PASID_BITS entries.

> maybe using the s1ctx_table_config terminology instead of
> pasid_table_config would be more generic, the pasid table being Intel
> naming.
>
> on top of pasid_bits, I think an "asid_bits" field may be needed too.
> The guest IOMMU might support a different number of asid bits from the
> host one.

Maybe needed for SMMUv3. I've noticed you've placed it in
struct iommu_smmu_s1_config.

> 
> Although without having skimmed through the whole series yet, I wonder
> how you handle the case where stage1 is bypassed or disabled? The guest
> may define the S1 context entries but bypass or abort stage 1
> translations globally. Looks something missing to me at first sight.

Sorry, I didn't quite follow here. What usage is case such for? like stage 1 is
bypassed or disabled. IOVA or SVA?

Thanks,
Yi Liu

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v5 01/23] iommu: introduce bind_pasid_table API function
  2018-08-24 12:47     ` Liu, Yi L
@ 2018-08-24 13:20       ` Auger Eric
  2018-08-28 17:04         ` Jacob Pan
  0 siblings, 1 reply; 78+ messages in thread
From: Auger Eric @ 2018-08-24 13:20 UTC (permalink / raw)
  To: Liu, Yi L, Jacob Pan, iommu, LKML, Joerg Roedel, David Woodhouse,
	Greg Kroah-Hartman, Alex Williamson, Jean-Philippe Brucker
  Cc: Liu, Yi L, Jean Delvare, Wysocki, Rafael J, Raj, Ashok

Hi Yi Liu,

On 08/24/2018 02:47 PM, Liu, Yi L wrote:
> Hi Eric,
> 
>> From: iommu-bounces@lists.linux-foundation.org [mailto:iommu-
>> bounces@lists.linux-foundation.org] On Behalf Of Auger Eric
>> Sent: Friday, August 24, 2018 12:35 AM
>>
>> Hi Jacob,
>>
>> On 05/11/2018 10:53 PM, Jacob Pan wrote:
>>> Virtual IOMMU was proposed to support Shared Virtual Memory (SVM)
>>> use in the guest:
>>> https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg05311.html
>>>
>>> As part of the proposed architecture, when an SVM capable PCI
>>> device is assigned to a guest, nested mode is turned on. Guest owns the
>>> first level page tables (request with PASID) which performs GVA->GPA
>>> translation. Second level page tables are owned by the host for GPA->HPA
>>> translation for both request with and without PASID.
>>>
>>> A new IOMMU driver interface is therefore needed to perform tasks as
>>> follows:
>>> * Enable nested translation and appropriate translation type
>>> * Assign guest PASID table pointer (in GPA) and size to host IOMMU
>>>
>>> This patch introduces new API functions to perform bind/unbind guest PASID
>>> tables. Based on common data, model specific IOMMU drivers can be extended
>>> to perform the specific steps for binding pasid table of assigned devices.
>>>
>>> Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
>>> Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
>>> Signed-off-by: Ashok Raj <ashok.raj@intel.com>
>>> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
>>> ---
> 
> [...]
> 
>>> +#ifndef _UAPI_IOMMU_H
>>> +#define _UAPI_IOMMU_H
>>> +
>>> +#include <linux/types.h>
>>> +
>>> +/**
>>> + * PASID table data used to bind guest PASID table to the host IOMMU. This will
>>> + * enable guest managed first level page tables.
>>> + * @version: for future extensions and identification of the data format
>>> + * @bytes: size of this structure
>>> + * @base_ptr:	PASID table pointer
>>> + * @pasid_bits:	number of bits supported in the guest PASID table, must be
>> less
>>> + *		or equal than the host supported PASID size.
>>> + */
>>> +struct pasid_table_config {
>>> +	__u32 version;
>>> +#define PASID_TABLE_CFG_VERSION_1 1
>>> +	__u32 bytes;
>>> +	__u64 base_ptr;
>>> +	__u8 pasid_bits;
>>
>> As reported in "[RFC 00/13] SMMUv3 Nested Stage Setup" thread, this API
>> could be used for ARM SMMUv3 nested stage enablement without many
>> changes. Assuming SMMUv3 nested stage is confirmed to be interesting for
>> vendors and maintainers, we could try to unify the APIs.
> 
> Just a quick question on nested stage on SMMUv3. If virtualizer wants to
> enable nested stage on SMMUv3, does it link the whole guest CD table to
> host or do it in other manner?
Yes that's correct. On ARM SMMUv3 you have Stream Table Entries (STEs,
indexed by ReqID=streamid). If stage 1 is used, the STE points to 1 or
more contiguous Context Descriptors (CDs).
So STE looks like the VTD Context-Entry and CD table looks like the VTD
PASID table as far as I understand.
> 
>> As far as I understand the VTD PASID table is equivalent to the ARM
>> SMMUv3 context descriptor table (CD). This corresponds to the stage 1
>> context table with one or more entries, each corresponding to one PASID.
> 
> PASID table is index by PASID, and have multiple entries. A PASID table
> would have 2^PASID_BITS entries.
On ARM SMMUv3 the  number of CDs is 2 ^STE.S1CDMax.
> 
>> maybe using the s1ctx_table_config terminology instead of
>> pasid_table_config would be more generic, the pasid table being Intel
>> naming.
>>
>> on top of pasid_bits, I think an "asid_bits" field may be needed too.
>> The guest IOMMU might support a different number of asid bits from the
>> host one.
> 
> Maybe needed for SMMUv3. I've noticed you've placed it in
> struct iommu_smmu_s1_config.
> 
>>
>> Although without having skimmed through the whole series yet, I wonder
>> how you handle the case where stage1 is bypassed or disabled? The guest
>> may define the S1 context entries but bypass or abort stage 1
>> translations globally. Looks something missing to me at first sight.
> 
> Sorry, I didn't quite follow here. What usage is case such for? like stage 1 is
> bypassed or disabled. IOVA or SVA?
Each STE entry has a config field which tells how S1 and S2 behave

Options are no traffic at all or any combination of the following:

S1        S2
bypass    bypass
transl    bypass
bypass    transl
transl    transl

host manages S2 info. guest sets S1 related fields.

To me the guest SET.Config should be passed to the host so that this
latter writes the correct global Config field value in the STE,
including S1 + S2 info.

Thanks

Eric
> 
> Thanks,
> Yi Liu
> 

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v5 01/23] iommu: introduce bind_pasid_table API function
  2018-05-11 20:53 ` [PATCH v5 01/23] iommu: introduce bind_pasid_table API function Jacob Pan
  2018-08-23 16:34   ` Auger Eric
@ 2018-08-24 15:00   ` Auger Eric
  2018-08-28  5:14     ` Jacob Pan
  1 sibling, 1 reply; 78+ messages in thread
From: Auger Eric @ 2018-08-24 15:00 UTC (permalink / raw)
  To: Jacob Pan, iommu, LKML, Joerg Roedel, David Woodhouse,
	Greg Kroah-Hartman, Alex Williamson, Jean-Philippe Brucker
  Cc: Yi L, Raj Ashok, Rafael Wysocki, Liu, Jean Delvare

Hi Jacob,

On 05/11/2018 10:53 PM, Jacob Pan wrote:
> Virtual IOMMU was proposed to support Shared Virtual Memory (SVM)
> use in the guest:
> https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg05311.html
> 
> As part of the proposed architecture, when an SVM capable PCI
> device is assigned to a guest, nested mode is turned on. Guest owns the
> first level page tables (request with PASID) which performs GVA->GPA
> translation. Second level page tables are owned by the host for GPA->HPA
> translation for both request with and without PASID.
> 
> A new IOMMU driver interface is therefore needed to perform tasks as
> follows:
> * Enable nested translation and appropriate translation type
> * Assign guest PASID table pointer (in GPA) and size to host IOMMU
> 
> This patch introduces new API functions to perform bind/unbind guest PASID
> tables. Based on common data, model specific IOMMU drivers can be extended
> to perform the specific steps for binding pasid table of assigned devices.
> 
> Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
> Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
> Signed-off-by: Ashok Raj <ashok.raj@intel.com>
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> ---
>  drivers/iommu/iommu.c      | 19 +++++++++++++++++++
>  include/linux/iommu.h      | 24 ++++++++++++++++++++++++
>  include/uapi/linux/iommu.h | 33 +++++++++++++++++++++++++++++++++
>  3 files changed, 76 insertions(+)
>  create mode 100644 include/uapi/linux/iommu.h
> 
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index d2aa2320..3a69620 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -1325,6 +1325,25 @@ int iommu_attach_device(struct iommu_domain *domain, struct device *dev)
>  }
>  EXPORT_SYMBOL_GPL(iommu_attach_device);
>  
> +int iommu_bind_pasid_table(struct iommu_domain *domain, struct device *dev,
> +			struct pasid_table_config *pasidt_binfo)
As Jean-Philippe, I must confessed i am very confused by having both the
iommu_domain and dev passed as argument.

I know this was discussed when the RFC was submitted and maybe I missed
the main justification behind that choice. I understand that at the HW
level we want to change the context entry or ARM CD in my case for a
specific device. But on other hand, at the logical level, I understand
the iommu_domain is representing a set of translation config & page
tables shared by all the devices within the domain (hope this is
fundamentally correct ?!). So to me we can't change the device
translation setup without changing the whole iommu_device setup
otherwise this would mean this device has a translation configuration
that is not consistent anymore with the other devices in the same
domain. Is that correct? So can't we only keep the iommu_domain arg?

> +{
> +	if (unlikely(!domain->ops->bind_pasid_table))
> +		return -ENODEV;
> +
> +	return domain->ops->bind_pasid_table(domain, dev, pasidt_binfo);
> +}
> +EXPORT_SYMBOL_GPL(iommu_bind_pasid_table);
> +
> +void iommu_unbind_pasid_table(struct iommu_domain *domain, struct device *dev)
> +{
> +	if (unlikely(!domain->ops->unbind_pasid_table))
> +		return;
> +
> +	domain->ops->unbind_pasid_table(domain, dev);
> +}
> +EXPORT_SYMBOL_GPL(iommu_unbind_pasid_table);
> +
>  static void __iommu_detach_device(struct iommu_domain *domain,
>  				  struct device *dev)
>  {
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index 19938ee..5199ca4 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -25,6 +25,7 @@
>  #include <linux/errno.h>
>  #include <linux/err.h>
>  #include <linux/of.h>
> +#include <uapi/linux/iommu.h>
>  
>  #define IOMMU_READ	(1 << 0)
>  #define IOMMU_WRITE	(1 << 1)
> @@ -187,6 +188,8 @@ struct iommu_resv_region {
>   * @domain_get_windows: Return the number of windows for a domain
>   * @of_xlate: add OF master IDs to iommu grouping
>   * @pgsize_bitmap: bitmap of all possible supported page sizes
> + * @bind_pasid_table: bind pasid table pointer for guest SVM
> + * @unbind_pasid_table: unbind pasid table pointer and restore defaults
>   */
>  struct iommu_ops {
>  	bool (*capable)(enum iommu_cap);
> @@ -233,8 +236,14 @@ struct iommu_ops {
>  	u32 (*domain_get_windows)(struct iommu_domain *domain);
>  
>  	int (*of_xlate)(struct device *dev, struct of_phandle_args *args);
> +
>  	bool (*is_attach_deferred)(struct iommu_domain *domain, struct device *dev);
>  
> +	int (*bind_pasid_table)(struct iommu_domain *domain, struct device *dev,
> +				struct pasid_table_config *pasidt_binfo);
> +	void (*unbind_pasid_table)(struct iommu_domain *domain,
> +				struct device *dev);
> +
>  	unsigned long pgsize_bitmap;
>  };
>  
> @@ -296,6 +305,10 @@ extern int iommu_attach_device(struct iommu_domain *domain,
>  			       struct device *dev);
>  extern void iommu_detach_device(struct iommu_domain *domain,
>  				struct device *dev);
> +extern int iommu_bind_pasid_table(struct iommu_domain *domain,
> +		struct device *dev, struct pasid_table_config *pasidt_binfo);
> +extern void iommu_unbind_pasid_table(struct iommu_domain *domain,
> +				struct device *dev);
>  extern struct iommu_domain *iommu_get_domain_for_dev(struct device *dev);
>  extern int iommu_map(struct iommu_domain *domain, unsigned long iova,
>  		     phys_addr_t paddr, size_t size, int prot);
> @@ -696,6 +709,17 @@ const struct iommu_ops *iommu_ops_from_fwnode(struct fwnode_handle *fwnode)
>  	return NULL;
>  }
>  
> +static inline
> +int iommu_bind_pasid_table(struct iommu_domain *domain, struct device *dev,
> +			struct pasid_table_config *pasidt_binfo)
> +{
> +	return -ENODEV;
> +}
> +static inline
> +void iommu_unbind_pasid_table(struct iommu_domain *domain, struct device *dev)
> +{
> +}
> +
>  #endif /* CONFIG_IOMMU_API */
>  
>  #endif /* __LINUX_IOMMU_H */
> diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
> new file mode 100644
> index 0000000..cb2d625
> --- /dev/null
> +++ b/include/uapi/linux/iommu.h
> @@ -0,0 +1,33 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +/*
> + * IOMMU user API definitions
> + *
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#ifndef _UAPI_IOMMU_H
> +#define _UAPI_IOMMU_H
> +
> +#include <linux/types.h>
> +
> +/**
> + * PASID table data used to bind guest PASID table to the host IOMMU. This will
> + * enable guest managed first level page tables.
> + * @version: for future extensions and identification of the data format
> + * @bytes: size of this structure
> + * @base_ptr:	PASID table pointer
> + * @pasid_bits:	number of bits supported in the guest PASID table, must be less
> + *		or equal than the host supported PASID size.
> + */
> +struct pasid_table_config {
> +	__u32 version;
> +#define PASID_TABLE_CFG_VERSION_1 1
> +	__u32 bytes;
> +	__u64 base_ptr;
> +	__u8 pasid_bits;
> +};
Don't we need to index all structs with iommu_ to protect the naming
spaces? Same comment on other patches impacting the uapi.

A question about the alignment. Don't we need to be 64b aligned? VFIO
uapi structs are.

Thanks

Eric
> +
> +#endif /* _UAPI_IOMMU_H */
> 

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v5 01/23] iommu: introduce bind_pasid_table API function
  2018-08-24 15:00   ` Auger Eric
@ 2018-08-28  5:14     ` Jacob Pan
  2018-08-28  8:34       ` Auger Eric
  0 siblings, 1 reply; 78+ messages in thread
From: Jacob Pan @ 2018-08-28  5:14 UTC (permalink / raw)
  To: Auger Eric
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Alex Williamson, Jean-Philippe Brucker, Yi L, Raj Ashok,
	Rafael Wysocki, Liu, Jean Delvare, jacob.jun.pan

On Fri, 24 Aug 2018 17:00:51 +0200
Auger Eric <eric.auger@redhat.com> wrote:

> Hi Jacob,
> 
> On 05/11/2018 10:53 PM, Jacob Pan wrote:
> > Virtual IOMMU was proposed to support Shared Virtual Memory (SVM)
> > use in the guest:
> > https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg05311.html
> > 
> > As part of the proposed architecture, when an SVM capable PCI
> > device is assigned to a guest, nested mode is turned on. Guest owns
> > the first level page tables (request with PASID) which performs
> > GVA->GPA translation. Second level page tables are owned by the
> > host for GPA->HPA translation for both request with and without
> > PASID.
> > 
> > A new IOMMU driver interface is therefore needed to perform tasks as
> > follows:
> > * Enable nested translation and appropriate translation type
> > * Assign guest PASID table pointer (in GPA) and size to host IOMMU
> > 
> > This patch introduces new API functions to perform bind/unbind
> > guest PASID tables. Based on common data, model specific IOMMU
> > drivers can be extended to perform the specific steps for binding
> > pasid table of assigned devices.
> > 
> > Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
> > Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
> > Signed-off-by: Ashok Raj <ashok.raj@intel.com>
> > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > ---
> >  drivers/iommu/iommu.c      | 19 +++++++++++++++++++
> >  include/linux/iommu.h      | 24 ++++++++++++++++++++++++
> >  include/uapi/linux/iommu.h | 33 +++++++++++++++++++++++++++++++++
> >  3 files changed, 76 insertions(+)
> >  create mode 100644 include/uapi/linux/iommu.h
> > 
> > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> > index d2aa2320..3a69620 100644
> > --- a/drivers/iommu/iommu.c
> > +++ b/drivers/iommu/iommu.c
> > @@ -1325,6 +1325,25 @@ int iommu_attach_device(struct iommu_domain
> > *domain, struct device *dev) }
> >  EXPORT_SYMBOL_GPL(iommu_attach_device);
> >  
> > +int iommu_bind_pasid_table(struct iommu_domain *domain, struct
> > device *dev,
> > +			struct pasid_table_config *pasidt_binfo)  
> As Jean-Philippe, I must confessed i am very confused by having both
> the iommu_domain and dev passed as argument.
> 
> I know this was discussed when the RFC was submitted and maybe I
> missed the main justification behind that choice. I understand that
> at the HW level we want to change the context entry or ARM CD in my
> case for a specific device. But on other hand, at the logical level,
> I understand the iommu_domain is representing a set of translation
> config & page tables shared by all the devices within the domain
> (hope this is fundamentally correct ?!). So to me we can't change the
> device translation setup without changing the whole iommu_device setup
> otherwise this would mean this device has a translation configuration
> that is not consistent anymore with the other devices in the same
> domain. Is that correct? So can't we only keep the iommu_domain arg?
> 
I agree with you on your understanding of HW and logical level. I think
there is a new twist to the definition of domain introduced by having
PASID and vSVA. Up until now, domain only means 2nd level mapping. In
that sense, bind guest PASID table does not alter domain. For VT-d 2.5
spec. implementation of the bind_pasid_table(), we needed some per
device data, also flags such as indication for IO page fault handling.

Anyway, for the new VT-d 3.0 spec. we no longer need this API. In
stead, I will introduce bind_guest_pasid() API, where per device PASID
table is allocated by the host.

> > +{
> > +	if (unlikely(!domain->ops->bind_pasid_table))
> > +		return -ENODEV;
> > +
> > +	return domain->ops->bind_pasid_table(domain, dev,
> > pasidt_binfo); +}
> > +EXPORT_SYMBOL_GPL(iommu_bind_pasid_table);
> > +
> > +void iommu_unbind_pasid_table(struct iommu_domain *domain, struct
> > device *dev) +{
> > +	if (unlikely(!domain->ops->unbind_pasid_table))
> > +		return;
> > +
> > +	domain->ops->unbind_pasid_table(domain, dev);
> > +}
> > +EXPORT_SYMBOL_GPL(iommu_unbind_pasid_table);
> > +
> >  static void __iommu_detach_device(struct iommu_domain *domain,
> >  				  struct device *dev)
> >  {
> > diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> > index 19938ee..5199ca4 100644
> > --- a/include/linux/iommu.h
> > +++ b/include/linux/iommu.h
> > @@ -25,6 +25,7 @@
> >  #include <linux/errno.h>
> >  #include <linux/err.h>
> >  #include <linux/of.h>
> > +#include <uapi/linux/iommu.h>
> >  
> >  #define IOMMU_READ	(1 << 0)
> >  #define IOMMU_WRITE	(1 << 1)
> > @@ -187,6 +188,8 @@ struct iommu_resv_region {
> >   * @domain_get_windows: Return the number of windows for a domain
> >   * @of_xlate: add OF master IDs to iommu grouping
> >   * @pgsize_bitmap: bitmap of all possible supported page sizes
> > + * @bind_pasid_table: bind pasid table pointer for guest SVM
> > + * @unbind_pasid_table: unbind pasid table pointer and restore
> > defaults */
> >  struct iommu_ops {
> >  	bool (*capable)(enum iommu_cap);
> > @@ -233,8 +236,14 @@ struct iommu_ops {
> >  	u32 (*domain_get_windows)(struct iommu_domain *domain);
> >  
> >  	int (*of_xlate)(struct device *dev, struct of_phandle_args
> > *args); +
> >  	bool (*is_attach_deferred)(struct iommu_domain *domain,
> > struct device *dev); 
> > +	int (*bind_pasid_table)(struct iommu_domain *domain,
> > struct device *dev,
> > +				struct pasid_table_config
> > *pasidt_binfo);
> > +	void (*unbind_pasid_table)(struct iommu_domain *domain,
> > +				struct device *dev);
> > +
> >  	unsigned long pgsize_bitmap;
> >  };
> >  
> > @@ -296,6 +305,10 @@ extern int iommu_attach_device(struct
> > iommu_domain *domain, struct device *dev);
> >  extern void iommu_detach_device(struct iommu_domain *domain,
> >  				struct device *dev);
> > +extern int iommu_bind_pasid_table(struct iommu_domain *domain,
> > +		struct device *dev, struct pasid_table_config
> > *pasidt_binfo); +extern void iommu_unbind_pasid_table(struct
> > iommu_domain *domain,
> > +				struct device *dev);
> >  extern struct iommu_domain *iommu_get_domain_for_dev(struct device
> > *dev); extern int iommu_map(struct iommu_domain *domain, unsigned
> > long iova, phys_addr_t paddr, size_t size, int prot);
> > @@ -696,6 +709,17 @@ const struct iommu_ops
> > *iommu_ops_from_fwnode(struct fwnode_handle *fwnode) return NULL;
> >  }
> >  
> > +static inline
> > +int iommu_bind_pasid_table(struct iommu_domain *domain, struct
> > device *dev,
> > +			struct pasid_table_config *pasidt_binfo)
> > +{
> > +	return -ENODEV;
> > +}
> > +static inline
> > +void iommu_unbind_pasid_table(struct iommu_domain *domain, struct
> > device *dev) +{
> > +}
> > +
> >  #endif /* CONFIG_IOMMU_API */
> >  
> >  #endif /* __LINUX_IOMMU_H */
> > diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
> > new file mode 100644
> > index 0000000..cb2d625
> > --- /dev/null
> > +++ b/include/uapi/linux/iommu.h
> > @@ -0,0 +1,33 @@
> > +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> > +/*
> > + * IOMMU user API definitions
> > + *
> > + *
> > + * This program is free software; you can redistribute it and/or
> > modify
> > + * it under the terms of the GNU General Public License version 2
> > as
> > + * published by the Free Software Foundation.
> > + */
> > +
> > +#ifndef _UAPI_IOMMU_H
> > +#define _UAPI_IOMMU_H
> > +
> > +#include <linux/types.h>
> > +
> > +/**
> > + * PASID table data used to bind guest PASID table to the host
> > IOMMU. This will
> > + * enable guest managed first level page tables.
> > + * @version: for future extensions and identification of the data
> > format
> > + * @bytes: size of this structure
> > + * @base_ptr:	PASID table pointer
> > + * @pasid_bits:	number of bits supported in the guest PASID
> > table, must be less
> > + *		or equal than the host supported PASID size.
> > + */
> > +struct pasid_table_config {
> > +	__u32 version;
> > +#define PASID_TABLE_CFG_VERSION_1 1
> > +	__u32 bytes;
> > +	__u64 base_ptr;
> > +	__u8 pasid_bits;
> > +};  
> Don't we need to index all structs with iommu_ to protect the naming
> spaces? Same comment on other patches impacting the uapi.
> 
yeah, it would be better to use iommu_ prefix, i was thinking vfio also
uses it and pasid itself is a industry standard.
> A question about the alignment. Don't we need to be 64b aligned? VFIO
> uapi structs are.
> 
I am not sure about the benefit, this is not a HW interface nor on
speed path.
> Thanks
> 
> Eric
> > +
> > +#endif /* _UAPI_IOMMU_H */
> >   

[Jacob Pan]

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v5 01/23] iommu: introduce bind_pasid_table API function
  2018-08-28  5:14     ` Jacob Pan
@ 2018-08-28  8:34       ` Auger Eric
  2018-08-28 16:36         ` Jacob Pan
  0 siblings, 1 reply; 78+ messages in thread
From: Auger Eric @ 2018-08-28  8:34 UTC (permalink / raw)
  To: Jacob Pan
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Alex Williamson, Jean-Philippe Brucker, Yi L, Raj Ashok,
	Rafael Wysocki, Liu, Jean Delvare

Hi Jacob,

On 08/28/2018 07:14 AM, Jacob Pan wrote:
> On Fri, 24 Aug 2018 17:00:51 +0200
> Auger Eric <eric.auger@redhat.com> wrote:
> 
>> Hi Jacob,
>>
>> On 05/11/2018 10:53 PM, Jacob Pan wrote:
>>> Virtual IOMMU was proposed to support Shared Virtual Memory (SVM)
>>> use in the guest:
>>> https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg05311.html
>>>
>>> As part of the proposed architecture, when an SVM capable PCI
>>> device is assigned to a guest, nested mode is turned on. Guest owns
>>> the first level page tables (request with PASID) which performs
>>> GVA->GPA translation. Second level page tables are owned by the
>>> host for GPA->HPA translation for both request with and without
>>> PASID.
>>>
>>> A new IOMMU driver interface is therefore needed to perform tasks as
>>> follows:
>>> * Enable nested translation and appropriate translation type
>>> * Assign guest PASID table pointer (in GPA) and size to host IOMMU
>>>
>>> This patch introduces new API functions to perform bind/unbind
>>> guest PASID tables. Based on common data, model specific IOMMU
>>> drivers can be extended to perform the specific steps for binding
>>> pasid table of assigned devices.
>>>
>>> Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
>>> Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
>>> Signed-off-by: Ashok Raj <ashok.raj@intel.com>
>>> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
>>> ---
>>>  drivers/iommu/iommu.c      | 19 +++++++++++++++++++
>>>  include/linux/iommu.h      | 24 ++++++++++++++++++++++++
>>>  include/uapi/linux/iommu.h | 33 +++++++++++++++++++++++++++++++++
>>>  3 files changed, 76 insertions(+)
>>>  create mode 100644 include/uapi/linux/iommu.h
>>>
>>> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
>>> index d2aa2320..3a69620 100644
>>> --- a/drivers/iommu/iommu.c
>>> +++ b/drivers/iommu/iommu.c
>>> @@ -1325,6 +1325,25 @@ int iommu_attach_device(struct iommu_domain
>>> *domain, struct device *dev) }
>>>  EXPORT_SYMBOL_GPL(iommu_attach_device);
>>>  
>>> +int iommu_bind_pasid_table(struct iommu_domain *domain, struct
>>> device *dev,
>>> +			struct pasid_table_config *pasidt_binfo)  
>> As Jean-Philippe, I must confessed i am very confused by having both
>> the iommu_domain and dev passed as argument.
>>
>> I know this was discussed when the RFC was submitted and maybe I
>> missed the main justification behind that choice. I understand that
>> at the HW level we want to change the context entry or ARM CD in my
>> case for a specific device. But on other hand, at the logical level,
>> I understand the iommu_domain is representing a set of translation
>> config & page tables shared by all the devices within the domain
>> (hope this is fundamentally correct ?!). So to me we can't change the
>> device translation setup without changing the whole iommu_device setup
>> otherwise this would mean this device has a translation configuration
>> that is not consistent anymore with the other devices in the same
>> domain. Is that correct? So can't we only keep the iommu_domain arg?
>>
> I agree with you on your understanding of HW and logical level. I think
> there is a new twist to the definition of domain introduced by having
> PASID and vSVA. Up until now, domain only means 2nd level mapping. In
> that sense, bind guest PASID table does not alter domain. For VT-d 2.5
> spec. implementation of the bind_pasid_table(), we needed some per
> device data, also flags such as indication for IO page fault handling.
> 
> Anyway, for the new VT-d 3.0 spec. we no longer need this API. In
> stead, I will introduce bind_guest_pasid() API, where per device PASID
> table is allocated by the host.

So what is the exact state of this series? Is it outdated as you don't
target VT-d 2.5 anymore? Will you keep the rest of the API?

Thanks

Eric
> 
>>> +{
>>> +	if (unlikely(!domain->ops->bind_pasid_table))
>>> +		return -ENODEV;
>>> +
>>> +	return domain->ops->bind_pasid_table(domain, dev,
>>> pasidt_binfo); +}
>>> +EXPORT_SYMBOL_GPL(iommu_bind_pasid_table);
>>> +
>>> +void iommu_unbind_pasid_table(struct iommu_domain *domain, struct
>>> device *dev) +{
>>> +	if (unlikely(!domain->ops->unbind_pasid_table))
>>> +		return;
>>> +
>>> +	domain->ops->unbind_pasid_table(domain, dev);
>>> +}
>>> +EXPORT_SYMBOL_GPL(iommu_unbind_pasid_table);
>>> +
>>>  static void __iommu_detach_device(struct iommu_domain *domain,
>>>  				  struct device *dev)
>>>  {
>>> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
>>> index 19938ee..5199ca4 100644
>>> --- a/include/linux/iommu.h
>>> +++ b/include/linux/iommu.h
>>> @@ -25,6 +25,7 @@
>>>  #include <linux/errno.h>
>>>  #include <linux/err.h>
>>>  #include <linux/of.h>
>>> +#include <uapi/linux/iommu.h>
>>>  
>>>  #define IOMMU_READ	(1 << 0)
>>>  #define IOMMU_WRITE	(1 << 1)
>>> @@ -187,6 +188,8 @@ struct iommu_resv_region {
>>>   * @domain_get_windows: Return the number of windows for a domain
>>>   * @of_xlate: add OF master IDs to iommu grouping
>>>   * @pgsize_bitmap: bitmap of all possible supported page sizes
>>> + * @bind_pasid_table: bind pasid table pointer for guest SVM
>>> + * @unbind_pasid_table: unbind pasid table pointer and restore
>>> defaults */
>>>  struct iommu_ops {
>>>  	bool (*capable)(enum iommu_cap);
>>> @@ -233,8 +236,14 @@ struct iommu_ops {
>>>  	u32 (*domain_get_windows)(struct iommu_domain *domain);
>>>  
>>>  	int (*of_xlate)(struct device *dev, struct of_phandle_args
>>> *args); +
>>>  	bool (*is_attach_deferred)(struct iommu_domain *domain,
>>> struct device *dev); 
>>> +	int (*bind_pasid_table)(struct iommu_domain *domain,
>>> struct device *dev,
>>> +				struct pasid_table_config
>>> *pasidt_binfo);
>>> +	void (*unbind_pasid_table)(struct iommu_domain *domain,
>>> +				struct device *dev);
>>> +
>>>  	unsigned long pgsize_bitmap;
>>>  };
>>>  
>>> @@ -296,6 +305,10 @@ extern int iommu_attach_device(struct
>>> iommu_domain *domain, struct device *dev);
>>>  extern void iommu_detach_device(struct iommu_domain *domain,
>>>  				struct device *dev);
>>> +extern int iommu_bind_pasid_table(struct iommu_domain *domain,
>>> +		struct device *dev, struct pasid_table_config
>>> *pasidt_binfo); +extern void iommu_unbind_pasid_table(struct
>>> iommu_domain *domain,
>>> +				struct device *dev);
>>>  extern struct iommu_domain *iommu_get_domain_for_dev(struct device
>>> *dev); extern int iommu_map(struct iommu_domain *domain, unsigned
>>> long iova, phys_addr_t paddr, size_t size, int prot);
>>> @@ -696,6 +709,17 @@ const struct iommu_ops
>>> *iommu_ops_from_fwnode(struct fwnode_handle *fwnode) return NULL;
>>>  }
>>>  
>>> +static inline
>>> +int iommu_bind_pasid_table(struct iommu_domain *domain, struct
>>> device *dev,
>>> +			struct pasid_table_config *pasidt_binfo)
>>> +{
>>> +	return -ENODEV;
>>> +}
>>> +static inline
>>> +void iommu_unbind_pasid_table(struct iommu_domain *domain, struct
>>> device *dev) +{
>>> +}
>>> +
>>>  #endif /* CONFIG_IOMMU_API */
>>>  
>>>  #endif /* __LINUX_IOMMU_H */
>>> diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
>>> new file mode 100644
>>> index 0000000..cb2d625
>>> --- /dev/null
>>> +++ b/include/uapi/linux/iommu.h
>>> @@ -0,0 +1,33 @@
>>> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
>>> +/*
>>> + * IOMMU user API definitions
>>> + *
>>> + *
>>> + * This program is free software; you can redistribute it and/or
>>> modify
>>> + * it under the terms of the GNU General Public License version 2
>>> as
>>> + * published by the Free Software Foundation.
>>> + */
>>> +
>>> +#ifndef _UAPI_IOMMU_H
>>> +#define _UAPI_IOMMU_H
>>> +
>>> +#include <linux/types.h>
>>> +
>>> +/**
>>> + * PASID table data used to bind guest PASID table to the host
>>> IOMMU. This will
>>> + * enable guest managed first level page tables.
>>> + * @version: for future extensions and identification of the data
>>> format
>>> + * @bytes: size of this structure
>>> + * @base_ptr:	PASID table pointer
>>> + * @pasid_bits:	number of bits supported in the guest PASID
>>> table, must be less
>>> + *		or equal than the host supported PASID size.
>>> + */
>>> +struct pasid_table_config {
>>> +	__u32 version;
>>> +#define PASID_TABLE_CFG_VERSION_1 1
>>> +	__u32 bytes;
>>> +	__u64 base_ptr;
>>> +	__u8 pasid_bits;
>>> +};  
>> Don't we need to index all structs with iommu_ to protect the naming
>> spaces? Same comment on other patches impacting the uapi.
>>
> yeah, it would be better to use iommu_ prefix, i was thinking vfio also
> uses it and pasid itself is a industry standard.
>> A question about the alignment. Don't we need to be 64b aligned? VFIO
>> uapi structs are.
>>
> I am not sure about the benefit, this is not a HW interface nor on
> speed path.
>> Thanks
>>
>> Eric
>>> +
>>> +#endif /* _UAPI_IOMMU_H */
>>>   
> 
> [Jacob Pan]
> 

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v5 01/23] iommu: introduce bind_pasid_table API function
  2018-08-28  8:34       ` Auger Eric
@ 2018-08-28 16:36         ` Jacob Pan
  0 siblings, 0 replies; 78+ messages in thread
From: Jacob Pan @ 2018-08-28 16:36 UTC (permalink / raw)
  To: Auger Eric
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Alex Williamson, Jean-Philippe Brucker, Yi L, Raj Ashok,
	Rafael Wysocki, Liu, Jean Delvare, jacob.jun.pan

On Tue, 28 Aug 2018 10:34:19 +0200
Auger Eric <eric.auger@redhat.com> wrote:

> > Anyway, for the new VT-d 3.0 spec. we no longer need this API. In
> > stead, I will introduce bind_guest_pasid() API, where per device
> > PASID table is allocated by the host.  
> 
> So what is the exact state of this series? Is it outdated as you don't
> target VT-d 2.5 anymore? Will you keep the rest of the API?

Hi Eric,

I am not targeting VT-d 2.5 for SVA related work. I am working on the
rest of the APIs for supporting VT-d v3, which includes guest PASID
bind, fault reporting, and invalidation passdown from the guest. These
are based on some recent patches from Baolu.
https://lkml.org/lkml/2018/7/16/62

So I feel it is better for you to take over bind_pasid_table() API in
your series. I will drop it from my next version.

Thanks,

Jacob

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v5 01/23] iommu: introduce bind_pasid_table API function
  2018-08-24 13:20       ` Auger Eric
@ 2018-08-28 17:04         ` Jacob Pan
  0 siblings, 0 replies; 78+ messages in thread
From: Jacob Pan @ 2018-08-28 17:04 UTC (permalink / raw)
  To: Auger Eric
  Cc: Liu, Yi L, iommu, LKML, Joerg Roedel, David Woodhouse,
	Greg Kroah-Hartman, Alex Williamson, Jean-Philippe Brucker, Liu,
	Yi L, Jean Delvare, Wysocki, Rafael J, Raj, Ashok, jacob.jun.pan

On Fri, 24 Aug 2018 15:20:08 +0200
Auger Eric <eric.auger@redhat.com> wrote:

> Hi Yi Liu,
> 
> On 08/24/2018 02:47 PM, Liu, Yi L wrote:
> > Hi Eric,
> >   
> >> From: iommu-bounces@lists.linux-foundation.org [mailto:iommu-
> >> bounces@lists.linux-foundation.org] On Behalf Of Auger Eric
> >> Sent: Friday, August 24, 2018 12:35 AM
> >>
> >> Hi Jacob,
> >>
> >> On 05/11/2018 10:53 PM, Jacob Pan wrote:  
> >>> Virtual IOMMU was proposed to support Shared Virtual Memory (SVM)
> >>> use in the guest:
> >>> https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg05311.html
> >>>
> >>> As part of the proposed architecture, when an SVM capable PCI
> >>> device is assigned to a guest, nested mode is turned on. Guest
> >>> owns the first level page tables (request with PASID) which
> >>> performs GVA->GPA translation. Second level page tables are owned
> >>> by the host for GPA->HPA translation for both request with and
> >>> without PASID.
> >>>
> >>> A new IOMMU driver interface is therefore needed to perform tasks
> >>> as follows:
> >>> * Enable nested translation and appropriate translation type
> >>> * Assign guest PASID table pointer (in GPA) and size to host IOMMU
> >>>
> >>> This patch introduces new API functions to perform bind/unbind
> >>> guest PASID tables. Based on common data, model specific IOMMU
> >>> drivers can be extended to perform the specific steps for binding
> >>> pasid table of assigned devices.
> >>>
> >>> Signed-off-by: Jean-Philippe Brucker
> >>> <jean-philippe.brucker@arm.com> Signed-off-by: Liu, Yi L
> >>> <yi.l.liu@linux.intel.com> Signed-off-by: Ashok Raj
> >>> <ashok.raj@intel.com> Signed-off-by: Jacob Pan
> >>> <jacob.jun.pan@linux.intel.com> ---  
> > 
> > [...]
> >   
> >>> +#ifndef _UAPI_IOMMU_H
> >>> +#define _UAPI_IOMMU_H
> >>> +
> >>> +#include <linux/types.h>
> >>> +
> >>> +/**
> >>> + * PASID table data used to bind guest PASID table to the host
> >>> IOMMU. This will
> >>> + * enable guest managed first level page tables.
> >>> + * @version: for future extensions and identification of the
> >>> data format
> >>> + * @bytes: size of this structure
> >>> + * @base_ptr:	PASID table pointer
> >>> + * @pasid_bits:	number of bits supported in the guest
> >>> PASID table, must be  
> >> less  
> >>> + *		or equal than the host supported PASID size.
> >>> + */
> >>> +struct pasid_table_config {
> >>> +	__u32 version;
> >>> +#define PASID_TABLE_CFG_VERSION_1 1
> >>> +	__u32 bytes;
> >>> +	__u64 base_ptr;
> >>> +	__u8 pasid_bits;  
> >>
> >> As reported in "[RFC 00/13] SMMUv3 Nested Stage Setup" thread,
> >> this API could be used for ARM SMMUv3 nested stage enablement
> >> without many changes. Assuming SMMUv3 nested stage is confirmed to
> >> be interesting for vendors and maintainers, we could try to unify
> >> the APIs.  
> > 
> > Just a quick question on nested stage on SMMUv3. If virtualizer
> > wants to enable nested stage on SMMUv3, does it link the whole
> > guest CD table to host or do it in other manner?  
> Yes that's correct. On ARM SMMUv3 you have Stream Table Entries (STEs,
> indexed by ReqID=streamid). If stage 1 is used, the STE points to 1 or
> more contiguous Context Descriptors (CDs).
> So STE looks like the VTD Context-Entry and CD table looks like the
> VTD PASID table as far as I understand.
> >   
> >> As far as I understand the VTD PASID table is equivalent to the ARM
> >> SMMUv3 context descriptor table (CD). This corresponds to the
> >> stage 1 context table with one or more entries, each corresponding
> >> to one PASID.  
> > 
> > PASID table is index by PASID, and have multiple entries. A PASID
> > table would have 2^PASID_BITS entries.  
> On ARM SMMUv3 the  number of CDs is 2 ^STE.S1CDMax.
> >   
> >> maybe using the s1ctx_table_config terminology instead of
> >> pasid_table_config would be more generic, the pasid table being
> >> Intel naming.
> >>
> >> on top of pasid_bits, I think an "asid_bits" field may be needed
> >> too. The guest IOMMU might support a different number of asid bits
> >> from the host one.  
> > 
> > Maybe needed for SMMUv3. I've noticed you've placed it in
> > struct iommu_smmu_s1_config.
> >   
> >>
> >> Although without having skimmed through the whole series yet, I
> >> wonder how you handle the case where stage1 is bypassed or
> >> disabled? The guest may define the S1 context entries but bypass
> >> or abort stage 1 translations globally. Looks something missing to
> >> me at first sight.  
> > 
> > Sorry, I didn't quite follow here. What usage is case such for?
> > like stage 1 is bypassed or disabled. IOVA or SVA?  
> Each STE entry has a config field which tells how S1 and S2 behave
> 
> Options are no traffic at all or any combination of the following:
> 
> S1        S2
> bypass    bypass
> transl    bypass
> bypass    transl
> transl    transl
> 
> host manages S2 info. guest sets S1 related fields.
> 
> To me the guest SET.Config should be passed to the host so that this
> latter writes the correct global Config field value in the STE,
> including S1 + S2 info.
> 
Global config ( VT-d global command reg) is IOMMU wide, we cannot let
guest config change to directly modify global settings. I think it is
up to the vIOMMU emulation code to unbind guest PASID table thus
disable S1, if the guest is setting S1 to bypass/disabled.

I am still perplexed by valid use cases of S1 bypass, to me it means no
SVA nor guest IOVA which means no need for vIOMMU.

> Thanks
> 
> Eric
> > 
> > Thanks,
> > Yi Liu
> >   

[Jacob Pan]

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v5 13/23] iommu: introduce device fault report API
  2018-05-11 20:54 ` [PATCH v5 13/23] iommu: introduce device fault report API Jacob Pan
  2018-05-14  6:01   ` Lu Baolu
  2018-05-17 11:41   ` Liu, Yi L
@ 2018-09-06  9:25   ` Auger Eric
  2018-09-06 12:42     ` Jean-Philippe Brucker
  2018-09-14 13:24   ` Auger Eric
  2018-09-25 14:58   ` Jean-Philippe Brucker
  4 siblings, 1 reply; 78+ messages in thread
From: Auger Eric @ 2018-09-06  9:25 UTC (permalink / raw)
  To: Jacob Pan, iommu, LKML, Joerg Roedel, David Woodhouse,
	Greg Kroah-Hartman, Alex Williamson, Jean-Philippe Brucker
  Cc: Raj Ashok, Rafael Wysocki, Jean Delvare

Hi Jacob,

On 05/11/2018 10:54 PM, Jacob Pan wrote:
> Traditionally, device specific faults are detected and handled within
> their own device drivers. When IOMMU is enabled, faults such as DMA
> related transactions are detected by IOMMU. There is no generic
> reporting mechanism to report faults back to the in-kernel device
> driver or the guest OS in case of assigned devices.
> 
> Faults detected by IOMMU is based on the transaction's source ID which
> can be reported at per device basis, regardless of the device type is a
> PCI device or not.
> 
> The fault types include recoverable (e.g. page request) and
> unrecoverable faults(e.g. access error). In most cases, faults can be
> handled by IOMMU drivers internally. The primary use cases are as
> follows:
> 1. page request fault originated from an SVM capable device that is
> assigned to guest via vIOMMU. In this case, the first level page tables
> are owned by the guest. Page request must be propagated to the guest to
> let guest OS fault in the pages then send page response. In this
> mechanism, the direct receiver of IOMMU fault notification is VFIO,
> which can relay notification events to QEMU or other user space
> software.
> 
> 2. faults need more subtle handling by device drivers. Other than
> simply invoke reset function, there are needs to let device driver
> handle the fault with a smaller impact.
> 
> This patchset is intended to create a generic fault report API such
> that it can scale as follows:
> - all IOMMU types
> - PCI and non-PCI devices
> - recoverable and unrecoverable faults
> - VFIO and other other in kernel users
> - DMA & IRQ remapping (TBD)
> The original idea was brought up by David Woodhouse and discussions
> summarized at https://lwn.net/Articles/608914/.
> 
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Signed-off-by: Ashok Raj <ashok.raj@intel.com>
> Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
> ---
>  drivers/iommu/iommu.c | 149 +++++++++++++++++++++++++++++++++++++++++++++++++-
>  include/linux/iommu.h |  35 +++++++++++-
>  2 files changed, 181 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index 3a49b96..b3f9daf 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -609,6 +609,13 @@ int iommu_group_add_device(struct iommu_group *group, struct device *dev)
>  		goto err_free_name;
>  	}
>  
> +	dev->iommu_param = kzalloc(sizeof(*dev->iommu_param), GFP_KERNEL);
> +	if (!dev->iommu_param) {
> +		ret = -ENOMEM;
> +		goto err_free_name;
> +	}
> +	mutex_init(&dev->iommu_param->lock);
> +
>  	kobject_get(group->devices_kobj);
>  
>  	dev->iommu_group = group;
> @@ -639,6 +646,7 @@ int iommu_group_add_device(struct iommu_group *group, struct device *dev)
>  	mutex_unlock(&group->mutex);
>  	dev->iommu_group = NULL;
>  	kobject_put(group->devices_kobj);
> +	kfree(dev->iommu_param);
>  err_free_name:
>  	kfree(device->name);
>  err_remove_link:
> @@ -685,7 +693,7 @@ void iommu_group_remove_device(struct device *dev)
>  	sysfs_remove_link(&dev->kobj, "iommu_group");
>  
>  	trace_remove_device_from_group(group->id, dev);
> -
> +	kfree(dev->iommu_param);
>  	kfree(device->name);
>  	kfree(device);
>  	dev->iommu_group = NULL;
> @@ -820,6 +828,145 @@ int iommu_group_unregister_notifier(struct iommu_group *group,
>  EXPORT_SYMBOL_GPL(iommu_group_unregister_notifier);
>  
>  /**
> + * iommu_register_device_fault_handler() - Register a device fault handler
> + * @dev: the device
> + * @handler: the fault handler
> + * @data: private data passed as argument to the handler
> + *
> + * When an IOMMU fault event is received, call this handler with the fault event
> + * and data as argument. The handler should return 0 on success. If the fault is
> + * recoverable (IOMMU_FAULT_PAGE_REQ), the handler can also complete
> + * the fault by calling iommu_page_response() with one of the following
iommu_page_response name looks too specific to PRI use case. why not
using iommu_fault_response.
> + * response code:
> + * - IOMMU_PAGE_RESP_SUCCESS: retry the translation
> + * - IOMMU_PAGE_RESP_INVALID: terminate the fault
> + * - IOMMU_PAGE_RESP_FAILURE: terminate the fault and stop reporting
Same here s/IOMMU_PAGE_RESP/IOMMU_PAGE_RESP

That way I can easily reuse the API for SMMU nested stage handing.
> + *   page faults if possible.
> + *
> + * Return 0 if the fault handler was installed successfully, or an error.
> + */
> +int iommu_register_device_fault_handler(struct device *dev,
> +					iommu_dev_fault_handler_t handler,
> +					void *data)
> +{
> +	struct iommu_param *param = dev->iommu_param;
> +	int ret = 0;
> +
> +	/*
> +	 * Device iommu_param should have been allocated when device is
> +	 * added to its iommu_group.
> +	 */
> +	if (!param)
> +		return -EINVAL;
> +
> +	mutex_lock(&param->lock);
> +	/* Only allow one fault handler registered for each device */
> +	if (param->fault_param) {
> +		ret = -EBUSY;
> +		goto done_unlock;
> +	}
> +
> +	get_device(dev);
> +	param->fault_param =
> +		kzalloc(sizeof(struct iommu_fault_param), GFP_KERNEL);
> +	if (!param->fault_param) {
> +		put_device(dev);
> +		ret = -ENOMEM;
> +		goto done_unlock;
> +	}
> +	mutex_init(&param->fault_param->lock);
> +	param->fault_param->handler = handler;
> +	param->fault_param->data = data;
> +	INIT_LIST_HEAD(&param->fault_param->faults);
> +
> +done_unlock:
> +	mutex_unlock(&param->lock);
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(iommu_register_device_fault_handler);
> +
> +/**
> + * iommu_unregister_device_fault_handler() - Unregister the device fault handler
> + * @dev: the device
> + *
> + * Remove the device fault handler installed with
> + * iommu_register_device_fault_handler().
> + *
> + * Return 0 on success, or an error.
> + */
> +int iommu_unregister_device_fault_handler(struct device *dev)
> +{
> +	struct iommu_param *param = dev->iommu_param;
> +	int ret = 0;
> +
> +	if (!param)
> +		return -EINVAL;
> +
> +	mutex_lock(&param->lock);
> +	/* we cannot unregister handler if there are pending faults */
> +	if (!list_empty(&param->fault_param->faults)) {> +		ret = -EBUSY;
> +		goto unlock;
> +	}
> +
> +	kfree(param->fault_param);
> +	param->fault_param = NULL;
> +	put_device(dev);
> +unlock:
> +	mutex_unlock(&param->lock);
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(iommu_unregister_device_fault_handler);
> +
> +
> +/**
> + * iommu_report_device_fault() - Report fault event to device
> + * @dev: the device
> + * @evt: fault event data
> + *
> + * Called by IOMMU model specific drivers when fault is detected, typically
> + * in a threaded IRQ handler.
> + *
> + * Return 0 on success, or an error.
> + */
> +int iommu_report_device_fault(struct device *dev, struct iommu_fault_event *evt)
> +{
> +	int ret = 0;
> +	struct iommu_fault_event *evt_pending;
> +	struct iommu_fault_param *fparam;
> +
> +	/* iommu_param is allocated when device is added to group */
> +	if (!dev->iommu_param | !evt)
> +		return -EINVAL;
> +	/* we only report device fault if there is a handler registered */
> +	mutex_lock(&dev->iommu_param->lock);
> +	if (!dev->iommu_param->fault_param ||
> +		!dev->iommu_param->fault_param->handler) {
> +		ret = -EINVAL;
> +		goto done_unlock;
> +	}
> +	fparam = dev->iommu_param->fault_param;
> +	if (evt->type == IOMMU_FAULT_PAGE_REQ && evt->last_req) {
> +		evt_pending = kmemdup(evt, sizeof(struct iommu_fault_event),
> +				GFP_KERNEL);
> +		if (!evt_pending) {
> +			ret = -ENOMEM;
> +			goto done_unlock;
> +		}
> +		mutex_lock(&fparam->lock);
> +		list_add_tail(&evt_pending->list, &fparam->faults);
same doubt as Yi Liu. You cannot rely on the userspace willingness to
void the queue and deallocate this memory.

SMMUv3 holds a queue of events whose size is implementation dependent.
I think such a queue should be available at SW level and its size should
be negotiated.

Note SMMU has separate queues for PRI and fault events. Here you use the
same queue for all events. I don't know if it would make sense to have
separate APIs?

Thanks

Eric

> +		mutex_unlock(&fparam->lock);
> +	}
> +	ret = fparam->handler(evt, fparam->data);
> +done_unlock:
> +	mutex_unlock(&dev->iommu_param->lock);
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(iommu_report_device_fault);
> +
> +/**
>   * iommu_group_id - Return ID for a group
>   * @group: the group to ID
>   *
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index aeadb4f..b3312ee 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -307,7 +307,8 @@ enum iommu_fault_reason {
>   * and PASID spec.
>   * - Un-recoverable faults of device interest
>   * - DMA remapping and IRQ remapping faults
> -
> + *
> + * @list pending fault event list, used for tracking responses
>   * @type contains fault type.
>   * @reason fault reasons if relevant outside IOMMU driver, IOMMU driver internal
>   *         faults are not reported
> @@ -324,6 +325,7 @@ enum iommu_fault_reason {
>   *                 sending the fault response.
>   */
>  struct iommu_fault_event {
> +	struct list_head list;
>  	enum iommu_fault_type type;
>  	enum iommu_fault_reason reason;
>  	u64 addr;
> @@ -340,10 +342,13 @@ struct iommu_fault_event {
>   * struct iommu_fault_param - per-device IOMMU fault data
>   * @dev_fault_handler: Callback function to handle IOMMU faults at device level
>   * @data: handler private data
> - *
> + * @faults: holds the pending faults which needs response, e.g. page response.
> + * @lock: protect pending PRQ event list
>   */
>  struct iommu_fault_param {
>  	iommu_dev_fault_handler_t handler;
> +	struct list_head faults;
> +	struct mutex lock;
>  	void *data;
>  };
>  
> @@ -357,6 +362,7 @@ struct iommu_fault_param {
>   *	struct iommu_fwspec	*iommu_fwspec;
>   */
>  struct iommu_param {
> +	struct mutex lock;
>  	struct iommu_fault_param *fault_param;
>  };
>  
> @@ -456,6 +462,14 @@ extern int iommu_group_register_notifier(struct iommu_group *group,
>  					 struct notifier_block *nb);
>  extern int iommu_group_unregister_notifier(struct iommu_group *group,
>  					   struct notifier_block *nb);
> +extern int iommu_register_device_fault_handler(struct device *dev,
> +					iommu_dev_fault_handler_t handler,
> +					void *data);
> +
> +extern int iommu_unregister_device_fault_handler(struct device *dev);
> +
> +extern int iommu_report_device_fault(struct device *dev, struct iommu_fault_event *evt);
> +
>  extern int iommu_group_id(struct iommu_group *group);
>  extern struct iommu_group *iommu_group_get_for_dev(struct device *dev);
>  extern struct iommu_domain *iommu_group_default_domain(struct iommu_group *);
> @@ -727,6 +741,23 @@ static inline int iommu_group_unregister_notifier(struct iommu_group *group,
>  	return 0;
>  }
>  
> +static inline int iommu_register_device_fault_handler(struct device *dev,
> +						iommu_dev_fault_handler_t handler,
> +						void *data)
> +{
> +	return -ENODEV;
> +}
> +
> +static inline int iommu_unregister_device_fault_handler(struct device *dev)
> +{
> +	return 0;
> +}
> +
> +static inline int iommu_report_device_fault(struct device *dev, struct iommu_fault_event *evt)
> +{
> +	return -ENODEV;
> +}
> +
>  static inline int iommu_group_id(struct iommu_group *group)
>  {
>  	return -ENODEV;
> 

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v5 13/23] iommu: introduce device fault report API
  2018-09-06  9:25   ` Auger Eric
@ 2018-09-06 12:42     ` Jean-Philippe Brucker
  2018-09-06 13:14       ` Auger Eric
  0 siblings, 1 reply; 78+ messages in thread
From: Jean-Philippe Brucker @ 2018-09-06 12:42 UTC (permalink / raw)
  To: Auger Eric, Jacob Pan, iommu, LKML, Joerg Roedel,
	David Woodhouse, Greg Kroah-Hartman, Alex Williamson
  Cc: Jean Delvare, Rafael Wysocki, Raj Ashok

On 06/09/2018 10:25, Auger Eric wrote:
>> +		mutex_lock(&fparam->lock);
>> +		list_add_tail(&evt_pending->list, &fparam->faults);
> same doubt as Yi Liu. You cannot rely on the userspace willingness to
> void the queue and deallocate this memory.
> 
> SMMUv3 holds a queue of events whose size is implementation dependent.
> I think such a queue should be available at SW level and its size should
> be negotiated.

Note that this fault API can also be used by host device drivers that
want to be notified on fault, in which case a direct callback seems
easier to use, and perhaps more efficient than an intermediate queue.

When injecting faults into userspace it makes sense to batch events, to
avoid context switches. Even though that queue management should
probably be done by VFIO, the IOMMU API has to help in some way, at
least to tell VFIO when the IOMMU driver is done processing a batch of
event.

> Note SMMU has separate queues for PRI and fault events. Here you use the
> same queue for all events. I don't know if it would make sense to have
> separate APIs?

Host device drivers that use this API to be notified on fault can't deal
with arch-specific event formats (SMMU event, Vt-d fault event, etc), so
the APIs should be arch-agnostic. Given that requirement, using a single
iommu_fault_event structure for both PRI and event queues made sense,
especially since the even queue can have stall events that look a lot
like PRI page requests.

Or do you mean separate APIs for recoverable and non-recoverable faults?
Using the same queue for PRI and stall event, and a separate one for
non-recoverable events?

Separate queues may be useful for the device driver scenario
(non-virtualization), where recoverable faults are handled by
io-pgfaults while non-recoverable ones could be reported directly to the
device driver. For this case I was thinking of adding a
multiple-consumer thing: both io-pgfaults and the device driver register
a fault handler. io-pgfault handles recoverable ones and what's left
goes to the device driver. This could also allow the device driver to be
notified when io-pgfault doesn't successfully handle a fault
(handle_mm_fault returns an error).

Thanks,
Jean

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v5 13/23] iommu: introduce device fault report API
  2018-09-06 12:42     ` Jean-Philippe Brucker
@ 2018-09-06 13:14       ` Auger Eric
  2018-09-06 17:06         ` Jean-Philippe Brucker
  0 siblings, 1 reply; 78+ messages in thread
From: Auger Eric @ 2018-09-06 13:14 UTC (permalink / raw)
  To: Jean-Philippe Brucker, Jacob Pan, iommu, LKML, Joerg Roedel,
	David Woodhouse, Greg Kroah-Hartman, Alex Williamson
  Cc: Jean Delvare, Rafael Wysocki, Raj Ashok

Hi Jean-Philippe,

On 09/06/2018 02:42 PM, Jean-Philippe Brucker wrote:
> On 06/09/2018 10:25, Auger Eric wrote:
>>> +		mutex_lock(&fparam->lock);
>>> +		list_add_tail(&evt_pending->list, &fparam->faults);
>> same doubt as Yi Liu. You cannot rely on the userspace willingness to
>> void the queue and deallocate this memory.

By the way I saw there is a kind of garbage collectors for faults which
wouldn't have received any responses. However I am not sure this removes
the concern of having the fault list on kernel side growing beyond
acceptable limits.
>>
>> SMMUv3 holds a queue of events whose size is implementation dependent.
>> I think such a queue should be available at SW level and its size should
>> be negotiated.
> 
> Note that this fault API can also be used by host device drivers that
> want to be notified on fault, in which case a direct callback seems
> easier to use, and perhaps more efficient than an intermediate queue.

> 
> When injecting faults into userspace it makes sense to batch events, to
> avoid context switches. Even though that queue management should
> probably be done by VFIO, the IOMMU API has to help in some way, at
> least to tell VFIO when the IOMMU driver is done processing a batch of
> event.
Yes I am currently investigating the usage of a kfifo in
vfio_iommu_type1, filled by the direct callback.
> 
>> Note SMMU has separate queues for PRI and fault events. Here you use the
>> same queue for all events. I don't know if it would make sense to have
>> separate APIs?
> 
> Host device drivers that use this API to be notified on fault can't deal
> with arch-specific event formats (SMMU event, Vt-d fault event, etc), so
> the APIs should be arch-agnostic. Given that requirement, using a single
> iommu_fault_event structure for both PRI and event queues made sense,
> especially since the even queue can have stall events that look a lot
> like PRI page requests.
I understand the data structure needs to be generic. Now I guess PRI
events and other standard translator error events (that can happen
without PRI) may have different characteristics in event fields, queue
size, that may deserve to create different APIs and internal data
structs. Also this may help separating the concerns. My remark also
stems from the fact the SMMU uses 2 different queues, whose size can
also be different.

Thanks

Eric
> 
> Or do you mean separate APIs for recoverable and non-recoverable faults?
> Using the same queue for PRI and stall event, and a separate one for
> non-recoverable events?
> 
> Separate queues may be useful for the device driver scenario
> (non-virtualization), where recoverable faults are handled by
> io-pgfaults while non-recoverable ones could be reported directly to the
> device driver. For this case I was thinking of adding a
> multiple-consumer thing: both io-pgfaults and the device driver register
> a fault handler. io-pgfault handles recoverable ones and what's left
> goes to the device driver. This could also allow the device driver to be
> notified when io-pgfault doesn't successfully handle a fault
> (handle_mm_fault returns an error).
> 
> Thanks,
> Jean
> 

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v5 13/23] iommu: introduce device fault report API
  2018-09-06 13:14       ` Auger Eric
@ 2018-09-06 17:06         ` Jean-Philippe Brucker
  2018-09-07  7:11           ` Auger Eric
  0 siblings, 1 reply; 78+ messages in thread
From: Jean-Philippe Brucker @ 2018-09-06 17:06 UTC (permalink / raw)
  To: Auger Eric, Jacob Pan, iommu, LKML, Joerg Roedel,
	David Woodhouse, Greg Kroah-Hartman, Alex Williamson
  Cc: Jean Delvare, Rafael Wysocki, Raj Ashok

On 06/09/2018 14:14, Auger Eric wrote:
> Hi Jean-Philippe,
> 
> On 09/06/2018 02:42 PM, Jean-Philippe Brucker wrote:
>> On 06/09/2018 10:25, Auger Eric wrote:
>>>> +		mutex_lock(&fparam->lock);
>>>> +		list_add_tail(&evt_pending->list, &fparam->faults);
>>> same doubt as Yi Liu. You cannot rely on the userspace willingness to
>>> void the queue and deallocate this memory.
> 
> By the way I saw there is a kind of garbage collectors for faults which
> wouldn't have received any responses. However I am not sure this removes
> the concern of having the fault list on kernel side growing beyond
> acceptable limits.

How about per-device quotas? (https://lkml.org/lkml/2018/4/23/706 for
reference) With PRI the IOMMU driver already sets per-device credits
when initializing the device (pci_enable_pri), so if the device behaves
properly it shouldn't send new page requests once the number of
outstanding ones is maxed out.

The stall mode of SMMU doesn't have per-device limit, and depending on
the implementation it might be easy for one guest using stall to prevent
other guests from receiving faults. For this reason we'll have to
enforce a per-device stall quota in the SMMU driver, and immediately
terminate faults that exceed this quota. We could easily do the same for
PRI, if we don't trust devices to follow the spec. The difficult part is
finding the right number of credits...

>> Host device drivers that use this API to be notified on fault can't deal
>> with arch-specific event formats (SMMU event, Vt-d fault event, etc), so
>> the APIs should be arch-agnostic. Given that requirement, using a single
>> iommu_fault_event structure for both PRI and event queues made sense,
>> especially since the even queue can have stall events that look a lot
>> like PRI page requests.
> I understand the data structure needs to be generic. Now I guess PRI
> events and other standard translator error events (that can happen
> without PRI) may have different characteristics in event fields,

Right, an event contains more information than a PRI page request.
Stage-2 fields (CLASS, S2, IPA, TTRnW) cannot be represented by
iommu_fault_event at the moment. For precise emulation it might be
useful to at least add the S2 flag (as a new iommu_fault_reason?), so
that when the guest maps stage-1 to an invalid GPA, QEMU could for
example inject an external abort.

> queue
> size, that may deserve to create different APIs and internal data
> structs. Also this may help separating the concerns.

It might duplicate them. If the consumer of the event report is a host
device driver, the SMMU needs to report a "generic" iommu_fault_event,
and if the consumer is VFIO it would report a specialized one

> My remark also
> stems from the fact the SMMU uses 2 different queues, whose size can
> also be different.

Hm, for PRI requests the kernel-userspace queue size should actually be
the number of PRI credits for that device. Hadn't thought about it
before, where do we pass that info to userspace? For fault events, the
queue could be as big as the SMMU event queue, though using all that
space might be wasteful. Non-stalled events should be rare and reporting
them isn't urgent. Stalled ones would need the number of stall credits I
mentioned above, which realistically will be a lot less than the SMMU
event queue size. Given that a device will use either PRI or stall but
not both, I still think events and PRI could go through the same queue.

Thanks,
Jean

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v5 13/23] iommu: introduce device fault report API
  2018-09-06 17:06         ` Jean-Philippe Brucker
@ 2018-09-07  7:11           ` Auger Eric
  2018-09-07 11:23             ` Jean-Philippe Brucker
  0 siblings, 1 reply; 78+ messages in thread
From: Auger Eric @ 2018-09-07  7:11 UTC (permalink / raw)
  To: Jean-Philippe Brucker, Jacob Pan, iommu, LKML, Joerg Roedel,
	David Woodhouse, Greg Kroah-Hartman, Alex Williamson
  Cc: Jean Delvare, Rafael Wysocki, Raj Ashok

Hi Jean-Philippe,

On 09/06/2018 07:06 PM, Jean-Philippe Brucker wrote:
> On 06/09/2018 14:14, Auger Eric wrote:
>> Hi Jean-Philippe,
>>
>> On 09/06/2018 02:42 PM, Jean-Philippe Brucker wrote:
>>> On 06/09/2018 10:25, Auger Eric wrote:
>>>>> +		mutex_lock(&fparam->lock);
>>>>> +		list_add_tail(&evt_pending->list, &fparam->faults);
>>>> same doubt as Yi Liu. You cannot rely on the userspace willingness to
>>>> void the queue and deallocate this memory.
>>
>> By the way I saw there is a kind of garbage collectors for faults which
>> wouldn't have received any responses. However I am not sure this removes
>> the concern of having the fault list on kernel side growing beyond
>> acceptable limits.
> 
> How about per-device quotas? (https://lkml.org/lkml/2018/4/23/706 for
> reference) With PRI the IOMMU driver already sets per-device credits
> when initializing the device (pci_enable_pri), so if the device behaves
> properly it shouldn't send new page requests once the number of
> outstanding ones is maxed out.

But this needs to work for non PRI use case too?
> 
> The stall mode of SMMU doesn't have per-device limit, and depending on
> the implementation it might be easy for one guest using stall to prevent
> other guests from receiving faults. For this reason we'll have to
> enforce a per-device stall quota in the SMMU driver, and immediately
> terminate faults that exceed this quota. We could easily do the same for
> PRI, if we don't trust devices to follow the spec. The difficult part is
> finding the right number of credits...
> 
>>> Host device drivers that use this API to be notified on fault can't deal
>>> with arch-specific event formats (SMMU event, Vt-d fault event, etc), so
>>> the APIs should be arch-agnostic. Given that requirement, using a single
>>> iommu_fault_event structure for both PRI and event queues made sense,
>>> especially since the even queue can have stall events that look a lot
>>> like PRI page requests.
>> I understand the data structure needs to be generic. Now I guess PRI
>> events and other standard translator error events (that can happen
>> without PRI) may have different characteristics in event fields,
> 
> Right, an event contains more information than a PRI page request.
> Stage-2 fields (CLASS, S2, IPA, TTRnW) cannot be represented by
> iommu_fault_event at the moment.

Yes I am currently doing the mapping exercise between SMMUv3 events and
iommu_fault_event and I miss config errors for instance.
 For precise emulation it might be
> useful to at least add the S2 flag (as a new iommu_fault_reason?), so
> that when the guest maps stage-1 to an invalid GPA, QEMU could for
> example inject an external abort.

Actually we may even need to filter events and return to the guest only
the S1 related.
> 
>> queue
>> size, that may deserve to create different APIs and internal data
>> structs. Also this may help separating the concerns.
> 
> It might duplicate them. If the consumer of the event report is a host
> device driver, the SMMU needs to report a "generic" iommu_fault_event,
> and if the consumer is VFIO it would report a specialized one

I am unsure of my understanding of the UNRECOVERABLE error type. Is it
everything else than a PRI. For instance are all SMMUv3 event errors
supposed to be put under the IOMMU_FAULT_DMA_UNRECOV umbrella?

If I understand correctly there are different consumers for PRI and
unrecoverable data, so why not having 2 different APIs.
> 
>> My remark also
>> stems from the fact the SMMU uses 2 different queues, whose size can
>> also be different.
> 
> Hm, for PRI requests the kernel-userspace queue size should actually be
> the number of PRI credits for that device. Hadn't thought about it
> before, where do we pass that info to userspace?
Cannot help here at the moment, sorry.
 For fault events, the
> queue could be as big as the SMMU event queue, though using all that
> space might be wasteful.
The guest has its own programming of the SMMU_EVENTQ_BASE.LOG2SIZE. This
could be used to program the SW fifo

 Non-stalled events should be rare and reporting
> them isn't urgent. Stalled ones would need the number of stall credits I
> mentioned above, which realistically will be a lot less than the SMMU
> event queue size. Given that a device will use either PRI or stall but
> not both, I still think events and PRI could go through the same queue.
Did I get it right PRI is for PCIe and STALL for non PCIe? But all that
stuff also is related to Page Request use case, right?

Thanks

Eric
> 
> Thanks,
> Jean
> 

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v5 13/23] iommu: introduce device fault report API
  2018-09-07  7:11           ` Auger Eric
@ 2018-09-07 11:23             ` Jean-Philippe Brucker
  0 siblings, 0 replies; 78+ messages in thread
From: Jean-Philippe Brucker @ 2018-09-07 11:23 UTC (permalink / raw)
  To: Auger Eric, Jacob Pan, iommu, LKML, Joerg Roedel,
	David Woodhouse, Greg Kroah-Hartman, Alex Williamson
  Cc: Jean Delvare, Rafael Wysocki, Raj Ashok

On 07/09/2018 08:11, Auger Eric wrote:
>>> On 09/06/2018 02:42 PM, Jean-Philippe Brucker wrote:
>>>> On 06/09/2018 10:25, Auger Eric wrote:
>>>>>> +		mutex_lock(&fparam->lock);
>>>>>> +		list_add_tail(&evt_pending->list, &fparam->faults);
>>>>> same doubt as Yi Liu. You cannot rely on the userspace willingness to
>>>>> void the queue and deallocate this memory.
>>>
>>> By the way I saw there is a kind of garbage collectors for faults which
>>> wouldn't have received any responses. However I am not sure this removes
>>> the concern of having the fault list on kernel side growing beyond
>>> acceptable limits.
>>
>> How about per-device quotas? (https://lkml.org/lkml/2018/4/23/706 for
>> reference) With PRI the IOMMU driver already sets per-device credits
>> when initializing the device (pci_enable_pri), so if the device behaves
>> properly it shouldn't send new page requests once the number of
>> outstanding ones is maxed out.
> 
> But this needs to work for non PRI use case too?

Only recoverable faults, PRI and stall, are added to the fparam->faults
list, because the kernel needs to make sure that each of these faults
gets a reply, or else they are held in hardware indefinitely.
Non-recoverable faults don't need tracking, the IOMMU API can forget
about them after they're reported. Rate-limiting could be done by the
consumer if it gets flooded by non-recoverable faults, for example by
dropping some of them.

>> Right, an event contains more information than a PRI page request.
>> Stage-2 fields (CLASS, S2, IPA, TTRnW) cannot be represented by
>> iommu_fault_event at the moment.
> 
> Yes I am currently doing the mapping exercise between SMMUv3 events and
> iommu_fault_event and I miss config errors for instance.

We may have initially focused only on guest and userspace config errors
(IOMMU_FAULT_REASON_PASID_FETCH, IOMMU_FAULT_REASON_PASID_INVALID, etc),
since other config errors are most likely a bug in the host IOMMU
driver, and could be reported with pr_err

>  For precise emulation it might be
>> useful to at least add the S2 flag (as a new iommu_fault_reason?), so
>> that when the guest maps stage-1 to an invalid GPA, QEMU could for
>> example inject an external abort.
> 
> Actually we may even need to filter events and return to the guest only
> the S1 related.
>>
>>> queue
>>> size, that may deserve to create different APIs and internal data
>>> structs. Also this may help separating the concerns.
>>
>> It might duplicate them. If the consumer of the event report is a host
>> device driver, the SMMU needs to report a "generic" iommu_fault_event,
>> and if the consumer is VFIO it would report a specialized one
> 
> I am unsure of my understanding of the UNRECOVERABLE error type. Is it
> everything else than a PRI. For instance are all SMMUv3 event errors
> supposed to be put under the IOMMU_FAULT_DMA_UNRECOV umbrella?

I guess it's more clear-cut in VT-d, which defines recoverable and
non-recoverable faults. In SMMUv3, PRI Page Requests are recoverable,
but event errors can also be recoverable if they have the Stall flag set.

Stall is a way for non-PCI endpoints to do SVA, and I have a patch in my
series that sorts events into PAGE_REQ and DMA_UNRECOV before feeding
them to this API: https://patchwork.kernel.org/patch/10395043/

> If I understand correctly there are different consumers for PRI and
> unrecoverable data, so why not having 2 different APIs.

My reasoning was that for virtualization they go through the same
channel, VFIO, until the guest or the vIOMMU dispatches them depending
on their type, so we might as well use the same API.

In addition, host device drivers might also want to handle stall or PRI
events themselves instead of relying on the SVA infrastructure. For
example the MSM GPU with SMMUv2: https://patchwork.kernel.org/patch/9953803/

>>> My remark also
>>> stems from the fact the SMMU uses 2 different queues, whose size can
>>> also be different.
>>
>> Hm, for PRI requests the kernel-userspace queue size should actually be
>> the number of PRI credits for that device. Hadn't thought about it
>> before, where do we pass that info to userspace?
> Cannot help here at the moment, sorry.
>  For fault events, the
>> queue could be as big as the SMMU event queue, though using all that
>> space might be wasteful.
> The guest has its own programming of the SMMU_EVENTQ_BASE.LOG2SIZE. This
> could be used to program the SW fifo
> 
>  Non-stalled events should be rare and reporting
>> them isn't urgent. Stalled ones would need the number of stall credits I
>> mentioned above, which realistically will be a lot less than the SMMU
>> event queue size. Given that a device will use either PRI or stall but
>> not both, I still think events and PRI could go through the same queue.
> Did I get it right PRI is for PCIe and STALL for non PCIe? But all that
> stuff also is related to Page Request use case, right?

Yes, a stall event is a page request from a non-PCI device, but it comes
in through the SMMU event queue

Thanks,
Jean

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v5 14/23] iommu: introduce page response function
  2018-05-11 20:54 ` [PATCH v5 14/23] iommu: introduce page response function Jacob Pan
  2018-05-14  6:39   ` Lu Baolu
@ 2018-09-10 14:52   ` Auger Eric
  2018-09-10 17:50     ` Jacob Pan
  1 sibling, 1 reply; 78+ messages in thread
From: Auger Eric @ 2018-09-10 14:52 UTC (permalink / raw)
  To: Jacob Pan, iommu, LKML, Joerg Roedel, David Woodhouse,
	Greg Kroah-Hartman, Alex Williamson, Jean-Philippe Brucker
  Cc: Raj Ashok, Rafael Wysocki, Jean Delvare

Hi Jacob,

On 05/11/2018 10:54 PM, Jacob Pan wrote:
> IO page faults can be handled outside IOMMU subsystem. For an example,
> when nested translation is turned on and guest owns the
> first level page tables, device page request can be forwared
forwarded
> to the guest for handling faults. As the page response returns
> by the guest, IOMMU driver on the host need to process the
from the guest ...  host needs
> response which informs the device and completes the page request
> transaction.
> 
> This patch introduces generic API function for page response
> passing from the guest or other in-kernel users. The definitions of
> the generic data is based on PCI ATS specification not limited to
> any vendor.
> 
> Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Link: https://lkml.org/lkml/2017/12/7/1725
> ---
>  drivers/iommu/iommu.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/iommu.h | 43 +++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 88 insertions(+)
> 
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index b3f9daf..02fed3e 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -1533,6 +1533,51 @@ int iommu_sva_invalidate(struct iommu_domain *domain,
>  }
>  EXPORT_SYMBOL_GPL(iommu_sva_invalidate);
>  
> +int iommu_page_response(struct device *dev,
> +			struct page_response_msg *msg)
> +{
> +	struct iommu_param *param = dev->iommu_param;
> +	int ret = -EINVAL;
> +	struct iommu_fault_event *evt;
> +	struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
> +
> +	if (!domain || !domain->ops->page_response)
> +		return -ENODEV;
> +
> +	/*
> +	 * Device iommu_param should have been allocated when device is
> +	 * added to its iommu_group.
> +	 */
> +	if (!param || !param->fault_param)
> +		return -EINVAL;
> +
> +	/* Only send response if there is a fault report pending */
> +	mutex_lock(&param->fault_param->lock);
> +	if (list_empty(&param->fault_param->faults)) {
> +		pr_warn("no pending PRQ, drop response\n");
> +		goto done_unlock;
> +	}
> +	/*
> +	 * Check if we have a matching page request pending to respond,
> +	 * otherwise return -EINVAL
> +	 */
> +	list_for_each_entry(evt, &param->fault_param->faults, list) {
> +		if (evt->pasid == msg->pasid &&
> +		    msg->page_req_group_id == evt->page_req_group_id) {
> +			msg->private_data = evt->iommu_private;
> +			ret = domain->ops->page_response(dev, msg);
> +			list_del(&evt->list);
don't you need a list_for_each_entry_safe?
> +			kfree(evt);
> +			break;
> +		}
> +	}
> +
> +done_unlock:
> +	mutex_unlock(&param->fault_param->lock);
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(iommu_page_response);
> +
>  static void __iommu_detach_device(struct iommu_domain *domain,
>  				  struct device *dev)
>  {
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index b3312ee..722b90f 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -163,6 +163,41 @@ struct iommu_resv_region {
>  #ifdef CONFIG_IOMMU_API
>  
>  /**
> + * enum page_response_code - Return status of fault handlers, telling the IOMMU
> + * driver how to proceed with the fault.
> + *
> + * @IOMMU_PAGE_RESP_SUCCESS: Fault has been handled and the page tables
> + *	populated, retry the access. This is "Success" in PCI PRI.
> + * @IOMMU_PAGE_RESP_FAILURE: General error. Drop all subsequent faults from
> + *	this device if possible. This is "Response Failure" in PCI PRI.
> + * @IOMMU_PAGE_RESP_INVALID: Could not handle this fault, don't retry the
> + *	access. This is "Invalid Request" in PCI PRI.
> + */
> +enum page_response_code {
> +	IOMMU_PAGE_RESP_SUCCESS = 0,
> +	IOMMU_PAGE_RESP_INVALID,
> +	IOMMU_PAGE_RESP_FAILURE,
> +};
> +
> +/**
> + * Generic page response information based on PCI ATS and PASID spec.
> + * @addr: servicing page address
> + * @pasid: contains process address space ID
> + * @resp_code: response code
nit: @pasid_present doc missing although quite obvious
> + * @page_req_group_id: page request group index
> + * @private_data: uniquely identify device-specific private data for an
> + *                individual page response
> + */
> +struct page_response_msg {
> +	u64 addr;
> +	u32 pasid;
> +	enum page_response_code resp_code;
> +	u32 pasid_present:1;
> +	u32 page_req_group_id;
> +	u64 private_data;
> +};
Doesn't it need to be part of iommu uapi header since the virtualizer
will pass the response through VFIO?

As mentioned in previous discussion this is really PRI related and does
not really fit unrecoverable fault reporting. To me we should clarify if
this API targets both use cases or only the PRI response use case. Also
in the implementation we check pasid and PRGindex. As mentionned by
Jean-Philippe, unrecoverable "traditional" faults do not require to
manage a list in the iommu subsystem.

Have you considered using a kfifo instead of a list to manage the
pending PRI requests?

Thanks

Eric
> +
> +/**
>   * struct iommu_ops - iommu ops and capabilities
>   * @capable: check capability
>   * @domain_alloc: allocate iommu domain
> @@ -195,6 +230,7 @@ struct iommu_resv_region {
>   * @bind_pasid_table: bind pasid table pointer for guest SVM
>   * @unbind_pasid_table: unbind pasid table pointer and restore defaults
>   * @sva_invalidate: invalidate translation caches of shared virtual address
> + * @page_response: handle page request response
>   */
>  struct iommu_ops {
>  	bool (*capable)(enum iommu_cap);
> @@ -250,6 +286,7 @@ struct iommu_ops {
>  				struct device *dev);
>  	int (*sva_invalidate)(struct iommu_domain *domain,
>  		struct device *dev, struct tlb_invalidate_info *inv_info);
> +	int (*page_response)(struct device *dev, struct page_response_msg *msg);
>  
>  	unsigned long pgsize_bitmap;
>  };
> @@ -470,6 +507,7 @@ extern int iommu_unregister_device_fault_handler(struct device *dev);
>  
>  extern int iommu_report_device_fault(struct device *dev, struct iommu_fault_event *evt);
>  
> +extern int iommu_page_response(struct device *dev, struct page_response_msg *msg);
>  extern int iommu_group_id(struct iommu_group *group);
>  extern struct iommu_group *iommu_group_get_for_dev(struct device *dev);
>  extern struct iommu_domain *iommu_group_default_domain(struct iommu_group *);
> @@ -758,6 +796,11 @@ static inline int iommu_report_device_fault(struct device *dev, struct iommu_fau
>  	return -ENODEV;
>  }
>  
> +static inline int iommu_page_response(struct device *dev, struct page_response_msg *msg)
> +{
> +	return -ENODEV;
> +}
> +
>  static inline int iommu_group_id(struct iommu_group *group)
>  {
>  	return -ENODEV;
> 

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v5 14/23] iommu: introduce page response function
  2018-09-10 14:52   ` Auger Eric
@ 2018-09-10 17:50     ` Jacob Pan
  2018-09-10 19:06       ` Auger Eric
  0 siblings, 1 reply; 78+ messages in thread
From: Jacob Pan @ 2018-09-10 17:50 UTC (permalink / raw)
  To: Auger Eric
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Alex Williamson, Jean-Philippe Brucker, Raj Ashok,
	Rafael Wysocki, Jean Delvare, jacob.jun.pan

On Mon, 10 Sep 2018 16:52:24 +0200
Auger Eric <eric.auger@redhat.com> wrote:

> Hi Jacob,
> 
Hi Eric,

Thanks for the review, please comments inline.
> On 05/11/2018 10:54 PM, Jacob Pan wrote:
> > IO page faults can be handled outside IOMMU subsystem. For an
> > example, when nested translation is turned on and guest owns the
> > first level page tables, device page request can be forwared  
> forwarded
> > to the guest for handling faults. As the page response returns
> > by the guest, IOMMU driver on the host need to process the  
> from the guest ...  host needs
> > response which informs the device and completes the page request
> > transaction.
> > 
> > This patch introduces generic API function for page response
> > passing from the guest or other in-kernel users. The definitions of
> > the generic data is based on PCI ATS specification not limited to
> > any vendor.
> > 
> > Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
> > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Link: https://lkml.org/lkml/2017/12/7/1725
> > ---
> >  drivers/iommu/iommu.c | 45
> > +++++++++++++++++++++++++++++++++++++++++++++ include/linux/iommu.h
> > | 43 +++++++++++++++++++++++++++++++++++++++++++ 2 files changed,
> > 88 insertions(+)
> > 
> > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> > index b3f9daf..02fed3e 100644
> > --- a/drivers/iommu/iommu.c
> > +++ b/drivers/iommu/iommu.c
> > @@ -1533,6 +1533,51 @@ int iommu_sva_invalidate(struct iommu_domain
> > *domain, }
> >  EXPORT_SYMBOL_GPL(iommu_sva_invalidate);
> >  
> > +int iommu_page_response(struct device *dev,
> > +			struct page_response_msg *msg)
> > +{
> > +	struct iommu_param *param = dev->iommu_param;
> > +	int ret = -EINVAL;
> > +	struct iommu_fault_event *evt;
> > +	struct iommu_domain *domain =
> > iommu_get_domain_for_dev(dev); +
> > +	if (!domain || !domain->ops->page_response)
> > +		return -ENODEV;
> > +
> > +	/*
> > +	 * Device iommu_param should have been allocated when
> > device is
> > +	 * added to its iommu_group.
> > +	 */
> > +	if (!param || !param->fault_param)
> > +		return -EINVAL;
> > +
> > +	/* Only send response if there is a fault report pending */
> > +	mutex_lock(&param->fault_param->lock);
> > +	if (list_empty(&param->fault_param->faults)) {
> > +		pr_warn("no pending PRQ, drop response\n");
> > +		goto done_unlock;
> > +	}
> > +	/*
> > +	 * Check if we have a matching page request pending to
> > respond,
> > +	 * otherwise return -EINVAL
> > +	 */
> > +	list_for_each_entry(evt, &param->fault_param->faults,
> > list) {
> > +		if (evt->pasid == msg->pasid &&
> > +		    msg->page_req_group_id ==
> > evt->page_req_group_id) {
> > +			msg->private_data = evt->iommu_private;
> > +			ret = domain->ops->page_response(dev, msg);
> > +			list_del(&evt->list);  
> don't you need a list_for_each_entry_safe?
why? I am here exiting the loop.
> > +			kfree(evt);
> > +			break;
> > +		}
> > +	}
> > +
> > +done_unlock:
> > +	mutex_unlock(&param->fault_param->lock);
> > +	return ret;
> > +}
> > +EXPORT_SYMBOL_GPL(iommu_page_response);
> > +
> >  static void __iommu_detach_device(struct iommu_domain *domain,
> >  				  struct device *dev)
> >  {
> > diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> > index b3312ee..722b90f 100644
> > --- a/include/linux/iommu.h
> > +++ b/include/linux/iommu.h
> > @@ -163,6 +163,41 @@ struct iommu_resv_region {
> >  #ifdef CONFIG_IOMMU_API
> >  
> >  /**
> > + * enum page_response_code - Return status of fault handlers,
> > telling the IOMMU
> > + * driver how to proceed with the fault.
> > + *
> > + * @IOMMU_PAGE_RESP_SUCCESS: Fault has been handled and the page
> > tables
> > + *	populated, retry the access. This is "Success" in PCI
> > PRI.
> > + * @IOMMU_PAGE_RESP_FAILURE: General error. Drop all subsequent
> > faults from
> > + *	this device if possible. This is "Response Failure" in
> > PCI PRI.
> > + * @IOMMU_PAGE_RESP_INVALID: Could not handle this fault, don't
> > retry the
> > + *	access. This is "Invalid Request" in PCI PRI.
> > + */
> > +enum page_response_code {
> > +	IOMMU_PAGE_RESP_SUCCESS = 0,
> > +	IOMMU_PAGE_RESP_INVALID,
> > +	IOMMU_PAGE_RESP_FAILURE,
> > +};
> > +
> > +/**
> > + * Generic page response information based on PCI ATS and PASID
> > spec.
> > + * @addr: servicing page address
> > + * @pasid: contains process address space ID
> > + * @resp_code: response code  
> nit: @pasid_present doc missing although quite obvious
> > + * @page_req_group_id: page request group index
> > + * @private_data: uniquely identify device-specific private data
> > for an
> > + *                individual page response
> > + */
> > +struct page_response_msg {
> > +	u64 addr;
> > +	u32 pasid;
> > +	enum page_response_code resp_code;
> > +	u32 pasid_present:1;
> > +	u32 page_req_group_id;
> > +	u64 private_data;
> > +};  
> Doesn't it need to be part of iommu uapi header since the virtualizer
> will pass the response through VFIO?
> 
Right, that has been the same feedback from others as well. I am moving
it to uapi in the next rev.
> As mentioned in previous discussion this is really PRI related and
> does not really fit unrecoverable fault reporting. To me we should
> clarify if this API targets both use cases or only the PRI response
> use case.
Yes, I should clarify this is for PRI only. It is little bit asymmetric
in that per IOMMU device fault reporting covers both unrecoverable
faults and PRI, but only PRI needs page response.

> Also in the implementation we check pasid and PRGindex. As
> mentionned by Jean-Philippe, unrecoverable "traditional" faults do
> not require to manage a list in the iommu subsystem.
> 
I am not sure if that is a question. We support PRI with PASID only.
We keep the group ID for page responses.
> Have you considered using a kfifo instead of a list to manage the
> pending PRI requests?
> 
No, I will look into it. But we may need too traverse the list in case
of exceptions. e.g. dropping some pending requests if device faults or
process/vm terminates.

> Thanks
> 
> Eric
> > +
> > +/**
> >   * struct iommu_ops - iommu ops and capabilities
> >   * @capable: check capability
> >   * @domain_alloc: allocate iommu domain
> > @@ -195,6 +230,7 @@ struct iommu_resv_region {
> >   * @bind_pasid_table: bind pasid table pointer for guest SVM
> >   * @unbind_pasid_table: unbind pasid table pointer and restore
> > defaults
> >   * @sva_invalidate: invalidate translation caches of shared
> > virtual address
> > + * @page_response: handle page request response
> >   */
> >  struct iommu_ops {
> >  	bool (*capable)(enum iommu_cap);
> > @@ -250,6 +286,7 @@ struct iommu_ops {
> >  				struct device *dev);
> >  	int (*sva_invalidate)(struct iommu_domain *domain,
> >  		struct device *dev, struct tlb_invalidate_info
> > *inv_info);
> > +	int (*page_response)(struct device *dev, struct
> > page_response_msg *msg); 
> >  	unsigned long pgsize_bitmap;
> >  };
> > @@ -470,6 +507,7 @@ extern int
> > iommu_unregister_device_fault_handler(struct device *dev); 
> >  extern int iommu_report_device_fault(struct device *dev, struct
> > iommu_fault_event *evt); 
> > +extern int iommu_page_response(struct device *dev, struct
> > page_response_msg *msg); extern int iommu_group_id(struct
> > iommu_group *group); extern struct iommu_group
> > *iommu_group_get_for_dev(struct device *dev); extern struct
> > iommu_domain *iommu_group_default_domain(struct iommu_group *); @@
> > -758,6 +796,11 @@ static inline int
> > iommu_report_device_fault(struct device *dev, struct iommu_fau
> > return -ENODEV; } 
> > +static inline int iommu_page_response(struct device *dev, struct
> > page_response_msg *msg) +{
> > +	return -ENODEV;
> > +}
> > +
> >  static inline int iommu_group_id(struct iommu_group *group)
> >  {
> >  	return -ENODEV;
> >   

[Jacob Pan]

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v5 14/23] iommu: introduce page response function
  2018-09-10 17:50     ` Jacob Pan
@ 2018-09-10 19:06       ` Auger Eric
  0 siblings, 0 replies; 78+ messages in thread
From: Auger Eric @ 2018-09-10 19:06 UTC (permalink / raw)
  To: Jacob Pan
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Alex Williamson, Jean-Philippe Brucker, Raj Ashok,
	Rafael Wysocki, Jean Delvare

Hi Jacob,

On 09/10/2018 07:50 PM, Jacob Pan wrote:
> On Mon, 10 Sep 2018 16:52:24 +0200
> Auger Eric <eric.auger@redhat.com> wrote:
> 
>> Hi Jacob,
>>
> Hi Eric,
> 
> Thanks for the review, please comments inline.
>> On 05/11/2018 10:54 PM, Jacob Pan wrote:
>>> IO page faults can be handled outside IOMMU subsystem. For an
>>> example, when nested translation is turned on and guest owns the
>>> first level page tables, device page request can be forwared  
>> forwarded
>>> to the guest for handling faults. As the page response returns
>>> by the guest, IOMMU driver on the host need to process the  
>> from the guest ...  host needs
>>> response which informs the device and completes the page request
>>> transaction.
>>>
>>> This patch introduces generic API function for page response
>>> passing from the guest or other in-kernel users. The definitions of
>>> the generic data is based on PCI ATS specification not limited to
>>> any vendor.
>>>
>>> Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
>>> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
>>> Link: https://lkml.org/lkml/2017/12/7/1725
>>> ---
>>>  drivers/iommu/iommu.c | 45
>>> +++++++++++++++++++++++++++++++++++++++++++++ include/linux/iommu.h
>>> | 43 +++++++++++++++++++++++++++++++++++++++++++ 2 files changed,
>>> 88 insertions(+)
>>>
>>> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
>>> index b3f9daf..02fed3e 100644
>>> --- a/drivers/iommu/iommu.c
>>> +++ b/drivers/iommu/iommu.c
>>> @@ -1533,6 +1533,51 @@ int iommu_sva_invalidate(struct iommu_domain
>>> *domain, }
>>>  EXPORT_SYMBOL_GPL(iommu_sva_invalidate);
>>>  
>>> +int iommu_page_response(struct device *dev,
>>> +			struct page_response_msg *msg)
>>> +{
>>> +	struct iommu_param *param = dev->iommu_param;
>>> +	int ret = -EINVAL;
>>> +	struct iommu_fault_event *evt;
>>> +	struct iommu_domain *domain =
>>> iommu_get_domain_for_dev(dev); +
>>> +	if (!domain || !domain->ops->page_response)
>>> +		return -ENODEV;
>>> +
>>> +	/*
>>> +	 * Device iommu_param should have been allocated when
>>> device is
>>> +	 * added to its iommu_group.
>>> +	 */
>>> +	if (!param || !param->fault_param)
>>> +		return -EINVAL;
>>> +
>>> +	/* Only send response if there is a fault report pending */
>>> +	mutex_lock(&param->fault_param->lock);
>>> +	if (list_empty(&param->fault_param->faults)) {
>>> +		pr_warn("no pending PRQ, drop response\n");
>>> +		goto done_unlock;
>>> +	}
>>> +	/*
>>> +	 * Check if we have a matching page request pending to
>>> respond,
>>> +	 * otherwise return -EINVAL
>>> +	 */
>>> +	list_for_each_entry(evt, &param->fault_param->faults,
>>> list) {
>>> +		if (evt->pasid == msg->pasid &&
>>> +		    msg->page_req_group_id ==
>>> evt->page_req_group_id) {
>>> +			msg->private_data = evt->iommu_private;
>>> +			ret = domain->ops->page_response(dev, msg);
>>> +			list_del(&evt->list);  
>> don't you need a list_for_each_entry_safe?
> why? I am here exiting the loop.
>>> +			kfree(evt);
>>> +			break;
Ah OK I missed the break. If you delete a single entry per page response
it is OK then. sorry for the noise.
>>> +		}
>>> +	}
>>> +
>>> +done_unlock:
>>> +	mutex_unlock(&param->fault_param->lock);
>>> +	return ret;
>>> +}
>>> +EXPORT_SYMBOL_GPL(iommu_page_response);
>>> +
>>>  static void __iommu_detach_device(struct iommu_domain *domain,
>>>  				  struct device *dev)
>>>  {
>>> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
>>> index b3312ee..722b90f 100644
>>> --- a/include/linux/iommu.h
>>> +++ b/include/linux/iommu.h
>>> @@ -163,6 +163,41 @@ struct iommu_resv_region {
>>>  #ifdef CONFIG_IOMMU_API
>>>  
>>>  /**
>>> + * enum page_response_code - Return status of fault handlers,
>>> telling the IOMMU
>>> + * driver how to proceed with the fault.
>>> + *
>>> + * @IOMMU_PAGE_RESP_SUCCESS: Fault has been handled and the page
>>> tables
>>> + *	populated, retry the access. This is "Success" in PCI
>>> PRI.
>>> + * @IOMMU_PAGE_RESP_FAILURE: General error. Drop all subsequent
>>> faults from
>>> + *	this device if possible. This is "Response Failure" in
>>> PCI PRI.
>>> + * @IOMMU_PAGE_RESP_INVALID: Could not handle this fault, don't
>>> retry the
>>> + *	access. This is "Invalid Request" in PCI PRI.
>>> + */
>>> +enum page_response_code {
>>> +	IOMMU_PAGE_RESP_SUCCESS = 0,
>>> +	IOMMU_PAGE_RESP_INVALID,
>>> +	IOMMU_PAGE_RESP_FAILURE,
>>> +};
>>> +
>>> +/**
>>> + * Generic page response information based on PCI ATS and PASID
>>> spec.
>>> + * @addr: servicing page address
>>> + * @pasid: contains process address space ID
>>> + * @resp_code: response code  
>> nit: @pasid_present doc missing although quite obvious
>>> + * @page_req_group_id: page request group index
>>> + * @private_data: uniquely identify device-specific private data
>>> for an
>>> + *                individual page response
>>> + */
>>> +struct page_response_msg {
>>> +	u64 addr;
>>> +	u32 pasid;
>>> +	enum page_response_code resp_code;
>>> +	u32 pasid_present:1;
>>> +	u32 page_req_group_id;
>>> +	u64 private_data;
>>> +};  
>> Doesn't it need to be part of iommu uapi header since the virtualizer
>> will pass the response through VFIO?
>>
> Right, that has been the same feedback from others as well. I am moving
> it to uapi in the next rev.
>> As mentioned in previous discussion this is really PRI related and
>> does not really fit unrecoverable fault reporting. To me we should
>> clarify if this API targets both use cases or only the PRI response
>> use case.
> Yes, I should clarify this is for PRI only. It is little bit asymmetric
> in that per IOMMU device fault reporting covers both unrecoverable
> faults and PRI, but only PRI needs page response.
OK. Still unrecoverable errors need a "read" API as the virtualizer may
inject them into a guest. The fault handler may signal an eventfd and
the userspace handler needs to retrieve the pending fault event(s).
> 
>> Also in the implementation we check pasid and PRGindex. As
>> mentionned by Jean-Philippe, unrecoverable "traditional" faults do
>> not require to manage a list in the iommu subsystem.

>>
> I am not sure if that is a question. We support PRI with PASID only.
> We keep the group ID for page responses.
As I was trying to reuse this API for unrecoverable errors for SMMU
stage1, (unrelated to PRI management), the check of pasid and PRGindex
looked very PRI specific.
>> Have you considered using a kfifo instead of a list to manage the
>> pending PRI requests?
>>
> No, I will look into it. But we may need too traverse the list in case
> of exceptions. e.g. dropping some pending requests if device faults or
> process/vm terminates.
Yes thinking more about it the kfifo does not seem to be adapted to your
needs. Also I think the PRI requests may be sent out of order (?). Kfifo
looks more adapted to unrecoverable errors.

Thanks

Eric
> 
>> Thanks
>>
>> Eric
>>> +
>>> +/**
>>>   * struct iommu_ops - iommu ops and capabilities
>>>   * @capable: check capability
>>>   * @domain_alloc: allocate iommu domain
>>> @@ -195,6 +230,7 @@ struct iommu_resv_region {
>>>   * @bind_pasid_table: bind pasid table pointer for guest SVM
>>>   * @unbind_pasid_table: unbind pasid table pointer and restore
>>> defaults
>>>   * @sva_invalidate: invalidate translation caches of shared
>>> virtual address
>>> + * @page_response: handle page request response
>>>   */
>>>  struct iommu_ops {
>>>  	bool (*capable)(enum iommu_cap);
>>> @@ -250,6 +286,7 @@ struct iommu_ops {
>>>  				struct device *dev);
>>>  	int (*sva_invalidate)(struct iommu_domain *domain,
>>>  		struct device *dev, struct tlb_invalidate_info
>>> *inv_info);
>>> +	int (*page_response)(struct device *dev, struct
>>> page_response_msg *msg); 
>>>  	unsigned long pgsize_bitmap;
>>>  };
>>> @@ -470,6 +507,7 @@ extern int
>>> iommu_unregister_device_fault_handler(struct device *dev); 
>>>  extern int iommu_report_device_fault(struct device *dev, struct
>>> iommu_fault_event *evt); 
>>> +extern int iommu_page_response(struct device *dev, struct
>>> page_response_msg *msg); extern int iommu_group_id(struct
>>> iommu_group *group); extern struct iommu_group
>>> *iommu_group_get_for_dev(struct device *dev); extern struct
>>> iommu_domain *iommu_group_default_domain(struct iommu_group *); @@
>>> -758,6 +796,11 @@ static inline int
>>> iommu_report_device_fault(struct device *dev, struct iommu_fau
>>> return -ENODEV; } 
>>> +static inline int iommu_page_response(struct device *dev, struct
>>> page_response_msg *msg) +{
>>> +	return -ENODEV;
>>> +}
>>> +
>>>  static inline int iommu_group_id(struct iommu_group *group)
>>>  {
>>>  	return -ENODEV;
>>>   
> 
> [Jacob Pan]
> 

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v5 13/23] iommu: introduce device fault report API
  2018-05-11 20:54 ` [PATCH v5 13/23] iommu: introduce device fault report API Jacob Pan
                     ` (2 preceding siblings ...)
  2018-09-06  9:25   ` Auger Eric
@ 2018-09-14 13:24   ` Auger Eric
  2018-09-17 16:57     ` Jacob Pan
  2018-09-25 14:58   ` Jean-Philippe Brucker
  4 siblings, 1 reply; 78+ messages in thread
From: Auger Eric @ 2018-09-14 13:24 UTC (permalink / raw)
  To: Jacob Pan, iommu, LKML, Joerg Roedel, David Woodhouse,
	Greg Kroah-Hartman, Alex Williamson, Jean-Philippe Brucker
  Cc: Raj Ashok, Rafael Wysocki, Jean Delvare

Hi Jacob,

On 5/11/18 10:54 PM, Jacob Pan wrote:
> Traditionally, device specific faults are detected and handled within
> their own device drivers. When IOMMU is enabled, faults such as DMA
> related transactions are detected by IOMMU. There is no generic
> reporting mechanism to report faults back to the in-kernel device
> driver or the guest OS in case of assigned devices.
> 
> Faults detected by IOMMU is based on the transaction's source ID which
> can be reported at per device basis, regardless of the device type is a
> PCI device or not.
> 
> The fault types include recoverable (e.g. page request) and
> unrecoverable faults(e.g. access error). In most cases, faults can be
> handled by IOMMU drivers internally. The primary use cases are as
> follows:
> 1. page request fault originated from an SVM capable device that is
> assigned to guest via vIOMMU. In this case, the first level page tables
> are owned by the guest. Page request must be propagated to the guest to
> let guest OS fault in the pages then send page response. In this
> mechanism, the direct receiver of IOMMU fault notification is VFIO,
> which can relay notification events to QEMU or other user space
> software.
> 
> 2. faults need more subtle handling by device drivers. Other than
> simply invoke reset function, there are needs to let device driver
> handle the fault with a smaller impact.
> 
> This patchset is intended to create a generic fault report API such
> that it can scale as follows:
> - all IOMMU types
> - PCI and non-PCI devices
> - recoverable and unrecoverable faults
> - VFIO and other other in kernel users
> - DMA & IRQ remapping (TBD)
> The original idea was brought up by David Woodhouse and discussions
> summarized at https://lwn.net/Articles/608914/.
> 
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Signed-off-by: Ashok Raj <ashok.raj@intel.com>
> Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
> ---
>  drivers/iommu/iommu.c | 149 +++++++++++++++++++++++++++++++++++++++++++++++++-
>  include/linux/iommu.h |  35 +++++++++++-
>  2 files changed, 181 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index 3a49b96..b3f9daf 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -609,6 +609,13 @@ int iommu_group_add_device(struct iommu_group *group, struct device *dev)
>  		goto err_free_name;
>  	}
>  
> +	dev->iommu_param = kzalloc(sizeof(*dev->iommu_param), GFP_KERNEL);
> +	if (!dev->iommu_param) {
> +		ret = -ENOMEM;
> +		goto err_free_name;
> +	}
> +	mutex_init(&dev->iommu_param->lock);
> +
>  	kobject_get(group->devices_kobj);
>  
>  	dev->iommu_group = group;
> @@ -639,6 +646,7 @@ int iommu_group_add_device(struct iommu_group *group, struct device *dev)
>  	mutex_unlock(&group->mutex);
>  	dev->iommu_group = NULL;
>  	kobject_put(group->devices_kobj);
> +	kfree(dev->iommu_param);
>  err_free_name:
>  	kfree(device->name);
>  err_remove_link:
> @@ -685,7 +693,7 @@ void iommu_group_remove_device(struct device *dev)
>  	sysfs_remove_link(&dev->kobj, "iommu_group");
>  
>  	trace_remove_device_from_group(group->id, dev);
> -
> +	kfree(dev->iommu_param);
>  	kfree(device->name);
>  	kfree(device);
>  	dev->iommu_group = NULL;
> @@ -820,6 +828,145 @@ int iommu_group_unregister_notifier(struct iommu_group *group,
>  EXPORT_SYMBOL_GPL(iommu_group_unregister_notifier);
>  
>  /**
> + * iommu_register_device_fault_handler() - Register a device fault handler
> + * @dev: the device
> + * @handler: the fault handler
> + * @data: private data passed as argument to the handler
> + *
> + * When an IOMMU fault event is received, call this handler with the fault event
> + * and data as argument. The handler should return 0 on success. If the fault is
> + * recoverable (IOMMU_FAULT_PAGE_REQ), the handler can also complete
> + * the fault by calling iommu_page_response() with one of the following
> + * response code:
> + * - IOMMU_PAGE_RESP_SUCCESS: retry the translation
> + * - IOMMU_PAGE_RESP_INVALID: terminate the fault
> + * - IOMMU_PAGE_RESP_FAILURE: terminate the fault and stop reporting
> + *   page faults if possible.
> + *
> + * Return 0 if the fault handler was installed successfully, or an error.
> + */
> +int iommu_register_device_fault_handler(struct device *dev,
> +					iommu_dev_fault_handler_t handler,
> +					void *data)
> +{
> +	struct iommu_param *param = dev->iommu_param;
> +	int ret = 0;
> +
> +	/*
> +	 * Device iommu_param should have been allocated when device is
> +	 * added to its iommu_group.
> +	 */
> +	if (!param)
> +		return -EINVAL;
> +
> +	mutex_lock(&param->lock);
> +	/* Only allow one fault handler registered for each device */
> +	if (param->fault_param) {
> +		ret = -EBUSY;
> +		goto done_unlock;
> +	}
> +
> +	get_device(dev);
> +	param->fault_param =
> +		kzalloc(sizeof(struct iommu_fault_param), GFP_KERNEL);
> +	if (!param->fault_param) {
> +		put_device(dev);
> +		ret = -ENOMEM;
> +		goto done_unlock;
> +	}
> +	mutex_init(&param->fault_param->lock);
> +	param->fault_param->handler = handler;
> +	param->fault_param->data = data;
> +	INIT_LIST_HEAD(&param->fault_param->faults);
> +
> +done_unlock:
> +	mutex_unlock(&param->lock);
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(iommu_register_device_fault_handler);
> +
> +/**
> + * iommu_unregister_device_fault_handler() - Unregister the device fault handler
> + * @dev: the device
> + *
> + * Remove the device fault handler installed with
> + * iommu_register_device_fault_handler().
> + *
> + * Return 0 on success, or an error.
> + */
> +int iommu_unregister_device_fault_handler(struct device *dev)
> +{
> +	struct iommu_param *param = dev->iommu_param;
> +	int ret = 0;
> +
> +	if (!param)
> +		return -EINVAL;
> +
> +	mutex_lock(&param->lock);
> +	/* we cannot unregister handler if there are pending faults */
> +	if (!list_empty(&param->fault_param->faults)) {
> +		ret = -EBUSY;
> +		goto unlock;
> +	}
> +
> +	kfree(param->fault_param);
> +	param->fault_param = NULL;
> +	put_device(dev);
don't you need to test if (param->fault_param) is set first. Otherwise
you may end up with an unpaired put_device()?

> +unlock:
> +	mutex_unlock(&param->lock);
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(iommu_unregister_device_fault_handler);
> +
> +
> +/**
> + * iommu_report_device_fault() - Report fault event to device
> + * @dev: the device
> + * @evt: fault event data
> + *
> + * Called by IOMMU model specific drivers when fault is detected, typically
> + * in a threaded IRQ handler.
> + *
> + * Return 0 on success, or an error.
> + */
> +int iommu_report_device_fault(struct device *dev, struct iommu_fault_event *evt)
> +{
> +	int ret = 0;
> +	struct iommu_fault_event *evt_pending;
> +	struct iommu_fault_param *fparam;
> +
> +	/* iommu_param is allocated when device is added to group */
> +	if (!dev->iommu_param | !evt)
> +		return -EINVAL;
> +	/* we only report device fault if there is a handler registered */
> +	mutex_lock(&dev->iommu_param->lock);
> +	if (!dev->iommu_param->fault_param ||
> +		!dev->iommu_param->fault_param->handler) {
> +		ret = -EINVAL;
> +		goto done_unlock;
> +	}
> +	fparam = dev->iommu_param->fault_param;
> +	if (evt->type == IOMMU_FAULT_PAGE_REQ && evt->last_req) {
> +		evt_pending = kmemdup(evt, sizeof(struct iommu_fault_event),
> +				GFP_KERNEL);
> +		if (!evt_pending) {
> +			ret = -ENOMEM;
> +			goto done_unlock;
> +		}
> +		mutex_lock(&fparam->lock);
> +		list_add_tail(&evt_pending->list, &fparam->faults);
> +		mutex_unlock(&fparam->lock);
> +	}
> +	ret = fparam->handler(evt, fparam->data);
> +done_unlock:
> +	mutex_unlock(&dev->iommu_param->lock);
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(iommu_report_device_fault);
> +
> +/**
>   * iommu_group_id - Return ID for a group
>   * @group: the group to ID
>   *
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index aeadb4f..b3312ee 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -307,7 +307,8 @@ enum iommu_fault_reason {
>   * and PASID spec.
>   * - Un-recoverable faults of device interest
>   * - DMA remapping and IRQ remapping faults
> -
> + *
> + * @list pending fault event list, used for tracking responses
>   * @type contains fault type.
>   * @reason fault reasons if relevant outside IOMMU driver, IOMMU driver internal
>   *         faults are not reported
> @@ -324,6 +325,7 @@ enum iommu_fault_reason {
>   *                 sending the fault response.
>   */
>  struct iommu_fault_event {
> +	struct list_head list;
>  	enum iommu_fault_type type;
>  	enum iommu_fault_reason reason;
>  	u64 addr;
> @@ -340,10 +342,13 @@ struct iommu_fault_event {
>   * struct iommu_fault_param - per-device IOMMU fault data
>   * @dev_fault_handler: Callback function to handle IOMMU faults at device level
>   * @data: handler private data
> - *
> + * @faults: holds the pending faults which needs response, e.g. page response.
s/needs/need

Thanks

Eric
> + * @lock: protect pending PRQ event list
>   */
>  struct iommu_fault_param {
>  	iommu_dev_fault_handler_t handler;
> +	struct list_head faults;
> +	struct mutex lock;
>  	void *data;
>  };
>  
> @@ -357,6 +362,7 @@ struct iommu_fault_param {
>   *	struct iommu_fwspec	*iommu_fwspec;
>   */
>  struct iommu_param {
> +	struct mutex lock;
>  	struct iommu_fault_param *fault_param;
>  };
>  
> @@ -456,6 +462,14 @@ extern int iommu_group_register_notifier(struct iommu_group *group,
>  					 struct notifier_block *nb);
>  extern int iommu_group_unregister_notifier(struct iommu_group *group,
>  					   struct notifier_block *nb);
> +extern int iommu_register_device_fault_handler(struct device *dev,
> +					iommu_dev_fault_handler_t handler,
> +					void *data);
> +
> +extern int iommu_unregister_device_fault_handler(struct device *dev);
> +
> +extern int iommu_report_device_fault(struct device *dev, struct iommu_fault_event *evt);
> +
>  extern int iommu_group_id(struct iommu_group *group);
>  extern struct iommu_group *iommu_group_get_for_dev(struct device *dev);
>  extern struct iommu_domain *iommu_group_default_domain(struct iommu_group *);
> @@ -727,6 +741,23 @@ static inline int iommu_group_unregister_notifier(struct iommu_group *group,
>  	return 0;
>  }
>  
> +static inline int iommu_register_device_fault_handler(struct device *dev,
> +						iommu_dev_fault_handler_t handler,
> +						void *data)
> +{
> +	return -ENODEV;
> +}
> +
> +static inline int iommu_unregister_device_fault_handler(struct device *dev)
> +{
> +	return 0;
> +}
> +
> +static inline int iommu_report_device_fault(struct device *dev, struct iommu_fault_event *evt)
> +{
> +	return -ENODEV;
> +}
> +
>  static inline int iommu_group_id(struct iommu_group *group)
>  {
>  	return -ENODEV;
> 

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v5 13/23] iommu: introduce device fault report API
  2018-09-14 13:24   ` Auger Eric
@ 2018-09-17 16:57     ` Jacob Pan
  0 siblings, 0 replies; 78+ messages in thread
From: Jacob Pan @ 2018-09-17 16:57 UTC (permalink / raw)
  To: Auger Eric
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Alex Williamson, Jean-Philippe Brucker, Raj Ashok,
	Rafael Wysocki, Jean Delvare, jacob.jun.pan

On Fri, 14 Sep 2018 15:24:41 +0200
Auger Eric <eric.auger@redhat.com> wrote:

> Hi Jacob,
> 
> On 5/11/18 10:54 PM, Jacob Pan wrote:
> > Traditionally, device specific faults are detected and handled
> > within their own device drivers. When IOMMU is enabled, faults such
> > as DMA related transactions are detected by IOMMU. There is no
> > generic reporting mechanism to report faults back to the in-kernel
> > device driver or the guest OS in case of assigned devices.
> > 
> > Faults detected by IOMMU is based on the transaction's source ID
> > which can be reported at per device basis, regardless of the device
> > type is a PCI device or not.
> > 
> > The fault types include recoverable (e.g. page request) and
> > unrecoverable faults(e.g. access error). In most cases, faults can
> > be handled by IOMMU drivers internally. The primary use cases are as
> > follows:
> > 1. page request fault originated from an SVM capable device that is
> > assigned to guest via vIOMMU. In this case, the first level page
> > tables are owned by the guest. Page request must be propagated to
> > the guest to let guest OS fault in the pages then send page
> > response. In this mechanism, the direct receiver of IOMMU fault
> > notification is VFIO, which can relay notification events to QEMU
> > or other user space software.
> > 
> > 2. faults need more subtle handling by device drivers. Other than
> > simply invoke reset function, there are needs to let device driver
> > handle the fault with a smaller impact.
> > 
> > This patchset is intended to create a generic fault report API such
> > that it can scale as follows:
> > - all IOMMU types
> > - PCI and non-PCI devices
> > - recoverable and unrecoverable faults
> > - VFIO and other other in kernel users
> > - DMA & IRQ remapping (TBD)
> > The original idea was brought up by David Woodhouse and discussions
> > summarized at https://lwn.net/Articles/608914/.
> > 
> > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Signed-off-by: Ashok Raj <ashok.raj@intel.com>
> > Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
> > ---
> >  drivers/iommu/iommu.c | 149
> > +++++++++++++++++++++++++++++++++++++++++++++++++-
> > include/linux/iommu.h |  35 +++++++++++- 2 files changed, 181
> > insertions(+), 3 deletions(-)
> > 
> > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> > index 3a49b96..b3f9daf 100644
> > --- a/drivers/iommu/iommu.c
> > +++ b/drivers/iommu/iommu.c
> > @@ -609,6 +609,13 @@ int iommu_group_add_device(struct iommu_group
> > *group, struct device *dev) goto err_free_name;
> >  	}
> >  
> > +	dev->iommu_param = kzalloc(sizeof(*dev->iommu_param),
> > GFP_KERNEL);
> > +	if (!dev->iommu_param) {
> > +		ret = -ENOMEM;
> > +		goto err_free_name;
> > +	}
> > +	mutex_init(&dev->iommu_param->lock);
> > +
> >  	kobject_get(group->devices_kobj);
> >  
> >  	dev->iommu_group = group;
> > @@ -639,6 +646,7 @@ int iommu_group_add_device(struct iommu_group
> > *group, struct device *dev) mutex_unlock(&group->mutex);
> >  	dev->iommu_group = NULL;
> >  	kobject_put(group->devices_kobj);
> > +	kfree(dev->iommu_param);
> >  err_free_name:
> >  	kfree(device->name);
> >  err_remove_link:
> > @@ -685,7 +693,7 @@ void iommu_group_remove_device(struct device
> > *dev) sysfs_remove_link(&dev->kobj, "iommu_group");
> >  
> >  	trace_remove_device_from_group(group->id, dev);
> > -
> > +	kfree(dev->iommu_param);
> >  	kfree(device->name);
> >  	kfree(device);
> >  	dev->iommu_group = NULL;
> > @@ -820,6 +828,145 @@ int iommu_group_unregister_notifier(struct
> > iommu_group *group,
> > EXPORT_SYMBOL_GPL(iommu_group_unregister_notifier); 
> >  /**
> > + * iommu_register_device_fault_handler() - Register a device fault
> > handler
> > + * @dev: the device
> > + * @handler: the fault handler
> > + * @data: private data passed as argument to the handler
> > + *
> > + * When an IOMMU fault event is received, call this handler with
> > the fault event
> > + * and data as argument. The handler should return 0 on success.
> > If the fault is
> > + * recoverable (IOMMU_FAULT_PAGE_REQ), the handler can also
> > complete
> > + * the fault by calling iommu_page_response() with one of the
> > following
> > + * response code:
> > + * - IOMMU_PAGE_RESP_SUCCESS: retry the translation
> > + * - IOMMU_PAGE_RESP_INVALID: terminate the fault
> > + * - IOMMU_PAGE_RESP_FAILURE: terminate the fault and stop
> > reporting
> > + *   page faults if possible.
> > + *
> > + * Return 0 if the fault handler was installed successfully, or an
> > error.
> > + */
> > +int iommu_register_device_fault_handler(struct device *dev,
> > +					iommu_dev_fault_handler_t
> > handler,
> > +					void *data)
> > +{
> > +	struct iommu_param *param = dev->iommu_param;
> > +	int ret = 0;
> > +
> > +	/*
> > +	 * Device iommu_param should have been allocated when
> > device is
> > +	 * added to its iommu_group.
> > +	 */
> > +	if (!param)
> > +		return -EINVAL;
> > +
> > +	mutex_lock(&param->lock);
> > +	/* Only allow one fault handler registered for each device
> > */
> > +	if (param->fault_param) {
> > +		ret = -EBUSY;
> > +		goto done_unlock;
> > +	}
> > +
> > +	get_device(dev);
> > +	param->fault_param =
> > +		kzalloc(sizeof(struct iommu_fault_param),
> > GFP_KERNEL);
> > +	if (!param->fault_param) {
> > +		put_device(dev);
> > +		ret = -ENOMEM;
> > +		goto done_unlock;
> > +	}
> > +	mutex_init(&param->fault_param->lock);
> > +	param->fault_param->handler = handler;
> > +	param->fault_param->data = data;
> > +	INIT_LIST_HEAD(&param->fault_param->faults);
> > +
> > +done_unlock:
> > +	mutex_unlock(&param->lock);
> > +
> > +	return ret;
> > +}
> > +EXPORT_SYMBOL_GPL(iommu_register_device_fault_handler);
> > +
> > +/**
> > + * iommu_unregister_device_fault_handler() - Unregister the device
> > fault handler
> > + * @dev: the device
> > + *
> > + * Remove the device fault handler installed with
> > + * iommu_register_device_fault_handler().
> > + *
> > + * Return 0 on success, or an error.
> > + */
> > +int iommu_unregister_device_fault_handler(struct device *dev)
> > +{
> > +	struct iommu_param *param = dev->iommu_param;
> > +	int ret = 0;
> > +
> > +	if (!param)
> > +		return -EINVAL;
> > +
> > +	mutex_lock(&param->lock);
> > +	/* we cannot unregister handler if there are pending
> > faults */
> > +	if (!list_empty(&param->fault_param->faults)) {
> > +		ret = -EBUSY;
> > +		goto unlock;
> > +	}
> > +
> > +	kfree(param->fault_param);
> > +	param->fault_param = NULL;
> > +	put_device(dev);  
> don't you need to test if (param->fault_param) is set first. Otherwise
> you may end up with an unpaired put_device()?
You are right, thanks.

I am also working on allowing multiple registrations per handler. i.e.
device can register the same fault handler with different data. Then I
will add refcount. The motivation is that for PCIe device with
sub-device partitioned at PASID granularity, fault reporting needs to
be at PCI device + PASID level.
> 
>  [...]  
> s/needs/need
> 
taken, thanks
> Thanks
> 
> Eric
> > + * @lock: protect pending PRQ event list
> >   */
> >  struct iommu_fault_param {
> >  	iommu_dev_fault_handler_t handler;
> > +	struct list_head faults;
> > +	struct mutex lock;
> >  	void *data;
> >  };
> >  
> > @@ -357,6 +362,7 @@ struct iommu_fault_param {
> >   *	struct iommu_fwspec	*iommu_fwspec;
> >   */
> >  struct iommu_param {
> > +	struct mutex lock;
> >  	struct iommu_fault_param *fault_param;
> >  };
> >  
> > @@ -456,6 +462,14 @@ extern int
> > iommu_group_register_notifier(struct iommu_group *group, struct
> > notifier_block *nb); extern int
> > iommu_group_unregister_notifier(struct iommu_group *group, struct
> > notifier_block *nb); +extern int
> > iommu_register_device_fault_handler(struct device *dev,
> > +					iommu_dev_fault_handler_t
> > handler,
> > +					void *data);
> > +
> > +extern int iommu_unregister_device_fault_handler(struct device
> > *dev); +
> > +extern int iommu_report_device_fault(struct device *dev, struct
> > iommu_fault_event *evt); +
> >  extern int iommu_group_id(struct iommu_group *group);
> >  extern struct iommu_group *iommu_group_get_for_dev(struct device
> > *dev); extern struct iommu_domain
> > *iommu_group_default_domain(struct iommu_group *); @@ -727,6
> > +741,23 @@ static inline int iommu_group_unregister_notifier(struct
> > iommu_group *group, return 0; }
> >  
> > +static inline int iommu_register_device_fault_handler(struct
> > device *dev,
> > +
> > iommu_dev_fault_handler_t handler,
> > +						void *data)
> > +{
> > +	return -ENODEV;
> > +}
> > +
> > +static inline int iommu_unregister_device_fault_handler(struct
> > device *dev) +{
> > +	return 0;
> > +}
> > +
> > +static inline int iommu_report_device_fault(struct device *dev,
> > struct iommu_fault_event *evt) +{
> > +	return -ENODEV;
> > +}
> > +
> >  static inline int iommu_group_id(struct iommu_group *group)
> >  {
> >  	return -ENODEV;
> >   

[Jacob Pan]

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v5 10/23] iommu: introduce device fault data
  2018-05-11 20:54 ` [PATCH v5 10/23] iommu: introduce device fault data Jacob Pan
@ 2018-09-21 10:07   ` Auger Eric
  2018-09-21 17:05     ` Jacob Pan
  0 siblings, 1 reply; 78+ messages in thread
From: Auger Eric @ 2018-09-21 10:07 UTC (permalink / raw)
  To: Jacob Pan, iommu, LKML, Joerg Roedel, David Woodhouse,
	Greg Kroah-Hartman, Alex Williamson, Jean-Philippe Brucker
  Cc: Yi L, Raj Ashok, Rafael Wysocki, Liu, Jean Delvare

Hi Jacob,

On 5/11/18 10:54 PM, Jacob Pan wrote:
> Device faults detected by IOMMU can be reported outside IOMMU
> subsystem for further processing. This patch intends to provide
> a generic device fault data such that device drivers can be
> communicated with IOMMU faults without model specific knowledge.
> 
> The proposed format is the result of discussion at:
> https://lkml.org/lkml/2017/11/10/291
> Part of the code is based on Jean-Philippe Brucker's patchset
> (https://patchwork.kernel.org/patch/9989315/).
> 
> The assumption is that model specific IOMMU driver can filter and
> handle most of the internal faults if the cause is within IOMMU driver
> control. Therefore, the fault reasons can be reported are grouped
> and generalized based common specifications such as PCI ATS.
> 
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
> Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
> Signed-off-by: Ashok Raj <ashok.raj@intel.com>
> ---
>  include/linux/iommu.h | 101 +++++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 99 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index e8cadb6..aeadb4f 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -49,13 +49,17 @@ struct bus_type;
>  struct device;
>  struct iommu_domain;
>  struct notifier_block;
> +struct iommu_fault_event;
>  
>  /* iommu fault flags */
> -#define IOMMU_FAULT_READ	0x0
> -#define IOMMU_FAULT_WRITE	0x1
> +#define IOMMU_FAULT_READ		(1 << 0)
> +#define IOMMU_FAULT_WRITE		(1 << 1)
> +#define IOMMU_FAULT_EXEC		(1 << 2)
> +#define IOMMU_FAULT_PRIV		(1 << 3)
>  
>  typedef int (*iommu_fault_handler_t)(struct iommu_domain *,
>  			struct device *, unsigned long, int, void *);
> +typedef int (*iommu_dev_fault_handler_t)(struct iommu_fault_event *, void *);
>  
>  struct iommu_domain_geometry {
>  	dma_addr_t aperture_start; /* First address that can be mapped    */
> @@ -264,6 +268,98 @@ struct iommu_device {
>  	struct device *dev;
>  };
>  
> +/*  Generic fault types, can be expanded IRQ remapping fault */
> +enum iommu_fault_type {
> +	IOMMU_FAULT_DMA_UNRECOV = 1,	/* unrecoverable fault */
> +	IOMMU_FAULT_PAGE_REQ,		/* page request fault */
> +};

While doing the exercise of mapping the SMMUv3 events to this, I failed
to map some event types to iommu_fault_reason enum values.
> +
> +enum iommu_fault_reason {
> +	IOMMU_FAULT_REASON_UNKNOWN = 0,
> +
> +	/* IOMMU internal error, no specific reason to report out */
> +	IOMMU_FAULT_REASON_INTERNAL,
> +
> +	/* Could not access the PASID table */
> +	IOMMU_FAULT_REASON_PASID_FETCH,

Would it be possible to add
 /* could not access the device context (fetch caused external abort) */
IOMMU_FAULT_REASON_DEVICE_CONTEXT_FETCH,

> +
> +	/*
> +	 * PASID is out of range (e.g. exceeds the maximum PASID
> +	 * supported by the IOMMU) or disabled.
> +	 */
> +	IOMMU_FAULT_REASON_PASID_INVALID,
Would it be possible to add
/* source id is out of range */
IOMMU_FAULT_REASON_SOURCEID_INVALID,

or alike
on ARM the sourceid matches the streamid and pasid matches the substreamid.

It would be useful to have:
/* pasid entry is invalid or has configuration errors */
IOMMU_FAULT_REASON_BAD_PASID_ENTRY,

/* device context entry is invalid or has configuration errors */
IOMMU_FAULT_REASON_BAD_DEVICE_CONTEXT_ENTRY,

This typically allows to return information to the guest about fields in
device context entry or pasid entry that are incorrect, not matching the
physical IOMMU capability
> +
> +	/* Could not access the page directory (Invalid PASID entry) */
> +	IOMMU_FAULT_REASON_PGD_FETCH,
I was unsure about this one. On my side I needed something more general
such as:
/*
* An external abort occurred fetching (or updating) a translation
* table descriptor
*/
IOMMU_FAULT_REASON_WALK_EABT,

> +
> +	/* Could not access the page table entry (Bad address) */
> +	IOMMU_FAULT_REASON_PTE_FETCH,
I interpreted this one as the actual translation failure but that's not
obvious to me either. Is it a fetch abort or is it that the PTE is
marked invalid. Maybe if we have the former we can just have a
translation fault reason instead.
> +
> +	/* Protection flag check failed */
> +	IOMMU_FAULT_REASON_PERMISSION,
On ARM we also have:

/* access flag check failed */
IOMMU_FAULT_REASON_ACCESS,

and

/* Output address of a translation stage caused Address Size fault */
 IOMMU_FAULT_REASON_OOR_ADDRESS

I am aware all those suggestions do not match the original goal of your
series, mostly targeted at SVA support. However in the prospect to make
those APIs as generic as possible it may be useful to take those
requirements as well.

Hope it does not bring extra noise to the topic ;-)

Thanks

Eric




> +};
> +
> +/**
> + * struct iommu_fault_event - Generic per device fault data
> + *
> + * - PCI and non-PCI devices
> + * - Recoverable faults (e.g. page request), information based on PCI ATS
> + * and PASID spec.
> + * - Un-recoverable faults of device interest
> + * - DMA remapping and IRQ remapping faults
> +
> + * @type contains fault type.
> + * @reason fault reasons if relevant outside IOMMU driver, IOMMU driver internal
> + *         faults are not reported
> + * @addr: tells the offending page address
> + * @pasid: contains process address space ID, used in shared virtual memory(SVM)
> + * @page_req_group_id: page request group index
> + * @last_req: last request in a page request group
> + * @pasid_valid: indicates if the PRQ has a valid PASID
> + * @prot: page access protection flag, e.g. IOMMU_FAULT_READ, IOMMU_FAULT_WRITE
> + * @device_private: if present, uniquely identify device-specific
> + *                  private data for an individual page request.
> + * @iommu_private: used by the IOMMU driver for storing fault-specific
> + *                 data. Users should not modify this field before
> + *                 sending the fault response.
> + */
> +struct iommu_fault_event {
> +	enum iommu_fault_type type;
> +	enum iommu_fault_reason reason;
> +	u64 addr;
> +	u32 pasid;
> +	u32 page_req_group_id;
> +	u32 last_req : 1;
> +	u32 pasid_valid : 1;
> +	u32 prot;
> +	u64 device_private;
> +	u64 iommu_private;
> +};
> +
> +/**
> + * struct iommu_fault_param - per-device IOMMU fault data
> + * @dev_fault_handler: Callback function to handle IOMMU faults at device level
> + * @data: handler private data
> + *
> + */
> +struct iommu_fault_param {
> +	iommu_dev_fault_handler_t handler;
> +	void *data;
> +};
> +
> +/**
> + * struct iommu_param - collection of per-device IOMMU data
> + *
> + * @fault_param: IOMMU detected device fault reporting data
> + *
> + * TODO: migrate other per device data pointers under iommu_dev_data, e.g.
> + *	struct iommu_group	*iommu_group;
> + *	struct iommu_fwspec	*iommu_fwspec;
> + */
> +struct iommu_param {
> +	struct iommu_fault_param *fault_param;
> +};
> +
>  int  iommu_device_register(struct iommu_device *iommu);
>  void iommu_device_unregister(struct iommu_device *iommu);
>  int  iommu_device_sysfs_add(struct iommu_device *iommu,
> @@ -437,6 +533,7 @@ struct iommu_ops {};
>  struct iommu_group {};
>  struct iommu_fwspec {};
>  struct iommu_device {};
> +struct iommu_fault_param {};
>  
>  static inline bool iommu_present(struct bus_type *bus)
>  {
> 

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v5 10/23] iommu: introduce device fault data
  2018-09-21 10:07   ` Auger Eric
@ 2018-09-21 17:05     ` Jacob Pan
  0 siblings, 0 replies; 78+ messages in thread
From: Jacob Pan @ 2018-09-21 17:05 UTC (permalink / raw)
  To: Auger Eric
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Alex Williamson, Jean-Philippe Brucker, Yi L, Raj Ashok,
	Rafael Wysocki, Liu, Jean Delvare, jacob.jun.pan

On Fri, 21 Sep 2018 12:07:09 +0200
Auger Eric <eric.auger@redhat.com> wrote:

> Hi Jacob,
> 
> On 5/11/18 10:54 PM, Jacob Pan wrote:
> > Device faults detected by IOMMU can be reported outside IOMMU
> > subsystem for further processing. This patch intends to provide
> > a generic device fault data such that device drivers can be
> > communicated with IOMMU faults without model specific knowledge.
> > 
> > The proposed format is the result of discussion at:
> > https://lkml.org/lkml/2017/11/10/291
> > Part of the code is based on Jean-Philippe Brucker's patchset
> > (https://patchwork.kernel.org/patch/9989315/).
> > 
> > The assumption is that model specific IOMMU driver can filter and
> > handle most of the internal faults if the cause is within IOMMU
> > driver control. Therefore, the fault reasons can be reported are
> > grouped and generalized based common specifications such as PCI ATS.
> > 
> > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
> > Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
> > Signed-off-by: Ashok Raj <ashok.raj@intel.com>
> > ---
> >  include/linux/iommu.h | 101
> > +++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed,
> > 99 insertions(+), 2 deletions(-)
> > 
> > diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> > index e8cadb6..aeadb4f 100644
> > --- a/include/linux/iommu.h
> > +++ b/include/linux/iommu.h
> > @@ -49,13 +49,17 @@ struct bus_type;
> >  struct device;
> >  struct iommu_domain;
> >  struct notifier_block;
> > +struct iommu_fault_event;
> >  
> >  /* iommu fault flags */
> > -#define IOMMU_FAULT_READ	0x0
> > -#define IOMMU_FAULT_WRITE	0x1
> > +#define IOMMU_FAULT_READ		(1 << 0)
> > +#define IOMMU_FAULT_WRITE		(1 << 1)
> > +#define IOMMU_FAULT_EXEC		(1 << 2)
> > +#define IOMMU_FAULT_PRIV		(1 << 3)
> >  
> >  typedef int (*iommu_fault_handler_t)(struct iommu_domain *,
> >  			struct device *, unsigned long, int, void
> > *); +typedef int (*iommu_dev_fault_handler_t)(struct
> > iommu_fault_event *, void *); 
> >  struct iommu_domain_geometry {
> >  	dma_addr_t aperture_start; /* First address that can be
> > mapped    */ @@ -264,6 +268,98 @@ struct iommu_device {
> >  	struct device *dev;
> >  };
> >  
> > +/*  Generic fault types, can be expanded IRQ remapping fault */
> > +enum iommu_fault_type {
> > +	IOMMU_FAULT_DMA_UNRECOV = 1,	/* unrecoverable fault
> > */
> > +	IOMMU_FAULT_PAGE_REQ,		/* page request fault
> > */ +};  
> 
> While doing the exercise of mapping the SMMUv3 events to this, I
> failed to map some event types to iommu_fault_reason enum values.
I am not surprised :), this list is intended to grow as we add support
more IOMMU models. I was thinking of these guidelines when adding to
this list
- model agnostic
- needs to be reported outside iommu subsystem
- per device identifiable

> > +
> > +enum iommu_fault_reason {
> > +	IOMMU_FAULT_REASON_UNKNOWN = 0,
> > +
> > +	/* IOMMU internal error, no specific reason to report out
> > */
> > +	IOMMU_FAULT_REASON_INTERNAL,
> > +
> > +	/* Could not access the PASID table */
> > +	IOMMU_FAULT_REASON_PASID_FETCH,  
> 
> Would it be possible to add
>  /* could not access the device context (fetch caused external abort)
> */ IOMMU_FAULT_REASON_DEVICE_CONTEXT_FETCH,
> 
sounds reasonable.
> > +
> > +	/*
> > +	 * PASID is out of range (e.g. exceeds the maximum PASID
> > +	 * supported by the IOMMU) or disabled.
> > +	 */
> > +	IOMMU_FAULT_REASON_PASID_INVALID,  
> Would it be possible to add
> /* source id is out of range */
> IOMMU_FAULT_REASON_SOURCEID_INVALID,
> 
hmm, the fault here should be per device. I guess source ID is PCI dev
requester ID eqivalent. If the source id is invalid, how could it be
reported to the right device in the vIOMMU? Should it be handled by the
host IOMMU itself?
> or alike
> on ARM the sourceid matches the streamid and pasid matches the
> substreamid.
> 
> It would be useful to have:
> /* pasid entry is invalid or has configuration errors */
> IOMMU_FAULT_REASON_BAD_PASID_ENTRY,
> 
> /* device context entry is invalid or has configuration errors */
> IOMMU_FAULT_REASON_BAD_DEVICE_CONTEXT_ENTRY,
> 
> This typically allows to return information to the guest about fields
> in device context entry or pasid entry that are incorrect, not
> matching the physical IOMMU capability
Sounds good.
> > +
> > +	/* Could not access the page directory (Invalid PASID
> > entry) */
> > +	IOMMU_FAULT_REASON_PGD_FETCH,  
> I was unsure about this one. On my side I needed something more
> general such as:
> /*
> * An external abort occurred fetching (or updating) a translation
> * table descriptor
> */
> IOMMU_FAULT_REASON_WALK_EABT,
> 
> > +
> > +	/* Could not access the page table entry (Bad address) */
> > +	IOMMU_FAULT_REASON_PTE_FETCH,  
> I interpreted this one as the actual translation failure but that's
> not obvious to me either. Is it a fetch abort or is it that the PTE is
> marked invalid. Maybe if we have the former we can just have a
> translation fault reason instead.
How about these two?
IOMMU_FAULT_REASON_TRANSL_TBL
IOMMU_FAULT_REASON_TRANSL

> > +
> > +	/* Protection flag check failed */
> > +	IOMMU_FAULT_REASON_PERMISSION,  
> On ARM we also have:
> 
> /* access flag check failed */
> IOMMU_FAULT_REASON_ACCESS,
> 
> and
> 
> /* Output address of a translation stage caused Address Size fault */
>  IOMMU_FAULT_REASON_OOR_ADDRESS
> 
is that for nested translation where stage 1 result is invalid for
stage 2? I am thinking for any misconfiguration, it should be handled
by host iommu driver locally.
> I am aware all those suggestions do not match the original goal of
> your series, mostly targeted at SVA support. However in the prospect
> to make those APIs as generic as possible it may be useful to take
> those requirements as well.
> 
> Hope it does not bring extra noise to the topic ;-)
> 
not at all. appreciate the effort to make it generally useful.

> Thanks
> 
> Eric
> 
> 
> 
> 
> > +};
> > +
> > +/**
> > + * struct iommu_fault_event - Generic per device fault data
> > + *
> > + * - PCI and non-PCI devices
> > + * - Recoverable faults (e.g. page request), information based on
> > PCI ATS
> > + * and PASID spec.
> > + * - Un-recoverable faults of device interest
> > + * - DMA remapping and IRQ remapping faults
> > +
> > + * @type contains fault type.
> > + * @reason fault reasons if relevant outside IOMMU driver, IOMMU
> > driver internal
> > + *         faults are not reported
> > + * @addr: tells the offending page address
> > + * @pasid: contains process address space ID, used in shared
> > virtual memory(SVM)
> > + * @page_req_group_id: page request group index
> > + * @last_req: last request in a page request group
> > + * @pasid_valid: indicates if the PRQ has a valid PASID
> > + * @prot: page access protection flag, e.g. IOMMU_FAULT_READ,
> > IOMMU_FAULT_WRITE
> > + * @device_private: if present, uniquely identify device-specific
> > + *                  private data for an individual page request.
> > + * @iommu_private: used by the IOMMU driver for storing
> > fault-specific
> > + *                 data. Users should not modify this field before
> > + *                 sending the fault response.
> > + */
> > +struct iommu_fault_event {
> > +	enum iommu_fault_type type;
> > +	enum iommu_fault_reason reason;
> > +	u64 addr;
> > +	u32 pasid;
> > +	u32 page_req_group_id;
> > +	u32 last_req : 1;
> > +	u32 pasid_valid : 1;
> > +	u32 prot;
> > +	u64 device_private;
> > +	u64 iommu_private;
> > +};
> > +
> > +/**
> > + * struct iommu_fault_param - per-device IOMMU fault data
> > + * @dev_fault_handler: Callback function to handle IOMMU faults at
> > device level
> > + * @data: handler private data
> > + *
> > + */
> > +struct iommu_fault_param {
> > +	iommu_dev_fault_handler_t handler;
> > +	void *data;
> > +};
> > +
> > +/**
> > + * struct iommu_param - collection of per-device IOMMU data
> > + *
> > + * @fault_param: IOMMU detected device fault reporting data
> > + *
> > + * TODO: migrate other per device data pointers under
> > iommu_dev_data, e.g.
> > + *	struct iommu_group	*iommu_group;
> > + *	struct iommu_fwspec	*iommu_fwspec;
> > + */
> > +struct iommu_param {
> > +	struct iommu_fault_param *fault_param;
> > +};
> > +
> >  int  iommu_device_register(struct iommu_device *iommu);
> >  void iommu_device_unregister(struct iommu_device *iommu);
> >  int  iommu_device_sysfs_add(struct iommu_device *iommu,
> > @@ -437,6 +533,7 @@ struct iommu_ops {};
> >  struct iommu_group {};
> >  struct iommu_fwspec {};
> >  struct iommu_device {};
> > +struct iommu_fault_param {};
> >  
> >  static inline bool iommu_present(struct bus_type *bus)
> >  {
> >   

[Jacob Pan]

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v5 13/23] iommu: introduce device fault report API
  2018-05-11 20:54 ` [PATCH v5 13/23] iommu: introduce device fault report API Jacob Pan
                     ` (3 preceding siblings ...)
  2018-09-14 13:24   ` Auger Eric
@ 2018-09-25 14:58   ` Jean-Philippe Brucker
  4 siblings, 0 replies; 78+ messages in thread
From: Jean-Philippe Brucker @ 2018-09-25 14:58 UTC (permalink / raw)
  To: Jacob Pan, iommu, LKML, Joerg Roedel, David Woodhouse,
	Greg Kroah-Hartman, Alex Williamson
  Cc: Rafael Wysocki, Liu, Yi L, Tian, Kevin, Raj Ashok, Jean Delvare,
	Christoph Hellwig, Lu Baolu

Hi Jacob,

Just two minor things below, that I noticed while using fault handlers
for SVA. From my perspective the series is fine otherwise

On 11/05/2018 21:54, Jacob Pan wrote:
> +int iommu_unregister_device_fault_handler(struct device *dev)
> +{
> +       struct iommu_param *param = dev->iommu_param;
> +       int ret = 0;
> +
> +       if (!param)
> +               return -EINVAL;
> +
> +       mutex_lock(&param->lock);

Could we check that param->fault_param isn't NULL here, so that the
driver can call this function unconditionally in a cleanup path?

> +       /* we cannot unregister handler if there are pending faults */
> +       if (!list_empty(&param->fault_param->faults)) {
> +               ret = -EBUSY;
> +               goto unlock;
> +       }
> +
> +       kfree(param->fault_param);
> +       param->fault_param = NULL;
> +       put_device(dev);
> +unlock:
> +       mutex_unlock(&param->lock);
> +
> +       return ret;
> +}
> +EXPORT_SYMBOL_GPL(iommu_unregister_device_fault_handler);
> +
> +
> +/**
> + * iommu_report_device_fault() - Report fault event to device
> + * @dev: the device
> + * @evt: fault event data
> + *
> + * Called by IOMMU model specific drivers when fault is detected, typically
> + * in a threaded IRQ handler.
> + *
> + * Return 0 on success, or an error.
> + */
> +int iommu_report_device_fault(struct device *dev, struct
> iommu_fault_event *evt)
> +{
> +       int ret = 0;
> +       struct iommu_fault_event *evt_pending;
> +       struct iommu_fault_param *fparam;
> +
> +       /* iommu_param is allocated when device is added to group */
> +       if (!dev->iommu_param | !evt)

Should probably be ||

Thanks,
Jean

^ permalink raw reply	[flat|nested] 78+ messages in thread

end of thread, back to index

Thread overview: 78+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-05-11 20:53 [PATCH v5 00/23] IOMMU and VT-d driver support for Shared Virtual Address (SVA) Jacob Pan
2018-05-11 20:53 ` [PATCH v5 01/23] iommu: introduce bind_pasid_table API function Jacob Pan
2018-08-23 16:34   ` Auger Eric
2018-08-24 12:47     ` Liu, Yi L
2018-08-24 13:20       ` Auger Eric
2018-08-28 17:04         ` Jacob Pan
2018-08-24 15:00   ` Auger Eric
2018-08-28  5:14     ` Jacob Pan
2018-08-28  8:34       ` Auger Eric
2018-08-28 16:36         ` Jacob Pan
2018-05-11 20:53 ` [PATCH v5 02/23] iommu/vt-d: move device_domain_info to header Jacob Pan
2018-05-11 20:53 ` [PATCH v5 03/23] iommu/vt-d: add a flag for pasid table bound status Jacob Pan
2018-05-13  7:33   ` Lu Baolu
2018-05-14 18:51     ` Jacob Pan
2018-05-13  8:01   ` Lu Baolu
2018-05-14 18:52     ` Jacob Pan
2018-05-11 20:53 ` [PATCH v5 04/23] iommu/vt-d: add bind_pasid_table function Jacob Pan
2018-05-13  9:29   ` Lu Baolu
2018-05-14 20:22     ` Jacob Pan
2018-05-11 20:53 ` [PATCH v5 05/23] iommu: introduce iommu invalidate API function Jacob Pan
2018-05-11 20:53 ` [PATCH v5 06/23] iommu/vt-d: add definitions for PFSID Jacob Pan
2018-05-14  1:36   ` Lu Baolu
2018-05-14 20:30     ` Jacob Pan
2018-05-11 20:53 ` [PATCH v5 07/23] iommu/vt-d: fix dev iotlb pfsid use Jacob Pan
2018-05-14  1:52   ` Lu Baolu
2018-05-14 20:38     ` Jacob Pan
2018-05-11 20:54 ` [PATCH v5 08/23] iommu/vt-d: support flushing more translation cache types Jacob Pan
2018-05-14  2:18   ` Lu Baolu
2018-05-14 20:46     ` Jacob Pan
2018-05-17  8:44   ` kbuild test robot
2018-05-11 20:54 ` [PATCH v5 09/23] iommu/vt-d: add svm/sva invalidate function Jacob Pan
2018-05-14  3:35   ` Lu Baolu
2018-05-14 20:49     ` Jacob Pan
2018-05-11 20:54 ` [PATCH v5 10/23] iommu: introduce device fault data Jacob Pan
2018-09-21 10:07   ` Auger Eric
2018-09-21 17:05     ` Jacob Pan
2018-05-11 20:54 ` [PATCH v5 11/23] driver core: add per device iommu param Jacob Pan
2018-05-14  5:27   ` Lu Baolu
2018-05-14 20:52     ` Jacob Pan
2018-05-11 20:54 ` [PATCH v5 12/23] iommu: add a timeout parameter for prq response Jacob Pan
2018-05-11 20:54 ` [PATCH v5 13/23] iommu: introduce device fault report API Jacob Pan
2018-05-14  6:01   ` Lu Baolu
2018-05-14 20:55     ` Jacob Pan
2018-05-15  6:52       ` Lu Baolu
2018-05-17 11:41   ` Liu, Yi L
2018-05-17 15:59     ` Jacob Pan
2018-05-17 23:22       ` Liu, Yi L
2018-05-21 23:03         ` Jacob Pan
2018-09-06  9:25   ` Auger Eric
2018-09-06 12:42     ` Jean-Philippe Brucker
2018-09-06 13:14       ` Auger Eric
2018-09-06 17:06         ` Jean-Philippe Brucker
2018-09-07  7:11           ` Auger Eric
2018-09-07 11:23             ` Jean-Philippe Brucker
2018-09-14 13:24   ` Auger Eric
2018-09-17 16:57     ` Jacob Pan
2018-09-25 14:58   ` Jean-Philippe Brucker
2018-05-11 20:54 ` [PATCH v5 14/23] iommu: introduce page response function Jacob Pan
2018-05-14  6:39   ` Lu Baolu
2018-05-29 16:13     ` Jacob Pan
2018-09-10 14:52   ` Auger Eric
2018-09-10 17:50     ` Jacob Pan
2018-09-10 19:06       ` Auger Eric
2018-05-11 20:54 ` [PATCH v5 15/23] iommu: handle page response timeout Jacob Pan
2018-05-14  7:43   ` Lu Baolu
2018-05-29 16:20     ` Jacob Pan
2018-05-30  7:46       ` Lu Baolu
2018-05-11 20:54 ` [PATCH v5 16/23] iommu/config: add build dependency for dmar Jacob Pan
2018-05-11 20:54 ` [PATCH v5 17/23] iommu/vt-d: report non-recoverable faults to device Jacob Pan
2018-05-14  8:17   ` Lu Baolu
2018-05-29 17:33     ` Jacob Pan
2018-05-11 20:54 ` [PATCH v5 18/23] iommu/intel-svm: report device page request Jacob Pan
2018-05-11 20:54 ` [PATCH v5 19/23] iommu/intel-svm: replace dev ops with fault report API Jacob Pan
2018-05-11 20:54 ` [PATCH v5 20/23] iommu/intel-svm: do not flush iotlb for viommu Jacob Pan
2018-05-11 20:54 ` [PATCH v5 21/23] iommu/vt-d: add intel iommu page response function Jacob Pan
2018-05-11 20:54 ` [PATCH v5 22/23] trace/iommu: add sva trace events Jacob Pan
2018-05-11 20:54 ` [PATCH v5 23/23] iommu: use sva invalidate and device fault trace event Jacob Pan
2018-05-29 15:54 ` [PATCH v5 00/23] IOMMU and VT-d driver support for Shared Virtual Address (SVA) Jacob Pan

LKML Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/lkml/0 lkml/git/0.git
	git clone --mirror https://lore.kernel.org/lkml/1 lkml/git/1.git
	git clone --mirror https://lore.kernel.org/lkml/2 lkml/git/2.git
	git clone --mirror https://lore.kernel.org/lkml/3 lkml/git/3.git
	git clone --mirror https://lore.kernel.org/lkml/4 lkml/git/4.git
	git clone --mirror https://lore.kernel.org/lkml/5 lkml/git/5.git
	git clone --mirror https://lore.kernel.org/lkml/6 lkml/git/6.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 lkml lkml/ https://lore.kernel.org/lkml \
		linux-kernel@vger.kernel.org linux-kernel@archiver.kernel.org
	public-inbox-index lkml


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kernel


AGPL code for this site: git clone https://public-inbox.org/ public-inbox