All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 00/16] [PATCH v3 00/16] IOMMU driver support for SVM virtualization
@ 2017-11-17 18:54 ` Jacob Pan
  0 siblings, 0 replies; 94+ messages in thread
From: Jacob Pan @ 2017-11-17 18:54 UTC (permalink / raw)
  To: iommu, LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Rafael Wysocki, Alex Williamson
  Cc: Liu, Yi L, Lan Tianyu, Tian, Kevin, Raj Ashok, Jean Delvare,
	Christoph Hellwig, Jacob Pan

Hi All,

Shared virtual memory (SVM), or more precisely shared virtual address (SVA),
between device DMA and applications can reduce programming complexity
and enhance security. To enable SVM in the guest, i.e. shared guest application
address space and physical device DMA address, IOMMU driver must provide
some new functionalities.

This patchset is a follow-up on the discussions held at LPC 2017
VFIO/IOMMU/PCI track. Slides and notes can be found here:
https://linuxplumbersconf.org/2017/ocw/events/LPC2017/tracks/636

The complete guest SVM support also involves changes in QEMU and VFIO,
which has been posted earlier.
https://www.spinics.net/lists/kvm/msg148798.html

This is the IOMMU portion follow up of the more complete series of the
kernel changes to support vSVM. Please refer to the link below for more
details. https://www.spinics.net/lists/kvm/msg148819.html

Generic APIs are introduced in addition to Intel VT-d specific changes,
the goal is to have common interfaces across IOMMU and device types for
both VFIO and other in-kernel users.

At the top level, new IOMMU interfaces are introduced as follows:
 - bind guest PASID table
 - passdown invalidations of translation caches
 - IOMMU device fault reporting including page request/response and
   non-recoverable faults.

For IOMMU detected device fault reporting, struct device is extended to
provide callback and tracking at device level. The original proposal was
discussed here "Error handling for I/O memory management units"
(https://lwn.net/Articles/608914/). I have experimented two alternative
solutions:
1. use a shared group notifier, this does not scale well also causes unwanted
notification traffic when group sibling device is reported with faults.
2. place fault callback at device IOMMU arch data, e.g. device_domain_info
in Intel/FSL IOMMU driver. This will cause code duplication, since per
device fault reporting is generic.

The additional patches are Intel VT-d specific, which either implements or
replaces existing private interfaces with the generic ones.

This patchset is based on the work and ideas from many people, especially:
Ashok Raj <ashok.raj@intel.com>
Liu, Yi L <yi.l.liu@linux.intel.com>
Jean-Philippe Brucker <jean-philippe.brucker@arm.com>

Thanks,

Jacob

V3
	- Consolidated fault reporting data format based on discussions on v2,
	  including input from ARM and AMD.
	- Renamed invalidation APIs from svm to sva based on discussions on v2
	- Use a parent pointer under struct device for all iommu per device data
	- Simplified device fault callback, allow driver private data to be
	  registered. This might make it easy to replace domain fault handler.
V2
	- Replaced hybrid interface data model (generic data + vendor specific
	data) with all generic data. This will have the security benefit where
	data passed from user space can be sanitized by all software layers if
	needed.
	- Addressed review comments from V1
	- Use per device fault report data
	- Support page request/response communications between host IOMMU and
	guest or other in-kernel users.
	- Added unrecoverable fault reporting to DMAR
	- Use threaded IRQ function for DMAR fault interrupt and fault
	reporting


Jacob Pan (15):
  iommu: introduce bind_pasid_table API function
  iommu/vt-d: add bind_pasid_table function
  iommu/vt-d: move device_domain_info to header
  iommu/vt-d: support flushing more TLB types
  iommu/vt-d: add svm/sva invalidate function
  iommu/vt-d: assign PFSID in device TLB invalidation
  iommu: introduce device fault data
  driver core: add iommu device fault reporting data
  iommu: introduce device fault report API
  iommu/vt-d: use threaded irq for dmar_fault
  iommu/vt-d: report unrecoverable device faults
  iommu/intel-svm: notify page request to guest
  iommu/intel-svm: replace dev ops with fault report API
  iommu: introduce page response function
  iommu/vt-d: add intel iommu page response function

Liu, Yi L (1):
  iommu: introduce iommu invalidate API function

 drivers/iommu/dmar.c          | 151 ++++++++++++++++-
 drivers/iommu/intel-iommu.c   | 365 +++++++++++++++++++++++++++++++++++++++---
 drivers/iommu/intel-svm.c     |  87 ++++++++--
 drivers/iommu/iommu.c         | 110 ++++++++++++-
 include/linux/device.h        |   3 +
 include/linux/dma_remapping.h |   1 +
 include/linux/intel-iommu.h   |  47 +++++-
 include/linux/intel-svm.h     |  20 +--
 include/linux/iommu.h         | 223 +++++++++++++++++++++++++-
 include/uapi/linux/iommu.h    | 101 ++++++++++++
 10 files changed, 1047 insertions(+), 61 deletions(-)
 create mode 100644 include/uapi/linux/iommu.h

-- 
2.7.4

^ permalink raw reply	[flat|nested] 94+ messages in thread

* [PATCH v3 00/16] [PATCH v3 00/16] IOMMU driver support for SVM virtualization
@ 2017-11-17 18:54 ` Jacob Pan
  0 siblings, 0 replies; 94+ messages in thread
From: Jacob Pan @ 2017-11-17 18:54 UTC (permalink / raw)
  To: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, LKML,
	Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Rafael Wysocki, Alex Williamson
  Cc: Lan Tianyu, Jean Delvare

Hi All,

Shared virtual memory (SVM), or more precisely shared virtual address (SVA),
between device DMA and applications can reduce programming complexity
and enhance security. To enable SVM in the guest, i.e. shared guest application
address space and physical device DMA address, IOMMU driver must provide
some new functionalities.

This patchset is a follow-up on the discussions held at LPC 2017
VFIO/IOMMU/PCI track. Slides and notes can be found here:
https://linuxplumbersconf.org/2017/ocw/events/LPC2017/tracks/636

The complete guest SVM support also involves changes in QEMU and VFIO,
which has been posted earlier.
https://www.spinics.net/lists/kvm/msg148798.html

This is the IOMMU portion follow up of the more complete series of the
kernel changes to support vSVM. Please refer to the link below for more
details. https://www.spinics.net/lists/kvm/msg148819.html

Generic APIs are introduced in addition to Intel VT-d specific changes,
the goal is to have common interfaces across IOMMU and device types for
both VFIO and other in-kernel users.

At the top level, new IOMMU interfaces are introduced as follows:
 - bind guest PASID table
 - passdown invalidations of translation caches
 - IOMMU device fault reporting including page request/response and
   non-recoverable faults.

For IOMMU detected device fault reporting, struct device is extended to
provide callback and tracking at device level. The original proposal was
discussed here "Error handling for I/O memory management units"
(https://lwn.net/Articles/608914/). I have experimented two alternative
solutions:
1. use a shared group notifier, this does not scale well also causes unwanted
notification traffic when group sibling device is reported with faults.
2. place fault callback at device IOMMU arch data, e.g. device_domain_info
in Intel/FSL IOMMU driver. This will cause code duplication, since per
device fault reporting is generic.

The additional patches are Intel VT-d specific, which either implements or
replaces existing private interfaces with the generic ones.

This patchset is based on the work and ideas from many people, especially:
Ashok Raj <ashok.raj-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Liu, Yi L <yi.l.liu-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Jean-Philippe Brucker <jean-philippe.brucker-5wv7dgnIgG8@public.gmane.org>

Thanks,

Jacob

V3
	- Consolidated fault reporting data format based on discussions on v2,
	  including input from ARM and AMD.
	- Renamed invalidation APIs from svm to sva based on discussions on v2
	- Use a parent pointer under struct device for all iommu per device data
	- Simplified device fault callback, allow driver private data to be
	  registered. This might make it easy to replace domain fault handler.
V2
	- Replaced hybrid interface data model (generic data + vendor specific
	data) with all generic data. This will have the security benefit where
	data passed from user space can be sanitized by all software layers if
	needed.
	- Addressed review comments from V1
	- Use per device fault report data
	- Support page request/response communications between host IOMMU and
	guest or other in-kernel users.
	- Added unrecoverable fault reporting to DMAR
	- Use threaded IRQ function for DMAR fault interrupt and fault
	reporting


Jacob Pan (15):
  iommu: introduce bind_pasid_table API function
  iommu/vt-d: add bind_pasid_table function
  iommu/vt-d: move device_domain_info to header
  iommu/vt-d: support flushing more TLB types
  iommu/vt-d: add svm/sva invalidate function
  iommu/vt-d: assign PFSID in device TLB invalidation
  iommu: introduce device fault data
  driver core: add iommu device fault reporting data
  iommu: introduce device fault report API
  iommu/vt-d: use threaded irq for dmar_fault
  iommu/vt-d: report unrecoverable device faults
  iommu/intel-svm: notify page request to guest
  iommu/intel-svm: replace dev ops with fault report API
  iommu: introduce page response function
  iommu/vt-d: add intel iommu page response function

Liu, Yi L (1):
  iommu: introduce iommu invalidate API function

 drivers/iommu/dmar.c          | 151 ++++++++++++++++-
 drivers/iommu/intel-iommu.c   | 365 +++++++++++++++++++++++++++++++++++++++---
 drivers/iommu/intel-svm.c     |  87 ++++++++--
 drivers/iommu/iommu.c         | 110 ++++++++++++-
 include/linux/device.h        |   3 +
 include/linux/dma_remapping.h |   1 +
 include/linux/intel-iommu.h   |  47 +++++-
 include/linux/intel-svm.h     |  20 +--
 include/linux/iommu.h         | 223 +++++++++++++++++++++++++-
 include/uapi/linux/iommu.h    | 101 ++++++++++++
 10 files changed, 1047 insertions(+), 61 deletions(-)
 create mode 100644 include/uapi/linux/iommu.h

-- 
2.7.4

^ permalink raw reply	[flat|nested] 94+ messages in thread

* [PATCH v3 01/16] iommu: introduce bind_pasid_table API function
@ 2017-11-17 18:54   ` Jacob Pan
  0 siblings, 0 replies; 94+ messages in thread
From: Jacob Pan @ 2017-11-17 18:54 UTC (permalink / raw)
  To: iommu, LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Rafael Wysocki, Alex Williamson
  Cc: Liu, Yi L, Lan Tianyu, Tian, Kevin, Raj Ashok, Jean Delvare,
	Christoph Hellwig, Jacob Pan, Liu, Yi L

Virtual IOMMU was proposed to support Shared Virtual Memory (SVM)
use in the guest:
https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg05311.html

As part of the proposed architecture, when an SVM capable PCI
device is assigned to a guest, nested mode is turned on. Guest owns the
first level page tables (request with PASID) which performs GVA->GPA
translation. Second level page tables are owned by the host for GPA->HPA
translation for both request with and without PASID.

A new IOMMU driver interface is therefore needed to perform tasks as
follows:
* Enable nested translation and appropriate translation type
* Assign guest PASID table pointer (in GPA) and size to host IOMMU

This patch introduces new API functions to perform bind/unbind guest PASID
tables. Based on common data, model specific IOMMU drivers can be extended
to perform the specific steps for binding pasid table of assigned devices.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
---
 drivers/iommu/iommu.c      | 19 +++++++++++++++++++
 include/linux/iommu.h      | 24 ++++++++++++++++++++++++
 include/uapi/linux/iommu.h | 39 +++++++++++++++++++++++++++++++++++++++
 3 files changed, 82 insertions(+)
 create mode 100644 include/uapi/linux/iommu.h

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 3de5c0b..c7e0d64 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1322,6 +1322,25 @@ int iommu_attach_device(struct iommu_domain *domain, struct device *dev)
 }
 EXPORT_SYMBOL_GPL(iommu_attach_device);
 
+int iommu_bind_pasid_table(struct iommu_domain *domain, struct device *dev,
+			struct pasid_table_config *pasidt_binfo)
+{
+	if (unlikely(!domain->ops->bind_pasid_table))
+		return -ENODEV;
+
+	return domain->ops->bind_pasid_table(domain, dev, pasidt_binfo);
+}
+EXPORT_SYMBOL_GPL(iommu_bind_pasid_table);
+
+void iommu_unbind_pasid_table(struct iommu_domain *domain, struct device *dev)
+{
+	if (unlikely(!domain->ops->unbind_pasid_table))
+		return;
+
+	domain->ops->unbind_pasid_table(domain, dev);
+}
+EXPORT_SYMBOL_GPL(iommu_unbind_pasid_table);
+
 static void __iommu_detach_device(struct iommu_domain *domain,
 				  struct device *dev)
 {
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 41b8c57..0f6f6c5 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -25,6 +25,7 @@
 #include <linux/errno.h>
 #include <linux/err.h>
 #include <linux/of.h>
+#include <uapi/linux/iommu.h>
 
 #define IOMMU_READ	(1 << 0)
 #define IOMMU_WRITE	(1 << 1)
@@ -187,6 +188,8 @@ struct iommu_resv_region {
  * @domain_get_windows: Return the number of windows for a domain
  * @of_xlate: add OF master IDs to iommu grouping
  * @pgsize_bitmap: bitmap of all possible supported page sizes
+ * @bind_pasid_table: bind pasid table pointer for guest SVM
+ * @unbind_pasid_table: unbind pasid table pointer and restore defaults
  */
 struct iommu_ops {
 	bool (*capable)(enum iommu_cap);
@@ -233,8 +236,14 @@ struct iommu_ops {
 	u32 (*domain_get_windows)(struct iommu_domain *domain);
 
 	int (*of_xlate)(struct device *dev, struct of_phandle_args *args);
+
 	bool (*is_attach_deferred)(struct iommu_domain *domain, struct device *dev);
 
+	int (*bind_pasid_table)(struct iommu_domain *domain, struct device *dev,
+				struct pasid_table_config *pasidt_binfo);
+	void (*unbind_pasid_table)(struct iommu_domain *domain,
+				struct device *dev);
+
 	unsigned long pgsize_bitmap;
 };
 
@@ -296,6 +305,10 @@ extern int iommu_attach_device(struct iommu_domain *domain,
 			       struct device *dev);
 extern void iommu_detach_device(struct iommu_domain *domain,
 				struct device *dev);
+extern int iommu_bind_pasid_table(struct iommu_domain *domain,
+		struct device *dev, struct pasid_table_config *pasidt_binfo);
+extern void iommu_unbind_pasid_table(struct iommu_domain *domain,
+				struct device *dev);
 extern struct iommu_domain *iommu_get_domain_for_dev(struct device *dev);
 extern int iommu_map(struct iommu_domain *domain, unsigned long iova,
 		     phys_addr_t paddr, size_t size, int prot);
@@ -696,6 +709,17 @@ const struct iommu_ops *iommu_ops_from_fwnode(struct fwnode_handle *fwnode)
 	return NULL;
 }
 
+static inline
+int iommu_bind_pasid_table(struct iommu_domain *domain, struct device *dev,
+			struct pasid_table_config *pasidt_binfo)
+{
+	return -EINVAL;
+}
+static inline
+void iommu_unbind_pasid_table(struct iommu_domain *domain, struct device *dev)
+{
+}
+
 #endif /* CONFIG_IOMMU_API */
 
 #endif /* __LINUX_IOMMU_H */
diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
new file mode 100644
index 0000000..651ad5d
--- /dev/null
+++ b/include/uapi/linux/iommu.h
@@ -0,0 +1,39 @@
+/*
+ * IOMMU user API definitions
+ *
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#ifndef _UAPI_IOMMU_H
+#define _UAPI_IOMMU_H
+
+#include <linux/types.h>
+
+/**
+ * PASID table data used to bind guest PASID table to the host IOMMU. This will
+ * enable guest managed first level page tables.
+ * @version: for future extensions and identification of the data format
+ * @bytes: size of this structure
+ * @base_ptr:	PASID table pointer
+ * @pasid_bits:	number of bits supported in the guest PASID table, must be less
+ *		or equal than the host supported PASID size.
+ */
+struct pasid_table_config {
+	__u32 version;
+#define PASID_TABLE_CFG_VERSION 1
+	__u32 bytes;
+	__u64 base_ptr;
+	__u8 pasid_bits;
+	/* reserved for extension of vendor specific config */
+	union {
+		struct {
+			/* ARM specific fields */
+			bool pasid0_dma_no_pasid;
+		} arm;
+	};
+};
+
+#endif /* _UAPI_IOMMU_H */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v3 01/16] iommu: introduce bind_pasid_table API function
@ 2017-11-17 18:54   ` Jacob Pan
  0 siblings, 0 replies; 94+ messages in thread
From: Jacob Pan @ 2017-11-17 18:54 UTC (permalink / raw)
  To: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, LKML,
	Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Rafael Wysocki, Alex Williamson
  Cc: Lan Tianyu, Yi L, Liu-i9wRM+HIrmnmtl4Z8vJ8Kg761KYD1DLY, Jean Delvare

Virtual IOMMU was proposed to support Shared Virtual Memory (SVM)
use in the guest:
https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg05311.html

As part of the proposed architecture, when an SVM capable PCI
device is assigned to a guest, nested mode is turned on. Guest owns the
first level page tables (request with PASID) which performs GVA->GPA
translation. Second level page tables are owned by the host for GPA->HPA
translation for both request with and without PASID.

A new IOMMU driver interface is therefore needed to perform tasks as
follows:
* Enable nested translation and appropriate translation type
* Assign guest PASID table pointer (in GPA) and size to host IOMMU

This patch introduces new API functions to perform bind/unbind guest PASID
tables. Based on common data, model specific IOMMU drivers can be extended
to perform the specific steps for binding pasid table of assigned devices.

Signed-off-by: Jacob Pan <jacob.jun.pan-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Signed-off-by: Liu, Yi L <yi.l.liu-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Signed-off-by: Ashok Raj <ashok.raj-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 drivers/iommu/iommu.c      | 19 +++++++++++++++++++
 include/linux/iommu.h      | 24 ++++++++++++++++++++++++
 include/uapi/linux/iommu.h | 39 +++++++++++++++++++++++++++++++++++++++
 3 files changed, 82 insertions(+)
 create mode 100644 include/uapi/linux/iommu.h

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 3de5c0b..c7e0d64 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1322,6 +1322,25 @@ int iommu_attach_device(struct iommu_domain *domain, struct device *dev)
 }
 EXPORT_SYMBOL_GPL(iommu_attach_device);
 
+int iommu_bind_pasid_table(struct iommu_domain *domain, struct device *dev,
+			struct pasid_table_config *pasidt_binfo)
+{
+	if (unlikely(!domain->ops->bind_pasid_table))
+		return -ENODEV;
+
+	return domain->ops->bind_pasid_table(domain, dev, pasidt_binfo);
+}
+EXPORT_SYMBOL_GPL(iommu_bind_pasid_table);
+
+void iommu_unbind_pasid_table(struct iommu_domain *domain, struct device *dev)
+{
+	if (unlikely(!domain->ops->unbind_pasid_table))
+		return;
+
+	domain->ops->unbind_pasid_table(domain, dev);
+}
+EXPORT_SYMBOL_GPL(iommu_unbind_pasid_table);
+
 static void __iommu_detach_device(struct iommu_domain *domain,
 				  struct device *dev)
 {
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 41b8c57..0f6f6c5 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -25,6 +25,7 @@
 #include <linux/errno.h>
 #include <linux/err.h>
 #include <linux/of.h>
+#include <uapi/linux/iommu.h>
 
 #define IOMMU_READ	(1 << 0)
 #define IOMMU_WRITE	(1 << 1)
@@ -187,6 +188,8 @@ struct iommu_resv_region {
  * @domain_get_windows: Return the number of windows for a domain
  * @of_xlate: add OF master IDs to iommu grouping
  * @pgsize_bitmap: bitmap of all possible supported page sizes
+ * @bind_pasid_table: bind pasid table pointer for guest SVM
+ * @unbind_pasid_table: unbind pasid table pointer and restore defaults
  */
 struct iommu_ops {
 	bool (*capable)(enum iommu_cap);
@@ -233,8 +236,14 @@ struct iommu_ops {
 	u32 (*domain_get_windows)(struct iommu_domain *domain);
 
 	int (*of_xlate)(struct device *dev, struct of_phandle_args *args);
+
 	bool (*is_attach_deferred)(struct iommu_domain *domain, struct device *dev);
 
+	int (*bind_pasid_table)(struct iommu_domain *domain, struct device *dev,
+				struct pasid_table_config *pasidt_binfo);
+	void (*unbind_pasid_table)(struct iommu_domain *domain,
+				struct device *dev);
+
 	unsigned long pgsize_bitmap;
 };
 
@@ -296,6 +305,10 @@ extern int iommu_attach_device(struct iommu_domain *domain,
 			       struct device *dev);
 extern void iommu_detach_device(struct iommu_domain *domain,
 				struct device *dev);
+extern int iommu_bind_pasid_table(struct iommu_domain *domain,
+		struct device *dev, struct pasid_table_config *pasidt_binfo);
+extern void iommu_unbind_pasid_table(struct iommu_domain *domain,
+				struct device *dev);
 extern struct iommu_domain *iommu_get_domain_for_dev(struct device *dev);
 extern int iommu_map(struct iommu_domain *domain, unsigned long iova,
 		     phys_addr_t paddr, size_t size, int prot);
@@ -696,6 +709,17 @@ const struct iommu_ops *iommu_ops_from_fwnode(struct fwnode_handle *fwnode)
 	return NULL;
 }
 
+static inline
+int iommu_bind_pasid_table(struct iommu_domain *domain, struct device *dev,
+			struct pasid_table_config *pasidt_binfo)
+{
+	return -EINVAL;
+}
+static inline
+void iommu_unbind_pasid_table(struct iommu_domain *domain, struct device *dev)
+{
+}
+
 #endif /* CONFIG_IOMMU_API */
 
 #endif /* __LINUX_IOMMU_H */
diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
new file mode 100644
index 0000000..651ad5d
--- /dev/null
+++ b/include/uapi/linux/iommu.h
@@ -0,0 +1,39 @@
+/*
+ * IOMMU user API definitions
+ *
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#ifndef _UAPI_IOMMU_H
+#define _UAPI_IOMMU_H
+
+#include <linux/types.h>
+
+/**
+ * PASID table data used to bind guest PASID table to the host IOMMU. This will
+ * enable guest managed first level page tables.
+ * @version: for future extensions and identification of the data format
+ * @bytes: size of this structure
+ * @base_ptr:	PASID table pointer
+ * @pasid_bits:	number of bits supported in the guest PASID table, must be less
+ *		or equal than the host supported PASID size.
+ */
+struct pasid_table_config {
+	__u32 version;
+#define PASID_TABLE_CFG_VERSION 1
+	__u32 bytes;
+	__u64 base_ptr;
+	__u8 pasid_bits;
+	/* reserved for extension of vendor specific config */
+	union {
+		struct {
+			/* ARM specific fields */
+			bool pasid0_dma_no_pasid;
+		} arm;
+	};
+};
+
+#endif /* _UAPI_IOMMU_H */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v3 02/16] iommu/vt-d: add bind_pasid_table function
@ 2017-11-17 18:55   ` Jacob Pan
  0 siblings, 0 replies; 94+ messages in thread
From: Jacob Pan @ 2017-11-17 18:55 UTC (permalink / raw)
  To: iommu, LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Rafael Wysocki, Alex Williamson
  Cc: Liu, Yi L, Lan Tianyu, Tian, Kevin, Raj Ashok, Jean Delvare,
	Christoph Hellwig, Jacob Pan, Liu, Yi L

Add Intel VT-d ops to the generic iommu_bind_pasid_table API
functions.

The primary use case is for direct assignment of SVM capable
device. Originated from emulated IOMMU in the guest, the request goes
through many layers (e.g. VFIO). Upon calling host IOMMU driver, caller
passes guest PASID table pointer (GPA) and size.

Device context table entry is modified by Intel IOMMU specific
bind_pasid_table function. This will turn on nesting mode and matching
translation type.

The unbind operation restores default context mapping.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
---
 drivers/iommu/intel-iommu.c   | 107 ++++++++++++++++++++++++++++++++++++++++++
 include/linux/dma_remapping.h |   1 +
 2 files changed, 108 insertions(+)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 2087cd8..3d1901d 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -5176,6 +5176,7 @@ static void intel_iommu_put_resv_regions(struct device *dev,
 
 #ifdef CONFIG_INTEL_IOMMU_SVM
 #define MAX_NR_PASID_BITS (20)
+#define MIN_NR_PASID_BITS (5)
 static inline unsigned long intel_iommu_get_pts(struct intel_iommu *iommu)
 {
 	/*
@@ -5302,6 +5303,108 @@ struct intel_iommu *intel_svm_device_to_iommu(struct device *dev)
 
 	return iommu;
 }
+
+static int intel_iommu_bind_pasid_table(struct iommu_domain *domain,
+		struct device *dev, struct pasid_table_config *pasidt_binfo)
+{
+	struct intel_iommu *iommu;
+	struct context_entry *context;
+	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
+	struct device_domain_info *info;
+	struct pci_dev *pdev;
+	u8 bus, devfn, host_table_pasid_bits;
+	u16 did, sid;
+	int ret = 0;
+	unsigned long flags;
+	u64 ctx_lo;
+
+	iommu = device_to_iommu(dev, &bus, &devfn);
+	if (!iommu)
+		return -ENODEV;
+	/* VT-d spec 9.4 says pasid table size is encoded as 2^(x+5) */
+	host_table_pasid_bits = intel_iommu_get_pts(iommu) + MIN_NR_PASID_BITS;
+	if (!pasidt_binfo || pasidt_binfo->pasid_bits > host_table_pasid_bits ||
+		pasidt_binfo->pasid_bits < MIN_NR_PASID_BITS) {
+		pr_err("Invalid gPASID bits %d, host range %d - %d\n",
+			pasidt_binfo->pasid_bits,
+			MIN_NR_PASID_BITS, host_table_pasid_bits);
+		return -ERANGE;
+	}
+
+	pdev = to_pci_dev(dev);
+	sid = PCI_DEVID(bus, devfn);
+	info = dev->archdata.iommu;
+
+	if (!info) {
+		dev_err(dev, "Invalid device domain info\n");
+		ret = -EINVAL;
+		goto out;
+	}
+	if (!info->pasid_enabled) {
+		ret = pci_enable_pasid(pdev, info->pasid_supported & ~1);
+		if (ret) {
+			dev_err(dev, "Failed to enable PASID\n");
+			goto out;
+		}
+	}
+	if (!device_context_mapped(iommu, bus, devfn)) {
+		pr_warn("ctx not mapped for bus devfn %x:%x\n", bus, devfn);
+		ret = -EINVAL;
+		goto out;
+	}
+	spin_lock_irqsave(&iommu->lock, flags);
+	context = iommu_context_addr(iommu, bus, devfn, 0);
+	if (!context) {
+		ret = -EINVAL;
+		goto out_unlock;
+	}
+
+	/* Anticipate guest to use SVM and owns the first level, so we turn
+	 * nested mode on
+	 */
+	ctx_lo = context[0].lo;
+	ctx_lo |= CONTEXT_NESTE | CONTEXT_PRS | CONTEXT_PASIDE;
+	ctx_lo &= ~CONTEXT_TT_MASK;
+	ctx_lo |= CONTEXT_TT_DEV_IOTLB << 2;
+	context[0].lo = ctx_lo;
+
+	/* Assign guest PASID table pointer and size order */
+	ctx_lo = (pasidt_binfo->base_ptr & VTD_PAGE_MASK) |
+		(pasidt_binfo->pasid_bits - MIN_NR_PASID_BITS);
+	context[1].lo = ctx_lo;
+	/* make sure context entry is updated before flushing */
+	wmb();
+	did = dmar_domain->iommu_did[iommu->seq_id];
+	iommu->flush.flush_context(iommu, did,
+				(((u16)bus) << 8) | devfn,
+				DMA_CCMD_MASK_NOBIT,
+				DMA_CCMD_DEVICE_INVL);
+	iommu->flush.flush_iotlb(iommu, did, 0, 0, DMA_TLB_DSI_FLUSH);
+
+out_unlock:
+	spin_unlock_irqrestore(&iommu->lock, flags);
+out:
+	return ret;
+}
+
+static void intel_iommu_unbind_pasid_table(struct iommu_domain *domain,
+					struct device *dev)
+{
+	struct intel_iommu *iommu;
+	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
+	u8 bus, devfn;
+
+	assert_spin_locked(&device_domain_lock);
+	iommu = device_to_iommu(dev, &bus, &devfn);
+	if (!iommu) {
+		dev_err(dev, "No IOMMU for device to unbind PASID table\n");
+		return;
+	}
+
+	domain_context_clear(iommu, dev);
+
+	domain_context_mapping_one(dmar_domain, iommu, bus, devfn);
+}
 #endif /* CONFIG_INTEL_IOMMU_SVM */
 
 const struct iommu_ops intel_iommu_ops = {
@@ -5310,6 +5413,10 @@ const struct iommu_ops intel_iommu_ops = {
 	.domain_free		= intel_iommu_domain_free,
 	.attach_dev		= intel_iommu_attach_device,
 	.detach_dev		= intel_iommu_detach_device,
+#ifdef CONFIG_INTEL_IOMMU_SVM
+	.bind_pasid_table	= intel_iommu_bind_pasid_table,
+	.unbind_pasid_table	= intel_iommu_unbind_pasid_table,
+#endif
 	.map			= intel_iommu_map,
 	.unmap			= intel_iommu_unmap,
 	.map_sg			= default_iommu_map_sg,
diff --git a/include/linux/dma_remapping.h b/include/linux/dma_remapping.h
index 21b3e7d..db290b2 100644
--- a/include/linux/dma_remapping.h
+++ b/include/linux/dma_remapping.h
@@ -28,6 +28,7 @@
 
 #define CONTEXT_DINVE		(1ULL << 8)
 #define CONTEXT_PRS		(1ULL << 9)
+#define CONTEXT_NESTE		(1ULL << 10)
 #define CONTEXT_PASIDE		(1ULL << 11)
 
 struct intel_iommu;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v3 02/16] iommu/vt-d: add bind_pasid_table function
@ 2017-11-17 18:55   ` Jacob Pan
  0 siblings, 0 replies; 94+ messages in thread
From: Jacob Pan @ 2017-11-17 18:55 UTC (permalink / raw)
  To: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, LKML,
	Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Rafael Wysocki, Alex Williamson
  Cc: Lan Tianyu, Yi L, Liu-i9wRM+HIrmnmtl4Z8vJ8Kg761KYD1DLY, Jean Delvare

Add Intel VT-d ops to the generic iommu_bind_pasid_table API
functions.

The primary use case is for direct assignment of SVM capable
device. Originated from emulated IOMMU in the guest, the request goes
through many layers (e.g. VFIO). Upon calling host IOMMU driver, caller
passes guest PASID table pointer (GPA) and size.

Device context table entry is modified by Intel IOMMU specific
bind_pasid_table function. This will turn on nesting mode and matching
translation type.

The unbind operation restores default context mapping.

Signed-off-by: Jacob Pan <jacob.jun.pan-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Signed-off-by: Liu, Yi L <yi.l.liu-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Signed-off-by: Ashok Raj <ashok.raj-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 drivers/iommu/intel-iommu.c   | 107 ++++++++++++++++++++++++++++++++++++++++++
 include/linux/dma_remapping.h |   1 +
 2 files changed, 108 insertions(+)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 2087cd8..3d1901d 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -5176,6 +5176,7 @@ static void intel_iommu_put_resv_regions(struct device *dev,
 
 #ifdef CONFIG_INTEL_IOMMU_SVM
 #define MAX_NR_PASID_BITS (20)
+#define MIN_NR_PASID_BITS (5)
 static inline unsigned long intel_iommu_get_pts(struct intel_iommu *iommu)
 {
 	/*
@@ -5302,6 +5303,108 @@ struct intel_iommu *intel_svm_device_to_iommu(struct device *dev)
 
 	return iommu;
 }
+
+static int intel_iommu_bind_pasid_table(struct iommu_domain *domain,
+		struct device *dev, struct pasid_table_config *pasidt_binfo)
+{
+	struct intel_iommu *iommu;
+	struct context_entry *context;
+	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
+	struct device_domain_info *info;
+	struct pci_dev *pdev;
+	u8 bus, devfn, host_table_pasid_bits;
+	u16 did, sid;
+	int ret = 0;
+	unsigned long flags;
+	u64 ctx_lo;
+
+	iommu = device_to_iommu(dev, &bus, &devfn);
+	if (!iommu)
+		return -ENODEV;
+	/* VT-d spec 9.4 says pasid table size is encoded as 2^(x+5) */
+	host_table_pasid_bits = intel_iommu_get_pts(iommu) + MIN_NR_PASID_BITS;
+	if (!pasidt_binfo || pasidt_binfo->pasid_bits > host_table_pasid_bits ||
+		pasidt_binfo->pasid_bits < MIN_NR_PASID_BITS) {
+		pr_err("Invalid gPASID bits %d, host range %d - %d\n",
+			pasidt_binfo->pasid_bits,
+			MIN_NR_PASID_BITS, host_table_pasid_bits);
+		return -ERANGE;
+	}
+
+	pdev = to_pci_dev(dev);
+	sid = PCI_DEVID(bus, devfn);
+	info = dev->archdata.iommu;
+
+	if (!info) {
+		dev_err(dev, "Invalid device domain info\n");
+		ret = -EINVAL;
+		goto out;
+	}
+	if (!info->pasid_enabled) {
+		ret = pci_enable_pasid(pdev, info->pasid_supported & ~1);
+		if (ret) {
+			dev_err(dev, "Failed to enable PASID\n");
+			goto out;
+		}
+	}
+	if (!device_context_mapped(iommu, bus, devfn)) {
+		pr_warn("ctx not mapped for bus devfn %x:%x\n", bus, devfn);
+		ret = -EINVAL;
+		goto out;
+	}
+	spin_lock_irqsave(&iommu->lock, flags);
+	context = iommu_context_addr(iommu, bus, devfn, 0);
+	if (!context) {
+		ret = -EINVAL;
+		goto out_unlock;
+	}
+
+	/* Anticipate guest to use SVM and owns the first level, so we turn
+	 * nested mode on
+	 */
+	ctx_lo = context[0].lo;
+	ctx_lo |= CONTEXT_NESTE | CONTEXT_PRS | CONTEXT_PASIDE;
+	ctx_lo &= ~CONTEXT_TT_MASK;
+	ctx_lo |= CONTEXT_TT_DEV_IOTLB << 2;
+	context[0].lo = ctx_lo;
+
+	/* Assign guest PASID table pointer and size order */
+	ctx_lo = (pasidt_binfo->base_ptr & VTD_PAGE_MASK) |
+		(pasidt_binfo->pasid_bits - MIN_NR_PASID_BITS);
+	context[1].lo = ctx_lo;
+	/* make sure context entry is updated before flushing */
+	wmb();
+	did = dmar_domain->iommu_did[iommu->seq_id];
+	iommu->flush.flush_context(iommu, did,
+				(((u16)bus) << 8) | devfn,
+				DMA_CCMD_MASK_NOBIT,
+				DMA_CCMD_DEVICE_INVL);
+	iommu->flush.flush_iotlb(iommu, did, 0, 0, DMA_TLB_DSI_FLUSH);
+
+out_unlock:
+	spin_unlock_irqrestore(&iommu->lock, flags);
+out:
+	return ret;
+}
+
+static void intel_iommu_unbind_pasid_table(struct iommu_domain *domain,
+					struct device *dev)
+{
+	struct intel_iommu *iommu;
+	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
+	u8 bus, devfn;
+
+	assert_spin_locked(&device_domain_lock);
+	iommu = device_to_iommu(dev, &bus, &devfn);
+	if (!iommu) {
+		dev_err(dev, "No IOMMU for device to unbind PASID table\n");
+		return;
+	}
+
+	domain_context_clear(iommu, dev);
+
+	domain_context_mapping_one(dmar_domain, iommu, bus, devfn);
+}
 #endif /* CONFIG_INTEL_IOMMU_SVM */
 
 const struct iommu_ops intel_iommu_ops = {
@@ -5310,6 +5413,10 @@ const struct iommu_ops intel_iommu_ops = {
 	.domain_free		= intel_iommu_domain_free,
 	.attach_dev		= intel_iommu_attach_device,
 	.detach_dev		= intel_iommu_detach_device,
+#ifdef CONFIG_INTEL_IOMMU_SVM
+	.bind_pasid_table	= intel_iommu_bind_pasid_table,
+	.unbind_pasid_table	= intel_iommu_unbind_pasid_table,
+#endif
 	.map			= intel_iommu_map,
 	.unmap			= intel_iommu_unmap,
 	.map_sg			= default_iommu_map_sg,
diff --git a/include/linux/dma_remapping.h b/include/linux/dma_remapping.h
index 21b3e7d..db290b2 100644
--- a/include/linux/dma_remapping.h
+++ b/include/linux/dma_remapping.h
@@ -28,6 +28,7 @@
 
 #define CONTEXT_DINVE		(1ULL << 8)
 #define CONTEXT_PRS		(1ULL << 9)
+#define CONTEXT_NESTE		(1ULL << 10)
 #define CONTEXT_PASIDE		(1ULL << 11)
 
 struct intel_iommu;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v3 03/16] iommu: introduce iommu invalidate API function
  2017-11-17 18:54 ` Jacob Pan
                   ` (2 preceding siblings ...)
  (?)
@ 2017-11-17 18:55 ` Jacob Pan
  2017-11-24 12:04   ` Jean-Philippe Brucker
  -1 siblings, 1 reply; 94+ messages in thread
From: Jacob Pan @ 2017-11-17 18:55 UTC (permalink / raw)
  To: iommu, LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Rafael Wysocki, Alex Williamson
  Cc: Liu, Yi L, Lan Tianyu, Tian, Kevin, Raj Ashok, Jean Delvare,
	Christoph Hellwig, Liu, Yi L, Liu, Jacob Pan

From: "Liu, Yi L" <yi.l.liu@linux.intel.com>

When an SVM capable device is assigned to a guest, the first level page
tables are owned by the guest and the guest PASID table pointer is
linked to the device context entry of the physical IOMMU.

Host IOMMU driver has no knowledge of caching structure updates unless
the guest invalidation activities are passed down to the host. The
primary usage is derived from emulated IOMMU in the guest, where QEMU
can trap invalidation activities before passing them down to the
host/physical IOMMU.
Since the invalidation data are obtained from user space and will be
written into physical IOMMU, we must allow security check at various
layers. Therefore, generic invalidation data format are proposed here,
model specific IOMMU drivers need to convert them into their own format.

Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
---
 drivers/iommu/iommu.c      | 14 +++++++++++
 include/linux/iommu.h      | 12 +++++++++
 include/uapi/linux/iommu.h | 62 ++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 88 insertions(+)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index c7e0d64..829e9e9 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1341,6 +1341,20 @@ void iommu_unbind_pasid_table(struct iommu_domain *domain, struct device *dev)
 }
 EXPORT_SYMBOL_GPL(iommu_unbind_pasid_table);
 
+int iommu_sva_invalidate(struct iommu_domain *domain,
+		struct device *dev, struct tlb_invalidate_info *inv_info)
+{
+	int ret = 0;
+
+	if (unlikely(!domain->ops->sva_invalidate))
+		return -ENODEV;
+
+	ret = domain->ops->sva_invalidate(domain, dev, inv_info);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_sva_invalidate);
+
 static void __iommu_detach_device(struct iommu_domain *domain,
 				  struct device *dev)
 {
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 0f6f6c5..da684a7 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -190,6 +190,7 @@ struct iommu_resv_region {
  * @pgsize_bitmap: bitmap of all possible supported page sizes
  * @bind_pasid_table: bind pasid table pointer for guest SVM
  * @unbind_pasid_table: unbind pasid table pointer and restore defaults
+ * @sva_invalidate: invalidate translation caches of shared virtual address
  */
 struct iommu_ops {
 	bool (*capable)(enum iommu_cap);
@@ -243,6 +244,8 @@ struct iommu_ops {
 				struct pasid_table_config *pasidt_binfo);
 	void (*unbind_pasid_table)(struct iommu_domain *domain,
 				struct device *dev);
+	int (*sva_invalidate)(struct iommu_domain *domain,
+		struct device *dev, struct tlb_invalidate_info *inv_info);
 
 	unsigned long pgsize_bitmap;
 };
@@ -309,6 +312,9 @@ extern int iommu_bind_pasid_table(struct iommu_domain *domain,
 		struct device *dev, struct pasid_table_config *pasidt_binfo);
 extern void iommu_unbind_pasid_table(struct iommu_domain *domain,
 				struct device *dev);
+extern int iommu_sva_invalidate(struct iommu_domain *domain,
+		struct device *dev, struct tlb_invalidate_info *inv_info);
+
 extern struct iommu_domain *iommu_get_domain_for_dev(struct device *dev);
 extern int iommu_map(struct iommu_domain *domain, unsigned long iova,
 		     phys_addr_t paddr, size_t size, int prot);
@@ -720,6 +726,12 @@ void iommu_unbind_pasid_table(struct iommu_domain *domain, struct device *dev)
 {
 }
 
+static inline int iommu_sva_invalidate(struct iommu_domain *domain,
+		struct device *dev, struct tlb_invalidate_info *inv_info)
+{
+	return -EINVAL;
+}
+
 #endif /* CONFIG_IOMMU_API */
 
 #endif /* __LINUX_IOMMU_H */
diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
index 651ad5d..039ba36 100644
--- a/include/uapi/linux/iommu.h
+++ b/include/uapi/linux/iommu.h
@@ -36,4 +36,66 @@ struct pasid_table_config {
 	};
 };
 
+enum iommu_inv_granularity {
+	IOMMU_INV_GRANU_GLOBAL,		/* all TLBs invalidated */
+	IOMMU_INV_GRANU_DOMAIN,		/* all TLBs associated with a domain */
+	IOMMU_INV_GRANU_DEVICE,		/* caching structure associated with a
+					 * device ID
+					 */
+	IOMMU_INV_GRANU_DOMAIN_PAGE,	/* address range with a domain */
+	IOMMU_INV_GRANU_ALL_PASID,	/* cache of a given PASID */
+	IOMMU_INV_GRANU_PASID_SEL,	/* only invalidate specified PASID */
+
+	IOMMU_INV_GRANU_NG_ALL_PASID,	/* non-global within all PASIDs */
+	IOMMU_INV_GRANU_NG_PASID,	/* non-global within a PASIDs */
+	IOMMU_INV_GRANU_PAGE_PASID,	/* page-selective within a PASID */
+	IOMMU_INV_NR_GRANU,
+};
+
+enum iommu_inv_type {
+	IOMMU_INV_TYPE_DTLB,	/* device IOTLB */
+	IOMMU_INV_TYPE_TLB,	/* IOMMU paging structure cache */
+	IOMMU_INV_TYPE_PASID,	/* PASID cache */
+	IOMMU_INV_TYPE_CONTEXT,	/* device context entry cache */
+	IOMMU_INV_NR_TYPE
+};
+
+/**
+ * Translation cache invalidation header that contains mandatory meta data.
+ * @version:	info format version, expecting future extesions
+ * @type:	type of translation cache to be invalidated
+ */
+struct tlb_invalidate_hdr {
+	__u32 version;
+#define TLB_INV_HDR_VERSION_1 1
+	enum iommu_inv_type type;
+};
+
+/**
+ * Translation cache invalidation information, contains generic IOMMU
+ * data which can be parsed based on model ID by model specific drivers.
+ *
+ * @granularity:	requested invalidation granularity, type dependent
+ * @size:		2^size of 4K pages, 0 for 4k, 9 for 2MB, etc.
+ * @pasid:		processor address space ID value per PCI spec.
+ * @addr:		page address to be invalidated
+ * @flags	IOMMU_INVALIDATE_PASID_TAGGED: DMA with PASID tagged,
+ *						@pasid validity can be
+ *						deduced from @granularity
+ *		IOMMU_INVALIDATE_ADDR_LEAF: leaf paging entries
+ *		IOMMU_INVALIDATE_GLOBAL_PAGE: global pages
+ *
+ */
+struct tlb_invalidate_info {
+	struct tlb_invalidate_hdr	hdr;
+	enum iommu_inv_granularity	granularity;
+	__u32		flags;
+#define IOMMU_INVALIDATE_NO_PASID	(1 << 0)
+#define IOMMU_INVALIDATE_ADDR_LEAF	(1 << 1)
+#define IOMMU_INVALIDATE_GLOBAL_PAGE	(1 << 2)
+#define IOMMU_INVALIDATE_PASID_TAGGED	(1 << 3)
+	__u8		size;
+	__u32		pasid;
+	__u64		addr;
+};
 #endif /* _UAPI_IOMMU_H */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v3 04/16] iommu/vt-d: move device_domain_info to header
@ 2017-11-17 18:55   ` Jacob Pan
  0 siblings, 0 replies; 94+ messages in thread
From: Jacob Pan @ 2017-11-17 18:55 UTC (permalink / raw)
  To: iommu, LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Rafael Wysocki, Alex Williamson
  Cc: Liu, Yi L, Lan Tianyu, Tian, Kevin, Raj Ashok, Jean Delvare,
	Christoph Hellwig, Jacob Pan

Allow both intel-iommu.c and dmar.c to access device_domain_info.
Prepare for additional per device arch data used in TLB flush function

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 drivers/iommu/intel-iommu.c | 18 ------------------
 include/linux/intel-iommu.h | 19 +++++++++++++++++++
 2 files changed, 19 insertions(+), 18 deletions(-)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 3d1901d..399b504 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -391,24 +391,6 @@ struct dmar_domain {
 					   iommu core */
 };
 
-/* PCI domain-device relationship */
-struct device_domain_info {
-	struct list_head link;	/* link to domain siblings */
-	struct list_head global; /* link to global list */
-	u8 bus;			/* PCI bus number */
-	u8 devfn;		/* PCI devfn number */
-	u8 pasid_supported:3;
-	u8 pasid_enabled:1;
-	u8 pri_supported:1;
-	u8 pri_enabled:1;
-	u8 ats_supported:1;
-	u8 ats_enabled:1;
-	u8 ats_qdep;
-	struct device *dev; /* it's NULL for PCIe-to-PCI bridge */
-	struct intel_iommu *iommu; /* IOMMU used by this device */
-	struct dmar_domain *domain; /* pointer to domain */
-};
-
 struct dmar_rmrr_unit {
 	struct list_head list;		/* list of rmrr units	*/
 	struct acpi_dmar_header *hdr;	/* ACPI header		*/
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index 77ea056..8d38e24 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -458,6 +458,25 @@ struct intel_iommu {
 	u32		flags;      /* Software defined flags */
 };
 
+/* PCI domain-device relationship */
+struct device_domain_info {
+	struct list_head link;	/* link to domain siblings */
+	struct list_head global; /* link to global list */
+	u8 bus;			/* PCI bus number */
+	u8 devfn;		/* PCI devfn number */
+	u8 pasid_supported:3;
+	u8 pasid_enabled:1;
+	u8 pri_supported:1;
+	u8 pri_enabled:1;
+	u8 ats_supported:1;
+	u8 ats_enabled:1;
+	u8 ats_qdep;
+	u64 fault_mask;	/* selected IOMMU faults to be reported */
+	struct device *dev; /* it's NULL for PCIe-to-PCI bridge */
+	struct intel_iommu *iommu; /* IOMMU used by this device */
+	struct dmar_domain *domain; /* pointer to domain */
+};
+
 static inline void __iommu_flush_cache(
 	struct intel_iommu *iommu, void *addr, int size)
 {
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v3 04/16] iommu/vt-d: move device_domain_info to header
@ 2017-11-17 18:55   ` Jacob Pan
  0 siblings, 0 replies; 94+ messages in thread
From: Jacob Pan @ 2017-11-17 18:55 UTC (permalink / raw)
  To: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, LKML,
	Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Rafael Wysocki, Alex Williamson
  Cc: Lan Tianyu, Jean Delvare

Allow both intel-iommu.c and dmar.c to access device_domain_info.
Prepare for additional per device arch data used in TLB flush function

Signed-off-by: Jacob Pan <jacob.jun.pan-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
---
 drivers/iommu/intel-iommu.c | 18 ------------------
 include/linux/intel-iommu.h | 19 +++++++++++++++++++
 2 files changed, 19 insertions(+), 18 deletions(-)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 3d1901d..399b504 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -391,24 +391,6 @@ struct dmar_domain {
 					   iommu core */
 };
 
-/* PCI domain-device relationship */
-struct device_domain_info {
-	struct list_head link;	/* link to domain siblings */
-	struct list_head global; /* link to global list */
-	u8 bus;			/* PCI bus number */
-	u8 devfn;		/* PCI devfn number */
-	u8 pasid_supported:3;
-	u8 pasid_enabled:1;
-	u8 pri_supported:1;
-	u8 pri_enabled:1;
-	u8 ats_supported:1;
-	u8 ats_enabled:1;
-	u8 ats_qdep;
-	struct device *dev; /* it's NULL for PCIe-to-PCI bridge */
-	struct intel_iommu *iommu; /* IOMMU used by this device */
-	struct dmar_domain *domain; /* pointer to domain */
-};
-
 struct dmar_rmrr_unit {
 	struct list_head list;		/* list of rmrr units	*/
 	struct acpi_dmar_header *hdr;	/* ACPI header		*/
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index 77ea056..8d38e24 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -458,6 +458,25 @@ struct intel_iommu {
 	u32		flags;      /* Software defined flags */
 };
 
+/* PCI domain-device relationship */
+struct device_domain_info {
+	struct list_head link;	/* link to domain siblings */
+	struct list_head global; /* link to global list */
+	u8 bus;			/* PCI bus number */
+	u8 devfn;		/* PCI devfn number */
+	u8 pasid_supported:3;
+	u8 pasid_enabled:1;
+	u8 pri_supported:1;
+	u8 pri_enabled:1;
+	u8 ats_supported:1;
+	u8 ats_enabled:1;
+	u8 ats_qdep;
+	u64 fault_mask;	/* selected IOMMU faults to be reported */
+	struct device *dev; /* it's NULL for PCIe-to-PCI bridge */
+	struct intel_iommu *iommu; /* IOMMU used by this device */
+	struct dmar_domain *domain; /* pointer to domain */
+};
+
 static inline void __iommu_flush_cache(
 	struct intel_iommu *iommu, void *addr, int size)
 {
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v3 05/16] iommu/vt-d: support flushing more TLB types
@ 2017-11-17 18:55   ` Jacob Pan
  0 siblings, 0 replies; 94+ messages in thread
From: Jacob Pan @ 2017-11-17 18:55 UTC (permalink / raw)
  To: iommu, LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Rafael Wysocki, Alex Williamson
  Cc: Liu, Yi L, Lan Tianyu, Tian, Kevin, Raj Ashok, Jean Delvare,
	Christoph Hellwig, Jacob Pan, Liu, Yi L

With shared virtual memory vitualization, extended IOTLB invalidation
may be passed down from outside IOMMU subsystems. This patch adds
invalidation functions that can be used for each IOTLB types.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
---
 drivers/iommu/dmar.c        | 54 ++++++++++++++++++++++++++++++++++++++++++---
 drivers/iommu/intel-iommu.c |  3 ++-
 include/linux/intel-iommu.h | 10 +++++++--
 3 files changed, 61 insertions(+), 6 deletions(-)

diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c
index 57c920c..f69f6ee 100644
--- a/drivers/iommu/dmar.c
+++ b/drivers/iommu/dmar.c
@@ -1336,11 +1336,25 @@ void qi_flush_iotlb(struct intel_iommu *iommu, u16 did, u64 addr,
 	qi_submit_sync(&desc, iommu);
 }
 
-void qi_flush_dev_iotlb(struct intel_iommu *iommu, u16 sid, u16 qdep,
-			u64 addr, unsigned mask)
+void qi_flush_eiotlb(struct intel_iommu *iommu, u16 did, u64 addr, u32 pasid,
+		unsigned int size_order, u64 granu, bool global)
 {
 	struct qi_desc desc;
 
+	desc.low = QI_EIOTLB_PASID(pasid) | QI_EIOTLB_DID(did) |
+		QI_EIOTLB_GRAN(granu) | QI_EIOTLB_TYPE;
+	desc.high = QI_EIOTLB_ADDR(addr) | QI_EIOTLB_GL(global) |
+		QI_EIOTLB_IH(0) | QI_EIOTLB_AM(size_order);
+	qi_submit_sync(&desc, iommu);
+}
+
+void qi_flush_dev_iotlb(struct intel_iommu *iommu, u16 sid, u16 pfsid,
+			u16 qdep, u64 addr, unsigned mask)
+{
+	struct qi_desc desc;
+
+	pr_debug_ratelimited("%s: sid %d, pfsid %d, qdep %d, addr %llx, mask %d\n",
+		__func__, sid, pfsid, qdep, addr, mask);
 	if (mask) {
 		BUG_ON(addr & ((1 << (VTD_PAGE_SHIFT + mask)) - 1));
 		addr |= (1ULL << (VTD_PAGE_SHIFT + mask - 1)) - 1;
@@ -1352,7 +1366,41 @@ void qi_flush_dev_iotlb(struct intel_iommu *iommu, u16 sid, u16 qdep,
 		qdep = 0;
 
 	desc.low = QI_DEV_IOTLB_SID(sid) | QI_DEV_IOTLB_QDEP(qdep) |
-		   QI_DIOTLB_TYPE;
+		   QI_DIOTLB_TYPE | QI_DEV_IOTLB_SID(pfsid);
+
+	qi_submit_sync(&desc, iommu);
+}
+
+void qi_flush_dev_eiotlb(struct intel_iommu *iommu, u16 sid, u16 pfsid,
+		u32 pasid,  u16 qdep, u64 addr, unsigned size, u64 granu)
+{
+	struct qi_desc desc;
+
+	desc.low = QI_DEV_EIOTLB_PASID(pasid) | QI_DEV_EIOTLB_SID(sid) |
+		QI_DEV_EIOTLB_QDEP(qdep) | QI_DEIOTLB_TYPE |
+		QI_DEV_EIOTLB_PFSID(pfsid);
+	desc.high |= QI_DEV_EIOTLB_GLOB(granu);
+
+	/* If S bit is 0, we only flush a single page. If S bit is set,
+	 * The least significant zero bit indicates the size. VT-d spec
+	 * 6.5.2.6
+	 */
+	if (!size)
+		desc.high = QI_DEV_EIOTLB_ADDR(addr) & ~QI_DEV_EIOTLB_SIZE;
+	else {
+		unsigned long mask = 1UL << (VTD_PAGE_SHIFT + size);
+
+		desc.high = QI_DEV_EIOTLB_ADDR(addr & ~mask) | QI_DEV_EIOTLB_SIZE;
+	}
+	qi_submit_sync(&desc, iommu);
+}
+
+void qi_flush_pasid(struct intel_iommu *iommu, u16 did, u64 granu, int pasid)
+{
+	struct qi_desc desc;
+
+	desc.high = 0;
+	desc.low = QI_PC_TYPE | QI_PC_DID(did) | QI_PC_GRAN(granu) | QI_PC_PASID(pasid);
 
 	qi_submit_sync(&desc, iommu);
 }
diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 399b504..556bdd2 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -1524,7 +1524,8 @@ static void iommu_flush_dev_iotlb(struct dmar_domain *domain,
 
 		sid = info->bus << 8 | info->devfn;
 		qdep = info->ats_qdep;
-		qi_flush_dev_iotlb(info->iommu, sid, qdep, addr, mask);
+		qi_flush_dev_iotlb(info->iommu, sid, info->pfsid,
+				qdep, addr, mask);
 	}
 	spin_unlock_irqrestore(&device_domain_lock, flags);
 }
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index 8d38e24..3c83f7e 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -305,6 +305,7 @@ enum {
 #define QI_DEV_EIOTLB_PASID(p)	(((u64)p) << 32)
 #define QI_DEV_EIOTLB_SID(sid)	((u64)((sid) & 0xffff) << 16)
 #define QI_DEV_EIOTLB_QDEP(qd)	((u64)((qd) & 0x1f) << 4)
+#define QI_DEV_EIOTLB_PFSID(pfsid) (((u64)(pfsid & 0xf) << 12) | ((u64)(pfsid & 0xff0) << 48))
 #define QI_DEV_EIOTLB_MAX_INVS	32
 
 #define QI_PGRP_IDX(idx)	(((u64)(idx)) << 55)
@@ -496,8 +497,13 @@ extern void qi_flush_context(struct intel_iommu *iommu, u16 did, u16 sid,
 			     u8 fm, u64 type);
 extern void qi_flush_iotlb(struct intel_iommu *iommu, u16 did, u64 addr,
 			  unsigned int size_order, u64 type);
-extern void qi_flush_dev_iotlb(struct intel_iommu *iommu, u16 sid, u16 qdep,
-			       u64 addr, unsigned mask);
+extern void qi_flush_eiotlb(struct intel_iommu *iommu, u16 did, u64 addr,
+			u32 pasid, unsigned int size_order, u64 type, bool global);
+extern void qi_flush_dev_iotlb(struct intel_iommu *iommu, u16 sid, u16 pfsid,
+			u16 qdep, u64 addr, unsigned mask);
+extern void qi_flush_dev_eiotlb(struct intel_iommu *iommu, u16 sid, u16 pfsid,
+				u32 pasid, u16 qdep, u64 addr, unsigned size);
+extern void qi_flush_pasid(struct intel_iommu *iommu, u16 did, u64 granu, int pasid);
 
 extern int qi_submit_sync(struct qi_desc *desc, struct intel_iommu *iommu);
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v3 05/16] iommu/vt-d: support flushing more TLB types
@ 2017-11-17 18:55   ` Jacob Pan
  0 siblings, 0 replies; 94+ messages in thread
From: Jacob Pan @ 2017-11-17 18:55 UTC (permalink / raw)
  To: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, LKML,
	Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Rafael Wysocki, Alex Williamson
  Cc: Lan Tianyu, Yi L, Liu-i9wRM+HIrmnmtl4Z8vJ8Kg761KYD1DLY, Jean Delvare

With shared virtual memory vitualization, extended IOTLB invalidation
may be passed down from outside IOMMU subsystems. This patch adds
invalidation functions that can be used for each IOTLB types.

Signed-off-by: Jacob Pan <jacob.jun.pan-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Signed-off-by: Liu, Yi L <yi.l.liu-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Signed-off-by: Ashok Raj <ashok.raj-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 drivers/iommu/dmar.c        | 54 ++++++++++++++++++++++++++++++++++++++++++---
 drivers/iommu/intel-iommu.c |  3 ++-
 include/linux/intel-iommu.h | 10 +++++++--
 3 files changed, 61 insertions(+), 6 deletions(-)

diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c
index 57c920c..f69f6ee 100644
--- a/drivers/iommu/dmar.c
+++ b/drivers/iommu/dmar.c
@@ -1336,11 +1336,25 @@ void qi_flush_iotlb(struct intel_iommu *iommu, u16 did, u64 addr,
 	qi_submit_sync(&desc, iommu);
 }
 
-void qi_flush_dev_iotlb(struct intel_iommu *iommu, u16 sid, u16 qdep,
-			u64 addr, unsigned mask)
+void qi_flush_eiotlb(struct intel_iommu *iommu, u16 did, u64 addr, u32 pasid,
+		unsigned int size_order, u64 granu, bool global)
 {
 	struct qi_desc desc;
 
+	desc.low = QI_EIOTLB_PASID(pasid) | QI_EIOTLB_DID(did) |
+		QI_EIOTLB_GRAN(granu) | QI_EIOTLB_TYPE;
+	desc.high = QI_EIOTLB_ADDR(addr) | QI_EIOTLB_GL(global) |
+		QI_EIOTLB_IH(0) | QI_EIOTLB_AM(size_order);
+	qi_submit_sync(&desc, iommu);
+}
+
+void qi_flush_dev_iotlb(struct intel_iommu *iommu, u16 sid, u16 pfsid,
+			u16 qdep, u64 addr, unsigned mask)
+{
+	struct qi_desc desc;
+
+	pr_debug_ratelimited("%s: sid %d, pfsid %d, qdep %d, addr %llx, mask %d\n",
+		__func__, sid, pfsid, qdep, addr, mask);
 	if (mask) {
 		BUG_ON(addr & ((1 << (VTD_PAGE_SHIFT + mask)) - 1));
 		addr |= (1ULL << (VTD_PAGE_SHIFT + mask - 1)) - 1;
@@ -1352,7 +1366,41 @@ void qi_flush_dev_iotlb(struct intel_iommu *iommu, u16 sid, u16 qdep,
 		qdep = 0;
 
 	desc.low = QI_DEV_IOTLB_SID(sid) | QI_DEV_IOTLB_QDEP(qdep) |
-		   QI_DIOTLB_TYPE;
+		   QI_DIOTLB_TYPE | QI_DEV_IOTLB_SID(pfsid);
+
+	qi_submit_sync(&desc, iommu);
+}
+
+void qi_flush_dev_eiotlb(struct intel_iommu *iommu, u16 sid, u16 pfsid,
+		u32 pasid,  u16 qdep, u64 addr, unsigned size, u64 granu)
+{
+	struct qi_desc desc;
+
+	desc.low = QI_DEV_EIOTLB_PASID(pasid) | QI_DEV_EIOTLB_SID(sid) |
+		QI_DEV_EIOTLB_QDEP(qdep) | QI_DEIOTLB_TYPE |
+		QI_DEV_EIOTLB_PFSID(pfsid);
+	desc.high |= QI_DEV_EIOTLB_GLOB(granu);
+
+	/* If S bit is 0, we only flush a single page. If S bit is set,
+	 * The least significant zero bit indicates the size. VT-d spec
+	 * 6.5.2.6
+	 */
+	if (!size)
+		desc.high = QI_DEV_EIOTLB_ADDR(addr) & ~QI_DEV_EIOTLB_SIZE;
+	else {
+		unsigned long mask = 1UL << (VTD_PAGE_SHIFT + size);
+
+		desc.high = QI_DEV_EIOTLB_ADDR(addr & ~mask) | QI_DEV_EIOTLB_SIZE;
+	}
+	qi_submit_sync(&desc, iommu);
+}
+
+void qi_flush_pasid(struct intel_iommu *iommu, u16 did, u64 granu, int pasid)
+{
+	struct qi_desc desc;
+
+	desc.high = 0;
+	desc.low = QI_PC_TYPE | QI_PC_DID(did) | QI_PC_GRAN(granu) | QI_PC_PASID(pasid);
 
 	qi_submit_sync(&desc, iommu);
 }
diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 399b504..556bdd2 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -1524,7 +1524,8 @@ static void iommu_flush_dev_iotlb(struct dmar_domain *domain,
 
 		sid = info->bus << 8 | info->devfn;
 		qdep = info->ats_qdep;
-		qi_flush_dev_iotlb(info->iommu, sid, qdep, addr, mask);
+		qi_flush_dev_iotlb(info->iommu, sid, info->pfsid,
+				qdep, addr, mask);
 	}
 	spin_unlock_irqrestore(&device_domain_lock, flags);
 }
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index 8d38e24..3c83f7e 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -305,6 +305,7 @@ enum {
 #define QI_DEV_EIOTLB_PASID(p)	(((u64)p) << 32)
 #define QI_DEV_EIOTLB_SID(sid)	((u64)((sid) & 0xffff) << 16)
 #define QI_DEV_EIOTLB_QDEP(qd)	((u64)((qd) & 0x1f) << 4)
+#define QI_DEV_EIOTLB_PFSID(pfsid) (((u64)(pfsid & 0xf) << 12) | ((u64)(pfsid & 0xff0) << 48))
 #define QI_DEV_EIOTLB_MAX_INVS	32
 
 #define QI_PGRP_IDX(idx)	(((u64)(idx)) << 55)
@@ -496,8 +497,13 @@ extern void qi_flush_context(struct intel_iommu *iommu, u16 did, u16 sid,
 			     u8 fm, u64 type);
 extern void qi_flush_iotlb(struct intel_iommu *iommu, u16 did, u64 addr,
 			  unsigned int size_order, u64 type);
-extern void qi_flush_dev_iotlb(struct intel_iommu *iommu, u16 sid, u16 qdep,
-			       u64 addr, unsigned mask);
+extern void qi_flush_eiotlb(struct intel_iommu *iommu, u16 did, u64 addr,
+			u32 pasid, unsigned int size_order, u64 type, bool global);
+extern void qi_flush_dev_iotlb(struct intel_iommu *iommu, u16 sid, u16 pfsid,
+			u16 qdep, u64 addr, unsigned mask);
+extern void qi_flush_dev_eiotlb(struct intel_iommu *iommu, u16 sid, u16 pfsid,
+				u32 pasid, u16 qdep, u64 addr, unsigned size);
+extern void qi_flush_pasid(struct intel_iommu *iommu, u16 did, u64 granu, int pasid);
 
 extern int qi_submit_sync(struct qi_desc *desc, struct intel_iommu *iommu);
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v3 06/16] iommu/vt-d: add svm/sva invalidate function
  2017-11-17 18:54 ` Jacob Pan
                   ` (5 preceding siblings ...)
  (?)
@ 2017-11-17 18:55 ` Jacob Pan
  2017-12-05  5:43     ` Lu Baolu
  -1 siblings, 1 reply; 94+ messages in thread
From: Jacob Pan @ 2017-11-17 18:55 UTC (permalink / raw)
  To: iommu, LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Rafael Wysocki, Alex Williamson
  Cc: Liu, Yi L, Lan Tianyu, Tian, Kevin, Raj Ashok, Jean Delvare,
	Christoph Hellwig, Jacob Pan, Liu, Yi L

This patch adds Intel VT-d specific function to implement
iommu passdown invalidate API for shared virtual address.

The use case is for supporting caching structure invalidation
of assigned SVM capable devices. Emulated IOMMU exposes queue
invalidation capability and passes down all descriptors from the guest
to the physical IOMMU.

The assumption is that guest to host device ID mapping should be
resolved prior to calling IOMMU driver. Based on the device handle,
host IOMMU driver can replace certain fields before submit to the
invalidation queue.

Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
---
 drivers/iommu/intel-iommu.c | 200 +++++++++++++++++++++++++++++++++++++++++++-
 include/linux/intel-iommu.h |  17 +++-
 2 files changed, 211 insertions(+), 6 deletions(-)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 556bdd2..000b2b3 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -4981,6 +4981,183 @@ static void intel_iommu_detach_device(struct iommu_domain *domain,
 	dmar_remove_one_dev_info(to_dmar_domain(domain), dev);
 }
 
+/*
+ * 3D array for converting IOMMU generic type-granularity to VT-d granularity
+ * X indexed by enum iommu_inv_type
+ * Y indicates request without and with PASID
+ * Z indexed by enum enum iommu_inv_granularity
+ *
+ * For an example, if we want to find the VT-d granularity encoding for IOTLB
+ * type, DMA request with PASID, and page selective. The look up indices are:
+ * [1][1][8], where
+ * 1: IOMMU_INV_TYPE_TLB
+ * 1: with PASID
+ * 8: IOMMU_INV_GRANU_PAGE_PASID
+ *
+ */
+const static int inv_type_granu_map[IOMMU_INV_NR_TYPE][2][IOMMU_INV_NR_GRANU] = {
+	/* extended dev IOTLBs, for dev-IOTLB, only global is valid,
+	   for dev-EXIOTLB, two valid granu */
+	{
+		{1},
+		{0, 0, 0, 0, 1, 1, 0, 0, 0}
+	},
+	/* IOTLB and EIOTLB */
+	{
+		{1, 1, 0, 1, 0, 0, 0, 0, 0},
+		{0, 0, 0, 0, 1, 0, 1, 1, 1}
+	},
+	/* PASID cache */
+	{
+		{0},
+		{0, 0, 0, 0, 1, 1, 0, 0, 0}
+	},
+	/* context cache */
+	{
+		{1, 1, 1}
+	}
+};
+
+const static u64 inv_type_granu_table[IOMMU_INV_NR_TYPE][2][IOMMU_INV_NR_GRANU] = {
+	/* extended dev IOTLBs, only global is valid */
+	{
+		{QI_DEV_IOTLB_GRAN_ALL},
+		{0, 0, 0, 0, QI_DEV_IOTLB_GRAN_ALL, QI_DEV_IOTLB_GRAN_PASID_SEL, 0, 0, 0}
+	},
+	/* IOTLB and EIOTLB */
+	{
+		{DMA_TLB_GLOBAL_FLUSH, DMA_TLB_DSI_FLUSH, 0, DMA_TLB_PSI_FLUSH},
+		{0, 0, 0, 0, QI_GRAN_ALL_ALL, 0, QI_GRAN_NONG_ALL, QI_GRAN_NONG_PASID, QI_GRAN_PSI_PASID}
+	},
+	/* PASID cache */
+	{
+		{0},
+		{0, 0, 0, 0, QI_PC_ALL_PASIDS, QI_PC_PASID_SEL}
+	},
+	/* context cache */
+	{
+		{DMA_CCMD_GLOBAL_INVL, DMA_CCMD_DOMAIN_INVL, DMA_CCMD_DEVICE_INVL}
+	}
+};
+
+static inline int to_vtd_granularity(int type, int granu, int with_pasid, u64 *vtd_granu)
+{
+	if (type >= IOMMU_INV_NR_TYPE || granu >= IOMMU_INV_NR_GRANU || with_pasid > 1)
+		return -EINVAL;
+
+	if (inv_type_granu_map[type][with_pasid][granu] == 0)
+		return -EINVAL;
+
+	*vtd_granu = inv_type_granu_table[type][with_pasid][granu];
+
+	return 0;
+}
+
+static int intel_iommu_sva_invalidate(struct iommu_domain *domain,
+		struct device *dev, struct tlb_invalidate_info *inv_info)
+{
+	struct intel_iommu *iommu;
+	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
+	struct device_domain_info *info;
+	struct pci_dev *pdev;
+	u16 did, sid, pfsid;
+	u8 bus, devfn;
+	int ret = 0;
+	u64 granu;
+	unsigned long flags;
+
+	if (!inv_info || !dmar_domain)
+		return -EINVAL;
+
+	iommu = device_to_iommu(dev, &bus, &devfn);
+	if (!iommu)
+		return -ENODEV;
+
+	if (!dev || !dev_is_pci(dev))
+		return -ENODEV;
+
+	did = dmar_domain->iommu_did[iommu->seq_id];
+	sid = PCI_DEVID(bus, devfn);
+	ret = to_vtd_granularity(inv_info->hdr.type, inv_info->granularity,
+				!!(inv_info->flags & IOMMU_INVALIDATE_PASID_TAGGED), &granu);
+	if (ret) {
+		pr_err("Invalid range type %d, granu %d\n", inv_info->hdr.type,
+			inv_info->granularity);
+		return ret;
+	}
+
+	spin_lock(&iommu->lock);
+	spin_lock_irqsave(&device_domain_lock, flags);
+
+	switch (inv_info->hdr.type) {
+	case IOMMU_INV_TYPE_CONTEXT:
+		iommu->flush.flush_context(iommu, did, sid,
+					DMA_CCMD_MASK_NOBIT, granu);
+		break;
+	case IOMMU_INV_TYPE_TLB:
+		/* We need to deal with two scenarios:
+		 * - IOTLB for request w/o PASID
+		 * - extended IOTLB for request with PASID.
+		 */
+		if (inv_info->size &&
+			(inv_info->addr & ((1 << (VTD_PAGE_SHIFT + inv_info->size)) - 1))) {
+			pr_err("Addr out of range, addr 0x%llx, size order %d\n",
+				inv_info->addr, inv_info->size);
+			ret = -ERANGE;
+			goto out_unlock;
+		}
+
+		if (inv_info->flags & IOMMU_INVALIDATE_PASID_TAGGED)
+			qi_flush_eiotlb(iommu, did, mm_to_dma_pfn(inv_info->addr),
+					inv_info->pasid,
+					inv_info->size, granu,
+					inv_info->flags & IOMMU_INVALIDATE_GLOBAL_PAGE);
+		else
+			qi_flush_iotlb(iommu, did, mm_to_dma_pfn(inv_info->addr),
+				inv_info->size, granu);
+		/* For SRIOV VF, invalidation of device IOTLB requires PFSID */
+		pdev = to_pci_dev(dev);
+		if (pdev && pdev->is_virtfn)
+			pfsid = PCI_DEVID(pdev->physfn->bus->number, pdev->physfn->devfn);
+		else
+			pfsid = sid;
+
+		/**
+		 * Always flush device IOTLB if ATS is enabled since guest
+		 * vIOMMU exposes CM = 1, no device IOTLB flush will be passed
+		 * down.
+		 * TODO: check if device is VF, use PF ATS data if spec does not require
+		 * VF to include all PF capabilities,  VF qdep and VF ats_enabled.
+		 */
+		info = iommu_support_dev_iotlb(dmar_domain, iommu, bus, devfn);
+		if (info && info->ats_enabled) {
+			if (inv_info->flags & IOMMU_INVALIDATE_PASID_TAGGED)
+				qi_flush_dev_eiotlb(iommu, sid, info->pfsid,
+						inv_info->pasid, info->ats_qdep,
+						inv_info->addr, inv_info->size,
+						granu);
+			else
+				qi_flush_dev_iotlb(iommu, sid, info->pfsid,
+						info->ats_qdep, inv_info->addr,
+						inv_info->size);
+		}
+		break;
+	case IOMMU_INV_TYPE_PASID:
+		qi_flush_pasid(iommu, did, granu, inv_info->pasid);
+
+		break;
+	default:
+		dev_err(dev, "Unknown IOMMU invalidation type %d\n",
+			inv_info->hdr.type);
+		ret = -EINVAL;
+	}
+out_unlock:
+	spin_unlock(&iommu->lock);
+	spin_unlock_irqrestore(&device_domain_lock, flags);
+
+	return ret;
+}
+
 static int intel_iommu_map(struct iommu_domain *domain,
 			   unsigned long iova, phys_addr_t hpa,
 			   size_t size, int iommu_prot)
@@ -5304,7 +5481,7 @@ static int intel_iommu_bind_pasid_table(struct iommu_domain *domain,
 	iommu = device_to_iommu(dev, &bus, &devfn);
 	if (!iommu)
 		return -ENODEV;
-	/* VT-d spec 9.4 says pasid table size is encoded as 2^(x+5) */
+	/* VT-d spec section 9.4 says pasid table size is encoded as 2^(x+5) */
 	host_table_pasid_bits = intel_iommu_get_pts(iommu) + MIN_NR_PASID_BITS;
 	if (!pasidt_binfo || pasidt_binfo->pasid_bits > host_table_pasid_bits ||
 		pasidt_binfo->pasid_bits < MIN_NR_PASID_BITS) {
@@ -5313,7 +5490,11 @@ static int intel_iommu_bind_pasid_table(struct iommu_domain *domain,
 			MIN_NR_PASID_BITS, host_table_pasid_bits);
 		return -ERANGE;
 	}
-
+	if (!ecap_nest(iommu->ecap)) {
+		dev_err(dev, "Cannot bind PASID table, no nested translation\n");
+		ret = -EINVAL;
+		goto out;
+	}
 	pdev = to_pci_dev(dev);
 	sid = PCI_DEVID(bus, devfn);
 	info = dev->archdata.iommu;
@@ -5323,6 +5504,11 @@ static int intel_iommu_bind_pasid_table(struct iommu_domain *domain,
 		ret = -EINVAL;
 		goto out;
 	}
+	if (info->pasid_table_bound) {
+		dev_err(dev, "Device PASID table already bound\n");
+		ret = -EBUSY;
+		goto out;
+	}
 	if (!info->pasid_enabled) {
 		ret = pci_enable_pasid(pdev, info->pasid_supported & ~1);
 		if (ret) {
@@ -5363,7 +5549,7 @@ static int intel_iommu_bind_pasid_table(struct iommu_domain *domain,
 				DMA_CCMD_MASK_NOBIT,
 				DMA_CCMD_DEVICE_INVL);
 	iommu->flush.flush_iotlb(iommu, did, 0, 0, DMA_TLB_DSI_FLUSH);
-
+	info->pasid_table_bound = 1;
 out_unlock:
 	spin_unlock_irqrestore(&iommu->lock, flags);
 out:
@@ -5375,8 +5561,14 @@ static void intel_iommu_unbind_pasid_table(struct iommu_domain *domain,
 {
 	struct intel_iommu *iommu;
 	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
+	struct device_domain_info *info;
 	u8 bus, devfn;
 
+	info = dev->archdata.iommu;
+	if (!info) {
+		dev_err(dev, "Invalid device domain info\n");
+		return;
+	}
 	assert_spin_locked(&device_domain_lock);
 	iommu = device_to_iommu(dev, &bus, &devfn);
 	if (!iommu) {
@@ -5387,6 +5579,7 @@ static void intel_iommu_unbind_pasid_table(struct iommu_domain *domain,
 	domain_context_clear(iommu, dev);
 
 	domain_context_mapping_one(dmar_domain, iommu, bus, devfn);
+	info->pasid_table_bound = 0;
 }
 #endif /* CONFIG_INTEL_IOMMU_SVM */
 
@@ -5399,6 +5592,7 @@ const struct iommu_ops intel_iommu_ops = {
 #ifdef CONFIG_INTEL_IOMMU_SVM
 	.bind_pasid_table	= intel_iommu_bind_pasid_table,
 	.unbind_pasid_table	= intel_iommu_unbind_pasid_table,
+	.sva_invalidate		= intel_iommu_sva_invalidate,
 #endif
 	.map			= intel_iommu_map,
 	.unmap			= intel_iommu_unmap,
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index 3c83f7e..7f05e36 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -258,6 +258,10 @@ enum {
 #define QI_PGRP_RESP_TYPE	0x9
 #define QI_PSTRM_RESP_TYPE	0xa
 
+#define QI_DID(did)		(((u64)did & 0xffff) << 16)
+#define QI_DID_MASK		GENMASK(31, 16)
+#define QI_TYPE_MASK		GENMASK(3, 0)
+
 #define QI_IEC_SELECTIVE	(((u64)1) << 4)
 #define QI_IEC_IIDEX(idx)	(((u64)(idx & 0xffff) << 32))
 #define QI_IEC_IM(m)		(((u64)(m & 0x1f) << 27))
@@ -288,8 +292,9 @@ enum {
 #define QI_PC_DID(did)		(((u64)did) << 16)
 #define QI_PC_GRAN(gran)	(((u64)gran) << 4)
 
-#define QI_PC_ALL_PASIDS	(QI_PC_TYPE | QI_PC_GRAN(0))
-#define QI_PC_PASID_SEL		(QI_PC_TYPE | QI_PC_GRAN(1))
+/* PC inv granu */
+#define QI_PC_ALL_PASIDS	0
+#define QI_PC_PASID_SEL		1
 
 #define QI_EIOTLB_ADDR(addr)	((u64)(addr) & VTD_PAGE_MASK)
 #define QI_EIOTLB_GL(gl)	(((u64)gl) << 7)
@@ -299,6 +304,10 @@ enum {
 #define QI_EIOTLB_DID(did)	(((u64)did) << 16)
 #define QI_EIOTLB_GRAN(gran) 	(((u64)gran) << 4)
 
+/* QI Dev-IOTLB inv granu */
+#define QI_DEV_IOTLB_GRAN_ALL		0
+#define QI_DEV_IOTLB_GRAN_PASID_SEL	1
+
 #define QI_DEV_EIOTLB_ADDR(a)	((u64)(a) & VTD_PAGE_MASK)
 #define QI_DEV_EIOTLB_SIZE	(((u64)1) << 11)
 #define QI_DEV_EIOTLB_GLOB(g)	((u64)g)
@@ -327,6 +336,7 @@ enum {
 #define QI_RESP_INVALID		0x1
 #define QI_RESP_FAILURE		0xf
 
+/* QI EIOTLB inv granu */
 #define QI_GRAN_ALL_ALL			0
 #define QI_GRAN_NONG_ALL		1
 #define QI_GRAN_NONG_PASID		2
@@ -471,6 +481,7 @@ struct device_domain_info {
 	u8 pri_enabled:1;
 	u8 ats_supported:1;
 	u8 ats_enabled:1;
+	u8 pasid_table_bound:1;
 	u8 ats_qdep;
 	u64 fault_mask;	/* selected IOMMU faults to be reported */
 	struct device *dev; /* it's NULL for PCIe-to-PCI bridge */
@@ -502,7 +513,7 @@ extern void qi_flush_eiotlb(struct intel_iommu *iommu, u16 did, u64 addr,
 extern void qi_flush_dev_iotlb(struct intel_iommu *iommu, u16 sid, u16 pfsid,
 			u16 qdep, u64 addr, unsigned mask);
 extern void qi_flush_dev_eiotlb(struct intel_iommu *iommu, u16 sid, u16 pfsid,
-				u32 pasid, u16 qdep, u64 addr, unsigned size);
+			u32 pasid, u16 qdep, u64 addr, unsigned size, u64 granu);
 extern void qi_flush_pasid(struct intel_iommu *iommu, u16 did, u64 granu, int pasid);
 
 extern int qi_submit_sync(struct qi_desc *desc, struct intel_iommu *iommu);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v3 07/16] iommu/vt-d: assign PFSID in device TLB invalidation
  2017-11-17 18:54 ` Jacob Pan
                   ` (6 preceding siblings ...)
  (?)
@ 2017-11-17 18:55 ` Jacob Pan
  2017-12-05  5:45   ` Lu Baolu
  -1 siblings, 1 reply; 94+ messages in thread
From: Jacob Pan @ 2017-11-17 18:55 UTC (permalink / raw)
  To: iommu, LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Rafael Wysocki, Alex Williamson
  Cc: Liu, Yi L, Lan Tianyu, Tian, Kevin, Raj Ashok, Jean Delvare,
	Christoph Hellwig, Jacob Pan

When SRIOV VF device IOTLB is invalidated, we need to provide
the PF source SID such that IOMMU hardware can gauge the depth
of invalidation queue which is shared among VFs. This is needed
when device invalidation throttle (DIT) capability is supported.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 drivers/iommu/intel-iommu.c | 13 +++++++++++++
 include/linux/intel-iommu.h |  3 +++
 2 files changed, 16 insertions(+)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 000b2b3..e1bd219 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -1459,6 +1459,19 @@ static void iommu_enable_dev_iotlb(struct device_domain_info *info)
 		return;
 
 	pdev = to_pci_dev(info->dev);
+	/* For IOMMU that supports device IOTLB throttling (DIT), we assign
+	 * PFSID to the invalidation desc of a VF such that IOMMU HW can gauge
+	 * queue depth at PF level. If DIT is not set, PFSID will be treated as
+	 * reserved, which should be set to 0.
+	 */
+	if (!ecap_dit(info->iommu->ecap))
+		info->pfsid = 0;
+	else if (pdev && pdev->is_virtfn) {
+		if (ecap_dit(info->iommu->ecap))
+			dev_warn(&pdev->dev, "SRIOV VF device IOTLB enabled without flow control\n");
+		info->pfsid = PCI_DEVID(pdev->physfn->bus->number, pdev->physfn->devfn);
+	} else
+		info->pfsid = PCI_DEVID(info->bus, info->devfn);
 
 #ifdef CONFIG_INTEL_IOMMU_SVM
 	/* The PCIe spec, in its wisdom, declares that the behaviour of
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index 7f05e36..6956a4e 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -112,6 +112,7 @@
  * Extended Capability Register
  */
 
+#define ecap_dit(e)		((e >> 41) & 0x1)
 #define ecap_pasid(e)		((e >> 40) & 0x1)
 #define ecap_pss(e)		((e >> 35) & 0x1f)
 #define ecap_eafs(e)		((e >> 34) & 0x1)
@@ -285,6 +286,7 @@ enum {
 #define QI_DEV_IOTLB_SID(sid)	((u64)((sid) & 0xffff) << 32)
 #define QI_DEV_IOTLB_QDEP(qdep)	(((qdep) & 0x1f) << 16)
 #define QI_DEV_IOTLB_ADDR(addr)	((u64)(addr) & VTD_PAGE_MASK)
+#define QI_DEV_IOTLB_PFSID(pfsid) (((u64)(pfsid & 0xf) << 12) | ((u64)(pfsid & 0xff0) << 48))
 #define QI_DEV_IOTLB_SIZE	1
 #define QI_DEV_IOTLB_MAX_INVS	32
 
@@ -475,6 +477,7 @@ struct device_domain_info {
 	struct list_head global; /* link to global list */
 	u8 bus;			/* PCI bus number */
 	u8 devfn;		/* PCI devfn number */
+	u16 pfsid;		/* SRIOV physical function source ID */
 	u8 pasid_supported:3;
 	u8 pasid_enabled:1;
 	u8 pri_supported:1;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v3 08/16] iommu: introduce device fault data
  2017-11-17 18:54 ` Jacob Pan
                   ` (7 preceding siblings ...)
  (?)
@ 2017-11-17 18:55 ` Jacob Pan
  2017-11-24 12:03   ` Jean-Philippe Brucker
  2018-01-10 11:41   ` Jean-Philippe Brucker
  -1 siblings, 2 replies; 94+ messages in thread
From: Jacob Pan @ 2017-11-17 18:55 UTC (permalink / raw)
  To: iommu, LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Rafael Wysocki, Alex Williamson
  Cc: Liu, Yi L, Lan Tianyu, Tian, Kevin, Raj Ashok, Jean Delvare,
	Christoph Hellwig, Jacob Pan, Liu, Yi L

Device faults detected by IOMMU can be reported outside IOMMU
subsystem for further processing. This patch intends to provide
a generic device fault data such that device drivers can be
communicated with IOMMU faults without model specific knowledge.

The proposed format is the result of discussion at:
https://lkml.org/lkml/2017/11/10/291
Part of the code is based on Jean-Philippe Brucker's patchset
(https://patchwork.kernel.org/patch/9989315/).

The assumption is that model specific IOMMU driver can filter and
handle most of the internal faults if the cause is within IOMMU driver
control. Therefore, the fault reasons can be reported are grouped
and generalized based common specifications such as PCI ATS.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
---
 include/linux/iommu.h | 108 +++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 106 insertions(+), 2 deletions(-)

diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index da684a7..dfda89b 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -49,13 +49,17 @@ struct bus_type;
 struct device;
 struct iommu_domain;
 struct notifier_block;
+struct iommu_fault_event;
 
 /* iommu fault flags */
-#define IOMMU_FAULT_READ	0x0
-#define IOMMU_FAULT_WRITE	0x1
+#define IOMMU_FAULT_READ		(1 << 0)
+#define IOMMU_FAULT_WRITE		(1 << 1)
+#define IOMMU_FAULT_EXEC		(1 << 2)
+#define IOMMU_FAULT_PRIV		(1 << 3)
 
 typedef int (*iommu_fault_handler_t)(struct iommu_domain *,
 			struct device *, unsigned long, int, void *);
+typedef int (*iommu_dev_fault_handler_t)(struct iommu_fault_event *, void *);
 
 struct iommu_domain_geometry {
 	dma_addr_t aperture_start; /* First address that can be mapped    */
@@ -264,6 +268,105 @@ struct iommu_device {
 	struct device *dev;
 };
 
+enum iommu_model {
+	IOMMU_MODEL_INTEL = 1,
+	IOMMU_MODEL_AMD,
+	IOMMU_MODEL_SMMU3,
+};
+
+/*  Generic fault types, can be expanded IRQ remapping fault */
+enum iommu_fault_type {
+	IOMMU_FAULT_DMA_UNRECOV = 1,	/* unrecoverable fault */
+	IOMMU_FAULT_PAGE_REQ,		/* page request fault */
+};
+
+enum iommu_fault_reason {
+	IOMMU_FAULT_REASON_UNKNOWN = 0,
+
+	/* IOMMU internal error, no specific reason to report out */
+	IOMMU_FAULT_REASON_INTERNAL,
+
+	/* Could not access the PASID table */
+	IOMMU_FAULT_REASON_PASID_FETCH,
+
+	/*
+	 * PASID is out of range (e.g. exceeds the maximum PASID
+	 * supported by the IOMMU) or disabled.
+	 */
+	IOMMU_FAULT_REASON_PASID_INVALID,
+
+	/* Could not access the page directory (Invalid PASID entry) */
+	IOMMU_FAULT_REASON_PGD_FETCH,
+
+	/* Could not access the page table entry (Bad address) */
+	IOMMU_FAULT_REASON_PTE_FETCH,
+
+	/* Protection flag check failed */
+	IOMMU_FAULT_REASON_PERMISSION,
+};
+
+/**
+ * struct iommu_fault_event - Generic per device fault data
+ *
+ * - PCI and non-PCI devices
+ * - Recoverable faults (e.g. page request), information based on PCI ATS
+ * and PASID spec.
+ * - Un-recoverable faults of device interest
+ * - DMA remapping and IRQ remapping faults
+
+ * @type contains fault type.
+ * @reason fault reasons if relevant outside IOMMU driver, IOMMU driver internal
+ *         faults are not reported
+ * @addr: tells the offending page address
+ * @pasid: contains process address space ID, used in shared virtual memory(SVM)
+ * @rid: requestor ID
+ * @page_req_group_id: page request group index
+ * @last_req: last request in a page request group
+ * @pasid_valid: indicates if the PRQ has a valid PASID
+ * @prot: page access protection flag, e.g. IOMMU_FAULT_READ, IOMMU_FAULT_WRITE
+ * @device_private: if present, uniquely identify device-specific
+ *                  private data for an individual page request.
+ * @iommu_private: used by the IOMMU driver for storing fault-specific
+ *                 data. Users should not modify this field before
+ *                 sending the fault response.
+ */
+struct iommu_fault_event {
+	enum iommu_fault_type type;
+	enum iommu_fault_reason reason;
+	u64 addr;
+	u32 pasid;
+	u32 page_req_group_id : 9;
+	u32 last_req : 1;
+	u32 pasid_valid : 1;
+	u32 prot;
+	u64 device_private;
+	u64 iommu_private;
+};
+
+/**
+ * struct iommu_fault_param - per-device IOMMU fault data
+ * @dev_fault_handler: Callback function to handle IOMMU faults at device level
+ * @data: handler private data
+ *
+ */
+struct iommu_fault_param {
+	iommu_dev_fault_handler_t handler;
+	void *data;
+};
+
+/**
+ * struct iommu_param - collection of per-device IOMMU data
+ *
+ * @fault_param: IOMMU detected device fault reporting data
+ *
+ * TODO: migrate other per device data pointers under iommu_dev_data, e.g.
+ *	struct iommu_group	*iommu_group;
+ *	struct iommu_fwspec	*iommu_fwspec;
+ */
+struct iommu_param {
+	struct iommu_fault_param *fault_param;
+};
+
 int  iommu_device_register(struct iommu_device *iommu);
 void iommu_device_unregister(struct iommu_device *iommu);
 int  iommu_device_sysfs_add(struct iommu_device *iommu,
@@ -437,6 +540,7 @@ struct iommu_ops {};
 struct iommu_group {};
 struct iommu_fwspec {};
 struct iommu_device {};
+struct iommu_fault_param {};
 
 static inline bool iommu_present(struct bus_type *bus)
 {
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v3 09/16] driver core: add iommu device fault reporting data
  2017-11-17 18:54 ` Jacob Pan
                   ` (8 preceding siblings ...)
  (?)
@ 2017-11-17 18:55 ` Jacob Pan
  2017-12-18 14:37     ` Greg Kroah-Hartman
  -1 siblings, 1 reply; 94+ messages in thread
From: Jacob Pan @ 2017-11-17 18:55 UTC (permalink / raw)
  To: iommu, LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Rafael Wysocki, Alex Williamson
  Cc: Liu, Yi L, Lan Tianyu, Tian, Kevin, Raj Ashok, Jean Delvare,
	Christoph Hellwig, Jacob Pan

DMA faults can be detected by IOMMU at device level. Adding a pointer
to struct device allows IOMMU subsystem to report relevant faults
back to the device driver for further handling.
For direct assigned device (or user space drivers), guest OS holds
responsibility to handle and respond per device IOMMU fault.
Therefore we need fault reporting mechanism to propagate faults beyond
IOMMU subsystem.

There are two other IOMMU data pointers under struct device today, here
we introduce iommu_param as a parent pointer such that all device IOMMU
data can be consolidated here. The idea was suggested here by Greg KH
and Joerg. The name iommu_param is chosen here since iommu_data has been used.

Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Link: https://lkml.org/lkml/2017/10/6/81
---
 include/linux/device.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/include/linux/device.h b/include/linux/device.h
index 66fe271..540e5e5 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -42,6 +42,7 @@ struct fwnode_handle;
 struct iommu_ops;
 struct iommu_group;
 struct iommu_fwspec;
+struct iommu_param;
 
 struct bus_attribute {
 	struct attribute	attr;
@@ -871,6 +872,7 @@ struct dev_links_info {
  * 		device (i.e. the bus driver that discovered the device).
  * @iommu_group: IOMMU group the device belongs to.
  * @iommu_fwspec: IOMMU-specific properties supplied by firmware.
+ * @iommu_param: Per device generic IOMMU runtime data
  *
  * @offline_disabled: If set, the device is permanently online.
  * @offline:	Set after successful invocation of bus type's .offline().
@@ -960,6 +962,7 @@ struct device {
 	void	(*release)(struct device *dev);
 	struct iommu_group	*iommu_group;
 	struct iommu_fwspec	*iommu_fwspec;
+	struct iommu_param	*iommu_param;
 
 	bool			offline_disabled:1;
 	bool			offline:1;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v3 10/16] iommu: introduce device fault report API
@ 2017-11-17 18:55   ` Jacob Pan
  0 siblings, 0 replies; 94+ messages in thread
From: Jacob Pan @ 2017-11-17 18:55 UTC (permalink / raw)
  To: iommu, LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Rafael Wysocki, Alex Williamson
  Cc: Liu, Yi L, Lan Tianyu, Tian, Kevin, Raj Ashok, Jean Delvare,
	Christoph Hellwig, Jacob Pan

Traditionally, device specific faults are detected and handled within
their own device drivers. When IOMMU is enabled, faults such as DMA
related transactions are detected by IOMMU. There is no generic
reporting mechanism to report faults back to the in-kernel device
driver or the guest OS in case of assigned devices.

Faults detected by IOMMU is based on the transaction's source ID which
can be reported at per device basis, regardless of the device type is a
PCI device or not.

The fault types include recoverable (e.g. page request) and
unrecoverable faults(e.g. access error). In most cases, faults can be
handled by IOMMU drivers internally. The primary use cases are as
follows:
1. page request fault originated from an SVM capable device that is
assigned to guest via vIOMMU. In this case, the first level page tables
are owned by the guest. Page request must be propagated to the guest to
let guest OS fault in the pages then send page response. In this
mechanism, the direct receiver of IOMMU fault notification is VFIO,
which can relay notification events to QEMU or other user space
software.

2. faults need more subtle handling by device drivers. Other than
simply invoke reset function, there are needs to let device driver
handle the fault with a smaller impact.

This patchset is intended to create a generic fault report API such
that it can scale as follows:
- all IOMMU types
- PCI and non-PCI devices
- recoverable and unrecoverable faults
- VFIO and other other in kernel users
- DMA & IRQ remapping (TBD)
The original idea was brought up by David Woodhouse and discussions
summarized at https://lwn.net/Articles/608914/.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
---
 drivers/iommu/iommu.c | 63 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 include/linux/iommu.h | 36 +++++++++++++++++++++++++++++
 2 files changed, 98 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 829e9e9..97b7990 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -581,6 +581,12 @@ int iommu_group_add_device(struct iommu_group *group, struct device *dev)
 		goto err_free_name;
 	}
 
+	dev->iommu_param = kzalloc(sizeof(struct iommu_fault_param), GFP_KERNEL);
+	if (!dev->iommu_param) {
+		ret = -ENOMEM;
+		goto err_free_name;
+	}
+
 	kobject_get(group->devices_kobj);
 
 	dev->iommu_group = group;
@@ -657,7 +663,7 @@ void iommu_group_remove_device(struct device *dev)
 	sysfs_remove_link(&dev->kobj, "iommu_group");
 
 	trace_remove_device_from_group(group->id, dev);
-
+	kfree(dev->iommu_param);
 	kfree(device->name);
 	kfree(device);
 	dev->iommu_group = NULL;
@@ -791,6 +797,61 @@ int iommu_group_unregister_notifier(struct iommu_group *group,
 }
 EXPORT_SYMBOL_GPL(iommu_group_unregister_notifier);
 
+int iommu_register_device_fault_handler(struct device *dev,
+					iommu_dev_fault_handler_t handler,
+					void *data)
+{
+	struct iommu_param *idata = dev->iommu_param;
+
+	/*
+	 * Device iommu_param should have been allocated when device is
+	 * added to its iommu_group.
+	 */
+	if (!idata)
+		return -EINVAL;
+	/* Only allow one fault handler registered for each device */
+	if (idata->fault_param)
+		return -EBUSY;
+	get_device(dev);
+	idata->fault_param =
+		kzalloc(sizeof(struct iommu_fault_param), GFP_KERNEL);
+	if (!idata->fault_param)
+		return -ENOMEM;
+	idata->fault_param->handler = handler;
+	idata->fault_param->data = data;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(iommu_register_device_fault_handler);
+
+int iommu_unregister_device_fault_handler(struct device *dev)
+{
+	struct iommu_param *idata = dev->iommu_param;
+
+	if (!idata)
+		return -EINVAL;
+
+	kfree(idata->fault_param);
+	idata->fault_param = NULL;
+	put_device(dev);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(iommu_unregister_device_fault_handler);
+
+
+int iommu_report_device_fault(struct device *dev, struct iommu_fault_event *evt)
+{
+	/* we only report device fault if there is a handler registered */
+	if (!dev->iommu_param || !dev->iommu_param->fault_param ||
+		!dev->iommu_param->fault_param->handler)
+		return -ENOSYS;
+
+	return dev->iommu_param->fault_param->handler(evt,
+						dev->iommu_param->fault_param->data);
+}
+EXPORT_SYMBOL_GPL(iommu_report_device_fault);
+
 /**
  * iommu_group_id - Return ID for a group
  * @group: the group to ID
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index dfda89b..841c044 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -463,6 +463,14 @@ extern int iommu_group_register_notifier(struct iommu_group *group,
 					 struct notifier_block *nb);
 extern int iommu_group_unregister_notifier(struct iommu_group *group,
 					   struct notifier_block *nb);
+extern int iommu_register_device_fault_handler(struct device *dev,
+					iommu_dev_fault_handler_t handler,
+					void *data);
+
+extern int iommu_unregister_device_fault_handler(struct device *dev);
+
+extern int iommu_report_device_fault(struct device *dev, struct iommu_fault_event *evt);
+
 extern int iommu_group_id(struct iommu_group *group);
 extern struct iommu_group *iommu_group_get_for_dev(struct device *dev);
 extern struct iommu_domain *iommu_group_default_domain(struct iommu_group *);
@@ -481,6 +489,12 @@ extern void iommu_domain_window_disable(struct iommu_domain *domain, u32 wnd_nr)
 extern int report_iommu_fault(struct iommu_domain *domain, struct device *dev,
 			      unsigned long iova, int flags);
 
+static inline bool iommu_has_device_fault_handler(struct device *dev)
+{
+	return dev->iommu_param && dev->iommu_param->fault_param &&
+		dev->iommu_param->fault_param->handler;
+}
+
 static inline void iommu_flush_tlb_all(struct iommu_domain *domain)
 {
 	if (domain->ops->flush_iotlb_all)
@@ -734,6 +748,28 @@ static inline int iommu_group_unregister_notifier(struct iommu_group *group,
 	return 0;
 }
 
+static inline int iommu_register_device_fault_handler(struct device *dev,
+						iommu_dev_fault_handler_t handler,
+						void *data)
+{
+	return 0;
+}
+
+static inline int iommu_unregister_device_fault_handler(struct device *dev)
+{
+	return 0;
+}
+
+static inline bool iommu_has_device_fault_handler(struct device *dev)
+{
+	return false;
+}
+
+static inline int iommu_report_device_fault(struct device *dev, struct iommu_fault_event *evt)
+{
+	return 0;
+}
+
 static inline int iommu_group_id(struct iommu_group *group)
 {
 	return -ENODEV;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v3 10/16] iommu: introduce device fault report API
@ 2017-11-17 18:55   ` Jacob Pan
  0 siblings, 0 replies; 94+ messages in thread
From: Jacob Pan @ 2017-11-17 18:55 UTC (permalink / raw)
  To: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, LKML,
	Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Rafael Wysocki, Alex Williamson
  Cc: Lan Tianyu, Jean Delvare

Traditionally, device specific faults are detected and handled within
their own device drivers. When IOMMU is enabled, faults such as DMA
related transactions are detected by IOMMU. There is no generic
reporting mechanism to report faults back to the in-kernel device
driver or the guest OS in case of assigned devices.

Faults detected by IOMMU is based on the transaction's source ID which
can be reported at per device basis, regardless of the device type is a
PCI device or not.

The fault types include recoverable (e.g. page request) and
unrecoverable faults(e.g. access error). In most cases, faults can be
handled by IOMMU drivers internally. The primary use cases are as
follows:
1. page request fault originated from an SVM capable device that is
assigned to guest via vIOMMU. In this case, the first level page tables
are owned by the guest. Page request must be propagated to the guest to
let guest OS fault in the pages then send page response. In this
mechanism, the direct receiver of IOMMU fault notification is VFIO,
which can relay notification events to QEMU or other user space
software.

2. faults need more subtle handling by device drivers. Other than
simply invoke reset function, there are needs to let device driver
handle the fault with a smaller impact.

This patchset is intended to create a generic fault report API such
that it can scale as follows:
- all IOMMU types
- PCI and non-PCI devices
- recoverable and unrecoverable faults
- VFIO and other other in kernel users
- DMA & IRQ remapping (TBD)
The original idea was brought up by David Woodhouse and discussions
summarized at https://lwn.net/Articles/608914/.

Signed-off-by: Jacob Pan <jacob.jun.pan-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Signed-off-by: Ashok Raj <ashok.raj-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 drivers/iommu/iommu.c | 63 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 include/linux/iommu.h | 36 +++++++++++++++++++++++++++++
 2 files changed, 98 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 829e9e9..97b7990 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -581,6 +581,12 @@ int iommu_group_add_device(struct iommu_group *group, struct device *dev)
 		goto err_free_name;
 	}
 
+	dev->iommu_param = kzalloc(sizeof(struct iommu_fault_param), GFP_KERNEL);
+	if (!dev->iommu_param) {
+		ret = -ENOMEM;
+		goto err_free_name;
+	}
+
 	kobject_get(group->devices_kobj);
 
 	dev->iommu_group = group;
@@ -657,7 +663,7 @@ void iommu_group_remove_device(struct device *dev)
 	sysfs_remove_link(&dev->kobj, "iommu_group");
 
 	trace_remove_device_from_group(group->id, dev);
-
+	kfree(dev->iommu_param);
 	kfree(device->name);
 	kfree(device);
 	dev->iommu_group = NULL;
@@ -791,6 +797,61 @@ int iommu_group_unregister_notifier(struct iommu_group *group,
 }
 EXPORT_SYMBOL_GPL(iommu_group_unregister_notifier);
 
+int iommu_register_device_fault_handler(struct device *dev,
+					iommu_dev_fault_handler_t handler,
+					void *data)
+{
+	struct iommu_param *idata = dev->iommu_param;
+
+	/*
+	 * Device iommu_param should have been allocated when device is
+	 * added to its iommu_group.
+	 */
+	if (!idata)
+		return -EINVAL;
+	/* Only allow one fault handler registered for each device */
+	if (idata->fault_param)
+		return -EBUSY;
+	get_device(dev);
+	idata->fault_param =
+		kzalloc(sizeof(struct iommu_fault_param), GFP_KERNEL);
+	if (!idata->fault_param)
+		return -ENOMEM;
+	idata->fault_param->handler = handler;
+	idata->fault_param->data = data;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(iommu_register_device_fault_handler);
+
+int iommu_unregister_device_fault_handler(struct device *dev)
+{
+	struct iommu_param *idata = dev->iommu_param;
+
+	if (!idata)
+		return -EINVAL;
+
+	kfree(idata->fault_param);
+	idata->fault_param = NULL;
+	put_device(dev);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(iommu_unregister_device_fault_handler);
+
+
+int iommu_report_device_fault(struct device *dev, struct iommu_fault_event *evt)
+{
+	/* we only report device fault if there is a handler registered */
+	if (!dev->iommu_param || !dev->iommu_param->fault_param ||
+		!dev->iommu_param->fault_param->handler)
+		return -ENOSYS;
+
+	return dev->iommu_param->fault_param->handler(evt,
+						dev->iommu_param->fault_param->data);
+}
+EXPORT_SYMBOL_GPL(iommu_report_device_fault);
+
 /**
  * iommu_group_id - Return ID for a group
  * @group: the group to ID
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index dfda89b..841c044 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -463,6 +463,14 @@ extern int iommu_group_register_notifier(struct iommu_group *group,
 					 struct notifier_block *nb);
 extern int iommu_group_unregister_notifier(struct iommu_group *group,
 					   struct notifier_block *nb);
+extern int iommu_register_device_fault_handler(struct device *dev,
+					iommu_dev_fault_handler_t handler,
+					void *data);
+
+extern int iommu_unregister_device_fault_handler(struct device *dev);
+
+extern int iommu_report_device_fault(struct device *dev, struct iommu_fault_event *evt);
+
 extern int iommu_group_id(struct iommu_group *group);
 extern struct iommu_group *iommu_group_get_for_dev(struct device *dev);
 extern struct iommu_domain *iommu_group_default_domain(struct iommu_group *);
@@ -481,6 +489,12 @@ extern void iommu_domain_window_disable(struct iommu_domain *domain, u32 wnd_nr)
 extern int report_iommu_fault(struct iommu_domain *domain, struct device *dev,
 			      unsigned long iova, int flags);
 
+static inline bool iommu_has_device_fault_handler(struct device *dev)
+{
+	return dev->iommu_param && dev->iommu_param->fault_param &&
+		dev->iommu_param->fault_param->handler;
+}
+
 static inline void iommu_flush_tlb_all(struct iommu_domain *domain)
 {
 	if (domain->ops->flush_iotlb_all)
@@ -734,6 +748,28 @@ static inline int iommu_group_unregister_notifier(struct iommu_group *group,
 	return 0;
 }
 
+static inline int iommu_register_device_fault_handler(struct device *dev,
+						iommu_dev_fault_handler_t handler,
+						void *data)
+{
+	return 0;
+}
+
+static inline int iommu_unregister_device_fault_handler(struct device *dev)
+{
+	return 0;
+}
+
+static inline bool iommu_has_device_fault_handler(struct device *dev)
+{
+	return false;
+}
+
+static inline int iommu_report_device_fault(struct device *dev, struct iommu_fault_event *evt)
+{
+	return 0;
+}
+
 static inline int iommu_group_id(struct iommu_group *group)
 {
 	return -ENODEV;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v3 11/16] iommu/vt-d: use threaded irq for dmar_fault
@ 2017-11-17 18:55   ` Jacob Pan
  0 siblings, 0 replies; 94+ messages in thread
From: Jacob Pan @ 2017-11-17 18:55 UTC (permalink / raw)
  To: iommu, LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Rafael Wysocki, Alex Williamson
  Cc: Liu, Yi L, Lan Tianyu, Tian, Kevin, Raj Ashok, Jean Delvare,
	Christoph Hellwig, Jacob Pan

Currently, dmar fault IRQ handler does nothing more than rate
limited printk, no critical hardware handling need to be done
in IRQ context.
Convert it to threaded IRQ would allow fault processing that
requires process context. e.g. find out offending device based
on source ID in the fault rasons.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 drivers/iommu/dmar.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c
index f69f6ee..38ee91b 100644
--- a/drivers/iommu/dmar.c
+++ b/drivers/iommu/dmar.c
@@ -1749,7 +1749,8 @@ int dmar_set_interrupt(struct intel_iommu *iommu)
 		return -EINVAL;
 	}
 
-	ret = request_irq(irq, dmar_fault, IRQF_NO_THREAD, iommu->name, iommu);
+	ret = request_threaded_irq(irq, NULL, dmar_fault,
+				IRQF_ONESHOT, iommu->name, iommu);
 	if (ret)
 		pr_err("Can't request irq\n");
 	return ret;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v3 11/16] iommu/vt-d: use threaded irq for dmar_fault
@ 2017-11-17 18:55   ` Jacob Pan
  0 siblings, 0 replies; 94+ messages in thread
From: Jacob Pan @ 2017-11-17 18:55 UTC (permalink / raw)
  To: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, LKML,
	Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Rafael Wysocki, Alex Williamson
  Cc: Lan Tianyu, Jean Delvare

Currently, dmar fault IRQ handler does nothing more than rate
limited printk, no critical hardware handling need to be done
in IRQ context.
Convert it to threaded IRQ would allow fault processing that
requires process context. e.g. find out offending device based
on source ID in the fault rasons.

Signed-off-by: Jacob Pan <jacob.jun.pan-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
---
 drivers/iommu/dmar.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c
index f69f6ee..38ee91b 100644
--- a/drivers/iommu/dmar.c
+++ b/drivers/iommu/dmar.c
@@ -1749,7 +1749,8 @@ int dmar_set_interrupt(struct intel_iommu *iommu)
 		return -EINVAL;
 	}
 
-	ret = request_irq(irq, dmar_fault, IRQF_NO_THREAD, iommu->name, iommu);
+	ret = request_threaded_irq(irq, NULL, dmar_fault,
+				IRQF_ONESHOT, iommu->name, iommu);
 	if (ret)
 		pr_err("Can't request irq\n");
 	return ret;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v3 12/16] iommu/vt-d: report unrecoverable device faults
@ 2017-11-17 18:55   ` Jacob Pan
  0 siblings, 0 replies; 94+ messages in thread
From: Jacob Pan @ 2017-11-17 18:55 UTC (permalink / raw)
  To: iommu, LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Rafael Wysocki, Alex Williamson
  Cc: Liu, Yi L, Lan Tianyu, Tian, Kevin, Raj Ashok, Jean Delvare,
	Christoph Hellwig, Jacob Pan, Liu, Yi L

Currently, when device DMA faults are detected by IOMMU the fault
reasons are printed but the driver of the offending device is
involved in fault handling.
This patch uses per device fault reporting API to send fault event
data for further processing.
Offending device is identified by the source ID in VT-d fault reason
report registers.

Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
---
 drivers/iommu/dmar.c | 94 +++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 93 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c
index 38ee91b..b1f67fc2 100644
--- a/drivers/iommu/dmar.c
+++ b/drivers/iommu/dmar.c
@@ -1555,6 +1555,31 @@ static const char *irq_remap_fault_reasons[] =
 	"Blocked an interrupt request due to source-id verification failure",
 };
 
+/* fault data and status */
+enum intel_iommu_fault_reason {
+	INTEL_IOMMU_FAULT_REASON_SW,
+	INTEL_IOMMU_FAULT_REASON_ROOT_NOT_PRESENT,
+	INTEL_IOMMU_FAULT_REASON_CONTEXT_NOT_PRESENT,
+	INTEL_IOMMU_FAULT_REASON_CONTEXT_INVALID,
+	INTEL_IOMMU_FAULT_REASON_BEYOND_ADDR_WIDTH,
+	INTEL_IOMMU_FAULT_REASON_PTE_WRITE_ACCESS,
+	INTEL_IOMMU_FAULT_REASON_PTE_READ_ACCESS,
+	INTEL_IOMMU_FAULT_REASON_NEXT_PT_INVALID,
+	INTEL_IOMMU_FAULT_REASON_ROOT_ADDR_INVALID,
+	INTEL_IOMMU_FAULT_REASON_CONTEXT_PTR_INVALID,
+	INTEL_IOMMU_FAULT_REASON_NONE_ZERO_RTP,
+	INTEL_IOMMU_FAULT_REASON_NONE_ZERO_CTP,
+	INTEL_IOMMU_FAULT_REASON_NONE_ZERO_PTE,
+	NR_INTEL_IOMMU_FAULT_REASON,
+};
+
+/* fault reasons that are allowed to be reported outside IOMMU subsystem */
+#define INTEL_IOMMU_FAULT_REASON_ALLOWED			\
+	((1ULL << INTEL_IOMMU_FAULT_REASON_BEYOND_ADDR_WIDTH) |	\
+		(1ULL << INTEL_IOMMU_FAULT_REASON_PTE_WRITE_ACCESS) |	\
+		(1ULL << INTEL_IOMMU_FAULT_REASON_PTE_READ_ACCESS))
+
+
 static const char *dmar_get_fault_reason(u8 fault_reason, int *fault_type)
 {
 	if (fault_reason >= 0x20 && (fault_reason - 0x20 <
@@ -1635,6 +1660,69 @@ void dmar_msi_read(int irq, struct msi_msg *msg)
 	raw_spin_unlock_irqrestore(&iommu->register_lock, flag);
 }
 
+static enum iommu_fault_reason to_iommu_fault_reason(u8 reason)
+{
+	if (reason >= NR_INTEL_IOMMU_FAULT_REASON) {
+		pr_warn("unknown DMAR fault reason %d\n", reason);
+		return IOMMU_FAULT_REASON_UNKNOWN;
+	}
+	switch (reason) {
+	case INTEL_IOMMU_FAULT_REASON_SW:
+	case INTEL_IOMMU_FAULT_REASON_ROOT_NOT_PRESENT:
+	case INTEL_IOMMU_FAULT_REASON_CONTEXT_NOT_PRESENT:
+	case INTEL_IOMMU_FAULT_REASON_CONTEXT_INVALID:
+	case INTEL_IOMMU_FAULT_REASON_BEYOND_ADDR_WIDTH:
+	case INTEL_IOMMU_FAULT_REASON_ROOT_ADDR_INVALID:
+	case INTEL_IOMMU_FAULT_REASON_CONTEXT_PTR_INVALID:
+		return IOMMU_FAULT_REASON_INTERNAL;
+	case INTEL_IOMMU_FAULT_REASON_NEXT_PT_INVALID:
+	case INTEL_IOMMU_FAULT_REASON_PTE_WRITE_ACCESS:
+	case INTEL_IOMMU_FAULT_REASON_PTE_READ_ACCESS:
+		return IOMMU_FAULT_REASON_PERMISSION;
+	default:
+		return IOMMU_FAULT_REASON_UNKNOWN;
+	}
+}
+
+static void report_fault_to_device(struct intel_iommu *iommu, u64 addr, int type,
+				int fault_type, enum intel_iommu_fault_reason reason, u16 sid)
+{
+	struct iommu_fault_event event;
+	struct pci_dev *pdev;
+	u8 bus, devfn;
+
+	/* check if fault reason is worth reporting outside IOMMU */
+	if (!((1 << reason) & INTEL_IOMMU_FAULT_REASON_ALLOWED)) {
+		pr_debug("Fault reason %d not allowed to report to device\n",
+			reason);
+		return;
+	}
+
+	bus = PCI_BUS_NUM(sid);
+	devfn = PCI_DEVFN(PCI_SLOT(sid), PCI_FUNC(sid));
+	/*
+	 * we need to check if the fault reporting is requested for the
+	 * offending device.
+	 */
+	pdev = pci_get_bus_and_slot(bus, devfn);
+	if (!pdev) {
+		pr_warn("No PCI device found for source ID %x\n", sid);
+		return;
+	}
+	/*
+	 * unrecoverable fault is reported per IOMMU, notifier handler can
+	 * resolve PCI device based on source ID.
+	 */
+	event.reason = to_iommu_fault_reason(reason);
+	event.addr = addr;
+	event.type = IOMMU_FAULT_DMA_UNRECOV;
+	event.prot = type ? IOMMU_READ : IOMMU_WRITE;
+	dev_warn(&pdev->dev, "report device unrecoverable fault: %d, %x, %d\n",
+		event.reason, sid, event.type);
+	iommu_report_device_fault(&pdev->dev, &event);
+	pci_dev_put(pdev);
+}
+
 static int dmar_fault_do_one(struct intel_iommu *iommu, int type,
 		u8 fault_reason, u16 source_id, unsigned long long addr)
 {
@@ -1648,11 +1736,15 @@ static int dmar_fault_do_one(struct intel_iommu *iommu, int type,
 			source_id >> 8, PCI_SLOT(source_id & 0xFF),
 			PCI_FUNC(source_id & 0xFF), addr >> 48,
 			fault_reason, reason);
-	else
+	else {
 		pr_err("[%s] Request device [%02x:%02x.%d] fault addr %llx [fault reason %02d] %s\n",
 		       type ? "DMA Read" : "DMA Write",
 		       source_id >> 8, PCI_SLOT(source_id & 0xFF),
 		       PCI_FUNC(source_id & 0xFF), addr, fault_reason, reason);
+	}
+	report_fault_to_device(iommu, addr, type, fault_type,
+			fault_reason, source_id);
+
 	return 0;
 }
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v3 12/16] iommu/vt-d: report unrecoverable device faults
@ 2017-11-17 18:55   ` Jacob Pan
  0 siblings, 0 replies; 94+ messages in thread
From: Jacob Pan @ 2017-11-17 18:55 UTC (permalink / raw)
  To: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, LKML,
	Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Rafael Wysocki, Alex Williamson
  Cc: Lan Tianyu, Yi L, Liu-i9wRM+HIrmnmtl4Z8vJ8Kg761KYD1DLY, Jean Delvare

Currently, when device DMA faults are detected by IOMMU the fault
reasons are printed but the driver of the offending device is
involved in fault handling.
This patch uses per device fault reporting API to send fault event
data for further processing.
Offending device is identified by the source ID in VT-d fault reason
report registers.

Signed-off-by: Liu, Yi L <yi.l.liu-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Signed-off-by: Jacob Pan <jacob.jun.pan-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Signed-off-by: Ashok Raj <ashok.raj-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 drivers/iommu/dmar.c | 94 +++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 93 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c
index 38ee91b..b1f67fc2 100644
--- a/drivers/iommu/dmar.c
+++ b/drivers/iommu/dmar.c
@@ -1555,6 +1555,31 @@ static const char *irq_remap_fault_reasons[] =
 	"Blocked an interrupt request due to source-id verification failure",
 };
 
+/* fault data and status */
+enum intel_iommu_fault_reason {
+	INTEL_IOMMU_FAULT_REASON_SW,
+	INTEL_IOMMU_FAULT_REASON_ROOT_NOT_PRESENT,
+	INTEL_IOMMU_FAULT_REASON_CONTEXT_NOT_PRESENT,
+	INTEL_IOMMU_FAULT_REASON_CONTEXT_INVALID,
+	INTEL_IOMMU_FAULT_REASON_BEYOND_ADDR_WIDTH,
+	INTEL_IOMMU_FAULT_REASON_PTE_WRITE_ACCESS,
+	INTEL_IOMMU_FAULT_REASON_PTE_READ_ACCESS,
+	INTEL_IOMMU_FAULT_REASON_NEXT_PT_INVALID,
+	INTEL_IOMMU_FAULT_REASON_ROOT_ADDR_INVALID,
+	INTEL_IOMMU_FAULT_REASON_CONTEXT_PTR_INVALID,
+	INTEL_IOMMU_FAULT_REASON_NONE_ZERO_RTP,
+	INTEL_IOMMU_FAULT_REASON_NONE_ZERO_CTP,
+	INTEL_IOMMU_FAULT_REASON_NONE_ZERO_PTE,
+	NR_INTEL_IOMMU_FAULT_REASON,
+};
+
+/* fault reasons that are allowed to be reported outside IOMMU subsystem */
+#define INTEL_IOMMU_FAULT_REASON_ALLOWED			\
+	((1ULL << INTEL_IOMMU_FAULT_REASON_BEYOND_ADDR_WIDTH) |	\
+		(1ULL << INTEL_IOMMU_FAULT_REASON_PTE_WRITE_ACCESS) |	\
+		(1ULL << INTEL_IOMMU_FAULT_REASON_PTE_READ_ACCESS))
+
+
 static const char *dmar_get_fault_reason(u8 fault_reason, int *fault_type)
 {
 	if (fault_reason >= 0x20 && (fault_reason - 0x20 <
@@ -1635,6 +1660,69 @@ void dmar_msi_read(int irq, struct msi_msg *msg)
 	raw_spin_unlock_irqrestore(&iommu->register_lock, flag);
 }
 
+static enum iommu_fault_reason to_iommu_fault_reason(u8 reason)
+{
+	if (reason >= NR_INTEL_IOMMU_FAULT_REASON) {
+		pr_warn("unknown DMAR fault reason %d\n", reason);
+		return IOMMU_FAULT_REASON_UNKNOWN;
+	}
+	switch (reason) {
+	case INTEL_IOMMU_FAULT_REASON_SW:
+	case INTEL_IOMMU_FAULT_REASON_ROOT_NOT_PRESENT:
+	case INTEL_IOMMU_FAULT_REASON_CONTEXT_NOT_PRESENT:
+	case INTEL_IOMMU_FAULT_REASON_CONTEXT_INVALID:
+	case INTEL_IOMMU_FAULT_REASON_BEYOND_ADDR_WIDTH:
+	case INTEL_IOMMU_FAULT_REASON_ROOT_ADDR_INVALID:
+	case INTEL_IOMMU_FAULT_REASON_CONTEXT_PTR_INVALID:
+		return IOMMU_FAULT_REASON_INTERNAL;
+	case INTEL_IOMMU_FAULT_REASON_NEXT_PT_INVALID:
+	case INTEL_IOMMU_FAULT_REASON_PTE_WRITE_ACCESS:
+	case INTEL_IOMMU_FAULT_REASON_PTE_READ_ACCESS:
+		return IOMMU_FAULT_REASON_PERMISSION;
+	default:
+		return IOMMU_FAULT_REASON_UNKNOWN;
+	}
+}
+
+static void report_fault_to_device(struct intel_iommu *iommu, u64 addr, int type,
+				int fault_type, enum intel_iommu_fault_reason reason, u16 sid)
+{
+	struct iommu_fault_event event;
+	struct pci_dev *pdev;
+	u8 bus, devfn;
+
+	/* check if fault reason is worth reporting outside IOMMU */
+	if (!((1 << reason) & INTEL_IOMMU_FAULT_REASON_ALLOWED)) {
+		pr_debug("Fault reason %d not allowed to report to device\n",
+			reason);
+		return;
+	}
+
+	bus = PCI_BUS_NUM(sid);
+	devfn = PCI_DEVFN(PCI_SLOT(sid), PCI_FUNC(sid));
+	/*
+	 * we need to check if the fault reporting is requested for the
+	 * offending device.
+	 */
+	pdev = pci_get_bus_and_slot(bus, devfn);
+	if (!pdev) {
+		pr_warn("No PCI device found for source ID %x\n", sid);
+		return;
+	}
+	/*
+	 * unrecoverable fault is reported per IOMMU, notifier handler can
+	 * resolve PCI device based on source ID.
+	 */
+	event.reason = to_iommu_fault_reason(reason);
+	event.addr = addr;
+	event.type = IOMMU_FAULT_DMA_UNRECOV;
+	event.prot = type ? IOMMU_READ : IOMMU_WRITE;
+	dev_warn(&pdev->dev, "report device unrecoverable fault: %d, %x, %d\n",
+		event.reason, sid, event.type);
+	iommu_report_device_fault(&pdev->dev, &event);
+	pci_dev_put(pdev);
+}
+
 static int dmar_fault_do_one(struct intel_iommu *iommu, int type,
 		u8 fault_reason, u16 source_id, unsigned long long addr)
 {
@@ -1648,11 +1736,15 @@ static int dmar_fault_do_one(struct intel_iommu *iommu, int type,
 			source_id >> 8, PCI_SLOT(source_id & 0xFF),
 			PCI_FUNC(source_id & 0xFF), addr >> 48,
 			fault_reason, reason);
-	else
+	else {
 		pr_err("[%s] Request device [%02x:%02x.%d] fault addr %llx [fault reason %02d] %s\n",
 		       type ? "DMA Read" : "DMA Write",
 		       source_id >> 8, PCI_SLOT(source_id & 0xFF),
 		       PCI_FUNC(source_id & 0xFF), addr, fault_reason, reason);
+	}
+	report_fault_to_device(iommu, addr, type, fault_type,
+			fault_reason, source_id);
+
 	return 0;
 }
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v3 13/16] iommu/intel-svm: notify page request to guest
@ 2017-11-17 18:55   ` Jacob Pan
  0 siblings, 0 replies; 94+ messages in thread
From: Jacob Pan @ 2017-11-17 18:55 UTC (permalink / raw)
  To: iommu, LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Rafael Wysocki, Alex Williamson
  Cc: Liu, Yi L, Lan Tianyu, Tian, Kevin, Raj Ashok, Jean Delvare,
	Christoph Hellwig, Jacob Pan

If the source device of a page request has its PASID table pointer
bond to a guest, the first level page tables are owned by the guest.
In this case, we shall let guest OS to manage page fault.

This patch uses the IOMMU fault notification API to send notifications,
possibly via VFIO, to the guest OS. Once guest pages are fault in, guest
will issue page response which will be passed down via the invalidation
passdown APIs.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
---
 drivers/iommu/intel-svm.c | 80 ++++++++++++++++++++++++++++++++++++++++++-----
 include/linux/iommu.h     |  1 +
 2 files changed, 74 insertions(+), 7 deletions(-)

diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c
index f6697e5..77c25d8 100644
--- a/drivers/iommu/intel-svm.c
+++ b/drivers/iommu/intel-svm.c
@@ -555,6 +555,71 @@ static bool is_canonical_address(u64 addr)
 	return (((saddr << shift) >> shift) == saddr);
 }
 
+static int prq_to_iommu_prot(struct page_req_dsc *req)
+{
+	int prot = 0;
+
+	if (req->rd_req)
+		prot |= IOMMU_FAULT_READ;
+	if (req->wr_req)
+		prot |= IOMMU_FAULT_WRITE;
+	if (req->exe_req)
+		prot |= IOMMU_FAULT_EXEC;
+	if (req->priv_req)
+		prot |= IOMMU_FAULT_PRIV;
+
+	return prot;
+}
+
+static int intel_svm_prq_report(struct device *dev, struct page_req_dsc *desc)
+{
+	int ret = 0;
+	struct iommu_fault_event event;
+	struct pci_dev *pdev;
+
+	/**
+	 * If caller does not provide struct device, this is the case where
+	 * guest PASID table is bound to the device. So we need to retrieve
+	 * struct device from the page request descriptor then proceed.
+	 */
+	if (!dev) {
+		pdev = pci_get_bus_and_slot(desc->bus, desc->devfn);
+		if (!pdev) {
+			pr_err("No PCI device found for PRQ [%02x:%02x.%d]\n",
+				desc->bus, PCI_SLOT(desc->devfn),
+				PCI_FUNC(desc->devfn));
+			return -ENODEV;
+		}
+		dev = &pdev->dev;
+	} else if (dev_is_pci(dev)) {
+		pdev = to_pci_dev(dev);
+		pci_dev_get(pdev);
+	} else
+		return -ENODEV;
+
+	pr_debug("Notify PRQ device [%02x:%02x.%d]\n",
+		desc->bus, PCI_SLOT(desc->devfn),
+		PCI_FUNC(desc->devfn));
+
+	/* invoke device fault handler if registered */
+	if (iommu_has_device_fault_handler(dev)) {
+		/* Fill in event data for device specific processing */
+		event.type = IOMMU_FAULT_PAGE_REQ;
+		event.addr = desc->addr;
+		event.pasid = desc->pasid;
+		event.page_req_group_id = desc->prg_index;
+		event.prot = prq_to_iommu_prot(desc);
+		event.last_req = desc->lpig;
+		event.pasid_valid = 1;
+		event.iommu_private = desc->private;
+		ret = iommu_report_device_fault(&pdev->dev, &event);
+	}
+
+	pci_dev_put(pdev);
+
+	return ret;
+}
+
 static irqreturn_t prq_event_thread(int irq, void *d)
 {
 	struct intel_iommu *iommu = d;
@@ -578,7 +643,12 @@ static irqreturn_t prq_event_thread(int irq, void *d)
 		handled = 1;
 
 		req = &iommu->prq[head / sizeof(*req)];
-
+		/**
+		 * If prq is to be handled outside iommu driver via receiver of
+		 * the fault notifiers, we skip the page response here.
+		 */
+		if (!intel_svm_prq_report(NULL, req))
+			goto prq_advance;
 		result = QI_RESP_FAILURE;
 		address = (u64)req->addr << VTD_PAGE_SHIFT;
 		if (!req->pasid_present) {
@@ -649,11 +719,7 @@ static irqreturn_t prq_event_thread(int irq, void *d)
 		if (WARN_ON(&sdev->list == &svm->devs))
 			sdev = NULL;
 
-		if (sdev && sdev->ops && sdev->ops->fault_cb) {
-			int rwxp = (req->rd_req << 3) | (req->wr_req << 2) |
-				(req->exe_req << 1) | (req->priv_req);
-			sdev->ops->fault_cb(sdev->dev, req->pasid, req->addr, req->private, rwxp, result);
-		}
+		intel_svm_prq_report(sdev->dev, req);
 		/* We get here in the error case where the PASID lookup failed,
 		   and these can be NULL. Do not use them below this point! */
 		sdev = NULL;
@@ -679,7 +745,7 @@ static irqreturn_t prq_event_thread(int irq, void *d)
 
 			qi_submit_sync(&resp, iommu);
 		}
-
+	prq_advance:
 		head = (head + sizeof(*req)) & PRQ_RING_MASK;
 	}
 
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 841c044..3083796b 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -42,6 +42,7 @@
  * if the IOMMU page table format is equivalent.
  */
 #define IOMMU_PRIV	(1 << 5)
+#define IOMMU_EXEC	(1 << 6)
 
 struct iommu_ops;
 struct iommu_group;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v3 13/16] iommu/intel-svm: notify page request to guest
@ 2017-11-17 18:55   ` Jacob Pan
  0 siblings, 0 replies; 94+ messages in thread
From: Jacob Pan @ 2017-11-17 18:55 UTC (permalink / raw)
  To: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, LKML,
	Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Rafael Wysocki, Alex Williamson
  Cc: Lan Tianyu, Jean Delvare

If the source device of a page request has its PASID table pointer
bond to a guest, the first level page tables are owned by the guest.
In this case, we shall let guest OS to manage page fault.

This patch uses the IOMMU fault notification API to send notifications,
possibly via VFIO, to the guest OS. Once guest pages are fault in, guest
will issue page response which will be passed down via the invalidation
passdown APIs.

Signed-off-by: Jacob Pan <jacob.jun.pan-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Signed-off-by: Ashok Raj <ashok.raj-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 drivers/iommu/intel-svm.c | 80 ++++++++++++++++++++++++++++++++++++++++++-----
 include/linux/iommu.h     |  1 +
 2 files changed, 74 insertions(+), 7 deletions(-)

diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c
index f6697e5..77c25d8 100644
--- a/drivers/iommu/intel-svm.c
+++ b/drivers/iommu/intel-svm.c
@@ -555,6 +555,71 @@ static bool is_canonical_address(u64 addr)
 	return (((saddr << shift) >> shift) == saddr);
 }
 
+static int prq_to_iommu_prot(struct page_req_dsc *req)
+{
+	int prot = 0;
+
+	if (req->rd_req)
+		prot |= IOMMU_FAULT_READ;
+	if (req->wr_req)
+		prot |= IOMMU_FAULT_WRITE;
+	if (req->exe_req)
+		prot |= IOMMU_FAULT_EXEC;
+	if (req->priv_req)
+		prot |= IOMMU_FAULT_PRIV;
+
+	return prot;
+}
+
+static int intel_svm_prq_report(struct device *dev, struct page_req_dsc *desc)
+{
+	int ret = 0;
+	struct iommu_fault_event event;
+	struct pci_dev *pdev;
+
+	/**
+	 * If caller does not provide struct device, this is the case where
+	 * guest PASID table is bound to the device. So we need to retrieve
+	 * struct device from the page request descriptor then proceed.
+	 */
+	if (!dev) {
+		pdev = pci_get_bus_and_slot(desc->bus, desc->devfn);
+		if (!pdev) {
+			pr_err("No PCI device found for PRQ [%02x:%02x.%d]\n",
+				desc->bus, PCI_SLOT(desc->devfn),
+				PCI_FUNC(desc->devfn));
+			return -ENODEV;
+		}
+		dev = &pdev->dev;
+	} else if (dev_is_pci(dev)) {
+		pdev = to_pci_dev(dev);
+		pci_dev_get(pdev);
+	} else
+		return -ENODEV;
+
+	pr_debug("Notify PRQ device [%02x:%02x.%d]\n",
+		desc->bus, PCI_SLOT(desc->devfn),
+		PCI_FUNC(desc->devfn));
+
+	/* invoke device fault handler if registered */
+	if (iommu_has_device_fault_handler(dev)) {
+		/* Fill in event data for device specific processing */
+		event.type = IOMMU_FAULT_PAGE_REQ;
+		event.addr = desc->addr;
+		event.pasid = desc->pasid;
+		event.page_req_group_id = desc->prg_index;
+		event.prot = prq_to_iommu_prot(desc);
+		event.last_req = desc->lpig;
+		event.pasid_valid = 1;
+		event.iommu_private = desc->private;
+		ret = iommu_report_device_fault(&pdev->dev, &event);
+	}
+
+	pci_dev_put(pdev);
+
+	return ret;
+}
+
 static irqreturn_t prq_event_thread(int irq, void *d)
 {
 	struct intel_iommu *iommu = d;
@@ -578,7 +643,12 @@ static irqreturn_t prq_event_thread(int irq, void *d)
 		handled = 1;
 
 		req = &iommu->prq[head / sizeof(*req)];
-
+		/**
+		 * If prq is to be handled outside iommu driver via receiver of
+		 * the fault notifiers, we skip the page response here.
+		 */
+		if (!intel_svm_prq_report(NULL, req))
+			goto prq_advance;
 		result = QI_RESP_FAILURE;
 		address = (u64)req->addr << VTD_PAGE_SHIFT;
 		if (!req->pasid_present) {
@@ -649,11 +719,7 @@ static irqreturn_t prq_event_thread(int irq, void *d)
 		if (WARN_ON(&sdev->list == &svm->devs))
 			sdev = NULL;
 
-		if (sdev && sdev->ops && sdev->ops->fault_cb) {
-			int rwxp = (req->rd_req << 3) | (req->wr_req << 2) |
-				(req->exe_req << 1) | (req->priv_req);
-			sdev->ops->fault_cb(sdev->dev, req->pasid, req->addr, req->private, rwxp, result);
-		}
+		intel_svm_prq_report(sdev->dev, req);
 		/* We get here in the error case where the PASID lookup failed,
 		   and these can be NULL. Do not use them below this point! */
 		sdev = NULL;
@@ -679,7 +745,7 @@ static irqreturn_t prq_event_thread(int irq, void *d)
 
 			qi_submit_sync(&resp, iommu);
 		}
-
+	prq_advance:
 		head = (head + sizeof(*req)) & PRQ_RING_MASK;
 	}
 
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 841c044..3083796b 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -42,6 +42,7 @@
  * if the IOMMU page table format is equivalent.
  */
 #define IOMMU_PRIV	(1 << 5)
+#define IOMMU_EXEC	(1 << 6)
 
 struct iommu_ops;
 struct iommu_group;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v3 14/16] iommu/intel-svm: replace dev ops with fault report API
@ 2017-11-17 18:55   ` Jacob Pan
  0 siblings, 0 replies; 94+ messages in thread
From: Jacob Pan @ 2017-11-17 18:55 UTC (permalink / raw)
  To: iommu, LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Rafael Wysocki, Alex Williamson
  Cc: Liu, Yi L, Lan Tianyu, Tian, Kevin, Raj Ashok, Jean Delvare,
	Christoph Hellwig, Jacob Pan

With the introduction of generic IOMMU device fault reporting API, we
can replace the private fault callback functions with standard function
and event data.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 drivers/iommu/intel-svm.c |  7 +------
 include/linux/intel-svm.h | 20 +++-----------------
 2 files changed, 4 insertions(+), 23 deletions(-)

diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c
index 77c25d8..93b1849 100644
--- a/drivers/iommu/intel-svm.c
+++ b/drivers/iommu/intel-svm.c
@@ -283,7 +283,7 @@ static const struct mmu_notifier_ops intel_mmuops = {
 
 static DEFINE_MUTEX(pasid_mutex);
 
-int intel_svm_bind_mm(struct device *dev, int *pasid, int flags, struct svm_dev_ops *ops)
+int intel_svm_bind_mm(struct device *dev, int *pasid, int flags)
 {
 	struct intel_iommu *iommu = intel_svm_device_to_iommu(dev);
 	struct intel_svm_dev *sdev;
@@ -329,10 +329,6 @@ int intel_svm_bind_mm(struct device *dev, int *pasid, int flags, struct svm_dev_
 
 			list_for_each_entry(sdev, &svm->devs, list) {
 				if (dev == sdev->dev) {
-					if (sdev->ops != ops) {
-						ret = -EBUSY;
-						goto out;
-					}
 					sdev->users++;
 					goto success;
 				}
@@ -358,7 +354,6 @@ int intel_svm_bind_mm(struct device *dev, int *pasid, int flags, struct svm_dev_
 	}
 	/* Finish the setup now we know we're keeping it */
 	sdev->users = 1;
-	sdev->ops = ops;
 	init_rcu_head(&sdev->rcu);
 
 	if (!svm) {
diff --git a/include/linux/intel-svm.h b/include/linux/intel-svm.h
index 99bc5b3..a39a502 100644
--- a/include/linux/intel-svm.h
+++ b/include/linux/intel-svm.h
@@ -18,18 +18,6 @@
 
 struct device;
 
-struct svm_dev_ops {
-	void (*fault_cb)(struct device *dev, int pasid, u64 address,
-			 u32 private, int rwxp, int response);
-};
-
-/* Values for rxwp in fault_cb callback */
-#define SVM_REQ_READ	(1<<3)
-#define SVM_REQ_WRITE	(1<<2)
-#define SVM_REQ_EXEC	(1<<1)
-#define SVM_REQ_PRIV	(1<<0)
-
-
 /*
  * The SVM_FLAG_PRIVATE_PASID flag requests a PASID which is *not* the "main"
  * PASID for the current process. Even if a PASID already exists, a new one
@@ -60,7 +48,6 @@ struct svm_dev_ops {
  * @dev:	Device to be granted acccess
  * @pasid:	Address for allocated PASID
  * @flags:	Flags. Later for requesting supervisor mode, etc.
- * @ops:	Callbacks to device driver
  *
  * This function attempts to enable PASID support for the given device.
  * If the @pasid argument is non-%NULL, a PASID is allocated for access
@@ -82,8 +69,7 @@ struct svm_dev_ops {
  * Multiple calls from the same process may result in the same PASID
  * being re-used. A reference count is kept.
  */
-extern int intel_svm_bind_mm(struct device *dev, int *pasid, int flags,
-			     struct svm_dev_ops *ops);
+extern int intel_svm_bind_mm(struct device *dev, int *pasid, int flags);
 
 /**
  * intel_svm_unbind_mm() - Unbind a specified PASID
@@ -120,7 +106,7 @@ extern int intel_svm_is_pasid_valid(struct device *dev, int pasid);
 #else /* CONFIG_INTEL_IOMMU_SVM */
 
 static inline int intel_svm_bind_mm(struct device *dev, int *pasid,
-				    int flags, struct svm_dev_ops *ops)
+				int flags)
 {
 	return -ENOSYS;
 }
@@ -136,6 +122,6 @@ static int intel_svm_is_pasid_valid(struct device *dev, int pasid)
 }
 #endif /* CONFIG_INTEL_IOMMU_SVM */
 
-#define intel_svm_available(dev) (!intel_svm_bind_mm((dev), NULL, 0, NULL))
+#define intel_svm_available(dev) (!intel_svm_bind_mm((dev), NULL, 0))
 
 #endif /* __INTEL_SVM_H__ */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v3 14/16] iommu/intel-svm: replace dev ops with fault report API
@ 2017-11-17 18:55   ` Jacob Pan
  0 siblings, 0 replies; 94+ messages in thread
From: Jacob Pan @ 2017-11-17 18:55 UTC (permalink / raw)
  To: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, LKML,
	Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Rafael Wysocki, Alex Williamson
  Cc: Lan Tianyu, Jean Delvare

With the introduction of generic IOMMU device fault reporting API, we
can replace the private fault callback functions with standard function
and event data.

Signed-off-by: Jacob Pan <jacob.jun.pan-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
---
 drivers/iommu/intel-svm.c |  7 +------
 include/linux/intel-svm.h | 20 +++-----------------
 2 files changed, 4 insertions(+), 23 deletions(-)

diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c
index 77c25d8..93b1849 100644
--- a/drivers/iommu/intel-svm.c
+++ b/drivers/iommu/intel-svm.c
@@ -283,7 +283,7 @@ static const struct mmu_notifier_ops intel_mmuops = {
 
 static DEFINE_MUTEX(pasid_mutex);
 
-int intel_svm_bind_mm(struct device *dev, int *pasid, int flags, struct svm_dev_ops *ops)
+int intel_svm_bind_mm(struct device *dev, int *pasid, int flags)
 {
 	struct intel_iommu *iommu = intel_svm_device_to_iommu(dev);
 	struct intel_svm_dev *sdev;
@@ -329,10 +329,6 @@ int intel_svm_bind_mm(struct device *dev, int *pasid, int flags, struct svm_dev_
 
 			list_for_each_entry(sdev, &svm->devs, list) {
 				if (dev == sdev->dev) {
-					if (sdev->ops != ops) {
-						ret = -EBUSY;
-						goto out;
-					}
 					sdev->users++;
 					goto success;
 				}
@@ -358,7 +354,6 @@ int intel_svm_bind_mm(struct device *dev, int *pasid, int flags, struct svm_dev_
 	}
 	/* Finish the setup now we know we're keeping it */
 	sdev->users = 1;
-	sdev->ops = ops;
 	init_rcu_head(&sdev->rcu);
 
 	if (!svm) {
diff --git a/include/linux/intel-svm.h b/include/linux/intel-svm.h
index 99bc5b3..a39a502 100644
--- a/include/linux/intel-svm.h
+++ b/include/linux/intel-svm.h
@@ -18,18 +18,6 @@
 
 struct device;
 
-struct svm_dev_ops {
-	void (*fault_cb)(struct device *dev, int pasid, u64 address,
-			 u32 private, int rwxp, int response);
-};
-
-/* Values for rxwp in fault_cb callback */
-#define SVM_REQ_READ	(1<<3)
-#define SVM_REQ_WRITE	(1<<2)
-#define SVM_REQ_EXEC	(1<<1)
-#define SVM_REQ_PRIV	(1<<0)
-
-
 /*
  * The SVM_FLAG_PRIVATE_PASID flag requests a PASID which is *not* the "main"
  * PASID for the current process. Even if a PASID already exists, a new one
@@ -60,7 +48,6 @@ struct svm_dev_ops {
  * @dev:	Device to be granted acccess
  * @pasid:	Address for allocated PASID
  * @flags:	Flags. Later for requesting supervisor mode, etc.
- * @ops:	Callbacks to device driver
  *
  * This function attempts to enable PASID support for the given device.
  * If the @pasid argument is non-%NULL, a PASID is allocated for access
@@ -82,8 +69,7 @@ struct svm_dev_ops {
  * Multiple calls from the same process may result in the same PASID
  * being re-used. A reference count is kept.
  */
-extern int intel_svm_bind_mm(struct device *dev, int *pasid, int flags,
-			     struct svm_dev_ops *ops);
+extern int intel_svm_bind_mm(struct device *dev, int *pasid, int flags);
 
 /**
  * intel_svm_unbind_mm() - Unbind a specified PASID
@@ -120,7 +106,7 @@ extern int intel_svm_is_pasid_valid(struct device *dev, int pasid);
 #else /* CONFIG_INTEL_IOMMU_SVM */
 
 static inline int intel_svm_bind_mm(struct device *dev, int *pasid,
-				    int flags, struct svm_dev_ops *ops)
+				int flags)
 {
 	return -ENOSYS;
 }
@@ -136,6 +122,6 @@ static int intel_svm_is_pasid_valid(struct device *dev, int pasid)
 }
 #endif /* CONFIG_INTEL_IOMMU_SVM */
 
-#define intel_svm_available(dev) (!intel_svm_bind_mm((dev), NULL, 0, NULL))
+#define intel_svm_available(dev) (!intel_svm_bind_mm((dev), NULL, 0))
 
 #endif /* __INTEL_SVM_H__ */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v3 15/16] iommu: introduce page response function
@ 2017-11-17 18:55   ` Jacob Pan
  0 siblings, 0 replies; 94+ messages in thread
From: Jacob Pan @ 2017-11-17 18:55 UTC (permalink / raw)
  To: iommu, LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Rafael Wysocki, Alex Williamson
  Cc: Liu, Yi L, Lan Tianyu, Tian, Kevin, Raj Ashok, Jean Delvare,
	Christoph Hellwig, Jacob Pan

When nested translation is turned on and guest owns the
first level page tables, device page request can be forwared
to the guest for handling faults. As the page response returns
by the guest, IOMMU driver on the host need to process the
response which informs the device and completes the page request
transaction.

This patch introduces generic API function for page response
passing from the guest or other in-kernel users. The definitions of
the generic data is based on PCI ATS specification not limited to
any vendor.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 drivers/iommu/iommu.c | 14 ++++++++++++++
 include/linux/iommu.h | 42 ++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 56 insertions(+)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 97b7990..7aefb40 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1416,6 +1416,20 @@ int iommu_sva_invalidate(struct iommu_domain *domain,
 }
 EXPORT_SYMBOL_GPL(iommu_sva_invalidate);
 
+int iommu_page_response(struct iommu_domain *domain, struct device *dev,
+			struct page_response_msg *msg)
+{
+	int ret = 0;
+
+	if (unlikely(!domain->ops->page_response))
+		return -ENODEV;
+
+	ret = domain->ops->page_response(domain, dev, msg);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_page_response);
+
 static void __iommu_detach_device(struct iommu_domain *domain,
 				  struct device *dev)
 {
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 3083796b..17f698b 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -163,6 +163,43 @@ struct iommu_resv_region {
 
 #ifdef CONFIG_IOMMU_API
 
+enum page_response_type {
+	IOMMU_PAGE_STREAM_RESP = 1,
+	IOMMU_PAGE_GROUP_RESP,
+};
+
+/**
+ * Generic page response information based on PCI ATS and PASID spec.
+ * @paddr: servicing page address
+ * @pasid: contains process address space ID, used in shared virtual memory(SVM)
+ * @rid: requestor ID
+ * @did: destination device ID
+ * @last_req: last request in a page request group
+ * @resp_code: response code
+ * @page_req_group_id: page request group index
+ * @prot: page access protection flag, e.g. IOMMU_FAULT_READ, IOMMU_FAULT_WRITE
+ * @type: group or stream response
+ * @private_data: uniquely identify device-specific private data for an
+ *                individual page response
+
+ */
+struct page_response_msg {
+	u64 paddr;
+	u32 pasid;
+	u32 rid:16;
+	u32 did:16;
+	u32 resp_code:4;
+	u32 last_req:1;
+	u32 pasid_present:1;
+#define IOMMU_PAGE_RESP_SUCCESS	0
+#define IOMMU_PAGE_RESP_INVALID	1
+#define IOMMU_PAGE_RESP_FAILURE	0xF
+	u32 page_req_group_id : 9;
+	u32 prot;
+	enum page_response_type type;
+	u32 private_data;
+};
+
 /**
  * struct iommu_ops - iommu ops and capabilities
  * @capable: check capability
@@ -196,6 +233,7 @@ struct iommu_resv_region {
  * @bind_pasid_table: bind pasid table pointer for guest SVM
  * @unbind_pasid_table: unbind pasid table pointer and restore defaults
  * @sva_invalidate: invalidate translation caches of shared virtual address
+ * @page_response: handle page request response
  */
 struct iommu_ops {
 	bool (*capable)(enum iommu_cap);
@@ -251,6 +289,8 @@ struct iommu_ops {
 				struct device *dev);
 	int (*sva_invalidate)(struct iommu_domain *domain,
 		struct device *dev, struct tlb_invalidate_info *inv_info);
+	int (*page_response)(struct iommu_domain *domain, struct device *dev,
+			struct page_response_msg *msg);
 
 	unsigned long pgsize_bitmap;
 };
@@ -472,6 +512,8 @@ extern int iommu_unregister_device_fault_handler(struct device *dev);
 
 extern int iommu_report_device_fault(struct device *dev, struct iommu_fault_event *evt);
 
+extern int iommu_page_response(struct iommu_domain *domain, struct device *dev,
+			struct page_response_msg *msg);
 extern int iommu_group_id(struct iommu_group *group);
 extern struct iommu_group *iommu_group_get_for_dev(struct device *dev);
 extern struct iommu_domain *iommu_group_default_domain(struct iommu_group *);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v3 15/16] iommu: introduce page response function
@ 2017-11-17 18:55   ` Jacob Pan
  0 siblings, 0 replies; 94+ messages in thread
From: Jacob Pan @ 2017-11-17 18:55 UTC (permalink / raw)
  To: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, LKML,
	Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Rafael Wysocki, Alex Williamson
  Cc: Lan Tianyu, Jean Delvare

When nested translation is turned on and guest owns the
first level page tables, device page request can be forwared
to the guest for handling faults. As the page response returns
by the guest, IOMMU driver on the host need to process the
response which informs the device and completes the page request
transaction.

This patch introduces generic API function for page response
passing from the guest or other in-kernel users. The definitions of
the generic data is based on PCI ATS specification not limited to
any vendor.

Signed-off-by: Jacob Pan <jacob.jun.pan-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
---
 drivers/iommu/iommu.c | 14 ++++++++++++++
 include/linux/iommu.h | 42 ++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 56 insertions(+)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 97b7990..7aefb40 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1416,6 +1416,20 @@ int iommu_sva_invalidate(struct iommu_domain *domain,
 }
 EXPORT_SYMBOL_GPL(iommu_sva_invalidate);
 
+int iommu_page_response(struct iommu_domain *domain, struct device *dev,
+			struct page_response_msg *msg)
+{
+	int ret = 0;
+
+	if (unlikely(!domain->ops->page_response))
+		return -ENODEV;
+
+	ret = domain->ops->page_response(domain, dev, msg);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_page_response);
+
 static void __iommu_detach_device(struct iommu_domain *domain,
 				  struct device *dev)
 {
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 3083796b..17f698b 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -163,6 +163,43 @@ struct iommu_resv_region {
 
 #ifdef CONFIG_IOMMU_API
 
+enum page_response_type {
+	IOMMU_PAGE_STREAM_RESP = 1,
+	IOMMU_PAGE_GROUP_RESP,
+};
+
+/**
+ * Generic page response information based on PCI ATS and PASID spec.
+ * @paddr: servicing page address
+ * @pasid: contains process address space ID, used in shared virtual memory(SVM)
+ * @rid: requestor ID
+ * @did: destination device ID
+ * @last_req: last request in a page request group
+ * @resp_code: response code
+ * @page_req_group_id: page request group index
+ * @prot: page access protection flag, e.g. IOMMU_FAULT_READ, IOMMU_FAULT_WRITE
+ * @type: group or stream response
+ * @private_data: uniquely identify device-specific private data for an
+ *                individual page response
+
+ */
+struct page_response_msg {
+	u64 paddr;
+	u32 pasid;
+	u32 rid:16;
+	u32 did:16;
+	u32 resp_code:4;
+	u32 last_req:1;
+	u32 pasid_present:1;
+#define IOMMU_PAGE_RESP_SUCCESS	0
+#define IOMMU_PAGE_RESP_INVALID	1
+#define IOMMU_PAGE_RESP_FAILURE	0xF
+	u32 page_req_group_id : 9;
+	u32 prot;
+	enum page_response_type type;
+	u32 private_data;
+};
+
 /**
  * struct iommu_ops - iommu ops and capabilities
  * @capable: check capability
@@ -196,6 +233,7 @@ struct iommu_resv_region {
  * @bind_pasid_table: bind pasid table pointer for guest SVM
  * @unbind_pasid_table: unbind pasid table pointer and restore defaults
  * @sva_invalidate: invalidate translation caches of shared virtual address
+ * @page_response: handle page request response
  */
 struct iommu_ops {
 	bool (*capable)(enum iommu_cap);
@@ -251,6 +289,8 @@ struct iommu_ops {
 				struct device *dev);
 	int (*sva_invalidate)(struct iommu_domain *domain,
 		struct device *dev, struct tlb_invalidate_info *inv_info);
+	int (*page_response)(struct iommu_domain *domain, struct device *dev,
+			struct page_response_msg *msg);
 
 	unsigned long pgsize_bitmap;
 };
@@ -472,6 +512,8 @@ extern int iommu_unregister_device_fault_handler(struct device *dev);
 
 extern int iommu_report_device_fault(struct device *dev, struct iommu_fault_event *evt);
 
+extern int iommu_page_response(struct iommu_domain *domain, struct device *dev,
+			struct page_response_msg *msg);
 extern int iommu_group_id(struct iommu_group *group);
 extern struct iommu_group *iommu_group_get_for_dev(struct device *dev);
 extern struct iommu_domain *iommu_group_default_domain(struct iommu_group *);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v3 16/16] iommu/vt-d: add intel iommu page response function
@ 2017-11-17 18:55   ` Jacob Pan
  0 siblings, 0 replies; 94+ messages in thread
From: Jacob Pan @ 2017-11-17 18:55 UTC (permalink / raw)
  To: iommu, LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Rafael Wysocki, Alex Williamson
  Cc: Liu, Yi L, Lan Tianyu, Tian, Kevin, Raj Ashok, Jean Delvare,
	Christoph Hellwig, Jacob Pan

This patch adds page response support for Intel VT-d.
Generic response data is taken from the IOMMU API
then parsed into VT-d specific response descriptor format.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 drivers/iommu/intel-iommu.c | 30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index e1bd219..7f95827 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -5171,6 +5171,35 @@ static int intel_iommu_sva_invalidate(struct iommu_domain *domain,
 	return ret;
 }
 
+int intel_iommu_page_response(struct iommu_domain *domain, struct device *dev,
+			struct page_response_msg *msg)
+{
+	struct qi_desc resp;
+	struct intel_iommu *iommu = dev_to_intel_iommu(dev);
+
+	/* TODO: sanitize response message */
+	if (msg->last_req) {
+		/* Page Group Response */
+		resp.low = QI_PGRP_PASID(msg->pasid) |
+			QI_PGRP_DID(msg->did) |
+			QI_PGRP_PASID_P(msg->pasid_present) |
+			QI_PGRP_RESP_TYPE;
+		/* REVISIT: allow private data passing from device prq */
+		resp.high = QI_PGRP_IDX(msg->page_req_group_id) |
+			QI_PGRP_PRIV(msg->private_data) | QI_PGRP_RESP_CODE(msg->resp_code);
+	} else {
+		/* Page Stream Response */
+		resp.low = QI_PSTRM_IDX(msg->page_req_group_id) |
+			QI_PSTRM_PRIV(msg->private_data) | QI_PSTRM_BUS(PCI_BUS_NUM(msg->did)) |
+			QI_PSTRM_PASID(msg->pasid) | QI_PSTRM_RESP_TYPE;
+		resp.high = QI_PSTRM_ADDR(msg->paddr) | QI_PSTRM_DEVFN(msg->did & 0xff) |
+			QI_PSTRM_RESP_CODE(msg->resp_code);
+	}
+	qi_submit_sync(&resp, iommu);
+
+	return 0;
+}
+
 static int intel_iommu_map(struct iommu_domain *domain,
 			   unsigned long iova, phys_addr_t hpa,
 			   size_t size, int iommu_prot)
@@ -5606,6 +5635,7 @@ const struct iommu_ops intel_iommu_ops = {
 	.bind_pasid_table	= intel_iommu_bind_pasid_table,
 	.unbind_pasid_table	= intel_iommu_unbind_pasid_table,
 	.sva_invalidate		= intel_iommu_sva_invalidate,
+	.page_response		= intel_iommu_page_response,
 #endif
 	.map			= intel_iommu_map,
 	.unmap			= intel_iommu_unmap,
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v3 16/16] iommu/vt-d: add intel iommu page response function
@ 2017-11-17 18:55   ` Jacob Pan
  0 siblings, 0 replies; 94+ messages in thread
From: Jacob Pan @ 2017-11-17 18:55 UTC (permalink / raw)
  To: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, LKML,
	Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Rafael Wysocki, Alex Williamson
  Cc: Lan Tianyu, Jean Delvare

This patch adds page response support for Intel VT-d.
Generic response data is taken from the IOMMU API
then parsed into VT-d specific response descriptor format.

Signed-off-by: Jacob Pan <jacob.jun.pan-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
---
 drivers/iommu/intel-iommu.c | 30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index e1bd219..7f95827 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -5171,6 +5171,35 @@ static int intel_iommu_sva_invalidate(struct iommu_domain *domain,
 	return ret;
 }
 
+int intel_iommu_page_response(struct iommu_domain *domain, struct device *dev,
+			struct page_response_msg *msg)
+{
+	struct qi_desc resp;
+	struct intel_iommu *iommu = dev_to_intel_iommu(dev);
+
+	/* TODO: sanitize response message */
+	if (msg->last_req) {
+		/* Page Group Response */
+		resp.low = QI_PGRP_PASID(msg->pasid) |
+			QI_PGRP_DID(msg->did) |
+			QI_PGRP_PASID_P(msg->pasid_present) |
+			QI_PGRP_RESP_TYPE;
+		/* REVISIT: allow private data passing from device prq */
+		resp.high = QI_PGRP_IDX(msg->page_req_group_id) |
+			QI_PGRP_PRIV(msg->private_data) | QI_PGRP_RESP_CODE(msg->resp_code);
+	} else {
+		/* Page Stream Response */
+		resp.low = QI_PSTRM_IDX(msg->page_req_group_id) |
+			QI_PSTRM_PRIV(msg->private_data) | QI_PSTRM_BUS(PCI_BUS_NUM(msg->did)) |
+			QI_PSTRM_PASID(msg->pasid) | QI_PSTRM_RESP_TYPE;
+		resp.high = QI_PSTRM_ADDR(msg->paddr) | QI_PSTRM_DEVFN(msg->did & 0xff) |
+			QI_PSTRM_RESP_CODE(msg->resp_code);
+	}
+	qi_submit_sync(&resp, iommu);
+
+	return 0;
+}
+
 static int intel_iommu_map(struct iommu_domain *domain,
 			   unsigned long iova, phys_addr_t hpa,
 			   size_t size, int iommu_prot)
@@ -5606,6 +5635,7 @@ const struct iommu_ops intel_iommu_ops = {
 	.bind_pasid_table	= intel_iommu_bind_pasid_table,
 	.unbind_pasid_table	= intel_iommu_unbind_pasid_table,
 	.sva_invalidate		= intel_iommu_sva_invalidate,
+	.page_response		= intel_iommu_page_response,
 #endif
 	.map			= intel_iommu_map,
 	.unmap			= intel_iommu_unmap,
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* Re: [PATCH v3 05/16] iommu/vt-d: support flushing more TLB types
@ 2017-11-20 14:20     ` Lukoshkov, Maksim
  0 siblings, 0 replies; 94+ messages in thread
From: Lukoshkov, Maksim @ 2017-11-20 14:20 UTC (permalink / raw)
  To: Jacob Pan, iommu, LKML, Joerg Roedel, David Woodhouse,
	Greg Kroah-Hartman, Wysocki, Rafael J, Alex Williamson
  Cc: Lan, Tianyu, Yi L, Liu, Jean Delvare

On 11/17/2017 18:55, Jacob Pan wrote:
> +void qi_flush_dev_iotlb(struct intel_iommu *iommu, u16 sid, u16 pfsid,
> +			u16 qdep, u64 addr, unsigned mask)
> +{
> +	struct qi_desc desc;
> +
> +	pr_debug_ratelimited("%s: sid %d, pfsid %d, qdep %d, addr %llx, mask %d\n",
> +		__func__, sid, pfsid, qdep, addr, mask);
>   	if (mask) {
>   		BUG_ON(addr & ((1 << (VTD_PAGE_SHIFT + mask)) - 1));
>   		addr |= (1ULL << (VTD_PAGE_SHIFT + mask - 1)) - 1;
> @@ -1352,7 +1366,41 @@ void qi_flush_dev_iotlb(struct intel_iommu *iommu, u16 sid, u16 qdep,
>   		qdep = 0;
>   
>   	desc.low = QI_DEV_IOTLB_SID(sid) | QI_DEV_IOTLB_QDEP(qdep) |
> -		   QI_DIOTLB_TYPE;
> +		   QI_DIOTLB_TYPE | QI_DEV_IOTLB_SID(pfsid);

QI_DEV_IOTLB_SID(pfsid) -> QI_DEV_EIOTLB_PFSID(pfsid)?

> +
> +	qi_submit_sync(&desc, iommu);
> +}
> +

Regards,
Maksim Lukoshkov

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v3 05/16] iommu/vt-d: support flushing more TLB types
@ 2017-11-20 14:20     ` Lukoshkov, Maksim
  0 siblings, 0 replies; 94+ messages in thread
From: Lukoshkov, Maksim @ 2017-11-20 14:20 UTC (permalink / raw)
  To: Jacob Pan, iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman, Wysocki,
	Rafael J, Alex Williamson
  Cc: Lan, Tianyu, Yi L, Jean Delvare, Liu-i9wRM+HIrmnmtl4Z8vJ8Kg761KYD1DLY

On 11/17/2017 18:55, Jacob Pan wrote:
> +void qi_flush_dev_iotlb(struct intel_iommu *iommu, u16 sid, u16 pfsid,
> +			u16 qdep, u64 addr, unsigned mask)
> +{
> +	struct qi_desc desc;
> +
> +	pr_debug_ratelimited("%s: sid %d, pfsid %d, qdep %d, addr %llx, mask %d\n",
> +		__func__, sid, pfsid, qdep, addr, mask);
>   	if (mask) {
>   		BUG_ON(addr & ((1 << (VTD_PAGE_SHIFT + mask)) - 1));
>   		addr |= (1ULL << (VTD_PAGE_SHIFT + mask - 1)) - 1;
> @@ -1352,7 +1366,41 @@ void qi_flush_dev_iotlb(struct intel_iommu *iommu, u16 sid, u16 qdep,
>   		qdep = 0;
>   
>   	desc.low = QI_DEV_IOTLB_SID(sid) | QI_DEV_IOTLB_QDEP(qdep) |
> -		   QI_DIOTLB_TYPE;
> +		   QI_DIOTLB_TYPE | QI_DEV_IOTLB_SID(pfsid);

QI_DEV_IOTLB_SID(pfsid) -> QI_DEV_EIOTLB_PFSID(pfsid)?

> +
> +	qi_submit_sync(&desc, iommu);
> +}
> +

Regards,
Maksim Lukoshkov

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v3 05/16] iommu/vt-d: support flushing more TLB types
@ 2017-11-20 18:40       ` Jacob Pan
  0 siblings, 0 replies; 94+ messages in thread
From: Jacob Pan @ 2017-11-20 18:40 UTC (permalink / raw)
  To: Lukoshkov, Maksim
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Wysocki, Rafael J, Alex Williamson, Lan, Tianyu, Yi L, Liu,
	Jean Delvare, jacob.jun.pan

On Mon, 20 Nov 2017 14:20:31 +0000
"Lukoshkov, Maksim" <maksim.lukoshkov@intel.com> wrote:

> On 11/17/2017 18:55, Jacob Pan wrote:
> > +void qi_flush_dev_iotlb(struct intel_iommu *iommu, u16 sid, u16
> > pfsid,
> > +			u16 qdep, u64 addr, unsigned mask)
> > +{
> > +	struct qi_desc desc;
> > +
> > +	pr_debug_ratelimited("%s: sid %d, pfsid %d, qdep %d, addr
> > %llx, mask %d\n",
> > +		__func__, sid, pfsid, qdep, addr, mask);
> >   	if (mask) {
> >   		BUG_ON(addr & ((1 << (VTD_PAGE_SHIFT + mask)) -
> > 1)); addr |= (1ULL << (VTD_PAGE_SHIFT + mask - 1)) - 1;
> > @@ -1352,7 +1366,41 @@ void qi_flush_dev_iotlb(struct intel_iommu
> > *iommu, u16 sid, u16 qdep, qdep = 0;
> >   
> >   	desc.low = QI_DEV_IOTLB_SID(sid) |
> > QI_DEV_IOTLB_QDEP(qdep) |
> > -		   QI_DIOTLB_TYPE;
> > +		   QI_DIOTLB_TYPE | QI_DEV_IOTLB_SID(pfsid);  
> 
> QI_DEV_IOTLB_SID(pfsid) -> QI_DEV_EIOTLB_PFSID(pfsid)?
> 
good catch! thank you.
> > +
> > +	qi_submit_sync(&desc, iommu);
> > +}
> > +  
> 
> Regards,
> Maksim Lukoshkov

[Jacob Pan]

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v3 05/16] iommu/vt-d: support flushing more TLB types
@ 2017-11-20 18:40       ` Jacob Pan
  0 siblings, 0 replies; 94+ messages in thread
From: Jacob Pan @ 2017-11-20 18:40 UTC (permalink / raw)
  To: Lukoshkov, Maksim
  Cc: Lan, Tianyu, Yi L, Liu-i9wRM+HIrmnmtl4Z8vJ8Kg761KYD1DLY,
	Greg Kroah-Hartman, Wysocki, Rafael J, LKML,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Jean Delvare,
	David Woodhouse

On Mon, 20 Nov 2017 14:20:31 +0000
"Lukoshkov, Maksim" <maksim.lukoshkov-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:

> On 11/17/2017 18:55, Jacob Pan wrote:
> > +void qi_flush_dev_iotlb(struct intel_iommu *iommu, u16 sid, u16
> > pfsid,
> > +			u16 qdep, u64 addr, unsigned mask)
> > +{
> > +	struct qi_desc desc;
> > +
> > +	pr_debug_ratelimited("%s: sid %d, pfsid %d, qdep %d, addr
> > %llx, mask %d\n",
> > +		__func__, sid, pfsid, qdep, addr, mask);
> >   	if (mask) {
> >   		BUG_ON(addr & ((1 << (VTD_PAGE_SHIFT + mask)) -
> > 1)); addr |= (1ULL << (VTD_PAGE_SHIFT + mask - 1)) - 1;
> > @@ -1352,7 +1366,41 @@ void qi_flush_dev_iotlb(struct intel_iommu
> > *iommu, u16 sid, u16 qdep, qdep = 0;
> >   
> >   	desc.low = QI_DEV_IOTLB_SID(sid) |
> > QI_DEV_IOTLB_QDEP(qdep) |
> > -		   QI_DIOTLB_TYPE;
> > +		   QI_DIOTLB_TYPE | QI_DEV_IOTLB_SID(pfsid);  
> 
> QI_DEV_IOTLB_SID(pfsid) -> QI_DEV_EIOTLB_PFSID(pfsid)?
> 
good catch! thank you.
> > +
> > +	qi_submit_sync(&desc, iommu);
> > +}
> > +  
> 
> Regards,
> Maksim Lukoshkov

[Jacob Pan]

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v3 08/16] iommu: introduce device fault data
  2017-11-17 18:55 ` [PATCH v3 08/16] iommu: introduce device fault data Jacob Pan
@ 2017-11-24 12:03   ` Jean-Philippe Brucker
  2017-11-29 21:55       ` Jacob Pan
  2018-01-10 11:41   ` Jean-Philippe Brucker
  1 sibling, 1 reply; 94+ messages in thread
From: Jean-Philippe Brucker @ 2017-11-24 12:03 UTC (permalink / raw)
  To: Jacob Pan, iommu, LKML, Joerg Roedel, David Woodhouse,
	Greg Kroah-Hartman, Rafael Wysocki, Alex Williamson
  Cc: Lan Tianyu, Yi L, Liu, Jean Delvare

Hi Jacob,

On 17/11/17 18:55, Jacob Pan wrote:
> Device faults detected by IOMMU can be reported outside IOMMU
> subsystem for further processing. This patch intends to provide
> a generic device fault data such that device drivers can be
> communicated with IOMMU faults without model specific knowledge.
> 
> The proposed format is the result of discussion at:
> https://lkml.org/lkml/2017/11/10/291
> Part of the code is based on Jean-Philippe Brucker's patchset
> (https://patchwork.kernel.org/patch/9989315/).
> 
> The assumption is that model specific IOMMU driver can filter and
> handle most of the internal faults if the cause is within IOMMU driver
> control. Therefore, the fault reasons can be reported are grouped
> and generalized based common specifications such as PCI ATS.
> 
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
> Signed-off-by: Ashok Raj <ashok.raj@intel.com>

This looks good from my point of view. And since it's not UAPI, we can
always update it if it turns out that device drivers need more information.

[...]
> +/**
> + * struct iommu_fault_event - Generic per device fault data
> + *
> + * - PCI and non-PCI devices
> + * - Recoverable faults (e.g. page request), information based on PCI ATS
> + * and PASID spec.
> + * - Un-recoverable faults of device interest
> + * - DMA remapping and IRQ remapping faults
> +
> + * @type contains fault type.
> + * @reason fault reasons if relevant outside IOMMU driver, IOMMU driver internal
> + *         faults are not reported
> + * @addr: tells the offending page address
> + * @pasid: contains process address space ID, used in shared virtual memory(SVM)
> + * @rid: requestor ID

This comment can be removed

Thanks,
Jean

> + * @page_req_group_id: page request group index
> + * @last_req: last request in a page request group
> + * @pasid_valid: indicates if the PRQ has a valid PASID
> + * @prot: page access protection flag, e.g. IOMMU_FAULT_READ, IOMMU_FAULT_WRITE
> + * @device_private: if present, uniquely identify device-specific
> + *                  private data for an individual page request.
> + * @iommu_private: used by the IOMMU driver for storing fault-specific
> + *                 data. Users should not modify this field before
> + *                 sending the fault response.
> + */
> +struct iommu_fault_event {
> +	enum iommu_fault_type type;
> +	enum iommu_fault_reason reason;
> +	u64 addr;
> +	u32 pasid;
> +	u32 page_req_group_id : 9;
> +	u32 last_req : 1;
> +	u32 pasid_valid : 1;
> +	u32 prot;
> +	u64 device_private;
> +	u64 iommu_private;
> +};

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v3 15/16] iommu: introduce page response function
  2017-11-17 18:55   ` Jacob Pan
  (?)
@ 2017-11-24 12:03   ` Jean-Philippe Brucker
  2017-12-04 21:37       ` Jacob Pan
  -1 siblings, 1 reply; 94+ messages in thread
From: Jean-Philippe Brucker @ 2017-11-24 12:03 UTC (permalink / raw)
  To: Jacob Pan, iommu, LKML, Joerg Roedel, David Woodhouse,
	Greg Kroah-Hartman, Rafael Wysocki, Alex Williamson
  Cc: Lan Tianyu, Jean Delvare

On 17/11/17 18:55, Jacob Pan wrote:
> When nested translation is turned on and guest owns the
> first level page tables, device page request can be forwared
> to the guest for handling faults. As the page response returns
> by the guest, IOMMU driver on the host need to process the
> response which informs the device and completes the page request
> transaction.
> 
> This patch introduces generic API function for page response
> passing from the guest or other in-kernel users. The definitions of
> the generic data is based on PCI ATS specification not limited to
> any vendor.>
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> ---
>  drivers/iommu/iommu.c | 14 ++++++++++++++
>  include/linux/iommu.h | 42 ++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 56 insertions(+)
> 
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index 97b7990..7aefb40 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -1416,6 +1416,20 @@ int iommu_sva_invalidate(struct iommu_domain *domain,
>  }
>  EXPORT_SYMBOL_GPL(iommu_sva_invalidate);
>  
> +int iommu_page_response(struct iommu_domain *domain, struct device *dev,
> +			struct page_response_msg *msg)

I think it's simpler, both for IOMMU and device drivers, to pass the exact
structure received in the fault handler back to the IOMMU driver, along
with a separate response status. So maybe

int iommu_page_response(struct iommu_domain *domain, struct device *dev,
			struct iommu_fault_event *event, u32 response)

And then you'd just need to define the IOMMU_PAGE_RESPONSE_* values.

Thanks,
Jean

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v3 01/16] iommu: introduce bind_pasid_table API function
  2017-11-17 18:54   ` Jacob Pan
  (?)
@ 2017-11-24 12:04   ` Jean-Philippe Brucker
  2017-11-29 22:01     ` Jacob Pan
  -1 siblings, 1 reply; 94+ messages in thread
From: Jean-Philippe Brucker @ 2017-11-24 12:04 UTC (permalink / raw)
  To: Jacob Pan, iommu, LKML, Joerg Roedel, David Woodhouse,
	Greg Kroah-Hartman, Rafael Wysocki, Alex Williamson
  Cc: Lan Tianyu, Yi L, Liu, Jean Delvare

On 17/11/17 18:54, Jacob Pan wrote:
> Virtual IOMMU was proposed to support Shared Virtual Memory (SVM)
> use in the guest:
> https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg05311.html
> 
> As part of the proposed architecture, when an SVM capable PCI
> device is assigned to a guest, nested mode is turned on. Guest owns the
> first level page tables (request with PASID) which performs GVA->GPA
> translation. Second level page tables are owned by the host for GPA->HPA
> translation for both request with and without PASID.
> 
> A new IOMMU driver interface is therefore needed to perform tasks as
> follows:
> * Enable nested translation and appropriate translation type
> * Assign guest PASID table pointer (in GPA) and size to host IOMMU
> 
> This patch introduces new API functions to perform bind/unbind guest PASID
> tables. Based on common data, model specific IOMMU drivers can be extended
> to perform the specific steps for binding pasid table of assigned devices.
> 
[...]
>  
>  #define IOMMU_READ	(1 << 0)
>  #define IOMMU_WRITE	(1 << 1)
> @@ -187,6 +188,8 @@ struct iommu_resv_region {
>   * @domain_get_windows: Return the number of windows for a domain
>   * @of_xlate: add OF master IDs to iommu grouping
>   * @pgsize_bitmap: bitmap of all possible supported page sizes
> + * @bind_pasid_table: bind pasid table pointer for guest SVM
> + * @unbind_pasid_table: unbind pasid table pointer and restore defaults

I was wondering, are you planning on using the IOMMU_DOMAIN_NESTING
attribute? It differentiates a domain that supports
bind/unbind_pasid_table and map/unmap GPA (virt SVM), from the domain that
supports bind/unbind individual PASIDs and map/unmap IOVA (host SVM)?

Users can set this attribute by using the VFIO_TYPE1_NESTING_IOMMU type
instead of VFIO_TYPE1v2_IOMMU, which seems ideal for what we're trying to do.

[...]
> +/**
> + * PASID table data used to bind guest PASID table to the host IOMMU. This will
> + * enable guest managed first level page tables.
> + * @version: for future extensions and identification of the data format
> + * @bytes: size of this structure
> + * @base_ptr:	PASID table pointer
> + * @pasid_bits:	number of bits supported in the guest PASID table, must be less
> + *		or equal than the host supported PASID size.

Why remove the @model parameter?

> + */
> +struct pasid_table_config {
> +	__u32 version;
> +#define PASID_TABLE_CFG_VERSION 1
> +	__u32 bytes;
> +	__u64 base_ptr;
> +	__u8 pasid_bits;
> +	/* reserved for extension of vendor specific config */
> +	union {
> +		struct {
> +			/* ARM specific fields */
> +			bool pasid0_dma_no_pasid;
> +		} arm;

I think @model is still required for sanity check, but could you remove
the whole union for the moment? Other parameters will be needed and I'm
still thinking about it, so I'll add the arm struct back in a future patch.

Thanks,
Jean

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v3 03/16] iommu: introduce iommu invalidate API function
  2017-11-17 18:55 ` [PATCH v3 03/16] iommu: introduce iommu invalidate API function Jacob Pan
@ 2017-11-24 12:04   ` Jean-Philippe Brucker
  2017-12-15 19:02       ` Jean-Philippe Brucker
  2017-12-28 19:25       ` Jacob Pan
  0 siblings, 2 replies; 94+ messages in thread
From: Jean-Philippe Brucker @ 2017-11-24 12:04 UTC (permalink / raw)
  To: Jacob Pan, iommu, LKML, Joerg Roedel, David Woodhouse,
	Greg Kroah-Hartman, Rafael Wysocki, Alex Williamson
  Cc: Lan Tianyu, Liu, Yi L, Liu, Jean Delvare

Hi,

On 17/11/17 18:55, Jacob Pan wrote:
> From: "Liu, Yi L" <yi.l.liu@linux.intel.com>
> 
> When an SVM capable device is assigned to a guest, the first level page
> tables are owned by the guest and the guest PASID table pointer is
> linked to the device context entry of the physical IOMMU.
> 
> Host IOMMU driver has no knowledge of caching structure updates unless
> the guest invalidation activities are passed down to the host. The
> primary usage is derived from emulated IOMMU in the guest, where QEMU
> can trap invalidation activities before passing them down to the
> host/physical IOMMU.
> Since the invalidation data are obtained from user space and will be
> written into physical IOMMU, we must allow security check at various
> layers. Therefore, generic invalidation data format are proposed here,
> model specific IOMMU drivers need to convert them into their own format.
> 
> Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Signed-off-by: Ashok Raj <ashok.raj@intel.com>
[...]
>  #endif /* __LINUX_IOMMU_H */
> diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
> index 651ad5d..039ba36 100644
> --- a/include/uapi/linux/iommu.h
> +++ b/include/uapi/linux/iommu.h
> @@ -36,4 +36,66 @@ struct pasid_table_config {
>  	};
>  };
>  
> +enum iommu_inv_granularity {
> +	IOMMU_INV_GRANU_GLOBAL,		/* all TLBs invalidated */
> +	IOMMU_INV_GRANU_DOMAIN,		/* all TLBs associated with a domain */
> +	IOMMU_INV_GRANU_DEVICE,		/* caching structure associated with a
> +					 * device ID
> +					 */

I thought you were planning on removing these? If we do need global
invalidation, for example the guest clears the whole PASID table and
doesn't want to send individual GRANU_ALL_PASID invalidations, maybe keep
only GRANU_DOMAIN?

> +	IOMMU_INV_GRANU_DOMAIN_PAGE,	/* address range with a domain */
> +	IOMMU_INV_GRANU_ALL_PASID,	/* cache of a given PASID */
> +	IOMMU_INV_GRANU_PASID_SEL,	/* only invalidate specified PASID */

GRANU_PASID_SEL seems redundant, don't you already get it by default with
GRANU_ALL_PASID and GRANU_DOMAIN_PAGE (with IOMMU_INVALIDATE_PASID_TAGGED
flag)?

> +
> +	IOMMU_INV_GRANU_NG_ALL_PASID,	/* non-global within all PASIDs */
> +	IOMMU_INV_GRANU_NG_PASID,	/* non-global within a PASIDs */

Don't you get the "NG" behavior by not passing the
IOMMU_INVALIDATE_GLOBAL_PAGE flag defined below?

> +	IOMMU_INV_GRANU_PAGE_PASID,	/* page-selective within a PASID */

And don't you get this with GRANU_DOMAIN_PAGE+IOMMU_INVALIDATE_PASID_TAGGED?

> +	IOMMU_INV_NR_GRANU,
> +};
> +
> +enum iommu_inv_type {
> +	IOMMU_INV_TYPE_DTLB,	/* device IOTLB */
> +	IOMMU_INV_TYPE_TLB,	/* IOMMU paging structure cache */
> +	IOMMU_INV_TYPE_PASID,	/* PASID cache */
> +	IOMMU_INV_TYPE_CONTEXT,	/* device context entry cache */
> +	IOMMU_INV_NR_TYPE
> +};

When the guest removes a PASID entry, it would have to send DTLB, TLB and
PASID invalidations separately? Could we define this inv_type as
cumulative, to avoid redundant invalidation requests:

* TYPE_DTLB only invalidates ATC entries.
* TYPE_TLB invalidates both ATC and IOTLB entries.
* TYPE_PASID invalidates all ATC and IOTLB entries for a PASID, and also
the PASID cache entry.
* TYPE_CONTEXT invalidates all. Although is it needed by userspace or just
here fore completeness? "CONTEXT" is specific to VT-d (doesn't exist on
AMD and has a different meaning on SMMU), how about "DEVICE" instead?

This is important because invalidation will probably become the
bottleneck. The guest shouldn't have to send DTLB and TLB invalidation
separately after each unmapping.

> +/**
> + * Translation cache invalidation header that contains mandatory meta data.
> + * @version:	info format version, expecting future extesions
> + * @type:	type of translation cache to be invalidated
> + */
> +struct tlb_invalidate_hdr {
> +	__u32 version;
> +#define TLB_INV_HDR_VERSION_1 1
> +	enum iommu_inv_type type;
> +};
> +
> +/**
> + * Translation cache invalidation information, contains generic IOMMU
> + * data which can be parsed based on model ID by model specific drivers.
> + *
> + * @granularity:	requested invalidation granularity, type dependent
> + * @size:		2^size of 4K pages, 0 for 4k, 9 for 2MB, etc.

Having only power of two invalidation seems too restrictive for a software
interface. You might have the same problem as above, where the guest or
userspace needs to send lots of invalidation requests, They could be
multiplexed by passing an arbitrary range instead. How about making @size
a __u64?

> + * @pasid:		processor address space ID value per PCI spec.
> + * @addr:		page address to be invalidated
> + * @flags	IOMMU_INVALIDATE_PASID_TAGGED: DMA with PASID tagged,
> + *						@pasid validity can be
> + *						deduced from @granularity

What's the use for this PASID_TAGGED flag if it doesn't define the @pasid
validity?

> + *		IOMMU_INVALIDATE_ADDR_LEAF: leaf paging entries

LEAF could be reused for multi-level PASID tables, when your first-level
table is already in place and you install a leaf entry, so maybe this
could be:

"IOMMU_INVALIDATE_LEAF: only invalidate leaf table entry"

Thanks,
Jean

> + *		IOMMU_INVALIDATE_GLOBAL_PAGE: global pages> + *
> + */
> +struct tlb_invalidate_info {
> +	struct tlb_invalidate_hdr	hdr;
> +	enum iommu_inv_granularity	granularity;
> +	__u32		flags;
> +#define IOMMU_INVALIDATE_NO_PASID	(1 << 0)
> +#define IOMMU_INVALIDATE_ADDR_LEAF	(1 << 1)
> +#define IOMMU_INVALIDATE_GLOBAL_PAGE	(1 << 2)
> +#define IOMMU_INVALIDATE_PASID_TAGGED	(1 << 3)
> +	__u8		size;
> +	__u32		pasid;
> +	__u64		addr;
> +};
>  #endif /* _UAPI_IOMMU_H */
> 

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v3 08/16] iommu: introduce device fault data
@ 2017-11-29 21:55       ` Jacob Pan
  0 siblings, 0 replies; 94+ messages in thread
From: Jacob Pan @ 2017-11-29 21:55 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Rafael Wysocki, Alex Williamson, Lan Tianyu, Yi L, Liu,
	Jean Delvare, jacob.jun.pan

On Fri, 24 Nov 2017 12:03:33 +0000
Jean-Philippe Brucker <jean-philippe.brucker@arm.com> wrote:

> > + * @rid: requestor ID  
> 
> This comment can be removed

will do, thanks.

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v3 08/16] iommu: introduce device fault data
@ 2017-11-29 21:55       ` Jacob Pan
  0 siblings, 0 replies; 94+ messages in thread
From: Jacob Pan @ 2017-11-29 21:55 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: Lan Tianyu, Yi L, Liu-i9wRM+HIrmnmtl4Z8vJ8Kg761KYD1DLY,
	Greg Kroah-Hartman, Rafael Wysocki, LKML,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Jean Delvare,
	David Woodhouse

On Fri, 24 Nov 2017 12:03:33 +0000
Jean-Philippe Brucker <jean-philippe.brucker-5wv7dgnIgG8@public.gmane.org> wrote:

> > + * @rid: requestor ID  
> 
> This comment can be removed

will do, thanks.

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v3 01/16] iommu: introduce bind_pasid_table API function
  2017-11-24 12:04   ` Jean-Philippe Brucker
@ 2017-11-29 22:01     ` Jacob Pan
  0 siblings, 0 replies; 94+ messages in thread
From: Jacob Pan @ 2017-11-29 22:01 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Rafael Wysocki, Alex Williamson, Lan Tianyu, Yi L, Liu,
	Jean Delvare, jacob.jun.pan

On Fri, 24 Nov 2017 12:04:08 +0000
Jean-Philippe Brucker <jean-philippe.brucker@arm.com> wrote:

> On 17/11/17 18:54, Jacob Pan wrote:
> > Virtual IOMMU was proposed to support Shared Virtual Memory (SVM)
> > use in the guest:
> > https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg05311.html
> > 
> > As part of the proposed architecture, when an SVM capable PCI
> > device is assigned to a guest, nested mode is turned on. Guest owns
> > the first level page tables (request with PASID) which performs
> > GVA->GPA translation. Second level page tables are owned by the
> > host for GPA->HPA translation for both request with and without
> > PASID.
> > 
> > A new IOMMU driver interface is therefore needed to perform tasks as
> > follows:
> > * Enable nested translation and appropriate translation type
> > * Assign guest PASID table pointer (in GPA) and size to host IOMMU
> > 
> > This patch introduces new API functions to perform bind/unbind
> > guest PASID tables. Based on common data, model specific IOMMU
> > drivers can be extended to perform the specific steps for binding
> > pasid table of assigned devices. 
> [...]
> >  
> >  #define IOMMU_READ	(1 << 0)
> >  #define IOMMU_WRITE	(1 << 1)
> > @@ -187,6 +188,8 @@ struct iommu_resv_region {
> >   * @domain_get_windows: Return the number of windows for a domain
> >   * @of_xlate: add OF master IDs to iommu grouping
> >   * @pgsize_bitmap: bitmap of all possible supported page sizes
> > + * @bind_pasid_table: bind pasid table pointer for guest SVM
> > + * @unbind_pasid_table: unbind pasid table pointer and restore
> > defaults  
> 
> I was wondering, are you planning on using the IOMMU_DOMAIN_NESTING
> attribute? It differentiates a domain that supports
> bind/unbind_pasid_table and map/unmap GPA (virt SVM), from the domain
> that supports bind/unbind individual PASIDs and map/unmap IOVA (host
> SVM)?
> 
> Users can set this attribute by using the VFIO_TYPE1_NESTING_IOMMU
> type instead of VFIO_TYPE1v2_IOMMU, which seems ideal for what we're
> trying to do.
> 
Hmmm, I am not sure. I think the bind/unbind is strictly a per device
attribute.
Yi, could you comment on the use via VFIO or QEMU?
> [...]
> > +/**
> > + * PASID table data used to bind guest PASID table to the host
> > IOMMU. This will
> > + * enable guest managed first level page tables.
> > + * @version: for future extensions and identification of the data
> > format
> > + * @bytes: size of this structure
> > + * @base_ptr:	PASID table pointer
> > + * @pasid_bits:	number of bits supported in the guest PASID
> > table, must be less
> > + *		or equal than the host supported PASID size.  
> 
> Why remove the @model parameter?
> 
We removed it because we want the config data to be model agnostic. Any
sanity check should be done via some query interface, e.g. sysfs, to
ensure model matching. Once set up, there is no need to embed model info
in every bind operation.

> > + */
> > +struct pasid_table_config {
> > +	__u32 version;
> > +#define PASID_TABLE_CFG_VERSION 1
> > +	__u32 bytes;
> > +	__u64 base_ptr;
> > +	__u8 pasid_bits;
> > +	/* reserved for extension of vendor specific config */
> > +	union {
> > +		struct {
> > +			/* ARM specific fields */
> > +			bool pasid0_dma_no_pasid;
> > +		} arm;  
> 
> I think @model is still required for sanity check, but could you
> remove the whole union for the moment? Other parameters will be
> needed and I'm still thinking about it, so I'll add the arm struct
> back in a future patch.
> 
sure, I will remove it for now.

Thanks,

Jacob
> Thanks,
> Jean
> 

[Jacob Pan]

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v3 15/16] iommu: introduce page response function
@ 2017-12-04 21:37       ` Jacob Pan
  0 siblings, 0 replies; 94+ messages in thread
From: Jacob Pan @ 2017-12-04 21:37 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Rafael Wysocki, Alex Williamson, Lan Tianyu, Jean Delvare,
	jacob.jun.pan

On Fri, 24 Nov 2017 12:03:50 +0000
Jean-Philippe Brucker <jean-philippe.brucker@arm.com> wrote:

> On 17/11/17 18:55, Jacob Pan wrote:
> > When nested translation is turned on and guest owns the
> > first level page tables, device page request can be forwared
> > to the guest for handling faults. As the page response returns
> > by the guest, IOMMU driver on the host need to process the
> > response which informs the device and completes the page request
> > transaction.
> > 
> > This patch introduces generic API function for page response
> > passing from the guest or other in-kernel users. The definitions of
> > the generic data is based on PCI ATS specification not limited to
> > any vendor.>
> > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > ---
> >  drivers/iommu/iommu.c | 14 ++++++++++++++
> >  include/linux/iommu.h | 42
> > ++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 56
> > insertions(+)
> > 
> > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> > index 97b7990..7aefb40 100644
> > --- a/drivers/iommu/iommu.c
> > +++ b/drivers/iommu/iommu.c
> > @@ -1416,6 +1416,20 @@ int iommu_sva_invalidate(struct iommu_domain
> > *domain, }
> >  EXPORT_SYMBOL_GPL(iommu_sva_invalidate);
> >  
> > +int iommu_page_response(struct iommu_domain *domain, struct device
> > *dev,
> > +			struct page_response_msg *msg)  
> 
> I think it's simpler, both for IOMMU and device drivers, to pass the
> exact structure received in the fault handler back to the IOMMU
> driver, along with a separate response status. So maybe
> 
> int iommu_page_response(struct iommu_domain *domain, struct device
> *dev, struct iommu_fault_event *event, u32 response)
> 
> And then you'd just need to define the IOMMU_PAGE_RESPONSE_* values.
> 
Apologize for the late response,

I think the simpler interface works for in-kernel driver use case very
well. But in case of VFIO, the callback function does not turn around
send back page response. The page response comes from guest and qemu,
where they don;t keep track of the the prq event data.
> Thanks,
> Jean

[Jacob Pan]

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v3 15/16] iommu: introduce page response function
@ 2017-12-04 21:37       ` Jacob Pan
  0 siblings, 0 replies; 94+ messages in thread
From: Jacob Pan @ 2017-12-04 21:37 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: Lan Tianyu, Greg Kroah-Hartman, Rafael Wysocki, LKML,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Jean Delvare,
	David Woodhouse

On Fri, 24 Nov 2017 12:03:50 +0000
Jean-Philippe Brucker <jean-philippe.brucker-5wv7dgnIgG8@public.gmane.org> wrote:

> On 17/11/17 18:55, Jacob Pan wrote:
> > When nested translation is turned on and guest owns the
> > first level page tables, device page request can be forwared
> > to the guest for handling faults. As the page response returns
> > by the guest, IOMMU driver on the host need to process the
> > response which informs the device and completes the page request
> > transaction.
> > 
> > This patch introduces generic API function for page response
> > passing from the guest or other in-kernel users. The definitions of
> > the generic data is based on PCI ATS specification not limited to
> > any vendor.>
> > Signed-off-by: Jacob Pan <jacob.jun.pan-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
> > ---
> >  drivers/iommu/iommu.c | 14 ++++++++++++++
> >  include/linux/iommu.h | 42
> > ++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 56
> > insertions(+)
> > 
> > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> > index 97b7990..7aefb40 100644
> > --- a/drivers/iommu/iommu.c
> > +++ b/drivers/iommu/iommu.c
> > @@ -1416,6 +1416,20 @@ int iommu_sva_invalidate(struct iommu_domain
> > *domain, }
> >  EXPORT_SYMBOL_GPL(iommu_sva_invalidate);
> >  
> > +int iommu_page_response(struct iommu_domain *domain, struct device
> > *dev,
> > +			struct page_response_msg *msg)  
> 
> I think it's simpler, both for IOMMU and device drivers, to pass the
> exact structure received in the fault handler back to the IOMMU
> driver, along with a separate response status. So maybe
> 
> int iommu_page_response(struct iommu_domain *domain, struct device
> *dev, struct iommu_fault_event *event, u32 response)
> 
> And then you'd just need to define the IOMMU_PAGE_RESPONSE_* values.
> 
Apologize for the late response,

I think the simpler interface works for in-kernel driver use case very
well. But in case of VFIO, the callback function does not turn around
send back page response. The page response comes from guest and qemu,
where they don;t keep track of the the prq event data.
> Thanks,
> Jean

[Jacob Pan]

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v3 06/16] iommu/vt-d: add svm/sva invalidate function
@ 2017-12-05  5:43     ` Lu Baolu
  0 siblings, 0 replies; 94+ messages in thread
From: Lu Baolu @ 2017-12-05  5:43 UTC (permalink / raw)
  To: Jacob Pan, iommu, LKML, Joerg Roedel, David Woodhouse,
	Greg Kroah-Hartman, Rafael Wysocki, Alex Williamson
  Cc: Liu, Yi L, Lan Tianyu, Tian, Kevin, Raj Ashok, Jean Delvare,
	Christoph Hellwig, Liu, Yi L

Hi,

On 11/18/2017 02:55 AM, Jacob Pan wrote:
> This patch adds Intel VT-d specific function to implement
> iommu passdown invalidate API for shared virtual address.
>
> The use case is for supporting caching structure invalidation
> of assigned SVM capable devices. Emulated IOMMU exposes queue
> invalidation capability and passes down all descriptors from the guest
> to the physical IOMMU.
>
> The assumption is that guest to host device ID mapping should be
> resolved prior to calling IOMMU driver. Based on the device handle,
> host IOMMU driver can replace certain fields before submit to the
> invalidation queue.
>
> Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Signed-off-by: Ashok Raj <ashok.raj@intel.com>
> ---
>  drivers/iommu/intel-iommu.c | 200 +++++++++++++++++++++++++++++++++++++++++++-
>  include/linux/intel-iommu.h |  17 +++-
>  2 files changed, 211 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index 556bdd2..000b2b3 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -4981,6 +4981,183 @@ static void intel_iommu_detach_device(struct iommu_domain *domain,
>  	dmar_remove_one_dev_info(to_dmar_domain(domain), dev);
>  }
>  
> +/*
> + * 3D array for converting IOMMU generic type-granularity to VT-d granularity
> + * X indexed by enum iommu_inv_type
> + * Y indicates request without and with PASID
> + * Z indexed by enum enum iommu_inv_granularity
> + *
> + * For an example, if we want to find the VT-d granularity encoding for IOTLB
> + * type, DMA request with PASID, and page selective. The look up indices are:
> + * [1][1][8], where
> + * 1: IOMMU_INV_TYPE_TLB
> + * 1: with PASID
> + * 8: IOMMU_INV_GRANU_PAGE_PASID
> + *
> + */
> +const static int inv_type_granu_map[IOMMU_INV_NR_TYPE][2][IOMMU_INV_NR_GRANU] = {
> +	/* extended dev IOTLBs, for dev-IOTLB, only global is valid,
> +	   for dev-EXIOTLB, two valid granu */
> +	{
> +		{1},
> +		{0, 0, 0, 0, 1, 1, 0, 0, 0}
> +	},
> +	/* IOTLB and EIOTLB */
> +	{
> +		{1, 1, 0, 1, 0, 0, 0, 0, 0},
> +		{0, 0, 0, 0, 1, 0, 1, 1, 1}
> +	},
> +	/* PASID cache */
> +	{
> +		{0},
> +		{0, 0, 0, 0, 1, 1, 0, 0, 0}
> +	},
> +	/* context cache */
> +	{
> +		{1, 1, 1}
> +	}
> +};
> +
> +const static u64 inv_type_granu_table[IOMMU_INV_NR_TYPE][2][IOMMU_INV_NR_GRANU] = {
> +	/* extended dev IOTLBs, only global is valid */
> +	{
> +		{QI_DEV_IOTLB_GRAN_ALL},
> +		{0, 0, 0, 0, QI_DEV_IOTLB_GRAN_ALL, QI_DEV_IOTLB_GRAN_PASID_SEL, 0, 0, 0}
> +	},
> +	/* IOTLB and EIOTLB */
> +	{
> +		{DMA_TLB_GLOBAL_FLUSH, DMA_TLB_DSI_FLUSH, 0, DMA_TLB_PSI_FLUSH},
> +		{0, 0, 0, 0, QI_GRAN_ALL_ALL, 0, QI_GRAN_NONG_ALL, QI_GRAN_NONG_PASID, QI_GRAN_PSI_PASID}
> +	},
> +	/* PASID cache */
> +	{
> +		{0},
> +		{0, 0, 0, 0, QI_PC_ALL_PASIDS, QI_PC_PASID_SEL}
> +	},
> +	/* context cache */
> +	{
> +		{DMA_CCMD_GLOBAL_INVL, DMA_CCMD_DOMAIN_INVL, DMA_CCMD_DEVICE_INVL}
> +	}
> +};
> +
> +static inline int to_vtd_granularity(int type, int granu, int with_pasid, u64 *vtd_granu)
> +{
> +	if (type >= IOMMU_INV_NR_TYPE || granu >= IOMMU_INV_NR_GRANU || with_pasid > 1)
> +		return -EINVAL;
> +
> +	if (inv_type_granu_map[type][with_pasid][granu] == 0)
> +		return -EINVAL;
> +
> +	*vtd_granu = inv_type_granu_table[type][with_pasid][granu];
> +
> +	return 0;
> +}
> +
> +static int intel_iommu_sva_invalidate(struct iommu_domain *domain,
> +		struct device *dev, struct tlb_invalidate_info *inv_info)
> +{
> +	struct intel_iommu *iommu;
> +	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
> +	struct device_domain_info *info;
> +	struct pci_dev *pdev;
> +	u16 did, sid, pfsid;
> +	u8 bus, devfn;
> +	int ret = 0;
> +	u64 granu;
> +	unsigned long flags;
> +
> +	if (!inv_info || !dmar_domain)
> +		return -EINVAL;
> +
> +	iommu = device_to_iommu(dev, &bus, &devfn);
> +	if (!iommu)
> +		return -ENODEV;
> +
> +	if (!dev || !dev_is_pci(dev))
> +		return -ENODEV;
> +
> +	did = dmar_domain->iommu_did[iommu->seq_id];
> +	sid = PCI_DEVID(bus, devfn);
> +	ret = to_vtd_granularity(inv_info->hdr.type, inv_info->granularity,
> +				!!(inv_info->flags & IOMMU_INVALIDATE_PASID_TAGGED), &granu);
> +	if (ret) {
> +		pr_err("Invalid range type %d, granu %d\n", inv_info->hdr.type,
> +			inv_info->granularity);
> +		return ret;
> +	}
> +
> +	spin_lock(&iommu->lock);
> +	spin_lock_irqsave(&device_domain_lock, flags);
> +
> +	switch (inv_info->hdr.type) {
> +	case IOMMU_INV_TYPE_CONTEXT:
> +		iommu->flush.flush_context(iommu, did, sid,
> +					DMA_CCMD_MASK_NOBIT, granu);
> +		break;
> +	case IOMMU_INV_TYPE_TLB:
> +		/* We need to deal with two scenarios:
> +		 * - IOTLB for request w/o PASID
> +		 * - extended IOTLB for request with PASID.
> +		 */
> +		if (inv_info->size &&
> +			(inv_info->addr & ((1 << (VTD_PAGE_SHIFT + inv_info->size)) - 1))) {
> +			pr_err("Addr out of range, addr 0x%llx, size order %d\n",
> +				inv_info->addr, inv_info->size);
> +			ret = -ERANGE;
> +			goto out_unlock;
> +		}
> +
> +		if (inv_info->flags & IOMMU_INVALIDATE_PASID_TAGGED)
> +			qi_flush_eiotlb(iommu, did, mm_to_dma_pfn(inv_info->addr),
> +					inv_info->pasid,
> +					inv_info->size, granu,
> +					inv_info->flags & IOMMU_INVALIDATE_GLOBAL_PAGE);
> +		else
> +			qi_flush_iotlb(iommu, did, mm_to_dma_pfn(inv_info->addr),
> +				inv_info->size, granu);
> +		/* For SRIOV VF, invalidation of device IOTLB requires PFSID */
> +		pdev = to_pci_dev(dev);
> +		if (pdev && pdev->is_virtfn)
> +			pfsid = PCI_DEVID(pdev->physfn->bus->number, pdev->physfn->devfn);
> +		else
> +			pfsid = sid;
> +
> +		/**
> +		 * Always flush device IOTLB if ATS is enabled since guest
> +		 * vIOMMU exposes CM = 1, no device IOTLB flush will be passed
> +		 * down.
> +		 * TODO: check if device is VF, use PF ATS data if spec does not require
> +		 * VF to include all PF capabilities,  VF qdep and VF ats_enabled.
> +		 */
> +		info = iommu_support_dev_iotlb(dmar_domain, iommu, bus, devfn);
> +		if (info && info->ats_enabled) {
> +			if (inv_info->flags & IOMMU_INVALIDATE_PASID_TAGGED)
> +				qi_flush_dev_eiotlb(iommu, sid, info->pfsid,
> +						inv_info->pasid, info->ats_qdep,
> +						inv_info->addr, inv_info->size,
> +						granu);
> +			else
> +				qi_flush_dev_iotlb(iommu, sid, info->pfsid,
> +						info->ats_qdep, inv_info->addr,
> +						inv_info->size);
> +		}
> +		break;
> +	case IOMMU_INV_TYPE_PASID:
> +		qi_flush_pasid(iommu, did, granu, inv_info->pasid);
> +
> +		break;
> +	default:
> +		dev_err(dev, "Unknown IOMMU invalidation type %d\n",
> +			inv_info->hdr.type);
> +		ret = -EINVAL;
> +	}
> +out_unlock:
> +	spin_unlock(&iommu->lock);
> +	spin_unlock_irqrestore(&device_domain_lock, flags);
> +
> +	return ret;
> +}
> +
>  static int intel_iommu_map(struct iommu_domain *domain,
>  			   unsigned long iova, phys_addr_t hpa,
>  			   size_t size, int iommu_prot)
> @@ -5304,7 +5481,7 @@ static int intel_iommu_bind_pasid_table(struct iommu_domain *domain,
>  	iommu = device_to_iommu(dev, &bus, &devfn);
>  	if (!iommu)
>  		return -ENODEV;
> -	/* VT-d spec 9.4 says pasid table size is encoded as 2^(x+5) */
> +	/* VT-d spec section 9.4 says pasid table size is encoded as 2^(x+5) */
>  	host_table_pasid_bits = intel_iommu_get_pts(iommu) + MIN_NR_PASID_BITS;
>  	if (!pasidt_binfo || pasidt_binfo->pasid_bits > host_table_pasid_bits ||
>  		pasidt_binfo->pasid_bits < MIN_NR_PASID_BITS) {
> @@ -5313,7 +5490,11 @@ static int intel_iommu_bind_pasid_table(struct iommu_domain *domain,
>  			MIN_NR_PASID_BITS, host_table_pasid_bits);
>  		return -ERANGE;
>  	}
> -
> +	if (!ecap_nest(iommu->ecap)) {
> +		dev_err(dev, "Cannot bind PASID table, no nested translation\n");
> +		ret = -EINVAL;
> +		goto out;
> +	}

This and some of below changes could be included in patch 02/16.

Best regards,
Lu Baolu

>  	pdev = to_pci_dev(dev);
>  	sid = PCI_DEVID(bus, devfn);
>  	info = dev->archdata.iommu;
> @@ -5323,6 +5504,11 @@ static int intel_iommu_bind_pasid_table(struct iommu_domain *domain,
>  		ret = -EINVAL;
>  		goto out;
>  	}
> +	if (info->pasid_table_bound) {
> +		dev_err(dev, "Device PASID table already bound\n");
> +		ret = -EBUSY;
> +		goto out;
> +	}
>  	if (!info->pasid_enabled) {
>  		ret = pci_enable_pasid(pdev, info->pasid_supported & ~1);
>  		if (ret) {
> @@ -5363,7 +5549,7 @@ static int intel_iommu_bind_pasid_table(struct iommu_domain *domain,
>  				DMA_CCMD_MASK_NOBIT,
>  				DMA_CCMD_DEVICE_INVL);
>  	iommu->flush.flush_iotlb(iommu, did, 0, 0, DMA_TLB_DSI_FLUSH);
> -
> +	info->pasid_table_bound = 1;
>  out_unlock:
>  	spin_unlock_irqrestore(&iommu->lock, flags);
>  out:
> @@ -5375,8 +5561,14 @@ static void intel_iommu_unbind_pasid_table(struct iommu_domain *domain,
>  {
>  	struct intel_iommu *iommu;
>  	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
> +	struct device_domain_info *info;
>  	u8 bus, devfn;
>  
> +	info = dev->archdata.iommu;
> +	if (!info) {
> +		dev_err(dev, "Invalid device domain info\n");
> +		return;
> +	}
>  	assert_spin_locked(&device_domain_lock);
>  	iommu = device_to_iommu(dev, &bus, &devfn);
>  	if (!iommu) {
> @@ -5387,6 +5579,7 @@ static void intel_iommu_unbind_pasid_table(struct iommu_domain *domain,
>  	domain_context_clear(iommu, dev);
>  
>  	domain_context_mapping_one(dmar_domain, iommu, bus, devfn);
> +	info->pasid_table_bound = 0;
>  }
>  #endif /* CONFIG_INTEL_IOMMU_SVM */
>  
> @@ -5399,6 +5592,7 @@ const struct iommu_ops intel_iommu_ops = {
>  #ifdef CONFIG_INTEL_IOMMU_SVM
>  	.bind_pasid_table	= intel_iommu_bind_pasid_table,
>  	.unbind_pasid_table	= intel_iommu_unbind_pasid_table,
> +	.sva_invalidate		= intel_iommu_sva_invalidate,
>  #endif
>  	.map			= intel_iommu_map,
>  	.unmap			= intel_iommu_unmap,
> diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
> index 3c83f7e..7f05e36 100644
> --- a/include/linux/intel-iommu.h
> +++ b/include/linux/intel-iommu.h
> @@ -258,6 +258,10 @@ enum {
>  #define QI_PGRP_RESP_TYPE	0x9
>  #define QI_PSTRM_RESP_TYPE	0xa
>  
> +#define QI_DID(did)		(((u64)did & 0xffff) << 16)
> +#define QI_DID_MASK		GENMASK(31, 16)
> +#define QI_TYPE_MASK		GENMASK(3, 0)
> +
>  #define QI_IEC_SELECTIVE	(((u64)1) << 4)
>  #define QI_IEC_IIDEX(idx)	(((u64)(idx & 0xffff) << 32))
>  #define QI_IEC_IM(m)		(((u64)(m & 0x1f) << 27))
> @@ -288,8 +292,9 @@ enum {
>  #define QI_PC_DID(did)		(((u64)did) << 16)
>  #define QI_PC_GRAN(gran)	(((u64)gran) << 4)
>  
> -#define QI_PC_ALL_PASIDS	(QI_PC_TYPE | QI_PC_GRAN(0))
> -#define QI_PC_PASID_SEL		(QI_PC_TYPE | QI_PC_GRAN(1))
> +/* PC inv granu */
> +#define QI_PC_ALL_PASIDS	0
> +#define QI_PC_PASID_SEL		1
>  
>  #define QI_EIOTLB_ADDR(addr)	((u64)(addr) & VTD_PAGE_MASK)
>  #define QI_EIOTLB_GL(gl)	(((u64)gl) << 7)
> @@ -299,6 +304,10 @@ enum {
>  #define QI_EIOTLB_DID(did)	(((u64)did) << 16)
>  #define QI_EIOTLB_GRAN(gran) 	(((u64)gran) << 4)
>  
> +/* QI Dev-IOTLB inv granu */
> +#define QI_DEV_IOTLB_GRAN_ALL		0
> +#define QI_DEV_IOTLB_GRAN_PASID_SEL	1
> +
>  #define QI_DEV_EIOTLB_ADDR(a)	((u64)(a) & VTD_PAGE_MASK)
>  #define QI_DEV_EIOTLB_SIZE	(((u64)1) << 11)
>  #define QI_DEV_EIOTLB_GLOB(g)	((u64)g)
> @@ -327,6 +336,7 @@ enum {
>  #define QI_RESP_INVALID		0x1
>  #define QI_RESP_FAILURE		0xf
>  
> +/* QI EIOTLB inv granu */
>  #define QI_GRAN_ALL_ALL			0
>  #define QI_GRAN_NONG_ALL		1
>  #define QI_GRAN_NONG_PASID		2
> @@ -471,6 +481,7 @@ struct device_domain_info {
>  	u8 pri_enabled:1;
>  	u8 ats_supported:1;
>  	u8 ats_enabled:1;
> +	u8 pasid_table_bound:1;
>  	u8 ats_qdep;
>  	u64 fault_mask;	/* selected IOMMU faults to be reported */
>  	struct device *dev; /* it's NULL for PCIe-to-PCI bridge */
> @@ -502,7 +513,7 @@ extern void qi_flush_eiotlb(struct intel_iommu *iommu, u16 did, u64 addr,
>  extern void qi_flush_dev_iotlb(struct intel_iommu *iommu, u16 sid, u16 pfsid,
>  			u16 qdep, u64 addr, unsigned mask);
>  extern void qi_flush_dev_eiotlb(struct intel_iommu *iommu, u16 sid, u16 pfsid,
> -				u32 pasid, u16 qdep, u64 addr, unsigned size);
> +			u32 pasid, u16 qdep, u64 addr, unsigned size, u64 granu);
>  extern void qi_flush_pasid(struct intel_iommu *iommu, u16 did, u64 granu, int pasid);
>  
>  extern int qi_submit_sync(struct qi_desc *desc, struct intel_iommu *iommu);

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v3 06/16] iommu/vt-d: add svm/sva invalidate function
@ 2017-12-05  5:43     ` Lu Baolu
  0 siblings, 0 replies; 94+ messages in thread
From: Lu Baolu @ 2017-12-05  5:43 UTC (permalink / raw)
  To: Jacob Pan, iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Rafael Wysocki, Alex Williamson
  Cc: Lan Tianyu, Yi L, Liu-u79uwXL29TY76Z2rM5mHXA, Jean Delvare

Hi,

On 11/18/2017 02:55 AM, Jacob Pan wrote:
> This patch adds Intel VT-d specific function to implement
> iommu passdown invalidate API for shared virtual address.
>
> The use case is for supporting caching structure invalidation
> of assigned SVM capable devices. Emulated IOMMU exposes queue
> invalidation capability and passes down all descriptors from the guest
> to the physical IOMMU.
>
> The assumption is that guest to host device ID mapping should be
> resolved prior to calling IOMMU driver. Based on the device handle,
> host IOMMU driver can replace certain fields before submit to the
> invalidation queue.
>
> Signed-off-by: Liu, Yi L <yi.l.liu-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
> Signed-off-by: Jacob Pan <jacob.jun.pan-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
> Signed-off-by: Ashok Raj <ashok.raj-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> ---
>  drivers/iommu/intel-iommu.c | 200 +++++++++++++++++++++++++++++++++++++++++++-
>  include/linux/intel-iommu.h |  17 +++-
>  2 files changed, 211 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index 556bdd2..000b2b3 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -4981,6 +4981,183 @@ static void intel_iommu_detach_device(struct iommu_domain *domain,
>  	dmar_remove_one_dev_info(to_dmar_domain(domain), dev);
>  }
>  
> +/*
> + * 3D array for converting IOMMU generic type-granularity to VT-d granularity
> + * X indexed by enum iommu_inv_type
> + * Y indicates request without and with PASID
> + * Z indexed by enum enum iommu_inv_granularity
> + *
> + * For an example, if we want to find the VT-d granularity encoding for IOTLB
> + * type, DMA request with PASID, and page selective. The look up indices are:
> + * [1][1][8], where
> + * 1: IOMMU_INV_TYPE_TLB
> + * 1: with PASID
> + * 8: IOMMU_INV_GRANU_PAGE_PASID
> + *
> + */
> +const static int inv_type_granu_map[IOMMU_INV_NR_TYPE][2][IOMMU_INV_NR_GRANU] = {
> +	/* extended dev IOTLBs, for dev-IOTLB, only global is valid,
> +	   for dev-EXIOTLB, two valid granu */
> +	{
> +		{1},
> +		{0, 0, 0, 0, 1, 1, 0, 0, 0}
> +	},
> +	/* IOTLB and EIOTLB */
> +	{
> +		{1, 1, 0, 1, 0, 0, 0, 0, 0},
> +		{0, 0, 0, 0, 1, 0, 1, 1, 1}
> +	},
> +	/* PASID cache */
> +	{
> +		{0},
> +		{0, 0, 0, 0, 1, 1, 0, 0, 0}
> +	},
> +	/* context cache */
> +	{
> +		{1, 1, 1}
> +	}
> +};
> +
> +const static u64 inv_type_granu_table[IOMMU_INV_NR_TYPE][2][IOMMU_INV_NR_GRANU] = {
> +	/* extended dev IOTLBs, only global is valid */
> +	{
> +		{QI_DEV_IOTLB_GRAN_ALL},
> +		{0, 0, 0, 0, QI_DEV_IOTLB_GRAN_ALL, QI_DEV_IOTLB_GRAN_PASID_SEL, 0, 0, 0}
> +	},
> +	/* IOTLB and EIOTLB */
> +	{
> +		{DMA_TLB_GLOBAL_FLUSH, DMA_TLB_DSI_FLUSH, 0, DMA_TLB_PSI_FLUSH},
> +		{0, 0, 0, 0, QI_GRAN_ALL_ALL, 0, QI_GRAN_NONG_ALL, QI_GRAN_NONG_PASID, QI_GRAN_PSI_PASID}
> +	},
> +	/* PASID cache */
> +	{
> +		{0},
> +		{0, 0, 0, 0, QI_PC_ALL_PASIDS, QI_PC_PASID_SEL}
> +	},
> +	/* context cache */
> +	{
> +		{DMA_CCMD_GLOBAL_INVL, DMA_CCMD_DOMAIN_INVL, DMA_CCMD_DEVICE_INVL}
> +	}
> +};
> +
> +static inline int to_vtd_granularity(int type, int granu, int with_pasid, u64 *vtd_granu)
> +{
> +	if (type >= IOMMU_INV_NR_TYPE || granu >= IOMMU_INV_NR_GRANU || with_pasid > 1)
> +		return -EINVAL;
> +
> +	if (inv_type_granu_map[type][with_pasid][granu] == 0)
> +		return -EINVAL;
> +
> +	*vtd_granu = inv_type_granu_table[type][with_pasid][granu];
> +
> +	return 0;
> +}
> +
> +static int intel_iommu_sva_invalidate(struct iommu_domain *domain,
> +		struct device *dev, struct tlb_invalidate_info *inv_info)
> +{
> +	struct intel_iommu *iommu;
> +	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
> +	struct device_domain_info *info;
> +	struct pci_dev *pdev;
> +	u16 did, sid, pfsid;
> +	u8 bus, devfn;
> +	int ret = 0;
> +	u64 granu;
> +	unsigned long flags;
> +
> +	if (!inv_info || !dmar_domain)
> +		return -EINVAL;
> +
> +	iommu = device_to_iommu(dev, &bus, &devfn);
> +	if (!iommu)
> +		return -ENODEV;
> +
> +	if (!dev || !dev_is_pci(dev))
> +		return -ENODEV;
> +
> +	did = dmar_domain->iommu_did[iommu->seq_id];
> +	sid = PCI_DEVID(bus, devfn);
> +	ret = to_vtd_granularity(inv_info->hdr.type, inv_info->granularity,
> +				!!(inv_info->flags & IOMMU_INVALIDATE_PASID_TAGGED), &granu);
> +	if (ret) {
> +		pr_err("Invalid range type %d, granu %d\n", inv_info->hdr.type,
> +			inv_info->granularity);
> +		return ret;
> +	}
> +
> +	spin_lock(&iommu->lock);
> +	spin_lock_irqsave(&device_domain_lock, flags);
> +
> +	switch (inv_info->hdr.type) {
> +	case IOMMU_INV_TYPE_CONTEXT:
> +		iommu->flush.flush_context(iommu, did, sid,
> +					DMA_CCMD_MASK_NOBIT, granu);
> +		break;
> +	case IOMMU_INV_TYPE_TLB:
> +		/* We need to deal with two scenarios:
> +		 * - IOTLB for request w/o PASID
> +		 * - extended IOTLB for request with PASID.
> +		 */
> +		if (inv_info->size &&
> +			(inv_info->addr & ((1 << (VTD_PAGE_SHIFT + inv_info->size)) - 1))) {
> +			pr_err("Addr out of range, addr 0x%llx, size order %d\n",
> +				inv_info->addr, inv_info->size);
> +			ret = -ERANGE;
> +			goto out_unlock;
> +		}
> +
> +		if (inv_info->flags & IOMMU_INVALIDATE_PASID_TAGGED)
> +			qi_flush_eiotlb(iommu, did, mm_to_dma_pfn(inv_info->addr),
> +					inv_info->pasid,
> +					inv_info->size, granu,
> +					inv_info->flags & IOMMU_INVALIDATE_GLOBAL_PAGE);
> +		else
> +			qi_flush_iotlb(iommu, did, mm_to_dma_pfn(inv_info->addr),
> +				inv_info->size, granu);
> +		/* For SRIOV VF, invalidation of device IOTLB requires PFSID */
> +		pdev = to_pci_dev(dev);
> +		if (pdev && pdev->is_virtfn)
> +			pfsid = PCI_DEVID(pdev->physfn->bus->number, pdev->physfn->devfn);
> +		else
> +			pfsid = sid;
> +
> +		/**
> +		 * Always flush device IOTLB if ATS is enabled since guest
> +		 * vIOMMU exposes CM = 1, no device IOTLB flush will be passed
> +		 * down.
> +		 * TODO: check if device is VF, use PF ATS data if spec does not require
> +		 * VF to include all PF capabilities,  VF qdep and VF ats_enabled.
> +		 */
> +		info = iommu_support_dev_iotlb(dmar_domain, iommu, bus, devfn);
> +		if (info && info->ats_enabled) {
> +			if (inv_info->flags & IOMMU_INVALIDATE_PASID_TAGGED)
> +				qi_flush_dev_eiotlb(iommu, sid, info->pfsid,
> +						inv_info->pasid, info->ats_qdep,
> +						inv_info->addr, inv_info->size,
> +						granu);
> +			else
> +				qi_flush_dev_iotlb(iommu, sid, info->pfsid,
> +						info->ats_qdep, inv_info->addr,
> +						inv_info->size);
> +		}
> +		break;
> +	case IOMMU_INV_TYPE_PASID:
> +		qi_flush_pasid(iommu, did, granu, inv_info->pasid);
> +
> +		break;
> +	default:
> +		dev_err(dev, "Unknown IOMMU invalidation type %d\n",
> +			inv_info->hdr.type);
> +		ret = -EINVAL;
> +	}
> +out_unlock:
> +	spin_unlock(&iommu->lock);
> +	spin_unlock_irqrestore(&device_domain_lock, flags);
> +
> +	return ret;
> +}
> +
>  static int intel_iommu_map(struct iommu_domain *domain,
>  			   unsigned long iova, phys_addr_t hpa,
>  			   size_t size, int iommu_prot)
> @@ -5304,7 +5481,7 @@ static int intel_iommu_bind_pasid_table(struct iommu_domain *domain,
>  	iommu = device_to_iommu(dev, &bus, &devfn);
>  	if (!iommu)
>  		return -ENODEV;
> -	/* VT-d spec 9.4 says pasid table size is encoded as 2^(x+5) */
> +	/* VT-d spec section 9.4 says pasid table size is encoded as 2^(x+5) */
>  	host_table_pasid_bits = intel_iommu_get_pts(iommu) + MIN_NR_PASID_BITS;
>  	if (!pasidt_binfo || pasidt_binfo->pasid_bits > host_table_pasid_bits ||
>  		pasidt_binfo->pasid_bits < MIN_NR_PASID_BITS) {
> @@ -5313,7 +5490,11 @@ static int intel_iommu_bind_pasid_table(struct iommu_domain *domain,
>  			MIN_NR_PASID_BITS, host_table_pasid_bits);
>  		return -ERANGE;
>  	}
> -
> +	if (!ecap_nest(iommu->ecap)) {
> +		dev_err(dev, "Cannot bind PASID table, no nested translation\n");
> +		ret = -EINVAL;
> +		goto out;
> +	}

This and some of below changes could be included in patch 02/16.

Best regards,
Lu Baolu

>  	pdev = to_pci_dev(dev);
>  	sid = PCI_DEVID(bus, devfn);
>  	info = dev->archdata.iommu;
> @@ -5323,6 +5504,11 @@ static int intel_iommu_bind_pasid_table(struct iommu_domain *domain,
>  		ret = -EINVAL;
>  		goto out;
>  	}
> +	if (info->pasid_table_bound) {
> +		dev_err(dev, "Device PASID table already bound\n");
> +		ret = -EBUSY;
> +		goto out;
> +	}
>  	if (!info->pasid_enabled) {
>  		ret = pci_enable_pasid(pdev, info->pasid_supported & ~1);
>  		if (ret) {
> @@ -5363,7 +5549,7 @@ static int intel_iommu_bind_pasid_table(struct iommu_domain *domain,
>  				DMA_CCMD_MASK_NOBIT,
>  				DMA_CCMD_DEVICE_INVL);
>  	iommu->flush.flush_iotlb(iommu, did, 0, 0, DMA_TLB_DSI_FLUSH);
> -
> +	info->pasid_table_bound = 1;
>  out_unlock:
>  	spin_unlock_irqrestore(&iommu->lock, flags);
>  out:
> @@ -5375,8 +5561,14 @@ static void intel_iommu_unbind_pasid_table(struct iommu_domain *domain,
>  {
>  	struct intel_iommu *iommu;
>  	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
> +	struct device_domain_info *info;
>  	u8 bus, devfn;
>  
> +	info = dev->archdata.iommu;
> +	if (!info) {
> +		dev_err(dev, "Invalid device domain info\n");
> +		return;
> +	}
>  	assert_spin_locked(&device_domain_lock);
>  	iommu = device_to_iommu(dev, &bus, &devfn);
>  	if (!iommu) {
> @@ -5387,6 +5579,7 @@ static void intel_iommu_unbind_pasid_table(struct iommu_domain *domain,
>  	domain_context_clear(iommu, dev);
>  
>  	domain_context_mapping_one(dmar_domain, iommu, bus, devfn);
> +	info->pasid_table_bound = 0;
>  }
>  #endif /* CONFIG_INTEL_IOMMU_SVM */
>  
> @@ -5399,6 +5592,7 @@ const struct iommu_ops intel_iommu_ops = {
>  #ifdef CONFIG_INTEL_IOMMU_SVM
>  	.bind_pasid_table	= intel_iommu_bind_pasid_table,
>  	.unbind_pasid_table	= intel_iommu_unbind_pasid_table,
> +	.sva_invalidate		= intel_iommu_sva_invalidate,
>  #endif
>  	.map			= intel_iommu_map,
>  	.unmap			= intel_iommu_unmap,
> diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
> index 3c83f7e..7f05e36 100644
> --- a/include/linux/intel-iommu.h
> +++ b/include/linux/intel-iommu.h
> @@ -258,6 +258,10 @@ enum {
>  #define QI_PGRP_RESP_TYPE	0x9
>  #define QI_PSTRM_RESP_TYPE	0xa
>  
> +#define QI_DID(did)		(((u64)did & 0xffff) << 16)
> +#define QI_DID_MASK		GENMASK(31, 16)
> +#define QI_TYPE_MASK		GENMASK(3, 0)
> +
>  #define QI_IEC_SELECTIVE	(((u64)1) << 4)
>  #define QI_IEC_IIDEX(idx)	(((u64)(idx & 0xffff) << 32))
>  #define QI_IEC_IM(m)		(((u64)(m & 0x1f) << 27))
> @@ -288,8 +292,9 @@ enum {
>  #define QI_PC_DID(did)		(((u64)did) << 16)
>  #define QI_PC_GRAN(gran)	(((u64)gran) << 4)
>  
> -#define QI_PC_ALL_PASIDS	(QI_PC_TYPE | QI_PC_GRAN(0))
> -#define QI_PC_PASID_SEL		(QI_PC_TYPE | QI_PC_GRAN(1))
> +/* PC inv granu */
> +#define QI_PC_ALL_PASIDS	0
> +#define QI_PC_PASID_SEL		1
>  
>  #define QI_EIOTLB_ADDR(addr)	((u64)(addr) & VTD_PAGE_MASK)
>  #define QI_EIOTLB_GL(gl)	(((u64)gl) << 7)
> @@ -299,6 +304,10 @@ enum {
>  #define QI_EIOTLB_DID(did)	(((u64)did) << 16)
>  #define QI_EIOTLB_GRAN(gran) 	(((u64)gran) << 4)
>  
> +/* QI Dev-IOTLB inv granu */
> +#define QI_DEV_IOTLB_GRAN_ALL		0
> +#define QI_DEV_IOTLB_GRAN_PASID_SEL	1
> +
>  #define QI_DEV_EIOTLB_ADDR(a)	((u64)(a) & VTD_PAGE_MASK)
>  #define QI_DEV_EIOTLB_SIZE	(((u64)1) << 11)
>  #define QI_DEV_EIOTLB_GLOB(g)	((u64)g)
> @@ -327,6 +336,7 @@ enum {
>  #define QI_RESP_INVALID		0x1
>  #define QI_RESP_FAILURE		0xf
>  
> +/* QI EIOTLB inv granu */
>  #define QI_GRAN_ALL_ALL			0
>  #define QI_GRAN_NONG_ALL		1
>  #define QI_GRAN_NONG_PASID		2
> @@ -471,6 +481,7 @@ struct device_domain_info {
>  	u8 pri_enabled:1;
>  	u8 ats_supported:1;
>  	u8 ats_enabled:1;
> +	u8 pasid_table_bound:1;
>  	u8 ats_qdep;
>  	u64 fault_mask;	/* selected IOMMU faults to be reported */
>  	struct device *dev; /* it's NULL for PCIe-to-PCI bridge */
> @@ -502,7 +513,7 @@ extern void qi_flush_eiotlb(struct intel_iommu *iommu, u16 did, u64 addr,
>  extern void qi_flush_dev_iotlb(struct intel_iommu *iommu, u16 sid, u16 pfsid,
>  			u16 qdep, u64 addr, unsigned mask);
>  extern void qi_flush_dev_eiotlb(struct intel_iommu *iommu, u16 sid, u16 pfsid,
> -				u32 pasid, u16 qdep, u64 addr, unsigned size);
> +			u32 pasid, u16 qdep, u64 addr, unsigned size, u64 granu);
>  extern void qi_flush_pasid(struct intel_iommu *iommu, u16 did, u64 granu, int pasid);
>  
>  extern int qi_submit_sync(struct qi_desc *desc, struct intel_iommu *iommu);

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v3 07/16] iommu/vt-d: assign PFSID in device TLB invalidation
  2017-11-17 18:55 ` [PATCH v3 07/16] iommu/vt-d: assign PFSID in device TLB invalidation Jacob Pan
@ 2017-12-05  5:45   ` Lu Baolu
  0 siblings, 0 replies; 94+ messages in thread
From: Lu Baolu @ 2017-12-05  5:45 UTC (permalink / raw)
  To: Jacob Pan, iommu, LKML, Joerg Roedel, David Woodhouse,
	Greg Kroah-Hartman, Rafael Wysocki, Alex Williamson
  Cc: Liu, Yi L, Lan Tianyu, Tian, Kevin, Raj Ashok, Jean Delvare,
	Christoph Hellwig

Hi,

On 11/18/2017 02:55 AM, Jacob Pan wrote:
> When SRIOV VF device IOTLB is invalidated, we need to provide
> the PF source SID such that IOMMU hardware can gauge the depth
> of invalidation queue which is shared among VFs. This is needed
> when device invalidation throttle (DIT) capability is supported.
>
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> ---
>  drivers/iommu/intel-iommu.c | 13 +++++++++++++
>  include/linux/intel-iommu.h |  3 +++
>  2 files changed, 16 insertions(+)
>
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index 000b2b3..e1bd219 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -1459,6 +1459,19 @@ static void iommu_enable_dev_iotlb(struct device_domain_info *info)
>  		return;
>  
>  	pdev = to_pci_dev(info->dev);
> +	/* For IOMMU that supports device IOTLB throttling (DIT), we assign
> +	 * PFSID to the invalidation desc of a VF such that IOMMU HW can gauge
> +	 * queue depth at PF level. If DIT is not set, PFSID will be treated as
> +	 * reserved, which should be set to 0.
> +	 */
> +	if (!ecap_dit(info->iommu->ecap))
> +		info->pfsid = 0;
> +	else if (pdev && pdev->is_virtfn) {
> +		if (ecap_dit(info->iommu->ecap))

Isn't this condition always true when it comes here?

Best regards,
Lu Baolu

> +			dev_warn(&pdev->dev, "SRIOV VF device IOTLB enabled without flow control\n");
> +		info->pfsid = PCI_DEVID(pdev->physfn->bus->number, pdev->physfn->devfn);
> +	} else
> +		info->pfsid = PCI_DEVID(info->bus, info->devfn);
>  
>  #ifdef CONFIG_INTEL_IOMMU_SVM
>  	/* The PCIe spec, in its wisdom, declares that the behaviour of
> diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
> index 7f05e36..6956a4e 100644
> --- a/include/linux/intel-iommu.h
> +++ b/include/linux/intel-iommu.h
> @@ -112,6 +112,7 @@
>   * Extended Capability Register
>   */
>  
> +#define ecap_dit(e)		((e >> 41) & 0x1)
>  #define ecap_pasid(e)		((e >> 40) & 0x1)
>  #define ecap_pss(e)		((e >> 35) & 0x1f)
>  #define ecap_eafs(e)		((e >> 34) & 0x1)
> @@ -285,6 +286,7 @@ enum {
>  #define QI_DEV_IOTLB_SID(sid)	((u64)((sid) & 0xffff) << 32)
>  #define QI_DEV_IOTLB_QDEP(qdep)	(((qdep) & 0x1f) << 16)
>  #define QI_DEV_IOTLB_ADDR(addr)	((u64)(addr) & VTD_PAGE_MASK)
> +#define QI_DEV_IOTLB_PFSID(pfsid) (((u64)(pfsid & 0xf) << 12) | ((u64)(pfsid & 0xff0) << 48))
>  #define QI_DEV_IOTLB_SIZE	1
>  #define QI_DEV_IOTLB_MAX_INVS	32
>  
> @@ -475,6 +477,7 @@ struct device_domain_info {
>  	struct list_head global; /* link to global list */
>  	u8 bus;			/* PCI bus number */
>  	u8 devfn;		/* PCI devfn number */
> +	u16 pfsid;		/* SRIOV physical function source ID */
>  	u8 pasid_supported:3;
>  	u8 pasid_enabled:1;
>  	u8 pri_supported:1;

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v3 10/16] iommu: introduce device fault report API
  2017-11-17 18:55   ` Jacob Pan
  (?)
@ 2017-12-05  6:22   ` Lu Baolu
  2017-12-08 21:22       ` Jacob Pan
  -1 siblings, 1 reply; 94+ messages in thread
From: Lu Baolu @ 2017-12-05  6:22 UTC (permalink / raw)
  To: Jacob Pan, iommu, LKML, Joerg Roedel, David Woodhouse,
	Greg Kroah-Hartman, Rafael Wysocki, Alex Williamson
  Cc: Lan Tianyu, Jean Delvare

Hi,

On 11/18/2017 02:55 AM, Jacob Pan wrote:
> Traditionally, device specific faults are detected and handled within
> their own device drivers. When IOMMU is enabled, faults such as DMA
> related transactions are detected by IOMMU. There is no generic
> reporting mechanism to report faults back to the in-kernel device
> driver or the guest OS in case of assigned devices.
>
> Faults detected by IOMMU is based on the transaction's source ID which
> can be reported at per device basis, regardless of the device type is a
> PCI device or not.
>
> The fault types include recoverable (e.g. page request) and
> unrecoverable faults(e.g. access error). In most cases, faults can be
> handled by IOMMU drivers internally. The primary use cases are as
> follows:
> 1. page request fault originated from an SVM capable device that is
> assigned to guest via vIOMMU. In this case, the first level page tables
> are owned by the guest. Page request must be propagated to the guest to
> let guest OS fault in the pages then send page response. In this
> mechanism, the direct receiver of IOMMU fault notification is VFIO,
> which can relay notification events to QEMU or other user space
> software.
>
> 2. faults need more subtle handling by device drivers. Other than
> simply invoke reset function, there are needs to let device driver
> handle the fault with a smaller impact.
>
> This patchset is intended to create a generic fault report API such
> that it can scale as follows:
> - all IOMMU types
> - PCI and non-PCI devices
> - recoverable and unrecoverable faults
> - VFIO and other other in kernel users
> - DMA & IRQ remapping (TBD)
> The original idea was brought up by David Woodhouse and discussions
> summarized at https://lwn.net/Articles/608914/.
>
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Signed-off-by: Ashok Raj <ashok.raj@intel.com>
> ---
>  drivers/iommu/iommu.c | 63 ++++++++++++++++++++++++++++++++++++++++++++++++++-
>  include/linux/iommu.h | 36 +++++++++++++++++++++++++++++
>  2 files changed, 98 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index 829e9e9..97b7990 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -581,6 +581,12 @@ int iommu_group_add_device(struct iommu_group *group, struct device *dev)
>  		goto err_free_name;
>  	}
>  
> +	dev->iommu_param = kzalloc(sizeof(struct iommu_fault_param), GFP_KERNEL);
> +	if (!dev->iommu_param) {
> +		ret = -ENOMEM;
> +		goto err_free_name;
> +	}
> +
>  	kobject_get(group->devices_kobj);
>  
>  	dev->iommu_group = group;
> @@ -657,7 +663,7 @@ void iommu_group_remove_device(struct device *dev)
>  	sysfs_remove_link(&dev->kobj, "iommu_group");
>  
>  	trace_remove_device_from_group(group->id, dev);
> -
> +	kfree(dev->iommu_param);
>  	kfree(device->name);
>  	kfree(device);
>  	dev->iommu_group = NULL;
> @@ -791,6 +797,61 @@ int iommu_group_unregister_notifier(struct iommu_group *group,
>  }
>  EXPORT_SYMBOL_GPL(iommu_group_unregister_notifier);
>  
> +int iommu_register_device_fault_handler(struct device *dev,
> +					iommu_dev_fault_handler_t handler,
> +					void *data)
> +{
> +	struct iommu_param *idata = dev->iommu_param;
> +
> +	/*
> +	 * Device iommu_param should have been allocated when device is
> +	 * added to its iommu_group.
> +	 */
> +	if (!idata)
> +		return -EINVAL;
> +	/* Only allow one fault handler registered for each device */
> +	if (idata->fault_param)
> +		return -EBUSY;
> +	get_device(dev);
> +	idata->fault_param =
> +		kzalloc(sizeof(struct iommu_fault_param), GFP_KERNEL);
> +	if (!idata->fault_param)
> +		return -ENOMEM;
> +	idata->fault_param->handler = handler;
> +	idata->fault_param->data = data;
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(iommu_register_device_fault_handler);
> +
> +int iommu_unregister_device_fault_handler(struct device *dev)
> +{
> +	struct iommu_param *idata = dev->iommu_param;
> +
> +	if (!idata)
> +		return -EINVAL;
> +
> +	kfree(idata->fault_param);
> +	idata->fault_param = NULL;
> +	put_device(dev);
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(iommu_unregister_device_fault_handler);
> +
> +
> +int iommu_report_device_fault(struct device *dev, struct iommu_fault_event *evt)
> +{
> +	/* we only report device fault if there is a handler registered */
> +	if (!dev->iommu_param || !dev->iommu_param->fault_param ||
> +		!dev->iommu_param->fault_param->handler)

Can this replaced by:

    if (!iommu_has_device_fault_handler(dev))

?

Best regards,
Lu Baolu

> +		return -ENOSYS;
> +
> +	return dev->iommu_param->fault_param->handler(evt,
> +						dev->iommu_param->fault_param->data);
> +}
> +EXPORT_SYMBOL_GPL(iommu_report_device_fault);
> +
>  /**
>   * iommu_group_id - Return ID for a group
>   * @group: the group to ID
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index dfda89b..841c044 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -463,6 +463,14 @@ extern int iommu_group_register_notifier(struct iommu_group *group,
>  					 struct notifier_block *nb);
>  extern int iommu_group_unregister_notifier(struct iommu_group *group,
>  					   struct notifier_block *nb);
> +extern int iommu_register_device_fault_handler(struct device *dev,
> +					iommu_dev_fault_handler_t handler,
> +					void *data);
> +
> +extern int iommu_unregister_device_fault_handler(struct device *dev);
> +
> +extern int iommu_report_device_fault(struct device *dev, struct iommu_fault_event *evt);
> +
>  extern int iommu_group_id(struct iommu_group *group);
>  extern struct iommu_group *iommu_group_get_for_dev(struct device *dev);
>  extern struct iommu_domain *iommu_group_default_domain(struct iommu_group *);
> @@ -481,6 +489,12 @@ extern void iommu_domain_window_disable(struct iommu_domain *domain, u32 wnd_nr)
>  extern int report_iommu_fault(struct iommu_domain *domain, struct device *dev,
>  			      unsigned long iova, int flags);
>  
> +static inline bool iommu_has_device_fault_handler(struct device *dev)
> +{
> +	return dev->iommu_param && dev->iommu_param->fault_param &&
> +		dev->iommu_param->fault_param->handler;
> +}
> +
>  static inline void iommu_flush_tlb_all(struct iommu_domain *domain)
>  {
>  	if (domain->ops->flush_iotlb_all)
> @@ -734,6 +748,28 @@ static inline int iommu_group_unregister_notifier(struct iommu_group *group,
>  	return 0;
>  }
>  
> +static inline int iommu_register_device_fault_handler(struct device *dev,
> +						iommu_dev_fault_handler_t handler,
> +						void *data)
> +{
> +	return 0;
> +}
> +
> +static inline int iommu_unregister_device_fault_handler(struct device *dev)
> +{
> +	return 0;
> +}
> +
> +static inline bool iommu_has_device_fault_handler(struct device *dev)
> +{
> +	return false;
> +}
> +
> +static inline int iommu_report_device_fault(struct device *dev, struct iommu_fault_event *evt)
> +{
> +	return 0;
> +}
> +
>  static inline int iommu_group_id(struct iommu_group *group)
>  {
>  	return -ENODEV;

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v3 12/16] iommu/vt-d: report unrecoverable device faults
@ 2017-12-05  6:34     ` Lu Baolu
  0 siblings, 0 replies; 94+ messages in thread
From: Lu Baolu @ 2017-12-05  6:34 UTC (permalink / raw)
  To: Jacob Pan, iommu, LKML, Joerg Roedel, David Woodhouse,
	Greg Kroah-Hartman, Rafael Wysocki, Alex Williamson
  Cc: Lan Tianyu, Yi L, Liu, Jean Delvare

Hi,

On 11/18/2017 02:55 AM, Jacob Pan wrote:
> Currently, when device DMA faults are detected by IOMMU the fault
> reasons are printed but the driver of the offending device is

"... but the driver of the offending device is not involved in ..."

Best regards,
Lu Baolu

> involved in fault handling.
> This patch uses per device fault reporting API to send fault event
> data for further processing.
> Offending device is identified by the source ID in VT-d fault reason
> report registers.
>
> Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Signed-off-by: Ashok Raj <ashok.raj@intel.com>
> ---
>  drivers/iommu/dmar.c | 94 +++++++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 93 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c
> index 38ee91b..b1f67fc2 100644
> --- a/drivers/iommu/dmar.c
> +++ b/drivers/iommu/dmar.c
> @@ -1555,6 +1555,31 @@ static const char *irq_remap_fault_reasons[] =
>  	"Blocked an interrupt request due to source-id verification failure",
>  };
>  
> +/* fault data and status */
> +enum intel_iommu_fault_reason {
> +	INTEL_IOMMU_FAULT_REASON_SW,
> +	INTEL_IOMMU_FAULT_REASON_ROOT_NOT_PRESENT,
> +	INTEL_IOMMU_FAULT_REASON_CONTEXT_NOT_PRESENT,
> +	INTEL_IOMMU_FAULT_REASON_CONTEXT_INVALID,
> +	INTEL_IOMMU_FAULT_REASON_BEYOND_ADDR_WIDTH,
> +	INTEL_IOMMU_FAULT_REASON_PTE_WRITE_ACCESS,
> +	INTEL_IOMMU_FAULT_REASON_PTE_READ_ACCESS,
> +	INTEL_IOMMU_FAULT_REASON_NEXT_PT_INVALID,
> +	INTEL_IOMMU_FAULT_REASON_ROOT_ADDR_INVALID,
> +	INTEL_IOMMU_FAULT_REASON_CONTEXT_PTR_INVALID,
> +	INTEL_IOMMU_FAULT_REASON_NONE_ZERO_RTP,
> +	INTEL_IOMMU_FAULT_REASON_NONE_ZERO_CTP,
> +	INTEL_IOMMU_FAULT_REASON_NONE_ZERO_PTE,
> +	NR_INTEL_IOMMU_FAULT_REASON,
> +};
> +
> +/* fault reasons that are allowed to be reported outside IOMMU subsystem */
> +#define INTEL_IOMMU_FAULT_REASON_ALLOWED			\
> +	((1ULL << INTEL_IOMMU_FAULT_REASON_BEYOND_ADDR_WIDTH) |	\
> +		(1ULL << INTEL_IOMMU_FAULT_REASON_PTE_WRITE_ACCESS) |	\
> +		(1ULL << INTEL_IOMMU_FAULT_REASON_PTE_READ_ACCESS))
> +
> +
>  static const char *dmar_get_fault_reason(u8 fault_reason, int *fault_type)
>  {
>  	if (fault_reason >= 0x20 && (fault_reason - 0x20 <
> @@ -1635,6 +1660,69 @@ void dmar_msi_read(int irq, struct msi_msg *msg)
>  	raw_spin_unlock_irqrestore(&iommu->register_lock, flag);
>  }
>  
> +static enum iommu_fault_reason to_iommu_fault_reason(u8 reason)
> +{
> +	if (reason >= NR_INTEL_IOMMU_FAULT_REASON) {
> +		pr_warn("unknown DMAR fault reason %d\n", reason);
> +		return IOMMU_FAULT_REASON_UNKNOWN;
> +	}
> +	switch (reason) {
> +	case INTEL_IOMMU_FAULT_REASON_SW:
> +	case INTEL_IOMMU_FAULT_REASON_ROOT_NOT_PRESENT:
> +	case INTEL_IOMMU_FAULT_REASON_CONTEXT_NOT_PRESENT:
> +	case INTEL_IOMMU_FAULT_REASON_CONTEXT_INVALID:
> +	case INTEL_IOMMU_FAULT_REASON_BEYOND_ADDR_WIDTH:
> +	case INTEL_IOMMU_FAULT_REASON_ROOT_ADDR_INVALID:
> +	case INTEL_IOMMU_FAULT_REASON_CONTEXT_PTR_INVALID:
> +		return IOMMU_FAULT_REASON_INTERNAL;
> +	case INTEL_IOMMU_FAULT_REASON_NEXT_PT_INVALID:
> +	case INTEL_IOMMU_FAULT_REASON_PTE_WRITE_ACCESS:
> +	case INTEL_IOMMU_FAULT_REASON_PTE_READ_ACCESS:
> +		return IOMMU_FAULT_REASON_PERMISSION;
> +	default:
> +		return IOMMU_FAULT_REASON_UNKNOWN;
> +	}
> +}
> +
> +static void report_fault_to_device(struct intel_iommu *iommu, u64 addr, int type,
> +				int fault_type, enum intel_iommu_fault_reason reason, u16 sid)
> +{
> +	struct iommu_fault_event event;
> +	struct pci_dev *pdev;
> +	u8 bus, devfn;
> +
> +	/* check if fault reason is worth reporting outside IOMMU */
> +	if (!((1 << reason) & INTEL_IOMMU_FAULT_REASON_ALLOWED)) {
> +		pr_debug("Fault reason %d not allowed to report to device\n",
> +			reason);
> +		return;
> +	}
> +
> +	bus = PCI_BUS_NUM(sid);
> +	devfn = PCI_DEVFN(PCI_SLOT(sid), PCI_FUNC(sid));
> +	/*
> +	 * we need to check if the fault reporting is requested for the
> +	 * offending device.
> +	 */
> +	pdev = pci_get_bus_and_slot(bus, devfn);
> +	if (!pdev) {
> +		pr_warn("No PCI device found for source ID %x\n", sid);
> +		return;
> +	}
> +	/*
> +	 * unrecoverable fault is reported per IOMMU, notifier handler can
> +	 * resolve PCI device based on source ID.
> +	 */
> +	event.reason = to_iommu_fault_reason(reason);
> +	event.addr = addr;
> +	event.type = IOMMU_FAULT_DMA_UNRECOV;
> +	event.prot = type ? IOMMU_READ : IOMMU_WRITE;
> +	dev_warn(&pdev->dev, "report device unrecoverable fault: %d, %x, %d\n",
> +		event.reason, sid, event.type);
> +	iommu_report_device_fault(&pdev->dev, &event);
> +	pci_dev_put(pdev);
> +}
> +
>  static int dmar_fault_do_one(struct intel_iommu *iommu, int type,
>  		u8 fault_reason, u16 source_id, unsigned long long addr)
>  {
> @@ -1648,11 +1736,15 @@ static int dmar_fault_do_one(struct intel_iommu *iommu, int type,
>  			source_id >> 8, PCI_SLOT(source_id & 0xFF),
>  			PCI_FUNC(source_id & 0xFF), addr >> 48,
>  			fault_reason, reason);
> -	else
> +	else {
>  		pr_err("[%s] Request device [%02x:%02x.%d] fault addr %llx [fault reason %02d] %s\n",
>  		       type ? "DMA Read" : "DMA Write",
>  		       source_id >> 8, PCI_SLOT(source_id & 0xFF),
>  		       PCI_FUNC(source_id & 0xFF), addr, fault_reason, reason);
> +	}
> +	report_fault_to_device(iommu, addr, type, fault_type,
> +			fault_reason, source_id);
> +
>  	return 0;
>  }
>  

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v3 12/16] iommu/vt-d: report unrecoverable device faults
@ 2017-12-05  6:34     ` Lu Baolu
  0 siblings, 0 replies; 94+ messages in thread
From: Lu Baolu @ 2017-12-05  6:34 UTC (permalink / raw)
  To: Jacob Pan, iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Rafael Wysocki, Alex Williamson
  Cc: Lan Tianyu, Yi L, Jean Delvare, Liu-i9wRM+HIrmnmtl4Z8vJ8Kg761KYD1DLY

Hi,

On 11/18/2017 02:55 AM, Jacob Pan wrote:
> Currently, when device DMA faults are detected by IOMMU the fault
> reasons are printed but the driver of the offending device is

"... but the driver of the offending device is not involved in ..."

Best regards,
Lu Baolu

> involved in fault handling.
> This patch uses per device fault reporting API to send fault event
> data for further processing.
> Offending device is identified by the source ID in VT-d fault reason
> report registers.
>
> Signed-off-by: Liu, Yi L <yi.l.liu-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
> Signed-off-by: Jacob Pan <jacob.jun.pan-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
> Signed-off-by: Ashok Raj <ashok.raj-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> ---
>  drivers/iommu/dmar.c | 94 +++++++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 93 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c
> index 38ee91b..b1f67fc2 100644
> --- a/drivers/iommu/dmar.c
> +++ b/drivers/iommu/dmar.c
> @@ -1555,6 +1555,31 @@ static const char *irq_remap_fault_reasons[] =
>  	"Blocked an interrupt request due to source-id verification failure",
>  };
>  
> +/* fault data and status */
> +enum intel_iommu_fault_reason {
> +	INTEL_IOMMU_FAULT_REASON_SW,
> +	INTEL_IOMMU_FAULT_REASON_ROOT_NOT_PRESENT,
> +	INTEL_IOMMU_FAULT_REASON_CONTEXT_NOT_PRESENT,
> +	INTEL_IOMMU_FAULT_REASON_CONTEXT_INVALID,
> +	INTEL_IOMMU_FAULT_REASON_BEYOND_ADDR_WIDTH,
> +	INTEL_IOMMU_FAULT_REASON_PTE_WRITE_ACCESS,
> +	INTEL_IOMMU_FAULT_REASON_PTE_READ_ACCESS,
> +	INTEL_IOMMU_FAULT_REASON_NEXT_PT_INVALID,
> +	INTEL_IOMMU_FAULT_REASON_ROOT_ADDR_INVALID,
> +	INTEL_IOMMU_FAULT_REASON_CONTEXT_PTR_INVALID,
> +	INTEL_IOMMU_FAULT_REASON_NONE_ZERO_RTP,
> +	INTEL_IOMMU_FAULT_REASON_NONE_ZERO_CTP,
> +	INTEL_IOMMU_FAULT_REASON_NONE_ZERO_PTE,
> +	NR_INTEL_IOMMU_FAULT_REASON,
> +};
> +
> +/* fault reasons that are allowed to be reported outside IOMMU subsystem */
> +#define INTEL_IOMMU_FAULT_REASON_ALLOWED			\
> +	((1ULL << INTEL_IOMMU_FAULT_REASON_BEYOND_ADDR_WIDTH) |	\
> +		(1ULL << INTEL_IOMMU_FAULT_REASON_PTE_WRITE_ACCESS) |	\
> +		(1ULL << INTEL_IOMMU_FAULT_REASON_PTE_READ_ACCESS))
> +
> +
>  static const char *dmar_get_fault_reason(u8 fault_reason, int *fault_type)
>  {
>  	if (fault_reason >= 0x20 && (fault_reason - 0x20 <
> @@ -1635,6 +1660,69 @@ void dmar_msi_read(int irq, struct msi_msg *msg)
>  	raw_spin_unlock_irqrestore(&iommu->register_lock, flag);
>  }
>  
> +static enum iommu_fault_reason to_iommu_fault_reason(u8 reason)
> +{
> +	if (reason >= NR_INTEL_IOMMU_FAULT_REASON) {
> +		pr_warn("unknown DMAR fault reason %d\n", reason);
> +		return IOMMU_FAULT_REASON_UNKNOWN;
> +	}
> +	switch (reason) {
> +	case INTEL_IOMMU_FAULT_REASON_SW:
> +	case INTEL_IOMMU_FAULT_REASON_ROOT_NOT_PRESENT:
> +	case INTEL_IOMMU_FAULT_REASON_CONTEXT_NOT_PRESENT:
> +	case INTEL_IOMMU_FAULT_REASON_CONTEXT_INVALID:
> +	case INTEL_IOMMU_FAULT_REASON_BEYOND_ADDR_WIDTH:
> +	case INTEL_IOMMU_FAULT_REASON_ROOT_ADDR_INVALID:
> +	case INTEL_IOMMU_FAULT_REASON_CONTEXT_PTR_INVALID:
> +		return IOMMU_FAULT_REASON_INTERNAL;
> +	case INTEL_IOMMU_FAULT_REASON_NEXT_PT_INVALID:
> +	case INTEL_IOMMU_FAULT_REASON_PTE_WRITE_ACCESS:
> +	case INTEL_IOMMU_FAULT_REASON_PTE_READ_ACCESS:
> +		return IOMMU_FAULT_REASON_PERMISSION;
> +	default:
> +		return IOMMU_FAULT_REASON_UNKNOWN;
> +	}
> +}
> +
> +static void report_fault_to_device(struct intel_iommu *iommu, u64 addr, int type,
> +				int fault_type, enum intel_iommu_fault_reason reason, u16 sid)
> +{
> +	struct iommu_fault_event event;
> +	struct pci_dev *pdev;
> +	u8 bus, devfn;
> +
> +	/* check if fault reason is worth reporting outside IOMMU */
> +	if (!((1 << reason) & INTEL_IOMMU_FAULT_REASON_ALLOWED)) {
> +		pr_debug("Fault reason %d not allowed to report to device\n",
> +			reason);
> +		return;
> +	}
> +
> +	bus = PCI_BUS_NUM(sid);
> +	devfn = PCI_DEVFN(PCI_SLOT(sid), PCI_FUNC(sid));
> +	/*
> +	 * we need to check if the fault reporting is requested for the
> +	 * offending device.
> +	 */
> +	pdev = pci_get_bus_and_slot(bus, devfn);
> +	if (!pdev) {
> +		pr_warn("No PCI device found for source ID %x\n", sid);
> +		return;
> +	}
> +	/*
> +	 * unrecoverable fault is reported per IOMMU, notifier handler can
> +	 * resolve PCI device based on source ID.
> +	 */
> +	event.reason = to_iommu_fault_reason(reason);
> +	event.addr = addr;
> +	event.type = IOMMU_FAULT_DMA_UNRECOV;
> +	event.prot = type ? IOMMU_READ : IOMMU_WRITE;
> +	dev_warn(&pdev->dev, "report device unrecoverable fault: %d, %x, %d\n",
> +		event.reason, sid, event.type);
> +	iommu_report_device_fault(&pdev->dev, &event);
> +	pci_dev_put(pdev);
> +}
> +
>  static int dmar_fault_do_one(struct intel_iommu *iommu, int type,
>  		u8 fault_reason, u16 source_id, unsigned long long addr)
>  {
> @@ -1648,11 +1736,15 @@ static int dmar_fault_do_one(struct intel_iommu *iommu, int type,
>  			source_id >> 8, PCI_SLOT(source_id & 0xFF),
>  			PCI_FUNC(source_id & 0xFF), addr >> 48,
>  			fault_reason, reason);
> -	else
> +	else {
>  		pr_err("[%s] Request device [%02x:%02x.%d] fault addr %llx [fault reason %02d] %s\n",
>  		       type ? "DMA Read" : "DMA Write",
>  		       source_id >> 8, PCI_SLOT(source_id & 0xFF),
>  		       PCI_FUNC(source_id & 0xFF), addr, fault_reason, reason);
> +	}
> +	report_fault_to_device(iommu, addr, type, fault_type,
> +			fault_reason, source_id);
> +
>  	return 0;
>  }
>  

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v3 13/16] iommu/intel-svm: notify page request to guest
@ 2017-12-05  7:37     ` Lu Baolu
  0 siblings, 0 replies; 94+ messages in thread
From: Lu Baolu @ 2017-12-05  7:37 UTC (permalink / raw)
  To: Jacob Pan, iommu, LKML, Joerg Roedel, David Woodhouse,
	Greg Kroah-Hartman, Rafael Wysocki, Alex Williamson
  Cc: Lan Tianyu, Jean Delvare

Hi,

On 11/18/2017 02:55 AM, Jacob Pan wrote:
> If the source device of a page request has its PASID table pointer
> bond to a guest, the first level page tables are owned by the guest.
> In this case, we shall let guest OS to manage page fault.
>
> This patch uses the IOMMU fault notification API to send notifications,
> possibly via VFIO, to the guest OS. Once guest pages are fault in, guest
> will issue page response which will be passed down via the invalidation
> passdown APIs.
>
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Signed-off-by: Ashok Raj <ashok.raj@intel.com>
> ---
>  drivers/iommu/intel-svm.c | 80 ++++++++++++++++++++++++++++++++++++++++++-----
>  include/linux/iommu.h     |  1 +
>  2 files changed, 74 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c
> index f6697e5..77c25d8 100644
> --- a/drivers/iommu/intel-svm.c
> +++ b/drivers/iommu/intel-svm.c
> @@ -555,6 +555,71 @@ static bool is_canonical_address(u64 addr)
>  	return (((saddr << shift) >> shift) == saddr);
>  }
>  
> +static int prq_to_iommu_prot(struct page_req_dsc *req)
> +{
> +	int prot = 0;
> +
> +	if (req->rd_req)
> +		prot |= IOMMU_FAULT_READ;
> +	if (req->wr_req)
> +		prot |= IOMMU_FAULT_WRITE;
> +	if (req->exe_req)
> +		prot |= IOMMU_FAULT_EXEC;
> +	if (req->priv_req)
> +		prot |= IOMMU_FAULT_PRIV;
> +
> +	return prot;
> +}
> +
> +static int intel_svm_prq_report(struct device *dev, struct page_req_dsc *desc)
> +{
> +	int ret = 0;

It seems that "ret" should be initialized as -EINVAL. Otherwise, this function
will return 0 for devices which have no fault handlers, and all page requests
will be ignored by iommu driver.

> +	struct iommu_fault_event event;
> +	struct pci_dev *pdev;
> +
> +	/**
> +	 * If caller does not provide struct device, this is the case where
> +	 * guest PASID table is bound to the device. So we need to retrieve
> +	 * struct device from the page request descriptor then proceed.
> +	 */
> +	if (!dev) {
> +		pdev = pci_get_bus_and_slot(desc->bus, desc->devfn);
> +		if (!pdev) {
> +			pr_err("No PCI device found for PRQ [%02x:%02x.%d]\n",
> +				desc->bus, PCI_SLOT(desc->devfn),
> +				PCI_FUNC(desc->devfn));
> +			return -ENODEV;
> +		}
> +		dev = &pdev->dev;
> +	} else if (dev_is_pci(dev)) {
> +		pdev = to_pci_dev(dev);
> +		pci_dev_get(pdev);
> +	} else
> +		return -ENODEV;
> +
> +	pr_debug("Notify PRQ device [%02x:%02x.%d]\n",
> +		desc->bus, PCI_SLOT(desc->devfn),
> +		PCI_FUNC(desc->devfn));
> +
> +	/* invoke device fault handler if registered */
> +	if (iommu_has_device_fault_handler(dev)) {
> +		/* Fill in event data for device specific processing */
> +		event.type = IOMMU_FAULT_PAGE_REQ;
> +		event.addr = desc->addr;
> +		event.pasid = desc->pasid;
> +		event.page_req_group_id = desc->prg_index;
> +		event.prot = prq_to_iommu_prot(desc);
> +		event.last_req = desc->lpig;
> +		event.pasid_valid = 1;
> +		event.iommu_private = desc->private;
> +		ret = iommu_report_device_fault(&pdev->dev, &event);
> +	}
> +
> +	pci_dev_put(pdev);
> +
> +	return ret;
> +}
> +
>  static irqreturn_t prq_event_thread(int irq, void *d)
>  {
>  	struct intel_iommu *iommu = d;
> @@ -578,7 +643,12 @@ static irqreturn_t prq_event_thread(int irq, void *d)
>  		handled = 1;
>  
>  		req = &iommu->prq[head / sizeof(*req)];
> -
> +		/**
> +		 * If prq is to be handled outside iommu driver via receiver of
> +		 * the fault notifiers, we skip the page response here.
> +		 */
> +		if (!intel_svm_prq_report(NULL, req))
> +			goto prq_advance;
>  		result = QI_RESP_FAILURE;
>  		address = (u64)req->addr << VTD_PAGE_SHIFT;
>  		if (!req->pasid_present) {
> @@ -649,11 +719,7 @@ static irqreturn_t prq_event_thread(int irq, void *d)
>  		if (WARN_ON(&sdev->list == &svm->devs))
>  			sdev = NULL;
>  
> -		if (sdev && sdev->ops && sdev->ops->fault_cb) {
> -			int rwxp = (req->rd_req << 3) | (req->wr_req << 2) |
> -				(req->exe_req << 1) | (req->priv_req);
> -			sdev->ops->fault_cb(sdev->dev, req->pasid, req->addr, req->private, rwxp, result);
> -		}
> +		intel_svm_prq_report(sdev->dev, req);

Do you mind explaining why we need to report this request twice?

Best regards,
Lu Baolu

>  		/* We get here in the error case where the PASID lookup failed,
>  		   and these can be NULL. Do not use them below this point! */
>  		sdev = NULL;
> @@ -679,7 +745,7 @@ static irqreturn_t prq_event_thread(int irq, void *d)
>  
>  			qi_submit_sync(&resp, iommu);
>  		}
> -
> +	prq_advance:
>  		head = (head + sizeof(*req)) & PRQ_RING_MASK;
>  	}
>  
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index 841c044..3083796b 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -42,6 +42,7 @@
>   * if the IOMMU page table format is equivalent.
>   */
>  #define IOMMU_PRIV	(1 << 5)
> +#define IOMMU_EXEC	(1 << 6)
>  
>  struct iommu_ops;
>  struct iommu_group;

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v3 13/16] iommu/intel-svm: notify page request to guest
@ 2017-12-05  7:37     ` Lu Baolu
  0 siblings, 0 replies; 94+ messages in thread
From: Lu Baolu @ 2017-12-05  7:37 UTC (permalink / raw)
  To: Jacob Pan, iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Rafael Wysocki, Alex Williamson
  Cc: Lan Tianyu, Jean Delvare

Hi,

On 11/18/2017 02:55 AM, Jacob Pan wrote:
> If the source device of a page request has its PASID table pointer
> bond to a guest, the first level page tables are owned by the guest.
> In this case, we shall let guest OS to manage page fault.
>
> This patch uses the IOMMU fault notification API to send notifications,
> possibly via VFIO, to the guest OS. Once guest pages are fault in, guest
> will issue page response which will be passed down via the invalidation
> passdown APIs.
>
> Signed-off-by: Jacob Pan <jacob.jun.pan-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
> Signed-off-by: Ashok Raj <ashok.raj-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> ---
>  drivers/iommu/intel-svm.c | 80 ++++++++++++++++++++++++++++++++++++++++++-----
>  include/linux/iommu.h     |  1 +
>  2 files changed, 74 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c
> index f6697e5..77c25d8 100644
> --- a/drivers/iommu/intel-svm.c
> +++ b/drivers/iommu/intel-svm.c
> @@ -555,6 +555,71 @@ static bool is_canonical_address(u64 addr)
>  	return (((saddr << shift) >> shift) == saddr);
>  }
>  
> +static int prq_to_iommu_prot(struct page_req_dsc *req)
> +{
> +	int prot = 0;
> +
> +	if (req->rd_req)
> +		prot |= IOMMU_FAULT_READ;
> +	if (req->wr_req)
> +		prot |= IOMMU_FAULT_WRITE;
> +	if (req->exe_req)
> +		prot |= IOMMU_FAULT_EXEC;
> +	if (req->priv_req)
> +		prot |= IOMMU_FAULT_PRIV;
> +
> +	return prot;
> +}
> +
> +static int intel_svm_prq_report(struct device *dev, struct page_req_dsc *desc)
> +{
> +	int ret = 0;

It seems that "ret" should be initialized as -EINVAL. Otherwise, this function
will return 0 for devices which have no fault handlers, and all page requests
will be ignored by iommu driver.

> +	struct iommu_fault_event event;
> +	struct pci_dev *pdev;
> +
> +	/**
> +	 * If caller does not provide struct device, this is the case where
> +	 * guest PASID table is bound to the device. So we need to retrieve
> +	 * struct device from the page request descriptor then proceed.
> +	 */
> +	if (!dev) {
> +		pdev = pci_get_bus_and_slot(desc->bus, desc->devfn);
> +		if (!pdev) {
> +			pr_err("No PCI device found for PRQ [%02x:%02x.%d]\n",
> +				desc->bus, PCI_SLOT(desc->devfn),
> +				PCI_FUNC(desc->devfn));
> +			return -ENODEV;
> +		}
> +		dev = &pdev->dev;
> +	} else if (dev_is_pci(dev)) {
> +		pdev = to_pci_dev(dev);
> +		pci_dev_get(pdev);
> +	} else
> +		return -ENODEV;
> +
> +	pr_debug("Notify PRQ device [%02x:%02x.%d]\n",
> +		desc->bus, PCI_SLOT(desc->devfn),
> +		PCI_FUNC(desc->devfn));
> +
> +	/* invoke device fault handler if registered */
> +	if (iommu_has_device_fault_handler(dev)) {
> +		/* Fill in event data for device specific processing */
> +		event.type = IOMMU_FAULT_PAGE_REQ;
> +		event.addr = desc->addr;
> +		event.pasid = desc->pasid;
> +		event.page_req_group_id = desc->prg_index;
> +		event.prot = prq_to_iommu_prot(desc);
> +		event.last_req = desc->lpig;
> +		event.pasid_valid = 1;
> +		event.iommu_private = desc->private;
> +		ret = iommu_report_device_fault(&pdev->dev, &event);
> +	}
> +
> +	pci_dev_put(pdev);
> +
> +	return ret;
> +}
> +
>  static irqreturn_t prq_event_thread(int irq, void *d)
>  {
>  	struct intel_iommu *iommu = d;
> @@ -578,7 +643,12 @@ static irqreturn_t prq_event_thread(int irq, void *d)
>  		handled = 1;
>  
>  		req = &iommu->prq[head / sizeof(*req)];
> -
> +		/**
> +		 * If prq is to be handled outside iommu driver via receiver of
> +		 * the fault notifiers, we skip the page response here.
> +		 */
> +		if (!intel_svm_prq_report(NULL, req))
> +			goto prq_advance;
>  		result = QI_RESP_FAILURE;
>  		address = (u64)req->addr << VTD_PAGE_SHIFT;
>  		if (!req->pasid_present) {
> @@ -649,11 +719,7 @@ static irqreturn_t prq_event_thread(int irq, void *d)
>  		if (WARN_ON(&sdev->list == &svm->devs))
>  			sdev = NULL;
>  
> -		if (sdev && sdev->ops && sdev->ops->fault_cb) {
> -			int rwxp = (req->rd_req << 3) | (req->wr_req << 2) |
> -				(req->exe_req << 1) | (req->priv_req);
> -			sdev->ops->fault_cb(sdev->dev, req->pasid, req->addr, req->private, rwxp, result);
> -		}
> +		intel_svm_prq_report(sdev->dev, req);

Do you mind explaining why we need to report this request twice?

Best regards,
Lu Baolu

>  		/* We get here in the error case where the PASID lookup failed,
>  		   and these can be NULL. Do not use them below this point! */
>  		sdev = NULL;
> @@ -679,7 +745,7 @@ static irqreturn_t prq_event_thread(int irq, void *d)
>  
>  			qi_submit_sync(&resp, iommu);
>  		}
> -
> +	prq_advance:
>  		head = (head + sizeof(*req)) & PRQ_RING_MASK;
>  	}
>  
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index 841c044..3083796b 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -42,6 +42,7 @@
>   * if the IOMMU page table format is equivalent.
>   */
>  #define IOMMU_PRIV	(1 << 5)
> +#define IOMMU_EXEC	(1 << 6)
>  
>  struct iommu_ops;
>  struct iommu_group;

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v3 15/16] iommu: introduce page response function
  2017-12-04 21:37       ` Jacob Pan
@ 2017-12-05 17:21         ` Jean-Philippe Brucker
  -1 siblings, 0 replies; 94+ messages in thread
From: Jean-Philippe Brucker @ 2017-12-05 17:21 UTC (permalink / raw)
  To: Jacob Pan
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Rafael Wysocki, Alex Williamson, Lan Tianyu, Jean Delvare,
	Will Deacon

Hi Jacob,

On 04/12/17 21:37, Jacob Pan wrote:
> On Fri, 24 Nov 2017 12:03:50 +0000
> Jean-Philippe Brucker <jean-philippe.brucker@arm.com> wrote:
> 
>> On 17/11/17 18:55, Jacob Pan wrote:
>>> When nested translation is turned on and guest owns the
>>> first level page tables, device page request can be forwared
>>> to the guest for handling faults. As the page response returns
>>> by the guest, IOMMU driver on the host need to process the
>>> response which informs the device and completes the page request
>>> transaction.
>>>
>>> This patch introduces generic API function for page response
>>> passing from the guest or other in-kernel users. The definitions of
>>> the generic data is based on PCI ATS specification not limited to
>>> any vendor.>
>>> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
[...]
> I think the simpler interface works for in-kernel driver use case very
> well. But in case of VFIO, the callback function does not turn around
> send back page response. The page response comes from guest and qemu,
> where they don;t keep track of the the prq event data.

Is it safe to trust whatever response the guest or userspace gives us? The
answer seems fairly vendor- and device-specific so I wonder if VFIO or
IOMMU shouldn't do a bit of sanity checking somewhere, and keep track of
all injected page requests.

>From SMMUv3 POV, it seems safe (haven't looked at SMMUv2 but I'm not so
confident).

* The guest can only send page responses to devices assigned to it, that's
  a given.

* If, after we injected a page request, the guest doesn't reply at all,
  then the device leaks page request credits and at some point it will
  stop sending requests.
  -> So the PRI capability needs to be reset whenever we change the
     device's domain, to clear the credit counter and pending states.

  For SMMUv3, the stall buffer may be shared between devices on some
  implementations, in which case the guest could prevent other devices to
  stall by letting the buffer fill up.
  -> We might have to keep track of stalls in the host driver and set a
     credit or timeout to each stall, if it comes to that.
  -> In addition, send a terminate-all-stalls command when changing the
     device's domain.

* If the guest sends spurious or duplicate page responses (where the PRGI
  or PASID doesn't exist in any outstanding page request of the device)

  For PRI if we send an invalid PRG Response, the endpoint sets UPRGI in
  the PRI cap, and issues an Unexpected Completion. Then I suppose the
  worst that happens is we get an AER report that we can't handle? I'm not
  too familiar with that part of PCIe.

  Stall is designed to tolerate this and will just ignore the response.

* If PRI/stall isn't even enabled, the IOMMU driver can check that in the
  device configuration and not send the reply.




Regardless, I have a few comments on the page_response_msg:

> +/**
> + * Generic page response information based on PCI ATS and PASID spec.
> + * @paddr: servicing page address

Maybe call it @addr, so we don't read this field as "phys addr"

> + * @pasid: contains process address space ID, used in shared virtual memory(SVM)

The "used in shared virtual memory(SVM)" part isn't necessary and we're
changing the API name.

> + * @rid: requestor ID
> + * @did: destination device ID

I guess you can remove @rid and @did

> + * @last_req: last request in a page request group

Is @last_req needed at all, since only the last request requires a response?

> + * @resp_code: response code

The comment is missing a description for @pasid_present here

> + * @page_req_group_id: page request group index
> + * @prot: page access protection flag, e.g. IOMMU_FAULT_READ, IOMMU_FAULT_WRITE

Is @prot really needed in the response?

> + * @type: group or stream response

The page request doesn't provide this information

> + * @private_data: uniquely identify device-specific private data for an
> + *                individual page response
> +
> + */
> +struct page_response_msg {
> +	u64 paddr;
> +	u32 pasid;
> +	u32 rid:16;
> +	u32 did:16;
> +	u32 resp_code:4;
> +	u32 last_req:1;
> +	u32 pasid_present:1;
> +#define IOMMU_PAGE_RESP_SUCCESS	0
> +#define IOMMU_PAGE_RESP_INVALID	1
> +#define IOMMU_PAGE_RESP_FAILURE	0xF

Maybe move these defines closer to resp_code.
For someone not familiar with PRI, we should add some comments about those
values:

* SUCCESS: the request was paged-in successfully
* INVALID: could not page-in one or more pages in the group
* FAILURE: permanent PRI error, may disable faults in the device

> +	u32 page_req_group_id : 9;
> +	u32 prot;
> +	enum page_response_type type;
> +	u32 private_data;
> +};
> +

Thanks,
Jean

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v3 15/16] iommu: introduce page response function
@ 2017-12-05 17:21         ` Jean-Philippe Brucker
  0 siblings, 0 replies; 94+ messages in thread
From: Jean-Philippe Brucker @ 2017-12-05 17:21 UTC (permalink / raw)
  To: Jacob Pan
  Cc: Lan Tianyu, Greg Kroah-Hartman, Rafael Wysocki, Will Deacon,
	LKML, iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Jean Delvare, David Woodhouse

Hi Jacob,

On 04/12/17 21:37, Jacob Pan wrote:
> On Fri, 24 Nov 2017 12:03:50 +0000
> Jean-Philippe Brucker <jean-philippe.brucker-5wv7dgnIgG8@public.gmane.org> wrote:
> 
>> On 17/11/17 18:55, Jacob Pan wrote:
>>> When nested translation is turned on and guest owns the
>>> first level page tables, device page request can be forwared
>>> to the guest for handling faults. As the page response returns
>>> by the guest, IOMMU driver on the host need to process the
>>> response which informs the device and completes the page request
>>> transaction.
>>>
>>> This patch introduces generic API function for page response
>>> passing from the guest or other in-kernel users. The definitions of
>>> the generic data is based on PCI ATS specification not limited to
>>> any vendor.>
>>> Signed-off-by: Jacob Pan <jacob.jun.pan-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
[...]
> I think the simpler interface works for in-kernel driver use case very
> well. But in case of VFIO, the callback function does not turn around
> send back page response. The page response comes from guest and qemu,
> where they don;t keep track of the the prq event data.

Is it safe to trust whatever response the guest or userspace gives us? The
answer seems fairly vendor- and device-specific so I wonder if VFIO or
IOMMU shouldn't do a bit of sanity checking somewhere, and keep track of
all injected page requests.

>From SMMUv3 POV, it seems safe (haven't looked at SMMUv2 but I'm not so
confident).

* The guest can only send page responses to devices assigned to it, that's
  a given.

* If, after we injected a page request, the guest doesn't reply at all,
  then the device leaks page request credits and at some point it will
  stop sending requests.
  -> So the PRI capability needs to be reset whenever we change the
     device's domain, to clear the credit counter and pending states.

  For SMMUv3, the stall buffer may be shared between devices on some
  implementations, in which case the guest could prevent other devices to
  stall by letting the buffer fill up.
  -> We might have to keep track of stalls in the host driver and set a
     credit or timeout to each stall, if it comes to that.
  -> In addition, send a terminate-all-stalls command when changing the
     device's domain.

* If the guest sends spurious or duplicate page responses (where the PRGI
  or PASID doesn't exist in any outstanding page request of the device)

  For PRI if we send an invalid PRG Response, the endpoint sets UPRGI in
  the PRI cap, and issues an Unexpected Completion. Then I suppose the
  worst that happens is we get an AER report that we can't handle? I'm not
  too familiar with that part of PCIe.

  Stall is designed to tolerate this and will just ignore the response.

* If PRI/stall isn't even enabled, the IOMMU driver can check that in the
  device configuration and not send the reply.




Regardless, I have a few comments on the page_response_msg:

> +/**
> + * Generic page response information based on PCI ATS and PASID spec.
> + * @paddr: servicing page address

Maybe call it @addr, so we don't read this field as "phys addr"

> + * @pasid: contains process address space ID, used in shared virtual memory(SVM)

The "used in shared virtual memory(SVM)" part isn't necessary and we're
changing the API name.

> + * @rid: requestor ID
> + * @did: destination device ID

I guess you can remove @rid and @did

> + * @last_req: last request in a page request group

Is @last_req needed at all, since only the last request requires a response?

> + * @resp_code: response code

The comment is missing a description for @pasid_present here

> + * @page_req_group_id: page request group index
> + * @prot: page access protection flag, e.g. IOMMU_FAULT_READ, IOMMU_FAULT_WRITE

Is @prot really needed in the response?

> + * @type: group or stream response

The page request doesn't provide this information

> + * @private_data: uniquely identify device-specific private data for an
> + *                individual page response
> +
> + */
> +struct page_response_msg {
> +	u64 paddr;
> +	u32 pasid;
> +	u32 rid:16;
> +	u32 did:16;
> +	u32 resp_code:4;
> +	u32 last_req:1;
> +	u32 pasid_present:1;
> +#define IOMMU_PAGE_RESP_SUCCESS	0
> +#define IOMMU_PAGE_RESP_INVALID	1
> +#define IOMMU_PAGE_RESP_FAILURE	0xF

Maybe move these defines closer to resp_code.
For someone not familiar with PRI, we should add some comments about those
values:

* SUCCESS: the request was paged-in successfully
* INVALID: could not page-in one or more pages in the group
* FAILURE: permanent PRI error, may disable faults in the device

> +	u32 page_req_group_id : 9;
> +	u32 prot;
> +	enum page_response_type type;
> +	u32 private_data;
> +};
> +

Thanks,
Jean

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v3 15/16] iommu: introduce page response function
@ 2017-12-06 19:25           ` Jacob Pan
  0 siblings, 0 replies; 94+ messages in thread
From: Jacob Pan @ 2017-12-06 19:25 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Rafael Wysocki, Alex Williamson, Lan Tianyu, Jean Delvare,
	Will Deacon, jacob.jun.pan, Kumar, Sanjay K

On Tue, 5 Dec 2017 17:21:15 +0000
Jean-Philippe Brucker <jean-philippe.brucker@arm.com> wrote:

> Hi Jacob,
> 
> On 04/12/17 21:37, Jacob Pan wrote:
> > On Fri, 24 Nov 2017 12:03:50 +0000
> > Jean-Philippe Brucker <jean-philippe.brucker@arm.com> wrote:
> >   
> >> On 17/11/17 18:55, Jacob Pan wrote:  
> >>> When nested translation is turned on and guest owns the
> >>> first level page tables, device page request can be forwared
> >>> to the guest for handling faults. As the page response returns
> >>> by the guest, IOMMU driver on the host need to process the
> >>> response which informs the device and completes the page request
> >>> transaction.
> >>>
> >>> This patch introduces generic API function for page response
> >>> passing from the guest or other in-kernel users. The definitions
> >>> of the generic data is based on PCI ATS specification not limited
> >>> to any vendor.>
> >>> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>  
> [...]
> > I think the simpler interface works for in-kernel driver use case
> > very well. But in case of VFIO, the callback function does not turn
> > around send back page response. The page response comes from guest
> > and qemu, where they don;t keep track of the the prq event data.  
> 
> Is it safe to trust whatever response the guest or userspace gives
> us? The answer seems fairly vendor- and device-specific so I wonder
> if VFIO or IOMMU shouldn't do a bit of sanity checking somewhere, and
> keep track of all injected page requests.
> 
> From SMMUv3 POV, it seems safe (haven't looked at SMMUv2 but I'm not
> so confident).
> 
> * The guest can only send page responses to devices assigned to it,
> that's a given.
> 
Agree, IOMMU driver cannot enforce it. I think VFIO layer can make sure
page response come from the assigned device and its guest/container.
> * If, after we injected a page request, the guest doesn't reply at
> all, then the device leaks page request credits and at some point it
> will stop sending requests.
>   -> So the PRI capability needs to be reset whenever we change the  
>      device's domain, to clear the credit counter and pending states.
> 
>   For SMMUv3, the stall buffer may be shared between devices on some
>   implementations, in which case the guest could prevent other
> devices to stall by letting the buffer fill up.
>   -> We might have to keep track of stalls in the host driver and set
> a credit or timeout to each stall, if it comes to that.
>   -> In addition, send a terminate-all-stalls command when changing
> the device's domain.
> 
We have the same situation in VT-d with shared queue which in turn may
affect other guests. Letting host driver maintain record of pending page
request seems the best way to go. VT-d has a way to drain PRQ per PASID
and RID combination. I guess this is the same as your
"terminate-all-stalls" but with finer control? Or
"terminate-all-stalls" only applies to a given device.
Seems we can implement a generic timeout/credit mechanism in IOMMU
driver with model specific action to drain/terminate. The timeout value
can also be model specific.

> * If the guest sends spurious or duplicate page responses (where the
> PRGI or PASID doesn't exist in any outstanding page request of the
> device)
> 
If we keep track of pending PRQ in host IOMMU driver, then it can
detect duplicated case.
>   For PRI if we send an invalid PRG Response, the endpoint sets UPRGI
> in the PRI cap, and issues an Unexpected Completion. Then I suppose
> the worst that happens is we get an AER report that we can't handle?
> I'm not too familiar with that part of PCIe.
> 
I don;t see this mentioned in the PCI ATS spec., but in general this
sounds like a case HW has to handle, perhaps ignoring them is
reasonable as you said below.
>   Stall is designed to tolerate this and will just ignore the
> response.
> 
> * If PRI/stall isn't even enabled, the IOMMU driver can check that in
> the device configuration and not send the reply.
> 
> 
> 
> 
> Regardless, I have a few comments on the page_response_msg:
> 
Thanks, all points are taken unless commented.
> > +/**
> > + * Generic page response information based on PCI ATS and PASID
> > spec.
> > + * @paddr: servicing page address  
> 
> Maybe call it @addr, so we don't read this field as "phys addr"
> 
> > + * @pasid: contains process address space ID, used in shared
> > virtual memory(SVM)  
> 
> The "used in shared virtual memory(SVM)" part isn't necessary and
> we're changing the API name.
> 
> > + * @rid: requestor ID
> > + * @did: destination device ID  
> 
> I guess you can remove @rid and @did
> 
> > + * @last_req: last request in a page request group  
> 
> Is @last_req needed at all, since only the last request requires a
> response?
> 
right, i was thinking we had single page response in vt-d, but there is
not need either.
> > + * @resp_code: response code  
> 
> The comment is missing a description for @pasid_present here
> 
> > + * @page_req_group_id: page request group index
> > + * @prot: page access protection flag, e.g. IOMMU_FAULT_READ,
> > IOMMU_FAULT_WRITE  
> 
> Is @prot really needed in the response?
> 
no, you are right.
> > + * @type: group or stream response  
> 
> The page request doesn't provide this information
> 
this is vt-d specific. it is in the vt-d page request descriptor and
response descriptors are different depending on the type.
Since we intend the generic data to be super set of models, I add this
field.
> > + * @private_data: uniquely identify device-specific private data
> > for an
> > + *                individual page response
> > +
> > + */
> > +struct page_response_msg {
> > +	u64 paddr;
> > +	u32 pasid;
> > +	u32 rid:16;
> > +	u32 did:16;
> > +	u32 resp_code:4;
> > +	u32 last_req:1;
> > +	u32 pasid_present:1;
> > +#define IOMMU_PAGE_RESP_SUCCESS	0
> > +#define IOMMU_PAGE_RESP_INVALID	1
> > +#define IOMMU_PAGE_RESP_FAILURE	0xF  
> 
> Maybe move these defines closer to resp_code.
> For someone not familiar with PRI, we should add some comments about
> those values:
> 
> * SUCCESS: the request was paged-in successfully
> * INVALID: could not page-in one or more pages in the group
> * FAILURE: permanent PRI error, may disable faults in the device
> 
> > +	u32 page_req_group_id : 9;
> > +	u32 prot;
> > +	enum page_response_type type;
> > +	u32 private_data;
> > +};
> > +  
> 
> Thanks,
> Jean

[Jacob Pan]

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v3 15/16] iommu: introduce page response function
@ 2017-12-06 19:25           ` Jacob Pan
  0 siblings, 0 replies; 94+ messages in thread
From: Jacob Pan @ 2017-12-06 19:25 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: Lan Tianyu, Greg Kroah-Hartman, Rafael Wysocki, Will Deacon,
	LKML, iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Jean Delvare, Kumar, Sanjay K, David Woodhouse

On Tue, 5 Dec 2017 17:21:15 +0000
Jean-Philippe Brucker <jean-philippe.brucker-5wv7dgnIgG8@public.gmane.org> wrote:

> Hi Jacob,
> 
> On 04/12/17 21:37, Jacob Pan wrote:
> > On Fri, 24 Nov 2017 12:03:50 +0000
> > Jean-Philippe Brucker <jean-philippe.brucker-5wv7dgnIgG8@public.gmane.org> wrote:
> >   
> >> On 17/11/17 18:55, Jacob Pan wrote:  
> >>> When nested translation is turned on and guest owns the
> >>> first level page tables, device page request can be forwared
> >>> to the guest for handling faults. As the page response returns
> >>> by the guest, IOMMU driver on the host need to process the
> >>> response which informs the device and completes the page request
> >>> transaction.
> >>>
> >>> This patch introduces generic API function for page response
> >>> passing from the guest or other in-kernel users. The definitions
> >>> of the generic data is based on PCI ATS specification not limited
> >>> to any vendor.>
> >>> Signed-off-by: Jacob Pan <jacob.jun.pan-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>  
> [...]
> > I think the simpler interface works for in-kernel driver use case
> > very well. But in case of VFIO, the callback function does not turn
> > around send back page response. The page response comes from guest
> > and qemu, where they don;t keep track of the the prq event data.  
> 
> Is it safe to trust whatever response the guest or userspace gives
> us? The answer seems fairly vendor- and device-specific so I wonder
> if VFIO or IOMMU shouldn't do a bit of sanity checking somewhere, and
> keep track of all injected page requests.
> 
> From SMMUv3 POV, it seems safe (haven't looked at SMMUv2 but I'm not
> so confident).
> 
> * The guest can only send page responses to devices assigned to it,
> that's a given.
> 
Agree, IOMMU driver cannot enforce it. I think VFIO layer can make sure
page response come from the assigned device and its guest/container.
> * If, after we injected a page request, the guest doesn't reply at
> all, then the device leaks page request credits and at some point it
> will stop sending requests.
>   -> So the PRI capability needs to be reset whenever we change the  
>      device's domain, to clear the credit counter and pending states.
> 
>   For SMMUv3, the stall buffer may be shared between devices on some
>   implementations, in which case the guest could prevent other
> devices to stall by letting the buffer fill up.
>   -> We might have to keep track of stalls in the host driver and set
> a credit or timeout to each stall, if it comes to that.
>   -> In addition, send a terminate-all-stalls command when changing
> the device's domain.
> 
We have the same situation in VT-d with shared queue which in turn may
affect other guests. Letting host driver maintain record of pending page
request seems the best way to go. VT-d has a way to drain PRQ per PASID
and RID combination. I guess this is the same as your
"terminate-all-stalls" but with finer control? Or
"terminate-all-stalls" only applies to a given device.
Seems we can implement a generic timeout/credit mechanism in IOMMU
driver with model specific action to drain/terminate. The timeout value
can also be model specific.

> * If the guest sends spurious or duplicate page responses (where the
> PRGI or PASID doesn't exist in any outstanding page request of the
> device)
> 
If we keep track of pending PRQ in host IOMMU driver, then it can
detect duplicated case.
>   For PRI if we send an invalid PRG Response, the endpoint sets UPRGI
> in the PRI cap, and issues an Unexpected Completion. Then I suppose
> the worst that happens is we get an AER report that we can't handle?
> I'm not too familiar with that part of PCIe.
> 
I don;t see this mentioned in the PCI ATS spec., but in general this
sounds like a case HW has to handle, perhaps ignoring them is
reasonable as you said below.
>   Stall is designed to tolerate this and will just ignore the
> response.
> 
> * If PRI/stall isn't even enabled, the IOMMU driver can check that in
> the device configuration and not send the reply.
> 
> 
> 
> 
> Regardless, I have a few comments on the page_response_msg:
> 
Thanks, all points are taken unless commented.
> > +/**
> > + * Generic page response information based on PCI ATS and PASID
> > spec.
> > + * @paddr: servicing page address  
> 
> Maybe call it @addr, so we don't read this field as "phys addr"
> 
> > + * @pasid: contains process address space ID, used in shared
> > virtual memory(SVM)  
> 
> The "used in shared virtual memory(SVM)" part isn't necessary and
> we're changing the API name.
> 
> > + * @rid: requestor ID
> > + * @did: destination device ID  
> 
> I guess you can remove @rid and @did
> 
> > + * @last_req: last request in a page request group  
> 
> Is @last_req needed at all, since only the last request requires a
> response?
> 
right, i was thinking we had single page response in vt-d, but there is
not need either.
> > + * @resp_code: response code  
> 
> The comment is missing a description for @pasid_present here
> 
> > + * @page_req_group_id: page request group index
> > + * @prot: page access protection flag, e.g. IOMMU_FAULT_READ,
> > IOMMU_FAULT_WRITE  
> 
> Is @prot really needed in the response?
> 
no, you are right.
> > + * @type: group or stream response  
> 
> The page request doesn't provide this information
> 
this is vt-d specific. it is in the vt-d page request descriptor and
response descriptors are different depending on the type.
Since we intend the generic data to be super set of models, I add this
field.
> > + * @private_data: uniquely identify device-specific private data
> > for an
> > + *                individual page response
> > +
> > + */
> > +struct page_response_msg {
> > +	u64 paddr;
> > +	u32 pasid;
> > +	u32 rid:16;
> > +	u32 did:16;
> > +	u32 resp_code:4;
> > +	u32 last_req:1;
> > +	u32 pasid_present:1;
> > +#define IOMMU_PAGE_RESP_SUCCESS	0
> > +#define IOMMU_PAGE_RESP_INVALID	1
> > +#define IOMMU_PAGE_RESP_FAILURE	0xF  
> 
> Maybe move these defines closer to resp_code.
> For someone not familiar with PRI, we should add some comments about
> those values:
> 
> * SUCCESS: the request was paged-in successfully
> * INVALID: could not page-in one or more pages in the group
> * FAILURE: permanent PRI error, may disable faults in the device
> 
> > +	u32 page_req_group_id : 9;
> > +	u32 prot;
> > +	enum page_response_type type;
> > +	u32 private_data;
> > +};
> > +  
> 
> Thanks,
> Jean

[Jacob Pan]

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v3 15/16] iommu: introduce page response function
  2017-12-06 19:25           ` Jacob Pan
@ 2017-12-07 12:56             ` Jean-Philippe Brucker
  -1 siblings, 0 replies; 94+ messages in thread
From: Jean-Philippe Brucker @ 2017-12-07 12:56 UTC (permalink / raw)
  To: Jacob Pan
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Rafael Wysocki, Alex Williamson, Lan Tianyu, Jean Delvare,
	Will Deacon, Kumar, Sanjay K

On 06/12/17 19:25, Jacob Pan wrote:
[...]
>>   For SMMUv3, the stall buffer may be shared between devices on some
>>   implementations, in which case the guest could prevent other
>> devices to stall by letting the buffer fill up.
>>   -> We might have to keep track of stalls in the host driver and set
>> a credit or timeout to each stall, if it comes to that.
>>   -> In addition, send a terminate-all-stalls command when changing
>> the device's domain.
>>
> We have the same situation in VT-d with shared queue which in turn may
> affect other guests. Letting host driver maintain record of pending page
> request seems the best way to go. VT-d has a way to drain PRQ per PASID
> and RID combination. I guess this is the same as your
> "terminate-all-stalls" but with finer control? Or
> "terminate-all-stalls" only applies to a given device.

That command terminates all stalls for a given device (for all PASIDs).
It's a bit awkward to implement but should be enough to ensure that we
don't leak any outstanding faults to the next VM.

> Seems we can implement a generic timeout/credit mechanism in IOMMU
> driver with model specific action to drain/terminate. The timeout value
> can also be model specific.

Sounds good. Timeout seems a bit complicated to implement (and how do we
guess what timeout would work?), so maybe it's simpler to enforce a quota
of outstanding faults per VM, for example half of the shared queue size
(the number can be chosen by the IOMMU driver). If a VM has that many
outstanding faults, then any new fault is immediately terminated by the
host. A bit rough but it might be enough to mitigate the problem
initially, and we can always tweak it later (for instance disable faulting
if a guest doesn't ever reply).

Seems like VFIO should enforce this quota, since the IOMMU layer doesn't
know which device is assigned to which VM. If it's the IOMMU that enforces
quotas per device and a VM has 15 devices assigned, then the guest can
still DoS the IOMMU.

[...]
>>> + * @type: group or stream response  
>>
>> The page request doesn't provide this information
>>
> this is vt-d specific. it is in the vt-d page request descriptor and
> response descriptors are different depending on the type.
> Since we intend the generic data to be super set of models, I add this
> field.

But don't you need to add the stream type to enum iommu_fault_type, in
patch 8? Otherwise the guest can't know what type to set in the response.

Thanks,
Jean

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v3 15/16] iommu: introduce page response function
@ 2017-12-07 12:56             ` Jean-Philippe Brucker
  0 siblings, 0 replies; 94+ messages in thread
From: Jean-Philippe Brucker @ 2017-12-07 12:56 UTC (permalink / raw)
  To: Jacob Pan
  Cc: Lan Tianyu, Greg Kroah-Hartman, Rafael Wysocki, Will Deacon,
	LKML, iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Jean Delvare, Kumar, Sanjay K, David Woodhouse

On 06/12/17 19:25, Jacob Pan wrote:
[...]
>>   For SMMUv3, the stall buffer may be shared between devices on some
>>   implementations, in which case the guest could prevent other
>> devices to stall by letting the buffer fill up.
>>   -> We might have to keep track of stalls in the host driver and set
>> a credit or timeout to each stall, if it comes to that.
>>   -> In addition, send a terminate-all-stalls command when changing
>> the device's domain.
>>
> We have the same situation in VT-d with shared queue which in turn may
> affect other guests. Letting host driver maintain record of pending page
> request seems the best way to go. VT-d has a way to drain PRQ per PASID
> and RID combination. I guess this is the same as your
> "terminate-all-stalls" but with finer control? Or
> "terminate-all-stalls" only applies to a given device.

That command terminates all stalls for a given device (for all PASIDs).
It's a bit awkward to implement but should be enough to ensure that we
don't leak any outstanding faults to the next VM.

> Seems we can implement a generic timeout/credit mechanism in IOMMU
> driver with model specific action to drain/terminate. The timeout value
> can also be model specific.

Sounds good. Timeout seems a bit complicated to implement (and how do we
guess what timeout would work?), so maybe it's simpler to enforce a quota
of outstanding faults per VM, for example half of the shared queue size
(the number can be chosen by the IOMMU driver). If a VM has that many
outstanding faults, then any new fault is immediately terminated by the
host. A bit rough but it might be enough to mitigate the problem
initially, and we can always tweak it later (for instance disable faulting
if a guest doesn't ever reply).

Seems like VFIO should enforce this quota, since the IOMMU layer doesn't
know which device is assigned to which VM. If it's the IOMMU that enforces
quotas per device and a VM has 15 devices assigned, then the guest can
still DoS the IOMMU.

[...]
>>> + * @type: group or stream response  
>>
>> The page request doesn't provide this information
>>
> this is vt-d specific. it is in the vt-d page request descriptor and
> response descriptors are different depending on the type.
> Since we intend the generic data to be super set of models, I add this
> field.

But don't you need to add the stream type to enum iommu_fault_type, in
patch 8? Otherwise the guest can't know what type to set in the response.

Thanks,
Jean

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v3 10/16] iommu: introduce device fault report API
@ 2017-12-07 21:27     ` Alex Williamson
  0 siblings, 0 replies; 94+ messages in thread
From: Alex Williamson @ 2017-12-07 21:27 UTC (permalink / raw)
  To: Jacob Pan
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Rafael Wysocki, Liu, Yi L, Lan Tianyu, Tian, Kevin, Raj Ashok,
	Jean Delvare, Christoph Hellwig

On Fri, 17 Nov 2017 10:55:08 -0800
Jacob Pan <jacob.jun.pan@linux.intel.com> wrote:

> Traditionally, device specific faults are detected and handled within
> their own device drivers. When IOMMU is enabled, faults such as DMA
> related transactions are detected by IOMMU. There is no generic
> reporting mechanism to report faults back to the in-kernel device
> driver or the guest OS in case of assigned devices.
> 
> Faults detected by IOMMU is based on the transaction's source ID which
> can be reported at per device basis, regardless of the device type is a
> PCI device or not.
> 
> The fault types include recoverable (e.g. page request) and
> unrecoverable faults(e.g. access error). In most cases, faults can be
> handled by IOMMU drivers internally. The primary use cases are as
> follows:
> 1. page request fault originated from an SVM capable device that is
> assigned to guest via vIOMMU. In this case, the first level page tables
> are owned by the guest. Page request must be propagated to the guest to
> let guest OS fault in the pages then send page response. In this
> mechanism, the direct receiver of IOMMU fault notification is VFIO,
> which can relay notification events to QEMU or other user space
> software.
> 
> 2. faults need more subtle handling by device drivers. Other than
> simply invoke reset function, there are needs to let device driver
> handle the fault with a smaller impact.
> 
> This patchset is intended to create a generic fault report API such
> that it can scale as follows:
> - all IOMMU types
> - PCI and non-PCI devices
> - recoverable and unrecoverable faults
> - VFIO and other other in kernel users
> - DMA & IRQ remapping (TBD)
> The original idea was brought up by David Woodhouse and discussions
> summarized at https://lwn.net/Articles/608914/.
> 
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Signed-off-by: Ashok Raj <ashok.raj@intel.com>
> ---
>  drivers/iommu/iommu.c | 63 ++++++++++++++++++++++++++++++++++++++++++++++++++-
>  include/linux/iommu.h | 36 +++++++++++++++++++++++++++++
>  2 files changed, 98 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index 829e9e9..97b7990 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -581,6 +581,12 @@ int iommu_group_add_device(struct iommu_group *group, struct device *dev)
>  		goto err_free_name;
>  	}
>  
> +	dev->iommu_param = kzalloc(sizeof(struct iommu_fault_param), GFP_KERNEL);
> +	if (!dev->iommu_param) {
> +		ret = -ENOMEM;
> +		goto err_free_name;
> +	}
> +
>  	kobject_get(group->devices_kobj);
>  
>  	dev->iommu_group = group;
> @@ -657,7 +663,7 @@ void iommu_group_remove_device(struct device *dev)
>  	sysfs_remove_link(&dev->kobj, "iommu_group");
>  
>  	trace_remove_device_from_group(group->id, dev);
> -
> +	kfree(dev->iommu_param);
>  	kfree(device->name);
>  	kfree(device);
>  	dev->iommu_group = NULL;
> @@ -791,6 +797,61 @@ int iommu_group_unregister_notifier(struct iommu_group *group,
>  }
>  EXPORT_SYMBOL_GPL(iommu_group_unregister_notifier);
>  
> +int iommu_register_device_fault_handler(struct device *dev,
> +					iommu_dev_fault_handler_t handler,
> +					void *data)
> +{
> +	struct iommu_param *idata = dev->iommu_param;
> +
> +	/*
> +	 * Device iommu_param should have been allocated when device is
> +	 * added to its iommu_group.
> +	 */
> +	if (!idata)
> +		return -EINVAL;
> +	/* Only allow one fault handler registered for each device */
> +	if (idata->fault_param)
> +		return -EBUSY;
> +	get_device(dev);
> +	idata->fault_param =
> +		kzalloc(sizeof(struct iommu_fault_param), GFP_KERNEL);
> +	if (!idata->fault_param)
> +		return -ENOMEM;
> +	idata->fault_param->handler = handler;
> +	idata->fault_param->data = data;
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(iommu_register_device_fault_handler);
> +
> +int iommu_unregister_device_fault_handler(struct device *dev)
> +{
> +	struct iommu_param *idata = dev->iommu_param;
> +
> +	if (!idata)
> +		return -EINVAL;
> +
> +	kfree(idata->fault_param);
> +	idata->fault_param = NULL;
> +	put_device(dev);
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(iommu_unregister_device_fault_handler);
> +
> +
> +int iommu_report_device_fault(struct device *dev, struct iommu_fault_event *evt)
> +{
> +	/* we only report device fault if there is a handler registered */
> +	if (!dev->iommu_param || !dev->iommu_param->fault_param ||
> +		!dev->iommu_param->fault_param->handler)
> +		return -ENOSYS;
> +
> +	return dev->iommu_param->fault_param->handler(evt,
> +						dev->iommu_param->fault_param->data);
> +}
> +EXPORT_SYMBOL_GPL(iommu_report_device_fault);
> +

Isn't this all rather racy?  I see that we can have multiple callers to
register racing.  Unregister is buggy, allowing any caller to decrement
the device reference regardless of whether there's one outstanding
through this interface.  The reporting callout can also race with an
unregistration.  Might need a mutex on iommu_param to avoid.

>  /**
>   * iommu_group_id - Return ID for a group
>   * @group: the group to ID
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index dfda89b..841c044 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -463,6 +463,14 @@ extern int iommu_group_register_notifier(struct iommu_group *group,
>  					 struct notifier_block *nb);
>  extern int iommu_group_unregister_notifier(struct iommu_group *group,
>  					   struct notifier_block *nb);
> +extern int iommu_register_device_fault_handler(struct device *dev,
> +					iommu_dev_fault_handler_t handler,
> +					void *data);
> +
> +extern int iommu_unregister_device_fault_handler(struct device *dev);
> +
> +extern int iommu_report_device_fault(struct device *dev, struct iommu_fault_event *evt);
> +
>  extern int iommu_group_id(struct iommu_group *group);
>  extern struct iommu_group *iommu_group_get_for_dev(struct device *dev);
>  extern struct iommu_domain *iommu_group_default_domain(struct iommu_group *);
> @@ -481,6 +489,12 @@ extern void iommu_domain_window_disable(struct iommu_domain *domain, u32 wnd_nr)
>  extern int report_iommu_fault(struct iommu_domain *domain, struct device *dev,
>  			      unsigned long iova, int flags);
>  
> +static inline bool iommu_has_device_fault_handler(struct device *dev)
> +{
> +	return dev->iommu_param && dev->iommu_param->fault_param &&
> +		dev->iommu_param->fault_param->handler;
> +}
> +

This interface is racy by design, there's no guarantee that the
handler isn't immediately unregistered after this check. Thanks,

Alex

>  static inline void iommu_flush_tlb_all(struct iommu_domain *domain)
>  {
>  	if (domain->ops->flush_iotlb_all)
> @@ -734,6 +748,28 @@ static inline int iommu_group_unregister_notifier(struct iommu_group *group,
>  	return 0;
>  }
>  
> +static inline int iommu_register_device_fault_handler(struct device *dev,
> +						iommu_dev_fault_handler_t handler,
> +						void *data)
> +{
> +	return 0;
> +}
> +
> +static inline int iommu_unregister_device_fault_handler(struct device *dev)
> +{
> +	return 0;
> +}
> +
> +static inline bool iommu_has_device_fault_handler(struct device *dev)
> +{
> +	return false;
> +}
> +
> +static inline int iommu_report_device_fault(struct device *dev, struct iommu_fault_event *evt)
> +{
> +	return 0;
> +}
> +
>  static inline int iommu_group_id(struct iommu_group *group)
>  {
>  	return -ENODEV;

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v3 10/16] iommu: introduce device fault report API
@ 2017-12-07 21:27     ` Alex Williamson
  0 siblings, 0 replies; 94+ messages in thread
From: Alex Williamson @ 2017-12-07 21:27 UTC (permalink / raw)
  To: Jacob Pan
  Cc: Lan Tianyu, Greg Kroah-Hartman, Rafael Wysocki, LKML,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Jean Delvare,
	David Woodhouse

On Fri, 17 Nov 2017 10:55:08 -0800
Jacob Pan <jacob.jun.pan-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> wrote:

> Traditionally, device specific faults are detected and handled within
> their own device drivers. When IOMMU is enabled, faults such as DMA
> related transactions are detected by IOMMU. There is no generic
> reporting mechanism to report faults back to the in-kernel device
> driver or the guest OS in case of assigned devices.
> 
> Faults detected by IOMMU is based on the transaction's source ID which
> can be reported at per device basis, regardless of the device type is a
> PCI device or not.
> 
> The fault types include recoverable (e.g. page request) and
> unrecoverable faults(e.g. access error). In most cases, faults can be
> handled by IOMMU drivers internally. The primary use cases are as
> follows:
> 1. page request fault originated from an SVM capable device that is
> assigned to guest via vIOMMU. In this case, the first level page tables
> are owned by the guest. Page request must be propagated to the guest to
> let guest OS fault in the pages then send page response. In this
> mechanism, the direct receiver of IOMMU fault notification is VFIO,
> which can relay notification events to QEMU or other user space
> software.
> 
> 2. faults need more subtle handling by device drivers. Other than
> simply invoke reset function, there are needs to let device driver
> handle the fault with a smaller impact.
> 
> This patchset is intended to create a generic fault report API such
> that it can scale as follows:
> - all IOMMU types
> - PCI and non-PCI devices
> - recoverable and unrecoverable faults
> - VFIO and other other in kernel users
> - DMA & IRQ remapping (TBD)
> The original idea was brought up by David Woodhouse and discussions
> summarized at https://lwn.net/Articles/608914/.
> 
> Signed-off-by: Jacob Pan <jacob.jun.pan-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
> Signed-off-by: Ashok Raj <ashok.raj-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> ---
>  drivers/iommu/iommu.c | 63 ++++++++++++++++++++++++++++++++++++++++++++++++++-
>  include/linux/iommu.h | 36 +++++++++++++++++++++++++++++
>  2 files changed, 98 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index 829e9e9..97b7990 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -581,6 +581,12 @@ int iommu_group_add_device(struct iommu_group *group, struct device *dev)
>  		goto err_free_name;
>  	}
>  
> +	dev->iommu_param = kzalloc(sizeof(struct iommu_fault_param), GFP_KERNEL);
> +	if (!dev->iommu_param) {
> +		ret = -ENOMEM;
> +		goto err_free_name;
> +	}
> +
>  	kobject_get(group->devices_kobj);
>  
>  	dev->iommu_group = group;
> @@ -657,7 +663,7 @@ void iommu_group_remove_device(struct device *dev)
>  	sysfs_remove_link(&dev->kobj, "iommu_group");
>  
>  	trace_remove_device_from_group(group->id, dev);
> -
> +	kfree(dev->iommu_param);
>  	kfree(device->name);
>  	kfree(device);
>  	dev->iommu_group = NULL;
> @@ -791,6 +797,61 @@ int iommu_group_unregister_notifier(struct iommu_group *group,
>  }
>  EXPORT_SYMBOL_GPL(iommu_group_unregister_notifier);
>  
> +int iommu_register_device_fault_handler(struct device *dev,
> +					iommu_dev_fault_handler_t handler,
> +					void *data)
> +{
> +	struct iommu_param *idata = dev->iommu_param;
> +
> +	/*
> +	 * Device iommu_param should have been allocated when device is
> +	 * added to its iommu_group.
> +	 */
> +	if (!idata)
> +		return -EINVAL;
> +	/* Only allow one fault handler registered for each device */
> +	if (idata->fault_param)
> +		return -EBUSY;
> +	get_device(dev);
> +	idata->fault_param =
> +		kzalloc(sizeof(struct iommu_fault_param), GFP_KERNEL);
> +	if (!idata->fault_param)
> +		return -ENOMEM;
> +	idata->fault_param->handler = handler;
> +	idata->fault_param->data = data;
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(iommu_register_device_fault_handler);
> +
> +int iommu_unregister_device_fault_handler(struct device *dev)
> +{
> +	struct iommu_param *idata = dev->iommu_param;
> +
> +	if (!idata)
> +		return -EINVAL;
> +
> +	kfree(idata->fault_param);
> +	idata->fault_param = NULL;
> +	put_device(dev);
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(iommu_unregister_device_fault_handler);
> +
> +
> +int iommu_report_device_fault(struct device *dev, struct iommu_fault_event *evt)
> +{
> +	/* we only report device fault if there is a handler registered */
> +	if (!dev->iommu_param || !dev->iommu_param->fault_param ||
> +		!dev->iommu_param->fault_param->handler)
> +		return -ENOSYS;
> +
> +	return dev->iommu_param->fault_param->handler(evt,
> +						dev->iommu_param->fault_param->data);
> +}
> +EXPORT_SYMBOL_GPL(iommu_report_device_fault);
> +

Isn't this all rather racy?  I see that we can have multiple callers to
register racing.  Unregister is buggy, allowing any caller to decrement
the device reference regardless of whether there's one outstanding
through this interface.  The reporting callout can also race with an
unregistration.  Might need a mutex on iommu_param to avoid.

>  /**
>   * iommu_group_id - Return ID for a group
>   * @group: the group to ID
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index dfda89b..841c044 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -463,6 +463,14 @@ extern int iommu_group_register_notifier(struct iommu_group *group,
>  					 struct notifier_block *nb);
>  extern int iommu_group_unregister_notifier(struct iommu_group *group,
>  					   struct notifier_block *nb);
> +extern int iommu_register_device_fault_handler(struct device *dev,
> +					iommu_dev_fault_handler_t handler,
> +					void *data);
> +
> +extern int iommu_unregister_device_fault_handler(struct device *dev);
> +
> +extern int iommu_report_device_fault(struct device *dev, struct iommu_fault_event *evt);
> +
>  extern int iommu_group_id(struct iommu_group *group);
>  extern struct iommu_group *iommu_group_get_for_dev(struct device *dev);
>  extern struct iommu_domain *iommu_group_default_domain(struct iommu_group *);
> @@ -481,6 +489,12 @@ extern void iommu_domain_window_disable(struct iommu_domain *domain, u32 wnd_nr)
>  extern int report_iommu_fault(struct iommu_domain *domain, struct device *dev,
>  			      unsigned long iova, int flags);
>  
> +static inline bool iommu_has_device_fault_handler(struct device *dev)
> +{
> +	return dev->iommu_param && dev->iommu_param->fault_param &&
> +		dev->iommu_param->fault_param->handler;
> +}
> +

This interface is racy by design, there's no guarantee that the
handler isn't immediately unregistered after this check. Thanks,

Alex

>  static inline void iommu_flush_tlb_all(struct iommu_domain *domain)
>  {
>  	if (domain->ops->flush_iotlb_all)
> @@ -734,6 +748,28 @@ static inline int iommu_group_unregister_notifier(struct iommu_group *group,
>  	return 0;
>  }
>  
> +static inline int iommu_register_device_fault_handler(struct device *dev,
> +						iommu_dev_fault_handler_t handler,
> +						void *data)
> +{
> +	return 0;
> +}
> +
> +static inline int iommu_unregister_device_fault_handler(struct device *dev)
> +{
> +	return 0;
> +}
> +
> +static inline bool iommu_has_device_fault_handler(struct device *dev)
> +{
> +	return false;
> +}
> +
> +static inline int iommu_report_device_fault(struct device *dev, struct iommu_fault_event *evt)
> +{
> +	return 0;
> +}
> +
>  static inline int iommu_group_id(struct iommu_group *group)
>  {
>  	return -ENODEV;

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v3 15/16] iommu: introduce page response function
  2017-12-06 19:25           ` Jacob Pan
@ 2017-12-07 21:51             ` Alex Williamson
  -1 siblings, 0 replies; 94+ messages in thread
From: Alex Williamson @ 2017-12-07 21:51 UTC (permalink / raw)
  To: Jacob Pan
  Cc: Jean-Philippe Brucker, iommu, LKML, Joerg Roedel,
	David Woodhouse, Greg Kroah-Hartman, Rafael Wysocki, Lan Tianyu,
	Jean Delvare, Will Deacon, Kumar, Sanjay K

On Wed, 6 Dec 2017 11:25:21 -0800
Jacob Pan <jacob.jun.pan@linux.intel.com> wrote:

> On Tue, 5 Dec 2017 17:21:15 +0000
> Jean-Philippe Brucker <jean-philippe.brucker@arm.com> wrote:
> 
> > Hi Jacob,
> > 
> > On 04/12/17 21:37, Jacob Pan wrote:  
> > > On Fri, 24 Nov 2017 12:03:50 +0000
> > > Jean-Philippe Brucker <jean-philippe.brucker@arm.com> wrote:
> > >     
> > >> On 17/11/17 18:55, Jacob Pan wrote:    
> > >>> When nested translation is turned on and guest owns the
> > >>> first level page tables, device page request can be forwared
> > >>> to the guest for handling faults. As the page response returns
> > >>> by the guest, IOMMU driver on the host need to process the
> > >>> response which informs the device and completes the page request
> > >>> transaction.
> > >>>
> > >>> This patch introduces generic API function for page response
> > >>> passing from the guest or other in-kernel users. The definitions
> > >>> of the generic data is based on PCI ATS specification not limited
> > >>> to any vendor.>
> > >>> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>    
> > [...]  
> > > I think the simpler interface works for in-kernel driver use case
> > > very well. But in case of VFIO, the callback function does not turn
> > > around send back page response. The page response comes from guest
> > > and qemu, where they don;t keep track of the the prq event data.    
> > 
> > Is it safe to trust whatever response the guest or userspace gives
> > us? The answer seems fairly vendor- and device-specific so I wonder
> > if VFIO or IOMMU shouldn't do a bit of sanity checking somewhere, and
> > keep track of all injected page requests.

This is always my question when we start embedding IDs in structures.
> > 
> > From SMMUv3 POV, it seems safe (haven't looked at SMMUv2 but I'm not
> > so confident).
> > 
> > * The guest can only send page responses to devices assigned to it,
> > that's a given.
> >   
> Agree, IOMMU driver cannot enforce it. I think VFIO layer can make sure
> page response come from the assigned device and its guest/container.

Can we enforce it via the IOMMU/VFIO interface?  If the response is for
a struct device, and not an rid/did embedded in a structure, then vfio
can pass it through w/o worrying about it, ie. response comes in via
ioctl with association to vfio device fd -> struct vfio_device -> struct
device, iommu driver fills in rid/did.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v3 15/16] iommu: introduce page response function
@ 2017-12-07 21:51             ` Alex Williamson
  0 siblings, 0 replies; 94+ messages in thread
From: Alex Williamson @ 2017-12-07 21:51 UTC (permalink / raw)
  To: Jacob Pan
  Cc: Lan Tianyu, Greg Kroah-Hartman, Rafael Wysocki, Will Deacon,
	LKML, iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Jean Delvare, Kumar, Sanjay K, David Woodhouse

On Wed, 6 Dec 2017 11:25:21 -0800
Jacob Pan <jacob.jun.pan-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> wrote:

> On Tue, 5 Dec 2017 17:21:15 +0000
> Jean-Philippe Brucker <jean-philippe.brucker-5wv7dgnIgG8@public.gmane.org> wrote:
> 
> > Hi Jacob,
> > 
> > On 04/12/17 21:37, Jacob Pan wrote:  
> > > On Fri, 24 Nov 2017 12:03:50 +0000
> > > Jean-Philippe Brucker <jean-philippe.brucker-5wv7dgnIgG8@public.gmane.org> wrote:
> > >     
> > >> On 17/11/17 18:55, Jacob Pan wrote:    
> > >>> When nested translation is turned on and guest owns the
> > >>> first level page tables, device page request can be forwared
> > >>> to the guest for handling faults. As the page response returns
> > >>> by the guest, IOMMU driver on the host need to process the
> > >>> response which informs the device and completes the page request
> > >>> transaction.
> > >>>
> > >>> This patch introduces generic API function for page response
> > >>> passing from the guest or other in-kernel users. The definitions
> > >>> of the generic data is based on PCI ATS specification not limited
> > >>> to any vendor.>
> > >>> Signed-off-by: Jacob Pan <jacob.jun.pan-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>    
> > [...]  
> > > I think the simpler interface works for in-kernel driver use case
> > > very well. But in case of VFIO, the callback function does not turn
> > > around send back page response. The page response comes from guest
> > > and qemu, where they don;t keep track of the the prq event data.    
> > 
> > Is it safe to trust whatever response the guest or userspace gives
> > us? The answer seems fairly vendor- and device-specific so I wonder
> > if VFIO or IOMMU shouldn't do a bit of sanity checking somewhere, and
> > keep track of all injected page requests.

This is always my question when we start embedding IDs in structures.
> > 
> > From SMMUv3 POV, it seems safe (haven't looked at SMMUv2 but I'm not
> > so confident).
> > 
> > * The guest can only send page responses to devices assigned to it,
> > that's a given.
> >   
> Agree, IOMMU driver cannot enforce it. I think VFIO layer can make sure
> page response come from the assigned device and its guest/container.

Can we enforce it via the IOMMU/VFIO interface?  If the response is for
a struct device, and not an rid/did embedded in a structure, then vfio
can pass it through w/o worrying about it, ie. response comes in via
ioctl with association to vfio device fd -> struct vfio_device -> struct
device, iommu driver fills in rid/did.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v3 15/16] iommu: introduce page response function
  2017-12-07 12:56             ` Jean-Philippe Brucker
  (?)
@ 2017-12-07 21:56             ` Alex Williamson
  2017-12-08 13:51                 ` Jean-Philippe Brucker
  -1 siblings, 1 reply; 94+ messages in thread
From: Alex Williamson @ 2017-12-07 21:56 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: Jacob Pan, iommu, LKML, Joerg Roedel, David Woodhouse,
	Greg Kroah-Hartman, Rafael Wysocki, Lan Tianyu, Jean Delvare,
	Will Deacon, Kumar, Sanjay K

On Thu, 7 Dec 2017 12:56:55 +0000
Jean-Philippe Brucker <jean-philippe.brucker@arm.com> wrote:

> On 06/12/17 19:25, Jacob Pan wrote:
> [...]
> >>   For SMMUv3, the stall buffer may be shared between devices on some
> >>   implementations, in which case the guest could prevent other
> >> devices to stall by letting the buffer fill up.  
> >>   -> We might have to keep track of stalls in the host driver and set  
> >> a credit or timeout to each stall, if it comes to that.  
> >>   -> In addition, send a terminate-all-stalls command when changing  
> >> the device's domain.
> >>  
> > We have the same situation in VT-d with shared queue which in turn may
> > affect other guests. Letting host driver maintain record of pending page
> > request seems the best way to go. VT-d has a way to drain PRQ per PASID
> > and RID combination. I guess this is the same as your
> > "terminate-all-stalls" but with finer control? Or
> > "terminate-all-stalls" only applies to a given device.  
> 
> That command terminates all stalls for a given device (for all PASIDs).
> It's a bit awkward to implement but should be enough to ensure that we
> don't leak any outstanding faults to the next VM.
> 
> > Seems we can implement a generic timeout/credit mechanism in IOMMU
> > driver with model specific action to drain/terminate. The timeout value
> > can also be model specific.  
> 
> Sounds good. Timeout seems a bit complicated to implement (and how do we
> guess what timeout would work?), so maybe it's simpler to enforce a quota
> of outstanding faults per VM, for example half of the shared queue size
> (the number can be chosen by the IOMMU driver). If a VM has that many
> outstanding faults, then any new fault is immediately terminated by the
> host. A bit rough but it might be enough to mitigate the problem
> initially, and we can always tweak it later (for instance disable faulting
> if a guest doesn't ever reply).
> 
> Seems like VFIO should enforce this quota, since the IOMMU layer doesn't
> know which device is assigned to which VM. If it's the IOMMU that enforces
> quotas per device and a VM has 15 devices assigned, then the guest can
> still DoS the IOMMU.

VFIO also doesn't know about VMs.  We know that devices attached to the
same container are probably used by the same user, but once we add
viommu, each device(group) uses its own container and we have no idea
they're associated.  So, no to VM based accounting, and it seems like
an IOMMU problem, X number of outstanding requests per device.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v3 15/16] iommu: introduce page response function
@ 2017-12-08  1:17               ` Jacob Pan
  0 siblings, 0 replies; 94+ messages in thread
From: Jacob Pan @ 2017-12-08  1:17 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Rafael Wysocki, Alex Williamson, Lan Tianyu, Jean Delvare,
	Will Deacon, Kumar, Sanjay K, jacob.jun.pan

On Thu, 7 Dec 2017 12:56:55 +0000
Jean-Philippe Brucker <jean-philippe.brucker@arm.com> wrote:

> On 06/12/17 19:25, Jacob Pan wrote:
> [...]
> >>   For SMMUv3, the stall buffer may be shared between devices on
> >> some implementations, in which case the guest could prevent other
> >> devices to stall by letting the buffer fill up.  
> >>   -> We might have to keep track of stalls in the host driver and
> >> set a credit or timeout to each stall, if it comes to that.  
> >>   -> In addition, send a terminate-all-stalls command when
> >> changing the device's domain.
> >>  
> > We have the same situation in VT-d with shared queue which in turn
> > may affect other guests. Letting host driver maintain record of
> > pending page request seems the best way to go. VT-d has a way to
> > drain PRQ per PASID and RID combination. I guess this is the same
> > as your "terminate-all-stalls" but with finer control? Or
> > "terminate-all-stalls" only applies to a given device.  
> 
> That command terminates all stalls for a given device (for all
> PASIDs). It's a bit awkward to implement but should be enough to
> ensure that we don't leak any outstanding faults to the next VM.
> 
OK, in any case, I think this terminate request should come from the
drivers or vfio not initiated by IOMMU.
> > Seems we can implement a generic timeout/credit mechanism in IOMMU
> > driver with model specific action to drain/terminate. The timeout
> > value can also be model specific.  
> 
> Sounds good. Timeout seems a bit complicated to implement (and how do
> we guess what timeout would work?), so maybe it's simpler to enforce
> a quota of outstanding faults per VM, for example half of the shared
> queue size (the number can be chosen by the IOMMU driver). If a VM
> has that many outstanding faults, then any new fault is immediately
> terminated by the host. A bit rough but it might be enough to
> mitigate the problem initially, and we can always tweak it later (for
> instance disable faulting if a guest doesn't ever reply).
> 
I have to make a correction/clarification, even though vt-d has a per
iommu shared queue for prq but we do not stall. Ashok reminded me that.
So there is no constraint on IOMMU if one of the guests does not
respond. All the pressure is on the device which may have limited # of
pending PR.

> Seems like VFIO should enforce this quota, since the IOMMU layer
> doesn't know which device is assigned to which VM. If it's the IOMMU
> that enforces quotas per device and a VM has 15 devices assigned,
> then the guest can still DoS the IOMMU.
> 
I still think timeout makes more sense than quota in that a VM could
be under quota but failed to respond to one of the devices forever.
I agree it is hard to devise a good timeout limit but since this is to
prevent rare faults, we could pick a relatively large timeout. And we
only tracks the longest pending timeout per device. The error condition
we try to prevent is not necessarily only stall buffer overflow but
timeout also, right?
> [...]
> >>> + * @type: group or stream response    
> >>
> >> The page request doesn't provide this information
> >>  
> > this is vt-d specific. it is in the vt-d page request descriptor and
> > response descriptors are different depending on the type.
> > Since we intend the generic data to be super set of models, I add
> > this field.  
> 
> But don't you need to add the stream type to enum iommu_fault_type, in
> patch 8? Otherwise the guest can't know what type to set in the
> response.
> 
> Thanks,
> Jean
> 

[Jacob Pan]

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v3 15/16] iommu: introduce page response function
@ 2017-12-08  1:17               ` Jacob Pan
  0 siblings, 0 replies; 94+ messages in thread
From: Jacob Pan @ 2017-12-08  1:17 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: Lan Tianyu, Greg Kroah-Hartman, Rafael Wysocki, Will Deacon,
	LKML, iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Jean Delvare, Kumar, Sanjay K, David Woodhouse

On Thu, 7 Dec 2017 12:56:55 +0000
Jean-Philippe Brucker <jean-philippe.brucker-5wv7dgnIgG8@public.gmane.org> wrote:

> On 06/12/17 19:25, Jacob Pan wrote:
> [...]
> >>   For SMMUv3, the stall buffer may be shared between devices on
> >> some implementations, in which case the guest could prevent other
> >> devices to stall by letting the buffer fill up.  
> >>   -> We might have to keep track of stalls in the host driver and
> >> set a credit or timeout to each stall, if it comes to that.  
> >>   -> In addition, send a terminate-all-stalls command when
> >> changing the device's domain.
> >>  
> > We have the same situation in VT-d with shared queue which in turn
> > may affect other guests. Letting host driver maintain record of
> > pending page request seems the best way to go. VT-d has a way to
> > drain PRQ per PASID and RID combination. I guess this is the same
> > as your "terminate-all-stalls" but with finer control? Or
> > "terminate-all-stalls" only applies to a given device.  
> 
> That command terminates all stalls for a given device (for all
> PASIDs). It's a bit awkward to implement but should be enough to
> ensure that we don't leak any outstanding faults to the next VM.
> 
OK, in any case, I think this terminate request should come from the
drivers or vfio not initiated by IOMMU.
> > Seems we can implement a generic timeout/credit mechanism in IOMMU
> > driver with model specific action to drain/terminate. The timeout
> > value can also be model specific.  
> 
> Sounds good. Timeout seems a bit complicated to implement (and how do
> we guess what timeout would work?), so maybe it's simpler to enforce
> a quota of outstanding faults per VM, for example half of the shared
> queue size (the number can be chosen by the IOMMU driver). If a VM
> has that many outstanding faults, then any new fault is immediately
> terminated by the host. A bit rough but it might be enough to
> mitigate the problem initially, and we can always tweak it later (for
> instance disable faulting if a guest doesn't ever reply).
> 
I have to make a correction/clarification, even though vt-d has a per
iommu shared queue for prq but we do not stall. Ashok reminded me that.
So there is no constraint on IOMMU if one of the guests does not
respond. All the pressure is on the device which may have limited # of
pending PR.

> Seems like VFIO should enforce this quota, since the IOMMU layer
> doesn't know which device is assigned to which VM. If it's the IOMMU
> that enforces quotas per device and a VM has 15 devices assigned,
> then the guest can still DoS the IOMMU.
> 
I still think timeout makes more sense than quota in that a VM could
be under quota but failed to respond to one of the devices forever.
I agree it is hard to devise a good timeout limit but since this is to
prevent rare faults, we could pick a relatively large timeout. And we
only tracks the longest pending timeout per device. The error condition
we try to prevent is not necessarily only stall buffer overflow but
timeout also, right?
> [...]
> >>> + * @type: group or stream response    
> >>
> >> The page request doesn't provide this information
> >>  
> > this is vt-d specific. it is in the vt-d page request descriptor and
> > response descriptors are different depending on the type.
> > Since we intend the generic data to be super set of models, I add
> > this field.  
> 
> But don't you need to add the stream type to enum iommu_fault_type, in
> patch 8? Otherwise the guest can't know what type to set in the
> response.
> 
> Thanks,
> Jean
> 

[Jacob Pan]

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v3 15/16] iommu: introduce page response function
@ 2017-12-08 13:51                 ` Jean-Philippe Brucker
  0 siblings, 0 replies; 94+ messages in thread
From: Jean-Philippe Brucker @ 2017-12-08 13:51 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jacob Pan, iommu, LKML, Joerg Roedel, David Woodhouse,
	Greg Kroah-Hartman, Rafael Wysocki, Lan Tianyu, Jean Delvare,
	Will Deacon, Kumar, Sanjay K

On 07/12/17 21:56, Alex Williamson wrote:
[...]
>> Seems like VFIO should enforce this quota, since the IOMMU layer doesn't
>> know which device is assigned to which VM. If it's the IOMMU that enforces
>> quotas per device and a VM has 15 devices assigned, then the guest can
>> still DoS the IOMMU.
> 
> VFIO also doesn't know about VMs.  We know that devices attached to the
> same container are probably used by the same user, but once we add
> viommu, each device(group) uses its own container and we have no idea
> they're associated.  So, no to VM based accounting, and it seems like
> an IOMMU problem, X number of outstanding requests per device.  Thanks,

Ok. It's not clear anyway how the architecture and implementations expect
us to virtualize stall, I'll try to clarify it.

Thanks,
Jean

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v3 15/16] iommu: introduce page response function
@ 2017-12-08 13:51                 ` Jean-Philippe Brucker
  0 siblings, 0 replies; 94+ messages in thread
From: Jean-Philippe Brucker @ 2017-12-08 13:51 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Lan Tianyu, Greg Kroah-Hartman, Rafael Wysocki, Will Deacon,
	LKML, iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Jean Delvare, Kumar, Sanjay K, David Woodhouse

On 07/12/17 21:56, Alex Williamson wrote:
[...]
>> Seems like VFIO should enforce this quota, since the IOMMU layer doesn't
>> know which device is assigned to which VM. If it's the IOMMU that enforces
>> quotas per device and a VM has 15 devices assigned, then the guest can
>> still DoS the IOMMU.
> 
> VFIO also doesn't know about VMs.  We know that devices attached to the
> same container are probably used by the same user, but once we add
> viommu, each device(group) uses its own container and we have no idea
> they're associated.  So, no to VM based accounting, and it seems like
> an IOMMU problem, X number of outstanding requests per device.  Thanks,

Ok. It's not clear anyway how the architecture and implementations expect
us to virtualize stall, I'll try to clarify it.

Thanks,
Jean

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v3 15/16] iommu: introduce page response function
  2017-12-08  1:17               ` Jacob Pan
@ 2017-12-08 13:51                 ` Jean-Philippe Brucker
  -1 siblings, 0 replies; 94+ messages in thread
From: Jean-Philippe Brucker @ 2017-12-08 13:51 UTC (permalink / raw)
  To: Jacob Pan
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Rafael Wysocki, Alex Williamson, Lan Tianyu, Jean Delvare,
	Will Deacon, Kumar, Sanjay K

On 08/12/17 01:17, Jacob Pan wrote:
[...]
>> Sounds good. Timeout seems a bit complicated to implement (and how do
>> we guess what timeout would work?), so maybe it's simpler to enforce
>> a quota of outstanding faults per VM, for example half of the shared
>> queue size (the number can be chosen by the IOMMU driver). If a VM
>> has that many outstanding faults, then any new fault is immediately
>> terminated by the host. A bit rough but it might be enough to
>> mitigate the problem initially, and we can always tweak it later (for
>> instance disable faulting if a guest doesn't ever reply).
>>
> I have to make a correction/clarification, even though vt-d has a per
> iommu shared queue for prq but we do not stall. Ashok reminded me that.
> So there is no constraint on IOMMU if one of the guests does not
> respond. All the pressure is on the device which may have limited # of
> pending PR.

Right that makes more sense, for PRI the IOMMU doesn't need to keep page
request state internally. Then it seems the problem only exists for Stall
and someone's going to have a fun time working around it in the SMMU driver :(

>> Seems like VFIO should enforce this quota, since the IOMMU layer
>> doesn't know which device is assigned to which VM. If it's the IOMMU
>> that enforces quotas per device and a VM has 15 devices assigned,
>> then the guest can still DoS the IOMMU.
>>
> I still think timeout makes more sense than quota in that a VM could
> be under quota but failed to respond to one of the devices forever.
> I agree it is hard to devise a good timeout limit but since this is to
> prevent rare faults, we could pick a relatively large timeout. And we
> only tracks the longest pending timeout per device. The error condition
> we try to prevent is not necessarily only stall buffer overflow but
> timeout also, right?

Handling timeouts is less crucial than making sure a guest doesn't
monopolize all the shared resources, in my opinion. If a guest can't reply
to the injected faults, it's not really our problem as long as it doesn't
affect fault injection for other guests. We can reset the device and clean
pending faults when the guest terminates or resets the device.

I guess it's similar to IRQ injection, you don't care if the guest
acknowledges your interrupt or not as long as you make sure it is delivered.

Thanks,
Jean

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v3 15/16] iommu: introduce page response function
@ 2017-12-08 13:51                 ` Jean-Philippe Brucker
  0 siblings, 0 replies; 94+ messages in thread
From: Jean-Philippe Brucker @ 2017-12-08 13:51 UTC (permalink / raw)
  To: Jacob Pan
  Cc: Lan Tianyu, Greg Kroah-Hartman, Rafael Wysocki, Will Deacon,
	LKML, iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Jean Delvare, Kumar, Sanjay K, David Woodhouse

On 08/12/17 01:17, Jacob Pan wrote:
[...]
>> Sounds good. Timeout seems a bit complicated to implement (and how do
>> we guess what timeout would work?), so maybe it's simpler to enforce
>> a quota of outstanding faults per VM, for example half of the shared
>> queue size (the number can be chosen by the IOMMU driver). If a VM
>> has that many outstanding faults, then any new fault is immediately
>> terminated by the host. A bit rough but it might be enough to
>> mitigate the problem initially, and we can always tweak it later (for
>> instance disable faulting if a guest doesn't ever reply).
>>
> I have to make a correction/clarification, even though vt-d has a per
> iommu shared queue for prq but we do not stall. Ashok reminded me that.
> So there is no constraint on IOMMU if one of the guests does not
> respond. All the pressure is on the device which may have limited # of
> pending PR.

Right that makes more sense, for PRI the IOMMU doesn't need to keep page
request state internally. Then it seems the problem only exists for Stall
and someone's going to have a fun time working around it in the SMMU driver :(

>> Seems like VFIO should enforce this quota, since the IOMMU layer
>> doesn't know which device is assigned to which VM. If it's the IOMMU
>> that enforces quotas per device and a VM has 15 devices assigned,
>> then the guest can still DoS the IOMMU.
>>
> I still think timeout makes more sense than quota in that a VM could
> be under quota but failed to respond to one of the devices forever.
> I agree it is hard to devise a good timeout limit but since this is to
> prevent rare faults, we could pick a relatively large timeout. And we
> only tracks the longest pending timeout per device. The error condition
> we try to prevent is not necessarily only stall buffer overflow but
> timeout also, right?

Handling timeouts is less crucial than making sure a guest doesn't
monopolize all the shared resources, in my opinion. If a guest can't reply
to the injected faults, it's not really our problem as long as it doesn't
affect fault injection for other guests. We can reset the device and clean
pending faults when the guest terminates or resets the device.

I guess it's similar to IRQ injection, you don't care if the guest
acknowledges your interrupt or not as long as you make sure it is delivered.

Thanks,
Jean

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v3 15/16] iommu: introduce page response function
  2017-12-07 21:51             ` Alex Williamson
  (?)
@ 2017-12-08 13:52             ` Jean-Philippe Brucker
  2017-12-08 20:40                 ` Jacob Pan
  -1 siblings, 1 reply; 94+ messages in thread
From: Jean-Philippe Brucker @ 2017-12-08 13:52 UTC (permalink / raw)
  To: Alex Williamson, Jacob Pan
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Rafael Wysocki, Lan Tianyu, Jean Delvare, Will Deacon, Kumar,
	Sanjay K

On 07/12/17 21:51, Alex Williamson wrote:
>> Agree, IOMMU driver cannot enforce it. I think VFIO layer can make sure
>> page response come from the assigned device and its guest/container.
> 
> Can we enforce it via the IOMMU/VFIO interface?  If the response is for
> a struct device, and not an rid/did embedded in a structure, then vfio
> can pass it through w/o worrying about it, ie. response comes in via
> ioctl with association to vfio device fd -> struct vfio_device -> struct
> device, iommu driver fills in rid/did.  Thanks,

Yes that's probably the best way, reporting faults and receiving responses
on the device fd.

Thanks,
Jean

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v3 10/16] iommu: introduce device fault report API
@ 2017-12-08 20:23       ` Jacob Pan
  0 siblings, 0 replies; 94+ messages in thread
From: Jacob Pan @ 2017-12-08 20:23 UTC (permalink / raw)
  To: Alex Williamson
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Rafael Wysocki, Liu, Yi L, Lan Tianyu, Tian, Kevin, Raj Ashok,
	Jean Delvare, Christoph Hellwig, jacob.jun.pan

On Thu, 7 Dec 2017 14:27:25 -0700
Alex Williamson <alex.williamson@redhat.com> wrote:

> On Fri, 17 Nov 2017 10:55:08 -0800
> Jacob Pan <jacob.jun.pan@linux.intel.com> wrote:
> 
> > Traditionally, device specific faults are detected and handled
> > within their own device drivers. When IOMMU is enabled, faults such
> > as DMA related transactions are detected by IOMMU. There is no
> > generic reporting mechanism to report faults back to the in-kernel
> > device driver or the guest OS in case of assigned devices.
> > 
> > Faults detected by IOMMU is based on the transaction's source ID
> > which can be reported at per device basis, regardless of the device
> > type is a PCI device or not.
> > 
> > The fault types include recoverable (e.g. page request) and
> > unrecoverable faults(e.g. access error). In most cases, faults can
> > be handled by IOMMU drivers internally. The primary use cases are as
> > follows:
> > 1. page request fault originated from an SVM capable device that is
> > assigned to guest via vIOMMU. In this case, the first level page
> > tables are owned by the guest. Page request must be propagated to
> > the guest to let guest OS fault in the pages then send page
> > response. In this mechanism, the direct receiver of IOMMU fault
> > notification is VFIO, which can relay notification events to QEMU
> > or other user space software.
> > 
> > 2. faults need more subtle handling by device drivers. Other than
> > simply invoke reset function, there are needs to let device driver
> > handle the fault with a smaller impact.
> > 
> > This patchset is intended to create a generic fault report API such
> > that it can scale as follows:
> > - all IOMMU types
> > - PCI and non-PCI devices
> > - recoverable and unrecoverable faults
> > - VFIO and other other in kernel users
> > - DMA & IRQ remapping (TBD)
> > The original idea was brought up by David Woodhouse and discussions
> > summarized at https://lwn.net/Articles/608914/.
> > 
> > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Signed-off-by: Ashok Raj <ashok.raj@intel.com>
> > ---
> >  drivers/iommu/iommu.c | 63
> > ++++++++++++++++++++++++++++++++++++++++++++++++++-
> > include/linux/iommu.h | 36 +++++++++++++++++++++++++++++ 2 files
> > changed, 98 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> > index 829e9e9..97b7990 100644
> > --- a/drivers/iommu/iommu.c
> > +++ b/drivers/iommu/iommu.c
> > @@ -581,6 +581,12 @@ int iommu_group_add_device(struct iommu_group
> > *group, struct device *dev) goto err_free_name;
> >  	}
> >  
> > +	dev->iommu_param = kzalloc(sizeof(struct
> > iommu_fault_param), GFP_KERNEL);
> > +	if (!dev->iommu_param) {
> > +		ret = -ENOMEM;
> > +		goto err_free_name;
> > +	}
> > +
> >  	kobject_get(group->devices_kobj);
> >  
> >  	dev->iommu_group = group;
> > @@ -657,7 +663,7 @@ void iommu_group_remove_device(struct device
> > *dev) sysfs_remove_link(&dev->kobj, "iommu_group");
> >  
> >  	trace_remove_device_from_group(group->id, dev);
> > -
> > +	kfree(dev->iommu_param);
> >  	kfree(device->name);
> >  	kfree(device);
> >  	dev->iommu_group = NULL;
> > @@ -791,6 +797,61 @@ int iommu_group_unregister_notifier(struct
> > iommu_group *group, }
> >  EXPORT_SYMBOL_GPL(iommu_group_unregister_notifier);
> >  
> > +int iommu_register_device_fault_handler(struct device *dev,
> > +					iommu_dev_fault_handler_t
> > handler,
> > +					void *data)
> > +{
> > +	struct iommu_param *idata = dev->iommu_param;
> > +
> > +	/*
> > +	 * Device iommu_param should have been allocated when
> > device is
> > +	 * added to its iommu_group.
> > +	 */
> > +	if (!idata)
> > +		return -EINVAL;
> > +	/* Only allow one fault handler registered for each device
> > */
> > +	if (idata->fault_param)
> > +		return -EBUSY;
> > +	get_device(dev);
> > +	idata->fault_param =
> > +		kzalloc(sizeof(struct iommu_fault_param),
> > GFP_KERNEL);
> > +	if (!idata->fault_param)
> > +		return -ENOMEM;
> > +	idata->fault_param->handler = handler;
> > +	idata->fault_param->data = data;
> > +
> > +	return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(iommu_register_device_fault_handler);
> > +
> > +int iommu_unregister_device_fault_handler(struct device *dev)
> > +{
> > +	struct iommu_param *idata = dev->iommu_param;
> > +
> > +	if (!idata)
> > +		return -EINVAL;
> > +
> > +	kfree(idata->fault_param);
> > +	idata->fault_param = NULL;
> > +	put_device(dev);
> > +
> > +	return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(iommu_unregister_device_fault_handler);
> > +
> > +
> > +int iommu_report_device_fault(struct device *dev, struct
> > iommu_fault_event *evt) +{
> > +	/* we only report device fault if there is a handler
> > registered */
> > +	if (!dev->iommu_param || !dev->iommu_param->fault_param ||
> > +		!dev->iommu_param->fault_param->handler)
> > +		return -ENOSYS;
> > +
> > +	return dev->iommu_param->fault_param->handler(evt,
> > +
> > dev->iommu_param->fault_param->data); +}
> > +EXPORT_SYMBOL_GPL(iommu_report_device_fault);
> > +  
> 
> Isn't this all rather racy?  I see that we can have multiple callers
> to register racing.
I agree, should use a lock here to prevent unregister. For multiple
caller race, it won't happen since there is only one caller can
register handler.
>  Unregister is buggy, allowing any caller to
> decrement the device reference regardless of whether there's one
> outstanding through this interface.  The reporting callout can also
> race with an unregistration.  Might need a mutex on iommu_param to
> avoid.
> 
you are right, forgot to check outstanding handler. will add mutex also.

Thanks,
> >  /**
> >   * iommu_group_id - Return ID for a group
> >   * @group: the group to ID
> > diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> > index dfda89b..841c044 100644
> > --- a/include/linux/iommu.h
> > +++ b/include/linux/iommu.h
> > @@ -463,6 +463,14 @@ extern int
> > iommu_group_register_notifier(struct iommu_group *group, struct
> > notifier_block *nb); extern int
> > iommu_group_unregister_notifier(struct iommu_group *group, struct
> > notifier_block *nb); +extern int
> > iommu_register_device_fault_handler(struct device *dev,
> > +					iommu_dev_fault_handler_t
> > handler,
> > +					void *data);
> > +
> > +extern int iommu_unregister_device_fault_handler(struct device
> > *dev); +
> > +extern int iommu_report_device_fault(struct device *dev, struct
> > iommu_fault_event *evt); +
> >  extern int iommu_group_id(struct iommu_group *group);
> >  extern struct iommu_group *iommu_group_get_for_dev(struct device
> > *dev); extern struct iommu_domain
> > *iommu_group_default_domain(struct iommu_group *); @@ -481,6
> > +489,12 @@ extern void iommu_domain_window_disable(struct
> > iommu_domain *domain, u32 wnd_nr) extern int
> > report_iommu_fault(struct iommu_domain *domain, struct device *dev,
> > unsigned long iova, int flags); +static inline bool
> > iommu_has_device_fault_handler(struct device *dev) +{
> > +	return dev->iommu_param && dev->iommu_param->fault_param &&
> > +		dev->iommu_param->fault_param->handler;
> > +}
> > +  
> 
> This interface is racy by design, there's no guarantee that the
> handler isn't immediately unregistered after this check. Thanks,
> 
right, I will fold this check into report function and protect by a
lock. I was trying to save some cycles but it would not work with the
race condition.
> Alex
> 
> >  static inline void iommu_flush_tlb_all(struct iommu_domain *domain)
> >  {
> >  	if (domain->ops->flush_iotlb_all)
> > @@ -734,6 +748,28 @@ static inline int
> > iommu_group_unregister_notifier(struct iommu_group *group, return 0;
> >  }
> >  
> > +static inline int iommu_register_device_fault_handler(struct
> > device *dev,
> > +
> > iommu_dev_fault_handler_t handler,
> > +						void *data)
> > +{
> > +	return 0;
> > +}
> > +
> > +static inline int iommu_unregister_device_fault_handler(struct
> > device *dev) +{
> > +	return 0;
> > +}
> > +
> > +static inline bool iommu_has_device_fault_handler(struct device
> > *dev) +{
> > +	return false;
> > +}
> > +
> > +static inline int iommu_report_device_fault(struct device *dev,
> > struct iommu_fault_event *evt) +{
> > +	return 0;
> > +}
> > +
> >  static inline int iommu_group_id(struct iommu_group *group)
> >  {
> >  	return -ENODEV;  
> 

[Jacob Pan]

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v3 10/16] iommu: introduce device fault report API
@ 2017-12-08 20:23       ` Jacob Pan
  0 siblings, 0 replies; 94+ messages in thread
From: Jacob Pan @ 2017-12-08 20:23 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Lan Tianyu, Greg Kroah-Hartman, Rafael Wysocki, LKML,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Jean Delvare,
	David Woodhouse

On Thu, 7 Dec 2017 14:27:25 -0700
Alex Williamson <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:

> On Fri, 17 Nov 2017 10:55:08 -0800
> Jacob Pan <jacob.jun.pan-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> wrote:
> 
> > Traditionally, device specific faults are detected and handled
> > within their own device drivers. When IOMMU is enabled, faults such
> > as DMA related transactions are detected by IOMMU. There is no
> > generic reporting mechanism to report faults back to the in-kernel
> > device driver or the guest OS in case of assigned devices.
> > 
> > Faults detected by IOMMU is based on the transaction's source ID
> > which can be reported at per device basis, regardless of the device
> > type is a PCI device or not.
> > 
> > The fault types include recoverable (e.g. page request) and
> > unrecoverable faults(e.g. access error). In most cases, faults can
> > be handled by IOMMU drivers internally. The primary use cases are as
> > follows:
> > 1. page request fault originated from an SVM capable device that is
> > assigned to guest via vIOMMU. In this case, the first level page
> > tables are owned by the guest. Page request must be propagated to
> > the guest to let guest OS fault in the pages then send page
> > response. In this mechanism, the direct receiver of IOMMU fault
> > notification is VFIO, which can relay notification events to QEMU
> > or other user space software.
> > 
> > 2. faults need more subtle handling by device drivers. Other than
> > simply invoke reset function, there are needs to let device driver
> > handle the fault with a smaller impact.
> > 
> > This patchset is intended to create a generic fault report API such
> > that it can scale as follows:
> > - all IOMMU types
> > - PCI and non-PCI devices
> > - recoverable and unrecoverable faults
> > - VFIO and other other in kernel users
> > - DMA & IRQ remapping (TBD)
> > The original idea was brought up by David Woodhouse and discussions
> > summarized at https://lwn.net/Articles/608914/.
> > 
> > Signed-off-by: Jacob Pan <jacob.jun.pan-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
> > Signed-off-by: Ashok Raj <ashok.raj-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> > ---
> >  drivers/iommu/iommu.c | 63
> > ++++++++++++++++++++++++++++++++++++++++++++++++++-
> > include/linux/iommu.h | 36 +++++++++++++++++++++++++++++ 2 files
> > changed, 98 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> > index 829e9e9..97b7990 100644
> > --- a/drivers/iommu/iommu.c
> > +++ b/drivers/iommu/iommu.c
> > @@ -581,6 +581,12 @@ int iommu_group_add_device(struct iommu_group
> > *group, struct device *dev) goto err_free_name;
> >  	}
> >  
> > +	dev->iommu_param = kzalloc(sizeof(struct
> > iommu_fault_param), GFP_KERNEL);
> > +	if (!dev->iommu_param) {
> > +		ret = -ENOMEM;
> > +		goto err_free_name;
> > +	}
> > +
> >  	kobject_get(group->devices_kobj);
> >  
> >  	dev->iommu_group = group;
> > @@ -657,7 +663,7 @@ void iommu_group_remove_device(struct device
> > *dev) sysfs_remove_link(&dev->kobj, "iommu_group");
> >  
> >  	trace_remove_device_from_group(group->id, dev);
> > -
> > +	kfree(dev->iommu_param);
> >  	kfree(device->name);
> >  	kfree(device);
> >  	dev->iommu_group = NULL;
> > @@ -791,6 +797,61 @@ int iommu_group_unregister_notifier(struct
> > iommu_group *group, }
> >  EXPORT_SYMBOL_GPL(iommu_group_unregister_notifier);
> >  
> > +int iommu_register_device_fault_handler(struct device *dev,
> > +					iommu_dev_fault_handler_t
> > handler,
> > +					void *data)
> > +{
> > +	struct iommu_param *idata = dev->iommu_param;
> > +
> > +	/*
> > +	 * Device iommu_param should have been allocated when
> > device is
> > +	 * added to its iommu_group.
> > +	 */
> > +	if (!idata)
> > +		return -EINVAL;
> > +	/* Only allow one fault handler registered for each device
> > */
> > +	if (idata->fault_param)
> > +		return -EBUSY;
> > +	get_device(dev);
> > +	idata->fault_param =
> > +		kzalloc(sizeof(struct iommu_fault_param),
> > GFP_KERNEL);
> > +	if (!idata->fault_param)
> > +		return -ENOMEM;
> > +	idata->fault_param->handler = handler;
> > +	idata->fault_param->data = data;
> > +
> > +	return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(iommu_register_device_fault_handler);
> > +
> > +int iommu_unregister_device_fault_handler(struct device *dev)
> > +{
> > +	struct iommu_param *idata = dev->iommu_param;
> > +
> > +	if (!idata)
> > +		return -EINVAL;
> > +
> > +	kfree(idata->fault_param);
> > +	idata->fault_param = NULL;
> > +	put_device(dev);
> > +
> > +	return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(iommu_unregister_device_fault_handler);
> > +
> > +
> > +int iommu_report_device_fault(struct device *dev, struct
> > iommu_fault_event *evt) +{
> > +	/* we only report device fault if there is a handler
> > registered */
> > +	if (!dev->iommu_param || !dev->iommu_param->fault_param ||
> > +		!dev->iommu_param->fault_param->handler)
> > +		return -ENOSYS;
> > +
> > +	return dev->iommu_param->fault_param->handler(evt,
> > +
> > dev->iommu_param->fault_param->data); +}
> > +EXPORT_SYMBOL_GPL(iommu_report_device_fault);
> > +  
> 
> Isn't this all rather racy?  I see that we can have multiple callers
> to register racing.
I agree, should use a lock here to prevent unregister. For multiple
caller race, it won't happen since there is only one caller can
register handler.
>  Unregister is buggy, allowing any caller to
> decrement the device reference regardless of whether there's one
> outstanding through this interface.  The reporting callout can also
> race with an unregistration.  Might need a mutex on iommu_param to
> avoid.
> 
you are right, forgot to check outstanding handler. will add mutex also.

Thanks,
> >  /**
> >   * iommu_group_id - Return ID for a group
> >   * @group: the group to ID
> > diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> > index dfda89b..841c044 100644
> > --- a/include/linux/iommu.h
> > +++ b/include/linux/iommu.h
> > @@ -463,6 +463,14 @@ extern int
> > iommu_group_register_notifier(struct iommu_group *group, struct
> > notifier_block *nb); extern int
> > iommu_group_unregister_notifier(struct iommu_group *group, struct
> > notifier_block *nb); +extern int
> > iommu_register_device_fault_handler(struct device *dev,
> > +					iommu_dev_fault_handler_t
> > handler,
> > +					void *data);
> > +
> > +extern int iommu_unregister_device_fault_handler(struct device
> > *dev); +
> > +extern int iommu_report_device_fault(struct device *dev, struct
> > iommu_fault_event *evt); +
> >  extern int iommu_group_id(struct iommu_group *group);
> >  extern struct iommu_group *iommu_group_get_for_dev(struct device
> > *dev); extern struct iommu_domain
> > *iommu_group_default_domain(struct iommu_group *); @@ -481,6
> > +489,12 @@ extern void iommu_domain_window_disable(struct
> > iommu_domain *domain, u32 wnd_nr) extern int
> > report_iommu_fault(struct iommu_domain *domain, struct device *dev,
> > unsigned long iova, int flags); +static inline bool
> > iommu_has_device_fault_handler(struct device *dev) +{
> > +	return dev->iommu_param && dev->iommu_param->fault_param &&
> > +		dev->iommu_param->fault_param->handler;
> > +}
> > +  
> 
> This interface is racy by design, there's no guarantee that the
> handler isn't immediately unregistered after this check. Thanks,
> 
right, I will fold this check into report function and protect by a
lock. I was trying to save some cycles but it would not work with the
race condition.
> Alex
> 
> >  static inline void iommu_flush_tlb_all(struct iommu_domain *domain)
> >  {
> >  	if (domain->ops->flush_iotlb_all)
> > @@ -734,6 +748,28 @@ static inline int
> > iommu_group_unregister_notifier(struct iommu_group *group, return 0;
> >  }
> >  
> > +static inline int iommu_register_device_fault_handler(struct
> > device *dev,
> > +
> > iommu_dev_fault_handler_t handler,
> > +						void *data)
> > +{
> > +	return 0;
> > +}
> > +
> > +static inline int iommu_unregister_device_fault_handler(struct
> > device *dev) +{
> > +	return 0;
> > +}
> > +
> > +static inline bool iommu_has_device_fault_handler(struct device
> > *dev) +{
> > +	return false;
> > +}
> > +
> > +static inline int iommu_report_device_fault(struct device *dev,
> > struct iommu_fault_event *evt) +{
> > +	return 0;
> > +}
> > +
> >  static inline int iommu_group_id(struct iommu_group *group)
> >  {
> >  	return -ENODEV;  
> 

[Jacob Pan]

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v3 15/16] iommu: introduce page response function
@ 2017-12-08 20:40                 ` Jacob Pan
  0 siblings, 0 replies; 94+ messages in thread
From: Jacob Pan @ 2017-12-08 20:40 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: Alex Williamson, iommu, LKML, Joerg Roedel, David Woodhouse,
	Greg Kroah-Hartman, Rafael Wysocki, Lan Tianyu, Jean Delvare,
	Will Deacon, Kumar, Sanjay K, jacob.jun.pan

On Fri, 8 Dec 2017 13:52:00 +0000
Jean-Philippe Brucker <jean-philippe.brucker@arm.com> wrote:

> On 07/12/17 21:51, Alex Williamson wrote:
> >> Agree, IOMMU driver cannot enforce it. I think VFIO layer can make
> >> sure page response come from the assigned device and its
> >> guest/container.  
> > 
> > Can we enforce it via the IOMMU/VFIO interface?  If the response is
> > for a struct device, and not an rid/did embedded in a structure,
> > then vfio can pass it through w/o worrying about it, ie. response
> > comes in via ioctl with association to vfio device fd -> struct
> > vfio_device -> struct device, iommu driver fills in rid/did.
> > Thanks,  
> 
> Yes that's probably the best way, reporting faults and receiving
> responses on the device fd.
> 
Just to put these ideas in to code. The IOMMU API used by VFIO has
struct device* (derived from fd), no did/rid (to be derived from
struct device by IOMMU driver.)

int intel_iommu_page_response(struct iommu_domain *domain, struct device *dev,
			struct page_response_msg *msg)

IOMMU driver can further sanitize by checking whether this is a pending
page request for the device, and refcount outstanding PRQs.

Does it sound right?

/**
 * Generic page response information based on PCI ATS and PASID spec.
 * @addr: servicing page address
 * @pasid: contains process address space ID, used in shared virtual
memory(SVM)
 * @resp_code: response code
 * @page_req_group_id: page request group index
 * @type: group or stream/single page response
 * @private_data: uniquely identify device-specific private data for an
 *                individual page response

 */
struct page_response_msg {
	u64 addr;
	u32 pasid;
	u32 resp_code:4;
#define IOMMU_PAGE_RESP_SUCCESS	0
#define IOMMU_PAGE_RESP_INVALID	1
#define IOMMU_PAGE_RESP_FAILURE	0xF

	u32 pasid_present:1;
	u32 page_req_group_id : 9;
	enum page_response_type type;
	u32 private_data;
};

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v3 15/16] iommu: introduce page response function
@ 2017-12-08 20:40                 ` Jacob Pan
  0 siblings, 0 replies; 94+ messages in thread
From: Jacob Pan @ 2017-12-08 20:40 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: Lan Tianyu, Greg Kroah-Hartman, Rafael Wysocki, Will Deacon,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, LKML,
	Jean Delvare, Kumar, Sanjay K, David Woodhouse

On Fri, 8 Dec 2017 13:52:00 +0000
Jean-Philippe Brucker <jean-philippe.brucker-5wv7dgnIgG8@public.gmane.org> wrote:

> On 07/12/17 21:51, Alex Williamson wrote:
> >> Agree, IOMMU driver cannot enforce it. I think VFIO layer can make
> >> sure page response come from the assigned device and its
> >> guest/container.  
> > 
> > Can we enforce it via the IOMMU/VFIO interface?  If the response is
> > for a struct device, and not an rid/did embedded in a structure,
> > then vfio can pass it through w/o worrying about it, ie. response
> > comes in via ioctl with association to vfio device fd -> struct
> > vfio_device -> struct device, iommu driver fills in rid/did.
> > Thanks,  
> 
> Yes that's probably the best way, reporting faults and receiving
> responses on the device fd.
> 
Just to put these ideas in to code. The IOMMU API used by VFIO has
struct device* (derived from fd), no did/rid (to be derived from
struct device by IOMMU driver.)

int intel_iommu_page_response(struct iommu_domain *domain, struct device *dev,
			struct page_response_msg *msg)

IOMMU driver can further sanitize by checking whether this is a pending
page request for the device, and refcount outstanding PRQs.

Does it sound right?

/**
 * Generic page response information based on PCI ATS and PASID spec.
 * @addr: servicing page address
 * @pasid: contains process address space ID, used in shared virtual
memory(SVM)
 * @resp_code: response code
 * @page_req_group_id: page request group index
 * @type: group or stream/single page response
 * @private_data: uniquely identify device-specific private data for an
 *                individual page response

 */
struct page_response_msg {
	u64 addr;
	u32 pasid;
	u32 resp_code:4;
#define IOMMU_PAGE_RESP_SUCCESS	0
#define IOMMU_PAGE_RESP_INVALID	1
#define IOMMU_PAGE_RESP_FAILURE	0xF

	u32 pasid_present:1;
	u32 page_req_group_id : 9;
	enum page_response_type type;
	u32 private_data;
};

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v3 10/16] iommu: introduce device fault report API
  2017-12-08 20:23       ` Jacob Pan
@ 2017-12-08 20:59         ` Alex Williamson
  -1 siblings, 0 replies; 94+ messages in thread
From: Alex Williamson @ 2017-12-08 20:59 UTC (permalink / raw)
  To: Jacob Pan
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Rafael Wysocki, Liu, Yi L, Lan Tianyu, Tian, Kevin, Raj Ashok,
	Jean Delvare, Christoph Hellwig

On Fri, 8 Dec 2017 12:23:58 -0800
Jacob Pan <jacob.jun.pan@linux.intel.com> wrote:

> On Thu, 7 Dec 2017 14:27:25 -0700
> Alex Williamson <alex.williamson@redhat.com> wrote:
> 
> > On Fri, 17 Nov 2017 10:55:08 -0800
> > Jacob Pan <jacob.jun.pan@linux.intel.com> wrote:
> >   
> > > Traditionally, device specific faults are detected and handled
> > > within their own device drivers. When IOMMU is enabled, faults such
> > > as DMA related transactions are detected by IOMMU. There is no
> > > generic reporting mechanism to report faults back to the in-kernel
> > > device driver or the guest OS in case of assigned devices.
> > > 
> > > Faults detected by IOMMU is based on the transaction's source ID
> > > which can be reported at per device basis, regardless of the device
> > > type is a PCI device or not.
> > > 
> > > The fault types include recoverable (e.g. page request) and
> > > unrecoverable faults(e.g. access error). In most cases, faults can
> > > be handled by IOMMU drivers internally. The primary use cases are as
> > > follows:
> > > 1. page request fault originated from an SVM capable device that is
> > > assigned to guest via vIOMMU. In this case, the first level page
> > > tables are owned by the guest. Page request must be propagated to
> > > the guest to let guest OS fault in the pages then send page
> > > response. In this mechanism, the direct receiver of IOMMU fault
> > > notification is VFIO, which can relay notification events to QEMU
> > > or other user space software.
> > > 
> > > 2. faults need more subtle handling by device drivers. Other than
> > > simply invoke reset function, there are needs to let device driver
> > > handle the fault with a smaller impact.
> > > 
> > > This patchset is intended to create a generic fault report API such
> > > that it can scale as follows:
> > > - all IOMMU types
> > > - PCI and non-PCI devices
> > > - recoverable and unrecoverable faults
> > > - VFIO and other other in kernel users
> > > - DMA & IRQ remapping (TBD)
> > > The original idea was brought up by David Woodhouse and discussions
> > > summarized at https://lwn.net/Articles/608914/.
> > > 
> > > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > > Signed-off-by: Ashok Raj <ashok.raj@intel.com>
> > > ---
> > >  drivers/iommu/iommu.c | 63
> > > ++++++++++++++++++++++++++++++++++++++++++++++++++-
> > > include/linux/iommu.h | 36 +++++++++++++++++++++++++++++ 2 files
> > > changed, 98 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> > > index 829e9e9..97b7990 100644
> > > --- a/drivers/iommu/iommu.c
> > > +++ b/drivers/iommu/iommu.c
> > > @@ -581,6 +581,12 @@ int iommu_group_add_device(struct iommu_group
> > > *group, struct device *dev) goto err_free_name;
> > >  	}
> > >  
> > > +	dev->iommu_param = kzalloc(sizeof(struct
> > > iommu_fault_param), GFP_KERNEL);
> > > +	if (!dev->iommu_param) {
> > > +		ret = -ENOMEM;
> > > +		goto err_free_name;
> > > +	}
> > > +
> > >  	kobject_get(group->devices_kobj);
> > >  
> > >  	dev->iommu_group = group;
> > > @@ -657,7 +663,7 @@ void iommu_group_remove_device(struct device
> > > *dev) sysfs_remove_link(&dev->kobj, "iommu_group");
> > >  
> > >  	trace_remove_device_from_group(group->id, dev);
> > > -
> > > +	kfree(dev->iommu_param);
> > >  	kfree(device->name);
> > >  	kfree(device);
> > >  	dev->iommu_group = NULL;
> > > @@ -791,6 +797,61 @@ int iommu_group_unregister_notifier(struct
> > > iommu_group *group, }
> > >  EXPORT_SYMBOL_GPL(iommu_group_unregister_notifier);
> > >  
> > > +int iommu_register_device_fault_handler(struct device *dev,
> > > +					iommu_dev_fault_handler_t
> > > handler,
> > > +					void *data)
> > > +{
> > > +	struct iommu_param *idata = dev->iommu_param;
> > > +
> > > +	/*
> > > +	 * Device iommu_param should have been allocated when
> > > device is
> > > +	 * added to its iommu_group.
> > > +	 */
> > > +	if (!idata)
> > > +		return -EINVAL;
> > > +	/* Only allow one fault handler registered for each device
> > > */
> > > +	if (idata->fault_param)
> > > +		return -EBUSY;
> > > +	get_device(dev);
> > > +	idata->fault_param =
> > > +		kzalloc(sizeof(struct iommu_fault_param),
> > > GFP_KERNEL);
> > > +	if (!idata->fault_param)
> > > +		return -ENOMEM;
> > > +	idata->fault_param->handler = handler;
> > > +	idata->fault_param->data = data;
> > > +
> > > +	return 0;
> > > +}
> > > +EXPORT_SYMBOL_GPL(iommu_register_device_fault_handler);
> > > +
> > > +int iommu_unregister_device_fault_handler(struct device *dev)
> > > +{
> > > +	struct iommu_param *idata = dev->iommu_param;
> > > +
> > > +	if (!idata)
> > > +		return -EINVAL;
> > > +
> > > +	kfree(idata->fault_param);
> > > +	idata->fault_param = NULL;
> > > +	put_device(dev);
> > > +
> > > +	return 0;
> > > +}
> > > +EXPORT_SYMBOL_GPL(iommu_unregister_device_fault_handler);
> > > +
> > > +
> > > +int iommu_report_device_fault(struct device *dev, struct
> > > iommu_fault_event *evt) +{
> > > +	/* we only report device fault if there is a handler
> > > registered */
> > > +	if (!dev->iommu_param || !dev->iommu_param->fault_param ||
> > > +		!dev->iommu_param->fault_param->handler)
> > > +		return -ENOSYS;
> > > +
> > > +	return dev->iommu_param->fault_param->handler(evt,
> > > +
> > > dev->iommu_param->fault_param->data); +}
> > > +EXPORT_SYMBOL_GPL(iommu_report_device_fault);
> > > +    
> > 
> > Isn't this all rather racy?  I see that we can have multiple callers
> > to register racing.  
> I agree, should use a lock here to prevent unregister. For multiple
> caller race, it won't happen since there is only one caller can
> register handler.

If you have multiple simultaneous callers to
iommu_register_device_fault_handler, they can all get past the test
for fault_param (testing and setting is not atomic), then it's
indeterminate which handler gets installed.  Thanks,

Alex

> >  Unregister is buggy, allowing any caller to
> > decrement the device reference regardless of whether there's one
> > outstanding through this interface.  The reporting callout can also
> > race with an unregistration.  Might need a mutex on iommu_param to
> > avoid.
> >   
> you are right, forgot to check outstanding handler. will add mutex also.
> 
> Thanks,
> > >  /**
> > >   * iommu_group_id - Return ID for a group
> > >   * @group: the group to ID
> > > diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> > > index dfda89b..841c044 100644
> > > --- a/include/linux/iommu.h
> > > +++ b/include/linux/iommu.h
> > > @@ -463,6 +463,14 @@ extern int
> > > iommu_group_register_notifier(struct iommu_group *group, struct
> > > notifier_block *nb); extern int
> > > iommu_group_unregister_notifier(struct iommu_group *group, struct
> > > notifier_block *nb); +extern int
> > > iommu_register_device_fault_handler(struct device *dev,
> > > +					iommu_dev_fault_handler_t
> > > handler,
> > > +					void *data);
> > > +
> > > +extern int iommu_unregister_device_fault_handler(struct device
> > > *dev); +
> > > +extern int iommu_report_device_fault(struct device *dev, struct
> > > iommu_fault_event *evt); +
> > >  extern int iommu_group_id(struct iommu_group *group);
> > >  extern struct iommu_group *iommu_group_get_for_dev(struct device
> > > *dev); extern struct iommu_domain
> > > *iommu_group_default_domain(struct iommu_group *); @@ -481,6
> > > +489,12 @@ extern void iommu_domain_window_disable(struct
> > > iommu_domain *domain, u32 wnd_nr) extern int
> > > report_iommu_fault(struct iommu_domain *domain, struct device *dev,
> > > unsigned long iova, int flags); +static inline bool
> > > iommu_has_device_fault_handler(struct device *dev) +{
> > > +	return dev->iommu_param && dev->iommu_param->fault_param &&
> > > +		dev->iommu_param->fault_param->handler;
> > > +}
> > > +    
> > 
> > This interface is racy by design, there's no guarantee that the
> > handler isn't immediately unregistered after this check. Thanks,
> >   
> right, I will fold this check into report function and protect by a
> lock. I was trying to save some cycles but it would not work with the
> race condition.
> > Alex
> >   
> > >  static inline void iommu_flush_tlb_all(struct iommu_domain *domain)
> > >  {
> > >  	if (domain->ops->flush_iotlb_all)
> > > @@ -734,6 +748,28 @@ static inline int
> > > iommu_group_unregister_notifier(struct iommu_group *group, return 0;
> > >  }
> > >  
> > > +static inline int iommu_register_device_fault_handler(struct
> > > device *dev,
> > > +
> > > iommu_dev_fault_handler_t handler,
> > > +						void *data)
> > > +{
> > > +	return 0;
> > > +}
> > > +
> > > +static inline int iommu_unregister_device_fault_handler(struct
> > > device *dev) +{
> > > +	return 0;
> > > +}
> > > +
> > > +static inline bool iommu_has_device_fault_handler(struct device
> > > *dev) +{
> > > +	return false;
> > > +}
> > > +
> > > +static inline int iommu_report_device_fault(struct device *dev,
> > > struct iommu_fault_event *evt) +{
> > > +	return 0;
> > > +}
> > > +
> > >  static inline int iommu_group_id(struct iommu_group *group)
> > >  {
> > >  	return -ENODEV;    
> >   
> 
> [Jacob Pan]

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v3 10/16] iommu: introduce device fault report API
@ 2017-12-08 20:59         ` Alex Williamson
  0 siblings, 0 replies; 94+ messages in thread
From: Alex Williamson @ 2017-12-08 20:59 UTC (permalink / raw)
  To: Jacob Pan
  Cc: Lan Tianyu, Greg Kroah-Hartman, Rafael Wysocki, LKML,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Jean Delvare,
	David Woodhouse

On Fri, 8 Dec 2017 12:23:58 -0800
Jacob Pan <jacob.jun.pan-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> wrote:

> On Thu, 7 Dec 2017 14:27:25 -0700
> Alex Williamson <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> 
> > On Fri, 17 Nov 2017 10:55:08 -0800
> > Jacob Pan <jacob.jun.pan-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> wrote:
> >   
> > > Traditionally, device specific faults are detected and handled
> > > within their own device drivers. When IOMMU is enabled, faults such
> > > as DMA related transactions are detected by IOMMU. There is no
> > > generic reporting mechanism to report faults back to the in-kernel
> > > device driver or the guest OS in case of assigned devices.
> > > 
> > > Faults detected by IOMMU is based on the transaction's source ID
> > > which can be reported at per device basis, regardless of the device
> > > type is a PCI device or not.
> > > 
> > > The fault types include recoverable (e.g. page request) and
> > > unrecoverable faults(e.g. access error). In most cases, faults can
> > > be handled by IOMMU drivers internally. The primary use cases are as
> > > follows:
> > > 1. page request fault originated from an SVM capable device that is
> > > assigned to guest via vIOMMU. In this case, the first level page
> > > tables are owned by the guest. Page request must be propagated to
> > > the guest to let guest OS fault in the pages then send page
> > > response. In this mechanism, the direct receiver of IOMMU fault
> > > notification is VFIO, which can relay notification events to QEMU
> > > or other user space software.
> > > 
> > > 2. faults need more subtle handling by device drivers. Other than
> > > simply invoke reset function, there are needs to let device driver
> > > handle the fault with a smaller impact.
> > > 
> > > This patchset is intended to create a generic fault report API such
> > > that it can scale as follows:
> > > - all IOMMU types
> > > - PCI and non-PCI devices
> > > - recoverable and unrecoverable faults
> > > - VFIO and other other in kernel users
> > > - DMA & IRQ remapping (TBD)
> > > The original idea was brought up by David Woodhouse and discussions
> > > summarized at https://lwn.net/Articles/608914/.
> > > 
> > > Signed-off-by: Jacob Pan <jacob.jun.pan-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
> > > Signed-off-by: Ashok Raj <ashok.raj-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> > > ---
> > >  drivers/iommu/iommu.c | 63
> > > ++++++++++++++++++++++++++++++++++++++++++++++++++-
> > > include/linux/iommu.h | 36 +++++++++++++++++++++++++++++ 2 files
> > > changed, 98 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> > > index 829e9e9..97b7990 100644
> > > --- a/drivers/iommu/iommu.c
> > > +++ b/drivers/iommu/iommu.c
> > > @@ -581,6 +581,12 @@ int iommu_group_add_device(struct iommu_group
> > > *group, struct device *dev) goto err_free_name;
> > >  	}
> > >  
> > > +	dev->iommu_param = kzalloc(sizeof(struct
> > > iommu_fault_param), GFP_KERNEL);
> > > +	if (!dev->iommu_param) {
> > > +		ret = -ENOMEM;
> > > +		goto err_free_name;
> > > +	}
> > > +
> > >  	kobject_get(group->devices_kobj);
> > >  
> > >  	dev->iommu_group = group;
> > > @@ -657,7 +663,7 @@ void iommu_group_remove_device(struct device
> > > *dev) sysfs_remove_link(&dev->kobj, "iommu_group");
> > >  
> > >  	trace_remove_device_from_group(group->id, dev);
> > > -
> > > +	kfree(dev->iommu_param);
> > >  	kfree(device->name);
> > >  	kfree(device);
> > >  	dev->iommu_group = NULL;
> > > @@ -791,6 +797,61 @@ int iommu_group_unregister_notifier(struct
> > > iommu_group *group, }
> > >  EXPORT_SYMBOL_GPL(iommu_group_unregister_notifier);
> > >  
> > > +int iommu_register_device_fault_handler(struct device *dev,
> > > +					iommu_dev_fault_handler_t
> > > handler,
> > > +					void *data)
> > > +{
> > > +	struct iommu_param *idata = dev->iommu_param;
> > > +
> > > +	/*
> > > +	 * Device iommu_param should have been allocated when
> > > device is
> > > +	 * added to its iommu_group.
> > > +	 */
> > > +	if (!idata)
> > > +		return -EINVAL;
> > > +	/* Only allow one fault handler registered for each device
> > > */
> > > +	if (idata->fault_param)
> > > +		return -EBUSY;
> > > +	get_device(dev);
> > > +	idata->fault_param =
> > > +		kzalloc(sizeof(struct iommu_fault_param),
> > > GFP_KERNEL);
> > > +	if (!idata->fault_param)
> > > +		return -ENOMEM;
> > > +	idata->fault_param->handler = handler;
> > > +	idata->fault_param->data = data;
> > > +
> > > +	return 0;
> > > +}
> > > +EXPORT_SYMBOL_GPL(iommu_register_device_fault_handler);
> > > +
> > > +int iommu_unregister_device_fault_handler(struct device *dev)
> > > +{
> > > +	struct iommu_param *idata = dev->iommu_param;
> > > +
> > > +	if (!idata)
> > > +		return -EINVAL;
> > > +
> > > +	kfree(idata->fault_param);
> > > +	idata->fault_param = NULL;
> > > +	put_device(dev);
> > > +
> > > +	return 0;
> > > +}
> > > +EXPORT_SYMBOL_GPL(iommu_unregister_device_fault_handler);
> > > +
> > > +
> > > +int iommu_report_device_fault(struct device *dev, struct
> > > iommu_fault_event *evt) +{
> > > +	/* we only report device fault if there is a handler
> > > registered */
> > > +	if (!dev->iommu_param || !dev->iommu_param->fault_param ||
> > > +		!dev->iommu_param->fault_param->handler)
> > > +		return -ENOSYS;
> > > +
> > > +	return dev->iommu_param->fault_param->handler(evt,
> > > +
> > > dev->iommu_param->fault_param->data); +}
> > > +EXPORT_SYMBOL_GPL(iommu_report_device_fault);
> > > +    
> > 
> > Isn't this all rather racy?  I see that we can have multiple callers
> > to register racing.  
> I agree, should use a lock here to prevent unregister. For multiple
> caller race, it won't happen since there is only one caller can
> register handler.

If you have multiple simultaneous callers to
iommu_register_device_fault_handler, they can all get past the test
for fault_param (testing and setting is not atomic), then it's
indeterminate which handler gets installed.  Thanks,

Alex

> >  Unregister is buggy, allowing any caller to
> > decrement the device reference regardless of whether there's one
> > outstanding through this interface.  The reporting callout can also
> > race with an unregistration.  Might need a mutex on iommu_param to
> > avoid.
> >   
> you are right, forgot to check outstanding handler. will add mutex also.
> 
> Thanks,
> > >  /**
> > >   * iommu_group_id - Return ID for a group
> > >   * @group: the group to ID
> > > diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> > > index dfda89b..841c044 100644
> > > --- a/include/linux/iommu.h
> > > +++ b/include/linux/iommu.h
> > > @@ -463,6 +463,14 @@ extern int
> > > iommu_group_register_notifier(struct iommu_group *group, struct
> > > notifier_block *nb); extern int
> > > iommu_group_unregister_notifier(struct iommu_group *group, struct
> > > notifier_block *nb); +extern int
> > > iommu_register_device_fault_handler(struct device *dev,
> > > +					iommu_dev_fault_handler_t
> > > handler,
> > > +					void *data);
> > > +
> > > +extern int iommu_unregister_device_fault_handler(struct device
> > > *dev); +
> > > +extern int iommu_report_device_fault(struct device *dev, struct
> > > iommu_fault_event *evt); +
> > >  extern int iommu_group_id(struct iommu_group *group);
> > >  extern struct iommu_group *iommu_group_get_for_dev(struct device
> > > *dev); extern struct iommu_domain
> > > *iommu_group_default_domain(struct iommu_group *); @@ -481,6
> > > +489,12 @@ extern void iommu_domain_window_disable(struct
> > > iommu_domain *domain, u32 wnd_nr) extern int
> > > report_iommu_fault(struct iommu_domain *domain, struct device *dev,
> > > unsigned long iova, int flags); +static inline bool
> > > iommu_has_device_fault_handler(struct device *dev) +{
> > > +	return dev->iommu_param && dev->iommu_param->fault_param &&
> > > +		dev->iommu_param->fault_param->handler;
> > > +}
> > > +    
> > 
> > This interface is racy by design, there's no guarantee that the
> > handler isn't immediately unregistered after this check. Thanks,
> >   
> right, I will fold this check into report function and protect by a
> lock. I was trying to save some cycles but it would not work with the
> race condition.
> > Alex
> >   
> > >  static inline void iommu_flush_tlb_all(struct iommu_domain *domain)
> > >  {
> > >  	if (domain->ops->flush_iotlb_all)
> > > @@ -734,6 +748,28 @@ static inline int
> > > iommu_group_unregister_notifier(struct iommu_group *group, return 0;
> > >  }
> > >  
> > > +static inline int iommu_register_device_fault_handler(struct
> > > device *dev,
> > > +
> > > iommu_dev_fault_handler_t handler,
> > > +						void *data)
> > > +{
> > > +	return 0;
> > > +}
> > > +
> > > +static inline int iommu_unregister_device_fault_handler(struct
> > > device *dev) +{
> > > +	return 0;
> > > +}
> > > +
> > > +static inline bool iommu_has_device_fault_handler(struct device
> > > *dev) +{
> > > +	return false;
> > > +}
> > > +
> > > +static inline int iommu_report_device_fault(struct device *dev,
> > > struct iommu_fault_event *evt) +{
> > > +	return 0;
> > > +}
> > > +
> > >  static inline int iommu_group_id(struct iommu_group *group)
> > >  {
> > >  	return -ENODEV;    
> >   
> 
> [Jacob Pan]

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v3 10/16] iommu: introduce device fault report API
@ 2017-12-08 21:22       ` Jacob Pan
  0 siblings, 0 replies; 94+ messages in thread
From: Jacob Pan @ 2017-12-08 21:22 UTC (permalink / raw)
  To: Lu Baolu
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Rafael Wysocki, Alex Williamson, Lan Tianyu, Jean Delvare,
	jacob.jun.pan

On Tue, 5 Dec 2017 14:22:41 +0800
Lu Baolu <baolu.lu@linux.intel.com> wrote:

> > +int iommu_report_device_fault(struct device *dev, struct
> > iommu_fault_event *evt) +{
> > +	/* we only report device fault if there is a handler
> > registered */
> > +	if (!dev->iommu_param || !dev->iommu_param->fault_param ||
> > +		!dev->iommu_param->fault_param->handler)  
> 
> Can this replaced by:
> 
>     if (!iommu_has_device_fault_handler(dev))
> 
right, and under a lock too.

Thanks,

Jacob

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v3 10/16] iommu: introduce device fault report API
@ 2017-12-08 21:22       ` Jacob Pan
  0 siblings, 0 replies; 94+ messages in thread
From: Jacob Pan @ 2017-12-08 21:22 UTC (permalink / raw)
  To: Lu Baolu
  Cc: Lan Tianyu, Greg Kroah-Hartman, Rafael Wysocki, LKML,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Jean Delvare,
	David Woodhouse

On Tue, 5 Dec 2017 14:22:41 +0800
Lu Baolu <baolu.lu-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> wrote:

> > +int iommu_report_device_fault(struct device *dev, struct
> > iommu_fault_event *evt) +{
> > +	/* we only report device fault if there is a handler
> > registered */
> > +	if (!dev->iommu_param || !dev->iommu_param->fault_param ||
> > +		!dev->iommu_param->fault_param->handler)  
> 
> Can this replaced by:
> 
>     if (!iommu_has_device_fault_handler(dev))
> 
right, and under a lock too.

Thanks,

Jacob

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v3 10/16] iommu: introduce device fault report API
@ 2017-12-08 21:22           ` Jacob Pan
  0 siblings, 0 replies; 94+ messages in thread
From: Jacob Pan @ 2017-12-08 21:22 UTC (permalink / raw)
  To: Alex Williamson
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Rafael Wysocki, Liu, Yi L, Lan Tianyu, Tian, Kevin, Raj Ashok,
	Jean Delvare, Christoph Hellwig, jacob.jun.pan

On Fri, 8 Dec 2017 13:59:09 -0700
Alex Williamson <alex.williamson@redhat.com> wrote:

> > > 
> > > Isn't this all rather racy?  I see that we can have multiple
> > > callers to register racing.    
> > I agree, should use a lock here to prevent unregister. For multiple
> > caller race, it won't happen since there is only one caller can
> > register handler.  
> 
> If you have multiple simultaneous callers to
> iommu_register_device_fault_handler, they can all get past the test
> for fault_param (testing and setting is not atomic), then it's
> indeterminate which handler gets installed.  Thanks,
> 
I see, having the mutex would prevent it. Later callers would get
-EBUSY.
Thanks a lot!
> Alex

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v3 10/16] iommu: introduce device fault report API
@ 2017-12-08 21:22           ` Jacob Pan
  0 siblings, 0 replies; 94+ messages in thread
From: Jacob Pan @ 2017-12-08 21:22 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Lan Tianyu, Greg Kroah-Hartman, Rafael Wysocki, LKML,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Jean Delvare,
	David Woodhouse

On Fri, 8 Dec 2017 13:59:09 -0700
Alex Williamson <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:

> > > 
> > > Isn't this all rather racy?  I see that we can have multiple
> > > callers to register racing.    
> > I agree, should use a lock here to prevent unregister. For multiple
> > caller race, it won't happen since there is only one caller can
> > register handler.  
> 
> If you have multiple simultaneous callers to
> iommu_register_device_fault_handler, they can all get past the test
> for fault_param (testing and setting is not atomic), then it's
> indeterminate which handler gets installed.  Thanks,
> 
I see, having the mutex would prevent it. Later callers would get
-EBUSY.
Thanks a lot!
> Alex

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v3 15/16] iommu: introduce page response function
  2017-12-08 20:40                 ` Jacob Pan
@ 2017-12-08 23:01                   ` Alex Williamson
  -1 siblings, 0 replies; 94+ messages in thread
From: Alex Williamson @ 2017-12-08 23:01 UTC (permalink / raw)
  To: Jacob Pan
  Cc: Jean-Philippe Brucker, iommu, LKML, Joerg Roedel,
	David Woodhouse, Greg Kroah-Hartman, Rafael Wysocki, Lan Tianyu,
	Jean Delvare, Will Deacon, Kumar, Sanjay K

On Fri, 8 Dec 2017 12:40:17 -0800
Jacob Pan <jacob.jun.pan@linux.intel.com> wrote:

> On Fri, 8 Dec 2017 13:52:00 +0000
> Jean-Philippe Brucker <jean-philippe.brucker@arm.com> wrote:
> 
> > On 07/12/17 21:51, Alex Williamson wrote:  
> > >> Agree, IOMMU driver cannot enforce it. I think VFIO layer can make
> > >> sure page response come from the assigned device and its
> > >> guest/container.    
> > > 
> > > Can we enforce it via the IOMMU/VFIO interface?  If the response is
> > > for a struct device, and not an rid/did embedded in a structure,
> > > then vfio can pass it through w/o worrying about it, ie. response
> > > comes in via ioctl with association to vfio device fd -> struct
> > > vfio_device -> struct device, iommu driver fills in rid/did.
> > > Thanks,    
> > 
> > Yes that's probably the best way, reporting faults and receiving
> > responses on the device fd.
> >   
> Just to put these ideas in to code. The IOMMU API used by VFIO has
> struct device* (derived from fd), no did/rid (to be derived from
> struct device by IOMMU driver.)
> 
> int intel_iommu_page_response(struct iommu_domain *domain, struct device *dev,
> 			struct page_response_msg *msg)
> 
> IOMMU driver can further sanitize by checking whether this is a pending
> page request for the device, and refcount outstanding PRQs.
> 
> Does it sound right?

Yep.  Thanks,

Alex
 
> /**
>  * Generic page response information based on PCI ATS and PASID spec.
>  * @addr: servicing page address
>  * @pasid: contains process address space ID, used in shared virtual
> memory(SVM)
>  * @resp_code: response code
>  * @page_req_group_id: page request group index
>  * @type: group or stream/single page response
>  * @private_data: uniquely identify device-specific private data for an
>  *                individual page response
> 
>  */
> struct page_response_msg {
> 	u64 addr;
> 	u32 pasid;
> 	u32 resp_code:4;
> #define IOMMU_PAGE_RESP_SUCCESS	0
> #define IOMMU_PAGE_RESP_INVALID	1
> #define IOMMU_PAGE_RESP_FAILURE	0xF
> 
> 	u32 pasid_present:1;
> 	u32 page_req_group_id : 9;
> 	enum page_response_type type;
> 	u32 private_data;
> };

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v3 15/16] iommu: introduce page response function
@ 2017-12-08 23:01                   ` Alex Williamson
  0 siblings, 0 replies; 94+ messages in thread
From: Alex Williamson @ 2017-12-08 23:01 UTC (permalink / raw)
  To: Jacob Pan
  Cc: Lan Tianyu, Greg Kroah-Hartman, Rafael Wysocki, Will Deacon,
	LKML, iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Jean Delvare, Kumar, Sanjay K, David Woodhouse

On Fri, 8 Dec 2017 12:40:17 -0800
Jacob Pan <jacob.jun.pan-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> wrote:

> On Fri, 8 Dec 2017 13:52:00 +0000
> Jean-Philippe Brucker <jean-philippe.brucker-5wv7dgnIgG8@public.gmane.org> wrote:
> 
> > On 07/12/17 21:51, Alex Williamson wrote:  
> > >> Agree, IOMMU driver cannot enforce it. I think VFIO layer can make
> > >> sure page response come from the assigned device and its
> > >> guest/container.    
> > > 
> > > Can we enforce it via the IOMMU/VFIO interface?  If the response is
> > > for a struct device, and not an rid/did embedded in a structure,
> > > then vfio can pass it through w/o worrying about it, ie. response
> > > comes in via ioctl with association to vfio device fd -> struct
> > > vfio_device -> struct device, iommu driver fills in rid/did.
> > > Thanks,    
> > 
> > Yes that's probably the best way, reporting faults and receiving
> > responses on the device fd.
> >   
> Just to put these ideas in to code. The IOMMU API used by VFIO has
> struct device* (derived from fd), no did/rid (to be derived from
> struct device by IOMMU driver.)
> 
> int intel_iommu_page_response(struct iommu_domain *domain, struct device *dev,
> 			struct page_response_msg *msg)
> 
> IOMMU driver can further sanitize by checking whether this is a pending
> page request for the device, and refcount outstanding PRQs.
> 
> Does it sound right?

Yep.  Thanks,

Alex
 
> /**
>  * Generic page response information based on PCI ATS and PASID spec.
>  * @addr: servicing page address
>  * @pasid: contains process address space ID, used in shared virtual
> memory(SVM)
>  * @resp_code: response code
>  * @page_req_group_id: page request group index
>  * @type: group or stream/single page response
>  * @private_data: uniquely identify device-specific private data for an
>  *                individual page response
> 
>  */
> struct page_response_msg {
> 	u64 addr;
> 	u32 pasid;
> 	u32 resp_code:4;
> #define IOMMU_PAGE_RESP_SUCCESS	0
> #define IOMMU_PAGE_RESP_INVALID	1
> #define IOMMU_PAGE_RESP_FAILURE	0xF
> 
> 	u32 pasid_present:1;
> 	u32 page_req_group_id : 9;
> 	enum page_response_type type;
> 	u32 private_data;
> };

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v3 03/16] iommu: introduce iommu invalidate API function
@ 2017-12-15 19:02       ` Jean-Philippe Brucker
  0 siblings, 0 replies; 94+ messages in thread
From: Jean-Philippe Brucker @ 2017-12-15 19:02 UTC (permalink / raw)
  To: Jacob Pan, iommu, LKML, Joerg Roedel, David Woodhouse,
	Greg Kroah-Hartman, Rafael Wysocki, Alex Williamson
  Cc: Lan Tianyu, Liu, Yi L

A quick update on invalidations before I leave for holidays, since we're
struggling to define useful semantics. I worked on the virtio-iommu
prototype for vSVA, so I tried to pin down what I think is needed for vSVA
invalidation in the host. I don't know whether the VT-d and AMD emulations
can translate all of this from guest commands.

Scope selects which entries are invalidated, and flags cherry-pick what
caches to invalidate. For example a guest might remove GBs of sparse
mappings, and decide that it would be quicker to invalidate the whole
context instead of one at a time. Then it would set only flags = (TLB |
DEV_TLB) with scope = PASID. If the guest clears one entry in the PASID
table, then it would send scope = PASID and flags = (LEAF | CONFIG | TLB |
DEV_TLB). On an ARM system the guest can invalidate TLBs with CPU
instructions, but can't invalidate ATCs. So it would send an invalidate
with flags = (LEAF | TLB) and scope = VA.

enum iommu_sva_inval_scope {
	IOMMU_INVALIDATE_DOMAIN	= 1,
	IOMMU_INVALIDATE_PASID,
	IOMMU_INVALIDATE_VA,
};

/* Only invalidate leaf entry. Applies to PASID table if scope == PASID or
 * page tables if scope == VA. */
#define IOMMU_INVALIDATE_LEAF		(1 << 0)
/* Invalidate cached PASID table configuration */
#define IOMMU_INVALIDATE_CONFIG		(1 << 1)
/* Invalidate IOTLBs */
#define IOMMU_INVALIDATE_TLB		(1 << 2)
/* Invalidate ATCs */
#define IOMMU_INVALIDATE_DEV_TLB	(1 << 3)
/* + Need a global flag? */

struct iommu_sva_invalidate {
	enum iommu_sva_inval_scope	scope;
	u32				flags;
	u32				pasid;
	u64				iova;
	u64				size;
	/* Arch-specific, format is determined at bind time */
	union {
		struct {
			u16		asid;
			u8		granule;
		} arm;
	}
};

ARM needs two more fields. A 16-bit @asid (Address Space ID) targets TLB
entries and may be different from the PASID (up to the guest to decide),
which targets ATC and config entries.

@granule is the TLB granule that we're invalidating. For instance if the
guest just unmapped a few 2M huge pages, it sets @granule to 21 bits, so
we issue less invalidation commands, since we only need to evict huge TLB
entries. I'm not sure about other architecture but I'd be surprised if
this wasn't more common. Should we move it to the common part?


int iommu_sva_invalidate(struct iommu_domain *domain,
			 struct iommu_sva_invalidate *inval);

And so the host driver implementation is roughly:
--------------------------------------------------------------------------
bool leaf	= flags & IOMMU_INVALIDATE_LEAF;
bool config	= flags & IOMMU_INVALIDATE_CONFIG;
bool tlb	= flags & IOMMU_INVALIDATE_TLB;
bool atc	= flags & IOMMU_INVALIDATE_DEV_TLB;

if (config) {
	switch (scope) {
	case IOMMU_INVALIDATE_PASID:
		inval_cached_pasid_entry(domain, pasid, leaf);
		break;
	case IOMMU_INVALIDATE_DOMAIN:
		inval_all_cached_pasid_entries(domain);
		break;
	default:
		return -EINVAL;
	}

	/* Wait for caches to be clean, then invalidate TLBs */
	sync_commands();
}

if (tlb) {
	switch (scope) {
	case IOMMU_INVALIDATE_VA:
		inval_tlb_entries(domain, asid, iova, size, granule,
				  leaf);
		break;
	case IOMMU_INVALIDATE_PASID:
		inval_all_tlb_entries_for_asid(domain, asid);
		break;
	case IOMMU_INVALIDATE_DOMAIN:
		inval_all_tlb_entries(domain);
		break;
	default:
		return -EINVAL;
	}

	/* Wait for TLBs to be clean, then invalidate ATCs. */
	sync_commands();
}

if (atc) {
	/* ATC invalidations are sent to all devices in the domain */
	switch (scope) {
	case IOMMU_INVALIDATE_VA:
		inval_atc_entries(domain, pasid, iova, size);
		break;
	case IOMMU_INVALIDATE_PASID:
		/* Covers the full address space */
		inval_all_atc_entries_for_pasid(domain, pasid);
		break;
	case IOMMU_INVALIDATE_DOMAIN:
		/* Set Global Invalidate */
		inval_all_atc_entries(domain);
		break;
	default:
		return -EINVAL;
	}

	sync_commands();
}

/* Then return to guest. */
--------------------------------------------------------------------------

I think this covers what we need and allows userspace or the guest to
gather multiple invalidations into a single request/ioctl.

I don't think per-device ATC invalidation is needed, but might be wrong.
According to ATS it is implicit when the guest resets the device (FLR) or
disables the ATS capability. Are there other use-cases than reset? I still
need to see how QEMU handles when a device is detached from a domain (e.g.
its device table entry set to invalid). Kvmtool has one VFIO container per
device so can simply unmap-all to clear caches and TLBs when this happens.

Hope this helps,
Jean

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v3 03/16] iommu: introduce iommu invalidate API function
@ 2017-12-15 19:02       ` Jean-Philippe Brucker
  0 siblings, 0 replies; 94+ messages in thread
From: Jean-Philippe Brucker @ 2017-12-15 19:02 UTC (permalink / raw)
  To: Jacob Pan, iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Rafael Wysocki, Alex Williamson
  Cc: Lan Tianyu, Liu, Yi L

A quick update on invalidations before I leave for holidays, since we're
struggling to define useful semantics. I worked on the virtio-iommu
prototype for vSVA, so I tried to pin down what I think is needed for vSVA
invalidation in the host. I don't know whether the VT-d and AMD emulations
can translate all of this from guest commands.

Scope selects which entries are invalidated, and flags cherry-pick what
caches to invalidate. For example a guest might remove GBs of sparse
mappings, and decide that it would be quicker to invalidate the whole
context instead of one at a time. Then it would set only flags = (TLB |
DEV_TLB) with scope = PASID. If the guest clears one entry in the PASID
table, then it would send scope = PASID and flags = (LEAF | CONFIG | TLB |
DEV_TLB). On an ARM system the guest can invalidate TLBs with CPU
instructions, but can't invalidate ATCs. So it would send an invalidate
with flags = (LEAF | TLB) and scope = VA.

enum iommu_sva_inval_scope {
	IOMMU_INVALIDATE_DOMAIN	= 1,
	IOMMU_INVALIDATE_PASID,
	IOMMU_INVALIDATE_VA,
};

/* Only invalidate leaf entry. Applies to PASID table if scope == PASID or
 * page tables if scope == VA. */
#define IOMMU_INVALIDATE_LEAF		(1 << 0)
/* Invalidate cached PASID table configuration */
#define IOMMU_INVALIDATE_CONFIG		(1 << 1)
/* Invalidate IOTLBs */
#define IOMMU_INVALIDATE_TLB		(1 << 2)
/* Invalidate ATCs */
#define IOMMU_INVALIDATE_DEV_TLB	(1 << 3)
/* + Need a global flag? */

struct iommu_sva_invalidate {
	enum iommu_sva_inval_scope	scope;
	u32				flags;
	u32				pasid;
	u64				iova;
	u64				size;
	/* Arch-specific, format is determined at bind time */
	union {
		struct {
			u16		asid;
			u8		granule;
		} arm;
	}
};

ARM needs two more fields. A 16-bit @asid (Address Space ID) targets TLB
entries and may be different from the PASID (up to the guest to decide),
which targets ATC and config entries.

@granule is the TLB granule that we're invalidating. For instance if the
guest just unmapped a few 2M huge pages, it sets @granule to 21 bits, so
we issue less invalidation commands, since we only need to evict huge TLB
entries. I'm not sure about other architecture but I'd be surprised if
this wasn't more common. Should we move it to the common part?


int iommu_sva_invalidate(struct iommu_domain *domain,
			 struct iommu_sva_invalidate *inval);

And so the host driver implementation is roughly:
--------------------------------------------------------------------------
bool leaf	= flags & IOMMU_INVALIDATE_LEAF;
bool config	= flags & IOMMU_INVALIDATE_CONFIG;
bool tlb	= flags & IOMMU_INVALIDATE_TLB;
bool atc	= flags & IOMMU_INVALIDATE_DEV_TLB;

if (config) {
	switch (scope) {
	case IOMMU_INVALIDATE_PASID:
		inval_cached_pasid_entry(domain, pasid, leaf);
		break;
	case IOMMU_INVALIDATE_DOMAIN:
		inval_all_cached_pasid_entries(domain);
		break;
	default:
		return -EINVAL;
	}

	/* Wait for caches to be clean, then invalidate TLBs */
	sync_commands();
}

if (tlb) {
	switch (scope) {
	case IOMMU_INVALIDATE_VA:
		inval_tlb_entries(domain, asid, iova, size, granule,
				  leaf);
		break;
	case IOMMU_INVALIDATE_PASID:
		inval_all_tlb_entries_for_asid(domain, asid);
		break;
	case IOMMU_INVALIDATE_DOMAIN:
		inval_all_tlb_entries(domain);
		break;
	default:
		return -EINVAL;
	}

	/* Wait for TLBs to be clean, then invalidate ATCs. */
	sync_commands();
}

if (atc) {
	/* ATC invalidations are sent to all devices in the domain */
	switch (scope) {
	case IOMMU_INVALIDATE_VA:
		inval_atc_entries(domain, pasid, iova, size);
		break;
	case IOMMU_INVALIDATE_PASID:
		/* Covers the full address space */
		inval_all_atc_entries_for_pasid(domain, pasid);
		break;
	case IOMMU_INVALIDATE_DOMAIN:
		/* Set Global Invalidate */
		inval_all_atc_entries(domain);
		break;
	default:
		return -EINVAL;
	}

	sync_commands();
}

/* Then return to guest. */
--------------------------------------------------------------------------

I think this covers what we need and allows userspace or the guest to
gather multiple invalidations into a single request/ioctl.

I don't think per-device ATC invalidation is needed, but might be wrong.
According to ATS it is implicit when the guest resets the device (FLR) or
disables the ATS capability. Are there other use-cases than reset? I still
need to see how QEMU handles when a device is detached from a domain (e.g.
its device table entry set to invalid). Kvmtool has one VFIO container per
device so can simply unmap-all to clear caches and TLBs when this happens.

Hope this helps,
Jean

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v3 09/16] driver core: add iommu device fault reporting data
@ 2017-12-18 14:37     ` Greg Kroah-Hartman
  0 siblings, 0 replies; 94+ messages in thread
From: Greg Kroah-Hartman @ 2017-12-18 14:37 UTC (permalink / raw)
  To: Jacob Pan
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Rafael Wysocki,
	Alex Williamson, Liu, Yi L, Lan Tianyu, Tian, Kevin, Raj Ashok,
	Jean Delvare, Christoph Hellwig

On Fri, Nov 17, 2017 at 10:55:07AM -0800, Jacob Pan wrote:
> DMA faults can be detected by IOMMU at device level. Adding a pointer
> to struct device allows IOMMU subsystem to report relevant faults
> back to the device driver for further handling.
> For direct assigned device (or user space drivers), guest OS holds
> responsibility to handle and respond per device IOMMU fault.
> Therefore we need fault reporting mechanism to propagate faults beyond
> IOMMU subsystem.
> 
> There are two other IOMMU data pointers under struct device today, here
> we introduce iommu_param as a parent pointer such that all device IOMMU
> data can be consolidated here. The idea was suggested here by Greg KH
> and Joerg. The name iommu_param is chosen here since iommu_data has been used.
> 
> Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Link: https://lkml.org/lkml/2017/10/6/81

Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v3 09/16] driver core: add iommu device fault reporting data
@ 2017-12-18 14:37     ` Greg Kroah-Hartman
  0 siblings, 0 replies; 94+ messages in thread
From: Greg Kroah-Hartman @ 2017-12-18 14:37 UTC (permalink / raw)
  To: Jacob Pan
  Cc: Lan Tianyu, Rafael Wysocki,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, LKML,
	Jean Delvare, David Woodhouse

On Fri, Nov 17, 2017 at 10:55:07AM -0800, Jacob Pan wrote:
> DMA faults can be detected by IOMMU at device level. Adding a pointer
> to struct device allows IOMMU subsystem to report relevant faults
> back to the device driver for further handling.
> For direct assigned device (or user space drivers), guest OS holds
> responsibility to handle and respond per device IOMMU fault.
> Therefore we need fault reporting mechanism to propagate faults beyond
> IOMMU subsystem.
> 
> There are two other IOMMU data pointers under struct device today, here
> we introduce iommu_param as a parent pointer such that all device IOMMU
> data can be consolidated here. The idea was suggested here by Greg KH
> and Joerg. The name iommu_param is chosen here since iommu_data has been used.
> 
> Suggested-by: Greg Kroah-Hartman <gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org>
> Signed-off-by: Jacob Pan <jacob.jun.pan-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
> Link: https://lkml.org/lkml/2017/10/6/81

Acked-by: Greg Kroah-Hartman <gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v3 03/16] iommu: introduce iommu invalidate API function
@ 2017-12-28 19:25       ` Jacob Pan
  0 siblings, 0 replies; 94+ messages in thread
From: Jacob Pan @ 2017-12-28 19:25 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Rafael Wysocki, Alex Williamson, Lan Tianyu, Liu, Yi L, Liu,
	Jean Delvare, jacob.jun.pan

On Fri, 24 Nov 2017 12:04:31 +0000
Jean-Philippe Brucker <jean-philippe.brucker@arm.com> wrote:

> Hi,
> 
> On 17/11/17 18:55, Jacob Pan wrote:
> > From: "Liu, Yi L" <yi.l.liu@linux.intel.com>
> > 
> > When an SVM capable device is assigned to a guest, the first level
> > page tables are owned by the guest and the guest PASID table
> > pointer is linked to the device context entry of the physical IOMMU.
> > 
> > Host IOMMU driver has no knowledge of caching structure updates
> > unless the guest invalidation activities are passed down to the
> > host. The primary usage is derived from emulated IOMMU in the
> > guest, where QEMU can trap invalidation activities before passing
> > them down to the host/physical IOMMU.
> > Since the invalidation data are obtained from user space and will be
> > written into physical IOMMU, we must allow security check at various
> > layers. Therefore, generic invalidation data format are proposed
> > here, model specific IOMMU drivers need to convert them into their
> > own format.
> > 
> > Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
> > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Signed-off-by: Ashok Raj <ashok.raj@intel.com>  
> [...]
> >  #endif /* __LINUX_IOMMU_H */
> > diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
> > index 651ad5d..039ba36 100644
> > --- a/include/uapi/linux/iommu.h
> > +++ b/include/uapi/linux/iommu.h
> > @@ -36,4 +36,66 @@ struct pasid_table_config {
> >  	};
> >  };
> >  
> > +enum iommu_inv_granularity {
> > +	IOMMU_INV_GRANU_GLOBAL,		/* all TLBs
> > invalidated */
> > +	IOMMU_INV_GRANU_DOMAIN,		/* all TLBs
> > associated with a domain */
> > +	IOMMU_INV_GRANU_DEVICE,		/* caching
> > structure associated with a
> > +					 * device ID
> > +					 */  
> 
> I thought you were planning on removing these? If we do need global
> invalidation, for example the guest clears the whole PASID table and
> doesn't want to send individual GRANU_ALL_PASID invalidations, maybe
> keep only GRANU_DOMAIN?
> 
yes, we can remove global and keep domain & pasid.
> > +	IOMMU_INV_GRANU_DOMAIN_PAGE,	/* address range with
> > a domain */
> > +	IOMMU_INV_GRANU_ALL_PASID,	/* cache of a given
> > PASID */
> > +	IOMMU_INV_GRANU_PASID_SEL,	/* only invalidate
> > specified PASID */  
> 
> GRANU_PASID_SEL seems redundant, don't you already get it by default
> with GRANU_ALL_PASID and GRANU_DOMAIN_PAGE (with
> IOMMU_INVALIDATE_PASID_TAGGED flag)?
> 
yes, you can deduce from certain combinations of flags. My thinking
was for an easy look up from generic flags to model specific
fields. Same as the one below. I will try to consolidate based on your
input in the next version.
> > +
> > +	IOMMU_INV_GRANU_NG_ALL_PASID,	/* non-global within
> > all PASIDs */
> > +	IOMMU_INV_GRANU_NG_PASID,	/* non-global within a
> > PASIDs */  
> 
> Don't you get the "NG" behavior by not passing the
> IOMMU_INVALIDATE_GLOBAL_PAGE flag defined below?
> 
> > +	IOMMU_INV_GRANU_PAGE_PASID,	/* page-selective
> > within a PASID */  
> 
> And don't you get this with
> GRANU_DOMAIN_PAGE+IOMMU_INVALIDATE_PASID_TAGGED?
> 
> > +	IOMMU_INV_NR_GRANU,
> > +};
> > +
> > +enum iommu_inv_type {
> > +	IOMMU_INV_TYPE_DTLB,	/* device IOTLB */
> > +	IOMMU_INV_TYPE_TLB,	/* IOMMU paging structure cache
> > */
> > +	IOMMU_INV_TYPE_PASID,	/* PASID cache */
> > +	IOMMU_INV_TYPE_CONTEXT,	/* device context entry
> > cache */
> > +	IOMMU_INV_NR_TYPE
> > +};  
> 
> When the guest removes a PASID entry, it would have to send DTLB, TLB
> and PASID invalidations separately? Could we define this inv_type as
> cumulative, to avoid redundant invalidation requests:
> 
That is a good idea, but it will require some change to VT-d driver.
For emulated IOMMU and current VT-d driver, we do send separate
requests for PASID cache, followed by IOTLB/DTLB invalidation. But we do
have a caching mode capability bit to tell the driver whether it is
running on a real IOMMU or not. So we can combine and reduce
invalidation overhead as you said below. Not sure about AMD though?

> * TYPE_DTLB only invalidates ATC entries.
> * TYPE_TLB invalidates both ATC and IOTLB entries.
> * TYPE_PASID invalidates all ATC and IOTLB entries for a PASID, and
> also the PASID cache entry.
Sounds good to me.

> * TYPE_CONTEXT invalidates all. Although is it needed by userspace or
> just here fore completeness? "CONTEXT" is specific to VT-d (doesn't
> exist on AMD and has a different meaning on SMMU), how about "DEVICE"
> instead?
It is here for completeness. context entry is set during bind/unbind
pasid table call. I can remove it for now.
> 
> This is important because invalidation will probably become the
> bottleneck. The guest shouldn't have to send DTLB and TLB invalidation
> separately after each unmapping.
> 
Agreed, i will change the VT-d driver to accommodate that. i.e. For
emulated IOMMU (Caching Mode == 1), no need to send redundant
invalidation request.
> > +/**
> > + * Translation cache invalidation header that contains mandatory
> > meta data.
> > + * @version:	info format version, expecting future extesions
> > + * @type:	type of translation cache to be invalidated
> > + */
> > +struct tlb_invalidate_hdr {
> > +	__u32 version;
> > +#define TLB_INV_HDR_VERSION_1 1
> > +	enum iommu_inv_type type;
> > +};
> > +
> > +/**
> > + * Translation cache invalidation information, contains generic
> > IOMMU
> > + * data which can be parsed based on model ID by model specific
> > drivers.
> > + *
> > + * @granularity:	requested invalidation granularity, type
> > dependent
> > + * @size:		2^size of 4K pages, 0 for 4k, 9 for 2MB,
> > etc.  
> 
> Having only power of two invalidation seems too restrictive for a
> software interface. You might have the same problem as above, where
> the guest or userspace needs to send lots of invalidation requests,
> They could be multiplexed by passing an arbitrary range instead. How
> about making @size a __u64?
> 
Sure if you have such need for non power of two. So it will be __u64 of
4k pages?

> > + * @pasid:		processor address space ID value per PCI
> > spec.
> > + * @addr:		page address to be invalidated
> > + * @flags	IOMMU_INVALIDATE_PASID_TAGGED: DMA with PASID
> > tagged,
> > + *						@pasid validity
> > can be
> > + *						deduced from
> > @granularity  
> 
> What's the use for this PASID_TAGGED flag if it doesn't define the
> @pasid validity?
> 
VT-d uses different table format based on this PASID_TAGGED flag. With
PASID_TAGGED set, @pasid could still be invalid if the granularity is
not at PASID selective level.
> > + *		IOMMU_INVALIDATE_ADDR_LEAF: leaf paging entries  
> 
> LEAF could be reused for multi-level PASID tables, when your
> first-level table is already in place and you install a leaf entry,
> so maybe this could be:
> 
> "IOMMU_INVALIDATE_LEAF: only invalidate leaf table entry"
> 
Sounds good. Assume we will only have 2 levels for the foreseeable
future.
> Thanks,
> Jean
> 
> > + *		IOMMU_INVALIDATE_GLOBAL_PAGE: global pages> + *
> > + */
> > +struct tlb_invalidate_info {
> > +	struct tlb_invalidate_hdr	hdr;
> > +	enum iommu_inv_granularity	granularity;
> > +	__u32		flags;
> > +#define IOMMU_INVALIDATE_NO_PASID	(1 << 0)
> > +#define IOMMU_INVALIDATE_ADDR_LEAF	(1 << 1)
> > +#define IOMMU_INVALIDATE_GLOBAL_PAGE	(1 << 2)
> > +#define IOMMU_INVALIDATE_PASID_TAGGED	(1 << 3)
> > +	__u8		size;
> > +	__u32		pasid;
> > +	__u64		addr;
> > +};
> >  #endif /* _UAPI_IOMMU_H */
> >   
> 

[Jacob Pan]

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v3 03/16] iommu: introduce iommu invalidate API function
@ 2017-12-28 19:25       ` Jacob Pan
  0 siblings, 0 replies; 94+ messages in thread
From: Jacob Pan @ 2017-12-28 19:25 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: Lan Tianyu, Liu, Yi L, Liu-i9wRM+HIrmnmtl4Z8vJ8Kg761KYD1DLY,
	Greg Kroah-Hartman, Rafael Wysocki, LKML,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Jean Delvare,
	David Woodhouse

On Fri, 24 Nov 2017 12:04:31 +0000
Jean-Philippe Brucker <jean-philippe.brucker-5wv7dgnIgG8@public.gmane.org> wrote:

> Hi,
> 
> On 17/11/17 18:55, Jacob Pan wrote:
> > From: "Liu, Yi L" <yi.l.liu-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
> > 
> > When an SVM capable device is assigned to a guest, the first level
> > page tables are owned by the guest and the guest PASID table
> > pointer is linked to the device context entry of the physical IOMMU.
> > 
> > Host IOMMU driver has no knowledge of caching structure updates
> > unless the guest invalidation activities are passed down to the
> > host. The primary usage is derived from emulated IOMMU in the
> > guest, where QEMU can trap invalidation activities before passing
> > them down to the host/physical IOMMU.
> > Since the invalidation data are obtained from user space and will be
> > written into physical IOMMU, we must allow security check at various
> > layers. Therefore, generic invalidation data format are proposed
> > here, model specific IOMMU drivers need to convert them into their
> > own format.
> > 
> > Signed-off-by: Liu, Yi L <yi.l.liu-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
> > Signed-off-by: Jacob Pan <jacob.jun.pan-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
> > Signed-off-by: Ashok Raj <ashok.raj-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>  
> [...]
> >  #endif /* __LINUX_IOMMU_H */
> > diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
> > index 651ad5d..039ba36 100644
> > --- a/include/uapi/linux/iommu.h
> > +++ b/include/uapi/linux/iommu.h
> > @@ -36,4 +36,66 @@ struct pasid_table_config {
> >  	};
> >  };
> >  
> > +enum iommu_inv_granularity {
> > +	IOMMU_INV_GRANU_GLOBAL,		/* all TLBs
> > invalidated */
> > +	IOMMU_INV_GRANU_DOMAIN,		/* all TLBs
> > associated with a domain */
> > +	IOMMU_INV_GRANU_DEVICE,		/* caching
> > structure associated with a
> > +					 * device ID
> > +					 */  
> 
> I thought you were planning on removing these? If we do need global
> invalidation, for example the guest clears the whole PASID table and
> doesn't want to send individual GRANU_ALL_PASID invalidations, maybe
> keep only GRANU_DOMAIN?
> 
yes, we can remove global and keep domain & pasid.
> > +	IOMMU_INV_GRANU_DOMAIN_PAGE,	/* address range with
> > a domain */
> > +	IOMMU_INV_GRANU_ALL_PASID,	/* cache of a given
> > PASID */
> > +	IOMMU_INV_GRANU_PASID_SEL,	/* only invalidate
> > specified PASID */  
> 
> GRANU_PASID_SEL seems redundant, don't you already get it by default
> with GRANU_ALL_PASID and GRANU_DOMAIN_PAGE (with
> IOMMU_INVALIDATE_PASID_TAGGED flag)?
> 
yes, you can deduce from certain combinations of flags. My thinking
was for an easy look up from generic flags to model specific
fields. Same as the one below. I will try to consolidate based on your
input in the next version.
> > +
> > +	IOMMU_INV_GRANU_NG_ALL_PASID,	/* non-global within
> > all PASIDs */
> > +	IOMMU_INV_GRANU_NG_PASID,	/* non-global within a
> > PASIDs */  
> 
> Don't you get the "NG" behavior by not passing the
> IOMMU_INVALIDATE_GLOBAL_PAGE flag defined below?
> 
> > +	IOMMU_INV_GRANU_PAGE_PASID,	/* page-selective
> > within a PASID */  
> 
> And don't you get this with
> GRANU_DOMAIN_PAGE+IOMMU_INVALIDATE_PASID_TAGGED?
> 
> > +	IOMMU_INV_NR_GRANU,
> > +};
> > +
> > +enum iommu_inv_type {
> > +	IOMMU_INV_TYPE_DTLB,	/* device IOTLB */
> > +	IOMMU_INV_TYPE_TLB,	/* IOMMU paging structure cache
> > */
> > +	IOMMU_INV_TYPE_PASID,	/* PASID cache */
> > +	IOMMU_INV_TYPE_CONTEXT,	/* device context entry
> > cache */
> > +	IOMMU_INV_NR_TYPE
> > +};  
> 
> When the guest removes a PASID entry, it would have to send DTLB, TLB
> and PASID invalidations separately? Could we define this inv_type as
> cumulative, to avoid redundant invalidation requests:
> 
That is a good idea, but it will require some change to VT-d driver.
For emulated IOMMU and current VT-d driver, we do send separate
requests for PASID cache, followed by IOTLB/DTLB invalidation. But we do
have a caching mode capability bit to tell the driver whether it is
running on a real IOMMU or not. So we can combine and reduce
invalidation overhead as you said below. Not sure about AMD though?

> * TYPE_DTLB only invalidates ATC entries.
> * TYPE_TLB invalidates both ATC and IOTLB entries.
> * TYPE_PASID invalidates all ATC and IOTLB entries for a PASID, and
> also the PASID cache entry.
Sounds good to me.

> * TYPE_CONTEXT invalidates all. Although is it needed by userspace or
> just here fore completeness? "CONTEXT" is specific to VT-d (doesn't
> exist on AMD and has a different meaning on SMMU), how about "DEVICE"
> instead?
It is here for completeness. context entry is set during bind/unbind
pasid table call. I can remove it for now.
> 
> This is important because invalidation will probably become the
> bottleneck. The guest shouldn't have to send DTLB and TLB invalidation
> separately after each unmapping.
> 
Agreed, i will change the VT-d driver to accommodate that. i.e. For
emulated IOMMU (Caching Mode == 1), no need to send redundant
invalidation request.
> > +/**
> > + * Translation cache invalidation header that contains mandatory
> > meta data.
> > + * @version:	info format version, expecting future extesions
> > + * @type:	type of translation cache to be invalidated
> > + */
> > +struct tlb_invalidate_hdr {
> > +	__u32 version;
> > +#define TLB_INV_HDR_VERSION_1 1
> > +	enum iommu_inv_type type;
> > +};
> > +
> > +/**
> > + * Translation cache invalidation information, contains generic
> > IOMMU
> > + * data which can be parsed based on model ID by model specific
> > drivers.
> > + *
> > + * @granularity:	requested invalidation granularity, type
> > dependent
> > + * @size:		2^size of 4K pages, 0 for 4k, 9 for 2MB,
> > etc.  
> 
> Having only power of two invalidation seems too restrictive for a
> software interface. You might have the same problem as above, where
> the guest or userspace needs to send lots of invalidation requests,
> They could be multiplexed by passing an arbitrary range instead. How
> about making @size a __u64?
> 
Sure if you have such need for non power of two. So it will be __u64 of
4k pages?

> > + * @pasid:		processor address space ID value per PCI
> > spec.
> > + * @addr:		page address to be invalidated
> > + * @flags	IOMMU_INVALIDATE_PASID_TAGGED: DMA with PASID
> > tagged,
> > + *						@pasid validity
> > can be
> > + *						deduced from
> > @granularity  
> 
> What's the use for this PASID_TAGGED flag if it doesn't define the
> @pasid validity?
> 
VT-d uses different table format based on this PASID_TAGGED flag. With
PASID_TAGGED set, @pasid could still be invalid if the granularity is
not at PASID selective level.
> > + *		IOMMU_INVALIDATE_ADDR_LEAF: leaf paging entries  
> 
> LEAF could be reused for multi-level PASID tables, when your
> first-level table is already in place and you install a leaf entry,
> so maybe this could be:
> 
> "IOMMU_INVALIDATE_LEAF: only invalidate leaf table entry"
> 
Sounds good. Assume we will only have 2 levels for the foreseeable
future.
> Thanks,
> Jean
> 
> > + *		IOMMU_INVALIDATE_GLOBAL_PAGE: global pages> + *
> > + */
> > +struct tlb_invalidate_info {
> > +	struct tlb_invalidate_hdr	hdr;
> > +	enum iommu_inv_granularity	granularity;
> > +	__u32		flags;
> > +#define IOMMU_INVALIDATE_NO_PASID	(1 << 0)
> > +#define IOMMU_INVALIDATE_ADDR_LEAF	(1 << 1)
> > +#define IOMMU_INVALIDATE_GLOBAL_PAGE	(1 << 2)
> > +#define IOMMU_INVALIDATE_PASID_TAGGED	(1 << 3)
> > +	__u8		size;
> > +	__u32		pasid;
> > +	__u64		addr;
> > +};
> >  #endif /* _UAPI_IOMMU_H */
> >   
> 

[Jacob Pan]

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v3 08/16] iommu: introduce device fault data
  2017-11-17 18:55 ` [PATCH v3 08/16] iommu: introduce device fault data Jacob Pan
  2017-11-24 12:03   ` Jean-Philippe Brucker
@ 2018-01-10 11:41   ` Jean-Philippe Brucker
  2018-01-11 21:10       ` Jacob Pan
  1 sibling, 1 reply; 94+ messages in thread
From: Jean-Philippe Brucker @ 2018-01-10 11:41 UTC (permalink / raw)
  To: Jacob Pan, iommu, LKML, Joerg Roedel, David Woodhouse,
	Greg Kroah-Hartman, Rafael Wysocki, Alex Williamson
  Cc: Lan Tianyu, Yi L, Liu

Hi Jacob,

On 17/11/17 18:55, Jacob Pan wrote:
[...]
> +/**
> + * struct iommu_fault_event - Generic per device fault data
> + *
> + * - PCI and non-PCI devices
> + * - Recoverable faults (e.g. page request), information based on PCI ATS
> + * and PASID spec.
> + * - Un-recoverable faults of device interest
> + * - DMA remapping and IRQ remapping faults
> +
> + * @type contains fault type.
> + * @reason fault reasons if relevant outside IOMMU driver, IOMMU driver internal
> + *         faults are not reported
> + * @addr: tells the offending page address
> + * @pasid: contains process address space ID, used in shared virtual memory(SVM)
> + * @rid: requestor ID
> + * @page_req_group_id: page request group index
> + * @last_req: last request in a page request group
> + * @pasid_valid: indicates if the PRQ has a valid PASID
> + * @prot: page access protection flag, e.g. IOMMU_FAULT_READ, IOMMU_FAULT_WRITE
> + * @device_private: if present, uniquely identify device-specific
> + *                  private data for an individual page request.
> + * @iommu_private: used by the IOMMU driver for storing fault-specific
> + *                 data. Users should not modify this field before
> + *                 sending the fault response.
> + */
> +struct iommu_fault_event {
> +	enum iommu_fault_type type;
> +	enum iommu_fault_reason reason;
> +	u64 addr;
> +	u32 pasid;
> +	u32 page_req_group_id : 9;

As I've been rebasing my work onto your series, I have a few more comments
about this structure. Is there any advantage in limiting the PRGI as a
bitfield? PCI uses 9 bits, but others might need more. For instance ARM
Stall uses 16-bit IDs to identify a fault event.

Could you please make it a u32 (as well as in page_response_msg), and
could page_req_group_id be renamed to simply "id"?

> +	u32 last_req : 1;
> +	u32 pasid_valid : 1;
I noticed that page_response_msg in patch 15/16 calls this bit
"pasid_present". Could you rename it to "pasid_valid" for consistency?

Thanks,
Jean

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v3 03/16] iommu: introduce iommu invalidate API function
  2017-12-28 19:25       ` Jacob Pan
@ 2018-01-10 12:00         ` Jean-Philippe Brucker
  -1 siblings, 0 replies; 94+ messages in thread
From: Jean-Philippe Brucker @ 2018-01-10 12:00 UTC (permalink / raw)
  To: Jacob Pan
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Rafael Wysocki, Alex Williamson, Lan Tianyu, Liu, Yi L, Liu,
	Jean Delvare

On 28/12/17 19:25, Jacob Pan wrote:
[...]
>>> + * @size:		2^size of 4K pages, 0 for 4k, 9 for 2MB,
>>> etc.  
>>
>> Having only power of two invalidation seems too restrictive for a
>> software interface. You might have the same problem as above, where
>> the guest or userspace needs to send lots of invalidation requests,
>> They could be multiplexed by passing an arbitrary range instead. How
>> about making @size a __u64?
>>
> Sure if you have such need for non power of two. So it will be __u64 of
> 4k pages?

4k granule would work for us right now, but other architectures may plan
to support arbitrary sizes. The map/unmap API does support arbitrary
sizes, so it might be better to have a byte granularity in @size.

Thanks,
Jean

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v3 03/16] iommu: introduce iommu invalidate API function
@ 2018-01-10 12:00         ` Jean-Philippe Brucker
  0 siblings, 0 replies; 94+ messages in thread
From: Jean-Philippe Brucker @ 2018-01-10 12:00 UTC (permalink / raw)
  To: Jacob Pan
  Cc: Lan Tianyu, Liu, Yi L, Liu-i9wRM+HIrmnmtl4Z8vJ8Kg761KYD1DLY,
	Greg Kroah-Hartman, Rafael Wysocki, LKML,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Jean Delvare,
	David Woodhouse

On 28/12/17 19:25, Jacob Pan wrote:
[...]
>>> + * @size:		2^size of 4K pages, 0 for 4k, 9 for 2MB,
>>> etc.  
>>
>> Having only power of two invalidation seems too restrictive for a
>> software interface. You might have the same problem as above, where
>> the guest or userspace needs to send lots of invalidation requests,
>> They could be multiplexed by passing an arbitrary range instead. How
>> about making @size a __u64?
>>
> Sure if you have such need for non power of two. So it will be __u64 of
> 4k pages?

4k granule would work for us right now, but other architectures may plan
to support arbitrary sizes. The map/unmap API does support arbitrary
sizes, so it might be better to have a byte granularity in @size.

Thanks,
Jean

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v3 10/16] iommu: introduce device fault report API
  2017-11-17 18:55   ` Jacob Pan
                     ` (2 preceding siblings ...)
  (?)
@ 2018-01-10 12:39   ` Jean-Philippe Brucker
  -1 siblings, 0 replies; 94+ messages in thread
From: Jean-Philippe Brucker @ 2018-01-10 12:39 UTC (permalink / raw)
  To: Jacob Pan, iommu, LKML, Joerg Roedel, David Woodhouse,
	Greg Kroah-Hartman, Rafael Wysocki, Alex Williamson
  Cc: Lan Tianyu, Jean Delvare

On 17/11/17 18:55, Jacob Pan wrote:
[...]
> +static inline int iommu_register_device_fault_handler(struct device *dev,
> +						iommu_dev_fault_handler_t handler,
> +						void *data)
> +{
> +	return 0;> +}
> +
> +static inline int iommu_unregister_device_fault_handler(struct device *dev)
> +{
> +	return 0;
> +}
> +
> +static inline bool iommu_has_device_fault_handler(struct device *dev)
> +{
> +	return false;
> +}
> +
> +static inline int iommu_report_device_fault(struct device *dev, struct iommu_fault_event *evt)
> +{
> +	return 0;
> +}

Not too important but these stubs, when CONFIG_IOMMU_API is disabled,
usually return an error value (-ENODEV) instead of 0.

Thanks,
Jean

> +
>  static inline int iommu_group_id(struct iommu_group *group)
>  {
>  	return -ENODEV;
> 

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v3 08/16] iommu: introduce device fault data
@ 2018-01-11 21:10       ` Jacob Pan
  0 siblings, 0 replies; 94+ messages in thread
From: Jacob Pan @ 2018-01-11 21:10 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Rafael Wysocki, Alex Williamson, Lan Tianyu, Yi L, Liu,
	jacob.jun.pan

On Wed, 10 Jan 2018 11:41:58 +0000
Jean-Philippe Brucker <jean-philippe.brucker@arm.com> wrote:

> Hi Jacob,
> 
> On 17/11/17 18:55, Jacob Pan wrote:
> [...]
> > +/**
> > + * struct iommu_fault_event - Generic per device fault data
> > + *
> > + * - PCI and non-PCI devices
> > + * - Recoverable faults (e.g. page request), information based on
> > PCI ATS
> > + * and PASID spec.
> > + * - Un-recoverable faults of device interest
> > + * - DMA remapping and IRQ remapping faults
> > +
> > + * @type contains fault type.
> > + * @reason fault reasons if relevant outside IOMMU driver, IOMMU
> > driver internal
> > + *         faults are not reported
> > + * @addr: tells the offending page address
> > + * @pasid: contains process address space ID, used in shared
> > virtual memory(SVM)
> > + * @rid: requestor ID
> > + * @page_req_group_id: page request group index
> > + * @last_req: last request in a page request group
> > + * @pasid_valid: indicates if the PRQ has a valid PASID
> > + * @prot: page access protection flag, e.g. IOMMU_FAULT_READ,
> > IOMMU_FAULT_WRITE
> > + * @device_private: if present, uniquely identify device-specific
> > + *                  private data for an individual page request.
> > + * @iommu_private: used by the IOMMU driver for storing
> > fault-specific
> > + *                 data. Users should not modify this field before
> > + *                 sending the fault response.
> > + */
> > +struct iommu_fault_event {
> > +	enum iommu_fault_type type;
> > +	enum iommu_fault_reason reason;
> > +	u64 addr;
> > +	u32 pasid;
> > +	u32 page_req_group_id : 9;  
> 
> As I've been rebasing my work onto your series, I have a few more
> comments about this structure. Is there any advantage in limiting the
> PRGI as a bitfield? PCI uses 9 bits, but others might need more. For
> instance ARM Stall uses 16-bit IDs to identify a fault event.
> 
> Could you please make it a u32 (as well as in page_response_msg), and
> could page_req_group_id be renamed to simply "id"?
> 
sure, I will make it u32 in v4 version of the patchset. I was using PCI
standard as a base with no specific advantage.
I am running into little bit problem with testing, so perhaps next week.
> > +	u32 last_req : 1;
> > +	u32 pasid_valid : 1;  
> I noticed that page_response_msg in patch 15/16 calls this bit
> "pasid_present". Could you rename it to "pasid_valid" for consistency?
> 
make sense.
> Thanks,
> Jean

[Jacob Pan]

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v3 08/16] iommu: introduce device fault data
@ 2018-01-11 21:10       ` Jacob Pan
  0 siblings, 0 replies; 94+ messages in thread
From: Jacob Pan @ 2018-01-11 21:10 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: Lan Tianyu, Yi L, Liu-i9wRM+HIrmnmtl4Z8vJ8Kg761KYD1DLY,
	Greg Kroah-Hartman, Rafael Wysocki, LKML,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	David Woodhouse

On Wed, 10 Jan 2018 11:41:58 +0000
Jean-Philippe Brucker <jean-philippe.brucker-5wv7dgnIgG8@public.gmane.org> wrote:

> Hi Jacob,
> 
> On 17/11/17 18:55, Jacob Pan wrote:
> [...]
> > +/**
> > + * struct iommu_fault_event - Generic per device fault data
> > + *
> > + * - PCI and non-PCI devices
> > + * - Recoverable faults (e.g. page request), information based on
> > PCI ATS
> > + * and PASID spec.
> > + * - Un-recoverable faults of device interest
> > + * - DMA remapping and IRQ remapping faults
> > +
> > + * @type contains fault type.
> > + * @reason fault reasons if relevant outside IOMMU driver, IOMMU
> > driver internal
> > + *         faults are not reported
> > + * @addr: tells the offending page address
> > + * @pasid: contains process address space ID, used in shared
> > virtual memory(SVM)
> > + * @rid: requestor ID
> > + * @page_req_group_id: page request group index
> > + * @last_req: last request in a page request group
> > + * @pasid_valid: indicates if the PRQ has a valid PASID
> > + * @prot: page access protection flag, e.g. IOMMU_FAULT_READ,
> > IOMMU_FAULT_WRITE
> > + * @device_private: if present, uniquely identify device-specific
> > + *                  private data for an individual page request.
> > + * @iommu_private: used by the IOMMU driver for storing
> > fault-specific
> > + *                 data. Users should not modify this field before
> > + *                 sending the fault response.
> > + */
> > +struct iommu_fault_event {
> > +	enum iommu_fault_type type;
> > +	enum iommu_fault_reason reason;
> > +	u64 addr;
> > +	u32 pasid;
> > +	u32 page_req_group_id : 9;  
> 
> As I've been rebasing my work onto your series, I have a few more
> comments about this structure. Is there any advantage in limiting the
> PRGI as a bitfield? PCI uses 9 bits, but others might need more. For
> instance ARM Stall uses 16-bit IDs to identify a fault event.
> 
> Could you please make it a u32 (as well as in page_response_msg), and
> could page_req_group_id be renamed to simply "id"?
> 
sure, I will make it u32 in v4 version of the patchset. I was using PCI
standard as a base with no specific advantage.
I am running into little bit problem with testing, so perhaps next week.
> > +	u32 last_req : 1;
> > +	u32 pasid_valid : 1;  
> I noticed that page_response_msg in patch 15/16 calls this bit
> "pasid_present". Could you rename it to "pasid_valid" for consistency?
> 
make sense.
> Thanks,
> Jean

[Jacob Pan]

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v3 10/16] iommu: introduce device fault report API
  2017-11-17 18:55   ` Jacob Pan
                     ` (3 preceding siblings ...)
  (?)
@ 2018-01-18 19:24   ` Jean-Philippe Brucker
  2018-01-23 20:01     ` Jacob Pan
  -1 siblings, 1 reply; 94+ messages in thread
From: Jean-Philippe Brucker @ 2018-01-18 19:24 UTC (permalink / raw)
  To: Jacob Pan, iommu, LKML, Joerg Roedel, David Woodhouse,
	Greg Kroah-Hartman, Rafael Wysocki, Alex Williamson
  Cc: Lan Tianyu

Hi Jacob,

I've got minor comments after working with this patch, sorry for the
multiple replies

On 17/11/17 18:55, Jacob Pan wrote:
[...]
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index 829e9e9..97b7990 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -581,6 +581,12 @@ int iommu_group_add_device(struct iommu_group *group, struct device *dev)
>  		goto err_free_name;
>  	}
>  
> +	dev->iommu_param = kzalloc(sizeof(struct iommu_fault_param), GFP_KERNEL);

This should be "sizeof(struct iommu_param)" or maybe
"sizeof(*dev->iommu_param)".

> +	if (!dev->iommu_param) {
> +		ret = -ENOMEM;
> +		goto err_free_name;
> +	}
> +
>  	kobject_get(group->devices_kobj);
>  
>  	dev->iommu_group = group;
> @@ -657,7 +663,7 @@ void iommu_group_remove_device(struct device *dev)
>  	sysfs_remove_link(&dev->kobj, "iommu_group");
>  
>  	trace_remove_device_from_group(group->id, dev);
> -
> +	kfree(dev->iommu_param);
>  	kfree(device->name);
>  	kfree(device);
>  	dev->iommu_group = NULL;
> @@ -791,6 +797,61 @@ int iommu_group_unregister_notifier(struct iommu_group *group,
>  }
>  EXPORT_SYMBOL_GPL(iommu_group_unregister_notifier);
> 
> +int iommu_register_device_fault_handler(struct device *dev,
> +					iommu_dev_fault_handler_t handler,
> +					void *data)
> +{
> +	struct iommu_param *idata = dev->iommu_param;
> +
> +	/*
> +	 * Device iommu_param should have been allocated when device is
> +	 * added to its iommu_group.
> +	 */
> +	if (!idata)
> +		return -EINVAL;
> +	/* Only allow one fault handler registered for each device */
> +	if (idata->fault_param)
> +		return -EBUSY;
> +	get_device(dev);
> +	idata->fault_param =
> +		kzalloc(sizeof(struct iommu_fault_param), GFP_KERNEL);
> +	if (!idata->fault_param)
> +		return -ENOMEM;
> +	idata->fault_param->handler = handler;
> +	idata->fault_param->data = data;
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(iommu_register_device_fault_handler);
> +
> +int iommu_unregister_device_fault_handler(struct device *dev)
> +{
> +	struct iommu_param *idata = dev->iommu_param;
> +
> +	if (!idata)
> +		return -EINVAL;
> +
> +	kfree(idata->fault_param);
> +	idata->fault_param = NULL;
> +	put_device(dev);
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(iommu_unregister_device_fault_handler);

We should probably document register() and unregister() functions since
they are part of the device driver API. If it helps I came up with:

/**
 * iommu_register_device_fault_handler() - Register a device fault handler
 * @dev: the device
 * @handler: the fault handler
 * @data: private data passed as argument to the handler
 *
 * When an IOMMU fault event is received, call this handler with the fault event
 * and data as argument. The handler should return 0. If the fault is
 * recoverable (IOMMU_FAULT_PAGE_REQ), the handler must also complete
 * the fault by calling iommu_page_response() with one of the following
 * response code:
 * - IOMMU_PAGE_RESP_SUCCESS: retry the translation
 * - IOMMU_PAGE_RESP_INVALID: terminate the fault
 * - IOMMU_PAGE_RESP_FAILURE: terminate the fault and stop reporting
 *   page faults if possible.
 *
 * Return 0 if the fault handler was installed successfully, or an error.
 */

/**
 * iommu_unregister_device_fault_handler() - Unregister the device fault handler
 * @dev: the device
 *
 * Remove the device fault handler installed with
 * iommu_register_device_fault_handler().
 *
 * Return 0 on success, or an error.
 */

Thanks,
Jean

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v3 10/16] iommu: introduce device fault report API
  2018-01-18 19:24   ` Jean-Philippe Brucker
@ 2018-01-23 20:01     ` Jacob Pan
  0 siblings, 0 replies; 94+ messages in thread
From: Jacob Pan @ 2018-01-23 20:01 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: iommu, LKML, Joerg Roedel, David Woodhouse, Greg Kroah-Hartman,
	Rafael Wysocki, Alex Williamson, Lan Tianyu, jacob.jun.pan

On Thu, 18 Jan 2018 19:24:52 +0000
Jean-Philippe Brucker <jean-philippe.brucker@arm.com> wrote:

> Hi Jacob,
> 
> I've got minor comments after working with this patch, sorry for the
> multiple replies
> 
> On 17/11/17 18:55, Jacob Pan wrote:
> [...]
> > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> > index 829e9e9..97b7990 100644
> > --- a/drivers/iommu/iommu.c
> > +++ b/drivers/iommu/iommu.c
> > @@ -581,6 +581,12 @@ int iommu_group_add_device(struct iommu_group
> > *group, struct device *dev) goto err_free_name;
> >  	}
> >  
> > +	dev->iommu_param = kzalloc(sizeof(struct
> > iommu_fault_param), GFP_KERNEL);  
> 
> This should be "sizeof(struct iommu_param)" or maybe
> "sizeof(*dev->iommu_param)".
> 
good catch, thanks,
> > +	if (!dev->iommu_param) {
> > +		ret = -ENOMEM;
> > +		goto err_free_name;
> > +	}
> > +
> >  	kobject_get(group->devices_kobj);
> >  
> >  	dev->iommu_group = group;
> > @@ -657,7 +663,7 @@ void iommu_group_remove_device(struct device
> > *dev) sysfs_remove_link(&dev->kobj, "iommu_group");
> >  
> >  	trace_remove_device_from_group(group->id, dev);
> > -
> > +	kfree(dev->iommu_param);
> >  	kfree(device->name);
> >  	kfree(device);
> >  	dev->iommu_group = NULL;
> > @@ -791,6 +797,61 @@ int iommu_group_unregister_notifier(struct
> > iommu_group *group, }
> >  EXPORT_SYMBOL_GPL(iommu_group_unregister_notifier);
> > 
> > +int iommu_register_device_fault_handler(struct device *dev,
> > +					iommu_dev_fault_handler_t
> > handler,
> > +					void *data)
> > +{
> > +	struct iommu_param *idata = dev->iommu_param;
> > +
> > +	/*
> > +	 * Device iommu_param should have been allocated when
> > device is
> > +	 * added to its iommu_group.
> > +	 */
> > +	if (!idata)
> > +		return -EINVAL;
> > +	/* Only allow one fault handler registered for each device
> > */
> > +	if (idata->fault_param)
> > +		return -EBUSY;
> > +	get_device(dev);
> > +	idata->fault_param =
> > +		kzalloc(sizeof(struct iommu_fault_param),
> > GFP_KERNEL);
> > +	if (!idata->fault_param)
> > +		return -ENOMEM;
> > +	idata->fault_param->handler = handler;
> > +	idata->fault_param->data = data;
> > +
> > +	return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(iommu_register_device_fault_handler);
> > +
> > +int iommu_unregister_device_fault_handler(struct device *dev)
> > +{
> > +	struct iommu_param *idata = dev->iommu_param;
> > +
> > +	if (!idata)
> > +		return -EINVAL;
> > +
> > +	kfree(idata->fault_param);
> > +	idata->fault_param = NULL;
> > +	put_device(dev);
> > +
> > +	return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(iommu_unregister_device_fault_handler);  
> 
> We should probably document register() and unregister() functions
> since they are part of the device driver API. If it helps I came up
> with:
> 
> /**
>  * iommu_register_device_fault_handler() - Register a device fault
> handler
>  * @dev: the device
>  * @handler: the fault handler
>  * @data: private data passed as argument to the handler
>  *
>  * When an IOMMU fault event is received, call this handler with the
> fault event
>  * and data as argument. The handler should return 0. If the fault is
>  * recoverable (IOMMU_FAULT_PAGE_REQ), the handler must also complete
>  * the fault by calling iommu_page_response() with one of the
> following
>  * response code:
>  * - IOMMU_PAGE_RESP_SUCCESS: retry the translation
>  * - IOMMU_PAGE_RESP_INVALID: terminate the fault
>  * - IOMMU_PAGE_RESP_FAILURE: terminate the fault and stop reporting
>  *   page faults if possible.
>  *
>  * Return 0 if the fault handler was installed successfully, or an
> error. */
> 
> /**
>  * iommu_unregister_device_fault_handler() - Unregister the device
> fault handler
>  * @dev: the device
>  *
>  * Remove the device fault handler installed with
>  * iommu_register_device_fault_handler().
>  *
>  * Return 0 on success, or an error.
>  */
> 
agreed. thanks. sorry about the delay.
> Thanks,
> Jean

[Jacob Pan]

^ permalink raw reply	[flat|nested] 94+ messages in thread

end of thread, other threads:[~2018-01-23 20:01 UTC | newest]

Thread overview: 94+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-17 18:54 [PATCH v3 00/16] [PATCH v3 00/16] IOMMU driver support for SVM virtualization Jacob Pan
2017-11-17 18:54 ` Jacob Pan
2017-11-17 18:54 ` [PATCH v3 01/16] iommu: introduce bind_pasid_table API function Jacob Pan
2017-11-17 18:54   ` Jacob Pan
2017-11-24 12:04   ` Jean-Philippe Brucker
2017-11-29 22:01     ` Jacob Pan
2017-11-17 18:55 ` [PATCH v3 02/16] iommu/vt-d: add bind_pasid_table function Jacob Pan
2017-11-17 18:55   ` Jacob Pan
2017-11-17 18:55 ` [PATCH v3 03/16] iommu: introduce iommu invalidate API function Jacob Pan
2017-11-24 12:04   ` Jean-Philippe Brucker
2017-12-15 19:02     ` Jean-Philippe Brucker
2017-12-15 19:02       ` Jean-Philippe Brucker
2017-12-28 19:25     ` Jacob Pan
2017-12-28 19:25       ` Jacob Pan
2018-01-10 12:00       ` Jean-Philippe Brucker
2018-01-10 12:00         ` Jean-Philippe Brucker
2017-11-17 18:55 ` [PATCH v3 04/16] iommu/vt-d: move device_domain_info to header Jacob Pan
2017-11-17 18:55   ` Jacob Pan
2017-11-17 18:55 ` [PATCH v3 05/16] iommu/vt-d: support flushing more TLB types Jacob Pan
2017-11-17 18:55   ` Jacob Pan
2017-11-20 14:20   ` Lukoshkov, Maksim
2017-11-20 14:20     ` Lukoshkov, Maksim
2017-11-20 18:40     ` Jacob Pan
2017-11-20 18:40       ` Jacob Pan
2017-11-17 18:55 ` [PATCH v3 06/16] iommu/vt-d: add svm/sva invalidate function Jacob Pan
2017-12-05  5:43   ` Lu Baolu
2017-12-05  5:43     ` Lu Baolu
2017-11-17 18:55 ` [PATCH v3 07/16] iommu/vt-d: assign PFSID in device TLB invalidation Jacob Pan
2017-12-05  5:45   ` Lu Baolu
2017-11-17 18:55 ` [PATCH v3 08/16] iommu: introduce device fault data Jacob Pan
2017-11-24 12:03   ` Jean-Philippe Brucker
2017-11-29 21:55     ` Jacob Pan
2017-11-29 21:55       ` Jacob Pan
2018-01-10 11:41   ` Jean-Philippe Brucker
2018-01-11 21:10     ` Jacob Pan
2018-01-11 21:10       ` Jacob Pan
2017-11-17 18:55 ` [PATCH v3 09/16] driver core: add iommu device fault reporting data Jacob Pan
2017-12-18 14:37   ` Greg Kroah-Hartman
2017-12-18 14:37     ` Greg Kroah-Hartman
2017-11-17 18:55 ` [PATCH v3 10/16] iommu: introduce device fault report API Jacob Pan
2017-11-17 18:55   ` Jacob Pan
2017-12-05  6:22   ` Lu Baolu
2017-12-08 21:22     ` Jacob Pan
2017-12-08 21:22       ` Jacob Pan
2017-12-07 21:27   ` Alex Williamson
2017-12-07 21:27     ` Alex Williamson
2017-12-08 20:23     ` Jacob Pan
2017-12-08 20:23       ` Jacob Pan
2017-12-08 20:59       ` Alex Williamson
2017-12-08 20:59         ` Alex Williamson
2017-12-08 21:22         ` Jacob Pan
2017-12-08 21:22           ` Jacob Pan
2018-01-10 12:39   ` Jean-Philippe Brucker
2018-01-18 19:24   ` Jean-Philippe Brucker
2018-01-23 20:01     ` Jacob Pan
2017-11-17 18:55 ` [PATCH v3 11/16] iommu/vt-d: use threaded irq for dmar_fault Jacob Pan
2017-11-17 18:55   ` Jacob Pan
2017-11-17 18:55 ` [PATCH v3 12/16] iommu/vt-d: report unrecoverable device faults Jacob Pan
2017-11-17 18:55   ` Jacob Pan
2017-12-05  6:34   ` Lu Baolu
2017-12-05  6:34     ` Lu Baolu
2017-11-17 18:55 ` [PATCH v3 13/16] iommu/intel-svm: notify page request to guest Jacob Pan
2017-11-17 18:55   ` Jacob Pan
2017-12-05  7:37   ` Lu Baolu
2017-12-05  7:37     ` Lu Baolu
2017-11-17 18:55 ` [PATCH v3 14/16] iommu/intel-svm: replace dev ops with fault report API Jacob Pan
2017-11-17 18:55   ` Jacob Pan
2017-11-17 18:55 ` [PATCH v3 15/16] iommu: introduce page response function Jacob Pan
2017-11-17 18:55   ` Jacob Pan
2017-11-24 12:03   ` Jean-Philippe Brucker
2017-12-04 21:37     ` Jacob Pan
2017-12-04 21:37       ` Jacob Pan
2017-12-05 17:21       ` Jean-Philippe Brucker
2017-12-05 17:21         ` Jean-Philippe Brucker
2017-12-06 19:25         ` Jacob Pan
2017-12-06 19:25           ` Jacob Pan
2017-12-07 12:56           ` Jean-Philippe Brucker
2017-12-07 12:56             ` Jean-Philippe Brucker
2017-12-07 21:56             ` Alex Williamson
2017-12-08 13:51               ` Jean-Philippe Brucker
2017-12-08 13:51                 ` Jean-Philippe Brucker
2017-12-08  1:17             ` Jacob Pan
2017-12-08  1:17               ` Jacob Pan
2017-12-08 13:51               ` Jean-Philippe Brucker
2017-12-08 13:51                 ` Jean-Philippe Brucker
2017-12-07 21:51           ` Alex Williamson
2017-12-07 21:51             ` Alex Williamson
2017-12-08 13:52             ` Jean-Philippe Brucker
2017-12-08 20:40               ` Jacob Pan
2017-12-08 20:40                 ` Jacob Pan
2017-12-08 23:01                 ` Alex Williamson
2017-12-08 23:01                   ` Alex Williamson
2017-11-17 18:55 ` [PATCH v3 16/16] iommu/vt-d: add intel iommu " Jacob Pan
2017-11-17 18:55   ` Jacob Pan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.