linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 00/26] iommu: Shared Virtual Addressing and SMMUv3 support
@ 2020-02-24 18:23 Jean-Philippe Brucker
  2020-02-24 18:23 ` [PATCH v4 01/26] mm/mmu_notifiers: pass private data down to alloc_notifier() Jean-Philippe Brucker
                   ` (26 more replies)
  0 siblings, 27 replies; 70+ messages in thread
From: Jean-Philippe Brucker @ 2020-02-24 18:23 UTC (permalink / raw)
  To: iommu, devicetree, linux-arm-kernel, linux-pci, linux-mm
  Cc: joro, robh+dt, mark.rutland, catalin.marinas, will, robin.murphy,
	kevin.tian, baolu.lu, Jonathan.Cameron, jacob.jun.pan,
	christian.koenig, yi.l.liu, zhangfei.gao

Shared Virtual Addressing (SVA) allows to share process page tables with
devices using the IOMMU. Add a generic implementation of the IOMMU SVA
API, and add support in the Arm SMMUv3 driver.

Previous versions of this patchset were sent over a year ago [1][2] but
we've made a lot of progress since then:

* ATS support for SMMUv3 was merged in v5.2.
* The bind() and fault reporting APIs have been merged in v5.3.
* IOASID were added in v5.5.
* SMMUv3 PASID was added in v5.6, with some pending for v5.7.

* The first user of the bind() API will be merged in v5.7 [3]. The zip
  accelerator is also the first piece of hardware that I've been able to
  use for testing (previous versions were developed with software models)
  and I now have tools for evaluating SVA performance. Unfortunately I
  still don't have hardware that supports ATS and PRI; the zip accelerator
  uses stall.

These are the remaining changes for SVA support in SMMUv3. Since v3 [1]
I fixed countless bugs and - I think - addressed everyone's comments.
Thanks to recent MMU notifier rework, iommu-sva.c is a lot more
straightforward. I'm still unhappy with the complicated locking in the
SMMUv3 driver resulting from patch 12 (Seize private ASID), but I
haven't found anything better.

Please find all SVA patches on branches sva/current and sva/zip-devel at
https://jpbrucker.net/git/linux

[1] https://lore.kernel.org/linux-iommu/20180920170046.20154-1-jean-philippe.brucker@arm.com/
[2] https://lore.kernel.org/linux-iommu/20180511190641.23008-1-jean-philippe.brucker@arm.com/
[3] https://lore.kernel.org/linux-iommu/1581407665-13504-1-git-send-email-zhangfei.gao@linaro.org/

Jean-Philippe Brucker (26):
  mm/mmu_notifiers: pass private data down to alloc_notifier()
  iommu/sva: Manage process address spaces
  iommu: Add a page fault handler
  iommu/sva: Search mm by PASID
  iommu/iopf: Handle mm faults
  iommu/sva: Register page fault handler
  arm64: mm: Pin down ASIDs for sharing mm with devices
  iommu/io-pgtable-arm: Move some definitions to a header
  iommu/arm-smmu-v3: Manage ASIDs with xarray
  arm64: cpufeature: Export symbol read_sanitised_ftr_reg()
  iommu/arm-smmu-v3: Share process page tables
  iommu/arm-smmu-v3: Seize private ASID
  iommu/arm-smmu-v3: Add support for VHE
  iommu/arm-smmu-v3: Enable broadcast TLB maintenance
  iommu/arm-smmu-v3: Add SVA feature checking
  iommu/arm-smmu-v3: Add dev_to_master() helper
  iommu/arm-smmu-v3: Implement mm operations
  iommu/arm-smmu-v3: Hook up ATC invalidation to mm ops
  iommu/arm-smmu-v3: Add support for Hardware Translation Table Update
  iommu/arm-smmu-v3: Maintain a SID->device structure
  iommu/arm-smmu-v3: Ratelimit event dump
  dt-bindings: document stall property for IOMMU masters
  iommu/arm-smmu-v3: Add stall support for platform devices
  PCI/ATS: Add PRI stubs
  PCI/ATS: Export symbols of PRI functions
  iommu/arm-smmu-v3: Add support for PRI

 .../devicetree/bindings/iommu/iommu.txt       |   18 +
 arch/arm64/include/asm/mmu.h                  |    1 +
 arch/arm64/include/asm/mmu_context.h          |   11 +-
 arch/arm64/kernel/cpufeature.c                |    1 +
 arch/arm64/mm/context.c                       |  103 +-
 drivers/iommu/Kconfig                         |   13 +
 drivers/iommu/Makefile                        |    2 +
 drivers/iommu/arm-smmu-v3.c                   | 1354 +++++++++++++++--
 drivers/iommu/io-pgfault.c                    |  533 +++++++
 drivers/iommu/io-pgtable-arm.c                |   27 +-
 drivers/iommu/io-pgtable-arm.h                |   30 +
 drivers/iommu/iommu-sva.c                     |  596 ++++++++
 drivers/iommu/iommu-sva.h                     |   64 +
 drivers/iommu/iommu.c                         |    1 +
 drivers/iommu/of_iommu.c                      |    5 +-
 drivers/misc/sgi-gru/grutlbpurge.c            |    4 +-
 drivers/pci/ats.c                             |    4 +
 include/linux/iommu.h                         |   73 +
 include/linux/mmu_notifier.h                  |   10 +-
 include/linux/pci-ats.h                       |    8 +
 mm/mmu_notifier.c                             |    6 +-
 21 files changed, 2699 insertions(+), 165 deletions(-)
 create mode 100644 drivers/iommu/io-pgfault.c
 create mode 100644 drivers/iommu/io-pgtable-arm.h
 create mode 100644 drivers/iommu/iommu-sva.c
 create mode 100644 drivers/iommu/iommu-sva.h

-- 
2.25.0


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [PATCH v4 01/26] mm/mmu_notifiers: pass private data down to alloc_notifier()
  2020-02-24 18:23 [PATCH v4 00/26] iommu: Shared Virtual Addressing and SMMUv3 support Jean-Philippe Brucker
@ 2020-02-24 18:23 ` Jean-Philippe Brucker
  2020-02-24 19:00   ` Jason Gunthorpe
  2020-03-05 16:36   ` Christoph Hellwig
  2020-02-24 18:23 ` [PATCH v4 02/26] iommu/sva: Manage process address spaces Jean-Philippe Brucker
                   ` (25 subsequent siblings)
  26 siblings, 2 replies; 70+ messages in thread
From: Jean-Philippe Brucker @ 2020-02-24 18:23 UTC (permalink / raw)
  To: iommu, devicetree, linux-arm-kernel, linux-pci, linux-mm
  Cc: joro, robh+dt, mark.rutland, catalin.marinas, will, robin.murphy,
	kevin.tian, baolu.lu, Jonathan.Cameron, jacob.jun.pan,
	christian.koenig, yi.l.liu, zhangfei.gao, Andrew Morton,
	Arnd Bergmann, Dimitri Sivanich, Greg Kroah-Hartman,
	Jason Gunthorpe

The new allocation scheme introduced by 2c7933f53f6b ("mm/mmu_notifiers:
add a get/put scheme for the registration") provides a convenient way
for users to attach notifier data to an mm. However, it would be even
better to create this notifier data atomically.

Since the alloc_notifier() callback only takes an mm argument at the
moment, some users have to perform the allocation in two times.
alloc_notifier() initially creates an incomplete structure, which is
then finalized using more context once mmu_notifier_get() returns. This
second step requires carrying an initialization lock in the notifier
data and playing dirty tricks to order memory accesses against live
invalidation.

To simplify MMU notifier allocation, pass an allocation context to
mmu_notifier_get().

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Dimitri Sivanich <sivanich@sgi.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
---
 drivers/misc/sgi-gru/grutlbpurge.c |  4 ++--
 include/linux/mmu_notifier.h       | 10 ++++++----
 mm/mmu_notifier.c                  |  6 ++++--
 3 files changed, 12 insertions(+), 8 deletions(-)

diff --git a/drivers/misc/sgi-gru/grutlbpurge.c b/drivers/misc/sgi-gru/grutlbpurge.c
index 10921cd2608d..77610e1704f6 100644
--- a/drivers/misc/sgi-gru/grutlbpurge.c
+++ b/drivers/misc/sgi-gru/grutlbpurge.c
@@ -235,7 +235,7 @@ static void gru_invalidate_range_end(struct mmu_notifier *mn,
 		gms, range->start, range->end);
 }
 
-static struct mmu_notifier *gru_alloc_notifier(struct mm_struct *mm)
+static struct mmu_notifier *gru_alloc_notifier(struct mm_struct *mm, void *privdata)
 {
 	struct gru_mm_struct *gms;
 
@@ -266,7 +266,7 @@ struct gru_mm_struct *gru_register_mmu_notifier(void)
 {
 	struct mmu_notifier *mn;
 
-	mn = mmu_notifier_get_locked(&gru_mmuops, current->mm);
+	mn = mmu_notifier_get_locked(&gru_mmuops, current->mm, NULL);
 	if (IS_ERR(mn))
 		return ERR_CAST(mn);
 
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 736f6918335e..06e68fa2b019 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -207,7 +207,7 @@ struct mmu_notifier_ops {
 	 * callbacks are currently running. It is called from a SRCU callback
 	 * and cannot sleep.
 	 */
-	struct mmu_notifier *(*alloc_notifier)(struct mm_struct *mm);
+	struct mmu_notifier *(*alloc_notifier)(struct mm_struct *mm, void *privdata);
 	void (*free_notifier)(struct mmu_notifier *subscription);
 };
 
@@ -271,14 +271,16 @@ static inline int mm_has_notifiers(struct mm_struct *mm)
 }
 
 struct mmu_notifier *mmu_notifier_get_locked(const struct mmu_notifier_ops *ops,
-					     struct mm_struct *mm);
+					     struct mm_struct *mm,
+					     void *privdata);
 static inline struct mmu_notifier *
-mmu_notifier_get(const struct mmu_notifier_ops *ops, struct mm_struct *mm)
+mmu_notifier_get(const struct mmu_notifier_ops *ops, struct mm_struct *mm,
+		 void *privdata)
 {
 	struct mmu_notifier *ret;
 
 	down_write(&mm->mmap_sem);
-	ret = mmu_notifier_get_locked(ops, mm);
+	ret = mmu_notifier_get_locked(ops, mm, privdata);
 	up_write(&mm->mmap_sem);
 	return ret;
 }
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index ef3973a5d34a..8beb9dcbe0fd 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -734,6 +734,7 @@ find_get_mmu_notifier(struct mm_struct *mm, const struct mmu_notifier_ops *ops)
  *                           the mm & ops
  * @ops: The operations struct being subscribe with
  * @mm : The mm to attach notifiers too
+ * @privdata: Initialization data passed down to ops->alloc_notifier()
  *
  * This function either allocates a new mmu_notifier via
  * ops->alloc_notifier(), or returns an already existing notifier on the
@@ -747,7 +748,8 @@ find_get_mmu_notifier(struct mm_struct *mm, const struct mmu_notifier_ops *ops)
  * and can be converted to an active mm pointer via mmget_not_zero().
  */
 struct mmu_notifier *mmu_notifier_get_locked(const struct mmu_notifier_ops *ops,
-					     struct mm_struct *mm)
+					     struct mm_struct *mm,
+					     void *privdata)
 {
 	struct mmu_notifier *subscription;
 	int ret;
@@ -760,7 +762,7 @@ struct mmu_notifier *mmu_notifier_get_locked(const struct mmu_notifier_ops *ops,
 			return subscription;
 	}
 
-	subscription = ops->alloc_notifier(mm);
+	subscription = ops->alloc_notifier(mm, privdata);
 	if (IS_ERR(subscription))
 		return subscription;
 	subscription->ops = ops;
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v4 02/26] iommu/sva: Manage process address spaces
  2020-02-24 18:23 [PATCH v4 00/26] iommu: Shared Virtual Addressing and SMMUv3 support Jean-Philippe Brucker
  2020-02-24 18:23 ` [PATCH v4 01/26] mm/mmu_notifiers: pass private data down to alloc_notifier() Jean-Philippe Brucker
@ 2020-02-24 18:23 ` Jean-Philippe Brucker
  2020-02-26 12:35   ` Jonathan Cameron
  2020-02-26 19:13   ` Jacob Pan
  2020-02-24 18:23 ` [PATCH v4 03/26] iommu: Add a page fault handler Jean-Philippe Brucker
                   ` (24 subsequent siblings)
  26 siblings, 2 replies; 70+ messages in thread
From: Jean-Philippe Brucker @ 2020-02-24 18:23 UTC (permalink / raw)
  To: iommu, devicetree, linux-arm-kernel, linux-pci, linux-mm
  Cc: joro, robh+dt, mark.rutland, catalin.marinas, will, robin.murphy,
	kevin.tian, baolu.lu, Jonathan.Cameron, jacob.jun.pan,
	christian.koenig, yi.l.liu, zhangfei.gao, Jean-Philippe Brucker

From: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>

Add a small library to help IOMMU drivers manage process address spaces
bound to their devices. Register an MMU notifier to track modification
on each address space bound to one or more devices.

IOMMU drivers must implement the io_mm_ops and can then use the helpers
provided by this library to easily implement the SVA API introduced by
commit 26b25a2b98e4. The io_mm_ops are:

void *alloc(struct mm_struct *)
  Allocate a PASID context private to the IOMMU driver. There is a
  single context per mm. IOMMU drivers may perform arch-specific
  operations in there, for example pinning down a CPU ASID (on Arm).

int attach(struct device *, int pasid, void *ctx, bool attach_domain)
  Attach a context to the device, by setting up the PASID table entry.

int invalidate(struct device *, int pasid, void *ctx,
               unsigned long vaddr, size_t size)
  Invalidate TLB entries for this address range.

int detach(struct device *, int pasid, void *ctx, bool detach_domain)
  Detach a context from the device, by clearing the PASID table entry
  and invalidating cached entries.

void free(void *ctx)
  Free a context.

Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
---
 drivers/iommu/Kconfig     |   7 +
 drivers/iommu/Makefile    |   1 +
 drivers/iommu/iommu-sva.c | 561 ++++++++++++++++++++++++++++++++++++++
 drivers/iommu/iommu-sva.h |  64 +++++
 drivers/iommu/iommu.c     |   1 +
 include/linux/iommu.h     |   3 +
 6 files changed, 637 insertions(+)
 create mode 100644 drivers/iommu/iommu-sva.c
 create mode 100644 drivers/iommu/iommu-sva.h

diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index d2fade984999..acca20e2da2f 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -102,6 +102,13 @@ config IOMMU_DMA
 	select IRQ_MSI_IOMMU
 	select NEED_SG_DMA_LENGTH
 
+# Shared Virtual Addressing library
+config IOMMU_SVA
+	bool
+	select IOASID
+	select IOMMU_API
+	select MMU_NOTIFIER
+
 config FSL_PAMU
 	bool "Freescale IOMMU support"
 	depends on PCI
diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
index 9f33fdb3bb05..40c800dd4e3e 100644
--- a/drivers/iommu/Makefile
+++ b/drivers/iommu/Makefile
@@ -37,3 +37,4 @@ obj-$(CONFIG_S390_IOMMU) += s390-iommu.o
 obj-$(CONFIG_QCOM_IOMMU) += qcom_iommu.o
 obj-$(CONFIG_HYPERV_IOMMU) += hyperv-iommu.o
 obj-$(CONFIG_VIRTIO_IOMMU) += virtio-iommu.o
+obj-$(CONFIG_IOMMU_SVA) += iommu-sva.o
diff --git a/drivers/iommu/iommu-sva.c b/drivers/iommu/iommu-sva.c
new file mode 100644
index 000000000000..64f1d1c82383
--- /dev/null
+++ b/drivers/iommu/iommu-sva.c
@@ -0,0 +1,561 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Manage PASIDs and bind process address spaces to devices.
+ *
+ * Copyright (C) 2018 ARM Ltd.
+ */
+
+#include <linux/idr.h>
+#include <linux/ioasid.h>
+#include <linux/iommu.h>
+#include <linux/sched/mm.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+
+#include "iommu-sva.h"
+
+/**
+ * DOC: io_mm model
+ *
+ * The io_mm keeps track of process address spaces shared between CPU and IOMMU.
+ * The following example illustrates the relation between structures
+ * iommu_domain, io_mm and iommu_sva. The iommu_sva struct is a bond between
+ * io_mm and device. A device can have multiple io_mm and an io_mm may be bound
+ * to multiple devices.
+ *              ___________________________
+ *             |  IOMMU domain A           |
+ *             |  ________________         |
+ *             | |  IOMMU group   |        +------- io_pgtables
+ *             | |                |        |
+ *             | |   dev 00:00.0 ----+------- bond 1 --- io_mm X
+ *             | |________________|   \    |
+ *             |                       '----- bond 2 ---.
+ *             |___________________________|             \
+ *              ___________________________               \
+ *             |  IOMMU domain B           |             io_mm Y
+ *             |  ________________         |             / /
+ *             | |  IOMMU group   |        |            / /
+ *             | |                |        |           / /
+ *             | |   dev 00:01.0 ------------ bond 3 -' /
+ *             | |   dev 00:01.1 ------------ bond 4 --'
+ *             | |________________|        |
+ *             |                           +------- io_pgtables
+ *             |___________________________|
+ *
+ * In this example, device 00:00.0 is in domain A, devices 00:01.* are in domain
+ * B. All devices within the same domain access the same address spaces. Device
+ * 00:00.0 accesses address spaces X and Y, each corresponding to an mm_struct.
+ * Devices 00:01.* only access address space Y. In addition each
+ * IOMMU_DOMAIN_DMA domain has a private address space, io_pgtable, that is
+ * managed with iommu_map()/iommu_unmap(), and isn't shared with the CPU MMU.
+ *
+ * To obtain the above configuration, users would for instance issue the
+ * following calls:
+ *
+ *     iommu_sva_bind_device(dev 00:00.0, mm X, ...) -> bond 1
+ *     iommu_sva_bind_device(dev 00:00.0, mm Y, ...) -> bond 2
+ *     iommu_sva_bind_device(dev 00:01.0, mm Y, ...) -> bond 3
+ *     iommu_sva_bind_device(dev 00:01.1, mm Y, ...) -> bond 4
+ *
+ * A single Process Address Space ID (PASID) is allocated for each mm. In the
+ * example, devices use PASID 1 to read/write into address space X and PASID 2
+ * to read/write into address space Y. Calling iommu_sva_get_pasid() on bond 1
+ * returns 1, and calling it on bonds 2-4 returns 2.
+ *
+ * Hardware tables describing this configuration in the IOMMU would typically
+ * look like this:
+ *
+ *                                PASID tables
+ *                                 of domain A
+ *                              .->+--------+
+ *                             / 0 |        |-------> io_pgtable
+ *                            /    +--------+
+ *            Device tables  /   1 |        |-------> pgd X
+ *              +--------+  /      +--------+
+ *      00:00.0 |      A |-'     2 |        |--.
+ *              +--------+         +--------+   \
+ *              :        :       3 |        |    \
+ *              +--------+         +--------+     --> pgd Y
+ *      00:01.0 |      B |--.                    /
+ *              +--------+   \                  |
+ *      00:01.1 |      B |----+   PASID tables  |
+ *              +--------+     \   of domain B  |
+ *                              '->+--------+   |
+ *                               0 |        |-- | --> io_pgtable
+ *                                 +--------+   |
+ *                               1 |        |   |
+ *                                 +--------+   |
+ *                               2 |        |---'
+ *                                 +--------+
+ *                               3 |        |
+ *                                 +--------+
+ *
+ * With this model, a single call binds all devices in a given domain to an
+ * address space. Other devices in the domain will get the same bond implicitly.
+ * However, users must issue one bind() for each device, because IOMMUs may
+ * implement SVA differently. Furthermore, mandating one bind() per device
+ * allows the driver to perform sanity-checks on device capabilities.
+ *
+ * In some IOMMUs, one entry of the PASID table (typically the first one) can
+ * hold non-PASID translations. In this case PASID 0 is reserved and the first
+ * entry points to the io_pgtable pointer. In other IOMMUs the io_pgtable
+ * pointer is held in the device table and PASID 0 is available to the
+ * allocator.
+ */
+
+struct io_mm {
+	struct list_head		devices;
+	struct mm_struct		*mm;
+	struct mmu_notifier		notifier;
+
+	/* Late initialization */
+	const struct io_mm_ops		*ops;
+	void				*ctx;
+	int				pasid;
+};
+
+#define to_io_mm(mmu_notifier)	container_of(mmu_notifier, struct io_mm, notifier)
+#define to_iommu_bond(handle)	container_of(handle, struct iommu_bond, sva)
+
+struct iommu_bond {
+	struct iommu_sva		sva;
+	struct io_mm __rcu		*io_mm;
+
+	struct list_head		mm_head;
+	void				*drvdata;
+	struct rcu_head			rcu_head;
+	refcount_t			refs;
+};
+
+static DECLARE_IOASID_SET(shared_pasid);
+
+static struct mmu_notifier_ops iommu_mmu_notifier_ops;
+
+/*
+ * Serializes modifications of bonds.
+ * Lock order: Device SVA mutex; global SVA mutex; IOASID lock
+ */
+static DEFINE_MUTEX(iommu_sva_lock);
+
+struct io_mm_alloc_params {
+	const struct io_mm_ops *ops;
+	int min_pasid, max_pasid;
+};
+
+static struct mmu_notifier *io_mm_alloc(struct mm_struct *mm, void *privdata)
+{
+	int ret;
+	struct io_mm *io_mm;
+	struct io_mm_alloc_params *params = privdata;
+
+	io_mm = kzalloc(sizeof(*io_mm), GFP_KERNEL);
+	if (!io_mm)
+		return ERR_PTR(-ENOMEM);
+
+	io_mm->mm = mm;
+	io_mm->ops = params->ops;
+	INIT_LIST_HEAD(&io_mm->devices);
+
+	io_mm->pasid = ioasid_alloc(&shared_pasid, params->min_pasid,
+				    params->max_pasid, io_mm->mm);
+	if (io_mm->pasid == INVALID_IOASID) {
+		ret = -ENOSPC;
+		goto err_free_io_mm;
+	}
+
+	io_mm->ctx = params->ops->alloc(mm);
+	if (IS_ERR(io_mm->ctx)) {
+		ret = PTR_ERR(io_mm->ctx);
+		goto err_free_pasid;
+	}
+	return &io_mm->notifier;
+
+err_free_pasid:
+	ioasid_free(io_mm->pasid);
+err_free_io_mm:
+	kfree(io_mm);
+	return ERR_PTR(ret);
+}
+
+static void io_mm_free(struct mmu_notifier *mn)
+{
+	struct io_mm *io_mm = to_io_mm(mn);
+
+	WARN_ON(!list_empty(&io_mm->devices));
+
+	io_mm->ops->release(io_mm->ctx);
+	ioasid_free(io_mm->pasid);
+	kfree(io_mm);
+}
+
+/*
+ * io_mm_get - Allocate an io_mm or get the existing one for the given mm
+ * @mm: the mm
+ * @ops: callbacks for the IOMMU driver
+ * @min_pasid: minimum PASID value (inclusive)
+ * @max_pasid: maximum PASID value (inclusive)
+ *
+ * Returns a valid io_mm or an error pointer.
+ */
+static struct io_mm *io_mm_get(struct mm_struct *mm,
+			       const struct io_mm_ops *ops,
+			       int min_pasid, int max_pasid)
+{
+	struct io_mm *io_mm;
+	struct mmu_notifier *mn;
+	struct io_mm_alloc_params params = {
+		.ops		= ops,
+		.min_pasid	= min_pasid,
+		.max_pasid	= max_pasid,
+	};
+
+	/*
+	 * A single notifier can exist for this (ops, mm) pair. Allocate it if
+	 * necessary.
+	 */
+	mn = mmu_notifier_get(&iommu_mmu_notifier_ops, mm, &params);
+	if (IS_ERR(mn))
+		return ERR_CAST(mn);
+	io_mm = to_io_mm(mn);
+
+	if (WARN_ON(io_mm->ops != ops)) {
+		mmu_notifier_put(mn);
+		return ERR_PTR(-EINVAL);
+	}
+
+	return io_mm;
+}
+
+static void io_mm_put(struct io_mm *io_mm)
+{
+	mmu_notifier_put(&io_mm->notifier);
+}
+
+static struct iommu_sva *
+io_mm_attach(struct device *dev, struct io_mm *io_mm, void *drvdata)
+{
+	int ret = 0;
+	bool attach_domain = true;
+	struct iommu_bond *bond, *tmp;
+	struct iommu_domain *domain, *other;
+	struct iommu_sva_param *param = dev->iommu_param->sva_param;
+
+	domain = iommu_get_domain_for_dev(dev);
+
+	bond = kzalloc(sizeof(*bond), GFP_KERNEL);
+	if (!bond)
+		return ERR_PTR(-ENOMEM);
+
+	bond->sva.dev	= dev;
+	bond->drvdata	= drvdata;
+	refcount_set(&bond->refs, 1);
+	RCU_INIT_POINTER(bond->io_mm, io_mm);
+
+	mutex_lock(&iommu_sva_lock);
+	/* Is it already bound to the device or domain? */
+	list_for_each_entry(tmp, &io_mm->devices, mm_head) {
+		if (tmp->sva.dev != dev) {
+			other = iommu_get_domain_for_dev(tmp->sva.dev);
+			if (domain == other)
+				attach_domain = false;
+
+			continue;
+		}
+
+		if (WARN_ON(tmp->drvdata != drvdata)) {
+			ret = -EINVAL;
+			goto err_free;
+		}
+
+		/*
+		 * Hold a single io_mm reference per bond. Note that we can't
+		 * return an error after this, otherwise the caller would drop
+		 * an additional reference to the io_mm.
+		 */
+		refcount_inc(&tmp->refs);
+		io_mm_put(io_mm);
+		kfree(bond);
+		mutex_unlock(&iommu_sva_lock);
+		return &tmp->sva;
+	}
+
+	list_add_rcu(&bond->mm_head, &io_mm->devices);
+	param->nr_bonds++;
+	mutex_unlock(&iommu_sva_lock);
+
+	ret = io_mm->ops->attach(bond->sva.dev, io_mm->pasid, io_mm->ctx,
+				 attach_domain);
+	if (ret)
+		goto err_remove;
+
+	return &bond->sva;
+
+err_remove:
+	/*
+	 * At this point concurrent threads may have started to access the
+	 * io_mm->devices list in order to invalidate address ranges, which
+	 * requires to free the bond via kfree_rcu()
+	 */
+	mutex_lock(&iommu_sva_lock);
+	param->nr_bonds--;
+	list_del_rcu(&bond->mm_head);
+
+err_free:
+	mutex_unlock(&iommu_sva_lock);
+	kfree_rcu(bond, rcu_head);
+	return ERR_PTR(ret);
+}
+
+static void io_mm_detach_locked(struct iommu_bond *bond)
+{
+	struct io_mm *io_mm;
+	struct iommu_bond *tmp;
+	bool detach_domain = true;
+	struct iommu_domain *domain, *other;
+
+	io_mm = rcu_dereference_protected(bond->io_mm,
+					  lockdep_is_held(&iommu_sva_lock));
+	if (!io_mm)
+		return;
+
+	domain = iommu_get_domain_for_dev(bond->sva.dev);
+
+	/* Are other devices in the same domain still attached to this mm? */
+	list_for_each_entry(tmp, &io_mm->devices, mm_head) {
+		if (tmp == bond)
+			continue;
+		other = iommu_get_domain_for_dev(tmp->sva.dev);
+		if (domain == other) {
+			detach_domain = false;
+			break;
+		}
+	}
+
+	io_mm->ops->detach(bond->sva.dev, io_mm->pasid, io_mm->ctx,
+			   detach_domain);
+
+	list_del_rcu(&bond->mm_head);
+	RCU_INIT_POINTER(bond->io_mm, NULL);
+
+	/* Free after RCU grace period */
+	io_mm_put(io_mm);
+}
+
+/*
+ * io_mm_release - release MMU notifier
+ *
+ * Called when the mm exits. Some devices may still be bound to the io_mm. A few
+ * things need to be done before it is safe to release:
+ *
+ * - Tell the device driver to stop using this PASID.
+ * - Clear the PASID table and invalidate TLBs.
+ * - Drop all references to this io_mm.
+ */
+static void io_mm_release(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+	struct iommu_bond *bond, *next;
+	struct io_mm *io_mm = to_io_mm(mn);
+
+	mutex_lock(&iommu_sva_lock);
+	list_for_each_entry_safe(bond, next, &io_mm->devices, mm_head) {
+		struct device *dev = bond->sva.dev;
+		struct iommu_sva *sva = &bond->sva;
+
+		if (sva->ops && sva->ops->mm_exit &&
+		    sva->ops->mm_exit(dev, sva, bond->drvdata))
+			dev_WARN(dev, "possible leak of PASID %u",
+				 io_mm->pasid);
+
+		/* unbind() frees the bond, we just detach it */
+		io_mm_detach_locked(bond);
+	}
+	mutex_unlock(&iommu_sva_lock);
+}
+
+static void io_mm_invalidate_range(struct mmu_notifier *mn,
+				   struct mm_struct *mm, unsigned long start,
+				   unsigned long end)
+{
+	struct iommu_bond *bond;
+	struct io_mm *io_mm = to_io_mm(mn);
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(bond, &io_mm->devices, mm_head)
+		io_mm->ops->invalidate(bond->sva.dev, io_mm->pasid, io_mm->ctx,
+				       start, end - start);
+	rcu_read_unlock();
+}
+
+static struct mmu_notifier_ops iommu_mmu_notifier_ops = {
+	.alloc_notifier		= io_mm_alloc,
+	.free_notifier		= io_mm_free,
+	.release		= io_mm_release,
+	.invalidate_range	= io_mm_invalidate_range,
+};
+
+struct iommu_sva *
+iommu_sva_bind_generic(struct device *dev, struct mm_struct *mm,
+		       const struct io_mm_ops *ops, void *drvdata)
+{
+	struct io_mm *io_mm;
+	struct iommu_sva *handle;
+	struct iommu_param *param = dev->iommu_param;
+
+	if (!param)
+		return ERR_PTR(-ENODEV);
+
+	mutex_lock(&param->sva_lock);
+	if (!param->sva_param) {
+		handle = ERR_PTR(-ENODEV);
+		goto out_unlock;
+	}
+
+	io_mm = io_mm_get(mm, ops, param->sva_param->min_pasid,
+			  param->sva_param->max_pasid);
+	if (IS_ERR(io_mm)) {
+		handle = ERR_CAST(io_mm);
+		goto out_unlock;
+	}
+
+	handle = io_mm_attach(dev, io_mm, drvdata);
+	if (IS_ERR(handle))
+		io_mm_put(io_mm);
+
+out_unlock:
+	mutex_unlock(&param->sva_lock);
+	return handle;
+}
+EXPORT_SYMBOL_GPL(iommu_sva_bind_generic);
+
+static void iommu_sva_unbind_locked(struct iommu_bond *bond)
+{
+	struct device *dev = bond->sva.dev;
+	struct iommu_sva_param *param = dev->iommu_param->sva_param;
+
+	if (!refcount_dec_and_test(&bond->refs))
+		return;
+
+	io_mm_detach_locked(bond);
+	param->nr_bonds--;
+	kfree_rcu(bond, rcu_head);
+}
+
+void iommu_sva_unbind_generic(struct iommu_sva *handle)
+{
+	struct iommu_param *param = handle->dev->iommu_param;
+
+	if (WARN_ON(!param))
+		return;
+
+	mutex_lock(&param->sva_lock);
+	mutex_lock(&iommu_sva_lock);
+	iommu_sva_unbind_locked(to_iommu_bond(handle));
+	mutex_unlock(&iommu_sva_lock);
+	mutex_unlock(&param->sva_lock);
+}
+EXPORT_SYMBOL_GPL(iommu_sva_unbind_generic);
+
+/**
+ * iommu_sva_enable() - Enable Shared Virtual Addressing for a device
+ * @dev: the device
+ * @sva_param: the parameters.
+ *
+ * Called by an IOMMU driver to setup the SVA parameters
+ * @sva_param is duplicated and can be freed when this function returns.
+ *
+ * Return 0 if initialization succeeded, or an error.
+ */
+int iommu_sva_enable(struct device *dev, struct iommu_sva_param *sva_param)
+{
+	int ret;
+	struct iommu_sva_param *new_param;
+	struct iommu_param *param = dev->iommu_param;
+
+	if (!param)
+		return -ENODEV;
+
+	new_param = kmemdup(sva_param, sizeof(*new_param), GFP_KERNEL);
+	if (!new_param)
+		return -ENOMEM;
+
+	mutex_lock(&param->sva_lock);
+	if (param->sva_param) {
+		ret = -EEXIST;
+		goto err_unlock;
+	}
+
+	dev->iommu_param->sva_param = new_param;
+	mutex_unlock(&param->sva_lock);
+	return 0;
+
+err_unlock:
+	mutex_unlock(&param->sva_lock);
+	kfree(new_param);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_sva_enable);
+
+/**
+ * iommu_sva_disable() - Disable Shared Virtual Addressing for a device
+ * @dev: the device
+ *
+ * IOMMU drivers call this to disable SVA.
+ */
+int iommu_sva_disable(struct device *dev)
+{
+	int ret = 0;
+	struct iommu_param *param = dev->iommu_param;
+
+	if (!param)
+		return -EINVAL;
+
+	mutex_lock(&param->sva_lock);
+	if (!param->sva_param) {
+		ret = -ENODEV;
+		goto out_unlock;
+	}
+
+	/* Require that all contexts are unbound */
+	if (param->sva_param->nr_bonds) {
+		ret = -EBUSY;
+		goto out_unlock;
+	}
+
+	kfree(param->sva_param);
+	param->sva_param = NULL;
+out_unlock:
+	mutex_unlock(&param->sva_lock);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_sva_disable);
+
+bool iommu_sva_enabled(struct device *dev)
+{
+	bool enabled;
+	struct iommu_param *param = dev->iommu_param;
+
+	if (!param)
+		return false;
+
+	mutex_lock(&param->sva_lock);
+	enabled = !!param->sva_param;
+	mutex_unlock(&param->sva_lock);
+	return enabled;
+}
+EXPORT_SYMBOL_GPL(iommu_sva_enabled);
+
+int iommu_sva_get_pasid_generic(struct iommu_sva *handle)
+{
+	struct io_mm *io_mm;
+	int pasid = IOMMU_PASID_INVALID;
+	struct iommu_bond *bond = to_iommu_bond(handle);
+
+	rcu_read_lock();
+	io_mm = rcu_dereference(bond->io_mm);
+	if (io_mm)
+		pasid = io_mm->pasid;
+	rcu_read_unlock();
+	return pasid;
+}
+EXPORT_SYMBOL_GPL(iommu_sva_get_pasid_generic);
diff --git a/drivers/iommu/iommu-sva.h b/drivers/iommu/iommu-sva.h
new file mode 100644
index 000000000000..dd55c2db0936
--- /dev/null
+++ b/drivers/iommu/iommu-sva.h
@@ -0,0 +1,64 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * SVA library for IOMMU drivers
+ */
+#ifndef _IOMMU_SVA_H
+#define _IOMMU_SVA_H
+
+#include <linux/iommu.h>
+#include <linux/kref.h>
+#include <linux/mmu_notifier.h>
+
+struct io_mm_ops {
+	/* Allocate a PASID context for an mm */
+	void *(*alloc)(struct mm_struct *mm);
+
+	/*
+	 * Attach a PASID context to a device. Write the entry into the PASID
+	 * table.
+	 *
+	 * @attach_domain is true when no other device in the IOMMU domain is
+	 *   already attached to this context. IOMMU drivers that share the
+	 *   PASID tables within a domain don't need to write the PASID entry
+	 *   when @attach_domain is false.
+	 */
+	int (*attach)(struct device *dev, int pasid, void *ctx,
+		      bool attach_domain);
+
+	/*
+	 * Detach a PASID context from a device. Clear the entry from the PASID
+	 * table and invalidate if necessary.
+	 *
+	 * @detach_domain is true when no other device in the IOMMU domain is
+	 *   still attached to this context. IOMMU drivers that share the PASID
+	 *   table within a domain don't need to clear the PASID entry when
+	 *   @detach_domain is false, only invalidate the caches.
+	 */
+	void (*detach)(struct device *dev, int pasid, void *ctx,
+		       bool detach_domain);
+
+	/* Invalidate a range of addresses. Cannot sleep. */
+	void (*invalidate)(struct device *dev, int pasid, void *ctx,
+			   unsigned long vaddr, size_t size);
+
+	/* Free a context. Cannot sleep. */
+	void (*release)(void *ctx);
+};
+
+struct iommu_sva_param {
+	u32			min_pasid;
+	u32			max_pasid;
+	int			nr_bonds;
+};
+
+struct iommu_sva *
+iommu_sva_bind_generic(struct device *dev, struct mm_struct *mm,
+		       const struct io_mm_ops *ops, void *drvdata);
+void iommu_sva_unbind_generic(struct iommu_sva *handle);
+int iommu_sva_get_pasid_generic(struct iommu_sva *handle);
+
+int iommu_sva_enable(struct device *dev, struct iommu_sva_param *sva_param);
+int iommu_sva_disable(struct device *dev);
+bool iommu_sva_enabled(struct device *dev);
+
+#endif /* _IOMMU_SVA_H */
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 3e3528436e0b..c8bd972c1788 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -164,6 +164,7 @@ static struct iommu_param *iommu_get_dev_param(struct device *dev)
 		return NULL;
 
 	mutex_init(&param->lock);
+	mutex_init(&param->sva_lock);
 	dev->iommu_param = param;
 	return param;
 }
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 1739f8a7a4b4..83397ae88d2d 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -368,6 +368,7 @@ struct iommu_fault_param {
  * struct iommu_param - collection of per-device IOMMU data
  *
  * @fault_param: IOMMU detected device fault reporting data
+ * @sva_param: IOMMU parameter for SVA
  *
  * TODO: migrate other per device data pointers under iommu_dev_data, e.g.
  *	struct iommu_group	*iommu_group;
@@ -376,6 +377,8 @@ struct iommu_fault_param {
 struct iommu_param {
 	struct mutex lock;
 	struct iommu_fault_param *fault_param;
+	struct mutex sva_lock;
+	struct iommu_sva_param *sva_param;
 };
 
 int  iommu_device_register(struct iommu_device *iommu);
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v4 03/26] iommu: Add a page fault handler
  2020-02-24 18:23 [PATCH v4 00/26] iommu: Shared Virtual Addressing and SMMUv3 support Jean-Philippe Brucker
  2020-02-24 18:23 ` [PATCH v4 01/26] mm/mmu_notifiers: pass private data down to alloc_notifier() Jean-Philippe Brucker
  2020-02-24 18:23 ` [PATCH v4 02/26] iommu/sva: Manage process address spaces Jean-Philippe Brucker
@ 2020-02-24 18:23 ` Jean-Philippe Brucker
  2020-02-25  3:30   ` Xu Zaibo
  2020-02-26 13:59   ` Jonathan Cameron
  2020-02-24 18:23 ` [PATCH v4 04/26] iommu/sva: Search mm by PASID Jean-Philippe Brucker
                   ` (23 subsequent siblings)
  26 siblings, 2 replies; 70+ messages in thread
From: Jean-Philippe Brucker @ 2020-02-24 18:23 UTC (permalink / raw)
  To: iommu, devicetree, linux-arm-kernel, linux-pci, linux-mm
  Cc: joro, robh+dt, mark.rutland, catalin.marinas, will, robin.murphy,
	kevin.tian, baolu.lu, Jonathan.Cameron, jacob.jun.pan,
	christian.koenig, yi.l.liu, zhangfei.gao, Jean-Philippe Brucker

From: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>

Some systems allow devices to handle I/O Page Faults in the core mm. For
example systems implementing the PCI PRI extension or Arm SMMU stall
model. Infrastructure for reporting these recoverable page faults was
recently added to the IOMMU core. Add a page fault handler for host SVA.

IOMMU driver can now instantiate several fault workqueues and link them to
IOPF-capable devices. Drivers can choose between a single global
workqueue, one per IOMMU device, one per low-level fault queue, one per
domain, etc.

When it receives a fault event, supposedly in an IRQ handler, the IOMMU
driver reports the fault using iommu_report_device_fault(), which calls
the registered handler. The page fault handler then calls the mm fault
handler, and reports either success or failure with iommu_page_response().
When the handler succeeded, the IOMMU retries the access.

The iopf_param pointer could be embedded into iommu_fault_param. But
putting iopf_param into the iommu_param structure allows us not to care
about ordering between calls to iopf_queue_add_device() and
iommu_register_device_fault_handler().

Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
---
 drivers/iommu/Kconfig      |   4 +
 drivers/iommu/Makefile     |   1 +
 drivers/iommu/io-pgfault.c | 451 +++++++++++++++++++++++++++++++++++++
 include/linux/iommu.h      |  59 +++++
 4 files changed, 515 insertions(+)
 create mode 100644 drivers/iommu/io-pgfault.c

diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index acca20e2da2f..e4a42e1708b4 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -109,6 +109,10 @@ config IOMMU_SVA
 	select IOMMU_API
 	select MMU_NOTIFIER
 
+config IOMMU_PAGE_FAULT
+	bool
+	select IOMMU_API
+
 config FSL_PAMU
 	bool "Freescale IOMMU support"
 	depends on PCI
diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
index 40c800dd4e3e..bf5cb4ee8409 100644
--- a/drivers/iommu/Makefile
+++ b/drivers/iommu/Makefile
@@ -4,6 +4,7 @@ obj-$(CONFIG_IOMMU_API) += iommu-traces.o
 obj-$(CONFIG_IOMMU_API) += iommu-sysfs.o
 obj-$(CONFIG_IOMMU_DEBUGFS) += iommu-debugfs.o
 obj-$(CONFIG_IOMMU_DMA) += dma-iommu.o
+obj-$(CONFIG_IOMMU_PAGE_FAULT) += io-pgfault.o
 obj-$(CONFIG_IOMMU_IO_PGTABLE) += io-pgtable.o
 obj-$(CONFIG_IOMMU_IO_PGTABLE_ARMV7S) += io-pgtable-arm-v7s.o
 obj-$(CONFIG_IOMMU_IO_PGTABLE_LPAE) += io-pgtable-arm.o
diff --git a/drivers/iommu/io-pgfault.c b/drivers/iommu/io-pgfault.c
new file mode 100644
index 000000000000..76e153c59fe3
--- /dev/null
+++ b/drivers/iommu/io-pgfault.c
@@ -0,0 +1,451 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Handle device page faults
+ *
+ * Copyright (C) 2018 ARM Ltd.
+ */
+
+#include <linux/iommu.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/workqueue.h>
+
+/**
+ * struct iopf_queue - IO Page Fault queue
+ * @wq: the fault workqueue
+ * @flush: low-level flush callback
+ * @flush_arg: flush() argument
+ * @devices: devices attached to this queue
+ * @lock: protects the device list
+ */
+struct iopf_queue {
+	struct workqueue_struct		*wq;
+	iopf_queue_flush_t		flush;
+	void				*flush_arg;
+	struct list_head		devices;
+	struct mutex			lock;
+};
+
+/**
+ * struct iopf_device_param - IO Page Fault data attached to a device
+ * @dev: the device that owns this param
+ * @queue: IOPF queue
+ * @queue_list: index into queue->devices
+ * @partial: faults that are part of a Page Request Group for which the last
+ *           request hasn't been submitted yet.
+ * @busy: the param is being used
+ * @wq_head: signal a change to @busy
+ */
+struct iopf_device_param {
+	struct device			*dev;
+	struct iopf_queue		*queue;
+	struct list_head		queue_list;
+	struct list_head		partial;
+	bool				busy;
+	wait_queue_head_t		wq_head;
+};
+
+struct iopf_fault {
+	struct iommu_fault		fault;
+	struct list_head		head;
+};
+
+struct iopf_group {
+	struct iopf_fault		last_fault;
+	struct list_head		faults;
+	struct work_struct		work;
+	struct device			*dev;
+};
+
+static int iopf_complete(struct device *dev, struct iopf_fault *iopf,
+			 enum iommu_page_response_code status)
+{
+	struct iommu_page_response resp = {
+		.version		= IOMMU_PAGE_RESP_VERSION_1,
+		.pasid			= iopf->fault.prm.pasid,
+		.grpid			= iopf->fault.prm.grpid,
+		.code			= status,
+	};
+
+	if (iopf->fault.prm.flags & IOMMU_FAULT_PAGE_REQUEST_PASID_VALID)
+		resp.flags = IOMMU_PAGE_RESP_PASID_VALID;
+
+	return iommu_page_response(dev, &resp);
+}
+
+static enum iommu_page_response_code
+iopf_handle_single(struct iopf_fault *iopf)
+{
+	/* TODO */
+	return -ENODEV;
+}
+
+static void iopf_handle_group(struct work_struct *work)
+{
+	struct iopf_group *group;
+	struct iopf_fault *iopf, *next;
+	enum iommu_page_response_code status = IOMMU_PAGE_RESP_SUCCESS;
+
+	group = container_of(work, struct iopf_group, work);
+
+	list_for_each_entry_safe(iopf, next, &group->faults, head) {
+		/*
+		 * For the moment, errors are sticky: don't handle subsequent
+		 * faults in the group if there is an error.
+		 */
+		if (status == IOMMU_PAGE_RESP_SUCCESS)
+			status = iopf_handle_single(iopf);
+
+		if (!(iopf->fault.prm.flags & IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE))
+			kfree(iopf);
+	}
+
+	iopf_complete(group->dev, &group->last_fault, status);
+	kfree(group);
+}
+
+/**
+ * iommu_queue_iopf - IO Page Fault handler
+ * @evt: fault event
+ * @cookie: struct device, passed to iommu_register_device_fault_handler.
+ *
+ * Add a fault to the device workqueue, to be handled by mm.
+ *
+ * Return: 0 on success and <0 on error.
+ */
+int iommu_queue_iopf(struct iommu_fault *fault, void *cookie)
+{
+	int ret;
+	struct iopf_group *group;
+	struct iopf_fault *iopf, *next;
+	struct iopf_device_param *iopf_param;
+
+	struct device *dev = cookie;
+	struct iommu_param *param = dev->iommu_param;
+
+	if (WARN_ON(!mutex_is_locked(&param->lock)))
+		return -EINVAL;
+
+	if (fault->type != IOMMU_FAULT_PAGE_REQ)
+		/* Not a recoverable page fault */
+		return -EOPNOTSUPP;
+
+	/*
+	 * As long as we're holding param->lock, the queue can't be unlinked
+	 * from the device and therefore cannot disappear.
+	 */
+	iopf_param = param->iopf_param;
+	if (!iopf_param)
+		return -ENODEV;
+
+	if (!(fault->prm.flags & IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE)) {
+		iopf = kzalloc(sizeof(*iopf), GFP_KERNEL);
+		if (!iopf)
+			return -ENOMEM;
+
+		iopf->fault = *fault;
+
+		/* Non-last request of a group. Postpone until the last one */
+		list_add(&iopf->head, &iopf_param->partial);
+
+		return 0;
+	}
+
+	group = kzalloc(sizeof(*group), GFP_KERNEL);
+	if (!group) {
+		/*
+		 * The caller will send a response to the hardware. But we do
+		 * need to clean up before leaving, otherwise partial faults
+		 * will be stuck.
+		 */
+		ret = -ENOMEM;
+		goto cleanup_partial;
+	}
+
+	group->dev = dev;
+	group->last_fault.fault = *fault;
+	INIT_LIST_HEAD(&group->faults);
+	list_add(&group->last_fault.head, &group->faults);
+	INIT_WORK(&group->work, iopf_handle_group);
+
+	/* See if we have partial faults for this group */
+	list_for_each_entry_safe(iopf, next, &iopf_param->partial, head) {
+		if (iopf->fault.prm.grpid == fault->prm.grpid)
+			/* Insert *before* the last fault */
+			list_move(&iopf->head, &group->faults);
+	}
+
+	queue_work(iopf_param->queue->wq, &group->work);
+	return 0;
+
+cleanup_partial:
+	list_for_each_entry_safe(iopf, next, &iopf_param->partial, head) {
+		if (iopf->fault.prm.grpid == fault->prm.grpid) {
+			list_del(&iopf->head);
+			kfree(iopf);
+		}
+	}
+	return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_queue_iopf);
+
+/**
+ * iopf_queue_flush_dev - Ensure that all queued faults have been processed
+ * @dev: the endpoint whose faults need to be flushed.
+ * @pasid: the PASID affected by this flush
+ *
+ * Users must call this function when releasing a PASID, to ensure that all
+ * pending faults for this PASID have been handled, and won't hit the address
+ * space of the next process that uses this PASID.
+ *
+ * This function can also be called before shutting down the device, in which
+ * case @pasid should be IOMMU_PASID_INVALID.
+ *
+ * Return: 0 on success and <0 on error.
+ */
+int iopf_queue_flush_dev(struct device *dev, int pasid)
+{
+	int ret = 0;
+	struct iopf_queue *queue;
+	struct iopf_device_param *iopf_param;
+	struct iommu_param *param = dev->iommu_param;
+
+	if (!param)
+		return -ENODEV;
+
+	/*
+	 * It is incredibly easy to find ourselves in a deadlock situation if
+	 * we're not careful, because we're taking the opposite path as
+	 * iommu_queue_iopf:
+	 *
+	 *   iopf_queue_flush_dev()   |  PRI queue handler
+	 *    lock(&param->lock)      |   iommu_queue_iopf()
+	 *     queue->flush()         |    lock(&param->lock)
+	 *      wait PRI queue empty  |
+	 *
+	 * So we can't hold the device param lock while flushing. Take a
+	 * reference to the device param instead, to prevent the queue from
+	 * going away.
+	 */
+	mutex_lock(&param->lock);
+	iopf_param = param->iopf_param;
+	if (iopf_param) {
+		queue = param->iopf_param->queue;
+		iopf_param->busy = true;
+	} else {
+		ret = -ENODEV;
+	}
+	mutex_unlock(&param->lock);
+	if (ret)
+		return ret;
+
+	/*
+	 * When removing a PASID, the device driver tells the device to stop
+	 * using it, and flush any pending fault to the IOMMU. In this flush
+	 * callback, the IOMMU driver makes sure that there are no such faults
+	 * left in the low-level queue.
+	 */
+	queue->flush(queue->flush_arg, dev, pasid);
+
+	flush_workqueue(queue->wq);
+
+	mutex_lock(&param->lock);
+	iopf_param->busy = false;
+	wake_up(&iopf_param->wq_head);
+	mutex_unlock(&param->lock);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(iopf_queue_flush_dev);
+
+/**
+ * iopf_queue_discard_partial - Remove all pending partial fault
+ * @queue: the queue whose partial faults need to be discarded
+ *
+ * When the hardware queue overflows, last page faults in a group may have been
+ * lost and the IOMMU driver calls this to discard all partial faults. The
+ * driver shouldn't be adding new faults to this queue concurrently.
+ *
+ * Return: 0 on success and <0 on error.
+ */
+int iopf_queue_discard_partial(struct iopf_queue *queue)
+{
+	struct iopf_fault *iopf, *next;
+	struct iopf_device_param *iopf_param;
+
+	if (!queue)
+		return -EINVAL;
+
+	mutex_lock(&queue->lock);
+	list_for_each_entry(iopf_param, &queue->devices, queue_list) {
+		list_for_each_entry_safe(iopf, next, &iopf_param->partial, head)
+			kfree(iopf);
+	}
+	mutex_unlock(&queue->lock);
+	return 0;
+}
+EXPORT_SYMBOL_GPL(iopf_queue_discard_partial);
+
+/**
+ * iopf_queue_add_device - Add producer to the fault queue
+ * @queue: IOPF queue
+ * @dev: device to add
+ *
+ * Return: 0 on success and <0 on error.
+ */
+int iopf_queue_add_device(struct iopf_queue *queue, struct device *dev)
+{
+	int ret = -EINVAL;
+	struct iopf_device_param *iopf_param;
+	struct iommu_param *param = dev->iommu_param;
+
+	if (!param)
+		return -ENODEV;
+
+	iopf_param = kzalloc(sizeof(*iopf_param), GFP_KERNEL);
+	if (!iopf_param)
+		return -ENOMEM;
+
+	INIT_LIST_HEAD(&iopf_param->partial);
+	iopf_param->queue = queue;
+	iopf_param->dev = dev;
+	init_waitqueue_head(&iopf_param->wq_head);
+
+	mutex_lock(&queue->lock);
+	mutex_lock(&param->lock);
+	if (!param->iopf_param) {
+		list_add(&iopf_param->queue_list, &queue->devices);
+		param->iopf_param = iopf_param;
+		ret = 0;
+	}
+	mutex_unlock(&param->lock);
+	mutex_unlock(&queue->lock);
+
+	if (ret)
+		kfree(iopf_param);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(iopf_queue_add_device);
+
+/**
+ * iopf_queue_remove_device - Remove producer from fault queue
+ * @queue: IOPF queue
+ * @dev: device to remove
+ *
+ * Caller makes sure that no more faults are reported for this device.
+ *
+ * Return: 0 on success and <0 on error.
+ */
+int iopf_queue_remove_device(struct iopf_queue *queue, struct device *dev)
+{
+	int ret = -EINVAL;
+	struct iopf_fault *iopf, *next;
+	struct iopf_device_param *iopf_param;
+	struct iommu_param *param = dev->iommu_param;
+
+	if (!param || !queue)
+		return -EINVAL;
+
+	do {
+		mutex_lock(&queue->lock);
+		mutex_lock(&param->lock);
+		iopf_param = param->iopf_param;
+		if (iopf_param && iopf_param->queue == queue) {
+			if (iopf_param->busy) {
+				ret = -EBUSY;
+			} else {
+				list_del(&iopf_param->queue_list);
+				param->iopf_param = NULL;
+				ret = 0;
+			}
+		}
+		mutex_unlock(&param->lock);
+		mutex_unlock(&queue->lock);
+
+		/*
+		 * If there is an ongoing flush, wait for it to complete and
+		 * then retry. iopf_param isn't going away since we're the only
+		 * thread that can free it.
+		 */
+		if (ret == -EBUSY)
+			wait_event(iopf_param->wq_head, !iopf_param->busy);
+		else if (ret)
+			return ret;
+	} while (ret == -EBUSY);
+
+	/* Just in case some faults are still stuck */
+	list_for_each_entry_safe(iopf, next, &iopf_param->partial, head)
+		kfree(iopf);
+
+	kfree(iopf_param);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(iopf_queue_remove_device);
+
+/**
+ * iopf_queue_alloc - Allocate and initialize a fault queue
+ * @name: a unique string identifying the queue (for workqueue)
+ * @flush: a callback that flushes the low-level queue
+ * @cookie: driver-private data passed to the flush callback
+ *
+ * The callback is called before the workqueue is flushed. The IOMMU driver must
+ * commit all faults that are pending in its low-level queues at the time of the
+ * call, into the IOPF queue (with iommu_report_device_fault). The callback
+ * takes a device pointer as argument, hinting what endpoint is causing the
+ * flush. When the device is NULL, all faults should be committed.
+ *
+ * Return: the queue on success and NULL on error.
+ */
+struct iopf_queue *
+iopf_queue_alloc(const char *name, iopf_queue_flush_t flush, void *cookie)
+{
+	struct iopf_queue *queue;
+
+	queue = kzalloc(sizeof(*queue), GFP_KERNEL);
+	if (!queue)
+		return NULL;
+
+	/*
+	 * The WQ is unordered because the low-level handler enqueues faults by
+	 * group. PRI requests within a group have to be ordered, but once
+	 * that's dealt with, the high-level function can handle groups out of
+	 * order.
+	 */
+	queue->wq = alloc_workqueue("iopf_queue/%s", WQ_UNBOUND, 0, name);
+	if (!queue->wq) {
+		kfree(queue);
+		return NULL;
+	}
+
+	queue->flush = flush;
+	queue->flush_arg = cookie;
+	INIT_LIST_HEAD(&queue->devices);
+	mutex_init(&queue->lock);
+
+	return queue;
+}
+EXPORT_SYMBOL_GPL(iopf_queue_alloc);
+
+/**
+ * iopf_queue_free - Free IOPF queue
+ * @queue: queue to free
+ *
+ * Counterpart to iopf_queue_alloc(). The driver must not be queuing faults or
+ * adding/removing devices on this queue anymore.
+ */
+void iopf_queue_free(struct iopf_queue *queue)
+{
+	struct iopf_device_param *iopf_param, *next;
+
+	if (!queue)
+		return;
+
+	list_for_each_entry_safe(iopf_param, next, &queue->devices, queue_list)
+		iopf_queue_remove_device(queue, iopf_param->dev);
+
+	destroy_workqueue(queue->wq);
+	kfree(queue);
+}
+EXPORT_SYMBOL_GPL(iopf_queue_free);
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 83397ae88d2d..e7bc47ba24f8 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -364,11 +364,20 @@ struct iommu_fault_param {
 	struct mutex lock;
 };
 
+/**
+ * iopf_queue_flush_t - Flush low-level page fault queue
+ *
+ * Report all faults currently pending in the low-level page fault queue
+ */
+struct iopf_queue;
+typedef int (*iopf_queue_flush_t)(void *cookie, struct device *dev, int pasid);
+
 /**
  * struct iommu_param - collection of per-device IOMMU data
  *
  * @fault_param: IOMMU detected device fault reporting data
  * @sva_param: IOMMU parameter for SVA
+ * @iopf_param: I/O Page Fault queue and data
  *
  * TODO: migrate other per device data pointers under iommu_dev_data, e.g.
  *	struct iommu_group	*iommu_group;
@@ -377,6 +386,7 @@ struct iommu_fault_param {
 struct iommu_param {
 	struct mutex lock;
 	struct iommu_fault_param *fault_param;
+	struct iopf_device_param *iopf_param;
 	struct mutex sva_lock;
 	struct iommu_sva_param *sva_param;
 };
@@ -1081,4 +1091,53 @@ void iommu_debugfs_setup(void);
 static inline void iommu_debugfs_setup(void) {}
 #endif
 
+#ifdef CONFIG_IOMMU_PAGE_FAULT
+extern int iommu_queue_iopf(struct iommu_fault *fault, void *cookie);
+
+extern int iopf_queue_add_device(struct iopf_queue *queue, struct device *dev);
+extern int iopf_queue_remove_device(struct iopf_queue *queue, struct device *dev);
+extern int iopf_queue_flush_dev(struct device *dev, int pasid);
+extern struct iopf_queue *
+iopf_queue_alloc(const char *name, iopf_queue_flush_t flush, void *cookie);
+extern void iopf_queue_free(struct iopf_queue *queue);
+extern int iopf_queue_discard_partial(struct iopf_queue *queue);
+#else /* CONFIG_IOMMU_PAGE_FAULT */
+static inline int iommu_queue_iopf(struct iommu_fault *fault, void *cookie)
+{
+	return -ENODEV;
+}
+
+static inline int iopf_queue_add_device(struct iopf_queue *queue,
+					struct device *dev)
+{
+	return -ENODEV;
+}
+
+static inline int iopf_queue_remove_device(struct iopf_queue *queue,
+					   struct device *dev)
+{
+	return -ENODEV;
+}
+
+static inline int iopf_queue_flush_dev(struct device *dev, int pasid)
+{
+	return -ENODEV;
+}
+
+static inline struct iopf_queue *
+iopf_queue_alloc(const char *name, iopf_queue_flush_t flush, void *cookie)
+{
+	return NULL;
+}
+
+static inline void iopf_queue_free(struct iopf_queue *queue)
+{
+}
+
+static inline int iopf_queue_discard_partial(struct iopf_queue *queue)
+{
+	return -ENODEV;
+}
+#endif /* CONFIG_IOMMU_PAGE_FAULT */
+
 #endif /* __LINUX_IOMMU_H */
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v4 04/26] iommu/sva: Search mm by PASID
  2020-02-24 18:23 [PATCH v4 00/26] iommu: Shared Virtual Addressing and SMMUv3 support Jean-Philippe Brucker
                   ` (2 preceding siblings ...)
  2020-02-24 18:23 ` [PATCH v4 03/26] iommu: Add a page fault handler Jean-Philippe Brucker
@ 2020-02-24 18:23 ` Jean-Philippe Brucker
  2020-02-24 18:23 ` [PATCH v4 05/26] iommu/iopf: Handle mm faults Jean-Philippe Brucker
                   ` (22 subsequent siblings)
  26 siblings, 0 replies; 70+ messages in thread
From: Jean-Philippe Brucker @ 2020-02-24 18:23 UTC (permalink / raw)
  To: iommu, devicetree, linux-arm-kernel, linux-pci, linux-mm
  Cc: joro, robh+dt, mark.rutland, catalin.marinas, will, robin.murphy,
	kevin.tian, baolu.lu, Jonathan.Cameron, jacob.jun.pan,
	christian.koenig, yi.l.liu, zhangfei.gao, Jean-Philippe Brucker

From: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>

The fault handler will need to find an mm given its PASID. This is the
reason we have an IDR for storing address spaces, so hook it up.

Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
---
 drivers/iommu/iommu-sva.c | 19 +++++++++++++++++++
 include/linux/iommu.h     |  9 +++++++++
 2 files changed, 28 insertions(+)

diff --git a/drivers/iommu/iommu-sva.c b/drivers/iommu/iommu-sva.c
index 64f1d1c82383..bfd0c477f290 100644
--- a/drivers/iommu/iommu-sva.c
+++ b/drivers/iommu/iommu-sva.c
@@ -559,3 +559,22 @@ int iommu_sva_get_pasid_generic(struct iommu_sva *handle)
 	return pasid;
 }
 EXPORT_SYMBOL_GPL(iommu_sva_get_pasid_generic);
+
+/* ioasid wants a void * argument */
+static bool __mmget_not_zero(void *mm)
+{
+	return mmget_not_zero(mm);
+}
+
+/**
+ * iommu_sva_find() - Find mm associated to the given PASID
+ * @pasid: Process Address Space ID assigned to the mm
+ *
+ * Returns the mm corresponding to this PASID, or an error if not found. A
+ * reference to the mm is taken, and must be released with mmput().
+ */
+struct mm_struct *iommu_sva_find(int pasid)
+{
+	return ioasid_find(&shared_pasid, pasid, __mmget_not_zero);
+}
+EXPORT_SYMBOL_GPL(iommu_sva_find);
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index e7bc47ba24f8..e52a8731e7a9 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -1091,6 +1091,15 @@ void iommu_debugfs_setup(void);
 static inline void iommu_debugfs_setup(void) {}
 #endif
 
+#ifdef CONFIG_IOMMU_SVA
+extern struct mm_struct *iommu_sva_find(int pasid);
+#else /* !CONFIG_IOMMU_SVA */
+static inline struct mm_struct *iommu_sva_find(int pasid)
+{
+	return NULL;
+}
+#endif /* !CONFIG_IOMMU_SVA */
+
 #ifdef CONFIG_IOMMU_PAGE_FAULT
 extern int iommu_queue_iopf(struct iommu_fault *fault, void *cookie);
 
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v4 05/26] iommu/iopf: Handle mm faults
  2020-02-24 18:23 [PATCH v4 00/26] iommu: Shared Virtual Addressing and SMMUv3 support Jean-Philippe Brucker
                   ` (3 preceding siblings ...)
  2020-02-24 18:23 ` [PATCH v4 04/26] iommu/sva: Search mm by PASID Jean-Philippe Brucker
@ 2020-02-24 18:23 ` Jean-Philippe Brucker
  2020-02-24 18:23 ` [PATCH v4 06/26] iommu/sva: Register page fault handler Jean-Philippe Brucker
                   ` (21 subsequent siblings)
  26 siblings, 0 replies; 70+ messages in thread
From: Jean-Philippe Brucker @ 2020-02-24 18:23 UTC (permalink / raw)
  To: iommu, devicetree, linux-arm-kernel, linux-pci, linux-mm
  Cc: joro, robh+dt, mark.rutland, catalin.marinas, will, robin.murphy,
	kevin.tian, baolu.lu, Jonathan.Cameron, jacob.jun.pan,
	christian.koenig, yi.l.liu, zhangfei.gao, Jean-Philippe Brucker

From: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>

When a recoverable page fault is handled by the fault workqueue, find the
associated mm and call handle_mm_fault.

Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
---
 drivers/iommu/io-pgfault.c | 86 +++++++++++++++++++++++++++++++++++++-
 1 file changed, 84 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/io-pgfault.c b/drivers/iommu/io-pgfault.c
index 76e153c59fe3..ffa9f14e0803 100644
--- a/drivers/iommu/io-pgfault.c
+++ b/drivers/iommu/io-pgfault.c
@@ -7,6 +7,7 @@
 
 #include <linux/iommu.h>
 #include <linux/list.h>
+#include <linux/sched/mm.h>
 #include <linux/slab.h>
 #include <linux/workqueue.h>
 
@@ -76,8 +77,65 @@ static int iopf_complete(struct device *dev, struct iopf_fault *iopf,
 static enum iommu_page_response_code
 iopf_handle_single(struct iopf_fault *iopf)
 {
-	/* TODO */
-	return -ENODEV;
+	vm_fault_t ret;
+	struct mm_struct *mm;
+	struct vm_area_struct *vma;
+	unsigned int access_flags = 0;
+	unsigned int fault_flags = FAULT_FLAG_REMOTE;
+	struct iommu_fault_page_request *prm = &iopf->fault.prm;
+	enum iommu_page_response_code status = IOMMU_PAGE_RESP_INVALID;
+
+	if (!(prm->flags & IOMMU_FAULT_PAGE_REQUEST_PASID_VALID))
+		return status;
+
+	mm = iommu_sva_find(prm->pasid);
+	if (IS_ERR_OR_NULL(mm))
+		return status;
+
+	down_read(&mm->mmap_sem);
+
+	vma = find_extend_vma(mm, prm->addr);
+	if (!vma)
+		/* Unmapped area */
+		goto out_put_mm;
+
+	if (prm->perm & IOMMU_FAULT_PERM_READ)
+		access_flags |= VM_READ;
+
+	if (prm->perm & IOMMU_FAULT_PERM_WRITE) {
+		access_flags |= VM_WRITE;
+		fault_flags |= FAULT_FLAG_WRITE;
+	}
+
+	if (prm->perm & IOMMU_FAULT_PERM_EXEC) {
+		access_flags |= VM_EXEC;
+		fault_flags |= FAULT_FLAG_INSTRUCTION;
+	}
+
+	if (!(prm->perm & IOMMU_FAULT_PERM_PRIV))
+		fault_flags |= FAULT_FLAG_USER;
+
+	if (access_flags & ~vma->vm_flags)
+		/* Access fault */
+		goto out_put_mm;
+
+	ret = handle_mm_fault(vma, prm->addr, fault_flags);
+	status = ret & VM_FAULT_ERROR ? IOMMU_PAGE_RESP_INVALID :
+		IOMMU_PAGE_RESP_SUCCESS;
+
+out_put_mm:
+	up_read(&mm->mmap_sem);
+
+	/*
+	 * If the process exits while we're handling the fault on its mm, we
+	 * can't do mmput(). exit_mmap() would release the MMU notifier, calling
+	 * iommu_notifier_release(), which has to flush the fault queue that
+	 * we're executing on... So mmput_async() moves the release of the mm to
+	 * another thread, if we're the last user.
+	 */
+	mmput_async(mm);
+
+	return status;
 }
 
 static void iopf_handle_group(struct work_struct *work)
@@ -111,6 +169,30 @@ static void iopf_handle_group(struct work_struct *work)
  *
  * Add a fault to the device workqueue, to be handled by mm.
  *
+ * This module doesn't handle PCI PASID Stop Marker; IOMMU drivers must discard
+ * them before reporting faults. A PASID Stop Marker (LRW = 0b100) doesn't
+ * expect a response. It may be generated when disabling a PASID (issuing a
+ * PASID stop request) by some PCI devices.
+ *
+ * The PASID stop request is triggered by the mm_exit() callback. When the
+ * callback returns from the device driver, no page request is generated for
+ * this PASID anymore and outstanding ones have been pushed to the IOMMU (as per
+ * PCIe 4.0r1.0 - 6.20.1 and 10.4.1.2 - Managing PASID TLP Prefix Usage). Some
+ * PCI devices will wait for all outstanding page requests to come back with a
+ * response before completing the PASID stop request. Others do not wait for
+ * page responses, and instead issue this Stop Marker that tells us when the
+ * PASID can be reallocated.
+ *
+ * It is safe to discard the Stop Marker because it is an optimization.
+ * a. Page requests, which are posted requests, have been flushed to the IOMMU
+ *    when mm_exit() returns,
+ * b. We flush all fault queues after mm_exit() returns and before freeing the
+ *    PASID.
+ *
+ * So even though the Stop Marker might be issued by the device *after* the stop
+ * request completes, outstanding faults will have been dealt with by the time
+ * we free the PASID.
+ *
  * Return: 0 on success and <0 on error.
  */
 int iommu_queue_iopf(struct iommu_fault *fault, void *cookie)
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v4 06/26] iommu/sva: Register page fault handler
  2020-02-24 18:23 [PATCH v4 00/26] iommu: Shared Virtual Addressing and SMMUv3 support Jean-Philippe Brucker
                   ` (4 preceding siblings ...)
  2020-02-24 18:23 ` [PATCH v4 05/26] iommu/iopf: Handle mm faults Jean-Philippe Brucker
@ 2020-02-24 18:23 ` Jean-Philippe Brucker
  2020-02-26 19:39   ` Jacob Pan
  2020-02-24 18:23 ` [PATCH v4 07/26] arm64: mm: Pin down ASIDs for sharing mm with devices Jean-Philippe Brucker
                   ` (20 subsequent siblings)
  26 siblings, 1 reply; 70+ messages in thread
From: Jean-Philippe Brucker @ 2020-02-24 18:23 UTC (permalink / raw)
  To: iommu, devicetree, linux-arm-kernel, linux-pci, linux-mm
  Cc: joro, robh+dt, mark.rutland, catalin.marinas, will, robin.murphy,
	kevin.tian, baolu.lu, Jonathan.Cameron, jacob.jun.pan,
	christian.koenig, yi.l.liu, zhangfei.gao, Jean-Philippe Brucker

From: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>

When enabling SVA, register the fault handler. Device driver will register
an I/O page fault queue before or after calling iommu_sva_enable. The
fault queue must be flushed before any io_mm is freed, to make sure that
its PASID isn't used in any fault queue, and can be reallocated. Add
iopf_queue_flush() calls in a few strategic locations.

Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
---
 drivers/iommu/Kconfig     |  1 +
 drivers/iommu/iommu-sva.c | 16 ++++++++++++++++
 2 files changed, 17 insertions(+)

diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index e4a42e1708b4..211684e785ea 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -106,6 +106,7 @@ config IOMMU_DMA
 config IOMMU_SVA
 	bool
 	select IOASID
+	select IOMMU_PAGE_FAULT
 	select IOMMU_API
 	select MMU_NOTIFIER
 
diff --git a/drivers/iommu/iommu-sva.c b/drivers/iommu/iommu-sva.c
index bfd0c477f290..494ca0824e4b 100644
--- a/drivers/iommu/iommu-sva.c
+++ b/drivers/iommu/iommu-sva.c
@@ -366,6 +366,8 @@ static void io_mm_release(struct mmu_notifier *mn, struct mm_struct *mm)
 			dev_WARN(dev, "possible leak of PASID %u",
 				 io_mm->pasid);
 
+		iopf_queue_flush_dev(dev, io_mm->pasid);
+
 		/* unbind() frees the bond, we just detach it */
 		io_mm_detach_locked(bond);
 	}
@@ -442,11 +444,20 @@ static void iommu_sva_unbind_locked(struct iommu_bond *bond)
 
 void iommu_sva_unbind_generic(struct iommu_sva *handle)
 {
+	int pasid;
 	struct iommu_param *param = handle->dev->iommu_param;
 
 	if (WARN_ON(!param))
 		return;
 
+	/*
+	 * Caller stopped the device from issuing PASIDs, now make sure they are
+	 * out of the fault queue.
+	 */
+	pasid = iommu_sva_get_pasid_generic(handle);
+	if (pasid != IOMMU_PASID_INVALID)
+		iopf_queue_flush_dev(handle->dev, pasid);
+
 	mutex_lock(&param->sva_lock);
 	mutex_lock(&iommu_sva_lock);
 	iommu_sva_unbind_locked(to_iommu_bond(handle));
@@ -484,6 +495,10 @@ int iommu_sva_enable(struct device *dev, struct iommu_sva_param *sva_param)
 		goto err_unlock;
 	}
 
+	ret = iommu_register_device_fault_handler(dev, iommu_queue_iopf, dev);
+	if (ret)
+		goto err_unlock;
+
 	dev->iommu_param->sva_param = new_param;
 	mutex_unlock(&param->sva_lock);
 	return 0;
@@ -521,6 +536,7 @@ int iommu_sva_disable(struct device *dev)
 		goto out_unlock;
 	}
 
+	iommu_unregister_device_fault_handler(dev);
 	kfree(param->sva_param);
 	param->sva_param = NULL;
 out_unlock:
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v4 07/26] arm64: mm: Pin down ASIDs for sharing mm with devices
  2020-02-24 18:23 [PATCH v4 00/26] iommu: Shared Virtual Addressing and SMMUv3 support Jean-Philippe Brucker
                   ` (5 preceding siblings ...)
  2020-02-24 18:23 ` [PATCH v4 06/26] iommu/sva: Register page fault handler Jean-Philippe Brucker
@ 2020-02-24 18:23 ` Jean-Philippe Brucker
  2020-02-27 17:43   ` Jonathan Cameron
  2020-02-24 18:23 ` [PATCH v4 08/26] iommu/io-pgtable-arm: Move some definitions to a header Jean-Philippe Brucker
                   ` (19 subsequent siblings)
  26 siblings, 1 reply; 70+ messages in thread
From: Jean-Philippe Brucker @ 2020-02-24 18:23 UTC (permalink / raw)
  To: iommu, devicetree, linux-arm-kernel, linux-pci, linux-mm
  Cc: joro, robh+dt, mark.rutland, catalin.marinas, will, robin.murphy,
	kevin.tian, baolu.lu, Jonathan.Cameron, jacob.jun.pan,
	christian.koenig, yi.l.liu, zhangfei.gao, Jean-Philippe Brucker

From: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>

To enable address space sharing with the IOMMU, introduce mm_context_get()
and mm_context_put(), that pin down a context and ensure that it will keep
its ASID after a rollover. Export the symbols to let the modular SMMUv3
driver use them.

Pinning is necessary because a device constantly needs a valid ASID,
unlike tasks that only require one when running. Without pinning, we would
need to notify the IOMMU when we're about to use a new ASID for a task,
and it would get complicated when a new task is assigned a shared ASID.
Consider the following scenario with no ASID pinned:

1. Task t1 is running on CPUx with shared ASID (gen=1, asid=1)
2. Task t2 is scheduled on CPUx, gets ASID (1, 2)
3. Task tn is scheduled on CPUy, a rollover occurs, tn gets ASID (2, 1)
   We would now have to immediately generate a new ASID for t1, notify
   the IOMMU, and finally enable task tn. We are holding the lock during
   all that time, since we can't afford having another CPU trigger a
   rollover. The IOMMU issues invalidation commands that can take tens of
   milliseconds.

It gets needlessly complicated. All we wanted to do was schedule task tn,
that has no business with the IOMMU. By letting the IOMMU pin tasks when
needed, we avoid stalling the slow path, and let the pinning fail when
we're out of shareable ASIDs.

After a rollover, the allocator expects at least one ASID to be available
in addition to the reserved ones (one per CPU). So (NR_ASIDS - NR_CPUS -
1) is the maximum number of ASIDs that can be shared with the IOMMU.

Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
---
v2->v4: handle KPTI
---
 arch/arm64/include/asm/mmu.h         |   1 +
 arch/arm64/include/asm/mmu_context.h |  11 ++-
 arch/arm64/mm/context.c              | 103 +++++++++++++++++++++++++--
 3 files changed, 109 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h
index e4d862420bb4..70ac3d4cbd3e 100644
--- a/arch/arm64/include/asm/mmu.h
+++ b/arch/arm64/include/asm/mmu.h
@@ -18,6 +18,7 @@
 
 typedef struct {
 	atomic64_t	id;
+	unsigned long	pinned;
 	void		*vdso;
 	unsigned long	flags;
 } mm_context_t;
diff --git a/arch/arm64/include/asm/mmu_context.h b/arch/arm64/include/asm/mmu_context.h
index 3827ff4040a3..70715c10c02a 100644
--- a/arch/arm64/include/asm/mmu_context.h
+++ b/arch/arm64/include/asm/mmu_context.h
@@ -175,7 +175,13 @@ static inline void cpu_replace_ttbr1(pgd_t *pgdp)
 #define destroy_context(mm)		do { } while(0)
 void check_and_switch_context(struct mm_struct *mm, unsigned int cpu);
 
-#define init_new_context(tsk,mm)	({ atomic64_set(&(mm)->context.id, 0); 0; })
+static inline int
+init_new_context(struct task_struct *tsk, struct mm_struct *mm)
+{
+	atomic64_set(&mm->context.id, 0);
+	mm->context.pinned = 0;
+	return 0;
+}
 
 #ifdef CONFIG_ARM64_SW_TTBR0_PAN
 static inline void update_saved_ttbr0(struct task_struct *tsk,
@@ -248,6 +254,9 @@ switch_mm(struct mm_struct *prev, struct mm_struct *next,
 void verify_cpu_asid_bits(void);
 void post_ttbr_update_workaround(void);
 
+unsigned long mm_context_get(struct mm_struct *mm);
+void mm_context_put(struct mm_struct *mm);
+
 #endif /* !__ASSEMBLY__ */
 
 #endif /* !__ASM_MMU_CONTEXT_H */
diff --git a/arch/arm64/mm/context.c b/arch/arm64/mm/context.c
index 121aba5b1941..5558de88b67d 100644
--- a/arch/arm64/mm/context.c
+++ b/arch/arm64/mm/context.c
@@ -26,6 +26,10 @@ static DEFINE_PER_CPU(atomic64_t, active_asids);
 static DEFINE_PER_CPU(u64, reserved_asids);
 static cpumask_t tlb_flush_pending;
 
+static unsigned long max_pinned_asids;
+static unsigned long nr_pinned_asids;
+static unsigned long *pinned_asid_map;
+
 #define ASID_MASK		(~GENMASK(asid_bits - 1, 0))
 #define ASID_FIRST_VERSION	(1UL << asid_bits)
 
@@ -73,6 +77,9 @@ void verify_cpu_asid_bits(void)
 
 static void set_kpti_asid_bits(void)
 {
+	unsigned int k;
+	u8 *dst = (u8 *)asid_map;
+	u8 *src = (u8 *)pinned_asid_map;
 	unsigned int len = BITS_TO_LONGS(NUM_USER_ASIDS) * sizeof(unsigned long);
 	/*
 	 * In case of KPTI kernel/user ASIDs are allocated in
@@ -80,7 +87,8 @@ static void set_kpti_asid_bits(void)
 	 * is set, then the ASID will map only userspace. Thus
 	 * mark even as reserved for kernel.
 	 */
-	memset(asid_map, 0xaa, len);
+	for (k = 0; k < len; k++)
+		dst[k] = src[k] | 0xaa;
 }
 
 static void set_reserved_asid_bits(void)
@@ -88,9 +96,12 @@ static void set_reserved_asid_bits(void)
 	if (arm64_kernel_unmapped_at_el0())
 		set_kpti_asid_bits();
 	else
-		bitmap_clear(asid_map, 0, NUM_USER_ASIDS);
+		bitmap_copy(asid_map, pinned_asid_map, NUM_USER_ASIDS);
 }
 
+#define asid_gen_match(asid) \
+	(!(((asid) ^ atomic64_read(&asid_generation)) >> asid_bits))
+
 static void flush_context(void)
 {
 	int i;
@@ -161,6 +172,14 @@ static u64 new_context(struct mm_struct *mm)
 		if (check_update_reserved_asid(asid, newasid))
 			return newasid;
 
+		/*
+		 * If it is pinned, we can keep using it. Note that reserved
+		 * takes priority, because even if it is also pinned, we need to
+		 * update the generation into the reserved_asids.
+		 */
+		if (mm->context.pinned)
+			return newasid;
+
 		/*
 		 * We had a valid ASID in a previous life, so try to re-use
 		 * it if possible.
@@ -219,8 +238,7 @@ void check_and_switch_context(struct mm_struct *mm, unsigned int cpu)
 	 *   because atomic RmWs are totally ordered for a given location.
 	 */
 	old_active_asid = atomic64_read(&per_cpu(active_asids, cpu));
-	if (old_active_asid &&
-	    !((asid ^ atomic64_read(&asid_generation)) >> asid_bits) &&
+	if (old_active_asid && asid_gen_match(asid) &&
 	    atomic64_cmpxchg_relaxed(&per_cpu(active_asids, cpu),
 				     old_active_asid, asid))
 		goto switch_mm_fastpath;
@@ -228,7 +246,7 @@ void check_and_switch_context(struct mm_struct *mm, unsigned int cpu)
 	raw_spin_lock_irqsave(&cpu_asid_lock, flags);
 	/* Check that our ASID belongs to the current generation. */
 	asid = atomic64_read(&mm->context.id);
-	if ((asid ^ atomic64_read(&asid_generation)) >> asid_bits) {
+	if (!asid_gen_match(asid)) {
 		asid = new_context(mm);
 		atomic64_set(&mm->context.id, asid);
 	}
@@ -251,6 +269,68 @@ void check_and_switch_context(struct mm_struct *mm, unsigned int cpu)
 		cpu_switch_mm(mm->pgd, mm);
 }
 
+unsigned long mm_context_get(struct mm_struct *mm)
+{
+	unsigned long flags;
+	u64 asid;
+
+	raw_spin_lock_irqsave(&cpu_asid_lock, flags);
+
+	asid = atomic64_read(&mm->context.id);
+
+	if (mm->context.pinned) {
+		mm->context.pinned++;
+		asid &= ~ASID_MASK;
+		goto out_unlock;
+	}
+
+	if (nr_pinned_asids >= max_pinned_asids) {
+		asid = 0;
+		goto out_unlock;
+	}
+
+	if (!asid_gen_match(asid)) {
+		/*
+		 * We went through one or more rollover since that ASID was
+		 * used. Ensure that it is still valid, or generate a new one.
+		 */
+		asid = new_context(mm);
+		atomic64_set(&mm->context.id, asid);
+	}
+
+	asid &= ~ASID_MASK;
+
+	nr_pinned_asids++;
+	__set_bit(asid2idx(asid), pinned_asid_map);
+	mm->context.pinned++;
+
+out_unlock:
+	raw_spin_unlock_irqrestore(&cpu_asid_lock, flags);
+
+	/* Set the equivalent of USER_ASID_BIT */
+	if (asid && IS_ENABLED(CONFIG_UNMAP_KERNEL_AT_EL0))
+		asid |= 1;
+
+	return asid;
+}
+EXPORT_SYMBOL_GPL(mm_context_get);
+
+void mm_context_put(struct mm_struct *mm)
+{
+	unsigned long flags;
+	u64 asid = atomic64_read(&mm->context.id) & ~ASID_MASK;
+
+	raw_spin_lock_irqsave(&cpu_asid_lock, flags);
+
+	if (--mm->context.pinned == 0) {
+		__clear_bit(asid2idx(asid), pinned_asid_map);
+		nr_pinned_asids--;
+	}
+
+	raw_spin_unlock_irqrestore(&cpu_asid_lock, flags);
+}
+EXPORT_SYMBOL_GPL(mm_context_put);
+
 /* Errata workaround post TTBRx_EL1 update. */
 asmlinkage void post_ttbr_update_workaround(void)
 {
@@ -279,6 +359,19 @@ static int asids_init(void)
 		panic("Failed to allocate bitmap for %lu ASIDs\n",
 		      NUM_USER_ASIDS);
 
+	pinned_asid_map = kcalloc(BITS_TO_LONGS(NUM_USER_ASIDS),
+				  sizeof(*pinned_asid_map), GFP_KERNEL);
+	if (!pinned_asid_map)
+		panic("Failed to allocate pinned bitmap\n");
+
+	/*
+	 * We assume that an ASID is always available after a rollover. This
+	 * means that even if all CPUs have a reserved ASID, there still is at
+	 * least one slot available in the asid map.
+	 */
+	max_pinned_asids = num_available_asids - num_possible_cpus() - 1;
+	nr_pinned_asids = 0;
+
 	/*
 	 * We cannot call set_reserved_asid_bits() here because CPU
 	 * caps are not finalized yet, so it is safer to assume KPTI
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v4 08/26] iommu/io-pgtable-arm: Move some definitions to a header
  2020-02-24 18:23 [PATCH v4 00/26] iommu: Shared Virtual Addressing and SMMUv3 support Jean-Philippe Brucker
                   ` (6 preceding siblings ...)
  2020-02-24 18:23 ` [PATCH v4 07/26] arm64: mm: Pin down ASIDs for sharing mm with devices Jean-Philippe Brucker
@ 2020-02-24 18:23 ` Jean-Philippe Brucker
  2020-02-24 18:23 ` [PATCH v4 09/26] iommu/arm-smmu-v3: Manage ASIDs with xarray Jean-Philippe Brucker
                   ` (18 subsequent siblings)
  26 siblings, 0 replies; 70+ messages in thread
From: Jean-Philippe Brucker @ 2020-02-24 18:23 UTC (permalink / raw)
  To: iommu, devicetree, linux-arm-kernel, linux-pci, linux-mm
  Cc: joro, robh+dt, mark.rutland, catalin.marinas, will, robin.murphy,
	kevin.tian, baolu.lu, Jonathan.Cameron, jacob.jun.pan,
	christian.koenig, yi.l.liu, zhangfei.gao

Extract some of the most generic TCR defines, so they can be reused by
the page table sharing code.

Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
---
 drivers/iommu/io-pgtable-arm.c | 27 ++-------------------------
 drivers/iommu/io-pgtable-arm.h | 30 ++++++++++++++++++++++++++++++
 2 files changed, 32 insertions(+), 25 deletions(-)
 create mode 100644 drivers/iommu/io-pgtable-arm.h

diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
index 983b08477e64..75782b525c2f 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -20,6 +20,8 @@
 
 #include <asm/barrier.h>
 
+#include "io-pgtable-arm.h"
+
 #define ARM_LPAE_MAX_ADDR_BITS		52
 #define ARM_LPAE_S2_MAX_CONCAT_PAGES	16
 #define ARM_LPAE_MAX_LEVELS		4
@@ -100,23 +102,6 @@
 #define ARM_LPAE_PTE_MEMATTR_DEV	(((arm_lpae_iopte)0x1) << 2)
 
 /* Register bits */
-#define ARM_LPAE_TCR_TG0_4K		0
-#define ARM_LPAE_TCR_TG0_64K		1
-#define ARM_LPAE_TCR_TG0_16K		2
-
-#define ARM_LPAE_TCR_TG1_16K		1
-#define ARM_LPAE_TCR_TG1_4K		2
-#define ARM_LPAE_TCR_TG1_64K		3
-
-#define ARM_LPAE_TCR_SH_NS		0
-#define ARM_LPAE_TCR_SH_OS		2
-#define ARM_LPAE_TCR_SH_IS		3
-
-#define ARM_LPAE_TCR_RGN_NC		0
-#define ARM_LPAE_TCR_RGN_WBWA		1
-#define ARM_LPAE_TCR_RGN_WT		2
-#define ARM_LPAE_TCR_RGN_WB		3
-
 #define ARM_LPAE_VTCR_SL0_MASK		0x3
 
 #define ARM_LPAE_TCR_T0SZ_SHIFT		0
@@ -124,14 +109,6 @@
 #define ARM_LPAE_VTCR_PS_SHIFT		16
 #define ARM_LPAE_VTCR_PS_MASK		0x7
 
-#define ARM_LPAE_TCR_PS_32_BIT		0x0ULL
-#define ARM_LPAE_TCR_PS_36_BIT		0x1ULL
-#define ARM_LPAE_TCR_PS_40_BIT		0x2ULL
-#define ARM_LPAE_TCR_PS_42_BIT		0x3ULL
-#define ARM_LPAE_TCR_PS_44_BIT		0x4ULL
-#define ARM_LPAE_TCR_PS_48_BIT		0x5ULL
-#define ARM_LPAE_TCR_PS_52_BIT		0x6ULL
-
 #define ARM_LPAE_MAIR_ATTR_SHIFT(n)	((n) << 3)
 #define ARM_LPAE_MAIR_ATTR_MASK		0xff
 #define ARM_LPAE_MAIR_ATTR_DEVICE	0x04
diff --git a/drivers/iommu/io-pgtable-arm.h b/drivers/iommu/io-pgtable-arm.h
new file mode 100644
index 000000000000..ba7cfdf7afa0
--- /dev/null
+++ b/drivers/iommu/io-pgtable-arm.h
@@ -0,0 +1,30 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef IO_PGTABLE_ARM_H_
+#define IO_PGTABLE_ARM_H_
+
+#define ARM_LPAE_TCR_TG0_4K		0
+#define ARM_LPAE_TCR_TG0_64K		1
+#define ARM_LPAE_TCR_TG0_16K		2
+
+#define ARM_LPAE_TCR_TG1_16K		1
+#define ARM_LPAE_TCR_TG1_4K		2
+#define ARM_LPAE_TCR_TG1_64K		3
+
+#define ARM_LPAE_TCR_SH_NS		0
+#define ARM_LPAE_TCR_SH_OS		2
+#define ARM_LPAE_TCR_SH_IS		3
+
+#define ARM_LPAE_TCR_RGN_NC		0
+#define ARM_LPAE_TCR_RGN_WBWA		1
+#define ARM_LPAE_TCR_RGN_WT		2
+#define ARM_LPAE_TCR_RGN_WB		3
+
+#define ARM_LPAE_TCR_PS_32_BIT		0x0ULL
+#define ARM_LPAE_TCR_PS_36_BIT		0x1ULL
+#define ARM_LPAE_TCR_PS_40_BIT		0x2ULL
+#define ARM_LPAE_TCR_PS_42_BIT		0x3ULL
+#define ARM_LPAE_TCR_PS_44_BIT		0x4ULL
+#define ARM_LPAE_TCR_PS_48_BIT		0x5ULL
+#define ARM_LPAE_TCR_PS_52_BIT		0x6ULL
+
+#endif /* IO_PGTABLE_ARM_H_ */
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v4 09/26] iommu/arm-smmu-v3: Manage ASIDs with xarray
  2020-02-24 18:23 [PATCH v4 00/26] iommu: Shared Virtual Addressing and SMMUv3 support Jean-Philippe Brucker
                   ` (7 preceding siblings ...)
  2020-02-24 18:23 ` [PATCH v4 08/26] iommu/io-pgtable-arm: Move some definitions to a header Jean-Philippe Brucker
@ 2020-02-24 18:23 ` Jean-Philippe Brucker
  2020-02-24 18:23 ` [PATCH v4 10/26] arm64: cpufeature: Export symbol read_sanitised_ftr_reg() Jean-Philippe Brucker
                   ` (17 subsequent siblings)
  26 siblings, 0 replies; 70+ messages in thread
From: Jean-Philippe Brucker @ 2020-02-24 18:23 UTC (permalink / raw)
  To: iommu, devicetree, linux-arm-kernel, linux-pci, linux-mm
  Cc: joro, robh+dt, mark.rutland, catalin.marinas, will, robin.murphy,
	kevin.tian, baolu.lu, Jonathan.Cameron, jacob.jun.pan,
	christian.koenig, yi.l.liu, zhangfei.gao, Jean-Philippe Brucker

From: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>

In preparation for sharing some ASIDs with the CPU, use a global xarray to
store ASIDs and their context. ASID#1 is not reserved, and the ASID
space is global.

Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
---
 drivers/iommu/arm-smmu-v3.c | 27 ++++++++++++++++++---------
 1 file changed, 18 insertions(+), 9 deletions(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 87ae31ef35a1..7737b70e74cd 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -651,7 +651,6 @@ struct arm_smmu_device {
 
 #define ARM_SMMU_MAX_ASIDS		(1 << 16)
 	unsigned int			asid_bits;
-	DECLARE_BITMAP(asid_map, ARM_SMMU_MAX_ASIDS);
 
 #define ARM_SMMU_MAX_VMIDS		(1 << 16)
 	unsigned int			vmid_bits;
@@ -711,6 +710,8 @@ struct arm_smmu_option_prop {
 	const char *prop;
 };
 
+static DEFINE_XARRAY_ALLOC1(asid_xa);
+
 static struct arm_smmu_option_prop arm_smmu_options[] = {
 	{ ARM_SMMU_OPT_SKIP_PREFETCH, "hisilicon,broken-prefetch-cmd" },
 	{ ARM_SMMU_OPT_PAGE0_REGS_ONLY, "cavium,cn9900-broken-page1-regspace"},
@@ -1742,6 +1743,14 @@ static void arm_smmu_free_cd_tables(struct arm_smmu_domain *smmu_domain)
 	cdcfg->cdtab = NULL;
 }
 
+static void arm_smmu_free_asid(struct arm_smmu_ctx_desc *cd)
+{
+	if (!cd->asid)
+		return;
+
+	xa_erase(&asid_xa, cd->asid);
+}
+
 /* Stream table manipulation functions */
 static void
 arm_smmu_write_strtab_l1_desc(__le64 *dst, struct arm_smmu_strtab_l1_desc *desc)
@@ -2388,10 +2397,9 @@ static void arm_smmu_domain_free(struct iommu_domain *domain)
 	if (smmu_domain->stage == ARM_SMMU_DOMAIN_S1) {
 		struct arm_smmu_s1_cfg *cfg = &smmu_domain->s1_cfg;
 
-		if (cfg->cdcfg.cdtab) {
+		if (cfg->cdcfg.cdtab)
 			arm_smmu_free_cd_tables(smmu_domain);
-			arm_smmu_bitmap_free(smmu->asid_map, cfg->cd.asid);
-		}
+		arm_smmu_free_asid(&cfg->cd);
 	} else {
 		struct arm_smmu_s2_cfg *cfg = &smmu_domain->s2_cfg;
 		if (cfg->vmid)
@@ -2406,14 +2414,15 @@ static int arm_smmu_domain_finalise_s1(struct arm_smmu_domain *smmu_domain,
 				       struct io_pgtable_cfg *pgtbl_cfg)
 {
 	int ret;
-	int asid;
+	u32 asid;
 	struct arm_smmu_device *smmu = smmu_domain->smmu;
 	struct arm_smmu_s1_cfg *cfg = &smmu_domain->s1_cfg;
 	typeof(&pgtbl_cfg->arm_lpae_s1_cfg.tcr) tcr = &pgtbl_cfg->arm_lpae_s1_cfg.tcr;
 
-	asid = arm_smmu_bitmap_alloc(smmu->asid_map, smmu->asid_bits);
-	if (asid < 0)
-		return asid;
+	ret = xa_alloc(&asid_xa, &asid, &cfg->cd,
+		       XA_LIMIT(1, (1 << smmu->asid_bits) - 1), GFP_KERNEL);
+	if (ret)
+		return ret;
 
 	cfg->s1cdmax = master->ssid_bits;
 
@@ -2446,7 +2455,7 @@ static int arm_smmu_domain_finalise_s1(struct arm_smmu_domain *smmu_domain,
 out_free_cd_tables:
 	arm_smmu_free_cd_tables(smmu_domain);
 out_free_asid:
-	arm_smmu_bitmap_free(smmu->asid_map, asid);
+	arm_smmu_free_asid(&cfg->cd);
 	return ret;
 }
 
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v4 10/26] arm64: cpufeature: Export symbol read_sanitised_ftr_reg()
  2020-02-24 18:23 [PATCH v4 00/26] iommu: Shared Virtual Addressing and SMMUv3 support Jean-Philippe Brucker
                   ` (8 preceding siblings ...)
  2020-02-24 18:23 ` [PATCH v4 09/26] iommu/arm-smmu-v3: Manage ASIDs with xarray Jean-Philippe Brucker
@ 2020-02-24 18:23 ` Jean-Philippe Brucker
  2020-02-24 18:23 ` [PATCH v4 11/26] iommu/arm-smmu-v3: Share process page tables Jean-Philippe Brucker
                   ` (16 subsequent siblings)
  26 siblings, 0 replies; 70+ messages in thread
From: Jean-Philippe Brucker @ 2020-02-24 18:23 UTC (permalink / raw)
  To: iommu, devicetree, linux-arm-kernel, linux-pci, linux-mm
  Cc: joro, robh+dt, mark.rutland, catalin.marinas, will, robin.murphy,
	kevin.tian, baolu.lu, Jonathan.Cameron, jacob.jun.pan,
	christian.koenig, yi.l.liu, zhangfei.gao

The SMMUv3 driver would like to read the MMFR0 PARANGE field in order to
share CPU page tables with devices. Allow the driver to be built as
module by exporting the read_sanitized_ftr_reg() cpufeature symbol.

Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
---
 arch/arm64/kernel/cpufeature.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index 0b6715625cf6..a96d2fb12e4d 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -838,6 +838,7 @@ u64 read_sanitised_ftr_reg(u32 id)
 	BUG_ON(!regp);
 	return regp->sys_val;
 }
+EXPORT_SYMBOL_GPL(read_sanitised_ftr_reg);
 
 #define read_sysreg_case(r)	\
 	case r:		return read_sysreg_s(r)
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v4 11/26] iommu/arm-smmu-v3: Share process page tables
  2020-02-24 18:23 [PATCH v4 00/26] iommu: Shared Virtual Addressing and SMMUv3 support Jean-Philippe Brucker
                   ` (9 preceding siblings ...)
  2020-02-24 18:23 ` [PATCH v4 10/26] arm64: cpufeature: Export symbol read_sanitised_ftr_reg() Jean-Philippe Brucker
@ 2020-02-24 18:23 ` Jean-Philippe Brucker
  2020-02-24 18:23 ` [PATCH v4 12/26] iommu/arm-smmu-v3: Seize private ASID Jean-Philippe Brucker
                   ` (15 subsequent siblings)
  26 siblings, 0 replies; 70+ messages in thread
From: Jean-Philippe Brucker @ 2020-02-24 18:23 UTC (permalink / raw)
  To: iommu, devicetree, linux-arm-kernel, linux-pci, linux-mm
  Cc: joro, robh+dt, mark.rutland, catalin.marinas, will, robin.murphy,
	kevin.tian, baolu.lu, Jonathan.Cameron, jacob.jun.pan,
	christian.koenig, yi.l.liu, zhangfei.gao, Jean-Philippe Brucker

From: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>

With Shared Virtual Addressing (SVA), we need to mirror CPU TTBR, TCR,
MAIR and ASIDs in SMMU contexts. Each SMMU has a single ASID space split
into two sets, shared and private. Shared ASIDs correspond to those
obtained from the arch ASID allocator, and private ASIDs are used for
"classic" map/unmap DMA.

Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
---
 drivers/iommu/arm-smmu-v3.c | 164 +++++++++++++++++++++++++++++++++++-
 1 file changed, 160 insertions(+), 4 deletions(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 7737b70e74cd..3f9adfd1b015 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -22,6 +22,7 @@
 #include <linux/iommu.h>
 #include <linux/iopoll.h>
 #include <linux/module.h>
+#include <linux/mmu_context.h>
 #include <linux/msi.h>
 #include <linux/of.h>
 #include <linux/of_address.h>
@@ -33,6 +34,8 @@
 
 #include <linux/amba/bus.h>
 
+#include "io-pgtable-arm.h"
+
 /* MMIO registers */
 #define ARM_SMMU_IDR0			0x0
 #define IDR0_ST_LVL			GENMASK(28, 27)
@@ -575,6 +578,9 @@ struct arm_smmu_ctx_desc {
 	u64				ttbr;
 	u64				tcr;
 	u64				mair;
+
+	refcount_t			refs;
+	struct mm_struct		*mm;
 };
 
 struct arm_smmu_l1_ctx_desc {
@@ -1639,7 +1645,8 @@ static int arm_smmu_write_ctx_desc(struct arm_smmu_domain *smmu_domain,
 #ifdef __BIG_ENDIAN
 			CTXDESC_CD_0_ENDI |
 #endif
-			CTXDESC_CD_0_R | CTXDESC_CD_0_A | CTXDESC_CD_0_ASET |
+			CTXDESC_CD_0_R | CTXDESC_CD_0_A |
+			(cd->mm ? 0 : CTXDESC_CD_0_ASET) |
 			CTXDESC_CD_0_AA64 |
 			FIELD_PREP(CTXDESC_CD_0_ASID, cd->asid) |
 			CTXDESC_CD_0_V;
@@ -1743,12 +1750,159 @@ static void arm_smmu_free_cd_tables(struct arm_smmu_domain *smmu_domain)
 	cdcfg->cdtab = NULL;
 }
 
-static void arm_smmu_free_asid(struct arm_smmu_ctx_desc *cd)
+static void arm_smmu_init_cd(struct arm_smmu_ctx_desc *cd)
 {
+	refcount_set(&cd->refs, 1);
+}
+
+static bool arm_smmu_free_asid(struct arm_smmu_ctx_desc *cd)
+{
+	bool free;
+	struct arm_smmu_ctx_desc *old_cd;
+
 	if (!cd->asid)
-		return;
+		return false;
+
+	xa_lock(&asid_xa);
+	free = refcount_dec_and_test(&cd->refs);
+	if (free) {
+		old_cd = __xa_erase(&asid_xa, cd->asid);
+		WARN_ON(old_cd != cd);
+	}
+	xa_unlock(&asid_xa);
+	return free;
+}
+
+static struct arm_smmu_ctx_desc *arm_smmu_share_asid(u16 asid)
+{
+	struct arm_smmu_ctx_desc *cd;
+
+	cd = xa_load(&asid_xa, asid);
+	if (!cd)
+		return NULL;
 
-	xa_erase(&asid_xa, cd->asid);
+	if (cd->mm) {
+		/*
+		 * It's pretty common to find a stale CD when doing unbind-bind,
+		 * given that the release happens after a RCU grace period.
+		 * arm_smmu_free_asid() hasn't gone through yet, so reuse it.
+		 */
+		refcount_inc(&cd->refs);
+		return cd;
+	}
+
+	/*
+	 * Ouch, ASID is already in use for a private cd.
+	 * TODO: seize it.
+	 */
+	return ERR_PTR(-EEXIST);
+}
+
+__maybe_unused
+static struct arm_smmu_ctx_desc *arm_smmu_alloc_shared_cd(struct mm_struct *mm)
+{
+	u16 asid;
+	int ret = 0;
+	u64 tcr, par, reg;
+	struct arm_smmu_ctx_desc *cd;
+	struct arm_smmu_ctx_desc *old_cd = NULL;
+
+	asid = mm_context_get(mm);
+	if (!asid)
+		return ERR_PTR(-ESRCH);
+
+	cd = kzalloc(sizeof(*cd), GFP_KERNEL);
+	if (!cd) {
+		ret = -ENOMEM;
+		goto err_put_context;
+	}
+
+	arm_smmu_init_cd(cd);
+
+	xa_lock(&asid_xa);
+	old_cd = arm_smmu_share_asid(asid);
+	if (!old_cd) {
+		old_cd = __xa_store(&asid_xa, asid, cd, GFP_ATOMIC);
+		/*
+		 * Keep error, clear valid pointers. If there was an old entry
+		 * it has been moved already by arm_smmu_share_asid().
+		 */
+		old_cd = ERR_PTR(xa_err(old_cd));
+		cd->asid = asid;
+	}
+	xa_unlock(&asid_xa);
+
+	if (IS_ERR(old_cd)) {
+		ret = PTR_ERR(old_cd);
+		goto err_free_cd;
+	} else if (old_cd) {
+		if (WARN_ON(old_cd->mm != mm)) {
+			ret = -EINVAL;
+			goto err_free_cd;
+		}
+		kfree(cd);
+		mm_context_put(mm);
+		return old_cd;
+	}
+
+	tcr = FIELD_PREP(CTXDESC_CD_0_TCR_T0SZ, 64ULL - VA_BITS) |
+	      FIELD_PREP(CTXDESC_CD_0_TCR_IRGN0, ARM_LPAE_TCR_RGN_WBWA) |
+	      FIELD_PREP(CTXDESC_CD_0_TCR_ORGN0, ARM_LPAE_TCR_RGN_WBWA) |
+	      FIELD_PREP(CTXDESC_CD_0_TCR_SH0, ARM_LPAE_TCR_SH_IS) |
+	      CTXDESC_CD_0_TCR_EPD1 | CTXDESC_CD_0_AA64;
+
+	switch (PAGE_SIZE) {
+		case SZ_4K:
+			tcr |= FIELD_PREP(CTXDESC_CD_0_TCR_TG0,
+					  ARM_LPAE_TCR_TG0_4K);
+			break;
+		case SZ_16K:
+			tcr |= FIELD_PREP(CTXDESC_CD_0_TCR_TG0,
+					  ARM_LPAE_TCR_TG0_16K);
+			break;
+		case SZ_64K:
+			tcr |= FIELD_PREP(CTXDESC_CD_0_TCR_TG0,
+					  ARM_LPAE_TCR_TG0_64K);
+			break;
+		default:
+			WARN_ON(1);
+			ret = -EINVAL;
+			goto err_free_asid;
+	}
+
+	reg = read_sanitised_ftr_reg(SYS_ID_AA64MMFR0_EL1);
+	par = cpuid_feature_extract_unsigned_field(reg, ID_AA64MMFR0_PARANGE_SHIFT);
+	tcr |= FIELD_PREP(CTXDESC_CD_0_TCR_IPS, par);
+
+	cd->ttbr = virt_to_phys(mm->pgd);
+	cd->tcr = tcr;
+	/*
+	 * MAIR value is pretty much constant and global, so we can just get it
+	 * from the current CPU register
+	 */
+	cd->mair = read_sysreg(mair_el1);
+
+	cd->mm = mm;
+
+	return cd;
+
+err_free_asid:
+	arm_smmu_free_asid(cd);
+err_free_cd:
+	kfree(cd);
+err_put_context:
+	mm_context_put(mm);
+	return ERR_PTR(ret);
+}
+
+__maybe_unused
+static void arm_smmu_free_shared_cd(struct arm_smmu_ctx_desc *cd)
+{
+	if (arm_smmu_free_asid(cd)) {
+		/* Unpin ASID */
+		mm_context_put(cd->mm);
+		kfree(cd);
+	}
 }
 
 /* Stream table manipulation functions */
@@ -2419,6 +2573,8 @@ static int arm_smmu_domain_finalise_s1(struct arm_smmu_domain *smmu_domain,
 	struct arm_smmu_s1_cfg *cfg = &smmu_domain->s1_cfg;
 	typeof(&pgtbl_cfg->arm_lpae_s1_cfg.tcr) tcr = &pgtbl_cfg->arm_lpae_s1_cfg.tcr;
 
+	arm_smmu_init_cd(&cfg->cd);
+
 	ret = xa_alloc(&asid_xa, &asid, &cfg->cd,
 		       XA_LIMIT(1, (1 << smmu->asid_bits) - 1), GFP_KERNEL);
 	if (ret)
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v4 12/26] iommu/arm-smmu-v3: Seize private ASID
  2020-02-24 18:23 [PATCH v4 00/26] iommu: Shared Virtual Addressing and SMMUv3 support Jean-Philippe Brucker
                   ` (10 preceding siblings ...)
  2020-02-24 18:23 ` [PATCH v4 11/26] iommu/arm-smmu-v3: Share process page tables Jean-Philippe Brucker
@ 2020-02-24 18:23 ` Jean-Philippe Brucker
  2020-02-24 18:23 ` [PATCH v4 13/26] iommu/arm-smmu-v3: Add support for VHE Jean-Philippe Brucker
                   ` (14 subsequent siblings)
  26 siblings, 0 replies; 70+ messages in thread
From: Jean-Philippe Brucker @ 2020-02-24 18:23 UTC (permalink / raw)
  To: iommu, devicetree, linux-arm-kernel, linux-pci, linux-mm
  Cc: joro, robh+dt, mark.rutland, catalin.marinas, will, robin.murphy,
	kevin.tian, baolu.lu, Jonathan.Cameron, jacob.jun.pan,
	christian.koenig, yi.l.liu, zhangfei.gao, Jean-Philippe Brucker

From: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>

The SMMU has a single ASID space, the union of shared and private ASID
sets. This means that the SMMU driver competes with the arch allocator
for ASIDs. Shared ASIDs are those of Linux processes, allocated by the
arch, and contribute in broadcast TLB maintenance. Private ASIDs are
allocated by the SMMU driver and used for "classic" map/unmap DMA. They
require explicit TLB invalidations.

When we pin down an mm_context and get an ASID that is already in use by
the SMMU, it belongs to a private context. We used to simply abort the
bind, but this is unfair to users that would be unable to bind a few
seemingly random processes. Try to allocate a new private ASID for the
context, and make the old ASID shared.

Introduce a new lock to prevent races when rewriting context
descriptors. Unfortunately it has to be a spinlock since we take it
while holding the asid lock, which will be held in non-sleepable context
(freeing ASIDs from an RCU callback).

Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
---
 drivers/iommu/arm-smmu-v3.c | 83 +++++++++++++++++++++++++++++--------
 1 file changed, 66 insertions(+), 17 deletions(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 3f9adfd1b015..2839527ec9ee 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -717,6 +717,7 @@ struct arm_smmu_option_prop {
 };
 
 static DEFINE_XARRAY_ALLOC1(asid_xa);
+static DEFINE_SPINLOCK(contexts_lock);
 
 static struct arm_smmu_option_prop arm_smmu_options[] = {
 	{ ARM_SMMU_OPT_SKIP_PREFETCH, "hisilicon,broken-prefetch-cmd" },
@@ -1513,6 +1514,17 @@ static int arm_smmu_cmdq_batch_submit(struct arm_smmu_device *smmu,
 }
 
 /* Context descriptor manipulation functions */
+static void arm_smmu_tlb_inv_asid(struct arm_smmu_device *smmu, u16 asid)
+{
+	struct arm_smmu_cmdq_ent cmd = {
+		.opcode = CMDQ_OP_TLBI_NH_ASID,
+		.tlbi.asid = asid,
+	};
+
+	arm_smmu_cmdq_issue_cmd(smmu, &cmd);
+	arm_smmu_cmdq_issue_sync(smmu);
+}
+
 static void arm_smmu_sync_cd(struct arm_smmu_domain *smmu_domain,
 			     int ssid, bool leaf)
 {
@@ -1547,7 +1559,7 @@ static int arm_smmu_alloc_cd_leaf_table(struct arm_smmu_device *smmu,
 	size_t size = CTXDESC_L2_ENTRIES * (CTXDESC_CD_DWORDS << 3);
 
 	l1_desc->l2ptr = dmam_alloc_coherent(smmu->dev, size,
-					     &l1_desc->l2ptr_dma, GFP_KERNEL);
+					     &l1_desc->l2ptr_dma, GFP_ATOMIC);
 	if (!l1_desc->l2ptr) {
 		dev_warn(smmu->dev,
 			 "failed to allocate context descriptor table\n");
@@ -1593,8 +1605,8 @@ static __le64 *arm_smmu_get_cd_ptr(struct arm_smmu_domain *smmu_domain,
 	return l1_desc->l2ptr + idx * CTXDESC_CD_DWORDS;
 }
 
-static int arm_smmu_write_ctx_desc(struct arm_smmu_domain *smmu_domain,
-				   int ssid, struct arm_smmu_ctx_desc *cd)
+static int __arm_smmu_write_ctx_desc(struct arm_smmu_domain *smmu_domain,
+				     int ssid, struct arm_smmu_ctx_desc *cd)
 {
 	/*
 	 * This function handles the following cases:
@@ -1670,6 +1682,17 @@ static int arm_smmu_write_ctx_desc(struct arm_smmu_domain *smmu_domain,
 	return 0;
 }
 
+static int arm_smmu_write_ctx_desc(struct arm_smmu_domain *smmu_domain,
+				   int ssid, struct arm_smmu_ctx_desc *cd)
+{
+	int ret;
+
+	spin_lock(&contexts_lock);
+	ret = __arm_smmu_write_ctx_desc(smmu_domain, ssid, cd);
+	spin_unlock(&contexts_lock);
+	return ret;
+}
+
 static int arm_smmu_alloc_cd_tables(struct arm_smmu_domain *smmu_domain)
 {
 	int ret;
@@ -1773,9 +1796,18 @@ static bool arm_smmu_free_asid(struct arm_smmu_ctx_desc *cd)
 	return free;
 }
 
+/*
+ * Try to reserve this ASID in the SMMU. If it is in use, try to steal it from
+ * the private entry. Careful here, we may be modifying the context tables of
+ * another SMMU!
+ */
 static struct arm_smmu_ctx_desc *arm_smmu_share_asid(u16 asid)
 {
+	int ret;
+	u32 new_asid;
 	struct arm_smmu_ctx_desc *cd;
+	struct arm_smmu_device *smmu;
+	struct arm_smmu_domain *smmu_domain;
 
 	cd = xa_load(&asid_xa, asid);
 	if (!cd)
@@ -1791,11 +1823,31 @@ static struct arm_smmu_ctx_desc *arm_smmu_share_asid(u16 asid)
 		return cd;
 	}
 
+	smmu_domain = container_of(cd, struct arm_smmu_domain, s1_cfg.cd);
+	smmu = smmu_domain->smmu;
+
+	/*
+	 * Race with unmap: TLB invalidations will start targeting the new ASID,
+	 * which isn't assigned yet. We'll do an invalidate-all on the old ASID
+	 * later, so it doesn't matter.
+	 */
+	ret = __xa_alloc(&asid_xa, &new_asid, cd,
+			 XA_LIMIT(1, 1 << smmu->asid_bits), GFP_ATOMIC);
+	if (ret)
+		return ERR_PTR(-ENOSPC);
+	cd->asid = new_asid;
+
 	/*
-	 * Ouch, ASID is already in use for a private cd.
-	 * TODO: seize it.
+	 * Update ASID and invalidate CD in all associated masters. There will
+	 * be some overlap between use of both ASIDs, until we invalidate the
+	 * TLB.
 	 */
-	return ERR_PTR(-EEXIST);
+	arm_smmu_write_ctx_desc(smmu_domain, 0, cd);
+
+	/* Invalidate TLB entries previously associated with that context */
+	arm_smmu_tlb_inv_asid(smmu, asid);
+
+	return NULL;
 }
 
 __maybe_unused
@@ -2389,15 +2441,6 @@ static void arm_smmu_tlb_inv_context(void *cookie)
 	struct arm_smmu_device *smmu = smmu_domain->smmu;
 	struct arm_smmu_cmdq_ent cmd;
 
-	if (smmu_domain->stage == ARM_SMMU_DOMAIN_S1) {
-		cmd.opcode	= CMDQ_OP_TLBI_NH_ASID;
-		cmd.tlbi.asid	= smmu_domain->s1_cfg.cd.asid;
-		cmd.tlbi.vmid	= 0;
-	} else {
-		cmd.opcode	= CMDQ_OP_TLBI_S12_VMALL;
-		cmd.tlbi.vmid	= smmu_domain->s2_cfg.vmid;
-	}
-
 	/*
 	 * NOTE: when io-pgtable is in non-strict mode, we may get here with
 	 * PTEs previously cleared by unmaps on the current CPU not yet visible
@@ -2405,8 +2448,14 @@ static void arm_smmu_tlb_inv_context(void *cookie)
 	 * insertion to guarantee those are observed before the TLBI. Do be
 	 * careful, 007.
 	 */
-	arm_smmu_cmdq_issue_cmd(smmu, &cmd);
-	arm_smmu_cmdq_issue_sync(smmu);
+	if (smmu_domain->stage == ARM_SMMU_DOMAIN_S1) {
+		arm_smmu_tlb_inv_asid(smmu, smmu_domain->s1_cfg.cd.asid);
+	} else {
+		cmd.opcode	= CMDQ_OP_TLBI_S12_VMALL;
+		cmd.tlbi.vmid	= smmu_domain->s2_cfg.vmid;
+		arm_smmu_cmdq_issue_cmd(smmu, &cmd);
+		arm_smmu_cmdq_issue_sync(smmu);
+	}
 	arm_smmu_atc_inv_domain(smmu_domain, 0, 0, 0);
 }
 
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v4 13/26] iommu/arm-smmu-v3: Add support for VHE
  2020-02-24 18:23 [PATCH v4 00/26] iommu: Shared Virtual Addressing and SMMUv3 support Jean-Philippe Brucker
                   ` (11 preceding siblings ...)
  2020-02-24 18:23 ` [PATCH v4 12/26] iommu/arm-smmu-v3: Seize private ASID Jean-Philippe Brucker
@ 2020-02-24 18:23 ` Jean-Philippe Brucker
  2020-02-24 18:23 ` [PATCH v4 14/26] iommu/arm-smmu-v3: Enable broadcast TLB maintenance Jean-Philippe Brucker
                   ` (13 subsequent siblings)
  26 siblings, 0 replies; 70+ messages in thread
From: Jean-Philippe Brucker @ 2020-02-24 18:23 UTC (permalink / raw)
  To: iommu, devicetree, linux-arm-kernel, linux-pci, linux-mm
  Cc: joro, robh+dt, mark.rutland, catalin.marinas, will, robin.murphy,
	kevin.tian, baolu.lu, Jonathan.Cameron, jacob.jun.pan,
	christian.koenig, yi.l.liu, zhangfei.gao, Jean-Philippe Brucker

From: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>

ARMv8.1 extensions added Virtualization Host Extensions (VHE), which allow
to run a host kernel at EL2. When using normal DMA, Device and CPU address
spaces are dissociated, and do not need to implement the same
capabilities, so VHE hasn't been used in the SMMU until now.

With shared address spaces however, ASIDs are shared between MMU and SMMU,
and broadcast TLB invalidations issued by a CPU are taken into account by
the SMMU. TLB entries on both sides need to have identical exception level
in order to be cleared with a single invalidation.

When the CPU is using VHE, enable VHE in the SMMU for all STEs. Normal DMA
mappings will need to use TLBI_EL2 commands instead of TLBI_NH, but
shouldn't be otherwise affected by this change.

Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
---
 drivers/iommu/arm-smmu-v3.c | 31 ++++++++++++++++++++++++++-----
 1 file changed, 26 insertions(+), 5 deletions(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 2839527ec9ee..77554d89653b 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -13,6 +13,7 @@
 #include <linux/acpi_iort.h>
 #include <linux/bitfield.h>
 #include <linux/bitops.h>
+#include <linux/cpufeature.h>
 #include <linux/crash_dump.h>
 #include <linux/delay.h>
 #include <linux/dma-iommu.h>
@@ -472,6 +473,8 @@ struct arm_smmu_cmdq_ent {
 		#define CMDQ_OP_TLBI_NH_ASID	0x11
 		#define CMDQ_OP_TLBI_NH_VA	0x12
 		#define CMDQ_OP_TLBI_EL2_ALL	0x20
+		#define CMDQ_OP_TLBI_EL2_ASID	0x21
+		#define CMDQ_OP_TLBI_EL2_VA	0x22
 		#define CMDQ_OP_TLBI_S12_VMALL	0x28
 		#define CMDQ_OP_TLBI_S2_IPA	0x2a
 		#define CMDQ_OP_TLBI_NSNH_ALL	0x30
@@ -638,6 +641,7 @@ struct arm_smmu_device {
 #define ARM_SMMU_FEAT_HYP		(1 << 12)
 #define ARM_SMMU_FEAT_STALL_FORCE	(1 << 13)
 #define ARM_SMMU_FEAT_VAX		(1 << 14)
+#define ARM_SMMU_FEAT_E2H		(1 << 15)
 	u32				features;
 
 #define ARM_SMMU_OPT_SKIP_PREFETCH	(1 << 0)
@@ -909,6 +913,8 @@ static int arm_smmu_cmdq_build_cmd(u64 *cmd, struct arm_smmu_cmdq_ent *ent)
 		break;
 	case CMDQ_OP_TLBI_NH_VA:
 		cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_VMID, ent->tlbi.vmid);
+		/* Fallthrough */
+	case CMDQ_OP_TLBI_EL2_VA:
 		cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_ASID, ent->tlbi.asid);
 		cmd[1] |= FIELD_PREP(CMDQ_TLBI_1_LEAF, ent->tlbi.leaf);
 		cmd[1] |= ent->tlbi.addr & CMDQ_TLBI_1_VA_MASK;
@@ -924,6 +930,9 @@ static int arm_smmu_cmdq_build_cmd(u64 *cmd, struct arm_smmu_cmdq_ent *ent)
 	case CMDQ_OP_TLBI_S12_VMALL:
 		cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_VMID, ent->tlbi.vmid);
 		break;
+	case CMDQ_OP_TLBI_EL2_ASID:
+		cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_ASID, ent->tlbi.asid);
+		break;
 	case CMDQ_OP_ATC_INV:
 		cmd[0] |= FIELD_PREP(CMDQ_0_SSV, ent->substream_valid);
 		cmd[0] |= FIELD_PREP(CMDQ_ATC_0_GLOBAL, ent->atc.global);
@@ -1517,7 +1526,8 @@ static int arm_smmu_cmdq_batch_submit(struct arm_smmu_device *smmu,
 static void arm_smmu_tlb_inv_asid(struct arm_smmu_device *smmu, u16 asid)
 {
 	struct arm_smmu_cmdq_ent cmd = {
-		.opcode = CMDQ_OP_TLBI_NH_ASID,
+		.opcode	= smmu->features & ARM_SMMU_FEAT_E2H ?
+			CMDQ_OP_TLBI_EL2_ASID : CMDQ_OP_TLBI_NH_ASID,
 		.tlbi.asid = asid,
 	};
 
@@ -2075,13 +2085,16 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
 	}
 
 	if (s1_cfg) {
+		int strw = smmu->features & ARM_SMMU_FEAT_E2H ?
+			STRTAB_STE_1_STRW_EL2 : STRTAB_STE_1_STRW_NSEL1;
+
 		BUG_ON(ste_live);
 		dst[1] = cpu_to_le64(
 			 FIELD_PREP(STRTAB_STE_1_S1DSS, STRTAB_STE_1_S1DSS_SSID0) |
 			 FIELD_PREP(STRTAB_STE_1_S1CIR, STRTAB_STE_1_S1C_CACHE_WBRA) |
 			 FIELD_PREP(STRTAB_STE_1_S1COR, STRTAB_STE_1_S1C_CACHE_WBRA) |
 			 FIELD_PREP(STRTAB_STE_1_S1CSH, ARM_SMMU_SH_ISH) |
-			 FIELD_PREP(STRTAB_STE_1_STRW, STRTAB_STE_1_STRW_NSEL1));
+			 FIELD_PREP(STRTAB_STE_1_STRW, strw));
 
 		if (smmu->features & ARM_SMMU_FEAT_STALLS &&
 		   !(smmu->features & ARM_SMMU_FEAT_STALL_FORCE))
@@ -2476,7 +2489,8 @@ static void arm_smmu_tlb_inv_range(unsigned long iova, size_t size,
 		return;
 
 	if (smmu_domain->stage == ARM_SMMU_DOMAIN_S1) {
-		cmd.opcode	= CMDQ_OP_TLBI_NH_VA;
+		cmd.opcode	= smmu->features & ARM_SMMU_FEAT_E2H ?
+				  CMDQ_OP_TLBI_EL2_VA : CMDQ_OP_TLBI_NH_VA;
 		cmd.tlbi.asid	= smmu_domain->s1_cfg.cd.asid;
 	} else {
 		cmd.opcode	= CMDQ_OP_TLBI_S2_IPA;
@@ -3743,7 +3757,11 @@ static int arm_smmu_device_reset(struct arm_smmu_device *smmu, bool bypass)
 	writel_relaxed(reg, smmu->base + ARM_SMMU_CR1);
 
 	/* CR2 (random crap) */
-	reg = CR2_PTM | CR2_RECINVSID | CR2_E2H;
+	reg = CR2_PTM | CR2_RECINVSID;
+
+	if (smmu->features & ARM_SMMU_FEAT_E2H)
+		reg |= CR2_E2H;
+
 	writel_relaxed(reg, smmu->base + ARM_SMMU_CR2);
 
 	/* Stream table */
@@ -3901,8 +3919,11 @@ static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu)
 	if (reg & IDR0_MSI)
 		smmu->features |= ARM_SMMU_FEAT_MSI;
 
-	if (reg & IDR0_HYP)
+	if (reg & IDR0_HYP) {
 		smmu->features |= ARM_SMMU_FEAT_HYP;
+		if (cpus_have_cap(ARM64_HAS_VIRT_HOST_EXTN))
+			smmu->features |= ARM_SMMU_FEAT_E2H;
+	}
 
 	/*
 	 * The coherency feature as set by FW is used in preference to the ID
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v4 14/26] iommu/arm-smmu-v3: Enable broadcast TLB maintenance
  2020-02-24 18:23 [PATCH v4 00/26] iommu: Shared Virtual Addressing and SMMUv3 support Jean-Philippe Brucker
                   ` (12 preceding siblings ...)
  2020-02-24 18:23 ` [PATCH v4 13/26] iommu/arm-smmu-v3: Add support for VHE Jean-Philippe Brucker
@ 2020-02-24 18:23 ` Jean-Philippe Brucker
  2020-02-24 18:23 ` [PATCH v4 15/26] iommu/arm-smmu-v3: Add SVA feature checking Jean-Philippe Brucker
                   ` (12 subsequent siblings)
  26 siblings, 0 replies; 70+ messages in thread
From: Jean-Philippe Brucker @ 2020-02-24 18:23 UTC (permalink / raw)
  To: iommu, devicetree, linux-arm-kernel, linux-pci, linux-mm
  Cc: joro, robh+dt, mark.rutland, catalin.marinas, will, robin.murphy,
	kevin.tian, baolu.lu, Jonathan.Cameron, jacob.jun.pan,
	christian.koenig, yi.l.liu, zhangfei.gao, Jean-Philippe Brucker

From: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>

The SMMUv3 can handle invalidation targeted at TLB entries with shared
ASIDs. If the implementation supports broadcast TLB maintenance, enable it
and keep track of it in a feature bit. The SMMU will then be affected by
inner-shareable TLB invalidations from other agents.

A major side-effect of this change is that stage-2 translation contexts
are now affected by all invalidations by VMID. VMIDs are all shared and
the only ways to prevent over-invalidation, since the stage-2 page tables
are not shared between CPU and SMMU, are to either disable BTM or allocate
different VMIDs. This patch does not address the problem.

Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
---
 drivers/iommu/arm-smmu-v3.c | 19 +++++++++++++++++--
 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 77554d89653b..b72b2fdcd21f 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -56,6 +56,7 @@
 #define IDR0_ASID16			(1 << 12)
 #define IDR0_ATS			(1 << 10)
 #define IDR0_HYP			(1 << 9)
+#define IDR0_BTM			(1 << 5)
 #define IDR0_COHACC			(1 << 4)
 #define IDR0_TTF			GENMASK(3, 2)
 #define IDR0_TTF_AARCH64		2
@@ -642,6 +643,7 @@ struct arm_smmu_device {
 #define ARM_SMMU_FEAT_STALL_FORCE	(1 << 13)
 #define ARM_SMMU_FEAT_VAX		(1 << 14)
 #define ARM_SMMU_FEAT_E2H		(1 << 15)
+#define ARM_SMMU_FEAT_BTM		(1 << 16)
 	u32				features;
 
 #define ARM_SMMU_OPT_SKIP_PREFETCH	(1 << 0)
@@ -3757,11 +3759,14 @@ static int arm_smmu_device_reset(struct arm_smmu_device *smmu, bool bypass)
 	writel_relaxed(reg, smmu->base + ARM_SMMU_CR1);
 
 	/* CR2 (random crap) */
-	reg = CR2_PTM | CR2_RECINVSID;
+	reg = CR2_RECINVSID;
 
 	if (smmu->features & ARM_SMMU_FEAT_E2H)
 		reg |= CR2_E2H;
 
+	if (!(smmu->features & ARM_SMMU_FEAT_BTM))
+		reg |= CR2_PTM;
+
 	writel_relaxed(reg, smmu->base + ARM_SMMU_CR2);
 
 	/* Stream table */
@@ -3872,6 +3877,7 @@ static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu)
 {
 	u32 reg;
 	bool coherent = smmu->features & ARM_SMMU_FEAT_COHERENCY;
+	bool vhe = cpus_have_cap(ARM64_HAS_VIRT_HOST_EXTN);
 
 	/* IDR0 */
 	reg = readl_relaxed(smmu->base + ARM_SMMU_IDR0);
@@ -3921,10 +3927,19 @@ static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu)
 
 	if (reg & IDR0_HYP) {
 		smmu->features |= ARM_SMMU_FEAT_HYP;
-		if (cpus_have_cap(ARM64_HAS_VIRT_HOST_EXTN))
+		if (vhe)
 			smmu->features |= ARM_SMMU_FEAT_E2H;
 	}
 
+	/*
+	 * If the CPU is using VHE, but the SMMU doesn't support it, the SMMU
+	 * will create TLB entries for NH-EL1 world and will miss the
+	 * broadcasted TLB invalidations that target EL2-E2H world. Don't enable
+	 * BTM in that case.
+	 */
+	if (reg & IDR0_BTM && (!vhe || reg & IDR0_HYP))
+		smmu->features |= ARM_SMMU_FEAT_BTM;
+
 	/*
 	 * The coherency feature as set by FW is used in preference to the ID
 	 * register, but warn on mismatch.
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v4 15/26] iommu/arm-smmu-v3: Add SVA feature checking
  2020-02-24 18:23 [PATCH v4 00/26] iommu: Shared Virtual Addressing and SMMUv3 support Jean-Philippe Brucker
                   ` (13 preceding siblings ...)
  2020-02-24 18:23 ` [PATCH v4 14/26] iommu/arm-smmu-v3: Enable broadcast TLB maintenance Jean-Philippe Brucker
@ 2020-02-24 18:23 ` Jean-Philippe Brucker
  2020-02-24 18:23 ` [PATCH v4 16/26] iommu/arm-smmu-v3: Add dev_to_master() helper Jean-Philippe Brucker
                   ` (11 subsequent siblings)
  26 siblings, 0 replies; 70+ messages in thread
From: Jean-Philippe Brucker @ 2020-02-24 18:23 UTC (permalink / raw)
  To: iommu, devicetree, linux-arm-kernel, linux-pci, linux-mm
  Cc: joro, robh+dt, mark.rutland, catalin.marinas, will, robin.murphy,
	kevin.tian, baolu.lu, Jonathan.Cameron, jacob.jun.pan,
	christian.koenig, yi.l.liu, zhangfei.gao, Jean-Philippe Brucker

From: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>

Aggregate all sanity-checks for sharing CPU page tables with the SMMU
under a single ARM_SMMU_FEAT_SVA bit. For PCIe SVA, users also need to
check FEAT_ATS and FEAT_PRI. For platform SVA, they will most likely have
to check FEAT_STALLS.

Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
---
 drivers/iommu/arm-smmu-v3.c | 72 +++++++++++++++++++++++++++++++++++++
 1 file changed, 72 insertions(+)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index b72b2fdcd21f..77a846440ba6 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -644,6 +644,7 @@ struct arm_smmu_device {
 #define ARM_SMMU_FEAT_VAX		(1 << 14)
 #define ARM_SMMU_FEAT_E2H		(1 << 15)
 #define ARM_SMMU_FEAT_BTM		(1 << 16)
+#define ARM_SMMU_FEAT_SVA		(1 << 17)
 	u32				features;
 
 #define ARM_SMMU_OPT_SKIP_PREFETCH	(1 << 0)
@@ -3873,6 +3874,74 @@ static int arm_smmu_device_reset(struct arm_smmu_device *smmu, bool bypass)
 	return 0;
 }
 
+static bool arm_smmu_supports_sva(struct arm_smmu_device *smmu)
+{
+	unsigned long reg, fld;
+	unsigned long oas;
+	unsigned long asid_bits;
+
+	u32 feat_mask = ARM_SMMU_FEAT_BTM | ARM_SMMU_FEAT_COHERENCY;
+
+	if ((smmu->features & feat_mask) != feat_mask)
+		return false;
+
+	if (!(smmu->pgsize_bitmap & PAGE_SIZE))
+		return false;
+
+	/*
+	 * Get the smallest PA size of all CPUs (sanitized by cpufeature). We're
+	 * not even pretending to support AArch32 here.
+	 */
+	reg = read_sanitised_ftr_reg(SYS_ID_AA64MMFR0_EL1);
+	fld = cpuid_feature_extract_unsigned_field(reg, ID_AA64MMFR0_PARANGE_SHIFT);
+	switch (fld) {
+	case 0x0:
+		oas = 32;
+		break;
+	case 0x1:
+		oas = 36;
+		break;
+	case 0x2:
+		oas = 40;
+		break;
+	case 0x3:
+		oas = 42;
+		break;
+	case 0x4:
+		oas = 44;
+		break;
+	case 0x5:
+		oas = 48;
+		break;
+	case 0x6:
+		oas = 52;
+		break;
+	default:
+		return false;
+	}
+
+	/* abort if MMU outputs addresses greater than what we support. */
+	if (smmu->oas < oas)
+		return false;
+
+	/* We can support bigger ASIDs than the CPU, but not smaller */
+	fld = cpuid_feature_extract_unsigned_field(reg, ID_AA64MMFR0_ASID_SHIFT);
+	asid_bits = fld ? 16 : 8;
+	if (smmu->asid_bits < asid_bits)
+		return false;
+
+	/*
+	 * See max_pinned_asids in arch/arm64/mm/context.c. The following is
+	 * generally the maximum number of bindable processes.
+	 */
+	if (IS_ENABLED(CONFIG_UNMAP_KERNEL_AT_EL0))
+		asid_bits--;
+	dev_dbg(smmu->dev, "%d shared contexts\n", (1 << asid_bits) -
+		num_possible_cpus() - 2);
+
+	return true;
+}
+
 static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu)
 {
 	u32 reg;
@@ -4080,6 +4149,9 @@ static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu)
 
 	smmu->ias = max(smmu->ias, smmu->oas);
 
+	if (arm_smmu_supports_sva(smmu))
+		smmu->features |= ARM_SMMU_FEAT_SVA;
+
 	dev_info(smmu->dev, "ias %lu-bit, oas %lu-bit (features 0x%08x)\n",
 		 smmu->ias, smmu->oas, smmu->features);
 	return 0;
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v4 16/26] iommu/arm-smmu-v3: Add dev_to_master() helper
  2020-02-24 18:23 [PATCH v4 00/26] iommu: Shared Virtual Addressing and SMMUv3 support Jean-Philippe Brucker
                   ` (14 preceding siblings ...)
  2020-02-24 18:23 ` [PATCH v4 15/26] iommu/arm-smmu-v3: Add SVA feature checking Jean-Philippe Brucker
@ 2020-02-24 18:23 ` Jean-Philippe Brucker
  2020-02-24 18:23 ` [PATCH v4 17/26] iommu/arm-smmu-v3: Implement mm operations Jean-Philippe Brucker
                   ` (10 subsequent siblings)
  26 siblings, 0 replies; 70+ messages in thread
From: Jean-Philippe Brucker @ 2020-02-24 18:23 UTC (permalink / raw)
  To: iommu, devicetree, linux-arm-kernel, linux-pci, linux-mm
  Cc: joro, robh+dt, mark.rutland, catalin.marinas, will, robin.murphy,
	kevin.tian, baolu.lu, Jonathan.Cameron, jacob.jun.pan,
	christian.koenig, yi.l.liu, zhangfei.gao

We'll need to frequently find the SMMU master associated to a device
when implementing SVA. Move it to a separate function.

Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
---
 drivers/iommu/arm-smmu-v3.c | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 77a846440ba6..54bd6913d648 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -747,6 +747,15 @@ static struct arm_smmu_domain *to_smmu_domain(struct iommu_domain *dom)
 	return container_of(dom, struct arm_smmu_domain, domain);
 }
 
+static struct arm_smmu_master *dev_to_master(struct device *dev)
+{
+	struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
+
+	if (!fwspec)
+		return NULL;
+	return fwspec->iommu_priv;
+}
+
 static void parse_driver_options(struct arm_smmu_device *smmu)
 {
 	int i = 0;
@@ -2940,15 +2949,13 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
 {
 	int ret = 0;
 	unsigned long flags;
-	struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
 	struct arm_smmu_device *smmu;
+	struct arm_smmu_master *master = dev_to_master(dev);
 	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
-	struct arm_smmu_master *master;
 
-	if (!fwspec)
+	if (!master)
 		return -ENOENT;
 
-	master = fwspec->iommu_priv;
 	smmu = master->smmu;
 
 	arm_smmu_detach_dev(master);
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v4 17/26] iommu/arm-smmu-v3: Implement mm operations
  2020-02-24 18:23 [PATCH v4 00/26] iommu: Shared Virtual Addressing and SMMUv3 support Jean-Philippe Brucker
                   ` (15 preceding siblings ...)
  2020-02-24 18:23 ` [PATCH v4 16/26] iommu/arm-smmu-v3: Add dev_to_master() helper Jean-Philippe Brucker
@ 2020-02-24 18:23 ` Jean-Philippe Brucker
  2020-02-24 18:23 ` [PATCH v4 18/26] iommu/arm-smmu-v3: Hook up ATC invalidation to mm ops Jean-Philippe Brucker
                   ` (9 subsequent siblings)
  26 siblings, 0 replies; 70+ messages in thread
From: Jean-Philippe Brucker @ 2020-02-24 18:23 UTC (permalink / raw)
  To: iommu, devicetree, linux-arm-kernel, linux-pci, linux-mm
  Cc: joro, robh+dt, mark.rutland, catalin.marinas, will, robin.murphy,
	kevin.tian, baolu.lu, Jonathan.Cameron, jacob.jun.pan,
	christian.koenig, yi.l.liu, zhangfei.gao, Jean-Philippe Brucker

From: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>

Hook SVA operations to support sharing page tables with the SMMUv3:

* dev_enable/disable/has_feature for device drivers to modify the SVA
  state.
* sva_bind/unbind and sva_get_pasid to bind device and address spaces.
* The mm_attach/detach/invalidate/free callbacks from iommu-sva

Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
---
 drivers/iommu/Kconfig       |   1 +
 drivers/iommu/arm-smmu-v3.c | 176 +++++++++++++++++++++++++++++++++++-
 2 files changed, 175 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index 211684e785ea..05341155d34b 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -434,6 +434,7 @@ config ARM_SMMU_V3
 	tristate "ARM Ltd. System MMU Version 3 (SMMUv3) Support"
 	depends on ARM64
 	select IOMMU_API
+	select IOMMU_SVA
 	select IOMMU_IO_PGTABLE_LPAE
 	select GENERIC_MSI_IRQ_DOMAIN
 	help
diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 54bd6913d648..3973f7222864 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -36,6 +36,7 @@
 #include <linux/amba/bus.h>
 
 #include "io-pgtable-arm.h"
+#include "iommu-sva.h"
 
 /* MMIO registers */
 #define ARM_SMMU_IDR0			0x0
@@ -1872,7 +1873,6 @@ static struct arm_smmu_ctx_desc *arm_smmu_share_asid(u16 asid)
 	return NULL;
 }
 
-__maybe_unused
 static struct arm_smmu_ctx_desc *arm_smmu_alloc_shared_cd(struct mm_struct *mm)
 {
 	u16 asid;
@@ -1969,7 +1969,6 @@ static struct arm_smmu_ctx_desc *arm_smmu_alloc_shared_cd(struct mm_struct *mm)
 	return ERR_PTR(ret);
 }
 
-__maybe_unused
 static void arm_smmu_free_shared_cd(struct arm_smmu_ctx_desc *cd)
 {
 	if (arm_smmu_free_asid(cd)) {
@@ -2958,6 +2957,12 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
 
 	smmu = master->smmu;
 
+	if (iommu_sva_enabled(dev)) {
+		/* Did the previous driver forget to release SVA handles? */
+		dev_err(dev, "cannot attach - SVA enabled\n");
+		return -EBUSY;
+	}
+
 	arm_smmu_detach_dev(master);
 
 	mutex_lock(&smmu_domain->init_mutex);
@@ -3057,6 +3062,81 @@ arm_smmu_iova_to_phys(struct iommu_domain *domain, dma_addr_t iova)
 	return ops->iova_to_phys(ops, iova);
 }
 
+static void arm_smmu_mm_invalidate(struct device *dev, int pasid, void *entry,
+				   unsigned long iova, size_t size)
+{
+	/* TODO: Invalidate ATC */
+}
+
+static int arm_smmu_mm_attach(struct device *dev, int pasid, void *entry,
+			      bool attach_domain)
+{
+	struct arm_smmu_ctx_desc *cd = entry;
+	struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
+	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
+
+	/*
+	 * If another device in the domain has already been attached, the
+	 * context descriptor is already valid.
+	 */
+	if (!attach_domain)
+		return 0;
+
+	return arm_smmu_write_ctx_desc(smmu_domain, pasid, cd);
+}
+
+static void arm_smmu_mm_detach(struct device *dev, int pasid, void *entry,
+			       bool detach_domain)
+{
+	struct arm_smmu_ctx_desc *cd = entry;
+	struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
+	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
+
+	if (detach_domain) {
+		arm_smmu_write_ctx_desc(smmu_domain, pasid, NULL);
+
+		/*
+		 * The ASID allocator won't broadcast the final TLB
+		 * invalidations for this ASID, so we need to do it manually.
+		 * For private contexts, freeing io-pgtable ops performs the
+		 * invalidation.
+		 */
+		arm_smmu_tlb_inv_asid(smmu_domain->smmu, cd->asid);
+	}
+
+	/* TODO: invalidate ATC */
+}
+
+static void *arm_smmu_mm_alloc(struct mm_struct *mm)
+{
+	return arm_smmu_alloc_shared_cd(mm);
+}
+
+static void arm_smmu_mm_free(void *entry)
+{
+	arm_smmu_free_shared_cd(entry);
+}
+
+static struct io_mm_ops arm_smmu_mm_ops = {
+	.alloc		= arm_smmu_mm_alloc,
+	.invalidate	= arm_smmu_mm_invalidate,
+	.attach		= arm_smmu_mm_attach,
+	.detach		= arm_smmu_mm_detach,
+	.release	= arm_smmu_mm_free,
+};
+
+static struct iommu_sva *
+arm_smmu_sva_bind(struct device *dev, struct mm_struct *mm, void *drvdata)
+{
+	struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
+	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
+
+	if (smmu_domain->stage != ARM_SMMU_DOMAIN_S1)
+		return ERR_PTR(-EINVAL);
+
+	return iommu_sva_bind_generic(dev, mm, &arm_smmu_mm_ops, drvdata);
+}
+
 static struct platform_driver arm_smmu_driver;
 
 static
@@ -3175,6 +3255,7 @@ static void arm_smmu_remove_device(struct device *dev)
 
 	master = fwspec->iommu_priv;
 	smmu = master->smmu;
+	iommu_sva_disable(dev);
 	arm_smmu_detach_dev(master);
 	iommu_group_remove_device(dev);
 	iommu_device_unlink(&smmu->iommu, dev);
@@ -3294,6 +3375,90 @@ static void arm_smmu_get_resv_regions(struct device *dev,
 	iommu_dma_get_resv_regions(dev, head);
 }
 
+static bool arm_smmu_iopf_supported(struct arm_smmu_master *master)
+{
+	return false;
+}
+
+static bool arm_smmu_dev_has_feature(struct device *dev,
+				     enum iommu_dev_features feat)
+{
+	struct arm_smmu_master *master = dev_to_master(dev);
+
+	if (!master)
+		return false;
+
+	switch (feat) {
+	case IOMMU_DEV_FEAT_SVA:
+		if (!(master->smmu->features & ARM_SMMU_FEAT_SVA))
+			return false;
+
+		/* SSID and IOPF support are mandatory for the moment */
+		return master->ssid_bits && arm_smmu_iopf_supported(master);
+	default:
+		return false;
+	}
+}
+
+static bool arm_smmu_dev_feature_enabled(struct device *dev,
+					 enum iommu_dev_features feat)
+{
+	struct arm_smmu_master *master = dev_to_master(dev);
+
+	if (!master)
+		return false;
+
+	switch (feat) {
+	case IOMMU_DEV_FEAT_SVA:
+		return iommu_sva_enabled(dev);
+	default:
+		return false;
+	}
+}
+
+static int arm_smmu_dev_enable_sva(struct device *dev)
+{
+	struct arm_smmu_master *master = dev_to_master(dev);
+	struct iommu_sva_param param = {
+		.min_pasid = 1,
+		.max_pasid = 0xfffffU,
+	};
+
+	param.max_pasid = min(param.max_pasid, (1U << master->ssid_bits) - 1);
+	return iommu_sva_enable(dev, &param);
+}
+
+static int arm_smmu_dev_enable_feature(struct device *dev,
+				       enum iommu_dev_features feat)
+{
+	if (!arm_smmu_dev_has_feature(dev, feat))
+		return -ENODEV;
+
+	if (arm_smmu_dev_feature_enabled(dev, feat))
+		return -EBUSY;
+
+	switch (feat) {
+	case IOMMU_DEV_FEAT_SVA:
+		return arm_smmu_dev_enable_sva(dev);
+	default:
+		return -EINVAL;
+	}
+}
+
+static int arm_smmu_dev_disable_feature(struct device *dev,
+					enum iommu_dev_features feat)
+{
+	if (!arm_smmu_dev_feature_enabled(dev, feat))
+		return -EINVAL;
+
+	switch (feat) {
+	case IOMMU_DEV_FEAT_SVA:
+		return iommu_sva_disable(dev);
+	default:
+		return -EINVAL;
+	}
+}
+
 static struct iommu_ops arm_smmu_ops = {
 	.capable		= arm_smmu_capable,
 	.domain_alloc		= arm_smmu_domain_alloc,
@@ -3312,6 +3477,13 @@ static struct iommu_ops arm_smmu_ops = {
 	.of_xlate		= arm_smmu_of_xlate,
 	.get_resv_regions	= arm_smmu_get_resv_regions,
 	.put_resv_regions	= generic_iommu_put_resv_regions,
+	.dev_has_feat		= arm_smmu_dev_has_feature,
+	.dev_feat_enabled	= arm_smmu_dev_feature_enabled,
+	.dev_enable_feat	= arm_smmu_dev_enable_feature,
+	.dev_disable_feat	= arm_smmu_dev_disable_feature,
+	.sva_bind		= arm_smmu_sva_bind,
+	.sva_unbind		= iommu_sva_unbind_generic,
+	.sva_get_pasid		= iommu_sva_get_pasid_generic,
 	.pgsize_bitmap		= -1UL, /* Restricted during device attach */
 };
 
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v4 18/26] iommu/arm-smmu-v3: Hook up ATC invalidation to mm ops
  2020-02-24 18:23 [PATCH v4 00/26] iommu: Shared Virtual Addressing and SMMUv3 support Jean-Philippe Brucker
                   ` (16 preceding siblings ...)
  2020-02-24 18:23 ` [PATCH v4 17/26] iommu/arm-smmu-v3: Implement mm operations Jean-Philippe Brucker
@ 2020-02-24 18:23 ` Jean-Philippe Brucker
  2020-02-24 18:23 ` [PATCH v4 19/26] iommu/arm-smmu-v3: Add support for Hardware Translation Table Update Jean-Philippe Brucker
                   ` (8 subsequent siblings)
  26 siblings, 0 replies; 70+ messages in thread
From: Jean-Philippe Brucker @ 2020-02-24 18:23 UTC (permalink / raw)
  To: iommu, devicetree, linux-arm-kernel, linux-pci, linux-mm
  Cc: joro, robh+dt, mark.rutland, catalin.marinas, will, robin.murphy,
	kevin.tian, baolu.lu, Jonathan.Cameron, jacob.jun.pan,
	christian.koenig, yi.l.liu, zhangfei.gao, Jean-Philippe Brucker

From: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>

The core calls us when an mm is modified. Perform the required ATC
invalidations.

Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
---
 drivers/iommu/arm-smmu-v3.c | 44 ++++++++++++++++++++++++++++++++-----
 1 file changed, 38 insertions(+), 6 deletions(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 3973f7222864..95b4caceae1a 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -2354,6 +2354,20 @@ arm_smmu_atc_inv_to_cmd(int ssid, unsigned long iova, size_t size,
 	size_t inval_grain_shift = 12;
 	unsigned long page_start, page_end;
 
+	/*
+	 * ATS and PASID:
+	 *
+	 * If substream_valid is clear, the PCIe TLP is sent without a PASID
+	 * prefix. In that case all ATC entries within the address range are
+	 * invalidated, including those that were requested with a PASID! There
+	 * is no way to invalidate only entries without PASID.
+	 *
+	 * When using STRTAB_STE_1_S1DSS_SSID0 (reserving CD 0 for non-PASID
+	 * traffic), translation requests without PASID create ATC entries
+	 * without PASID, which must be invalidated with substream_valid clear.
+	 * This has the unpleasant side-effect of invalidating all PASID-tagged
+	 * ATC entries within the address range.
+	 */
 	*cmd = (struct arm_smmu_cmdq_ent) {
 		.opcode			= CMDQ_OP_ATC_INV,
 		.substream_valid	= !!ssid,
@@ -2397,12 +2411,12 @@ arm_smmu_atc_inv_to_cmd(int ssid, unsigned long iova, size_t size,
 	cmd->atc.size	= log2_span;
 }
 
-static int arm_smmu_atc_inv_master(struct arm_smmu_master *master)
+static int arm_smmu_atc_inv_master(struct arm_smmu_master *master, int ssid)
 {
 	int i;
 	struct arm_smmu_cmdq_ent cmd;
 
-	arm_smmu_atc_inv_to_cmd(0, 0, 0, &cmd);
+	arm_smmu_atc_inv_to_cmd(ssid, 0, 0, &cmd);
 
 	for (i = 0; i < master->num_sids; i++) {
 		cmd.atc.sid = master->sids[i];
@@ -2874,7 +2888,7 @@ static void arm_smmu_disable_ats(struct arm_smmu_master *master)
 	 * ATC invalidation via the SMMU.
 	 */
 	wmb();
-	arm_smmu_atc_inv_master(master);
+	arm_smmu_atc_inv_master(master, 0);
 	atomic_dec(&smmu_domain->nr_ats_masters);
 }
 
@@ -3065,7 +3079,22 @@ arm_smmu_iova_to_phys(struct iommu_domain *domain, dma_addr_t iova)
 static void arm_smmu_mm_invalidate(struct device *dev, int pasid, void *entry,
 				   unsigned long iova, size_t size)
 {
-	/* TODO: Invalidate ATC */
+	int i;
+	struct arm_smmu_cmdq_ent cmd;
+	struct arm_smmu_cmdq_batch cmds = {};
+	struct arm_smmu_master *master = dev_to_master(dev);
+
+	if (!master->ats_enabled)
+		return;
+
+	arm_smmu_atc_inv_to_cmd(pasid, iova, size, &cmd);
+
+	for (i = 0; i < master->num_sids; i++) {
+		cmd.atc.sid = master->sids[i];
+		arm_smmu_cmdq_batch_add(master->smmu, &cmds, &cmd);
+	}
+
+	arm_smmu_cmdq_batch_submit(master->smmu, &cmds);
 }
 
 static int arm_smmu_mm_attach(struct device *dev, int pasid, void *entry,
@@ -3089,6 +3118,7 @@ static void arm_smmu_mm_detach(struct device *dev, int pasid, void *entry,
 			       bool detach_domain)
 {
 	struct arm_smmu_ctx_desc *cd = entry;
+	struct arm_smmu_master *master = dev_to_master(dev);
 	struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
 	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
 
@@ -3102,9 +3132,11 @@ static void arm_smmu_mm_detach(struct device *dev, int pasid, void *entry,
 		 * invalidation.
 		 */
 		arm_smmu_tlb_inv_asid(smmu_domain->smmu, cd->asid);
-	}
+		arm_smmu_atc_inv_domain(smmu_domain, pasid, 0, 0);
 
-	/* TODO: invalidate ATC */
+	} else if (master->ats_enabled) {
+		arm_smmu_atc_inv_master(master, pasid);
+	}
 }
 
 static void *arm_smmu_mm_alloc(struct mm_struct *mm)
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v4 19/26] iommu/arm-smmu-v3: Add support for Hardware Translation Table Update
  2020-02-24 18:23 [PATCH v4 00/26] iommu: Shared Virtual Addressing and SMMUv3 support Jean-Philippe Brucker
                   ` (17 preceding siblings ...)
  2020-02-24 18:23 ` [PATCH v4 18/26] iommu/arm-smmu-v3: Hook up ATC invalidation to mm ops Jean-Philippe Brucker
@ 2020-02-24 18:23 ` Jean-Philippe Brucker
  2020-02-24 18:23 ` [PATCH v4 20/26] iommu/arm-smmu-v3: Maintain a SID->device structure Jean-Philippe Brucker
                   ` (7 subsequent siblings)
  26 siblings, 0 replies; 70+ messages in thread
From: Jean-Philippe Brucker @ 2020-02-24 18:23 UTC (permalink / raw)
  To: iommu, devicetree, linux-arm-kernel, linux-pci, linux-mm
  Cc: joro, robh+dt, mark.rutland, catalin.marinas, will, robin.murphy,
	kevin.tian, baolu.lu, Jonathan.Cameron, jacob.jun.pan,
	christian.koenig, yi.l.liu, zhangfei.gao, Jean-Philippe Brucker

From: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>

If the SMMU supports it and the kernel was built with HTTU support, enable
hardware update of access and dirty flags. This is essential for shared
page tables, to reduce the number of access faults on the fault queue.

We can enable HTTU even if CPUs don't support it, because the kernel
always checks for HW dirty bit and updates the PTE flags atomically.

Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
---
 drivers/iommu/arm-smmu-v3.c | 24 +++++++++++++++++++++++-
 1 file changed, 23 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 95b4caceae1a..015e8e59e0ef 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -57,6 +57,8 @@
 #define IDR0_ASID16			(1 << 12)
 #define IDR0_ATS			(1 << 10)
 #define IDR0_HYP			(1 << 9)
+#define IDR0_HD				(1 << 7)
+#define IDR0_HA				(1 << 6)
 #define IDR0_BTM			(1 << 5)
 #define IDR0_COHACC			(1 << 4)
 #define IDR0_TTF			GENMASK(3, 2)
@@ -305,6 +307,9 @@
 #define CTXDESC_CD_0_TCR_IPS		GENMASK_ULL(34, 32)
 #define CTXDESC_CD_0_TCR_TBI0		(1ULL << 38)
 
+#define CTXDESC_CD_0_TCR_HA		(1UL << 43)
+#define CTXDESC_CD_0_TCR_HD		(1UL << 42)
+
 #define CTXDESC_CD_0_AA64		(1UL << 41)
 #define CTXDESC_CD_0_S			(1UL << 44)
 #define CTXDESC_CD_0_R			(1UL << 45)
@@ -646,6 +651,8 @@ struct arm_smmu_device {
 #define ARM_SMMU_FEAT_E2H		(1 << 15)
 #define ARM_SMMU_FEAT_BTM		(1 << 16)
 #define ARM_SMMU_FEAT_SVA		(1 << 17)
+#define ARM_SMMU_FEAT_HA		(1 << 18)
+#define ARM_SMMU_FEAT_HD		(1 << 19)
 	u32				features;
 
 #define ARM_SMMU_OPT_SKIP_PREFETCH	(1 << 0)
@@ -1665,10 +1672,17 @@ static int __arm_smmu_write_ctx_desc(struct arm_smmu_domain *smmu_domain,
 		 * this substream's traffic
 		 */
 	} else { /* (1) and (2) */
+		u64 tcr = cd->tcr;
+
 		cdptr[1] = cpu_to_le64(cd->ttbr & CTXDESC_CD_1_TTB0_MASK);
 		cdptr[2] = 0;
 		cdptr[3] = cpu_to_le64(cd->mair);
 
+		if (!(smmu->features & ARM_SMMU_FEAT_HD))
+			tcr &= ~CTXDESC_CD_0_TCR_HD;
+		if (!(smmu->features & ARM_SMMU_FEAT_HA))
+			tcr &= ~CTXDESC_CD_0_TCR_HA;
+
 		/*
 		 * STE is live, and the SMMU might read dwords of this CD in any
 		 * order. Ensure that it observes valid values before reading
@@ -1676,7 +1690,7 @@ static int __arm_smmu_write_ctx_desc(struct arm_smmu_domain *smmu_domain,
 		 */
 		arm_smmu_sync_cd(smmu_domain, ssid, true);
 
-		val = cd->tcr |
+		val = tcr |
 #ifdef __BIG_ENDIAN
 			CTXDESC_CD_0_ENDI |
 #endif
@@ -1919,10 +1933,12 @@ static struct arm_smmu_ctx_desc *arm_smmu_alloc_shared_cd(struct mm_struct *mm)
 		return old_cd;
 	}
 
+	/* HA and HD will be filtered out later if not supported by the SMMU */
 	tcr = FIELD_PREP(CTXDESC_CD_0_TCR_T0SZ, 64ULL - VA_BITS) |
 	      FIELD_PREP(CTXDESC_CD_0_TCR_IRGN0, ARM_LPAE_TCR_RGN_WBWA) |
 	      FIELD_PREP(CTXDESC_CD_0_TCR_ORGN0, ARM_LPAE_TCR_RGN_WBWA) |
 	      FIELD_PREP(CTXDESC_CD_0_TCR_SH0, ARM_LPAE_TCR_SH_IS) |
+	      CTXDESC_CD_0_TCR_HA | CTXDESC_CD_0_TCR_HD |
 	      CTXDESC_CD_0_TCR_EPD1 | CTXDESC_CD_0_AA64;
 
 	switch (PAGE_SIZE) {
@@ -4211,6 +4227,12 @@ static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu)
 			smmu->features |= ARM_SMMU_FEAT_E2H;
 	}
 
+	if (reg & (IDR0_HA | IDR0_HD)) {
+		smmu->features |= ARM_SMMU_FEAT_HA;
+		if (reg & IDR0_HD)
+			smmu->features |= ARM_SMMU_FEAT_HD;
+	}
+
 	/*
 	 * If the CPU is using VHE, but the SMMU doesn't support it, the SMMU
 	 * will create TLB entries for NH-EL1 world and will miss the
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v4 20/26] iommu/arm-smmu-v3: Maintain a SID->device structure
  2020-02-24 18:23 [PATCH v4 00/26] iommu: Shared Virtual Addressing and SMMUv3 support Jean-Philippe Brucker
                   ` (18 preceding siblings ...)
  2020-02-24 18:23 ` [PATCH v4 19/26] iommu/arm-smmu-v3: Add support for Hardware Translation Table Update Jean-Philippe Brucker
@ 2020-02-24 18:23 ` Jean-Philippe Brucker
  2020-02-24 18:23 ` [PATCH v4 21/26] iommu/arm-smmu-v3: Ratelimit event dump Jean-Philippe Brucker
                   ` (6 subsequent siblings)
  26 siblings, 0 replies; 70+ messages in thread
From: Jean-Philippe Brucker @ 2020-02-24 18:23 UTC (permalink / raw)
  To: iommu, devicetree, linux-arm-kernel, linux-pci, linux-mm
  Cc: joro, robh+dt, mark.rutland, catalin.marinas, will, robin.murphy,
	kevin.tian, baolu.lu, Jonathan.Cameron, jacob.jun.pan,
	christian.koenig, yi.l.liu, zhangfei.gao, Jean-Philippe Brucker

From: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>

When handling faults from the event or PRI queue, we need to find the
struct device associated to a SID. Add a rb_tree to keep track of SIDs.

Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
---
 drivers/iommu/arm-smmu-v3.c | 177 +++++++++++++++++++++++++++++-------
 1 file changed, 145 insertions(+), 32 deletions(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 015e8e59e0ef..28f8583cd47b 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -684,6 +684,15 @@ struct arm_smmu_device {
 
 	/* IOMMU core code handle */
 	struct iommu_device		iommu;
+
+	struct rb_root			streams;
+	struct mutex			streams_mutex;
+};
+
+struct arm_smmu_stream {
+	u32				id;
+	struct arm_smmu_master		*master;
+	struct rb_node			node;
 };
 
 /* SMMU private data for each master */
@@ -692,8 +701,8 @@ struct arm_smmu_master {
 	struct device			*dev;
 	struct arm_smmu_domain		*domain;
 	struct list_head		domain_head;
-	u32				*sids;
-	unsigned int			num_sids;
+	struct arm_smmu_stream		*streams;
+	unsigned int			num_streams;
 	bool				ats_enabled;
 	unsigned int			ssid_bits;
 };
@@ -1573,8 +1582,8 @@ static void arm_smmu_sync_cd(struct arm_smmu_domain *smmu_domain,
 
 	spin_lock_irqsave(&smmu_domain->devices_lock, flags);
 	list_for_each_entry(master, &smmu_domain->devices, domain_head) {
-		for (i = 0; i < master->num_sids; i++) {
-			cmd.cfgi.sid = master->sids[i];
+		for (i = 0; i < master->num_streams; i++) {
+			cmd.cfgi.sid = master->streams[i].id;
 			arm_smmu_cmdq_batch_add(smmu, &cmds, &cmd);
 		}
 	}
@@ -2201,6 +2210,32 @@ static int arm_smmu_init_l2_strtab(struct arm_smmu_device *smmu, u32 sid)
 	return 0;
 }
 
+__maybe_unused
+static struct arm_smmu_master *
+arm_smmu_find_master(struct arm_smmu_device *smmu, u32 sid)
+{
+	struct rb_node *node;
+	struct arm_smmu_stream *stream;
+	struct arm_smmu_master *master = NULL;
+
+	mutex_lock(&smmu->streams_mutex);
+	node = smmu->streams.rb_node;
+	while (node) {
+		stream = rb_entry(node, struct arm_smmu_stream, node);
+		if (stream->id < sid) {
+			node = node->rb_right;
+		} else if (stream->id > sid) {
+			node = node->rb_left;
+		} else {
+			master = stream->master;
+			break;
+		}
+	}
+	mutex_unlock(&smmu->streams_mutex);
+
+	return master;
+}
+
 /* IRQ and event handlers */
 static irqreturn_t arm_smmu_evtq_thread(int irq, void *dev)
 {
@@ -2434,8 +2469,8 @@ static int arm_smmu_atc_inv_master(struct arm_smmu_master *master, int ssid)
 
 	arm_smmu_atc_inv_to_cmd(ssid, 0, 0, &cmd);
 
-	for (i = 0; i < master->num_sids; i++) {
-		cmd.atc.sid = master->sids[i];
+	for (i = 0; i < master->num_streams; i++) {
+		cmd.atc.sid = master->streams[i].id;
 		arm_smmu_cmdq_issue_cmd(master->smmu, &cmd);
 	}
 
@@ -2478,8 +2513,8 @@ static int arm_smmu_atc_inv_domain(struct arm_smmu_domain *smmu_domain,
 		if (!master->ats_enabled)
 			continue;
 
-		for (i = 0; i < master->num_sids; i++) {
-			cmd.atc.sid = master->sids[i];
+		for (i = 0; i < master->num_streams; i++) {
+			cmd.atc.sid = master->streams[i].id;
 			arm_smmu_cmdq_batch_add(smmu_domain->smmu, &cmds, &cmd);
 		}
 	}
@@ -2846,13 +2881,13 @@ static void arm_smmu_install_ste_for_dev(struct arm_smmu_master *master)
 	int i, j;
 	struct arm_smmu_device *smmu = master->smmu;
 
-	for (i = 0; i < master->num_sids; ++i) {
-		u32 sid = master->sids[i];
+	for (i = 0; i < master->num_streams; ++i) {
+		u32 sid = master->streams[i].id;
 		__le64 *step = arm_smmu_get_step_for_sid(smmu, sid);
 
 		/* Bridged PCI devices may end up with duplicated IDs */
 		for (j = 0; j < i; j++)
-			if (master->sids[j] == sid)
+			if (master->streams[j].id == sid)
 				break;
 		if (j < i)
 			continue;
@@ -3105,8 +3140,8 @@ static void arm_smmu_mm_invalidate(struct device *dev, int pasid, void *entry,
 
 	arm_smmu_atc_inv_to_cmd(pasid, iova, size, &cmd);
 
-	for (i = 0; i < master->num_sids; i++) {
-		cmd.atc.sid = master->sids[i];
+	for (i = 0; i < master->num_streams; i++) {
+		cmd.atc.sid = master->streams[i].id;
 		arm_smmu_cmdq_batch_add(master->smmu, &cmds, &cmd);
 	}
 
@@ -3206,11 +3241,99 @@ static bool arm_smmu_sid_in_range(struct arm_smmu_device *smmu, u32 sid)
 	return sid < limit;
 }
 
+static int arm_smmu_insert_master(struct arm_smmu_device *smmu,
+				  struct arm_smmu_master *master)
+{
+	int i;
+	int ret = 0;
+	struct arm_smmu_stream *new_stream, *cur_stream;
+	struct rb_node **new_node, *parent_node = NULL;
+	struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(master->dev);
+
+	master->streams = kcalloc(fwspec->num_ids,
+				  sizeof(struct arm_smmu_stream), GFP_KERNEL);
+	if (!master->streams)
+		return -ENOMEM;
+	master->num_streams = fwspec->num_ids;
+
+	mutex_lock(&smmu->streams_mutex);
+	for (i = 0; i < fwspec->num_ids && !ret; i++) {
+		u32 sid = fwspec->ids[i];
+
+		new_stream = &master->streams[i];
+		new_stream->id = sid;
+		new_stream->master = master;
+
+		/* Check the SIDs are in range of the SMMU and our stream table */
+		if (!arm_smmu_sid_in_range(smmu, sid)) {
+			ret = -ERANGE;
+			break;
+		}
+
+		/* Ensure l2 strtab is initialised */
+		if (smmu->features & ARM_SMMU_FEAT_2_LVL_STRTAB) {
+			ret = arm_smmu_init_l2_strtab(smmu, sid);
+			if (ret)
+				break;
+		}
+
+		/* Insert into SID tree */
+		new_node = &(smmu->streams.rb_node);
+		while (*new_node) {
+			cur_stream = rb_entry(*new_node, struct arm_smmu_stream,
+					      node);
+			parent_node = *new_node;
+			if (cur_stream->id > new_stream->id) {
+				new_node = &((*new_node)->rb_left);
+			} else if (cur_stream->id < new_stream->id) {
+				new_node = &((*new_node)->rb_right);
+			} else {
+				dev_warn(master->dev,
+					 "stream %u already in tree\n",
+					 cur_stream->id);
+				ret = -EINVAL;
+				break;
+			}
+		}
+
+		if (!ret) {
+			rb_link_node(&new_stream->node, parent_node, new_node);
+			rb_insert_color(&new_stream->node, &smmu->streams);
+		}
+	}
+
+	if (ret) {
+		for (; i > 0; i--)
+			rb_erase(&master->streams[i].node, &smmu->streams);
+		kfree(master->streams);
+	}
+	mutex_unlock(&smmu->streams_mutex);
+
+	return ret;
+}
+
+static void arm_smmu_remove_master(struct arm_smmu_device *smmu,
+				   struct arm_smmu_master *master)
+{
+	int i;
+	struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(master->dev);
+
+	if (!master->streams)
+		return;
+
+	mutex_lock(&smmu->streams_mutex);
+	for (i = 0; i < fwspec->num_ids; i++)
+		rb_erase(&master->streams[i].node, &smmu->streams);
+	mutex_unlock(&smmu->streams_mutex);
+
+	kfree(master->streams);
+}
+
 static struct iommu_ops arm_smmu_ops;
 
 static int arm_smmu_add_device(struct device *dev)
 {
-	int i, ret;
+	int ret;
 	struct arm_smmu_device *smmu;
 	struct arm_smmu_master *master;
 	struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
@@ -3232,26 +3355,11 @@ static int arm_smmu_add_device(struct device *dev)
 
 	master->dev = dev;
 	master->smmu = smmu;
-	master->sids = fwspec->ids;
-	master->num_sids = fwspec->num_ids;
 	fwspec->iommu_priv = master;
 
-	/* Check the SIDs are in range of the SMMU and our stream table */
-	for (i = 0; i < master->num_sids; i++) {
-		u32 sid = master->sids[i];
-
-		if (!arm_smmu_sid_in_range(smmu, sid)) {
-			ret = -ERANGE;
-			goto err_free_master;
-		}
-
-		/* Ensure l2 strtab is initialised */
-		if (smmu->features & ARM_SMMU_FEAT_2_LVL_STRTAB) {
-			ret = arm_smmu_init_l2_strtab(smmu, sid);
-			if (ret)
-				goto err_free_master;
-		}
-	}
+	ret = arm_smmu_insert_master(smmu, master);
+	if (ret)
+		goto err_free_master;
 
 	master->ssid_bits = min(smmu->ssid_bits, fwspec->num_pasid_bits);
 
@@ -3286,6 +3394,7 @@ static int arm_smmu_add_device(struct device *dev)
 	iommu_device_unlink(&smmu->iommu, dev);
 err_disable_pasid:
 	arm_smmu_disable_pasid(master);
+	arm_smmu_remove_master(smmu, master);
 err_free_master:
 	kfree(master);
 	fwspec->iommu_priv = NULL;
@@ -3308,6 +3417,7 @@ static void arm_smmu_remove_device(struct device *dev)
 	iommu_group_remove_device(dev);
 	iommu_device_unlink(&smmu->iommu, dev);
 	arm_smmu_disable_pasid(master);
+	arm_smmu_remove_master(smmu, master);
 	kfree(master);
 	iommu_fwspec_free(dev);
 }
@@ -3751,6 +3861,9 @@ static int arm_smmu_init_structures(struct arm_smmu_device *smmu)
 {
 	int ret;
 
+	mutex_init(&smmu->streams_mutex);
+	smmu->streams = RB_ROOT;
+
 	ret = arm_smmu_init_queues(smmu);
 	if (ret)
 		return ret;
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v4 21/26] iommu/arm-smmu-v3: Ratelimit event dump
  2020-02-24 18:23 [PATCH v4 00/26] iommu: Shared Virtual Addressing and SMMUv3 support Jean-Philippe Brucker
                   ` (19 preceding siblings ...)
  2020-02-24 18:23 ` [PATCH v4 20/26] iommu/arm-smmu-v3: Maintain a SID->device structure Jean-Philippe Brucker
@ 2020-02-24 18:23 ` Jean-Philippe Brucker
  2021-05-28  8:09   ` Aaro Koskinen
  2020-02-24 18:23 ` [PATCH v4 22/26] dt-bindings: document stall property for IOMMU masters Jean-Philippe Brucker
                   ` (5 subsequent siblings)
  26 siblings, 1 reply; 70+ messages in thread
From: Jean-Philippe Brucker @ 2020-02-24 18:23 UTC (permalink / raw)
  To: iommu, devicetree, linux-arm-kernel, linux-pci, linux-mm
  Cc: joro, robh+dt, mark.rutland, catalin.marinas, will, robin.murphy,
	kevin.tian, baolu.lu, Jonathan.Cameron, jacob.jun.pan,
	christian.koenig, yi.l.liu, zhangfei.gao

When a device or driver misbehaves, it is possible to receive events
much faster than we can print them out. Ratelimit the printing of
events.

Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
---
During the SVA tests when the device driver didn't properly stop DMA
before unbinding, the event queue thread would almost lock-up the server
with a flood of event 0xa. This patch helped recover from the error.
---
 drivers/iommu/arm-smmu-v3.c | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 28f8583cd47b..6a5987cce03f 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -2243,17 +2243,20 @@ static irqreturn_t arm_smmu_evtq_thread(int irq, void *dev)
 	struct arm_smmu_device *smmu = dev;
 	struct arm_smmu_queue *q = &smmu->evtq.q;
 	struct arm_smmu_ll_queue *llq = &q->llq;
+	static DEFINE_RATELIMIT_STATE(rs, DEFAULT_RATELIMIT_INTERVAL,
+				      DEFAULT_RATELIMIT_BURST);
 	u64 evt[EVTQ_ENT_DWORDS];
 
 	do {
 		while (!queue_remove_raw(q, evt)) {
 			u8 id = FIELD_GET(EVTQ_0_ID, evt[0]);
 
-			dev_info(smmu->dev, "event 0x%02x received:\n", id);
-			for (i = 0; i < ARRAY_SIZE(evt); ++i)
-				dev_info(smmu->dev, "\t0x%016llx\n",
-					 (unsigned long long)evt[i]);
-
+			if (__ratelimit(&rs)) {
+				dev_info(smmu->dev, "event 0x%02x received:\n", id);
+				for (i = 0; i < ARRAY_SIZE(evt); ++i)
+					dev_info(smmu->dev, "\t0x%016llx\n",
+						 (unsigned long long)evt[i]);
+			}
 		}
 
 		/*
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v4 22/26] dt-bindings: document stall property for IOMMU masters
  2020-02-24 18:23 [PATCH v4 00/26] iommu: Shared Virtual Addressing and SMMUv3 support Jean-Philippe Brucker
                   ` (20 preceding siblings ...)
  2020-02-24 18:23 ` [PATCH v4 21/26] iommu/arm-smmu-v3: Ratelimit event dump Jean-Philippe Brucker
@ 2020-02-24 18:23 ` Jean-Philippe Brucker
  2020-02-24 18:23 ` [PATCH v4 23/26] iommu/arm-smmu-v3: Add stall support for platform devices Jean-Philippe Brucker
                   ` (4 subsequent siblings)
  26 siblings, 0 replies; 70+ messages in thread
From: Jean-Philippe Brucker @ 2020-02-24 18:23 UTC (permalink / raw)
  To: iommu, devicetree, linux-arm-kernel, linux-pci, linux-mm
  Cc: joro, robh+dt, mark.rutland, catalin.marinas, will, robin.murphy,
	kevin.tian, baolu.lu, Jonathan.Cameron, jacob.jun.pan,
	christian.koenig, yi.l.liu, zhangfei.gao, Jean-Philippe Brucker,
	Rob Herring

From: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>

On ARM systems, some platform devices behind an IOMMU may support stall,
which is the ability to recover from page faults. Let the firmware tell us
when a device supports stall.

Reviewed-by: Rob Herring <robh@kernel.org>
Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
---
 .../devicetree/bindings/iommu/iommu.txt        | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/Documentation/devicetree/bindings/iommu/iommu.txt b/Documentation/devicetree/bindings/iommu/iommu.txt
index 3c36334e4f94..26ba9e530f13 100644
--- a/Documentation/devicetree/bindings/iommu/iommu.txt
+++ b/Documentation/devicetree/bindings/iommu/iommu.txt
@@ -92,6 +92,24 @@ Optional properties:
   tagging DMA transactions with an address space identifier. By default,
   this is 0, which means that the device only has one address space.
 
+- dma-can-stall: When present, the master can wait for a transaction to
+  complete for an indefinite amount of time. Upon translation fault some
+  IOMMUs, instead of aborting the translation immediately, may first
+  notify the driver and keep the transaction in flight. This allows the OS
+  to inspect the fault and, for example, make physical pages resident
+  before updating the mappings and completing the transaction. Such IOMMU
+  accepts a limited number of simultaneous stalled transactions before
+  having to either put back-pressure on the master, or abort new faulting
+  transactions.
+
+  Firmware has to opt-in stalling, because most buses and masters don't
+  support it. In particular it isn't compatible with PCI, where
+  transactions have to complete before a time limit. More generally it
+  won't work in systems and masters that haven't been designed for
+  stalling. For example the OS, in order to handle a stalled transaction,
+  may attempt to retrieve pages from secondary storage in a stalled
+  domain, leading to a deadlock.
+
 
 Notes:
 ======
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v4 23/26] iommu/arm-smmu-v3: Add stall support for platform devices
  2020-02-24 18:23 [PATCH v4 00/26] iommu: Shared Virtual Addressing and SMMUv3 support Jean-Philippe Brucker
                   ` (21 preceding siblings ...)
  2020-02-24 18:23 ` [PATCH v4 22/26] dt-bindings: document stall property for IOMMU masters Jean-Philippe Brucker
@ 2020-02-24 18:23 ` Jean-Philippe Brucker
  2020-02-26  8:44   ` Xu Zaibo
  2020-02-27 18:17   ` Jonathan Cameron
  2020-02-24 18:23 ` [PATCH v4 24/26] PCI/ATS: Add PRI stubs Jean-Philippe Brucker
                   ` (3 subsequent siblings)
  26 siblings, 2 replies; 70+ messages in thread
From: Jean-Philippe Brucker @ 2020-02-24 18:23 UTC (permalink / raw)
  To: iommu, devicetree, linux-arm-kernel, linux-pci, linux-mm
  Cc: joro, robh+dt, mark.rutland, catalin.marinas, will, robin.murphy,
	kevin.tian, baolu.lu, Jonathan.Cameron, jacob.jun.pan,
	christian.koenig, yi.l.liu, zhangfei.gao, Jean-Philippe Brucker

From: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>

The SMMU provides a Stall model for handling page faults in platform
devices. It is similar to PCI PRI, but doesn't require devices to have
their own translation cache. Instead, faulting transactions are parked and
the OS is given a chance to fix the page tables and retry the transaction.

Enable stall for devices that support it (opt-in by firmware). When an
event corresponds to a translation error, call the IOMMU fault handler. If
the fault is recoverable, it will call us back to terminate or continue
the stall.

Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
---
 drivers/iommu/arm-smmu-v3.c | 271 ++++++++++++++++++++++++++++++++++--
 drivers/iommu/of_iommu.c    |   5 +-
 include/linux/iommu.h       |   2 +
 3 files changed, 269 insertions(+), 9 deletions(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 6a5987cce03f..da5dda5ba26a 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -374,6 +374,13 @@
 #define CMDQ_PRI_1_GRPID		GENMASK_ULL(8, 0)
 #define CMDQ_PRI_1_RESP			GENMASK_ULL(13, 12)
 
+#define CMDQ_RESUME_0_SID		GENMASK_ULL(63, 32)
+#define CMDQ_RESUME_0_RESP_TERM		0UL
+#define CMDQ_RESUME_0_RESP_RETRY	1UL
+#define CMDQ_RESUME_0_RESP_ABORT	2UL
+#define CMDQ_RESUME_0_RESP		GENMASK_ULL(13, 12)
+#define CMDQ_RESUME_1_STAG		GENMASK_ULL(15, 0)
+
 #define CMDQ_SYNC_0_CS			GENMASK_ULL(13, 12)
 #define CMDQ_SYNC_0_CS_NONE		0
 #define CMDQ_SYNC_0_CS_IRQ		1
@@ -390,6 +397,25 @@
 
 #define EVTQ_0_ID			GENMASK_ULL(7, 0)
 
+#define EVT_ID_TRANSLATION_FAULT	0x10
+#define EVT_ID_ADDR_SIZE_FAULT		0x11
+#define EVT_ID_ACCESS_FAULT		0x12
+#define EVT_ID_PERMISSION_FAULT		0x13
+
+#define EVTQ_0_SSV			(1UL << 11)
+#define EVTQ_0_SSID			GENMASK_ULL(31, 12)
+#define EVTQ_0_SID			GENMASK_ULL(63, 32)
+#define EVTQ_1_STAG			GENMASK_ULL(15, 0)
+#define EVTQ_1_STALL			(1UL << 31)
+#define EVTQ_1_PRIV			(1UL << 33)
+#define EVTQ_1_EXEC			(1UL << 34)
+#define EVTQ_1_READ			(1UL << 35)
+#define EVTQ_1_S2			(1UL << 39)
+#define EVTQ_1_CLASS			GENMASK_ULL(41, 40)
+#define EVTQ_1_TT_READ			(1UL << 44)
+#define EVTQ_2_ADDR			GENMASK_ULL(63, 0)
+#define EVTQ_3_IPA			GENMASK_ULL(51, 12)
+
 /* PRI queue */
 #define PRIQ_ENT_SZ_SHIFT		4
 #define PRIQ_ENT_DWORDS			((1 << PRIQ_ENT_SZ_SHIFT) >> 3)
@@ -510,6 +536,13 @@ struct arm_smmu_cmdq_ent {
 			enum pri_resp		resp;
 		} pri;
 
+		#define CMDQ_OP_RESUME		0x44
+		struct {
+			u32			sid;
+			u16			stag;
+			u8			resp;
+		} resume;
+
 		#define CMDQ_OP_CMD_SYNC	0x46
 		struct {
 			u64			msiaddr;
@@ -545,6 +578,10 @@ struct arm_smmu_queue {
 
 	u32 __iomem			*prod_reg;
 	u32 __iomem			*cons_reg;
+
+	/* Event and PRI */
+	u64				batch;
+	wait_queue_head_t		wq;
 };
 
 struct arm_smmu_queue_poll {
@@ -568,6 +605,7 @@ struct arm_smmu_cmdq_batch {
 
 struct arm_smmu_evtq {
 	struct arm_smmu_queue		q;
+	struct iopf_queue		*iopf;
 	u32				max_stalls;
 };
 
@@ -704,6 +742,7 @@ struct arm_smmu_master {
 	struct arm_smmu_stream		*streams;
 	unsigned int			num_streams;
 	bool				ats_enabled;
+	bool				stall_enabled;
 	unsigned int			ssid_bits;
 };
 
@@ -721,6 +760,7 @@ struct arm_smmu_domain {
 
 	struct io_pgtable_ops		*pgtbl_ops;
 	bool				non_strict;
+	bool				stall_enabled;
 	atomic_t			nr_ats_masters;
 
 	enum arm_smmu_domain_stage	stage;
@@ -985,6 +1025,11 @@ static int arm_smmu_cmdq_build_cmd(u64 *cmd, struct arm_smmu_cmdq_ent *ent)
 		}
 		cmd[1] |= FIELD_PREP(CMDQ_PRI_1_RESP, ent->pri.resp);
 		break;
+	case CMDQ_OP_RESUME:
+		cmd[0] |= FIELD_PREP(CMDQ_RESUME_0_SID, ent->resume.sid);
+		cmd[0] |= FIELD_PREP(CMDQ_RESUME_0_RESP, ent->resume.resp);
+		cmd[1] |= FIELD_PREP(CMDQ_RESUME_1_STAG, ent->resume.stag);
+		break;
 	case CMDQ_OP_CMD_SYNC:
 		if (ent->sync.msiaddr) {
 			cmd[0] |= FIELD_PREP(CMDQ_SYNC_0_CS, CMDQ_SYNC_0_CS_IRQ);
@@ -1551,6 +1596,45 @@ static int arm_smmu_cmdq_batch_submit(struct arm_smmu_device *smmu,
 	return arm_smmu_cmdq_issue_cmdlist(smmu, cmds->cmds, cmds->num, true);
 }
 
+static int arm_smmu_page_response(struct device *dev,
+				  struct iommu_fault_event *unused,
+				  struct iommu_page_response *resp)
+{
+	struct arm_smmu_cmdq_ent cmd = {0};
+	struct arm_smmu_master *master = dev_iommu_fwspec_get(dev)->iommu_priv;
+	int sid = master->streams[0].id;
+
+	if (master->stall_enabled) {
+		cmd.opcode		= CMDQ_OP_RESUME;
+		cmd.resume.sid		= sid;
+		cmd.resume.stag		= resp->grpid;
+		switch (resp->code) {
+		case IOMMU_PAGE_RESP_INVALID:
+		case IOMMU_PAGE_RESP_FAILURE:
+			cmd.resume.resp = CMDQ_RESUME_0_RESP_ABORT;
+			break;
+		case IOMMU_PAGE_RESP_SUCCESS:
+			cmd.resume.resp = CMDQ_RESUME_0_RESP_RETRY;
+			break;
+		default:
+			return -EINVAL;
+		}
+	} else {
+		/* TODO: insert PRI response here */
+		return -ENODEV;
+	}
+
+	arm_smmu_cmdq_issue_cmd(master->smmu, &cmd);
+	/*
+	 * Don't send a SYNC, it doesn't do anything for RESUME or PRI_RESP.
+	 * RESUME consumption guarantees that the stalled transaction will be
+	 * terminated... at some point in the future. PRI_RESP is fire and
+	 * forget.
+	 */
+
+	return 0;
+}
+
 /* Context descriptor manipulation functions */
 static void arm_smmu_tlb_inv_asid(struct arm_smmu_device *smmu, u16 asid)
 {
@@ -1709,8 +1793,7 @@ static int __arm_smmu_write_ctx_desc(struct arm_smmu_domain *smmu_domain,
 			FIELD_PREP(CTXDESC_CD_0_ASID, cd->asid) |
 			CTXDESC_CD_0_V;
 
-		/* STALL_MODEL==0b10 && CD.S==0 is ILLEGAL */
-		if (smmu->features & ARM_SMMU_FEAT_STALL_FORCE)
+		if (smmu_domain->stall_enabled)
 			val |= CTXDESC_CD_0_S;
 	}
 
@@ -2133,7 +2216,7 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
 			 FIELD_PREP(STRTAB_STE_1_STRW, strw));
 
 		if (smmu->features & ARM_SMMU_FEAT_STALLS &&
-		   !(smmu->features & ARM_SMMU_FEAT_STALL_FORCE))
+		    !master->stall_enabled)
 			dst[1] |= cpu_to_le64(STRTAB_STE_1_S1STALLD);
 
 		val |= (s1_cfg->cdcfg.cdtab_dma & STRTAB_STE_0_S1CTXPTR_MASK) |
@@ -2210,7 +2293,6 @@ static int arm_smmu_init_l2_strtab(struct arm_smmu_device *smmu, u32 sid)
 	return 0;
 }
 
-__maybe_unused
 static struct arm_smmu_master *
 arm_smmu_find_master(struct arm_smmu_device *smmu, u32 sid)
 {
@@ -2237,21 +2319,119 @@ arm_smmu_find_master(struct arm_smmu_device *smmu, u32 sid)
 }
 
 /* IRQ and event handlers */
+static int arm_smmu_handle_evt(struct arm_smmu_device *smmu, u64 *evt)
+{
+	int ret;
+	u32 perm = 0;
+	struct arm_smmu_master *master;
+	bool ssid_valid = evt[0] & EVTQ_0_SSV;
+	u8 type = FIELD_GET(EVTQ_0_ID, evt[0]);
+	u32 sid = FIELD_GET(EVTQ_0_SID, evt[0]);
+	struct iommu_fault_event fault_evt = { };
+	struct iommu_fault *flt = &fault_evt.fault;
+
+	/* Stage-2 is always pinned at the moment */
+	if (evt[1] & EVTQ_1_S2)
+		return -EFAULT;
+
+	master = arm_smmu_find_master(smmu, sid);
+	if (!master)
+		return -EINVAL;
+
+	if (evt[1] & EVTQ_1_READ)
+		perm |= IOMMU_FAULT_PERM_READ;
+	else
+		perm |= IOMMU_FAULT_PERM_WRITE;
+
+	if (evt[1] & EVTQ_1_EXEC)
+		perm |= IOMMU_FAULT_PERM_EXEC;
+
+	if (evt[1] & EVTQ_1_PRIV)
+		perm |= IOMMU_FAULT_PERM_PRIV;
+
+	if (evt[1] & EVTQ_1_STALL) {
+		flt->type = IOMMU_FAULT_PAGE_REQ;
+		flt->prm = (struct iommu_fault_page_request) {
+			.flags = IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE,
+			.pasid = FIELD_GET(EVTQ_0_SSID, evt[0]),
+			.grpid = FIELD_GET(EVTQ_1_STAG, evt[1]),
+			.perm = perm,
+			.addr = FIELD_GET(EVTQ_2_ADDR, evt[2]),
+		};
+
+		if (ssid_valid)
+			flt->prm.flags |= IOMMU_FAULT_PAGE_REQUEST_PASID_VALID;
+	} else {
+		flt->type = IOMMU_FAULT_DMA_UNRECOV;
+		flt->event = (struct iommu_fault_unrecoverable) {
+			.flags = IOMMU_FAULT_UNRECOV_ADDR_VALID |
+				 IOMMU_FAULT_UNRECOV_FETCH_ADDR_VALID,
+			.pasid = FIELD_GET(EVTQ_0_SSID, evt[0]),
+			.perm = perm,
+			.addr = FIELD_GET(EVTQ_2_ADDR, evt[2]),
+			.fetch_addr = FIELD_GET(EVTQ_3_IPA, evt[3]),
+		};
+
+		if (ssid_valid)
+			flt->event.flags |= IOMMU_FAULT_UNRECOV_PASID_VALID;
+
+		switch (type) {
+		case EVT_ID_TRANSLATION_FAULT:
+		case EVT_ID_ADDR_SIZE_FAULT:
+		case EVT_ID_ACCESS_FAULT:
+			flt->event.reason = IOMMU_FAULT_REASON_PTE_FETCH;
+			break;
+		case EVT_ID_PERMISSION_FAULT:
+			flt->event.reason = IOMMU_FAULT_REASON_PERMISSION;
+			break;
+		default:
+			/* TODO: report other unrecoverable faults. */
+			return -EFAULT;
+		}
+	}
+
+	ret = iommu_report_device_fault(master->dev, &fault_evt);
+	if (ret && flt->type == IOMMU_FAULT_PAGE_REQ) {
+		/* Nobody cared, abort the access */
+		struct iommu_page_response resp = {
+			.pasid		= flt->prm.pasid,
+			.grpid		= flt->prm.grpid,
+			.code		= IOMMU_PAGE_RESP_FAILURE,
+		};
+		arm_smmu_page_response(master->dev, NULL, &resp);
+	}
+
+	return ret;
+}
+
 static irqreturn_t arm_smmu_evtq_thread(int irq, void *dev)
 {
-	int i;
+	int i, ret;
+	int num_handled = 0;
 	struct arm_smmu_device *smmu = dev;
 	struct arm_smmu_queue *q = &smmu->evtq.q;
 	struct arm_smmu_ll_queue *llq = &q->llq;
+	size_t queue_size = 1 << llq->max_n_shift;
 	static DEFINE_RATELIMIT_STATE(rs, DEFAULT_RATELIMIT_INTERVAL,
 				      DEFAULT_RATELIMIT_BURST);
 	u64 evt[EVTQ_ENT_DWORDS];
 
+	spin_lock(&q->wq.lock);
 	do {
 		while (!queue_remove_raw(q, evt)) {
 			u8 id = FIELD_GET(EVTQ_0_ID, evt[0]);
 
-			if (__ratelimit(&rs)) {
+			spin_unlock(&q->wq.lock);
+			ret = arm_smmu_handle_evt(smmu, evt);
+			spin_lock(&q->wq.lock);
+
+			if (++num_handled == queue_size) {
+				q->batch++;
+				wake_up_all_locked(&q->wq);
+				num_handled = 0;
+			}
+
+			if (ret && __ratelimit(&rs)) {
 				dev_info(smmu->dev, "event 0x%02x received:\n", id);
 				for (i = 0; i < ARRAY_SIZE(evt); ++i)
 					dev_info(smmu->dev, "\t0x%016llx\n",
@@ -2270,6 +2450,11 @@ static irqreturn_t arm_smmu_evtq_thread(int irq, void *dev)
 	/* Sync our overflow flag, as we believe we're up to speed */
 	llq->cons = Q_OVF(llq->prod) | Q_WRP(llq, llq->cons) |
 		    Q_IDX(llq, llq->cons);
+	queue_sync_cons_out(q);
+
+	wake_up_all_locked(&q->wq);
+	spin_unlock(&q->wq.lock);
+
 	return IRQ_HANDLED;
 }
 
@@ -2333,6 +2518,33 @@ static irqreturn_t arm_smmu_priq_thread(int irq, void *dev)
 	return IRQ_HANDLED;
 }
 
+/*
+ * arm_smmu_flush_evtq - wait until all events currently in the queue have been
+ *                       consumed.
+ *
+ * Wait until the evtq thread finished a batch, or until the queue is empty.
+ * Note that we don't handle overflows on q->batch. If it occurs, just wait for
+ * the queue to be empty.
+ */
+static int arm_smmu_flush_evtq(void *cookie, struct device *dev, int pasid)
+{
+	int ret;
+	u64 batch;
+	struct arm_smmu_device *smmu = cookie;
+	struct arm_smmu_queue *q = &smmu->evtq.q;
+
+	spin_lock(&q->wq.lock);
+	if (queue_sync_prod_in(q) == -EOVERFLOW)
+		dev_err(smmu->dev, "evtq overflow detected -- requests lost\n");
+
+	batch = q->batch;
+	ret = wait_event_interruptible_locked(q->wq, queue_empty(&q->llq) ||
+					      q->batch >= batch + 2);
+	spin_unlock(&q->wq.lock);
+
+	return ret;
+}
+
 static int arm_smmu_device_disable(struct arm_smmu_device *smmu);
 
 static irqreturn_t arm_smmu_gerror_handler(int irq, void *dev)
@@ -2724,6 +2936,9 @@ static int arm_smmu_domain_finalise_s1(struct arm_smmu_domain *smmu_domain,
 
 	cfg->s1cdmax = master->ssid_bits;
 
+	if (master->stall_enabled)
+		smmu_domain->stall_enabled = true;
+
 	ret = arm_smmu_alloc_cd_tables(smmu_domain);
 	if (ret)
 		goto out_free_asid;
@@ -3056,6 +3271,10 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
 			smmu_domain->s1_cfg.s1cdmax, master->ssid_bits);
 		ret = -EINVAL;
 		goto out_unlock;
+	} else if (smmu_domain->stall_enabled && !master->stall_enabled) {
+		dev_err(dev, "cannot attach to stall-enabled domain\n");
+		ret = -EINVAL;
+		goto out_unlock;
 	}
 
 	master->domain = smmu_domain;
@@ -3380,6 +3599,10 @@ static int arm_smmu_add_device(struct device *dev)
 		master->ssid_bits = min_t(u8, master->ssid_bits,
 					  CTXDESC_LINEAR_CDMAX);
 
+	if ((smmu->features & ARM_SMMU_FEAT_STALLS && fwspec->can_stall) ||
+	    smmu->features & ARM_SMMU_FEAT_STALL_FORCE)
+		master->stall_enabled = true;
+
 	ret = iommu_device_link(&smmu->iommu, dev);
 	if (ret)
 		goto err_disable_pasid;
@@ -3415,6 +3638,7 @@ static void arm_smmu_remove_device(struct device *dev)
 
 	master = fwspec->iommu_priv;
 	smmu = master->smmu;
+	iopf_queue_remove_device(smmu->evtq.iopf, dev);
 	iommu_sva_disable(dev);
 	arm_smmu_detach_dev(master);
 	iommu_group_remove_device(dev);
@@ -3538,7 +3762,7 @@ static void arm_smmu_get_resv_regions(struct device *dev,
 
 static bool arm_smmu_iopf_supported(struct arm_smmu_master *master)
 {
-	return false;
+	return master->stall_enabled;
 }
 
 static bool arm_smmu_dev_has_feature(struct device *dev,
@@ -3579,6 +3803,7 @@ static bool arm_smmu_dev_feature_enabled(struct device *dev,
 
 static int arm_smmu_dev_enable_sva(struct device *dev)
 {
+	int ret;
 	struct arm_smmu_master *master = dev_to_master(dev);
 	struct iommu_sva_param param = {
 		.min_pasid = 1,
@@ -3586,7 +3811,21 @@ static int arm_smmu_dev_enable_sva(struct device *dev)
 	};
 
 	param.max_pasid = min(param.max_pasid, (1U << master->ssid_bits) - 1);
-	return iommu_sva_enable(dev, &param);
+
+	ret = iommu_sva_enable(dev, &param);
+	if (ret)
+		return ret;
+
+	if (master->stall_enabled) {
+		ret = iopf_queue_add_device(master->smmu->evtq.iopf, dev);
+		if (ret)
+			goto err_disable_sva;
+	}
+	return 0;
+
+err_disable_sva:
+	iommu_sva_disable(dev);
+	return ret;
 }
 
 static int arm_smmu_dev_enable_feature(struct device *dev,
@@ -3609,11 +3848,14 @@ static int arm_smmu_dev_enable_feature(struct device *dev,
 static int arm_smmu_dev_disable_feature(struct device *dev,
 					enum iommu_dev_features feat)
 {
+	struct arm_smmu_master *master = dev_to_master(dev);
+
 	if (!arm_smmu_dev_feature_enabled(dev, feat))
 		return -EINVAL;
 
 	switch (feat) {
 	case IOMMU_DEV_FEAT_SVA:
+		iopf_queue_remove_device(master->smmu->evtq.iopf, dev);
 		return iommu_sva_disable(dev);
 	default:
 		return -EINVAL;
@@ -3645,6 +3887,7 @@ static struct iommu_ops arm_smmu_ops = {
 	.sva_bind		= arm_smmu_sva_bind,
 	.sva_unbind		= iommu_sva_unbind_generic,
 	.sva_get_pasid		= iommu_sva_get_pasid_generic,
+	.page_response		= arm_smmu_page_response,
 	.pgsize_bitmap		= -1UL, /* Restricted during device attach */
 };
 
@@ -3688,6 +3931,10 @@ static int arm_smmu_init_one_queue(struct arm_smmu_device *smmu,
 	q->q_base |= FIELD_PREP(Q_BASE_LOG2SIZE, q->llq.max_n_shift);
 
 	q->llq.prod = q->llq.cons = 0;
+
+	init_waitqueue_head(&q->wq);
+	q->batch = 0;
+
 	return 0;
 }
 
@@ -3741,6 +3988,13 @@ static int arm_smmu_init_queues(struct arm_smmu_device *smmu)
 	if (ret)
 		return ret;
 
+	if (smmu->features & ARM_SMMU_FEAT_STALLS) {
+		smmu->evtq.iopf = iopf_queue_alloc(dev_name(smmu->dev),
+						   arm_smmu_flush_evtq, smmu);
+		if (!smmu->evtq.iopf)
+			return -ENOMEM;
+	}
+
 	/* priq */
 	if (!(smmu->features & ARM_SMMU_FEAT_PRI))
 		return 0;
@@ -4716,6 +4970,7 @@ static int arm_smmu_device_remove(struct platform_device *pdev)
 	iommu_device_unregister(&smmu->iommu);
 	iommu_device_sysfs_remove(&smmu->iommu);
 	arm_smmu_device_disable(smmu);
+	iopf_queue_free(smmu->evtq.iopf);
 
 	return 0;
 }
diff --git a/drivers/iommu/of_iommu.c b/drivers/iommu/of_iommu.c
index 20738aacac89..dd7017750954 100644
--- a/drivers/iommu/of_iommu.c
+++ b/drivers/iommu/of_iommu.c
@@ -205,9 +205,12 @@ const struct iommu_ops *of_iommu_configure(struct device *dev,
 		}
 
 		fwspec = dev_iommu_fwspec_get(dev);
-		if (!err && fwspec)
+		if (!err && fwspec) {
 			of_property_read_u32(master_np, "pasid-num-bits",
 					     &fwspec->num_pasid_bits);
+			fwspec->can_stall = of_property_read_bool(master_np,
+								  "dma-can-stall");
+		}
 	}
 
 	/*
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index e52a8731e7a9..b39dae6608c5 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -595,6 +595,7 @@ struct iommu_group *fsl_mc_device_group(struct device *dev);
  * @iommu_fwnode: firmware handle for this device's IOMMU
  * @iommu_priv: IOMMU driver private data for this device
  * @num_pasid_bits: number of PASID bits supported by this device
+ * @can_stall: the device is allowed to stall
  * @num_ids: number of associated device IDs
  * @ids: IDs which this device may present to the IOMMU
  */
@@ -603,6 +604,7 @@ struct iommu_fwspec {
 	struct fwnode_handle	*iommu_fwnode;
 	void			*iommu_priv;
 	u32			num_pasid_bits;
+	bool			can_stall;
 	unsigned int		num_ids;
 	u32			ids[1];
 };
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v4 24/26] PCI/ATS: Add PRI stubs
  2020-02-24 18:23 [PATCH v4 00/26] iommu: Shared Virtual Addressing and SMMUv3 support Jean-Philippe Brucker
                   ` (22 preceding siblings ...)
  2020-02-24 18:23 ` [PATCH v4 23/26] iommu/arm-smmu-v3: Add stall support for platform devices Jean-Philippe Brucker
@ 2020-02-24 18:23 ` Jean-Philippe Brucker
  2020-02-27 20:55   ` Bjorn Helgaas
  2020-02-24 18:24 ` [PATCH v4 25/26] PCI/ATS: Export symbols of PRI functions Jean-Philippe Brucker
                   ` (2 subsequent siblings)
  26 siblings, 1 reply; 70+ messages in thread
From: Jean-Philippe Brucker @ 2020-02-24 18:23 UTC (permalink / raw)
  To: iommu, devicetree, linux-arm-kernel, linux-pci, linux-mm
  Cc: joro, robh+dt, mark.rutland, catalin.marinas, will, robin.murphy,
	kevin.tian, baolu.lu, Jonathan.Cameron, jacob.jun.pan,
	christian.koenig, yi.l.liu, zhangfei.gao, Bjorn Helgaas

The SMMUv3 driver, which can be built without CONFIG_PCI, will soon gain
support for PRI.  Partially revert commit c6e9aefbf9db ("PCI/ATS: Remove
unused PRI and PASID stubs") to re-introduce the PRI stubs, and avoid
adding more #ifdefs to the SMMU driver.

Cc: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
---
 include/linux/pci-ats.h | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/include/linux/pci-ats.h b/include/linux/pci-ats.h
index f75c307f346d..e9e266df9b37 100644
--- a/include/linux/pci-ats.h
+++ b/include/linux/pci-ats.h
@@ -28,6 +28,14 @@ int pci_enable_pri(struct pci_dev *pdev, u32 reqs);
 void pci_disable_pri(struct pci_dev *pdev);
 int pci_reset_pri(struct pci_dev *pdev);
 int pci_prg_resp_pasid_required(struct pci_dev *pdev);
+#else /* CONFIG_PCI_PRI */
+static inline int pci_enable_pri(struct pci_dev *pdev, u32 reqs)
+{ return -ENODEV; }
+static inline void pci_disable_pri(struct pci_dev *pdev) { }
+static inline int pci_reset_pri(struct pci_dev *pdev)
+{ return -ENODEV; }
+static inline int pci_prg_resp_pasid_required(struct pci_dev *pdev)
+{ return 0; }
 #endif /* CONFIG_PCI_PRI */
 
 #ifdef CONFIG_PCI_PASID
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v4 25/26] PCI/ATS: Export symbols of PRI functions
  2020-02-24 18:23 [PATCH v4 00/26] iommu: Shared Virtual Addressing and SMMUv3 support Jean-Philippe Brucker
                   ` (23 preceding siblings ...)
  2020-02-24 18:23 ` [PATCH v4 24/26] PCI/ATS: Add PRI stubs Jean-Philippe Brucker
@ 2020-02-24 18:24 ` Jean-Philippe Brucker
  2020-02-27 20:55   ` Bjorn Helgaas
  2020-02-24 18:24 ` [PATCH v4 26/26] iommu/arm-smmu-v3: Add support for PRI Jean-Philippe Brucker
  2020-02-27 18:22 ` [PATCH v4 00/26] iommu: Shared Virtual Addressing and SMMUv3 support Jonathan Cameron
  26 siblings, 1 reply; 70+ messages in thread
From: Jean-Philippe Brucker @ 2020-02-24 18:24 UTC (permalink / raw)
  To: iommu, devicetree, linux-arm-kernel, linux-pci, linux-mm
  Cc: joro, robh+dt, mark.rutland, catalin.marinas, will, robin.murphy,
	kevin.tian, baolu.lu, Jonathan.Cameron, jacob.jun.pan,
	christian.koenig, yi.l.liu, zhangfei.gao, Bjorn Helgaas

The SMMUv3 driver uses pci_{enable,disable}_pri() and related
functions. Export those functions to allow the driver to be built as a
module.

Cc: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
---
 drivers/pci/ats.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/pci/ats.c b/drivers/pci/ats.c
index bbfd0d42b8b9..fc8fc6fc8bd5 100644
--- a/drivers/pci/ats.c
+++ b/drivers/pci/ats.c
@@ -197,6 +197,7 @@ void pci_pri_init(struct pci_dev *pdev)
 	if (status & PCI_PRI_STATUS_PASID)
 		pdev->pasid_required = 1;
 }
+EXPORT_SYMBOL_GPL(pci_pri_init);
 
 /**
  * pci_enable_pri - Enable PRI capability
@@ -243,6 +244,7 @@ int pci_enable_pri(struct pci_dev *pdev, u32 reqs)
 
 	return 0;
 }
+EXPORT_SYMBOL_GPL(pci_enable_pri);
 
 /**
  * pci_disable_pri - Disable PRI capability
@@ -322,6 +324,7 @@ int pci_reset_pri(struct pci_dev *pdev)
 
 	return 0;
 }
+EXPORT_SYMBOL_GPL(pci_reset_pri);
 
 /**
  * pci_prg_resp_pasid_required - Return PRG Response PASID Required bit
@@ -337,6 +340,7 @@ int pci_prg_resp_pasid_required(struct pci_dev *pdev)
 
 	return pdev->pasid_required;
 }
+EXPORT_SYMBOL_GPL(pci_prg_resp_pasid_required);
 #endif /* CONFIG_PCI_PRI */
 
 #ifdef CONFIG_PCI_PASID
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v4 26/26] iommu/arm-smmu-v3: Add support for PRI
  2020-02-24 18:23 [PATCH v4 00/26] iommu: Shared Virtual Addressing and SMMUv3 support Jean-Philippe Brucker
                   ` (24 preceding siblings ...)
  2020-02-24 18:24 ` [PATCH v4 25/26] PCI/ATS: Export symbols of PRI functions Jean-Philippe Brucker
@ 2020-02-24 18:24 ` Jean-Philippe Brucker
  2020-02-27 18:22 ` [PATCH v4 00/26] iommu: Shared Virtual Addressing and SMMUv3 support Jonathan Cameron
  26 siblings, 0 replies; 70+ messages in thread
From: Jean-Philippe Brucker @ 2020-02-24 18:24 UTC (permalink / raw)
  To: iommu, devicetree, linux-arm-kernel, linux-pci, linux-mm
  Cc: joro, robh+dt, mark.rutland, catalin.marinas, will, robin.murphy,
	kevin.tian, baolu.lu, Jonathan.Cameron, jacob.jun.pan,
	christian.koenig, yi.l.liu, zhangfei.gao, Jean-Philippe Brucker

From: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>

For PCI devices that support it, enable the PRI capability and handle PRI
Page Requests with the generic fault handler. It is enabled on demand by
iommu_sva_device_init().

Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
---
 drivers/iommu/arm-smmu-v3.c | 278 +++++++++++++++++++++++++++++-------
 1 file changed, 228 insertions(+), 50 deletions(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index da5dda5ba26a..f9732e397b2d 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -248,6 +248,7 @@
 #define STRTAB_STE_1_S1COR		GENMASK_ULL(5, 4)
 #define STRTAB_STE_1_S1CSH		GENMASK_ULL(7, 6)
 
+#define STRTAB_STE_1_PPAR		(1UL << 18)
 #define STRTAB_STE_1_S1STALLD		(1UL << 27)
 
 #define STRTAB_STE_1_EATS		GENMASK_ULL(29, 28)
@@ -373,6 +374,9 @@
 #define CMDQ_PRI_0_SID			GENMASK_ULL(63, 32)
 #define CMDQ_PRI_1_GRPID		GENMASK_ULL(8, 0)
 #define CMDQ_PRI_1_RESP			GENMASK_ULL(13, 12)
+#define CMDQ_PRI_1_RESP_FAILURE		0UL
+#define CMDQ_PRI_1_RESP_INVALID		1UL
+#define CMDQ_PRI_1_RESP_SUCCESS		2UL
 
 #define CMDQ_RESUME_0_SID		GENMASK_ULL(63, 32)
 #define CMDQ_RESUME_0_RESP_TERM		0UL
@@ -445,12 +449,6 @@ module_param_named(disable_bypass, disable_bypass, bool, S_IRUGO);
 MODULE_PARM_DESC(disable_bypass,
 	"Disable bypass streams such that incoming transactions from devices that are not attached to an iommu domain will report an abort back to the device and will not be allowed to pass through the SMMU.");
 
-enum pri_resp {
-	PRI_RESP_DENY = 0,
-	PRI_RESP_FAIL = 1,
-	PRI_RESP_SUCC = 2,
-};
-
 enum arm_smmu_msi_index {
 	EVTQ_MSI_INDEX,
 	GERROR_MSI_INDEX,
@@ -533,7 +531,7 @@ struct arm_smmu_cmdq_ent {
 			u32			sid;
 			u32			ssid;
 			u16			grpid;
-			enum pri_resp		resp;
+			u8			resp;
 		} pri;
 
 		#define CMDQ_OP_RESUME		0x44
@@ -611,6 +609,7 @@ struct arm_smmu_evtq {
 
 struct arm_smmu_priq {
 	struct arm_smmu_queue		q;
+	struct iopf_queue		*iopf;
 };
 
 /* High-level stream table and context descriptor structures */
@@ -743,6 +742,8 @@ struct arm_smmu_master {
 	unsigned int			num_streams;
 	bool				ats_enabled;
 	bool				stall_enabled;
+	bool				pri_supported;
+	bool				prg_resp_needs_ssid;
 	unsigned int			ssid_bits;
 };
 
@@ -1015,14 +1016,6 @@ static int arm_smmu_cmdq_build_cmd(u64 *cmd, struct arm_smmu_cmdq_ent *ent)
 		cmd[0] |= FIELD_PREP(CMDQ_PRI_0_SSID, ent->pri.ssid);
 		cmd[0] |= FIELD_PREP(CMDQ_PRI_0_SID, ent->pri.sid);
 		cmd[1] |= FIELD_PREP(CMDQ_PRI_1_GRPID, ent->pri.grpid);
-		switch (ent->pri.resp) {
-		case PRI_RESP_DENY:
-		case PRI_RESP_FAIL:
-		case PRI_RESP_SUCC:
-			break;
-		default:
-			return -EINVAL;
-		}
 		cmd[1] |= FIELD_PREP(CMDQ_PRI_1_RESP, ent->pri.resp);
 		break;
 	case CMDQ_OP_RESUME:
@@ -1602,6 +1595,7 @@ static int arm_smmu_page_response(struct device *dev,
 {
 	struct arm_smmu_cmdq_ent cmd = {0};
 	struct arm_smmu_master *master = dev_iommu_fwspec_get(dev)->iommu_priv;
+	bool pasid_valid = resp->flags & IOMMU_PAGE_RESP_PASID_VALID;
 	int sid = master->streams[0].id;
 
 	if (master->stall_enabled) {
@@ -1619,8 +1613,27 @@ static int arm_smmu_page_response(struct device *dev,
 		default:
 			return -EINVAL;
 		}
+	} else if (master->pri_supported) {
+		cmd.opcode		= CMDQ_OP_PRI_RESP;
+		cmd.substream_valid	= pasid_valid &&
+					  master->prg_resp_needs_ssid;
+		cmd.pri.sid		= sid;
+		cmd.pri.ssid		= resp->pasid;
+		cmd.pri.grpid		= resp->grpid;
+		switch (resp->code) {
+		case IOMMU_PAGE_RESP_FAILURE:
+			cmd.pri.resp = CMDQ_PRI_1_RESP_FAILURE;
+			break;
+		case IOMMU_PAGE_RESP_INVALID:
+			cmd.pri.resp = CMDQ_PRI_1_RESP_INVALID;
+			break;
+		case IOMMU_PAGE_RESP_SUCCESS:
+			cmd.pri.resp = CMDQ_PRI_1_RESP_SUCCESS;
+			break;
+		default:
+			return -EINVAL;
+		}
 	} else {
-		/* TODO: insert PRI response here */
 		return -ENODEV;
 	}
 
@@ -2215,6 +2228,9 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
 			 FIELD_PREP(STRTAB_STE_1_S1CSH, ARM_SMMU_SH_ISH) |
 			 FIELD_PREP(STRTAB_STE_1_STRW, strw));
 
+		if (master->prg_resp_needs_ssid)
+			dst[1] |= STRTAB_STE_1_PPAR;
+
 		if (smmu->features & ARM_SMMU_FEAT_STALLS &&
 		    !master->stall_enabled)
 			dst[1] |= cpu_to_le64(STRTAB_STE_1_S1STALLD);
@@ -2460,61 +2476,110 @@ static irqreturn_t arm_smmu_evtq_thread(int irq, void *dev)
 
 static void arm_smmu_handle_ppr(struct arm_smmu_device *smmu, u64 *evt)
 {
-	u32 sid, ssid;
-	u16 grpid;
-	bool ssv, last;
-
-	sid = FIELD_GET(PRIQ_0_SID, evt[0]);
-	ssv = FIELD_GET(PRIQ_0_SSID_V, evt[0]);
-	ssid = ssv ? FIELD_GET(PRIQ_0_SSID, evt[0]) : 0;
-	last = FIELD_GET(PRIQ_0_PRG_LAST, evt[0]);
-	grpid = FIELD_GET(PRIQ_1_PRG_IDX, evt[1]);
-
-	dev_info(smmu->dev, "unexpected PRI request received:\n");
-	dev_info(smmu->dev,
-		 "\tsid 0x%08x.0x%05x: [%u%s] %sprivileged %s%s%s access at iova 0x%016llx\n",
-		 sid, ssid, grpid, last ? "L" : "",
-		 evt[0] & PRIQ_0_PERM_PRIV ? "" : "un",
-		 evt[0] & PRIQ_0_PERM_READ ? "R" : "",
-		 evt[0] & PRIQ_0_PERM_WRITE ? "W" : "",
-		 evt[0] & PRIQ_0_PERM_EXEC ? "X" : "",
-		 evt[1] & PRIQ_1_ADDR_MASK);
-
-	if (last) {
-		struct arm_smmu_cmdq_ent cmd = {
-			.opcode			= CMDQ_OP_PRI_RESP,
-			.substream_valid	= ssv,
-			.pri			= {
-				.sid	= sid,
-				.ssid	= ssid,
-				.grpid	= grpid,
-				.resp	= PRI_RESP_DENY,
-			},
+	u32 sid = FIELD_PREP(PRIQ_0_SID, evt[0]);
+
+	bool pasid_valid, last;
+	struct arm_smmu_master *master;
+	struct iommu_fault_event fault_evt = {
+		.fault.type = IOMMU_FAULT_PAGE_REQ,
+		.fault.prm = {
+			.pasid		= FIELD_GET(PRIQ_0_SSID, evt[0]),
+			.grpid		= FIELD_GET(PRIQ_1_PRG_IDX, evt[1]),
+			.addr		= evt[1] & PRIQ_1_ADDR_MASK,
+		},
+	};
+	struct iommu_fault_page_request *pr = &fault_evt.fault.prm;
+
+	pasid_valid = evt[0] & PRIQ_0_SSID_V;
+	last = evt[0] & PRIQ_0_PRG_LAST;
+
+	/* Discard Stop PASID marker, it isn't used */
+	if (!(evt[0] & (PRIQ_0_PERM_READ | PRIQ_0_PERM_WRITE)) && last)
+		return;
+
+	if (last)
+		pr->flags |= IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE;
+	if (pasid_valid)
+		pr->flags |= IOMMU_FAULT_PAGE_REQUEST_PASID_VALID;
+	if (evt[0] & PRIQ_0_PERM_READ)
+		pr->perm |= IOMMU_FAULT_PERM_READ;
+	if (evt[0] & PRIQ_0_PERM_WRITE)
+		pr->perm |= IOMMU_FAULT_PERM_WRITE;
+	if (evt[0] & PRIQ_0_PERM_EXEC)
+		pr->perm |= IOMMU_FAULT_PERM_EXEC;
+	if (evt[0] & PRIQ_0_PERM_PRIV)
+		pr->perm |= IOMMU_FAULT_PERM_PRIV;
+
+	master = arm_smmu_find_master(smmu, sid);
+	if (WARN_ON(!master))
+		return;
+
+	if (iommu_report_device_fault(master->dev, &fault_evt)) {
+		/*
+		 * No handler registered, so subsequent faults won't produce
+		 * better results. Try to disable PRI.
+		 */
+		struct iommu_page_response resp = {
+			.flags		= pasid_valid ?
+				          IOMMU_PAGE_RESP_PASID_VALID : 0,
+			.pasid		= pr->pasid,
+			.grpid		= pr->grpid,
+			.code		= IOMMU_PAGE_RESP_FAILURE,
 		};
 
-		arm_smmu_cmdq_issue_cmd(smmu, &cmd);
+		dev_warn(master->dev,
+			 "PPR 0x%x:0x%llx 0x%x: nobody cared, disabling PRI\n",
+			 pasid_valid ? pr->pasid : 0, pr->addr, pr->perm);
+		if (last)
+			arm_smmu_page_response(master->dev, NULL, &resp);
 	}
 }
 
 static irqreturn_t arm_smmu_priq_thread(int irq, void *dev)
 {
+	int num_handled = 0;
+	bool overflow = false;
 	struct arm_smmu_device *smmu = dev;
 	struct arm_smmu_queue *q = &smmu->priq.q;
 	struct arm_smmu_ll_queue *llq = &q->llq;
+	size_t queue_size = 1 << llq->max_n_shift;
 	u64 evt[PRIQ_ENT_DWORDS];
 
+	spin_lock(&q->wq.lock);
 	do {
-		while (!queue_remove_raw(q, evt))
+		while (!queue_remove_raw(q, evt)) {
+			spin_unlock(&q->wq.lock);
 			arm_smmu_handle_ppr(smmu, evt);
+			spin_lock(&q->wq.lock);
+			if (++num_handled == queue_size) {
+				q->batch++;
+				wake_up_all_locked(&q->wq);
+				num_handled = 0;
+			}
+		}
 
-		if (queue_sync_prod_in(q) == -EOVERFLOW)
+		if (queue_sync_prod_in(q) == -EOVERFLOW) {
 			dev_err(smmu->dev, "PRIQ overflow detected -- requests lost\n");
+			overflow = true;
+		}
 	} while (!queue_empty(llq));
 
 	/* Sync our overflow flag, as we believe we're up to speed */
 	llq->cons = Q_OVF(llq->prod) | Q_WRP(llq, llq->cons) |
 		      Q_IDX(llq, llq->cons);
 	queue_sync_cons_out(q);
+
+	wake_up_all_locked(&q->wq);
+	spin_unlock(&q->wq.lock);
+
+	/*
+	 * On overflow, the SMMU might have discarded the last PPR in a group.
+	 * There is no way to know more about it, so we have to discard all
+	 * partial faults already queued.
+	 */
+	if (overflow)
+		iopf_queue_discard_partial(smmu->priq.iopf);
+
 	return IRQ_HANDLED;
 }
 
@@ -2545,6 +2610,30 @@ static int arm_smmu_flush_evtq(void *cookie, struct device *dev, int pasid)
 	return ret;
 }
 
+static int arm_smmu_flush_priq(void *cookie, struct device *dev, int pasid)
+{
+	int ret;
+	u64 batch;
+	bool overflow = false;
+	struct arm_smmu_device *smmu = cookie;
+	struct arm_smmu_queue *q = &smmu->priq.q;
+
+	spin_lock(&q->wq.lock);
+	if (queue_sync_prod_in(q) == -EOVERFLOW) {
+		dev_err(smmu->dev, "priq overflow detected -- requests lost\n");
+		overflow = true;
+	}
+
+	batch = q->batch;
+	ret = wait_event_interruptible_locked(q->wq, queue_empty(&q->llq) ||
+					      q->batch >= batch + 2);
+	spin_unlock(&q->wq.lock);
+
+	if (overflow)
+		iopf_queue_discard_partial(smmu->priq.iopf);
+	return ret;
+}
+
 static int arm_smmu_device_disable(struct arm_smmu_device *smmu);
 
 static irqreturn_t arm_smmu_gerror_handler(int irq, void *dev)
@@ -3208,6 +3297,75 @@ static void arm_smmu_disable_pasid(struct arm_smmu_master *master)
 	pci_disable_pasid(pdev);
 }
 
+static int arm_smmu_init_pri(struct arm_smmu_master *master)
+{
+	int pos;
+	struct pci_dev *pdev;
+
+	if (!dev_is_pci(master->dev))
+		return -EINVAL;
+
+	if (!(master->smmu->features & ARM_SMMU_FEAT_PRI))
+		return 0;
+
+	pdev = to_pci_dev(master->dev);
+	pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_PRI);
+	if (!pos)
+		return 0;
+
+	/* If the device supports PASID and PRI, set STE.PPAR */
+	if (master->ssid_bits)
+		master->prg_resp_needs_ssid = pci_prg_resp_pasid_required(pdev);
+
+	master->pri_supported = true;
+	return 0;
+}
+
+static int arm_smmu_enable_pri(struct arm_smmu_master *master)
+{
+	int ret;
+	struct pci_dev *pdev;
+	/*
+	 * TODO: find a good inflight PPR number. We should divide the PRI queue
+	 * by the number of PRI-capable devices, but it's impossible to know
+	 * about future (probed late or hotplugged) devices. So we're at risk of
+	 * dropping PPRs (and leaking pending requests in the FQ).
+	 */
+	size_t max_inflight_pprs = 16;
+
+	if (!master->pri_supported || !master->ats_enabled)
+		return -ENOSYS;
+
+	pdev = to_pci_dev(master->dev);
+
+	ret = pci_reset_pri(pdev);
+	if (ret)
+		return ret;
+
+	ret = pci_enable_pri(pdev, max_inflight_pprs);
+	if (ret) {
+		dev_err(master->dev, "cannot enable PRI: %d\n", ret);
+		return ret;
+	}
+
+	return 0;
+}
+
+static void arm_smmu_disable_pri(struct arm_smmu_master *master)
+{
+	struct pci_dev *pdev;
+
+	if (!dev_is_pci(master->dev))
+		return;
+
+	pdev = to_pci_dev(master->dev);
+
+	if (!pdev->pri_enabled)
+		return;
+
+	pci_disable_pri(pdev);
+}
+
 static void arm_smmu_detach_dev(struct arm_smmu_master *master)
 {
 	unsigned long flags;
@@ -3603,6 +3761,8 @@ static int arm_smmu_add_device(struct device *dev)
 	    smmu->features & ARM_SMMU_FEAT_STALL_FORCE)
 		master->stall_enabled = true;
 
+	arm_smmu_init_pri(master);
+
 	ret = iommu_device_link(&smmu->iommu, dev);
 	if (ret)
 		goto err_disable_pasid;
@@ -3639,6 +3799,7 @@ static void arm_smmu_remove_device(struct device *dev)
 	master = fwspec->iommu_priv;
 	smmu = master->smmu;
 	iopf_queue_remove_device(smmu->evtq.iopf, dev);
+	iopf_queue_remove_device(smmu->priq.iopf, dev);
 	iommu_sva_disable(dev);
 	arm_smmu_detach_dev(master);
 	iommu_group_remove_device(dev);
@@ -3762,7 +3923,7 @@ static void arm_smmu_get_resv_regions(struct device *dev,
 
 static bool arm_smmu_iopf_supported(struct arm_smmu_master *master)
 {
-	return master->stall_enabled;
+	return master->stall_enabled || master->pri_supported;
 }
 
 static bool arm_smmu_dev_has_feature(struct device *dev,
@@ -3820,6 +3981,15 @@ static int arm_smmu_dev_enable_sva(struct device *dev)
 		ret = iopf_queue_add_device(master->smmu->evtq.iopf, dev);
 		if (ret)
 			goto err_disable_sva;
+	} else if (master->pri_supported) {
+		ret = iopf_queue_add_device(master->smmu->priq.iopf, dev);
+		if (ret)
+			goto err_disable_sva;
+
+		if (arm_smmu_enable_pri(master)) {
+			iopf_queue_remove_device(master->smmu->priq.iopf, dev);
+			goto err_disable_sva;
+		}
 	}
 	return 0;
 
@@ -3855,7 +4025,9 @@ static int arm_smmu_dev_disable_feature(struct device *dev,
 
 	switch (feat) {
 	case IOMMU_DEV_FEAT_SVA:
+		arm_smmu_disable_pri(master);
 		iopf_queue_remove_device(master->smmu->evtq.iopf, dev);
+		iopf_queue_remove_device(master->smmu->priq.iopf, dev);
 		return iommu_sva_disable(dev);
 	default:
 		return -EINVAL;
@@ -3999,6 +4171,11 @@ static int arm_smmu_init_queues(struct arm_smmu_device *smmu)
 	if (!(smmu->features & ARM_SMMU_FEAT_PRI))
 		return 0;
 
+	smmu->priq.iopf = iopf_queue_alloc(dev_name(smmu->dev),
+					   arm_smmu_flush_priq, smmu);
+	if (!smmu->priq.iopf)
+		return -ENOMEM;
+
 	return arm_smmu_init_one_queue(smmu, &smmu->priq.q, ARM_SMMU_PRIQ_PROD,
 				       ARM_SMMU_PRIQ_CONS, PRIQ_ENT_DWORDS,
 				       "priq");
@@ -4971,6 +5148,7 @@ static int arm_smmu_device_remove(struct platform_device *pdev)
 	iommu_device_sysfs_remove(&smmu->iommu);
 	arm_smmu_device_disable(smmu);
 	iopf_queue_free(smmu->evtq.iopf);
+	iopf_queue_free(smmu->priq.iopf);
 
 	return 0;
 }
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* Re: [PATCH v4 01/26] mm/mmu_notifiers: pass private data down to alloc_notifier()
  2020-02-24 18:23 ` [PATCH v4 01/26] mm/mmu_notifiers: pass private data down to alloc_notifier() Jean-Philippe Brucker
@ 2020-02-24 19:00   ` Jason Gunthorpe
  2020-02-25  9:24     ` Jean-Philippe Brucker
  2020-03-05 16:36   ` Christoph Hellwig
  1 sibling, 1 reply; 70+ messages in thread
From: Jason Gunthorpe @ 2020-02-24 19:00 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: iommu, devicetree, linux-arm-kernel, linux-pci, linux-mm, joro,
	robh+dt, mark.rutland, catalin.marinas, will, robin.murphy,
	kevin.tian, baolu.lu, Jonathan.Cameron, jacob.jun.pan,
	christian.koenig, yi.l.liu, zhangfei.gao, Andrew Morton,
	Arnd Bergmann, Dimitri Sivanich, Greg Kroah-Hartman

On Mon, Feb 24, 2020 at 07:23:36PM +0100, Jean-Philippe Brucker wrote:
> The new allocation scheme introduced by 2c7933f53f6b ("mm/mmu_notifiers:
> add a get/put scheme for the registration") provides a convenient way
> for users to attach notifier data to an mm. However, it would be even
> better to create this notifier data atomically.
> 
> Since the alloc_notifier() callback only takes an mm argument at the
> moment, some users have to perform the allocation in two times.
> alloc_notifier() initially creates an incomplete structure, which is
> then finalized using more context once mmu_notifier_get() returns. This
> second step requires carrying an initialization lock in the notifier
> data and playing dirty tricks to order memory accesses against live
> invalidation.

This was the intended pattern. Tthere shouldn't be an real issue as
there shouldn't be any data on which to invalidate, ie the later patch
does:

+       list_for_each_entry_rcu(bond, &io_mm->devices, mm_head)

And that list is empty post-allocation, so no 'dirty tricks' required.

The other op callback is release, which also cannot be called as the
caller must hold a mmget to establish the notifier.

So just use the locking that already exists. There is one function
that calls io_mm_get() which immediately calls io_mm_attach, which
immediately grabs the global iommu_sva_lock.

Thus init the pasid for the first time under that lock and everything
is fine.

There is nothing inherently wrong with the approach in this patch, but
it seems unneeded in this case..

Jason

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v4 03/26] iommu: Add a page fault handler
  2020-02-24 18:23 ` [PATCH v4 03/26] iommu: Add a page fault handler Jean-Philippe Brucker
@ 2020-02-25  3:30   ` Xu Zaibo
  2020-02-25  9:25     ` Jean-Philippe Brucker
  2020-02-26 13:59   ` Jonathan Cameron
  1 sibling, 1 reply; 70+ messages in thread
From: Xu Zaibo @ 2020-02-25  3:30 UTC (permalink / raw)
  To: Jean-Philippe Brucker, iommu, devicetree, linux-arm-kernel,
	linux-pci, linux-mm
  Cc: mark.rutland, kevin.tian, Jean-Philippe Brucker, catalin.marinas,
	robin.murphy, robh+dt, zhangfei.gao, will, christian.koenig

Hi,

On 2020/2/25 2:23, Jean-Philippe Brucker wrote:
> From: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
>
> Some systems allow devices to handle I/O Page Faults in the core mm. For
> example systems implementing the PCI PRI extension or Arm SMMU stall
> model. Infrastructure for reporting these recoverable page faults was
> recently added to the IOMMU core. Add a page fault handler for host SVA.
>
> IOMMU driver can now instantiate several fault workqueues and link them to
> IOPF-capable devices. Drivers can choose between a single global
> workqueue, one per IOMMU device, one per low-level fault queue, one per
> domain, etc.
>
> When it receives a fault event, supposedly in an IRQ handler, the IOMMU
> driver reports the fault using iommu_report_device_fault(), which calls
> the registered handler. The page fault handler then calls the mm fault
> handler, and reports either success or failure with iommu_page_response().
> When the handler succeeded, the IOMMU retries the access.
>
> The iopf_param pointer could be embedded into iommu_fault_param. But
> putting iopf_param into the iommu_param structure allows us not to care
> about ordering between calls to iopf_queue_add_device() and
> iommu_register_device_fault_handler().
>
> Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
> ---
>   drivers/iommu/Kconfig      |   4 +
>   drivers/iommu/Makefile     |   1 +
>   drivers/iommu/io-pgfault.c | 451 +++++++++++++++++++++++++++++++++++++
>   include/linux/iommu.h      |  59 +++++
>   4 files changed, 515 insertions(+)
>   create mode 100644 drivers/iommu/io-pgfault.c
>
> diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
> index acca20e2da2f..e4a42e1708b4 100644
> --- a/drivers/iommu/Kconfig
> +++ b/drivers/iommu/Kconfig
> @@ -109,6 +109,10 @@ config IOMMU_SVA
>   	select IOMMU_API
>   	select MMU_NOTIFIER
>   
> +config IOMMU_PAGE_FAULT
> +	bool
> +	select IOMMU_API
> +
>   config FSL_PAMU
>   	bool "Freescale IOMMU support"
>   	depends on PCI
> diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
> index 40c800dd4e3e..bf5cb4ee8409 100644
> --- a/drivers/iommu/Makefile
> +++ b/drivers/iommu/Makefile
> @@ -4,6 +4,7 @@ obj-$(CONFIG_IOMMU_API) += iommu-traces.o
>   obj-$(CONFIG_IOMMU_API) += iommu-sysfs.o
>   obj-$(CONFIG_IOMMU_DEBUGFS) += iommu-debugfs.o
>   obj-$(CONFIG_IOMMU_DMA) += dma-iommu.o
> +obj-$(CONFIG_IOMMU_PAGE_FAULT) += io-pgfault.o
>   obj-$(CONFIG_IOMMU_IO_PGTABLE) += io-pgtable.o
>   obj-$(CONFIG_IOMMU_IO_PGTABLE_ARMV7S) += io-pgtable-arm-v7s.o
>   obj-$(CONFIG_IOMMU_IO_PGTABLE_LPAE) += io-pgtable-arm.o
> diff --git a/drivers/iommu/io-pgfault.c b/drivers/iommu/io-pgfault.c
> new file mode 100644
> index 000000000000..76e153c59fe3
> --- /dev/null
> +++ b/drivers/iommu/io-pgfault.c
> @@ -0,0 +1,451 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Handle device page faults
> + *
> + * Copyright (C) 2018 ARM Ltd.
> + */
> +
> +#include <linux/iommu.h>
> +#include <linux/list.h>
> +#include <linux/slab.h>
> +#include <linux/workqueue.h>
> +
> +/**
> + * struct iopf_queue - IO Page Fault queue
> + * @wq: the fault workqueue
> + * @flush: low-level flush callback
> + * @flush_arg: flush() argument
> + * @devices: devices attached to this queue
> + * @lock: protects the device list
> + */
> +struct iopf_queue {
> +	struct workqueue_struct		*wq;
> +	iopf_queue_flush_t		flush;
> +	void				*flush_arg;
> +	struct list_head		devices;
> +	struct mutex			lock;
> +};
> +
> +/**
> + * struct iopf_device_param - IO Page Fault data attached to a device
> + * @dev: the device that owns this param
> + * @queue: IOPF queue
> + * @queue_list: index into queue->devices
> + * @partial: faults that are part of a Page Request Group for which the last
> + *           request hasn't been submitted yet.
> + * @busy: the param is being used
> + * @wq_head: signal a change to @busy
> + */
> +struct iopf_device_param {
> +	struct device			*dev;
> +	struct iopf_queue		*queue;
> +	struct list_head		queue_list;
> +	struct list_head		partial;
> +	bool				busy;
> +	wait_queue_head_t		wq_head;
> +};
> +
> +struct iopf_fault {
> +	struct iommu_fault		fault;
> +	struct list_head		head;
> +};
> +
> +struct iopf_group {
> +	struct iopf_fault		last_fault;
> +	struct list_head		faults;
> +	struct work_struct		work;
> +	struct device			*dev;
> +};
> +
[...]
> +
> +/**
> + * iopf_queue_alloc - Allocate and initialize a fault queue
> + * @name: a unique string identifying the queue (for workqueue)
> + * @flush: a callback that flushes the low-level queue
> + * @cookie: driver-private data passed to the flush callback
> + *
> + * The callback is called before the workqueue is flushed. The IOMMU driver must
> + * commit all faults that are pending in its low-level queues at the time of the
> + * call, into the IOPF queue (with iommu_report_device_fault). The callback
> + * takes a device pointer as argument, hinting what endpoint is causing the
> + * flush. When the device is NULL, all faults should be committed.
> + *
> + * Return: the queue on success and NULL on error.
> + */
> +struct iopf_queue *
> +iopf_queue_alloc(const char *name, iopf_queue_flush_t flush, void *cookie)
> +{
> +	struct iopf_queue *queue;
> +
> +	queue = kzalloc(sizeof(*queue), GFP_KERNEL);
> +	if (!queue)
> +		return NULL;
> +
> +	/*
> +	 * The WQ is unordered because the low-level handler enqueues faults by
> +	 * group. PRI requests within a group have to be ordered, but once
> +	 * that's dealt with, the high-level function can handle groups out of
> +	 * order.
> +	 */
> +	queue->wq = alloc_workqueue("iopf_queue/%s", WQ_UNBOUND, 0, name);
Should this workqueue use 'WQ_HIGHPRI | WQ_UNBOUND' or some flags like 
this to decrease the unexpected
latency of I/O PageFault here? Or maybe, workqueue will show an 
uncontrolled latency, even in a busy system.

Cheers,
Zaibo

.
> +	if (!queue->wq) {
> +		kfree(queue);
> +		return NULL;
> +	}
> +
> +	queue->flush = flush;
> +	queue->flush_arg = cookie;
> +	INIT_LIST_HEAD(&queue->devices);
> +	mutex_init(&queue->lock);
> +
> +	return queue;
> +}
> +EXPORT_SYMBOL_GPL(iopf_queue_alloc);
> +
> +/**
> + * iopf_queue_free - Free IOPF queue
> + * @queue: queue to free
> + *
> + * Counterpart to iopf_queue_alloc(). The driver must not be queuing faults or
> + * adding/removing devices on this queue anymore.
> + */
> +void iopf_queue_free(struct iopf_queue *queue)
> +{
> +	struct iopf_device_param *iopf_param, *next;
> +
> +	if (!queue)
> +		return;
> +
> +	list_for_each_entry_safe(iopf_param, next, &queue->devices, queue_list)
> +		iopf_queue_remove_device(queue, iopf_param->dev);
> +
> +	destroy_workqueue(queue->wq);
> +	kfree(queue);
> +}
> +EXPORT_SYMBOL_GPL(iopf_queue_free);
[...]


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v4 01/26] mm/mmu_notifiers: pass private data down to alloc_notifier()
  2020-02-24 19:00   ` Jason Gunthorpe
@ 2020-02-25  9:24     ` Jean-Philippe Brucker
  2020-02-25 14:08       ` Jason Gunthorpe
  0 siblings, 1 reply; 70+ messages in thread
From: Jean-Philippe Brucker @ 2020-02-25  9:24 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: iommu, devicetree, linux-arm-kernel, linux-pci, linux-mm, joro,
	robh+dt, mark.rutland, catalin.marinas, will, robin.murphy,
	kevin.tian, baolu.lu, Jonathan.Cameron, jacob.jun.pan,
	christian.koenig, yi.l.liu, zhangfei.gao, Andrew Morton,
	Arnd Bergmann, Dimitri Sivanich, Greg Kroah-Hartman

On Mon, Feb 24, 2020 at 03:00:56PM -0400, Jason Gunthorpe wrote:
> On Mon, Feb 24, 2020 at 07:23:36PM +0100, Jean-Philippe Brucker wrote:
> > The new allocation scheme introduced by 2c7933f53f6b ("mm/mmu_notifiers:
> > add a get/put scheme for the registration") provides a convenient way
> > for users to attach notifier data to an mm. However, it would be even
> > better to create this notifier data atomically.
> > 
> > Since the alloc_notifier() callback only takes an mm argument at the
> > moment, some users have to perform the allocation in two times.
> > alloc_notifier() initially creates an incomplete structure, which is
> > then finalized using more context once mmu_notifier_get() returns. This
> > second step requires carrying an initialization lock in the notifier
> > data and playing dirty tricks to order memory accesses against live
> > invalidation.
> 
> This was the intended pattern. Tthere shouldn't be an real issue as
> there shouldn't be any data on which to invalidate, ie the later patch
> does:
> 
> +       list_for_each_entry_rcu(bond, &io_mm->devices, mm_head)
> 
> And that list is empty post-allocation, so no 'dirty tricks' required.

Before introducing this patch I had the following code:

+	list_for_each_entry_rcu(bond, &io_mm->devices, mm_head) {
+		/*
+		 * To ensure that we observe the initialization of io_mm fields
+		 * by io_mm_finalize() before the registration of this bond to
+		 * the list by io_mm_attach(), introduce an address dependency
+		 * between bond and io_mm. It pairs with the smp_store_release()
+		 * from list_add_rcu().
+		 */
+		io_mm = rcu_dereference(bond->io_mm);
+		io_mm->ops->invalidate(bond->sva.dev, io_mm->pasid, io_mm->ctx,
+				       start, end - start);
+	}

(1) io_mm_get() would obtain an empty io_mm from iommu_notifier_get().
(2) then io_mm_finalize() would initialize io_mm->ops, io_mm->ctx, etc.
(3) finally io_mm_attach() would add the bond to io_mm->devices.

Since the above code can run before (2) it needs to observe valid
io_mm->ctx, io_mm->ops initialized by (2) after obtaining the bond
initialized by (3). Which I believe requires the address dependency from
the rcu_dereference() above or some stronger barrier to pair with the
list_add_rcu(). If io_mm->ctx and io_mm->ops are already valid before the
mmu notifier is published, then we don't need that stuff.

That's the main reason I would have liked moving everything to
alloc_notifier(), the locking below isn't a big deal.

> The other op callback is release, which also cannot be called as the
> caller must hold a mmget to establish the notifier.
> 
> So just use the locking that already exists. There is one function
> that calls io_mm_get() which immediately calls io_mm_attach, which
> immediately grabs the global iommu_sva_lock.
> 
> Thus init the pasid for the first time under that lock and everything
> is fine.

I agree with this, can't remember why I used a separate lock for
initialization rather than reusing iommu_sva_lock.

Thanks,
Jean

> 
> There is nothing inherently wrong with the approach in this patch, but
> it seems unneeded in this case..
> 
> Jason

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v4 03/26] iommu: Add a page fault handler
  2020-02-25  3:30   ` Xu Zaibo
@ 2020-02-25  9:25     ` Jean-Philippe Brucker
  2020-02-26  3:05       ` Xu Zaibo
  0 siblings, 1 reply; 70+ messages in thread
From: Jean-Philippe Brucker @ 2020-02-25  9:25 UTC (permalink / raw)
  To: Xu Zaibo
  Cc: iommu, devicetree, linux-arm-kernel, linux-pci, linux-mm,
	mark.rutland, kevin.tian, Jean-Philippe Brucker, catalin.marinas,
	robin.murphy, robh+dt, zhangfei.gao, will, christian.koenig

Hi Zaibo,

On Tue, Feb 25, 2020 at 11:30:05AM +0800, Xu Zaibo wrote:
> > +struct iopf_queue *
> > +iopf_queue_alloc(const char *name, iopf_queue_flush_t flush, void *cookie)
> > +{
> > +	struct iopf_queue *queue;
> > +
> > +	queue = kzalloc(sizeof(*queue), GFP_KERNEL);
> > +	if (!queue)
> > +		return NULL;
> > +
> > +	/*
> > +	 * The WQ is unordered because the low-level handler enqueues faults by
> > +	 * group. PRI requests within a group have to be ordered, but once
> > +	 * that's dealt with, the high-level function can handle groups out of
> > +	 * order.
> > +	 */
> > +	queue->wq = alloc_workqueue("iopf_queue/%s", WQ_UNBOUND, 0, name);
> Should this workqueue use 'WQ_HIGHPRI | WQ_UNBOUND' or some flags like this
> to decrease the unexpected
> latency of I/O PageFault here? Or maybe, workqueue will show an uncontrolled
> latency, even in a busy system.

I'll investigate the effect of these flags. So far I've only run on
completely idle systems but it would be interesting to add some
workqueue-heavy load in my tests.

Thanks,
Jean

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v4 01/26] mm/mmu_notifiers: pass private data down to alloc_notifier()
  2020-02-25  9:24     ` Jean-Philippe Brucker
@ 2020-02-25 14:08       ` Jason Gunthorpe
  2020-02-28 14:39         ` Jean-Philippe Brucker
  0 siblings, 1 reply; 70+ messages in thread
From: Jason Gunthorpe @ 2020-02-25 14:08 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: iommu, devicetree, linux-arm-kernel, linux-pci, linux-mm, joro,
	robh+dt, mark.rutland, catalin.marinas, will, robin.murphy,
	kevin.tian, baolu.lu, Jonathan.Cameron, jacob.jun.pan,
	christian.koenig, yi.l.liu, zhangfei.gao, Andrew Morton,
	Arnd Bergmann, Dimitri Sivanich, Greg Kroah-Hartman

On Tue, Feb 25, 2020 at 10:24:39AM +0100, Jean-Philippe Brucker wrote:
> On Mon, Feb 24, 2020 at 03:00:56PM -0400, Jason Gunthorpe wrote:
> > On Mon, Feb 24, 2020 at 07:23:36PM +0100, Jean-Philippe Brucker wrote:
> > > The new allocation scheme introduced by 2c7933f53f6b ("mm/mmu_notifiers:
> > > add a get/put scheme for the registration") provides a convenient way
> > > for users to attach notifier data to an mm. However, it would be even
> > > better to create this notifier data atomically.
> > > 
> > > Since the alloc_notifier() callback only takes an mm argument at the
> > > moment, some users have to perform the allocation in two times.
> > > alloc_notifier() initially creates an incomplete structure, which is
> > > then finalized using more context once mmu_notifier_get() returns. This
> > > second step requires carrying an initialization lock in the notifier
> > > data and playing dirty tricks to order memory accesses against live
> > > invalidation.
> > 
> > This was the intended pattern. Tthere shouldn't be an real issue as
> > there shouldn't be any data on which to invalidate, ie the later patch
> > does:
> > 
> > +       list_for_each_entry_rcu(bond, &io_mm->devices, mm_head)
> > 
> > And that list is empty post-allocation, so no 'dirty tricks' required.
> 
> Before introducing this patch I had the following code:
> 
> +	list_for_each_entry_rcu(bond, &io_mm->devices, mm_head) {
> +		/*
> +		 * To ensure that we observe the initialization of io_mm fields
> +		 * by io_mm_finalize() before the registration of this bond to
> +		 * the list by io_mm_attach(), introduce an address dependency
> +		 * between bond and io_mm. It pairs with the smp_store_release()
> +		 * from list_add_rcu().
> +		 */
> +		io_mm = rcu_dereference(bond->io_mm);

A rcu_dereference isn't need here, just a normal derference is fine.

> +		io_mm->ops->invalidate(bond->sva.dev, io_mm->pasid, io_mm->ctx,
> +				       start, end - start);
> +	}
> 
> (1) io_mm_get() would obtain an empty io_mm from iommu_notifier_get().
> (2) then io_mm_finalize() would initialize io_mm->ops, io_mm->ctx, etc.
> (3) finally io_mm_attach() would add the bond to io_mm->devices.
> 
> Since the above code can run before (2) it needs to observe valid
> io_mm->ctx, io_mm->ops initialized by (2) after obtaining the bond
> initialized by (3). Which I believe requires the address dependency from
> the rcu_dereference() above or some stronger barrier to pair with the
> list_add_rcu().

The list_for_each_entry_rcu() is an acquire that already pairs with
the release in list_add_rcu(), all you need is a data dependency chain
starting on bond to be correct on ordering.

But this is super tricky :\

> If io_mm->ctx and io_mm->ops are already valid before the
> mmu notifier is published, then we don't need that stuff.

So, this trickyness with RCU is not a bad reason to introduce the priv
scheme, maybe explain it in the commit message?

Jason

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v4 03/26] iommu: Add a page fault handler
  2020-02-25  9:25     ` Jean-Philippe Brucker
@ 2020-02-26  3:05       ` Xu Zaibo
  0 siblings, 0 replies; 70+ messages in thread
From: Xu Zaibo @ 2020-02-26  3:05 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: iommu, devicetree, linux-arm-kernel, linux-pci, linux-mm,
	mark.rutland, kevin.tian, Jean-Philippe Brucker, catalin.marinas,
	robin.murphy, robh+dt, zhangfei.gao, will, christian.koenig, tj

Hi,
On 2020/2/25 17:25, Jean-Philippe Brucker wrote:
> Hi Zaibo,
>
> On Tue, Feb 25, 2020 at 11:30:05AM +0800, Xu Zaibo wrote:
>>> +struct iopf_queue *
>>> +iopf_queue_alloc(const char *name, iopf_queue_flush_t flush, void *cookie)
>>> +{
>>> +	struct iopf_queue *queue;
>>> +
>>> +	queue = kzalloc(sizeof(*queue), GFP_KERNEL);
>>> +	if (!queue)
>>> +		return NULL;
>>> +
>>> +	/*
>>> +	 * The WQ is unordered because the low-level handler enqueues faults by
>>> +	 * group. PRI requests within a group have to be ordered, but once
>>> +	 * that's dealt with, the high-level function can handle groups out of
>>> +	 * order.
>>> +	 */
>>> +	queue->wq = alloc_workqueue("iopf_queue/%s", WQ_UNBOUND, 0, name);
>> Should this workqueue use 'WQ_HIGHPRI | WQ_UNBOUND' or some flags like this
>> to decrease the unexpected
>> latency of I/O PageFault here? Or maybe, workqueue will show an uncontrolled
>> latency, even in a busy system.
> I'll investigate the effect of these flags. So far I've only run on
> completely idle systems but it would be interesting to add some
> workqueue-heavy load in my tests.
>
I'm not sure, just my concern. Hopefully, Tejun Heo can give us some 
hints. :)

+cc  Tejun Heo <tj@kernel.org>

Cheers,
Zaibo

.
> .
>



^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v4 23/26] iommu/arm-smmu-v3: Add stall support for platform devices
  2020-02-24 18:23 ` [PATCH v4 23/26] iommu/arm-smmu-v3: Add stall support for platform devices Jean-Philippe Brucker
@ 2020-02-26  8:44   ` Xu Zaibo
  2020-03-04 14:09     ` Jean-Philippe Brucker
  2020-02-27 18:17   ` Jonathan Cameron
  1 sibling, 1 reply; 70+ messages in thread
From: Xu Zaibo @ 2020-02-26  8:44 UTC (permalink / raw)
  To: Jean-Philippe Brucker, iommu, devicetree, linux-arm-kernel,
	linux-pci, linux-mm
  Cc: mark.rutland, kevin.tian, Jean-Philippe Brucker, catalin.marinas,
	robin.murphy, robh+dt, zhangfei.gao, will, christian.koenig

Hi,


On 2020/2/25 2:23, Jean-Philippe Brucker wrote:
> From: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
>
> The SMMU provides a Stall model for handling page faults in platform
> devices. It is similar to PCI PRI, but doesn't require devices to have
> their own translation cache. Instead, faulting transactions are parked and
> the OS is given a chance to fix the page tables and retry the transaction.
>
> Enable stall for devices that support it (opt-in by firmware). When an
> event corresponds to a translation error, call the IOMMU fault handler. If
> the fault is recoverable, it will call us back to terminate or continue
> the stall.
>
> Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
> ---
>   drivers/iommu/arm-smmu-v3.c | 271 ++++++++++++++++++++++++++++++++++--
>   drivers/iommu/of_iommu.c    |   5 +-
>   include/linux/iommu.h       |   2 +
>   3 files changed, 269 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
> index 6a5987cce03f..da5dda5ba26a 100644
> --- a/drivers/iommu/arm-smmu-v3.c
> +++ b/drivers/iommu/arm-smmu-v3.c
> @@ -374,6 +374,13 @@
>   #define CMDQ_PRI_1_GRPID		GENMASK_ULL(8, 0)
>   #define CMDQ_PRI_1_RESP			GENMASK_ULL(13, 12)
>   
[...]
> +static int arm_smmu_page_response(struct device *dev,
> +				  struct iommu_fault_event *unused,
> +				  struct iommu_page_response *resp)
> +{
> +	struct arm_smmu_cmdq_ent cmd = {0};
> +	struct arm_smmu_master *master = dev_iommu_fwspec_get(dev)->iommu_priv;
Here can use 'dev_to_master' ?

Cheers,
Zaibo

.
> +	int sid = master->streams[0].id;
> +
> +	if (master->stall_enabled) {
> +		cmd.opcode		= CMDQ_OP_RESUME;
> +		cmd.resume.sid		= sid;
> +		cmd.resume.stag		= resp->grpid;
> +		switch (resp->code) {
> +		case IOMMU_PAGE_RESP_INVALID:
> +		case IOMMU_PAGE_RESP_FAILURE:
> +			cmd.resume.resp = CMDQ_RESUME_0_RESP_ABORT;
> +			break;
> +		case IOMMU_PAGE_RESP_SUCCESS:
> +			cmd.resume.resp = CMDQ_RESUME_0_RESP_RETRY;
> +			break;
> +		default:
> +			return -EINVAL;
> +		}
> +	} else {
> +		/* TODO: insert PRI response here */
> +		return -ENODEV;
> +	}
> +
> +	arm_smmu_cmdq_issue_cmd(master->smmu, &cmd);
> +	/*
> +	 * Don't send a SYNC, it doesn't do anything for RESUME or PRI_RESP.
> +	 * RESUME consumption guarantees that the stalled transaction will be
> +	 * terminated... at some point in the future. PRI_RESP is fire and
> +	 * forget.
> +	 */
> +
> +	return 0;
> +}
> +
[...]


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v4 02/26] iommu/sva: Manage process address spaces
  2020-02-24 18:23 ` [PATCH v4 02/26] iommu/sva: Manage process address spaces Jean-Philippe Brucker
@ 2020-02-26 12:35   ` Jonathan Cameron
  2020-02-28 14:43     ` Jean-Philippe Brucker
  2020-02-26 19:13   ` Jacob Pan
  1 sibling, 1 reply; 70+ messages in thread
From: Jonathan Cameron @ 2020-02-26 12:35 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: iommu, devicetree, linux-arm-kernel, linux-pci, linux-mm, joro,
	robh+dt, mark.rutland, catalin.marinas, will, robin.murphy,
	kevin.tian, baolu.lu, jacob.jun.pan, christian.koenig, yi.l.liu,
	zhangfei.gao, Jean-Philippe Brucker

On Mon, 24 Feb 2020 19:23:37 +0100
Jean-Philippe Brucker <jean-philippe@linaro.org> wrote:

> From: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
> 
> Add a small library to help IOMMU drivers manage process address spaces
> bound to their devices. Register an MMU notifier to track modification
> on each address space bound to one or more devices.
> 
> IOMMU drivers must implement the io_mm_ops and can then use the helpers
> provided by this library to easily implement the SVA API introduced by
> commit 26b25a2b98e4. The io_mm_ops are:
> 
> void *alloc(struct mm_struct *)
>   Allocate a PASID context private to the IOMMU driver. There is a
>   single context per mm. IOMMU drivers may perform arch-specific
>   operations in there, for example pinning down a CPU ASID (on Arm).
> 
> int attach(struct device *, int pasid, void *ctx, bool attach_domain)
>   Attach a context to the device, by setting up the PASID table entry.
> 
> int invalidate(struct device *, int pasid, void *ctx,
>                unsigned long vaddr, size_t size)
>   Invalidate TLB entries for this address range.
> 
> int detach(struct device *, int pasid, void *ctx, bool detach_domain)
>   Detach a context from the device, by clearing the PASID table entry
>   and invalidating cached entries.
> 
> void free(void *ctx)
>   Free a context.
> 
> Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>

Hi Jean-Phillippe,

A few trivial comments from me in line.  Otherwise this all seems sensible.

Jonathan

> ---
>  drivers/iommu/Kconfig     |   7 +
>  drivers/iommu/Makefile    |   1 +
>  drivers/iommu/iommu-sva.c | 561 ++++++++++++++++++++++++++++++++++++++
>  drivers/iommu/iommu-sva.h |  64 +++++
>  drivers/iommu/iommu.c     |   1 +
>  include/linux/iommu.h     |   3 +
>  6 files changed, 637 insertions(+)
>  create mode 100644 drivers/iommu/iommu-sva.c
>  create mode 100644 drivers/iommu/iommu-sva.h
> 
> diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
> index d2fade984999..acca20e2da2f 100644
> --- a/drivers/iommu/Kconfig
> +++ b/drivers/iommu/Kconfig
> @@ -102,6 +102,13 @@ config IOMMU_DMA
>  	select IRQ_MSI_IOMMU
>  	select NEED_SG_DMA_LENGTH
>  
> +# Shared Virtual Addressing library
> +config IOMMU_SVA
> +	bool
> +	select IOASID
> +	select IOMMU_API
> +	select MMU_NOTIFIER
> +
>  config FSL_PAMU
>  	bool "Freescale IOMMU support"
>  	depends on PCI
> diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
> index 9f33fdb3bb05..40c800dd4e3e 100644
> --- a/drivers/iommu/Makefile
> +++ b/drivers/iommu/Makefile
> @@ -37,3 +37,4 @@ obj-$(CONFIG_S390_IOMMU) += s390-iommu.o
>  obj-$(CONFIG_QCOM_IOMMU) += qcom_iommu.o
>  obj-$(CONFIG_HYPERV_IOMMU) += hyperv-iommu.o
>  obj-$(CONFIG_VIRTIO_IOMMU) += virtio-iommu.o
> +obj-$(CONFIG_IOMMU_SVA) += iommu-sva.o
> diff --git a/drivers/iommu/iommu-sva.c b/drivers/iommu/iommu-sva.c
> new file mode 100644
> index 000000000000..64f1d1c82383
> --- /dev/null
> +++ b/drivers/iommu/iommu-sva.c
> @@ -0,0 +1,561 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Manage PASIDs and bind process address spaces to devices.
> + *
> + * Copyright (C) 2018 ARM Ltd.

Worth updating the date?

> + */
> +
> +#include <linux/idr.h>
> +#include <linux/ioasid.h>
> +#include <linux/iommu.h>
> +#include <linux/sched/mm.h>
> +#include <linux/slab.h>
> +#include <linux/spinlock.h>
> +
> +#include "iommu-sva.h"
> +
> +/**
> + * DOC: io_mm model
> + *
> + * The io_mm keeps track of process address spaces shared between CPU and IOMMU.
> + * The following example illustrates the relation between structures
> + * iommu_domain, io_mm and iommu_sva. The iommu_sva struct is a bond between
> + * io_mm and device. A device can have multiple io_mm and an io_mm may be bound
> + * to multiple devices.
> + *              ___________________________
> + *             |  IOMMU domain A           |
> + *             |  ________________         |
> + *             | |  IOMMU group   |        +------- io_pgtables
> + *             | |                |        |
> + *             | |   dev 00:00.0 ----+------- bond 1 --- io_mm X
> + *             | |________________|   \    |
> + *             |                       '----- bond 2 ---.
> + *             |___________________________|             \
> + *              ___________________________               \
> + *             |  IOMMU domain B           |             io_mm Y
> + *             |  ________________         |             / /
> + *             | |  IOMMU group   |        |            / /
> + *             | |                |        |           / /
> + *             | |   dev 00:01.0 ------------ bond 3 -' /
> + *             | |   dev 00:01.1 ------------ bond 4 --'
> + *             | |________________|        |
> + *             |                           +------- io_pgtables
> + *             |___________________________|
> + *
> + * In this example, device 00:00.0 is in domain A, devices 00:01.* are in domain
> + * B. All devices within the same domain access the same address spaces. Device
> + * 00:00.0 accesses address spaces X and Y, each corresponding to an mm_struct.
> + * Devices 00:01.* only access address space Y. In addition each
> + * IOMMU_DOMAIN_DMA domain has a private address space, io_pgtable, that is
> + * managed with iommu_map()/iommu_unmap(), and isn't shared with the CPU MMU.
> + *
> + * To obtain the above configuration, users would for instance issue the
> + * following calls:
> + *
> + *     iommu_sva_bind_device(dev 00:00.0, mm X, ...) -> bond 1
> + *     iommu_sva_bind_device(dev 00:00.0, mm Y, ...) -> bond 2
> + *     iommu_sva_bind_device(dev 00:01.0, mm Y, ...) -> bond 3
> + *     iommu_sva_bind_device(dev 00:01.1, mm Y, ...) -> bond 4
> + *
> + * A single Process Address Space ID (PASID) is allocated for each mm. In the
> + * example, devices use PASID 1 to read/write into address space X and PASID 2
> + * to read/write into address space Y. Calling iommu_sva_get_pasid() on bond 1
> + * returns 1, and calling it on bonds 2-4 returns 2.
> + *
> + * Hardware tables describing this configuration in the IOMMU would typically
> + * look like this:
> + *
> + *                                PASID tables
> + *                                 of domain A
> + *                              .->+--------+
> + *                             / 0 |        |-------> io_pgtable
> + *                            /    +--------+
> + *            Device tables  /   1 |        |-------> pgd X
> + *              +--------+  /      +--------+
> + *      00:00.0 |      A |-'     2 |        |--.
> + *              +--------+         +--------+   \
> + *              :        :       3 |        |    \
> + *              +--------+         +--------+     --> pgd Y
> + *      00:01.0 |      B |--.                    /
> + *              +--------+   \                  |
> + *      00:01.1 |      B |----+   PASID tables  |
> + *              +--------+     \   of domain B  |
> + *                              '->+--------+   |
> + *                               0 |        |-- | --> io_pgtable
> + *                                 +--------+   |
> + *                               1 |        |   |
> + *                                 +--------+   |
> + *                               2 |        |---'
> + *                                 +--------+
> + *                               3 |        |
> + *                                 +--------+
> + *
> + * With this model, a single call binds all devices in a given domain to an
> + * address space. Other devices in the domain will get the same bond implicitly.
> + * However, users must issue one bind() for each device, because IOMMUs may
> + * implement SVA differently. Furthermore, mandating one bind() per device
> + * allows the driver to perform sanity-checks on device capabilities.

> + *
> + * In some IOMMUs, one entry of the PASID table (typically the first one) can
> + * hold non-PASID translations. In this case PASID 0 is reserved and the first
> + * entry points to the io_pgtable pointer. In other IOMMUs the io_pgtable
> + * pointer is held in the device table and PASID 0 is available to the
> + * allocator.

Is it worth hammering home in here that we can only do this because the PASID space
is global (with exception of PASID 0)?  It's a convenient simplification but not
necessarily a hardware restriction so perhaps we should remind people somewhere in here?

> + */
> +
> +struct io_mm {
> +	struct list_head		devices;
> +	struct mm_struct		*mm;
> +	struct mmu_notifier		notifier;
> +
> +	/* Late initialization */
> +	const struct io_mm_ops		*ops;
> +	void				*ctx;
> +	int				pasid;
> +};
> +
> +#define to_io_mm(mmu_notifier)	container_of(mmu_notifier, struct io_mm, notifier)
> +#define to_iommu_bond(handle)	container_of(handle, struct iommu_bond, sva)

Code ordering wise, do we want this after the definition of iommu_bond?

For both of these it's a bit non obvious what they come 'from'.
I wouldn't naturally assume to_io_mm gets me from notifier to the io_mm
for example.  Not sure it matters though if these are only used in a few
places.

> +
> +struct iommu_bond {
> +	struct iommu_sva		sva;
> +	struct io_mm __rcu		*io_mm;
> +
> +	struct list_head		mm_head;
> +	void				*drvdata;
> +	struct rcu_head			rcu_head;
> +	refcount_t			refs;
> +};
> +
> +static DECLARE_IOASID_SET(shared_pasid);
> +
> +static struct mmu_notifier_ops iommu_mmu_notifier_ops;
> +
> +/*
> + * Serializes modifications of bonds.
> + * Lock order: Device SVA mutex; global SVA mutex; IOASID lock
> + */
> +static DEFINE_MUTEX(iommu_sva_lock);
> +
> +struct io_mm_alloc_params {
> +	const struct io_mm_ops *ops;
> +	int min_pasid, max_pasid;
> +};
> +
> +static struct mmu_notifier *io_mm_alloc(struct mm_struct *mm, void *privdata)
> +{
> +	int ret;
> +	struct io_mm *io_mm;
> +	struct io_mm_alloc_params *params = privdata;
> +
> +	io_mm = kzalloc(sizeof(*io_mm), GFP_KERNEL);
> +	if (!io_mm)
> +		return ERR_PTR(-ENOMEM);
> +
> +	io_mm->mm = mm;
> +	io_mm->ops = params->ops;
> +	INIT_LIST_HEAD(&io_mm->devices);
> +
> +	io_mm->pasid = ioasid_alloc(&shared_pasid, params->min_pasid,
> +				    params->max_pasid, io_mm->mm);
> +	if (io_mm->pasid == INVALID_IOASID) {
> +		ret = -ENOSPC;
> +		goto err_free_io_mm;
> +	}
> +
> +	io_mm->ctx = params->ops->alloc(mm);
> +	if (IS_ERR(io_mm->ctx)) {
> +		ret = PTR_ERR(io_mm->ctx);
> +		goto err_free_pasid;
> +	}
> +	return &io_mm->notifier;
> +
> +err_free_pasid:
> +	ioasid_free(io_mm->pasid);
> +err_free_io_mm:
> +	kfree(io_mm);
> +	return ERR_PTR(ret);
> +}
> +
> +static void io_mm_free(struct mmu_notifier *mn)
> +{
> +	struct io_mm *io_mm = to_io_mm(mn);
> +
> +	WARN_ON(!list_empty(&io_mm->devices));
> +
> +	io_mm->ops->release(io_mm->ctx);
> +	ioasid_free(io_mm->pasid);
> +	kfree(io_mm);
> +}
> +
> +/*
> + * io_mm_get - Allocate an io_mm or get the existing one for the given mm
> + * @mm: the mm
> + * @ops: callbacks for the IOMMU driver
> + * @min_pasid: minimum PASID value (inclusive)
> + * @max_pasid: maximum PASID value (inclusive)
> + *
> + * Returns a valid io_mm or an error pointer.
> + */
> +static struct io_mm *io_mm_get(struct mm_struct *mm,
> +			       const struct io_mm_ops *ops,
> +			       int min_pasid, int max_pasid)
> +{
> +	struct io_mm *io_mm;
> +	struct mmu_notifier *mn;
> +	struct io_mm_alloc_params params = {
> +		.ops		= ops,
> +		.min_pasid	= min_pasid,
> +		.max_pasid	= max_pasid,
> +	};
> +
> +	/*
> +	 * A single notifier can exist for this (ops, mm) pair. Allocate it if
> +	 * necessary.
> +	 */
> +	mn = mmu_notifier_get(&iommu_mmu_notifier_ops, mm, &params);
> +	if (IS_ERR(mn))
> +		return ERR_CAST(mn);
> +	io_mm = to_io_mm(mn);
> +
> +	if (WARN_ON(io_mm->ops != ops)) {
> +		mmu_notifier_put(mn);
> +		return ERR_PTR(-EINVAL);
> +	}
> +
> +	return io_mm;
> +}
> +
> +static void io_mm_put(struct io_mm *io_mm)
> +{
> +	mmu_notifier_put(&io_mm->notifier);
> +}
> +
> +static struct iommu_sva *
> +io_mm_attach(struct device *dev, struct io_mm *io_mm, void *drvdata)
> +{
> +	int ret = 0;

I'm fairly sure this is set in all paths below.  Now, of course the
compiler might not think that in which case fair enough :)

> +	bool attach_domain = true;
> +	struct iommu_bond *bond, *tmp;
> +	struct iommu_domain *domain, *other;
> +	struct iommu_sva_param *param = dev->iommu_param->sva_param;
> +
> +	domain = iommu_get_domain_for_dev(dev);
> +
> +	bond = kzalloc(sizeof(*bond), GFP_KERNEL);
> +	if (!bond)
> +		return ERR_PTR(-ENOMEM);
> +
> +	bond->sva.dev	= dev;
> +	bond->drvdata	= drvdata;
> +	refcount_set(&bond->refs, 1);
> +	RCU_INIT_POINTER(bond->io_mm, io_mm);
> +
> +	mutex_lock(&iommu_sva_lock);
> +	/* Is it already bound to the device or domain? */
> +	list_for_each_entry(tmp, &io_mm->devices, mm_head) {
> +		if (tmp->sva.dev != dev) {
> +			other = iommu_get_domain_for_dev(tmp->sva.dev);
> +			if (domain == other)
> +				attach_domain = false;
> +
> +			continue;
> +		}
> +
> +		if (WARN_ON(tmp->drvdata != drvdata)) {
> +			ret = -EINVAL;
> +			goto err_free;
> +		}
> +
> +		/*
> +		 * Hold a single io_mm reference per bond. Note that we can't
> +		 * return an error after this, otherwise the caller would drop
> +		 * an additional reference to the io_mm.
> +		 */
> +		refcount_inc(&tmp->refs);
> +		io_mm_put(io_mm);
> +		kfree(bond);

Free outside the lock would be ever so slightly more logical given we allocated
before taking the lock.

> +		mutex_unlock(&iommu_sva_lock);
> +		return &tmp->sva;
> +	}
> +
> +	list_add_rcu(&bond->mm_head, &io_mm->devices);
> +	param->nr_bonds++;
> +	mutex_unlock(&iommu_sva_lock);
> +
> +	ret = io_mm->ops->attach(bond->sva.dev, io_mm->pasid, io_mm->ctx,
> +				 attach_domain);
> +	if (ret)
> +		goto err_remove;
> +
> +	return &bond->sva;
> +
> +err_remove:
> +	/*
> +	 * At this point concurrent threads may have started to access the
> +	 * io_mm->devices list in order to invalidate address ranges, which
> +	 * requires to free the bond via kfree_rcu()
> +	 */
> +	mutex_lock(&iommu_sva_lock);
> +	param->nr_bonds--;
> +	list_del_rcu(&bond->mm_head);
> +
> +err_free:
> +	mutex_unlock(&iommu_sva_lock);
> +	kfree_rcu(bond, rcu_head);

I don't suppose it matters really but we don't need the rcu free if
we follow the err_free goto.  Perhaps we are cleaner in this case
to not use a unified exit path but do that case inline?

> +	return ERR_PTR(ret);
> +}
> +
> +static void io_mm_detach_locked(struct iommu_bond *bond)
> +{
> +	struct io_mm *io_mm;
> +	struct iommu_bond *tmp;
> +	bool detach_domain = true;
> +	struct iommu_domain *domain, *other;
> +
> +	io_mm = rcu_dereference_protected(bond->io_mm,
> +					  lockdep_is_held(&iommu_sva_lock));
> +	if (!io_mm)
> +		return;
> +
> +	domain = iommu_get_domain_for_dev(bond->sva.dev);
> +
> +	/* Are other devices in the same domain still attached to this mm? */
> +	list_for_each_entry(tmp, &io_mm->devices, mm_head) {
> +		if (tmp == bond)
> +			continue;
> +		other = iommu_get_domain_for_dev(tmp->sva.dev);
> +		if (domain == other) {
> +			detach_domain = false;
> +			break;
> +		}
> +	}
> +
> +	io_mm->ops->detach(bond->sva.dev, io_mm->pasid, io_mm->ctx,
> +			   detach_domain);
> +
> +	list_del_rcu(&bond->mm_head);
> +	RCU_INIT_POINTER(bond->io_mm, NULL);
> +
> +	/* Free after RCU grace period */
> +	io_mm_put(io_mm);
> +}
> +
> +/*
> + * io_mm_release - release MMU notifier
> + *
> + * Called when the mm exits. Some devices may still be bound to the io_mm. A few
> + * things need to be done before it is safe to release:
> + *
> + * - Tell the device driver to stop using this PASID.
> + * - Clear the PASID table and invalidate TLBs.
> + * - Drop all references to this io_mm.
> + */
> +static void io_mm_release(struct mmu_notifier *mn, struct mm_struct *mm)
> +{
> +	struct iommu_bond *bond, *next;
> +	struct io_mm *io_mm = to_io_mm(mn);
> +
> +	mutex_lock(&iommu_sva_lock);
> +	list_for_each_entry_safe(bond, next, &io_mm->devices, mm_head) {
> +		struct device *dev = bond->sva.dev;
> +		struct iommu_sva *sva = &bond->sva;
> +
> +		if (sva->ops && sva->ops->mm_exit &&
> +		    sva->ops->mm_exit(dev, sva, bond->drvdata))
> +			dev_WARN(dev, "possible leak of PASID %u",
> +				 io_mm->pasid);
> +
> +		/* unbind() frees the bond, we just detach it */
> +		io_mm_detach_locked(bond);
> +	}
> +	mutex_unlock(&iommu_sva_lock);
> +}
> +
> +static void io_mm_invalidate_range(struct mmu_notifier *mn,
> +				   struct mm_struct *mm, unsigned long start,
> +				   unsigned long end)
> +{
> +	struct iommu_bond *bond;
> +	struct io_mm *io_mm = to_io_mm(mn);
> +
> +	rcu_read_lock();
> +	list_for_each_entry_rcu(bond, &io_mm->devices, mm_head)
> +		io_mm->ops->invalidate(bond->sva.dev, io_mm->pasid, io_mm->ctx,
> +				       start, end - start);
> +	rcu_read_unlock();
> +}
> +
> +static struct mmu_notifier_ops iommu_mmu_notifier_ops = {
> +	.alloc_notifier		= io_mm_alloc,
> +	.free_notifier		= io_mm_free,
> +	.release		= io_mm_release,
> +	.invalidate_range	= io_mm_invalidate_range,
> +};
> +
> +struct iommu_sva *
> +iommu_sva_bind_generic(struct device *dev, struct mm_struct *mm,
> +		       const struct io_mm_ops *ops, void *drvdata)
> +{
> +	struct io_mm *io_mm;
> +	struct iommu_sva *handle;
> +	struct iommu_param *param = dev->iommu_param;
> +
> +	if (!param)
> +		return ERR_PTR(-ENODEV);
> +
> +	mutex_lock(&param->sva_lock);
> +	if (!param->sva_param) {
> +		handle = ERR_PTR(-ENODEV);
> +		goto out_unlock;
> +	}
> +
> +	io_mm = io_mm_get(mm, ops, param->sva_param->min_pasid,
> +			  param->sva_param->max_pasid);
> +	if (IS_ERR(io_mm)) {
> +		handle = ERR_CAST(io_mm);
> +		goto out_unlock;
> +	}
> +
> +	handle = io_mm_attach(dev, io_mm, drvdata);
> +	if (IS_ERR(handle))
> +		io_mm_put(io_mm);
> +
> +out_unlock:
> +	mutex_unlock(&param->sva_lock);
> +	return handle;
> +}
> +EXPORT_SYMBOL_GPL(iommu_sva_bind_generic);
> +
> +static void iommu_sva_unbind_locked(struct iommu_bond *bond)
> +{
> +	struct device *dev = bond->sva.dev;
> +	struct iommu_sva_param *param = dev->iommu_param->sva_param;
> +
> +	if (!refcount_dec_and_test(&bond->refs))
> +		return;
> +
> +	io_mm_detach_locked(bond);
> +	param->nr_bonds--;
> +	kfree_rcu(bond, rcu_head);
> +}
> +
> +void iommu_sva_unbind_generic(struct iommu_sva *handle)
> +{
> +	struct iommu_param *param = handle->dev->iommu_param;
> +
> +	if (WARN_ON(!param))
> +		return;
> +
> +	mutex_lock(&param->sva_lock);
> +	mutex_lock(&iommu_sva_lock);
> +	iommu_sva_unbind_locked(to_iommu_bond(handle));
> +	mutex_unlock(&iommu_sva_lock);
> +	mutex_unlock(&param->sva_lock);
> +}
> +EXPORT_SYMBOL_GPL(iommu_sva_unbind_generic);
> +
> +/**
> + * iommu_sva_enable() - Enable Shared Virtual Addressing for a device
> + * @dev: the device
> + * @sva_param: the parameters.
> + *
> + * Called by an IOMMU driver to setup the SVA parameters
> + * @sva_param is duplicated and can be freed when this function returns.
> + *
> + * Return 0 if initialization succeeded, or an error.
> + */
> +int iommu_sva_enable(struct device *dev, struct iommu_sva_param *sva_param)
> +{
> +	int ret;
> +	struct iommu_sva_param *new_param;
> +	struct iommu_param *param = dev->iommu_param;
> +
> +	if (!param)
> +		return -ENODEV;
> +
> +	new_param = kmemdup(sva_param, sizeof(*new_param), GFP_KERNEL);
> +	if (!new_param)
> +		return -ENOMEM;
> +
> +	mutex_lock(&param->sva_lock);
> +	if (param->sva_param) {
> +		ret = -EEXIST;
> +		goto err_unlock;
> +	}
> +
> +	dev->iommu_param->sva_param = new_param;
> +	mutex_unlock(&param->sva_lock);
> +	return 0;
> +
> +err_unlock:
> +	mutex_unlock(&param->sva_lock);
> +	kfree(new_param);
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(iommu_sva_enable);
> +
> +/**
> + * iommu_sva_disable() - Disable Shared Virtual Addressing for a device
> + * @dev: the device
> + *
> + * IOMMU drivers call this to disable SVA.
> + */
> +int iommu_sva_disable(struct device *dev)
> +{
> +	int ret = 0;
> +	struct iommu_param *param = dev->iommu_param;
> +
> +	if (!param)
> +		return -EINVAL;
> +
> +	mutex_lock(&param->sva_lock);
> +	if (!param->sva_param) {
> +		ret = -ENODEV;
> +		goto out_unlock;
> +	}
> +
> +	/* Require that all contexts are unbound */
> +	if (param->sva_param->nr_bonds) {
> +		ret = -EBUSY;
> +		goto out_unlock;
> +	}
> +
> +	kfree(param->sva_param);
> +	param->sva_param = NULL;
> +out_unlock:
> +	mutex_unlock(&param->sva_lock);
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(iommu_sva_disable);
> +
> +bool iommu_sva_enabled(struct device *dev)
> +{
> +	bool enabled;
> +	struct iommu_param *param = dev->iommu_param;
> +
> +	if (!param)
> +		return false;
> +
> +	mutex_lock(&param->sva_lock);
> +	enabled = !!param->sva_param;
> +	mutex_unlock(&param->sva_lock);
> +	return enabled;
> +}
> +EXPORT_SYMBOL_GPL(iommu_sva_enabled);
> +
> +int iommu_sva_get_pasid_generic(struct iommu_sva *handle)
> +{
> +	struct io_mm *io_mm;
> +	int pasid = IOMMU_PASID_INVALID;
> +	struct iommu_bond *bond = to_iommu_bond(handle);
> +
> +	rcu_read_lock();
> +	io_mm = rcu_dereference(bond->io_mm);
> +	if (io_mm)
> +		pasid = io_mm->pasid;
> +	rcu_read_unlock();
> +	return pasid;
> +}
> +EXPORT_SYMBOL_GPL(iommu_sva_get_pasid_generic);
> diff --git a/drivers/iommu/iommu-sva.h b/drivers/iommu/iommu-sva.h
> new file mode 100644
> index 000000000000..dd55c2db0936
> --- /dev/null
> +++ b/drivers/iommu/iommu-sva.h
> @@ -0,0 +1,64 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * SVA library for IOMMU drivers
> + */
> +#ifndef _IOMMU_SVA_H
> +#define _IOMMU_SVA_H
> +
> +#include <linux/iommu.h>
> +#include <linux/kref.h>
> +#include <linux/mmu_notifier.h>
> +
> +struct io_mm_ops {
> +	/* Allocate a PASID context for an mm */
> +	void *(*alloc)(struct mm_struct *mm);
> +
> +	/*
> +	 * Attach a PASID context to a device. Write the entry into the PASID
> +	 * table.
> +	 *
> +	 * @attach_domain is true when no other device in the IOMMU domain is
> +	 *   already attached to this context. IOMMU drivers that share the
> +	 *   PASID tables within a domain don't need to write the PASID entry
> +	 *   when @attach_domain is false.
> +	 */
> +	int (*attach)(struct device *dev, int pasid, void *ctx,
> +		      bool attach_domain);
> +
> +	/*
> +	 * Detach a PASID context from a device. Clear the entry from the PASID
> +	 * table and invalidate if necessary.
> +	 *
> +	 * @detach_domain is true when no other device in the IOMMU domain is
> +	 *   still attached to this context. IOMMU drivers that share the PASID
> +	 *   table within a domain don't need to clear the PASID entry when
> +	 *   @detach_domain is false, only invalidate the caches.
> +	 */
> +	void (*detach)(struct device *dev, int pasid, void *ctx,
> +		       bool detach_domain);
> +
> +	/* Invalidate a range of addresses. Cannot sleep. */
> +	void (*invalidate)(struct device *dev, int pasid, void *ctx,
> +			   unsigned long vaddr, size_t size);
> +
> +	/* Free a context. Cannot sleep. */
> +	void (*release)(void *ctx);
> +};
> +
> +struct iommu_sva_param {
> +	u32			min_pasid;
> +	u32			max_pasid;
> +	int			nr_bonds;
> +};
> +
> +struct iommu_sva *
> +iommu_sva_bind_generic(struct device *dev, struct mm_struct *mm,
> +		       const struct io_mm_ops *ops, void *drvdata);
> +void iommu_sva_unbind_generic(struct iommu_sva *handle);
> +int iommu_sva_get_pasid_generic(struct iommu_sva *handle);
> +
> +int iommu_sva_enable(struct device *dev, struct iommu_sva_param *sva_param);
> +int iommu_sva_disable(struct device *dev);
> +bool iommu_sva_enabled(struct device *dev);
> +
> +#endif /* _IOMMU_SVA_H */
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index 3e3528436e0b..c8bd972c1788 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -164,6 +164,7 @@ static struct iommu_param *iommu_get_dev_param(struct device *dev)
>  		return NULL;
>  
>  	mutex_init(&param->lock);
> +	mutex_init(&param->sva_lock);
>  	dev->iommu_param = param;
>  	return param;
>  }
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index 1739f8a7a4b4..83397ae88d2d 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -368,6 +368,7 @@ struct iommu_fault_param {
>   * struct iommu_param - collection of per-device IOMMU data
>   *
>   * @fault_param: IOMMU detected device fault reporting data
> + * @sva_param: IOMMU parameter for SVA
>   *
>   * TODO: migrate other per device data pointers under iommu_dev_data, e.g.
>   *	struct iommu_group	*iommu_group;
> @@ -376,6 +377,8 @@ struct iommu_fault_param {
>  struct iommu_param {
>  	struct mutex lock;
>  	struct iommu_fault_param *fault_param;
> +	struct mutex sva_lock;
> +	struct iommu_sva_param *sva_param;
>  };
>  
>  int  iommu_device_register(struct iommu_device *iommu);



^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v4 03/26] iommu: Add a page fault handler
  2020-02-24 18:23 ` [PATCH v4 03/26] iommu: Add a page fault handler Jean-Philippe Brucker
  2020-02-25  3:30   ` Xu Zaibo
@ 2020-02-26 13:59   ` Jonathan Cameron
  2020-02-28 14:44     ` Jean-Philippe Brucker
  1 sibling, 1 reply; 70+ messages in thread
From: Jonathan Cameron @ 2020-02-26 13:59 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: iommu, devicetree, linux-arm-kernel, linux-pci, linux-mm, joro,
	robh+dt, mark.rutland, catalin.marinas, will, robin.murphy,
	kevin.tian, baolu.lu, jacob.jun.pan, christian.koenig, yi.l.liu,
	zhangfei.gao, Jean-Philippe Brucker

On Mon, 24 Feb 2020 19:23:38 +0100
Jean-Philippe Brucker <jean-philippe@linaro.org> wrote:

> From: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
> 
> Some systems allow devices to handle I/O Page Faults in the core mm. For
> example systems implementing the PCI PRI extension or Arm SMMU stall
> model. Infrastructure for reporting these recoverable page faults was
> recently added to the IOMMU core. Add a page fault handler for host SVA.
> 
> IOMMU driver can now instantiate several fault workqueues and link them to
> IOPF-capable devices. Drivers can choose between a single global
> workqueue, one per IOMMU device, one per low-level fault queue, one per
> domain, etc.
> 
> When it receives a fault event, supposedly in an IRQ handler, the IOMMU
> driver reports the fault using iommu_report_device_fault(), which calls
> the registered handler. The page fault handler then calls the mm fault
> handler, and reports either success or failure with iommu_page_response().
> When the handler succeeded, the IOMMU retries the access.
> 
> The iopf_param pointer could be embedded into iommu_fault_param. But
> putting iopf_param into the iommu_param structure allows us not to care
> about ordering between calls to iopf_queue_add_device() and
> iommu_register_device_fault_handler().
> 
> Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
A few more minor comments...

> ---
>  drivers/iommu/Kconfig      |   4 +
>  drivers/iommu/Makefile     |   1 +
>  drivers/iommu/io-pgfault.c | 451 +++++++++++++++++++++++++++++++++++++
>  include/linux/iommu.h      |  59 +++++
>  4 files changed, 515 insertions(+)
>  create mode 100644 drivers/iommu/io-pgfault.c
> 
> diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
> index acca20e2da2f..e4a42e1708b4 100644
> --- a/drivers/iommu/Kconfig
> +++ b/drivers/iommu/Kconfig
> @@ -109,6 +109,10 @@ config IOMMU_SVA
>  	select IOMMU_API
>  	select MMU_NOTIFIER
>  
> +config IOMMU_PAGE_FAULT
> +	bool
> +	select IOMMU_API
> +
>  config FSL_PAMU
>  	bool "Freescale IOMMU support"
>  	depends on PCI
> diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
> index 40c800dd4e3e..bf5cb4ee8409 100644
> --- a/drivers/iommu/Makefile
> +++ b/drivers/iommu/Makefile
> @@ -4,6 +4,7 @@ obj-$(CONFIG_IOMMU_API) += iommu-traces.o
>  obj-$(CONFIG_IOMMU_API) += iommu-sysfs.o
>  obj-$(CONFIG_IOMMU_DEBUGFS) += iommu-debugfs.o
>  obj-$(CONFIG_IOMMU_DMA) += dma-iommu.o
> +obj-$(CONFIG_IOMMU_PAGE_FAULT) += io-pgfault.o
>  obj-$(CONFIG_IOMMU_IO_PGTABLE) += io-pgtable.o
>  obj-$(CONFIG_IOMMU_IO_PGTABLE_ARMV7S) += io-pgtable-arm-v7s.o
>  obj-$(CONFIG_IOMMU_IO_PGTABLE_LPAE) += io-pgtable-arm.o
> diff --git a/drivers/iommu/io-pgfault.c b/drivers/iommu/io-pgfault.c
> new file mode 100644
> index 000000000000..76e153c59fe3
> --- /dev/null
> +++ b/drivers/iommu/io-pgfault.c
> @@ -0,0 +1,451 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Handle device page faults
> + *
> + * Copyright (C) 2018 ARM Ltd.

As before. Date update perhaps?

> + */
> +
> +#include <linux/iommu.h>
> +#include <linux/list.h>
> +#include <linux/slab.h>
> +#include <linux/workqueue.h>
> +
> +/**
> + * struct iopf_queue - IO Page Fault queue
> + * @wq: the fault workqueue
> + * @flush: low-level flush callback
> + * @flush_arg: flush() argument
> + * @devices: devices attached to this queue
> + * @lock: protects the device list
> + */
> +struct iopf_queue {
> +	struct workqueue_struct		*wq;
> +	iopf_queue_flush_t		flush;
> +	void				*flush_arg;
> +	struct list_head		devices;
> +	struct mutex			lock;
> +};
> +
> +/**
> + * struct iopf_device_param - IO Page Fault data attached to a device
> + * @dev: the device that owns this param
> + * @queue: IOPF queue
> + * @queue_list: index into queue->devices
> + * @partial: faults that are part of a Page Request Group for which the last
> + *           request hasn't been submitted yet.
> + * @busy: the param is being used
> + * @wq_head: signal a change to @busy
> + */
> +struct iopf_device_param {
> +	struct device			*dev;
> +	struct iopf_queue		*queue;
> +	struct list_head		queue_list;
> +	struct list_head		partial;
> +	bool				busy;
> +	wait_queue_head_t		wq_head;
> +};
> +
> +struct iopf_fault {
> +	struct iommu_fault		fault;
> +	struct list_head		head;
> +};
> +
> +struct iopf_group {
> +	struct iopf_fault		last_fault;
> +	struct list_head		faults;
> +	struct work_struct		work;
> +	struct device			*dev;
> +};
> +
> +static int iopf_complete(struct device *dev, struct iopf_fault *iopf,
> +			 enum iommu_page_response_code status)

This is called once per group.  Should name reflect that?

> +{
> +	struct iommu_page_response resp = {
> +		.version		= IOMMU_PAGE_RESP_VERSION_1,
> +		.pasid			= iopf->fault.prm.pasid,
> +		.grpid			= iopf->fault.prm.grpid,
> +		.code			= status,
> +	};
> +
> +	if (iopf->fault.prm.flags & IOMMU_FAULT_PAGE_REQUEST_PASID_VALID)
> +		resp.flags = IOMMU_PAGE_RESP_PASID_VALID;
> +
> +	return iommu_page_response(dev, &resp);
> +}
> +
> +static enum iommu_page_response_code
> +iopf_handle_single(struct iopf_fault *iopf)
> +{
> +	/* TODO */
> +	return -ENODEV;
> +}
> +
> +static void iopf_handle_group(struct work_struct *work)
> +{
> +	struct iopf_group *group;
> +	struct iopf_fault *iopf, *next;
> +	enum iommu_page_response_code status = IOMMU_PAGE_RESP_SUCCESS;
> +
> +	group = container_of(work, struct iopf_group, work);
> +
> +	list_for_each_entry_safe(iopf, next, &group->faults, head) {
> +		/*
> +		 * For the moment, errors are sticky: don't handle subsequent
> +		 * faults in the group if there is an error.
> +		 */
> +		if (status == IOMMU_PAGE_RESP_SUCCESS)
> +			status = iopf_handle_single(iopf);
> +
> +		if (!(iopf->fault.prm.flags & IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE))
> +			kfree(iopf);
> +	}
> +
> +	iopf_complete(group->dev, &group->last_fault, status);
> +	kfree(group);
> +}
> +
> +/**
> + * iommu_queue_iopf - IO Page Fault handler
> + * @evt: fault event
> + * @cookie: struct device, passed to iommu_register_device_fault_handler.
> + *
> + * Add a fault to the device workqueue, to be handled by mm.
> + *
> + * Return: 0 on success and <0 on error.
> + */
> +int iommu_queue_iopf(struct iommu_fault *fault, void *cookie)
> +{
> +	int ret;
> +	struct iopf_group *group;
> +	struct iopf_fault *iopf, *next;
> +	struct iopf_device_param *iopf_param;
> +
> +	struct device *dev = cookie;
> +	struct iommu_param *param = dev->iommu_param;
> +
> +	if (WARN_ON(!mutex_is_locked(&param->lock)))
> +		return -EINVAL;

Just curious...

Why do we always need a runtime check on this rather than say,
using lockdep_assert_held or similar?

> +
> +	if (fault->type != IOMMU_FAULT_PAGE_REQ)
> +		/* Not a recoverable page fault */
> +		return -EOPNOTSUPP;
> +
> +	/*
> +	 * As long as we're holding param->lock, the queue can't be unlinked
> +	 * from the device and therefore cannot disappear.
> +	 */
> +	iopf_param = param->iopf_param;
> +	if (!iopf_param)
> +		return -ENODEV;
> +
> +	if (!(fault->prm.flags & IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE)) {
> +		iopf = kzalloc(sizeof(*iopf), GFP_KERNEL);
> +		if (!iopf)
> +			return -ENOMEM;
> +
> +		iopf->fault = *fault;
> +
> +		/* Non-last request of a group. Postpone until the last one */
> +		list_add(&iopf->head, &iopf_param->partial);
> +
> +		return 0;
> +	}
> +
> +	group = kzalloc(sizeof(*group), GFP_KERNEL);
> +	if (!group) {
> +		/*
> +		 * The caller will send a response to the hardware. But we do
> +		 * need to clean up before leaving, otherwise partial faults
> +		 * will be stuck.
> +		 */
> +		ret = -ENOMEM;
> +		goto cleanup_partial;
> +	}
> +
> +	group->dev = dev;
> +	group->last_fault.fault = *fault;
> +	INIT_LIST_HEAD(&group->faults);
> +	list_add(&group->last_fault.head, &group->faults);
> +	INIT_WORK(&group->work, iopf_handle_group);
> +
> +	/* See if we have partial faults for this group */
> +	list_for_each_entry_safe(iopf, next, &iopf_param->partial, head) {
> +		if (iopf->fault.prm.grpid == fault->prm.grpid)
> +			/* Insert *before* the last fault */
> +			list_move(&iopf->head, &group->faults);
> +	}
> +
> +	queue_work(iopf_param->queue->wq, &group->work);
> +	return 0;
> +
> +cleanup_partial:
> +	list_for_each_entry_safe(iopf, next, &iopf_param->partial, head) {
> +		if (iopf->fault.prm.grpid == fault->prm.grpid) {
> +			list_del(&iopf->head);
> +			kfree(iopf);
> +		}
> +	}
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(iommu_queue_iopf);
> +
> +/**
> + * iopf_queue_flush_dev - Ensure that all queued faults have been processed
> + * @dev: the endpoint whose faults need to be flushed.
> + * @pasid: the PASID affected by this flush
> + *
> + * Users must call this function when releasing a PASID, to ensure that all
> + * pending faults for this PASID have been handled, and won't hit the address
> + * space of the next process that uses this PASID.
> + *
> + * This function can also be called before shutting down the device, in which
> + * case @pasid should be IOMMU_PASID_INVALID.
> + *
> + * Return: 0 on success and <0 on error.
> + */
> +int iopf_queue_flush_dev(struct device *dev, int pasid)
> +{
> +	int ret = 0;
> +	struct iopf_queue *queue;
> +	struct iopf_device_param *iopf_param;
> +	struct iommu_param *param = dev->iommu_param;
> +
> +	if (!param)
> +		return -ENODEV;
> +
> +	/*
> +	 * It is incredibly easy to find ourselves in a deadlock situation if
> +	 * we're not careful, because we're taking the opposite path as
> +	 * iommu_queue_iopf:
> +	 *
> +	 *   iopf_queue_flush_dev()   |  PRI queue handler
> +	 *    lock(&param->lock)      |   iommu_queue_iopf()
> +	 *     queue->flush()         |    lock(&param->lock)
> +	 *      wait PRI queue empty  |
> +	 *
> +	 * So we can't hold the device param lock while flushing. Take a
> +	 * reference to the device param instead, to prevent the queue from
> +	 * going away.
> +	 */
> +	mutex_lock(&param->lock);
> +	iopf_param = param->iopf_param;
> +	if (iopf_param) {
> +		queue = param->iopf_param->queue;
> +		iopf_param->busy = true;

Describing this as taking a reference is not great...
I'd change the comment to set a flag or something like that.

Is there any potential of multiple copies of this running against
each other?  I've not totally gotten my head around when this
might be called yet.

> +	} else {
> +		ret = -ENODEV;
> +	}
> +	mutex_unlock(&param->lock);
> +	if (ret)
> +		return ret;
> +
> +	/*
> +	 * When removing a PASID, the device driver tells the device to stop
> +	 * using it, and flush any pending fault to the IOMMU. In this flush
> +	 * callback, the IOMMU driver makes sure that there are no such faults
> +	 * left in the low-level queue.
> +	 */
> +	queue->flush(queue->flush_arg, dev, pasid);
> +
> +	flush_workqueue(queue->wq);
> +
> +	mutex_lock(&param->lock);
> +	iopf_param->busy = false;
> +	wake_up(&iopf_param->wq_head);
> +	mutex_unlock(&param->lock);
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(iopf_queue_flush_dev);
> +
> +/**
> + * iopf_queue_discard_partial - Remove all pending partial fault
> + * @queue: the queue whose partial faults need to be discarded
> + *
> + * When the hardware queue overflows, last page faults in a group may have been
> + * lost and the IOMMU driver calls this to discard all partial faults. The
> + * driver shouldn't be adding new faults to this queue concurrently.
> + *
> + * Return: 0 on success and <0 on error.
> + */
> +int iopf_queue_discard_partial(struct iopf_queue *queue)
> +{
> +	struct iopf_fault *iopf, *next;
> +	struct iopf_device_param *iopf_param;
> +
> +	if (!queue)
> +		return -EINVAL;
> +
> +	mutex_lock(&queue->lock);
> +	list_for_each_entry(iopf_param, &queue->devices, queue_list) {
> +		list_for_each_entry_safe(iopf, next, &iopf_param->partial, head)
> +			kfree(iopf);
> +	}
> +	mutex_unlock(&queue->lock);
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(iopf_queue_discard_partial);
> +
> +/**
> + * iopf_queue_add_device - Add producer to the fault queue
> + * @queue: IOPF queue
> + * @dev: device to add
> + *
> + * Return: 0 on success and <0 on error.
> + */
> +int iopf_queue_add_device(struct iopf_queue *queue, struct device *dev)
> +{
> +	int ret = -EINVAL;
> +	struct iopf_device_param *iopf_param;
> +	struct iommu_param *param = dev->iommu_param;
> +
> +	if (!param)
> +		return -ENODEV;
> +
> +	iopf_param = kzalloc(sizeof(*iopf_param), GFP_KERNEL);
> +	if (!iopf_param)
> +		return -ENOMEM;
> +
> +	INIT_LIST_HEAD(&iopf_param->partial);
> +	iopf_param->queue = queue;
> +	iopf_param->dev = dev;
> +	init_waitqueue_head(&iopf_param->wq_head);
> +
> +	mutex_lock(&queue->lock);
> +	mutex_lock(&param->lock);
> +	if (!param->iopf_param) {
> +		list_add(&iopf_param->queue_list, &queue->devices);
> +		param->iopf_param = iopf_param;
> +		ret = 0;
> +	}
> +	mutex_unlock(&param->lock);
> +	mutex_unlock(&queue->lock);
> +
> +	if (ret)
> +		kfree(iopf_param);
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(iopf_queue_add_device);
> +
> +/**
> + * iopf_queue_remove_device - Remove producer from fault queue
> + * @queue: IOPF queue
> + * @dev: device to remove
> + *
> + * Caller makes sure that no more faults are reported for this device.
> + *
> + * Return: 0 on success and <0 on error.
> + */
> +int iopf_queue_remove_device(struct iopf_queue *queue, struct device *dev)
> +{
> +	int ret = -EINVAL;
> +	struct iopf_fault *iopf, *next;
> +	struct iopf_device_param *iopf_param;
> +	struct iommu_param *param = dev->iommu_param;
> +
> +	if (!param || !queue)
> +		return -EINVAL;
> +
> +	do {
> +		mutex_lock(&queue->lock);
> +		mutex_lock(&param->lock);
> +		iopf_param = param->iopf_param;
> +		if (iopf_param && iopf_param->queue == queue) {
> +			if (iopf_param->busy) {
> +				ret = -EBUSY;
> +			} else {
> +				list_del(&iopf_param->queue_list);
> +				param->iopf_param = NULL;
> +				ret = 0;
> +			}
> +		}
> +		mutex_unlock(&param->lock);
> +		mutex_unlock(&queue->lock);
> +
> +		/*
> +		 * If there is an ongoing flush, wait for it to complete and
> +		 * then retry. iopf_param isn't going away since we're the only
> +		 * thread that can free it.
> +		 */
> +		if (ret == -EBUSY)
> +			wait_event(iopf_param->wq_head, !iopf_param->busy);
> +		else if (ret)
> +			return ret;
> +	} while (ret == -EBUSY);

I'm in two minds about the next comment (so up to you)...

Currently this looks a bit odd.  Would you be better off just having a separate
parameter for busy and explicit separate handling for the error path?

	bool busy;
	int ret = 0;

	do {
		mutex_lock(&queue->lock);
		mutex_lock(&param->lock);
		iopf_param = param->iopf_param;
		if (iopf_param && iopf_param->queue == queue) {
			busy = iopf_param->busy;
			if (!busy) {
				list_del(&iopf_param->queue_list);
				param->iopf_param = NULL;
			}
		} else {
			ret = -EINVAL;
		}
		mutex_unlock(&param->lock);
		mutex_unlock(&queue->lock);
		if (ret)
			return ret;
		if (busy)
			wait_event(iopf_param->wq_head, !iopf_param->busy);
		
	} while (busy);

	..

> +
> +	/* Just in case some faults are still stuck */
> +	list_for_each_entry_safe(iopf, next, &iopf_param->partial, head)
> +		kfree(iopf);
> +
> +	kfree(iopf_param);
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(iopf_queue_remove_device);
> +
> +/**
> + * iopf_queue_alloc - Allocate and initialize a fault queue
> + * @name: a unique string identifying the queue (for workqueue)
> + * @flush: a callback that flushes the low-level queue
> + * @cookie: driver-private data passed to the flush callback
> + *
> + * The callback is called before the workqueue is flushed. The IOMMU driver must
> + * commit all faults that are pending in its low-level queues at the time of the
> + * call, into the IOPF queue (with iommu_report_device_fault). The callback
> + * takes a device pointer as argument, hinting what endpoint is causing the
> + * flush. When the device is NULL, all faults should be committed.
> + *
> + * Return: the queue on success and NULL on error.
> + */
> +struct iopf_queue *
> +iopf_queue_alloc(const char *name, iopf_queue_flush_t flush, void *cookie)
> +{
> +	struct iopf_queue *queue;
> +
> +	queue = kzalloc(sizeof(*queue), GFP_KERNEL);
> +	if (!queue)
> +		return NULL;
> +
> +	/*
> +	 * The WQ is unordered because the low-level handler enqueues faults by
> +	 * group. PRI requests within a group have to be ordered, but once
> +	 * that's dealt with, the high-level function can handle groups out of
> +	 * order.
> +	 */
> +	queue->wq = alloc_workqueue("iopf_queue/%s", WQ_UNBOUND, 0, name);
> +	if (!queue->wq) {
> +		kfree(queue);
> +		return NULL;
> +	}
> +
> +	queue->flush = flush;
> +	queue->flush_arg = cookie;
> +	INIT_LIST_HEAD(&queue->devices);
> +	mutex_init(&queue->lock);
> +
> +	return queue;
> +}
> +EXPORT_SYMBOL_GPL(iopf_queue_alloc);
> +
> +/**
> + * iopf_queue_free - Free IOPF queue
> + * @queue: queue to free
> + *
> + * Counterpart to iopf_queue_alloc(). The driver must not be queuing faults or
> + * adding/removing devices on this queue anymore.
> + */
> +void iopf_queue_free(struct iopf_queue *queue)
> +{
> +	struct iopf_device_param *iopf_param, *next;
> +
> +	if (!queue)
> +		return;
> +
> +	list_for_each_entry_safe(iopf_param, next, &queue->devices, queue_list)
> +		iopf_queue_remove_device(queue, iopf_param->dev);
> +
> +	destroy_workqueue(queue->wq);
> +	kfree(queue);
> +}
> +EXPORT_SYMBOL_GPL(iopf_queue_free);
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index 83397ae88d2d..e7bc47ba24f8 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -364,11 +364,20 @@ struct iommu_fault_param {
>  	struct mutex lock;
>  };
>  
> +/**
> + * iopf_queue_flush_t - Flush low-level page fault queue
> + *
> + * Report all faults currently pending in the low-level page fault queue
> + */
> +struct iopf_queue;
> +typedef int (*iopf_queue_flush_t)(void *cookie, struct device *dev, int pasid);
> +
>  /**
>   * struct iommu_param - collection of per-device IOMMU data
>   *
>   * @fault_param: IOMMU detected device fault reporting data
>   * @sva_param: IOMMU parameter for SVA
> + * @iopf_param: I/O Page Fault queue and data
>   *
>   * TODO: migrate other per device data pointers under iommu_dev_data, e.g.
>   *	struct iommu_group	*iommu_group;
> @@ -377,6 +386,7 @@ struct iommu_fault_param {
>  struct iommu_param {
>  	struct mutex lock;
>  	struct iommu_fault_param *fault_param;
> +	struct iopf_device_param *iopf_param;
>  	struct mutex sva_lock;
>  	struct iommu_sva_param *sva_param;
>  };
> @@ -1081,4 +1091,53 @@ void iommu_debugfs_setup(void);
>  static inline void iommu_debugfs_setup(void) {}
>  #endif
>  
> +#ifdef CONFIG_IOMMU_PAGE_FAULT
> +extern int iommu_queue_iopf(struct iommu_fault *fault, void *cookie);
> +
> +extern int iopf_queue_add_device(struct iopf_queue *queue, struct device *dev);
> +extern int iopf_queue_remove_device(struct iopf_queue *queue, struct device *dev);
> +extern int iopf_queue_flush_dev(struct device *dev, int pasid);
> +extern struct iopf_queue *
> +iopf_queue_alloc(const char *name, iopf_queue_flush_t flush, void *cookie);
> +extern void iopf_queue_free(struct iopf_queue *queue);
> +extern int iopf_queue_discard_partial(struct iopf_queue *queue);
> +#else /* CONFIG_IOMMU_PAGE_FAULT */
> +static inline int iommu_queue_iopf(struct iommu_fault *fault, void *cookie)
> +{
> +	return -ENODEV;
> +}
> +
> +static inline int iopf_queue_add_device(struct iopf_queue *queue,
> +					struct device *dev)
> +{
> +	return -ENODEV;
> +}
> +
> +static inline int iopf_queue_remove_device(struct iopf_queue *queue,
> +					   struct device *dev)
> +{
> +	return -ENODEV;
> +}
> +
> +static inline int iopf_queue_flush_dev(struct device *dev, int pasid)
> +{
> +	return -ENODEV;
> +}
> +
> +static inline struct iopf_queue *
> +iopf_queue_alloc(const char *name, iopf_queue_flush_t flush, void *cookie)
> +{
> +	return NULL;
> +}
> +
> +static inline void iopf_queue_free(struct iopf_queue *queue)
> +{
> +}
> +
> +static inline int iopf_queue_discard_partial(struct iopf_queue *queue)
> +{
> +	return -ENODEV;
> +}
> +#endif /* CONFIG_IOMMU_PAGE_FAULT */
> +
>  #endif /* __LINUX_IOMMU_H */



^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v4 02/26] iommu/sva: Manage process address spaces
  2020-02-24 18:23 ` [PATCH v4 02/26] iommu/sva: Manage process address spaces Jean-Philippe Brucker
  2020-02-26 12:35   ` Jonathan Cameron
@ 2020-02-26 19:13   ` Jacob Pan
  2020-02-28 14:40     ` Jean-Philippe Brucker
  1 sibling, 1 reply; 70+ messages in thread
From: Jacob Pan @ 2020-02-26 19:13 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: iommu, devicetree, linux-arm-kernel, linux-pci, linux-mm, joro,
	robh+dt, mark.rutland, catalin.marinas, will, robin.murphy,
	kevin.tian, baolu.lu, Jonathan.Cameron, christian.koenig,
	yi.l.liu, zhangfei.gao, Jean-Philippe Brucker, jacob.jun.pan

Hi Jean,

A few comments inline. I am also trying to converge to the common sva
APIs. I sent out the first step w/o iopage fault and the generic ops
you have here.

On Mon, 24 Feb 2020 19:23:37 +0100
Jean-Philippe Brucker <jean-philippe@linaro.org> wrote:

> From: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
> 
> Add a small library to help IOMMU drivers manage process address
> spaces bound to their devices. Register an MMU notifier to track
> modification on each address space bound to one or more devices.
> 
> IOMMU drivers must implement the io_mm_ops and can then use the
> helpers provided by this library to easily implement the SVA API
> introduced by commit 26b25a2b98e4. The io_mm_ops are:
> 
> void *alloc(struct mm_struct *)
>   Allocate a PASID context private to the IOMMU driver. There is a
>   single context per mm. IOMMU drivers may perform arch-specific
>   operations in there, for example pinning down a CPU ASID (on Arm).
> 
> int attach(struct device *, int pasid, void *ctx, bool attach_domain)
>   Attach a context to the device, by setting up the PASID table entry.
> 
> int invalidate(struct device *, int pasid, void *ctx,
>                unsigned long vaddr, size_t size)
>   Invalidate TLB entries for this address range.
> 
> int detach(struct device *, int pasid, void *ctx, bool detach_domain)
>   Detach a context from the device, by clearing the PASID table entry
>   and invalidating cached entries.
> 
> void free(void *ctx)
you meant release()?

>   Free a context.
> 
> Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
> ---
>  drivers/iommu/Kconfig     |   7 +
>  drivers/iommu/Makefile    |   1 +
>  drivers/iommu/iommu-sva.c | 561
> ++++++++++++++++++++++++++++++++++++++ drivers/iommu/iommu-sva.h |
> 64 +++++ drivers/iommu/iommu.c     |   1 +
>  include/linux/iommu.h     |   3 +
>  6 files changed, 637 insertions(+)
>  create mode 100644 drivers/iommu/iommu-sva.c
>  create mode 100644 drivers/iommu/iommu-sva.h
> 
> diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
> index d2fade984999..acca20e2da2f 100644
> --- a/drivers/iommu/Kconfig
> +++ b/drivers/iommu/Kconfig
> @@ -102,6 +102,13 @@ config IOMMU_DMA
>  	select IRQ_MSI_IOMMU
>  	select NEED_SG_DMA_LENGTH
>  
> +# Shared Virtual Addressing library
> +config IOMMU_SVA
> +	bool
> +	select IOASID
> +	select IOMMU_API
> +	select MMU_NOTIFIER
> +
>  config FSL_PAMU
>  	bool "Freescale IOMMU support"
>  	depends on PCI
> diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
> index 9f33fdb3bb05..40c800dd4e3e 100644
> --- a/drivers/iommu/Makefile
> +++ b/drivers/iommu/Makefile
> @@ -37,3 +37,4 @@ obj-$(CONFIG_S390_IOMMU) += s390-iommu.o
>  obj-$(CONFIG_QCOM_IOMMU) += qcom_iommu.o
>  obj-$(CONFIG_HYPERV_IOMMU) += hyperv-iommu.o
>  obj-$(CONFIG_VIRTIO_IOMMU) += virtio-iommu.o
> +obj-$(CONFIG_IOMMU_SVA) += iommu-sva.o
> diff --git a/drivers/iommu/iommu-sva.c b/drivers/iommu/iommu-sva.c
> new file mode 100644
> index 000000000000..64f1d1c82383
> --- /dev/null
> +++ b/drivers/iommu/iommu-sva.c
> @@ -0,0 +1,561 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Manage PASIDs and bind process address spaces to devices.
> + *
> + * Copyright (C) 2018 ARM Ltd.
> + */
> +
> +#include <linux/idr.h>
> +#include <linux/ioasid.h>
> +#include <linux/iommu.h>
> +#include <linux/sched/mm.h>
> +#include <linux/slab.h>
> +#include <linux/spinlock.h>
> +
> +#include "iommu-sva.h"
> +
> +/**
> + * DOC: io_mm model
> + *
> + * The io_mm keeps track of process address spaces shared between
> CPU and IOMMU.
> + * The following example illustrates the relation between structures
> + * iommu_domain, io_mm and iommu_sva. The iommu_sva struct is a bond
> between
> + * io_mm and device. A device can have multiple io_mm and an io_mm
> may be bound
> + * to multiple devices.
> + *              ___________________________
> + *             |  IOMMU domain A           |
> + *             |  ________________         |
> + *             | |  IOMMU group   |        +------- io_pgtables
> + *             | |                |        |
> + *             | |   dev 00:00.0 ----+------- bond 1 --- io_mm X
> + *             | |________________|   \    |
> + *             |                       '----- bond 2 ---.
> + *             |___________________________|             \
> + *              ___________________________               \
> + *             |  IOMMU domain B           |             io_mm Y
> + *             |  ________________         |             / /
> + *             | |  IOMMU group   |        |            / /
> + *             | |                |        |           / /
> + *             | |   dev 00:01.0 ------------ bond 3 -' /
> + *             | |   dev 00:01.1 ------------ bond 4 --'
> + *             | |________________|        |
> + *             |                           +------- io_pgtables
> + *             |___________________________|
> + *
> + * In this example, device 00:00.0 is in domain A, devices 00:01.*
> are in domain
> + * B. All devices within the same domain access the same address
> spaces.
Hmm, devices in domain A has access to both X & Y, isn't it
contradictory?

> Device
> + * 00:00.0 accesses address spaces X and Y, each corresponding to an
> mm_struct.
> + * Devices 00:01.* only access address space Y. In addition each
> + * IOMMU_DOMAIN_DMA domain has a private address space, io_pgtable,
> that is
> + * managed with iommu_map()/iommu_unmap(), and isn't shared with the
> CPU MMU.
So this would allow IOVA and SVA co-exist in the same address space?
I guess this is the PASID 0 for DMA request w/o PASID. If that is the
case, perhaps needs more explanation since the private address space
also has a private PASID within the domain.

> + *
> + * To obtain the above configuration, users would for instance issue
> the
> + * following calls:
> + *
> + *     iommu_sva_bind_device(dev 00:00.0, mm X, ...) -> bond 1
> + *     iommu_sva_bind_device(dev 00:00.0, mm Y, ...) -> bond 2
> + *     iommu_sva_bind_device(dev 00:01.0, mm Y, ...) -> bond 3
> + *     iommu_sva_bind_device(dev 00:01.1, mm Y, ...) -> bond 4
> + *
> + * A single Process Address Space ID (PASID) is allocated for each
> mm. In the
> + * example, devices use PASID 1 to read/write into address space X
> and PASID 2
> + * to read/write into address space Y. Calling iommu_sva_get_pasid()
> on bond 1
> + * returns 1, and calling it on bonds 2-4 returns 2.
> + *
> + * Hardware tables describing this configuration in the IOMMU would
> typically
> + * look like this:
> + *
> + *                                PASID tables
> + *                                 of domain A
> + *                              .->+--------+
> + *                             / 0 |        |-------> io_pgtable
> + *                            /    +--------+
> + *            Device tables  /   1 |        |-------> pgd X
> + *              +--------+  /      +--------+
> + *      00:00.0 |      A |-'     2 |        |--.
> + *              +--------+         +--------+   \
> + *              :        :       3 |        |    \
> + *              +--------+         +--------+     --> pgd Y
> + *      00:01.0 |      B |--.                    /
> + *              +--------+   \                  |
> + *      00:01.1 |      B |----+   PASID tables  |
> + *              +--------+     \   of domain B  |
> + *                              '->+--------+   |
> + *                               0 |        |-- | --> io_pgtable
> + *                                 +--------+   |
> + *                               1 |        |   |
> + *                                 +--------+   |
> + *                               2 |        |---'
> + *                                 +--------+
> + *                               3 |        |
> + *                                 +--------+
> + *
> + * With this model, a single call binds all devices in a given
> domain to an
> + * address space. Other devices in the domain will get the same bond
> implicitly.
> + * However, users must issue one bind() for each device, because
> IOMMUs may
> + * implement SVA differently. Furthermore, mandating one bind() per
> device
> + * allows the driver to perform sanity-checks on device capabilities.
> + *
> + * In some IOMMUs, one entry of the PASID table (typically the first
> one) can
> + * hold non-PASID translations. In this case PASID 0 is reserved and
> the first
> + * entry points to the io_pgtable pointer. In other IOMMUs the
> io_pgtable
> + * pointer is held in the device table and PASID 0 is available to
> the
> + * allocator.
> + */
> +
> +struct io_mm {
> +	struct list_head		devices;
> +	struct mm_struct		*mm;
> +	struct mmu_notifier		notifier;
> +
> +	/* Late initialization */
> +	const struct io_mm_ops		*ops;
> +	void				*ctx;
> +	int				pasid;
> +};
> +
> +#define to_io_mm(mmu_notifier)	container_of(mmu_notifier,
> struct io_mm, notifier) +#define to_iommu_bond(handle)
> container_of(handle, struct iommu_bond, sva) +
> +struct iommu_bond {
> +	struct iommu_sva		sva;
> +	struct io_mm __rcu		*io_mm;
> +
> +	struct list_head		mm_head;
> +	void				*drvdata;
> +	struct rcu_head			rcu_head;
> +	refcount_t			refs;
> +};
> +
> +static DECLARE_IOASID_SET(shared_pasid);
> +
> +static struct mmu_notifier_ops iommu_mmu_notifier_ops;
> +
> +/*
> + * Serializes modifications of bonds.
> + * Lock order: Device SVA mutex; global SVA mutex; IOASID lock
> + */
> +static DEFINE_MUTEX(iommu_sva_lock);
> +
> +struct io_mm_alloc_params {
> +	const struct io_mm_ops *ops;
> +	int min_pasid, max_pasid;
> +};
> +
> +static struct mmu_notifier *io_mm_alloc(struct mm_struct *mm, void
> *privdata) +{
> +	int ret;
> +	struct io_mm *io_mm;
> +	struct io_mm_alloc_params *params = privdata;
> +
> +	io_mm = kzalloc(sizeof(*io_mm), GFP_KERNEL);
> +	if (!io_mm)
> +		return ERR_PTR(-ENOMEM);
> +
> +	io_mm->mm = mm;
> +	io_mm->ops = params->ops;
> +	INIT_LIST_HEAD(&io_mm->devices);
> +
> +	io_mm->pasid = ioasid_alloc(&shared_pasid, params->min_pasid,
> +				    params->max_pasid, io_mm->mm);
> +	if (io_mm->pasid == INVALID_IOASID) {
> +		ret = -ENOSPC;
> +		goto err_free_io_mm;
> +	}
> +
> +	io_mm->ctx = params->ops->alloc(mm);
> +	if (IS_ERR(io_mm->ctx)) {
> +		ret = PTR_ERR(io_mm->ctx);
> +		goto err_free_pasid;
> +	}
> +	return &io_mm->notifier;
> +
> +err_free_pasid:
> +	ioasid_free(io_mm->pasid);
> +err_free_io_mm:
> +	kfree(io_mm);
> +	return ERR_PTR(ret);
> +}
> +
> +static void io_mm_free(struct mmu_notifier *mn)
> +{
> +	struct io_mm *io_mm = to_io_mm(mn);
> +
> +	WARN_ON(!list_empty(&io_mm->devices));
> +
> +	io_mm->ops->release(io_mm->ctx);
> +	ioasid_free(io_mm->pasid);
> +	kfree(io_mm);
> +}
> +
> +/*
> + * io_mm_get - Allocate an io_mm or get the existing one for the
> given mm
> + * @mm: the mm
> + * @ops: callbacks for the IOMMU driver
> + * @min_pasid: minimum PASID value (inclusive)
> + * @max_pasid: maximum PASID value (inclusive)
> + *
> + * Returns a valid io_mm or an error pointer.
> + */
> +static struct io_mm *io_mm_get(struct mm_struct *mm,
> +			       const struct io_mm_ops *ops,
> +			       int min_pasid, int max_pasid)
> +{
> +	struct io_mm *io_mm;
> +	struct mmu_notifier *mn;
> +	struct io_mm_alloc_params params = {
> +		.ops		= ops,
> +		.min_pasid	= min_pasid,
> +		.max_pasid	= max_pasid,
> +	};
> +
> +	/*
> +	 * A single notifier can exist for this (ops, mm) pair.
> Allocate it if
> +	 * necessary.
> +	 */
> +	mn = mmu_notifier_get(&iommu_mmu_notifier_ops, mm, &params);
> +	if (IS_ERR(mn))
> +		return ERR_CAST(mn);
> +	io_mm = to_io_mm(mn);
> +
> +	if (WARN_ON(io_mm->ops != ops)) {
> +		mmu_notifier_put(mn);
> +		return ERR_PTR(-EINVAL);
> +	}
> +
> +	return io_mm;
> +}
> +
> +static void io_mm_put(struct io_mm *io_mm)
> +{
> +	mmu_notifier_put(&io_mm->notifier);
> +}
> +
> +static struct iommu_sva *
> +io_mm_attach(struct device *dev, struct io_mm *io_mm, void *drvdata)
> +{
> +	int ret = 0;
> +	bool attach_domain = true;
> +	struct iommu_bond *bond, *tmp;
> +	struct iommu_domain *domain, *other;
> +	struct iommu_sva_param *param = dev->iommu_param->sva_param;
> +
> +	domain = iommu_get_domain_for_dev(dev);
> +
> +	bond = kzalloc(sizeof(*bond), GFP_KERNEL);
> +	if (!bond)
> +		return ERR_PTR(-ENOMEM);
> +
> +	bond->sva.dev	= dev;
> +	bond->drvdata	= drvdata;
> +	refcount_set(&bond->refs, 1);
> +	RCU_INIT_POINTER(bond->io_mm, io_mm);
> +
> +	mutex_lock(&iommu_sva_lock);
> +	/* Is it already bound to the device or domain? */
> +	list_for_each_entry(tmp, &io_mm->devices, mm_head) {
> +		if (tmp->sva.dev != dev) {
> +			other =
> iommu_get_domain_for_dev(tmp->sva.dev);
> +			if (domain == other)
> +				attach_domain = false;
> +
> +			continue;
At this point, we already know this is a new device trying to attach to
one of io_mm's existing domains. So there is no need to continue
checking, right? Perhaps check like this?
-               if (tmp->sva.dev != dev) {
+               if (tmp->sva.dev != dev && attach_domain) {


> +		}
> +
> +		if (WARN_ON(tmp->drvdata != drvdata)) {
> +			ret = -EINVAL;
> +			goto err_free;
> +		}
> +
> +		/*
> +		 * Hold a single io_mm reference per bond. Note that
> we can't
> +		 * return an error after this, otherwise the caller
> would drop
> +		 * an additional reference to the io_mm.
> +		 */
> +		refcount_inc(&tmp->refs);
> +		io_mm_put(io_mm);
> +		kfree(bond);
Can bond be allocated after searching for existing bond or domain? If
so, we can avoid free bond here.

> +		mutex_unlock(&iommu_sva_lock);
> +		return &tmp->sva;
> +	}
> +
> +	list_add_rcu(&bond->mm_head, &io_mm->devices);
> +	param->nr_bonds++;
> +	mutex_unlock(&iommu_sva_lock);
> +
> +	ret = io_mm->ops->attach(bond->sva.dev, io_mm->pasid,
> io_mm->ctx,
> +				 attach_domain);
For VT-d, if a device trying to do SVA bind, there would not be a DMA
domain. SVA should own the entire address space, no IOVA. So this
attach() call is for VT-d driver to setup the first PASID table entry
regardless attach_domain is true or false?

> +	if (ret)
> +		goto err_remove;
> +
> +	return &bond->sva;
> +
> +err_remove:
> +	/*
> +	 * At this point concurrent threads may have started to
> access the
> +	 * io_mm->devices list in order to invalidate address
> ranges, which
> +	 * requires to free the bond via kfree_rcu()
> +	 */
> +	mutex_lock(&iommu_sva_lock);
> +	param->nr_bonds--;
> +	list_del_rcu(&bond->mm_head);
> +
> +err_free:
> +	mutex_unlock(&iommu_sva_lock);
> +	kfree_rcu(bond, rcu_head);
> +	return ERR_PTR(ret);
> +}
> +
> +static void io_mm_detach_locked(struct iommu_bond *bond)
> +{
> +	struct io_mm *io_mm;
> +	struct iommu_bond *tmp;
> +	bool detach_domain = true;
> +	struct iommu_domain *domain, *other;
> +
> +	io_mm = rcu_dereference_protected(bond->io_mm,
> +
> lockdep_is_held(&iommu_sva_lock));
> +	if (!io_mm)
> +		return;
> +
> +	domain = iommu_get_domain_for_dev(bond->sva.dev);
> +
> +	/* Are other devices in the same domain still attached to
> this mm? */
> +	list_for_each_entry(tmp, &io_mm->devices, mm_head) {
> +		if (tmp == bond)
> +			continue;
> +		other = iommu_get_domain_for_dev(tmp->sva.dev);
> +		if (domain == other) {
> +			detach_domain = false;
> +			break;
> +		}
> +	}
> +
> +	io_mm->ops->detach(bond->sva.dev, io_mm->pasid, io_mm->ctx,
> +			   detach_domain);
> +
> +	list_del_rcu(&bond->mm_head);
> +	RCU_INIT_POINTER(bond->io_mm, NULL);
> +
> +	/* Free after RCU grace period */
> +	io_mm_put(io_mm);
> +}
> +
> +/*
> + * io_mm_release - release MMU notifier
> + *
> + * Called when the mm exits. Some devices may still be bound to the
> io_mm. A few
> + * things need to be done before it is safe to release:
> + *
> + * - Tell the device driver to stop using this PASID.
> + * - Clear the PASID table and invalidate TLBs.
> + * - Drop all references to this io_mm.
> + */
> +static void io_mm_release(struct mmu_notifier *mn, struct mm_struct
> *mm) +{
> +	struct iommu_bond *bond, *next;
> +	struct io_mm *io_mm = to_io_mm(mn);
> +
> +	mutex_lock(&iommu_sva_lock);
> +	list_for_each_entry_safe(bond, next, &io_mm->devices,
> mm_head) {
> +		struct device *dev = bond->sva.dev;
> +		struct iommu_sva *sva = &bond->sva;
> +
> +		if (sva->ops && sva->ops->mm_exit &&
> +		    sva->ops->mm_exit(dev, sva, bond->drvdata))
> +			dev_WARN(dev, "possible leak of PASID %u",
> +				 io_mm->pasid);
> +
> +		/* unbind() frees the bond, we just detach it */
> +		io_mm_detach_locked(bond);
> +	}
> +	mutex_unlock(&iommu_sva_lock);
> +}
> +
> +static void io_mm_invalidate_range(struct mmu_notifier *mn,
> +				   struct mm_struct *mm, unsigned
> long start,
> +				   unsigned long end)
> +{
> +	struct iommu_bond *bond;
> +	struct io_mm *io_mm = to_io_mm(mn);
> +
> +	rcu_read_lock();
> +	list_for_each_entry_rcu(bond, &io_mm->devices, mm_head)
> +		io_mm->ops->invalidate(bond->sva.dev, io_mm->pasid,
> io_mm->ctx,
> +				       start, end - start);
> +	rcu_read_unlock();
> +}
> +
> +static struct mmu_notifier_ops iommu_mmu_notifier_ops = {
> +	.alloc_notifier		= io_mm_alloc,
> +	.free_notifier		= io_mm_free,
> +	.release		= io_mm_release,
> +	.invalidate_range	= io_mm_invalidate_range,
> +};
> +
> +struct iommu_sva *
> +iommu_sva_bind_generic(struct device *dev, struct mm_struct *mm,
> +		       const struct io_mm_ops *ops, void *drvdata)
> +{
> +	struct io_mm *io_mm;
> +	struct iommu_sva *handle;
> +	struct iommu_param *param = dev->iommu_param;
> +
> +	if (!param)
> +		return ERR_PTR(-ENODEV);
> +
> +	mutex_lock(&param->sva_lock);
> +	if (!param->sva_param) {
> +		handle = ERR_PTR(-ENODEV);
> +		goto out_unlock;
> +	}
> +
> +	io_mm = io_mm_get(mm, ops, param->sva_param->min_pasid,
> +			  param->sva_param->max_pasid);
> +	if (IS_ERR(io_mm)) {
> +		handle = ERR_CAST(io_mm);
> +		goto out_unlock;
> +	}
> +
> +	handle = io_mm_attach(dev, io_mm, drvdata);
> +	if (IS_ERR(handle))
> +		io_mm_put(io_mm);
> +
> +out_unlock:
> +	mutex_unlock(&param->sva_lock);
> +	return handle;
> +}
> +EXPORT_SYMBOL_GPL(iommu_sva_bind_generic);
> +
> +static void iommu_sva_unbind_locked(struct iommu_bond *bond)
> +{
> +	struct device *dev = bond->sva.dev;
> +	struct iommu_sva_param *param = dev->iommu_param->sva_param;
> +
> +	if (!refcount_dec_and_test(&bond->refs))
> +		return;
> +
dont you need to free bond here?

> +	io_mm_detach_locked(bond);
> +	param->nr_bonds--;
> +	kfree_rcu(bond, rcu_head);
> +}
> +
> +void iommu_sva_unbind_generic(struct iommu_sva *handle)
> +{
> +	struct iommu_param *param = handle->dev->iommu_param;
> +
> +	if (WARN_ON(!param))
> +		return;
> +
> +	mutex_lock(&param->sva_lock);
> +	mutex_lock(&iommu_sva_lock);
> +	iommu_sva_unbind_locked(to_iommu_bond(handle));
> +	mutex_unlock(&iommu_sva_lock);
> +	mutex_unlock(&param->sva_lock);
> +}
> +EXPORT_SYMBOL_GPL(iommu_sva_unbind_generic);
> +
> +/**
> + * iommu_sva_enable() - Enable Shared Virtual Addressing for a device
> + * @dev: the device
> + * @sva_param: the parameters.
> + *
> + * Called by an IOMMU driver to setup the SVA parameters
> + * @sva_param is duplicated and can be freed when this function
> returns.
> + *
> + * Return 0 if initialization succeeded, or an error.
> + */
IOMMU vendor driver usually dont know when the device SVA feature will
be used until bind call. So we pretty much have to call this for every
device during init time?

> +int iommu_sva_enable(struct device *dev, struct iommu_sva_param
> *sva_param) +{
> +	int ret;
> +	struct iommu_sva_param *new_param;
> +	struct iommu_param *param = dev->iommu_param;
> +
> +	if (!param)
> +		return -ENODEV;
> +
> +	new_param = kmemdup(sva_param, sizeof(*new_param),
> GFP_KERNEL);
> +	if (!new_param)
> +		return -ENOMEM;
> +
> +	mutex_lock(&param->sva_lock);
> +	if (param->sva_param) {
> +		ret = -EEXIST;
> +		goto err_unlock;
> +	}
> +
> +	dev->iommu_param->sva_param = new_param;
> +	mutex_unlock(&param->sva_lock);
> +	return 0;
> +
> +err_unlock:
> +	mutex_unlock(&param->sva_lock);
> +	kfree(new_param);
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(iommu_sva_enable);
> +
> +/**
> + * iommu_sva_disable() - Disable Shared Virtual Addressing for a
> device
> + * @dev: the device
> + *
> + * IOMMU drivers call this to disable SVA.
> + */
> +int iommu_sva_disable(struct device *dev)
> +{
> +	int ret = 0;
> +	struct iommu_param *param = dev->iommu_param;
> +
> +	if (!param)
> +		return -EINVAL;
> +
> +	mutex_lock(&param->sva_lock);
> +	if (!param->sva_param) {
> +		ret = -ENODEV;
> +		goto out_unlock;
> +	}
> +
> +	/* Require that all contexts are unbound */
> +	if (param->sva_param->nr_bonds) {
> +		ret = -EBUSY;
> +		goto out_unlock;
> +	}
> +
> +	kfree(param->sva_param);
> +	param->sva_param = NULL;
> +out_unlock:
> +	mutex_unlock(&param->sva_lock);
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(iommu_sva_disable);
> +
> +bool iommu_sva_enabled(struct device *dev)
> +{
> +	bool enabled;
> +	struct iommu_param *param = dev->iommu_param;
> +
> +	if (!param)
> +		return false;
> +
> +	mutex_lock(&param->sva_lock);
> +	enabled = !!param->sva_param;
> +	mutex_unlock(&param->sva_lock);
> +	return enabled;
> +}
> +EXPORT_SYMBOL_GPL(iommu_sva_enabled);
> +
> +int iommu_sva_get_pasid_generic(struct iommu_sva *handle)
> +{
> +	struct io_mm *io_mm;
> +	int pasid = IOMMU_PASID_INVALID;
> +	struct iommu_bond *bond = to_iommu_bond(handle);
> +
> +	rcu_read_lock();
> +	io_mm = rcu_dereference(bond->io_mm);
> +	if (io_mm)
> +		pasid = io_mm->pasid;
> +	rcu_read_unlock();
> +	return pasid;
> +}
> +EXPORT_SYMBOL_GPL(iommu_sva_get_pasid_generic);
> diff --git a/drivers/iommu/iommu-sva.h b/drivers/iommu/iommu-sva.h
> new file mode 100644
> index 000000000000..dd55c2db0936
> --- /dev/null
> +++ b/drivers/iommu/iommu-sva.h
> @@ -0,0 +1,64 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * SVA library for IOMMU drivers
> + */
> +#ifndef _IOMMU_SVA_H
> +#define _IOMMU_SVA_H
> +
> +#include <linux/iommu.h>
> +#include <linux/kref.h>
> +#include <linux/mmu_notifier.h>
> +
> +struct io_mm_ops {
> +	/* Allocate a PASID context for an mm */
> +	void *(*alloc)(struct mm_struct *mm);
> +
> +	/*
> +	 * Attach a PASID context to a device. Write the entry into
> the PASID
> +	 * table.
> +	 *
> +	 * @attach_domain is true when no other device in the IOMMU
> domain is
> +	 *   already attached to this context. IOMMU drivers that
> share the
> +	 *   PASID tables within a domain don't need to write the
> PASID entry
> +	 *   when @attach_domain is false.
> +	 */
If we have per device PASID table, then we need to set up PASID table
entry regardless of the domain sharing. What is confusing to me is that
domain is for DMA isolation on request w/o PASID, but with SVA we don't
really care about domains. Sorry, it has been a long time since we
discussed this. I think will work for VT-d but just wanted to make sure
I understand the intentions.

> +	int (*attach)(struct device *dev, int pasid, void *ctx,
> +		      bool attach_domain);
> +
> +	/*
> +	 * Detach a PASID context from a device. Clear the entry
> from the PASID
> +	 * table and invalidate if necessary.
> +	 *
> +	 * @detach_domain is true when no other device in the IOMMU
> domain is
> +	 *   still attached to this context. IOMMU drivers that
> share the PASID
> +	 *   table within a domain don't need to clear the PASID
> entry when
> +	 *   @detach_domain is false, only invalidate the caches.
> +	 */
> +	void (*detach)(struct device *dev, int pasid, void *ctx,
> +		       bool detach_domain);
> +
> +	/* Invalidate a range of addresses. Cannot sleep. */
> +	void (*invalidate)(struct device *dev, int pasid, void *ctx,
> +			   unsigned long vaddr, size_t size);
> +
> +	/* Free a context. Cannot sleep. */
> +	void (*release)(void *ctx);
> +};
> +
> +struct iommu_sva_param {
> +	u32			min_pasid;
> +	u32			max_pasid;
> +	int			nr_bonds;
> +};
> +
> +struct iommu_sva *
> +iommu_sva_bind_generic(struct device *dev, struct mm_struct *mm,
> +		       const struct io_mm_ops *ops, void *drvdata);
> +void iommu_sva_unbind_generic(struct iommu_sva *handle);
> +int iommu_sva_get_pasid_generic(struct iommu_sva *handle);
> +
> +int iommu_sva_enable(struct device *dev, struct iommu_sva_param
> *sva_param); +int iommu_sva_disable(struct device *dev);
> +bool iommu_sva_enabled(struct device *dev);
> +
> +#endif /* _IOMMU_SVA_H */
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index 3e3528436e0b..c8bd972c1788 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -164,6 +164,7 @@ static struct iommu_param
> *iommu_get_dev_param(struct device *dev) return NULL;
>  
>  	mutex_init(&param->lock);
> +	mutex_init(&param->sva_lock);
>  	dev->iommu_param = param;
>  	return param;
>  }
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index 1739f8a7a4b4..83397ae88d2d 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -368,6 +368,7 @@ struct iommu_fault_param {
>   * struct iommu_param - collection of per-device IOMMU data
>   *
>   * @fault_param: IOMMU detected device fault reporting data
> + * @sva_param: IOMMU parameter for SVA
>   *
>   * TODO: migrate other per device data pointers under
> iommu_dev_data, e.g.
>   *	struct iommu_group	*iommu_group;
> @@ -376,6 +377,8 @@ struct iommu_fault_param {
>  struct iommu_param {
>  	struct mutex lock;
>  	struct iommu_fault_param *fault_param;
> +	struct mutex sva_lock;
> +	struct iommu_sva_param *sva_param;
>  };
>  
>  int  iommu_device_register(struct iommu_device *iommu);

Thanks,

Jacob

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v4 06/26] iommu/sva: Register page fault handler
  2020-02-24 18:23 ` [PATCH v4 06/26] iommu/sva: Register page fault handler Jean-Philippe Brucker
@ 2020-02-26 19:39   ` Jacob Pan
  2020-02-28 14:44     ` Jean-Philippe Brucker
  0 siblings, 1 reply; 70+ messages in thread
From: Jacob Pan @ 2020-02-26 19:39 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: iommu, devicetree, linux-arm-kernel, linux-pci, linux-mm, joro,
	robh+dt, mark.rutland, catalin.marinas, will, robin.murphy,
	kevin.tian, baolu.lu, Jonathan.Cameron, christian.koenig,
	yi.l.liu, zhangfei.gao, Jean-Philippe Brucker, jacob.jun.pan

On Mon, 24 Feb 2020 19:23:41 +0100
Jean-Philippe Brucker <jean-philippe@linaro.org> wrote:

> From: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
> 
> When enabling SVA, register the fault handler. Device driver will
> register an I/O page fault queue before or after calling
> iommu_sva_enable. The fault queue must be flushed before any io_mm is
> freed, to make sure that its PASID isn't used in any fault queue, and
> can be reallocated. Add iopf_queue_flush() calls in a few strategic
> locations.
> 
> Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
> ---
>  drivers/iommu/Kconfig     |  1 +
>  drivers/iommu/iommu-sva.c | 16 ++++++++++++++++
>  2 files changed, 17 insertions(+)
> 
> diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
> index e4a42e1708b4..211684e785ea 100644
> --- a/drivers/iommu/Kconfig
> +++ b/drivers/iommu/Kconfig
> @@ -106,6 +106,7 @@ config IOMMU_DMA
>  config IOMMU_SVA
>  	bool
>  	select IOASID
> +	select IOMMU_PAGE_FAULT
>  	select IOMMU_API
>  	select MMU_NOTIFIER
>  
> diff --git a/drivers/iommu/iommu-sva.c b/drivers/iommu/iommu-sva.c
> index bfd0c477f290..494ca0824e4b 100644
> --- a/drivers/iommu/iommu-sva.c
> +++ b/drivers/iommu/iommu-sva.c
> @@ -366,6 +366,8 @@ static void io_mm_release(struct mmu_notifier
> *mn, struct mm_struct *mm) dev_WARN(dev, "possible leak of PASID %u",
>  				 io_mm->pasid);
>  
> +		iopf_queue_flush_dev(dev, io_mm->pasid);
> +
>  		/* unbind() frees the bond, we just detach it */
>  		io_mm_detach_locked(bond);
>  	}
> @@ -442,11 +444,20 @@ static void iommu_sva_unbind_locked(struct
> iommu_bond *bond) 
>  void iommu_sva_unbind_generic(struct iommu_sva *handle)
>  {
> +	int pasid;
>  	struct iommu_param *param = handle->dev->iommu_param;
>  
>  	if (WARN_ON(!param))
>  		return;
>  
> +	/*
> +	 * Caller stopped the device from issuing PASIDs, now make
> sure they are
> +	 * out of the fault queue.
> +	 */
> +	pasid = iommu_sva_get_pasid_generic(handle);
> +	if (pasid != IOMMU_PASID_INVALID)
> +		iopf_queue_flush_dev(handle->dev, pasid);
> +
I have an ordering concern.
The caller can only stop the device issuing page request but there will
be in-flight request inside the IOMMU. If we flush here before clearing
the PASID context, there might be new request coming in before the
detach.
How about detach first then flush? Then anything come after the detach
would be faults. Flush will be clean.

>  	mutex_lock(&param->sva_lock);
>  	mutex_lock(&iommu_sva_lock);
>  	iommu_sva_unbind_locked(to_iommu_bond(handle));
> @@ -484,6 +495,10 @@ int iommu_sva_enable(struct device *dev, struct
> iommu_sva_param *sva_param) goto err_unlock;
>  	}
>  
> +	ret = iommu_register_device_fault_handler(dev,
> iommu_queue_iopf, dev);
> +	if (ret)
> +		goto err_unlock;
> +
>  	dev->iommu_param->sva_param = new_param;
>  	mutex_unlock(&param->sva_lock);
>  	return 0;
> @@ -521,6 +536,7 @@ int iommu_sva_disable(struct device *dev)
>  		goto out_unlock;
>  	}
>  
> +	iommu_unregister_device_fault_handler(dev);
>  	kfree(param->sva_param);
>  	param->sva_param = NULL;
>  out_unlock:

[Jacob Pan]

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v4 07/26] arm64: mm: Pin down ASIDs for sharing mm with devices
  2020-02-24 18:23 ` [PATCH v4 07/26] arm64: mm: Pin down ASIDs for sharing mm with devices Jean-Philippe Brucker
@ 2020-02-27 17:43   ` Jonathan Cameron
  2020-03-04 14:10     ` Jean-Philippe Brucker
  0 siblings, 1 reply; 70+ messages in thread
From: Jonathan Cameron @ 2020-02-27 17:43 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: iommu, devicetree, linux-arm-kernel, linux-pci, linux-mm, joro,
	robh+dt, mark.rutland, catalin.marinas, will, robin.murphy,
	kevin.tian, baolu.lu, jacob.jun.pan, christian.koenig, yi.l.liu,
	zhangfei.gao, Jean-Philippe Brucker

On Mon, 24 Feb 2020 19:23:42 +0100
Jean-Philippe Brucker <jean-philippe@linaro.org> wrote:

> From: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
> 
> To enable address space sharing with the IOMMU, introduce mm_context_get()
> and mm_context_put(), that pin down a context and ensure that it will keep
> its ASID after a rollover. Export the symbols to let the modular SMMUv3
> driver use them.
> 
> Pinning is necessary because a device constantly needs a valid ASID,
> unlike tasks that only require one when running. Without pinning, we would
> need to notify the IOMMU when we're about to use a new ASID for a task,
> and it would get complicated when a new task is assigned a shared ASID.
> Consider the following scenario with no ASID pinned:
> 
> 1. Task t1 is running on CPUx with shared ASID (gen=1, asid=1)
> 2. Task t2 is scheduled on CPUx, gets ASID (1, 2)
> 3. Task tn is scheduled on CPUy, a rollover occurs, tn gets ASID (2, 1)
>    We would now have to immediately generate a new ASID for t1, notify
>    the IOMMU, and finally enable task tn. We are holding the lock during
>    all that time, since we can't afford having another CPU trigger a
>    rollover. The IOMMU issues invalidation commands that can take tens of
>    milliseconds.
> 
> It gets needlessly complicated. All we wanted to do was schedule task tn,
> that has no business with the IOMMU. By letting the IOMMU pin tasks when
> needed, we avoid stalling the slow path, and let the pinning fail when
> we're out of shareable ASIDs.
> 
> After a rollover, the allocator expects at least one ASID to be available
> in addition to the reserved ones (one per CPU). So (NR_ASIDS - NR_CPUS -
> 1) is the maximum number of ASIDs that can be shared with the IOMMU.
> 
> Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
A few more trivial points.

Thanks,

Jonathan

> ---
> v2->v4: handle KPTI
> ---
>  arch/arm64/include/asm/mmu.h         |   1 +
>  arch/arm64/include/asm/mmu_context.h |  11 ++-
>  arch/arm64/mm/context.c              | 103 +++++++++++++++++++++++++--
>  3 files changed, 109 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h
> index e4d862420bb4..70ac3d4cbd3e 100644
> --- a/arch/arm64/include/asm/mmu.h
> +++ b/arch/arm64/include/asm/mmu.h
> @@ -18,6 +18,7 @@
>  
>  typedef struct {
>  	atomic64_t	id;
> +	unsigned long	pinned;
>  	void		*vdso;
>  	unsigned long	flags;
>  } mm_context_t;
> diff --git a/arch/arm64/include/asm/mmu_context.h b/arch/arm64/include/asm/mmu_context.h
> index 3827ff4040a3..70715c10c02a 100644
> --- a/arch/arm64/include/asm/mmu_context.h
> +++ b/arch/arm64/include/asm/mmu_context.h
> @@ -175,7 +175,13 @@ static inline void cpu_replace_ttbr1(pgd_t *pgdp)
>  #define destroy_context(mm)		do { } while(0)
>  void check_and_switch_context(struct mm_struct *mm, unsigned int cpu);
>  
> -#define init_new_context(tsk,mm)	({ atomic64_set(&(mm)->context.id, 0); 0; })
> +static inline int
> +init_new_context(struct task_struct *tsk, struct mm_struct *mm)
> +{
> +	atomic64_set(&mm->context.id, 0);
> +	mm->context.pinned = 0;
> +	return 0;
> +}
>  
>  #ifdef CONFIG_ARM64_SW_TTBR0_PAN
>  static inline void update_saved_ttbr0(struct task_struct *tsk,
> @@ -248,6 +254,9 @@ switch_mm(struct mm_struct *prev, struct mm_struct *next,
>  void verify_cpu_asid_bits(void);
>  void post_ttbr_update_workaround(void);
>  
> +unsigned long mm_context_get(struct mm_struct *mm);
> +void mm_context_put(struct mm_struct *mm);
> +
>  #endif /* !__ASSEMBLY__ */
>  
>  #endif /* !__ASM_MMU_CONTEXT_H */
> diff --git a/arch/arm64/mm/context.c b/arch/arm64/mm/context.c
> index 121aba5b1941..5558de88b67d 100644
> --- a/arch/arm64/mm/context.c
> +++ b/arch/arm64/mm/context.c
> @@ -26,6 +26,10 @@ static DEFINE_PER_CPU(atomic64_t, active_asids);
>  static DEFINE_PER_CPU(u64, reserved_asids);
>  static cpumask_t tlb_flush_pending;
>  
> +static unsigned long max_pinned_asids;
> +static unsigned long nr_pinned_asids;
> +static unsigned long *pinned_asid_map;
> +
>  #define ASID_MASK		(~GENMASK(asid_bits - 1, 0))
>  #define ASID_FIRST_VERSION	(1UL << asid_bits)
>  
> @@ -73,6 +77,9 @@ void verify_cpu_asid_bits(void)
>  
>  static void set_kpti_asid_bits(void)
>  {
> +	unsigned int k;
> +	u8 *dst = (u8 *)asid_map;
> +	u8 *src = (u8 *)pinned_asid_map;
>  	unsigned int len = BITS_TO_LONGS(NUM_USER_ASIDS) * sizeof(unsigned long);
>  	/*
>  	 * In case of KPTI kernel/user ASIDs are allocated in
> @@ -80,7 +87,8 @@ static void set_kpti_asid_bits(void)
>  	 * is set, then the ASID will map only userspace. Thus
>  	 * mark even as reserved for kernel.
>  	 */
> -	memset(asid_map, 0xaa, len);
> +	for (k = 0; k < len; k++)
> +		dst[k] = src[k] | 0xaa;
>  }
>  
>  static void set_reserved_asid_bits(void)
> @@ -88,9 +96,12 @@ static void set_reserved_asid_bits(void)
>  	if (arm64_kernel_unmapped_at_el0())
>  		set_kpti_asid_bits();
>  	else
> -		bitmap_clear(asid_map, 0, NUM_USER_ASIDS);
> +		bitmap_copy(asid_map, pinned_asid_map, NUM_USER_ASIDS);
>  }
>  
> +#define asid_gen_match(asid) \
> +	(!(((asid) ^ atomic64_read(&asid_generation)) >> asid_bits))
> +

I'd have slightly preferred this bit of refactoring as a precursor patch.

>  static void flush_context(void)
>  {
>  	int i;
> @@ -161,6 +172,14 @@ static u64 new_context(struct mm_struct *mm)
>  		if (check_update_reserved_asid(asid, newasid))
>  			return newasid;
>  
> +		/*
> +		 * If it is pinned, we can keep using it. Note that reserved
> +		 * takes priority, because even if it is also pinned, we need to
> +		 * update the generation into the reserved_asids.
> +		 */
> +		if (mm->context.pinned)
> +			return newasid;
> +
>  		/*
>  		 * We had a valid ASID in a previous life, so try to re-use
>  		 * it if possible.
> @@ -219,8 +238,7 @@ void check_and_switch_context(struct mm_struct *mm, unsigned int cpu)
>  	 *   because atomic RmWs are totally ordered for a given location.
>  	 */
>  	old_active_asid = atomic64_read(&per_cpu(active_asids, cpu));
> -	if (old_active_asid &&
> -	    !((asid ^ atomic64_read(&asid_generation)) >> asid_bits) &&
> +	if (old_active_asid && asid_gen_match(asid) &&
>  	    atomic64_cmpxchg_relaxed(&per_cpu(active_asids, cpu),
>  				     old_active_asid, asid))
>  		goto switch_mm_fastpath;
> @@ -228,7 +246,7 @@ void check_and_switch_context(struct mm_struct *mm, unsigned int cpu)
>  	raw_spin_lock_irqsave(&cpu_asid_lock, flags);
>  	/* Check that our ASID belongs to the current generation. */
>  	asid = atomic64_read(&mm->context.id);
> -	if ((asid ^ atomic64_read(&asid_generation)) >> asid_bits) {
> +	if (!asid_gen_match(asid)) {
>  		asid = new_context(mm);
>  		atomic64_set(&mm->context.id, asid);
>  	}
> @@ -251,6 +269,68 @@ void check_and_switch_context(struct mm_struct *mm, unsigned int cpu)
>  		cpu_switch_mm(mm->pgd, mm);
>  }
>  
> +unsigned long mm_context_get(struct mm_struct *mm)
> +{
> +	unsigned long flags;
> +	u64 asid;
> +
> +	raw_spin_lock_irqsave(&cpu_asid_lock, flags);
> +
> +	asid = atomic64_read(&mm->context.id);
> +
> +	if (mm->context.pinned) {
> +		mm->context.pinned++;
> +		asid &= ~ASID_MASK;
> +		goto out_unlock;
> +	}
> +
> +	if (nr_pinned_asids >= max_pinned_asids) {
> +		asid = 0;
> +		goto out_unlock;
> +	}
> +
> +	if (!asid_gen_match(asid)) {
> +		/*
> +		 * We went through one or more rollover since that ASID was
> +		 * used. Ensure that it is still valid, or generate a new one.
> +		 */
> +		asid = new_context(mm);
> +		atomic64_set(&mm->context.id, asid);
> +	}
> +
> +	asid &= ~ASID_MASK;
> +
> +	nr_pinned_asids++;
> +	__set_bit(asid2idx(asid), pinned_asid_map);
> +	mm->context.pinned++;
> +
> +out_unlock:
> +	raw_spin_unlock_irqrestore(&cpu_asid_lock, flags);
> +
> +	/* Set the equivalent of USER_ASID_BIT */
> +	if (asid && IS_ENABLED(CONFIG_UNMAP_KERNEL_AT_EL0))
> +		asid |= 1;
> +
> +	return asid;
> +}
> +EXPORT_SYMBOL_GPL(mm_context_get);
> +
> +void mm_context_put(struct mm_struct *mm)
> +{
> +	unsigned long flags;
> +	u64 asid = atomic64_read(&mm->context.id) & ~ASID_MASK;
> +
> +	raw_spin_lock_irqsave(&cpu_asid_lock, flags);
> +
> +	if (--mm->context.pinned == 0) {
> +		__clear_bit(asid2idx(asid), pinned_asid_map);
> +		nr_pinned_asids--;
> +	}
> +
> +	raw_spin_unlock_irqrestore(&cpu_asid_lock, flags);
> +}
> +EXPORT_SYMBOL_GPL(mm_context_put);
> +
>  /* Errata workaround post TTBRx_EL1 update. */
>  asmlinkage void post_ttbr_update_workaround(void)
>  {
> @@ -279,6 +359,19 @@ static int asids_init(void)
>  		panic("Failed to allocate bitmap for %lu ASIDs\n",
>  		      NUM_USER_ASIDS);
>  
> +	pinned_asid_map = kcalloc(BITS_TO_LONGS(NUM_USER_ASIDS),
> +				  sizeof(*pinned_asid_map), GFP_KERNEL);
> +	if (!pinned_asid_map)
> +		panic("Failed to allocate pinned bitmap\n");
Perhaps "Failed to allocate pinnned asid bitmap\n"
> +
> +	/*
> +	 * We assume that an ASID is always available after a rollover. This
> +	 * means that even if all CPUs have a reserved ASID, there still is at
> +	 * least one slot available in the asid map.
> +	 */
> +	max_pinned_asids = num_available_asids - num_possible_cpus() - 1;
> +	nr_pinned_asids = 0;
> +
>  	/*
>  	 * We cannot call set_reserved_asid_bits() here because CPU
>  	 * caps are not finalized yet, so it is safer to assume KPTI



^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v4 23/26] iommu/arm-smmu-v3: Add stall support for platform devices
  2020-02-24 18:23 ` [PATCH v4 23/26] iommu/arm-smmu-v3: Add stall support for platform devices Jean-Philippe Brucker
  2020-02-26  8:44   ` Xu Zaibo
@ 2020-02-27 18:17   ` Jonathan Cameron
  2020-03-04 14:08     ` Jean-Philippe Brucker
  1 sibling, 1 reply; 70+ messages in thread
From: Jonathan Cameron @ 2020-02-27 18:17 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: iommu, devicetree, linux-arm-kernel, linux-pci, linux-mm, joro,
	robh+dt, mark.rutland, catalin.marinas, will, robin.murphy,
	kevin.tian, baolu.lu, jacob.jun.pan, christian.koenig, yi.l.liu,
	zhangfei.gao, Jean-Philippe Brucker

On Mon, 24 Feb 2020 19:23:58 +0100
Jean-Philippe Brucker <jean-philippe@linaro.org> wrote:

> From: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
> 
> The SMMU provides a Stall model for handling page faults in platform
> devices. It is similar to PCI PRI, but doesn't require devices to have
> their own translation cache. Instead, faulting transactions are parked and
> the OS is given a chance to fix the page tables and retry the transaction.
> 
> Enable stall for devices that support it (opt-in by firmware). When an
> event corresponds to a translation error, call the IOMMU fault handler. If
> the fault is recoverable, it will call us back to terminate or continue
> the stall.
> 
> Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
One question inline.

Thanks,

> ---
>  drivers/iommu/arm-smmu-v3.c | 271 ++++++++++++++++++++++++++++++++++--
>  drivers/iommu/of_iommu.c    |   5 +-
>  include/linux/iommu.h       |   2 +
>  3 files changed, 269 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
> index 6a5987cce03f..da5dda5ba26a 100644
> --- a/drivers/iommu/arm-smmu-v3.c
> +++ b/drivers/iommu/arm-smmu-v3.c
> @@ -374,6 +374,13 @@


> +/*
> + * arm_smmu_flush_evtq - wait until all events currently in the queue have been
> + *                       consumed.
> + *
> + * Wait until the evtq thread finished a batch, or until the queue is empty.
> + * Note that we don't handle overflows on q->batch. If it occurs, just wait for
> + * the queue to be empty.
> + */
> +static int arm_smmu_flush_evtq(void *cookie, struct device *dev, int pasid)
> +{
> +	int ret;
> +	u64 batch;
> +	struct arm_smmu_device *smmu = cookie;
> +	struct arm_smmu_queue *q = &smmu->evtq.q;
> +
> +	spin_lock(&q->wq.lock);
> +	if (queue_sync_prod_in(q) == -EOVERFLOW)
> +		dev_err(smmu->dev, "evtq overflow detected -- requests lost\n");
> +
> +	batch = q->batch;

So this is trying to be sure we have advanced the queue 2 spots?

Is there a potential race here?  q->batch could have updated before we take
a local copy.

> +	ret = wait_event_interruptible_locked(q->wq, queue_empty(&q->llq) ||
> +					      q->batch >= batch + 2);
> +	spin_unlock(&q->wq.lock);
> +
> +	return ret;
> +}
> +
...


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v4 00/26] iommu: Shared Virtual Addressing and SMMUv3 support
  2020-02-24 18:23 [PATCH v4 00/26] iommu: Shared Virtual Addressing and SMMUv3 support Jean-Philippe Brucker
                   ` (25 preceding siblings ...)
  2020-02-24 18:24 ` [PATCH v4 26/26] iommu/arm-smmu-v3: Add support for PRI Jean-Philippe Brucker
@ 2020-02-27 18:22 ` Jonathan Cameron
  26 siblings, 0 replies; 70+ messages in thread
From: Jonathan Cameron @ 2020-02-27 18:22 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: iommu, devicetree, linux-arm-kernel, linux-pci, linux-mm,
	mark.rutland, kevin.tian, jacob.jun.pan, catalin.marinas, joro,
	robin.murphy, robh+dt, yi.l.liu, zhangfei.gao, will,
	christian.koenig, baolu.lu

On Mon, 24 Feb 2020 19:23:35 +0100
Jean-Philippe Brucker <jean-philippe@linaro.org> wrote:

> Shared Virtual Addressing (SVA) allows to share process page tables with
> devices using the IOMMU. Add a generic implementation of the IOMMU SVA
> API, and add support in the Arm SMMUv3 driver.
> 
> Previous versions of this patchset were sent over a year ago [1][2] but
> we've made a lot of progress since then:
> 
> * ATS support for SMMUv3 was merged in v5.2.
> * The bind() and fault reporting APIs have been merged in v5.3.
> * IOASID were added in v5.5.
> * SMMUv3 PASID was added in v5.6, with some pending for v5.7.
> 
> * The first user of the bind() API will be merged in v5.7 [3]. The zip
>   accelerator is also the first piece of hardware that I've been able to
>   use for testing (previous versions were developed with software models)
>   and I now have tools for evaluating SVA performance. Unfortunately I
>   still don't have hardware that supports ATS and PRI; the zip accelerator
>   uses stall.
> 
> These are the remaining changes for SVA support in SMMUv3. Since v3 [1]
> I fixed countless bugs and - I think - addressed everyone's comments.
> Thanks to recent MMU notifier rework, iommu-sva.c is a lot more
> straightforward. I'm still unhappy with the complicated locking in the
> SMMUv3 driver resulting from patch 12 (Seize private ASID), but I
> haven't found anything better.
> 
> Please find all SVA patches on branches sva/current and sva/zip-devel at
> https://jpbrucker.net/git/linux
> 
> [1] https://lore.kernel.org/linux-iommu/20180920170046.20154-1-jean-philippe.brucker@arm.com/
> [2] https://lore.kernel.org/linux-iommu/20180511190641.23008-1-jean-philippe.brucker@arm.com/
> [3] https://lore.kernel.org/linux-iommu/1581407665-13504-1-git-send-email-zhangfei.gao@linaro.org/

Hi Jean-Phillippe.

Great to see this progressing.  Other than the few places I've commented
it all looks good to me.

Thanks,

Jonathan

> 
> Jean-Philippe Brucker (26):
>   mm/mmu_notifiers: pass private data down to alloc_notifier()
>   iommu/sva: Manage process address spaces
>   iommu: Add a page fault handler
>   iommu/sva: Search mm by PASID
>   iommu/iopf: Handle mm faults
>   iommu/sva: Register page fault handler
>   arm64: mm: Pin down ASIDs for sharing mm with devices
>   iommu/io-pgtable-arm: Move some definitions to a header
>   iommu/arm-smmu-v3: Manage ASIDs with xarray
>   arm64: cpufeature: Export symbol read_sanitised_ftr_reg()
>   iommu/arm-smmu-v3: Share process page tables
>   iommu/arm-smmu-v3: Seize private ASID
>   iommu/arm-smmu-v3: Add support for VHE
>   iommu/arm-smmu-v3: Enable broadcast TLB maintenance
>   iommu/arm-smmu-v3: Add SVA feature checking
>   iommu/arm-smmu-v3: Add dev_to_master() helper
>   iommu/arm-smmu-v3: Implement mm operations
>   iommu/arm-smmu-v3: Hook up ATC invalidation to mm ops
>   iommu/arm-smmu-v3: Add support for Hardware Translation Table Update
>   iommu/arm-smmu-v3: Maintain a SID->device structure
>   iommu/arm-smmu-v3: Ratelimit event dump
>   dt-bindings: document stall property for IOMMU masters
>   iommu/arm-smmu-v3: Add stall support for platform devices
>   PCI/ATS: Add PRI stubs
>   PCI/ATS: Export symbols of PRI functions
>   iommu/arm-smmu-v3: Add support for PRI
> 
>  .../devicetree/bindings/iommu/iommu.txt       |   18 +
>  arch/arm64/include/asm/mmu.h                  |    1 +
>  arch/arm64/include/asm/mmu_context.h          |   11 +-
>  arch/arm64/kernel/cpufeature.c                |    1 +
>  arch/arm64/mm/context.c                       |  103 +-
>  drivers/iommu/Kconfig                         |   13 +
>  drivers/iommu/Makefile                        |    2 +
>  drivers/iommu/arm-smmu-v3.c                   | 1354 +++++++++++++++--
>  drivers/iommu/io-pgfault.c                    |  533 +++++++
>  drivers/iommu/io-pgtable-arm.c                |   27 +-
>  drivers/iommu/io-pgtable-arm.h                |   30 +
>  drivers/iommu/iommu-sva.c                     |  596 ++++++++
>  drivers/iommu/iommu-sva.h                     |   64 +
>  drivers/iommu/iommu.c                         |    1 +
>  drivers/iommu/of_iommu.c                      |    5 +-
>  drivers/misc/sgi-gru/grutlbpurge.c            |    4 +-
>  drivers/pci/ats.c                             |    4 +
>  include/linux/iommu.h                         |   73 +
>  include/linux/mmu_notifier.h                  |   10 +-
>  include/linux/pci-ats.h                       |    8 +
>  mm/mmu_notifier.c                             |    6 +-
>  21 files changed, 2699 insertions(+), 165 deletions(-)
>  create mode 100644 drivers/iommu/io-pgfault.c
>  create mode 100644 drivers/iommu/io-pgtable-arm.h
>  create mode 100644 drivers/iommu/iommu-sva.c
>  create mode 100644 drivers/iommu/iommu-sva.h
> 



^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v4 25/26] PCI/ATS: Export symbols of PRI functions
  2020-02-24 18:24 ` [PATCH v4 25/26] PCI/ATS: Export symbols of PRI functions Jean-Philippe Brucker
@ 2020-02-27 20:55   ` Bjorn Helgaas
  0 siblings, 0 replies; 70+ messages in thread
From: Bjorn Helgaas @ 2020-02-27 20:55 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: iommu, devicetree, linux-arm-kernel, linux-pci, linux-mm, joro,
	robh+dt, mark.rutland, catalin.marinas, will, robin.murphy,
	kevin.tian, baolu.lu, Jonathan.Cameron, jacob.jun.pan,
	christian.koenig, yi.l.liu, zhangfei.gao

Subject could be simply "PCI/ATS: Export PRI functions"

On Mon, Feb 24, 2020 at 07:24:00PM +0100, Jean-Philippe Brucker wrote:
> The SMMUv3 driver uses pci_{enable,disable}_pri() and related
> functions. Export those functions to allow the driver to be built as a
> module.
> 
> Cc: Bjorn Helgaas <bhelgaas@google.com>
> Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>

Acked-by: Bjorn Helgaas <bhelgaas@google.com>

> ---
>  drivers/pci/ats.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/drivers/pci/ats.c b/drivers/pci/ats.c
> index bbfd0d42b8b9..fc8fc6fc8bd5 100644
> --- a/drivers/pci/ats.c
> +++ b/drivers/pci/ats.c
> @@ -197,6 +197,7 @@ void pci_pri_init(struct pci_dev *pdev)
>  	if (status & PCI_PRI_STATUS_PASID)
>  		pdev->pasid_required = 1;
>  }
> +EXPORT_SYMBOL_GPL(pci_pri_init);
>  
>  /**
>   * pci_enable_pri - Enable PRI capability
> @@ -243,6 +244,7 @@ int pci_enable_pri(struct pci_dev *pdev, u32 reqs)
>  
>  	return 0;
>  }
> +EXPORT_SYMBOL_GPL(pci_enable_pri);
>  
>  /**
>   * pci_disable_pri - Disable PRI capability
> @@ -322,6 +324,7 @@ int pci_reset_pri(struct pci_dev *pdev)
>  
>  	return 0;
>  }
> +EXPORT_SYMBOL_GPL(pci_reset_pri);
>  
>  /**
>   * pci_prg_resp_pasid_required - Return PRG Response PASID Required bit
> @@ -337,6 +340,7 @@ int pci_prg_resp_pasid_required(struct pci_dev *pdev)
>  
>  	return pdev->pasid_required;
>  }
> +EXPORT_SYMBOL_GPL(pci_prg_resp_pasid_required);
>  #endif /* CONFIG_PCI_PRI */
>  
>  #ifdef CONFIG_PCI_PASID
> -- 
> 2.25.0
> 

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v4 24/26] PCI/ATS: Add PRI stubs
  2020-02-24 18:23 ` [PATCH v4 24/26] PCI/ATS: Add PRI stubs Jean-Philippe Brucker
@ 2020-02-27 20:55   ` Bjorn Helgaas
  0 siblings, 0 replies; 70+ messages in thread
From: Bjorn Helgaas @ 2020-02-27 20:55 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: iommu, devicetree, linux-arm-kernel, linux-pci, linux-mm, joro,
	robh+dt, mark.rutland, catalin.marinas, will, robin.murphy,
	kevin.tian, baolu.lu, Jonathan.Cameron, jacob.jun.pan,
	christian.koenig, yi.l.liu, zhangfei.gao

On Mon, Feb 24, 2020 at 07:23:59PM +0100, Jean-Philippe Brucker wrote:
> The SMMUv3 driver, which can be built without CONFIG_PCI, will soon gain
> support for PRI.  Partially revert commit c6e9aefbf9db ("PCI/ATS: Remove
> unused PRI and PASID stubs") to re-introduce the PRI stubs, and avoid
> adding more #ifdefs to the SMMU driver.
> 
> Cc: Bjorn Helgaas <bhelgaas@google.com>
> Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>

Acked-by: Bjorn Helgaas <bhelgaas@google.com>

> ---
>  include/linux/pci-ats.h | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/include/linux/pci-ats.h b/include/linux/pci-ats.h
> index f75c307f346d..e9e266df9b37 100644
> --- a/include/linux/pci-ats.h
> +++ b/include/linux/pci-ats.h
> @@ -28,6 +28,14 @@ int pci_enable_pri(struct pci_dev *pdev, u32 reqs);
>  void pci_disable_pri(struct pci_dev *pdev);
>  int pci_reset_pri(struct pci_dev *pdev);
>  int pci_prg_resp_pasid_required(struct pci_dev *pdev);
> +#else /* CONFIG_PCI_PRI */
> +static inline int pci_enable_pri(struct pci_dev *pdev, u32 reqs)
> +{ return -ENODEV; }
> +static inline void pci_disable_pri(struct pci_dev *pdev) { }
> +static inline int pci_reset_pri(struct pci_dev *pdev)
> +{ return -ENODEV; }
> +static inline int pci_prg_resp_pasid_required(struct pci_dev *pdev)
> +{ return 0; }
>  #endif /* CONFIG_PCI_PRI */
>  
>  #ifdef CONFIG_PCI_PASID
> -- 
> 2.25.0
> 

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v4 01/26] mm/mmu_notifiers: pass private data down to alloc_notifier()
  2020-02-25 14:08       ` Jason Gunthorpe
@ 2020-02-28 14:39         ` Jean-Philippe Brucker
  2020-02-28 14:48           ` Jason Gunthorpe
  0 siblings, 1 reply; 70+ messages in thread
From: Jean-Philippe Brucker @ 2020-02-28 14:39 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: mark.rutland, linux-pci, linux-mm, will, Dimitri Sivanich,
	catalin.marinas, zhangfei.gao, devicetree, kevin.tian,
	Arnd Bergmann, robh+dt, linux-arm-kernel, Greg Kroah-Hartman,
	iommu, Andrew Morton, robin.murphy, christian.koenig

On Tue, Feb 25, 2020 at 10:08:14AM -0400, Jason Gunthorpe wrote:
> On Tue, Feb 25, 2020 at 10:24:39AM +0100, Jean-Philippe Brucker wrote:
> > On Mon, Feb 24, 2020 at 03:00:56PM -0400, Jason Gunthorpe wrote:
> > > On Mon, Feb 24, 2020 at 07:23:36PM +0100, Jean-Philippe Brucker wrote:
> > > > The new allocation scheme introduced by 2c7933f53f6b ("mm/mmu_notifiers:
> > > > add a get/put scheme for the registration") provides a convenient way
> > > > for users to attach notifier data to an mm. However, it would be even
> > > > better to create this notifier data atomically.
> > > > 
> > > > Since the alloc_notifier() callback only takes an mm argument at the
> > > > moment, some users have to perform the allocation in two times.
> > > > alloc_notifier() initially creates an incomplete structure, which is
> > > > then finalized using more context once mmu_notifier_get() returns. This
> > > > second step requires carrying an initialization lock in the notifier
> > > > data and playing dirty tricks to order memory accesses against live
> > > > invalidation.
> > > 
> > > This was the intended pattern. Tthere shouldn't be an real issue as
> > > there shouldn't be any data on which to invalidate, ie the later patch
> > > does:
> > > 
> > > +       list_for_each_entry_rcu(bond, &io_mm->devices, mm_head)
> > > 
> > > And that list is empty post-allocation, so no 'dirty tricks' required.
> > 
> > Before introducing this patch I had the following code:
> > 
> > +	list_for_each_entry_rcu(bond, &io_mm->devices, mm_head) {
> > +		/*
> > +		 * To ensure that we observe the initialization of io_mm fields
> > +		 * by io_mm_finalize() before the registration of this bond to
> > +		 * the list by io_mm_attach(), introduce an address dependency
> > +		 * between bond and io_mm. It pairs with the smp_store_release()
> > +		 * from list_add_rcu().
> > +		 */
> > +		io_mm = rcu_dereference(bond->io_mm);
> 
> A rcu_dereference isn't need here, just a normal derference is fine.

bond->io_mm is annotated with __rcu (for iommu_sva_get_pasid_generic(),
which does bond->io_mm under rcu_read_lock())

> 
> > +		io_mm->ops->invalidate(bond->sva.dev, io_mm->pasid, io_mm->ctx,
> > +				       start, end - start);
> > +	}
> > 
> > (1) io_mm_get() would obtain an empty io_mm from iommu_notifier_get().
> > (2) then io_mm_finalize() would initialize io_mm->ops, io_mm->ctx, etc.
> > (3) finally io_mm_attach() would add the bond to io_mm->devices.
> > 
> > Since the above code can run before (2) it needs to observe valid
> > io_mm->ctx, io_mm->ops initialized by (2) after obtaining the bond
> > initialized by (3). Which I believe requires the address dependency from
> > the rcu_dereference() above or some stronger barrier to pair with the
> > list_add_rcu().
> 
> The list_for_each_entry_rcu() is an acquire that already pairs with
> the release in list_add_rcu(), all you need is a data dependency chain
> starting on bond to be correct on ordering.
> 
> But this is super tricky :\
> 
> > If io_mm->ctx and io_mm->ops are already valid before the
> > mmu notifier is published, then we don't need that stuff.
> 
> So, this trickyness with RCU is not a bad reason to introduce the priv
> scheme, maybe explain it in the commit message?

Ok, I've added this to the commit message:

    The IOMMU SVA module, which attaches an mm to multiple devices,
    exemplifies this situation. In essence it does:

            mmu_notifier_get()
              alloc_notifier()
                 A = kzalloc()
              /* MMU notifier is published */
            A->ctx = ctx;                           // (1)
            device->A = A;
            list_add_rcu(device, A->devices);       // (2)

    The invalidate notifier, which may start running before A is fully
    initialized at (1), does the following:

            io_mm_invalidate(A)
              list_for_each_entry_rcu(device, A->devices)
                A = device->A;                      // (3)
                device->invalidate(A->ctx)

    To ensure that an invalidate() thread observes write (1) before (2), it
    needs the address dependency (3). The resulting code is subtle and
    difficult to understand. If instead we fully initialize object A before
    publishing the MMU notifier, we don't need the complexity added by (3).


I'll try to improve the wording before sending next version.

Thanks,
Jean

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v4 02/26] iommu/sva: Manage process address spaces
  2020-02-26 19:13   ` Jacob Pan
@ 2020-02-28 14:40     ` Jean-Philippe Brucker
  2020-02-28 14:57       ` Jason Gunthorpe
  0 siblings, 1 reply; 70+ messages in thread
From: Jean-Philippe Brucker @ 2020-02-28 14:40 UTC (permalink / raw)
  To: Jacob Pan
  Cc: iommu, devicetree, linux-arm-kernel, linux-pci, linux-mm, joro,
	robh+dt, mark.rutland, catalin.marinas, will, robin.murphy,
	kevin.tian, baolu.lu, Jonathan.Cameron, christian.koenig,
	yi.l.liu, zhangfei.gao, Jean-Philippe Brucker

On Wed, Feb 26, 2020 at 11:13:20AM -0800, Jacob Pan wrote:
> Hi Jean,
> 
> A few comments inline. I am also trying to converge to the common sva
> APIs. I sent out the first step w/o iopage fault and the generic ops
> you have here.

Great, thanks for sending it out, it's on my list to look at

> On Mon, 24 Feb 2020 19:23:37 +0100
> Jean-Philippe Brucker <jean-philippe@linaro.org> wrote:
> 
> > From: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
> > 
> > Add a small library to help IOMMU drivers manage process address
> > spaces bound to their devices. Register an MMU notifier to track
> > modification on each address space bound to one or more devices.
> > 
> > IOMMU drivers must implement the io_mm_ops and can then use the
> > helpers provided by this library to easily implement the SVA API
> > introduced by commit 26b25a2b98e4. The io_mm_ops are:
> > 
> > void *alloc(struct mm_struct *)
> >   Allocate a PASID context private to the IOMMU driver. There is a
> >   single context per mm. IOMMU drivers may perform arch-specific
> >   operations in there, for example pinning down a CPU ASID (on Arm).
> > 
> > int attach(struct device *, int pasid, void *ctx, bool attach_domain)
> >   Attach a context to the device, by setting up the PASID table entry.
> > 
> > int invalidate(struct device *, int pasid, void *ctx,
> >                unsigned long vaddr, size_t size)
> >   Invalidate TLB entries for this address range.
> > 
> > int detach(struct device *, int pasid, void *ctx, bool detach_domain)
> >   Detach a context from the device, by clearing the PASID table entry
> >   and invalidating cached entries.
> > 
> > void free(void *ctx)
> you meant release()?

Yes

[...]
> > +/**
> > + * DOC: io_mm model
> > + *
> > + * The io_mm keeps track of process address spaces shared between
> > CPU and IOMMU.
> > + * The following example illustrates the relation between structures
> > + * iommu_domain, io_mm and iommu_sva. The iommu_sva struct is a bond
> > between
> > + * io_mm and device. A device can have multiple io_mm and an io_mm
> > may be bound
> > + * to multiple devices.
> > + *              ___________________________
> > + *             |  IOMMU domain A           |
> > + *             |  ________________         |
> > + *             | |  IOMMU group   |        +------- io_pgtables
> > + *             | |                |        |
> > + *             | |   dev 00:00.0 ----+------- bond 1 --- io_mm X
> > + *             | |________________|   \    |
> > + *             |                       '----- bond 2 ---.
> > + *             |___________________________|             \
> > + *              ___________________________               \
> > + *             |  IOMMU domain B           |             io_mm Y
> > + *             |  ________________         |             / /
> > + *             | |  IOMMU group   |        |            / /
> > + *             | |                |        |           / /
> > + *             | |   dev 00:01.0 ------------ bond 3 -' /
> > + *             | |   dev 00:01.1 ------------ bond 4 --'
> > + *             | |________________|        |
> > + *             |                           +------- io_pgtables
> > + *             |___________________________|
> > + *
> > + * In this example, device 00:00.0 is in domain A, devices 00:01.*
> > are in domain
> > + * B. All devices within the same domain access the same address
> > spaces.
> Hmm, devices in domain A has access to both X & Y, isn't it
> contradictory?

I guess it's unclear, this is meant to explain that any device in domain B
for example, would access all address spaces bound to any other device in
that domain.

> 
> > Device
> > + * 00:00.0 accesses address spaces X and Y, each corresponding to an
> > mm_struct.
> > + * Devices 00:01.* only access address space Y. In addition each
> > + * IOMMU_DOMAIN_DMA domain has a private address space, io_pgtable,
> > that is
> > + * managed with iommu_map()/iommu_unmap(), and isn't shared with the
> > CPU MMU.
> So this would allow IOVA and SVA co-exist in the same address space?

Hmm, not in the same address space, but they can co-exist in a device. In
fact the endpoint I'm testing (hisi zip accelerator) already needs normal
DMA alongside SVA for queue management. This one is integrated on an
Arm-based platform so shouldn't be a concern for VT-d at the moment, but
I suspect we might see more of this kind of device with mixed DMA.

In addition on Arm MSI addresses are translated by the IOMMU, and since
they are requests w/o PASID they need the private address space on entry 0.

Are you not planning to use the RID_PASID entry of Scalable-Mode
Context-Entry in VT-d?

> I guess this is the PASID 0 for DMA request w/o PASID. If that is the
> case, perhaps needs more explanation since the private address space
> also has a private PASID within the domain.

The last sentence refers to this private address space used for requests
w/o PASID. I don't like referring to it as "PASID 0" since it might be
more confusing. It's entry 0 of the PASID table reserved for requests
without PASID.

I think I should just remove this here sentence and try to make the last
paragraph of the comment, which referes to the same thing, clearer. I'll
also drop io_pgtables from the above diagram to keep things on point.

> > + *
> > + * To obtain the above configuration, users would for instance issue
> > the
> > + * following calls:
> > + *
> > + *     iommu_sva_bind_device(dev 00:00.0, mm X, ...) -> bond 1
> > + *     iommu_sva_bind_device(dev 00:00.0, mm Y, ...) -> bond 2
> > + *     iommu_sva_bind_device(dev 00:01.0, mm Y, ...) -> bond 3
> > + *     iommu_sva_bind_device(dev 00:01.1, mm Y, ...) -> bond 4
> > + *
> > + * A single Process Address Space ID (PASID) is allocated for each
> > mm. In the
> > + * example, devices use PASID 1 to read/write into address space X
> > and PASID 2
> > + * to read/write into address space Y. Calling iommu_sva_get_pasid()
> > on bond 1
> > + * returns 1, and calling it on bonds 2-4 returns 2.
> > + *
> > + * Hardware tables describing this configuration in the IOMMU would
> > typically
> > + * look like this:
> > + *
> > + *                                PASID tables
> > + *                                 of domain A
> > + *                              .->+--------+
> > + *                             / 0 |        |-------> io_pgtable
> > + *                            /    +--------+
> > + *            Device tables  /   1 |        |-------> pgd X
> > + *              +--------+  /      +--------+
> > + *      00:00.0 |      A |-'     2 |        |--.
> > + *              +--------+         +--------+   \
> > + *              :        :       3 |        |    \
> > + *              +--------+         +--------+     --> pgd Y
> > + *      00:01.0 |      B |--.                    /
> > + *              +--------+   \                  |
> > + *      00:01.1 |      B |----+   PASID tables  |
> > + *              +--------+     \   of domain B  |
> > + *                              '->+--------+   |
> > + *                               0 |        |-- | --> io_pgtable
> > + *                                 +--------+   |
> > + *                               1 |        |   |
> > + *                                 +--------+   |
> > + *                               2 |        |---'
> > + *                                 +--------+
> > + *                               3 |        |
> > + *                                 +--------+
> > + *
> > + * With this model, a single call binds all devices in a given
> > domain to an
> > + * address space. Other devices in the domain will get the same bond
> > implicitly.
> > + * However, users must issue one bind() for each device, because
> > IOMMUs may
> > + * implement SVA differently. Furthermore, mandating one bind() per
> > device
> > + * allows the driver to perform sanity-checks on device capabilities.
> > + *
> > + * In some IOMMUs, one entry of the PASID table (typically the first
> > one) can
> > + * hold non-PASID translations. In this case PASID 0 is reserved and
> > the first
> > + * entry points to the io_pgtable pointer. In other IOMMUs the
> > io_pgtable
> > + * pointer is held in the device table and PASID 0 is available to
> > the
> > + * allocator.
> > + */
[...]
> > +static struct iommu_sva *
> > +io_mm_attach(struct device *dev, struct io_mm *io_mm, void *drvdata)
> > +{
> > +	int ret = 0;
> > +	bool attach_domain = true;
> > +	struct iommu_bond *bond, *tmp;
> > +	struct iommu_domain *domain, *other;
> > +	struct iommu_sva_param *param = dev->iommu_param->sva_param;
> > +
> > +	domain = iommu_get_domain_for_dev(dev);
> > +
> > +	bond = kzalloc(sizeof(*bond), GFP_KERNEL);
> > +	if (!bond)
> > +		return ERR_PTR(-ENOMEM);
> > +
> > +	bond->sva.dev	= dev;
> > +	bond->drvdata	= drvdata;
> > +	refcount_set(&bond->refs, 1);
> > +	RCU_INIT_POINTER(bond->io_mm, io_mm);
> > +
> > +	mutex_lock(&iommu_sva_lock);
> > +	/* Is it already bound to the device or domain? */
> > +	list_for_each_entry(tmp, &io_mm->devices, mm_head) {
> > +		if (tmp->sva.dev != dev) {
> > +			other =
> > iommu_get_domain_for_dev(tmp->sva.dev);
> > +			if (domain == other)
> > +				attach_domain = false;
> > +
> > +			continue;
> At this point, we already know this is a new device trying to attach to
> one of io_mm's existing domains.
>
> So there is no need to continue
> checking, right? Perhaps check like this?
> -               if (tmp->sva.dev != dev) {
> +               if (tmp->sva.dev != dev && attach_domain) {

That doesn't seem right, we need the 'continue'. I'll turn this around
into 'if (tmp->sva.dev == dev)' to make things more readable.

> > +		}
> > +
> > +		if (WARN_ON(tmp->drvdata != drvdata)) {
> > +			ret = -EINVAL;
> > +			goto err_free;
> > +		}
> > +
> > +		/*
> > +		 * Hold a single io_mm reference per bond. Note that
> > we can't
> > +		 * return an error after this, otherwise the caller
> > would drop
> > +		 * an additional reference to the io_mm.
> > +		 */
> > +		refcount_inc(&tmp->refs);
> > +		io_mm_put(io_mm);
> > +		kfree(bond);
> Can bond be allocated after searching for existing bond or domain? If
> so, we can avoid free bond here.

Yes, and I think we can simplify the whole function further. I think I
wrote it that way to have the kzalloc() be outside iommu_sva_lock, back
when it was a spinlock.

> > +		mutex_unlock(&iommu_sva_lock);
> > +		return &tmp->sva;
> > +	}
> > +
> > +	list_add_rcu(&bond->mm_head, &io_mm->devices);
> > +	param->nr_bonds++;
> > +	mutex_unlock(&iommu_sva_lock);
> > +
> > +	ret = io_mm->ops->attach(bond->sva.dev, io_mm->pasid,
> > io_mm->ctx,
> > +				 attach_domain);
> For VT-d, if a device trying to do SVA bind, there would not be a DMA
> domain. SVA should own the entire address space, no IOVA.

Do you mean PASID table rather than address space?

> So this
> attach() call is for VT-d driver to setup the first PASID table entry
> regardless attach_domain is true or false?

Yes ignoring the attach_domain parameter should be fine (more below). 

[...]
> > +static void iommu_sva_unbind_locked(struct iommu_bond *bond)
> > +{
> > +	struct device *dev = bond->sva.dev;
> > +	struct iommu_sva_param *param = dev->iommu_param->sva_param;
> > +
> > +	if (!refcount_dec_and_test(&bond->refs))
> > +		return;
> > +
> dont you need to free bond here?

We free it in the rcu callback below

> > +	io_mm_detach_locked(bond);
> > +	param->nr_bonds--;
> > +	kfree_rcu(bond, rcu_head);
> > +}
> > +
> > +void iommu_sva_unbind_generic(struct iommu_sva *handle)
> > +{
> > +	struct iommu_param *param = handle->dev->iommu_param;
> > +
> > +	if (WARN_ON(!param))
> > +		return;
> > +
> > +	mutex_lock(&param->sva_lock);
> > +	mutex_lock(&iommu_sva_lock);
> > +	iommu_sva_unbind_locked(to_iommu_bond(handle));
> > +	mutex_unlock(&iommu_sva_lock);
> > +	mutex_unlock(&param->sva_lock);
> > +}
> > +EXPORT_SYMBOL_GPL(iommu_sva_unbind_generic);
> > +
> > +/**
> > + * iommu_sva_enable() - Enable Shared Virtual Addressing for a device
> > + * @dev: the device
> > + * @sva_param: the parameters.
> > + *
> > + * Called by an IOMMU driver to setup the SVA parameters
> > + * @sva_param is duplicated and can be freed when this function
> > returns.
> > + *
> > + * Return 0 if initialization succeeded, or an error.
> > + */
> IOMMU vendor driver usually dont know when the device SVA feature will
> be used until bind call. So we pretty much have to call this for every
> device during init time?

Not necessarily. Before bind the device driver should call
iommu_dev_enable_feature(dev, IOMMU_FEAT_SVA), which is when SMMUv3
invokes iommu_sva_enable()

[...]
> > +struct io_mm_ops {
> > +	/* Allocate a PASID context for an mm */
> > +	void *(*alloc)(struct mm_struct *mm);
> > +
> > +	/*
> > +	 * Attach a PASID context to a device. Write the entry into
> > the PASID
> > +	 * table.
> > +	 *
> > +	 * @attach_domain is true when no other device in the IOMMU
> > domain is
> > +	 *   already attached to this context. IOMMU drivers that
> > share the
> > +	 *   PASID tables within a domain don't need to write the
> > PASID entry
> > +	 *   when @attach_domain is false.
> > +	 */
> If we have per device PASID table, then we need to set up PASID table
> entry regardless of the domain sharing.

Yes, the attach_domain is a hint for IOMMU drivers that handle PASID
tables per domain (SMMUv3). If PASID tables are per device then it can be
ignored. I added it to the interface because it's a lot more difficult to
check from within the SMMU driver, whereas iommu-sva already iterates over
all devices attached to an io_mm. Arguably the hint isn't as useful on
attach than on detach, where we must not clear the PASID table entry if
other devices in the domain are still using it.

> What is confusing to me is that
> domain is for DMA isolation on request w/o PASID, but with SVA we don't
> really care about domains. Sorry, it has been a long time since we
> discussed this. I think will work for VT-d but just wanted to make sure
> I understand the intentions.

No problem, it has been a while and I don't remember the rationale for
every choice. It's good to question whether they're still valid.

I find the per-domain PASID table to be a good model when reasoning about
IOMMU groups. In pci_device_group() a single group is created for devices
whose Requester ID alias, and they all get the same domain. In a buggy
system, if a device can issue DMA with the RID of another, then regardless
of PASID the IOMMU cannot isolate them. Having per-device PASID table
doesn't add any isolation but may hide the flaw from the user, if they
think that binding an mm to device A prevents a DMA-aliased device B from
accessing it.

This is hypothetical because we don't allow SVA for multi-device groups at
the moment (sanity-check would be messy) but maybe buggy implementations
will want this support in the future.

In the normal case, one device per domain, having PASID tables on the
domain rather than device doesn't make a difference. It makes a difference
if the user wants to put multiple devices in the same domain (e.g. VFIO
container). I don't know if that's a use-case.

Thanks,
Jean

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v4 02/26] iommu/sva: Manage process address spaces
  2020-02-26 12:35   ` Jonathan Cameron
@ 2020-02-28 14:43     ` Jean-Philippe Brucker
  2020-02-28 16:26       ` Jonathan Cameron
  0 siblings, 1 reply; 70+ messages in thread
From: Jean-Philippe Brucker @ 2020-02-28 14:43 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: iommu, devicetree, linux-arm-kernel, linux-pci, linux-mm, joro,
	robh+dt, mark.rutland, catalin.marinas, will, robin.murphy,
	kevin.tian, baolu.lu, jacob.jun.pan, christian.koenig, yi.l.liu,
	zhangfei.gao, Jean-Philippe Brucker

On Wed, Feb 26, 2020 at 12:35:06PM +0000, Jonathan Cameron wrote:
> > + * A single Process Address Space ID (PASID) is allocated for each mm. In the
> > + * example, devices use PASID 1 to read/write into address space X and PASID 2
> > + * to read/write into address space Y. Calling iommu_sva_get_pasid() on bond 1
> > + * returns 1, and calling it on bonds 2-4 returns 2.
> > + *
> > + * Hardware tables describing this configuration in the IOMMU would typically
> > + * look like this:
> > + *
> > + *                                PASID tables
> > + *                                 of domain A
> > + *                              .->+--------+
> > + *                             / 0 |        |-------> io_pgtable
> > + *                            /    +--------+
> > + *            Device tables  /   1 |        |-------> pgd X
> > + *              +--------+  /      +--------+
> > + *      00:00.0 |      A |-'     2 |        |--.
> > + *              +--------+         +--------+   \
> > + *              :        :       3 |        |    \
> > + *              +--------+         +--------+     --> pgd Y
> > + *      00:01.0 |      B |--.                    /
> > + *              +--------+   \                  |
> > + *      00:01.1 |      B |----+   PASID tables  |
> > + *              +--------+     \   of domain B  |
> > + *                              '->+--------+   |
> > + *                               0 |        |-- | --> io_pgtable
> > + *                                 +--------+   |
> > + *                               1 |        |   |
> > + *                                 +--------+   |
> > + *                               2 |        |---'
> > + *                                 +--------+
> > + *                               3 |        |
> > + *                                 +--------+
> > + *
> > + * With this model, a single call binds all devices in a given domain to an
> > + * address space. Other devices in the domain will get the same bond implicitly.
> > + * However, users must issue one bind() for each device, because IOMMUs may
> > + * implement SVA differently. Furthermore, mandating one bind() per device
> > + * allows the driver to perform sanity-checks on device capabilities.
> 
> > + *
> > + * In some IOMMUs, one entry of the PASID table (typically the first one) can
> > + * hold non-PASID translations. In this case PASID 0 is reserved and the first
> > + * entry points to the io_pgtable pointer. In other IOMMUs the io_pgtable
> > + * pointer is held in the device table and PASID 0 is available to the
> > + * allocator.
> 
> Is it worth hammering home in here that we can only do this because the PASID space
> is global (with exception of PASID 0)?  It's a convenient simplification but not
> necessarily a hardware restriction so perhaps we should remind people somewhere in here?

I could add this four paragraphs up:

"A single Process Address Space ID (PASID) is allocated for each mm. It is
a choice made for the Linux SVA implementation, not a hardware
restriction."

> > + */
> > +
> > +struct io_mm {
> > +	struct list_head		devices;
> > +	struct mm_struct		*mm;
> > +	struct mmu_notifier		notifier;
> > +
> > +	/* Late initialization */
> > +	const struct io_mm_ops		*ops;
> > +	void				*ctx;
> > +	int				pasid;
> > +};
> > +
> > +#define to_io_mm(mmu_notifier)	container_of(mmu_notifier, struct io_mm, notifier)
> > +#define to_iommu_bond(handle)	container_of(handle, struct iommu_bond, sva)
> 
> Code ordering wise, do we want this after the definition of iommu_bond?
> 
> For both of these it's a bit non obvious what they come 'from'.
> I wouldn't naturally assume to_io_mm gets me from notifier to the io_mm
> for example.  Not sure it matters though if these are only used in a few
> places.

Right, I can rename the first one to mn_to_io_mm(). The second one I think
might be good enough.


> > +static struct iommu_sva *
> > +io_mm_attach(struct device *dev, struct io_mm *io_mm, void *drvdata)
> > +{
> > +	int ret = 0;
> 
> I'm fairly sure this is set in all paths below.  Now, of course the
> compiler might not think that in which case fair enough :)
> 
> > +	bool attach_domain = true;
> > +	struct iommu_bond *bond, *tmp;
> > +	struct iommu_domain *domain, *other;
> > +	struct iommu_sva_param *param = dev->iommu_param->sva_param;
> > +
> > +	domain = iommu_get_domain_for_dev(dev);
> > +
> > +	bond = kzalloc(sizeof(*bond), GFP_KERNEL);
> > +	if (!bond)
> > +		return ERR_PTR(-ENOMEM);
> > +
> > +	bond->sva.dev	= dev;
> > +	bond->drvdata	= drvdata;
> > +	refcount_set(&bond->refs, 1);
> > +	RCU_INIT_POINTER(bond->io_mm, io_mm);
> > +
> > +	mutex_lock(&iommu_sva_lock);
> > +	/* Is it already bound to the device or domain? */
> > +	list_for_each_entry(tmp, &io_mm->devices, mm_head) {
> > +		if (tmp->sva.dev != dev) {
> > +			other = iommu_get_domain_for_dev(tmp->sva.dev);
> > +			if (domain == other)
> > +				attach_domain = false;
> > +
> > +			continue;
> > +		}
> > +
> > +		if (WARN_ON(tmp->drvdata != drvdata)) {
> > +			ret = -EINVAL;
> > +			goto err_free;
> > +		}
> > +
> > +		/*
> > +		 * Hold a single io_mm reference per bond. Note that we can't
> > +		 * return an error after this, otherwise the caller would drop
> > +		 * an additional reference to the io_mm.
> > +		 */
> > +		refcount_inc(&tmp->refs);
> > +		io_mm_put(io_mm);
> > +		kfree(bond);
> 
> Free outside the lock would be ever so slightly more logical given we allocated
> before taking the lock.
> 
> > +		mutex_unlock(&iommu_sva_lock);
> > +		return &tmp->sva;
> > +	}
> > +
> > +	list_add_rcu(&bond->mm_head, &io_mm->devices);
> > +	param->nr_bonds++;
> > +	mutex_unlock(&iommu_sva_lock);
> > +
> > +	ret = io_mm->ops->attach(bond->sva.dev, io_mm->pasid, io_mm->ctx,
> > +				 attach_domain);
> > +	if (ret)
> > +		goto err_remove;
> > +
> > +	return &bond->sva;
> > +
> > +err_remove:
> > +	/*
> > +	 * At this point concurrent threads may have started to access the
> > +	 * io_mm->devices list in order to invalidate address ranges, which
> > +	 * requires to free the bond via kfree_rcu()
> > +	 */
> > +	mutex_lock(&iommu_sva_lock);
> > +	param->nr_bonds--;
> > +	list_del_rcu(&bond->mm_head);
> > +
> > +err_free:
> > +	mutex_unlock(&iommu_sva_lock);
> > +	kfree_rcu(bond, rcu_head);
> 
> I don't suppose it matters really but we don't need the rcu free if
> we follow the err_free goto.  Perhaps we are cleaner in this case
> to not use a unified exit path but do that case inline?

Agreed, though I moved the kzalloc() later as suggested by Jacob, I think
it looks a little better and simplifies the error paths

Thanks,
Jean

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v4 03/26] iommu: Add a page fault handler
  2020-02-26 13:59   ` Jonathan Cameron
@ 2020-02-28 14:44     ` Jean-Philippe Brucker
  0 siblings, 0 replies; 70+ messages in thread
From: Jean-Philippe Brucker @ 2020-02-28 14:44 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: iommu, devicetree, linux-arm-kernel, linux-pci, linux-mm, joro,
	robh+dt, mark.rutland, catalin.marinas, will, robin.murphy,
	kevin.tian, baolu.lu, jacob.jun.pan, christian.koenig, yi.l.liu,
	zhangfei.gao, Jean-Philippe Brucker

On Wed, Feb 26, 2020 at 01:59:33PM +0000, Jonathan Cameron wrote:
> > +static int iopf_complete(struct device *dev, struct iopf_fault *iopf,
> > +			 enum iommu_page_response_code status)
> 
> This is called once per group.  Should name reflect that?

Ok

[...]
> > +/**
> > + * iommu_queue_iopf - IO Page Fault handler
> > + * @evt: fault event
> > + * @cookie: struct device, passed to iommu_register_device_fault_handler.
> > + *
> > + * Add a fault to the device workqueue, to be handled by mm.
> > + *
> > + * Return: 0 on success and <0 on error.
> > + */
> > +int iommu_queue_iopf(struct iommu_fault *fault, void *cookie)
> > +{
> > +	int ret;
> > +	struct iopf_group *group;
> > +	struct iopf_fault *iopf, *next;
> > +	struct iopf_device_param *iopf_param;
> > +
> > +	struct device *dev = cookie;
> > +	struct iommu_param *param = dev->iommu_param;
> > +
> > +	if (WARN_ON(!mutex_is_locked(&param->lock)))
> > +		return -EINVAL;
> 
> Just curious...
> 
> Why do we always need a runtime check on this rather than say,
> using lockdep_assert_held or similar?

I probably didn't know about lockdep_assert at the time :)

> > +	/*
> > +	 * It is incredibly easy to find ourselves in a deadlock situation if
> > +	 * we're not careful, because we're taking the opposite path as
> > +	 * iommu_queue_iopf:
> > +	 *
> > +	 *   iopf_queue_flush_dev()   |  PRI queue handler
> > +	 *    lock(&param->lock)      |   iommu_queue_iopf()
> > +	 *     queue->flush()         |    lock(&param->lock)
> > +	 *      wait PRI queue empty  |
> > +	 *
> > +	 * So we can't hold the device param lock while flushing. Take a
> > +	 * reference to the device param instead, to prevent the queue from
> > +	 * going away.
> > +	 */
> > +	mutex_lock(&param->lock);
> > +	iopf_param = param->iopf_param;
> > +	if (iopf_param) {
> > +		queue = param->iopf_param->queue;
> > +		iopf_param->busy = true;
> 
> Describing this as taking a reference is not great...
> I'd change the comment to set a flag or something like that.
> 
> Is there any potential of multiple copies of this running against
> each other?  I've not totally gotten my head around when this
> might be called yet.

Yes it's allowed, this should be a refcount

[...]
> > +int iopf_queue_remove_device(struct iopf_queue *queue, struct device *dev)
> > +{
> > +	int ret = -EINVAL;
> > +	struct iopf_fault *iopf, *next;
> > +	struct iopf_device_param *iopf_param;
> > +	struct iommu_param *param = dev->iommu_param;
> > +
> > +	if (!param || !queue)
> > +		return -EINVAL;
> > +
> > +	do {
> > +		mutex_lock(&queue->lock);
> > +		mutex_lock(&param->lock);
> > +		iopf_param = param->iopf_param;
> > +		if (iopf_param && iopf_param->queue == queue) {
> > +			if (iopf_param->busy) {
> > +				ret = -EBUSY;
> > +			} else {
> > +				list_del(&iopf_param->queue_list);
> > +				param->iopf_param = NULL;
> > +				ret = 0;
> > +			}
> > +		}
> > +		mutex_unlock(&param->lock);
> > +		mutex_unlock(&queue->lock);
> > +
> > +		/*
> > +		 * If there is an ongoing flush, wait for it to complete and
> > +		 * then retry. iopf_param isn't going away since we're the only
> > +		 * thread that can free it.
> > +		 */
> > +		if (ret == -EBUSY)
> > +			wait_event(iopf_param->wq_head, !iopf_param->busy);
> > +		else if (ret)
> > +			return ret;
> > +	} while (ret == -EBUSY);
> 
> I'm in two minds about the next comment (so up to you)...
> 
> Currently this looks a bit odd.  Would you be better off just having a separate
> parameter for busy and explicit separate handling for the error path?
> 
> 	bool busy;
> 	int ret = 0;
> 
> 	do {
> 		mutex_lock(&queue->lock);
> 		mutex_lock(&param->lock);
> 		iopf_param = param->iopf_param;
> 		if (iopf_param && iopf_param->queue == queue) {
> 			busy = iopf_param->busy;
> 			if (!busy) {
> 				list_del(&iopf_param->queue_list);
> 				param->iopf_param = NULL;
> 			}
> 		} else {
> 			ret = -EINVAL;
> 		}
> 		mutex_unlock(&param->lock);
> 		mutex_unlock(&queue->lock);
> 		if (ret)
> 			return ret;
> 		if (busy)
> 			wait_event(iopf_param->wq_head, !iopf_param->busy);
> 		
> 	} while (busy);
> 
> 	..

Sure, I think it looks better

Thanks,
Jean

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v4 06/26] iommu/sva: Register page fault handler
  2020-02-26 19:39   ` Jacob Pan
@ 2020-02-28 14:44     ` Jean-Philippe Brucker
  0 siblings, 0 replies; 70+ messages in thread
From: Jean-Philippe Brucker @ 2020-02-28 14:44 UTC (permalink / raw)
  To: Jacob Pan
  Cc: iommu, devicetree, linux-arm-kernel, linux-pci, linux-mm, joro,
	robh+dt, mark.rutland, catalin.marinas, will, robin.murphy,
	kevin.tian, baolu.lu, Jonathan.Cameron, christian.koenig,
	yi.l.liu, zhangfei.gao, Jean-Philippe Brucker

On Wed, Feb 26, 2020 at 11:39:59AM -0800, Jacob Pan wrote:
> > @@ -442,11 +444,20 @@ static void iommu_sva_unbind_locked(struct
> > iommu_bond *bond) 
> >  void iommu_sva_unbind_generic(struct iommu_sva *handle)
> >  {
> > +	int pasid;
> >  	struct iommu_param *param = handle->dev->iommu_param;
> >  
> >  	if (WARN_ON(!param))
> >  		return;
> >  
> > +	/*
> > +	 * Caller stopped the device from issuing PASIDs, now make
> > sure they are
> > +	 * out of the fault queue.
> > +	 */
> > +	pasid = iommu_sva_get_pasid_generic(handle);
> > +	if (pasid != IOMMU_PASID_INVALID)
> > +		iopf_queue_flush_dev(handle->dev, pasid);
> > +
> I have an ordering concern.
> The caller can only stop the device issuing page request but there will
> be in-flight request inside the IOMMU. If we flush here before clearing
> the PASID context, there might be new request coming in before the
> detach.

The goal of this flush is also to clear the IOMMU PRI queue. It calls the
IOMMU's flush() callback before flushing the workqueue. So when this
returns, there shouldn't be any more pending fault.

Thanks,
Jean

> How about detach first then flush? Then anything come after the detach
> would be faults. Flush will be clean.
> 
> >  	mutex_lock(&param->sva_lock);
> >  	mutex_lock(&iommu_sva_lock);
> >  	iommu_sva_unbind_locked(to_iommu_bond(handle));
> > @@ -484,6 +495,10 @@ int iommu_sva_enable(struct device *dev, struct
> > iommu_sva_param *sva_param) goto err_unlock;
> >  	}
> >  
> > +	ret = iommu_register_device_fault_handler(dev,
> > iommu_queue_iopf, dev);
> > +	if (ret)
> > +		goto err_unlock;
> > +
> >  	dev->iommu_param->sva_param = new_param;
> >  	mutex_unlock(&param->sva_lock);
> >  	return 0;
> > @@ -521,6 +536,7 @@ int iommu_sva_disable(struct device *dev)
> >  		goto out_unlock;
> >  	}
> >  
> > +	iommu_unregister_device_fault_handler(dev);
> >  	kfree(param->sva_param);
> >  	param->sva_param = NULL;
> >  out_unlock:
> 
> [Jacob Pan]

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v4 01/26] mm/mmu_notifiers: pass private data down to alloc_notifier()
  2020-02-28 14:39         ` Jean-Philippe Brucker
@ 2020-02-28 14:48           ` Jason Gunthorpe
  2020-02-28 15:04             ` Jean-Philippe Brucker
  0 siblings, 1 reply; 70+ messages in thread
From: Jason Gunthorpe @ 2020-02-28 14:48 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: mark.rutland, linux-pci, linux-mm, will, Dimitri Sivanich,
	catalin.marinas, zhangfei.gao, devicetree, kevin.tian,
	Arnd Bergmann, robh+dt, linux-arm-kernel, Greg Kroah-Hartman,
	iommu, Andrew Morton, robin.murphy, christian.koenig

On Fri, Feb 28, 2020 at 03:39:35PM +0100, Jean-Philippe Brucker wrote:
> > > +	list_for_each_entry_rcu(bond, &io_mm->devices, mm_head) {
> > > +		/*
> > > +		 * To ensure that we observe the initialization of io_mm fields
> > > +		 * by io_mm_finalize() before the registration of this bond to
> > > +		 * the list by io_mm_attach(), introduce an address dependency
> > > +		 * between bond and io_mm. It pairs with the smp_store_release()
> > > +		 * from list_add_rcu().
> > > +		 */
> > > +		io_mm = rcu_dereference(bond->io_mm);
> > 
> > A rcu_dereference isn't need here, just a normal derference is fine.
> 
> bond->io_mm is annotated with __rcu (for iommu_sva_get_pasid_generic(),
> which does bond->io_mm under rcu_read_lock())

I'm surprised the bond->io_mm can change over the lifetime of the
bond memory..

> > > If io_mm->ctx and io_mm->ops are already valid before the
> > > mmu notifier is published, then we don't need that stuff.
> > 
> > So, this trickyness with RCU is not a bad reason to introduce the priv
> > scheme, maybe explain it in the commit message?
> 
> Ok, I've added this to the commit message:
> 
>     The IOMMU SVA module, which attaches an mm to multiple devices,
>     exemplifies this situation. In essence it does:
> 
>             mmu_notifier_get()
>               alloc_notifier()
>                  A = kzalloc()
>               /* MMU notifier is published */
>             A->ctx = ctx;                           // (1)
>             device->A = A;
>             list_add_rcu(device, A->devices);       // (2)
> 
>     The invalidate notifier, which may start running before A is fully
>     initialized at (1), does the following:
> 
>             io_mm_invalidate(A)
>               list_for_each_entry_rcu(device, A->devices)
>                 A = device->A;                      // (3)

I would drop the work around from the decription, it is enough to say
that the line below needs to observe (1) after (2) and this is
trivially achieved by moving (1) to before publishing the notifier so
the core MM locking can be used.

Regards,
Jason

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v4 02/26] iommu/sva: Manage process address spaces
  2020-02-28 14:40     ` Jean-Philippe Brucker
@ 2020-02-28 14:57       ` Jason Gunthorpe
  0 siblings, 0 replies; 70+ messages in thread
From: Jason Gunthorpe @ 2020-02-28 14:57 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: Jacob Pan, iommu, devicetree, linux-arm-kernel, linux-pci,
	linux-mm, joro, robh+dt, mark.rutland, catalin.marinas, will,
	robin.murphy, kevin.tian, baolu.lu, Jonathan.Cameron,
	christian.koenig, yi.l.liu, zhangfei.gao, Jean-Philippe Brucker

On Fri, Feb 28, 2020 at 03:40:07PM +0100, Jean-Philippe Brucker wrote:
> > > Device
> > > + * 00:00.0 accesses address spaces X and Y, each corresponding to an
> > > mm_struct.
> > > + * Devices 00:01.* only access address space Y. In addition each
> > > + * IOMMU_DOMAIN_DMA domain has a private address space, io_pgtable,
> > > that is
> > > + * managed with iommu_map()/iommu_unmap(), and isn't shared with the
> > > CPU MMU.
> > So this would allow IOVA and SVA co-exist in the same address space?
> 
> Hmm, not in the same address space, but they can co-exist in a device. In
> fact the endpoint I'm testing (hisi zip accelerator) already needs normal
> DMA alongside SVA for queue management. This one is integrated on an
> Arm-based platform so shouldn't be a concern for VT-d at the moment, but
> I suspect we might see more of this kind of device with mixed DMA.

Probably the most interesting usecases for PASID definately require
this, so this is more than a "suspect we might see"

We want to see the privileged kernel control the general behavior of
the PCI function and delegate only some DMAs to PASIDs associated with
the user mm_struct. The device is always trusted the label its DMA
properly.

These programming models are already being used for years now with the
opencapi implementation.

Jason

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v4 01/26] mm/mmu_notifiers: pass private data down to alloc_notifier()
  2020-02-28 14:48           ` Jason Gunthorpe
@ 2020-02-28 15:04             ` Jean-Philippe Brucker
  2020-02-28 15:13               ` Jason Gunthorpe
  0 siblings, 1 reply; 70+ messages in thread
From: Jean-Philippe Brucker @ 2020-02-28 15:04 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: mark.rutland, linux-pci, linux-mm, will, Dimitri Sivanich,
	catalin.marinas, zhangfei.gao, devicetree, kevin.tian,
	Arnd Bergmann, robh+dt, linux-arm-kernel, Greg Kroah-Hartman,
	iommu, Andrew Morton, robin.murphy, christian.koenig

On Fri, Feb 28, 2020 at 10:48:44AM -0400, Jason Gunthorpe wrote:
> On Fri, Feb 28, 2020 at 03:39:35PM +0100, Jean-Philippe Brucker wrote:
> > > > +	list_for_each_entry_rcu(bond, &io_mm->devices, mm_head) {
> > > > +		/*
> > > > +		 * To ensure that we observe the initialization of io_mm fields
> > > > +		 * by io_mm_finalize() before the registration of this bond to
> > > > +		 * the list by io_mm_attach(), introduce an address dependency
> > > > +		 * between bond and io_mm. It pairs with the smp_store_release()
> > > > +		 * from list_add_rcu().
> > > > +		 */
> > > > +		io_mm = rcu_dereference(bond->io_mm);
> > > 
> > > A rcu_dereference isn't need here, just a normal derference is fine.
> > 
> > bond->io_mm is annotated with __rcu (for iommu_sva_get_pasid_generic(),
> > which does bond->io_mm under rcu_read_lock())
> 
> I'm surprised the bond->io_mm can change over the lifetime of the
> bond memory..

The normal lifetime of the bond is between device driver calls to bind()
and unbind(). If the mm exits early, though, we clear bond->io_mm. The
bond is then stale but can only be freed when the device driver releases
it with unbind().

> 
> > > > If io_mm->ctx and io_mm->ops are already valid before the
> > > > mmu notifier is published, then we don't need that stuff.
> > > 
> > > So, this trickyness with RCU is not a bad reason to introduce the priv
> > > scheme, maybe explain it in the commit message?
> > 
> > Ok, I've added this to the commit message:
> > 
> >     The IOMMU SVA module, which attaches an mm to multiple devices,
> >     exemplifies this situation. In essence it does:
> > 
> >             mmu_notifier_get()
> >               alloc_notifier()
> >                  A = kzalloc()
> >               /* MMU notifier is published */
> >             A->ctx = ctx;                           // (1)
> >             device->A = A;
> >             list_add_rcu(device, A->devices);       // (2)
> > 
> >     The invalidate notifier, which may start running before A is fully
> >     initialized at (1), does the following:
> > 
> >             io_mm_invalidate(A)
> >               list_for_each_entry_rcu(device, A->devices)
> >                 A = device->A;                      // (3)
> 
> I would drop the work around from the decription, it is enough to say
> that the line below needs to observe (1) after (2) and this is
> trivially achieved by moving (1) to before publishing the notifier so
> the core MM locking can be used.

Ok, will do

Thanks,
Jean

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v4 01/26] mm/mmu_notifiers: pass private data down to alloc_notifier()
  2020-02-28 15:04             ` Jean-Philippe Brucker
@ 2020-02-28 15:13               ` Jason Gunthorpe
  2020-03-06  9:56                 ` Jean-Philippe Brucker
  0 siblings, 1 reply; 70+ messages in thread
From: Jason Gunthorpe @ 2020-02-28 15:13 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: mark.rutland, linux-pci, linux-mm, will, Dimitri Sivanich,
	catalin.marinas, zhangfei.gao, devicetree, kevin.tian,
	Arnd Bergmann, robh+dt, linux-arm-kernel, Greg Kroah-Hartman,
	iommu, Andrew Morton, robin.murphy, christian.koenig

On Fri, Feb 28, 2020 at 04:04:27PM +0100, Jean-Philippe Brucker wrote:
> On Fri, Feb 28, 2020 at 10:48:44AM -0400, Jason Gunthorpe wrote:
> > On Fri, Feb 28, 2020 at 03:39:35PM +0100, Jean-Philippe Brucker wrote:
> > > > > +	list_for_each_entry_rcu(bond, &io_mm->devices, mm_head) {
> > > > > +		/*
> > > > > +		 * To ensure that we observe the initialization of io_mm fields
> > > > > +		 * by io_mm_finalize() before the registration of this bond to
> > > > > +		 * the list by io_mm_attach(), introduce an address dependency
> > > > > +		 * between bond and io_mm. It pairs with the smp_store_release()
> > > > > +		 * from list_add_rcu().
> > > > > +		 */
> > > > > +		io_mm = rcu_dereference(bond->io_mm);
> > > > 
> > > > A rcu_dereference isn't need here, just a normal derference is fine.
> > > 
> > > bond->io_mm is annotated with __rcu (for iommu_sva_get_pasid_generic(),
> > > which does bond->io_mm under rcu_read_lock())
> > 
> > I'm surprised the bond->io_mm can change over the lifetime of the
> > bond memory..
> 
> The normal lifetime of the bond is between device driver calls to bind()
> and unbind(). If the mm exits early, though, we clear bond->io_mm. The
> bond is then stale but can only be freed when the device driver releases
> it with unbind().

I usually advocate for simple use of these APIs. The mm_notifier_get()
should happen in bind() and the matching put should happen in the
call_rcu callbcak that does the kfree. Then you can never get a stale
pointer. Don't worry about exit_mmap().

release() is an unusual callback and I see alot of places using it
wrong. The purpose of release is to invalidate_all, that is it.

Also, confusingly release may be called multiple times in some
situations, so it shouldn't disturb anything that might impact a 2nd
call.

Jason

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v4 02/26] iommu/sva: Manage process address spaces
  2020-02-28 14:43     ` Jean-Philippe Brucker
@ 2020-02-28 16:26       ` Jonathan Cameron
  0 siblings, 0 replies; 70+ messages in thread
From: Jonathan Cameron @ 2020-02-28 16:26 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: iommu, devicetree, linux-arm-kernel, linux-pci, linux-mm, joro,
	robh+dt, mark.rutland, catalin.marinas, will, robin.murphy,
	kevin.tian, baolu.lu, jacob.jun.pan, christian.koenig, yi.l.liu,
	zhangfei.gao, Jean-Philippe Brucker

On Fri, 28 Feb 2020 15:43:04 +0100
Jean-Philippe Brucker <jean-philippe@linaro.org> wrote:

> On Wed, Feb 26, 2020 at 12:35:06PM +0000, Jonathan Cameron wrote:
> > > + * A single Process Address Space ID (PASID) is allocated for each mm. In the
> > > + * example, devices use PASID 1 to read/write into address space X and PASID 2
> > > + * to read/write into address space Y. Calling iommu_sva_get_pasid() on bond 1
> > > + * returns 1, and calling it on bonds 2-4 returns 2.
> > > + *
> > > + * Hardware tables describing this configuration in the IOMMU would typically
> > > + * look like this:
> > > + *
> > > + *                                PASID tables
> > > + *                                 of domain A
> > > + *                              .->+--------+
> > > + *                             / 0 |        |-------> io_pgtable
> > > + *                            /    +--------+
> > > + *            Device tables  /   1 |        |-------> pgd X
> > > + *              +--------+  /      +--------+
> > > + *      00:00.0 |      A |-'     2 |        |--.
> > > + *              +--------+         +--------+   \
> > > + *              :        :       3 |        |    \
> > > + *              +--------+         +--------+     --> pgd Y
> > > + *      00:01.0 |      B |--.                    /
> > > + *              +--------+   \                  |
> > > + *      00:01.1 |      B |----+   PASID tables  |
> > > + *              +--------+     \   of domain B  |
> > > + *                              '->+--------+   |
> > > + *                               0 |        |-- | --> io_pgtable
> > > + *                                 +--------+   |
> > > + *                               1 |        |   |
> > > + *                                 +--------+   |
> > > + *                               2 |        |---'
> > > + *                                 +--------+
> > > + *                               3 |        |
> > > + *                                 +--------+
> > > + *
> > > + * With this model, a single call binds all devices in a given domain to an
> > > + * address space. Other devices in the domain will get the same bond implicitly.
> > > + * However, users must issue one bind() for each device, because IOMMUs may
> > > + * implement SVA differently. Furthermore, mandating one bind() per device
> > > + * allows the driver to perform sanity-checks on device capabilities.  
> >   
> > > + *
> > > + * In some IOMMUs, one entry of the PASID table (typically the first one) can
> > > + * hold non-PASID translations. In this case PASID 0 is reserved and the first
> > > + * entry points to the io_pgtable pointer. In other IOMMUs the io_pgtable
> > > + * pointer is held in the device table and PASID 0 is available to the
> > > + * allocator.  
> > 
> > Is it worth hammering home in here that we can only do this because the PASID space
> > is global (with exception of PASID 0)?  It's a convenient simplification but not
> > necessarily a hardware restriction so perhaps we should remind people somewhere in here?  
> 
> I could add this four paragraphs up:
> 
> "A single Process Address Space ID (PASID) is allocated for each mm. It is
> a choice made for the Linux SVA implementation, not a hardware
> restriction."

Perfect.

> 
> > > + */
> > > +
> > > +struct io_mm {
> > > +	struct list_head		devices;
> > > +	struct mm_struct		*mm;
> > > +	struct mmu_notifier		notifier;
> > > +
> > > +	/* Late initialization */
> > > +	const struct io_mm_ops		*ops;
> > > +	void				*ctx;
> > > +	int				pasid;
> > > +};
> > > +
> > > +#define to_io_mm(mmu_notifier)	container_of(mmu_notifier, struct io_mm, notifier)
> > > +#define to_iommu_bond(handle)	container_of(handle, struct iommu_bond, sva)  
> > 
> > Code ordering wise, do we want this after the definition of iommu_bond?
> > 
> > For both of these it's a bit non obvious what they come 'from'.
> > I wouldn't naturally assume to_io_mm gets me from notifier to the io_mm
> > for example.  Not sure it matters though if these are only used in a few
> > places.  
> 
> Right, I can rename the first one to mn_to_io_mm(). The second one I think
> might be good enough.

Agreed. The second one does feel more natural.

> 
> 
> > > +static struct iommu_sva *
> > > +io_mm_attach(struct device *dev, struct io_mm *io_mm, void *drvdata)
> > > +{
> > > +	int ret = 0;  
> > 
> > I'm fairly sure this is set in all paths below.  Now, of course the
> > compiler might not think that in which case fair enough :)
> >   
> > > +	bool attach_domain = true;
> > > +	struct iommu_bond *bond, *tmp;
> > > +	struct iommu_domain *domain, *other;
> > > +	struct iommu_sva_param *param = dev->iommu_param->sva_param;
> > > +
> > > +	domain = iommu_get_domain_for_dev(dev);
> > > +
> > > +	bond = kzalloc(sizeof(*bond), GFP_KERNEL);
> > > +	if (!bond)
> > > +		return ERR_PTR(-ENOMEM);
> > > +
> > > +	bond->sva.dev	= dev;
> > > +	bond->drvdata	= drvdata;
> > > +	refcount_set(&bond->refs, 1);
> > > +	RCU_INIT_POINTER(bond->io_mm, io_mm);
> > > +
> > > +	mutex_lock(&iommu_sva_lock);
> > > +	/* Is it already bound to the device or domain? */
> > > +	list_for_each_entry(tmp, &io_mm->devices, mm_head) {
> > > +		if (tmp->sva.dev != dev) {
> > > +			other = iommu_get_domain_for_dev(tmp->sva.dev);
> > > +			if (domain == other)
> > > +				attach_domain = false;
> > > +
> > > +			continue;
> > > +		}
> > > +
> > > +		if (WARN_ON(tmp->drvdata != drvdata)) {
> > > +			ret = -EINVAL;
> > > +			goto err_free;
> > > +		}
> > > +
> > > +		/*
> > > +		 * Hold a single io_mm reference per bond. Note that we can't
> > > +		 * return an error after this, otherwise the caller would drop
> > > +		 * an additional reference to the io_mm.
> > > +		 */
> > > +		refcount_inc(&tmp->refs);
> > > +		io_mm_put(io_mm);
> > > +		kfree(bond);  
> > 
> > Free outside the lock would be ever so slightly more logical given we allocated
> > before taking the lock.
> >   
> > > +		mutex_unlock(&iommu_sva_lock);
> > > +		return &tmp->sva;
> > > +	}
> > > +
> > > +	list_add_rcu(&bond->mm_head, &io_mm->devices);
> > > +	param->nr_bonds++;
> > > +	mutex_unlock(&iommu_sva_lock);
> > > +
> > > +	ret = io_mm->ops->attach(bond->sva.dev, io_mm->pasid, io_mm->ctx,
> > > +				 attach_domain);
> > > +	if (ret)
> > > +		goto err_remove;
> > > +
> > > +	return &bond->sva;
> > > +
> > > +err_remove:
> > > +	/*
> > > +	 * At this point concurrent threads may have started to access the
> > > +	 * io_mm->devices list in order to invalidate address ranges, which
> > > +	 * requires to free the bond via kfree_rcu()
> > > +	 */
> > > +	mutex_lock(&iommu_sva_lock);
> > > +	param->nr_bonds--;
> > > +	list_del_rcu(&bond->mm_head);
> > > +
> > > +err_free:
> > > +	mutex_unlock(&iommu_sva_lock);
> > > +	kfree_rcu(bond, rcu_head);  
> > 
> > I don't suppose it matters really but we don't need the rcu free if
> > we follow the err_free goto.  Perhaps we are cleaner in this case
> > to not use a unified exit path but do that case inline?  
> 
> Agreed, though I moved the kzalloc() later as suggested by Jacob, I think
> it looks a little better and simplifies the error paths
> 
> Thanks,
> Jean
Jonathan


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v4 23/26] iommu/arm-smmu-v3: Add stall support for platform devices
  2020-02-27 18:17   ` Jonathan Cameron
@ 2020-03-04 14:08     ` Jean-Philippe Brucker
  2020-03-09 10:48       ` Jonathan Cameron
  0 siblings, 1 reply; 70+ messages in thread
From: Jean-Philippe Brucker @ 2020-03-04 14:08 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: iommu, devicetree, linux-arm-kernel, linux-pci, linux-mm, joro,
	robh+dt, mark.rutland, catalin.marinas, will, robin.murphy,
	kevin.tian, baolu.lu, jacob.jun.pan, christian.koenig, yi.l.liu,
	zhangfei.gao, Jean-Philippe Brucker

On Thu, Feb 27, 2020 at 06:17:26PM +0000, Jonathan Cameron wrote:
> On Mon, 24 Feb 2020 19:23:58 +0100
> Jean-Philippe Brucker <jean-philippe@linaro.org> wrote:
> 
> > From: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
> > 
> > The SMMU provides a Stall model for handling page faults in platform
> > devices. It is similar to PCI PRI, but doesn't require devices to have
> > their own translation cache. Instead, faulting transactions are parked and
> > the OS is given a chance to fix the page tables and retry the transaction.
> > 
> > Enable stall for devices that support it (opt-in by firmware). When an
> > event corresponds to a translation error, call the IOMMU fault handler. If
> > the fault is recoverable, it will call us back to terminate or continue
> > the stall.
> > 
> > Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
> One question inline.
> 
> Thanks,
> 
> > ---
> >  drivers/iommu/arm-smmu-v3.c | 271 ++++++++++++++++++++++++++++++++++--
> >  drivers/iommu/of_iommu.c    |   5 +-
> >  include/linux/iommu.h       |   2 +
> >  3 files changed, 269 insertions(+), 9 deletions(-)
> > 
> > diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
> > index 6a5987cce03f..da5dda5ba26a 100644
> > --- a/drivers/iommu/arm-smmu-v3.c
> > +++ b/drivers/iommu/arm-smmu-v3.c
> > @@ -374,6 +374,13 @@
> 
> 
> > +/*
> > + * arm_smmu_flush_evtq - wait until all events currently in the queue have been
> > + *                       consumed.
> > + *
> > + * Wait until the evtq thread finished a batch, or until the queue is empty.
> > + * Note that we don't handle overflows on q->batch. If it occurs, just wait for
> > + * the queue to be empty.
> > + */
> > +static int arm_smmu_flush_evtq(void *cookie, struct device *dev, int pasid)
> > +{
> > +	int ret;
> > +	u64 batch;
> > +	struct arm_smmu_device *smmu = cookie;
> > +	struct arm_smmu_queue *q = &smmu->evtq.q;
> > +
> > +	spin_lock(&q->wq.lock);
> > +	if (queue_sync_prod_in(q) == -EOVERFLOW)
> > +		dev_err(smmu->dev, "evtq overflow detected -- requests lost\n");
> > +
> > +	batch = q->batch;
> 
> So this is trying to be sure we have advanced the queue 2 spots?

So we call arm_smmu_flush_evtq() before decommissioning a PASID, to make
sure that there aren't any pending event for this PASID languishing in the
fault queues.

The main test is queue_empty(). If that succeeds then we know that there
aren't any pending event (and the PASID is safe to reuse). But if new
events are constantly added to the queue then we wait for the evtq thread
to handle a full batch, where one batch corresponds to the queue size. For
that we take the batch number when entering flush(), and wait for the evtq
thread to increment it twice.

> Is there a potential race here?  q->batch could have updated before we take
> a local copy.

Yes we're just checking on the progress of the evtq thread. All accesses
to batch are made while holding the wq lock.

Flush is a rare event so the lock isn't contended, but the wake_up() that
this patch introduces in arm_smmu_evtq_thread() does add some overhead
(0.85% of arm_smmu_evtq_thread(), according to perf). It would be nice to
get rid of it but I haven't found anything clever yet.

Thanks,
Jean

> 
> > +	ret = wait_event_interruptible_locked(q->wq, queue_empty(&q->llq) ||
> > +					      q->batch >= batch + 2);
> > +	spin_unlock(&q->wq.lock);
> > +
> > +	return ret;
> > +}
> > +
> ...
> 

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v4 23/26] iommu/arm-smmu-v3: Add stall support for platform devices
  2020-02-26  8:44   ` Xu Zaibo
@ 2020-03-04 14:09     ` Jean-Philippe Brucker
  0 siblings, 0 replies; 70+ messages in thread
From: Jean-Philippe Brucker @ 2020-03-04 14:09 UTC (permalink / raw)
  To: Xu Zaibo
  Cc: iommu, devicetree, linux-arm-kernel, linux-pci, linux-mm,
	mark.rutland, kevin.tian, Jean-Philippe Brucker, catalin.marinas,
	robin.murphy, robh+dt, zhangfei.gao, will, christian.koenig

On Wed, Feb 26, 2020 at 04:44:53PM +0800, Xu Zaibo wrote:
> Hi,
> 
> 
> On 2020/2/25 2:23, Jean-Philippe Brucker wrote:
> > From: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
> > 
> > The SMMU provides a Stall model for handling page faults in platform
> > devices. It is similar to PCI PRI, but doesn't require devices to have
> > their own translation cache. Instead, faulting transactions are parked and
> > the OS is given a chance to fix the page tables and retry the transaction.
> > 
> > Enable stall for devices that support it (opt-in by firmware). When an
> > event corresponds to a translation error, call the IOMMU fault handler. If
> > the fault is recoverable, it will call us back to terminate or continue
> > the stall.
> > 
> > Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
> > ---
> >   drivers/iommu/arm-smmu-v3.c | 271 ++++++++++++++++++++++++++++++++++--
> >   drivers/iommu/of_iommu.c    |   5 +-
> >   include/linux/iommu.h       |   2 +
> >   3 files changed, 269 insertions(+), 9 deletions(-)
> > 
> > diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
> > index 6a5987cce03f..da5dda5ba26a 100644
> > --- a/drivers/iommu/arm-smmu-v3.c
> > +++ b/drivers/iommu/arm-smmu-v3.c
> > @@ -374,6 +374,13 @@
> >   #define CMDQ_PRI_1_GRPID		GENMASK_ULL(8, 0)
> >   #define CMDQ_PRI_1_RESP			GENMASK_ULL(13, 12)
> [...]
> > +static int arm_smmu_page_response(struct device *dev,
> > +				  struct iommu_fault_event *unused,
> > +				  struct iommu_page_response *resp)
> > +{
> > +	struct arm_smmu_cmdq_ent cmd = {0};
> > +	struct arm_smmu_master *master = dev_iommu_fwspec_get(dev)->iommu_priv;
> Here can use 'dev_to_master' ?

Certainly, good catch

Thanks,
Jean

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v4 07/26] arm64: mm: Pin down ASIDs for sharing mm with devices
  2020-02-27 17:43   ` Jonathan Cameron
@ 2020-03-04 14:10     ` Jean-Philippe Brucker
  0 siblings, 0 replies; 70+ messages in thread
From: Jean-Philippe Brucker @ 2020-03-04 14:10 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: iommu, devicetree, linux-arm-kernel, linux-pci, linux-mm, joro,
	robh+dt, mark.rutland, catalin.marinas, will, robin.murphy,
	kevin.tian, baolu.lu, jacob.jun.pan, christian.koenig, yi.l.liu,
	zhangfei.gao, Jean-Philippe Brucker

On Thu, Feb 27, 2020 at 05:43:51PM +0000, Jonathan Cameron wrote:
> On Mon, 24 Feb 2020 19:23:42 +0100
> Jean-Philippe Brucker <jean-philippe@linaro.org> wrote:
> 
> > From: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
> > 
> > To enable address space sharing with the IOMMU, introduce mm_context_get()
> > and mm_context_put(), that pin down a context and ensure that it will keep
> > its ASID after a rollover. Export the symbols to let the modular SMMUv3
> > driver use them.
> > 
> > Pinning is necessary because a device constantly needs a valid ASID,
> > unlike tasks that only require one when running. Without pinning, we would
> > need to notify the IOMMU when we're about to use a new ASID for a task,
> > and it would get complicated when a new task is assigned a shared ASID.
> > Consider the following scenario with no ASID pinned:
> > 
> > 1. Task t1 is running on CPUx with shared ASID (gen=1, asid=1)
> > 2. Task t2 is scheduled on CPUx, gets ASID (1, 2)
> > 3. Task tn is scheduled on CPUy, a rollover occurs, tn gets ASID (2, 1)
> >    We would now have to immediately generate a new ASID for t1, notify
> >    the IOMMU, and finally enable task tn. We are holding the lock during
> >    all that time, since we can't afford having another CPU trigger a
> >    rollover. The IOMMU issues invalidation commands that can take tens of
> >    milliseconds.
> > 
> > It gets needlessly complicated. All we wanted to do was schedule task tn,
> > that has no business with the IOMMU. By letting the IOMMU pin tasks when
> > needed, we avoid stalling the slow path, and let the pinning fail when
> > we're out of shareable ASIDs.
> > 
> > After a rollover, the allocator expects at least one ASID to be available
> > in addition to the reserved ones (one per CPU). So (NR_ASIDS - NR_CPUS -
> > 1) is the maximum number of ASIDs that can be shared with the IOMMU.
> > 
> > Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
> A few more trivial points.

I'll fix those, thanks

Jean

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v4 01/26] mm/mmu_notifiers: pass private data down to alloc_notifier()
  2020-02-24 18:23 ` [PATCH v4 01/26] mm/mmu_notifiers: pass private data down to alloc_notifier() Jean-Philippe Brucker
  2020-02-24 19:00   ` Jason Gunthorpe
@ 2020-03-05 16:36   ` Christoph Hellwig
  1 sibling, 0 replies; 70+ messages in thread
From: Christoph Hellwig @ 2020-03-05 16:36 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: iommu, devicetree, linux-arm-kernel, linux-pci, linux-mm, joro,
	robh+dt, mark.rutland, catalin.marinas, will, robin.murphy,
	kevin.tian, baolu.lu, Jonathan.Cameron, jacob.jun.pan,
	christian.koenig, yi.l.liu, zhangfei.gao, Andrew Morton,
	Arnd Bergmann, Dimitri Sivanich, Greg Kroah-Hartman,
	Jason Gunthorpe

On Mon, Feb 24, 2020 at 07:23:36PM +0100, Jean-Philippe Brucker wrote:
> -static struct mmu_notifier *gru_alloc_notifier(struct mm_struct *mm)
> +static struct mmu_notifier *gru_alloc_notifier(struct mm_struct *mm, void *privdata)

Pleae don't introduce any > 80 char lines.  Not here, and not anywhere
else in this patch or the series.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v4 01/26] mm/mmu_notifiers: pass private data down to alloc_notifier()
  2020-02-28 15:13               ` Jason Gunthorpe
@ 2020-03-06  9:56                 ` Jean-Philippe Brucker
  2020-03-06 13:09                   ` Jason Gunthorpe
  0 siblings, 1 reply; 70+ messages in thread
From: Jean-Philippe Brucker @ 2020-03-06  9:56 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: mark.rutland, linux-pci, linux-mm, will, Dimitri Sivanich,
	catalin.marinas, zhangfei.gao, devicetree, kevin.tian,
	Arnd Bergmann, robh+dt, linux-arm-kernel, Greg Kroah-Hartman,
	iommu, Andrew Morton, robin.murphy, christian.koenig

On Fri, Feb 28, 2020 at 11:13:40AM -0400, Jason Gunthorpe wrote:
> On Fri, Feb 28, 2020 at 04:04:27PM +0100, Jean-Philippe Brucker wrote:
> > On Fri, Feb 28, 2020 at 10:48:44AM -0400, Jason Gunthorpe wrote:
> > > On Fri, Feb 28, 2020 at 03:39:35PM +0100, Jean-Philippe Brucker wrote:
> > > > > > +	list_for_each_entry_rcu(bond, &io_mm->devices, mm_head) {
> > > > > > +		/*
> > > > > > +		 * To ensure that we observe the initialization of io_mm fields
> > > > > > +		 * by io_mm_finalize() before the registration of this bond to
> > > > > > +		 * the list by io_mm_attach(), introduce an address dependency
> > > > > > +		 * between bond and io_mm. It pairs with the smp_store_release()
> > > > > > +		 * from list_add_rcu().
> > > > > > +		 */
> > > > > > +		io_mm = rcu_dereference(bond->io_mm);
> > > > > 
> > > > > A rcu_dereference isn't need here, just a normal derference is fine.
> > > > 
> > > > bond->io_mm is annotated with __rcu (for iommu_sva_get_pasid_generic(),
> > > > which does bond->io_mm under rcu_read_lock())
> > > 
> > > I'm surprised the bond->io_mm can change over the lifetime of the
> > > bond memory..
> > 
> > The normal lifetime of the bond is between device driver calls to bind()
> > and unbind(). If the mm exits early, though, we clear bond->io_mm. The
> > bond is then stale but can only be freed when the device driver releases
> > it with unbind().
> 
> I usually advocate for simple use of these APIs. The mm_notifier_get()
> should happen in bind() and the matching put should happen in the
> call_rcu callbcak that does the kfree.

I tried to keep it simple like that: normally mmu_notifier_get() is called
in bind(), and mmu_notifier_put() is called in unbind(). 

Multiple device drivers may call bind() with the same mm. Each bind()
calls mmu_notifier_get(), obtains the same io_mm, and returns a new bond
(a device<->mm link). Each bond is freed by calling unbind(), which calls
mmu_notifier_put().

That's the most common case. Now if the process is killed and the mm
disappears, we do need to avoid use-after-free caused by DMA of the
mappings and the page tables. So the release() callback, before doing
invalidate_all, stops DMA and clears the page table pointer on the IOMMU
side. It detaches all bonds from the io_mm, calling mmu_notifier_put() for
each of them. After release(), bond objects still exists and device
drivers still need to free them with unbind(), but they don't point to an
io_mm anymore.

> Then you can never get a stale
> pointer. Don't worry about exit_mmap().
> 
> release() is an unusual callback and I see alot of places using it
> wrong. The purpose of release is to invalidate_all, that is it.
> 
> Also, confusingly release may be called multiple times in some
> situations, so it shouldn't disturb anything that might impact a 2nd
> call.

I hadn't realized that. The current implementation should be safe against
it, as release() is a nop if the io_mm doesn't have bonds anymore. Do you
have an example of such a situation?  I'm trying to write tests for this
kind of corner cases.

Thanks,
Jean

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v4 01/26] mm/mmu_notifiers: pass private data down to alloc_notifier()
  2020-03-06  9:56                 ` Jean-Philippe Brucker
@ 2020-03-06 13:09                   ` Jason Gunthorpe
  2020-03-06 14:35                     ` Jean-Philippe Brucker
  2020-03-16 15:46                     ` Christoph Hellwig
  0 siblings, 2 replies; 70+ messages in thread
From: Jason Gunthorpe @ 2020-03-06 13:09 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: mark.rutland, linux-pci, linux-mm, will, Dimitri Sivanich,
	catalin.marinas, zhangfei.gao, devicetree, kevin.tian,
	Arnd Bergmann, robh+dt, linux-arm-kernel, Greg Kroah-Hartman,
	iommu, Andrew Morton, robin.murphy, christian.koenig

On Fri, Mar 06, 2020 at 10:56:14AM +0100, Jean-Philippe Brucker wrote:
> I tried to keep it simple like that: normally mmu_notifier_get() is called
> in bind(), and mmu_notifier_put() is called in unbind(). 
> 
> Multiple device drivers may call bind() with the same mm. Each bind()
> calls mmu_notifier_get(), obtains the same io_mm, and returns a new bond
> (a device<->mm link). Each bond is freed by calling unbind(), which calls
> mmu_notifier_put().
> 
> That's the most common case. Now if the process is killed and the mm
> disappears, we do need to avoid use-after-free caused by DMA of the
> mappings and the page tables. 

This is why release must do invalidate all - but it doesn't need to do
any more - as no SPTE can be established without a mmget() - and
mmget() is no longer possible past release.

> So the release() callback, before doing invalidate_all, stops DMA
> and clears the page table pointer on the IOMMU side. It detaches all
> bonds from the io_mm, calling mmu_notifier_put() for each of
> them. After release(), bond objects still exists and device drivers
> still need to free them with unbind(), but they don't point to an
> io_mm anymore.

Why is so much work needed in release? It really should just be
invalidate all, usually trying to sort out all the locking for the
more complicated stuff is not worthwhile.

If other stuff is implicitly relying on the mm being alive and release
to fence against that then it is already racy. If it doesn't, then why
bother doing complicated work in release?

> > Then you can never get a stale
> > pointer. Don't worry about exit_mmap().
> > 
> > release() is an unusual callback and I see alot of places using it
> > wrong. The purpose of release is to invalidate_all, that is it.
> > 
> > Also, confusingly release may be called multiple times in some
> > situations, so it shouldn't disturb anything that might impact a 2nd
> > call.
> 
> I hadn't realized that. The current implementation should be safe against
> it, as release() is a nop if the io_mm doesn't have bonds anymore. Do you
> have an example of such a situation?  I'm trying to write tests for this
> kind of corner cases.

Hmm, let me think. Ah, you have to be using mmu_notifier_unregister()
to get that race. This is one of the things that get/put don't suffer
from - but they conversely don't guarantee that release() will be
called, so it is up to the caller to ensure everything is fenced
before calling put.

Jason

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v4 01/26] mm/mmu_notifiers: pass private data down to alloc_notifier()
  2020-03-06 13:09                   ` Jason Gunthorpe
@ 2020-03-06 14:35                     ` Jean-Philippe Brucker
  2020-03-06 14:52                       ` Jason Gunthorpe
  2020-03-16 15:46                     ` Christoph Hellwig
  1 sibling, 1 reply; 70+ messages in thread
From: Jean-Philippe Brucker @ 2020-03-06 14:35 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: mark.rutland, linux-pci, linux-mm, will, Dimitri Sivanich,
	catalin.marinas, zhangfei.gao, devicetree, kevin.tian,
	Arnd Bergmann, robh+dt, linux-arm-kernel, Greg Kroah-Hartman,
	iommu, Andrew Morton, robin.murphy, christian.koenig

On Fri, Mar 06, 2020 at 09:09:19AM -0400, Jason Gunthorpe wrote:
> On Fri, Mar 06, 2020 at 10:56:14AM +0100, Jean-Philippe Brucker wrote:
> > I tried to keep it simple like that: normally mmu_notifier_get() is called
> > in bind(), and mmu_notifier_put() is called in unbind(). 
> > 
> > Multiple device drivers may call bind() with the same mm. Each bind()
> > calls mmu_notifier_get(), obtains the same io_mm, and returns a new bond
> > (a device<->mm link). Each bond is freed by calling unbind(), which calls
> > mmu_notifier_put().
> > 
> > That's the most common case. Now if the process is killed and the mm
> > disappears, we do need to avoid use-after-free caused by DMA of the
> > mappings and the page tables. 
> 
> This is why release must do invalidate all - but it doesn't need to do
> any more - as no SPTE can be established without a mmget() - and
> mmget() is no longer possible past release.

In our case we don't have SPTEs, the whole pgd is shared between MMU and
IOMMU (isolated using PASID tables).

Taking the concrete example of the crypto accelerator:

1. A process opens a queue in the accelerator. That queue is bound to the
   address space: a PASID is allocated for the mm, and mm->pgd is written
   into the IOMMU PASID table.
2. The process queues some work and waits. In the background, the
   accelerators performs DMA on the process address space, by using the
   mm's PASID.
3. Now the process gets killed, and release() is called.

At this point no one told the device to stop working on this queue, it may
still be doing DMA on this address space. So the first thing we do is
notify the device driver that the bond is going away, and that it must
stop the queue and flush remaining DMA transactions for this PASID.

Then we also clear the pgd from the IOMMU PASID table. If we only did
invalidate-all and somehow the queue wasn't properly stopped, concurrent
DMA would immediately form new IOTLB entries since the page tables haven't
been wiped at this point. And later, it would use-after-free page tables
and mappings. Whereas with a clear pgd it would just generate IOMMU fault
events, which are undesirable but harmless.

Thanks,
Jean

> > So the release() callback, before doing invalidate_all, stops DMA
> > and clears the page table pointer on the IOMMU side. It detaches all
> > bonds from the io_mm, calling mmu_notifier_put() for each of
> > them. After release(), bond objects still exists and device drivers
> > still need to free them with unbind(), but they don't point to an
> > io_mm anymore.
> 
> Why is so much work needed in release? It really should just be
> invalidate all, usually trying to sort out all the locking for the
> more complicated stuff is not worthwhile.
> 
> If other stuff is implicitly relying on the mm being alive and release
> to fence against that then it is already racy. If it doesn't, then why
> bother doing complicated work in release?
> 
> > > Then you can never get a stale
> > > pointer. Don't worry about exit_mmap().
> > > 
> > > release() is an unusual callback and I see alot of places using it
> > > wrong. The purpose of release is to invalidate_all, that is it.
> > > 
> > > Also, confusingly release may be called multiple times in some
> > > situations, so it shouldn't disturb anything that might impact a 2nd
> > > call.
> > 
> > I hadn't realized that. The current implementation should be safe against
> > it, as release() is a nop if the io_mm doesn't have bonds anymore. Do you
> > have an example of such a situation?  I'm trying to write tests for this
> > kind of corner cases.
> 
> Hmm, let me think. Ah, you have to be using mmu_notifier_unregister()
> to get that race. This is one of the things that get/put don't suffer
> from - but they conversely don't guarantee that release() will be
> called, so it is up to the caller to ensure everything is fenced
> before calling put.
> 
> Jason

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v4 01/26] mm/mmu_notifiers: pass private data down to alloc_notifier()
  2020-03-06 14:35                     ` Jean-Philippe Brucker
@ 2020-03-06 14:52                       ` Jason Gunthorpe
  2020-03-06 16:15                         ` Jean-Philippe Brucker
  0 siblings, 1 reply; 70+ messages in thread
From: Jason Gunthorpe @ 2020-03-06 14:52 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: mark.rutland, linux-pci, linux-mm, will, Dimitri Sivanich,
	catalin.marinas, zhangfei.gao, devicetree, kevin.tian,
	Arnd Bergmann, robh+dt, linux-arm-kernel, Greg Kroah-Hartman,
	iommu, Andrew Morton, robin.murphy, christian.koenig

On Fri, Mar 06, 2020 at 03:35:56PM +0100, Jean-Philippe Brucker wrote:
> On Fri, Mar 06, 2020 at 09:09:19AM -0400, Jason Gunthorpe wrote:
> > On Fri, Mar 06, 2020 at 10:56:14AM +0100, Jean-Philippe Brucker wrote:
> > > I tried to keep it simple like that: normally mmu_notifier_get() is called
> > > in bind(), and mmu_notifier_put() is called in unbind(). 
> > > 
> > > Multiple device drivers may call bind() with the same mm. Each bind()
> > > calls mmu_notifier_get(), obtains the same io_mm, and returns a new bond
> > > (a device<->mm link). Each bond is freed by calling unbind(), which calls
> > > mmu_notifier_put().
> > > 
> > > That's the most common case. Now if the process is killed and the mm
> > > disappears, we do need to avoid use-after-free caused by DMA of the
> > > mappings and the page tables. 
> > 
> > This is why release must do invalidate all - but it doesn't need to do
> > any more - as no SPTE can be established without a mmget() - and
> > mmget() is no longer possible past release.
> 
> In our case we don't have SPTEs, the whole pgd is shared between MMU and
> IOMMU (isolated using PASID tables).

Okay, but this just means that 'invalidate all' also requires
switching the PASID to use some pgd that is permanently 'all fail'.

> At this point no one told the device to stop working on this queue,
> it may still be doing DMA on this address space.

Sure, but there are lots of cases where a defective user space can
cause pages under active DMA to disappear, like munmap for
instance. Process exit is really no different, the PASID should take
errors and the device & driver should do whatever error flow it has.

Involving a complex driver flow in the exit_mmap path seems like
dangerous complexity to me.

Jason

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v4 01/26] mm/mmu_notifiers: pass private data down to alloc_notifier()
  2020-03-06 14:52                       ` Jason Gunthorpe
@ 2020-03-06 16:15                         ` Jean-Philippe Brucker
  2020-03-06 17:42                           ` Jason Gunthorpe
  0 siblings, 1 reply; 70+ messages in thread
From: Jean-Philippe Brucker @ 2020-03-06 16:15 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: mark.rutland, linux-pci, linux-mm, will, Dimitri Sivanich,
	catalin.marinas, zhangfei.gao, devicetree, kevin.tian,
	Arnd Bergmann, robh+dt, linux-arm-kernel, Greg Kroah-Hartman,
	iommu, Andrew Morton, robin.murphy, christian.koenig

On Fri, Mar 06, 2020 at 10:52:45AM -0400, Jason Gunthorpe wrote:
> On Fri, Mar 06, 2020 at 03:35:56PM +0100, Jean-Philippe Brucker wrote:
> > On Fri, Mar 06, 2020 at 09:09:19AM -0400, Jason Gunthorpe wrote:
> > > On Fri, Mar 06, 2020 at 10:56:14AM +0100, Jean-Philippe Brucker wrote:
> > > > I tried to keep it simple like that: normally mmu_notifier_get() is called
> > > > in bind(), and mmu_notifier_put() is called in unbind(). 
> > > > 
> > > > Multiple device drivers may call bind() with the same mm. Each bind()
> > > > calls mmu_notifier_get(), obtains the same io_mm, and returns a new bond
> > > > (a device<->mm link). Each bond is freed by calling unbind(), which calls
> > > > mmu_notifier_put().
> > > > 
> > > > That's the most common case. Now if the process is killed and the mm
> > > > disappears, we do need to avoid use-after-free caused by DMA of the
> > > > mappings and the page tables. 
> > > 
> > > This is why release must do invalidate all - but it doesn't need to do
> > > any more - as no SPTE can be established without a mmget() - and
> > > mmget() is no longer possible past release.
> > 
> > In our case we don't have SPTEs, the whole pgd is shared between MMU and
> > IOMMU (isolated using PASID tables).
> 
> Okay, but this just means that 'invalidate all' also requires
> switching the PASID to use some pgd that is permanently 'all fail'.
> 
> > At this point no one told the device to stop working on this queue,
> > it may still be doing DMA on this address space.
> 
> Sure, but there are lots of cases where a defective user space can
> cause pages under active DMA to disappear, like munmap for
> instance. Process exit is really no different, the PASID should take
> errors and the device & driver should do whatever error flow it has.

We do have the possibility to shut things down in order, so to me this
feels like a band-aid. The idea has come up before though [1], and I'm not
strongly opposed to this model, but I'm still not convinced it's
necessary. It does add more complexity to IOMMU drivers, to avoid printing
out the errors that we wouldn't otherwise see, whereas device drivers need
in any case to implement the logic that forces stop DMA.

Thanks,
Jean

[1] https://lore.kernel.org/linux-iommu/4d68da96-0ad5-b412-5987-2f7a6aa796c3@amd.com/

> 
> Involving a complex driver flow in the exit_mmap path seems like
> dangerous complexity to me.
> 
> Jason

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v4 01/26] mm/mmu_notifiers: pass private data down to alloc_notifier()
  2020-03-06 16:15                         ` Jean-Philippe Brucker
@ 2020-03-06 17:42                           ` Jason Gunthorpe
  2020-03-13 18:49                             ` Jean-Philippe Brucker
  0 siblings, 1 reply; 70+ messages in thread
From: Jason Gunthorpe @ 2020-03-06 17:42 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: mark.rutland, linux-pci, linux-mm, will, Dimitri Sivanich,
	catalin.marinas, zhangfei.gao, devicetree, kevin.tian,
	Arnd Bergmann, robh+dt, linux-arm-kernel, Greg Kroah-Hartman,
	iommu, Andrew Morton, robin.murphy, christian.koenig

On Fri, Mar 06, 2020 at 05:15:19PM +0100, Jean-Philippe Brucker wrote:
> On Fri, Mar 06, 2020 at 10:52:45AM -0400, Jason Gunthorpe wrote:
> > On Fri, Mar 06, 2020 at 03:35:56PM +0100, Jean-Philippe Brucker wrote:
> > > On Fri, Mar 06, 2020 at 09:09:19AM -0400, Jason Gunthorpe wrote:
> > > > On Fri, Mar 06, 2020 at 10:56:14AM +0100, Jean-Philippe Brucker wrote:
> > > > > I tried to keep it simple like that: normally mmu_notifier_get() is called
> > > > > in bind(), and mmu_notifier_put() is called in unbind(). 
> > > > > 
> > > > > Multiple device drivers may call bind() with the same mm. Each bind()
> > > > > calls mmu_notifier_get(), obtains the same io_mm, and returns a new bond
> > > > > (a device<->mm link). Each bond is freed by calling unbind(), which calls
> > > > > mmu_notifier_put().
> > > > > 
> > > > > That's the most common case. Now if the process is killed and the mm
> > > > > disappears, we do need to avoid use-after-free caused by DMA of the
> > > > > mappings and the page tables. 
> > > > 
> > > > This is why release must do invalidate all - but it doesn't need to do
> > > > any more - as no SPTE can be established without a mmget() - and
> > > > mmget() is no longer possible past release.
> > > 
> > > In our case we don't have SPTEs, the whole pgd is shared between MMU and
> > > IOMMU (isolated using PASID tables).
> > 
> > Okay, but this just means that 'invalidate all' also requires
> > switching the PASID to use some pgd that is permanently 'all fail'.
> > 
> > > At this point no one told the device to stop working on this queue,
> > > it may still be doing DMA on this address space.
> > 
> > Sure, but there are lots of cases where a defective user space can
> > cause pages under active DMA to disappear, like munmap for
> > instance. Process exit is really no different, the PASID should take
> > errors and the device & driver should do whatever error flow it has.
> 
> We do have the possibility to shut things down in order, so to me this
> feels like a band-aid. 

->release() is called by exit_mmap which is called by mmput. There are
over a 100 callsites to mmput() and I'm not totally sure what the
rules are for release(). We've run into problems before with things
like this.

IMHO, due to this, it is best for release to be simple and have
conservative requirements on context like all the other notifier
callbacks. It is is not a good place to put complex HW fencing driver
code.

In particular that link you referenced is suggesting the driver tear
down could take minutes - IMHO it is not OK to block mmput() for
minutes.

> The idea has come up before though [1], and I'm not strongly opposed
> to this model, but I'm still not convinced it's necessary. It does
> add more complexity to IOMMU drivers, to avoid printing out the
> errors that we wouldn't otherwise see, whereas device drivers need
> in any case to implement the logic that forces stop DMA.

Errors should not be printed to the kernel log for PASID cases
anyhow. PASID will be used by unpriv user, and unpriv user should not
be able to trigger kernel prints at will, eg by doing dma to nmap VA
or whatever. 

Process exit is just another case of this, and should not be treated
specially.

Jason

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v4 23/26] iommu/arm-smmu-v3: Add stall support for platform devices
  2020-03-04 14:08     ` Jean-Philippe Brucker
@ 2020-03-09 10:48       ` Jonathan Cameron
  0 siblings, 0 replies; 70+ messages in thread
From: Jonathan Cameron @ 2020-03-09 10:48 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: iommu, devicetree, linux-arm-kernel, linux-pci, linux-mm, joro,
	robh+dt, mark.rutland, catalin.marinas, will, robin.murphy,
	kevin.tian, baolu.lu, jacob.jun.pan, christian.koenig, yi.l.liu,
	zhangfei.gao, Jean-Philippe Brucker

On Wed, 4 Mar 2020 15:08:33 +0100
Jean-Philippe Brucker <jean-philippe@linaro.org> wrote:

> On Thu, Feb 27, 2020 at 06:17:26PM +0000, Jonathan Cameron wrote:
> > On Mon, 24 Feb 2020 19:23:58 +0100
> > Jean-Philippe Brucker <jean-philippe@linaro.org> wrote:
> >   
> > > From: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
> > > 
> > > The SMMU provides a Stall model for handling page faults in platform
> > > devices. It is similar to PCI PRI, but doesn't require devices to have
> > > their own translation cache. Instead, faulting transactions are parked and
> > > the OS is given a chance to fix the page tables and retry the transaction.
> > > 
> > > Enable stall for devices that support it (opt-in by firmware). When an
> > > event corresponds to a translation error, call the IOMMU fault handler. If
> > > the fault is recoverable, it will call us back to terminate or continue
> > > the stall.
> > > 
> > > Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>  
> > One question inline.
> > 
> > Thanks,
> >   
> > > ---
> > >  drivers/iommu/arm-smmu-v3.c | 271 ++++++++++++++++++++++++++++++++++--
> > >  drivers/iommu/of_iommu.c    |   5 +-
> > >  include/linux/iommu.h       |   2 +
> > >  3 files changed, 269 insertions(+), 9 deletions(-)
> > > 
> > > diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
> > > index 6a5987cce03f..da5dda5ba26a 100644
> > > --- a/drivers/iommu/arm-smmu-v3.c
> > > +++ b/drivers/iommu/arm-smmu-v3.c
> > > @@ -374,6 +374,13 @@  
> > 
> >   
> > > +/*
> > > + * arm_smmu_flush_evtq - wait until all events currently in the queue have been
> > > + *                       consumed.
> > > + *
> > > + * Wait until the evtq thread finished a batch, or until the queue is empty.
> > > + * Note that we don't handle overflows on q->batch. If it occurs, just wait for
> > > + * the queue to be empty.
> > > + */
> > > +static int arm_smmu_flush_evtq(void *cookie, struct device *dev, int pasid)
> > > +{
> > > +	int ret;
> > > +	u64 batch;
> > > +	struct arm_smmu_device *smmu = cookie;
> > > +	struct arm_smmu_queue *q = &smmu->evtq.q;
> > > +
> > > +	spin_lock(&q->wq.lock);
> > > +	if (queue_sync_prod_in(q) == -EOVERFLOW)
> > > +		dev_err(smmu->dev, "evtq overflow detected -- requests lost\n");
> > > +
> > > +	batch = q->batch;  
> > 
> > So this is trying to be sure we have advanced the queue 2 spots?  
> 
> So we call arm_smmu_flush_evtq() before decommissioning a PASID, to make
> sure that there aren't any pending event for this PASID languishing in the
> fault queues.
> 
> The main test is queue_empty(). If that succeeds then we know that there
> aren't any pending event (and the PASID is safe to reuse). But if new
> events are constantly added to the queue then we wait for the evtq thread
> to handle a full batch, where one batch corresponds to the queue size. For
> that we take the batch number when entering flush(), and wait for the evtq
> thread to increment it twice.
> 
> > Is there a potential race here?  q->batch could have updated before we take
> > a local copy.  
> 
> Yes we're just checking on the progress of the evtq thread. All accesses
> to batch are made while holding the wq lock.
> 
> Flush is a rare event so the lock isn't contended, but the wake_up() that
> this patch introduces in arm_smmu_evtq_thread() does add some overhead
> (0.85% of arm_smmu_evtq_thread(), according to perf). It would be nice to
> get rid of it but I haven't found anything clever yet.
> 

Thanks.  Maybe worth a few comments in the code as this is a bit esoteric.

Thanks,

Jonathan

> Thanks,
> Jean
> 
> >   
> > > +	ret = wait_event_interruptible_locked(q->wq, queue_empty(&q->llq) ||
> > > +					      q->batch >= batch + 2);
> > > +	spin_unlock(&q->wq.lock);
> > > +
> > > +	return ret;
> > > +}
> > > +  
> > ...
> >   



^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v4 01/26] mm/mmu_notifiers: pass private data down to alloc_notifier()
  2020-03-06 17:42                           ` Jason Gunthorpe
@ 2020-03-13 18:49                             ` Jean-Philippe Brucker
  2020-03-13 19:13                               ` Jason Gunthorpe
  0 siblings, 1 reply; 70+ messages in thread
From: Jean-Philippe Brucker @ 2020-03-13 18:49 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: mark.rutland, linux-pci, linux-mm, will, Dimitri Sivanich,
	catalin.marinas, zhangfei.gao, devicetree, kevin.tian,
	Arnd Bergmann, robh+dt, linux-arm-kernel, Greg Kroah-Hartman,
	iommu, Andrew Morton, robin.murphy, christian.koenig

On Fri, Mar 06, 2020 at 01:42:39PM -0400, Jason Gunthorpe wrote:
> On Fri, Mar 06, 2020 at 05:15:19PM +0100, Jean-Philippe Brucker wrote:
> > On Fri, Mar 06, 2020 at 10:52:45AM -0400, Jason Gunthorpe wrote:
> > > On Fri, Mar 06, 2020 at 03:35:56PM +0100, Jean-Philippe Brucker wrote:
> > > > On Fri, Mar 06, 2020 at 09:09:19AM -0400, Jason Gunthorpe wrote:
> > > > > On Fri, Mar 06, 2020 at 10:56:14AM +0100, Jean-Philippe Brucker wrote:
> > > > > > I tried to keep it simple like that: normally mmu_notifier_get() is called
> > > > > > in bind(), and mmu_notifier_put() is called in unbind(). 
> > > > > > 
> > > > > > Multiple device drivers may call bind() with the same mm. Each bind()
> > > > > > calls mmu_notifier_get(), obtains the same io_mm, and returns a new bond
> > > > > > (a device<->mm link). Each bond is freed by calling unbind(), which calls
> > > > > > mmu_notifier_put().
> > > > > > 
> > > > > > That's the most common case. Now if the process is killed and the mm
> > > > > > disappears, we do need to avoid use-after-free caused by DMA of the
> > > > > > mappings and the page tables. 
> > > > > 
> > > > > This is why release must do invalidate all - but it doesn't need to do
> > > > > any more - as no SPTE can be established without a mmget() - and
> > > > > mmget() is no longer possible past release.
> > > > 
> > > > In our case we don't have SPTEs, the whole pgd is shared between MMU and
> > > > IOMMU (isolated using PASID tables).
> > > 
> > > Okay, but this just means that 'invalidate all' also requires
> > > switching the PASID to use some pgd that is permanently 'all fail'.
> > > 
> > > > At this point no one told the device to stop working on this queue,
> > > > it may still be doing DMA on this address space.
> > > 
> > > Sure, but there are lots of cases where a defective user space can
> > > cause pages under active DMA to disappear, like munmap for
> > > instance. Process exit is really no different, the PASID should take
> > > errors and the device & driver should do whatever error flow it has.
> > 
> > We do have the possibility to shut things down in order, so to me this
> > feels like a band-aid. 
> 
> ->release() is called by exit_mmap which is called by mmput. There are
> over a 100 callsites to mmput() and I'm not totally sure what the
> rules are for release(). We've run into problems before with things
> like this.

A concrete example of something that could go badly if mmput() takes too
long would greatly help. Otherwise I'll have a hard time justifying the
added complexity.

I wrote a prototype that removes the device driver callback from
release(). It works with SMMUv3, but complicates the PASID descriptor
code, which is already awful with my recent changes and this series.

> IMHO, due to this, it is best for release to be simple and have
> conservative requirements on context like all the other notifier
> callbacks. It is is not a good place to put complex HW fencing driver
> code.
> 
> In particular that link you referenced is suggesting the driver tear
> down could take minutes - IMHO it is not OK to block mmput() for
> minutes.
> 
> > The idea has come up before though [1], and I'm not strongly opposed
> > to this model, but I'm still not convinced it's necessary. It does
> > add more complexity to IOMMU drivers, to avoid printing out the
> > errors that we wouldn't otherwise see, whereas device drivers need
> > in any case to implement the logic that forces stop DMA.
> 
> Errors should not be printed to the kernel log for PASID cases
> anyhow. PASID will be used by unpriv user, and unpriv user should not
> be able to trigger kernel prints at will, eg by doing dma to nmap VA
> or whatever. 

I agree. There is a difference, though, between invalid mappings and the
absence of a pgd. The former comes from userspace issuing DMA on unmapped
buffers, while the latter is typically a device/driver error which
normally needs to be reported.

On Arm SMMUv3 they are handled differently by the hardware. But instead of
disabling the whole PASID context on mm exit, we can quietly abort
incoming transactions while waiting for unbind().

And I think the other IOMMUs treat invalid PASID descriptor the same as
invalid translation table descriptor. At least VT-d quietly returns a
no-translation response to ATS TR and rejects PRI PR. I haven't found the
equivalent in the AMD IOMMU spec yet.

Thanks,
Jean

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v4 01/26] mm/mmu_notifiers: pass private data down to alloc_notifier()
  2020-03-13 18:49                             ` Jean-Philippe Brucker
@ 2020-03-13 19:13                               ` Jason Gunthorpe
  0 siblings, 0 replies; 70+ messages in thread
From: Jason Gunthorpe @ 2020-03-13 19:13 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: mark.rutland, linux-pci, linux-mm, will, Dimitri Sivanich,
	catalin.marinas, zhangfei.gao, devicetree, kevin.tian,
	Arnd Bergmann, robh+dt, linux-arm-kernel, Greg Kroah-Hartman,
	iommu, Andrew Morton, robin.murphy, christian.koenig

On Fri, Mar 13, 2020 at 07:49:29PM +0100, Jean-Philippe Brucker wrote:
> On Fri, Mar 06, 2020 at 01:42:39PM -0400, Jason Gunthorpe wrote:
> > On Fri, Mar 06, 2020 at 05:15:19PM +0100, Jean-Philippe Brucker wrote:
> > > On Fri, Mar 06, 2020 at 10:52:45AM -0400, Jason Gunthorpe wrote:
> > > > On Fri, Mar 06, 2020 at 03:35:56PM +0100, Jean-Philippe Brucker wrote:
> > > > > On Fri, Mar 06, 2020 at 09:09:19AM -0400, Jason Gunthorpe wrote:
> > > > > > On Fri, Mar 06, 2020 at 10:56:14AM +0100, Jean-Philippe Brucker wrote:
> > > > > > > I tried to keep it simple like that: normally mmu_notifier_get() is called
> > > > > > > in bind(), and mmu_notifier_put() is called in unbind(). 
> > > > > > > 
> > > > > > > Multiple device drivers may call bind() with the same mm. Each bind()
> > > > > > > calls mmu_notifier_get(), obtains the same io_mm, and returns a new bond
> > > > > > > (a device<->mm link). Each bond is freed by calling unbind(), which calls
> > > > > > > mmu_notifier_put().
> > > > > > > 
> > > > > > > That's the most common case. Now if the process is killed and the mm
> > > > > > > disappears, we do need to avoid use-after-free caused by DMA of the
> > > > > > > mappings and the page tables. 
> > > > > > 
> > > > > > This is why release must do invalidate all - but it doesn't need to do
> > > > > > any more - as no SPTE can be established without a mmget() - and
> > > > > > mmget() is no longer possible past release.
> > > > > 
> > > > > In our case we don't have SPTEs, the whole pgd is shared between MMU and
> > > > > IOMMU (isolated using PASID tables).
> > > > 
> > > > Okay, but this just means that 'invalidate all' also requires
> > > > switching the PASID to use some pgd that is permanently 'all fail'.
> > > > 
> > > > > At this point no one told the device to stop working on this queue,
> > > > > it may still be doing DMA on this address space.
> > > > 
> > > > Sure, but there are lots of cases where a defective user space can
> > > > cause pages under active DMA to disappear, like munmap for
> > > > instance. Process exit is really no different, the PASID should take
> > > > errors and the device & driver should do whatever error flow it has.
> > > 
> > > We do have the possibility to shut things down in order, so to me this
> > > feels like a band-aid. 
> > 
> > ->release() is called by exit_mmap which is called by mmput. There are
> > over a 100 callsites to mmput() and I'm not totally sure what the
> > rules are for release(). We've run into problems before with things
> > like this.
> 
> A concrete example of something that could go badly if mmput() takes too
> long would greatly help. Otherwise I'll have a hard time justifying the
> added complexity.

It is not just takes too long, but also accidently causing locking
problems by doing very complex code in the release callback. Unless
you audit all the mmput call sites to define the calling conditions I
can't even say what the risk is here. 

Particularly, calling something with impossible to audit locking like
the dma_fence stuff from release is probably impossible to prove
safety and then keep safe.

It is easy enough to see where takes too long can have a bad impact,
mmput is called all over the place. Just in the RDMA code slowing it
down would block ODP page faulting completely for all processes.
This is not acceptable.

For this reason release callbacks must be simple/fast and must have
trivial locking.

> > Errors should not be printed to the kernel log for PASID cases
> > anyhow. PASID will be used by unpriv user, and unpriv user should not
> > be able to trigger kernel prints at will, eg by doing dma to nmap VA
> > or whatever. 
> 
> I agree. There is a difference, though, between invalid mappings and the
> absence of a pgd. The former comes from userspace issuing DMA on unmapped
> buffers, while the latter is typically a device/driver error which
> normally needs to be reported.

Why not make the pgd present as I suggested? Point it at a static
dummy pgd that always fails to page fault during release? Make the pgd
not present only once the PASID is fully destroyed.

That really is the only thing release is supposed to mean -> unmap all
VAs.

Jason

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v4 01/26] mm/mmu_notifiers: pass private data down to alloc_notifier()
  2020-03-06 13:09                   ` Jason Gunthorpe
  2020-03-06 14:35                     ` Jean-Philippe Brucker
@ 2020-03-16 15:46                     ` Christoph Hellwig
  2020-03-17 18:40                       ` Jason Gunthorpe
  1 sibling, 1 reply; 70+ messages in thread
From: Christoph Hellwig @ 2020-03-16 15:46 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker, mark.rutland, devicetree, kevin.tian,
	Dimitri Sivanich, Arnd Bergmann, Greg Kroah-Hartman, linux-pci,
	robin.murphy, linux-mm, iommu, robh+dt, catalin.marinas,
	zhangfei.gao, Andrew Morton, will, christian.koenig,
	linux-arm-kernel

On Fri, Mar 06, 2020 at 09:09:19AM -0400, Jason Gunthorpe wrote:
> This is why release must do invalidate all - but it doesn't need to do
> any more - as no SPTE can be established without a mmget() - and
> mmget() is no longer possible past release.

Maybe we should rename the release method to invalidate_all?

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v4 01/26] mm/mmu_notifiers: pass private data down to alloc_notifier()
  2020-03-16 15:46                     ` Christoph Hellwig
@ 2020-03-17 18:40                       ` Jason Gunthorpe
  0 siblings, 0 replies; 70+ messages in thread
From: Jason Gunthorpe @ 2020-03-17 18:40 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jean-Philippe Brucker, mark.rutland, devicetree, kevin.tian,
	Dimitri Sivanich, Arnd Bergmann, Greg Kroah-Hartman, linux-pci,
	robin.murphy, linux-mm, iommu, robh+dt, catalin.marinas,
	zhangfei.gao, Andrew Morton, will, christian.koenig,
	linux-arm-kernel

On Mon, Mar 16, 2020 at 08:46:59AM -0700, Christoph Hellwig wrote:
> On Fri, Mar 06, 2020 at 09:09:19AM -0400, Jason Gunthorpe wrote:
> > This is why release must do invalidate all - but it doesn't need to do
> > any more - as no SPTE can be established without a mmget() - and
> > mmget() is no longer possible past release.
> 
> Maybe we should rename the release method to invalidate_all?

It is a better name. The function it must also fence future access if
the mirror is not using mmget(), and stop using the pgd/etc pointer if
the page tables are accessed directly.

Jason

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v4 21/26] iommu/arm-smmu-v3: Ratelimit event dump
  2020-02-24 18:23 ` [PATCH v4 21/26] iommu/arm-smmu-v3: Ratelimit event dump Jean-Philippe Brucker
@ 2021-05-28  8:09   ` Aaro Koskinen
  2021-05-28 16:25     ` Jean-Philippe Brucker
  0 siblings, 1 reply; 70+ messages in thread
From: Aaro Koskinen @ 2021-05-28  8:09 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: iommu, devicetree, linux-arm-kernel, linux-pci, linux-mm,
	mark.rutland, kevin.tian, jacob.jun.pan, catalin.marinas, joro,
	robin.murphy, robh+dt, yi.l.liu, Jonathan.Cameron, zhangfei.gao,
	will, christian.koenig, baolu.lu

Hi,

On Mon, Feb 24, 2020 at 07:23:56PM +0100, Jean-Philippe Brucker wrote:
> When a device or driver misbehaves, it is possible to receive events
> much faster than we can print them out. Ratelimit the printing of
> events.
> 
> Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>

Tested-by: Aaro Koskinen <aaro.koskinen@nokia.com>

> During the SVA tests when the device driver didn't properly stop DMA
> before unbinding, the event queue thread would almost lock-up the server
> with a flood of event 0xa. This patch helped recover from the error.

I was just debugging a similar case, and this patch was required to
prevent system from locking up.

Could you please resend this patch independently from the other patches
in the series, as it seems it's a worthwhile fix and still relevent for
current kernels. Thanks,

A.

> ---
>  drivers/iommu/arm-smmu-v3.c | 13 ++++++++-----
>  1 file changed, 8 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
> index 28f8583cd47b..6a5987cce03f 100644
> --- a/drivers/iommu/arm-smmu-v3.c
> +++ b/drivers/iommu/arm-smmu-v3.c
> @@ -2243,17 +2243,20 @@ static irqreturn_t arm_smmu_evtq_thread(int irq, void *dev)
>  	struct arm_smmu_device *smmu = dev;
>  	struct arm_smmu_queue *q = &smmu->evtq.q;
>  	struct arm_smmu_ll_queue *llq = &q->llq;
> +	static DEFINE_RATELIMIT_STATE(rs, DEFAULT_RATELIMIT_INTERVAL,
> +				      DEFAULT_RATELIMIT_BURST);
>  	u64 evt[EVTQ_ENT_DWORDS];
>  
>  	do {
>  		while (!queue_remove_raw(q, evt)) {
>  			u8 id = FIELD_GET(EVTQ_0_ID, evt[0]);
>  
> -			dev_info(smmu->dev, "event 0x%02x received:\n", id);
> -			for (i = 0; i < ARRAY_SIZE(evt); ++i)
> -				dev_info(smmu->dev, "\t0x%016llx\n",
> -					 (unsigned long long)evt[i]);
> -
> +			if (__ratelimit(&rs)) {
> +				dev_info(smmu->dev, "event 0x%02x received:\n", id);
> +				for (i = 0; i < ARRAY_SIZE(evt); ++i)
> +					dev_info(smmu->dev, "\t0x%016llx\n",
> +						 (unsigned long long)evt[i]);
> +			}
>  		}
>  
>  		/*

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v4 21/26] iommu/arm-smmu-v3: Ratelimit event dump
  2021-05-28  8:09   ` Aaro Koskinen
@ 2021-05-28 16:25     ` Jean-Philippe Brucker
  0 siblings, 0 replies; 70+ messages in thread
From: Jean-Philippe Brucker @ 2021-05-28 16:25 UTC (permalink / raw)
  To: Aaro Koskinen
  Cc: iommu, devicetree, linux-arm-kernel, linux-pci, linux-mm,
	mark.rutland, kevin.tian, jacob.jun.pan, catalin.marinas, joro,
	robin.murphy, robh+dt, yi.l.liu, Jonathan.Cameron, zhangfei.gao,
	will, christian.koenig, baolu.lu

Hi Aaro,

On Fri, May 28, 2021 at 11:09:58AM +0300, Aaro Koskinen wrote:
> Hi,
> 
> On Mon, Feb 24, 2020 at 07:23:56PM +0100, Jean-Philippe Brucker wrote:
> > When a device or driver misbehaves, it is possible to receive events
> > much faster than we can print them out. Ratelimit the printing of
> > events.
> > 
> > Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
> 
> Tested-by: Aaro Koskinen <aaro.koskinen@nokia.com>
> 
> > During the SVA tests when the device driver didn't properly stop DMA
> > before unbinding, the event queue thread would almost lock-up the server
> > with a flood of event 0xa. This patch helped recover from the error.
> 
> I was just debugging a similar case, and this patch was required to
> prevent system from locking up.
> 
> Could you please resend this patch independently from the other patches
> in the series, as it seems it's a worthwhile fix and still relevent for
> current kernels. Thanks,

Ok, I'll resend it

Thanks,
Jean

> 
> A.
> 
> > ---
> >  drivers/iommu/arm-smmu-v3.c | 13 ++++++++-----
> >  1 file changed, 8 insertions(+), 5 deletions(-)
> > 
> > diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
> > index 28f8583cd47b..6a5987cce03f 100644
> > --- a/drivers/iommu/arm-smmu-v3.c
> > +++ b/drivers/iommu/arm-smmu-v3.c
> > @@ -2243,17 +2243,20 @@ static irqreturn_t arm_smmu_evtq_thread(int irq, void *dev)
> >  	struct arm_smmu_device *smmu = dev;
> >  	struct arm_smmu_queue *q = &smmu->evtq.q;
> >  	struct arm_smmu_ll_queue *llq = &q->llq;
> > +	static DEFINE_RATELIMIT_STATE(rs, DEFAULT_RATELIMIT_INTERVAL,
> > +				      DEFAULT_RATELIMIT_BURST);
> >  	u64 evt[EVTQ_ENT_DWORDS];
> >  
> >  	do {
> >  		while (!queue_remove_raw(q, evt)) {
> >  			u8 id = FIELD_GET(EVTQ_0_ID, evt[0]);
> >  
> > -			dev_info(smmu->dev, "event 0x%02x received:\n", id);
> > -			for (i = 0; i < ARRAY_SIZE(evt); ++i)
> > -				dev_info(smmu->dev, "\t0x%016llx\n",
> > -					 (unsigned long long)evt[i]);
> > -
> > +			if (__ratelimit(&rs)) {
> > +				dev_info(smmu->dev, "event 0x%02x received:\n", id);
> > +				for (i = 0; i < ARRAY_SIZE(evt); ++i)
> > +					dev_info(smmu->dev, "\t0x%016llx\n",
> > +						 (unsigned long long)evt[i]);
> > +			}
> >  		}
> >  
> >  		/*

^ permalink raw reply	[flat|nested] 70+ messages in thread

end of thread, other threads:[~2021-05-28 16:26 UTC | newest]

Thread overview: 70+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-24 18:23 [PATCH v4 00/26] iommu: Shared Virtual Addressing and SMMUv3 support Jean-Philippe Brucker
2020-02-24 18:23 ` [PATCH v4 01/26] mm/mmu_notifiers: pass private data down to alloc_notifier() Jean-Philippe Brucker
2020-02-24 19:00   ` Jason Gunthorpe
2020-02-25  9:24     ` Jean-Philippe Brucker
2020-02-25 14:08       ` Jason Gunthorpe
2020-02-28 14:39         ` Jean-Philippe Brucker
2020-02-28 14:48           ` Jason Gunthorpe
2020-02-28 15:04             ` Jean-Philippe Brucker
2020-02-28 15:13               ` Jason Gunthorpe
2020-03-06  9:56                 ` Jean-Philippe Brucker
2020-03-06 13:09                   ` Jason Gunthorpe
2020-03-06 14:35                     ` Jean-Philippe Brucker
2020-03-06 14:52                       ` Jason Gunthorpe
2020-03-06 16:15                         ` Jean-Philippe Brucker
2020-03-06 17:42                           ` Jason Gunthorpe
2020-03-13 18:49                             ` Jean-Philippe Brucker
2020-03-13 19:13                               ` Jason Gunthorpe
2020-03-16 15:46                     ` Christoph Hellwig
2020-03-17 18:40                       ` Jason Gunthorpe
2020-03-05 16:36   ` Christoph Hellwig
2020-02-24 18:23 ` [PATCH v4 02/26] iommu/sva: Manage process address spaces Jean-Philippe Brucker
2020-02-26 12:35   ` Jonathan Cameron
2020-02-28 14:43     ` Jean-Philippe Brucker
2020-02-28 16:26       ` Jonathan Cameron
2020-02-26 19:13   ` Jacob Pan
2020-02-28 14:40     ` Jean-Philippe Brucker
2020-02-28 14:57       ` Jason Gunthorpe
2020-02-24 18:23 ` [PATCH v4 03/26] iommu: Add a page fault handler Jean-Philippe Brucker
2020-02-25  3:30   ` Xu Zaibo
2020-02-25  9:25     ` Jean-Philippe Brucker
2020-02-26  3:05       ` Xu Zaibo
2020-02-26 13:59   ` Jonathan Cameron
2020-02-28 14:44     ` Jean-Philippe Brucker
2020-02-24 18:23 ` [PATCH v4 04/26] iommu/sva: Search mm by PASID Jean-Philippe Brucker
2020-02-24 18:23 ` [PATCH v4 05/26] iommu/iopf: Handle mm faults Jean-Philippe Brucker
2020-02-24 18:23 ` [PATCH v4 06/26] iommu/sva: Register page fault handler Jean-Philippe Brucker
2020-02-26 19:39   ` Jacob Pan
2020-02-28 14:44     ` Jean-Philippe Brucker
2020-02-24 18:23 ` [PATCH v4 07/26] arm64: mm: Pin down ASIDs for sharing mm with devices Jean-Philippe Brucker
2020-02-27 17:43   ` Jonathan Cameron
2020-03-04 14:10     ` Jean-Philippe Brucker
2020-02-24 18:23 ` [PATCH v4 08/26] iommu/io-pgtable-arm: Move some definitions to a header Jean-Philippe Brucker
2020-02-24 18:23 ` [PATCH v4 09/26] iommu/arm-smmu-v3: Manage ASIDs with xarray Jean-Philippe Brucker
2020-02-24 18:23 ` [PATCH v4 10/26] arm64: cpufeature: Export symbol read_sanitised_ftr_reg() Jean-Philippe Brucker
2020-02-24 18:23 ` [PATCH v4 11/26] iommu/arm-smmu-v3: Share process page tables Jean-Philippe Brucker
2020-02-24 18:23 ` [PATCH v4 12/26] iommu/arm-smmu-v3: Seize private ASID Jean-Philippe Brucker
2020-02-24 18:23 ` [PATCH v4 13/26] iommu/arm-smmu-v3: Add support for VHE Jean-Philippe Brucker
2020-02-24 18:23 ` [PATCH v4 14/26] iommu/arm-smmu-v3: Enable broadcast TLB maintenance Jean-Philippe Brucker
2020-02-24 18:23 ` [PATCH v4 15/26] iommu/arm-smmu-v3: Add SVA feature checking Jean-Philippe Brucker
2020-02-24 18:23 ` [PATCH v4 16/26] iommu/arm-smmu-v3: Add dev_to_master() helper Jean-Philippe Brucker
2020-02-24 18:23 ` [PATCH v4 17/26] iommu/arm-smmu-v3: Implement mm operations Jean-Philippe Brucker
2020-02-24 18:23 ` [PATCH v4 18/26] iommu/arm-smmu-v3: Hook up ATC invalidation to mm ops Jean-Philippe Brucker
2020-02-24 18:23 ` [PATCH v4 19/26] iommu/arm-smmu-v3: Add support for Hardware Translation Table Update Jean-Philippe Brucker
2020-02-24 18:23 ` [PATCH v4 20/26] iommu/arm-smmu-v3: Maintain a SID->device structure Jean-Philippe Brucker
2020-02-24 18:23 ` [PATCH v4 21/26] iommu/arm-smmu-v3: Ratelimit event dump Jean-Philippe Brucker
2021-05-28  8:09   ` Aaro Koskinen
2021-05-28 16:25     ` Jean-Philippe Brucker
2020-02-24 18:23 ` [PATCH v4 22/26] dt-bindings: document stall property for IOMMU masters Jean-Philippe Brucker
2020-02-24 18:23 ` [PATCH v4 23/26] iommu/arm-smmu-v3: Add stall support for platform devices Jean-Philippe Brucker
2020-02-26  8:44   ` Xu Zaibo
2020-03-04 14:09     ` Jean-Philippe Brucker
2020-02-27 18:17   ` Jonathan Cameron
2020-03-04 14:08     ` Jean-Philippe Brucker
2020-03-09 10:48       ` Jonathan Cameron
2020-02-24 18:23 ` [PATCH v4 24/26] PCI/ATS: Add PRI stubs Jean-Philippe Brucker
2020-02-27 20:55   ` Bjorn Helgaas
2020-02-24 18:24 ` [PATCH v4 25/26] PCI/ATS: Export symbols of PRI functions Jean-Philippe Brucker
2020-02-27 20:55   ` Bjorn Helgaas
2020-02-24 18:24 ` [PATCH v4 26/26] iommu/arm-smmu-v3: Add support for PRI Jean-Philippe Brucker
2020-02-27 18:22 ` [PATCH v4 00/26] iommu: Shared Virtual Addressing and SMMUv3 support Jonathan Cameron

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).