Shared Virtual Addressing (SVA) allows to share process page tables with devices using the IOMMU. Add a generic implementation of the IOMMU SVA API, and add support in the Arm SMMUv3 driver. Previous versions of this patchset were sent over a year ago [1][2] but we've made a lot of progress since then: * ATS support for SMMUv3 was merged in v5.2. * The bind() and fault reporting APIs have been merged in v5.3. * IOASID were added in v5.5. * SMMUv3 PASID was added in v5.6, with some pending for v5.7. * The first user of the bind() API will be merged in v5.7 [3]. The zip accelerator is also the first piece of hardware that I've been able to use for testing (previous versions were developed with software models) and I now have tools for evaluating SVA performance. Unfortunately I still don't have hardware that supports ATS and PRI; the zip accelerator uses stall. These are the remaining changes for SVA support in SMMUv3. Since v3 [1] I fixed countless bugs and - I think - addressed everyone's comments. Thanks to recent MMU notifier rework, iommu-sva.c is a lot more straightforward. I'm still unhappy with the complicated locking in the SMMUv3 driver resulting from patch 12 (Seize private ASID), but I haven't found anything better. Please find all SVA patches on branches sva/current and sva/zip-devel at https://jpbrucker.net/git/linux [1] https://lore.kernel.org/linux-iommu/20180920170046.20154-1-jean-philippe.brucker@arm.com/ [2] https://lore.kernel.org/linux-iommu/20180511190641.23008-1-jean-philippe.brucker@arm.com/ [3] https://lore.kernel.org/linux-iommu/1581407665-13504-1-git-send-email-zhangfei.gao@linaro.org/ Jean-Philippe Brucker (26): mm/mmu_notifiers: pass private data down to alloc_notifier() iommu/sva: Manage process address spaces iommu: Add a page fault handler iommu/sva: Search mm by PASID iommu/iopf: Handle mm faults iommu/sva: Register page fault handler arm64: mm: Pin down ASIDs for sharing mm with devices iommu/io-pgtable-arm: Move some definitions to a header iommu/arm-smmu-v3: Manage ASIDs with xarray arm64: cpufeature: Export symbol read_sanitised_ftr_reg() iommu/arm-smmu-v3: Share process page tables iommu/arm-smmu-v3: Seize private ASID iommu/arm-smmu-v3: Add support for VHE iommu/arm-smmu-v3: Enable broadcast TLB maintenance iommu/arm-smmu-v3: Add SVA feature checking iommu/arm-smmu-v3: Add dev_to_master() helper iommu/arm-smmu-v3: Implement mm operations iommu/arm-smmu-v3: Hook up ATC invalidation to mm ops iommu/arm-smmu-v3: Add support for Hardware Translation Table Update iommu/arm-smmu-v3: Maintain a SID->device structure iommu/arm-smmu-v3: Ratelimit event dump dt-bindings: document stall property for IOMMU masters iommu/arm-smmu-v3: Add stall support for platform devices PCI/ATS: Add PRI stubs PCI/ATS: Export symbols of PRI functions iommu/arm-smmu-v3: Add support for PRI .../devicetree/bindings/iommu/iommu.txt | 18 + arch/arm64/include/asm/mmu.h | 1 + arch/arm64/include/asm/mmu_context.h | 11 +- arch/arm64/kernel/cpufeature.c | 1 + arch/arm64/mm/context.c | 103 +- drivers/iommu/Kconfig | 13 + drivers/iommu/Makefile | 2 + drivers/iommu/arm-smmu-v3.c | 1354 +++++++++++++++-- drivers/iommu/io-pgfault.c | 533 +++++++ drivers/iommu/io-pgtable-arm.c | 27 +- drivers/iommu/io-pgtable-arm.h | 30 + drivers/iommu/iommu-sva.c | 596 ++++++++ drivers/iommu/iommu-sva.h | 64 + drivers/iommu/iommu.c | 1 + drivers/iommu/of_iommu.c | 5 +- drivers/misc/sgi-gru/grutlbpurge.c | 4 +- drivers/pci/ats.c | 4 + include/linux/iommu.h | 73 + include/linux/mmu_notifier.h | 10 +- include/linux/pci-ats.h | 8 + mm/mmu_notifier.c | 6 +- 21 files changed, 2699 insertions(+), 165 deletions(-) create mode 100644 drivers/iommu/io-pgfault.c create mode 100644 drivers/iommu/io-pgtable-arm.h create mode 100644 drivers/iommu/iommu-sva.c create mode 100644 drivers/iommu/iommu-sva.h -- 2.25.0
The new allocation scheme introduced by 2c7933f53f6b ("mm/mmu_notifiers: add a get/put scheme for the registration") provides a convenient way for users to attach notifier data to an mm. However, it would be even better to create this notifier data atomically. Since the alloc_notifier() callback only takes an mm argument at the moment, some users have to perform the allocation in two times. alloc_notifier() initially creates an incomplete structure, which is then finalized using more context once mmu_notifier_get() returns. This second step requires carrying an initialization lock in the notifier data and playing dirty tricks to order memory accesses against live invalidation. To simplify MMU notifier allocation, pass an allocation context to mmu_notifier_get(). Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dimitri Sivanich <sivanich@sgi.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org> --- drivers/misc/sgi-gru/grutlbpurge.c | 4 ++-- include/linux/mmu_notifier.h | 10 ++++++---- mm/mmu_notifier.c | 6 ++++-- 3 files changed, 12 insertions(+), 8 deletions(-) diff --git a/drivers/misc/sgi-gru/grutlbpurge.c b/drivers/misc/sgi-gru/grutlbpurge.c index 10921cd2608d..77610e1704f6 100644 --- a/drivers/misc/sgi-gru/grutlbpurge.c +++ b/drivers/misc/sgi-gru/grutlbpurge.c @@ -235,7 +235,7 @@ static void gru_invalidate_range_end(struct mmu_notifier *mn, gms, range->start, range->end); } -static struct mmu_notifier *gru_alloc_notifier(struct mm_struct *mm) +static struct mmu_notifier *gru_alloc_notifier(struct mm_struct *mm, void *privdata) { struct gru_mm_struct *gms; @@ -266,7 +266,7 @@ struct gru_mm_struct *gru_register_mmu_notifier(void) { struct mmu_notifier *mn; - mn = mmu_notifier_get_locked(&gru_mmuops, current->mm); + mn = mmu_notifier_get_locked(&gru_mmuops, current->mm, NULL); if (IS_ERR(mn)) return ERR_CAST(mn); diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h index 736f6918335e..06e68fa2b019 100644 --- a/include/linux/mmu_notifier.h +++ b/include/linux/mmu_notifier.h @@ -207,7 +207,7 @@ struct mmu_notifier_ops { * callbacks are currently running. It is called from a SRCU callback * and cannot sleep. */ - struct mmu_notifier *(*alloc_notifier)(struct mm_struct *mm); + struct mmu_notifier *(*alloc_notifier)(struct mm_struct *mm, void *privdata); void (*free_notifier)(struct mmu_notifier *subscription); }; @@ -271,14 +271,16 @@ static inline int mm_has_notifiers(struct mm_struct *mm) } struct mmu_notifier *mmu_notifier_get_locked(const struct mmu_notifier_ops *ops, - struct mm_struct *mm); + struct mm_struct *mm, + void *privdata); static inline struct mmu_notifier * -mmu_notifier_get(const struct mmu_notifier_ops *ops, struct mm_struct *mm) +mmu_notifier_get(const struct mmu_notifier_ops *ops, struct mm_struct *mm, + void *privdata) { struct mmu_notifier *ret; down_write(&mm->mmap_sem); - ret = mmu_notifier_get_locked(ops, mm); + ret = mmu_notifier_get_locked(ops, mm, privdata); up_write(&mm->mmap_sem); return ret; } diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c index ef3973a5d34a..8beb9dcbe0fd 100644 --- a/mm/mmu_notifier.c +++ b/mm/mmu_notifier.c @@ -734,6 +734,7 @@ find_get_mmu_notifier(struct mm_struct *mm, const struct mmu_notifier_ops *ops) * the mm & ops * @ops: The operations struct being subscribe with * @mm : The mm to attach notifiers too + * @privdata: Initialization data passed down to ops->alloc_notifier() * * This function either allocates a new mmu_notifier via * ops->alloc_notifier(), or returns an already existing notifier on the @@ -747,7 +748,8 @@ find_get_mmu_notifier(struct mm_struct *mm, const struct mmu_notifier_ops *ops) * and can be converted to an active mm pointer via mmget_not_zero(). */ struct mmu_notifier *mmu_notifier_get_locked(const struct mmu_notifier_ops *ops, - struct mm_struct *mm) + struct mm_struct *mm, + void *privdata) { struct mmu_notifier *subscription; int ret; @@ -760,7 +762,7 @@ struct mmu_notifier *mmu_notifier_get_locked(const struct mmu_notifier_ops *ops, return subscription; } - subscription = ops->alloc_notifier(mm); + subscription = ops->alloc_notifier(mm, privdata); if (IS_ERR(subscription)) return subscription; subscription->ops = ops; -- 2.25.0
From: Jean-Philippe Brucker <jean-philippe.brucker@arm.com> Add a small library to help IOMMU drivers manage process address spaces bound to their devices. Register an MMU notifier to track modification on each address space bound to one or more devices. IOMMU drivers must implement the io_mm_ops and can then use the helpers provided by this library to easily implement the SVA API introduced by commit 26b25a2b98e4. The io_mm_ops are: void *alloc(struct mm_struct *) Allocate a PASID context private to the IOMMU driver. There is a single context per mm. IOMMU drivers may perform arch-specific operations in there, for example pinning down a CPU ASID (on Arm). int attach(struct device *, int pasid, void *ctx, bool attach_domain) Attach a context to the device, by setting up the PASID table entry. int invalidate(struct device *, int pasid, void *ctx, unsigned long vaddr, size_t size) Invalidate TLB entries for this address range. int detach(struct device *, int pasid, void *ctx, bool detach_domain) Detach a context from the device, by clearing the PASID table entry and invalidating cached entries. void free(void *ctx) Free a context. Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org> --- drivers/iommu/Kconfig | 7 + drivers/iommu/Makefile | 1 + drivers/iommu/iommu-sva.c | 561 ++++++++++++++++++++++++++++++++++++++ drivers/iommu/iommu-sva.h | 64 +++++ drivers/iommu/iommu.c | 1 + include/linux/iommu.h | 3 + 6 files changed, 637 insertions(+) create mode 100644 drivers/iommu/iommu-sva.c create mode 100644 drivers/iommu/iommu-sva.h diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig index d2fade984999..acca20e2da2f 100644 --- a/drivers/iommu/Kconfig +++ b/drivers/iommu/Kconfig @@ -102,6 +102,13 @@ config IOMMU_DMA select IRQ_MSI_IOMMU select NEED_SG_DMA_LENGTH +# Shared Virtual Addressing library +config IOMMU_SVA + bool + select IOASID + select IOMMU_API + select MMU_NOTIFIER + config FSL_PAMU bool "Freescale IOMMU support" depends on PCI diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile index 9f33fdb3bb05..40c800dd4e3e 100644 --- a/drivers/iommu/Makefile +++ b/drivers/iommu/Makefile @@ -37,3 +37,4 @@ obj-$(CONFIG_S390_IOMMU) += s390-iommu.o obj-$(CONFIG_QCOM_IOMMU) += qcom_iommu.o obj-$(CONFIG_HYPERV_IOMMU) += hyperv-iommu.o obj-$(CONFIG_VIRTIO_IOMMU) += virtio-iommu.o +obj-$(CONFIG_IOMMU_SVA) += iommu-sva.o diff --git a/drivers/iommu/iommu-sva.c b/drivers/iommu/iommu-sva.c new file mode 100644 index 000000000000..64f1d1c82383 --- /dev/null +++ b/drivers/iommu/iommu-sva.c @@ -0,0 +1,561 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Manage PASIDs and bind process address spaces to devices. + * + * Copyright (C) 2018 ARM Ltd. + */ + +#include <linux/idr.h> +#include <linux/ioasid.h> +#include <linux/iommu.h> +#include <linux/sched/mm.h> +#include <linux/slab.h> +#include <linux/spinlock.h> + +#include "iommu-sva.h" + +/** + * DOC: io_mm model + * + * The io_mm keeps track of process address spaces shared between CPU and IOMMU. + * The following example illustrates the relation between structures + * iommu_domain, io_mm and iommu_sva. The iommu_sva struct is a bond between + * io_mm and device. A device can have multiple io_mm and an io_mm may be bound + * to multiple devices. + * ___________________________ + * | IOMMU domain A | + * | ________________ | + * | | IOMMU group | +------- io_pgtables + * | | | | + * | | dev 00:00.0 ----+------- bond 1 --- io_mm X + * | |________________| \ | + * | '----- bond 2 ---. + * |___________________________| \ + * ___________________________ \ + * | IOMMU domain B | io_mm Y + * | ________________ | / / + * | | IOMMU group | | / / + * | | | | / / + * | | dev 00:01.0 ------------ bond 3 -' / + * | | dev 00:01.1 ------------ bond 4 --' + * | |________________| | + * | +------- io_pgtables + * |___________________________| + * + * In this example, device 00:00.0 is in domain A, devices 00:01.* are in domain + * B. All devices within the same domain access the same address spaces. Device + * 00:00.0 accesses address spaces X and Y, each corresponding to an mm_struct. + * Devices 00:01.* only access address space Y. In addition each + * IOMMU_DOMAIN_DMA domain has a private address space, io_pgtable, that is + * managed with iommu_map()/iommu_unmap(), and isn't shared with the CPU MMU. + * + * To obtain the above configuration, users would for instance issue the + * following calls: + * + * iommu_sva_bind_device(dev 00:00.0, mm X, ...) -> bond 1 + * iommu_sva_bind_device(dev 00:00.0, mm Y, ...) -> bond 2 + * iommu_sva_bind_device(dev 00:01.0, mm Y, ...) -> bond 3 + * iommu_sva_bind_device(dev 00:01.1, mm Y, ...) -> bond 4 + * + * A single Process Address Space ID (PASID) is allocated for each mm. In the + * example, devices use PASID 1 to read/write into address space X and PASID 2 + * to read/write into address space Y. Calling iommu_sva_get_pasid() on bond 1 + * returns 1, and calling it on bonds 2-4 returns 2. + * + * Hardware tables describing this configuration in the IOMMU would typically + * look like this: + * + * PASID tables + * of domain A + * .->+--------+ + * / 0 | |-------> io_pgtable + * / +--------+ + * Device tables / 1 | |-------> pgd X + * +--------+ / +--------+ + * 00:00.0 | A |-' 2 | |--. + * +--------+ +--------+ \ + * : : 3 | | \ + * +--------+ +--------+ --> pgd Y + * 00:01.0 | B |--. / + * +--------+ \ | + * 00:01.1 | B |----+ PASID tables | + * +--------+ \ of domain B | + * '->+--------+ | + * 0 | |-- | --> io_pgtable + * +--------+ | + * 1 | | | + * +--------+ | + * 2 | |---' + * +--------+ + * 3 | | + * +--------+ + * + * With this model, a single call binds all devices in a given domain to an + * address space. Other devices in the domain will get the same bond implicitly. + * However, users must issue one bind() for each device, because IOMMUs may + * implement SVA differently. Furthermore, mandating one bind() per device + * allows the driver to perform sanity-checks on device capabilities. + * + * In some IOMMUs, one entry of the PASID table (typically the first one) can + * hold non-PASID translations. In this case PASID 0 is reserved and the first + * entry points to the io_pgtable pointer. In other IOMMUs the io_pgtable + * pointer is held in the device table and PASID 0 is available to the + * allocator. + */ + +struct io_mm { + struct list_head devices; + struct mm_struct *mm; + struct mmu_notifier notifier; + + /* Late initialization */ + const struct io_mm_ops *ops; + void *ctx; + int pasid; +}; + +#define to_io_mm(mmu_notifier) container_of(mmu_notifier, struct io_mm, notifier) +#define to_iommu_bond(handle) container_of(handle, struct iommu_bond, sva) + +struct iommu_bond { + struct iommu_sva sva; + struct io_mm __rcu *io_mm; + + struct list_head mm_head; + void *drvdata; + struct rcu_head rcu_head; + refcount_t refs; +}; + +static DECLARE_IOASID_SET(shared_pasid); + +static struct mmu_notifier_ops iommu_mmu_notifier_ops; + +/* + * Serializes modifications of bonds. + * Lock order: Device SVA mutex; global SVA mutex; IOASID lock + */ +static DEFINE_MUTEX(iommu_sva_lock); + +struct io_mm_alloc_params { + const struct io_mm_ops *ops; + int min_pasid, max_pasid; +}; + +static struct mmu_notifier *io_mm_alloc(struct mm_struct *mm, void *privdata) +{ + int ret; + struct io_mm *io_mm; + struct io_mm_alloc_params *params = privdata; + + io_mm = kzalloc(sizeof(*io_mm), GFP_KERNEL); + if (!io_mm) + return ERR_PTR(-ENOMEM); + + io_mm->mm = mm; + io_mm->ops = params->ops; + INIT_LIST_HEAD(&io_mm->devices); + + io_mm->pasid = ioasid_alloc(&shared_pasid, params->min_pasid, + params->max_pasid, io_mm->mm); + if (io_mm->pasid == INVALID_IOASID) { + ret = -ENOSPC; + goto err_free_io_mm; + } + + io_mm->ctx = params->ops->alloc(mm); + if (IS_ERR(io_mm->ctx)) { + ret = PTR_ERR(io_mm->ctx); + goto err_free_pasid; + } + return &io_mm->notifier; + +err_free_pasid: + ioasid_free(io_mm->pasid); +err_free_io_mm: + kfree(io_mm); + return ERR_PTR(ret); +} + +static void io_mm_free(struct mmu_notifier *mn) +{ + struct io_mm *io_mm = to_io_mm(mn); + + WARN_ON(!list_empty(&io_mm->devices)); + + io_mm->ops->release(io_mm->ctx); + ioasid_free(io_mm->pasid); + kfree(io_mm); +} + +/* + * io_mm_get - Allocate an io_mm or get the existing one for the given mm + * @mm: the mm + * @ops: callbacks for the IOMMU driver + * @min_pasid: minimum PASID value (inclusive) + * @max_pasid: maximum PASID value (inclusive) + * + * Returns a valid io_mm or an error pointer. + */ +static struct io_mm *io_mm_get(struct mm_struct *mm, + const struct io_mm_ops *ops, + int min_pasid, int max_pasid) +{ + struct io_mm *io_mm; + struct mmu_notifier *mn; + struct io_mm_alloc_params params = { + .ops = ops, + .min_pasid = min_pasid, + .max_pasid = max_pasid, + }; + + /* + * A single notifier can exist for this (ops, mm) pair. Allocate it if + * necessary. + */ + mn = mmu_notifier_get(&iommu_mmu_notifier_ops, mm, ¶ms); + if (IS_ERR(mn)) + return ERR_CAST(mn); + io_mm = to_io_mm(mn); + + if (WARN_ON(io_mm->ops != ops)) { + mmu_notifier_put(mn); + return ERR_PTR(-EINVAL); + } + + return io_mm; +} + +static void io_mm_put(struct io_mm *io_mm) +{ + mmu_notifier_put(&io_mm->notifier); +} + +static struct iommu_sva * +io_mm_attach(struct device *dev, struct io_mm *io_mm, void *drvdata) +{ + int ret = 0; + bool attach_domain = true; + struct iommu_bond *bond, *tmp; + struct iommu_domain *domain, *other; + struct iommu_sva_param *param = dev->iommu_param->sva_param; + + domain = iommu_get_domain_for_dev(dev); + + bond = kzalloc(sizeof(*bond), GFP_KERNEL); + if (!bond) + return ERR_PTR(-ENOMEM); + + bond->sva.dev = dev; + bond->drvdata = drvdata; + refcount_set(&bond->refs, 1); + RCU_INIT_POINTER(bond->io_mm, io_mm); + + mutex_lock(&iommu_sva_lock); + /* Is it already bound to the device or domain? */ + list_for_each_entry(tmp, &io_mm->devices, mm_head) { + if (tmp->sva.dev != dev) { + other = iommu_get_domain_for_dev(tmp->sva.dev); + if (domain == other) + attach_domain = false; + + continue; + } + + if (WARN_ON(tmp->drvdata != drvdata)) { + ret = -EINVAL; + goto err_free; + } + + /* + * Hold a single io_mm reference per bond. Note that we can't + * return an error after this, otherwise the caller would drop + * an additional reference to the io_mm. + */ + refcount_inc(&tmp->refs); + io_mm_put(io_mm); + kfree(bond); + mutex_unlock(&iommu_sva_lock); + return &tmp->sva; + } + + list_add_rcu(&bond->mm_head, &io_mm->devices); + param->nr_bonds++; + mutex_unlock(&iommu_sva_lock); + + ret = io_mm->ops->attach(bond->sva.dev, io_mm->pasid, io_mm->ctx, + attach_domain); + if (ret) + goto err_remove; + + return &bond->sva; + +err_remove: + /* + * At this point concurrent threads may have started to access the + * io_mm->devices list in order to invalidate address ranges, which + * requires to free the bond via kfree_rcu() + */ + mutex_lock(&iommu_sva_lock); + param->nr_bonds--; + list_del_rcu(&bond->mm_head); + +err_free: + mutex_unlock(&iommu_sva_lock); + kfree_rcu(bond, rcu_head); + return ERR_PTR(ret); +} + +static void io_mm_detach_locked(struct iommu_bond *bond) +{ + struct io_mm *io_mm; + struct iommu_bond *tmp; + bool detach_domain = true; + struct iommu_domain *domain, *other; + + io_mm = rcu_dereference_protected(bond->io_mm, + lockdep_is_held(&iommu_sva_lock)); + if (!io_mm) + return; + + domain = iommu_get_domain_for_dev(bond->sva.dev); + + /* Are other devices in the same domain still attached to this mm? */ + list_for_each_entry(tmp, &io_mm->devices, mm_head) { + if (tmp == bond) + continue; + other = iommu_get_domain_for_dev(tmp->sva.dev); + if (domain == other) { + detach_domain = false; + break; + } + } + + io_mm->ops->detach(bond->sva.dev, io_mm->pasid, io_mm->ctx, + detach_domain); + + list_del_rcu(&bond->mm_head); + RCU_INIT_POINTER(bond->io_mm, NULL); + + /* Free after RCU grace period */ + io_mm_put(io_mm); +} + +/* + * io_mm_release - release MMU notifier + * + * Called when the mm exits. Some devices may still be bound to the io_mm. A few + * things need to be done before it is safe to release: + * + * - Tell the device driver to stop using this PASID. + * - Clear the PASID table and invalidate TLBs. + * - Drop all references to this io_mm. + */ +static void io_mm_release(struct mmu_notifier *mn, struct mm_struct *mm) +{ + struct iommu_bond *bond, *next; + struct io_mm *io_mm = to_io_mm(mn); + + mutex_lock(&iommu_sva_lock); + list_for_each_entry_safe(bond, next, &io_mm->devices, mm_head) { + struct device *dev = bond->sva.dev; + struct iommu_sva *sva = &bond->sva; + + if (sva->ops && sva->ops->mm_exit && + sva->ops->mm_exit(dev, sva, bond->drvdata)) + dev_WARN(dev, "possible leak of PASID %u", + io_mm->pasid); + + /* unbind() frees the bond, we just detach it */ + io_mm_detach_locked(bond); + } + mutex_unlock(&iommu_sva_lock); +} + +static void io_mm_invalidate_range(struct mmu_notifier *mn, + struct mm_struct *mm, unsigned long start, + unsigned long end) +{ + struct iommu_bond *bond; + struct io_mm *io_mm = to_io_mm(mn); + + rcu_read_lock(); + list_for_each_entry_rcu(bond, &io_mm->devices, mm_head) + io_mm->ops->invalidate(bond->sva.dev, io_mm->pasid, io_mm->ctx, + start, end - start); + rcu_read_unlock(); +} + +static struct mmu_notifier_ops iommu_mmu_notifier_ops = { + .alloc_notifier = io_mm_alloc, + .free_notifier = io_mm_free, + .release = io_mm_release, + .invalidate_range = io_mm_invalidate_range, +}; + +struct iommu_sva * +iommu_sva_bind_generic(struct device *dev, struct mm_struct *mm, + const struct io_mm_ops *ops, void *drvdata) +{ + struct io_mm *io_mm; + struct iommu_sva *handle; + struct iommu_param *param = dev->iommu_param; + + if (!param) + return ERR_PTR(-ENODEV); + + mutex_lock(¶m->sva_lock); + if (!param->sva_param) { + handle = ERR_PTR(-ENODEV); + goto out_unlock; + } + + io_mm = io_mm_get(mm, ops, param->sva_param->min_pasid, + param->sva_param->max_pasid); + if (IS_ERR(io_mm)) { + handle = ERR_CAST(io_mm); + goto out_unlock; + } + + handle = io_mm_attach(dev, io_mm, drvdata); + if (IS_ERR(handle)) + io_mm_put(io_mm); + +out_unlock: + mutex_unlock(¶m->sva_lock); + return handle; +} +EXPORT_SYMBOL_GPL(iommu_sva_bind_generic); + +static void iommu_sva_unbind_locked(struct iommu_bond *bond) +{ + struct device *dev = bond->sva.dev; + struct iommu_sva_param *param = dev->iommu_param->sva_param; + + if (!refcount_dec_and_test(&bond->refs)) + return; + + io_mm_detach_locked(bond); + param->nr_bonds--; + kfree_rcu(bond, rcu_head); +} + +void iommu_sva_unbind_generic(struct iommu_sva *handle) +{ + struct iommu_param *param = handle->dev->iommu_param; + + if (WARN_ON(!param)) + return; + + mutex_lock(¶m->sva_lock); + mutex_lock(&iommu_sva_lock); + iommu_sva_unbind_locked(to_iommu_bond(handle)); + mutex_unlock(&iommu_sva_lock); + mutex_unlock(¶m->sva_lock); +} +EXPORT_SYMBOL_GPL(iommu_sva_unbind_generic); + +/** + * iommu_sva_enable() - Enable Shared Virtual Addressing for a device + * @dev: the device + * @sva_param: the parameters. + * + * Called by an IOMMU driver to setup the SVA parameters + * @sva_param is duplicated and can be freed when this function returns. + * + * Return 0 if initialization succeeded, or an error. + */ +int iommu_sva_enable(struct device *dev, struct iommu_sva_param *sva_param) +{ + int ret; + struct iommu_sva_param *new_param; + struct iommu_param *param = dev->iommu_param; + + if (!param) + return -ENODEV; + + new_param = kmemdup(sva_param, sizeof(*new_param), GFP_KERNEL); + if (!new_param) + return -ENOMEM; + + mutex_lock(¶m->sva_lock); + if (param->sva_param) { + ret = -EEXIST; + goto err_unlock; + } + + dev->iommu_param->sva_param = new_param; + mutex_unlock(¶m->sva_lock); + return 0; + +err_unlock: + mutex_unlock(¶m->sva_lock); + kfree(new_param); + return ret; +} +EXPORT_SYMBOL_GPL(iommu_sva_enable); + +/** + * iommu_sva_disable() - Disable Shared Virtual Addressing for a device + * @dev: the device + * + * IOMMU drivers call this to disable SVA. + */ +int iommu_sva_disable(struct device *dev) +{ + int ret = 0; + struct iommu_param *param = dev->iommu_param; + + if (!param) + return -EINVAL; + + mutex_lock(¶m->sva_lock); + if (!param->sva_param) { + ret = -ENODEV; + goto out_unlock; + } + + /* Require that all contexts are unbound */ + if (param->sva_param->nr_bonds) { + ret = -EBUSY; + goto out_unlock; + } + + kfree(param->sva_param); + param->sva_param = NULL; +out_unlock: + mutex_unlock(¶m->sva_lock); + + return ret; +} +EXPORT_SYMBOL_GPL(iommu_sva_disable); + +bool iommu_sva_enabled(struct device *dev) +{ + bool enabled; + struct iommu_param *param = dev->iommu_param; + + if (!param) + return false; + + mutex_lock(¶m->sva_lock); + enabled = !!param->sva_param; + mutex_unlock(¶m->sva_lock); + return enabled; +} +EXPORT_SYMBOL_GPL(iommu_sva_enabled); + +int iommu_sva_get_pasid_generic(struct iommu_sva *handle) +{ + struct io_mm *io_mm; + int pasid = IOMMU_PASID_INVALID; + struct iommu_bond *bond = to_iommu_bond(handle); + + rcu_read_lock(); + io_mm = rcu_dereference(bond->io_mm); + if (io_mm) + pasid = io_mm->pasid; + rcu_read_unlock(); + return pasid; +} +EXPORT_SYMBOL_GPL(iommu_sva_get_pasid_generic); diff --git a/drivers/iommu/iommu-sva.h b/drivers/iommu/iommu-sva.h new file mode 100644 index 000000000000..dd55c2db0936 --- /dev/null +++ b/drivers/iommu/iommu-sva.h @@ -0,0 +1,64 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * SVA library for IOMMU drivers + */ +#ifndef _IOMMU_SVA_H +#define _IOMMU_SVA_H + +#include <linux/iommu.h> +#include <linux/kref.h> +#include <linux/mmu_notifier.h> + +struct io_mm_ops { + /* Allocate a PASID context for an mm */ + void *(*alloc)(struct mm_struct *mm); + + /* + * Attach a PASID context to a device. Write the entry into the PASID + * table. + * + * @attach_domain is true when no other device in the IOMMU domain is + * already attached to this context. IOMMU drivers that share the + * PASID tables within a domain don't need to write the PASID entry + * when @attach_domain is false. + */ + int (*attach)(struct device *dev, int pasid, void *ctx, + bool attach_domain); + + /* + * Detach a PASID context from a device. Clear the entry from the PASID + * table and invalidate if necessary. + * + * @detach_domain is true when no other device in the IOMMU domain is + * still attached to this context. IOMMU drivers that share the PASID + * table within a domain don't need to clear the PASID entry when + * @detach_domain is false, only invalidate the caches. + */ + void (*detach)(struct device *dev, int pasid, void *ctx, + bool detach_domain); + + /* Invalidate a range of addresses. Cannot sleep. */ + void (*invalidate)(struct device *dev, int pasid, void *ctx, + unsigned long vaddr, size_t size); + + /* Free a context. Cannot sleep. */ + void (*release)(void *ctx); +}; + +struct iommu_sva_param { + u32 min_pasid; + u32 max_pasid; + int nr_bonds; +}; + +struct iommu_sva * +iommu_sva_bind_generic(struct device *dev, struct mm_struct *mm, + const struct io_mm_ops *ops, void *drvdata); +void iommu_sva_unbind_generic(struct iommu_sva *handle); +int iommu_sva_get_pasid_generic(struct iommu_sva *handle); + +int iommu_sva_enable(struct device *dev, struct iommu_sva_param *sva_param); +int iommu_sva_disable(struct device *dev); +bool iommu_sva_enabled(struct device *dev); + +#endif /* _IOMMU_SVA_H */ diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index 3e3528436e0b..c8bd972c1788 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -164,6 +164,7 @@ static struct iommu_param *iommu_get_dev_param(struct device *dev) return NULL; mutex_init(¶m->lock); + mutex_init(¶m->sva_lock); dev->iommu_param = param; return param; } diff --git a/include/linux/iommu.h b/include/linux/iommu.h index 1739f8a7a4b4..83397ae88d2d 100644 --- a/include/linux/iommu.h +++ b/include/linux/iommu.h @@ -368,6 +368,7 @@ struct iommu_fault_param { * struct iommu_param - collection of per-device IOMMU data * * @fault_param: IOMMU detected device fault reporting data + * @sva_param: IOMMU parameter for SVA * * TODO: migrate other per device data pointers under iommu_dev_data, e.g. * struct iommu_group *iommu_group; @@ -376,6 +377,8 @@ struct iommu_fault_param { struct iommu_param { struct mutex lock; struct iommu_fault_param *fault_param; + struct mutex sva_lock; + struct iommu_sva_param *sva_param; }; int iommu_device_register(struct iommu_device *iommu); -- 2.25.0
From: Jean-Philippe Brucker <jean-philippe.brucker@arm.com> Some systems allow devices to handle I/O Page Faults in the core mm. For example systems implementing the PCI PRI extension or Arm SMMU stall model. Infrastructure for reporting these recoverable page faults was recently added to the IOMMU core. Add a page fault handler for host SVA. IOMMU driver can now instantiate several fault workqueues and link them to IOPF-capable devices. Drivers can choose between a single global workqueue, one per IOMMU device, one per low-level fault queue, one per domain, etc. When it receives a fault event, supposedly in an IRQ handler, the IOMMU driver reports the fault using iommu_report_device_fault(), which calls the registered handler. The page fault handler then calls the mm fault handler, and reports either success or failure with iommu_page_response(). When the handler succeeded, the IOMMU retries the access. The iopf_param pointer could be embedded into iommu_fault_param. But putting iopf_param into the iommu_param structure allows us not to care about ordering between calls to iopf_queue_add_device() and iommu_register_device_fault_handler(). Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org> --- drivers/iommu/Kconfig | 4 + drivers/iommu/Makefile | 1 + drivers/iommu/io-pgfault.c | 451 +++++++++++++++++++++++++++++++++++++ include/linux/iommu.h | 59 +++++ 4 files changed, 515 insertions(+) create mode 100644 drivers/iommu/io-pgfault.c diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig index acca20e2da2f..e4a42e1708b4 100644 --- a/drivers/iommu/Kconfig +++ b/drivers/iommu/Kconfig @@ -109,6 +109,10 @@ config IOMMU_SVA select IOMMU_API select MMU_NOTIFIER +config IOMMU_PAGE_FAULT + bool + select IOMMU_API + config FSL_PAMU bool "Freescale IOMMU support" depends on PCI diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile index 40c800dd4e3e..bf5cb4ee8409 100644 --- a/drivers/iommu/Makefile +++ b/drivers/iommu/Makefile @@ -4,6 +4,7 @@ obj-$(CONFIG_IOMMU_API) += iommu-traces.o obj-$(CONFIG_IOMMU_API) += iommu-sysfs.o obj-$(CONFIG_IOMMU_DEBUGFS) += iommu-debugfs.o obj-$(CONFIG_IOMMU_DMA) += dma-iommu.o +obj-$(CONFIG_IOMMU_PAGE_FAULT) += io-pgfault.o obj-$(CONFIG_IOMMU_IO_PGTABLE) += io-pgtable.o obj-$(CONFIG_IOMMU_IO_PGTABLE_ARMV7S) += io-pgtable-arm-v7s.o obj-$(CONFIG_IOMMU_IO_PGTABLE_LPAE) += io-pgtable-arm.o diff --git a/drivers/iommu/io-pgfault.c b/drivers/iommu/io-pgfault.c new file mode 100644 index 000000000000..76e153c59fe3 --- /dev/null +++ b/drivers/iommu/io-pgfault.c @@ -0,0 +1,451 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Handle device page faults + * + * Copyright (C) 2018 ARM Ltd. + */ + +#include <linux/iommu.h> +#include <linux/list.h> +#include <linux/slab.h> +#include <linux/workqueue.h> + +/** + * struct iopf_queue - IO Page Fault queue + * @wq: the fault workqueue + * @flush: low-level flush callback + * @flush_arg: flush() argument + * @devices: devices attached to this queue + * @lock: protects the device list + */ +struct iopf_queue { + struct workqueue_struct *wq; + iopf_queue_flush_t flush; + void *flush_arg; + struct list_head devices; + struct mutex lock; +}; + +/** + * struct iopf_device_param - IO Page Fault data attached to a device + * @dev: the device that owns this param + * @queue: IOPF queue + * @queue_list: index into queue->devices + * @partial: faults that are part of a Page Request Group for which the last + * request hasn't been submitted yet. + * @busy: the param is being used + * @wq_head: signal a change to @busy + */ +struct iopf_device_param { + struct device *dev; + struct iopf_queue *queue; + struct list_head queue_list; + struct list_head partial; + bool busy; + wait_queue_head_t wq_head; +}; + +struct iopf_fault { + struct iommu_fault fault; + struct list_head head; +}; + +struct iopf_group { + struct iopf_fault last_fault; + struct list_head faults; + struct work_struct work; + struct device *dev; +}; + +static int iopf_complete(struct device *dev, struct iopf_fault *iopf, + enum iommu_page_response_code status) +{ + struct iommu_page_response resp = { + .version = IOMMU_PAGE_RESP_VERSION_1, + .pasid = iopf->fault.prm.pasid, + .grpid = iopf->fault.prm.grpid, + .code = status, + }; + + if (iopf->fault.prm.flags & IOMMU_FAULT_PAGE_REQUEST_PASID_VALID) + resp.flags = IOMMU_PAGE_RESP_PASID_VALID; + + return iommu_page_response(dev, &resp); +} + +static enum iommu_page_response_code +iopf_handle_single(struct iopf_fault *iopf) +{ + /* TODO */ + return -ENODEV; +} + +static void iopf_handle_group(struct work_struct *work) +{ + struct iopf_group *group; + struct iopf_fault *iopf, *next; + enum iommu_page_response_code status = IOMMU_PAGE_RESP_SUCCESS; + + group = container_of(work, struct iopf_group, work); + + list_for_each_entry_safe(iopf, next, &group->faults, head) { + /* + * For the moment, errors are sticky: don't handle subsequent + * faults in the group if there is an error. + */ + if (status == IOMMU_PAGE_RESP_SUCCESS) + status = iopf_handle_single(iopf); + + if (!(iopf->fault.prm.flags & IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE)) + kfree(iopf); + } + + iopf_complete(group->dev, &group->last_fault, status); + kfree(group); +} + +/** + * iommu_queue_iopf - IO Page Fault handler + * @evt: fault event + * @cookie: struct device, passed to iommu_register_device_fault_handler. + * + * Add a fault to the device workqueue, to be handled by mm. + * + * Return: 0 on success and <0 on error. + */ +int iommu_queue_iopf(struct iommu_fault *fault, void *cookie) +{ + int ret; + struct iopf_group *group; + struct iopf_fault *iopf, *next; + struct iopf_device_param *iopf_param; + + struct device *dev = cookie; + struct iommu_param *param = dev->iommu_param; + + if (WARN_ON(!mutex_is_locked(¶m->lock))) + return -EINVAL; + + if (fault->type != IOMMU_FAULT_PAGE_REQ) + /* Not a recoverable page fault */ + return -EOPNOTSUPP; + + /* + * As long as we're holding param->lock, the queue can't be unlinked + * from the device and therefore cannot disappear. + */ + iopf_param = param->iopf_param; + if (!iopf_param) + return -ENODEV; + + if (!(fault->prm.flags & IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE)) { + iopf = kzalloc(sizeof(*iopf), GFP_KERNEL); + if (!iopf) + return -ENOMEM; + + iopf->fault = *fault; + + /* Non-last request of a group. Postpone until the last one */ + list_add(&iopf->head, &iopf_param->partial); + + return 0; + } + + group = kzalloc(sizeof(*group), GFP_KERNEL); + if (!group) { + /* + * The caller will send a response to the hardware. But we do + * need to clean up before leaving, otherwise partial faults + * will be stuck. + */ + ret = -ENOMEM; + goto cleanup_partial; + } + + group->dev = dev; + group->last_fault.fault = *fault; + INIT_LIST_HEAD(&group->faults); + list_add(&group->last_fault.head, &group->faults); + INIT_WORK(&group->work, iopf_handle_group); + + /* See if we have partial faults for this group */ + list_for_each_entry_safe(iopf, next, &iopf_param->partial, head) { + if (iopf->fault.prm.grpid == fault->prm.grpid) + /* Insert *before* the last fault */ + list_move(&iopf->head, &group->faults); + } + + queue_work(iopf_param->queue->wq, &group->work); + return 0; + +cleanup_partial: + list_for_each_entry_safe(iopf, next, &iopf_param->partial, head) { + if (iopf->fault.prm.grpid == fault->prm.grpid) { + list_del(&iopf->head); + kfree(iopf); + } + } + return ret; +} +EXPORT_SYMBOL_GPL(iommu_queue_iopf); + +/** + * iopf_queue_flush_dev - Ensure that all queued faults have been processed + * @dev: the endpoint whose faults need to be flushed. + * @pasid: the PASID affected by this flush + * + * Users must call this function when releasing a PASID, to ensure that all + * pending faults for this PASID have been handled, and won't hit the address + * space of the next process that uses this PASID. + * + * This function can also be called before shutting down the device, in which + * case @pasid should be IOMMU_PASID_INVALID. + * + * Return: 0 on success and <0 on error. + */ +int iopf_queue_flush_dev(struct device *dev, int pasid) +{ + int ret = 0; + struct iopf_queue *queue; + struct iopf_device_param *iopf_param; + struct iommu_param *param = dev->iommu_param; + + if (!param) + return -ENODEV; + + /* + * It is incredibly easy to find ourselves in a deadlock situation if + * we're not careful, because we're taking the opposite path as + * iommu_queue_iopf: + * + * iopf_queue_flush_dev() | PRI queue handler + * lock(¶m->lock) | iommu_queue_iopf() + * queue->flush() | lock(¶m->lock) + * wait PRI queue empty | + * + * So we can't hold the device param lock while flushing. Take a + * reference to the device param instead, to prevent the queue from + * going away. + */ + mutex_lock(¶m->lock); + iopf_param = param->iopf_param; + if (iopf_param) { + queue = param->iopf_param->queue; + iopf_param->busy = true; + } else { + ret = -ENODEV; + } + mutex_unlock(¶m->lock); + if (ret) + return ret; + + /* + * When removing a PASID, the device driver tells the device to stop + * using it, and flush any pending fault to the IOMMU. In this flush + * callback, the IOMMU driver makes sure that there are no such faults + * left in the low-level queue. + */ + queue->flush(queue->flush_arg, dev, pasid); + + flush_workqueue(queue->wq); + + mutex_lock(¶m->lock); + iopf_param->busy = false; + wake_up(&iopf_param->wq_head); + mutex_unlock(¶m->lock); + + return 0; +} +EXPORT_SYMBOL_GPL(iopf_queue_flush_dev); + +/** + * iopf_queue_discard_partial - Remove all pending partial fault + * @queue: the queue whose partial faults need to be discarded + * + * When the hardware queue overflows, last page faults in a group may have been + * lost and the IOMMU driver calls this to discard all partial faults. The + * driver shouldn't be adding new faults to this queue concurrently. + * + * Return: 0 on success and <0 on error. + */ +int iopf_queue_discard_partial(struct iopf_queue *queue) +{ + struct iopf_fault *iopf, *next; + struct iopf_device_param *iopf_param; + + if (!queue) + return -EINVAL; + + mutex_lock(&queue->lock); + list_for_each_entry(iopf_param, &queue->devices, queue_list) { + list_for_each_entry_safe(iopf, next, &iopf_param->partial, head) + kfree(iopf); + } + mutex_unlock(&queue->lock); + return 0; +} +EXPORT_SYMBOL_GPL(iopf_queue_discard_partial); + +/** + * iopf_queue_add_device - Add producer to the fault queue + * @queue: IOPF queue + * @dev: device to add + * + * Return: 0 on success and <0 on error. + */ +int iopf_queue_add_device(struct iopf_queue *queue, struct device *dev) +{ + int ret = -EINVAL; + struct iopf_device_param *iopf_param; + struct iommu_param *param = dev->iommu_param; + + if (!param) + return -ENODEV; + + iopf_param = kzalloc(sizeof(*iopf_param), GFP_KERNEL); + if (!iopf_param) + return -ENOMEM; + + INIT_LIST_HEAD(&iopf_param->partial); + iopf_param->queue = queue; + iopf_param->dev = dev; + init_waitqueue_head(&iopf_param->wq_head); + + mutex_lock(&queue->lock); + mutex_lock(¶m->lock); + if (!param->iopf_param) { + list_add(&iopf_param->queue_list, &queue->devices); + param->iopf_param = iopf_param; + ret = 0; + } + mutex_unlock(¶m->lock); + mutex_unlock(&queue->lock); + + if (ret) + kfree(iopf_param); + + return ret; +} +EXPORT_SYMBOL_GPL(iopf_queue_add_device); + +/** + * iopf_queue_remove_device - Remove producer from fault queue + * @queue: IOPF queue + * @dev: device to remove + * + * Caller makes sure that no more faults are reported for this device. + * + * Return: 0 on success and <0 on error. + */ +int iopf_queue_remove_device(struct iopf_queue *queue, struct device *dev) +{ + int ret = -EINVAL; + struct iopf_fault *iopf, *next; + struct iopf_device_param *iopf_param; + struct iommu_param *param = dev->iommu_param; + + if (!param || !queue) + return -EINVAL; + + do { + mutex_lock(&queue->lock); + mutex_lock(¶m->lock); + iopf_param = param->iopf_param; + if (iopf_param && iopf_param->queue == queue) { + if (iopf_param->busy) { + ret = -EBUSY; + } else { + list_del(&iopf_param->queue_list); + param->iopf_param = NULL; + ret = 0; + } + } + mutex_unlock(¶m->lock); + mutex_unlock(&queue->lock); + + /* + * If there is an ongoing flush, wait for it to complete and + * then retry. iopf_param isn't going away since we're the only + * thread that can free it. + */ + if (ret == -EBUSY) + wait_event(iopf_param->wq_head, !iopf_param->busy); + else if (ret) + return ret; + } while (ret == -EBUSY); + + /* Just in case some faults are still stuck */ + list_for_each_entry_safe(iopf, next, &iopf_param->partial, head) + kfree(iopf); + + kfree(iopf_param); + + return 0; +} +EXPORT_SYMBOL_GPL(iopf_queue_remove_device); + +/** + * iopf_queue_alloc - Allocate and initialize a fault queue + * @name: a unique string identifying the queue (for workqueue) + * @flush: a callback that flushes the low-level queue + * @cookie: driver-private data passed to the flush callback + * + * The callback is called before the workqueue is flushed. The IOMMU driver must + * commit all faults that are pending in its low-level queues at the time of the + * call, into the IOPF queue (with iommu_report_device_fault). The callback + * takes a device pointer as argument, hinting what endpoint is causing the + * flush. When the device is NULL, all faults should be committed. + * + * Return: the queue on success and NULL on error. + */ +struct iopf_queue * +iopf_queue_alloc(const char *name, iopf_queue_flush_t flush, void *cookie) +{ + struct iopf_queue *queue; + + queue = kzalloc(sizeof(*queue), GFP_KERNEL); + if (!queue) + return NULL; + + /* + * The WQ is unordered because the low-level handler enqueues faults by + * group. PRI requests within a group have to be ordered, but once + * that's dealt with, the high-level function can handle groups out of + * order. + */ + queue->wq = alloc_workqueue("iopf_queue/%s", WQ_UNBOUND, 0, name); + if (!queue->wq) { + kfree(queue); + return NULL; + } + + queue->flush = flush; + queue->flush_arg = cookie; + INIT_LIST_HEAD(&queue->devices); + mutex_init(&queue->lock); + + return queue; +} +EXPORT_SYMBOL_GPL(iopf_queue_alloc); + +/** + * iopf_queue_free - Free IOPF queue + * @queue: queue to free + * + * Counterpart to iopf_queue_alloc(). The driver must not be queuing faults or + * adding/removing devices on this queue anymore. + */ +void iopf_queue_free(struct iopf_queue *queue) +{ + struct iopf_device_param *iopf_param, *next; + + if (!queue) + return; + + list_for_each_entry_safe(iopf_param, next, &queue->devices, queue_list) + iopf_queue_remove_device(queue, iopf_param->dev); + + destroy_workqueue(queue->wq); + kfree(queue); +} +EXPORT_SYMBOL_GPL(iopf_queue_free); diff --git a/include/linux/iommu.h b/include/linux/iommu.h index 83397ae88d2d..e7bc47ba24f8 100644 --- a/include/linux/iommu.h +++ b/include/linux/iommu.h @@ -364,11 +364,20 @@ struct iommu_fault_param { struct mutex lock; }; +/** + * iopf_queue_flush_t - Flush low-level page fault queue + * + * Report all faults currently pending in the low-level page fault queue + */ +struct iopf_queue; +typedef int (*iopf_queue_flush_t)(void *cookie, struct device *dev, int pasid); + /** * struct iommu_param - collection of per-device IOMMU data * * @fault_param: IOMMU detected device fault reporting data * @sva_param: IOMMU parameter for SVA + * @iopf_param: I/O Page Fault queue and data * * TODO: migrate other per device data pointers under iommu_dev_data, e.g. * struct iommu_group *iommu_group; @@ -377,6 +386,7 @@ struct iommu_fault_param { struct iommu_param { struct mutex lock; struct iommu_fault_param *fault_param; + struct iopf_device_param *iopf_param; struct mutex sva_lock; struct iommu_sva_param *sva_param; }; @@ -1081,4 +1091,53 @@ void iommu_debugfs_setup(void); static inline void iommu_debugfs_setup(void) {} #endif +#ifdef CONFIG_IOMMU_PAGE_FAULT +extern int iommu_queue_iopf(struct iommu_fault *fault, void *cookie); + +extern int iopf_queue_add_device(struct iopf_queue *queue, struct device *dev); +extern int iopf_queue_remove_device(struct iopf_queue *queue, struct device *dev); +extern int iopf_queue_flush_dev(struct device *dev, int pasid); +extern struct iopf_queue * +iopf_queue_alloc(const char *name, iopf_queue_flush_t flush, void *cookie); +extern void iopf_queue_free(struct iopf_queue *queue); +extern int iopf_queue_discard_partial(struct iopf_queue *queue); +#else /* CONFIG_IOMMU_PAGE_FAULT */ +static inline int iommu_queue_iopf(struct iommu_fault *fault, void *cookie) +{ + return -ENODEV; +} + +static inline int iopf_queue_add_device(struct iopf_queue *queue, + struct device *dev) +{ + return -ENODEV; +} + +static inline int iopf_queue_remove_device(struct iopf_queue *queue, + struct device *dev) +{ + return -ENODEV; +} + +static inline int iopf_queue_flush_dev(struct device *dev, int pasid) +{ + return -ENODEV; +} + +static inline struct iopf_queue * +iopf_queue_alloc(const char *name, iopf_queue_flush_t flush, void *cookie) +{ + return NULL; +} + +static inline void iopf_queue_free(struct iopf_queue *queue) +{ +} + +static inline int iopf_queue_discard_partial(struct iopf_queue *queue) +{ + return -ENODEV; +} +#endif /* CONFIG_IOMMU_PAGE_FAULT */ + #endif /* __LINUX_IOMMU_H */ -- 2.25.0
From: Jean-Philippe Brucker <jean-philippe.brucker@arm.com> The fault handler will need to find an mm given its PASID. This is the reason we have an IDR for storing address spaces, so hook it up. Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org> --- drivers/iommu/iommu-sva.c | 19 +++++++++++++++++++ include/linux/iommu.h | 9 +++++++++ 2 files changed, 28 insertions(+) diff --git a/drivers/iommu/iommu-sva.c b/drivers/iommu/iommu-sva.c index 64f1d1c82383..bfd0c477f290 100644 --- a/drivers/iommu/iommu-sva.c +++ b/drivers/iommu/iommu-sva.c @@ -559,3 +559,22 @@ int iommu_sva_get_pasid_generic(struct iommu_sva *handle) return pasid; } EXPORT_SYMBOL_GPL(iommu_sva_get_pasid_generic); + +/* ioasid wants a void * argument */ +static bool __mmget_not_zero(void *mm) +{ + return mmget_not_zero(mm); +} + +/** + * iommu_sva_find() - Find mm associated to the given PASID + * @pasid: Process Address Space ID assigned to the mm + * + * Returns the mm corresponding to this PASID, or an error if not found. A + * reference to the mm is taken, and must be released with mmput(). + */ +struct mm_struct *iommu_sva_find(int pasid) +{ + return ioasid_find(&shared_pasid, pasid, __mmget_not_zero); +} +EXPORT_SYMBOL_GPL(iommu_sva_find); diff --git a/include/linux/iommu.h b/include/linux/iommu.h index e7bc47ba24f8..e52a8731e7a9 100644 --- a/include/linux/iommu.h +++ b/include/linux/iommu.h @@ -1091,6 +1091,15 @@ void iommu_debugfs_setup(void); static inline void iommu_debugfs_setup(void) {} #endif +#ifdef CONFIG_IOMMU_SVA +extern struct mm_struct *iommu_sva_find(int pasid); +#else /* !CONFIG_IOMMU_SVA */ +static inline struct mm_struct *iommu_sva_find(int pasid) +{ + return NULL; +} +#endif /* !CONFIG_IOMMU_SVA */ + #ifdef CONFIG_IOMMU_PAGE_FAULT extern int iommu_queue_iopf(struct iommu_fault *fault, void *cookie); -- 2.25.0
From: Jean-Philippe Brucker <jean-philippe.brucker@arm.com> When a recoverable page fault is handled by the fault workqueue, find the associated mm and call handle_mm_fault. Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org> --- drivers/iommu/io-pgfault.c | 86 +++++++++++++++++++++++++++++++++++++- 1 file changed, 84 insertions(+), 2 deletions(-) diff --git a/drivers/iommu/io-pgfault.c b/drivers/iommu/io-pgfault.c index 76e153c59fe3..ffa9f14e0803 100644 --- a/drivers/iommu/io-pgfault.c +++ b/drivers/iommu/io-pgfault.c @@ -7,6 +7,7 @@ #include <linux/iommu.h> #include <linux/list.h> +#include <linux/sched/mm.h> #include <linux/slab.h> #include <linux/workqueue.h> @@ -76,8 +77,65 @@ static int iopf_complete(struct device *dev, struct iopf_fault *iopf, static enum iommu_page_response_code iopf_handle_single(struct iopf_fault *iopf) { - /* TODO */ - return -ENODEV; + vm_fault_t ret; + struct mm_struct *mm; + struct vm_area_struct *vma; + unsigned int access_flags = 0; + unsigned int fault_flags = FAULT_FLAG_REMOTE; + struct iommu_fault_page_request *prm = &iopf->fault.prm; + enum iommu_page_response_code status = IOMMU_PAGE_RESP_INVALID; + + if (!(prm->flags & IOMMU_FAULT_PAGE_REQUEST_PASID_VALID)) + return status; + + mm = iommu_sva_find(prm->pasid); + if (IS_ERR_OR_NULL(mm)) + return status; + + down_read(&mm->mmap_sem); + + vma = find_extend_vma(mm, prm->addr); + if (!vma) + /* Unmapped area */ + goto out_put_mm; + + if (prm->perm & IOMMU_FAULT_PERM_READ) + access_flags |= VM_READ; + + if (prm->perm & IOMMU_FAULT_PERM_WRITE) { + access_flags |= VM_WRITE; + fault_flags |= FAULT_FLAG_WRITE; + } + + if (prm->perm & IOMMU_FAULT_PERM_EXEC) { + access_flags |= VM_EXEC; + fault_flags |= FAULT_FLAG_INSTRUCTION; + } + + if (!(prm->perm & IOMMU_FAULT_PERM_PRIV)) + fault_flags |= FAULT_FLAG_USER; + + if (access_flags & ~vma->vm_flags) + /* Access fault */ + goto out_put_mm; + + ret = handle_mm_fault(vma, prm->addr, fault_flags); + status = ret & VM_FAULT_ERROR ? IOMMU_PAGE_RESP_INVALID : + IOMMU_PAGE_RESP_SUCCESS; + +out_put_mm: + up_read(&mm->mmap_sem); + + /* + * If the process exits while we're handling the fault on its mm, we + * can't do mmput(). exit_mmap() would release the MMU notifier, calling + * iommu_notifier_release(), which has to flush the fault queue that + * we're executing on... So mmput_async() moves the release of the mm to + * another thread, if we're the last user. + */ + mmput_async(mm); + + return status; } static void iopf_handle_group(struct work_struct *work) @@ -111,6 +169,30 @@ static void iopf_handle_group(struct work_struct *work) * * Add a fault to the device workqueue, to be handled by mm. * + * This module doesn't handle PCI PASID Stop Marker; IOMMU drivers must discard + * them before reporting faults. A PASID Stop Marker (LRW = 0b100) doesn't + * expect a response. It may be generated when disabling a PASID (issuing a + * PASID stop request) by some PCI devices. + * + * The PASID stop request is triggered by the mm_exit() callback. When the + * callback returns from the device driver, no page request is generated for + * this PASID anymore and outstanding ones have been pushed to the IOMMU (as per + * PCIe 4.0r1.0 - 6.20.1 and 10.4.1.2 - Managing PASID TLP Prefix Usage). Some + * PCI devices will wait for all outstanding page requests to come back with a + * response before completing the PASID stop request. Others do not wait for + * page responses, and instead issue this Stop Marker that tells us when the + * PASID can be reallocated. + * + * It is safe to discard the Stop Marker because it is an optimization. + * a. Page requests, which are posted requests, have been flushed to the IOMMU + * when mm_exit() returns, + * b. We flush all fault queues after mm_exit() returns and before freeing the + * PASID. + * + * So even though the Stop Marker might be issued by the device *after* the stop + * request completes, outstanding faults will have been dealt with by the time + * we free the PASID. + * * Return: 0 on success and <0 on error. */ int iommu_queue_iopf(struct iommu_fault *fault, void *cookie) -- 2.25.0
From: Jean-Philippe Brucker <jean-philippe.brucker@arm.com> When enabling SVA, register the fault handler. Device driver will register an I/O page fault queue before or after calling iommu_sva_enable. The fault queue must be flushed before any io_mm is freed, to make sure that its PASID isn't used in any fault queue, and can be reallocated. Add iopf_queue_flush() calls in a few strategic locations. Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org> --- drivers/iommu/Kconfig | 1 + drivers/iommu/iommu-sva.c | 16 ++++++++++++++++ 2 files changed, 17 insertions(+) diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig index e4a42e1708b4..211684e785ea 100644 --- a/drivers/iommu/Kconfig +++ b/drivers/iommu/Kconfig @@ -106,6 +106,7 @@ config IOMMU_DMA config IOMMU_SVA bool select IOASID + select IOMMU_PAGE_FAULT select IOMMU_API select MMU_NOTIFIER diff --git a/drivers/iommu/iommu-sva.c b/drivers/iommu/iommu-sva.c index bfd0c477f290..494ca0824e4b 100644 --- a/drivers/iommu/iommu-sva.c +++ b/drivers/iommu/iommu-sva.c @@ -366,6 +366,8 @@ static void io_mm_release(struct mmu_notifier *mn, struct mm_struct *mm) dev_WARN(dev, "possible leak of PASID %u", io_mm->pasid); + iopf_queue_flush_dev(dev, io_mm->pasid); + /* unbind() frees the bond, we just detach it */ io_mm_detach_locked(bond); } @@ -442,11 +444,20 @@ static void iommu_sva_unbind_locked(struct iommu_bond *bond) void iommu_sva_unbind_generic(struct iommu_sva *handle) { + int pasid; struct iommu_param *param = handle->dev->iommu_param; if (WARN_ON(!param)) return; + /* + * Caller stopped the device from issuing PASIDs, now make sure they are + * out of the fault queue. + */ + pasid = iommu_sva_get_pasid_generic(handle); + if (pasid != IOMMU_PASID_INVALID) + iopf_queue_flush_dev(handle->dev, pasid); + mutex_lock(¶m->sva_lock); mutex_lock(&iommu_sva_lock); iommu_sva_unbind_locked(to_iommu_bond(handle)); @@ -484,6 +495,10 @@ int iommu_sva_enable(struct device *dev, struct iommu_sva_param *sva_param) goto err_unlock; } + ret = iommu_register_device_fault_handler(dev, iommu_queue_iopf, dev); + if (ret) + goto err_unlock; + dev->iommu_param->sva_param = new_param; mutex_unlock(¶m->sva_lock); return 0; @@ -521,6 +536,7 @@ int iommu_sva_disable(struct device *dev) goto out_unlock; } + iommu_unregister_device_fault_handler(dev); kfree(param->sva_param); param->sva_param = NULL; out_unlock: -- 2.25.0
From: Jean-Philippe Brucker <jean-philippe.brucker@arm.com> To enable address space sharing with the IOMMU, introduce mm_context_get() and mm_context_put(), that pin down a context and ensure that it will keep its ASID after a rollover. Export the symbols to let the modular SMMUv3 driver use them. Pinning is necessary because a device constantly needs a valid ASID, unlike tasks that only require one when running. Without pinning, we would need to notify the IOMMU when we're about to use a new ASID for a task, and it would get complicated when a new task is assigned a shared ASID. Consider the following scenario with no ASID pinned: 1. Task t1 is running on CPUx with shared ASID (gen=1, asid=1) 2. Task t2 is scheduled on CPUx, gets ASID (1, 2) 3. Task tn is scheduled on CPUy, a rollover occurs, tn gets ASID (2, 1) We would now have to immediately generate a new ASID for t1, notify the IOMMU, and finally enable task tn. We are holding the lock during all that time, since we can't afford having another CPU trigger a rollover. The IOMMU issues invalidation commands that can take tens of milliseconds. It gets needlessly complicated. All we wanted to do was schedule task tn, that has no business with the IOMMU. By letting the IOMMU pin tasks when needed, we avoid stalling the slow path, and let the pinning fail when we're out of shareable ASIDs. After a rollover, the allocator expects at least one ASID to be available in addition to the reserved ones (one per CPU). So (NR_ASIDS - NR_CPUS - 1) is the maximum number of ASIDs that can be shared with the IOMMU. Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org> --- v2->v4: handle KPTI --- arch/arm64/include/asm/mmu.h | 1 + arch/arm64/include/asm/mmu_context.h | 11 ++- arch/arm64/mm/context.c | 103 +++++++++++++++++++++++++-- 3 files changed, 109 insertions(+), 6 deletions(-) diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h index e4d862420bb4..70ac3d4cbd3e 100644 --- a/arch/arm64/include/asm/mmu.h +++ b/arch/arm64/include/asm/mmu.h @@ -18,6 +18,7 @@ typedef struct { atomic64_t id; + unsigned long pinned; void *vdso; unsigned long flags; } mm_context_t; diff --git a/arch/arm64/include/asm/mmu_context.h b/arch/arm64/include/asm/mmu_context.h index 3827ff4040a3..70715c10c02a 100644 --- a/arch/arm64/include/asm/mmu_context.h +++ b/arch/arm64/include/asm/mmu_context.h @@ -175,7 +175,13 @@ static inline void cpu_replace_ttbr1(pgd_t *pgdp) #define destroy_context(mm) do { } while(0) void check_and_switch_context(struct mm_struct *mm, unsigned int cpu); -#define init_new_context(tsk,mm) ({ atomic64_set(&(mm)->context.id, 0); 0; }) +static inline int +init_new_context(struct task_struct *tsk, struct mm_struct *mm) +{ + atomic64_set(&mm->context.id, 0); + mm->context.pinned = 0; + return 0; +} #ifdef CONFIG_ARM64_SW_TTBR0_PAN static inline void update_saved_ttbr0(struct task_struct *tsk, @@ -248,6 +254,9 @@ switch_mm(struct mm_struct *prev, struct mm_struct *next, void verify_cpu_asid_bits(void); void post_ttbr_update_workaround(void); +unsigned long mm_context_get(struct mm_struct *mm); +void mm_context_put(struct mm_struct *mm); + #endif /* !__ASSEMBLY__ */ #endif /* !__ASM_MMU_CONTEXT_H */ diff --git a/arch/arm64/mm/context.c b/arch/arm64/mm/context.c index 121aba5b1941..5558de88b67d 100644 --- a/arch/arm64/mm/context.c +++ b/arch/arm64/mm/context.c @@ -26,6 +26,10 @@ static DEFINE_PER_CPU(atomic64_t, active_asids); static DEFINE_PER_CPU(u64, reserved_asids); static cpumask_t tlb_flush_pending; +static unsigned long max_pinned_asids; +static unsigned long nr_pinned_asids; +static unsigned long *pinned_asid_map; + #define ASID_MASK (~GENMASK(asid_bits - 1, 0)) #define ASID_FIRST_VERSION (1UL << asid_bits) @@ -73,6 +77,9 @@ void verify_cpu_asid_bits(void) static void set_kpti_asid_bits(void) { + unsigned int k; + u8 *dst = (u8 *)asid_map; + u8 *src = (u8 *)pinned_asid_map; unsigned int len = BITS_TO_LONGS(NUM_USER_ASIDS) * sizeof(unsigned long); /* * In case of KPTI kernel/user ASIDs are allocated in @@ -80,7 +87,8 @@ static void set_kpti_asid_bits(void) * is set, then the ASID will map only userspace. Thus * mark even as reserved for kernel. */ - memset(asid_map, 0xaa, len); + for (k = 0; k < len; k++) + dst[k] = src[k] | 0xaa; } static void set_reserved_asid_bits(void) @@ -88,9 +96,12 @@ static void set_reserved_asid_bits(void) if (arm64_kernel_unmapped_at_el0()) set_kpti_asid_bits(); else - bitmap_clear(asid_map, 0, NUM_USER_ASIDS); + bitmap_copy(asid_map, pinned_asid_map, NUM_USER_ASIDS); } +#define asid_gen_match(asid) \ + (!(((asid) ^ atomic64_read(&asid_generation)) >> asid_bits)) + static void flush_context(void) { int i; @@ -161,6 +172,14 @@ static u64 new_context(struct mm_struct *mm) if (check_update_reserved_asid(asid, newasid)) return newasid; + /* + * If it is pinned, we can keep using it. Note that reserved + * takes priority, because even if it is also pinned, we need to + * update the generation into the reserved_asids. + */ + if (mm->context.pinned) + return newasid; + /* * We had a valid ASID in a previous life, so try to re-use * it if possible. @@ -219,8 +238,7 @@ void check_and_switch_context(struct mm_struct *mm, unsigned int cpu) * because atomic RmWs are totally ordered for a given location. */ old_active_asid = atomic64_read(&per_cpu(active_asids, cpu)); - if (old_active_asid && - !((asid ^ atomic64_read(&asid_generation)) >> asid_bits) && + if (old_active_asid && asid_gen_match(asid) && atomic64_cmpxchg_relaxed(&per_cpu(active_asids, cpu), old_active_asid, asid)) goto switch_mm_fastpath; @@ -228,7 +246,7 @@ void check_and_switch_context(struct mm_struct *mm, unsigned int cpu) raw_spin_lock_irqsave(&cpu_asid_lock, flags); /* Check that our ASID belongs to the current generation. */ asid = atomic64_read(&mm->context.id); - if ((asid ^ atomic64_read(&asid_generation)) >> asid_bits) { + if (!asid_gen_match(asid)) { asid = new_context(mm); atomic64_set(&mm->context.id, asid); } @@ -251,6 +269,68 @@ void check_and_switch_context(struct mm_struct *mm, unsigned int cpu) cpu_switch_mm(mm->pgd, mm); } +unsigned long mm_context_get(struct mm_struct *mm) +{ + unsigned long flags; + u64 asid; + + raw_spin_lock_irqsave(&cpu_asid_lock, flags); + + asid = atomic64_read(&mm->context.id); + + if (mm->context.pinned) { + mm->context.pinned++; + asid &= ~ASID_MASK; + goto out_unlock; + } + + if (nr_pinned_asids >= max_pinned_asids) { + asid = 0; + goto out_unlock; + } + + if (!asid_gen_match(asid)) { + /* + * We went through one or more rollover since that ASID was + * used. Ensure that it is still valid, or generate a new one. + */ + asid = new_context(mm); + atomic64_set(&mm->context.id, asid); + } + + asid &= ~ASID_MASK; + + nr_pinned_asids++; + __set_bit(asid2idx(asid), pinned_asid_map); + mm->context.pinned++; + +out_unlock: + raw_spin_unlock_irqrestore(&cpu_asid_lock, flags); + + /* Set the equivalent of USER_ASID_BIT */ + if (asid && IS_ENABLED(CONFIG_UNMAP_KERNEL_AT_EL0)) + asid |= 1; + + return asid; +} +EXPORT_SYMBOL_GPL(mm_context_get); + +void mm_context_put(struct mm_struct *mm) +{ + unsigned long flags; + u64 asid = atomic64_read(&mm->context.id) & ~ASID_MASK; + + raw_spin_lock_irqsave(&cpu_asid_lock, flags); + + if (--mm->context.pinned == 0) { + __clear_bit(asid2idx(asid), pinned_asid_map); + nr_pinned_asids--; + } + + raw_spin_unlock_irqrestore(&cpu_asid_lock, flags); +} +EXPORT_SYMBOL_GPL(mm_context_put); + /* Errata workaround post TTBRx_EL1 update. */ asmlinkage void post_ttbr_update_workaround(void) { @@ -279,6 +359,19 @@ static int asids_init(void) panic("Failed to allocate bitmap for %lu ASIDs\n", NUM_USER_ASIDS); + pinned_asid_map = kcalloc(BITS_TO_LONGS(NUM_USER_ASIDS), + sizeof(*pinned_asid_map), GFP_KERNEL); + if (!pinned_asid_map) + panic("Failed to allocate pinned bitmap\n"); + + /* + * We assume that an ASID is always available after a rollover. This + * means that even if all CPUs have a reserved ASID, there still is at + * least one slot available in the asid map. + */ + max_pinned_asids = num_available_asids - num_possible_cpus() - 1; + nr_pinned_asids = 0; + /* * We cannot call set_reserved_asid_bits() here because CPU * caps are not finalized yet, so it is safer to assume KPTI -- 2.25.0
Extract some of the most generic TCR defines, so they can be reused by the page table sharing code. Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org> --- drivers/iommu/io-pgtable-arm.c | 27 ++------------------------- drivers/iommu/io-pgtable-arm.h | 30 ++++++++++++++++++++++++++++++ 2 files changed, 32 insertions(+), 25 deletions(-) create mode 100644 drivers/iommu/io-pgtable-arm.h diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c index 983b08477e64..75782b525c2f 100644 --- a/drivers/iommu/io-pgtable-arm.c +++ b/drivers/iommu/io-pgtable-arm.c @@ -20,6 +20,8 @@ #include <asm/barrier.h> +#include "io-pgtable-arm.h" + #define ARM_LPAE_MAX_ADDR_BITS 52 #define ARM_LPAE_S2_MAX_CONCAT_PAGES 16 #define ARM_LPAE_MAX_LEVELS 4 @@ -100,23 +102,6 @@ #define ARM_LPAE_PTE_MEMATTR_DEV (((arm_lpae_iopte)0x1) << 2) /* Register bits */ -#define ARM_LPAE_TCR_TG0_4K 0 -#define ARM_LPAE_TCR_TG0_64K 1 -#define ARM_LPAE_TCR_TG0_16K 2 - -#define ARM_LPAE_TCR_TG1_16K 1 -#define ARM_LPAE_TCR_TG1_4K 2 -#define ARM_LPAE_TCR_TG1_64K 3 - -#define ARM_LPAE_TCR_SH_NS 0 -#define ARM_LPAE_TCR_SH_OS 2 -#define ARM_LPAE_TCR_SH_IS 3 - -#define ARM_LPAE_TCR_RGN_NC 0 -#define ARM_LPAE_TCR_RGN_WBWA 1 -#define ARM_LPAE_TCR_RGN_WT 2 -#define ARM_LPAE_TCR_RGN_WB 3 - #define ARM_LPAE_VTCR_SL0_MASK 0x3 #define ARM_LPAE_TCR_T0SZ_SHIFT 0 @@ -124,14 +109,6 @@ #define ARM_LPAE_VTCR_PS_SHIFT 16 #define ARM_LPAE_VTCR_PS_MASK 0x7 -#define ARM_LPAE_TCR_PS_32_BIT 0x0ULL -#define ARM_LPAE_TCR_PS_36_BIT 0x1ULL -#define ARM_LPAE_TCR_PS_40_BIT 0x2ULL -#define ARM_LPAE_TCR_PS_42_BIT 0x3ULL -#define ARM_LPAE_TCR_PS_44_BIT 0x4ULL -#define ARM_LPAE_TCR_PS_48_BIT 0x5ULL -#define ARM_LPAE_TCR_PS_52_BIT 0x6ULL - #define ARM_LPAE_MAIR_ATTR_SHIFT(n) ((n) << 3) #define ARM_LPAE_MAIR_ATTR_MASK 0xff #define ARM_LPAE_MAIR_ATTR_DEVICE 0x04 diff --git a/drivers/iommu/io-pgtable-arm.h b/drivers/iommu/io-pgtable-arm.h new file mode 100644 index 000000000000..ba7cfdf7afa0 --- /dev/null +++ b/drivers/iommu/io-pgtable-arm.h @@ -0,0 +1,30 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ +#ifndef IO_PGTABLE_ARM_H_ +#define IO_PGTABLE_ARM_H_ + +#define ARM_LPAE_TCR_TG0_4K 0 +#define ARM_LPAE_TCR_TG0_64K 1 +#define ARM_LPAE_TCR_TG0_16K 2 + +#define ARM_LPAE_TCR_TG1_16K 1 +#define ARM_LPAE_TCR_TG1_4K 2 +#define ARM_LPAE_TCR_TG1_64K 3 + +#define ARM_LPAE_TCR_SH_NS 0 +#define ARM_LPAE_TCR_SH_OS 2 +#define ARM_LPAE_TCR_SH_IS 3 + +#define ARM_LPAE_TCR_RGN_NC 0 +#define ARM_LPAE_TCR_RGN_WBWA 1 +#define ARM_LPAE_TCR_RGN_WT 2 +#define ARM_LPAE_TCR_RGN_WB 3 + +#define ARM_LPAE_TCR_PS_32_BIT 0x0ULL +#define ARM_LPAE_TCR_PS_36_BIT 0x1ULL +#define ARM_LPAE_TCR_PS_40_BIT 0x2ULL +#define ARM_LPAE_TCR_PS_42_BIT 0x3ULL +#define ARM_LPAE_TCR_PS_44_BIT 0x4ULL +#define ARM_LPAE_TCR_PS_48_BIT 0x5ULL +#define ARM_LPAE_TCR_PS_52_BIT 0x6ULL + +#endif /* IO_PGTABLE_ARM_H_ */ -- 2.25.0
From: Jean-Philippe Brucker <jean-philippe.brucker@arm.com> In preparation for sharing some ASIDs with the CPU, use a global xarray to store ASIDs and their context. ASID#1 is not reserved, and the ASID space is global. Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org> --- drivers/iommu/arm-smmu-v3.c | 27 ++++++++++++++++++--------- 1 file changed, 18 insertions(+), 9 deletions(-) diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c index 87ae31ef35a1..7737b70e74cd 100644 --- a/drivers/iommu/arm-smmu-v3.c +++ b/drivers/iommu/arm-smmu-v3.c @@ -651,7 +651,6 @@ struct arm_smmu_device { #define ARM_SMMU_MAX_ASIDS (1 << 16) unsigned int asid_bits; - DECLARE_BITMAP(asid_map, ARM_SMMU_MAX_ASIDS); #define ARM_SMMU_MAX_VMIDS (1 << 16) unsigned int vmid_bits; @@ -711,6 +710,8 @@ struct arm_smmu_option_prop { const char *prop; }; +static DEFINE_XARRAY_ALLOC1(asid_xa); + static struct arm_smmu_option_prop arm_smmu_options[] = { { ARM_SMMU_OPT_SKIP_PREFETCH, "hisilicon,broken-prefetch-cmd" }, { ARM_SMMU_OPT_PAGE0_REGS_ONLY, "cavium,cn9900-broken-page1-regspace"}, @@ -1742,6 +1743,14 @@ static void arm_smmu_free_cd_tables(struct arm_smmu_domain *smmu_domain) cdcfg->cdtab = NULL; } +static void arm_smmu_free_asid(struct arm_smmu_ctx_desc *cd) +{ + if (!cd->asid) + return; + + xa_erase(&asid_xa, cd->asid); +} + /* Stream table manipulation functions */ static void arm_smmu_write_strtab_l1_desc(__le64 *dst, struct arm_smmu_strtab_l1_desc *desc) @@ -2388,10 +2397,9 @@ static void arm_smmu_domain_free(struct iommu_domain *domain) if (smmu_domain->stage == ARM_SMMU_DOMAIN_S1) { struct arm_smmu_s1_cfg *cfg = &smmu_domain->s1_cfg; - if (cfg->cdcfg.cdtab) { + if (cfg->cdcfg.cdtab) arm_smmu_free_cd_tables(smmu_domain); - arm_smmu_bitmap_free(smmu->asid_map, cfg->cd.asid); - } + arm_smmu_free_asid(&cfg->cd); } else { struct arm_smmu_s2_cfg *cfg = &smmu_domain->s2_cfg; if (cfg->vmid) @@ -2406,14 +2414,15 @@ static int arm_smmu_domain_finalise_s1(struct arm_smmu_domain *smmu_domain, struct io_pgtable_cfg *pgtbl_cfg) { int ret; - int asid; + u32 asid; struct arm_smmu_device *smmu = smmu_domain->smmu; struct arm_smmu_s1_cfg *cfg = &smmu_domain->s1_cfg; typeof(&pgtbl_cfg->arm_lpae_s1_cfg.tcr) tcr = &pgtbl_cfg->arm_lpae_s1_cfg.tcr; - asid = arm_smmu_bitmap_alloc(smmu->asid_map, smmu->asid_bits); - if (asid < 0) - return asid; + ret = xa_alloc(&asid_xa, &asid, &cfg->cd, + XA_LIMIT(1, (1 << smmu->asid_bits) - 1), GFP_KERNEL); + if (ret) + return ret; cfg->s1cdmax = master->ssid_bits; @@ -2446,7 +2455,7 @@ static int arm_smmu_domain_finalise_s1(struct arm_smmu_domain *smmu_domain, out_free_cd_tables: arm_smmu_free_cd_tables(smmu_domain); out_free_asid: - arm_smmu_bitmap_free(smmu->asid_map, asid); + arm_smmu_free_asid(&cfg->cd); return ret; } -- 2.25.0
The SMMUv3 driver would like to read the MMFR0 PARANGE field in order to share CPU page tables with devices. Allow the driver to be built as module by exporting the read_sanitized_ftr_reg() cpufeature symbol. Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org> --- arch/arm64/kernel/cpufeature.c | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c index 0b6715625cf6..a96d2fb12e4d 100644 --- a/arch/arm64/kernel/cpufeature.c +++ b/arch/arm64/kernel/cpufeature.c @@ -838,6 +838,7 @@ u64 read_sanitised_ftr_reg(u32 id) BUG_ON(!regp); return regp->sys_val; } +EXPORT_SYMBOL_GPL(read_sanitised_ftr_reg); #define read_sysreg_case(r) \ case r: return read_sysreg_s(r) -- 2.25.0
From: Jean-Philippe Brucker <jean-philippe.brucker@arm.com> With Shared Virtual Addressing (SVA), we need to mirror CPU TTBR, TCR, MAIR and ASIDs in SMMU contexts. Each SMMU has a single ASID space split into two sets, shared and private. Shared ASIDs correspond to those obtained from the arch ASID allocator, and private ASIDs are used for "classic" map/unmap DMA. Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org> --- drivers/iommu/arm-smmu-v3.c | 164 +++++++++++++++++++++++++++++++++++- 1 file changed, 160 insertions(+), 4 deletions(-) diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c index 7737b70e74cd..3f9adfd1b015 100644 --- a/drivers/iommu/arm-smmu-v3.c +++ b/drivers/iommu/arm-smmu-v3.c @@ -22,6 +22,7 @@ #include <linux/iommu.h> #include <linux/iopoll.h> #include <linux/module.h> +#include <linux/mmu_context.h> #include <linux/msi.h> #include <linux/of.h> #include <linux/of_address.h> @@ -33,6 +34,8 @@ #include <linux/amba/bus.h> +#include "io-pgtable-arm.h" + /* MMIO registers */ #define ARM_SMMU_IDR0 0x0 #define IDR0_ST_LVL GENMASK(28, 27) @@ -575,6 +578,9 @@ struct arm_smmu_ctx_desc { u64 ttbr; u64 tcr; u64 mair; + + refcount_t refs; + struct mm_struct *mm; }; struct arm_smmu_l1_ctx_desc { @@ -1639,7 +1645,8 @@ static int arm_smmu_write_ctx_desc(struct arm_smmu_domain *smmu_domain, #ifdef __BIG_ENDIAN CTXDESC_CD_0_ENDI | #endif - CTXDESC_CD_0_R | CTXDESC_CD_0_A | CTXDESC_CD_0_ASET | + CTXDESC_CD_0_R | CTXDESC_CD_0_A | + (cd->mm ? 0 : CTXDESC_CD_0_ASET) | CTXDESC_CD_0_AA64 | FIELD_PREP(CTXDESC_CD_0_ASID, cd->asid) | CTXDESC_CD_0_V; @@ -1743,12 +1750,159 @@ static void arm_smmu_free_cd_tables(struct arm_smmu_domain *smmu_domain) cdcfg->cdtab = NULL; } -static void arm_smmu_free_asid(struct arm_smmu_ctx_desc *cd) +static void arm_smmu_init_cd(struct arm_smmu_ctx_desc *cd) { + refcount_set(&cd->refs, 1); +} + +static bool arm_smmu_free_asid(struct arm_smmu_ctx_desc *cd) +{ + bool free; + struct arm_smmu_ctx_desc *old_cd; + if (!cd->asid) - return; + return false; + + xa_lock(&asid_xa); + free = refcount_dec_and_test(&cd->refs); + if (free) { + old_cd = __xa_erase(&asid_xa, cd->asid); + WARN_ON(old_cd != cd); + } + xa_unlock(&asid_xa); + return free; +} + +static struct arm_smmu_ctx_desc *arm_smmu_share_asid(u16 asid) +{ + struct arm_smmu_ctx_desc *cd; + + cd = xa_load(&asid_xa, asid); + if (!cd) + return NULL; - xa_erase(&asid_xa, cd->asid); + if (cd->mm) { + /* + * It's pretty common to find a stale CD when doing unbind-bind, + * given that the release happens after a RCU grace period. + * arm_smmu_free_asid() hasn't gone through yet, so reuse it. + */ + refcount_inc(&cd->refs); + return cd; + } + + /* + * Ouch, ASID is already in use for a private cd. + * TODO: seize it. + */ + return ERR_PTR(-EEXIST); +} + +__maybe_unused +static struct arm_smmu_ctx_desc *arm_smmu_alloc_shared_cd(struct mm_struct *mm) +{ + u16 asid; + int ret = 0; + u64 tcr, par, reg; + struct arm_smmu_ctx_desc *cd; + struct arm_smmu_ctx_desc *old_cd = NULL; + + asid = mm_context_get(mm); + if (!asid) + return ERR_PTR(-ESRCH); + + cd = kzalloc(sizeof(*cd), GFP_KERNEL); + if (!cd) { + ret = -ENOMEM; + goto err_put_context; + } + + arm_smmu_init_cd(cd); + + xa_lock(&asid_xa); + old_cd = arm_smmu_share_asid(asid); + if (!old_cd) { + old_cd = __xa_store(&asid_xa, asid, cd, GFP_ATOMIC); + /* + * Keep error, clear valid pointers. If there was an old entry + * it has been moved already by arm_smmu_share_asid(). + */ + old_cd = ERR_PTR(xa_err(old_cd)); + cd->asid = asid; + } + xa_unlock(&asid_xa); + + if (IS_ERR(old_cd)) { + ret = PTR_ERR(old_cd); + goto err_free_cd; + } else if (old_cd) { + if (WARN_ON(old_cd->mm != mm)) { + ret = -EINVAL; + goto err_free_cd; + } + kfree(cd); + mm_context_put(mm); + return old_cd; + } + + tcr = FIELD_PREP(CTXDESC_CD_0_TCR_T0SZ, 64ULL - VA_BITS) | + FIELD_PREP(CTXDESC_CD_0_TCR_IRGN0, ARM_LPAE_TCR_RGN_WBWA) | + FIELD_PREP(CTXDESC_CD_0_TCR_ORGN0, ARM_LPAE_TCR_RGN_WBWA) | + FIELD_PREP(CTXDESC_CD_0_TCR_SH0, ARM_LPAE_TCR_SH_IS) | + CTXDESC_CD_0_TCR_EPD1 | CTXDESC_CD_0_AA64; + + switch (PAGE_SIZE) { + case SZ_4K: + tcr |= FIELD_PREP(CTXDESC_CD_0_TCR_TG0, + ARM_LPAE_TCR_TG0_4K); + break; + case SZ_16K: + tcr |= FIELD_PREP(CTXDESC_CD_0_TCR_TG0, + ARM_LPAE_TCR_TG0_16K); + break; + case SZ_64K: + tcr |= FIELD_PREP(CTXDESC_CD_0_TCR_TG0, + ARM_LPAE_TCR_TG0_64K); + break; + default: + WARN_ON(1); + ret = -EINVAL; + goto err_free_asid; + } + + reg = read_sanitised_ftr_reg(SYS_ID_AA64MMFR0_EL1); + par = cpuid_feature_extract_unsigned_field(reg, ID_AA64MMFR0_PARANGE_SHIFT); + tcr |= FIELD_PREP(CTXDESC_CD_0_TCR_IPS, par); + + cd->ttbr = virt_to_phys(mm->pgd); + cd->tcr = tcr; + /* + * MAIR value is pretty much constant and global, so we can just get it + * from the current CPU register + */ + cd->mair = read_sysreg(mair_el1); + + cd->mm = mm; + + return cd; + +err_free_asid: + arm_smmu_free_asid(cd); +err_free_cd: + kfree(cd); +err_put_context: + mm_context_put(mm); + return ERR_PTR(ret); +} + +__maybe_unused +static void arm_smmu_free_shared_cd(struct arm_smmu_ctx_desc *cd) +{ + if (arm_smmu_free_asid(cd)) { + /* Unpin ASID */ + mm_context_put(cd->mm); + kfree(cd); + } } /* Stream table manipulation functions */ @@ -2419,6 +2573,8 @@ static int arm_smmu_domain_finalise_s1(struct arm_smmu_domain *smmu_domain, struct arm_smmu_s1_cfg *cfg = &smmu_domain->s1_cfg; typeof(&pgtbl_cfg->arm_lpae_s1_cfg.tcr) tcr = &pgtbl_cfg->arm_lpae_s1_cfg.tcr; + arm_smmu_init_cd(&cfg->cd); + ret = xa_alloc(&asid_xa, &asid, &cfg->cd, XA_LIMIT(1, (1 << smmu->asid_bits) - 1), GFP_KERNEL); if (ret) -- 2.25.0
From: Jean-Philippe Brucker <jean-philippe.brucker@arm.com> The SMMU has a single ASID space, the union of shared and private ASID sets. This means that the SMMU driver competes with the arch allocator for ASIDs. Shared ASIDs are those of Linux processes, allocated by the arch, and contribute in broadcast TLB maintenance. Private ASIDs are allocated by the SMMU driver and used for "classic" map/unmap DMA. They require explicit TLB invalidations. When we pin down an mm_context and get an ASID that is already in use by the SMMU, it belongs to a private context. We used to simply abort the bind, but this is unfair to users that would be unable to bind a few seemingly random processes. Try to allocate a new private ASID for the context, and make the old ASID shared. Introduce a new lock to prevent races when rewriting context descriptors. Unfortunately it has to be a spinlock since we take it while holding the asid lock, which will be held in non-sleepable context (freeing ASIDs from an RCU callback). Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org> --- drivers/iommu/arm-smmu-v3.c | 83 +++++++++++++++++++++++++++++-------- 1 file changed, 66 insertions(+), 17 deletions(-) diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c index 3f9adfd1b015..2839527ec9ee 100644 --- a/drivers/iommu/arm-smmu-v3.c +++ b/drivers/iommu/arm-smmu-v3.c @@ -717,6 +717,7 @@ struct arm_smmu_option_prop { }; static DEFINE_XARRAY_ALLOC1(asid_xa); +static DEFINE_SPINLOCK(contexts_lock); static struct arm_smmu_option_prop arm_smmu_options[] = { { ARM_SMMU_OPT_SKIP_PREFETCH, "hisilicon,broken-prefetch-cmd" }, @@ -1513,6 +1514,17 @@ static int arm_smmu_cmdq_batch_submit(struct arm_smmu_device *smmu, } /* Context descriptor manipulation functions */ +static void arm_smmu_tlb_inv_asid(struct arm_smmu_device *smmu, u16 asid) +{ + struct arm_smmu_cmdq_ent cmd = { + .opcode = CMDQ_OP_TLBI_NH_ASID, + .tlbi.asid = asid, + }; + + arm_smmu_cmdq_issue_cmd(smmu, &cmd); + arm_smmu_cmdq_issue_sync(smmu); +} + static void arm_smmu_sync_cd(struct arm_smmu_domain *smmu_domain, int ssid, bool leaf) { @@ -1547,7 +1559,7 @@ static int arm_smmu_alloc_cd_leaf_table(struct arm_smmu_device *smmu, size_t size = CTXDESC_L2_ENTRIES * (CTXDESC_CD_DWORDS << 3); l1_desc->l2ptr = dmam_alloc_coherent(smmu->dev, size, - &l1_desc->l2ptr_dma, GFP_KERNEL); + &l1_desc->l2ptr_dma, GFP_ATOMIC); if (!l1_desc->l2ptr) { dev_warn(smmu->dev, "failed to allocate context descriptor table\n"); @@ -1593,8 +1605,8 @@ static __le64 *arm_smmu_get_cd_ptr(struct arm_smmu_domain *smmu_domain, return l1_desc->l2ptr + idx * CTXDESC_CD_DWORDS; } -static int arm_smmu_write_ctx_desc(struct arm_smmu_domain *smmu_domain, - int ssid, struct arm_smmu_ctx_desc *cd) +static int __arm_smmu_write_ctx_desc(struct arm_smmu_domain *smmu_domain, + int ssid, struct arm_smmu_ctx_desc *cd) { /* * This function handles the following cases: @@ -1670,6 +1682,17 @@ static int arm_smmu_write_ctx_desc(struct arm_smmu_domain *smmu_domain, return 0; } +static int arm_smmu_write_ctx_desc(struct arm_smmu_domain *smmu_domain, + int ssid, struct arm_smmu_ctx_desc *cd) +{ + int ret; + + spin_lock(&contexts_lock); + ret = __arm_smmu_write_ctx_desc(smmu_domain, ssid, cd); + spin_unlock(&contexts_lock); + return ret; +} + static int arm_smmu_alloc_cd_tables(struct arm_smmu_domain *smmu_domain) { int ret; @@ -1773,9 +1796,18 @@ static bool arm_smmu_free_asid(struct arm_smmu_ctx_desc *cd) return free; } +/* + * Try to reserve this ASID in the SMMU. If it is in use, try to steal it from + * the private entry. Careful here, we may be modifying the context tables of + * another SMMU! + */ static struct arm_smmu_ctx_desc *arm_smmu_share_asid(u16 asid) { + int ret; + u32 new_asid; struct arm_smmu_ctx_desc *cd; + struct arm_smmu_device *smmu; + struct arm_smmu_domain *smmu_domain; cd = xa_load(&asid_xa, asid); if (!cd) @@ -1791,11 +1823,31 @@ static struct arm_smmu_ctx_desc *arm_smmu_share_asid(u16 asid) return cd; } + smmu_domain = container_of(cd, struct arm_smmu_domain, s1_cfg.cd); + smmu = smmu_domain->smmu; + + /* + * Race with unmap: TLB invalidations will start targeting the new ASID, + * which isn't assigned yet. We'll do an invalidate-all on the old ASID + * later, so it doesn't matter. + */ + ret = __xa_alloc(&asid_xa, &new_asid, cd, + XA_LIMIT(1, 1 << smmu->asid_bits), GFP_ATOMIC); + if (ret) + return ERR_PTR(-ENOSPC); + cd->asid = new_asid; + /* - * Ouch, ASID is already in use for a private cd. - * TODO: seize it. + * Update ASID and invalidate CD in all associated masters. There will + * be some overlap between use of both ASIDs, until we invalidate the + * TLB. */ - return ERR_PTR(-EEXIST); + arm_smmu_write_ctx_desc(smmu_domain, 0, cd); + + /* Invalidate TLB entries previously associated with that context */ + arm_smmu_tlb_inv_asid(smmu, asid); + + return NULL; } __maybe_unused @@ -2389,15 +2441,6 @@ static void arm_smmu_tlb_inv_context(void *cookie) struct arm_smmu_device *smmu = smmu_domain->smmu; struct arm_smmu_cmdq_ent cmd; - if (smmu_domain->stage == ARM_SMMU_DOMAIN_S1) { - cmd.opcode = CMDQ_OP_TLBI_NH_ASID; - cmd.tlbi.asid = smmu_domain->s1_cfg.cd.asid; - cmd.tlbi.vmid = 0; - } else { - cmd.opcode = CMDQ_OP_TLBI_S12_VMALL; - cmd.tlbi.vmid = smmu_domain->s2_cfg.vmid; - } - /* * NOTE: when io-pgtable is in non-strict mode, we may get here with * PTEs previously cleared by unmaps on the current CPU not yet visible @@ -2405,8 +2448,14 @@ static void arm_smmu_tlb_inv_context(void *cookie) * insertion to guarantee those are observed before the TLBI. Do be * careful, 007. */ - arm_smmu_cmdq_issue_cmd(smmu, &cmd); - arm_smmu_cmdq_issue_sync(smmu); + if (smmu_domain->stage == ARM_SMMU_DOMAIN_S1) { + arm_smmu_tlb_inv_asid(smmu, smmu_domain->s1_cfg.cd.asid); + } else { + cmd.opcode = CMDQ_OP_TLBI_S12_VMALL; + cmd.tlbi.vmid = smmu_domain->s2_cfg.vmid; + arm_smmu_cmdq_issue_cmd(smmu, &cmd); + arm_smmu_cmdq_issue_sync(smmu); + } arm_smmu_atc_inv_domain(smmu_domain, 0, 0, 0); } -- 2.25.0
From: Jean-Philippe Brucker <jean-philippe.brucker@arm.com> ARMv8.1 extensions added Virtualization Host Extensions (VHE), which allow to run a host kernel at EL2. When using normal DMA, Device and CPU address spaces are dissociated, and do not need to implement the same capabilities, so VHE hasn't been used in the SMMU until now. With shared address spaces however, ASIDs are shared between MMU and SMMU, and broadcast TLB invalidations issued by a CPU are taken into account by the SMMU. TLB entries on both sides need to have identical exception level in order to be cleared with a single invalidation. When the CPU is using VHE, enable VHE in the SMMU for all STEs. Normal DMA mappings will need to use TLBI_EL2 commands instead of TLBI_NH, but shouldn't be otherwise affected by this change. Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org> --- drivers/iommu/arm-smmu-v3.c | 31 ++++++++++++++++++++++++++----- 1 file changed, 26 insertions(+), 5 deletions(-) diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c index 2839527ec9ee..77554d89653b 100644 --- a/drivers/iommu/arm-smmu-v3.c +++ b/drivers/iommu/arm-smmu-v3.c @@ -13,6 +13,7 @@ #include <linux/acpi_iort.h> #include <linux/bitfield.h> #include <linux/bitops.h> +#include <linux/cpufeature.h> #include <linux/crash_dump.h> #include <linux/delay.h> #include <linux/dma-iommu.h> @@ -472,6 +473,8 @@ struct arm_smmu_cmdq_ent { #define CMDQ_OP_TLBI_NH_ASID 0x11 #define CMDQ_OP_TLBI_NH_VA 0x12 #define CMDQ_OP_TLBI_EL2_ALL 0x20 + #define CMDQ_OP_TLBI_EL2_ASID 0x21 + #define CMDQ_OP_TLBI_EL2_VA 0x22 #define CMDQ_OP_TLBI_S12_VMALL 0x28 #define CMDQ_OP_TLBI_S2_IPA 0x2a #define CMDQ_OP_TLBI_NSNH_ALL 0x30 @@ -638,6 +641,7 @@ struct arm_smmu_device { #define ARM_SMMU_FEAT_HYP (1 << 12) #define ARM_SMMU_FEAT_STALL_FORCE (1 << 13) #define ARM_SMMU_FEAT_VAX (1 << 14) +#define ARM_SMMU_FEAT_E2H (1 << 15) u32 features; #define ARM_SMMU_OPT_SKIP_PREFETCH (1 << 0) @@ -909,6 +913,8 @@ static int arm_smmu_cmdq_build_cmd(u64 *cmd, struct arm_smmu_cmdq_ent *ent) break; case CMDQ_OP_TLBI_NH_VA: cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_VMID, ent->tlbi.vmid); + /* Fallthrough */ + case CMDQ_OP_TLBI_EL2_VA: cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_ASID, ent->tlbi.asid); cmd[1] |= FIELD_PREP(CMDQ_TLBI_1_LEAF, ent->tlbi.leaf); cmd[1] |= ent->tlbi.addr & CMDQ_TLBI_1_VA_MASK; @@ -924,6 +930,9 @@ static int arm_smmu_cmdq_build_cmd(u64 *cmd, struct arm_smmu_cmdq_ent *ent) case CMDQ_OP_TLBI_S12_VMALL: cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_VMID, ent->tlbi.vmid); break; + case CMDQ_OP_TLBI_EL2_ASID: + cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_ASID, ent->tlbi.asid); + break; case CMDQ_OP_ATC_INV: cmd[0] |= FIELD_PREP(CMDQ_0_SSV, ent->substream_valid); cmd[0] |= FIELD_PREP(CMDQ_ATC_0_GLOBAL, ent->atc.global); @@ -1517,7 +1526,8 @@ static int arm_smmu_cmdq_batch_submit(struct arm_smmu_device *smmu, static void arm_smmu_tlb_inv_asid(struct arm_smmu_device *smmu, u16 asid) { struct arm_smmu_cmdq_ent cmd = { - .opcode = CMDQ_OP_TLBI_NH_ASID, + .opcode = smmu->features & ARM_SMMU_FEAT_E2H ? + CMDQ_OP_TLBI_EL2_ASID : CMDQ_OP_TLBI_NH_ASID, .tlbi.asid = asid, }; @@ -2075,13 +2085,16 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid, } if (s1_cfg) { + int strw = smmu->features & ARM_SMMU_FEAT_E2H ? + STRTAB_STE_1_STRW_EL2 : STRTAB_STE_1_STRW_NSEL1; + BUG_ON(ste_live); dst[1] = cpu_to_le64( FIELD_PREP(STRTAB_STE_1_S1DSS, STRTAB_STE_1_S1DSS_SSID0) | FIELD_PREP(STRTAB_STE_1_S1CIR, STRTAB_STE_1_S1C_CACHE_WBRA) | FIELD_PREP(STRTAB_STE_1_S1COR, STRTAB_STE_1_S1C_CACHE_WBRA) | FIELD_PREP(STRTAB_STE_1_S1CSH, ARM_SMMU_SH_ISH) | - FIELD_PREP(STRTAB_STE_1_STRW, STRTAB_STE_1_STRW_NSEL1)); + FIELD_PREP(STRTAB_STE_1_STRW, strw)); if (smmu->features & ARM_SMMU_FEAT_STALLS && !(smmu->features & ARM_SMMU_FEAT_STALL_FORCE)) @@ -2476,7 +2489,8 @@ static void arm_smmu_tlb_inv_range(unsigned long iova, size_t size, return; if (smmu_domain->stage == ARM_SMMU_DOMAIN_S1) { - cmd.opcode = CMDQ_OP_TLBI_NH_VA; + cmd.opcode = smmu->features & ARM_SMMU_FEAT_E2H ? + CMDQ_OP_TLBI_EL2_VA : CMDQ_OP_TLBI_NH_VA; cmd.tlbi.asid = smmu_domain->s1_cfg.cd.asid; } else { cmd.opcode = CMDQ_OP_TLBI_S2_IPA; @@ -3743,7 +3757,11 @@ static int arm_smmu_device_reset(struct arm_smmu_device *smmu, bool bypass) writel_relaxed(reg, smmu->base + ARM_SMMU_CR1); /* CR2 (random crap) */ - reg = CR2_PTM | CR2_RECINVSID | CR2_E2H; + reg = CR2_PTM | CR2_RECINVSID; + + if (smmu->features & ARM_SMMU_FEAT_E2H) + reg |= CR2_E2H; + writel_relaxed(reg, smmu->base + ARM_SMMU_CR2); /* Stream table */ @@ -3901,8 +3919,11 @@ static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu) if (reg & IDR0_MSI) smmu->features |= ARM_SMMU_FEAT_MSI; - if (reg & IDR0_HYP) + if (reg & IDR0_HYP) { smmu->features |= ARM_SMMU_FEAT_HYP; + if (cpus_have_cap(ARM64_HAS_VIRT_HOST_EXTN)) + smmu->features |= ARM_SMMU_FEAT_E2H; + } /* * The coherency feature as set by FW is used in preference to the ID -- 2.25.0
From: Jean-Philippe Brucker <jean-philippe.brucker@arm.com> The SMMUv3 can handle invalidation targeted at TLB entries with shared ASIDs. If the implementation supports broadcast TLB maintenance, enable it and keep track of it in a feature bit. The SMMU will then be affected by inner-shareable TLB invalidations from other agents. A major side-effect of this change is that stage-2 translation contexts are now affected by all invalidations by VMID. VMIDs are all shared and the only ways to prevent over-invalidation, since the stage-2 page tables are not shared between CPU and SMMU, are to either disable BTM or allocate different VMIDs. This patch does not address the problem. Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org> --- drivers/iommu/arm-smmu-v3.c | 19 +++++++++++++++++-- 1 file changed, 17 insertions(+), 2 deletions(-) diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c index 77554d89653b..b72b2fdcd21f 100644 --- a/drivers/iommu/arm-smmu-v3.c +++ b/drivers/iommu/arm-smmu-v3.c @@ -56,6 +56,7 @@ #define IDR0_ASID16 (1 << 12) #define IDR0_ATS (1 << 10) #define IDR0_HYP (1 << 9) +#define IDR0_BTM (1 << 5) #define IDR0_COHACC (1 << 4) #define IDR0_TTF GENMASK(3, 2) #define IDR0_TTF_AARCH64 2 @@ -642,6 +643,7 @@ struct arm_smmu_device { #define ARM_SMMU_FEAT_STALL_FORCE (1 << 13) #define ARM_SMMU_FEAT_VAX (1 << 14) #define ARM_SMMU_FEAT_E2H (1 << 15) +#define ARM_SMMU_FEAT_BTM (1 << 16) u32 features; #define ARM_SMMU_OPT_SKIP_PREFETCH (1 << 0) @@ -3757,11 +3759,14 @@ static int arm_smmu_device_reset(struct arm_smmu_device *smmu, bool bypass) writel_relaxed(reg, smmu->base + ARM_SMMU_CR1); /* CR2 (random crap) */ - reg = CR2_PTM | CR2_RECINVSID; + reg = CR2_RECINVSID; if (smmu->features & ARM_SMMU_FEAT_E2H) reg |= CR2_E2H; + if (!(smmu->features & ARM_SMMU_FEAT_BTM)) + reg |= CR2_PTM; + writel_relaxed(reg, smmu->base + ARM_SMMU_CR2); /* Stream table */ @@ -3872,6 +3877,7 @@ static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu) { u32 reg; bool coherent = smmu->features & ARM_SMMU_FEAT_COHERENCY; + bool vhe = cpus_have_cap(ARM64_HAS_VIRT_HOST_EXTN); /* IDR0 */ reg = readl_relaxed(smmu->base + ARM_SMMU_IDR0); @@ -3921,10 +3927,19 @@ static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu) if (reg & IDR0_HYP) { smmu->features |= ARM_SMMU_FEAT_HYP; - if (cpus_have_cap(ARM64_HAS_VIRT_HOST_EXTN)) + if (vhe) smmu->features |= ARM_SMMU_FEAT_E2H; } + /* + * If the CPU is using VHE, but the SMMU doesn't support it, the SMMU + * will create TLB entries for NH-EL1 world and will miss the + * broadcasted TLB invalidations that target EL2-E2H world. Don't enable + * BTM in that case. + */ + if (reg & IDR0_BTM && (!vhe || reg & IDR0_HYP)) + smmu->features |= ARM_SMMU_FEAT_BTM; + /* * The coherency feature as set by FW is used in preference to the ID * register, but warn on mismatch. -- 2.25.0
From: Jean-Philippe Brucker <jean-philippe.brucker@arm.com> Aggregate all sanity-checks for sharing CPU page tables with the SMMU under a single ARM_SMMU_FEAT_SVA bit. For PCIe SVA, users also need to check FEAT_ATS and FEAT_PRI. For platform SVA, they will most likely have to check FEAT_STALLS. Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org> --- drivers/iommu/arm-smmu-v3.c | 72 +++++++++++++++++++++++++++++++++++++ 1 file changed, 72 insertions(+) diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c index b72b2fdcd21f..77a846440ba6 100644 --- a/drivers/iommu/arm-smmu-v3.c +++ b/drivers/iommu/arm-smmu-v3.c @@ -644,6 +644,7 @@ struct arm_smmu_device { #define ARM_SMMU_FEAT_VAX (1 << 14) #define ARM_SMMU_FEAT_E2H (1 << 15) #define ARM_SMMU_FEAT_BTM (1 << 16) +#define ARM_SMMU_FEAT_SVA (1 << 17) u32 features; #define ARM_SMMU_OPT_SKIP_PREFETCH (1 << 0) @@ -3873,6 +3874,74 @@ static int arm_smmu_device_reset(struct arm_smmu_device *smmu, bool bypass) return 0; } +static bool arm_smmu_supports_sva(struct arm_smmu_device *smmu) +{ + unsigned long reg, fld; + unsigned long oas; + unsigned long asid_bits; + + u32 feat_mask = ARM_SMMU_FEAT_BTM | ARM_SMMU_FEAT_COHERENCY; + + if ((smmu->features & feat_mask) != feat_mask) + return false; + + if (!(smmu->pgsize_bitmap & PAGE_SIZE)) + return false; + + /* + * Get the smallest PA size of all CPUs (sanitized by cpufeature). We're + * not even pretending to support AArch32 here. + */ + reg = read_sanitised_ftr_reg(SYS_ID_AA64MMFR0_EL1); + fld = cpuid_feature_extract_unsigned_field(reg, ID_AA64MMFR0_PARANGE_SHIFT); + switch (fld) { + case 0x0: + oas = 32; + break; + case 0x1: + oas = 36; + break; + case 0x2: + oas = 40; + break; + case 0x3: + oas = 42; + break; + case 0x4: + oas = 44; + break; + case 0x5: + oas = 48; + break; + case 0x6: + oas = 52; + break; + default: + return false; + } + + /* abort if MMU outputs addresses greater than what we support. */ + if (smmu->oas < oas) + return false; + + /* We can support bigger ASIDs than the CPU, but not smaller */ + fld = cpuid_feature_extract_unsigned_field(reg, ID_AA64MMFR0_ASID_SHIFT); + asid_bits = fld ? 16 : 8; + if (smmu->asid_bits < asid_bits) + return false; + + /* + * See max_pinned_asids in arch/arm64/mm/context.c. The following is + * generally the maximum number of bindable processes. + */ + if (IS_ENABLED(CONFIG_UNMAP_KERNEL_AT_EL0)) + asid_bits--; + dev_dbg(smmu->dev, "%d shared contexts\n", (1 << asid_bits) - + num_possible_cpus() - 2); + + return true; +} + static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu) { u32 reg; @@ -4080,6 +4149,9 @@ static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu) smmu->ias = max(smmu->ias, smmu->oas); + if (arm_smmu_supports_sva(smmu)) + smmu->features |= ARM_SMMU_FEAT_SVA; + dev_info(smmu->dev, "ias %lu-bit, oas %lu-bit (features 0x%08x)\n", smmu->ias, smmu->oas, smmu->features); return 0; -- 2.25.0
We'll need to frequently find the SMMU master associated to a device when implementing SVA. Move it to a separate function. Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org> --- drivers/iommu/arm-smmu-v3.c | 15 +++++++++++---- 1 file changed, 11 insertions(+), 4 deletions(-) diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c index 77a846440ba6..54bd6913d648 100644 --- a/drivers/iommu/arm-smmu-v3.c +++ b/drivers/iommu/arm-smmu-v3.c @@ -747,6 +747,15 @@ static struct arm_smmu_domain *to_smmu_domain(struct iommu_domain *dom) return container_of(dom, struct arm_smmu_domain, domain); } +static struct arm_smmu_master *dev_to_master(struct device *dev) +{ + struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev); + + if (!fwspec) + return NULL; + return fwspec->iommu_priv; +} + static void parse_driver_options(struct arm_smmu_device *smmu) { int i = 0; @@ -2940,15 +2949,13 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev) { int ret = 0; unsigned long flags; - struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev); struct arm_smmu_device *smmu; + struct arm_smmu_master *master = dev_to_master(dev); struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain); - struct arm_smmu_master *master; - if (!fwspec) + if (!master) return -ENOENT; - master = fwspec->iommu_priv; smmu = master->smmu; arm_smmu_detach_dev(master); -- 2.25.0
From: Jean-Philippe Brucker <jean-philippe.brucker@arm.com> Hook SVA operations to support sharing page tables with the SMMUv3: * dev_enable/disable/has_feature for device drivers to modify the SVA state. * sva_bind/unbind and sva_get_pasid to bind device and address spaces. * The mm_attach/detach/invalidate/free callbacks from iommu-sva Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org> --- drivers/iommu/Kconfig | 1 + drivers/iommu/arm-smmu-v3.c | 176 +++++++++++++++++++++++++++++++++++- 2 files changed, 175 insertions(+), 2 deletions(-) diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig index 211684e785ea..05341155d34b 100644 --- a/drivers/iommu/Kconfig +++ b/drivers/iommu/Kconfig @@ -434,6 +434,7 @@ config ARM_SMMU_V3 tristate "ARM Ltd. System MMU Version 3 (SMMUv3) Support" depends on ARM64 select IOMMU_API + select IOMMU_SVA select IOMMU_IO_PGTABLE_LPAE select GENERIC_MSI_IRQ_DOMAIN help diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c index 54bd6913d648..3973f7222864 100644 --- a/drivers/iommu/arm-smmu-v3.c +++ b/drivers/iommu/arm-smmu-v3.c @@ -36,6 +36,7 @@ #include <linux/amba/bus.h> #include "io-pgtable-arm.h" +#include "iommu-sva.h" /* MMIO registers */ #define ARM_SMMU_IDR0 0x0 @@ -1872,7 +1873,6 @@ static struct arm_smmu_ctx_desc *arm_smmu_share_asid(u16 asid) return NULL; } -__maybe_unused static struct arm_smmu_ctx_desc *arm_smmu_alloc_shared_cd(struct mm_struct *mm) { u16 asid; @@ -1969,7 +1969,6 @@ static struct arm_smmu_ctx_desc *arm_smmu_alloc_shared_cd(struct mm_struct *mm) return ERR_PTR(ret); } -__maybe_unused static void arm_smmu_free_shared_cd(struct arm_smmu_ctx_desc *cd) { if (arm_smmu_free_asid(cd)) { @@ -2958,6 +2957,12 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev) smmu = master->smmu; + if (iommu_sva_enabled(dev)) { + /* Did the previous driver forget to release SVA handles? */ + dev_err(dev, "cannot attach - SVA enabled\n"); + return -EBUSY; + } + arm_smmu_detach_dev(master); mutex_lock(&smmu_domain->init_mutex); @@ -3057,6 +3062,81 @@ arm_smmu_iova_to_phys(struct iommu_domain *domain, dma_addr_t iova) return ops->iova_to_phys(ops, iova); } +static void arm_smmu_mm_invalidate(struct device *dev, int pasid, void *entry, + unsigned long iova, size_t size) +{ + /* TODO: Invalidate ATC */ +} + +static int arm_smmu_mm_attach(struct device *dev, int pasid, void *entry, + bool attach_domain) +{ + struct arm_smmu_ctx_desc *cd = entry; + struct iommu_domain *domain = iommu_get_domain_for_dev(dev); + struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain); + + /* + * If another device in the domain has already been attached, the + * context descriptor is already valid. + */ + if (!attach_domain) + return 0; + + return arm_smmu_write_ctx_desc(smmu_domain, pasid, cd); +} + +static void arm_smmu_mm_detach(struct device *dev, int pasid, void *entry, + bool detach_domain) +{ + struct arm_smmu_ctx_desc *cd = entry; + struct iommu_domain *domain = iommu_get_domain_for_dev(dev); + struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain); + + if (detach_domain) { + arm_smmu_write_ctx_desc(smmu_domain, pasid, NULL); + + /* + * The ASID allocator won't broadcast the final TLB + * invalidations for this ASID, so we need to do it manually. + * For private contexts, freeing io-pgtable ops performs the + * invalidation. + */ + arm_smmu_tlb_inv_asid(smmu_domain->smmu, cd->asid); + } + + /* TODO: invalidate ATC */ +} + +static void *arm_smmu_mm_alloc(struct mm_struct *mm) +{ + return arm_smmu_alloc_shared_cd(mm); +} + +static void arm_smmu_mm_free(void *entry) +{ + arm_smmu_free_shared_cd(entry); +} + +static struct io_mm_ops arm_smmu_mm_ops = { + .alloc = arm_smmu_mm_alloc, + .invalidate = arm_smmu_mm_invalidate, + .attach = arm_smmu_mm_attach, + .detach = arm_smmu_mm_detach, + .release = arm_smmu_mm_free, +}; + +static struct iommu_sva * +arm_smmu_sva_bind(struct device *dev, struct mm_struct *mm, void *drvdata) +{ + struct iommu_domain *domain = iommu_get_domain_for_dev(dev); + struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain); + + if (smmu_domain->stage != ARM_SMMU_DOMAIN_S1) + return ERR_PTR(-EINVAL); + + return iommu_sva_bind_generic(dev, mm, &arm_smmu_mm_ops, drvdata); +} + static struct platform_driver arm_smmu_driver; static @@ -3175,6 +3255,7 @@ static void arm_smmu_remove_device(struct device *dev) master = fwspec->iommu_priv; smmu = master->smmu; + iommu_sva_disable(dev); arm_smmu_detach_dev(master); iommu_group_remove_device(dev); iommu_device_unlink(&smmu->iommu, dev); @@ -3294,6 +3375,90 @@ static void arm_smmu_get_resv_regions(struct device *dev, iommu_dma_get_resv_regions(dev, head); } +static bool arm_smmu_iopf_supported(struct arm_smmu_master *master) +{ + return false; +} + +static bool arm_smmu_dev_has_feature(struct device *dev, + enum iommu_dev_features feat) +{ + struct arm_smmu_master *master = dev_to_master(dev); + + if (!master) + return false; + + switch (feat) { + case IOMMU_DEV_FEAT_SVA: + if (!(master->smmu->features & ARM_SMMU_FEAT_SVA)) + return false; + + /* SSID and IOPF support are mandatory for the moment */ + return master->ssid_bits && arm_smmu_iopf_supported(master); + default: + return false; + } +} + +static bool arm_smmu_dev_feature_enabled(struct device *dev, + enum iommu_dev_features feat) +{ + struct arm_smmu_master *master = dev_to_master(dev); + + if (!master) + return false; + + switch (feat) { + case IOMMU_DEV_FEAT_SVA: + return iommu_sva_enabled(dev); + default: + return false; + } +} + +static int arm_smmu_dev_enable_sva(struct device *dev) +{ + struct arm_smmu_master *master = dev_to_master(dev); + struct iommu_sva_param param = { + .min_pasid = 1, + .max_pasid = 0xfffffU, + }; + + param.max_pasid = min(param.max_pasid, (1U << master->ssid_bits) - 1); + return iommu_sva_enable(dev, ¶m); +} + +static int arm_smmu_dev_enable_feature(struct device *dev, + enum iommu_dev_features feat) +{ + if (!arm_smmu_dev_has_feature(dev, feat)) + return -ENODEV; + + if (arm_smmu_dev_feature_enabled(dev, feat)) + return -EBUSY; + + switch (feat) { + case IOMMU_DEV_FEAT_SVA: + return arm_smmu_dev_enable_sva(dev); + default: + return -EINVAL; + } +} + +static int arm_smmu_dev_disable_feature(struct device *dev, + enum iommu_dev_features feat) +{ + if (!arm_smmu_dev_feature_enabled(dev, feat)) + return -EINVAL; + + switch (feat) { + case IOMMU_DEV_FEAT_SVA: + return iommu_sva_disable(dev); + default: + return -EINVAL; + } +} + static struct iommu_ops arm_smmu_ops = { .capable = arm_smmu_capable, .domain_alloc = arm_smmu_domain_alloc, @@ -3312,6 +3477,13 @@ static struct iommu_ops arm_smmu_ops = { .of_xlate = arm_smmu_of_xlate, .get_resv_regions = arm_smmu_get_resv_regions, .put_resv_regions = generic_iommu_put_resv_regions, + .dev_has_feat = arm_smmu_dev_has_feature, + .dev_feat_enabled = arm_smmu_dev_feature_enabled, + .dev_enable_feat = arm_smmu_dev_enable_feature, + .dev_disable_feat = arm_smmu_dev_disable_feature, + .sva_bind = arm_smmu_sva_bind, + .sva_unbind = iommu_sva_unbind_generic, + .sva_get_pasid = iommu_sva_get_pasid_generic, .pgsize_bitmap = -1UL, /* Restricted during device attach */ }; -- 2.25.0
From: Jean-Philippe Brucker <jean-philippe.brucker@arm.com> The core calls us when an mm is modified. Perform the required ATC invalidations. Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org> --- drivers/iommu/arm-smmu-v3.c | 44 ++++++++++++++++++++++++++++++++----- 1 file changed, 38 insertions(+), 6 deletions(-) diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c index 3973f7222864..95b4caceae1a 100644 --- a/drivers/iommu/arm-smmu-v3.c +++ b/drivers/iommu/arm-smmu-v3.c @@ -2354,6 +2354,20 @@ arm_smmu_atc_inv_to_cmd(int ssid, unsigned long iova, size_t size, size_t inval_grain_shift = 12; unsigned long page_start, page_end; + /* + * ATS and PASID: + * + * If substream_valid is clear, the PCIe TLP is sent without a PASID + * prefix. In that case all ATC entries within the address range are + * invalidated, including those that were requested with a PASID! There + * is no way to invalidate only entries without PASID. + * + * When using STRTAB_STE_1_S1DSS_SSID0 (reserving CD 0 for non-PASID + * traffic), translation requests without PASID create ATC entries + * without PASID, which must be invalidated with substream_valid clear. + * This has the unpleasant side-effect of invalidating all PASID-tagged + * ATC entries within the address range. + */ *cmd = (struct arm_smmu_cmdq_ent) { .opcode = CMDQ_OP_ATC_INV, .substream_valid = !!ssid, @@ -2397,12 +2411,12 @@ arm_smmu_atc_inv_to_cmd(int ssid, unsigned long iova, size_t size, cmd->atc.size = log2_span; } -static int arm_smmu_atc_inv_master(struct arm_smmu_master *master) +static int arm_smmu_atc_inv_master(struct arm_smmu_master *master, int ssid) { int i; struct arm_smmu_cmdq_ent cmd; - arm_smmu_atc_inv_to_cmd(0, 0, 0, &cmd); + arm_smmu_atc_inv_to_cmd(ssid, 0, 0, &cmd); for (i = 0; i < master->num_sids; i++) { cmd.atc.sid = master->sids[i]; @@ -2874,7 +2888,7 @@ static void arm_smmu_disable_ats(struct arm_smmu_master *master) * ATC invalidation via the SMMU. */ wmb(); - arm_smmu_atc_inv_master(master); + arm_smmu_atc_inv_master(master, 0); atomic_dec(&smmu_domain->nr_ats_masters); } @@ -3065,7 +3079,22 @@ arm_smmu_iova_to_phys(struct iommu_domain *domain, dma_addr_t iova) static void arm_smmu_mm_invalidate(struct device *dev, int pasid, void *entry, unsigned long iova, size_t size) { - /* TODO: Invalidate ATC */ + int i; + struct arm_smmu_cmdq_ent cmd; + struct arm_smmu_cmdq_batch cmds = {}; + struct arm_smmu_master *master = dev_to_master(dev); + + if (!master->ats_enabled) + return; + + arm_smmu_atc_inv_to_cmd(pasid, iova, size, &cmd); + + for (i = 0; i < master->num_sids; i++) { + cmd.atc.sid = master->sids[i]; + arm_smmu_cmdq_batch_add(master->smmu, &cmds, &cmd); + } + + arm_smmu_cmdq_batch_submit(master->smmu, &cmds); } static int arm_smmu_mm_attach(struct device *dev, int pasid, void *entry, @@ -3089,6 +3118,7 @@ static void arm_smmu_mm_detach(struct device *dev, int pasid, void *entry, bool detach_domain) { struct arm_smmu_ctx_desc *cd = entry; + struct arm_smmu_master *master = dev_to_master(dev); struct iommu_domain *domain = iommu_get_domain_for_dev(dev); struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain); @@ -3102,9 +3132,11 @@ static void arm_smmu_mm_detach(struct device *dev, int pasid, void *entry, * invalidation. */ arm_smmu_tlb_inv_asid(smmu_domain->smmu, cd->asid); - } + arm_smmu_atc_inv_domain(smmu_domain, pasid, 0, 0); - /* TODO: invalidate ATC */ + } else if (master->ats_enabled) { + arm_smmu_atc_inv_master(master, pasid); + } } static void *arm_smmu_mm_alloc(struct mm_struct *mm) -- 2.25.0
From: Jean-Philippe Brucker <jean-philippe.brucker@arm.com> If the SMMU supports it and the kernel was built with HTTU support, enable hardware update of access and dirty flags. This is essential for shared page tables, to reduce the number of access faults on the fault queue. We can enable HTTU even if CPUs don't support it, because the kernel always checks for HW dirty bit and updates the PTE flags atomically. Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org> --- drivers/iommu/arm-smmu-v3.c | 24 +++++++++++++++++++++++- 1 file changed, 23 insertions(+), 1 deletion(-) diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c index 95b4caceae1a..015e8e59e0ef 100644 --- a/drivers/iommu/arm-smmu-v3.c +++ b/drivers/iommu/arm-smmu-v3.c @@ -57,6 +57,8 @@ #define IDR0_ASID16 (1 << 12) #define IDR0_ATS (1 << 10) #define IDR0_HYP (1 << 9) +#define IDR0_HD (1 << 7) +#define IDR0_HA (1 << 6) #define IDR0_BTM (1 << 5) #define IDR0_COHACC (1 << 4) #define IDR0_TTF GENMASK(3, 2) @@ -305,6 +307,9 @@ #define CTXDESC_CD_0_TCR_IPS GENMASK_ULL(34, 32) #define CTXDESC_CD_0_TCR_TBI0 (1ULL << 38) +#define CTXDESC_CD_0_TCR_HA (1UL << 43) +#define CTXDESC_CD_0_TCR_HD (1UL << 42) + #define CTXDESC_CD_0_AA64 (1UL << 41) #define CTXDESC_CD_0_S (1UL << 44) #define CTXDESC_CD_0_R (1UL << 45) @@ -646,6 +651,8 @@ struct arm_smmu_device { #define ARM_SMMU_FEAT_E2H (1 << 15) #define ARM_SMMU_FEAT_BTM (1 << 16) #define ARM_SMMU_FEAT_SVA (1 << 17) +#define ARM_SMMU_FEAT_HA (1 << 18) +#define ARM_SMMU_FEAT_HD (1 << 19) u32 features; #define ARM_SMMU_OPT_SKIP_PREFETCH (1 << 0) @@ -1665,10 +1672,17 @@ static int __arm_smmu_write_ctx_desc(struct arm_smmu_domain *smmu_domain, * this substream's traffic */ } else { /* (1) and (2) */ + u64 tcr = cd->tcr; + cdptr[1] = cpu_to_le64(cd->ttbr & CTXDESC_CD_1_TTB0_MASK); cdptr[2] = 0; cdptr[3] = cpu_to_le64(cd->mair); + if (!(smmu->features & ARM_SMMU_FEAT_HD)) + tcr &= ~CTXDESC_CD_0_TCR_HD; + if (!(smmu->features & ARM_SMMU_FEAT_HA)) + tcr &= ~CTXDESC_CD_0_TCR_HA; + /* * STE is live, and the SMMU might read dwords of this CD in any * order. Ensure that it observes valid values before reading @@ -1676,7 +1690,7 @@ static int __arm_smmu_write_ctx_desc(struct arm_smmu_domain *smmu_domain, */ arm_smmu_sync_cd(smmu_domain, ssid, true); - val = cd->tcr | + val = tcr | #ifdef __BIG_ENDIAN CTXDESC_CD_0_ENDI | #endif @@ -1919,10 +1933,12 @@ static struct arm_smmu_ctx_desc *arm_smmu_alloc_shared_cd(struct mm_struct *mm) return old_cd; } + /* HA and HD will be filtered out later if not supported by the SMMU */ tcr = FIELD_PREP(CTXDESC_CD_0_TCR_T0SZ, 64ULL - VA_BITS) | FIELD_PREP(CTXDESC_CD_0_TCR_IRGN0, ARM_LPAE_TCR_RGN_WBWA) | FIELD_PREP(CTXDESC_CD_0_TCR_ORGN0, ARM_LPAE_TCR_RGN_WBWA) | FIELD_PREP(CTXDESC_CD_0_TCR_SH0, ARM_LPAE_TCR_SH_IS) | + CTXDESC_CD_0_TCR_HA | CTXDESC_CD_0_TCR_HD | CTXDESC_CD_0_TCR_EPD1 | CTXDESC_CD_0_AA64; switch (PAGE_SIZE) { @@ -4211,6 +4227,12 @@ static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu) smmu->features |= ARM_SMMU_FEAT_E2H; } + if (reg & (IDR0_HA | IDR0_HD)) { + smmu->features |= ARM_SMMU_FEAT_HA; + if (reg & IDR0_HD) + smmu->features |= ARM_SMMU_FEAT_HD; + } + /* * If the CPU is using VHE, but the SMMU doesn't support it, the SMMU * will create TLB entries for NH-EL1 world and will miss the -- 2.25.0
From: Jean-Philippe Brucker <jean-philippe.brucker@arm.com> When handling faults from the event or PRI queue, we need to find the struct device associated to a SID. Add a rb_tree to keep track of SIDs. Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org> --- drivers/iommu/arm-smmu-v3.c | 177 +++++++++++++++++++++++++++++------- 1 file changed, 145 insertions(+), 32 deletions(-) diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c index 015e8e59e0ef..28f8583cd47b 100644 --- a/drivers/iommu/arm-smmu-v3.c +++ b/drivers/iommu/arm-smmu-v3.c @@ -684,6 +684,15 @@ struct arm_smmu_device { /* IOMMU core code handle */ struct iommu_device iommu; + + struct rb_root streams; + struct mutex streams_mutex; +}; + +struct arm_smmu_stream { + u32 id; + struct arm_smmu_master *master; + struct rb_node node; }; /* SMMU private data for each master */ @@ -692,8 +701,8 @@ struct arm_smmu_master { struct device *dev; struct arm_smmu_domain *domain; struct list_head domain_head; - u32 *sids; - unsigned int num_sids; + struct arm_smmu_stream *streams; + unsigned int num_streams; bool ats_enabled; unsigned int ssid_bits; }; @@ -1573,8 +1582,8 @@ static void arm_smmu_sync_cd(struct arm_smmu_domain *smmu_domain, spin_lock_irqsave(&smmu_domain->devices_lock, flags); list_for_each_entry(master, &smmu_domain->devices, domain_head) { - for (i = 0; i < master->num_sids; i++) { - cmd.cfgi.sid = master->sids[i]; + for (i = 0; i < master->num_streams; i++) { + cmd.cfgi.sid = master->streams[i].id; arm_smmu_cmdq_batch_add(smmu, &cmds, &cmd); } } @@ -2201,6 +2210,32 @@ static int arm_smmu_init_l2_strtab(struct arm_smmu_device *smmu, u32 sid) return 0; } +__maybe_unused +static struct arm_smmu_master * +arm_smmu_find_master(struct arm_smmu_device *smmu, u32 sid) +{ + struct rb_node *node; + struct arm_smmu_stream *stream; + struct arm_smmu_master *master = NULL; + + mutex_lock(&smmu->streams_mutex); + node = smmu->streams.rb_node; + while (node) { + stream = rb_entry(node, struct arm_smmu_stream, node); + if (stream->id < sid) { + node = node->rb_right; + } else if (stream->id > sid) { + node = node->rb_left; + } else { + master = stream->master; + break; + } + } + mutex_unlock(&smmu->streams_mutex); + + return master; +} + /* IRQ and event handlers */ static irqreturn_t arm_smmu_evtq_thread(int irq, void *dev) { @@ -2434,8 +2469,8 @@ static int arm_smmu_atc_inv_master(struct arm_smmu_master *master, int ssid) arm_smmu_atc_inv_to_cmd(ssid, 0, 0, &cmd); - for (i = 0; i < master->num_sids; i++) { - cmd.atc.sid = master->sids[i]; + for (i = 0; i < master->num_streams; i++) { + cmd.atc.sid = master->streams[i].id; arm_smmu_cmdq_issue_cmd(master->smmu, &cmd); } @@ -2478,8 +2513,8 @@ static int arm_smmu_atc_inv_domain(struct arm_smmu_domain *smmu_domain, if (!master->ats_enabled) continue; - for (i = 0; i < master->num_sids; i++) { - cmd.atc.sid = master->sids[i]; + for (i = 0; i < master->num_streams; i++) { + cmd.atc.sid = master->streams[i].id; arm_smmu_cmdq_batch_add(smmu_domain->smmu, &cmds, &cmd); } } @@ -2846,13 +2881,13 @@ static void arm_smmu_install_ste_for_dev(struct arm_smmu_master *master) int i, j; struct arm_smmu_device *smmu = master->smmu; - for (i = 0; i < master->num_sids; ++i) { - u32 sid = master->sids[i]; + for (i = 0; i < master->num_streams; ++i) { + u32 sid = master->streams[i].id; __le64 *step = arm_smmu_get_step_for_sid(smmu, sid); /* Bridged PCI devices may end up with duplicated IDs */ for (j = 0; j < i; j++) - if (master->sids[j] == sid) + if (master->streams[j].id == sid) break; if (j < i) continue; @@ -3105,8 +3140,8 @@ static void arm_smmu_mm_invalidate(struct device *dev, int pasid, void *entry, arm_smmu_atc_inv_to_cmd(pasid, iova, size, &cmd); - for (i = 0; i < master->num_sids; i++) { - cmd.atc.sid = master->sids[i]; + for (i = 0; i < master->num_streams; i++) { + cmd.atc.sid = master->streams[i].id; arm_smmu_cmdq_batch_add(master->smmu, &cmds, &cmd); } @@ -3206,11 +3241,99 @@ static bool arm_smmu_sid_in_range(struct arm_smmu_device *smmu, u32 sid) return sid < limit; } +static int arm_smmu_insert_master(struct arm_smmu_device *smmu, + struct arm_smmu_master *master) +{ + int i; + int ret = 0; + struct arm_smmu_stream *new_stream, *cur_stream; + struct rb_node **new_node, *parent_node = NULL; + struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(master->dev); + + master->streams = kcalloc(fwspec->num_ids, + sizeof(struct arm_smmu_stream), GFP_KERNEL); + if (!master->streams) + return -ENOMEM; + master->num_streams = fwspec->num_ids; + + mutex_lock(&smmu->streams_mutex); + for (i = 0; i < fwspec->num_ids && !ret; i++) { + u32 sid = fwspec->ids[i]; + + new_stream = &master->streams[i]; + new_stream->id = sid; + new_stream->master = master; + + /* Check the SIDs are in range of the SMMU and our stream table */ + if (!arm_smmu_sid_in_range(smmu, sid)) { + ret = -ERANGE; + break; + } + + /* Ensure l2 strtab is initialised */ + if (smmu->features & ARM_SMMU_FEAT_2_LVL_STRTAB) { + ret = arm_smmu_init_l2_strtab(smmu, sid); + if (ret) + break; + } + + /* Insert into SID tree */ + new_node = &(smmu->streams.rb_node); + while (*new_node) { + cur_stream = rb_entry(*new_node, struct arm_smmu_stream, + node); + parent_node = *new_node; + if (cur_stream->id > new_stream->id) { + new_node = &((*new_node)->rb_left); + } else if (cur_stream->id < new_stream->id) { + new_node = &((*new_node)->rb_right); + } else { + dev_warn(master->dev, + "stream %u already in tree\n", + cur_stream->id); + ret = -EINVAL; + break; + } + } + + if (!ret) { + rb_link_node(&new_stream->node, parent_node, new_node); + rb_insert_color(&new_stream->node, &smmu->streams); + } + } + + if (ret) { + for (; i > 0; i--) + rb_erase(&master->streams[i].node, &smmu->streams); + kfree(master->streams); + } + mutex_unlock(&smmu->streams_mutex); + + return ret; +} + +static void arm_smmu_remove_master(struct arm_smmu_device *smmu, + struct arm_smmu_master *master) +{ + int i; + struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(master->dev); + + if (!master->streams) + return; + + mutex_lock(&smmu->streams_mutex); + for (i = 0; i < fwspec->num_ids; i++) + rb_erase(&master->streams[i].node, &smmu->streams); + mutex_unlock(&smmu->streams_mutex); + + kfree(master->streams); +} + static struct iommu_ops arm_smmu_ops; static int arm_smmu_add_device(struct device *dev) { - int i, ret; + int ret; struct arm_smmu_device *smmu; struct arm_smmu_master *master; struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev); @@ -3232,26 +3355,11 @@ static int arm_smmu_add_device(struct device *dev) master->dev = dev; master->smmu = smmu; - master->sids = fwspec->ids; - master->num_sids = fwspec->num_ids; fwspec->iommu_priv = master; - /* Check the SIDs are in range of the SMMU and our stream table */ - for (i = 0; i < master->num_sids; i++) { - u32 sid = master->sids[i]; - - if (!arm_smmu_sid_in_range(smmu, sid)) { - ret = -ERANGE; - goto err_free_master; - } - - /* Ensure l2 strtab is initialised */ - if (smmu->features & ARM_SMMU_FEAT_2_LVL_STRTAB) { - ret = arm_smmu_init_l2_strtab(smmu, sid); - if (ret) - goto err_free_master; - } - } + ret = arm_smmu_insert_master(smmu, master); + if (ret) + goto err_free_master; master->ssid_bits = min(smmu->ssid_bits, fwspec->num_pasid_bits); @@ -3286,6 +3394,7 @@ static int arm_smmu_add_device(struct device *dev) iommu_device_unlink(&smmu->iommu, dev); err_disable_pasid: arm_smmu_disable_pasid(master); + arm_smmu_remove_master(smmu, master); err_free_master: kfree(master); fwspec->iommu_priv = NULL; @@ -3308,6 +3417,7 @@ static void arm_smmu_remove_device(struct device *dev) iommu_group_remove_device(dev); iommu_device_unlink(&smmu->iommu, dev); arm_smmu_disable_pasid(master); + arm_smmu_remove_master(smmu, master); kfree(master); iommu_fwspec_free(dev); } @@ -3751,6 +3861,9 @@ static int arm_smmu_init_structures(struct arm_smmu_device *smmu) { int ret; + mutex_init(&smmu->streams_mutex); + smmu->streams = RB_ROOT; + ret = arm_smmu_init_queues(smmu); if (ret) return ret; -- 2.25.0
When a device or driver misbehaves, it is possible to receive events much faster than we can print them out. Ratelimit the printing of events. Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org> --- During the SVA tests when the device driver didn't properly stop DMA before unbinding, the event queue thread would almost lock-up the server with a flood of event 0xa. This patch helped recover from the error. --- drivers/iommu/arm-smmu-v3.c | 13 ++++++++----- 1 file changed, 8 insertions(+), 5 deletions(-) diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c index 28f8583cd47b..6a5987cce03f 100644 --- a/drivers/iommu/arm-smmu-v3.c +++ b/drivers/iommu/arm-smmu-v3.c @@ -2243,17 +2243,20 @@ static irqreturn_t arm_smmu_evtq_thread(int irq, void *dev) struct arm_smmu_device *smmu = dev; struct arm_smmu_queue *q = &smmu->evtq.q; struct arm_smmu_ll_queue *llq = &q->llq; + static DEFINE_RATELIMIT_STATE(rs, DEFAULT_RATELIMIT_INTERVAL, + DEFAULT_RATELIMIT_BURST); u64 evt[EVTQ_ENT_DWORDS]; do { while (!queue_remove_raw(q, evt)) { u8 id = FIELD_GET(EVTQ_0_ID, evt[0]); - dev_info(smmu->dev, "event 0x%02x received:\n", id); - for (i = 0; i < ARRAY_SIZE(evt); ++i) - dev_info(smmu->dev, "\t0x%016llx\n", - (unsigned long long)evt[i]); - + if (__ratelimit(&rs)) { + dev_info(smmu->dev, "event 0x%02x received:\n", id); + for (i = 0; i < ARRAY_SIZE(evt); ++i) + dev_info(smmu->dev, "\t0x%016llx\n", + (unsigned long long)evt[i]); + } } /* -- 2.25.0
From: Jean-Philippe Brucker <jean-philippe.brucker@arm.com> On ARM systems, some platform devices behind an IOMMU may support stall, which is the ability to recover from page faults. Let the firmware tell us when a device supports stall. Reviewed-by: Rob Herring <robh@kernel.org> Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org> --- .../devicetree/bindings/iommu/iommu.txt | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+) diff --git a/Documentation/devicetree/bindings/iommu/iommu.txt b/Documentation/devicetree/bindings/iommu/iommu.txt index 3c36334e4f94..26ba9e530f13 100644 --- a/Documentation/devicetree/bindings/iommu/iommu.txt +++ b/Documentation/devicetree/bindings/iommu/iommu.txt @@ -92,6 +92,24 @@ Optional properties: tagging DMA transactions with an address space identifier. By default, this is 0, which means that the device only has one address space. +- dma-can-stall: When present, the master can wait for a transaction to + complete for an indefinite amount of time. Upon translation fault some + IOMMUs, instead of aborting the translation immediately, may first + notify the driver and keep the transaction in flight. This allows the OS + to inspect the fault and, for example, make physical pages resident + before updating the mappings and completing the transaction. Such IOMMU + accepts a limited number of simultaneous stalled transactions before + having to either put back-pressure on the master, or abort new faulting + transactions. + + Firmware has to opt-in stalling, because most buses and masters don't + support it. In particular it isn't compatible with PCI, where + transactions have to complete before a time limit. More generally it + won't work in systems and masters that haven't been designed for + stalling. For example the OS, in order to handle a stalled transaction, + may attempt to retrieve pages from secondary storage in a stalled + domain, leading to a deadlock. + Notes: ====== -- 2.25.0
From: Jean-Philippe Brucker <jean-philippe.brucker@arm.com> The SMMU provides a Stall model for handling page faults in platform devices. It is similar to PCI PRI, but doesn't require devices to have their own translation cache. Instead, faulting transactions are parked and the OS is given a chance to fix the page tables and retry the transaction. Enable stall for devices that support it (opt-in by firmware). When an event corresponds to a translation error, call the IOMMU fault handler. If the fault is recoverable, it will call us back to terminate or continue the stall. Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org> --- drivers/iommu/arm-smmu-v3.c | 271 ++++++++++++++++++++++++++++++++++-- drivers/iommu/of_iommu.c | 5 +- include/linux/iommu.h | 2 + 3 files changed, 269 insertions(+), 9 deletions(-) diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c index 6a5987cce03f..da5dda5ba26a 100644 --- a/drivers/iommu/arm-smmu-v3.c +++ b/drivers/iommu/arm-smmu-v3.c @@ -374,6 +374,13 @@ #define CMDQ_PRI_1_GRPID GENMASK_ULL(8, 0) #define CMDQ_PRI_1_RESP GENMASK_ULL(13, 12) +#define CMDQ_RESUME_0_SID GENMASK_ULL(63, 32) +#define CMDQ_RESUME_0_RESP_TERM 0UL +#define CMDQ_RESUME_0_RESP_RETRY 1UL +#define CMDQ_RESUME_0_RESP_ABORT 2UL +#define CMDQ_RESUME_0_RESP GENMASK_ULL(13, 12) +#define CMDQ_RESUME_1_STAG GENMASK_ULL(15, 0) + #define CMDQ_SYNC_0_CS GENMASK_ULL(13, 12) #define CMDQ_SYNC_0_CS_NONE 0 #define CMDQ_SYNC_0_CS_IRQ 1 @@ -390,6 +397,25 @@ #define EVTQ_0_ID GENMASK_ULL(7, 0) +#define EVT_ID_TRANSLATION_FAULT 0x10 +#define EVT_ID_ADDR_SIZE_FAULT 0x11 +#define EVT_ID_ACCESS_FAULT 0x12 +#define EVT_ID_PERMISSION_FAULT 0x13 + +#define EVTQ_0_SSV (1UL << 11) +#define EVTQ_0_SSID GENMASK_ULL(31, 12) +#define EVTQ_0_SID GENMASK_ULL(63, 32) +#define EVTQ_1_STAG GENMASK_ULL(15, 0) +#define EVTQ_1_STALL (1UL << 31) +#define EVTQ_1_PRIV (1UL << 33) +#define EVTQ_1_EXEC (1UL << 34) +#define EVTQ_1_READ (1UL << 35) +#define EVTQ_1_S2 (1UL << 39) +#define EVTQ_1_CLASS GENMASK_ULL(41, 40) +#define EVTQ_1_TT_READ (1UL << 44) +#define EVTQ_2_ADDR GENMASK_ULL(63, 0) +#define EVTQ_3_IPA GENMASK_ULL(51, 12) + /* PRI queue */ #define PRIQ_ENT_SZ_SHIFT 4 #define PRIQ_ENT_DWORDS ((1 << PRIQ_ENT_SZ_SHIFT) >> 3) @@ -510,6 +536,13 @@ struct arm_smmu_cmdq_ent { enum pri_resp resp; } pri; + #define CMDQ_OP_RESUME 0x44 + struct { + u32 sid; + u16 stag; + u8 resp; + } resume; + #define CMDQ_OP_CMD_SYNC 0x46 struct { u64 msiaddr; @@ -545,6 +578,10 @@ struct arm_smmu_queue { u32 __iomem *prod_reg; u32 __iomem *cons_reg; + + /* Event and PRI */ + u64 batch; + wait_queue_head_t wq; }; struct arm_smmu_queue_poll { @@ -568,6 +605,7 @@ struct arm_smmu_cmdq_batch { struct arm_smmu_evtq { struct arm_smmu_queue q; + struct iopf_queue *iopf; u32 max_stalls; }; @@ -704,6 +742,7 @@ struct arm_smmu_master { struct arm_smmu_stream *streams; unsigned int num_streams; bool ats_enabled; + bool stall_enabled; unsigned int ssid_bits; }; @@ -721,6 +760,7 @@ struct arm_smmu_domain { struct io_pgtable_ops *pgtbl_ops; bool non_strict; + bool stall_enabled; atomic_t nr_ats_masters; enum arm_smmu_domain_stage stage; @@ -985,6 +1025,11 @@ static int arm_smmu_cmdq_build_cmd(u64 *cmd, struct arm_smmu_cmdq_ent *ent) } cmd[1] |= FIELD_PREP(CMDQ_PRI_1_RESP, ent->pri.resp); break; + case CMDQ_OP_RESUME: + cmd[0] |= FIELD_PREP(CMDQ_RESUME_0_SID, ent->resume.sid); + cmd[0] |= FIELD_PREP(CMDQ_RESUME_0_RESP, ent->resume.resp); + cmd[1] |= FIELD_PREP(CMDQ_RESUME_1_STAG, ent->resume.stag); + break; case CMDQ_OP_CMD_SYNC: if (ent->sync.msiaddr) { cmd[0] |= FIELD_PREP(CMDQ_SYNC_0_CS, CMDQ_SYNC_0_CS_IRQ); @@ -1551,6 +1596,45 @@ static int arm_smmu_cmdq_batch_submit(struct arm_smmu_device *smmu, return arm_smmu_cmdq_issue_cmdlist(smmu, cmds->cmds, cmds->num, true); } +static int arm_smmu_page_response(struct device *dev, + struct iommu_fault_event *unused, + struct iommu_page_response *resp) +{ + struct arm_smmu_cmdq_ent cmd = {0}; + struct arm_smmu_master *master = dev_iommu_fwspec_get(dev)->iommu_priv; + int sid = master->streams[0].id; + + if (master->stall_enabled) { + cmd.opcode = CMDQ_OP_RESUME; + cmd.resume.sid = sid; + cmd.resume.stag = resp->grpid; + switch (resp->code) { + case IOMMU_PAGE_RESP_INVALID: + case IOMMU_PAGE_RESP_FAILURE: + cmd.resume.resp = CMDQ_RESUME_0_RESP_ABORT; + break; + case IOMMU_PAGE_RESP_SUCCESS: + cmd.resume.resp = CMDQ_RESUME_0_RESP_RETRY; + break; + default: + return -EINVAL; + } + } else { + /* TODO: insert PRI response here */ + return -ENODEV; + } + + arm_smmu_cmdq_issue_cmd(master->smmu, &cmd); + /* + * Don't send a SYNC, it doesn't do anything for RESUME or PRI_RESP. + * RESUME consumption guarantees that the stalled transaction will be + * terminated... at some point in the future. PRI_RESP is fire and + * forget. + */ + + return 0; +} + /* Context descriptor manipulation functions */ static void arm_smmu_tlb_inv_asid(struct arm_smmu_device *smmu, u16 asid) { @@ -1709,8 +1793,7 @@ static int __arm_smmu_write_ctx_desc(struct arm_smmu_domain *smmu_domain, FIELD_PREP(CTXDESC_CD_0_ASID, cd->asid) | CTXDESC_CD_0_V; - /* STALL_MODEL==0b10 && CD.S==0 is ILLEGAL */ - if (smmu->features & ARM_SMMU_FEAT_STALL_FORCE) + if (smmu_domain->stall_enabled) val |= CTXDESC_CD_0_S; } @@ -2133,7 +2216,7 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid, FIELD_PREP(STRTAB_STE_1_STRW, strw)); if (smmu->features & ARM_SMMU_FEAT_STALLS && - !(smmu->features & ARM_SMMU_FEAT_STALL_FORCE)) + !master->stall_enabled) dst[1] |= cpu_to_le64(STRTAB_STE_1_S1STALLD); val |= (s1_cfg->cdcfg.cdtab_dma & STRTAB_STE_0_S1CTXPTR_MASK) | @@ -2210,7 +2293,6 @@ static int arm_smmu_init_l2_strtab(struct arm_smmu_device *smmu, u32 sid) return 0; } -__maybe_unused static struct arm_smmu_master * arm_smmu_find_master(struct arm_smmu_device *smmu, u32 sid) { @@ -2237,21 +2319,119 @@ arm_smmu_find_master(struct arm_smmu_device *smmu, u32 sid) } /* IRQ and event handlers */ +static int arm_smmu_handle_evt(struct arm_smmu_device *smmu, u64 *evt) +{ + int ret; + u32 perm = 0; + struct arm_smmu_master *master; + bool ssid_valid = evt[0] & EVTQ_0_SSV; + u8 type = FIELD_GET(EVTQ_0_ID, evt[0]); + u32 sid = FIELD_GET(EVTQ_0_SID, evt[0]); + struct iommu_fault_event fault_evt = { }; + struct iommu_fault *flt = &fault_evt.fault; + + /* Stage-2 is always pinned at the moment */ + if (evt[1] & EVTQ_1_S2) + return -EFAULT; + + master = arm_smmu_find_master(smmu, sid); + if (!master) + return -EINVAL; + + if (evt[1] & EVTQ_1_READ) + perm |= IOMMU_FAULT_PERM_READ; + else + perm |= IOMMU_FAULT_PERM_WRITE; + + if (evt[1] & EVTQ_1_EXEC) + perm |= IOMMU_FAULT_PERM_EXEC; + + if (evt[1] & EVTQ_1_PRIV) + perm |= IOMMU_FAULT_PERM_PRIV; + + if (evt[1] & EVTQ_1_STALL) { + flt->type = IOMMU_FAULT_PAGE_REQ; + flt->prm = (struct iommu_fault_page_request) { + .flags = IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE, + .pasid = FIELD_GET(EVTQ_0_SSID, evt[0]), + .grpid = FIELD_GET(EVTQ_1_STAG, evt[1]), + .perm = perm, + .addr = FIELD_GET(EVTQ_2_ADDR, evt[2]), + }; + + if (ssid_valid) + flt->prm.flags |= IOMMU_FAULT_PAGE_REQUEST_PASID_VALID; + } else { + flt->type = IOMMU_FAULT_DMA_UNRECOV; + flt->event = (struct iommu_fault_unrecoverable) { + .flags = IOMMU_FAULT_UNRECOV_ADDR_VALID | + IOMMU_FAULT_UNRECOV_FETCH_ADDR_VALID, + .pasid = FIELD_GET(EVTQ_0_SSID, evt[0]), + .perm = perm, + .addr = FIELD_GET(EVTQ_2_ADDR, evt[2]), + .fetch_addr = FIELD_GET(EVTQ_3_IPA, evt[3]), + }; + + if (ssid_valid) + flt->event.flags |= IOMMU_FAULT_UNRECOV_PASID_VALID; + + switch (type) { + case EVT_ID_TRANSLATION_FAULT: + case EVT_ID_ADDR_SIZE_FAULT: + case EVT_ID_ACCESS_FAULT: + flt->event.reason = IOMMU_FAULT_REASON_PTE_FETCH; + break; + case EVT_ID_PERMISSION_FAULT: + flt->event.reason = IOMMU_FAULT_REASON_PERMISSION; + break; + default: + /* TODO: report other unrecoverable faults. */ + return -EFAULT; + } + } + + ret = iommu_report_device_fault(master->dev, &fault_evt); + if (ret && flt->type == IOMMU_FAULT_PAGE_REQ) { + /* Nobody cared, abort the access */ + struct iommu_page_response resp = { + .pasid = flt->prm.pasid, + .grpid = flt->prm.grpid, + .code = IOMMU_PAGE_RESP_FAILURE, + }; + arm_smmu_page_response(master->dev, NULL, &resp); + } + + return ret; +} + static irqreturn_t arm_smmu_evtq_thread(int irq, void *dev) { - int i; + int i, ret; + int num_handled = 0; struct arm_smmu_device *smmu = dev; struct arm_smmu_queue *q = &smmu->evtq.q; struct arm_smmu_ll_queue *llq = &q->llq; + size_t queue_size = 1 << llq->max_n_shift; static DEFINE_RATELIMIT_STATE(rs, DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST); u64 evt[EVTQ_ENT_DWORDS]; + spin_lock(&q->wq.lock); do { while (!queue_remove_raw(q, evt)) { u8 id = FIELD_GET(EVTQ_0_ID, evt[0]); - if (__ratelimit(&rs)) { + spin_unlock(&q->wq.lock); + ret = arm_smmu_handle_evt(smmu, evt); + spin_lock(&q->wq.lock); + + if (++num_handled == queue_size) { + q->batch++; + wake_up_all_locked(&q->wq); + num_handled = 0; + } + + if (ret && __ratelimit(&rs)) { dev_info(smmu->dev, "event 0x%02x received:\n", id); for (i = 0; i < ARRAY_SIZE(evt); ++i) dev_info(smmu->dev, "\t0x%016llx\n", @@ -2270,6 +2450,11 @@ static irqreturn_t arm_smmu_evtq_thread(int irq, void *dev) /* Sync our overflow flag, as we believe we're up to speed */ llq->cons = Q_OVF(llq->prod) | Q_WRP(llq, llq->cons) | Q_IDX(llq, llq->cons); + queue_sync_cons_out(q); + + wake_up_all_locked(&q->wq); + spin_unlock(&q->wq.lock); + return IRQ_HANDLED; } @@ -2333,6 +2518,33 @@ static irqreturn_t arm_smmu_priq_thread(int irq, void *dev) return IRQ_HANDLED; } +/* + * arm_smmu_flush_evtq - wait until all events currently in the queue have been + * consumed. + * + * Wait until the evtq thread finished a batch, or until the queue is empty. + * Note that we don't handle overflows on q->batch. If it occurs, just wait for + * the queue to be empty. + */ +static int arm_smmu_flush_evtq(void *cookie, struct device *dev, int pasid) +{ + int ret; + u64 batch; + struct arm_smmu_device *smmu = cookie; + struct arm_smmu_queue *q = &smmu->evtq.q; + + spin_lock(&q->wq.lock); + if (queue_sync_prod_in(q) == -EOVERFLOW) + dev_err(smmu->dev, "evtq overflow detected -- requests lost\n"); + + batch = q->batch; + ret = wait_event_interruptible_locked(q->wq, queue_empty(&q->llq) || + q->batch >= batch + 2); + spin_unlock(&q->wq.lock); + + return ret; +} + static int arm_smmu_device_disable(struct arm_smmu_device *smmu); static irqreturn_t arm_smmu_gerror_handler(int irq, void *dev) @@ -2724,6 +2936,9 @@ static int arm_smmu_domain_finalise_s1(struct arm_smmu_domain *smmu_domain, cfg->s1cdmax = master->ssid_bits; + if (master->stall_enabled) + smmu_domain->stall_enabled = true; + ret = arm_smmu_alloc_cd_tables(smmu_domain); if (ret) goto out_free_asid; @@ -3056,6 +3271,10 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev) smmu_domain->s1_cfg.s1cdmax, master->ssid_bits); ret = -EINVAL; goto out_unlock; + } else if (smmu_domain->stall_enabled && !master->stall_enabled) { + dev_err(dev, "cannot attach to stall-enabled domain\n"); + ret = -EINVAL; + goto out_unlock; } master->domain = smmu_domain; @@ -3380,6 +3599,10 @@ static int arm_smmu_add_device(struct device *dev) master->ssid_bits = min_t(u8, master->ssid_bits, CTXDESC_LINEAR_CDMAX); + if ((smmu->features & ARM_SMMU_FEAT_STALLS && fwspec->can_stall) || + smmu->features & ARM_SMMU_FEAT_STALL_FORCE) + master->stall_enabled = true; + ret = iommu_device_link(&smmu->iommu, dev); if (ret) goto err_disable_pasid; @@ -3415,6 +3638,7 @@ static void arm_smmu_remove_device(struct device *dev) master = fwspec->iommu_priv; smmu = master->smmu; + iopf_queue_remove_device(smmu->evtq.iopf, dev); iommu_sva_disable(dev); arm_smmu_detach_dev(master); iommu_group_remove_device(dev); @@ -3538,7 +3762,7 @@ static void arm_smmu_get_resv_regions(struct device *dev, static bool arm_smmu_iopf_supported(struct arm_smmu_master *master) { - return false; + return master->stall_enabled; } static bool arm_smmu_dev_has_feature(struct device *dev, @@ -3579,6 +3803,7 @@ static bool arm_smmu_dev_feature_enabled(struct device *dev, static int arm_smmu_dev_enable_sva(struct device *dev) { + int ret; struct arm_smmu_master *master = dev_to_master(dev); struct iommu_sva_param param = { .min_pasid = 1, @@ -3586,7 +3811,21 @@ static int arm_smmu_dev_enable_sva(struct device *dev) }; param.max_pasid = min(param.max_pasid, (1U << master->ssid_bits) - 1); - return iommu_sva_enable(dev, ¶m); + + ret = iommu_sva_enable(dev, ¶m); + if (ret) + return ret; + + if (master->stall_enabled) { + ret = iopf_queue_add_device(master->smmu->evtq.iopf, dev); + if (ret) + goto err_disable_sva; + } + return 0; + +err_disable_sva: + iommu_sva_disable(dev); + return ret; } static int arm_smmu_dev_enable_feature(struct device *dev, @@ -3609,11 +3848,14 @@ static int arm_smmu_dev_enable_feature(struct device *dev, static int arm_smmu_dev_disable_feature(struct device *dev, enum iommu_dev_features feat) { + struct arm_smmu_master *master = dev_to_master(dev); + if (!arm_smmu_dev_feature_enabled(dev, feat)) return -EINVAL; switch (feat) { case IOMMU_DEV_FEAT_SVA: + iopf_queue_remove_device(master->smmu->evtq.iopf, dev); return iommu_sva_disable(dev); default: return -EINVAL; @@ -3645,6 +3887,7 @@ static struct iommu_ops arm_smmu_ops = { .sva_bind = arm_smmu_sva_bind, .sva_unbind = iommu_sva_unbind_generic, .sva_get_pasid = iommu_sva_get_pasid_generic, + .page_response = arm_smmu_page_response, .pgsize_bitmap = -1UL, /* Restricted during device attach */ }; @@ -3688,6 +3931,10 @@ static int arm_smmu_init_one_queue(struct arm_smmu_device *smmu, q->q_base |= FIELD_PREP(Q_BASE_LOG2SIZE, q->llq.max_n_shift); q->llq.prod = q->llq.cons = 0; + + init_waitqueue_head(&q->wq); + q->batch = 0; + return 0; } @@ -3741,6 +3988,13 @@ static int arm_smmu_init_queues(struct arm_smmu_device *smmu) if (ret) return ret; + if (smmu->features & ARM_SMMU_FEAT_STALLS) { + smmu->evtq.iopf = iopf_queue_alloc(dev_name(smmu->dev), + arm_smmu_flush_evtq, smmu); + if (!smmu->evtq.iopf) + return -ENOMEM; + } + /* priq */ if (!(smmu->features & ARM_SMMU_FEAT_PRI)) return 0; @@ -4716,6 +4970,7 @@ static int arm_smmu_device_remove(struct platform_device *pdev) iommu_device_unregister(&smmu->iommu); iommu_device_sysfs_remove(&smmu->iommu); arm_smmu_device_disable(smmu); + iopf_queue_free(smmu->evtq.iopf); return 0; } diff --git a/drivers/iommu/of_iommu.c b/drivers/iommu/of_iommu.c index 20738aacac89..dd7017750954 100644 --- a/drivers/iommu/of_iommu.c +++ b/drivers/iommu/of_iommu.c @@ -205,9 +205,12 @@ const struct iommu_ops *of_iommu_configure(struct device *dev, } fwspec = dev_iommu_fwspec_get(dev); - if (!err && fwspec) + if (!err && fwspec) { of_property_read_u32(master_np, "pasid-num-bits", &fwspec->num_pasid_bits); + fwspec->can_stall = of_property_read_bool(master_np, + "dma-can-stall"); + } } /* diff --git a/include/linux/iommu.h b/include/linux/iommu.h index e52a8731e7a9..b39dae6608c5 100644 --- a/include/linux/iommu.h +++ b/include/linux/iommu.h @@ -595,6 +595,7 @@ struct iommu_group *fsl_mc_device_group(struct device *dev); * @iommu_fwnode: firmware handle for this device's IOMMU * @iommu_priv: IOMMU driver private data for this device * @num_pasid_bits: number of PASID bits supported by this device + * @can_stall: the device is allowed to stall * @num_ids: number of associated device IDs * @ids: IDs which this device may present to the IOMMU */ @@ -603,6 +604,7 @@ struct iommu_fwspec { struct fwnode_handle *iommu_fwnode; void *iommu_priv; u32 num_pasid_bits; + bool can_stall; unsigned int num_ids; u32 ids[1]; }; -- 2.25.0
The SMMUv3 driver, which can be built without CONFIG_PCI, will soon gain support for PRI. Partially revert commit c6e9aefbf9db ("PCI/ATS: Remove unused PRI and PASID stubs") to re-introduce the PRI stubs, and avoid adding more #ifdefs to the SMMU driver. Cc: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org> --- include/linux/pci-ats.h | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/include/linux/pci-ats.h b/include/linux/pci-ats.h index f75c307f346d..e9e266df9b37 100644 --- a/include/linux/pci-ats.h +++ b/include/linux/pci-ats.h @@ -28,6 +28,14 @@ int pci_enable_pri(struct pci_dev *pdev, u32 reqs); void pci_disable_pri(struct pci_dev *pdev); int pci_reset_pri(struct pci_dev *pdev); int pci_prg_resp_pasid_required(struct pci_dev *pdev); +#else /* CONFIG_PCI_PRI */ +static inline int pci_enable_pri(struct pci_dev *pdev, u32 reqs) +{ return -ENODEV; } +static inline void pci_disable_pri(struct pci_dev *pdev) { } +static inline int pci_reset_pri(struct pci_dev *pdev) +{ return -ENODEV; } +static inline int pci_prg_resp_pasid_required(struct pci_dev *pdev) +{ return 0; } #endif /* CONFIG_PCI_PRI */ #ifdef CONFIG_PCI_PASID -- 2.25.0
The SMMUv3 driver uses pci_{enable,disable}_pri() and related functions. Export those functions to allow the driver to be built as a module. Cc: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org> --- drivers/pci/ats.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/drivers/pci/ats.c b/drivers/pci/ats.c index bbfd0d42b8b9..fc8fc6fc8bd5 100644 --- a/drivers/pci/ats.c +++ b/drivers/pci/ats.c @@ -197,6 +197,7 @@ void pci_pri_init(struct pci_dev *pdev) if (status & PCI_PRI_STATUS_PASID) pdev->pasid_required = 1; } +EXPORT_SYMBOL_GPL(pci_pri_init); /** * pci_enable_pri - Enable PRI capability @@ -243,6 +244,7 @@ int pci_enable_pri(struct pci_dev *pdev, u32 reqs) return 0; } +EXPORT_SYMBOL_GPL(pci_enable_pri); /** * pci_disable_pri - Disable PRI capability @@ -322,6 +324,7 @@ int pci_reset_pri(struct pci_dev *pdev) return 0; } +EXPORT_SYMBOL_GPL(pci_reset_pri); /** * pci_prg_resp_pasid_required - Return PRG Response PASID Required bit @@ -337,6 +340,7 @@ int pci_prg_resp_pasid_required(struct pci_dev *pdev) return pdev->pasid_required; } +EXPORT_SYMBOL_GPL(pci_prg_resp_pasid_required); #endif /* CONFIG_PCI_PRI */ #ifdef CONFIG_PCI_PASID -- 2.25.0
From: Jean-Philippe Brucker <jean-philippe.brucker@arm.com> For PCI devices that support it, enable the PRI capability and handle PRI Page Requests with the generic fault handler. It is enabled on demand by iommu_sva_device_init(). Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org> --- drivers/iommu/arm-smmu-v3.c | 278 +++++++++++++++++++++++++++++------- 1 file changed, 228 insertions(+), 50 deletions(-) diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c index da5dda5ba26a..f9732e397b2d 100644 --- a/drivers/iommu/arm-smmu-v3.c +++ b/drivers/iommu/arm-smmu-v3.c @@ -248,6 +248,7 @@ #define STRTAB_STE_1_S1COR GENMASK_ULL(5, 4) #define STRTAB_STE_1_S1CSH GENMASK_ULL(7, 6) +#define STRTAB_STE_1_PPAR (1UL << 18) #define STRTAB_STE_1_S1STALLD (1UL << 27) #define STRTAB_STE_1_EATS GENMASK_ULL(29, 28) @@ -373,6 +374,9 @@ #define CMDQ_PRI_0_SID GENMASK_ULL(63, 32) #define CMDQ_PRI_1_GRPID GENMASK_ULL(8, 0) #define CMDQ_PRI_1_RESP GENMASK_ULL(13, 12) +#define CMDQ_PRI_1_RESP_FAILURE 0UL +#define CMDQ_PRI_1_RESP_INVALID 1UL +#define CMDQ_PRI_1_RESP_SUCCESS 2UL #define CMDQ_RESUME_0_SID GENMASK_ULL(63, 32) #define CMDQ_RESUME_0_RESP_TERM 0UL @@ -445,12 +449,6 @@ module_param_named(disable_bypass, disable_bypass, bool, S_IRUGO); MODULE_PARM_DESC(disable_bypass, "Disable bypass streams such that incoming transactions from devices that are not attached to an iommu domain will report an abort back to the device and will not be allowed to pass through the SMMU."); -enum pri_resp { - PRI_RESP_DENY = 0, - PRI_RESP_FAIL = 1, - PRI_RESP_SUCC = 2, -}; - enum arm_smmu_msi_index { EVTQ_MSI_INDEX, GERROR_MSI_INDEX, @@ -533,7 +531,7 @@ struct arm_smmu_cmdq_ent { u32 sid; u32 ssid; u16 grpid; - enum pri_resp resp; + u8 resp; } pri; #define CMDQ_OP_RESUME 0x44 @@ -611,6 +609,7 @@ struct arm_smmu_evtq { struct arm_smmu_priq { struct arm_smmu_queue q; + struct iopf_queue *iopf; }; /* High-level stream table and context descriptor structures */ @@ -743,6 +742,8 @@ struct arm_smmu_master { unsigned int num_streams; bool ats_enabled; bool stall_enabled; + bool pri_supported; + bool prg_resp_needs_ssid; unsigned int ssid_bits; }; @@ -1015,14 +1016,6 @@ static int arm_smmu_cmdq_build_cmd(u64 *cmd, struct arm_smmu_cmdq_ent *ent) cmd[0] |= FIELD_PREP(CMDQ_PRI_0_SSID, ent->pri.ssid); cmd[0] |= FIELD_PREP(CMDQ_PRI_0_SID, ent->pri.sid); cmd[1] |= FIELD_PREP(CMDQ_PRI_1_GRPID, ent->pri.grpid); - switch (ent->pri.resp) { - case PRI_RESP_DENY: - case PRI_RESP_FAIL: - case PRI_RESP_SUCC: - break; - default: - return -EINVAL; - } cmd[1] |= FIELD_PREP(CMDQ_PRI_1_RESP, ent->pri.resp); break; case CMDQ_OP_RESUME: @@ -1602,6 +1595,7 @@ static int arm_smmu_page_response(struct device *dev, { struct arm_smmu_cmdq_ent cmd = {0}; struct arm_smmu_master *master = dev_iommu_fwspec_get(dev)->iommu_priv; + bool pasid_valid = resp->flags & IOMMU_PAGE_RESP_PASID_VALID; int sid = master->streams[0].id; if (master->stall_enabled) { @@ -1619,8 +1613,27 @@ static int arm_smmu_page_response(struct device *dev, default: return -EINVAL; } + } else if (master->pri_supported) { + cmd.opcode = CMDQ_OP_PRI_RESP; + cmd.substream_valid = pasid_valid && + master->prg_resp_needs_ssid; + cmd.pri.sid = sid; + cmd.pri.ssid = resp->pasid; + cmd.pri.grpid = resp->grpid; + switch (resp->code) { + case IOMMU_PAGE_RESP_FAILURE: + cmd.pri.resp = CMDQ_PRI_1_RESP_FAILURE; + break; + case IOMMU_PAGE_RESP_INVALID: + cmd.pri.resp = CMDQ_PRI_1_RESP_INVALID; + break; + case IOMMU_PAGE_RESP_SUCCESS: + cmd.pri.resp = CMDQ_PRI_1_RESP_SUCCESS; + break; + default: + return -EINVAL; + } } else { - /* TODO: insert PRI response here */ return -ENODEV; } @@ -2215,6 +2228,9 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid, FIELD_PREP(STRTAB_STE_1_S1CSH, ARM_SMMU_SH_ISH) | FIELD_PREP(STRTAB_STE_1_STRW, strw)); + if (master->prg_resp_needs_ssid) + dst[1] |= STRTAB_STE_1_PPAR; + if (smmu->features & ARM_SMMU_FEAT_STALLS && !master->stall_enabled) dst[1] |= cpu_to_le64(STRTAB_STE_1_S1STALLD); @@ -2460,61 +2476,110 @@ static irqreturn_t arm_smmu_evtq_thread(int irq, void *dev) static void arm_smmu_handle_ppr(struct arm_smmu_device *smmu, u64 *evt) { - u32 sid, ssid; - u16 grpid; - bool ssv, last; - - sid = FIELD_GET(PRIQ_0_SID, evt[0]); - ssv = FIELD_GET(PRIQ_0_SSID_V, evt[0]); - ssid = ssv ? FIELD_GET(PRIQ_0_SSID, evt[0]) : 0; - last = FIELD_GET(PRIQ_0_PRG_LAST, evt[0]); - grpid = FIELD_GET(PRIQ_1_PRG_IDX, evt[1]); - - dev_info(smmu->dev, "unexpected PRI request received:\n"); - dev_info(smmu->dev, - "\tsid 0x%08x.0x%05x: [%u%s] %sprivileged %s%s%s access at iova 0x%016llx\n", - sid, ssid, grpid, last ? "L" : "", - evt[0] & PRIQ_0_PERM_PRIV ? "" : "un", - evt[0] & PRIQ_0_PERM_READ ? "R" : "", - evt[0] & PRIQ_0_PERM_WRITE ? "W" : "", - evt[0] & PRIQ_0_PERM_EXEC ? "X" : "", - evt[1] & PRIQ_1_ADDR_MASK); - - if (last) { - struct arm_smmu_cmdq_ent cmd = { - .opcode = CMDQ_OP_PRI_RESP, - .substream_valid = ssv, - .pri = { - .sid = sid, - .ssid = ssid, - .grpid = grpid, - .resp = PRI_RESP_DENY, - }, + u32 sid = FIELD_PREP(PRIQ_0_SID, evt[0]); + + bool pasid_valid, last; + struct arm_smmu_master *master; + struct iommu_fault_event fault_evt = { + .fault.type = IOMMU_FAULT_PAGE_REQ, + .fault.prm = { + .pasid = FIELD_GET(PRIQ_0_SSID, evt[0]), + .grpid = FIELD_GET(PRIQ_1_PRG_IDX, evt[1]), + .addr = evt[1] & PRIQ_1_ADDR_MASK, + }, + }; + struct iommu_fault_page_request *pr = &fault_evt.fault.prm; + + pasid_valid = evt[0] & PRIQ_0_SSID_V; + last = evt[0] & PRIQ_0_PRG_LAST; + + /* Discard Stop PASID marker, it isn't used */ + if (!(evt[0] & (PRIQ_0_PERM_READ | PRIQ_0_PERM_WRITE)) && last) + return; + + if (last) + pr->flags |= IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE; + if (pasid_valid) + pr->flags |= IOMMU_FAULT_PAGE_REQUEST_PASID_VALID; + if (evt[0] & PRIQ_0_PERM_READ) + pr->perm |= IOMMU_FAULT_PERM_READ; + if (evt[0] & PRIQ_0_PERM_WRITE) + pr->perm |= IOMMU_FAULT_PERM_WRITE; + if (evt[0] & PRIQ_0_PERM_EXEC) + pr->perm |= IOMMU_FAULT_PERM_EXEC; + if (evt[0] & PRIQ_0_PERM_PRIV) + pr->perm |= IOMMU_FAULT_PERM_PRIV; + + master = arm_smmu_find_master(smmu, sid); + if (WARN_ON(!master)) + return; + + if (iommu_report_device_fault(master->dev, &fault_evt)) { + /* + * No handler registered, so subsequent faults won't produce + * better results. Try to disable PRI. + */ + struct iommu_page_response resp = { + .flags = pasid_valid ? + IOMMU_PAGE_RESP_PASID_VALID : 0, + .pasid = pr->pasid, + .grpid = pr->grpid, + .code = IOMMU_PAGE_RESP_FAILURE, }; - arm_smmu_cmdq_issue_cmd(smmu, &cmd); + dev_warn(master->dev, + "PPR 0x%x:0x%llx 0x%x: nobody cared, disabling PRI\n", + pasid_valid ? pr->pasid : 0, pr->addr, pr->perm); + if (last) + arm_smmu_page_response(master->dev, NULL, &resp); } } static irqreturn_t arm_smmu_priq_thread(int irq, void *dev) { + int num_handled = 0; + bool overflow = false; struct arm_smmu_device *smmu = dev; struct arm_smmu_queue *q = &smmu->priq.q; struct arm_smmu_ll_queue *llq = &q->llq; + size_t queue_size = 1 << llq->max_n_shift; u64 evt[PRIQ_ENT_DWORDS]; + spin_lock(&q->wq.lock); do { - while (!queue_remove_raw(q, evt)) + while (!queue_remove_raw(q, evt)) { + spin_unlock(&q->wq.lock); arm_smmu_handle_ppr(smmu, evt); + spin_lock(&q->wq.lock); + if (++num_handled == queue_size) { + q->batch++; + wake_up_all_locked(&q->wq); + num_handled = 0; + } + } - if (queue_sync_prod_in(q) == -EOVERFLOW) + if (queue_sync_prod_in(q) == -EOVERFLOW) { dev_err(smmu->dev, "PRIQ overflow detected -- requests lost\n"); + overflow = true; + } } while (!queue_empty(llq)); /* Sync our overflow flag, as we believe we're up to speed */ llq->cons = Q_OVF(llq->prod) | Q_WRP(llq, llq->cons) | Q_IDX(llq, llq->cons); queue_sync_cons_out(q); + + wake_up_all_locked(&q->wq); + spin_unlock(&q->wq.lock); + + /* + * On overflow, the SMMU might have discarded the last PPR in a group. + * There is no way to know more about it, so we have to discard all + * partial faults already queued. + */ + if (overflow) + iopf_queue_discard_partial(smmu->priq.iopf); + return IRQ_HANDLED; } @@ -2545,6 +2610,30 @@ static int arm_smmu_flush_evtq(void *cookie, struct device *dev, int pasid) return ret; } +static int arm_smmu_flush_priq(void *cookie, struct device *dev, int pasid) +{ + int ret; + u64 batch; + bool overflow = false; + struct arm_smmu_device *smmu = cookie; + struct arm_smmu_queue *q = &smmu->priq.q; + + spin_lock(&q->wq.lock); + if (queue_sync_prod_in(q) == -EOVERFLOW) { + dev_err(smmu->dev, "priq overflow detected -- requests lost\n"); + overflow = true; + } + + batch = q->batch; + ret = wait_event_interruptible_locked(q->wq, queue_empty(&q->llq) || + q->batch >= batch + 2); + spin_unlock(&q->wq.lock); + + if (overflow) + iopf_queue_discard_partial(smmu->priq.iopf); + return ret; +} + static int arm_smmu_device_disable(struct arm_smmu_device *smmu); static irqreturn_t arm_smmu_gerror_handler(int irq, void *dev) @@ -3208,6 +3297,75 @@ static void arm_smmu_disable_pasid(struct arm_smmu_master *master) pci_disable_pasid(pdev); } +static int arm_smmu_init_pri(struct arm_smmu_master *master) +{ + int pos; + struct pci_dev *pdev; + + if (!dev_is_pci(master->dev)) + return -EINVAL; + + if (!(master->smmu->features & ARM_SMMU_FEAT_PRI)) + return 0; + + pdev = to_pci_dev(master->dev); + pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_PRI); + if (!pos) + return 0; + + /* If the device supports PASID and PRI, set STE.PPAR */ + if (master->ssid_bits) + master->prg_resp_needs_ssid = pci_prg_resp_pasid_required(pdev); + + master->pri_supported = true; + return 0; +} + +static int arm_smmu_enable_pri(struct arm_smmu_master *master) +{ + int ret; + struct pci_dev *pdev; + /* + * TODO: find a good inflight PPR number. We should divide the PRI queue + * by the number of PRI-capable devices, but it's impossible to know + * about future (probed late or hotplugged) devices. So we're at risk of + * dropping PPRs (and leaking pending requests in the FQ). + */ + size_t max_inflight_pprs = 16; + + if (!master->pri_supported || !master->ats_enabled) + return -ENOSYS; + + pdev = to_pci_dev(master->dev); + + ret = pci_reset_pri(pdev); + if (ret) + return ret; + + ret = pci_enable_pri(pdev, max_inflight_pprs); + if (ret) { + dev_err(master->dev, "cannot enable PRI: %d\n", ret); + return ret; + } + + return 0; +} + +static void arm_smmu_disable_pri(struct arm_smmu_master *master) +{ + struct pci_dev *pdev; + + if (!dev_is_pci(master->dev)) + return; + + pdev = to_pci_dev(master->dev); + + if (!pdev->pri_enabled) + return; + + pci_disable_pri(pdev); +} + static void arm_smmu_detach_dev(struct arm_smmu_master *master) { unsigned long flags; @@ -3603,6 +3761,8 @@ static int arm_smmu_add_device(struct device *dev) smmu->features & ARM_SMMU_FEAT_STALL_FORCE) master->stall_enabled = true; + arm_smmu_init_pri(master); + ret = iommu_device_link(&smmu->iommu, dev); if (ret) goto err_disable_pasid; @@ -3639,6 +3799,7 @@ static void arm_smmu_remove_device(struct device *dev) master = fwspec->iommu_priv; smmu = master->smmu; iopf_queue_remove_device(smmu->evtq.iopf, dev); + iopf_queue_remove_device(smmu->priq.iopf, dev); iommu_sva_disable(dev); arm_smmu_detach_dev(master); iommu_group_remove_device(dev); @@ -3762,7 +3923,7 @@ static void arm_smmu_get_resv_regions(struct device *dev, static bool arm_smmu_iopf_supported(struct arm_smmu_master *master) { - return master->stall_enabled; + return master->stall_enabled || master->pri_supported; } static bool arm_smmu_dev_has_feature(struct device *dev, @@ -3820,6 +3981,15 @@ static int arm_smmu_dev_enable_sva(struct device *dev) ret = iopf_queue_add_device(master->smmu->evtq.iopf, dev); if (ret) goto err_disable_sva; + } else if (master->pri_supported) { + ret = iopf_queue_add_device(master->smmu->priq.iopf, dev); + if (ret) + goto err_disable_sva; + + if (arm_smmu_enable_pri(master)) { + iopf_queue_remove_device(master->smmu->priq.iopf, dev); + goto err_disable_sva; + } } return 0; @@ -3855,7 +4025,9 @@ static int arm_smmu_dev_disable_feature(struct device *dev, switch (feat) { case IOMMU_DEV_FEAT_SVA: + arm_smmu_disable_pri(master); iopf_queue_remove_device(master->smmu->evtq.iopf, dev); + iopf_queue_remove_device(master->smmu->priq.iopf, dev); return iommu_sva_disable(dev); default: return -EINVAL; @@ -3999,6 +4171,11 @@ static int arm_smmu_init_queues(struct arm_smmu_device *smmu) if (!(smmu->features & ARM_SMMU_FEAT_PRI)) return 0; + smmu->priq.iopf = iopf_queue_alloc(dev_name(smmu->dev), + arm_smmu_flush_priq, smmu); + if (!smmu->priq.iopf) + return -ENOMEM; + return arm_smmu_init_one_queue(smmu, &smmu->priq.q, ARM_SMMU_PRIQ_PROD, ARM_SMMU_PRIQ_CONS, PRIQ_ENT_DWORDS, "priq"); @@ -4971,6 +5148,7 @@ static int arm_smmu_device_remove(struct platform_device *pdev) iommu_device_sysfs_remove(&smmu->iommu); arm_smmu_device_disable(smmu); iopf_queue_free(smmu->evtq.iopf); + iopf_queue_free(smmu->priq.iopf); return 0; } -- 2.25.0
On Mon, Feb 24, 2020 at 07:23:36PM +0100, Jean-Philippe Brucker wrote:
> The new allocation scheme introduced by 2c7933f53f6b ("mm/mmu_notifiers:
> add a get/put scheme for the registration") provides a convenient way
> for users to attach notifier data to an mm. However, it would be even
> better to create this notifier data atomically.
>
> Since the alloc_notifier() callback only takes an mm argument at the
> moment, some users have to perform the allocation in two times.
> alloc_notifier() initially creates an incomplete structure, which is
> then finalized using more context once mmu_notifier_get() returns. This
> second step requires carrying an initialization lock in the notifier
> data and playing dirty tricks to order memory accesses against live
> invalidation.
This was the intended pattern. Tthere shouldn't be an real issue as
there shouldn't be any data on which to invalidate, ie the later patch
does:
+ list_for_each_entry_rcu(bond, &io_mm->devices, mm_head)
And that list is empty post-allocation, so no 'dirty tricks' required.
The other op callback is release, which also cannot be called as the
caller must hold a mmget to establish the notifier.
So just use the locking that already exists. There is one function
that calls io_mm_get() which immediately calls io_mm_attach, which
immediately grabs the global iommu_sva_lock.
Thus init the pasid for the first time under that lock and everything
is fine.
There is nothing inherently wrong with the approach in this patch, but
it seems unneeded in this case..
Jason
Hi, On 2020/2/25 2:23, Jean-Philippe Brucker wrote: > From: Jean-Philippe Brucker <jean-philippe.brucker@arm.com> > > Some systems allow devices to handle I/O Page Faults in the core mm. For > example systems implementing the PCI PRI extension or Arm SMMU stall > model. Infrastructure for reporting these recoverable page faults was > recently added to the IOMMU core. Add a page fault handler for host SVA. > > IOMMU driver can now instantiate several fault workqueues and link them to > IOPF-capable devices. Drivers can choose between a single global > workqueue, one per IOMMU device, one per low-level fault queue, one per > domain, etc. > > When it receives a fault event, supposedly in an IRQ handler, the IOMMU > driver reports the fault using iommu_report_device_fault(), which calls > the registered handler. The page fault handler then calls the mm fault > handler, and reports either success or failure with iommu_page_response(). > When the handler succeeded, the IOMMU retries the access. > > The iopf_param pointer could be embedded into iommu_fault_param. But > putting iopf_param into the iommu_param structure allows us not to care > about ordering between calls to iopf_queue_add_device() and > iommu_register_device_fault_handler(). > > Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org> > --- > drivers/iommu/Kconfig | 4 + > drivers/iommu/Makefile | 1 + > drivers/iommu/io-pgfault.c | 451 +++++++++++++++++++++++++++++++++++++ > include/linux/iommu.h | 59 +++++ > 4 files changed, 515 insertions(+) > create mode 100644 drivers/iommu/io-pgfault.c > > diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig > index acca20e2da2f..e4a42e1708b4 100644 > --- a/drivers/iommu/Kconfig > +++ b/drivers/iommu/Kconfig > @@ -109,6 +109,10 @@ config IOMMU_SVA > select IOMMU_API > select MMU_NOTIFIER > > +config IOMMU_PAGE_FAULT > + bool > + select IOMMU_API > + > config FSL_PAMU > bool "Freescale IOMMU support" > depends on PCI > diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile > index 40c800dd4e3e..bf5cb4ee8409 100644 > --- a/drivers/iommu/Makefile > +++ b/drivers/iommu/Makefile > @@ -4,6 +4,7 @@ obj-$(CONFIG_IOMMU_API) += iommu-traces.o > obj-$(CONFIG_IOMMU_API) += iommu-sysfs.o > obj-$(CONFIG_IOMMU_DEBUGFS) += iommu-debugfs.o > obj-$(CONFIG_IOMMU_DMA) += dma-iommu.o > +obj-$(CONFIG_IOMMU_PAGE_FAULT) += io-pgfault.o > obj-$(CONFIG_IOMMU_IO_PGTABLE) += io-pgtable.o > obj-$(CONFIG_IOMMU_IO_PGTABLE_ARMV7S) += io-pgtable-arm-v7s.o > obj-$(CONFIG_IOMMU_IO_PGTABLE_LPAE) += io-pgtable-arm.o > diff --git a/drivers/iommu/io-pgfault.c b/drivers/iommu/io-pgfault.c > new file mode 100644 > index 000000000000..76e153c59fe3 > --- /dev/null > +++ b/drivers/iommu/io-pgfault.c > @@ -0,0 +1,451 @@ > +// SPDX-License-Identifier: GPL-2.0 > +/* > + * Handle device page faults > + * > + * Copyright (C) 2018 ARM Ltd. > + */ > + > +#include <linux/iommu.h> > +#include <linux/list.h> > +#include <linux/slab.h> > +#include <linux/workqueue.h> > + > +/** > + * struct iopf_queue - IO Page Fault queue > + * @wq: the fault workqueue > + * @flush: low-level flush callback > + * @flush_arg: flush() argument > + * @devices: devices attached to this queue > + * @lock: protects the device list > + */ > +struct iopf_queue { > + struct workqueue_struct *wq; > + iopf_queue_flush_t flush; > + void *flush_arg; > + struct list_head devices; > + struct mutex lock; > +}; > + > +/** > + * struct iopf_device_param - IO Page Fault data attached to a device > + * @dev: the device that owns this param > + * @queue: IOPF queue > + * @queue_list: index into queue->devices > + * @partial: faults that are part of a Page Request Group for which the last > + * request hasn't been submitted yet. > + * @busy: the param is being used > + * @wq_head: signal a change to @busy > + */ > +struct iopf_device_param { > + struct device *dev; > + struct iopf_queue *queue; > + struct list_head queue_list; > + struct list_head partial; > + bool busy; > + wait_queue_head_t wq_head; > +}; > + > +struct iopf_fault { > + struct iommu_fault fault; > + struct list_head head; > +}; > + > +struct iopf_group { > + struct iopf_fault last_fault; > + struct list_head faults; > + struct work_struct work; > + struct device *dev; > +}; > + [...] > + > +/** > + * iopf_queue_alloc - Allocate and initialize a fault queue > + * @name: a unique string identifying the queue (for workqueue) > + * @flush: a callback that flushes the low-level queue > + * @cookie: driver-private data passed to the flush callback > + * > + * The callback is called before the workqueue is flushed. The IOMMU driver must > + * commit all faults that are pending in its low-level queues at the time of the > + * call, into the IOPF queue (with iommu_report_device_fault). The callback > + * takes a device pointer as argument, hinting what endpoint is causing the > + * flush. When the device is NULL, all faults should be committed. > + * > + * Return: the queue on success and NULL on error. > + */ > +struct iopf_queue * > +iopf_queue_alloc(const char *name, iopf_queue_flush_t flush, void *cookie) > +{ > + struct iopf_queue *queue; > + > + queue = kzalloc(sizeof(*queue), GFP_KERNEL); > + if (!queue) > + return NULL; > + > + /* > + * The WQ is unordered because the low-level handler enqueues faults by > + * group. PRI requests within a group have to be ordered, but once > + * that's dealt with, the high-level function can handle groups out of > + * order. > + */ > + queue->wq = alloc_workqueue("iopf_queue/%s", WQ_UNBOUND, 0, name); Should this workqueue use 'WQ_HIGHPRI | WQ_UNBOUND' or some flags like this to decrease the unexpected latency of I/O PageFault here? Or maybe, workqueue will show an uncontrolled latency, even in a busy system. Cheers, Zaibo . > + if (!queue->wq) { > + kfree(queue); > + return NULL; > + } > + > + queue->flush = flush; > + queue->flush_arg = cookie; > + INIT_LIST_HEAD(&queue->devices); > + mutex_init(&queue->lock); > + > + return queue; > +} > +EXPORT_SYMBOL_GPL(iopf_queue_alloc); > + > +/** > + * iopf_queue_free - Free IOPF queue > + * @queue: queue to free > + * > + * Counterpart to iopf_queue_alloc(). The driver must not be queuing faults or > + * adding/removing devices on this queue anymore. > + */ > +void iopf_queue_free(struct iopf_queue *queue) > +{ > + struct iopf_device_param *iopf_param, *next; > + > + if (!queue) > + return; > + > + list_for_each_entry_safe(iopf_param, next, &queue->devices, queue_list) > + iopf_queue_remove_device(queue, iopf_param->dev); > + > + destroy_workqueue(queue->wq); > + kfree(queue); > +} > +EXPORT_SYMBOL_GPL(iopf_queue_free); [...]
On Mon, Feb 24, 2020 at 03:00:56PM -0400, Jason Gunthorpe wrote: > On Mon, Feb 24, 2020 at 07:23:36PM +0100, Jean-Philippe Brucker wrote: > > The new allocation scheme introduced by 2c7933f53f6b ("mm/mmu_notifiers: > > add a get/put scheme for the registration") provides a convenient way > > for users to attach notifier data to an mm. However, it would be even > > better to create this notifier data atomically. > > > > Since the alloc_notifier() callback only takes an mm argument at the > > moment, some users have to perform the allocation in two times. > > alloc_notifier() initially creates an incomplete structure, which is > > then finalized using more context once mmu_notifier_get() returns. This > > second step requires carrying an initialization lock in the notifier > > data and playing dirty tricks to order memory accesses against live > > invalidation. > > This was the intended pattern. Tthere shouldn't be an real issue as > there shouldn't be any data on which to invalidate, ie the later patch > does: > > + list_for_each_entry_rcu(bond, &io_mm->devices, mm_head) > > And that list is empty post-allocation, so no 'dirty tricks' required. Before introducing this patch I had the following code: + list_for_each_entry_rcu(bond, &io_mm->devices, mm_head) { + /* + * To ensure that we observe the initialization of io_mm fields + * by io_mm_finalize() before the registration of this bond to + * the list by io_mm_attach(), introduce an address dependency + * between bond and io_mm. It pairs with the smp_store_release() + * from list_add_rcu(). + */ + io_mm = rcu_dereference(bond->io_mm); + io_mm->ops->invalidate(bond->sva.dev, io_mm->pasid, io_mm->ctx, + start, end - start); + } (1) io_mm_get() would obtain an empty io_mm from iommu_notifier_get(). (2) then io_mm_finalize() would initialize io_mm->ops, io_mm->ctx, etc. (3) finally io_mm_attach() would add the bond to io_mm->devices. Since the above code can run before (2) it needs to observe valid io_mm->ctx, io_mm->ops initialized by (2) after obtaining the bond initialized by (3). Which I believe requires the address dependency from the rcu_dereference() above or some stronger barrier to pair with the list_add_rcu(). If io_mm->ctx and io_mm->ops are already valid before the mmu notifier is published, then we don't need that stuff. That's the main reason I would have liked moving everything to alloc_notifier(), the locking below isn't a big deal. > The other op callback is release, which also cannot be called as the > caller must hold a mmget to establish the notifier. > > So just use the locking that already exists. There is one function > that calls io_mm_get() which immediately calls io_mm_attach, which > immediately grabs the global iommu_sva_lock. > > Thus init the pasid for the first time under that lock and everything > is fine. I agree with this, can't remember why I used a separate lock for initialization rather than reusing iommu_sva_lock. Thanks, Jean > > There is nothing inherently wrong with the approach in this patch, but > it seems unneeded in this case.. > > Jason
Hi Zaibo,
On Tue, Feb 25, 2020 at 11:30:05AM +0800, Xu Zaibo wrote:
> > +struct iopf_queue *
> > +iopf_queue_alloc(const char *name, iopf_queue_flush_t flush, void *cookie)
> > +{
> > + struct iopf_queue *queue;
> > +
> > + queue = kzalloc(sizeof(*queue), GFP_KERNEL);
> > + if (!queue)
> > + return NULL;
> > +
> > + /*
> > + * The WQ is unordered because the low-level handler enqueues faults by
> > + * group. PRI requests within a group have to be ordered, but once
> > + * that's dealt with, the high-level function can handle groups out of
> > + * order.
> > + */
> > + queue->wq = alloc_workqueue("iopf_queue/%s", WQ_UNBOUND, 0, name);
> Should this workqueue use 'WQ_HIGHPRI | WQ_UNBOUND' or some flags like this
> to decrease the unexpected
> latency of I/O PageFault here? Or maybe, workqueue will show an uncontrolled
> latency, even in a busy system.
I'll investigate the effect of these flags. So far I've only run on
completely idle systems but it would be interesting to add some
workqueue-heavy load in my tests.
Thanks,
Jean
On Tue, Feb 25, 2020 at 10:24:39AM +0100, Jean-Philippe Brucker wrote: > On Mon, Feb 24, 2020 at 03:00:56PM -0400, Jason Gunthorpe wrote: > > On Mon, Feb 24, 2020 at 07:23:36PM +0100, Jean-Philippe Brucker wrote: > > > The new allocation scheme introduced by 2c7933f53f6b ("mm/mmu_notifiers: > > > add a get/put scheme for the registration") provides a convenient way > > > for users to attach notifier data to an mm. However, it would be even > > > better to create this notifier data atomically. > > > > > > Since the alloc_notifier() callback only takes an mm argument at the > > > moment, some users have to perform the allocation in two times. > > > alloc_notifier() initially creates an incomplete structure, which is > > > then finalized using more context once mmu_notifier_get() returns. This > > > second step requires carrying an initialization lock in the notifier > > > data and playing dirty tricks to order memory accesses against live > > > invalidation. > > > > This was the intended pattern. Tthere shouldn't be an real issue as > > there shouldn't be any data on which to invalidate, ie the later patch > > does: > > > > + list_for_each_entry_rcu(bond, &io_mm->devices, mm_head) > > > > And that list is empty post-allocation, so no 'dirty tricks' required. > > Before introducing this patch I had the following code: > > + list_for_each_entry_rcu(bond, &io_mm->devices, mm_head) { > + /* > + * To ensure that we observe the initialization of io_mm fields > + * by io_mm_finalize() before the registration of this bond to > + * the list by io_mm_attach(), introduce an address dependency > + * between bond and io_mm. It pairs with the smp_store_release() > + * from list_add_rcu(). > + */ > + io_mm = rcu_dereference(bond->io_mm); A rcu_dereference isn't need here, just a normal derference is fine. > + io_mm->ops->invalidate(bond->sva.dev, io_mm->pasid, io_mm->ctx, > + start, end - start); > + } > > (1) io_mm_get() would obtain an empty io_mm from iommu_notifier_get(). > (2) then io_mm_finalize() would initialize io_mm->ops, io_mm->ctx, etc. > (3) finally io_mm_attach() would add the bond to io_mm->devices. > > Since the above code can run before (2) it needs to observe valid > io_mm->ctx, io_mm->ops initialized by (2) after obtaining the bond > initialized by (3). Which I believe requires the address dependency from > the rcu_dereference() above or some stronger barrier to pair with the > list_add_rcu(). The list_for_each_entry_rcu() is an acquire that already pairs with the release in list_add_rcu(), all you need is a data dependency chain starting on bond to be correct on ordering. But this is super tricky :\ > If io_mm->ctx and io_mm->ops are already valid before the > mmu notifier is published, then we don't need that stuff. So, this trickyness with RCU is not a bad reason to introduce the priv scheme, maybe explain it in the commit message? Jason
Hi, On 2020/2/25 17:25, Jean-Philippe Brucker wrote: > Hi Zaibo, > > On Tue, Feb 25, 2020 at 11:30:05AM +0800, Xu Zaibo wrote: >>> +struct iopf_queue * >>> +iopf_queue_alloc(const char *name, iopf_queue_flush_t flush, void *cookie) >>> +{ >>> + struct iopf_queue *queue; >>> + >>> + queue = kzalloc(sizeof(*queue), GFP_KERNEL); >>> + if (!queue) >>> + return NULL; >>> + >>> + /* >>> + * The WQ is unordered because the low-level handler enqueues faults by >>> + * group. PRI requests within a group have to be ordered, but once >>> + * that's dealt with, the high-level function can handle groups out of >>> + * order. >>> + */ >>> + queue->wq = alloc_workqueue("iopf_queue/%s", WQ_UNBOUND, 0, name); >> Should this workqueue use 'WQ_HIGHPRI | WQ_UNBOUND' or some flags like this >> to decrease the unexpected >> latency of I/O PageFault here? Or maybe, workqueue will show an uncontrolled >> latency, even in a busy system. > I'll investigate the effect of these flags. So far I've only run on > completely idle systems but it would be interesting to add some > workqueue-heavy load in my tests. > I'm not sure, just my concern. Hopefully, Tejun Heo can give us some hints. :) +cc Tejun Heo <tj@kernel.org> Cheers, Zaibo . > . >
Hi, On 2020/2/25 2:23, Jean-Philippe Brucker wrote: > From: Jean-Philippe Brucker <jean-philippe.brucker@arm.com> > > The SMMU provides a Stall model for handling page faults in platform > devices. It is similar to PCI PRI, but doesn't require devices to have > their own translation cache. Instead, faulting transactions are parked and > the OS is given a chance to fix the page tables and retry the transaction. > > Enable stall for devices that support it (opt-in by firmware). When an > event corresponds to a translation error, call the IOMMU fault handler. If > the fault is recoverable, it will call us back to terminate or continue > the stall. > > Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org> > --- > drivers/iommu/arm-smmu-v3.c | 271 ++++++++++++++++++++++++++++++++++-- > drivers/iommu/of_iommu.c | 5 +- > include/linux/iommu.h | 2 + > 3 files changed, 269 insertions(+), 9 deletions(-) > > diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c > index 6a5987cce03f..da5dda5ba26a 100644 > --- a/drivers/iommu/arm-smmu-v3.c > +++ b/drivers/iommu/arm-smmu-v3.c > @@ -374,6 +374,13 @@ > #define CMDQ_PRI_1_GRPID GENMASK_ULL(8, 0) > #define CMDQ_PRI_1_RESP GENMASK_ULL(13, 12) > [...] > +static int arm_smmu_page_response(struct device *dev, > + struct iommu_fault_event *unused, > + struct iommu_page_response *resp) > +{ > + struct arm_smmu_cmdq_ent cmd = {0}; > + struct arm_smmu_master *master = dev_iommu_fwspec_get(dev)->iommu_priv; Here can use 'dev_to_master' ? Cheers, Zaibo . > + int sid = master->streams[0].id; > + > + if (master->stall_enabled) { > + cmd.opcode = CMDQ_OP_RESUME; > + cmd.resume.sid = sid; > + cmd.resume.stag = resp->grpid; > + switch (resp->code) { > + case IOMMU_PAGE_RESP_INVALID: > + case IOMMU_PAGE_RESP_FAILURE: > + cmd.resume.resp = CMDQ_RESUME_0_RESP_ABORT; > + break; > + case IOMMU_PAGE_RESP_SUCCESS: > + cmd.resume.resp = CMDQ_RESUME_0_RESP_RETRY; > + break; > + default: > + return -EINVAL; > + } > + } else { > + /* TODO: insert PRI response here */ > + return -ENODEV; > + } > + > + arm_smmu_cmdq_issue_cmd(master->smmu, &cmd); > + /* > + * Don't send a SYNC, it doesn't do anything for RESUME or PRI_RESP. > + * RESUME consumption guarantees that the stalled transaction will be > + * terminated... at some point in the future. PRI_RESP is fire and > + * forget. > + */ > + > + return 0; > +} > + [...]
On Mon, 24 Feb 2020 19:23:37 +0100 Jean-Philippe Brucker <jean-philippe@linaro.org> wrote: > From: Jean-Philippe Brucker <jean-philippe.brucker@arm.com> > > Add a small library to help IOMMU drivers manage process address spaces > bound to their devices. Register an MMU notifier to track modification > on each address space bound to one or more devices. > > IOMMU drivers must implement the io_mm_ops and can then use the helpers > provided by this library to easily implement the SVA API introduced by > commit 26b25a2b98e4. The io_mm_ops are: > > void *alloc(struct mm_struct *) > Allocate a PASID context private to the IOMMU driver. There is a > single context per mm. IOMMU drivers may perform arch-specific > operations in there, for example pinning down a CPU ASID (on Arm). > > int attach(struct device *, int pasid, void *ctx, bool attach_domain) > Attach a context to the device, by setting up the PASID table entry. > > int invalidate(struct device *, int pasid, void *ctx, > unsigned long vaddr, size_t size) > Invalidate TLB entries for this address range. > > int detach(struct device *, int pasid, void *ctx, bool detach_domain) > Detach a context from the device, by clearing the PASID table entry > and invalidating cached entries. > > void free(void *ctx) > Free a context. > > Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org> Hi Jean-Phillippe, A few trivial comments from me in line. Otherwise this all seems sensible. Jonathan > --- > drivers/iommu/Kconfig | 7 + > drivers/iommu/Makefile | 1 + > drivers/iommu/iommu-sva.c | 561 ++++++++++++++++++++++++++++++++++++++ > drivers/iommu/iommu-sva.h | 64 +++++ > drivers/iommu/iommu.c | 1 + > include/linux/iommu.h | 3 + > 6 files changed, 637 insertions(+) > create mode 100644 drivers/iommu/iommu-sva.c > create mode 100644 drivers/iommu/iommu-sva.h > > diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig > index d2fade984999..acca20e2da2f 100644 > --- a/drivers/iommu/Kconfig > +++ b/drivers/iommu/Kconfig > @@ -102,6 +102,13 @@ config IOMMU_DMA > select IRQ_MSI_IOMMU > select NEED_SG_DMA_LENGTH > > +# Shared Virtual Addressing library > +config IOMMU_SVA > + bool > + select IOASID > + select IOMMU_API > + select MMU_NOTIFIER > + > config FSL_PAMU > bool "Freescale IOMMU support" > depends on PCI > diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile > index 9f33fdb3bb05..40c800dd4e3e 100644 > --- a/drivers/iommu/Makefile > +++ b/drivers/iommu/Makefile > @@ -37,3 +37,4 @@ obj-$(CONFIG_S390_IOMMU) += s390-iommu.o > obj-$(CONFIG_QCOM_IOMMU) += qcom_iommu.o > obj-$(CONFIG_HYPERV_IOMMU) += hyperv-iommu.o > obj-$(CONFIG_VIRTIO_IOMMU) += virtio-iommu.o > +obj-$(CONFIG_IOMMU_SVA) += iommu-sva.o > diff --git a/drivers/iommu/iommu-sva.c b/drivers/iommu/iommu-sva.c > new file mode 100644 > index 000000000000..64f1d1c82383 > --- /dev/null > +++ b/drivers/iommu/iommu-sva.c > @@ -0,0 +1,561 @@ > +// SPDX-License-Identifier: GPL-2.0 > +/* > + * Manage PASIDs and bind process address spaces to devices. > + * > + * Copyright (C) 2018 ARM Ltd. Worth updating the date? > + */ > + > +#include <linux/idr.h> > +#include <linux/ioasid.h> > +#include <linux/iommu.h> > +#include <linux/sched/mm.h> > +#include <linux/slab.h> > +#include <linux/spinlock.h> > + > +#include "iommu-sva.h" > + > +/** > + * DOC: io_mm model > + * > + * The io_mm keeps track of process address spaces shared between CPU and IOMMU. > + * The following example illustrates the relation between structures > + * iommu_domain, io_mm and iommu_sva. The iommu_sva struct is a bond between > + * io_mm and device. A device can have multiple io_mm and an io_mm may be bound > + * to multiple devices. > + * ___________________________ > + * | IOMMU domain A | > + * | ________________ | > + * | | IOMMU group | +------- io_pgtables > + * | | | | > + * | | dev 00:00.0 ----+------- bond 1 --- io_mm X > + * | |________________| \ | > + * | '----- bond 2 ---. > + * |___________________________| \ > + * ___________________________ \ > + * | IOMMU domain B | io_mm Y > + * | ________________ | / / > + * | | IOMMU group | | / / > + * | | | | / / > + * | | dev 00:01.0 ------------ bond 3 -' / > + * | | dev 00:01.1 ------------ bond 4 --' > + * | |________________| | > + * | +------- io_pgtables > + * |___________________________| > + * > + * In this example, device 00:00.0 is in domain A, devices 00:01.* are in domain > + * B. All devices within the same domain access the same address spaces. Device > + * 00:00.0 accesses address spaces X and Y, each corresponding to an mm_struct. > + * Devices 00:01.* only access address space Y. In addition each > + * IOMMU_DOMAIN_DMA domain has a private address space, io_pgtable, that is > + * managed with iommu_map()/iommu_unmap(), and isn't shared with the CPU MMU. > + * > + * To obtain the above configuration, users would for instance issue the > + * following calls: > + * > + * iommu_sva_bind_device(dev 00:00.0, mm X, ...) -> bond 1 > + * iommu_sva_bind_device(dev 00:00.0, mm Y, ...) -> bond 2 > + * iommu_sva_bind_device(dev 00:01.0, mm Y, ...) -> bond 3 > + * iommu_sva_bind_device(dev 00:01.1, mm Y, ...) -> bond 4 > + * > + * A single Process Address Space ID (PASID) is allocated for each mm. In the > + * example, devices use PASID 1 to read/write into address space X and PASID 2 > + * to read/write into address space Y. Calling iommu_sva_get_pasid() on bond 1 > + * returns 1, and calling it on bonds 2-4 returns 2. > + * > + * Hardware tables describing this configuration in the IOMMU would typically > + * look like this: > + * > + * PASID tables > + * of domain A > + * .->+--------+ > + * / 0 | |-------> io_pgtable > + * / +--------+ > + * Device tables / 1 | |-------> pgd X > + * +--------+ / +--------+ > + * 00:00.0 | A |-' 2 | |--. > + * +--------+ +--------+ \ > + * : : 3 | | \ > + * +--------+ +--------+ --> pgd Y > + * 00:01.0 | B |--. / > + * +--------+ \ | > + * 00:01.1 | B |----+ PASID tables | > + * +--------+ \ of domain B | > + * '->+--------+ | > + * 0 | |-- | --> io_pgtable > + * +--------+ | > + * 1 | | | > + * +--------+ | > + * 2 | |---' > + * +--------+ > + * 3 | | > + * +--------+ > + * > + * With this model, a single call binds all devices in a given domain to an > + * address space. Other devices in the domain will get the same bond implicitly. > + * However, users must issue one bind() for each device, because IOMMUs may > + * implement SVA differently. Furthermore, mandating one bind() per device > + * allows the driver to perform sanity-checks on device capabilities. > + * > + * In some IOMMUs, one entry of the PASID table (typically the first one) can > + * hold non-PASID translations. In this case PASID 0 is reserved and the first > + * entry points to the io_pgtable pointer. In other IOMMUs the io_pgtable > + * pointer is held in the device table and PASID 0 is available to the > + * allocator. Is it worth hammering home in here that we can only do this because the PASID space is global (with exception of PASID 0)? It's a convenient simplification but not necessarily a hardware restriction so perhaps we should remind people somewhere in here? > + */ > + > +struct io_mm { > + struct list_head devices; > + struct mm_struct *mm; > + struct mmu_notifier notifier; > + > + /* Late initialization */ > + const struct io_mm_ops *ops; > + void *ctx; > + int pasid; > +}; > + > +#define to_io_mm(mmu_notifier) container_of(mmu_notifier, struct io_mm, notifier) > +#define to_iommu_bond(handle) container_of(handle, struct iommu_bond, sva) Code ordering wise, do we want this after the definition of iommu_bond? For both of these it's a bit non obvious what they come 'from'. I wouldn't naturally assume to_io_mm gets me from notifier to the io_mm for example. Not sure it matters though if these are only used in a few places. > + > +struct iommu_bond { > + struct iommu_sva sva; > + struct io_mm __rcu *io_mm; > + > + struct list_head mm_head; > + void *drvdata; > + struct rcu_head rcu_head; > + refcount_t refs; > +}; > + > +static DECLARE_IOASID_SET(shared_pasid); > + > +static struct mmu_notifier_ops iommu_mmu_notifier_ops; > + > +/* > + * Serializes modifications of bonds. > + * Lock order: Device SVA mutex; global SVA mutex; IOASID lock > + */ > +static DEFINE_MUTEX(iommu_sva_lock); > + > +struct io_mm_alloc_params { > + const struct io_mm_ops *ops; > + int min_pasid, max_pasid; > +}; > + > +static struct mmu_notifier *io_mm_alloc(struct mm_struct *mm, void *privdata) > +{ > + int ret; > + struct io_mm *io_mm; > + struct io_mm_alloc_params *params = privdata; > + > + io_mm = kzalloc(sizeof(*io_mm), GFP_KERNEL); > + if (!io_mm) > + return ERR_PTR(-ENOMEM); > + > + io_mm->mm = mm; > + io_mm->ops = params->ops; > + INIT_LIST_HEAD(&io_mm->devices); > + > + io_mm->pasid = ioasid_alloc(&shared_pasid, params->min_pasid, > + params->max_pasid, io_mm->mm); > + if (io_mm->pasid == INVALID_IOASID) { > + ret = -ENOSPC; > + goto err_free_io_mm; > + } > + > + io_mm->ctx = params->ops->alloc(mm); > + if (IS_ERR(io_mm->ctx)) { > + ret = PTR_ERR(io_mm->ctx); > + goto err_free_pasid; > + } > + return &io_mm->notifier; > + > +err_free_pasid: > + ioasid_free(io_mm->pasid); > +err_free_io_mm: > + kfree(io_mm); > + return ERR_PTR(ret); > +} > + > +static void io_mm_free(struct mmu_notifier *mn) > +{ > + struct io_mm *io_mm = to_io_mm(mn); > + > + WARN_ON(!list_empty(&io_mm->devices)); > + > + io_mm->ops->release(io_mm->ctx); > + ioasid_free(io_mm->pasid); > + kfree(io_mm); > +} > + > +/* > + * io_mm_get - Allocate an io_mm or get the existing one for the given mm > + * @mm: the mm > + * @ops: callbacks for the IOMMU driver > + * @min_pasid: minimum PASID value (inclusive) > + * @max_pasid: maximum PASID value (inclusive) > + * > + * Returns a valid io_mm or an error pointer. > + */ > +static struct io_mm *io_mm_get(struct mm_struct *mm, > + const struct io_mm_ops *ops, > + int min_pasid, int max_pasid) > +{ > + struct io_mm *io_mm; > + struct mmu_notifier *mn; > + struct io_mm_alloc_params params = { > + .ops = ops, > + .min_pasid = min_pasid, > + .max_pasid = max_pasid, > + }; > + > + /* > + * A single notifier can exist for this (ops, mm) pair. Allocate it if > + * necessary. > + */ > + mn = mmu_notifier_get(&iommu_mmu_notifier_ops, mm, ¶ms); > + if (IS_ERR(mn)) > + return ERR_CAST(mn); > + io_mm = to_io_mm(mn); > + > + if (WARN_ON(io_mm->ops != ops)) { > + mmu_notifier_put(mn); > + return ERR_PTR(-EINVAL); > + } > + > + return io_mm; > +} > + > +static void io_mm_put(struct io_mm *io_mm) > +{ > + mmu_notifier_put(&io_mm->notifier); > +} > + > +static struct iommu_sva * > +io_mm_attach(struct device *dev, struct io_mm *io_mm, void *drvdata) > +{ > + int ret = 0; I'm fairly sure this is set in all paths below. Now, of course the compiler might not think that in which case fair enough :) > + bool attach_domain = true; > + struct iommu_bond *bond, *tmp; > + struct iommu_domain *domain, *other; > + struct iommu_sva_param *param = dev->iommu_param->sva_param; > + > + domain = iommu_get_domain_for_dev(dev); > + > + bond = kzalloc(sizeof(*bond), GFP_KERNEL); > + if (!bond) > + return ERR_PTR(-ENOMEM); > + > + bond->sva.dev = dev; > + bond->drvdata = drvdata; > + refcount_set(&bond->refs, 1); > + RCU_INIT_POINTER(bond->io_mm, io_mm); > + > + mutex_lock(&iommu_sva_lock); > + /* Is it already bound to the device or domain? */ > + list_for_each_entry(tmp, &io_mm->devices, mm_head) { > + if (tmp->sva.dev != dev) { > + other = iommu_get_domain_for_dev(tmp->sva.dev); > + if (domain == other) > + attach_domain = false; > + > + continue; > + } > + > + if (WARN_ON(tmp->drvdata != drvdata)) { > + ret = -EINVAL; > + goto err_free; > + } > + > + /* > + * Hold a single io_mm reference per bond. Note that we can't > + * return an error after this, otherwise the caller would drop > + * an additional reference to the io_mm. > + */ > + refcount_inc(&tmp->refs); > + io_mm_put(io_mm); > + kfree(bond); Free outside the lock would be ever so slightly more logical given we allocated before taking the lock. > + mutex_unlock(&iommu_sva_lock); > + return &tmp->sva; > + } > + > + list_add_rcu(&bond->mm_head, &io_mm->devices); > + param->nr_bonds++; > + mutex_unlock(&iommu_sva_lock); > + > + ret = io_mm->ops->attach(bond->sva.dev, io_mm->pasid, io_mm->ctx, > + attach_domain); > + if (ret) > + goto err_remove; > + > + return &bond->sva; > + > +err_remove: > + /* > + * At this point concurrent threads may have started to access the > + * io_mm->devices list in order to invalidate address ranges, which > + * requires to free the bond via kfree_rcu() > + */ > + mutex_lock(&iommu_sva_lock); > + param->nr_bonds--; > + list_del_rcu(&bond->mm_head); > + > +err_free: > + mutex_unlock(&iommu_sva_lock); > + kfree_rcu(bond, rcu_head); I don't suppose it matters really but we don't need the rcu free if we follow the err_free goto. Perhaps we are cleaner in this case to not use a unified exit path but do that case inline? > + return ERR_PTR(ret); > +} > + > +static void io_mm_detach_locked(struct iommu_bond *bond) > +{ > + struct io_mm *io_mm; > + struct iommu_bond *tmp; > + bool detach_domain = true; > + struct iommu_domain *domain, *other; > + > + io_mm = rcu_dereference_protected(bond->io_mm, > + lockdep_is_held(&iommu_sva_lock)); > + if (!io_mm) > + return; > + > + domain = iommu_get_domain_for_dev(bond->sva.dev); > + > + /* Are other devices in the same domain still attached to this mm? */ > + list_for_each_entry(tmp, &io_mm->devices, mm_head) { > + if (tmp == bond) > + continue; > + other = iommu_get_domain_for_dev(tmp->sva.dev); > + if (domain == other) { > + detach_domain = false; > + break; > + } > + } > + > + io_mm->ops->detach(bond->sva.dev, io_mm->pasid, io_mm->ctx, > + detach_domain); > + > + list_del_rcu(&bond->mm_head); > + RCU_INIT_POINTER(bond->io_mm, NULL); > + > + /* Free after RCU grace period */ > + io_mm_put(io_mm); > +} > + > +/* > + * io_mm_release - release MMU notifier > + * > + * Called when the mm exits. Some devices may still be bound to the io_mm. A few > + * things need to be done before it is safe to release: > + * > + * - Tell the device driver to stop using this PASID. > + * - Clear the PASID table and invalidate TLBs. > + * - Drop all references to this io_mm. > + */ > +static void io_mm_release(struct mmu_notifier *mn, struct mm_struct *mm) > +{ > + struct iommu_bond *bond, *next; > + struct io_mm *io_mm = to_io_mm(mn); > + > + mutex_lock(&iommu_sva_lock); > + list_for_each_entry_safe(bond, next, &io_mm->devices, mm_head) { > + struct device *dev = bond->sva.dev; > + struct iommu_sva *sva = &bond->sva; > + > + if (sva->ops && sva->ops->mm_exit && > + sva->ops->mm_exit(dev, sva, bond->drvdata)) > + dev_WARN(dev, "possible leak of PASID %u", > + io_mm->pasid); > + > + /* unbind() frees the bond, we just detach it */ > + io_mm_detach_locked(bond); > + } > + mutex_unlock(&iommu_sva_lock); > +} > + > +static void io_mm_invalidate_range(struct mmu_notifier *mn, > + struct mm_struct *mm, unsigned long start, > + unsigned long end) > +{ > + struct iommu_bond *bond; > + struct io_mm *io_mm = to_io_mm(mn); > + > + rcu_read_lock(); > + list_for_each_entry_rcu(bond, &io_mm->devices, mm_head) > + io_mm->ops->invalidate(bond->sva.dev, io_mm->pasid, io_mm->ctx, > + start, end - start); > + rcu_read_unlock(); > +} > + > +static struct mmu_notifier_ops iommu_mmu_notifier_ops = { > + .alloc_notifier = io_mm_alloc, > + .free_notifier = io_mm_free, > + .release = io_mm_release, > + .invalidate_range = io_mm_invalidate_range, > +}; > + > +struct iommu_sva * > +iommu_sva_bind_generic(struct device *dev, struct mm_struct *mm, > + const struct io_mm_ops *ops, void *drvdata) > +{ > + struct io_mm *io_mm; > + struct iommu_sva *handle; > + struct iommu_param *param = dev->iommu_param; > + > + if (!param) > + return ERR_PTR(-ENODEV); > + > + mutex_lock(¶m->sva_lock); > + if (!param->sva_param) { > + handle = ERR_PTR(-ENODEV); > + goto out_unlock; > + } > + > + io_mm = io_mm_get(mm, ops, param->sva_param->min_pasid, > + param->sva_param->max_pasid); > + if (IS_ERR(io_mm)) { > + handle = ERR_CAST(io_mm); > + goto out_unlock; > + } > + > + handle = io_mm_attach(dev, io_mm, drvdata); > + if (IS_ERR(handle)) > + io_mm_put(io_mm); > + > +out_unlock: > + mutex_unlock(¶m->sva_lock); > + return handle; > +} > +EXPORT_SYMBOL_GPL(iommu_sva_bind_generic); > + > +static void iommu_sva_unbind_locked(struct iommu_bond *bond) > +{ > + struct device *dev = bond->sva.dev; > + struct iommu_sva_param *param = dev->iommu_param->sva_param; > + > + if (!refcount_dec_and_test(&bond->refs)) > + return; > + > + io_mm_detach_locked(bond); > + param->nr_bonds--; > + kfree_rcu(bond, rcu_head); > +} > + > +void iommu_sva_unbind_generic(struct iommu_sva *handle) > +{ > + struct iommu_param *param = handle->dev->iommu_param; > + > + if (WARN_ON(!param)) > + return; > + > + mutex_lock(¶m->sva_lock); > + mutex_lock(&iommu_sva_lock); > + iommu_sva_unbind_locked(to_iommu_bond(handle)); > + mutex_unlock(&iommu_sva_lock); > + mutex_unlock(¶m->sva_lock); > +} > +EXPORT_SYMBOL_GPL(iommu_sva_unbind_generic); > + > +/** > + * iommu_sva_enable() - Enable Shared Virtual Addressing for a device > + * @dev: the device > + * @sva_param: the parameters. > + * > + * Called by an IOMMU driver to setup the SVA parameters > + * @sva_param is duplicated and can be freed when this function returns. > + * > + * Return 0 if initialization succeeded, or an error. > + */ > +int iommu_sva_enable(struct device *dev, struct iommu_sva_param *sva_param) > +{ > + int ret; > + struct iommu_sva_param *new_param; > + struct iommu_param *param = dev->iommu_param; > + > + if (!param) > + return -ENODEV; > + > + new_param = kmemdup(sva_param, sizeof(*new_param), GFP_KERNEL); > + if (!new_param) > + return -ENOMEM; > + > + mutex_lock(¶m->sva_lock); > + if (param->sva_param) { > + ret = -EEXIST; > + goto err_unlock; > + } > + > + dev->iommu_param->sva_param = new_param; > + mutex_unlock(¶m->sva_lock); > + return 0; > + > +err_unlock: > + mutex_unlock(¶m->sva_lock); > + kfree(new_param); > + return ret; > +} > +EXPORT_SYMBOL_GPL(iommu_sva_enable); > + > +/** > + * iommu_sva_disable() - Disable Shared Virtual Addressing for a device > + * @dev: the device > + * > + * IOMMU drivers call this to disable SVA. > + */ > +int iommu_sva_disable(struct device *dev) > +{ > + int ret = 0; > + struct iommu_param *param = dev->iommu_param; > + > + if (!param) > + return -EINVAL; > + > + mutex_lock(¶m->sva_lock); > + if (!param->sva_param) { > + ret = -ENODEV; > + goto out_unlock; > + } > + > + /* Require that all contexts are unbound */ > + if (param->sva_param->nr_bonds) { > + ret = -EBUSY; > + goto out_unlock; > + } > + > + kfree(param->sva_param); > + param->sva_param = NULL; > +out_unlock: > + mutex_unlock(¶m->sva_lock); > + > + return ret; > +} > +EXPORT_SYMBOL_GPL(iommu_sva_disable); > + > +bool iommu_sva_enabled(struct device *dev) > +{ > + bool enabled; > + struct iommu_param *param = dev->iommu_param; > + > + if (!param) > + return false; > + > + mutex_lock(¶m->sva_lock); > + enabled = !!param->sva_param; > + mutex_unlock(¶m->sva_lock); > + return enabled; > +} > +EXPORT_SYMBOL_GPL(iommu_sva_enabled); > + > +int iommu_sva_get_pasid_generic(struct iommu_sva *handle) > +{ > + struct io_mm *io_mm; > + int pasid = IOMMU_PASID_INVALID; > + struct iommu_bond *bond = to_iommu_bond(handle); > + > + rcu_read_lock(); > + io_mm = rcu_dereference(bond->io_mm); > + if (io_mm) > + pasid = io_mm->pasid; > + rcu_read_unlock(); > + return pasid; > +} > +EXPORT_SYMBOL_GPL(iommu_sva_get_pasid_generic); > diff --git a/drivers/iommu/iommu-sva.h b/drivers/iommu/iommu-sva.h > new file mode 100644 > index 000000000000..dd55c2db0936 > --- /dev/null > +++ b/drivers/iommu/iommu-sva.h > @@ -0,0 +1,64 @@ > +/* SPDX-License-Identifier: GPL-2.0 */ > +/* > + * SVA library for IOMMU drivers > + */ > +#ifndef _IOMMU_SVA_H > +#define _IOMMU_SVA_H > + > +#include <linux/iommu.h> > +#include <linux/kref.h> > +#include <linux/mmu_notifier.h> > + > +struct io_mm_ops { > + /* Allocate a PASID context for an mm */ > + void *(*alloc)(struct mm_struct *mm); > + > + /* > + * Attach a PASID context to a device. Write the entry into the PASID > + * table. > + * > + * @attach_domain is true when no other device in the IOMMU domain is > + * already attached to this context. IOMMU drivers that share the > + * PASID tables within a domain don't need to write the PASID entry > + * when @attach_domain is false. > + */ > + int (*attach)(struct device *dev, int pasid, void *ctx, > + bool attach_domain); > + > + /* > + * Detach a PASID context from a device. Clear the entry from the PASID > + * table and invalidate if necessary. > + * > + * @detach_domain is true when no other device in the IOMMU domain is > + * still attached to this context. IOMMU drivers that share the PASID > + * table within a domain don't need to clear the PASID entry when > + * @detach_domain is false, only invalidate the caches. > + */ > + void (*detach)(struct device *dev, int pasid, void *ctx, > + bool detach_domain); > + > + /* Invalidate a range of addresses. Cannot sleep. */ > + void (*invalidate)(struct device *dev, int pasid, void *ctx, > + unsigned long vaddr, size_t size); > + > + /* Free a context. Cannot sleep. */ > + void (*release)(void *ctx); > +}; > + > +struct iommu_sva_param { > + u32 min_pasid; > + u32 max_pasid; > + int nr_bonds; > +}; > + > +struct iommu_sva * > +iommu_sva_bind_generic(struct device *dev, struct mm_struct *mm, > + const struct io_mm_ops *ops, void *drvdata); > +void iommu_sva_unbind_generic(struct iommu_sva *handle); > +int iommu_sva_get_pasid_generic(struct iommu_sva *handle); > + > +int iommu_sva_enable(struct device *dev, struct iommu_sva_param *sva_param); > +int iommu_sva_disable(struct device *dev); > +bool iommu_sva_enabled(struct device *dev); > + > +#endif /* _IOMMU_SVA_H */ > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c > index 3e3528436e0b..c8bd972c1788 100644 > --- a/drivers/iommu/iommu.c > +++ b/drivers/iommu/iommu.c > @@ -164,6 +164,7 @@ static struct iommu_param *iommu_get_dev_param(struct device *dev) > return NULL; > > mutex_init(¶m->lock); > + mutex_init(¶m->sva_lock); > dev->iommu_param = param; > return param; > } > diff --git a/include/linux/iommu.h b/include/linux/iommu.h > index 1739f8a7a4b4..83397ae88d2d 100644 > --- a/include/linux/iommu.h > +++ b/include/linux/iommu.h > @@ -368,6 +368,7 @@ struct iommu_fault_param { > * struct iommu_param - collection of per-device IOMMU data > * > * @fault_param: IOMMU detected device fault reporting data > + * @sva_param: IOMMU parameter for SVA > * > * TODO: migrate other per device data pointers under iommu_dev_data, e.g. > * struct iommu_group *iommu_group; > @@ -376,6 +377,8 @@ struct iommu_fault_param { > struct iommu_param { > struct mutex lock; > struct iommu_fault_param *fault_param; > + struct mutex sva_lock; > + struct iommu_sva_param *sva_param; > }; > > int iommu_device_register(struct iommu_device *iommu);
On Mon, 24 Feb 2020 19:23:38 +0100 Jean-Philippe Brucker <jean-philippe@linaro.org> wrote: > From: Jean-Philippe Brucker <jean-philippe.brucker@arm.com> > > Some systems allow devices to handle I/O Page Faults in the core mm. For > example systems implementing the PCI PRI extension or Arm SMMU stall > model. Infrastructure for reporting these recoverable page faults was > recently added to the IOMMU core. Add a page fault handler for host SVA. > > IOMMU driver can now instantiate several fault workqueues and link them to > IOPF-capable devices. Drivers can choose between a single global > workqueue, one per IOMMU device, one per low-level fault queue, one per > domain, etc. > > When it receives a fault event, supposedly in an IRQ handler, the IOMMU > driver reports the fault using iommu_report_device_fault(), which calls > the registered handler. The page fault handler then calls the mm fault > handler, and reports either success or failure with iommu_page_response(). > When the handler succeeded, the IOMMU retries the access. > > The iopf_param pointer could be embedded into iommu_fault_param. But > putting iopf_param into the iommu_param structure allows us not to care > about ordering between calls to iopf_queue_add_device() and > iommu_register_device_fault_handler(). > > Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org> A few more minor comments... > --- > drivers/iommu/Kconfig | 4 + > drivers/iommu/Makefile | 1 + > drivers/iommu/io-pgfault.c | 451 +++++++++++++++++++++++++++++++++++++ > include/linux/iommu.h | 59 +++++ > 4 files changed, 515 insertions(+) > create mode 100644 drivers/iommu/io-pgfault.c > > diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig > index acca20e2da2f..e4a42e1708b4 100644 > --- a/drivers/iommu/Kconfig > +++ b/drivers/iommu/Kconfig > @@ -109,6 +109,10 @@ config IOMMU_SVA > select IOMMU_API > select MMU_NOTIFIER > > +config IOMMU_PAGE_FAULT > + bool > + select IOMMU_API > + > config FSL_PAMU > bool "Freescale IOMMU support" > depends on PCI > diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile > index 40c800dd4e3e..bf5cb4ee8409 100644 > --- a/drivers/iommu/Makefile > +++ b/drivers/iommu/Makefile > @@ -4,6 +4,7 @@ obj-$(CONFIG_IOMMU_API) += iommu-traces.o > obj-$(CONFIG_IOMMU_API) += iommu-sysfs.o > obj-$(CONFIG_IOMMU_DEBUGFS) += iommu-debugfs.o > obj-$(CONFIG_IOMMU_DMA) += dma-iommu.o > +obj-$(CONFIG_IOMMU_PAGE_FAULT) += io-pgfault.o > obj-$(CONFIG_IOMMU_IO_PGTABLE) += io-pgtable.o > obj-$(CONFIG_IOMMU_IO_PGTABLE_ARMV7S) += io-pgtable-arm-v7s.o > obj-$(CONFIG_IOMMU_IO_PGTABLE_LPAE) += io-pgtable-arm.o > diff --git a/drivers/iommu/io-pgfault.c b/drivers/iommu/io-pgfault.c > new file mode 100644 > index 000000000000..76e153c59fe3 > --- /dev/null > +++ b/drivers/iommu/io-pgfault.c > @@ -0,0 +1,451 @@ > +// SPDX-License-Identifier: GPL-2.0 > +/* > + * Handle device page faults > + * > + * Copyright (C) 2018 ARM Ltd. As before. Date update perhaps? > + */ > + > +#include <linux/iommu.h> > +#include <linux/list.h> > +#include <linux/slab.h> > +#include <linux/workqueue.h> > + > +/** > + * struct iopf_queue - IO Page Fault queue > + * @wq: the fault workqueue > + * @flush: low-level flush callback > + * @flush_arg: flush() argument > + * @devices: devices attached to this queue > + * @lock: protects the device list > + */ > +struct iopf_queue { > + struct workqueue_struct *wq; > + iopf_queue_flush_t flush; > + void *flush_arg; > + struct list_head devices; > + struct mutex lock; > +}; > + > +/** > + * struct iopf_device_param - IO Page Fault data attached to a device > + * @dev: the device that owns this param > + * @queue: IOPF queue > + * @queue_list: index into queue->devices > + * @partial: faults that are part of a Page Request Group for which the last > + * request hasn't been submitted yet. > + * @busy: the param is being used > + * @wq_head: signal a change to @busy > + */ > +struct iopf_device_param { > + struct device *dev; > + struct iopf_queue *queue; > + struct list_head queue_list; > + struct list_head partial; > + bool busy; > + wait_queue_head_t wq_head; > +}; > + > +struct iopf_fault { > + struct iommu_fault fault; > + struct list_head head; > +}; > + > +struct iopf_group { > + struct iopf_fault last_fault; > + struct list_head faults; > + struct work_struct work; > + struct device *dev; > +}; > + > +static int iopf_complete(struct device *dev, struct iopf_fault *iopf, > + enum iommu_page_response_code status) This is called once per group. Should name reflect that? > +{ > + struct iommu_page_response resp = { > + .version = IOMMU_PAGE_RESP_VERSION_1, > + .pasid = iopf->fault.prm.pasid, > + .grpid = iopf->fault.prm.grpid, > + .code = status, > + }; > + > + if (iopf->fault.prm.flags & IOMMU_FAULT_PAGE_REQUEST_PASID_VALID) > + resp.flags = IOMMU_PAGE_RESP_PASID_VALID; > + > + return iommu_page_response(dev, &resp); > +} > + > +static enum iommu_page_response_code > +iopf_handle_single(struct iopf_fault *iopf) > +{ > + /* TODO */ > + return -ENODEV; > +} > + > +static void iopf_handle_group(struct work_struct *work) > +{ > + struct iopf_group *group; > + struct iopf_fault *iopf, *next; > + enum iommu_page_response_code status = IOMMU_PAGE_RESP_SUCCESS; > + > + group = container_of(work, struct iopf_group, work); > + > + list_for_each_entry_safe(iopf, next, &group->faults, head) { > + /* > + * For the moment, errors are sticky: don't handle subsequent > + * faults in the group if there is an error. > + */ > + if (status == IOMMU_PAGE_RESP_SUCCESS) > + status = iopf_handle_single(iopf); > + > + if (!(iopf->fault.prm.flags & IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE)) > + kfree(iopf); > + } > + > + iopf_complete(group->dev, &group->last_fault, status); > + kfree(group); > +} > + > +/** > + * iommu_queue_iopf - IO Page Fault handler > + * @evt: fault event > + * @cookie: struct device, passed to iommu_register_device_fault_handler. > + * > + * Add a fault to the device workqueue, to be handled by mm. > + * > + * Return: 0 on success and <0 on error. > + */ > +int iommu_queue_iopf(struct iommu_fault *fault, void *cookie) > +{ > + int ret; > + struct iopf_group *group; > + struct iopf_fault *iopf, *next; > + struct iopf_device_param *iopf_param; > + > + struct device *dev = cookie; > + struct iommu_param *param = dev->iommu_param; > + > + if (WARN_ON(!mutex_is_locked(¶m->lock))) > + return -EINVAL; Just curious... Why do we always need a runtime check on this rather than say, using lockdep_assert_held or similar? > + > + if (fault->type != IOMMU_FAULT_PAGE_REQ) > + /* Not a recoverable page fault */ > + return -EOPNOTSUPP; > + > + /* > + * As long as we're holding param->lock, the queue can't be unlinked > + * from the device and therefore cannot disappear. > + */ > + iopf_param = param->iopf_param; > + if (!iopf_param) > + return -ENODEV; > + > + if (!(fault->prm.flags & IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE)) { > + iopf = kzalloc(sizeof(*iopf), GFP_KERNEL); > + if (!iopf) > + return -ENOMEM; > + > + iopf->fault = *fault; > + > + /* Non-last request of a group. Postpone until the last one */ > + list_add(&iopf->head, &iopf_param->partial); > + > + return 0; > + } > + > + group = kzalloc(sizeof(*group), GFP_KERNEL); > + if (!group) { > + /* > + * The caller will send a response to the hardware. But we do > + * need to clean up before leaving, otherwise partial faults > + * will be stuck. > + */ > + ret = -ENOMEM; > + goto cleanup_partial; > + } > + > + group->dev = dev; > + group->last_fault.fault = *fault; > + INIT_LIST_HEAD(&group->faults); > + list_add(&group->last_fault.head, &group->faults); > + INIT_WORK(&group->work, iopf_handle_group); > + > + /* See if we have partial faults for this group */ > + list_for_each_entry_safe(iopf, next, &iopf_param->partial, head) { > + if (iopf->fault.prm.grpid == fault->prm.grpid) > + /* Insert *before* the last fault */ > + list_move(&iopf->head, &group->faults); > + } > + > + queue_work(iopf_param->queue->wq, &group->work); > + return 0; > + > +cleanup_partial: > + list_for_each_entry_safe(iopf, next, &iopf_param->partial, head) { > + if (iopf->fault.prm.grpid == fault->prm.grpid) { > + list_del(&iopf->head); > + kfree(iopf); > + } > + } > + return ret; > +} > +EXPORT_SYMBOL_GPL(iommu_queue_iopf); > + > +/** > + * iopf_queue_flush_dev - Ensure that all queued faults have been processed > + * @dev: the endpoint whose faults need to be flushed. > + * @pasid: the PASID affected by this flush > + * > + * Users must call this function when releasing a PASID, to ensure that all > + * pending faults for this PASID have been handled, and won't hit the address > + * space of the next process that uses this PASID. > + * > + * This function can also be called before shutting down the device, in which > + * case @pasid should be IOMMU_PASID_INVALID. > + * > + * Return: 0 on success and <0 on error. > + */ > +int iopf_queue_flush_dev(struct device *dev, int pasid) > +{ > + int ret = 0; > + struct iopf_queue *queue; > + struct iopf_device_param *iopf_param; > + struct iommu_param *param = dev->iommu_param; > + > + if (!param) > + return -ENODEV; > + > + /* > + * It is incredibly easy to find ourselves in a deadlock situation if > + * we're not careful, because we're taking the opposite path as > + * iommu_queue_iopf: > + * > + * iopf_queue_flush_dev() | PRI queue handler > + * lock(¶m->lock) | iommu_queue_iopf() > + * queue->flush() | lock(¶m->lock) > + * wait PRI queue empty | > + * > + * So we can't hold the device param lock while flushing. Take a > + * reference to the device param instead, to prevent the queue from > + * going away. > + */ > + mutex_lock(¶m->lock); > + iopf_param = param->iopf_param; > + if (iopf_param) { > + queue = param->iopf_param->queue; > + iopf_param->busy = true; Describing this as taking a reference is not great... I'd change the comment to set a flag or something like that. Is there any potential of multiple copies of this running against each other? I've not totally gotten my head around when this might be called yet. > + } else { > + ret = -ENODEV; > + } > + mutex_unlock(¶m->lock); > + if (ret) > + return ret; > + > + /* > + * When removing a PASID, the device driver tells the device to stop > + * using it, and flush any pending fault to the IOMMU. In this flush > + * callback, the IOMMU driver makes sure that there are no such faults > + * left in the low-level queue. > + */ > + queue->flush(queue->flush_arg, dev, pasid); > + > + flush_workqueue(queue->wq); > + > + mutex_lock(¶m->lock); > + iopf_param->busy = false; > + wake_up(&iopf_param->wq_head); > + mutex_unlock(¶m->lock); > + > + return 0; > +} > +EXPORT_SYMBOL_GPL(iopf_queue_flush_dev); > + > +/** > + * iopf_queue_discard_partial - Remove all pending partial fault > + * @queue: the queue whose partial faults need to be discarded > + * > + * When the hardware queue overflows, last page faults in a group may have been > + * lost and the IOMMU driver calls this to discard all partial faults. The > + * driver shouldn't be adding new faults to this queue concurrently. > + * > + * Return: 0 on success and <0 on error. > + */ > +int iopf_queue_discard_partial(struct iopf_queue *queue) > +{ > + struct iopf_fault *iopf, *next; > + struct iopf_device_param *iopf_param; > + > + if (!queue) > + return -EINVAL; > + > + mutex_lock(&queue->lock); > + list_for_each_entry(iopf_param, &queue->devices, queue_list) { > + list_for_each_entry_safe(iopf, next, &iopf_param->partial, head) > + kfree(iopf); > + } > + mutex_unlock(&queue->lock); > + return 0; > +} > +EXPORT_SYMBOL_GPL(iopf_queue_discard_partial); > + > +/** > + * iopf_queue_add_device - Add producer to the fault queue > + * @queue: IOPF queue > + * @dev: device to add > + * > + * Return: 0 on success and <0 on error. > + */ > +int iopf_queue_add_device(struct iopf_queue *queue, struct device *dev) > +{ > + int ret = -EINVAL; > + struct iopf_device_param *iopf_param; > + struct iommu_param *param = dev->iommu_param; > + > + if (!param) > + return -ENODEV; > + > + iopf_param = kzalloc(sizeof(*iopf_param), GFP_KERNEL); > + if (!iopf_param) > + return -ENOMEM; > + > + INIT_LIST_HEAD(&iopf_param->partial); > + iopf_param->queue = queue; > + iopf_param->dev = dev; > + init_waitqueue_head(&iopf_param->wq_head); > + > + mutex_lock(&queue->lock); > + mutex_lock(¶m->lock); > + if (!param->iopf_param) { > + list_add(&iopf_param->queue_list, &queue->devices); > + param->iopf_param = iopf_param; > + ret = 0; > + } > + mutex_unlock(¶m->lock); > + mutex_unlock(&queue->lock); > + > + if (ret) > + kfree(iopf_param); > + > + return ret; > +} > +EXPORT_SYMBOL_GPL(iopf_queue_add_device); > + > +/** > + * iopf_queue_remove_device - Remove producer from fault queue > + * @queue: IOPF queue > + * @dev: device to remove > + * > + * Caller makes sure that no more faults are reported for this device. > + * > + * Return: 0 on success and <0 on error. > + */ > +int iopf_queue_remove_device(struct iopf_queue *queue, struct device *dev) > +{ > + int ret = -EINVAL; > + struct iopf_fault *iopf, *next; > + struct iopf_device_param *iopf_param; > + struct iommu_param *param = dev->iommu_param; > + > + if (!param || !queue) > + return -EINVAL; > + > + do { > + mutex_lock(&queue->lock); > + mutex_lock(¶m->lock); > + iopf_param = param->iopf_param; > + if (iopf_param && iopf_param->queue == queue) { > + if (iopf_param->busy) { > + ret = -EBUSY; > + } else { > + list_del(&iopf_param->queue_list); > + param->iopf_param = NULL; > + ret = 0; > + } > + } > + mutex_unlock(¶m->lock); > + mutex_unlock(&queue->lock); > + > + /* > + * If there is an ongoing flush, wait for it to complete and > + * then retry. iopf_param isn't going away since we're the only > + * thread that can free it. > + */ > + if (ret == -EBUSY) > + wait_event(iopf_param->wq_head, !iopf_param->busy); > + else if (ret) > + return ret; > + } while (ret == -EBUSY); I'm in two minds about the next comment (so up to you)... Currently this looks a bit odd. Would you be better off just having a separate parameter for busy and explicit separate handling for the error path? bool busy; int ret = 0; do { mutex_lock(&queue->lock); mutex_lock(¶m->lock); iopf_param = param->iopf_param; if (iopf_param && iopf_param->queue == queue) { busy = iopf_param->busy; if (!busy) { list_del(&iopf_param->queue_list); param->iopf_param = NULL; } } else { ret = -EINVAL; } mutex_unlock(¶m->lock); mutex_unlock(&queue->lock); if (ret) return ret; if (busy) wait_event(iopf_param->wq_head, !iopf_param->busy); } while (busy); .. > + > + /* Just in case some faults are still stuck */ > + list_for_each_entry_safe(iopf, next, &iopf_param->partial, head) > + kfree(iopf); > + > + kfree(iopf_param); > + > + return 0; > +} > +EXPORT_SYMBOL_GPL(iopf_queue_remove_device); > + > +/** > + * iopf_queue_alloc - Allocate and initialize a fault queue > + * @name: a unique string identifying the queue (for workqueue) > + * @flush: a callback that flushes the low-level queue > + * @cookie: driver-private data passed to the flush callback > + * > + * The callback is called before the workqueue is flushed. The IOMMU driver must > + * commit all faults that are pending in its low-level queues at the time of the > + * call, into the IOPF queue (with iommu_report_device_fault). The callback > + * takes a device pointer as argument, hinting what endpoint is causing the > + * flush. When the device is NULL, all faults should be committed. > + * > + * Return: the queue on success and NULL on error. > + */ > +struct iopf_queue * > +iopf_queue_alloc(const char *name, iopf_queue_flush_t flush, void *cookie) > +{ > + struct iopf_queue *queue; > + > + queue = kzalloc(sizeof(*queue), GFP_KERNEL); > + if (!queue) > + return NULL; > + > + /* > + * The WQ is unordered because the low-level handler enqueues faults by > + * group. PRI requests within a group have to be ordered, but once > + * that's dealt with, the high-level function can handle groups out of > + * order. > + */ > + queue->wq = alloc_workqueue("iopf_queue/%s", WQ_UNBOUND, 0, name); > + if (!queue->wq) { > + kfree(queue); > + return NULL; > + } > + > + queue->flush = flush; > + queue->flush_arg = cookie; > + INIT_LIST_HEAD(&queue->devices); > + mutex_init(&queue->lock); > + > + return queue; > +} > +EXPORT_SYMBOL_GPL(iopf_queue_alloc); > + > +/** > + * iopf_queue_free - Free IOPF queue > + * @queue: queue to free > + * > + * Counterpart to iopf_queue_alloc(). The driver must not be queuing faults or > + * adding/removing devices on this queue anymore. > + */ > +void iopf_queue_free(struct iopf_queue *queue) > +{ > + struct iopf_device_param *iopf_param, *next; > + > + if (!queue) > + return; > + > + list_for_each_entry_safe(iopf_param, next, &queue->devices, queue_list) > + iopf_queue_remove_device(queue, iopf_param->dev); > + > + destroy_workqueue(queue->wq); > + kfree(queue); > +} > +EXPORT_SYMBOL_GPL(iopf_queue_free); > diff --git a/include/linux/iommu.h b/include/linux/iommu.h > index 83397ae88d2d..e7bc47ba24f8 100644 > --- a/include/linux/iommu.h > +++ b/include/linux/iommu.h > @@ -364,11 +364,20 @@ struct iommu_fault_param { > struct mutex lock; > }; > > +/** > + * iopf_queue_flush_t - Flush low-level page fault queue > + * > + * Report all faults currently pending in the low-level page fault queue > + */ > +struct iopf_queue; > +typedef int (*iopf_queue_flush_t)(void *cookie, struct device *dev, int pasid); > + > /** > * struct iommu_param - collection of per-device IOMMU data > * > * @fault_param: IOMMU detected device fault reporting data > * @sva_param: IOMMU parameter for SVA > + * @iopf_param: I/O Page Fault queue and data > * > * TODO: migrate other per device data pointers under iommu_dev_data, e.g. > * struct iommu_group *iommu_group; > @@ -377,6 +386,7 @@ struct iommu_fault_param { > struct iommu_param { > struct mutex lock; > struct iommu_fault_param *fault_param; > + struct iopf_device_param *iopf_param; > struct mutex sva_lock; > struct iommu_sva_param *sva_param; > }; > @@ -1081,4 +1091,53 @@ void iommu_debugfs_setup(void); > static inline void iommu_debugfs_setup(void) {} > #endif > > +#ifdef CONFIG_IOMMU_PAGE_FAULT > +extern int iommu_queue_iopf(struct iommu_fault *fault, void *cookie); > + > +extern int iopf_queue_add_device(struct iopf_queue *queue, struct device *dev); > +extern int iopf_queue_remove_device(struct iopf_queue *queue, struct device *dev); > +extern int iopf_queue_flush_dev(struct device *dev, int pasid); > +extern struct iopf_queue * > +iopf_queue_alloc(const char *name, iopf_queue_flush_t flush, void *cookie); > +extern void iopf_queue_free(struct iopf_queue *queue); > +extern int iopf_queue_discard_partial(struct iopf_queue *queue); > +#else /* CONFIG_IOMMU_PAGE_FAULT */ > +static inline int iommu_queue_iopf(struct iommu_fault *fault, void *cookie) > +{ > + return -ENODEV; > +} > + > +static inline int iopf_queue_add_device(struct iopf_queue *queue, > + struct device *dev) > +{ > + return -ENODEV; > +} > + > +static inline int iopf_queue_remove_device(struct iopf_queue *queue, > + struct device *dev) > +{ > + return -ENODEV; > +} > + > +static inline int iopf_queue_flush_dev(struct device *dev, int pasid) > +{ > + return -ENODEV; > +} > + > +static inline struct iopf_queue * > +iopf_queue_alloc(const char *name, iopf_queue_flush_t flush, void *cookie) > +{ > + return NULL; > +} > + > +static inline void iopf_queue_free(struct iopf_queue *queue) > +{ > +} > + > +static inline int iopf_queue_discard_partial(struct iopf_queue *queue) > +{ > + return -ENODEV; > +} > +#endif /* CONFIG_IOMMU_PAGE_FAULT */ > + > #endif /* __LINUX_IOMMU_H */
Hi Jean, A few comments inline. I am also trying to converge to the common sva APIs. I sent out the first step w/o iopage fault and the generic ops you have here. On Mon, 24 Feb 2020 19:23:37 +0100 Jean-Philippe Brucker <jean-philippe@linaro.org> wrote: > From: Jean-Philippe Brucker <jean-philippe.brucker@arm.com> > > Add a small library to help IOMMU drivers manage process address > spaces bound to their devices. Register an MMU notifier to track > modification on each address space bound to one or more devices. > > IOMMU drivers must implement the io_mm_ops and can then use the > helpers provided by this library to easily implement the SVA API > introduced by commit 26b25a2b98e4. The io_mm_ops are: > > void *alloc(struct mm_struct *) > Allocate a PASID context private to the IOMMU driver. There is a > single context per mm. IOMMU drivers may perform arch-specific > operations in there, for example pinning down a CPU ASID (on Arm). > > int attach(struct device *, int pasid, void *ctx, bool attach_domain) > Attach a context to the device, by setting up the PASID table entry. > > int invalidate(struct device *, int pasid, void *ctx, > unsigned long vaddr, size_t size) > Invalidate TLB entries for this address range. > > int detach(struct device *, int pasid, void *ctx, bool detach_domain) > Detach a context from the device, by clearing the PASID table entry > and invalidating cached entries. > > void free(void *ctx) you meant release()? > Free a context. > > Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org> > --- > drivers/iommu/Kconfig | 7 + > drivers/iommu/Makefile | 1 + > drivers/iommu/iommu-sva.c | 561 > ++++++++++++++++++++++++++++++++++++++ drivers/iommu/iommu-sva.h | > 64 +++++ drivers/iommu/iommu.c | 1 + > include/linux/iommu.h | 3 + > 6 files changed, 637 insertions(+) > create mode 100644 drivers/iommu/iommu-sva.c > create mode 100644 drivers/iommu/iommu-sva.h > > diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig > index d2fade984999..acca20e2da2f 100644 > --- a/drivers/iommu/Kconfig > +++ b/drivers/iommu/Kconfig > @@ -102,6 +102,13 @@ config IOMMU_DMA > select IRQ_MSI_IOMMU > select NEED_SG_DMA_LENGTH > > +# Shared Virtual Addressing library > +config IOMMU_SVA > + bool > + select IOASID > + select IOMMU_API > + select MMU_NOTIFIER > + > config FSL_PAMU > bool "Freescale IOMMU support" > depends on PCI > diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile > index 9f33fdb3bb05..40c800dd4e3e 100644 > --- a/drivers/iommu/Makefile > +++ b/drivers/iommu/Makefile > @@ -37,3 +37,4 @@ obj-$(CONFIG_S390_IOMMU) += s390-iommu.o > obj-$(CONFIG_QCOM_IOMMU) += qcom_iommu.o > obj-$(CONFIG_HYPERV_IOMMU) += hyperv-iommu.o > obj-$(CONFIG_VIRTIO_IOMMU) += virtio-iommu.o > +obj-$(CONFIG_IOMMU_SVA) += iommu-sva.o > diff --git a/drivers/iommu/iommu-sva.c b/drivers/iommu/iommu-sva.c > new file mode 100644 > index 000000000000..64f1d1c82383 > --- /dev/null > +++ b/drivers/iommu/iommu-sva.c > @@ -0,0 +1,561 @@ > +// SPDX-License-Identifier: GPL-2.0 > +/* > + * Manage PASIDs and bind process address spaces to devices. > + * > + * Copyright (C) 2018 ARM Ltd. > + */ > + > +#include <linux/idr.h> > +#include <linux/ioasid.h> > +#include <linux/iommu.h> > +#include <linux/sched/mm.h> > +#include <linux/slab.h> > +#include <linux/spinlock.h> > + > +#include "iommu-sva.h" > + > +/** > + * DOC: io_mm model > + * > + * The io_mm keeps track of process address spaces shared between > CPU and IOMMU. > + * The following example illustrates the relation between structures > + * iommu_domain, io_mm and iommu_sva. The iommu_sva struct is a bond > between > + * io_mm and device. A device can have multiple io_mm and an io_mm > may be bound > + * to multiple devices. > + * ___________________________ > + * | IOMMU domain A | > + * | ________________ | > + * | | IOMMU group | +------- io_pgtables > + * | | | | > + * | | dev 00:00.0 ----+------- bond 1 --- io_mm X > + * | |________________| \ | > + * | '----- bond 2 ---. > + * |___________________________| \ > + * ___________________________ \ > + * | IOMMU domain B | io_mm Y > + * | ________________ | / / > + * | | IOMMU group | | / / > + * | | | | / / > + * | | dev 00:01.0 ------------ bond 3 -' / > + * | | dev 00:01.1 ------------ bond 4 --' > + * | |________________| | > + * | +------- io_pgtables > + * |___________________________| > + * > + * In this example, device 00:00.0 is in domain A, devices 00:01.* > are in domain > + * B. All devices within the same domain access the same address > spaces. Hmm, devices in domain A has access to both X & Y, isn't it contradictory? > Device > + * 00:00.0 accesses address spaces X and Y, each corresponding to an > mm_struct. > + * Devices 00:01.* only access address space Y. In addition each > + * IOMMU_DOMAIN_DMA domain has a private address space, io_pgtable, > that is > + * managed with iommu_map()/iommu_unmap(), and isn't shared with the > CPU MMU. So this would allow IOVA and SVA co-exist in the same address space? I guess this is the PASID 0 for DMA request w/o PASID. If that is the case, perhaps needs more explanation since the private address space also has a private PASID within the domain. > + * > + * To obtain the above configuration, users would for instance issue > the > + * following calls: > + * > + * iommu_sva_bind_device(dev 00:00.0, mm X, ...) -> bond 1 > + * iommu_sva_bind_device(dev 00:00.0, mm Y, ...) -> bond 2 > + * iommu_sva_bind_device(dev 00:01.0, mm Y, ...) -> bond 3 > + * iommu_sva_bind_device(dev 00:01.1, mm Y, ...) -> bond 4 > + * > + * A single Process Address Space ID (PASID) is allocated for each > mm. In the > + * example, devices use PASID 1 to read/write into address space X > and PASID 2 > + * to read/write into address space Y. Calling iommu_sva_get_pasid() > on bond 1 > + * returns 1, and calling it on bonds 2-4 returns 2. > + * > + * Hardware tables describing this configuration in the IOMMU would > typically > + * look like this: > + * > + * PASID tables > + * of domain A > + * .->+--------+ > + * / 0 | |-------> io_pgtable > + * / +--------+ > + * Device tables / 1 | |-------> pgd X > + * +--------+ / +--------+ > + * 00:00.0 | A |-' 2 | |--. > + * +--------+ +--------+ \ > + * : : 3 | | \ > + * +--------+ +--------+ --> pgd Y > + * 00:01.0 | B |--. / > + * +--------+ \ | > + * 00:01.1 | B |----+ PASID tables | > + * +--------+ \ of domain B | > + * '->+--------+ | > + * 0 | |-- | --> io_pgtable > + * +--------+ | > + * 1 | | | > + * +--------+ | > + * 2 | |---' > + * +--------+ > + * 3 | | > + * +--------+ > + * > + * With this model, a single call binds all devices in a given > domain to an > + * address space. Other devices in the domain will get the same bond > implicitly. > + * However, users must issue one bind() for each device, because > IOMMUs may > + * implement SVA differently. Furthermore, mandating one bind() per > device > + * allows the driver to perform sanity-checks on device capabilities. > + * > + * In some IOMMUs, one entry of the PASID table (typically the first > one) can > + * hold non-PASID translations. In this case PASID 0 is reserved and > the first > + * entry points to the io_pgtable pointer. In other IOMMUs the > io_pgtable > + * pointer is held in the device table and PASID 0 is available to > the > + * allocator. > + */ > + > +struct io_mm { > + struct list_head devices; > + struct mm_struct *mm; > + struct mmu_notifier notifier; > + > + /* Late initialization */ > + const struct io_mm_ops *ops; > + void *ctx; > + int pasid; > +}; > + > +#define to_io_mm(mmu_notifier) container_of(mmu_notifier, > struct io_mm, notifier) +#define to_iommu_bond(handle) > container_of(handle, struct iommu_bond, sva) + > +struct iommu_bond { > + struct iommu_sva sva; > + struct io_mm __rcu *io_mm; > + > + struct list_head mm_head; > + void *drvdata; > + struct rcu_head rcu_head; > + refcount_t refs; > +}; > + > +static DECLARE_IOASID_SET(shared_pasid); > + > +static struct mmu_notifier_ops iommu_mmu_notifier_ops; > + > +/* > + * Serializes modifications of bonds. > + * Lock order: Device SVA mutex; global SVA mutex; IOASID lock > + */ > +static DEFINE_MUTEX(iommu_sva_lock); > + > +struct io_mm_alloc_params { > + const struct io_mm_ops *ops; > + int min_pasid, max_pasid; > +}; > + > +static struct mmu_notifier *io_mm_alloc(struct mm_struct *mm, void > *privdata) +{ > + int ret; > + struct io_mm *io_mm; > + struct io_mm_alloc_params *params = privdata; > + > + io_mm = kzalloc(sizeof(*io_mm), GFP_KERNEL); > + if (!io_mm) > + return ERR_PTR(-ENOMEM); > + > + io_mm->mm = mm; > + io_mm->ops = params->ops; > + INIT_LIST_HEAD(&io_mm->devices); > + > + io_mm->pasid = ioasid_alloc(&shared_pasid, params->min_pasid, > + params->max_pasid, io_mm->mm); > + if (io_mm->pasid == INVALID_IOASID) { > + ret = -ENOSPC; > + goto err_free_io_mm; > + } > + > + io_mm->ctx = params->ops->alloc(mm); > + if (IS_ERR(io_mm->ctx)) { > + ret = PTR_ERR(io_mm->ctx); > + goto err_free_pasid; > + } > + return &io_mm->notifier; > + > +err_free_pasid: > + ioasid_free(io_mm->pasid); > +err_free_io_mm: > + kfree(io_mm); > + return ERR_PTR(ret); > +} > + > +static void io_mm_free(struct mmu_notifier *mn) > +{ > + struct io_mm *io_mm = to_io_mm(mn); > + > + WARN_ON(!list_empty(&io_mm->devices)); > + > + io_mm->ops->release(io_mm->ctx); > + ioasid_free(io_mm->pasid); > + kfree(io_mm); > +} > + > +/* > + * io_mm_get - Allocate an io_mm or get the existing one for the > given mm > + * @mm: the mm > + * @ops: callbacks for the IOMMU driver > + * @min_pasid: minimum PASID value (inclusive) > + * @max_pasid: maximum PASID value (inclusive) > + * > + * Returns a valid io_mm or an error pointer. > + */ > +static struct io_mm *io_mm_get(struct mm_struct *mm, > + const struct io_mm_ops *ops, > + int min_pasid, int max_pasid) > +{ > + struct io_mm *io_mm; > + struct mmu_notifier *mn; > + struct io_mm_alloc_params params = { > + .ops = ops, > + .min_pasid = min_pasid, > + .max_pasid = max_pasid, > + }; > + > + /* > + * A single notifier can exist for this (ops, mm) pair. > Allocate it if > + * necessary. > + */ > + mn = mmu_notifier_get(&iommu_mmu_notifier_ops, mm, ¶ms); > + if (IS_ERR(mn)) > + return ERR_CAST(mn); > + io_mm = to_io_mm(mn); > + > + if (WARN_ON(io_mm->ops != ops)) { > + mmu_notifier_put(mn); > + return ERR_PTR(-EINVAL); > + } > + > + return io_mm; > +} > + > +static void io_mm_put(struct io_mm *io_mm) > +{ > + mmu_notifier_put(&io_mm->notifier); > +} > + > +static struct iommu_sva * > +io_mm_attach(struct device *dev, struct io_mm *io_mm, void *drvdata) > +{ > + int ret = 0; > + bool attach_domain = true; > + struct iommu_bond *bond, *tmp; > + struct iommu_domain *domain, *other; > + struct iommu_sva_param *param = dev->iommu_param->sva_param; > + > + domain = iommu_get_domain_for_dev(dev); > + > + bond = kzalloc(sizeof(*bond), GFP_KERNEL); > + if (!bond) > + return ERR_PTR(-ENOMEM); > + > + bond->sva.dev = dev; > + bond->drvdata = drvdata; > + refcount_set(&bond->refs, 1); > + RCU_INIT_POINTER(bond->io_mm, io_mm); > + > + mutex_lock(&iommu_sva_lock); > + /* Is it already bound to the device or domain? */ > + list_for_each_entry(tmp, &io_mm->devices, mm_head) { > + if (tmp->sva.dev != dev) { > + other = > iommu_get_domain_for_dev(tmp->sva.dev); > + if (domain == other) > + attach_domain = false; > + > + continue; At this point, we already know this is a new device trying to attach to one of io_mm's existing domains. So there is no need to continue checking, right? Perhaps check like this? - if (tmp->sva.dev != dev) { + if (tmp->sva.dev != dev && attach_domain) { > + } > + > + if (WARN_ON(tmp->drvdata != drvdata)) { > + ret = -EINVAL; > + goto err_free; > + } > + > + /* > + * Hold a single io_mm reference per bond. Note that > we can't > + * return an error after this, otherwise the caller > would drop > + * an additional reference to the io_mm. > + */ > + refcount_inc(&tmp->refs); > + io_mm_put(io_mm); > + kfree(bond); Can bond be allocated after searching for existing bond or domain? If so, we can avoid free bond here. > + mutex_unlock(&iommu_sva_lock); > + return &tmp->sva; > + } > + > + list_add_rcu(&bond->mm_head, &io_mm->devices); > + param->nr_bonds++; > + mutex_unlock(&iommu_sva_lock); > + > + ret = io_mm->ops->attach(bond->sva.dev, io_mm->pasid, > io_mm->ctx, > + attach_domain); For VT-d, if a device trying to do SVA bind, there would not be a DMA domain. SVA should own the entire address space, no IOVA. So this attach() call is for VT-d driver to setup the first PASID table entry regardless attach_domain is true or false? > + if (ret) > + goto err_remove; > + > + return &bond->sva; > + > +err_remove: > + /* > + * At this point concurrent threads may have started to > access the > + * io_mm->devices list in order to invalidate address > ranges, which > + * requires to free the bond via kfree_rcu() > + */ > + mutex_lock(&iommu_sva_lock); > + param->nr_bonds--; > + list_del_rcu(&bond->mm_head); > + > +err_free: > + mutex_unlock(&iommu_sva_lock); > + kfree_rcu(bond, rcu_head); > + return ERR_PTR(ret); > +} > + > +static void io_mm_detach_locked(struct iommu_bond *bond) > +{ > + struct io_mm *io_mm; > + struct iommu_bond *tmp; > + bool detach_domain = true; > + struct iommu_domain *domain, *other; > + > + io_mm = rcu_dereference_protected(bond->io_mm, > + > lockdep_is_held(&iommu_sva_lock)); > + if (!io_mm) > + return; > + > + domain = iommu_get_domain_for_dev(bond->sva.dev); > + > + /* Are other devices in the same domain still attached to > this mm? */ > + list_for_each_entry(tmp, &io_mm->devices, mm_head) { > + if (tmp == bond) > + continue; > + other = iommu_get_domain_for_dev(tmp->sva.dev); > + if (domain == other) { > + detach_domain = false; > + break; > + } > + } > + > + io_mm->ops->detach(bond->sva.dev, io_mm->pasid, io_mm->ctx, > + detach_domain); > + > + list_del_rcu(&bond->mm_head); > + RCU_INIT_POINTER(bond->io_mm, NULL); > + > + /* Free after RCU grace period */ > + io_mm_put(io_mm); > +} > + > +/* > + * io_mm_release - release MMU notifier > + * > + * Called when the mm exits. Some devices may still be bound to the > io_mm. A few > + * things need to be done before it is safe to release: > + * > + * - Tell the device driver to stop using this PASID. > + * - Clear the PASID table and invalidate TLBs. > + * - Drop all references to this io_mm. > + */ > +static void io_mm_release(struct mmu_notifier *mn, struct mm_struct > *mm) +{ > + struct iommu_bond *bond, *next; > + struct io_mm *io_mm = to_io_mm(mn); > + > + mutex_lock(&iommu_sva_lock); > + list_for_each_entry_safe(bond, next, &io_mm->devices, > mm_head) { > + struct device *dev = bond->sva.dev; > + struct iommu_sva *sva = &bond->sva; > + > + if (sva->ops && sva->ops->mm_exit && > + sva->ops->mm_exit(dev, sva, bond->drvdata)) > + dev_WARN(dev, "possible leak of PASID %u", > + io_mm->pasid); > + > + /* unbind() frees the bond, we just detach it */ > + io_mm_detach_locked(bond); > + } > + mutex_unlock(&iommu_sva_lock); > +} > + > +static void io_mm_invalidate_range(struct mmu_notifier *mn, > + struct mm_struct *mm, unsigned > long start, > + unsigned long end) > +{ > + struct iommu_bond *bond; > + struct io_mm *io_mm = to_io_mm(mn); > + > + rcu_read_lock(); > + list_for_each_entry_rcu(bond, &io_mm->devices, mm_head) > + io_mm->ops->invalidate(bond->sva.dev, io_mm->pasid, > io_mm->ctx, > + start, end - start); > + rcu_read_unlock(); > +} > + > +static struct mmu_notifier_ops iommu_mmu_notifier_ops = { > + .alloc_notifier = io_mm_alloc, > + .free_notifier = io_mm_free, > + .release = io_mm_release, > + .invalidate_range = io_mm_invalidate_range, > +}; > + > +struct iommu_sva * > +iommu_sva_bind_generic(struct device *dev, struct mm_struct *mm, > + const struct io_mm_ops *ops, void *drvdata) > +{ > + struct io_mm *io_mm; > + struct iommu_sva *handle; > + struct iommu_param *param = dev->iommu_param; > + > + if (!param) > + return ERR_PTR(-ENODEV); > + > + mutex_lock(¶m->sva_lock); > + if (!param->sva_param) { > + handle = ERR_PTR(-ENODEV); > + goto out_unlock; > + } > + > + io_mm = io_mm_get(mm, ops, param->sva_param->min_pasid, > + param->sva_param->max_pasid); > + if (IS_ERR(io_mm)) { > + handle = ERR_CAST(io_mm); > + goto out_unlock; > + } > + > + handle = io_mm_attach(dev, io_mm, drvdata); > + if (IS_ERR(handle)) > + io_mm_put(io_mm); > + > +out_unlock: > + mutex_unlock(¶m->sva_lock); > + return handle; > +} > +EXPORT_SYMBOL_GPL(iommu_sva_bind_generic); > + > +static void iommu_sva_unbind_locked(struct iommu_bond *bond) > +{ > + struct device *dev = bond->sva.dev; > + struct iommu_sva_param *param = dev->iommu_param->sva_param; > + > + if (!refcount_dec_and_test(&bond->refs)) > + return; > + dont you need to free bond here? > + io_mm_detach_locked(bond); > + param->nr_bonds--; > + kfree_rcu(bond, rcu_head); > +} > + > +void iommu_sva_unbind_generic(struct iommu_sva *handle) > +{ > + struct iommu_param *param = handle->dev->iommu_param; > + > + if (WARN_ON(!param)) > + return; > + > + mutex_lock(¶m->sva_lock); > + mutex_lock(&iommu_sva_lock); > + iommu_sva_unbind_locked(to_iommu_bond(handle)); > + mutex_unlock(&iommu_sva_lock); > + mutex_unlock(¶m->sva_lock); > +} > +EXPORT_SYMBOL_GPL(iommu_sva_unbind_generic); > + > +/** > + * iommu_sva_enable() - Enable Shared Virtual Addressing for a device > + * @dev: the device > + * @sva_param: the parameters. > + * > + * Called by an IOMMU driver to setup the SVA parameters > + * @sva_param is duplicated and can be freed when this function > returns. > + * > + * Return 0 if initialization succeeded, or an error. > + */ IOMMU vendor driver usually dont know when the device SVA feature will be used until bind call. So we pretty much have to call this for every device during init time? > +int iommu_sva_enable(struct device *dev, struct iommu_sva_param > *sva_param) +{ > + int ret; > + struct iommu_sva_param *new_param; > + struct iommu_param *param = dev->iommu_param; > + > + if (!param) > + return -ENODEV; > + > + new_param = kmemdup(sva_param, sizeof(*new_param), > GFP_KERNEL); > + if (!new_param) > + return -ENOMEM; > + > + mutex_lock(¶m->sva_lock); > + if (param->sva_param) { > + ret = -EEXIST; > + goto err_unlock; > + } > + > + dev->iommu_param->sva_param = new_param; > + mutex_unlock(¶m->sva_lock); > + return 0; > + > +err_unlock: > + mutex_unlock(¶m->sva_lock); > + kfree(new_param); > + return ret; > +} > +EXPORT_SYMBOL_GPL(iommu_sva_enable); > + > +/** > + * iommu_sva_disable() - Disable Shared Virtual Addressing for a > device > + * @dev: the device > + * > + * IOMMU drivers call this to disable SVA. > + */ > +int iommu_sva_disable(struct device *dev) > +{ > + int ret = 0; > + struct iommu_param *param = dev->iommu_param; > + > + if (!param) > + return -EINVAL; > + > + mutex_lock(¶m->sva_lock); > + if (!param->sva_param) { > + ret = -ENODEV; > + goto out_unlock; > + } > + > + /* Require that all contexts are unbound */ > + if (param->sva_param->nr_bonds) { > + ret = -EBUSY; > + goto out_unlock; > + } > + > + kfree(param->sva_param); > + param->sva_param = NULL; > +out_unlock: > + mutex_unlock(¶m->sva_lock); > + > + return ret; > +} > +EXPORT_SYMBOL_GPL(iommu_sva_disable); > + > +bool iommu_sva_enabled(struct device *dev) > +{ > + bool enabled; > + struct iommu_param *param = dev->iommu_param; > + > + if (!param) > + return false; > + > + mutex_lock(¶m->sva_lock); > + enabled = !!param->sva_param; > + mutex_unlock(¶m->sva_lock); > + return enabled; > +} > +EXPORT_SYMBOL_GPL(iommu_sva_enabled); > + > +int iommu_sva_get_pasid_generic(struct iommu_sva *handle) > +{ > + struct io_mm *io_mm; > + int pasid = IOMMU_PASID_INVALID; > + struct iommu_bond *bond = to_iommu_bond(handle); > + > + rcu_read_lock(); > + io_mm = rcu_dereference(bond->io_mm); > + if (io_mm) > + pasid = io_mm->pasid; > + rcu_read_unlock(); > + return pasid; > +} > +EXPORT_SYMBOL_GPL(iommu_sva_get_pasid_generic); > diff --git a/drivers/iommu/iommu-sva.h b/drivers/iommu/iommu-sva.h > new file mode 100644 > index 000000000000..dd55c2db0936 > --- /dev/null > +++ b/drivers/iommu/iommu-sva.h > @@ -0,0 +1,64 @@ > +/* SPDX-License-Identifier: GPL-2.0 */ > +/* > + * SVA library for IOMMU drivers > + */ > +#ifndef _IOMMU_SVA_H > +#define _IOMMU_SVA_H > + > +#include <linux/iommu.h> > +#include <linux/kref.h> > +#include <linux/mmu_notifier.h> > + > +struct io_mm_ops { > + /* Allocate a PASID context for an mm */ > + void *(*alloc)(struct mm_struct *mm); > + > + /* > + * Attach a PASID context to a device. Write the entry into > the PASID > + * table. > + * > + * @attach_domain is true when no other device in the IOMMU > domain is > + * already attached to this context. IOMMU drivers that > share the > + * PASID tables within a domain don't need to write the > PASID entry > + * when @attach_domain is false. > + */ If we have per device PASID table, then we need to set up PASID table entry regardless of the domain sharing. What is confusing to me is that domain is for DMA isolation on request w/o PASID, but with SVA we don't really care about domains. Sorry, it has been a long time since we discussed this. I think will work for VT-d but just wanted to make sure I understand the intentions. > + int (*attach)(struct device *dev, int pasid, void *ctx, > + bool attach_domain); > + > + /* > + * Detach a PASID context from a device. Clear the entry > from the PASID > + * table and invalidate if necessary. > + * > + * @detach_domain is true when no other device in the IOMMU > domain is > + * still attached to this context. IOMMU drivers that > share the PASID > + * table within a domain don't need to clear the PASID > entry when > + * @detach_domain is false, only invalidate the caches. > + */ > + void (*detach)(struct device *dev, int pasid, void *ctx, > + bool detach_domain); > + > + /* Invalidate a range of addresses. Cannot sleep. */ > + void (*invalidate)(struct device *dev, int pasid, void *ctx, > + unsigned long vaddr, size_t size); > + > + /* Free a context. Cannot sleep. */ > + void (*release)(void *ctx); > +}; > + > +struct iommu_sva_param { > + u32 min_pasid; > + u32 max_pasid; > + int nr_bonds; > +}; > + > +struct iommu_sva * > +iommu_sva_bind_generic(struct device *dev, struct mm_struct *mm, > + const struct io_mm_ops *ops, void *drvdata); > +void iommu_sva_unbind_generic(struct iommu_sva *handle); > +int iommu_sva_get_pasid_generic(struct iommu_sva *handle); > + > +int iommu_sva_enable(struct device *dev, struct iommu_sva_param > *sva_param); +int iommu_sva_disable(struct device *dev); > +bool iommu_sva_enabled(struct device *dev); > + > +#endif /* _IOMMU_SVA_H */ > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c > index 3e3528436e0b..c8bd972c1788 100644 > --- a/drivers/iommu/iommu.c > +++ b/drivers/iommu/iommu.c > @@ -164,6 +164,7 @@ static struct iommu_param > *iommu_get_dev_param(struct device *dev) return NULL; > > mutex_init(¶m->lock); > + mutex_init(¶m->sva_lock); > dev->iommu_param = param; > return param; > } > diff --git a/include/linux/iommu.h b/include/linux/iommu.h > index 1739f8a7a4b4..83397ae88d2d 100644 > --- a/include/linux/iommu.h > +++ b/include/linux/iommu.h > @@ -368,6 +368,7 @@ struct iommu_fault_param { > * struct iommu_param - collection of per-device IOMMU data > * > * @fault_param: IOMMU detected device fault reporting data > + * @sva_param: IOMMU parameter for SVA > * > * TODO: migrate other per device data pointers under > iommu_dev_data, e.g. > * struct iommu_group *iommu_group; > @@ -376,6 +377,8 @@ struct iommu_fault_param { > struct iommu_param { > struct mutex lock; > struct iommu_fault_param *fault_param; > + struct mutex sva_lock; > + struct iommu_sva_param *sva_param; > }; > > int iommu_device_register(struct iommu_device *iommu); Thanks, Jacob
On Mon, 24 Feb 2020 19:23:41 +0100 Jean-Philippe Brucker <jean-philippe@linaro.org> wrote: > From: Jean-Philippe Brucker <jean-philippe.brucker@arm.com> > > When enabling SVA, register the fault handler. Device driver will > register an I/O page fault queue before or after calling > iommu_sva_enable. The fault queue must be flushed before any io_mm is > freed, to make sure that its PASID isn't used in any fault queue, and > can be reallocated. Add iopf_queue_flush() calls in a few strategic > locations. > > Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org> > --- > drivers/iommu/Kconfig | 1 + > drivers/iommu/iommu-sva.c | 16 ++++++++++++++++ > 2 files changed, 17 insertions(+) > > diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig > index e4a42e1708b4..211684e785ea 100644 > --- a/drivers/iommu/Kconfig > +++ b/drivers/iommu/Kconfig > @@ -106,6 +106,7 @@ config IOMMU_DMA > config IOMMU_SVA > bool > select IOASID > + select IOMMU_PAGE_FAULT > select IOMMU_API > select MMU_NOTIFIER > > diff --git a/drivers/iommu/iommu-sva.c b/drivers/iommu/iommu-sva.c > index bfd0c477f290..494ca0824e4b 100644 > --- a/drivers/iommu/iommu-sva.c > +++ b/drivers/iommu/iommu-sva.c > @@ -366,6 +366,8 @@ static void io_mm_release(struct mmu_notifier > *mn, struct mm_struct *mm) dev_WARN(dev, "possible leak of PASID %u", > io_mm->pasid); > > + iopf_queue_flush_dev(dev, io_mm->pasid); > + > /* unbind() frees the bond, we just detach it */ > io_mm_detach_locked(bond); > } > @@ -442,11 +444,20 @@ static void iommu_sva_unbind_locked(struct > iommu_bond *bond) > void iommu_sva_unbind_generic(struct iommu_sva *handle) > { > + int pasid; > struct iommu_param *param = handle->dev->iommu_param; > > if (WARN_ON(!param)) > return; > > + /* > + * Caller stopped the device from issuing PASIDs, now make > sure they are > + * out of the fault queue. > + */ > + pasid = iommu_sva_get_pasid_generic(handle); > + if (pasid != IOMMU_PASID_INVALID) > + iopf_queue_flush_dev(handle->dev, pasid); > + I have an ordering concern. The caller can only stop the device issuing page request but there will be in-flight request inside the IOMMU. If we flush here before clearing the PASID context, there might be new request coming in before the detach. How about detach first then flush? Then anything come after the detach would be faults. Flush will be clean. > mutex_lock(¶m->sva_lock); > mutex_lock(&iommu_sva_lock); > iommu_sva_unbind_locked(to_iommu_bond(handle)); > @@ -484,6 +495,10 @@ int iommu_sva_enable(struct device *dev, struct > iommu_sva_param *sva_param) goto err_unlock; > } > > + ret = iommu_register_device_fault_handler(dev, > iommu_queue_iopf, dev); > + if (ret) > + goto err_unlock; > + > dev->iommu_param->sva_param = new_param; > mutex_unlock(¶m->sva_lock); > return 0; > @@ -521,6 +536,7 @@ int iommu_sva_disable(struct device *dev) > goto out_unlock; > } > > + iommu_unregister_device_fault_handler(dev); > kfree(param->sva_param); > param->sva_param = NULL; > out_unlock: [Jacob Pan]
On Mon, 24 Feb 2020 19:23:42 +0100 Jean-Philippe Brucker <jean-philippe@linaro.org> wrote: > From: Jean-Philippe Brucker <jean-philippe.brucker@arm.com> > > To enable address space sharing with the IOMMU, introduce mm_context_get() > and mm_context_put(), that pin down a context and ensure that it will keep > its ASID after a rollover. Export the symbols to let the modular SMMUv3 > driver use them. > > Pinning is necessary because a device constantly needs a valid ASID, > unlike tasks that only require one when running. Without pinning, we would > need to notify the IOMMU when we're about to use a new ASID for a task, > and it would get complicated when a new task is assigned a shared ASID. > Consider the following scenario with no ASID pinned: > > 1. Task t1 is running on CPUx with shared ASID (gen=1, asid=1) > 2. Task t2 is scheduled on CPUx, gets ASID (1, 2) > 3. Task tn is scheduled on CPUy, a rollover occurs, tn gets ASID (2, 1) > We would now have to immediately generate a new ASID for t1, notify > the IOMMU, and finally enable task tn. We are holding the lock during > all that time, since we can't afford having another CPU trigger a > rollover. The IOMMU issues invalidation commands that can take tens of > milliseconds. > > It gets needlessly complicated. All we wanted to do was schedule task tn, > that has no business with the IOMMU. By letting the IOMMU pin tasks when > needed, we avoid stalling the slow path, and let the pinning fail when > we're out of shareable ASIDs. > > After a rollover, the allocator expects at least one ASID to be available > in addition to the reserved ones (one per CPU). So (NR_ASIDS - NR_CPUS - > 1) is the maximum number of ASIDs that can be shared with the IOMMU. > > Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org> A few more trivial points. Thanks, Jonathan > --- > v2->v4: handle KPTI > --- > arch/arm64/include/asm/mmu.h | 1 + > arch/arm64/include/asm/mmu_context.h | 11 ++- > arch/arm64/mm/context.c | 103 +++++++++++++++++++++++++-- > 3 files changed, 109 insertions(+), 6 deletions(-) > > diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h > index e4d862420bb4..70ac3d4cbd3e 100644 > --- a/arch/arm64/include/asm/mmu.h > +++ b/arch/arm64/include/asm/mmu.h > @@ -18,6 +18,7 @@ > > typedef struct { > atomic64_t id; > + unsigned long pinned; > void *vdso; > unsigned long flags; > } mm_context_t; > diff --git a/arch/arm64/include/asm/mmu_context.h b/arch/arm64/include/asm/mmu_context.h > index 3827ff4040a3..70715c10c02a 100644 > --- a/arch/arm64/include/asm/mmu_context.h > +++ b/arch/arm64/include/asm/mmu_context.h > @@ -175,7 +175,13 @@ static inline void cpu_replace_ttbr1(pgd_t *pgdp) > #define destroy_context(mm) do { } while(0) > void check_and_switch_context(struct mm_struct *mm, unsigned int cpu); > > -#define init_new_context(tsk,mm) ({ atomic64_set(&(mm)->context.id, 0); 0; }) > +static inline int > +init_new_context(struct task_struct *tsk, struct mm_struct *mm) > +{ > + atomic64_set(&mm->context.id, 0); > + mm->context.pinned = 0; > + return 0; > +} > > #ifdef CONFIG_ARM64_SW_TTBR0_PAN > static inline void update_saved_ttbr0(struct task_struct *tsk, > @@ -248,6 +254,9 @@ switch_mm(struct mm_struct *prev, struct mm_struct *next, > void verify_cpu_asid_bits(void); > void post_ttbr_update_workaround(void); > > +unsigned long mm_context_get(struct mm_struct *mm); > +void mm_context_put(struct mm_struct *mm); > + > #endif /* !__ASSEMBLY__ */ > > #endif /* !__ASM_MMU_CONTEXT_H */ > diff --git a/arch/arm64/mm/context.c b/arch/arm64/mm/context.c > index 121aba5b1941..5558de88b67d 100644 > --- a/arch/arm64/mm/context.c > +++ b/arch/arm64/mm/context.c > @@ -26,6 +26,10 @@ static DEFINE_PER_CPU(atomic64_t, active_asids); > static DEFINE_PER_CPU(u64, reserved_asids); > static cpumask_t tlb_flush_pending; > > +static unsigned long max_pinned_asids; > +static unsigned long nr_pinned_asids; > +static unsigned long *pinned_asid_map; > + > #define ASID_MASK (~GENMASK(asid_bits - 1, 0)) > #define ASID_FIRST_VERSION (1UL << asid_bits) > > @@ -73,6 +77,9 @@ void verify_cpu_asid_bits(void) > > static void set_kpti_asid_bits(void) > { > + unsigned int k; > + u8 *dst = (u8 *)asid_map; > + u8 *src = (u8 *)pinned_asid_map; > unsigned int len = BITS_TO_LONGS(NUM_USER_ASIDS) * sizeof(unsigned long); > /* > * In case of KPTI kernel/user ASIDs are allocated in > @@ -80,7 +87,8 @@ static void set_kpti_asid_bits(void) > * is set, then the ASID will map only userspace. Thus > * mark even as reserved for kernel. > */ > - memset(asid_map, 0xaa, len); > + for (k = 0; k < len; k++) > + dst[k] = src[k] | 0xaa; > } > > static void set_reserved_asid_bits(void) > @@ -88,9 +96,12 @@ static void set_reserved_asid_bits(void) > if (arm64_kernel_unmapped_at_el0()) > set_kpti_asid_bits(); > else > - bitmap_clear(asid_map, 0, NUM_USER_ASIDS); > + bitmap_copy(asid_map, pinned_asid_map, NUM_USER_ASIDS); > } > > +#define asid_gen_match(asid) \ > + (!(((asid) ^ atomic64_read(&asid_generation)) >> asid_bits)) > + I'd have slightly preferred this bit of refactoring as a precursor patch. > static void flush_context(void) > { > int i; > @@ -161,6 +172,14 @@ static u64 new_context(struct mm_struct *mm) > if (check_update_reserved_asid(asid, newasid)) > return newasid; > > + /* > + * If it is pinned, we can keep using it. Note that reserved > + * takes priority, because even if it is also pinned, we need to > + * update the generation into the reserved_asids. > + */ > + if (mm->context.pinned) > + return newasid; > + > /* > * We had a valid ASID in a previous life, so try to re-use > * it if possible. > @@ -219,8 +238,7 @@ void check_and_switch_context(struct mm_struct *mm, unsigned int cpu) > * because atomic RmWs are totally ordered for a given location. > */ > old_active_asid = atomic64_read(&per_cpu(active_asids, cpu)); > - if (old_active_asid && > - !((asid ^ atomic64_read(&asid_generation)) >> asid_bits) && > + if (old_active_asid && asid_gen_match(asid) && > atomic64_cmpxchg_relaxed(&per_cpu(active_asids, cpu), > old_active_asid, asid)) > goto switch_mm_fastpath; > @@ -228,7 +246,7 @@ void check_and_switch_context(struct mm_struct *mm, unsigned int cpu) > raw_spin_lock_irqsave(&cpu_asid_lock, flags); > /* Check that our ASID belongs to the current generation. */ > asid = atomic64_read(&mm->context.id); > - if ((asid ^ atomic64_read(&asid_generation)) >> asid_bits) { > + if (!asid_gen_match(asid)) { > asid = new_context(mm); > atomic64_set(&mm->context.id, asid); > } > @@ -251,6 +269,68 @@ void check_and_switch_context(struct mm_struct *mm, unsigned int cpu) > cpu_switch_mm(mm->pgd, mm); > } > > +unsigned long mm_context_get(struct mm_struct *mm) > +{ > + unsigned long flags; > + u64 asid; > + > + raw_spin_lock_irqsave(&cpu_asid_lock, flags); > + > + asid = atomic64_read(&mm->context.id); > + > + if (mm->context.pinned) { > + mm->context.pinned++; > + asid &= ~ASID_MASK; > + goto out_unlock; > + } > + > + if (nr_pinned_asids >= max_pinned_asids) { > + asid = 0; > + goto out_unlock; > + } > + > + if (!asid_gen_match(asid)) { > + /* > + * We went through one or more rollover since that ASID was > + * used. Ensure that it is still valid, or generate a new one. > + */ > + asid = new_context(mm); > + atomic64_set(&mm->context.id, asid); > + } > + > + asid &= ~ASID_MASK; > + > + nr_pinned_asids++; > + __set_bit(asid2idx(asid), pinned_asid_map); > + mm->context.pinned++; > + > +out_unlock: > + raw_spin_unlock_irqrestore(&cpu_asid_lock, flags); > + > + /* Set the equivalent of USER_ASID_BIT */ > + if (asid && IS_ENABLED(CONFIG_UNMAP_KERNEL_AT_EL0)) > + asid |= 1; > + > + return asid; > +} > +EXPORT_SYMBOL_GPL(mm_context_get); > + > +void mm_context_put(struct mm_struct *mm) > +{ > + unsigned long flags; > + u64 asid = atomic64_read(&mm->context.id) & ~ASID_MASK; > + > + raw_spin_lock_irqsave(&cpu_asid_lock, flags); > + > + if (--mm->context.pinned == 0) { > + __clear_bit(asid2idx(asid), pinned_asid_map); > + nr_pinned_asids--; > + } > + > + raw_spin_unlock_irqrestore(&cpu_asid_lock, flags); > +} > +EXPORT_SYMBOL_GPL(mm_context_put); > + > /* Errata workaround post TTBRx_EL1 update. */ > asmlinkage void post_ttbr_update_workaround(void) > { > @@ -279,6 +359,19 @@ static int asids_init(void) > panic("Failed to allocate bitmap for %lu ASIDs\n", > NUM_USER_ASIDS); > > + pinned_asid_map = kcalloc(BITS_TO_LONGS(NUM_USER_ASIDS), > + sizeof(*pinned_asid_map), GFP_KERNEL); > + if (!pinned_asid_map) > + panic("Failed to allocate pinned bitmap\n"); Perhaps "Failed to allocate pinnned asid bitmap\n" > + > + /* > + * We assume that an ASID is always available after a rollover. This > + * means that even if all CPUs have a reserved ASID, there still is at > + * least one slot available in the asid map. > + */ > + max_pinned_asids = num_available_asids - num_possible_cpus() - 1; > + nr_pinned_asids = 0; > + > /* > * We cannot call set_reserved_asid_bits() here because CPU > * caps are not finalized yet, so it is safer to assume KPTI
On Mon, 24 Feb 2020 19:23:58 +0100 Jean-Philippe Brucker <jean-philippe@linaro.org> wrote: > From: Jean-Philippe Brucker <jean-philippe.brucker@arm.com> > > The SMMU provides a Stall model for handling page faults in platform > devices. It is similar to PCI PRI, but doesn't require devices to have > their own translation cache. Instead, faulting transactions are parked and > the OS is given a chance to fix the page tables and retry the transaction. > > Enable stall for devices that support it (opt-in by firmware). When an > event corresponds to a translation error, call the IOMMU fault handler. If > the fault is recoverable, it will call us back to terminate or continue > the stall. > > Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org> One question inline. Thanks, > --- > drivers/iommu/arm-smmu-v3.c | 271 ++++++++++++++++++++++++++++++++++-- > drivers/iommu/of_iommu.c | 5 +- > include/linux/iommu.h | 2 + > 3 files changed, 269 insertions(+), 9 deletions(-) > > diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c > index 6a5987cce03f..da5dda5ba26a 100644 > --- a/drivers/iommu/arm-smmu-v3.c > +++ b/drivers/iommu/arm-smmu-v3.c > @@ -374,6 +374,13 @@ > +/* > + * arm_smmu_flush_evtq - wait until all events currently in the queue have been > + * consumed. > + * > + * Wait until the evtq thread finished a batch, or until the queue is empty. > + * Note that we don't handle overflows on q->batch. If it occurs, just wait for > + * the queue to be empty. > + */ > +static int arm_smmu_flush_evtq(void *cookie, struct device *dev, int pasid) > +{ > + int ret; > + u64 batch; > + struct arm_smmu_device *smmu = cookie; > + struct arm_smmu_queue *q = &smmu->evtq.q; > + > + spin_lock(&q->wq.lock); > + if (queue_sync_prod_in(q) == -EOVERFLOW) > + dev_err(smmu->dev, "evtq overflow detected -- requests lost\n"); > + > + batch = q->batch; So this is trying to be sure we have advanced the queue 2 spots? Is there a potential race here? q->batch could have updated before we take a local copy. > + ret = wait_event_interruptible_locked(q->wq, queue_empty(&q->llq) || > + q->batch >= batch + 2); > + spin_unlock(&q->wq.lock); > + > + return ret; > +} > + ...
On Mon, 24 Feb 2020 19:23:35 +0100 Jean-Philippe Brucker <jean-philippe@linaro.org> wrote: > Shared Virtual Addressing (SVA) allows to share process page tables with > devices using the IOMMU. Add a generic implementation of the IOMMU SVA > API, and add support in the Arm SMMUv3 driver. > > Previous versions of this patchset were sent over a year ago [1][2] but > we've made a lot of progress since then: > > * ATS support for SMMUv3 was merged in v5.2. > * The bind() and fault reporting APIs have been merged in v5.3. > * IOASID were added in v5.5. > * SMMUv3 PASID was added in v5.6, with some pending for v5.7. > > * The first user of the bind() API will be merged in v5.7 [3]. The zip > accelerator is also the first piece of hardware that I've been able to > use for testing (previous versions were developed with software models) > and I now have tools for evaluating SVA performance. Unfortunately I > still don't have hardware that supports ATS and PRI; the zip accelerator > uses stall. > > These are the remaining changes for SVA support in SMMUv3. Since v3 [1] > I fixed countless bugs and - I think - addressed everyone's comments. > Thanks to recent MMU notifier rework, iommu-sva.c is a lot more > straightforward. I'm still unhappy with the complicated locking in the > SMMUv3 driver resulting from patch 12 (Seize private ASID), but I > haven't found anything better. > > Please find all SVA patches on branches sva/current and sva/zip-devel at > https://jpbrucker.net/git/linux > > [1] https://lore.kernel.org/linux-iommu/20180920170046.20154-1-jean-philippe.brucker@arm.com/ > [2] https://lore.kernel.org/linux-iommu/20180511190641.23008-1-jean-philippe.brucker@arm.com/ > [3] https://lore.kernel.org/linux-iommu/1581407665-13504-1-git-send-email-zhangfei.gao@linaro.org/ Hi Jean-Phillippe. Great to see this progressing. Other than the few places I've commented it all looks good to me. Thanks, Jonathan > > Jean-Philippe Brucker (26): > mm/mmu_notifiers: pass private data down to alloc_notifier() > iommu/sva: Manage process address spaces > iommu: Add a page fault handler > iommu/sva: Search mm by PASID > iommu/iopf: Handle mm faults > iommu/sva: Register page fault handler > arm64: mm: Pin down ASIDs for sharing mm with devices > iommu/io-pgtable-arm: Move some definitions to a header > iommu/arm-smmu-v3: Manage ASIDs with xarray > arm64: cpufeature: Export symbol read_sanitised_ftr_reg() > iommu/arm-smmu-v3: Share process page tables > iommu/arm-smmu-v3: Seize private ASID > iommu/arm-smmu-v3: Add support for VHE > iommu/arm-smmu-v3: Enable broadcast TLB maintenance > iommu/arm-smmu-v3: Add SVA feature checking > iommu/arm-smmu-v3: Add dev_to_master() helper > iommu/arm-smmu-v3: Implement mm operations > iommu/arm-smmu-v3: Hook up ATC invalidation to mm ops > iommu/arm-smmu-v3: Add support for Hardware Translation Table Update > iommu/arm-smmu-v3: Maintain a SID->device structure > iommu/arm-smmu-v3: Ratelimit event dump > dt-bindings: document stall property for IOMMU masters > iommu/arm-smmu-v3: Add stall support for platform devices > PCI/ATS: Add PRI stubs > PCI/ATS: Export symbols of PRI functions > iommu/arm-smmu-v3: Add support for PRI > > .../devicetree/bindings/iommu/iommu.txt | 18 + > arch/arm64/include/asm/mmu.h | 1 + > arch/arm64/include/asm/mmu_context.h | 11 +- > arch/arm64/kernel/cpufeature.c | 1 + > arch/arm64/mm/context.c | 103 +- > drivers/iommu/Kconfig | 13 + > drivers/iommu/Makefile | 2 + > drivers/iommu/arm-smmu-v3.c | 1354 +++++++++++++++-- > drivers/iommu/io-pgfault.c | 533 +++++++ > drivers/iommu/io-pgtable-arm.c | 27 +- > drivers/iommu/io-pgtable-arm.h | 30 + > drivers/iommu/iommu-sva.c | 596 ++++++++ > drivers/iommu/iommu-sva.h | 64 + > drivers/iommu/iommu.c | 1 + > drivers/iommu/of_iommu.c | 5 +- > drivers/misc/sgi-gru/grutlbpurge.c | 4 +- > drivers/pci/ats.c | 4 + > include/linux/iommu.h | 73 + > include/linux/mmu_notifier.h | 10 +- > include/linux/pci-ats.h | 8 + > mm/mmu_notifier.c | 6 +- > 21 files changed, 2699 insertions(+), 165 deletions(-) > create mode 100644 drivers/iommu/io-pgfault.c > create mode 100644 drivers/iommu/io-pgtable-arm.h > create mode 100644 drivers/iommu/iommu-sva.c > create mode 100644 drivers/iommu/iommu-sva.h >
Subject could be simply "PCI/ATS: Export PRI functions" On Mon, Feb 24, 2020 at 07:24:00PM +0100, Jean-Philippe Brucker wrote: > The SMMUv3 driver uses pci_{enable,disable}_pri() and related > functions. Export those functions to allow the driver to be built as a > module. > > Cc: Bjorn Helgaas <bhelgaas@google.com> > Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org> Acked-by: Bjorn Helgaas <bhelgaas@google.com> > --- > drivers/pci/ats.c | 4 ++++ > 1 file changed, 4 insertions(+) > > diff --git a/drivers/pci/ats.c b/drivers/pci/ats.c > index bbfd0d42b8b9..fc8fc6fc8bd5 100644 > --- a/drivers/pci/ats.c > +++ b/drivers/pci/ats.c > @@ -197,6 +197,7 @@ void pci_pri_init(struct pci_dev *pdev) > if (status & PCI_PRI_STATUS_PASID) > pdev->pasid_required = 1; > } > +EXPORT_SYMBOL_GPL(pci_pri_init); > > /** > * pci_enable_pri - Enable PRI capability > @@ -243,6 +244,7 @@ int pci_enable_pri(struct pci_dev *pdev, u32 reqs) > > return 0; > } > +EXPORT_SYMBOL_GPL(pci_enable_pri); > > /** > * pci_disable_pri - Disable PRI capability > @@ -322,6 +324,7 @@ int pci_reset_pri(struct pci_dev *pdev) > > return 0; > } > +EXPORT_SYMBOL_GPL(pci_reset_pri); > > /** > * pci_prg_resp_pasid_required - Return PRG Response PASID Required bit > @@ -337,6 +340,7 @@ int pci_prg_resp_pasid_required(struct pci_dev *pdev) > > return pdev->pasid_required; > } > +EXPORT_SYMBOL_GPL(pci_prg_resp_pasid_required); > #endif /* CONFIG_PCI_PRI */ > > #ifdef CONFIG_PCI_PASID > -- > 2.25.0 >
On Mon, Feb 24, 2020 at 07:23:59PM +0100, Jean-Philippe Brucker wrote: > The SMMUv3 driver, which can be built without CONFIG_PCI, will soon gain > support for PRI. Partially revert commit c6e9aefbf9db ("PCI/ATS: Remove > unused PRI and PASID stubs") to re-introduce the PRI stubs, and avoid > adding more #ifdefs to the SMMU driver. > > Cc: Bjorn Helgaas <bhelgaas@google.com> > Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org> Acked-by: Bjorn Helgaas <bhelgaas@google.com> > --- > include/linux/pci-ats.h | 8 ++++++++ > 1 file changed, 8 insertions(+) > > diff --git a/include/linux/pci-ats.h b/include/linux/pci-ats.h > index f75c307f346d..e9e266df9b37 100644 > --- a/include/linux/pci-ats.h > +++ b/include/linux/pci-ats.h > @@ -28,6 +28,14 @@ int pci_enable_pri(struct pci_dev *pdev, u32 reqs); > void pci_disable_pri(struct pci_dev *pdev); > int pci_reset_pri(struct pci_dev *pdev); > int pci_prg_resp_pasid_required(struct pci_dev *pdev); > +#else /* CONFIG_PCI_PRI */ > +static inline int pci_enable_pri(struct pci_dev *pdev, u32 reqs) > +{ return -ENODEV; } > +static inline void pci_disable_pri(struct pci_dev *pdev) { } > +static inline int pci_reset_pri(struct pci_dev *pdev) > +{ return -ENODEV; } > +static inline int pci_prg_resp_pasid_required(struct pci_dev *pdev) > +{ return 0; } > #endif /* CONFIG_PCI_PRI */ > > #ifdef CONFIG_PCI_PASID > -- > 2.25.0 >
On Tue, Feb 25, 2020 at 10:08:14AM -0400, Jason Gunthorpe wrote: > On Tue, Feb 25, 2020 at 10:24:39AM +0100, Jean-Philippe Brucker wrote: > > On Mon, Feb 24, 2020 at 03:00:56PM -0400, Jason Gunthorpe wrote: > > > On Mon, Feb 24, 2020 at 07:23:36PM +0100, Jean-Philippe Brucker wrote: > > > > The new allocation scheme introduced by 2c7933f53f6b ("mm/mmu_notifiers: > > > > add a get/put scheme for the registration") provides a convenient way > > > > for users to attach notifier data to an mm. However, it would be even > > > > better to create this notifier data atomically. > > > > > > > > Since the alloc_notifier() callback only takes an mm argument at the > > > > moment, some users have to perform the allocation in two times. > > > > alloc_notifier() initially creates an incomplete structure, which is > > > > then finalized using more context once mmu_notifier_get() returns. This > > > > second step requires carrying an initialization lock in the notifier > > > > data and playing dirty tricks to order memory accesses against live > > > > invalidation. > > > > > > This was the intended pattern. Tthere shouldn't be an real issue as > > > there shouldn't be any data on which to invalidate, ie the later patch > > > does: > > > > > > + list_for_each_entry_rcu(bond, &io_mm->devices, mm_head) > > > > > > And that list is empty post-allocation, so no 'dirty tricks' required. > > > > Before introducing this patch I had the following code: > > > > + list_for_each_entry_rcu(bond, &io_mm->devices, mm_head) { > > + /* > > + * To ensure that we observe the initialization of io_mm fields > > + * by io_mm_finalize() before the registration of this bond to > > + * the list by io_mm_attach(), introduce an address dependency > > + * between bond and io_mm. It pairs with the smp_store_release() > > + * from list_add_rcu(). > > + */ > > + io_mm = rcu_dereference(bond->io_mm); > > A rcu_dereference isn't need here, just a normal derference is fine. bond->io_mm is annotated with __rcu (for iommu_sva_get_pasid_generic(), which does bond->io_mm under rcu_read_lock()) > > > + io_mm->ops->invalidate(bond->sva.dev, io_mm->pasid, io_mm->ctx, > > + start, end - start); > > + } > > > > (1) io_mm_get() would obtain an empty io_mm from iommu_notifier_get(). > > (2) then io_mm_finalize() would initialize io_mm->ops, io_mm->ctx, etc. > > (3) finally io_mm_attach() would add the bond to io_mm->devices. > > > > Since the above code can run before (2) it needs to observe valid > > io_mm->ctx, io_mm->ops initialized by (2) after obtaining the bond > > initialized by (3). Which I believe requires the address dependency from > > the rcu_dereference() above or some stronger barrier to pair with the > > list_add_rcu(). > > The list_for_each_entry_rcu() is an acquire that already pairs with > the release in list_add_rcu(), all you need is a data dependency chain > starting on bond to be correct on ordering. > > But this is super tricky :\ > > > If io_mm->ctx and io_mm->ops are already valid before the > > mmu notifier is published, then we don't need that stuff. > > So, this trickyness with RCU is not a bad reason to introduce the priv > scheme, maybe explain it in the commit message? Ok, I've added this to the commit message: The IOMMU SVA module, which attaches an mm to multiple devices, exemplifies this situation. In essence it does: mmu_notifier_get() alloc_notifier() A = kzalloc() /* MMU notifier is published */ A->ctx = ctx; // (1) device->A = A; list_add_rcu(device, A->devices); // (2) The invalidate notifier, which may start running before A is fully initialized at (1), does the following: io_mm_invalidate(A) list_for_each_entry_rcu(device, A->devices) A = device->A; // (3) device->invalidate(A->ctx) To ensure that an invalidate() thread observes write (1) before (2), it needs the address dependency (3). The resulting code is subtle and difficult to understand. If instead we fully initialize object A before publishing the MMU notifier, we don't need the complexity added by (3). I'll try to improve the wording before sending next version. Thanks, Jean
On Wed, Feb 26, 2020 at 11:13:20AM -0800, Jacob Pan wrote: > Hi Jean, > > A few comments inline. I am also trying to converge to the common sva > APIs. I sent out the first step w/o iopage fault and the generic ops > you have here. Great, thanks for sending it out, it's on my list to look at > On Mon, 24 Feb 2020 19:23:37 +0100 > Jean-Philippe Brucker <jean-philippe@linaro.org> wrote: > > > From: Jean-Philippe Brucker <jean-philippe.brucker@arm.com> > > > > Add a small library to help IOMMU drivers manage process address > > spaces bound to their devices. Register an MMU notifier to track > > modification on each address space bound to one or more devices. > > > > IOMMU drivers must implement the io_mm_ops and can then use the > > helpers provided by this library to easily implement the SVA API > > introduced by commit 26b25a2b98e4. The io_mm_ops are: > > > > void *alloc(struct mm_struct *) > > Allocate a PASID context private to the IOMMU driver. There is a > > single context per mm. IOMMU drivers may perform arch-specific > > operations in there, for example pinning down a CPU ASID (on Arm). > > > > int attach(struct device *, int pasid, void *ctx, bool attach_domain) > > Attach a context to the device, by setting up the PASID table entry. > > > > int invalidate(struct device *, int pasid, void *ctx, > > unsigned long vaddr, size_t size) > > Invalidate TLB entries for this address range. > > > > int detach(struct device *, int pasid, void *ctx, bool detach_domain) > > Detach a context from the device, by clearing the PASID table entry > > and invalidating cached entries. > > > > void free(void *ctx) > you meant release()? Yes [...] > > +/** > > + * DOC: io_mm model > > + * > > + * The io_mm keeps track of process address spaces shared between > > CPU and IOMMU. > > + * The following example illustrates the relation between structures > > + * iommu_domain, io_mm and iommu_sva. The iommu_sva struct is a bond > > between > > + * io_mm and device. A device can have multiple io_mm and an io_mm > > may be bound > > + * to multiple devices. > > + * ___________________________ > > + * | IOMMU domain A | > > + * | ________________ | > > + * | | IOMMU group | +------- io_pgtables > > + * | | | | > > + * | | dev 00:00.0 ----+------- bond 1 --- io_mm X > > + * | |________________| \ | > > + * | '----- bond 2 ---. > > + * |___________________________| \ > > + * ___________________________ \ > > + * | IOMMU domain B | io_mm Y > > + * | ________________ | / / > > + * | | IOMMU group | | / / > > + * | | | | / / > > + * | | dev 00:01.0 ------------ bond 3 -' / > > + * | | dev 00:01.1 ------------ bond 4 --' > > + * | |________________| | > > + * | +------- io_pgtables > > + * |___________________________| > > + * > > + * In this example, device 00:00.0 is in domain A, devices 00:01.* > > are in domain > > + * B. All devices within the same domain access the same address > > spaces. > Hmm, devices in domain A has access to both X & Y, isn't it > contradictory? I guess it's unclear, this is meant to explain that any device in domain B for example, would access all address spaces bound to any other device in that domain. > > > Device > > + * 00:00.0 accesses address spaces X and Y, each corresponding to an > > mm_struct. > > + * Devices 00:01.* only access address space Y. In addition each > > + * IOMMU_DOMAIN_DMA domain has a private address space, io_pgtable, > > that is > > + * managed with iommu_map()/iommu_unmap(), and isn't shared with the > > CPU MMU. > So this would allow IOVA and SVA co-exist in the same address space? Hmm, not in the same address space, but they can co-exist in a device. In fact the endpoint I'm testing (hisi zip accelerator) already needs normal DMA alongside SVA for queue management. This one is integrated on an Arm-based platform so shouldn't be a concern for VT-d at the moment, but I suspect we might see more of this kind of device with mixed DMA. In addition on Arm MSI addresses are translated by the IOMMU, and since they are requests w/o PASID they need the private address space on entry 0. Are you not planning to use the RID_PASID entry of Scalable-Mode Context-Entry in VT-d? > I guess this is the PASID 0 for DMA request w/o PASID. If that is the > case, perhaps needs more explanation since the private address space > also has a private PASID within the domain. The last sentence refers to this private address space used for requests w/o PASID. I don't like referring to it as "PASID 0" since it might be more confusing. It's entry 0 of the PASID table reserved for requests without PASID. I think I should just remove this here sentence and try to make the last paragraph of the comment, which referes to the same thing, clearer. I'll also drop io_pgtables from the above diagram to keep things on point. > > + * > > + * To obtain the above configuration, users would for instance issue > > the > > + * following calls: > > + * > > + * iommu_sva_bind_device(dev 00:00.0, mm X, ...) -> bond 1 > > + * iommu_sva_bind_device(dev 00:00.0, mm Y, ...) -> bond 2 > > + * iommu_sva_bind_device(dev 00:01.0, mm Y, ...) -> bond 3 > > + * iommu_sva_bind_device(dev 00:01.1, mm Y, ...) -> bond 4 > > + * > > + * A single Process Address Space ID (PASID) is allocated for each > > mm. In the > > + * example, devices use PASID 1 to read/write into address space X > > and PASID 2 > > + * to read/write into address space Y. Calling iommu_sva_get_pasid() > > on bond 1 > > + * returns 1, and calling it on bonds 2-4 returns 2. > > + * > > + * Hardware tables describing this configuration in the IOMMU would > > typically > > + * look like this: > > + * > > + * PASID tables > > + * of domain A > > + * .->+--------+ > > + * / 0 | |-------> io_pgtable > > + * / +--------+ > > + * Device tables / 1 | |-------> pgd X > > + * +--------+ / +--------+ > > + * 00:00.0 | A |-' 2 | |--. > > + * +--------+ +--------+ \ > > + * : : 3 | | \ > > + * +--------+ +--------+ --> pgd Y > > + * 00:01.0 | B |--. / > > + * +--------+ \ | > > + * 00:01.1 | B |----+ PASID tables | > > + * +--------+ \ of domain B | > > + * '->+--------+ | > > + * 0 | |-- | --> io_pgtable > > + * +--------+ | > > + * 1 | | | > > + * +--------+ | > > + * 2 | |---' > > + * +--------+ > > + * 3 | | > > + * +--------+ > > + * > > + * With this model, a single call binds all devices in a given > > domain to an > > + * address space. Other devices in the domain will get the same bond > > implicitly. > > + * However, users must issue one bind() for each device, because > > IOMMUs may > > + * implement SVA differently. Furthermore, mandating one bind() per > > device > > + * allows the driver to perform sanity-checks on device capabilities. > > + * > > + * In some IOMMUs, one entry of the PASID table (typically the first > > one) can > > + * hold non-PASID translations. In this case PASID 0 is reserved and > > the first > > + * entry points to the io_pgtable pointer. In other IOMMUs the > > io_pgtable > > + * pointer is held in the device table and PASID 0 is available to > > the > > + * allocator. > > + */ [...] > > +static struct iommu_sva * > > +io_mm_attach(struct device *dev, struct io_mm *io_mm, void *drvdata) > > +{ > > + int ret = 0; > > + bool attach_domain = true; > > + struct iommu_bond *bond, *tmp; > > + struct iommu_domain *domain, *other; > > + struct iommu_sva_param *param = dev->iommu_param->sva_param; > > + > > + domain = iommu_get_domain_for_dev(dev); > > + > > + bond = kzalloc(sizeof(*bond), GFP_KERNEL); > > + if (!bond) > > + return ERR_PTR(-ENOMEM); > > + > > + bond->sva.dev = dev; > > + bond->drvdata = drvdata; > > + refcount_set(&bond->refs, 1); > > + RCU_INIT_POINTER(bond->io_mm, io_mm); > > + > > + mutex_lock(&iommu_sva_lock); > > + /* Is it already bound to the device or domain? */ > > + list_for_each_entry(tmp, &io_mm->devices, mm_head) { > > + if (tmp->sva.dev != dev) { > > + other = > > iommu_get_domain_for_dev(tmp->sva.dev); > > + if (domain == other) > > + attach_domain = false; > > + > > + continue; > At this point, we already know this is a new device trying to attach to > one of io_mm's existing domains. > > So there is no need to continue > checking, right? Perhaps check like this? > - if (tmp->sva.dev != dev) { > + if (tmp->sva.dev != dev && attach_domain) { That doesn't seem right, we need the 'continue'. I'll turn this around into 'if (tmp->sva.dev == dev)' to make things more readable. > > + } > > + > > + if (WARN_ON(tmp->drvdata != drvdata)) { > > + ret = -EINVAL; > > + goto err_free; > > + } > > + > > + /* > > + * Hold a single io_mm reference per bond. Note that > > we can't > > + * return an error after this, otherwise the caller > > would drop > > + * an additional reference to the io_mm. > > + */ > > + refcount_inc(&tmp->refs); > > + io_mm_put(io_mm); > > + kfree(bond); > Can bond be allocated after searching for existing bond or domain? If > so, we can avoid free bond here. Yes, and I think we can simplify the whole function further. I think I wrote it that way to have the kzalloc() be outside iommu_sva_lock, back when it was a spinlock. > > + mutex_unlock(&iommu_sva_lock); > > + return &tmp->sva; > > + } > > + > > + list_add_rcu(&bond->mm_head, &io_mm->devices); > > + param->nr_bonds++; > > + mutex_unlock(&iommu_sva_lock); > > + > > + ret = io_mm->ops->attach(bond->sva.dev, io_mm->pasid, > > io_mm->ctx, > > + attach_domain); > For VT-d, if a device trying to do SVA bind, there would not be a DMA > domain. SVA should own the entire address space, no IOVA. Do you mean PASID table rather than address space? > So this > attach() call is for VT-d driver to setup the first PASID table entry > regardless attach_domain is true or false? Yes ignoring the attach_domain parameter should be fine (more below). [...] > > +static void iommu_sva_unbind_locked(struct iommu_bond *bond) > > +{ > > + struct device *dev = bond->sva.dev; > > + struct iommu_sva_param *param = dev->iommu_param->sva_param; > > + > > + if (!refcount_dec_and_test(&bond->refs)) > > + return; > > + > dont you need to free bond here? We free it in the rcu callback below > > + io_mm_detach_locked(bond); > > + param->nr_bonds--; > > + kfree_rcu(bond, rcu_head); > > +} > > + > > +void iommu_sva_unbind_generic(struct iommu_sva *handle) > > +{ > > + struct iommu_param *param = handle->dev->iommu_param; > > + > > + if (WARN_ON(!param)) > > + return; > > + > > + mutex_lock(¶m->sva_lock); > > + mutex_lock(&iommu_sva_lock); > > + iommu_sva_unbind_locked(to_iommu_bond(handle)); > > + mutex_unlock(&iommu_sva_lock); > > + mutex_unlock(¶m->sva_lock); > > +} > > +EXPORT_SYMBOL_GPL(iommu_sva_unbind_generic); > > + > > +/** > > + * iommu_sva_enable() - Enable Shared Virtual Addressing for a device > > + * @dev: the device > > + * @sva_param: the parameters. > > + * > > + * Called by an IOMMU driver to setup the SVA parameters > > + * @sva_param is duplicated and can be freed when this function > > returns. > > + * > > + * Return 0 if initialization succeeded, or an error. > > + */ > IOMMU vendor driver usually dont know when the device SVA feature will > be used until bind call. So we pretty much have to call this for every > device during init time? Not necessarily. Before bind the device driver should call iommu_dev_enable_feature(dev, IOMMU_FEAT_SVA), which is when SMMUv3 invokes iommu_sva_enable() [...] > > +struct io_mm_ops { > > + /* Allocate a PASID context for an mm */ > > + void *(*alloc)(struct mm_struct *mm); > > + > > + /* > > + * Attach a PASID context to a device. Write the entry into > > the PASID > > + * table. > > + * > > + * @attach_domain is true when no other device in the IOMMU > > domain is > > + * already attached to this context. IOMMU drivers that > > share the > > + * PASID tables within a domain don't need to write the > > PASID entry > > + * when @attach_domain is false. > > + */ > If we have per device PASID table, then we need to set up PASID table > entry regardless of the domain sharing. Yes, the attach_domain is a hint for IOMMU drivers that handle PASID tables per domain (SMMUv3). If PASID tables are per device then it can be ignored. I added it to the interface because it's a lot more difficult to check from within the SMMU driver, whereas iommu-sva already iterates over all devices attached to an io_mm. Arguably the hint isn't as useful on attach than on detach, where we must not clear the PASID table entry if other devices in the domain are still using it. > What is confusing to me is that > domain is for DMA isolation on request w/o PASID, but with SVA we don't > really care about domains. Sorry, it has been a long time since we > discussed this. I think will work for VT-d but just wanted to make sure > I understand the intentions. No problem, it has been a while and I don't remember the rationale for every choice. It's good to question whether they're still valid. I find the per-domain PASID table to be a good model when reasoning about IOMMU groups. In pci_device_group() a single group is created for devices whose Requester ID alias, and they all get the same domain. In a buggy system, if a device can issue DMA with the RID of another, then regardless of PASID the IOMMU cannot isolate them. Having per-device PASID table doesn't add any isolation but may hide the flaw from the user, if they think that binding an mm to device A prevents a DMA-aliased device B from accessing it. This is hypothetical because we don't allow SVA for multi-device groups at the moment (sanity-check would be messy) but maybe buggy implementations will want this support in the future. In the normal case, one device per domain, having PASID tables on the domain rather than device doesn't make a difference. It makes a difference if the user wants to put multiple devices in the same domain (e.g. VFIO container). I don't know if that's a use-case. Thanks, Jean
On Wed, Feb 26, 2020 at 12:35:06PM +0000, Jonathan Cameron wrote: > > + * A single Process Address Space ID (PASID) is allocated for each mm. In the > > + * example, devices use PASID 1 to read/write into address space X and PASID 2 > > + * to read/write into address space Y. Calling iommu_sva_get_pasid() on bond 1 > > + * returns 1, and calling it on bonds 2-4 returns 2. > > + * > > + * Hardware tables describing this configuration in the IOMMU would typically > > + * look like this: > > + * > > + * PASID tables > > + * of domain A > > + * .->+--------+ > > + * / 0 | |-------> io_pgtable > > + * / +--------+ > > + * Device tables / 1 | |-------> pgd X > > + * +--------+ / +--------+ > > + * 00:00.0 | A |-' 2 | |--. > > + * +--------+ +--------+ \ > > + * : : 3 | | \ > > + * +--------+ +--------+ --> pgd Y > > + * 00:01.0 | B |--. / > > + * +--------+ \ | > > + * 00:01.1 | B |----+ PASID tables | > > + * +--------+ \ of domain B | > > + * '->+--------+ | > > + * 0 | |-- | --> io_pgtable > > + * +--------+ | > > + * 1 | | | > > + * +--------+ | > > + * 2 | |---' > > + * +--------+ > > + * 3 | | > > + * +--------+ > > + * > > + * With this model, a single call binds all devices in a given domain to an > > + * address space. Other devices in the domain will get the same bond implicitly. > > + * However, users must issue one bind() for each device, because IOMMUs may > > + * implement SVA differently. Furthermore, mandating one bind() per device > > + * allows the driver to perform sanity-checks on device capabilities. > > > + * > > + * In some IOMMUs, one entry of the PASID table (typically the first one) can > > + * hold non-PASID translations. In this case PASID 0 is reserved and the first > > + * entry points to the io_pgtable pointer. In other IOMMUs the io_pgtable > > + * pointer is held in the device table and PASID 0 is available to the > > + * allocator. > > Is it worth hammering home in here that we can only do this because the PASID space > is global (with exception of PASID 0)? It's a convenient simplification but not > necessarily a hardware restriction so perhaps we should remind people somewhere in here? I could add this four paragraphs up: "A single Process Address Space ID (PASID) is allocated for each mm. It is a choice made for the Linux SVA implementation, not a hardware restriction." > > + */ > > + > > +struct io_mm { > > + struct list_head devices; > > + struct mm_struct *mm; > > + struct mmu_notifier notifier; > > + > > + /* Late initialization */ > > + const struct io_mm_ops *ops; > > + void *ctx; > > + int pasid; > > +}; > > + > > +#define to_io_mm(mmu_notifier) container_of(mmu_notifier, struct io_mm, notifier) > > +#define to_iommu_bond(handle) container_of(handle, struct iommu_bond, sva) > > Code ordering wise, do we want this after the definition of iommu_bond? > > For both of these it's a bit non obvious what they come 'from'. > I wouldn't naturally assume to_io_mm gets me from notifier to the io_mm > for example. Not sure it matters though if these are only used in a few > places. Right, I can rename the first one to mn_to_io_mm(). The second one I think might be good enough. > > +static struct iommu_sva * > > +io_mm_attach(struct device *dev, struct io_mm *io_mm, void *drvdata) > > +{ > > + int ret = 0; > > I'm fairly sure this is set in all paths below. Now, of course the > compiler might not think that in which case fair enough :) > > > + bool attach_domain = true; > > + struct iommu_bond *bond, *tmp; > > + struct iommu_domain *domain, *other; > > + struct iommu_sva_param *param = dev->iommu_param->sva_param; > > + > > + domain = iommu_get_domain_for_dev(dev); > > + > > + bond = kzalloc(sizeof(*bond), GFP_KERNEL); > > + if (!bond) > > + return ERR_PTR(-ENOMEM); > > + > > + bond->sva.dev = dev; > > + bond->drvdata = drvdata; > > + refcount_set(&bond->refs, 1); > > + RCU_INIT_POINTER(bond->io_mm, io_mm); > > + > > + mutex_lock(&iommu_sva_lock); > > + /* Is it already bound to the device or domain? */ > > + list_for_each_entry(tmp, &io_mm->devices, mm_head) { > > + if (tmp->sva.dev != dev) { > > + other = iommu_get_domain_for_dev(tmp->sva.dev); > > + if (domain == other) > > + attach_domain = false; > > + > > + continue; > > + } > > + > > + if (WARN_ON(tmp->drvdata != drvdata)) { > > + ret = -EINVAL; > > + goto err_free; > > + } > > + > > + /* > > + * Hold a single io_mm reference per bond. Note that we can't > > + * return an error after this, otherwise the caller would drop > > + * an additional reference to the io_mm. > > + */ > > + refcount_inc(&tmp->refs); > > + io_mm_put(io_mm); > > + kfree(bond); > > Free outside the lock would be ever so slightly more logical given we allocated > before taking the lock. > > > + mutex_unlock(&iommu_sva_lock); > > + return &tmp->sva; > > + } > > + > > + list_add_rcu(&bond->mm_head, &io_mm->devices); > > + param->nr_bonds++; > > + mutex_unlock(&iommu_sva_lock); > > + > > + ret = io_mm->ops->attach(bond->sva.dev, io_mm->pasid, io_mm->ctx, > > + attach_domain); > > + if (ret) > > + goto err_remove; > > + > > + return &bond->sva; > > + > > +err_remove: > > + /* > > + * At this point concurrent threads may have started to access the > > + * io_mm->devices list in order to invalidate address ranges, which > > + * requires to free the bond via kfree_rcu() > > + */ > > + mutex_lock(&iommu_sva_lock); > > + param->nr_bonds--; > > + list_del_rcu(&bond->mm_head); > > + > > +err_free: > > + mutex_unlock(&iommu_sva_lock); > > + kfree_rcu(bond, rcu_head); > > I don't suppose it matters really but we don't need the rcu free if > we follow the err_free goto. Perhaps we are cleaner in this case > to not use a unified exit path but do that case inline? Agreed, though I moved the kzalloc() later as suggested by Jacob, I think it looks a little better and simplifies the error paths Thanks, Jean
On Wed, Feb 26, 2020 at 01:59:33PM +0000, Jonathan Cameron wrote: > > +static int iopf_complete(struct device *dev, struct iopf_fault *iopf, > > + enum iommu_page_response_code status) > > This is called once per group. Should name reflect that? Ok [...] > > +/** > > + * iommu_queue_iopf - IO Page Fault handler > > + * @evt: fault event > > + * @cookie: struct device, passed to iommu_register_device_fault_handler. > > + * > > + * Add a fault to the device workqueue, to be handled by mm. > > + * > > + * Return: 0 on success and <0 on error. > > + */ > > +int iommu_queue_iopf(struct iommu_fault *fault, void *cookie) > > +{ > > + int ret; > > + struct iopf_group *group; > > + struct iopf_fault *iopf, *next; > > + struct iopf_device_param *iopf_param; > > + > > + struct device *dev = cookie; > > + struct iommu_param *param = dev->iommu_param; > > + > > + if (WARN_ON(!mutex_is_locked(¶m->lock))) > > + return -EINVAL; > > Just curious... > > Why do we always need a runtime check on this rather than say, > using lockdep_assert_held or similar? I probably didn't know about lockdep_assert at the time :) > > + /* > > + * It is incredibly easy to find ourselves in a deadlock situation if > > + * we're not careful, because we're taking the opposite path as > > + * iommu_queue_iopf: > > + * > > + * iopf_queue_flush_dev() | PRI queue handler > > + * lock(¶m->lock) | iommu_queue_iopf() > > + * queue->flush() | lock(¶m->lock) > > + * wait PRI queue empty | > > + * > > + * So we can't hold the device param lock while flushing. Take a > > + * reference to the device param instead, to prevent the queue from > > + * going away. > > + */ > > + mutex_lock(¶m->lock); > > + iopf_param = param->iopf_param; > > + if (iopf_param) { > > + queue = param->iopf_param->queue; > > + iopf_param->busy = true; > > Describing this as taking a reference is not great... > I'd change the comment to set a flag or something like that. > > Is there any potential of multiple copies of this running against > each other? I've not totally gotten my head around when this > might be called yet. Yes it's allowed, this should be a refcount [...] > > +int iopf_queue_remove_device(struct iopf_queue *queue, struct device *dev) > > +{ > > + int ret = -EINVAL; > > + struct iopf_fault *iopf, *next; > > + struct iopf_device_param *iopf_param; > > + struct iommu_param *param = dev->iommu_param; > > + > > + if (!param || !queue) > > + return -EINVAL; > > + > > + do { > > + mutex_lock(&queue->lock); > > + mutex_lock(¶m->lock); > > + iopf_param = param->iopf_param; > > + if (iopf_param && iopf_param->queue == queue) { > > + if (iopf_param->busy) { > > + ret = -EBUSY; > > + } else { > > + list_del(&iopf_param->queue_list); > > + param->iopf_param = NULL; > > + ret = 0; > > + } > > + } > > + mutex_unlock(¶m->lock); > > + mutex_unlock(&queue->lock); > > + > > + /* > > + * If there is an ongoing flush, wait for it to complete and > > + * then retry. iopf_param isn't going away since we're the only > > + * thread that can free it. > > + */ > > + if (ret == -EBUSY) > > + wait_event(iopf_param->wq_head, !iopf_param->busy); > > + else if (ret) > > + return ret; > > + } while (ret == -EBUSY); > > I'm in two minds about the next comment (so up to you)... > > Currently this looks a bit odd. Would you be better off just having a separate > parameter for busy and explicit separate handling for the error path? > > bool busy; > int ret = 0; > > do { > mutex_lock(&queue->lock); > mutex_lock(¶m->lock); > iopf_param = param->iopf_param; > if (iopf_param && iopf_param->queue == queue) { > busy = iopf_param->busy; > if (!busy) { > list_del(&iopf_param->queue_list); > param->iopf_param = NULL; > } > } else { > ret = -EINVAL; > } > mutex_unlock(¶m->lock); > mutex_unlock(&queue->lock); > if (ret) > return ret; > if (busy) > wait_event(iopf_param->wq_head, !iopf_param->busy); > > } while (busy); > > .. Sure, I think it looks better Thanks, Jean
On Wed, Feb 26, 2020 at 11:39:59AM -0800, Jacob Pan wrote: > > @@ -442,11 +444,20 @@ static void iommu_sva_unbind_locked(struct > > iommu_bond *bond) > > void iommu_sva_unbind_generic(struct iommu_sva *handle) > > { > > + int pasid; > > struct iommu_param *param = handle->dev->iommu_param; > > > > if (WARN_ON(!param)) > > return; > > > > + /* > > + * Caller stopped the device from issuing PASIDs, now make > > sure they are > > + * out of the fault queue. > > + */ > > + pasid = iommu_sva_get_pasid_generic(handle); > > + if (pasid != IOMMU_PASID_INVALID) > > + iopf_queue_flush_dev(handle->dev, pasid); > > + > I have an ordering concern. > The caller can only stop the device issuing page request but there will > be in-flight request inside the IOMMU. If we flush here before clearing > the PASID context, there might be new request coming in before the > detach. The goal of this flush is also to clear the IOMMU PRI queue. It calls the IOMMU's flush() callback before flushing the workqueue. So when this returns, there shouldn't be any more pending fault. Thanks, Jean > How about detach first then flush? Then anything come after the detach > would be faults. Flush will be clean. > > > mutex_lock(¶m->sva_lock); > > mutex_lock(&iommu_sva_lock); > > iommu_sva_unbind_locked(to_iommu_bond(handle)); > > @@ -484,6 +495,10 @@ int iommu_sva_enable(struct device *dev, struct > > iommu_sva_param *sva_param) goto err_unlock; > > } > > > > + ret = iommu_register_device_fault_handler(dev, > > iommu_queue_iopf, dev); > > + if (ret) > > + goto err_unlock; > > + > > dev->iommu_param->sva_param = new_param; > > mutex_unlock(¶m->sva_lock); > > return 0; > > @@ -521,6 +536,7 @@ int iommu_sva_disable(struct device *dev) > > goto out_unlock; > > } > > > > + iommu_unregister_device_fault_handler(dev); > > kfree(param->sva_param); > > param->sva_param = NULL; > > out_unlock: > > [Jacob Pan]
On Fri, Feb 28, 2020 at 03:39:35PM +0100, Jean-Philippe Brucker wrote: > > > + list_for_each_entry_rcu(bond, &io_mm->devices, mm_head) { > > > + /* > > > + * To ensure that we observe the initialization of io_mm fields > > > + * by io_mm_finalize() before the registration of this bond to > > > + * the list by io_mm_attach(), introduce an address dependency > > > + * between bond and io_mm. It pairs with the smp_store_release() > > > + * from list_add_rcu(). > > > + */ > > > + io_mm = rcu_dereference(bond->io_mm); > > > > A rcu_dereference isn't need here, just a normal derference is fine. > > bond->io_mm is annotated with __rcu (for iommu_sva_get_pasid_generic(), > which does bond->io_mm under rcu_read_lock()) I'm surprised the bond->io_mm can change over the lifetime of the bond memory.. > > > If io_mm->ctx and io_mm->ops are already valid before the > > > mmu notifier is published, then we don't need that stuff. > > > > So, this trickyness with RCU is not a bad reason to introduce the priv > > scheme, maybe explain it in the commit message? > > Ok, I've added this to the commit message: > > The IOMMU SVA module, which attaches an mm to multiple devices, > exemplifies this situation. In essence it does: > > mmu_notifier_get() > alloc_notifier() > A = kzalloc() > /* MMU notifier is published */ > A->ctx = ctx; // (1) > device->A = A; > list_add_rcu(device, A->devices); // (2) > > The invalidate notifier, which may start running before A is fully > initialized at (1), does the following: > > io_mm_invalidate(A) > list_for_each_entry_rcu(device, A->devices) > A = device->A; // (3) I would drop the work around from the decription, it is enough to say that the line below needs to observe (1) after (2) and this is trivially achieved by moving (1) to before publishing the notifier so the core MM locking can be used. Regards, Jason
On Fri, Feb 28, 2020 at 03:40:07PM +0100, Jean-Philippe Brucker wrote:
> > > Device
> > > + * 00:00.0 accesses address spaces X and Y, each corresponding to an
> > > mm_struct.
> > > + * Devices 00:01.* only access address space Y. In addition each
> > > + * IOMMU_DOMAIN_DMA domain has a private address space, io_pgtable,
> > > that is
> > > + * managed with iommu_map()/iommu_unmap(), and isn't shared with the
> > > CPU MMU.
> > So this would allow IOVA and SVA co-exist in the same address space?
>
> Hmm, not in the same address space, but they can co-exist in a device. In
> fact the endpoint I'm testing (hisi zip accelerator) already needs normal
> DMA alongside SVA for queue management. This one is integrated on an
> Arm-based platform so shouldn't be a concern for VT-d at the moment, but
> I suspect we might see more of this kind of device with mixed DMA.
Probably the most interesting usecases for PASID definately require
this, so this is more than a "suspect we might see"
We want to see the privileged kernel control the general behavior of
the PCI function and delegate only some DMAs to PASIDs associated with
the user mm_struct. The device is always trusted the label its DMA
properly.
These programming models are already being used for years now with the
opencapi implementation.
Jason
On Fri, Feb 28, 2020 at 10:48:44AM -0400, Jason Gunthorpe wrote: > On Fri, Feb 28, 2020 at 03:39:35PM +0100, Jean-Philippe Brucker wrote: > > > > + list_for_each_entry_rcu(bond, &io_mm->devices, mm_head) { > > > > + /* > > > > + * To ensure that we observe the initialization of io_mm fields > > > > + * by io_mm_finalize() before the registration of this bond to > > > > + * the list by io_mm_attach(), introduce an address dependency > > > > + * between bond and io_mm. It pairs with the smp_store_release() > > > > + * from list_add_rcu(). > > > > + */ > > > > + io_mm = rcu_dereference(bond->io_mm); > > > > > > A rcu_dereference isn't need here, just a normal derference is fine. > > > > bond->io_mm is annotated with __rcu (for iommu_sva_get_pasid_generic(), > > which does bond->io_mm under rcu_read_lock()) > > I'm surprised the bond->io_mm can change over the lifetime of the > bond memory.. The normal lifetime of the bond is between device driver calls to bind() and unbind(). If the mm exits early, though, we clear bond->io_mm. The bond is then stale but can only be freed when the device driver releases it with unbind(). > > > > > If io_mm->ctx and io_mm->ops are already valid before the > > > > mmu notifier is published, then we don't need that stuff. > > > > > > So, this trickyness with RCU is not a bad reason to introduce the priv > > > scheme, maybe explain it in the commit message? > > > > Ok, I've added this to the commit message: > > > > The IOMMU SVA module, which attaches an mm to multiple devices, > > exemplifies this situation. In essence it does: > > > > mmu_notifier_get() > > alloc_notifier() > > A = kzalloc() > > /* MMU notifier is published */ > > A->ctx = ctx; // (1) > > device->A = A; > > list_add_rcu(device, A->devices); // (2) > > > > The invalidate notifier, which may start running before A is fully > > initialized at (1), does the following: > > > > io_mm_invalidate(A) > > list_for_each_entry_rcu(device, A->devices) > > A = device->A; // (3) > > I would drop the work around from the decription, it is enough to say > that the line below needs to observe (1) after (2) and this is > trivially achieved by moving (1) to before publishing the notifier so > the core MM locking can be used. Ok, will do Thanks, Jean
On Fri, Feb 28, 2020 at 04:04:27PM +0100, Jean-Philippe Brucker wrote:
> On Fri, Feb 28, 2020 at 10:48:44AM -0400, Jason Gunthorpe wrote:
> > On Fri, Feb 28, 2020 at 03:39:35PM +0100, Jean-Philippe Brucker wrote:
> > > > > + list_for_each_entry_rcu(bond, &io_mm->devices, mm_head) {
> > > > > + /*
> > > > > + * To ensure that we observe the initialization of io_mm fields
> > > > > + * by io_mm_finalize() before the registration of this bond to
> > > > > + * the list by io_mm_attach(), introduce an address dependency
> > > > > + * between bond and io_mm. It pairs with the smp_store_release()
> > > > > + * from list_add_rcu().
> > > > > + */
> > > > > + io_mm = rcu_dereference(bond->io_mm);
> > > >
> > > > A rcu_dereference isn't need here, just a normal derference is fine.
> > >
> > > bond->io_mm is annotated with __rcu (for iommu_sva_get_pasid_generic(),
> > > which does bond->io_mm under rcu_read_lock())
> >
> > I'm surprised the bond->io_mm can change over the lifetime of the
> > bond memory..
>
> The normal lifetime of the bond is between device driver calls to bind()
> and unbind(). If the mm exits early, though, we clear bond->io_mm. The
> bond is then stale but can only be freed when the device driver releases
> it with unbind().
I usually advocate for simple use of these APIs. The mm_notifier_get()
should happen in bind() and the matching put should happen in the
call_rcu callbcak that does the kfree. Then you can never get a stale
pointer. Don't worry about exit_mmap().
release() is an unusual callback and I see alot of places using it
wrong. The purpose of release is to invalidate_all, that is it.
Also, confusingly release may be called multiple times in some
situations, so it shouldn't disturb anything that might impact a 2nd
call.
Jason
On Fri, 28 Feb 2020 15:43:04 +0100 Jean-Philippe Brucker <jean-philippe@linaro.org> wrote: > On Wed, Feb 26, 2020 at 12:35:06PM +0000, Jonathan Cameron wrote: > > > + * A single Process Address Space ID (PASID) is allocated for each mm. In the > > > + * example, devices use PASID 1 to read/write into address space X and PASID 2 > > > + * to read/write into address space Y. Calling iommu_sva_get_pasid() on bond 1 > > > + * returns 1, and calling it on bonds 2-4 returns 2. > > > + * > > > + * Hardware tables describing this configuration in the IOMMU would typically > > > + * look like this: > > > + * > > > + * PASID tables > > > + * of domain A > > > + * .->+--------+ > > > + * / 0 | |-------> io_pgtable > > > + * / +--------+ > > > + * Device tables / 1 | |-------> pgd X > > > + * +--------+ / +--------+ > > > + * 00:00.0 | A |-' 2 | |--. > > > + * +--------+ +--------+ \ > > > + * : : 3 | | \ > > > + * +--------+ +--------+ --> pgd Y > > > + * 00:01.0 | B |--. / > > > + * +--------+ \ | > > > + * 00:01.1 | B |----+ PASID tables | > > > + * +--------+ \ of domain B | > > > + * '->+--------+ | > > > + * 0 | |-- | --> io_pgtable > > > + * +--------+ | > > > + * 1 | | | > > > + * +--------+ | > > > + * 2 | |---' > > > + * +--------+ > > > + * 3 | | > > > + * +--------+ > > > + * > > > + * With this model, a single call binds all devices in a given domain to an > > > + * address space. Other devices in the domain will get the same bond implicitly. > > > + * However, users must issue one bind() for each device, because IOMMUs may > > > + * implement SVA differently. Furthermore, mandating one bind() per device > > > + * allows the driver to perform sanity-checks on device capabilities. > > > > > + * > > > + * In some IOMMUs, one entry of the PASID table (typically the first one) can > > > + * hold non-PASID translations. In this case PASID 0 is reserved and the first > > > + * entry points to the io_pgtable pointer. In other IOMMUs the io_pgtable > > > + * pointer is held in the device table and PASID 0 is available to the > > > + * allocator. > > > > Is it worth hammering home in here that we can only do this because the PASID space > > is global (with exception of PASID 0)? It's a convenient simplification but not > > necessarily a hardware restriction so perhaps we should remind people somewhere in here? > > I could add this four paragraphs up: > > "A single Process Address Space ID (PASID) is allocated for each mm. It is > a choice made for the Linux SVA implementation, not a hardware > restriction." Perfect. > > > > + */ > > > + > > > +struct io_mm { > > > + struct list_head devices; > > > + struct mm_struct *mm; > > > + struct mmu_notifier notifier; > > > + > > > + /* Late initialization */ > > > + const struct io_mm_ops *ops; > > > + void *ctx; > > > + int pasid; > > > +}; > > > + > > > +#define to_io_mm(mmu_notifier) container_of(mmu_notifier, struct io_mm, notifier) > > > +#define to_iommu_bond(handle) container_of(handle, struct iommu_bond, sva) > > > > Code ordering wise, do we want this after the definition of iommu_bond? > > > > For both of these it's a bit non obvious what they come 'from'. > > I wouldn't naturally assume to_io_mm gets me from notifier to the io_mm > > for example. Not sure it matters though if these are only used in a few > > places. > > Right, I can rename the first one to mn_to_io_mm(). The second one I think > might be good enough. Agreed. The second one does feel more natural. > > > > > +static struct iommu_sva * > > > +io_mm_attach(struct device *dev, struct io_mm *io_mm, void *drvdata) > > > +{ > > > + int ret = 0; > > > > I'm fairly sure this is set in all paths below. Now, of course the > > compiler might not think that in which case fair enough :) > > > > > + bool attach_domain = true; > > > + struct iommu_bond *bond, *tmp; > > > + struct iommu_domain *domain, *other; > > > + struct iommu_sva_param *param = dev->iommu_param->sva_param; > > > + > > > + domain = iommu_get_domain_for_dev(dev); > > > + > > > + bond = kzalloc(sizeof(*bond), GFP_KERNEL); > > > + if (!bond) > > > + return ERR_PTR(-ENOMEM); > > > + > > > + bond->sva.dev = dev; > > > + bond->drvdata = drvdata; > > > + refcount_set(&bond->refs, 1); > > > + RCU_INIT_POINTER(bond->io_mm, io_mm); > > > + > > > + mutex_lock(&iommu_sva_lock); > > > + /* Is it already bound to the device or domain? */ > > > + list_for_each_entry(tmp, &io_mm->devices, mm_head) { > > > + if (tmp->sva.dev != dev) { > > > + other = iommu_get_domain_for_dev(tmp->sva.dev); > > > + if (domain == other) > > > + attach_domain = false; > > > + > > > + continue; > > > + } > > > + > > > + if (WARN_ON(tmp->drvdata != drvdata)) { > > > + ret = -EINVAL; > > > + goto err_free; > > > + } > > > + > > > + /* > > > + * Hold a single io_mm reference per bond. Note that we can't > > > + * return an error after this, otherwise the caller would drop > > > + * an additional reference to the io_mm. > > > + */ > > > + refcount_inc(&tmp->refs); > > > + io_mm_put(io_mm); > > > + kfree(bond); > > > > Free outside the lock would be ever so slightly more logical given we allocated > > before taking the lock. > > > > > + mutex_unlock(&iommu_sva_lock); > > > + return &tmp->sva; > > > + } > > > + > > > + list_add_rcu(&bond->mm_head, &io_mm->devices); > > > + param->nr_bonds++; > > > + mutex_unlock(&iommu_sva_lock); > > > + > > > + ret = io_mm->ops->attach(bond->sva.dev, io_mm->pasid, io_mm->ctx, > > > + attach_domain); > > > + if (ret) > > > + goto err_remove; > > > + > > > + return &bond->sva; > > > + > > > +err_remove: > > > + /* > > > + * At this point concurrent threads may have started to access the > > > + * io_mm->devices list in order to invalidate address ranges, which > > > + * requires to free the bond via kfree_rcu() > > > + */ > > > + mutex_lock(&iommu_sva_lock); > > > + param->nr_bonds--; > > > + list_del_rcu(&bond->mm_head); > > > + > > > +err_free: > > > + mutex_unlock(&iommu_sva_lock); > > > + kfree_rcu(bond, rcu_head); > > > > I don't suppose it matters really but we don't need the rcu free if > > we follow the err_free goto. Perhaps we are cleaner in this case > > to not use a unified exit path but do that case inline? > > Agreed, though I moved the kzalloc() later as suggested by Jacob, I think > it looks a little better and simplifies the error paths > > Thanks, > Jean Jonathan
On Thu, Feb 27, 2020 at 06:17:26PM +0000, Jonathan Cameron wrote: > On Mon, 24 Feb 2020 19:23:58 +0100 > Jean-Philippe Brucker <jean-philippe@linaro.org> wrote: > > > From: Jean-Philippe Brucker <jean-philippe.brucker@arm.com> > > > > The SMMU provides a Stall model for handling page faults in platform > > devices. It is similar to PCI PRI, but doesn't require devices to have > > their own translation cache. Instead, faulting transactions are parked and > > the OS is given a chance to fix the page tables and retry the transaction. > > > > Enable stall for devices that support it (opt-in by firmware). When an > > event corresponds to a translation error, call the IOMMU fault handler. If > > the fault is recoverable, it will call us back to terminate or continue > > the stall. > > > > Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org> > One question inline. > > Thanks, > > > --- > > drivers/iommu/arm-smmu-v3.c | 271 ++++++++++++++++++++++++++++++++++-- > > drivers/iommu/of_iommu.c | 5 +- > > include/linux/iommu.h | 2 + > > 3 files changed, 269 insertions(+), 9 deletions(-) > > > > diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c > > index 6a5987cce03f..da5dda5ba26a 100644 > > --- a/drivers/iommu/arm-smmu-v3.c > > +++ b/drivers/iommu/arm-smmu-v3.c > > @@ -374,6 +374,13 @@ > > > > +/* > > + * arm_smmu_flush_evtq - wait until all events currently in the queue have been > > + * consumed. > > + * > > + * Wait until the evtq thread finished a batch, or until the queue is empty. > > + * Note that we don't handle overflows on q->batch. If it occurs, just wait for > > + * the queue to be empty. > > + */ > > +static int arm_smmu_flush_evtq(void *cookie, struct device *dev, int pasid) > > +{ > > + int ret; > > + u64 batch; > > + struct arm_smmu_device *smmu = cookie; > > + struct arm_smmu_queue *q = &smmu->evtq.q; > > + > > + spin_lock(&q->wq.lock); > > + if (queue_sync_prod_in(q) == -EOVERFLOW) > > + dev_err(smmu->dev, "evtq overflow detected -- requests lost\n"); > > + > > + batch = q->batch; > > So this is trying to be sure we have advanced the queue 2 spots? So we call arm_smmu_flush_evtq() before decommissioning a PASID, to make sure that there aren't any pending event for this PASID languishing in the fault queues. The main test is queue_empty(). If that succeeds then we know that there aren't any pending event (and the PASID is safe to reuse). But if new events are constantly added to the queue then we wait for the evtq thread to handle a full batch, where one batch corresponds to the queue size. For that we take the batch number when entering flush(), and wait for the evtq thread to increment it twice. > Is there a potential race here? q->batch could have updated before we take > a local copy. Yes we're just checking on the progress of the evtq thread. All accesses to batch are made while holding the wq lock. Flush is a rare event so the lock isn't contended, but the wake_up() that this patch introduces in arm_smmu_evtq_thread() does add some overhead (0.85% of arm_smmu_evtq_thread(), according to perf). It would be nice to get rid of it but I haven't found anything clever yet. Thanks, Jean > > > + ret = wait_event_interruptible_locked(q->wq, queue_empty(&q->llq) || > > + q->batch >= batch + 2); > > + spin_unlock(&q->wq.lock); > > + > > + return ret; > > +} > > + > ... >
On Wed, Feb 26, 2020 at 04:44:53PM +0800, Xu Zaibo wrote:
> Hi,
>
>
> On 2020/2/25 2:23, Jean-Philippe Brucker wrote:
> > From: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
> >
> > The SMMU provides a Stall model for handling page faults in platform
> > devices. It is similar to PCI PRI, but doesn't require devices to have
> > their own translation cache. Instead, faulting transactions are parked and
> > the OS is given a chance to fix the page tables and retry the transaction.
> >
> > Enable stall for devices that support it (opt-in by firmware). When an
> > event corresponds to a translation error, call the IOMMU fault handler. If
> > the fault is recoverable, it will call us back to terminate or continue
> > the stall.
> >
> > Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
> > ---
> > drivers/iommu/arm-smmu-v3.c | 271 ++++++++++++++++++++++++++++++++++--
> > drivers/iommu/of_iommu.c | 5 +-
> > include/linux/iommu.h | 2 +
> > 3 files changed, 269 insertions(+), 9 deletions(-)
> >
> > diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
> > index 6a5987cce03f..da5dda5ba26a 100644
> > --- a/drivers/iommu/arm-smmu-v3.c
> > +++ b/drivers/iommu/arm-smmu-v3.c
> > @@ -374,6 +374,13 @@
> > #define CMDQ_PRI_1_GRPID GENMASK_ULL(8, 0)
> > #define CMDQ_PRI_1_RESP GENMASK_ULL(13, 12)
> [...]
> > +static int arm_smmu_page_response(struct device *dev,
> > + struct iommu_fault_event *unused,
> > + struct iommu_page_response *resp)
> > +{
> > + struct arm_smmu_cmdq_ent cmd = {0};
> > + struct arm_smmu_master *master = dev_iommu_fwspec_get(dev)->iommu_priv;
> Here can use 'dev_to_master' ?
Certainly, good catch
Thanks,
Jean
On Thu, Feb 27, 2020 at 05:43:51PM +0000, Jonathan Cameron wrote:
> On Mon, 24 Feb 2020 19:23:42 +0100
> Jean-Philippe Brucker <jean-philippe@linaro.org> wrote:
>
> > From: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
> >
> > To enable address space sharing with the IOMMU, introduce mm_context_get()
> > and mm_context_put(), that pin down a context and ensure that it will keep
> > its ASID after a rollover. Export the symbols to let the modular SMMUv3
> > driver use them.
> >
> > Pinning is necessary because a device constantly needs a valid ASID,
> > unlike tasks that only require one when running. Without pinning, we would
> > need to notify the IOMMU when we're about to use a new ASID for a task,
> > and it would get complicated when a new task is assigned a shared ASID.
> > Consider the following scenario with no ASID pinned:
> >
> > 1. Task t1 is running on CPUx with shared ASID (gen=1, asid=1)
> > 2. Task t2 is scheduled on CPUx, gets ASID (1, 2)
> > 3. Task tn is scheduled on CPUy, a rollover occurs, tn gets ASID (2, 1)
> > We would now have to immediately generate a new ASID for t1, notify
> > the IOMMU, and finally enable task tn. We are holding the lock during
> > all that time, since we can't afford having another CPU trigger a
> > rollover. The IOMMU issues invalidation commands that can take tens of
> > milliseconds.
> >
> > It gets needlessly complicated. All we wanted to do was schedule task tn,
> > that has no business with the IOMMU. By letting the IOMMU pin tasks when
> > needed, we avoid stalling the slow path, and let the pinning fail when
> > we're out of shareable ASIDs.
> >
> > After a rollover, the allocator expects at least one ASID to be available
> > in addition to the reserved ones (one per CPU). So (NR_ASIDS - NR_CPUS -
> > 1) is the maximum number of ASIDs that can be shared with the IOMMU.
> >
> > Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
> A few more trivial points.
I'll fix those, thanks
Jean
On Mon, Feb 24, 2020 at 07:23:36PM +0100, Jean-Philippe Brucker wrote:
> -static struct mmu_notifier *gru_alloc_notifier(struct mm_struct *mm)
> +static struct mmu_notifier *gru_alloc_notifier(struct mm_struct *mm, void *privdata)
Pleae don't introduce any > 80 char lines. Not here, and not anywhere
else in this patch or the series.
On Fri, Feb 28, 2020 at 11:13:40AM -0400, Jason Gunthorpe wrote: > On Fri, Feb 28, 2020 at 04:04:27PM +0100, Jean-Philippe Brucker wrote: > > On Fri, Feb 28, 2020 at 10:48:44AM -0400, Jason Gunthorpe wrote: > > > On Fri, Feb 28, 2020 at 03:39:35PM +0100, Jean-Philippe Brucker wrote: > > > > > > + list_for_each_entry_rcu(bond, &io_mm->devices, mm_head) { > > > > > > + /* > > > > > > + * To ensure that we observe the initialization of io_mm fields > > > > > > + * by io_mm_finalize() before the registration of this bond to > > > > > > + * the list by io_mm_attach(), introduce an address dependency > > > > > > + * between bond and io_mm. It pairs with the smp_store_release() > > > > > > + * from list_add_rcu(). > > > > > > + */ > > > > > > + io_mm = rcu_dereference(bond->io_mm); > > > > > > > > > > A rcu_dereference isn't need here, just a normal derference is fine. > > > > > > > > bond->io_mm is annotated with __rcu (for iommu_sva_get_pasid_generic(), > > > > which does bond->io_mm under rcu_read_lock()) > > > > > > I'm surprised the bond->io_mm can change over the lifetime of the > > > bond memory.. > > > > The normal lifetime of the bond is between device driver calls to bind() > > and unbind(). If the mm exits early, though, we clear bond->io_mm. The > > bond is then stale but can only be freed when the device driver releases > > it with unbind(). > > I usually advocate for simple use of these APIs. The mm_notifier_get() > should happen in bind() and the matching put should happen in the > call_rcu callbcak that does the kfree. I tried to keep it simple like that: normally mmu_notifier_get() is called in bind(), and mmu_notifier_put() is called in unbind(). Multiple device drivers may call bind() with the same mm. Each bind() calls mmu_notifier_get(), obtains the same io_mm, and returns a new bond (a device<->mm link). Each bond is freed by calling unbind(), which calls mmu_notifier_put(). That's the most common case. Now if the process is killed and the mm disappears, we do need to avoid use-after-free caused by DMA of the mappings and the page tables. So the release() callback, before doing invalidate_all, stops DMA and clears the page table pointer on the IOMMU side. It detaches all bonds from the io_mm, calling mmu_notifier_put() for each of them. After release(), bond objects still exists and device drivers still need to free them with unbind(), but they don't point to an io_mm anymore. > Then you can never get a stale > pointer. Don't worry about exit_mmap(). > > release() is an unusual callback and I see alot of places using it > wrong. The purpose of release is to invalidate_all, that is it. > > Also, confusingly release may be called multiple times in some > situations, so it shouldn't disturb anything that might impact a 2nd > call. I hadn't realized that. The current implementation should be safe against it, as release() is a nop if the io_mm doesn't have bonds anymore. Do you have an example of such a situation? I'm trying to write tests for this kind of corner cases. Thanks, Jean
On Fri, Mar 06, 2020 at 10:56:14AM +0100, Jean-Philippe Brucker wrote: > I tried to keep it simple like that: normally mmu_notifier_get() is called > in bind(), and mmu_notifier_put() is called in unbind(). > > Multiple device drivers may call bind() with the same mm. Each bind() > calls mmu_notifier_get(), obtains the same io_mm, and returns a new bond > (a device<->mm link). Each bond is freed by calling unbind(), which calls > mmu_notifier_put(). > > That's the most common case. Now if the process is killed and the mm > disappears, we do need to avoid use-after-free caused by DMA of the > mappings and the page tables. This is why release must do invalidate all - but it doesn't need to do any more - as no SPTE can be established without a mmget() - and mmget() is no longer possible past release. > So the release() callback, before doing invalidate_all, stops DMA > and clears the page table pointer on the IOMMU side. It detaches all > bonds from the io_mm, calling mmu_notifier_put() for each of > them. After release(), bond objects still exists and device drivers > still need to free them with unbind(), but they don't point to an > io_mm anymore. Why is so much work needed in release? It really should just be invalidate all, usually trying to sort out all the locking for the more complicated stuff is not worthwhile. If other stuff is implicitly relying on the mm being alive and release to fence against that then it is already racy. If it doesn't, then why bother doing complicated work in release? > > Then you can never get a stale > > pointer. Don't worry about exit_mmap(). > > > > release() is an unusual callback and I see alot of places using it > > wrong. The purpose of release is to invalidate_all, that is it. > > > > Also, confusingly release may be called multiple times in some > > situations, so it shouldn't disturb anything that might impact a 2nd > > call. > > I hadn't realized that. The current implementation should be safe against > it, as release() is a nop if the io_mm doesn't have bonds anymore. Do you > have an example of such a situation? I'm trying to write tests for this > kind of corner cases. Hmm, let me think. Ah, you have to be using mmu_notifier_unregister() to get that race. This is one of the things that get/put don't suffer from - but they conversely don't guarantee that release() will be called, so it is up to the caller to ensure everything is fenced before calling put. Jason
On Fri, Mar 06, 2020 at 09:09:19AM -0400, Jason Gunthorpe wrote: > On Fri, Mar 06, 2020 at 10:56:14AM +0100, Jean-Philippe Brucker wrote: > > I tried to keep it simple like that: normally mmu_notifier_get() is called > > in bind(), and mmu_notifier_put() is called in unbind(). > > > > Multiple device drivers may call bind() with the same mm. Each bind() > > calls mmu_notifier_get(), obtains the same io_mm, and returns a new bond > > (a device<->mm link). Each bond is freed by calling unbind(), which calls > > mmu_notifier_put(). > > > > That's the most common case. Now if the process is killed and the mm > > disappears, we do need to avoid use-after-free caused by DMA of the > > mappings and the page tables. > > This is why release must do invalidate all - but it doesn't need to do > any more - as no SPTE can be established without a mmget() - and > mmget() is no longer possible past release. In our case we don't have SPTEs, the whole pgd is shared between MMU and IOMMU (isolated using PASID tables). Taking the concrete example of the crypto accelerator: 1. A process opens a queue in the accelerator. That queue is bound to the address space: a PASID is allocated for the mm, and mm->pgd is written into the IOMMU PASID table. 2. The process queues some work and waits. In the background, the accelerators performs DMA on the process address space, by using the mm's PASID. 3. Now the process gets killed, and release() is called. At this point no one told the device to stop working on this queue, it may still be doing DMA on this address space. So the first thing we do is notify the device driver that the bond is going away, and that it must stop the queue and flush remaining DMA transactions for this PASID. Then we also clear the pgd from the IOMMU PASID table. If we only did invalidate-all and somehow the queue wasn't properly stopped, concurrent DMA would immediately form new IOTLB entries since the page tables haven't been wiped at this point. And later, it would use-after-free page tables and mappings. Whereas with a clear pgd it would just generate IOMMU fault events, which are undesirable but harmless. Thanks, Jean > > So the release() callback, before doing invalidate_all, stops DMA > > and clears the page table pointer on the IOMMU side. It detaches all > > bonds from the io_mm, calling mmu_notifier_put() for each of > > them. After release(), bond objects still exists and device drivers > > still need to free them with unbind(), but they don't point to an > > io_mm anymore. > > Why is so much work needed in release? It really should just be > invalidate all, usually trying to sort out all the locking for the > more complicated stuff is not worthwhile. > > If other stuff is implicitly relying on the mm being alive and release > to fence against that then it is already racy. If it doesn't, then why > bother doing complicated work in release? > > > > Then you can never get a stale > > > pointer. Don't worry about exit_mmap(). > > > > > > release() is an unusual callback and I see alot of places using it > > > wrong. The purpose of release is to invalidate_all, that is it. > > > > > > Also, confusingly release may be called multiple times in some > > > situations, so it shouldn't disturb anything that might impact a 2nd > > > call. > > > > I hadn't realized that. The current implementation should be safe against > > it, as release() is a nop if the io_mm doesn't have bonds anymore. Do you > > have an example of such a situation? I'm trying to write tests for this > > kind of corner cases. > > Hmm, let me think. Ah, you have to be using mmu_notifier_unregister() > to get that race. This is one of the things that get/put don't suffer > from - but they conversely don't guarantee that release() will be > called, so it is up to the caller to ensure everything is fenced > before calling put. > > Jason
On Fri, Mar 06, 2020 at 03:35:56PM +0100, Jean-Philippe Brucker wrote: > On Fri, Mar 06, 2020 at 09:09:19AM -0400, Jason Gunthorpe wrote: > > On Fri, Mar 06, 2020 at 10:56:14AM +0100, Jean-Philippe Brucker wrote: > > > I tried to keep it simple like that: normally mmu_notifier_get() is called > > > in bind(), and mmu_notifier_put() is called in unbind(). > > > > > > Multiple device drivers may call bind() with the same mm. Each bind() > > > calls mmu_notifier_get(), obtains the same io_mm, and returns a new bond > > > (a device<->mm link). Each bond is freed by calling unbind(), which calls > > > mmu_notifier_put(). > > > > > > That's the most common case. Now if the process is killed and the mm > > > disappears, we do need to avoid use-after-free caused by DMA of the > > > mappings and the page tables. > > > > This is why release must do invalidate all - but it doesn't need to do > > any more - as no SPTE can be established without a mmget() - and > > mmget() is no longer possible past release. > > In our case we don't have SPTEs, the whole pgd is shared between MMU and > IOMMU (isolated using PASID tables). Okay, but this just means that 'invalidate all' also requires switching the PASID to use some pgd that is permanently 'all fail'. > At this point no one told the device to stop working on this queue, > it may still be doing DMA on this address space. Sure, but there are lots of cases where a defective user space can cause pages under active DMA to disappear, like munmap for instance. Process exit is really no different, the PASID should take errors and the device & driver should do whatever error flow it has. Involving a complex driver flow in the exit_mmap path seems like dangerous complexity to me. Jason
On Fri, Mar 06, 2020 at 10:52:45AM -0400, Jason Gunthorpe wrote: > On Fri, Mar 06, 2020 at 03:35:56PM +0100, Jean-Philippe Brucker wrote: > > On Fri, Mar 06, 2020 at 09:09:19AM -0400, Jason Gunthorpe wrote: > > > On Fri, Mar 06, 2020 at 10:56:14AM +0100, Jean-Philippe Brucker wrote: > > > > I tried to keep it simple like that: normally mmu_notifier_get() is called > > > > in bind(), and mmu_notifier_put() is called in unbind(). > > > > > > > > Multiple device drivers may call bind() with the same mm. Each bind() > > > > calls mmu_notifier_get(), obtains the same io_mm, and returns a new bond > > > > (a device<->mm link). Each bond is freed by calling unbind(), which calls > > > > mmu_notifier_put(). > > > > > > > > That's the most common case. Now if the process is killed and the mm > > > > disappears, we do need to avoid use-after-free caused by DMA of the > > > > mappings and the page tables. > > > > > > This is why release must do invalidate all - but it doesn't need to do > > > any more - as no SPTE can be established without a mmget() - and > > > mmget() is no longer possible past release. > > > > In our case we don't have SPTEs, the whole pgd is shared between MMU and > > IOMMU (isolated using PASID tables). > > Okay, but this just means that 'invalidate all' also requires > switching the PASID to use some pgd that is permanently 'all fail'. > > > At this point no one told the device to stop working on this queue, > > it may still be doing DMA on this address space. > > Sure, but there are lots of cases where a defective user space can > cause pages under active DMA to disappear, like munmap for > instance. Process exit is really no different, the PASID should take > errors and the device & driver should do whatever error flow it has. We do have the possibility to shut things down in order, so to me this feels like a band-aid. The idea has come up before though [1], and I'm not strongly opposed to this model, but I'm still not convinced it's necessary. It does add more complexity to IOMMU drivers, to avoid printing out the errors that we wouldn't otherwise see, whereas device drivers need in any case to implement the logic that forces stop DMA. Thanks, Jean [1] https://lore.kernel.org/linux-iommu/4d68da96-0ad5-b412-5987-2f7a6aa796c3@amd.com/ > > Involving a complex driver flow in the exit_mmap path seems like > dangerous complexity to me. > > Jason
On Fri, Mar 06, 2020 at 05:15:19PM +0100, Jean-Philippe Brucker wrote: > On Fri, Mar 06, 2020 at 10:52:45AM -0400, Jason Gunthorpe wrote: > > On Fri, Mar 06, 2020 at 03:35:56PM +0100, Jean-Philippe Brucker wrote: > > > On Fri, Mar 06, 2020 at 09:09:19AM -0400, Jason Gunthorpe wrote: > > > > On Fri, Mar 06, 2020 at 10:56:14AM +0100, Jean-Philippe Brucker wrote: > > > > > I tried to keep it simple like that: normally mmu_notifier_get() is called > > > > > in bind(), and mmu_notifier_put() is called in unbind(). > > > > > > > > > > Multiple device drivers may call bind() with the same mm. Each bind() > > > > > calls mmu_notifier_get(), obtains the same io_mm, and returns a new bond > > > > > (a device<->mm link). Each bond is freed by calling unbind(), which calls > > > > > mmu_notifier_put(). > > > > > > > > > > That's the most common case. Now if the process is killed and the mm > > > > > disappears, we do need to avoid use-after-free caused by DMA of the > > > > > mappings and the page tables. > > > > > > > > This is why release must do invalidate all - but it doesn't need to do > > > > any more - as no SPTE can be established without a mmget() - and > > > > mmget() is no longer possible past release. > > > > > > In our case we don't have SPTEs, the whole pgd is shared between MMU and > > > IOMMU (isolated using PASID tables). > > > > Okay, but this just means that 'invalidate all' also requires > > switching the PASID to use some pgd that is permanently 'all fail'. > > > > > At this point no one told the device to stop working on this queue, > > > it may still be doing DMA on this address space. > > > > Sure, but there are lots of cases where a defective user space can > > cause pages under active DMA to disappear, like munmap for > > instance. Process exit is really no different, the PASID should take > > errors and the device & driver should do whatever error flow it has. > > We do have the possibility to shut things down in order, so to me this > feels like a band-aid. ->release() is called by exit_mmap which is called by mmput. There are over a 100 callsites to mmput() and I'm not totally sure what the rules are for release(). We've run into problems before with things like this. IMHO, due to this, it is best for release to be simple and have conservative requirements on context like all the other notifier callbacks. It is is not a good place to put complex HW fencing driver code. In particular that link you referenced is suggesting the driver tear down could take minutes - IMHO it is not OK to block mmput() for minutes. > The idea has come up before though [1], and I'm not strongly opposed > to this model, but I'm still not convinced it's necessary. It does > add more complexity to IOMMU drivers, to avoid printing out the > errors that we wouldn't otherwise see, whereas device drivers need > in any case to implement the logic that forces stop DMA. Errors should not be printed to the kernel log for PASID cases anyhow. PASID will be used by unpriv user, and unpriv user should not be able to trigger kernel prints at will, eg by doing dma to nmap VA or whatever. Process exit is just another case of this, and should not be treated specially. Jason
On Wed, 4 Mar 2020 15:08:33 +0100 Jean-Philippe Brucker <jean-philippe@linaro.org> wrote: > On Thu, Feb 27, 2020 at 06:17:26PM +0000, Jonathan Cameron wrote: > > On Mon, 24 Feb 2020 19:23:58 +0100 > > Jean-Philippe Brucker <jean-philippe@linaro.org> wrote: > > > > > From: Jean-Philippe Brucker <jean-philippe.brucker@arm.com> > > > > > > The SMMU provides a Stall model for handling page faults in platform > > > devices. It is similar to PCI PRI, but doesn't require devices to have > > > their own translation cache. Instead, faulting transactions are parked and > > > the OS is given a chance to fix the page tables and retry the transaction. > > > > > > Enable stall for devices that support it (opt-in by firmware). When an > > > event corresponds to a translation error, call the IOMMU fault handler. If > > > the fault is recoverable, it will call us back to terminate or continue > > > the stall. > > > > > > Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org> > > One question inline. > > > > Thanks, > > > > > --- > > > drivers/iommu/arm-smmu-v3.c | 271 ++++++++++++++++++++++++++++++++++-- > > > drivers/iommu/of_iommu.c | 5 +- > > > include/linux/iommu.h | 2 + > > > 3 files changed, 269 insertions(+), 9 deletions(-) > > > > > > diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c > > > index 6a5987cce03f..da5dda5ba26a 100644 > > > --- a/drivers/iommu/arm-smmu-v3.c > > > +++ b/drivers/iommu/arm-smmu-v3.c > > > @@ -374,6 +374,13 @@ > > > > > > > +/* > > > + * arm_smmu_flush_evtq - wait until all events currently in the queue have been > > > + * consumed. > > > + * > > > + * Wait until the evtq thread finished a batch, or until the queue is empty. > > > + * Note that we don't handle overflows on q->batch. If it occurs, just wait for > > > + * the queue to be empty. > > > + */ > > > +static int arm_smmu_flush_evtq(void *cookie, struct device *dev, int pasid) > > > +{ > > > + int ret; > > > + u64 batch; > > > + struct arm_smmu_device *smmu = cookie; > > > + struct arm_smmu_queue *q = &smmu->evtq.q; > > > + > > > + spin_lock(&q->wq.lock); > > > + if (queue_sync_prod_in(q) == -EOVERFLOW) > > > + dev_err(smmu->dev, "evtq overflow detected -- requests lost\n"); > > > + > > > + batch = q->batch; > > > > So this is trying to be sure we have advanced the queue 2 spots? > > So we call arm_smmu_flush_evtq() before decommissioning a PASID, to make > sure that there aren't any pending event for this PASID languishing in the > fault queues. > > The main test is queue_empty(). If that succeeds then we know that there > aren't any pending event (and the PASID is safe to reuse). But if new > events are constantly added to the queue then we wait for the evtq thread > to handle a full batch, where one batch corresponds to the queue size. For > that we take the batch number when entering flush(), and wait for the evtq > thread to increment it twice. > > > Is there a potential race here? q->batch could have updated before we take > > a local copy. > > Yes we're just checking on the progress of the evtq thread. All accesses > to batch are made while holding the wq lock. > > Flush is a rare event so the lock isn't contended, but the wake_up() that > this patch introduces in arm_smmu_evtq_thread() does add some overhead > (0.85% of arm_smmu_evtq_thread(), according to perf). It would be nice to > get rid of it but I haven't found anything clever yet. > Thanks. Maybe worth a few comments in the code as this is a bit esoteric. Thanks, Jonathan > Thanks, > Jean > > > > > > + ret = wait_event_interruptible_locked(q->wq, queue_empty(&q->llq) || > > > + q->batch >= batch + 2); > > > + spin_unlock(&q->wq.lock); > > > + > > > + return ret; > > > +} > > > + > > ... > >
On Fri, Mar 06, 2020 at 01:42:39PM -0400, Jason Gunthorpe wrote: > On Fri, Mar 06, 2020 at 05:15:19PM +0100, Jean-Philippe Brucker wrote: > > On Fri, Mar 06, 2020 at 10:52:45AM -0400, Jason Gunthorpe wrote: > > > On Fri, Mar 06, 2020 at 03:35:56PM +0100, Jean-Philippe Brucker wrote: > > > > On Fri, Mar 06, 2020 at 09:09:19AM -0400, Jason Gunthorpe wrote: > > > > > On Fri, Mar 06, 2020 at 10:56:14AM +0100, Jean-Philippe Brucker wrote: > > > > > > I tried to keep it simple like that: normally mmu_notifier_get() is called > > > > > > in bind(), and mmu_notifier_put() is called in unbind(). > > > > > > > > > > > > Multiple device drivers may call bind() with the same mm. Each bind() > > > > > > calls mmu_notifier_get(), obtains the same io_mm, and returns a new bond > > > > > > (a device<->mm link). Each bond is freed by calling unbind(), which calls > > > > > > mmu_notifier_put(). > > > > > > > > > > > > That's the most common case. Now if the process is killed and the mm > > > > > > disappears, we do need to avoid use-after-free caused by DMA of the > > > > > > mappings and the page tables. > > > > > > > > > > This is why release must do invalidate all - but it doesn't need to do > > > > > any more - as no SPTE can be established without a mmget() - and > > > > > mmget() is no longer possible past release. > > > > > > > > In our case we don't have SPTEs, the whole pgd is shared between MMU and > > > > IOMMU (isolated using PASID tables). > > > > > > Okay, but this just means that 'invalidate all' also requires > > > switching the PASID to use some pgd that is permanently 'all fail'. > > > > > > > At this point no one told the device to stop working on this queue, > > > > it may still be doing DMA on this address space. > > > > > > Sure, but there are lots of cases where a defective user space can > > > cause pages under active DMA to disappear, like munmap for > > > instance. Process exit is really no different, the PASID should take > > > errors and the device & driver should do whatever error flow it has. > > > > We do have the possibility to shut things down in order, so to me this > > feels like a band-aid. > > ->release() is called by exit_mmap which is called by mmput. There are > over a 100 callsites to mmput() and I'm not totally sure what the > rules are for release(). We've run into problems before with things > like this. A concrete example of something that could go badly if mmput() takes too long would greatly help. Otherwise I'll have a hard time justifying the added complexity. I wrote a prototype that removes the device driver callback from release(). It works with SMMUv3, but complicates the PASID descriptor code, which is already awful with my recent changes and this series. > IMHO, due to this, it is best for release to be simple and have > conservative requirements on context like all the other notifier > callbacks. It is is not a good place to put complex HW fencing driver > code. > > In particular that link you referenced is suggesting the driver tear > down could take minutes - IMHO it is not OK to block mmput() for > minutes. > > > The idea has come up before though [1], and I'm not strongly opposed > > to this model, but I'm still not convinced it's necessary. It does > > add more complexity to IOMMU drivers, to avoid printing out the > > errors that we wouldn't otherwise see, whereas device drivers need > > in any case to implement the logic that forces stop DMA. > > Errors should not be printed to the kernel log for PASID cases > anyhow. PASID will be used by unpriv user, and unpriv user should not > be able to trigger kernel prints at will, eg by doing dma to nmap VA > or whatever. I agree. There is a difference, though, between invalid mappings and the absence of a pgd. The former comes from userspace issuing DMA on unmapped buffers, while the latter is typically a device/driver error which normally needs to be reported. On Arm SMMUv3 they are handled differently by the hardware. But instead of disabling the whole PASID context on mm exit, we can quietly abort incoming transactions while waiting for unbind(). And I think the other IOMMUs treat invalid PASID descriptor the same as invalid translation table descriptor. At least VT-d quietly returns a no-translation response to ATS TR and rejects PRI PR. I haven't found the equivalent in the AMD IOMMU spec yet. Thanks, Jean
On Fri, Mar 13, 2020 at 07:49:29PM +0100, Jean-Philippe Brucker wrote: > On Fri, Mar 06, 2020 at 01:42:39PM -0400, Jason Gunthorpe wrote: > > On Fri, Mar 06, 2020 at 05:15:19PM +0100, Jean-Philippe Brucker wrote: > > > On Fri, Mar 06, 2020 at 10:52:45AM -0400, Jason Gunthorpe wrote: > > > > On Fri, Mar 06, 2020 at 03:35:56PM +0100, Jean-Philippe Brucker wrote: > > > > > On Fri, Mar 06, 2020 at 09:09:19AM -0400, Jason Gunthorpe wrote: > > > > > > On Fri, Mar 06, 2020 at 10:56:14AM +0100, Jean-Philippe Brucker wrote: > > > > > > > I tried to keep it simple like that: normally mmu_notifier_get() is called > > > > > > > in bind(), and mmu_notifier_put() is called in unbind(). > > > > > > > > > > > > > > Multiple device drivers may call bind() with the same mm. Each bind() > > > > > > > calls mmu_notifier_get(), obtains the same io_mm, and returns a new bond > > > > > > > (a device<->mm link). Each bond is freed by calling unbind(), which calls > > > > > > > mmu_notifier_put(). > > > > > > > > > > > > > > That's the most common case. Now if the process is killed and the mm > > > > > > > disappears, we do need to avoid use-after-free caused by DMA of the > > > > > > > mappings and the page tables. > > > > > > > > > > > > This is why release must do invalidate all - but it doesn't need to do > > > > > > any more - as no SPTE can be established without a mmget() - and > > > > > > mmget() is no longer possible past release. > > > > > > > > > > In our case we don't have SPTEs, the whole pgd is shared between MMU and > > > > > IOMMU (isolated using PASID tables). > > > > > > > > Okay, but this just means that 'invalidate all' also requires > > > > switching the PASID to use some pgd that is permanently 'all fail'. > > > > > > > > > At this point no one told the device to stop working on this queue, > > > > > it may still be doing DMA on this address space. > > > > > > > > Sure, but there are lots of cases where a defective user space can > > > > cause pages under active DMA to disappear, like munmap for > > > > instance. Process exit is really no different, the PASID should take > > > > errors and the device & driver should do whatever error flow it has. > > > > > > We do have the possibility to shut things down in order, so to me this > > > feels like a band-aid. > > > > ->release() is called by exit_mmap which is called by mmput. There are > > over a 100 callsites to mmput() and I'm not totally sure what the > > rules are for release(). We've run into problems before with things > > like this. > > A concrete example of something that could go badly if mmput() takes too > long would greatly help. Otherwise I'll have a hard time justifying the > added complexity. It is not just takes too long, but also accidently causing locking problems by doing very complex code in the release callback. Unless you audit all the mmput call sites to define the calling conditions I can't even say what the risk is here. Particularly, calling something with impossible to audit locking like the dma_fence stuff from release is probably impossible to prove safety and then keep safe. It is easy enough to see where takes too long can have a bad impact, mmput is called all over the place. Just in the RDMA code slowing it down would block ODP page faulting completely for all processes. This is not acceptable. For this reason release callbacks must be simple/fast and must have trivial locking. > > Errors should not be printed to the kernel log for PASID cases > > anyhow. PASID will be used by unpriv user, and unpriv user should not > > be able to trigger kernel prints at will, eg by doing dma to nmap VA > > or whatever. > > I agree. There is a difference, though, between invalid mappings and the > absence of a pgd. The former comes from userspace issuing DMA on unmapped > buffers, while the latter is typically a device/driver error which > normally needs to be reported. Why not make the pgd present as I suggested? Point it at a static dummy pgd that always fails to page fault during release? Make the pgd not present only once the PASID is fully destroyed. That really is the only thing release is supposed to mean -> unmap all VAs. Jason
On Fri, Mar 06, 2020 at 09:09:19AM -0400, Jason Gunthorpe wrote:
> This is why release must do invalidate all - but it doesn't need to do
> any more - as no SPTE can be established without a mmget() - and
> mmget() is no longer possible past release.
Maybe we should rename the release method to invalidate_all?
On Mon, Mar 16, 2020 at 08:46:59AM -0700, Christoph Hellwig wrote:
> On Fri, Mar 06, 2020 at 09:09:19AM -0400, Jason Gunthorpe wrote:
> > This is why release must do invalidate all - but it doesn't need to do
> > any more - as no SPTE can be established without a mmget() - and
> > mmget() is no longer possible past release.
>
> Maybe we should rename the release method to invalidate_all?
It is a better name. The function it must also fence future access if
the mirror is not using mmget(), and stop using the pgd/etc pointer if
the page tables are accessed directly.
Jason
Hi, On Mon, Feb 24, 2020 at 07:23:56PM +0100, Jean-Philippe Brucker wrote: > When a device or driver misbehaves, it is possible to receive events > much faster than we can print them out. Ratelimit the printing of > events. > > Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org> Tested-by: Aaro Koskinen <aaro.koskinen@nokia.com> > During the SVA tests when the device driver didn't properly stop DMA > before unbinding, the event queue thread would almost lock-up the server > with a flood of event 0xa. This patch helped recover from the error. I was just debugging a similar case, and this patch was required to prevent system from locking up. Could you please resend this patch independently from the other patches in the series, as it seems it's a worthwhile fix and still relevent for current kernels. Thanks, A. > --- > drivers/iommu/arm-smmu-v3.c | 13 ++++++++----- > 1 file changed, 8 insertions(+), 5 deletions(-) > > diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c > index 28f8583cd47b..6a5987cce03f 100644 > --- a/drivers/iommu/arm-smmu-v3.c > +++ b/drivers/iommu/arm-smmu-v3.c > @@ -2243,17 +2243,20 @@ static irqreturn_t arm_smmu_evtq_thread(int irq, void *dev) > struct arm_smmu_device *smmu = dev; > struct arm_smmu_queue *q = &smmu->evtq.q; > struct arm_smmu_ll_queue *llq = &q->llq; > + static DEFINE_RATELIMIT_STATE(rs, DEFAULT_RATELIMIT_INTERVAL, > + DEFAULT_RATELIMIT_BURST); > u64 evt[EVTQ_ENT_DWORDS]; > > do { > while (!queue_remove_raw(q, evt)) { > u8 id = FIELD_GET(EVTQ_0_ID, evt[0]); > > - dev_info(smmu->dev, "event 0x%02x received:\n", id); > - for (i = 0; i < ARRAY_SIZE(evt); ++i) > - dev_info(smmu->dev, "\t0x%016llx\n", > - (unsigned long long)evt[i]); > - > + if (__ratelimit(&rs)) { > + dev_info(smmu->dev, "event 0x%02x received:\n", id); > + for (i = 0; i < ARRAY_SIZE(evt); ++i) > + dev_info(smmu->dev, "\t0x%016llx\n", > + (unsigned long long)evt[i]); > + } > } > > /*
Hi Aaro, On Fri, May 28, 2021 at 11:09:58AM +0300, Aaro Koskinen wrote: > Hi, > > On Mon, Feb 24, 2020 at 07:23:56PM +0100, Jean-Philippe Brucker wrote: > > When a device or driver misbehaves, it is possible to receive events > > much faster than we can print them out. Ratelimit the printing of > > events. > > > > Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org> > > Tested-by: Aaro Koskinen <aaro.koskinen@nokia.com> > > > During the SVA tests when the device driver didn't properly stop DMA > > before unbinding, the event queue thread would almost lock-up the server > > with a flood of event 0xa. This patch helped recover from the error. > > I was just debugging a similar case, and this patch was required to > prevent system from locking up. > > Could you please resend this patch independently from the other patches > in the series, as it seems it's a worthwhile fix and still relevent for > current kernels. Thanks, Ok, I'll resend it Thanks, Jean > > A. > > > --- > > drivers/iommu/arm-smmu-v3.c | 13 ++++++++----- > > 1 file changed, 8 insertions(+), 5 deletions(-) > > > > diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c > > index 28f8583cd47b..6a5987cce03f 100644 > > --- a/drivers/iommu/arm-smmu-v3.c > > +++ b/drivers/iommu/arm-smmu-v3.c > > @@ -2243,17 +2243,20 @@ static irqreturn_t arm_smmu_evtq_thread(int irq, void *dev) > > struct arm_smmu_device *smmu = dev; > > struct arm_smmu_queue *q = &smmu->evtq.q; > > struct arm_smmu_ll_queue *llq = &q->llq; > > + static DEFINE_RATELIMIT_STATE(rs, DEFAULT_RATELIMIT_INTERVAL, > > + DEFAULT_RATELIMIT_BURST); > > u64 evt[EVTQ_ENT_DWORDS]; > > > > do { > > while (!queue_remove_raw(q, evt)) { > > u8 id = FIELD_GET(EVTQ_0_ID, evt[0]); > > > > - dev_info(smmu->dev, "event 0x%02x received:\n", id); > > - for (i = 0; i < ARRAY_SIZE(evt); ++i) > > - dev_info(smmu->dev, "\t0x%016llx\n", > > - (unsigned long long)evt[i]); > > - > > + if (__ratelimit(&rs)) { > > + dev_info(smmu->dev, "event 0x%02x received:\n", id); > > + for (i = 0; i < ARRAY_SIZE(evt); ++i) > > + dev_info(smmu->dev, "\t0x%016llx\n", > > + (unsigned long long)evt[i]); > > + } > > } > > > > /*