All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v6 0/3] arm64: IOMMU-backed DMA mapping
@ 2015-10-01 19:13 ` Robin Murphy
  0 siblings, 0 replies; 78+ messages in thread
From: Robin Murphy @ 2015-10-01 19:13 UTC (permalink / raw)
  To: joro-zLv9SwRftAIdnm+yROfE0A, will.deacon-5wv7dgnIgG8,
	catalin.marinas-5wv7dgnIgG8
  Cc: laurent.pinchart+renesas-ryLnwIuWjnjg/C1BVhZhaw,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	djkurtz-hpIqsD4AKlfQT0dZR+AlfA,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	thunder.leizhen-hv44wF8Li93QT0dZR+AlfA,
	yingjoe.chen-NuS5LvNUpcJWk0Htik3J/w,
	treding-DDmLM1+adcrQT0dZR+AlfA

Hi all,

Here's the latest, and hopefully last, revision of the initial arm64
IOMMU dma_ops support.

There are a couple of dependencies still currently in -next and the
intel-iommu tree[0]: "iommu: iova: Move iova cache management to the
iova library" is necessary for the rename of iova_cache_get(), and
"iommu/iova: Avoid over-allocating when size-aligned" will be needed
with some IOMMU drivers to prevent unmapping errors.

Changes from v5[1]:
- Change __iommu_dma_unmap() from BUG to WARN when things go wrong, and
  prevent a NULL dereference on double-free.
- Fix iommu_dma_map_sg() to ensure segments can never inadvertently end
  mapped across a segment boundary. As a result, we have to lose the
  segment-merging optimisation from before (I might revisit that if
  there's some evidence it's really worthwhile, though).
- Cleaned up the platform device workarounds for config order and
  default domains, and removed the other hacks. Demanding that the IOMMU
  drivers assign groups, and support IOMMU_DOMAIN_DMA via the methods
  provided, keeps things bearable, and the behaviour should now be
  consistent across all cases.

As a bonus, whilst the underlying of_iommu_configure() code only supports
platform devices at the moment, I can also say that this has now been
tested to work for PCI devices too, via some horrible hacks on a Juno r1.

Thanks,
Robin.

[0]:http://thread.gmane.org/gmane.linux.kernel.iommu/11033
[1]:http://thread.gmane.org/gmane.linux.kernel.iommu/10439

Robin Murphy (3):
  iommu: Implement common IOMMU ops for DMA mapping
  arm64: Add IOMMU dma_ops
  arm64: Hook up IOMMU dma_ops

 arch/arm64/Kconfig                   |   1 +
 arch/arm64/include/asm/dma-mapping.h |  15 +-
 arch/arm64/mm/dma-mapping.c          | 457 ++++++++++++++++++++++++++++++
 drivers/iommu/Kconfig                |   7 +
 drivers/iommu/Makefile               |   1 +
 drivers/iommu/dma-iommu.c            | 524 +++++++++++++++++++++++++++++++++++
 include/linux/dma-iommu.h            |  85 ++++++
 include/linux/iommu.h                |   1 +
 8 files changed, 1083 insertions(+), 8 deletions(-)
 create mode 100644 drivers/iommu/dma-iommu.c
 create mode 100644 include/linux/dma-iommu.h

-- 
1.9.1

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v6 0/3] arm64: IOMMU-backed DMA mapping
@ 2015-10-01 19:13 ` Robin Murphy
  0 siblings, 0 replies; 78+ messages in thread
From: Robin Murphy @ 2015-10-01 19:13 UTC (permalink / raw)
  To: linux-arm-kernel

Hi all,

Here's the latest, and hopefully last, revision of the initial arm64
IOMMU dma_ops support.

There are a couple of dependencies still currently in -next and the
intel-iommu tree[0]: "iommu: iova: Move iova cache management to the
iova library" is necessary for the rename of iova_cache_get(), and
"iommu/iova: Avoid over-allocating when size-aligned" will be needed
with some IOMMU drivers to prevent unmapping errors.

Changes from v5[1]:
- Change __iommu_dma_unmap() from BUG to WARN when things go wrong, and
  prevent a NULL dereference on double-free.
- Fix iommu_dma_map_sg() to ensure segments can never inadvertently end
  mapped across a segment boundary. As a result, we have to lose the
  segment-merging optimisation from before (I might revisit that if
  there's some evidence it's really worthwhile, though).
- Cleaned up the platform device workarounds for config order and
  default domains, and removed the other hacks. Demanding that the IOMMU
  drivers assign groups, and support IOMMU_DOMAIN_DMA via the methods
  provided, keeps things bearable, and the behaviour should now be
  consistent across all cases.

As a bonus, whilst the underlying of_iommu_configure() code only supports
platform devices at the moment, I can also say that this has now been
tested to work for PCI devices too, via some horrible hacks on a Juno r1.

Thanks,
Robin.

[0]:http://thread.gmane.org/gmane.linux.kernel.iommu/11033
[1]:http://thread.gmane.org/gmane.linux.kernel.iommu/10439

Robin Murphy (3):
  iommu: Implement common IOMMU ops for DMA mapping
  arm64: Add IOMMU dma_ops
  arm64: Hook up IOMMU dma_ops

 arch/arm64/Kconfig                   |   1 +
 arch/arm64/include/asm/dma-mapping.h |  15 +-
 arch/arm64/mm/dma-mapping.c          | 457 ++++++++++++++++++++++++++++++
 drivers/iommu/Kconfig                |   7 +
 drivers/iommu/Makefile               |   1 +
 drivers/iommu/dma-iommu.c            | 524 +++++++++++++++++++++++++++++++++++
 include/linux/dma-iommu.h            |  85 ++++++
 include/linux/iommu.h                |   1 +
 8 files changed, 1083 insertions(+), 8 deletions(-)
 create mode 100644 drivers/iommu/dma-iommu.c
 create mode 100644 include/linux/dma-iommu.h

-- 
1.9.1

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v6 1/3] iommu: Implement common IOMMU ops for DMA mapping
  2015-10-01 19:13 ` Robin Murphy
@ 2015-10-01 19:13     ` Robin Murphy
  -1 siblings, 0 replies; 78+ messages in thread
From: Robin Murphy @ 2015-10-01 19:13 UTC (permalink / raw)
  To: joro-zLv9SwRftAIdnm+yROfE0A, will.deacon-5wv7dgnIgG8,
	catalin.marinas-5wv7dgnIgG8
  Cc: laurent.pinchart+renesas-ryLnwIuWjnjg/C1BVhZhaw,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	djkurtz-hpIqsD4AKlfQT0dZR+AlfA,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	thunder.leizhen-hv44wF8Li93QT0dZR+AlfA,
	yingjoe.chen-NuS5LvNUpcJWk0Htik3J/w,
	treding-DDmLM1+adcrQT0dZR+AlfA

Taking inspiration from the existing arch/arm code, break out some
generic functions to interface the DMA-API to the IOMMU-API. This will
do the bulk of the heavy lifting for IOMMU-backed dma-mapping.

Since associating an IOVA allocator with an IOMMU domain is a fairly
common need, rather than introduce yet another private structure just to
do this for ourselves, extend the top-level struct iommu_domain with the
notion. A simple opaque cookie allows reuse by other IOMMU API users
with their various different incompatible allocator types.

Signed-off-by: Robin Murphy <robin.murphy-5wv7dgnIgG8@public.gmane.org>
---
 drivers/iommu/Kconfig     |   7 +
 drivers/iommu/Makefile    |   1 +
 drivers/iommu/dma-iommu.c | 524 ++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/dma-iommu.h |  85 ++++++++
 include/linux/iommu.h     |   1 +
 5 files changed, 618 insertions(+)
 create mode 100644 drivers/iommu/dma-iommu.c
 create mode 100644 include/linux/dma-iommu.h

diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index 3dc1bcb..27d4d4b 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -48,6 +48,13 @@ config OF_IOMMU
        def_bool y
        depends on OF && IOMMU_API
 
+# IOMMU-agnostic DMA-mapping layer
+config IOMMU_DMA
+	bool
+	depends on NEED_SG_DMA_LENGTH
+	select IOMMU_API
+	select IOMMU_IOVA
+
 config FSL_PAMU
 	bool "Freescale IOMMU support"
 	depends on PPC32
diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
index c6dcc51..f465cfb 100644
--- a/drivers/iommu/Makefile
+++ b/drivers/iommu/Makefile
@@ -1,6 +1,7 @@
 obj-$(CONFIG_IOMMU_API) += iommu.o
 obj-$(CONFIG_IOMMU_API) += iommu-traces.o
 obj-$(CONFIG_IOMMU_API) += iommu-sysfs.o
+obj-$(CONFIG_IOMMU_DMA) += dma-iommu.o
 obj-$(CONFIG_IOMMU_IO_PGTABLE) += io-pgtable.o
 obj-$(CONFIG_IOMMU_IO_PGTABLE_LPAE) += io-pgtable-arm.o
 obj-$(CONFIG_IOMMU_IOVA) += iova.o
diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
new file mode 100644
index 0000000..3a20db4
--- /dev/null
+++ b/drivers/iommu/dma-iommu.c
@@ -0,0 +1,524 @@
+/*
+ * A fairly generic DMA-API to IOMMU-API glue layer.
+ *
+ * Copyright (C) 2014-2015 ARM Ltd.
+ *
+ * based in part on arch/arm/mm/dma-mapping.c:
+ * Copyright (C) 2000-2004 Russell King
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program.  If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <linux/device.h>
+#include <linux/dma-iommu.h>
+#include <linux/huge_mm.h>
+#include <linux/iommu.h>
+#include <linux/iova.h>
+#include <linux/mm.h>
+
+int iommu_dma_init(void)
+{
+	return iova_cache_get();
+}
+
+/**
+ * iommu_get_dma_cookie - Acquire DMA-API resources for a domain
+ * @domain: IOMMU domain to prepare for DMA-API usage
+ *
+ * IOMMU drivers should normally call this from their domain_alloc
+ * callback when domain->type == IOMMU_DOMAIN_DMA.
+ */
+int iommu_get_dma_cookie(struct iommu_domain *domain)
+{
+	struct iova_domain *iovad;
+
+	if (domain->iova_cookie)
+		return -EEXIST;
+
+	iovad = kzalloc(sizeof(*iovad), GFP_KERNEL);
+	domain->iova_cookie = iovad;
+
+	return iovad ? 0 : -ENOMEM;
+}
+EXPORT_SYMBOL(iommu_get_dma_cookie);
+
+/**
+ * iommu_put_dma_cookie - Release a domain's DMA mapping resources
+ * @domain: IOMMU domain previously prepared by iommu_get_dma_cookie()
+ *
+ * IOMMU drivers should normally call this from their domain_free callback.
+ */
+void iommu_put_dma_cookie(struct iommu_domain *domain)
+{
+	struct iova_domain *iovad = domain->iova_cookie;
+
+	if (!iovad)
+		return;
+
+	put_iova_domain(iovad);
+	kfree(iovad);
+	domain->iova_cookie = NULL;
+}
+EXPORT_SYMBOL(iommu_put_dma_cookie);
+
+/**
+ * iommu_dma_init_domain - Initialise a DMA mapping domain
+ * @domain: IOMMU domain previously prepared by iommu_get_dma_cookie()
+ * @base: IOVA at which the mappable address space starts
+ * @size: Size of IOVA space
+ *
+ * @base and @size should be exact multiples of IOMMU page granularity to
+ * avoid rounding surprises. If necessary, we reserve the page at address 0
+ * to ensure it is an invalid IOVA. It is safe to reinitialise a domain, but
+ * any change which could make prior IOVAs invalid will fail.
+ */
+int iommu_dma_init_domain(struct iommu_domain *domain, dma_addr_t base, u64 size)
+{
+	struct iova_domain *iovad = domain->iova_cookie;
+	unsigned long order, base_pfn, end_pfn;
+
+	if (!iovad)
+		return -ENODEV;
+
+	/* Use the smallest supported page size for IOVA granularity */
+	order = __ffs(domain->ops->pgsize_bitmap);
+	base_pfn = max_t(unsigned long, 1, base >> order);
+	end_pfn = (base + size - 1) >> order;
+
+	/* Check the domain allows at least some access to the device... */
+	if (domain->geometry.force_aperture) {
+		if (base > domain->geometry.aperture_end ||
+		    base + size <= domain->geometry.aperture_start) {
+			pr_warn("specified DMA range outside IOMMU capability\n");
+			return -EFAULT;
+		}
+		/* ...then finally give it a kicking to make sure it fits */
+		base_pfn = max_t(unsigned long, base_pfn,
+				domain->geometry.aperture_start >> order);
+		end_pfn = min_t(unsigned long, end_pfn,
+				domain->geometry.aperture_end >> order);
+	}
+
+	/* All we can safely do with an existing domain is enlarge it */
+	if (iovad->start_pfn) {
+		if (1UL << order != iovad->granule ||
+		    base_pfn != iovad->start_pfn ||
+		    end_pfn < iovad->dma_32bit_pfn) {
+			pr_warn("Incompatible range for DMA domain\n");
+			return -EFAULT;
+		}
+		iovad->dma_32bit_pfn = end_pfn;
+	} else {
+		init_iova_domain(iovad, 1UL << order, base_pfn, end_pfn);
+	}
+	return 0;
+}
+EXPORT_SYMBOL(iommu_dma_init_domain);
+
+/**
+ * dma_direction_to_prot - Translate DMA API directions to IOMMU API page flags
+ * @dir: Direction of DMA transfer
+ * @coherent: Is the DMA master cache-coherent?
+ *
+ * Return: corresponding IOMMU API page protection flags
+ */
+int dma_direction_to_prot(enum dma_data_direction dir, bool coherent)
+{
+	int prot = coherent ? IOMMU_CACHE : 0;
+
+	switch (dir) {
+	case DMA_BIDIRECTIONAL:
+		return prot | IOMMU_READ | IOMMU_WRITE;
+	case DMA_TO_DEVICE:
+		return prot | IOMMU_READ;
+	case DMA_FROM_DEVICE:
+		return prot | IOMMU_WRITE;
+	default:
+		return 0;
+	}
+}
+
+static struct iova *__alloc_iova(struct iova_domain *iovad, size_t size,
+		dma_addr_t dma_limit)
+{
+	unsigned long shift = iova_shift(iovad);
+	unsigned long length = iova_align(iovad, size) >> shift;
+
+	/*
+	 * Enforce size-alignment to be safe - there could perhaps be an
+	 * attribute to control this per-device, or at least per-domain...
+	 */
+	return alloc_iova(iovad, length, dma_limit >> shift, true);
+}
+
+/* The IOVA allocator knows what we mapped, so just unmap whatever that was */
+static void __iommu_dma_unmap(struct iommu_domain *domain, dma_addr_t dma_addr)
+{
+	struct iova_domain *iovad = domain->iova_cookie;
+	unsigned long shift = iova_shift(iovad);
+	unsigned long pfn = dma_addr >> shift;
+	struct iova *iova = find_iova(iovad, pfn);
+	size_t size;
+
+	if (WARN_ON(!iova))
+		return;
+
+	size = iova_size(iova) << shift;
+	size -= iommu_unmap(domain, pfn << shift, size);
+	/* ...and if we can't, then something is horribly, horribly wrong */
+	WARN_ON(size > 0);
+	__free_iova(iovad, iova);
+}
+
+static void __iommu_dma_free_pages(struct page **pages, int count)
+{
+	while (count--)
+		__free_page(pages[count]);
+	kvfree(pages);
+}
+
+static struct page **__iommu_dma_alloc_pages(unsigned int count, gfp_t gfp)
+{
+	struct page **pages;
+	unsigned int i = 0, array_size = count * sizeof(*pages);
+
+	if (array_size <= PAGE_SIZE)
+		pages = kzalloc(array_size, GFP_KERNEL);
+	else
+		pages = vzalloc(array_size);
+	if (!pages)
+		return NULL;
+
+	/* IOMMU can map any pages, so himem can also be used here */
+	gfp |= __GFP_NOWARN | __GFP_HIGHMEM;
+
+	while (count) {
+		struct page *page = NULL;
+		int j, order = __fls(count);
+
+		/*
+		 * Higher-order allocations are a convenience rather
+		 * than a necessity, hence using __GFP_NORETRY until
+		 * falling back to single-page allocations.
+		 */
+		for (order = min(order, MAX_ORDER); order > 0; order--) {
+			page = alloc_pages(gfp | __GFP_NORETRY, order);
+			if (!page)
+				continue;
+			if (PageCompound(page)) {
+				if (!split_huge_page(page))
+					break;
+				__free_pages(page, order);
+			} else {
+				split_page(page, order);
+				break;
+			}
+		}
+		if (!page)
+			page = alloc_page(gfp);
+		if (!page) {
+			__iommu_dma_free_pages(pages, i);
+			return NULL;
+		}
+		j = 1 << order;
+		count -= j;
+		while (j--)
+			pages[i++] = page++;
+	}
+	return pages;
+}
+
+/**
+ * iommu_dma_free - Free a buffer allocated by iommu_dma_alloc()
+ * @dev: Device which owns this buffer
+ * @pages: Array of buffer pages as returned by iommu_dma_alloc()
+ * @size: Size of buffer in bytes
+ * @handle: DMA address of buffer
+ *
+ * Frees both the pages associated with the buffer, and the array
+ * describing them
+ */
+void iommu_dma_free(struct device *dev, struct page **pages, size_t size,
+		dma_addr_t *handle)
+{
+	__iommu_dma_unmap(iommu_get_domain_for_dev(dev), *handle);
+	__iommu_dma_free_pages(pages, PAGE_ALIGN(size) >> PAGE_SHIFT);
+	*handle = DMA_ERROR_CODE;
+}
+
+/**
+ * iommu_dma_alloc - Allocate and map a buffer contiguous in IOVA space
+ * @dev: Device to allocate memory for. Must be a real device
+ *	 attached to an iommu_dma_domain
+ * @size: Size of buffer in bytes
+ * @gfp: Allocation flags
+ * @prot: IOMMU mapping flags
+ * @handle: Out argument for allocated DMA handle
+ * @flush_page: Arch callback which must ensure PAGE_SIZE bytes from the
+ *		given VA/PA are visible to the given non-coherent device.
+ *
+ * If @size is less than PAGE_SIZE, then a full CPU page will be allocated,
+ * but an IOMMU which supports smaller pages might not map the whole thing.
+ *
+ * Return: Array of struct page pointers describing the buffer,
+ *	   or NULL on failure.
+ */
+struct page **iommu_dma_alloc(struct device *dev, size_t size,
+		gfp_t gfp, int prot, dma_addr_t *handle,
+		void (*flush_page)(struct device *, const void *, phys_addr_t))
+{
+	struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
+	struct iova_domain *iovad = domain->iova_cookie;
+	struct iova *iova;
+	struct page **pages;
+	struct sg_table sgt;
+	dma_addr_t dma_addr;
+	unsigned int count = PAGE_ALIGN(size) >> PAGE_SHIFT;
+
+	*handle = DMA_ERROR_CODE;
+
+	pages = __iommu_dma_alloc_pages(count, gfp);
+	if (!pages)
+		return NULL;
+
+	iova = __alloc_iova(iovad, size, dev->coherent_dma_mask);
+	if (!iova)
+		goto out_free_pages;
+
+	size = iova_align(iovad, size);
+	if (sg_alloc_table_from_pages(&sgt, pages, count, 0, size, GFP_KERNEL))
+		goto out_free_iova;
+
+	if (!(prot & IOMMU_CACHE)) {
+		struct sg_mapping_iter miter;
+		/*
+		 * The CPU-centric flushing implied by SG_MITER_TO_SG isn't
+		 * sufficient here, so skip it by using the "wrong" direction.
+		 */
+		sg_miter_start(&miter, sgt.sgl, sgt.orig_nents, SG_MITER_FROM_SG);
+		while (sg_miter_next(&miter))
+			flush_page(dev, miter.addr, page_to_phys(miter.page));
+		sg_miter_stop(&miter);
+	}
+
+	dma_addr = iova_dma_addr(iovad, iova);
+	if (iommu_map_sg(domain, dma_addr, sgt.sgl, sgt.orig_nents, prot)
+			< size)
+		goto out_free_sg;
+
+	*handle = dma_addr;
+	sg_free_table(&sgt);
+	return pages;
+
+out_free_sg:
+	sg_free_table(&sgt);
+out_free_iova:
+	__free_iova(iovad, iova);
+out_free_pages:
+	__iommu_dma_free_pages(pages, count);
+	return NULL;
+}
+
+/**
+ * iommu_dma_mmap - Map a buffer into provided user VMA
+ * @pages: Array representing buffer from iommu_dma_alloc()
+ * @size: Size of buffer in bytes
+ * @vma: VMA describing requested userspace mapping
+ *
+ * Maps the pages of the buffer in @pages into @vma. The caller is responsible
+ * for verifying the correct size and protection of @vma beforehand.
+ */
+
+int iommu_dma_mmap(struct page **pages, size_t size, struct vm_area_struct *vma)
+{
+	unsigned long uaddr = vma->vm_start;
+	unsigned int i, count = PAGE_ALIGN(size) >> PAGE_SHIFT;
+	int ret = -ENXIO;
+
+	for (i = vma->vm_pgoff; i < count && uaddr < vma->vm_end; i++) {
+		ret = vm_insert_page(vma, uaddr, pages[i]);
+		if (ret)
+			break;
+		uaddr += PAGE_SIZE;
+	}
+	return ret;
+}
+
+dma_addr_t iommu_dma_map_page(struct device *dev, struct page *page,
+		unsigned long offset, size_t size, int prot)
+{
+	dma_addr_t dma_addr;
+	struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
+	struct iova_domain *iovad = domain->iova_cookie;
+	phys_addr_t phys = page_to_phys(page) + offset;
+	size_t iova_off = iova_offset(iovad, phys);
+	size_t len = iova_align(iovad, size + iova_off);
+	struct iova *iova = __alloc_iova(iovad, len, dma_get_mask(dev));
+
+	if (!iova)
+		return DMA_ERROR_CODE;
+
+	dma_addr = iova_dma_addr(iovad, iova);
+	if (iommu_map(domain, dma_addr, phys - iova_off, len, prot)) {
+		__free_iova(iovad, iova);
+		return DMA_ERROR_CODE;
+	}
+	return dma_addr + iova_off;
+}
+
+void iommu_dma_unmap_page(struct device *dev, dma_addr_t handle, size_t size,
+		enum dma_data_direction dir, struct dma_attrs *attrs)
+{
+	__iommu_dma_unmap(iommu_get_domain_for_dev(dev), handle);
+}
+
+/*
+ * Prepare a successfully-mapped scatterlist to give back to the caller.
+ * Handling IOVA concatenation can come later, if needed
+ */
+static int __finalise_sg(struct device *dev, struct scatterlist *sg, int nents,
+		dma_addr_t dma_addr)
+{
+	struct scatterlist *s;
+	int i;
+
+	for_each_sg(sg, s, nents, i) {
+		/* Un-swizzling the fields here, hence the naming mismatch */
+		unsigned int s_offset = sg_dma_address(s);
+		unsigned int s_length = sg_dma_len(s);
+		unsigned int s_dma_len = s->length;
+
+		s->offset = s_offset;
+		s->length = s_length;
+		sg_dma_address(s) = dma_addr + s_offset;
+		dma_addr += s_dma_len;
+	}
+	return i;
+}
+
+/*
+ * If mapping failed, then just restore the original list,
+ * but making sure the DMA fields are invalidated.
+ */
+static void __invalidate_sg(struct scatterlist *sg, int nents)
+{
+	struct scatterlist *s;
+	int i;
+
+	for_each_sg(sg, s, nents, i) {
+		if (sg_dma_address(s) != DMA_ERROR_CODE)
+			s->offset = sg_dma_address(s);
+		if (sg_dma_len(s))
+			s->length = sg_dma_len(s);
+		sg_dma_address(s) = DMA_ERROR_CODE;
+		sg_dma_len(s) = 0;
+	}
+}
+
+/*
+ * The DMA API client is passing in a scatterlist which could describe
+ * any old buffer layout, but the IOMMU API requires everything to be
+ * aligned to IOMMU pages. Hence the need for this complicated bit of
+ * impedance-matching, to be able to hand off a suitably-aligned list,
+ * but still preserve the original offsets and sizes for the caller.
+ */
+int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
+		int nents, int prot)
+{
+	struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
+	struct iova_domain *iovad = domain->iova_cookie;
+	struct iova *iova;
+	struct scatterlist *s, *prev = NULL;
+	dma_addr_t dma_addr;
+	size_t iova_len = 0;
+	int i;
+
+	/*
+	 * Work out how much IOVA space we need, and align the segments to
+	 * IOVA granules for the IOMMU driver to handle. With some clever
+	 * trickery we can modify the list in-place, but reversibly, by
+	 * hiding the original data in the as-yet-unused DMA fields.
+	 */
+	for_each_sg(sg, s, nents, i) {
+		size_t s_offset = iova_offset(iovad, s->offset);
+		size_t s_length = s->length;
+
+		sg_dma_address(s) = s->offset;
+		sg_dma_len(s) = s_length;
+		s->offset -= s_offset;
+		s_length = iova_align(iovad, s_length + s_offset);
+		s->length = s_length;
+
+		/*
+		 * The simple way to avoid the rare case of a segment
+		 * crossing the boundary mask is to pad the previous one
+		 * to end at a naturally-aligned IOVA for this one's size,
+		 * at the cost of potentially over-allocating a little.
+		 */
+		if (prev) {
+			size_t pad_len = roundup_pow_of_two(s_length);
+
+			pad_len = (pad_len - iova_len) & (pad_len - 1);
+			prev->length += pad_len;
+			iova_len += pad_len;
+		}
+
+		iova_len += s_length;
+		prev = s;
+	}
+
+	iova = __alloc_iova(iovad, iova_len, dma_get_mask(dev));
+	if (!iova)
+		goto out_restore_sg;
+
+	/*
+	 * We'll leave any physical concatenation to the IOMMU driver's
+	 * implementation - it knows better than we do.
+	 */
+	dma_addr = iova_dma_addr(iovad, iova);
+	if (iommu_map_sg(domain, dma_addr, sg, nents, prot) < iova_len)
+		goto out_free_iova;
+
+	return __finalise_sg(dev, sg, nents, dma_addr);
+
+out_free_iova:
+	__free_iova(iovad, iova);
+out_restore_sg:
+	__invalidate_sg(sg, nents);
+	return 0;
+}
+
+void iommu_dma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents,
+		enum dma_data_direction dir, struct dma_attrs *attrs)
+{
+	/*
+	 * The scatterlist segments are mapped into a single
+	 * contiguous IOVA allocation, so this is incredibly easy.
+	 */
+	__iommu_dma_unmap(iommu_get_domain_for_dev(dev), sg_dma_address(sg));
+}
+
+int iommu_dma_supported(struct device *dev, u64 mask)
+{
+	/*
+	 * 'Special' IOMMUs which don't have the same addressing capability
+	 * as the CPU will have to wait until we have some way to query that
+	 * before they'll be able to use this framework.
+	 */
+	return 1;
+}
+
+int iommu_dma_mapping_error(struct device *dev, dma_addr_t dma_addr)
+{
+	return dma_addr == DMA_ERROR_CODE;
+}
diff --git a/include/linux/dma-iommu.h b/include/linux/dma-iommu.h
new file mode 100644
index 0000000..fc48103
--- /dev/null
+++ b/include/linux/dma-iommu.h
@@ -0,0 +1,85 @@
+/*
+ * Copyright (C) 2014-2015 ARM Ltd.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program.  If not, see <http://www.gnu.org/licenses/>.
+ */
+#ifndef __DMA_IOMMU_H
+#define __DMA_IOMMU_H
+
+#ifdef __KERNEL__
+#include <asm/errno.h>
+
+#ifdef CONFIG_IOMMU_DMA
+#include <linux/iommu.h>
+
+int iommu_dma_init(void);
+
+/* Domain management interface for IOMMU drivers */
+int iommu_get_dma_cookie(struct iommu_domain *domain);
+void iommu_put_dma_cookie(struct iommu_domain *domain);
+
+/* Setup call for arch DMA mapping code */
+int iommu_dma_init_domain(struct iommu_domain *domain, dma_addr_t base, u64 size);
+
+/* General helpers for DMA-API <-> IOMMU-API interaction */
+int dma_direction_to_prot(enum dma_data_direction dir, bool coherent);
+
+/*
+ * These implement the bulk of the relevant DMA mapping callbacks, but require
+ * the arch code to take care of attributes and cache maintenance
+ */
+struct page **iommu_dma_alloc(struct device *dev, size_t size,
+		gfp_t gfp, int prot, dma_addr_t *handle,
+		void (*flush_page)(struct device *, const void *, phys_addr_t));
+void iommu_dma_free(struct device *dev, struct page **pages, size_t size,
+		dma_addr_t *handle);
+
+int iommu_dma_mmap(struct page **pages, size_t size, struct vm_area_struct *vma);
+
+dma_addr_t iommu_dma_map_page(struct device *dev, struct page *page,
+		unsigned long offset, size_t size, int prot);
+int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
+		int nents, int prot);
+
+/*
+ * Arch code with no special attribute handling may use these
+ * directly as DMA mapping callbacks for simplicity
+ */
+void iommu_dma_unmap_page(struct device *dev, dma_addr_t handle, size_t size,
+		enum dma_data_direction dir, struct dma_attrs *attrs);
+void iommu_dma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents,
+		enum dma_data_direction dir, struct dma_attrs *attrs);
+int iommu_dma_supported(struct device *dev, u64 mask);
+int iommu_dma_mapping_error(struct device *dev, dma_addr_t dma_addr);
+
+#else
+
+struct iommu_domain;
+
+static inline int iommu_dma_init(void)
+{
+	return 0;
+}
+
+static inline int iommu_get_dma_cookie(struct iommu_domain *domain)
+{
+	return -ENODEV;
+}
+
+static inline void iommu_put_dma_cookie(struct iommu_domain *domain)
+{
+}
+
+#endif	/* CONFIG_IOMMU_DMA */
+#endif	/* __KERNEL__ */
+#endif	/* __DMA_IOMMU_H */
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index f9c1b6d..f174506 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -81,6 +81,7 @@ struct iommu_domain {
 	iommu_fault_handler_t handler;
 	void *handler_token;
 	struct iommu_domain_geometry geometry;
+	void *iova_cookie;
 };
 
 enum iommu_cap {
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v6 1/3] iommu: Implement common IOMMU ops for DMA mapping
@ 2015-10-01 19:13     ` Robin Murphy
  0 siblings, 0 replies; 78+ messages in thread
From: Robin Murphy @ 2015-10-01 19:13 UTC (permalink / raw)
  To: linux-arm-kernel

Taking inspiration from the existing arch/arm code, break out some
generic functions to interface the DMA-API to the IOMMU-API. This will
do the bulk of the heavy lifting for IOMMU-backed dma-mapping.

Since associating an IOVA allocator with an IOMMU domain is a fairly
common need, rather than introduce yet another private structure just to
do this for ourselves, extend the top-level struct iommu_domain with the
notion. A simple opaque cookie allows reuse by other IOMMU API users
with their various different incompatible allocator types.

Signed-off-by: Robin Murphy <robin.murphy@arm.com>
---
 drivers/iommu/Kconfig     |   7 +
 drivers/iommu/Makefile    |   1 +
 drivers/iommu/dma-iommu.c | 524 ++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/dma-iommu.h |  85 ++++++++
 include/linux/iommu.h     |   1 +
 5 files changed, 618 insertions(+)
 create mode 100644 drivers/iommu/dma-iommu.c
 create mode 100644 include/linux/dma-iommu.h

diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index 3dc1bcb..27d4d4b 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -48,6 +48,13 @@ config OF_IOMMU
        def_bool y
        depends on OF && IOMMU_API
 
+# IOMMU-agnostic DMA-mapping layer
+config IOMMU_DMA
+	bool
+	depends on NEED_SG_DMA_LENGTH
+	select IOMMU_API
+	select IOMMU_IOVA
+
 config FSL_PAMU
 	bool "Freescale IOMMU support"
 	depends on PPC32
diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
index c6dcc51..f465cfb 100644
--- a/drivers/iommu/Makefile
+++ b/drivers/iommu/Makefile
@@ -1,6 +1,7 @@
 obj-$(CONFIG_IOMMU_API) += iommu.o
 obj-$(CONFIG_IOMMU_API) += iommu-traces.o
 obj-$(CONFIG_IOMMU_API) += iommu-sysfs.o
+obj-$(CONFIG_IOMMU_DMA) += dma-iommu.o
 obj-$(CONFIG_IOMMU_IO_PGTABLE) += io-pgtable.o
 obj-$(CONFIG_IOMMU_IO_PGTABLE_LPAE) += io-pgtable-arm.o
 obj-$(CONFIG_IOMMU_IOVA) += iova.o
diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
new file mode 100644
index 0000000..3a20db4
--- /dev/null
+++ b/drivers/iommu/dma-iommu.c
@@ -0,0 +1,524 @@
+/*
+ * A fairly generic DMA-API to IOMMU-API glue layer.
+ *
+ * Copyright (C) 2014-2015 ARM Ltd.
+ *
+ * based in part on arch/arm/mm/dma-mapping.c:
+ * Copyright (C) 2000-2004 Russell King
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program.  If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <linux/device.h>
+#include <linux/dma-iommu.h>
+#include <linux/huge_mm.h>
+#include <linux/iommu.h>
+#include <linux/iova.h>
+#include <linux/mm.h>
+
+int iommu_dma_init(void)
+{
+	return iova_cache_get();
+}
+
+/**
+ * iommu_get_dma_cookie - Acquire DMA-API resources for a domain
+ * @domain: IOMMU domain to prepare for DMA-API usage
+ *
+ * IOMMU drivers should normally call this from their domain_alloc
+ * callback when domain->type == IOMMU_DOMAIN_DMA.
+ */
+int iommu_get_dma_cookie(struct iommu_domain *domain)
+{
+	struct iova_domain *iovad;
+
+	if (domain->iova_cookie)
+		return -EEXIST;
+
+	iovad = kzalloc(sizeof(*iovad), GFP_KERNEL);
+	domain->iova_cookie = iovad;
+
+	return iovad ? 0 : -ENOMEM;
+}
+EXPORT_SYMBOL(iommu_get_dma_cookie);
+
+/**
+ * iommu_put_dma_cookie - Release a domain's DMA mapping resources
+ * @domain: IOMMU domain previously prepared by iommu_get_dma_cookie()
+ *
+ * IOMMU drivers should normally call this from their domain_free callback.
+ */
+void iommu_put_dma_cookie(struct iommu_domain *domain)
+{
+	struct iova_domain *iovad = domain->iova_cookie;
+
+	if (!iovad)
+		return;
+
+	put_iova_domain(iovad);
+	kfree(iovad);
+	domain->iova_cookie = NULL;
+}
+EXPORT_SYMBOL(iommu_put_dma_cookie);
+
+/**
+ * iommu_dma_init_domain - Initialise a DMA mapping domain
+ * @domain: IOMMU domain previously prepared by iommu_get_dma_cookie()
+ * @base: IOVA at which the mappable address space starts
+ * @size: Size of IOVA space
+ *
+ * @base and @size should be exact multiples of IOMMU page granularity to
+ * avoid rounding surprises. If necessary, we reserve the page at address 0
+ * to ensure it is an invalid IOVA. It is safe to reinitialise a domain, but
+ * any change which could make prior IOVAs invalid will fail.
+ */
+int iommu_dma_init_domain(struct iommu_domain *domain, dma_addr_t base, u64 size)
+{
+	struct iova_domain *iovad = domain->iova_cookie;
+	unsigned long order, base_pfn, end_pfn;
+
+	if (!iovad)
+		return -ENODEV;
+
+	/* Use the smallest supported page size for IOVA granularity */
+	order = __ffs(domain->ops->pgsize_bitmap);
+	base_pfn = max_t(unsigned long, 1, base >> order);
+	end_pfn = (base + size - 1) >> order;
+
+	/* Check the domain allows at least some access to the device... */
+	if (domain->geometry.force_aperture) {
+		if (base > domain->geometry.aperture_end ||
+		    base + size <= domain->geometry.aperture_start) {
+			pr_warn("specified DMA range outside IOMMU capability\n");
+			return -EFAULT;
+		}
+		/* ...then finally give it a kicking to make sure it fits */
+		base_pfn = max_t(unsigned long, base_pfn,
+				domain->geometry.aperture_start >> order);
+		end_pfn = min_t(unsigned long, end_pfn,
+				domain->geometry.aperture_end >> order);
+	}
+
+	/* All we can safely do with an existing domain is enlarge it */
+	if (iovad->start_pfn) {
+		if (1UL << order != iovad->granule ||
+		    base_pfn != iovad->start_pfn ||
+		    end_pfn < iovad->dma_32bit_pfn) {
+			pr_warn("Incompatible range for DMA domain\n");
+			return -EFAULT;
+		}
+		iovad->dma_32bit_pfn = end_pfn;
+	} else {
+		init_iova_domain(iovad, 1UL << order, base_pfn, end_pfn);
+	}
+	return 0;
+}
+EXPORT_SYMBOL(iommu_dma_init_domain);
+
+/**
+ * dma_direction_to_prot - Translate DMA API directions to IOMMU API page flags
+ * @dir: Direction of DMA transfer
+ * @coherent: Is the DMA master cache-coherent?
+ *
+ * Return: corresponding IOMMU API page protection flags
+ */
+int dma_direction_to_prot(enum dma_data_direction dir, bool coherent)
+{
+	int prot = coherent ? IOMMU_CACHE : 0;
+
+	switch (dir) {
+	case DMA_BIDIRECTIONAL:
+		return prot | IOMMU_READ | IOMMU_WRITE;
+	case DMA_TO_DEVICE:
+		return prot | IOMMU_READ;
+	case DMA_FROM_DEVICE:
+		return prot | IOMMU_WRITE;
+	default:
+		return 0;
+	}
+}
+
+static struct iova *__alloc_iova(struct iova_domain *iovad, size_t size,
+		dma_addr_t dma_limit)
+{
+	unsigned long shift = iova_shift(iovad);
+	unsigned long length = iova_align(iovad, size) >> shift;
+
+	/*
+	 * Enforce size-alignment to be safe - there could perhaps be an
+	 * attribute to control this per-device, or at least per-domain...
+	 */
+	return alloc_iova(iovad, length, dma_limit >> shift, true);
+}
+
+/* The IOVA allocator knows what we mapped, so just unmap whatever that was */
+static void __iommu_dma_unmap(struct iommu_domain *domain, dma_addr_t dma_addr)
+{
+	struct iova_domain *iovad = domain->iova_cookie;
+	unsigned long shift = iova_shift(iovad);
+	unsigned long pfn = dma_addr >> shift;
+	struct iova *iova = find_iova(iovad, pfn);
+	size_t size;
+
+	if (WARN_ON(!iova))
+		return;
+
+	size = iova_size(iova) << shift;
+	size -= iommu_unmap(domain, pfn << shift, size);
+	/* ...and if we can't, then something is horribly, horribly wrong */
+	WARN_ON(size > 0);
+	__free_iova(iovad, iova);
+}
+
+static void __iommu_dma_free_pages(struct page **pages, int count)
+{
+	while (count--)
+		__free_page(pages[count]);
+	kvfree(pages);
+}
+
+static struct page **__iommu_dma_alloc_pages(unsigned int count, gfp_t gfp)
+{
+	struct page **pages;
+	unsigned int i = 0, array_size = count * sizeof(*pages);
+
+	if (array_size <= PAGE_SIZE)
+		pages = kzalloc(array_size, GFP_KERNEL);
+	else
+		pages = vzalloc(array_size);
+	if (!pages)
+		return NULL;
+
+	/* IOMMU can map any pages, so himem can also be used here */
+	gfp |= __GFP_NOWARN | __GFP_HIGHMEM;
+
+	while (count) {
+		struct page *page = NULL;
+		int j, order = __fls(count);
+
+		/*
+		 * Higher-order allocations are a convenience rather
+		 * than a necessity, hence using __GFP_NORETRY until
+		 * falling back to single-page allocations.
+		 */
+		for (order = min(order, MAX_ORDER); order > 0; order--) {
+			page = alloc_pages(gfp | __GFP_NORETRY, order);
+			if (!page)
+				continue;
+			if (PageCompound(page)) {
+				if (!split_huge_page(page))
+					break;
+				__free_pages(page, order);
+			} else {
+				split_page(page, order);
+				break;
+			}
+		}
+		if (!page)
+			page = alloc_page(gfp);
+		if (!page) {
+			__iommu_dma_free_pages(pages, i);
+			return NULL;
+		}
+		j = 1 << order;
+		count -= j;
+		while (j--)
+			pages[i++] = page++;
+	}
+	return pages;
+}
+
+/**
+ * iommu_dma_free - Free a buffer allocated by iommu_dma_alloc()
+ * @dev: Device which owns this buffer
+ * @pages: Array of buffer pages as returned by iommu_dma_alloc()
+ * @size: Size of buffer in bytes
+ * @handle: DMA address of buffer
+ *
+ * Frees both the pages associated with the buffer, and the array
+ * describing them
+ */
+void iommu_dma_free(struct device *dev, struct page **pages, size_t size,
+		dma_addr_t *handle)
+{
+	__iommu_dma_unmap(iommu_get_domain_for_dev(dev), *handle);
+	__iommu_dma_free_pages(pages, PAGE_ALIGN(size) >> PAGE_SHIFT);
+	*handle = DMA_ERROR_CODE;
+}
+
+/**
+ * iommu_dma_alloc - Allocate and map a buffer contiguous in IOVA space
+ * @dev: Device to allocate memory for. Must be a real device
+ *	 attached to an iommu_dma_domain
+ * @size: Size of buffer in bytes
+ * @gfp: Allocation flags
+ * @prot: IOMMU mapping flags
+ * @handle: Out argument for allocated DMA handle
+ * @flush_page: Arch callback which must ensure PAGE_SIZE bytes from the
+ *		given VA/PA are visible to the given non-coherent device.
+ *
+ * If @size is less than PAGE_SIZE, then a full CPU page will be allocated,
+ * but an IOMMU which supports smaller pages might not map the whole thing.
+ *
+ * Return: Array of struct page pointers describing the buffer,
+ *	   or NULL on failure.
+ */
+struct page **iommu_dma_alloc(struct device *dev, size_t size,
+		gfp_t gfp, int prot, dma_addr_t *handle,
+		void (*flush_page)(struct device *, const void *, phys_addr_t))
+{
+	struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
+	struct iova_domain *iovad = domain->iova_cookie;
+	struct iova *iova;
+	struct page **pages;
+	struct sg_table sgt;
+	dma_addr_t dma_addr;
+	unsigned int count = PAGE_ALIGN(size) >> PAGE_SHIFT;
+
+	*handle = DMA_ERROR_CODE;
+
+	pages = __iommu_dma_alloc_pages(count, gfp);
+	if (!pages)
+		return NULL;
+
+	iova = __alloc_iova(iovad, size, dev->coherent_dma_mask);
+	if (!iova)
+		goto out_free_pages;
+
+	size = iova_align(iovad, size);
+	if (sg_alloc_table_from_pages(&sgt, pages, count, 0, size, GFP_KERNEL))
+		goto out_free_iova;
+
+	if (!(prot & IOMMU_CACHE)) {
+		struct sg_mapping_iter miter;
+		/*
+		 * The CPU-centric flushing implied by SG_MITER_TO_SG isn't
+		 * sufficient here, so skip it by using the "wrong" direction.
+		 */
+		sg_miter_start(&miter, sgt.sgl, sgt.orig_nents, SG_MITER_FROM_SG);
+		while (sg_miter_next(&miter))
+			flush_page(dev, miter.addr, page_to_phys(miter.page));
+		sg_miter_stop(&miter);
+	}
+
+	dma_addr = iova_dma_addr(iovad, iova);
+	if (iommu_map_sg(domain, dma_addr, sgt.sgl, sgt.orig_nents, prot)
+			< size)
+		goto out_free_sg;
+
+	*handle = dma_addr;
+	sg_free_table(&sgt);
+	return pages;
+
+out_free_sg:
+	sg_free_table(&sgt);
+out_free_iova:
+	__free_iova(iovad, iova);
+out_free_pages:
+	__iommu_dma_free_pages(pages, count);
+	return NULL;
+}
+
+/**
+ * iommu_dma_mmap - Map a buffer into provided user VMA
+ * @pages: Array representing buffer from iommu_dma_alloc()
+ * @size: Size of buffer in bytes
+ * @vma: VMA describing requested userspace mapping
+ *
+ * Maps the pages of the buffer in @pages into @vma. The caller is responsible
+ * for verifying the correct size and protection of @vma beforehand.
+ */
+
+int iommu_dma_mmap(struct page **pages, size_t size, struct vm_area_struct *vma)
+{
+	unsigned long uaddr = vma->vm_start;
+	unsigned int i, count = PAGE_ALIGN(size) >> PAGE_SHIFT;
+	int ret = -ENXIO;
+
+	for (i = vma->vm_pgoff; i < count && uaddr < vma->vm_end; i++) {
+		ret = vm_insert_page(vma, uaddr, pages[i]);
+		if (ret)
+			break;
+		uaddr += PAGE_SIZE;
+	}
+	return ret;
+}
+
+dma_addr_t iommu_dma_map_page(struct device *dev, struct page *page,
+		unsigned long offset, size_t size, int prot)
+{
+	dma_addr_t dma_addr;
+	struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
+	struct iova_domain *iovad = domain->iova_cookie;
+	phys_addr_t phys = page_to_phys(page) + offset;
+	size_t iova_off = iova_offset(iovad, phys);
+	size_t len = iova_align(iovad, size + iova_off);
+	struct iova *iova = __alloc_iova(iovad, len, dma_get_mask(dev));
+
+	if (!iova)
+		return DMA_ERROR_CODE;
+
+	dma_addr = iova_dma_addr(iovad, iova);
+	if (iommu_map(domain, dma_addr, phys - iova_off, len, prot)) {
+		__free_iova(iovad, iova);
+		return DMA_ERROR_CODE;
+	}
+	return dma_addr + iova_off;
+}
+
+void iommu_dma_unmap_page(struct device *dev, dma_addr_t handle, size_t size,
+		enum dma_data_direction dir, struct dma_attrs *attrs)
+{
+	__iommu_dma_unmap(iommu_get_domain_for_dev(dev), handle);
+}
+
+/*
+ * Prepare a successfully-mapped scatterlist to give back to the caller.
+ * Handling IOVA concatenation can come later, if needed
+ */
+static int __finalise_sg(struct device *dev, struct scatterlist *sg, int nents,
+		dma_addr_t dma_addr)
+{
+	struct scatterlist *s;
+	int i;
+
+	for_each_sg(sg, s, nents, i) {
+		/* Un-swizzling the fields here, hence the naming mismatch */
+		unsigned int s_offset = sg_dma_address(s);
+		unsigned int s_length = sg_dma_len(s);
+		unsigned int s_dma_len = s->length;
+
+		s->offset = s_offset;
+		s->length = s_length;
+		sg_dma_address(s) = dma_addr + s_offset;
+		dma_addr += s_dma_len;
+	}
+	return i;
+}
+
+/*
+ * If mapping failed, then just restore the original list,
+ * but making sure the DMA fields are invalidated.
+ */
+static void __invalidate_sg(struct scatterlist *sg, int nents)
+{
+	struct scatterlist *s;
+	int i;
+
+	for_each_sg(sg, s, nents, i) {
+		if (sg_dma_address(s) != DMA_ERROR_CODE)
+			s->offset = sg_dma_address(s);
+		if (sg_dma_len(s))
+			s->length = sg_dma_len(s);
+		sg_dma_address(s) = DMA_ERROR_CODE;
+		sg_dma_len(s) = 0;
+	}
+}
+
+/*
+ * The DMA API client is passing in a scatterlist which could describe
+ * any old buffer layout, but the IOMMU API requires everything to be
+ * aligned to IOMMU pages. Hence the need for this complicated bit of
+ * impedance-matching, to be able to hand off a suitably-aligned list,
+ * but still preserve the original offsets and sizes for the caller.
+ */
+int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
+		int nents, int prot)
+{
+	struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
+	struct iova_domain *iovad = domain->iova_cookie;
+	struct iova *iova;
+	struct scatterlist *s, *prev = NULL;
+	dma_addr_t dma_addr;
+	size_t iova_len = 0;
+	int i;
+
+	/*
+	 * Work out how much IOVA space we need, and align the segments to
+	 * IOVA granules for the IOMMU driver to handle. With some clever
+	 * trickery we can modify the list in-place, but reversibly, by
+	 * hiding the original data in the as-yet-unused DMA fields.
+	 */
+	for_each_sg(sg, s, nents, i) {
+		size_t s_offset = iova_offset(iovad, s->offset);
+		size_t s_length = s->length;
+
+		sg_dma_address(s) = s->offset;
+		sg_dma_len(s) = s_length;
+		s->offset -= s_offset;
+		s_length = iova_align(iovad, s_length + s_offset);
+		s->length = s_length;
+
+		/*
+		 * The simple way to avoid the rare case of a segment
+		 * crossing the boundary mask is to pad the previous one
+		 * to end at a naturally-aligned IOVA for this one's size,
+		 * at the cost of potentially over-allocating a little.
+		 */
+		if (prev) {
+			size_t pad_len = roundup_pow_of_two(s_length);
+
+			pad_len = (pad_len - iova_len) & (pad_len - 1);
+			prev->length += pad_len;
+			iova_len += pad_len;
+		}
+
+		iova_len += s_length;
+		prev = s;
+	}
+
+	iova = __alloc_iova(iovad, iova_len, dma_get_mask(dev));
+	if (!iova)
+		goto out_restore_sg;
+
+	/*
+	 * We'll leave any physical concatenation to the IOMMU driver's
+	 * implementation - it knows better than we do.
+	 */
+	dma_addr = iova_dma_addr(iovad, iova);
+	if (iommu_map_sg(domain, dma_addr, sg, nents, prot) < iova_len)
+		goto out_free_iova;
+
+	return __finalise_sg(dev, sg, nents, dma_addr);
+
+out_free_iova:
+	__free_iova(iovad, iova);
+out_restore_sg:
+	__invalidate_sg(sg, nents);
+	return 0;
+}
+
+void iommu_dma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents,
+		enum dma_data_direction dir, struct dma_attrs *attrs)
+{
+	/*
+	 * The scatterlist segments are mapped into a single
+	 * contiguous IOVA allocation, so this is incredibly easy.
+	 */
+	__iommu_dma_unmap(iommu_get_domain_for_dev(dev), sg_dma_address(sg));
+}
+
+int iommu_dma_supported(struct device *dev, u64 mask)
+{
+	/*
+	 * 'Special' IOMMUs which don't have the same addressing capability
+	 * as the CPU will have to wait until we have some way to query that
+	 * before they'll be able to use this framework.
+	 */
+	return 1;
+}
+
+int iommu_dma_mapping_error(struct device *dev, dma_addr_t dma_addr)
+{
+	return dma_addr == DMA_ERROR_CODE;
+}
diff --git a/include/linux/dma-iommu.h b/include/linux/dma-iommu.h
new file mode 100644
index 0000000..fc48103
--- /dev/null
+++ b/include/linux/dma-iommu.h
@@ -0,0 +1,85 @@
+/*
+ * Copyright (C) 2014-2015 ARM Ltd.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program.  If not, see <http://www.gnu.org/licenses/>.
+ */
+#ifndef __DMA_IOMMU_H
+#define __DMA_IOMMU_H
+
+#ifdef __KERNEL__
+#include <asm/errno.h>
+
+#ifdef CONFIG_IOMMU_DMA
+#include <linux/iommu.h>
+
+int iommu_dma_init(void);
+
+/* Domain management interface for IOMMU drivers */
+int iommu_get_dma_cookie(struct iommu_domain *domain);
+void iommu_put_dma_cookie(struct iommu_domain *domain);
+
+/* Setup call for arch DMA mapping code */
+int iommu_dma_init_domain(struct iommu_domain *domain, dma_addr_t base, u64 size);
+
+/* General helpers for DMA-API <-> IOMMU-API interaction */
+int dma_direction_to_prot(enum dma_data_direction dir, bool coherent);
+
+/*
+ * These implement the bulk of the relevant DMA mapping callbacks, but require
+ * the arch code to take care of attributes and cache maintenance
+ */
+struct page **iommu_dma_alloc(struct device *dev, size_t size,
+		gfp_t gfp, int prot, dma_addr_t *handle,
+		void (*flush_page)(struct device *, const void *, phys_addr_t));
+void iommu_dma_free(struct device *dev, struct page **pages, size_t size,
+		dma_addr_t *handle);
+
+int iommu_dma_mmap(struct page **pages, size_t size, struct vm_area_struct *vma);
+
+dma_addr_t iommu_dma_map_page(struct device *dev, struct page *page,
+		unsigned long offset, size_t size, int prot);
+int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
+		int nents, int prot);
+
+/*
+ * Arch code with no special attribute handling may use these
+ * directly as DMA mapping callbacks for simplicity
+ */
+void iommu_dma_unmap_page(struct device *dev, dma_addr_t handle, size_t size,
+		enum dma_data_direction dir, struct dma_attrs *attrs);
+void iommu_dma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents,
+		enum dma_data_direction dir, struct dma_attrs *attrs);
+int iommu_dma_supported(struct device *dev, u64 mask);
+int iommu_dma_mapping_error(struct device *dev, dma_addr_t dma_addr);
+
+#else
+
+struct iommu_domain;
+
+static inline int iommu_dma_init(void)
+{
+	return 0;
+}
+
+static inline int iommu_get_dma_cookie(struct iommu_domain *domain)
+{
+	return -ENODEV;
+}
+
+static inline void iommu_put_dma_cookie(struct iommu_domain *domain)
+{
+}
+
+#endif	/* CONFIG_IOMMU_DMA */
+#endif	/* __KERNEL__ */
+#endif	/* __DMA_IOMMU_H */
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index f9c1b6d..f174506 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -81,6 +81,7 @@ struct iommu_domain {
 	iommu_fault_handler_t handler;
 	void *handler_token;
 	struct iommu_domain_geometry geometry;
+	void *iova_cookie;
 };
 
 enum iommu_cap {
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v6 2/3] arm64: Add IOMMU dma_ops
  2015-10-01 19:13 ` Robin Murphy
@ 2015-10-01 19:13     ` Robin Murphy
  -1 siblings, 0 replies; 78+ messages in thread
From: Robin Murphy @ 2015-10-01 19:13 UTC (permalink / raw)
  To: joro-zLv9SwRftAIdnm+yROfE0A, will.deacon-5wv7dgnIgG8,
	catalin.marinas-5wv7dgnIgG8
  Cc: laurent.pinchart+renesas-ryLnwIuWjnjg/C1BVhZhaw,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	djkurtz-hpIqsD4AKlfQT0dZR+AlfA,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	thunder.leizhen-hv44wF8Li93QT0dZR+AlfA,
	yingjoe.chen-NuS5LvNUpcJWk0Htik3J/w,
	treding-DDmLM1+adcrQT0dZR+AlfA

Taking some inspiration from the arch/arm code, implement the
arch-specific side of the DMA mapping ops using the new IOMMU-DMA layer.

Since there is still work to do elsewhere to make DMA configuration happen
in a more appropriate order and properly support platform devices in the
IOMMU core, the device setup code unfortunately starts out carrying some
workarounds to ensure it works correctly in the current state of things.

Signed-off-by: Robin Murphy <robin.murphy-5wv7dgnIgG8@public.gmane.org>
---
 arch/arm64/mm/dma-mapping.c | 435 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 435 insertions(+)

diff --git a/arch/arm64/mm/dma-mapping.c b/arch/arm64/mm/dma-mapping.c
index 0bcc4bc..dd2d6e6 100644
--- a/arch/arm64/mm/dma-mapping.c
+++ b/arch/arm64/mm/dma-mapping.c
@@ -533,3 +533,438 @@ static int __init dma_debug_do_init(void)
 	return 0;
 }
 fs_initcall(dma_debug_do_init);
+
+
+#ifdef CONFIG_IOMMU_DMA
+#include <linux/dma-iommu.h>
+#include <linux/platform_device.h>
+#include <linux/amba/bus.h>
+
+/* Thankfully, all cache ops are by VA so we can ignore phys here */
+static void flush_page(struct device *dev, const void *virt, phys_addr_t phys)
+{
+	__dma_flush_range(virt, virt + PAGE_SIZE);
+}
+
+static void *__iommu_alloc_attrs(struct device *dev, size_t size,
+				 dma_addr_t *handle, gfp_t gfp,
+				 struct dma_attrs *attrs)
+{
+	bool coherent = is_device_dma_coherent(dev);
+	int ioprot = dma_direction_to_prot(DMA_BIDIRECTIONAL, coherent);
+	void *addr;
+
+	if (WARN(!dev, "cannot create IOMMU mapping for unknown device\n"))
+		return NULL;
+	/*
+	 * Some drivers rely on this, and we probably don't want the
+	 * possibility of stale kernel data being read by devices anyway.
+	 */
+	gfp |= __GFP_ZERO;
+
+	if (gfp & __GFP_WAIT) {
+		struct page **pages;
+		pgprot_t prot = __get_dma_pgprot(attrs, PAGE_KERNEL, coherent);
+
+		pages = iommu_dma_alloc(dev, size, gfp, ioprot,	handle,
+					flush_page);
+		if (!pages)
+			return NULL;
+
+		addr = dma_common_pages_remap(pages, size, VM_USERMAP, prot,
+					      __builtin_return_address(0));
+		if (!addr)
+			iommu_dma_free(dev, pages, size, handle);
+	} else {
+		struct page *page;
+		/*
+		 * In atomic context we can't remap anything, so we'll only
+		 * get the virtually contiguous buffer we need by way of a
+		 * physically contiguous allocation.
+		 */
+		if (coherent) {
+			page = alloc_pages(gfp, get_order(size));
+			addr = page ? page_address(page) : NULL;
+		} else {
+			addr = __alloc_from_pool(size, &page, gfp);
+		}
+		if (!addr)
+			return NULL;
+
+		*handle = iommu_dma_map_page(dev, page, 0, size, ioprot);
+		if (iommu_dma_mapping_error(dev, *handle)) {
+			if (coherent)
+				__free_pages(page, get_order(size));
+			else
+				__free_from_pool(addr, size);
+			addr = NULL;
+		}
+	}
+	return addr;
+}
+
+static void __iommu_free_attrs(struct device *dev, size_t size, void *cpu_addr,
+			       dma_addr_t handle, struct dma_attrs *attrs)
+{
+	/*
+	 * @cpu_addr will be one of 3 things depending on how it was allocated:
+	 * - A remapped array of pages from iommu_dma_alloc(), for all
+	 *   non-atomic allocations.
+	 * - A non-cacheable alias from the atomic pool, for atomic
+	 *   allocations by non-coherent devices.
+	 * - A normal lowmem address, for atomic allocations by
+	 *   coherent devices.
+	 * Hence how dodgy the below logic looks...
+	 */
+	if (__in_atomic_pool(cpu_addr, size)) {
+		iommu_dma_unmap_page(dev, handle, size, 0, NULL);
+		__free_from_pool(cpu_addr, size);
+	} else if (is_vmalloc_addr(cpu_addr)){
+		struct vm_struct *area = find_vm_area(cpu_addr);
+
+		if (WARN_ON(!area || !area->pages))
+			return;
+		iommu_dma_free(dev, area->pages, size, &handle);
+		dma_common_free_remap(cpu_addr, size, VM_USERMAP);
+	} else {
+		iommu_dma_unmap_page(dev, handle, size, 0, NULL);
+		__free_pages(virt_to_page(cpu_addr), get_order(size));
+	}
+}
+
+static int __iommu_mmap_attrs(struct device *dev, struct vm_area_struct *vma,
+			      void *cpu_addr, dma_addr_t dma_addr, size_t size,
+			      struct dma_attrs *attrs)
+{
+	struct vm_struct *area;
+	int ret;
+
+	vma->vm_page_prot = __get_dma_pgprot(attrs, vma->vm_page_prot,
+					     is_device_dma_coherent(dev));
+
+	if (dma_mmap_from_coherent(dev, vma, cpu_addr, size, &ret))
+		return ret;
+
+	area = find_vm_area(cpu_addr);
+	if (WARN_ON(!area || !area->pages))
+		return -ENXIO;
+
+	return iommu_dma_mmap(area->pages, size, vma);
+}
+
+static int __iommu_get_sgtable(struct device *dev, struct sg_table *sgt,
+			       void *cpu_addr, dma_addr_t dma_addr,
+			       size_t size, struct dma_attrs *attrs)
+{
+	unsigned int count = PAGE_ALIGN(size) >> PAGE_SHIFT;
+	struct vm_struct *area = find_vm_area(cpu_addr);
+
+	if (WARN_ON(!area || !area->pages))
+		return -ENXIO;
+
+	return sg_alloc_table_from_pages(sgt, area->pages, count, 0, size,
+					 GFP_KERNEL);
+}
+
+static void __iommu_sync_single_for_cpu(struct device *dev,
+					dma_addr_t dev_addr, size_t size,
+					enum dma_data_direction dir)
+{
+	phys_addr_t phys;
+
+	if (is_device_dma_coherent(dev))
+		return;
+
+	phys = iommu_iova_to_phys(iommu_get_domain_for_dev(dev), dev_addr);
+	__dma_unmap_area(phys_to_virt(phys), size, dir);
+}
+
+static void __iommu_sync_single_for_device(struct device *dev,
+					   dma_addr_t dev_addr, size_t size,
+					   enum dma_data_direction dir)
+{
+	phys_addr_t phys;
+
+	if (is_device_dma_coherent(dev))
+		return;
+
+	phys = iommu_iova_to_phys(iommu_get_domain_for_dev(dev), dev_addr);
+	__dma_map_area(phys_to_virt(phys), size, dir);
+}
+
+static dma_addr_t __iommu_map_page(struct device *dev, struct page *page,
+				   unsigned long offset, size_t size,
+				   enum dma_data_direction dir,
+				   struct dma_attrs *attrs)
+{
+	bool coherent = is_device_dma_coherent(dev);
+	int prot = dma_direction_to_prot(dir, coherent);
+	dma_addr_t dev_addr = iommu_dma_map_page(dev, page, offset, size, prot);
+
+	if (!iommu_dma_mapping_error(dev, dev_addr) &&
+	    !dma_get_attr(DMA_ATTR_SKIP_CPU_SYNC, attrs))
+		__iommu_sync_single_for_device(dev, dev_addr, size, dir);
+
+	return dev_addr;
+}
+
+static void __iommu_unmap_page(struct device *dev, dma_addr_t dev_addr,
+			       size_t size, enum dma_data_direction dir,
+			       struct dma_attrs *attrs)
+{
+	if (!dma_get_attr(DMA_ATTR_SKIP_CPU_SYNC, attrs))
+		__iommu_sync_single_for_cpu(dev, dev_addr, size, dir);
+
+	iommu_dma_unmap_page(dev, dev_addr, size, dir, attrs);
+}
+
+static void __iommu_sync_sg_for_cpu(struct device *dev,
+				    struct scatterlist *sgl, int nelems,
+				    enum dma_data_direction dir)
+{
+	struct scatterlist *sg;
+	int i;
+
+	if (is_device_dma_coherent(dev))
+		return;
+
+	for_each_sg(sgl, sg, nelems, i)
+		__dma_unmap_area(sg_virt(sg), sg->length, dir);
+}
+
+static void __iommu_sync_sg_for_device(struct device *dev,
+				       struct scatterlist *sgl, int nelems,
+				       enum dma_data_direction dir)
+{
+	struct scatterlist *sg;
+	int i;
+
+	if (is_device_dma_coherent(dev))
+		return;
+
+	for_each_sg(sgl, sg, nelems, i)
+		__dma_map_area(sg_virt(sg), sg->length, dir);
+}
+
+static int __iommu_map_sg_attrs(struct device *dev, struct scatterlist *sgl,
+				int nelems, enum dma_data_direction dir,
+				struct dma_attrs *attrs)
+{
+	bool coherent = is_device_dma_coherent(dev);
+
+	if (!dma_get_attr(DMA_ATTR_SKIP_CPU_SYNC, attrs))
+		__iommu_sync_sg_for_device(dev, sgl, nelems, dir);
+
+	return iommu_dma_map_sg(dev, sgl, nelems,
+			dma_direction_to_prot(dir, coherent));
+}
+
+static void __iommu_unmap_sg_attrs(struct device *dev,
+				   struct scatterlist *sgl, int nelems,
+				   enum dma_data_direction dir,
+				   struct dma_attrs *attrs)
+{
+	if (!dma_get_attr(DMA_ATTR_SKIP_CPU_SYNC, attrs))
+		__iommu_sync_sg_for_cpu(dev, sgl, nelems, dir);
+
+	iommu_dma_unmap_sg(dev, sgl, nelems, dir, attrs);
+}
+
+static struct dma_map_ops iommu_dma_ops = {
+	.alloc = __iommu_alloc_attrs,
+	.free = __iommu_free_attrs,
+	.mmap = __iommu_mmap_attrs,
+	.get_sgtable = __iommu_get_sgtable,
+	.map_page = __iommu_map_page,
+	.unmap_page = __iommu_unmap_page,
+	.map_sg = __iommu_map_sg_attrs,
+	.unmap_sg = __iommu_unmap_sg_attrs,
+	.sync_single_for_cpu = __iommu_sync_single_for_cpu,
+	.sync_single_for_device = __iommu_sync_single_for_device,
+	.sync_sg_for_cpu = __iommu_sync_sg_for_cpu,
+	.sync_sg_for_device = __iommu_sync_sg_for_device,
+	.dma_supported = iommu_dma_supported,
+	.mapping_error = iommu_dma_mapping_error,
+};
+
+/*
+ * TODO: Right now __iommu_setup_dma_ops() gets called too early to do
+ * everything it needs to - the device is only partially created and the
+ * IOMMU driver hasn't seen it yet, so it can't have a group. Thus we
+ * need this delayed attachment dance. Once IOMMU probe ordering is sorted
+ * to move the arch_setup_dma_ops() call later, all the notifier bits below
+ * become unnecessary, and will go away.
+ */
+struct iommu_dma_notifier_data {
+	struct list_head list;
+	struct device *dev;
+	const struct iommu_ops *ops;
+	u64 dma_base;
+	u64 size;
+};
+static LIST_HEAD(iommu_dma_masters);
+static DEFINE_MUTEX(iommu_dma_notifier_lock);
+
+/*
+ * Temporarily "borrow" a domain feature flag to to tell if we had to resort
+ * to creating our own domain here, in case we need to clean it up again.
+ */
+#define __IOMMU_DOMAIN_FAKE_DEFAULT		(1U << 31)
+
+static bool do_iommu_attach(struct device *dev, const struct iommu_ops *ops,
+			   u64 dma_base, u64 size)
+{
+	struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
+
+	/*
+	 * Best case: The device is either part of a group which was
+	 * already attached to a domain in a previous call, or it's
+	 * been put in a default DMA domain by the IOMMU core.
+	 */
+	if (!domain) {
+		/*
+		 * Urgh. The IOMMU core isn't going to do default domains
+		 * for non-PCI devices anyway, until it has some means of
+		 * abstracting the entirely implementation-specific
+		 * sideband data/SoC topology/unicorn dust that may or
+		 * may not differentiate upstream masters.
+		 * So until then, HORRIBLE HACKS!
+		 */
+		domain = ops->domain_alloc(IOMMU_DOMAIN_DMA);
+		if (!domain)
+			goto out_no_domain;
+
+		domain->ops = ops;
+		domain->type = IOMMU_DOMAIN_DMA | __IOMMU_DOMAIN_FAKE_DEFAULT;
+
+		if (iommu_attach_device(domain, dev))
+			goto out_put_domain;
+	}
+
+	if (iommu_dma_init_domain(domain, dma_base, size))
+		goto out_detach;
+
+	dev->archdata.dma_ops = &iommu_dma_ops;
+	return true;
+
+out_detach:
+	iommu_detach_device(domain, dev);
+out_put_domain:
+	if (domain->type & __IOMMU_DOMAIN_FAKE_DEFAULT)
+		iommu_domain_free(domain);
+out_no_domain:
+	pr_warn("Failed to set up IOMMU for device %s; retaining platform DMA ops\n",
+		dev_name(dev));
+	return false;
+}
+
+static void queue_iommu_attach(struct device *dev, const struct iommu_ops *ops,
+			      u64 dma_base, u64 size)
+{
+	struct iommu_dma_notifier_data *iommudata;
+
+	iommudata = kzalloc(sizeof(*iommudata), GFP_KERNEL);
+	if (!iommudata)
+		return;
+
+	iommudata->dev = dev;
+	iommudata->ops = ops;
+	iommudata->dma_base = dma_base;
+	iommudata->size = size;
+
+	mutex_lock(&iommu_dma_notifier_lock);
+	list_add(&iommudata->list, &iommu_dma_masters);
+	mutex_unlock(&iommu_dma_notifier_lock);
+}
+
+static int __iommu_attach_notifier(struct notifier_block *nb,
+				   unsigned long action, void *data)
+{
+	struct iommu_dma_notifier_data *master, *tmp;
+
+	if (action != BUS_NOTIFY_ADD_DEVICE)
+		return 0;
+
+	mutex_lock(&iommu_dma_notifier_lock);
+	list_for_each_entry_safe(master, tmp, &iommu_dma_masters, list) {
+		if (do_iommu_attach(master->dev, master->ops,
+				master->dma_base, master->size)) {
+			list_del(&master->list);
+			kfree(master);
+		}
+	}
+	mutex_unlock(&iommu_dma_notifier_lock);
+	return 0;
+}
+
+static int register_iommu_dma_ops_notifier(struct bus_type *bus)
+{
+	struct notifier_block *nb = kzalloc(sizeof(*nb), GFP_KERNEL);
+	int ret;
+
+	if (!nb)
+		return -ENOMEM;
+	/*
+	 * The device must be attached to a domain before the driver probe
+	 * routine gets a chance to start allocating DMA buffers. However,
+	 * the IOMMU driver also needs a chance to configure the iommu_group
+	 * via its add_device callback first, so we need to make the attach
+	 * happen between those two points. Since the IOMMU core uses a bus
+	 * notifier with default priority for add_device, do the same but
+	 * with a lower priority to ensure the appropriate ordering.
+	 */
+	nb->notifier_call = __iommu_attach_notifier;
+	nb->priority = -100;
+
+	ret = bus_register_notifier(bus, nb);
+	if (ret) {
+		pr_warn("Failed to register DMA domain notifier; IOMMU DMA ops unavailable on bus '%s'\n",
+			bus->name);
+		kfree(nb);
+	}
+	return ret;
+}
+
+static int __init __iommu_dma_init(void)
+{
+	int ret;
+
+	ret = iommu_dma_init();
+	if (!ret)
+		ret = register_iommu_dma_ops_notifier(&platform_bus_type);
+	if (!ret)
+		ret = register_iommu_dma_ops_notifier(&amba_bustype);
+	return ret;
+}
+arch_initcall(__iommu_dma_init);
+
+static void __iommu_setup_dma_ops(struct device *dev, u64 dma_base, u64 size,
+				  const struct iommu_ops *ops)
+{
+	struct iommu_group *group;
+
+	if (!ops)
+		return;
+	/*
+	 * TODO: As a concession to the future, we're ready to handle being
+	 * called both early and late (i.e. after bus_add_device). Once all
+	 * the platform bus code is reworked to call us late and the notifier
+	 * junk above goes away, move the body of do_iommu_attach here.
+	 */
+	group = iommu_group_get(dev);
+	if (group) {
+		do_iommu_attach(dev, ops, dma_base, size);
+		iommu_group_put(group);
+	} else {
+		queue_iommu_attach(dev, ops, dma_base, size);
+	}
+}
+
+#else
+
+static void __iommu_setup_dma_ops(struct device *dev, u64 dma_base, u64 size,
+				  struct iommu_ops *iommu)
+{ }
+
+#endif  /* CONFIG_IOMMU_DMA */
+
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v6 2/3] arm64: Add IOMMU dma_ops
@ 2015-10-01 19:13     ` Robin Murphy
  0 siblings, 0 replies; 78+ messages in thread
From: Robin Murphy @ 2015-10-01 19:13 UTC (permalink / raw)
  To: linux-arm-kernel

Taking some inspiration from the arch/arm code, implement the
arch-specific side of the DMA mapping ops using the new IOMMU-DMA layer.

Since there is still work to do elsewhere to make DMA configuration happen
in a more appropriate order and properly support platform devices in the
IOMMU core, the device setup code unfortunately starts out carrying some
workarounds to ensure it works correctly in the current state of things.

Signed-off-by: Robin Murphy <robin.murphy@arm.com>
---
 arch/arm64/mm/dma-mapping.c | 435 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 435 insertions(+)

diff --git a/arch/arm64/mm/dma-mapping.c b/arch/arm64/mm/dma-mapping.c
index 0bcc4bc..dd2d6e6 100644
--- a/arch/arm64/mm/dma-mapping.c
+++ b/arch/arm64/mm/dma-mapping.c
@@ -533,3 +533,438 @@ static int __init dma_debug_do_init(void)
 	return 0;
 }
 fs_initcall(dma_debug_do_init);
+
+
+#ifdef CONFIG_IOMMU_DMA
+#include <linux/dma-iommu.h>
+#include <linux/platform_device.h>
+#include <linux/amba/bus.h>
+
+/* Thankfully, all cache ops are by VA so we can ignore phys here */
+static void flush_page(struct device *dev, const void *virt, phys_addr_t phys)
+{
+	__dma_flush_range(virt, virt + PAGE_SIZE);
+}
+
+static void *__iommu_alloc_attrs(struct device *dev, size_t size,
+				 dma_addr_t *handle, gfp_t gfp,
+				 struct dma_attrs *attrs)
+{
+	bool coherent = is_device_dma_coherent(dev);
+	int ioprot = dma_direction_to_prot(DMA_BIDIRECTIONAL, coherent);
+	void *addr;
+
+	if (WARN(!dev, "cannot create IOMMU mapping for unknown device\n"))
+		return NULL;
+	/*
+	 * Some drivers rely on this, and we probably don't want the
+	 * possibility of stale kernel data being read by devices anyway.
+	 */
+	gfp |= __GFP_ZERO;
+
+	if (gfp & __GFP_WAIT) {
+		struct page **pages;
+		pgprot_t prot = __get_dma_pgprot(attrs, PAGE_KERNEL, coherent);
+
+		pages = iommu_dma_alloc(dev, size, gfp, ioprot,	handle,
+					flush_page);
+		if (!pages)
+			return NULL;
+
+		addr = dma_common_pages_remap(pages, size, VM_USERMAP, prot,
+					      __builtin_return_address(0));
+		if (!addr)
+			iommu_dma_free(dev, pages, size, handle);
+	} else {
+		struct page *page;
+		/*
+		 * In atomic context we can't remap anything, so we'll only
+		 * get the virtually contiguous buffer we need by way of a
+		 * physically contiguous allocation.
+		 */
+		if (coherent) {
+			page = alloc_pages(gfp, get_order(size));
+			addr = page ? page_address(page) : NULL;
+		} else {
+			addr = __alloc_from_pool(size, &page, gfp);
+		}
+		if (!addr)
+			return NULL;
+
+		*handle = iommu_dma_map_page(dev, page, 0, size, ioprot);
+		if (iommu_dma_mapping_error(dev, *handle)) {
+			if (coherent)
+				__free_pages(page, get_order(size));
+			else
+				__free_from_pool(addr, size);
+			addr = NULL;
+		}
+	}
+	return addr;
+}
+
+static void __iommu_free_attrs(struct device *dev, size_t size, void *cpu_addr,
+			       dma_addr_t handle, struct dma_attrs *attrs)
+{
+	/*
+	 * @cpu_addr will be one of 3 things depending on how it was allocated:
+	 * - A remapped array of pages from iommu_dma_alloc(), for all
+	 *   non-atomic allocations.
+	 * - A non-cacheable alias from the atomic pool, for atomic
+	 *   allocations by non-coherent devices.
+	 * - A normal lowmem address, for atomic allocations by
+	 *   coherent devices.
+	 * Hence how dodgy the below logic looks...
+	 */
+	if (__in_atomic_pool(cpu_addr, size)) {
+		iommu_dma_unmap_page(dev, handle, size, 0, NULL);
+		__free_from_pool(cpu_addr, size);
+	} else if (is_vmalloc_addr(cpu_addr)){
+		struct vm_struct *area = find_vm_area(cpu_addr);
+
+		if (WARN_ON(!area || !area->pages))
+			return;
+		iommu_dma_free(dev, area->pages, size, &handle);
+		dma_common_free_remap(cpu_addr, size, VM_USERMAP);
+	} else {
+		iommu_dma_unmap_page(dev, handle, size, 0, NULL);
+		__free_pages(virt_to_page(cpu_addr), get_order(size));
+	}
+}
+
+static int __iommu_mmap_attrs(struct device *dev, struct vm_area_struct *vma,
+			      void *cpu_addr, dma_addr_t dma_addr, size_t size,
+			      struct dma_attrs *attrs)
+{
+	struct vm_struct *area;
+	int ret;
+
+	vma->vm_page_prot = __get_dma_pgprot(attrs, vma->vm_page_prot,
+					     is_device_dma_coherent(dev));
+
+	if (dma_mmap_from_coherent(dev, vma, cpu_addr, size, &ret))
+		return ret;
+
+	area = find_vm_area(cpu_addr);
+	if (WARN_ON(!area || !area->pages))
+		return -ENXIO;
+
+	return iommu_dma_mmap(area->pages, size, vma);
+}
+
+static int __iommu_get_sgtable(struct device *dev, struct sg_table *sgt,
+			       void *cpu_addr, dma_addr_t dma_addr,
+			       size_t size, struct dma_attrs *attrs)
+{
+	unsigned int count = PAGE_ALIGN(size) >> PAGE_SHIFT;
+	struct vm_struct *area = find_vm_area(cpu_addr);
+
+	if (WARN_ON(!area || !area->pages))
+		return -ENXIO;
+
+	return sg_alloc_table_from_pages(sgt, area->pages, count, 0, size,
+					 GFP_KERNEL);
+}
+
+static void __iommu_sync_single_for_cpu(struct device *dev,
+					dma_addr_t dev_addr, size_t size,
+					enum dma_data_direction dir)
+{
+	phys_addr_t phys;
+
+	if (is_device_dma_coherent(dev))
+		return;
+
+	phys = iommu_iova_to_phys(iommu_get_domain_for_dev(dev), dev_addr);
+	__dma_unmap_area(phys_to_virt(phys), size, dir);
+}
+
+static void __iommu_sync_single_for_device(struct device *dev,
+					   dma_addr_t dev_addr, size_t size,
+					   enum dma_data_direction dir)
+{
+	phys_addr_t phys;
+
+	if (is_device_dma_coherent(dev))
+		return;
+
+	phys = iommu_iova_to_phys(iommu_get_domain_for_dev(dev), dev_addr);
+	__dma_map_area(phys_to_virt(phys), size, dir);
+}
+
+static dma_addr_t __iommu_map_page(struct device *dev, struct page *page,
+				   unsigned long offset, size_t size,
+				   enum dma_data_direction dir,
+				   struct dma_attrs *attrs)
+{
+	bool coherent = is_device_dma_coherent(dev);
+	int prot = dma_direction_to_prot(dir, coherent);
+	dma_addr_t dev_addr = iommu_dma_map_page(dev, page, offset, size, prot);
+
+	if (!iommu_dma_mapping_error(dev, dev_addr) &&
+	    !dma_get_attr(DMA_ATTR_SKIP_CPU_SYNC, attrs))
+		__iommu_sync_single_for_device(dev, dev_addr, size, dir);
+
+	return dev_addr;
+}
+
+static void __iommu_unmap_page(struct device *dev, dma_addr_t dev_addr,
+			       size_t size, enum dma_data_direction dir,
+			       struct dma_attrs *attrs)
+{
+	if (!dma_get_attr(DMA_ATTR_SKIP_CPU_SYNC, attrs))
+		__iommu_sync_single_for_cpu(dev, dev_addr, size, dir);
+
+	iommu_dma_unmap_page(dev, dev_addr, size, dir, attrs);
+}
+
+static void __iommu_sync_sg_for_cpu(struct device *dev,
+				    struct scatterlist *sgl, int nelems,
+				    enum dma_data_direction dir)
+{
+	struct scatterlist *sg;
+	int i;
+
+	if (is_device_dma_coherent(dev))
+		return;
+
+	for_each_sg(sgl, sg, nelems, i)
+		__dma_unmap_area(sg_virt(sg), sg->length, dir);
+}
+
+static void __iommu_sync_sg_for_device(struct device *dev,
+				       struct scatterlist *sgl, int nelems,
+				       enum dma_data_direction dir)
+{
+	struct scatterlist *sg;
+	int i;
+
+	if (is_device_dma_coherent(dev))
+		return;
+
+	for_each_sg(sgl, sg, nelems, i)
+		__dma_map_area(sg_virt(sg), sg->length, dir);
+}
+
+static int __iommu_map_sg_attrs(struct device *dev, struct scatterlist *sgl,
+				int nelems, enum dma_data_direction dir,
+				struct dma_attrs *attrs)
+{
+	bool coherent = is_device_dma_coherent(dev);
+
+	if (!dma_get_attr(DMA_ATTR_SKIP_CPU_SYNC, attrs))
+		__iommu_sync_sg_for_device(dev, sgl, nelems, dir);
+
+	return iommu_dma_map_sg(dev, sgl, nelems,
+			dma_direction_to_prot(dir, coherent));
+}
+
+static void __iommu_unmap_sg_attrs(struct device *dev,
+				   struct scatterlist *sgl, int nelems,
+				   enum dma_data_direction dir,
+				   struct dma_attrs *attrs)
+{
+	if (!dma_get_attr(DMA_ATTR_SKIP_CPU_SYNC, attrs))
+		__iommu_sync_sg_for_cpu(dev, sgl, nelems, dir);
+
+	iommu_dma_unmap_sg(dev, sgl, nelems, dir, attrs);
+}
+
+static struct dma_map_ops iommu_dma_ops = {
+	.alloc = __iommu_alloc_attrs,
+	.free = __iommu_free_attrs,
+	.mmap = __iommu_mmap_attrs,
+	.get_sgtable = __iommu_get_sgtable,
+	.map_page = __iommu_map_page,
+	.unmap_page = __iommu_unmap_page,
+	.map_sg = __iommu_map_sg_attrs,
+	.unmap_sg = __iommu_unmap_sg_attrs,
+	.sync_single_for_cpu = __iommu_sync_single_for_cpu,
+	.sync_single_for_device = __iommu_sync_single_for_device,
+	.sync_sg_for_cpu = __iommu_sync_sg_for_cpu,
+	.sync_sg_for_device = __iommu_sync_sg_for_device,
+	.dma_supported = iommu_dma_supported,
+	.mapping_error = iommu_dma_mapping_error,
+};
+
+/*
+ * TODO: Right now __iommu_setup_dma_ops() gets called too early to do
+ * everything it needs to - the device is only partially created and the
+ * IOMMU driver hasn't seen it yet, so it can't have a group. Thus we
+ * need this delayed attachment dance. Once IOMMU probe ordering is sorted
+ * to move the arch_setup_dma_ops() call later, all the notifier bits below
+ * become unnecessary, and will go away.
+ */
+struct iommu_dma_notifier_data {
+	struct list_head list;
+	struct device *dev;
+	const struct iommu_ops *ops;
+	u64 dma_base;
+	u64 size;
+};
+static LIST_HEAD(iommu_dma_masters);
+static DEFINE_MUTEX(iommu_dma_notifier_lock);
+
+/*
+ * Temporarily "borrow" a domain feature flag to to tell if we had to resort
+ * to creating our own domain here, in case we need to clean it up again.
+ */
+#define __IOMMU_DOMAIN_FAKE_DEFAULT		(1U << 31)
+
+static bool do_iommu_attach(struct device *dev, const struct iommu_ops *ops,
+			   u64 dma_base, u64 size)
+{
+	struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
+
+	/*
+	 * Best case: The device is either part of a group which was
+	 * already attached to a domain in a previous call, or it's
+	 * been put in a default DMA domain by the IOMMU core.
+	 */
+	if (!domain) {
+		/*
+		 * Urgh. The IOMMU core isn't going to do default domains
+		 * for non-PCI devices anyway, until it has some means of
+		 * abstracting the entirely implementation-specific
+		 * sideband data/SoC topology/unicorn dust that may or
+		 * may not differentiate upstream masters.
+		 * So until then, HORRIBLE HACKS!
+		 */
+		domain = ops->domain_alloc(IOMMU_DOMAIN_DMA);
+		if (!domain)
+			goto out_no_domain;
+
+		domain->ops = ops;
+		domain->type = IOMMU_DOMAIN_DMA | __IOMMU_DOMAIN_FAKE_DEFAULT;
+
+		if (iommu_attach_device(domain, dev))
+			goto out_put_domain;
+	}
+
+	if (iommu_dma_init_domain(domain, dma_base, size))
+		goto out_detach;
+
+	dev->archdata.dma_ops = &iommu_dma_ops;
+	return true;
+
+out_detach:
+	iommu_detach_device(domain, dev);
+out_put_domain:
+	if (domain->type & __IOMMU_DOMAIN_FAKE_DEFAULT)
+		iommu_domain_free(domain);
+out_no_domain:
+	pr_warn("Failed to set up IOMMU for device %s; retaining platform DMA ops\n",
+		dev_name(dev));
+	return false;
+}
+
+static void queue_iommu_attach(struct device *dev, const struct iommu_ops *ops,
+			      u64 dma_base, u64 size)
+{
+	struct iommu_dma_notifier_data *iommudata;
+
+	iommudata = kzalloc(sizeof(*iommudata), GFP_KERNEL);
+	if (!iommudata)
+		return;
+
+	iommudata->dev = dev;
+	iommudata->ops = ops;
+	iommudata->dma_base = dma_base;
+	iommudata->size = size;
+
+	mutex_lock(&iommu_dma_notifier_lock);
+	list_add(&iommudata->list, &iommu_dma_masters);
+	mutex_unlock(&iommu_dma_notifier_lock);
+}
+
+static int __iommu_attach_notifier(struct notifier_block *nb,
+				   unsigned long action, void *data)
+{
+	struct iommu_dma_notifier_data *master, *tmp;
+
+	if (action != BUS_NOTIFY_ADD_DEVICE)
+		return 0;
+
+	mutex_lock(&iommu_dma_notifier_lock);
+	list_for_each_entry_safe(master, tmp, &iommu_dma_masters, list) {
+		if (do_iommu_attach(master->dev, master->ops,
+				master->dma_base, master->size)) {
+			list_del(&master->list);
+			kfree(master);
+		}
+	}
+	mutex_unlock(&iommu_dma_notifier_lock);
+	return 0;
+}
+
+static int register_iommu_dma_ops_notifier(struct bus_type *bus)
+{
+	struct notifier_block *nb = kzalloc(sizeof(*nb), GFP_KERNEL);
+	int ret;
+
+	if (!nb)
+		return -ENOMEM;
+	/*
+	 * The device must be attached to a domain before the driver probe
+	 * routine gets a chance to start allocating DMA buffers. However,
+	 * the IOMMU driver also needs a chance to configure the iommu_group
+	 * via its add_device callback first, so we need to make the attach
+	 * happen between those two points. Since the IOMMU core uses a bus
+	 * notifier with default priority for add_device, do the same but
+	 * with a lower priority to ensure the appropriate ordering.
+	 */
+	nb->notifier_call = __iommu_attach_notifier;
+	nb->priority = -100;
+
+	ret = bus_register_notifier(bus, nb);
+	if (ret) {
+		pr_warn("Failed to register DMA domain notifier; IOMMU DMA ops unavailable on bus '%s'\n",
+			bus->name);
+		kfree(nb);
+	}
+	return ret;
+}
+
+static int __init __iommu_dma_init(void)
+{
+	int ret;
+
+	ret = iommu_dma_init();
+	if (!ret)
+		ret = register_iommu_dma_ops_notifier(&platform_bus_type);
+	if (!ret)
+		ret = register_iommu_dma_ops_notifier(&amba_bustype);
+	return ret;
+}
+arch_initcall(__iommu_dma_init);
+
+static void __iommu_setup_dma_ops(struct device *dev, u64 dma_base, u64 size,
+				  const struct iommu_ops *ops)
+{
+	struct iommu_group *group;
+
+	if (!ops)
+		return;
+	/*
+	 * TODO: As a concession to the future, we're ready to handle being
+	 * called both early and late (i.e. after bus_add_device). Once all
+	 * the platform bus code is reworked to call us late and the notifier
+	 * junk above goes away, move the body of do_iommu_attach here.
+	 */
+	group = iommu_group_get(dev);
+	if (group) {
+		do_iommu_attach(dev, ops, dma_base, size);
+		iommu_group_put(group);
+	} else {
+		queue_iommu_attach(dev, ops, dma_base, size);
+	}
+}
+
+#else
+
+static void __iommu_setup_dma_ops(struct device *dev, u64 dma_base, u64 size,
+				  struct iommu_ops *iommu)
+{ }
+
+#endif  /* CONFIG_IOMMU_DMA */
+
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v6 3/3] arm64: Hook up IOMMU dma_ops
  2015-10-01 19:13 ` Robin Murphy
@ 2015-10-01 19:14     ` Robin Murphy
  -1 siblings, 0 replies; 78+ messages in thread
From: Robin Murphy @ 2015-10-01 19:14 UTC (permalink / raw)
  To: joro-zLv9SwRftAIdnm+yROfE0A, will.deacon-5wv7dgnIgG8,
	catalin.marinas-5wv7dgnIgG8
  Cc: laurent.pinchart+renesas-ryLnwIuWjnjg/C1BVhZhaw,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	djkurtz-hpIqsD4AKlfQT0dZR+AlfA,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	thunder.leizhen-hv44wF8Li93QT0dZR+AlfA,
	yingjoe.chen-NuS5LvNUpcJWk0Htik3J/w,
	treding-DDmLM1+adcrQT0dZR+AlfA

With iommu_dma_ops in place, hook them up to the configuration code, so
IOMMU-fronted devices will get them automatically.

Acked-by: Catalin Marinas <catalin.marinas-5wv7dgnIgG8@public.gmane.org>
Signed-off-by: Robin Murphy <robin.murphy-5wv7dgnIgG8@public.gmane.org>
---
 arch/arm64/Kconfig                   |  1 +
 arch/arm64/include/asm/dma-mapping.h | 15 +++++++--------
 arch/arm64/mm/dma-mapping.c          | 22 ++++++++++++++++++++++
 3 files changed, 30 insertions(+), 8 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 7d95663..6597311 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -74,6 +74,7 @@ config ARM64
 	select HAVE_PERF_USER_STACK_DUMP
 	select HAVE_RCU_TABLE_FREE
 	select HAVE_SYSCALL_TRACEPOINTS
+	select IOMMU_DMA if IOMMU_SUPPORT
 	select IRQ_DOMAIN
 	select IRQ_FORCED_THREADING
 	select MODULES_USE_ELF_RELA
diff --git a/arch/arm64/include/asm/dma-mapping.h b/arch/arm64/include/asm/dma-mapping.h
index cfdb34b..54d0ead 100644
--- a/arch/arm64/include/asm/dma-mapping.h
+++ b/arch/arm64/include/asm/dma-mapping.h
@@ -54,16 +54,15 @@ static inline struct dma_map_ops *get_dma_ops(struct device *dev)
 		return __generic_dma_ops(dev);
 }
 
-static inline void arch_setup_dma_ops(struct device *dev, u64 dma_base, u64 size,
-				      struct iommu_ops *iommu, bool coherent)
-{
-	if (!acpi_disabled && !dev->archdata.dma_ops)
-		dev->archdata.dma_ops = dma_ops;
-
-	dev->archdata.dma_coherent = coherent;
-}
+void arch_setup_dma_ops(struct device *dev, u64 dma_base, u64 size,
+			struct iommu_ops *iommu, bool coherent);
 #define arch_setup_dma_ops	arch_setup_dma_ops
 
+#ifdef CONFIG_IOMMU_DMA
+void arch_teardown_dma_ops(struct device *dev);
+#define arch_teardown_dma_ops	arch_teardown_dma_ops
+#endif
+
 /* do not use this function in a driver */
 static inline bool is_device_dma_coherent(struct device *dev)
 {
diff --git a/arch/arm64/mm/dma-mapping.c b/arch/arm64/mm/dma-mapping.c
index dd2d6e6..02ef19d 100644
--- a/arch/arm64/mm/dma-mapping.c
+++ b/arch/arm64/mm/dma-mapping.c
@@ -960,6 +960,19 @@ static void __iommu_setup_dma_ops(struct device *dev, u64 dma_base, u64 size,
 	}
 }
 
+void arch_teardown_dma_ops(struct device *dev)
+{
+	struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
+
+	if (domain) {
+		iommu_detach_device(domain, dev);
+		if (domain->type & __IOMMU_DOMAIN_FAKE_DEFAULT)
+			iommu_domain_free(domain);
+	}
+
+	dev->archdata.dma_ops = NULL;
+}
+
 #else
 
 static void __iommu_setup_dma_ops(struct device *dev, u64 dma_base, u64 size,
@@ -968,3 +981,12 @@ static void __iommu_setup_dma_ops(struct device *dev, u64 dma_base, u64 size,
 
 #endif  /* CONFIG_IOMMU_DMA */
 
+void arch_setup_dma_ops(struct device *dev, u64 dma_base, u64 size,
+			struct iommu_ops *iommu, bool coherent)
+{
+	if (!acpi_disabled && !dev->archdata.dma_ops)
+		dev->archdata.dma_ops = dma_ops;
+
+	dev->archdata.dma_coherent = coherent;
+	__iommu_setup_dma_ops(dev, dma_base, size, iommu);
+}
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v6 3/3] arm64: Hook up IOMMU dma_ops
@ 2015-10-01 19:14     ` Robin Murphy
  0 siblings, 0 replies; 78+ messages in thread
From: Robin Murphy @ 2015-10-01 19:14 UTC (permalink / raw)
  To: linux-arm-kernel

With iommu_dma_ops in place, hook them up to the configuration code, so
IOMMU-fronted devices will get them automatically.

Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
---
 arch/arm64/Kconfig                   |  1 +
 arch/arm64/include/asm/dma-mapping.h | 15 +++++++--------
 arch/arm64/mm/dma-mapping.c          | 22 ++++++++++++++++++++++
 3 files changed, 30 insertions(+), 8 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 7d95663..6597311 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -74,6 +74,7 @@ config ARM64
 	select HAVE_PERF_USER_STACK_DUMP
 	select HAVE_RCU_TABLE_FREE
 	select HAVE_SYSCALL_TRACEPOINTS
+	select IOMMU_DMA if IOMMU_SUPPORT
 	select IRQ_DOMAIN
 	select IRQ_FORCED_THREADING
 	select MODULES_USE_ELF_RELA
diff --git a/arch/arm64/include/asm/dma-mapping.h b/arch/arm64/include/asm/dma-mapping.h
index cfdb34b..54d0ead 100644
--- a/arch/arm64/include/asm/dma-mapping.h
+++ b/arch/arm64/include/asm/dma-mapping.h
@@ -54,16 +54,15 @@ static inline struct dma_map_ops *get_dma_ops(struct device *dev)
 		return __generic_dma_ops(dev);
 }
 
-static inline void arch_setup_dma_ops(struct device *dev, u64 dma_base, u64 size,
-				      struct iommu_ops *iommu, bool coherent)
-{
-	if (!acpi_disabled && !dev->archdata.dma_ops)
-		dev->archdata.dma_ops = dma_ops;
-
-	dev->archdata.dma_coherent = coherent;
-}
+void arch_setup_dma_ops(struct device *dev, u64 dma_base, u64 size,
+			struct iommu_ops *iommu, bool coherent);
 #define arch_setup_dma_ops	arch_setup_dma_ops
 
+#ifdef CONFIG_IOMMU_DMA
+void arch_teardown_dma_ops(struct device *dev);
+#define arch_teardown_dma_ops	arch_teardown_dma_ops
+#endif
+
 /* do not use this function in a driver */
 static inline bool is_device_dma_coherent(struct device *dev)
 {
diff --git a/arch/arm64/mm/dma-mapping.c b/arch/arm64/mm/dma-mapping.c
index dd2d6e6..02ef19d 100644
--- a/arch/arm64/mm/dma-mapping.c
+++ b/arch/arm64/mm/dma-mapping.c
@@ -960,6 +960,19 @@ static void __iommu_setup_dma_ops(struct device *dev, u64 dma_base, u64 size,
 	}
 }
 
+void arch_teardown_dma_ops(struct device *dev)
+{
+	struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
+
+	if (domain) {
+		iommu_detach_device(domain, dev);
+		if (domain->type & __IOMMU_DOMAIN_FAKE_DEFAULT)
+			iommu_domain_free(domain);
+	}
+
+	dev->archdata.dma_ops = NULL;
+}
+
 #else
 
 static void __iommu_setup_dma_ops(struct device *dev, u64 dma_base, u64 size,
@@ -968,3 +981,12 @@ static void __iommu_setup_dma_ops(struct device *dev, u64 dma_base, u64 size,
 
 #endif  /* CONFIG_IOMMU_DMA */
 
+void arch_setup_dma_ops(struct device *dev, u64 dma_base, u64 size,
+			struct iommu_ops *iommu, bool coherent)
+{
+	if (!acpi_disabled && !dev->archdata.dma_ops)
+		dev->archdata.dma_ops = dma_ops;
+
+	dev->archdata.dma_coherent = coherent;
+	__iommu_setup_dma_ops(dev, dma_base, size, iommu);
+}
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [PATCH v6 2/3] arm64: Add IOMMU dma_ops
  2015-10-01 19:13     ` Robin Murphy
@ 2015-10-06 11:00         ` Yong Wu
  -1 siblings, 0 replies; 78+ messages in thread
From: Yong Wu @ 2015-10-06 11:00 UTC (permalink / raw)
  To: Robin Murphy
  Cc: laurent.pinchart+renesas-ryLnwIuWjnjg/C1BVhZhaw,
	catalin.marinas-5wv7dgnIgG8, will.deacon-5wv7dgnIgG8,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	djkurtz-hpIqsD4AKlfQT0dZR+AlfA,
	thunder.leizhen-hv44wF8Li93QT0dZR+AlfA,
	yingjoe.chen-NuS5LvNUpcJWk0Htik3J/w,
	treding-DDmLM1+adcrQT0dZR+AlfA,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

On Thu, 2015-10-01 at 20:13 +0100, Robin Murphy wrote:
> Taking some inspiration from the arch/arm code, implement the
> arch-specific side of the DMA mapping ops using the new IOMMU-DMA layer.
> 
> Since there is still work to do elsewhere to make DMA configuration happen
> in a more appropriate order and properly support platform devices in the
> IOMMU core, the device setup code unfortunately starts out carrying some
> workarounds to ensure it works correctly in the current state of things.
> 
> Signed-off-by: Robin Murphy <robin.murphy-5wv7dgnIgG8@public.gmane.org>
> ---
>  arch/arm64/mm/dma-mapping.c | 435 ++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 435 insertions(+)
> 
[...]
> +/*
> + * TODO: Right now __iommu_setup_dma_ops() gets called too early to do
> + * everything it needs to - the device is only partially created and the
> + * IOMMU driver hasn't seen it yet, so it can't have a group. Thus we
> + * need this delayed attachment dance. Once IOMMU probe ordering is sorted
> + * to move the arch_setup_dma_ops() call later, all the notifier bits below
> + * become unnecessary, and will go away.
> + */

Hi Robin,
      Could I ask a question about the plan in the future:
      How to move arch_setup_dma_ops() call later than IOMMU probe?
      
      arch_setup_dma_ops is from of_dma_configure which is from
arm64_device_init, and IOMMU probe is subsys_init. So
arch_setup_dma_ops will run before IOMMU probe normally, is it right?
      Does Laurent's probe-deferral series could help do this? what's
the state of this series.

> +struct iommu_dma_notifier_data {
> +	struct list_head list;
> +	struct device *dev;
> +	const struct iommu_ops *ops;
> +	u64 dma_base;
> +	u64 size;
> +};
> +static LIST_HEAD(iommu_dma_masters);
> +static DEFINE_MUTEX(iommu_dma_notifier_lock);
> +
> +/*
> + * Temporarily "borrow" a domain feature flag to to tell if we had to resort
> + * to creating our own domain here, in case we need to clean it up again.
> + */
> +#define __IOMMU_DOMAIN_FAKE_DEFAULT		(1U << 31)
> +
> +static bool do_iommu_attach(struct device *dev, const struct iommu_ops *ops,
> +			   u64 dma_base, u64 size)
> +{
> +	struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
> +
> +	/*
> +	 * Best case: The device is either part of a group which was
> +	 * already attached to a domain in a previous call, or it's
> +	 * been put in a default DMA domain by the IOMMU core.
> +	 */
> +	if (!domain) {
> +		/*
> +		 * Urgh. The IOMMU core isn't going to do default domains
> +		 * for non-PCI devices anyway, until it has some means of
> +		 * abstracting the entirely implementation-specific
> +		 * sideband data/SoC topology/unicorn dust that may or
> +		 * may not differentiate upstream masters.
> +		 * So until then, HORRIBLE HACKS!
> +		 */
> +		domain = ops->domain_alloc(IOMMU_DOMAIN_DMA);
> +		if (!domain)
> +			goto out_no_domain;
> +
> +		domain->ops = ops;
> +		domain->type = IOMMU_DOMAIN_DMA | __IOMMU_DOMAIN_FAKE_DEFAULT;
> +
> +		if (iommu_attach_device(domain, dev))
> +			goto out_put_domain;
> +	}
> +
> +	if (iommu_dma_init_domain(domain, dma_base, size))
> +		goto out_detach;
> +
> +	dev->archdata.dma_ops = &iommu_dma_ops;
> +	return true;
> +
> +out_detach:
> +	iommu_detach_device(domain, dev);
> +out_put_domain:
> +	if (domain->type & __IOMMU_DOMAIN_FAKE_DEFAULT)
> +		iommu_domain_free(domain);
> +out_no_domain:
> +	pr_warn("Failed to set up IOMMU for device %s; retaining platform DMA ops\n",
> +		dev_name(dev));
> +	return false;
> +}
[...]
> +static void __iommu_setup_dma_ops(struct device *dev, u64 dma_base, u64 size,
> +				  const struct iommu_ops *ops)
> +{
> +	struct iommu_group *group;
> +
> +	if (!ops)
> +		return;
> +	/*
> +	 * TODO: As a concession to the future, we're ready to handle being
> +	 * called both early and late (i.e. after bus_add_device). Once all
> +	 * the platform bus code is reworked to call us late and the notifier
> +	 * junk above goes away, move the body of do_iommu_attach here.
> +	 */
> +	group = iommu_group_get(dev);

   If iommu_setup_dma_ops run after bus_add_device, then the device has
its group here. It will enter do_iommu_attach which will alloc a default
iommu domain and attach this device to the new iommu domain.
   But mtk-iommu don't expect like this, we would like to attach to the
same domain. So we should alloc a default iommu domain(if there is no
iommu domain at that time) and attach the device to the same domain in
our xx_add_device, is it right?

> +	if (group) {
> +		do_iommu_attach(dev, ops, dma_base, size);
> +		iommu_group_put(group);
> +	} else {
> +		queue_iommu_attach(dev, ops, dma_base, size);
> +	}
> +}
> +
> +#else
> +
> +static void __iommu_setup_dma_ops(struct device *dev, u64 dma_base, u64 size,
> +				  struct iommu_ops *iommu)
> +{ }
> +
> +#endif  /* CONFIG_IOMMU_DMA */
> +

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v6 2/3] arm64: Add IOMMU dma_ops
@ 2015-10-06 11:00         ` Yong Wu
  0 siblings, 0 replies; 78+ messages in thread
From: Yong Wu @ 2015-10-06 11:00 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 2015-10-01 at 20:13 +0100, Robin Murphy wrote:
> Taking some inspiration from the arch/arm code, implement the
> arch-specific side of the DMA mapping ops using the new IOMMU-DMA layer.
> 
> Since there is still work to do elsewhere to make DMA configuration happen
> in a more appropriate order and properly support platform devices in the
> IOMMU core, the device setup code unfortunately starts out carrying some
> workarounds to ensure it works correctly in the current state of things.
> 
> Signed-off-by: Robin Murphy <robin.murphy@arm.com>
> ---
>  arch/arm64/mm/dma-mapping.c | 435 ++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 435 insertions(+)
> 
[...]
> +/*
> + * TODO: Right now __iommu_setup_dma_ops() gets called too early to do
> + * everything it needs to - the device is only partially created and the
> + * IOMMU driver hasn't seen it yet, so it can't have a group. Thus we
> + * need this delayed attachment dance. Once IOMMU probe ordering is sorted
> + * to move the arch_setup_dma_ops() call later, all the notifier bits below
> + * become unnecessary, and will go away.
> + */

Hi Robin,
      Could I ask a question about the plan in the future:
      How to move arch_setup_dma_ops() call later than IOMMU probe?
      
      arch_setup_dma_ops is from of_dma_configure which is from
arm64_device_init, and IOMMU probe is subsys_init. So
arch_setup_dma_ops will run before IOMMU probe normally, is it right?
      Does Laurent's probe-deferral series could help do this? what's
the state of this series.

> +struct iommu_dma_notifier_data {
> +	struct list_head list;
> +	struct device *dev;
> +	const struct iommu_ops *ops;
> +	u64 dma_base;
> +	u64 size;
> +};
> +static LIST_HEAD(iommu_dma_masters);
> +static DEFINE_MUTEX(iommu_dma_notifier_lock);
> +
> +/*
> + * Temporarily "borrow" a domain feature flag to to tell if we had to resort
> + * to creating our own domain here, in case we need to clean it up again.
> + */
> +#define __IOMMU_DOMAIN_FAKE_DEFAULT		(1U << 31)
> +
> +static bool do_iommu_attach(struct device *dev, const struct iommu_ops *ops,
> +			   u64 dma_base, u64 size)
> +{
> +	struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
> +
> +	/*
> +	 * Best case: The device is either part of a group which was
> +	 * already attached to a domain in a previous call, or it's
> +	 * been put in a default DMA domain by the IOMMU core.
> +	 */
> +	if (!domain) {
> +		/*
> +		 * Urgh. The IOMMU core isn't going to do default domains
> +		 * for non-PCI devices anyway, until it has some means of
> +		 * abstracting the entirely implementation-specific
> +		 * sideband data/SoC topology/unicorn dust that may or
> +		 * may not differentiate upstream masters.
> +		 * So until then, HORRIBLE HACKS!
> +		 */
> +		domain = ops->domain_alloc(IOMMU_DOMAIN_DMA);
> +		if (!domain)
> +			goto out_no_domain;
> +
> +		domain->ops = ops;
> +		domain->type = IOMMU_DOMAIN_DMA | __IOMMU_DOMAIN_FAKE_DEFAULT;
> +
> +		if (iommu_attach_device(domain, dev))
> +			goto out_put_domain;
> +	}
> +
> +	if (iommu_dma_init_domain(domain, dma_base, size))
> +		goto out_detach;
> +
> +	dev->archdata.dma_ops = &iommu_dma_ops;
> +	return true;
> +
> +out_detach:
> +	iommu_detach_device(domain, dev);
> +out_put_domain:
> +	if (domain->type & __IOMMU_DOMAIN_FAKE_DEFAULT)
> +		iommu_domain_free(domain);
> +out_no_domain:
> +	pr_warn("Failed to set up IOMMU for device %s; retaining platform DMA ops\n",
> +		dev_name(dev));
> +	return false;
> +}
[...]
> +static void __iommu_setup_dma_ops(struct device *dev, u64 dma_base, u64 size,
> +				  const struct iommu_ops *ops)
> +{
> +	struct iommu_group *group;
> +
> +	if (!ops)
> +		return;
> +	/*
> +	 * TODO: As a concession to the future, we're ready to handle being
> +	 * called both early and late (i.e. after bus_add_device). Once all
> +	 * the platform bus code is reworked to call us late and the notifier
> +	 * junk above goes away, move the body of do_iommu_attach here.
> +	 */
> +	group = iommu_group_get(dev);

   If iommu_setup_dma_ops run after bus_add_device, then the device has
its group here. It will enter do_iommu_attach which will alloc a default
iommu domain and attach this device to the new iommu domain.
   But mtk-iommu don't expect like this, we would like to attach to the
same domain. So we should alloc a default iommu domain(if there is no
iommu domain at that time) and attach the device to the same domain in
our xx_add_device, is it right?

> +	if (group) {
> +		do_iommu_attach(dev, ops, dma_base, size);
> +		iommu_group_put(group);
> +	} else {
> +		queue_iommu_attach(dev, ops, dma_base, size);
> +	}
> +}
> +
> +#else
> +
> +static void __iommu_setup_dma_ops(struct device *dev, u64 dma_base, u64 size,
> +				  struct iommu_ops *iommu)
> +{ }
> +
> +#endif  /* CONFIG_IOMMU_DMA */
> +

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v6 2/3] arm64: Add IOMMU dma_ops
  2015-10-01 19:13     ` Robin Murphy
@ 2015-10-07  9:03       ` Anup Patel
  -1 siblings, 0 replies; 78+ messages in thread
From: Anup Patel @ 2015-10-07  9:03 UTC (permalink / raw)
  To: Robin Murphy
  Cc: laurent.pinchart+renesas, Anup Patel, Catalin Marinas, joro,
	Will Deacon, iommu, djkurtz, yong.wu, thunder.leizhen,
	yingjoe.chen, treding, linux-arm-kernel

Hi Robin,

On Fri, Oct 2, 2015 at 12:43 AM, Robin Murphy <robin.murphy@arm.com> wrote:
> Taking some inspiration from the arch/arm code, implement the
> arch-specific side of the DMA mapping ops using the new IOMMU-DMA layer.
>
> Since there is still work to do elsewhere to make DMA configuration happen
> in a more appropriate order and properly support platform devices in the
> IOMMU core, the device setup code unfortunately starts out carrying some
> workarounds to ensure it works correctly in the current state of things.
>
> Signed-off-by: Robin Murphy <robin.murphy@arm.com>
> ---
>  arch/arm64/mm/dma-mapping.c | 435 ++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 435 insertions(+)
>
> diff --git a/arch/arm64/mm/dma-mapping.c b/arch/arm64/mm/dma-mapping.c
> index 0bcc4bc..dd2d6e6 100644
> --- a/arch/arm64/mm/dma-mapping.c
> +++ b/arch/arm64/mm/dma-mapping.c
> @@ -533,3 +533,438 @@ static int __init dma_debug_do_init(void)
>         return 0;
>  }
>  fs_initcall(dma_debug_do_init);
> +
> +
> +#ifdef CONFIG_IOMMU_DMA
> +#include <linux/dma-iommu.h>
> +#include <linux/platform_device.h>
> +#include <linux/amba/bus.h>
> +
> +/* Thankfully, all cache ops are by VA so we can ignore phys here */
> +static void flush_page(struct device *dev, const void *virt, phys_addr_t phys)
> +{
> +       __dma_flush_range(virt, virt + PAGE_SIZE);
> +}
> +
> +static void *__iommu_alloc_attrs(struct device *dev, size_t size,
> +                                dma_addr_t *handle, gfp_t gfp,
> +                                struct dma_attrs *attrs)
> +{
> +       bool coherent = is_device_dma_coherent(dev);
> +       int ioprot = dma_direction_to_prot(DMA_BIDIRECTIONAL, coherent);
> +       void *addr;
> +
> +       if (WARN(!dev, "cannot create IOMMU mapping for unknown device\n"))
> +               return NULL;
> +       /*
> +        * Some drivers rely on this, and we probably don't want the
> +        * possibility of stale kernel data being read by devices anyway.
> +        */
> +       gfp |= __GFP_ZERO;
> +
> +       if (gfp & __GFP_WAIT) {
> +               struct page **pages;
> +               pgprot_t prot = __get_dma_pgprot(attrs, PAGE_KERNEL, coherent);
> +
> +               pages = iommu_dma_alloc(dev, size, gfp, ioprot, handle,
> +                                       flush_page);
> +               if (!pages)
> +                       return NULL;
> +
> +               addr = dma_common_pages_remap(pages, size, VM_USERMAP, prot,
> +                                             __builtin_return_address(0));
> +               if (!addr)
> +                       iommu_dma_free(dev, pages, size, handle);
> +       } else {
> +               struct page *page;
> +               /*
> +                * In atomic context we can't remap anything, so we'll only
> +                * get the virtually contiguous buffer we need by way of a
> +                * physically contiguous allocation.
> +                */
> +               if (coherent) {
> +                       page = alloc_pages(gfp, get_order(size));
> +                       addr = page ? page_address(page) : NULL;
> +               } else {
> +                       addr = __alloc_from_pool(size, &page, gfp);
> +               }
> +               if (!addr)
> +                       return NULL;
> +
> +               *handle = iommu_dma_map_page(dev, page, 0, size, ioprot);
> +               if (iommu_dma_mapping_error(dev, *handle)) {
> +                       if (coherent)
> +                               __free_pages(page, get_order(size));
> +                       else
> +                               __free_from_pool(addr, size);
> +                       addr = NULL;
> +               }
> +       }
> +       return addr;
> +}
> +
> +static void __iommu_free_attrs(struct device *dev, size_t size, void *cpu_addr,
> +                              dma_addr_t handle, struct dma_attrs *attrs)
> +{
> +       /*
> +        * @cpu_addr will be one of 3 things depending on how it was allocated:
> +        * - A remapped array of pages from iommu_dma_alloc(), for all
> +        *   non-atomic allocations.
> +        * - A non-cacheable alias from the atomic pool, for atomic
> +        *   allocations by non-coherent devices.
> +        * - A normal lowmem address, for atomic allocations by
> +        *   coherent devices.
> +        * Hence how dodgy the below logic looks...
> +        */
> +       if (__in_atomic_pool(cpu_addr, size)) {
> +               iommu_dma_unmap_page(dev, handle, size, 0, NULL);
> +               __free_from_pool(cpu_addr, size);
> +       } else if (is_vmalloc_addr(cpu_addr)){
> +               struct vm_struct *area = find_vm_area(cpu_addr);
> +
> +               if (WARN_ON(!area || !area->pages))
> +                       return;
> +               iommu_dma_free(dev, area->pages, size, &handle);
> +               dma_common_free_remap(cpu_addr, size, VM_USERMAP);
> +       } else {
> +               iommu_dma_unmap_page(dev, handle, size, 0, NULL);
> +               __free_pages(virt_to_page(cpu_addr), get_order(size));
> +       }
> +}
> +
> +static int __iommu_mmap_attrs(struct device *dev, struct vm_area_struct *vma,
> +                             void *cpu_addr, dma_addr_t dma_addr, size_t size,
> +                             struct dma_attrs *attrs)
> +{
> +       struct vm_struct *area;
> +       int ret;
> +
> +       vma->vm_page_prot = __get_dma_pgprot(attrs, vma->vm_page_prot,
> +                                            is_device_dma_coherent(dev));
> +
> +       if (dma_mmap_from_coherent(dev, vma, cpu_addr, size, &ret))
> +               return ret;
> +
> +       area = find_vm_area(cpu_addr);
> +       if (WARN_ON(!area || !area->pages))
> +               return -ENXIO;
> +
> +       return iommu_dma_mmap(area->pages, size, vma);
> +}
> +
> +static int __iommu_get_sgtable(struct device *dev, struct sg_table *sgt,
> +                              void *cpu_addr, dma_addr_t dma_addr,
> +                              size_t size, struct dma_attrs *attrs)
> +{
> +       unsigned int count = PAGE_ALIGN(size) >> PAGE_SHIFT;
> +       struct vm_struct *area = find_vm_area(cpu_addr);
> +
> +       if (WARN_ON(!area || !area->pages))
> +               return -ENXIO;
> +
> +       return sg_alloc_table_from_pages(sgt, area->pages, count, 0, size,
> +                                        GFP_KERNEL);
> +}
> +
> +static void __iommu_sync_single_for_cpu(struct device *dev,
> +                                       dma_addr_t dev_addr, size_t size,
> +                                       enum dma_data_direction dir)
> +{
> +       phys_addr_t phys;
> +
> +       if (is_device_dma_coherent(dev))
> +               return;
> +
> +       phys = iommu_iova_to_phys(iommu_get_domain_for_dev(dev), dev_addr);
> +       __dma_unmap_area(phys_to_virt(phys), size, dir);
> +}
> +
> +static void __iommu_sync_single_for_device(struct device *dev,
> +                                          dma_addr_t dev_addr, size_t size,
> +                                          enum dma_data_direction dir)
> +{
> +       phys_addr_t phys;
> +
> +       if (is_device_dma_coherent(dev))
> +               return;
> +
> +       phys = iommu_iova_to_phys(iommu_get_domain_for_dev(dev), dev_addr);
> +       __dma_map_area(phys_to_virt(phys), size, dir);
> +}
> +
> +static dma_addr_t __iommu_map_page(struct device *dev, struct page *page,
> +                                  unsigned long offset, size_t size,
> +                                  enum dma_data_direction dir,
> +                                  struct dma_attrs *attrs)
> +{
> +       bool coherent = is_device_dma_coherent(dev);
> +       int prot = dma_direction_to_prot(dir, coherent);
> +       dma_addr_t dev_addr = iommu_dma_map_page(dev, page, offset, size, prot);
> +
> +       if (!iommu_dma_mapping_error(dev, dev_addr) &&
> +           !dma_get_attr(DMA_ATTR_SKIP_CPU_SYNC, attrs))
> +               __iommu_sync_single_for_device(dev, dev_addr, size, dir);
> +
> +       return dev_addr;
> +}
> +
> +static void __iommu_unmap_page(struct device *dev, dma_addr_t dev_addr,
> +                              size_t size, enum dma_data_direction dir,
> +                              struct dma_attrs *attrs)
> +{
> +       if (!dma_get_attr(DMA_ATTR_SKIP_CPU_SYNC, attrs))
> +               __iommu_sync_single_for_cpu(dev, dev_addr, size, dir);
> +
> +       iommu_dma_unmap_page(dev, dev_addr, size, dir, attrs);
> +}
> +
> +static void __iommu_sync_sg_for_cpu(struct device *dev,
> +                                   struct scatterlist *sgl, int nelems,
> +                                   enum dma_data_direction dir)
> +{
> +       struct scatterlist *sg;
> +       int i;
> +
> +       if (is_device_dma_coherent(dev))
> +               return;
> +
> +       for_each_sg(sgl, sg, nelems, i)
> +               __dma_unmap_area(sg_virt(sg), sg->length, dir);
> +}
> +
> +static void __iommu_sync_sg_for_device(struct device *dev,
> +                                      struct scatterlist *sgl, int nelems,
> +                                      enum dma_data_direction dir)
> +{
> +       struct scatterlist *sg;
> +       int i;
> +
> +       if (is_device_dma_coherent(dev))
> +               return;
> +
> +       for_each_sg(sgl, sg, nelems, i)
> +               __dma_map_area(sg_virt(sg), sg->length, dir);
> +}
> +
> +static int __iommu_map_sg_attrs(struct device *dev, struct scatterlist *sgl,
> +                               int nelems, enum dma_data_direction dir,
> +                               struct dma_attrs *attrs)
> +{
> +       bool coherent = is_device_dma_coherent(dev);
> +
> +       if (!dma_get_attr(DMA_ATTR_SKIP_CPU_SYNC, attrs))
> +               __iommu_sync_sg_for_device(dev, sgl, nelems, dir);
> +
> +       return iommu_dma_map_sg(dev, sgl, nelems,
> +                       dma_direction_to_prot(dir, coherent));
> +}
> +
> +static void __iommu_unmap_sg_attrs(struct device *dev,
> +                                  struct scatterlist *sgl, int nelems,
> +                                  enum dma_data_direction dir,
> +                                  struct dma_attrs *attrs)
> +{
> +       if (!dma_get_attr(DMA_ATTR_SKIP_CPU_SYNC, attrs))
> +               __iommu_sync_sg_for_cpu(dev, sgl, nelems, dir);
> +
> +       iommu_dma_unmap_sg(dev, sgl, nelems, dir, attrs);
> +}
> +
> +static struct dma_map_ops iommu_dma_ops = {
> +       .alloc = __iommu_alloc_attrs,
> +       .free = __iommu_free_attrs,
> +       .mmap = __iommu_mmap_attrs,
> +       .get_sgtable = __iommu_get_sgtable,
> +       .map_page = __iommu_map_page,
> +       .unmap_page = __iommu_unmap_page,
> +       .map_sg = __iommu_map_sg_attrs,
> +       .unmap_sg = __iommu_unmap_sg_attrs,
> +       .sync_single_for_cpu = __iommu_sync_single_for_cpu,
> +       .sync_single_for_device = __iommu_sync_single_for_device,
> +       .sync_sg_for_cpu = __iommu_sync_sg_for_cpu,
> +       .sync_sg_for_device = __iommu_sync_sg_for_device,
> +       .dma_supported = iommu_dma_supported,
> +       .mapping_error = iommu_dma_mapping_error,
> +};
> +
> +/*
> + * TODO: Right now __iommu_setup_dma_ops() gets called too early to do
> + * everything it needs to - the device is only partially created and the
> + * IOMMU driver hasn't seen it yet, so it can't have a group. Thus we
> + * need this delayed attachment dance. Once IOMMU probe ordering is sorted
> + * to move the arch_setup_dma_ops() call later, all the notifier bits below
> + * become unnecessary, and will go away.
> + */
> +struct iommu_dma_notifier_data {
> +       struct list_head list;
> +       struct device *dev;
> +       const struct iommu_ops *ops;
> +       u64 dma_base;
> +       u64 size;
> +};
> +static LIST_HEAD(iommu_dma_masters);
> +static DEFINE_MUTEX(iommu_dma_notifier_lock);
> +
> +/*
> + * Temporarily "borrow" a domain feature flag to to tell if we had to resort
> + * to creating our own domain here, in case we need to clean it up again.
> + */
> +#define __IOMMU_DOMAIN_FAKE_DEFAULT            (1U << 31)
> +
> +static bool do_iommu_attach(struct device *dev, const struct iommu_ops *ops,
> +                          u64 dma_base, u64 size)
> +{
> +       struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
> +
> +       /*
> +        * Best case: The device is either part of a group which was
> +        * already attached to a domain in a previous call, or it's
> +        * been put in a default DMA domain by the IOMMU core.
> +        */
> +       if (!domain) {
> +               /*
> +                * Urgh. The IOMMU core isn't going to do default domains
> +                * for non-PCI devices anyway, until it has some means of
> +                * abstracting the entirely implementation-specific
> +                * sideband data/SoC topology/unicorn dust that may or
> +                * may not differentiate upstream masters.
> +                * So until then, HORRIBLE HACKS!
> +                */
> +               domain = ops->domain_alloc(IOMMU_DOMAIN_DMA);
> +               if (!domain)
> +                       goto out_no_domain;
> +
> +               domain->ops = ops;
> +               domain->type = IOMMU_DOMAIN_DMA | __IOMMU_DOMAIN_FAKE_DEFAULT;

We require iommu_get_dma_cookie(domain) here. If we dont
allocate iommu cookie then iommu_dma_init_domain() will fail.

> +
> +               if (iommu_attach_device(domain, dev))
> +                       goto out_put_domain;
> +       }
> +
> +       if (iommu_dma_init_domain(domain, dma_base, size))
> +               goto out_detach;
> +
> +       dev->archdata.dma_ops = &iommu_dma_ops;
> +       return true;
> +
> +out_detach:
> +       iommu_detach_device(domain, dev);
> +out_put_domain:
> +       if (domain->type & __IOMMU_DOMAIN_FAKE_DEFAULT)
> +               iommu_domain_free(domain);
> +out_no_domain:
> +       pr_warn("Failed to set up IOMMU for device %s; retaining platform DMA ops\n",
> +               dev_name(dev));
> +       return false;
> +}
> +
> +static void queue_iommu_attach(struct device *dev, const struct iommu_ops *ops,
> +                             u64 dma_base, u64 size)
> +{
> +       struct iommu_dma_notifier_data *iommudata;
> +
> +       iommudata = kzalloc(sizeof(*iommudata), GFP_KERNEL);
> +       if (!iommudata)
> +               return;
> +
> +       iommudata->dev = dev;
> +       iommudata->ops = ops;
> +       iommudata->dma_base = dma_base;
> +       iommudata->size = size;
> +
> +       mutex_lock(&iommu_dma_notifier_lock);
> +       list_add(&iommudata->list, &iommu_dma_masters);
> +       mutex_unlock(&iommu_dma_notifier_lock);
> +}
> +
> +static int __iommu_attach_notifier(struct notifier_block *nb,
> +                                  unsigned long action, void *data)
> +{
> +       struct iommu_dma_notifier_data *master, *tmp;
> +
> +       if (action != BUS_NOTIFY_ADD_DEVICE)
> +               return 0;
> +
> +       mutex_lock(&iommu_dma_notifier_lock);
> +       list_for_each_entry_safe(master, tmp, &iommu_dma_masters, list) {
> +               if (do_iommu_attach(master->dev, master->ops,
> +                               master->dma_base, master->size)) {
> +                       list_del(&master->list);
> +                       kfree(master);
> +               }
> +       }
> +       mutex_unlock(&iommu_dma_notifier_lock);
> +       return 0;
> +}
> +
> +static int register_iommu_dma_ops_notifier(struct bus_type *bus)
> +{
> +       struct notifier_block *nb = kzalloc(sizeof(*nb), GFP_KERNEL);
> +       int ret;
> +
> +       if (!nb)
> +               return -ENOMEM;
> +       /*
> +        * The device must be attached to a domain before the driver probe
> +        * routine gets a chance to start allocating DMA buffers. However,
> +        * the IOMMU driver also needs a chance to configure the iommu_group
> +        * via its add_device callback first, so we need to make the attach
> +        * happen between those two points. Since the IOMMU core uses a bus
> +        * notifier with default priority for add_device, do the same but
> +        * with a lower priority to ensure the appropriate ordering.
> +        */
> +       nb->notifier_call = __iommu_attach_notifier;
> +       nb->priority = -100;
> +
> +       ret = bus_register_notifier(bus, nb);
> +       if (ret) {
> +               pr_warn("Failed to register DMA domain notifier; IOMMU DMA ops unavailable on bus '%s'\n",
> +                       bus->name);
> +               kfree(nb);
> +       }
> +       return ret;
> +}
> +
> +static int __init __iommu_dma_init(void)
> +{
> +       int ret;
> +
> +       ret = iommu_dma_init();
> +       if (!ret)
> +               ret = register_iommu_dma_ops_notifier(&platform_bus_type);
> +       if (!ret)
> +               ret = register_iommu_dma_ops_notifier(&amba_bustype);
> +       return ret;
> +}
> +arch_initcall(__iommu_dma_init);
> +
> +static void __iommu_setup_dma_ops(struct device *dev, u64 dma_base, u64 size,
> +                                 const struct iommu_ops *ops)
> +{
> +       struct iommu_group *group;
> +
> +       if (!ops)
> +               return;
> +       /*
> +        * TODO: As a concession to the future, we're ready to handle being
> +        * called both early and late (i.e. after bus_add_device). Once all
> +        * the platform bus code is reworked to call us late and the notifier
> +        * junk above goes away, move the body of do_iommu_attach here.
> +        */
> +       group = iommu_group_get(dev);
> +       if (group) {
> +               do_iommu_attach(dev, ops, dma_base, size);
> +               iommu_group_put(group);
> +       } else {
> +               queue_iommu_attach(dev, ops, dma_base, size);
> +       }
> +}
> +
> +#else
> +
> +static void __iommu_setup_dma_ops(struct device *dev, u64 dma_base, u64 size,
> +                                 struct iommu_ops *iommu)
> +{ }
> +
> +#endif  /* CONFIG_IOMMU_DMA */
> +
> --
> 1.9.1
>

Regards,
Anup

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v6 2/3] arm64: Add IOMMU dma_ops
@ 2015-10-07  9:03       ` Anup Patel
  0 siblings, 0 replies; 78+ messages in thread
From: Anup Patel @ 2015-10-07  9:03 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Robin,

On Fri, Oct 2, 2015 at 12:43 AM, Robin Murphy <robin.murphy@arm.com> wrote:
> Taking some inspiration from the arch/arm code, implement the
> arch-specific side of the DMA mapping ops using the new IOMMU-DMA layer.
>
> Since there is still work to do elsewhere to make DMA configuration happen
> in a more appropriate order and properly support platform devices in the
> IOMMU core, the device setup code unfortunately starts out carrying some
> workarounds to ensure it works correctly in the current state of things.
>
> Signed-off-by: Robin Murphy <robin.murphy@arm.com>
> ---
>  arch/arm64/mm/dma-mapping.c | 435 ++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 435 insertions(+)
>
> diff --git a/arch/arm64/mm/dma-mapping.c b/arch/arm64/mm/dma-mapping.c
> index 0bcc4bc..dd2d6e6 100644
> --- a/arch/arm64/mm/dma-mapping.c
> +++ b/arch/arm64/mm/dma-mapping.c
> @@ -533,3 +533,438 @@ static int __init dma_debug_do_init(void)
>         return 0;
>  }
>  fs_initcall(dma_debug_do_init);
> +
> +
> +#ifdef CONFIG_IOMMU_DMA
> +#include <linux/dma-iommu.h>
> +#include <linux/platform_device.h>
> +#include <linux/amba/bus.h>
> +
> +/* Thankfully, all cache ops are by VA so we can ignore phys here */
> +static void flush_page(struct device *dev, const void *virt, phys_addr_t phys)
> +{
> +       __dma_flush_range(virt, virt + PAGE_SIZE);
> +}
> +
> +static void *__iommu_alloc_attrs(struct device *dev, size_t size,
> +                                dma_addr_t *handle, gfp_t gfp,
> +                                struct dma_attrs *attrs)
> +{
> +       bool coherent = is_device_dma_coherent(dev);
> +       int ioprot = dma_direction_to_prot(DMA_BIDIRECTIONAL, coherent);
> +       void *addr;
> +
> +       if (WARN(!dev, "cannot create IOMMU mapping for unknown device\n"))
> +               return NULL;
> +       /*
> +        * Some drivers rely on this, and we probably don't want the
> +        * possibility of stale kernel data being read by devices anyway.
> +        */
> +       gfp |= __GFP_ZERO;
> +
> +       if (gfp & __GFP_WAIT) {
> +               struct page **pages;
> +               pgprot_t prot = __get_dma_pgprot(attrs, PAGE_KERNEL, coherent);
> +
> +               pages = iommu_dma_alloc(dev, size, gfp, ioprot, handle,
> +                                       flush_page);
> +               if (!pages)
> +                       return NULL;
> +
> +               addr = dma_common_pages_remap(pages, size, VM_USERMAP, prot,
> +                                             __builtin_return_address(0));
> +               if (!addr)
> +                       iommu_dma_free(dev, pages, size, handle);
> +       } else {
> +               struct page *page;
> +               /*
> +                * In atomic context we can't remap anything, so we'll only
> +                * get the virtually contiguous buffer we need by way of a
> +                * physically contiguous allocation.
> +                */
> +               if (coherent) {
> +                       page = alloc_pages(gfp, get_order(size));
> +                       addr = page ? page_address(page) : NULL;
> +               } else {
> +                       addr = __alloc_from_pool(size, &page, gfp);
> +               }
> +               if (!addr)
> +                       return NULL;
> +
> +               *handle = iommu_dma_map_page(dev, page, 0, size, ioprot);
> +               if (iommu_dma_mapping_error(dev, *handle)) {
> +                       if (coherent)
> +                               __free_pages(page, get_order(size));
> +                       else
> +                               __free_from_pool(addr, size);
> +                       addr = NULL;
> +               }
> +       }
> +       return addr;
> +}
> +
> +static void __iommu_free_attrs(struct device *dev, size_t size, void *cpu_addr,
> +                              dma_addr_t handle, struct dma_attrs *attrs)
> +{
> +       /*
> +        * @cpu_addr will be one of 3 things depending on how it was allocated:
> +        * - A remapped array of pages from iommu_dma_alloc(), for all
> +        *   non-atomic allocations.
> +        * - A non-cacheable alias from the atomic pool, for atomic
> +        *   allocations by non-coherent devices.
> +        * - A normal lowmem address, for atomic allocations by
> +        *   coherent devices.
> +        * Hence how dodgy the below logic looks...
> +        */
> +       if (__in_atomic_pool(cpu_addr, size)) {
> +               iommu_dma_unmap_page(dev, handle, size, 0, NULL);
> +               __free_from_pool(cpu_addr, size);
> +       } else if (is_vmalloc_addr(cpu_addr)){
> +               struct vm_struct *area = find_vm_area(cpu_addr);
> +
> +               if (WARN_ON(!area || !area->pages))
> +                       return;
> +               iommu_dma_free(dev, area->pages, size, &handle);
> +               dma_common_free_remap(cpu_addr, size, VM_USERMAP);
> +       } else {
> +               iommu_dma_unmap_page(dev, handle, size, 0, NULL);
> +               __free_pages(virt_to_page(cpu_addr), get_order(size));
> +       }
> +}
> +
> +static int __iommu_mmap_attrs(struct device *dev, struct vm_area_struct *vma,
> +                             void *cpu_addr, dma_addr_t dma_addr, size_t size,
> +                             struct dma_attrs *attrs)
> +{
> +       struct vm_struct *area;
> +       int ret;
> +
> +       vma->vm_page_prot = __get_dma_pgprot(attrs, vma->vm_page_prot,
> +                                            is_device_dma_coherent(dev));
> +
> +       if (dma_mmap_from_coherent(dev, vma, cpu_addr, size, &ret))
> +               return ret;
> +
> +       area = find_vm_area(cpu_addr);
> +       if (WARN_ON(!area || !area->pages))
> +               return -ENXIO;
> +
> +       return iommu_dma_mmap(area->pages, size, vma);
> +}
> +
> +static int __iommu_get_sgtable(struct device *dev, struct sg_table *sgt,
> +                              void *cpu_addr, dma_addr_t dma_addr,
> +                              size_t size, struct dma_attrs *attrs)
> +{
> +       unsigned int count = PAGE_ALIGN(size) >> PAGE_SHIFT;
> +       struct vm_struct *area = find_vm_area(cpu_addr);
> +
> +       if (WARN_ON(!area || !area->pages))
> +               return -ENXIO;
> +
> +       return sg_alloc_table_from_pages(sgt, area->pages, count, 0, size,
> +                                        GFP_KERNEL);
> +}
> +
> +static void __iommu_sync_single_for_cpu(struct device *dev,
> +                                       dma_addr_t dev_addr, size_t size,
> +                                       enum dma_data_direction dir)
> +{
> +       phys_addr_t phys;
> +
> +       if (is_device_dma_coherent(dev))
> +               return;
> +
> +       phys = iommu_iova_to_phys(iommu_get_domain_for_dev(dev), dev_addr);
> +       __dma_unmap_area(phys_to_virt(phys), size, dir);
> +}
> +
> +static void __iommu_sync_single_for_device(struct device *dev,
> +                                          dma_addr_t dev_addr, size_t size,
> +                                          enum dma_data_direction dir)
> +{
> +       phys_addr_t phys;
> +
> +       if (is_device_dma_coherent(dev))
> +               return;
> +
> +       phys = iommu_iova_to_phys(iommu_get_domain_for_dev(dev), dev_addr);
> +       __dma_map_area(phys_to_virt(phys), size, dir);
> +}
> +
> +static dma_addr_t __iommu_map_page(struct device *dev, struct page *page,
> +                                  unsigned long offset, size_t size,
> +                                  enum dma_data_direction dir,
> +                                  struct dma_attrs *attrs)
> +{
> +       bool coherent = is_device_dma_coherent(dev);
> +       int prot = dma_direction_to_prot(dir, coherent);
> +       dma_addr_t dev_addr = iommu_dma_map_page(dev, page, offset, size, prot);
> +
> +       if (!iommu_dma_mapping_error(dev, dev_addr) &&
> +           !dma_get_attr(DMA_ATTR_SKIP_CPU_SYNC, attrs))
> +               __iommu_sync_single_for_device(dev, dev_addr, size, dir);
> +
> +       return dev_addr;
> +}
> +
> +static void __iommu_unmap_page(struct device *dev, dma_addr_t dev_addr,
> +                              size_t size, enum dma_data_direction dir,
> +                              struct dma_attrs *attrs)
> +{
> +       if (!dma_get_attr(DMA_ATTR_SKIP_CPU_SYNC, attrs))
> +               __iommu_sync_single_for_cpu(dev, dev_addr, size, dir);
> +
> +       iommu_dma_unmap_page(dev, dev_addr, size, dir, attrs);
> +}
> +
> +static void __iommu_sync_sg_for_cpu(struct device *dev,
> +                                   struct scatterlist *sgl, int nelems,
> +                                   enum dma_data_direction dir)
> +{
> +       struct scatterlist *sg;
> +       int i;
> +
> +       if (is_device_dma_coherent(dev))
> +               return;
> +
> +       for_each_sg(sgl, sg, nelems, i)
> +               __dma_unmap_area(sg_virt(sg), sg->length, dir);
> +}
> +
> +static void __iommu_sync_sg_for_device(struct device *dev,
> +                                      struct scatterlist *sgl, int nelems,
> +                                      enum dma_data_direction dir)
> +{
> +       struct scatterlist *sg;
> +       int i;
> +
> +       if (is_device_dma_coherent(dev))
> +               return;
> +
> +       for_each_sg(sgl, sg, nelems, i)
> +               __dma_map_area(sg_virt(sg), sg->length, dir);
> +}
> +
> +static int __iommu_map_sg_attrs(struct device *dev, struct scatterlist *sgl,
> +                               int nelems, enum dma_data_direction dir,
> +                               struct dma_attrs *attrs)
> +{
> +       bool coherent = is_device_dma_coherent(dev);
> +
> +       if (!dma_get_attr(DMA_ATTR_SKIP_CPU_SYNC, attrs))
> +               __iommu_sync_sg_for_device(dev, sgl, nelems, dir);
> +
> +       return iommu_dma_map_sg(dev, sgl, nelems,
> +                       dma_direction_to_prot(dir, coherent));
> +}
> +
> +static void __iommu_unmap_sg_attrs(struct device *dev,
> +                                  struct scatterlist *sgl, int nelems,
> +                                  enum dma_data_direction dir,
> +                                  struct dma_attrs *attrs)
> +{
> +       if (!dma_get_attr(DMA_ATTR_SKIP_CPU_SYNC, attrs))
> +               __iommu_sync_sg_for_cpu(dev, sgl, nelems, dir);
> +
> +       iommu_dma_unmap_sg(dev, sgl, nelems, dir, attrs);
> +}
> +
> +static struct dma_map_ops iommu_dma_ops = {
> +       .alloc = __iommu_alloc_attrs,
> +       .free = __iommu_free_attrs,
> +       .mmap = __iommu_mmap_attrs,
> +       .get_sgtable = __iommu_get_sgtable,
> +       .map_page = __iommu_map_page,
> +       .unmap_page = __iommu_unmap_page,
> +       .map_sg = __iommu_map_sg_attrs,
> +       .unmap_sg = __iommu_unmap_sg_attrs,
> +       .sync_single_for_cpu = __iommu_sync_single_for_cpu,
> +       .sync_single_for_device = __iommu_sync_single_for_device,
> +       .sync_sg_for_cpu = __iommu_sync_sg_for_cpu,
> +       .sync_sg_for_device = __iommu_sync_sg_for_device,
> +       .dma_supported = iommu_dma_supported,
> +       .mapping_error = iommu_dma_mapping_error,
> +};
> +
> +/*
> + * TODO: Right now __iommu_setup_dma_ops() gets called too early to do
> + * everything it needs to - the device is only partially created and the
> + * IOMMU driver hasn't seen it yet, so it can't have a group. Thus we
> + * need this delayed attachment dance. Once IOMMU probe ordering is sorted
> + * to move the arch_setup_dma_ops() call later, all the notifier bits below
> + * become unnecessary, and will go away.
> + */
> +struct iommu_dma_notifier_data {
> +       struct list_head list;
> +       struct device *dev;
> +       const struct iommu_ops *ops;
> +       u64 dma_base;
> +       u64 size;
> +};
> +static LIST_HEAD(iommu_dma_masters);
> +static DEFINE_MUTEX(iommu_dma_notifier_lock);
> +
> +/*
> + * Temporarily "borrow" a domain feature flag to to tell if we had to resort
> + * to creating our own domain here, in case we need to clean it up again.
> + */
> +#define __IOMMU_DOMAIN_FAKE_DEFAULT            (1U << 31)
> +
> +static bool do_iommu_attach(struct device *dev, const struct iommu_ops *ops,
> +                          u64 dma_base, u64 size)
> +{
> +       struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
> +
> +       /*
> +        * Best case: The device is either part of a group which was
> +        * already attached to a domain in a previous call, or it's
> +        * been put in a default DMA domain by the IOMMU core.
> +        */
> +       if (!domain) {
> +               /*
> +                * Urgh. The IOMMU core isn't going to do default domains
> +                * for non-PCI devices anyway, until it has some means of
> +                * abstracting the entirely implementation-specific
> +                * sideband data/SoC topology/unicorn dust that may or
> +                * may not differentiate upstream masters.
> +                * So until then, HORRIBLE HACKS!
> +                */
> +               domain = ops->domain_alloc(IOMMU_DOMAIN_DMA);
> +               if (!domain)
> +                       goto out_no_domain;
> +
> +               domain->ops = ops;
> +               domain->type = IOMMU_DOMAIN_DMA | __IOMMU_DOMAIN_FAKE_DEFAULT;

We require iommu_get_dma_cookie(domain) here. If we dont
allocate iommu cookie then iommu_dma_init_domain() will fail.

> +
> +               if (iommu_attach_device(domain, dev))
> +                       goto out_put_domain;
> +       }
> +
> +       if (iommu_dma_init_domain(domain, dma_base, size))
> +               goto out_detach;
> +
> +       dev->archdata.dma_ops = &iommu_dma_ops;
> +       return true;
> +
> +out_detach:
> +       iommu_detach_device(domain, dev);
> +out_put_domain:
> +       if (domain->type & __IOMMU_DOMAIN_FAKE_DEFAULT)
> +               iommu_domain_free(domain);
> +out_no_domain:
> +       pr_warn("Failed to set up IOMMU for device %s; retaining platform DMA ops\n",
> +               dev_name(dev));
> +       return false;
> +}
> +
> +static void queue_iommu_attach(struct device *dev, const struct iommu_ops *ops,
> +                             u64 dma_base, u64 size)
> +{
> +       struct iommu_dma_notifier_data *iommudata;
> +
> +       iommudata = kzalloc(sizeof(*iommudata), GFP_KERNEL);
> +       if (!iommudata)
> +               return;
> +
> +       iommudata->dev = dev;
> +       iommudata->ops = ops;
> +       iommudata->dma_base = dma_base;
> +       iommudata->size = size;
> +
> +       mutex_lock(&iommu_dma_notifier_lock);
> +       list_add(&iommudata->list, &iommu_dma_masters);
> +       mutex_unlock(&iommu_dma_notifier_lock);
> +}
> +
> +static int __iommu_attach_notifier(struct notifier_block *nb,
> +                                  unsigned long action, void *data)
> +{
> +       struct iommu_dma_notifier_data *master, *tmp;
> +
> +       if (action != BUS_NOTIFY_ADD_DEVICE)
> +               return 0;
> +
> +       mutex_lock(&iommu_dma_notifier_lock);
> +       list_for_each_entry_safe(master, tmp, &iommu_dma_masters, list) {
> +               if (do_iommu_attach(master->dev, master->ops,
> +                               master->dma_base, master->size)) {
> +                       list_del(&master->list);
> +                       kfree(master);
> +               }
> +       }
> +       mutex_unlock(&iommu_dma_notifier_lock);
> +       return 0;
> +}
> +
> +static int register_iommu_dma_ops_notifier(struct bus_type *bus)
> +{
> +       struct notifier_block *nb = kzalloc(sizeof(*nb), GFP_KERNEL);
> +       int ret;
> +
> +       if (!nb)
> +               return -ENOMEM;
> +       /*
> +        * The device must be attached to a domain before the driver probe
> +        * routine gets a chance to start allocating DMA buffers. However,
> +        * the IOMMU driver also needs a chance to configure the iommu_group
> +        * via its add_device callback first, so we need to make the attach
> +        * happen between those two points. Since the IOMMU core uses a bus
> +        * notifier with default priority for add_device, do the same but
> +        * with a lower priority to ensure the appropriate ordering.
> +        */
> +       nb->notifier_call = __iommu_attach_notifier;
> +       nb->priority = -100;
> +
> +       ret = bus_register_notifier(bus, nb);
> +       if (ret) {
> +               pr_warn("Failed to register DMA domain notifier; IOMMU DMA ops unavailable on bus '%s'\n",
> +                       bus->name);
> +               kfree(nb);
> +       }
> +       return ret;
> +}
> +
> +static int __init __iommu_dma_init(void)
> +{
> +       int ret;
> +
> +       ret = iommu_dma_init();
> +       if (!ret)
> +               ret = register_iommu_dma_ops_notifier(&platform_bus_type);
> +       if (!ret)
> +               ret = register_iommu_dma_ops_notifier(&amba_bustype);
> +       return ret;
> +}
> +arch_initcall(__iommu_dma_init);
> +
> +static void __iommu_setup_dma_ops(struct device *dev, u64 dma_base, u64 size,
> +                                 const struct iommu_ops *ops)
> +{
> +       struct iommu_group *group;
> +
> +       if (!ops)
> +               return;
> +       /*
> +        * TODO: As a concession to the future, we're ready to handle being
> +        * called both early and late (i.e. after bus_add_device). Once all
> +        * the platform bus code is reworked to call us late and the notifier
> +        * junk above goes away, move the body of do_iommu_attach here.
> +        */
> +       group = iommu_group_get(dev);
> +       if (group) {
> +               do_iommu_attach(dev, ops, dma_base, size);
> +               iommu_group_put(group);
> +       } else {
> +               queue_iommu_attach(dev, ops, dma_base, size);
> +       }
> +}
> +
> +#else
> +
> +static void __iommu_setup_dma_ops(struct device *dev, u64 dma_base, u64 size,
> +                                 struct iommu_ops *iommu)
> +{ }
> +
> +#endif  /* CONFIG_IOMMU_DMA */
> +
> --
> 1.9.1
>

Regards,
Anup

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v6 2/3] arm64: Add IOMMU dma_ops
  2015-10-06 11:00         ` Yong Wu
@ 2015-10-07 16:07           ` Robin Murphy
  -1 siblings, 0 replies; 78+ messages in thread
From: Robin Murphy @ 2015-10-07 16:07 UTC (permalink / raw)
  To: Yong Wu
  Cc: laurent.pinchart+renesas-ryLnwIuWjnjg/C1BVhZhaw,
	catalin.marinas-5wv7dgnIgG8, will.deacon-5wv7dgnIgG8,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	djkurtz-hpIqsD4AKlfQT0dZR+AlfA,
	thunder.leizhen-hv44wF8Li93QT0dZR+AlfA,
	yingjoe.chen-NuS5LvNUpcJWk0Htik3J/w,
	treding-DDmLM1+adcrQT0dZR+AlfA,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

On 06/10/15 12:00, Yong Wu wrote:
> On Thu, 2015-10-01 at 20:13 +0100, Robin Murphy wrote:
>> Taking some inspiration from the arch/arm code, implement the
>> arch-specific side of the DMA mapping ops using the new IOMMU-DMA layer.
>>
>> Since there is still work to do elsewhere to make DMA configuration happen
>> in a more appropriate order and properly support platform devices in the
>> IOMMU core, the device setup code unfortunately starts out carrying some
>> workarounds to ensure it works correctly in the current state of things.
>>
>> Signed-off-by: Robin Murphy <robin.murphy-5wv7dgnIgG8@public.gmane.org>
>> ---
>>   arch/arm64/mm/dma-mapping.c | 435 ++++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 435 insertions(+)
>>
> [...]
>> +/*
>> + * TODO: Right now __iommu_setup_dma_ops() gets called too early to do
>> + * everything it needs to - the device is only partially created and the
>> + * IOMMU driver hasn't seen it yet, so it can't have a group. Thus we
>> + * need this delayed attachment dance. Once IOMMU probe ordering is sorted
>> + * to move the arch_setup_dma_ops() call later, all the notifier bits below
>> + * become unnecessary, and will go away.
>> + */
>
> Hi Robin,
>        Could I ask a question about the plan in the future:
>        How to move arch_setup_dma_ops() call later than IOMMU probe?
>
>        arch_setup_dma_ops is from of_dma_configure which is from
> arm64_device_init, and IOMMU probe is subsys_init. So
> arch_setup_dma_ops will run before IOMMU probe normally, is it right?

Yup, hence the need to call of_platform_device_create() manually in your 
IOMMU_OF_DECLARE init function if you need the actual device instance to 
be ready before the root of_platform_populate() runs.

>        Does Laurent's probe-deferral series could help do this? what's
> the state of this series.

What Laurent's patches do is to leave the DMA mask configuration where 
it is early in device creation, but split out the dma_ops configuration 
to be called just before the actual driver probe, and defer that if the 
IOMMU device hasn't probed yet. At the moment, those patches (plus a bit 
of my own development on top) are working fairly well in the simple 
case, but I've seen things start falling apart if the client driver then 
requests its own probe deferral, and there are probably other 
troublesome edge cases to find - I need to dig into that further, but 
sorting out my ARM SMMU driver patches is currently looking like a 
higher priority.

>> +struct iommu_dma_notifier_data {
>> +	struct list_head list;
>> +	struct device *dev;
>> +	const struct iommu_ops *ops;
>> +	u64 dma_base;
>> +	u64 size;
>> +};
>> +static LIST_HEAD(iommu_dma_masters);
>> +static DEFINE_MUTEX(iommu_dma_notifier_lock);
>> +
>> +/*
>> + * Temporarily "borrow" a domain feature flag to to tell if we had to resort
>> + * to creating our own domain here, in case we need to clean it up again.
>> + */
>> +#define __IOMMU_DOMAIN_FAKE_DEFAULT		(1U << 31)
>> +
>> +static bool do_iommu_attach(struct device *dev, const struct iommu_ops *ops,
>> +			   u64 dma_base, u64 size)
>> +{
>> +	struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
>> +
>> +	/*
>> +	 * Best case: The device is either part of a group which was
>> +	 * already attached to a domain in a previous call, or it's
>> +	 * been put in a default DMA domain by the IOMMU core.
>> +	 */
>> +	if (!domain) {
>> +		/*
>> +		 * Urgh. The IOMMU core isn't going to do default domains
>> +		 * for non-PCI devices anyway, until it has some means of
>> +		 * abstracting the entirely implementation-specific
>> +		 * sideband data/SoC topology/unicorn dust that may or
>> +		 * may not differentiate upstream masters.
>> +		 * So until then, HORRIBLE HACKS!
>> +		 */
>> +		domain = ops->domain_alloc(IOMMU_DOMAIN_DMA);
>> +		if (!domain)
>> +			goto out_no_domain;
>> +
>> +		domain->ops = ops;
>> +		domain->type = IOMMU_DOMAIN_DMA | __IOMMU_DOMAIN_FAKE_DEFAULT;
>> +
>> +		if (iommu_attach_device(domain, dev))
>> +			goto out_put_domain;
>> +	}
>> +
>> +	if (iommu_dma_init_domain(domain, dma_base, size))
>> +		goto out_detach;
>> +
>> +	dev->archdata.dma_ops = &iommu_dma_ops;
>> +	return true;
>> +
>> +out_detach:
>> +	iommu_detach_device(domain, dev);
>> +out_put_domain:
>> +	if (domain->type & __IOMMU_DOMAIN_FAKE_DEFAULT)
>> +		iommu_domain_free(domain);
>> +out_no_domain:
>> +	pr_warn("Failed to set up IOMMU for device %s; retaining platform DMA ops\n",
>> +		dev_name(dev));
>> +	return false;
>> +}
> [...]
>> +static void __iommu_setup_dma_ops(struct device *dev, u64 dma_base, u64 size,
>> +				  const struct iommu_ops *ops)
>> +{
>> +	struct iommu_group *group;
>> +
>> +	if (!ops)
>> +		return;
>> +	/*
>> +	 * TODO: As a concession to the future, we're ready to handle being
>> +	 * called both early and late (i.e. after bus_add_device). Once all
>> +	 * the platform bus code is reworked to call us late and the notifier
>> +	 * junk above goes away, move the body of do_iommu_attach here.
>> +	 */
>> +	group = iommu_group_get(dev);
>
>     If iommu_setup_dma_ops run after bus_add_device, then the device has
> its group here. It will enter do_iommu_attach which will alloc a default
> iommu domain and attach this device to the new iommu domain.
>     But mtk-iommu don't expect like this, we would like to attach to the
> same domain. So we should alloc a default iommu domain(if there is no
> iommu domain at that time) and attach the device to the same domain in
> our xx_add_device, is it right?

Yes, if you attach the device to your own 'real' default domain after 
setting up the group in add_device, then do_iommu_attach() will now pick 
that domain up and use it instead of trying to create a new one, and the 
arch code will stop short of tearing the domain down if the device probe 
fails and it gets detached again. Additionally, since from add_device 
you should hopefully have all the information you need to get back to 
the relevant m4u instance, it should now be OK to keep the default 
domain there and finally get rid of that pesky global variable.

Robin.

>> +	if (group) {
>> +		do_iommu_attach(dev, ops, dma_base, size);
>> +		iommu_group_put(group);
>> +	} else {
>> +		queue_iommu_attach(dev, ops, dma_base, size);
>> +	}
>> +}
>> +
>> +#else
>> +
>> +static void __iommu_setup_dma_ops(struct device *dev, u64 dma_base, u64 size,
>> +				  struct iommu_ops *iommu)
>> +{ }
>> +
>> +#endif  /* CONFIG_IOMMU_DMA */
>> +
>
>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v6 2/3] arm64: Add IOMMU dma_ops
@ 2015-10-07 16:07           ` Robin Murphy
  0 siblings, 0 replies; 78+ messages in thread
From: Robin Murphy @ 2015-10-07 16:07 UTC (permalink / raw)
  To: linux-arm-kernel

On 06/10/15 12:00, Yong Wu wrote:
> On Thu, 2015-10-01 at 20:13 +0100, Robin Murphy wrote:
>> Taking some inspiration from the arch/arm code, implement the
>> arch-specific side of the DMA mapping ops using the new IOMMU-DMA layer.
>>
>> Since there is still work to do elsewhere to make DMA configuration happen
>> in a more appropriate order and properly support platform devices in the
>> IOMMU core, the device setup code unfortunately starts out carrying some
>> workarounds to ensure it works correctly in the current state of things.
>>
>> Signed-off-by: Robin Murphy <robin.murphy@arm.com>
>> ---
>>   arch/arm64/mm/dma-mapping.c | 435 ++++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 435 insertions(+)
>>
> [...]
>> +/*
>> + * TODO: Right now __iommu_setup_dma_ops() gets called too early to do
>> + * everything it needs to - the device is only partially created and the
>> + * IOMMU driver hasn't seen it yet, so it can't have a group. Thus we
>> + * need this delayed attachment dance. Once IOMMU probe ordering is sorted
>> + * to move the arch_setup_dma_ops() call later, all the notifier bits below
>> + * become unnecessary, and will go away.
>> + */
>
> Hi Robin,
>        Could I ask a question about the plan in the future:
>        How to move arch_setup_dma_ops() call later than IOMMU probe?
>
>        arch_setup_dma_ops is from of_dma_configure which is from
> arm64_device_init, and IOMMU probe is subsys_init. So
> arch_setup_dma_ops will run before IOMMU probe normally, is it right?

Yup, hence the need to call of_platform_device_create() manually in your 
IOMMU_OF_DECLARE init function if you need the actual device instance to 
be ready before the root of_platform_populate() runs.

>        Does Laurent's probe-deferral series could help do this? what's
> the state of this series.

What Laurent's patches do is to leave the DMA mask configuration where 
it is early in device creation, but split out the dma_ops configuration 
to be called just before the actual driver probe, and defer that if the 
IOMMU device hasn't probed yet. At the moment, those patches (plus a bit 
of my own development on top) are working fairly well in the simple 
case, but I've seen things start falling apart if the client driver then 
requests its own probe deferral, and there are probably other 
troublesome edge cases to find - I need to dig into that further, but 
sorting out my ARM SMMU driver patches is currently looking like a 
higher priority.

>> +struct iommu_dma_notifier_data {
>> +	struct list_head list;
>> +	struct device *dev;
>> +	const struct iommu_ops *ops;
>> +	u64 dma_base;
>> +	u64 size;
>> +};
>> +static LIST_HEAD(iommu_dma_masters);
>> +static DEFINE_MUTEX(iommu_dma_notifier_lock);
>> +
>> +/*
>> + * Temporarily "borrow" a domain feature flag to to tell if we had to resort
>> + * to creating our own domain here, in case we need to clean it up again.
>> + */
>> +#define __IOMMU_DOMAIN_FAKE_DEFAULT		(1U << 31)
>> +
>> +static bool do_iommu_attach(struct device *dev, const struct iommu_ops *ops,
>> +			   u64 dma_base, u64 size)
>> +{
>> +	struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
>> +
>> +	/*
>> +	 * Best case: The device is either part of a group which was
>> +	 * already attached to a domain in a previous call, or it's
>> +	 * been put in a default DMA domain by the IOMMU core.
>> +	 */
>> +	if (!domain) {
>> +		/*
>> +		 * Urgh. The IOMMU core isn't going to do default domains
>> +		 * for non-PCI devices anyway, until it has some means of
>> +		 * abstracting the entirely implementation-specific
>> +		 * sideband data/SoC topology/unicorn dust that may or
>> +		 * may not differentiate upstream masters.
>> +		 * So until then, HORRIBLE HACKS!
>> +		 */
>> +		domain = ops->domain_alloc(IOMMU_DOMAIN_DMA);
>> +		if (!domain)
>> +			goto out_no_domain;
>> +
>> +		domain->ops = ops;
>> +		domain->type = IOMMU_DOMAIN_DMA | __IOMMU_DOMAIN_FAKE_DEFAULT;
>> +
>> +		if (iommu_attach_device(domain, dev))
>> +			goto out_put_domain;
>> +	}
>> +
>> +	if (iommu_dma_init_domain(domain, dma_base, size))
>> +		goto out_detach;
>> +
>> +	dev->archdata.dma_ops = &iommu_dma_ops;
>> +	return true;
>> +
>> +out_detach:
>> +	iommu_detach_device(domain, dev);
>> +out_put_domain:
>> +	if (domain->type & __IOMMU_DOMAIN_FAKE_DEFAULT)
>> +		iommu_domain_free(domain);
>> +out_no_domain:
>> +	pr_warn("Failed to set up IOMMU for device %s; retaining platform DMA ops\n",
>> +		dev_name(dev));
>> +	return false;
>> +}
> [...]
>> +static void __iommu_setup_dma_ops(struct device *dev, u64 dma_base, u64 size,
>> +				  const struct iommu_ops *ops)
>> +{
>> +	struct iommu_group *group;
>> +
>> +	if (!ops)
>> +		return;
>> +	/*
>> +	 * TODO: As a concession to the future, we're ready to handle being
>> +	 * called both early and late (i.e. after bus_add_device). Once all
>> +	 * the platform bus code is reworked to call us late and the notifier
>> +	 * junk above goes away, move the body of do_iommu_attach here.
>> +	 */
>> +	group = iommu_group_get(dev);
>
>     If iommu_setup_dma_ops run after bus_add_device, then the device has
> its group here. It will enter do_iommu_attach which will alloc a default
> iommu domain and attach this device to the new iommu domain.
>     But mtk-iommu don't expect like this, we would like to attach to the
> same domain. So we should alloc a default iommu domain(if there is no
> iommu domain at that time) and attach the device to the same domain in
> our xx_add_device, is it right?

Yes, if you attach the device to your own 'real' default domain after 
setting up the group in add_device, then do_iommu_attach() will now pick 
that domain up and use it instead of trying to create a new one, and the 
arch code will stop short of tearing the domain down if the device probe 
fails and it gets detached again. Additionally, since from add_device 
you should hopefully have all the information you need to get back to 
the relevant m4u instance, it should now be OK to keep the default 
domain there and finally get rid of that pesky global variable.

Robin.

>> +	if (group) {
>> +		do_iommu_attach(dev, ops, dma_base, size);
>> +		iommu_group_put(group);
>> +	} else {
>> +		queue_iommu_attach(dev, ops, dma_base, size);
>> +	}
>> +}
>> +
>> +#else
>> +
>> +static void __iommu_setup_dma_ops(struct device *dev, u64 dma_base, u64 size,
>> +				  struct iommu_ops *iommu)
>> +{ }
>> +
>> +#endif  /* CONFIG_IOMMU_DMA */
>> +
>
>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v6 2/3] arm64: Add IOMMU dma_ops
  2015-10-07  9:03       ` Anup Patel
@ 2015-10-07 16:36           ` Robin Murphy
  -1 siblings, 0 replies; 78+ messages in thread
From: Robin Murphy @ 2015-10-07 16:36 UTC (permalink / raw)
  To: Anup Patel
  Cc: laurent.pinchart+renesas-ryLnwIuWjnjg/C1BVhZhaw, Anup Patel,
	Catalin Marinas, Will Deacon,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	djkurtz-hpIqsD4AKlfQT0dZR+AlfA,
	thunder.leizhen-hv44wF8Li93QT0dZR+AlfA,
	yingjoe.chen-NuS5LvNUpcJWk0Htik3J/w,
	treding-DDmLM1+adcrQT0dZR+AlfA, linux-arm-kernel

On 07/10/15 10:03, Anup Patel wrote:
[...]
>> +static bool do_iommu_attach(struct device *dev, const struct iommu_ops *ops,
>> +                          u64 dma_base, u64 size)
>> +{
>> +       struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
>> +
>> +       /*
>> +        * Best case: The device is either part of a group which was
>> +        * already attached to a domain in a previous call, or it's
>> +        * been put in a default DMA domain by the IOMMU core.
>> +        */
>> +       if (!domain) {
>> +               /*
>> +                * Urgh. The IOMMU core isn't going to do default domains
>> +                * for non-PCI devices anyway, until it has some means of
>> +                * abstracting the entirely implementation-specific
>> +                * sideband data/SoC topology/unicorn dust that may or
>> +                * may not differentiate upstream masters.
>> +                * So until then, HORRIBLE HACKS!
>> +                */
>> +               domain = ops->domain_alloc(IOMMU_DOMAIN_DMA);
>> +               if (!domain)
>> +                       goto out_no_domain;
>> +
>> +               domain->ops = ops;
>> +               domain->type = IOMMU_DOMAIN_DMA | __IOMMU_DOMAIN_FAKE_DEFAULT;
>
> We require iommu_get_dma_cookie(domain) here. If we dont
> allocate iommu cookie then iommu_dma_init_domain() will fail.

The iova cookie is tightly coupled with the domain, so it really only 
makes sense for the IOMMU driver to deal with it as part of its 
domain_alloc/domain_free callbacks.

Doing that here was one of the nastier 'compatibility' hacks which I've 
now taken out; trying to make things work without modifying existing 
IOMMU drivers was just too impractical. Drivers which want to play are 
now required to support the IOMMU_DOMAIN_DMA type appropriately, but 
it's still only a minimal change, as per the example diff for the ARM 
SMMU driver below (big pile of further patches necessary to make said 
driver compatible in other respects notwithstanding).

Robin.

--->8---
@@ -29,6 +29,7 @@
  #define pr_fmt(fmt) "arm-smmu: " fmt

  #include <linux/delay.h>
+#include <linux/dma-iommu.h>
  #include <linux/dma-mapping.h>
  #include <linux/err.h>
  #include <linux/interrupt.h>
@@ -973,7 +974,7 @@ static struct iommu_domain 
*arm_smmu_domain_alloc(unsigned type)
  {
  	struct arm_smmu_domain *smmu_domain;

-	if (type != IOMMU_DOMAIN_UNMANAGED)
+	if (type != IOMMU_DOMAIN_UNMANAGED && type != IOMMU_DOMAIN_DMA)
  		return NULL;
  	/*
  	 * Allocate the domain and initialise some of its data structures.
@@ -984,6 +985,12 @@ static struct iommu_domain 
*arm_smmu_domain_alloc(unsigned type)
  	if (!smmu_domain)
  		return NULL;

+	if (type == IOMMU_DOMAIN_DMA &&
+	    iommu_get_dma_cookie(&smmu_domain->domain)) {
+		kfree(smmu_domain);
+		return NULL;
+	}
+
  	mutex_init(&smmu_domain->init_mutex);
  	spin_lock_init(&smmu_domain->pgtbl_lock);

@@ -998,6 +1005,7 @@ static void arm_smmu_domain_free(struct 
iommu_domain *domain)
  	 * Free the domain resources. We assume that all devices have
  	 * already been detached.
  	 */
+	iommu_put_dma_cookie(domain);
  	arm_smmu_destroy_domain_context(domain);
  	kfree(smmu_domain);
  }

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v6 2/3] arm64: Add IOMMU dma_ops
@ 2015-10-07 16:36           ` Robin Murphy
  0 siblings, 0 replies; 78+ messages in thread
From: Robin Murphy @ 2015-10-07 16:36 UTC (permalink / raw)
  To: linux-arm-kernel

On 07/10/15 10:03, Anup Patel wrote:
[...]
>> +static bool do_iommu_attach(struct device *dev, const struct iommu_ops *ops,
>> +                          u64 dma_base, u64 size)
>> +{
>> +       struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
>> +
>> +       /*
>> +        * Best case: The device is either part of a group which was
>> +        * already attached to a domain in a previous call, or it's
>> +        * been put in a default DMA domain by the IOMMU core.
>> +        */
>> +       if (!domain) {
>> +               /*
>> +                * Urgh. The IOMMU core isn't going to do default domains
>> +                * for non-PCI devices anyway, until it has some means of
>> +                * abstracting the entirely implementation-specific
>> +                * sideband data/SoC topology/unicorn dust that may or
>> +                * may not differentiate upstream masters.
>> +                * So until then, HORRIBLE HACKS!
>> +                */
>> +               domain = ops->domain_alloc(IOMMU_DOMAIN_DMA);
>> +               if (!domain)
>> +                       goto out_no_domain;
>> +
>> +               domain->ops = ops;
>> +               domain->type = IOMMU_DOMAIN_DMA | __IOMMU_DOMAIN_FAKE_DEFAULT;
>
> We require iommu_get_dma_cookie(domain) here. If we dont
> allocate iommu cookie then iommu_dma_init_domain() will fail.

The iova cookie is tightly coupled with the domain, so it really only 
makes sense for the IOMMU driver to deal with it as part of its 
domain_alloc/domain_free callbacks.

Doing that here was one of the nastier 'compatibility' hacks which I've 
now taken out; trying to make things work without modifying existing 
IOMMU drivers was just too impractical. Drivers which want to play are 
now required to support the IOMMU_DOMAIN_DMA type appropriately, but 
it's still only a minimal change, as per the example diff for the ARM 
SMMU driver below (big pile of further patches necessary to make said 
driver compatible in other respects notwithstanding).

Robin.

--->8---
@@ -29,6 +29,7 @@
  #define pr_fmt(fmt) "arm-smmu: " fmt

  #include <linux/delay.h>
+#include <linux/dma-iommu.h>
  #include <linux/dma-mapping.h>
  #include <linux/err.h>
  #include <linux/interrupt.h>
@@ -973,7 +974,7 @@ static struct iommu_domain 
*arm_smmu_domain_alloc(unsigned type)
  {
  	struct arm_smmu_domain *smmu_domain;

-	if (type != IOMMU_DOMAIN_UNMANAGED)
+	if (type != IOMMU_DOMAIN_UNMANAGED && type != IOMMU_DOMAIN_DMA)
  		return NULL;
  	/*
  	 * Allocate the domain and initialise some of its data structures.
@@ -984,6 +985,12 @@ static struct iommu_domain 
*arm_smmu_domain_alloc(unsigned type)
  	if (!smmu_domain)
  		return NULL;

+	if (type == IOMMU_DOMAIN_DMA &&
+	    iommu_get_dma_cookie(&smmu_domain->domain)) {
+		kfree(smmu_domain);
+		return NULL;
+	}
+
  	mutex_init(&smmu_domain->init_mutex);
  	spin_lock_init(&smmu_domain->pgtbl_lock);

@@ -998,6 +1005,7 @@ static void arm_smmu_domain_free(struct 
iommu_domain *domain)
  	 * Free the domain resources. We assume that all devices have
  	 * already been detached.
  	 */
+	iommu_put_dma_cookie(domain);
  	arm_smmu_destroy_domain_context(domain);
  	kfree(smmu_domain);
  }

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v6 2/3] arm64: Add IOMMU dma_ops
  2015-10-07 16:36           ` Robin Murphy
@ 2015-10-07 17:40             ` Anup Patel
  -1 siblings, 0 replies; 78+ messages in thread
From: Anup Patel @ 2015-10-07 17:40 UTC (permalink / raw)
  To: Robin Murphy
  Cc: laurent.pinchart+renesas, Anup Patel, Catalin Marinas, joro,
	Will Deacon, iommu, Daniel Kurtz, yong.wu, thunder.leizhen,
	yingjoe.chen, treding, linux-arm-kernel

On Wed, Oct 7, 2015 at 10:06 PM, Robin Murphy <robin.murphy@arm.com> wrote:
> On 07/10/15 10:03, Anup Patel wrote:
> [...]
>
>>> +static bool do_iommu_attach(struct device *dev, const struct iommu_ops
>>> *ops,
>>> +                          u64 dma_base, u64 size)
>>> +{
>>> +       struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
>>> +
>>> +       /*
>>> +        * Best case: The device is either part of a group which was
>>> +        * already attached to a domain in a previous call, or it's
>>> +        * been put in a default DMA domain by the IOMMU core.
>>> +        */
>>> +       if (!domain) {
>>> +               /*
>>> +                * Urgh. The IOMMU core isn't going to do default domains
>>> +                * for non-PCI devices anyway, until it has some means of
>>> +                * abstracting the entirely implementation-specific
>>> +                * sideband data/SoC topology/unicorn dust that may or
>>> +                * may not differentiate upstream masters.
>>> +                * So until then, HORRIBLE HACKS!
>>> +                */
>>> +               domain = ops->domain_alloc(IOMMU_DOMAIN_DMA);
>>> +               if (!domain)
>>> +                       goto out_no_domain;
>>> +
>>> +               domain->ops = ops;
>>> +               domain->type = IOMMU_DOMAIN_DMA |
>>> __IOMMU_DOMAIN_FAKE_DEFAULT;
>>
>>
>> We require iommu_get_dma_cookie(domain) here. If we dont
>> allocate iommu cookie then iommu_dma_init_domain() will fail.
>
>
> The iova cookie is tightly coupled with the domain, so it really only makes
> sense for the IOMMU driver to deal with it as part of its
> domain_alloc/domain_free callbacks.
>
> Doing that here was one of the nastier 'compatibility' hacks which I've now
> taken out; trying to make things work without modifying existing IOMMU
> drivers was just too impractical. Drivers which want to play are now
> required to support the IOMMU_DOMAIN_DMA type appropriately, but it's still
> only a minimal change, as per the example diff for the ARM SMMU driver below
> (big pile of further patches necessary to make said driver compatible in
> other respects notwithstanding).

Many thanks for the info.

Actually, I was trying your dma-mapping patches with SMMU driver. I have got
SMMU driver somewhat working for IOMMU_DOMAIN_DMA. Your suggestion
will help me restrict my changes to SMMU driver only.

This patchset looks good to me.

Thanks,
Anup

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v6 2/3] arm64: Add IOMMU dma_ops
@ 2015-10-07 17:40             ` Anup Patel
  0 siblings, 0 replies; 78+ messages in thread
From: Anup Patel @ 2015-10-07 17:40 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Oct 7, 2015 at 10:06 PM, Robin Murphy <robin.murphy@arm.com> wrote:
> On 07/10/15 10:03, Anup Patel wrote:
> [...]
>
>>> +static bool do_iommu_attach(struct device *dev, const struct iommu_ops
>>> *ops,
>>> +                          u64 dma_base, u64 size)
>>> +{
>>> +       struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
>>> +
>>> +       /*
>>> +        * Best case: The device is either part of a group which was
>>> +        * already attached to a domain in a previous call, or it's
>>> +        * been put in a default DMA domain by the IOMMU core.
>>> +        */
>>> +       if (!domain) {
>>> +               /*
>>> +                * Urgh. The IOMMU core isn't going to do default domains
>>> +                * for non-PCI devices anyway, until it has some means of
>>> +                * abstracting the entirely implementation-specific
>>> +                * sideband data/SoC topology/unicorn dust that may or
>>> +                * may not differentiate upstream masters.
>>> +                * So until then, HORRIBLE HACKS!
>>> +                */
>>> +               domain = ops->domain_alloc(IOMMU_DOMAIN_DMA);
>>> +               if (!domain)
>>> +                       goto out_no_domain;
>>> +
>>> +               domain->ops = ops;
>>> +               domain->type = IOMMU_DOMAIN_DMA |
>>> __IOMMU_DOMAIN_FAKE_DEFAULT;
>>
>>
>> We require iommu_get_dma_cookie(domain) here. If we dont
>> allocate iommu cookie then iommu_dma_init_domain() will fail.
>
>
> The iova cookie is tightly coupled with the domain, so it really only makes
> sense for the IOMMU driver to deal with it as part of its
> domain_alloc/domain_free callbacks.
>
> Doing that here was one of the nastier 'compatibility' hacks which I've now
> taken out; trying to make things work without modifying existing IOMMU
> drivers was just too impractical. Drivers which want to play are now
> required to support the IOMMU_DOMAIN_DMA type appropriately, but it's still
> only a minimal change, as per the example diff for the ARM SMMU driver below
> (big pile of further patches necessary to make said driver compatible in
> other respects notwithstanding).

Many thanks for the info.

Actually, I was trying your dma-mapping patches with SMMU driver. I have got
SMMU driver somewhat working for IOMMU_DOMAIN_DMA. Your suggestion
will help me restrict my changes to SMMU driver only.

This patchset looks good to me.

Thanks,
Anup

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v6 2/3] arm64: Add IOMMU dma_ops
  2015-10-07 16:07           ` Robin Murphy
@ 2015-10-09  5:44               ` Yong Wu
  -1 siblings, 0 replies; 78+ messages in thread
From: Yong Wu @ 2015-10-09  5:44 UTC (permalink / raw)
  To: Robin Murphy
  Cc: laurent.pinchart+renesas-ryLnwIuWjnjg/C1BVhZhaw,
	catalin.marinas-5wv7dgnIgG8, will.deacon-5wv7dgnIgG8,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	djkurtz-hpIqsD4AKlfQT0dZR+AlfA,
	thunder.leizhen-hv44wF8Li93QT0dZR+AlfA,
	yingjoe.chen-NuS5LvNUpcJWk0Htik3J/w,
	treding-DDmLM1+adcrQT0dZR+AlfA,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

On Wed, 2015-10-07 at 17:07 +0100, Robin Murphy wrote:
> On 06/10/15 12:00, Yong Wu wrote:
> > On Thu, 2015-10-01 at 20:13 +0100, Robin Murphy wrote:
> >> Taking some inspiration from the arch/arm code, implement the
> >> arch-specific side of the DMA mapping ops using the new IOMMU-DMA layer.
> >>
> >> Since there is still work to do elsewhere to make DMA configuration happen
> >> in a more appropriate order and properly support platform devices in the
> >> IOMMU core, the device setup code unfortunately starts out carrying some
> >> workarounds to ensure it works correctly in the current state of things.
> >>
> >> Signed-off-by: Robin Murphy <robin.murphy-5wv7dgnIgG8@public.gmane.org>
> >> ---
> >>   arch/arm64/mm/dma-mapping.c | 435 ++++++++++++++++++++++++++++++++++++++++++++
> >>   1 file changed, 435 insertions(+)
> >>
> > [...]
> >> +/*
> >> + * TODO: Right now __iommu_setup_dma_ops() gets called too early to do
> >> + * everything it needs to - the device is only partially created and the
> >> + * IOMMU driver hasn't seen it yet, so it can't have a group. Thus we
> >> + * need this delayed attachment dance. Once IOMMU probe ordering is sorted
> >> + * to move the arch_setup_dma_ops() call later, all the notifier bits below
> >> + * become unnecessary, and will go away.
> >> + */
> >
> > Hi Robin,
> >        Could I ask a question about the plan in the future:
> >        How to move arch_setup_dma_ops() call later than IOMMU probe?
> >
> >        arch_setup_dma_ops is from of_dma_configure which is from
> > arm64_device_init, and IOMMU probe is subsys_init. So
> > arch_setup_dma_ops will run before IOMMU probe normally, is it right?
> 
> Yup, hence the need to call of_platform_device_create() manually in your 
> IOMMU_OF_DECLARE init function if you need the actual device instance to 
> be ready before the root of_platform_populate() runs.

Thanks. I have added of_platform_device_create.

If the arch_setup_dma_ops always be called before IOMMU probe, What's
the meaning of the TODO comment here?  Does the arch_setup_dma_ops()
will be moved to run later than IOMMU probe? How to do this.

> 
> >        Does Laurent's probe-deferral series could help do this? what's
> > the state of this series.
> 
> What Laurent's patches do is to leave the DMA mask configuration where 
> it is early in device creation, but split out the dma_ops configuration 
> to be called just before the actual driver probe, and defer that if the 
> IOMMU device hasn't probed yet. At the moment, those patches (plus a bit 
> of my own development on top) are working fairly well in the simple 
> case, but I've seen things start falling apart if the client driver then 
> requests its own probe deferral, and there are probably other 
> troublesome edge cases to find - I need to dig into that further, but 
> sorting out my ARM SMMU driver patches is currently looking like a 
> higher priority.
> 
> >> +struct iommu_dma_notifier_data {
> >> +	struct list_head list;
> >> +	struct device *dev;
> >> +	const struct iommu_ops *ops;
> >> +	u64 dma_base;
> >> +	u64 size;
> >> +};
> [...]
> >> +static void __iommu_setup_dma_ops(struct device *dev, u64 dma_base, u64 size,
> >> +				  const struct iommu_ops *ops)
> >> +{
> >> +	struct iommu_group *group;
> >> +
> >> +	if (!ops)
> >> +		return;
> >> +	/*
> >> +	 * TODO: As a concession to the future, we're ready to handle being
> >> +	 * called both early and late (i.e. after bus_add_device). Once all
> >> +	 * the platform bus code is reworked to call us late and the notifier
> >> +	 * junk above goes away, move the body of do_iommu_attach here.
> >> +	 */
> >> +	group = iommu_group_get(dev);
> >
> >     If iommu_setup_dma_ops run after bus_add_device, then the device has
> > its group here. It will enter do_iommu_attach which will alloc a default
> > iommu domain and attach this device to the new iommu domain.
> >     But mtk-iommu don't expect like this, we would like to attach to the
> > same domain. So we should alloc a default iommu domain(if there is no
> > iommu domain at that time) and attach the device to the same domain in
> > our xx_add_device, is it right?
> 
> Yes, if you attach the device to your own 'real' default domain after 
> setting up the group in add_device, then do_iommu_attach() will now pick 
> that domain up and use it instead of trying to create a new one, and the 
> arch code will stop short of tearing the domain down if the device probe 
> fails and it gets detached again. Additionally, since from add_device 
> you should hopefully have all the information you need to get back to 
> the relevant m4u instance, it should now be OK to keep the default 
> domain there and finally get rid of that pesky global variable.
> 
> Robin.

Thanks very much for your confirm.

I have added it following this. As above, the arch_setup_dma_ops always
be called before IOMMU probe, so the attach_device will be called
earlier than probe and the iommu domain will exist already in
add_device.

Meanwhile I attach all the iommu devices into the same domain in the
add_device and free the other unnecessary domains in the attach_device.

Here I wrote may be not clear, so I send the detail code[1]. when you
are free, please also help have a look.

Currently it's based on the condition that arch_setup_dma_ops run before
IOMMU probe, But from your TODO comment, the sequence of
arch_setup_dma_ops may be changed. this series is not a final version?
You also question me "how can you guarantee domain_alloc() happens
before this driver is probed?". So I am a little confused this TODO.

[1]:http://lists.linuxfoundation.org/pipermail/iommu/2015-October/014591.html

> 
> >> +	if (group) {
> >> +		do_iommu_attach(dev, ops, dma_base, size);
> >> +		iommu_group_put(group);
> >> +	} else {
> >> +		queue_iommu_attach(dev, ops, dma_base, size);
> >> +	}
> >> +}
> >> +
> >> +#else
> >> +
> >> +static void __iommu_setup_dma_ops(struct device *dev, u64 dma_base, u64 size,
> >> +				  struct iommu_ops *iommu)
> >> +{ }
> >> +
> >> +#endif  /* CONFIG_IOMMU_DMA */
> >> +
> >
> 

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v6 2/3] arm64: Add IOMMU dma_ops
@ 2015-10-09  5:44               ` Yong Wu
  0 siblings, 0 replies; 78+ messages in thread
From: Yong Wu @ 2015-10-09  5:44 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 2015-10-07 at 17:07 +0100, Robin Murphy wrote:
> On 06/10/15 12:00, Yong Wu wrote:
> > On Thu, 2015-10-01 at 20:13 +0100, Robin Murphy wrote:
> >> Taking some inspiration from the arch/arm code, implement the
> >> arch-specific side of the DMA mapping ops using the new IOMMU-DMA layer.
> >>
> >> Since there is still work to do elsewhere to make DMA configuration happen
> >> in a more appropriate order and properly support platform devices in the
> >> IOMMU core, the device setup code unfortunately starts out carrying some
> >> workarounds to ensure it works correctly in the current state of things.
> >>
> >> Signed-off-by: Robin Murphy <robin.murphy@arm.com>
> >> ---
> >>   arch/arm64/mm/dma-mapping.c | 435 ++++++++++++++++++++++++++++++++++++++++++++
> >>   1 file changed, 435 insertions(+)
> >>
> > [...]
> >> +/*
> >> + * TODO: Right now __iommu_setup_dma_ops() gets called too early to do
> >> + * everything it needs to - the device is only partially created and the
> >> + * IOMMU driver hasn't seen it yet, so it can't have a group. Thus we
> >> + * need this delayed attachment dance. Once IOMMU probe ordering is sorted
> >> + * to move the arch_setup_dma_ops() call later, all the notifier bits below
> >> + * become unnecessary, and will go away.
> >> + */
> >
> > Hi Robin,
> >        Could I ask a question about the plan in the future:
> >        How to move arch_setup_dma_ops() call later than IOMMU probe?
> >
> >        arch_setup_dma_ops is from of_dma_configure which is from
> > arm64_device_init, and IOMMU probe is subsys_init. So
> > arch_setup_dma_ops will run before IOMMU probe normally, is it right?
> 
> Yup, hence the need to call of_platform_device_create() manually in your 
> IOMMU_OF_DECLARE init function if you need the actual device instance to 
> be ready before the root of_platform_populate() runs.

Thanks. I have added of_platform_device_create.

If the arch_setup_dma_ops always be called before IOMMU probe, What's
the meaning of the TODO comment here?  Does the arch_setup_dma_ops()
will be moved to run later than IOMMU probe? How to do this.

> 
> >        Does Laurent's probe-deferral series could help do this? what's
> > the state of this series.
> 
> What Laurent's patches do is to leave the DMA mask configuration where 
> it is early in device creation, but split out the dma_ops configuration 
> to be called just before the actual driver probe, and defer that if the 
> IOMMU device hasn't probed yet. At the moment, those patches (plus a bit 
> of my own development on top) are working fairly well in the simple 
> case, but I've seen things start falling apart if the client driver then 
> requests its own probe deferral, and there are probably other 
> troublesome edge cases to find - I need to dig into that further, but 
> sorting out my ARM SMMU driver patches is currently looking like a 
> higher priority.
> 
> >> +struct iommu_dma_notifier_data {
> >> +	struct list_head list;
> >> +	struct device *dev;
> >> +	const struct iommu_ops *ops;
> >> +	u64 dma_base;
> >> +	u64 size;
> >> +};
> [...]
> >> +static void __iommu_setup_dma_ops(struct device *dev, u64 dma_base, u64 size,
> >> +				  const struct iommu_ops *ops)
> >> +{
> >> +	struct iommu_group *group;
> >> +
> >> +	if (!ops)
> >> +		return;
> >> +	/*
> >> +	 * TODO: As a concession to the future, we're ready to handle being
> >> +	 * called both early and late (i.e. after bus_add_device). Once all
> >> +	 * the platform bus code is reworked to call us late and the notifier
> >> +	 * junk above goes away, move the body of do_iommu_attach here.
> >> +	 */
> >> +	group = iommu_group_get(dev);
> >
> >     If iommu_setup_dma_ops run after bus_add_device, then the device has
> > its group here. It will enter do_iommu_attach which will alloc a default
> > iommu domain and attach this device to the new iommu domain.
> >     But mtk-iommu don't expect like this, we would like to attach to the
> > same domain. So we should alloc a default iommu domain(if there is no
> > iommu domain at that time) and attach the device to the same domain in
> > our xx_add_device, is it right?
> 
> Yes, if you attach the device to your own 'real' default domain after 
> setting up the group in add_device, then do_iommu_attach() will now pick 
> that domain up and use it instead of trying to create a new one, and the 
> arch code will stop short of tearing the domain down if the device probe 
> fails and it gets detached again. Additionally, since from add_device 
> you should hopefully have all the information you need to get back to 
> the relevant m4u instance, it should now be OK to keep the default 
> domain there and finally get rid of that pesky global variable.
> 
> Robin.

Thanks very much for your confirm.

I have added it following this. As above, the arch_setup_dma_ops always
be called before IOMMU probe, so the attach_device will be called
earlier than probe and the iommu domain will exist already in
add_device.

Meanwhile I attach all the iommu devices into the same domain in the
add_device and free the other unnecessary domains in the attach_device.

Here I wrote may be not clear, so I send the detail code[1]. when you
are free, please also help have a look.

Currently it's based on the condition that arch_setup_dma_ops run before
IOMMU probe, But from your TODO comment, the sequence of
arch_setup_dma_ops may be changed. this series is not a final version?
You also question me "how can you guarantee domain_alloc() happens
before this driver is probed?". So I am a little confused this TODO.

[1]:http://lists.linuxfoundation.org/pipermail/iommu/2015-October/014591.html

> 
> >> +	if (group) {
> >> +		do_iommu_attach(dev, ops, dma_base, size);
> >> +		iommu_group_put(group);
> >> +	} else {
> >> +		queue_iommu_attach(dev, ops, dma_base, size);
> >> +	}
> >> +}
> >> +
> >> +#else
> >> +
> >> +static void __iommu_setup_dma_ops(struct device *dev, u64 dma_base, u64 size,
> >> +				  struct iommu_ops *iommu)
> >> +{ }
> >> +
> >> +#endif  /* CONFIG_IOMMU_DMA */
> >> +
> >
> 

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v6 0/3] arm64: IOMMU-backed DMA mapping
  2015-10-01 19:13 ` Robin Murphy
@ 2015-10-13 12:12     ` Robin Murphy
  -1 siblings, 0 replies; 78+ messages in thread
From: Robin Murphy @ 2015-10-13 12:12 UTC (permalink / raw)
  To: joro-zLv9SwRftAIdnm+yROfE0A
  Cc: laurent.pinchart+renesas-ryLnwIuWjnjg/C1BVhZhaw, Catalin Marinas,
	Will Deacon, iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	djkurtz-hpIqsD4AKlfQT0dZR+AlfA,
	thunder.leizhen-hv44wF8Li93QT0dZR+AlfA,
	yingjoe.chen-NuS5LvNUpcJWk0Htik3J/w,
	treding-DDmLM1+adcrQT0dZR+AlfA,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

Hi Joerg,

On 01/10/15 20:13, Robin Murphy wrote:
> Hi all,
>
> Here's the latest, and hopefully last, revision of the initial arm64
> IOMMU dma_ops support.
>
> There are a couple of dependencies still currently in -next and the
> intel-iommu tree[0]: "iommu: iova: Move iova cache management to the
> iova library" is necessary for the rename of iova_cache_get(), and
> "iommu/iova: Avoid over-allocating when size-aligned" will be needed
> with some IOMMU drivers to prevent unmapping errors.
>
> Changes from v5[1]:
> - Change __iommu_dma_unmap() from BUG to WARN when things go wrong, and
>    prevent a NULL dereference on double-free.
> - Fix iommu_dma_map_sg() to ensure segments can never inadvertently end
>    mapped across a segment boundary. As a result, we have to lose the
>    segment-merging optimisation from before (I might revisit that if
>    there's some evidence it's really worthwhile, though).
> - Cleaned up the platform device workarounds for config order and
>    default domains, and removed the other hacks. Demanding that the IOMMU
>    drivers assign groups, and support IOMMU_DOMAIN_DMA via the methods
>    provided, keeps things bearable, and the behaviour should now be
>    consistent across all cases.

I realise there was one other point you raised which I never explicitly 
addressed (I had to go off and do some investigation at the time) - the 
domain detach/free in arch_teardown_dma_ops does currently get called if 
the client device driver fails it probe (or asks to defer). This 
prevents us leaking any of our own domains we had to create, and lets us 
start from a clean slate if the device gets re-probed later.

Anyway, what are your thoughts on taking this for 4.4? Since the 
dependencies are now in and we're at -rc5 already, I'm on the verge of 
reposting a self-contained version to go through arm64, as we really 
need to unblock all the follow-on development there (it's a shame that 
of the people I'm aware want this, it's only me and the Mediatek/Chrome 
guys here on the list saying so).

Yes, there's still plenty more to do in the OF/device layer for platform 
device infrastructure as a whole (I hope my last reply[0] helped explain 
why it has to look so upside-down compared to the nice simple x86 PCI 
model), but I can only do it one chunk at a time ;)

Thanks,
Robin.

[0]:http://article.gmane.org/gmane.linux.kernel.iommu/10607

> As a bonus, whilst the underlying of_iommu_configure() code only supports
> platform devices at the moment, I can also say that this has now been
> tested to work for PCI devices too, via some horrible hacks on a Juno r1.
>
> Thanks,
> Robin.
>
> [0]:http://thread.gmane.org/gmane.linux.kernel.iommu/11033
> [1]:http://thread.gmane.org/gmane.linux.kernel.iommu/10439
>
> Robin Murphy (3):
>    iommu: Implement common IOMMU ops for DMA mapping
>    arm64: Add IOMMU dma_ops
>    arm64: Hook up IOMMU dma_ops
>
>   arch/arm64/Kconfig                   |   1 +
>   arch/arm64/include/asm/dma-mapping.h |  15 +-
>   arch/arm64/mm/dma-mapping.c          | 457 ++++++++++++++++++++++++++++++
>   drivers/iommu/Kconfig                |   7 +
>   drivers/iommu/Makefile               |   1 +
>   drivers/iommu/dma-iommu.c            | 524 +++++++++++++++++++++++++++++++++++
>   include/linux/dma-iommu.h            |  85 ++++++
>   include/linux/iommu.h                |   1 +
>   8 files changed, 1083 insertions(+), 8 deletions(-)
>   create mode 100644 drivers/iommu/dma-iommu.c
>   create mode 100644 include/linux/dma-iommu.h
>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v6 0/3] arm64: IOMMU-backed DMA mapping
@ 2015-10-13 12:12     ` Robin Murphy
  0 siblings, 0 replies; 78+ messages in thread
From: Robin Murphy @ 2015-10-13 12:12 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Joerg,

On 01/10/15 20:13, Robin Murphy wrote:
> Hi all,
>
> Here's the latest, and hopefully last, revision of the initial arm64
> IOMMU dma_ops support.
>
> There are a couple of dependencies still currently in -next and the
> intel-iommu tree[0]: "iommu: iova: Move iova cache management to the
> iova library" is necessary for the rename of iova_cache_get(), and
> "iommu/iova: Avoid over-allocating when size-aligned" will be needed
> with some IOMMU drivers to prevent unmapping errors.
>
> Changes from v5[1]:
> - Change __iommu_dma_unmap() from BUG to WARN when things go wrong, and
>    prevent a NULL dereference on double-free.
> - Fix iommu_dma_map_sg() to ensure segments can never inadvertently end
>    mapped across a segment boundary. As a result, we have to lose the
>    segment-merging optimisation from before (I might revisit that if
>    there's some evidence it's really worthwhile, though).
> - Cleaned up the platform device workarounds for config order and
>    default domains, and removed the other hacks. Demanding that the IOMMU
>    drivers assign groups, and support IOMMU_DOMAIN_DMA via the methods
>    provided, keeps things bearable, and the behaviour should now be
>    consistent across all cases.

I realise there was one other point you raised which I never explicitly 
addressed (I had to go off and do some investigation at the time) - the 
domain detach/free in arch_teardown_dma_ops does currently get called if 
the client device driver fails it probe (or asks to defer). This 
prevents us leaking any of our own domains we had to create, and lets us 
start from a clean slate if the device gets re-probed later.

Anyway, what are your thoughts on taking this for 4.4? Since the 
dependencies are now in and we're at -rc5 already, I'm on the verge of 
reposting a self-contained version to go through arm64, as we really 
need to unblock all the follow-on development there (it's a shame that 
of the people I'm aware want this, it's only me and the Mediatek/Chrome 
guys here on the list saying so).

Yes, there's still plenty more to do in the OF/device layer for platform 
device infrastructure as a whole (I hope my last reply[0] helped explain 
why it has to look so upside-down compared to the nice simple x86 PCI 
model), but I can only do it one chunk at a time ;)

Thanks,
Robin.

[0]:http://article.gmane.org/gmane.linux.kernel.iommu/10607

> As a bonus, whilst the underlying of_iommu_configure() code only supports
> platform devices at the moment, I can also say that this has now been
> tested to work for PCI devices too, via some horrible hacks on a Juno r1.
>
> Thanks,
> Robin.
>
> [0]:http://thread.gmane.org/gmane.linux.kernel.iommu/11033
> [1]:http://thread.gmane.org/gmane.linux.kernel.iommu/10439
>
> Robin Murphy (3):
>    iommu: Implement common IOMMU ops for DMA mapping
>    arm64: Add IOMMU dma_ops
>    arm64: Hook up IOMMU dma_ops
>
>   arch/arm64/Kconfig                   |   1 +
>   arch/arm64/include/asm/dma-mapping.h |  15 +-
>   arch/arm64/mm/dma-mapping.c          | 457 ++++++++++++++++++++++++++++++
>   drivers/iommu/Kconfig                |   7 +
>   drivers/iommu/Makefile               |   1 +
>   drivers/iommu/dma-iommu.c            | 524 +++++++++++++++++++++++++++++++++++
>   include/linux/dma-iommu.h            |  85 ++++++
>   include/linux/iommu.h                |   1 +
>   8 files changed, 1083 insertions(+), 8 deletions(-)
>   create mode 100644 drivers/iommu/dma-iommu.c
>   create mode 100644 include/linux/dma-iommu.h
>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v6 2/3] arm64: Add IOMMU dma_ops
  2015-10-01 19:13     ` Robin Murphy
@ 2015-10-14 11:47         ` Joerg Roedel
  -1 siblings, 0 replies; 78+ messages in thread
From: Joerg Roedel @ 2015-10-14 11:47 UTC (permalink / raw)
  To: Robin Murphy
  Cc: laurent.pinchart+renesas-ryLnwIuWjnjg/C1BVhZhaw,
	catalin.marinas-5wv7dgnIgG8, will.deacon-5wv7dgnIgG8,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	djkurtz-hpIqsD4AKlfQT0dZR+AlfA,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	thunder.leizhen-hv44wF8Li93QT0dZR+AlfA,
	yingjoe.chen-NuS5LvNUpcJWk0Htik3J/w,
	treding-DDmLM1+adcrQT0dZR+AlfA

On Thu, Oct 01, 2015 at 08:13:59PM +0100, Robin Murphy wrote:
> Taking some inspiration from the arch/arm code, implement the
> arch-specific side of the DMA mapping ops using the new IOMMU-DMA layer.
> 
> Since there is still work to do elsewhere to make DMA configuration happen
> in a more appropriate order and properly support platform devices in the
> IOMMU core, the device setup code unfortunately starts out carrying some
> workarounds to ensure it works correctly in the current state of things.
> 
> Signed-off-by: Robin Murphy <robin.murphy-5wv7dgnIgG8@public.gmane.org>
> ---
>  arch/arm64/mm/dma-mapping.c | 435 ++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 435 insertions(+)

This needs an ack from the arm64 maintainers.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v6 2/3] arm64: Add IOMMU dma_ops
@ 2015-10-14 11:47         ` Joerg Roedel
  0 siblings, 0 replies; 78+ messages in thread
From: Joerg Roedel @ 2015-10-14 11:47 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Oct 01, 2015 at 08:13:59PM +0100, Robin Murphy wrote:
> Taking some inspiration from the arch/arm code, implement the
> arch-specific side of the DMA mapping ops using the new IOMMU-DMA layer.
> 
> Since there is still work to do elsewhere to make DMA configuration happen
> in a more appropriate order and properly support platform devices in the
> IOMMU core, the device setup code unfortunately starts out carrying some
> workarounds to ensure it works correctly in the current state of things.
> 
> Signed-off-by: Robin Murphy <robin.murphy@arm.com>
> ---
>  arch/arm64/mm/dma-mapping.c | 435 ++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 435 insertions(+)

This needs an ack from the arm64 maintainers.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v6 0/3] arm64: IOMMU-backed DMA mapping
  2015-10-13 12:12     ` Robin Murphy
@ 2015-10-14 11:50         ` joro at 8bytes.org
  -1 siblings, 0 replies; 78+ messages in thread
From: joro-zLv9SwRftAIdnm+yROfE0A @ 2015-10-14 11:50 UTC (permalink / raw)
  To: Robin Murphy
  Cc: laurent.pinchart+renesas-ryLnwIuWjnjg/C1BVhZhaw, Catalin Marinas,
	Will Deacon, iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	djkurtz-hpIqsD4AKlfQT0dZR+AlfA,
	thunder.leizhen-hv44wF8Li93QT0dZR+AlfA,
	yingjoe.chen-NuS5LvNUpcJWk0Htik3J/w,
	treding-DDmLM1+adcrQT0dZR+AlfA,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

Hi Robin,

On Tue, Oct 13, 2015 at 01:12:46PM +0100, Robin Murphy wrote:
> Anyway, what are your thoughts on taking this for 4.4? Since the
> dependencies are now in and we're at -rc5 already, I'm on the verge
> of reposting a self-contained version to go through arm64, as we
> really need to unblock all the follow-on development there (it's a
> shame that of the people I'm aware want this, it's only me and the
> Mediatek/Chrome guys here on the list saying so).

I plan to take it for 4.4, given the missing ack arrives. What are your
plans to resolve the remaining issues, like with the probe deferral
patch-set?


	Joerg

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v6 0/3] arm64: IOMMU-backed DMA mapping
@ 2015-10-14 11:50         ` joro at 8bytes.org
  0 siblings, 0 replies; 78+ messages in thread
From: joro at 8bytes.org @ 2015-10-14 11:50 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Robin,

On Tue, Oct 13, 2015 at 01:12:46PM +0100, Robin Murphy wrote:
> Anyway, what are your thoughts on taking this for 4.4? Since the
> dependencies are now in and we're at -rc5 already, I'm on the verge
> of reposting a self-contained version to go through arm64, as we
> really need to unblock all the follow-on development there (it's a
> shame that of the people I'm aware want this, it's only me and the
> Mediatek/Chrome guys here on the list saying so).

I plan to take it for 4.4, given the missing ack arrives. What are your
plans to resolve the remaining issues, like with the probe deferral
patch-set?


	Joerg

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v6 2/3] arm64: Add IOMMU dma_ops
  2015-10-01 19:13     ` Robin Murphy
@ 2015-10-14 13:35         ` Catalin Marinas
  -1 siblings, 0 replies; 78+ messages in thread
From: Catalin Marinas @ 2015-10-14 13:35 UTC (permalink / raw)
  To: Robin Murphy
  Cc: laurent.pinchart+renesas-ryLnwIuWjnjg/C1BVhZhaw,
	will.deacon-5wv7dgnIgG8,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	djkurtz-hpIqsD4AKlfQT0dZR+AlfA,
	thunder.leizhen-hv44wF8Li93QT0dZR+AlfA,
	yingjoe.chen-NuS5LvNUpcJWk0Htik3J/w,
	treding-DDmLM1+adcrQT0dZR+AlfA,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

On Thu, Oct 01, 2015 at 08:13:59PM +0100, Robin Murphy wrote:
> Taking some inspiration from the arch/arm code, implement the
> arch-specific side of the DMA mapping ops using the new IOMMU-DMA layer.
> 
> Since there is still work to do elsewhere to make DMA configuration happen
> in a more appropriate order and properly support platform devices in the
> IOMMU core, the device setup code unfortunately starts out carrying some
> workarounds to ensure it works correctly in the current state of things.
> 
> Signed-off-by: Robin Murphy <robin.murphy-5wv7dgnIgG8@public.gmane.org>

Sorry, I reviewed this patch before but forgot to ack it, so here it is:

Acked-by: Catalin Marinas <catalin.marinas-5wv7dgnIgG8@public.gmane.org>

(and I'm fined for the arm64 patches here to go in via the iommu tree)

I assume part of this patch will disappear at some point when the device
probing order is sorted.

-- 
Catalin

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v6 2/3] arm64: Add IOMMU dma_ops
@ 2015-10-14 13:35         ` Catalin Marinas
  0 siblings, 0 replies; 78+ messages in thread
From: Catalin Marinas @ 2015-10-14 13:35 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Oct 01, 2015 at 08:13:59PM +0100, Robin Murphy wrote:
> Taking some inspiration from the arch/arm code, implement the
> arch-specific side of the DMA mapping ops using the new IOMMU-DMA layer.
> 
> Since there is still work to do elsewhere to make DMA configuration happen
> in a more appropriate order and properly support platform devices in the
> IOMMU core, the device setup code unfortunately starts out carrying some
> workarounds to ensure it works correctly in the current state of things.
> 
> Signed-off-by: Robin Murphy <robin.murphy@arm.com>

Sorry, I reviewed this patch before but forgot to ack it, so here it is:

Acked-by: Catalin Marinas <catalin.marinas@arm.com>

(and I'm fined for the arm64 patches here to go in via the iommu tree)

I assume part of this patch will disappear at some point when the device
probing order is sorted.

-- 
Catalin

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v6 2/3] arm64: Add IOMMU dma_ops
  2015-10-14 13:35         ` Catalin Marinas
@ 2015-10-14 16:34             ` Robin Murphy
  -1 siblings, 0 replies; 78+ messages in thread
From: Robin Murphy @ 2015-10-14 16:34 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: laurent.pinchart+renesas-ryLnwIuWjnjg/C1BVhZhaw,
	will.deacon-5wv7dgnIgG8,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	djkurtz-hpIqsD4AKlfQT0dZR+AlfA,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	thunder.leizhen-hv44wF8Li93QT0dZR+AlfA,
	yingjoe.chen-NuS5LvNUpcJWk0Htik3J/w,
	treding-DDmLM1+adcrQT0dZR+AlfA

On 14/10/15 14:35, Catalin Marinas wrote:
> On Thu, Oct 01, 2015 at 08:13:59PM +0100, Robin Murphy wrote:
>> Taking some inspiration from the arch/arm code, implement the
>> arch-specific side of the DMA mapping ops using the new IOMMU-DMA layer.
>>
>> Since there is still work to do elsewhere to make DMA configuration happen
>> in a more appropriate order and properly support platform devices in the
>> IOMMU core, the device setup code unfortunately starts out carrying some
>> workarounds to ensure it works correctly in the current state of things.
>>
>> Signed-off-by: Robin Murphy <robin.murphy-5wv7dgnIgG8@public.gmane.org>
>
> Sorry, I reviewed this patch before but forgot to ack it, so here it is:
>
> Acked-by: Catalin Marinas <catalin.marinas-5wv7dgnIgG8@public.gmane.org>

Thanks! Although it turns out I'm at least partly to blame there - you 
did give a reviewed-by on v5, but I didn't add it here since I'd made 
significant changes - I should have checked and called that out, my bad.

> (and I'm fined for the arm64 patches here to go in via the iommu tree)
>
> I assume part of this patch will disappear at some point when the device
> probing order is sorted.

I'll be working on that for 4.5, indeed. Getting the initialisation 
order sorted out also stands in the way of converting 32-bit to the 
common ops, so it's very high up my priority list.

Robin.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v6 2/3] arm64: Add IOMMU dma_ops
@ 2015-10-14 16:34             ` Robin Murphy
  0 siblings, 0 replies; 78+ messages in thread
From: Robin Murphy @ 2015-10-14 16:34 UTC (permalink / raw)
  To: linux-arm-kernel

On 14/10/15 14:35, Catalin Marinas wrote:
> On Thu, Oct 01, 2015 at 08:13:59PM +0100, Robin Murphy wrote:
>> Taking some inspiration from the arch/arm code, implement the
>> arch-specific side of the DMA mapping ops using the new IOMMU-DMA layer.
>>
>> Since there is still work to do elsewhere to make DMA configuration happen
>> in a more appropriate order and properly support platform devices in the
>> IOMMU core, the device setup code unfortunately starts out carrying some
>> workarounds to ensure it works correctly in the current state of things.
>>
>> Signed-off-by: Robin Murphy <robin.murphy@arm.com>
>
> Sorry, I reviewed this patch before but forgot to ack it, so here it is:
>
> Acked-by: Catalin Marinas <catalin.marinas@arm.com>

Thanks! Although it turns out I'm at least partly to blame there - you 
did give a reviewed-by on v5, but I didn't add it here since I'd made 
significant changes - I should have checked and called that out, my bad.

> (and I'm fined for the arm64 patches here to go in via the iommu tree)
>
> I assume part of this patch will disappear at some point when the device
> probing order is sorted.

I'll be working on that for 4.5, indeed. Getting the initialisation 
order sorted out also stands in the way of converting 32-bit to the 
common ops, so it's very high up my priority list.

Robin.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v6 0/3] arm64: IOMMU-backed DMA mapping
  2015-10-14 11:50         ` joro at 8bytes.org
@ 2015-10-14 18:19             ` Robin Murphy
  -1 siblings, 0 replies; 78+ messages in thread
From: Robin Murphy @ 2015-10-14 18:19 UTC (permalink / raw)
  To: joro-zLv9SwRftAIdnm+yROfE0A
  Cc: laurent.pinchart+renesas-ryLnwIuWjnjg/C1BVhZhaw, Catalin Marinas,
	Will Deacon, iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	djkurtz-hpIqsD4AKlfQT0dZR+AlfA,
	thunder.leizhen-hv44wF8Li93QT0dZR+AlfA,
	yingjoe.chen-NuS5LvNUpcJWk0Htik3J/w,
	treding-DDmLM1+adcrQT0dZR+AlfA,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

On 14/10/15 12:50, joro-zLv9SwRftAIdnm+yROfE0A@public.gmane.org wrote:
> Hi Robin,
>
> On Tue, Oct 13, 2015 at 01:12:46PM +0100, Robin Murphy wrote:
>> Anyway, what are your thoughts on taking this for 4.4? Since the
>> dependencies are now in and we're at -rc5 already, I'm on the verge
>> of reposting a self-contained version to go through arm64, as we
>> really need to unblock all the follow-on development there (it's a
>> shame that of the people I'm aware want this, it's only me and the
>> Mediatek/Chrome guys here on the list saying so).
>
> I plan to take it for 4.4, given the missing ack arrives. What are your
> plans to resolve the remaining issues, like with the probe deferral
> patch-set?

That's great, thanks. I've been keeping an eye on the on-demand probing 
series[0], and now that things are looking promising there I'm having a 
go at pulling it in as it lets us solve the device dependency issue in 
an even neater way than deferral. With the rest of Laurent's patches to 
move the OF dma_ops configuration into the probe path, I just need to 
finish working out how to handle probe failure/deferral, and we should 
be good to go. That's my priority for 4.5, in parallel with properly 
wiring up the ARM SMMU driver to work nicely with this stuff.

Sorting out the configuration order also gets us a lot closer to 
handling multiple IOMMU drivers for platform devices, and moving 32-bit 
ARM over to the common code - plus I now have some RK3288 hardware which 
can serve to test both of those, too.

Robin.

[0]:http://thread.gmane.org/gmane.linux.acpi.devel/78653

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v6 0/3] arm64: IOMMU-backed DMA mapping
@ 2015-10-14 18:19             ` Robin Murphy
  0 siblings, 0 replies; 78+ messages in thread
From: Robin Murphy @ 2015-10-14 18:19 UTC (permalink / raw)
  To: linux-arm-kernel

On 14/10/15 12:50, joro at 8bytes.org wrote:
> Hi Robin,
>
> On Tue, Oct 13, 2015 at 01:12:46PM +0100, Robin Murphy wrote:
>> Anyway, what are your thoughts on taking this for 4.4? Since the
>> dependencies are now in and we're at -rc5 already, I'm on the verge
>> of reposting a self-contained version to go through arm64, as we
>> really need to unblock all the follow-on development there (it's a
>> shame that of the people I'm aware want this, it's only me and the
>> Mediatek/Chrome guys here on the list saying so).
>
> I plan to take it for 4.4, given the missing ack arrives. What are your
> plans to resolve the remaining issues, like with the probe deferral
> patch-set?

That's great, thanks. I've been keeping an eye on the on-demand probing 
series[0], and now that things are looking promising there I'm having a 
go at pulling it in as it lets us solve the device dependency issue in 
an even neater way than deferral. With the rest of Laurent's patches to 
move the OF dma_ops configuration into the probe path, I just need to 
finish working out how to handle probe failure/deferral, and we should 
be good to go. That's my priority for 4.5, in parallel with properly 
wiring up the ARM SMMU driver to work nicely with this stuff.

Sorting out the configuration order also gets us a lot closer to 
handling multiple IOMMU drivers for platform devices, and moving 32-bit 
ARM over to the common code - plus I now have some RK3288 hardware which 
can serve to test both of those, too.

Robin.

[0]:http://thread.gmane.org/gmane.linux.acpi.devel/78653

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v6 0/3] arm64: IOMMU-backed DMA mapping
  2015-10-01 19:13 ` Robin Murphy
@ 2015-10-15 15:04     ` Joerg Roedel
  -1 siblings, 0 replies; 78+ messages in thread
From: Joerg Roedel @ 2015-10-15 15:04 UTC (permalink / raw)
  To: Robin Murphy
  Cc: laurent.pinchart+renesas-ryLnwIuWjnjg/C1BVhZhaw,
	catalin.marinas-5wv7dgnIgG8, will.deacon-5wv7dgnIgG8,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	djkurtz-hpIqsD4AKlfQT0dZR+AlfA,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	thunder.leizhen-hv44wF8Li93QT0dZR+AlfA,
	yingjoe.chen-NuS5LvNUpcJWk0Htik3J/w,
	treding-DDmLM1+adcrQT0dZR+AlfA

On Thu, Oct 01, 2015 at 08:13:57PM +0100, Robin Murphy wrote:
> Robin Murphy (3):
>   iommu: Implement common IOMMU ops for DMA mapping
>   arm64: Add IOMMU dma_ops
>   arm64: Hook up IOMMU dma_ops
> 
>  arch/arm64/Kconfig                   |   1 +
>  arch/arm64/include/asm/dma-mapping.h |  15 +-
>  arch/arm64/mm/dma-mapping.c          | 457 ++++++++++++++++++++++++++++++
>  drivers/iommu/Kconfig                |   7 +
>  drivers/iommu/Makefile               |   1 +
>  drivers/iommu/dma-iommu.c            | 524 +++++++++++++++++++++++++++++++++++
>  include/linux/dma-iommu.h            |  85 ++++++
>  include/linux/iommu.h                |   1 +
>  8 files changed, 1083 insertions(+), 8 deletions(-)
>  create mode 100644 drivers/iommu/dma-iommu.c
>  create mode 100644 include/linux/dma-iommu.h

Applied to the core branch.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v6 0/3] arm64: IOMMU-backed DMA mapping
@ 2015-10-15 15:04     ` Joerg Roedel
  0 siblings, 0 replies; 78+ messages in thread
From: Joerg Roedel @ 2015-10-15 15:04 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Oct 01, 2015 at 08:13:57PM +0100, Robin Murphy wrote:
> Robin Murphy (3):
>   iommu: Implement common IOMMU ops for DMA mapping
>   arm64: Add IOMMU dma_ops
>   arm64: Hook up IOMMU dma_ops
> 
>  arch/arm64/Kconfig                   |   1 +
>  arch/arm64/include/asm/dma-mapping.h |  15 +-
>  arch/arm64/mm/dma-mapping.c          | 457 ++++++++++++++++++++++++++++++
>  drivers/iommu/Kconfig                |   7 +
>  drivers/iommu/Makefile               |   1 +
>  drivers/iommu/dma-iommu.c            | 524 +++++++++++++++++++++++++++++++++++
>  include/linux/dma-iommu.h            |  85 ++++++
>  include/linux/iommu.h                |   1 +
>  8 files changed, 1083 insertions(+), 8 deletions(-)
>  create mode 100644 drivers/iommu/dma-iommu.c
>  create mode 100644 include/linux/dma-iommu.h

Applied to the core branch.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v6 1/3] iommu: Implement common IOMMU ops for DMA mapping
  2015-10-01 19:13     ` Robin Murphy
@ 2015-10-26 13:44         ` Yong Wu
  -1 siblings, 0 replies; 78+ messages in thread
From: Yong Wu @ 2015-10-26 13:44 UTC (permalink / raw)
  To: Robin Murphy
  Cc: laurent.pinchart+renesas-ryLnwIuWjnjg/C1BVhZhaw,
	catalin.marinas-5wv7dgnIgG8, will.deacon-5wv7dgnIgG8,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	djkurtz-hpIqsD4AKlfQT0dZR+AlfA,
	pochun.lin-NuS5LvNUpcJWk0Htik3J/w,
	thunder.leizhen-hv44wF8Li93QT0dZR+AlfA,
	yingjoe.chen-NuS5LvNUpcJWk0Htik3J/w,
	treding-DDmLM1+adcrQT0dZR+AlfA,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

On Thu, 2015-10-01 at 20:13 +0100, Robin Murphy wrote:
[...]
> +/*
> + * The DMA API client is passing in a scatterlist which could describe
> + * any old buffer layout, but the IOMMU API requires everything to be
> + * aligned to IOMMU pages. Hence the need for this complicated bit of
> + * impedance-matching, to be able to hand off a suitably-aligned list,
> + * but still preserve the original offsets and sizes for the caller.
> + */
> +int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
> +		int nents, int prot)
> +{
> +	struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
> +	struct iova_domain *iovad = domain->iova_cookie;
> +	struct iova *iova;
> +	struct scatterlist *s, *prev = NULL;
> +	dma_addr_t dma_addr;
> +	size_t iova_len = 0;
> +	int i;
> +
> +	/*
> +	 * Work out how much IOVA space we need, and align the segments to
> +	 * IOVA granules for the IOMMU driver to handle. With some clever
> +	 * trickery we can modify the list in-place, but reversibly, by
> +	 * hiding the original data in the as-yet-unused DMA fields.
> +	 */
> +	for_each_sg(sg, s, nents, i) {
> +		size_t s_offset = iova_offset(iovad, s->offset);
> +		size_t s_length = s->length;
> +
> +		sg_dma_address(s) = s->offset;
> +		sg_dma_len(s) = s_length;
> +		s->offset -= s_offset;
> +		s_length = iova_align(iovad, s_length + s_offset);
> +		s->length = s_length;
> +
> +		/*
> +		 * The simple way to avoid the rare case of a segment
> +		 * crossing the boundary mask is to pad the previous one
> +		 * to end at a naturally-aligned IOVA for this one's size,
> +		 * at the cost of potentially over-allocating a little.
> +		 */
> +		if (prev) {
> +			size_t pad_len = roundup_pow_of_two(s_length);
> +
> +			pad_len = (pad_len - iova_len) & (pad_len - 1);
> +			prev->length += pad_len;

Hi Robin,
      While our v4l2 testing, It seems that we met a problem here.
      Here we update prev->length again, Do we need update
sg_dma_len(prev) again too?

      Some function like vb2_dc_get_contiguous_size[1] always get
sg_dma_len(s) to compare instead of s->length. so it may break
unexpectedly while sg_dma_len(s) is not same with s->length.

[1]:
http://lxr.free-electrons.com/source/drivers/media/v4l2-core/videobuf2-dma-contig.c#L70


> +			iova_len += pad_len;
> +		}
> +
> +		iova_len += s_length;
> +		prev = s;
> +	}
> +
> +	iova = __alloc_iova(iovad, iova_len, dma_get_mask(dev));
> +	if (!iova)
> +		goto out_restore_sg;
> +
> +	/*
> +	 * We'll leave any physical concatenation to the IOMMU driver's
> +	 * implementation - it knows better than we do.
> +	 */
> +	dma_addr = iova_dma_addr(iovad, iova);
> +	if (iommu_map_sg(domain, dma_addr, sg, nents, prot) < iova_len)
> +		goto out_free_iova;
> +
> +	return __finalise_sg(dev, sg, nents, dma_addr);
> +
> +out_free_iova:
> +	__free_iova(iovad, iova);
> +out_restore_sg:
> +	__invalidate_sg(sg, nents);
> +	return 0;
> +}
> +

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v6 1/3] iommu: Implement common IOMMU ops for DMA mapping
@ 2015-10-26 13:44         ` Yong Wu
  0 siblings, 0 replies; 78+ messages in thread
From: Yong Wu @ 2015-10-26 13:44 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 2015-10-01 at 20:13 +0100, Robin Murphy wrote:
[...]
> +/*
> + * The DMA API client is passing in a scatterlist which could describe
> + * any old buffer layout, but the IOMMU API requires everything to be
> + * aligned to IOMMU pages. Hence the need for this complicated bit of
> + * impedance-matching, to be able to hand off a suitably-aligned list,
> + * but still preserve the original offsets and sizes for the caller.
> + */
> +int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
> +		int nents, int prot)
> +{
> +	struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
> +	struct iova_domain *iovad = domain->iova_cookie;
> +	struct iova *iova;
> +	struct scatterlist *s, *prev = NULL;
> +	dma_addr_t dma_addr;
> +	size_t iova_len = 0;
> +	int i;
> +
> +	/*
> +	 * Work out how much IOVA space we need, and align the segments to
> +	 * IOVA granules for the IOMMU driver to handle. With some clever
> +	 * trickery we can modify the list in-place, but reversibly, by
> +	 * hiding the original data in the as-yet-unused DMA fields.
> +	 */
> +	for_each_sg(sg, s, nents, i) {
> +		size_t s_offset = iova_offset(iovad, s->offset);
> +		size_t s_length = s->length;
> +
> +		sg_dma_address(s) = s->offset;
> +		sg_dma_len(s) = s_length;
> +		s->offset -= s_offset;
> +		s_length = iova_align(iovad, s_length + s_offset);
> +		s->length = s_length;
> +
> +		/*
> +		 * The simple way to avoid the rare case of a segment
> +		 * crossing the boundary mask is to pad the previous one
> +		 * to end at a naturally-aligned IOVA for this one's size,
> +		 * at the cost of potentially over-allocating a little.
> +		 */
> +		if (prev) {
> +			size_t pad_len = roundup_pow_of_two(s_length);
> +
> +			pad_len = (pad_len - iova_len) & (pad_len - 1);
> +			prev->length += pad_len;

Hi Robin,
      While our v4l2 testing, It seems that we met a problem here.
      Here we update prev->length again, Do we need update
sg_dma_len(prev) again too?

      Some function like vb2_dc_get_contiguous_size[1] always get
sg_dma_len(s) to compare instead of s->length. so it may break
unexpectedly while sg_dma_len(s) is not same with s->length.

[1]:
http://lxr.free-electrons.com/source/drivers/media/v4l2-core/videobuf2-dma-contig.c#L70


> +			iova_len += pad_len;
> +		}
> +
> +		iova_len += s_length;
> +		prev = s;
> +	}
> +
> +	iova = __alloc_iova(iovad, iova_len, dma_get_mask(dev));
> +	if (!iova)
> +		goto out_restore_sg;
> +
> +	/*
> +	 * We'll leave any physical concatenation to the IOMMU driver's
> +	 * implementation - it knows better than we do.
> +	 */
> +	dma_addr = iova_dma_addr(iovad, iova);
> +	if (iommu_map_sg(domain, dma_addr, sg, nents, prot) < iova_len)
> +		goto out_free_iova;
> +
> +	return __finalise_sg(dev, sg, nents, dma_addr);
> +
> +out_free_iova:
> +	__free_iova(iovad, iova);
> +out_restore_sg:
> +	__invalidate_sg(sg, nents);
> +	return 0;
> +}
> +

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v6 1/3] iommu: Implement common IOMMU ops for DMA mapping
  2015-10-26 13:44         ` Yong Wu
@ 2015-10-26 16:55           ` Robin Murphy
  -1 siblings, 0 replies; 78+ messages in thread
From: Robin Murphy @ 2015-10-26 16:55 UTC (permalink / raw)
  To: Yong Wu
  Cc: laurent.pinchart+renesas-ryLnwIuWjnjg/C1BVhZhaw,
	catalin.marinas-5wv7dgnIgG8, will.deacon-5wv7dgnIgG8,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	djkurtz-hpIqsD4AKlfQT0dZR+AlfA,
	pochun.lin-NuS5LvNUpcJWk0Htik3J/w,
	thunder.leizhen-hv44wF8Li93QT0dZR+AlfA,
	yingjoe.chen-NuS5LvNUpcJWk0Htik3J/w,
	treding-DDmLM1+adcrQT0dZR+AlfA,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

On 26/10/15 13:44, Yong Wu wrote:
> On Thu, 2015-10-01 at 20:13 +0100, Robin Murphy wrote:
> [...]
>> +/*
>> + * The DMA API client is passing in a scatterlist which could describe
>> + * any old buffer layout, but the IOMMU API requires everything to be
>> + * aligned to IOMMU pages. Hence the need for this complicated bit of
>> + * impedance-matching, to be able to hand off a suitably-aligned list,
>> + * but still preserve the original offsets and sizes for the caller.
>> + */
>> +int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
>> +		int nents, int prot)
>> +{
>> +	struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
>> +	struct iova_domain *iovad = domain->iova_cookie;
>> +	struct iova *iova;
>> +	struct scatterlist *s, *prev = NULL;
>> +	dma_addr_t dma_addr;
>> +	size_t iova_len = 0;
>> +	int i;
>> +
>> +	/*
>> +	 * Work out how much IOVA space we need, and align the segments to
>> +	 * IOVA granules for the IOMMU driver to handle. With some clever
>> +	 * trickery we can modify the list in-place, but reversibly, by
>> +	 * hiding the original data in the as-yet-unused DMA fields.
>> +	 */
>> +	for_each_sg(sg, s, nents, i) {
>> +		size_t s_offset = iova_offset(iovad, s->offset);
>> +		size_t s_length = s->length;
>> +
>> +		sg_dma_address(s) = s->offset;
>> +		sg_dma_len(s) = s_length;
>> +		s->offset -= s_offset;
>> +		s_length = iova_align(iovad, s_length + s_offset);
>> +		s->length = s_length;
>> +
>> +		/*
>> +		 * The simple way to avoid the rare case of a segment
>> +		 * crossing the boundary mask is to pad the previous one
>> +		 * to end at a naturally-aligned IOVA for this one's size,
>> +		 * at the cost of potentially over-allocating a little.
>> +		 */
>> +		if (prev) {
>> +			size_t pad_len = roundup_pow_of_two(s_length);
>> +
>> +			pad_len = (pad_len - iova_len) & (pad_len - 1);
>> +			prev->length += pad_len;
>
> Hi Robin,
>        While our v4l2 testing, It seems that we met a problem here.
>        Here we update prev->length again, Do we need update
> sg_dma_len(prev) again too?
>
>        Some function like vb2_dc_get_contiguous_size[1] always get
> sg_dma_len(s) to compare instead of s->length. so it may break
> unexpectedly while sg_dma_len(s) is not same with s->length.

This is just tweaking the faked-up length that we hand off to 
iommu_map_sg() (see also the iova_align() above), to trick it into 
bumping this segment up to a suitable starting IOVA. The real length at 
this point is stashed in sg_dma_len(s), and will be copied back into 
s->length in __finalise_sg(), so both will hold the same true length 
once we return to the caller.

Yes, it does mean that if you have a list where the segment lengths are 
page aligned but not monotonically decreasing, e.g. {64k, 16k, 64k}, 
then you'll still end up with a gap between the second and third 
segments, but that's fine because the DMA API offers no guarantees about 
what the resulting DMA addresses will be (consider the no-IOMMU case 
where they would each just be "mapped" to their physical address). If 
that breaks v4l, then it's probably v4l's DMA API use that needs looking 
at (again).

Robin.

> [1]:
> http://lxr.free-electrons.com/source/drivers/media/v4l2-core/videobuf2-dma-contig.c#L70

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v6 1/3] iommu: Implement common IOMMU ops for DMA mapping
@ 2015-10-26 16:55           ` Robin Murphy
  0 siblings, 0 replies; 78+ messages in thread
From: Robin Murphy @ 2015-10-26 16:55 UTC (permalink / raw)
  To: linux-arm-kernel

On 26/10/15 13:44, Yong Wu wrote:
> On Thu, 2015-10-01 at 20:13 +0100, Robin Murphy wrote:
> [...]
>> +/*
>> + * The DMA API client is passing in a scatterlist which could describe
>> + * any old buffer layout, but the IOMMU API requires everything to be
>> + * aligned to IOMMU pages. Hence the need for this complicated bit of
>> + * impedance-matching, to be able to hand off a suitably-aligned list,
>> + * but still preserve the original offsets and sizes for the caller.
>> + */
>> +int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
>> +		int nents, int prot)
>> +{
>> +	struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
>> +	struct iova_domain *iovad = domain->iova_cookie;
>> +	struct iova *iova;
>> +	struct scatterlist *s, *prev = NULL;
>> +	dma_addr_t dma_addr;
>> +	size_t iova_len = 0;
>> +	int i;
>> +
>> +	/*
>> +	 * Work out how much IOVA space we need, and align the segments to
>> +	 * IOVA granules for the IOMMU driver to handle. With some clever
>> +	 * trickery we can modify the list in-place, but reversibly, by
>> +	 * hiding the original data in the as-yet-unused DMA fields.
>> +	 */
>> +	for_each_sg(sg, s, nents, i) {
>> +		size_t s_offset = iova_offset(iovad, s->offset);
>> +		size_t s_length = s->length;
>> +
>> +		sg_dma_address(s) = s->offset;
>> +		sg_dma_len(s) = s_length;
>> +		s->offset -= s_offset;
>> +		s_length = iova_align(iovad, s_length + s_offset);
>> +		s->length = s_length;
>> +
>> +		/*
>> +		 * The simple way to avoid the rare case of a segment
>> +		 * crossing the boundary mask is to pad the previous one
>> +		 * to end at a naturally-aligned IOVA for this one's size,
>> +		 * at the cost of potentially over-allocating a little.
>> +		 */
>> +		if (prev) {
>> +			size_t pad_len = roundup_pow_of_two(s_length);
>> +
>> +			pad_len = (pad_len - iova_len) & (pad_len - 1);
>> +			prev->length += pad_len;
>
> Hi Robin,
>        While our v4l2 testing, It seems that we met a problem here.
>        Here we update prev->length again, Do we need update
> sg_dma_len(prev) again too?
>
>        Some function like vb2_dc_get_contiguous_size[1] always get
> sg_dma_len(s) to compare instead of s->length. so it may break
> unexpectedly while sg_dma_len(s) is not same with s->length.

This is just tweaking the faked-up length that we hand off to 
iommu_map_sg() (see also the iova_align() above), to trick it into 
bumping this segment up to a suitable starting IOVA. The real length at 
this point is stashed in sg_dma_len(s), and will be copied back into 
s->length in __finalise_sg(), so both will hold the same true length 
once we return to the caller.

Yes, it does mean that if you have a list where the segment lengths are 
page aligned but not monotonically decreasing, e.g. {64k, 16k, 64k}, 
then you'll still end up with a gap between the second and third 
segments, but that's fine because the DMA API offers no guarantees about 
what the resulting DMA addresses will be (consider the no-IOMMU case 
where they would each just be "mapped" to their physical address). If 
that breaks v4l, then it's probably v4l's DMA API use that needs looking 
at (again).

Robin.

> [1]:
> http://lxr.free-electrons.com/source/drivers/media/v4l2-core/videobuf2-dma-contig.c#L70

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v6 1/3] iommu: Implement common IOMMU ops for DMA mapping
  2015-10-26 16:55           ` Robin Murphy
@ 2015-10-30  1:17             ` Daniel Kurtz
  -1 siblings, 0 replies; 78+ messages in thread
From: Daniel Kurtz @ 2015-10-30  1:17 UTC (permalink / raw)
  To: Robin Murphy, Pawel Osciak
  Cc: Yong Wu, Joerg Roedel, Will Deacon, Catalin Marinas,
	open list:IOMMU DRIVERS, linux-arm-kernel, thunder.leizhen,
	Yingjoe Chen, laurent.pinchart+renesas, Thierry Reding,
	Lin PoChun, Bobby Batacharia (via Google Docs),
	linux-media, Marek Szyprowski, Kyungmin Park

+linux-media & VIDEOBUF2 FRAMEWORK maintainers since this is about the
v4l2-contig's usage of the DMA API.

Hi Robin,

On Tue, Oct 27, 2015 at 12:55 AM, Robin Murphy <robin.murphy@arm.com> wrote:
> On 26/10/15 13:44, Yong Wu wrote:
>>
>> On Thu, 2015-10-01 at 20:13 +0100, Robin Murphy wrote:
>> [...]
>>>
>>> +/*
>>> + * The DMA API client is passing in a scatterlist which could describe
>>> + * any old buffer layout, but the IOMMU API requires everything to be
>>> + * aligned to IOMMU pages. Hence the need for this complicated bit of
>>> + * impedance-matching, to be able to hand off a suitably-aligned list,
>>> + * but still preserve the original offsets and sizes for the caller.
>>> + */
>>> +int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
>>> +               int nents, int prot)
>>> +{
>>> +       struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
>>> +       struct iova_domain *iovad = domain->iova_cookie;
>>> +       struct iova *iova;
>>> +       struct scatterlist *s, *prev = NULL;
>>> +       dma_addr_t dma_addr;
>>> +       size_t iova_len = 0;
>>> +       int i;
>>> +
>>> +       /*
>>> +        * Work out how much IOVA space we need, and align the segments
>>> to
>>> +        * IOVA granules for the IOMMU driver to handle. With some clever
>>> +        * trickery we can modify the list in-place, but reversibly, by
>>> +        * hiding the original data in the as-yet-unused DMA fields.
>>> +        */
>>> +       for_each_sg(sg, s, nents, i) {
>>> +               size_t s_offset = iova_offset(iovad, s->offset);
>>> +               size_t s_length = s->length;
>>> +
>>> +               sg_dma_address(s) = s->offset;
>>> +               sg_dma_len(s) = s_length;
>>> +               s->offset -= s_offset;
>>> +               s_length = iova_align(iovad, s_length + s_offset);
>>> +               s->length = s_length;
>>> +
>>> +               /*
>>> +                * The simple way to avoid the rare case of a segment
>>> +                * crossing the boundary mask is to pad the previous one
>>> +                * to end at a naturally-aligned IOVA for this one's
>>> size,
>>> +                * at the cost of potentially over-allocating a little.
>>> +                */
>>> +               if (prev) {
>>> +                       size_t pad_len = roundup_pow_of_two(s_length);
>>> +
>>> +                       pad_len = (pad_len - iova_len) & (pad_len - 1);
>>> +                       prev->length += pad_len;
>>
>>
>> Hi Robin,
>>        While our v4l2 testing, It seems that we met a problem here.
>>        Here we update prev->length again, Do we need update
>> sg_dma_len(prev) again too?
>>
>>        Some function like vb2_dc_get_contiguous_size[1] always get
>> sg_dma_len(s) to compare instead of s->length. so it may break
>> unexpectedly while sg_dma_len(s) is not same with s->length.
>
>
> This is just tweaking the faked-up length that we hand off to iommu_map_sg()
> (see also the iova_align() above), to trick it into bumping this segment up
> to a suitable starting IOVA. The real length at this point is stashed in
> sg_dma_len(s), and will be copied back into s->length in __finalise_sg(), so
> both will hold the same true length once we return to the caller.
>
> Yes, it does mean that if you have a list where the segment lengths are page
> aligned but not monotonically decreasing, e.g. {64k, 16k, 64k}, then you'll
> still end up with a gap between the second and third segments, but that's
> fine because the DMA API offers no guarantees about what the resulting DMA
> addresses will be (consider the no-IOMMU case where they would each just be
> "mapped" to their physical address). If that breaks v4l, then it's probably
> v4l's DMA API use that needs looking at (again).

Hmm, I thought the DMA API maps a (possibly) non-contiguous set of
memory pages into a contiguous block in device memory address space.
This would allow passing a dma mapped buffer to device dma using just
a device address and length.
IIUC, the change above breaks this model by inserting gaps in how the
buffer is mapped to device memory, such that the buffer is no longer
contiguous in dma address space.

Here is the code in question from
drivers/media/v4l2-core/videobuf2-dma-contig.c :

static unsigned long vb2_dc_get_contiguous_size(struct sg_table *sgt)
{
        struct scatterlist *s;
        dma_addr_t expected = sg_dma_address(sgt->sgl);
        unsigned int i;
        unsigned long size = 0;

        for_each_sg(sgt->sgl, s, sgt->nents, i) {
                if (sg_dma_address(s) != expected)
                        break;
                expected = sg_dma_address(s) + sg_dma_len(s);
                size += sg_dma_len(s);
        }
        return size;
}


static void *vb2_dc_get_userptr(void *alloc_ctx, unsigned long vaddr,
        unsigned long size, enum dma_data_direction dma_dir)
{
        struct vb2_dc_conf *conf = alloc_ctx;
        struct vb2_dc_buf *buf;
        struct frame_vector *vec;
        unsigned long offset;
        int n_pages, i;
        int ret = 0;
        struct sg_table *sgt;
        unsigned long contig_size;
        unsigned long dma_align = dma_get_cache_alignment();
        DEFINE_DMA_ATTRS(attrs);

        dma_set_attr(DMA_ATTR_SKIP_CPU_SYNC, &attrs);

        buf = kzalloc(sizeof *buf, GFP_KERNEL);
        buf->dma_dir = dma_dir;

        offset = vaddr & ~PAGE_MASK;
        vec = vb2_create_framevec(vaddr, size, dma_dir == DMA_FROM_DEVICE);
        buf->vec = vec;
        n_pages = frame_vector_count(vec);

        sgt = kzalloc(sizeof(*sgt), GFP_KERNEL);

        ret = sg_alloc_table_from_pages(sgt, frame_vector_pages(vec), n_pages,
                offset, size, GFP_KERNEL);

        sgt->nents = dma_map_sg_attrs(buf->dev, sgt->sgl, sgt->orig_nents,
                                      buf->dma_dir, &attrs);

        contig_size = vb2_dc_get_contiguous_size(sgt);
        if (contig_size < size) {

    <<<===   if the original buffer had sg entries that were not
aligned on the "natural" alignment for their size, the new arm64 iommu
core code inserts  a 'gap' in the iommu mapping, which causes
vb2_dc_get_contiguous_size() to exit early (and return a smaller size
than expected).

                pr_err("contiguous mapping is too small %lu/%lu\n",
                        contig_size, size);
                ret = -EFAULT;
                goto fail_map_sg;
        }


So, is the videobuf2-dma-contig.c based on an incorrect assumption
about how the DMA API is supposed to work?
Is it even possible to map a "contiguous-in-iova-range" mapping for a
buffer given as an sg_table with an arbitrary set of pages?

Thanks for helping to move this forward.

-Dan

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v6 1/3] iommu: Implement common IOMMU ops for DMA mapping
@ 2015-10-30  1:17             ` Daniel Kurtz
  0 siblings, 0 replies; 78+ messages in thread
From: Daniel Kurtz @ 2015-10-30  1:17 UTC (permalink / raw)
  To: linux-arm-kernel

+linux-media & VIDEOBUF2 FRAMEWORK maintainers since this is about the
v4l2-contig's usage of the DMA API.

Hi Robin,

On Tue, Oct 27, 2015 at 12:55 AM, Robin Murphy <robin.murphy@arm.com> wrote:
> On 26/10/15 13:44, Yong Wu wrote:
>>
>> On Thu, 2015-10-01 at 20:13 +0100, Robin Murphy wrote:
>> [...]
>>>
>>> +/*
>>> + * The DMA API client is passing in a scatterlist which could describe
>>> + * any old buffer layout, but the IOMMU API requires everything to be
>>> + * aligned to IOMMU pages. Hence the need for this complicated bit of
>>> + * impedance-matching, to be able to hand off a suitably-aligned list,
>>> + * but still preserve the original offsets and sizes for the caller.
>>> + */
>>> +int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
>>> +               int nents, int prot)
>>> +{
>>> +       struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
>>> +       struct iova_domain *iovad = domain->iova_cookie;
>>> +       struct iova *iova;
>>> +       struct scatterlist *s, *prev = NULL;
>>> +       dma_addr_t dma_addr;
>>> +       size_t iova_len = 0;
>>> +       int i;
>>> +
>>> +       /*
>>> +        * Work out how much IOVA space we need, and align the segments
>>> to
>>> +        * IOVA granules for the IOMMU driver to handle. With some clever
>>> +        * trickery we can modify the list in-place, but reversibly, by
>>> +        * hiding the original data in the as-yet-unused DMA fields.
>>> +        */
>>> +       for_each_sg(sg, s, nents, i) {
>>> +               size_t s_offset = iova_offset(iovad, s->offset);
>>> +               size_t s_length = s->length;
>>> +
>>> +               sg_dma_address(s) = s->offset;
>>> +               sg_dma_len(s) = s_length;
>>> +               s->offset -= s_offset;
>>> +               s_length = iova_align(iovad, s_length + s_offset);
>>> +               s->length = s_length;
>>> +
>>> +               /*
>>> +                * The simple way to avoid the rare case of a segment
>>> +                * crossing the boundary mask is to pad the previous one
>>> +                * to end at a naturally-aligned IOVA for this one's
>>> size,
>>> +                * at the cost of potentially over-allocating a little.
>>> +                */
>>> +               if (prev) {
>>> +                       size_t pad_len = roundup_pow_of_two(s_length);
>>> +
>>> +                       pad_len = (pad_len - iova_len) & (pad_len - 1);
>>> +                       prev->length += pad_len;
>>
>>
>> Hi Robin,
>>        While our v4l2 testing, It seems that we met a problem here.
>>        Here we update prev->length again, Do we need update
>> sg_dma_len(prev) again too?
>>
>>        Some function like vb2_dc_get_contiguous_size[1] always get
>> sg_dma_len(s) to compare instead of s->length. so it may break
>> unexpectedly while sg_dma_len(s) is not same with s->length.
>
>
> This is just tweaking the faked-up length that we hand off to iommu_map_sg()
> (see also the iova_align() above), to trick it into bumping this segment up
> to a suitable starting IOVA. The real length at this point is stashed in
> sg_dma_len(s), and will be copied back into s->length in __finalise_sg(), so
> both will hold the same true length once we return to the caller.
>
> Yes, it does mean that if you have a list where the segment lengths are page
> aligned but not monotonically decreasing, e.g. {64k, 16k, 64k}, then you'll
> still end up with a gap between the second and third segments, but that's
> fine because the DMA API offers no guarantees about what the resulting DMA
> addresses will be (consider the no-IOMMU case where they would each just be
> "mapped" to their physical address). If that breaks v4l, then it's probably
> v4l's DMA API use that needs looking at (again).

Hmm, I thought the DMA API maps a (possibly) non-contiguous set of
memory pages into a contiguous block in device memory address space.
This would allow passing a dma mapped buffer to device dma using just
a device address and length.
IIUC, the change above breaks this model by inserting gaps in how the
buffer is mapped to device memory, such that the buffer is no longer
contiguous in dma address space.

Here is the code in question from
drivers/media/v4l2-core/videobuf2-dma-contig.c :

static unsigned long vb2_dc_get_contiguous_size(struct sg_table *sgt)
{
        struct scatterlist *s;
        dma_addr_t expected = sg_dma_address(sgt->sgl);
        unsigned int i;
        unsigned long size = 0;

        for_each_sg(sgt->sgl, s, sgt->nents, i) {
                if (sg_dma_address(s) != expected)
                        break;
                expected = sg_dma_address(s) + sg_dma_len(s);
                size += sg_dma_len(s);
        }
        return size;
}


static void *vb2_dc_get_userptr(void *alloc_ctx, unsigned long vaddr,
        unsigned long size, enum dma_data_direction dma_dir)
{
        struct vb2_dc_conf *conf = alloc_ctx;
        struct vb2_dc_buf *buf;
        struct frame_vector *vec;
        unsigned long offset;
        int n_pages, i;
        int ret = 0;
        struct sg_table *sgt;
        unsigned long contig_size;
        unsigned long dma_align = dma_get_cache_alignment();
        DEFINE_DMA_ATTRS(attrs);

        dma_set_attr(DMA_ATTR_SKIP_CPU_SYNC, &attrs);

        buf = kzalloc(sizeof *buf, GFP_KERNEL);
        buf->dma_dir = dma_dir;

        offset = vaddr & ~PAGE_MASK;
        vec = vb2_create_framevec(vaddr, size, dma_dir == DMA_FROM_DEVICE);
        buf->vec = vec;
        n_pages = frame_vector_count(vec);

        sgt = kzalloc(sizeof(*sgt), GFP_KERNEL);

        ret = sg_alloc_table_from_pages(sgt, frame_vector_pages(vec), n_pages,
                offset, size, GFP_KERNEL);

        sgt->nents = dma_map_sg_attrs(buf->dev, sgt->sgl, sgt->orig_nents,
                                      buf->dma_dir, &attrs);

        contig_size = vb2_dc_get_contiguous_size(sgt);
        if (contig_size < size) {

    <<<===   if the original buffer had sg entries that were not
aligned on the "natural" alignment for their size, the new arm64 iommu
core code inserts  a 'gap' in the iommu mapping, which causes
vb2_dc_get_contiguous_size() to exit early (and return a smaller size
than expected).

                pr_err("contiguous mapping is too small %lu/%lu\n",
                        contig_size, size);
                ret = -EFAULT;
                goto fail_map_sg;
        }


So, is the videobuf2-dma-contig.c based on an incorrect assumption
about how the DMA API is supposed to work?
Is it even possible to map a "contiguous-in-iova-range" mapping for a
buffer given as an sg_table with an arbitrary set of pages?

Thanks for helping to move this forward.

-Dan

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v6 1/3] iommu: Implement common IOMMU ops for DMA mapping
  2015-10-30  1:17             ` Daniel Kurtz
@ 2015-10-30 14:09               ` Joerg Roedel
  -1 siblings, 0 replies; 78+ messages in thread
From: Joerg Roedel @ 2015-10-30 14:09 UTC (permalink / raw)
  To: Daniel Kurtz
  Cc: Robin Murphy, Pawel Osciak, Yong Wu, Will Deacon,
	Catalin Marinas, open list:IOMMU DRIVERS, linux-arm-kernel,
	thunder.leizhen, Yingjoe Chen, laurent.pinchart+renesas,
	Thierry Reding, Lin PoChun, Bobby Batacharia (via Google Docs),
	linux-media, Marek Szyprowski, Kyungmin Park

On Fri, Oct 30, 2015 at 09:17:52AM +0800, Daniel Kurtz wrote:
> Hmm, I thought the DMA API maps a (possibly) non-contiguous set of
> memory pages into a contiguous block in device memory address space.
> This would allow passing a dma mapped buffer to device dma using just
> a device address and length.

If you are speaking of the dma_map_sg interface, than there is absolutly
no guarantee from the API side that the buffers you pass in will end up
mapped contiguously.
IOMMU drivers handle this differently, and when there is no IOMMU at all
there is also no way to map these buffers together.

> So, is the videobuf2-dma-contig.c based on an incorrect assumption
> about how the DMA API is supposed to work?

If it makes the above assumption, then yes.



	Joerg


^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v6 1/3] iommu: Implement common IOMMU ops for DMA mapping
@ 2015-10-30 14:09               ` Joerg Roedel
  0 siblings, 0 replies; 78+ messages in thread
From: Joerg Roedel @ 2015-10-30 14:09 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Oct 30, 2015 at 09:17:52AM +0800, Daniel Kurtz wrote:
> Hmm, I thought the DMA API maps a (possibly) non-contiguous set of
> memory pages into a contiguous block in device memory address space.
> This would allow passing a dma mapped buffer to device dma using just
> a device address and length.

If you are speaking of the dma_map_sg interface, than there is absolutly
no guarantee from the API side that the buffers you pass in will end up
mapped contiguously.
IOMMU drivers handle this differently, and when there is no IOMMU at all
there is also no way to map these buffers together.

> So, is the videobuf2-dma-contig.c based on an incorrect assumption
> about how the DMA API is supposed to work?

If it makes the above assumption, then yes.



	Joerg

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v6 1/3] iommu: Implement common IOMMU ops for DMA mapping
  2015-10-30  1:17             ` Daniel Kurtz
@ 2015-10-30 14:27               ` Robin Murphy
  -1 siblings, 0 replies; 78+ messages in thread
From: Robin Murphy @ 2015-10-30 14:27 UTC (permalink / raw)
  To: Daniel Kurtz, Pawel Osciak
  Cc: Yong Wu, Joerg Roedel, Will Deacon, Catalin Marinas,
	open list:IOMMU DRIVERS, linux-arm-kernel, thunder.leizhen,
	Yingjoe Chen, laurent.pinchart+renesas, Thierry Reding,
	Lin PoChun, Bobby Batacharia (via Google Docs),
	linux-media, Marek Szyprowski, Kyungmin Park

Hi Dan,

On 30/10/15 01:17, Daniel Kurtz wrote:
> +linux-media & VIDEOBUF2 FRAMEWORK maintainers since this is about the
> v4l2-contig's usage of the DMA API.
>
> Hi Robin,
>
> On Tue, Oct 27, 2015 at 12:55 AM, Robin Murphy <robin.murphy@arm.com> wrote:
>> On 26/10/15 13:44, Yong Wu wrote:
>>>
>>> On Thu, 2015-10-01 at 20:13 +0100, Robin Murphy wrote:
>>> [...]
>>>>
>>>> +/*
>>>> + * The DMA API client is passing in a scatterlist which could describe
>>>> + * any old buffer layout, but the IOMMU API requires everything to be
>>>> + * aligned to IOMMU pages. Hence the need for this complicated bit of
>>>> + * impedance-matching, to be able to hand off a suitably-aligned list,
>>>> + * but still preserve the original offsets and sizes for the caller.
>>>> + */
>>>> +int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
>>>> +               int nents, int prot)
>>>> +{
>>>> +       struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
>>>> +       struct iova_domain *iovad = domain->iova_cookie;
>>>> +       struct iova *iova;
>>>> +       struct scatterlist *s, *prev = NULL;
>>>> +       dma_addr_t dma_addr;
>>>> +       size_t iova_len = 0;
>>>> +       int i;
>>>> +
>>>> +       /*
>>>> +        * Work out how much IOVA space we need, and align the segments
>>>> to
>>>> +        * IOVA granules for the IOMMU driver to handle. With some clever
>>>> +        * trickery we can modify the list in-place, but reversibly, by
>>>> +        * hiding the original data in the as-yet-unused DMA fields.
>>>> +        */
>>>> +       for_each_sg(sg, s, nents, i) {
>>>> +               size_t s_offset = iova_offset(iovad, s->offset);
>>>> +               size_t s_length = s->length;
>>>> +
>>>> +               sg_dma_address(s) = s->offset;
>>>> +               sg_dma_len(s) = s_length;
>>>> +               s->offset -= s_offset;
>>>> +               s_length = iova_align(iovad, s_length + s_offset);
>>>> +               s->length = s_length;
>>>> +
>>>> +               /*
>>>> +                * The simple way to avoid the rare case of a segment
>>>> +                * crossing the boundary mask is to pad the previous one
>>>> +                * to end at a naturally-aligned IOVA for this one's
>>>> size,
>>>> +                * at the cost of potentially over-allocating a little.
>>>> +                */
>>>> +               if (prev) {
>>>> +                       size_t pad_len = roundup_pow_of_two(s_length);
>>>> +
>>>> +                       pad_len = (pad_len - iova_len) & (pad_len - 1);
>>>> +                       prev->length += pad_len;
>>>
>>>
>>> Hi Robin,
>>>         While our v4l2 testing, It seems that we met a problem here.
>>>         Here we update prev->length again, Do we need update
>>> sg_dma_len(prev) again too?
>>>
>>>         Some function like vb2_dc_get_contiguous_size[1] always get
>>> sg_dma_len(s) to compare instead of s->length. so it may break
>>> unexpectedly while sg_dma_len(s) is not same with s->length.
>>
>>
>> This is just tweaking the faked-up length that we hand off to iommu_map_sg()
>> (see also the iova_align() above), to trick it into bumping this segment up
>> to a suitable starting IOVA. The real length at this point is stashed in
>> sg_dma_len(s), and will be copied back into s->length in __finalise_sg(), so
>> both will hold the same true length once we return to the caller.
>>
>> Yes, it does mean that if you have a list where the segment lengths are page
>> aligned but not monotonically decreasing, e.g. {64k, 16k, 64k}, then you'll
>> still end up with a gap between the second and third segments, but that's
>> fine because the DMA API offers no guarantees about what the resulting DMA
>> addresses will be (consider the no-IOMMU case where they would each just be
>> "mapped" to their physical address). If that breaks v4l, then it's probably
>> v4l's DMA API use that needs looking at (again).
>
> Hmm, I thought the DMA API maps a (possibly) non-contiguous set of
> memory pages into a contiguous block in device memory address space.
> This would allow passing a dma mapped buffer to device dma using just
> a device address and length.

Not at all. The streaming DMA API (dma_map_* and friends) has two 
responsibilities: performing any necessary cache maintenance to ensure 
the device will correctly see data from the CPU, and the CPU will 
correctly see data from the device; and working out an address for that 
buffer from the device's point of view to actually hand off to the 
hardware (which is perfectly well allowed to fail).

Consider SWIOTLB's implementation - segments which already lie at 
physical addresses within the device's DMA mask just get passed through, 
while those that lie outside it get mapped into the bounce buffer, but 
still as individual allocations (arch code just handles cache 
maintenance on the resulting physical addresses and can apply any 
hard-wired DMA offset for the device concerned).

> IIUC, the change above breaks this model by inserting gaps in how the
> buffer is mapped to device memory, such that the buffer is no longer
> contiguous in dma address space.

Even the existing arch/arm IOMMU DMA code which I guess this implicitly 
relies on doesn't guarantee that behaviour - if the mapping happens to 
reach one of the segment length/boundary limits it won't just leave a 
gap, it'll start an entirely new IOVA allocation which could well start 
at a wildly different address[0].

> Here is the code in question from
> drivers/media/v4l2-core/videobuf2-dma-contig.c :
>
> static unsigned long vb2_dc_get_contiguous_size(struct sg_table *sgt)
> {
>          struct scatterlist *s;
>          dma_addr_t expected = sg_dma_address(sgt->sgl);
>          unsigned int i;
>          unsigned long size = 0;
>
>          for_each_sg(sgt->sgl, s, sgt->nents, i) {
>                  if (sg_dma_address(s) != expected)
>                          break;
>                  expected = sg_dma_address(s) + sg_dma_len(s);
>                  size += sg_dma_len(s);
>          }
>          return size;
> }
>
>
> static void *vb2_dc_get_userptr(void *alloc_ctx, unsigned long vaddr,
>          unsigned long size, enum dma_data_direction dma_dir)
> {
>          struct vb2_dc_conf *conf = alloc_ctx;
>          struct vb2_dc_buf *buf;
>          struct frame_vector *vec;
>          unsigned long offset;
>          int n_pages, i;
>          int ret = 0;
>          struct sg_table *sgt;
>          unsigned long contig_size;
>          unsigned long dma_align = dma_get_cache_alignment();
>          DEFINE_DMA_ATTRS(attrs);
>
>          dma_set_attr(DMA_ATTR_SKIP_CPU_SYNC, &attrs);
>
>          buf = kzalloc(sizeof *buf, GFP_KERNEL);
>          buf->dma_dir = dma_dir;
>
>          offset = vaddr & ~PAGE_MASK;
>          vec = vb2_create_framevec(vaddr, size, dma_dir == DMA_FROM_DEVICE);
>          buf->vec = vec;
>          n_pages = frame_vector_count(vec);
>
>          sgt = kzalloc(sizeof(*sgt), GFP_KERNEL);
>
>          ret = sg_alloc_table_from_pages(sgt, frame_vector_pages(vec), n_pages,
>                  offset, size, GFP_KERNEL);
>
>          sgt->nents = dma_map_sg_attrs(buf->dev, sgt->sgl, sgt->orig_nents,
>                                        buf->dma_dir, &attrs);
>
>          contig_size = vb2_dc_get_contiguous_size(sgt);

(as an aside, it's rather unintuitive that the handling of the 
dma_map_sg call actually failing is entirely implicit here)

>          if (contig_size < size) {
>
>      <<<===   if the original buffer had sg entries that were not
> aligned on the "natural" alignment for their size, the new arm64 iommu
> core code inserts  a 'gap' in the iommu mapping, which causes
> vb2_dc_get_contiguous_size() to exit early (and return a smaller size
> than expected).
>
>                  pr_err("contiguous mapping is too small %lu/%lu\n",
>                          contig_size, size);
>                  ret = -EFAULT;
>                  goto fail_map_sg;
>          }
>
>
> So, is the videobuf2-dma-contig.c based on an incorrect assumption
> about how the DMA API is supposed to work?
> Is it even possible to map a "contiguous-in-iova-range" mapping for a
> buffer given as an sg_table with an arbitrary set of pages?

 From the Streaming DMA mappings section of Documentation/DMA-API.txt:

   Note also that the above constraints on physical contiguity and
   dma_mask may not apply if the platform has an IOMMU (a device which
   maps an I/O DMA address to a physical memory address).  However, to be
   portable, device driver writers may *not* assume that such an IOMMU
   exists.

There's not strictly any harm in using the DMA API this way and *hoping* 
you get what you want, as long as you're happy for it to fail pretty 
much 100% of the time on some systems, and still in a minority of corner 
cases on any system. However, if there's a real dependency on IOMMUs and 
tight control of IOVA allocation here, then the DMA API isn't really the 
right tool for the job, and maybe it's time to start looking to how to 
better fit these multimedia-subsystem-type use cases into the IOMMU API 
- as far as I understand it there's at least some conceptual overlap 
with the HSA PASID stuff being prototyped in PCI/x86-land at the moment, 
so it could be an apposite time to try and bang out some common 
requirements.

Robin.

[0]:http://article.gmane.org/gmane.linux.kernel.iommu/11185

>
> Thanks for helping to move this forward.
>
> -Dan
>


^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v6 1/3] iommu: Implement common IOMMU ops for DMA mapping
@ 2015-10-30 14:27               ` Robin Murphy
  0 siblings, 0 replies; 78+ messages in thread
From: Robin Murphy @ 2015-10-30 14:27 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Dan,

On 30/10/15 01:17, Daniel Kurtz wrote:
> +linux-media & VIDEOBUF2 FRAMEWORK maintainers since this is about the
> v4l2-contig's usage of the DMA API.
>
> Hi Robin,
>
> On Tue, Oct 27, 2015 at 12:55 AM, Robin Murphy <robin.murphy@arm.com> wrote:
>> On 26/10/15 13:44, Yong Wu wrote:
>>>
>>> On Thu, 2015-10-01 at 20:13 +0100, Robin Murphy wrote:
>>> [...]
>>>>
>>>> +/*
>>>> + * The DMA API client is passing in a scatterlist which could describe
>>>> + * any old buffer layout, but the IOMMU API requires everything to be
>>>> + * aligned to IOMMU pages. Hence the need for this complicated bit of
>>>> + * impedance-matching, to be able to hand off a suitably-aligned list,
>>>> + * but still preserve the original offsets and sizes for the caller.
>>>> + */
>>>> +int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
>>>> +               int nents, int prot)
>>>> +{
>>>> +       struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
>>>> +       struct iova_domain *iovad = domain->iova_cookie;
>>>> +       struct iova *iova;
>>>> +       struct scatterlist *s, *prev = NULL;
>>>> +       dma_addr_t dma_addr;
>>>> +       size_t iova_len = 0;
>>>> +       int i;
>>>> +
>>>> +       /*
>>>> +        * Work out how much IOVA space we need, and align the segments
>>>> to
>>>> +        * IOVA granules for the IOMMU driver to handle. With some clever
>>>> +        * trickery we can modify the list in-place, but reversibly, by
>>>> +        * hiding the original data in the as-yet-unused DMA fields.
>>>> +        */
>>>> +       for_each_sg(sg, s, nents, i) {
>>>> +               size_t s_offset = iova_offset(iovad, s->offset);
>>>> +               size_t s_length = s->length;
>>>> +
>>>> +               sg_dma_address(s) = s->offset;
>>>> +               sg_dma_len(s) = s_length;
>>>> +               s->offset -= s_offset;
>>>> +               s_length = iova_align(iovad, s_length + s_offset);
>>>> +               s->length = s_length;
>>>> +
>>>> +               /*
>>>> +                * The simple way to avoid the rare case of a segment
>>>> +                * crossing the boundary mask is to pad the previous one
>>>> +                * to end at a naturally-aligned IOVA for this one's
>>>> size,
>>>> +                * at the cost of potentially over-allocating a little.
>>>> +                */
>>>> +               if (prev) {
>>>> +                       size_t pad_len = roundup_pow_of_two(s_length);
>>>> +
>>>> +                       pad_len = (pad_len - iova_len) & (pad_len - 1);
>>>> +                       prev->length += pad_len;
>>>
>>>
>>> Hi Robin,
>>>         While our v4l2 testing, It seems that we met a problem here.
>>>         Here we update prev->length again, Do we need update
>>> sg_dma_len(prev) again too?
>>>
>>>         Some function like vb2_dc_get_contiguous_size[1] always get
>>> sg_dma_len(s) to compare instead of s->length. so it may break
>>> unexpectedly while sg_dma_len(s) is not same with s->length.
>>
>>
>> This is just tweaking the faked-up length that we hand off to iommu_map_sg()
>> (see also the iova_align() above), to trick it into bumping this segment up
>> to a suitable starting IOVA. The real length at this point is stashed in
>> sg_dma_len(s), and will be copied back into s->length in __finalise_sg(), so
>> both will hold the same true length once we return to the caller.
>>
>> Yes, it does mean that if you have a list where the segment lengths are page
>> aligned but not monotonically decreasing, e.g. {64k, 16k, 64k}, then you'll
>> still end up with a gap between the second and third segments, but that's
>> fine because the DMA API offers no guarantees about what the resulting DMA
>> addresses will be (consider the no-IOMMU case where they would each just be
>> "mapped" to their physical address). If that breaks v4l, then it's probably
>> v4l's DMA API use that needs looking at (again).
>
> Hmm, I thought the DMA API maps a (possibly) non-contiguous set of
> memory pages into a contiguous block in device memory address space.
> This would allow passing a dma mapped buffer to device dma using just
> a device address and length.

Not at all. The streaming DMA API (dma_map_* and friends) has two 
responsibilities: performing any necessary cache maintenance to ensure 
the device will correctly see data from the CPU, and the CPU will 
correctly see data from the device; and working out an address for that 
buffer from the device's point of view to actually hand off to the 
hardware (which is perfectly well allowed to fail).

Consider SWIOTLB's implementation - segments which already lie at 
physical addresses within the device's DMA mask just get passed through, 
while those that lie outside it get mapped into the bounce buffer, but 
still as individual allocations (arch code just handles cache 
maintenance on the resulting physical addresses and can apply any 
hard-wired DMA offset for the device concerned).

> IIUC, the change above breaks this model by inserting gaps in how the
> buffer is mapped to device memory, such that the buffer is no longer
> contiguous in dma address space.

Even the existing arch/arm IOMMU DMA code which I guess this implicitly 
relies on doesn't guarantee that behaviour - if the mapping happens to 
reach one of the segment length/boundary limits it won't just leave a 
gap, it'll start an entirely new IOVA allocation which could well start 
at a wildly different address[0].

> Here is the code in question from
> drivers/media/v4l2-core/videobuf2-dma-contig.c :
>
> static unsigned long vb2_dc_get_contiguous_size(struct sg_table *sgt)
> {
>          struct scatterlist *s;
>          dma_addr_t expected = sg_dma_address(sgt->sgl);
>          unsigned int i;
>          unsigned long size = 0;
>
>          for_each_sg(sgt->sgl, s, sgt->nents, i) {
>                  if (sg_dma_address(s) != expected)
>                          break;
>                  expected = sg_dma_address(s) + sg_dma_len(s);
>                  size += sg_dma_len(s);
>          }
>          return size;
> }
>
>
> static void *vb2_dc_get_userptr(void *alloc_ctx, unsigned long vaddr,
>          unsigned long size, enum dma_data_direction dma_dir)
> {
>          struct vb2_dc_conf *conf = alloc_ctx;
>          struct vb2_dc_buf *buf;
>          struct frame_vector *vec;
>          unsigned long offset;
>          int n_pages, i;
>          int ret = 0;
>          struct sg_table *sgt;
>          unsigned long contig_size;
>          unsigned long dma_align = dma_get_cache_alignment();
>          DEFINE_DMA_ATTRS(attrs);
>
>          dma_set_attr(DMA_ATTR_SKIP_CPU_SYNC, &attrs);
>
>          buf = kzalloc(sizeof *buf, GFP_KERNEL);
>          buf->dma_dir = dma_dir;
>
>          offset = vaddr & ~PAGE_MASK;
>          vec = vb2_create_framevec(vaddr, size, dma_dir == DMA_FROM_DEVICE);
>          buf->vec = vec;
>          n_pages = frame_vector_count(vec);
>
>          sgt = kzalloc(sizeof(*sgt), GFP_KERNEL);
>
>          ret = sg_alloc_table_from_pages(sgt, frame_vector_pages(vec), n_pages,
>                  offset, size, GFP_KERNEL);
>
>          sgt->nents = dma_map_sg_attrs(buf->dev, sgt->sgl, sgt->orig_nents,
>                                        buf->dma_dir, &attrs);
>
>          contig_size = vb2_dc_get_contiguous_size(sgt);

(as an aside, it's rather unintuitive that the handling of the 
dma_map_sg call actually failing is entirely implicit here)

>          if (contig_size < size) {
>
>      <<<===   if the original buffer had sg entries that were not
> aligned on the "natural" alignment for their size, the new arm64 iommu
> core code inserts  a 'gap' in the iommu mapping, which causes
> vb2_dc_get_contiguous_size() to exit early (and return a smaller size
> than expected).
>
>                  pr_err("contiguous mapping is too small %lu/%lu\n",
>                          contig_size, size);
>                  ret = -EFAULT;
>                  goto fail_map_sg;
>          }
>
>
> So, is the videobuf2-dma-contig.c based on an incorrect assumption
> about how the DMA API is supposed to work?
> Is it even possible to map a "contiguous-in-iova-range" mapping for a
> buffer given as an sg_table with an arbitrary set of pages?

 From the Streaming DMA mappings section of Documentation/DMA-API.txt:

   Note also that the above constraints on physical contiguity and
   dma_mask may not apply if the platform has an IOMMU (a device which
   maps an I/O DMA address to a physical memory address).  However, to be
   portable, device driver writers may *not* assume that such an IOMMU
   exists.

There's not strictly any harm in using the DMA API this way and *hoping* 
you get what you want, as long as you're happy for it to fail pretty 
much 100% of the time on some systems, and still in a minority of corner 
cases on any system. However, if there's a real dependency on IOMMUs and 
tight control of IOVA allocation here, then the DMA API isn't really the 
right tool for the job, and maybe it's time to start looking to how to 
better fit these multimedia-subsystem-type use cases into the IOMMU API 
- as far as I understand it there's at least some conceptual overlap 
with the HSA PASID stuff being prototyped in PCI/x86-land at the moment, 
so it could be an apposite time to try and bang out some common 
requirements.

Robin.

[0]:http://article.gmane.org/gmane.linux.kernel.iommu/11185

>
> Thanks for helping to move this forward.
>
> -Dan
>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v6 1/3] iommu: Implement common IOMMU ops for DMA mapping
       [not found]               ` <20151030140923.GJ27420-zLv9SwRftAIdnm+yROfE0A@public.gmane.org>
@ 2015-10-30 18:18                 ` Mark Hounschell
  0 siblings, 0 replies; 78+ messages in thread
From: Mark Hounschell @ 2015-10-30 18:18 UTC (permalink / raw)
  To: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

On 10/30/2015 10:09 AM, Joerg Roedel wrote:
> On Fri, Oct 30, 2015 at 09:17:52AM +0800, Daniel Kurtz wrote:
>> Hmm, I thought the DMA API maps a (possibly) non-contiguous set of
>> memory pages into a contiguous block in device memory address space.
>> This would allow passing a dma mapped buffer to device dma using just
>> a device address and length.
>
> If you are speaking of the dma_map_sg interface, than there is absolutly
> no guarantee from the API side that the buffers you pass in will end up
> mapped contiguously.
> IOMMU drivers handle this differently, and when there is no IOMMU at all
> there is also no way to map these buffers together.
>

That is what CMA is for ya know. It makes it physically contiguous.

Mark

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v6 1/3] iommu: Implement common IOMMU ops for DMA mapping
  2015-10-30 14:27               ` Robin Murphy
@ 2015-11-02 13:11                 ` Daniel Kurtz
  -1 siblings, 0 replies; 78+ messages in thread
From: Daniel Kurtz @ 2015-11-02 13:11 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Lin PoChun, linux-arm-kernel, Yingjoe Chen, Will Deacon,
	linux-media, Thierry Reding, open list:IOMMU DRIVERS,
	Bobby Batacharia (via Google Docs),
	Kyungmin Park, Marek Szyprowski, Yong Wu, Pawel Osciak,
	laurent.pinchart+renesas, Joerg Roedel, thunder.leizhen,
	Catalin Marinas, Tomasz Figa, Russell King, linux-mediatek

+Tomasz, so he can reply to the thread
+Marek and Russell as recommended by Tomasz

On Oct 30, 2015 22:27, "Robin Murphy" <robin.murphy@arm.com> wrote:
>
> Hi Dan,
>
> On 30/10/15 01:17, Daniel Kurtz wrote:
>>
>> +linux-media & VIDEOBUF2 FRAMEWORK maintainers since this is about the
>> v4l2-contig's usage of the DMA API.
>>
>> Hi Robin,
>>
>> On Tue, Oct 27, 2015 at 12:55 AM, Robin Murphy <robin.murphy@arm.com> wrote:
>>>
>>> On 26/10/15 13:44, Yong Wu wrote:
>>>>
>>>>
>>>> On Thu, 2015-10-01 at 20:13 +0100, Robin Murphy wrote:
>>>> [...]
>>>>>
>>>>>
>>>>> +/*
>>>>> + * The DMA API client is passing in a scatterlist which could describe
>>>>> + * any old buffer layout, but the IOMMU API requires everything to be
>>>>> + * aligned to IOMMU pages. Hence the need for this complicated bit of
>>>>> + * impedance-matching, to be able to hand off a suitably-aligned list,
>>>>> + * but still preserve the original offsets and sizes for the caller.
>>>>> + */
>>>>> +int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
>>>>> +               int nents, int prot)
>>>>> +{
>>>>> +       struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
>>>>> +       struct iova_domain *iovad = domain->iova_cookie;
>>>>> +       struct iova *iova;
>>>>> +       struct scatterlist *s, *prev = NULL;
>>>>> +       dma_addr_t dma_addr;
>>>>> +       size_t iova_len = 0;
>>>>> +       int i;
>>>>> +
>>>>> +       /*
>>>>> +        * Work out how much IOVA space we need, and align the segments
>>>>> to
>>>>> +        * IOVA granules for the IOMMU driver to handle. With some clever
>>>>> +        * trickery we can modify the list in-place, but reversibly, by
>>>>> +        * hiding the original data in the as-yet-unused DMA fields.
>>>>> +        */
>>>>> +       for_each_sg(sg, s, nents, i) {
>>>>> +               size_t s_offset = iova_offset(iovad, s->offset);
>>>>> +               size_t s_length = s->length;
>>>>> +
>>>>> +               sg_dma_address(s) = s->offset;
>>>>> +               sg_dma_len(s) = s_length;
>>>>> +               s->offset -= s_offset;
>>>>> +               s_length = iova_align(iovad, s_length + s_offset);
>>>>> +               s->length = s_length;
>>>>> +
>>>>> +               /*
>>>>> +                * The simple way to avoid the rare case of a segment
>>>>> +                * crossing the boundary mask is to pad the previous one
>>>>> +                * to end at a naturally-aligned IOVA for this one's
>>>>> size,
>>>>> +                * at the cost of potentially over-allocating a little.
>>>>> +                */
>>>>> +               if (prev) {
>>>>> +                       size_t pad_len = roundup_pow_of_two(s_length);
>>>>> +
>>>>> +                       pad_len = (pad_len - iova_len) & (pad_len - 1);
>>>>> +                       prev->length += pad_len;
>>>>
>>>>
>>>>
>>>> Hi Robin,
>>>>         While our v4l2 testing, It seems that we met a problem here.
>>>>         Here we update prev->length again, Do we need update
>>>> sg_dma_len(prev) again too?
>>>>
>>>>         Some function like vb2_dc_get_contiguous_size[1] always get
>>>> sg_dma_len(s) to compare instead of s->length. so it may break
>>>> unexpectedly while sg_dma_len(s) is not same with s->length.
>>>
>>>
>>>
>>> This is just tweaking the faked-up length that we hand off to iommu_map_sg()
>>> (see also the iova_align() above), to trick it into bumping this segment up
>>> to a suitable starting IOVA. The real length at this point is stashed in
>>> sg_dma_len(s), and will be copied back into s->length in __finalise_sg(), so
>>> both will hold the same true length once we return to the caller.
>>>
>>> Yes, it does mean that if you have a list where the segment lengths are page
>>> aligned but not monotonically decreasing, e.g. {64k, 16k, 64k}, then you'll
>>> still end up with a gap between the second and third segments, but that's
>>> fine because the DMA API offers no guarantees about what the resulting DMA
>>> addresses will be (consider the no-IOMMU case where they would each just be
>>> "mapped" to their physical address). If that breaks v4l, then it's probably
>>> v4l's DMA API use that needs looking at (again).
>>
>>
>> Hmm, I thought the DMA API maps a (possibly) non-contiguous set of
>> memory pages into a contiguous block in device memory address space.
>> This would allow passing a dma mapped buffer to device dma using just
>> a device address and length.
>
>
> Not at all. The streaming DMA API (dma_map_* and friends) has two responsibilities: performing any necessary cache maintenance to ensure the device will correctly see data from the CPU, and the CPU will correctly see data from the device; and working out an address for that buffer from the device's point of view to actually hand off to the hardware (which is perfectly well allowed to fail).
>
> Consider SWIOTLB's implementation - segments which already lie at physical addresses within the device's DMA mask just get passed through, while those that lie outside it get mapped into the bounce buffer, but still as individual allocations (arch code just handles cache maintenance on the resulting physical addresses and can apply any hard-wired DMA offset for the device concerned).
>
>> IIUC, the change above breaks this model by inserting gaps in how the
>> buffer is mapped to device memory, such that the buffer is no longer
>> contiguous in dma address space.
>
>
> Even the existing arch/arm IOMMU DMA code which I guess this implicitly relies on doesn't guarantee that behaviour - if the mapping happens to reach one of the segment length/boundary limits it won't just leave a gap, it'll start an entirely new IOVA allocation which could well start at a wildly different address[0].
>
>> Here is the code in question from
>> drivers/media/v4l2-core/videobuf2-dma-contig.c :
>>
>> static unsigned long vb2_dc_get_contiguous_size(struct sg_table *sgt)
>> {
>>          struct scatterlist *s;
>>          dma_addr_t expected = sg_dma_address(sgt->sgl);
>>          unsigned int i;
>>          unsigned long size = 0;
>>
>>          for_each_sg(sgt->sgl, s, sgt->nents, i) {
>>                  if (sg_dma_address(s) != expected)
>>                          break;
>>                  expected = sg_dma_address(s) + sg_dma_len(s);
>>                  size += sg_dma_len(s);
>>          }
>>          return size;
>> }
>>
>>
>> static void *vb2_dc_get_userptr(void *alloc_ctx, unsigned long vaddr,
>>          unsigned long size, enum dma_data_direction dma_dir)
>> {
>>          struct vb2_dc_conf *conf = alloc_ctx;
>>          struct vb2_dc_buf *buf;
>>          struct frame_vector *vec;
>>          unsigned long offset;
>>          int n_pages, i;
>>          int ret = 0;
>>          struct sg_table *sgt;
>>          unsigned long contig_size;
>>          unsigned long dma_align = dma_get_cache_alignment();
>>          DEFINE_DMA_ATTRS(attrs);
>>
>>          dma_set_attr(DMA_ATTR_SKIP_CPU_SYNC, &attrs);
>>
>>          buf = kzalloc(sizeof *buf, GFP_KERNEL);
>>          buf->dma_dir = dma_dir;
>>
>>          offset = vaddr & ~PAGE_MASK;
>>          vec = vb2_create_framevec(vaddr, size, dma_dir == DMA_FROM_DEVICE);
>>          buf->vec = vec;
>>          n_pages = frame_vector_count(vec);
>>
>>          sgt = kzalloc(sizeof(*sgt), GFP_KERNEL);
>>
>>          ret = sg_alloc_table_from_pages(sgt, frame_vector_pages(vec), n_pages,
>>                  offset, size, GFP_KERNEL);
>>
>>          sgt->nents = dma_map_sg_attrs(buf->dev, sgt->sgl, sgt->orig_nents,
>>                                        buf->dma_dir, &attrs);
>>
>>          contig_size = vb2_dc_get_contiguous_size(sgt);
>
>
> (as an aside, it's rather unintuitive that the handling of the dma_map_sg call actually failing is entirely implicit here)
>
>>          if (contig_size < size) {
>>
>>      <<<===   if the original buffer had sg entries that were not
>> aligned on the "natural" alignment for their size, the new arm64 iommu
>> core code inserts  a 'gap' in the iommu mapping, which causes
>> vb2_dc_get_contiguous_size() to exit early (and return a smaller size
>> than expected).
>>
>>                  pr_err("contiguous mapping is too small %lu/%lu\n",
>>                          contig_size, size);
>>                  ret = -EFAULT;
>>                  goto fail_map_sg;
>>          }
>>
>>
>> So, is the videobuf2-dma-contig.c based on an incorrect assumption
>> about how the DMA API is supposed to work?
>> Is it even possible to map a "contiguous-in-iova-range" mapping for a
>> buffer given as an sg_table with an arbitrary set of pages?
>
>
> From the Streaming DMA mappings section of Documentation/DMA-API.txt:
>
>   Note also that the above constraints on physical contiguity and
>   dma_mask may not apply if the platform has an IOMMU (a device which
>   maps an I/O DMA address to a physical memory address).  However, to be
>   portable, device driver writers may *not* assume that such an IOMMU
>   exists.
>
> There's not strictly any harm in using the DMA API this way and *hoping* you get what you want, as long as you're happy for it to fail pretty much 100% of the time on some systems, and still in a minority of corner cases on any system. However, if there's a real dependency on IOMMUs and tight control of IOVA allocation here, then the DMA API isn't really the right tool for the job, and maybe it's time to start looking to how to better fit these multimedia-subsystem-type use cases into the IOMMU API - as far as I understand it there's at least some conceptual overlap with the HSA PASID stuff being prototyped in PCI/x86-land at the moment, so it could be an apposite time to try and bang out some common requirements.
>
> Robin.
>
> [0]:http://article.gmane.org/gmane.linux.kernel.iommu/11185
>
>>
>> Thanks for helping to move this forward.
>>
>> -Dan
>>
>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v6 1/3] iommu: Implement common IOMMU ops for DMA mapping
@ 2015-11-02 13:11                 ` Daniel Kurtz
  0 siblings, 0 replies; 78+ messages in thread
From: Daniel Kurtz @ 2015-11-02 13:11 UTC (permalink / raw)
  To: linux-arm-kernel

+Tomasz, so he can reply to the thread
+Marek and Russell as recommended by Tomasz

On Oct 30, 2015 22:27, "Robin Murphy" <robin.murphy@arm.com> wrote:
>
> Hi Dan,
>
> On 30/10/15 01:17, Daniel Kurtz wrote:
>>
>> +linux-media & VIDEOBUF2 FRAMEWORK maintainers since this is about the
>> v4l2-contig's usage of the DMA API.
>>
>> Hi Robin,
>>
>> On Tue, Oct 27, 2015 at 12:55 AM, Robin Murphy <robin.murphy@arm.com> wrote:
>>>
>>> On 26/10/15 13:44, Yong Wu wrote:
>>>>
>>>>
>>>> On Thu, 2015-10-01 at 20:13 +0100, Robin Murphy wrote:
>>>> [...]
>>>>>
>>>>>
>>>>> +/*
>>>>> + * The DMA API client is passing in a scatterlist which could describe
>>>>> + * any old buffer layout, but the IOMMU API requires everything to be
>>>>> + * aligned to IOMMU pages. Hence the need for this complicated bit of
>>>>> + * impedance-matching, to be able to hand off a suitably-aligned list,
>>>>> + * but still preserve the original offsets and sizes for the caller.
>>>>> + */
>>>>> +int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
>>>>> +               int nents, int prot)
>>>>> +{
>>>>> +       struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
>>>>> +       struct iova_domain *iovad = domain->iova_cookie;
>>>>> +       struct iova *iova;
>>>>> +       struct scatterlist *s, *prev = NULL;
>>>>> +       dma_addr_t dma_addr;
>>>>> +       size_t iova_len = 0;
>>>>> +       int i;
>>>>> +
>>>>> +       /*
>>>>> +        * Work out how much IOVA space we need, and align the segments
>>>>> to
>>>>> +        * IOVA granules for the IOMMU driver to handle. With some clever
>>>>> +        * trickery we can modify the list in-place, but reversibly, by
>>>>> +        * hiding the original data in the as-yet-unused DMA fields.
>>>>> +        */
>>>>> +       for_each_sg(sg, s, nents, i) {
>>>>> +               size_t s_offset = iova_offset(iovad, s->offset);
>>>>> +               size_t s_length = s->length;
>>>>> +
>>>>> +               sg_dma_address(s) = s->offset;
>>>>> +               sg_dma_len(s) = s_length;
>>>>> +               s->offset -= s_offset;
>>>>> +               s_length = iova_align(iovad, s_length + s_offset);
>>>>> +               s->length = s_length;
>>>>> +
>>>>> +               /*
>>>>> +                * The simple way to avoid the rare case of a segment
>>>>> +                * crossing the boundary mask is to pad the previous one
>>>>> +                * to end at a naturally-aligned IOVA for this one's
>>>>> size,
>>>>> +                * at the cost of potentially over-allocating a little.
>>>>> +                */
>>>>> +               if (prev) {
>>>>> +                       size_t pad_len = roundup_pow_of_two(s_length);
>>>>> +
>>>>> +                       pad_len = (pad_len - iova_len) & (pad_len - 1);
>>>>> +                       prev->length += pad_len;
>>>>
>>>>
>>>>
>>>> Hi Robin,
>>>>         While our v4l2 testing, It seems that we met a problem here.
>>>>         Here we update prev->length again, Do we need update
>>>> sg_dma_len(prev) again too?
>>>>
>>>>         Some function like vb2_dc_get_contiguous_size[1] always get
>>>> sg_dma_len(s) to compare instead of s->length. so it may break
>>>> unexpectedly while sg_dma_len(s) is not same with s->length.
>>>
>>>
>>>
>>> This is just tweaking the faked-up length that we hand off to iommu_map_sg()
>>> (see also the iova_align() above), to trick it into bumping this segment up
>>> to a suitable starting IOVA. The real length at this point is stashed in
>>> sg_dma_len(s), and will be copied back into s->length in __finalise_sg(), so
>>> both will hold the same true length once we return to the caller.
>>>
>>> Yes, it does mean that if you have a list where the segment lengths are page
>>> aligned but not monotonically decreasing, e.g. {64k, 16k, 64k}, then you'll
>>> still end up with a gap between the second and third segments, but that's
>>> fine because the DMA API offers no guarantees about what the resulting DMA
>>> addresses will be (consider the no-IOMMU case where they would each just be
>>> "mapped" to their physical address). If that breaks v4l, then it's probably
>>> v4l's DMA API use that needs looking at (again).
>>
>>
>> Hmm, I thought the DMA API maps a (possibly) non-contiguous set of
>> memory pages into a contiguous block in device memory address space.
>> This would allow passing a dma mapped buffer to device dma using just
>> a device address and length.
>
>
> Not at all. The streaming DMA API (dma_map_* and friends) has two responsibilities: performing any necessary cache maintenance to ensure the device will correctly see data from the CPU, and the CPU will correctly see data from the device; and working out an address for that buffer from the device's point of view to actually hand off to the hardware (which is perfectly well allowed to fail).
>
> Consider SWIOTLB's implementation - segments which already lie at physical addresses within the device's DMA mask just get passed through, while those that lie outside it get mapped into the bounce buffer, but still as individual allocations (arch code just handles cache maintenance on the resulting physical addresses and can apply any hard-wired DMA offset for the device concerned).
>
>> IIUC, the change above breaks this model by inserting gaps in how the
>> buffer is mapped to device memory, such that the buffer is no longer
>> contiguous in dma address space.
>
>
> Even the existing arch/arm IOMMU DMA code which I guess this implicitly relies on doesn't guarantee that behaviour - if the mapping happens to reach one of the segment length/boundary limits it won't just leave a gap, it'll start an entirely new IOVA allocation which could well start at a wildly different address[0].
>
>> Here is the code in question from
>> drivers/media/v4l2-core/videobuf2-dma-contig.c :
>>
>> static unsigned long vb2_dc_get_contiguous_size(struct sg_table *sgt)
>> {
>>          struct scatterlist *s;
>>          dma_addr_t expected = sg_dma_address(sgt->sgl);
>>          unsigned int i;
>>          unsigned long size = 0;
>>
>>          for_each_sg(sgt->sgl, s, sgt->nents, i) {
>>                  if (sg_dma_address(s) != expected)
>>                          break;
>>                  expected = sg_dma_address(s) + sg_dma_len(s);
>>                  size += sg_dma_len(s);
>>          }
>>          return size;
>> }
>>
>>
>> static void *vb2_dc_get_userptr(void *alloc_ctx, unsigned long vaddr,
>>          unsigned long size, enum dma_data_direction dma_dir)
>> {
>>          struct vb2_dc_conf *conf = alloc_ctx;
>>          struct vb2_dc_buf *buf;
>>          struct frame_vector *vec;
>>          unsigned long offset;
>>          int n_pages, i;
>>          int ret = 0;
>>          struct sg_table *sgt;
>>          unsigned long contig_size;
>>          unsigned long dma_align = dma_get_cache_alignment();
>>          DEFINE_DMA_ATTRS(attrs);
>>
>>          dma_set_attr(DMA_ATTR_SKIP_CPU_SYNC, &attrs);
>>
>>          buf = kzalloc(sizeof *buf, GFP_KERNEL);
>>          buf->dma_dir = dma_dir;
>>
>>          offset = vaddr & ~PAGE_MASK;
>>          vec = vb2_create_framevec(vaddr, size, dma_dir == DMA_FROM_DEVICE);
>>          buf->vec = vec;
>>          n_pages = frame_vector_count(vec);
>>
>>          sgt = kzalloc(sizeof(*sgt), GFP_KERNEL);
>>
>>          ret = sg_alloc_table_from_pages(sgt, frame_vector_pages(vec), n_pages,
>>                  offset, size, GFP_KERNEL);
>>
>>          sgt->nents = dma_map_sg_attrs(buf->dev, sgt->sgl, sgt->orig_nents,
>>                                        buf->dma_dir, &attrs);
>>
>>          contig_size = vb2_dc_get_contiguous_size(sgt);
>
>
> (as an aside, it's rather unintuitive that the handling of the dma_map_sg call actually failing is entirely implicit here)
>
>>          if (contig_size < size) {
>>
>>      <<<===   if the original buffer had sg entries that were not
>> aligned on the "natural" alignment for their size, the new arm64 iommu
>> core code inserts  a 'gap' in the iommu mapping, which causes
>> vb2_dc_get_contiguous_size() to exit early (and return a smaller size
>> than expected).
>>
>>                  pr_err("contiguous mapping is too small %lu/%lu\n",
>>                          contig_size, size);
>>                  ret = -EFAULT;
>>                  goto fail_map_sg;
>>          }
>>
>>
>> So, is the videobuf2-dma-contig.c based on an incorrect assumption
>> about how the DMA API is supposed to work?
>> Is it even possible to map a "contiguous-in-iova-range" mapping for a
>> buffer given as an sg_table with an arbitrary set of pages?
>
>
> From the Streaming DMA mappings section of Documentation/DMA-API.txt:
>
>   Note also that the above constraints on physical contiguity and
>   dma_mask may not apply if the platform has an IOMMU (a device which
>   maps an I/O DMA address to a physical memory address).  However, to be
>   portable, device driver writers may *not* assume that such an IOMMU
>   exists.
>
> There's not strictly any harm in using the DMA API this way and *hoping* you get what you want, as long as you're happy for it to fail pretty much 100% of the time on some systems, and still in a minority of corner cases on any system. However, if there's a real dependency on IOMMUs and tight control of IOVA allocation here, then the DMA API isn't really the right tool for the job, and maybe it's time to start looking to how to better fit these multimedia-subsystem-type use cases into the IOMMU API - as far as I understand it there's at least some conceptual overlap with the HSA PASID stuff being prototyped in PCI/x86-land at the moment, so it could be an apposite time to try and bang out some common requirements.
>
> Robin.
>
> [0]:http://article.gmane.org/gmane.linux.kernel.iommu/11185
>
>>
>> Thanks for helping to move this forward.
>>
>> -Dan
>>
>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v6 1/3] iommu: Implement common IOMMU ops for DMA mapping
  2015-11-02 13:11                 ` Daniel Kurtz
@ 2015-11-02 13:43                   ` Tomasz Figa
  -1 siblings, 0 replies; 78+ messages in thread
From: Tomasz Figa @ 2015-11-02 13:43 UTC (permalink / raw)
  To: Daniel Kurtz
  Cc: Robin Murphy, Lin PoChun, linux-arm-kernel, Yingjoe Chen,
	Will Deacon, linux-media, Thierry Reding,
	open list:IOMMU DRIVERS, Bobby Batacharia (via Google Docs),
	Kyungmin Park, Marek Szyprowski, Yong Wu, Pawel Osciak,
	Laurent Pinchart, Joerg Roedel, thunder.leizhen, Catalin Marinas,
	Russell King, linux-mediatek

On Mon, Nov 2, 2015 at 10:11 PM, Daniel Kurtz <djkurtz@chromium.org> wrote:
>
> +Tomasz, so he can reply to the thread
> +Marek and Russell as recommended by Tomasz
>
> On Oct 30, 2015 22:27, "Robin Murphy" <robin.murphy@arm.com> wrote:
> >
> > Hi Dan,
> >
> > On 30/10/15 01:17, Daniel Kurtz wrote:
> >>
> >> +linux-media & VIDEOBUF2 FRAMEWORK maintainers since this is about the
> >> v4l2-contig's usage of the DMA API.
> >>
> >> Hi Robin,
> >>
> >> On Tue, Oct 27, 2015 at 12:55 AM, Robin Murphy <robin.murphy@arm.com> wrote:
> >>>
> >>> On 26/10/15 13:44, Yong Wu wrote:
> >>>>
> >>>>
> >>>> On Thu, 2015-10-01 at 20:13 +0100, Robin Murphy wrote:
> >>>> [...]
> >>>>>
> >>>>>
> >>>>> +/*
> >>>>> + * The DMA API client is passing in a scatterlist which could describe
> >>>>> + * any old buffer layout, but the IOMMU API requires everything to be
> >>>>> + * aligned to IOMMU pages. Hence the need for this complicated bit of
> >>>>> + * impedance-matching, to be able to hand off a suitably-aligned list,
> >>>>> + * but still preserve the original offsets and sizes for the caller.
> >>>>> + */
> >>>>> +int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
> >>>>> +               int nents, int prot)
> >>>>> +{
> >>>>> +       struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
> >>>>> +       struct iova_domain *iovad = domain->iova_cookie;
> >>>>> +       struct iova *iova;
> >>>>> +       struct scatterlist *s, *prev = NULL;
> >>>>> +       dma_addr_t dma_addr;
> >>>>> +       size_t iova_len = 0;
> >>>>> +       int i;
> >>>>> +
> >>>>> +       /*
> >>>>> +        * Work out how much IOVA space we need, and align the segments
> >>>>> to
> >>>>> +        * IOVA granules for the IOMMU driver to handle. With some clever
> >>>>> +        * trickery we can modify the list in-place, but reversibly, by
> >>>>> +        * hiding the original data in the as-yet-unused DMA fields.
> >>>>> +        */
> >>>>> +       for_each_sg(sg, s, nents, i) {
> >>>>> +               size_t s_offset = iova_offset(iovad, s->offset);
> >>>>> +               size_t s_length = s->length;
> >>>>> +
> >>>>> +               sg_dma_address(s) = s->offset;
> >>>>> +               sg_dma_len(s) = s_length;
> >>>>> +               s->offset -= s_offset;
> >>>>> +               s_length = iova_align(iovad, s_length + s_offset);
> >>>>> +               s->length = s_length;
> >>>>> +
> >>>>> +               /*
> >>>>> +                * The simple way to avoid the rare case of a segment
> >>>>> +                * crossing the boundary mask is to pad the previous one
> >>>>> +                * to end at a naturally-aligned IOVA for this one's
> >>>>> size,
> >>>>> +                * at the cost of potentially over-allocating a little.

I'd like to know what is the boundary mask and what hardware imposes
requirements like this. The cost here is not only over-allocating a
little, but making many, many buffers contiguously mappable on the
CPU, unmappable contiguously in IOMMU, which just defeats the purpose
of having an IOMMU, which I believe should be there for simple IP
blocks taking one DMA address to be able to view the buffer the same
way as the CPU.

> >>>>> +                */
> >>>>> +               if (prev) {
> >>>>> +                       size_t pad_len = roundup_pow_of_two(s_length);
> >>>>> +
> >>>>> +                       pad_len = (pad_len - iova_len) & (pad_len - 1);
> >>>>> +                       prev->length += pad_len;
> >>>>
> >>>>
> >>>>
> >>>> Hi Robin,
> >>>>         While our v4l2 testing, It seems that we met a problem here.
> >>>>         Here we update prev->length again, Do we need update
> >>>> sg_dma_len(prev) again too?
> >>>>
> >>>>         Some function like vb2_dc_get_contiguous_size[1] always get
> >>>> sg_dma_len(s) to compare instead of s->length. so it may break
> >>>> unexpectedly while sg_dma_len(s) is not same with s->length.
> >>>
> >>>
> >>>
> >>> This is just tweaking the faked-up length that we hand off to iommu_map_sg()
> >>> (see also the iova_align() above), to trick it into bumping this segment up
> >>> to a suitable starting IOVA. The real length at this point is stashed in
> >>> sg_dma_len(s), and will be copied back into s->length in __finalise_sg(), so
> >>> both will hold the same true length once we return to the caller.
> >>>
> >>> Yes, it does mean that if you have a list where the segment lengths are page
> >>> aligned but not monotonically decreasing, e.g. {64k, 16k, 64k}, then you'll
> >>> still end up with a gap between the second and third segments, but that's
> >>> fine because the DMA API offers no guarantees about what the resulting DMA
> >>> addresses will be (consider the no-IOMMU case where they would each just be
> >>> "mapped" to their physical address). If that breaks v4l, then it's probably
> >>> v4l's DMA API use that needs looking at (again).
> >>
> >>
> >> Hmm, I thought the DMA API maps a (possibly) non-contiguous set of
> >> memory pages into a contiguous block in device memory address space.
> >> This would allow passing a dma mapped buffer to device dma using just
> >> a device address and length.
> >
> >
> > Not at all. The streaming DMA API (dma_map_* and friends) has two responsibilities: performing any necessary cache maintenance to ensure the device will correctly see data from the CPU, and the CPU will correctly see data from the device; and working out an address for that buffer from the device's point of view to actually hand off to the hardware (which is perfectly well allowed to fail).

Agreed. The dma_map_*() API is not guaranteed to return a single
contiguous part of virtual address space for any given SG list.
However it was understood to be able to map buffers contiguously
mappable by the CPU into a single segment and users,
videobuf2-dma-contig in particular, relied on this.

> >
> > Consider SWIOTLB's implementation - segments which already lie at physical addresses within the device's DMA mask just get passed through, while those that lie outside it get mapped into the bounce buffer, but still as individual allocations (arch code just handles cache maintenance on the resulting physical addresses and can apply any hard-wired DMA offset for the device concerned).

And this is fine for vb2-dma-contig, which was made for devices that
require buffers contiguous in its address space. Without IOMMU it will
allow only physically contiguous buffers and fails otherwise, which is
fine, because it's a hardware requirement.

> >
> >> IIUC, the change above breaks this model by inserting gaps in how the
> >> buffer is mapped to device memory, such that the buffer is no longer
> >> contiguous in dma address space.
> >
> >
> > Even the existing arch/arm IOMMU DMA code which I guess this implicitly relies on doesn't guarantee that behaviour - if the mapping happens to reach one of the segment length/boundary limits it won't just leave a gap, it'll start an entirely new IOVA allocation which could well start at a wildly different address[0].

Could you explain segment length/boundary limits and when buffers can
reach them? Sorry, i haven't been following all the discussions, but
I'm not aware of any similar requirements of the IOMMU hardware I
worked with.

> >
> >> Here is the code in question from
> >> drivers/media/v4l2-core/videobuf2-dma-contig.c :
> >>
> >> static unsigned long vb2_dc_get_contiguous_size(struct sg_table *sgt)
> >> {
> >>          struct scatterlist *s;
> >>          dma_addr_t expected = sg_dma_address(sgt->sgl);
> >>          unsigned int i;
> >>          unsigned long size = 0;
> >>
> >>          for_each_sg(sgt->sgl, s, sgt->nents, i) {
> >>                  if (sg_dma_address(s) != expected)
> >>                          break;
> >>                  expected = sg_dma_address(s) + sg_dma_len(s);
> >>                  size += sg_dma_len(s);
> >>          }
> >>          return size;
> >> }
> >>
> >>
> >> static void *vb2_dc_get_userptr(void *alloc_ctx, unsigned long vaddr,
> >>          unsigned long size, enum dma_data_direction dma_dir)
> >> {
> >>          struct vb2_dc_conf *conf = alloc_ctx;
> >>          struct vb2_dc_buf *buf;
> >>          struct frame_vector *vec;
> >>          unsigned long offset;
> >>          int n_pages, i;
> >>          int ret = 0;
> >>          struct sg_table *sgt;
> >>          unsigned long contig_size;
> >>          unsigned long dma_align = dma_get_cache_alignment();
> >>          DEFINE_DMA_ATTRS(attrs);
> >>
> >>          dma_set_attr(DMA_ATTR_SKIP_CPU_SYNC, &attrs);
> >>
> >>          buf = kzalloc(sizeof *buf, GFP_KERNEL);
> >>          buf->dma_dir = dma_dir;
> >>
> >>          offset = vaddr & ~PAGE_MASK;
> >>          vec = vb2_create_framevec(vaddr, size, dma_dir == DMA_FROM_DEVICE);
> >>          buf->vec = vec;
> >>          n_pages = frame_vector_count(vec);
> >>
> >>          sgt = kzalloc(sizeof(*sgt), GFP_KERNEL);
> >>
> >>          ret = sg_alloc_table_from_pages(sgt, frame_vector_pages(vec), n_pages,
> >>                  offset, size, GFP_KERNEL);
> >>
> >>          sgt->nents = dma_map_sg_attrs(buf->dev, sgt->sgl, sgt->orig_nents,
> >>                                        buf->dma_dir, &attrs);
> >>
> >>          contig_size = vb2_dc_get_contiguous_size(sgt);
> >
> >
> > (as an aside, it's rather unintuitive that the handling of the dma_map_sg call actually failing is entirely implicit here)

I'm not sure what you mean, please elaborate. The code considers only
the case of contiguously mapping at least the requested size as a
success, because anything else is useless with the hardware.

> >
> >>          if (contig_size < size) {
> >>
> >>      <<<===   if the original buffer had sg entries that were not
> >> aligned on the "natural" alignment for their size, the new arm64 iommu
> >> core code inserts  a 'gap' in the iommu mapping, which causes
> >> vb2_dc_get_contiguous_size() to exit early (and return a smaller size
> >> than expected).
> >>
> >>                  pr_err("contiguous mapping is too small %lu/%lu\n",
> >>                          contig_size, size);
> >>                  ret = -EFAULT;
> >>                  goto fail_map_sg;
> >>          }
> >>
> >>
> >> So, is the videobuf2-dma-contig.c based on an incorrect assumption
> >> about how the DMA API is supposed to work?
> >> Is it even possible to map a "contiguous-in-iova-range" mapping for a
> >> buffer given as an sg_table with an arbitrary set of pages?
> >
> >
> > From the Streaming DMA mappings section of Documentation/DMA-API.txt:
> >
> >   Note also that the above constraints on physical contiguity and
> >   dma_mask may not apply if the platform has an IOMMU (a device which
> >   maps an I/O DMA address to a physical memory address).  However, to be
> >   portable, device driver writers may *not* assume that such an IOMMU
> >   exists.
> >
> > There's not strictly any harm in using the DMA API this way and *hoping* you get what you want, as long as you're happy for it to fail pretty much 100% of the time on some systems, and still in a minority of corner cases on any system.

Could you please elaborate? I'd like to see examples, because I can't
really imagine buffers mappable contiguously on CPU, but not on IOMMU.
Also, as I said, the hardware I worked with didn't suffer from
problems like this.

> > However, if there's a real dependency on IOMMUs and tight control of IOVA allocation here, then the DMA API isn't really the right tool for the job, and maybe it's time to start looking to how to better fit these multimedia-subsystem-type use cases into the IOMMU API - as far as I understand it there's at least some conceptual overlap with the HSA PASID stuff being prototyped in PCI/x86-land at the moment, so it could be an apposite time to try and bang out some common requirements.

The DMA API is actually the only good tool to use here to keep the
videobuf2-dma-contig code away from the knowledge about platform
specific data, e.g. presence of IOMMU. The only thing it knows is that
the target hardware requires a single contiguous buffer and it relies
on the fact that in correct cases the buffer given to it will meet
this requirement (i.e. physically contiguous w/o IOMMU; CPU mappable
with IOMMU).

Best regards,
Tomasz

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v6 1/3] iommu: Implement common IOMMU ops for DMA mapping
@ 2015-11-02 13:43                   ` Tomasz Figa
  0 siblings, 0 replies; 78+ messages in thread
From: Tomasz Figa @ 2015-11-02 13:43 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Nov 2, 2015 at 10:11 PM, Daniel Kurtz <djkurtz@chromium.org> wrote:
>
> +Tomasz, so he can reply to the thread
> +Marek and Russell as recommended by Tomasz
>
> On Oct 30, 2015 22:27, "Robin Murphy" <robin.murphy@arm.com> wrote:
> >
> > Hi Dan,
> >
> > On 30/10/15 01:17, Daniel Kurtz wrote:
> >>
> >> +linux-media & VIDEOBUF2 FRAMEWORK maintainers since this is about the
> >> v4l2-contig's usage of the DMA API.
> >>
> >> Hi Robin,
> >>
> >> On Tue, Oct 27, 2015 at 12:55 AM, Robin Murphy <robin.murphy@arm.com> wrote:
> >>>
> >>> On 26/10/15 13:44, Yong Wu wrote:
> >>>>
> >>>>
> >>>> On Thu, 2015-10-01 at 20:13 +0100, Robin Murphy wrote:
> >>>> [...]
> >>>>>
> >>>>>
> >>>>> +/*
> >>>>> + * The DMA API client is passing in a scatterlist which could describe
> >>>>> + * any old buffer layout, but the IOMMU API requires everything to be
> >>>>> + * aligned to IOMMU pages. Hence the need for this complicated bit of
> >>>>> + * impedance-matching, to be able to hand off a suitably-aligned list,
> >>>>> + * but still preserve the original offsets and sizes for the caller.
> >>>>> + */
> >>>>> +int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
> >>>>> +               int nents, int prot)
> >>>>> +{
> >>>>> +       struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
> >>>>> +       struct iova_domain *iovad = domain->iova_cookie;
> >>>>> +       struct iova *iova;
> >>>>> +       struct scatterlist *s, *prev = NULL;
> >>>>> +       dma_addr_t dma_addr;
> >>>>> +       size_t iova_len = 0;
> >>>>> +       int i;
> >>>>> +
> >>>>> +       /*
> >>>>> +        * Work out how much IOVA space we need, and align the segments
> >>>>> to
> >>>>> +        * IOVA granules for the IOMMU driver to handle. With some clever
> >>>>> +        * trickery we can modify the list in-place, but reversibly, by
> >>>>> +        * hiding the original data in the as-yet-unused DMA fields.
> >>>>> +        */
> >>>>> +       for_each_sg(sg, s, nents, i) {
> >>>>> +               size_t s_offset = iova_offset(iovad, s->offset);
> >>>>> +               size_t s_length = s->length;
> >>>>> +
> >>>>> +               sg_dma_address(s) = s->offset;
> >>>>> +               sg_dma_len(s) = s_length;
> >>>>> +               s->offset -= s_offset;
> >>>>> +               s_length = iova_align(iovad, s_length + s_offset);
> >>>>> +               s->length = s_length;
> >>>>> +
> >>>>> +               /*
> >>>>> +                * The simple way to avoid the rare case of a segment
> >>>>> +                * crossing the boundary mask is to pad the previous one
> >>>>> +                * to end at a naturally-aligned IOVA for this one's
> >>>>> size,
> >>>>> +                * at the cost of potentially over-allocating a little.

I'd like to know what is the boundary mask and what hardware imposes
requirements like this. The cost here is not only over-allocating a
little, but making many, many buffers contiguously mappable on the
CPU, unmappable contiguously in IOMMU, which just defeats the purpose
of having an IOMMU, which I believe should be there for simple IP
blocks taking one DMA address to be able to view the buffer the same
way as the CPU.

> >>>>> +                */
> >>>>> +               if (prev) {
> >>>>> +                       size_t pad_len = roundup_pow_of_two(s_length);
> >>>>> +
> >>>>> +                       pad_len = (pad_len - iova_len) & (pad_len - 1);
> >>>>> +                       prev->length += pad_len;
> >>>>
> >>>>
> >>>>
> >>>> Hi Robin,
> >>>>         While our v4l2 testing, It seems that we met a problem here.
> >>>>         Here we update prev->length again, Do we need update
> >>>> sg_dma_len(prev) again too?
> >>>>
> >>>>         Some function like vb2_dc_get_contiguous_size[1] always get
> >>>> sg_dma_len(s) to compare instead of s->length. so it may break
> >>>> unexpectedly while sg_dma_len(s) is not same with s->length.
> >>>
> >>>
> >>>
> >>> This is just tweaking the faked-up length that we hand off to iommu_map_sg()
> >>> (see also the iova_align() above), to trick it into bumping this segment up
> >>> to a suitable starting IOVA. The real length at this point is stashed in
> >>> sg_dma_len(s), and will be copied back into s->length in __finalise_sg(), so
> >>> both will hold the same true length once we return to the caller.
> >>>
> >>> Yes, it does mean that if you have a list where the segment lengths are page
> >>> aligned but not monotonically decreasing, e.g. {64k, 16k, 64k}, then you'll
> >>> still end up with a gap between the second and third segments, but that's
> >>> fine because the DMA API offers no guarantees about what the resulting DMA
> >>> addresses will be (consider the no-IOMMU case where they would each just be
> >>> "mapped" to their physical address). If that breaks v4l, then it's probably
> >>> v4l's DMA API use that needs looking at (again).
> >>
> >>
> >> Hmm, I thought the DMA API maps a (possibly) non-contiguous set of
> >> memory pages into a contiguous block in device memory address space.
> >> This would allow passing a dma mapped buffer to device dma using just
> >> a device address and length.
> >
> >
> > Not at all. The streaming DMA API (dma_map_* and friends) has two responsibilities: performing any necessary cache maintenance to ensure the device will correctly see data from the CPU, and the CPU will correctly see data from the device; and working out an address for that buffer from the device's point of view to actually hand off to the hardware (which is perfectly well allowed to fail).

Agreed. The dma_map_*() API is not guaranteed to return a single
contiguous part of virtual address space for any given SG list.
However it was understood to be able to map buffers contiguously
mappable by the CPU into a single segment and users,
videobuf2-dma-contig in particular, relied on this.

> >
> > Consider SWIOTLB's implementation - segments which already lie at physical addresses within the device's DMA mask just get passed through, while those that lie outside it get mapped into the bounce buffer, but still as individual allocations (arch code just handles cache maintenance on the resulting physical addresses and can apply any hard-wired DMA offset for the device concerned).

And this is fine for vb2-dma-contig, which was made for devices that
require buffers contiguous in its address space. Without IOMMU it will
allow only physically contiguous buffers and fails otherwise, which is
fine, because it's a hardware requirement.

> >
> >> IIUC, the change above breaks this model by inserting gaps in how the
> >> buffer is mapped to device memory, such that the buffer is no longer
> >> contiguous in dma address space.
> >
> >
> > Even the existing arch/arm IOMMU DMA code which I guess this implicitly relies on doesn't guarantee that behaviour - if the mapping happens to reach one of the segment length/boundary limits it won't just leave a gap, it'll start an entirely new IOVA allocation which could well start at a wildly different address[0].

Could you explain segment length/boundary limits and when buffers can
reach them? Sorry, i haven't been following all the discussions, but
I'm not aware of any similar requirements of the IOMMU hardware I
worked with.

> >
> >> Here is the code in question from
> >> drivers/media/v4l2-core/videobuf2-dma-contig.c :
> >>
> >> static unsigned long vb2_dc_get_contiguous_size(struct sg_table *sgt)
> >> {
> >>          struct scatterlist *s;
> >>          dma_addr_t expected = sg_dma_address(sgt->sgl);
> >>          unsigned int i;
> >>          unsigned long size = 0;
> >>
> >>          for_each_sg(sgt->sgl, s, sgt->nents, i) {
> >>                  if (sg_dma_address(s) != expected)
> >>                          break;
> >>                  expected = sg_dma_address(s) + sg_dma_len(s);
> >>                  size += sg_dma_len(s);
> >>          }
> >>          return size;
> >> }
> >>
> >>
> >> static void *vb2_dc_get_userptr(void *alloc_ctx, unsigned long vaddr,
> >>          unsigned long size, enum dma_data_direction dma_dir)
> >> {
> >>          struct vb2_dc_conf *conf = alloc_ctx;
> >>          struct vb2_dc_buf *buf;
> >>          struct frame_vector *vec;
> >>          unsigned long offset;
> >>          int n_pages, i;
> >>          int ret = 0;
> >>          struct sg_table *sgt;
> >>          unsigned long contig_size;
> >>          unsigned long dma_align = dma_get_cache_alignment();
> >>          DEFINE_DMA_ATTRS(attrs);
> >>
> >>          dma_set_attr(DMA_ATTR_SKIP_CPU_SYNC, &attrs);
> >>
> >>          buf = kzalloc(sizeof *buf, GFP_KERNEL);
> >>          buf->dma_dir = dma_dir;
> >>
> >>          offset = vaddr & ~PAGE_MASK;
> >>          vec = vb2_create_framevec(vaddr, size, dma_dir == DMA_FROM_DEVICE);
> >>          buf->vec = vec;
> >>          n_pages = frame_vector_count(vec);
> >>
> >>          sgt = kzalloc(sizeof(*sgt), GFP_KERNEL);
> >>
> >>          ret = sg_alloc_table_from_pages(sgt, frame_vector_pages(vec), n_pages,
> >>                  offset, size, GFP_KERNEL);
> >>
> >>          sgt->nents = dma_map_sg_attrs(buf->dev, sgt->sgl, sgt->orig_nents,
> >>                                        buf->dma_dir, &attrs);
> >>
> >>          contig_size = vb2_dc_get_contiguous_size(sgt);
> >
> >
> > (as an aside, it's rather unintuitive that the handling of the dma_map_sg call actually failing is entirely implicit here)

I'm not sure what you mean, please elaborate. The code considers only
the case of contiguously mapping at least the requested size as a
success, because anything else is useless with the hardware.

> >
> >>          if (contig_size < size) {
> >>
> >>      <<<===   if the original buffer had sg entries that were not
> >> aligned on the "natural" alignment for their size, the new arm64 iommu
> >> core code inserts  a 'gap' in the iommu mapping, which causes
> >> vb2_dc_get_contiguous_size() to exit early (and return a smaller size
> >> than expected).
> >>
> >>                  pr_err("contiguous mapping is too small %lu/%lu\n",
> >>                          contig_size, size);
> >>                  ret = -EFAULT;
> >>                  goto fail_map_sg;
> >>          }
> >>
> >>
> >> So, is the videobuf2-dma-contig.c based on an incorrect assumption
> >> about how the DMA API is supposed to work?
> >> Is it even possible to map a "contiguous-in-iova-range" mapping for a
> >> buffer given as an sg_table with an arbitrary set of pages?
> >
> >
> > From the Streaming DMA mappings section of Documentation/DMA-API.txt:
> >
> >   Note also that the above constraints on physical contiguity and
> >   dma_mask may not apply if the platform has an IOMMU (a device which
> >   maps an I/O DMA address to a physical memory address).  However, to be
> >   portable, device driver writers may *not* assume that such an IOMMU
> >   exists.
> >
> > There's not strictly any harm in using the DMA API this way and *hoping* you get what you want, as long as you're happy for it to fail pretty much 100% of the time on some systems, and still in a minority of corner cases on any system.

Could you please elaborate? I'd like to see examples, because I can't
really imagine buffers mappable contiguously on CPU, but not on IOMMU.
Also, as I said, the hardware I worked with didn't suffer from
problems like this.

> > However, if there's a real dependency on IOMMUs and tight control of IOVA allocation here, then the DMA API isn't really the right tool for the job, and maybe it's time to start looking to how to better fit these multimedia-subsystem-type use cases into the IOMMU API - as far as I understand it there's at least some conceptual overlap with the HSA PASID stuff being prototyped in PCI/x86-land at the moment, so it could be an apposite time to try and bang out some common requirements.

The DMA API is actually the only good tool to use here to keep the
videobuf2-dma-contig code away from the knowledge about platform
specific data, e.g. presence of IOMMU. The only thing it knows is that
the target hardware requires a single contiguous buffer and it relies
on the fact that in correct cases the buffer given to it will meet
this requirement (i.e. physically contiguous w/o IOMMU; CPU mappable
with IOMMU).

Best regards,
Tomasz

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v6 1/3] iommu: Implement common IOMMU ops for DMA mapping
  2015-11-02 13:43                   ` Tomasz Figa
@ 2015-11-03 17:41                     ` Robin Murphy
  -1 siblings, 0 replies; 78+ messages in thread
From: Robin Murphy @ 2015-11-03 17:41 UTC (permalink / raw)
  To: Tomasz Figa, Daniel Kurtz
  Cc: Lin PoChun, linux-arm-kernel, Yingjoe Chen, Will Deacon,
	linux-media, Thierry Reding, open list:IOMMU DRIVERS,
	Bobby Batacharia (via Google Docs),
	Kyungmin Park, Marek Szyprowski, Yong Wu, Pawel Osciak,
	Laurent Pinchart, Joerg Roedel, thunder.leizhen, Catalin Marinas,
	Russell King, linux-mediatek

Hi Tomasz,

On 02/11/15 13:43, Tomasz Figa wrote:
> I'd like to know what is the boundary mask and what hardware imposes
> requirements like this. The cost here is not only over-allocating a
> little, but making many, many buffers contiguously mappable on the
> CPU, unmappable contiguously in IOMMU, which just defeats the purpose
> of having an IOMMU, which I believe should be there for simple IP
> blocks taking one DMA address to be able to view the buffer the same
> way as the CPU.

The expectation with dma_map_sg() is that you're either going to be 
iterating over the buffer segments, handing off each address to the 
device to process one by one; or you have a scatter-gather-capable 
device, in which case you hand off the whole list at once. It's in the 
latter case where you have to make sure the list doesn't exceed the 
hardware limitations of that device. I believe the original concern was 
disk controllers (the introduction of dma_parms seems to originate from 
the linux-scsi list), but most scatter-gather engines are going to have 
some limit on how much they can handle per entry (IMO the dmaengine 
drivers are the easiest example to look at).

Segment boundaries are a little more arcane, but my assumption is that 
they relate to the kind of devices whose addressing is not flat but 
relative to some separate segment register (The "64-bit" mode of USB 
EHCI is one concrete example I can think of) - since you cannot 
realistically change the segment register while the device is in the 
middle of accessing a single buffer entry, that entry must not fall 
across a segment boundary or at some point the device's accesses are 
going to overflow the offset address bits and wrap around to bogus 
addresses at the bottom of the segment.

Now yes, it will be possible under _most_ circumstances to use an IOMMU 
to lay out a list of segments with page-aligned lengths within a single 
IOVA allocation whilst still meeting all the necessary constraints. It 
just needs some unavoidably complicated calculations - quite likely 
significantly more complex than my v5 version of map_sg() that tried to 
do that and merge segments but failed to take the initial alignment into 
account properly - since there are much simpler ways to enforce just the 
_necessary_ behaviour for the DMA API, I put the complicated stuff to 
one side for now to prevent it holding up getting the basic functional 
support in place.

>>>> Hmm, I thought the DMA API maps a (possibly) non-contiguous set of
>>>> memory pages into a contiguous block in device memory address space.
>>>> This would allow passing a dma mapped buffer to device dma using just
>>>> a device address and length.
>>>
>>>
>>> Not at all. The streaming DMA API (dma_map_* and friends) has two responsibilities: performing any necessary cache maintenance to ensure the device will correctly see data from the CPU, and the CPU will correctly see data from the device; and working out an address for that buffer from the device's point of view to actually hand off to the hardware (which is perfectly well allowed to fail).
>
> Agreed. The dma_map_*() API is not guaranteed to return a single
> contiguous part of virtual address space for any given SG list.
> However it was understood to be able to map buffers contiguously
> mappable by the CPU into a single segment and users,
> videobuf2-dma-contig in particular, relied on this.

I don't follow that - _any_ buffer made of page-sized chunks is going to 
be mappable contiguously by the CPU; it's clearly impossible for the 
streaming DMA API itself to offer such a guarantee, because it's 
entirely orthogonal to the presence or otherwise of an IOMMU.

Furthermore, I can't see any existing dma_map_sg implementation (between 
arm/64 and x86, at least), that _won't_ break that expectation under 
certain conditions (ranging from "relatively pathological" to "always"), 
so it still seems questionable to have a dependency on it.

>>> Consider SWIOTLB's implementation - segments which already lie at physical addresses within the device's DMA mask just get passed through, while those that lie outside it get mapped into the bounce buffer, but still as individual allocations (arch code just handles cache maintenance on the resulting physical addresses and can apply any hard-wired DMA offset for the device concerned).
>
> And this is fine for vb2-dma-contig, which was made for devices that
> require buffers contiguous in its address space. Without IOMMU it will
> allow only physically contiguous buffers and fails otherwise, which is
> fine, because it's a hardware requirement.

If it depends on having contiguous-from-the-device's-view DMA buffers 
either way, that's a sign it should perhaps be using the coherent DMA 
API instead, which _does_ give such a guarantee. I'm well aware of the 
"but the noncacheable mappings make userspace access unacceptably slow!" 
issue many folks have with that, though, and don't particularly fancy 
going off on that tangent here.

>>>> IIUC, the change above breaks this model by inserting gaps in how the
>>>> buffer is mapped to device memory, such that the buffer is no longer
>>>> contiguous in dma address space.
>>>
>>>
>>> Even the existing arch/arm IOMMU DMA code which I guess this implicitly relies on doesn't guarantee that behaviour - if the mapping happens to reach one of the segment length/boundary limits it won't just leave a gap, it'll start an entirely new IOVA allocation which could well start at a wildly different address[0].
>
> Could you explain segment length/boundary limits and when buffers can
> reach them? Sorry, i haven't been following all the discussions, but
> I'm not aware of any similar requirements of the IOMMU hardware I
> worked with.

I hope the explanation at the top makes sense - it's purely about the 
requirements of the DMA master device itself, nothing to do with the 
IOMMU (or lack of) in the middle. Devices with scatter-gather DMA 
limitations exist, therefore the API for scatter-gather DMA is designed 
to represent and respect such limitations.

>>>> Here is the code in question from
>>>> drivers/media/v4l2-core/videobuf2-dma-contig.c :
[...]
>>>> static void *vb2_dc_get_userptr(void *alloc_ctx, unsigned long vaddr,
>>>>           unsigned long size, enum dma_data_direction dma_dir)
>>>> {
[...]
>>>>           sgt->nents = dma_map_sg_attrs(buf->dev, sgt->sgl, sgt->orig_nents,
>>>>                                         buf->dma_dir, &attrs);
>>>>
>>>>           contig_size = vb2_dc_get_contiguous_size(sgt);
>>>
>>>
>>> (as an aside, it's rather unintuitive that the handling of the dma_map_sg call actually failing is entirely implicit here)
>
> I'm not sure what you mean, please elaborate. The code considers only
> the case of contiguously mapping at least the requested size as a
> success, because anything else is useless with the hardware.

My bad; having now compared against the actual file I see this is just a 
cherry-picking of relevant lines with all the error checking stripped 
out. Objection withdrawn ;)

>>>> So, is the videobuf2-dma-contig.c based on an incorrect assumption
>>>> about how the DMA API is supposed to work?
>>>> Is it even possible to map a "contiguous-in-iova-range" mapping for a
>>>> buffer given as an sg_table with an arbitrary set of pages?
>>>
>>>
>>>  From the Streaming DMA mappings section of Documentation/DMA-API.txt:
>>>
>>>    Note also that the above constraints on physical contiguity and
>>>    dma_mask may not apply if the platform has an IOMMU (a device which
>>>    maps an I/O DMA address to a physical memory address).  However, to be
>>>    portable, device driver writers may *not* assume that such an IOMMU
>>>    exists.
>>>
>>> There's not strictly any harm in using the DMA API this way and *hoping* you get what you want, as long as you're happy for it to fail pretty much 100% of the time on some systems, and still in a minority of corner cases on any system.
>
> Could you please elaborate? I'd like to see examples, because I can't
> really imagine buffers mappable contiguously on CPU, but not on IOMMU.
> Also, as I said, the hardware I worked with didn't suffer from
> problems like this.

"...device driver writers may *not* assume that such an IOMMU exists."

>>> However, if there's a real dependency on IOMMUs and tight control of IOVA allocation here, then the DMA API isn't really the right tool for the job, and maybe it's time to start looking to how to better fit these multimedia-subsystem-type use cases into the IOMMU API - as far as I understand it there's at least some conceptual overlap with the HSA PASID stuff being prototyped in PCI/x86-land at the moment, so it could be an apposite time to try and bang out some common requirements.
>
> The DMA API is actually the only good tool to use here to keep the
> videobuf2-dma-contig code away from the knowledge about platform
> specific data, e.g. presence of IOMMU. The only thing it knows is that
> the target hardware requires a single contiguous buffer and it relies
> on the fact that in correct cases the buffer given to it will meet
> this requirement (i.e. physically contiguous w/o IOMMU; CPU mappable
> with IOMMU).

As above; the DMA API guarantees only what the DMA API guarantees. An 
IOMMU-based implementation of streaming DMA is free to identity-map 
pages if it only cares about device isolation; a non-IOMMU 
implementation is free to provide streaming DMA remapping via some 
elaborate bounce-buffering scheme if it really wants to. GART-type 
IOMMUs... let's not even go there.

If v4l needs a guarantee of a single contiguous DMA buffer, then it 
needs to use dma_alloc_coherent() for that, not streaming mappings.

Robin.


^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v6 1/3] iommu: Implement common IOMMU ops for DMA mapping
@ 2015-11-03 17:41                     ` Robin Murphy
  0 siblings, 0 replies; 78+ messages in thread
From: Robin Murphy @ 2015-11-03 17:41 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Tomasz,

On 02/11/15 13:43, Tomasz Figa wrote:
> I'd like to know what is the boundary mask and what hardware imposes
> requirements like this. The cost here is not only over-allocating a
> little, but making many, many buffers contiguously mappable on the
> CPU, unmappable contiguously in IOMMU, which just defeats the purpose
> of having an IOMMU, which I believe should be there for simple IP
> blocks taking one DMA address to be able to view the buffer the same
> way as the CPU.

The expectation with dma_map_sg() is that you're either going to be 
iterating over the buffer segments, handing off each address to the 
device to process one by one; or you have a scatter-gather-capable 
device, in which case you hand off the whole list at once. It's in the 
latter case where you have to make sure the list doesn't exceed the 
hardware limitations of that device. I believe the original concern was 
disk controllers (the introduction of dma_parms seems to originate from 
the linux-scsi list), but most scatter-gather engines are going to have 
some limit on how much they can handle per entry (IMO the dmaengine 
drivers are the easiest example to look at).

Segment boundaries are a little more arcane, but my assumption is that 
they relate to the kind of devices whose addressing is not flat but 
relative to some separate segment register (The "64-bit" mode of USB 
EHCI is one concrete example I can think of) - since you cannot 
realistically change the segment register while the device is in the 
middle of accessing a single buffer entry, that entry must not fall 
across a segment boundary or at some point the device's accesses are 
going to overflow the offset address bits and wrap around to bogus 
addresses at the bottom of the segment.

Now yes, it will be possible under _most_ circumstances to use an IOMMU 
to lay out a list of segments with page-aligned lengths within a single 
IOVA allocation whilst still meeting all the necessary constraints. It 
just needs some unavoidably complicated calculations - quite likely 
significantly more complex than my v5 version of map_sg() that tried to 
do that and merge segments but failed to take the initial alignment into 
account properly - since there are much simpler ways to enforce just the 
_necessary_ behaviour for the DMA API, I put the complicated stuff to 
one side for now to prevent it holding up getting the basic functional 
support in place.

>>>> Hmm, I thought the DMA API maps a (possibly) non-contiguous set of
>>>> memory pages into a contiguous block in device memory address space.
>>>> This would allow passing a dma mapped buffer to device dma using just
>>>> a device address and length.
>>>
>>>
>>> Not at all. The streaming DMA API (dma_map_* and friends) has two responsibilities: performing any necessary cache maintenance to ensure the device will correctly see data from the CPU, and the CPU will correctly see data from the device; and working out an address for that buffer from the device's point of view to actually hand off to the hardware (which is perfectly well allowed to fail).
>
> Agreed. The dma_map_*() API is not guaranteed to return a single
> contiguous part of virtual address space for any given SG list.
> However it was understood to be able to map buffers contiguously
> mappable by the CPU into a single segment and users,
> videobuf2-dma-contig in particular, relied on this.

I don't follow that - _any_ buffer made of page-sized chunks is going to 
be mappable contiguously by the CPU; it's clearly impossible for the 
streaming DMA API itself to offer such a guarantee, because it's 
entirely orthogonal to the presence or otherwise of an IOMMU.

Furthermore, I can't see any existing dma_map_sg implementation (between 
arm/64 and x86, at least), that _won't_ break that expectation under 
certain conditions (ranging from "relatively pathological" to "always"), 
so it still seems questionable to have a dependency on it.

>>> Consider SWIOTLB's implementation - segments which already lie at physical addresses within the device's DMA mask just get passed through, while those that lie outside it get mapped into the bounce buffer, but still as individual allocations (arch code just handles cache maintenance on the resulting physical addresses and can apply any hard-wired DMA offset for the device concerned).
>
> And this is fine for vb2-dma-contig, which was made for devices that
> require buffers contiguous in its address space. Without IOMMU it will
> allow only physically contiguous buffers and fails otherwise, which is
> fine, because it's a hardware requirement.

If it depends on having contiguous-from-the-device's-view DMA buffers 
either way, that's a sign it should perhaps be using the coherent DMA 
API instead, which _does_ give such a guarantee. I'm well aware of the 
"but the noncacheable mappings make userspace access unacceptably slow!" 
issue many folks have with that, though, and don't particularly fancy 
going off on that tangent here.

>>>> IIUC, the change above breaks this model by inserting gaps in how the
>>>> buffer is mapped to device memory, such that the buffer is no longer
>>>> contiguous in dma address space.
>>>
>>>
>>> Even the existing arch/arm IOMMU DMA code which I guess this implicitly relies on doesn't guarantee that behaviour - if the mapping happens to reach one of the segment length/boundary limits it won't just leave a gap, it'll start an entirely new IOVA allocation which could well start at a wildly different address[0].
>
> Could you explain segment length/boundary limits and when buffers can
> reach them? Sorry, i haven't been following all the discussions, but
> I'm not aware of any similar requirements of the IOMMU hardware I
> worked with.

I hope the explanation at the top makes sense - it's purely about the 
requirements of the DMA master device itself, nothing to do with the 
IOMMU (or lack of) in the middle. Devices with scatter-gather DMA 
limitations exist, therefore the API for scatter-gather DMA is designed 
to represent and respect such limitations.

>>>> Here is the code in question from
>>>> drivers/media/v4l2-core/videobuf2-dma-contig.c :
[...]
>>>> static void *vb2_dc_get_userptr(void *alloc_ctx, unsigned long vaddr,
>>>>           unsigned long size, enum dma_data_direction dma_dir)
>>>> {
[...]
>>>>           sgt->nents = dma_map_sg_attrs(buf->dev, sgt->sgl, sgt->orig_nents,
>>>>                                         buf->dma_dir, &attrs);
>>>>
>>>>           contig_size = vb2_dc_get_contiguous_size(sgt);
>>>
>>>
>>> (as an aside, it's rather unintuitive that the handling of the dma_map_sg call actually failing is entirely implicit here)
>
> I'm not sure what you mean, please elaborate. The code considers only
> the case of contiguously mapping at least the requested size as a
> success, because anything else is useless with the hardware.

My bad; having now compared against the actual file I see this is just a 
cherry-picking of relevant lines with all the error checking stripped 
out. Objection withdrawn ;)

>>>> So, is the videobuf2-dma-contig.c based on an incorrect assumption
>>>> about how the DMA API is supposed to work?
>>>> Is it even possible to map a "contiguous-in-iova-range" mapping for a
>>>> buffer given as an sg_table with an arbitrary set of pages?
>>>
>>>
>>>  From the Streaming DMA mappings section of Documentation/DMA-API.txt:
>>>
>>>    Note also that the above constraints on physical contiguity and
>>>    dma_mask may not apply if the platform has an IOMMU (a device which
>>>    maps an I/O DMA address to a physical memory address).  However, to be
>>>    portable, device driver writers may *not* assume that such an IOMMU
>>>    exists.
>>>
>>> There's not strictly any harm in using the DMA API this way and *hoping* you get what you want, as long as you're happy for it to fail pretty much 100% of the time on some systems, and still in a minority of corner cases on any system.
>
> Could you please elaborate? I'd like to see examples, because I can't
> really imagine buffers mappable contiguously on CPU, but not on IOMMU.
> Also, as I said, the hardware I worked with didn't suffer from
> problems like this.

"...device driver writers may *not* assume that such an IOMMU exists."

>>> However, if there's a real dependency on IOMMUs and tight control of IOVA allocation here, then the DMA API isn't really the right tool for the job, and maybe it's time to start looking to how to better fit these multimedia-subsystem-type use cases into the IOMMU API - as far as I understand it there's at least some conceptual overlap with the HSA PASID stuff being prototyped in PCI/x86-land at the moment, so it could be an apposite time to try and bang out some common requirements.
>
> The DMA API is actually the only good tool to use here to keep the
> videobuf2-dma-contig code away from the knowledge about platform
> specific data, e.g. presence of IOMMU. The only thing it knows is that
> the target hardware requires a single contiguous buffer and it relies
> on the fact that in correct cases the buffer given to it will meet
> this requirement (i.e. physically contiguous w/o IOMMU; CPU mappable
> with IOMMU).

As above; the DMA API guarantees only what the DMA API guarantees. An 
IOMMU-based implementation of streaming DMA is free to identity-map 
pages if it only cares about device isolation; a non-IOMMU 
implementation is free to provide streaming DMA remapping via some 
elaborate bounce-buffering scheme if it really wants to. GART-type 
IOMMUs... let's not even go there.

If v4l needs a guarantee of a single contiguous DMA buffer, then it 
needs to use dma_alloc_coherent() for that, not streaming mappings.

Robin.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v6 1/3] iommu: Implement common IOMMU ops for DMA mapping
  2015-11-03 17:41                     ` Robin Murphy
@ 2015-11-03 18:40                       ` Russell King - ARM Linux
  -1 siblings, 0 replies; 78+ messages in thread
From: Russell King - ARM Linux @ 2015-11-03 18:40 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Tomasz Figa, Daniel Kurtz, Lin PoChun, linux-arm-kernel,
	Yingjoe Chen, Will Deacon, linux-media, Thierry Reding,
	open list:IOMMU DRIVERS, Bobby Batacharia (via Google Docs),
	Kyungmin Park, Marek Szyprowski, Yong Wu, Pawel Osciak,
	Laurent Pinchart, Joerg Roedel, thunder.leizhen, Catalin Marinas,
	linux-mediatek

On Tue, Nov 03, 2015 at 05:41:24PM +0000, Robin Murphy wrote:
> Hi Tomasz,
> 
> On 02/11/15 13:43, Tomasz Figa wrote:
> >Agreed. The dma_map_*() API is not guaranteed to return a single
> >contiguous part of virtual address space for any given SG list.
> >However it was understood to be able to map buffers contiguously
> >mappable by the CPU into a single segment and users,
> >videobuf2-dma-contig in particular, relied on this.
> 
> I don't follow that - _any_ buffer made of page-sized chunks is going to be
> mappable contiguously by the CPU; it's clearly impossible for the streaming
> DMA API itself to offer such a guarantee, because it's entirely orthogonal
> to the presence or otherwise of an IOMMU.

Tomasz's use of "virtual address space" above in combination with the
DMA API is really confusing.

dma_map_sg() does *not* construct a CPU view of the passed scatterlist.
The only thing dma_map_sg() might do with virtual addresses is to use
them as a way to achieve cache coherence for one particular view of
that memory, that being the kernel's own lowmem mapping and any kmaps.
It doesn't extend to vmalloc() or userspace mappings of the memory.

If the scatterlist is converted to an array of struct page pointers,
it's possible to map it with vmap(), but it's implementation defined
whether such a mapping will receive cache maintanence as part of the
DMA API or not.  (If you have PIPT caches, it will, if they're VIPT
caches, maybe not.)

There is a separate set of calls to deal with the flushing issues for
vmap()'d memory in this case - see flush_kernel_vmap_range() and
invalidate_kernel_vmap_range().

-- 
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v6 1/3] iommu: Implement common IOMMU ops for DMA mapping
@ 2015-11-03 18:40                       ` Russell King - ARM Linux
  0 siblings, 0 replies; 78+ messages in thread
From: Russell King - ARM Linux @ 2015-11-03 18:40 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Nov 03, 2015 at 05:41:24PM +0000, Robin Murphy wrote:
> Hi Tomasz,
> 
> On 02/11/15 13:43, Tomasz Figa wrote:
> >Agreed. The dma_map_*() API is not guaranteed to return a single
> >contiguous part of virtual address space for any given SG list.
> >However it was understood to be able to map buffers contiguously
> >mappable by the CPU into a single segment and users,
> >videobuf2-dma-contig in particular, relied on this.
> 
> I don't follow that - _any_ buffer made of page-sized chunks is going to be
> mappable contiguously by the CPU; it's clearly impossible for the streaming
> DMA API itself to offer such a guarantee, because it's entirely orthogonal
> to the presence or otherwise of an IOMMU.

Tomasz's use of "virtual address space" above in combination with the
DMA API is really confusing.

dma_map_sg() does *not* construct a CPU view of the passed scatterlist.
The only thing dma_map_sg() might do with virtual addresses is to use
them as a way to achieve cache coherence for one particular view of
that memory, that being the kernel's own lowmem mapping and any kmaps.
It doesn't extend to vmalloc() or userspace mappings of the memory.

If the scatterlist is converted to an array of struct page pointers,
it's possible to map it with vmap(), but it's implementation defined
whether such a mapping will receive cache maintanence as part of the
DMA API or not.  (If you have PIPT caches, it will, if they're VIPT
caches, maybe not.)

There is a separate set of calls to deal with the flushing issues for
vmap()'d memory in this case - see flush_kernel_vmap_range() and
invalidate_kernel_vmap_range().

-- 
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v6 1/3] iommu: Implement common IOMMU ops for DMA mapping
  2015-11-03 17:41                     ` Robin Murphy
@ 2015-11-04  5:12                       ` Tomasz Figa
  -1 siblings, 0 replies; 78+ messages in thread
From: Tomasz Figa @ 2015-11-04  5:12 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Daniel Kurtz, Lin PoChun, linux-arm-kernel, Yingjoe Chen,
	Will Deacon, linux-media, Thierry Reding,
	open list:IOMMU DRIVERS, Bobby Batacharia (via Google Docs),
	Kyungmin Park, Marek Szyprowski, Yong Wu, Pawel Osciak,
	Laurent Pinchart, Joerg Roedel, thunder.leizhen, Catalin Marinas,
	Russell King, linux-mediatek

On Wed, Nov 4, 2015 at 2:41 AM, Robin Murphy <robin.murphy@arm.com> wrote:
> Hi Tomasz,
>
> On 02/11/15 13:43, Tomasz Figa wrote:
>>
>> I'd like to know what is the boundary mask and what hardware imposes
>> requirements like this. The cost here is not only over-allocating a
>> little, but making many, many buffers contiguously mappable on the
>> CPU, unmappable contiguously in IOMMU, which just defeats the purpose
>> of having an IOMMU, which I believe should be there for simple IP
>> blocks taking one DMA address to be able to view the buffer the same
>> way as the CPU.
>
>
> The expectation with dma_map_sg() is that you're either going to be
> iterating over the buffer segments, handing off each address to the device
> to process one by one;

My understanding of a scatterlist was that it represents a buffer as a
whole, by joining together its physically discontinuous segments.

I don't see how single segments (layout of which is completely up to
the allocator; often just single pages) would be usable for hardware
that needs to do some work more serious than just writing a byte
stream continuously to subsequent buffers. In case of such simple
devices you don't even need an IOMMU (for means other than protection
and/or getting over address space limitations).

However, IMHO the most important use case of an IOMMU is to make
buffers, which are contiguous in CPU virtual address space (VA),
contiguous in device's address space (IOVA). Your implementation of
dma_map_sg() effectively breaks this ability, so I'm not really
following why it's located under drivers/iommu and supposed to be used
with IOMMU-enabled platforms...

> or you have a scatter-gather-capable device, in which
> case you hand off the whole list at once.

No need for mapping ability of the IOMMU here as well (except for
working around address space issues, as I mentioned above).

> It's in the latter case where you
> have to make sure the list doesn't exceed the hardware limitations of that
> device. I believe the original concern was disk controllers (the
> introduction of dma_parms seems to originate from the linux-scsi list), but
> most scatter-gather engines are going to have some limit on how much they
> can handle per entry (IMO the dmaengine drivers are the easiest example to
> look at).
>
> Segment boundaries are a little more arcane, but my assumption is that they
> relate to the kind of devices whose addressing is not flat but relative to
> some separate segment register (The "64-bit" mode of USB EHCI is one
> concrete example I can think of) - since you cannot realistically change the
> segment register while the device is in the middle of accessing a single
> buffer entry, that entry must not fall across a segment boundary or at some
> point the device's accesses are going to overflow the offset address bits
> and wrap around to bogus addresses at the bottom of the segment.

The two requirements above sound like something really specific to
scatter-gather-capable hardware, which as I pointed above, barely need
an IOMMU (at least its mapping capabilities). We are talking here
about very IOMMU-specific code, though...

Now, while I see that on some systems there might be IOMMU used for
improving protection and working around addressing issues with
SG-capable hardware, the code shouldn't be breaking the majority of
systems with IOMMU used as the only possible way to make physically
discontinuous appear (IO-virtually) continuous to devices incapable of
scatter-gather.

>
> Now yes, it will be possible under _most_ circumstances to use an IOMMU to
> lay out a list of segments with page-aligned lengths within a single IOVA
> allocation whilst still meeting all the necessary constraints. It just needs
> some unavoidably complicated calculations - quite likely significantly more
> complex than my v5 version of map_sg() that tried to do that and merge
> segments but failed to take the initial alignment into account properly -
> since there are much simpler ways to enforce just the _necessary_ behaviour
> for the DMA API, I put the complicated stuff to one side for now to prevent
> it holding up getting the basic functional support in place.

Somehow just whatever currently done in arch/arm/mm/dma-mapping.c was
sufficient and not overly complicated.

See http://lxr.free-electrons.com/source/arch/arm/mm/dma-mapping.c#L1547 .

I can see that the code there at least tries to comply with maximum
segment size constraint. Segment boundary seems to be ignored, though.
However, I'm convinced that in most (if not all) cases where IOMMU
IOVA-contiguous mapping is needed, those two requirements don't exist.
Do we really have to break the good hardware only because the
bad^Wlimited one is broken?

Couldn't we preserve the ARM-like behavior whenever
dma_parms->segment_boundary_mask is set to all 1s and
dma_parms->max_segment_size to UINT_MAX (what currently drivers used
to set) or 0 (sounds more logical for the meaning of "no maximum
given")?

>
>>>>> Hmm, I thought the DMA API maps a (possibly) non-contiguous set of
>>>>> memory pages into a contiguous block in device memory address space.
>>>>> This would allow passing a dma mapped buffer to device dma using just
>>>>> a device address and length.
>>>>
>>>>
>>>>
>>>> Not at all. The streaming DMA API (dma_map_* and friends) has two
>>>> responsibilities: performing any necessary cache maintenance to ensure the
>>>> device will correctly see data from the CPU, and the CPU will correctly see
>>>> data from the device; and working out an address for that buffer from the
>>>> device's point of view to actually hand off to the hardware (which is
>>>> perfectly well allowed to fail).
>>
>>
>> Agreed. The dma_map_*() API is not guaranteed to return a single
>> contiguous part of virtual address space for any given SG list.
>> However it was understood to be able to map buffers contiguously
>> mappable by the CPU into a single segment and users,
>> videobuf2-dma-contig in particular, relied on this.
>
>
> I don't follow that - _any_ buffer made of page-sized chunks is going to be
> mappable contiguously by the CPU;'

Yes it is. Actually the last chunk might not even need to be
page-sized. However I believe we can have a scatterlist consisting of
non-page-sized chunks in the middle as well, which is obviously not
mappable in a contiguous way even for the CPU.

> it's clearly impossible for the streaming
> DMA API itself to offer such a guarantee, because it's entirely orthogonal
> to the presence or otherwise of an IOMMU.

But we are talking here about the very IOMMU-specific implementation of DMA API.

>
> Furthermore, I can't see any existing dma_map_sg implementation (between
> arm/64 and x86, at least), that _won't_ break that expectation under certain
> conditions (ranging from "relatively pathological" to "always"), so it still
> seems questionable to have a dependency on it.

The current implementation for arch/arm doesn't break that
expectation. As long as we fit inside the maximum segment size (which
in most, if not all, cases of the hardware that actually requires such
contiguous mapping to be created, is UINT_MAX).

http://lxr.free-electrons.com/source/arch/arm/mm/dma-mapping.c#L1547

>
>>>> Consider SWIOTLB's implementation - segments which already lie at
>>>> physical addresses within the device's DMA mask just get passed through,
>>>> while those that lie outside it get mapped into the bounce buffer, but still
>>>> as individual allocations (arch code just handles cache maintenance on the
>>>> resulting physical addresses and can apply any hard-wired DMA offset for the
>>>> device concerned).
>>
>>
>> And this is fine for vb2-dma-contig, which was made for devices that
>> require buffers contiguous in its address space. Without IOMMU it will
>> allow only physically contiguous buffers and fails otherwise, which is
>> fine, because it's a hardware requirement.
>
>
> If it depends on having contiguous-from-the-device's-view DMA buffers either
> way, that's a sign it should perhaps be using the coherent DMA API instead,
> which _does_ give such a guarantee. I'm well aware of the "but the
> noncacheable mappings make userspace access unacceptably slow!" issue many
> folks have with that, though, and don't particularly fancy going off on that
> tangent here.

The keywords here are DMA-BUF and user pointer. Neither of these cases
can use coherent DMA API, because the buffer is already allocated, so
it just needs to be mapped into another device's (or its IOMMU's)
address space. Obviously we can't guarantee mappability of such
buffers, e.g. in case of importing non-contiguous buffers to a device
without an IOMMU, However we expect the pipelines to be sane
(physically contiguous buffers or both devices IOMMU-enabled), so that
such things won't happen.

>
>>>>> IIUC, the change above breaks this model by inserting gaps in how the
>>>>> buffer is mapped to device memory, such that the buffer is no longer
>>>>> contiguous in dma address space.
>>>>
>>>>
>>>>
>>>> Even the existing arch/arm IOMMU DMA code which I guess this implicitly
>>>> relies on doesn't guarantee that behaviour - if the mapping happens to reach
>>>> one of the segment length/boundary limits it won't just leave a gap, it'll
>>>> start an entirely new IOVA allocation which could well start at a wildly
>>>> different address[0].
>>
>>
>> Could you explain segment length/boundary limits and when buffers can
>> reach them? Sorry, i haven't been following all the discussions, but
>> I'm not aware of any similar requirements of the IOMMU hardware I
>> worked with.
>
>
> I hope the explanation at the top makes sense - it's purely about the
> requirements of the DMA master device itself, nothing to do with the IOMMU
> (or lack of) in the middle. Devices with scatter-gather DMA limitations
> exist, therefore the API for scatter-gather DMA is designed to represent and
> respect such limitations.

Yes, it makes sense, thanks for the explanation. However there also
exist devices with no scatter-gather capability, but behind an IOMMU
without such fancy mapping limitations. I believe we should also
respect the limitation of such setups, which is the lack of support
for multiple IOVA segments.

>>>>> So, is the videobuf2-dma-contig.c based on an incorrect assumption
>>>>> about how the DMA API is supposed to work?
>>>>> Is it even possible to map a "contiguous-in-iova-range" mapping for a
>>>>> buffer given as an sg_table with an arbitrary set of pages?
>>>>
>>>>
>>>>
>>>>  From the Streaming DMA mappings section of Documentation/DMA-API.txt:
>>>>
>>>>    Note also that the above constraints on physical contiguity and
>>>>    dma_mask may not apply if the platform has an IOMMU (a device which
>>>>    maps an I/O DMA address to a physical memory address).  However, to
>>>> be
>>>>    portable, device driver writers may *not* assume that such an IOMMU
>>>>    exists.
>>>>
>>>> There's not strictly any harm in using the DMA API this way and *hoping*
>>>> you get what you want, as long as you're happy for it to fail pretty much
>>>> 100% of the time on some systems, and still in a minority of corner cases on
>>>> any system.
>>
>>
>> Could you please elaborate? I'd like to see examples, because I can't
>> really imagine buffers mappable contiguously on CPU, but not on IOMMU.
>> Also, as I said, the hardware I worked with didn't suffer from
>> problems like this.
>
>
> "...device driver writers may *not* assume that such an IOMMU exists."
>

And this is exactly why they _should_ use dma_map_sg(), because it was
supposed to work correctly for both physically contiguous (i.e. 1
segment) buffers and non-IOMMU-enabled devices, as well as with
non-contiguous (i.e. > 1 segment) buffers and IOMMU-enabled devices.

>>>> However, if there's a real dependency on IOMMUs and tight control of
>>>> IOVA allocation here, then the DMA API isn't really the right tool for the
>>>> job, and maybe it's time to start looking to how to better fit these
>>>> multimedia-subsystem-type use cases into the IOMMU API - as far as I
>>>> understand it there's at least some conceptual overlap with the HSA PASID
>>>> stuff being prototyped in PCI/x86-land at the moment, so it could be an
>>>> apposite time to try and bang out some common requirements.
>>
>>
>> The DMA API is actually the only good tool to use here to keep the
>> videobuf2-dma-contig code away from the knowledge about platform
>> specific data, e.g. presence of IOMMU. The only thing it knows is that
>> the target hardware requires a single contiguous buffer and it relies
>> on the fact that in correct cases the buffer given to it will meet
>> this requirement (i.e. physically contiguous w/o IOMMU; CPU mappable
>> with IOMMU).
>
>
> As above; the DMA API guarantees only what the DMA API guarantees. An
> IOMMU-based implementation of streaming DMA is free to identity-map pages if
> it only cares about device isolation; a non-IOMMU implementation is free to
> provide streaming DMA remapping via some elaborate bounce-buffering scheme

I guess this is the area where our understandings of IOMMU-backed DMA
API differ.

> if it really wants to. GART-type IOMMUs... let's not even go there.

I believe that's how IOMMU-based implementation of DMA API was
supposed to work when first implemented for ARM...

>
> If v4l needs a guarantee of a single contiguous DMA buffer, then it needs to
> use dma_alloc_coherent() for that, not streaming mappings.

Except that it can't use it, because the buffers are already allocated
by another entity.

Best regards,
Tomasz

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v6 1/3] iommu: Implement common IOMMU ops for DMA mapping
@ 2015-11-04  5:12                       ` Tomasz Figa
  0 siblings, 0 replies; 78+ messages in thread
From: Tomasz Figa @ 2015-11-04  5:12 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Nov 4, 2015 at 2:41 AM, Robin Murphy <robin.murphy@arm.com> wrote:
> Hi Tomasz,
>
> On 02/11/15 13:43, Tomasz Figa wrote:
>>
>> I'd like to know what is the boundary mask and what hardware imposes
>> requirements like this. The cost here is not only over-allocating a
>> little, but making many, many buffers contiguously mappable on the
>> CPU, unmappable contiguously in IOMMU, which just defeats the purpose
>> of having an IOMMU, which I believe should be there for simple IP
>> blocks taking one DMA address to be able to view the buffer the same
>> way as the CPU.
>
>
> The expectation with dma_map_sg() is that you're either going to be
> iterating over the buffer segments, handing off each address to the device
> to process one by one;

My understanding of a scatterlist was that it represents a buffer as a
whole, by joining together its physically discontinuous segments.

I don't see how single segments (layout of which is completely up to
the allocator; often just single pages) would be usable for hardware
that needs to do some work more serious than just writing a byte
stream continuously to subsequent buffers. In case of such simple
devices you don't even need an IOMMU (for means other than protection
and/or getting over address space limitations).

However, IMHO the most important use case of an IOMMU is to make
buffers, which are contiguous in CPU virtual address space (VA),
contiguous in device's address space (IOVA). Your implementation of
dma_map_sg() effectively breaks this ability, so I'm not really
following why it's located under drivers/iommu and supposed to be used
with IOMMU-enabled platforms...

> or you have a scatter-gather-capable device, in which
> case you hand off the whole list at once.

No need for mapping ability of the IOMMU here as well (except for
working around address space issues, as I mentioned above).

> It's in the latter case where you
> have to make sure the list doesn't exceed the hardware limitations of that
> device. I believe the original concern was disk controllers (the
> introduction of dma_parms seems to originate from the linux-scsi list), but
> most scatter-gather engines are going to have some limit on how much they
> can handle per entry (IMO the dmaengine drivers are the easiest example to
> look at).
>
> Segment boundaries are a little more arcane, but my assumption is that they
> relate to the kind of devices whose addressing is not flat but relative to
> some separate segment register (The "64-bit" mode of USB EHCI is one
> concrete example I can think of) - since you cannot realistically change the
> segment register while the device is in the middle of accessing a single
> buffer entry, that entry must not fall across a segment boundary or at some
> point the device's accesses are going to overflow the offset address bits
> and wrap around to bogus addresses at the bottom of the segment.

The two requirements above sound like something really specific to
scatter-gather-capable hardware, which as I pointed above, barely need
an IOMMU (at least its mapping capabilities). We are talking here
about very IOMMU-specific code, though...

Now, while I see that on some systems there might be IOMMU used for
improving protection and working around addressing issues with
SG-capable hardware, the code shouldn't be breaking the majority of
systems with IOMMU used as the only possible way to make physically
discontinuous appear (IO-virtually) continuous to devices incapable of
scatter-gather.

>
> Now yes, it will be possible under _most_ circumstances to use an IOMMU to
> lay out a list of segments with page-aligned lengths within a single IOVA
> allocation whilst still meeting all the necessary constraints. It just needs
> some unavoidably complicated calculations - quite likely significantly more
> complex than my v5 version of map_sg() that tried to do that and merge
> segments but failed to take the initial alignment into account properly -
> since there are much simpler ways to enforce just the _necessary_ behaviour
> for the DMA API, I put the complicated stuff to one side for now to prevent
> it holding up getting the basic functional support in place.

Somehow just whatever currently done in arch/arm/mm/dma-mapping.c was
sufficient and not overly complicated.

See http://lxr.free-electrons.com/source/arch/arm/mm/dma-mapping.c#L1547 .

I can see that the code there at least tries to comply with maximum
segment size constraint. Segment boundary seems to be ignored, though.
However, I'm convinced that in most (if not all) cases where IOMMU
IOVA-contiguous mapping is needed, those two requirements don't exist.
Do we really have to break the good hardware only because the
bad^Wlimited one is broken?

Couldn't we preserve the ARM-like behavior whenever
dma_parms->segment_boundary_mask is set to all 1s and
dma_parms->max_segment_size to UINT_MAX (what currently drivers used
to set) or 0 (sounds more logical for the meaning of "no maximum
given")?

>
>>>>> Hmm, I thought the DMA API maps a (possibly) non-contiguous set of
>>>>> memory pages into a contiguous block in device memory address space.
>>>>> This would allow passing a dma mapped buffer to device dma using just
>>>>> a device address and length.
>>>>
>>>>
>>>>
>>>> Not at all. The streaming DMA API (dma_map_* and friends) has two
>>>> responsibilities: performing any necessary cache maintenance to ensure the
>>>> device will correctly see data from the CPU, and the CPU will correctly see
>>>> data from the device; and working out an address for that buffer from the
>>>> device's point of view to actually hand off to the hardware (which is
>>>> perfectly well allowed to fail).
>>
>>
>> Agreed. The dma_map_*() API is not guaranteed to return a single
>> contiguous part of virtual address space for any given SG list.
>> However it was understood to be able to map buffers contiguously
>> mappable by the CPU into a single segment and users,
>> videobuf2-dma-contig in particular, relied on this.
>
>
> I don't follow that - _any_ buffer made of page-sized chunks is going to be
> mappable contiguously by the CPU;'

Yes it is. Actually the last chunk might not even need to be
page-sized. However I believe we can have a scatterlist consisting of
non-page-sized chunks in the middle as well, which is obviously not
mappable in a contiguous way even for the CPU.

> it's clearly impossible for the streaming
> DMA API itself to offer such a guarantee, because it's entirely orthogonal
> to the presence or otherwise of an IOMMU.

But we are talking here about the very IOMMU-specific implementation of DMA API.

>
> Furthermore, I can't see any existing dma_map_sg implementation (between
> arm/64 and x86, at least), that _won't_ break that expectation under certain
> conditions (ranging from "relatively pathological" to "always"), so it still
> seems questionable to have a dependency on it.

The current implementation for arch/arm doesn't break that
expectation. As long as we fit inside the maximum segment size (which
in most, if not all, cases of the hardware that actually requires such
contiguous mapping to be created, is UINT_MAX).

http://lxr.free-electrons.com/source/arch/arm/mm/dma-mapping.c#L1547

>
>>>> Consider SWIOTLB's implementation - segments which already lie at
>>>> physical addresses within the device's DMA mask just get passed through,
>>>> while those that lie outside it get mapped into the bounce buffer, but still
>>>> as individual allocations (arch code just handles cache maintenance on the
>>>> resulting physical addresses and can apply any hard-wired DMA offset for the
>>>> device concerned).
>>
>>
>> And this is fine for vb2-dma-contig, which was made for devices that
>> require buffers contiguous in its address space. Without IOMMU it will
>> allow only physically contiguous buffers and fails otherwise, which is
>> fine, because it's a hardware requirement.
>
>
> If it depends on having contiguous-from-the-device's-view DMA buffers either
> way, that's a sign it should perhaps be using the coherent DMA API instead,
> which _does_ give such a guarantee. I'm well aware of the "but the
> noncacheable mappings make userspace access unacceptably slow!" issue many
> folks have with that, though, and don't particularly fancy going off on that
> tangent here.

The keywords here are DMA-BUF and user pointer. Neither of these cases
can use coherent DMA API, because the buffer is already allocated, so
it just needs to be mapped into another device's (or its IOMMU's)
address space. Obviously we can't guarantee mappability of such
buffers, e.g. in case of importing non-contiguous buffers to a device
without an IOMMU, However we expect the pipelines to be sane
(physically contiguous buffers or both devices IOMMU-enabled), so that
such things won't happen.

>
>>>>> IIUC, the change above breaks this model by inserting gaps in how the
>>>>> buffer is mapped to device memory, such that the buffer is no longer
>>>>> contiguous in dma address space.
>>>>
>>>>
>>>>
>>>> Even the existing arch/arm IOMMU DMA code which I guess this implicitly
>>>> relies on doesn't guarantee that behaviour - if the mapping happens to reach
>>>> one of the segment length/boundary limits it won't just leave a gap, it'll
>>>> start an entirely new IOVA allocation which could well start at a wildly
>>>> different address[0].
>>
>>
>> Could you explain segment length/boundary limits and when buffers can
>> reach them? Sorry, i haven't been following all the discussions, but
>> I'm not aware of any similar requirements of the IOMMU hardware I
>> worked with.
>
>
> I hope the explanation at the top makes sense - it's purely about the
> requirements of the DMA master device itself, nothing to do with the IOMMU
> (or lack of) in the middle. Devices with scatter-gather DMA limitations
> exist, therefore the API for scatter-gather DMA is designed to represent and
> respect such limitations.

Yes, it makes sense, thanks for the explanation. However there also
exist devices with no scatter-gather capability, but behind an IOMMU
without such fancy mapping limitations. I believe we should also
respect the limitation of such setups, which is the lack of support
for multiple IOVA segments.

>>>>> So, is the videobuf2-dma-contig.c based on an incorrect assumption
>>>>> about how the DMA API is supposed to work?
>>>>> Is it even possible to map a "contiguous-in-iova-range" mapping for a
>>>>> buffer given as an sg_table with an arbitrary set of pages?
>>>>
>>>>
>>>>
>>>>  From the Streaming DMA mappings section of Documentation/DMA-API.txt:
>>>>
>>>>    Note also that the above constraints on physical contiguity and
>>>>    dma_mask may not apply if the platform has an IOMMU (a device which
>>>>    maps an I/O DMA address to a physical memory address).  However, to
>>>> be
>>>>    portable, device driver writers may *not* assume that such an IOMMU
>>>>    exists.
>>>>
>>>> There's not strictly any harm in using the DMA API this way and *hoping*
>>>> you get what you want, as long as you're happy for it to fail pretty much
>>>> 100% of the time on some systems, and still in a minority of corner cases on
>>>> any system.
>>
>>
>> Could you please elaborate? I'd like to see examples, because I can't
>> really imagine buffers mappable contiguously on CPU, but not on IOMMU.
>> Also, as I said, the hardware I worked with didn't suffer from
>> problems like this.
>
>
> "...device driver writers may *not* assume that such an IOMMU exists."
>

And this is exactly why they _should_ use dma_map_sg(), because it was
supposed to work correctly for both physically contiguous (i.e. 1
segment) buffers and non-IOMMU-enabled devices, as well as with
non-contiguous (i.e. > 1 segment) buffers and IOMMU-enabled devices.

>>>> However, if there's a real dependency on IOMMUs and tight control of
>>>> IOVA allocation here, then the DMA API isn't really the right tool for the
>>>> job, and maybe it's time to start looking to how to better fit these
>>>> multimedia-subsystem-type use cases into the IOMMU API - as far as I
>>>> understand it there's at least some conceptual overlap with the HSA PASID
>>>> stuff being prototyped in PCI/x86-land at the moment, so it could be an
>>>> apposite time to try and bang out some common requirements.
>>
>>
>> The DMA API is actually the only good tool to use here to keep the
>> videobuf2-dma-contig code away from the knowledge about platform
>> specific data, e.g. presence of IOMMU. The only thing it knows is that
>> the target hardware requires a single contiguous buffer and it relies
>> on the fact that in correct cases the buffer given to it will meet
>> this requirement (i.e. physically contiguous w/o IOMMU; CPU mappable
>> with IOMMU).
>
>
> As above; the DMA API guarantees only what the DMA API guarantees. An
> IOMMU-based implementation of streaming DMA is free to identity-map pages if
> it only cares about device isolation; a non-IOMMU implementation is free to
> provide streaming DMA remapping via some elaborate bounce-buffering scheme

I guess this is the area where our understandings of IOMMU-backed DMA
API differ.

> if it really wants to. GART-type IOMMUs... let's not even go there.

I believe that's how IOMMU-based implementation of DMA API was
supposed to work when first implemented for ARM...

>
> If v4l needs a guarantee of a single contiguous DMA buffer, then it needs to
> use dma_alloc_coherent() for that, not streaming mappings.

Except that it can't use it, because the buffers are already allocated
by another entity.

Best regards,
Tomasz

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v6 1/3] iommu: Implement common IOMMU ops for DMA mapping
  2015-11-03 18:40                       ` Russell King - ARM Linux
@ 2015-11-04  5:15                         ` Tomasz Figa
  -1 siblings, 0 replies; 78+ messages in thread
From: Tomasz Figa @ 2015-11-04  5:15 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Robin Murphy, Daniel Kurtz, Lin PoChun, linux-arm-kernel,
	Yingjoe Chen, Will Deacon, linux-media, Thierry Reding,
	open list:IOMMU DRIVERS, Bobby Batacharia (via Google Docs),
	Kyungmin Park, Marek Szyprowski, Yong Wu, Pawel Osciak,
	Laurent Pinchart, Joerg Roedel, thunder.leizhen, Catalin Marinas,
	linux-mediatek

On Wed, Nov 4, 2015 at 3:40 AM, Russell King - ARM Linux
<linux@arm.linux.org.uk> wrote:
> On Tue, Nov 03, 2015 at 05:41:24PM +0000, Robin Murphy wrote:
>> Hi Tomasz,
>>
>> On 02/11/15 13:43, Tomasz Figa wrote:
>> >Agreed. The dma_map_*() API is not guaranteed to return a single
>> >contiguous part of virtual address space for any given SG list.
>> >However it was understood to be able to map buffers contiguously
>> >mappable by the CPU into a single segment and users,
>> >videobuf2-dma-contig in particular, relied on this.
>>
>> I don't follow that - _any_ buffer made of page-sized chunks is going to be
>> mappable contiguously by the CPU; it's clearly impossible for the streaming
>> DMA API itself to offer such a guarantee, because it's entirely orthogonal
>> to the presence or otherwise of an IOMMU.
>
> Tomasz's use of "virtual address space" above in combination with the
> DMA API is really confusing.

I suppose I must have mistakenly use "virtual address space" somewhere
instead of "IO virtual address space". I'm sorry for causing
confusion.

The thing being discussed here is mapping of buffers described by
scatterlists into IO virtual address space, i.e. the operation
happening when dma_map_sg() is called for an IOMMU-enabled device.

Best regards,
Tomasz

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v6 1/3] iommu: Implement common IOMMU ops for DMA mapping
@ 2015-11-04  5:15                         ` Tomasz Figa
  0 siblings, 0 replies; 78+ messages in thread
From: Tomasz Figa @ 2015-11-04  5:15 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Nov 4, 2015 at 3:40 AM, Russell King - ARM Linux
<linux@arm.linux.org.uk> wrote:
> On Tue, Nov 03, 2015 at 05:41:24PM +0000, Robin Murphy wrote:
>> Hi Tomasz,
>>
>> On 02/11/15 13:43, Tomasz Figa wrote:
>> >Agreed. The dma_map_*() API is not guaranteed to return a single
>> >contiguous part of virtual address space for any given SG list.
>> >However it was understood to be able to map buffers contiguously
>> >mappable by the CPU into a single segment and users,
>> >videobuf2-dma-contig in particular, relied on this.
>>
>> I don't follow that - _any_ buffer made of page-sized chunks is going to be
>> mappable contiguously by the CPU; it's clearly impossible for the streaming
>> DMA API itself to offer such a guarantee, because it's entirely orthogonal
>> to the presence or otherwise of an IOMMU.
>
> Tomasz's use of "virtual address space" above in combination with the
> DMA API is really confusing.

I suppose I must have mistakenly use "virtual address space" somewhere
instead of "IO virtual address space". I'm sorry for causing
confusion.

The thing being discussed here is mapping of buffers described by
scatterlists into IO virtual address space, i.e. the operation
happening when dma_map_sg() is called for an IOMMU-enabled device.

Best regards,
Tomasz

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v6 2/3] arm64: Add IOMMU dma_ops
  2015-10-01 19:13     ` Robin Murphy
@ 2015-11-04  8:39         ` Yong Wu
  -1 siblings, 0 replies; 78+ messages in thread
From: Yong Wu @ 2015-11-04  8:39 UTC (permalink / raw)
  To: Robin Murphy
  Cc: laurent.pinchart+renesas-ryLnwIuWjnjg/C1BVhZhaw,
	catalin.marinas-5wv7dgnIgG8, will.deacon-5wv7dgnIgG8,
	tiffany.lin-NuS5LvNUpcJWk0Htik3J/w, Tomasz Figa,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	djkurtz-hpIqsD4AKlfQT0dZR+AlfA,
	thunder.leizhen-hv44wF8Li93QT0dZR+AlfA,
	yingjoe.chen-NuS5LvNUpcJWk0Htik3J/w,
	treding-DDmLM1+adcrQT0dZR+AlfA,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

On Thu, 2015-10-01 at 20:13 +0100, Robin Murphy wrote:
> Taking some inspiration from the arch/arm code, implement the
> arch-specific side of the DMA mapping ops using the new IOMMU-DMA layer.
[...]
> +static void *__iommu_alloc_attrs(struct device *dev, size_t size,
> +				 dma_addr_t *handle, gfp_t gfp,
> +				 struct dma_attrs *attrs)
> +{
> +	bool coherent = is_device_dma_coherent(dev);
> +	int ioprot = dma_direction_to_prot(DMA_BIDIRECTIONAL, coherent);
> +	void *addr;
> +
> +	if (WARN(!dev, "cannot create IOMMU mapping for unknown device\n"))
> +		return NULL;
> +	/*
> +	 * Some drivers rely on this, and we probably don't want the
> +	 * possibility of stale kernel data being read by devices anyway.
> +	 */
> +	gfp |= __GFP_ZERO;
> +
> +	if (gfp & __GFP_WAIT) {
> +		struct page **pages;
> +		pgprot_t prot = __get_dma_pgprot(attrs, PAGE_KERNEL, coherent);
> +
> +		pages = iommu_dma_alloc(dev, size, gfp, ioprot,	handle,
> +					flush_page);
> +		if (!pages)
> +			return NULL;
> +
> +		addr = dma_common_pages_remap(pages, size, VM_USERMAP, prot,
> +					      __builtin_return_address(0));
> +		if (!addr)
> +			iommu_dma_free(dev, pages, size, handle);
> +	} else {
> +		struct page *page;
> +		/*
> +		 * In atomic context we can't remap anything, so we'll only
> +		 * get the virtually contiguous buffer we need by way of a
> +		 * physically contiguous allocation.
> +		 */
> +		if (coherent) {
> +			page = alloc_pages(gfp, get_order(size));
> +			addr = page ? page_address(page) : NULL;
> +		} else {
> +			addr = __alloc_from_pool(size, &page, gfp);
> +		}
> +		if (!addr)
> +			return NULL;
> +
> +		*handle = iommu_dma_map_page(dev, page, 0, size, ioprot);
> +		if (iommu_dma_mapping_error(dev, *handle)) {
> +			if (coherent)
> +				__free_pages(page, get_order(size));
> +			else
> +				__free_from_pool(addr, size);
> +			addr = NULL;
> +		}
> +	}
> +	return addr;
> +}
> +
> +static void __iommu_free_attrs(struct device *dev, size_t size, void *cpu_addr,
> +			       dma_addr_t handle, struct dma_attrs *attrs)
> +{
> +	/*
> +	 * @cpu_addr will be one of 3 things depending on how it was allocated:
> +	 * - A remapped array of pages from iommu_dma_alloc(), for all
> +	 *   non-atomic allocations.
> +	 * - A non-cacheable alias from the atomic pool, for atomic
> +	 *   allocations by non-coherent devices.
> +	 * - A normal lowmem address, for atomic allocations by
> +	 *   coherent devices.
> +	 * Hence how dodgy the below logic looks...
> +	 */
> +	if (__in_atomic_pool(cpu_addr, size)) {
> +		iommu_dma_unmap_page(dev, handle, size, 0, NULL);
> +		__free_from_pool(cpu_addr, size);
> +	} else if (is_vmalloc_addr(cpu_addr)){
> +		struct vm_struct *area = find_vm_area(cpu_addr);
> +
> +		if (WARN_ON(!area || !area->pages))
> +			return;
> +		iommu_dma_free(dev, area->pages, size, &handle);
> +		dma_common_free_remap(cpu_addr, size, VM_USERMAP);

Hi Robin,
    We get a WARN issue while the size is not aligned here.

    The WARN log is:
[  206.852002] WARNING: CPU: 0 PID: 23329
at /mnt/host/source/src/third_party/kernel/v3.18/mm/vmalloc.c:65
vunmap_page_range+0x190/0x1b4()
[  206.864438] Modules linked in: nls_iso8859_1 nls_cp437 vfat fat
rfcomm i2c_dev uinput dm9601 uvcvideo btmrvl_sdio mwifiex_sdio mwifiex
btmrvl bluetooth zram fuse cfg80211 nf_conntrack_ipv6 nf_defrag_ipv6
ip6table_filter ip6_tables cdc_ether usbnet mii joydev snd_seq_midi
snd_seq_midi_event snd_rawmidi snd_seq snd_seq_device ppp_async
ppp_generic slhc tun
[  206.902983] CPU: 0 PID: 23329 Comm: chrome Not tainted 3.18.0 #17
[  206.910430] Hardware name: Mediatek Oak rev3 board (DT)
[  206.920018] Call trace:
[  206.925537] [<ffffffc000208c00>] dump_backtrace+0x0/0x140
[  206.931905] [<ffffffc000208d5c>] show_stack+0x1c/0x28
[  206.939158] [<ffffffc000870f80>] dump_stack+0x74/0x94
[  206.947459] [<ffffffc0002219a4>] warn_slowpath_common+0x90/0xb8
[  206.954100] [<ffffffc000221b58>] warn_slowpath_null+0x34/0x44
[  206.961537] [<ffffffc000321358>] vunmap_page_range+0x18c/0x1b4
[  206.967630] [<ffffffc0003213e4>] unmap_kernel_range+0x2c/0x78
[  206.976977] [<ffffffc000582224>] dma_common_free_remap+0x68/0x80
[  206.983581] [<ffffffc000217260>] __iommu_free_attrs+0x14c/0x160
[  206.989646] [<ffffffc00066fc1c>] mtk_vcodec_mem_free+0xa0/0x15c
[  206.996481] [<ffffffc00067e278>] vp9_free_work_buf+0x54/0x70
[  207.002260] [<ffffffc00067f168>] vdec_vp9_deinit+0x7c/0xe8
[  207.008134] [<ffffffc0006787d8>] vdec_if_deinit+0x84/0xec
[  207.013820] [<ffffffc000677898>] mtk_vcodec_vdec_release+0x54/0x6c
[  207.020672] [<ffffffc000673e3c>] fops_vcodec_release+0x7c/0xf8
[  207.026607] [<ffffffc000652b78>] v4l2_release+0x3c/0x84
[  207.031824] [<ffffffc00033b218>] __fput+0xf8/0x1c0
[  207.036599] [<ffffffc00033b350>] ____fput+0x1c/0x2c
[  207.041454] [<ffffffc00023ed78>] task_work_run+0xb0/0xd4
[  207.046756] [<ffffffc00020872c>] do_notify_resume+0x54/0x6c


   From the log I get in this fail case, the size of unmap here is
0x10080, and its map size of dma_common_pages_remap in
__iommu_alloc_attrs is 0x10080, and the corresponding dma-map size is
0x11000(after iova_align). I think all the parameters of map and unmap
are good, it look like not a DMA issue. but I don't know why we get this
warning.
Have you met this problem and give us some advices, Thanks.

(If we add PAGE_ALIGN for the size in dma_alloc and dma_free, It is OK.)

> +	} else {
> +		iommu_dma_unmap_page(dev, handle, size, 0, NULL);
> +		__free_pages(virt_to_page(cpu_addr), get_order(size));
> +	}
> +}
> +
[...]

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v6 2/3] arm64: Add IOMMU dma_ops
@ 2015-11-04  8:39         ` Yong Wu
  0 siblings, 0 replies; 78+ messages in thread
From: Yong Wu @ 2015-11-04  8:39 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 2015-10-01 at 20:13 +0100, Robin Murphy wrote:
> Taking some inspiration from the arch/arm code, implement the
> arch-specific side of the DMA mapping ops using the new IOMMU-DMA layer.
[...]
> +static void *__iommu_alloc_attrs(struct device *dev, size_t size,
> +				 dma_addr_t *handle, gfp_t gfp,
> +				 struct dma_attrs *attrs)
> +{
> +	bool coherent = is_device_dma_coherent(dev);
> +	int ioprot = dma_direction_to_prot(DMA_BIDIRECTIONAL, coherent);
> +	void *addr;
> +
> +	if (WARN(!dev, "cannot create IOMMU mapping for unknown device\n"))
> +		return NULL;
> +	/*
> +	 * Some drivers rely on this, and we probably don't want the
> +	 * possibility of stale kernel data being read by devices anyway.
> +	 */
> +	gfp |= __GFP_ZERO;
> +
> +	if (gfp & __GFP_WAIT) {
> +		struct page **pages;
> +		pgprot_t prot = __get_dma_pgprot(attrs, PAGE_KERNEL, coherent);
> +
> +		pages = iommu_dma_alloc(dev, size, gfp, ioprot,	handle,
> +					flush_page);
> +		if (!pages)
> +			return NULL;
> +
> +		addr = dma_common_pages_remap(pages, size, VM_USERMAP, prot,
> +					      __builtin_return_address(0));
> +		if (!addr)
> +			iommu_dma_free(dev, pages, size, handle);
> +	} else {
> +		struct page *page;
> +		/*
> +		 * In atomic context we can't remap anything, so we'll only
> +		 * get the virtually contiguous buffer we need by way of a
> +		 * physically contiguous allocation.
> +		 */
> +		if (coherent) {
> +			page = alloc_pages(gfp, get_order(size));
> +			addr = page ? page_address(page) : NULL;
> +		} else {
> +			addr = __alloc_from_pool(size, &page, gfp);
> +		}
> +		if (!addr)
> +			return NULL;
> +
> +		*handle = iommu_dma_map_page(dev, page, 0, size, ioprot);
> +		if (iommu_dma_mapping_error(dev, *handle)) {
> +			if (coherent)
> +				__free_pages(page, get_order(size));
> +			else
> +				__free_from_pool(addr, size);
> +			addr = NULL;
> +		}
> +	}
> +	return addr;
> +}
> +
> +static void __iommu_free_attrs(struct device *dev, size_t size, void *cpu_addr,
> +			       dma_addr_t handle, struct dma_attrs *attrs)
> +{
> +	/*
> +	 * @cpu_addr will be one of 3 things depending on how it was allocated:
> +	 * - A remapped array of pages from iommu_dma_alloc(), for all
> +	 *   non-atomic allocations.
> +	 * - A non-cacheable alias from the atomic pool, for atomic
> +	 *   allocations by non-coherent devices.
> +	 * - A normal lowmem address, for atomic allocations by
> +	 *   coherent devices.
> +	 * Hence how dodgy the below logic looks...
> +	 */
> +	if (__in_atomic_pool(cpu_addr, size)) {
> +		iommu_dma_unmap_page(dev, handle, size, 0, NULL);
> +		__free_from_pool(cpu_addr, size);
> +	} else if (is_vmalloc_addr(cpu_addr)){
> +		struct vm_struct *area = find_vm_area(cpu_addr);
> +
> +		if (WARN_ON(!area || !area->pages))
> +			return;
> +		iommu_dma_free(dev, area->pages, size, &handle);
> +		dma_common_free_remap(cpu_addr, size, VM_USERMAP);

Hi Robin,
    We get a WARN issue while the size is not aligned here.

    The WARN log is:
[  206.852002] WARNING: CPU: 0 PID: 23329
at /mnt/host/source/src/third_party/kernel/v3.18/mm/vmalloc.c:65
vunmap_page_range+0x190/0x1b4()
[  206.864438] Modules linked in: nls_iso8859_1 nls_cp437 vfat fat
rfcomm i2c_dev uinput dm9601 uvcvideo btmrvl_sdio mwifiex_sdio mwifiex
btmrvl bluetooth zram fuse cfg80211 nf_conntrack_ipv6 nf_defrag_ipv6
ip6table_filter ip6_tables cdc_ether usbnet mii joydev snd_seq_midi
snd_seq_midi_event snd_rawmidi snd_seq snd_seq_device ppp_async
ppp_generic slhc tun
[  206.902983] CPU: 0 PID: 23329 Comm: chrome Not tainted 3.18.0 #17
[  206.910430] Hardware name: Mediatek Oak rev3 board (DT)
[  206.920018] Call trace:
[  206.925537] [<ffffffc000208c00>] dump_backtrace+0x0/0x140
[  206.931905] [<ffffffc000208d5c>] show_stack+0x1c/0x28
[  206.939158] [<ffffffc000870f80>] dump_stack+0x74/0x94
[  206.947459] [<ffffffc0002219a4>] warn_slowpath_common+0x90/0xb8
[  206.954100] [<ffffffc000221b58>] warn_slowpath_null+0x34/0x44
[  206.961537] [<ffffffc000321358>] vunmap_page_range+0x18c/0x1b4
[  206.967630] [<ffffffc0003213e4>] unmap_kernel_range+0x2c/0x78
[  206.976977] [<ffffffc000582224>] dma_common_free_remap+0x68/0x80
[  206.983581] [<ffffffc000217260>] __iommu_free_attrs+0x14c/0x160
[  206.989646] [<ffffffc00066fc1c>] mtk_vcodec_mem_free+0xa0/0x15c
[  206.996481] [<ffffffc00067e278>] vp9_free_work_buf+0x54/0x70
[  207.002260] [<ffffffc00067f168>] vdec_vp9_deinit+0x7c/0xe8
[  207.008134] [<ffffffc0006787d8>] vdec_if_deinit+0x84/0xec
[  207.013820] [<ffffffc000677898>] mtk_vcodec_vdec_release+0x54/0x6c
[  207.020672] [<ffffffc000673e3c>] fops_vcodec_release+0x7c/0xf8
[  207.026607] [<ffffffc000652b78>] v4l2_release+0x3c/0x84
[  207.031824] [<ffffffc00033b218>] __fput+0xf8/0x1c0
[  207.036599] [<ffffffc00033b350>] ____fput+0x1c/0x2c
[  207.041454] [<ffffffc00023ed78>] task_work_run+0xb0/0xd4
[  207.046756] [<ffffffc00020872c>] do_notify_resume+0x54/0x6c


   From the log I get in this fail case, the size of unmap here is
0x10080, and its map size of dma_common_pages_remap in
__iommu_alloc_attrs is 0x10080, and the corresponding dma-map size is
0x11000(after iova_align). I think all the parameters of map and unmap
are good, it look like not a DMA issue. but I don't know why we get this
warning.
Have you met this problem and give us some advices, Thanks.

(If we add PAGE_ALIGN for the size in dma_alloc and dma_free, It is OK.)

> +	} else {
> +		iommu_dma_unmap_page(dev, handle, size, 0, NULL);
> +		__free_pages(virt_to_page(cpu_addr), get_order(size));
> +	}
> +}
> +
[...]

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v6 1/3] iommu: Implement common IOMMU ops for DMA mapping
@ 2015-11-04  9:10                           ` Russell King - ARM Linux
  0 siblings, 0 replies; 78+ messages in thread
From: Russell King - ARM Linux @ 2015-11-04  9:10 UTC (permalink / raw)
  To: Tomasz Figa
  Cc: Robin Murphy, Daniel Kurtz, Lin PoChun, linux-arm-kernel,
	Yingjoe Chen, Will Deacon, linux-media, Thierry Reding,
	open list:IOMMU DRIVERS, Bobby Batacharia (via Google Docs),
	Kyungmin Park, Marek Szyprowski, Yong Wu, Pawel Osciak,
	Laurent Pinchart, Joerg Roedel, thunder.leizhen, Catalin Marinas,
	linux-mediatek

On Wed, Nov 04, 2015 at 02:15:41PM +0900, Tomasz Figa wrote:
> On Wed, Nov 4, 2015 at 3:40 AM, Russell King - ARM Linux
> <linux@arm.linux.org.uk> wrote:
> > On Tue, Nov 03, 2015 at 05:41:24PM +0000, Robin Murphy wrote:
> >> Hi Tomasz,
> >>
> >> On 02/11/15 13:43, Tomasz Figa wrote:
> >> >Agreed. The dma_map_*() API is not guaranteed to return a single
> >> >contiguous part of virtual address space for any given SG list.
> >> >However it was understood to be able to map buffers contiguously
> >> >mappable by the CPU into a single segment and users,
> >> >videobuf2-dma-contig in particular, relied on this.
> >>
> >> I don't follow that - _any_ buffer made of page-sized chunks is going to be
> >> mappable contiguously by the CPU; it's clearly impossible for the streaming
> >> DMA API itself to offer such a guarantee, because it's entirely orthogonal
> >> to the presence or otherwise of an IOMMU.
> >
> > Tomasz's use of "virtual address space" above in combination with the
> > DMA API is really confusing.
> 
> I suppose I must have mistakenly use "virtual address space" somewhere
> instead of "IO virtual address space". I'm sorry for causing
> confusion.
> 
> The thing being discussed here is mapping of buffers described by
> scatterlists into IO virtual address space, i.e. the operation
> happening when dma_map_sg() is called for an IOMMU-enabled device.

... and there, it's perfectly legal for an IOMMU to merge all entries
in a scatterlist into one mapping - so dma_map_sg() would return 1.

What that means is that the scatterlist contains the original number of
entries which describes the CPU view of the buffer list using the original
number of entries, and the DMA device view of the same but using just the
first entry.

In other words, if you're walking a scatterlist, and doing a mixture of
DMA and PIO, you can't assume that if you're at scatterlist entry N for
DMA, you can switch to PIO for entry N and you'll write to the same
memory.  (I know that there's badly written drivers in the kernel which
unfortunately do make this assumption, and if they're used in the
presence of an IOMMU, they _will_ be silently data corrupting.)

-- 
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v6 1/3] iommu: Implement common IOMMU ops for DMA mapping
@ 2015-11-04  9:10                           ` Russell King - ARM Linux
  0 siblings, 0 replies; 78+ messages in thread
From: Russell King - ARM Linux @ 2015-11-04  9:10 UTC (permalink / raw)
  To: Tomasz Figa
  Cc: Laurent Pinchart, Pawel Osciak, Catalin Marinas, Will Deacon,
	Kyungmin Park, Daniel Kurtz, open list:IOMMU DRIVERS,
	Bobby Batacharia (via Google Docs),
	linux-mediatek-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Lin PoChun,
	thunder.leizhen-hv44wF8Li93QT0dZR+AlfA, Yingjoe Chen,
	Thierry Reding,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-media-u79uwXL29TY76Z2rM5mHXA

On Wed, Nov 04, 2015 at 02:15:41PM +0900, Tomasz Figa wrote:
> On Wed, Nov 4, 2015 at 3:40 AM, Russell King - ARM Linux
> <linux-lFZ/pmaqli7XmaaqVzeoHQ@public.gmane.org> wrote:
> > On Tue, Nov 03, 2015 at 05:41:24PM +0000, Robin Murphy wrote:
> >> Hi Tomasz,
> >>
> >> On 02/11/15 13:43, Tomasz Figa wrote:
> >> >Agreed. The dma_map_*() API is not guaranteed to return a single
> >> >contiguous part of virtual address space for any given SG list.
> >> >However it was understood to be able to map buffers contiguously
> >> >mappable by the CPU into a single segment and users,
> >> >videobuf2-dma-contig in particular, relied on this.
> >>
> >> I don't follow that - _any_ buffer made of page-sized chunks is going to be
> >> mappable contiguously by the CPU; it's clearly impossible for the streaming
> >> DMA API itself to offer such a guarantee, because it's entirely orthogonal
> >> to the presence or otherwise of an IOMMU.
> >
> > Tomasz's use of "virtual address space" above in combination with the
> > DMA API is really confusing.
> 
> I suppose I must have mistakenly use "virtual address space" somewhere
> instead of "IO virtual address space". I'm sorry for causing
> confusion.
> 
> The thing being discussed here is mapping of buffers described by
> scatterlists into IO virtual address space, i.e. the operation
> happening when dma_map_sg() is called for an IOMMU-enabled device.

... and there, it's perfectly legal for an IOMMU to merge all entries
in a scatterlist into one mapping - so dma_map_sg() would return 1.

What that means is that the scatterlist contains the original number of
entries which describes the CPU view of the buffer list using the original
number of entries, and the DMA device view of the same but using just the
first entry.

In other words, if you're walking a scatterlist, and doing a mixture of
DMA and PIO, you can't assume that if you're at scatterlist entry N for
DMA, you can switch to PIO for entry N and you'll write to the same
memory.  (I know that there's badly written drivers in the kernel which
unfortunately do make this assumption, and if they're used in the
presence of an IOMMU, they _will_ be silently data corrupting.)

-- 
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v6 1/3] iommu: Implement common IOMMU ops for DMA mapping
@ 2015-11-04  9:10                           ` Russell King - ARM Linux
  0 siblings, 0 replies; 78+ messages in thread
From: Russell King - ARM Linux @ 2015-11-04  9:10 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Nov 04, 2015 at 02:15:41PM +0900, Tomasz Figa wrote:
> On Wed, Nov 4, 2015 at 3:40 AM, Russell King - ARM Linux
> <linux@arm.linux.org.uk> wrote:
> > On Tue, Nov 03, 2015 at 05:41:24PM +0000, Robin Murphy wrote:
> >> Hi Tomasz,
> >>
> >> On 02/11/15 13:43, Tomasz Figa wrote:
> >> >Agreed. The dma_map_*() API is not guaranteed to return a single
> >> >contiguous part of virtual address space for any given SG list.
> >> >However it was understood to be able to map buffers contiguously
> >> >mappable by the CPU into a single segment and users,
> >> >videobuf2-dma-contig in particular, relied on this.
> >>
> >> I don't follow that - _any_ buffer made of page-sized chunks is going to be
> >> mappable contiguously by the CPU; it's clearly impossible for the streaming
> >> DMA API itself to offer such a guarantee, because it's entirely orthogonal
> >> to the presence or otherwise of an IOMMU.
> >
> > Tomasz's use of "virtual address space" above in combination with the
> > DMA API is really confusing.
> 
> I suppose I must have mistakenly use "virtual address space" somewhere
> instead of "IO virtual address space". I'm sorry for causing
> confusion.
> 
> The thing being discussed here is mapping of buffers described by
> scatterlists into IO virtual address space, i.e. the operation
> happening when dma_map_sg() is called for an IOMMU-enabled device.

... and there, it's perfectly legal for an IOMMU to merge all entries
in a scatterlist into one mapping - so dma_map_sg() would return 1.

What that means is that the scatterlist contains the original number of
entries which describes the CPU view of the buffer list using the original
number of entries, and the DMA device view of the same but using just the
first entry.

In other words, if you're walking a scatterlist, and doing a mixture of
DMA and PIO, you can't assume that if you're at scatterlist entry N for
DMA, you can switch to PIO for entry N and you'll write to the same
memory.  (I know that there's badly written drivers in the kernel which
unfortunately do make this assumption, and if they're used in the
presence of an IOMMU, they _will_ be silently data corrupting.)

-- 
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v6 1/3] iommu: Implement common IOMMU ops for DMA mapping
@ 2015-11-04  9:27                         ` Russell King - ARM Linux
  0 siblings, 0 replies; 78+ messages in thread
From: Russell King - ARM Linux @ 2015-11-04  9:27 UTC (permalink / raw)
  To: Tomasz Figa
  Cc: Robin Murphy, Laurent Pinchart, Pawel Osciak, Catalin Marinas,
	Joerg Roedel, Will Deacon, Kyungmin Park, Daniel Kurtz, Yong Wu,
	open list:IOMMU DRIVERS, Bobby Batacharia (via Google Docs),
	linux-mediatek, Lin PoChun, thunder.leizhen, Marek Szyprowski,
	Yingjoe Chen, Thierry Reding, linux-arm-kernel, linux-media

On Wed, Nov 04, 2015 at 02:12:03PM +0900, Tomasz Figa wrote:
> My understanding of a scatterlist was that it represents a buffer as a
> whole, by joining together its physically discontinuous segments.

Correct, and it may also be scattered in CPU virtual space as well.

> I don't see how single segments (layout of which is completely up to
> the allocator; often just single pages) would be usable for hardware
> that needs to do some work more serious than just writing a byte
> stream continuously to subsequent buffers. In case of such simple
> devices you don't even need an IOMMU (for means other than protection
> and/or getting over address space limitations).

All that's required is that the addresses described in the scatterlist
are accessed as an apparently contiguous series of bytes.  They don't
have to be contiguous in any address view, provided the device access
appears to be contiguous.  How that is done is really neither here nor
there.

IOMMUs are normally there as an address translator - for example, the
underlying device may not have the capability to address a scatterlist
(eg, because it makes effectively random access) and in order to be
accessible to the device, it needs to be made contiguous in device
address space.

Another scenario is that you have more bits of physical address than
a device can generate itself for DMA purposes, and you need an IOMMU
to create a (possibly scattered) mapping in device address space
within the ability of the device to address.

The requirements here depend on the device behind the IOMMU.

> However, IMHO the most important use case of an IOMMU is to make
> buffers, which are contiguous in CPU virtual address space (VA),
> contiguous in device's address space (IOVA).

No - there is no requirement for CPU virtual contiguous buffers to also
be contiguous in the device address space.

-- 
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v6 1/3] iommu: Implement common IOMMU ops for DMA mapping
@ 2015-11-04  9:27                         ` Russell King - ARM Linux
  0 siblings, 0 replies; 78+ messages in thread
From: Russell King - ARM Linux @ 2015-11-04  9:27 UTC (permalink / raw)
  To: Tomasz Figa
  Cc: linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	Laurent Pinchart, linux-media-u79uwXL29TY76Z2rM5mHXA,
	Pawel Osciak, Catalin Marinas, Will Deacon,
	open list:IOMMU DRIVERS, Daniel Kurtz, Kyungmin Park,
	Bobby Batacharia (via Google Docs),
	linux-mediatek-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Lin PoChun,
	thunder.leizhen-hv44wF8Li93QT0dZR+AlfA, Yingjoe Chen,
	Thierry Reding

On Wed, Nov 04, 2015 at 02:12:03PM +0900, Tomasz Figa wrote:
> My understanding of a scatterlist was that it represents a buffer as a
> whole, by joining together its physically discontinuous segments.

Correct, and it may also be scattered in CPU virtual space as well.

> I don't see how single segments (layout of which is completely up to
> the allocator; often just single pages) would be usable for hardware
> that needs to do some work more serious than just writing a byte
> stream continuously to subsequent buffers. In case of such simple
> devices you don't even need an IOMMU (for means other than protection
> and/or getting over address space limitations).

All that's required is that the addresses described in the scatterlist
are accessed as an apparently contiguous series of bytes.  They don't
have to be contiguous in any address view, provided the device access
appears to be contiguous.  How that is done is really neither here nor
there.

IOMMUs are normally there as an address translator - for example, the
underlying device may not have the capability to address a scatterlist
(eg, because it makes effectively random access) and in order to be
accessible to the device, it needs to be made contiguous in device
address space.

Another scenario is that you have more bits of physical address than
a device can generate itself for DMA purposes, and you need an IOMMU
to create a (possibly scattered) mapping in device address space
within the ability of the device to address.

The requirements here depend on the device behind the IOMMU.

> However, IMHO the most important use case of an IOMMU is to make
> buffers, which are contiguous in CPU virtual address space (VA),
> contiguous in device's address space (IOVA).

No - there is no requirement for CPU virtual contiguous buffers to also
be contiguous in the device address space.

-- 
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v6 1/3] iommu: Implement common IOMMU ops for DMA mapping
@ 2015-11-04  9:27                         ` Russell King - ARM Linux
  0 siblings, 0 replies; 78+ messages in thread
From: Russell King - ARM Linux @ 2015-11-04  9:27 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Nov 04, 2015 at 02:12:03PM +0900, Tomasz Figa wrote:
> My understanding of a scatterlist was that it represents a buffer as a
> whole, by joining together its physically discontinuous segments.

Correct, and it may also be scattered in CPU virtual space as well.

> I don't see how single segments (layout of which is completely up to
> the allocator; often just single pages) would be usable for hardware
> that needs to do some work more serious than just writing a byte
> stream continuously to subsequent buffers. In case of such simple
> devices you don't even need an IOMMU (for means other than protection
> and/or getting over address space limitations).

All that's required is that the addresses described in the scatterlist
are accessed as an apparently contiguous series of bytes.  They don't
have to be contiguous in any address view, provided the device access
appears to be contiguous.  How that is done is really neither here nor
there.

IOMMUs are normally there as an address translator - for example, the
underlying device may not have the capability to address a scatterlist
(eg, because it makes effectively random access) and in order to be
accessible to the device, it needs to be made contiguous in device
address space.

Another scenario is that you have more bits of physical address than
a device can generate itself for DMA purposes, and you need an IOMMU
to create a (possibly scattered) mapping in device address space
within the ability of the device to address.

The requirements here depend on the device behind the IOMMU.

> However, IMHO the most important use case of an IOMMU is to make
> buffers, which are contiguous in CPU virtual address space (VA),
> contiguous in device's address space (IOVA).

No - there is no requirement for CPU virtual contiguous buffers to also
be contiguous in the device address space.

-- 
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v6 1/3] iommu: Implement common IOMMU ops for DMA mapping
  2015-11-04  9:27                         ` Russell King - ARM Linux
@ 2015-11-04  9:48                           ` Tomasz Figa
  -1 siblings, 0 replies; 78+ messages in thread
From: Tomasz Figa @ 2015-11-04  9:48 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Robin Murphy, Laurent Pinchart, Pawel Osciak, Catalin Marinas,
	Joerg Roedel, Will Deacon, Kyungmin Park, Daniel Kurtz, Yong Wu,
	open list:IOMMU DRIVERS, Bobby Batacharia (via Google Docs),
	linux-mediatek, Lin PoChun, thunder.leizhen, Marek Szyprowski,
	Yingjoe Chen, Thierry Reding, linux-arm-kernel, linux-media

On Wed, Nov 4, 2015 at 6:27 PM, Russell King - ARM Linux
<linux@arm.linux.org.uk> wrote:
> On Wed, Nov 04, 2015 at 02:12:03PM +0900, Tomasz Figa wrote:
>> My understanding of a scatterlist was that it represents a buffer as a
>> whole, by joining together its physically discontinuous segments.
>
> Correct, and it may also be scattered in CPU virtual space as well.
>
>> I don't see how single segments (layout of which is completely up to
>> the allocator; often just single pages) would be usable for hardware
>> that needs to do some work more serious than just writing a byte
>> stream continuously to subsequent buffers. In case of such simple
>> devices you don't even need an IOMMU (for means other than protection
>> and/or getting over address space limitations).
>
> All that's required is that the addresses described in the scatterlist
> are accessed as an apparently contiguous series of bytes.  They don't
> have to be contiguous in any address view, provided the device access
> appears to be contiguous.  How that is done is really neither here nor
> there.
>
> IOMMUs are normally there as an address translator - for example, the
> underlying device may not have the capability to address a scatterlist
> (eg, because it makes effectively random access) and in order to be
> accessible to the device, it needs to be made contiguous in device
> address space.
>
> Another scenario is that you have more bits of physical address than
> a device can generate itself for DMA purposes, and you need an IOMMU
> to create a (possibly scattered) mapping in device address space
> within the ability of the device to address.
>
> The requirements here depend on the device behind the IOMMU.

I fully agree with you.

The problem is that the code being discussed here breaks the case of
devices that don't have the capability of addressing a scatterlist,
supposedly for the sake of devices that have such capability (but as I
suggested, they both could be happily supported, by distinguishing
special values of DMA max segment size and boundary mask).

>> However, IMHO the most important use case of an IOMMU is to make
>> buffers, which are contiguous in CPU virtual address space (VA),
>> contiguous in device's address space (IOVA).
>
> No - there is no requirement for CPU virtual contiguous buffers to also
> be contiguous in the device address space.

There is no requirement, but shouldn't it be desired for the mapping
code to map them as such? Otherwise, how could the IOMMU use case you
described above (address translator for devices which don't have the
capability to address a scatterlist) be handled properly?

Is the general conclusion now that dma_map_sg() should not be used to
create IOMMU mappings and we should make a step backwards making all
drivers (or frameworks, such as videobuf2) do that manually? That
would be really backwards, because code not aware of IOMMU existence
at all would have to become aware of it.

Best regards,
Tomasz

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v6 1/3] iommu: Implement common IOMMU ops for DMA mapping
@ 2015-11-04  9:48                           ` Tomasz Figa
  0 siblings, 0 replies; 78+ messages in thread
From: Tomasz Figa @ 2015-11-04  9:48 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Nov 4, 2015 at 6:27 PM, Russell King - ARM Linux
<linux@arm.linux.org.uk> wrote:
> On Wed, Nov 04, 2015 at 02:12:03PM +0900, Tomasz Figa wrote:
>> My understanding of a scatterlist was that it represents a buffer as a
>> whole, by joining together its physically discontinuous segments.
>
> Correct, and it may also be scattered in CPU virtual space as well.
>
>> I don't see how single segments (layout of which is completely up to
>> the allocator; often just single pages) would be usable for hardware
>> that needs to do some work more serious than just writing a byte
>> stream continuously to subsequent buffers. In case of such simple
>> devices you don't even need an IOMMU (for means other than protection
>> and/or getting over address space limitations).
>
> All that's required is that the addresses described in the scatterlist
> are accessed as an apparently contiguous series of bytes.  They don't
> have to be contiguous in any address view, provided the device access
> appears to be contiguous.  How that is done is really neither here nor
> there.
>
> IOMMUs are normally there as an address translator - for example, the
> underlying device may not have the capability to address a scatterlist
> (eg, because it makes effectively random access) and in order to be
> accessible to the device, it needs to be made contiguous in device
> address space.
>
> Another scenario is that you have more bits of physical address than
> a device can generate itself for DMA purposes, and you need an IOMMU
> to create a (possibly scattered) mapping in device address space
> within the ability of the device to address.
>
> The requirements here depend on the device behind the IOMMU.

I fully agree with you.

The problem is that the code being discussed here breaks the case of
devices that don't have the capability of addressing a scatterlist,
supposedly for the sake of devices that have such capability (but as I
suggested, they both could be happily supported, by distinguishing
special values of DMA max segment size and boundary mask).

>> However, IMHO the most important use case of an IOMMU is to make
>> buffers, which are contiguous in CPU virtual address space (VA),
>> contiguous in device's address space (IOVA).
>
> No - there is no requirement for CPU virtual contiguous buffers to also
> be contiguous in the device address space.

There is no requirement, but shouldn't it be desired for the mapping
code to map them as such? Otherwise, how could the IOMMU use case you
described above (address translator for devices which don't have the
capability to address a scatterlist) be handled properly?

Is the general conclusion now that dma_map_sg() should not be used to
create IOMMU mappings and we should make a step backwards making all
drivers (or frameworks, such as videobuf2) do that manually? That
would be really backwards, because code not aware of IOMMU existence
at all would have to become aware of it.

Best regards,
Tomasz

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v6 1/3] iommu: Implement common IOMMU ops for DMA mapping
@ 2015-11-04 10:50                             ` Russell King - ARM Linux
  0 siblings, 0 replies; 78+ messages in thread
From: Russell King - ARM Linux @ 2015-11-04 10:50 UTC (permalink / raw)
  To: Tomasz Figa
  Cc: Robin Murphy, Laurent Pinchart, Pawel Osciak, Catalin Marinas,
	Joerg Roedel, Will Deacon, Kyungmin Park, Daniel Kurtz, Yong Wu,
	open list:IOMMU DRIVERS, Bobby Batacharia (via Google Docs),
	linux-mediatek, Lin PoChun, thunder.leizhen, Marek Szyprowski,
	Yingjoe Chen, Thierry Reding, linux-arm-kernel, linux-media

On Wed, Nov 04, 2015 at 06:48:50PM +0900, Tomasz Figa wrote:
> There is no requirement, but shouldn't it be desired for the mapping
> code to map them as such? Otherwise, how could the IOMMU use case you
> described above (address translator for devices which don't have the
> capability to address a scatterlist) be handled properly?

It's up to the IOMMU code to respect the parameters that the device has
supplied to it via the device_dma_parameters.  This doesn't currently
allow a device to say "I want this scatterlist to be mapped as a
contiguous device address", so really if a device has such a requirement,
at the moment the device driver _must_ check the dma_map_sg() return
value and act accordingly.

While it's possible to say "an IOMMU should map as a single contiguous
address" what happens when the IOMMU's device address space becomes
fragmented?

> Is the general conclusion now that dma_map_sg() should not be used to
> create IOMMU mappings and we should make a step backwards making all
> drivers (or frameworks, such as videobuf2) do that manually? That
> would be really backwards, because code not aware of IOMMU existence
> at all would have to become aware of it.

No.  The DMA API has always had the responsibility for managing the
IOMMU device, which may well be shared between multiple different
devices.

However, if the IOMMU is part of a device IP block (such as a GPU)
then the decision on whether the DMA API should be used or not is up
to the driver author.  If it has special management requirements,
then it's probably appropriate for the device driver to manage it by
itself.

For example, a GPUs MMU may need something inserted into the GPUs
command stream to flush the MMU TLBs.  Such cases are inappropriate
to be using the DMA API for IOMMU management.

-- 
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v6 1/3] iommu: Implement common IOMMU ops for DMA mapping
@ 2015-11-04 10:50                             ` Russell King - ARM Linux
  0 siblings, 0 replies; 78+ messages in thread
From: Russell King - ARM Linux @ 2015-11-04 10:50 UTC (permalink / raw)
  To: Tomasz Figa
  Cc: linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	Laurent Pinchart, linux-media-u79uwXL29TY76Z2rM5mHXA,
	Pawel Osciak, Catalin Marinas, Will Deacon,
	open list:IOMMU DRIVERS, Daniel Kurtz, Kyungmin Park,
	Bobby Batacharia (via Google Docs),
	linux-mediatek-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Lin PoChun,
	thunder.leizhen-hv44wF8Li93QT0dZR+AlfA, Yingjoe Chen,
	Thierry Reding

On Wed, Nov 04, 2015 at 06:48:50PM +0900, Tomasz Figa wrote:
> There is no requirement, but shouldn't it be desired for the mapping
> code to map them as such? Otherwise, how could the IOMMU use case you
> described above (address translator for devices which don't have the
> capability to address a scatterlist) be handled properly?

It's up to the IOMMU code to respect the parameters that the device has
supplied to it via the device_dma_parameters.  This doesn't currently
allow a device to say "I want this scatterlist to be mapped as a
contiguous device address", so really if a device has such a requirement,
at the moment the device driver _must_ check the dma_map_sg() return
value and act accordingly.

While it's possible to say "an IOMMU should map as a single contiguous
address" what happens when the IOMMU's device address space becomes
fragmented?

> Is the general conclusion now that dma_map_sg() should not be used to
> create IOMMU mappings and we should make a step backwards making all
> drivers (or frameworks, such as videobuf2) do that manually? That
> would be really backwards, because code not aware of IOMMU existence
> at all would have to become aware of it.

No.  The DMA API has always had the responsibility for managing the
IOMMU device, which may well be shared between multiple different
devices.

However, if the IOMMU is part of a device IP block (such as a GPU)
then the decision on whether the DMA API should be used or not is up
to the driver author.  If it has special management requirements,
then it's probably appropriate for the device driver to manage it by
itself.

For example, a GPUs MMU may need something inserted into the GPUs
command stream to flush the MMU TLBs.  Such cases are inappropriate
to be using the DMA API for IOMMU management.

-- 
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v6 1/3] iommu: Implement common IOMMU ops for DMA mapping
@ 2015-11-04 10:50                             ` Russell King - ARM Linux
  0 siblings, 0 replies; 78+ messages in thread
From: Russell King - ARM Linux @ 2015-11-04 10:50 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Nov 04, 2015 at 06:48:50PM +0900, Tomasz Figa wrote:
> There is no requirement, but shouldn't it be desired for the mapping
> code to map them as such? Otherwise, how could the IOMMU use case you
> described above (address translator for devices which don't have the
> capability to address a scatterlist) be handled properly?

It's up to the IOMMU code to respect the parameters that the device has
supplied to it via the device_dma_parameters.  This doesn't currently
allow a device to say "I want this scatterlist to be mapped as a
contiguous device address", so really if a device has such a requirement,
at the moment the device driver _must_ check the dma_map_sg() return
value and act accordingly.

While it's possible to say "an IOMMU should map as a single contiguous
address" what happens when the IOMMU's device address space becomes
fragmented?

> Is the general conclusion now that dma_map_sg() should not be used to
> create IOMMU mappings and we should make a step backwards making all
> drivers (or frameworks, such as videobuf2) do that manually? That
> would be really backwards, because code not aware of IOMMU existence
> at all would have to become aware of it.

No.  The DMA API has always had the responsibility for managing the
IOMMU device, which may well be shared between multiple different
devices.

However, if the IOMMU is part of a device IP block (such as a GPU)
then the decision on whether the DMA API should be used or not is up
to the driver author.  If it has special management requirements,
then it's probably appropriate for the device driver to manage it by
itself.

For example, a GPUs MMU may need something inserted into the GPUs
command stream to flush the MMU TLBs.  Such cases are inappropriate
to be using the DMA API for IOMMU management.

-- 
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v6 2/3] arm64: Add IOMMU dma_ops
  2015-11-04  8:39         ` Yong Wu
@ 2015-11-04 13:11           ` Robin Murphy
  -1 siblings, 0 replies; 78+ messages in thread
From: Robin Murphy @ 2015-11-04 13:11 UTC (permalink / raw)
  To: Yong Wu, labbott-rxtnV0ftBwyoClj4AeEUq9i2O/JbrIOy
  Cc: laurent.pinchart+renesas-ryLnwIuWjnjg/C1BVhZhaw,
	catalin.marinas-5wv7dgnIgG8, will.deacon-5wv7dgnIgG8,
	tiffany.lin-NuS5LvNUpcJWk0Htik3J/w, Tomasz Figa,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	djkurtz-hpIqsD4AKlfQT0dZR+AlfA,
	thunder.leizhen-hv44wF8Li93QT0dZR+AlfA,
	yingjoe.chen-NuS5LvNUpcJWk0Htik3J/w,
	treding-DDmLM1+adcrQT0dZR+AlfA,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

On 04/11/15 08:39, Yong Wu wrote:
> On Thu, 2015-10-01 at 20:13 +0100, Robin Murphy wrote:
>> Taking some inspiration from the arch/arm code, implement the
>> arch-specific side of the DMA mapping ops using the new IOMMU-DMA layer.
> [...]
>> +static void *__iommu_alloc_attrs(struct device *dev, size_t size,
>> +				 dma_addr_t *handle, gfp_t gfp,
>> +				 struct dma_attrs *attrs)
>> +{
>> +	bool coherent = is_device_dma_coherent(dev);
>> +	int ioprot = dma_direction_to_prot(DMA_BIDIRECTIONAL, coherent);
>> +	void *addr;
>> +
>> +	if (WARN(!dev, "cannot create IOMMU mapping for unknown device\n"))
>> +		return NULL;
>> +	/*
>> +	 * Some drivers rely on this, and we probably don't want the
>> +	 * possibility of stale kernel data being read by devices anyway.
>> +	 */
>> +	gfp |= __GFP_ZERO;
>> +
>> +	if (gfp & __GFP_WAIT) {
>> +		struct page **pages;
>> +		pgprot_t prot = __get_dma_pgprot(attrs, PAGE_KERNEL, coherent);
>> +
>> +		pages = iommu_dma_alloc(dev, size, gfp, ioprot,	handle,
>> +					flush_page);
>> +		if (!pages)
>> +			return NULL;
>> +
>> +		addr = dma_common_pages_remap(pages, size, VM_USERMAP, prot,
>> +					      __builtin_return_address(0));
>> +		if (!addr)
>> +			iommu_dma_free(dev, pages, size, handle);
>> +	} else {
>> +		struct page *page;
>> +		/*
>> +		 * In atomic context we can't remap anything, so we'll only
>> +		 * get the virtually contiguous buffer we need by way of a
>> +		 * physically contiguous allocation.
>> +		 */
>> +		if (coherent) {
>> +			page = alloc_pages(gfp, get_order(size));
>> +			addr = page ? page_address(page) : NULL;
>> +		} else {
>> +			addr = __alloc_from_pool(size, &page, gfp);
>> +		}
>> +		if (!addr)
>> +			return NULL;
>> +
>> +		*handle = iommu_dma_map_page(dev, page, 0, size, ioprot);
>> +		if (iommu_dma_mapping_error(dev, *handle)) {
>> +			if (coherent)
>> +				__free_pages(page, get_order(size));
>> +			else
>> +				__free_from_pool(addr, size);
>> +			addr = NULL;
>> +		}
>> +	}
>> +	return addr;
>> +}
>> +
>> +static void __iommu_free_attrs(struct device *dev, size_t size, void *cpu_addr,
>> +			       dma_addr_t handle, struct dma_attrs *attrs)
>> +{
>> +	/*
>> +	 * @cpu_addr will be one of 3 things depending on how it was allocated:
>> +	 * - A remapped array of pages from iommu_dma_alloc(), for all
>> +	 *   non-atomic allocations.
>> +	 * - A non-cacheable alias from the atomic pool, for atomic
>> +	 *   allocations by non-coherent devices.
>> +	 * - A normal lowmem address, for atomic allocations by
>> +	 *   coherent devices.
>> +	 * Hence how dodgy the below logic looks...
>> +	 */
>> +	if (__in_atomic_pool(cpu_addr, size)) {
>> +		iommu_dma_unmap_page(dev, handle, size, 0, NULL);
>> +		__free_from_pool(cpu_addr, size);
>> +	} else if (is_vmalloc_addr(cpu_addr)){
>> +		struct vm_struct *area = find_vm_area(cpu_addr);
>> +
>> +		if (WARN_ON(!area || !area->pages))
>> +			return;
>> +		iommu_dma_free(dev, area->pages, size, &handle);
>> +		dma_common_free_remap(cpu_addr, size, VM_USERMAP);
>
> Hi Robin,
>      We get a WARN issue while the size is not aligned here.
>
>      The WARN log is:
> [  206.852002] WARNING: CPU: 0 PID: 23329
> at /mnt/host/source/src/third_party/kernel/v3.18/mm/vmalloc.c:65
> vunmap_page_range+0x190/0x1b4()
> [  206.864438] Modules linked in: nls_iso8859_1 nls_cp437 vfat fat
> rfcomm i2c_dev uinput dm9601 uvcvideo btmrvl_sdio mwifiex_sdio mwifiex
> btmrvl bluetooth zram fuse cfg80211 nf_conntrack_ipv6 nf_defrag_ipv6
> ip6table_filter ip6_tables cdc_ether usbnet mii joydev snd_seq_midi
> snd_seq_midi_event snd_rawmidi snd_seq snd_seq_device ppp_async
> ppp_generic slhc tun
> [  206.902983] CPU: 0 PID: 23329 Comm: chrome Not tainted 3.18.0 #17
> [  206.910430] Hardware name: Mediatek Oak rev3 board (DT)
> [  206.920018] Call trace:
> [  206.925537] [<ffffffc000208c00>] dump_backtrace+0x0/0x140
> [  206.931905] [<ffffffc000208d5c>] show_stack+0x1c/0x28
> [  206.939158] [<ffffffc000870f80>] dump_stack+0x74/0x94
> [  206.947459] [<ffffffc0002219a4>] warn_slowpath_common+0x90/0xb8
> [  206.954100] [<ffffffc000221b58>] warn_slowpath_null+0x34/0x44
> [  206.961537] [<ffffffc000321358>] vunmap_page_range+0x18c/0x1b4
> [  206.967630] [<ffffffc0003213e4>] unmap_kernel_range+0x2c/0x78
> [  206.976977] [<ffffffc000582224>] dma_common_free_remap+0x68/0x80
> [  206.983581] [<ffffffc000217260>] __iommu_free_attrs+0x14c/0x160
> [  206.989646] [<ffffffc00066fc1c>] mtk_vcodec_mem_free+0xa0/0x15c
> [  206.996481] [<ffffffc00067e278>] vp9_free_work_buf+0x54/0x70
> [  207.002260] [<ffffffc00067f168>] vdec_vp9_deinit+0x7c/0xe8
> [  207.008134] [<ffffffc0006787d8>] vdec_if_deinit+0x84/0xec
> [  207.013820] [<ffffffc000677898>] mtk_vcodec_vdec_release+0x54/0x6c
> [  207.020672] [<ffffffc000673e3c>] fops_vcodec_release+0x7c/0xf8
> [  207.026607] [<ffffffc000652b78>] v4l2_release+0x3c/0x84
> [  207.031824] [<ffffffc00033b218>] __fput+0xf8/0x1c0
> [  207.036599] [<ffffffc00033b350>] ____fput+0x1c/0x2c
> [  207.041454] [<ffffffc00023ed78>] task_work_run+0xb0/0xd4
> [  207.046756] [<ffffffc00020872c>] do_notify_resume+0x54/0x6c
>
>
>     From the log I get in this fail case, the size of unmap here is
> 0x10080, and its map size of dma_common_pages_remap in
> __iommu_alloc_attrs is 0x10080, and the corresponding dma-map size is
> 0x11000(after iova_align). I think all the parameters of map and unmap
> are good, it look like not a DMA issue. but I don't know why we get this
> warning.
> Have you met this problem and give us some advices, Thanks.
>
> (If we add PAGE_ALIGN for the size in dma_alloc and dma_free, It is OK.)

OK, having dug into this it looks like the root cause comes from some 
asymmetry in the common code: dma_common_pages remap() just passes the 
size through to get_vm_area_caller(), and the first thing that does is 
to page-align it. On the other hand, neither dma_common_free_remap() nor 
unmap_kernel_range() does anything with the size, so we wind up giving 
an unaligned end address to vunmap_page_range() and messing up the 
vmalloc page tables.

I wonder if dma_common_free_remap() should be page-aligning the size to 
match expectations (i.e. make it correctly unmap any request the other 
functions happily mapped), or conversely, perhaps both the map and unmap 
functions should have a WARN_ON(size & PAGE_MASK) to enforce being 
called as actually intended. Laura?

Either way, I'll send out a patch to make the arm64 side deal with it 
explicitly.

Robin.

>
>> +	} else {
>> +		iommu_dma_unmap_page(dev, handle, size, 0, NULL);
>> +		__free_pages(virt_to_page(cpu_addr), get_order(size));
>> +	}
>> +}
>> +
> [...]
>
>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v6 2/3] arm64: Add IOMMU dma_ops
@ 2015-11-04 13:11           ` Robin Murphy
  0 siblings, 0 replies; 78+ messages in thread
From: Robin Murphy @ 2015-11-04 13:11 UTC (permalink / raw)
  To: linux-arm-kernel

On 04/11/15 08:39, Yong Wu wrote:
> On Thu, 2015-10-01 at 20:13 +0100, Robin Murphy wrote:
>> Taking some inspiration from the arch/arm code, implement the
>> arch-specific side of the DMA mapping ops using the new IOMMU-DMA layer.
> [...]
>> +static void *__iommu_alloc_attrs(struct device *dev, size_t size,
>> +				 dma_addr_t *handle, gfp_t gfp,
>> +				 struct dma_attrs *attrs)
>> +{
>> +	bool coherent = is_device_dma_coherent(dev);
>> +	int ioprot = dma_direction_to_prot(DMA_BIDIRECTIONAL, coherent);
>> +	void *addr;
>> +
>> +	if (WARN(!dev, "cannot create IOMMU mapping for unknown device\n"))
>> +		return NULL;
>> +	/*
>> +	 * Some drivers rely on this, and we probably don't want the
>> +	 * possibility of stale kernel data being read by devices anyway.
>> +	 */
>> +	gfp |= __GFP_ZERO;
>> +
>> +	if (gfp & __GFP_WAIT) {
>> +		struct page **pages;
>> +		pgprot_t prot = __get_dma_pgprot(attrs, PAGE_KERNEL, coherent);
>> +
>> +		pages = iommu_dma_alloc(dev, size, gfp, ioprot,	handle,
>> +					flush_page);
>> +		if (!pages)
>> +			return NULL;
>> +
>> +		addr = dma_common_pages_remap(pages, size, VM_USERMAP, prot,
>> +					      __builtin_return_address(0));
>> +		if (!addr)
>> +			iommu_dma_free(dev, pages, size, handle);
>> +	} else {
>> +		struct page *page;
>> +		/*
>> +		 * In atomic context we can't remap anything, so we'll only
>> +		 * get the virtually contiguous buffer we need by way of a
>> +		 * physically contiguous allocation.
>> +		 */
>> +		if (coherent) {
>> +			page = alloc_pages(gfp, get_order(size));
>> +			addr = page ? page_address(page) : NULL;
>> +		} else {
>> +			addr = __alloc_from_pool(size, &page, gfp);
>> +		}
>> +		if (!addr)
>> +			return NULL;
>> +
>> +		*handle = iommu_dma_map_page(dev, page, 0, size, ioprot);
>> +		if (iommu_dma_mapping_error(dev, *handle)) {
>> +			if (coherent)
>> +				__free_pages(page, get_order(size));
>> +			else
>> +				__free_from_pool(addr, size);
>> +			addr = NULL;
>> +		}
>> +	}
>> +	return addr;
>> +}
>> +
>> +static void __iommu_free_attrs(struct device *dev, size_t size, void *cpu_addr,
>> +			       dma_addr_t handle, struct dma_attrs *attrs)
>> +{
>> +	/*
>> +	 * @cpu_addr will be one of 3 things depending on how it was allocated:
>> +	 * - A remapped array of pages from iommu_dma_alloc(), for all
>> +	 *   non-atomic allocations.
>> +	 * - A non-cacheable alias from the atomic pool, for atomic
>> +	 *   allocations by non-coherent devices.
>> +	 * - A normal lowmem address, for atomic allocations by
>> +	 *   coherent devices.
>> +	 * Hence how dodgy the below logic looks...
>> +	 */
>> +	if (__in_atomic_pool(cpu_addr, size)) {
>> +		iommu_dma_unmap_page(dev, handle, size, 0, NULL);
>> +		__free_from_pool(cpu_addr, size);
>> +	} else if (is_vmalloc_addr(cpu_addr)){
>> +		struct vm_struct *area = find_vm_area(cpu_addr);
>> +
>> +		if (WARN_ON(!area || !area->pages))
>> +			return;
>> +		iommu_dma_free(dev, area->pages, size, &handle);
>> +		dma_common_free_remap(cpu_addr, size, VM_USERMAP);
>
> Hi Robin,
>      We get a WARN issue while the size is not aligned here.
>
>      The WARN log is:
> [  206.852002] WARNING: CPU: 0 PID: 23329
> at /mnt/host/source/src/third_party/kernel/v3.18/mm/vmalloc.c:65
> vunmap_page_range+0x190/0x1b4()
> [  206.864438] Modules linked in: nls_iso8859_1 nls_cp437 vfat fat
> rfcomm i2c_dev uinput dm9601 uvcvideo btmrvl_sdio mwifiex_sdio mwifiex
> btmrvl bluetooth zram fuse cfg80211 nf_conntrack_ipv6 nf_defrag_ipv6
> ip6table_filter ip6_tables cdc_ether usbnet mii joydev snd_seq_midi
> snd_seq_midi_event snd_rawmidi snd_seq snd_seq_device ppp_async
> ppp_generic slhc tun
> [  206.902983] CPU: 0 PID: 23329 Comm: chrome Not tainted 3.18.0 #17
> [  206.910430] Hardware name: Mediatek Oak rev3 board (DT)
> [  206.920018] Call trace:
> [  206.925537] [<ffffffc000208c00>] dump_backtrace+0x0/0x140
> [  206.931905] [<ffffffc000208d5c>] show_stack+0x1c/0x28
> [  206.939158] [<ffffffc000870f80>] dump_stack+0x74/0x94
> [  206.947459] [<ffffffc0002219a4>] warn_slowpath_common+0x90/0xb8
> [  206.954100] [<ffffffc000221b58>] warn_slowpath_null+0x34/0x44
> [  206.961537] [<ffffffc000321358>] vunmap_page_range+0x18c/0x1b4
> [  206.967630] [<ffffffc0003213e4>] unmap_kernel_range+0x2c/0x78
> [  206.976977] [<ffffffc000582224>] dma_common_free_remap+0x68/0x80
> [  206.983581] [<ffffffc000217260>] __iommu_free_attrs+0x14c/0x160
> [  206.989646] [<ffffffc00066fc1c>] mtk_vcodec_mem_free+0xa0/0x15c
> [  206.996481] [<ffffffc00067e278>] vp9_free_work_buf+0x54/0x70
> [  207.002260] [<ffffffc00067f168>] vdec_vp9_deinit+0x7c/0xe8
> [  207.008134] [<ffffffc0006787d8>] vdec_if_deinit+0x84/0xec
> [  207.013820] [<ffffffc000677898>] mtk_vcodec_vdec_release+0x54/0x6c
> [  207.020672] [<ffffffc000673e3c>] fops_vcodec_release+0x7c/0xf8
> [  207.026607] [<ffffffc000652b78>] v4l2_release+0x3c/0x84
> [  207.031824] [<ffffffc00033b218>] __fput+0xf8/0x1c0
> [  207.036599] [<ffffffc00033b350>] ____fput+0x1c/0x2c
> [  207.041454] [<ffffffc00023ed78>] task_work_run+0xb0/0xd4
> [  207.046756] [<ffffffc00020872c>] do_notify_resume+0x54/0x6c
>
>
>     From the log I get in this fail case, the size of unmap here is
> 0x10080, and its map size of dma_common_pages_remap in
> __iommu_alloc_attrs is 0x10080, and the corresponding dma-map size is
> 0x11000(after iova_align). I think all the parameters of map and unmap
> are good, it look like not a DMA issue. but I don't know why we get this
> warning.
> Have you met this problem and give us some advices, Thanks.
>
> (If we add PAGE_ALIGN for the size in dma_alloc and dma_free, It is OK.)

OK, having dug into this it looks like the root cause comes from some 
asymmetry in the common code: dma_common_pages remap() just passes the 
size through to get_vm_area_caller(), and the first thing that does is 
to page-align it. On the other hand, neither dma_common_free_remap() nor 
unmap_kernel_range() does anything with the size, so we wind up giving 
an unaligned end address to vunmap_page_range() and messing up the 
vmalloc page tables.

I wonder if dma_common_free_remap() should be page-aligning the size to 
match expectations (i.e. make it correctly unmap any request the other 
functions happily mapped), or conversely, perhaps both the map and unmap 
functions should have a WARN_ON(size & PAGE_MASK) to enforce being 
called as actually intended. Laura?

Either way, I'll send out a patch to make the arm64 side deal with it 
explicitly.

Robin.

>
>> +	} else {
>> +		iommu_dma_unmap_page(dev, handle, size, 0, NULL);
>> +		__free_pages(virt_to_page(cpu_addr), get_order(size));
>> +	}
>> +}
>> +
> [...]
>
>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v6 2/3] arm64: Add IOMMU dma_ops
  2015-11-04 13:11           ` Robin Murphy
@ 2015-11-04 17:35               ` Laura Abbott
  -1 siblings, 0 replies; 78+ messages in thread
From: Laura Abbott @ 2015-11-04 17:35 UTC (permalink / raw)
  To: Robin Murphy, Yong Wu, labbott-rxtnV0ftBwyoClj4AeEUq9i2O/JbrIOy
  Cc: laurent.pinchart+renesas-ryLnwIuWjnjg/C1BVhZhaw,
	catalin.marinas-5wv7dgnIgG8, will.deacon-5wv7dgnIgG8,
	tiffany.lin-NuS5LvNUpcJWk0Htik3J/w, Tomasz Figa,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	djkurtz-hpIqsD4AKlfQT0dZR+AlfA,
	thunder.leizhen-hv44wF8Li93QT0dZR+AlfA,
	yingjoe.chen-NuS5LvNUpcJWk0Htik3J/w,
	treding-DDmLM1+adcrQT0dZR+AlfA,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

On 11/04/2015 05:11 AM, Robin Murphy wrote:
> On 04/11/15 08:39, Yong Wu wrote:
>> On Thu, 2015-10-01 at 20:13 +0100, Robin Murphy wrote:
>>> Taking some inspiration from the arch/arm code, implement the
>>> arch-specific side of the DMA mapping ops using the new IOMMU-DMA layer.
>> [...]
>>> +static void *__iommu_alloc_attrs(struct device *dev, size_t size,
>>> +                 dma_addr_t *handle, gfp_t gfp,
>>> +                 struct dma_attrs *attrs)
>>> +{
>>> +    bool coherent = is_device_dma_coherent(dev);
>>> +    int ioprot = dma_direction_to_prot(DMA_BIDIRECTIONAL, coherent);
>>> +    void *addr;
>>> +
>>> +    if (WARN(!dev, "cannot create IOMMU mapping for unknown device\n"))
>>> +        return NULL;
>>> +    /*
>>> +     * Some drivers rely on this, and we probably don't want the
>>> +     * possibility of stale kernel data being read by devices anyway.
>>> +     */
>>> +    gfp |= __GFP_ZERO;
>>> +
>>> +    if (gfp & __GFP_WAIT) {
>>> +        struct page **pages;
>>> +        pgprot_t prot = __get_dma_pgprot(attrs, PAGE_KERNEL, coherent);
>>> +
>>> +        pages = iommu_dma_alloc(dev, size, gfp, ioprot,    handle,
>>> +                    flush_page);
>>> +        if (!pages)
>>> +            return NULL;
>>> +
>>> +        addr = dma_common_pages_remap(pages, size, VM_USERMAP, prot,
>>> +                          __builtin_return_address(0));
>>> +        if (!addr)
>>> +            iommu_dma_free(dev, pages, size, handle);
>>> +    } else {
>>> +        struct page *page;
>>> +        /*
>>> +         * In atomic context we can't remap anything, so we'll only
>>> +         * get the virtually contiguous buffer we need by way of a
>>> +         * physically contiguous allocation.
>>> +         */
>>> +        if (coherent) {
>>> +            page = alloc_pages(gfp, get_order(size));
>>> +            addr = page ? page_address(page) : NULL;
>>> +        } else {
>>> +            addr = __alloc_from_pool(size, &page, gfp);
>>> +        }
>>> +        if (!addr)
>>> +            return NULL;
>>> +
>>> +        *handle = iommu_dma_map_page(dev, page, 0, size, ioprot);
>>> +        if (iommu_dma_mapping_error(dev, *handle)) {
>>> +            if (coherent)
>>> +                __free_pages(page, get_order(size));
>>> +            else
>>> +                __free_from_pool(addr, size);
>>> +            addr = NULL;
>>> +        }
>>> +    }
>>> +    return addr;
>>> +}
>>> +
>>> +static void __iommu_free_attrs(struct device *dev, size_t size, void *cpu_addr,
>>> +                   dma_addr_t handle, struct dma_attrs *attrs)
>>> +{
>>> +    /*
>>> +     * @cpu_addr will be one of 3 things depending on how it was allocated:
>>> +     * - A remapped array of pages from iommu_dma_alloc(), for all
>>> +     *   non-atomic allocations.
>>> +     * - A non-cacheable alias from the atomic pool, for atomic
>>> +     *   allocations by non-coherent devices.
>>> +     * - A normal lowmem address, for atomic allocations by
>>> +     *   coherent devices.
>>> +     * Hence how dodgy the below logic looks...
>>> +     */
>>> +    if (__in_atomic_pool(cpu_addr, size)) {
>>> +        iommu_dma_unmap_page(dev, handle, size, 0, NULL);
>>> +        __free_from_pool(cpu_addr, size);
>>> +    } else if (is_vmalloc_addr(cpu_addr)){
>>> +        struct vm_struct *area = find_vm_area(cpu_addr);
>>> +
>>> +        if (WARN_ON(!area || !area->pages))
>>> +            return;
>>> +        iommu_dma_free(dev, area->pages, size, &handle);
>>> +        dma_common_free_remap(cpu_addr, size, VM_USERMAP);
>>
>> Hi Robin,
>>      We get a WARN issue while the size is not aligned here.
>>
>>      The WARN log is:
>> [  206.852002] WARNING: CPU: 0 PID: 23329
>> at /mnt/host/source/src/third_party/kernel/v3.18/mm/vmalloc.c:65
>> vunmap_page_range+0x190/0x1b4()
>> [  206.864438] Modules linked in: nls_iso8859_1 nls_cp437 vfat fat
>> rfcomm i2c_dev uinput dm9601 uvcvideo btmrvl_sdio mwifiex_sdio mwifiex
>> btmrvl bluetooth zram fuse cfg80211 nf_conntrack_ipv6 nf_defrag_ipv6
>> ip6table_filter ip6_tables cdc_ether usbnet mii joydev snd_seq_midi
>> snd_seq_midi_event snd_rawmidi snd_seq snd_seq_device ppp_async
>> ppp_generic slhc tun
>> [  206.902983] CPU: 0 PID: 23329 Comm: chrome Not tainted 3.18.0 #17
>> [  206.910430] Hardware name: Mediatek Oak rev3 board (DT)
>> [  206.920018] Call trace:
>> [  206.925537] [<ffffffc000208c00>] dump_backtrace+0x0/0x140
>> [  206.931905] [<ffffffc000208d5c>] show_stack+0x1c/0x28
>> [  206.939158] [<ffffffc000870f80>] dump_stack+0x74/0x94
>> [  206.947459] [<ffffffc0002219a4>] warn_slowpath_common+0x90/0xb8
>> [  206.954100] [<ffffffc000221b58>] warn_slowpath_null+0x34/0x44
>> [  206.961537] [<ffffffc000321358>] vunmap_page_range+0x18c/0x1b4
>> [  206.967630] [<ffffffc0003213e4>] unmap_kernel_range+0x2c/0x78
>> [  206.976977] [<ffffffc000582224>] dma_common_free_remap+0x68/0x80
>> [  206.983581] [<ffffffc000217260>] __iommu_free_attrs+0x14c/0x160
>> [  206.989646] [<ffffffc00066fc1c>] mtk_vcodec_mem_free+0xa0/0x15c
>> [  206.996481] [<ffffffc00067e278>] vp9_free_work_buf+0x54/0x70
>> [  207.002260] [<ffffffc00067f168>] vdec_vp9_deinit+0x7c/0xe8
>> [  207.008134] [<ffffffc0006787d8>] vdec_if_deinit+0x84/0xec
>> [  207.013820] [<ffffffc000677898>] mtk_vcodec_vdec_release+0x54/0x6c
>> [  207.020672] [<ffffffc000673e3c>] fops_vcodec_release+0x7c/0xf8
>> [  207.026607] [<ffffffc000652b78>] v4l2_release+0x3c/0x84
>> [  207.031824] [<ffffffc00033b218>] __fput+0xf8/0x1c0
>> [  207.036599] [<ffffffc00033b350>] ____fput+0x1c/0x2c
>> [  207.041454] [<ffffffc00023ed78>] task_work_run+0xb0/0xd4
>> [  207.046756] [<ffffffc00020872c>] do_notify_resume+0x54/0x6c
>>
>>
>>     From the log I get in this fail case, the size of unmap here is
>> 0x10080, and its map size of dma_common_pages_remap in
>> __iommu_alloc_attrs is 0x10080, and the corresponding dma-map size is
>> 0x11000(after iova_align). I think all the parameters of map and unmap
>> are good, it look like not a DMA issue. but I don't know why we get this
>> warning.
>> Have you met this problem and give us some advices, Thanks.
>>
>> (If we add PAGE_ALIGN for the size in dma_alloc and dma_free, It is OK.)
>
> OK, having dug into this it looks like the root cause comes from some asymmetry in the common code: dma_common_pages remap() just passes the size through to get_vm_area_caller(), and the first thing that does is to page-align it. On the other hand, neither dma_common_free_remap() nor unmap_kernel_range() does anything with the size, so we wind up giving an unaligned end address to vunmap_page_range() and messing up the vmalloc page tables.
>
> I wonder if dma_common_free_remap() should be page-aligning the size to match expectations (i.e. make it correctly unmap any request the other functions happily mapped), or conversely, perhaps both the map and unmap functions should have a WARN_ON(size & PAGE_MASK) to enforce being called as actually intended. Laura?
>

Based on what I've found, the DMA mapping API needs to be able to handle unaligned sizes
gracefully so I don't think a warn is appropriate. I was aligning at the higher level but
it would be best for dma_common_free_remap to align as well.
  
> Either way, I'll send out a patch to make the arm64 side deal with it explicitly.
>
> Robin.
>

Thanks,
Laura

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v6 2/3] arm64: Add IOMMU dma_ops
@ 2015-11-04 17:35               ` Laura Abbott
  0 siblings, 0 replies; 78+ messages in thread
From: Laura Abbott @ 2015-11-04 17:35 UTC (permalink / raw)
  To: linux-arm-kernel

On 11/04/2015 05:11 AM, Robin Murphy wrote:
> On 04/11/15 08:39, Yong Wu wrote:
>> On Thu, 2015-10-01 at 20:13 +0100, Robin Murphy wrote:
>>> Taking some inspiration from the arch/arm code, implement the
>>> arch-specific side of the DMA mapping ops using the new IOMMU-DMA layer.
>> [...]
>>> +static void *__iommu_alloc_attrs(struct device *dev, size_t size,
>>> +                 dma_addr_t *handle, gfp_t gfp,
>>> +                 struct dma_attrs *attrs)
>>> +{
>>> +    bool coherent = is_device_dma_coherent(dev);
>>> +    int ioprot = dma_direction_to_prot(DMA_BIDIRECTIONAL, coherent);
>>> +    void *addr;
>>> +
>>> +    if (WARN(!dev, "cannot create IOMMU mapping for unknown device\n"))
>>> +        return NULL;
>>> +    /*
>>> +     * Some drivers rely on this, and we probably don't want the
>>> +     * possibility of stale kernel data being read by devices anyway.
>>> +     */
>>> +    gfp |= __GFP_ZERO;
>>> +
>>> +    if (gfp & __GFP_WAIT) {
>>> +        struct page **pages;
>>> +        pgprot_t prot = __get_dma_pgprot(attrs, PAGE_KERNEL, coherent);
>>> +
>>> +        pages = iommu_dma_alloc(dev, size, gfp, ioprot,    handle,
>>> +                    flush_page);
>>> +        if (!pages)
>>> +            return NULL;
>>> +
>>> +        addr = dma_common_pages_remap(pages, size, VM_USERMAP, prot,
>>> +                          __builtin_return_address(0));
>>> +        if (!addr)
>>> +            iommu_dma_free(dev, pages, size, handle);
>>> +    } else {
>>> +        struct page *page;
>>> +        /*
>>> +         * In atomic context we can't remap anything, so we'll only
>>> +         * get the virtually contiguous buffer we need by way of a
>>> +         * physically contiguous allocation.
>>> +         */
>>> +        if (coherent) {
>>> +            page = alloc_pages(gfp, get_order(size));
>>> +            addr = page ? page_address(page) : NULL;
>>> +        } else {
>>> +            addr = __alloc_from_pool(size, &page, gfp);
>>> +        }
>>> +        if (!addr)
>>> +            return NULL;
>>> +
>>> +        *handle = iommu_dma_map_page(dev, page, 0, size, ioprot);
>>> +        if (iommu_dma_mapping_error(dev, *handle)) {
>>> +            if (coherent)
>>> +                __free_pages(page, get_order(size));
>>> +            else
>>> +                __free_from_pool(addr, size);
>>> +            addr = NULL;
>>> +        }
>>> +    }
>>> +    return addr;
>>> +}
>>> +
>>> +static void __iommu_free_attrs(struct device *dev, size_t size, void *cpu_addr,
>>> +                   dma_addr_t handle, struct dma_attrs *attrs)
>>> +{
>>> +    /*
>>> +     * @cpu_addr will be one of 3 things depending on how it was allocated:
>>> +     * - A remapped array of pages from iommu_dma_alloc(), for all
>>> +     *   non-atomic allocations.
>>> +     * - A non-cacheable alias from the atomic pool, for atomic
>>> +     *   allocations by non-coherent devices.
>>> +     * - A normal lowmem address, for atomic allocations by
>>> +     *   coherent devices.
>>> +     * Hence how dodgy the below logic looks...
>>> +     */
>>> +    if (__in_atomic_pool(cpu_addr, size)) {
>>> +        iommu_dma_unmap_page(dev, handle, size, 0, NULL);
>>> +        __free_from_pool(cpu_addr, size);
>>> +    } else if (is_vmalloc_addr(cpu_addr)){
>>> +        struct vm_struct *area = find_vm_area(cpu_addr);
>>> +
>>> +        if (WARN_ON(!area || !area->pages))
>>> +            return;
>>> +        iommu_dma_free(dev, area->pages, size, &handle);
>>> +        dma_common_free_remap(cpu_addr, size, VM_USERMAP);
>>
>> Hi Robin,
>>      We get a WARN issue while the size is not aligned here.
>>
>>      The WARN log is:
>> [  206.852002] WARNING: CPU: 0 PID: 23329
>> at /mnt/host/source/src/third_party/kernel/v3.18/mm/vmalloc.c:65
>> vunmap_page_range+0x190/0x1b4()
>> [  206.864438] Modules linked in: nls_iso8859_1 nls_cp437 vfat fat
>> rfcomm i2c_dev uinput dm9601 uvcvideo btmrvl_sdio mwifiex_sdio mwifiex
>> btmrvl bluetooth zram fuse cfg80211 nf_conntrack_ipv6 nf_defrag_ipv6
>> ip6table_filter ip6_tables cdc_ether usbnet mii joydev snd_seq_midi
>> snd_seq_midi_event snd_rawmidi snd_seq snd_seq_device ppp_async
>> ppp_generic slhc tun
>> [  206.902983] CPU: 0 PID: 23329 Comm: chrome Not tainted 3.18.0 #17
>> [  206.910430] Hardware name: Mediatek Oak rev3 board (DT)
>> [  206.920018] Call trace:
>> [  206.925537] [<ffffffc000208c00>] dump_backtrace+0x0/0x140
>> [  206.931905] [<ffffffc000208d5c>] show_stack+0x1c/0x28
>> [  206.939158] [<ffffffc000870f80>] dump_stack+0x74/0x94
>> [  206.947459] [<ffffffc0002219a4>] warn_slowpath_common+0x90/0xb8
>> [  206.954100] [<ffffffc000221b58>] warn_slowpath_null+0x34/0x44
>> [  206.961537] [<ffffffc000321358>] vunmap_page_range+0x18c/0x1b4
>> [  206.967630] [<ffffffc0003213e4>] unmap_kernel_range+0x2c/0x78
>> [  206.976977] [<ffffffc000582224>] dma_common_free_remap+0x68/0x80
>> [  206.983581] [<ffffffc000217260>] __iommu_free_attrs+0x14c/0x160
>> [  206.989646] [<ffffffc00066fc1c>] mtk_vcodec_mem_free+0xa0/0x15c
>> [  206.996481] [<ffffffc00067e278>] vp9_free_work_buf+0x54/0x70
>> [  207.002260] [<ffffffc00067f168>] vdec_vp9_deinit+0x7c/0xe8
>> [  207.008134] [<ffffffc0006787d8>] vdec_if_deinit+0x84/0xec
>> [  207.013820] [<ffffffc000677898>] mtk_vcodec_vdec_release+0x54/0x6c
>> [  207.020672] [<ffffffc000673e3c>] fops_vcodec_release+0x7c/0xf8
>> [  207.026607] [<ffffffc000652b78>] v4l2_release+0x3c/0x84
>> [  207.031824] [<ffffffc00033b218>] __fput+0xf8/0x1c0
>> [  207.036599] [<ffffffc00033b350>] ____fput+0x1c/0x2c
>> [  207.041454] [<ffffffc00023ed78>] task_work_run+0xb0/0xd4
>> [  207.046756] [<ffffffc00020872c>] do_notify_resume+0x54/0x6c
>>
>>
>>     From the log I get in this fail case, the size of unmap here is
>> 0x10080, and its map size of dma_common_pages_remap in
>> __iommu_alloc_attrs is 0x10080, and the corresponding dma-map size is
>> 0x11000(after iova_align). I think all the parameters of map and unmap
>> are good, it look like not a DMA issue. but I don't know why we get this
>> warning.
>> Have you met this problem and give us some advices, Thanks.
>>
>> (If we add PAGE_ALIGN for the size in dma_alloc and dma_free, It is OK.)
>
> OK, having dug into this it looks like the root cause comes from some asymmetry in the common code: dma_common_pages remap() just passes the size through to get_vm_area_caller(), and the first thing that does is to page-align it. On the other hand, neither dma_common_free_remap() nor unmap_kernel_range() does anything with the size, so we wind up giving an unaligned end address to vunmap_page_range() and messing up the vmalloc page tables.
>
> I wonder if dma_common_free_remap() should be page-aligning the size to match expectations (i.e. make it correctly unmap any request the other functions happily mapped), or conversely, perhaps both the map and unmap functions should have a WARN_ON(size & PAGE_MASK) to enforce being called as actually intended. Laura?
>

Based on what I've found, the DMA mapping API needs to be able to handle unaligned sizes
gracefully so I don't think a warn is appropriate. I was aligning at the higher level but
it would be best for dma_common_free_remap to align as well.
  
> Either way, I'll send out a patch to make the arm64 side deal with it explicitly.
>
> Robin.
>

Thanks,
Laura

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v6 1/3] iommu: Implement common IOMMU ops for DMA mapping
  2015-11-04  5:12                       ` Tomasz Figa
@ 2015-11-09 13:11                         ` Robin Murphy
  -1 siblings, 0 replies; 78+ messages in thread
From: Robin Murphy @ 2015-11-09 13:11 UTC (permalink / raw)
  To: Tomasz Figa
  Cc: Daniel Kurtz, Lin PoChun, linux-arm-kernel, Yingjoe Chen,
	Will Deacon, linux-media, Thierry Reding,
	open list:IOMMU DRIVERS, Bobby Batacharia (via Google Docs),
	Kyungmin Park, Marek Szyprowski, Yong Wu, Pawel Osciak,
	Laurent Pinchart, Joerg Roedel, thunder.leizhen, Catalin Marinas,
	Russell King, linux-mediatek

On 04/11/15 05:12, Tomasz Figa wrote:
> On Wed, Nov 4, 2015 at 2:41 AM, Robin Murphy <robin.murphy@arm.com> wrote:
>> Hi Tomasz,
>>
>> On 02/11/15 13:43, Tomasz Figa wrote:
>>>
>>> I'd like to know what is the boundary mask and what hardware imposes
>>> requirements like this. The cost here is not only over-allocating a
>>> little, but making many, many buffers contiguously mappable on the
>>> CPU, unmappable contiguously in IOMMU, which just defeats the purpose
>>> of having an IOMMU, which I believe should be there for simple IP
>>> blocks taking one DMA address to be able to view the buffer the same
>>> way as the CPU.
>>
>>
>> The expectation with dma_map_sg() is that you're either going to be
>> iterating over the buffer segments, handing off each address to the device
>> to process one by one;
>
> My understanding of a scatterlist was that it represents a buffer as a
> whole, by joining together its physically discontinuous segments.

It can, but there are also cases where a single scatterlist is used to 
batch up multiple I/O requests - see the stuff in block/blk-merge.c as 
described in section 2.2 of Documentation/biodoc.txt, and AFAICS anyone 
could quite happily use the dmaengine API, and possibly others, in the 
same way. Ultimately a scatterlist is no more specific than "a list of 
blocks of physical memory that each want giving a DMA address".

> I don't see how single segments (layout of which is completely up to
> the allocator; often just single pages) would be usable for hardware
> that needs to do some work more serious than just writing a byte
> stream continuously to subsequent buffers. In case of such simple
> devices you don't even need an IOMMU (for means other than protection
> and/or getting over address space limitations).
>
> However, IMHO the most important use case of an IOMMU is to make
> buffers, which are contiguous in CPU virtual address space (VA),
> contiguous in device's address space (IOVA). Your implementation of
> dma_map_sg() effectively breaks this ability, so I'm not really
> following why it's located under drivers/iommu and supposed to be used
> with IOMMU-enabled platforms...
>
>> or you have a scatter-gather-capable device, in which
>> case you hand off the whole list at once.
>
> No need for mapping ability of the IOMMU here as well (except for
> working around address space issues, as I mentioned above).

Ok, now I'm starting to wonder if you're wilfully choosing to miss the 
point. Look at 64-bit systems of any architecture, and those address 
space issues are pretty much the primary consideration for including an 
IOMMU in the first place (behind virtualisation, which we can forget 
about here). Take the Juno board on my desk - most of the peripherals 
cannot address 75% of the RAM, and CPU bounce buffers are both not 
overly efficient and a limited resource (try using dmatest with 
sufficiently large buffers to stress/measure memory bandwidth and watch 
it take down the kernel, and that's without any other SWIOTLB 
contention). The only one that really cares at all about contiguous 
buffers is the HDLCD, but that's perfectly happy when it calls 
dma_alloc_coherent() via drm_fb_cma_helper and pulls a contiguous 8MB 
framebuffer out of thin air, without even knowing that CMA itself is 
disabled and it couldn't natively address 75% of the memory that might 
be backing that buffer.

That last point also illustrates that the thing for providing 
DMA-contiguous buffers is indeed very good at providing DMA-contiguous 
buffers when backed by an IOMMU.

>> It's in the latter case where you
>> have to make sure the list doesn't exceed the hardware limitations of that
>> device. I believe the original concern was disk controllers (the
>> introduction of dma_parms seems to originate from the linux-scsi list), but
>> most scatter-gather engines are going to have some limit on how much they
>> can handle per entry (IMO the dmaengine drivers are the easiest example to
>> look at).
>>
>> Segment boundaries are a little more arcane, but my assumption is that they
>> relate to the kind of devices whose addressing is not flat but relative to
>> some separate segment register (The "64-bit" mode of USB EHCI is one
>> concrete example I can think of) - since you cannot realistically change the
>> segment register while the device is in the middle of accessing a single
>> buffer entry, that entry must not fall across a segment boundary or at some
>> point the device's accesses are going to overflow the offset address bits
>> and wrap around to bogus addresses at the bottom of the segment.
>
> The two requirements above sound like something really specific to
> scatter-gather-capable hardware, which as I pointed above, barely need
> an IOMMU (at least its mapping capabilities). We are talking here
> about very IOMMU-specific code, though...
>
> Now, while I see that on some systems there might be IOMMU used for
> improving protection and working around addressing issues with
> SG-capable hardware, the code shouldn't be breaking the majority of
> systems with IOMMU used as the only possible way to make physically
> discontinuous appear (IO-virtually) continuous to devices incapable of
> scatter-gather.

Unless this majority of systems are all 64-bit ARMv8 ones running code 
that works perfectly _with the existing SWIOTLB DMA API implementation_ 
but not with this implementation, then I disagree that anything is being 
broken that wasn't already broken with respect to portability. 
Otherwise, please give me the details of any regressions with these 
patches relative to SWIOTLB DMA on arm64 so I can look into them.

>> Now yes, it will be possible under _most_ circumstances to use an IOMMU to
>> lay out a list of segments with page-aligned lengths within a single IOVA
>> allocation whilst still meeting all the necessary constraints. It just needs
>> some unavoidably complicated calculations - quite likely significantly more
>> complex than my v5 version of map_sg() that tried to do that and merge
>> segments but failed to take the initial alignment into account properly -
>> since there are much simpler ways to enforce just the _necessary_ behaviour
>> for the DMA API, I put the complicated stuff to one side for now to prevent
>> it holding up getting the basic functional support in place.
>
> Somehow just whatever currently done in arch/arm/mm/dma-mapping.c was
> sufficient and not overly complicated.
>
> See http://lxr.free-electrons.com/source/arch/arm/mm/dma-mapping.c#L1547 .
>
> I can see that the code there at least tries to comply with maximum
> segment size constraint. Segment boundary seems to be ignored, though.

It certainly doesn't map the entire list into a single IOVA allocation 
as here (such that everything is laid out in contiguous IOVA pages 
_regardless_ of the segment lengths, and unmapping becomes nicely 
trivial). That it also is the only implementation which fails to respect 
segment boundaries really just implies that it's probably not seen much 
use beyond supporting graphics hardware on 32-bit systems, and/or has 
just got lucky otherwise.

> However, I'm convinced that in most (if not all) cases where IOMMU
> IOVA-contiguous mapping is needed, those two requirements don't exist.
> Do we really have to break the good hardware only because the
> bad^Wlimited one is broken?

Where "is broken" at least encompasses "is a SATA controller", 
presumably. Here's an example I've actually played with:

http://lxr.free-electrons.com/source/drivers/ata/sata_sil24.c#L390

It doesn't seem all that unreasonable that hardware that fundamentally 
works in fixed-size blocks of data wants its data aligned to its block 
size (or some efficient multiple). Implementing an API which has 
guaranteed support for that requirement from the outset necessitates 
supporting that requirement. I'm not going to buy the argument that 
having some video device DMA into userspace pages is more important than 
being able to boot at all (and not corrupting your filesystem).

> Couldn't we preserve the ARM-like behavior whenever
> dma_parms->segment_boundary_mask is set to all 1s and
> dma_parms->max_segment_size to UINT_MAX (what currently drivers used
> to set) or 0 (sounds more logical for the meaning of "no maximum
> given")?

Sure, I was always aiming to ultimately improve on the arch/arm 
implementation (i.e. with the single allocation thing), but for a common 
general-purpose implementation that's going to be shared by multiple 
architectures, correctness comes way before optimisation for one 
specific use-case. Thus we start with a baseline version that we know 
correctly implements all the required behaviour specified by the DMA 
API, then start tweaking it for other considerations later.

FWIW, I've already sketched out such a follow-on patch to start 
tightening up map_sg (because exposing any pages to the device more than 
absolutely necessary is not what we want in the long run). The thought 
that it's likely to be jumped on and used as an excuse to justify bad 
code elsewhere does rather sour the idea, though.

>>>>>> Hmm, I thought the DMA API maps a (possibly) non-contiguous set of
>>>>>> memory pages into a contiguous block in device memory address space.
>>>>>> This would allow passing a dma mapped buffer to device dma using just
>>>>>> a device address and length.
>>>>>
>>>>>
>>>>>
>>>>> Not at all. The streaming DMA API (dma_map_* and friends) has two
>>>>> responsibilities: performing any necessary cache maintenance to ensure the
>>>>> device will correctly see data from the CPU, and the CPU will correctly see
>>>>> data from the device; and working out an address for that buffer from the
>>>>> device's point of view to actually hand off to the hardware (which is
>>>>> perfectly well allowed to fail).
>>>
>>>
>>> Agreed. The dma_map_*() API is not guaranteed to return a single
>>> contiguous part of virtual address space for any given SG list.
>>> However it was understood to be able to map buffers contiguously
>>> mappable by the CPU into a single segment and users,
>>> videobuf2-dma-contig in particular, relied on this.
>>
>>
>> I don't follow that - _any_ buffer made of page-sized chunks is going to be
>> mappable contiguously by the CPU;'
>
> Yes it is. Actually the last chunk might not even need to be
> page-sized. However I believe we can have a scatterlist consisting of
> non-page-sized chunks in the middle as well, which is obviously not
> mappable in a contiguous way even for the CPU.
>
>> it's clearly impossible for the streaming
>> DMA API itself to offer such a guarantee, because it's entirely orthogonal
>> to the presence or otherwise of an IOMMU.
>
> But we are talking here about the very IOMMU-specific implementation of DMA API.

Exactly, therein lies the problem! The whole point of an API is that we 
write code against the provided _interface_, not against some particular 
implementation detail. To quote Raymond Chen, "I can't believe I had to 
write that".

I fail to see how anyone would be surprised that code which is reliant 
on specific non-contractual behaviour of a particular API implementation 
is not portable to other implementations of that API.

>> Furthermore, I can't see any existing dma_map_sg implementation (between
>> arm/64 and x86, at least), that _won't_ break that expectation under certain
>> conditions (ranging from "relatively pathological" to "always"), so it still
>> seems questionable to have a dependency on it.
>
> The current implementation for arch/arm doesn't break that
> expectation. As long as we fit inside the maximum segment size (which
> in most, if not all, cases of the hardware that actually requires such
> contiguous mapping to be created, is UINT_MAX).

Well, yes, that just restates my point exactly; outside of certain 
conditions you will still get a non-contiguous mapping. Put that exact 
code on a 64-bit system, throw a scatterlist describing a "relatively 
pathological" 5GB buffer into it, and see what you get out.

> http://lxr.free-electrons.com/source/arch/arm/mm/dma-mapping.c#L1547
>
>>
>>>>> Consider SWIOTLB's implementation - segments which already lie at
>>>>> physical addresses within the device's DMA mask just get passed through,
>>>>> while those that lie outside it get mapped into the bounce buffer, but still
>>>>> as individual allocations (arch code just handles cache maintenance on the
>>>>> resulting physical addresses and can apply any hard-wired DMA offset for the
>>>>> device concerned).
>>>
>>>
>>> And this is fine for vb2-dma-contig, which was made for devices that
>>> require buffers contiguous in its address space. Without IOMMU it will
>>> allow only physically contiguous buffers and fails otherwise, which is
>>> fine, because it's a hardware requirement.
>>
>>
>> If it depends on having contiguous-from-the-device's-view DMA buffers either
>> way, that's a sign it should perhaps be using the coherent DMA API instead,
>> which _does_ give such a guarantee. I'm well aware of the "but the
>> noncacheable mappings make userspace access unacceptably slow!" issue many
>> folks have with that, though, and don't particularly fancy going off on that
>> tangent here.
>
> The keywords here are DMA-BUF and user pointer. Neither of these cases
> can use coherent DMA API, because the buffer is already allocated, so
> it just needs to be mapped into another device's (or its IOMMU's)
> address space. Obviously we can't guarantee mappability of such
> buffers, e.g. in case of importing non-contiguous buffers to a device
> without an IOMMU, However we expect the pipelines to be sane
> (physically contiguous buffers or both devices IOMMU-enabled), so that
> such things won't happen.

The "guarantee to map these scatterlist pages contiguously in IOVA space 
if an IOMMU is present" function is named iommu_map_sg(). There is 
nothing in the DMA API offering that behaviour. How well does 
vb2-dma-contig work with the x86 IOMMUs?

>>>>>> IIUC, the change above breaks this model by inserting gaps in how the
>>>>>> buffer is mapped to device memory, such that the buffer is no longer
>>>>>> contiguous in dma address space.
>>>>>
>>>>>
>>>>>
>>>>> Even the existing arch/arm IOMMU DMA code which I guess this implicitly
>>>>> relies on doesn't guarantee that behaviour - if the mapping happens to reach
>>>>> one of the segment length/boundary limits it won't just leave a gap, it'll
>>>>> start an entirely new IOVA allocation which could well start at a wildly
>>>>> different address[0].
>>>
>>>
>>> Could you explain segment length/boundary limits and when buffers can
>>> reach them? Sorry, i haven't been following all the discussions, but
>>> I'm not aware of any similar requirements of the IOMMU hardware I
>>> worked with.
>>
>>
>> I hope the explanation at the top makes sense - it's purely about the
>> requirements of the DMA master device itself, nothing to do with the IOMMU
>> (or lack of) in the middle. Devices with scatter-gather DMA limitations
>> exist, therefore the API for scatter-gather DMA is designed to represent and
>> respect such limitations.
>
> Yes, it makes sense, thanks for the explanation. However there also
> exist devices with no scatter-gather capability, but behind an IOMMU
> without such fancy mapping limitations. I believe we should also
> respect the limitation of such setups, which is the lack of support
> for multiple IOVA segments.
>
>>>>>> So, is the videobuf2-dma-contig.c based on an incorrect assumption
>>>>>> about how the DMA API is supposed to work?
>>>>>> Is it even possible to map a "contiguous-in-iova-range" mapping for a
>>>>>> buffer given as an sg_table with an arbitrary set of pages?
>>>>>
>>>>>
>>>>>
>>>>>   From the Streaming DMA mappings section of Documentation/DMA-API.txt:
>>>>>
>>>>>     Note also that the above constraints on physical contiguity and
>>>>>     dma_mask may not apply if the platform has an IOMMU (a device which
>>>>>     maps an I/O DMA address to a physical memory address).  However, to
>>>>> be
>>>>>     portable, device driver writers may *not* assume that such an IOMMU
>>>>>     exists.
>>>>>
>>>>> There's not strictly any harm in using the DMA API this way and *hoping*
>>>>> you get what you want, as long as you're happy for it to fail pretty much
>>>>> 100% of the time on some systems, and still in a minority of corner cases on
>>>>> any system.
>>>
>>>
>>> Could you please elaborate? I'd like to see examples, because I can't
>>> really imagine buffers mappable contiguously on CPU, but not on IOMMU.
>>> Also, as I said, the hardware I worked with didn't suffer from
>>> problems like this.
>>
>>
>> "...device driver writers may *not* assume that such an IOMMU exists."
>>
>
> And this is exactly why they _should_ use dma_map_sg(), because it was
> supposed to work correctly for both physically contiguous (i.e. 1
> segment) buffers and non-IOMMU-enabled devices, as well as with
> non-contiguous (i.e. > 1 segment) buffers and IOMMU-enabled devices.

Note that the number of segments has nothing to do with whether they are 
contiguous (in any address space) or not.

In fact, while I've been thinking about this I realise we have another 
misapprehension here: the point of dma_parms is to expose a device's 
scatter-gather capabilities to _restrict_ what an IOMMU-based DMA API 
implementation can do (see 6b7b65105522) - thus setting fake 
"restrictions" for non-scatter-gather hardware in an attempt to force an 
implementation into merging segments is entirely backwards.

>>>>> However, if there's a real dependency on IOMMUs and tight control of
>>>>> IOVA allocation here, then the DMA API isn't really the right tool for the
>>>>> job, and maybe it's time to start looking to how to better fit these
>>>>> multimedia-subsystem-type use cases into the IOMMU API - as far as I
>>>>> understand it there's at least some conceptual overlap with the HSA PASID
>>>>> stuff being prototyped in PCI/x86-land at the moment, so it could be an
>>>>> apposite time to try and bang out some common requirements.
>>>
>>>
>>> The DMA API is actually the only good tool to use here to keep the
>>> videobuf2-dma-contig code away from the knowledge about platform
>>> specific data, e.g. presence of IOMMU. The only thing it knows is that
>>> the target hardware requires a single contiguous buffer and it relies
>>> on the fact that in correct cases the buffer given to it will meet
>>> this requirement (i.e. physically contiguous w/o IOMMU; CPU mappable
>>> with IOMMU).
>>
>>
>> As above; the DMA API guarantees only what the DMA API guarantees. An
>> IOMMU-based implementation of streaming DMA is free to identity-map pages if
>> it only cares about device isolation; a non-IOMMU implementation is free to
>> provide streaming DMA remapping via some elaborate bounce-buffering scheme
>
> I guess this is the area where our understandings of IOMMU-backed DMA
> API differ.

The DMA API provides a hardware-independent abstraction of a set of 
operations for exposing kernel memory to devices. When someone calls a 
DMA API function, they don't get to choose the details of that 
abstraction, and they don't get to choose the semantics of those operations.

Of course they can always go ahead and propose adding something to the 
API, if they really believe there's something else it needs to offer.

>> if it really wants to. GART-type IOMMUs... let's not even go there.
>
> I believe that's how IOMMU-based implementation of DMA API was
> supposed to work when first implemented for ARM...
 >
>> If v4l needs a guarantee of a single contiguous DMA buffer, then it needs to
>> use dma_alloc_coherent() for that, not streaming mappings.
>
> Except that it can't use it, because the buffers are already allocated
> by another entity.

	dma_alloc_coherent(...
	for_each_sg(..
		memcpy(...

Or v4l is rearchitected such that the userspace pages came from 
mmap()ing a guaranteed-contiguous DMA buffer in the first place. Or 
vb2-dma-contig is rearchitected to use the IOMMU API directly where it 
has an IOMMU dependency. Or someone posts a patch to extend the DMA API 
with a dma_try_to_map_sg_as_contiguously_as_you_can_manage() operation 
that doesn't even necessarily have to depend on an IOMMU... Plenty of 
ways to replace incorrect assumptions with reliable ones.

Or to put it another way; Fast, Easy to implement, Correct: pick two.

With the caveat that for upstream, one of the two _must_ be "Correct".

Robin.


^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v6 1/3] iommu: Implement common IOMMU ops for DMA mapping
@ 2015-11-09 13:11                         ` Robin Murphy
  0 siblings, 0 replies; 78+ messages in thread
From: Robin Murphy @ 2015-11-09 13:11 UTC (permalink / raw)
  To: linux-arm-kernel

On 04/11/15 05:12, Tomasz Figa wrote:
> On Wed, Nov 4, 2015 at 2:41 AM, Robin Murphy <robin.murphy@arm.com> wrote:
>> Hi Tomasz,
>>
>> On 02/11/15 13:43, Tomasz Figa wrote:
>>>
>>> I'd like to know what is the boundary mask and what hardware imposes
>>> requirements like this. The cost here is not only over-allocating a
>>> little, but making many, many buffers contiguously mappable on the
>>> CPU, unmappable contiguously in IOMMU, which just defeats the purpose
>>> of having an IOMMU, which I believe should be there for simple IP
>>> blocks taking one DMA address to be able to view the buffer the same
>>> way as the CPU.
>>
>>
>> The expectation with dma_map_sg() is that you're either going to be
>> iterating over the buffer segments, handing off each address to the device
>> to process one by one;
>
> My understanding of a scatterlist was that it represents a buffer as a
> whole, by joining together its physically discontinuous segments.

It can, but there are also cases where a single scatterlist is used to 
batch up multiple I/O requests - see the stuff in block/blk-merge.c as 
described in section 2.2 of Documentation/biodoc.txt, and AFAICS anyone 
could quite happily use the dmaengine API, and possibly others, in the 
same way. Ultimately a scatterlist is no more specific than "a list of 
blocks of physical memory that each want giving a DMA address".

> I don't see how single segments (layout of which is completely up to
> the allocator; often just single pages) would be usable for hardware
> that needs to do some work more serious than just writing a byte
> stream continuously to subsequent buffers. In case of such simple
> devices you don't even need an IOMMU (for means other than protection
> and/or getting over address space limitations).
>
> However, IMHO the most important use case of an IOMMU is to make
> buffers, which are contiguous in CPU virtual address space (VA),
> contiguous in device's address space (IOVA). Your implementation of
> dma_map_sg() effectively breaks this ability, so I'm not really
> following why it's located under drivers/iommu and supposed to be used
> with IOMMU-enabled platforms...
>
>> or you have a scatter-gather-capable device, in which
>> case you hand off the whole list at once.
>
> No need for mapping ability of the IOMMU here as well (except for
> working around address space issues, as I mentioned above).

Ok, now I'm starting to wonder if you're wilfully choosing to miss the 
point. Look at 64-bit systems of any architecture, and those address 
space issues are pretty much the primary consideration for including an 
IOMMU in the first place (behind virtualisation, which we can forget 
about here). Take the Juno board on my desk - most of the peripherals 
cannot address 75% of the RAM, and CPU bounce buffers are both not 
overly efficient and a limited resource (try using dmatest with 
sufficiently large buffers to stress/measure memory bandwidth and watch 
it take down the kernel, and that's without any other SWIOTLB 
contention). The only one that really cares at all about contiguous 
buffers is the HDLCD, but that's perfectly happy when it calls 
dma_alloc_coherent() via drm_fb_cma_helper and pulls a contiguous 8MB 
framebuffer out of thin air, without even knowing that CMA itself is 
disabled and it couldn't natively address 75% of the memory that might 
be backing that buffer.

That last point also illustrates that the thing for providing 
DMA-contiguous buffers is indeed very good at providing DMA-contiguous 
buffers when backed by an IOMMU.

>> It's in the latter case where you
>> have to make sure the list doesn't exceed the hardware limitations of that
>> device. I believe the original concern was disk controllers (the
>> introduction of dma_parms seems to originate from the linux-scsi list), but
>> most scatter-gather engines are going to have some limit on how much they
>> can handle per entry (IMO the dmaengine drivers are the easiest example to
>> look at).
>>
>> Segment boundaries are a little more arcane, but my assumption is that they
>> relate to the kind of devices whose addressing is not flat but relative to
>> some separate segment register (The "64-bit" mode of USB EHCI is one
>> concrete example I can think of) - since you cannot realistically change the
>> segment register while the device is in the middle of accessing a single
>> buffer entry, that entry must not fall across a segment boundary or at some
>> point the device's accesses are going to overflow the offset address bits
>> and wrap around to bogus addresses at the bottom of the segment.
>
> The two requirements above sound like something really specific to
> scatter-gather-capable hardware, which as I pointed above, barely need
> an IOMMU (at least its mapping capabilities). We are talking here
> about very IOMMU-specific code, though...
>
> Now, while I see that on some systems there might be IOMMU used for
> improving protection and working around addressing issues with
> SG-capable hardware, the code shouldn't be breaking the majority of
> systems with IOMMU used as the only possible way to make physically
> discontinuous appear (IO-virtually) continuous to devices incapable of
> scatter-gather.

Unless this majority of systems are all 64-bit ARMv8 ones running code 
that works perfectly _with the existing SWIOTLB DMA API implementation_ 
but not with this implementation, then I disagree that anything is being 
broken that wasn't already broken with respect to portability. 
Otherwise, please give me the details of any regressions with these 
patches relative to SWIOTLB DMA on arm64 so I can look into them.

>> Now yes, it will be possible under _most_ circumstances to use an IOMMU to
>> lay out a list of segments with page-aligned lengths within a single IOVA
>> allocation whilst still meeting all the necessary constraints. It just needs
>> some unavoidably complicated calculations - quite likely significantly more
>> complex than my v5 version of map_sg() that tried to do that and merge
>> segments but failed to take the initial alignment into account properly -
>> since there are much simpler ways to enforce just the _necessary_ behaviour
>> for the DMA API, I put the complicated stuff to one side for now to prevent
>> it holding up getting the basic functional support in place.
>
> Somehow just whatever currently done in arch/arm/mm/dma-mapping.c was
> sufficient and not overly complicated.
>
> See http://lxr.free-electrons.com/source/arch/arm/mm/dma-mapping.c#L1547 .
>
> I can see that the code there at least tries to comply with maximum
> segment size constraint. Segment boundary seems to be ignored, though.

It certainly doesn't map the entire list into a single IOVA allocation 
as here (such that everything is laid out in contiguous IOVA pages 
_regardless_ of the segment lengths, and unmapping becomes nicely 
trivial). That it also is the only implementation which fails to respect 
segment boundaries really just implies that it's probably not seen much 
use beyond supporting graphics hardware on 32-bit systems, and/or has 
just got lucky otherwise.

> However, I'm convinced that in most (if not all) cases where IOMMU
> IOVA-contiguous mapping is needed, those two requirements don't exist.
> Do we really have to break the good hardware only because the
> bad^Wlimited one is broken?

Where "is broken" at least encompasses "is a SATA controller", 
presumably. Here's an example I've actually played with:

http://lxr.free-electrons.com/source/drivers/ata/sata_sil24.c#L390

It doesn't seem all that unreasonable that hardware that fundamentally 
works in fixed-size blocks of data wants its data aligned to its block 
size (or some efficient multiple). Implementing an API which has 
guaranteed support for that requirement from the outset necessitates 
supporting that requirement. I'm not going to buy the argument that 
having some video device DMA into userspace pages is more important than 
being able to boot at all (and not corrupting your filesystem).

> Couldn't we preserve the ARM-like behavior whenever
> dma_parms->segment_boundary_mask is set to all 1s and
> dma_parms->max_segment_size to UINT_MAX (what currently drivers used
> to set) or 0 (sounds more logical for the meaning of "no maximum
> given")?

Sure, I was always aiming to ultimately improve on the arch/arm 
implementation (i.e. with the single allocation thing), but for a common 
general-purpose implementation that's going to be shared by multiple 
architectures, correctness comes way before optimisation for one 
specific use-case. Thus we start with a baseline version that we know 
correctly implements all the required behaviour specified by the DMA 
API, then start tweaking it for other considerations later.

FWIW, I've already sketched out such a follow-on patch to start 
tightening up map_sg (because exposing any pages to the device more than 
absolutely necessary is not what we want in the long run). The thought 
that it's likely to be jumped on and used as an excuse to justify bad 
code elsewhere does rather sour the idea, though.

>>>>>> Hmm, I thought the DMA API maps a (possibly) non-contiguous set of
>>>>>> memory pages into a contiguous block in device memory address space.
>>>>>> This would allow passing a dma mapped buffer to device dma using just
>>>>>> a device address and length.
>>>>>
>>>>>
>>>>>
>>>>> Not at all. The streaming DMA API (dma_map_* and friends) has two
>>>>> responsibilities: performing any necessary cache maintenance to ensure the
>>>>> device will correctly see data from the CPU, and the CPU will correctly see
>>>>> data from the device; and working out an address for that buffer from the
>>>>> device's point of view to actually hand off to the hardware (which is
>>>>> perfectly well allowed to fail).
>>>
>>>
>>> Agreed. The dma_map_*() API is not guaranteed to return a single
>>> contiguous part of virtual address space for any given SG list.
>>> However it was understood to be able to map buffers contiguously
>>> mappable by the CPU into a single segment and users,
>>> videobuf2-dma-contig in particular, relied on this.
>>
>>
>> I don't follow that - _any_ buffer made of page-sized chunks is going to be
>> mappable contiguously by the CPU;'
>
> Yes it is. Actually the last chunk might not even need to be
> page-sized. However I believe we can have a scatterlist consisting of
> non-page-sized chunks in the middle as well, which is obviously not
> mappable in a contiguous way even for the CPU.
>
>> it's clearly impossible for the streaming
>> DMA API itself to offer such a guarantee, because it's entirely orthogonal
>> to the presence or otherwise of an IOMMU.
>
> But we are talking here about the very IOMMU-specific implementation of DMA API.

Exactly, therein lies the problem! The whole point of an API is that we 
write code against the provided _interface_, not against some particular 
implementation detail. To quote Raymond Chen, "I can't believe I had to 
write that".

I fail to see how anyone would be surprised that code which is reliant 
on specific non-contractual behaviour of a particular API implementation 
is not portable to other implementations of that API.

>> Furthermore, I can't see any existing dma_map_sg implementation (between
>> arm/64 and x86, at least), that _won't_ break that expectation under certain
>> conditions (ranging from "relatively pathological" to "always"), so it still
>> seems questionable to have a dependency on it.
>
> The current implementation for arch/arm doesn't break that
> expectation. As long as we fit inside the maximum segment size (which
> in most, if not all, cases of the hardware that actually requires such
> contiguous mapping to be created, is UINT_MAX).

Well, yes, that just restates my point exactly; outside of certain 
conditions you will still get a non-contiguous mapping. Put that exact 
code on a 64-bit system, throw a scatterlist describing a "relatively 
pathological" 5GB buffer into it, and see what you get out.

> http://lxr.free-electrons.com/source/arch/arm/mm/dma-mapping.c#L1547
>
>>
>>>>> Consider SWIOTLB's implementation - segments which already lie at
>>>>> physical addresses within the device's DMA mask just get passed through,
>>>>> while those that lie outside it get mapped into the bounce buffer, but still
>>>>> as individual allocations (arch code just handles cache maintenance on the
>>>>> resulting physical addresses and can apply any hard-wired DMA offset for the
>>>>> device concerned).
>>>
>>>
>>> And this is fine for vb2-dma-contig, which was made for devices that
>>> require buffers contiguous in its address space. Without IOMMU it will
>>> allow only physically contiguous buffers and fails otherwise, which is
>>> fine, because it's a hardware requirement.
>>
>>
>> If it depends on having contiguous-from-the-device's-view DMA buffers either
>> way, that's a sign it should perhaps be using the coherent DMA API instead,
>> which _does_ give such a guarantee. I'm well aware of the "but the
>> noncacheable mappings make userspace access unacceptably slow!" issue many
>> folks have with that, though, and don't particularly fancy going off on that
>> tangent here.
>
> The keywords here are DMA-BUF and user pointer. Neither of these cases
> can use coherent DMA API, because the buffer is already allocated, so
> it just needs to be mapped into another device's (or its IOMMU's)
> address space. Obviously we can't guarantee mappability of such
> buffers, e.g. in case of importing non-contiguous buffers to a device
> without an IOMMU, However we expect the pipelines to be sane
> (physically contiguous buffers or both devices IOMMU-enabled), so that
> such things won't happen.

The "guarantee to map these scatterlist pages contiguously in IOVA space 
if an IOMMU is present" function is named iommu_map_sg(). There is 
nothing in the DMA API offering that behaviour. How well does 
vb2-dma-contig work with the x86 IOMMUs?

>>>>>> IIUC, the change above breaks this model by inserting gaps in how the
>>>>>> buffer is mapped to device memory, such that the buffer is no longer
>>>>>> contiguous in dma address space.
>>>>>
>>>>>
>>>>>
>>>>> Even the existing arch/arm IOMMU DMA code which I guess this implicitly
>>>>> relies on doesn't guarantee that behaviour - if the mapping happens to reach
>>>>> one of the segment length/boundary limits it won't just leave a gap, it'll
>>>>> start an entirely new IOVA allocation which could well start at a wildly
>>>>> different address[0].
>>>
>>>
>>> Could you explain segment length/boundary limits and when buffers can
>>> reach them? Sorry, i haven't been following all the discussions, but
>>> I'm not aware of any similar requirements of the IOMMU hardware I
>>> worked with.
>>
>>
>> I hope the explanation at the top makes sense - it's purely about the
>> requirements of the DMA master device itself, nothing to do with the IOMMU
>> (or lack of) in the middle. Devices with scatter-gather DMA limitations
>> exist, therefore the API for scatter-gather DMA is designed to represent and
>> respect such limitations.
>
> Yes, it makes sense, thanks for the explanation. However there also
> exist devices with no scatter-gather capability, but behind an IOMMU
> without such fancy mapping limitations. I believe we should also
> respect the limitation of such setups, which is the lack of support
> for multiple IOVA segments.
>
>>>>>> So, is the videobuf2-dma-contig.c based on an incorrect assumption
>>>>>> about how the DMA API is supposed to work?
>>>>>> Is it even possible to map a "contiguous-in-iova-range" mapping for a
>>>>>> buffer given as an sg_table with an arbitrary set of pages?
>>>>>
>>>>>
>>>>>
>>>>>   From the Streaming DMA mappings section of Documentation/DMA-API.txt:
>>>>>
>>>>>     Note also that the above constraints on physical contiguity and
>>>>>     dma_mask may not apply if the platform has an IOMMU (a device which
>>>>>     maps an I/O DMA address to a physical memory address).  However, to
>>>>> be
>>>>>     portable, device driver writers may *not* assume that such an IOMMU
>>>>>     exists.
>>>>>
>>>>> There's not strictly any harm in using the DMA API this way and *hoping*
>>>>> you get what you want, as long as you're happy for it to fail pretty much
>>>>> 100% of the time on some systems, and still in a minority of corner cases on
>>>>> any system.
>>>
>>>
>>> Could you please elaborate? I'd like to see examples, because I can't
>>> really imagine buffers mappable contiguously on CPU, but not on IOMMU.
>>> Also, as I said, the hardware I worked with didn't suffer from
>>> problems like this.
>>
>>
>> "...device driver writers may *not* assume that such an IOMMU exists."
>>
>
> And this is exactly why they _should_ use dma_map_sg(), because it was
> supposed to work correctly for both physically contiguous (i.e. 1
> segment) buffers and non-IOMMU-enabled devices, as well as with
> non-contiguous (i.e. > 1 segment) buffers and IOMMU-enabled devices.

Note that the number of segments has nothing to do with whether they are 
contiguous (in any address space) or not.

In fact, while I've been thinking about this I realise we have another 
misapprehension here: the point of dma_parms is to expose a device's 
scatter-gather capabilities to _restrict_ what an IOMMU-based DMA API 
implementation can do (see 6b7b65105522) - thus setting fake 
"restrictions" for non-scatter-gather hardware in an attempt to force an 
implementation into merging segments is entirely backwards.

>>>>> However, if there's a real dependency on IOMMUs and tight control of
>>>>> IOVA allocation here, then the DMA API isn't really the right tool for the
>>>>> job, and maybe it's time to start looking to how to better fit these
>>>>> multimedia-subsystem-type use cases into the IOMMU API - as far as I
>>>>> understand it there's at least some conceptual overlap with the HSA PASID
>>>>> stuff being prototyped in PCI/x86-land at the moment, so it could be an
>>>>> apposite time to try and bang out some common requirements.
>>>
>>>
>>> The DMA API is actually the only good tool to use here to keep the
>>> videobuf2-dma-contig code away from the knowledge about platform
>>> specific data, e.g. presence of IOMMU. The only thing it knows is that
>>> the target hardware requires a single contiguous buffer and it relies
>>> on the fact that in correct cases the buffer given to it will meet
>>> this requirement (i.e. physically contiguous w/o IOMMU; CPU mappable
>>> with IOMMU).
>>
>>
>> As above; the DMA API guarantees only what the DMA API guarantees. An
>> IOMMU-based implementation of streaming DMA is free to identity-map pages if
>> it only cares about device isolation; a non-IOMMU implementation is free to
>> provide streaming DMA remapping via some elaborate bounce-buffering scheme
>
> I guess this is the area where our understandings of IOMMU-backed DMA
> API differ.

The DMA API provides a hardware-independent abstraction of a set of 
operations for exposing kernel memory to devices. When someone calls a 
DMA API function, they don't get to choose the details of that 
abstraction, and they don't get to choose the semantics of those operations.

Of course they can always go ahead and propose adding something to the 
API, if they really believe there's something else it needs to offer.

>> if it really wants to. GART-type IOMMUs... let's not even go there.
>
> I believe that's how IOMMU-based implementation of DMA API was
> supposed to work when first implemented for ARM...
 >
>> If v4l needs a guarantee of a single contiguous DMA buffer, then it needs to
>> use dma_alloc_coherent() for that, not streaming mappings.
>
> Except that it can't use it, because the buffers are already allocated
> by another entity.

	dma_alloc_coherent(...
	for_each_sg(..
		memcpy(...

Or v4l is rearchitected such that the userspace pages came from 
mmap()ing a guaranteed-contiguous DMA buffer in the first place. Or 
vb2-dma-contig is rearchitected to use the IOMMU API directly where it 
has an IOMMU dependency. Or someone posts a patch to extend the DMA API 
with a dma_try_to_map_sg_as_contiguously_as_you_can_manage() operation 
that doesn't even necessarily have to depend on an IOMMU... Plenty of 
ways to replace incorrect assumptions with reliable ones.

Or to put it another way; Fast, Easy to implement, Correct: pick two.

With the caveat that for upstream, one of the two _must_ be "Correct".

Robin.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v6 1/3] iommu: Implement common IOMMU ops for DMA mapping
  2015-10-30  1:17             ` Daniel Kurtz
@ 2015-11-17 12:02               ` Marek Szyprowski
  -1 siblings, 0 replies; 78+ messages in thread
From: Marek Szyprowski @ 2015-11-17 12:02 UTC (permalink / raw)
  To: Daniel Kurtz, Robin Murphy, Pawel Osciak
  Cc: Yong Wu, Joerg Roedel, Will Deacon, Catalin Marinas,
	open list:IOMMU DRIVERS, linux-arm-kernel, thunder.leizhen,
	Yingjoe Chen, laurent.pinchart+renesas, Thierry Reding,
	Lin PoChun, Bobby Batacharia (via Google Docs),
	linux-media, Kyungmin Park, Tomasz Figa,
	Russell King - ARM Linux, Bartlomiej Zolnierkiewicz

Hello,

I'm really sorry do late joining this discussion, but I was terribly 
busy with other things.

On 2015-10-30 02:17, Daniel Kurtz wrote:
> +linux-media & VIDEOBUF2 FRAMEWORK maintainers since this is about the
> v4l2-contig's usage of the DMA API.
>
> Hi Robin,
>
> On Tue, Oct 27, 2015 at 12:55 AM, Robin Murphy <robin.murphy@arm.com> wrote:
>> On 26/10/15 13:44, Yong Wu wrote:
>>> On Thu, 2015-10-01 at 20:13 +0100, Robin Murphy wrote:
>>> [...]
>>>> +/*
>>>> + * The DMA API client is passing in a scatterlist which could describe
>>>> + * any old buffer layout, but the IOMMU API requires everything to be
>>>> + * aligned to IOMMU pages. Hence the need for this complicated bit of
>>>> + * impedance-matching, to be able to hand off a suitably-aligned list,
>>>> + * but still preserve the original offsets and sizes for the caller.
>>>> + */
>>>> +int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
>>>> +               int nents, int prot)
>>>> +{
>>>> +       struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
>>>> +       struct iova_domain *iovad = domain->iova_cookie;
>>>> +       struct iova *iova;
>>>> +       struct scatterlist *s, *prev = NULL;
>>>> +       dma_addr_t dma_addr;
>>>> +       size_t iova_len = 0;
>>>> +       int i;
>>>> +
>>>> +       /*
>>>> +        * Work out how much IOVA space we need, and align the segments
>>>> to
>>>> +        * IOVA granules for the IOMMU driver to handle. With some clever
>>>> +        * trickery we can modify the list in-place, but reversibly, by
>>>> +        * hiding the original data in the as-yet-unused DMA fields.
>>>> +        */
>>>> +       for_each_sg(sg, s, nents, i) {
>>>> +               size_t s_offset = iova_offset(iovad, s->offset);
>>>> +               size_t s_length = s->length;
>>>> +
>>>> +               sg_dma_address(s) = s->offset;
>>>> +               sg_dma_len(s) = s_length;
>>>> +               s->offset -= s_offset;
>>>> +               s_length = iova_align(iovad, s_length + s_offset);
>>>> +               s->length = s_length;
>>>> +
>>>> +               /*
>>>> +                * The simple way to avoid the rare case of a segment
>>>> +                * crossing the boundary mask is to pad the previous one
>>>> +                * to end at a naturally-aligned IOVA for this one's
>>>> size,
>>>> +                * at the cost of potentially over-allocating a little.
>>>> +                */
>>>> +               if (prev) {
>>>> +                       size_t pad_len = roundup_pow_of_two(s_length);
>>>> +
>>>> +                       pad_len = (pad_len - iova_len) & (pad_len - 1);
>>>> +                       prev->length += pad_len;
>>>
>>> Hi Robin,
>>>         While our v4l2 testing, It seems that we met a problem here.
>>>         Here we update prev->length again, Do we need update
>>> sg_dma_len(prev) again too?
>>>
>>>         Some function like vb2_dc_get_contiguous_size[1] always get
>>> sg_dma_len(s) to compare instead of s->length. so it may break
>>> unexpectedly while sg_dma_len(s) is not same with s->length.
>>
>> This is just tweaking the faked-up length that we hand off to iommu_map_sg()
>> (see also the iova_align() above), to trick it into bumping this segment up
>> to a suitable starting IOVA. The real length at this point is stashed in
>> sg_dma_len(s), and will be copied back into s->length in __finalise_sg(), so
>> both will hold the same true length once we return to the caller.
>>
>> Yes, it does mean that if you have a list where the segment lengths are page
>> aligned but not monotonically decreasing, e.g. {64k, 16k, 64k}, then you'll
>> still end up with a gap between the second and third segments, but that's
>> fine because the DMA API offers no guarantees about what the resulting DMA
>> addresses will be (consider the no-IOMMU case where they would each just be
>> "mapped" to their physical address). If that breaks v4l, then it's probably
>> v4l's DMA API use that needs looking at (again).
> Hmm, I thought the DMA API maps a (possibly) non-contiguous set of
> memory pages into a contiguous block in device memory address space.
> This would allow passing a dma mapped buffer to device dma using just
> a device address and length.
> IIUC, the change above breaks this model by inserting gaps in how the
> buffer is mapped to device memory, such that the buffer is no longer
> contiguous in dma address space.
>
> Here is the code in question from
> drivers/media/v4l2-core/videobuf2-dma-contig.c :
>
> static unsigned long vb2_dc_get_contiguous_size(struct sg_table *sgt)
> {
>          struct scatterlist *s;
>          dma_addr_t expected = sg_dma_address(sgt->sgl);
>          unsigned int i;
>          unsigned long size = 0;
>
>          for_each_sg(sgt->sgl, s, sgt->nents, i) {
>                  if (sg_dma_address(s) != expected)
>                          break;
>                  expected = sg_dma_address(s) + sg_dma_len(s);
>                  size += sg_dma_len(s);
>          }
>          return size;
> }
>
>
> static void *vb2_dc_get_userptr(void *alloc_ctx, unsigned long vaddr,
>          unsigned long size, enum dma_data_direction dma_dir)
> {
>          struct vb2_dc_conf *conf = alloc_ctx;
>          struct vb2_dc_buf *buf;
>          struct frame_vector *vec;
>          unsigned long offset;
>          int n_pages, i;
>          int ret = 0;
>          struct sg_table *sgt;
>          unsigned long contig_size;
>          unsigned long dma_align = dma_get_cache_alignment();
>          DEFINE_DMA_ATTRS(attrs);
>
>          dma_set_attr(DMA_ATTR_SKIP_CPU_SYNC, &attrs);
>
>          buf = kzalloc(sizeof *buf, GFP_KERNEL);
>          buf->dma_dir = dma_dir;
>
>          offset = vaddr & ~PAGE_MASK;
>          vec = vb2_create_framevec(vaddr, size, dma_dir == DMA_FROM_DEVICE);
>          buf->vec = vec;
>          n_pages = frame_vector_count(vec);
>
>          sgt = kzalloc(sizeof(*sgt), GFP_KERNEL);
>
>          ret = sg_alloc_table_from_pages(sgt, frame_vector_pages(vec), n_pages,
>                  offset, size, GFP_KERNEL);
>
>          sgt->nents = dma_map_sg_attrs(buf->dev, sgt->sgl, sgt->orig_nents,
>                                        buf->dma_dir, &attrs);
>
>          contig_size = vb2_dc_get_contiguous_size(sgt);
>          if (contig_size < size) {
>
>      <<<===   if the original buffer had sg entries that were not
> aligned on the "natural" alignment for their size, the new arm64 iommu
> core code inserts  a 'gap' in the iommu mapping, which causes
> vb2_dc_get_contiguous_size() to exit early (and return a smaller size
> than expected).
>
>                  pr_err("contiguous mapping is too small %lu/%lu\n",
>                          contig_size, size);
>                  ret = -EFAULT;
>                  goto fail_map_sg;
>          }
>
>
> So, is the videobuf2-dma-contig.c based on an incorrect assumption
> about how the DMA API is supposed to work?
> Is it even possible to map a "contiguous-in-iova-range" mapping for a
> buffer given as an sg_table with an arbitrary set of pages?
>
> Thanks for helping to move this forward.

As I'm responsible for the code of both dma-mapping IOMMU integration 
code for
ARM and videobuf2-dc subdriver, I would like to share the background behind
them.

This code is a result of our (Samsung R&D Institute Poland) works on 
mainlining
drivers for various multimedia devices found in Exynos SoCs. All those 
devices
can only process buffers, which are contiguous in the DMA address space. 
This
is requirement was one of the fundamental reason for using IOMMU modules.
However it turned out that there was no straightforward way of 
integrating it
for our purposes and some extension to core frameworks were needed.

Our initial proposal integrated IOMMU drivers directly to the V4L2 
helper code,
as a separate memory managing subdriver for videobuf2:
http://www.spinics.net/lists/linux-media/msg31455.html

Then I've been suggested to use dma-mapping API and hide IOMMU behind it.

Allocating a buffer suitable for DMA with IOMMU mapper enabled was easy.
However creating a contiguous mapping in DMA address space for a buffer
scattered in the physical memory was still a bit tricky. The only 
possible way
was to use scatter-list and assume that DMA-mapping will do it right (will
create only one segment for the whole scatter list). I found no other
possibility. This part of the IOMMU and DMA-mapping was especially 
problematic
when buffer sharing (dma-buf) has been introduced and it turned out that
there are different ways of interpreting scatter lists done by different
drivers.

 From the time I see that I focused mainly on re-using existing 
DMA-mapping API,
what shouldn't be the main goal. This resulted in somehow tricky way of 
doing
one of the most common operation for existing multimedia devices. 
Scatter lists
are also a bit over-engineered for doing a such simple operation like 
mapping
scattered memory into single contiguous DMA address space. They waste memory
for storing useless parameters like per-page offset and dma address/len.

Maybe it would be better if something like page vector (or PFN vector to 
solve
the problem of mapping buffers that cannot be described by pages) have been
introduced and operations like dma_map_vector() will make thing much 
more clear.
I can provide a proof-of-concept code for further discussion if needed.

Best regards
-- 
Marek Szyprowski, PhD
Samsung R&D Institute Poland


^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v6 1/3] iommu: Implement common IOMMU ops for DMA mapping
@ 2015-11-17 12:02               ` Marek Szyprowski
  0 siblings, 0 replies; 78+ messages in thread
From: Marek Szyprowski @ 2015-11-17 12:02 UTC (permalink / raw)
  To: linux-arm-kernel

Hello,

I'm really sorry do late joining this discussion, but I was terribly 
busy with other things.

On 2015-10-30 02:17, Daniel Kurtz wrote:
> +linux-media & VIDEOBUF2 FRAMEWORK maintainers since this is about the
> v4l2-contig's usage of the DMA API.
>
> Hi Robin,
>
> On Tue, Oct 27, 2015 at 12:55 AM, Robin Murphy <robin.murphy@arm.com> wrote:
>> On 26/10/15 13:44, Yong Wu wrote:
>>> On Thu, 2015-10-01 at 20:13 +0100, Robin Murphy wrote:
>>> [...]
>>>> +/*
>>>> + * The DMA API client is passing in a scatterlist which could describe
>>>> + * any old buffer layout, but the IOMMU API requires everything to be
>>>> + * aligned to IOMMU pages. Hence the need for this complicated bit of
>>>> + * impedance-matching, to be able to hand off a suitably-aligned list,
>>>> + * but still preserve the original offsets and sizes for the caller.
>>>> + */
>>>> +int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
>>>> +               int nents, int prot)
>>>> +{
>>>> +       struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
>>>> +       struct iova_domain *iovad = domain->iova_cookie;
>>>> +       struct iova *iova;
>>>> +       struct scatterlist *s, *prev = NULL;
>>>> +       dma_addr_t dma_addr;
>>>> +       size_t iova_len = 0;
>>>> +       int i;
>>>> +
>>>> +       /*
>>>> +        * Work out how much IOVA space we need, and align the segments
>>>> to
>>>> +        * IOVA granules for the IOMMU driver to handle. With some clever
>>>> +        * trickery we can modify the list in-place, but reversibly, by
>>>> +        * hiding the original data in the as-yet-unused DMA fields.
>>>> +        */
>>>> +       for_each_sg(sg, s, nents, i) {
>>>> +               size_t s_offset = iova_offset(iovad, s->offset);
>>>> +               size_t s_length = s->length;
>>>> +
>>>> +               sg_dma_address(s) = s->offset;
>>>> +               sg_dma_len(s) = s_length;
>>>> +               s->offset -= s_offset;
>>>> +               s_length = iova_align(iovad, s_length + s_offset);
>>>> +               s->length = s_length;
>>>> +
>>>> +               /*
>>>> +                * The simple way to avoid the rare case of a segment
>>>> +                * crossing the boundary mask is to pad the previous one
>>>> +                * to end at a naturally-aligned IOVA for this one's
>>>> size,
>>>> +                * at the cost of potentially over-allocating a little.
>>>> +                */
>>>> +               if (prev) {
>>>> +                       size_t pad_len = roundup_pow_of_two(s_length);
>>>> +
>>>> +                       pad_len = (pad_len - iova_len) & (pad_len - 1);
>>>> +                       prev->length += pad_len;
>>>
>>> Hi Robin,
>>>         While our v4l2 testing, It seems that we met a problem here.
>>>         Here we update prev->length again, Do we need update
>>> sg_dma_len(prev) again too?
>>>
>>>         Some function like vb2_dc_get_contiguous_size[1] always get
>>> sg_dma_len(s) to compare instead of s->length. so it may break
>>> unexpectedly while sg_dma_len(s) is not same with s->length.
>>
>> This is just tweaking the faked-up length that we hand off to iommu_map_sg()
>> (see also the iova_align() above), to trick it into bumping this segment up
>> to a suitable starting IOVA. The real length at this point is stashed in
>> sg_dma_len(s), and will be copied back into s->length in __finalise_sg(), so
>> both will hold the same true length once we return to the caller.
>>
>> Yes, it does mean that if you have a list where the segment lengths are page
>> aligned but not monotonically decreasing, e.g. {64k, 16k, 64k}, then you'll
>> still end up with a gap between the second and third segments, but that's
>> fine because the DMA API offers no guarantees about what the resulting DMA
>> addresses will be (consider the no-IOMMU case where they would each just be
>> "mapped" to their physical address). If that breaks v4l, then it's probably
>> v4l's DMA API use that needs looking at (again).
> Hmm, I thought the DMA API maps a (possibly) non-contiguous set of
> memory pages into a contiguous block in device memory address space.
> This would allow passing a dma mapped buffer to device dma using just
> a device address and length.
> IIUC, the change above breaks this model by inserting gaps in how the
> buffer is mapped to device memory, such that the buffer is no longer
> contiguous in dma address space.
>
> Here is the code in question from
> drivers/media/v4l2-core/videobuf2-dma-contig.c :
>
> static unsigned long vb2_dc_get_contiguous_size(struct sg_table *sgt)
> {
>          struct scatterlist *s;
>          dma_addr_t expected = sg_dma_address(sgt->sgl);
>          unsigned int i;
>          unsigned long size = 0;
>
>          for_each_sg(sgt->sgl, s, sgt->nents, i) {
>                  if (sg_dma_address(s) != expected)
>                          break;
>                  expected = sg_dma_address(s) + sg_dma_len(s);
>                  size += sg_dma_len(s);
>          }
>          return size;
> }
>
>
> static void *vb2_dc_get_userptr(void *alloc_ctx, unsigned long vaddr,
>          unsigned long size, enum dma_data_direction dma_dir)
> {
>          struct vb2_dc_conf *conf = alloc_ctx;
>          struct vb2_dc_buf *buf;
>          struct frame_vector *vec;
>          unsigned long offset;
>          int n_pages, i;
>          int ret = 0;
>          struct sg_table *sgt;
>          unsigned long contig_size;
>          unsigned long dma_align = dma_get_cache_alignment();
>          DEFINE_DMA_ATTRS(attrs);
>
>          dma_set_attr(DMA_ATTR_SKIP_CPU_SYNC, &attrs);
>
>          buf = kzalloc(sizeof *buf, GFP_KERNEL);
>          buf->dma_dir = dma_dir;
>
>          offset = vaddr & ~PAGE_MASK;
>          vec = vb2_create_framevec(vaddr, size, dma_dir == DMA_FROM_DEVICE);
>          buf->vec = vec;
>          n_pages = frame_vector_count(vec);
>
>          sgt = kzalloc(sizeof(*sgt), GFP_KERNEL);
>
>          ret = sg_alloc_table_from_pages(sgt, frame_vector_pages(vec), n_pages,
>                  offset, size, GFP_KERNEL);
>
>          sgt->nents = dma_map_sg_attrs(buf->dev, sgt->sgl, sgt->orig_nents,
>                                        buf->dma_dir, &attrs);
>
>          contig_size = vb2_dc_get_contiguous_size(sgt);
>          if (contig_size < size) {
>
>      <<<===   if the original buffer had sg entries that were not
> aligned on the "natural" alignment for their size, the new arm64 iommu
> core code inserts  a 'gap' in the iommu mapping, which causes
> vb2_dc_get_contiguous_size() to exit early (and return a smaller size
> than expected).
>
>                  pr_err("contiguous mapping is too small %lu/%lu\n",
>                          contig_size, size);
>                  ret = -EFAULT;
>                  goto fail_map_sg;
>          }
>
>
> So, is the videobuf2-dma-contig.c based on an incorrect assumption
> about how the DMA API is supposed to work?
> Is it even possible to map a "contiguous-in-iova-range" mapping for a
> buffer given as an sg_table with an arbitrary set of pages?
>
> Thanks for helping to move this forward.

As I'm responsible for the code of both dma-mapping IOMMU integration 
code for
ARM and videobuf2-dc subdriver, I would like to share the background behind
them.

This code is a result of our (Samsung R&D Institute Poland) works on 
mainlining
drivers for various multimedia devices found in Exynos SoCs. All those 
devices
can only process buffers, which are contiguous in the DMA address space. 
This
is requirement was one of the fundamental reason for using IOMMU modules.
However it turned out that there was no straightforward way of 
integrating it
for our purposes and some extension to core frameworks were needed.

Our initial proposal integrated IOMMU drivers directly to the V4L2 
helper code,
as a separate memory managing subdriver for videobuf2:
http://www.spinics.net/lists/linux-media/msg31455.html

Then I've been suggested to use dma-mapping API and hide IOMMU behind it.

Allocating a buffer suitable for DMA with IOMMU mapper enabled was easy.
However creating a contiguous mapping in DMA address space for a buffer
scattered in the physical memory was still a bit tricky. The only 
possible way
was to use scatter-list and assume that DMA-mapping will do it right (will
create only one segment for the whole scatter list). I found no other
possibility. This part of the IOMMU and DMA-mapping was especially 
problematic
when buffer sharing (dma-buf) has been introduced and it turned out that
there are different ways of interpreting scatter lists done by different
drivers.

 From the time I see that I focused mainly on re-using existing 
DMA-mapping API,
what shouldn't be the main goal. This resulted in somehow tricky way of 
doing
one of the most common operation for existing multimedia devices. 
Scatter lists
are also a bit over-engineered for doing a such simple operation like 
mapping
scattered memory into single contiguous DMA address space. They waste memory
for storing useless parameters like per-page offset and dma address/len.

Maybe it would be better if something like page vector (or PFN vector to 
solve
the problem of mapping buffers that cannot be described by pages) have been
introduced and operations like dma_map_vector() will make thing much 
more clear.
I can provide a proof-of-concept code for further discussion if needed.

Best regards
-- 
Marek Szyprowski, PhD
Samsung R&D Institute Poland

^ permalink raw reply	[flat|nested] 78+ messages in thread

end of thread, other threads:[~2015-11-17 12:02 UTC | newest]

Thread overview: 78+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-01 19:13 [PATCH v6 0/3] arm64: IOMMU-backed DMA mapping Robin Murphy
2015-10-01 19:13 ` Robin Murphy
     [not found] ` <cover.1443718557.git.robin.murphy-5wv7dgnIgG8@public.gmane.org>
2015-10-01 19:13   ` [PATCH v6 1/3] iommu: Implement common IOMMU ops for " Robin Murphy
2015-10-01 19:13     ` Robin Murphy
     [not found]     ` <ab8e1caa40d6da1afa4a49f30242ef4e6e1f17df.1443718557.git.robin.murphy-5wv7dgnIgG8@public.gmane.org>
2015-10-26 13:44       ` Yong Wu
2015-10-26 13:44         ` Yong Wu
2015-10-26 16:55         ` Robin Murphy
2015-10-26 16:55           ` Robin Murphy
2015-10-30  1:17           ` Daniel Kurtz
2015-10-30  1:17             ` Daniel Kurtz
2015-10-30 14:09             ` Joerg Roedel
2015-10-30 14:09               ` Joerg Roedel
     [not found]               ` <20151030140923.GJ27420-zLv9SwRftAIdnm+yROfE0A@public.gmane.org>
2015-10-30 18:18                 ` Mark Hounschell
2015-10-30 14:27             ` Robin Murphy
2015-10-30 14:27               ` Robin Murphy
2015-11-02 13:11               ` Daniel Kurtz
2015-11-02 13:11                 ` Daniel Kurtz
2015-11-02 13:43                 ` Tomasz Figa
2015-11-02 13:43                   ` Tomasz Figa
2015-11-03 17:41                   ` Robin Murphy
2015-11-03 17:41                     ` Robin Murphy
2015-11-03 18:40                     ` Russell King - ARM Linux
2015-11-03 18:40                       ` Russell King - ARM Linux
2015-11-04  5:15                       ` Tomasz Figa
2015-11-04  5:15                         ` Tomasz Figa
2015-11-04  9:10                         ` Russell King - ARM Linux
2015-11-04  9:10                           ` Russell King - ARM Linux
2015-11-04  9:10                           ` Russell King - ARM Linux
2015-11-04  5:12                     ` Tomasz Figa
2015-11-04  5:12                       ` Tomasz Figa
2015-11-04  9:27                       ` Russell King - ARM Linux
2015-11-04  9:27                         ` Russell King - ARM Linux
2015-11-04  9:27                         ` Russell King - ARM Linux
2015-11-04  9:48                         ` Tomasz Figa
2015-11-04  9:48                           ` Tomasz Figa
2015-11-04 10:50                           ` Russell King - ARM Linux
2015-11-04 10:50                             ` Russell King - ARM Linux
2015-11-04 10:50                             ` Russell King - ARM Linux
2015-11-09 13:11                       ` Robin Murphy
2015-11-09 13:11                         ` Robin Murphy
2015-11-17 12:02             ` Marek Szyprowski
2015-11-17 12:02               ` Marek Szyprowski
2015-10-01 19:13   ` [PATCH v6 2/3] arm64: Add IOMMU dma_ops Robin Murphy
2015-10-01 19:13     ` Robin Murphy
2015-10-07  9:03     ` Anup Patel
2015-10-07  9:03       ` Anup Patel
     [not found]       ` <CAAhSdy2tpAfH+i=1axDkmRqZixsbVhd-_9VGvpyQ=5e06v=Kpg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-10-07 16:36         ` Robin Murphy
2015-10-07 16:36           ` Robin Murphy
2015-10-07 17:40           ` Anup Patel
2015-10-07 17:40             ` Anup Patel
     [not found]     ` <80cb035144a2648a5d94eb1fec3336f17ad249f1.1443718557.git.robin.murphy-5wv7dgnIgG8@public.gmane.org>
2015-10-06 11:00       ` Yong Wu
2015-10-06 11:00         ` Yong Wu
2015-10-07 16:07         ` Robin Murphy
2015-10-07 16:07           ` Robin Murphy
     [not found]           ` <56154349.8040101-5wv7dgnIgG8@public.gmane.org>
2015-10-09  5:44             ` Yong Wu
2015-10-09  5:44               ` Yong Wu
2015-10-14 11:47       ` Joerg Roedel
2015-10-14 11:47         ` Joerg Roedel
2015-10-14 13:35       ` Catalin Marinas
2015-10-14 13:35         ` Catalin Marinas
     [not found]         ` <20151014133538.GG4239-M2fw3Uu6cmfZROr8t4l/smS4ubULX0JqMm0uRHvK7Nw@public.gmane.org>
2015-10-14 16:34           ` Robin Murphy
2015-10-14 16:34             ` Robin Murphy
2015-11-04  8:39       ` Yong Wu
2015-11-04  8:39         ` Yong Wu
2015-11-04 13:11         ` Robin Murphy
2015-11-04 13:11           ` Robin Murphy
     [not found]           ` <563A0419.9070100-5wv7dgnIgG8@public.gmane.org>
2015-11-04 17:35             ` Laura Abbott
2015-11-04 17:35               ` Laura Abbott
2015-10-01 19:14   ` [PATCH v6 3/3] arm64: Hook up " Robin Murphy
2015-10-01 19:14     ` Robin Murphy
2015-10-13 12:12   ` [PATCH v6 0/3] arm64: IOMMU-backed DMA mapping Robin Murphy
2015-10-13 12:12     ` Robin Murphy
     [not found]     ` <561CF53E.7000809-5wv7dgnIgG8@public.gmane.org>
2015-10-14 11:50       ` joro-zLv9SwRftAIdnm+yROfE0A
2015-10-14 11:50         ` joro at 8bytes.org
     [not found]         ` <20151014115013.GM27420-zLv9SwRftAIdnm+yROfE0A@public.gmane.org>
2015-10-14 18:19           ` Robin Murphy
2015-10-14 18:19             ` Robin Murphy
2015-10-15 15:04   ` Joerg Roedel
2015-10-15 15:04     ` Joerg Roedel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.