linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] Add dynamic iommu backed bounce buffers
@ 2021-07-07  7:55 David Stevens
  2021-07-07  7:55 ` [PATCH 1/4] dma-iommu: add kalloc gfp flag to alloc helper David Stevens
                   ` (5 more replies)
  0 siblings, 6 replies; 11+ messages in thread
From: David Stevens @ 2021-07-07  7:55 UTC (permalink / raw)
  To: Joerg Roedel, Will Deacon
  Cc: Christoph Hellwig, Sergey Senozhatsky, iommu, linux-kernel,
	David Stevens

Add support for per-domain dynamic pools of iommu bounce buffers to the 
dma-iommu API. This allows iommu mappings to be reused while still
maintaining strict iommu protection. Allocating buffers dynamically
instead of using swiotlb carveouts makes per-domain pools more amenable
on systems with large numbers of devices or where devices are unknown.

When enabled, all non-direct streaming mappings below a configurable
size will go through bounce buffers. Note that this means drivers which
don't properly use the DMA API (e.g. i915) cannot use an iommu when this
feature is enabled. However, all drivers which work with swiotlb=force
should work.

Bounce buffers serve as an optimization in situations where interactions
with the iommu are very costly. For example, virtio-iommu operations in
a guest on a linux host require a vmexit, involvement the VMM, and a
VFIO syscall. For relatively small DMA operations, memcpy can be
significantly faster.

As a performance comparison, on a device with an i5-10210U, I ran fio
with a VFIO passthrough NVMe drive with '--direct=1 --rw=read
--ioengine=libaio --iodepth=64' and block sizes 4k, 16k, 64k, and
128k. Test throughput increased by 2.8x, 4.7x, 3.6x, and 3.6x. Time
spent in iommu_dma_unmap_(page|sg) per GB processed decreased by 97%,
94%, 90%, and 87%. Time spent in iommu_dma_map_(page|sg) decreased
by >99%, as bounce buffers don't require syncing here in the read case.
Running with multiple jobs doesn't serve as a useful performance
comparison because virtio-iommu and vfio_iommu_type1 both have big
locks that significantly limit mulithreaded DMA performance.

This patch set is based on v5.13-rc7 plus the patches at [1].

David Stevens (4):
  dma-iommu: add kalloc gfp flag to alloc helper
  dma-iommu: replace device arguments
  dma-iommu: expose a few helper functions to module
  dma-iommu: Add iommu bounce buffers to dma-iommu api

 drivers/iommu/Kconfig          |  10 +
 drivers/iommu/Makefile         |   1 +
 drivers/iommu/dma-iommu.c      | 119 ++++--
 drivers/iommu/io-buffer-pool.c | 656 +++++++++++++++++++++++++++++++++
 drivers/iommu/io-buffer-pool.h |  91 +++++
 include/linux/dma-iommu.h      |  12 +
 6 files changed, 861 insertions(+), 28 deletions(-)
 create mode 100644 drivers/iommu/io-buffer-pool.c
 create mode 100644 drivers/iommu/io-buffer-pool.h

-- 
2.32.0.93.g670b81a890-goog


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/4] dma-iommu: add kalloc gfp flag to alloc helper
  2021-07-07  7:55 [PATCH 0/4] Add dynamic iommu backed bounce buffers David Stevens
@ 2021-07-07  7:55 ` David Stevens
  2021-07-08 17:22   ` Robin Murphy
  2021-07-07  7:55 ` [PATCH 2/4] dma-iommu: replace device arguments David Stevens
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 11+ messages in thread
From: David Stevens @ 2021-07-07  7:55 UTC (permalink / raw)
  To: Joerg Roedel, Will Deacon
  Cc: Christoph Hellwig, Sergey Senozhatsky, iommu, linux-kernel,
	David Stevens

From: David Stevens <stevensd@chromium.org>

Add gfp flag for kalloc calls within __iommu_dma_alloc_pages, so the
function can be called from atomic contexts.

Signed-off-by: David Stevens <stevensd@chromium.org>
---
 drivers/iommu/dma-iommu.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 614f0dd86b08..00993b56c977 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -593,7 +593,8 @@ static void __iommu_dma_free_pages(struct page **pages, int count)
 }
 
 static struct page **__iommu_dma_alloc_pages(struct device *dev,
-		unsigned int count, unsigned long order_mask, gfp_t gfp)
+		unsigned int count, unsigned long order_mask,
+		gfp_t page_gfp, gfp_t kalloc_gfp)
 {
 	struct page **pages;
 	unsigned int i = 0, nid = dev_to_node(dev);
@@ -602,15 +603,15 @@ static struct page **__iommu_dma_alloc_pages(struct device *dev,
 	if (!order_mask)
 		return NULL;
 
-	pages = kvzalloc(count * sizeof(*pages), GFP_KERNEL);
+	pages = kvzalloc(count * sizeof(*pages), kalloc_gfp);
 	if (!pages)
 		return NULL;
 
 	/* IOMMU can map any pages, so himem can also be used here */
-	gfp |= __GFP_NOWARN | __GFP_HIGHMEM;
+	page_gfp |= __GFP_NOWARN | __GFP_HIGHMEM;
 
 	/* It makes no sense to muck about with huge pages */
-	gfp &= ~__GFP_COMP;
+	page_gfp &= ~__GFP_COMP;
 
 	while (count) {
 		struct page *page = NULL;
@@ -624,7 +625,7 @@ static struct page **__iommu_dma_alloc_pages(struct device *dev,
 		for (order_mask &= (2U << __fls(count)) - 1;
 		     order_mask; order_mask &= ~order_size) {
 			unsigned int order = __fls(order_mask);
-			gfp_t alloc_flags = gfp;
+			gfp_t alloc_flags = page_gfp;
 
 			order_size = 1U << order;
 			if (order_mask > order_size)
@@ -680,7 +681,7 @@ static struct page **__iommu_dma_alloc_noncontiguous(struct device *dev,
 
 	count = PAGE_ALIGN(size) >> PAGE_SHIFT;
 	pages = __iommu_dma_alloc_pages(dev, count, alloc_sizes >> PAGE_SHIFT,
-					gfp);
+					gfp, GFP_KERNEL);
 	if (!pages)
 		return NULL;
 
-- 
2.32.0.93.g670b81a890-goog


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 2/4] dma-iommu: replace device arguments
  2021-07-07  7:55 [PATCH 0/4] Add dynamic iommu backed bounce buffers David Stevens
  2021-07-07  7:55 ` [PATCH 1/4] dma-iommu: add kalloc gfp flag to alloc helper David Stevens
@ 2021-07-07  7:55 ` David Stevens
  2021-07-07  7:55 ` [PATCH 3/4] dma-iommu: expose a few helper functions to module David Stevens
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 11+ messages in thread
From: David Stevens @ 2021-07-07  7:55 UTC (permalink / raw)
  To: Joerg Roedel, Will Deacon
  Cc: Christoph Hellwig, Sergey Senozhatsky, iommu, linux-kernel,
	David Stevens

From: David Stevens <stevensd@chromium.org>

Replace the struct device argument with the device's nid in
__iommu_dma_alloc_pages, since it doesn't need the whole struct. This
allows it to be called from places which don't have access to the
device.

Signed-off-by: David Stevens <stevensd@chromium.org>
---
 drivers/iommu/dma-iommu.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 00993b56c977..98a5c566a303 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -592,12 +592,12 @@ static void __iommu_dma_free_pages(struct page **pages, int count)
 	kvfree(pages);
 }
 
-static struct page **__iommu_dma_alloc_pages(struct device *dev,
+static struct page **__iommu_dma_alloc_pages(
 		unsigned int count, unsigned long order_mask,
-		gfp_t page_gfp, gfp_t kalloc_gfp)
+		unsigned int nid, gfp_t page_gfp, gfp_t kalloc_gfp)
 {
 	struct page **pages;
-	unsigned int i = 0, nid = dev_to_node(dev);
+	unsigned int i = 0;
 
 	order_mask &= (2U << MAX_ORDER) - 1;
 	if (!order_mask)
@@ -680,8 +680,8 @@ static struct page **__iommu_dma_alloc_noncontiguous(struct device *dev,
 		alloc_sizes = min_size;
 
 	count = PAGE_ALIGN(size) >> PAGE_SHIFT;
-	pages = __iommu_dma_alloc_pages(dev, count, alloc_sizes >> PAGE_SHIFT,
-					gfp, GFP_KERNEL);
+	pages = __iommu_dma_alloc_pages(count, alloc_sizes >> PAGE_SHIFT,
+					dev_to_node(dev), gfp, GFP_KERNEL);
 	if (!pages)
 		return NULL;
 
-- 
2.32.0.93.g670b81a890-goog


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 3/4] dma-iommu: expose a few helper functions to module
  2021-07-07  7:55 [PATCH 0/4] Add dynamic iommu backed bounce buffers David Stevens
  2021-07-07  7:55 ` [PATCH 1/4] dma-iommu: add kalloc gfp flag to alloc helper David Stevens
  2021-07-07  7:55 ` [PATCH 2/4] dma-iommu: replace device arguments David Stevens
@ 2021-07-07  7:55 ` David Stevens
  2021-07-07  7:55 ` [PATCH 4/4] dma-iommu: Add iommu bounce buffers to dma-iommu api David Stevens
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 11+ messages in thread
From: David Stevens @ 2021-07-07  7:55 UTC (permalink / raw)
  To: Joerg Roedel, Will Deacon
  Cc: Christoph Hellwig, Sergey Senozhatsky, iommu, linux-kernel,
	David Stevens

From: David Stevens <stevensd@chromium.org>

Expose a few helper functions from dma-iommu to the rest of the module.

Signed-off-by: David Stevens <stevensd@chromium.org>
---
 drivers/iommu/dma-iommu.c | 27 ++++++++++++++-------------
 include/linux/dma-iommu.h | 12 ++++++++++++
 2 files changed, 26 insertions(+), 13 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 98a5c566a303..48267d9f5152 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -413,7 +413,7 @@ static int dma_info_to_prot(enum dma_data_direction dir, bool coherent,
 	}
 }
 
-static dma_addr_t iommu_dma_alloc_iova(struct iommu_domain *domain,
+dma_addr_t __iommu_dma_alloc_iova(struct iommu_domain *domain,
 		size_t size, u64 dma_limit, struct device *dev)
 {
 	struct iommu_dma_cookie *cookie = domain->iova_cookie;
@@ -453,7 +453,7 @@ static dma_addr_t iommu_dma_alloc_iova(struct iommu_domain *domain,
 	return (dma_addr_t)iova << shift;
 }
 
-static void iommu_dma_free_iova(struct iommu_dma_cookie *cookie,
+void __iommu_dma_free_iova(struct iommu_dma_cookie *cookie,
 		dma_addr_t iova, size_t size, struct page *freelist)
 {
 	struct iova_domain *iovad = &cookie->iovad;
@@ -489,7 +489,7 @@ static void __iommu_dma_unmap(struct device *dev, dma_addr_t dma_addr,
 
 	if (!cookie->fq_domain)
 		iommu_iotlb_sync(domain, &iotlb_gather);
-	iommu_dma_free_iova(cookie, dma_addr, size, iotlb_gather.freelist);
+	__iommu_dma_free_iova(cookie, dma_addr, size, iotlb_gather.freelist);
 }
 
 static void __iommu_dma_unmap_swiotlb(struct device *dev, dma_addr_t dma_addr,
@@ -525,12 +525,12 @@ static dma_addr_t __iommu_dma_map(struct device *dev, phys_addr_t phys,
 
 	size = iova_align(iovad, size + iova_off);
 
-	iova = iommu_dma_alloc_iova(domain, size, dma_mask, dev);
+	iova = __iommu_dma_alloc_iova(domain, size, dma_mask, dev);
 	if (!iova)
 		return DMA_MAPPING_ERROR;
 
 	if (iommu_map_atomic(domain, iova, phys - iova_off, size, prot)) {
-		iommu_dma_free_iova(cookie, iova, size, NULL);
+		__iommu_dma_free_iova(cookie, iova, size, NULL);
 		return DMA_MAPPING_ERROR;
 	}
 	return iova + iova_off;
@@ -585,14 +585,14 @@ static dma_addr_t __iommu_dma_map_swiotlb(struct device *dev, phys_addr_t phys,
 	return iova;
 }
 
-static void __iommu_dma_free_pages(struct page **pages, int count)
+void __iommu_dma_free_pages(struct page **pages, int count)
 {
 	while (count--)
 		__free_page(pages[count]);
 	kvfree(pages);
 }
 
-static struct page **__iommu_dma_alloc_pages(
+struct page **__iommu_dma_alloc_pages(
 		unsigned int count, unsigned long order_mask,
 		unsigned int nid, gfp_t page_gfp, gfp_t kalloc_gfp)
 {
@@ -686,7 +686,8 @@ static struct page **__iommu_dma_alloc_noncontiguous(struct device *dev,
 		return NULL;
 
 	size = iova_align(iovad, size);
-	iova = iommu_dma_alloc_iova(domain, size, dev->coherent_dma_mask, dev);
+	iova = __iommu_dma_alloc_iova(domain, size,
+				      dev->coherent_dma_mask, dev);
 	if (!iova)
 		goto out_free_pages;
 
@@ -712,7 +713,7 @@ static struct page **__iommu_dma_alloc_noncontiguous(struct device *dev,
 out_free_sg:
 	sg_free_table(sgt);
 out_free_iova:
-	iommu_dma_free_iova(cookie, iova, size, NULL);
+	__iommu_dma_free_iova(cookie, iova, size, NULL);
 out_free_pages:
 	__iommu_dma_free_pages(pages, count);
 	return NULL;
@@ -1063,7 +1064,7 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
 		prev = s;
 	}
 
-	iova = iommu_dma_alloc_iova(domain, iova_len, dma_get_mask(dev), dev);
+	iova = __iommu_dma_alloc_iova(domain, iova_len, dma_get_mask(dev), dev);
 	if (!iova)
 		goto out_restore_sg;
 
@@ -1077,7 +1078,7 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
 	return __finalise_sg(dev, sg, nents, iova);
 
 out_free_iova:
-	iommu_dma_free_iova(cookie, iova, iova_len, NULL);
+	__iommu_dma_free_iova(cookie, iova, iova_len, NULL);
 out_restore_sg:
 	__invalidate_sg(sg, nents);
 	return 0;
@@ -1370,7 +1371,7 @@ static struct iommu_dma_msi_page *iommu_dma_get_msi_page(struct device *dev,
 	if (!msi_page)
 		return NULL;
 
-	iova = iommu_dma_alloc_iova(domain, size, dma_get_mask(dev), dev);
+	iova = __iommu_dma_alloc_iova(domain, size, dma_get_mask(dev), dev);
 	if (!iova)
 		goto out_free_page;
 
@@ -1384,7 +1385,7 @@ static struct iommu_dma_msi_page *iommu_dma_get_msi_page(struct device *dev,
 	return msi_page;
 
 out_free_iova:
-	iommu_dma_free_iova(cookie, iova, size, NULL);
+	__iommu_dma_free_iova(cookie, iova, size, NULL);
 out_free_page:
 	kfree(msi_page);
 	return NULL;
diff --git a/include/linux/dma-iommu.h b/include/linux/dma-iommu.h
index 6e75a2d689b4..fc9acc581db0 100644
--- a/include/linux/dma-iommu.h
+++ b/include/linux/dma-iommu.h
@@ -42,6 +42,18 @@ void iommu_dma_free_cpu_cached_iovas(unsigned int cpu,
 
 extern bool iommu_dma_forcedac;
 
+struct iommu_dma_cookie;
+
+struct page **__iommu_dma_alloc_pages(
+		unsigned int count, unsigned long order_mask,
+		unsigned int nid, gfp_t page_gfp, gfp_t kalloc_gfp);
+void __iommu_dma_free_pages(struct page **pages, int count);
+dma_addr_t __iommu_dma_alloc_iova(struct iommu_domain *domain,
+				  size_t size, dma_addr_t dma_limit,
+				  struct device *dev);
+void __iommu_dma_free_iova(struct iommu_dma_cookie *cookie,
+		dma_addr_t iova, size_t size, struct page *freelist);
+
 #else /* CONFIG_IOMMU_DMA */
 
 struct iommu_domain;
-- 
2.32.0.93.g670b81a890-goog


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 4/4] dma-iommu: Add iommu bounce buffers to dma-iommu api
  2021-07-07  7:55 [PATCH 0/4] Add dynamic iommu backed bounce buffers David Stevens
                   ` (2 preceding siblings ...)
  2021-07-07  7:55 ` [PATCH 3/4] dma-iommu: expose a few helper functions to module David Stevens
@ 2021-07-07  7:55 ` David Stevens
  2021-07-08  9:29 ` [PATCH 0/4] Add dynamic iommu backed bounce buffers Joerg Roedel
  2021-07-08 13:38 ` Lu Baolu
  5 siblings, 0 replies; 11+ messages in thread
From: David Stevens @ 2021-07-07  7:55 UTC (permalink / raw)
  To: Joerg Roedel, Will Deacon
  Cc: Christoph Hellwig, Sergey Senozhatsky, iommu, linux-kernel,
	David Stevens

From: David Stevens <stevensd@chromium.org>

Add support for per-domain dynamic pools of iommu bounce buffers to the
dma-iommu api. When enabled, all non-direct streaming mappings below a
configurable size will go through bounce buffers.

Each domain has its own buffer pool. Each buffer pool is split into
multiple power-of-2 size classes. Each class has a number of
preallocated slots that can hold bounce buffers. Bounce buffers are
allocated on demand, and unmapped bounce buffers are stored in a cache
with periodic eviction of unused cache entries. As the buffer pool is an
optimization, any failures simply result in falling back to the normal
dma-iommu handling.

Signed-off-by: David Stevens <stevensd@chromium.org>
---
 drivers/iommu/Kconfig          |  10 +
 drivers/iommu/Makefile         |   1 +
 drivers/iommu/dma-iommu.c      |  75 +++-
 drivers/iommu/io-buffer-pool.c | 656 +++++++++++++++++++++++++++++++++
 drivers/iommu/io-buffer-pool.h |  91 +++++
 5 files changed, 826 insertions(+), 7 deletions(-)
 create mode 100644 drivers/iommu/io-buffer-pool.c
 create mode 100644 drivers/iommu/io-buffer-pool.h

diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index 1f111b399bca..6eee57b03ff9 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -420,4 +420,14 @@ config SPRD_IOMMU
 
 	  Say Y here if you want to use the multimedia devices listed above.
 
+config IOMMU_IO_BUFFER
+	bool "Use IOMMU bounce buffers"
+	depends on IOMMU_DMA
+	help
+	  Use bounce buffers for small, streaming DMA operations. This may
+	  have performance benefits on systems where establishing IOMMU mappings
+	  is particularly expensive, such as when running as a guest.
+
+	  If unsure, say N here.
+
 endif # IOMMU_SUPPORT
diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
index c0fb0ba88143..2287b2e3d92d 100644
--- a/drivers/iommu/Makefile
+++ b/drivers/iommu/Makefile
@@ -5,6 +5,7 @@ obj-$(CONFIG_IOMMU_API) += iommu-traces.o
 obj-$(CONFIG_IOMMU_API) += iommu-sysfs.o
 obj-$(CONFIG_IOMMU_DEBUGFS) += iommu-debugfs.o
 obj-$(CONFIG_IOMMU_DMA) += dma-iommu.o
+obj-$(CONFIG_IOMMU_IO_BUFFER) += io-buffer-pool.o
 obj-$(CONFIG_IOMMU_IO_PGTABLE) += io-pgtable.o
 obj-$(CONFIG_IOMMU_IO_PGTABLE_ARMV7S) += io-pgtable-arm-v7s.o
 obj-$(CONFIG_IOMMU_IO_PGTABLE_LPAE) += io-pgtable-arm.o
diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 48267d9f5152..1d2cfbbe03c1 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -26,6 +26,8 @@
 #include <linux/crash_dump.h>
 #include <linux/dma-direct.h>
 
+#include "io-buffer-pool.h"
+
 struct iommu_dma_msi_page {
 	struct list_head	list;
 	dma_addr_t		iova;
@@ -46,6 +48,7 @@ struct iommu_dma_cookie {
 		dma_addr_t		msi_iova;
 	};
 	struct list_head		msi_page_list;
+	struct io_buffer_pool		*bounce_buffers;
 
 	/* Domain for flush queue callback; NULL if flush queue not in use */
 	struct iommu_domain		*fq_domain;
@@ -83,6 +86,14 @@ static inline size_t cookie_msi_granule(struct iommu_dma_cookie *cookie)
 	return PAGE_SIZE;
 }
 
+static inline struct io_buffer_pool *dev_to_io_buffer_pool(struct device *dev)
+{
+	struct iommu_domain *domain = iommu_get_dma_domain(dev);
+	struct iommu_dma_cookie *cookie = domain->iova_cookie;
+
+	return cookie->bounce_buffers;
+}
+
 static struct iommu_dma_cookie *cookie_alloc(enum iommu_dma_cookie_type type)
 {
 	struct iommu_dma_cookie *cookie;
@@ -162,6 +173,9 @@ void iommu_put_dma_cookie(struct iommu_domain *domain)
 	if (!cookie)
 		return;
 
+	if (IS_ENABLED(CONFIG_IOMMU_IO_BUFFER))
+		io_buffer_pool_destroy(cookie->bounce_buffers);
+
 	if (cookie->type == IOMMU_DMA_IOVA_COOKIE && cookie->iovad.granule)
 		put_iova_domain(&cookie->iovad);
 
@@ -334,6 +348,7 @@ static int iommu_dma_init_domain(struct iommu_domain *domain, dma_addr_t base,
 	struct iommu_dma_cookie *cookie = domain->iova_cookie;
 	unsigned long order, base_pfn;
 	struct iova_domain *iovad;
+	int ret;
 
 	if (!cookie || cookie->type != IOMMU_DMA_IOVA_COOKIE)
 		return -EINVAL;
@@ -381,7 +396,13 @@ static int iommu_dma_init_domain(struct iommu_domain *domain, dma_addr_t base,
 	if (!dev)
 		return 0;
 
-	return iova_reserve_iommu_regions(dev, domain);
+	ret = iova_reserve_iommu_regions(dev, domain);
+
+	if (ret == 0 && IS_ENABLED(CONFIG_IOMMU_IO_BUFFER))
+		ret = io_buffer_pool_init(dev, domain, iovad,
+					  &cookie->bounce_buffers);
+
+	return ret;
 }
 
 /**
@@ -537,11 +558,10 @@ static dma_addr_t __iommu_dma_map(struct device *dev, phys_addr_t phys,
 }
 
 static dma_addr_t __iommu_dma_map_swiotlb(struct device *dev, phys_addr_t phys,
-		size_t org_size, dma_addr_t dma_mask, bool coherent,
+		size_t org_size, dma_addr_t dma_mask, int prot,
 		enum dma_data_direction dir, unsigned long attrs,
 		phys_addr_t *adj_phys)
 {
-	int prot = dma_info_to_prot(dir, coherent, attrs);
 	struct iommu_domain *domain = iommu_get_dma_domain(dev);
 	struct iommu_dma_cookie *cookie = domain->iova_cookie;
 	struct iova_domain *iovad = &cookie->iovad;
@@ -781,6 +801,11 @@ static void iommu_dma_sync_single_for_cpu(struct device *dev,
 {
 	phys_addr_t phys;
 
+	if (IS_ENABLED(CONFIG_IOMMU_IO_BUFFER) &&
+	    io_buffer_pool_sync_single(dev_to_io_buffer_pool(dev), dma_handle,
+				       size, dir, true))
+		return;
+
 	if (dev_is_dma_coherent(dev) && !dev_use_swiotlb(dev))
 		return;
 
@@ -796,6 +821,11 @@ static void __iommu_dma_sync_single_for_device(struct device *dev,
 		dma_addr_t dma_handle, size_t size,
 		enum dma_data_direction dir, phys_addr_t phys)
 {
+	if (IS_ENABLED(CONFIG_IOMMU_IO_BUFFER) &&
+	    io_buffer_pool_sync_single(dev_to_io_buffer_pool(dev), dma_handle,
+				       size, dir, false))
+		return;
+
 	if (dev_is_dma_coherent(dev) && !dev_use_swiotlb(dev))
 		return;
 
@@ -823,6 +853,11 @@ static void iommu_dma_sync_sg_for_cpu(struct device *dev,
 	struct scatterlist *sg;
 	int i;
 
+	if (IS_ENABLED(CONFIG_IOMMU_IO_BUFFER) &&
+	    io_buffer_pool_sync_sg(dev_to_io_buffer_pool(dev),
+				   sgl, nelems, dir, true))
+		return;
+
 	if (dev_is_dma_coherent(dev) && !dev_use_swiotlb(dev))
 		return;
 
@@ -842,6 +877,11 @@ static void iommu_dma_sync_sg_for_device(struct device *dev,
 	struct scatterlist *sg;
 	int i;
 
+	if (IS_ENABLED(CONFIG_IOMMU_IO_BUFFER) &&
+	    io_buffer_pool_sync_sg(dev_to_io_buffer_pool(dev),
+				   sgl, nelems, dir, false))
+		return;
+
 	if (dev_is_dma_coherent(dev) && !dev_use_swiotlb(dev))
 		return;
 
@@ -861,10 +901,17 @@ static dma_addr_t iommu_dma_map_page(struct device *dev, struct page *page,
 {
 	phys_addr_t phys = page_to_phys(page) + offset, adj_phys;
 	bool coherent = dev_is_dma_coherent(dev);
-	dma_addr_t dma_handle;
+	int prot = dma_info_to_prot(dir, coherent, attrs);
+	dma_addr_t dma_handle = DMA_MAPPING_ERROR;
+
+	if (IS_ENABLED(CONFIG_IOMMU_IO_BUFFER))
+		dma_handle = io_buffer_pool_map_page(dev_to_io_buffer_pool(dev),
+				page, offset, size, prot, attrs);
+
+	if (dma_handle == DMA_MAPPING_ERROR)
+		dma_handle = __iommu_dma_map_swiotlb(dev, phys, size,
+				dma_get_mask(dev), prot, dir, attrs, &adj_phys);
 
-	dma_handle = __iommu_dma_map_swiotlb(dev, phys, size,
-			dma_get_mask(dev), coherent, dir, attrs, &adj_phys);
 	if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC) &&
 	    dma_handle != DMA_MAPPING_ERROR)
 		__iommu_dma_sync_single_for_device(dev, dma_handle, size,
@@ -877,6 +924,11 @@ static void iommu_dma_unmap_page(struct device *dev, dma_addr_t dma_handle,
 {
 	if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
 		iommu_dma_sync_single_for_cpu(dev, dma_handle, size, dir);
+
+	if (IS_ENABLED(CONFIG_IOMMU_IO_BUFFER) &&
+	    io_buffer_pool_unmap_buffer(dev_to_io_buffer_pool(dev), dma_handle))
+		return;
+
 	__iommu_dma_unmap_swiotlb(dev, dma_handle, size, dir, attrs);
 }
 
@@ -1012,7 +1064,11 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
 	    iommu_deferred_attach(dev, domain))
 		return 0;
 
-	if (dev_use_swiotlb(dev)) {
+	if (IS_ENABLED(CONFIG_IOMMU_IO_BUFFER))
+		early_mapped = io_buffer_pool_map_sg(cookie->bounce_buffers,
+				dev, sg, nents, prot, attrs);
+
+	if (!early_mapped && dev_use_swiotlb(dev)) {
 		early_mapped = iommu_dma_map_sg_swiotlb(dev, sg, nents,
 							dir, attrs);
 		if (!early_mapped)
@@ -1110,6 +1166,11 @@ static void iommu_dma_unmap_sg(struct device *dev, struct scatterlist *sg,
 		sg = tmp;
 	}
 	end = sg_dma_address(sg) + sg_dma_len(sg);
+
+	if (IS_ENABLED(CONFIG_IOMMU_IO_BUFFER) &&
+	    io_buffer_pool_unmap_buffer(dev_to_io_buffer_pool(dev), start))
+		return;
+
 	__iommu_dma_unmap(dev, start, end - start);
 }
 
diff --git a/drivers/iommu/io-buffer-pool.c b/drivers/iommu/io-buffer-pool.c
new file mode 100644
index 000000000000..3f6f32411889
--- /dev/null
+++ b/drivers/iommu/io-buffer-pool.c
@@ -0,0 +1,656 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Dynamic pools of IOMMU mapped bounce buffers.
+ *
+ * Copyright (C) 2021 Google, Inc.
+ */
+
+#include <linux/dma-iommu.h>
+#include <linux/dma-map-ops.h>
+#include <linux/highmem.h>
+#include <linux/list.h>
+#include <linux/moduleparam.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/swiotlb.h>
+#include <linux/workqueue.h>
+
+#include "io-buffer-pool.h"
+
+struct io_buffer_slot {
+	struct list_head list;
+	void *orig_buffer;
+	struct page **bounce_buffer;
+	int prot;
+	bool old_cache_entry;
+};
+
+enum io_buffer_slot_type {
+	IO_BUFFER_SLOT_TYPE_RO = 0,
+	IO_BUFFER_SLOT_TYPE_WO,
+	IO_BUFFER_SLOT_TYPE_RW,
+	IO_BUFFER_SLOT_TYPE_COUNT,
+};
+
+struct io_buffer_class {
+	struct list_head cached_slots[IO_BUFFER_SLOT_TYPE_COUNT];
+	struct list_head empty_slots;
+	spinlock_t lock;
+	dma_addr_t unaligned_iova;
+	dma_addr_t iova_base;
+	size_t buffer_size;
+	struct io_buffer_slot *slots;
+};
+
+struct io_buffer_pool {
+	struct workqueue_struct *evict_wq;
+	struct delayed_work evict_work;
+	struct iommu_domain *domain;
+	unsigned int nid;
+	struct io_buffer_class classes[];
+};
+
+#define EVICT_PERIOD_MSEC 5000
+
+static unsigned int num_slots = 1024;
+module_param(num_slots, uint, 0);
+
+// A bitmask which selects the buffer classes used by the pool. Bit i enables
+// a class with buffer size (1 << i) * PAGE_SIZE.
+static u8 class_sizes = 32 | 4 | 1;
+module_param(class_sizes, byte, 0);
+static unsigned int num_classes;
+
+static unsigned int init_num_classes(void)
+{
+	unsigned char sizes_mask = class_sizes;
+	unsigned int count = 0;
+
+	while (sizes_mask) {
+		count += (sizes_mask & 1);
+		sizes_mask >>= 1;
+	}
+
+	num_classes = count;
+	return num_classes;
+}
+
+static dma_addr_t io_buffer_slot_to_iova(struct io_buffer_slot *slot,
+					 struct io_buffer_class *class)
+{
+	return class->iova_base + class->buffer_size * (slot - class->slots);
+}
+
+static struct list_head *
+io_buffer_class_get_cache(struct io_buffer_class *class, int prot)
+{
+	prot &= (IOMMU_READ | IOMMU_WRITE);
+	if (prot == IOMMU_READ)
+		return &class->cached_slots[IO_BUFFER_SLOT_TYPE_RO];
+	else if (prot == IOMMU_WRITE)
+		return &class->cached_slots[IO_BUFFER_SLOT_TYPE_WO];
+	BUG_ON(prot == 0);
+	return &class->cached_slots[IO_BUFFER_SLOT_TYPE_RW];
+}
+
+static bool io_buffer_pool_free_slot(struct io_buffer_pool *pool,
+				     struct io_buffer_class *class,
+				     struct io_buffer_slot *slot)
+{
+	dma_addr_t iova = io_buffer_slot_to_iova(slot, class);
+	size_t unmapped;
+
+	unmapped = iommu_unmap(pool->domain, iova, class->buffer_size);
+
+	if (unmapped != class->buffer_size) {
+		pr_err("Failed to unmap bounce buffer, leaking buffer.\n");
+		return false;
+	}
+
+	__iommu_dma_free_pages(slot->bounce_buffer,
+			       class->buffer_size >> PAGE_SHIFT);
+	return true;
+}
+
+static void __io_buffer_pool_evict(struct io_buffer_pool *pool,
+				   bool pool_teardown)
+{
+	struct io_buffer_class *class;
+	struct io_buffer_slot *slot;
+	struct list_head *cache;
+	unsigned long flags;
+	int i, j;
+	bool requeue = false, freed;
+
+	for (i = 0; i < num_classes; i++) {
+		class = &pool->classes[i];
+
+		spin_lock_irqsave(&class->lock, flags);
+		for (j = 0; j < IO_BUFFER_SLOT_TYPE_COUNT; j++) {
+			cache = &class->cached_slots[j];
+
+			while (!list_empty(cache)) {
+				slot = list_last_entry(
+					cache, struct io_buffer_slot, list);
+				if (!pool_teardown && !slot->old_cache_entry)
+					break;
+				list_del(&slot->list);
+				spin_unlock_irqrestore(&class->lock, flags);
+
+				freed = io_buffer_pool_free_slot(pool, class,
+								 slot);
+
+				spin_lock_irqsave(&class->lock, flags);
+
+				// If the iova is in an unknown state, then
+				// give up on the slot altogether.
+				if (freed)
+					list_add(&slot->list,
+						 &class->empty_slots);
+			}
+
+			list_for_each_entry(slot, cache, list) {
+				slot->old_cache_entry = true;
+				requeue = true;
+			}
+		}
+		spin_unlock_irqrestore(&class->lock, flags);
+	}
+
+	if (!pool_teardown && requeue)
+		queue_delayed_work(pool->evict_wq, &pool->evict_work,
+				   msecs_to_jiffies(EVICT_PERIOD_MSEC));
+}
+
+static void io_buffer_pool_evict(struct work_struct *work)
+{
+	struct io_buffer_pool *pool = container_of(
+		to_delayed_work(work), struct io_buffer_pool, evict_work);
+	__io_buffer_pool_evict(pool, false);
+}
+
+int io_buffer_class_init(struct io_buffer_class *class, struct device *dev,
+			 struct iommu_domain *domain, size_t buffer_size)
+{
+	int i;
+
+	class->slots = kcalloc(num_slots, sizeof(*class->slots), GFP_KERNEL);
+	if (!class->slots)
+		return -ENOMEM;
+
+	spin_lock_init(&class->lock);
+	for (i = 0; i < IO_BUFFER_SLOT_TYPE_COUNT; i++)
+		INIT_LIST_HEAD(&class->cached_slots[i]);
+	INIT_LIST_HEAD(&class->empty_slots);
+	for (i = 0; i < num_slots; i++)
+		list_add_tail(&class->slots[i].list, &class->empty_slots);
+
+	class->buffer_size = buffer_size;
+	class->unaligned_iova = __iommu_dma_alloc_iova(
+		domain, buffer_size * (num_slots + 1), dma_get_mask(dev), dev);
+	if (!class->unaligned_iova) {
+		kfree(class->slots);
+		return -ENOSPC;
+	}
+
+	// Since the class's buffer size is a power of 2, aligning the
+	// class's base iova to that power of 2 ensures that no slot has
+	// has a segment which falls across a segment boundary.
+	class->iova_base = ALIGN(class->unaligned_iova, class->buffer_size);
+	return 0;
+}
+
+int io_buffer_pool_init(struct device *dev, struct iommu_domain *domain,
+			struct iova_domain *iovad,
+			struct io_buffer_pool **pool_out)
+{
+	int i, ret;
+	u8 class_size_mask;
+	struct io_buffer_pool *pool = *pool_out;
+
+	if (pool) {
+		for (i = 0; i < num_classes; i++) {
+			struct io_buffer_class *class = pool->classes + i;
+			dma_addr_t iova_end = class->iova_base +
+					      class->buffer_size * num_slots;
+			if (~dma_get_mask(dev) & (iova_end - 1)) {
+				pr_err("io-buffer-pool out of range of %s\n",
+				       dev_name(dev));
+				return -EFAULT;
+			}
+		}
+		if (pool->nid != dev_to_node(dev))
+			pr_warn("node mismatch: pool=%d dev=%d\n", pool->nid,
+				dev_to_node(dev));
+		return 0;
+	}
+
+	*pool_out = NULL;
+
+	if (init_num_classes() == 0 || num_slots == 0)
+		return 0;
+
+	pool = kzalloc(sizeof(*pool) +
+			       sizeof(struct io_buffer_class) * num_classes,
+		       GFP_KERNEL);
+	if (!pool)
+		return -ENOMEM;
+
+	INIT_DELAYED_WORK(&pool->evict_work, io_buffer_pool_evict);
+	pool->evict_wq = create_singlethread_workqueue("io-buffer-pool");
+	if (!pool->evict_wq) {
+		ret = -ENOMEM;
+		goto out_wq_alloc_fail;
+	}
+
+	for (i = 0, class_size_mask = class_sizes; i < num_classes; i++) {
+		int bit_idx;
+		size_t buffer_size;
+
+		bit_idx = ffs(class_size_mask) - 1;
+		BUG_ON(bit_idx < 0);
+		buffer_size = iova_align(iovad, PAGE_SIZE) << bit_idx;
+		class_size_mask &= ~(1 << bit_idx);
+
+		ret = io_buffer_class_init(pool->classes + i, dev, domain,
+					   buffer_size);
+		if (ret)
+			goto out_class_init_fail;
+	}
+
+	pool->domain = domain;
+	pool->nid = dev_to_node(dev);
+
+	*pool_out = pool;
+	return 0;
+
+out_class_init_fail:
+	while (--i >= 0) {
+		struct io_buffer_class *class = pool->classes + i;
+
+		__iommu_dma_free_iova(domain->iova_cookie,
+				      class->unaligned_iova,
+				      class->buffer_size * (num_slots + 1),
+				      NULL);
+		kfree(class->slots);
+	}
+	destroy_workqueue(pool->evict_wq);
+out_wq_alloc_fail:
+	kfree(pool);
+	return ret;
+}
+
+void io_buffer_pool_destroy(struct io_buffer_pool *pool)
+{
+	int i;
+
+	if (!pool)
+		return;
+
+	cancel_delayed_work_sync(&pool->evict_work);
+	destroy_workqueue(pool->evict_wq);
+	__io_buffer_pool_evict(pool, true);
+
+	for (i = 0; i < num_classes; i++) {
+		struct io_buffer_class *class = pool->classes + i;
+
+		__iommu_dma_free_iova(pool->domain->iova_cookie,
+				      class->unaligned_iova,
+				      class->buffer_size * (num_slots + 1),
+				      NULL);
+		kfree(class->slots);
+	}
+	kfree(pool);
+}
+
+static bool io_buffer_pool_find_buffer(struct io_buffer_pool *pool,
+				       dma_addr_t handle,
+				       struct io_buffer_class **class,
+				       struct io_buffer_slot **slot)
+{
+	int i;
+	struct io_buffer_class *cur;
+
+	for (i = 0; i < num_classes; i++) {
+		cur = pool->classes + i;
+		if (handle >= cur->iova_base &&
+		    handle < cur->iova_base + cur->buffer_size * num_slots) {
+			*class = cur;
+			*slot = cur->slots +
+				(handle - cur->iova_base) / cur->buffer_size;
+			return true;
+		}
+	}
+	return false;
+}
+
+bool io_buffer_pool_unmap_buffer(struct io_buffer_pool *pool, dma_addr_t handle)
+{
+	struct io_buffer_slot *slot;
+	struct io_buffer_class *class;
+	struct list_head *cache;
+	unsigned long flags;
+
+	if (!pool || !io_buffer_pool_find_buffer(pool, handle, &class, &slot))
+		return false;
+
+	spin_lock_irqsave(&class->lock, flags);
+
+	cache = io_buffer_class_get_cache(class, slot->prot);
+	if (list_empty(cache))
+		queue_delayed_work(pool->evict_wq, &pool->evict_work,
+				   msecs_to_jiffies(EVICT_PERIOD_MSEC));
+
+	slot->orig_buffer = NULL;
+	list_add(&slot->list, cache);
+	slot->old_cache_entry = false;
+
+	spin_unlock_irqrestore(&class->lock, flags);
+	return true;
+}
+
+static bool io_buffer_pool_fill_slot(struct io_buffer_pool *pool,
+				     struct io_buffer_class *class,
+				     struct io_buffer_slot *slot, int prot)
+{
+	dma_addr_t iova = io_buffer_slot_to_iova(slot, class);
+	struct sg_table sgt;
+	unsigned int count;
+	size_t mapped = 0;
+
+	count = class->buffer_size >> PAGE_SHIFT;
+	slot->bounce_buffer = __iommu_dma_alloc_pages(
+		count, (pool->domain->pgsize_bitmap | PAGE_SIZE) >> PAGE_SHIFT,
+		pool->nid, GFP_ATOMIC, GFP_ATOMIC);
+	if (!slot->bounce_buffer)
+		return false;
+
+	if (sg_alloc_table_from_pages(&sgt, slot->bounce_buffer, count, 0,
+				      class->buffer_size, GFP_ATOMIC))
+		goto out_free_pages;
+
+	mapped = iommu_map_sg_atomic(pool->domain, iova, sgt.sgl,
+				     sgt.orig_nents, prot);
+
+	sg_free_table(&sgt);
+
+out_free_pages:
+	if (mapped < class->buffer_size) {
+		__iommu_dma_free_pages(slot->bounce_buffer, count);
+		return false;
+	}
+
+	slot->prot = prot;
+	return true;
+}
+
+static dma_addr_t io_buffer_pool_map_buffer(struct io_buffer_pool *pool,
+					    void *orig_buffer, size_t size,
+					    int prot, unsigned long attrs)
+{
+	struct io_buffer_slot *slot = NULL, *tmp;
+	struct io_buffer_class *class = NULL;
+	struct list_head *cache;
+	unsigned long flags;
+	int i;
+
+	if (!pool)
+		return DMA_MAPPING_ERROR;
+
+	for (i = 0; i < num_classes; i++) {
+		class = pool->classes + i;
+		if (size <= class->buffer_size)
+			break;
+	}
+
+	if (i == num_classes)
+		return DMA_MAPPING_ERROR;
+
+	spin_lock_irqsave(&class->lock, flags);
+
+	cache = io_buffer_class_get_cache(class, prot);
+	if (!list_empty(cache)) {
+		list_for_each_entry(tmp, cache, list) {
+			if (tmp->prot == prot) {
+				slot = tmp;
+				list_del(&slot->list);
+				break;
+			}
+		}
+	}
+
+	if (slot == NULL) {
+		if (list_empty(&class->empty_slots)) {
+			spin_unlock_irqrestore(&class->lock, flags);
+			return DMA_MAPPING_ERROR;
+		}
+
+		slot = list_first_entry(&class->empty_slots,
+					struct io_buffer_slot, list);
+		list_del(&slot->list);
+		spin_unlock_irqrestore(&class->lock, flags);
+
+		if (!io_buffer_pool_fill_slot(pool, class, slot, prot)) {
+			spin_lock_irqsave(&class->lock, flags);
+			list_add(&slot->list, &class->empty_slots);
+			spin_unlock_irqrestore(&class->lock, flags);
+			return DMA_MAPPING_ERROR;
+		}
+	} else {
+		spin_unlock_irqrestore(&class->lock, flags);
+	}
+
+	slot->orig_buffer = orig_buffer;
+	return io_buffer_slot_to_iova(slot, class);
+}
+
+dma_addr_t io_buffer_pool_map_page(struct io_buffer_pool *pool,
+				   struct page *page, unsigned long offset,
+				   size_t size, int prot, unsigned long attrs)
+{
+	dma_addr_t iova = io_buffer_pool_map_buffer(pool, page, offset + size,
+						    prot, attrs);
+	return iova != DMA_MAPPING_ERROR ? iova + offset : iova;
+}
+
+int io_buffer_pool_map_sg(struct io_buffer_pool *pool, struct device *dev,
+			  struct scatterlist *sg, int nents, int prot,
+			  unsigned long attrs)
+{
+	struct scatterlist *s;
+	unsigned int size = 0;
+	dma_addr_t iova;
+	int i;
+
+	for_each_sg(sg, s, nents, i)
+		size += s->length;
+
+	iova = io_buffer_pool_map_buffer(pool, sg, size, prot, attrs);
+	if (iova == DMA_MAPPING_ERROR)
+		return 0;
+
+	i = 0;
+	while (size > 0) {
+		unsigned int seg_size = min(size, dma_get_max_seg_size(dev));
+
+		sg_dma_len(sg) = seg_size;
+		sg_dma_address(sg) = iova;
+
+		sg = sg_next(sg);
+		size -= seg_size;
+		iova += seg_size;
+		i++;
+	}
+
+	if (sg) {
+		sg_dma_address(sg) = DMA_MAPPING_ERROR;
+		sg_dma_len(sg) = 0;
+	}
+
+	return i;
+}
+
+static bool io_buffer_needs_sync(enum dma_data_direction dir,
+				 bool sync_for_cpu)
+{
+	return dir == DMA_BIDIRECTIONAL ||
+	       (dir == DMA_FROM_DEVICE && sync_for_cpu) ||
+	       (dir == DMA_TO_DEVICE && !sync_for_cpu);
+}
+
+static void io_buffer_pool_do_sync(struct io_buffer_pool *pool,
+				   struct io_buffer_class *class,
+				   struct io_buffer_slot *slot,
+				   size_t bounce_offset, struct page *orig,
+				   size_t orig_offset, size_t size,
+				   enum dma_data_direction dir,
+				   bool sync_for_cpu)
+{
+	bool needs_buffer_sync = io_buffer_needs_sync(dir, sync_for_cpu);
+	char *orig_lowmem_ptr;
+	bool dma_is_coherent = slot->prot & IOMMU_CACHE;
+
+	if (dma_is_coherent && !needs_buffer_sync)
+		return;
+
+	orig_lowmem_ptr = PageHighMem(orig) ? NULL : page_to_virt(orig);
+
+	while (size) {
+		size_t copy_len, bounce_page_offset;
+		struct page *bounce_page;
+
+		bounce_page = slot->bounce_buffer[bounce_offset / PAGE_SIZE];
+		bounce_page_offset = bounce_offset % PAGE_SIZE;
+
+		copy_len = size;
+		if (copy_len + bounce_page_offset > PAGE_SIZE) {
+			size_t new_copy_len = PAGE_SIZE - bounce_page_offset;
+			size_t page_idx = bounce_offset / PAGE_SIZE;
+			int consecutive_pages = 1;
+
+			while (++page_idx < class->buffer_size / PAGE_SIZE &&
+			       new_copy_len < copy_len) {
+				if (nth_page(bounce_page, consecutive_pages) !=
+				    slot->bounce_buffer[page_idx])
+					break;
+				consecutive_pages++;
+				new_copy_len =
+					min(copy_len, new_copy_len + PAGE_SIZE);
+			}
+			copy_len = new_copy_len;
+		}
+
+		if (!dma_is_coherent && sync_for_cpu) {
+			phys_addr_t paddr = page_to_phys(bounce_page);
+
+			arch_sync_dma_for_cpu(paddr + bounce_page_offset,
+					      copy_len, dir);
+		}
+
+		if (needs_buffer_sync) {
+			char *bounce_ptr =
+				page_to_virt(bounce_page) + bounce_page_offset;
+
+			if (!orig_lowmem_ptr) {
+				size_t remaining = copy_len;
+				size_t offset = orig_offset % PAGE_SIZE;
+				unsigned long flags;
+				size_t orig_page_idx = orig_offset / PAGE_SIZE;
+
+				while (remaining) {
+					char *orig_ptr;
+					size_t sz = min(remaining,
+							PAGE_SIZE - offset);
+
+					local_irq_save(flags);
+					orig_ptr = kmap_atomic(
+						nth_page(orig, orig_page_idx));
+					if (sync_for_cpu) {
+						memcpy(orig_ptr + offset,
+						       bounce_ptr, sz);
+					} else {
+						memcpy(bounce_ptr,
+						       orig_ptr + offset, sz);
+					}
+					kunmap_atomic(orig_ptr);
+					local_irq_restore(flags);
+
+					remaining -= sz;
+					orig_page_idx += 1;
+					bounce_ptr += sz;
+					offset = 0;
+				}
+			} else if (sync_for_cpu) {
+				memcpy(orig_lowmem_ptr + orig_offset,
+				       bounce_ptr, copy_len);
+			} else {
+				memcpy(bounce_ptr,
+				       orig_lowmem_ptr + orig_offset, copy_len);
+			}
+		}
+
+		if (!dma_is_coherent && !sync_for_cpu) {
+			phys_addr_t paddr = page_to_phys(bounce_page);
+
+			arch_sync_dma_for_device(paddr + bounce_page_offset,
+						 copy_len, dir);
+		}
+
+		bounce_offset += copy_len;
+		orig_offset += copy_len;
+		size -= copy_len;
+	}
+}
+
+bool io_buffer_pool_sync_single(struct io_buffer_pool *pool,
+				dma_addr_t dma_handle, size_t size,
+				enum dma_data_direction dir,
+				bool sync_for_cpu)
+{
+	struct io_buffer_slot *slot;
+	struct io_buffer_class *class;
+	size_t offset;
+
+	if (!pool ||
+	    !io_buffer_pool_find_buffer(pool, dma_handle, &class, &slot))
+		return false;
+
+	offset = dma_handle - io_buffer_slot_to_iova(slot, class);
+	io_buffer_pool_do_sync(pool, class, slot, offset, slot->orig_buffer,
+			       offset, size, dir, sync_for_cpu);
+
+	return true;
+}
+
+bool io_buffer_pool_sync_sg(struct io_buffer_pool *pool,
+			    struct scatterlist *sgl, int nelems,
+			    enum dma_data_direction dir,
+			    bool sync_for_cpu)
+{
+	struct io_buffer_slot *slot;
+	struct io_buffer_class *class;
+	struct scatterlist *iter;
+	size_t bounce_offset;
+	int i;
+
+	if (!pool || !io_buffer_pool_find_buffer(pool, sg_dma_address(sgl),
+						 &class, &slot))
+		return false;
+
+	// In the non io-buffer-pool case, iommu_dma_map_sg syncs before setting
+	// up the new mapping's dma address. This check handles false positives
+	// in find_buffer caused by sgl being reused for a non io-buffer-pool
+	// case after being used with the io-buffer-pool.
+	if (slot->orig_buffer != sgl)
+		return false;
+
+	bounce_offset = 0;
+	for_each_sg(sgl, iter, nelems, i) {
+		io_buffer_pool_do_sync(pool, class, slot, bounce_offset,
+				       sg_page(iter), iter->offset,
+				       iter->length, dir, sync_for_cpu);
+		bounce_offset += iter->length;
+	}
+
+	return true;
+}
diff --git a/drivers/iommu/io-buffer-pool.h b/drivers/iommu/io-buffer-pool.h
new file mode 100644
index 000000000000..27820436d1d7
--- /dev/null
+++ b/drivers/iommu/io-buffer-pool.h
@@ -0,0 +1,91 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Dynamic pools of IOMMU mapped bounce buffers.
+ *
+ * Copyright (C) 2021 Google, Inc.
+ */
+
+#ifndef _LINUX_IO_BUFFER_POOL_H
+#define _LINUX_IO_BUFFER_POOL_H
+
+struct io_buffer_pool;
+
+#ifdef CONFIG_IOMMU_IO_BUFFER
+
+#include <linux/dma-iommu.h>
+#include <linux/iova.h>
+#include <linux/swiotlb.h>
+
+int io_buffer_pool_init(struct device *dev, struct iommu_domain *domain,
+			struct iova_domain *iovad,
+			struct io_buffer_pool **pool);
+void io_buffer_pool_destroy(struct io_buffer_pool *pool);
+bool io_buffer_pool_sync_single(struct io_buffer_pool *pool,
+				dma_addr_t dma_handle, size_t size,
+				enum dma_data_direction dir,
+				bool sync_for_cpu);
+bool io_buffer_pool_sync_sg(struct io_buffer_pool *pool,
+			    struct scatterlist *sgl, int nelems,
+			    enum dma_data_direction dir,
+			    bool sync_for_cpu);
+dma_addr_t io_buffer_pool_map_page(struct io_buffer_pool *pool,
+				   struct page *page, unsigned long offset,
+				   size_t size, int prot, unsigned long attrs);
+int io_buffer_pool_map_sg(struct io_buffer_pool *pool, struct device *dev,
+			  struct scatterlist *sg, int nents, int prot,
+			  unsigned long attrs);
+bool io_buffer_pool_unmap_buffer(struct io_buffer_pool *pool,
+				 dma_addr_t handle);
+
+#else
+
+int io_buffer_pool_init(struct device *dev, struct iommu_domain *domain,
+			struct iova_domain *iovad,
+			struct io_buffer_pool **pool)
+{
+	return 0;
+}
+
+void io_buffer_pool_destroy(struct io_buffer_pool *pool)
+{
+}
+
+bool io_buffer_pool_sync_single(struct io_buffer_pool *pool,
+				dma_addr_t dma_handle, size_t size,
+				enum dma_data_direction dir,
+				bool sync_for_cpu)
+{
+	return false;
+}
+
+bool io_buffer_pool_sync_sg(struct io_buffer_pool *pool,
+			    struct scatterlist *sgl, int nelems,
+			    enum dma_data_direction dir,
+			    bool sync_for_cpu)
+{
+	return false;
+}
+
+dma_addr_t io_buffer_pool_map_page(struct io_buffer_pool *pool,
+				   struct page *page, unsigned long offset,
+				   size_t size, int prot, unsigned long attrs)
+{
+	return DMA_MAPPING_ERROR;
+}
+
+int io_buffer_pool_map_sg(struct io_buffer_pool *pool, struct device *dev,
+			  struct scatterlist *sg, int nents, int prot,
+			  unsigned long attrs)
+{
+	return 0;
+}
+
+bool io_buffer_pool_unmap_buffer(struct io_buffer_pool *pool,
+				 dma_addr_t handle)
+{
+	return false;
+}
+
+#endif /* CONFIG_IOMMU_IO_BUFFER */
+
+#endif /* _LINUX_IO_BUFFER_POOL_H */
-- 
2.32.0.93.g670b81a890-goog


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 0/4] Add dynamic iommu backed bounce buffers
  2021-07-07  7:55 [PATCH 0/4] Add dynamic iommu backed bounce buffers David Stevens
                   ` (3 preceding siblings ...)
  2021-07-07  7:55 ` [PATCH 4/4] dma-iommu: Add iommu bounce buffers to dma-iommu api David Stevens
@ 2021-07-08  9:29 ` Joerg Roedel
  2021-07-08 17:14   ` Robin Murphy
  2021-07-08 13:38 ` Lu Baolu
  5 siblings, 1 reply; 11+ messages in thread
From: Joerg Roedel @ 2021-07-08  9:29 UTC (permalink / raw)
  To: David Stevens, Robin Murphy
  Cc: Will Deacon, Christoph Hellwig, Sergey Senozhatsky, iommu,
	linux-kernel, David Stevens

Adding Robin too.

On Wed, Jul 07, 2021 at 04:55:01PM +0900, David Stevens wrote:
> Add support for per-domain dynamic pools of iommu bounce buffers to the 
> dma-iommu API. This allows iommu mappings to be reused while still
> maintaining strict iommu protection. Allocating buffers dynamically
> instead of using swiotlb carveouts makes per-domain pools more amenable
> on systems with large numbers of devices or where devices are unknown.
> 
> When enabled, all non-direct streaming mappings below a configurable
> size will go through bounce buffers. Note that this means drivers which
> don't properly use the DMA API (e.g. i915) cannot use an iommu when this
> feature is enabled. However, all drivers which work with swiotlb=force
> should work.
> 
> Bounce buffers serve as an optimization in situations where interactions
> with the iommu are very costly. For example, virtio-iommu operations in
> a guest on a linux host require a vmexit, involvement the VMM, and a
> VFIO syscall. For relatively small DMA operations, memcpy can be
> significantly faster.
> 
> As a performance comparison, on a device with an i5-10210U, I ran fio
> with a VFIO passthrough NVMe drive with '--direct=1 --rw=read
> --ioengine=libaio --iodepth=64' and block sizes 4k, 16k, 64k, and
> 128k. Test throughput increased by 2.8x, 4.7x, 3.6x, and 3.6x. Time
> spent in iommu_dma_unmap_(page|sg) per GB processed decreased by 97%,
> 94%, 90%, and 87%. Time spent in iommu_dma_map_(page|sg) decreased
> by >99%, as bounce buffers don't require syncing here in the read case.
> Running with multiple jobs doesn't serve as a useful performance
> comparison because virtio-iommu and vfio_iommu_type1 both have big
> locks that significantly limit mulithreaded DMA performance.
> 
> This patch set is based on v5.13-rc7 plus the patches at [1].
> 
> David Stevens (4):
>   dma-iommu: add kalloc gfp flag to alloc helper
>   dma-iommu: replace device arguments
>   dma-iommu: expose a few helper functions to module
>   dma-iommu: Add iommu bounce buffers to dma-iommu api
> 
>  drivers/iommu/Kconfig          |  10 +
>  drivers/iommu/Makefile         |   1 +
>  drivers/iommu/dma-iommu.c      | 119 ++++--
>  drivers/iommu/io-buffer-pool.c | 656 +++++++++++++++++++++++++++++++++
>  drivers/iommu/io-buffer-pool.h |  91 +++++
>  include/linux/dma-iommu.h      |  12 +
>  6 files changed, 861 insertions(+), 28 deletions(-)
>  create mode 100644 drivers/iommu/io-buffer-pool.c
>  create mode 100644 drivers/iommu/io-buffer-pool.h
> 
> -- 
> 2.32.0.93.g670b81a890-goog

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 0/4] Add dynamic iommu backed bounce buffers
  2021-07-07  7:55 [PATCH 0/4] Add dynamic iommu backed bounce buffers David Stevens
                   ` (4 preceding siblings ...)
  2021-07-08  9:29 ` [PATCH 0/4] Add dynamic iommu backed bounce buffers Joerg Roedel
@ 2021-07-08 13:38 ` Lu Baolu
  2021-07-09  6:04   ` David Stevens
  5 siblings, 1 reply; 11+ messages in thread
From: Lu Baolu @ 2021-07-08 13:38 UTC (permalink / raw)
  To: David Stevens, Joerg Roedel, Will Deacon
  Cc: baolu.lu, Sergey Senozhatsky, iommu, David Stevens,
	Christoph Hellwig, linux-kernel

Hi David,

I like this idea. Thanks for proposing this.

On 2021/7/7 15:55, David Stevens wrote:
> Add support for per-domain dynamic pools of iommu bounce buffers to the
> dma-iommu API. This allows iommu mappings to be reused while still
> maintaining strict iommu protection. Allocating buffers dynamically
> instead of using swiotlb carveouts makes per-domain pools more amenable
> on systems with large numbers of devices or where devices are unknown.

Have you ever considered leveraging the per-device swiotlb memory pool
added by below series?

https://lore.kernel.org/linux-iommu/20210625123004.GA3170@willie-the-truck/

> 
> When enabled, all non-direct streaming mappings below a configurable
> size will go through bounce buffers. Note that this means drivers which
> don't properly use the DMA API (e.g. i915) cannot use an iommu when this
> feature is enabled. However, all drivers which work with swiotlb=force
> should work.

If so, why not making it more scalable by adding a callback into vendor
iommu drivers? The vendor iommu drivers have enough information to tell
whether the bounce buffer is feasible for a specific domain.

> 
> Bounce buffers serve as an optimization in situations where interactions
> with the iommu are very costly. For example, virtio-iommu operations in

The simulated IOMMU does the same thing.

It's also an optimization for bare metal in cases where the strict mode
of cache invalidation is used. CPU moving data is faster than IOMMU
cache invalidation if the buffer is small.

Best regards,
baolu

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 0/4] Add dynamic iommu backed bounce buffers
  2021-07-08  9:29 ` [PATCH 0/4] Add dynamic iommu backed bounce buffers Joerg Roedel
@ 2021-07-08 17:14   ` Robin Murphy
  2021-07-09  7:25     ` David Stevens
  0 siblings, 1 reply; 11+ messages in thread
From: Robin Murphy @ 2021-07-08 17:14 UTC (permalink / raw)
  To: Joerg Roedel, David Stevens
  Cc: Will Deacon, Christoph Hellwig, Sergey Senozhatsky, iommu,
	linux-kernel, David Stevens

On 2021-07-08 10:29, Joerg Roedel wrote:
> Adding Robin too.
> 
> On Wed, Jul 07, 2021 at 04:55:01PM +0900, David Stevens wrote:
>> Add support for per-domain dynamic pools of iommu bounce buffers to the
>> dma-iommu API. This allows iommu mappings to be reused while still
>> maintaining strict iommu protection. Allocating buffers dynamically
>> instead of using swiotlb carveouts makes per-domain pools more amenable
>> on systems with large numbers of devices or where devices are unknown.

But isn't that just as true for the currently-supported case? All you 
need is a large enough Thunderbolt enclosure and you could suddenly plug 
in a dozen untrusted GPUs all wanting to map hundreds of megabytes of 
memory. If there's a real concern worth addressing, surely it's worth 
addressing properly for everyone.

>> When enabled, all non-direct streaming mappings below a configurable
>> size will go through bounce buffers. Note that this means drivers which
>> don't properly use the DMA API (e.g. i915) cannot use an iommu when this
>> feature is enabled. However, all drivers which work with swiotlb=force
>> should work.
>>
>> Bounce buffers serve as an optimization in situations where interactions
>> with the iommu are very costly. For example, virtio-iommu operations in
>> a guest on a linux host require a vmexit, involvement the VMM, and a
>> VFIO syscall. For relatively small DMA operations, memcpy can be
>> significantly faster.

Yup, back when the bounce-buffering stuff first came up I know 
networking folks were interested in terms of latency for small packets - 
virtualised IOMMUs are indeed another interesting case I hadn't thought 
of. It's definitely been on the radar as another use-case we'd like to 
accommodate with the bounce-buffering scheme. However, that's the thing: 
bouncing is bouncing and however you look at it it still overlaps so 
much with the untrusted case - there's no reason that couldn't use 
pre-mapped bounce buffers too, for instance - that the only necessary 
difference is really the policy decision of when to bounce. iommu-dma 
has already grown complicated enough, and having *three* different ways 
of doing things internally just seems bonkers and untenable. Pre-map the 
bounce buffers? Absolutely. Dynamically grow them on demand? Yes please! 
Do it all as a special thing in its own NIH module and leave the 
existing mess to rot? Sorry, but no.

Thanks,
Robin.

>> As a performance comparison, on a device with an i5-10210U, I ran fio
>> with a VFIO passthrough NVMe drive with '--direct=1 --rw=read
>> --ioengine=libaio --iodepth=64' and block sizes 4k, 16k, 64k, and
>> 128k. Test throughput increased by 2.8x, 4.7x, 3.6x, and 3.6x. Time
>> spent in iommu_dma_unmap_(page|sg) per GB processed decreased by 97%,
>> 94%, 90%, and 87%. Time spent in iommu_dma_map_(page|sg) decreased
>> by >99%, as bounce buffers don't require syncing here in the read case.
>> Running with multiple jobs doesn't serve as a useful performance
>> comparison because virtio-iommu and vfio_iommu_type1 both have big
>> locks that significantly limit mulithreaded DMA performance.
>>
>> This patch set is based on v5.13-rc7 plus the patches at [1].
>>
>> David Stevens (4):
>>    dma-iommu: add kalloc gfp flag to alloc helper
>>    dma-iommu: replace device arguments
>>    dma-iommu: expose a few helper functions to module
>>    dma-iommu: Add iommu bounce buffers to dma-iommu api
>>
>>   drivers/iommu/Kconfig          |  10 +
>>   drivers/iommu/Makefile         |   1 +
>>   drivers/iommu/dma-iommu.c      | 119 ++++--
>>   drivers/iommu/io-buffer-pool.c | 656 +++++++++++++++++++++++++++++++++
>>   drivers/iommu/io-buffer-pool.h |  91 +++++
>>   include/linux/dma-iommu.h      |  12 +
>>   6 files changed, 861 insertions(+), 28 deletions(-)
>>   create mode 100644 drivers/iommu/io-buffer-pool.c
>>   create mode 100644 drivers/iommu/io-buffer-pool.h
>>
>> -- 
>> 2.32.0.93.g670b81a890-goog

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/4] dma-iommu: add kalloc gfp flag to alloc helper
  2021-07-07  7:55 ` [PATCH 1/4] dma-iommu: add kalloc gfp flag to alloc helper David Stevens
@ 2021-07-08 17:22   ` Robin Murphy
  0 siblings, 0 replies; 11+ messages in thread
From: Robin Murphy @ 2021-07-08 17:22 UTC (permalink / raw)
  To: David Stevens, Joerg Roedel, Will Deacon
  Cc: Sergey Senozhatsky, iommu, Christoph Hellwig, linux-kernel

On 2021-07-07 08:55, David Stevens wrote:
> From: David Stevens <stevensd@chromium.org>
> 
> Add gfp flag for kalloc calls within __iommu_dma_alloc_pages, so the
> function can be called from atomic contexts.

Why bother? If you need GFP_ATOMIC for allocating the pages array, then 
you don't not need it for allocating the pages themselves. It's hardly 
rocket science to infer one from the other.

Robin.

> Signed-off-by: David Stevens <stevensd@chromium.org>
> ---
>   drivers/iommu/dma-iommu.c | 13 +++++++------
>   1 file changed, 7 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
> index 614f0dd86b08..00993b56c977 100644
> --- a/drivers/iommu/dma-iommu.c
> +++ b/drivers/iommu/dma-iommu.c
> @@ -593,7 +593,8 @@ static void __iommu_dma_free_pages(struct page **pages, int count)
>   }
>   
>   static struct page **__iommu_dma_alloc_pages(struct device *dev,
> -		unsigned int count, unsigned long order_mask, gfp_t gfp)
> +		unsigned int count, unsigned long order_mask,
> +		gfp_t page_gfp, gfp_t kalloc_gfp)
>   {
>   	struct page **pages;
>   	unsigned int i = 0, nid = dev_to_node(dev);
> @@ -602,15 +603,15 @@ static struct page **__iommu_dma_alloc_pages(struct device *dev,
>   	if (!order_mask)
>   		return NULL;
>   
> -	pages = kvzalloc(count * sizeof(*pages), GFP_KERNEL);
> +	pages = kvzalloc(count * sizeof(*pages), kalloc_gfp);
>   	if (!pages)
>   		return NULL;
>   
>   	/* IOMMU can map any pages, so himem can also be used here */
> -	gfp |= __GFP_NOWARN | __GFP_HIGHMEM;
> +	page_gfp |= __GFP_NOWARN | __GFP_HIGHMEM;
>   
>   	/* It makes no sense to muck about with huge pages */
> -	gfp &= ~__GFP_COMP;
> +	page_gfp &= ~__GFP_COMP;
>   
>   	while (count) {
>   		struct page *page = NULL;
> @@ -624,7 +625,7 @@ static struct page **__iommu_dma_alloc_pages(struct device *dev,
>   		for (order_mask &= (2U << __fls(count)) - 1;
>   		     order_mask; order_mask &= ~order_size) {
>   			unsigned int order = __fls(order_mask);
> -			gfp_t alloc_flags = gfp;
> +			gfp_t alloc_flags = page_gfp;
>   
>   			order_size = 1U << order;
>   			if (order_mask > order_size)
> @@ -680,7 +681,7 @@ static struct page **__iommu_dma_alloc_noncontiguous(struct device *dev,
>   
>   	count = PAGE_ALIGN(size) >> PAGE_SHIFT;
>   	pages = __iommu_dma_alloc_pages(dev, count, alloc_sizes >> PAGE_SHIFT,
> -					gfp);
> +					gfp, GFP_KERNEL);
>   	if (!pages)
>   		return NULL;
>   
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 0/4] Add dynamic iommu backed bounce buffers
  2021-07-08 13:38 ` Lu Baolu
@ 2021-07-09  6:04   ` David Stevens
  0 siblings, 0 replies; 11+ messages in thread
From: David Stevens @ 2021-07-09  6:04 UTC (permalink / raw)
  To: Lu Baolu
  Cc: Joerg Roedel, Will Deacon, Sergey Senozhatsky, iommu,
	David Stevens, Christoph Hellwig, open list

On Thu, Jul 8, 2021 at 10:38 PM Lu Baolu <baolu.lu@linux.intel.com> wrote:
>
> Hi David,
>
> I like this idea. Thanks for proposing this.
>
> On 2021/7/7 15:55, David Stevens wrote:
> > Add support for per-domain dynamic pools of iommu bounce buffers to the
> > dma-iommu API. This allows iommu mappings to be reused while still
> > maintaining strict iommu protection. Allocating buffers dynamically
> > instead of using swiotlb carveouts makes per-domain pools more amenable
> > on systems with large numbers of devices or where devices are unknown.
>
> Have you ever considered leveraging the per-device swiotlb memory pool
> added by below series?
>
> https://lore.kernel.org/linux-iommu/20210625123004.GA3170@willie-the-truck/

I'm not sure if that's a good fit. The swiotlb pools are allocated
during device initialization, so they require setting aside the
worst-case amount of memory. That's okay if you only use it with a
small number of devices where you know in advance approximately how
much memory they use. However, it doesn't work as well if you want to
use it with a large number of devices, or with unknown (i.e.
hotplugged) devices.

> >
> > When enabled, all non-direct streaming mappings below a configurable
> > size will go through bounce buffers. Note that this means drivers which
> > don't properly use the DMA API (e.g. i915) cannot use an iommu when this
> > feature is enabled. However, all drivers which work with swiotlb=force
> > should work.
>
> If so, why not making it more scalable by adding a callback into vendor
> iommu drivers? The vendor iommu drivers have enough information to tell
> whether the bounce buffer is feasible for a specific domain.

I'm not very familiar with the specifics of VT-d or restrictions with
the graphics hardware, but at least on the surface it looks like a
limitation of the i915 driver's implementation. The driver uses the
DMA_ATTR_SKIP_CPU_SYNC flag, but never calls the dma_sync functions,
since things are coherent on x86 hardware. However, bounce buffers
violate the driver's assumption that there's no need to sync the CPU
and device domain. I doubt there's an inherent limitation of the
hardware here, it's just how the driver is implemented. Given that, I
don't know if it's something the iommu driver needs to handle.

One potential way this could be addressed would be to add explicit
support to the DMA API for long-lived streaming mappings. Drivers can
get that behavior today via DMA_ATTR_SKIP_CPU_SYNC and dma_sync.
However, the DMA API doesn't really have enough information to treat
ephemeral and long-lived mappings differently. With a new DMA_ATTR
flag for long-lived streaming mappings, the DMA API could skip bounce
buffers. That flag could also be used as a performance optimization in
the various dma-buf implementations, since they seem to mostly fall
into the long-lived streaming category (the handful I checked do call
dma_sync, so there isn't a correctness issue).

-David

> >
> > Bounce buffers serve as an optimization in situations where interactions
> > with the iommu are very costly. For example, virtio-iommu operations in
>
> The simulated IOMMU does the same thing.
>
> It's also an optimization for bare metal in cases where the strict mode
> of cache invalidation is used. CPU moving data is faster than IOMMU
> cache invalidation if the buffer is small.
>
> Best regards,
> baolu

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 0/4] Add dynamic iommu backed bounce buffers
  2021-07-08 17:14   ` Robin Murphy
@ 2021-07-09  7:25     ` David Stevens
  0 siblings, 0 replies; 11+ messages in thread
From: David Stevens @ 2021-07-09  7:25 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Joerg Roedel, Will Deacon, Christoph Hellwig, Sergey Senozhatsky,
	iommu, open list, David Stevens

On Fri, Jul 9, 2021 at 2:14 AM Robin Murphy <robin.murphy@arm.com> wrote:
>
> On 2021-07-08 10:29, Joerg Roedel wrote:
> > Adding Robin too.
> >
> > On Wed, Jul 07, 2021 at 04:55:01PM +0900, David Stevens wrote:
> >> Add support for per-domain dynamic pools of iommu bounce buffers to the
> >> dma-iommu API. This allows iommu mappings to be reused while still
> >> maintaining strict iommu protection. Allocating buffers dynamically
> >> instead of using swiotlb carveouts makes per-domain pools more amenable
> >> on systems with large numbers of devices or where devices are unknown.
>
> But isn't that just as true for the currently-supported case? All you
> need is a large enough Thunderbolt enclosure and you could suddenly plug
> in a dozen untrusted GPUs all wanting to map hundreds of megabytes of
> memory. If there's a real concern worth addressing, surely it's worth
> addressing properly for everyone.

Bounce buffers consume memory, so there is always going to be some
limitation on how many devices are supported. This patch series limits
the memory consumption at a given point in time to approximately the
amount of active DMA transactions. There's really no way to improve
significantly on that. The 'approximately' qualification could be
removed by adding a shrinker, but that doesn't change things
materially.

This is compared to reusing swiotlb, where the amount of memory
consumed would be the largest amount of active DMA transactions you
want bounce buffers to handle. I see two concrete shortcomings here.
First, most of the time you're not doing heavy IO, especially for
consumer workloads. Second, it raises the problem of per-device
tuning, since you don't want to waste performance by having too few
bounce buffers but you also don't want to waste memory by
preallocating too many bounce buffers. This tuning becomes more
problematic once you start dealing with external devices.

Also, although this doesn't directly address the raised concern, the
bounce buffers are only used for relatively small DMA transactions. So
large allocations like framebuffers won't actually consume extra
memory via bounce buffers.

> >> When enabled, all non-direct streaming mappings below a configurable
> >> size will go through bounce buffers. Note that this means drivers which
> >> don't properly use the DMA API (e.g. i915) cannot use an iommu when this
> >> feature is enabled. However, all drivers which work with swiotlb=force
> >> should work.
> >>
> >> Bounce buffers serve as an optimization in situations where interactions
> >> with the iommu are very costly. For example, virtio-iommu operations in
> >> a guest on a linux host require a vmexit, involvement the VMM, and a
> >> VFIO syscall. For relatively small DMA operations, memcpy can be
> >> significantly faster.
>
> Yup, back when the bounce-buffering stuff first came up I know
> networking folks were interested in terms of latency for small packets -
> virtualised IOMMUs are indeed another interesting case I hadn't thought
> of. It's definitely been on the radar as another use-case we'd like to
> accommodate with the bounce-buffering scheme. However, that's the thing:
> bouncing is bouncing and however you look at it it still overlaps so
> much with the untrusted case - there's no reason that couldn't use
> pre-mapped bounce buffers too, for instance - that the only necessary
> difference is really the policy decision of when to bounce. iommu-dma
> has already grown complicated enough, and having *three* different ways
> of doing things internally just seems bonkers and untenable. Pre-map the
> bounce buffers? Absolutely. Dynamically grow them on demand? Yes please!
> Do it all as a special thing in its own NIH module and leave the
> existing mess to rot? Sorry, but no.

I do agree that iommu-dma is getting fairly complicated. Since a
virtualized IOMMU uses bounce buffers much more heavily than
sub-granule untrusted DMA, and for the reasons stated earlier in this
email, I don't think pre-allocated bounce buffers are viable for the
virtualized IOMMU case. I can look at migrating the sub-granule
untrusted DMA case to dynamic bounce buffers, if that's an acceptable
approach.

-David

> Thanks,
> Robin.
>
> >> As a performance comparison, on a device with an i5-10210U, I ran fio
> >> with a VFIO passthrough NVMe drive with '--direct=1 --rw=read
> >> --ioengine=libaio --iodepth=64' and block sizes 4k, 16k, 64k, and
> >> 128k. Test throughput increased by 2.8x, 4.7x, 3.6x, and 3.6x. Time
> >> spent in iommu_dma_unmap_(page|sg) per GB processed decreased by 97%,
> >> 94%, 90%, and 87%. Time spent in iommu_dma_map_(page|sg) decreased
> >> by >99%, as bounce buffers don't require syncing here in the read case.
> >> Running with multiple jobs doesn't serve as a useful performance
> >> comparison because virtio-iommu and vfio_iommu_type1 both have big
> >> locks that significantly limit mulithreaded DMA performance.
> >>
> >> This patch set is based on v5.13-rc7 plus the patches at [1].
> >>
> >> David Stevens (4):
> >>    dma-iommu: add kalloc gfp flag to alloc helper
> >>    dma-iommu: replace device arguments
> >>    dma-iommu: expose a few helper functions to module
> >>    dma-iommu: Add iommu bounce buffers to dma-iommu api
> >>
> >>   drivers/iommu/Kconfig          |  10 +
> >>   drivers/iommu/Makefile         |   1 +
> >>   drivers/iommu/dma-iommu.c      | 119 ++++--
> >>   drivers/iommu/io-buffer-pool.c | 656 +++++++++++++++++++++++++++++++++
> >>   drivers/iommu/io-buffer-pool.h |  91 +++++
> >>   include/linux/dma-iommu.h      |  12 +
> >>   6 files changed, 861 insertions(+), 28 deletions(-)
> >>   create mode 100644 drivers/iommu/io-buffer-pool.c
> >>   create mode 100644 drivers/iommu/io-buffer-pool.h
> >>
> >> --
> >> 2.32.0.93.g670b81a890-goog

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2021-07-09  7:25 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-07  7:55 [PATCH 0/4] Add dynamic iommu backed bounce buffers David Stevens
2021-07-07  7:55 ` [PATCH 1/4] dma-iommu: add kalloc gfp flag to alloc helper David Stevens
2021-07-08 17:22   ` Robin Murphy
2021-07-07  7:55 ` [PATCH 2/4] dma-iommu: replace device arguments David Stevens
2021-07-07  7:55 ` [PATCH 3/4] dma-iommu: expose a few helper functions to module David Stevens
2021-07-07  7:55 ` [PATCH 4/4] dma-iommu: Add iommu bounce buffers to dma-iommu api David Stevens
2021-07-08  9:29 ` [PATCH 0/4] Add dynamic iommu backed bounce buffers Joerg Roedel
2021-07-08 17:14   ` Robin Murphy
2021-07-09  7:25     ` David Stevens
2021-07-08 13:38 ` Lu Baolu
2021-07-09  6:04   ` David Stevens

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).