[PATCH 0/6] dma mapping/iommu: Allow IOMMU IOVA rcache range to be configured

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/6] dma mapping/iommu: Allow IOMMU IOVA rcache range to be configured
@ 2021-03-19 13:25 John Garry
  2021-03-19 13:25 ` [PATCH 1/6] iommu: Move IOVA power-of-2 roundup into allocator John Garry
                   ` (6 more replies)
  0 siblings, 7 replies; 23+ messages in thread
From: John Garry @ 2021-03-19 13:25 UTC (permalink / raw)
  To: joro, will, jejb, martin.petersen, hch, m.szyprowski, robin.murphy
  Cc: iommu, linux-kernel, linux-scsi, linuxarm, John Garry

For streaming DMA mappings involving an IOMMU and whose IOVA len regularly
exceeds the IOVA rcache upper limit (meaning that they are not cached),
performance can be reduced. 

This is much more pronounced from commit 4e89dce72521 ("iommu/iova: Retry
from last rb tree node if iova search fails"), as discussed at [0].

IOVAs which cannot be cached are highly involved in the IOVA aging issue,
as discussed at [1].

This series attempts to allow the device driver hint what upper limit its
DMA mapping IOVA lengths would be, so that the caching range may be
increased.

Some figures on storage scenario:
v5.12-rc3 baseline:			600K IOPS
With series:				1300K IOPS
With reverting 4e89dce72521: 		1250K IOPS

All above are for IOMMU strict mode. Non-strict mode gives ~1750K IOPS in
all scenarios.

I will say that APIs and their semantics are a bit ropey - any better
ideas welcome...

[0] https://lore.kernel.org/linux-iommu/20210129092120.1482-1-thunder.leizhen@huawei.com/
[1] https://lore.kernel.org/linux-iommu/1607538189-237944-1-git-send-email-john.garry@huawei.com/

John Garry (6):
  iommu: Move IOVA power-of-2 roundup into allocator
  iova: Add a per-domain count of reserved nodes
  iova: Allow rcache range upper limit to be configurable
  iommu: Add iommu_dma_set_opt_size()
  dma-mapping/iommu: Add dma_set_max_opt_size()
  scsi: hisi_sas: Set max optimal DMA size for v3 hw

 drivers/iommu/dma-iommu.c              | 23 ++++---
 drivers/iommu/iova.c                   | 88 ++++++++++++++++++++------
 drivers/scsi/hisi_sas/hisi_sas_v3_hw.c |  2 +
 include/linux/dma-map-ops.h            |  1 +
 include/linux/dma-mapping.h            |  5 ++
 include/linux/iova.h                   | 12 +++-
 kernel/dma/mapping.c                   | 11 ++++
 7 files changed, 115 insertions(+), 27 deletions(-)

-- 
2.26.2

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 1/6] iommu: Move IOVA power-of-2 roundup into allocator
  2021-03-19 13:25 [PATCH 0/6] dma mapping/iommu: Allow IOMMU IOVA rcache range to be configured John Garry
@ 2021-03-19 13:25 ` John Garry
  2021-03-19 16:13   ` Robin Murphy
  2021-03-19 13:25 ` [PATCH 2/6] iova: Add a per-domain count of reserved nodes John Garry
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 23+ messages in thread
From: John Garry @ 2021-03-19 13:25 UTC (permalink / raw)
  To: joro, will, jejb, martin.petersen, hch, m.szyprowski, robin.murphy
  Cc: iommu, linux-kernel, linux-scsi, linuxarm, John Garry

Move the IOVA size power-of-2 rcache roundup into the IOVA allocator.

This is to eventually make it possible to be able to configure the upper
limit of the IOVA rcache range.

Signed-off-by: John Garry <john.garry@huawei.com>
---
 drivers/iommu/dma-iommu.c |  8 ------
 drivers/iommu/iova.c      | 51 ++++++++++++++++++++++++++-------------
 2 files changed, 34 insertions(+), 25 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index af765c813cc8..15b7270a5c2a 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -429,14 +429,6 @@ static dma_addr_t iommu_dma_alloc_iova(struct iommu_domain *domain,
 
 	shift = iova_shift(iovad);
 	iova_len = size >> shift;
-	/*
-	 * Freeing non-power-of-two-sized allocations back into the IOVA caches
-	 * will come back to bite us badly, so we have to waste a bit of space
-	 * rounding up anything cacheable to make sure that can't happen. The
-	 * order of the unadjusted size will still match upon freeing.
-	 */
-	if (iova_len < (1 << (IOVA_RANGE_CACHE_MAX_SIZE - 1)))
-		iova_len = roundup_pow_of_two(iova_len);
 
 	dma_limit = min_not_zero(dma_limit, dev->bus_dma_limit);
 
diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index e6e2fa85271c..e62e9e30b30c 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -179,7 +179,7 @@ iova_insert_rbtree(struct rb_root *root, struct iova *iova,
 
 static int __alloc_and_insert_iova_range(struct iova_domain *iovad,
 		unsigned long size, unsigned long limit_pfn,
-			struct iova *new, bool size_aligned)
+			struct iova *new, bool size_aligned, bool fast)
 {
 	struct rb_node *curr, *prev;
 	struct iova *curr_iova;
@@ -188,6 +188,15 @@ static int __alloc_and_insert_iova_range(struct iova_domain *iovad,
 	unsigned long align_mask = ~0UL;
 	unsigned long high_pfn = limit_pfn, low_pfn = iovad->start_pfn;
 
+	/*
+	 * Freeing non-power-of-two-sized allocations back into the IOVA caches
+	 * will come back to bite us badly, so we have to waste a bit of space
+	 * rounding up anything cacheable to make sure that can't happen. The
+	 * order of the unadjusted size will still match upon freeing.
+	 */
+	if (fast && size < (1 << (IOVA_RANGE_CACHE_MAX_SIZE - 1)))
+		size = roundup_pow_of_two(size);
+
 	if (size_aligned)
 		align_mask <<= fls_long(size - 1);
 
@@ -288,21 +297,10 @@ void iova_cache_put(void)
 }
 EXPORT_SYMBOL_GPL(iova_cache_put);
 
-/**
- * alloc_iova - allocates an iova
- * @iovad: - iova domain in question
- * @size: - size of page frames to allocate
- * @limit_pfn: - max limit address
- * @size_aligned: - set if size_aligned address range is required
- * This function allocates an iova in the range iovad->start_pfn to limit_pfn,
- * searching top-down from limit_pfn to iovad->start_pfn. If the size_aligned
- * flag is set then the allocated address iova->pfn_lo will be naturally
- * aligned on roundup_power_of_two(size).
- */
-struct iova *
-alloc_iova(struct iova_domain *iovad, unsigned long size,
+static struct iova *
+__alloc_iova(struct iova_domain *iovad, unsigned long size,
 	unsigned long limit_pfn,
-	bool size_aligned)
+	bool size_aligned, bool fast)
 {
 	struct iova *new_iova;
 	int ret;
@@ -312,7 +310,7 @@ alloc_iova(struct iova_domain *iovad, unsigned long size,
 		return NULL;
 
 	ret = __alloc_and_insert_iova_range(iovad, size, limit_pfn + 1,
-			new_iova, size_aligned);
+			new_iova, size_aligned, fast);
 
 	if (ret) {
 		free_iova_mem(new_iova);
@@ -321,6 +319,25 @@ alloc_iova(struct iova_domain *iovad, unsigned long size,
 
 	return new_iova;
 }
+
+/**
+ * alloc_iova - allocates an iova
+ * @iovad: - iova domain in question
+ * @size: - size of page frames to allocate
+ * @limit_pfn: - max limit address
+ * @size_aligned: - set if size_aligned address range is required
+ * This function allocates an iova in the range iovad->start_pfn to limit_pfn,
+ * searching top-down from limit_pfn to iovad->start_pfn. If the size_aligned
+ * flag is set then the allocated address iova->pfn_lo will be naturally
+ * aligned on roundup_power_of_two(size).
+ */
+struct iova *
+alloc_iova(struct iova_domain *iovad, unsigned long size,
+	unsigned long limit_pfn,
+	bool size_aligned)
+{
+	return __alloc_iova(iovad, size, limit_pfn, size_aligned, false);
+}
 EXPORT_SYMBOL_GPL(alloc_iova);
 
 static struct iova *
@@ -433,7 +450,7 @@ alloc_iova_fast(struct iova_domain *iovad, unsigned long size,
 		return iova_pfn;
 
 retry:
-	new_iova = alloc_iova(iovad, size, limit_pfn, true);
+	new_iova = __alloc_iova(iovad, size, limit_pfn, true, true);
 	if (!new_iova) {
 		unsigned int cpu;
 
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 2/6] iova: Add a per-domain count of reserved nodes
  2021-03-19 13:25 [PATCH 0/6] dma mapping/iommu: Allow IOMMU IOVA rcache range to be configured John Garry
  2021-03-19 13:25 ` [PATCH 1/6] iommu: Move IOVA power-of-2 roundup into allocator John Garry
@ 2021-03-19 13:25 ` John Garry
  2021-03-19 13:25 ` [PATCH 3/6] iova: Allow rcache range upper limit to be configurable John Garry
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 23+ messages in thread
From: John Garry @ 2021-03-19 13:25 UTC (permalink / raw)
  To: joro, will, jejb, martin.petersen, hch, m.szyprowski, robin.murphy
  Cc: iommu, linux-kernel, linux-scsi, linuxarm, John Garry

To help learn if the domain has regular IOVA nodes, add a count of
reserved nodes, calculated at init time.

Signed-off-by: John Garry <john.garry@huawei.com>
---
 drivers/iommu/iova.c | 2 ++
 include/linux/iova.h | 1 +
 2 files changed, 3 insertions(+)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index e62e9e30b30c..cecc74fb8663 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -717,6 +717,8 @@ reserve_iova(struct iova_domain *iovad,
 	 * or need to insert remaining non overlap addr range
 	 */
 	iova = __insert_new_range(iovad, pfn_lo, pfn_hi);
+	if (iova)
+		iovad->reserved_node_count++;
 finish:
 
 	spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
diff --git a/include/linux/iova.h b/include/linux/iova.h
index c834c01c0a5b..fd3217a605b2 100644
--- a/include/linux/iova.h
+++ b/include/linux/iova.h
@@ -95,6 +95,7 @@ struct iova_domain {
 						   flush-queues */
 	atomic_t fq_timer_on;			/* 1 when timer is active, 0
 						   when not */
+	int reserved_node_count;
 };
 
 static inline unsigned long iova_size(struct iova *iova)
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 3/6] iova: Allow rcache range upper limit to be configurable
  2021-03-19 13:25 [PATCH 0/6] dma mapping/iommu: Allow IOMMU IOVA rcache range to be configured John Garry
  2021-03-19 13:25 ` [PATCH 1/6] iommu: Move IOVA power-of-2 roundup into allocator John Garry
  2021-03-19 13:25 ` [PATCH 2/6] iova: Add a per-domain count of reserved nodes John Garry
@ 2021-03-19 13:25 ` John Garry
  2021-03-19 16:25   ` Robin Murphy
  2021-03-19 13:25 ` [PATCH 4/6] iommu: Add iommu_dma_set_opt_size() John Garry
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 23+ messages in thread
From: John Garry @ 2021-03-19 13:25 UTC (permalink / raw)
  To: joro, will, jejb, martin.petersen, hch, m.szyprowski, robin.murphy
  Cc: iommu, linux-kernel, linux-scsi, linuxarm, John Garry

Some LLDs may request DMA mappings whose IOVA length exceeds that of the
current rcache upper limit.

This means that allocations for those IOVAs will never be cached, and
always must be allocated and freed from the RB tree per DMA mapping cycle.
This has a significant effect on performance, more so since commit
4e89dce72521 ("iommu/iova: Retry from last rb tree node if iova search
fails"), as discussed at [0].

Allow the range of cached IOVAs to be increased, by providing an API to set
the upper limit, which is capped at IOVA_RANGE_CACHE_MAX_SIZE.

For simplicity, the full range of IOVA rcaches is allocated and initialized
at IOVA domain init time.

Setting the range upper limit will fail if the domain is already live
(before the tree contains IOVA nodes). This must be done to ensure any
IOVAs cached comply with rule of size being a power-of-2.

[0] https://lore.kernel.org/linux-iommu/20210129092120.1482-1-thunder.leizhen@huawei.com/

Signed-off-by: John Garry <john.garry@huawei.com>
---
 drivers/iommu/iova.c | 37 +++++++++++++++++++++++++++++++++++--
 include/linux/iova.h | 11 ++++++++++-
 2 files changed, 45 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index cecc74fb8663..d4f2f7fbbd84 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -49,6 +49,7 @@ init_iova_domain(struct iova_domain *iovad, unsigned long granule,
 	iovad->flush_cb = NULL;
 	iovad->fq = NULL;
 	iovad->anchor.pfn_lo = iovad->anchor.pfn_hi = IOVA_ANCHOR;
+	iovad->rcache_max_size = IOVA_RANGE_CACHE_DEFAULT_SIZE;
 	rb_link_node(&iovad->anchor.node, NULL, &iovad->rbroot.rb_node);
 	rb_insert_color(&iovad->anchor.node, &iovad->rbroot);
 	init_iova_rcaches(iovad);
@@ -194,7 +195,7 @@ static int __alloc_and_insert_iova_range(struct iova_domain *iovad,
 	 * rounding up anything cacheable to make sure that can't happen. The
 	 * order of the unadjusted size will still match upon freeing.
 	 */
-	if (fast && size < (1 << (IOVA_RANGE_CACHE_MAX_SIZE - 1)))
+	if (fast && size < (1 << (iovad->rcache_max_size - 1)))
 		size = roundup_pow_of_two(size);
 
 	if (size_aligned)
@@ -901,7 +902,7 @@ static bool iova_rcache_insert(struct iova_domain *iovad, unsigned long pfn,
 {
 	unsigned int log_size = order_base_2(size);
 
-	if (log_size >= IOVA_RANGE_CACHE_MAX_SIZE)
+	if (log_size >= iovad->rcache_max_size)
 		return false;
 
 	return __iova_rcache_insert(iovad, &iovad->rcaches[log_size], pfn);
@@ -946,6 +947,38 @@ static unsigned long __iova_rcache_get(struct iova_rcache *rcache,
 	return iova_pfn;
 }
 
+void iova_rcache_set_upper_limit(struct iova_domain *iovad,
+				 unsigned long iova_len)
+{
+	unsigned int rcache_index = order_base_2(iova_len) + 1;
+	struct rb_node *rb_node = &iovad->anchor.node;
+	unsigned long flags;
+	int count = 0;
+
+	spin_lock_irqsave(&iovad->iova_rbtree_lock, flags);
+	if (rcache_index <= iovad->rcache_max_size)
+		goto out;
+
+	while (1) {
+		rb_node = rb_prev(rb_node);
+		if (!rb_node)
+			break;
+		count++;
+	}
+
+	/*
+	 * If there are already IOVA nodes present in the tree, then don't
+	 * allow range upper limit to be set.
+	 */
+	if (count != iovad->reserved_node_count)
+		goto out;
+
+	iovad->rcache_max_size = min_t(unsigned long, rcache_index,
+				       IOVA_RANGE_CACHE_MAX_SIZE);
+out:
+	spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
+}
+
 /*
  * Try to satisfy IOVA allocation range from rcache.  Fail if requested
  * size is too big or the DMA limit we are given isn't satisfied by the
diff --git a/include/linux/iova.h b/include/linux/iova.h
index fd3217a605b2..952b81b54ef7 100644
--- a/include/linux/iova.h
+++ b/include/linux/iova.h
@@ -25,7 +25,8 @@ struct iova {
 struct iova_magazine;
 struct iova_cpu_rcache;
 
-#define IOVA_RANGE_CACHE_MAX_SIZE 6	/* log of max cached IOVA range size (in pages) */
+#define IOVA_RANGE_CACHE_DEFAULT_SIZE 6
+#define IOVA_RANGE_CACHE_MAX_SIZE 10 /* log of max cached IOVA range size (in pages) */
 #define MAX_GLOBAL_MAGS 32	/* magazines per bin */
 
 struct iova_rcache {
@@ -74,6 +75,7 @@ struct iova_domain {
 	unsigned long	start_pfn;	/* Lower limit for this domain */
 	unsigned long	dma_32bit_pfn;
 	unsigned long	max32_alloc_size; /* Size of last failed allocation */
+	unsigned long	rcache_max_size; /* Upper limit of cached IOVA RANGE */
 	struct iova_fq __percpu *fq;	/* Flush Queue */
 
 	atomic64_t	fq_flush_start_cnt;	/* Number of TLB flushes that
@@ -158,6 +160,8 @@ int init_iova_flush_queue(struct iova_domain *iovad,
 struct iova *find_iova(struct iova_domain *iovad, unsigned long pfn);
 void put_iova_domain(struct iova_domain *iovad);
 void free_cpu_cached_iovas(unsigned int cpu, struct iova_domain *iovad);
+void iova_rcache_set_upper_limit(struct iova_domain *iovad,
+				 unsigned long iova_len);
 #else
 static inline int iova_cache_get(void)
 {
@@ -238,6 +242,11 @@ static inline void free_cpu_cached_iovas(unsigned int cpu,
 					 struct iova_domain *iovad)
 {
 }
+
+static inline void iova_rcache_set_upper_limit(struct iova_domain *iovad,
+					       unsigned long iova_len)
+{
+}
 #endif
 
 #endif
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 4/6] iommu: Add iommu_dma_set_opt_size()
  2021-03-19 13:25 [PATCH 0/6] dma mapping/iommu: Allow IOMMU IOVA rcache range to be configured John Garry
                   ` (2 preceding siblings ...)
  2021-03-19 13:25 ` [PATCH 3/6] iova: Allow rcache range upper limit to be configurable John Garry
@ 2021-03-19 13:25 ` John Garry
  2021-03-19 13:25 ` [PATCH 5/6] dma-mapping/iommu: Add dma_set_max_opt_size() John Garry
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 23+ messages in thread
From: John Garry @ 2021-03-19 13:25 UTC (permalink / raw)
  To: joro, will, jejb, martin.petersen, hch, m.szyprowski, robin.murphy
  Cc: iommu, linux-kernel, linux-scsi, linuxarm, John Garry

Add a function which allows the max optimised IOMMU DMA size to be set.

Signed-off-by: John Garry <john.garry@huawei.com>
---
 drivers/iommu/dma-iommu.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 15b7270a5c2a..a5dfbd6c0496 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -447,6 +447,21 @@ static dma_addr_t iommu_dma_alloc_iova(struct iommu_domain *domain,
 	return (dma_addr_t)iova << shift;
 }
 
+__maybe_unused
+static void iommu_dma_set_opt_size(struct device *dev, size_t size)
+{
+	struct iommu_domain *domain = iommu_get_dma_domain(dev);
+	struct iommu_dma_cookie *cookie = domain->iova_cookie;
+	struct iova_domain *iovad = &cookie->iovad;
+	unsigned long shift, iova_len;
+
+	shift = iova_shift(iovad);
+	iova_len = size >> shift;
+	iova_len = roundup_pow_of_two(iova_len);
+
+	iova_rcache_set_upper_limit(iovad, iova_len);
+}
+
 static void iommu_dma_free_iova(struct iommu_dma_cookie *cookie,
 		dma_addr_t iova, size_t size, struct page *freelist)
 {
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 5/6] dma-mapping/iommu: Add dma_set_max_opt_size()
  2021-03-19 13:25 [PATCH 0/6] dma mapping/iommu: Allow IOMMU IOVA rcache range to be configured John Garry
                   ` (3 preceding siblings ...)
  2021-03-19 13:25 ` [PATCH 4/6] iommu: Add iommu_dma_set_opt_size() John Garry
@ 2021-03-19 13:25 ` John Garry
  2021-03-19 17:00   ` Robin Murphy
  2021-03-19 13:25 ` [PATCH 6/6] scsi: hisi_sas: Set max optimal DMA size for v3 hw John Garry
  2021-03-19 13:40 ` [PATCH 0/6] dma mapping/iommu: Allow IOMMU IOVA rcache range to be configured Christoph Hellwig
  6 siblings, 1 reply; 23+ messages in thread
From: John Garry @ 2021-03-19 13:25 UTC (permalink / raw)
  To: joro, will, jejb, martin.petersen, hch, m.szyprowski, robin.murphy
  Cc: iommu, linux-kernel, linux-scsi, linuxarm, John Garry

Add a function to allow the max size which we want to optimise DMA mappings
for.

Signed-off-by: John Garry <john.garry@huawei.com>
---
 drivers/iommu/dma-iommu.c   |  2 +-
 include/linux/dma-map-ops.h |  1 +
 include/linux/dma-mapping.h |  5 +++++
 kernel/dma/mapping.c        | 11 +++++++++++
 4 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index a5dfbd6c0496..d35881fcfb9c 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -447,7 +447,6 @@ static dma_addr_t iommu_dma_alloc_iova(struct iommu_domain *domain,
 	return (dma_addr_t)iova << shift;
 }
 
-__maybe_unused
 static void iommu_dma_set_opt_size(struct device *dev, size_t size)
 {
 	struct iommu_domain *domain = iommu_get_dma_domain(dev);
@@ -1278,6 +1277,7 @@ static const struct dma_map_ops iommu_dma_ops = {
 	.map_resource		= iommu_dma_map_resource,
 	.unmap_resource		= iommu_dma_unmap_resource,
 	.get_merge_boundary	= iommu_dma_get_merge_boundary,
+	.set_max_opt_size	= iommu_dma_set_opt_size,
 };
 
 /*
diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
index 51872e736e7b..fed7a183b3b9 100644
--- a/include/linux/dma-map-ops.h
+++ b/include/linux/dma-map-ops.h
@@ -64,6 +64,7 @@ struct dma_map_ops {
 	u64 (*get_required_mask)(struct device *dev);
 	size_t (*max_mapping_size)(struct device *dev);
 	unsigned long (*get_merge_boundary)(struct device *dev);
+	void (*set_max_opt_size)(struct device *dev, size_t size);
 };
 
 #ifdef CONFIG_DMA_OPS
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index 2a984cb4d1e0..91fe770145d4 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -144,6 +144,7 @@ u64 dma_get_required_mask(struct device *dev);
 size_t dma_max_mapping_size(struct device *dev);
 bool dma_need_sync(struct device *dev, dma_addr_t dma_addr);
 unsigned long dma_get_merge_boundary(struct device *dev);
+void dma_set_max_opt_size(struct device *dev, size_t size);
 #else /* CONFIG_HAS_DMA */
 static inline dma_addr_t dma_map_page_attrs(struct device *dev,
 		struct page *page, size_t offset, size_t size,
@@ -257,6 +258,10 @@ static inline unsigned long dma_get_merge_boundary(struct device *dev)
 {
 	return 0;
 }
+static inline void dma_set_max_opt_size(struct device *dev, size_t size)
+{
+}
+
 #endif /* CONFIG_HAS_DMA */
 
 struct page *dma_alloc_pages(struct device *dev, size_t size,
diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index b6a633679933..59e6acb1c471 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -608,3 +608,14 @@ unsigned long dma_get_merge_boundary(struct device *dev)
 	return ops->get_merge_boundary(dev);
 }
 EXPORT_SYMBOL_GPL(dma_get_merge_boundary);
+
+void dma_set_max_opt_size(struct device *dev, size_t size)
+{
+	const struct dma_map_ops *ops = get_dma_ops(dev);
+
+	if (!ops || !ops->set_max_opt_size)
+		return;
+
+	ops->set_max_opt_size(dev, size);
+}
+EXPORT_SYMBOL_GPL(dma_set_max_opt_size);
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 6/6] scsi: hisi_sas: Set max optimal DMA size for v3 hw
  2021-03-19 13:25 [PATCH 0/6] dma mapping/iommu: Allow IOMMU IOVA rcache range to be configured John Garry
                   ` (4 preceding siblings ...)
  2021-03-19 13:25 ` [PATCH 5/6] dma-mapping/iommu: Add dma_set_max_opt_size() John Garry
@ 2021-03-19 13:25 ` John Garry
  2021-03-19 13:40 ` [PATCH 0/6] dma mapping/iommu: Allow IOMMU IOVA rcache range to be configured Christoph Hellwig
  6 siblings, 0 replies; 23+ messages in thread
From: John Garry @ 2021-03-19 13:25 UTC (permalink / raw)
  To: joro, will, jejb, martin.petersen, hch, m.szyprowski, robin.murphy
  Cc: iommu, linux-kernel, linux-scsi, linuxarm, John Garry

For IOMMU strict mode, more than doubles throughput in some scenarios.

Signed-off-by: John Garry <john.garry@huawei.com>
---
 drivers/scsi/hisi_sas/hisi_sas_v3_hw.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c b/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c
index 4580e081e489..2f77b418bbeb 100644
--- a/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c
+++ b/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c
@@ -4684,6 +4684,8 @@ hisi_sas_v3_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 		goto err_out_regions;
 	}
 
+	dma_set_max_opt_size(dev, PAGE_SIZE * HISI_SAS_SGE_PAGE_CNT);
+
 	shost = hisi_sas_shost_alloc_pci(pdev);
 	if (!shost) {
 		rc = -ENOMEM;
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/6] dma mapping/iommu: Allow IOMMU IOVA rcache range to be configured
  2021-03-19 13:25 [PATCH 0/6] dma mapping/iommu: Allow IOMMU IOVA rcache range to be configured John Garry
                   ` (5 preceding siblings ...)
  2021-03-19 13:25 ` [PATCH 6/6] scsi: hisi_sas: Set max optimal DMA size for v3 hw John Garry
@ 2021-03-19 13:40 ` Christoph Hellwig
  2021-03-19 15:42   ` John Garry
  6 siblings, 1 reply; 23+ messages in thread
From: Christoph Hellwig @ 2021-03-19 13:40 UTC (permalink / raw)
  To: John Garry
  Cc: joro, will, jejb, martin.petersen, hch, m.szyprowski,
	robin.murphy, iommu, linux-kernel, linux-scsi, linuxarm

On Fri, Mar 19, 2021 at 09:25:42PM +0800, John Garry wrote:
> For streaming DMA mappings involving an IOMMU and whose IOVA len regularly
> exceeds the IOVA rcache upper limit (meaning that they are not cached),
> performance can be reduced. 
> 
> This is much more pronounced from commit 4e89dce72521 ("iommu/iova: Retry
> from last rb tree node if iova search fails"), as discussed at [0].
> 
> IOVAs which cannot be cached are highly involved in the IOVA aging issue,
> as discussed at [1].

I'm confused.  If this a limit in the IOVA allocator, dma-iommu should
be able to just not grow the allocation so larger without help from
the driver.

If contrary to the above description it is device-specific, the driver
could simply use dma_get_max_seg_size().

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/6] dma mapping/iommu: Allow IOMMU IOVA rcache range to be configured
  2021-03-19 13:40 ` [PATCH 0/6] dma mapping/iommu: Allow IOMMU IOVA rcache range to be configured Christoph Hellwig
@ 2021-03-19 15:42   ` John Garry
  0 siblings, 0 replies; 23+ messages in thread
From: John Garry @ 2021-03-19 15:42 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: joro, will, jejb, martin.petersen, m.szyprowski, robin.murphy,
	iommu, linux-kernel, linux-scsi, linuxarm

On 19/03/2021 13:40, Christoph Hellwig wrote:
> On Fri, Mar 19, 2021 at 09:25:42PM +0800, John Garry wrote:
>> For streaming DMA mappings involving an IOMMU and whose IOVA len regularly
>> exceeds the IOVA rcache upper limit (meaning that they are not cached),
>> performance can be reduced.
>>
>> This is much more pronounced from commit 4e89dce72521 ("iommu/iova: Retry
>> from last rb tree node if iova search fails"), as discussed at [0].
>>
>> IOVAs which cannot be cached are highly involved in the IOVA aging issue,
>> as discussed at [1].
> 
> I'm confused.  If this a limit in the IOVA allocator, dma-iommu should
> be able to just not grow the allocation so larger without help from
> the driver.

This is not an issue with the IOVA allocator.

The issue is with how the IOVA code handles caching of IOVAs. 
Specifically, when we DMA unmap, for an IOVA whose length is above a 
fixed threshold, the IOVA is freed, rather than being cached. See 
free_iova_fast().

For performance reasons, I want that threshold increased for my driver 
to avail of the caching of all lengths of IOVA which we may see - 
currently we see IOVAs whose length exceeds that threshold. But it may 
not be good to increase that threshold for everyone.

 > If contrary to the above description it is device-specific, the driver
 > could simply use dma_get_max_seg_size().
 > .
 >

But that is for a single segment, right? Is there something equivalent 
to tell how many scatter-gather elements which we may generate, like 
scsi_host_template.sg_tablesize?

Thanks,
John

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/6] iommu: Move IOVA power-of-2 roundup into allocator
  2021-03-19 13:25 ` [PATCH 1/6] iommu: Move IOVA power-of-2 roundup into allocator John Garry
@ 2021-03-19 16:13   ` Robin Murphy
  2021-03-19 16:58     ` John Garry
  0 siblings, 1 reply; 23+ messages in thread
From: Robin Murphy @ 2021-03-19 16:13 UTC (permalink / raw)
  To: John Garry, joro, will, jejb, martin.petersen, hch, m.szyprowski
  Cc: iommu, linux-kernel, linux-scsi, linuxarm

On 2021-03-19 13:25, John Garry wrote:
> Move the IOVA size power-of-2 rcache roundup into the IOVA allocator.
> 
> This is to eventually make it possible to be able to configure the upper
> limit of the IOVA rcache range.
> 
> Signed-off-by: John Garry <john.garry@huawei.com>
> ---
>   drivers/iommu/dma-iommu.c |  8 ------
>   drivers/iommu/iova.c      | 51 ++++++++++++++++++++++++++-------------
>   2 files changed, 34 insertions(+), 25 deletions(-)
> 
> diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
> index af765c813cc8..15b7270a5c2a 100644
> --- a/drivers/iommu/dma-iommu.c
> +++ b/drivers/iommu/dma-iommu.c
> @@ -429,14 +429,6 @@ static dma_addr_t iommu_dma_alloc_iova(struct iommu_domain *domain,
>   
>   	shift = iova_shift(iovad);
>   	iova_len = size >> shift;
> -	/*
> -	 * Freeing non-power-of-two-sized allocations back into the IOVA caches
> -	 * will come back to bite us badly, so we have to waste a bit of space
> -	 * rounding up anything cacheable to make sure that can't happen. The
> -	 * order of the unadjusted size will still match upon freeing.
> -	 */
> -	if (iova_len < (1 << (IOVA_RANGE_CACHE_MAX_SIZE - 1)))
> -		iova_len = roundup_pow_of_two(iova_len);
>   
>   	dma_limit = min_not_zero(dma_limit, dev->bus_dma_limit);
>   
> diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
> index e6e2fa85271c..e62e9e30b30c 100644
> --- a/drivers/iommu/iova.c
> +++ b/drivers/iommu/iova.c
> @@ -179,7 +179,7 @@ iova_insert_rbtree(struct rb_root *root, struct iova *iova,
>   
>   static int __alloc_and_insert_iova_range(struct iova_domain *iovad,
>   		unsigned long size, unsigned long limit_pfn,
> -			struct iova *new, bool size_aligned)
> +			struct iova *new, bool size_aligned, bool fast)
>   {
>   	struct rb_node *curr, *prev;
>   	struct iova *curr_iova;
> @@ -188,6 +188,15 @@ static int __alloc_and_insert_iova_range(struct iova_domain *iovad,
>   	unsigned long align_mask = ~0UL;
>   	unsigned long high_pfn = limit_pfn, low_pfn = iovad->start_pfn;
>   
> +	/*
> +	 * Freeing non-power-of-two-sized allocations back into the IOVA caches
> +	 * will come back to bite us badly, so we have to waste a bit of space
> +	 * rounding up anything cacheable to make sure that can't happen. The
> +	 * order of the unadjusted size will still match upon freeing.
> +	 */
> +	if (fast && size < (1 << (IOVA_RANGE_CACHE_MAX_SIZE - 1)))
> +		size = roundup_pow_of_two(size);

If this transformation is only relevant to alloc_iova_fast(), and we 
have to add a special parameter here to tell whether we were called from 
alloc_iova_fast(), doesn't it seem more sensible to just do it in 
alloc_iova_fast() rather than here?

But then the API itself has no strict requirement that a pfn passed to 
free_iova_fast() wasn't originally allocated with alloc_iova(), so 
arguably hiding the adjustment away makes it less clear that the 
responsibility is really on any caller of free_iova_fast() to make sure 
they don't get things wrong.

Robin.

> +
>   	if (size_aligned)
>   		align_mask <<= fls_long(size - 1);
>   
> @@ -288,21 +297,10 @@ void iova_cache_put(void)
>   }
>   EXPORT_SYMBOL_GPL(iova_cache_put);
>   
> -/**
> - * alloc_iova - allocates an iova
> - * @iovad: - iova domain in question
> - * @size: - size of page frames to allocate
> - * @limit_pfn: - max limit address
> - * @size_aligned: - set if size_aligned address range is required
> - * This function allocates an iova in the range iovad->start_pfn to limit_pfn,
> - * searching top-down from limit_pfn to iovad->start_pfn. If the size_aligned
> - * flag is set then the allocated address iova->pfn_lo will be naturally
> - * aligned on roundup_power_of_two(size).
> - */
> -struct iova *
> -alloc_iova(struct iova_domain *iovad, unsigned long size,
> +static struct iova *
> +__alloc_iova(struct iova_domain *iovad, unsigned long size,
>   	unsigned long limit_pfn,
> -	bool size_aligned)
> +	bool size_aligned, bool fast)
>   {
>   	struct iova *new_iova;
>   	int ret;
> @@ -312,7 +310,7 @@ alloc_iova(struct iova_domain *iovad, unsigned long size,
>   		return NULL;
>   
>   	ret = __alloc_and_insert_iova_range(iovad, size, limit_pfn + 1,
> -			new_iova, size_aligned);
> +			new_iova, size_aligned, fast);
>   
>   	if (ret) {
>   		free_iova_mem(new_iova);
> @@ -321,6 +319,25 @@ alloc_iova(struct iova_domain *iovad, unsigned long size,
>   
>   	return new_iova;
>   }
> +
> +/**
> + * alloc_iova - allocates an iova
> + * @iovad: - iova domain in question
> + * @size: - size of page frames to allocate
> + * @limit_pfn: - max limit address
> + * @size_aligned: - set if size_aligned address range is required
> + * This function allocates an iova in the range iovad->start_pfn to limit_pfn,
> + * searching top-down from limit_pfn to iovad->start_pfn. If the size_aligned
> + * flag is set then the allocated address iova->pfn_lo will be naturally
> + * aligned on roundup_power_of_two(size).
> + */
> +struct iova *
> +alloc_iova(struct iova_domain *iovad, unsigned long size,
> +	unsigned long limit_pfn,
> +	bool size_aligned)
> +{
> +	return __alloc_iova(iovad, size, limit_pfn, size_aligned, false);
> +}
>   EXPORT_SYMBOL_GPL(alloc_iova);
>   
>   static struct iova *
> @@ -433,7 +450,7 @@ alloc_iova_fast(struct iova_domain *iovad, unsigned long size,
>   		return iova_pfn;
>   
>   retry:
> -	new_iova = alloc_iova(iovad, size, limit_pfn, true);
> +	new_iova = __alloc_iova(iovad, size, limit_pfn, true, true);
>   	if (!new_iova) {
>   		unsigned int cpu;
>   
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/6] iova: Allow rcache range upper limit to be configurable
  2021-03-19 13:25 ` [PATCH 3/6] iova: Allow rcache range upper limit to be configurable John Garry
@ 2021-03-19 16:25   ` Robin Murphy
  2021-03-19 17:26     ` John Garry
  0 siblings, 1 reply; 23+ messages in thread
From: Robin Murphy @ 2021-03-19 16:25 UTC (permalink / raw)
  To: John Garry, joro, will, jejb, martin.petersen, hch, m.szyprowski
  Cc: iommu, linux-kernel, linux-scsi, linuxarm

On 2021-03-19 13:25, John Garry wrote:
> Some LLDs may request DMA mappings whose IOVA length exceeds that of the
> current rcache upper limit.
> 
> This means that allocations for those IOVAs will never be cached, and
> always must be allocated and freed from the RB tree per DMA mapping cycle.
> This has a significant effect on performance, more so since commit
> 4e89dce72521 ("iommu/iova: Retry from last rb tree node if iova search
> fails"), as discussed at [0].
> 
> Allow the range of cached IOVAs to be increased, by providing an API to set
> the upper limit, which is capped at IOVA_RANGE_CACHE_MAX_SIZE.
> 
> For simplicity, the full range of IOVA rcaches is allocated and initialized
> at IOVA domain init time.
> 
> Setting the range upper limit will fail if the domain is already live
> (before the tree contains IOVA nodes). This must be done to ensure any
> IOVAs cached comply with rule of size being a power-of-2.
> 
> [0] https://lore.kernel.org/linux-iommu/20210129092120.1482-1-thunder.leizhen@huawei.com/
> 
> Signed-off-by: John Garry <john.garry@huawei.com>
> ---
>   drivers/iommu/iova.c | 37 +++++++++++++++++++++++++++++++++++--
>   include/linux/iova.h | 11 ++++++++++-
>   2 files changed, 45 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
> index cecc74fb8663..d4f2f7fbbd84 100644
> --- a/drivers/iommu/iova.c
> +++ b/drivers/iommu/iova.c
> @@ -49,6 +49,7 @@ init_iova_domain(struct iova_domain *iovad, unsigned long granule,
>   	iovad->flush_cb = NULL;
>   	iovad->fq = NULL;
>   	iovad->anchor.pfn_lo = iovad->anchor.pfn_hi = IOVA_ANCHOR;
> +	iovad->rcache_max_size = IOVA_RANGE_CACHE_DEFAULT_SIZE;
>   	rb_link_node(&iovad->anchor.node, NULL, &iovad->rbroot.rb_node);
>   	rb_insert_color(&iovad->anchor.node, &iovad->rbroot);
>   	init_iova_rcaches(iovad);
> @@ -194,7 +195,7 @@ static int __alloc_and_insert_iova_range(struct iova_domain *iovad,
>   	 * rounding up anything cacheable to make sure that can't happen. The
>   	 * order of the unadjusted size will still match upon freeing.
>   	 */
> -	if (fast && size < (1 << (IOVA_RANGE_CACHE_MAX_SIZE - 1)))
> +	if (fast && size < (1 << (iovad->rcache_max_size - 1)))
>   		size = roundup_pow_of_two(size);
>   
>   	if (size_aligned)
> @@ -901,7 +902,7 @@ static bool iova_rcache_insert(struct iova_domain *iovad, unsigned long pfn,
>   {
>   	unsigned int log_size = order_base_2(size);
>   
> -	if (log_size >= IOVA_RANGE_CACHE_MAX_SIZE)
> +	if (log_size >= iovad->rcache_max_size)
>   		return false;
>   
>   	return __iova_rcache_insert(iovad, &iovad->rcaches[log_size], pfn);
> @@ -946,6 +947,38 @@ static unsigned long __iova_rcache_get(struct iova_rcache *rcache,
>   	return iova_pfn;
>   }
>   
> +void iova_rcache_set_upper_limit(struct iova_domain *iovad,
> +				 unsigned long iova_len)
> +{
> +	unsigned int rcache_index = order_base_2(iova_len) + 1;
> +	struct rb_node *rb_node = &iovad->anchor.node;
> +	unsigned long flags;
> +	int count = 0;
> +
> +	spin_lock_irqsave(&iovad->iova_rbtree_lock, flags);
> +	if (rcache_index <= iovad->rcache_max_size)
> +		goto out;
> +
> +	while (1) {
> +		rb_node = rb_prev(rb_node);
> +		if (!rb_node)
> +			break;
> +		count++;
> +	}
> +
> +	/*
> +	 * If there are already IOVA nodes present in the tree, then don't
> +	 * allow range upper limit to be set.
> +	 */
> +	if (count != iovad->reserved_node_count)
> +		goto out;
> +
> +	iovad->rcache_max_size = min_t(unsigned long, rcache_index,
> +				       IOVA_RANGE_CACHE_MAX_SIZE);
> +out:
> +	spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
> +}
> +
>   /*
>    * Try to satisfy IOVA allocation range from rcache.  Fail if requested
>    * size is too big or the DMA limit we are given isn't satisfied by the
> diff --git a/include/linux/iova.h b/include/linux/iova.h
> index fd3217a605b2..952b81b54ef7 100644
> --- a/include/linux/iova.h
> +++ b/include/linux/iova.h
> @@ -25,7 +25,8 @@ struct iova {
>   struct iova_magazine;
>   struct iova_cpu_rcache;
>   
> -#define IOVA_RANGE_CACHE_MAX_SIZE 6	/* log of max cached IOVA range size (in pages) */
> +#define IOVA_RANGE_CACHE_DEFAULT_SIZE 6
> +#define IOVA_RANGE_CACHE_MAX_SIZE 10 /* log of max cached IOVA range size (in pages) */

No.

And why? If we're going to allocate massive caches anyway, whatever is 
the point of *not* using all of them?

It only makes sense for a configuration knob to affect the actual rcache 
and depot allocations - that way, big high-throughput systems with 
plenty of memory can spend it on better performance, while small systems 
- that often need IOMMU scatter-gather precisely *because* memory is 
tight and thus easily fragmented - don't have to pay the (not 
insignificant) cost for caches they don't need.

Robin.

>   #define MAX_GLOBAL_MAGS 32	/* magazines per bin */
>   
>   struct iova_rcache {
> @@ -74,6 +75,7 @@ struct iova_domain {
>   	unsigned long	start_pfn;	/* Lower limit for this domain */
>   	unsigned long	dma_32bit_pfn;
>   	unsigned long	max32_alloc_size; /* Size of last failed allocation */
> +	unsigned long	rcache_max_size; /* Upper limit of cached IOVA RANGE */
>   	struct iova_fq __percpu *fq;	/* Flush Queue */
>   
>   	atomic64_t	fq_flush_start_cnt;	/* Number of TLB flushes that
> @@ -158,6 +160,8 @@ int init_iova_flush_queue(struct iova_domain *iovad,
>   struct iova *find_iova(struct iova_domain *iovad, unsigned long pfn);
>   void put_iova_domain(struct iova_domain *iovad);
>   void free_cpu_cached_iovas(unsigned int cpu, struct iova_domain *iovad);
> +void iova_rcache_set_upper_limit(struct iova_domain *iovad,
> +				 unsigned long iova_len);
>   #else
>   static inline int iova_cache_get(void)
>   {
> @@ -238,6 +242,11 @@ static inline void free_cpu_cached_iovas(unsigned int cpu,
>   					 struct iova_domain *iovad)
>   {
>   }
> +
> +static inline void iova_rcache_set_upper_limit(struct iova_domain *iovad,
> +					       unsigned long iova_len)
> +{
> +}
>   #endif
>   
>   #endif
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/6] iommu: Move IOVA power-of-2 roundup into allocator
  2021-03-19 16:13   ` Robin Murphy
@ 2021-03-19 16:58     ` John Garry
  2021-03-19 19:20       ` Robin Murphy
  0 siblings, 1 reply; 23+ messages in thread
From: John Garry @ 2021-03-19 16:58 UTC (permalink / raw)
  To: Robin Murphy, joro, will, jejb, martin.petersen, hch, m.szyprowski
  Cc: iommu, linux-kernel, linux-scsi, linuxarm

On 19/03/2021 16:13, Robin Murphy wrote:
> On 2021-03-19 13:25, John Garry wrote:
>> Move the IOVA size power-of-2 rcache roundup into the IOVA allocator.
>>
>> This is to eventually make it possible to be able to configure the upper
>> limit of the IOVA rcache range.
>>
>> Signed-off-by: John Garry <john.garry@huawei.com>
>> ---
>>   drivers/iommu/dma-iommu.c |  8 ------
>>   drivers/iommu/iova.c      | 51 ++++++++++++++++++++++++++-------------
>>   2 files changed, 34 insertions(+), 25 deletions(-)
>>
>> diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
>> index af765c813cc8..15b7270a5c2a 100644
>> --- a/drivers/iommu/dma-iommu.c
>> +++ b/drivers/iommu/dma-iommu.c
>> @@ -429,14 +429,6 @@ static dma_addr_t iommu_dma_alloc_iova(struct 
>> iommu_domain *domain,
>>       shift = iova_shift(iovad);
>>       iova_len = size >> shift;
>> -    /*
>> -     * Freeing non-power-of-two-sized allocations back into the IOVA 
>> caches
>> -     * will come back to bite us badly, so we have to waste a bit of 
>> space
>> -     * rounding up anything cacheable to make sure that can't happen. 
>> The
>> -     * order of the unadjusted size will still match upon freeing.
>> -     */
>> -    if (iova_len < (1 << (IOVA_RANGE_CACHE_MAX_SIZE - 1)))
>> -        iova_len = roundup_pow_of_two(iova_len);
>>       dma_limit = min_not_zero(dma_limit, dev->bus_dma_limit);
>> diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
>> index e6e2fa85271c..e62e9e30b30c 100644
>> --- a/drivers/iommu/iova.c
>> +++ b/drivers/iommu/iova.c
>> @@ -179,7 +179,7 @@ iova_insert_rbtree(struct rb_root *root, struct 
>> iova *iova,
>>   static int __alloc_and_insert_iova_range(struct iova_domain *iovad,
>>           unsigned long size, unsigned long limit_pfn,
>> -            struct iova *new, bool size_aligned)
>> +            struct iova *new, bool size_aligned, bool fast)
>>   {
>>       struct rb_node *curr, *prev;
>>       struct iova *curr_iova;
>> @@ -188,6 +188,15 @@ static int __alloc_and_insert_iova_range(struct 
>> iova_domain *iovad,
>>       unsigned long align_mask = ~0UL;
>>       unsigned long high_pfn = limit_pfn, low_pfn = iovad->start_pfn;
>> +    /*
>> +     * Freeing non-power-of-two-sized allocations back into the IOVA 
>> caches
>> +     * will come back to bite us badly, so we have to waste a bit of 
>> space
>> +     * rounding up anything cacheable to make sure that can't happen. 
>> The
>> +     * order of the unadjusted size will still match upon freeing.
>> +     */
>> +    if (fast && size < (1 << (IOVA_RANGE_CACHE_MAX_SIZE - 1)))
>> +        size = roundup_pow_of_two(size);
> 
> If this transformation is only relevant to alloc_iova_fast(), and we 
> have to add a special parameter here to tell whether we were called from 
> alloc_iova_fast(), doesn't it seem more sensible to just do it in 
> alloc_iova_fast() rather than here?

We have the restriction that anything we put in the rcache needs be a 
power-of-2.

So then we have the issue of how to dynamically increase this rcache 
threshold. The problem is that we may have many devices associated with 
the same domain. So, in theory, we can't assume that when we increase 
the threshold that some other device will try to fast free an IOVA which 
was allocated prior to the increase and was not rounded up.

I'm very open to better (or less bad) suggestions on how to do this ...

I could say that we only allow this for a group with a single device, so 
these sort of things don't have to be worried about, but even then the 
iommu_group internals are not readily accessible here.

> 
> But then the API itself has no strict requirement that a pfn passed to 
> free_iova_fast() wasn't originally allocated with alloc_iova(), so 
> arguably hiding the adjustment away makes it less clear that the 
> responsibility is really on any caller of free_iova_fast() to make sure 
> they don't get things wrong.
> 

alloc_iova() doesn't roundup to pow-of-2, so wouldn't it be broken to do 
that?

Cheers,
John

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 5/6] dma-mapping/iommu: Add dma_set_max_opt_size()
  2021-03-19 13:25 ` [PATCH 5/6] dma-mapping/iommu: Add dma_set_max_opt_size() John Garry
@ 2021-03-19 17:00   ` Robin Murphy
  2021-03-19 18:02     ` John Garry
                       ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Robin Murphy @ 2021-03-19 17:00 UTC (permalink / raw)
  To: John Garry, joro, will, jejb, martin.petersen, hch, m.szyprowski
  Cc: iommu, linux-kernel, linux-scsi, linuxarm

On 2021-03-19 13:25, John Garry wrote:
> Add a function to allow the max size which we want to optimise DMA mappings
> for.

It seems neat in theory - particularly for packet-based interfaces that 
might have a known fixed size of data unit that they're working on at 
any given time - but aren't there going to be many cases where the 
driver has no idea because it depends on whatever size(s) of request 
userspace happens to throw at it? Even if it does know the absolute 
maximum size of thing it could ever transfer, that could be 
impractically large in areas like video/AI/etc., so it could still be 
hard to make a reasonable decision.

Being largely workload-dependent is why I still think this should be a 
command-line or sysfs tuneable - we could set the default based on how 
much total memory is available, but ultimately it's the end user who 
knows what the workload is going to be and what they care about 
optimising for.

Another thought (which I'm almost reluctant to share) is that I would 
*love* to try implementing a self-tuning strategy that can detect high 
contention on particular allocation sizes and adjust the caches on the 
fly, but I can easily imagine that having enough inherent overhead to 
end up being an impractical (but fun) waste of time.

Robin.

> Signed-off-by: John Garry <john.garry@huawei.com>
> ---
>   drivers/iommu/dma-iommu.c   |  2 +-
>   include/linux/dma-map-ops.h |  1 +
>   include/linux/dma-mapping.h |  5 +++++
>   kernel/dma/mapping.c        | 11 +++++++++++
>   4 files changed, 18 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
> index a5dfbd6c0496..d35881fcfb9c 100644
> --- a/drivers/iommu/dma-iommu.c
> +++ b/drivers/iommu/dma-iommu.c
> @@ -447,7 +447,6 @@ static dma_addr_t iommu_dma_alloc_iova(struct iommu_domain *domain,
>   	return (dma_addr_t)iova << shift;
>   }
>   
> -__maybe_unused
>   static void iommu_dma_set_opt_size(struct device *dev, size_t size)
>   {
>   	struct iommu_domain *domain = iommu_get_dma_domain(dev);
> @@ -1278,6 +1277,7 @@ static const struct dma_map_ops iommu_dma_ops = {
>   	.map_resource		= iommu_dma_map_resource,
>   	.unmap_resource		= iommu_dma_unmap_resource,
>   	.get_merge_boundary	= iommu_dma_get_merge_boundary,
> +	.set_max_opt_size	= iommu_dma_set_opt_size,
>   };
>   
>   /*
> diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
> index 51872e736e7b..fed7a183b3b9 100644
> --- a/include/linux/dma-map-ops.h
> +++ b/include/linux/dma-map-ops.h
> @@ -64,6 +64,7 @@ struct dma_map_ops {
>   	u64 (*get_required_mask)(struct device *dev);
>   	size_t (*max_mapping_size)(struct device *dev);
>   	unsigned long (*get_merge_boundary)(struct device *dev);
> +	void (*set_max_opt_size)(struct device *dev, size_t size);
>   };
>   
>   #ifdef CONFIG_DMA_OPS
> diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
> index 2a984cb4d1e0..91fe770145d4 100644
> --- a/include/linux/dma-mapping.h
> +++ b/include/linux/dma-mapping.h
> @@ -144,6 +144,7 @@ u64 dma_get_required_mask(struct device *dev);
>   size_t dma_max_mapping_size(struct device *dev);
>   bool dma_need_sync(struct device *dev, dma_addr_t dma_addr);
>   unsigned long dma_get_merge_boundary(struct device *dev);
> +void dma_set_max_opt_size(struct device *dev, size_t size);
>   #else /* CONFIG_HAS_DMA */
>   static inline dma_addr_t dma_map_page_attrs(struct device *dev,
>   		struct page *page, size_t offset, size_t size,
> @@ -257,6 +258,10 @@ static inline unsigned long dma_get_merge_boundary(struct device *dev)
>   {
>   	return 0;
>   }
> +static inline void dma_set_max_opt_size(struct device *dev, size_t size)
> +{
> +}
> +
>   #endif /* CONFIG_HAS_DMA */
>   
>   struct page *dma_alloc_pages(struct device *dev, size_t size,
> diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
> index b6a633679933..59e6acb1c471 100644
> --- a/kernel/dma/mapping.c
> +++ b/kernel/dma/mapping.c
> @@ -608,3 +608,14 @@ unsigned long dma_get_merge_boundary(struct device *dev)
>   	return ops->get_merge_boundary(dev);
>   }
>   EXPORT_SYMBOL_GPL(dma_get_merge_boundary);
> +
> +void dma_set_max_opt_size(struct device *dev, size_t size)
> +{
> +	const struct dma_map_ops *ops = get_dma_ops(dev);
> +
> +	if (!ops || !ops->set_max_opt_size)
> +		return;
> +
> +	ops->set_max_opt_size(dev, size);
> +}
> +EXPORT_SYMBOL_GPL(dma_set_max_opt_size);
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/6] iova: Allow rcache range upper limit to be configurable
  2021-03-19 16:25   ` Robin Murphy
@ 2021-03-19 17:26     ` John Garry
  2021-03-31 10:53       ` Robin Murphy
  0 siblings, 1 reply; 23+ messages in thread
From: John Garry @ 2021-03-19 17:26 UTC (permalink / raw)
  To: Robin Murphy, joro, will, jejb, martin.petersen, hch, m.szyprowski
  Cc: iommu, linux-kernel, linux-scsi, linuxarm

On 19/03/2021 16:25, Robin Murphy wrote:
> On 2021-03-19 13:25, John Garry wrote:
>> Some LLDs may request DMA mappings whose IOVA length exceeds that of the
>> current rcache upper limit.
>>
>> This means that allocations for those IOVAs will never be cached, and
>> always must be allocated and freed from the RB tree per DMA mapping 
>> cycle.
>> This has a significant effect on performance, more so since commit
>> 4e89dce72521 ("iommu/iova: Retry from last rb tree node if iova search
>> fails"), as discussed at [0].
>>
>> Allow the range of cached IOVAs to be increased, by providing an API 
>> to set
>> the upper limit, which is capped at IOVA_RANGE_CACHE_MAX_SIZE.
>>
>> For simplicity, the full range of IOVA rcaches is allocated and 
>> initialized
>> at IOVA domain init time.
>>
>> Setting the range upper limit will fail if the domain is already live
>> (before the tree contains IOVA nodes). This must be done to ensure any
>> IOVAs cached comply with rule of size being a power-of-2.
>>
>> [0] 
>> https://lore.kernel.org/linux-iommu/20210129092120.1482-1-thunder.leizhen@huawei.com/ 
>>
>>
>> Signed-off-by: John Garry <john.garry@huawei.com>
>> ---
>>   drivers/iommu/iova.c | 37 +++++++++++++++++++++++++++++++++++--
>>   include/linux/iova.h | 11 ++++++++++-
>>   2 files changed, 45 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
>> index cecc74fb8663..d4f2f7fbbd84 100644
>> --- a/drivers/iommu/iova.c
>> +++ b/drivers/iommu/iova.c
>> @@ -49,6 +49,7 @@ init_iova_domain(struct iova_domain *iovad, unsigned 
>> long granule,
>>       iovad->flush_cb = NULL;
>>       iovad->fq = NULL;
>>       iovad->anchor.pfn_lo = iovad->anchor.pfn_hi = IOVA_ANCHOR;
>> +    iovad->rcache_max_size = IOVA_RANGE_CACHE_DEFAULT_SIZE;
>>       rb_link_node(&iovad->anchor.node, NULL, &iovad->rbroot.rb_node);
>>       rb_insert_color(&iovad->anchor.node, &iovad->rbroot);
>>       init_iova_rcaches(iovad);
>> @@ -194,7 +195,7 @@ static int __alloc_and_insert_iova_range(struct 
>> iova_domain *iovad,
>>        * rounding up anything cacheable to make sure that can't 
>> happen. The
>>        * order of the unadjusted size will still match upon freeing.
>>        */
>> -    if (fast && size < (1 << (IOVA_RANGE_CACHE_MAX_SIZE - 1)))
>> +    if (fast && size < (1 << (iovad->rcache_max_size - 1)))
>>           size = roundup_pow_of_two(size);
>>       if (size_aligned)
>> @@ -901,7 +902,7 @@ static bool iova_rcache_insert(struct iova_domain 
>> *iovad, unsigned long pfn,
>>   {
>>       unsigned int log_size = order_base_2(size);
>> -    if (log_size >= IOVA_RANGE_CACHE_MAX_SIZE)
>> +    if (log_size >= iovad->rcache_max_size)
>>           return false;
>>       return __iova_rcache_insert(iovad, &iovad->rcaches[log_size], pfn);
>> @@ -946,6 +947,38 @@ static unsigned long __iova_rcache_get(struct 
>> iova_rcache *rcache,
>>       return iova_pfn;
>>   }
>> +void iova_rcache_set_upper_limit(struct iova_domain *iovad,
>> +                 unsigned long iova_len)
>> +{
>> +    unsigned int rcache_index = order_base_2(iova_len) + 1;
>> +    struct rb_node *rb_node = &iovad->anchor.node;
>> +    unsigned long flags;
>> +    int count = 0;
>> +
>> +    spin_lock_irqsave(&iovad->iova_rbtree_lock, flags);
>> +    if (rcache_index <= iovad->rcache_max_size)
>> +        goto out;
>> +
>> +    while (1) {
>> +        rb_node = rb_prev(rb_node);
>> +        if (!rb_node)
>> +            break;
>> +        count++;
>> +    }
>> +
>> +    /*
>> +     * If there are already IOVA nodes present in the tree, then don't
>> +     * allow range upper limit to be set.
>> +     */
>> +    if (count != iovad->reserved_node_count)
>> +        goto out;
>> +
>> +    iovad->rcache_max_size = min_t(unsigned long, rcache_index,
>> +                       IOVA_RANGE_CACHE_MAX_SIZE);
>> +out:
>> +    spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
>> +}
>> +
>>   /*
>>    * Try to satisfy IOVA allocation range from rcache.  Fail if requested
>>    * size is too big or the DMA limit we are given isn't satisfied by the
>> diff --git a/include/linux/iova.h b/include/linux/iova.h
>> index fd3217a605b2..952b81b54ef7 100644
>> --- a/include/linux/iova.h
>> +++ b/include/linux/iova.h
>> @@ -25,7 +25,8 @@ struct iova {
>>   struct iova_magazine;
>>   struct iova_cpu_rcache;
>> -#define IOVA_RANGE_CACHE_MAX_SIZE 6    /* log of max cached IOVA 
>> range size (in pages) */
>> +#define IOVA_RANGE_CACHE_DEFAULT_SIZE 6
>> +#define IOVA_RANGE_CACHE_MAX_SIZE 10 /* log of max cached IOVA range 
>> size (in pages) */
> 
> No.
> 
> And why? If we're going to allocate massive caches anyway, whatever is 
> the point of *not* using all of them?
> 

I wanted to keep the same effective threshold for devices today, unless 
set otherwise.

The reason is that I didn't know if a blanket increase would cause 
regressions, and I was taking the super-safe road. Specifically some 
systems may be very IOVA space limited, and just work today by not 
caching large IOVAs.

And in the precursor thread you wrote "We can't arbitrarily *increase* 
the scope of caching once a domain is active due to the size-rounding-up 
requirement, which would be prohibitive to larger allocations if applied 
universally" (sorry for quoting)

I took the last part to mean that we shouldn't apply this increase in 
threshold globally.

> It only makes sense for a configuration knob to affect the actual rcache 
> and depot allocations - that way, big high-throughput systems with 
> plenty of memory can spend it on better performance, while small systems 
> - that often need IOMMU scatter-gather precisely *because* memory is 
> tight and thus easily fragmented - don't have to pay the (not 
> insignificant) cost for caches they don't need.

So do you suggest to just make IOVA_RANGE_CACHE_MAX_SIZE a kconfig option?

Thanks,
John

> 
> Robin.
> 
>>   #define MAX_GLOBAL_MAGS 32    /* magazines per bin */
>>   struct iova_rcache {
>> @@ -74,6 +75,7 @@ struct iova_domain {
>>       unsigned long    start_pfn;    /* Lower limit for this domain */
>>       unsigned long    dma_32bit_pfn;
>>       unsigned long    max32_alloc_size; /* Size of last failed 
>> allocation */
>> +    unsigned long    rcache_max_size; /* Upper limit of cached IOVA 
>> RANGE */
>>       struct iova_fq __percpu *fq;    /* Flush Queue */
>>       atomic64_t    fq_flush_start_cnt;    /* Number of TLB flushes that
>> @@ -158,6 +160,8 @@ int init_iova_flush_queue(struct iova_domain *iovad,
>>   struct iova *find_iova(struct iova_domain *iovad, unsigned long pfn);
>>   void put_iova_domain(struct iova_domain *iovad);
>>   void free_cpu_cached_iovas(unsigned int cpu, struct iova_domain 
>> *iovad);
>> +void iova_rcache_set_upper_limit(struct iova_domain *iovad,
>> +                 unsigned long iova_len);
>>   #else
>>   static inline int iova_cache_get(void)
>>   {
>> @@ -238,6 +242,11 @@ static inline void free_cpu_cached_iovas(unsigned 
>> int cpu,
>>                        struct iova_domain *iovad)
>>   {
>>   }
>> +
>> +static inline void iova_rcache_set_upper_limit(struct iova_domain 
>> *iovad,
>> +                           unsigned long iova_len)
>> +{
>> +}
>>   #endif
>>   #endif
>>
> .


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 5/6] dma-mapping/iommu: Add dma_set_max_opt_size()
  2021-03-19 17:00   ` Robin Murphy
@ 2021-03-19 18:02     ` John Garry
  2021-03-31  8:01     ` Salil Mehta
  2021-03-31  8:08     ` Salil Mehta
  2 siblings, 0 replies; 23+ messages in thread
From: John Garry @ 2021-03-19 18:02 UTC (permalink / raw)
  To: Robin Murphy, joro, will, jejb, martin.petersen, hch, m.szyprowski
  Cc: iommu, linux-kernel, linux-scsi, linuxarm

On 19/03/2021 17:00, Robin Murphy wrote:
> On 2021-03-19 13:25, John Garry wrote:
>> Add a function to allow the max size which we want to optimise DMA 
>> mappings
>> for.
> 
> It seems neat in theory - particularly for packet-based interfaces that 
> might have a known fixed size of data unit that they're working on at 
> any given time - but aren't there going to be many cases where the 
> driver has no idea because it depends on whatever size(s) of request 
> userspace happens to throw at it? Even if it does know the absolute 
> maximum size of thing it could ever transfer, that could be 
> impractically large in areas like video/AI/etc., so it could still be 
> hard to make a reasonable decision.

So if you consider the SCSI stack, which is my interest, we know the max 
segment size and we know the max number of segments per request, so we 
should know the theoretical upper limit of the actual IOVA length we can 
get.

Indeed, from my experiment on my SCSI host, max IOVA len is found to be 
507904, which is PAGE_SIZE * 124 (that is max sg ents there). 
Incidentally that means that we want RCACHE RANGE MAX of 8, not 6.

> 
> Being largely workload-dependent is why I still think this should be a 
> command-line or sysfs tuneable - we could set the default based on how 
> much total memory is available, but ultimately it's the end user who 
> knows what the workload is going to be and what they care about 
> optimising for.

If that hardware is only found in a server, then the extra memory cost 
would be trivial, so setting to max is standard approach.

> 
> Another thought (which I'm almost reluctant to share) is that I would 
> *love* to try implementing a self-tuning strategy that can detect high 
> contention on particular allocation sizes and adjust the caches on the 
> fly, but I can easily imagine that having enough inherent overhead to 
> end up being an impractical (but fun) waste of time.
> 

For now, I just want to recover the performance lost recently :)

Thanks,
John

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/6] iommu: Move IOVA power-of-2 roundup into allocator
  2021-03-19 16:58     ` John Garry
@ 2021-03-19 19:20       ` Robin Murphy
  2021-03-22 15:01         ` John Garry
  0 siblings, 1 reply; 23+ messages in thread
From: Robin Murphy @ 2021-03-19 19:20 UTC (permalink / raw)
  To: John Garry, joro, will, jejb, martin.petersen, hch, m.szyprowski
  Cc: iommu, linux-kernel, linux-scsi, linuxarm

On 2021-03-19 16:58, John Garry wrote:
> On 19/03/2021 16:13, Robin Murphy wrote:
>> On 2021-03-19 13:25, John Garry wrote:
>>> Move the IOVA size power-of-2 rcache roundup into the IOVA allocator.
>>>
>>> This is to eventually make it possible to be able to configure the upper
>>> limit of the IOVA rcache range.
>>>
>>> Signed-off-by: John Garry <john.garry@huawei.com>
>>> ---
>>>   drivers/iommu/dma-iommu.c |  8 ------
>>>   drivers/iommu/iova.c      | 51 ++++++++++++++++++++++++++-------------
>>>   2 files changed, 34 insertions(+), 25 deletions(-)
>>>
>>> diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
>>> index af765c813cc8..15b7270a5c2a 100644
>>> --- a/drivers/iommu/dma-iommu.c
>>> +++ b/drivers/iommu/dma-iommu.c
>>> @@ -429,14 +429,6 @@ static dma_addr_t iommu_dma_alloc_iova(struct 
>>> iommu_domain *domain,
>>>       shift = iova_shift(iovad);
>>>       iova_len = size >> shift;
>>> -    /*
>>> -     * Freeing non-power-of-two-sized allocations back into the IOVA 
>>> caches
>>> -     * will come back to bite us badly, so we have to waste a bit of 
>>> space
>>> -     * rounding up anything cacheable to make sure that can't 
>>> happen. The
>>> -     * order of the unadjusted size will still match upon freeing.
>>> -     */
>>> -    if (iova_len < (1 << (IOVA_RANGE_CACHE_MAX_SIZE - 1)))
>>> -        iova_len = roundup_pow_of_two(iova_len);
>>>       dma_limit = min_not_zero(dma_limit, dev->bus_dma_limit);
>>> diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
>>> index e6e2fa85271c..e62e9e30b30c 100644
>>> --- a/drivers/iommu/iova.c
>>> +++ b/drivers/iommu/iova.c
>>> @@ -179,7 +179,7 @@ iova_insert_rbtree(struct rb_root *root, struct 
>>> iova *iova,
>>>   static int __alloc_and_insert_iova_range(struct iova_domain *iovad,
>>>           unsigned long size, unsigned long limit_pfn,
>>> -            struct iova *new, bool size_aligned)
>>> +            struct iova *new, bool size_aligned, bool fast)
>>>   {
>>>       struct rb_node *curr, *prev;
>>>       struct iova *curr_iova;
>>> @@ -188,6 +188,15 @@ static int __alloc_and_insert_iova_range(struct 
>>> iova_domain *iovad,
>>>       unsigned long align_mask = ~0UL;
>>>       unsigned long high_pfn = limit_pfn, low_pfn = iovad->start_pfn;
>>> +    /*
>>> +     * Freeing non-power-of-two-sized allocations back into the IOVA 
>>> caches
>>> +     * will come back to bite us badly, so we have to waste a bit of 
>>> space
>>> +     * rounding up anything cacheable to make sure that can't 
>>> happen. The
>>> +     * order of the unadjusted size will still match upon freeing.
>>> +     */
>>> +    if (fast && size < (1 << (IOVA_RANGE_CACHE_MAX_SIZE - 1)))
>>> +        size = roundup_pow_of_two(size);
>>
>> If this transformation is only relevant to alloc_iova_fast(), and we 
>> have to add a special parameter here to tell whether we were called 
>> from alloc_iova_fast(), doesn't it seem more sensible to just do it in 
>> alloc_iova_fast() rather than here?
> 
> We have the restriction that anything we put in the rcache needs be a 
> power-of-2.

I was really only talking about the apparently silly structure of:

void foo(bool in_bar) {
	if (in_bar)
		//do thing
	...
}
void bar() {
	foo(true);
}

vs.:

void foo() {
	...
}
void bar() {
	//do thing
	foo();
}

> So then we have the issue of how to dynamically increase this rcache 
> threshold. The problem is that we may have many devices associated with 
> the same domain. So, in theory, we can't assume that when we increase 
> the threshold that some other device will try to fast free an IOVA which 
> was allocated prior to the increase and was not rounded up.
> 
> I'm very open to better (or less bad) suggestions on how to do this ...

...but yes, regardless of exactly where it happens, rounding up or not 
is the problem for rcaches in general. I've said several times that my 
preferred approach is to not change it that dynamically at all, but 
instead treat it more like we treat the default domain type.

> I could say that we only allow this for a group with a single device, so 
> these sort of things don't have to be worried about, but even then the 
> iommu_group internals are not readily accessible here.
> 
>>
>> But then the API itself has no strict requirement that a pfn passed to 
>> free_iova_fast() wasn't originally allocated with alloc_iova(), so 
>> arguably hiding the adjustment away makes it less clear that the 
>> responsibility is really on any caller of free_iova_fast() to make 
>> sure they don't get things wrong.
>>
> 
> alloc_iova() doesn't roundup to pow-of-2, so wouldn't it be broken to do 
> that?

Well, right now neither call rounds up, which is why iommu-dma takes 
care to avoid any issues by explicitly rounding up for itself 
beforehand. I'm just concerned that giving the impression that the API 
takes care of everything for itself will make it easier to write broken 
code in future, if that impression is in fact not entirely true.

I don't even think it's very likely that someone would manage to hit 
that rather wacky alloc/free pattern either way, I just know that 
getting wrong-sized things into the rcaches is an absolute sod to debug, 
so...

Robin.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/6] iommu: Move IOVA power-of-2 roundup into allocator
  2021-03-19 19:20       ` Robin Murphy
@ 2021-03-22 15:01         ` John Garry
  2021-03-31  9:58           ` Robin Murphy
  0 siblings, 1 reply; 23+ messages in thread
From: John Garry @ 2021-03-22 15:01 UTC (permalink / raw)
  To: Robin Murphy, joro, will, jejb, martin.petersen, hch, m.szyprowski
  Cc: iommu, linux-kernel, linux-scsi, Linuxarm

On 19/03/2021 19:20, Robin Murphy wrote:

Hi Robin,

>> So then we have the issue of how to dynamically increase this rcache
>> threshold. The problem is that we may have many devices associated with
>> the same domain. So, in theory, we can't assume that when we increase
>> the threshold that some other device will try to fast free an IOVA which
>> was allocated prior to the increase and was not rounded up.
>>
>> I'm very open to better (or less bad) suggestions on how to do this ...
> ...but yes, regardless of exactly where it happens, rounding up or not
> is the problem for rcaches in general. I've said several times that my
> preferred approach is to not change it that dynamically at all, but
> instead treat it more like we treat the default domain type.
> 

Can you remind me of that idea? I don't remember you mentioning using 
default domain handling as a reference in any context.

Thanks,
John

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: [PATCH 5/6] dma-mapping/iommu: Add dma_set_max_opt_size()
  2021-03-19 17:00   ` Robin Murphy
  2021-03-19 18:02     ` John Garry
@ 2021-03-31  8:01     ` Salil Mehta
  2021-03-31  8:08     ` Salil Mehta
  2 siblings, 0 replies; 23+ messages in thread
From: Salil Mehta @ 2021-03-31  8:01 UTC (permalink / raw)
  To: Robin Murphy, John Garry, joro, will, jejb, martin.petersen, hch,
	m.szyprowski
  Cc: iommu, linux-kernel, linux-scsi, Linuxarm

> From: iommu [mailto:iommu-bounces@lists.linux-foundation.org] On Behalf Of
> Robin Murphy
> Sent: Friday, March 19, 2021 5:00 PM
> To: John Garry <john.garry@huawei.com>; joro@8bytes.org; will@kernel.org;
> jejb@linux.ibm.com; martin.petersen@oracle.com; hch@lst.de;
> m.szyprowski@samsung.com
> Cc: iommu@lists.linux-foundation.org; linux-kernel@vger.kernel.org;
> linux-scsi@vger.kernel.org; Linuxarm <linuxarm@huawei.com>
> Subject: Re: [PATCH 5/6] dma-mapping/iommu: Add dma_set_max_opt_size()
> 
> On 2021-03-19 13:25, John Garry wrote:
> > Add a function to allow the max size which we want to optimise DMA mappings
> > for.
> 
> It seems neat in theory - particularly for packet-based interfaces that
> might have a known fixed size of data unit that they're working on at
> any given time - but aren't there going to be many cases where the
> driver has no idea because it depends on whatever size(s) of request
> userspace happens to throw at it? Even if it does know the absolute
> maximum size of thing it could ever transfer, that could be
> impractically large in areas like video/AI/etc., so it could still be
> hard to make a reasonable decision.


This is also the case in networking workloads where we have MTU set but
actual packet sizes might vary.


> 
> Being largely workload-dependent is why I still think this should be a
> command-line or sysfs tuneable - we could set the default based on how
> much total memory is available, but ultimately it's the end user who
> knows what the workload is going to be and what they care about
> optimising for.
> 
> Another thought (which I'm almost reluctant to share) is that I would
> *love* to try implementing a self-tuning strategy that can detect high
> contention on particular allocation sizes and adjust the caches on the
> fly, but I can easily imagine that having enough inherent overhead to
> end up being an impractical (but fun) waste of time.

This might be particularly useful for the NICs where packet sizes vary
from 64K to 9K. Hence, without optimal strategy this can affect the
performance of networking workloads.


> 
> Robin.
> 
> > Signed-off-by: John Garry <john.garry@huawei.com>
> > ---
> >   drivers/iommu/dma-iommu.c   |  2 +-
> >   include/linux/dma-map-ops.h |  1 +
> >   include/linux/dma-mapping.h |  5 +++++
> >   kernel/dma/mapping.c        | 11 +++++++++++
> >   4 files changed, 18 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
> > index a5dfbd6c0496..d35881fcfb9c 100644
> > --- a/drivers/iommu/dma-iommu.c
> > +++ b/drivers/iommu/dma-iommu.c
> > @@ -447,7 +447,6 @@ static dma_addr_t iommu_dma_alloc_iova(struct iommu_domain
> *domain,
> >   	return (dma_addr_t)iova << shift;
> >   }
> >
> > -__maybe_unused
> >   static void iommu_dma_set_opt_size(struct device *dev, size_t size)
> >   {
> >   	struct iommu_domain *domain = iommu_get_dma_domain(dev);
> > @@ -1278,6 +1277,7 @@ static const struct dma_map_ops iommu_dma_ops = {
> >   	.map_resource		= iommu_dma_map_resource,
> >   	.unmap_resource		= iommu_dma_unmap_resource,
> >   	.get_merge_boundary	= iommu_dma_get_merge_boundary,
> > +	.set_max_opt_size	= iommu_dma_set_opt_size,
> >   };
> >
> >   /*
> > diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
> > index 51872e736e7b..fed7a183b3b9 100644
> > --- a/include/linux/dma-map-ops.h
> > +++ b/include/linux/dma-map-ops.h
> > @@ -64,6 +64,7 @@ struct dma_map_ops {
> >   	u64 (*get_required_mask)(struct device *dev);
> >   	size_t (*max_mapping_size)(struct device *dev);
> >   	unsigned long (*get_merge_boundary)(struct device *dev);
> > +	void (*set_max_opt_size)(struct device *dev, size_t size);
> >   };
> >
> >   #ifdef CONFIG_DMA_OPS
> > diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
> > index 2a984cb4d1e0..91fe770145d4 100644
> > --- a/include/linux/dma-mapping.h
> > +++ b/include/linux/dma-mapping.h
> > @@ -144,6 +144,7 @@ u64 dma_get_required_mask(struct device *dev);
> >   size_t dma_max_mapping_size(struct device *dev);
> >   bool dma_need_sync(struct device *dev, dma_addr_t dma_addr);
> >   unsigned long dma_get_merge_boundary(struct device *dev);
> > +void dma_set_max_opt_size(struct device *dev, size_t size);
> >   #else /* CONFIG_HAS_DMA */
> >   static inline dma_addr_t dma_map_page_attrs(struct device *dev,
> >   		struct page *page, size_t offset, size_t size,
> > @@ -257,6 +258,10 @@ static inline unsigned long dma_get_merge_boundary(struct
> device *dev)
> >   {
> >   	return 0;
> >   }
> > +static inline void dma_set_max_opt_size(struct device *dev, size_t size)
> > +{
> > +}
> > +
> >   #endif /* CONFIG_HAS_DMA */
> >
> >   struct page *dma_alloc_pages(struct device *dev, size_t size,
> > diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
> > index b6a633679933..59e6acb1c471 100644
> > --- a/kernel/dma/mapping.c
> > +++ b/kernel/dma/mapping.c
> > @@ -608,3 +608,14 @@ unsigned long dma_get_merge_boundary(struct device *dev)
> >   	return ops->get_merge_boundary(dev);
> >   }
> >   EXPORT_SYMBOL_GPL(dma_get_merge_boundary);
> > +
> > +void dma_set_max_opt_size(struct device *dev, size_t size)
> > +{
> > +	const struct dma_map_ops *ops = get_dma_ops(dev);
> > +
> > +	if (!ops || !ops->set_max_opt_size)
> > +		return;
> > +
> > +	ops->set_max_opt_size(dev, size);
> > +}
> > +EXPORT_SYMBOL_GPL(dma_set_max_opt_size);
> >
> _______________________________________________
> iommu mailing list
> iommu@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: [PATCH 5/6] dma-mapping/iommu: Add dma_set_max_opt_size()
  2021-03-19 17:00   ` Robin Murphy
  2021-03-19 18:02     ` John Garry
  2021-03-31  8:01     ` Salil Mehta
@ 2021-03-31  8:08     ` Salil Mehta
  2 siblings, 0 replies; 23+ messages in thread
From: Salil Mehta @ 2021-03-31  8:08 UTC (permalink / raw)
  To: Robin Murphy, John Garry, joro, will, jejb, martin.petersen, hch,
	m.szyprowski
  Cc: iommu, linux-kernel, linux-scsi, Linuxarm

(+) correction below, sorry for the typo in earlier post.

> From: iommu [mailto:iommu-bounces@lists.linux-foundation.org] On Behalf Of
> Robin Murphy
> Sent: Friday, March 19, 2021 5:00 PM
> To: John Garry <john.garry@huawei.com>; joro@8bytes.org; will@kernel.org;
> jejb@linux.ibm.com; martin.petersen@oracle.com; hch@lst.de;
> m.szyprowski@samsung.com
> Cc: iommu@lists.linux-foundation.org; linux-kernel@vger.kernel.org;
> linux-scsi@vger.kernel.org; Linuxarm <linuxarm@huawei.com>
> Subject: Re: [PATCH 5/6] dma-mapping/iommu: Add dma_set_max_opt_size()
> 
> On 2021-03-19 13:25, John Garry wrote:
> > Add a function to allow the max size which we want to optimise DMA mappings
> > for.
> 
> It seems neat in theory - particularly for packet-based interfaces that
> might have a known fixed size of data unit that they're working on at
> any given time - but aren't there going to be many cases where the
> driver has no idea because it depends on whatever size(s) of request
> userspace happens to throw at it? Even if it does know the absolute
> maximum size of thing it could ever transfer, that could be
> impractically large in areas like video/AI/etc., so it could still be
> hard to make a reasonable decision.


This is also the case for networking workloads where we have MTU set but
actual packet sizes might vary.

> 
> Being largely workload-dependent is why I still think this should be a
> command-line or sysfs tuneable - we could set the default based on how
> much total memory is available, but ultimately it's the end user who
> knows what the workload is going to be and what they care about
> optimising for.
> 
> Another thought (which I'm almost reluctant to share) is that I would
> *love* to try implementing a self-tuning strategy that can detect high
> contention on particular allocation sizes and adjust the caches on the
> fly, but I can easily imagine that having enough inherent overhead to
> end up being an impractical (but fun) waste of time.


This might be particularly useful for the NICs where packet sizes vary
from 64B to 9K. But without optimal strategy this can affect the
performance of networking workloads.


> 
> Robin.
> 
> > Signed-off-by: John Garry <john.garry@huawei.com>
> > ---
> >   drivers/iommu/dma-iommu.c   |  2 +-
> >   include/linux/dma-map-ops.h |  1 +
> >   include/linux/dma-mapping.h |  5 +++++
> >   kernel/dma/mapping.c        | 11 +++++++++++
> >   4 files changed, 18 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
> > index a5dfbd6c0496..d35881fcfb9c 100644
> > --- a/drivers/iommu/dma-iommu.c
> > +++ b/drivers/iommu/dma-iommu.c
> > @@ -447,7 +447,6 @@ static dma_addr_t iommu_dma_alloc_iova(struct iommu_domain
> *domain,
> >   	return (dma_addr_t)iova << shift;
> >   }
> >
> > -__maybe_unused
> >   static void iommu_dma_set_opt_size(struct device *dev, size_t size)
> >   {
> >   	struct iommu_domain *domain = iommu_get_dma_domain(dev);
> > @@ -1278,6 +1277,7 @@ static const struct dma_map_ops iommu_dma_ops = {
> >   	.map_resource		= iommu_dma_map_resource,
> >   	.unmap_resource		= iommu_dma_unmap_resource,
> >   	.get_merge_boundary	= iommu_dma_get_merge_boundary,
> > +	.set_max_opt_size	= iommu_dma_set_opt_size,
> >   };
> >
> >   /*
> > diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
> > index 51872e736e7b..fed7a183b3b9 100644
> > --- a/include/linux/dma-map-ops.h
> > +++ b/include/linux/dma-map-ops.h
> > @@ -64,6 +64,7 @@ struct dma_map_ops {
> >   	u64 (*get_required_mask)(struct device *dev);
> >   	size_t (*max_mapping_size)(struct device *dev);
> >   	unsigned long (*get_merge_boundary)(struct device *dev);
> > +	void (*set_max_opt_size)(struct device *dev, size_t size);
> >   };
> >
> >   #ifdef CONFIG_DMA_OPS
> > diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
> > index 2a984cb4d1e0..91fe770145d4 100644
> > --- a/include/linux/dma-mapping.h
> > +++ b/include/linux/dma-mapping.h
> > @@ -144,6 +144,7 @@ u64 dma_get_required_mask(struct device *dev);
> >   size_t dma_max_mapping_size(struct device *dev);
> >   bool dma_need_sync(struct device *dev, dma_addr_t dma_addr);
> >   unsigned long dma_get_merge_boundary(struct device *dev);
> > +void dma_set_max_opt_size(struct device *dev, size_t size);
> >   #else /* CONFIG_HAS_DMA */
> >   static inline dma_addr_t dma_map_page_attrs(struct device *dev,
> >   		struct page *page, size_t offset, size_t size,
> > @@ -257,6 +258,10 @@ static inline unsigned long dma_get_merge_boundary(struct
> device *dev)
> >   {
> >   	return 0;
> >   }
> > +static inline void dma_set_max_opt_size(struct device *dev, size_t size)
> > +{
> > +}
> > +
> >   #endif /* CONFIG_HAS_DMA */
> >
> >   struct page *dma_alloc_pages(struct device *dev, size_t size,
> > diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
> > index b6a633679933..59e6acb1c471 100644
> > --- a/kernel/dma/mapping.c
> > +++ b/kernel/dma/mapping.c
> > @@ -608,3 +608,14 @@ unsigned long dma_get_merge_boundary(struct device *dev)
> >   	return ops->get_merge_boundary(dev);
> >   }
> >   EXPORT_SYMBOL_GPL(dma_get_merge_boundary);
> > +
> > +void dma_set_max_opt_size(struct device *dev, size_t size)
> > +{
> > +	const struct dma_map_ops *ops = get_dma_ops(dev);
> > +
> > +	if (!ops || !ops->set_max_opt_size)
> > +		return;
> > +
> > +	ops->set_max_opt_size(dev, size);
> > +}
> > +EXPORT_SYMBOL_GPL(dma_set_max_opt_size);
> >
> _______________________________________________
> iommu mailing list
> iommu@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/6] iommu: Move IOVA power-of-2 roundup into allocator
  2021-03-22 15:01         ` John Garry
@ 2021-03-31  9:58           ` Robin Murphy
  2021-04-06 16:54             ` John Garry
  0 siblings, 1 reply; 23+ messages in thread
From: Robin Murphy @ 2021-03-31  9:58 UTC (permalink / raw)
  To: John Garry, joro, will, jejb, martin.petersen, hch, m.szyprowski
  Cc: iommu, linux-kernel, linux-scsi, Linuxarm

On 2021-03-22 15:01, John Garry wrote:
> On 19/03/2021 19:20, Robin Murphy wrote:
> 
> Hi Robin,
> 
>>> So then we have the issue of how to dynamically increase this rcache
>>> threshold. The problem is that we may have many devices associated with
>>> the same domain. So, in theory, we can't assume that when we increase
>>> the threshold that some other device will try to fast free an IOVA which
>>> was allocated prior to the increase and was not rounded up.
>>>
>>> I'm very open to better (or less bad) suggestions on how to do this ...
>> ...but yes, regardless of exactly where it happens, rounding up or not
>> is the problem for rcaches in general. I've said several times that my
>> preferred approach is to not change it that dynamically at all, but
>> instead treat it more like we treat the default domain type.
>>
> 
> Can you remind me of that idea? I don't remember you mentioning using 
> default domain handling as a reference in any context.

Sorry if the phrasing was unclear there - the allusion to default 
domains is new, it just occurred to me that what we do there is in fact 
fairly close to what I've suggested previously for this. In that case, 
we have a global policy set by the command line, which *can* be 
overridden per-domain via sysfs at runtime, provided the user is willing 
to tear the whole thing down. Using a similar approach here would give a 
fair degree of flexibility but still mean that changes never have to be 
made dynamically to a live domain.

Robin.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/6] iova: Allow rcache range upper limit to be configurable
  2021-03-19 17:26     ` John Garry
@ 2021-03-31 10:53       ` Robin Murphy
  0 siblings, 0 replies; 23+ messages in thread
From: Robin Murphy @ 2021-03-31 10:53 UTC (permalink / raw)
  To: John Garry, joro, will, jejb, martin.petersen, hch, m.szyprowski
  Cc: iommu, linux-kernel, linux-scsi, linuxarm

On 2021-03-19 17:26, John Garry wrote:
[...]
>>> @@ -25,7 +25,8 @@ struct iova {
>>>   struct iova_magazine;
>>>   struct iova_cpu_rcache;
>>> -#define IOVA_RANGE_CACHE_MAX_SIZE 6    /* log of max cached IOVA 
>>> range size (in pages) */
>>> +#define IOVA_RANGE_CACHE_DEFAULT_SIZE 6
>>> +#define IOVA_RANGE_CACHE_MAX_SIZE 10 /* log of max cached IOVA range 
>>> size (in pages) */
>>
>> No.
>>
>> And why? If we're going to allocate massive caches anyway, whatever is 
>> the point of *not* using all of them?
>>
> 
> I wanted to keep the same effective threshold for devices today, unless 
> set otherwise.
> 
> The reason is that I didn't know if a blanket increase would cause 
> regressions, and I was taking the super-safe road. Specifically some 
> systems may be very IOVA space limited, and just work today by not 
> caching large IOVAs.

alloc_iova_fast() will already clear out the caches if space is running 
low, so the caching itself shouldn't be an issue.

> And in the precursor thread you wrote "We can't arbitrarily *increase* 
> the scope of caching once a domain is active due to the size-rounding-up 
> requirement, which would be prohibitive to larger allocations if applied 
> universally" (sorry for quoting)
> 
> I took the last part to mean that we shouldn't apply this increase in 
> threshold globally.

I meant we can't increase the caching threshold as-is once the domain is 
in use, because that could result in odd-sized IOVAs previously 
allocated above the old threshold being later freed back into caches, 
then causing havoc the next time they get allocated (because they're not 
as big as the actual size being asked for). However, trying to address 
that by just size-aligning everything even above the caching threshold 
is liable to waste too much space on IOVA-constrained systems (e.g. a 
single 4K video frame may be ~35MB - rounding that up to 64MB each time 
would be hard to justify).

It follows from that that there's really no point in decoupling the 
rounding-up threshold from the actual caching threshold - you get all 
the wastage (both IOVA space and actual memory for the cache arrays) for 
no obvious benefit.

>> It only makes sense for a configuration knob to affect the actual 
>> rcache and depot allocations - that way, big high-throughput systems 
>> with plenty of memory can spend it on better performance, while small 
>> systems - that often need IOMMU scatter-gather precisely *because* 
>> memory is tight and thus easily fragmented - don't have to pay the 
>> (not insignificant) cost for caches they don't need.
> 
> So do you suggest to just make IOVA_RANGE_CACHE_MAX_SIZE a kconfig option?

Again, I'm less convinced by Kconfig since I imagine many people tuning 
server-class systems for their own particular workloads will be running 
standard enterprise distros, so I think end-user-accessible knobs will 
be the most valuable. That's not to say that a Kconfig option to set the 
default state of a command-line option (as we do elsewhere) won't be 
useful for embedded users, cloud providers, etc., just that I'm not sure 
it's worth it being the *only* option.

Robin.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/6] iommu: Move IOVA power-of-2 roundup into allocator
  2021-03-31  9:58           ` Robin Murphy
@ 2021-04-06 16:54             ` John Garry
  2021-04-14 17:44               ` John Garry
  0 siblings, 1 reply; 23+ messages in thread
From: John Garry @ 2021-04-06 16:54 UTC (permalink / raw)
  To: Robin Murphy, joro, will, jejb, martin.petersen, hch, m.szyprowski
  Cc: iommu, linux-kernel, linux-scsi, Linuxarm

>>>> So then we have the issue of how to dynamically increase this rcache
>>>> threshold. The problem is that we may have many devices associated with
>>>> the same domain. So, in theory, we can't assume that when we increase
>>>> the threshold that some other device will try to fast free an IOVA 
>>>> which
>>>> was allocated prior to the increase and was not rounded up.
>>>>
>>>> I'm very open to better (or less bad) suggestions on how to do this ...
>>> ...but yes, regardless of exactly where it happens, rounding up or not
>>> is the problem for rcaches in general. I've said several times that my
>>> preferred approach is to not change it that dynamically at all, but
>>> instead treat it more like we treat the default domain type.
>>>
>>
>> Can you remind me of that idea? I don't remember you mentioning using 
>> default domain handling as a reference in any context.
> 

Hi Robin,

> Sorry if the phrasing was unclear there - the allusion to default 
> domains is new, it just occurred to me that what we do there is in fact 
> fairly close to what I've suggested previously for this. In that case, 
> we have a global policy set by the command line, which *can* be 
> overridden per-domain via sysfs at runtime, provided the user is willing 
> to tear the whole thing down. Using a similar approach here would give a 
> fair degree of flexibility but still mean that changes never have to be 
> made dynamically to a live domain.

So are you saying that we can handle it similar to how we now can handle 
changing default domain for an IOMMU group via sysfs? If so, that just 
is not practical here. Reason being that this particular DMA engine 
provides the block device giving / mount point, so if we unbind the 
driver, we lose / mount point.

And I am not sure if the end user would even know how to set such a 
tunable. Or, in this case, why the end user would not want the optimized 
range configured always.

I'd still rather if the device driver could provide info which can be 
used to configure this before or during probing.

Cheers,
John

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/6] iommu: Move IOVA power-of-2 roundup into allocator
  2021-04-06 16:54             ` John Garry
@ 2021-04-14 17:44               ` John Garry
  0 siblings, 0 replies; 23+ messages in thread
From: John Garry @ 2021-04-14 17:44 UTC (permalink / raw)
  To: Robin Murphy, joro, will, jejb, martin.petersen, hch, m.szyprowski
  Cc: iommu, linux-kernel, linux-scsi, Linuxarm

On 06/04/2021 17:54, John Garry wrote:

Hi Robin,

> 
>> Sorry if the phrasing was unclear there - the allusion to default 
>> domains is new, it just occurred to me that what we do there is in 
>> fact fairly close to what I've suggested previously for this. In that 
>> case, we have a global policy set by the command line, which *can* be 
>> overridden per-domain via sysfs at runtime, provided the user is 
>> willing to tear the whole thing down. Using a similar approach here 
>> would give a fair degree of flexibility but still mean that changes 
>> never have to be made dynamically to a live domain.
> 
> So are you saying that we can handle it similar to how we now can handle 
> changing default domain for an IOMMU group via sysfs? If so, that just 
> is not practical here. Reason being that this particular DMA engine 
> provides the block device giving / mount point, so if we unbind the 
> driver, we lose / mount point.
> 
> And I am not sure if the end user would even know how to set such a 
> tunable. Or, in this case, why the end user would not want the optimized 
> range configured always.
> 
> I'd still rather if the device driver could provide info which can be 
> used to configure this before or during probing.

As a new solution, how about do both of these:
a. Add a per-IOMMU group sysfs file to set this tunable. Works same as 
how we change the default domain, and has all the same 
restrictions/steps. I think that this is what you are already suggesting.
b. Provide a DMA mapping API to set this value, similar to this current 
series. In the IOMMU backend for that API, we record a new range value 
and return -EPROBE_DEFER when successful. In the reprobe we reset the 
default domain for the devices' IOMMU group, with the IOVA domain rcache 
range configured as previously requested. Again, allocating the new 
default domain is similar to how we change default domain type today.
This means that we don't play with a live domain. Downside is that we 
need to defer the probe.

Thanks,
John

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2021-04-14 17:47 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-19 13:25 [PATCH 0/6] dma mapping/iommu: Allow IOMMU IOVA rcache range to be configured John Garry
2021-03-19 13:25 ` [PATCH 1/6] iommu: Move IOVA power-of-2 roundup into allocator John Garry
2021-03-19 16:13   ` Robin Murphy
2021-03-19 16:58     ` John Garry
2021-03-19 19:20       ` Robin Murphy
2021-03-22 15:01         ` John Garry
2021-03-31  9:58           ` Robin Murphy
2021-04-06 16:54             ` John Garry
2021-04-14 17:44               ` John Garry
2021-03-19 13:25 ` [PATCH 2/6] iova: Add a per-domain count of reserved nodes John Garry
2021-03-19 13:25 ` [PATCH 3/6] iova: Allow rcache range upper limit to be configurable John Garry
2021-03-19 16:25   ` Robin Murphy
2021-03-19 17:26     ` John Garry
2021-03-31 10:53       ` Robin Murphy
2021-03-19 13:25 ` [PATCH 4/6] iommu: Add iommu_dma_set_opt_size() John Garry
2021-03-19 13:25 ` [PATCH 5/6] dma-mapping/iommu: Add dma_set_max_opt_size() John Garry
2021-03-19 17:00   ` Robin Murphy
2021-03-19 18:02     ` John Garry
2021-03-31  8:01     ` Salil Mehta
2021-03-31  8:08     ` Salil Mehta
2021-03-19 13:25 ` [PATCH 6/6] scsi: hisi_sas: Set max optimal DMA size for v3 hw John Garry
2021-03-19 13:40 ` [PATCH 0/6] dma mapping/iommu: Allow IOMMU IOVA rcache range to be configured Christoph Hellwig
2021-03-19 15:42   ` John Garry

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).