All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v8 RFC 0/3] Generic IOMMU pooled allocator
@ 2015-03-31 14:40 ` Sowmini Varadhan
  0 siblings, 0 replies; 60+ messages in thread
From: Sowmini Varadhan @ 2015-03-31 14:40 UTC (permalink / raw)
  To: sparclinux; +Cc: aik, anton, paulus, sowmini.varadhan, linuxppc-dev, davem

Addresses BenH comments with one exception: I've left the
IOMMU_POOL_HASH as is, so that powerpc can tailor it to their
convenience. 

I've not heard back from the IB folks, but I'm going to make
a judgement call here and go with the spin_lock. *If* they
report some significant benefit from the trylock, probably 
need to revisit this (and then probably start by re-exmaining
the hash function to avoid collisions, before resorting to 
trylock).

Sowmini Varadhan (3):
  Break up monolithic iommu table/lock into finer graularity pools and
    lock
  Make sparc64 use scalable lib/iommu-common.c functions
  Make LDC use common iommu poll management functions

 arch/sparc/include/asm/iommu_64.h |    7 +-
 arch/sparc/kernel/iommu.c         |  171 +++++++--------------------
 arch/sparc/kernel/iommu_common.h  |    8 --
 arch/sparc/kernel/ldc.c           |  153 ++++++++++--------------
 arch/sparc/kernel/pci_sun4v.c     |  181 +++++++++++++----------------
 include/linux/iommu-common.h      |   48 ++++++++
 lib/Makefile                      |    2 +-
 lib/iommu-common.c                |  233 +++++++++++++++++++++++++++++++++++++
 8 files changed, 472 insertions(+), 331 deletions(-)
 create mode 100644 include/linux/iommu-common.h
 create mode 100644 lib/iommu-common.c


^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v8 RFC 0/3] Generic IOMMU pooled allocator
@ 2015-03-31 14:40 ` Sowmini Varadhan
  0 siblings, 0 replies; 60+ messages in thread
From: Sowmini Varadhan @ 2015-03-31 14:40 UTC (permalink / raw)
  To: sparclinux; +Cc: aik, anton, paulus, sowmini.varadhan, linuxppc-dev, davem

Addresses BenH comments with one exception: I've left the
IOMMU_POOL_HASH as is, so that powerpc can tailor it to their
convenience. 

I've not heard back from the IB folks, but I'm going to make
a judgement call here and go with the spin_lock. *If* they
report some significant benefit from the trylock, probably 
need to revisit this (and then probably start by re-exmaining
the hash function to avoid collisions, before resorting to 
trylock).

Sowmini Varadhan (3):
  Break up monolithic iommu table/lock into finer graularity pools and
    lock
  Make sparc64 use scalable lib/iommu-common.c functions
  Make LDC use common iommu poll management functions

 arch/sparc/include/asm/iommu_64.h |    7 +-
 arch/sparc/kernel/iommu.c         |  171 +++++++--------------------
 arch/sparc/kernel/iommu_common.h  |    8 --
 arch/sparc/kernel/ldc.c           |  153 ++++++++++--------------
 arch/sparc/kernel/pci_sun4v.c     |  181 +++++++++++++----------------
 include/linux/iommu-common.h      |   48 ++++++++
 lib/Makefile                      |    2 +-
 lib/iommu-common.c                |  233 +++++++++++++++++++++++++++++++++++++
 8 files changed, 472 insertions(+), 331 deletions(-)
 create mode 100644 include/linux/iommu-common.h
 create mode 100644 lib/iommu-common.c

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock
  2015-03-31 14:40 ` Sowmini Varadhan
@ 2015-03-31 14:40   ` Sowmini Varadhan
  -1 siblings, 0 replies; 60+ messages in thread
From: Sowmini Varadhan @ 2015-03-31 14:40 UTC (permalink / raw)
  To: sparclinux; +Cc: aik, anton, paulus, sowmini.varadhan, linuxppc-dev, davem

Investigation of multithreaded iperf experiments on an ethernet
interface show the iommu->lock as the hottest lock identified by
lockstat, with something of the order of  21M contentions out of
27M acquisitions, and an average wait time of 26 us for the lock.
This is not efficient. A more scalable design is to follow the ppc
model, where the iommu_map_table has multiple pools, each stretching
over a segment of the map, and with a separate lock for each pool.
This model allows for better parallelization of the iommu map search.

This patch adds the iommu range alloc/free function infrastructure.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
v2 changes:
  - incorporate David Miller editorial comments: sparc specific
    fields moved from iommu-common into sparc's iommu_64.h
  - make the npools value an input parameter, for the case when
    the iommu map size is not very large
  - cookie_to_index mapping, and optimizations for span-boundary
    check, for use case such as LDC.
v3: eliminate iommu_sparc, rearrange the ->demap indirection to
    be invoked under the pool lock.

v4: David Miller review changes:
  - s/IOMMU_ERROR_CODE/DMA_ERROR_CODE
  - page_table_map_base and page_table_shift are unsigned long, not u32.

v5: Feedback from benh@kernel.crashing.org and aik@ozlabs.ru
  - removed ->cookie_to_index and ->demap indirection: caller should
    invoke these as needed before calling into the generic allocator

v6: Benh/DaveM discussion eliminationg iommu_tbl_ops, but retaining flush_all
    optimization.

v7: one-time initialization of pool_hash from iommu_tbl_pool_init()

v8: Benh code review comments.

 include/linux/iommu-common.h |   48 +++++++++
 lib/Makefile                 |    2 +-
 lib/iommu-common.c           |  233 ++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 282 insertions(+), 1 deletions(-)
 create mode 100644 include/linux/iommu-common.h
 create mode 100644 lib/iommu-common.c

diff --git a/include/linux/iommu-common.h b/include/linux/iommu-common.h
new file mode 100644
index 0000000..7b1b761
--- /dev/null
+++ b/include/linux/iommu-common.h
@@ -0,0 +1,48 @@
+#ifndef _LINUX_IOMMU_COMMON_H
+#define _LINUX_IOMMU_COMMON_H
+
+#include <linux/spinlock_types.h>
+#include <linux/device.h>
+#include <asm/page.h>
+
+#define IOMMU_POOL_HASHBITS     4
+#define IOMMU_NR_POOLS          (1 << IOMMU_POOL_HASHBITS)
+
+struct iommu_pool {
+	unsigned long	start;
+	unsigned long	end;
+	unsigned long	hint;
+	spinlock_t	lock;
+};
+
+struct iommu_map_table {
+	unsigned long		table_map_base;
+	unsigned long		table_shift;
+	unsigned long		nr_pools;
+	void			(*lazy_flush)(struct iommu_map_table *);
+	unsigned long		poolsize;
+	struct iommu_pool	pools[IOMMU_NR_POOLS];
+	u32			flags;
+#define	IOMMU_HAS_LARGE_POOL	0x00000001
+#define	IOMMU_NO_SPAN_BOUND	0x00000002
+	struct iommu_pool	large_pool;
+	unsigned long		*map;
+};
+
+extern void iommu_tbl_pool_init(struct iommu_map_table *iommu,
+				unsigned long num_entries,
+				u32 table_shift,
+				void (*lazy_flush)(struct iommu_map_table *),
+				bool large_pool, u32 npools,
+				bool skip_span_boundary_check);
+
+extern unsigned long iommu_tbl_range_alloc(struct device *dev,
+					   struct iommu_map_table *iommu,
+					   unsigned long npages,
+					   unsigned long *handle);
+
+extern void iommu_tbl_range_free(struct iommu_map_table *iommu,
+				 u64 dma_addr, unsigned long npages,
+				 unsigned long entry);
+
+#endif
diff --git a/lib/Makefile b/lib/Makefile
index 58f74d2..60c22e6 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -106,7 +106,7 @@ obj-$(CONFIG_AUDIT_GENERIC) += audit.o
 obj-$(CONFIG_AUDIT_COMPAT_GENERIC) += compat_audit.o
 
 obj-$(CONFIG_SWIOTLB) += swiotlb.o
-obj-$(CONFIG_IOMMU_HELPER) += iommu-helper.o
+obj-$(CONFIG_IOMMU_HELPER) += iommu-helper.o iommu-common.o
 obj-$(CONFIG_FAULT_INJECTION) += fault-inject.o
 obj-$(CONFIG_NOTIFIER_ERROR_INJECTION) += notifier-error-inject.o
 obj-$(CONFIG_CPU_NOTIFIER_ERROR_INJECT) += cpu-notifier-error-inject.o
diff --git a/lib/iommu-common.c b/lib/iommu-common.c
new file mode 100644
index 0000000..35dacbe
--- /dev/null
+++ b/lib/iommu-common.c
@@ -0,0 +1,233 @@
+/*
+ * IOMMU mmap management and range allocation functions.
+ * Based almost entirely upon the powerpc iommu allocator.
+ */
+
+#include <linux/export.h>
+#include <linux/bitmap.h>
+#include <linux/bug.h>
+#include <linux/iommu-helper.h>
+#include <linux/iommu-common.h>
+#include <linux/dma-mapping.h>
+#include <linux/hash.h>
+
+unsigned long iommu_large_alloc = 15;
+
+static	DEFINE_PER_CPU(unsigned int, iommu_pool_hash);
+
+static void setup_iommu_pool_hash(void)
+{
+	unsigned int i;
+	static bool do_once;
+
+	if (do_once)
+		return;
+	do_once = true;
+	for_each_possible_cpu(i)
+		per_cpu(iommu_pool_hash, i) = hash_32(i, IOMMU_POOL_HASHBITS);
+}
+
+/*
+ * Initialize iommu_pool entries for the iommu_map_table. `num_entries'
+ * is the number of table entries. If `large_pool' is set to true,
+ * the top 1/4 of the table will be set aside for pool allocations
+ * of more than iommu_large_alloc pages.
+ */
+extern void iommu_tbl_pool_init(struct iommu_map_table *iommu,
+				unsigned long num_entries,
+				u32 table_shift,
+				void (*lazy_flush)(struct iommu_map_table *),
+				bool large_pool, u32 npools,
+				bool skip_span_boundary_check)
+{
+	unsigned int start, i;
+	struct iommu_pool *p = &(iommu->large_pool);
+
+	setup_iommu_pool_hash();
+	if (npools = 0)
+		iommu->nr_pools = IOMMU_NR_POOLS;
+	else
+		iommu->nr_pools = npools;
+	BUG_ON(npools > IOMMU_NR_POOLS);
+
+	iommu->table_shift = table_shift;
+	iommu->lazy_flush = lazy_flush;
+	start = 0;
+	if (skip_span_boundary_check)
+		iommu->flags |= IOMMU_NO_SPAN_BOUND;
+	if (large_pool)
+		iommu->flags |= IOMMU_HAS_LARGE_POOL;
+
+	if (!large_pool)
+		iommu->poolsize = num_entries/iommu->nr_pools;
+	else
+		iommu->poolsize = (num_entries * 3 / 4)/iommu->nr_pools;
+	for (i = 0; i < iommu->nr_pools; i++) {
+		spin_lock_init(&(iommu->pools[i].lock));
+		iommu->pools[i].start = start;
+		iommu->pools[i].hint = start;
+		start += iommu->poolsize; /* start for next pool */
+		iommu->pools[i].end = start - 1;
+	}
+	if (!large_pool)
+		return;
+	/* initialize large_pool */
+	spin_lock_init(&(p->lock));
+	p->start = start;
+	p->hint = p->start;
+	p->end = num_entries;
+}
+EXPORT_SYMBOL(iommu_tbl_pool_init);
+
+unsigned long iommu_tbl_range_alloc(struct device *dev,
+				struct iommu_map_table *iommu,
+				unsigned long npages,
+				unsigned long *handle)
+{
+	unsigned int pool_hash = __this_cpu_read(iommu_pool_hash);
+	unsigned long n, end, start, limit, boundary_size;
+	struct iommu_pool *pool;
+	int pass = 0;
+	unsigned int pool_nr;
+	unsigned int npools = iommu->nr_pools;
+	unsigned long flags;
+	bool large_pool = ((iommu->flags & IOMMU_HAS_LARGE_POOL) != 0);
+	bool largealloc = (large_pool && npages > iommu_large_alloc);
+	unsigned long shift;
+	bool need_flush = false;
+
+	/* Sanity check */
+	if (unlikely(npages = 0)) {
+		WARN_ON_ONCE(1);
+		return DMA_ERROR_CODE;
+	}
+
+	if (largealloc) {
+		pool = &(iommu->large_pool);
+		spin_lock_irqsave(&pool->lock, flags);
+		pool_nr = 0; /* to keep compiler happy */
+	} else {
+		/* pick out pool_nr */
+		pool_nr =  pool_hash & (npools - 1);
+		pool = &(iommu->pools[pool_nr]);
+		spin_lock_irqsave(&(pool->lock), flags);
+	}
+
+ again:
+	if (pass = 0 && handle && *handle &&
+	    (*handle >= pool->start) && (*handle < pool->end))
+		start = *handle;
+	else
+		start = pool->hint;
+
+	limit = pool->end;
+
+	/* The case below can happen if we have a small segment appended
+	 * to a large, or when the previous alloc was at the very end of
+	 * the available space. If so, go back to the beginning and flush.
+	 */
+	if (start >= limit) {
+		start = pool->start;
+		if (!large_pool && iommu->lazy_flush != NULL)
+			iommu->lazy_flush(iommu);
+	}
+
+	if (dev)
+		boundary_size = ALIGN(dma_get_seg_boundary(dev) + 1,
+				      1 << iommu->table_shift);
+	else
+		boundary_size = ALIGN(1UL << 32, 1 << iommu->table_shift);
+
+	shift = iommu->table_map_base >> iommu->table_shift;
+	boundary_size = boundary_size >> iommu->table_shift;
+	/*
+	 * if the skip_span_boundary_check had been set during init, we set
+	 * things up so that iommu_is_span_boundary() merely checks if the
+	 * (index + npages) < num_tsb_entries
+	 */
+	if ((iommu->flags & IOMMU_NO_SPAN_BOUND) != 0) {
+		shift = 0;
+		boundary_size = iommu->poolsize * iommu->nr_pools;
+	}
+	n = iommu_area_alloc(iommu->map, limit, start, npages, shift,
+			     boundary_size, 0);
+	if (n = -1) {
+		if (likely(pass = 0)) {
+			/* First failure, rescan from the beginning.  */
+			pool->hint = pool->start;
+			if (!large_pool && iommu->lazy_flush != NULL)
+				need_flush = true;
+			pass++;
+			goto again;
+		} else if (!largealloc && pass <= iommu->nr_pools) {
+			spin_unlock(&(pool->lock));
+			pool_nr = (pool_nr + 1) & (iommu->nr_pools - 1);
+			pool = &(iommu->pools[pool_nr]);
+			spin_lock(&(pool->lock));
+			pool->hint = pool->start;
+			if (!large_pool && iommu->lazy_flush != NULL)
+				need_flush = true;
+			pass++;
+			goto again;
+		} else {
+			/* give up */
+			n = DMA_ERROR_CODE;
+			goto bail;
+		}
+	}
+
+	end = n + npages;
+
+	pool->hint = end;
+
+	/* Update handle for SG allocations */
+	if (handle)
+		*handle = end;
+bail:
+	if (need_flush)
+		iommu->lazy_flush(iommu);
+	spin_unlock_irqrestore(&(pool->lock), flags);
+
+	return n;
+}
+EXPORT_SYMBOL(iommu_tbl_range_alloc);
+
+static struct iommu_pool *get_pool(struct iommu_map_table *tbl,
+				   unsigned long entry)
+{
+	struct iommu_pool *p;
+	unsigned long largepool_start = tbl->large_pool.start;
+	bool large_pool = ((tbl->flags & IOMMU_HAS_LARGE_POOL) != 0);
+
+	/* The large pool is the last pool at the top of the table */
+	if (large_pool && entry >= largepool_start) {
+		p = &tbl->large_pool;
+	} else {
+		unsigned int pool_nr = entry / tbl->poolsize;
+
+		BUG_ON(pool_nr >= tbl->nr_pools);
+		p = &tbl->pools[pool_nr];
+	}
+	return p;
+}
+
+/* Caller supplies the index of the entry into the iommu map table
+ * itself when the mapping from dma_addr to the entry is not the
+ * default addr->entry mapping below.
+ */
+void iommu_tbl_range_free(struct iommu_map_table *iommu, u64 dma_addr,
+			  unsigned long npages, unsigned long entry)
+{
+	struct iommu_pool *pool;
+	unsigned long flags;
+	unsigned long shift = iommu->table_shift;
+
+	if (entry = DMA_ERROR_CODE) /* use default addr->entry mapping */
+		entry = (dma_addr - iommu->table_map_base) >> shift;
+	pool = get_pool(iommu, entry);
+
+	spin_lock_irqsave(&(pool->lock), flags);
+	bitmap_clear(iommu->map, entry, npages);
+	spin_unlock_irqrestore(&(pool->lock), flags);
+}
+EXPORT_SYMBOL(iommu_tbl_range_free);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock
@ 2015-03-31 14:40   ` Sowmini Varadhan
  0 siblings, 0 replies; 60+ messages in thread
From: Sowmini Varadhan @ 2015-03-31 14:40 UTC (permalink / raw)
  To: sparclinux; +Cc: aik, anton, paulus, sowmini.varadhan, linuxppc-dev, davem

Investigation of multithreaded iperf experiments on an ethernet
interface show the iommu->lock as the hottest lock identified by
lockstat, with something of the order of  21M contentions out of
27M acquisitions, and an average wait time of 26 us for the lock.
This is not efficient. A more scalable design is to follow the ppc
model, where the iommu_map_table has multiple pools, each stretching
over a segment of the map, and with a separate lock for each pool.
This model allows for better parallelization of the iommu map search.

This patch adds the iommu range alloc/free function infrastructure.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
v2 changes:
  - incorporate David Miller editorial comments: sparc specific
    fields moved from iommu-common into sparc's iommu_64.h
  - make the npools value an input parameter, for the case when
    the iommu map size is not very large
  - cookie_to_index mapping, and optimizations for span-boundary
    check, for use case such as LDC.
v3: eliminate iommu_sparc, rearrange the ->demap indirection to
    be invoked under the pool lock.

v4: David Miller review changes:
  - s/IOMMU_ERROR_CODE/DMA_ERROR_CODE
  - page_table_map_base and page_table_shift are unsigned long, not u32.

v5: Feedback from benh@kernel.crashing.org and aik@ozlabs.ru
  - removed ->cookie_to_index and ->demap indirection: caller should
    invoke these as needed before calling into the generic allocator

v6: Benh/DaveM discussion eliminationg iommu_tbl_ops, but retaining flush_all
    optimization.

v7: one-time initialization of pool_hash from iommu_tbl_pool_init()

v8: Benh code review comments.

 include/linux/iommu-common.h |   48 +++++++++
 lib/Makefile                 |    2 +-
 lib/iommu-common.c           |  233 ++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 282 insertions(+), 1 deletions(-)
 create mode 100644 include/linux/iommu-common.h
 create mode 100644 lib/iommu-common.c

diff --git a/include/linux/iommu-common.h b/include/linux/iommu-common.h
new file mode 100644
index 0000000..7b1b761
--- /dev/null
+++ b/include/linux/iommu-common.h
@@ -0,0 +1,48 @@
+#ifndef _LINUX_IOMMU_COMMON_H
+#define _LINUX_IOMMU_COMMON_H
+
+#include <linux/spinlock_types.h>
+#include <linux/device.h>
+#include <asm/page.h>
+
+#define IOMMU_POOL_HASHBITS     4
+#define IOMMU_NR_POOLS          (1 << IOMMU_POOL_HASHBITS)
+
+struct iommu_pool {
+	unsigned long	start;
+	unsigned long	end;
+	unsigned long	hint;
+	spinlock_t	lock;
+};
+
+struct iommu_map_table {
+	unsigned long		table_map_base;
+	unsigned long		table_shift;
+	unsigned long		nr_pools;
+	void			(*lazy_flush)(struct iommu_map_table *);
+	unsigned long		poolsize;
+	struct iommu_pool	pools[IOMMU_NR_POOLS];
+	u32			flags;
+#define	IOMMU_HAS_LARGE_POOL	0x00000001
+#define	IOMMU_NO_SPAN_BOUND	0x00000002
+	struct iommu_pool	large_pool;
+	unsigned long		*map;
+};
+
+extern void iommu_tbl_pool_init(struct iommu_map_table *iommu,
+				unsigned long num_entries,
+				u32 table_shift,
+				void (*lazy_flush)(struct iommu_map_table *),
+				bool large_pool, u32 npools,
+				bool skip_span_boundary_check);
+
+extern unsigned long iommu_tbl_range_alloc(struct device *dev,
+					   struct iommu_map_table *iommu,
+					   unsigned long npages,
+					   unsigned long *handle);
+
+extern void iommu_tbl_range_free(struct iommu_map_table *iommu,
+				 u64 dma_addr, unsigned long npages,
+				 unsigned long entry);
+
+#endif
diff --git a/lib/Makefile b/lib/Makefile
index 58f74d2..60c22e6 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -106,7 +106,7 @@ obj-$(CONFIG_AUDIT_GENERIC) += audit.o
 obj-$(CONFIG_AUDIT_COMPAT_GENERIC) += compat_audit.o
 
 obj-$(CONFIG_SWIOTLB) += swiotlb.o
-obj-$(CONFIG_IOMMU_HELPER) += iommu-helper.o
+obj-$(CONFIG_IOMMU_HELPER) += iommu-helper.o iommu-common.o
 obj-$(CONFIG_FAULT_INJECTION) += fault-inject.o
 obj-$(CONFIG_NOTIFIER_ERROR_INJECTION) += notifier-error-inject.o
 obj-$(CONFIG_CPU_NOTIFIER_ERROR_INJECT) += cpu-notifier-error-inject.o
diff --git a/lib/iommu-common.c b/lib/iommu-common.c
new file mode 100644
index 0000000..35dacbe
--- /dev/null
+++ b/lib/iommu-common.c
@@ -0,0 +1,233 @@
+/*
+ * IOMMU mmap management and range allocation functions.
+ * Based almost entirely upon the powerpc iommu allocator.
+ */
+
+#include <linux/export.h>
+#include <linux/bitmap.h>
+#include <linux/bug.h>
+#include <linux/iommu-helper.h>
+#include <linux/iommu-common.h>
+#include <linux/dma-mapping.h>
+#include <linux/hash.h>
+
+unsigned long iommu_large_alloc = 15;
+
+static	DEFINE_PER_CPU(unsigned int, iommu_pool_hash);
+
+static void setup_iommu_pool_hash(void)
+{
+	unsigned int i;
+	static bool do_once;
+
+	if (do_once)
+		return;
+	do_once = true;
+	for_each_possible_cpu(i)
+		per_cpu(iommu_pool_hash, i) = hash_32(i, IOMMU_POOL_HASHBITS);
+}
+
+/*
+ * Initialize iommu_pool entries for the iommu_map_table. `num_entries'
+ * is the number of table entries. If `large_pool' is set to true,
+ * the top 1/4 of the table will be set aside for pool allocations
+ * of more than iommu_large_alloc pages.
+ */
+extern void iommu_tbl_pool_init(struct iommu_map_table *iommu,
+				unsigned long num_entries,
+				u32 table_shift,
+				void (*lazy_flush)(struct iommu_map_table *),
+				bool large_pool, u32 npools,
+				bool skip_span_boundary_check)
+{
+	unsigned int start, i;
+	struct iommu_pool *p = &(iommu->large_pool);
+
+	setup_iommu_pool_hash();
+	if (npools == 0)
+		iommu->nr_pools = IOMMU_NR_POOLS;
+	else
+		iommu->nr_pools = npools;
+	BUG_ON(npools > IOMMU_NR_POOLS);
+
+	iommu->table_shift = table_shift;
+	iommu->lazy_flush = lazy_flush;
+	start = 0;
+	if (skip_span_boundary_check)
+		iommu->flags |= IOMMU_NO_SPAN_BOUND;
+	if (large_pool)
+		iommu->flags |= IOMMU_HAS_LARGE_POOL;
+
+	if (!large_pool)
+		iommu->poolsize = num_entries/iommu->nr_pools;
+	else
+		iommu->poolsize = (num_entries * 3 / 4)/iommu->nr_pools;
+	for (i = 0; i < iommu->nr_pools; i++) {
+		spin_lock_init(&(iommu->pools[i].lock));
+		iommu->pools[i].start = start;
+		iommu->pools[i].hint = start;
+		start += iommu->poolsize; /* start for next pool */
+		iommu->pools[i].end = start - 1;
+	}
+	if (!large_pool)
+		return;
+	/* initialize large_pool */
+	spin_lock_init(&(p->lock));
+	p->start = start;
+	p->hint = p->start;
+	p->end = num_entries;
+}
+EXPORT_SYMBOL(iommu_tbl_pool_init);
+
+unsigned long iommu_tbl_range_alloc(struct device *dev,
+				struct iommu_map_table *iommu,
+				unsigned long npages,
+				unsigned long *handle)
+{
+	unsigned int pool_hash = __this_cpu_read(iommu_pool_hash);
+	unsigned long n, end, start, limit, boundary_size;
+	struct iommu_pool *pool;
+	int pass = 0;
+	unsigned int pool_nr;
+	unsigned int npools = iommu->nr_pools;
+	unsigned long flags;
+	bool large_pool = ((iommu->flags & IOMMU_HAS_LARGE_POOL) != 0);
+	bool largealloc = (large_pool && npages > iommu_large_alloc);
+	unsigned long shift;
+	bool need_flush = false;
+
+	/* Sanity check */
+	if (unlikely(npages == 0)) {
+		WARN_ON_ONCE(1);
+		return DMA_ERROR_CODE;
+	}
+
+	if (largealloc) {
+		pool = &(iommu->large_pool);
+		spin_lock_irqsave(&pool->lock, flags);
+		pool_nr = 0; /* to keep compiler happy */
+	} else {
+		/* pick out pool_nr */
+		pool_nr =  pool_hash & (npools - 1);
+		pool = &(iommu->pools[pool_nr]);
+		spin_lock_irqsave(&(pool->lock), flags);
+	}
+
+ again:
+	if (pass == 0 && handle && *handle &&
+	    (*handle >= pool->start) && (*handle < pool->end))
+		start = *handle;
+	else
+		start = pool->hint;
+
+	limit = pool->end;
+
+	/* The case below can happen if we have a small segment appended
+	 * to a large, or when the previous alloc was at the very end of
+	 * the available space. If so, go back to the beginning and flush.
+	 */
+	if (start >= limit) {
+		start = pool->start;
+		if (!large_pool && iommu->lazy_flush != NULL)
+			iommu->lazy_flush(iommu);
+	}
+
+	if (dev)
+		boundary_size = ALIGN(dma_get_seg_boundary(dev) + 1,
+				      1 << iommu->table_shift);
+	else
+		boundary_size = ALIGN(1UL << 32, 1 << iommu->table_shift);
+
+	shift = iommu->table_map_base >> iommu->table_shift;
+	boundary_size = boundary_size >> iommu->table_shift;
+	/*
+	 * if the skip_span_boundary_check had been set during init, we set
+	 * things up so that iommu_is_span_boundary() merely checks if the
+	 * (index + npages) < num_tsb_entries
+	 */
+	if ((iommu->flags & IOMMU_NO_SPAN_BOUND) != 0) {
+		shift = 0;
+		boundary_size = iommu->poolsize * iommu->nr_pools;
+	}
+	n = iommu_area_alloc(iommu->map, limit, start, npages, shift,
+			     boundary_size, 0);
+	if (n == -1) {
+		if (likely(pass == 0)) {
+			/* First failure, rescan from the beginning.  */
+			pool->hint = pool->start;
+			if (!large_pool && iommu->lazy_flush != NULL)
+				need_flush = true;
+			pass++;
+			goto again;
+		} else if (!largealloc && pass <= iommu->nr_pools) {
+			spin_unlock(&(pool->lock));
+			pool_nr = (pool_nr + 1) & (iommu->nr_pools - 1);
+			pool = &(iommu->pools[pool_nr]);
+			spin_lock(&(pool->lock));
+			pool->hint = pool->start;
+			if (!large_pool && iommu->lazy_flush != NULL)
+				need_flush = true;
+			pass++;
+			goto again;
+		} else {
+			/* give up */
+			n = DMA_ERROR_CODE;
+			goto bail;
+		}
+	}
+
+	end = n + npages;
+
+	pool->hint = end;
+
+	/* Update handle for SG allocations */
+	if (handle)
+		*handle = end;
+bail:
+	if (need_flush)
+		iommu->lazy_flush(iommu);
+	spin_unlock_irqrestore(&(pool->lock), flags);
+
+	return n;
+}
+EXPORT_SYMBOL(iommu_tbl_range_alloc);
+
+static struct iommu_pool *get_pool(struct iommu_map_table *tbl,
+				   unsigned long entry)
+{
+	struct iommu_pool *p;
+	unsigned long largepool_start = tbl->large_pool.start;
+	bool large_pool = ((tbl->flags & IOMMU_HAS_LARGE_POOL) != 0);
+
+	/* The large pool is the last pool at the top of the table */
+	if (large_pool && entry >= largepool_start) {
+		p = &tbl->large_pool;
+	} else {
+		unsigned int pool_nr = entry / tbl->poolsize;
+
+		BUG_ON(pool_nr >= tbl->nr_pools);
+		p = &tbl->pools[pool_nr];
+	}
+	return p;
+}
+
+/* Caller supplies the index of the entry into the iommu map table
+ * itself when the mapping from dma_addr to the entry is not the
+ * default addr->entry mapping below.
+ */
+void iommu_tbl_range_free(struct iommu_map_table *iommu, u64 dma_addr,
+			  unsigned long npages, unsigned long entry)
+{
+	struct iommu_pool *pool;
+	unsigned long flags;
+	unsigned long shift = iommu->table_shift;
+
+	if (entry == DMA_ERROR_CODE) /* use default addr->entry mapping */
+		entry = (dma_addr - iommu->table_map_base) >> shift;
+	pool = get_pool(iommu, entry);
+
+	spin_lock_irqsave(&(pool->lock), flags);
+	bitmap_clear(iommu->map, entry, npages);
+	spin_unlock_irqrestore(&(pool->lock), flags);
+}
+EXPORT_SYMBOL(iommu_tbl_range_free);
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v8 RFC 2/3] sparc: Make sparc64 use scalable lib/iommu-common.c functions
  2015-03-31 14:40 ` Sowmini Varadhan
@ 2015-03-31 14:40   ` Sowmini Varadhan
  -1 siblings, 0 replies; 60+ messages in thread
From: Sowmini Varadhan @ 2015-03-31 14:40 UTC (permalink / raw)
  To: sparclinux; +Cc: aik, anton, paulus, sowmini.varadhan, linuxppc-dev, davem

In iperf experiments running linux as the Tx side (TCP client) with
10 threads results in a severe performance drop when TSO is disabled,
indicating a weakness in the software that can be avoided by using
the scalable IOMMU arena DMA allocation.

Baseline numbers before this patch:
   with default settings (TSO enabled) :    9-9.5 Gbps
   Disable TSO using ethtool- drops badly:  2-3 Gbps.

After this patch, iperf client with 10 threads, can give a
throughput of at least 8.5 Gbps, even when TSO is disabled.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
v2: moved sparc specific fileds into iommu_sparc
v3: converted all sparc users of iommu, so lot of cleanup and streamlining
v4: David Miller review change:
    - s/IOMMU_ERROR_CODE/DMA_ERROR_CODE
    - reverts pci_impl.h (now that all iommu usage has been converted)
v5: benh/aik feedback modifies the function signatures: pass in 
    modified args to iommmu_tbl_pool_init() and iommu_tbl_range_free()
v6: removed iommu_tbl_ops. Pass flush_all as function pointer to 
    iommu_tbl_pool_init
v7: move pool_hash initialization to iommu_tbl_pool_init()
v8: BenH comments

 arch/sparc/include/asm/iommu_64.h |    7 +-
 arch/sparc/kernel/iommu.c         |  171 +++++++++--------------------------
 arch/sparc/kernel/iommu_common.h  |    8 --
 arch/sparc/kernel/pci_sun4v.c     |  181 ++++++++++++++++---------------------
 4 files changed, 125 insertions(+), 242 deletions(-)

diff --git a/arch/sparc/include/asm/iommu_64.h b/arch/sparc/include/asm/iommu_64.h
index 2b9321a..cd0d69f 100644
--- a/arch/sparc/include/asm/iommu_64.h
+++ b/arch/sparc/include/asm/iommu_64.h
@@ -16,6 +16,7 @@
 #define IOPTE_WRITE   0x0000000000000002UL
 
 #define IOMMU_NUM_CTXS	4096
+#include <linux/iommu-common.h>
 
 struct iommu_arena {
 	unsigned long	*map;
@@ -24,11 +25,10 @@ struct iommu_arena {
 };
 
 struct iommu {
+	struct iommu_map_table	tbl;
 	spinlock_t		lock;
-	struct iommu_arena	arena;
-	void			(*flush_all)(struct iommu *);
+	u32			dma_addr_mask;
 	iopte_t			*page_table;
-	u32			page_table_map_base;
 	unsigned long		iommu_control;
 	unsigned long		iommu_tsbbase;
 	unsigned long		iommu_flush;
@@ -40,7 +40,6 @@ struct iommu {
 	unsigned long		dummy_page_pa;
 	unsigned long		ctx_lowest_free;
 	DECLARE_BITMAP(ctx_bitmap, IOMMU_NUM_CTXS);
-	u32			dma_addr_mask;
 };
 
 struct strbuf {
diff --git a/arch/sparc/kernel/iommu.c b/arch/sparc/kernel/iommu.c
index bfa4d0c..8b76e21 100644
--- a/arch/sparc/kernel/iommu.c
+++ b/arch/sparc/kernel/iommu.c
@@ -13,6 +13,7 @@
 #include <linux/errno.h>
 #include <linux/iommu-helper.h>
 #include <linux/bitmap.h>
+#include <linux/iommu-common.h>
 
 #ifdef CONFIG_PCI
 #include <linux/pci.h>
@@ -45,8 +46,9 @@
 			       "i" (ASI_PHYS_BYPASS_EC_E))
 
 /* Must be invoked under the IOMMU lock. */
-static void iommu_flushall(struct iommu *iommu)
+static void iommu_flushall(struct iommu_map_table *iommu_map_table)
 {
+	struct iommu *iommu = container_of(iommu_map_table, struct iommu, tbl);
 	if (iommu->iommu_flushinv) {
 		iommu_write(iommu->iommu_flushinv, ~(u64)0);
 	} else {
@@ -87,94 +89,6 @@ static inline void iopte_make_dummy(struct iommu *iommu, iopte_t *iopte)
 	iopte_val(*iopte) = val;
 }
 
-/* Based almost entirely upon the ppc64 iommu allocator.  If you use the 'handle'
- * facility it must all be done in one pass while under the iommu lock.
- *
- * On sun4u platforms, we only flush the IOMMU once every time we've passed
- * over the entire page table doing allocations.  Therefore we only ever advance
- * the hint and cannot backtrack it.
- */
-unsigned long iommu_range_alloc(struct device *dev,
-				struct iommu *iommu,
-				unsigned long npages,
-				unsigned long *handle)
-{
-	unsigned long n, end, start, limit, boundary_size;
-	struct iommu_arena *arena = &iommu->arena;
-	int pass = 0;
-
-	/* This allocator was derived from x86_64's bit string search */
-
-	/* Sanity check */
-	if (unlikely(npages = 0)) {
-		if (printk_ratelimit())
-			WARN_ON(1);
-		return DMA_ERROR_CODE;
-	}
-
-	if (handle && *handle)
-		start = *handle;
-	else
-		start = arena->hint;
-
-	limit = arena->limit;
-
-	/* The case below can happen if we have a small segment appended
-	 * to a large, or when the previous alloc was at the very end of
-	 * the available space. If so, go back to the beginning and flush.
-	 */
-	if (start >= limit) {
-		start = 0;
-		if (iommu->flush_all)
-			iommu->flush_all(iommu);
-	}
-
- again:
-
-	if (dev)
-		boundary_size = ALIGN(dma_get_seg_boundary(dev) + 1,
-				      1 << IO_PAGE_SHIFT);
-	else
-		boundary_size = ALIGN(1UL << 32, 1 << IO_PAGE_SHIFT);
-
-	n = iommu_area_alloc(arena->map, limit, start, npages,
-			     iommu->page_table_map_base >> IO_PAGE_SHIFT,
-			     boundary_size >> IO_PAGE_SHIFT, 0);
-	if (n = -1) {
-		if (likely(pass < 1)) {
-			/* First failure, rescan from the beginning.  */
-			start = 0;
-			if (iommu->flush_all)
-				iommu->flush_all(iommu);
-			pass++;
-			goto again;
-		} else {
-			/* Second failure, give up */
-			return DMA_ERROR_CODE;
-		}
-	}
-
-	end = n + npages;
-
-	arena->hint = end;
-
-	/* Update handle for SG allocations */
-	if (handle)
-		*handle = end;
-
-	return n;
-}
-
-void iommu_range_free(struct iommu *iommu, dma_addr_t dma_addr, unsigned long npages)
-{
-	struct iommu_arena *arena = &iommu->arena;
-	unsigned long entry;
-
-	entry = (dma_addr - iommu->page_table_map_base) >> IO_PAGE_SHIFT;
-
-	bitmap_clear(arena->map, entry, npages);
-}
-
 int iommu_table_init(struct iommu *iommu, int tsbsize,
 		     u32 dma_offset, u32 dma_addr_mask,
 		     int numa_node)
@@ -187,22 +101,20 @@ int iommu_table_init(struct iommu *iommu, int tsbsize,
 	/* Setup initial software IOMMU state. */
 	spin_lock_init(&iommu->lock);
 	iommu->ctx_lowest_free = 1;
-	iommu->page_table_map_base = dma_offset;
+	iommu->tbl.table_map_base = dma_offset;
 	iommu->dma_addr_mask = dma_addr_mask;
 
 	/* Allocate and initialize the free area map.  */
 	sz = num_tsb_entries / 8;
 	sz = (sz + 7UL) & ~7UL;
-	iommu->arena.map = kmalloc_node(sz, GFP_KERNEL, numa_node);
-	if (!iommu->arena.map) {
-		printk(KERN_ERR "IOMMU: Error, kmalloc(arena.map) failed.\n");
+	iommu->tbl.map = kmalloc_node(sz, GFP_KERNEL, numa_node);
+	if (!iommu->tbl.map)
 		return -ENOMEM;
-	}
-	memset(iommu->arena.map, 0, sz);
-	iommu->arena.limit = num_tsb_entries;
+	memset(iommu->tbl.map, 0, sz);
 
-	if (tlb_type != hypervisor)
-		iommu->flush_all = iommu_flushall;
+	iommu_tbl_pool_init(&iommu->tbl, num_tsb_entries, IO_PAGE_SHIFT,
+			    (tlb_type != hypervisor ? iommu_flushall : NULL),
+			    false, 1, false);
 
 	/* Allocate and initialize the dummy page which we
 	 * set inactive IO PTEs to point to.
@@ -235,18 +147,19 @@ int iommu_table_init(struct iommu *iommu, int tsbsize,
 	iommu->dummy_page = 0UL;
 
 out_free_map:
-	kfree(iommu->arena.map);
-	iommu->arena.map = NULL;
+	kfree(iommu->tbl.map);
+	iommu->tbl.map = NULL;
 
 	return -ENOMEM;
 }
 
-static inline iopte_t *alloc_npages(struct device *dev, struct iommu *iommu,
+static inline iopte_t *alloc_npages(struct device *dev,
+				    struct iommu *iommu,
 				    unsigned long npages)
 {
 	unsigned long entry;
 
-	entry = iommu_range_alloc(dev, iommu, npages, NULL);
+	entry = iommu_tbl_range_alloc(dev, &iommu->tbl, npages, NULL);
 	if (unlikely(entry = DMA_ERROR_CODE))
 		return NULL;
 
@@ -284,7 +197,7 @@ static void *dma_4u_alloc_coherent(struct device *dev, size_t size,
 				   dma_addr_t *dma_addrp, gfp_t gfp,
 				   struct dma_attrs *attrs)
 {
-	unsigned long flags, order, first_page;
+	unsigned long order, first_page;
 	struct iommu *iommu;
 	struct page *page;
 	int npages, nid;
@@ -306,16 +219,14 @@ static void *dma_4u_alloc_coherent(struct device *dev, size_t size,
 
 	iommu = dev->archdata.iommu;
 
-	spin_lock_irqsave(&iommu->lock, flags);
 	iopte = alloc_npages(dev, iommu, size >> IO_PAGE_SHIFT);
-	spin_unlock_irqrestore(&iommu->lock, flags);
 
 	if (unlikely(iopte = NULL)) {
 		free_pages(first_page, order);
 		return NULL;
 	}
 
-	*dma_addrp = (iommu->page_table_map_base +
+	*dma_addrp = (iommu->tbl.table_map_base +
 		      ((iopte - iommu->page_table) << IO_PAGE_SHIFT));
 	ret = (void *) first_page;
 	npages = size >> IO_PAGE_SHIFT;
@@ -336,16 +247,12 @@ static void dma_4u_free_coherent(struct device *dev, size_t size,
 				 struct dma_attrs *attrs)
 {
 	struct iommu *iommu;
-	unsigned long flags, order, npages;
+	unsigned long order, npages;
 
 	npages = IO_PAGE_ALIGN(size) >> IO_PAGE_SHIFT;
 	iommu = dev->archdata.iommu;
 
-	spin_lock_irqsave(&iommu->lock, flags);
-
-	iommu_range_free(iommu, dvma, npages);
-
-	spin_unlock_irqrestore(&iommu->lock, flags);
+	iommu_tbl_range_free(&iommu->tbl, dvma, npages, DMA_ERROR_CODE);
 
 	order = get_order(size);
 	if (order < 10)
@@ -375,8 +282,8 @@ static dma_addr_t dma_4u_map_page(struct device *dev, struct page *page,
 	npages = IO_PAGE_ALIGN(oaddr + sz) - (oaddr & IO_PAGE_MASK);
 	npages >>= IO_PAGE_SHIFT;
 
-	spin_lock_irqsave(&iommu->lock, flags);
 	base = alloc_npages(dev, iommu, npages);
+	spin_lock_irqsave(&iommu->lock, flags);
 	ctx = 0;
 	if (iommu->iommu_ctxflush)
 		ctx = iommu_alloc_ctx(iommu);
@@ -385,7 +292,7 @@ static dma_addr_t dma_4u_map_page(struct device *dev, struct page *page,
 	if (unlikely(!base))
 		goto bad;
 
-	bus_addr = (iommu->page_table_map_base +
+	bus_addr = (iommu->tbl.table_map_base +
 		    ((base - iommu->page_table) << IO_PAGE_SHIFT));
 	ret = bus_addr | (oaddr & ~IO_PAGE_MASK);
 	base_paddr = __pa(oaddr & IO_PAGE_MASK);
@@ -496,7 +403,7 @@ static void dma_4u_unmap_page(struct device *dev, dma_addr_t bus_addr,
 	npages = IO_PAGE_ALIGN(bus_addr + sz) - (bus_addr & IO_PAGE_MASK);
 	npages >>= IO_PAGE_SHIFT;
 	base = iommu->page_table +
-		((bus_addr - iommu->page_table_map_base) >> IO_PAGE_SHIFT);
+		((bus_addr - iommu->tbl.table_map_base) >> IO_PAGE_SHIFT);
 	bus_addr &= IO_PAGE_MASK;
 
 	spin_lock_irqsave(&iommu->lock, flags);
@@ -515,11 +422,10 @@ static void dma_4u_unmap_page(struct device *dev, dma_addr_t bus_addr,
 	for (i = 0; i < npages; i++)
 		iopte_make_dummy(iommu, base + i);
 
-	iommu_range_free(iommu, bus_addr, npages);
-
 	iommu_free_ctx(iommu, ctx);
-
 	spin_unlock_irqrestore(&iommu->lock, flags);
+
+	iommu_tbl_range_free(&iommu->tbl, bus_addr, npages, DMA_ERROR_CODE);
 }
 
 static int dma_4u_map_sg(struct device *dev, struct scatterlist *sglist,
@@ -567,7 +473,7 @@ static int dma_4u_map_sg(struct device *dev, struct scatterlist *sglist,
 	max_seg_size = dma_get_max_seg_size(dev);
 	seg_boundary_size = ALIGN(dma_get_seg_boundary(dev) + 1,
 				  IO_PAGE_SIZE) >> IO_PAGE_SHIFT;
-	base_shift = iommu->page_table_map_base >> IO_PAGE_SHIFT;
+	base_shift = iommu->tbl.table_map_base >> IO_PAGE_SHIFT;
 	for_each_sg(sglist, s, nelems, i) {
 		unsigned long paddr, npages, entry, out_entry = 0, slen;
 		iopte_t *base;
@@ -581,7 +487,8 @@ static int dma_4u_map_sg(struct device *dev, struct scatterlist *sglist,
 		/* Allocate iommu entries for that segment */
 		paddr = (unsigned long) SG_ENT_PHYS_ADDRESS(s);
 		npages = iommu_num_pages(paddr, slen, IO_PAGE_SIZE);
-		entry = iommu_range_alloc(dev, iommu, npages, &handle);
+		entry = iommu_tbl_range_alloc(dev, &iommu->tbl, npages,
+					      &handle);
 
 		/* Handle failure */
 		if (unlikely(entry = DMA_ERROR_CODE)) {
@@ -594,7 +501,7 @@ static int dma_4u_map_sg(struct device *dev, struct scatterlist *sglist,
 		base = iommu->page_table + entry;
 
 		/* Convert entry to a dma_addr_t */
-		dma_addr = iommu->page_table_map_base +
+		dma_addr = iommu->tbl.table_map_base +
 			(entry << IO_PAGE_SHIFT);
 		dma_addr |= (s->offset & ~IO_PAGE_MASK);
 
@@ -654,15 +561,17 @@ static int dma_4u_map_sg(struct device *dev, struct scatterlist *sglist,
 			vaddr = s->dma_address & IO_PAGE_MASK;
 			npages = iommu_num_pages(s->dma_address, s->dma_length,
 						 IO_PAGE_SIZE);
-			iommu_range_free(iommu, vaddr, npages);
 
-			entry = (vaddr - iommu->page_table_map_base)
+			entry = (vaddr - iommu->tbl.table_map_base)
 				>> IO_PAGE_SHIFT;
 			base = iommu->page_table + entry;
 
 			for (j = 0; j < npages; j++)
 				iopte_make_dummy(iommu, base + j);
 
+			iommu_tbl_range_free(&iommu->tbl, vaddr, npages,
+					     DMA_ERROR_CODE);
+
 			s->dma_address = DMA_ERROR_CODE;
 			s->dma_length = 0;
 		}
@@ -684,10 +593,11 @@ static unsigned long fetch_sg_ctx(struct iommu *iommu, struct scatterlist *sg)
 	if (iommu->iommu_ctxflush) {
 		iopte_t *base;
 		u32 bus_addr;
+		struct iommu_map_table *tbl = &iommu->tbl;
 
 		bus_addr = sg->dma_address & IO_PAGE_MASK;
 		base = iommu->page_table +
-			((bus_addr - iommu->page_table_map_base) >> IO_PAGE_SHIFT);
+			((bus_addr - tbl->table_map_base) >> IO_PAGE_SHIFT);
 
 		ctx = (iopte_val(*base) & IOPTE_CONTEXT) >> 47UL;
 	}
@@ -723,9 +633,8 @@ static void dma_4u_unmap_sg(struct device *dev, struct scatterlist *sglist,
 		if (!len)
 			break;
 		npages = iommu_num_pages(dma_handle, len, IO_PAGE_SIZE);
-		iommu_range_free(iommu, dma_handle, npages);
 
-		entry = ((dma_handle - iommu->page_table_map_base)
+		entry = ((dma_handle - iommu->tbl.table_map_base)
 			 >> IO_PAGE_SHIFT);
 		base = iommu->page_table + entry;
 
@@ -737,6 +646,8 @@ static void dma_4u_unmap_sg(struct device *dev, struct scatterlist *sglist,
 		for (i = 0; i < npages; i++)
 			iopte_make_dummy(iommu, base + i);
 
+		iommu_tbl_range_free(&iommu->tbl, dma_handle, npages,
+				     DMA_ERROR_CODE);
 		sg = sg_next(sg);
 	}
 
@@ -770,9 +681,10 @@ static void dma_4u_sync_single_for_cpu(struct device *dev,
 	if (iommu->iommu_ctxflush &&
 	    strbuf->strbuf_ctxflush) {
 		iopte_t *iopte;
+		struct iommu_map_table *tbl = &iommu->tbl;
 
 		iopte = iommu->page_table +
-			((bus_addr - iommu->page_table_map_base)>>IO_PAGE_SHIFT);
+			((bus_addr - tbl->table_map_base)>>IO_PAGE_SHIFT);
 		ctx = (iopte_val(*iopte) & IOPTE_CONTEXT) >> 47UL;
 	}
 
@@ -805,9 +717,10 @@ static void dma_4u_sync_sg_for_cpu(struct device *dev,
 	if (iommu->iommu_ctxflush &&
 	    strbuf->strbuf_ctxflush) {
 		iopte_t *iopte;
+		struct iommu_map_table *tbl = &iommu->tbl;
 
-		iopte = iommu->page_table +
-			((sglist[0].dma_address - iommu->page_table_map_base) >> IO_PAGE_SHIFT);
+		iopte = iommu->page_table + ((sglist[0].dma_address -
+			tbl->table_map_base) >> IO_PAGE_SHIFT);
 		ctx = (iopte_val(*iopte) & IOPTE_CONTEXT) >> 47UL;
 	}
 
diff --git a/arch/sparc/kernel/iommu_common.h b/arch/sparc/kernel/iommu_common.h
index 1ec0de4..f4be0d7 100644
--- a/arch/sparc/kernel/iommu_common.h
+++ b/arch/sparc/kernel/iommu_common.h
@@ -48,12 +48,4 @@ static inline int is_span_boundary(unsigned long entry,
 	return iommu_is_span_boundary(entry, nr, shift, boundary_size);
 }
 
-unsigned long iommu_range_alloc(struct device *dev,
-				struct iommu *iommu,
-				unsigned long npages,
-				unsigned long *handle);
-void iommu_range_free(struct iommu *iommu,
-		      dma_addr_t dma_addr,
-		      unsigned long npages);
-
 #endif /* _IOMMU_COMMON_H */
diff --git a/arch/sparc/kernel/pci_sun4v.c b/arch/sparc/kernel/pci_sun4v.c
index 47ddbd4..6ca0190 100644
--- a/arch/sparc/kernel/pci_sun4v.c
+++ b/arch/sparc/kernel/pci_sun4v.c
@@ -15,6 +15,7 @@
 #include <linux/export.h>
 #include <linux/log2.h>
 #include <linux/of_device.h>
+#include <linux/iommu-common.h>
 
 #include <asm/iommu.h>
 #include <asm/irq.h>
@@ -155,15 +156,12 @@ static void *dma_4v_alloc_coherent(struct device *dev, size_t size,
 
 	iommu = dev->archdata.iommu;
 
-	spin_lock_irqsave(&iommu->lock, flags);
-	entry = iommu_range_alloc(dev, iommu, npages, NULL);
-	spin_unlock_irqrestore(&iommu->lock, flags);
+	entry = iommu_tbl_range_alloc(dev, &iommu->tbl, npages, NULL);
 
 	if (unlikely(entry = DMA_ERROR_CODE))
 		goto range_alloc_fail;
 
-	*dma_addrp = (iommu->page_table_map_base +
-		      (entry << IO_PAGE_SHIFT));
+	*dma_addrp = (iommu->tbl.table_map_base + (entry << IO_PAGE_SHIFT));
 	ret = (void *) first_page;
 	first_page = __pa(first_page);
 
@@ -188,45 +186,46 @@ static void *dma_4v_alloc_coherent(struct device *dev, size_t size,
 	return ret;
 
 iommu_map_fail:
-	/* Interrupts are disabled.  */
-	spin_lock(&iommu->lock);
-	iommu_range_free(iommu, *dma_addrp, npages);
-	spin_unlock_irqrestore(&iommu->lock, flags);
+	iommu_tbl_range_free(&iommu->tbl, *dma_addrp, npages, DMA_ERROR_CODE);
 
 range_alloc_fail:
 	free_pages(first_page, order);
 	return NULL;
 }
 
+static void dma_4v_iommu_demap(void *demap_arg, unsigned long entry,
+			       unsigned long npages)
+{
+	u32 devhandle = *(u32 *)demap_arg;
+	unsigned long num, flags;
+
+	local_irq_save(flags);
+	do {
+		num = pci_sun4v_iommu_demap(devhandle,
+					    HV_PCI_TSBID(0, entry),
+					    npages);
+
+		entry += num;
+		npages -= num;
+	} while (npages != 0);
+	local_irq_restore(flags);
+}
+
 static void dma_4v_free_coherent(struct device *dev, size_t size, void *cpu,
 				 dma_addr_t dvma, struct dma_attrs *attrs)
 {
 	struct pci_pbm_info *pbm;
 	struct iommu *iommu;
-	unsigned long flags, order, npages, entry;
+	unsigned long order, npages, entry;
 	u32 devhandle;
 
 	npages = IO_PAGE_ALIGN(size) >> IO_PAGE_SHIFT;
 	iommu = dev->archdata.iommu;
 	pbm = dev->archdata.host_controller;
 	devhandle = pbm->devhandle;
-	entry = ((dvma - iommu->page_table_map_base) >> IO_PAGE_SHIFT);
-
-	spin_lock_irqsave(&iommu->lock, flags);
-
-	iommu_range_free(iommu, dvma, npages);
-
-	do {
-		unsigned long num;
-
-		num = pci_sun4v_iommu_demap(devhandle, HV_PCI_TSBID(0, entry),
-					    npages);
-		entry += num;
-		npages -= num;
-	} while (npages != 0);
-
-	spin_unlock_irqrestore(&iommu->lock, flags);
-
+	entry = ((dvma - iommu->tbl.table_map_base) >> IO_PAGE_SHIFT);
+	dma_4v_iommu_demap(&devhandle, entry, npages);
+	iommu_tbl_range_free(&iommu->tbl, dvma, npages, DMA_ERROR_CODE);
 	order = get_order(size);
 	if (order < 10)
 		free_pages((unsigned long)cpu, order);
@@ -253,15 +252,12 @@ static dma_addr_t dma_4v_map_page(struct device *dev, struct page *page,
 	npages = IO_PAGE_ALIGN(oaddr + sz) - (oaddr & IO_PAGE_MASK);
 	npages >>= IO_PAGE_SHIFT;
 
-	spin_lock_irqsave(&iommu->lock, flags);
-	entry = iommu_range_alloc(dev, iommu, npages, NULL);
-	spin_unlock_irqrestore(&iommu->lock, flags);
+	entry = iommu_tbl_range_alloc(dev, &iommu->tbl, npages, NULL);
 
 	if (unlikely(entry = DMA_ERROR_CODE))
 		goto bad;
 
-	bus_addr = (iommu->page_table_map_base +
-		    (entry << IO_PAGE_SHIFT));
+	bus_addr = (iommu->tbl.table_map_base + (entry << IO_PAGE_SHIFT));
 	ret = bus_addr | (oaddr & ~IO_PAGE_MASK);
 	base_paddr = __pa(oaddr & IO_PAGE_MASK);
 	prot = HV_PCI_MAP_ATTR_READ;
@@ -290,11 +286,7 @@ static dma_addr_t dma_4v_map_page(struct device *dev, struct page *page,
 	return DMA_ERROR_CODE;
 
 iommu_map_fail:
-	/* Interrupts are disabled.  */
-	spin_lock(&iommu->lock);
-	iommu_range_free(iommu, bus_addr, npages);
-	spin_unlock_irqrestore(&iommu->lock, flags);
-
+	iommu_tbl_range_free(&iommu->tbl, bus_addr, npages, DMA_ERROR_CODE);
 	return DMA_ERROR_CODE;
 }
 
@@ -304,7 +296,7 @@ static void dma_4v_unmap_page(struct device *dev, dma_addr_t bus_addr,
 {
 	struct pci_pbm_info *pbm;
 	struct iommu *iommu;
-	unsigned long flags, npages;
+	unsigned long npages;
 	long entry;
 	u32 devhandle;
 
@@ -321,22 +313,9 @@ static void dma_4v_unmap_page(struct device *dev, dma_addr_t bus_addr,
 	npages = IO_PAGE_ALIGN(bus_addr + sz) - (bus_addr & IO_PAGE_MASK);
 	npages >>= IO_PAGE_SHIFT;
 	bus_addr &= IO_PAGE_MASK;
-
-	spin_lock_irqsave(&iommu->lock, flags);
-
-	iommu_range_free(iommu, bus_addr, npages);
-
-	entry = (bus_addr - iommu->page_table_map_base) >> IO_PAGE_SHIFT;
-	do {
-		unsigned long num;
-
-		num = pci_sun4v_iommu_demap(devhandle, HV_PCI_TSBID(0, entry),
-					    npages);
-		entry += num;
-		npages -= num;
-	} while (npages != 0);
-
-	spin_unlock_irqrestore(&iommu->lock, flags);
+	entry = (bus_addr - iommu->tbl.table_map_base) >> IO_PAGE_SHIFT;
+	dma_4v_iommu_demap(&devhandle, entry, npages);
+	iommu_tbl_range_free(&iommu->tbl, bus_addr, npages, DMA_ERROR_CODE);
 }
 
 static int dma_4v_map_sg(struct device *dev, struct scatterlist *sglist,
@@ -371,14 +350,14 @@ static int dma_4v_map_sg(struct device *dev, struct scatterlist *sglist,
 	/* Init first segment length for backout at failure */
 	outs->dma_length = 0;
 
-	spin_lock_irqsave(&iommu->lock, flags);
+	local_irq_save(flags);
 
 	iommu_batch_start(dev, prot, ~0UL);
 
 	max_seg_size = dma_get_max_seg_size(dev);
 	seg_boundary_size = ALIGN(dma_get_seg_boundary(dev) + 1,
 				  IO_PAGE_SIZE) >> IO_PAGE_SHIFT;
-	base_shift = iommu->page_table_map_base >> IO_PAGE_SHIFT;
+	base_shift = iommu->tbl.table_map_base >> IO_PAGE_SHIFT;
 	for_each_sg(sglist, s, nelems, i) {
 		unsigned long paddr, npages, entry, out_entry = 0, slen;
 
@@ -391,7 +370,8 @@ static int dma_4v_map_sg(struct device *dev, struct scatterlist *sglist,
 		/* Allocate iommu entries for that segment */
 		paddr = (unsigned long) SG_ENT_PHYS_ADDRESS(s);
 		npages = iommu_num_pages(paddr, slen, IO_PAGE_SIZE);
-		entry = iommu_range_alloc(dev, iommu, npages, &handle);
+		entry = iommu_tbl_range_alloc(dev, &iommu->tbl, npages,
+					      &handle);
 
 		/* Handle failure */
 		if (unlikely(entry = DMA_ERROR_CODE)) {
@@ -404,8 +384,7 @@ static int dma_4v_map_sg(struct device *dev, struct scatterlist *sglist,
 		iommu_batch_new_entry(entry);
 
 		/* Convert entry to a dma_addr_t */
-		dma_addr = iommu->page_table_map_base +
-			(entry << IO_PAGE_SHIFT);
+		dma_addr = iommu->tbl.table_map_base + (entry << IO_PAGE_SHIFT);
 		dma_addr |= (s->offset & ~IO_PAGE_MASK);
 
 		/* Insert into HW table */
@@ -451,7 +430,7 @@ static int dma_4v_map_sg(struct device *dev, struct scatterlist *sglist,
 	if (unlikely(err < 0L))
 		goto iommu_map_failed;
 
-	spin_unlock_irqrestore(&iommu->lock, flags);
+	local_irq_restore(flags);
 
 	if (outcount < incount) {
 		outs = sg_next(outs);
@@ -469,7 +448,8 @@ static int dma_4v_map_sg(struct device *dev, struct scatterlist *sglist,
 			vaddr = s->dma_address & IO_PAGE_MASK;
 			npages = iommu_num_pages(s->dma_address, s->dma_length,
 						 IO_PAGE_SIZE);
-			iommu_range_free(iommu, vaddr, npages);
+			iommu_tbl_range_free(&iommu->tbl, vaddr, npages,
+					     DMA_ERROR_CODE);
 			/* XXX demap? XXX */
 			s->dma_address = DMA_ERROR_CODE;
 			s->dma_length = 0;
@@ -477,7 +457,7 @@ static int dma_4v_map_sg(struct device *dev, struct scatterlist *sglist,
 		if (s = outs)
 			break;
 	}
-	spin_unlock_irqrestore(&iommu->lock, flags);
+	local_irq_restore(flags);
 
 	return 0;
 }
@@ -489,7 +469,7 @@ static void dma_4v_unmap_sg(struct device *dev, struct scatterlist *sglist,
 	struct pci_pbm_info *pbm;
 	struct scatterlist *sg;
 	struct iommu *iommu;
-	unsigned long flags;
+	unsigned long flags, entry;
 	u32 devhandle;
 
 	BUG_ON(direction = DMA_NONE);
@@ -498,33 +478,27 @@ static void dma_4v_unmap_sg(struct device *dev, struct scatterlist *sglist,
 	pbm = dev->archdata.host_controller;
 	devhandle = pbm->devhandle;
 	
-	spin_lock_irqsave(&iommu->lock, flags);
+	local_irq_save(flags);
 
 	sg = sglist;
 	while (nelems--) {
 		dma_addr_t dma_handle = sg->dma_address;
 		unsigned int len = sg->dma_length;
-		unsigned long npages, entry;
+		unsigned long npages;
+		struct iommu_map_table *tbl = &iommu->tbl;
+		unsigned long shift = IO_PAGE_SHIFT;
 
 		if (!len)
 			break;
 		npages = iommu_num_pages(dma_handle, len, IO_PAGE_SIZE);
-		iommu_range_free(iommu, dma_handle, npages);
-
-		entry = ((dma_handle - iommu->page_table_map_base) >> IO_PAGE_SHIFT);
-		while (npages) {
-			unsigned long num;
-
-			num = pci_sun4v_iommu_demap(devhandle, HV_PCI_TSBID(0, entry),
-						    npages);
-			entry += num;
-			npages -= num;
-		}
-
+		entry = ((dma_handle - tbl->table_map_base) >> shift);
+		dma_4v_iommu_demap(&devhandle, entry, npages);
+		iommu_tbl_range_free(&iommu->tbl, dma_handle, npages,
+				     DMA_ERROR_CODE);
 		sg = sg_next(sg);
 	}
 
-	spin_unlock_irqrestore(&iommu->lock, flags);
+	local_irq_restore(flags);
 }
 
 static struct dma_map_ops sun4v_dma_ops = {
@@ -550,30 +524,33 @@ static void pci_sun4v_scan_bus(struct pci_pbm_info *pbm, struct device *parent)
 }
 
 static unsigned long probe_existing_entries(struct pci_pbm_info *pbm,
-					    struct iommu *iommu)
+					    struct iommu_map_table *iommu)
 {
-	struct iommu_arena *arena = &iommu->arena;
-	unsigned long i, cnt = 0;
+	struct iommu_pool *pool;
+	unsigned long i, pool_nr, cnt = 0;
 	u32 devhandle;
 
 	devhandle = pbm->devhandle;
-	for (i = 0; i < arena->limit; i++) {
-		unsigned long ret, io_attrs, ra;
-
-		ret = pci_sun4v_iommu_getmap(devhandle,
-					     HV_PCI_TSBID(0, i),
-					     &io_attrs, &ra);
-		if (ret = HV_EOK) {
-			if (page_in_phys_avail(ra)) {
-				pci_sun4v_iommu_demap(devhandle,
-						      HV_PCI_TSBID(0, i), 1);
-			} else {
-				cnt++;
-				__set_bit(i, arena->map);
+	for (pool_nr = 0; pool_nr < iommu->nr_pools; pool_nr++) {
+		pool = &(iommu->pools[pool_nr]);
+		for (i = pool->start; i <= pool->end; i++) {
+			unsigned long ret, io_attrs, ra;
+
+			ret = pci_sun4v_iommu_getmap(devhandle,
+						     HV_PCI_TSBID(0, i),
+						     &io_attrs, &ra);
+			if (ret = HV_EOK) {
+				if (page_in_phys_avail(ra)) {
+					pci_sun4v_iommu_demap(devhandle,
+							      HV_PCI_TSBID(0,
+							      i), 1);
+				} else {
+					cnt++;
+					__set_bit(i, iommu->map);
+				}
 			}
 		}
 	}
-
 	return cnt;
 }
 
@@ -603,20 +580,22 @@ static int pci_sun4v_iommu_init(struct pci_pbm_info *pbm)
 	/* Setup initial software IOMMU state. */
 	spin_lock_init(&iommu->lock);
 	iommu->ctx_lowest_free = 1;
-	iommu->page_table_map_base = dma_offset;
+	iommu->tbl.table_map_base = dma_offset;
 	iommu->dma_addr_mask = dma_mask;
 
 	/* Allocate and initialize the free area map.  */
 	sz = (num_tsb_entries + 7) / 8;
 	sz = (sz + 7UL) & ~7UL;
-	iommu->arena.map = kzalloc(sz, GFP_KERNEL);
-	if (!iommu->arena.map) {
+	iommu->tbl.map = kzalloc(sz, GFP_KERNEL);
+	if (!iommu->tbl.map) {
 		printk(KERN_ERR PFX "Error, kmalloc(arena.map) failed.\n");
 		return -ENOMEM;
 	}
-	iommu->arena.limit = num_tsb_entries;
-
-	sz = probe_existing_entries(pbm, iommu);
+	iommu_tbl_pool_init(&iommu->tbl, num_tsb_entries, IO_PAGE_SHIFT,
+			    NULL, false /* no large_pool */,
+			    0 /* default npools */,
+			    false /* want span boundary checking */);
+	sz = probe_existing_entries(pbm, &iommu->tbl);
 	if (sz)
 		printk("%s: Imported %lu TSB entries from OBP\n",
 		       pbm->name, sz);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v8 RFC 2/3] sparc: Make sparc64 use scalable lib/iommu-common.c functions
@ 2015-03-31 14:40   ` Sowmini Varadhan
  0 siblings, 0 replies; 60+ messages in thread
From: Sowmini Varadhan @ 2015-03-31 14:40 UTC (permalink / raw)
  To: sparclinux; +Cc: aik, anton, paulus, sowmini.varadhan, linuxppc-dev, davem

In iperf experiments running linux as the Tx side (TCP client) with
10 threads results in a severe performance drop when TSO is disabled,
indicating a weakness in the software that can be avoided by using
the scalable IOMMU arena DMA allocation.

Baseline numbers before this patch:
   with default settings (TSO enabled) :    9-9.5 Gbps
   Disable TSO using ethtool- drops badly:  2-3 Gbps.

After this patch, iperf client with 10 threads, can give a
throughput of at least 8.5 Gbps, even when TSO is disabled.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
v2: moved sparc specific fileds into iommu_sparc
v3: converted all sparc users of iommu, so lot of cleanup and streamlining
v4: David Miller review change:
    - s/IOMMU_ERROR_CODE/DMA_ERROR_CODE
    - reverts pci_impl.h (now that all iommu usage has been converted)
v5: benh/aik feedback modifies the function signatures: pass in 
    modified args to iommmu_tbl_pool_init() and iommu_tbl_range_free()
v6: removed iommu_tbl_ops. Pass flush_all as function pointer to 
    iommu_tbl_pool_init
v7: move pool_hash initialization to iommu_tbl_pool_init()
v8: BenH comments

 arch/sparc/include/asm/iommu_64.h |    7 +-
 arch/sparc/kernel/iommu.c         |  171 +++++++++--------------------------
 arch/sparc/kernel/iommu_common.h  |    8 --
 arch/sparc/kernel/pci_sun4v.c     |  181 ++++++++++++++++---------------------
 4 files changed, 125 insertions(+), 242 deletions(-)

diff --git a/arch/sparc/include/asm/iommu_64.h b/arch/sparc/include/asm/iommu_64.h
index 2b9321a..cd0d69f 100644
--- a/arch/sparc/include/asm/iommu_64.h
+++ b/arch/sparc/include/asm/iommu_64.h
@@ -16,6 +16,7 @@
 #define IOPTE_WRITE   0x0000000000000002UL
 
 #define IOMMU_NUM_CTXS	4096
+#include <linux/iommu-common.h>
 
 struct iommu_arena {
 	unsigned long	*map;
@@ -24,11 +25,10 @@ struct iommu_arena {
 };
 
 struct iommu {
+	struct iommu_map_table	tbl;
 	spinlock_t		lock;
-	struct iommu_arena	arena;
-	void			(*flush_all)(struct iommu *);
+	u32			dma_addr_mask;
 	iopte_t			*page_table;
-	u32			page_table_map_base;
 	unsigned long		iommu_control;
 	unsigned long		iommu_tsbbase;
 	unsigned long		iommu_flush;
@@ -40,7 +40,6 @@ struct iommu {
 	unsigned long		dummy_page_pa;
 	unsigned long		ctx_lowest_free;
 	DECLARE_BITMAP(ctx_bitmap, IOMMU_NUM_CTXS);
-	u32			dma_addr_mask;
 };
 
 struct strbuf {
diff --git a/arch/sparc/kernel/iommu.c b/arch/sparc/kernel/iommu.c
index bfa4d0c..8b76e21 100644
--- a/arch/sparc/kernel/iommu.c
+++ b/arch/sparc/kernel/iommu.c
@@ -13,6 +13,7 @@
 #include <linux/errno.h>
 #include <linux/iommu-helper.h>
 #include <linux/bitmap.h>
+#include <linux/iommu-common.h>
 
 #ifdef CONFIG_PCI
 #include <linux/pci.h>
@@ -45,8 +46,9 @@
 			       "i" (ASI_PHYS_BYPASS_EC_E))
 
 /* Must be invoked under the IOMMU lock. */
-static void iommu_flushall(struct iommu *iommu)
+static void iommu_flushall(struct iommu_map_table *iommu_map_table)
 {
+	struct iommu *iommu = container_of(iommu_map_table, struct iommu, tbl);
 	if (iommu->iommu_flushinv) {
 		iommu_write(iommu->iommu_flushinv, ~(u64)0);
 	} else {
@@ -87,94 +89,6 @@ static inline void iopte_make_dummy(struct iommu *iommu, iopte_t *iopte)
 	iopte_val(*iopte) = val;
 }
 
-/* Based almost entirely upon the ppc64 iommu allocator.  If you use the 'handle'
- * facility it must all be done in one pass while under the iommu lock.
- *
- * On sun4u platforms, we only flush the IOMMU once every time we've passed
- * over the entire page table doing allocations.  Therefore we only ever advance
- * the hint and cannot backtrack it.
- */
-unsigned long iommu_range_alloc(struct device *dev,
-				struct iommu *iommu,
-				unsigned long npages,
-				unsigned long *handle)
-{
-	unsigned long n, end, start, limit, boundary_size;
-	struct iommu_arena *arena = &iommu->arena;
-	int pass = 0;
-
-	/* This allocator was derived from x86_64's bit string search */
-
-	/* Sanity check */
-	if (unlikely(npages == 0)) {
-		if (printk_ratelimit())
-			WARN_ON(1);
-		return DMA_ERROR_CODE;
-	}
-
-	if (handle && *handle)
-		start = *handle;
-	else
-		start = arena->hint;
-
-	limit = arena->limit;
-
-	/* The case below can happen if we have a small segment appended
-	 * to a large, or when the previous alloc was at the very end of
-	 * the available space. If so, go back to the beginning and flush.
-	 */
-	if (start >= limit) {
-		start = 0;
-		if (iommu->flush_all)
-			iommu->flush_all(iommu);
-	}
-
- again:
-
-	if (dev)
-		boundary_size = ALIGN(dma_get_seg_boundary(dev) + 1,
-				      1 << IO_PAGE_SHIFT);
-	else
-		boundary_size = ALIGN(1UL << 32, 1 << IO_PAGE_SHIFT);
-
-	n = iommu_area_alloc(arena->map, limit, start, npages,
-			     iommu->page_table_map_base >> IO_PAGE_SHIFT,
-			     boundary_size >> IO_PAGE_SHIFT, 0);
-	if (n == -1) {
-		if (likely(pass < 1)) {
-			/* First failure, rescan from the beginning.  */
-			start = 0;
-			if (iommu->flush_all)
-				iommu->flush_all(iommu);
-			pass++;
-			goto again;
-		} else {
-			/* Second failure, give up */
-			return DMA_ERROR_CODE;
-		}
-	}
-
-	end = n + npages;
-
-	arena->hint = end;
-
-	/* Update handle for SG allocations */
-	if (handle)
-		*handle = end;
-
-	return n;
-}
-
-void iommu_range_free(struct iommu *iommu, dma_addr_t dma_addr, unsigned long npages)
-{
-	struct iommu_arena *arena = &iommu->arena;
-	unsigned long entry;
-
-	entry = (dma_addr - iommu->page_table_map_base) >> IO_PAGE_SHIFT;
-
-	bitmap_clear(arena->map, entry, npages);
-}
-
 int iommu_table_init(struct iommu *iommu, int tsbsize,
 		     u32 dma_offset, u32 dma_addr_mask,
 		     int numa_node)
@@ -187,22 +101,20 @@ int iommu_table_init(struct iommu *iommu, int tsbsize,
 	/* Setup initial software IOMMU state. */
 	spin_lock_init(&iommu->lock);
 	iommu->ctx_lowest_free = 1;
-	iommu->page_table_map_base = dma_offset;
+	iommu->tbl.table_map_base = dma_offset;
 	iommu->dma_addr_mask = dma_addr_mask;
 
 	/* Allocate and initialize the free area map.  */
 	sz = num_tsb_entries / 8;
 	sz = (sz + 7UL) & ~7UL;
-	iommu->arena.map = kmalloc_node(sz, GFP_KERNEL, numa_node);
-	if (!iommu->arena.map) {
-		printk(KERN_ERR "IOMMU: Error, kmalloc(arena.map) failed.\n");
+	iommu->tbl.map = kmalloc_node(sz, GFP_KERNEL, numa_node);
+	if (!iommu->tbl.map)
 		return -ENOMEM;
-	}
-	memset(iommu->arena.map, 0, sz);
-	iommu->arena.limit = num_tsb_entries;
+	memset(iommu->tbl.map, 0, sz);
 
-	if (tlb_type != hypervisor)
-		iommu->flush_all = iommu_flushall;
+	iommu_tbl_pool_init(&iommu->tbl, num_tsb_entries, IO_PAGE_SHIFT,
+			    (tlb_type != hypervisor ? iommu_flushall : NULL),
+			    false, 1, false);
 
 	/* Allocate and initialize the dummy page which we
 	 * set inactive IO PTEs to point to.
@@ -235,18 +147,19 @@ int iommu_table_init(struct iommu *iommu, int tsbsize,
 	iommu->dummy_page = 0UL;
 
 out_free_map:
-	kfree(iommu->arena.map);
-	iommu->arena.map = NULL;
+	kfree(iommu->tbl.map);
+	iommu->tbl.map = NULL;
 
 	return -ENOMEM;
 }
 
-static inline iopte_t *alloc_npages(struct device *dev, struct iommu *iommu,
+static inline iopte_t *alloc_npages(struct device *dev,
+				    struct iommu *iommu,
 				    unsigned long npages)
 {
 	unsigned long entry;
 
-	entry = iommu_range_alloc(dev, iommu, npages, NULL);
+	entry = iommu_tbl_range_alloc(dev, &iommu->tbl, npages, NULL);
 	if (unlikely(entry == DMA_ERROR_CODE))
 		return NULL;
 
@@ -284,7 +197,7 @@ static void *dma_4u_alloc_coherent(struct device *dev, size_t size,
 				   dma_addr_t *dma_addrp, gfp_t gfp,
 				   struct dma_attrs *attrs)
 {
-	unsigned long flags, order, first_page;
+	unsigned long order, first_page;
 	struct iommu *iommu;
 	struct page *page;
 	int npages, nid;
@@ -306,16 +219,14 @@ static void *dma_4u_alloc_coherent(struct device *dev, size_t size,
 
 	iommu = dev->archdata.iommu;
 
-	spin_lock_irqsave(&iommu->lock, flags);
 	iopte = alloc_npages(dev, iommu, size >> IO_PAGE_SHIFT);
-	spin_unlock_irqrestore(&iommu->lock, flags);
 
 	if (unlikely(iopte == NULL)) {
 		free_pages(first_page, order);
 		return NULL;
 	}
 
-	*dma_addrp = (iommu->page_table_map_base +
+	*dma_addrp = (iommu->tbl.table_map_base +
 		      ((iopte - iommu->page_table) << IO_PAGE_SHIFT));
 	ret = (void *) first_page;
 	npages = size >> IO_PAGE_SHIFT;
@@ -336,16 +247,12 @@ static void dma_4u_free_coherent(struct device *dev, size_t size,
 				 struct dma_attrs *attrs)
 {
 	struct iommu *iommu;
-	unsigned long flags, order, npages;
+	unsigned long order, npages;
 
 	npages = IO_PAGE_ALIGN(size) >> IO_PAGE_SHIFT;
 	iommu = dev->archdata.iommu;
 
-	spin_lock_irqsave(&iommu->lock, flags);
-
-	iommu_range_free(iommu, dvma, npages);
-
-	spin_unlock_irqrestore(&iommu->lock, flags);
+	iommu_tbl_range_free(&iommu->tbl, dvma, npages, DMA_ERROR_CODE);
 
 	order = get_order(size);
 	if (order < 10)
@@ -375,8 +282,8 @@ static dma_addr_t dma_4u_map_page(struct device *dev, struct page *page,
 	npages = IO_PAGE_ALIGN(oaddr + sz) - (oaddr & IO_PAGE_MASK);
 	npages >>= IO_PAGE_SHIFT;
 
-	spin_lock_irqsave(&iommu->lock, flags);
 	base = alloc_npages(dev, iommu, npages);
+	spin_lock_irqsave(&iommu->lock, flags);
 	ctx = 0;
 	if (iommu->iommu_ctxflush)
 		ctx = iommu_alloc_ctx(iommu);
@@ -385,7 +292,7 @@ static dma_addr_t dma_4u_map_page(struct device *dev, struct page *page,
 	if (unlikely(!base))
 		goto bad;
 
-	bus_addr = (iommu->page_table_map_base +
+	bus_addr = (iommu->tbl.table_map_base +
 		    ((base - iommu->page_table) << IO_PAGE_SHIFT));
 	ret = bus_addr | (oaddr & ~IO_PAGE_MASK);
 	base_paddr = __pa(oaddr & IO_PAGE_MASK);
@@ -496,7 +403,7 @@ static void dma_4u_unmap_page(struct device *dev, dma_addr_t bus_addr,
 	npages = IO_PAGE_ALIGN(bus_addr + sz) - (bus_addr & IO_PAGE_MASK);
 	npages >>= IO_PAGE_SHIFT;
 	base = iommu->page_table +
-		((bus_addr - iommu->page_table_map_base) >> IO_PAGE_SHIFT);
+		((bus_addr - iommu->tbl.table_map_base) >> IO_PAGE_SHIFT);
 	bus_addr &= IO_PAGE_MASK;
 
 	spin_lock_irqsave(&iommu->lock, flags);
@@ -515,11 +422,10 @@ static void dma_4u_unmap_page(struct device *dev, dma_addr_t bus_addr,
 	for (i = 0; i < npages; i++)
 		iopte_make_dummy(iommu, base + i);
 
-	iommu_range_free(iommu, bus_addr, npages);
-
 	iommu_free_ctx(iommu, ctx);
-
 	spin_unlock_irqrestore(&iommu->lock, flags);
+
+	iommu_tbl_range_free(&iommu->tbl, bus_addr, npages, DMA_ERROR_CODE);
 }
 
 static int dma_4u_map_sg(struct device *dev, struct scatterlist *sglist,
@@ -567,7 +473,7 @@ static int dma_4u_map_sg(struct device *dev, struct scatterlist *sglist,
 	max_seg_size = dma_get_max_seg_size(dev);
 	seg_boundary_size = ALIGN(dma_get_seg_boundary(dev) + 1,
 				  IO_PAGE_SIZE) >> IO_PAGE_SHIFT;
-	base_shift = iommu->page_table_map_base >> IO_PAGE_SHIFT;
+	base_shift = iommu->tbl.table_map_base >> IO_PAGE_SHIFT;
 	for_each_sg(sglist, s, nelems, i) {
 		unsigned long paddr, npages, entry, out_entry = 0, slen;
 		iopte_t *base;
@@ -581,7 +487,8 @@ static int dma_4u_map_sg(struct device *dev, struct scatterlist *sglist,
 		/* Allocate iommu entries for that segment */
 		paddr = (unsigned long) SG_ENT_PHYS_ADDRESS(s);
 		npages = iommu_num_pages(paddr, slen, IO_PAGE_SIZE);
-		entry = iommu_range_alloc(dev, iommu, npages, &handle);
+		entry = iommu_tbl_range_alloc(dev, &iommu->tbl, npages,
+					      &handle);
 
 		/* Handle failure */
 		if (unlikely(entry == DMA_ERROR_CODE)) {
@@ -594,7 +501,7 @@ static int dma_4u_map_sg(struct device *dev, struct scatterlist *sglist,
 		base = iommu->page_table + entry;
 
 		/* Convert entry to a dma_addr_t */
-		dma_addr = iommu->page_table_map_base +
+		dma_addr = iommu->tbl.table_map_base +
 			(entry << IO_PAGE_SHIFT);
 		dma_addr |= (s->offset & ~IO_PAGE_MASK);
 
@@ -654,15 +561,17 @@ static int dma_4u_map_sg(struct device *dev, struct scatterlist *sglist,
 			vaddr = s->dma_address & IO_PAGE_MASK;
 			npages = iommu_num_pages(s->dma_address, s->dma_length,
 						 IO_PAGE_SIZE);
-			iommu_range_free(iommu, vaddr, npages);
 
-			entry = (vaddr - iommu->page_table_map_base)
+			entry = (vaddr - iommu->tbl.table_map_base)
 				>> IO_PAGE_SHIFT;
 			base = iommu->page_table + entry;
 
 			for (j = 0; j < npages; j++)
 				iopte_make_dummy(iommu, base + j);
 
+			iommu_tbl_range_free(&iommu->tbl, vaddr, npages,
+					     DMA_ERROR_CODE);
+
 			s->dma_address = DMA_ERROR_CODE;
 			s->dma_length = 0;
 		}
@@ -684,10 +593,11 @@ static unsigned long fetch_sg_ctx(struct iommu *iommu, struct scatterlist *sg)
 	if (iommu->iommu_ctxflush) {
 		iopte_t *base;
 		u32 bus_addr;
+		struct iommu_map_table *tbl = &iommu->tbl;
 
 		bus_addr = sg->dma_address & IO_PAGE_MASK;
 		base = iommu->page_table +
-			((bus_addr - iommu->page_table_map_base) >> IO_PAGE_SHIFT);
+			((bus_addr - tbl->table_map_base) >> IO_PAGE_SHIFT);
 
 		ctx = (iopte_val(*base) & IOPTE_CONTEXT) >> 47UL;
 	}
@@ -723,9 +633,8 @@ static void dma_4u_unmap_sg(struct device *dev, struct scatterlist *sglist,
 		if (!len)
 			break;
 		npages = iommu_num_pages(dma_handle, len, IO_PAGE_SIZE);
-		iommu_range_free(iommu, dma_handle, npages);
 
-		entry = ((dma_handle - iommu->page_table_map_base)
+		entry = ((dma_handle - iommu->tbl.table_map_base)
 			 >> IO_PAGE_SHIFT);
 		base = iommu->page_table + entry;
 
@@ -737,6 +646,8 @@ static void dma_4u_unmap_sg(struct device *dev, struct scatterlist *sglist,
 		for (i = 0; i < npages; i++)
 			iopte_make_dummy(iommu, base + i);
 
+		iommu_tbl_range_free(&iommu->tbl, dma_handle, npages,
+				     DMA_ERROR_CODE);
 		sg = sg_next(sg);
 	}
 
@@ -770,9 +681,10 @@ static void dma_4u_sync_single_for_cpu(struct device *dev,
 	if (iommu->iommu_ctxflush &&
 	    strbuf->strbuf_ctxflush) {
 		iopte_t *iopte;
+		struct iommu_map_table *tbl = &iommu->tbl;
 
 		iopte = iommu->page_table +
-			((bus_addr - iommu->page_table_map_base)>>IO_PAGE_SHIFT);
+			((bus_addr - tbl->table_map_base)>>IO_PAGE_SHIFT);
 		ctx = (iopte_val(*iopte) & IOPTE_CONTEXT) >> 47UL;
 	}
 
@@ -805,9 +717,10 @@ static void dma_4u_sync_sg_for_cpu(struct device *dev,
 	if (iommu->iommu_ctxflush &&
 	    strbuf->strbuf_ctxflush) {
 		iopte_t *iopte;
+		struct iommu_map_table *tbl = &iommu->tbl;
 
-		iopte = iommu->page_table +
-			((sglist[0].dma_address - iommu->page_table_map_base) >> IO_PAGE_SHIFT);
+		iopte = iommu->page_table + ((sglist[0].dma_address -
+			tbl->table_map_base) >> IO_PAGE_SHIFT);
 		ctx = (iopte_val(*iopte) & IOPTE_CONTEXT) >> 47UL;
 	}
 
diff --git a/arch/sparc/kernel/iommu_common.h b/arch/sparc/kernel/iommu_common.h
index 1ec0de4..f4be0d7 100644
--- a/arch/sparc/kernel/iommu_common.h
+++ b/arch/sparc/kernel/iommu_common.h
@@ -48,12 +48,4 @@ static inline int is_span_boundary(unsigned long entry,
 	return iommu_is_span_boundary(entry, nr, shift, boundary_size);
 }
 
-unsigned long iommu_range_alloc(struct device *dev,
-				struct iommu *iommu,
-				unsigned long npages,
-				unsigned long *handle);
-void iommu_range_free(struct iommu *iommu,
-		      dma_addr_t dma_addr,
-		      unsigned long npages);
-
 #endif /* _IOMMU_COMMON_H */
diff --git a/arch/sparc/kernel/pci_sun4v.c b/arch/sparc/kernel/pci_sun4v.c
index 47ddbd4..6ca0190 100644
--- a/arch/sparc/kernel/pci_sun4v.c
+++ b/arch/sparc/kernel/pci_sun4v.c
@@ -15,6 +15,7 @@
 #include <linux/export.h>
 #include <linux/log2.h>
 #include <linux/of_device.h>
+#include <linux/iommu-common.h>
 
 #include <asm/iommu.h>
 #include <asm/irq.h>
@@ -155,15 +156,12 @@ static void *dma_4v_alloc_coherent(struct device *dev, size_t size,
 
 	iommu = dev->archdata.iommu;
 
-	spin_lock_irqsave(&iommu->lock, flags);
-	entry = iommu_range_alloc(dev, iommu, npages, NULL);
-	spin_unlock_irqrestore(&iommu->lock, flags);
+	entry = iommu_tbl_range_alloc(dev, &iommu->tbl, npages, NULL);
 
 	if (unlikely(entry == DMA_ERROR_CODE))
 		goto range_alloc_fail;
 
-	*dma_addrp = (iommu->page_table_map_base +
-		      (entry << IO_PAGE_SHIFT));
+	*dma_addrp = (iommu->tbl.table_map_base + (entry << IO_PAGE_SHIFT));
 	ret = (void *) first_page;
 	first_page = __pa(first_page);
 
@@ -188,45 +186,46 @@ static void *dma_4v_alloc_coherent(struct device *dev, size_t size,
 	return ret;
 
 iommu_map_fail:
-	/* Interrupts are disabled.  */
-	spin_lock(&iommu->lock);
-	iommu_range_free(iommu, *dma_addrp, npages);
-	spin_unlock_irqrestore(&iommu->lock, flags);
+	iommu_tbl_range_free(&iommu->tbl, *dma_addrp, npages, DMA_ERROR_CODE);
 
 range_alloc_fail:
 	free_pages(first_page, order);
 	return NULL;
 }
 
+static void dma_4v_iommu_demap(void *demap_arg, unsigned long entry,
+			       unsigned long npages)
+{
+	u32 devhandle = *(u32 *)demap_arg;
+	unsigned long num, flags;
+
+	local_irq_save(flags);
+	do {
+		num = pci_sun4v_iommu_demap(devhandle,
+					    HV_PCI_TSBID(0, entry),
+					    npages);
+
+		entry += num;
+		npages -= num;
+	} while (npages != 0);
+	local_irq_restore(flags);
+}
+
 static void dma_4v_free_coherent(struct device *dev, size_t size, void *cpu,
 				 dma_addr_t dvma, struct dma_attrs *attrs)
 {
 	struct pci_pbm_info *pbm;
 	struct iommu *iommu;
-	unsigned long flags, order, npages, entry;
+	unsigned long order, npages, entry;
 	u32 devhandle;
 
 	npages = IO_PAGE_ALIGN(size) >> IO_PAGE_SHIFT;
 	iommu = dev->archdata.iommu;
 	pbm = dev->archdata.host_controller;
 	devhandle = pbm->devhandle;
-	entry = ((dvma - iommu->page_table_map_base) >> IO_PAGE_SHIFT);
-
-	spin_lock_irqsave(&iommu->lock, flags);
-
-	iommu_range_free(iommu, dvma, npages);
-
-	do {
-		unsigned long num;
-
-		num = pci_sun4v_iommu_demap(devhandle, HV_PCI_TSBID(0, entry),
-					    npages);
-		entry += num;
-		npages -= num;
-	} while (npages != 0);
-
-	spin_unlock_irqrestore(&iommu->lock, flags);
-
+	entry = ((dvma - iommu->tbl.table_map_base) >> IO_PAGE_SHIFT);
+	dma_4v_iommu_demap(&devhandle, entry, npages);
+	iommu_tbl_range_free(&iommu->tbl, dvma, npages, DMA_ERROR_CODE);
 	order = get_order(size);
 	if (order < 10)
 		free_pages((unsigned long)cpu, order);
@@ -253,15 +252,12 @@ static dma_addr_t dma_4v_map_page(struct device *dev, struct page *page,
 	npages = IO_PAGE_ALIGN(oaddr + sz) - (oaddr & IO_PAGE_MASK);
 	npages >>= IO_PAGE_SHIFT;
 
-	spin_lock_irqsave(&iommu->lock, flags);
-	entry = iommu_range_alloc(dev, iommu, npages, NULL);
-	spin_unlock_irqrestore(&iommu->lock, flags);
+	entry = iommu_tbl_range_alloc(dev, &iommu->tbl, npages, NULL);
 
 	if (unlikely(entry == DMA_ERROR_CODE))
 		goto bad;
 
-	bus_addr = (iommu->page_table_map_base +
-		    (entry << IO_PAGE_SHIFT));
+	bus_addr = (iommu->tbl.table_map_base + (entry << IO_PAGE_SHIFT));
 	ret = bus_addr | (oaddr & ~IO_PAGE_MASK);
 	base_paddr = __pa(oaddr & IO_PAGE_MASK);
 	prot = HV_PCI_MAP_ATTR_READ;
@@ -290,11 +286,7 @@ static dma_addr_t dma_4v_map_page(struct device *dev, struct page *page,
 	return DMA_ERROR_CODE;
 
 iommu_map_fail:
-	/* Interrupts are disabled.  */
-	spin_lock(&iommu->lock);
-	iommu_range_free(iommu, bus_addr, npages);
-	spin_unlock_irqrestore(&iommu->lock, flags);
-
+	iommu_tbl_range_free(&iommu->tbl, bus_addr, npages, DMA_ERROR_CODE);
 	return DMA_ERROR_CODE;
 }
 
@@ -304,7 +296,7 @@ static void dma_4v_unmap_page(struct device *dev, dma_addr_t bus_addr,
 {
 	struct pci_pbm_info *pbm;
 	struct iommu *iommu;
-	unsigned long flags, npages;
+	unsigned long npages;
 	long entry;
 	u32 devhandle;
 
@@ -321,22 +313,9 @@ static void dma_4v_unmap_page(struct device *dev, dma_addr_t bus_addr,
 	npages = IO_PAGE_ALIGN(bus_addr + sz) - (bus_addr & IO_PAGE_MASK);
 	npages >>= IO_PAGE_SHIFT;
 	bus_addr &= IO_PAGE_MASK;
-
-	spin_lock_irqsave(&iommu->lock, flags);
-
-	iommu_range_free(iommu, bus_addr, npages);
-
-	entry = (bus_addr - iommu->page_table_map_base) >> IO_PAGE_SHIFT;
-	do {
-		unsigned long num;
-
-		num = pci_sun4v_iommu_demap(devhandle, HV_PCI_TSBID(0, entry),
-					    npages);
-		entry += num;
-		npages -= num;
-	} while (npages != 0);
-
-	spin_unlock_irqrestore(&iommu->lock, flags);
+	entry = (bus_addr - iommu->tbl.table_map_base) >> IO_PAGE_SHIFT;
+	dma_4v_iommu_demap(&devhandle, entry, npages);
+	iommu_tbl_range_free(&iommu->tbl, bus_addr, npages, DMA_ERROR_CODE);
 }
 
 static int dma_4v_map_sg(struct device *dev, struct scatterlist *sglist,
@@ -371,14 +350,14 @@ static int dma_4v_map_sg(struct device *dev, struct scatterlist *sglist,
 	/* Init first segment length for backout at failure */
 	outs->dma_length = 0;
 
-	spin_lock_irqsave(&iommu->lock, flags);
+	local_irq_save(flags);
 
 	iommu_batch_start(dev, prot, ~0UL);
 
 	max_seg_size = dma_get_max_seg_size(dev);
 	seg_boundary_size = ALIGN(dma_get_seg_boundary(dev) + 1,
 				  IO_PAGE_SIZE) >> IO_PAGE_SHIFT;
-	base_shift = iommu->page_table_map_base >> IO_PAGE_SHIFT;
+	base_shift = iommu->tbl.table_map_base >> IO_PAGE_SHIFT;
 	for_each_sg(sglist, s, nelems, i) {
 		unsigned long paddr, npages, entry, out_entry = 0, slen;
 
@@ -391,7 +370,8 @@ static int dma_4v_map_sg(struct device *dev, struct scatterlist *sglist,
 		/* Allocate iommu entries for that segment */
 		paddr = (unsigned long) SG_ENT_PHYS_ADDRESS(s);
 		npages = iommu_num_pages(paddr, slen, IO_PAGE_SIZE);
-		entry = iommu_range_alloc(dev, iommu, npages, &handle);
+		entry = iommu_tbl_range_alloc(dev, &iommu->tbl, npages,
+					      &handle);
 
 		/* Handle failure */
 		if (unlikely(entry == DMA_ERROR_CODE)) {
@@ -404,8 +384,7 @@ static int dma_4v_map_sg(struct device *dev, struct scatterlist *sglist,
 		iommu_batch_new_entry(entry);
 
 		/* Convert entry to a dma_addr_t */
-		dma_addr = iommu->page_table_map_base +
-			(entry << IO_PAGE_SHIFT);
+		dma_addr = iommu->tbl.table_map_base + (entry << IO_PAGE_SHIFT);
 		dma_addr |= (s->offset & ~IO_PAGE_MASK);
 
 		/* Insert into HW table */
@@ -451,7 +430,7 @@ static int dma_4v_map_sg(struct device *dev, struct scatterlist *sglist,
 	if (unlikely(err < 0L))
 		goto iommu_map_failed;
 
-	spin_unlock_irqrestore(&iommu->lock, flags);
+	local_irq_restore(flags);
 
 	if (outcount < incount) {
 		outs = sg_next(outs);
@@ -469,7 +448,8 @@ static int dma_4v_map_sg(struct device *dev, struct scatterlist *sglist,
 			vaddr = s->dma_address & IO_PAGE_MASK;
 			npages = iommu_num_pages(s->dma_address, s->dma_length,
 						 IO_PAGE_SIZE);
-			iommu_range_free(iommu, vaddr, npages);
+			iommu_tbl_range_free(&iommu->tbl, vaddr, npages,
+					     DMA_ERROR_CODE);
 			/* XXX demap? XXX */
 			s->dma_address = DMA_ERROR_CODE;
 			s->dma_length = 0;
@@ -477,7 +457,7 @@ static int dma_4v_map_sg(struct device *dev, struct scatterlist *sglist,
 		if (s == outs)
 			break;
 	}
-	spin_unlock_irqrestore(&iommu->lock, flags);
+	local_irq_restore(flags);
 
 	return 0;
 }
@@ -489,7 +469,7 @@ static void dma_4v_unmap_sg(struct device *dev, struct scatterlist *sglist,
 	struct pci_pbm_info *pbm;
 	struct scatterlist *sg;
 	struct iommu *iommu;
-	unsigned long flags;
+	unsigned long flags, entry;
 	u32 devhandle;
 
 	BUG_ON(direction == DMA_NONE);
@@ -498,33 +478,27 @@ static void dma_4v_unmap_sg(struct device *dev, struct scatterlist *sglist,
 	pbm = dev->archdata.host_controller;
 	devhandle = pbm->devhandle;
 	
-	spin_lock_irqsave(&iommu->lock, flags);
+	local_irq_save(flags);
 
 	sg = sglist;
 	while (nelems--) {
 		dma_addr_t dma_handle = sg->dma_address;
 		unsigned int len = sg->dma_length;
-		unsigned long npages, entry;
+		unsigned long npages;
+		struct iommu_map_table *tbl = &iommu->tbl;
+		unsigned long shift = IO_PAGE_SHIFT;
 
 		if (!len)
 			break;
 		npages = iommu_num_pages(dma_handle, len, IO_PAGE_SIZE);
-		iommu_range_free(iommu, dma_handle, npages);
-
-		entry = ((dma_handle - iommu->page_table_map_base) >> IO_PAGE_SHIFT);
-		while (npages) {
-			unsigned long num;
-
-			num = pci_sun4v_iommu_demap(devhandle, HV_PCI_TSBID(0, entry),
-						    npages);
-			entry += num;
-			npages -= num;
-		}
-
+		entry = ((dma_handle - tbl->table_map_base) >> shift);
+		dma_4v_iommu_demap(&devhandle, entry, npages);
+		iommu_tbl_range_free(&iommu->tbl, dma_handle, npages,
+				     DMA_ERROR_CODE);
 		sg = sg_next(sg);
 	}
 
-	spin_unlock_irqrestore(&iommu->lock, flags);
+	local_irq_restore(flags);
 }
 
 static struct dma_map_ops sun4v_dma_ops = {
@@ -550,30 +524,33 @@ static void pci_sun4v_scan_bus(struct pci_pbm_info *pbm, struct device *parent)
 }
 
 static unsigned long probe_existing_entries(struct pci_pbm_info *pbm,
-					    struct iommu *iommu)
+					    struct iommu_map_table *iommu)
 {
-	struct iommu_arena *arena = &iommu->arena;
-	unsigned long i, cnt = 0;
+	struct iommu_pool *pool;
+	unsigned long i, pool_nr, cnt = 0;
 	u32 devhandle;
 
 	devhandle = pbm->devhandle;
-	for (i = 0; i < arena->limit; i++) {
-		unsigned long ret, io_attrs, ra;
-
-		ret = pci_sun4v_iommu_getmap(devhandle,
-					     HV_PCI_TSBID(0, i),
-					     &io_attrs, &ra);
-		if (ret == HV_EOK) {
-			if (page_in_phys_avail(ra)) {
-				pci_sun4v_iommu_demap(devhandle,
-						      HV_PCI_TSBID(0, i), 1);
-			} else {
-				cnt++;
-				__set_bit(i, arena->map);
+	for (pool_nr = 0; pool_nr < iommu->nr_pools; pool_nr++) {
+		pool = &(iommu->pools[pool_nr]);
+		for (i = pool->start; i <= pool->end; i++) {
+			unsigned long ret, io_attrs, ra;
+
+			ret = pci_sun4v_iommu_getmap(devhandle,
+						     HV_PCI_TSBID(0, i),
+						     &io_attrs, &ra);
+			if (ret == HV_EOK) {
+				if (page_in_phys_avail(ra)) {
+					pci_sun4v_iommu_demap(devhandle,
+							      HV_PCI_TSBID(0,
+							      i), 1);
+				} else {
+					cnt++;
+					__set_bit(i, iommu->map);
+				}
 			}
 		}
 	}
-
 	return cnt;
 }
 
@@ -603,20 +580,22 @@ static int pci_sun4v_iommu_init(struct pci_pbm_info *pbm)
 	/* Setup initial software IOMMU state. */
 	spin_lock_init(&iommu->lock);
 	iommu->ctx_lowest_free = 1;
-	iommu->page_table_map_base = dma_offset;
+	iommu->tbl.table_map_base = dma_offset;
 	iommu->dma_addr_mask = dma_mask;
 
 	/* Allocate and initialize the free area map.  */
 	sz = (num_tsb_entries + 7) / 8;
 	sz = (sz + 7UL) & ~7UL;
-	iommu->arena.map = kzalloc(sz, GFP_KERNEL);
-	if (!iommu->arena.map) {
+	iommu->tbl.map = kzalloc(sz, GFP_KERNEL);
+	if (!iommu->tbl.map) {
 		printk(KERN_ERR PFX "Error, kmalloc(arena.map) failed.\n");
 		return -ENOMEM;
 	}
-	iommu->arena.limit = num_tsb_entries;
-
-	sz = probe_existing_entries(pbm, iommu);
+	iommu_tbl_pool_init(&iommu->tbl, num_tsb_entries, IO_PAGE_SHIFT,
+			    NULL, false /* no large_pool */,
+			    0 /* default npools */,
+			    false /* want span boundary checking */);
+	sz = probe_existing_entries(pbm, &iommu->tbl);
 	if (sz)
 		printk("%s: Imported %lu TSB entries from OBP\n",
 		       pbm->name, sz);
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v8 RFC 3/3] sparc: Make LDC use common iommu poll management functions
  2015-03-31 14:40 ` Sowmini Varadhan
@ 2015-03-31 14:40   ` Sowmini Varadhan
  -1 siblings, 0 replies; 60+ messages in thread
From: Sowmini Varadhan @ 2015-03-31 14:40 UTC (permalink / raw)
  To: sparclinux; +Cc: aik, anton, paulus, sowmini.varadhan, linuxppc-dev, davem

Note that this conversion is only being done to consolidate the
code and ensure that the common code provides the sufficient
abstraction. It is not expected to result in any noticeable
performance improvement, as there is typically one ldc_iommu
per vnet_port, and each one has 8k entries, with a typical
request for 1-4 pages.  Thus LDC uses npools = 1.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
v3: added this file to be a consumer of the common iommu library
v4: removed ->cookie_to_index and ->demap from iommu_tbl_ops and instead
    inline these calls into ldc before calling into iommu-common
v6: remove iommu_tbl_ops
v7: move pool_hash initialization to iommu_tbl_pool_init
v8: BenH comments

 arch/sparc/kernel/ldc.c |  153 ++++++++++++++++++++---------------------------
 1 files changed, 65 insertions(+), 88 deletions(-)

diff --git a/arch/sparc/kernel/ldc.c b/arch/sparc/kernel/ldc.c
index 274a9f5..f842512 100644
--- a/arch/sparc/kernel/ldc.c
+++ b/arch/sparc/kernel/ldc.c
@@ -15,6 +15,7 @@
 #include <linux/list.h>
 #include <linux/init.h>
 #include <linux/bitmap.h>
+#include <linux/iommu-common.h>
 
 #include <asm/hypervisor.h>
 #include <asm/iommu.h>
@@ -27,6 +28,10 @@
 #define DRV_MODULE_VERSION	"1.1"
 #define DRV_MODULE_RELDATE	"July 22, 2008"
 
+#define COOKIE_PGSZ_CODE	0xf000000000000000ULL
+#define COOKIE_PGSZ_CODE_SHIFT	60ULL
+
+
 static char version[]  	DRV_MODULE_NAME ".c:v" DRV_MODULE_VERSION " (" DRV_MODULE_RELDATE ")\n";
 #define LDC_PACKET_SIZE		64
@@ -98,10 +103,10 @@ static const struct ldc_mode_ops stream_ops;
 int ldom_domaining_enabled;
 
 struct ldc_iommu {
-	/* Protects arena alloc/free.  */
+	/* Protects ldc_unmap.  */
 	spinlock_t			lock;
-	struct iommu_arena		arena;
 	struct ldc_mtable_entry		*page_table;
+	struct iommu_map_table		iommu_map_table;
 };
 
 struct ldc_channel {
@@ -998,31 +1003,59 @@ static void free_queue(unsigned long num_entries, struct ldc_packet *q)
 	free_pages((unsigned long)q, order);
 }
 
+static unsigned long ldc_cookie_to_index(u64 cookie, void *arg)
+{
+	u64 szcode = cookie >> COOKIE_PGSZ_CODE_SHIFT;
+	/* struct ldc_iommu *ldc_iommu = (struct ldc_iommu *)arg; */
+
+	cookie &= ~COOKIE_PGSZ_CODE;
+
+	return (cookie >> (13ULL + (szcode * 3ULL)));
+}
+
+static void ldc_demap(struct ldc_iommu *iommu, unsigned long id, u64 cookie,
+		      unsigned long entry, unsigned long npages)
+{
+	struct ldc_mtable_entry *base;
+	unsigned long i, shift;
+
+	shift = (cookie >> COOKIE_PGSZ_CODE_SHIFT) * 3;
+	base = iommu->page_table + entry;
+	for (i = 0; i < npages; i++) {
+		if (base->cookie)
+			sun4v_ldc_revoke(id, cookie + (i << shift),
+					 base->cookie);
+		base->mte = 0;
+	}
+}
+
 /* XXX Make this configurable... XXX */
 #define LDC_IOTABLE_SIZE	(8 * 1024)
 
-static int ldc_iommu_init(struct ldc_channel *lp)
+static int ldc_iommu_init(const char *name, struct ldc_channel *lp)
 {
 	unsigned long sz, num_tsb_entries, tsbsize, order;
-	struct ldc_iommu *iommu = &lp->iommu;
+	struct ldc_iommu *ldc_iommu = &lp->iommu;
+	struct iommu_map_table *iommu = &ldc_iommu->iommu_map_table;
 	struct ldc_mtable_entry *table;
 	unsigned long hv_err;
 	int err;
 
 	num_tsb_entries = LDC_IOTABLE_SIZE;
 	tsbsize = num_tsb_entries * sizeof(struct ldc_mtable_entry);
-
-	spin_lock_init(&iommu->lock);
+	spin_lock_init(&ldc_iommu->lock);
 
 	sz = num_tsb_entries / 8;
 	sz = (sz + 7UL) & ~7UL;
-	iommu->arena.map = kzalloc(sz, GFP_KERNEL);
-	if (!iommu->arena.map) {
+	iommu->map = kzalloc(sz, GFP_KERNEL);
+	if (!iommu->map) {
 		printk(KERN_ERR PFX "Alloc of arena map failed, sz=%lu\n", sz);
 		return -ENOMEM;
 	}
-
-	iommu->arena.limit = num_tsb_entries;
+	iommu_tbl_pool_init(iommu, num_tsb_entries, PAGE_SHIFT,
+			    NULL, false /* no large pool */,
+			    1 /* npools */,
+			    true /* skip span boundary check */);
 
 	order = get_order(tsbsize);
 
@@ -1037,7 +1070,7 @@ static int ldc_iommu_init(struct ldc_channel *lp)
 
 	memset(table, 0, PAGE_SIZE << order);
 
-	iommu->page_table = table;
+	ldc_iommu->page_table = table;
 
 	hv_err = sun4v_ldc_set_map_table(lp->id, __pa(table),
 					 num_tsb_entries);
@@ -1049,31 +1082,32 @@ static int ldc_iommu_init(struct ldc_channel *lp)
 
 out_free_table:
 	free_pages((unsigned long) table, order);
-	iommu->page_table = NULL;
+	ldc_iommu->page_table = NULL;
 
 out_free_map:
-	kfree(iommu->arena.map);
-	iommu->arena.map = NULL;
+	kfree(iommu->map);
+	iommu->map = NULL;
 
 	return err;
 }
 
 static void ldc_iommu_release(struct ldc_channel *lp)
 {
-	struct ldc_iommu *iommu = &lp->iommu;
+	struct ldc_iommu *ldc_iommu = &lp->iommu;
+	struct iommu_map_table *iommu = &ldc_iommu->iommu_map_table;
 	unsigned long num_tsb_entries, tsbsize, order;
 
 	(void) sun4v_ldc_set_map_table(lp->id, 0, 0);
 
-	num_tsb_entries = iommu->arena.limit;
+	num_tsb_entries = iommu->poolsize * iommu->nr_pools;
 	tsbsize = num_tsb_entries * sizeof(struct ldc_mtable_entry);
 	order = get_order(tsbsize);
 
-	free_pages((unsigned long) iommu->page_table, order);
-	iommu->page_table = NULL;
+	free_pages((unsigned long) ldc_iommu->page_table, order);
+	ldc_iommu->page_table = NULL;
 
-	kfree(iommu->arena.map);
-	iommu->arena.map = NULL;
+	kfree(iommu->map);
+	iommu->map = NULL;
 }
 
 struct ldc_channel *ldc_alloc(unsigned long id,
@@ -1140,7 +1174,7 @@ struct ldc_channel *ldc_alloc(unsigned long id,
 
 	lp->id = id;
 
-	err = ldc_iommu_init(lp);
+	err = ldc_iommu_init(name, lp);
 	if (err)
 		goto out_free_ldc;
 
@@ -1885,40 +1919,6 @@ int ldc_read(struct ldc_channel *lp, void *buf, unsigned int size)
 }
 EXPORT_SYMBOL(ldc_read);
 
-static long arena_alloc(struct ldc_iommu *iommu, unsigned long npages)
-{
-	struct iommu_arena *arena = &iommu->arena;
-	unsigned long n, start, end, limit;
-	int pass;
-
-	limit = arena->limit;
-	start = arena->hint;
-	pass = 0;
-
-again:
-	n = bitmap_find_next_zero_area(arena->map, limit, start, npages, 0);
-	end = n + npages;
-	if (unlikely(end >= limit)) {
-		if (likely(pass < 1)) {
-			limit = start;
-			start = 0;
-			pass++;
-			goto again;
-		} else {
-			/* Scanned the whole thing, give up. */
-			return -1;
-		}
-	}
-	bitmap_set(arena->map, n, npages);
-
-	arena->hint = end;
-
-	return n;
-}
-
-#define COOKIE_PGSZ_CODE	0xf000000000000000ULL
-#define COOKIE_PGSZ_CODE_SHIFT	60ULL
-
 static u64 pagesize_code(void)
 {
 	switch (PAGE_SIZE) {
@@ -1945,23 +1945,14 @@ static u64 make_cookie(u64 index, u64 pgsz_code, u64 page_offset)
 		page_offset);
 }
 
-static u64 cookie_to_index(u64 cookie, unsigned long *shift)
-{
-	u64 szcode = cookie >> COOKIE_PGSZ_CODE_SHIFT;
-
-	cookie &= ~COOKIE_PGSZ_CODE;
-
-	*shift = szcode * 3;
-
-	return (cookie >> (13ULL + (szcode * 3ULL)));
-}
 
 static struct ldc_mtable_entry *alloc_npages(struct ldc_iommu *iommu,
 					     unsigned long npages)
 {
 	long entry;
 
-	entry = arena_alloc(iommu, npages);
+	entry = iommu_tbl_range_alloc(NULL, &iommu->iommu_map_table,
+				      npages, NULL);
 	if (unlikely(entry < 0))
 		return NULL;
 
@@ -2090,7 +2081,7 @@ int ldc_map_sg(struct ldc_channel *lp,
 	       struct ldc_trans_cookie *cookies, int ncookies,
 	       unsigned int map_perm)
 {
-	unsigned long i, npages, flags;
+	unsigned long i, npages;
 	struct ldc_mtable_entry *base;
 	struct cookie_state state;
 	struct ldc_iommu *iommu;
@@ -2109,9 +2100,7 @@ int ldc_map_sg(struct ldc_channel *lp,
 
 	iommu = &lp->iommu;
 
-	spin_lock_irqsave(&iommu->lock, flags);
 	base = alloc_npages(iommu, npages);
-	spin_unlock_irqrestore(&iommu->lock, flags);
 
 	if (!base)
 		return -ENOMEM;
@@ -2136,7 +2125,7 @@ int ldc_map_single(struct ldc_channel *lp,
 		   struct ldc_trans_cookie *cookies, int ncookies,
 		   unsigned int map_perm)
 {
-	unsigned long npages, pa, flags;
+	unsigned long npages, pa;
 	struct ldc_mtable_entry *base;
 	struct cookie_state state;
 	struct ldc_iommu *iommu;
@@ -2152,9 +2141,7 @@ int ldc_map_single(struct ldc_channel *lp,
 
 	iommu = &lp->iommu;
 
-	spin_lock_irqsave(&iommu->lock, flags);
 	base = alloc_npages(iommu, npages);
-	spin_unlock_irqrestore(&iommu->lock, flags);
 
 	if (!base)
 		return -ENOMEM;
@@ -2172,35 +2159,25 @@ int ldc_map_single(struct ldc_channel *lp,
 }
 EXPORT_SYMBOL(ldc_map_single);
 
+
 static void free_npages(unsigned long id, struct ldc_iommu *iommu,
 			u64 cookie, u64 size)
 {
-	struct iommu_arena *arena = &iommu->arena;
-	unsigned long i, shift, index, npages;
-	struct ldc_mtable_entry *base;
+	unsigned long npages, entry;
 
 	npages = PAGE_ALIGN(((cookie & ~PAGE_MASK) + size)) >> PAGE_SHIFT;
-	index = cookie_to_index(cookie, &shift);
-	base = iommu->page_table + index;
 
-	BUG_ON(index > arena->limit ||
-	       (index + npages) > arena->limit);
-
-	for (i = 0; i < npages; i++) {
-		if (base->cookie)
-			sun4v_ldc_revoke(id, cookie + (i << shift),
-					 base->cookie);
-		base->mte = 0;
-		__clear_bit(index + i, arena->map);
-	}
+	entry = ldc_cookie_to_index(cookie, iommu);
+	ldc_demap(iommu, id, cookie, entry, npages);
+	iommu_tbl_range_free(&iommu->iommu_map_table, cookie, npages, entry);
 }
 
 void ldc_unmap(struct ldc_channel *lp, struct ldc_trans_cookie *cookies,
 	       int ncookies)
 {
 	struct ldc_iommu *iommu = &lp->iommu;
-	unsigned long flags;
 	int i;
+	unsigned long flags;
 
 	spin_lock_irqsave(&iommu->lock, flags);
 	for (i = 0; i < ncookies; i++) {
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v8 RFC 3/3] sparc: Make LDC use common iommu poll management functions
@ 2015-03-31 14:40   ` Sowmini Varadhan
  0 siblings, 0 replies; 60+ messages in thread
From: Sowmini Varadhan @ 2015-03-31 14:40 UTC (permalink / raw)
  To: sparclinux; +Cc: aik, anton, paulus, sowmini.varadhan, linuxppc-dev, davem

Note that this conversion is only being done to consolidate the
code and ensure that the common code provides the sufficient
abstraction. It is not expected to result in any noticeable
performance improvement, as there is typically one ldc_iommu
per vnet_port, and each one has 8k entries, with a typical
request for 1-4 pages.  Thus LDC uses npools == 1.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
v3: added this file to be a consumer of the common iommu library
v4: removed ->cookie_to_index and ->demap from iommu_tbl_ops and instead
    inline these calls into ldc before calling into iommu-common
v6: remove iommu_tbl_ops
v7: move pool_hash initialization to iommu_tbl_pool_init
v8: BenH comments

 arch/sparc/kernel/ldc.c |  153 ++++++++++++++++++++---------------------------
 1 files changed, 65 insertions(+), 88 deletions(-)

diff --git a/arch/sparc/kernel/ldc.c b/arch/sparc/kernel/ldc.c
index 274a9f5..f842512 100644
--- a/arch/sparc/kernel/ldc.c
+++ b/arch/sparc/kernel/ldc.c
@@ -15,6 +15,7 @@
 #include <linux/list.h>
 #include <linux/init.h>
 #include <linux/bitmap.h>
+#include <linux/iommu-common.h>
 
 #include <asm/hypervisor.h>
 #include <asm/iommu.h>
@@ -27,6 +28,10 @@
 #define DRV_MODULE_VERSION	"1.1"
 #define DRV_MODULE_RELDATE	"July 22, 2008"
 
+#define COOKIE_PGSZ_CODE	0xf000000000000000ULL
+#define COOKIE_PGSZ_CODE_SHIFT	60ULL
+
+
 static char version[] =
 	DRV_MODULE_NAME ".c:v" DRV_MODULE_VERSION " (" DRV_MODULE_RELDATE ")\n";
 #define LDC_PACKET_SIZE		64
@@ -98,10 +103,10 @@ static const struct ldc_mode_ops stream_ops;
 int ldom_domaining_enabled;
 
 struct ldc_iommu {
-	/* Protects arena alloc/free.  */
+	/* Protects ldc_unmap.  */
 	spinlock_t			lock;
-	struct iommu_arena		arena;
 	struct ldc_mtable_entry		*page_table;
+	struct iommu_map_table		iommu_map_table;
 };
 
 struct ldc_channel {
@@ -998,31 +1003,59 @@ static void free_queue(unsigned long num_entries, struct ldc_packet *q)
 	free_pages((unsigned long)q, order);
 }
 
+static unsigned long ldc_cookie_to_index(u64 cookie, void *arg)
+{
+	u64 szcode = cookie >> COOKIE_PGSZ_CODE_SHIFT;
+	/* struct ldc_iommu *ldc_iommu = (struct ldc_iommu *)arg; */
+
+	cookie &= ~COOKIE_PGSZ_CODE;
+
+	return (cookie >> (13ULL + (szcode * 3ULL)));
+}
+
+static void ldc_demap(struct ldc_iommu *iommu, unsigned long id, u64 cookie,
+		      unsigned long entry, unsigned long npages)
+{
+	struct ldc_mtable_entry *base;
+	unsigned long i, shift;
+
+	shift = (cookie >> COOKIE_PGSZ_CODE_SHIFT) * 3;
+	base = iommu->page_table + entry;
+	for (i = 0; i < npages; i++) {
+		if (base->cookie)
+			sun4v_ldc_revoke(id, cookie + (i << shift),
+					 base->cookie);
+		base->mte = 0;
+	}
+}
+
 /* XXX Make this configurable... XXX */
 #define LDC_IOTABLE_SIZE	(8 * 1024)
 
-static int ldc_iommu_init(struct ldc_channel *lp)
+static int ldc_iommu_init(const char *name, struct ldc_channel *lp)
 {
 	unsigned long sz, num_tsb_entries, tsbsize, order;
-	struct ldc_iommu *iommu = &lp->iommu;
+	struct ldc_iommu *ldc_iommu = &lp->iommu;
+	struct iommu_map_table *iommu = &ldc_iommu->iommu_map_table;
 	struct ldc_mtable_entry *table;
 	unsigned long hv_err;
 	int err;
 
 	num_tsb_entries = LDC_IOTABLE_SIZE;
 	tsbsize = num_tsb_entries * sizeof(struct ldc_mtable_entry);
-
-	spin_lock_init(&iommu->lock);
+	spin_lock_init(&ldc_iommu->lock);
 
 	sz = num_tsb_entries / 8;
 	sz = (sz + 7UL) & ~7UL;
-	iommu->arena.map = kzalloc(sz, GFP_KERNEL);
-	if (!iommu->arena.map) {
+	iommu->map = kzalloc(sz, GFP_KERNEL);
+	if (!iommu->map) {
 		printk(KERN_ERR PFX "Alloc of arena map failed, sz=%lu\n", sz);
 		return -ENOMEM;
 	}
-
-	iommu->arena.limit = num_tsb_entries;
+	iommu_tbl_pool_init(iommu, num_tsb_entries, PAGE_SHIFT,
+			    NULL, false /* no large pool */,
+			    1 /* npools */,
+			    true /* skip span boundary check */);
 
 	order = get_order(tsbsize);
 
@@ -1037,7 +1070,7 @@ static int ldc_iommu_init(struct ldc_channel *lp)
 
 	memset(table, 0, PAGE_SIZE << order);
 
-	iommu->page_table = table;
+	ldc_iommu->page_table = table;
 
 	hv_err = sun4v_ldc_set_map_table(lp->id, __pa(table),
 					 num_tsb_entries);
@@ -1049,31 +1082,32 @@ static int ldc_iommu_init(struct ldc_channel *lp)
 
 out_free_table:
 	free_pages((unsigned long) table, order);
-	iommu->page_table = NULL;
+	ldc_iommu->page_table = NULL;
 
 out_free_map:
-	kfree(iommu->arena.map);
-	iommu->arena.map = NULL;
+	kfree(iommu->map);
+	iommu->map = NULL;
 
 	return err;
 }
 
 static void ldc_iommu_release(struct ldc_channel *lp)
 {
-	struct ldc_iommu *iommu = &lp->iommu;
+	struct ldc_iommu *ldc_iommu = &lp->iommu;
+	struct iommu_map_table *iommu = &ldc_iommu->iommu_map_table;
 	unsigned long num_tsb_entries, tsbsize, order;
 
 	(void) sun4v_ldc_set_map_table(lp->id, 0, 0);
 
-	num_tsb_entries = iommu->arena.limit;
+	num_tsb_entries = iommu->poolsize * iommu->nr_pools;
 	tsbsize = num_tsb_entries * sizeof(struct ldc_mtable_entry);
 	order = get_order(tsbsize);
 
-	free_pages((unsigned long) iommu->page_table, order);
-	iommu->page_table = NULL;
+	free_pages((unsigned long) ldc_iommu->page_table, order);
+	ldc_iommu->page_table = NULL;
 
-	kfree(iommu->arena.map);
-	iommu->arena.map = NULL;
+	kfree(iommu->map);
+	iommu->map = NULL;
 }
 
 struct ldc_channel *ldc_alloc(unsigned long id,
@@ -1140,7 +1174,7 @@ struct ldc_channel *ldc_alloc(unsigned long id,
 
 	lp->id = id;
 
-	err = ldc_iommu_init(lp);
+	err = ldc_iommu_init(name, lp);
 	if (err)
 		goto out_free_ldc;
 
@@ -1885,40 +1919,6 @@ int ldc_read(struct ldc_channel *lp, void *buf, unsigned int size)
 }
 EXPORT_SYMBOL(ldc_read);
 
-static long arena_alloc(struct ldc_iommu *iommu, unsigned long npages)
-{
-	struct iommu_arena *arena = &iommu->arena;
-	unsigned long n, start, end, limit;
-	int pass;
-
-	limit = arena->limit;
-	start = arena->hint;
-	pass = 0;
-
-again:
-	n = bitmap_find_next_zero_area(arena->map, limit, start, npages, 0);
-	end = n + npages;
-	if (unlikely(end >= limit)) {
-		if (likely(pass < 1)) {
-			limit = start;
-			start = 0;
-			pass++;
-			goto again;
-		} else {
-			/* Scanned the whole thing, give up. */
-			return -1;
-		}
-	}
-	bitmap_set(arena->map, n, npages);
-
-	arena->hint = end;
-
-	return n;
-}
-
-#define COOKIE_PGSZ_CODE	0xf000000000000000ULL
-#define COOKIE_PGSZ_CODE_SHIFT	60ULL
-
 static u64 pagesize_code(void)
 {
 	switch (PAGE_SIZE) {
@@ -1945,23 +1945,14 @@ static u64 make_cookie(u64 index, u64 pgsz_code, u64 page_offset)
 		page_offset);
 }
 
-static u64 cookie_to_index(u64 cookie, unsigned long *shift)
-{
-	u64 szcode = cookie >> COOKIE_PGSZ_CODE_SHIFT;
-
-	cookie &= ~COOKIE_PGSZ_CODE;
-
-	*shift = szcode * 3;
-
-	return (cookie >> (13ULL + (szcode * 3ULL)));
-}
 
 static struct ldc_mtable_entry *alloc_npages(struct ldc_iommu *iommu,
 					     unsigned long npages)
 {
 	long entry;
 
-	entry = arena_alloc(iommu, npages);
+	entry = iommu_tbl_range_alloc(NULL, &iommu->iommu_map_table,
+				      npages, NULL);
 	if (unlikely(entry < 0))
 		return NULL;
 
@@ -2090,7 +2081,7 @@ int ldc_map_sg(struct ldc_channel *lp,
 	       struct ldc_trans_cookie *cookies, int ncookies,
 	       unsigned int map_perm)
 {
-	unsigned long i, npages, flags;
+	unsigned long i, npages;
 	struct ldc_mtable_entry *base;
 	struct cookie_state state;
 	struct ldc_iommu *iommu;
@@ -2109,9 +2100,7 @@ int ldc_map_sg(struct ldc_channel *lp,
 
 	iommu = &lp->iommu;
 
-	spin_lock_irqsave(&iommu->lock, flags);
 	base = alloc_npages(iommu, npages);
-	spin_unlock_irqrestore(&iommu->lock, flags);
 
 	if (!base)
 		return -ENOMEM;
@@ -2136,7 +2125,7 @@ int ldc_map_single(struct ldc_channel *lp,
 		   struct ldc_trans_cookie *cookies, int ncookies,
 		   unsigned int map_perm)
 {
-	unsigned long npages, pa, flags;
+	unsigned long npages, pa;
 	struct ldc_mtable_entry *base;
 	struct cookie_state state;
 	struct ldc_iommu *iommu;
@@ -2152,9 +2141,7 @@ int ldc_map_single(struct ldc_channel *lp,
 
 	iommu = &lp->iommu;
 
-	spin_lock_irqsave(&iommu->lock, flags);
 	base = alloc_npages(iommu, npages);
-	spin_unlock_irqrestore(&iommu->lock, flags);
 
 	if (!base)
 		return -ENOMEM;
@@ -2172,35 +2159,25 @@ int ldc_map_single(struct ldc_channel *lp,
 }
 EXPORT_SYMBOL(ldc_map_single);
 
+
 static void free_npages(unsigned long id, struct ldc_iommu *iommu,
 			u64 cookie, u64 size)
 {
-	struct iommu_arena *arena = &iommu->arena;
-	unsigned long i, shift, index, npages;
-	struct ldc_mtable_entry *base;
+	unsigned long npages, entry;
 
 	npages = PAGE_ALIGN(((cookie & ~PAGE_MASK) + size)) >> PAGE_SHIFT;
-	index = cookie_to_index(cookie, &shift);
-	base = iommu->page_table + index;
 
-	BUG_ON(index > arena->limit ||
-	       (index + npages) > arena->limit);
-
-	for (i = 0; i < npages; i++) {
-		if (base->cookie)
-			sun4v_ldc_revoke(id, cookie + (i << shift),
-					 base->cookie);
-		base->mte = 0;
-		__clear_bit(index + i, arena->map);
-	}
+	entry = ldc_cookie_to_index(cookie, iommu);
+	ldc_demap(iommu, id, cookie, entry, npages);
+	iommu_tbl_range_free(&iommu->iommu_map_table, cookie, npages, entry);
 }
 
 void ldc_unmap(struct ldc_channel *lp, struct ldc_trans_cookie *cookies,
 	       int ncookies)
 {
 	struct ldc_iommu *iommu = &lp->iommu;
-	unsigned long flags;
 	int i;
+	unsigned long flags;
 
 	spin_lock_irqsave(&iommu->lock, flags);
 	for (i = 0; i < ncookies; i++) {
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* RE: [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and l
  2015-03-31 14:40   ` Sowmini Varadhan
@ 2015-03-31 15:15     ` David Laight
  -1 siblings, 0 replies; 60+ messages in thread
From: David Laight @ 2015-03-31 15:15 UTC (permalink / raw)
  To: 'Sowmini Varadhan', sparclinux
  Cc: paulus, anton, aik, linuxppc-dev, davem

RnJvbTogU293bWluaSBWYXJhZGhhbg0KPiBJbnZlc3RpZ2F0aW9uIG9mIG11bHRpdGhyZWFkZWQg
aXBlcmYgZXhwZXJpbWVudHMgb24gYW4gZXRoZXJuZXQNCj4gaW50ZXJmYWNlIHNob3cgdGhlIGlv
bW11LT5sb2NrIGFzIHRoZSBob3R0ZXN0IGxvY2sgaWRlbnRpZmllZCBieQ0KPiBsb2Nrc3RhdCwg
d2l0aCBzb21ldGhpbmcgb2YgdGhlIG9yZGVyIG9mICAyMU0gY29udGVudGlvbnMgb3V0IG9mDQo+
IDI3TSBhY3F1aXNpdGlvbnMsIGFuZCBhbiBhdmVyYWdlIHdhaXQgdGltZSBvZiAyNiB1cyBmb3Ig
dGhlIGxvY2suDQo+IFRoaXMgaXMgbm90IGVmZmljaWVudC4gQSBtb3JlIHNjYWxhYmxlIGRlc2ln
biBpcyB0byBmb2xsb3cgdGhlIHBwYw0KPiBtb2RlbCwgd2hlcmUgdGhlIGlvbW11X21hcF90YWJs
ZSBoYXMgbXVsdGlwbGUgcG9vbHMsIGVhY2ggc3RyZXRjaGluZw0KPiBvdmVyIGEgc2VnbWVudCBv
ZiB0aGUgbWFwLCBhbmQgd2l0aCBhIHNlcGFyYXRlIGxvY2sgZm9yIGVhY2ggcG9vbC4NCj4gVGhp
cyBtb2RlbCBhbGxvd3MgZm9yIGJldHRlciBwYXJhbGxlbGl6YXRpb24gb2YgdGhlIGlvbW11IG1h
cCBzZWFyY2guDQoNCkkndmUgd29uZGVyZWQgd2hldGhlciB0aGUgaW9tbXUgc2V0dXAgZm9yIGV0
aGVybmV0IHJlY2VpdmUgKGluIHBhcnRpY3VsYXIpDQpjb3VsZCBiZSBtYWRlIG11Y2ggbW9yZSBl
ZmZpY2llbnQgaWYgdGhlcmUgd2VyZSBhIGZ1bmN0aW9uIHRoYXQNCndvdWxkIHVubWFwIG9uZSBi
dWZmZXIgYW5kIG1hcCBhIHNlY29uZCBidWZmZXI/DQpNeSB0aG91Z2h0IGlzIHRoYXQgaW9tbXUg
cHRlIGVudHJ5IHVzZWQgYnkgdGhlIG9sZCBidWZmZXIgY291bGQganVzdA0KYmUgbW9kaWZpZWQg
dG8gcmVmZXJlbmNlIHRoZSBuZXcgb25lLg0KSW4gZWZmZWN0IGVhY2ggcmluZyBlbnRyeSB3b3Vs
ZCBlbmQgdXAgdXNpbmcgYSBmaXhlZCBpb21tdSBwdGUuDQoNClRoZSBvdGhlciBxdWVzdGlvbiBp
cyBob3cgbXVjaCBkYXRhIGNhbiBiZSBjb3BpZWQgaW4gMjZ1cyA/DQpPbiBpb21tdSBzeXN0ZW1z
ICdjb3B5YnJlYWsnIGxpbWl0cyBvbiByZWNlaXZlIGFuZCB0cmFuc21pdA0KbWF5IG5lZWQgdG8g
YmUgcXVpdGUgaGlnaC4NCg0KCURhdmlkDQoNCg=

^ permalink raw reply	[flat|nested] 60+ messages in thread

* RE: [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock
@ 2015-03-31 15:15     ` David Laight
  0 siblings, 0 replies; 60+ messages in thread
From: David Laight @ 2015-03-31 15:15 UTC (permalink / raw)
  To: 'Sowmini Varadhan', sparclinux
  Cc: paulus, anton, aik, linuxppc-dev, davem

RnJvbTogU293bWluaSBWYXJhZGhhbg0KPiBJbnZlc3RpZ2F0aW9uIG9mIG11bHRpdGhyZWFkZWQg
aXBlcmYgZXhwZXJpbWVudHMgb24gYW4gZXRoZXJuZXQNCj4gaW50ZXJmYWNlIHNob3cgdGhlIGlv
bW11LT5sb2NrIGFzIHRoZSBob3R0ZXN0IGxvY2sgaWRlbnRpZmllZCBieQ0KPiBsb2Nrc3RhdCwg
d2l0aCBzb21ldGhpbmcgb2YgdGhlIG9yZGVyIG9mICAyMU0gY29udGVudGlvbnMgb3V0IG9mDQo+
IDI3TSBhY3F1aXNpdGlvbnMsIGFuZCBhbiBhdmVyYWdlIHdhaXQgdGltZSBvZiAyNiB1cyBmb3Ig
dGhlIGxvY2suDQo+IFRoaXMgaXMgbm90IGVmZmljaWVudC4gQSBtb3JlIHNjYWxhYmxlIGRlc2ln
biBpcyB0byBmb2xsb3cgdGhlIHBwYw0KPiBtb2RlbCwgd2hlcmUgdGhlIGlvbW11X21hcF90YWJs
ZSBoYXMgbXVsdGlwbGUgcG9vbHMsIGVhY2ggc3RyZXRjaGluZw0KPiBvdmVyIGEgc2VnbWVudCBv
ZiB0aGUgbWFwLCBhbmQgd2l0aCBhIHNlcGFyYXRlIGxvY2sgZm9yIGVhY2ggcG9vbC4NCj4gVGhp
cyBtb2RlbCBhbGxvd3MgZm9yIGJldHRlciBwYXJhbGxlbGl6YXRpb24gb2YgdGhlIGlvbW11IG1h
cCBzZWFyY2guDQoNCkkndmUgd29uZGVyZWQgd2hldGhlciB0aGUgaW9tbXUgc2V0dXAgZm9yIGV0
aGVybmV0IHJlY2VpdmUgKGluIHBhcnRpY3VsYXIpDQpjb3VsZCBiZSBtYWRlIG11Y2ggbW9yZSBl
ZmZpY2llbnQgaWYgdGhlcmUgd2VyZSBhIGZ1bmN0aW9uIHRoYXQNCndvdWxkIHVubWFwIG9uZSBi
dWZmZXIgYW5kIG1hcCBhIHNlY29uZCBidWZmZXI/DQpNeSB0aG91Z2h0IGlzIHRoYXQgaW9tbXUg
cHRlIGVudHJ5IHVzZWQgYnkgdGhlIG9sZCBidWZmZXIgY291bGQganVzdA0KYmUgbW9kaWZpZWQg
dG8gcmVmZXJlbmNlIHRoZSBuZXcgb25lLg0KSW4gZWZmZWN0IGVhY2ggcmluZyBlbnRyeSB3b3Vs
ZCBlbmQgdXAgdXNpbmcgYSBmaXhlZCBpb21tdSBwdGUuDQoNClRoZSBvdGhlciBxdWVzdGlvbiBp
cyBob3cgbXVjaCBkYXRhIGNhbiBiZSBjb3BpZWQgaW4gMjZ1cyA/DQpPbiBpb21tdSBzeXN0ZW1z
ICdjb3B5YnJlYWsnIGxpbWl0cyBvbiByZWNlaXZlIGFuZCB0cmFuc21pdA0KbWF5IG5lZWQgdG8g
YmUgcXVpdGUgaGlnaC4NCg0KCURhdmlkDQoNCg==

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and l
  2015-03-31 15:15     ` [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock David Laight
@ 2015-03-31 15:25       ` Sowmini Varadhan
  -1 siblings, 0 replies; 60+ messages in thread
From: Sowmini Varadhan @ 2015-03-31 15:25 UTC (permalink / raw)
  To: David Laight; +Cc: aik, anton, paulus, sparclinux, linuxppc-dev, davem

On (03/31/15 15:15), David Laight wrote:
> 
> I've wondered whether the iommu setup for ethernet receive (in particular)
> could be made much more efficient if there were a function that
> would unmap one buffer and map a second buffer?
> My thought is that iommu pte entry used by the old buffer could just
> be modified to reference the new one.
> In effect each ring entry would end up using a fixed iommu pte.

There are a number of interesting things to investigate in
this space, and the above is just one of them, ways to avoid
the overhead of a full-blown map/unmap on each call. See 
  http://www.spinics.net/lists/sparclinux/msg13613.html

But the scope of this patchset is actually very rigidly defined:
to refactor the iommu pool/arena allocator into a common library,
and avoid code duplication (today even the single sparc arch
duplicates it for sun4[u,v] and ldc, and that's not even counting
the duplication across other archs/pci-drivers).

Investigating ways to provide a generalized infra that can
avoid a dma map/unmp for every packet would be a good follow-on.

> The other question is how much data can be copied in 26us ?
> On iommu systems 'copybreak' limits on receive and transmit
> may need to be quite high.

where does the "26us" number come from? I may be missing that context?

--Sowmini

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock
@ 2015-03-31 15:25       ` Sowmini Varadhan
  0 siblings, 0 replies; 60+ messages in thread
From: Sowmini Varadhan @ 2015-03-31 15:25 UTC (permalink / raw)
  To: David Laight; +Cc: aik, anton, paulus, sparclinux, linuxppc-dev, davem

On (03/31/15 15:15), David Laight wrote:
> 
> I've wondered whether the iommu setup for ethernet receive (in particular)
> could be made much more efficient if there were a function that
> would unmap one buffer and map a second buffer?
> My thought is that iommu pte entry used by the old buffer could just
> be modified to reference the new one.
> In effect each ring entry would end up using a fixed iommu pte.

There are a number of interesting things to investigate in
this space, and the above is just one of them, ways to avoid
the overhead of a full-blown map/unmap on each call. See 
  http://www.spinics.net/lists/sparclinux/msg13613.html

But the scope of this patchset is actually very rigidly defined:
to refactor the iommu pool/arena allocator into a common library,
and avoid code duplication (today even the single sparc arch
duplicates it for sun4[u,v] and ldc, and that's not even counting
the duplication across other archs/pci-drivers).

Investigating ways to provide a generalized infra that can
avoid a dma map/unmp for every packet would be a good follow-on.

> The other question is how much data can be copied in 26us ?
> On iommu systems 'copybreak' limits on receive and transmit
> may need to be quite high.

where does the "26us" number come from? I may be missing that context?

--Sowmini

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v8 RFC 0/3] Generic IOMMU pooled allocator
  2015-03-31 14:40 ` Sowmini Varadhan
@ 2015-03-31 18:06   ` Sowmini Varadhan
  -1 siblings, 0 replies; 60+ messages in thread
From: Sowmini Varadhan @ 2015-03-31 18:06 UTC (permalink / raw)
  To: sparclinux; +Cc: aik, anton, paulus, linuxppc-dev, davem

On (03/31/15 10:40), Sowmini Varadhan wrote:
> 
> I've not heard back from the IB folks, but I'm going to make
> a judgement call here and go with the spin_lock. *If* they
> report some significant benefit from the trylock, probably 
> need to revisit this (and then probably start by re-exmaining
> the hash function to avoid collisions, before resorting to 
> trylock).

Having bravely said that..

the IB team informs me that they see a 10% degradation using 
the spin_lock as opposed to the trylock.

one path going forward is to continue processing this patch-set 
as is. I can investigate this further, and later revise the spin_lock
to the trylock, after we are certain that it is good/necessary.

thoughts?

--Sowmini



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v8 RFC 0/3] Generic IOMMU pooled allocator
@ 2015-03-31 18:06   ` Sowmini Varadhan
  0 siblings, 0 replies; 60+ messages in thread
From: Sowmini Varadhan @ 2015-03-31 18:06 UTC (permalink / raw)
  To: sparclinux; +Cc: aik, anton, paulus, linuxppc-dev, davem

On (03/31/15 10:40), Sowmini Varadhan wrote:
> 
> I've not heard back from the IB folks, but I'm going to make
> a judgement call here and go with the spin_lock. *If* they
> report some significant benefit from the trylock, probably 
> need to revisit this (and then probably start by re-exmaining
> the hash function to avoid collisions, before resorting to 
> trylock).

Having bravely said that..

the IB team informs me that they see a 10% degradation using 
the spin_lock as opposed to the trylock.

one path going forward is to continue processing this patch-set 
as is. I can investigate this further, and later revise the spin_lock
to the trylock, after we are certain that it is good/necessary.

thoughts?

--Sowmini

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v8 RFC 0/3] Generic IOMMU pooled allocator
  2015-03-31 18:06   ` Sowmini Varadhan
@ 2015-03-31 18:08     ` David Miller
  -1 siblings, 0 replies; 60+ messages in thread
From: David Miller @ 2015-03-31 18:08 UTC (permalink / raw)
  To: sowmini.varadhan; +Cc: aik, anton, paulus, sparclinux, linuxppc-dev

From: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Date: Tue, 31 Mar 2015 14:06:42 -0400

> Having bravely said that..
> 
> the IB team informs me that they see a 10% degradation using 
> the spin_lock as opposed to the trylock.
> 
> one path going forward is to continue processing this patch-set 
> as is. I can investigate this further, and later revise the spin_lock
> to the trylock, after we are certain that it is good/necessary.
> 
> thoughts?

Let's address the trylock vs. spin_lock thing later and use plain
spin_lock for now.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v8 RFC 0/3] Generic IOMMU pooled allocator
@ 2015-03-31 18:08     ` David Miller
  0 siblings, 0 replies; 60+ messages in thread
From: David Miller @ 2015-03-31 18:08 UTC (permalink / raw)
  To: sowmini.varadhan; +Cc: aik, anton, paulus, sparclinux, linuxppc-dev

From: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Date: Tue, 31 Mar 2015 14:06:42 -0400

> Having bravely said that..
> 
> the IB team informs me that they see a 10% degradation using 
> the spin_lock as opposed to the trylock.
> 
> one path going forward is to continue processing this patch-set 
> as is. I can investigate this further, and later revise the spin_lock
> to the trylock, after we are certain that it is good/necessary.
> 
> thoughts?

Let's address the trylock vs. spin_lock thing later and use plain
spin_lock for now.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v8 RFC 0/3] Generic IOMMU pooled allocator
  2015-03-31 18:06   ` Sowmini Varadhan
@ 2015-04-01  1:01     ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 60+ messages in thread
From: Benjamin Herrenschmidt @ 2015-04-01  1:01 UTC (permalink / raw)
  To: Sowmini Varadhan; +Cc: aik, anton, paulus, sparclinux, linuxppc-dev, davem

On Tue, 2015-03-31 at 14:06 -0400, Sowmini Varadhan wrote:
> Having bravely said that..
> 
> the IB team informs me that they see a 10% degradation using 
> the spin_lock as opposed to the trylock.
> 
> one path going forward is to continue processing this patch-set 
> as is. I can investigate this further, and later revise the spin_lock
> to the trylock, after we are certain that it is good/necessary.

Have they tried using more pools instead ?

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v8 RFC 0/3] Generic IOMMU pooled allocator
@ 2015-04-01  1:01     ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 60+ messages in thread
From: Benjamin Herrenschmidt @ 2015-04-01  1:01 UTC (permalink / raw)
  To: Sowmini Varadhan; +Cc: aik, anton, paulus, sparclinux, linuxppc-dev, davem

On Tue, 2015-03-31 at 14:06 -0400, Sowmini Varadhan wrote:
> Having bravely said that..
> 
> the IB team informs me that they see a 10% degradation using 
> the spin_lock as opposed to the trylock.
> 
> one path going forward is to continue processing this patch-set 
> as is. I can investigate this further, and later revise the spin_lock
> to the trylock, after we are certain that it is good/necessary.

Have they tried using more pools instead ?

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v8 RFC 0/3] Generic IOMMU pooled allocator
  2015-04-01  1:01     ` Benjamin Herrenschmidt
@ 2015-04-01  1:08       ` Sowmini Varadhan
  -1 siblings, 0 replies; 60+ messages in thread
From: Sowmini Varadhan @ 2015-04-01  1:08 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: aik, anton, paulus, sparclinux, linuxppc-dev, davem

On 03/31/2015 09:01 PM, Benjamin Herrenschmidt wrote:
> On Tue, 2015-03-31 at 14:06 -0400, Sowmini Varadhan wrote:
>> Having bravely said that..
>>
>> the IB team informs me that they see a 10% degradation using
>> the spin_lock as opposed to the trylock.
>>
>> one path going forward is to continue processing this patch-set
>> as is. I can investigate this further, and later revise the spin_lock
>> to the trylock, after we are certain that it is good/necessary.
>
> Have they tried using more pools instead ?


we just tried 32 instead of 16, no change to perf.

Looks like their current bottleneck is the find_next_zero_bit (they
can get a 2X perf improvement with the lock fragmentation, but are
then hitting a new ceiling, even with the trylock version)


I'm starting to wonder if  some approximation of dma premapped
buffers may be needed. Doing a map/unmap on each packet is expensive.


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v8 RFC 0/3] Generic IOMMU pooled allocator
@ 2015-04-01  1:08       ` Sowmini Varadhan
  0 siblings, 0 replies; 60+ messages in thread
From: Sowmini Varadhan @ 2015-04-01  1:08 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: aik, anton, paulus, sparclinux, linuxppc-dev, davem

On 03/31/2015 09:01 PM, Benjamin Herrenschmidt wrote:
> On Tue, 2015-03-31 at 14:06 -0400, Sowmini Varadhan wrote:
>> Having bravely said that..
>>
>> the IB team informs me that they see a 10% degradation using
>> the spin_lock as opposed to the trylock.
>>
>> one path going forward is to continue processing this patch-set
>> as is. I can investigate this further, and later revise the spin_lock
>> to the trylock, after we are certain that it is good/necessary.
>
> Have they tried using more pools instead ?


we just tried 32 instead of 16, no change to perf.

Looks like their current bottleneck is the find_next_zero_bit (they
can get a 2X perf improvement with the lock fragmentation, but are
then hitting a new ceiling, even with the trylock version)


I'm starting to wonder if  some approximation of dma premapped
buffers may be needed. Doing a map/unmap on each packet is expensive.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v8 RFC 0/3] Generic IOMMU pooled allocator
  2015-04-01  1:08       ` Sowmini Varadhan
@ 2015-04-01  3:12         ` David Miller
  -1 siblings, 0 replies; 60+ messages in thread
From: David Miller @ 2015-04-01  3:12 UTC (permalink / raw)
  To: sowmini.varadhan; +Cc: aik, anton, paulus, sparclinux, linuxppc-dev

From: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Date: Tue, 31 Mar 2015 21:08:18 -0400

> I'm starting to wonder if  some approximation of dma premapped
> buffers may be needed. Doing a map/unmap on each packet is expensive.

It's much more amortized with smart buffering strategies, which are
common on current generation networking cards.

There you only eat one map/unmap per "PAGE_SIZE / rx_pkt_size".

Maybe the infiniband stuff is doing things very suboptimally, and
actually with that subsystem and drivers absolutely nothing would
surprise me.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v8 RFC 0/3] Generic IOMMU pooled allocator
@ 2015-04-01  3:12         ` David Miller
  0 siblings, 0 replies; 60+ messages in thread
From: David Miller @ 2015-04-01  3:12 UTC (permalink / raw)
  To: sowmini.varadhan; +Cc: aik, anton, paulus, sparclinux, linuxppc-dev

From: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Date: Tue, 31 Mar 2015 21:08:18 -0400

> I'm starting to wonder if  some approximation of dma premapped
> buffers may be needed. Doing a map/unmap on each packet is expensive.

It's much more amortized with smart buffering strategies, which are
common on current generation networking cards.

There you only eat one map/unmap per "PAGE_SIZE / rx_pkt_size".

Maybe the infiniband stuff is doing things very suboptimally, and
actually with that subsystem and drivers absolutely nothing would
surprise me.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v8 RFC 0/3] Generic IOMMU pooled allocator
  2015-04-01  3:12         ` David Miller
@ 2015-04-02 12:51           ` Sowmini Varadhan
  -1 siblings, 0 replies; 60+ messages in thread
From: Sowmini Varadhan @ 2015-04-02 12:51 UTC (permalink / raw)
  To: David Miller; +Cc: aik, anton, paulus, sparclinux, linuxppc-dev

On (03/31/15 23:12), David Miller wrote:
> 
> It's much more amortized with smart buffering strategies, which are
> common on current generation networking cards.
> 
> There you only eat one map/unmap per "PAGE_SIZE / rx_pkt_size".
> 
> Maybe the infiniband stuff is doing things very suboptimally, and
> actually with that subsystem and drivers absolutely nothing would
> surprise me.

yeh, we are trying to get more info from them about what their
bottle-necks are. Until then, lets just stick with the spin_lock.

do I need to resubmit this without the RFC tag? Perhaps I should
have dropped that some time ago.

--Sowmini

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v8 RFC 0/3] Generic IOMMU pooled allocator
@ 2015-04-02 12:51           ` Sowmini Varadhan
  0 siblings, 0 replies; 60+ messages in thread
From: Sowmini Varadhan @ 2015-04-02 12:51 UTC (permalink / raw)
  To: David Miller; +Cc: aik, anton, paulus, sparclinux, linuxppc-dev

On (03/31/15 23:12), David Miller wrote:
> 
> It's much more amortized with smart buffering strategies, which are
> common on current generation networking cards.
> 
> There you only eat one map/unmap per "PAGE_SIZE / rx_pkt_size".
> 
> Maybe the infiniband stuff is doing things very suboptimally, and
> actually with that subsystem and drivers absolutely nothing would
> surprise me.

yeh, we are trying to get more info from them about what their
bottle-necks are. Until then, lets just stick with the spin_lock.

do I need to resubmit this without the RFC tag? Perhaps I should
have dropped that some time ago.

--Sowmini

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v8 RFC 0/3] Generic IOMMU pooled allocator
  2015-04-02 12:51           ` Sowmini Varadhan
@ 2015-04-02 16:21             ` David Miller
  -1 siblings, 0 replies; 60+ messages in thread
From: David Miller @ 2015-04-02 16:21 UTC (permalink / raw)
  To: sowmini.varadhan; +Cc: aik, anton, paulus, sparclinux, linuxppc-dev

From: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Date: Thu, 2 Apr 2015 08:51:52 -0400

> do I need to resubmit this without the RFC tag? Perhaps I should
> have dropped that some time ago.

I want to hear from the powerpc folks whether they can positively
adopt the new generic code or not.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v8 RFC 0/3] Generic IOMMU pooled allocator
@ 2015-04-02 16:21             ` David Miller
  0 siblings, 0 replies; 60+ messages in thread
From: David Miller @ 2015-04-02 16:21 UTC (permalink / raw)
  To: sowmini.varadhan; +Cc: aik, anton, paulus, sparclinux, linuxppc-dev

From: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Date: Thu, 2 Apr 2015 08:51:52 -0400

> do I need to resubmit this without the RFC tag? Perhaps I should
> have dropped that some time ago.

I want to hear from the powerpc folks whether they can positively
adopt the new generic code or not.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v8 RFC 0/3] Generic IOMMU pooled allocator
  2015-04-02 16:21             ` David Miller
@ 2015-04-02 20:22               ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 60+ messages in thread
From: Benjamin Herrenschmidt @ 2015-04-02 20:22 UTC (permalink / raw)
  To: David Miller
  Cc: aik, sowmini.varadhan, anton, paulus, sparclinux, linuxppc-dev

On Thu, 2015-04-02 at 12:21 -0400, David Miller wrote:
> From: Sowmini Varadhan <sowmini.varadhan@oracle.com>
> Date: Thu, 2 Apr 2015 08:51:52 -0400
> 
> > do I need to resubmit this without the RFC tag? Perhaps I should
> > have dropped that some time ago.
> 
> I want to hear from the powerpc folks whether they can positively
> adopt the new generic code or not.

Oh yes, I'm fine with it, it's basically our code slightly refactored,
but I still have a few minor review comments on Swomini's patch I
started to put together yesterday but didn't finish. I'll send them
today.

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v8 RFC 0/3] Generic IOMMU pooled allocator
@ 2015-04-02 20:22               ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 60+ messages in thread
From: Benjamin Herrenschmidt @ 2015-04-02 20:22 UTC (permalink / raw)
  To: David Miller
  Cc: aik, sowmini.varadhan, anton, paulus, sparclinux, linuxppc-dev

On Thu, 2015-04-02 at 12:21 -0400, David Miller wrote:
> From: Sowmini Varadhan <sowmini.varadhan@oracle.com>
> Date: Thu, 2 Apr 2015 08:51:52 -0400
> 
> > do I need to resubmit this without the RFC tag? Perhaps I should
> > have dropped that some time ago.
> 
> I want to hear from the powerpc folks whether they can positively
> adopt the new generic code or not.

Oh yes, I'm fine with it, it's basically our code slightly refactored,
but I still have a few minor review comments on Swomini's patch I
started to put together yesterday but didn't finish. I'll send them
today.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v8 RFC 0/3] Generic IOMMU pooled allocator
  2015-04-02 20:22               ` Benjamin Herrenschmidt
@ 2015-04-02 20:52                 ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 60+ messages in thread
From: Benjamin Herrenschmidt @ 2015-04-02 20:52 UTC (permalink / raw)
  To: David Miller
  Cc: aik, sowmini.varadhan, anton, paulus, sparclinux, linuxppc-dev

On Fri, 2015-04-03 at 07:22 +1100, Benjamin Herrenschmidt wrote:
> On Thu, 2015-04-02 at 12:21 -0400, David Miller wrote:
> > From: Sowmini Varadhan <sowmini.varadhan@oracle.com>
> > Date: Thu, 2 Apr 2015 08:51:52 -0400
> > 
> > > do I need to resubmit this without the RFC tag? Perhaps I should
> > > have dropped that some time ago.
> > 
> > I want to hear from the powerpc folks whether they can positively
> > adopt the new generic code or not.
> 
> Oh yes, I'm fine with it, it's basically our code slightly refactored,
> but I still have a few minor review comments on Swomini's patch I
> started to put together yesterday but didn't finish. I'll send them
> today.

Actually I found some more serious issues that for some reason I didn't
spot the other day. Will send comments in a jiffy.

Ben.



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v8 RFC 0/3] Generic IOMMU pooled allocator
@ 2015-04-02 20:52                 ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 60+ messages in thread
From: Benjamin Herrenschmidt @ 2015-04-02 20:52 UTC (permalink / raw)
  To: David Miller
  Cc: aik, sowmini.varadhan, anton, paulus, sparclinux, linuxppc-dev

On Fri, 2015-04-03 at 07:22 +1100, Benjamin Herrenschmidt wrote:
> On Thu, 2015-04-02 at 12:21 -0400, David Miller wrote:
> > From: Sowmini Varadhan <sowmini.varadhan@oracle.com>
> > Date: Thu, 2 Apr 2015 08:51:52 -0400
> > 
> > > do I need to resubmit this without the RFC tag? Perhaps I should
> > > have dropped that some time ago.
> > 
> > I want to hear from the powerpc folks whether they can positively
> > adopt the new generic code or not.
> 
> Oh yes, I'm fine with it, it's basically our code slightly refactored,
> but I still have a few minor review comments on Swomini's patch I
> started to put together yesterday but didn't finish. I'll send them
> today.

Actually I found some more serious issues that for some reason I didn't
spot the other day. Will send comments in a jiffy.

Ben.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and l
  2015-03-31 14:40   ` Sowmini Varadhan
@ 2015-04-02 20:54     ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 60+ messages in thread
From: Benjamin Herrenschmidt @ 2015-04-02 20:54 UTC (permalink / raw)
  To: Sowmini Varadhan; +Cc: aik, anton, paulus, sparclinux, linuxppc-dev, davem

On Tue, 2015-03-31 at 10:40 -0400, Sowmini Varadhan wrote:

> +	if (largealloc) {
> +		pool = &(iommu->large_pool);
> +		spin_lock_irqsave(&pool->lock, flags);
> +		pool_nr = 0; /* to keep compiler happy */
> +	} else {
> +		/* pick out pool_nr */
> +		pool_nr =  pool_hash & (npools - 1);
> +		pool = &(iommu->pools[pool_nr]);
> +		spin_lock_irqsave(&(pool->lock), flags);
> +	}

Move spin_lock_irqsave outside of if/else

> + again:
> +	if (pass = 0 && handle && *handle &&
> +	    (*handle >= pool->start) && (*handle < pool->end))
> +		start = *handle;
> +	else
> +		start = pool->hint;

Now this means "handle" might be < pool->hint, in that case you also
need a lazy flush. Or rather only if the resulting alloc is. My
suggestion originally was to test that after the area alloc. See further
down.

> +	limit = pool->end;
> +
> +	/* The case below can happen if we have a small segment appended
> +	 * to a large, or when the previous alloc was at the very end of
> +	 * the available space. If so, go back to the beginning and flush.
> +	 */
> +	if (start >= limit) {
> +		start = pool->start;
> +		if (!large_pool && iommu->lazy_flush != NULL)
> +			iommu->lazy_flush(iommu);

Add need_flush = false;

In fact, looking more closely, I see a another problems:

You never flush the large pool. You should either remove all
your large_pool checks since you treat it as a normal pool you
your code, and let it flush normally, or you should have an
unconditional flush somewhere in there for the large pool (or in
free()), either way. The current implementation you show me, afaik,
never calls lazy_flush on a large allocation.

Also, if you fix the problem I mentioned earlier about *handle < hint,
you also don't need the above flush, it will be handled further down
in the else case for the if (n = -1), see below

> +	}

I only just noticed too, you completely dropped the code to honor
the dma mask. Why that ? Some devices rely on this.

> +	if (dev)
> +		boundary_size = ALIGN(dma_get_seg_boundary(dev) + 1,
> +				      1 << iommu->table_shift);
> +	else
> +		boundary_size = ALIGN(1UL << 32, 1 << iommu->table_shift);
> +
> +	shift = iommu->table_map_base >> iommu->table_shift;
> +	boundary_size = boundary_size >> iommu->table_shift;
> +	/*
> +	 * if the skip_span_boundary_check had been set during init, we set
> +	 * things up so that iommu_is_span_boundary() merely checks if the
> +	 * (index + npages) < num_tsb_entries
> +	 */
> +	if ((iommu->flags & IOMMU_NO_SPAN_BOUND) != 0) {
> +		shift = 0;
> +		boundary_size = iommu->poolsize * iommu->nr_pools;
> +	}
> +	n = iommu_area_alloc(iommu->map, limit, start, npages, shift,
> +			     boundary_size, 0);

You have completely dropped the alignment support. This will break
drivers. There are cases (especially with consistent allocations) where
the driver have alignment constraints on the address, those must be
preserved.

> +	if (n = -1) {
> +		if (likely(pass = 0)) {
> +			/* First failure, rescan from the beginning.  */
> +			pool->hint = pool->start;
> +			if (!large_pool && iommu->lazy_flush != NULL)
> +				need_flush = true;

Same question about the large pool check... I wouldn't bother with the
function pointer NULL check here either, only test it at the call site.

> +			pass++;
> +			goto again;
> +		} else if (!largealloc && pass <= iommu->nr_pools) {
> +	

Here, you might have moved the hint of the pool (case above), and then
hit this code path, no ? In that case, need_flush might be set, which
means you must flush before you unlock.

> 			spin_unlock(&(pool->lock));
> +
> 			pool_nr = (pool_nr + 1) & (iommu->nr_pools - 1);
> +			pool = &(iommu->pools[pool_nr]);
> +			spin_lock(&(pool->lock));
> +			pool->hint = pool->start;
> +			if (!large_pool && iommu->lazy_flush != NULL)
> +				need_flush = true;

I wouldn't bother with the test. This can't be large_alloc afaik and
the NULL check can be done at the call site.

> +			pass++;
> +			goto again;
> +		} else {
> +			/* give up */
> +			n = DMA_ERROR_CODE;
> +			goto bail;
> +		}
> +	}

Here, make this something like:

	} else if (end < pool->hint)
		need_flush = true;

To handle the case where you started your alloc below the hint without
having updated the hint yet, ie, a *handle < hint or a wrap around,
this will also handle the start >= limit case.

Basically you need to flush iff you are going to provide an alloc below
the current hint, or change the current hint to something lower than it
was.

> +	end = n + npages;
> +
> +	pool->hint = end;

Here you completely removed our "blocky" allocation scheme for small
pools. Now that being said, I'm not sure what it bought us to be honest
so I'm not necessarily upset about it. I'll ask around here see if
somebody can remember why we did that in the first place.

> +	/* Update handle for SG allocations */
> +	if (handle)
> +		*handle = end;
> +bail:
> +	if (need_flush)

Here I would put the test for lazy_flush being non-NULL

> +		iommu->lazy_flush(iommu);
> +	spin_unlock_irqrestore(&(pool->lock), flags);
> +
> +	return n;
> +}
> +EXPORT_SYMBOL(iommu_tbl_range_alloc);
> +
> +static struct iommu_pool *get_pool(struct iommu_map_table *tbl,
> +				   unsigned long entry)
> +{
> +	struct iommu_pool *p;
> +	unsigned long largepool_start = tbl->large_pool.start;
> +	bool large_pool = ((tbl->flags & IOMMU_HAS_LARGE_POOL) != 0);
> +
> +	/* The large pool is the last pool at the top of the table */
> +	if (large_pool && entry >= largepool_start) {
> +		p = &tbl->large_pool;
> +	} else {
> +		unsigned int pool_nr = entry / tbl->poolsize;
> +
> +		BUG_ON(pool_nr >= tbl->nr_pools);
> +		p = &tbl->pools[pool_nr];
> +	}
> +	return p;
> +}
> +
> +/* Caller supplies the index of the entry into the iommu map table
> + * itself when the mapping from dma_addr to the entry is not the
> + * default addr->entry mapping below.
> + */
> +void iommu_tbl_range_free(struct iommu_map_table *iommu, u64 dma_addr,
> +			  unsigned long npages, unsigned long entry)
> +{
> +	struct iommu_pool *pool;
> +	unsigned long flags;
> +	unsigned long shift = iommu->table_shift;
> +
> +	if (entry = DMA_ERROR_CODE) /* use default addr->entry mapping */
> +		entry = (dma_addr - iommu->table_map_base) >> shift;
> +	pool = get_pool(iommu, entry);
> +
> +	spin_lock_irqsave(&(pool->lock), flags);
> +	bitmap_clear(iommu->map, entry, npages);
> +	spin_unlock_irqrestore(&(pool->lock), flags);
> +}
> +EXPORT_SYMBOL(iommu_tbl_range_free);



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock
@ 2015-04-02 20:54     ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 60+ messages in thread
From: Benjamin Herrenschmidt @ 2015-04-02 20:54 UTC (permalink / raw)
  To: Sowmini Varadhan; +Cc: aik, anton, paulus, sparclinux, linuxppc-dev, davem

On Tue, 2015-03-31 at 10:40 -0400, Sowmini Varadhan wrote:

> +	if (largealloc) {
> +		pool = &(iommu->large_pool);
> +		spin_lock_irqsave(&pool->lock, flags);
> +		pool_nr = 0; /* to keep compiler happy */
> +	} else {
> +		/* pick out pool_nr */
> +		pool_nr =  pool_hash & (npools - 1);
> +		pool = &(iommu->pools[pool_nr]);
> +		spin_lock_irqsave(&(pool->lock), flags);
> +	}

Move spin_lock_irqsave outside of if/else

> + again:
> +	if (pass == 0 && handle && *handle &&
> +	    (*handle >= pool->start) && (*handle < pool->end))
> +		start = *handle;
> +	else
> +		start = pool->hint;

Now this means "handle" might be < pool->hint, in that case you also
need a lazy flush. Or rather only if the resulting alloc is. My
suggestion originally was to test that after the area alloc. See further
down.

> +	limit = pool->end;
> +
> +	/* The case below can happen if we have a small segment appended
> +	 * to a large, or when the previous alloc was at the very end of
> +	 * the available space. If so, go back to the beginning and flush.
> +	 */
> +	if (start >= limit) {
> +		start = pool->start;
> +		if (!large_pool && iommu->lazy_flush != NULL)
> +			iommu->lazy_flush(iommu);

Add need_flush = false;

In fact, looking more closely, I see a another problems:

You never flush the large pool. You should either remove all
your large_pool checks since you treat it as a normal pool you
your code, and let it flush normally, or you should have an
unconditional flush somewhere in there for the large pool (or in
free()), either way. The current implementation you show me, afaik,
never calls lazy_flush on a large allocation.

Also, if you fix the problem I mentioned earlier about *handle < hint,
you also don't need the above flush, it will be handled further down
in the else case for the if (n == -1), see below

> +	}

I only just noticed too, you completely dropped the code to honor
the dma mask. Why that ? Some devices rely on this.

> +	if (dev)
> +		boundary_size = ALIGN(dma_get_seg_boundary(dev) + 1,
> +				      1 << iommu->table_shift);
> +	else
> +		boundary_size = ALIGN(1UL << 32, 1 << iommu->table_shift);
> +
> +	shift = iommu->table_map_base >> iommu->table_shift;
> +	boundary_size = boundary_size >> iommu->table_shift;
> +	/*
> +	 * if the skip_span_boundary_check had been set during init, we set
> +	 * things up so that iommu_is_span_boundary() merely checks if the
> +	 * (index + npages) < num_tsb_entries
> +	 */
> +	if ((iommu->flags & IOMMU_NO_SPAN_BOUND) != 0) {
> +		shift = 0;
> +		boundary_size = iommu->poolsize * iommu->nr_pools;
> +	}
> +	n = iommu_area_alloc(iommu->map, limit, start, npages, shift,
> +			     boundary_size, 0);

You have completely dropped the alignment support. This will break
drivers. There are cases (especially with consistent allocations) where
the driver have alignment constraints on the address, those must be
preserved.

> +	if (n == -1) {
> +		if (likely(pass == 0)) {
> +			/* First failure, rescan from the beginning.  */
> +			pool->hint = pool->start;
> +			if (!large_pool && iommu->lazy_flush != NULL)
> +				need_flush = true;

Same question about the large pool check... I wouldn't bother with the
function pointer NULL check here either, only test it at the call site.

> +			pass++;
> +			goto again;
> +		} else if (!largealloc && pass <= iommu->nr_pools) {
> +	

Here, you might have moved the hint of the pool (case above), and then
hit this code path, no ? In that case, need_flush might be set, which
means you must flush before you unlock.

> 			spin_unlock(&(pool->lock));
> +
> 			pool_nr = (pool_nr + 1) & (iommu->nr_pools - 1);
> +			pool = &(iommu->pools[pool_nr]);
> +			spin_lock(&(pool->lock));
> +			pool->hint = pool->start;
> +			if (!large_pool && iommu->lazy_flush != NULL)
> +				need_flush = true;

I wouldn't bother with the test. This can't be large_alloc afaik and
the NULL check can be done at the call site.

> +			pass++;
> +			goto again;
> +		} else {
> +			/* give up */
> +			n = DMA_ERROR_CODE;
> +			goto bail;
> +		}
> +	}

Here, make this something like:

	} else if (end < pool->hint)
		need_flush = true;

To handle the case where you started your alloc below the hint without
having updated the hint yet, ie, a *handle < hint or a wrap around,
this will also handle the start >= limit case.

Basically you need to flush iff you are going to provide an alloc below
the current hint, or change the current hint to something lower than it
was.

> +	end = n + npages;
> +
> +	pool->hint = end;

Here you completely removed our "blocky" allocation scheme for small
pools. Now that being said, I'm not sure what it bought us to be honest
so I'm not necessarily upset about it. I'll ask around here see if
somebody can remember why we did that in the first place.

> +	/* Update handle for SG allocations */
> +	if (handle)
> +		*handle = end;
> +bail:
> +	if (need_flush)

Here I would put the test for lazy_flush being non-NULL

> +		iommu->lazy_flush(iommu);
> +	spin_unlock_irqrestore(&(pool->lock), flags);
> +
> +	return n;
> +}
> +EXPORT_SYMBOL(iommu_tbl_range_alloc);
> +
> +static struct iommu_pool *get_pool(struct iommu_map_table *tbl,
> +				   unsigned long entry)
> +{
> +	struct iommu_pool *p;
> +	unsigned long largepool_start = tbl->large_pool.start;
> +	bool large_pool = ((tbl->flags & IOMMU_HAS_LARGE_POOL) != 0);
> +
> +	/* The large pool is the last pool at the top of the table */
> +	if (large_pool && entry >= largepool_start) {
> +		p = &tbl->large_pool;
> +	} else {
> +		unsigned int pool_nr = entry / tbl->poolsize;
> +
> +		BUG_ON(pool_nr >= tbl->nr_pools);
> +		p = &tbl->pools[pool_nr];
> +	}
> +	return p;
> +}
> +
> +/* Caller supplies the index of the entry into the iommu map table
> + * itself when the mapping from dma_addr to the entry is not the
> + * default addr->entry mapping below.
> + */
> +void iommu_tbl_range_free(struct iommu_map_table *iommu, u64 dma_addr,
> +			  unsigned long npages, unsigned long entry)
> +{
> +	struct iommu_pool *pool;
> +	unsigned long flags;
> +	unsigned long shift = iommu->table_shift;
> +
> +	if (entry == DMA_ERROR_CODE) /* use default addr->entry mapping */
> +		entry = (dma_addr - iommu->table_map_base) >> shift;
> +	pool = get_pool(iommu, entry);
> +
> +	spin_lock_irqsave(&(pool->lock), flags);
> +	bitmap_clear(iommu->map, entry, npages);
> +	spin_unlock_irqrestore(&(pool->lock), flags);
> +}
> +EXPORT_SYMBOL(iommu_tbl_range_free);

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and l
  2015-04-02 20:54     ` [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock Benjamin Herrenschmidt
@ 2015-04-02 21:43       ` Sowmini Varadhan
  -1 siblings, 0 replies; 60+ messages in thread
From: Sowmini Varadhan @ 2015-04-02 21:43 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: aik, anton, paulus, sparclinux, linuxppc-dev, davem

On (04/03/15 07:54), Benjamin Herrenschmidt wrote:
> > +	limit = pool->end;
> > +
> > +	/* The case below can happen if we have a small segment appended
> > +	 * to a large, or when the previous alloc was at the very end of
> > +	 * the available space. If so, go back to the beginning and flush.
> > +	 */
> > +	if (start >= limit) {
> > +		start = pool->start;
> > +		if (!large_pool && iommu->lazy_flush != NULL)
> > +			iommu->lazy_flush(iommu);
> 
> Add need_flush = false;

A few clarifications, while I parse the rest of your comments:

Not sure I follow- need_flush is initialized to true at the start of the function?

> I only just noticed too, you completely dropped the code to honor
> the dma mask. Why that ? Some devices rely on this.

so that's an interesting question: the existing iommu_range_alloc() in
arch/sparc/kernel/iommu.c does not use the mask at all. I based most of
the code  on this (except for the lock fragmentation part). 
I dont know if this is arch specific.

> 
> > +	if (dev)
> > +		boundary_size = ALIGN(dma_get_seg_boundary(dev) + 1,
> > +				      1 << iommu->table_shift);
> > +	else
> > +		boundary_size = ALIGN(1UL << 32, 1 << iommu->table_shift);
> > +
> > +	shift = iommu->table_map_base >> iommu->table_shift;
> > +	boundary_size = boundary_size >> iommu->table_shift;
> > +	/*
> > +	 * if the skip_span_boundary_check had been set during init, we set
> > +	 * things up so that iommu_is_span_boundary() merely checks if the
> > +	 * (index + npages) < num_tsb_entries
> > +	 */
> > +	if ((iommu->flags & IOMMU_NO_SPAN_BOUND) != 0) {
> > +		shift = 0;
> > +		boundary_size = iommu->poolsize * iommu->nr_pools;
> > +	}
> > +	n = iommu_area_alloc(iommu->map, limit, start, npages, shift,
> > +			     boundary_size, 0);
> 
> You have completely dropped the alignment support. This will break
> drivers. There are cases (especially with consistent allocations) where

Again, not sure I follow? are you referring to the IOMMU_NO_SPAN_BOUND case?
That's very specific to LDC (sparc ldoms virtualization infra). The default
is to not have IOMMU_NO_SPAN_BOUND set. 
For the rest of the drivers, the code that sets up boundary_size aligns things
in the same way as the ppc code.


> the driver have alignment constraints on the address, those must be
> preserved.
> 

--Sowmini

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock
@ 2015-04-02 21:43       ` Sowmini Varadhan
  0 siblings, 0 replies; 60+ messages in thread
From: Sowmini Varadhan @ 2015-04-02 21:43 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: aik, anton, paulus, sparclinux, linuxppc-dev, davem

On (04/03/15 07:54), Benjamin Herrenschmidt wrote:
> > +	limit = pool->end;
> > +
> > +	/* The case below can happen if we have a small segment appended
> > +	 * to a large, or when the previous alloc was at the very end of
> > +	 * the available space. If so, go back to the beginning and flush.
> > +	 */
> > +	if (start >= limit) {
> > +		start = pool->start;
> > +		if (!large_pool && iommu->lazy_flush != NULL)
> > +			iommu->lazy_flush(iommu);
> 
> Add need_flush = false;

A few clarifications, while I parse the rest of your comments:

Not sure I follow- need_flush is initialized to true at the start of the function?

> I only just noticed too, you completely dropped the code to honor
> the dma mask. Why that ? Some devices rely on this.

so that's an interesting question: the existing iommu_range_alloc() in
arch/sparc/kernel/iommu.c does not use the mask at all. I based most of
the code  on this (except for the lock fragmentation part). 
I dont know if this is arch specific.

> 
> > +	if (dev)
> > +		boundary_size = ALIGN(dma_get_seg_boundary(dev) + 1,
> > +				      1 << iommu->table_shift);
> > +	else
> > +		boundary_size = ALIGN(1UL << 32, 1 << iommu->table_shift);
> > +
> > +	shift = iommu->table_map_base >> iommu->table_shift;
> > +	boundary_size = boundary_size >> iommu->table_shift;
> > +	/*
> > +	 * if the skip_span_boundary_check had been set during init, we set
> > +	 * things up so that iommu_is_span_boundary() merely checks if the
> > +	 * (index + npages) < num_tsb_entries
> > +	 */
> > +	if ((iommu->flags & IOMMU_NO_SPAN_BOUND) != 0) {
> > +		shift = 0;
> > +		boundary_size = iommu->poolsize * iommu->nr_pools;
> > +	}
> > +	n = iommu_area_alloc(iommu->map, limit, start, npages, shift,
> > +			     boundary_size, 0);
> 
> You have completely dropped the alignment support. This will break
> drivers. There are cases (especially with consistent allocations) where

Again, not sure I follow? are you referring to the IOMMU_NO_SPAN_BOUND case?
That's very specific to LDC (sparc ldoms virtualization infra). The default
is to not have IOMMU_NO_SPAN_BOUND set. 
For the rest of the drivers, the code that sets up boundary_size aligns things
in the same way as the ppc code.


> the driver have alignment constraints on the address, those must be
> preserved.
> 

--Sowmini

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and l
  2015-04-02 20:54     ` [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock Benjamin Herrenschmidt
@ 2015-04-02 21:54       ` Sowmini Varadhan
  -1 siblings, 0 replies; 60+ messages in thread
From: Sowmini Varadhan @ 2015-04-02 21:54 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: aik, anton, paulus, sparclinux, linuxppc-dev, davem

the other question that comes to my mind is: the whole lazy_flush
optimization probably works best when there is exactly one pool,
and no large pools. In most other cases, we'd end up doing a lazy_flush
when we wrap within our pool itself, losing the benefit of that 
optimization. 

Given that the lazy_flush is mostly there to avoid regressions for 
the older sun4u architectures (which have other hardware bottlenecks 
anyway), and this code is rapidly getting messy, does it make sense
to constrain the lazy_flush check to only apply for the 1-pool, 
no-large-pool case?



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock
@ 2015-04-02 21:54       ` Sowmini Varadhan
  0 siblings, 0 replies; 60+ messages in thread
From: Sowmini Varadhan @ 2015-04-02 21:54 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: aik, anton, paulus, sparclinux, linuxppc-dev, davem

the other question that comes to my mind is: the whole lazy_flush
optimization probably works best when there is exactly one pool,
and no large pools. In most other cases, we'd end up doing a lazy_flush
when we wrap within our pool itself, losing the benefit of that 
optimization. 

Given that the lazy_flush is mostly there to avoid regressions for 
the older sun4u architectures (which have other hardware bottlenecks 
anyway), and this code is rapidly getting messy, does it make sense
to constrain the lazy_flush check to only apply for the 1-pool, 
no-large-pool case?

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and l
  2015-04-02 21:43       ` [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock Sowmini Varadhan
@ 2015-04-02 21:57         ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 60+ messages in thread
From: Benjamin Herrenschmidt @ 2015-04-02 21:57 UTC (permalink / raw)
  To: Sowmini Varadhan; +Cc: aik, anton, paulus, sparclinux, linuxppc-dev, davem

On Thu, 2015-04-02 at 17:43 -0400, Sowmini Varadhan wrote:
> On (04/03/15 07:54), Benjamin Herrenschmidt wrote:
> > > +	limit = pool->end;
> > > +
> > > +	/* The case below can happen if we have a small segment appended
> > > +	 * to a large, or when the previous alloc was at the very end of
> > > +	 * the available space. If so, go back to the beginning and flush.
> > > +	 */
> > > +	if (start >= limit) {
> > > +		start = pool->start;
> > > +		if (!large_pool && iommu->lazy_flush != NULL)
> > > +			iommu->lazy_flush(iommu);
> > 
> > Add need_flush = false;
> 
> A few clarifications, while I parse the rest of your comments:
> 
> Not sure I follow- need_flush is initialized to true at the start of the function?

No but you can loop back there via "goto again". However if you follow
my other comment and move the flush back to the end, then you don't need
that at all.

> > I only just noticed too, you completely dropped the code to honor
> > the dma mask. Why that ? Some devices rely on this.
> 
> so that's an interesting question: the existing iommu_range_alloc() in
> arch/sparc/kernel/iommu.c does not use the mask at all. I based most of
> the code  on this (except for the lock fragmentation part). 
> I dont know if this is arch specific.

Probably, not that many devices have limits on DMA mask but they do
exist. It becomes more important if we decide to create a very large
IOMMU window that spans beyond 4G in order to support devices with
32-bit DMA masks. Otherwise it's older devices mostly with <32-bit
masks.

In any case, for a generic piece of code, this should be supported.
Basically, assume that if we have something in the powerpc code, we need
it, if you remove it, we won't be able to use your code generically.

There are a few cases we can debate like our block allocation, but
things like the mask or the alignment constraints aren't in that group.

> > 
> > > +	if (dev)
> > > +		boundary_size = ALIGN(dma_get_seg_boundary(dev) + 1,
> > > +				      1 << iommu->table_shift);
> > > +	else
> > > +		boundary_size = ALIGN(1UL << 32, 1 << iommu->table_shift);
> > > +
> > > +	shift = iommu->table_map_base >> iommu->table_shift;
> > > +	boundary_size = boundary_size >> iommu->table_shift;
> > > +	/*
> > > +	 * if the skip_span_boundary_check had been set during init, we set
> > > +	 * things up so that iommu_is_span_boundary() merely checks if the
> > > +	 * (index + npages) < num_tsb_entries
> > > +	 */
> > > +	if ((iommu->flags & IOMMU_NO_SPAN_BOUND) != 0) {
> > > +		shift = 0;
> > > +		boundary_size = iommu->poolsize * iommu->nr_pools;
> > > +	}
> > > +	n = iommu_area_alloc(iommu->map, limit, start, npages, shift,
> > > +			     boundary_size, 0);
> > 
> > You have completely dropped the alignment support. This will break
> > drivers. There are cases (especially with consistent allocations) where
> 
> Again, not sure I follow? are you referring to the IOMMU_NO_SPAN_BOUND case?

No, the last argument to iommu_area_alloc() which is passed from the
callers when doing consistent allocs. Basically, the DMA api mandates
that consistent allocs are naturally aligned (to their own size), we
implement that on powerpc by passing that alignment argument down.

> That's very specific to LDC (sparc ldoms virtualization infra). The default
> is to not have IOMMU_NO_SPAN_BOUND set. 
> For the rest of the drivers, the code that sets up boundary_size aligns things
> in the same way as the ppc code.
> 
> 
> > the driver have alignment constraints on the address, those must be
> > preserved.
> > 
> 
> --Sowmini



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock
@ 2015-04-02 21:57         ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 60+ messages in thread
From: Benjamin Herrenschmidt @ 2015-04-02 21:57 UTC (permalink / raw)
  To: Sowmini Varadhan; +Cc: aik, anton, paulus, sparclinux, linuxppc-dev, davem

On Thu, 2015-04-02 at 17:43 -0400, Sowmini Varadhan wrote:
> On (04/03/15 07:54), Benjamin Herrenschmidt wrote:
> > > +	limit = pool->end;
> > > +
> > > +	/* The case below can happen if we have a small segment appended
> > > +	 * to a large, or when the previous alloc was at the very end of
> > > +	 * the available space. If so, go back to the beginning and flush.
> > > +	 */
> > > +	if (start >= limit) {
> > > +		start = pool->start;
> > > +		if (!large_pool && iommu->lazy_flush != NULL)
> > > +			iommu->lazy_flush(iommu);
> > 
> > Add need_flush = false;
> 
> A few clarifications, while I parse the rest of your comments:
> 
> Not sure I follow- need_flush is initialized to true at the start of the function?

No but you can loop back there via "goto again". However if you follow
my other comment and move the flush back to the end, then you don't need
that at all.

> > I only just noticed too, you completely dropped the code to honor
> > the dma mask. Why that ? Some devices rely on this.
> 
> so that's an interesting question: the existing iommu_range_alloc() in
> arch/sparc/kernel/iommu.c does not use the mask at all. I based most of
> the code  on this (except for the lock fragmentation part). 
> I dont know if this is arch specific.

Probably, not that many devices have limits on DMA mask but they do
exist. It becomes more important if we decide to create a very large
IOMMU window that spans beyond 4G in order to support devices with
32-bit DMA masks. Otherwise it's older devices mostly with <32-bit
masks.

In any case, for a generic piece of code, this should be supported.
Basically, assume that if we have something in the powerpc code, we need
it, if you remove it, we won't be able to use your code generically.

There are a few cases we can debate like our block allocation, but
things like the mask or the alignment constraints aren't in that group.

> > 
> > > +	if (dev)
> > > +		boundary_size = ALIGN(dma_get_seg_boundary(dev) + 1,
> > > +				      1 << iommu->table_shift);
> > > +	else
> > > +		boundary_size = ALIGN(1UL << 32, 1 << iommu->table_shift);
> > > +
> > > +	shift = iommu->table_map_base >> iommu->table_shift;
> > > +	boundary_size = boundary_size >> iommu->table_shift;
> > > +	/*
> > > +	 * if the skip_span_boundary_check had been set during init, we set
> > > +	 * things up so that iommu_is_span_boundary() merely checks if the
> > > +	 * (index + npages) < num_tsb_entries
> > > +	 */
> > > +	if ((iommu->flags & IOMMU_NO_SPAN_BOUND) != 0) {
> > > +		shift = 0;
> > > +		boundary_size = iommu->poolsize * iommu->nr_pools;
> > > +	}
> > > +	n = iommu_area_alloc(iommu->map, limit, start, npages, shift,
> > > +			     boundary_size, 0);
> > 
> > You have completely dropped the alignment support. This will break
> > drivers. There are cases (especially with consistent allocations) where
> 
> Again, not sure I follow? are you referring to the IOMMU_NO_SPAN_BOUND case?

No, the last argument to iommu_area_alloc() which is passed from the
callers when doing consistent allocs. Basically, the DMA api mandates
that consistent allocs are naturally aligned (to their own size), we
implement that on powerpc by passing that alignment argument down.

> That's very specific to LDC (sparc ldoms virtualization infra). The default
> is to not have IOMMU_NO_SPAN_BOUND set. 
> For the rest of the drivers, the code that sets up boundary_size aligns things
> in the same way as the ppc code.
> 
> 
> > the driver have alignment constraints on the address, those must be
> > preserved.
> > 
> 
> --Sowmini

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and l
  2015-04-02 21:57         ` [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock Benjamin Herrenschmidt
@ 2015-04-02 22:01           ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 60+ messages in thread
From: Benjamin Herrenschmidt @ 2015-04-02 22:01 UTC (permalink / raw)
  To: Sowmini Varadhan; +Cc: aik, anton, paulus, sparclinux, linuxppc-dev, davem

On Fri, 2015-04-03 at 08:57 +1100, Benjamin Herrenschmidt wrote:

> No, the last argument to iommu_area_alloc() which is passed from the
> callers when doing consistent allocs. Basically, the DMA api mandates
> that consistent allocs are naturally aligned (to their own size), we
> implement that on powerpc by passing that alignment argument down.

Talking of this ... I notice this is not documented in DMA-API.txt...
however many drivers make that assumption, and the DMA pool allocator
in mm/dmapool.c as well.

Maybe somebody should update DMA-API.txt...

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock
@ 2015-04-02 22:01           ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 60+ messages in thread
From: Benjamin Herrenschmidt @ 2015-04-02 22:01 UTC (permalink / raw)
  To: Sowmini Varadhan; +Cc: aik, anton, paulus, sparclinux, linuxppc-dev, davem

On Fri, 2015-04-03 at 08:57 +1100, Benjamin Herrenschmidt wrote:

> No, the last argument to iommu_area_alloc() which is passed from the
> callers when doing consistent allocs. Basically, the DMA api mandates
> that consistent allocs are naturally aligned (to their own size), we
> implement that on powerpc by passing that alignment argument down.

Talking of this ... I notice this is not documented in DMA-API.txt...
however many drivers make that assumption, and the DMA pool allocator
in mm/dmapool.c as well.

Maybe somebody should update DMA-API.txt...

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and l
  2015-04-02 22:01           ` [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock Benjamin Herrenschmidt
@ 2015-04-02 22:02             ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 60+ messages in thread
From: Benjamin Herrenschmidt @ 2015-04-02 22:02 UTC (permalink / raw)
  To: Sowmini Varadhan; +Cc: aik, anton, paulus, sparclinux, linuxppc-dev, davem

On Fri, 2015-04-03 at 09:01 +1100, Benjamin Herrenschmidt wrote:
> On Fri, 2015-04-03 at 08:57 +1100, Benjamin Herrenschmidt wrote:
> 
> > No, the last argument to iommu_area_alloc() which is passed from the
> > callers when doing consistent allocs. Basically, the DMA api mandates
> > that consistent allocs are naturally aligned (to their own size), we
> > implement that on powerpc by passing that alignment argument down.
> 
> Talking of this ... I notice this is not documented in DMA-API.txt...
> however many drivers make that assumption, and the DMA pool allocator
> in mm/dmapool.c as well.
> 
> Maybe somebody should update DMA-API.txt...

Ah it *is* documented in DMA-API-HOWTO.txt :-)

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock
@ 2015-04-02 22:02             ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 60+ messages in thread
From: Benjamin Herrenschmidt @ 2015-04-02 22:02 UTC (permalink / raw)
  To: Sowmini Varadhan; +Cc: aik, anton, paulus, sparclinux, linuxppc-dev, davem

On Fri, 2015-04-03 at 09:01 +1100, Benjamin Herrenschmidt wrote:
> On Fri, 2015-04-03 at 08:57 +1100, Benjamin Herrenschmidt wrote:
> 
> > No, the last argument to iommu_area_alloc() which is passed from the
> > callers when doing consistent allocs. Basically, the DMA api mandates
> > that consistent allocs are naturally aligned (to their own size), we
> > implement that on powerpc by passing that alignment argument down.
> 
> Talking of this ... I notice this is not documented in DMA-API.txt...
> however many drivers make that assumption, and the DMA pool allocator
> in mm/dmapool.c as well.
> 
> Maybe somebody should update DMA-API.txt...

Ah it *is* documented in DMA-API-HOWTO.txt :-)

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and l
  2015-04-02 21:54       ` [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock Sowmini Varadhan
@ 2015-04-02 22:08         ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 60+ messages in thread
From: Benjamin Herrenschmidt @ 2015-04-02 22:08 UTC (permalink / raw)
  To: Sowmini Varadhan; +Cc: aik, anton, paulus, sparclinux, linuxppc-dev, davem

On Thu, 2015-04-02 at 17:54 -0400, Sowmini Varadhan wrote:
> the other question that comes to my mind is: the whole lazy_flush
> optimization probably works best when there is exactly one pool,
> and no large pools. In most other cases, we'd end up doing a lazy_flush
> when we wrap within our pool itself, losing the benefit of that 
> optimization. 

Not a big deal. The point of the optimization is to avoid flushing on
every map/unmap. As long as the pool wrapping is an order of magnitude
rarer than actual map/unmap operations, it's a win.

> Given that the lazy_flush is mostly there to avoid regressions for 
> the older sun4u architectures (which have other hardware bottlenecks 
> anyway), and this code is rapidly getting messy, does it make sense
> to constrain the lazy_flush check to only apply for the 1-pool, 
> no-large-pool case?

I was planning to use it to improve things on older G5s as well and I
definitely want pools on these. Leave it in, we just need to get it
right.

I think it's not that messy, if you refactor the way I suggested, it
basically boils down to conditionally flushing in two places, at
the point where the pool lock is dropped, if the hint for that pool
was moved backward.

Do the test once in the else case of the n = -1 for all the "normal"
cases (which includes the hint > limit, multi-pass wrap and *handle
below hint, they are all covered) and in the pool hop'ing case.

Cheers,
Ben.




^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock
@ 2015-04-02 22:08         ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 60+ messages in thread
From: Benjamin Herrenschmidt @ 2015-04-02 22:08 UTC (permalink / raw)
  To: Sowmini Varadhan; +Cc: aik, anton, paulus, sparclinux, linuxppc-dev, davem

On Thu, 2015-04-02 at 17:54 -0400, Sowmini Varadhan wrote:
> the other question that comes to my mind is: the whole lazy_flush
> optimization probably works best when there is exactly one pool,
> and no large pools. In most other cases, we'd end up doing a lazy_flush
> when we wrap within our pool itself, losing the benefit of that 
> optimization. 

Not a big deal. The point of the optimization is to avoid flushing on
every map/unmap. As long as the pool wrapping is an order of magnitude
rarer than actual map/unmap operations, it's a win.

> Given that the lazy_flush is mostly there to avoid regressions for 
> the older sun4u architectures (which have other hardware bottlenecks 
> anyway), and this code is rapidly getting messy, does it make sense
> to constrain the lazy_flush check to only apply for the 1-pool, 
> no-large-pool case?

I was planning to use it to improve things on older G5s as well and I
definitely want pools on these. Leave it in, we just need to get it
right.

I think it's not that messy, if you refactor the way I suggested, it
basically boils down to conditionally flushing in two places, at
the point where the pool lock is dropped, if the hint for that pool
was moved backward.

Do the test once in the else case of the n == -1 for all the "normal"
cases (which includes the hint > limit, multi-pass wrap and *handle
below hint, they are all covered) and in the pool hop'ing case.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and l
  2015-04-02 21:57         ` [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock Benjamin Herrenschmidt
@ 2015-04-02 22:15           ` Sowmini Varadhan
  -1 siblings, 0 replies; 60+ messages in thread
From: Sowmini Varadhan @ 2015-04-02 22:15 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: aik, anton, paulus, sparclinux, linuxppc-dev, davem

On (04/03/15 08:57), Benjamin Herrenschmidt wrote:
> 
> > > I only just noticed too, you completely dropped the code to honor
> > > the dma mask. Why that ? Some devices rely on this.

    /* Sowmini's comment about this coming from sparc origins.. */

> Probably, not that many devices have limits on DMA mask but they do
> exist. It becomes more important if we decide to create a very large
> IOMMU window that spans beyond 4G in order to support devices with
> 32-bit DMA masks. Otherwise it's older devices mostly with <32-bit
> masks.
> 
> In any case, for a generic piece of code, this should be supported.
> Basically, assume that if we have something in the powerpc code, we need
> it, if you remove it, we won't be able to use your code generically.

I see.

is the mask something that can be stored in the iommu_map_table as
part of the init? 

I can see that the align_order has to be an additional arg to 
iommu_tbl_range_alloc, not sure if mask falls in that category
as well.

--Sowmini


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock
@ 2015-04-02 22:15           ` Sowmini Varadhan
  0 siblings, 0 replies; 60+ messages in thread
From: Sowmini Varadhan @ 2015-04-02 22:15 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: aik, anton, paulus, sparclinux, linuxppc-dev, davem

On (04/03/15 08:57), Benjamin Herrenschmidt wrote:
> 
> > > I only just noticed too, you completely dropped the code to honor
> > > the dma mask. Why that ? Some devices rely on this.

    /* Sowmini's comment about this coming from sparc origins.. */

> Probably, not that many devices have limits on DMA mask but they do
> exist. It becomes more important if we decide to create a very large
> IOMMU window that spans beyond 4G in order to support devices with
> 32-bit DMA masks. Otherwise it's older devices mostly with <32-bit
> masks.
> 
> In any case, for a generic piece of code, this should be supported.
> Basically, assume that if we have something in the powerpc code, we need
> it, if you remove it, we won't be able to use your code generically.

I see.

is the mask something that can be stored in the iommu_map_table as
part of the init? 

I can see that the align_order has to be an additional arg to 
iommu_tbl_range_alloc, not sure if mask falls in that category
as well.

--Sowmini

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and l
  2015-04-02 22:15           ` [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock Sowmini Varadhan
@ 2015-04-02 22:19             ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 60+ messages in thread
From: Benjamin Herrenschmidt @ 2015-04-02 22:19 UTC (permalink / raw)
  To: Sowmini Varadhan; +Cc: aik, anton, paulus, sparclinux, linuxppc-dev, davem

On Thu, 2015-04-02 at 18:15 -0400, Sowmini Varadhan wrote:
> On (04/03/15 08:57), Benjamin Herrenschmidt wrote:
> > 
> > > > I only just noticed too, you completely dropped the code to honor
> > > > the dma mask. Why that ? Some devices rely on this.
> 
>     /* Sowmini's comment about this coming from sparc origins.. */
> 
> > Probably, not that many devices have limits on DMA mask but they do
> > exist. It becomes more important if we decide to create a very large
> > IOMMU window that spans beyond 4G in order to support devices with
> > 32-bit DMA masks. Otherwise it's older devices mostly with <32-bit
> > masks.
> > 
> > In any case, for a generic piece of code, this should be supported.
> > Basically, assume that if we have something in the powerpc code, we need
> > it, if you remove it, we won't be able to use your code generically.
> 
> I see.
> 
> is the mask something that can be stored in the iommu_map_table as
> part of the init? 
> 
No, the mask is per device and has to be retrieved from the device.
Additionally the mask used for dma_map_* can be different from the
consistent mask used for alloc_coherent (we have a bug there on powerpc
which I'm trying to fix btw).

So it should be passed as an argument by the caller.

> I can see that the align_order has to be an additional arg to 
> iommu_tbl_range_alloc, not sure if mask falls in that category
> as well.

It does.

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock
@ 2015-04-02 22:19             ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 60+ messages in thread
From: Benjamin Herrenschmidt @ 2015-04-02 22:19 UTC (permalink / raw)
  To: Sowmini Varadhan; +Cc: aik, anton, paulus, sparclinux, linuxppc-dev, davem

On Thu, 2015-04-02 at 18:15 -0400, Sowmini Varadhan wrote:
> On (04/03/15 08:57), Benjamin Herrenschmidt wrote:
> > 
> > > > I only just noticed too, you completely dropped the code to honor
> > > > the dma mask. Why that ? Some devices rely on this.
> 
>     /* Sowmini's comment about this coming from sparc origins.. */
> 
> > Probably, not that many devices have limits on DMA mask but they do
> > exist. It becomes more important if we decide to create a very large
> > IOMMU window that spans beyond 4G in order to support devices with
> > 32-bit DMA masks. Otherwise it's older devices mostly with <32-bit
> > masks.
> > 
> > In any case, for a generic piece of code, this should be supported.
> > Basically, assume that if we have something in the powerpc code, we need
> > it, if you remove it, we won't be able to use your code generically.
> 
> I see.
> 
> is the mask something that can be stored in the iommu_map_table as
> part of the init? 
> 
No, the mask is per device and has to be retrieved from the device.
Additionally the mask used for dma_map_* can be different from the
consistent mask used for alloc_coherent (we have a bug there on powerpc
which I'm trying to fix btw).

So it should be passed as an argument by the caller.

> I can see that the align_order has to be an additional arg to 
> iommu_tbl_range_alloc, not sure if mask falls in that category
> as well.

It does.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and l
  2015-04-02 21:54       ` [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock Sowmini Varadhan
@ 2015-04-03  0:42         ` David Miller
  -1 siblings, 0 replies; 60+ messages in thread
From: David Miller @ 2015-04-03  0:42 UTC (permalink / raw)
  To: sowmini.varadhan; +Cc: aik, anton, paulus, sparclinux, linuxppc-dev

From: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Date: Thu, 2 Apr 2015 17:54:53 -0400

> the other question that comes to my mind is: the whole lazy_flush
> optimization probably works best when there is exactly one pool,
> and no large pools. In most other cases, we'd end up doing a lazy_flush
> when we wrap within our pool itself, losing the benefit of that 
> optimization. 
> 
> Given that the lazy_flush is mostly there to avoid regressions for 
> the older sun4u architectures (which have other hardware bottlenecks 
> anyway), and this code is rapidly getting messy, does it make sense
> to constrain the lazy_flush check to only apply for the 1-pool, 
> no-large-pool case?

I think it's better to have the multiple pools with more often global
flushing.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock
@ 2015-04-03  0:42         ` David Miller
  0 siblings, 0 replies; 60+ messages in thread
From: David Miller @ 2015-04-03  0:42 UTC (permalink / raw)
  To: sowmini.varadhan; +Cc: aik, anton, paulus, sparclinux, linuxppc-dev

From: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Date: Thu, 2 Apr 2015 17:54:53 -0400

> the other question that comes to my mind is: the whole lazy_flush
> optimization probably works best when there is exactly one pool,
> and no large pools. In most other cases, we'd end up doing a lazy_flush
> when we wrap within our pool itself, losing the benefit of that 
> optimization. 
> 
> Given that the lazy_flush is mostly there to avoid regressions for 
> the older sun4u architectures (which have other hardware bottlenecks 
> anyway), and this code is rapidly getting messy, does it make sense
> to constrain the lazy_flush check to only apply for the 1-pool, 
> no-large-pool case?

I think it's better to have the multiple pools with more often global
flushing.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and l
  2015-04-02 20:54     ` [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock Benjamin Herrenschmidt
@ 2015-04-03 18:28       ` Sowmini Varadhan
  -1 siblings, 0 replies; 60+ messages in thread
From: Sowmini Varadhan @ 2015-04-03 18:28 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: aik, anton, paulus, sparclinux, linuxppc-dev, davem



Just want to confirm:

> > + again:
> > +	if (pass = 0 && handle && *handle &&
> > +	    (*handle >= pool->start) && (*handle < pool->end))
> > +		start = *handle;
> > +	else
> > +		start = pool->hint;
> 
> Now this means "handle" might be < pool->hint, in that case you also
> need a lazy flush. Or rather only if the resulting alloc is. My
         :

> > +		} else {
> > +			/* give up */
> > +			n = DMA_ERROR_CODE;
> > +			goto bail;
> > +		}
> > +	}
> 
> Here, make this something like:
> 
> 	} else if (end < pool->hint)
> 		need_flush = true;

you mean 

 	} else if (start < pool->hint)

right?  (so I'm not missing some corner-case that you are thinking
about here)

--Sowmini
 

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock
@ 2015-04-03 18:28       ` Sowmini Varadhan
  0 siblings, 0 replies; 60+ messages in thread
From: Sowmini Varadhan @ 2015-04-03 18:28 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: aik, anton, paulus, sparclinux, linuxppc-dev, davem



Just want to confirm:

> > + again:
> > +	if (pass == 0 && handle && *handle &&
> > +	    (*handle >= pool->start) && (*handle < pool->end))
> > +		start = *handle;
> > +	else
> > +		start = pool->hint;
> 
> Now this means "handle" might be < pool->hint, in that case you also
> need a lazy flush. Or rather only if the resulting alloc is. My
         :

> > +		} else {
> > +			/* give up */
> > +			n = DMA_ERROR_CODE;
> > +			goto bail;
> > +		}
> > +	}
> 
> Here, make this something like:
> 
> 	} else if (end < pool->hint)
> 		need_flush = true;

you mean 

 	} else if (start < pool->hint)

right?  (so I'm not missing some corner-case that you are thinking
about here)

--Sowmini
 

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and l
  2015-04-03 18:28       ` [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock Sowmini Varadhan
@ 2015-04-03 21:06         ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 60+ messages in thread
From: Benjamin Herrenschmidt @ 2015-04-03 21:06 UTC (permalink / raw)
  To: Sowmini Varadhan; +Cc: aik, anton, paulus, sparclinux, linuxppc-dev, davem

On Fri, 2015-04-03 at 14:28 -0400, Sowmini Varadhan wrote:
> 
> Just want to confirm:
> 
> > > + again:
> > > +	if (pass = 0 && handle && *handle &&
> > > +	    (*handle >= pool->start) && (*handle < pool->end))
> > > +		start = *handle;
> > > +	else
> > > +		start = pool->hint;
> > 
> > Now this means "handle" might be < pool->hint, in that case you also
> > need a lazy flush. Or rather only if the resulting alloc is. My
>          :
> 
> > > +		} else {
> > > +			/* give up */
> > > +			n = DMA_ERROR_CODE;
> > > +			goto bail;
> > > +		}
> > > +	}
> > 
> > Here, make this something like:
> > 
> > 	} else if (end < pool->hint)
> > 		need_flush = true;
> 
> you mean 
> 
>  	} else if (start < pool->hint)
> 
> right?  (so I'm not missing some corner-case that you are thinking
> about here)

No, I meant "n < pool->hint", ie, the start of the newly allocated
block.

"end" hasn't been adjusted yet at that point but we don't want to
compare "end" anyway (which will be n + npages), we really want n, ie if
the *beginning* of the newly allocated chunk is before the end of the
previous one (after all they may even overlap if the previous one has
been freed).

Ben.

> --Sowmini
>  



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock
@ 2015-04-03 21:06         ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 60+ messages in thread
From: Benjamin Herrenschmidt @ 2015-04-03 21:06 UTC (permalink / raw)
  To: Sowmini Varadhan; +Cc: aik, anton, paulus, sparclinux, linuxppc-dev, davem

On Fri, 2015-04-03 at 14:28 -0400, Sowmini Varadhan wrote:
> 
> Just want to confirm:
> 
> > > + again:
> > > +	if (pass == 0 && handle && *handle &&
> > > +	    (*handle >= pool->start) && (*handle < pool->end))
> > > +		start = *handle;
> > > +	else
> > > +		start = pool->hint;
> > 
> > Now this means "handle" might be < pool->hint, in that case you also
> > need a lazy flush. Or rather only if the resulting alloc is. My
>          :
> 
> > > +		} else {
> > > +			/* give up */
> > > +			n = DMA_ERROR_CODE;
> > > +			goto bail;
> > > +		}
> > > +	}
> > 
> > Here, make this something like:
> > 
> > 	} else if (end < pool->hint)
> > 		need_flush = true;
> 
> you mean 
> 
>  	} else if (start < pool->hint)
> 
> right?  (so I'm not missing some corner-case that you are thinking
> about here)

No, I meant "n < pool->hint", ie, the start of the newly allocated
block.

"end" hasn't been adjusted yet at that point but we don't want to
compare "end" anyway (which will be n + npages), we really want n, ie if
the *beginning* of the newly allocated chunk is before the end of the
previous one (after all they may even overlap if the previous one has
been freed).

Ben.

> --Sowmini
>  

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and l
  2015-04-03 21:06         ` [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock Benjamin Herrenschmidt
@ 2015-04-03 21:51           ` Sowmini Varadhan
  -1 siblings, 0 replies; 60+ messages in thread
From: Sowmini Varadhan @ 2015-04-03 21:51 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: aik, anton, paulus, sparclinux, linuxppc-dev, davem

On (04/04/15 08:06), Benjamin Herrenschmidt wrote:
> 
> No, I meant "n < pool->hint", ie, the start of the newly allocated
> block.

ah, got it. I'll do my drill with patchset and get back, probably by
Monday.

--Sowmini

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock
@ 2015-04-03 21:51           ` Sowmini Varadhan
  0 siblings, 0 replies; 60+ messages in thread
From: Sowmini Varadhan @ 2015-04-03 21:51 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: aik, anton, paulus, sparclinux, linuxppc-dev, davem

On (04/04/15 08:06), Benjamin Herrenschmidt wrote:
> 
> No, I meant "n < pool->hint", ie, the start of the newly allocated
> block.

ah, got it. I'll do my drill with patchset and get back, probably by
Monday.

--Sowmini

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and l
  2015-04-03 21:06         ` [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock Benjamin Herrenschmidt
@ 2015-04-04 11:27           ` Sowmini Varadhan
  -1 siblings, 0 replies; 60+ messages in thread
From: Sowmini Varadhan @ 2015-04-04 11:27 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: aik, anton, paulus, sparclinux, linuxppc-dev, davem

One last question before I spin out v9.. the dma_mask code
is a bit confusing to me, so I want to make sure... the code is 

> if (limit + tbl->it_offset > mask) {
>               limit = mask - tbl->it_offset + 1;
>               /* If we're constrained on address range, first try
>                * at the masked hint to avoid O(n) search complexity,
>                * but on second pass, start at 0 in pool 0.
>                */
>               if ((start & mask) >= limit || pass > 0) {
>                       spin_unlock(&(pool->lock));
>                       pool = &(tbl->pools[0]);
>                       spin_lock(&(pool->lock));
>                       start = pool->start;

So here I would need to mark need_flush to true, right?


--Sowmini


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock
@ 2015-04-04 11:27           ` Sowmini Varadhan
  0 siblings, 0 replies; 60+ messages in thread
From: Sowmini Varadhan @ 2015-04-04 11:27 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: aik, anton, paulus, sparclinux, linuxppc-dev, davem

One last question before I spin out v9.. the dma_mask code
is a bit confusing to me, so I want to make sure... the code is 

> if (limit + tbl->it_offset > mask) {
>               limit = mask - tbl->it_offset + 1;
>               /* If we're constrained on address range, first try
>                * at the masked hint to avoid O(n) search complexity,
>                * but on second pass, start at 0 in pool 0.
>                */
>               if ((start & mask) >= limit || pass > 0) {
>                       spin_unlock(&(pool->lock));
>                       pool = &(tbl->pools[0]);
>                       spin_lock(&(pool->lock));
>                       start = pool->start;

So here I would need to mark need_flush to true, right?


--Sowmini

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and l
  2015-04-04 11:27           ` [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock Sowmini Varadhan
@ 2015-04-04 13:33             ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 60+ messages in thread
From: Benjamin Herrenschmidt @ 2015-04-04 13:33 UTC (permalink / raw)
  To: Sowmini Varadhan; +Cc: aik, anton, paulus, sparclinux, linuxppc-dev, davem

On Sat, 2015-04-04 at 07:27 -0400, Sowmini Varadhan wrote:
> One last question before I spin out v9.. the dma_mask code
> is a bit confusing to me, so I want to make sure... the code is 
> 
> > if (limit + tbl->it_offset > mask) {
> >               limit = mask - tbl->it_offset + 1;
> >               /* If we're constrained on address range, first try
> >                * at the masked hint to avoid O(n) search complexity,
> >                * but on second pass, start at 0 in pool 0.
> >                */
> >               if ((start & mask) >= limit || pass > 0) {
> >                       spin_unlock(&(pool->lock));
> >                       pool = &(tbl->pools[0]);
> >                       spin_lock(&(pool->lock));
> >                       start = pool->start;
> 
> So here I would need to mark need_flush to true, right?

Well, no but ...

So you aren't changing any pool hint, so you don't have to mark any pool
for flushing. The "n < hint" check will do the flush later on if needed
(the one I told you to add).

You only need to force a flush if you modify the hint.

*However*, you do an unlock, which means that if you have modified any
*other* pool's hint (ie, need_flush is true), you need to actually
perform a flush.

I think that business with flushing on unlock is becoming too ugly, I
would suggest you simplify things by moving need_flush to be part of the
pool structure (maybe a flag).

IE, when you modify a pool's hint, your mark it for flushing. 

When you allocate from a pool, you flush it at the end, just before you
return success, if either it was marked for flushing or n < hint (and
clear the mark of course).

That should cleanly factor all cases in a simpler & more readable code
path. The flush is done only when needed (an actual allocation happen on
a pool), you just "mark them dirty" when you change their hints, you
don't have to worry about the lock dropped case & pool hop'ing anymore.

Basically the logic looks like that:

	.../...

	n = iommu_area_alloc(...)
	if (n = -1) {
		if (pass = 0) {
			change pool hint
			pool->need_flush = true;
			pass++;
			goto again;
		} else if (!largealloc && .... ) {
			unlock/pool hop/lock
			// no need to flush anything here
			pool->hint = pool->start;
			pool->need_flush = true;
			etc...
		} else {
			n = DMA_ERROR_CODE;
			goto fail;
		}
	}
	end = n + pages;

	if (n < pool->hint || pool->need_flush) {
		pool->need_flush = false;
		iommu->lazy_flush(...);
	}
	pool->hint = end;

	.../...

Now this can be further simplified imho. For example that logic:

+       if (pass = 0 && handle && *handle &&
+           (*handle >= pool->start) && (*handle < pool->end))
+               start = *handle;
+       else
+               start = pool->hint;

Can move above the again: label. that way, you remove the pass test
and you no longer have to adjust pool->hint in the "if (likely(pass =
0))" failure case. Just adjust "start" and try again, which means you
also no longer need to set need_flush.

Additionally, when pool hop'ing, now you only need to adjust "start" as
well. By not resetting it, you also remove the need for marking it "need
flush". I would add to that, why not set "start" to the new pool's hint
instead of the new pool's start ? That will save some flushes, no ?

Anton, do you remember why you reset the hint when changing pool ? I
don't see that gaining us anything and it does make lazy flushing more
complex.

At that point, you have removed all setters of need_flush, that logic
can be entirely removed, or am I just too tired (it's past midnight) and
missing something entirely here ?

We only need the if (n < pool->hint) check to do the lazy flush at the
end, that's it.

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock
@ 2015-04-04 13:33             ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 60+ messages in thread
From: Benjamin Herrenschmidt @ 2015-04-04 13:33 UTC (permalink / raw)
  To: Sowmini Varadhan; +Cc: aik, anton, paulus, sparclinux, linuxppc-dev, davem

On Sat, 2015-04-04 at 07:27 -0400, Sowmini Varadhan wrote:
> One last question before I spin out v9.. the dma_mask code
> is a bit confusing to me, so I want to make sure... the code is 
> 
> > if (limit + tbl->it_offset > mask) {
> >               limit = mask - tbl->it_offset + 1;
> >               /* If we're constrained on address range, first try
> >                * at the masked hint to avoid O(n) search complexity,
> >                * but on second pass, start at 0 in pool 0.
> >                */
> >               if ((start & mask) >= limit || pass > 0) {
> >                       spin_unlock(&(pool->lock));
> >                       pool = &(tbl->pools[0]);
> >                       spin_lock(&(pool->lock));
> >                       start = pool->start;
> 
> So here I would need to mark need_flush to true, right?

Well, no but ...

So you aren't changing any pool hint, so you don't have to mark any pool
for flushing. The "n < hint" check will do the flush later on if needed
(the one I told you to add).

You only need to force a flush if you modify the hint.

*However*, you do an unlock, which means that if you have modified any
*other* pool's hint (ie, need_flush is true), you need to actually
perform a flush.

I think that business with flushing on unlock is becoming too ugly, I
would suggest you simplify things by moving need_flush to be part of the
pool structure (maybe a flag).

IE, when you modify a pool's hint, your mark it for flushing. 

When you allocate from a pool, you flush it at the end, just before you
return success, if either it was marked for flushing or n < hint (and
clear the mark of course).

That should cleanly factor all cases in a simpler & more readable code
path. The flush is done only when needed (an actual allocation happen on
a pool), you just "mark them dirty" when you change their hints, you
don't have to worry about the lock dropped case & pool hop'ing anymore.

Basically the logic looks like that:

	.../...

	n = iommu_area_alloc(...)
	if (n == -1) {
		if (pass == 0) {
			change pool hint
			pool->need_flush = true;
			pass++;
			goto again;
		} else if (!largealloc && .... ) {
			unlock/pool hop/lock
			// no need to flush anything here
			pool->hint = pool->start;
			pool->need_flush = true;
			etc...
		} else {
			n = DMA_ERROR_CODE;
			goto fail;
		}
	}
	end = n + pages;

	if (n < pool->hint || pool->need_flush) {
		pool->need_flush = false;
		iommu->lazy_flush(...);
	}
	pool->hint = end;

	.../...

Now this can be further simplified imho. For example that logic:

+       if (pass == 0 && handle && *handle &&
+           (*handle >= pool->start) && (*handle < pool->end))
+               start = *handle;
+       else
+               start = pool->hint;

Can move above the again: label. that way, you remove the pass test
and you no longer have to adjust pool->hint in the "if (likely(pass ==
0))" failure case. Just adjust "start" and try again, which means you
also no longer need to set need_flush.

Additionally, when pool hop'ing, now you only need to adjust "start" as
well. By not resetting it, you also remove the need for marking it "need
flush". I would add to that, why not set "start" to the new pool's hint
instead of the new pool's start ? That will save some flushes, no ?

Anton, do you remember why you reset the hint when changing pool ? I
don't see that gaining us anything and it does make lazy flushing more
complex.

At that point, you have removed all setters of need_flush, that logic
can be entirely removed, or am I just too tired (it's past midnight) and
missing something entirely here ?

We only need the if (n < pool->hint) check to do the lazy flush at the
end, that's it.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 60+ messages in thread

end of thread, other threads:[~2015-04-04 13:33 UTC | newest]

Thread overview: 60+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-31 14:40 [PATCH v8 RFC 0/3] Generic IOMMU pooled allocator Sowmini Varadhan
2015-03-31 14:40 ` Sowmini Varadhan
2015-03-31 14:40 ` [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock Sowmini Varadhan
2015-03-31 14:40   ` Sowmini Varadhan
2015-03-31 15:15   ` [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and l David Laight
2015-03-31 15:15     ` [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock David Laight
2015-03-31 15:25     ` [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and l Sowmini Varadhan
2015-03-31 15:25       ` [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock Sowmini Varadhan
2015-04-02 20:54   ` [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and l Benjamin Herrenschmidt
2015-04-02 20:54     ` [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock Benjamin Herrenschmidt
2015-04-02 21:43     ` [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and l Sowmini Varadhan
2015-04-02 21:43       ` [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock Sowmini Varadhan
2015-04-02 21:57       ` [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and l Benjamin Herrenschmidt
2015-04-02 21:57         ` [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock Benjamin Herrenschmidt
2015-04-02 22:01         ` [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and l Benjamin Herrenschmidt
2015-04-02 22:01           ` [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock Benjamin Herrenschmidt
2015-04-02 22:02           ` [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and l Benjamin Herrenschmidt
2015-04-02 22:02             ` [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock Benjamin Herrenschmidt
2015-04-02 22:15         ` [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and l Sowmini Varadhan
2015-04-02 22:15           ` [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock Sowmini Varadhan
2015-04-02 22:19           ` [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and l Benjamin Herrenschmidt
2015-04-02 22:19             ` [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock Benjamin Herrenschmidt
2015-04-02 21:54     ` [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and l Sowmini Varadhan
2015-04-02 21:54       ` [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock Sowmini Varadhan
2015-04-02 22:08       ` [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and l Benjamin Herrenschmidt
2015-04-02 22:08         ` [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock Benjamin Herrenschmidt
2015-04-03  0:42       ` [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and l David Miller
2015-04-03  0:42         ` [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock David Miller
2015-04-03 18:28     ` [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and l Sowmini Varadhan
2015-04-03 18:28       ` [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock Sowmini Varadhan
2015-04-03 21:06       ` [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and l Benjamin Herrenschmidt
2015-04-03 21:06         ` [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock Benjamin Herrenschmidt
2015-04-03 21:51         ` [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and l Sowmini Varadhan
2015-04-03 21:51           ` [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock Sowmini Varadhan
2015-04-04 11:27         ` [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and l Sowmini Varadhan
2015-04-04 11:27           ` [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock Sowmini Varadhan
2015-04-04 13:33           ` [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and l Benjamin Herrenschmidt
2015-04-04 13:33             ` [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock Benjamin Herrenschmidt
2015-03-31 14:40 ` [PATCH v8 RFC 2/3] sparc: Make sparc64 use scalable lib/iommu-common.c functions Sowmini Varadhan
2015-03-31 14:40   ` Sowmini Varadhan
2015-03-31 14:40 ` [PATCH v8 RFC 3/3] sparc: Make LDC use common iommu poll management functions Sowmini Varadhan
2015-03-31 14:40   ` Sowmini Varadhan
2015-03-31 18:06 ` [PATCH v8 RFC 0/3] Generic IOMMU pooled allocator Sowmini Varadhan
2015-03-31 18:06   ` Sowmini Varadhan
2015-03-31 18:08   ` David Miller
2015-03-31 18:08     ` David Miller
2015-04-01  1:01   ` Benjamin Herrenschmidt
2015-04-01  1:01     ` Benjamin Herrenschmidt
2015-04-01  1:08     ` Sowmini Varadhan
2015-04-01  1:08       ` Sowmini Varadhan
2015-04-01  3:12       ` David Miller
2015-04-01  3:12         ` David Miller
2015-04-02 12:51         ` Sowmini Varadhan
2015-04-02 12:51           ` Sowmini Varadhan
2015-04-02 16:21           ` David Miller
2015-04-02 16:21             ` David Miller
2015-04-02 20:22             ` Benjamin Herrenschmidt
2015-04-02 20:22               ` Benjamin Herrenschmidt
2015-04-02 20:52               ` Benjamin Herrenschmidt
2015-04-02 20:52                 ` Benjamin Herrenschmidt

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.