All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/7] Intel IOMMU scalability improvements
@ 2016-04-13 18:50 Adam Morrison
       [not found] ` <cover.1460548546.git.mad-FrESSTt7Abv7r6psnUbsSmZHpeb/A1Y/@public.gmane.org>
  0 siblings, 1 reply; 17+ messages in thread
From: Adam Morrison @ 2016-04-13 18:50 UTC (permalink / raw)
  To: dwmw2-wEGCiKHe2LqWVfeAwA7xHQ, joro-zLv9SwRftAIdnm+yROfE0A,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: serebrin-hpIqsD4AKlfQT0dZR+AlfA,
	dan-FrESSTt7Abv7r6psnUbsSmZHpeb/A1Y/,
	omer-FrESSTt7Abv7r6psnUbsSmZHpeb/A1Y/,
	shli-b10kYP2dOMg, gvdl-hpIqsD4AKlfQT0dZR+AlfA,
	Kernel-team-b10kYP2dOMg

This patchset improves the scalability of the Intel IOMMU code by
resolving two spinlock bottlenecks, yielding up to ~5x performance
improvement and approaching iommu=off performance.

For example, here's the throughput obtained by 16 memcached instances
running on a 16-core Sandy Bridge system, accessed using memslap on
another machine that has iommu=off, using the default memslap config
(64-byte keys, 1024-byte values, and 10%/90% SET/GET ops):

    stock iommu=off:
       990,803 memcached transactions/sec (=100%, median of 10 runs).
    stock iommu=on:
       221,416 memcached transactions/sec (=22%).
       [61.70%    0.63%  memcached       [kernel.kallsyms]      [k] _raw_spin_lock_irqsave]
    patched iommu=on:
       963,457 memcached transactions/sec (=97%).
       [1.04%     0.94%  memcached       [kernel.kallsyms]      [k] _raw_spin_lock_irqsave]

The two resolved spinlocks:

 - Deferred IOTLB invalidations are batched in a global data structure
   and serialized under a spinlock (add_unmap() & flush_unmaps()); this
   patchset batches IOTLB invalidations in a per-CPU data structure.

 - IOVA management (alloc_iova() & __free_iova()) is serialized under
   the rbtree spinlock; this patchset adds per-CPU caches of allocated
   IOVAs so that the rbtree doesn't get accessed frequently. (Adding a
   cache above the existing IOVA allocator is less intrusive than dynamic
   identity mapping and helps keep IOMMU page table usage low; see
   Patch 7.)

The paper "Utilizing the IOMMU Scalably" (presented at the 2015 USENIX
Annual Technical Conference) contains many more details and experiments:

  https://www.usenix.org/system/files/conference/atc15/atc15-paper-peleg.pdf

v2:

 * Extend IOVA API instead of modifying it, to not break the API's other 
   non-Intel callers.
 * Invalidate all per-cpu invalidations if one CPU hits its per-cpu limit,
   so that we don't defer invalidations more than before.
 * Smaller cap on per-CPU cache size, to consume less of the IOVA space.
 * Free resources and perform IOTLB invalidations when a CPU is hot-unplugged.


Omer Peleg (7):
  iommu: refactoring of deferred flush entries
  iommu: per-cpu deferred invalidation queues
  iommu: correct flush_unmaps pfn usage
  iommu: only unmap mapped entries
  iommu: avoid dev iotlb logic in intel-iommu for domains with no dev
    iotlbs
  iommu: change intel-iommu to use IOVA frame numbers
  iommu: introduce per-cpu caching to iova allocation

 drivers/iommu/intel-iommu.c | 318 ++++++++++++++++++++++++++-----------
 drivers/iommu/iova.c        | 372 +++++++++++++++++++++++++++++++++++++++++---
 include/linux/iova.h        |  23 ++-
 3 files changed, 593 insertions(+), 120 deletions(-)

-- 
1.9.1

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v2 1/7] iommu: refactoring of deferred flush entries
       [not found] ` <cover.1460548546.git.mad-FrESSTt7Abv7r6psnUbsSmZHpeb/A1Y/@public.gmane.org>
@ 2016-04-13 18:51   ` Adam Morrison
  2016-04-13 18:51   ` [PATCH v2 2/7] iommu: per-cpu deferred invalidation queues Adam Morrison
                     ` (5 subsequent siblings)
  6 siblings, 0 replies; 17+ messages in thread
From: Adam Morrison @ 2016-04-13 18:51 UTC (permalink / raw)
  To: dwmw2-wEGCiKHe2LqWVfeAwA7xHQ, joro-zLv9SwRftAIdnm+yROfE0A,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: serebrin-hpIqsD4AKlfQT0dZR+AlfA,
	dan-FrESSTt7Abv7r6psnUbsSmZHpeb/A1Y/,
	omer-FrESSTt7Abv7r6psnUbsSmZHpeb/A1Y/,
	shli-b10kYP2dOMg, gvdl-hpIqsD4AKlfQT0dZR+AlfA,
	Kernel-team-b10kYP2dOMg

From: Omer Peleg <omer-FrESSTt7Abv7r6psnUbsSmZHpeb/A1Y/@public.gmane.org>

Currently, deferred flushes' info is striped between several lists in
the flush tables. Instead, move all information about a specific flush
to a single entry in this table.

This patch does not introduce any functional change.

Signed-off-by: Omer Peleg <omer-FrESSTt7Abv7r6psnUbsSmZHpeb/A1Y/@public.gmane.org>
[mad-FrESSTt7Abv7r6psnUbsSmZHpeb/A1Y/@public.gmane.org: rebased and reworded the commit message]
Signed-off-by: Adam Morrison <mad-FrESSTt7Abv7r6psnUbsSmZHpeb/A1Y/@public.gmane.org>
---
 drivers/iommu/intel-iommu.c | 48 +++++++++++++++++++++++++++------------------
 1 file changed, 29 insertions(+), 19 deletions(-)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index b2bfb95..f66a45a 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -458,15 +458,19 @@ static void flush_unmaps_timeout(unsigned long data);
 
 static DEFINE_TIMER(unmap_timer,  flush_unmaps_timeout, 0, 0);
 
+struct deferred_flush_entry {
+	struct iova *iova;
+	struct dmar_domain *domain;
+	struct page *freelist;
+};
+
 #define HIGH_WATER_MARK 250
-struct deferred_flush_tables {
+struct deferred_flush_table {
 	int next;
-	struct iova *iova[HIGH_WATER_MARK];
-	struct dmar_domain *domain[HIGH_WATER_MARK];
-	struct page *freelist[HIGH_WATER_MARK];
+	struct deferred_flush_entry entries[HIGH_WATER_MARK];
 };
 
-static struct deferred_flush_tables *deferred_flush;
+static struct deferred_flush_table *deferred_flush;
 
 /* bitmap for indexing intel_iommus */
 static int g_num_of_iommus;
@@ -3111,7 +3115,7 @@ static int __init init_dmars(void)
 	}
 
 	deferred_flush = kzalloc(g_num_of_iommus *
-		sizeof(struct deferred_flush_tables), GFP_KERNEL);
+		sizeof(struct deferred_flush_table), GFP_KERNEL);
 	if (!deferred_flush) {
 		ret = -ENOMEM;
 		goto free_g_iommus;
@@ -3518,22 +3522,25 @@ static void flush_unmaps(void)
 					 DMA_TLB_GLOBAL_FLUSH);
 		for (j = 0; j < deferred_flush[i].next; j++) {
 			unsigned long mask;
-			struct iova *iova = deferred_flush[i].iova[j];
-			struct dmar_domain *domain = deferred_flush[i].domain[j];
+			struct deferred_flush_entry *entry =
+						&deferred_flush->entries[j];
+			struct iova *iova = entry->iova;
+			struct dmar_domain *domain = entry->domain;
+			struct page *freelist = entry->freelist;
 
 			/* On real hardware multiple invalidations are expensive */
 			if (cap_caching_mode(iommu->cap))
 				iommu_flush_iotlb_psi(iommu, domain,
 					iova->pfn_lo, iova_size(iova),
-					!deferred_flush[i].freelist[j], 0);
+					!freelist, 0);
 			else {
 				mask = ilog2(mm_to_dma_pfn(iova_size(iova)));
-				iommu_flush_dev_iotlb(deferred_flush[i].domain[j],
+				iommu_flush_dev_iotlb(domain,
 						(uint64_t)iova->pfn_lo << PAGE_SHIFT, mask);
 			}
-			__free_iova(&deferred_flush[i].domain[j]->iovad, iova);
-			if (deferred_flush[i].freelist[j])
-				dma_free_pagelist(deferred_flush[i].freelist[j]);
+			__free_iova(&domain->iovad, iova);
+			if (freelist)
+				dma_free_pagelist(freelist);
 		}
 		deferred_flush[i].next = 0;
 	}
@@ -3553,8 +3560,9 @@ static void flush_unmaps_timeout(unsigned long data)
 static void add_unmap(struct dmar_domain *dom, struct iova *iova, struct page *freelist)
 {
 	unsigned long flags;
-	int next, iommu_id;
+	int entry_id, iommu_id;
 	struct intel_iommu *iommu;
+	struct deferred_flush_entry *entry;
 
 	spin_lock_irqsave(&async_umap_flush_lock, flags);
 	if (list_size == HIGH_WATER_MARK)
@@ -3563,11 +3571,13 @@ static void add_unmap(struct dmar_domain *dom, struct iova *iova, struct page *f
 	iommu = domain_get_iommu(dom);
 	iommu_id = iommu->seq_id;
 
-	next = deferred_flush[iommu_id].next;
-	deferred_flush[iommu_id].domain[next] = dom;
-	deferred_flush[iommu_id].iova[next] = iova;
-	deferred_flush[iommu_id].freelist[next] = freelist;
-	deferred_flush[iommu_id].next++;
+	entry_id = deferred_flush[iommu_id].next;
+	++(deferred_flush[iommu_id].next);
+
+	entry = &deferred_flush[iommu_id].entries[entry_id];
+	entry->domain = dom;
+	entry->iova = iova;
+	entry->freelist = freelist;
 
 	if (!timer_on) {
 		mod_timer(&unmap_timer, jiffies + msecs_to_jiffies(10));
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v2 2/7] iommu: per-cpu deferred invalidation queues
       [not found] ` <cover.1460548546.git.mad-FrESSTt7Abv7r6psnUbsSmZHpeb/A1Y/@public.gmane.org>
  2016-04-13 18:51   ` [PATCH v2 1/7] iommu: refactoring of deferred flush entries Adam Morrison
@ 2016-04-13 18:51   ` Adam Morrison
  2016-04-13 18:51   ` [PATCH v2 3/7] iommu: correct flush_unmaps pfn usage Adam Morrison
                     ` (4 subsequent siblings)
  6 siblings, 0 replies; 17+ messages in thread
From: Adam Morrison @ 2016-04-13 18:51 UTC (permalink / raw)
  To: dwmw2-wEGCiKHe2LqWVfeAwA7xHQ, joro-zLv9SwRftAIdnm+yROfE0A,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: serebrin-hpIqsD4AKlfQT0dZR+AlfA,
	dan-FrESSTt7Abv7r6psnUbsSmZHpeb/A1Y/,
	omer-FrESSTt7Abv7r6psnUbsSmZHpeb/A1Y/,
	shli-b10kYP2dOMg, gvdl-hpIqsD4AKlfQT0dZR+AlfA,
	Kernel-team-b10kYP2dOMg

From: Omer Peleg <omer-FrESSTt7Abv7r6psnUbsSmZHpeb/A1Y/@public.gmane.org>

The IOMMU's IOTLB invalidation is a costly process.  When iommu mode
is not set to "strict", it is done asynchronously. Current code
amortizes the cost of invalidating IOTLB entries by batching all the
invalidations in the system and performing a single global invalidation
instead. The code queues pending invalidations in a global queue that
is accessed under the global "async_umap_flush_lock" spinlock, which
can result is significant spinlock contention.

This patch splits this deferred queue into multiple per-cpu deferred
queues, and thus gets rid of the "async_umap_flush_lock" and its
contention.  To keep existing deferred invalidation behavior, it still
invalidates the pending invalidations of all CPUs whenever a CPU
reaches its watermark or a timeout occurs.

Signed-off-by: Omer Peleg <omer-FrESSTt7Abv7r6psnUbsSmZHpeb/A1Y/@public.gmane.org>
[mad-FrESSTt7Abv7r6psnUbsSmZHpeb/A1Y/@public.gmane.org: rebased, cleaned up and reworded the commit message]
Signed-off-by: Adam Morrison <mad-FrESSTt7Abv7r6psnUbsSmZHpeb/A1Y/@public.gmane.org>
---
 drivers/iommu/intel-iommu.c | 133 ++++++++++++++++++++++++++++++--------------
 1 file changed, 92 insertions(+), 41 deletions(-)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index f66a45a..be94444 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -33,6 +33,7 @@
 #include <linux/dma-mapping.h>
 #include <linux/mempool.h>
 #include <linux/memory.h>
+#include <linux/cpu.h>
 #include <linux/timer.h>
 #include <linux/io.h>
 #include <linux/iova.h>
@@ -456,8 +457,6 @@ static LIST_HEAD(dmar_rmrr_units);
 
 static void flush_unmaps_timeout(unsigned long data);
 
-static DEFINE_TIMER(unmap_timer,  flush_unmaps_timeout, 0, 0);
-
 struct deferred_flush_entry {
 	struct iova *iova;
 	struct dmar_domain *domain;
@@ -470,17 +469,19 @@ struct deferred_flush_table {
 	struct deferred_flush_entry entries[HIGH_WATER_MARK];
 };
 
-static struct deferred_flush_table *deferred_flush;
+struct deferred_flush_data {
+	spinlock_t lock;
+	int timer_on;
+	struct timer_list timer;
+	long size;
+	struct deferred_flush_table *tables;
+};
+
+DEFINE_PER_CPU(struct deferred_flush_data, deferred_flush);
 
 /* bitmap for indexing intel_iommus */
 static int g_num_of_iommus;
 
-static DEFINE_SPINLOCK(async_umap_flush_lock);
-static LIST_HEAD(unmaps_to_do);
-
-static int timer_on;
-static long list_size;
-
 static void domain_exit(struct dmar_domain *domain);
 static void domain_remove_dev_info(struct dmar_domain *domain);
 static void dmar_remove_one_dev_info(struct dmar_domain *domain,
@@ -1922,8 +1923,12 @@ static void domain_exit(struct dmar_domain *domain)
 		return;
 
 	/* Flush any lazy unmaps that may reference this domain */
-	if (!intel_iommu_strict)
-		flush_unmaps_timeout(0);
+	if (!intel_iommu_strict) {
+		int cpu;
+
+		for_each_possible_cpu(cpu)
+			flush_unmaps_timeout(cpu);
+	}
 
 	/* Remove associated devices and clear attached or cached domains */
 	rcu_read_lock();
@@ -3081,7 +3086,7 @@ static int __init init_dmars(void)
 	bool copied_tables = false;
 	struct device *dev;
 	struct intel_iommu *iommu;
-	int i, ret;
+	int i, ret, cpu;
 
 	/*
 	 * for each drhd
@@ -3114,11 +3119,20 @@ static int __init init_dmars(void)
 		goto error;
 	}
 
-	deferred_flush = kzalloc(g_num_of_iommus *
-		sizeof(struct deferred_flush_table), GFP_KERNEL);
-	if (!deferred_flush) {
-		ret = -ENOMEM;
-		goto free_g_iommus;
+	for_each_possible_cpu(cpu) {
+		struct deferred_flush_data *dfd = per_cpu_ptr(&deferred_flush,
+							      cpu);
+
+		dfd->tables = kzalloc(g_num_of_iommus *
+				      sizeof(struct deferred_flush_table),
+				      GFP_KERNEL);
+		if (!dfd->tables) {
+			ret = -ENOMEM;
+			goto free_g_iommus;
+		}
+
+		spin_lock_init(&dfd->lock);
+		setup_timer(&dfd->timer, flush_unmaps_timeout, cpu);
 	}
 
 	for_each_active_iommu(iommu, drhd) {
@@ -3295,8 +3309,9 @@ free_iommu:
 		disable_dmar_iommu(iommu);
 		free_dmar_iommu(iommu);
 	}
-	kfree(deferred_flush);
 free_g_iommus:
+	for_each_possible_cpu(cpu)
+		kfree(per_cpu_ptr(&deferred_flush, cpu)->tables);
 	kfree(g_iommus);
 error:
 	return ret;
@@ -3501,29 +3516,31 @@ static dma_addr_t intel_map_page(struct device *dev, struct page *page,
 				  dir, *dev->dma_mask);
 }
 
-static void flush_unmaps(void)
+static void flush_unmaps(struct deferred_flush_data *flush_data)
 {
 	int i, j;
 
-	timer_on = 0;
+	flush_data->timer_on = 0;
 
 	/* just flush them all */
 	for (i = 0; i < g_num_of_iommus; i++) {
 		struct intel_iommu *iommu = g_iommus[i];
+		struct deferred_flush_table *flush_table =
+				&flush_data->tables[i];
 		if (!iommu)
 			continue;
 
-		if (!deferred_flush[i].next)
+		if (!flush_table->next)
 			continue;
 
 		/* In caching mode, global flushes turn emulation expensive */
 		if (!cap_caching_mode(iommu->cap))
 			iommu->flush.flush_iotlb(iommu, 0, 0, 0,
 					 DMA_TLB_GLOBAL_FLUSH);
-		for (j = 0; j < deferred_flush[i].next; j++) {
+		for (j = 0; j < flush_table->next; j++) {
 			unsigned long mask;
 			struct deferred_flush_entry *entry =
-						&deferred_flush->entries[j];
+						&flush_table->entries[j];
 			struct iova *iova = entry->iova;
 			struct dmar_domain *domain = entry->domain;
 			struct page *freelist = entry->freelist;
@@ -3542,19 +3559,20 @@ static void flush_unmaps(void)
 			if (freelist)
 				dma_free_pagelist(freelist);
 		}
-		deferred_flush[i].next = 0;
+		flush_table->next = 0;
 	}
 
-	list_size = 0;
+	flush_data->size = 0;
 }
 
-static void flush_unmaps_timeout(unsigned long data)
+static void flush_unmaps_timeout(unsigned long cpuid)
 {
+	struct deferred_flush_data *flush_data = per_cpu_ptr(&deferred_flush, cpuid);
 	unsigned long flags;
 
-	spin_lock_irqsave(&async_umap_flush_lock, flags);
-	flush_unmaps();
-	spin_unlock_irqrestore(&async_umap_flush_lock, flags);
+	spin_lock_irqsave(&flush_data->lock, flags);
+	flush_unmaps(flush_data);
+	spin_unlock_irqrestore(&flush_data->lock, flags);
 }
 
 static void add_unmap(struct dmar_domain *dom, struct iova *iova, struct page *freelist)
@@ -3563,28 +3581,44 @@ static void add_unmap(struct dmar_domain *dom, struct iova *iova, struct page *f
 	int entry_id, iommu_id;
 	struct intel_iommu *iommu;
 	struct deferred_flush_entry *entry;
+	struct deferred_flush_data *flush_data;
+	unsigned int cpuid;
 
-	spin_lock_irqsave(&async_umap_flush_lock, flags);
-	if (list_size == HIGH_WATER_MARK)
-		flush_unmaps();
+	cpuid = get_cpu();
+	flush_data = per_cpu_ptr(&deferred_flush, cpuid);
+
+	/* Flush all CPUs' entries to avoid deferring too much.  If
+	 * this becomes a bottleneck, can just flush us, and rely on
+	 * flush timer for the rest.
+	 */
+	if (flush_data->size == HIGH_WATER_MARK) {
+		int cpu;
+
+		for_each_online_cpu(cpu)
+			flush_unmaps_timeout(cpu);
+	}
+
+	spin_lock_irqsave(&flush_data->lock, flags);
 
 	iommu = domain_get_iommu(dom);
 	iommu_id = iommu->seq_id;
 
-	entry_id = deferred_flush[iommu_id].next;
-	++(deferred_flush[iommu_id].next);
+	entry_id = flush_data->tables[iommu_id].next;
+	++(flush_data->tables[iommu_id].next);
 
-	entry = &deferred_flush[iommu_id].entries[entry_id];
+	entry = &flush_data->tables[iommu_id].entries[entry_id];
 	entry->domain = dom;
 	entry->iova = iova;
 	entry->freelist = freelist;
 
-	if (!timer_on) {
-		mod_timer(&unmap_timer, jiffies + msecs_to_jiffies(10));
-		timer_on = 1;
+	if (!flush_data->timer_on) {
+		mod_timer(&flush_data->timer, jiffies + msecs_to_jiffies(10));
+		flush_data->timer_on = 1;
 	}
-	list_size++;
-	spin_unlock_irqrestore(&async_umap_flush_lock, flags);
+	flush_data->size++;
+	spin_unlock_irqrestore(&flush_data->lock, flags);
+
+	put_cpu();
 }
 
 static void intel_unmap(struct device *dev, dma_addr_t dev_addr)
@@ -4508,6 +4542,23 @@ static struct notifier_block intel_iommu_memory_nb = {
 	.priority = 0
 };
 
+static int intel_iommu_cpu_notifier(struct notifier_block *nfb,
+				    unsigned long action, void *v)
+{
+	unsigned int cpu = (unsigned long)v;
+
+	switch (action) {
+	case CPU_DEAD:
+	case CPU_DEAD_FROZEN:
+		flush_unmaps_timeout(cpu);
+		break;
+	}
+	return NOTIFY_OK;
+}
+
+static struct notifier_block intel_iommu_cpu_nb = {
+	.notifier_call = intel_iommu_cpu_notifier,
+};
 
 static ssize_t intel_iommu_show_version(struct device *dev,
 					struct device_attribute *attr,
@@ -4641,7 +4692,6 @@ int __init intel_iommu_init(void)
 	up_write(&dmar_global_lock);
 	pr_info("Intel(R) Virtualization Technology for Directed I/O\n");
 
-	init_timer(&unmap_timer);
 #ifdef CONFIG_SWIOTLB
 	swiotlb = 0;
 #endif
@@ -4658,6 +4708,7 @@ int __init intel_iommu_init(void)
 	bus_register_notifier(&pci_bus_type, &device_nb);
 	if (si_domain && !hw_pass_through)
 		register_memory_notifier(&intel_iommu_memory_nb);
+	register_hotcpu_notifier(&intel_iommu_cpu_nb);
 
 	intel_iommu_enabled = 1;
 
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v2 3/7] iommu: correct flush_unmaps pfn usage
       [not found] ` <cover.1460548546.git.mad-FrESSTt7Abv7r6psnUbsSmZHpeb/A1Y/@public.gmane.org>
  2016-04-13 18:51   ` [PATCH v2 1/7] iommu: refactoring of deferred flush entries Adam Morrison
  2016-04-13 18:51   ` [PATCH v2 2/7] iommu: per-cpu deferred invalidation queues Adam Morrison
@ 2016-04-13 18:51   ` Adam Morrison
  2016-04-13 18:52   ` [PATCH v2 4/7] iommu: only unmap mapped entries Adam Morrison
                     ` (3 subsequent siblings)
  6 siblings, 0 replies; 17+ messages in thread
From: Adam Morrison @ 2016-04-13 18:51 UTC (permalink / raw)
  To: dwmw2-wEGCiKHe2LqWVfeAwA7xHQ, joro-zLv9SwRftAIdnm+yROfE0A,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: serebrin-hpIqsD4AKlfQT0dZR+AlfA,
	dan-FrESSTt7Abv7r6psnUbsSmZHpeb/A1Y/,
	omer-FrESSTt7Abv7r6psnUbsSmZHpeb/A1Y/,
	shli-b10kYP2dOMg, gvdl-hpIqsD4AKlfQT0dZR+AlfA,
	Kernel-team-b10kYP2dOMg

From: Omer Peleg <omer-FrESSTt7Abv7r6psnUbsSmZHpeb/A1Y/@public.gmane.org>

Change flush_unmaps() to correctly pass iommu_flush_iotlb_psi()
dma addresses.  (Intel mm and dma have the same size for pages
at the moment, but this usage improves consistency.)

Signed-off-by: Omer Peleg <omer-FrESSTt7Abv7r6psnUbsSmZHpeb/A1Y/@public.gmane.org>
[mad-FrESSTt7Abv7r6psnUbsSmZHpeb/A1Y/@public.gmane.org: rebased and reworded the commit message]
Signed-off-by: Adam Morrison <mad-FrESSTt7Abv7r6psnUbsSmZHpeb/A1Y/@public.gmane.org>
---
 drivers/iommu/intel-iommu.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index be94444..576a29a 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -3548,7 +3548,8 @@ static void flush_unmaps(struct deferred_flush_data *flush_data)
 			/* On real hardware multiple invalidations are expensive */
 			if (cap_caching_mode(iommu->cap))
 				iommu_flush_iotlb_psi(iommu, domain,
-					iova->pfn_lo, iova_size(iova),
+					mm_to_dma_pfn(iova->pfn_lo),
+					mm_to_dma_pfn(iova_size(iova)),
 					!freelist, 0);
 			else {
 				mask = ilog2(mm_to_dma_pfn(iova_size(iova)));
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v2 4/7] iommu: only unmap mapped entries
       [not found] ` <cover.1460548546.git.mad-FrESSTt7Abv7r6psnUbsSmZHpeb/A1Y/@public.gmane.org>
                     ` (2 preceding siblings ...)
  2016-04-13 18:51   ` [PATCH v2 3/7] iommu: correct flush_unmaps pfn usage Adam Morrison
@ 2016-04-13 18:52   ` Adam Morrison
       [not found]     ` <e07164c8d0aaff68cabd2cf8e3aee9ed20882ae4.1460548546.git.mad-FrESSTt7Abv7r6psnUbsSmZHpeb/A1Y/@public.gmane.org>
  2016-04-13 18:52   ` [PATCH v2 5/7] iommu: avoid dev iotlb logic in intel-iommu for domains with no dev iotlbs Adam Morrison
                     ` (2 subsequent siblings)
  6 siblings, 1 reply; 17+ messages in thread
From: Adam Morrison @ 2016-04-13 18:52 UTC (permalink / raw)
  To: dwmw2-wEGCiKHe2LqWVfeAwA7xHQ, joro-zLv9SwRftAIdnm+yROfE0A,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: serebrin-hpIqsD4AKlfQT0dZR+AlfA,
	dan-FrESSTt7Abv7r6psnUbsSmZHpeb/A1Y/,
	omer-FrESSTt7Abv7r6psnUbsSmZHpeb/A1Y/,
	shli-b10kYP2dOMg, gvdl-hpIqsD4AKlfQT0dZR+AlfA,
	Kernel-team-b10kYP2dOMg

From: Omer Peleg <omer-FrESSTt7Abv7r6psnUbsSmZHpeb/A1Y/@public.gmane.org>

Current unmap implementation unmaps the entire area covered by the IOVA
range, which is a power-of-2 aligned region. The corresponding map,
however, only maps those pages originally mapped by the user. This
discrepancy can lead to unmapping of already unmapped entries, which is
unneeded work.

With this patch, only mapped pages are unmapped. This is also a baseline
for a map/unmap implementation based on IOVAs and not iova structures,
which will allow caching.

Signed-off-by: Omer Peleg <omer-FrESSTt7Abv7r6psnUbsSmZHpeb/A1Y/@public.gmane.org>
[mad-FrESSTt7Abv7r6psnUbsSmZHpeb/A1Y/@public.gmane.org: rebased and reworded the commit message]
Signed-off-by: Adam Morrison <mad-FrESSTt7Abv7r6psnUbsSmZHpeb/A1Y/@public.gmane.org>
---
 drivers/iommu/intel-iommu.c | 36 +++++++++++++++++++++++++-----------
 1 file changed, 25 insertions(+), 11 deletions(-)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 576a29a..d40870b 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -459,6 +459,7 @@ static void flush_unmaps_timeout(unsigned long data);
 
 struct deferred_flush_entry {
 	struct iova *iova;
+	unsigned long nrpages;
 	struct dmar_domain *domain;
 	struct page *freelist;
 };
@@ -3542,6 +3543,7 @@ static void flush_unmaps(struct deferred_flush_data *flush_data)
 			struct deferred_flush_entry *entry =
 						&flush_table->entries[j];
 			struct iova *iova = entry->iova;
+			unsigned long nrpages = entry->nrpages;
 			struct dmar_domain *domain = entry->domain;
 			struct page *freelist = entry->freelist;
 
@@ -3549,10 +3551,9 @@ static void flush_unmaps(struct deferred_flush_data *flush_data)
 			if (cap_caching_mode(iommu->cap))
 				iommu_flush_iotlb_psi(iommu, domain,
 					mm_to_dma_pfn(iova->pfn_lo),
-					mm_to_dma_pfn(iova_size(iova)),
-					!freelist, 0);
+					nrpages, !freelist, 0);
 			else {
-				mask = ilog2(mm_to_dma_pfn(iova_size(iova)));
+				mask = ilog2(nrpages);
 				iommu_flush_dev_iotlb(domain,
 						(uint64_t)iova->pfn_lo << PAGE_SHIFT, mask);
 			}
@@ -3576,7 +3577,8 @@ static void flush_unmaps_timeout(unsigned long cpuid)
 	spin_unlock_irqrestore(&flush_data->lock, flags);
 }
 
-static void add_unmap(struct dmar_domain *dom, struct iova *iova, struct page *freelist)
+static void add_unmap(struct dmar_domain *dom, struct iova *iova,
+		      unsigned long nrpages, struct page *freelist)
 {
 	unsigned long flags;
 	int entry_id, iommu_id;
@@ -3610,6 +3612,7 @@ static void add_unmap(struct dmar_domain *dom, struct iova *iova, struct page *f
 	entry = &flush_data->tables[iommu_id].entries[entry_id];
 	entry->domain = dom;
 	entry->iova = iova;
+	entry->nrpages = nrpages;
 	entry->freelist = freelist;
 
 	if (!flush_data->timer_on) {
@@ -3622,10 +3625,11 @@ static void add_unmap(struct dmar_domain *dom, struct iova *iova, struct page *f
 	put_cpu();
 }
 
-static void intel_unmap(struct device *dev, dma_addr_t dev_addr)
+static void intel_unmap(struct device *dev, dma_addr_t dev_addr, size_t size)
 {
 	struct dmar_domain *domain;
 	unsigned long start_pfn, last_pfn;
+	unsigned long nrpages;
 	struct iova *iova;
 	struct intel_iommu *iommu;
 	struct page *freelist;
@@ -3643,8 +3647,9 @@ static void intel_unmap(struct device *dev, dma_addr_t dev_addr)
 		      (unsigned long long)dev_addr))
 		return;
 
+	nrpages = aligned_nrpages(dev_addr, size);
 	start_pfn = mm_to_dma_pfn(iova->pfn_lo);
-	last_pfn = mm_to_dma_pfn(iova->pfn_hi + 1) - 1;
+	last_pfn = start_pfn + nrpages - 1;
 
 	pr_debug("Device %s unmapping: pfn %lx-%lx\n",
 		 dev_name(dev), start_pfn, last_pfn);
@@ -3653,12 +3658,12 @@ static void intel_unmap(struct device *dev, dma_addr_t dev_addr)
 
 	if (intel_iommu_strict) {
 		iommu_flush_iotlb_psi(iommu, domain, start_pfn,
-				      last_pfn - start_pfn + 1, !freelist, 0);
+				      nrpages, !freelist, 0);
 		/* free iova */
 		__free_iova(&domain->iovad, iova);
 		dma_free_pagelist(freelist);
 	} else {
-		add_unmap(domain, iova, freelist);
+		add_unmap(domain, iova, nrpages, freelist);
 		/*
 		 * queue up the release of the unmap to save the 1/6th of the
 		 * cpu used up by the iotlb flush operation...
@@ -3670,7 +3675,7 @@ static void intel_unmap_page(struct device *dev, dma_addr_t dev_addr,
 			     size_t size, enum dma_data_direction dir,
 			     struct dma_attrs *attrs)
 {
-	intel_unmap(dev, dev_addr);
+	intel_unmap(dev, dev_addr, size);
 }
 
 static void *intel_alloc_coherent(struct device *dev, size_t size,
@@ -3729,7 +3734,7 @@ static void intel_free_coherent(struct device *dev, size_t size, void *vaddr,
 	size = PAGE_ALIGN(size);
 	order = get_order(size);
 
-	intel_unmap(dev, dma_handle);
+	intel_unmap(dev, dma_handle, size);
 	if (!dma_release_from_contiguous(dev, page, size >> PAGE_SHIFT))
 		__free_pages(page, order);
 }
@@ -3738,7 +3743,16 @@ static void intel_unmap_sg(struct device *dev, struct scatterlist *sglist,
 			   int nelems, enum dma_data_direction dir,
 			   struct dma_attrs *attrs)
 {
-	intel_unmap(dev, sglist[0].dma_address);
+	dma_addr_t startaddr = sglist[0].dma_address - sglist[0].offset;
+	unsigned long nrpages = 0;
+	struct scatterlist *sg;
+	int i;
+
+	for_each_sg(sglist, sg, nelems, i) {
+		nrpages += aligned_nrpages(sg->offset, sg->length);
+	}
+
+	intel_unmap(dev, startaddr, nrpages << VTD_PAGE_SHIFT);
 }
 
 static int intel_nontranslate_map_sg(struct device *hddev,
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v2 5/7] iommu: avoid dev iotlb logic in intel-iommu for domains with no dev iotlbs
       [not found] ` <cover.1460548546.git.mad-FrESSTt7Abv7r6psnUbsSmZHpeb/A1Y/@public.gmane.org>
                     ` (3 preceding siblings ...)
  2016-04-13 18:52   ` [PATCH v2 4/7] iommu: only unmap mapped entries Adam Morrison
@ 2016-04-13 18:52   ` Adam Morrison
  2016-04-13 18:52   ` [PATCH v2 6/7] iommu: change intel-iommu to use IOVA frame numbers Adam Morrison
  2016-04-13 18:52   ` [PATCH v2 7/7] iommu: introduce per-cpu caching to iova allocation Adam Morrison
  6 siblings, 0 replies; 17+ messages in thread
From: Adam Morrison @ 2016-04-13 18:52 UTC (permalink / raw)
  To: dwmw2-wEGCiKHe2LqWVfeAwA7xHQ, joro-zLv9SwRftAIdnm+yROfE0A,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: serebrin-hpIqsD4AKlfQT0dZR+AlfA,
	dan-FrESSTt7Abv7r6psnUbsSmZHpeb/A1Y/,
	omer-FrESSTt7Abv7r6psnUbsSmZHpeb/A1Y/,
	shli-b10kYP2dOMg, gvdl-hpIqsD4AKlfQT0dZR+AlfA,
	Kernel-team-b10kYP2dOMg

From: Omer Peleg <omer-FrESSTt7Abv7r6psnUbsSmZHpeb/A1Y/@public.gmane.org>

This patch avoids taking the device_domain_lock in iommu_flush_dev_iotlb()
for domains with no dev iotlb devices.

Signed-off-by: Omer Peleg <omer-FrESSTt7Abv7r6psnUbsSmZHpeb/A1Y/@public.gmane.org>
[gvdl-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org: fixed locking issues]
Signed-off-by: Godfrey van der Linden <gvdl-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
[mad-FrESSTt7Abv7r6psnUbsSmZHpeb/A1Y/@public.gmane.org: rebased and reworded the commit message]
Signed-off-by: Adam Morrison <mad-FrESSTt7Abv7r6psnUbsSmZHpeb/A1Y/@public.gmane.org>
---
 drivers/iommu/intel-iommu.c | 34 ++++++++++++++++++++++++++++++++++
 1 file changed, 34 insertions(+)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index d40870b..acba66c 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -392,6 +392,7 @@ struct dmar_domain {
 					 * to VT-d spec, section 9.3 */
 
 	struct list_head devices;	/* all devices' list */
+	bool has_iotlb_device;
 	struct iova_domain iovad;	/* iova's that belong to this domain */
 
 	struct dma_pte	*pgd;		/* virtual address */
@@ -1464,10 +1465,35 @@ iommu_support_dev_iotlb (struct dmar_domain *domain, struct intel_iommu *iommu,
 	return NULL;
 }
 
+static void domain_update_iotlb(struct dmar_domain *domain)
+{
+	struct device_domain_info *info;
+	bool has_iotlb_device = false;
+
+	assert_spin_locked(&device_domain_lock);
+
+	list_for_each_entry(info, &domain->devices, link) {
+		struct pci_dev *pdev;
+
+		if (!info->dev || !dev_is_pci(info->dev))
+			continue;
+
+		pdev = to_pci_dev(info->dev);
+		if (pdev->ats_enabled) {
+			has_iotlb_device = true;
+			break;
+		}
+	}
+
+	domain->has_iotlb_device = has_iotlb_device;
+}
+
 static void iommu_enable_dev_iotlb(struct device_domain_info *info)
 {
 	struct pci_dev *pdev;
 
+	assert_spin_locked(&device_domain_lock);
+
 	if (!info || !dev_is_pci(info->dev))
 		return;
 
@@ -1487,6 +1513,7 @@ static void iommu_enable_dev_iotlb(struct device_domain_info *info)
 #endif
 	if (info->ats_supported && !pci_enable_ats(pdev, VTD_PAGE_SHIFT)) {
 		info->ats_enabled = 1;
+		domain_update_iotlb(info->domain);
 		info->ats_qdep = pci_ats_queue_depth(pdev);
 	}
 }
@@ -1495,6 +1522,8 @@ static void iommu_disable_dev_iotlb(struct device_domain_info *info)
 {
 	struct pci_dev *pdev;
 
+	assert_spin_locked(&device_domain_lock);
+
 	if (!dev_is_pci(info->dev))
 		return;
 
@@ -1503,6 +1532,7 @@ static void iommu_disable_dev_iotlb(struct device_domain_info *info)
 	if (info->ats_enabled) {
 		pci_disable_ats(pdev);
 		info->ats_enabled = 0;
+		domain_update_iotlb(info->domain);
 	}
 #ifdef CONFIG_INTEL_IOMMU_SVM
 	if (info->pri_enabled) {
@@ -1523,6 +1553,9 @@ static void iommu_flush_dev_iotlb(struct dmar_domain *domain,
 	unsigned long flags;
 	struct device_domain_info *info;
 
+	if (!domain->has_iotlb_device)
+		return;
+
 	spin_lock_irqsave(&device_domain_lock, flags);
 	list_for_each_entry(info, &domain->devices, link) {
 		if (!info->ats_enabled)
@@ -1740,6 +1773,7 @@ static struct dmar_domain *alloc_domain(int flags)
 	memset(domain, 0, sizeof(*domain));
 	domain->nid = -1;
 	domain->flags = flags;
+	domain->has_iotlb_device = false;
 	INIT_LIST_HEAD(&domain->devices);
 
 	return domain;
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v2 6/7] iommu: change intel-iommu to use IOVA frame numbers
       [not found] ` <cover.1460548546.git.mad-FrESSTt7Abv7r6psnUbsSmZHpeb/A1Y/@public.gmane.org>
                     ` (4 preceding siblings ...)
  2016-04-13 18:52   ` [PATCH v2 5/7] iommu: avoid dev iotlb logic in intel-iommu for domains with no dev iotlbs Adam Morrison
@ 2016-04-13 18:52   ` Adam Morrison
  2016-04-13 18:52   ` [PATCH v2 7/7] iommu: introduce per-cpu caching to iova allocation Adam Morrison
  6 siblings, 0 replies; 17+ messages in thread
From: Adam Morrison @ 2016-04-13 18:52 UTC (permalink / raw)
  To: dwmw2-wEGCiKHe2LqWVfeAwA7xHQ, joro-zLv9SwRftAIdnm+yROfE0A,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: serebrin-hpIqsD4AKlfQT0dZR+AlfA,
	dan-FrESSTt7Abv7r6psnUbsSmZHpeb/A1Y/,
	omer-FrESSTt7Abv7r6psnUbsSmZHpeb/A1Y/,
	shli-b10kYP2dOMg, gvdl-hpIqsD4AKlfQT0dZR+AlfA,
	Kernel-team-b10kYP2dOMg

From: Omer Peleg <omer-FrESSTt7Abv7r6psnUbsSmZHpeb/A1Y/@public.gmane.org>

Make intel-iommu map/unmap/invalidate work with IOVA pfns instead of
pointers to "struct iova". This avoids using the iova struct from the IOVA
red-black tree and the resulting explicit find_iova() on unmap.

This patch will allow us to cache IOVAs in the next patch, in order to
avoid rbtree operations for the majority of map/unmap operations.

Note: In eliminating the find_iova() operation, we have also eliminated
the sanity check previously done in the unmap flow. Arguably, this was
overhead that is better avoided in production code, but it could be
brought back as a debug option for driver development.

Signed-off-by: Omer Peleg <omer-FrESSTt7Abv7r6psnUbsSmZHpeb/A1Y/@public.gmane.org>
[mad-FrESSTt7Abv7r6psnUbsSmZHpeb/A1Y/@public.gmane.org: rebased, fixed to not break iova api, and reworded
 the commit message]
Signed-off-by: Adam Morrison <mad-FrESSTt7Abv7r6psnUbsSmZHpeb/A1Y/@public.gmane.org>
---
 drivers/iommu/intel-iommu.c | 61 +++++++++++++++++++++------------------------
 1 file changed, 29 insertions(+), 32 deletions(-)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index acba66c..7a9a3a6 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -459,7 +459,7 @@ static LIST_HEAD(dmar_rmrr_units);
 static void flush_unmaps_timeout(unsigned long data);
 
 struct deferred_flush_entry {
-	struct iova *iova;
+	unsigned long iova_pfn;
 	unsigned long nrpages;
 	struct dmar_domain *domain;
 	struct page *freelist;
@@ -3353,7 +3353,7 @@ error:
 }
 
 /* This takes a number of _MM_ pages, not VTD pages */
-static struct iova *intel_alloc_iova(struct device *dev,
+static unsigned long intel_alloc_iova(struct device *dev,
 				     struct dmar_domain *domain,
 				     unsigned long nrpages, uint64_t dma_mask)
 {
@@ -3373,16 +3373,16 @@ static struct iova *intel_alloc_iova(struct device *dev,
 		iova = alloc_iova(&domain->iovad, nrpages,
 				  IOVA_PFN(DMA_BIT_MASK(32)), 1);
 		if (iova)
-			return iova;
+			return iova->pfn_lo;
 	}
 	iova = alloc_iova(&domain->iovad, nrpages, IOVA_PFN(dma_mask), 1);
 	if (unlikely(!iova)) {
 		pr_err("Allocating %ld-page iova for %s failed",
 		       nrpages, dev_name(dev));
-		return NULL;
+		return 0;
 	}
 
-	return iova;
+	return iova->pfn_lo;
 }
 
 static struct dmar_domain *__get_valid_domain_for_dev(struct device *dev)
@@ -3480,7 +3480,7 @@ static dma_addr_t __intel_map_single(struct device *dev, phys_addr_t paddr,
 {
 	struct dmar_domain *domain;
 	phys_addr_t start_paddr;
-	struct iova *iova;
+	unsigned long iova_pfn;
 	int prot = 0;
 	int ret;
 	struct intel_iommu *iommu;
@@ -3498,8 +3498,8 @@ static dma_addr_t __intel_map_single(struct device *dev, phys_addr_t paddr,
 	iommu = domain_get_iommu(domain);
 	size = aligned_nrpages(paddr, size);
 
-	iova = intel_alloc_iova(dev, domain, dma_to_mm_pfn(size), dma_mask);
-	if (!iova)
+	iova_pfn = intel_alloc_iova(dev, domain, dma_to_mm_pfn(size), dma_mask);
+	if (!iova_pfn)
 		goto error;
 
 	/*
@@ -3517,7 +3517,7 @@ static dma_addr_t __intel_map_single(struct device *dev, phys_addr_t paddr,
 	 * might have two guest_addr mapping to the same host paddr, but this
 	 * is not a big problem
 	 */
-	ret = domain_pfn_mapping(domain, mm_to_dma_pfn(iova->pfn_lo),
+	ret = domain_pfn_mapping(domain, mm_to_dma_pfn(iova_pfn),
 				 mm_to_dma_pfn(paddr_pfn), size, prot);
 	if (ret)
 		goto error;
@@ -3525,18 +3525,18 @@ static dma_addr_t __intel_map_single(struct device *dev, phys_addr_t paddr,
 	/* it's a non-present to present mapping. Only flush if caching mode */
 	if (cap_caching_mode(iommu->cap))
 		iommu_flush_iotlb_psi(iommu, domain,
-				      mm_to_dma_pfn(iova->pfn_lo),
+				      mm_to_dma_pfn(iova_pfn),
 				      size, 0, 1);
 	else
 		iommu_flush_write_buffer(iommu);
 
-	start_paddr = (phys_addr_t)iova->pfn_lo << PAGE_SHIFT;
+	start_paddr = (phys_addr_t)iova_pfn << PAGE_SHIFT;
 	start_paddr += paddr & ~PAGE_MASK;
 	return start_paddr;
 
 error:
-	if (iova)
-		__free_iova(&domain->iovad, iova);
+	if (iova_pfn)
+		free_iova(&domain->iovad, iova_pfn);
 	pr_err("Device %s request: %zx@%llx dir %d --- failed\n",
 		dev_name(dev), size, (unsigned long long)paddr, dir);
 	return 0;
@@ -3576,7 +3576,7 @@ static void flush_unmaps(struct deferred_flush_data *flush_data)
 			unsigned long mask;
 			struct deferred_flush_entry *entry =
 						&flush_table->entries[j];
-			struct iova *iova = entry->iova;
+			unsigned long iova_pfn = entry->iova_pfn;
 			unsigned long nrpages = entry->nrpages;
 			struct dmar_domain *domain = entry->domain;
 			struct page *freelist = entry->freelist;
@@ -3584,14 +3584,14 @@ static void flush_unmaps(struct deferred_flush_data *flush_data)
 			/* On real hardware multiple invalidations are expensive */
 			if (cap_caching_mode(iommu->cap))
 				iommu_flush_iotlb_psi(iommu, domain,
-					mm_to_dma_pfn(iova->pfn_lo),
+					mm_to_dma_pfn(iova_pfn),
 					nrpages, !freelist, 0);
 			else {
 				mask = ilog2(nrpages);
 				iommu_flush_dev_iotlb(domain,
-						(uint64_t)iova->pfn_lo << PAGE_SHIFT, mask);
+						(uint64_t)iova_pfn << PAGE_SHIFT, mask);
 			}
-			__free_iova(&domain->iovad, iova);
+			free_iova(&domain->iovad, iova_pfn);
 			if (freelist)
 				dma_free_pagelist(freelist);
 		}
@@ -3611,7 +3611,7 @@ static void flush_unmaps_timeout(unsigned long cpuid)
 	spin_unlock_irqrestore(&flush_data->lock, flags);
 }
 
-static void add_unmap(struct dmar_domain *dom, struct iova *iova,
+static void add_unmap(struct dmar_domain *dom, unsigned long iova_pfn,
 		      unsigned long nrpages, struct page *freelist)
 {
 	unsigned long flags;
@@ -3645,7 +3645,7 @@ static void add_unmap(struct dmar_domain *dom, struct iova *iova,
 
 	entry = &flush_data->tables[iommu_id].entries[entry_id];
 	entry->domain = dom;
-	entry->iova = iova;
+	entry->iova_pfn = iova_pfn;
 	entry->nrpages = nrpages;
 	entry->freelist = freelist;
 
@@ -3664,7 +3664,7 @@ static void intel_unmap(struct device *dev, dma_addr_t dev_addr, size_t size)
 	struct dmar_domain *domain;
 	unsigned long start_pfn, last_pfn;
 	unsigned long nrpages;
-	struct iova *iova;
+	unsigned long iova_pfn;
 	struct intel_iommu *iommu;
 	struct page *freelist;
 
@@ -3676,13 +3676,10 @@ static void intel_unmap(struct device *dev, dma_addr_t dev_addr, size_t size)
 
 	iommu = domain_get_iommu(domain);
 
-	iova = find_iova(&domain->iovad, IOVA_PFN(dev_addr));
-	if (WARN_ONCE(!iova, "Driver unmaps unmatched page at PFN %llx\n",
-		      (unsigned long long)dev_addr))
-		return;
+	iova_pfn = IOVA_PFN(dev_addr);
 
 	nrpages = aligned_nrpages(dev_addr, size);
-	start_pfn = mm_to_dma_pfn(iova->pfn_lo);
+	start_pfn = mm_to_dma_pfn(iova_pfn);
 	last_pfn = start_pfn + nrpages - 1;
 
 	pr_debug("Device %s unmapping: pfn %lx-%lx\n",
@@ -3694,10 +3691,10 @@ static void intel_unmap(struct device *dev, dma_addr_t dev_addr, size_t size)
 		iommu_flush_iotlb_psi(iommu, domain, start_pfn,
 				      nrpages, !freelist, 0);
 		/* free iova */
-		__free_iova(&domain->iovad, iova);
+		free_iova(&domain->iovad, iova_pfn);
 		dma_free_pagelist(freelist);
 	} else {
-		add_unmap(domain, iova, nrpages, freelist);
+		add_unmap(domain, iova_pfn, nrpages, freelist);
 		/*
 		 * queue up the release of the unmap to save the 1/6th of the
 		 * cpu used up by the iotlb flush operation...
@@ -3810,7 +3807,7 @@ static int intel_map_sg(struct device *dev, struct scatterlist *sglist, int nele
 	struct dmar_domain *domain;
 	size_t size = 0;
 	int prot = 0;
-	struct iova *iova = NULL;
+	unsigned long iova_pfn;
 	int ret;
 	struct scatterlist *sg;
 	unsigned long start_vpfn;
@@ -3829,9 +3826,9 @@ static int intel_map_sg(struct device *dev, struct scatterlist *sglist, int nele
 	for_each_sg(sglist, sg, nelems, i)
 		size += aligned_nrpages(sg->offset, sg->length);
 
-	iova = intel_alloc_iova(dev, domain, dma_to_mm_pfn(size),
+	iova_pfn = intel_alloc_iova(dev, domain, dma_to_mm_pfn(size),
 				*dev->dma_mask);
-	if (!iova) {
+	if (!iova_pfn) {
 		sglist->dma_length = 0;
 		return 0;
 	}
@@ -3846,13 +3843,13 @@ static int intel_map_sg(struct device *dev, struct scatterlist *sglist, int nele
 	if (dir == DMA_FROM_DEVICE || dir == DMA_BIDIRECTIONAL)
 		prot |= DMA_PTE_WRITE;
 
-	start_vpfn = mm_to_dma_pfn(iova->pfn_lo);
+	start_vpfn = mm_to_dma_pfn(iova_pfn);
 
 	ret = domain_sg_mapping(domain, start_vpfn, sglist, size, prot);
 	if (unlikely(ret)) {
 		dma_pte_free_pagetable(domain, start_vpfn,
 				       start_vpfn + size - 1);
-		__free_iova(&domain->iovad, iova);
+		free_iova(&domain->iovad, iova_pfn);
 		return 0;
 	}
 
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v2 7/7] iommu: introduce per-cpu caching to iova allocation
       [not found] ` <cover.1460548546.git.mad-FrESSTt7Abv7r6psnUbsSmZHpeb/A1Y/@public.gmane.org>
                     ` (5 preceding siblings ...)
  2016-04-13 18:52   ` [PATCH v2 6/7] iommu: change intel-iommu to use IOVA frame numbers Adam Morrison
@ 2016-04-13 18:52   ` Adam Morrison
       [not found]     ` <b208a304d83088aae7ecac10a3062dc57c0a2f79.1460548546.git.mad-FrESSTt7Abv7r6psnUbsSmZHpeb/A1Y/@public.gmane.org>
  6 siblings, 1 reply; 17+ messages in thread
From: Adam Morrison @ 2016-04-13 18:52 UTC (permalink / raw)
  To: dwmw2-wEGCiKHe2LqWVfeAwA7xHQ, joro-zLv9SwRftAIdnm+yROfE0A,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: serebrin-hpIqsD4AKlfQT0dZR+AlfA,
	dan-FrESSTt7Abv7r6psnUbsSmZHpeb/A1Y/,
	omer-FrESSTt7Abv7r6psnUbsSmZHpeb/A1Y/,
	shli-b10kYP2dOMg, gvdl-hpIqsD4AKlfQT0dZR+AlfA,
	Kernel-team-b10kYP2dOMg

From: Omer Peleg <omer-FrESSTt7Abv7r6psnUbsSmZHpeb/A1Y/@public.gmane.org>

IOVA allocation has two problems that impede high-throughput I/O.
First, it can do a linear search over the allocated IOVA ranges.
Second, the rbtree spinlock that serializes IOVA allocations becomes
contended.

Address these problems by creating an API for caching allocated IOVA
ranges, so that the IOVA allocator isn't accessed frequently.  This
patch adds a per-CPU cache, from which CPUs can alloc/free IOVAs
without taking the rbtree spinlock.  The per-CPU caches are backed by
a global cache, to avoid invoking the (linear-time) IOVA allocator
without needing to make the per-CPU cache size excessive.  This design
is based on magazines, as described in "Magazines and Vmem: Extending
the Slab Allocator to Many CPUs and Arbitrary Resources" (currently
available at https://www.usenix.org/legacy/event/usenix01/bonwick.html)

Adding caching on top of the existing rbtree allocator maintains the
property that IOVAs are densely packed in the IO virtual address space,
which is important for keeping IOMMU page table usage low.

To keep the cache size reasonable, we limit caching to ranges of
size <= 128 KB.  Overall, a CPU can cache at most 32 MB and the global
cache is bounded by 4 MB.

Signed-off-by: Omer Peleg <omer-FrESSTt7Abv7r6psnUbsSmZHpeb/A1Y/@public.gmane.org>
[mad-FrESSTt7Abv7r6psnUbsSmZHpeb/A1Y/@public.gmane.org: rebased, cleaned up and reworded the commit message]
Signed-off-by: Adam Morrison <mad-FrESSTt7Abv7r6psnUbsSmZHpeb/A1Y/@public.gmane.org>
---
 drivers/iommu/intel-iommu.c |  47 ++++--
 drivers/iommu/iova.c        | 372 +++++++++++++++++++++++++++++++++++++++++---
 include/linux/iova.h        |  23 ++-
 3 files changed, 404 insertions(+), 38 deletions(-)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 7a9a3a6..629c5b19 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -3357,7 +3357,7 @@ static unsigned long intel_alloc_iova(struct device *dev,
 				     struct dmar_domain *domain,
 				     unsigned long nrpages, uint64_t dma_mask)
 {
-	struct iova *iova = NULL;
+	unsigned long iova_pfn = 0;
 
 	/* Restrict dma_mask to the width that the iommu can handle */
 	dma_mask = min_t(uint64_t, DOMAIN_MAX_ADDR(domain->gaw), dma_mask);
@@ -3370,19 +3370,19 @@ static unsigned long intel_alloc_iova(struct device *dev,
 		 * DMA_BIT_MASK(32) and if that fails then try allocating
 		 * from higher range
 		 */
-		iova = alloc_iova(&domain->iovad, nrpages,
-				  IOVA_PFN(DMA_BIT_MASK(32)), 1);
-		if (iova)
-			return iova->pfn_lo;
+		iova_pfn = alloc_iova_fast(&domain->iovad, nrpages,
+					   IOVA_PFN(DMA_BIT_MASK(32)));
+		if (iova_pfn)
+			return iova_pfn;
 	}
-	iova = alloc_iova(&domain->iovad, nrpages, IOVA_PFN(dma_mask), 1);
-	if (unlikely(!iova)) {
+	iova_pfn = alloc_iova_fast(&domain->iovad, nrpages, IOVA_PFN(dma_mask));
+	if (unlikely(!iova_pfn)) {
 		pr_err("Allocating %ld-page iova for %s failed",
 		       nrpages, dev_name(dev));
 		return 0;
 	}
 
-	return iova->pfn_lo;
+	return iova_pfn;
 }
 
 static struct dmar_domain *__get_valid_domain_for_dev(struct device *dev)
@@ -3536,7 +3536,7 @@ static dma_addr_t __intel_map_single(struct device *dev, phys_addr_t paddr,
 
 error:
 	if (iova_pfn)
-		free_iova(&domain->iovad, iova_pfn);
+		free_iova_fast(&domain->iovad, iova_pfn, dma_to_mm_pfn(size));
 	pr_err("Device %s request: %zx@%llx dir %d --- failed\n",
 		dev_name(dev), size, (unsigned long long)paddr, dir);
 	return 0;
@@ -3591,7 +3591,7 @@ static void flush_unmaps(struct deferred_flush_data *flush_data)
 				iommu_flush_dev_iotlb(domain,
 						(uint64_t)iova_pfn << PAGE_SHIFT, mask);
 			}
-			free_iova(&domain->iovad, iova_pfn);
+			free_iova_fast(&domain->iovad, iova_pfn, nrpages);
 			if (freelist)
 				dma_free_pagelist(freelist);
 		}
@@ -3691,7 +3691,7 @@ static void intel_unmap(struct device *dev, dma_addr_t dev_addr, size_t size)
 		iommu_flush_iotlb_psi(iommu, domain, start_pfn,
 				      nrpages, !freelist, 0);
 		/* free iova */
-		free_iova(&domain->iovad, iova_pfn);
+		free_iova_fast(&domain->iovad, iova_pfn, dma_to_mm_pfn(nrpages));
 		dma_free_pagelist(freelist);
 	} else {
 		add_unmap(domain, iova_pfn, nrpages, freelist);
@@ -3849,7 +3849,7 @@ static int intel_map_sg(struct device *dev, struct scatterlist *sglist, int nele
 	if (unlikely(ret)) {
 		dma_pte_free_pagetable(domain, start_vpfn,
 				       start_vpfn + size - 1);
-		free_iova(&domain->iovad, iova_pfn);
+		free_iova_fast(&domain->iovad, iova_pfn, dma_to_mm_pfn(size));
 		return 0;
 	}
 
@@ -4588,6 +4588,28 @@ static struct notifier_block intel_iommu_memory_nb = {
 	.priority = 0
 };
 
+static void free_all_cpu_cached_iovas(unsigned int cpu)
+{
+	int i;
+
+	for (i = 0; i < g_num_of_iommus; i++) {
+		struct intel_iommu *iommu = g_iommus[i];
+		struct dmar_domain *domain;
+		u16 did;
+
+		if (!iommu)
+			continue;
+
+		for (did = 0; did < 0xffff; did++) {
+			domain = get_iommu_domain(iommu, did);
+
+			if (!domain)
+				continue;
+			free_cpu_cached_iovas(cpu, &domain->iovad);
+		}
+	}
+}
+
 static int intel_iommu_cpu_notifier(struct notifier_block *nfb,
 				    unsigned long action, void *v)
 {
@@ -4596,6 +4618,7 @@ static int intel_iommu_cpu_notifier(struct notifier_block *nfb,
 	switch (action) {
 	case CPU_DEAD:
 	case CPU_DEAD_FROZEN:
+		free_all_cpu_cached_iovas(cpu);
 		flush_unmaps_timeout(cpu);
 		break;
 	}
diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index fa0adef..db3b914 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -20,6 +20,16 @@
 #include <linux/iova.h>
 #include <linux/module.h>
 #include <linux/slab.h>
+#include <linux/smp.h>
+#include <linux/bitops.h>
+
+static bool iova_rcache_insert(struct iova_domain *iovad,
+			       struct iova_rcache *rcache,
+			       unsigned long iova_pfn);
+static unsigned long iova_rcache_get(struct iova_rcache *rcache);
+
+static void init_iova_rcaches(struct iova_domain *iovad);
+static void free_iova_rcaches(struct iova_domain *iovad);
 
 void
 init_iova_domain(struct iova_domain *iovad, unsigned long granule,
@@ -38,6 +48,7 @@ init_iova_domain(struct iova_domain *iovad, unsigned long granule,
 	iovad->granule = granule;
 	iovad->start_pfn = start_pfn;
 	iovad->dma_32bit_pfn = pfn_32bit;
+	init_iova_rcaches(iovad);
 }
 EXPORT_SYMBOL_GPL(init_iova_domain);
 
@@ -291,33 +302,18 @@ alloc_iova(struct iova_domain *iovad, unsigned long size,
 }
 EXPORT_SYMBOL_GPL(alloc_iova);
 
-/**
- * find_iova - find's an iova for a given pfn
- * @iovad: - iova domain in question.
- * @pfn: - page frame number
- * This function finds and returns an iova belonging to the
- * given doamin which matches the given pfn.
- */
-struct iova *find_iova(struct iova_domain *iovad, unsigned long pfn)
+static struct iova *
+private_find_iova(struct iova_domain *iovad, unsigned long pfn)
 {
-	unsigned long flags;
-	struct rb_node *node;
+	struct rb_node *node = iovad->rbroot.rb_node;
+
+	assert_spin_locked(&iovad->iova_rbtree_lock);
 
-	/* Take the lock so that no other thread is manipulating the rbtree */
-	spin_lock_irqsave(&iovad->iova_rbtree_lock, flags);
-	node = iovad->rbroot.rb_node;
 	while (node) {
 		struct iova *iova = container_of(node, struct iova, node);
 
 		/* If pfn falls within iova's range, return iova */
 		if ((pfn >= iova->pfn_lo) && (pfn <= iova->pfn_hi)) {
-			spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
-			/* We are not holding the lock while this iova
-			 * is referenced by the caller as the same thread
-			 * which called this function also calls __free_iova()
-			 * and it is by design that only one thread can possibly
-			 * reference a particular iova and hence no conflict.
-			 */
 			return iova;
 		}
 
@@ -327,9 +323,35 @@ struct iova *find_iova(struct iova_domain *iovad, unsigned long pfn)
 			node = node->rb_right;
 	}
 
-	spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
 	return NULL;
 }
+
+static void private_free_iova(struct iova_domain *iovad, struct iova *iova)
+{
+	assert_spin_locked(&iovad->iova_rbtree_lock);
+	__cached_rbnode_delete_update(iovad, iova);
+	rb_erase(&iova->node, &iovad->rbroot);
+	free_iova_mem(iova);
+}
+
+/**
+ * find_iova - finds an iova for a given pfn
+ * @iovad: - iova domain in question.
+ * @pfn: - page frame number
+ * This function finds and returns an iova belonging to the
+ * given doamin which matches the given pfn.
+ */
+struct iova *find_iova(struct iova_domain *iovad, unsigned long pfn)
+{
+	unsigned long flags;
+	struct iova *iova;
+
+	/* Take the lock so that no other thread is manipulating the rbtree */
+	spin_lock_irqsave(&iovad->iova_rbtree_lock, flags);
+	iova = private_find_iova(iovad, pfn);
+	spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
+	return iova;
+}
 EXPORT_SYMBOL_GPL(find_iova);
 
 /**
@@ -344,10 +366,8 @@ __free_iova(struct iova_domain *iovad, struct iova *iova)
 	unsigned long flags;
 
 	spin_lock_irqsave(&iovad->iova_rbtree_lock, flags);
-	__cached_rbnode_delete_update(iovad, iova);
-	rb_erase(&iova->node, &iovad->rbroot);
+	private_free_iova(iovad, iova);
 	spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
-	free_iova_mem(iova);
 }
 EXPORT_SYMBOL_GPL(__free_iova);
 
@@ -370,6 +390,57 @@ free_iova(struct iova_domain *iovad, unsigned long pfn)
 EXPORT_SYMBOL_GPL(free_iova);
 
 /**
+ * alloc_iova_fast - allocates an iova from rcache
+ * @iovad: - iova domain in question
+ * @size: - size of page frames to allocate
+ * @limit_pfn: - max limit address
+ * This function tries to satisfy an iova allocation from the rcache,
+ * and falls back to regular allocation on failure.
+*/
+unsigned long
+alloc_iova_fast(struct iova_domain *iovad, unsigned long size,
+		unsigned long limit_pfn)
+{
+	unsigned int log_size = fls_long(size - 1);
+	unsigned long iova_pfn;
+	struct iova *new_iova;
+
+	if (log_size < IOVA_RANGE_CACHE_MAX_SIZE) {
+		iova_pfn = iova_rcache_get(&iovad->rcaches[log_size]);
+		if (iova_pfn)
+			return iova_pfn;
+	}
+
+	new_iova = alloc_iova(iovad, size, limit_pfn, true);
+	if (!new_iova)
+		return 0;
+
+	return new_iova->pfn_lo;
+}
+EXPORT_SYMBOL_GPL(alloc_iova_fast);
+
+/**
+ * free_iova_fast - free iova pfn range into rcache
+ * @iovad: - iova domain in question.
+ * @pfn: - pfn that is allocated previously
+ * @size: - # of pages in range
+ * This functions frees an iova range by trying to put it into the rcache,
+ * falling back to regular iova deallocation via free_iova() if this fails.
+ */
+void
+free_iova_fast(struct iova_domain *iovad, unsigned long pfn, unsigned long size)
+{
+	unsigned int log_size = order_base_2(size);
+
+	if (log_size < IOVA_RANGE_CACHE_MAX_SIZE &&
+	    iova_rcache_insert(iovad, &iovad->rcaches[log_size], pfn))
+		return;
+
+	free_iova(iovad, pfn);
+}
+EXPORT_SYMBOL_GPL(free_iova_fast);
+
+/**
  * put_iova_domain - destroys the iova doamin
  * @iovad: - iova domain in question.
  * All the iova's in that domain are destroyed.
@@ -379,6 +450,7 @@ void put_iova_domain(struct iova_domain *iovad)
 	struct rb_node *node;
 	unsigned long flags;
 
+	free_iova_rcaches(iovad);
 	spin_lock_irqsave(&iovad->iova_rbtree_lock, flags);
 	node = rb_first(&iovad->rbroot);
 	while (node) {
@@ -550,5 +622,257 @@ error:
 	return NULL;
 }
 
+/*
+ * Magazine caches for IOVA ranges.  For an introduction to magazines,
+ * see the USENIX 2001 paper "Magazines and Vmem: Extending the Slab
+ * Allocator to Many CPUs and Arbitrary Resources" by Bonwick and Adams.
+ * For simplicity, we use a static magazine size and don't implement the
+ * dynamic size tuning described in the paper.
+ */
+
+#define IOVA_MAG_SIZE 128
+
+struct iova_magazine {
+	unsigned long size;
+	unsigned long pfns[IOVA_MAG_SIZE];
+};
+
+struct iova_cpu_rcache {
+	struct iova_magazine *loaded;
+	struct iova_magazine *prev;
+};
+
+static struct iova_magazine *iova_magazine_alloc(gfp_t flags)
+{
+	return kzalloc(sizeof(struct iova_magazine), flags);
+}
+
+static void iova_magazine_free(struct iova_magazine *mag)
+{
+	kfree(mag);
+}
+
+static void
+iova_magazine_free_pfns(struct iova_magazine *mag, struct iova_domain *iovad)
+{
+	unsigned long flags;
+	int i;
+
+	if (!mag)
+		return;
+
+	spin_lock_irqsave(&iovad->iova_rbtree_lock, flags);
+
+	for (i = 0 ; i < mag->size; ++i) {
+		struct iova *iova = private_find_iova(iovad, mag->pfns[i]);
+
+		BUG_ON(!iova);
+		private_free_iova(iovad, iova);
+	}
+
+	spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
+
+	mag->size = 0;
+}
+
+static bool iova_magazine_full(struct iova_magazine *mag)
+{
+	return (mag && mag->size == IOVA_MAG_SIZE);
+}
+
+static bool iova_magazine_empty(struct iova_magazine *mag)
+{
+	return (!mag || mag->size == 0);
+}
+
+static unsigned long iova_magazine_pop(struct iova_magazine *mag)
+{
+	BUG_ON(iova_magazine_empty(mag));
+
+	return mag->pfns[--mag->size];
+}
+
+static void iova_magazine_push(struct iova_magazine *mag, unsigned long pfn)
+{
+	BUG_ON(iova_magazine_full(mag));
+
+	mag->pfns[mag->size++] = pfn;
+}
+
+static void init_iova_rcaches(struct iova_domain *iovad)
+{
+	struct iova_cpu_rcache *cpu_rcache;
+	struct iova_rcache *rcache;
+	unsigned int cpu;
+	int i;
+
+	for (i = 0; i < IOVA_RANGE_CACHE_MAX_SIZE; ++i) {
+		rcache = &iovad->rcaches[i];
+		spin_lock_init(&rcache->lock);
+		rcache->depot_size = 0;
+		rcache->cpu_rcaches = __alloc_percpu(sizeof(*cpu_rcache), cache_line_size());
+		if (WARN_ON(!rcache->cpu_rcaches))
+			continue;
+		for_each_possible_cpu(cpu) {
+			cpu_rcache = per_cpu_ptr(rcache->cpu_rcaches, cpu);
+			cpu_rcache->loaded = iova_magazine_alloc(GFP_KERNEL);
+			cpu_rcache->prev = iova_magazine_alloc(GFP_KERNEL);
+		}
+	}
+}
+
+/*
+ * Try inserting IOVA range starting with 'iova_pfn' into 'rcache', and
+ * return true on success.  Can fail if rcache is full and we can't free
+ * space, and free_iova() (our only caller) will then return the IOVA
+ * range to the rbtree instead.
+ */
+static bool iova_rcache_insert(struct iova_domain *iovad,
+			       struct iova_rcache *rcache,
+			       unsigned long iova_pfn)
+{
+	struct iova_magazine *mag_to_free = NULL;
+	struct iova_cpu_rcache *cpu_rcache;
+	bool can_insert = false;
+
+	preempt_disable();
+	cpu_rcache = this_cpu_ptr(rcache->cpu_rcaches);
+
+	if (!iova_magazine_full(cpu_rcache->loaded)) {
+		can_insert = true;
+	} else if (!iova_magazine_full(cpu_rcache->prev)) {
+		swap(cpu_rcache->prev, cpu_rcache->loaded);
+		can_insert = true;
+	} else {
+		struct iova_magazine *new_mag = iova_magazine_alloc(GFP_ATOMIC);
+
+		if (new_mag) {
+			unsigned long flags;
+
+			spin_lock_irqsave(&rcache->lock, flags);
+			if (rcache->depot_size < MAX_GLOBAL_MAGS) {
+				rcache->depot[rcache->depot_size++] =
+						cpu_rcache->loaded;
+			} else {
+				mag_to_free = cpu_rcache->loaded;
+			}
+			spin_unlock_irqrestore(&rcache->lock, flags);
+
+			cpu_rcache->loaded = new_mag;
+			can_insert = true;
+		}
+	}
+
+	if (can_insert)
+		iova_magazine_push(cpu_rcache->loaded, iova_pfn);
+
+	preempt_enable();
+
+	if (mag_to_free) {
+		iova_magazine_free_pfns(mag_to_free, iovad);
+		iova_magazine_free(mag_to_free);
+	}
+
+	return can_insert;
+}
+
+/*
+ * Caller wants to allocate a new IOVA range from 'rcache'.  If we can
+ * satisfy the request, return a matching non-NULL range and remove
+ * it from the 'rcache'.
+ */
+static unsigned long iova_rcache_get(struct iova_rcache *rcache)
+{
+	struct iova_cpu_rcache *cpu_rcache;
+	unsigned long iova_pfn = 0;
+	bool has_pfn = false;
+
+	preempt_disable();
+	cpu_rcache = this_cpu_ptr(rcache->cpu_rcaches);
+
+	if (!iova_magazine_empty(cpu_rcache->loaded)) {
+		has_pfn = true;
+	} else if (!iova_magazine_empty(cpu_rcache->prev)) {
+		swap(cpu_rcache->prev, cpu_rcache->loaded);
+		has_pfn = true;
+	} else {
+		unsigned long flags;
+
+		spin_lock_irqsave(&rcache->lock, flags);
+		if (rcache->depot_size > 0) {
+			iova_magazine_free(cpu_rcache->loaded);
+			cpu_rcache->loaded = rcache->depot[--rcache->depot_size];
+			has_pfn = true;
+		}
+		spin_unlock_irqrestore(&rcache->lock, flags);
+	}
+
+	if (has_pfn)
+		iova_pfn = iova_magazine_pop(cpu_rcache->loaded);
+
+	preempt_enable();
+
+	return iova_pfn;
+}
+
+/*
+ * free a cpu's rcache; assumes it cannot race with the cpu accessing
+ * its rcache.
+ */
+static void free_cpu_iova_rcache(unsigned int cpu, struct iova_domain *iovad,
+				 struct iova_rcache *rcache)
+{
+	struct iova_cpu_rcache *cpu_rcache = per_cpu_ptr(rcache->cpu_rcaches, cpu);
+
+	/* No one else touches cpu caches when they're freed */
+	iova_magazine_free_pfns(cpu_rcache->loaded, iovad);
+	iova_magazine_free(cpu_rcache->loaded);
+
+	iova_magazine_free_pfns(cpu_rcache->prev, iovad);
+	iova_magazine_free(cpu_rcache->prev);
+}
+
+/*
+ * reset entire rcache data structure.
+ */
+static void free_iova_rcaches(struct iova_domain *iovad)
+{
+	struct iova_rcache *rcache;
+	unsigned long flags;
+	unsigned int cpu;
+	int i, j;
+
+	for (i = 0; i < IOVA_RANGE_CACHE_MAX_SIZE; ++i) {
+		rcache = &iovad->rcaches[i];
+		for_each_possible_cpu(cpu)
+			free_cpu_iova_rcache(cpu, iovad, rcache);
+		spin_lock_irqsave(&rcache->lock, flags);
+		free_percpu(rcache->cpu_rcaches);
+		for (j = 0; j < rcache->depot_size; ++j) {
+			iova_magazine_free_pfns(rcache->depot[j], iovad);
+			iova_magazine_free(rcache->depot[j]);
+		}
+		spin_unlock_irqrestore(&rcache->lock, flags);
+	}
+}
+
+/*
+ * free all the IOVA ranges cached by a cpu (used when cpu is unplugged)
+ */
+void free_cpu_cached_iovas(unsigned int cpu, struct iova_domain *iovad)
+{
+	struct iova_cpu_rcache *cpu_rcache;
+	struct iova_rcache *rcache;
+	int i;
+
+	for (i = 0; i < IOVA_RANGE_CACHE_MAX_SIZE; ++i) {
+		rcache = &iovad->rcaches[i];
+		cpu_rcache = per_cpu_ptr(rcache->cpu_rcaches, cpu);
+		iova_magazine_free_pfns(cpu_rcache->loaded, iovad);
+		iova_magazine_free_pfns(cpu_rcache->prev, iovad);
+	}
+}
+
+
 MODULE_AUTHOR("Anil S Keshavamurthy <anil.s.keshavamurthy-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>");
 MODULE_LICENSE("GPL");
diff --git a/include/linux/iova.h b/include/linux/iova.h
index 92f7177..0652cdd 100644
--- a/include/linux/iova.h
+++ b/include/linux/iova.h
@@ -19,8 +19,21 @@
 /* iova structure */
 struct iova {
 	struct rb_node	node;
-	unsigned long	pfn_hi; /* IOMMU dish out addr hi */
-	unsigned long	pfn_lo; /* IOMMU dish out addr lo */
+	unsigned long	pfn_hi; /* Highest allocated pfn */
+	unsigned long	pfn_lo; /* Lowest allocated pfn */
+};
+
+struct iova_magazine;
+struct iova_cpu_rcache;
+
+#define IOVA_RANGE_CACHE_MAX_SIZE 6	/* log of max cached IOVA range size (in pages) */
+#define MAX_GLOBAL_MAGS 32	/* magazines per bin */
+
+struct iova_rcache {
+	spinlock_t lock;
+	int depot_size;
+	struct iova_magazine *depot[MAX_GLOBAL_MAGS];
+	struct iova_cpu_rcache __percpu *cpu_rcaches;
 };
 
 /* holds all the iova translations for a domain */
@@ -31,6 +44,7 @@ struct iova_domain {
 	unsigned long	granule;	/* pfn granularity for this domain */
 	unsigned long	start_pfn;	/* Lower limit for this domain */
 	unsigned long	dma_32bit_pfn;
+	struct iova_rcache rcaches[IOVA_RANGE_CACHE_MAX_SIZE];	/* IOVA range caches */
 };
 
 static inline unsigned long iova_size(struct iova *iova)
@@ -78,6 +92,10 @@ void __free_iova(struct iova_domain *iovad, struct iova *iova);
 struct iova *alloc_iova(struct iova_domain *iovad, unsigned long size,
 	unsigned long limit_pfn,
 	bool size_aligned);
+void free_iova_fast(struct iova_domain *iovad, unsigned long pfn,
+		    unsigned long size);
+unsigned long alloc_iova_fast(struct iova_domain *iovad, unsigned long size,
+			      unsigned long limit_pfn);
 struct iova *reserve_iova(struct iova_domain *iovad, unsigned long pfn_lo,
 	unsigned long pfn_hi);
 void copy_reserved_iova(struct iova_domain *from, struct iova_domain *to);
@@ -87,5 +105,6 @@ struct iova *find_iova(struct iova_domain *iovad, unsigned long pfn);
 void put_iova_domain(struct iova_domain *iovad);
 struct iova *split_and_remove_iova(struct iova_domain *iovad,
 	struct iova *iova, unsigned long pfn_lo, unsigned long pfn_hi);
+void free_cpu_cached_iovas(unsigned int cpu, struct iova_domain *iovad);
 
 #endif
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 4/7] iommu: only unmap mapped entries
       [not found]     ` <e07164c8d0aaff68cabd2cf8e3aee9ed20882ae4.1460548546.git.mad-FrESSTt7Abv7r6psnUbsSmZHpeb/A1Y/@public.gmane.org>
@ 2016-04-13 20:37       ` Shaohua Li
  0 siblings, 0 replies; 17+ messages in thread
From: Shaohua Li @ 2016-04-13 20:37 UTC (permalink / raw)
  To: Adam Morrison
  Cc: serebrin-hpIqsD4AKlfQT0dZR+AlfA,
	dan-FrESSTt7Abv7r6psnUbsSmZHpeb/A1Y/,
	omer-FrESSTt7Abv7r6psnUbsSmZHpeb/A1Y/,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	gvdl-hpIqsD4AKlfQT0dZR+AlfA, dwmw2-wEGCiKHe2LqWVfeAwA7xHQ,
	Kernel-team-b10kYP2dOMg

On Wed, Apr 13, 2016 at 09:52:00PM +0300, Adam Morrison wrote:
> @@ -3738,7 +3743,16 @@ static void intel_unmap_sg(struct device *dev, struct scatterlist *sglist,
>  			   int nelems, enum dma_data_direction dir,
>  			   struct dma_attrs *attrs)
>  {
> -	intel_unmap(dev, sglist[0].dma_address);
> +	dma_addr_t startaddr = sglist[0].dma_address - sglist[0].offset;
> +	unsigned long nrpages = 0;
> +	struct scatterlist *sg;
> +	int i;
> +
> +	for_each_sg(sglist, sg, nelems, i) {
> +		nrpages += aligned_nrpages(sg->offset, sg->length);
> +	}
> +
> +	intel_unmap(dev, startaddr, nrpages << VTD_PAGE_SHIFT);
>  }

I'm uncomfortmable with the .dma_address, .length. There are macros for
these, sg_dma_address/sg_dma_len. For .offset, are we sure it's always
valid in .unmap_sg? Better use sg_dma_address/sg_dma_len and align them
to get the pages.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 7/7] iommu: introduce per-cpu caching to iova allocation
       [not found]     ` <b208a304d83088aae7ecac10a3062dc57c0a2f79.1460548546.git.mad-FrESSTt7Abv7r6psnUbsSmZHpeb/A1Y/@public.gmane.org>
@ 2016-04-13 20:43       ` Shaohua Li
  2016-04-14 18:26       ` Benjamin Serebrin via iommu
  1 sibling, 0 replies; 17+ messages in thread
From: Shaohua Li @ 2016-04-13 20:43 UTC (permalink / raw)
  To: Adam Morrison
  Cc: serebrin-hpIqsD4AKlfQT0dZR+AlfA,
	dan-FrESSTt7Abv7r6psnUbsSmZHpeb/A1Y/,
	omer-FrESSTt7Abv7r6psnUbsSmZHpeb/A1Y/,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	gvdl-hpIqsD4AKlfQT0dZR+AlfA, dwmw2-wEGCiKHe2LqWVfeAwA7xHQ,
	Kernel-team-b10kYP2dOMg

On Wed, Apr 13, 2016 at 09:52:33PM +0300, Adam Morrison wrote:
> From: Omer Peleg <omer-FrESSTt7Abv7r6psnUbsSmZHpeb/A1Y/@public.gmane.org>
> 
> IOVA allocation has two problems that impede high-throughput I/O.
> First, it can do a linear search over the allocated IOVA ranges.
> Second, the rbtree spinlock that serializes IOVA allocations becomes
> contended.
> 
> Address these problems by creating an API for caching allocated IOVA
> ranges, so that the IOVA allocator isn't accessed frequently.  This
> patch adds a per-CPU cache, from which CPUs can alloc/free IOVAs
> without taking the rbtree spinlock.  The per-CPU caches are backed by
> a global cache, to avoid invoking the (linear-time) IOVA allocator
> without needing to make the per-CPU cache size excessive.  This design
> is based on magazines, as described in "Magazines and Vmem: Extending
> the Slab Allocator to Many CPUs and Arbitrary Resources" (currently
> available at https://www.usenix.org/legacy/event/usenix01/bonwick.html)
> 
> Adding caching on top of the existing rbtree allocator maintains the
> property that IOVAs are densely packed in the IO virtual address space,
> which is important for keeping IOMMU page table usage low.
> 
> To keep the cache size reasonable, we limit caching to ranges of
> size <= 128 KB.  Overall, a CPU can cache at most 32 MB and the global
> cache is bounded by 4 MB.

So the cached case still ignores the limit_pfn as I pointed out before.
This can break drivers if the cached pfn is out of dma range the device
can handle or impact performance for devices because of DAC. I think we
could have 2 caches. One for DMA32 and the other for DMA64. Choose
corresponding cache according to limit_pfn. For devices with special DMA
mask, for example, DMA24, just fallback to slow path.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 7/7] iommu: introduce per-cpu caching to iova allocation
       [not found]     ` <b208a304d83088aae7ecac10a3062dc57c0a2f79.1460548546.git.mad-FrESSTt7Abv7r6psnUbsSmZHpeb/A1Y/@public.gmane.org>
  2016-04-13 20:43       ` Shaohua Li
@ 2016-04-14 18:26       ` Benjamin Serebrin via iommu
       [not found]         ` <CAN+hb0W=+tuQp3cm_VKoU=LKiVQDPMtGrZGq=59rcaWsy2S-+A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 1 reply; 17+ messages in thread
From: Benjamin Serebrin via iommu @ 2016-04-14 18:26 UTC (permalink / raw)
  To: Adam Morrison
  Cc: Dan Tsafrir, Omer Peleg,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Shaohua Li,
	Godfrey van der Linden, David Woodhouse, Kernel-team-b10kYP2dOMg

Looks very nice!  Inline comments.

On Wed, Apr 13, 2016 at 11:52 AM, Adam Morrison <mad-FrESSTt7Abv7r6psnUbsSmZHpeb/A1Y/@public.gmane.org> wrote:
> From: Omer Peleg <omer-FrESSTt7Abv7r6psnUbsSmZHpeb/A1Y/@public.gmane.org>
>
> IOVA allocation has two problems that impede high-throughput I/O.
> First, it can do a linear search over the allocated IOVA ranges.
> Second, the rbtree spinlock that serializes IOVA allocations becomes
> contended.
>
> Address these problems by creating an API for caching allocated IOVA
> ranges, so that the IOVA allocator isn't accessed frequently.  This
> patch adds a per-CPU cache, from which CPUs can alloc/free IOVAs
> without taking the rbtree spinlock.  The per-CPU caches are backed by
> a global cache, to avoid invoking the (linear-time) IOVA allocator
> without needing to make the per-CPU cache size excessive.  This design
> is based on magazines, as described in "Magazines and Vmem: Extending
> the Slab Allocator to Many CPUs and Arbitrary Resources" (currently
> available at https://www.usenix.org/legacy/event/usenix01/bonwick.html)
>
> Adding caching on top of the existing rbtree allocator maintains the
> property that IOVAs are densely packed in the IO virtual address space,
> which is important for keeping IOMMU page table usage low.
>
> To keep the cache size reasonable, we limit caching to ranges of
> size <= 128 KB.  Overall, a CPU can cache at most 32 MB and the global
> cache is bounded by 4 MB.
>
> Signed-off-by: Omer Peleg <omer-FrESSTt7Abv7r6psnUbsSmZHpeb/A1Y/@public.gmane.org>
> [mad-FrESSTt7Abv7r6psnUbsSmZHpeb/A1Y/@public.gmane.org: rebased, cleaned up and reworded the commit message]
> Signed-off-by: Adam Morrison <mad-FrESSTt7Abv7r6psnUbsSmZHpeb/A1Y/@public.gmane.org>
> ---
>  drivers/iommu/intel-iommu.c |  47 ++++--
>  drivers/iommu/iova.c        | 372 +++++++++++++++++++++++++++++++++++++++++---
>  include/linux/iova.h        |  23 ++-
>  3 files changed, 404 insertions(+), 38 deletions(-)
>
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index 7a9a3a6..629c5b19 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -3357,7 +3357,7 @@ static unsigned long intel_alloc_iova(struct device *dev,
>                                      struct dmar_domain *domain,
>                                      unsigned long nrpages, uint64_t dma_mask)
>  {
> -       struct iova *iova = NULL;
> +       unsigned long iova_pfn = 0;
>
>         /* Restrict dma_mask to the width that the iommu can handle */
>         dma_mask = min_t(uint64_t, DOMAIN_MAX_ADDR(domain->gaw), dma_mask);
> @@ -3370,19 +3370,19 @@ static unsigned long intel_alloc_iova(struct device *dev,
>                  * DMA_BIT_MASK(32) and if that fails then try allocating
>                  * from higher range
>                  */
> -               iova = alloc_iova(&domain->iovad, nrpages,
> -                                 IOVA_PFN(DMA_BIT_MASK(32)), 1);
> -               if (iova)
> -                       return iova->pfn_lo;
> +               iova_pfn = alloc_iova_fast(&domain->iovad, nrpages,
> +                                          IOVA_PFN(DMA_BIT_MASK(32)));
> +               if (iova_pfn)
> +                       return iova_pfn;
>         }
> -       iova = alloc_iova(&domain->iovad, nrpages, IOVA_PFN(dma_mask), 1);
> -       if (unlikely(!iova)) {
> +       iova_pfn = alloc_iova_fast(&domain->iovad, nrpages, IOVA_PFN(dma_mask));
> +       if (unlikely(!iova_pfn)) {
>                 pr_err("Allocating %ld-page iova for %s failed",
>                        nrpages, dev_name(dev));
>                 return 0;
>         }
>
> -       return iova->pfn_lo;
> +       return iova_pfn;
>  }
>
>  static struct dmar_domain *__get_valid_domain_for_dev(struct device *dev)
> @@ -3536,7 +3536,7 @@ static dma_addr_t __intel_map_single(struct device *dev, phys_addr_t paddr,
>
>  error:
>         if (iova_pfn)
> -               free_iova(&domain->iovad, iova_pfn);
> +               free_iova_fast(&domain->iovad, iova_pfn, dma_to_mm_pfn(size));
>         pr_err("Device %s request: %zx@%llx dir %d --- failed\n",
>                 dev_name(dev), size, (unsigned long long)paddr, dir);
>         return 0;
> @@ -3591,7 +3591,7 @@ static void flush_unmaps(struct deferred_flush_data *flush_data)
>                                 iommu_flush_dev_iotlb(domain,
>                                                 (uint64_t)iova_pfn << PAGE_SHIFT, mask);
>                         }
> -                       free_iova(&domain->iovad, iova_pfn);
> +                       free_iova_fast(&domain->iovad, iova_pfn, nrpages);
>                         if (freelist)
>                                 dma_free_pagelist(freelist);
>                 }
> @@ -3691,7 +3691,7 @@ static void intel_unmap(struct device *dev, dma_addr_t dev_addr, size_t size)
>                 iommu_flush_iotlb_psi(iommu, domain, start_pfn,
>                                       nrpages, !freelist, 0);
>                 /* free iova */
> -               free_iova(&domain->iovad, iova_pfn);
> +               free_iova_fast(&domain->iovad, iova_pfn, dma_to_mm_pfn(nrpages));
>                 dma_free_pagelist(freelist);
>         } else {
>                 add_unmap(domain, iova_pfn, nrpages, freelist);
> @@ -3849,7 +3849,7 @@ static int intel_map_sg(struct device *dev, struct scatterlist *sglist, int nele
>         if (unlikely(ret)) {
>                 dma_pte_free_pagetable(domain, start_vpfn,
>                                        start_vpfn + size - 1);
> -               free_iova(&domain->iovad, iova_pfn);
> +               free_iova_fast(&domain->iovad, iova_pfn, dma_to_mm_pfn(size));
>                 return 0;
>         }
>
> @@ -4588,6 +4588,28 @@ static struct notifier_block intel_iommu_memory_nb = {
>         .priority = 0
>  };
>
> +static void free_all_cpu_cached_iovas(unsigned int cpu)
> +{
> +       int i;
> +
> +       for (i = 0; i < g_num_of_iommus; i++) {
> +               struct intel_iommu *iommu = g_iommus[i];
> +               struct dmar_domain *domain;
> +               u16 did;
> +
> +               if (!iommu)
> +                       continue;
> +
> +               for (did = 0; did < 0xffff; did++) {
> +                       domain = get_iommu_domain(iommu, did);
> +
> +                       if (!domain)
> +                               continue;
> +                       free_cpu_cached_iovas(cpu, &domain->iovad);
> +               }
> +       }
> +}
> +
>  static int intel_iommu_cpu_notifier(struct notifier_block *nfb,
>                                     unsigned long action, void *v)
>  {
> @@ -4596,6 +4618,7 @@ static int intel_iommu_cpu_notifier(struct notifier_block *nfb,
>         switch (action) {
>         case CPU_DEAD:
>         case CPU_DEAD_FROZEN:
> +               free_all_cpu_cached_iovas(cpu);
>                 flush_unmaps_timeout(cpu);
>                 break;
>         }
> diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
> index fa0adef..db3b914 100644
> --- a/drivers/iommu/iova.c
> +++ b/drivers/iommu/iova.c
> @@ -20,6 +20,16 @@
>  #include <linux/iova.h>
>  #include <linux/module.h>
>  #include <linux/slab.h>
> +#include <linux/smp.h>
> +#include <linux/bitops.h>
> +
> +static bool iova_rcache_insert(struct iova_domain *iovad,
> +                              struct iova_rcache *rcache,
> +                              unsigned long iova_pfn);
> +static unsigned long iova_rcache_get(struct iova_rcache *rcache);
> +
> +static void init_iova_rcaches(struct iova_domain *iovad);
> +static void free_iova_rcaches(struct iova_domain *iovad);
>
>  void
>  init_iova_domain(struct iova_domain *iovad, unsigned long granule,
> @@ -38,6 +48,7 @@ init_iova_domain(struct iova_domain *iovad, unsigned long granule,
>         iovad->granule = granule;
>         iovad->start_pfn = start_pfn;
>         iovad->dma_32bit_pfn = pfn_32bit;
> +       init_iova_rcaches(iovad);
>  }
>  EXPORT_SYMBOL_GPL(init_iova_domain);
>
> @@ -291,33 +302,18 @@ alloc_iova(struct iova_domain *iovad, unsigned long size,
>  }
>  EXPORT_SYMBOL_GPL(alloc_iova);
>
> -/**
> - * find_iova - find's an iova for a given pfn
> - * @iovad: - iova domain in question.
> - * @pfn: - page frame number
> - * This function finds and returns an iova belonging to the
> - * given doamin which matches the given pfn.
> - */
> -struct iova *find_iova(struct iova_domain *iovad, unsigned long pfn)
> +static struct iova *
> +private_find_iova(struct iova_domain *iovad, unsigned long pfn)
>  {
> -       unsigned long flags;
> -       struct rb_node *node;
> +       struct rb_node *node = iovad->rbroot.rb_node;
> +
> +       assert_spin_locked(&iovad->iova_rbtree_lock);
>
> -       /* Take the lock so that no other thread is manipulating the rbtree */
> -       spin_lock_irqsave(&iovad->iova_rbtree_lock, flags);
> -       node = iovad->rbroot.rb_node;
>         while (node) {
>                 struct iova *iova = container_of(node, struct iova, node);
>
>                 /* If pfn falls within iova's range, return iova */
>                 if ((pfn >= iova->pfn_lo) && (pfn <= iova->pfn_hi)) {
> -                       spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
> -                       /* We are not holding the lock while this iova
> -                        * is referenced by the caller as the same thread
> -                        * which called this function also calls __free_iova()
> -                        * and it is by design that only one thread can possibly
> -                        * reference a particular iova and hence no conflict.
> -                        */
>                         return iova;
>                 }
>
> @@ -327,9 +323,35 @@ struct iova *find_iova(struct iova_domain *iovad, unsigned long pfn)
>                         node = node->rb_right;
>         }
>
> -       spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
>         return NULL;
>  }
> +
> +static void private_free_iova(struct iova_domain *iovad, struct iova *iova)
> +{
> +       assert_spin_locked(&iovad->iova_rbtree_lock);
> +       __cached_rbnode_delete_update(iovad, iova);
> +       rb_erase(&iova->node, &iovad->rbroot);
> +       free_iova_mem(iova);
> +}
> +
> +/**
> + * find_iova - finds an iova for a given pfn
> + * @iovad: - iova domain in question.
> + * @pfn: - page frame number
> + * This function finds and returns an iova belonging to the
> + * given doamin which matches the given pfn.
> + */
> +struct iova *find_iova(struct iova_domain *iovad, unsigned long pfn)
> +{
> +       unsigned long flags;
> +       struct iova *iova;
> +
> +       /* Take the lock so that no other thread is manipulating the rbtree */
> +       spin_lock_irqsave(&iovad->iova_rbtree_lock, flags);
> +       iova = private_find_iova(iovad, pfn);
> +       spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
> +       return iova;
> +}
>  EXPORT_SYMBOL_GPL(find_iova);
>
>  /**
> @@ -344,10 +366,8 @@ __free_iova(struct iova_domain *iovad, struct iova *iova)
>         unsigned long flags;
>
>         spin_lock_irqsave(&iovad->iova_rbtree_lock, flags);
> -       __cached_rbnode_delete_update(iovad, iova);
> -       rb_erase(&iova->node, &iovad->rbroot);
> +       private_free_iova(iovad, iova);
>         spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
> -       free_iova_mem(iova);
>  }
>  EXPORT_SYMBOL_GPL(__free_iova);
>
> @@ -370,6 +390,57 @@ free_iova(struct iova_domain *iovad, unsigned long pfn)
>  EXPORT_SYMBOL_GPL(free_iova);
>
>  /**
> + * alloc_iova_fast - allocates an iova from rcache
> + * @iovad: - iova domain in question
> + * @size: - size of page frames to allocate
> + * @limit_pfn: - max limit address
> + * This function tries to satisfy an iova allocation from the rcache,
> + * and falls back to regular allocation on failure.
> +*/
> +unsigned long
> +alloc_iova_fast(struct iova_domain *iovad, unsigned long size,
> +               unsigned long limit_pfn)
> +{
> +       unsigned int log_size = fls_long(size - 1);

In free_iova_fast, order_base_2(size) is used. Why differ here?

> +       unsigned long iova_pfn;
> +       struct iova *new_iova;
> +
> +       if (log_size < IOVA_RANGE_CACHE_MAX_SIZE) {
> +               iova_pfn = iova_rcache_get(&iovad->rcaches[log_size]);
> +               if (iova_pfn)
> +                       return iova_pfn;
> +       }
> +
> +       new_iova = alloc_iova(iovad, size, limit_pfn, true);
> +       if (!new_iova)
> +               return 0;

It was pointed out that DMA_32 or _24 (or anything other non-64 size)
could be starved if the magazines on all cores are full and the depot
is empty.  (This gets more probable with increased core count.)  You
could try one more time: call free_iova_rcaches() and try alloc_iova
again before giving up

> +
> +       return new_iova->pfn_lo;
> +}
> +EXPORT_SYMBOL_GPL(alloc_iova_fast);
> +
> +/**
> + * free_iova_fast - free iova pfn range into rcache
> + * @iovad: - iova domain in question.
> + * @pfn: - pfn that is allocated previously
> + * @size: - # of pages in range
> + * This functions frees an iova range by trying to put it into the rcache,
> + * falling back to regular iova deallocation via free_iova() if this fails.
> + */
> +void
> +free_iova_fast(struct iova_domain *iovad, unsigned long pfn, unsigned long size)
> +{
> +       unsigned int log_size = order_base_2(size);
> +
> +       if (log_size < IOVA_RANGE_CACHE_MAX_SIZE &&
> +           iova_rcache_insert(iovad, &iovad->rcaches[log_size], pfn))
> +               return;
> +
> +       free_iova(iovad, pfn);
> +}
> +EXPORT_SYMBOL_GPL(free_iova_fast);
> +
> +/**
>   * put_iova_domain - destroys the iova doamin
>   * @iovad: - iova domain in question.
>   * All the iova's in that domain are destroyed.
> @@ -379,6 +450,7 @@ void put_iova_domain(struct iova_domain *iovad)
>         struct rb_node *node;
>         unsigned long flags;
>
> +       free_iova_rcaches(iovad);
>         spin_lock_irqsave(&iovad->iova_rbtree_lock, flags);
>         node = rb_first(&iovad->rbroot);
>         while (node) {
> @@ -550,5 +622,257 @@ error:
>         return NULL;
>  }
>
> +/*
> + * Magazine caches for IOVA ranges.  For an introduction to magazines,
> + * see the USENIX 2001 paper "Magazines and Vmem: Extending the Slab
> + * Allocator to Many CPUs and Arbitrary Resources" by Bonwick and Adams.
> + * For simplicity, we use a static magazine size and don't implement the
> + * dynamic size tuning described in the paper.
> + */
> +
> +#define IOVA_MAG_SIZE 128
> +
> +struct iova_magazine {
> +       unsigned long size;
> +       unsigned long pfns[IOVA_MAG_SIZE];
> +};
> +
> +struct iova_cpu_rcache {
> +       struct iova_magazine *loaded;
> +       struct iova_magazine *prev;
> +};
> +
> +static struct iova_magazine *iova_magazine_alloc(gfp_t flags)
> +{
> +       return kzalloc(sizeof(struct iova_magazine), flags);
> +}
> +
> +static void iova_magazine_free(struct iova_magazine *mag)
> +{
> +       kfree(mag);
> +}
> +
> +static void
> +iova_magazine_free_pfns(struct iova_magazine *mag, struct iova_domain *iovad)
> +{
> +       unsigned long flags;
> +       int i;
> +
> +       if (!mag)
> +               return;
> +
> +       spin_lock_irqsave(&iovad->iova_rbtree_lock, flags);
> +
> +       for (i = 0 ; i < mag->size; ++i) {
> +               struct iova *iova = private_find_iova(iovad, mag->pfns[i]);
> +
> +               BUG_ON(!iova);
> +               private_free_iova(iovad, iova);
> +       }
> +
> +       spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
> +
> +       mag->size = 0;
> +}
> +
> +static bool iova_magazine_full(struct iova_magazine *mag)
> +{
> +       return (mag && mag->size == IOVA_MAG_SIZE);
> +}
> +
> +static bool iova_magazine_empty(struct iova_magazine *mag)
> +{
> +       return (!mag || mag->size == 0);
> +}
> +
> +static unsigned long iova_magazine_pop(struct iova_magazine *mag)
> +{
> +       BUG_ON(iova_magazine_empty(mag));
> +
> +       return mag->pfns[--mag->size];
> +}
> +
> +static void iova_magazine_push(struct iova_magazine *mag, unsigned long pfn)
> +{
> +       BUG_ON(iova_magazine_full(mag));
> +
> +       mag->pfns[mag->size++] = pfn;
> +}
> +
> +static void init_iova_rcaches(struct iova_domain *iovad)
> +{
> +       struct iova_cpu_rcache *cpu_rcache;
> +       struct iova_rcache *rcache;
> +       unsigned int cpu;
> +       int i;
> +
> +       for (i = 0; i < IOVA_RANGE_CACHE_MAX_SIZE; ++i) {
> +               rcache = &iovad->rcaches[i];
> +               spin_lock_init(&rcache->lock);
> +               rcache->depot_size = 0;
> +               rcache->cpu_rcaches = __alloc_percpu(sizeof(*cpu_rcache), cache_line_size());
> +               if (WARN_ON(!rcache->cpu_rcaches))
> +                       continue;
> +               for_each_possible_cpu(cpu) {
> +                       cpu_rcache = per_cpu_ptr(rcache->cpu_rcaches, cpu);
> +                       cpu_rcache->loaded = iova_magazine_alloc(GFP_KERNEL);
> +                       cpu_rcache->prev = iova_magazine_alloc(GFP_KERNEL);
> +               }
> +       }
> +}
> +
> +/*
> + * Try inserting IOVA range starting with 'iova_pfn' into 'rcache', and
> + * return true on success.  Can fail if rcache is full and we can't free
> + * space, and free_iova() (our only caller) will then return the IOVA
> + * range to the rbtree instead.
> + */
> +static bool iova_rcache_insert(struct iova_domain *iovad,
> +                              struct iova_rcache *rcache,
> +                              unsigned long iova_pfn)
> +{
> +       struct iova_magazine *mag_to_free = NULL;
> +       struct iova_cpu_rcache *cpu_rcache;
> +       bool can_insert = false;
> +
> +       preempt_disable();
> +       cpu_rcache = this_cpu_ptr(rcache->cpu_rcaches);
> +
> +       if (!iova_magazine_full(cpu_rcache->loaded)) {
> +               can_insert = true;
> +       } else if (!iova_magazine_full(cpu_rcache->prev)) {
> +               swap(cpu_rcache->prev, cpu_rcache->loaded);
> +               can_insert = true;
> +       } else {
> +               struct iova_magazine *new_mag = iova_magazine_alloc(GFP_ATOMIC);
> +
> +               if (new_mag) {
> +                       unsigned long flags;
> +
> +                       spin_lock_irqsave(&rcache->lock, flags);
> +                       if (rcache->depot_size < MAX_GLOBAL_MAGS) {
> +                               rcache->depot[rcache->depot_size++] =
> +                                               cpu_rcache->loaded;
> +                       } else {
> +                               mag_to_free = cpu_rcache->loaded;
> +                       }
> +                       spin_unlock_irqrestore(&rcache->lock, flags);
> +
> +                       cpu_rcache->loaded = new_mag;
> +                       can_insert = true;
> +               }
> +       }
> +
> +       if (can_insert)
> +               iova_magazine_push(cpu_rcache->loaded, iova_pfn);
> +
> +       preempt_enable();
> +
> +       if (mag_to_free) {
> +               iova_magazine_free_pfns(mag_to_free, iovad);
> +               iova_magazine_free(mag_to_free);
> +       }
> +
> +       return can_insert;
> +}
> +
> +/*
> + * Caller wants to allocate a new IOVA range from 'rcache'.  If we can
> + * satisfy the request, return a matching non-NULL range and remove
> + * it from the 'rcache'.
> + */
> +static unsigned long iova_rcache_get(struct iova_rcache *rcache)
> +{
> +       struct iova_cpu_rcache *cpu_rcache;
> +       unsigned long iova_pfn = 0;
> +       bool has_pfn = false;
> +
> +       preempt_disable();
> +       cpu_rcache = this_cpu_ptr(rcache->cpu_rcaches);
> +
> +       if (!iova_magazine_empty(cpu_rcache->loaded)) {
> +               has_pfn = true;
> +       } else if (!iova_magazine_empty(cpu_rcache->prev)) {
> +               swap(cpu_rcache->prev, cpu_rcache->loaded);
> +               has_pfn = true;
> +       } else {
> +               unsigned long flags;
> +
> +               spin_lock_irqsave(&rcache->lock, flags);
> +               if (rcache->depot_size > 0) {
> +                       iova_magazine_free(cpu_rcache->loaded);
> +                       cpu_rcache->loaded = rcache->depot[--rcache->depot_size];
> +                       has_pfn = true;
> +               }
> +               spin_unlock_irqrestore(&rcache->lock, flags);
> +       }
> +
> +       if (has_pfn)
> +               iova_pfn = iova_magazine_pop(cpu_rcache->loaded);
> +
> +       preempt_enable();
> +
> +       return iova_pfn;
> +}
> +
> +/*
> + * free a cpu's rcache; assumes it cannot race with the cpu accessing
> + * its rcache.
> + */
> +static void free_cpu_iova_rcache(unsigned int cpu, struct iova_domain *iovad,
> +                                struct iova_rcache *rcache)
> +{
> +       struct iova_cpu_rcache *cpu_rcache = per_cpu_ptr(rcache->cpu_rcaches, cpu);
> +
> +       /* No one else touches cpu caches when they're freed */
> +       iova_magazine_free_pfns(cpu_rcache->loaded, iovad);
> +       iova_magazine_free(cpu_rcache->loaded);
> +
> +       iova_magazine_free_pfns(cpu_rcache->prev, iovad);
> +       iova_magazine_free(cpu_rcache->prev);
> +}
> +
> +/*
> + * reset entire rcache data structure.
> + */
> +static void free_iova_rcaches(struct iova_domain *iovad)
> +{
> +       struct iova_rcache *rcache;
> +       unsigned long flags;
> +       unsigned int cpu;
> +       int i, j;
> +
> +       for (i = 0; i < IOVA_RANGE_CACHE_MAX_SIZE; ++i) {
> +               rcache = &iovad->rcaches[i];
> +               for_each_possible_cpu(cpu)
> +                       free_cpu_iova_rcache(cpu, iovad, rcache);
> +               spin_lock_irqsave(&rcache->lock, flags);
> +               free_percpu(rcache->cpu_rcaches);
> +               for (j = 0; j < rcache->depot_size; ++j) {
> +                       iova_magazine_free_pfns(rcache->depot[j], iovad);
> +                       iova_magazine_free(rcache->depot[j]);
> +               }
> +               spin_unlock_irqrestore(&rcache->lock, flags);
> +       }
> +}
> +
> +/*
> + * free all the IOVA ranges cached by a cpu (used when cpu is unplugged)
> + */
> +void free_cpu_cached_iovas(unsigned int cpu, struct iova_domain *iovad)
> +{
> +       struct iova_cpu_rcache *cpu_rcache;
> +       struct iova_rcache *rcache;
> +       int i;
> +
> +       for (i = 0; i < IOVA_RANGE_CACHE_MAX_SIZE; ++i) {
> +               rcache = &iovad->rcaches[i];
> +               cpu_rcache = per_cpu_ptr(rcache->cpu_rcaches, cpu);
> +               iova_magazine_free_pfns(cpu_rcache->loaded, iovad);
> +               iova_magazine_free_pfns(cpu_rcache->prev, iovad);
> +       }
> +}
> +
> +
>  MODULE_AUTHOR("Anil S Keshavamurthy <anil.s.keshavamurthy-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>");
>  MODULE_LICENSE("GPL");
> diff --git a/include/linux/iova.h b/include/linux/iova.h
> index 92f7177..0652cdd 100644
> --- a/include/linux/iova.h
> +++ b/include/linux/iova.h
> @@ -19,8 +19,21 @@
>  /* iova structure */
>  struct iova {
>         struct rb_node  node;
> -       unsigned long   pfn_hi; /* IOMMU dish out addr hi */
> -       unsigned long   pfn_lo; /* IOMMU dish out addr lo */
> +       unsigned long   pfn_hi; /* Highest allocated pfn */
> +       unsigned long   pfn_lo; /* Lowest allocated pfn */
> +};
> +
> +struct iova_magazine;
> +struct iova_cpu_rcache;
> +
> +#define IOVA_RANGE_CACHE_MAX_SIZE 6    /* log of max cached IOVA range size (in pages) */
> +#define MAX_GLOBAL_MAGS 32     /* magazines per bin */
> +
> +struct iova_rcache {
> +       spinlock_t lock;
> +       int depot_size;
> +       struct iova_magazine *depot[MAX_GLOBAL_MAGS];
> +       struct iova_cpu_rcache __percpu *cpu_rcaches;
>  };
>
>  /* holds all the iova translations for a domain */
> @@ -31,6 +44,7 @@ struct iova_domain {
>         unsigned long   granule;        /* pfn granularity for this domain */
>         unsigned long   start_pfn;      /* Lower limit for this domain */
>         unsigned long   dma_32bit_pfn;
> +       struct iova_rcache rcaches[IOVA_RANGE_CACHE_MAX_SIZE];  /* IOVA range caches */
>  };
>
>  static inline unsigned long iova_size(struct iova *iova)
> @@ -78,6 +92,10 @@ void __free_iova(struct iova_domain *iovad, struct iova *iova);
>  struct iova *alloc_iova(struct iova_domain *iovad, unsigned long size,
>         unsigned long limit_pfn,
>         bool size_aligned);
> +void free_iova_fast(struct iova_domain *iovad, unsigned long pfn,
> +                   unsigned long size);
> +unsigned long alloc_iova_fast(struct iova_domain *iovad, unsigned long size,
> +                             unsigned long limit_pfn);
>  struct iova *reserve_iova(struct iova_domain *iovad, unsigned long pfn_lo,
>         unsigned long pfn_hi);
>  void copy_reserved_iova(struct iova_domain *from, struct iova_domain *to);
> @@ -87,5 +105,6 @@ struct iova *find_iova(struct iova_domain *iovad, unsigned long pfn);
>  void put_iova_domain(struct iova_domain *iovad);
>  struct iova *split_and_remove_iova(struct iova_domain *iovad,
>         struct iova *iova, unsigned long pfn_lo, unsigned long pfn_hi);
> +void free_cpu_cached_iovas(unsigned int cpu, struct iova_domain *iovad);
>
>  #endif
> --
> 1.9.1
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 7/7] iommu: introduce per-cpu caching to iova allocation
       [not found]         ` <CAN+hb0W=+tuQp3cm_VKoU=LKiVQDPMtGrZGq=59rcaWsy2S-+A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-04-14 21:05           ` Adam Morrison
       [not found]             ` <CAHMfzJmjZWeUpmTVb-Z7NMJUp0N84ZK4zwUWdAKHv4sd4TXPMg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 17+ messages in thread
From: Adam Morrison @ 2016-04-14 21:05 UTC (permalink / raw)
  To: Benjamin Serebrin
  Cc: Dan Tsafrir, Omer Peleg,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Shaohua Li,
	Godfrey van der Linden, David Woodhouse, Kernel-team-b10kYP2dOMg

On Thu, Apr 14, 2016 at 9:26 PM, Benjamin Serebrin via iommu
<iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org> wrote:

> It was pointed out that DMA_32 or _24 (or anything other non-64 size)
> could be starved if the magazines on all cores are full and the depot
> is empty.  (This gets more probable with increased core count.)  You
> could try one more time: call free_iova_rcaches() and try alloc_iova
> again before giving up

That's not safe, unfortunately.  free_iova_rcaches() is meant to be
called only when the domain is dying and the CPUs won't access the
rcaches.

It's tempting to make the rcaches work only for DMA_64 allocations.
This would also solve the problem of respecting the pfn_limit when
allocating, which Shaohua Li pointed out.  Sadly, intel-iommu.c
converts DMA_64 to DMA_32 by default, apparently to avoid dual address
cycles on the PCI bus.  I wonder about the importance of this, though,
as it doesn't seem that anything equivalent happens when iommu=off.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 7/7] iommu: introduce per-cpu caching to iova allocation
       [not found]             ` <CAHMfzJmjZWeUpmTVb-Z7NMJUp0N84ZK4zwUWdAKHv4sd4TXPMg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-04-14 21:18               ` Benjamin Serebrin via iommu
       [not found]                 ` <CAN+hb0WOaokFYc6C+mR6rdj4WwmMUSzDHDZngfUvy-5cEve_-g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 17+ messages in thread
From: Benjamin Serebrin via iommu @ 2016-04-14 21:18 UTC (permalink / raw)
  To: Adam Morrison
  Cc: Dan Tsafrir, Omer Peleg,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Shaohua Li,
	Godfrey van der Linden, David Woodhouse, Kernel-team-b10kYP2dOMg

On Thu, Apr 14, 2016 at 2:05 PM, Adam Morrison <mad-FrESSTt7Abv7r6psnUbsSmZHpeb/A1Y/@public.gmane.org> wrote:
> On Thu, Apr 14, 2016 at 9:26 PM, Benjamin Serebrin via iommu
> <iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org> wrote:
>
>> It was pointed out that DMA_32 or _24 (or anything other non-64 size)
>> could be starved if the magazines on all cores are full and the depot
>> is empty.  (This gets more probable with increased core count.)  You
>> could try one more time: call free_iova_rcaches() and try alloc_iova
>> again before giving up
>
> That's not safe, unfortunately.  free_iova_rcaches() is meant to be
> called only when the domain is dying and the CPUs won't access the
> rcaches.

Fair enough.  Is it possible to make this safe, cleanly and without
too much locking during the normal case?

> It's tempting to make the rcaches work only for DMA_64 allocations.
> This would also solve the problem of respecting the pfn_limit when
> allocating, which Shaohua Li pointed out.  Sadly, intel-iommu.c
> converts DMA_64 to DMA_32 by default, apparently to avoid dual address
> cycles on the PCI bus.  I wonder about the importance of this, though,
> as it doesn't seem that anything equivalent happens when iommu=off.

I agree.  It's tempting to make all DMA_64 allocations grow up from
4G, leaving the entire 32 bit space free for small allocations.  I'd
be willing to argue that that should be the default, with some
override for anyone who finds it objectionable.

Dual address cycle is really "4 more bytes in the TLP header" on PCIe;
a 32-bit address takes 3 doublewords (12 bytes) while a 64-bit address
takes 4 DW (16 bytes).  What's 25% of a read request between friends?
And every read request has a read response 3DW TLP plus its data, so
the aggregate bandwidth consumed is getting uninteresting.  Similarly
for writes, the additional address bytes don't cost a large
percentage.

That being said, it's a rare device that needs more than 4GB of active
address space, and it's a rare system that needs to mix a
performance-critical DMA_32 (or 24) and _64 device in the same page
table.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 7/7] iommu: introduce per-cpu caching to iova allocation
       [not found]                 ` <CAN+hb0WOaokFYc6C+mR6rdj4WwmMUSzDHDZngfUvy-5cEve_-g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-04-14 21:33                   ` Shaohua Li
       [not found]                     ` <20160414213326.GA474260-tb7CFzD8y5b7E6g3fPdp/g2O0Ztt9esIQQ4Iyu8u01E@public.gmane.org>
  0 siblings, 1 reply; 17+ messages in thread
From: Shaohua Li @ 2016-04-14 21:33 UTC (permalink / raw)
  To: Benjamin Serebrin
  Cc: Dan Tsafrir, Omer Peleg,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Adam Morrison,
	Godfrey van der Linden, David Woodhouse, Kernel-team-b10kYP2dOMg

On Thu, Apr 14, 2016 at 02:18:32PM -0700, Benjamin Serebrin wrote:
> On Thu, Apr 14, 2016 at 2:05 PM, Adam Morrison <mad-FrESSTt7Abv7r6psnUbsSmZHpeb/A1Y/@public.gmane.org> wrote:
> > On Thu, Apr 14, 2016 at 9:26 PM, Benjamin Serebrin via iommu
> > <iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org> wrote:
> >
> >> It was pointed out that DMA_32 or _24 (or anything other non-64 size)
> >> could be starved if the magazines on all cores are full and the depot
> >> is empty.  (This gets more probable with increased core count.)  You
> >> could try one more time: call free_iova_rcaches() and try alloc_iova
> >> again before giving up
> >
> > That's not safe, unfortunately.  free_iova_rcaches() is meant to be
> > called only when the domain is dying and the CPUs won't access the
> > rcaches.
> 
> Fair enough.  Is it possible to make this safe, cleanly and without
> too much locking during the normal case?
> 
> > It's tempting to make the rcaches work only for DMA_64 allocations.
> > This would also solve the problem of respecting the pfn_limit when
> > allocating, which Shaohua Li pointed out.  Sadly, intel-iommu.c
> > converts DMA_64 to DMA_32 by default, apparently to avoid dual address
> > cycles on the PCI bus.  I wonder about the importance of this, though,
> > as it doesn't seem that anything equivalent happens when iommu=off.
> 
> I agree.  It's tempting to make all DMA_64 allocations grow up from
> 4G, leaving the entire 32 bit space free for small allocations.  I'd
> be willing to argue that that should be the default, with some
> override for anyone who finds it objectionable.
> 
> Dual address cycle is really "4 more bytes in the TLP header" on PCIe;
> a 32-bit address takes 3 doublewords (12 bytes) while a 64-bit address
> takes 4 DW (16 bytes).  What's 25% of a read request between friends?
> And every read request has a read response 3DW TLP plus its data, so
> the aggregate bandwidth consumed is getting uninteresting.  Similarly
> for writes, the additional address bytes don't cost a large
> percentage.
> 
> That being said, it's a rare device that needs more than 4GB of active
> address space, and it's a rare system that needs to mix a
> performance-critical DMA_32 (or 24) and _64 device in the same page
> table.

I'm not sure about the TLP overhead. IOMMU is not only for pcie device.
there are pcie-to-pcix/pci bridge, any pci device can reside behind it.
The device might not handle DMA_64. DAC has overhead for pcix device
iirc, which somebody might care about. So let's not break such devices.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 7/7] iommu: introduce per-cpu caching to iova allocation
       [not found]                     ` <20160414213326.GA474260-tb7CFzD8y5b7E6g3fPdp/g2O0Ztt9esIQQ4Iyu8u01E@public.gmane.org>
@ 2016-04-15  4:59                       ` Benjamin Serebrin via iommu
       [not found]                         ` <CAN+hb0WRDYCpY8xoUvvGu4SSD83F9VTMW=9W=xfjYtJV-dijmQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 17+ messages in thread
From: Benjamin Serebrin via iommu @ 2016-04-15  4:59 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Dan Tsafrir, Omer Peleg,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Adam Morrison,
	Godfrey van der Linden, David Woodhouse, Kernel-team-b10kYP2dOMg


[-- Attachment #1.1: Type: text/plain, Size: 3138 bytes --]

On Thu, Apr 14, 2016 at 2:33 PM, Shaohua Li <shli-b10kYP2dOMg@public.gmane.org> wrote:

> On Thu, Apr 14, 2016 at 02:18:32PM -0700, Benjamin Serebrin wrote:
> > On Thu, Apr 14, 2016 at 2:05 PM, Adam Morrison <mad-FrESSTt7Abv7r6psnUbsSmZHpeb/A1Y/@public.gmane.org>
> wrote:
> > > On Thu, Apr 14, 2016 at 9:26 PM, Benjamin Serebrin via iommu
> > > <iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org> wrote:
> > >
> > >> It was pointed out that DMA_32 or _24 (or anything other non-64 size)
> > >> could be starved if the magazines on all cores are full and the depot
> > >> is empty.  (This gets more probable with increased core count.)  You
> > >> could try one more time: call free_iova_rcaches() and try alloc_iova
> > >> again before giving up
> > >
> > > That's not safe, unfortunately.  free_iova_rcaches() is meant to be
> > > called only when the domain is dying and the CPUs won't access the
> > > rcaches.
> >
> > Fair enough.  Is it possible to make this safe, cleanly and without
> > too much locking during the normal case?
> >
> > > It's tempting to make the rcaches work only for DMA_64 allocations.
> > > This would also solve the problem of respecting the pfn_limit when
> > > allocating, which Shaohua Li pointed out.  Sadly, intel-iommu.c
> > > converts DMA_64 to DMA_32 by default, apparently to avoid dual address
> > > cycles on the PCI bus.  I wonder about the importance of this, though,
> > > as it doesn't seem that anything equivalent happens when iommu=off.
> >
> > I agree.  It's tempting to make all DMA_64 allocations grow up from
> > 4G, leaving the entire 32 bit space free for small allocations.  I'd
> > be willing to argue that that should be the default, with some
> > override for anyone who finds it objectionable.
> >
> > Dual address cycle is really "4 more bytes in the TLP header" on PCIe;
> > a 32-bit address takes 3 doublewords (12 bytes) while a 64-bit address
> > takes 4 DW (16 bytes).  What's 25% of a read request between friends?
> > And every read request has a read response 3DW TLP plus its data, so
> > the aggregate bandwidth consumed is getting uninteresting.  Similarly
> > for writes, the additional address bytes don't cost a large
> > percentage.
> >
> > That being said, it's a rare device that needs more than 4GB of active
> > address space, and it's a rare system that needs to mix a
> > performance-critical DMA_32 (or 24) and _64 device in the same page
> > table.
>
> I'm not sure about the TLP overhead. IOMMU is not only for pcie device.
> there are pcie-to-pcix/pci bridge, any pci device can reside behind it.
> The device might not handle DMA_64. DAC has overhead for pcix device
> iirc, which somebody might care about. So let's not break such devices.
>
> Thanks,
> Shaohua
>

Thanks, Shaohua.

As Adam mentioned, in iommu=off cases, there's no enforcement that keeps
any PCIe address below 4GB.  If you have a system with DRAM addresses above
4GB and you're using any of the iommu disabled or 1:1 settings, you'll
encounter this.  So the change in allocation policy would not add any new
failure mode; we're already going to encounter it today.

[-- Attachment #1.2: Type: text/html, Size: 4177 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 7/7] iommu: introduce per-cpu caching to iova allocation
       [not found]                         ` <CAN+hb0WRDYCpY8xoUvvGu4SSD83F9VTMW=9W=xfjYtJV-dijmQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-04-15 17:55                           ` Shaohua Li
       [not found]                             ` <20160415175520.GA2644484-tb7CFzD8y5b7E6g3fPdp/g2O0Ztt9esIQQ4Iyu8u01E@public.gmane.org>
  0 siblings, 1 reply; 17+ messages in thread
From: Shaohua Li @ 2016-04-15 17:55 UTC (permalink / raw)
  To: Benjamin Serebrin
  Cc: Dan Tsafrir, Omer Peleg,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Adam Morrison,
	Godfrey van der Linden, David Woodhouse, Kernel-team-b10kYP2dOMg

On Thu, Apr 14, 2016 at 09:59:39PM -0700, Benjamin Serebrin wrote:
> On Thu, Apr 14, 2016 at 2:33 PM, Shaohua Li <shli-b10kYP2dOMg@public.gmane.org> wrote:
> 
> > On Thu, Apr 14, 2016 at 02:18:32PM -0700, Benjamin Serebrin wrote:
> > > On Thu, Apr 14, 2016 at 2:05 PM, Adam Morrison <mad-FrESSTt7Abv7r6psnUbsSmZHpeb/A1Y/@public.gmane.org>
> > wrote:
> > > > On Thu, Apr 14, 2016 at 9:26 PM, Benjamin Serebrin via iommu
> > > > <iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org> wrote:
> > > >
> > > >> It was pointed out that DMA_32 or _24 (or anything other non-64 size)
> > > >> could be starved if the magazines on all cores are full and the depot
> > > >> is empty.  (This gets more probable with increased core count.)  You
> > > >> could try one more time: call free_iova_rcaches() and try alloc_iova
> > > >> again before giving up
> > > >
> > > > That's not safe, unfortunately.  free_iova_rcaches() is meant to be
> > > > called only when the domain is dying and the CPUs won't access the
> > > > rcaches.
> > >
> > > Fair enough.  Is it possible to make this safe, cleanly and without
> > > too much locking during the normal case?
> > >
> > > > It's tempting to make the rcaches work only for DMA_64 allocations.
> > > > This would also solve the problem of respecting the pfn_limit when
> > > > allocating, which Shaohua Li pointed out.  Sadly, intel-iommu.c
> > > > converts DMA_64 to DMA_32 by default, apparently to avoid dual address
> > > > cycles on the PCI bus.  I wonder about the importance of this, though,
> > > > as it doesn't seem that anything equivalent happens when iommu=off.
> > >
> > > I agree.  It's tempting to make all DMA_64 allocations grow up from
> > > 4G, leaving the entire 32 bit space free for small allocations.  I'd
> > > be willing to argue that that should be the default, with some
> > > override for anyone who finds it objectionable.
> > >
> > > Dual address cycle is really "4 more bytes in the TLP header" on PCIe;
> > > a 32-bit address takes 3 doublewords (12 bytes) while a 64-bit address
> > > takes 4 DW (16 bytes).  What's 25% of a read request between friends?
> > > And every read request has a read response 3DW TLP plus its data, so
> > > the aggregate bandwidth consumed is getting uninteresting.  Similarly
> > > for writes, the additional address bytes don't cost a large
> > > percentage.
> > >
> > > That being said, it's a rare device that needs more than 4GB of active
> > > address space, and it's a rare system that needs to mix a
> > > performance-critical DMA_32 (or 24) and _64 device in the same page
> > > table.
> >
> > I'm not sure about the TLP overhead. IOMMU is not only for pcie device.
> > there are pcie-to-pcix/pci bridge, any pci device can reside behind it.
> > The device might not handle DMA_64. DAC has overhead for pcix device
> > iirc, which somebody might care about. So let's not break such devices.
> >
> > Thanks,
> > Shaohua
> >
> 
> Thanks, Shaohua.
> 
> As Adam mentioned, in iommu=off cases, there's no enforcement that keeps
> any PCIe address below 4GB.  If you have a system with DRAM addresses above
> 4GB and you're using any of the iommu disabled or 1:1 settings, you'll
> encounter this.  So the change in allocation policy would not add any new
> failure mode; we're already going to encounter it today.

Fair enough, it makes sense to ignore the DAC overhead. My point is the
cache case should respect the limit, since devices might not be able to
handle DMA64. Even without IOMMU, we use GFP_DMA32 or swiotlb to
guarantee DMA address is in range. For the 'force DMA32' logic in
intel-iommu, it's in the code in day one, but I can't remember the
reason. My guess is it's to workaround driver/device bugs. I'd suggest
just deleting the logic and let cache only handle DMA64.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 7/7] iommu: introduce per-cpu caching to iova allocation
       [not found]                             ` <20160415175520.GA2644484-tb7CFzD8y5b7E6g3fPdp/g2O0Ztt9esIQQ4Iyu8u01E@public.gmane.org>
@ 2016-04-17 18:05                               ` Adam Morrison
  0 siblings, 0 replies; 17+ messages in thread
From: Adam Morrison @ 2016-04-17 18:05 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Benjamin Serebrin, Dan Tsafrir, Omer Peleg,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Godfrey van der Linden, David Woodhouse, Kernel-team-b10kYP2dOMg

On Fri, Apr 15, 2016 at 8:55 PM, Shaohua Li <shli-b10kYP2dOMg@public.gmane.org> wrote:

> Fair enough, it makes sense to ignore the DAC overhead. My point is the
> cache case should respect the limit, since devices might not be able to
> handle DMA64. Even without IOMMU, we use GFP_DMA32 or swiotlb to
> guarantee DMA address is in range.

Absolutely, the cache should respect the limit.  That'll be fixed in
the next version; sorry I forgot to do it in this revision.

> For the 'force DMA32' logic in
> intel-iommu, it's in the code in day one, but I can't remember the
> reason. My guess is it's to workaround driver/device bugs. I'd suggest
> just deleting the logic and let cache only handle DMA64.

Having looked at this further, DMA64 limits get cut down to the max
address the IOMMU supports (in intel_alloc_iova()).  In principle,
DMA32 (or close to it) might end up as the limit even for DMA64.  So
it seems cleaner to just flush the rcache if we can't allocate; that
takes care of everything.

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2016-04-17 18:05 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-04-13 18:50 [PATCH v2 0/7] Intel IOMMU scalability improvements Adam Morrison
     [not found] ` <cover.1460548546.git.mad-FrESSTt7Abv7r6psnUbsSmZHpeb/A1Y/@public.gmane.org>
2016-04-13 18:51   ` [PATCH v2 1/7] iommu: refactoring of deferred flush entries Adam Morrison
2016-04-13 18:51   ` [PATCH v2 2/7] iommu: per-cpu deferred invalidation queues Adam Morrison
2016-04-13 18:51   ` [PATCH v2 3/7] iommu: correct flush_unmaps pfn usage Adam Morrison
2016-04-13 18:52   ` [PATCH v2 4/7] iommu: only unmap mapped entries Adam Morrison
     [not found]     ` <e07164c8d0aaff68cabd2cf8e3aee9ed20882ae4.1460548546.git.mad-FrESSTt7Abv7r6psnUbsSmZHpeb/A1Y/@public.gmane.org>
2016-04-13 20:37       ` Shaohua Li
2016-04-13 18:52   ` [PATCH v2 5/7] iommu: avoid dev iotlb logic in intel-iommu for domains with no dev iotlbs Adam Morrison
2016-04-13 18:52   ` [PATCH v2 6/7] iommu: change intel-iommu to use IOVA frame numbers Adam Morrison
2016-04-13 18:52   ` [PATCH v2 7/7] iommu: introduce per-cpu caching to iova allocation Adam Morrison
     [not found]     ` <b208a304d83088aae7ecac10a3062dc57c0a2f79.1460548546.git.mad-FrESSTt7Abv7r6psnUbsSmZHpeb/A1Y/@public.gmane.org>
2016-04-13 20:43       ` Shaohua Li
2016-04-14 18:26       ` Benjamin Serebrin via iommu
     [not found]         ` <CAN+hb0W=+tuQp3cm_VKoU=LKiVQDPMtGrZGq=59rcaWsy2S-+A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-04-14 21:05           ` Adam Morrison
     [not found]             ` <CAHMfzJmjZWeUpmTVb-Z7NMJUp0N84ZK4zwUWdAKHv4sd4TXPMg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-04-14 21:18               ` Benjamin Serebrin via iommu
     [not found]                 ` <CAN+hb0WOaokFYc6C+mR6rdj4WwmMUSzDHDZngfUvy-5cEve_-g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-04-14 21:33                   ` Shaohua Li
     [not found]                     ` <20160414213326.GA474260-tb7CFzD8y5b7E6g3fPdp/g2O0Ztt9esIQQ4Iyu8u01E@public.gmane.org>
2016-04-15  4:59                       ` Benjamin Serebrin via iommu
     [not found]                         ` <CAN+hb0WRDYCpY8xoUvvGu4SSD83F9VTMW=9W=xfjYtJV-dijmQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-04-15 17:55                           ` Shaohua Li
     [not found]                             ` <20160415175520.GA2644484-tb7CFzD8y5b7E6g3fPdp/g2O0Ztt9esIQQ4Iyu8u01E@public.gmane.org>
2016-04-17 18:05                               ` Adam Morrison

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.