linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/23] AMD IOMMU DMA-API Scalability Improvements
@ 2015-12-22 22:21 Joerg Roedel
  2015-12-22 22:21 ` [PATCH 01/23] iommu/amd: Warn only once on unexpected pte value Joerg Roedel
                   ` (22 more replies)
  0 siblings, 23 replies; 24+ messages in thread
From: Joerg Roedel @ 2015-12-22 22:21 UTC (permalink / raw)
  To: iommu; +Cc: linux-kernel, joro, jroedel

Hi,

here is a patch-set to improve scalability in the dma_ops
path of the AMD IOMMU driver. The current code doesn't scale
well because of the per-domain spin-lock which serializes
the DMA-API operations.

This lock protects the address allocator, the page-table
updates and the iommu tlb flushing.

As a first step these patches introduce a lock that only
protects the address allocator on a per-aperture basis. A
domain can have multiple apertures, each covering 128 MiB of
address space.

The page-table code is updated to work lock-less like the
Intel VT-d page-table code. Also the iommu tlb flushing is
not defered any longer to the end of the DMA-API operation,
but happens right before/after the address allocator is
updated (which is the point where we either own the
addresses or make them available to someone else). This also
removes the need to lock the iommu tlb flushing.

As a next step the patches change the address allocator path
to allocate from a non-contended aperture. This is done by
first using spin_trylock() on the available apertures. Only
of this fails it retrys with spinning.

To make this work, more than one aperture per device is
needed by default. Based on the dma_mask of the device the
code now allocates between 4 and 8 apertures in the
set_dma_mask call-back.

In my tests on a single-node AMD IOMMU machine this resolves
the lock contention issues. It is expected that on bigger
machines there will be lock-contention again, but still to a
smaller degree than without these patches.

I also did some measurements to show the difference. I ran a
test that generates network packets over a 10 GBit link in a
loop and measured the average packets that could be queued
per second. Here are the results:

	stock   v4.4-rc6 iommu disabled : 1465946 PPS (100%)
	stock   v4.4-rc6 iommu enabled  : 815089  PPS (55.6%)
	patched v4.4-rc6 iommu enabled  : 1426606 PPS (97.3%)

So with the current code there is a 44.4% performance drop,
with these patches the performance only drops by 2.7%.

This is only a start, the goal to resolve the lock
contention problem is to get rid of the address allocator
completly and implement dynamic identity mapping for 64bit
devices. But there are still some problems to solve with
that, so until this is ready these patches at least reduce
the problem.

Feedback welcome!

Thanks,

	Joerg


Joerg Roedel (23):
  iommu/amd: Warn only once on unexpected pte value
  iommu/amd: Move 'struct dma_ops_domain' definition to amd_iommu.c
  iommu/amd: Introduce bitmap_lock in struct aperture_range
  iommu/amd: Flush IOMMU TLB on __map_single error path
  iommu/amd: Flush the IOMMU TLB before the addresses are freed
  iommu/amd: Pass correct shift to iommu_area_alloc()
  iommu/amd: Add dma_ops_aperture_alloc() function
  iommu/amd: Move aperture_range.offset to another cache-line
  iommu/amd: Retry address allocation within one aperture
  iommu/amd: Flush iommu tlb in dma_ops_aperture_alloc()
  iommu/amd: Remove 'start' parameter from dma_ops_area_alloc
  iommu/amd: Rename dma_ops_domain->next_address to next_index
  iommu/amd: Flush iommu tlb in dma_ops_free_addresses
  iommu/amd: Iterate over all aperture ranges in dma_ops_area_alloc
  iommu/amd: Remove need_flush from struct dma_ops_domain
  iommu/amd: Optimize dma_ops_free_addresses
  iommu/amd: Allocate new aperture ranges in dma_ops_alloc_addresses
  iommu/amd: Build io page-tables with cmpxchg64
  iommu/amd: Initialize new aperture range before making it visible
  iommu/amd: Relax locking in dma_ops path
  iommu/amd: Make dma_ops_domain->next_index percpu
  iommu/amd: Use trylock to aquire bitmap_lock
  iommu/amd: Preallocate dma_ops apertures based on dma_mask

 drivers/iommu/amd_iommu.c       | 388 +++++++++++++++++++++++++---------------
 drivers/iommu/amd_iommu_types.h |  40 -----
 2 files changed, 244 insertions(+), 184 deletions(-)

-- 
1.9.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 01/23] iommu/amd: Warn only once on unexpected pte value
  2015-12-22 22:21 [PATCH 00/23] AMD IOMMU DMA-API Scalability Improvements Joerg Roedel
@ 2015-12-22 22:21 ` Joerg Roedel
  2015-12-22 22:21 ` [PATCH 02/23] iommu/amd: Move 'struct dma_ops_domain' definition to amd_iommu.c Joerg Roedel
                   ` (21 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: Joerg Roedel @ 2015-12-22 22:21 UTC (permalink / raw)
  To: iommu; +Cc: linux-kernel, joro, jroedel

From: Joerg Roedel <jroedel@suse.de>

This prevents possible flooding of the kernel log.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 drivers/iommu/amd_iommu.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index 8b2be1e..3cdfac6 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -2328,7 +2328,7 @@ static dma_addr_t dma_ops_domain_map(struct dma_ops_domain *dom,
 	else if (direction == DMA_BIDIRECTIONAL)
 		__pte |= IOMMU_PTE_IR | IOMMU_PTE_IW;
 
-	WARN_ON(*pte);
+	WARN_ON_ONCE(*pte);
 
 	*pte = __pte;
 
@@ -2357,7 +2357,7 @@ static void dma_ops_domain_unmap(struct dma_ops_domain *dom,
 
 	pte += PM_LEVEL_INDEX(0, address);
 
-	WARN_ON(!*pte);
+	WARN_ON_ONCE(!*pte);
 
 	*pte = 0ULL;
 }
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 02/23] iommu/amd: Move 'struct dma_ops_domain' definition to amd_iommu.c
  2015-12-22 22:21 [PATCH 00/23] AMD IOMMU DMA-API Scalability Improvements Joerg Roedel
  2015-12-22 22:21 ` [PATCH 01/23] iommu/amd: Warn only once on unexpected pte value Joerg Roedel
@ 2015-12-22 22:21 ` Joerg Roedel
  2015-12-22 22:21 ` [PATCH 03/23] iommu/amd: Introduce bitmap_lock in struct aperture_range Joerg Roedel
                   ` (20 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: Joerg Roedel @ 2015-12-22 22:21 UTC (permalink / raw)
  To: iommu; +Cc: linux-kernel, joro, jroedel

From: Joerg Roedel <jroedel@suse.de>

It is only used in this file anyway, so keep it there. Same
with 'struct aperture_range'.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 drivers/iommu/amd_iommu.c       | 40 ++++++++++++++++++++++++++++++++++++++++
 drivers/iommu/amd_iommu_types.h | 40 ----------------------------------------
 2 files changed, 40 insertions(+), 40 deletions(-)

diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index 3cdfac6..9ce51eb 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -114,6 +114,46 @@ struct kmem_cache *amd_iommu_irq_cache;
 static void update_domain(struct protection_domain *domain);
 static int protection_domain_init(struct protection_domain *domain);
 
+/*
+ * For dynamic growth the aperture size is split into ranges of 128MB of
+ * DMA address space each. This struct represents one such range.
+ */
+struct aperture_range {
+
+	/* address allocation bitmap */
+	unsigned long *bitmap;
+
+	/*
+	 * Array of PTE pages for the aperture. In this array we save all the
+	 * leaf pages of the domain page table used for the aperture. This way
+	 * we don't need to walk the page table to find a specific PTE. We can
+	 * just calculate its address in constant time.
+	 */
+	u64 *pte_pages[64];
+
+	unsigned long offset;
+};
+
+/*
+ * Data container for a dma_ops specific protection domain
+ */
+struct dma_ops_domain {
+	/* generic protection domain information */
+	struct protection_domain domain;
+
+	/* size of the aperture for the mappings */
+	unsigned long aperture_size;
+
+	/* address we start to search for free addresses */
+	unsigned long next_address;
+
+	/* address space relevant data */
+	struct aperture_range *aperture[APERTURE_MAX_RANGES];
+
+	/* This will be set to true when TLB needs to be flushed */
+	bool need_flush;
+};
+
 /****************************************************************************
  *
  * Helper functions
diff --git a/drivers/iommu/amd_iommu_types.h b/drivers/iommu/amd_iommu_types.h
index b08cf57..9d32b20 100644
--- a/drivers/iommu/amd_iommu_types.h
+++ b/drivers/iommu/amd_iommu_types.h
@@ -425,46 +425,6 @@ struct protection_domain {
 };
 
 /*
- * For dynamic growth the aperture size is split into ranges of 128MB of
- * DMA address space each. This struct represents one such range.
- */
-struct aperture_range {
-
-	/* address allocation bitmap */
-	unsigned long *bitmap;
-
-	/*
-	 * Array of PTE pages for the aperture. In this array we save all the
-	 * leaf pages of the domain page table used for the aperture. This way
-	 * we don't need to walk the page table to find a specific PTE. We can
-	 * just calculate its address in constant time.
-	 */
-	u64 *pte_pages[64];
-
-	unsigned long offset;
-};
-
-/*
- * Data container for a dma_ops specific protection domain
- */
-struct dma_ops_domain {
-	/* generic protection domain information */
-	struct protection_domain domain;
-
-	/* size of the aperture for the mappings */
-	unsigned long aperture_size;
-
-	/* address we start to search for free addresses */
-	unsigned long next_address;
-
-	/* address space relevant data */
-	struct aperture_range *aperture[APERTURE_MAX_RANGES];
-
-	/* This will be set to true when TLB needs to be flushed */
-	bool need_flush;
-};
-
-/*
  * Structure where we save information about one hardware AMD IOMMU in the
  * system.
  */
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 03/23] iommu/amd: Introduce bitmap_lock in struct aperture_range
  2015-12-22 22:21 [PATCH 00/23] AMD IOMMU DMA-API Scalability Improvements Joerg Roedel
  2015-12-22 22:21 ` [PATCH 01/23] iommu/amd: Warn only once on unexpected pte value Joerg Roedel
  2015-12-22 22:21 ` [PATCH 02/23] iommu/amd: Move 'struct dma_ops_domain' definition to amd_iommu.c Joerg Roedel
@ 2015-12-22 22:21 ` Joerg Roedel
  2015-12-22 22:21 ` [PATCH 04/23] iommu/amd: Flush IOMMU TLB on __map_single error path Joerg Roedel
                   ` (19 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: Joerg Roedel @ 2015-12-22 22:21 UTC (permalink / raw)
  To: iommu; +Cc: linux-kernel, joro, jroedel

From: Joerg Roedel <jroedel@suse.de>

This lock only protects the address allocation bitmap in one
aperture.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 drivers/iommu/amd_iommu.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index 9ce51eb..8ff3331 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -120,6 +120,8 @@ static int protection_domain_init(struct protection_domain *domain);
  */
 struct aperture_range {
 
+	spinlock_t bitmap_lock;
+
 	/* address allocation bitmap */
 	unsigned long *bitmap;
 
@@ -1436,6 +1438,8 @@ static int alloc_new_range(struct dma_ops_domain *dma_dom,
 
 	dma_dom->aperture[index]->offset = dma_dom->aperture_size;
 
+	spin_lock_init(&dma_dom->aperture[index]->bitmap_lock);
+
 	if (populate) {
 		unsigned long address = dma_dom->aperture_size;
 		int i, num_ptes = APERTURE_RANGE_PAGES / 512;
@@ -1527,6 +1531,7 @@ static unsigned long dma_ops_area_alloc(struct device *dev,
 	unsigned long boundary_size, mask;
 	unsigned long address = -1;
 	unsigned long limit;
+	unsigned long flags;
 
 	next_bit >>= PAGE_SHIFT;
 
@@ -1544,9 +1549,11 @@ static unsigned long dma_ops_area_alloc(struct device *dev,
 		limit = iommu_device_max_index(APERTURE_RANGE_PAGES, offset,
 					       dma_mask >> PAGE_SHIFT);
 
+		spin_lock_irqsave(&dom->aperture[i]->bitmap_lock, flags);
 		address = iommu_area_alloc(dom->aperture[i]->bitmap,
 					   limit, next_bit, pages, 0,
 					    boundary_size, align_mask);
+		spin_unlock_irqrestore(&dom->aperture[i]->bitmap_lock, flags);
 		if (address != -1) {
 			address = dom->aperture[i]->offset +
 				  (address << PAGE_SHIFT);
@@ -1602,6 +1609,7 @@ static void dma_ops_free_addresses(struct dma_ops_domain *dom,
 {
 	unsigned i = address >> APERTURE_RANGE_SHIFT;
 	struct aperture_range *range = dom->aperture[i];
+	unsigned long flags;
 
 	BUG_ON(i >= APERTURE_MAX_RANGES || range == NULL);
 
@@ -1615,7 +1623,9 @@ static void dma_ops_free_addresses(struct dma_ops_domain *dom,
 
 	address = (address % APERTURE_RANGE_SIZE) >> PAGE_SHIFT;
 
+	spin_lock_irqsave(&range->bitmap_lock, flags);
 	bitmap_clear(range->bitmap, address, pages);
+	spin_unlock_irqrestore(&range->bitmap_lock, flags);
 
 }
 
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 04/23] iommu/amd: Flush IOMMU TLB on __map_single error path
  2015-12-22 22:21 [PATCH 00/23] AMD IOMMU DMA-API Scalability Improvements Joerg Roedel
                   ` (2 preceding siblings ...)
  2015-12-22 22:21 ` [PATCH 03/23] iommu/amd: Introduce bitmap_lock in struct aperture_range Joerg Roedel
@ 2015-12-22 22:21 ` Joerg Roedel
  2015-12-22 22:21 ` [PATCH 05/23] iommu/amd: Flush the IOMMU TLB before the addresses are freed Joerg Roedel
                   ` (18 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: Joerg Roedel @ 2015-12-22 22:21 UTC (permalink / raw)
  To: iommu; +Cc: linux-kernel, joro, jroedel

From: Joerg Roedel <jroedel@suse.de>

There have been present PTEs which in theory could have made
it to the IOMMU TLB. Flush the addresses out on the error
path to make sure no stale entries remain.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 drivers/iommu/amd_iommu.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index 8ff3331..42c0a81 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -2493,6 +2493,8 @@ out_unmap:
 		dma_ops_domain_unmap(dma_dom, start);
 	}
 
+	domain_flush_pages(&dma_dom->domain, address, size);
+
 	dma_ops_free_addresses(dma_dom, address, pages);
 
 	return DMA_ERROR_CODE;
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 05/23] iommu/amd: Flush the IOMMU TLB before the addresses are freed
  2015-12-22 22:21 [PATCH 00/23] AMD IOMMU DMA-API Scalability Improvements Joerg Roedel
                   ` (3 preceding siblings ...)
  2015-12-22 22:21 ` [PATCH 04/23] iommu/amd: Flush IOMMU TLB on __map_single error path Joerg Roedel
@ 2015-12-22 22:21 ` Joerg Roedel
  2015-12-22 22:21 ` [PATCH 06/23] iommu/amd: Pass correct shift to iommu_area_alloc() Joerg Roedel
                   ` (17 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: Joerg Roedel @ 2015-12-22 22:21 UTC (permalink / raw)
  To: iommu; +Cc: linux-kernel, joro, jroedel

From: Joerg Roedel <jroedel@suse.de>

This allows to keep the bitmap_lock only for a very short
period of time.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 drivers/iommu/amd_iommu.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index 42c0a81..69021ec 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -2527,14 +2527,14 @@ static void __unmap_single(struct dma_ops_domain *dma_dom,
 		start += PAGE_SIZE;
 	}
 
-	SUB_STATS_COUNTER(alloced_io_mem, size);
-
-	dma_ops_free_addresses(dma_dom, dma_addr, pages);
-
 	if (amd_iommu_unmap_flush || dma_dom->need_flush) {
 		domain_flush_pages(&dma_dom->domain, flush_addr, size);
 		dma_dom->need_flush = false;
 	}
+
+	SUB_STATS_COUNTER(alloced_io_mem, size);
+
+	dma_ops_free_addresses(dma_dom, dma_addr, pages);
 }
 
 /*
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 06/23] iommu/amd: Pass correct shift to iommu_area_alloc()
  2015-12-22 22:21 [PATCH 00/23] AMD IOMMU DMA-API Scalability Improvements Joerg Roedel
                   ` (4 preceding siblings ...)
  2015-12-22 22:21 ` [PATCH 05/23] iommu/amd: Flush the IOMMU TLB before the addresses are freed Joerg Roedel
@ 2015-12-22 22:21 ` Joerg Roedel
  2015-12-22 22:21 ` [PATCH 07/23] iommu/amd: Add dma_ops_aperture_alloc() function Joerg Roedel
                   ` (16 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: Joerg Roedel @ 2015-12-22 22:21 UTC (permalink / raw)
  To: iommu; +Cc: linux-kernel, joro, jroedel

From: Joerg Roedel <jroedel@suse.de>

The page-offset of the aperture must be passed instead of 0.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 drivers/iommu/amd_iommu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index 69021ec..1d1ef37 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -1551,7 +1551,7 @@ static unsigned long dma_ops_area_alloc(struct device *dev,
 
 		spin_lock_irqsave(&dom->aperture[i]->bitmap_lock, flags);
 		address = iommu_area_alloc(dom->aperture[i]->bitmap,
-					   limit, next_bit, pages, 0,
+					   limit, next_bit, pages, offset,
 					    boundary_size, align_mask);
 		spin_unlock_irqrestore(&dom->aperture[i]->bitmap_lock, flags);
 		if (address != -1) {
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 07/23] iommu/amd: Add dma_ops_aperture_alloc() function
  2015-12-22 22:21 [PATCH 00/23] AMD IOMMU DMA-API Scalability Improvements Joerg Roedel
                   ` (5 preceding siblings ...)
  2015-12-22 22:21 ` [PATCH 06/23] iommu/amd: Pass correct shift to iommu_area_alloc() Joerg Roedel
@ 2015-12-22 22:21 ` Joerg Roedel
  2015-12-22 22:21 ` [PATCH 08/23] iommu/amd: Move aperture_range.offset to another cache-line Joerg Roedel
                   ` (15 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: Joerg Roedel @ 2015-12-22 22:21 UTC (permalink / raw)
  To: iommu; +Cc: linux-kernel, joro, jroedel

From: Joerg Roedel <jroedel@suse.de>

Make this a wrapper around iommu_ops_area_alloc() for now
and add more logic to this function later on.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 drivers/iommu/amd_iommu.c | 37 +++++++++++++++++++++++++------------
 1 file changed, 25 insertions(+), 12 deletions(-)

diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index 1d1ef37..be0e81a 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -1518,6 +1518,28 @@ out_free:
 	return -ENOMEM;
 }
 
+static dma_addr_t dma_ops_aperture_alloc(struct aperture_range *range,
+					 unsigned long pages,
+					 unsigned long next_bit,
+					 unsigned long dma_mask,
+					 unsigned long boundary_size,
+					 unsigned long align_mask)
+{
+	unsigned long offset, limit, flags;
+	dma_addr_t address;
+
+	offset = range->offset >> PAGE_SHIFT;
+	limit  = iommu_device_max_index(APERTURE_RANGE_PAGES, offset,
+					dma_mask >> PAGE_SHIFT);
+
+	spin_lock_irqsave(&range->bitmap_lock, flags);
+	address = iommu_area_alloc(range->bitmap, limit, next_bit, pages,
+				   offset, boundary_size, align_mask);
+	spin_unlock_irqrestore(&range->bitmap_lock, flags);
+
+	return address;
+}
+
 static unsigned long dma_ops_area_alloc(struct device *dev,
 					struct dma_ops_domain *dom,
 					unsigned int pages,
@@ -1530,8 +1552,6 @@ static unsigned long dma_ops_area_alloc(struct device *dev,
 	int i = start >> APERTURE_RANGE_SHIFT;
 	unsigned long boundary_size, mask;
 	unsigned long address = -1;
-	unsigned long limit;
-	unsigned long flags;
 
 	next_bit >>= PAGE_SHIFT;
 
@@ -1541,19 +1561,12 @@ static unsigned long dma_ops_area_alloc(struct device *dev,
 				   1UL << (BITS_PER_LONG - PAGE_SHIFT);
 
 	for (;i < max_index; ++i) {
-		unsigned long offset = dom->aperture[i]->offset >> PAGE_SHIFT;
-
 		if (dom->aperture[i]->offset >= dma_mask)
 			break;
 
-		limit = iommu_device_max_index(APERTURE_RANGE_PAGES, offset,
-					       dma_mask >> PAGE_SHIFT);
-
-		spin_lock_irqsave(&dom->aperture[i]->bitmap_lock, flags);
-		address = iommu_area_alloc(dom->aperture[i]->bitmap,
-					   limit, next_bit, pages, offset,
-					    boundary_size, align_mask);
-		spin_unlock_irqrestore(&dom->aperture[i]->bitmap_lock, flags);
+		address = dma_ops_aperture_alloc(dom->aperture[i], pages,
+						 next_bit, dma_mask,
+						 boundary_size, align_mask);
 		if (address != -1) {
 			address = dom->aperture[i]->offset +
 				  (address << PAGE_SHIFT);
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 08/23] iommu/amd: Move aperture_range.offset to another cache-line
  2015-12-22 22:21 [PATCH 00/23] AMD IOMMU DMA-API Scalability Improvements Joerg Roedel
                   ` (6 preceding siblings ...)
  2015-12-22 22:21 ` [PATCH 07/23] iommu/amd: Add dma_ops_aperture_alloc() function Joerg Roedel
@ 2015-12-22 22:21 ` Joerg Roedel
  2015-12-22 22:21 ` [PATCH 09/23] iommu/amd: Retry address allocation within one aperture Joerg Roedel
                   ` (14 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: Joerg Roedel @ 2015-12-22 22:21 UTC (permalink / raw)
  To: iommu; +Cc: linux-kernel, joro, jroedel

From: Joerg Roedel <jroedel@suse.de>

Moving it before the pte_pages array puts in into the same
cache-line as the spin-lock and the bitmap array pointer.
This should safe a cache-miss.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 drivers/iommu/amd_iommu.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index be0e81a..2a22515 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -124,6 +124,7 @@ struct aperture_range {
 
 	/* address allocation bitmap */
 	unsigned long *bitmap;
+	unsigned long offset;
 
 	/*
 	 * Array of PTE pages for the aperture. In this array we save all the
@@ -132,8 +133,6 @@ struct aperture_range {
 	 * just calculate its address in constant time.
 	 */
 	u64 *pte_pages[64];
-
-	unsigned long offset;
 };
 
 /*
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 09/23] iommu/amd: Retry address allocation within one aperture
  2015-12-22 22:21 [PATCH 00/23] AMD IOMMU DMA-API Scalability Improvements Joerg Roedel
                   ` (7 preceding siblings ...)
  2015-12-22 22:21 ` [PATCH 08/23] iommu/amd: Move aperture_range.offset to another cache-line Joerg Roedel
@ 2015-12-22 22:21 ` Joerg Roedel
  2015-12-22 22:21 ` [PATCH 10/23] iommu/amd: Flush iommu tlb in dma_ops_aperture_alloc() Joerg Roedel
                   ` (13 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: Joerg Roedel @ 2015-12-22 22:21 UTC (permalink / raw)
  To: iommu; +Cc: linux-kernel, joro, jroedel

From: Joerg Roedel <jroedel@suse.de>

Instead of skipping to the next aperture, first try again in
the current one.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 drivers/iommu/amd_iommu.c | 29 +++++++++++++++++++----------
 1 file changed, 19 insertions(+), 10 deletions(-)

diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index 2a22515..58d7d82 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -125,6 +125,7 @@ struct aperture_range {
 	/* address allocation bitmap */
 	unsigned long *bitmap;
 	unsigned long offset;
+	unsigned long next_bit;
 
 	/*
 	 * Array of PTE pages for the aperture. In this array we save all the
@@ -1519,7 +1520,6 @@ out_free:
 
 static dma_addr_t dma_ops_aperture_alloc(struct aperture_range *range,
 					 unsigned long pages,
-					 unsigned long next_bit,
 					 unsigned long dma_mask,
 					 unsigned long boundary_size,
 					 unsigned long align_mask)
@@ -1532,8 +1532,17 @@ static dma_addr_t dma_ops_aperture_alloc(struct aperture_range *range,
 					dma_mask >> PAGE_SHIFT);
 
 	spin_lock_irqsave(&range->bitmap_lock, flags);
-	address = iommu_area_alloc(range->bitmap, limit, next_bit, pages,
-				   offset, boundary_size, align_mask);
+	address = iommu_area_alloc(range->bitmap, limit, range->next_bit,
+				   pages, offset, boundary_size, align_mask);
+	if (address == -1)
+		/* Nothing found, retry one time */
+		address = iommu_area_alloc(range->bitmap, limit,
+					   0, pages, offset, boundary_size,
+					   align_mask);
+
+	if (address != -1)
+		range->next_bit = address + pages;
+
 	spin_unlock_irqrestore(&range->bitmap_lock, flags);
 
 	return address;
@@ -1546,14 +1555,11 @@ static unsigned long dma_ops_area_alloc(struct device *dev,
 					u64 dma_mask,
 					unsigned long start)
 {
-	unsigned long next_bit = dom->next_address % APERTURE_RANGE_SIZE;
 	int max_index = dom->aperture_size >> APERTURE_RANGE_SHIFT;
 	int i = start >> APERTURE_RANGE_SHIFT;
-	unsigned long boundary_size, mask;
+	unsigned long next_bit, boundary_size, mask;
 	unsigned long address = -1;
 
-	next_bit >>= PAGE_SHIFT;
-
 	mask = dma_get_seg_boundary(dev);
 
 	boundary_size = mask + 1 ? ALIGN(mask + 1, PAGE_SIZE) >> PAGE_SHIFT :
@@ -1563,9 +1569,11 @@ static unsigned long dma_ops_area_alloc(struct device *dev,
 		if (dom->aperture[i]->offset >= dma_mask)
 			break;
 
+		next_bit = dom->aperture[i]->next_bit;
+
 		address = dma_ops_aperture_alloc(dom->aperture[i], pages,
-						 next_bit, dma_mask,
-						 boundary_size, align_mask);
+						 dma_mask, boundary_size,
+						 align_mask);
 		if (address != -1) {
 			address = dom->aperture[i]->offset +
 				  (address << PAGE_SHIFT);
@@ -1573,7 +1581,8 @@ static unsigned long dma_ops_area_alloc(struct device *dev,
 			break;
 		}
 
-		next_bit = 0;
+		if (next_bit > dom->aperture[i]->next_bit)
+			dom->need_flush = true;
 	}
 
 	return address;
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 10/23] iommu/amd: Flush iommu tlb in dma_ops_aperture_alloc()
  2015-12-22 22:21 [PATCH 00/23] AMD IOMMU DMA-API Scalability Improvements Joerg Roedel
                   ` (8 preceding siblings ...)
  2015-12-22 22:21 ` [PATCH 09/23] iommu/amd: Retry address allocation within one aperture Joerg Roedel
@ 2015-12-22 22:21 ` Joerg Roedel
  2015-12-22 22:21 ` [PATCH 11/23] iommu/amd: Remove 'start' parameter from dma_ops_area_alloc Joerg Roedel
                   ` (12 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: Joerg Roedel @ 2015-12-22 22:21 UTC (permalink / raw)
  To: iommu; +Cc: linux-kernel, joro, jroedel

From: Joerg Roedel <jroedel@suse.de>

Since the allocator wraparound happens in this function now,
flush the iommu tlb there too.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 drivers/iommu/amd_iommu.c | 21 ++++++++++++++++-----
 1 file changed, 16 insertions(+), 5 deletions(-)

diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index 58d7d82..eb11996 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -1518,7 +1518,8 @@ out_free:
 	return -ENOMEM;
 }
 
-static dma_addr_t dma_ops_aperture_alloc(struct aperture_range *range,
+static dma_addr_t dma_ops_aperture_alloc(struct dma_ops_domain *dom,
+					 struct aperture_range *range,
 					 unsigned long pages,
 					 unsigned long dma_mask,
 					 unsigned long boundary_size,
@@ -1526,6 +1527,7 @@ static dma_addr_t dma_ops_aperture_alloc(struct aperture_range *range,
 {
 	unsigned long offset, limit, flags;
 	dma_addr_t address;
+	bool flush = false;
 
 	offset = range->offset >> PAGE_SHIFT;
 	limit  = iommu_device_max_index(APERTURE_RANGE_PAGES, offset,
@@ -1534,17 +1536,24 @@ static dma_addr_t dma_ops_aperture_alloc(struct aperture_range *range,
 	spin_lock_irqsave(&range->bitmap_lock, flags);
 	address = iommu_area_alloc(range->bitmap, limit, range->next_bit,
 				   pages, offset, boundary_size, align_mask);
-	if (address == -1)
+	if (address == -1) {
 		/* Nothing found, retry one time */
 		address = iommu_area_alloc(range->bitmap, limit,
 					   0, pages, offset, boundary_size,
 					   align_mask);
+		flush = true;
+	}
 
 	if (address != -1)
 		range->next_bit = address + pages;
 
 	spin_unlock_irqrestore(&range->bitmap_lock, flags);
 
+	if (flush) {
+		domain_flush_tlb(&dom->domain);
+		domain_flush_complete(&dom->domain);
+	}
+
 	return address;
 }
 
@@ -1566,12 +1575,14 @@ static unsigned long dma_ops_area_alloc(struct device *dev,
 				   1UL << (BITS_PER_LONG - PAGE_SHIFT);
 
 	for (;i < max_index; ++i) {
-		if (dom->aperture[i]->offset >= dma_mask)
+		struct aperture_range *range = dom->aperture[i];
+
+		if (range->offset >= dma_mask)
 			break;
 
-		next_bit = dom->aperture[i]->next_bit;
+		next_bit  = range->next_bit;
 
-		address = dma_ops_aperture_alloc(dom->aperture[i], pages,
+		address = dma_ops_aperture_alloc(dom, dom->aperture[i], pages,
 						 dma_mask, boundary_size,
 						 align_mask);
 		if (address != -1) {
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 11/23] iommu/amd: Remove 'start' parameter from dma_ops_area_alloc
  2015-12-22 22:21 [PATCH 00/23] AMD IOMMU DMA-API Scalability Improvements Joerg Roedel
                   ` (9 preceding siblings ...)
  2015-12-22 22:21 ` [PATCH 10/23] iommu/amd: Flush iommu tlb in dma_ops_aperture_alloc() Joerg Roedel
@ 2015-12-22 22:21 ` Joerg Roedel
  2015-12-22 22:21 ` [PATCH 12/23] iommu/amd: Rename dma_ops_domain->next_address to next_index Joerg Roedel
                   ` (11 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: Joerg Roedel @ 2015-12-22 22:21 UTC (permalink / raw)
  To: iommu; +Cc: linux-kernel, joro, jroedel

From: Joerg Roedel <jroedel@suse.de>

Parameter is not needed because the value is part of the
already passed in struct dma_ops_domain.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 drivers/iommu/amd_iommu.c | 10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index eb11996..2962c62 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -1561,11 +1561,10 @@ static unsigned long dma_ops_area_alloc(struct device *dev,
 					struct dma_ops_domain *dom,
 					unsigned int pages,
 					unsigned long align_mask,
-					u64 dma_mask,
-					unsigned long start)
+					u64 dma_mask)
 {
 	int max_index = dom->aperture_size >> APERTURE_RANGE_SHIFT;
-	int i = start >> APERTURE_RANGE_SHIFT;
+	int i = dom->next_address >> APERTURE_RANGE_SHIFT;
 	unsigned long next_bit, boundary_size, mask;
 	unsigned long address = -1;
 
@@ -1612,13 +1611,12 @@ static unsigned long dma_ops_alloc_addresses(struct device *dev,
 	dom->need_flush = true;
 #endif
 
-	address = dma_ops_area_alloc(dev, dom, pages, align_mask,
-				     dma_mask, dom->next_address);
+	address = dma_ops_area_alloc(dev, dom, pages, align_mask, dma_mask);
 
 	if (address == -1) {
 		dom->next_address = 0;
 		address = dma_ops_area_alloc(dev, dom, pages, align_mask,
-					     dma_mask, 0);
+					     dma_mask);
 		dom->need_flush = true;
 	}
 
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 12/23] iommu/amd: Rename dma_ops_domain->next_address to next_index
  2015-12-22 22:21 [PATCH 00/23] AMD IOMMU DMA-API Scalability Improvements Joerg Roedel
                   ` (10 preceding siblings ...)
  2015-12-22 22:21 ` [PATCH 11/23] iommu/amd: Remove 'start' parameter from dma_ops_area_alloc Joerg Roedel
@ 2015-12-22 22:21 ` Joerg Roedel
  2015-12-22 22:21 ` [PATCH 13/23] iommu/amd: Flush iommu tlb in dma_ops_free_addresses Joerg Roedel
                   ` (10 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: Joerg Roedel @ 2015-12-22 22:21 UTC (permalink / raw)
  To: iommu; +Cc: linux-kernel, joro, jroedel

From: Joerg Roedel <jroedel@suse.de>

It points to the next aperture index to allocate from. We
don't need the full address anymore because this is now
tracked in struct aperture_range.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 drivers/iommu/amd_iommu.c | 26 +++++++++++++-------------
 1 file changed, 13 insertions(+), 13 deletions(-)

diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index 2962c62..a26cd76 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -146,8 +146,8 @@ struct dma_ops_domain {
 	/* size of the aperture for the mappings */
 	unsigned long aperture_size;
 
-	/* address we start to search for free addresses */
-	unsigned long next_address;
+	/* aperture index we start searching for free addresses */
+	unsigned long next_index;
 
 	/* address space relevant data */
 	struct aperture_range *aperture[APERTURE_MAX_RANGES];
@@ -1564,9 +1564,9 @@ static unsigned long dma_ops_area_alloc(struct device *dev,
 					u64 dma_mask)
 {
 	int max_index = dom->aperture_size >> APERTURE_RANGE_SHIFT;
-	int i = dom->next_address >> APERTURE_RANGE_SHIFT;
 	unsigned long next_bit, boundary_size, mask;
 	unsigned long address = -1;
+	int i = dom->next_index;
 
 	mask = dma_get_seg_boundary(dev);
 
@@ -1587,7 +1587,7 @@ static unsigned long dma_ops_area_alloc(struct device *dev,
 		if (address != -1) {
 			address = dom->aperture[i]->offset +
 				  (address << PAGE_SHIFT);
-			dom->next_address = address + (pages << PAGE_SHIFT);
+			dom->next_index = i;
 			break;
 		}
 
@@ -1607,14 +1607,14 @@ static unsigned long dma_ops_alloc_addresses(struct device *dev,
 	unsigned long address;
 
 #ifdef CONFIG_IOMMU_STRESS
-	dom->next_address = 0;
+	dom->next_index = 0;
 	dom->need_flush = true;
 #endif
 
 	address = dma_ops_area_alloc(dev, dom, pages, align_mask, dma_mask);
 
 	if (address == -1) {
-		dom->next_address = 0;
+		dom->next_index = 0;
 		address = dma_ops_area_alloc(dev, dom, pages, align_mask,
 					     dma_mask);
 		dom->need_flush = true;
@@ -1648,7 +1648,7 @@ static void dma_ops_free_addresses(struct dma_ops_domain *dom,
 		return;
 #endif
 
-	if (address >= dom->next_address)
+	if ((address >> APERTURE_RANGE_SHIFT) >= dom->next_index)
 		dom->need_flush = true;
 
 	address = (address % APERTURE_RANGE_SIZE) >> PAGE_SHIFT;
@@ -1884,7 +1884,7 @@ static struct dma_ops_domain *dma_ops_domain_alloc(void)
 	 * a valid dma-address. So we can use 0 as error value
 	 */
 	dma_dom->aperture[0]->bitmap[0] = 1;
-	dma_dom->next_address = 0;
+	dma_dom->next_index = 0;
 
 
 	return dma_dom;
@@ -2477,15 +2477,15 @@ retry:
 	address = dma_ops_alloc_addresses(dev, dma_dom, pages, align_mask,
 					  dma_mask);
 	if (unlikely(address == DMA_ERROR_CODE)) {
+		if (alloc_new_range(dma_dom, false, GFP_ATOMIC))
+			goto out;
+
 		/*
-		 * setting next_address here will let the address
+		 * setting next_index here will let the address
 		 * allocator only scan the new allocated range in the
 		 * first run. This is a small optimization.
 		 */
-		dma_dom->next_address = dma_dom->aperture_size;
-
-		if (alloc_new_range(dma_dom, false, GFP_ATOMIC))
-			goto out;
+		dma_dom->next_index = dma_dom->aperture_size >> APERTURE_RANGE_SHIFT;
 
 		/*
 		 * aperture was successfully enlarged by 128 MB, try
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 13/23] iommu/amd: Flush iommu tlb in dma_ops_free_addresses
  2015-12-22 22:21 [PATCH 00/23] AMD IOMMU DMA-API Scalability Improvements Joerg Roedel
                   ` (11 preceding siblings ...)
  2015-12-22 22:21 ` [PATCH 12/23] iommu/amd: Rename dma_ops_domain->next_address to next_index Joerg Roedel
@ 2015-12-22 22:21 ` Joerg Roedel
  2015-12-22 22:21 ` [PATCH 14/23] iommu/amd: Iterate over all aperture ranges in dma_ops_area_alloc Joerg Roedel
                   ` (9 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: Joerg Roedel @ 2015-12-22 22:21 UTC (permalink / raw)
  To: iommu; +Cc: linux-kernel, joro, jroedel

From: Joerg Roedel <jroedel@suse.de>

Instead of setting need_flush, do the flush directly in
dma_ops_free_addresses.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 drivers/iommu/amd_iommu.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index a26cd76..62a4079 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -1648,8 +1648,10 @@ static void dma_ops_free_addresses(struct dma_ops_domain *dom,
 		return;
 #endif
 
-	if ((address >> APERTURE_RANGE_SHIFT) >= dom->next_index)
-		dom->need_flush = true;
+	if (address + pages > range->next_bit) {
+		domain_flush_tlb(&dom->domain);
+		domain_flush_complete(&dom->domain);
+	}
 
 	address = (address % APERTURE_RANGE_SIZE) >> PAGE_SHIFT;
 
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 14/23] iommu/amd: Iterate over all aperture ranges in dma_ops_area_alloc
  2015-12-22 22:21 [PATCH 00/23] AMD IOMMU DMA-API Scalability Improvements Joerg Roedel
                   ` (12 preceding siblings ...)
  2015-12-22 22:21 ` [PATCH 13/23] iommu/amd: Flush iommu tlb in dma_ops_free_addresses Joerg Roedel
@ 2015-12-22 22:21 ` Joerg Roedel
  2015-12-22 22:21 ` [PATCH 15/23] iommu/amd: Remove need_flush from struct dma_ops_domain Joerg Roedel
                   ` (8 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: Joerg Roedel @ 2015-12-22 22:21 UTC (permalink / raw)
  To: iommu; +Cc: linux-kernel, joro, jroedel

From: Joerg Roedel <jroedel@suse.de>

This way we don't need to care about the next_index wrapping
around in dma_ops_alloc_addresses.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 drivers/iommu/amd_iommu.c | 28 +++++++++++-----------------
 1 file changed, 11 insertions(+), 17 deletions(-)

diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index 62a4079..faf51a0 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -1563,35 +1563,36 @@ static unsigned long dma_ops_area_alloc(struct device *dev,
 					unsigned long align_mask,
 					u64 dma_mask)
 {
-	int max_index = dom->aperture_size >> APERTURE_RANGE_SHIFT;
 	unsigned long next_bit, boundary_size, mask;
 	unsigned long address = -1;
-	int i = dom->next_index;
+	int start = dom->next_index;
+	int i;
 
 	mask = dma_get_seg_boundary(dev);
 
 	boundary_size = mask + 1 ? ALIGN(mask + 1, PAGE_SIZE) >> PAGE_SHIFT :
 				   1UL << (BITS_PER_LONG - PAGE_SHIFT);
 
-	for (;i < max_index; ++i) {
-		struct aperture_range *range = dom->aperture[i];
+	for (i = 0; i < APERTURE_MAX_RANGES; ++i) {
+		struct aperture_range *range;
+
+		range = dom->aperture[(start + i) % APERTURE_MAX_RANGES];
 
-		if (range->offset >= dma_mask)
-			break;
+		if (!range || range->offset >= dma_mask)
+			continue;
 
 		next_bit  = range->next_bit;
 
-		address = dma_ops_aperture_alloc(dom, dom->aperture[i], pages,
+		address = dma_ops_aperture_alloc(dom, range, pages,
 						 dma_mask, boundary_size,
 						 align_mask);
 		if (address != -1) {
-			address = dom->aperture[i]->offset +
-				  (address << PAGE_SHIFT);
+			address = range->offset + (address << PAGE_SHIFT);
 			dom->next_index = i;
 			break;
 		}
 
-		if (next_bit > dom->aperture[i]->next_bit)
+		if (next_bit > range->next_bit)
 			dom->need_flush = true;
 	}
 
@@ -1613,13 +1614,6 @@ static unsigned long dma_ops_alloc_addresses(struct device *dev,
 
 	address = dma_ops_area_alloc(dev, dom, pages, align_mask, dma_mask);
 
-	if (address == -1) {
-		dom->next_index = 0;
-		address = dma_ops_area_alloc(dev, dom, pages, align_mask,
-					     dma_mask);
-		dom->need_flush = true;
-	}
-
 	if (unlikely(address == -1))
 		address = DMA_ERROR_CODE;
 
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 15/23] iommu/amd: Remove need_flush from struct dma_ops_domain
  2015-12-22 22:21 [PATCH 00/23] AMD IOMMU DMA-API Scalability Improvements Joerg Roedel
                   ` (13 preceding siblings ...)
  2015-12-22 22:21 ` [PATCH 14/23] iommu/amd: Iterate over all aperture ranges in dma_ops_area_alloc Joerg Roedel
@ 2015-12-22 22:21 ` Joerg Roedel
  2015-12-22 22:21 ` [PATCH 16/23] iommu/amd: Optimize dma_ops_free_addresses Joerg Roedel
                   ` (7 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: Joerg Roedel @ 2015-12-22 22:21 UTC (permalink / raw)
  To: iommu; +Cc: linux-kernel, joro, jroedel

From: Joerg Roedel <jroedel@suse.de>

The flushing of iommu tlbs is now done on a per-range basis.
So there is no need anymore for domain-wide flush tracking.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 drivers/iommu/amd_iommu.c | 30 ++++++------------------------
 1 file changed, 6 insertions(+), 24 deletions(-)

diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index faf51a0..39a2048 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -151,9 +151,6 @@ struct dma_ops_domain {
 
 	/* address space relevant data */
 	struct aperture_range *aperture[APERTURE_MAX_RANGES];
-
-	/* This will be set to true when TLB needs to be flushed */
-	bool need_flush;
 };
 
 /****************************************************************************
@@ -1563,7 +1560,7 @@ static unsigned long dma_ops_area_alloc(struct device *dev,
 					unsigned long align_mask,
 					u64 dma_mask)
 {
-	unsigned long next_bit, boundary_size, mask;
+	unsigned long boundary_size, mask;
 	unsigned long address = -1;
 	int start = dom->next_index;
 	int i;
@@ -1581,8 +1578,6 @@ static unsigned long dma_ops_area_alloc(struct device *dev,
 		if (!range || range->offset >= dma_mask)
 			continue;
 
-		next_bit  = range->next_bit;
-
 		address = dma_ops_aperture_alloc(dom, range, pages,
 						 dma_mask, boundary_size,
 						 align_mask);
@@ -1591,9 +1586,6 @@ static unsigned long dma_ops_area_alloc(struct device *dev,
 			dom->next_index = i;
 			break;
 		}
-
-		if (next_bit > range->next_bit)
-			dom->need_flush = true;
 	}
 
 	return address;
@@ -1609,7 +1601,6 @@ static unsigned long dma_ops_alloc_addresses(struct device *dev,
 
 #ifdef CONFIG_IOMMU_STRESS
 	dom->next_index = 0;
-	dom->need_flush = true;
 #endif
 
 	address = dma_ops_area_alloc(dev, dom, pages, align_mask, dma_mask);
@@ -1642,7 +1633,8 @@ static void dma_ops_free_addresses(struct dma_ops_domain *dom,
 		return;
 #endif
 
-	if (address + pages > range->next_bit) {
+	if (amd_iommu_unmap_flush ||
+	    (address + pages > range->next_bit)) {
 		domain_flush_tlb(&dom->domain);
 		domain_flush_complete(&dom->domain);
 	}
@@ -1868,8 +1860,6 @@ static struct dma_ops_domain *dma_ops_domain_alloc(void)
 	if (!dma_dom->domain.pt_root)
 		goto free_dma_dom;
 
-	dma_dom->need_flush = false;
-
 	add_domain_to_list(&dma_dom->domain);
 
 	if (alloc_new_range(dma_dom, true, GFP_KERNEL))
@@ -2503,11 +2493,10 @@ retry:
 
 	ADD_STATS_COUNTER(alloced_io_mem, size);
 
-	if (unlikely(dma_dom->need_flush && !amd_iommu_unmap_flush)) {
-		domain_flush_tlb(&dma_dom->domain);
-		dma_dom->need_flush = false;
-	} else if (unlikely(amd_iommu_np_cache))
+	if (unlikely(amd_iommu_np_cache)) {
 		domain_flush_pages(&dma_dom->domain, address, size);
+		domain_flush_complete(&dma_dom->domain);
+	}
 
 out:
 	return address;
@@ -2519,8 +2508,6 @@ out_unmap:
 		dma_ops_domain_unmap(dma_dom, start);
 	}
 
-	domain_flush_pages(&dma_dom->domain, address, size);
-
 	dma_ops_free_addresses(dma_dom, address, pages);
 
 	return DMA_ERROR_CODE;
@@ -2553,11 +2540,6 @@ static void __unmap_single(struct dma_ops_domain *dma_dom,
 		start += PAGE_SIZE;
 	}
 
-	if (amd_iommu_unmap_flush || dma_dom->need_flush) {
-		domain_flush_pages(&dma_dom->domain, flush_addr, size);
-		dma_dom->need_flush = false;
-	}
-
 	SUB_STATS_COUNTER(alloced_io_mem, size);
 
 	dma_ops_free_addresses(dma_dom, dma_addr, pages);
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 16/23] iommu/amd: Optimize dma_ops_free_addresses
  2015-12-22 22:21 [PATCH 00/23] AMD IOMMU DMA-API Scalability Improvements Joerg Roedel
                   ` (14 preceding siblings ...)
  2015-12-22 22:21 ` [PATCH 15/23] iommu/amd: Remove need_flush from struct dma_ops_domain Joerg Roedel
@ 2015-12-22 22:21 ` Joerg Roedel
  2015-12-22 22:21 ` [PATCH 17/23] iommu/amd: Allocate new aperture ranges in dma_ops_alloc_addresses Joerg Roedel
                   ` (6 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: Joerg Roedel @ 2015-12-22 22:21 UTC (permalink / raw)
  To: iommu; +Cc: linux-kernel, joro, jroedel

From: Joerg Roedel <jroedel@suse.de>

Don't flush the iommu tlb when we free something behind the
current next_bit pointer. Update the next_bit pointer
instead and let the flush happen on the next wraparound in
the allocation path.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 drivers/iommu/amd_iommu.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index 39a2048..c657e48 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -1633,8 +1633,7 @@ static void dma_ops_free_addresses(struct dma_ops_domain *dom,
 		return;
 #endif
 
-	if (amd_iommu_unmap_flush ||
-	    (address + pages > range->next_bit)) {
+	if (amd_iommu_unmap_flush) {
 		domain_flush_tlb(&dom->domain);
 		domain_flush_complete(&dom->domain);
 	}
@@ -1642,6 +1641,8 @@ static void dma_ops_free_addresses(struct dma_ops_domain *dom,
 	address = (address % APERTURE_RANGE_SIZE) >> PAGE_SHIFT;
 
 	spin_lock_irqsave(&range->bitmap_lock, flags);
+	if (address + pages > range->next_bit)
+		range->next_bit = address + pages;
 	bitmap_clear(range->bitmap, address, pages);
 	spin_unlock_irqrestore(&range->bitmap_lock, flags);
 
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 17/23] iommu/amd: Allocate new aperture ranges in dma_ops_alloc_addresses
  2015-12-22 22:21 [PATCH 00/23] AMD IOMMU DMA-API Scalability Improvements Joerg Roedel
                   ` (15 preceding siblings ...)
  2015-12-22 22:21 ` [PATCH 16/23] iommu/amd: Optimize dma_ops_free_addresses Joerg Roedel
@ 2015-12-22 22:21 ` Joerg Roedel
  2015-12-22 22:21 ` [PATCH 18/23] iommu/amd: Build io page-tables with cmpxchg64 Joerg Roedel
                   ` (5 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: Joerg Roedel @ 2015-12-22 22:21 UTC (permalink / raw)
  To: iommu; +Cc: linux-kernel, joro, jroedel

From: Joerg Roedel <jroedel@suse.de>

It really belongs there and not in __map_single.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 drivers/iommu/amd_iommu.c | 29 ++++++++++-------------------
 1 file changed, 10 insertions(+), 19 deletions(-)

diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index c657e48..4c926da 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -1597,13 +1597,19 @@ static unsigned long dma_ops_alloc_addresses(struct device *dev,
 					     unsigned long align_mask,
 					     u64 dma_mask)
 {
-	unsigned long address;
+	unsigned long address = -1;
 
 #ifdef CONFIG_IOMMU_STRESS
 	dom->next_index = 0;
 #endif
 
-	address = dma_ops_area_alloc(dev, dom, pages, align_mask, dma_mask);
+	while (address == -1) {
+		address = dma_ops_area_alloc(dev, dom, pages,
+					     align_mask, dma_mask);
+
+		if (address == -1 && alloc_new_range(dom, true, GFP_ATOMIC))
+			break;
+	}
 
 	if (unlikely(address == -1))
 		address = DMA_ERROR_CODE;
@@ -2460,26 +2466,11 @@ static dma_addr_t __map_single(struct device *dev,
 	if (align)
 		align_mask = (1UL << get_order(size)) - 1;
 
-retry:
 	address = dma_ops_alloc_addresses(dev, dma_dom, pages, align_mask,
 					  dma_mask);
-	if (unlikely(address == DMA_ERROR_CODE)) {
-		if (alloc_new_range(dma_dom, false, GFP_ATOMIC))
-			goto out;
-
-		/*
-		 * setting next_index here will let the address
-		 * allocator only scan the new allocated range in the
-		 * first run. This is a small optimization.
-		 */
-		dma_dom->next_index = dma_dom->aperture_size >> APERTURE_RANGE_SHIFT;
 
-		/*
-		 * aperture was successfully enlarged by 128 MB, try
-		 * allocation again
-		 */
-		goto retry;
-	}
+	if (address == DMA_ERROR_CODE)
+		goto out;
 
 	start = address;
 	for (i = 0; i < pages; ++i) {
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 18/23] iommu/amd: Build io page-tables with cmpxchg64
  2015-12-22 22:21 [PATCH 00/23] AMD IOMMU DMA-API Scalability Improvements Joerg Roedel
                   ` (16 preceding siblings ...)
  2015-12-22 22:21 ` [PATCH 17/23] iommu/amd: Allocate new aperture ranges in dma_ops_alloc_addresses Joerg Roedel
@ 2015-12-22 22:21 ` Joerg Roedel
  2015-12-22 22:21 ` [PATCH 19/23] iommu/amd: Initialize new aperture range before making it visible Joerg Roedel
                   ` (4 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: Joerg Roedel @ 2015-12-22 22:21 UTC (permalink / raw)
  To: iommu; +Cc: linux-kernel, joro, jroedel

From: Joerg Roedel <jroedel@suse.de>

This allows to build up the page-tables without holding any
locks. As a consequence it removes the need to pre-populate
dma_ops page-tables.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 drivers/iommu/amd_iommu.c | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index 4c926da..ecdd3f7 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -1206,11 +1206,21 @@ static u64 *alloc_pte(struct protection_domain *domain,
 	end_lvl = PAGE_SIZE_LEVEL(page_size);
 
 	while (level > end_lvl) {
-		if (!IOMMU_PTE_PRESENT(*pte)) {
+		u64 __pte, __npte;
+
+		__pte = *pte;
+
+		if (!IOMMU_PTE_PRESENT(__pte)) {
 			page = (u64 *)get_zeroed_page(gfp);
 			if (!page)
 				return NULL;
-			*pte = PM_LEVEL_PDE(level, virt_to_phys(page));
+
+			__npte = PM_LEVEL_PDE(level, virt_to_phys(page));
+
+			if (cmpxchg64(pte, __pte, __npte)) {
+				free_page((unsigned long)page);
+				continue;
+			}
 		}
 
 		/* No level skipping support yet */
@@ -1607,7 +1617,7 @@ static unsigned long dma_ops_alloc_addresses(struct device *dev,
 		address = dma_ops_area_alloc(dev, dom, pages,
 					     align_mask, dma_mask);
 
-		if (address == -1 && alloc_new_range(dom, true, GFP_ATOMIC))
+		if (address == -1 && alloc_new_range(dom, false, GFP_ATOMIC))
 			break;
 	}
 
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 19/23] iommu/amd: Initialize new aperture range before making it visible
  2015-12-22 22:21 [PATCH 00/23] AMD IOMMU DMA-API Scalability Improvements Joerg Roedel
                   ` (17 preceding siblings ...)
  2015-12-22 22:21 ` [PATCH 18/23] iommu/amd: Build io page-tables with cmpxchg64 Joerg Roedel
@ 2015-12-22 22:21 ` Joerg Roedel
  2015-12-22 22:21 ` [PATCH 20/23] iommu/amd: Relax locking in dma_ops path Joerg Roedel
                   ` (3 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: Joerg Roedel @ 2015-12-22 22:21 UTC (permalink / raw)
  To: iommu; +Cc: linux-kernel, joro, jroedel

From: Joerg Roedel <jroedel@suse.de>

Make sure the aperture range is fully initialized before it
is visible to the address allocator.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 drivers/iommu/amd_iommu.c | 33 ++++++++++++++++++++-------------
 1 file changed, 20 insertions(+), 13 deletions(-)

diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index ecdd3f7..11ee885 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -1425,8 +1425,10 @@ static int alloc_new_range(struct dma_ops_domain *dma_dom,
 			   bool populate, gfp_t gfp)
 {
 	int index = dma_dom->aperture_size >> APERTURE_RANGE_SHIFT;
-	struct amd_iommu *iommu;
 	unsigned long i, old_size, pte_pgsize;
+	struct aperture_range *range;
+	struct amd_iommu *iommu;
+	unsigned long flags;
 
 #ifdef CONFIG_IOMMU_STRESS
 	populate = false;
@@ -1435,17 +1437,17 @@ static int alloc_new_range(struct dma_ops_domain *dma_dom,
 	if (index >= APERTURE_MAX_RANGES)
 		return -ENOMEM;
 
-	dma_dom->aperture[index] = kzalloc(sizeof(struct aperture_range), gfp);
-	if (!dma_dom->aperture[index])
+	range = kzalloc(sizeof(struct aperture_range), gfp);
+	if (!range)
 		return -ENOMEM;
 
-	dma_dom->aperture[index]->bitmap = (void *)get_zeroed_page(gfp);
-	if (!dma_dom->aperture[index]->bitmap)
+	range->bitmap = (void *)get_zeroed_page(gfp);
+	if (!range->bitmap)
 		goto out_free;
 
-	dma_dom->aperture[index]->offset = dma_dom->aperture_size;
+	range->offset = dma_dom->aperture_size;
 
-	spin_lock_init(&dma_dom->aperture[index]->bitmap_lock);
+	spin_lock_init(&range->bitmap_lock);
 
 	if (populate) {
 		unsigned long address = dma_dom->aperture_size;
@@ -1458,14 +1460,18 @@ static int alloc_new_range(struct dma_ops_domain *dma_dom,
 			if (!pte)
 				goto out_free;
 
-			dma_dom->aperture[index]->pte_pages[i] = pte_page;
+			range->pte_pages[i] = pte_page;
 
 			address += APERTURE_RANGE_SIZE / 64;
 		}
 	}
 
-	old_size                = dma_dom->aperture_size;
-	dma_dom->aperture_size += APERTURE_RANGE_SIZE;
+	/* First take the bitmap_lock and then publish the range */
+	spin_lock_irqsave(&range->bitmap_lock, flags);
+
+	old_size                 = dma_dom->aperture_size;
+	dma_dom->aperture[index] = range;
+	dma_dom->aperture_size  += APERTURE_RANGE_SIZE;
 
 	/* Reserve address range used for MSI messages */
 	if (old_size < MSI_ADDR_BASE_LO &&
@@ -1512,15 +1518,16 @@ static int alloc_new_range(struct dma_ops_domain *dma_dom,
 
 	update_domain(&dma_dom->domain);
 
+	spin_unlock_irqrestore(&range->bitmap_lock, flags);
+
 	return 0;
 
 out_free:
 	update_domain(&dma_dom->domain);
 
-	free_page((unsigned long)dma_dom->aperture[index]->bitmap);
+	free_page((unsigned long)range->bitmap);
 
-	kfree(dma_dom->aperture[index]);
-	dma_dom->aperture[index] = NULL;
+	kfree(range);
 
 	return -ENOMEM;
 }
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 20/23] iommu/amd: Relax locking in dma_ops path
  2015-12-22 22:21 [PATCH 00/23] AMD IOMMU DMA-API Scalability Improvements Joerg Roedel
                   ` (18 preceding siblings ...)
  2015-12-22 22:21 ` [PATCH 19/23] iommu/amd: Initialize new aperture range before making it visible Joerg Roedel
@ 2015-12-22 22:21 ` Joerg Roedel
  2015-12-22 22:21 ` [PATCH 21/23] iommu/amd: Make dma_ops_domain->next_index percpu Joerg Roedel
                   ` (2 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: Joerg Roedel @ 2015-12-22 22:21 UTC (permalink / raw)
  To: iommu; +Cc: linux-kernel, joro, jroedel

From: Joerg Roedel <jroedel@suse.de>

Remove the long holding times of the domain->lock and rely
on the bitmap_lock instead.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 drivers/iommu/amd_iommu.c | 70 ++++++++---------------------------------------
 1 file changed, 11 insertions(+), 59 deletions(-)

diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index 11ee885..e98a466 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -1466,8 +1466,10 @@ static int alloc_new_range(struct dma_ops_domain *dma_dom,
 		}
 	}
 
+	spin_lock_irqsave(&dma_dom->domain.lock, flags);
+
 	/* First take the bitmap_lock and then publish the range */
-	spin_lock_irqsave(&range->bitmap_lock, flags);
+	spin_lock(&range->bitmap_lock);
 
 	old_size                 = dma_dom->aperture_size;
 	dma_dom->aperture[index] = range;
@@ -1518,7 +1520,9 @@ static int alloc_new_range(struct dma_ops_domain *dma_dom,
 
 	update_domain(&dma_dom->domain);
 
-	spin_unlock_irqrestore(&range->bitmap_lock, flags);
+	spin_unlock(&range->bitmap_lock);
+
+	spin_unlock_irqrestore(&dma_dom->domain.lock, flags);
 
 	return 0;
 
@@ -2562,11 +2566,9 @@ static dma_addr_t map_page(struct device *dev, struct page *page,
 			   enum dma_data_direction dir,
 			   struct dma_attrs *attrs)
 {
-	unsigned long flags;
+	phys_addr_t paddr = page_to_phys(page) + offset;
 	struct protection_domain *domain;
-	dma_addr_t addr;
 	u64 dma_mask;
-	phys_addr_t paddr = page_to_phys(page) + offset;
 
 	INC_STATS_COUNTER(cnt_map_single);
 
@@ -2578,19 +2580,8 @@ static dma_addr_t map_page(struct device *dev, struct page *page,
 
 	dma_mask = *dev->dma_mask;
 
-	spin_lock_irqsave(&domain->lock, flags);
-
-	addr = __map_single(dev, domain->priv, paddr, size, dir, false,
+	return __map_single(dev, domain->priv, paddr, size, dir, false,
 			    dma_mask);
-	if (addr == DMA_ERROR_CODE)
-		goto out;
-
-	domain_flush_complete(domain);
-
-out:
-	spin_unlock_irqrestore(&domain->lock, flags);
-
-	return addr;
 }
 
 /*
@@ -2599,7 +2590,6 @@ out:
 static void unmap_page(struct device *dev, dma_addr_t dma_addr, size_t size,
 		       enum dma_data_direction dir, struct dma_attrs *attrs)
 {
-	unsigned long flags;
 	struct protection_domain *domain;
 
 	INC_STATS_COUNTER(cnt_unmap_single);
@@ -2608,13 +2598,7 @@ static void unmap_page(struct device *dev, dma_addr_t dma_addr, size_t size,
 	if (IS_ERR(domain))
 		return;
 
-	spin_lock_irqsave(&domain->lock, flags);
-
 	__unmap_single(domain->priv, dma_addr, size, dir);
-
-	domain_flush_complete(domain);
-
-	spin_unlock_irqrestore(&domain->lock, flags);
 }
 
 /*
@@ -2625,7 +2609,6 @@ static int map_sg(struct device *dev, struct scatterlist *sglist,
 		  int nelems, enum dma_data_direction dir,
 		  struct dma_attrs *attrs)
 {
-	unsigned long flags;
 	struct protection_domain *domain;
 	int i;
 	struct scatterlist *s;
@@ -2641,8 +2624,6 @@ static int map_sg(struct device *dev, struct scatterlist *sglist,
 
 	dma_mask = *dev->dma_mask;
 
-	spin_lock_irqsave(&domain->lock, flags);
-
 	for_each_sg(sglist, s, nelems, i) {
 		paddr = sg_phys(s);
 
@@ -2657,12 +2638,8 @@ static int map_sg(struct device *dev, struct scatterlist *sglist,
 			goto unmap;
 	}
 
-	domain_flush_complete(domain);
-
-out:
-	spin_unlock_irqrestore(&domain->lock, flags);
-
 	return mapped_elems;
+
 unmap:
 	for_each_sg(sglist, s, mapped_elems, i) {
 		if (s->dma_address)
@@ -2671,9 +2648,7 @@ unmap:
 		s->dma_address = s->dma_length = 0;
 	}
 
-	mapped_elems = 0;
-
-	goto out;
+	return 0;
 }
 
 /*
@@ -2684,7 +2659,6 @@ static void unmap_sg(struct device *dev, struct scatterlist *sglist,
 		     int nelems, enum dma_data_direction dir,
 		     struct dma_attrs *attrs)
 {
-	unsigned long flags;
 	struct protection_domain *domain;
 	struct scatterlist *s;
 	int i;
@@ -2695,17 +2669,11 @@ static void unmap_sg(struct device *dev, struct scatterlist *sglist,
 	if (IS_ERR(domain))
 		return;
 
-	spin_lock_irqsave(&domain->lock, flags);
-
 	for_each_sg(sglist, s, nelems, i) {
 		__unmap_single(domain->priv, s->dma_address,
 			       s->dma_length, dir);
 		s->dma_address = s->dma_length = 0;
 	}
-
-	domain_flush_complete(domain);
-
-	spin_unlock_irqrestore(&domain->lock, flags);
 }
 
 /*
@@ -2717,7 +2685,6 @@ static void *alloc_coherent(struct device *dev, size_t size,
 {
 	u64 dma_mask = dev->coherent_dma_mask;
 	struct protection_domain *domain;
-	unsigned long flags;
 	struct page *page;
 
 	INC_STATS_COUNTER(cnt_alloc_coherent);
@@ -2749,19 +2716,11 @@ static void *alloc_coherent(struct device *dev, size_t size,
 	if (!dma_mask)
 		dma_mask = *dev->dma_mask;
 
-	spin_lock_irqsave(&domain->lock, flags);
-
 	*dma_addr = __map_single(dev, domain->priv, page_to_phys(page),
 				 size, DMA_BIDIRECTIONAL, true, dma_mask);
 
-	if (*dma_addr == DMA_ERROR_CODE) {
-		spin_unlock_irqrestore(&domain->lock, flags);
+	if (*dma_addr == DMA_ERROR_CODE)
 		goto out_free;
-	}
-
-	domain_flush_complete(domain);
-
-	spin_unlock_irqrestore(&domain->lock, flags);
 
 	return page_address(page);
 
@@ -2781,7 +2740,6 @@ static void free_coherent(struct device *dev, size_t size,
 			  struct dma_attrs *attrs)
 {
 	struct protection_domain *domain;
-	unsigned long flags;
 	struct page *page;
 
 	INC_STATS_COUNTER(cnt_free_coherent);
@@ -2793,14 +2751,8 @@ static void free_coherent(struct device *dev, size_t size,
 	if (IS_ERR(domain))
 		goto free_mem;
 
-	spin_lock_irqsave(&domain->lock, flags);
-
 	__unmap_single(domain->priv, dma_addr, size, DMA_BIDIRECTIONAL);
 
-	domain_flush_complete(domain);
-
-	spin_unlock_irqrestore(&domain->lock, flags);
-
 free_mem:
 	if (!dma_release_from_contiguous(dev, page, size >> PAGE_SHIFT))
 		__free_pages(page, get_order(size));
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 21/23] iommu/amd: Make dma_ops_domain->next_index percpu
  2015-12-22 22:21 [PATCH 00/23] AMD IOMMU DMA-API Scalability Improvements Joerg Roedel
                   ` (19 preceding siblings ...)
  2015-12-22 22:21 ` [PATCH 20/23] iommu/amd: Relax locking in dma_ops path Joerg Roedel
@ 2015-12-22 22:21 ` Joerg Roedel
  2015-12-22 22:21 ` [PATCH 22/23] iommu/amd: Use trylock to aquire bitmap_lock Joerg Roedel
  2015-12-22 22:21 ` [PATCH 23/23] iommu/amd: Preallocate dma_ops apertures based on dma_mask Joerg Roedel
  22 siblings, 0 replies; 24+ messages in thread
From: Joerg Roedel @ 2015-12-22 22:21 UTC (permalink / raw)
  To: iommu; +Cc: linux-kernel, joro, jroedel

From: Joerg Roedel <jroedel@suse.de>

Make this pointer percpu so that we start searching for new
addresses in the range we last stopped and which is has a
higher probability of being still in the cache.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 drivers/iommu/amd_iommu.c | 39 +++++++++++++++++++++++++++++----------
 1 file changed, 29 insertions(+), 10 deletions(-)

diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index e98a466..84c7da1 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -35,6 +35,7 @@
 #include <linux/msi.h>
 #include <linux/dma-contiguous.h>
 #include <linux/irqdomain.h>
+#include <linux/percpu.h>
 #include <asm/irq_remapping.h>
 #include <asm/io_apic.h>
 #include <asm/apic.h>
@@ -147,7 +148,7 @@ struct dma_ops_domain {
 	unsigned long aperture_size;
 
 	/* aperture index we start searching for free addresses */
-	unsigned long next_index;
+	u32 __percpu *next_index;
 
 	/* address space relevant data */
 	struct aperture_range *aperture[APERTURE_MAX_RANGES];
@@ -1583,18 +1584,30 @@ static unsigned long dma_ops_area_alloc(struct device *dev,
 {
 	unsigned long boundary_size, mask;
 	unsigned long address = -1;
-	int start = dom->next_index;
-	int i;
+	u32 start, i;
+
+	preempt_disable();
 
 	mask = dma_get_seg_boundary(dev);
 
+	start = this_cpu_read(*dom->next_index);
+
+	/* Sanity check - is it really necessary? */
+	if (unlikely(start > APERTURE_MAX_RANGES)) {
+		start = 0;
+		this_cpu_write(*dom->next_index, 0);
+	}
+
 	boundary_size = mask + 1 ? ALIGN(mask + 1, PAGE_SIZE) >> PAGE_SHIFT :
 				   1UL << (BITS_PER_LONG - PAGE_SHIFT);
 
 	for (i = 0; i < APERTURE_MAX_RANGES; ++i) {
 		struct aperture_range *range;
+		int index;
+
+		index = (start + i) % APERTURE_MAX_RANGES;
 
-		range = dom->aperture[(start + i) % APERTURE_MAX_RANGES];
+		range = dom->aperture[index];
 
 		if (!range || range->offset >= dma_mask)
 			continue;
@@ -1604,11 +1617,13 @@ static unsigned long dma_ops_area_alloc(struct device *dev,
 						 align_mask);
 		if (address != -1) {
 			address = range->offset + (address << PAGE_SHIFT);
-			dom->next_index = i;
+			this_cpu_write(*dom->next_index, index);
 			break;
 		}
 	}
 
+	preempt_enable();
+
 	return address;
 }
 
@@ -1620,10 +1635,6 @@ static unsigned long dma_ops_alloc_addresses(struct device *dev,
 {
 	unsigned long address = -1;
 
-#ifdef CONFIG_IOMMU_STRESS
-	dom->next_index = 0;
-#endif
-
 	while (address == -1) {
 		address = dma_ops_area_alloc(dev, dom, pages,
 					     align_mask, dma_mask);
@@ -1851,6 +1862,8 @@ static void dma_ops_domain_free(struct dma_ops_domain *dom)
 	if (!dom)
 		return;
 
+	free_percpu(dom->next_index);
+
 	del_domain_from_list(&dom->domain);
 
 	free_pagetable(&dom->domain);
@@ -1873,6 +1886,7 @@ static void dma_ops_domain_free(struct dma_ops_domain *dom)
 static struct dma_ops_domain *dma_ops_domain_alloc(void)
 {
 	struct dma_ops_domain *dma_dom;
+	int cpu;
 
 	dma_dom = kzalloc(sizeof(struct dma_ops_domain), GFP_KERNEL);
 	if (!dma_dom)
@@ -1881,6 +1895,10 @@ static struct dma_ops_domain *dma_ops_domain_alloc(void)
 	if (protection_domain_init(&dma_dom->domain))
 		goto free_dma_dom;
 
+	dma_dom->next_index = alloc_percpu(u32);
+	if (!dma_dom->next_index)
+		goto free_dma_dom;
+
 	dma_dom->domain.mode = PAGE_MODE_2_LEVEL;
 	dma_dom->domain.pt_root = (void *)get_zeroed_page(GFP_KERNEL);
 	dma_dom->domain.flags = PD_DMA_OPS_MASK;
@@ -1898,8 +1916,9 @@ static struct dma_ops_domain *dma_ops_domain_alloc(void)
 	 * a valid dma-address. So we can use 0 as error value
 	 */
 	dma_dom->aperture[0]->bitmap[0] = 1;
-	dma_dom->next_index = 0;
 
+	for_each_possible_cpu(cpu)
+		*per_cpu_ptr(dma_dom->next_index, cpu) = 0;
 
 	return dma_dom;
 
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 22/23] iommu/amd: Use trylock to aquire bitmap_lock
  2015-12-22 22:21 [PATCH 00/23] AMD IOMMU DMA-API Scalability Improvements Joerg Roedel
                   ` (20 preceding siblings ...)
  2015-12-22 22:21 ` [PATCH 21/23] iommu/amd: Make dma_ops_domain->next_index percpu Joerg Roedel
@ 2015-12-22 22:21 ` Joerg Roedel
  2015-12-22 22:21 ` [PATCH 23/23] iommu/amd: Preallocate dma_ops apertures based on dma_mask Joerg Roedel
  22 siblings, 0 replies; 24+ messages in thread
From: Joerg Roedel @ 2015-12-22 22:21 UTC (permalink / raw)
  To: iommu; +Cc: linux-kernel, joro, jroedel

From: Joerg Roedel <jroedel@suse.de>

First search for a non-contended aperture with trylock
before spinning.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 drivers/iommu/amd_iommu.c | 20 +++++++++++++++++---
 1 file changed, 17 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index 84c7da1..eed355c 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -1542,7 +1542,8 @@ static dma_addr_t dma_ops_aperture_alloc(struct dma_ops_domain *dom,
 					 unsigned long pages,
 					 unsigned long dma_mask,
 					 unsigned long boundary_size,
-					 unsigned long align_mask)
+					 unsigned long align_mask,
+					 bool trylock)
 {
 	unsigned long offset, limit, flags;
 	dma_addr_t address;
@@ -1552,7 +1553,13 @@ static dma_addr_t dma_ops_aperture_alloc(struct dma_ops_domain *dom,
 	limit  = iommu_device_max_index(APERTURE_RANGE_PAGES, offset,
 					dma_mask >> PAGE_SHIFT);
 
-	spin_lock_irqsave(&range->bitmap_lock, flags);
+	if (trylock) {
+		if (!spin_trylock_irqsave(&range->bitmap_lock, flags))
+			return -1;
+	} else {
+		spin_lock_irqsave(&range->bitmap_lock, flags);
+	}
+
 	address = iommu_area_alloc(range->bitmap, limit, range->next_bit,
 				   pages, offset, boundary_size, align_mask);
 	if (address == -1) {
@@ -1584,12 +1591,14 @@ static unsigned long dma_ops_area_alloc(struct device *dev,
 {
 	unsigned long boundary_size, mask;
 	unsigned long address = -1;
+	bool first = true;
 	u32 start, i;
 
 	preempt_disable();
 
 	mask = dma_get_seg_boundary(dev);
 
+again:
 	start = this_cpu_read(*dom->next_index);
 
 	/* Sanity check - is it really necessary? */
@@ -1614,7 +1623,7 @@ static unsigned long dma_ops_area_alloc(struct device *dev,
 
 		address = dma_ops_aperture_alloc(dom, range, pages,
 						 dma_mask, boundary_size,
-						 align_mask);
+						 align_mask, first);
 		if (address != -1) {
 			address = range->offset + (address << PAGE_SHIFT);
 			this_cpu_write(*dom->next_index, index);
@@ -1622,6 +1631,11 @@ static unsigned long dma_ops_area_alloc(struct device *dev,
 		}
 	}
 
+	if (address == -1 && first) {
+		first = false;
+		goto again;
+	}
+
 	preempt_enable();
 
 	return address;
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 23/23] iommu/amd: Preallocate dma_ops apertures based on dma_mask
  2015-12-22 22:21 [PATCH 00/23] AMD IOMMU DMA-API Scalability Improvements Joerg Roedel
                   ` (21 preceding siblings ...)
  2015-12-22 22:21 ` [PATCH 22/23] iommu/amd: Use trylock to aquire bitmap_lock Joerg Roedel
@ 2015-12-22 22:21 ` Joerg Roedel
  22 siblings, 0 replies; 24+ messages in thread
From: Joerg Roedel @ 2015-12-22 22:21 UTC (permalink / raw)
  To: iommu; +Cc: linux-kernel, joro, jroedel

From: Joerg Roedel <jroedel@suse.de>

Preallocate between 4 and 8 apertures when a device gets it
dma_mask. With more apertures we reduce the lock contention
of the domain lock significantly.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 drivers/iommu/amd_iommu.c | 60 +++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 53 insertions(+), 7 deletions(-)

diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index eed355c..6f6502d 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -1892,6 +1892,23 @@ static void dma_ops_domain_free(struct dma_ops_domain *dom)
 	kfree(dom);
 }
 
+static int dma_ops_domain_alloc_apertures(struct dma_ops_domain *dma_dom,
+					  int max_apertures)
+{
+	int ret, i, apertures;
+
+	apertures = dma_dom->aperture_size >> APERTURE_RANGE_SHIFT;
+	ret       = 0;
+
+	for (i = apertures; i < max_apertures; ++i) {
+		ret = alloc_new_range(dma_dom, false, GFP_KERNEL);
+		if (ret)
+			break;
+	}
+
+	return ret;
+}
+
 /*
  * Allocates a new protection domain usable for the dma_ops functions.
  * It also initializes the page table and the address allocator data
@@ -2800,14 +2817,43 @@ static int amd_iommu_dma_supported(struct device *dev, u64 mask)
 	return check_device(dev);
 }
 
+static int set_dma_mask(struct device *dev, u64 mask)
+{
+	struct protection_domain *domain;
+	int max_apertures = 1;
+
+	domain = get_domain(dev);
+	if (IS_ERR(domain))
+		return PTR_ERR(domain);
+
+	if (mask == DMA_BIT_MASK(64))
+		max_apertures = 8;
+	else if (mask > DMA_BIT_MASK(32))
+		max_apertures = 4;
+
+	/*
+	 * To prevent lock contention it doesn't make sense to allocate more
+	 * apertures than online cpus
+	 */
+	if (max_apertures > num_online_cpus())
+		max_apertures = num_online_cpus();
+
+	if (dma_ops_domain_alloc_apertures(domain->priv, max_apertures))
+		dev_err(dev, "Can't allocate %d iommu apertures\n",
+			max_apertures);
+
+	return 0;
+}
+
 static struct dma_map_ops amd_iommu_dma_ops = {
-	.alloc = alloc_coherent,
-	.free = free_coherent,
-	.map_page = map_page,
-	.unmap_page = unmap_page,
-	.map_sg = map_sg,
-	.unmap_sg = unmap_sg,
-	.dma_supported = amd_iommu_dma_supported,
+	.alloc		= alloc_coherent,
+	.free		= free_coherent,
+	.map_page	= map_page,
+	.unmap_page	= unmap_page,
+	.map_sg		= map_sg,
+	.unmap_sg	= unmap_sg,
+	.dma_supported	= amd_iommu_dma_supported,
+	.set_dma_mask	= set_dma_mask,
 };
 
 int __init amd_iommu_init_api(void)
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2015-12-22 22:27 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-12-22 22:21 [PATCH 00/23] AMD IOMMU DMA-API Scalability Improvements Joerg Roedel
2015-12-22 22:21 ` [PATCH 01/23] iommu/amd: Warn only once on unexpected pte value Joerg Roedel
2015-12-22 22:21 ` [PATCH 02/23] iommu/amd: Move 'struct dma_ops_domain' definition to amd_iommu.c Joerg Roedel
2015-12-22 22:21 ` [PATCH 03/23] iommu/amd: Introduce bitmap_lock in struct aperture_range Joerg Roedel
2015-12-22 22:21 ` [PATCH 04/23] iommu/amd: Flush IOMMU TLB on __map_single error path Joerg Roedel
2015-12-22 22:21 ` [PATCH 05/23] iommu/amd: Flush the IOMMU TLB before the addresses are freed Joerg Roedel
2015-12-22 22:21 ` [PATCH 06/23] iommu/amd: Pass correct shift to iommu_area_alloc() Joerg Roedel
2015-12-22 22:21 ` [PATCH 07/23] iommu/amd: Add dma_ops_aperture_alloc() function Joerg Roedel
2015-12-22 22:21 ` [PATCH 08/23] iommu/amd: Move aperture_range.offset to another cache-line Joerg Roedel
2015-12-22 22:21 ` [PATCH 09/23] iommu/amd: Retry address allocation within one aperture Joerg Roedel
2015-12-22 22:21 ` [PATCH 10/23] iommu/amd: Flush iommu tlb in dma_ops_aperture_alloc() Joerg Roedel
2015-12-22 22:21 ` [PATCH 11/23] iommu/amd: Remove 'start' parameter from dma_ops_area_alloc Joerg Roedel
2015-12-22 22:21 ` [PATCH 12/23] iommu/amd: Rename dma_ops_domain->next_address to next_index Joerg Roedel
2015-12-22 22:21 ` [PATCH 13/23] iommu/amd: Flush iommu tlb in dma_ops_free_addresses Joerg Roedel
2015-12-22 22:21 ` [PATCH 14/23] iommu/amd: Iterate over all aperture ranges in dma_ops_area_alloc Joerg Roedel
2015-12-22 22:21 ` [PATCH 15/23] iommu/amd: Remove need_flush from struct dma_ops_domain Joerg Roedel
2015-12-22 22:21 ` [PATCH 16/23] iommu/amd: Optimize dma_ops_free_addresses Joerg Roedel
2015-12-22 22:21 ` [PATCH 17/23] iommu/amd: Allocate new aperture ranges in dma_ops_alloc_addresses Joerg Roedel
2015-12-22 22:21 ` [PATCH 18/23] iommu/amd: Build io page-tables with cmpxchg64 Joerg Roedel
2015-12-22 22:21 ` [PATCH 19/23] iommu/amd: Initialize new aperture range before making it visible Joerg Roedel
2015-12-22 22:21 ` [PATCH 20/23] iommu/amd: Relax locking in dma_ops path Joerg Roedel
2015-12-22 22:21 ` [PATCH 21/23] iommu/amd: Make dma_ops_domain->next_index percpu Joerg Roedel
2015-12-22 22:21 ` [PATCH 22/23] iommu/amd: Use trylock to aquire bitmap_lock Joerg Roedel
2015-12-22 22:21 ` [PATCH 23/23] iommu/amd: Preallocate dma_ops apertures based on dma_mask Joerg Roedel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).