[Patch v3 0/3] iommu: reduce spinlock contention on fast path

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [Patch v3 0/3] iommu: reduce spinlock contention on fast path
@ 2019-12-18  4:39 Cong Wang
  2019-12-18  4:39 ` [Patch v3 1/3] iommu: avoid unnecessary magazine allocations Cong Wang
                   ` (3 more replies)
  0 siblings, 4 replies; 17+ messages in thread
From: Cong Wang @ 2019-12-18  4:39 UTC (permalink / raw)
  To: iommu; +Cc: robin.murphy, linux-kernel, joro, Cong Wang

This patchset contains three small optimizations for the global spinlock
contention in IOVA cache. Our memcache perf test shows this reduced its
p999 latency down by 45% on AMD when IOMMU is enabled.

(Resending v3 on Joerg's request.)

Cong Wang (3):
  iommu: avoid unnecessary magazine allocations
  iommu: optimize iova_magazine_free_pfns()
  iommu: avoid taking iova_rbtree_lock twice

---
v3: improve changelog, no code change
v2: fix a memory leak

 drivers/iommu/iova.c | 75 ++++++++++++++++++++++++++------------------
 1 file changed, 45 insertions(+), 30 deletions(-)

-- 
2.21.0


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Patch v3 1/3] iommu: avoid unnecessary magazine allocations
  2019-12-18  4:39 [Patch v3 0/3] iommu: reduce spinlock contention on fast path Cong Wang
@ 2019-12-18  4:39 ` Cong Wang
  2020-01-21 11:11   ` Robin Murphy
  2019-12-18  4:39 ` [Patch v3 2/3] iommu: optimize iova_magazine_free_pfns() Cong Wang
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 17+ messages in thread
From: Cong Wang @ 2019-12-18  4:39 UTC (permalink / raw)
  To: iommu; +Cc: robin.murphy, linux-kernel, joro, Cong Wang, John Garry

The IOVA cache algorithm implemented in IOMMU code does not
exactly match the original algorithm described in the paper
"Magazines and Vmem: Extending the Slab Allocator to Many
CPUs and Arbitrary Resources".

Particularly, it doesn't need to free the loaded empty magazine
when trying to put it back to global depot. To make it work, we
have to pre-allocate magazines in the depot and only recycle them
when all of them are full.

Before this patch, rcache->depot[] contains either full or
freed entries, after this patch, it contains either full or
empty (but allocated) entries.

Together with a few other changes to make it exactly match
the pseudo code in the paper.

Cc: Joerg Roedel <joro@8bytes.org>
Cc: John Garry <john.garry@huawei.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
 drivers/iommu/iova.c | 45 +++++++++++++++++++++++++++-----------------
 1 file changed, 28 insertions(+), 17 deletions(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index 41c605b0058f..cb473ddce4cf 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -862,12 +862,16 @@ static void init_iova_rcaches(struct iova_domain *iovad)
 	struct iova_cpu_rcache *cpu_rcache;
 	struct iova_rcache *rcache;
 	unsigned int cpu;
-	int i;
+	int i, j;
 
 	for (i = 0; i < IOVA_RANGE_CACHE_MAX_SIZE; ++i) {
 		rcache = &iovad->rcaches[i];
 		spin_lock_init(&rcache->lock);
 		rcache->depot_size = 0;
+		for (j = 0; j < MAX_GLOBAL_MAGS; ++j) {
+			rcache->depot[j] = iova_magazine_alloc(GFP_KERNEL);
+			WARN_ON(!rcache->depot[j]);
+		}
 		rcache->cpu_rcaches = __alloc_percpu(sizeof(*cpu_rcache), cache_line_size());
 		if (WARN_ON(!rcache->cpu_rcaches))
 			continue;
@@ -900,24 +904,30 @@ static bool __iova_rcache_insert(struct iova_domain *iovad,
 
 	if (!iova_magazine_full(cpu_rcache->loaded)) {
 		can_insert = true;
-	} else if (!iova_magazine_full(cpu_rcache->prev)) {
+	} else if (iova_magazine_empty(cpu_rcache->prev)) {
 		swap(cpu_rcache->prev, cpu_rcache->loaded);
 		can_insert = true;
 	} else {
-		struct iova_magazine *new_mag = iova_magazine_alloc(GFP_ATOMIC);
+		spin_lock(&rcache->lock);
+		if (rcache->depot_size < MAX_GLOBAL_MAGS) {
+			swap(rcache->depot[rcache->depot_size], cpu_rcache->prev);
+			swap(cpu_rcache->prev, cpu_rcache->loaded);
+			rcache->depot_size++;
+			can_insert = true;
+		} else {
+			mag_to_free = cpu_rcache->loaded;
+		}
+		spin_unlock(&rcache->lock);
+
+		if (mag_to_free) {
+			struct iova_magazine *new_mag = iova_magazine_alloc(GFP_ATOMIC);
 
-		if (new_mag) {
-			spin_lock(&rcache->lock);
-			if (rcache->depot_size < MAX_GLOBAL_MAGS) {
-				rcache->depot[rcache->depot_size++] =
-						cpu_rcache->loaded;
+			if (new_mag) {
+				cpu_rcache->loaded = new_mag;
+				can_insert = true;
 			} else {
-				mag_to_free = cpu_rcache->loaded;
+				mag_to_free = NULL;
 			}
-			spin_unlock(&rcache->lock);
-
-			cpu_rcache->loaded = new_mag;
-			can_insert = true;
 		}
 	}
 
@@ -963,14 +973,15 @@ static unsigned long __iova_rcache_get(struct iova_rcache *rcache,
 
 	if (!iova_magazine_empty(cpu_rcache->loaded)) {
 		has_pfn = true;
-	} else if (!iova_magazine_empty(cpu_rcache->prev)) {
+	} else if (iova_magazine_full(cpu_rcache->prev)) {
 		swap(cpu_rcache->prev, cpu_rcache->loaded);
 		has_pfn = true;
 	} else {
 		spin_lock(&rcache->lock);
 		if (rcache->depot_size > 0) {
-			iova_magazine_free(cpu_rcache->loaded);
-			cpu_rcache->loaded = rcache->depot[--rcache->depot_size];
+			swap(rcache->depot[rcache->depot_size - 1], cpu_rcache->prev);
+			swap(cpu_rcache->prev, cpu_rcache->loaded);
+			rcache->depot_size--;
 			has_pfn = true;
 		}
 		spin_unlock(&rcache->lock);
@@ -1019,7 +1030,7 @@ static void free_iova_rcaches(struct iova_domain *iovad)
 			iova_magazine_free(cpu_rcache->prev);
 		}
 		free_percpu(rcache->cpu_rcaches);
-		for (j = 0; j < rcache->depot_size; ++j)
+		for (j = 0; j < MAX_GLOBAL_MAGS; ++j)
 			iova_magazine_free(rcache->depot[j]);
 	}
 }
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [Patch v3 2/3] iommu: optimize iova_magazine_free_pfns()
  2019-12-18  4:39 [Patch v3 0/3] iommu: reduce spinlock contention on fast path Cong Wang
  2019-12-18  4:39 ` [Patch v3 1/3] iommu: avoid unnecessary magazine allocations Cong Wang
@ 2019-12-18  4:39 ` Cong Wang
  2020-01-21  9:52   ` Robin Murphy
  2019-12-18  4:39 ` [Patch v3 3/3] iommu: avoid taking iova_rbtree_lock twice Cong Wang
  2020-01-20 23:10 ` [Patch v3 0/3] iommu: reduce spinlock contention on fast path Cong Wang
  3 siblings, 1 reply; 17+ messages in thread
From: Cong Wang @ 2019-12-18  4:39 UTC (permalink / raw)
  To: iommu; +Cc: robin.murphy, linux-kernel, joro, Cong Wang, John Garry

If the magazine is empty, iova_magazine_free_pfns() should
be a nop, however it misses the case of mag->size==0. So we
should just call iova_magazine_empty().

This should reduce the contention on iovad->iova_rbtree_lock
a little bit, not much at all.

Cc: Joerg Roedel <joro@8bytes.org>
Cc: John Garry <john.garry@huawei.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
 drivers/iommu/iova.c | 22 +++++++++++-----------
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index cb473ddce4cf..184d4c0e20b5 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -797,13 +797,23 @@ static void iova_magazine_free(struct iova_magazine *mag)
 	kfree(mag);
 }
 
+static bool iova_magazine_full(struct iova_magazine *mag)
+{
+	return (mag && mag->size == IOVA_MAG_SIZE);
+}
+
+static bool iova_magazine_empty(struct iova_magazine *mag)
+{
+	return (!mag || mag->size == 0);
+}
+
 static void
 iova_magazine_free_pfns(struct iova_magazine *mag, struct iova_domain *iovad)
 {
 	unsigned long flags;
 	int i;
 
-	if (!mag)
+	if (iova_magazine_empty(mag))
 		return;
 
 	spin_lock_irqsave(&iovad->iova_rbtree_lock, flags);
@@ -820,16 +830,6 @@ iova_magazine_free_pfns(struct iova_magazine *mag, struct iova_domain *iovad)
 	mag->size = 0;
 }
 
-static bool iova_magazine_full(struct iova_magazine *mag)
-{
-	return (mag && mag->size == IOVA_MAG_SIZE);
-}
-
-static bool iova_magazine_empty(struct iova_magazine *mag)
-{
-	return (!mag || mag->size == 0);
-}
-
 static unsigned long iova_magazine_pop(struct iova_magazine *mag,
 				       unsigned long limit_pfn)
 {
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [Patch v3 3/3] iommu: avoid taking iova_rbtree_lock twice
  2019-12-18  4:39 [Patch v3 0/3] iommu: reduce spinlock contention on fast path Cong Wang
  2019-12-18  4:39 ` [Patch v3 1/3] iommu: avoid unnecessary magazine allocations Cong Wang
  2019-12-18  4:39 ` [Patch v3 2/3] iommu: optimize iova_magazine_free_pfns() Cong Wang
@ 2019-12-18  4:39 ` Cong Wang
  2019-12-19  9:51   ` John Garry
  2020-01-21  9:56   ` Robin Murphy
  2020-01-20 23:10 ` [Patch v3 0/3] iommu: reduce spinlock contention on fast path Cong Wang
  3 siblings, 2 replies; 17+ messages in thread
From: Cong Wang @ 2019-12-18  4:39 UTC (permalink / raw)
  To: iommu; +Cc: robin.murphy, linux-kernel, joro, Cong Wang, John Garry

Both find_iova() and __free_iova() take iova_rbtree_lock,
there is no reason to take and release it twice inside
free_iova().

Fold them into one critical section by calling the unlock
versions instead.

Cc: Joerg Roedel <joro@8bytes.org>
Cc: John Garry <john.garry@huawei.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
 drivers/iommu/iova.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index 184d4c0e20b5..f46f8f794678 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -390,10 +390,14 @@ EXPORT_SYMBOL_GPL(__free_iova);
 void
 free_iova(struct iova_domain *iovad, unsigned long pfn)
 {
-	struct iova *iova = find_iova(iovad, pfn);
+	unsigned long flags;
+	struct iova *iova;
 
+	spin_lock_irqsave(&iovad->iova_rbtree_lock, flags);
+	iova = private_find_iova(iovad, pfn);
 	if (iova)
-		__free_iova(iovad, iova);
+		private_free_iova(iovad, iova);
+	spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
 
 }
 EXPORT_SYMBOL_GPL(free_iova);
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [Patch v3 3/3] iommu: avoid taking iova_rbtree_lock twice
  2019-12-18  4:39 ` [Patch v3 3/3] iommu: avoid taking iova_rbtree_lock twice Cong Wang
@ 2019-12-19  9:51   ` John Garry
  2020-01-21  9:56   ` Robin Murphy
  1 sibling, 0 replies; 17+ messages in thread
From: John Garry @ 2019-12-19  9:51 UTC (permalink / raw)
  To: Cong Wang, iommu; +Cc: robin.murphy, linux-kernel, joro

On 18/12/2019 04:39, Cong Wang wrote:
> Both find_iova() and __free_iova() take iova_rbtree_lock,
> there is no reason to take and release it twice inside
> free_iova().
> 
> Fold them into one critical section by calling the unlock
> versions instead.
> 
> Cc: Joerg Roedel <joro@8bytes.org>
> Cc: John Garry <john.garry@huawei.com>
> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>

FWIW:
Reviewed-by: John Garry <john.garry@huawei.com>

> ---
>   drivers/iommu/iova.c | 8 ++++++--
>   1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
> index 184d4c0e20b5..f46f8f794678 100644
> --- a/drivers/iommu/iova.c
> +++ b/drivers/iommu/iova.c
> @@ -390,10 +390,14 @@ EXPORT_SYMBOL_GPL(__free_iova);
>   void
>   free_iova(struct iova_domain *iovad, unsigned long pfn)
>   {
> -	struct iova *iova = find_iova(iovad, pfn);
> +	unsigned long flags;
> +	struct iova *iova;
>   
> +	spin_lock_irqsave(&iovad->iova_rbtree_lock, flags);
> +	iova = private_find_iova(iovad, pfn);
>   	if (iova)
> -		__free_iova(iovad, iova);
> +		private_free_iova(iovad, iova);
> +	spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
>   
>   }
>   EXPORT_SYMBOL_GPL(free_iova);
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Patch v3 0/3] iommu: reduce spinlock contention on fast path
  2019-12-18  4:39 [Patch v3 0/3] iommu: reduce spinlock contention on fast path Cong Wang
                   ` (2 preceding siblings ...)
  2019-12-18  4:39 ` [Patch v3 3/3] iommu: avoid taking iova_rbtree_lock twice Cong Wang
@ 2020-01-20 23:10 ` Cong Wang
  3 siblings, 0 replies; 17+ messages in thread
From: Cong Wang @ 2020-01-20 23:10 UTC (permalink / raw)
  To: iommu; +Cc: Robin Murphy, LKML, Joerg Roedel

On Tue, Dec 17, 2019 at 8:40 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> This patchset contains three small optimizations for the global spinlock
> contention in IOVA cache. Our memcache perf test shows this reduced its
> p999 latency down by 45% on AMD when IOMMU is enabled.
>
> (Resending v3 on Joerg's request.)

Hi, Joerg

Can you take these patches?

Thanks!

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Patch v3 2/3] iommu: optimize iova_magazine_free_pfns()
  2019-12-18  4:39 ` [Patch v3 2/3] iommu: optimize iova_magazine_free_pfns() Cong Wang
@ 2020-01-21  9:52   ` Robin Murphy
  2020-01-21 17:29     ` Cong Wang
  0 siblings, 1 reply; 17+ messages in thread
From: Robin Murphy @ 2020-01-21  9:52 UTC (permalink / raw)
  To: Cong Wang, iommu; +Cc: linux-kernel, joro, John Garry

On 18/12/2019 4:39 am, Cong Wang wrote:
> If the magazine is empty, iova_magazine_free_pfns() should
> be a nop, however it misses the case of mag->size==0. So we
> should just call iova_magazine_empty().
> 
> This should reduce the contention on iovad->iova_rbtree_lock
> a little bit, not much at all.

Have you measured that in any way? AFAICS the only time this can get 
called with a non-full magazine is in the CPU hotplug callback, where 
the impact of taking the rbtree lock and immediately releasing it seems 
unlikely to be significant on top of everything else involved in that 
operation.

Robin.

> Cc: Joerg Roedel <joro@8bytes.org>
> Cc: John Garry <john.garry@huawei.com>
> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
> ---
>   drivers/iommu/iova.c | 22 +++++++++++-----------
>   1 file changed, 11 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
> index cb473ddce4cf..184d4c0e20b5 100644
> --- a/drivers/iommu/iova.c
> +++ b/drivers/iommu/iova.c
> @@ -797,13 +797,23 @@ static void iova_magazine_free(struct iova_magazine *mag)
>   	kfree(mag);
>   }
>   
> +static bool iova_magazine_full(struct iova_magazine *mag)
> +{
> +	return (mag && mag->size == IOVA_MAG_SIZE);
> +}
> +
> +static bool iova_magazine_empty(struct iova_magazine *mag)
> +{
> +	return (!mag || mag->size == 0);
> +}
> +
>   static void
>   iova_magazine_free_pfns(struct iova_magazine *mag, struct iova_domain *iovad)
>   {
>   	unsigned long flags;
>   	int i;
>   
> -	if (!mag)
> +	if (iova_magazine_empty(mag))
>   		return;
>   
>   	spin_lock_irqsave(&iovad->iova_rbtree_lock, flags);
> @@ -820,16 +830,6 @@ iova_magazine_free_pfns(struct iova_magazine *mag, struct iova_domain *iovad)
>   	mag->size = 0;
>   }
>   
> -static bool iova_magazine_full(struct iova_magazine *mag)
> -{
> -	return (mag && mag->size == IOVA_MAG_SIZE);
> -}
> -
> -static bool iova_magazine_empty(struct iova_magazine *mag)
> -{
> -	return (!mag || mag->size == 0);
> -}
> -
>   static unsigned long iova_magazine_pop(struct iova_magazine *mag,
>   				       unsigned long limit_pfn)
>   {
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Patch v3 3/3] iommu: avoid taking iova_rbtree_lock twice
  2019-12-18  4:39 ` [Patch v3 3/3] iommu: avoid taking iova_rbtree_lock twice Cong Wang
  2019-12-19  9:51   ` John Garry
@ 2020-01-21  9:56   ` Robin Murphy
  2020-03-03 11:33     ` John Garry
  1 sibling, 1 reply; 17+ messages in thread
From: Robin Murphy @ 2020-01-21  9:56 UTC (permalink / raw)
  To: Cong Wang, iommu; +Cc: linux-kernel, joro, John Garry

On 18/12/2019 4:39 am, Cong Wang wrote:
> Both find_iova() and __free_iova() take iova_rbtree_lock,
> there is no reason to take and release it twice inside
> free_iova().
> 
> Fold them into one critical section by calling the unlock
> versions instead.

Makes sense to me.

Reviewed-by: Robin Murphy <robin.murphy@arm.com>

> Cc: Joerg Roedel <joro@8bytes.org>
> Cc: John Garry <john.garry@huawei.com>
> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
> ---
>   drivers/iommu/iova.c | 8 ++++++--
>   1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
> index 184d4c0e20b5..f46f8f794678 100644
> --- a/drivers/iommu/iova.c
> +++ b/drivers/iommu/iova.c
> @@ -390,10 +390,14 @@ EXPORT_SYMBOL_GPL(__free_iova);
>   void
>   free_iova(struct iova_domain *iovad, unsigned long pfn)
>   {
> -	struct iova *iova = find_iova(iovad, pfn);
> +	unsigned long flags;
> +	struct iova *iova;
>   
> +	spin_lock_irqsave(&iovad->iova_rbtree_lock, flags);
> +	iova = private_find_iova(iovad, pfn);
>   	if (iova)
> -		__free_iova(iovad, iova);
> +		private_free_iova(iovad, iova);
> +	spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
>   
>   }
>   EXPORT_SYMBOL_GPL(free_iova);
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Patch v3 1/3] iommu: avoid unnecessary magazine allocations
  2019-12-18  4:39 ` [Patch v3 1/3] iommu: avoid unnecessary magazine allocations Cong Wang
@ 2020-01-21 11:11   ` Robin Murphy
  2020-01-21 17:21     ` Cong Wang
  0 siblings, 1 reply; 17+ messages in thread
From: Robin Murphy @ 2020-01-21 11:11 UTC (permalink / raw)
  To: Cong Wang, iommu; +Cc: linux-kernel, joro, John Garry

On 18/12/2019 4:39 am, Cong Wang wrote:
> The IOVA cache algorithm implemented in IOMMU code does not
> exactly match the original algorithm described in the paper
> "Magazines and Vmem: Extending the Slab Allocator to Many
> CPUs and Arbitrary Resources".
> 
> Particularly, it doesn't need to free the loaded empty magazine
> when trying to put it back to global depot. To make it work, we
> have to pre-allocate magazines in the depot and only recycle them
> when all of them are full.
> 
> Before this patch, rcache->depot[] contains either full or
> freed entries, after this patch, it contains either full or
> empty (but allocated) entries.

How much additional memory overhead does this impose (particularly on 
systems that may have many domains mostly used for large, long-term 
mappings)? I'm wary that trying to micro-optimise for the "churn network 
packets as fast as possible" case may penalise every other case, 
potentially quite badly. Lower-end embedded systems are using IOMMUs in 
front of their GPUs, video codecs, etc. precisely because they *don't* 
have much memory to spare (and thus need to scrape together large 
buffers out of whatever pages they can find).

But on the other hand, if we were to go down this route, then why is 
there any dynamic allocation/freeing left at all? Once both the depot 
and the rcaches are preallocated, then AFAICS it would make more sense 
to rework the overflow case in __iova_rcache_insert() to just free the 
IOVAs and swap the empty mag around rather than destroying and 
recreating it entirely.

Perhaps there's a reasonable compromise wherein we don't preallocate, 
but still 'free' empty magazines back to the depot, such that busy 
domains will quickly reach a steady-state. In fact, having now dug up 
the paper at this point of writing this reply, that appears to be what 
fig. 3.1b describes anyway - I don't see any mention of preallocating 
the depot.

Robin.

> 
> Together with a few other changes to make it exactly match
> the pseudo code in the paper.
> 
> Cc: Joerg Roedel <joro@8bytes.org>
> Cc: John Garry <john.garry@huawei.com>
> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
> ---
>   drivers/iommu/iova.c | 45 +++++++++++++++++++++++++++-----------------
>   1 file changed, 28 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
> index 41c605b0058f..cb473ddce4cf 100644
> --- a/drivers/iommu/iova.c
> +++ b/drivers/iommu/iova.c
> @@ -862,12 +862,16 @@ static void init_iova_rcaches(struct iova_domain *iovad)
>   	struct iova_cpu_rcache *cpu_rcache;
>   	struct iova_rcache *rcache;
>   	unsigned int cpu;
> -	int i;
> +	int i, j;
>   
>   	for (i = 0; i < IOVA_RANGE_CACHE_MAX_SIZE; ++i) {
>   		rcache = &iovad->rcaches[i];
>   		spin_lock_init(&rcache->lock);
>   		rcache->depot_size = 0;
> +		for (j = 0; j < MAX_GLOBAL_MAGS; ++j) {
> +			rcache->depot[j] = iova_magazine_alloc(GFP_KERNEL);
> +			WARN_ON(!rcache->depot[j]);
> +		}
>   		rcache->cpu_rcaches = __alloc_percpu(sizeof(*cpu_rcache), cache_line_size());
>   		if (WARN_ON(!rcache->cpu_rcaches))
>   			continue;
> @@ -900,24 +904,30 @@ static bool __iova_rcache_insert(struct iova_domain *iovad,
>   
>   	if (!iova_magazine_full(cpu_rcache->loaded)) {
>   		can_insert = true;
> -	} else if (!iova_magazine_full(cpu_rcache->prev)) {
> +	} else if (iova_magazine_empty(cpu_rcache->prev)) {
>   		swap(cpu_rcache->prev, cpu_rcache->loaded);
>   		can_insert = true;
>   	} else {
> -		struct iova_magazine *new_mag = iova_magazine_alloc(GFP_ATOMIC);
> +		spin_lock(&rcache->lock);
> +		if (rcache->depot_size < MAX_GLOBAL_MAGS) {
> +			swap(rcache->depot[rcache->depot_size], cpu_rcache->prev);
> +			swap(cpu_rcache->prev, cpu_rcache->loaded);
> +			rcache->depot_size++;
> +			can_insert = true;
> +		} else {
> +			mag_to_free = cpu_rcache->loaded;
> +		}
> +		spin_unlock(&rcache->lock);
> +
> +		if (mag_to_free) {
> +			struct iova_magazine *new_mag = iova_magazine_alloc(GFP_ATOMIC);
>   
> -		if (new_mag) {
> -			spin_lock(&rcache->lock);
> -			if (rcache->depot_size < MAX_GLOBAL_MAGS) {
> -				rcache->depot[rcache->depot_size++] =
> -						cpu_rcache->loaded;
> +			if (new_mag) {
> +				cpu_rcache->loaded = new_mag;
> +				can_insert = true;
>   			} else {
> -				mag_to_free = cpu_rcache->loaded;
> +				mag_to_free = NULL;
>   			}
> -			spin_unlock(&rcache->lock);
> -
> -			cpu_rcache->loaded = new_mag;
> -			can_insert = true;
>   		}
>   	}
>   
> @@ -963,14 +973,15 @@ static unsigned long __iova_rcache_get(struct iova_rcache *rcache,
>   
>   	if (!iova_magazine_empty(cpu_rcache->loaded)) {
>   		has_pfn = true;
> -	} else if (!iova_magazine_empty(cpu_rcache->prev)) {
> +	} else if (iova_magazine_full(cpu_rcache->prev)) {
>   		swap(cpu_rcache->prev, cpu_rcache->loaded);
>   		has_pfn = true;
>   	} else {
>   		spin_lock(&rcache->lock);
>   		if (rcache->depot_size > 0) {
> -			iova_magazine_free(cpu_rcache->loaded);
> -			cpu_rcache->loaded = rcache->depot[--rcache->depot_size];
> +			swap(rcache->depot[rcache->depot_size - 1], cpu_rcache->prev);
> +			swap(cpu_rcache->prev, cpu_rcache->loaded);
> +			rcache->depot_size--;
>   			has_pfn = true;
>   		}
>   		spin_unlock(&rcache->lock);
> @@ -1019,7 +1030,7 @@ static void free_iova_rcaches(struct iova_domain *iovad)
>   			iova_magazine_free(cpu_rcache->prev);
>   		}
>   		free_percpu(rcache->cpu_rcaches);
> -		for (j = 0; j < rcache->depot_size; ++j)
> +		for (j = 0; j < MAX_GLOBAL_MAGS; ++j)
>   			iova_magazine_free(rcache->depot[j]);
>   	}
>   }
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Patch v3 1/3] iommu: avoid unnecessary magazine allocations
  2020-01-21 11:11   ` Robin Murphy
@ 2020-01-21 17:21     ` Cong Wang
  2020-01-22 17:07       ` Robin Murphy
  0 siblings, 1 reply; 17+ messages in thread
From: Cong Wang @ 2020-01-21 17:21 UTC (permalink / raw)
  To: Robin Murphy; +Cc: iommu, LKML, Joerg Roedel, John Garry

On Tue, Jan 21, 2020 at 3:11 AM Robin Murphy <robin.murphy@arm.com> wrote:
>
> On 18/12/2019 4:39 am, Cong Wang wrote:
> > The IOVA cache algorithm implemented in IOMMU code does not
> > exactly match the original algorithm described in the paper
> > "Magazines and Vmem: Extending the Slab Allocator to Many
> > CPUs and Arbitrary Resources".
> >
> > Particularly, it doesn't need to free the loaded empty magazine
> > when trying to put it back to global depot. To make it work, we
> > have to pre-allocate magazines in the depot and only recycle them
> > when all of them are full.
> >
> > Before this patch, rcache->depot[] contains either full or
> > freed entries, after this patch, it contains either full or
> > empty (but allocated) entries.
>
> How much additional memory overhead does this impose (particularly on
> systems that may have many domains mostly used for large, long-term
> mappings)? I'm wary that trying to micro-optimise for the "churn network
> packets as fast as possible" case may penalise every other case,
> potentially quite badly. Lower-end embedded systems are using IOMMUs in
> front of their GPUs, video codecs, etc. precisely because they *don't*
> have much memory to spare (and thus need to scrape together large
> buffers out of whatever pages they can find).

The calculation is not complicated: 32 * 6 * 129 * 8 = 198144 bytes,
which is roughly 192K, per domain.

>
> But on the other hand, if we were to go down this route, then why is
> there any dynamic allocation/freeing left at all? Once both the depot
> and the rcaches are preallocated, then AFAICS it would make more sense
> to rework the overflow case in __iova_rcache_insert() to just free the
> IOVAs and swap the empty mag around rather than destroying and
> recreating it entirely.

It's due to the algorithm requires a swap(), which can't be done with
statically allocated magzine. I had the same thought initially but gave it
up quickly when realized this.

If you are suggesting to change the algorithm, it is not a goal of this
patchset. I do have plan to search for a better algorithm as the IOMMU
performance still sucks (comparing to no IOMMU) after this patchset,
but once again, I do not want to change it in this patchset.

(My ultimate goal is to find a spinlock-free algorithm, otherwise there is
no way to make it close to no-IOMMU performance.)

>
> Perhaps there's a reasonable compromise wherein we don't preallocate,
> but still 'free' empty magazines back to the depot, such that busy
> domains will quickly reach a steady-state. In fact, having now dug up
> the paper at this point of writing this reply, that appears to be what
> fig. 3.1b describes anyway - I don't see any mention of preallocating
> the depot.

That paper missed a lot of things, it doesn't even recommend a size
of a depot or percpu cache. For implementation, we still have to
think about those details, including whether to preallocate memory.

Thanks.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Patch v3 2/3] iommu: optimize iova_magazine_free_pfns()
  2020-01-21  9:52   ` Robin Murphy
@ 2020-01-21 17:29     ` Cong Wang
  2020-01-22 17:34       ` Robin Murphy
  0 siblings, 1 reply; 17+ messages in thread
From: Cong Wang @ 2020-01-21 17:29 UTC (permalink / raw)
  To: Robin Murphy; +Cc: iommu, LKML, Joerg Roedel, John Garry

On Tue, Jan 21, 2020 at 1:52 AM Robin Murphy <robin.murphy@arm.com> wrote:
>
> On 18/12/2019 4:39 am, Cong Wang wrote:
> > If the magazine is empty, iova_magazine_free_pfns() should
> > be a nop, however it misses the case of mag->size==0. So we
> > should just call iova_magazine_empty().
> >
> > This should reduce the contention on iovad->iova_rbtree_lock
> > a little bit, not much at all.
>
> Have you measured that in any way? AFAICS the only time this can get
> called with a non-full magazine is in the CPU hotplug callback, where
> the impact of taking the rbtree lock and immediately releasing it seems
> unlikely to be significant on top of everything else involved in that
> operation.

This patchset is only tested as a whole, it is not easy to deploy
each to production and test it separately.

Is there anything wrong to optimize a CPU hotplug path? :) And,
it is called in alloc_iova_fast() too when, for example, over-cached.

Thanks.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Patch v3 1/3] iommu: avoid unnecessary magazine allocations
  2020-01-21 17:21     ` Cong Wang
@ 2020-01-22 17:07       ` Robin Murphy
  2020-01-22 17:54         ` Cong Wang
  0 siblings, 1 reply; 17+ messages in thread
From: Robin Murphy @ 2020-01-22 17:07 UTC (permalink / raw)
  To: Cong Wang; +Cc: iommu, LKML, Joerg Roedel, John Garry

On 21/01/2020 5:21 pm, Cong Wang wrote:
> On Tue, Jan 21, 2020 at 3:11 AM Robin Murphy <robin.murphy@arm.com> wrote:
>>
>> On 18/12/2019 4:39 am, Cong Wang wrote:
>>> The IOVA cache algorithm implemented in IOMMU code does not
>>> exactly match the original algorithm described in the paper
>>> "Magazines and Vmem: Extending the Slab Allocator to Many
>>> CPUs and Arbitrary Resources".
>>>
>>> Particularly, it doesn't need to free the loaded empty magazine
>>> when trying to put it back to global depot. To make it work, we
>>> have to pre-allocate magazines in the depot and only recycle them
>>> when all of them are full.
>>>
>>> Before this patch, rcache->depot[] contains either full or
>>> freed entries, after this patch, it contains either full or
>>> empty (but allocated) entries.
>>
>> How much additional memory overhead does this impose (particularly on
>> systems that may have many domains mostly used for large, long-term
>> mappings)? I'm wary that trying to micro-optimise for the "churn network
>> packets as fast as possible" case may penalise every other case,
>> potentially quite badly. Lower-end embedded systems are using IOMMUs in
>> front of their GPUs, video codecs, etc. precisely because they *don't*
>> have much memory to spare (and thus need to scrape together large
>> buffers out of whatever pages they can find).
> 
> The calculation is not complicated: 32 * 6 * 129 * 8 = 198144 bytes,
> which is roughly 192K, per domain.

Theoretically. On many architectures, kmalloc(1032,...) is going to 
consume rather more than 1032 bytes. Either way, it's rather a lot of 
memory to waste in the many cases where it will never be used at all.

>> But on the other hand, if we were to go down this route, then why is
>> there any dynamic allocation/freeing left at all? Once both the depot
>> and the rcaches are preallocated, then AFAICS it would make more sense
>> to rework the overflow case in __iova_rcache_insert() to just free the
>> IOVAs and swap the empty mag around rather than destroying and
>> recreating it entirely.
> 
> It's due to the algorithm requires a swap(), which can't be done with
> statically allocated magzine. I had the same thought initially but gave it
> up quickly when realized this.

I'm not sure I follow... we're replacing a "full magazine" pointer with 
an "empty magazine" pointer regardless of where that empty magazine came 
from. It would be trivial to preallocate an 'overflow' magazine for the 
one remaining case of handling a full depot, although to be honest, at 
that point it's probably most efficient to just free the pfns directly 
from cpu_rcache->loaded while still under the percpu lock and be done 
with it.

> If you are suggesting to change the algorithm, it is not a goal of this
> patchset. I do have plan to search for a better algorithm as the IOMMU
> performance still sucks (comparing to no IOMMU) after this patchset,
> but once again, I do not want to change it in this patchset.

"Still sucks" is probably the most interesting thing here - the headline 
number for the original patch series was that it reached about 98% of 
bypass performance on Intel VT-d[1]. Sounds like it would be well worth 
digging in to what's different about your system and/or workload.

> (My ultimate goal is to find a spinlock-free algorithm, otherwise there is
> no way to make it close to no-IOMMU performance.)
> 
>>
>> Perhaps there's a reasonable compromise wherein we don't preallocate,
>> but still 'free' empty magazines back to the depot, such that busy
>> domains will quickly reach a steady-state. In fact, having now dug up
>> the paper at this point of writing this reply, that appears to be what
>> fig. 3.1b describes anyway - I don't see any mention of preallocating
>> the depot.
> 
> That paper missed a lot of things, it doesn't even recommend a size
> of a depot or percpu cache. For implementation, we still have to
> think about those details, including whether to preallocate memory.

Heh, "missed"... To my reading, the original design actually describes a 
depot consisting of two unbounded (but garbage-collected) lists and a 
dynamically-adjusted magazine size - I'd hardly blame the authors for 
not discussing an implementation from 15 years in the future of a 
fixed-size design *based on* their concept ;)

Robin.

[1] 
https://lists.linuxfoundation.org/pipermail/iommu/2015-December/015271.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Patch v3 2/3] iommu: optimize iova_magazine_free_pfns()
  2020-01-21 17:29     ` Cong Wang
@ 2020-01-22 17:34       ` Robin Murphy
  2020-01-22 17:45         ` Cong Wang
  0 siblings, 1 reply; 17+ messages in thread
From: Robin Murphy @ 2020-01-22 17:34 UTC (permalink / raw)
  To: Cong Wang; +Cc: iommu, LKML, Joerg Roedel, John Garry

On 21/01/2020 5:29 pm, Cong Wang wrote:
> On Tue, Jan 21, 2020 at 1:52 AM Robin Murphy <robin.murphy@arm.com> wrote:
>>
>> On 18/12/2019 4:39 am, Cong Wang wrote:
>>> If the magazine is empty, iova_magazine_free_pfns() should
>>> be a nop, however it misses the case of mag->size==0. So we
>>> should just call iova_magazine_empty().
>>>
>>> This should reduce the contention on iovad->iova_rbtree_lock
>>> a little bit, not much at all.
>>
>> Have you measured that in any way? AFAICS the only time this can get
>> called with a non-full magazine is in the CPU hotplug callback, where
>> the impact of taking the rbtree lock and immediately releasing it seems
>> unlikely to be significant on top of everything else involved in that
>> operation.
> 
> This patchset is only tested as a whole, it is not easy to deploy
> each to production and test it separately.
> 
> Is there anything wrong to optimize a CPU hotplug path? :) And,
> it is called in alloc_iova_fast() too when, for example, over-cached.

And if the IOVA space is consumed to the point that we've fallen back to 
that desperate last resort, what do you think the chances are that a 
significant number of percpu magazines will be *empty*? Also bear in 
mind that in that case we've already walked the rbtree once, so any 
notion of still being fast is long, long gone.

As for CPU hotplug, it's a comparatively rare event involving all manner 
of system-wide synchronisation, and the "optimisation" of shaving a few 
dozen CPU cycles off at one point *if* things happen to line up 
correctly is taking a cup of water out of a lake. If the domain is busy 
at the time, then once again chances are the magazines aren't empty and 
having an extra check redundant with the loop condition simply adds 
(trivial, but nonzero) overhead to every call. And if the domain isn't 
busy, then the lock is unlikely to be contended anyway.

Sorry, but without convincing evidence, this change just looks like 
churn for the sake of it.

Robin.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Patch v3 2/3] iommu: optimize iova_magazine_free_pfns()
  2020-01-22 17:34       ` Robin Murphy
@ 2020-01-22 17:45         ` Cong Wang
  0 siblings, 0 replies; 17+ messages in thread
From: Cong Wang @ 2020-01-22 17:45 UTC (permalink / raw)
  To: Robin Murphy; +Cc: iommu, LKML, Joerg Roedel, John Garry

On Wed, Jan 22, 2020 at 9:34 AM Robin Murphy <robin.murphy@arm.com> wrote:
> Sorry, but without convincing evidence, this change just looks like
> churn for the sake of it.

The time I wasted on arguing with you isn't worth anything than
the value this patch brings. So let's just drop it to save some
time.

Thanks.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Patch v3 1/3] iommu: avoid unnecessary magazine allocations
  2020-01-22 17:07       ` Robin Murphy
@ 2020-01-22 17:54         ` Cong Wang
  0 siblings, 0 replies; 17+ messages in thread
From: Cong Wang @ 2020-01-22 17:54 UTC (permalink / raw)
  To: Robin Murphy; +Cc: iommu, LKML, Joerg Roedel, John Garry

On Wed, Jan 22, 2020 at 9:07 AM Robin Murphy <robin.murphy@arm.com> wrote:
>
> On 21/01/2020 5:21 pm, Cong Wang wrote:
> > On Tue, Jan 21, 2020 at 3:11 AM Robin Murphy <robin.murphy@arm.com> wrote:
> >>
> >> On 18/12/2019 4:39 am, Cong Wang wrote:
> >>> The IOVA cache algorithm implemented in IOMMU code does not
> >>> exactly match the original algorithm described in the paper
> >>> "Magazines and Vmem: Extending the Slab Allocator to Many
> >>> CPUs and Arbitrary Resources".
> >>>
> >>> Particularly, it doesn't need to free the loaded empty magazine
> >>> when trying to put it back to global depot. To make it work, we
> >>> have to pre-allocate magazines in the depot and only recycle them
> >>> when all of them are full.
> >>>
> >>> Before this patch, rcache->depot[] contains either full or
> >>> freed entries, after this patch, it contains either full or
> >>> empty (but allocated) entries.
> >>
> >> How much additional memory overhead does this impose (particularly on
> >> systems that may have many domains mostly used for large, long-term
> >> mappings)? I'm wary that trying to micro-optimise for the "churn network
> >> packets as fast as possible" case may penalise every other case,
> >> potentially quite badly. Lower-end embedded systems are using IOMMUs in
> >> front of their GPUs, video codecs, etc. precisely because they *don't*
> >> have much memory to spare (and thus need to scrape together large
> >> buffers out of whatever pages they can find).
> >
> > The calculation is not complicated: 32 * 6 * 129 * 8 = 198144 bytes,
> > which is roughly 192K, per domain.
>
> Theoretically. On many architectures, kmalloc(1032,...) is going to
> consume rather more than 1032 bytes. Either way, it's rather a lot of
> memory to waste in the many cases where it will never be used at all.

If this is a concern, we can make IOVA_MAG_SIZE tunable in Kconfig.
I myself want a larger IOVA_MAG_SIZE at least for experiments.
You know, servers now have 100G+ memory, 192k is nearly nothing...


>
> >> But on the other hand, if we were to go down this route, then why is
> >> there any dynamic allocation/freeing left at all? Once both the depot
> >> and the rcaches are preallocated, then AFAICS it would make more sense
> >> to rework the overflow case in __iova_rcache_insert() to just free the
> >> IOVAs and swap the empty mag around rather than destroying and
> >> recreating it entirely.
> >
> > It's due to the algorithm requires a swap(), which can't be done with
> > statically allocated magzine. I had the same thought initially but gave it
> > up quickly when realized this.
>
> I'm not sure I follow... we're replacing a "full magazine" pointer with
> an "empty magazine" pointer regardless of where that empty magazine came
> from. It would be trivial to preallocate an 'overflow' magazine for the
> one remaining case of handling a full depot, although to be honest, at
> that point it's probably most efficient to just free the pfns directly
> from cpu_rcache->loaded while still under the percpu lock and be done
> with it.

I don't follow you either. I thought you are suggesting to completely
get rid of dynamic memory allocations like:

@@ -31,7 +31,7 @@ struct iova_cpu_rcache;
 struct iova_rcache {
        spinlock_t lock;
        unsigned long depot_size;
-       struct iova_magazine *depot[MAX_GLOBAL_MAGS];
+       struct iova_magazine depot[MAX_GLOBAL_MAGS];
        struct iova_cpu_rcache __percpu *cpu_rcaches;
 };

If it is so, I don't see how I can do swap() with pointers like
cpu_rcache->prev.

More importantly, this doesn't save any memory either for your embedded
case. So I don't know why you want to bring it up.

>
> > If you are suggesting to change the algorithm, it is not a goal of this
> > patchset. I do have plan to search for a better algorithm as the IOMMU
> > performance still sucks (comparing to no IOMMU) after this patchset,
> > but once again, I do not want to change it in this patchset.
>
> "Still sucks" is probably the most interesting thing here - the headline
> number for the original patch series was that it reached about 98% of
> bypass performance on Intel VT-d[1]. Sounds like it would be well worth
> digging in to what's different about your system and/or workload.

Just FYI: The latency is 10x/20x worse with IOMMU enabled on AMD
servers here. (mlx5 driver for ethernet, if matters.) The throughput
is roughly same. The patchset you linked only measures throughput.


>
> > (My ultimate goal is to find a spinlock-free algorithm, otherwise there is
> > no way to make it close to no-IOMMU performance.)
> >
> >>
> >> Perhaps there's a reasonable compromise wherein we don't preallocate,
> >> but still 'free' empty magazines back to the depot, such that busy
> >> domains will quickly reach a steady-state. In fact, having now dug up
> >> the paper at this point of writing this reply, that appears to be what
> >> fig. 3.1b describes anyway - I don't see any mention of preallocating
> >> the depot.
> >
> > That paper missed a lot of things, it doesn't even recommend a size
> > of a depot or percpu cache. For implementation, we still have to
> > think about those details, including whether to preallocate memory.
>
> Heh, "missed"... To my reading, the original design actually describes a
> depot consisting of two unbounded (but garbage-collected) lists and a
> dynamically-adjusted magazine size - I'd hardly blame the authors for

I must miss the dynamic size part, as I tried to tune IOVA_MAG_SIZE
manually when I initially thought it is overcached.

> not discussing an implementation from 15 years in the future of a
> fixed-size design *based on* their concept ;)

Are you saying fixed-size implementation is wrong? I'd like to hear
more! :) I am curious how to dynamically adjust the magzine size
too as I still don't believe IOVA_MAG_SIZE fits all, also how to
balance the percpu cache. Can you be more elaborate?

Thanks.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Patch v3 3/3] iommu: avoid taking iova_rbtree_lock twice
  2020-01-21  9:56   ` Robin Murphy
@ 2020-03-03 11:33     ` John Garry
  0 siblings, 0 replies; 17+ messages in thread
From: John Garry @ 2020-03-03 11:33 UTC (permalink / raw)
  To: joro; +Cc: Robin Murphy, Cong Wang, iommu, linux-kernel

On 21/01/2020 09:56, Robin Murphy wrote:
> On 18/12/2019 4:39 am, Cong Wang wrote:
>> Both find_iova() and __free_iova() take iova_rbtree_lock,
>> there is no reason to take and release it twice inside
>> free_iova().
>>
>> Fold them into one critical section by calling the unlock
>> versions instead.
> 
> Makes sense to me.
> 
> Reviewed-by: Robin Murphy <robin.murphy@arm.com>

Reviewed-by: John Garry <john.garry@huawei.com>

Could we at least get this patch picked up? It should be ok to take in 
isolation, since there is some debate on the other 2 patches in this 
series. Thanks

> 
>> Cc: Joerg Roedel <joro@8bytes.org>
>> Cc: John Garry <john.garry@huawei.com>
>> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
>> ---
>>   drivers/iommu/iova.c | 8 ++++++--
>>   1 file changed, 6 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
>> index 184d4c0e20b5..f46f8f794678 100644
>> --- a/drivers/iommu/iova.c
>> +++ b/drivers/iommu/iova.c
>> @@ -390,10 +390,14 @@ EXPORT_SYMBOL_GPL(__free_iova);
>>   void
>>   free_iova(struct iova_domain *iovad, unsigned long pfn)
>>   {
>> -    struct iova *iova = find_iova(iovad, pfn);
>> +    unsigned long flags;
>> +    struct iova *iova;
>> +    spin_lock_irqsave(&iovad->iova_rbtree_lock, flags);
>> +    iova = private_find_iova(iovad, pfn);
>>       if (iova)
>> -        __free_iova(iovad, iova);
>> +        private_free_iova(iovad, iova);
>> +    spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
>>   }
>>   EXPORT_SYMBOL_GPL(free_iova);
>>
> .


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Patch v3 1/3] iommu: avoid unnecessary magazine allocations
  2019-12-06 21:38 Cong Wang
@ 2019-12-06 21:38 ` Cong Wang
  0 siblings, 0 replies; 17+ messages in thread
From: Cong Wang @ 2019-12-06 21:38 UTC (permalink / raw)
  To: iommu; +Cc: linux-kernel, joro, Cong Wang, John Garry

The IOVA cache algorithm implemented in IOMMU code does not
exactly match the original algorithm described in the paper
"Magazines and Vmem: Extending the Slab Allocator to Many
CPUs and Arbitrary Resources".

Particularly, it doesn't need to free the loaded empty magazine
when trying to put it back to global depot. To make it work, we
have to pre-allocate magazines in the depot and only recycle them
when all of them are full.

Before this patch, rcache->depot[] contains either full or
freed entries, after this patch, it contains either full or
empty (but allocated) entries.

Together with a few other changes to make it exactly match
the pseudo code in the paper.

Cc: Joerg Roedel <joro@8bytes.org>
Cc: John Garry <john.garry@huawei.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
 drivers/iommu/iova.c | 45 +++++++++++++++++++++++++++-----------------
 1 file changed, 28 insertions(+), 17 deletions(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index 41c605b0058f..cb473ddce4cf 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -862,12 +862,16 @@ static void init_iova_rcaches(struct iova_domain *iovad)
 	struct iova_cpu_rcache *cpu_rcache;
 	struct iova_rcache *rcache;
 	unsigned int cpu;
-	int i;
+	int i, j;
 
 	for (i = 0; i < IOVA_RANGE_CACHE_MAX_SIZE; ++i) {
 		rcache = &iovad->rcaches[i];
 		spin_lock_init(&rcache->lock);
 		rcache->depot_size = 0;
+		for (j = 0; j < MAX_GLOBAL_MAGS; ++j) {
+			rcache->depot[j] = iova_magazine_alloc(GFP_KERNEL);
+			WARN_ON(!rcache->depot[j]);
+		}
 		rcache->cpu_rcaches = __alloc_percpu(sizeof(*cpu_rcache), cache_line_size());
 		if (WARN_ON(!rcache->cpu_rcaches))
 			continue;
@@ -900,24 +904,30 @@ static bool __iova_rcache_insert(struct iova_domain *iovad,
 
 	if (!iova_magazine_full(cpu_rcache->loaded)) {
 		can_insert = true;
-	} else if (!iova_magazine_full(cpu_rcache->prev)) {
+	} else if (iova_magazine_empty(cpu_rcache->prev)) {
 		swap(cpu_rcache->prev, cpu_rcache->loaded);
 		can_insert = true;
 	} else {
-		struct iova_magazine *new_mag = iova_magazine_alloc(GFP_ATOMIC);
+		spin_lock(&rcache->lock);
+		if (rcache->depot_size < MAX_GLOBAL_MAGS) {
+			swap(rcache->depot[rcache->depot_size], cpu_rcache->prev);
+			swap(cpu_rcache->prev, cpu_rcache->loaded);
+			rcache->depot_size++;
+			can_insert = true;
+		} else {
+			mag_to_free = cpu_rcache->loaded;
+		}
+		spin_unlock(&rcache->lock);
+
+		if (mag_to_free) {
+			struct iova_magazine *new_mag = iova_magazine_alloc(GFP_ATOMIC);
 
-		if (new_mag) {
-			spin_lock(&rcache->lock);
-			if (rcache->depot_size < MAX_GLOBAL_MAGS) {
-				rcache->depot[rcache->depot_size++] =
-						cpu_rcache->loaded;
+			if (new_mag) {
+				cpu_rcache->loaded = new_mag;
+				can_insert = true;
 			} else {
-				mag_to_free = cpu_rcache->loaded;
+				mag_to_free = NULL;
 			}
-			spin_unlock(&rcache->lock);
-
-			cpu_rcache->loaded = new_mag;
-			can_insert = true;
 		}
 	}
 
@@ -963,14 +973,15 @@ static unsigned long __iova_rcache_get(struct iova_rcache *rcache,
 
 	if (!iova_magazine_empty(cpu_rcache->loaded)) {
 		has_pfn = true;
-	} else if (!iova_magazine_empty(cpu_rcache->prev)) {
+	} else if (iova_magazine_full(cpu_rcache->prev)) {
 		swap(cpu_rcache->prev, cpu_rcache->loaded);
 		has_pfn = true;
 	} else {
 		spin_lock(&rcache->lock);
 		if (rcache->depot_size > 0) {
-			iova_magazine_free(cpu_rcache->loaded);
-			cpu_rcache->loaded = rcache->depot[--rcache->depot_size];
+			swap(rcache->depot[rcache->depot_size - 1], cpu_rcache->prev);
+			swap(cpu_rcache->prev, cpu_rcache->loaded);
+			rcache->depot_size--;
 			has_pfn = true;
 		}
 		spin_unlock(&rcache->lock);
@@ -1019,7 +1030,7 @@ static void free_iova_rcaches(struct iova_domain *iovad)
 			iova_magazine_free(cpu_rcache->prev);
 		}
 		free_percpu(rcache->cpu_rcaches);
-		for (j = 0; j < rcache->depot_size; ++j)
+		for (j = 0; j < MAX_GLOBAL_MAGS; ++j)
 			iova_magazine_free(rcache->depot[j]);
 	}
 }
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2020-03-03 11:33 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-12-18  4:39 [Patch v3 0/3] iommu: reduce spinlock contention on fast path Cong Wang
2019-12-18  4:39 ` [Patch v3 1/3] iommu: avoid unnecessary magazine allocations Cong Wang
2020-01-21 11:11   ` Robin Murphy
2020-01-21 17:21     ` Cong Wang
2020-01-22 17:07       ` Robin Murphy
2020-01-22 17:54         ` Cong Wang
2019-12-18  4:39 ` [Patch v3 2/3] iommu: optimize iova_magazine_free_pfns() Cong Wang
2020-01-21  9:52   ` Robin Murphy
2020-01-21 17:29     ` Cong Wang
2020-01-22 17:34       ` Robin Murphy
2020-01-22 17:45         ` Cong Wang
2019-12-18  4:39 ` [Patch v3 3/3] iommu: avoid taking iova_rbtree_lock twice Cong Wang
2019-12-19  9:51   ` John Garry
2020-01-21  9:56   ` Robin Murphy
2020-03-03 11:33     ` John Garry
2020-01-20 23:10 ` [Patch v3 0/3] iommu: reduce spinlock contention on fast path Cong Wang
  -- strict thread matches above, loose matches on Subject: below --
2019-12-06 21:38 Cong Wang
2019-12-06 21:38 ` [Patch v3 1/3] iommu: avoid unnecessary magazine allocations Cong Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).