All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] Optimise 64-bit IOVA allocations
@ 2017-07-18 16:57 ` Robin Murphy
  0 siblings, 0 replies; 30+ messages in thread
From: Robin Murphy @ 2017-07-18 16:57 UTC (permalink / raw)
  To: joro
  Cc: iommu, linux-arm-kernel, linux-kernel, dwmw2, thunder.leizhen,
	lorenzo.pieralisi, ard.biesheuvel, Jonathan.Cameron, nwatters,
	ray.jui

Hi all,

In the wake of the ARM SMMU optimisation efforts, it seems that certain
workloads (e.g. storage I/O with large scatterlists) probably remain quite
heavily influenced by IOVA allocation performance. Separately, Ard also
reported massive performance drops for a graphical desktop on AMD Seattle
when enabling SMMUs via IORT, which we traced to dma_32bit_pfn in the DMA
ops domain getting initialised differently for ACPI vs. DT, and exposing
the overhead of the rbtree slow path. Whilst we could go around trying to
close up all the little gaps that lead to hitting the slowest case, it
seems a much better idea to simply make said slowest case a lot less slow.

I had a go at rebasing Leizhen's last IOVA series[1], but ended up finding
the changes rather too hard to follow, so I've taken the liberty here of
picking the whole thing up and reimplementing the main part in a rather
less invasive manner.

Robin.

[1] https://www.mail-archive.com/iommu@lists.linux-foundation.org/msg17753.html

Robin Murphy (1):
  iommu/iova: Extend rbtree node caching

Zhen Lei (3):
  iommu/iova: Optimise rbtree searching
  iommu/iova: Optimise the padding calculation
  iommu/iova: Make dma_32bit_pfn implicit

 drivers/gpu/drm/tegra/drm.c      |   3 +-
 drivers/gpu/host1x/dev.c         |   3 +-
 drivers/iommu/amd_iommu.c        |   7 +--
 drivers/iommu/dma-iommu.c        |  18 +------
 drivers/iommu/intel-iommu.c      |  11 ++--
 drivers/iommu/iova.c             | 112 ++++++++++++++++-----------------------
 drivers/misc/mic/scif/scif_rma.c |   3 +-
 include/linux/iova.h             |   8 +--
 8 files changed, 60 insertions(+), 105 deletions(-)

-- 
2.12.2.dirty

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 0/4] Optimise 64-bit IOVA allocations
@ 2017-07-18 16:57 ` Robin Murphy
  0 siblings, 0 replies; 30+ messages in thread
From: Robin Murphy @ 2017-07-18 16:57 UTC (permalink / raw)
  To: joro-zLv9SwRftAIdnm+yROfE0A
  Cc: ard.biesheuvel-QSEj5FYQhm4dnm+yROfE0A,
	Jonathan.Cameron-hv44wF8Li93QT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	ray.jui-dY08KVG/lbpWk0Htik3J/w,
	thunder.leizhen-hv44wF8Li93QT0dZR+AlfA,
	dwmw2-wEGCiKHe2LqWVfeAwA7xHQ,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

Hi all,

In the wake of the ARM SMMU optimisation efforts, it seems that certain
workloads (e.g. storage I/O with large scatterlists) probably remain quite
heavily influenced by IOVA allocation performance. Separately, Ard also
reported massive performance drops for a graphical desktop on AMD Seattle
when enabling SMMUs via IORT, which we traced to dma_32bit_pfn in the DMA
ops domain getting initialised differently for ACPI vs. DT, and exposing
the overhead of the rbtree slow path. Whilst we could go around trying to
close up all the little gaps that lead to hitting the slowest case, it
seems a much better idea to simply make said slowest case a lot less slow.

I had a go at rebasing Leizhen's last IOVA series[1], but ended up finding
the changes rather too hard to follow, so I've taken the liberty here of
picking the whole thing up and reimplementing the main part in a rather
less invasive manner.

Robin.

[1] https://www.mail-archive.com/iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org/msg17753.html

Robin Murphy (1):
  iommu/iova: Extend rbtree node caching

Zhen Lei (3):
  iommu/iova: Optimise rbtree searching
  iommu/iova: Optimise the padding calculation
  iommu/iova: Make dma_32bit_pfn implicit

 drivers/gpu/drm/tegra/drm.c      |   3 +-
 drivers/gpu/host1x/dev.c         |   3 +-
 drivers/iommu/amd_iommu.c        |   7 +--
 drivers/iommu/dma-iommu.c        |  18 +------
 drivers/iommu/intel-iommu.c      |  11 ++--
 drivers/iommu/iova.c             | 112 ++++++++++++++++-----------------------
 drivers/misc/mic/scif/scif_rma.c |   3 +-
 include/linux/iova.h             |   8 +--
 8 files changed, 60 insertions(+), 105 deletions(-)

-- 
2.12.2.dirty

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 0/4] Optimise 64-bit IOVA allocations
@ 2017-07-18 16:57 ` Robin Murphy
  0 siblings, 0 replies; 30+ messages in thread
From: Robin Murphy @ 2017-07-18 16:57 UTC (permalink / raw)
  To: linux-arm-kernel

Hi all,

In the wake of the ARM SMMU optimisation efforts, it seems that certain
workloads (e.g. storage I/O with large scatterlists) probably remain quite
heavily influenced by IOVA allocation performance. Separately, Ard also
reported massive performance drops for a graphical desktop on AMD Seattle
when enabling SMMUs via IORT, which we traced to dma_32bit_pfn in the DMA
ops domain getting initialised differently for ACPI vs. DT, and exposing
the overhead of the rbtree slow path. Whilst we could go around trying to
close up all the little gaps that lead to hitting the slowest case, it
seems a much better idea to simply make said slowest case a lot less slow.

I had a go at rebasing Leizhen's last IOVA series[1], but ended up finding
the changes rather too hard to follow, so I've taken the liberty here of
picking the whole thing up and reimplementing the main part in a rather
less invasive manner.

Robin.

[1] https://www.mail-archive.com/iommu at lists.linux-foundation.org/msg17753.html

Robin Murphy (1):
  iommu/iova: Extend rbtree node caching

Zhen Lei (3):
  iommu/iova: Optimise rbtree searching
  iommu/iova: Optimise the padding calculation
  iommu/iova: Make dma_32bit_pfn implicit

 drivers/gpu/drm/tegra/drm.c      |   3 +-
 drivers/gpu/host1x/dev.c         |   3 +-
 drivers/iommu/amd_iommu.c        |   7 +--
 drivers/iommu/dma-iommu.c        |  18 +------
 drivers/iommu/intel-iommu.c      |  11 ++--
 drivers/iommu/iova.c             | 112 ++++++++++++++++-----------------------
 drivers/misc/mic/scif/scif_rma.c |   3 +-
 include/linux/iova.h             |   8 +--
 8 files changed, 60 insertions(+), 105 deletions(-)

-- 
2.12.2.dirty

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 1/4] iommu/iova: Optimise rbtree searching
@ 2017-07-18 16:57   ` Robin Murphy
  0 siblings, 0 replies; 30+ messages in thread
From: Robin Murphy @ 2017-07-18 16:57 UTC (permalink / raw)
  To: joro
  Cc: iommu, linux-arm-kernel, linux-kernel, dwmw2, thunder.leizhen,
	lorenzo.pieralisi, ard.biesheuvel, Jonathan.Cameron, nwatters,
	ray.jui

From: Zhen Lei <thunder.leizhen@huawei.com>

Checking the IOVA bounds separately before deciding which direction to
continue the search (if necessary) results in redundantly comparing both
pfns twice each. GCC can already determine that the final comparison op
is redundant and optimise it down to 3 in total, but we can go one
further with a little tweak of the ordering (which makes the intent of
the code that much cleaner as a bonus).

Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
[rm: rewrote commit message to clarify]
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
---
 drivers/iommu/iova.c | 9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index 246f14c83944..8f7552dc4e04 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -289,15 +289,12 @@ private_find_iova(struct iova_domain *iovad, unsigned long pfn)
 	while (node) {
 		struct iova *iova = rb_entry(node, struct iova, node);
 
-		/* If pfn falls within iova's range, return iova */
-		if ((pfn >= iova->pfn_lo) && (pfn <= iova->pfn_hi)) {
-			return iova;
-		}
-
 		if (pfn < iova->pfn_lo)
 			node = node->rb_left;
-		else if (pfn > iova->pfn_lo)
+		else if (pfn > iova->pfn_hi)
 			node = node->rb_right;
+		else
+			return iova;	/* pfn falls within iova's range */
 	}
 
 	return NULL;
-- 
2.12.2.dirty

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 1/4] iommu/iova: Optimise rbtree searching
@ 2017-07-18 16:57   ` Robin Murphy
  0 siblings, 0 replies; 30+ messages in thread
From: Robin Murphy @ 2017-07-18 16:57 UTC (permalink / raw)
  To: joro-zLv9SwRftAIdnm+yROfE0A
  Cc: ard.biesheuvel-QSEj5FYQhm4dnm+yROfE0A,
	Jonathan.Cameron-hv44wF8Li93QT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	ray.jui-dY08KVG/lbpWk0Htik3J/w,
	thunder.leizhen-hv44wF8Li93QT0dZR+AlfA,
	dwmw2-wEGCiKHe2LqWVfeAwA7xHQ,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

From: Zhen Lei <thunder.leizhen-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>

Checking the IOVA bounds separately before deciding which direction to
continue the search (if necessary) results in redundantly comparing both
pfns twice each. GCC can already determine that the final comparison op
is redundant and optimise it down to 3 in total, but we can go one
further with a little tweak of the ordering (which makes the intent of
the code that much cleaner as a bonus).

Signed-off-by: Zhen Lei <thunder.leizhen-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
[rm: rewrote commit message to clarify]
Signed-off-by: Robin Murphy <robin.murphy-5wv7dgnIgG8@public.gmane.org>
---
 drivers/iommu/iova.c | 9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index 246f14c83944..8f7552dc4e04 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -289,15 +289,12 @@ private_find_iova(struct iova_domain *iovad, unsigned long pfn)
 	while (node) {
 		struct iova *iova = rb_entry(node, struct iova, node);
 
-		/* If pfn falls within iova's range, return iova */
-		if ((pfn >= iova->pfn_lo) && (pfn <= iova->pfn_hi)) {
-			return iova;
-		}
-
 		if (pfn < iova->pfn_lo)
 			node = node->rb_left;
-		else if (pfn > iova->pfn_lo)
+		else if (pfn > iova->pfn_hi)
 			node = node->rb_right;
+		else
+			return iova;	/* pfn falls within iova's range */
 	}
 
 	return NULL;
-- 
2.12.2.dirty

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 1/4] iommu/iova: Optimise rbtree searching
@ 2017-07-18 16:57   ` Robin Murphy
  0 siblings, 0 replies; 30+ messages in thread
From: Robin Murphy @ 2017-07-18 16:57 UTC (permalink / raw)
  To: linux-arm-kernel

From: Zhen Lei <thunder.leizhen@huawei.com>

Checking the IOVA bounds separately before deciding which direction to
continue the search (if necessary) results in redundantly comparing both
pfns twice each. GCC can already determine that the final comparison op
is redundant and optimise it down to 3 in total, but we can go one
further with a little tweak of the ordering (which makes the intent of
the code that much cleaner as a bonus).

Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
[rm: rewrote commit message to clarify]
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
---
 drivers/iommu/iova.c | 9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index 246f14c83944..8f7552dc4e04 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -289,15 +289,12 @@ private_find_iova(struct iova_domain *iovad, unsigned long pfn)
 	while (node) {
 		struct iova *iova = rb_entry(node, struct iova, node);
 
-		/* If pfn falls within iova's range, return iova */
-		if ((pfn >= iova->pfn_lo) && (pfn <= iova->pfn_hi)) {
-			return iova;
-		}
-
 		if (pfn < iova->pfn_lo)
 			node = node->rb_left;
-		else if (pfn > iova->pfn_lo)
+		else if (pfn > iova->pfn_hi)
 			node = node->rb_right;
+		else
+			return iova;	/* pfn falls within iova's range */
 	}
 
 	return NULL;
-- 
2.12.2.dirty

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 2/4] iommu/iova: Optimise the padding calculation
@ 2017-07-18 16:57   ` Robin Murphy
  0 siblings, 0 replies; 30+ messages in thread
From: Robin Murphy @ 2017-07-18 16:57 UTC (permalink / raw)
  To: joro
  Cc: iommu, linux-arm-kernel, linux-kernel, dwmw2, thunder.leizhen,
	lorenzo.pieralisi, ard.biesheuvel, Jonathan.Cameron, nwatters,
	ray.jui

From: Zhen Lei <thunder.leizhen@huawei.com>

The mask for calculating the padding size doesn't change, so there's no
need to recalculate it every loop iteration. Furthermore, Once we've
done that, it becomes clear that we don't actually need to calculate a
padding size at all - by flipping the arithmetic around, we can just
combine the upper limit, size, and mask directly to check against the
lower limit.

For an arm64 build, this alone knocks 16% off the size of the entire
alloc_iova() function!

Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
[rm: simplified more of the arithmetic, rewrote commit message]
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
---
 drivers/iommu/iova.c | 40 ++++++++++++++--------------------------
 1 file changed, 14 insertions(+), 26 deletions(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index 8f7552dc4e04..d094d1ca8f23 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -129,16 +129,6 @@ iova_insert_rbtree(struct rb_root *root, struct iova *iova,
 	rb_insert_color(&iova->node, root);
 }
 
-/*
- * Computes the padding size required, to make the start address
- * naturally aligned on the power-of-two order of its size
- */
-static unsigned int
-iova_get_pad_size(unsigned int size, unsigned int limit_pfn)
-{
-	return (limit_pfn - size) & (__roundup_pow_of_two(size) - 1);
-}
-
 static int __alloc_and_insert_iova_range(struct iova_domain *iovad,
 		unsigned long size, unsigned long limit_pfn,
 			struct iova *new, bool size_aligned)
@@ -146,7 +136,10 @@ static int __alloc_and_insert_iova_range(struct iova_domain *iovad,
 	struct rb_node *prev, *curr = NULL;
 	unsigned long flags;
 	unsigned long saved_pfn;
-	unsigned int pad_size = 0;
+	unsigned long align_mask = ~0UL;
+
+	if (size_aligned)
+		align_mask <<= __fls(size);
 
 	/* Walk the tree backwards */
 	spin_lock_irqsave(&iovad->iova_rbtree_lock, flags);
@@ -156,31 +149,26 @@ static int __alloc_and_insert_iova_range(struct iova_domain *iovad,
 	while (curr) {
 		struct iova *curr_iova = rb_entry(curr, struct iova, node);
 
-		if (limit_pfn <= curr_iova->pfn_lo) {
+		if (limit_pfn <= curr_iova->pfn_lo)
 			goto move_left;
-		} else if (limit_pfn > curr_iova->pfn_hi) {
-			if (size_aligned)
-				pad_size = iova_get_pad_size(size, limit_pfn);
-			if ((curr_iova->pfn_hi + size + pad_size) < limit_pfn)
-				break;	/* found a free slot */
-		}
+
+		if (((limit_pfn - size) & align_mask) > curr_iova->pfn_hi)
+			break;	/* found a free slot */
+
 		limit_pfn = curr_iova->pfn_lo;
 move_left:
 		prev = curr;
 		curr = rb_prev(curr);
 	}
 
-	if (!curr) {
-		if (size_aligned)
-			pad_size = iova_get_pad_size(size, limit_pfn);
-		if ((iovad->start_pfn + size + pad_size) > limit_pfn) {
-			spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
-			return -ENOMEM;
-		}
+	if (limit_pfn < size ||
+	    (!curr && ((limit_pfn - size) & align_mask) < iovad->start_pfn)) {
+		spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
+		return -ENOMEM;
 	}
 
 	/* pfn_lo will point to size aligned address if size_aligned is set */
-	new->pfn_lo = limit_pfn - (size + pad_size);
+	new->pfn_lo = (limit_pfn - size) & align_mask;
 	new->pfn_hi = new->pfn_lo + size - 1;
 
 	/* If we have 'prev', it's a valid place to start the insertion. */
-- 
2.12.2.dirty

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 2/4] iommu/iova: Optimise the padding calculation
@ 2017-07-18 16:57   ` Robin Murphy
  0 siblings, 0 replies; 30+ messages in thread
From: Robin Murphy @ 2017-07-18 16:57 UTC (permalink / raw)
  To: joro-zLv9SwRftAIdnm+yROfE0A
  Cc: ard.biesheuvel-QSEj5FYQhm4dnm+yROfE0A,
	Jonathan.Cameron-hv44wF8Li93QT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	ray.jui-dY08KVG/lbpWk0Htik3J/w,
	thunder.leizhen-hv44wF8Li93QT0dZR+AlfA,
	dwmw2-wEGCiKHe2LqWVfeAwA7xHQ,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

From: Zhen Lei <thunder.leizhen-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>

The mask for calculating the padding size doesn't change, so there's no
need to recalculate it every loop iteration. Furthermore, Once we've
done that, it becomes clear that we don't actually need to calculate a
padding size at all - by flipping the arithmetic around, we can just
combine the upper limit, size, and mask directly to check against the
lower limit.

For an arm64 build, this alone knocks 16% off the size of the entire
alloc_iova() function!

Signed-off-by: Zhen Lei <thunder.leizhen-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
[rm: simplified more of the arithmetic, rewrote commit message]
Signed-off-by: Robin Murphy <robin.murphy-5wv7dgnIgG8@public.gmane.org>
---
 drivers/iommu/iova.c | 40 ++++++++++++++--------------------------
 1 file changed, 14 insertions(+), 26 deletions(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index 8f7552dc4e04..d094d1ca8f23 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -129,16 +129,6 @@ iova_insert_rbtree(struct rb_root *root, struct iova *iova,
 	rb_insert_color(&iova->node, root);
 }
 
-/*
- * Computes the padding size required, to make the start address
- * naturally aligned on the power-of-two order of its size
- */
-static unsigned int
-iova_get_pad_size(unsigned int size, unsigned int limit_pfn)
-{
-	return (limit_pfn - size) & (__roundup_pow_of_two(size) - 1);
-}
-
 static int __alloc_and_insert_iova_range(struct iova_domain *iovad,
 		unsigned long size, unsigned long limit_pfn,
 			struct iova *new, bool size_aligned)
@@ -146,7 +136,10 @@ static int __alloc_and_insert_iova_range(struct iova_domain *iovad,
 	struct rb_node *prev, *curr = NULL;
 	unsigned long flags;
 	unsigned long saved_pfn;
-	unsigned int pad_size = 0;
+	unsigned long align_mask = ~0UL;
+
+	if (size_aligned)
+		align_mask <<= __fls(size);
 
 	/* Walk the tree backwards */
 	spin_lock_irqsave(&iovad->iova_rbtree_lock, flags);
@@ -156,31 +149,26 @@ static int __alloc_and_insert_iova_range(struct iova_domain *iovad,
 	while (curr) {
 		struct iova *curr_iova = rb_entry(curr, struct iova, node);
 
-		if (limit_pfn <= curr_iova->pfn_lo) {
+		if (limit_pfn <= curr_iova->pfn_lo)
 			goto move_left;
-		} else if (limit_pfn > curr_iova->pfn_hi) {
-			if (size_aligned)
-				pad_size = iova_get_pad_size(size, limit_pfn);
-			if ((curr_iova->pfn_hi + size + pad_size) < limit_pfn)
-				break;	/* found a free slot */
-		}
+
+		if (((limit_pfn - size) & align_mask) > curr_iova->pfn_hi)
+			break;	/* found a free slot */
+
 		limit_pfn = curr_iova->pfn_lo;
 move_left:
 		prev = curr;
 		curr = rb_prev(curr);
 	}
 
-	if (!curr) {
-		if (size_aligned)
-			pad_size = iova_get_pad_size(size, limit_pfn);
-		if ((iovad->start_pfn + size + pad_size) > limit_pfn) {
-			spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
-			return -ENOMEM;
-		}
+	if (limit_pfn < size ||
+	    (!curr && ((limit_pfn - size) & align_mask) < iovad->start_pfn)) {
+		spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
+		return -ENOMEM;
 	}
 
 	/* pfn_lo will point to size aligned address if size_aligned is set */
-	new->pfn_lo = limit_pfn - (size + pad_size);
+	new->pfn_lo = (limit_pfn - size) & align_mask;
 	new->pfn_hi = new->pfn_lo + size - 1;
 
 	/* If we have 'prev', it's a valid place to start the insertion. */
-- 
2.12.2.dirty

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 2/4] iommu/iova: Optimise the padding calculation
@ 2017-07-18 16:57   ` Robin Murphy
  0 siblings, 0 replies; 30+ messages in thread
From: Robin Murphy @ 2017-07-18 16:57 UTC (permalink / raw)
  To: linux-arm-kernel

From: Zhen Lei <thunder.leizhen@huawei.com>

The mask for calculating the padding size doesn't change, so there's no
need to recalculate it every loop iteration. Furthermore, Once we've
done that, it becomes clear that we don't actually need to calculate a
padding size at all - by flipping the arithmetic around, we can just
combine the upper limit, size, and mask directly to check against the
lower limit.

For an arm64 build, this alone knocks 16% off the size of the entire
alloc_iova() function!

Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
[rm: simplified more of the arithmetic, rewrote commit message]
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
---
 drivers/iommu/iova.c | 40 ++++++++++++++--------------------------
 1 file changed, 14 insertions(+), 26 deletions(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index 8f7552dc4e04..d094d1ca8f23 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -129,16 +129,6 @@ iova_insert_rbtree(struct rb_root *root, struct iova *iova,
 	rb_insert_color(&iova->node, root);
 }
 
-/*
- * Computes the padding size required, to make the start address
- * naturally aligned on the power-of-two order of its size
- */
-static unsigned int
-iova_get_pad_size(unsigned int size, unsigned int limit_pfn)
-{
-	return (limit_pfn - size) & (__roundup_pow_of_two(size) - 1);
-}
-
 static int __alloc_and_insert_iova_range(struct iova_domain *iovad,
 		unsigned long size, unsigned long limit_pfn,
 			struct iova *new, bool size_aligned)
@@ -146,7 +136,10 @@ static int __alloc_and_insert_iova_range(struct iova_domain *iovad,
 	struct rb_node *prev, *curr = NULL;
 	unsigned long flags;
 	unsigned long saved_pfn;
-	unsigned int pad_size = 0;
+	unsigned long align_mask = ~0UL;
+
+	if (size_aligned)
+		align_mask <<= __fls(size);
 
 	/* Walk the tree backwards */
 	spin_lock_irqsave(&iovad->iova_rbtree_lock, flags);
@@ -156,31 +149,26 @@ static int __alloc_and_insert_iova_range(struct iova_domain *iovad,
 	while (curr) {
 		struct iova *curr_iova = rb_entry(curr, struct iova, node);
 
-		if (limit_pfn <= curr_iova->pfn_lo) {
+		if (limit_pfn <= curr_iova->pfn_lo)
 			goto move_left;
-		} else if (limit_pfn > curr_iova->pfn_hi) {
-			if (size_aligned)
-				pad_size = iova_get_pad_size(size, limit_pfn);
-			if ((curr_iova->pfn_hi + size + pad_size) < limit_pfn)
-				break;	/* found a free slot */
-		}
+
+		if (((limit_pfn - size) & align_mask) > curr_iova->pfn_hi)
+			break;	/* found a free slot */
+
 		limit_pfn = curr_iova->pfn_lo;
 move_left:
 		prev = curr;
 		curr = rb_prev(curr);
 	}
 
-	if (!curr) {
-		if (size_aligned)
-			pad_size = iova_get_pad_size(size, limit_pfn);
-		if ((iovad->start_pfn + size + pad_size) > limit_pfn) {
-			spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
-			return -ENOMEM;
-		}
+	if (limit_pfn < size ||
+	    (!curr && ((limit_pfn - size) & align_mask) < iovad->start_pfn)) {
+		spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
+		return -ENOMEM;
 	}
 
 	/* pfn_lo will point to size aligned address if size_aligned is set */
-	new->pfn_lo = limit_pfn - (size + pad_size);
+	new->pfn_lo = (limit_pfn - size) & align_mask;
 	new->pfn_hi = new->pfn_lo + size - 1;
 
 	/* If we have 'prev', it's a valid place to start the insertion. */
-- 
2.12.2.dirty

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 3/4] iommu/iova: Extend rbtree node caching
@ 2017-07-18 16:57   ` Robin Murphy
  0 siblings, 0 replies; 30+ messages in thread
From: Robin Murphy @ 2017-07-18 16:57 UTC (permalink / raw)
  To: joro
  Cc: iommu, linux-arm-kernel, linux-kernel, dwmw2, thunder.leizhen,
	lorenzo.pieralisi, ard.biesheuvel, Jonathan.Cameron, nwatters,
	ray.jui

The cached node mechanism provides a significant performance benefit for
allocations using a 32-bit DMA mask, but in the case of non-PCI devices
or where the 32-bit space is full, the loss of this benefit can be
significant - on large systems there can be many thousands of entries in
the tree, such that traversing to the end then walking all the way down
to find free space every time becomes increasingly awful.

Maintain a similar cached node for the whole IOVA space as a superset of
the 32-bit space so that performance can remain much more consistent.

Inspired by work by Zhen Lei <thunder.leizhen@huawei.com>.

Signed-off-by: Robin Murphy <robin.murphy@arm.com>
---
 drivers/iommu/iova.c | 59 +++++++++++++++++++++++++---------------------------
 include/linux/iova.h |  3 ++-
 2 files changed, 30 insertions(+), 32 deletions(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index d094d1ca8f23..f5809a2ee6c2 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -46,6 +46,7 @@ init_iova_domain(struct iova_domain *iovad, unsigned long granule,
 
 	spin_lock_init(&iovad->iova_rbtree_lock);
 	iovad->rbroot = RB_ROOT;
+	iovad->cached_node = NULL;
 	iovad->cached32_node = NULL;
 	iovad->granule = granule;
 	iovad->start_pfn = start_pfn;
@@ -57,48 +58,46 @@ EXPORT_SYMBOL_GPL(init_iova_domain);
 static struct rb_node *
 __get_cached_rbnode(struct iova_domain *iovad, unsigned long *limit_pfn)
 {
-	if ((*limit_pfn > iovad->dma_32bit_pfn) ||
-		(iovad->cached32_node == NULL))
+	struct rb_node *cached_node = NULL;
+	struct iova *curr_iova;
+
+	if (*limit_pfn <= iovad->dma_32bit_pfn)
+		cached_node = iovad->cached32_node;
+	if (!cached_node)
+		cached_node = iovad->cached_node;
+	if (!cached_node)
 		return rb_last(&iovad->rbroot);
-	else {
-		struct rb_node *prev_node = rb_prev(iovad->cached32_node);
-		struct iova *curr_iova =
-			rb_entry(iovad->cached32_node, struct iova, node);
-		*limit_pfn = curr_iova->pfn_lo;
-		return prev_node;
-	}
+
+	curr_iova = rb_entry(cached_node, struct iova, node);
+	*limit_pfn = curr_iova->pfn_lo;
+
+	return rb_prev(cached_node);
 }
 
 static void
-__cached_rbnode_insert_update(struct iova_domain *iovad,
-	unsigned long limit_pfn, struct iova *new)
+__cached_rbnode_insert_update(struct iova_domain *iovad, struct iova *new)
 {
-	if (limit_pfn != iovad->dma_32bit_pfn)
-		return;
-	iovad->cached32_node = &new->node;
+	if (new->pfn_lo > iovad->dma_32bit_pfn)
+		iovad->cached_node = &new->node;
+	else
+		iovad->cached32_node = &new->node;
 }
 
 static void
 __cached_rbnode_delete_update(struct iova_domain *iovad, struct iova *free)
 {
 	struct iova *cached_iova;
-	struct rb_node *curr;
+	struct rb_node **curr = NULL;
 
-	if (!iovad->cached32_node)
-		return;
-	curr = iovad->cached32_node;
-	cached_iova = rb_entry(curr, struct iova, node);
+	if (free->pfn_hi < iovad->dma_32bit_pfn)
+		curr = &iovad->cached32_node;
+	if (!curr)
+		curr = &iovad->cached_node;
 
-	if (free->pfn_lo >= cached_iova->pfn_lo) {
-		struct rb_node *node = rb_next(&free->node);
-		struct iova *iova = rb_entry(node, struct iova, node);
+	cached_iova = rb_entry(*curr, struct iova, node);
 
-		/* only cache if it's below 32bit pfn */
-		if (node && iova->pfn_lo < iovad->dma_32bit_pfn)
-			iovad->cached32_node = node;
-		else
-			iovad->cached32_node = NULL;
-	}
+	if (free->pfn_lo >= cached_iova->pfn_lo)
+		*curr = rb_next(&free->node);
 }
 
 /* Insert the iova into domain rbtree by holding writer lock */
@@ -135,7 +134,6 @@ static int __alloc_and_insert_iova_range(struct iova_domain *iovad,
 {
 	struct rb_node *prev, *curr = NULL;
 	unsigned long flags;
-	unsigned long saved_pfn;
 	unsigned long align_mask = ~0UL;
 
 	if (size_aligned)
@@ -143,7 +141,6 @@ static int __alloc_and_insert_iova_range(struct iova_domain *iovad,
 
 	/* Walk the tree backwards */
 	spin_lock_irqsave(&iovad->iova_rbtree_lock, flags);
-	saved_pfn = limit_pfn;
 	curr = __get_cached_rbnode(iovad, &limit_pfn);
 	prev = curr;
 	while (curr) {
@@ -173,7 +170,7 @@ static int __alloc_and_insert_iova_range(struct iova_domain *iovad,
 
 	/* If we have 'prev', it's a valid place to start the insertion. */
 	iova_insert_rbtree(&iovad->rbroot, new, prev);
-	__cached_rbnode_insert_update(iovad, saved_pfn, new);
+	__cached_rbnode_insert_update(iovad, new);
 
 	spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
 
diff --git a/include/linux/iova.h b/include/linux/iova.h
index e0a892ae45c0..0bb8df43b393 100644
--- a/include/linux/iova.h
+++ b/include/linux/iova.h
@@ -40,7 +40,8 @@ struct iova_rcache {
 struct iova_domain {
 	spinlock_t	iova_rbtree_lock; /* Lock to protect update of rbtree */
 	struct rb_root	rbroot;		/* iova domain rbtree root */
-	struct rb_node	*cached32_node; /* Save last alloced node */
+	struct rb_node	*cached_node;	/* Save last alloced node */
+	struct rb_node	*cached32_node; /* Save last 32-bit alloced node */
 	unsigned long	granule;	/* pfn granularity for this domain */
 	unsigned long	start_pfn;	/* Lower limit for this domain */
 	unsigned long	dma_32bit_pfn;
-- 
2.12.2.dirty

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 3/4] iommu/iova: Extend rbtree node caching
@ 2017-07-18 16:57   ` Robin Murphy
  0 siblings, 0 replies; 30+ messages in thread
From: Robin Murphy @ 2017-07-18 16:57 UTC (permalink / raw)
  To: joro-zLv9SwRftAIdnm+yROfE0A
  Cc: ard.biesheuvel-QSEj5FYQhm4dnm+yROfE0A,
	Jonathan.Cameron-hv44wF8Li93QT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	ray.jui-dY08KVG/lbpWk0Htik3J/w,
	thunder.leizhen-hv44wF8Li93QT0dZR+AlfA,
	dwmw2-wEGCiKHe2LqWVfeAwA7xHQ,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

The cached node mechanism provides a significant performance benefit for
allocations using a 32-bit DMA mask, but in the case of non-PCI devices
or where the 32-bit space is full, the loss of this benefit can be
significant - on large systems there can be many thousands of entries in
the tree, such that traversing to the end then walking all the way down
to find free space every time becomes increasingly awful.

Maintain a similar cached node for the whole IOVA space as a superset of
the 32-bit space so that performance can remain much more consistent.

Inspired by work by Zhen Lei <thunder.leizhen-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>.

Signed-off-by: Robin Murphy <robin.murphy-5wv7dgnIgG8@public.gmane.org>
---
 drivers/iommu/iova.c | 59 +++++++++++++++++++++++++---------------------------
 include/linux/iova.h |  3 ++-
 2 files changed, 30 insertions(+), 32 deletions(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index d094d1ca8f23..f5809a2ee6c2 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -46,6 +46,7 @@ init_iova_domain(struct iova_domain *iovad, unsigned long granule,
 
 	spin_lock_init(&iovad->iova_rbtree_lock);
 	iovad->rbroot = RB_ROOT;
+	iovad->cached_node = NULL;
 	iovad->cached32_node = NULL;
 	iovad->granule = granule;
 	iovad->start_pfn = start_pfn;
@@ -57,48 +58,46 @@ EXPORT_SYMBOL_GPL(init_iova_domain);
 static struct rb_node *
 __get_cached_rbnode(struct iova_domain *iovad, unsigned long *limit_pfn)
 {
-	if ((*limit_pfn > iovad->dma_32bit_pfn) ||
-		(iovad->cached32_node == NULL))
+	struct rb_node *cached_node = NULL;
+	struct iova *curr_iova;
+
+	if (*limit_pfn <= iovad->dma_32bit_pfn)
+		cached_node = iovad->cached32_node;
+	if (!cached_node)
+		cached_node = iovad->cached_node;
+	if (!cached_node)
 		return rb_last(&iovad->rbroot);
-	else {
-		struct rb_node *prev_node = rb_prev(iovad->cached32_node);
-		struct iova *curr_iova =
-			rb_entry(iovad->cached32_node, struct iova, node);
-		*limit_pfn = curr_iova->pfn_lo;
-		return prev_node;
-	}
+
+	curr_iova = rb_entry(cached_node, struct iova, node);
+	*limit_pfn = curr_iova->pfn_lo;
+
+	return rb_prev(cached_node);
 }
 
 static void
-__cached_rbnode_insert_update(struct iova_domain *iovad,
-	unsigned long limit_pfn, struct iova *new)
+__cached_rbnode_insert_update(struct iova_domain *iovad, struct iova *new)
 {
-	if (limit_pfn != iovad->dma_32bit_pfn)
-		return;
-	iovad->cached32_node = &new->node;
+	if (new->pfn_lo > iovad->dma_32bit_pfn)
+		iovad->cached_node = &new->node;
+	else
+		iovad->cached32_node = &new->node;
 }
 
 static void
 __cached_rbnode_delete_update(struct iova_domain *iovad, struct iova *free)
 {
 	struct iova *cached_iova;
-	struct rb_node *curr;
+	struct rb_node **curr = NULL;
 
-	if (!iovad->cached32_node)
-		return;
-	curr = iovad->cached32_node;
-	cached_iova = rb_entry(curr, struct iova, node);
+	if (free->pfn_hi < iovad->dma_32bit_pfn)
+		curr = &iovad->cached32_node;
+	if (!curr)
+		curr = &iovad->cached_node;
 
-	if (free->pfn_lo >= cached_iova->pfn_lo) {
-		struct rb_node *node = rb_next(&free->node);
-		struct iova *iova = rb_entry(node, struct iova, node);
+	cached_iova = rb_entry(*curr, struct iova, node);
 
-		/* only cache if it's below 32bit pfn */
-		if (node && iova->pfn_lo < iovad->dma_32bit_pfn)
-			iovad->cached32_node = node;
-		else
-			iovad->cached32_node = NULL;
-	}
+	if (free->pfn_lo >= cached_iova->pfn_lo)
+		*curr = rb_next(&free->node);
 }
 
 /* Insert the iova into domain rbtree by holding writer lock */
@@ -135,7 +134,6 @@ static int __alloc_and_insert_iova_range(struct iova_domain *iovad,
 {
 	struct rb_node *prev, *curr = NULL;
 	unsigned long flags;
-	unsigned long saved_pfn;
 	unsigned long align_mask = ~0UL;
 
 	if (size_aligned)
@@ -143,7 +141,6 @@ static int __alloc_and_insert_iova_range(struct iova_domain *iovad,
 
 	/* Walk the tree backwards */
 	spin_lock_irqsave(&iovad->iova_rbtree_lock, flags);
-	saved_pfn = limit_pfn;
 	curr = __get_cached_rbnode(iovad, &limit_pfn);
 	prev = curr;
 	while (curr) {
@@ -173,7 +170,7 @@ static int __alloc_and_insert_iova_range(struct iova_domain *iovad,
 
 	/* If we have 'prev', it's a valid place to start the insertion. */
 	iova_insert_rbtree(&iovad->rbroot, new, prev);
-	__cached_rbnode_insert_update(iovad, saved_pfn, new);
+	__cached_rbnode_insert_update(iovad, new);
 
 	spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
 
diff --git a/include/linux/iova.h b/include/linux/iova.h
index e0a892ae45c0..0bb8df43b393 100644
--- a/include/linux/iova.h
+++ b/include/linux/iova.h
@@ -40,7 +40,8 @@ struct iova_rcache {
 struct iova_domain {
 	spinlock_t	iova_rbtree_lock; /* Lock to protect update of rbtree */
 	struct rb_root	rbroot;		/* iova domain rbtree root */
-	struct rb_node	*cached32_node; /* Save last alloced node */
+	struct rb_node	*cached_node;	/* Save last alloced node */
+	struct rb_node	*cached32_node; /* Save last 32-bit alloced node */
 	unsigned long	granule;	/* pfn granularity for this domain */
 	unsigned long	start_pfn;	/* Lower limit for this domain */
 	unsigned long	dma_32bit_pfn;
-- 
2.12.2.dirty

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 3/4] iommu/iova: Extend rbtree node caching
@ 2017-07-18 16:57   ` Robin Murphy
  0 siblings, 0 replies; 30+ messages in thread
From: Robin Murphy @ 2017-07-18 16:57 UTC (permalink / raw)
  To: linux-arm-kernel

The cached node mechanism provides a significant performance benefit for
allocations using a 32-bit DMA mask, but in the case of non-PCI devices
or where the 32-bit space is full, the loss of this benefit can be
significant - on large systems there can be many thousands of entries in
the tree, such that traversing to the end then walking all the way down
to find free space every time becomes increasingly awful.

Maintain a similar cached node for the whole IOVA space as a superset of
the 32-bit space so that performance can remain much more consistent.

Inspired by work by Zhen Lei <thunder.leizhen@huawei.com>.

Signed-off-by: Robin Murphy <robin.murphy@arm.com>
---
 drivers/iommu/iova.c | 59 +++++++++++++++++++++++++---------------------------
 include/linux/iova.h |  3 ++-
 2 files changed, 30 insertions(+), 32 deletions(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index d094d1ca8f23..f5809a2ee6c2 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -46,6 +46,7 @@ init_iova_domain(struct iova_domain *iovad, unsigned long granule,
 
 	spin_lock_init(&iovad->iova_rbtree_lock);
 	iovad->rbroot = RB_ROOT;
+	iovad->cached_node = NULL;
 	iovad->cached32_node = NULL;
 	iovad->granule = granule;
 	iovad->start_pfn = start_pfn;
@@ -57,48 +58,46 @@ EXPORT_SYMBOL_GPL(init_iova_domain);
 static struct rb_node *
 __get_cached_rbnode(struct iova_domain *iovad, unsigned long *limit_pfn)
 {
-	if ((*limit_pfn > iovad->dma_32bit_pfn) ||
-		(iovad->cached32_node == NULL))
+	struct rb_node *cached_node = NULL;
+	struct iova *curr_iova;
+
+	if (*limit_pfn <= iovad->dma_32bit_pfn)
+		cached_node = iovad->cached32_node;
+	if (!cached_node)
+		cached_node = iovad->cached_node;
+	if (!cached_node)
 		return rb_last(&iovad->rbroot);
-	else {
-		struct rb_node *prev_node = rb_prev(iovad->cached32_node);
-		struct iova *curr_iova =
-			rb_entry(iovad->cached32_node, struct iova, node);
-		*limit_pfn = curr_iova->pfn_lo;
-		return prev_node;
-	}
+
+	curr_iova = rb_entry(cached_node, struct iova, node);
+	*limit_pfn = curr_iova->pfn_lo;
+
+	return rb_prev(cached_node);
 }
 
 static void
-__cached_rbnode_insert_update(struct iova_domain *iovad,
-	unsigned long limit_pfn, struct iova *new)
+__cached_rbnode_insert_update(struct iova_domain *iovad, struct iova *new)
 {
-	if (limit_pfn != iovad->dma_32bit_pfn)
-		return;
-	iovad->cached32_node = &new->node;
+	if (new->pfn_lo > iovad->dma_32bit_pfn)
+		iovad->cached_node = &new->node;
+	else
+		iovad->cached32_node = &new->node;
 }
 
 static void
 __cached_rbnode_delete_update(struct iova_domain *iovad, struct iova *free)
 {
 	struct iova *cached_iova;
-	struct rb_node *curr;
+	struct rb_node **curr = NULL;
 
-	if (!iovad->cached32_node)
-		return;
-	curr = iovad->cached32_node;
-	cached_iova = rb_entry(curr, struct iova, node);
+	if (free->pfn_hi < iovad->dma_32bit_pfn)
+		curr = &iovad->cached32_node;
+	if (!curr)
+		curr = &iovad->cached_node;
 
-	if (free->pfn_lo >= cached_iova->pfn_lo) {
-		struct rb_node *node = rb_next(&free->node);
-		struct iova *iova = rb_entry(node, struct iova, node);
+	cached_iova = rb_entry(*curr, struct iova, node);
 
-		/* only cache if it's below 32bit pfn */
-		if (node && iova->pfn_lo < iovad->dma_32bit_pfn)
-			iovad->cached32_node = node;
-		else
-			iovad->cached32_node = NULL;
-	}
+	if (free->pfn_lo >= cached_iova->pfn_lo)
+		*curr = rb_next(&free->node);
 }
 
 /* Insert the iova into domain rbtree by holding writer lock */
@@ -135,7 +134,6 @@ static int __alloc_and_insert_iova_range(struct iova_domain *iovad,
 {
 	struct rb_node *prev, *curr = NULL;
 	unsigned long flags;
-	unsigned long saved_pfn;
 	unsigned long align_mask = ~0UL;
 
 	if (size_aligned)
@@ -143,7 +141,6 @@ static int __alloc_and_insert_iova_range(struct iova_domain *iovad,
 
 	/* Walk the tree backwards */
 	spin_lock_irqsave(&iovad->iova_rbtree_lock, flags);
-	saved_pfn = limit_pfn;
 	curr = __get_cached_rbnode(iovad, &limit_pfn);
 	prev = curr;
 	while (curr) {
@@ -173,7 +170,7 @@ static int __alloc_and_insert_iova_range(struct iova_domain *iovad,
 
 	/* If we have 'prev', it's a valid place to start the insertion. */
 	iova_insert_rbtree(&iovad->rbroot, new, prev);
-	__cached_rbnode_insert_update(iovad, saved_pfn, new);
+	__cached_rbnode_insert_update(iovad, new);
 
 	spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
 
diff --git a/include/linux/iova.h b/include/linux/iova.h
index e0a892ae45c0..0bb8df43b393 100644
--- a/include/linux/iova.h
+++ b/include/linux/iova.h
@@ -40,7 +40,8 @@ struct iova_rcache {
 struct iova_domain {
 	spinlock_t	iova_rbtree_lock; /* Lock to protect update of rbtree */
 	struct rb_root	rbroot;		/* iova domain rbtree root */
-	struct rb_node	*cached32_node; /* Save last alloced node */
+	struct rb_node	*cached_node;	/* Save last alloced node */
+	struct rb_node	*cached32_node; /* Save last 32-bit alloced node */
 	unsigned long	granule;	/* pfn granularity for this domain */
 	unsigned long	start_pfn;	/* Lower limit for this domain */
 	unsigned long	dma_32bit_pfn;
-- 
2.12.2.dirty

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 4/4] iommu/iova: Make dma_32bit_pfn implicit
@ 2017-07-18 16:57   ` Robin Murphy
  0 siblings, 0 replies; 30+ messages in thread
From: Robin Murphy @ 2017-07-18 16:57 UTC (permalink / raw)
  To: joro
  Cc: iommu, linux-arm-kernel, linux-kernel, dwmw2, thunder.leizhen,
	lorenzo.pieralisi, ard.biesheuvel, Jonathan.Cameron, nwatters,
	ray.jui, Thierry Reding, Jonathan Hunter, David Airlie,
	Sudeep Dutt, Ashutosh Dixit

From: Zhen Lei <thunder.leizhen@huawei.com>

Now that the cached node optimisation can apply to all allocations, the
couple of users which were playing tricks with dma_32bit_pfn in order to
benefit from it can stop doing so. Conversely, there is also no need for
all the other users to explicitly calculate a 'real' 32-bit PFN, when
init_iova_domain() can happily do that itself from the page granularity.

CC: Thierry Reding <thierry.reding@gmail.com>
CC: Jonathan Hunter <jonathanh@nvidia.com>
CC: David Airlie <airlied@linux.ie>
CC: Sudeep Dutt <sudeep.dutt@intel.com>
CC: Ashutosh Dixit <ashutosh.dixit@intel.com>
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
[rm: use iova_pfn(), rewrote commit message]
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
---
 drivers/gpu/drm/tegra/drm.c      |  3 +--
 drivers/gpu/host1x/dev.c         |  3 +--
 drivers/iommu/amd_iommu.c        |  7 ++-----
 drivers/iommu/dma-iommu.c        | 18 +-----------------
 drivers/iommu/intel-iommu.c      | 11 +++--------
 drivers/iommu/iova.c             |  4 ++--
 drivers/misc/mic/scif/scif_rma.c |  3 +--
 include/linux/iova.h             |  5 ++---
 8 files changed, 13 insertions(+), 41 deletions(-)

diff --git a/drivers/gpu/drm/tegra/drm.c b/drivers/gpu/drm/tegra/drm.c
index 518f4b69ea53..81e9ae1ee90b 100644
--- a/drivers/gpu/drm/tegra/drm.c
+++ b/drivers/gpu/drm/tegra/drm.c
@@ -150,8 +150,7 @@ static int tegra_drm_load(struct drm_device *drm, unsigned long flags)
 
 		order = __ffs(tegra->domain->pgsize_bitmap);
 		init_iova_domain(&tegra->carveout.domain, 1UL << order,
-				 carveout_start >> order,
-				 carveout_end >> order);
+				 carveout_start >> order);
 
 		tegra->carveout.shift = iova_shift(&tegra->carveout.domain);
 		tegra->carveout.limit = carveout_end >> tegra->carveout.shift;
diff --git a/drivers/gpu/host1x/dev.c b/drivers/gpu/host1x/dev.c
index 2c58a390123a..57c8eed0ed71 100644
--- a/drivers/gpu/host1x/dev.c
+++ b/drivers/gpu/host1x/dev.c
@@ -193,8 +193,7 @@ static int host1x_probe(struct platform_device *pdev)
 
 		order = __ffs(host->domain->pgsize_bitmap);
 		init_iova_domain(&host->iova, 1UL << order,
-				 geometry->aperture_start >> order,
-				 geometry->aperture_end >> order);
+				 geometry->aperture_start >> order);
 		host->iova_end = geometry->aperture_end;
 	}
 
diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index 688e77576e5a..a12e3e12014a 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -63,7 +63,6 @@
 /* IO virtual address start page frame number */
 #define IOVA_START_PFN		(1)
 #define IOVA_PFN(addr)		((addr) >> PAGE_SHIFT)
-#define DMA_32BIT_PFN		IOVA_PFN(DMA_BIT_MASK(32))
 
 /* Reserved IOVA ranges */
 #define MSI_RANGE_START		(0xfee00000)
@@ -2010,8 +2009,7 @@ static struct dma_ops_domain *dma_ops_domain_alloc(void)
 	if (!dma_dom->domain.pt_root)
 		goto free_dma_dom;
 
-	init_iova_domain(&dma_dom->iovad, PAGE_SIZE,
-			 IOVA_START_PFN, DMA_32BIT_PFN);
+	init_iova_domain(&dma_dom->iovad, PAGE_SIZE, IOVA_START_PFN);
 
 	/* Initialize reserved ranges */
 	copy_reserved_iova(&reserved_iova_ranges, &dma_dom->iovad);
@@ -2912,8 +2910,7 @@ static int init_reserved_iova_ranges(void)
 	struct pci_dev *pdev = NULL;
 	struct iova *val;
 
-	init_iova_domain(&reserved_iova_ranges, PAGE_SIZE,
-			 IOVA_START_PFN, DMA_32BIT_PFN);
+	init_iova_domain(&reserved_iova_ranges, PAGE_SIZE, IOVA_START_PFN);
 
 	lockdep_set_class(&reserved_iova_ranges.iova_rbtree_lock,
 			  &reserved_rbtree_key);
diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 9d1cebe7f6cb..191be9c80a8a 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -292,18 +292,7 @@ int iommu_dma_init_domain(struct iommu_domain *domain, dma_addr_t base,
 		/* ...then finally give it a kicking to make sure it fits */
 		base_pfn = max_t(unsigned long, base_pfn,
 				domain->geometry.aperture_start >> order);
-		end_pfn = min_t(unsigned long, end_pfn,
-				domain->geometry.aperture_end >> order);
 	}
-	/*
-	 * PCI devices may have larger DMA masks, but still prefer allocating
-	 * within a 32-bit mask to avoid DAC addressing. Such limitations don't
-	 * apply to the typical platform device, so for those we may as well
-	 * leave the cache limit at the top of their range to save an rb_last()
-	 * traversal on every allocation.
-	 */
-	if (dev && dev_is_pci(dev))
-		end_pfn &= DMA_BIT_MASK(32) >> order;
 
 	/* start_pfn is always nonzero for an already-initialised domain */
 	if (iovad->start_pfn) {
@@ -312,16 +301,11 @@ int iommu_dma_init_domain(struct iommu_domain *domain, dma_addr_t base,
 			pr_warn("Incompatible range for DMA domain\n");
 			return -EFAULT;
 		}
-		/*
-		 * If we have devices with different DMA masks, move the free
-		 * area cache limit down for the benefit of the smaller one.
-		 */
-		iovad->dma_32bit_pfn = min(end_pfn + 1, iovad->dma_32bit_pfn);
 
 		return 0;
 	}
 
-	init_iova_domain(iovad, 1UL << order, base_pfn, end_pfn);
+	init_iova_domain(iovad, 1UL << order, base_pfn);
 	if (!dev)
 		return 0;
 
diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 687f18f65cea..afa3b4e765e7 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -82,8 +82,6 @@
 #define IOVA_START_PFN		(1)
 
 #define IOVA_PFN(addr)		((addr) >> PAGE_SHIFT)
-#define DMA_32BIT_PFN		IOVA_PFN(DMA_BIT_MASK(32))
-#define DMA_64BIT_PFN		IOVA_PFN(DMA_BIT_MASK(64))
 
 /* page table handling */
 #define LEVEL_STRIDE		(9)
@@ -1874,8 +1872,7 @@ static int dmar_init_reserved_ranges(void)
 	struct iova *iova;
 	int i;
 
-	init_iova_domain(&reserved_iova_list, VTD_PAGE_SIZE, IOVA_START_PFN,
-			DMA_32BIT_PFN);
+	init_iova_domain(&reserved_iova_list, VTD_PAGE_SIZE, IOVA_START_PFN);
 
 	lockdep_set_class(&reserved_iova_list.iova_rbtree_lock,
 		&reserved_rbtree_key);
@@ -1933,8 +1930,7 @@ static int domain_init(struct dmar_domain *domain, struct intel_iommu *iommu,
 	int adjust_width, agaw;
 	unsigned long sagaw;
 
-	init_iova_domain(&domain->iovad, VTD_PAGE_SIZE, IOVA_START_PFN,
-			DMA_32BIT_PFN);
+	init_iova_domain(&domain->iovad, VTD_PAGE_SIZE, IOVA_START_PFN);
 	domain_reserve_special_ranges(domain);
 
 	/* calculate AGAW */
@@ -4989,8 +4985,7 @@ static int md_domain_init(struct dmar_domain *domain, int guest_width)
 {
 	int adjust_width;
 
-	init_iova_domain(&domain->iovad, VTD_PAGE_SIZE, IOVA_START_PFN,
-			DMA_32BIT_PFN);
+	init_iova_domain(&domain->iovad, VTD_PAGE_SIZE, IOVA_START_PFN);
 	domain_reserve_special_ranges(domain);
 
 	/* calculate AGAW */
diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index f5809a2ee6c2..a8ba101ffdfc 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -35,7 +35,7 @@ static void free_iova_rcaches(struct iova_domain *iovad);
 
 void
 init_iova_domain(struct iova_domain *iovad, unsigned long granule,
-	unsigned long start_pfn, unsigned long pfn_32bit)
+	unsigned long start_pfn)
 {
 	/*
 	 * IOVA granularity will normally be equal to the smallest
@@ -50,7 +50,7 @@ init_iova_domain(struct iova_domain *iovad, unsigned long granule,
 	iovad->cached32_node = NULL;
 	iovad->granule = granule;
 	iovad->start_pfn = start_pfn;
-	iovad->dma_32bit_pfn = pfn_32bit + 1;
+	iovad->dma_32bit_pfn = iova_pfn(iovad, 1ULL << 32);
 	init_iova_rcaches(iovad);
 }
 EXPORT_SYMBOL_GPL(init_iova_domain);
diff --git a/drivers/misc/mic/scif/scif_rma.c b/drivers/misc/mic/scif/scif_rma.c
index 329727e00e97..c824329f7012 100644
--- a/drivers/misc/mic/scif/scif_rma.c
+++ b/drivers/misc/mic/scif/scif_rma.c
@@ -39,8 +39,7 @@ void scif_rma_ep_init(struct scif_endpt *ep)
 	struct scif_endpt_rma_info *rma = &ep->rma_info;
 
 	mutex_init(&rma->rma_lock);
-	init_iova_domain(&rma->iovad, PAGE_SIZE, SCIF_IOVA_START_PFN,
-			 SCIF_DMA_64BIT_PFN);
+	init_iova_domain(&rma->iovad, PAGE_SIZE, SCIF_IOVA_START_PFN);
 	spin_lock_init(&rma->tc_lock);
 	mutex_init(&rma->mmn_lock);
 	INIT_LIST_HEAD(&rma->reg_list);
diff --git a/include/linux/iova.h b/include/linux/iova.h
index 0bb8df43b393..58c2a365c45f 100644
--- a/include/linux/iova.h
+++ b/include/linux/iova.h
@@ -102,7 +102,7 @@ struct iova *reserve_iova(struct iova_domain *iovad, unsigned long pfn_lo,
 	unsigned long pfn_hi);
 void copy_reserved_iova(struct iova_domain *from, struct iova_domain *to);
 void init_iova_domain(struct iova_domain *iovad, unsigned long granule,
-	unsigned long start_pfn, unsigned long pfn_32bit);
+	unsigned long start_pfn);
 struct iova *find_iova(struct iova_domain *iovad, unsigned long pfn);
 void put_iova_domain(struct iova_domain *iovad);
 struct iova *split_and_remove_iova(struct iova_domain *iovad,
@@ -170,8 +170,7 @@ static inline void copy_reserved_iova(struct iova_domain *from,
 
 static inline void init_iova_domain(struct iova_domain *iovad,
 				    unsigned long granule,
-				    unsigned long start_pfn,
-				    unsigned long pfn_32bit)
+				    unsigned long start_pfn)
 {
 }
 
-- 
2.12.2.dirty

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 4/4] iommu/iova: Make dma_32bit_pfn implicit
@ 2017-07-18 16:57   ` Robin Murphy
  0 siblings, 0 replies; 30+ messages in thread
From: Robin Murphy @ 2017-07-18 16:57 UTC (permalink / raw)
  To: joro-zLv9SwRftAIdnm+yROfE0A
  Cc: Sudeep Dutt, ard.biesheuvel-QSEj5FYQhm4dnm+yROfE0A, David Airlie,
	Jonathan.Cameron-hv44wF8Li93QT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Jonathan Hunter,
	Ashutosh Dixit,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Thierry Reding, ray.jui-dY08KVG/lbpWk0Htik3J/w,
	thunder.leizhen-hv44wF8Li93QT0dZR+AlfA,
	dwmw2-wEGCiKHe2LqWVfeAwA7xHQ,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

From: Zhen Lei <thunder.leizhen-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>

Now that the cached node optimisation can apply to all allocations, the
couple of users which were playing tricks with dma_32bit_pfn in order to
benefit from it can stop doing so. Conversely, there is also no need for
all the other users to explicitly calculate a 'real' 32-bit PFN, when
init_iova_domain() can happily do that itself from the page granularity.

CC: Thierry Reding <thierry.reding-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
CC: Jonathan Hunter <jonathanh-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>
CC: David Airlie <airlied-cv59FeDIM0c@public.gmane.org>
CC: Sudeep Dutt <sudeep.dutt-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
CC: Ashutosh Dixit <ashutosh.dixit-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Zhen Lei <thunder.leizhen-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
[rm: use iova_pfn(), rewrote commit message]
Signed-off-by: Robin Murphy <robin.murphy-5wv7dgnIgG8@public.gmane.org>
---
 drivers/gpu/drm/tegra/drm.c      |  3 +--
 drivers/gpu/host1x/dev.c         |  3 +--
 drivers/iommu/amd_iommu.c        |  7 ++-----
 drivers/iommu/dma-iommu.c        | 18 +-----------------
 drivers/iommu/intel-iommu.c      | 11 +++--------
 drivers/iommu/iova.c             |  4 ++--
 drivers/misc/mic/scif/scif_rma.c |  3 +--
 include/linux/iova.h             |  5 ++---
 8 files changed, 13 insertions(+), 41 deletions(-)

diff --git a/drivers/gpu/drm/tegra/drm.c b/drivers/gpu/drm/tegra/drm.c
index 518f4b69ea53..81e9ae1ee90b 100644
--- a/drivers/gpu/drm/tegra/drm.c
+++ b/drivers/gpu/drm/tegra/drm.c
@@ -150,8 +150,7 @@ static int tegra_drm_load(struct drm_device *drm, unsigned long flags)
 
 		order = __ffs(tegra->domain->pgsize_bitmap);
 		init_iova_domain(&tegra->carveout.domain, 1UL << order,
-				 carveout_start >> order,
-				 carveout_end >> order);
+				 carveout_start >> order);
 
 		tegra->carveout.shift = iova_shift(&tegra->carveout.domain);
 		tegra->carveout.limit = carveout_end >> tegra->carveout.shift;
diff --git a/drivers/gpu/host1x/dev.c b/drivers/gpu/host1x/dev.c
index 2c58a390123a..57c8eed0ed71 100644
--- a/drivers/gpu/host1x/dev.c
+++ b/drivers/gpu/host1x/dev.c
@@ -193,8 +193,7 @@ static int host1x_probe(struct platform_device *pdev)
 
 		order = __ffs(host->domain->pgsize_bitmap);
 		init_iova_domain(&host->iova, 1UL << order,
-				 geometry->aperture_start >> order,
-				 geometry->aperture_end >> order);
+				 geometry->aperture_start >> order);
 		host->iova_end = geometry->aperture_end;
 	}
 
diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index 688e77576e5a..a12e3e12014a 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -63,7 +63,6 @@
 /* IO virtual address start page frame number */
 #define IOVA_START_PFN		(1)
 #define IOVA_PFN(addr)		((addr) >> PAGE_SHIFT)
-#define DMA_32BIT_PFN		IOVA_PFN(DMA_BIT_MASK(32))
 
 /* Reserved IOVA ranges */
 #define MSI_RANGE_START		(0xfee00000)
@@ -2010,8 +2009,7 @@ static struct dma_ops_domain *dma_ops_domain_alloc(void)
 	if (!dma_dom->domain.pt_root)
 		goto free_dma_dom;
 
-	init_iova_domain(&dma_dom->iovad, PAGE_SIZE,
-			 IOVA_START_PFN, DMA_32BIT_PFN);
+	init_iova_domain(&dma_dom->iovad, PAGE_SIZE, IOVA_START_PFN);
 
 	/* Initialize reserved ranges */
 	copy_reserved_iova(&reserved_iova_ranges, &dma_dom->iovad);
@@ -2912,8 +2910,7 @@ static int init_reserved_iova_ranges(void)
 	struct pci_dev *pdev = NULL;
 	struct iova *val;
 
-	init_iova_domain(&reserved_iova_ranges, PAGE_SIZE,
-			 IOVA_START_PFN, DMA_32BIT_PFN);
+	init_iova_domain(&reserved_iova_ranges, PAGE_SIZE, IOVA_START_PFN);
 
 	lockdep_set_class(&reserved_iova_ranges.iova_rbtree_lock,
 			  &reserved_rbtree_key);
diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 9d1cebe7f6cb..191be9c80a8a 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -292,18 +292,7 @@ int iommu_dma_init_domain(struct iommu_domain *domain, dma_addr_t base,
 		/* ...then finally give it a kicking to make sure it fits */
 		base_pfn = max_t(unsigned long, base_pfn,
 				domain->geometry.aperture_start >> order);
-		end_pfn = min_t(unsigned long, end_pfn,
-				domain->geometry.aperture_end >> order);
 	}
-	/*
-	 * PCI devices may have larger DMA masks, but still prefer allocating
-	 * within a 32-bit mask to avoid DAC addressing. Such limitations don't
-	 * apply to the typical platform device, so for those we may as well
-	 * leave the cache limit at the top of their range to save an rb_last()
-	 * traversal on every allocation.
-	 */
-	if (dev && dev_is_pci(dev))
-		end_pfn &= DMA_BIT_MASK(32) >> order;
 
 	/* start_pfn is always nonzero for an already-initialised domain */
 	if (iovad->start_pfn) {
@@ -312,16 +301,11 @@ int iommu_dma_init_domain(struct iommu_domain *domain, dma_addr_t base,
 			pr_warn("Incompatible range for DMA domain\n");
 			return -EFAULT;
 		}
-		/*
-		 * If we have devices with different DMA masks, move the free
-		 * area cache limit down for the benefit of the smaller one.
-		 */
-		iovad->dma_32bit_pfn = min(end_pfn + 1, iovad->dma_32bit_pfn);
 
 		return 0;
 	}
 
-	init_iova_domain(iovad, 1UL << order, base_pfn, end_pfn);
+	init_iova_domain(iovad, 1UL << order, base_pfn);
 	if (!dev)
 		return 0;
 
diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 687f18f65cea..afa3b4e765e7 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -82,8 +82,6 @@
 #define IOVA_START_PFN		(1)
 
 #define IOVA_PFN(addr)		((addr) >> PAGE_SHIFT)
-#define DMA_32BIT_PFN		IOVA_PFN(DMA_BIT_MASK(32))
-#define DMA_64BIT_PFN		IOVA_PFN(DMA_BIT_MASK(64))
 
 /* page table handling */
 #define LEVEL_STRIDE		(9)
@@ -1874,8 +1872,7 @@ static int dmar_init_reserved_ranges(void)
 	struct iova *iova;
 	int i;
 
-	init_iova_domain(&reserved_iova_list, VTD_PAGE_SIZE, IOVA_START_PFN,
-			DMA_32BIT_PFN);
+	init_iova_domain(&reserved_iova_list, VTD_PAGE_SIZE, IOVA_START_PFN);
 
 	lockdep_set_class(&reserved_iova_list.iova_rbtree_lock,
 		&reserved_rbtree_key);
@@ -1933,8 +1930,7 @@ static int domain_init(struct dmar_domain *domain, struct intel_iommu *iommu,
 	int adjust_width, agaw;
 	unsigned long sagaw;
 
-	init_iova_domain(&domain->iovad, VTD_PAGE_SIZE, IOVA_START_PFN,
-			DMA_32BIT_PFN);
+	init_iova_domain(&domain->iovad, VTD_PAGE_SIZE, IOVA_START_PFN);
 	domain_reserve_special_ranges(domain);
 
 	/* calculate AGAW */
@@ -4989,8 +4985,7 @@ static int md_domain_init(struct dmar_domain *domain, int guest_width)
 {
 	int adjust_width;
 
-	init_iova_domain(&domain->iovad, VTD_PAGE_SIZE, IOVA_START_PFN,
-			DMA_32BIT_PFN);
+	init_iova_domain(&domain->iovad, VTD_PAGE_SIZE, IOVA_START_PFN);
 	domain_reserve_special_ranges(domain);
 
 	/* calculate AGAW */
diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index f5809a2ee6c2..a8ba101ffdfc 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -35,7 +35,7 @@ static void free_iova_rcaches(struct iova_domain *iovad);
 
 void
 init_iova_domain(struct iova_domain *iovad, unsigned long granule,
-	unsigned long start_pfn, unsigned long pfn_32bit)
+	unsigned long start_pfn)
 {
 	/*
 	 * IOVA granularity will normally be equal to the smallest
@@ -50,7 +50,7 @@ init_iova_domain(struct iova_domain *iovad, unsigned long granule,
 	iovad->cached32_node = NULL;
 	iovad->granule = granule;
 	iovad->start_pfn = start_pfn;
-	iovad->dma_32bit_pfn = pfn_32bit + 1;
+	iovad->dma_32bit_pfn = iova_pfn(iovad, 1ULL << 32);
 	init_iova_rcaches(iovad);
 }
 EXPORT_SYMBOL_GPL(init_iova_domain);
diff --git a/drivers/misc/mic/scif/scif_rma.c b/drivers/misc/mic/scif/scif_rma.c
index 329727e00e97..c824329f7012 100644
--- a/drivers/misc/mic/scif/scif_rma.c
+++ b/drivers/misc/mic/scif/scif_rma.c
@@ -39,8 +39,7 @@ void scif_rma_ep_init(struct scif_endpt *ep)
 	struct scif_endpt_rma_info *rma = &ep->rma_info;
 
 	mutex_init(&rma->rma_lock);
-	init_iova_domain(&rma->iovad, PAGE_SIZE, SCIF_IOVA_START_PFN,
-			 SCIF_DMA_64BIT_PFN);
+	init_iova_domain(&rma->iovad, PAGE_SIZE, SCIF_IOVA_START_PFN);
 	spin_lock_init(&rma->tc_lock);
 	mutex_init(&rma->mmn_lock);
 	INIT_LIST_HEAD(&rma->reg_list);
diff --git a/include/linux/iova.h b/include/linux/iova.h
index 0bb8df43b393..58c2a365c45f 100644
--- a/include/linux/iova.h
+++ b/include/linux/iova.h
@@ -102,7 +102,7 @@ struct iova *reserve_iova(struct iova_domain *iovad, unsigned long pfn_lo,
 	unsigned long pfn_hi);
 void copy_reserved_iova(struct iova_domain *from, struct iova_domain *to);
 void init_iova_domain(struct iova_domain *iovad, unsigned long granule,
-	unsigned long start_pfn, unsigned long pfn_32bit);
+	unsigned long start_pfn);
 struct iova *find_iova(struct iova_domain *iovad, unsigned long pfn);
 void put_iova_domain(struct iova_domain *iovad);
 struct iova *split_and_remove_iova(struct iova_domain *iovad,
@@ -170,8 +170,7 @@ static inline void copy_reserved_iova(struct iova_domain *from,
 
 static inline void init_iova_domain(struct iova_domain *iovad,
 				    unsigned long granule,
-				    unsigned long start_pfn,
-				    unsigned long pfn_32bit)
+				    unsigned long start_pfn)
 {
 }
 
-- 
2.12.2.dirty

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 4/4] iommu/iova: Make dma_32bit_pfn implicit
@ 2017-07-18 16:57   ` Robin Murphy
  0 siblings, 0 replies; 30+ messages in thread
From: Robin Murphy @ 2017-07-18 16:57 UTC (permalink / raw)
  To: linux-arm-kernel

From: Zhen Lei <thunder.leizhen@huawei.com>

Now that the cached node optimisation can apply to all allocations, the
couple of users which were playing tricks with dma_32bit_pfn in order to
benefit from it can stop doing so. Conversely, there is also no need for
all the other users to explicitly calculate a 'real' 32-bit PFN, when
init_iova_domain() can happily do that itself from the page granularity.

CC: Thierry Reding <thierry.reding@gmail.com>
CC: Jonathan Hunter <jonathanh@nvidia.com>
CC: David Airlie <airlied@linux.ie>
CC: Sudeep Dutt <sudeep.dutt@intel.com>
CC: Ashutosh Dixit <ashutosh.dixit@intel.com>
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
[rm: use iova_pfn(), rewrote commit message]
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
---
 drivers/gpu/drm/tegra/drm.c      |  3 +--
 drivers/gpu/host1x/dev.c         |  3 +--
 drivers/iommu/amd_iommu.c        |  7 ++-----
 drivers/iommu/dma-iommu.c        | 18 +-----------------
 drivers/iommu/intel-iommu.c      | 11 +++--------
 drivers/iommu/iova.c             |  4 ++--
 drivers/misc/mic/scif/scif_rma.c |  3 +--
 include/linux/iova.h             |  5 ++---
 8 files changed, 13 insertions(+), 41 deletions(-)

diff --git a/drivers/gpu/drm/tegra/drm.c b/drivers/gpu/drm/tegra/drm.c
index 518f4b69ea53..81e9ae1ee90b 100644
--- a/drivers/gpu/drm/tegra/drm.c
+++ b/drivers/gpu/drm/tegra/drm.c
@@ -150,8 +150,7 @@ static int tegra_drm_load(struct drm_device *drm, unsigned long flags)
 
 		order = __ffs(tegra->domain->pgsize_bitmap);
 		init_iova_domain(&tegra->carveout.domain, 1UL << order,
-				 carveout_start >> order,
-				 carveout_end >> order);
+				 carveout_start >> order);
 
 		tegra->carveout.shift = iova_shift(&tegra->carveout.domain);
 		tegra->carveout.limit = carveout_end >> tegra->carveout.shift;
diff --git a/drivers/gpu/host1x/dev.c b/drivers/gpu/host1x/dev.c
index 2c58a390123a..57c8eed0ed71 100644
--- a/drivers/gpu/host1x/dev.c
+++ b/drivers/gpu/host1x/dev.c
@@ -193,8 +193,7 @@ static int host1x_probe(struct platform_device *pdev)
 
 		order = __ffs(host->domain->pgsize_bitmap);
 		init_iova_domain(&host->iova, 1UL << order,
-				 geometry->aperture_start >> order,
-				 geometry->aperture_end >> order);
+				 geometry->aperture_start >> order);
 		host->iova_end = geometry->aperture_end;
 	}
 
diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index 688e77576e5a..a12e3e12014a 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -63,7 +63,6 @@
 /* IO virtual address start page frame number */
 #define IOVA_START_PFN		(1)
 #define IOVA_PFN(addr)		((addr) >> PAGE_SHIFT)
-#define DMA_32BIT_PFN		IOVA_PFN(DMA_BIT_MASK(32))
 
 /* Reserved IOVA ranges */
 #define MSI_RANGE_START		(0xfee00000)
@@ -2010,8 +2009,7 @@ static struct dma_ops_domain *dma_ops_domain_alloc(void)
 	if (!dma_dom->domain.pt_root)
 		goto free_dma_dom;
 
-	init_iova_domain(&dma_dom->iovad, PAGE_SIZE,
-			 IOVA_START_PFN, DMA_32BIT_PFN);
+	init_iova_domain(&dma_dom->iovad, PAGE_SIZE, IOVA_START_PFN);
 
 	/* Initialize reserved ranges */
 	copy_reserved_iova(&reserved_iova_ranges, &dma_dom->iovad);
@@ -2912,8 +2910,7 @@ static int init_reserved_iova_ranges(void)
 	struct pci_dev *pdev = NULL;
 	struct iova *val;
 
-	init_iova_domain(&reserved_iova_ranges, PAGE_SIZE,
-			 IOVA_START_PFN, DMA_32BIT_PFN);
+	init_iova_domain(&reserved_iova_ranges, PAGE_SIZE, IOVA_START_PFN);
 
 	lockdep_set_class(&reserved_iova_ranges.iova_rbtree_lock,
 			  &reserved_rbtree_key);
diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 9d1cebe7f6cb..191be9c80a8a 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -292,18 +292,7 @@ int iommu_dma_init_domain(struct iommu_domain *domain, dma_addr_t base,
 		/* ...then finally give it a kicking to make sure it fits */
 		base_pfn = max_t(unsigned long, base_pfn,
 				domain->geometry.aperture_start >> order);
-		end_pfn = min_t(unsigned long, end_pfn,
-				domain->geometry.aperture_end >> order);
 	}
-	/*
-	 * PCI devices may have larger DMA masks, but still prefer allocating
-	 * within a 32-bit mask to avoid DAC addressing. Such limitations don't
-	 * apply to the typical platform device, so for those we may as well
-	 * leave the cache limit at the top of their range to save an rb_last()
-	 * traversal on every allocation.
-	 */
-	if (dev && dev_is_pci(dev))
-		end_pfn &= DMA_BIT_MASK(32) >> order;
 
 	/* start_pfn is always nonzero for an already-initialised domain */
 	if (iovad->start_pfn) {
@@ -312,16 +301,11 @@ int iommu_dma_init_domain(struct iommu_domain *domain, dma_addr_t base,
 			pr_warn("Incompatible range for DMA domain\n");
 			return -EFAULT;
 		}
-		/*
-		 * If we have devices with different DMA masks, move the free
-		 * area cache limit down for the benefit of the smaller one.
-		 */
-		iovad->dma_32bit_pfn = min(end_pfn + 1, iovad->dma_32bit_pfn);
 
 		return 0;
 	}
 
-	init_iova_domain(iovad, 1UL << order, base_pfn, end_pfn);
+	init_iova_domain(iovad, 1UL << order, base_pfn);
 	if (!dev)
 		return 0;
 
diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 687f18f65cea..afa3b4e765e7 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -82,8 +82,6 @@
 #define IOVA_START_PFN		(1)
 
 #define IOVA_PFN(addr)		((addr) >> PAGE_SHIFT)
-#define DMA_32BIT_PFN		IOVA_PFN(DMA_BIT_MASK(32))
-#define DMA_64BIT_PFN		IOVA_PFN(DMA_BIT_MASK(64))
 
 /* page table handling */
 #define LEVEL_STRIDE		(9)
@@ -1874,8 +1872,7 @@ static int dmar_init_reserved_ranges(void)
 	struct iova *iova;
 	int i;
 
-	init_iova_domain(&reserved_iova_list, VTD_PAGE_SIZE, IOVA_START_PFN,
-			DMA_32BIT_PFN);
+	init_iova_domain(&reserved_iova_list, VTD_PAGE_SIZE, IOVA_START_PFN);
 
 	lockdep_set_class(&reserved_iova_list.iova_rbtree_lock,
 		&reserved_rbtree_key);
@@ -1933,8 +1930,7 @@ static int domain_init(struct dmar_domain *domain, struct intel_iommu *iommu,
 	int adjust_width, agaw;
 	unsigned long sagaw;
 
-	init_iova_domain(&domain->iovad, VTD_PAGE_SIZE, IOVA_START_PFN,
-			DMA_32BIT_PFN);
+	init_iova_domain(&domain->iovad, VTD_PAGE_SIZE, IOVA_START_PFN);
 	domain_reserve_special_ranges(domain);
 
 	/* calculate AGAW */
@@ -4989,8 +4985,7 @@ static int md_domain_init(struct dmar_domain *domain, int guest_width)
 {
 	int adjust_width;
 
-	init_iova_domain(&domain->iovad, VTD_PAGE_SIZE, IOVA_START_PFN,
-			DMA_32BIT_PFN);
+	init_iova_domain(&domain->iovad, VTD_PAGE_SIZE, IOVA_START_PFN);
 	domain_reserve_special_ranges(domain);
 
 	/* calculate AGAW */
diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index f5809a2ee6c2..a8ba101ffdfc 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -35,7 +35,7 @@ static void free_iova_rcaches(struct iova_domain *iovad);
 
 void
 init_iova_domain(struct iova_domain *iovad, unsigned long granule,
-	unsigned long start_pfn, unsigned long pfn_32bit)
+	unsigned long start_pfn)
 {
 	/*
 	 * IOVA granularity will normally be equal to the smallest
@@ -50,7 +50,7 @@ init_iova_domain(struct iova_domain *iovad, unsigned long granule,
 	iovad->cached32_node = NULL;
 	iovad->granule = granule;
 	iovad->start_pfn = start_pfn;
-	iovad->dma_32bit_pfn = pfn_32bit + 1;
+	iovad->dma_32bit_pfn = iova_pfn(iovad, 1ULL << 32);
 	init_iova_rcaches(iovad);
 }
 EXPORT_SYMBOL_GPL(init_iova_domain);
diff --git a/drivers/misc/mic/scif/scif_rma.c b/drivers/misc/mic/scif/scif_rma.c
index 329727e00e97..c824329f7012 100644
--- a/drivers/misc/mic/scif/scif_rma.c
+++ b/drivers/misc/mic/scif/scif_rma.c
@@ -39,8 +39,7 @@ void scif_rma_ep_init(struct scif_endpt *ep)
 	struct scif_endpt_rma_info *rma = &ep->rma_info;
 
 	mutex_init(&rma->rma_lock);
-	init_iova_domain(&rma->iovad, PAGE_SIZE, SCIF_IOVA_START_PFN,
-			 SCIF_DMA_64BIT_PFN);
+	init_iova_domain(&rma->iovad, PAGE_SIZE, SCIF_IOVA_START_PFN);
 	spin_lock_init(&rma->tc_lock);
 	mutex_init(&rma->mmn_lock);
 	INIT_LIST_HEAD(&rma->reg_list);
diff --git a/include/linux/iova.h b/include/linux/iova.h
index 0bb8df43b393..58c2a365c45f 100644
--- a/include/linux/iova.h
+++ b/include/linux/iova.h
@@ -102,7 +102,7 @@ struct iova *reserve_iova(struct iova_domain *iovad, unsigned long pfn_lo,
 	unsigned long pfn_hi);
 void copy_reserved_iova(struct iova_domain *from, struct iova_domain *to);
 void init_iova_domain(struct iova_domain *iovad, unsigned long granule,
-	unsigned long start_pfn, unsigned long pfn_32bit);
+	unsigned long start_pfn);
 struct iova *find_iova(struct iova_domain *iovad, unsigned long pfn);
 void put_iova_domain(struct iova_domain *iovad);
 struct iova *split_and_remove_iova(struct iova_domain *iovad,
@@ -170,8 +170,7 @@ static inline void copy_reserved_iova(struct iova_domain *from,
 
 static inline void init_iova_domain(struct iova_domain *iovad,
 				    unsigned long granule,
-				    unsigned long start_pfn,
-				    unsigned long pfn_32bit)
+				    unsigned long start_pfn)
 {
 }
 
-- 
2.12.2.dirty

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/4] Optimise 64-bit IOVA allocations
@ 2017-07-19  8:37   ` Ard Biesheuvel
  0 siblings, 0 replies; 30+ messages in thread
From: Ard Biesheuvel @ 2017-07-19  8:37 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Joerg Roedel, iommu, linux-arm-kernel, linux-kernel,
	David Woodhouse, Zhen Lei, Lorenzo Pieralisi, Jonathan.Cameron,
	nwatters, ray.jui

On 18 July 2017 at 17:57, Robin Murphy <robin.murphy@arm.com> wrote:
> Hi all,
>
> In the wake of the ARM SMMU optimisation efforts, it seems that certain
> workloads (e.g. storage I/O with large scatterlists) probably remain quite
> heavily influenced by IOVA allocation performance. Separately, Ard also
> reported massive performance drops for a graphical desktop on AMD Seattle
> when enabling SMMUs via IORT, which we traced to dma_32bit_pfn in the DMA
> ops domain getting initialised differently for ACPI vs. DT, and exposing
> the overhead of the rbtree slow path. Whilst we could go around trying to
> close up all the little gaps that lead to hitting the slowest case, it
> seems a much better idea to simply make said slowest case a lot less slow.
>
> I had a go at rebasing Leizhen's last IOVA series[1], but ended up finding
> the changes rather too hard to follow, so I've taken the liberty here of
> picking the whole thing up and reimplementing the main part in a rather
> less invasive manner.
>
> Robin.
>
> [1] https://www.mail-archive.com/iommu@lists.linux-foundation.org/msg17753.html
>
> Robin Murphy (1):
>   iommu/iova: Extend rbtree node caching
>
> Zhen Lei (3):
>   iommu/iova: Optimise rbtree searching
>   iommu/iova: Optimise the padding calculation
>   iommu/iova: Make dma_32bit_pfn implicit
>
>  drivers/gpu/drm/tegra/drm.c      |   3 +-
>  drivers/gpu/host1x/dev.c         |   3 +-
>  drivers/iommu/amd_iommu.c        |   7 +--
>  drivers/iommu/dma-iommu.c        |  18 +------
>  drivers/iommu/intel-iommu.c      |  11 ++--
>  drivers/iommu/iova.c             | 112 ++++++++++++++++-----------------------
>  drivers/misc/mic/scif/scif_rma.c |   3 +-
>  include/linux/iova.h             |   8 +--
>  8 files changed, 60 insertions(+), 105 deletions(-)
>

These patches look suspiciously like the ones I have been using over
the past couple of weeks (modulo the tegra and host1x changes) from
your git tree. They work fine on my AMD Overdrive B1, both in DT and
in ACPI/IORT modes, although it is difficult to quantify any
performance deltas on my setup.

Tested-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/4] Optimise 64-bit IOVA allocations
@ 2017-07-19  8:37   ` Ard Biesheuvel
  0 siblings, 0 replies; 30+ messages in thread
From: Ard Biesheuvel @ 2017-07-19  8:37 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Jonathan.Cameron-hv44wF8Li93QT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	ray.jui-dY08KVG/lbpWk0Htik3J/w, Zhen Lei, David Woodhouse,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

On 18 July 2017 at 17:57, Robin Murphy <robin.murphy-5wv7dgnIgG8@public.gmane.org> wrote:
> Hi all,
>
> In the wake of the ARM SMMU optimisation efforts, it seems that certain
> workloads (e.g. storage I/O with large scatterlists) probably remain quite
> heavily influenced by IOVA allocation performance. Separately, Ard also
> reported massive performance drops for a graphical desktop on AMD Seattle
> when enabling SMMUs via IORT, which we traced to dma_32bit_pfn in the DMA
> ops domain getting initialised differently for ACPI vs. DT, and exposing
> the overhead of the rbtree slow path. Whilst we could go around trying to
> close up all the little gaps that lead to hitting the slowest case, it
> seems a much better idea to simply make said slowest case a lot less slow.
>
> I had a go at rebasing Leizhen's last IOVA series[1], but ended up finding
> the changes rather too hard to follow, so I've taken the liberty here of
> picking the whole thing up and reimplementing the main part in a rather
> less invasive manner.
>
> Robin.
>
> [1] https://www.mail-archive.com/iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org/msg17753.html
>
> Robin Murphy (1):
>   iommu/iova: Extend rbtree node caching
>
> Zhen Lei (3):
>   iommu/iova: Optimise rbtree searching
>   iommu/iova: Optimise the padding calculation
>   iommu/iova: Make dma_32bit_pfn implicit
>
>  drivers/gpu/drm/tegra/drm.c      |   3 +-
>  drivers/gpu/host1x/dev.c         |   3 +-
>  drivers/iommu/amd_iommu.c        |   7 +--
>  drivers/iommu/dma-iommu.c        |  18 +------
>  drivers/iommu/intel-iommu.c      |  11 ++--
>  drivers/iommu/iova.c             | 112 ++++++++++++++++-----------------------
>  drivers/misc/mic/scif/scif_rma.c |   3 +-
>  include/linux/iova.h             |   8 +--
>  8 files changed, 60 insertions(+), 105 deletions(-)
>

These patches look suspiciously like the ones I have been using over
the past couple of weeks (modulo the tegra and host1x changes) from
your git tree. They work fine on my AMD Overdrive B1, both in DT and
in ACPI/IORT modes, although it is difficult to quantify any
performance deltas on my setup.

Tested-by: Ard Biesheuvel <ard.biesheuvel-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 0/4] Optimise 64-bit IOVA allocations
@ 2017-07-19  8:37   ` Ard Biesheuvel
  0 siblings, 0 replies; 30+ messages in thread
From: Ard Biesheuvel @ 2017-07-19  8:37 UTC (permalink / raw)
  To: linux-arm-kernel

On 18 July 2017 at 17:57, Robin Murphy <robin.murphy@arm.com> wrote:
> Hi all,
>
> In the wake of the ARM SMMU optimisation efforts, it seems that certain
> workloads (e.g. storage I/O with large scatterlists) probably remain quite
> heavily influenced by IOVA allocation performance. Separately, Ard also
> reported massive performance drops for a graphical desktop on AMD Seattle
> when enabling SMMUs via IORT, which we traced to dma_32bit_pfn in the DMA
> ops domain getting initialised differently for ACPI vs. DT, and exposing
> the overhead of the rbtree slow path. Whilst we could go around trying to
> close up all the little gaps that lead to hitting the slowest case, it
> seems a much better idea to simply make said slowest case a lot less slow.
>
> I had a go at rebasing Leizhen's last IOVA series[1], but ended up finding
> the changes rather too hard to follow, so I've taken the liberty here of
> picking the whole thing up and reimplementing the main part in a rather
> less invasive manner.
>
> Robin.
>
> [1] https://www.mail-archive.com/iommu at lists.linux-foundation.org/msg17753.html
>
> Robin Murphy (1):
>   iommu/iova: Extend rbtree node caching
>
> Zhen Lei (3):
>   iommu/iova: Optimise rbtree searching
>   iommu/iova: Optimise the padding calculation
>   iommu/iova: Make dma_32bit_pfn implicit
>
>  drivers/gpu/drm/tegra/drm.c      |   3 +-
>  drivers/gpu/host1x/dev.c         |   3 +-
>  drivers/iommu/amd_iommu.c        |   7 +--
>  drivers/iommu/dma-iommu.c        |  18 +------
>  drivers/iommu/intel-iommu.c      |  11 ++--
>  drivers/iommu/iova.c             | 112 ++++++++++++++++-----------------------
>  drivers/misc/mic/scif/scif_rma.c |   3 +-
>  include/linux/iova.h             |   8 +--
>  8 files changed, 60 insertions(+), 105 deletions(-)
>

These patches look suspiciously like the ones I have been using over
the past couple of weeks (modulo the tegra and host1x changes) from
your git tree. They work fine on my AMD Overdrive B1, both in DT and
in ACPI/IORT modes, although it is difficult to quantify any
performance deltas on my setup.

Tested-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/4] Optimise 64-bit IOVA allocations
  2017-07-19  8:37   ` Ard Biesheuvel
  (?)
@ 2017-07-19 10:23     ` Robin Murphy
  -1 siblings, 0 replies; 30+ messages in thread
From: Robin Murphy @ 2017-07-19 10:23 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Joerg Roedel, iommu, linux-arm-kernel, linux-kernel,
	David Woodhouse, Zhen Lei, Lorenzo Pieralisi, Jonathan.Cameron,
	nwatters, ray.jui

On 19/07/17 09:37, Ard Biesheuvel wrote:
> On 18 July 2017 at 17:57, Robin Murphy <robin.murphy@arm.com> wrote:
>> Hi all,
>>
>> In the wake of the ARM SMMU optimisation efforts, it seems that certain
>> workloads (e.g. storage I/O with large scatterlists) probably remain quite
>> heavily influenced by IOVA allocation performance. Separately, Ard also
>> reported massive performance drops for a graphical desktop on AMD Seattle
>> when enabling SMMUs via IORT, which we traced to dma_32bit_pfn in the DMA
>> ops domain getting initialised differently for ACPI vs. DT, and exposing
>> the overhead of the rbtree slow path. Whilst we could go around trying to
>> close up all the little gaps that lead to hitting the slowest case, it
>> seems a much better idea to simply make said slowest case a lot less slow.
>>
>> I had a go at rebasing Leizhen's last IOVA series[1], but ended up finding
>> the changes rather too hard to follow, so I've taken the liberty here of
>> picking the whole thing up and reimplementing the main part in a rather
>> less invasive manner.
>>
>> Robin.
>>
>> [1] https://www.mail-archive.com/iommu@lists.linux-foundation.org/msg17753.html
>>
>> Robin Murphy (1):
>>   iommu/iova: Extend rbtree node caching
>>
>> Zhen Lei (3):
>>   iommu/iova: Optimise rbtree searching
>>   iommu/iova: Optimise the padding calculation
>>   iommu/iova: Make dma_32bit_pfn implicit
>>
>>  drivers/gpu/drm/tegra/drm.c      |   3 +-
>>  drivers/gpu/host1x/dev.c         |   3 +-
>>  drivers/iommu/amd_iommu.c        |   7 +--
>>  drivers/iommu/dma-iommu.c        |  18 +------
>>  drivers/iommu/intel-iommu.c      |  11 ++--
>>  drivers/iommu/iova.c             | 112 ++++++++++++++++-----------------------
>>  drivers/misc/mic/scif/scif_rma.c |   3 +-
>>  include/linux/iova.h             |   8 +--
>>  8 files changed, 60 insertions(+), 105 deletions(-)
>>
> 
> These patches look suspiciously like the ones I have been using over
> the past couple of weeks (modulo the tegra and host1x changes) from
> your git tree. They work fine on my AMD Overdrive B1, both in DT and
> in ACPI/IORT modes, although it is difficult to quantify any
> performance deltas on my setup.

Indeed - this is a rebase (to account for those new callers) with a
couple of trivial tweaks to error paths and corner cases that normal
usage shouldn't have been hitting anyway. "No longer unusably awful" is
a good enough performance delta for me :)

> Tested-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>

Thanks!

Robin.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/4] Optimise 64-bit IOVA allocations
@ 2017-07-19 10:23     ` Robin Murphy
  0 siblings, 0 replies; 30+ messages in thread
From: Robin Murphy @ 2017-07-19 10:23 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Joerg Roedel, iommu, linux-arm-kernel, linux-kernel,
	David Woodhouse, Zhen Lei, Lorenzo Pieralisi, Jonathan.Cameron,
	nwatters, ray.jui

On 19/07/17 09:37, Ard Biesheuvel wrote:
> On 18 July 2017 at 17:57, Robin Murphy <robin.murphy@arm.com> wrote:
>> Hi all,
>>
>> In the wake of the ARM SMMU optimisation efforts, it seems that certain
>> workloads (e.g. storage I/O with large scatterlists) probably remain quite
>> heavily influenced by IOVA allocation performance. Separately, Ard also
>> reported massive performance drops for a graphical desktop on AMD Seattle
>> when enabling SMMUs via IORT, which we traced to dma_32bit_pfn in the DMA
>> ops domain getting initialised differently for ACPI vs. DT, and exposing
>> the overhead of the rbtree slow path. Whilst we could go around trying to
>> close up all the little gaps that lead to hitting the slowest case, it
>> seems a much better idea to simply make said slowest case a lot less slow.
>>
>> I had a go at rebasing Leizhen's last IOVA series[1], but ended up finding
>> the changes rather too hard to follow, so I've taken the liberty here of
>> picking the whole thing up and reimplementing the main part in a rather
>> less invasive manner.
>>
>> Robin.
>>
>> [1] https://www.mail-archive.com/iommu@lists.linux-foundation.org/msg17753.html
>>
>> Robin Murphy (1):
>>   iommu/iova: Extend rbtree node caching
>>
>> Zhen Lei (3):
>>   iommu/iova: Optimise rbtree searching
>>   iommu/iova: Optimise the padding calculation
>>   iommu/iova: Make dma_32bit_pfn implicit
>>
>>  drivers/gpu/drm/tegra/drm.c      |   3 +-
>>  drivers/gpu/host1x/dev.c         |   3 +-
>>  drivers/iommu/amd_iommu.c        |   7 +--
>>  drivers/iommu/dma-iommu.c        |  18 +------
>>  drivers/iommu/intel-iommu.c      |  11 ++--
>>  drivers/iommu/iova.c             | 112 ++++++++++++++++-----------------------
>>  drivers/misc/mic/scif/scif_rma.c |   3 +-
>>  include/linux/iova.h             |   8 +--
>>  8 files changed, 60 insertions(+), 105 deletions(-)
>>
> 
> These patches look suspiciously like the ones I have been using over
> the past couple of weeks (modulo the tegra and host1x changes) from
> your git tree. They work fine on my AMD Overdrive B1, both in DT and
> in ACPI/IORT modes, although it is difficult to quantify any
> performance deltas on my setup.

Indeed - this is a rebase (to account for those new callers) with a
couple of trivial tweaks to error paths and corner cases that normal
usage shouldn't have been hitting anyway. "No longer unusably awful" is
a good enough performance delta for me :)

> Tested-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>

Thanks!

Robin.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 0/4] Optimise 64-bit IOVA allocations
@ 2017-07-19 10:23     ` Robin Murphy
  0 siblings, 0 replies; 30+ messages in thread
From: Robin Murphy @ 2017-07-19 10:23 UTC (permalink / raw)
  To: linux-arm-kernel

On 19/07/17 09:37, Ard Biesheuvel wrote:
> On 18 July 2017 at 17:57, Robin Murphy <robin.murphy@arm.com> wrote:
>> Hi all,
>>
>> In the wake of the ARM SMMU optimisation efforts, it seems that certain
>> workloads (e.g. storage I/O with large scatterlists) probably remain quite
>> heavily influenced by IOVA allocation performance. Separately, Ard also
>> reported massive performance drops for a graphical desktop on AMD Seattle
>> when enabling SMMUs via IORT, which we traced to dma_32bit_pfn in the DMA
>> ops domain getting initialised differently for ACPI vs. DT, and exposing
>> the overhead of the rbtree slow path. Whilst we could go around trying to
>> close up all the little gaps that lead to hitting the slowest case, it
>> seems a much better idea to simply make said slowest case a lot less slow.
>>
>> I had a go at rebasing Leizhen's last IOVA series[1], but ended up finding
>> the changes rather too hard to follow, so I've taken the liberty here of
>> picking the whole thing up and reimplementing the main part in a rather
>> less invasive manner.
>>
>> Robin.
>>
>> [1] https://www.mail-archive.com/iommu at lists.linux-foundation.org/msg17753.html
>>
>> Robin Murphy (1):
>>   iommu/iova: Extend rbtree node caching
>>
>> Zhen Lei (3):
>>   iommu/iova: Optimise rbtree searching
>>   iommu/iova: Optimise the padding calculation
>>   iommu/iova: Make dma_32bit_pfn implicit
>>
>>  drivers/gpu/drm/tegra/drm.c      |   3 +-
>>  drivers/gpu/host1x/dev.c         |   3 +-
>>  drivers/iommu/amd_iommu.c        |   7 +--
>>  drivers/iommu/dma-iommu.c        |  18 +------
>>  drivers/iommu/intel-iommu.c      |  11 ++--
>>  drivers/iommu/iova.c             | 112 ++++++++++++++++-----------------------
>>  drivers/misc/mic/scif/scif_rma.c |   3 +-
>>  include/linux/iova.h             |   8 +--
>>  8 files changed, 60 insertions(+), 105 deletions(-)
>>
> 
> These patches look suspiciously like the ones I have been using over
> the past couple of weeks (modulo the tegra and host1x changes) from
> your git tree. They work fine on my AMD Overdrive B1, both in DT and
> in ACPI/IORT modes, although it is difficult to quantify any
> performance deltas on my setup.

Indeed - this is a rebase (to account for those new callers) with a
couple of trivial tweaks to error paths and corner cases that normal
usage shouldn't have been hitting anyway. "No longer unusably awful" is
a good enough performance delta for me :)

> Tested-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>

Thanks!

Robin.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 4/4] iommu/iova: Make dma_32bit_pfn implicit
@ 2017-07-19 15:07     ` kbuild test robot
  0 siblings, 0 replies; 30+ messages in thread
From: kbuild test robot @ 2017-07-19 15:07 UTC (permalink / raw)
  To: Robin Murphy
  Cc: kbuild-all, joro, iommu, linux-arm-kernel, linux-kernel, dwmw2,
	thunder.leizhen, lorenzo.pieralisi, ard.biesheuvel,
	Jonathan.Cameron, nwatters, ray.jui, Thierry Reding,
	Jonathan Hunter, David Airlie, Sudeep Dutt, Ashutosh Dixit

[-- Attachment #1: Type: text/plain, Size: 2031 bytes --]

Hi Zhen,

[auto build test WARNING on iommu/next]
[also build test WARNING on v4.13-rc1]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Robin-Murphy/Optimise-64-bit-IOVA-allocations/20170719-060847
base:   https://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git next
config: arm-multi_v7_defconfig (attached as .config)
compiler: arm-linux-gnueabi-gcc (Debian 6.1.1-9) 6.1.1 20160705
reproduce:
        wget https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=arm 

All warnings (new ones prefixed by >>):

   drivers/iommu/iova.c: In function 'init_iova_domain':
>> drivers/iommu/iova.c:53:41: warning: large integer implicitly truncated to unsigned type [-Woverflow]
     iovad->dma_32bit_pfn = iova_pfn(iovad, 1ULL << 32);
                                            ^~~~

vim +53 drivers/iommu/iova.c

    35	
    36	void
    37	init_iova_domain(struct iova_domain *iovad, unsigned long granule,
    38		unsigned long start_pfn)
    39	{
    40		/*
    41		 * IOVA granularity will normally be equal to the smallest
    42		 * supported IOMMU page size; both *must* be capable of
    43		 * representing individual CPU pages exactly.
    44		 */
    45		BUG_ON((granule > PAGE_SIZE) || !is_power_of_2(granule));
    46	
    47		spin_lock_init(&iovad->iova_rbtree_lock);
    48		iovad->rbroot = RB_ROOT;
    49		iovad->cached_node = NULL;
    50		iovad->cached32_node = NULL;
    51		iovad->granule = granule;
    52		iovad->start_pfn = start_pfn;
  > 53		iovad->dma_32bit_pfn = iova_pfn(iovad, 1ULL << 32);
    54		init_iova_rcaches(iovad);
    55	}
    56	EXPORT_SYMBOL_GPL(init_iova_domain);
    57	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 41230 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 4/4] iommu/iova: Make dma_32bit_pfn implicit
@ 2017-07-19 15:07     ` kbuild test robot
  0 siblings, 0 replies; 30+ messages in thread
From: kbuild test robot @ 2017-07-19 15:07 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Sudeep Dutt, ard.biesheuvel-QSEj5FYQhm4dnm+yROfE0A, David Airlie,
	Jonathan.Cameron-hv44wF8Li93QT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Jonathan Hunter,
	Ashutosh Dixit,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Thierry Reding, ray.jui-dY08KVG/lbpWk0Htik3J/w,
	kbuild-all-JC7UmRfGjtg, thunder.leizhen-hv44wF8Li93QT0dZR+AlfA,
	dwmw2-wEGCiKHe2LqWVfeAwA7xHQ,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

[-- Attachment #1: Type: text/plain, Size: 2031 bytes --]

Hi Zhen,

[auto build test WARNING on iommu/next]
[also build test WARNING on v4.13-rc1]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Robin-Murphy/Optimise-64-bit-IOVA-allocations/20170719-060847
base:   https://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git next
config: arm-multi_v7_defconfig (attached as .config)
compiler: arm-linux-gnueabi-gcc (Debian 6.1.1-9) 6.1.1 20160705
reproduce:
        wget https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=arm 

All warnings (new ones prefixed by >>):

   drivers/iommu/iova.c: In function 'init_iova_domain':
>> drivers/iommu/iova.c:53:41: warning: large integer implicitly truncated to unsigned type [-Woverflow]
     iovad->dma_32bit_pfn = iova_pfn(iovad, 1ULL << 32);
                                            ^~~~

vim +53 drivers/iommu/iova.c

    35	
    36	void
    37	init_iova_domain(struct iova_domain *iovad, unsigned long granule,
    38		unsigned long start_pfn)
    39	{
    40		/*
    41		 * IOVA granularity will normally be equal to the smallest
    42		 * supported IOMMU page size; both *must* be capable of
    43		 * representing individual CPU pages exactly.
    44		 */
    45		BUG_ON((granule > PAGE_SIZE) || !is_power_of_2(granule));
    46	
    47		spin_lock_init(&iovad->iova_rbtree_lock);
    48		iovad->rbroot = RB_ROOT;
    49		iovad->cached_node = NULL;
    50		iovad->cached32_node = NULL;
    51		iovad->granule = granule;
    52		iovad->start_pfn = start_pfn;
  > 53		iovad->dma_32bit_pfn = iova_pfn(iovad, 1ULL << 32);
    54		init_iova_rcaches(iovad);
    55	}
    56	EXPORT_SYMBOL_GPL(init_iova_domain);
    57	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 41230 bytes --]

[-- Attachment #3: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 4/4] iommu/iova: Make dma_32bit_pfn implicit
@ 2017-07-19 15:07     ` kbuild test robot
  0 siblings, 0 replies; 30+ messages in thread
From: kbuild test robot @ 2017-07-19 15:07 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Zhen,

[auto build test WARNING on iommu/next]
[also build test WARNING on v4.13-rc1]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Robin-Murphy/Optimise-64-bit-IOVA-allocations/20170719-060847
base:   https://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git next
config: arm-multi_v7_defconfig (attached as .config)
compiler: arm-linux-gnueabi-gcc (Debian 6.1.1-9) 6.1.1 20160705
reproduce:
        wget https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=arm 

All warnings (new ones prefixed by >>):

   drivers/iommu/iova.c: In function 'init_iova_domain':
>> drivers/iommu/iova.c:53:41: warning: large integer implicitly truncated to unsigned type [-Woverflow]
     iovad->dma_32bit_pfn = iova_pfn(iovad, 1ULL << 32);
                                            ^~~~

vim +53 drivers/iommu/iova.c

    35	
    36	void
    37	init_iova_domain(struct iova_domain *iovad, unsigned long granule,
    38		unsigned long start_pfn)
    39	{
    40		/*
    41		 * IOVA granularity will normally be equal to the smallest
    42		 * supported IOMMU page size; both *must* be capable of
    43		 * representing individual CPU pages exactly.
    44		 */
    45		BUG_ON((granule > PAGE_SIZE) || !is_power_of_2(granule));
    46	
    47		spin_lock_init(&iovad->iova_rbtree_lock);
    48		iovad->rbroot = RB_ROOT;
    49		iovad->cached_node = NULL;
    50		iovad->cached32_node = NULL;
    51		iovad->granule = granule;
    52		iovad->start_pfn = start_pfn;
  > 53		iovad->dma_32bit_pfn = iova_pfn(iovad, 1ULL << 32);
    54		init_iova_rcaches(iovad);
    55	}
    56	EXPORT_SYMBOL_GPL(init_iova_domain);
    57	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation
-------------- next part --------------
A non-text attachment was scrubbed...
Name: .config.gz
Type: application/gzip
Size: 41230 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20170719/ed8e7764/attachment-0001.gz>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 4/4] iommu/iova: Make dma_32bit_pfn implicit
@ 2017-07-20  2:55       ` Leizhen (ThunderTown)
  0 siblings, 0 replies; 30+ messages in thread
From: Leizhen (ThunderTown) @ 2017-07-20  2:55 UTC (permalink / raw)
  To: kbuild test robot, Robin Murphy
  Cc: kbuild-all, joro, iommu, linux-arm-kernel, linux-kernel, dwmw2,
	lorenzo.pieralisi, ard.biesheuvel, Jonathan.Cameron, nwatters,
	ray.jui, Thierry Reding, Jonathan Hunter, David Airlie,
	Sudeep Dutt, Ashutosh Dixit



On 2017/7/19 23:07, kbuild test robot wrote:
> Hi Zhen,
> 
> [auto build test WARNING on iommu/next]
> [also build test WARNING on v4.13-rc1]
> [if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
> 
> url:    https://github.com/0day-ci/linux/commits/Robin-Murphy/Optimise-64-bit-IOVA-allocations/20170719-060847
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git next
> config: arm-multi_v7_defconfig (attached as .config)
> compiler: arm-linux-gnueabi-gcc (Debian 6.1.1-9) 6.1.1 20160705
> reproduce:
>         wget https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
>         chmod +x ~/bin/make.cross
>         # save the attached .config to linux build tree
>         make.cross ARCH=arm 
> 
> All warnings (new ones prefixed by >>):
> 
>    drivers/iommu/iova.c: In function 'init_iova_domain':
>>> drivers/iommu/iova.c:53:41: warning: large integer implicitly truncated to unsigned type [-Woverflow]
>      iovad->dma_32bit_pfn = iova_pfn(iovad, 1ULL << 32);
OK, I see. I think the problem is that "1ULL << 32" exceed the scope of 32bits general register. We should
replace "1ULL << 32" with DMA_BIT_MASK(32), the latter will minus one to keep it can be safely stored in
the general register.

iovad->dma_32bit_pfn = iova_pfn(iovad, DMA_BIT_MASK(32)) + 1;

>                                             ^~~~
> 
> vim +53 drivers/iommu/iova.c
> 
>     35	
>     36	void
>     37	init_iova_domain(struct iova_domain *iovad, unsigned long granule,
>     38		unsigned long start_pfn)
>     39	{
>     40		/*
>     41		 * IOVA granularity will normally be equal to the smallest
>     42		 * supported IOMMU page size; both *must* be capable of
>     43		 * representing individual CPU pages exactly.
>     44		 */
>     45		BUG_ON((granule > PAGE_SIZE) || !is_power_of_2(granule));
>     46	
>     47		spin_lock_init(&iovad->iova_rbtree_lock);
>     48		iovad->rbroot = RB_ROOT;
>     49		iovad->cached_node = NULL;
>     50		iovad->cached32_node = NULL;
>     51		iovad->granule = granule;
>     52		iovad->start_pfn = start_pfn;
>   > 53		iovad->dma_32bit_pfn = iova_pfn(iovad, 1ULL << 32);
>     54		init_iova_rcaches(iovad);
>     55	}
>     56	EXPORT_SYMBOL_GPL(init_iova_domain);
>     57	
> 
> ---
> 0-DAY kernel test infrastructure                Open Source Technology Center
> https://lists.01.org/pipermail/kbuild-all                   Intel Corporation
> 

-- 
Thanks!
BestRegards

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 4/4] iommu/iova: Make dma_32bit_pfn implicit
@ 2017-07-20  2:55       ` Leizhen (ThunderTown)
  0 siblings, 0 replies; 30+ messages in thread
From: Leizhen (ThunderTown) @ 2017-07-20  2:55 UTC (permalink / raw)
  To: kbuild test robot, Robin Murphy
  Cc: Sudeep Dutt, ard.biesheuvel-QSEj5FYQhm4dnm+yROfE0A, David Airlie,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Jonathan Hunter,
	Ashutosh Dixit,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Thierry Reding, ray.jui-dY08KVG/lbpWk0Htik3J/w,
	kbuild-all-JC7UmRfGjtg, Jonathan.Cameron-hv44wF8Li93QT0dZR+AlfA,
	dwmw2-wEGCiKHe2LqWVfeAwA7xHQ,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r



On 2017/7/19 23:07, kbuild test robot wrote:
> Hi Zhen,
> 
> [auto build test WARNING on iommu/next]
> [also build test WARNING on v4.13-rc1]
> [if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
> 
> url:    https://github.com/0day-ci/linux/commits/Robin-Murphy/Optimise-64-bit-IOVA-allocations/20170719-060847
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git next
> config: arm-multi_v7_defconfig (attached as .config)
> compiler: arm-linux-gnueabi-gcc (Debian 6.1.1-9) 6.1.1 20160705
> reproduce:
>         wget https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
>         chmod +x ~/bin/make.cross
>         # save the attached .config to linux build tree
>         make.cross ARCH=arm 
> 
> All warnings (new ones prefixed by >>):
> 
>    drivers/iommu/iova.c: In function 'init_iova_domain':
>>> drivers/iommu/iova.c:53:41: warning: large integer implicitly truncated to unsigned type [-Woverflow]
>      iovad->dma_32bit_pfn = iova_pfn(iovad, 1ULL << 32);
OK, I see. I think the problem is that "1ULL << 32" exceed the scope of 32bits general register. We should
replace "1ULL << 32" with DMA_BIT_MASK(32), the latter will minus one to keep it can be safely stored in
the general register.

iovad->dma_32bit_pfn = iova_pfn(iovad, DMA_BIT_MASK(32)) + 1;

>                                             ^~~~
> 
> vim +53 drivers/iommu/iova.c
> 
>     35	
>     36	void
>     37	init_iova_domain(struct iova_domain *iovad, unsigned long granule,
>     38		unsigned long start_pfn)
>     39	{
>     40		/*
>     41		 * IOVA granularity will normally be equal to the smallest
>     42		 * supported IOMMU page size; both *must* be capable of
>     43		 * representing individual CPU pages exactly.
>     44		 */
>     45		BUG_ON((granule > PAGE_SIZE) || !is_power_of_2(granule));
>     46	
>     47		spin_lock_init(&iovad->iova_rbtree_lock);
>     48		iovad->rbroot = RB_ROOT;
>     49		iovad->cached_node = NULL;
>     50		iovad->cached32_node = NULL;
>     51		iovad->granule = granule;
>     52		iovad->start_pfn = start_pfn;
>   > 53		iovad->dma_32bit_pfn = iova_pfn(iovad, 1ULL << 32);
>     54		init_iova_rcaches(iovad);
>     55	}
>     56	EXPORT_SYMBOL_GPL(init_iova_domain);
>     57	
> 
> ---
> 0-DAY kernel test infrastructure                Open Source Technology Center
> https://lists.01.org/pipermail/kbuild-all                   Intel Corporation
> 

-- 
Thanks!
BestRegards

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 4/4] iommu/iova: Make dma_32bit_pfn implicit
@ 2017-07-20  2:55       ` Leizhen (ThunderTown)
  0 siblings, 0 replies; 30+ messages in thread
From: Leizhen (ThunderTown) @ 2017-07-20  2:55 UTC (permalink / raw)
  To: linux-arm-kernel



On 2017/7/19 23:07, kbuild test robot wrote:
> Hi Zhen,
> 
> [auto build test WARNING on iommu/next]
> [also build test WARNING on v4.13-rc1]
> [if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
> 
> url:    https://github.com/0day-ci/linux/commits/Robin-Murphy/Optimise-64-bit-IOVA-allocations/20170719-060847
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git next
> config: arm-multi_v7_defconfig (attached as .config)
> compiler: arm-linux-gnueabi-gcc (Debian 6.1.1-9) 6.1.1 20160705
> reproduce:
>         wget https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
>         chmod +x ~/bin/make.cross
>         # save the attached .config to linux build tree
>         make.cross ARCH=arm 
> 
> All warnings (new ones prefixed by >>):
> 
>    drivers/iommu/iova.c: In function 'init_iova_domain':
>>> drivers/iommu/iova.c:53:41: warning: large integer implicitly truncated to unsigned type [-Woverflow]
>      iovad->dma_32bit_pfn = iova_pfn(iovad, 1ULL << 32);
OK, I see. I think the problem is that "1ULL << 32" exceed the scope of 32bits general register. We should
replace "1ULL << 32" with DMA_BIT_MASK(32), the latter will minus one to keep it can be safely stored in
the general register.

iovad->dma_32bit_pfn = iova_pfn(iovad, DMA_BIT_MASK(32)) + 1;

>                                             ^~~~
> 
> vim +53 drivers/iommu/iova.c
> 
>     35	
>     36	void
>     37	init_iova_domain(struct iova_domain *iovad, unsigned long granule,
>     38		unsigned long start_pfn)
>     39	{
>     40		/*
>     41		 * IOVA granularity will normally be equal to the smallest
>     42		 * supported IOMMU page size; both *must* be capable of
>     43		 * representing individual CPU pages exactly.
>     44		 */
>     45		BUG_ON((granule > PAGE_SIZE) || !is_power_of_2(granule));
>     46	
>     47		spin_lock_init(&iovad->iova_rbtree_lock);
>     48		iovad->rbroot = RB_ROOT;
>     49		iovad->cached_node = NULL;
>     50		iovad->cached32_node = NULL;
>     51		iovad->granule = granule;
>     52		iovad->start_pfn = start_pfn;
>   > 53		iovad->dma_32bit_pfn = iova_pfn(iovad, 1ULL << 32);
>     54		init_iova_rcaches(iovad);
>     55	}
>     56	EXPORT_SYMBOL_GPL(init_iova_domain);
>     57	
> 
> ---
> 0-DAY kernel test infrastructure                Open Source Technology Center
> https://lists.01.org/pipermail/kbuild-all                   Intel Corporation
> 

-- 
Thanks!
BestRegards

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/4] Optimise 64-bit IOVA allocations
@ 2017-07-21  9:48       ` Leizhen (ThunderTown)
  0 siblings, 0 replies; 30+ messages in thread
From: Leizhen (ThunderTown) @ 2017-07-21  9:48 UTC (permalink / raw)
  To: Robin Murphy, Ard Biesheuvel
  Cc: Joerg Roedel, iommu, linux-arm-kernel, linux-kernel,
	David Woodhouse, Lorenzo Pieralisi, Jonathan.Cameron, nwatters,
	ray.jui



On 2017/7/19 18:23, Robin Murphy wrote:
> On 19/07/17 09:37, Ard Biesheuvel wrote:
>> On 18 July 2017 at 17:57, Robin Murphy <robin.murphy@arm.com> wrote:
>>> Hi all,
>>>
>>> In the wake of the ARM SMMU optimisation efforts, it seems that certain
>>> workloads (e.g. storage I/O with large scatterlists) probably remain quite
>>> heavily influenced by IOVA allocation performance. Separately, Ard also
>>> reported massive performance drops for a graphical desktop on AMD Seattle
>>> when enabling SMMUs via IORT, which we traced to dma_32bit_pfn in the DMA
>>> ops domain getting initialised differently for ACPI vs. DT, and exposing
>>> the overhead of the rbtree slow path. Whilst we could go around trying to
>>> close up all the little gaps that lead to hitting the slowest case, it
>>> seems a much better idea to simply make said slowest case a lot less slow.
>>>
>>> I had a go at rebasing Leizhen's last IOVA series[1], but ended up finding
>>> the changes rather too hard to follow, so I've taken the liberty here of
>>> picking the whole thing up and reimplementing the main part in a rather
>>> less invasive manner.
>>>
>>> Robin.
>>>
>>> [1] https://www.mail-archive.com/iommu@lists.linux-foundation.org/msg17753.html
>>>
>>> Robin Murphy (1):
>>>   iommu/iova: Extend rbtree node caching
>>>
>>> Zhen Lei (3):
>>>   iommu/iova: Optimise rbtree searching
>>>   iommu/iova: Optimise the padding calculation
>>>   iommu/iova: Make dma_32bit_pfn implicit
>>>
>>>  drivers/gpu/drm/tegra/drm.c      |   3 +-
>>>  drivers/gpu/host1x/dev.c         |   3 +-
>>>  drivers/iommu/amd_iommu.c        |   7 +--
>>>  drivers/iommu/dma-iommu.c        |  18 +------
>>>  drivers/iommu/intel-iommu.c      |  11 ++--
>>>  drivers/iommu/iova.c             | 112 ++++++++++++++++-----------------------
>>>  drivers/misc/mic/scif/scif_rma.c |   3 +-
>>>  include/linux/iova.h             |   8 +--
>>>  8 files changed, 60 insertions(+), 105 deletions(-)
>>>
>>
>> These patches look suspiciously like the ones I have been using over
>> the past couple of weeks (modulo the tegra and host1x changes) from
>> your git tree. They work fine on my AMD Overdrive B1, both in DT and
>> in ACPI/IORT modes, although it is difficult to quantify any
>> performance deltas on my setup.
> 
> Indeed - this is a rebase (to account for those new callers) with a
> couple of trivial tweaks to error paths and corner cases that normal
> usage shouldn't have been hitting anyway. "No longer unusably awful" is
> a good enough performance delta for me :)
> 
>> Tested-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
I got the same performance data compared with my patch version. It works well.

Tested-by: Zhen Lei <thunder.leizhen@huawei.com>

> 
> Thanks!
> 
> Robin.
> 
> .
> 

-- 
Thanks!
BestRegards

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/4] Optimise 64-bit IOVA allocations
@ 2017-07-21  9:48       ` Leizhen (ThunderTown)
  0 siblings, 0 replies; 30+ messages in thread
From: Leizhen (ThunderTown) @ 2017-07-21  9:48 UTC (permalink / raw)
  To: Robin Murphy, Ard Biesheuvel
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	ray.jui-dY08KVG/lbpWk0Htik3J/w,
	Jonathan.Cameron-hv44wF8Li93QT0dZR+AlfA, David Woodhouse,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r



On 2017/7/19 18:23, Robin Murphy wrote:
> On 19/07/17 09:37, Ard Biesheuvel wrote:
>> On 18 July 2017 at 17:57, Robin Murphy <robin.murphy-5wv7dgnIgG8@public.gmane.org> wrote:
>>> Hi all,
>>>
>>> In the wake of the ARM SMMU optimisation efforts, it seems that certain
>>> workloads (e.g. storage I/O with large scatterlists) probably remain quite
>>> heavily influenced by IOVA allocation performance. Separately, Ard also
>>> reported massive performance drops for a graphical desktop on AMD Seattle
>>> when enabling SMMUs via IORT, which we traced to dma_32bit_pfn in the DMA
>>> ops domain getting initialised differently for ACPI vs. DT, and exposing
>>> the overhead of the rbtree slow path. Whilst we could go around trying to
>>> close up all the little gaps that lead to hitting the slowest case, it
>>> seems a much better idea to simply make said slowest case a lot less slow.
>>>
>>> I had a go at rebasing Leizhen's last IOVA series[1], but ended up finding
>>> the changes rather too hard to follow, so I've taken the liberty here of
>>> picking the whole thing up and reimplementing the main part in a rather
>>> less invasive manner.
>>>
>>> Robin.
>>>
>>> [1] https://www.mail-archive.com/iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org/msg17753.html
>>>
>>> Robin Murphy (1):
>>>   iommu/iova: Extend rbtree node caching
>>>
>>> Zhen Lei (3):
>>>   iommu/iova: Optimise rbtree searching
>>>   iommu/iova: Optimise the padding calculation
>>>   iommu/iova: Make dma_32bit_pfn implicit
>>>
>>>  drivers/gpu/drm/tegra/drm.c      |   3 +-
>>>  drivers/gpu/host1x/dev.c         |   3 +-
>>>  drivers/iommu/amd_iommu.c        |   7 +--
>>>  drivers/iommu/dma-iommu.c        |  18 +------
>>>  drivers/iommu/intel-iommu.c      |  11 ++--
>>>  drivers/iommu/iova.c             | 112 ++++++++++++++++-----------------------
>>>  drivers/misc/mic/scif/scif_rma.c |   3 +-
>>>  include/linux/iova.h             |   8 +--
>>>  8 files changed, 60 insertions(+), 105 deletions(-)
>>>
>>
>> These patches look suspiciously like the ones I have been using over
>> the past couple of weeks (modulo the tegra and host1x changes) from
>> your git tree. They work fine on my AMD Overdrive B1, both in DT and
>> in ACPI/IORT modes, although it is difficult to quantify any
>> performance deltas on my setup.
> 
> Indeed - this is a rebase (to account for those new callers) with a
> couple of trivial tweaks to error paths and corner cases that normal
> usage shouldn't have been hitting anyway. "No longer unusably awful" is
> a good enough performance delta for me :)
> 
>> Tested-by: Ard Biesheuvel <ard.biesheuvel-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org>
I got the same performance data compared with my patch version. It works well.

Tested-by: Zhen Lei <thunder.leizhen-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>

> 
> Thanks!
> 
> Robin.
> 
> .
> 

-- 
Thanks!
BestRegards

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 0/4] Optimise 64-bit IOVA allocations
@ 2017-07-21  9:48       ` Leizhen (ThunderTown)
  0 siblings, 0 replies; 30+ messages in thread
From: Leizhen (ThunderTown) @ 2017-07-21  9:48 UTC (permalink / raw)
  To: linux-arm-kernel



On 2017/7/19 18:23, Robin Murphy wrote:
> On 19/07/17 09:37, Ard Biesheuvel wrote:
>> On 18 July 2017 at 17:57, Robin Murphy <robin.murphy@arm.com> wrote:
>>> Hi all,
>>>
>>> In the wake of the ARM SMMU optimisation efforts, it seems that certain
>>> workloads (e.g. storage I/O with large scatterlists) probably remain quite
>>> heavily influenced by IOVA allocation performance. Separately, Ard also
>>> reported massive performance drops for a graphical desktop on AMD Seattle
>>> when enabling SMMUs via IORT, which we traced to dma_32bit_pfn in the DMA
>>> ops domain getting initialised differently for ACPI vs. DT, and exposing
>>> the overhead of the rbtree slow path. Whilst we could go around trying to
>>> close up all the little gaps that lead to hitting the slowest case, it
>>> seems a much better idea to simply make said slowest case a lot less slow.
>>>
>>> I had a go at rebasing Leizhen's last IOVA series[1], but ended up finding
>>> the changes rather too hard to follow, so I've taken the liberty here of
>>> picking the whole thing up and reimplementing the main part in a rather
>>> less invasive manner.
>>>
>>> Robin.
>>>
>>> [1] https://www.mail-archive.com/iommu at lists.linux-foundation.org/msg17753.html
>>>
>>> Robin Murphy (1):
>>>   iommu/iova: Extend rbtree node caching
>>>
>>> Zhen Lei (3):
>>>   iommu/iova: Optimise rbtree searching
>>>   iommu/iova: Optimise the padding calculation
>>>   iommu/iova: Make dma_32bit_pfn implicit
>>>
>>>  drivers/gpu/drm/tegra/drm.c      |   3 +-
>>>  drivers/gpu/host1x/dev.c         |   3 +-
>>>  drivers/iommu/amd_iommu.c        |   7 +--
>>>  drivers/iommu/dma-iommu.c        |  18 +------
>>>  drivers/iommu/intel-iommu.c      |  11 ++--
>>>  drivers/iommu/iova.c             | 112 ++++++++++++++++-----------------------
>>>  drivers/misc/mic/scif/scif_rma.c |   3 +-
>>>  include/linux/iova.h             |   8 +--
>>>  8 files changed, 60 insertions(+), 105 deletions(-)
>>>
>>
>> These patches look suspiciously like the ones I have been using over
>> the past couple of weeks (modulo the tegra and host1x changes) from
>> your git tree. They work fine on my AMD Overdrive B1, both in DT and
>> in ACPI/IORT modes, although it is difficult to quantify any
>> performance deltas on my setup.
> 
> Indeed - this is a rebase (to account for those new callers) with a
> couple of trivial tweaks to error paths and corner cases that normal
> usage shouldn't have been hitting anyway. "No longer unusably awful" is
> a good enough performance delta for me :)
> 
>> Tested-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
I got the same performance data compared with my patch version. It works well.

Tested-by: Zhen Lei <thunder.leizhen@huawei.com>

> 
> Thanks!
> 
> Robin.
> 
> .
> 

-- 
Thanks!
BestRegards

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2017-07-21  9:49 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-07-18 16:57 [PATCH 0/4] Optimise 64-bit IOVA allocations Robin Murphy
2017-07-18 16:57 ` Robin Murphy
2017-07-18 16:57 ` Robin Murphy
2017-07-18 16:57 ` [PATCH 1/4] iommu/iova: Optimise rbtree searching Robin Murphy
2017-07-18 16:57   ` Robin Murphy
2017-07-18 16:57   ` Robin Murphy
2017-07-18 16:57 ` [PATCH 2/4] iommu/iova: Optimise the padding calculation Robin Murphy
2017-07-18 16:57   ` Robin Murphy
2017-07-18 16:57   ` Robin Murphy
2017-07-18 16:57 ` [PATCH 3/4] iommu/iova: Extend rbtree node caching Robin Murphy
2017-07-18 16:57   ` Robin Murphy
2017-07-18 16:57   ` Robin Murphy
2017-07-18 16:57 ` [PATCH 4/4] iommu/iova: Make dma_32bit_pfn implicit Robin Murphy
2017-07-18 16:57   ` Robin Murphy
2017-07-18 16:57   ` Robin Murphy
2017-07-19 15:07   ` kbuild test robot
2017-07-19 15:07     ` kbuild test robot
2017-07-19 15:07     ` kbuild test robot
2017-07-20  2:55     ` Leizhen (ThunderTown)
2017-07-20  2:55       ` Leizhen (ThunderTown)
2017-07-20  2:55       ` Leizhen (ThunderTown)
2017-07-19  8:37 ` [PATCH 0/4] Optimise 64-bit IOVA allocations Ard Biesheuvel
2017-07-19  8:37   ` Ard Biesheuvel
2017-07-19  8:37   ` Ard Biesheuvel
2017-07-19 10:23   ` Robin Murphy
2017-07-19 10:23     ` Robin Murphy
2017-07-19 10:23     ` Robin Murphy
2017-07-21  9:48     ` Leizhen (ThunderTown)
2017-07-21  9:48       ` Leizhen (ThunderTown)
2017-07-21  9:48       ` Leizhen (ThunderTown)

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.