linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/4] Optimise 64-bit IOVA allocations
@ 2017-07-21 11:41 Robin Murphy
  2017-07-21 11:41 ` [PATCH v2 1/4] iommu/iova: Optimise rbtree searching Robin Murphy
                   ` (4 more replies)
  0 siblings, 5 replies; 17+ messages in thread
From: Robin Murphy @ 2017-07-21 11:41 UTC (permalink / raw)
  To: joro
  Cc: iommu, linux-arm-kernel, linux-kernel, dwmw2, thunder.leizhen,
	lorenzo.pieralisi, ard.biesheuvel, Jonathan.Cameron, nwatters,
	ray.jui

Hi all,

In the wake of the ARM SMMU optimisation efforts, it seems that certain
workloads (e.g. storage I/O with large scatterlists) probably remain quite
heavily influenced by IOVA allocation performance. Separately, Ard also
reported massive performance drops for a graphical desktop on AMD Seattle
when enabling SMMUs via IORT, which we traced to dma_32bit_pfn in the DMA
ops domain getting initialised differently for ACPI vs. DT, and exposing
the overhead of the rbtree slow path. Whilst we could go around trying to
close up all the little gaps that lead to hitting the slowest case, it
seems a much better idea to simply make said slowest case a lot less slow.

I had a go at rebasing Leizhen's last IOVA series[1], but ended up finding
the changes rather too hard to follow, so I've taken the liberty here of
picking the whole thing up and reimplementing the main part in a rather
less invasive manner.

Robin.

Changes from v1:
 - Fix overflow with 32-bit dma_addr_t
 - Add tested-bys

[1] https://www.mail-archive.com/iommu@lists.linux-foundation.org/msg17753.html

Robin Murphy (1):
  iommu/iova: Extend rbtree node caching

Zhen Lei (3):
  iommu/iova: Optimise rbtree searching
  iommu/iova: Optimise the padding calculation
  iommu/iova: Make dma_32bit_pfn implicit

 drivers/gpu/drm/tegra/drm.c      |   3 +-
 drivers/gpu/host1x/dev.c         |   3 +-
 drivers/iommu/amd_iommu.c        |   7 +--
 drivers/iommu/dma-iommu.c        |  18 +------
 drivers/iommu/intel-iommu.c      |  11 ++--
 drivers/iommu/iova.c             | 112 ++++++++++++++++-----------------------
 drivers/misc/mic/scif/scif_rma.c |   3 +-
 include/linux/iova.h             |   8 +--
 8 files changed, 60 insertions(+), 105 deletions(-)

-- 
2.12.2.dirty

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v2 1/4] iommu/iova: Optimise rbtree searching
  2017-07-21 11:41 [PATCH v2 0/4] Optimise 64-bit IOVA allocations Robin Murphy
@ 2017-07-21 11:41 ` Robin Murphy
  2017-07-21 11:41 ` [PATCH v2 2/4] iommu/iova: Optimise the padding calculation Robin Murphy
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 17+ messages in thread
From: Robin Murphy @ 2017-07-21 11:41 UTC (permalink / raw)
  To: joro
  Cc: iommu, linux-arm-kernel, linux-kernel, dwmw2, thunder.leizhen,
	lorenzo.pieralisi, ard.biesheuvel, Jonathan.Cameron, nwatters,
	ray.jui

From: Zhen Lei <thunder.leizhen@huawei.com>

Checking the IOVA bounds separately before deciding which direction to
continue the search (if necessary) results in redundantly comparing both
pfns twice each. GCC can already determine that the final comparison op
is redundant and optimise it down to 3 in total, but we can go one
further with a little tweak of the ordering (which makes the intent of
the code that much cleaner as a bonus).

Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Tested-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Tested-by: Zhen Lei <thunder.leizhen@huawei.com>
[rm: rewrote commit message to clarify]
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
---

v2: No change

 drivers/iommu/iova.c | 9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index 246f14c83944..8f7552dc4e04 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -289,15 +289,12 @@ private_find_iova(struct iova_domain *iovad, unsigned long pfn)
 	while (node) {
 		struct iova *iova = rb_entry(node, struct iova, node);
 
-		/* If pfn falls within iova's range, return iova */
-		if ((pfn >= iova->pfn_lo) && (pfn <= iova->pfn_hi)) {
-			return iova;
-		}
-
 		if (pfn < iova->pfn_lo)
 			node = node->rb_left;
-		else if (pfn > iova->pfn_lo)
+		else if (pfn > iova->pfn_hi)
 			node = node->rb_right;
+		else
+			return iova;	/* pfn falls within iova's range */
 	}
 
 	return NULL;
-- 
2.12.2.dirty

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v2 2/4] iommu/iova: Optimise the padding calculation
  2017-07-21 11:41 [PATCH v2 0/4] Optimise 64-bit IOVA allocations Robin Murphy
  2017-07-21 11:41 ` [PATCH v2 1/4] iommu/iova: Optimise rbtree searching Robin Murphy
@ 2017-07-21 11:41 ` Robin Murphy
  2017-07-21 11:42 ` [PATCH v2 3/4] iommu/iova: Extend rbtree node caching Robin Murphy
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 17+ messages in thread
From: Robin Murphy @ 2017-07-21 11:41 UTC (permalink / raw)
  To: joro
  Cc: iommu, linux-arm-kernel, linux-kernel, dwmw2, thunder.leizhen,
	lorenzo.pieralisi, ard.biesheuvel, Jonathan.Cameron, nwatters,
	ray.jui

From: Zhen Lei <thunder.leizhen@huawei.com>

The mask for calculating the padding size doesn't change, so there's no
need to recalculate it every loop iteration. Furthermore, Once we've
done that, it becomes clear that we don't actually need to calculate a
padding size at all - by flipping the arithmetic around, we can just
combine the upper limit, size, and mask directly to check against the
lower limit.

For an arm64 build, this alone knocks 16% off the size of the entire
alloc_iova() function!

Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Tested-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Tested-by: Zhen Lei <thunder.leizhen@huawei.com>
[rm: simplified more of the arithmetic, rewrote commit message]
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
---

v2: No change

 drivers/iommu/iova.c | 40 ++++++++++++++--------------------------
 1 file changed, 14 insertions(+), 26 deletions(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index 8f7552dc4e04..d094d1ca8f23 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -129,16 +129,6 @@ iova_insert_rbtree(struct rb_root *root, struct iova *iova,
 	rb_insert_color(&iova->node, root);
 }
 
-/*
- * Computes the padding size required, to make the start address
- * naturally aligned on the power-of-two order of its size
- */
-static unsigned int
-iova_get_pad_size(unsigned int size, unsigned int limit_pfn)
-{
-	return (limit_pfn - size) & (__roundup_pow_of_two(size) - 1);
-}
-
 static int __alloc_and_insert_iova_range(struct iova_domain *iovad,
 		unsigned long size, unsigned long limit_pfn,
 			struct iova *new, bool size_aligned)
@@ -146,7 +136,10 @@ static int __alloc_and_insert_iova_range(struct iova_domain *iovad,
 	struct rb_node *prev, *curr = NULL;
 	unsigned long flags;
 	unsigned long saved_pfn;
-	unsigned int pad_size = 0;
+	unsigned long align_mask = ~0UL;
+
+	if (size_aligned)
+		align_mask <<= __fls(size);
 
 	/* Walk the tree backwards */
 	spin_lock_irqsave(&iovad->iova_rbtree_lock, flags);
@@ -156,31 +149,26 @@ static int __alloc_and_insert_iova_range(struct iova_domain *iovad,
 	while (curr) {
 		struct iova *curr_iova = rb_entry(curr, struct iova, node);
 
-		if (limit_pfn <= curr_iova->pfn_lo) {
+		if (limit_pfn <= curr_iova->pfn_lo)
 			goto move_left;
-		} else if (limit_pfn > curr_iova->pfn_hi) {
-			if (size_aligned)
-				pad_size = iova_get_pad_size(size, limit_pfn);
-			if ((curr_iova->pfn_hi + size + pad_size) < limit_pfn)
-				break;	/* found a free slot */
-		}
+
+		if (((limit_pfn - size) & align_mask) > curr_iova->pfn_hi)
+			break;	/* found a free slot */
+
 		limit_pfn = curr_iova->pfn_lo;
 move_left:
 		prev = curr;
 		curr = rb_prev(curr);
 	}
 
-	if (!curr) {
-		if (size_aligned)
-			pad_size = iova_get_pad_size(size, limit_pfn);
-		if ((iovad->start_pfn + size + pad_size) > limit_pfn) {
-			spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
-			return -ENOMEM;
-		}
+	if (limit_pfn < size ||
+	    (!curr && ((limit_pfn - size) & align_mask) < iovad->start_pfn)) {
+		spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
+		return -ENOMEM;
 	}
 
 	/* pfn_lo will point to size aligned address if size_aligned is set */
-	new->pfn_lo = limit_pfn - (size + pad_size);
+	new->pfn_lo = (limit_pfn - size) & align_mask;
 	new->pfn_hi = new->pfn_lo + size - 1;
 
 	/* If we have 'prev', it's a valid place to start the insertion. */
-- 
2.12.2.dirty

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v2 3/4] iommu/iova: Extend rbtree node caching
  2017-07-21 11:41 [PATCH v2 0/4] Optimise 64-bit IOVA allocations Robin Murphy
  2017-07-21 11:41 ` [PATCH v2 1/4] iommu/iova: Optimise rbtree searching Robin Murphy
  2017-07-21 11:41 ` [PATCH v2 2/4] iommu/iova: Optimise the padding calculation Robin Murphy
@ 2017-07-21 11:42 ` Robin Murphy
  2017-07-29  3:57   ` Nate Watterson
  2017-07-21 11:42 ` [PATCH v2 4/4] iommu/iova: Make dma_32bit_pfn implicit Robin Murphy
  2017-07-26 11:08 ` [PATCH v2 0/4] Optimise 64-bit IOVA allocations Joerg Roedel
  4 siblings, 1 reply; 17+ messages in thread
From: Robin Murphy @ 2017-07-21 11:42 UTC (permalink / raw)
  To: joro
  Cc: iommu, linux-arm-kernel, linux-kernel, dwmw2, thunder.leizhen,
	lorenzo.pieralisi, ard.biesheuvel, Jonathan.Cameron, nwatters,
	ray.jui

The cached node mechanism provides a significant performance benefit for
allocations using a 32-bit DMA mask, but in the case of non-PCI devices
or where the 32-bit space is full, the loss of this benefit can be
significant - on large systems there can be many thousands of entries in
the tree, such that traversing to the end then walking all the way down
to find free space every time becomes increasingly awful.

Maintain a similar cached node for the whole IOVA space as a superset of
the 32-bit space so that performance can remain much more consistent.

Inspired by work by Zhen Lei <thunder.leizhen@huawei.com>.

Tested-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Tested-by: Zhen Lei <thunder.leizhen@huawei.com>
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
---

v2: No change

 drivers/iommu/iova.c | 59 +++++++++++++++++++++++++---------------------------
 include/linux/iova.h |  3 ++-
 2 files changed, 30 insertions(+), 32 deletions(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index d094d1ca8f23..f5809a2ee6c2 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -46,6 +46,7 @@ init_iova_domain(struct iova_domain *iovad, unsigned long granule,
 
 	spin_lock_init(&iovad->iova_rbtree_lock);
 	iovad->rbroot = RB_ROOT;
+	iovad->cached_node = NULL;
 	iovad->cached32_node = NULL;
 	iovad->granule = granule;
 	iovad->start_pfn = start_pfn;
@@ -57,48 +58,46 @@ EXPORT_SYMBOL_GPL(init_iova_domain);
 static struct rb_node *
 __get_cached_rbnode(struct iova_domain *iovad, unsigned long *limit_pfn)
 {
-	if ((*limit_pfn > iovad->dma_32bit_pfn) ||
-		(iovad->cached32_node == NULL))
+	struct rb_node *cached_node = NULL;
+	struct iova *curr_iova;
+
+	if (*limit_pfn <= iovad->dma_32bit_pfn)
+		cached_node = iovad->cached32_node;
+	if (!cached_node)
+		cached_node = iovad->cached_node;
+	if (!cached_node)
 		return rb_last(&iovad->rbroot);
-	else {
-		struct rb_node *prev_node = rb_prev(iovad->cached32_node);
-		struct iova *curr_iova =
-			rb_entry(iovad->cached32_node, struct iova, node);
-		*limit_pfn = curr_iova->pfn_lo;
-		return prev_node;
-	}
+
+	curr_iova = rb_entry(cached_node, struct iova, node);
+	*limit_pfn = curr_iova->pfn_lo;
+
+	return rb_prev(cached_node);
 }
 
 static void
-__cached_rbnode_insert_update(struct iova_domain *iovad,
-	unsigned long limit_pfn, struct iova *new)
+__cached_rbnode_insert_update(struct iova_domain *iovad, struct iova *new)
 {
-	if (limit_pfn != iovad->dma_32bit_pfn)
-		return;
-	iovad->cached32_node = &new->node;
+	if (new->pfn_lo > iovad->dma_32bit_pfn)
+		iovad->cached_node = &new->node;
+	else
+		iovad->cached32_node = &new->node;
 }
 
 static void
 __cached_rbnode_delete_update(struct iova_domain *iovad, struct iova *free)
 {
 	struct iova *cached_iova;
-	struct rb_node *curr;
+	struct rb_node **curr = NULL;
 
-	if (!iovad->cached32_node)
-		return;
-	curr = iovad->cached32_node;
-	cached_iova = rb_entry(curr, struct iova, node);
+	if (free->pfn_hi < iovad->dma_32bit_pfn)
+		curr = &iovad->cached32_node;
+	if (!curr)
+		curr = &iovad->cached_node;
 
-	if (free->pfn_lo >= cached_iova->pfn_lo) {
-		struct rb_node *node = rb_next(&free->node);
-		struct iova *iova = rb_entry(node, struct iova, node);
+	cached_iova = rb_entry(*curr, struct iova, node);
 
-		/* only cache if it's below 32bit pfn */
-		if (node && iova->pfn_lo < iovad->dma_32bit_pfn)
-			iovad->cached32_node = node;
-		else
-			iovad->cached32_node = NULL;
-	}
+	if (free->pfn_lo >= cached_iova->pfn_lo)
+		*curr = rb_next(&free->node);
 }
 
 /* Insert the iova into domain rbtree by holding writer lock */
@@ -135,7 +134,6 @@ static int __alloc_and_insert_iova_range(struct iova_domain *iovad,
 {
 	struct rb_node *prev, *curr = NULL;
 	unsigned long flags;
-	unsigned long saved_pfn;
 	unsigned long align_mask = ~0UL;
 
 	if (size_aligned)
@@ -143,7 +141,6 @@ static int __alloc_and_insert_iova_range(struct iova_domain *iovad,
 
 	/* Walk the tree backwards */
 	spin_lock_irqsave(&iovad->iova_rbtree_lock, flags);
-	saved_pfn = limit_pfn;
 	curr = __get_cached_rbnode(iovad, &limit_pfn);
 	prev = curr;
 	while (curr) {
@@ -173,7 +170,7 @@ static int __alloc_and_insert_iova_range(struct iova_domain *iovad,
 
 	/* If we have 'prev', it's a valid place to start the insertion. */
 	iova_insert_rbtree(&iovad->rbroot, new, prev);
-	__cached_rbnode_insert_update(iovad, saved_pfn, new);
+	__cached_rbnode_insert_update(iovad, new);
 
 	spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
 
diff --git a/include/linux/iova.h b/include/linux/iova.h
index e0a892ae45c0..0bb8df43b393 100644
--- a/include/linux/iova.h
+++ b/include/linux/iova.h
@@ -40,7 +40,8 @@ struct iova_rcache {
 struct iova_domain {
 	spinlock_t	iova_rbtree_lock; /* Lock to protect update of rbtree */
 	struct rb_root	rbroot;		/* iova domain rbtree root */
-	struct rb_node	*cached32_node; /* Save last alloced node */
+	struct rb_node	*cached_node;	/* Save last alloced node */
+	struct rb_node	*cached32_node; /* Save last 32-bit alloced node */
 	unsigned long	granule;	/* pfn granularity for this domain */
 	unsigned long	start_pfn;	/* Lower limit for this domain */
 	unsigned long	dma_32bit_pfn;
-- 
2.12.2.dirty

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v2 4/4] iommu/iova: Make dma_32bit_pfn implicit
  2017-07-21 11:41 [PATCH v2 0/4] Optimise 64-bit IOVA allocations Robin Murphy
                   ` (2 preceding siblings ...)
  2017-07-21 11:42 ` [PATCH v2 3/4] iommu/iova: Extend rbtree node caching Robin Murphy
@ 2017-07-21 11:42 ` Robin Murphy
  2017-07-26 11:08 ` [PATCH v2 0/4] Optimise 64-bit IOVA allocations Joerg Roedel
  4 siblings, 0 replies; 17+ messages in thread
From: Robin Murphy @ 2017-07-21 11:42 UTC (permalink / raw)
  To: joro
  Cc: iommu, linux-arm-kernel, linux-kernel, dwmw2, thunder.leizhen,
	lorenzo.pieralisi, ard.biesheuvel, Jonathan.Cameron, nwatters,
	ray.jui, Thierry Reding, Jonathan Hunter, David Airlie,
	Sudeep Dutt, Ashutosh Dixit

From: Zhen Lei <thunder.leizhen@huawei.com>

Now that the cached node optimisation can apply to all allocations, the
couple of users which were playing tricks with dma_32bit_pfn in order to
benefit from it can stop doing so. Conversely, there is also no need for
all the other users to explicitly calculate a 'real' 32-bit PFN, when
init_iova_domain() can happily do that itself from the page granularity.

CC: Thierry Reding <thierry.reding@gmail.com>
CC: Jonathan Hunter <jonathanh@nvidia.com>
CC: David Airlie <airlied@linux.ie>
CC: Sudeep Dutt <sudeep.dutt@intel.com>
CC: Ashutosh Dixit <ashutosh.dixit@intel.com>
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Tested-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Tested-by: Zhen Lei <thunder.leizhen@huawei.com>
[rm: use iova_shift(), rewrote commit message]
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
---

v2: Avoid iova_pfn() overflow with 32-bit dma_addr_t

 drivers/gpu/drm/tegra/drm.c      |  3 +--
 drivers/gpu/host1x/dev.c         |  3 +--
 drivers/iommu/amd_iommu.c        |  7 ++-----
 drivers/iommu/dma-iommu.c        | 18 +-----------------
 drivers/iommu/intel-iommu.c      | 11 +++--------
 drivers/iommu/iova.c             |  4 ++--
 drivers/misc/mic/scif/scif_rma.c |  3 +--
 include/linux/iova.h             |  5 ++---
 8 files changed, 13 insertions(+), 41 deletions(-)

diff --git a/drivers/gpu/drm/tegra/drm.c b/drivers/gpu/drm/tegra/drm.c
index 518f4b69ea53..81e9ae1ee90b 100644
--- a/drivers/gpu/drm/tegra/drm.c
+++ b/drivers/gpu/drm/tegra/drm.c
@@ -150,8 +150,7 @@ static int tegra_drm_load(struct drm_device *drm, unsigned long flags)
 
 		order = __ffs(tegra->domain->pgsize_bitmap);
 		init_iova_domain(&tegra->carveout.domain, 1UL << order,
-				 carveout_start >> order,
-				 carveout_end >> order);
+				 carveout_start >> order);
 
 		tegra->carveout.shift = iova_shift(&tegra->carveout.domain);
 		tegra->carveout.limit = carveout_end >> tegra->carveout.shift;
diff --git a/drivers/gpu/host1x/dev.c b/drivers/gpu/host1x/dev.c
index 2c58a390123a..57c8eed0ed71 100644
--- a/drivers/gpu/host1x/dev.c
+++ b/drivers/gpu/host1x/dev.c
@@ -193,8 +193,7 @@ static int host1x_probe(struct platform_device *pdev)
 
 		order = __ffs(host->domain->pgsize_bitmap);
 		init_iova_domain(&host->iova, 1UL << order,
-				 geometry->aperture_start >> order,
-				 geometry->aperture_end >> order);
+				 geometry->aperture_start >> order);
 		host->iova_end = geometry->aperture_end;
 	}
 
diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index 688e77576e5a..a12e3e12014a 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -63,7 +63,6 @@
 /* IO virtual address start page frame number */
 #define IOVA_START_PFN		(1)
 #define IOVA_PFN(addr)		((addr) >> PAGE_SHIFT)
-#define DMA_32BIT_PFN		IOVA_PFN(DMA_BIT_MASK(32))
 
 /* Reserved IOVA ranges */
 #define MSI_RANGE_START		(0xfee00000)
@@ -2010,8 +2009,7 @@ static struct dma_ops_domain *dma_ops_domain_alloc(void)
 	if (!dma_dom->domain.pt_root)
 		goto free_dma_dom;
 
-	init_iova_domain(&dma_dom->iovad, PAGE_SIZE,
-			 IOVA_START_PFN, DMA_32BIT_PFN);
+	init_iova_domain(&dma_dom->iovad, PAGE_SIZE, IOVA_START_PFN);
 
 	/* Initialize reserved ranges */
 	copy_reserved_iova(&reserved_iova_ranges, &dma_dom->iovad);
@@ -2912,8 +2910,7 @@ static int init_reserved_iova_ranges(void)
 	struct pci_dev *pdev = NULL;
 	struct iova *val;
 
-	init_iova_domain(&reserved_iova_ranges, PAGE_SIZE,
-			 IOVA_START_PFN, DMA_32BIT_PFN);
+	init_iova_domain(&reserved_iova_ranges, PAGE_SIZE, IOVA_START_PFN);
 
 	lockdep_set_class(&reserved_iova_ranges.iova_rbtree_lock,
 			  &reserved_rbtree_key);
diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 9d1cebe7f6cb..191be9c80a8a 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -292,18 +292,7 @@ int iommu_dma_init_domain(struct iommu_domain *domain, dma_addr_t base,
 		/* ...then finally give it a kicking to make sure it fits */
 		base_pfn = max_t(unsigned long, base_pfn,
 				domain->geometry.aperture_start >> order);
-		end_pfn = min_t(unsigned long, end_pfn,
-				domain->geometry.aperture_end >> order);
 	}
-	/*
-	 * PCI devices may have larger DMA masks, but still prefer allocating
-	 * within a 32-bit mask to avoid DAC addressing. Such limitations don't
-	 * apply to the typical platform device, so for those we may as well
-	 * leave the cache limit at the top of their range to save an rb_last()
-	 * traversal on every allocation.
-	 */
-	if (dev && dev_is_pci(dev))
-		end_pfn &= DMA_BIT_MASK(32) >> order;
 
 	/* start_pfn is always nonzero for an already-initialised domain */
 	if (iovad->start_pfn) {
@@ -312,16 +301,11 @@ int iommu_dma_init_domain(struct iommu_domain *domain, dma_addr_t base,
 			pr_warn("Incompatible range for DMA domain\n");
 			return -EFAULT;
 		}
-		/*
-		 * If we have devices with different DMA masks, move the free
-		 * area cache limit down for the benefit of the smaller one.
-		 */
-		iovad->dma_32bit_pfn = min(end_pfn + 1, iovad->dma_32bit_pfn);
 
 		return 0;
 	}
 
-	init_iova_domain(iovad, 1UL << order, base_pfn, end_pfn);
+	init_iova_domain(iovad, 1UL << order, base_pfn);
 	if (!dev)
 		return 0;
 
diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 687f18f65cea..afa3b4e765e7 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -82,8 +82,6 @@
 #define IOVA_START_PFN		(1)
 
 #define IOVA_PFN(addr)		((addr) >> PAGE_SHIFT)
-#define DMA_32BIT_PFN		IOVA_PFN(DMA_BIT_MASK(32))
-#define DMA_64BIT_PFN		IOVA_PFN(DMA_BIT_MASK(64))
 
 /* page table handling */
 #define LEVEL_STRIDE		(9)
@@ -1874,8 +1872,7 @@ static int dmar_init_reserved_ranges(void)
 	struct iova *iova;
 	int i;
 
-	init_iova_domain(&reserved_iova_list, VTD_PAGE_SIZE, IOVA_START_PFN,
-			DMA_32BIT_PFN);
+	init_iova_domain(&reserved_iova_list, VTD_PAGE_SIZE, IOVA_START_PFN);
 
 	lockdep_set_class(&reserved_iova_list.iova_rbtree_lock,
 		&reserved_rbtree_key);
@@ -1933,8 +1930,7 @@ static int domain_init(struct dmar_domain *domain, struct intel_iommu *iommu,
 	int adjust_width, agaw;
 	unsigned long sagaw;
 
-	init_iova_domain(&domain->iovad, VTD_PAGE_SIZE, IOVA_START_PFN,
-			DMA_32BIT_PFN);
+	init_iova_domain(&domain->iovad, VTD_PAGE_SIZE, IOVA_START_PFN);
 	domain_reserve_special_ranges(domain);
 
 	/* calculate AGAW */
@@ -4989,8 +4985,7 @@ static int md_domain_init(struct dmar_domain *domain, int guest_width)
 {
 	int adjust_width;
 
-	init_iova_domain(&domain->iovad, VTD_PAGE_SIZE, IOVA_START_PFN,
-			DMA_32BIT_PFN);
+	init_iova_domain(&domain->iovad, VTD_PAGE_SIZE, IOVA_START_PFN);
 	domain_reserve_special_ranges(domain);
 
 	/* calculate AGAW */
diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index f5809a2ee6c2..f88acadeebfe 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -35,7 +35,7 @@ static void free_iova_rcaches(struct iova_domain *iovad);
 
 void
 init_iova_domain(struct iova_domain *iovad, unsigned long granule,
-	unsigned long start_pfn, unsigned long pfn_32bit)
+	unsigned long start_pfn)
 {
 	/*
 	 * IOVA granularity will normally be equal to the smallest
@@ -50,7 +50,7 @@ init_iova_domain(struct iova_domain *iovad, unsigned long granule,
 	iovad->cached32_node = NULL;
 	iovad->granule = granule;
 	iovad->start_pfn = start_pfn;
-	iovad->dma_32bit_pfn = pfn_32bit + 1;
+	iovad->dma_32bit_pfn = 1UL << (32 - iova_shift(iovad));
 	init_iova_rcaches(iovad);
 }
 EXPORT_SYMBOL_GPL(init_iova_domain);
diff --git a/drivers/misc/mic/scif/scif_rma.c b/drivers/misc/mic/scif/scif_rma.c
index 329727e00e97..c824329f7012 100644
--- a/drivers/misc/mic/scif/scif_rma.c
+++ b/drivers/misc/mic/scif/scif_rma.c
@@ -39,8 +39,7 @@ void scif_rma_ep_init(struct scif_endpt *ep)
 	struct scif_endpt_rma_info *rma = &ep->rma_info;
 
 	mutex_init(&rma->rma_lock);
-	init_iova_domain(&rma->iovad, PAGE_SIZE, SCIF_IOVA_START_PFN,
-			 SCIF_DMA_64BIT_PFN);
+	init_iova_domain(&rma->iovad, PAGE_SIZE, SCIF_IOVA_START_PFN);
 	spin_lock_init(&rma->tc_lock);
 	mutex_init(&rma->mmn_lock);
 	INIT_LIST_HEAD(&rma->reg_list);
diff --git a/include/linux/iova.h b/include/linux/iova.h
index 0bb8df43b393..58c2a365c45f 100644
--- a/include/linux/iova.h
+++ b/include/linux/iova.h
@@ -102,7 +102,7 @@ struct iova *reserve_iova(struct iova_domain *iovad, unsigned long pfn_lo,
 	unsigned long pfn_hi);
 void copy_reserved_iova(struct iova_domain *from, struct iova_domain *to);
 void init_iova_domain(struct iova_domain *iovad, unsigned long granule,
-	unsigned long start_pfn, unsigned long pfn_32bit);
+	unsigned long start_pfn);
 struct iova *find_iova(struct iova_domain *iovad, unsigned long pfn);
 void put_iova_domain(struct iova_domain *iovad);
 struct iova *split_and_remove_iova(struct iova_domain *iovad,
@@ -170,8 +170,7 @@ static inline void copy_reserved_iova(struct iova_domain *from,
 
 static inline void init_iova_domain(struct iova_domain *iovad,
 				    unsigned long granule,
-				    unsigned long start_pfn,
-				    unsigned long pfn_32bit)
+				    unsigned long start_pfn)
 {
 }
 
-- 
2.12.2.dirty

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 0/4] Optimise 64-bit IOVA allocations
  2017-07-21 11:41 [PATCH v2 0/4] Optimise 64-bit IOVA allocations Robin Murphy
                   ` (3 preceding siblings ...)
  2017-07-21 11:42 ` [PATCH v2 4/4] iommu/iova: Make dma_32bit_pfn implicit Robin Murphy
@ 2017-07-26 11:08 ` Joerg Roedel
  2017-07-26 11:17   ` Leizhen (ThunderTown)
  4 siblings, 1 reply; 17+ messages in thread
From: Joerg Roedel @ 2017-07-26 11:08 UTC (permalink / raw)
  To: Robin Murphy
  Cc: iommu, linux-arm-kernel, linux-kernel, dwmw2, thunder.leizhen,
	lorenzo.pieralisi, ard.biesheuvel, Jonathan.Cameron, nwatters,
	ray.jui

Hi Robin.

On Fri, Jul 21, 2017 at 12:41:57PM +0100, Robin Murphy wrote:
> Hi all,
> 
> In the wake of the ARM SMMU optimisation efforts, it seems that certain
> workloads (e.g. storage I/O with large scatterlists) probably remain quite
> heavily influenced by IOVA allocation performance. Separately, Ard also
> reported massive performance drops for a graphical desktop on AMD Seattle
> when enabling SMMUs via IORT, which we traced to dma_32bit_pfn in the DMA
> ops domain getting initialised differently for ACPI vs. DT, and exposing
> the overhead of the rbtree slow path. Whilst we could go around trying to
> close up all the little gaps that lead to hitting the slowest case, it
> seems a much better idea to simply make said slowest case a lot less slow.

Do you have some numbers here? How big was the impact before these
patches and how is it with the patches?


	Joerg

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 0/4] Optimise 64-bit IOVA allocations
  2017-07-26 11:08 ` [PATCH v2 0/4] Optimise 64-bit IOVA allocations Joerg Roedel
@ 2017-07-26 11:17   ` Leizhen (ThunderTown)
  2017-08-08 12:03     ` Ganapatrao Kulkarni
  0 siblings, 1 reply; 17+ messages in thread
From: Leizhen (ThunderTown) @ 2017-07-26 11:17 UTC (permalink / raw)
  To: Joerg Roedel, Robin Murphy
  Cc: iommu, linux-arm-kernel, linux-kernel, dwmw2, lorenzo.pieralisi,
	ard.biesheuvel, Jonathan.Cameron, nwatters, ray.jui



On 2017/7/26 19:08, Joerg Roedel wrote:
> Hi Robin.
> 
> On Fri, Jul 21, 2017 at 12:41:57PM +0100, Robin Murphy wrote:
>> Hi all,
>>
>> In the wake of the ARM SMMU optimisation efforts, it seems that certain
>> workloads (e.g. storage I/O with large scatterlists) probably remain quite
>> heavily influenced by IOVA allocation performance. Separately, Ard also
>> reported massive performance drops for a graphical desktop on AMD Seattle
>> when enabling SMMUs via IORT, which we traced to dma_32bit_pfn in the DMA
>> ops domain getting initialised differently for ACPI vs. DT, and exposing
>> the overhead of the rbtree slow path. Whilst we could go around trying to
>> close up all the little gaps that lead to hitting the slowest case, it
>> seems a much better idea to simply make said slowest case a lot less slow.
> 
> Do you have some numbers here? How big was the impact before these
> patches and how is it with the patches?
Here are some numbers:

(before)$ iperf -s
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 85.3 KByte (default)
------------------------------------------------------------
[  4] local 192.168.1.106 port 5001 connected with 192.168.1.198 port 35898
[ ID] Interval       Transfer     Bandwidth
[  4]  0.0-10.2 sec  7.88 MBytes  6.48 Mbits/sec
[  5] local 192.168.1.106 port 5001 connected with 192.168.1.198 port 35900
[  5]  0.0-10.3 sec  7.88 MBytes  6.43 Mbits/sec
[  4] local 192.168.1.106 port 5001 connected with 192.168.1.198 port 35902
[  4]  0.0-10.3 sec  7.88 MBytes  6.43 Mbits/sec

(after)$ iperf -s
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 85.3 KByte (default)
------------------------------------------------------------
[  4] local 192.168.1.106 port 5001 connected with 192.168.1.198 port 36330
[ ID] Interval       Transfer     Bandwidth
[  4]  0.0-10.0 sec  1.09 GBytes   933 Mbits/sec
[  5] local 192.168.1.106 port 5001 connected with 192.168.1.198 port 36332
[  5]  0.0-10.0 sec  1.10 GBytes   939 Mbits/sec
[  4] local 192.168.1.106 port 5001 connected with 192.168.1.198 port 36334
[  4]  0.0-10.0 sec  1.10 GBytes   938 Mbits/sec

> 
> 
> 	Joerg
> 
> 
> .
> 

-- 
Thanks!
BestRegards

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 3/4] iommu/iova: Extend rbtree node caching
  2017-07-21 11:42 ` [PATCH v2 3/4] iommu/iova: Extend rbtree node caching Robin Murphy
@ 2017-07-29  3:57   ` Nate Watterson
  2017-07-31 11:42     ` Robin Murphy
  0 siblings, 1 reply; 17+ messages in thread
From: Nate Watterson @ 2017-07-29  3:57 UTC (permalink / raw)
  To: Robin Murphy, joro
  Cc: iommu, linux-arm-kernel, linux-kernel, dwmw2, thunder.leizhen,
	lorenzo.pieralisi, ard.biesheuvel, Jonathan.Cameron, ray.jui

Hi Robin,
I am seeing a crash when performing very basic testing on this series
with a Mellanox CX4 NIC. I dug into the crash a bit, and think this
patch is the culprit, but this rcache business is still mostly
witchcraft to me.

# ifconfig eth5 up
# ifconfig eth5 down
     Unable to handle kernel NULL pointer dereference at virtual address 00000020
     user pgtable: 64k pages, 48-bit VAs, pgd = ffff8007dbf47c00
     [0000000000000020] *pgd=00000006efab0003, *pud=00000006efab0003, *pmd=00000007d8720003, *pte=0000000000000000
     Internal error: Oops: 96000007 [#1] SMP
     Modules linked in:
     CPU: 47 PID: 5082 Comm: ifconfig Not tainted 4.13.0-rtp-enablement+ #3
     task: ffff8007da1e5780 task.stack: ffff8007ddcb8000
     PC is at __cached_rbnode_delete_update+0x2c/0x58
     LR is at private_free_iova+0x2c/0x60
     pc : [<ffff00000852c6e4>] lr : [<ffff00000852cdac>] pstate: 204001c5
     sp : ffff8007ddcbba00
     x29: ffff8007ddcbba00 x28: ffff8007c8350210
     x27: ffff8007d1a80000 x26: ffff8007dcc20800
     x25: 0000000000000140 x24: ffff8007c98f0008
     x23: 00000000fffffe4e x22: 0000000000000140
     x21: ffff8007c98f0008 x20: ffff8007c9adb240
     x19: ffff8007c98f0018 x18: 0000000000000010
     x17: 0000000000000000 x16: 0000000000000000
     x15: 4000000000000000 x14: 0000000000000000
     x13: 00000000ffffffff x12: 0000000000000001
     x11: dead000000000200 x10: 00000000ffffffff
     x9 : 0000000000000000 x8 : ffff8007c9adb1c0
     x7 : 0000000040002000 x6 : 0000000000210d00
     x5 : 0000000000000000 x4 : 000000000000c57e
     x3 : 00000000ffffffcf x2 : 00000000ffffffcf
     x1 : ffff8007c9adb240 x0 : 0000000000000000
     [...]
     [<ffff00000852c6e4>] __cached_rbnode_delete_update+0x2c/0x58
     [<ffff00000852cdac>] private_free_iova+0x2c/0x60
     [<ffff00000852cea4>] iova_magazine_free_pfns+0x4c/0xa0
     [<ffff00000852d0a8>] free_iova_fast+0x1b0/0x230
     [<ffff000008529d34>] iommu_dma_free_iova+0x5c/0x80
     [<ffff000008529db4>] __iommu_dma_unmap+0x5c/0x98
     [<ffff00000852aebc>] iommu_dma_unmap_resource+0x24/0x30
     [<ffff00000852aed4>] iommu_dma_unmap_page+0xc/0x18
     [<ffff000008093030>] __iommu_unmap_page+0x40/0x60
     [<ffff0000087401bc>] mlx5e_page_release+0xbc/0x128
     [<ffff000008740498>] mlx5e_dealloc_rx_wqe+0x30/0x40
     [<ffff000008735ef0>] mlx5e_close_channel+0x70/0x1f8
     [<ffff0000087370f4>] mlx5e_close_channels+0x2c/0x50
     [<ffff0000087385dc>] mlx5e_close_locked+0x54/0x68
     [<ffff000008738620>] mlx5e_close+0x30/0x58
     [...]

** Disassembly for __cached_rbnode_delete_update() near the fault **
               92|if (free->pfn_hi < iovad->dma_32bit_pfn)
FFFF00000852C6C4|            ldr     x3,[x1,#0x18]    ; x3,[free,#24]
FFFF00000852C6C8|            ldr     x2,[x0,#0x30]    ; x2,[iovad,#48]
FFFF00000852C6CC|            cmp     x3,x2
FFFF00000852C6D0|            b.cs    0xFFFF00000852C708
                 |        curr = &iovad->cached32_node;
               94|if (!curr)
FFFF00000852C6D4|            adds    x19,x0,#0x18     ; x19,iovad,#24
FFFF00000852C6D8|            b.eq    0xFFFF00000852C708
                 |
                 |cached_iova = rb_entry(*curr, struct iova, node);
                 |
               99|if (free->pfn_lo >= cached_iova->pfn_lo)
FFFF00000852C6DC|            ldr     x0,[x19]         ; xiovad,[curr]
FFFF00000852C6E0|            ldr     x2,[x1,#0x20]    ; x2,[free,#32]
FFFF00000852C6E4|            ldr     x0,[x0,#0x20]    ; x0,[x0,#32]
Apparently cached_iova was NULL so the pfn_lo access faulted.

FFFF00000852C6E8|            cmp     x2,x0
FFFF00000852C6EC|            b.cc    0xFFFF00000852C6FC
FFFF00000852C6F0|            mov     x0,x1            ; x0,free
              100|        *curr = rb_next(&free->node);
After instrumenting the code a bit, this seems to be the culprit. In the
previous call, free->pfn_lo was 0xFFFF_FFFF which is actually the
dma_limit for the domain so rb_next() returns NULL.

Let me know if you have any questions or would like additional tests
run. I also applied your "DMA domain debug info" patches and dumped the
contents of the domain at each of the steps above in case that would be
useful. If nothing else, they reinforce how thirsty the CX4 NIC is
especially when using 64k pages and many CPUs.

-Nate

On 7/21/2017 7:42 AM, Robin Murphy wrote:
> The cached node mechanism provides a significant performance benefit for
> allocations using a 32-bit DMA mask, but in the case of non-PCI devices
> or where the 32-bit space is full, the loss of this benefit can be
> significant - on large systems there can be many thousands of entries in
> the tree, such that traversing to the end then walking all the way down
> to find free space every time becomes increasingly awful.
> 
> Maintain a similar cached node for the whole IOVA space as a superset of
> the 32-bit space so that performance can remain much more consistent.
> 
> Inspired by work by Zhen Lei <thunder.leizhen@huawei.com>.
> 
> Tested-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
> Tested-by: Zhen Lei <thunder.leizhen@huawei.com>
> Signed-off-by: Robin Murphy <robin.murphy@arm.com>
> ---
> 
> v2: No change
> 
>   drivers/iommu/iova.c | 59 +++++++++++++++++++++++++---------------------------
>   include/linux/iova.h |  3 ++-
>   2 files changed, 30 insertions(+), 32 deletions(-)
> 
> diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
> index d094d1ca8f23..f5809a2ee6c2 100644
> --- a/drivers/iommu/iova.c
> +++ b/drivers/iommu/iova.c
> @@ -46,6 +46,7 @@ init_iova_domain(struct iova_domain *iovad, unsigned long granule,
>   
>   	spin_lock_init(&iovad->iova_rbtree_lock);
>   	iovad->rbroot = RB_ROOT;
> +	iovad->cached_node = NULL;
>   	iovad->cached32_node = NULL;
>   	iovad->granule = granule;
>   	iovad->start_pfn = start_pfn;
> @@ -57,48 +58,46 @@ EXPORT_SYMBOL_GPL(init_iova_domain);
>   static struct rb_node *
>   __get_cached_rbnode(struct iova_domain *iovad, unsigned long *limit_pfn)
>   {
> -	if ((*limit_pfn > iovad->dma_32bit_pfn) ||
> -		(iovad->cached32_node == NULL))
> +	struct rb_node *cached_node = NULL;
> +	struct iova *curr_iova;
> +
> +	if (*limit_pfn <= iovad->dma_32bit_pfn)
> +		cached_node = iovad->cached32_node;
> +	if (!cached_node)
> +		cached_node = iovad->cached_node;
> +	if (!cached_node)
>   		return rb_last(&iovad->rbroot);
> -	else {
> -		struct rb_node *prev_node = rb_prev(iovad->cached32_node);
> -		struct iova *curr_iova =
> -			rb_entry(iovad->cached32_node, struct iova, node);
> -		*limit_pfn = curr_iova->pfn_lo;
> -		return prev_node;
> -	}
> +
> +	curr_iova = rb_entry(cached_node, struct iova, node);
> +	*limit_pfn = curr_iova->pfn_lo;
> +
> +	return rb_prev(cached_node);
>   }
>   
>   static void
> -__cached_rbnode_insert_update(struct iova_domain *iovad,
> -	unsigned long limit_pfn, struct iova *new)
> +__cached_rbnode_insert_update(struct iova_domain *iovad, struct iova *new)
>   {
> -	if (limit_pfn != iovad->dma_32bit_pfn)
> -		return;
> -	iovad->cached32_node = &new->node;
> +	if (new->pfn_lo > iovad->dma_32bit_pfn)
> +		iovad->cached_node = &new->node;
> +	else
> +		iovad->cached32_node = &new->node;
>   }
>   
>   static void
>   __cached_rbnode_delete_update(struct iova_domain *iovad, struct iova *free)
>   {
>   	struct iova *cached_iova;
> -	struct rb_node *curr;
> +	struct rb_node **curr = NULL;
>   
> -	if (!iovad->cached32_node)
> -		return;
> -	curr = iovad->cached32_node;
> -	cached_iova = rb_entry(curr, struct iova, node);
> +	if (free->pfn_hi < iovad->dma_32bit_pfn)
> +		curr = &iovad->cached32_node;
> +	if (!curr)
> +		curr = &iovad->cached_node;
>   
> -	if (free->pfn_lo >= cached_iova->pfn_lo) {
> -		struct rb_node *node = rb_next(&free->node);
> -		struct iova *iova = rb_entry(node, struct iova, node);
> +	cached_iova = rb_entry(*curr, struct iova, node);
>   
> -		/* only cache if it's below 32bit pfn */
> -		if (node && iova->pfn_lo < iovad->dma_32bit_pfn)
> -			iovad->cached32_node = node;
> -		else
> -			iovad->cached32_node = NULL;
> -	}
> +	if (free->pfn_lo >= cached_iova->pfn_lo)
> +		*curr = rb_next(&free->node);
>   }
>   
>   /* Insert the iova into domain rbtree by holding writer lock */
> @@ -135,7 +134,6 @@ static int __alloc_and_insert_iova_range(struct iova_domain *iovad,
>   {
>   	struct rb_node *prev, *curr = NULL;
>   	unsigned long flags;
> -	unsigned long saved_pfn;
>   	unsigned long align_mask = ~0UL;
>   
>   	if (size_aligned)
> @@ -143,7 +141,6 @@ static int __alloc_and_insert_iova_range(struct iova_domain *iovad,
>   
>   	/* Walk the tree backwards */
>   	spin_lock_irqsave(&iovad->iova_rbtree_lock, flags);
> -	saved_pfn = limit_pfn;
>   	curr = __get_cached_rbnode(iovad, &limit_pfn);
>   	prev = curr;
>   	while (curr) {
> @@ -173,7 +170,7 @@ static int __alloc_and_insert_iova_range(struct iova_domain *iovad,
>   
>   	/* If we have 'prev', it's a valid place to start the insertion. */
>   	iova_insert_rbtree(&iovad->rbroot, new, prev);
> -	__cached_rbnode_insert_update(iovad, saved_pfn, new);
> +	__cached_rbnode_insert_update(iovad, new);
>   
>   	spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
>   
> diff --git a/include/linux/iova.h b/include/linux/iova.h
> index e0a892ae45c0..0bb8df43b393 100644
> --- a/include/linux/iova.h
> +++ b/include/linux/iova.h
> @@ -40,7 +40,8 @@ struct iova_rcache {
>   struct iova_domain {
>   	spinlock_t	iova_rbtree_lock; /* Lock to protect update of rbtree */
>   	struct rb_root	rbroot;		/* iova domain rbtree root */
> -	struct rb_node	*cached32_node; /* Save last alloced node */
> +	struct rb_node	*cached_node;	/* Save last alloced node */
> +	struct rb_node	*cached32_node; /* Save last 32-bit alloced node */
>   	unsigned long	granule;	/* pfn granularity for this domain */
>   	unsigned long	start_pfn;	/* Lower limit for this domain */
>   	unsigned long	dma_32bit_pfn;
> 

-- 
Qualcomm Datacenter Technologies as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 3/4] iommu/iova: Extend rbtree node caching
  2017-07-29  3:57   ` Nate Watterson
@ 2017-07-31 11:42     ` Robin Murphy
  2017-08-03 19:41       ` Nate Watterson
  2017-09-19  9:42       ` Leizhen (ThunderTown)
  0 siblings, 2 replies; 17+ messages in thread
From: Robin Murphy @ 2017-07-31 11:42 UTC (permalink / raw)
  To: Nate Watterson, joro
  Cc: iommu, linux-arm-kernel, linux-kernel, dwmw2, thunder.leizhen,
	lorenzo.pieralisi, ard.biesheuvel, Jonathan.Cameron, ray.jui

Hi Nate,

On 29/07/17 04:57, Nate Watterson wrote:
> Hi Robin,
> I am seeing a crash when performing very basic testing on this series
> with a Mellanox CX4 NIC. I dug into the crash a bit, and think this
> patch is the culprit, but this rcache business is still mostly
> witchcraft to me.
> 
> # ifconfig eth5 up
> # ifconfig eth5 down
>     Unable to handle kernel NULL pointer dereference at virtual address
> 00000020
>     user pgtable: 64k pages, 48-bit VAs, pgd = ffff8007dbf47c00
>     [0000000000000020] *pgd=00000006efab0003, *pud=00000006efab0003,
> *pmd=00000007d8720003, *pte=0000000000000000
>     Internal error: Oops: 96000007 [#1] SMP
>     Modules linked in:
>     CPU: 47 PID: 5082 Comm: ifconfig Not tainted 4.13.0-rtp-enablement+ #3
>     task: ffff8007da1e5780 task.stack: ffff8007ddcb8000
>     PC is at __cached_rbnode_delete_update+0x2c/0x58
>     LR is at private_free_iova+0x2c/0x60
>     pc : [<ffff00000852c6e4>] lr : [<ffff00000852cdac>] pstate: 204001c5
>     sp : ffff8007ddcbba00
>     x29: ffff8007ddcbba00 x28: ffff8007c8350210
>     x27: ffff8007d1a80000 x26: ffff8007dcc20800
>     x25: 0000000000000140 x24: ffff8007c98f0008
>     x23: 00000000fffffe4e x22: 0000000000000140
>     x21: ffff8007c98f0008 x20: ffff8007c9adb240
>     x19: ffff8007c98f0018 x18: 0000000000000010
>     x17: 0000000000000000 x16: 0000000000000000
>     x15: 4000000000000000 x14: 0000000000000000
>     x13: 00000000ffffffff x12: 0000000000000001
>     x11: dead000000000200 x10: 00000000ffffffff
>     x9 : 0000000000000000 x8 : ffff8007c9adb1c0
>     x7 : 0000000040002000 x6 : 0000000000210d00
>     x5 : 0000000000000000 x4 : 000000000000c57e
>     x3 : 00000000ffffffcf x2 : 00000000ffffffcf
>     x1 : ffff8007c9adb240 x0 : 0000000000000000
>     [...]
>     [<ffff00000852c6e4>] __cached_rbnode_delete_update+0x2c/0x58
>     [<ffff00000852cdac>] private_free_iova+0x2c/0x60
>     [<ffff00000852cea4>] iova_magazine_free_pfns+0x4c/0xa0
>     [<ffff00000852d0a8>] free_iova_fast+0x1b0/0x230
>     [<ffff000008529d34>] iommu_dma_free_iova+0x5c/0x80
>     [<ffff000008529db4>] __iommu_dma_unmap+0x5c/0x98
>     [<ffff00000852aebc>] iommu_dma_unmap_resource+0x24/0x30
>     [<ffff00000852aed4>] iommu_dma_unmap_page+0xc/0x18
>     [<ffff000008093030>] __iommu_unmap_page+0x40/0x60
>     [<ffff0000087401bc>] mlx5e_page_release+0xbc/0x128
>     [<ffff000008740498>] mlx5e_dealloc_rx_wqe+0x30/0x40
>     [<ffff000008735ef0>] mlx5e_close_channel+0x70/0x1f8
>     [<ffff0000087370f4>] mlx5e_close_channels+0x2c/0x50
>     [<ffff0000087385dc>] mlx5e_close_locked+0x54/0x68
>     [<ffff000008738620>] mlx5e_close+0x30/0x58
>     [...]
> 
> ** Disassembly for __cached_rbnode_delete_update() near the fault **
>               92|if (free->pfn_hi < iovad->dma_32bit_pfn)
> FFFF00000852C6C4|            ldr     x3,[x1,#0x18]    ; x3,[free,#24]
> FFFF00000852C6C8|            ldr     x2,[x0,#0x30]    ; x2,[iovad,#48]
> FFFF00000852C6CC|            cmp     x3,x2
> FFFF00000852C6D0|            b.cs    0xFFFF00000852C708
>                 |        curr = &iovad->cached32_node;
>               94|if (!curr)
> FFFF00000852C6D4|            adds    x19,x0,#0x18     ; x19,iovad,#24
> FFFF00000852C6D8|            b.eq    0xFFFF00000852C708
>                 |
>                 |cached_iova = rb_entry(*curr, struct iova, node);
>                 |
>               99|if (free->pfn_lo >= cached_iova->pfn_lo)
> FFFF00000852C6DC|            ldr     x0,[x19]         ; xiovad,[curr]
> FFFF00000852C6E0|            ldr     x2,[x1,#0x20]    ; x2,[free,#32]
> FFFF00000852C6E4|            ldr     x0,[x0,#0x20]    ; x0,[x0,#32]
> Apparently cached_iova was NULL so the pfn_lo access faulted.
> 
> FFFF00000852C6E8|            cmp     x2,x0
> FFFF00000852C6EC|            b.cc    0xFFFF00000852C6FC
> FFFF00000852C6F0|            mov     x0,x1            ; x0,free
>              100|        *curr = rb_next(&free->node);
> After instrumenting the code a bit, this seems to be the culprit. In the
> previous call, free->pfn_lo was 0xFFFF_FFFF which is actually the
> dma_limit for the domain so rb_next() returns NULL.
> 
> Let me know if you have any questions or would like additional tests
> run. I also applied your "DMA domain debug info" patches and dumped the
> contents of the domain at each of the steps above in case that would be
> useful. If nothing else, they reinforce how thirsty the CX4 NIC is
> especially when using 64k pages and many CPUs.

Thanks for the report - I somehow managed to reason myself out of
keeping the "no cached node" check in __cached_rbnode_delete_update() on
the assumption that it must always be set by a previous allocation.
However, there is indeed just one case case for which that fails: when
you free any IOVA immediately after freeing the very topmost one. Which
is something that freeing an entire magazine's worth of IOVAs back to
the tree all at once has a very real chance of doing...

The obvious straightforward fix is inline below, but I'm now starting to
understand the appeal of reserving a sentinel node to ensure the tree
can never be empty, so I might have a quick go at that to see if it
results in nicer code overall.

Robin.

>
> -Nate
> 
> On 7/21/2017 7:42 AM, Robin Murphy wrote:
>> The cached node mechanism provides a significant performance benefit for
>> allocations using a 32-bit DMA mask, but in the case of non-PCI devices
>> or where the 32-bit space is full, the loss of this benefit can be
>> significant - on large systems there can be many thousands of entries in
>> the tree, such that traversing to the end then walking all the way down
>> to find free space every time becomes increasingly awful.
>>
>> Maintain a similar cached node for the whole IOVA space as a superset of
>> the 32-bit space so that performance can remain much more consistent.
>>
>> Inspired by work by Zhen Lei <thunder.leizhen@huawei.com>.
>>
>> Tested-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
>> Tested-by: Zhen Lei <thunder.leizhen@huawei.com>
>> Signed-off-by: Robin Murphy <robin.murphy@arm.com>
>> ---
>>
>> v2: No change
>>
>>   drivers/iommu/iova.c | 59
>> +++++++++++++++++++++++++---------------------------
>>   include/linux/iova.h |  3 ++-
>>   2 files changed, 30 insertions(+), 32 deletions(-)
>>
>> diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
>> index d094d1ca8f23..f5809a2ee6c2 100644
>> --- a/drivers/iommu/iova.c
>> +++ b/drivers/iommu/iova.c
>> @@ -46,6 +46,7 @@ init_iova_domain(struct iova_domain *iovad, unsigned
>> long granule,
>>         spin_lock_init(&iovad->iova_rbtree_lock);
>>       iovad->rbroot = RB_ROOT;
>> +    iovad->cached_node = NULL;
>>       iovad->cached32_node = NULL;
>>       iovad->granule = granule;
>>       iovad->start_pfn = start_pfn;
>> @@ -57,48 +58,46 @@ EXPORT_SYMBOL_GPL(init_iova_domain);
>>   static struct rb_node *
>>   __get_cached_rbnode(struct iova_domain *iovad, unsigned long
>> *limit_pfn)
>>   {
>> -    if ((*limit_pfn > iovad->dma_32bit_pfn) ||
>> -        (iovad->cached32_node == NULL))
>> +    struct rb_node *cached_node = NULL;
>> +    struct iova *curr_iova;
>> +
>> +    if (*limit_pfn <= iovad->dma_32bit_pfn)
>> +        cached_node = iovad->cached32_node;
>> +    if (!cached_node)
>> +        cached_node = iovad->cached_node;
>> +    if (!cached_node)
>>           return rb_last(&iovad->rbroot);
>> -    else {
>> -        struct rb_node *prev_node = rb_prev(iovad->cached32_node);
>> -        struct iova *curr_iova =
>> -            rb_entry(iovad->cached32_node, struct iova, node);
>> -        *limit_pfn = curr_iova->pfn_lo;
>> -        return prev_node;
>> -    }
>> +
>> +    curr_iova = rb_entry(cached_node, struct iova, node);
>> +    *limit_pfn = curr_iova->pfn_lo;
>> +
>> +    return rb_prev(cached_node);
>>   }
>>     static void
>> -__cached_rbnode_insert_update(struct iova_domain *iovad,
>> -    unsigned long limit_pfn, struct iova *new)
>> +__cached_rbnode_insert_update(struct iova_domain *iovad, struct iova
>> *new)
>>   {
>> -    if (limit_pfn != iovad->dma_32bit_pfn)
>> -        return;
>> -    iovad->cached32_node = &new->node;
>> +    if (new->pfn_lo > iovad->dma_32bit_pfn)
>> +        iovad->cached_node = &new->node;
>> +    else
>> +        iovad->cached32_node = &new->node;
>>   }
>>     static void
>>   __cached_rbnode_delete_update(struct iova_domain *iovad, struct iova
>> *free)
>>   {
>>       struct iova *cached_iova;
>> -    struct rb_node *curr;
>> +    struct rb_node **curr = NULL;
>>   -    if (!iovad->cached32_node)
>> -        return;
>> -    curr = iovad->cached32_node;
>> -    cached_iova = rb_entry(curr, struct iova, node);
>> +    if (free->pfn_hi < iovad->dma_32bit_pfn)
>> +        curr = &iovad->cached32_node;
>> +    if (!curr)
>> +        curr = &iovad->cached_node;

+	if (!*curr)
+		return;

>>   -    if (free->pfn_lo >= cached_iova->pfn_lo) {
>> -        struct rb_node *node = rb_next(&free->node);
>> -        struct iova *iova = rb_entry(node, struct iova, node);
>> +    cached_iova = rb_entry(*curr, struct iova, node);
>>   -        /* only cache if it's below 32bit pfn */
>> -        if (node && iova->pfn_lo < iovad->dma_32bit_pfn)
>> -            iovad->cached32_node = node;
>> -        else
>> -            iovad->cached32_node = NULL;
>> -    }
>> +    if (free->pfn_lo >= cached_iova->pfn_lo)
>> +        *curr = rb_next(&free->node);
>>   }
>>     /* Insert the iova into domain rbtree by holding writer lock */
>> @@ -135,7 +134,6 @@ static int __alloc_and_insert_iova_range(struct
>> iova_domain *iovad,
>>   {
>>       struct rb_node *prev, *curr = NULL;
>>       unsigned long flags;
>> -    unsigned long saved_pfn;
>>       unsigned long align_mask = ~0UL;
>>         if (size_aligned)
>> @@ -143,7 +141,6 @@ static int __alloc_and_insert_iova_range(struct
>> iova_domain *iovad,
>>         /* Walk the tree backwards */
>>       spin_lock_irqsave(&iovad->iova_rbtree_lock, flags);
>> -    saved_pfn = limit_pfn;
>>       curr = __get_cached_rbnode(iovad, &limit_pfn);
>>       prev = curr;
>>       while (curr) {
>> @@ -173,7 +170,7 @@ static int __alloc_and_insert_iova_range(struct
>> iova_domain *iovad,
>>         /* If we have 'prev', it's a valid place to start the
>> insertion. */
>>       iova_insert_rbtree(&iovad->rbroot, new, prev);
>> -    __cached_rbnode_insert_update(iovad, saved_pfn, new);
>> +    __cached_rbnode_insert_update(iovad, new);
>>         spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
>>   diff --git a/include/linux/iova.h b/include/linux/iova.h
>> index e0a892ae45c0..0bb8df43b393 100644
>> --- a/include/linux/iova.h
>> +++ b/include/linux/iova.h
>> @@ -40,7 +40,8 @@ struct iova_rcache {
>>   struct iova_domain {
>>       spinlock_t    iova_rbtree_lock; /* Lock to protect update of
>> rbtree */
>>       struct rb_root    rbroot;        /* iova domain rbtree root */
>> -    struct rb_node    *cached32_node; /* Save last alloced node */
>> +    struct rb_node    *cached_node;    /* Save last alloced node */
>> +    struct rb_node    *cached32_node; /* Save last 32-bit alloced
>> node */
>>       unsigned long    granule;    /* pfn granularity for this domain */
>>       unsigned long    start_pfn;    /* Lower limit for this domain */
>>       unsigned long    dma_32bit_pfn;
>>
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 3/4] iommu/iova: Extend rbtree node caching
  2017-07-31 11:42     ` Robin Murphy
@ 2017-08-03 19:41       ` Nate Watterson
  2017-08-31  9:46         ` Leizhen (ThunderTown)
  2017-09-19  9:42       ` Leizhen (ThunderTown)
  1 sibling, 1 reply; 17+ messages in thread
From: Nate Watterson @ 2017-08-03 19:41 UTC (permalink / raw)
  To: Robin Murphy, joro
  Cc: iommu, linux-arm-kernel, linux-kernel, dwmw2, thunder.leizhen,
	lorenzo.pieralisi, ard.biesheuvel, Jonathan.Cameron, ray.jui

Hi Robin,

On 7/31/2017 7:42 AM, Robin Murphy wrote:
> Hi Nate,
> 
> On 29/07/17 04:57, Nate Watterson wrote:
>> Hi Robin,
>> I am seeing a crash when performing very basic testing on this series
>> with a Mellanox CX4 NIC. I dug into the crash a bit, and think this
>> patch is the culprit, but this rcache business is still mostly
>> witchcraft to me.
>>
>> # ifconfig eth5 up
>> # ifconfig eth5 down
>>      Unable to handle kernel NULL pointer dereference at virtual address
>> 00000020
>>      user pgtable: 64k pages, 48-bit VAs, pgd = ffff8007dbf47c00
>>      [0000000000000020] *pgd=00000006efab0003, *pud=00000006efab0003,
>> *pmd=00000007d8720003, *pte=0000000000000000
>>      Internal error: Oops: 96000007 [#1] SMP
>>      Modules linked in:
>>      CPU: 47 PID: 5082 Comm: ifconfig Not tainted 4.13.0-rtp-enablement+ #3
>>      task: ffff8007da1e5780 task.stack: ffff8007ddcb8000
>>      PC is at __cached_rbnode_delete_update+0x2c/0x58
>>      LR is at private_free_iova+0x2c/0x60
>>      pc : [<ffff00000852c6e4>] lr : [<ffff00000852cdac>] pstate: 204001c5
>>      sp : ffff8007ddcbba00
>>      x29: ffff8007ddcbba00 x28: ffff8007c8350210
>>      x27: ffff8007d1a80000 x26: ffff8007dcc20800
>>      x25: 0000000000000140 x24: ffff8007c98f0008
>>      x23: 00000000fffffe4e x22: 0000000000000140
>>      x21: ffff8007c98f0008 x20: ffff8007c9adb240
>>      x19: ffff8007c98f0018 x18: 0000000000000010
>>      x17: 0000000000000000 x16: 0000000000000000
>>      x15: 4000000000000000 x14: 0000000000000000
>>      x13: 00000000ffffffff x12: 0000000000000001
>>      x11: dead000000000200 x10: 00000000ffffffff
>>      x9 : 0000000000000000 x8 : ffff8007c9adb1c0
>>      x7 : 0000000040002000 x6 : 0000000000210d00
>>      x5 : 0000000000000000 x4 : 000000000000c57e
>>      x3 : 00000000ffffffcf x2 : 00000000ffffffcf
>>      x1 : ffff8007c9adb240 x0 : 0000000000000000
>>      [...]
>>      [<ffff00000852c6e4>] __cached_rbnode_delete_update+0x2c/0x58
>>      [<ffff00000852cdac>] private_free_iova+0x2c/0x60
>>      [<ffff00000852cea4>] iova_magazine_free_pfns+0x4c/0xa0
>>      [<ffff00000852d0a8>] free_iova_fast+0x1b0/0x230
>>      [<ffff000008529d34>] iommu_dma_free_iova+0x5c/0x80
>>      [<ffff000008529db4>] __iommu_dma_unmap+0x5c/0x98
>>      [<ffff00000852aebc>] iommu_dma_unmap_resource+0x24/0x30
>>      [<ffff00000852aed4>] iommu_dma_unmap_page+0xc/0x18
>>      [<ffff000008093030>] __iommu_unmap_page+0x40/0x60
>>      [<ffff0000087401bc>] mlx5e_page_release+0xbc/0x128
>>      [<ffff000008740498>] mlx5e_dealloc_rx_wqe+0x30/0x40
>>      [<ffff000008735ef0>] mlx5e_close_channel+0x70/0x1f8
>>      [<ffff0000087370f4>] mlx5e_close_channels+0x2c/0x50
>>      [<ffff0000087385dc>] mlx5e_close_locked+0x54/0x68
>>      [<ffff000008738620>] mlx5e_close+0x30/0x58
>>      [...]
>>
>> ** Disassembly for __cached_rbnode_delete_update() near the fault **
>>                92|if (free->pfn_hi < iovad->dma_32bit_pfn)
>> FFFF00000852C6C4|            ldr     x3,[x1,#0x18]    ; x3,[free,#24]
>> FFFF00000852C6C8|            ldr     x2,[x0,#0x30]    ; x2,[iovad,#48]
>> FFFF00000852C6CC|            cmp     x3,x2
>> FFFF00000852C6D0|            b.cs    0xFFFF00000852C708
>>                  |        curr = &iovad->cached32_node;
>>                94|if (!curr)
>> FFFF00000852C6D4|            adds    x19,x0,#0x18     ; x19,iovad,#24
>> FFFF00000852C6D8|            b.eq    0xFFFF00000852C708
>>                  |
>>                  |cached_iova = rb_entry(*curr, struct iova, node);
>>                  |
>>                99|if (free->pfn_lo >= cached_iova->pfn_lo)
>> FFFF00000852C6DC|            ldr     x0,[x19]         ; xiovad,[curr]
>> FFFF00000852C6E0|            ldr     x2,[x1,#0x20]    ; x2,[free,#32]
>> FFFF00000852C6E4|            ldr     x0,[x0,#0x20]    ; x0,[x0,#32]
>> Apparently cached_iova was NULL so the pfn_lo access faulted.
>>
>> FFFF00000852C6E8|            cmp     x2,x0
>> FFFF00000852C6EC|            b.cc    0xFFFF00000852C6FC
>> FFFF00000852C6F0|            mov     x0,x1            ; x0,free
>>               100|        *curr = rb_next(&free->node);
>> After instrumenting the code a bit, this seems to be the culprit. In the
>> previous call, free->pfn_lo was 0xFFFF_FFFF which is actually the
>> dma_limit for the domain so rb_next() returns NULL.
>>
>> Let me know if you have any questions or would like additional tests
>> run. I also applied your "DMA domain debug info" patches and dumped the
>> contents of the domain at each of the steps above in case that would be
>> useful. If nothing else, they reinforce how thirsty the CX4 NIC is
>> especially when using 64k pages and many CPUs.
> 
> Thanks for the report - I somehow managed to reason myself out of
> keeping the "no cached node" check in __cached_rbnode_delete_update() on
> the assumption that it must always be set by a previous allocation.
> However, there is indeed just one case case for which that fails: when
> you free any IOVA immediately after freeing the very topmost one. Which
> is something that freeing an entire magazine's worth of IOVAs back to
> the tree all at once has a very real chance of doing...
> 
> The obvious straightforward fix is inline below, but I'm now starting to
> understand the appeal of reserving a sentinel node to ensure the tree
> can never be empty, so I might have a quick go at that to see if it
> results in nicer code overall.

After applying your fix, the crash no longer occurs, but the performance
of this use case is only marginally less awful than it was before. I'll
start a new thread to discuss the causes and potential solutions.

Nate
> 
> Robin.
> 
>>
>> -Nate
>>
>> On 7/21/2017 7:42 AM, Robin Murphy wrote:
>>> The cached node mechanism provides a significant performance benefit for
>>> allocations using a 32-bit DMA mask, but in the case of non-PCI devices
>>> or where the 32-bit space is full, the loss of this benefit can be
>>> significant - on large systems there can be many thousands of entries in
>>> the tree, such that traversing to the end then walking all the way down
>>> to find free space every time becomes increasingly awful.
>>>
>>> Maintain a similar cached node for the whole IOVA space as a superset of
>>> the 32-bit space so that performance can remain much more consistent.
>>>
>>> Inspired by work by Zhen Lei <thunder.leizhen@huawei.com>.
>>>
>>> Tested-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
>>> Tested-by: Zhen Lei <thunder.leizhen@huawei.com>
>>> Signed-off-by: Robin Murphy <robin.murphy@arm.com>
>>> ---
>>>
>>> v2: No change
>>>
>>>    drivers/iommu/iova.c | 59
>>> +++++++++++++++++++++++++---------------------------
>>>    include/linux/iova.h |  3 ++-
>>>    2 files changed, 30 insertions(+), 32 deletions(-)
>>>
>>> diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
>>> index d094d1ca8f23..f5809a2ee6c2 100644
>>> --- a/drivers/iommu/iova.c
>>> +++ b/drivers/iommu/iova.c
>>> @@ -46,6 +46,7 @@ init_iova_domain(struct iova_domain *iovad, unsigned
>>> long granule,
>>>          spin_lock_init(&iovad->iova_rbtree_lock);
>>>        iovad->rbroot = RB_ROOT;
>>> +    iovad->cached_node = NULL;
>>>        iovad->cached32_node = NULL;
>>>        iovad->granule = granule;
>>>        iovad->start_pfn = start_pfn;
>>> @@ -57,48 +58,46 @@ EXPORT_SYMBOL_GPL(init_iova_domain);
>>>    static struct rb_node *
>>>    __get_cached_rbnode(struct iova_domain *iovad, unsigned long
>>> *limit_pfn)
>>>    {
>>> -    if ((*limit_pfn > iovad->dma_32bit_pfn) ||
>>> -        (iovad->cached32_node == NULL))
>>> +    struct rb_node *cached_node = NULL;
>>> +    struct iova *curr_iova;
>>> +
>>> +    if (*limit_pfn <= iovad->dma_32bit_pfn)
>>> +        cached_node = iovad->cached32_node;
>>> +    if (!cached_node)
>>> +        cached_node = iovad->cached_node;
>>> +    if (!cached_node)
>>>            return rb_last(&iovad->rbroot);
>>> -    else {
>>> -        struct rb_node *prev_node = rb_prev(iovad->cached32_node);
>>> -        struct iova *curr_iova =
>>> -            rb_entry(iovad->cached32_node, struct iova, node);
>>> -        *limit_pfn = curr_iova->pfn_lo;
>>> -        return prev_node;
>>> -    }
>>> +
>>> +    curr_iova = rb_entry(cached_node, struct iova, node);
>>> +    *limit_pfn = curr_iova->pfn_lo;
>>> +
>>> +    return rb_prev(cached_node);
>>>    }
>>>      static void
>>> -__cached_rbnode_insert_update(struct iova_domain *iovad,
>>> -    unsigned long limit_pfn, struct iova *new)
>>> +__cached_rbnode_insert_update(struct iova_domain *iovad, struct iova
>>> *new)
>>>    {
>>> -    if (limit_pfn != iovad->dma_32bit_pfn)
>>> -        return;
>>> -    iovad->cached32_node = &new->node;
>>> +    if (new->pfn_lo > iovad->dma_32bit_pfn)
>>> +        iovad->cached_node = &new->node;
>>> +    else
>>> +        iovad->cached32_node = &new->node;
>>>    }
>>>      static void
>>>    __cached_rbnode_delete_update(struct iova_domain *iovad, struct iova
>>> *free)
>>>    {
>>>        struct iova *cached_iova;
>>> -    struct rb_node *curr;
>>> +    struct rb_node **curr = NULL;
>>>    -    if (!iovad->cached32_node)
>>> -        return;
>>> -    curr = iovad->cached32_node;
>>> -    cached_iova = rb_entry(curr, struct iova, node);
>>> +    if (free->pfn_hi < iovad->dma_32bit_pfn)
>>> +        curr = &iovad->cached32_node;
>>> +    if (!curr)
>>> +        curr = &iovad->cached_node;
> 
> +	if (!*curr)
> +		return;
> 
>>>    -    if (free->pfn_lo >= cached_iova->pfn_lo) {
>>> -        struct rb_node *node = rb_next(&free->node);
>>> -        struct iova *iova = rb_entry(node, struct iova, node);
>>> +    cached_iova = rb_entry(*curr, struct iova, node);
>>>    -        /* only cache if it's below 32bit pfn */
>>> -        if (node && iova->pfn_lo < iovad->dma_32bit_pfn)
>>> -            iovad->cached32_node = node;
>>> -        else
>>> -            iovad->cached32_node = NULL;
>>> -    }
>>> +    if (free->pfn_lo >= cached_iova->pfn_lo)
>>> +        *curr = rb_next(&free->node);
>>>    }
>>>      /* Insert the iova into domain rbtree by holding writer lock */
>>> @@ -135,7 +134,6 @@ static int __alloc_and_insert_iova_range(struct
>>> iova_domain *iovad,
>>>    {
>>>        struct rb_node *prev, *curr = NULL;
>>>        unsigned long flags;
>>> -    unsigned long saved_pfn;
>>>        unsigned long align_mask = ~0UL;
>>>          if (size_aligned)
>>> @@ -143,7 +141,6 @@ static int __alloc_and_insert_iova_range(struct
>>> iova_domain *iovad,
>>>          /* Walk the tree backwards */
>>>        spin_lock_irqsave(&iovad->iova_rbtree_lock, flags);
>>> -    saved_pfn = limit_pfn;
>>>        curr = __get_cached_rbnode(iovad, &limit_pfn);
>>>        prev = curr;
>>>        while (curr) {
>>> @@ -173,7 +170,7 @@ static int __alloc_and_insert_iova_range(struct
>>> iova_domain *iovad,
>>>          /* If we have 'prev', it's a valid place to start the
>>> insertion. */
>>>        iova_insert_rbtree(&iovad->rbroot, new, prev);
>>> -    __cached_rbnode_insert_update(iovad, saved_pfn, new);
>>> +    __cached_rbnode_insert_update(iovad, new);
>>>          spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
>>>    diff --git a/include/linux/iova.h b/include/linux/iova.h
>>> index e0a892ae45c0..0bb8df43b393 100644
>>> --- a/include/linux/iova.h
>>> +++ b/include/linux/iova.h
>>> @@ -40,7 +40,8 @@ struct iova_rcache {
>>>    struct iova_domain {
>>>        spinlock_t    iova_rbtree_lock; /* Lock to protect update of
>>> rbtree */
>>>        struct rb_root    rbroot;        /* iova domain rbtree root */
>>> -    struct rb_node    *cached32_node; /* Save last alloced node */
>>> +    struct rb_node    *cached_node;    /* Save last alloced node */
>>> +    struct rb_node    *cached32_node; /* Save last 32-bit alloced
>>> node */
>>>        unsigned long    granule;    /* pfn granularity for this domain */
>>>        unsigned long    start_pfn;    /* Lower limit for this domain */
>>>        unsigned long    dma_32bit_pfn;
>>>
>>
> 

-- 
Qualcomm Datacenter Technologies as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 0/4] Optimise 64-bit IOVA allocations
  2017-07-26 11:17   ` Leizhen (ThunderTown)
@ 2017-08-08 12:03     ` Ganapatrao Kulkarni
  2017-08-09  1:42       ` Leizhen (ThunderTown)
  0 siblings, 1 reply; 17+ messages in thread
From: Ganapatrao Kulkarni @ 2017-08-08 12:03 UTC (permalink / raw)
  To: Leizhen (ThunderTown)
  Cc: Joerg Roedel, Robin Murphy, Lorenzo Pieralisi, ard.biesheuvel,
	linux-kernel, iommu, ray.jui, nwatters, Jonathan.Cameron, dwmw2,
	linux-arm-kernel, Ganapatrao Kulkarni

On Wed, Jul 26, 2017 at 4:47 PM, Leizhen (ThunderTown)
<thunder.leizhen@huawei.com> wrote:
>
>
> On 2017/7/26 19:08, Joerg Roedel wrote:
>> Hi Robin.
>>
>> On Fri, Jul 21, 2017 at 12:41:57PM +0100, Robin Murphy wrote:
>>> Hi all,
>>>
>>> In the wake of the ARM SMMU optimisation efforts, it seems that certain
>>> workloads (e.g. storage I/O with large scatterlists) probably remain quite
>>> heavily influenced by IOVA allocation performance. Separately, Ard also
>>> reported massive performance drops for a graphical desktop on AMD Seattle
>>> when enabling SMMUs via IORT, which we traced to dma_32bit_pfn in the DMA
>>> ops domain getting initialised differently for ACPI vs. DT, and exposing
>>> the overhead of the rbtree slow path. Whilst we could go around trying to
>>> close up all the little gaps that lead to hitting the slowest case, it
>>> seems a much better idea to simply make said slowest case a lot less slow.
>>
>> Do you have some numbers here? How big was the impact before these
>> patches and how is it with the patches?
> Here are some numbers:
>
> (before)$ iperf -s
> ------------------------------------------------------------
> Server listening on TCP port 5001
> TCP window size: 85.3 KByte (default)
> ------------------------------------------------------------
> [  4] local 192.168.1.106 port 5001 connected with 192.168.1.198 port 35898
> [ ID] Interval       Transfer     Bandwidth
> [  4]  0.0-10.2 sec  7.88 MBytes  6.48 Mbits/sec
> [  5] local 192.168.1.106 port 5001 connected with 192.168.1.198 port 35900
> [  5]  0.0-10.3 sec  7.88 MBytes  6.43 Mbits/sec
> [  4] local 192.168.1.106 port 5001 connected with 192.168.1.198 port 35902
> [  4]  0.0-10.3 sec  7.88 MBytes  6.43 Mbits/sec
>
> (after)$ iperf -s
> ------------------------------------------------------------
> Server listening on TCP port 5001
> TCP window size: 85.3 KByte (default)
> ------------------------------------------------------------
> [  4] local 192.168.1.106 port 5001 connected with 192.168.1.198 port 36330
> [ ID] Interval       Transfer     Bandwidth
> [  4]  0.0-10.0 sec  1.09 GBytes   933 Mbits/sec
> [  5] local 192.168.1.106 port 5001 connected with 192.168.1.198 port 36332
> [  5]  0.0-10.0 sec  1.10 GBytes   939 Mbits/sec
> [  4] local 192.168.1.106 port 5001 connected with 192.168.1.198 port 36334
> [  4]  0.0-10.0 sec  1.10 GBytes   938 Mbits/sec
>

Is this testing done on Host or on Guest/VM?

>>
>>
>>       Joerg
>>
>>
>> .
>>
>
> --
> Thanks!
> BestRegards
>
>
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

thanks
Ganapat

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 0/4] Optimise 64-bit IOVA allocations
  2017-08-08 12:03     ` Ganapatrao Kulkarni
@ 2017-08-09  1:42       ` Leizhen (ThunderTown)
  2017-08-09  3:24         ` Ganapatrao Kulkarni
  0 siblings, 1 reply; 17+ messages in thread
From: Leizhen (ThunderTown) @ 2017-08-09  1:42 UTC (permalink / raw)
  To: Ganapatrao Kulkarni
  Cc: Joerg Roedel, Robin Murphy, Lorenzo Pieralisi, ard.biesheuvel,
	linux-kernel, iommu, ray.jui, nwatters, Jonathan.Cameron, dwmw2,
	linux-arm-kernel, Ganapatrao Kulkarni



On 2017/8/8 20:03, Ganapatrao Kulkarni wrote:
> On Wed, Jul 26, 2017 at 4:47 PM, Leizhen (ThunderTown)
> <thunder.leizhen@huawei.com> wrote:
>>
>>
>> On 2017/7/26 19:08, Joerg Roedel wrote:
>>> Hi Robin.
>>>
>>> On Fri, Jul 21, 2017 at 12:41:57PM +0100, Robin Murphy wrote:
>>>> Hi all,
>>>>
>>>> In the wake of the ARM SMMU optimisation efforts, it seems that certain
>>>> workloads (e.g. storage I/O with large scatterlists) probably remain quite
>>>> heavily influenced by IOVA allocation performance. Separately, Ard also
>>>> reported massive performance drops for a graphical desktop on AMD Seattle
>>>> when enabling SMMUs via IORT, which we traced to dma_32bit_pfn in the DMA
>>>> ops domain getting initialised differently for ACPI vs. DT, and exposing
>>>> the overhead of the rbtree slow path. Whilst we could go around trying to
>>>> close up all the little gaps that lead to hitting the slowest case, it
>>>> seems a much better idea to simply make said slowest case a lot less slow.
>>>
>>> Do you have some numbers here? How big was the impact before these
>>> patches and how is it with the patches?
>> Here are some numbers:
>>
>> (before)$ iperf -s
>> ------------------------------------------------------------
>> Server listening on TCP port 5001
>> TCP window size: 85.3 KByte (default)
>> ------------------------------------------------------------
>> [  4] local 192.168.1.106 port 5001 connected with 192.168.1.198 port 35898
>> [ ID] Interval       Transfer     Bandwidth
>> [  4]  0.0-10.2 sec  7.88 MBytes  6.48 Mbits/sec
>> [  5] local 192.168.1.106 port 5001 connected with 192.168.1.198 port 35900
>> [  5]  0.0-10.3 sec  7.88 MBytes  6.43 Mbits/sec
>> [  4] local 192.168.1.106 port 5001 connected with 192.168.1.198 port 35902
>> [  4]  0.0-10.3 sec  7.88 MBytes  6.43 Mbits/sec
>>
>> (after)$ iperf -s
>> ------------------------------------------------------------
>> Server listening on TCP port 5001
>> TCP window size: 85.3 KByte (default)
>> ------------------------------------------------------------
>> [  4] local 192.168.1.106 port 5001 connected with 192.168.1.198 port 36330
>> [ ID] Interval       Transfer     Bandwidth
>> [  4]  0.0-10.0 sec  1.09 GBytes   933 Mbits/sec
>> [  5] local 192.168.1.106 port 5001 connected with 192.168.1.198 port 36332
>> [  5]  0.0-10.0 sec  1.10 GBytes   939 Mbits/sec
>> [  4] local 192.168.1.106 port 5001 connected with 192.168.1.198 port 36334
>> [  4]  0.0-10.0 sec  1.10 GBytes   938 Mbits/sec
>>
> 
> Is this testing done on Host or on Guest/VM?
Host

> 
>>>
>>>
>>>       Joerg
>>>
>>>
>>> .
>>>
>>
>> --
>> Thanks!
>> BestRegards
>>
>>
>> _______________________________________________
>> linux-arm-kernel mailing list
>> linux-arm-kernel@lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
> 
> thanks
> Ganapat
> 
> .
> 

-- 
Thanks!
BestRegards

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 0/4] Optimise 64-bit IOVA allocations
  2017-08-09  1:42       ` Leizhen (ThunderTown)
@ 2017-08-09  3:24         ` Ganapatrao Kulkarni
  2017-08-09  4:09           ` Leizhen (ThunderTown)
  0 siblings, 1 reply; 17+ messages in thread
From: Ganapatrao Kulkarni @ 2017-08-09  3:24 UTC (permalink / raw)
  To: Leizhen (ThunderTown)
  Cc: Joerg Roedel, Robin Murphy, Lorenzo Pieralisi, ard.biesheuvel,
	linux-kernel, iommu, ray.jui, nwatters, Jonathan.Cameron, dwmw2,
	linux-arm-kernel, Ganapatrao Kulkarni

On Wed, Aug 9, 2017 at 7:12 AM, Leizhen (ThunderTown)
<thunder.leizhen@huawei.com> wrote:
>
>
> On 2017/8/8 20:03, Ganapatrao Kulkarni wrote:
>> On Wed, Jul 26, 2017 at 4:47 PM, Leizhen (ThunderTown)
>> <thunder.leizhen@huawei.com> wrote:
>>>
>>>
>>> On 2017/7/26 19:08, Joerg Roedel wrote:
>>>> Hi Robin.
>>>>
>>>> On Fri, Jul 21, 2017 at 12:41:57PM +0100, Robin Murphy wrote:
>>>>> Hi all,
>>>>>
>>>>> In the wake of the ARM SMMU optimisation efforts, it seems that certain
>>>>> workloads (e.g. storage I/O with large scatterlists) probably remain quite
>>>>> heavily influenced by IOVA allocation performance. Separately, Ard also
>>>>> reported massive performance drops for a graphical desktop on AMD Seattle
>>>>> when enabling SMMUs via IORT, which we traced to dma_32bit_pfn in the DMA
>>>>> ops domain getting initialised differently for ACPI vs. DT, and exposing
>>>>> the overhead of the rbtree slow path. Whilst we could go around trying to
>>>>> close up all the little gaps that lead to hitting the slowest case, it
>>>>> seems a much better idea to simply make said slowest case a lot less slow.
>>>>
>>>> Do you have some numbers here? How big was the impact before these
>>>> patches and how is it with the patches?
>>> Here are some numbers:
>>>
>>> (before)$ iperf -s
>>> ------------------------------------------------------------
>>> Server listening on TCP port 5001
>>> TCP window size: 85.3 KByte (default)
>>> ------------------------------------------------------------
>>> [  4] local 192.168.1.106 port 5001 connected with 192.168.1.198 port 35898
>>> [ ID] Interval       Transfer     Bandwidth
>>> [  4]  0.0-10.2 sec  7.88 MBytes  6.48 Mbits/sec
>>> [  5] local 192.168.1.106 port 5001 connected with 192.168.1.198 port 35900
>>> [  5]  0.0-10.3 sec  7.88 MBytes  6.43 Mbits/sec
>>> [  4] local 192.168.1.106 port 5001 connected with 192.168.1.198 port 35902
>>> [  4]  0.0-10.3 sec  7.88 MBytes  6.43 Mbits/sec
>>>
>>> (after)$ iperf -s
>>> ------------------------------------------------------------
>>> Server listening on TCP port 5001
>>> TCP window size: 85.3 KByte (default)
>>> ------------------------------------------------------------
>>> [  4] local 192.168.1.106 port 5001 connected with 192.168.1.198 port 36330
>>> [ ID] Interval       Transfer     Bandwidth
>>> [  4]  0.0-10.0 sec  1.09 GBytes   933 Mbits/sec
>>> [  5] local 192.168.1.106 port 5001 connected with 192.168.1.198 port 36332
>>> [  5]  0.0-10.0 sec  1.10 GBytes   939 Mbits/sec
>>> [  4] local 192.168.1.106 port 5001 connected with 192.168.1.198 port 36334
>>> [  4]  0.0-10.0 sec  1.10 GBytes   938 Mbits/sec
>>>
>>
>> Is this testing done on Host or on Guest/VM?
> Host

As per your log, iperf throughput is improved to 938 Mbits/sec
from  6.43 Mbits/sec.
IMO, this seems to be unrealistic, some thing wrong with the testing?

>
>>
>>>>
>>>>
>>>>       Joerg
>>>>
>>>>
>>>> .
>>>>
>>>
>>> --
>>> Thanks!
>>> BestRegards
>>>
>>>
>>> _______________________________________________
>>> linux-arm-kernel mailing list
>>> linux-arm-kernel@lists.infradead.org
>>> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
>>
>> thanks
>> Ganapat
>>
>> .
>>
>
> --
> Thanks!
> BestRegards
>

thanks
Ganapat

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 0/4] Optimise 64-bit IOVA allocations
  2017-08-09  3:24         ` Ganapatrao Kulkarni
@ 2017-08-09  4:09           ` Leizhen (ThunderTown)
  0 siblings, 0 replies; 17+ messages in thread
From: Leizhen (ThunderTown) @ 2017-08-09  4:09 UTC (permalink / raw)
  To: Ganapatrao Kulkarni
  Cc: Joerg Roedel, Robin Murphy, Lorenzo Pieralisi, ard.biesheuvel,
	linux-kernel, iommu, ray.jui, nwatters, Jonathan.Cameron, dwmw2,
	linux-arm-kernel, Ganapatrao Kulkarni



On 2017/8/9 11:24, Ganapatrao Kulkarni wrote:
> On Wed, Aug 9, 2017 at 7:12 AM, Leizhen (ThunderTown)
> <thunder.leizhen@huawei.com> wrote:
>>
>>
>> On 2017/8/8 20:03, Ganapatrao Kulkarni wrote:
>>> On Wed, Jul 26, 2017 at 4:47 PM, Leizhen (ThunderTown)
>>> <thunder.leizhen@huawei.com> wrote:
>>>>
>>>>
>>>> On 2017/7/26 19:08, Joerg Roedel wrote:
>>>>> Hi Robin.
>>>>>
>>>>> On Fri, Jul 21, 2017 at 12:41:57PM +0100, Robin Murphy wrote:
>>>>>> Hi all,
>>>>>>
>>>>>> In the wake of the ARM SMMU optimisation efforts, it seems that certain
>>>>>> workloads (e.g. storage I/O with large scatterlists) probably remain quite
>>>>>> heavily influenced by IOVA allocation performance. Separately, Ard also
>>>>>> reported massive performance drops for a graphical desktop on AMD Seattle
>>>>>> when enabling SMMUs via IORT, which we traced to dma_32bit_pfn in the DMA
>>>>>> ops domain getting initialised differently for ACPI vs. DT, and exposing
>>>>>> the overhead of the rbtree slow path. Whilst we could go around trying to
>>>>>> close up all the little gaps that lead to hitting the slowest case, it
>>>>>> seems a much better idea to simply make said slowest case a lot less slow.
>>>>>
>>>>> Do you have some numbers here? How big was the impact before these
>>>>> patches and how is it with the patches?
>>>> Here are some numbers:
>>>>
>>>> (before)$ iperf -s
>>>> ------------------------------------------------------------
>>>> Server listening on TCP port 5001
>>>> TCP window size: 85.3 KByte (default)
>>>> ------------------------------------------------------------
>>>> [  4] local 192.168.1.106 port 5001 connected with 192.168.1.198 port 35898
>>>> [ ID] Interval       Transfer     Bandwidth
>>>> [  4]  0.0-10.2 sec  7.88 MBytes  6.48 Mbits/sec
>>>> [  5] local 192.168.1.106 port 5001 connected with 192.168.1.198 port 35900
>>>> [  5]  0.0-10.3 sec  7.88 MBytes  6.43 Mbits/sec
>>>> [  4] local 192.168.1.106 port 5001 connected with 192.168.1.198 port 35902
>>>> [  4]  0.0-10.3 sec  7.88 MBytes  6.43 Mbits/sec
>>>>
>>>> (after)$ iperf -s
>>>> ------------------------------------------------------------
>>>> Server listening on TCP port 5001
>>>> TCP window size: 85.3 KByte (default)
>>>> ------------------------------------------------------------
>>>> [  4] local 192.168.1.106 port 5001 connected with 192.168.1.198 port 36330
>>>> [ ID] Interval       Transfer     Bandwidth
>>>> [  4]  0.0-10.0 sec  1.09 GBytes   933 Mbits/sec
>>>> [  5] local 192.168.1.106 port 5001 connected with 192.168.1.198 port 36332
>>>> [  5]  0.0-10.0 sec  1.10 GBytes   939 Mbits/sec
>>>> [  4] local 192.168.1.106 port 5001 connected with 192.168.1.198 port 36334
>>>> [  4]  0.0-10.0 sec  1.10 GBytes   938 Mbits/sec
>>>>
>>>
>>> Is this testing done on Host or on Guest/VM?
>> Host
> 
> As per your log, iperf throughput is improved to 938 Mbits/sec
> from  6.43 Mbits/sec.
> IMO, this seems to be unrealistic, some thing wrong with the testing?
For 64bits non-pci devices, the iova allocation is always searched from the last rb-tree node.
When many iovas allocated and keep a long time, the search process should check many rb nodes
then find a suitable free space. As my tracking, the average times exceeds 10K.
        [free-space][free][used][...][used]
                      ^     ^          ^
                      |     |          |-----rb_last
                      |     |--------- maybe more than 10K allocated iova nodes
                      |------- for 32bits devices, cached32_node remember the lastest freed node, which can help us reduce check times

This patch series add a new member "cached_node" to service for 64bits devices, like cached32_node service for 32bits devices.

> 
>>
>>>
>>>>>
>>>>>
>>>>>       Joerg
>>>>>
>>>>>
>>>>> .
>>>>>
>>>>
>>>> --
>>>> Thanks!
>>>> BestRegards
>>>>
>>>>
>>>> _______________________________________________
>>>> linux-arm-kernel mailing list
>>>> linux-arm-kernel@lists.infradead.org
>>>> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
>>>
>>> thanks
>>> Ganapat
>>>
>>> .
>>>
>>
>> --
>> Thanks!
>> BestRegards
>>
> 
> thanks
> Ganapat
> 
> .
> 

-- 
Thanks!
BestRegards

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 3/4] iommu/iova: Extend rbtree node caching
  2017-08-03 19:41       ` Nate Watterson
@ 2017-08-31  9:46         ` Leizhen (ThunderTown)
  0 siblings, 0 replies; 17+ messages in thread
From: Leizhen (ThunderTown) @ 2017-08-31  9:46 UTC (permalink / raw)
  To: Nate Watterson, Robin Murphy, joro
  Cc: iommu, linux-arm-kernel, linux-kernel, dwmw2, lorenzo.pieralisi,
	ard.biesheuvel, Jonathan.Cameron, ray.jui



On 2017/8/4 3:41, Nate Watterson wrote:
> Hi Robin,
> 
> On 7/31/2017 7:42 AM, Robin Murphy wrote:
>> Hi Nate,
>>
>> On 29/07/17 04:57, Nate Watterson wrote:
>>> Hi Robin,
>>> I am seeing a crash when performing very basic testing on this series
>>> with a Mellanox CX4 NIC. I dug into the crash a bit, and think this
>>> patch is the culprit, but this rcache business is still mostly
>>> witchcraft to me.
>>>
>>> # ifconfig eth5 up
>>> # ifconfig eth5 down
>>>      Unable to handle kernel NULL pointer dereference at virtual address
>>> 00000020
>>>      user pgtable: 64k pages, 48-bit VAs, pgd = ffff8007dbf47c00
>>>      [0000000000000020] *pgd=00000006efab0003, *pud=00000006efab0003,
>>> *pmd=00000007d8720003, *pte=0000000000000000
>>>      Internal error: Oops: 96000007 [#1] SMP
>>>      Modules linked in:
>>>      CPU: 47 PID: 5082 Comm: ifconfig Not tainted 4.13.0-rtp-enablement+ #3
>>>      task: ffff8007da1e5780 task.stack: ffff8007ddcb8000
>>>      PC is at __cached_rbnode_delete_update+0x2c/0x58
>>>      LR is at private_free_iova+0x2c/0x60
>>>      pc : [<ffff00000852c6e4>] lr : [<ffff00000852cdac>] pstate: 204001c5
>>>      sp : ffff8007ddcbba00
>>>      x29: ffff8007ddcbba00 x28: ffff8007c8350210
>>>      x27: ffff8007d1a80000 x26: ffff8007dcc20800
>>>      x25: 0000000000000140 x24: ffff8007c98f0008
>>>      x23: 00000000fffffe4e x22: 0000000000000140
>>>      x21: ffff8007c98f0008 x20: ffff8007c9adb240
>>>      x19: ffff8007c98f0018 x18: 0000000000000010
>>>      x17: 0000000000000000 x16: 0000000000000000
>>>      x15: 4000000000000000 x14: 0000000000000000
>>>      x13: 00000000ffffffff x12: 0000000000000001
>>>      x11: dead000000000200 x10: 00000000ffffffff
>>>      x9 : 0000000000000000 x8 : ffff8007c9adb1c0
>>>      x7 : 0000000040002000 x6 : 0000000000210d00
>>>      x5 : 0000000000000000 x4 : 000000000000c57e
>>>      x3 : 00000000ffffffcf x2 : 00000000ffffffcf
>>>      x1 : ffff8007c9adb240 x0 : 0000000000000000
>>>      [...]
>>>      [<ffff00000852c6e4>] __cached_rbnode_delete_update+0x2c/0x58
>>>      [<ffff00000852cdac>] private_free_iova+0x2c/0x60
>>>      [<ffff00000852cea4>] iova_magazine_free_pfns+0x4c/0xa0
>>>      [<ffff00000852d0a8>] free_iova_fast+0x1b0/0x230
>>>      [<ffff000008529d34>] iommu_dma_free_iova+0x5c/0x80
>>>      [<ffff000008529db4>] __iommu_dma_unmap+0x5c/0x98
>>>      [<ffff00000852aebc>] iommu_dma_unmap_resource+0x24/0x30
>>>      [<ffff00000852aed4>] iommu_dma_unmap_page+0xc/0x18
>>>      [<ffff000008093030>] __iommu_unmap_page+0x40/0x60
>>>      [<ffff0000087401bc>] mlx5e_page_release+0xbc/0x128
>>>      [<ffff000008740498>] mlx5e_dealloc_rx_wqe+0x30/0x40
>>>      [<ffff000008735ef0>] mlx5e_close_channel+0x70/0x1f8
>>>      [<ffff0000087370f4>] mlx5e_close_channels+0x2c/0x50
>>>      [<ffff0000087385dc>] mlx5e_close_locked+0x54/0x68
>>>      [<ffff000008738620>] mlx5e_close+0x30/0x58
>>>      [...]
>>>
>>> ** Disassembly for __cached_rbnode_delete_update() near the fault **
>>>                92|if (free->pfn_hi < iovad->dma_32bit_pfn)
>>> FFFF00000852C6C4|            ldr     x3,[x1,#0x18]    ; x3,[free,#24]
>>> FFFF00000852C6C8|            ldr     x2,[x0,#0x30]    ; x2,[iovad,#48]
>>> FFFF00000852C6CC|            cmp     x3,x2
>>> FFFF00000852C6D0|            b.cs    0xFFFF00000852C708
>>>                  |        curr = &iovad->cached32_node;
>>>                94|if (!curr)
>>> FFFF00000852C6D4|            adds    x19,x0,#0x18     ; x19,iovad,#24
>>> FFFF00000852C6D8|            b.eq    0xFFFF00000852C708
>>>                  |
>>>                  |cached_iova = rb_entry(*curr, struct iova, node);
>>>                  |
>>>                99|if (free->pfn_lo >= cached_iova->pfn_lo)
>>> FFFF00000852C6DC|            ldr     x0,[x19]         ; xiovad,[curr]
>>> FFFF00000852C6E0|            ldr     x2,[x1,#0x20]    ; x2,[free,#32]
>>> FFFF00000852C6E4|            ldr     x0,[x0,#0x20]    ; x0,[x0,#32]
>>> Apparently cached_iova was NULL so the pfn_lo access faulted.
>>>
>>> FFFF00000852C6E8|            cmp     x2,x0
>>> FFFF00000852C6EC|            b.cc    0xFFFF00000852C6FC
>>> FFFF00000852C6F0|            mov     x0,x1            ; x0,free
>>>               100|        *curr = rb_next(&free->node);
>>> After instrumenting the code a bit, this seems to be the culprit. In the
>>> previous call, free->pfn_lo was 0xFFFF_FFFF which is actually the
>>> dma_limit for the domain so rb_next() returns NULL.
>>>
>>> Let me know if you have any questions or would like additional tests
>>> run. I also applied your "DMA domain debug info" patches and dumped the
>>> contents of the domain at each of the steps above in case that would be
>>> useful. If nothing else, they reinforce how thirsty the CX4 NIC is
>>> especially when using 64k pages and many CPUs.
>>
>> Thanks for the report - I somehow managed to reason myself out of
>> keeping the "no cached node" check in __cached_rbnode_delete_update() on
>> the assumption that it must always be set by a previous allocation.
>> However, there is indeed just one case case for which that fails: when
>> you free any IOVA immediately after freeing the very topmost one. Which
>> is something that freeing an entire magazine's worth of IOVAs back to
>> the tree all at once has a very real chance of doing...
>>
>> The obvious straightforward fix is inline below, but I'm now starting to
>> understand the appeal of reserving a sentinel node to ensure the tree
Hi Robin,
   Have you prepared v3? I think we'd better to use the sentinel nodes. Otherwise,
there maybe many special cases should be considered, which may make the code difficult
to understand. The sentinel nodes make the dma32 and dma64 can be considered separately,
and both parts are handled the same way as before.


>> can never be empty, so I might have a quick go at that to see if it
>> results in nicer code overall.
> 
> After applying your fix, the crash no longer occurs, but the performance
> of this use case is only marginally less awful than it was before. I'll
> start a new thread to discuss the causes and potential solutions.
> 
> Nate
>>
>> Robin.
>>
>>>
>>> -Nate
>>>
>>> On 7/21/2017 7:42 AM, Robin Murphy wrote:
>>>> The cached node mechanism provides a significant performance benefit for
>>>> allocations using a 32-bit DMA mask, but in the case of non-PCI devices
>>>> or where the 32-bit space is full, the loss of this benefit can be
>>>> significant - on large systems there can be many thousands of entries in
>>>> the tree, such that traversing to the end then walking all the way down
>>>> to find free space every time becomes increasingly awful.
>>>>
>>>> Maintain a similar cached node for the whole IOVA space as a superset of
>>>> the 32-bit space so that performance can remain much more consistent.
>>>>
>>>> Inspired by work by Zhen Lei <thunder.leizhen@huawei.com>.
>>>>
>>>> Tested-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
>>>> Tested-by: Zhen Lei <thunder.leizhen@huawei.com>
>>>> Signed-off-by: Robin Murphy <robin.murphy@arm.com>
>>>> ---
>>>>
>>>> v2: No change
>>>>
>>>>    drivers/iommu/iova.c | 59
>>>> +++++++++++++++++++++++++---------------------------
>>>>    include/linux/iova.h |  3 ++-
>>>>    2 files changed, 30 insertions(+), 32 deletions(-)
>>>>
>>>> diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
>>>> index d094d1ca8f23..f5809a2ee6c2 100644
>>>> --- a/drivers/iommu/iova.c
>>>> +++ b/drivers/iommu/iova.c
>>>> @@ -46,6 +46,7 @@ init_iova_domain(struct iova_domain *iovad, unsigned
>>>> long granule,
>>>>          spin_lock_init(&iovad->iova_rbtree_lock);
>>>>        iovad->rbroot = RB_ROOT;
>>>> +    iovad->cached_node = NULL;
>>>>        iovad->cached32_node = NULL;
>>>>        iovad->granule = granule;
>>>>        iovad->start_pfn = start_pfn;
>>>> @@ -57,48 +58,46 @@ EXPORT_SYMBOL_GPL(init_iova_domain);
>>>>    static struct rb_node *
>>>>    __get_cached_rbnode(struct iova_domain *iovad, unsigned long
>>>> *limit_pfn)
>>>>    {
>>>> -    if ((*limit_pfn > iovad->dma_32bit_pfn) ||
>>>> -        (iovad->cached32_node == NULL))
>>>> +    struct rb_node *cached_node = NULL;
>>>> +    struct iova *curr_iova;
>>>> +
>>>> +    if (*limit_pfn <= iovad->dma_32bit_pfn)
>>>> +        cached_node = iovad->cached32_node;
>>>> +    if (!cached_node)
>>>> +        cached_node = iovad->cached_node;
>>>> +    if (!cached_node)
>>>>            return rb_last(&iovad->rbroot);
>>>> -    else {
>>>> -        struct rb_node *prev_node = rb_prev(iovad->cached32_node);
>>>> -        struct iova *curr_iova =
>>>> -            rb_entry(iovad->cached32_node, struct iova, node);
>>>> -        *limit_pfn = curr_iova->pfn_lo;
>>>> -        return prev_node;
>>>> -    }
>>>> +
>>>> +    curr_iova = rb_entry(cached_node, struct iova, node);
>>>> +    *limit_pfn = curr_iova->pfn_lo;
>>>> +
>>>> +    return rb_prev(cached_node);
>>>>    }
>>>>      static void
>>>> -__cached_rbnode_insert_update(struct iova_domain *iovad,
>>>> -    unsigned long limit_pfn, struct iova *new)
>>>> +__cached_rbnode_insert_update(struct iova_domain *iovad, struct iova
>>>> *new)
>>>>    {
>>>> -    if (limit_pfn != iovad->dma_32bit_pfn)
>>>> -        return;
>>>> -    iovad->cached32_node = &new->node;
>>>> +    if (new->pfn_lo > iovad->dma_32bit_pfn)
>>>> +        iovad->cached_node = &new->node;
>>>> +    else
>>>> +        iovad->cached32_node = &new->node;
>>>>    }
>>>>      static void
>>>>    __cached_rbnode_delete_update(struct iova_domain *iovad, struct iova
>>>> *free)
>>>>    {
>>>>        struct iova *cached_iova;
>>>> -    struct rb_node *curr;
>>>> +    struct rb_node **curr = NULL;
>>>>    -    if (!iovad->cached32_node)
>>>> -        return;
>>>> -    curr = iovad->cached32_node;
>>>> -    cached_iova = rb_entry(curr, struct iova, node);
>>>> +    if (free->pfn_hi < iovad->dma_32bit_pfn)
>>>> +        curr = &iovad->cached32_node;
>>>> +    if (!curr)
>>>> +        curr = &iovad->cached_node;
>>
>> +    if (!*curr)
>> +        return;
>>
>>>>    -    if (free->pfn_lo >= cached_iova->pfn_lo) {
>>>> -        struct rb_node *node = rb_next(&free->node);
>>>> -        struct iova *iova = rb_entry(node, struct iova, node);
>>>> +    cached_iova = rb_entry(*curr, struct iova, node);
>>>>    -        /* only cache if it's below 32bit pfn */
>>>> -        if (node && iova->pfn_lo < iovad->dma_32bit_pfn)
>>>> -            iovad->cached32_node = node;
>>>> -        else
>>>> -            iovad->cached32_node = NULL;
>>>> -    }
>>>> +    if (free->pfn_lo >= cached_iova->pfn_lo)
>>>> +        *curr = rb_next(&free->node);
>>>>    }
>>>>      /* Insert the iova into domain rbtree by holding writer lock */
>>>> @@ -135,7 +134,6 @@ static int __alloc_and_insert_iova_range(struct
>>>> iova_domain *iovad,
>>>>    {
>>>>        struct rb_node *prev, *curr = NULL;
>>>>        unsigned long flags;
>>>> -    unsigned long saved_pfn;
>>>>        unsigned long align_mask = ~0UL;
>>>>          if (size_aligned)
>>>> @@ -143,7 +141,6 @@ static int __alloc_and_insert_iova_range(struct
>>>> iova_domain *iovad,
>>>>          /* Walk the tree backwards */
>>>>        spin_lock_irqsave(&iovad->iova_rbtree_lock, flags);
>>>> -    saved_pfn = limit_pfn;
>>>>        curr = __get_cached_rbnode(iovad, &limit_pfn);
>>>>        prev = curr;
>>>>        while (curr) {
>>>> @@ -173,7 +170,7 @@ static int __alloc_and_insert_iova_range(struct
>>>> iova_domain *iovad,
>>>>          /* If we have 'prev', it's a valid place to start the
>>>> insertion. */
>>>>        iova_insert_rbtree(&iovad->rbroot, new, prev);
>>>> -    __cached_rbnode_insert_update(iovad, saved_pfn, new);
>>>> +    __cached_rbnode_insert_update(iovad, new);
>>>>          spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
>>>>    diff --git a/include/linux/iova.h b/include/linux/iova.h
>>>> index e0a892ae45c0..0bb8df43b393 100644
>>>> --- a/include/linux/iova.h
>>>> +++ b/include/linux/iova.h
>>>> @@ -40,7 +40,8 @@ struct iova_rcache {
>>>>    struct iova_domain {
>>>>        spinlock_t    iova_rbtree_lock; /* Lock to protect update of
>>>> rbtree */
>>>>        struct rb_root    rbroot;        /* iova domain rbtree root */
>>>> -    struct rb_node    *cached32_node; /* Save last alloced node */
>>>> +    struct rb_node    *cached_node;    /* Save last alloced node */
>>>> +    struct rb_node    *cached32_node; /* Save last 32-bit alloced
>>>> node */
>>>>        unsigned long    granule;    /* pfn granularity for this domain */
>>>>        unsigned long    start_pfn;    /* Lower limit for this domain */
>>>>        unsigned long    dma_32bit_pfn;
>>>>
>>>
>>
> 

-- 
Thanks!
BestRegards

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 3/4] iommu/iova: Extend rbtree node caching
  2017-07-31 11:42     ` Robin Murphy
  2017-08-03 19:41       ` Nate Watterson
@ 2017-09-19  9:42       ` Leizhen (ThunderTown)
  2017-09-19 10:07         ` Robin Murphy
  1 sibling, 1 reply; 17+ messages in thread
From: Leizhen (ThunderTown) @ 2017-09-19  9:42 UTC (permalink / raw)
  To: Robin Murphy, Nate Watterson, joro
  Cc: iommu, linux-arm-kernel, linux-kernel, dwmw2, lorenzo.pieralisi,
	ard.biesheuvel, Jonathan.Cameron, ray.jui



On 2017/7/31 19:42, Robin Murphy wrote:
> Hi Nate,
> 
> On 29/07/17 04:57, Nate Watterson wrote:
>> Hi Robin,
>> I am seeing a crash when performing very basic testing on this series
>> with a Mellanox CX4 NIC. I dug into the crash a bit, and think this
>> patch is the culprit, but this rcache business is still mostly
>> witchcraft to me.
>>
>> # ifconfig eth5 up
>> # ifconfig eth5 down
>>     Unable to handle kernel NULL pointer dereference at virtual address
>> 00000020
>>     user pgtable: 64k pages, 48-bit VAs, pgd = ffff8007dbf47c00
>>     [0000000000000020] *pgd=00000006efab0003, *pud=00000006efab0003,
>> *pmd=00000007d8720003, *pte=0000000000000000
>>     Internal error: Oops: 96000007 [#1] SMP
>>     Modules linked in:
>>     CPU: 47 PID: 5082 Comm: ifconfig Not tainted 4.13.0-rtp-enablement+ #3
>>     task: ffff8007da1e5780 task.stack: ffff8007ddcb8000
>>     PC is at __cached_rbnode_delete_update+0x2c/0x58
>>     LR is at private_free_iova+0x2c/0x60
>>     pc : [<ffff00000852c6e4>] lr : [<ffff00000852cdac>] pstate: 204001c5
>>     sp : ffff8007ddcbba00
>>     x29: ffff8007ddcbba00 x28: ffff8007c8350210
>>     x27: ffff8007d1a80000 x26: ffff8007dcc20800
>>     x25: 0000000000000140 x24: ffff8007c98f0008
>>     x23: 00000000fffffe4e x22: 0000000000000140
>>     x21: ffff8007c98f0008 x20: ffff8007c9adb240
>>     x19: ffff8007c98f0018 x18: 0000000000000010
>>     x17: 0000000000000000 x16: 0000000000000000
>>     x15: 4000000000000000 x14: 0000000000000000
>>     x13: 00000000ffffffff x12: 0000000000000001
>>     x11: dead000000000200 x10: 00000000ffffffff
>>     x9 : 0000000000000000 x8 : ffff8007c9adb1c0
>>     x7 : 0000000040002000 x6 : 0000000000210d00
>>     x5 : 0000000000000000 x4 : 000000000000c57e
>>     x3 : 00000000ffffffcf x2 : 00000000ffffffcf
>>     x1 : ffff8007c9adb240 x0 : 0000000000000000
>>     [...]
>>     [<ffff00000852c6e4>] __cached_rbnode_delete_update+0x2c/0x58
>>     [<ffff00000852cdac>] private_free_iova+0x2c/0x60
>>     [<ffff00000852cea4>] iova_magazine_free_pfns+0x4c/0xa0
>>     [<ffff00000852d0a8>] free_iova_fast+0x1b0/0x230
>>     [<ffff000008529d34>] iommu_dma_free_iova+0x5c/0x80
>>     [<ffff000008529db4>] __iommu_dma_unmap+0x5c/0x98
>>     [<ffff00000852aebc>] iommu_dma_unmap_resource+0x24/0x30
>>     [<ffff00000852aed4>] iommu_dma_unmap_page+0xc/0x18
>>     [<ffff000008093030>] __iommu_unmap_page+0x40/0x60
>>     [<ffff0000087401bc>] mlx5e_page_release+0xbc/0x128
>>     [<ffff000008740498>] mlx5e_dealloc_rx_wqe+0x30/0x40
>>     [<ffff000008735ef0>] mlx5e_close_channel+0x70/0x1f8
>>     [<ffff0000087370f4>] mlx5e_close_channels+0x2c/0x50
>>     [<ffff0000087385dc>] mlx5e_close_locked+0x54/0x68
>>     [<ffff000008738620>] mlx5e_close+0x30/0x58
>>     [...]
>>
>> ** Disassembly for __cached_rbnode_delete_update() near the fault **
>>               92|if (free->pfn_hi < iovad->dma_32bit_pfn)
>> FFFF00000852C6C4|            ldr     x3,[x1,#0x18]    ; x3,[free,#24]
>> FFFF00000852C6C8|            ldr     x2,[x0,#0x30]    ; x2,[iovad,#48]
>> FFFF00000852C6CC|            cmp     x3,x2
>> FFFF00000852C6D0|            b.cs    0xFFFF00000852C708
>>                 |        curr = &iovad->cached32_node;
>>               94|if (!curr)
>> FFFF00000852C6D4|            adds    x19,x0,#0x18     ; x19,iovad,#24
>> FFFF00000852C6D8|            b.eq    0xFFFF00000852C708
>>                 |
>>                 |cached_iova = rb_entry(*curr, struct iova, node);
>>                 |
>>               99|if (free->pfn_lo >= cached_iova->pfn_lo)
>> FFFF00000852C6DC|            ldr     x0,[x19]         ; xiovad,[curr]
>> FFFF00000852C6E0|            ldr     x2,[x1,#0x20]    ; x2,[free,#32]
>> FFFF00000852C6E4|            ldr     x0,[x0,#0x20]    ; x0,[x0,#32]
>> Apparently cached_iova was NULL so the pfn_lo access faulted.
>>
>> FFFF00000852C6E8|            cmp     x2,x0
>> FFFF00000852C6EC|            b.cc    0xFFFF00000852C6FC
>> FFFF00000852C6F0|            mov     x0,x1            ; x0,free
>>              100|        *curr = rb_next(&free->node);
>> After instrumenting the code a bit, this seems to be the culprit. In the
>> previous call, free->pfn_lo was 0xFFFF_FFFF which is actually the
>> dma_limit for the domain so rb_next() returns NULL.
>>
>> Let me know if you have any questions or would like additional tests
>> run. I also applied your "DMA domain debug info" patches and dumped the
>> contents of the domain at each of the steps above in case that would be
>> useful. If nothing else, they reinforce how thirsty the CX4 NIC is
>> especially when using 64k pages and many CPUs.
> 
> Thanks for the report - I somehow managed to reason myself out of
> keeping the "no cached node" check in __cached_rbnode_delete_update() on
> the assumption that it must always be set by a previous allocation.
> However, there is indeed just one case case for which that fails: when
> you free any IOVA immediately after freeing the very topmost one. Which
> is something that freeing an entire magazine's worth of IOVAs back to
> the tree all at once has a very real chance of doing...
> 
> The obvious straightforward fix is inline below, but I'm now starting to
> understand the appeal of reserving a sentinel node to ensure the tree
> can never be empty, so I might have a quick go at that to see if it
> results in nicer code overall.
> 
> Robin.
> 
>>
>> -Nate
>>
>> On 7/21/2017 7:42 AM, Robin Murphy wrote:
>>> The cached node mechanism provides a significant performance benefit for
>>> allocations using a 32-bit DMA mask, but in the case of non-PCI devices
>>> or where the 32-bit space is full, the loss of this benefit can be
>>> significant - on large systems there can be many thousands of entries in
>>> the tree, such that traversing to the end then walking all the way down
>>> to find free space every time becomes increasingly awful.
>>>
>>> Maintain a similar cached node for the whole IOVA space as a superset of
>>> the 32-bit space so that performance can remain much more consistent.
>>>
>>> Inspired by work by Zhen Lei <thunder.leizhen@huawei.com>.
>>>
>>> Tested-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
>>> Tested-by: Zhen Lei <thunder.leizhen@huawei.com>
>>> Signed-off-by: Robin Murphy <robin.murphy@arm.com>
>>> ---
>>>
>>> v2: No change
>>>
>>>   drivers/iommu/iova.c | 59
>>> +++++++++++++++++++++++++---------------------------
>>>   include/linux/iova.h |  3 ++-
>>>   2 files changed, 30 insertions(+), 32 deletions(-)
>>>
>>> diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
>>> index d094d1ca8f23..f5809a2ee6c2 100644
>>> --- a/drivers/iommu/iova.c
>>> +++ b/drivers/iommu/iova.c
>>> @@ -46,6 +46,7 @@ init_iova_domain(struct iova_domain *iovad, unsigned
>>> long granule,
>>>         spin_lock_init(&iovad->iova_rbtree_lock);
>>>       iovad->rbroot = RB_ROOT;
>>> +    iovad->cached_node = NULL;
>>>       iovad->cached32_node = NULL;
>>>       iovad->granule = granule;
>>>       iovad->start_pfn = start_pfn;
>>> @@ -57,48 +58,46 @@ EXPORT_SYMBOL_GPL(init_iova_domain);
>>>   static struct rb_node *
>>>   __get_cached_rbnode(struct iova_domain *iovad, unsigned long
>>> *limit_pfn)
>>>   {
>>> -    if ((*limit_pfn > iovad->dma_32bit_pfn) ||
>>> -        (iovad->cached32_node == NULL))
>>> +    struct rb_node *cached_node = NULL;
>>> +    struct iova *curr_iova;
>>> +
>>> +    if (*limit_pfn <= iovad->dma_32bit_pfn)
>>> +        cached_node = iovad->cached32_node;
>>> +    if (!cached_node)
>>> +        cached_node = iovad->cached_node;
>>> +    if (!cached_node)
>>>           return rb_last(&iovad->rbroot);
>>> -    else {
>>> -        struct rb_node *prev_node = rb_prev(iovad->cached32_node);
>>> -        struct iova *curr_iova =
>>> -            rb_entry(iovad->cached32_node, struct iova, node);
>>> -        *limit_pfn = curr_iova->pfn_lo;
>>> -        return prev_node;
>>> -    }
>>> +
>>> +    curr_iova = rb_entry(cached_node, struct iova, node);
>>> +    *limit_pfn = curr_iova->pfn_lo;
>>> +
>>> +    return rb_prev(cached_node);
>>>   }
>>>     static void
>>> -__cached_rbnode_insert_update(struct iova_domain *iovad,
>>> -    unsigned long limit_pfn, struct iova *new)
>>> +__cached_rbnode_insert_update(struct iova_domain *iovad, struct iova
>>> *new)
>>>   {
>>> -    if (limit_pfn != iovad->dma_32bit_pfn)
>>> -        return;
>>> -    iovad->cached32_node = &new->node;
>>> +    if (new->pfn_lo > iovad->dma_32bit_pfn)
>>> +        iovad->cached_node = &new->node;
>>> +    else
>>> +        iovad->cached32_node = &new->node;
>>>   }
>>>     static void
>>>   __cached_rbnode_delete_update(struct iova_domain *iovad, struct iova
>>> *free)
>>>   {
>>>       struct iova *cached_iova;
>>> -    struct rb_node *curr;
>>> +    struct rb_node **curr = NULL;
>>>   -    if (!iovad->cached32_node)
>>> -        return;
>>> -    curr = iovad->cached32_node;
>>> -    cached_iova = rb_entry(curr, struct iova, node);

-----------------------------
>>> +    if (free->pfn_hi < iovad->dma_32bit_pfn)
>>> +        curr = &iovad->cached32_node;
>>> +    if (!curr)
>>> +        curr = &iovad->cached_node;
> 
> +	if (!*curr)
> +		return;
Is it necessary for us to try the following adjustment?
+	if (free->pfn_hi < iovad->dma_32bit_pfn)
+		curr = &iovad->cached32_node;
+	else
+		curr = &iovad->cached_node;
+
+	if (!*curr) {
+		*curr = rb_next(&free->node);
+		return;
+	}


> 
>>>   -    if (free->pfn_lo >= cached_iova->pfn_lo) {
>>> -        struct rb_node *node = rb_next(&free->node);
>>> -        struct iova *iova = rb_entry(node, struct iova, node);
>>> +    cached_iova = rb_entry(*curr, struct iova, node);
>>>   -        /* only cache if it's below 32bit pfn */
>>> -        if (node && iova->pfn_lo < iovad->dma_32bit_pfn)
>>> -            iovad->cached32_node = node;
>>> -        else
>>> -            iovad->cached32_node = NULL;
>>> -    }
>>> +    if (free->pfn_lo >= cached_iova->pfn_lo)
>>> +        *curr = rb_next(&free->node);
>>>   }
>>>     /* Insert the iova into domain rbtree by holding writer lock */
>>> @@ -135,7 +134,6 @@ static int __alloc_and_insert_iova_range(struct
>>> iova_domain *iovad,
>>>   {
>>>       struct rb_node *prev, *curr = NULL;
>>>       unsigned long flags;
>>> -    unsigned long saved_pfn;
>>>       unsigned long align_mask = ~0UL;
>>>         if (size_aligned)
>>> @@ -143,7 +141,6 @@ static int __alloc_and_insert_iova_range(struct
>>> iova_domain *iovad,
>>>         /* Walk the tree backwards */
>>>       spin_lock_irqsave(&iovad->iova_rbtree_lock, flags);
>>> -    saved_pfn = limit_pfn;
>>>       curr = __get_cached_rbnode(iovad, &limit_pfn);
>>>       prev = curr;
>>>       while (curr) {
>>> @@ -173,7 +170,7 @@ static int __alloc_and_insert_iova_range(struct
>>> iova_domain *iovad,
>>>         /* If we have 'prev', it's a valid place to start the
>>> insertion. */
>>>       iova_insert_rbtree(&iovad->rbroot, new, prev);
>>> -    __cached_rbnode_insert_update(iovad, saved_pfn, new);
>>> +    __cached_rbnode_insert_update(iovad, new);
>>>         spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
>>>   diff --git a/include/linux/iova.h b/include/linux/iova.h
>>> index e0a892ae45c0..0bb8df43b393 100644
>>> --- a/include/linux/iova.h
>>> +++ b/include/linux/iova.h
>>> @@ -40,7 +40,8 @@ struct iova_rcache {
>>>   struct iova_domain {
>>>       spinlock_t    iova_rbtree_lock; /* Lock to protect update of
>>> rbtree */
>>>       struct rb_root    rbroot;        /* iova domain rbtree root */
>>> -    struct rb_node    *cached32_node; /* Save last alloced node */
>>> +    struct rb_node    *cached_node;    /* Save last alloced node */
>>> +    struct rb_node    *cached32_node; /* Save last 32-bit alloced
>>> node */
>>>       unsigned long    granule;    /* pfn granularity for this domain */
>>>       unsigned long    start_pfn;    /* Lower limit for this domain */
>>>       unsigned long    dma_32bit_pfn;
>>>
>>
> 
> 
> .
> 

-- 
Thanks!
BestRegards

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 3/4] iommu/iova: Extend rbtree node caching
  2017-09-19  9:42       ` Leizhen (ThunderTown)
@ 2017-09-19 10:07         ` Robin Murphy
  0 siblings, 0 replies; 17+ messages in thread
From: Robin Murphy @ 2017-09-19 10:07 UTC (permalink / raw)
  To: Leizhen (ThunderTown), Nate Watterson, joro
  Cc: iommu, linux-arm-kernel, linux-kernel, dwmw2, lorenzo.pieralisi,
	ard.biesheuvel, Jonathan.Cameron, ray.jui

On 19/09/17 10:42, Leizhen (ThunderTown) wrote:
[...]
>>>>     static void
>>>>   __cached_rbnode_delete_update(struct iova_domain *iovad, struct iova
>>>> *free)
>>>>   {
>>>>       struct iova *cached_iova;
>>>> -    struct rb_node *curr;
>>>> +    struct rb_node **curr = NULL;
>>>>   -    if (!iovad->cached32_node)
>>>> -        return;
>>>> -    curr = iovad->cached32_node;
>>>> -    cached_iova = rb_entry(curr, struct iova, node);
> 
> -----------------------------
>>>> +    if (free->pfn_hi < iovad->dma_32bit_pfn)
>>>> +        curr = &iovad->cached32_node;
>>>> +    if (!curr)
>>>> +        curr = &iovad->cached_node;
>>
>> +	if (!*curr)
>> +		return;
> Is it necessary for us to try the following adjustment?
> +	if (free->pfn_hi < iovad->dma_32bit_pfn)
> +		curr = &iovad->cached32_node;
> +	else
> +		curr = &iovad->cached_node;
> +
> +	if (!*curr) {
> +		*curr = rb_next(&free->node);
> +		return;
> +	}

Yeah, I spotted that this looked a bit wonky after I posted it. It's
already cleaned up in v3, which I'll be posting shortly after I write up
some cover letters.

Thanks,
Robin.

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2017-09-19 10:07 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-07-21 11:41 [PATCH v2 0/4] Optimise 64-bit IOVA allocations Robin Murphy
2017-07-21 11:41 ` [PATCH v2 1/4] iommu/iova: Optimise rbtree searching Robin Murphy
2017-07-21 11:41 ` [PATCH v2 2/4] iommu/iova: Optimise the padding calculation Robin Murphy
2017-07-21 11:42 ` [PATCH v2 3/4] iommu/iova: Extend rbtree node caching Robin Murphy
2017-07-29  3:57   ` Nate Watterson
2017-07-31 11:42     ` Robin Murphy
2017-08-03 19:41       ` Nate Watterson
2017-08-31  9:46         ` Leizhen (ThunderTown)
2017-09-19  9:42       ` Leizhen (ThunderTown)
2017-09-19 10:07         ` Robin Murphy
2017-07-21 11:42 ` [PATCH v2 4/4] iommu/iova: Make dma_32bit_pfn implicit Robin Murphy
2017-07-26 11:08 ` [PATCH v2 0/4] Optimise 64-bit IOVA allocations Joerg Roedel
2017-07-26 11:17   ` Leizhen (ThunderTown)
2017-08-08 12:03     ` Ganapatrao Kulkarni
2017-08-09  1:42       ` Leizhen (ThunderTown)
2017-08-09  3:24         ` Ganapatrao Kulkarni
2017-08-09  4:09           ` Leizhen (ThunderTown)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).